* Bug in kernel 2.6.31, Slow wb_kupdate writeout
@ 2009-07-28 19:11 Chad Talbott
2009-07-28 21:49 ` Martin Bligh
2009-07-30 21:39 ` Jens Axboe
0 siblings, 2 replies; 32+ messages in thread
From: Chad Talbott @ 2009-07-28 19:11 UTC (permalink / raw)
To: linux-kernel
Cc: linux-mm, wfg, Martin Bligh, Michael Rubin, Andrew Morton,
sandeen
I run a simple workload on a 4GB machine which dirties a few largish
inodes like so:
# seq 10 | xargs -P0 -n1 -i\{} dd if=/dev/zero of=/tmp/dump\{}
bs=1024k count=100
While the dds are running data is written out at disk speed. However,
once the dds have run to completion and exited there is ~500MB of
dirty memory left. Background writeout then takes about 3 more
minutes to clean memory at only ~3.3MB/s. When I explicitly sync, I
can see that the disk is capable of 40MB/s, which finishes off the
files in ~10s. [1]
An interesting recent-ish change is "writeback: speed up writeback of
big dirty files." When I revert the change to __sync_single_inode the
problem appears to go away and background writeout proceeds at disk
speed. Interestingly, that code is in the git commit [2], but not in
the post to LKML. [3] This is may not be the fix, but it makes this
test behave better.
Thanks,
Chad
[1] I've plotted the dirty memory from /proc/meminfo and disk write
speed from iostat at
http://sites.google.com/site/cwtlinux/2-6-31-writeback-bug
[2] git commit:
http://mirror.celinuxforum.org/gitstat/commit-detail.php?commit=8bc3be2751b4f74ab90a446da1912fd8204d53f7
[3] LKML post: http://marc.info/?l=linux-kernel&m=119131601130372&w=2
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Bug in kernel 2.6.31, Slow wb_kupdate writeout
2009-07-28 19:11 Bug in kernel 2.6.31, Slow wb_kupdate writeout Chad Talbott
@ 2009-07-28 21:49 ` Martin Bligh
2009-07-29 7:15 ` Martin Bligh
2009-07-30 21:39 ` Jens Axboe
1 sibling, 1 reply; 32+ messages in thread
From: Martin Bligh @ 2009-07-28 21:49 UTC (permalink / raw)
To: Chad Talbott
Cc: linux-kernel, linux-mm, wfg, Michael Rubin, Andrew Morton,
sandeen
> An interesting recent-ish change is "writeback: speed up writeback of
> big dirty files." When I revert the change to __sync_single_inode the
> problem appears to go away and background writeout proceeds at disk
> speed. Interestingly, that code is in the git commit [2], but not in
> the post to LKML. [3] This is may not be the fix, but it makes this
> test behave better.
I'm fairly sure this is not fixing the root cause - but putting it at the head
rather than the tail of the queue causes the error not to starve wb_kupdate
for nearly so long - as long as we keep the queue full, the bug is hidden.
M.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Bug in kernel 2.6.31, Slow wb_kupdate writeout
2009-07-28 21:49 ` Martin Bligh
@ 2009-07-29 7:15 ` Martin Bligh
2009-07-29 11:43 ` Wu Fengguang
0 siblings, 1 reply; 32+ messages in thread
From: Martin Bligh @ 2009-07-29 7:15 UTC (permalink / raw)
To: Chad Talbott
Cc: linux-kernel, linux-mm, wfg, Michael Rubin, Andrew Morton,
sandeen, Michael Davidson
On Tue, Jul 28, 2009 at 2:49 PM, Martin Bligh<mbligh@google.com> wrote:
>> An interesting recent-ish change is "writeback: speed up writeback of
>> big dirty files." When I revert the change to __sync_single_inode the
>> problem appears to go away and background writeout proceeds at disk
>> speed. Interestingly, that code is in the git commit [2], but not in
>> the post to LKML. [3] This is may not be the fix, but it makes this
>> test behave better.
>
> I'm fairly sure this is not fixing the root cause - but putting it at the head
> rather than the tail of the queue causes the error not to starve wb_kupdate
> for nearly so long - as long as we keep the queue full, the bug is hidden.
OK, it seems this is the root cause - I wasn't clear why all the pages weren't
being written back, and thought there was another bug. What happens is
we go into write_cache_pages, and stuff the disk queue with as much as
we can put into it, and then inevitably hit the congestion limit.
Then we back out to __sync_single_inode, who says "huh, you didn't manage
to write your whole slice", and penalizes the poor blameless inode in question
by putting it back into the penalty box for 30s.
This results in very lumpy I/O writeback at 5s intervals, and very
poor throughput.
Patch below is inline and probably text munged, but is for RFC only.
I'll test it
more thoroughly tomorrow. As for the comment about starving other writes,
I believe requeue_io moves it from s_io to s_more_io which should at least
allow some progress of other files.
--- linux-2.6.30/fs/fs-writeback.c.old 2009-07-29 00:08:29.000000000 -0700
+++ linux-2.6.30/fs/fs-writeback.c 2009-07-29 00:11:28.000000000 -0700
@@ -322,46 +322,11 @@ __sync_single_inode(struct inode *inode,
/*
* We didn't write back all the pages. nfs_writepages()
* sometimes bales out without doing anything. Redirty
- * the inode; Move it from s_io onto s_more_io/s_dirty.
+ * the inode; Move it from s_io onto s_more_io. It
+ * may well have just encountered congestion
*/
- /*
- * akpm: if the caller was the kupdate function we put
- * this inode at the head of s_dirty so it gets first
- * consideration. Otherwise, move it to the tail, for
- * the reasons described there. I'm not really sure
- * how much sense this makes. Presumably I had a good
- * reasons for doing it this way, and I'd rather not
- * muck with it at present.
- */
- if (wbc->for_kupdate) {
- /*
- * For the kupdate function we move the inode
- * to s_more_io so it will get more writeout as
- * soon as the queue becomes uncongested.
- */
- inode->i_state |= I_DIRTY_PAGES;
- if (wbc->nr_to_write <= 0) {
- /*
- * slice used up: queue for next turn
- */
- requeue_io(inode);
- } else {
- /*
- * somehow blocked: retry later
- */
- redirty_tail(inode);
- }
- } else {
- /*
- * Otherwise fully redirty the inode so that
- * other inodes on this superblock will get some
- * writeout. Otherwise heavy writing to one
- * file would indefinitely suspend writeout of
- * all the other files.
- */
- inode->i_state |= I_DIRTY_PAGES;
- redirty_tail(inode);
- }
+ inode->i_state |= I_DIRTY_PAGES;
+ requeue_io(inode);
} else if (inode->i_state & I_DIRTY) {
/*
* Someone redirtied the inode while were writing back
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Bug in kernel 2.6.31, Slow wb_kupdate writeout
2009-07-29 7:15 ` Martin Bligh
@ 2009-07-29 11:43 ` Wu Fengguang
2009-07-29 14:11 ` Martin Bligh
2009-07-30 0:19 ` Martin Bligh
0 siblings, 2 replies; 32+ messages in thread
From: Wu Fengguang @ 2009-07-29 11:43 UTC (permalink / raw)
To: Martin Bligh
Cc: Chad Talbott, linux-kernel, linux-mm, Michael Rubin,
Andrew Morton, sandeen, Michael Davidson
On Wed, Jul 29, 2009 at 12:15:48AM -0700, Martin Bligh wrote:
> On Tue, Jul 28, 2009 at 2:49 PM, Martin Bligh<mbligh@google.com> wrote:
> >> An interesting recent-ish change is "writeback: speed up writeback of
> >> big dirty files." A When I revert the change to __sync_single_inode the
> >> problem appears to go away and background writeout proceeds at disk
> >> speed. A Interestingly, that code is in the git commit [2], but not in
> >> the post to LKML. [3] A This is may not be the fix, but it makes this
> >> test behave better.
> >
> > I'm fairly sure this is not fixing the root cause - but putting it at the head
> > rather than the tail of the queue causes the error not to starve wb_kupdate
> > for nearly so long - as long as we keep the queue full, the bug is hidden.
>
> OK, it seems this is the root cause - I wasn't clear why all the pages weren't
> being written back, and thought there was another bug. What happens is
> we go into write_cache_pages, and stuff the disk queue with as much as
> we can put into it, and then inevitably hit the congestion limit.
>
> Then we back out to __sync_single_inode, who says "huh, you didn't manage
> to write your whole slice", and penalizes the poor blameless inode in question
> by putting it back into the penalty box for 30s.
>
> This results in very lumpy I/O writeback at 5s intervals, and very
> poor throughput.
You are right, so let's fix the congestion case. Your analysis would
be perfect changelog :)
> Patch below is inline and probably text munged, but is for RFC only.
> I'll test it
> more thoroughly tomorrow. As for the comment about starving other writes,
> I believe requeue_io moves it from s_io to s_more_io which should at least
> allow some progress of other files.
>
> --- linux-2.6.30/fs/fs-writeback.c.old 2009-07-29 00:08:29.000000000 -0700
> +++ linux-2.6.30/fs/fs-writeback.c 2009-07-29 00:11:28.000000000 -0700
> @@ -322,46 +322,11 @@ __sync_single_inode(struct inode *inode,
> /*
> * We didn't write back all the pages. nfs_writepages()
> * sometimes bales out without doing anything. Redirty
[snip]
> - if (wbc->nr_to_write <= 0) {
> - /*
> - * slice used up: queue for next turn
> - */
> - requeue_io(inode);
> - } else {
> - /*
> - * somehow blocked: retry later
> - */
> - redirty_tail(inode);
Removing this line can be dangerous - we'll probably go into buzy
waiting (I have tried that long long ago).
Chad, can you try this small patch? Thank you.
Thanks,
Fengguang
---
fs/fs-writeback.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
--- mm.orig/fs/fs-writeback.c
+++ mm/fs/fs-writeback.c
@@ -325,7 +325,8 @@ __sync_single_inode(struct inode *inode,
* soon as the queue becomes uncongested.
*/
inode->i_state |= I_DIRTY_PAGES;
- if (wbc->nr_to_write <= 0) {
+ if (wbc->nr_to_write <= 0 ||
+ wbc->encountered_congestion) {
/*
* slice used up: queue for next turn
*/
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Bug in kernel 2.6.31, Slow wb_kupdate writeout
2009-07-29 11:43 ` Wu Fengguang
@ 2009-07-29 14:11 ` Martin Bligh
2009-07-30 1:06 ` Wu Fengguang
2009-07-30 0:19 ` Martin Bligh
1 sibling, 1 reply; 32+ messages in thread
From: Martin Bligh @ 2009-07-29 14:11 UTC (permalink / raw)
To: Wu Fengguang
Cc: Chad Talbott, linux-kernel, linux-mm, Michael Rubin,
Andrew Morton, sandeen, Michael Davidson
> --- mm.orig/fs/fs-writeback.c
> +++ mm/fs/fs-writeback.c
> @@ -325,7 +325,8 @@ __sync_single_inode(struct inode *inode,
> * soon as the queue becomes uncongested.
> */
> inode->i_state |= I_DIRTY_PAGES;
> - if (wbc->nr_to_write <= 0) {
> + if (wbc->nr_to_write <= 0 ||
> + wbc->encountered_congestion) {
> /*
> * slice used up: queue for next turn
> */
>
That's not sufficient - it only the problem in the wb_kupdate path. If you want
to be more conservative, how about we do this?
--- linux-2.6.30/fs/fs-writeback.c.old 2009-07-29 00:08:29.000000000 -0700
+++ linux-2.6.30/fs/fs-writeback.c 2009-07-29 07:08:48.000000000 -0700
@@ -323,43 +323,14 @@ __sync_single_inode(struct inode *inode,
* We didn't write back all the pages. nfs_writepages(
)
* sometimes bales out without doing anything. Redirty
* the inode; Move it from s_io onto s_more_io/s_dirty.
+ * It may well have just encountered congestion
*/
- /*
- * akpm: if the caller was the kupdate function we put
- * this inode at the head of s_dirty so it gets first
- * consideration. Otherwise, move it to the tail, for
- * the reasons described there. I'm not really sure
- * how much sense this makes. Presumably I had a good
- * reasons for doing it this way, and I'd rather not
- * muck with it at present.
- */
- if (wbc->for_kupdate) {
- /*
- * For the kupdate function we move the inode
- * to s_more_io so it will get more writeout as
- * soon as the queue becomes uncongested.
- */
- inode->i_state |= I_DIRTY_PAGES;
- if (wbc->nr_to_write <= 0) {
- /*
- * slice used up: queue for next turn
- */
- requeue_io(inode);
- } else {
- /*
- * somehow blocked: retry later
- */
- redirty_tail(inode);
- }
- } else {
- /*
- * Otherwise fully redirty the inode so that
- * other inodes on this superblock will get som
e
- * writeout. Otherwise heavy writing to one
- * file would indefinitely suspend writeout of
- * all the other files.
- */
- inode->i_state |= I_DIRTY_PAGES;
+ inode->i_state |= I_DIRTY_PAGES;
+ if (wbc->nr_to_write <= 0 || /* sliced used up */
+ wbc->encountered_congestion)
+ requeue_io(inode);
+ else {
+ /* somehow blocked: retry later */
redirty_tail(inode);
}
} else if (inode->i_state & I_DIRTY) {
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Bug in kernel 2.6.31, Slow wb_kupdate writeout
2009-07-29 11:43 ` Wu Fengguang
2009-07-29 14:11 ` Martin Bligh
@ 2009-07-30 0:19 ` Martin Bligh
2009-07-30 1:28 ` Martin Bligh
2009-07-30 1:49 ` Wu Fengguang
1 sibling, 2 replies; 32+ messages in thread
From: Martin Bligh @ 2009-07-30 0:19 UTC (permalink / raw)
To: Wu Fengguang
Cc: Chad Talbott, linux-kernel, linux-mm, Michael Rubin,
Andrew Morton, sandeen, Michael Davidson
BTW, can you explain this code at the bottom of generic_sync_sb_inodes
for me?
if (wbc->nr_to_write <= 0) {
wbc->more_io = 1;
break;
}
I don't understand why we are setting more_io here? AFAICS, more_io
means there's more stuff to write ... I would think we'd set this if
nr_to_write was > 0 ?
Or just have the section below brought up above this
break check and do:
if (!list_empty(&sb->s_more_io) || !list_empty(&sb->s_io))
wbc->more_io = 1;
Am I just misunderstanding the intent of more_io ?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Bug in kernel 2.6.31, Slow wb_kupdate writeout
2009-07-29 14:11 ` Martin Bligh
@ 2009-07-30 1:06 ` Wu Fengguang
2009-07-30 1:12 ` Martin Bligh
0 siblings, 1 reply; 32+ messages in thread
From: Wu Fengguang @ 2009-07-30 1:06 UTC (permalink / raw)
To: Martin Bligh
Cc: Chad Talbott, linux-kernel@vger.kernel.org, linux-mm@kvack.org,
Michael Rubin, Andrew Morton, sandeen@redhat.com,
Michael Davidson
On Wed, Jul 29, 2009 at 10:11:10PM +0800, Martin Bligh wrote:
> > --- mm.orig/fs/fs-writeback.c
> > +++ mm/fs/fs-writeback.c
> > @@ -325,7 +325,8 @@ __sync_single_inode(struct inode *inode,
> > A A A A A A A A A A A A A A A A * soon as the queue becomes uncongested.
> > A A A A A A A A A A A A A A A A */
> > A A A A A A A A A A A A A A A A inode->i_state |= I_DIRTY_PAGES;
> > - A A A A A A A A A A A A A A A if (wbc->nr_to_write <= 0) {
> > + A A A A A A A A A A A A A A A if (wbc->nr_to_write <= 0 ||
> > + A A A A A A A A A A A A A A A A A wbc->encountered_congestion) {
> > A A A A A A A A A A A A A A A A A A A A /*
> > A A A A A A A A A A A A A A A A A A A A * slice used up: queue for next turn
> > A A A A A A A A A A A A A A A A A A A A */
> >
>
> That's not sufficient - it only the problem in the wb_kupdate path. If you want
> to be more conservative, how about we do this?
I agree on the unification of kupdate and sync paths. In fact I had a
patch for doing this. And I'd recommend to do it in two patches:
one to fix the congestion case, another to do the code unification.
The sync path don't care whether requeue_io() or redirty_tail() is
used, because they disregard the time stamps totally - only order of
inodes matters (ie. starvation), which is same for requeue_io()/redirty_tail().
Thanks,
Fengguang
> --- linux-2.6.30/fs/fs-writeback.c.old 2009-07-29 00:08:29.000000000 -0700
> +++ linux-2.6.30/fs/fs-writeback.c 2009-07-29 07:08:48.000000000 -0700
> @@ -323,43 +323,14 @@ __sync_single_inode(struct inode *inode,
> * We didn't write back all the pages. nfs_writepages(
> )
> * sometimes bales out without doing anything. Redirty
> * the inode; Move it from s_io onto s_more_io/s_dirty.
> + * It may well have just encountered congestion
> */
> - /*
> - * akpm: if the caller was the kupdate function we put
> - * this inode at the head of s_dirty so it gets first
> - * consideration. Otherwise, move it to the tail, for
> - * the reasons described there. I'm not really sure
> - * how much sense this makes. Presumably I had a good
> - * reasons for doing it this way, and I'd rather not
> - * muck with it at present.
> - */
> - if (wbc->for_kupdate) {
> - /*
> - * For the kupdate function we move the inode
> - * to s_more_io so it will get more writeout as
> - * soon as the queue becomes uncongested.
> - */
> - inode->i_state |= I_DIRTY_PAGES;
> - if (wbc->nr_to_write <= 0) {
> - /*
> - * slice used up: queue for next turn
> - */
> - requeue_io(inode);
> - } else {
> - /*
> - * somehow blocked: retry later
> - */
> - redirty_tail(inode);
> - }
> - } else {
> - /*
> - * Otherwise fully redirty the inode so that
> - * other inodes on this superblock will get som
> e
> - * writeout. Otherwise heavy writing to one
> - * file would indefinitely suspend writeout of
> - * all the other files.
> - */
> - inode->i_state |= I_DIRTY_PAGES;
> + inode->i_state |= I_DIRTY_PAGES;
> + if (wbc->nr_to_write <= 0 || /* sliced used up */
> + wbc->encountered_congestion)
> + requeue_io(inode);
> + else {
> + /* somehow blocked: retry later */
> redirty_tail(inode);
> }
> } else if (inode->i_state & I_DIRTY) {
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Bug in kernel 2.6.31, Slow wb_kupdate writeout
2009-07-30 1:06 ` Wu Fengguang
@ 2009-07-30 1:12 ` Martin Bligh
2009-07-30 1:57 ` Wu Fengguang
0 siblings, 1 reply; 32+ messages in thread
From: Martin Bligh @ 2009-07-30 1:12 UTC (permalink / raw)
To: Wu Fengguang
Cc: Chad Talbott, linux-kernel@vger.kernel.org, linux-mm@kvack.org,
Michael Rubin, Andrew Morton, sandeen@redhat.com,
Michael Davidson
> I agree on the unification of kupdate and sync paths. In fact I had a
> patch for doing this. And I'd recommend to do it in two patches:
> one to fix the congestion case, another to do the code unification.
>
> The sync path don't care whether requeue_io() or redirty_tail() is
> used, because they disregard the time stamps totally - only order of
> inodes matters (ie. starvation), which is same for requeue_io()/redirty_tail().
But, as I understand it, both paths share the same lists, so we still have
to be consistent?
Also, you set flags like more_io higher up in sync_sb_inodes() based on
whether there's anything in s_more_io queue, so it still seems to have
some effect to me?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Bug in kernel 2.6.31, Slow wb_kupdate writeout
2009-07-30 0:19 ` Martin Bligh
@ 2009-07-30 1:28 ` Martin Bligh
2009-07-30 2:09 ` Wu Fengguang
2009-07-30 1:49 ` Wu Fengguang
1 sibling, 1 reply; 32+ messages in thread
From: Martin Bligh @ 2009-07-30 1:28 UTC (permalink / raw)
To: Wu Fengguang
Cc: Chad Talbott, linux-kernel, linux-mm, Michael Rubin,
Andrew Morton, sandeen, Michael Davidson
On Wed, Jul 29, 2009 at 5:19 PM, Martin Bligh<mbligh@google.com> wrote:
> BTW, can you explain this code at the bottom of generic_sync_sb_inodes
> for me?
>
> if (wbc->nr_to_write <= 0) {
> wbc->more_io = 1;
> break;
> }
>
> I don't understand why we are setting more_io here? AFAICS, more_io
> means there's more stuff to write ... I would think we'd set this if
> nr_to_write was > 0 ?
>
> Or just have the section below brought up above this
> break check and do:
>
> if (!list_empty(&sb->s_more_io) || !list_empty(&sb->s_io))
> wbc->more_io = 1;
>
> Am I just misunderstanding the intent of more_io ?
I am thinking along the lines of:
@@ -638,13 +609,11 @@ sync_sb_inodes(struct super_block *sb, s
iput(inode);
cond_resched();
spin_lock(&inode_lock);
- if (wbc->nr_to_write <= 0) {
- wbc->more_io = 1;
+ if (wbc->nr_to_write <= 0)
break;
- }
- if (!list_empty(&sb->s_more_io))
- wbc->more_io = 1;
}
+ if (!list_empty(&sb->s_more_io) || !list_empty(&sb->s_io)
+ wbc->more_io = 1;
return; /* Leave any unwritten inodes on s_io */
}
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Bug in kernel 2.6.31, Slow wb_kupdate writeout
2009-07-30 0:19 ` Martin Bligh
2009-07-30 1:28 ` Martin Bligh
@ 2009-07-30 1:49 ` Wu Fengguang
1 sibling, 0 replies; 32+ messages in thread
From: Wu Fengguang @ 2009-07-30 1:49 UTC (permalink / raw)
To: Martin Bligh
Cc: Chad Talbott, linux-kernel@vger.kernel.org, linux-mm@kvack.org,
Michael Rubin, Andrew Morton, sandeen@redhat.com,
Michael Davidson
On Thu, Jul 30, 2009 at 08:19:34AM +0800, Martin Bligh wrote:
> BTW, can you explain this code at the bottom of generic_sync_sb_inodes
> for me?
>
> if (wbc->nr_to_write <= 0) {
> wbc->more_io = 1;
> break;
> }
>
> I don't understand why we are setting more_io here? AFAICS, more_io
> means there's more stuff to write ... I would think we'd set this if
> nr_to_write was > 0 ?
That's true: wbc.nr_to_write will always be set to MAX_WRITEBACK_PAGES
by wb_writeback() before entering generic_sync_sb_inodes().
So wbc.nr_to_write <=0 indicates we are interrupted by the quota and
should revisit generic_sync_sb_inodes() to check for more io (which will
_normally_ find more dirty pages to write).
> Or just have the section below brought up above this
> break check and do:
>
> if (!list_empty(&sb->s_more_io) || !list_empty(&sb->s_io))
> wbc->more_io = 1;
>
> Am I just misunderstanding the intent of more_io ?
It should be OK. I agree on the change if it makes the logic more
straightforward.
Thanks,
Fengguang
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Bug in kernel 2.6.31, Slow wb_kupdate writeout
2009-07-30 1:12 ` Martin Bligh
@ 2009-07-30 1:57 ` Wu Fengguang
2009-07-30 2:59 ` Martin Bligh
0 siblings, 1 reply; 32+ messages in thread
From: Wu Fengguang @ 2009-07-30 1:57 UTC (permalink / raw)
To: Martin Bligh
Cc: Chad Talbott, linux-kernel@vger.kernel.org, linux-mm@kvack.org,
Michael Rubin, Andrew Morton, sandeen@redhat.com,
Michael Davidson
On Thu, Jul 30, 2009 at 09:12:26AM +0800, Martin Bligh wrote:
> > I agree on the unification of kupdate and sync paths. In fact I had a
> > patch for doing this. And I'd recommend to do it in two patches:
> > one to fix the congestion case, another to do the code unification.
> >
> > The sync path don't care whether requeue_io() or redirty_tail() is
> > used, because they disregard the time stamps totally - only order of
> > inodes matters (ie. starvation), which is same for requeue_io()/redirty_tail().
>
> But, as I understand it, both paths share the same lists, so we still have
> to be consistent?
Then let's first unify the code, then fix the congestion case? :)
> Also, you set flags like more_io higher up in sync_sb_inodes() based on
> whether there's anything in s_more_io queue, so it still seems to have
> some effect to me?
Yes, maybe in some rare cases.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Bug in kernel 2.6.31, Slow wb_kupdate writeout
2009-07-30 1:28 ` Martin Bligh
@ 2009-07-30 2:09 ` Wu Fengguang
2009-07-30 2:57 ` Martin Bligh
0 siblings, 1 reply; 32+ messages in thread
From: Wu Fengguang @ 2009-07-30 2:09 UTC (permalink / raw)
To: Martin Bligh
Cc: Chad Talbott, linux-kernel@vger.kernel.org, linux-mm@kvack.org,
Michael Rubin, Andrew Morton, sandeen@redhat.com,
Michael Davidson
On Thu, Jul 30, 2009 at 09:28:07AM +0800, Martin Bligh wrote:
> On Wed, Jul 29, 2009 at 5:19 PM, Martin Bligh<mbligh@google.com> wrote:
> > BTW, can you explain this code at the bottom of generic_sync_sb_inodes
> > for me?
> >
> > A A A A A A A A if (wbc->nr_to_write <= 0) {
> > A A A A A A A A A A A A wbc->more_io = 1;
> > A A A A A A A A A A A A break;
> > A A A A A A A A }
> >
> > I don't understand why we are setting more_io here? AFAICS, more_io
> > means there's more stuff to write ... I would think we'd set this if
> > nr_to_write was > 0 ?
> >
> > Or just have the section below brought up above this
> > break check and do:
> >
> > if (!list_empty(&sb->s_more_io) || !list_empty(&sb->s_io))
> > A A A A wbc->more_io = 1;
> >
> > Am I just misunderstanding the intent of more_io ?
>
> I am thinking along the lines of:
On closer looks I found this line:
if (inode_dirtied_after(inode, start))
break;
In this case "list_empty(&sb->s_io)" is not a good criteria:
here we are breaking away for some other reasons, and shall
not touch wbc.more_io.
So let's stick with the current code?
Thanks,
Fengguang
> @@ -638,13 +609,11 @@ sync_sb_inodes(struct super_block *sb, s
> iput(inode);
> cond_resched();
> spin_lock(&inode_lock);
> - if (wbc->nr_to_write <= 0) {
> - wbc->more_io = 1;
> + if (wbc->nr_to_write <= 0)
> break;
> - }
> - if (!list_empty(&sb->s_more_io))
> - wbc->more_io = 1;
> }
> + if (!list_empty(&sb->s_more_io) || !list_empty(&sb->s_io)
> + wbc->more_io = 1;
> return; /* Leave any unwritten inodes on s_io */
> }
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Bug in kernel 2.6.31, Slow wb_kupdate writeout
2009-07-30 2:09 ` Wu Fengguang
@ 2009-07-30 2:57 ` Martin Bligh
2009-07-30 3:19 ` Wu Fengguang
0 siblings, 1 reply; 32+ messages in thread
From: Martin Bligh @ 2009-07-30 2:57 UTC (permalink / raw)
To: Wu Fengguang
Cc: Chad Talbott, linux-kernel@vger.kernel.org, linux-mm@kvack.org,
Michael Rubin, Andrew Morton, sandeen@redhat.com,
Michael Davidson
> On closer looks I found this line:
>
> if (inode_dirtied_after(inode, start))
> break;
Ah, OK.
> In this case "list_empty(&sb->s_io)" is not a good criteria:
> here we are breaking away for some other reasons, and shall
> not touch wbc.more_io.
>
> So let's stick with the current code?
Well, I see two problems. One is that we set more_io based on
whether s_more_io is empty or not before we finish the loop.
I can't see how this can be correct, especially as there can be
other concurrent writers. So somehow we need to check when
we exit the loop, not during it.
The other is that we're saying we are setting more_io when
nr_to_write is <=0 ... but we only really check it when
nr_to_write is > 0 ... I can't see how this can be useful?
I'll admit there is one corner case when page_skipped it set
from one of the branches, but I am really not sure what the
intended logic is here, given the above?
In the case where we hit the inode_dirtied_after break
condition, is it bad to set more_io ? There is more to do
on that inode after all. Is there a definition somewhere for
exactly what the more_io flag means?
M.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Bug in kernel 2.6.31, Slow wb_kupdate writeout
2009-07-30 1:57 ` Wu Fengguang
@ 2009-07-30 2:59 ` Martin Bligh
2009-07-30 4:08 ` Wu Fengguang
0 siblings, 1 reply; 32+ messages in thread
From: Martin Bligh @ 2009-07-30 2:59 UTC (permalink / raw)
To: Wu Fengguang
Cc: Chad Talbott, linux-kernel@vger.kernel.org, linux-mm@kvack.org,
Michael Rubin, Andrew Morton, sandeen@redhat.com,
Michael Davidson
On Wed, Jul 29, 2009 at 6:57 PM, Wu Fengguang<fengguang.wu@intel.com> wrote:
> On Thu, Jul 30, 2009 at 09:12:26AM +0800, Martin Bligh wrote:
>> > I agree on the unification of kupdate and sync paths. In fact I had a
>> > patch for doing this. And I'd recommend to do it in two patches:
>> > one to fix the congestion case, another to do the code unification.
>> >
>> > The sync path don't care whether requeue_io() or redirty_tail() is
>> > used, because they disregard the time stamps totally - only order of
>> > inodes matters (ie. starvation), which is same for requeue_io()/redirty_tail().
>>
>> But, as I understand it, both paths share the same lists, so we still have
>> to be consistent?
>
> Then let's first unify the code, then fix the congestion case? :)
OK, I will send it out as separate patches. I am just finishing up the testing
first.
M.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Bug in kernel 2.6.31, Slow wb_kupdate writeout
2009-07-30 2:57 ` Martin Bligh
@ 2009-07-30 3:19 ` Wu Fengguang
2009-07-30 20:33 ` Martin Bligh
0 siblings, 1 reply; 32+ messages in thread
From: Wu Fengguang @ 2009-07-30 3:19 UTC (permalink / raw)
To: Martin Bligh
Cc: Chad Talbott, linux-kernel@vger.kernel.org, linux-mm@kvack.org,
Michael Rubin, Andrew Morton, sandeen@redhat.com,
Michael Davidson
On Thu, Jul 30, 2009 at 10:57:35AM +0800, Martin Bligh wrote:
> > On closer looks I found this line:
> >
> > A A A A A A A A if (inode_dirtied_after(inode, start))
> > A A A A A A A A A A A A break;
>
> Ah, OK.
>
> > In this case "list_empty(&sb->s_io)" is not a good criteria:
> > here we are breaking away for some other reasons, and shall
> > not touch wbc.more_io.
> >
> > So let's stick with the current code?
>
> Well, I see two problems. One is that we set more_io based on
> whether s_more_io is empty or not before we finish the loop.
> I can't see how this can be correct, especially as there can be
> other concurrent writers. So somehow we need to check when
> we exit the loop, not during it.
It is correct inside the loop, however with some overheads.
We put it inside the loop because sometimes the whole filesystem is
skipped and we shall not set more_io on them whether or not s_more_io
is empty.
> The other is that we're saying we are setting more_io when
> nr_to_write is <=0 ... but we only really check it when
> nr_to_write is > 0 ... I can't see how this can be useful?
That's the caller's fault - I guess the logic was changed a bit by
Jens in linux-next. I noticed this just now. It shall be fixed.
> I'll admit there is one corner case when page_skipped it set
> from one of the branches, but I am really not sure what the
> intended logic is here, given the above?
>
> In the case where we hit the inode_dirtied_after break
> condition, is it bad to set more_io ? There is more to do
> on that inode after all. Is there a definition somewhere for
> exactly what the more_io flag means?
"More dirty pages to be put to io"?
The exact semantics of more_io is determined by the caller,
which used to be (in 2.6.31):
background_writeout():
if (wbc.nr_to_write > 0 || wbc.pages_skipped > 0) {
/* Wrote less than expected */
if (wbc.encountered_congestion || wbc.more_io)
congestion_wait(BLK_RW_ASYNC, HZ/10);
else
break;
}
wb_kupdate() is same except that it does not check pages_skipped.
Note that in 2.6.31, more_io is not used at all for sync().
Thanks,
Fengguang
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Bug in kernel 2.6.31, Slow wb_kupdate writeout
2009-07-30 2:59 ` Martin Bligh
@ 2009-07-30 4:08 ` Wu Fengguang
2009-07-30 19:55 ` Martin Bligh
0 siblings, 1 reply; 32+ messages in thread
From: Wu Fengguang @ 2009-07-30 4:08 UTC (permalink / raw)
To: Martin Bligh
Cc: Chad Talbott, linux-kernel@vger.kernel.org, linux-mm@kvack.org,
Michael Rubin, Andrew Morton, sandeen@redhat.com,
Michael Davidson
On Thu, Jul 30, 2009 at 10:59:09AM +0800, Martin Bligh wrote:
> On Wed, Jul 29, 2009 at 6:57 PM, Wu Fengguang<fengguang.wu@intel.com> wrote:
> > On Thu, Jul 30, 2009 at 09:12:26AM +0800, Martin Bligh wrote:
> >> > I agree on the unification of kupdate and sync paths. In fact I had a
> >> > patch for doing this. And I'd recommend to do it in two patches:
> >> > one to fix the congestion case, another to do the code unification.
> >> >
> >> > The sync path don't care whether requeue_io() or redirty_tail() is
> >> > used, because they disregard the time stamps totally - only order of
> >> > inodes matters (ie. starvation), which is same for requeue_io()/redirty_tail().
> >>
> >> But, as I understand it, both paths share the same lists, so we still have
> >> to be consistent?
> >
> > Then let's first unify the code, then fix the congestion case? :)
>
> OK, I will send it out as separate patches. I am just finishing up the testing
> first.
Note that this is a simple fix that may have suboptimal write performance.
Here is an old reasoning:
http://lkml.org/lkml/2009/3/28/235
Thanks,
Fengguang
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Bug in kernel 2.6.31, Slow wb_kupdate writeout
2009-07-30 4:08 ` Wu Fengguang
@ 2009-07-30 19:55 ` Martin Bligh
2009-08-01 2:02 ` Wu Fengguang
0 siblings, 1 reply; 32+ messages in thread
From: Martin Bligh @ 2009-07-30 19:55 UTC (permalink / raw)
To: Wu Fengguang
Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, Michael Rubin,
sandeen@redhat.com, Michael Davidson, Andrew Morton,
Peter Zijlstra
> Note that this is a simple fix that may have suboptimal write performance.
> Here is an old reasoning:
>
> http://lkml.org/lkml/2009/3/28/235
The other thing I've been experimenting with is to disable the per-page
check in write_cache_pages, ie:
if (wbc->nonblocking && bdi_write_congested(bdi)) {
wb_stats_inc(WB_STATS_WCP_SECTION_CONG);
wbc->encountered_congestion = 1;
/* done = 1; */
This treats the congestion limits as soft, but encourages us to write
back in larger, more efficient chunks. If that's not going to scare
people unduly, I can submit that as well.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Bug in kernel 2.6.31, Slow wb_kupdate writeout
2009-07-30 3:19 ` Wu Fengguang
@ 2009-07-30 20:33 ` Martin Bligh
2009-08-01 2:58 ` Wu Fengguang
2009-08-01 4:10 ` Wu Fengguang
0 siblings, 2 replies; 32+ messages in thread
From: Martin Bligh @ 2009-07-30 20:33 UTC (permalink / raw)
To: Wu Fengguang
Cc: Chad Talbott, linux-kernel@vger.kernel.org, linux-mm@kvack.org,
Michael Rubin, Andrew Morton, sandeen@redhat.com,
Michael Davidson
(BTW: background ... I'm not picking through this code for fun, I'm
trying to debug writeback problems introduced in our new kernel
that are affecting Google production workloads ;-))
>> Well, I see two problems. One is that we set more_io based on
>> whether s_more_io is empty or not before we finish the loop.
>> I can't see how this can be correct, especially as there can be
>> other concurrent writers. So somehow we need to check when
>> we exit the loop, not during it.
>
> It is correct inside the loop, however with some overheads.
>
> We put it inside the loop because sometimes the whole filesystem is
> skipped and we shall not set more_io on them whether or not s_more_io
> is empty.
My point was that you're setting more_io based on a condition
at a point in time that isn't when you return to the caller.
By the time you return to the caller (after several more loops
iterations), that condition may no longer be true.
One other way to address that would to be only to set if if we're
about to fall off the end of the loop, ie change it to:
if (!list_empty(&sb->s_more_io) && list_empty(&sb->s_io))
wbc->more_io = 1;
>> The other is that we're saying we are setting more_io when
>> nr_to_write is <=0 ... but we only really check it when
>> nr_to_write is > 0 ... I can't see how this can be useful?
>
> That's the caller's fault - I guess the logic was changed a bit by
> Jens in linux-next. I noticed this just now. It shall be fixed.
I am guessing you're setting more_io here because we're stopping
because our slice expired, presumably without us completing
all the io there was to do? That doesn't seem entirely accurate,
we could have finished all the pending IO (particularly given that
we can go over nr_to_write somewhat and send it negative).
Hence, I though that checking whether s_more_io and s_io were
empty at the time of return might be a more accurate check,
but on the other hand they are shared lists.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Bug in kernel 2.6.31, Slow wb_kupdate writeout
2009-07-28 19:11 Bug in kernel 2.6.31, Slow wb_kupdate writeout Chad Talbott
2009-07-28 21:49 ` Martin Bligh
@ 2009-07-30 21:39 ` Jens Axboe
2009-07-30 22:01 ` Martin Bligh
1 sibling, 1 reply; 32+ messages in thread
From: Jens Axboe @ 2009-07-30 21:39 UTC (permalink / raw)
To: Chad Talbott
Cc: linux-kernel, linux-mm, wfg, Martin Bligh, Michael Rubin,
Andrew Morton, sandeen
On Tue, Jul 28 2009, Chad Talbott wrote:
> I run a simple workload on a 4GB machine which dirties a few largish
> inodes like so:
>
> # seq 10 | xargs -P0 -n1 -i\{} dd if=/dev/zero of=/tmp/dump\{}
> bs=1024k count=100
>
> While the dds are running data is written out at disk speed. However,
> once the dds have run to completion and exited there is ~500MB of
> dirty memory left. Background writeout then takes about 3 more
> minutes to clean memory at only ~3.3MB/s. When I explicitly sync, I
> can see that the disk is capable of 40MB/s, which finishes off the
> files in ~10s. [1]
>
> An interesting recent-ish change is "writeback: speed up writeback of
> big dirty files." When I revert the change to __sync_single_inode the
> problem appears to go away and background writeout proceeds at disk
> speed. Interestingly, that code is in the git commit [2], but not in
> the post to LKML. [3] This is may not be the fix, but it makes this
> test behave better.
Can I talk you into trying the per-bdi writeback patchset? I just tried
your test on a 16gb machine, and the dd's finish immediately since it
wont trip the writeout at that percentage of dirty memory. The 1GB of
dirty memory is flushed when it gets too old, 30 seconds later in two
chunks of writeout running at disk speed.
http://lkml.org/lkml/2009/7/30/302
You can either use the git branch, or just download
http://kernel.dk/writeback-v13.patch
and apply that to -git (or -rc4) directly.
--
Jens Axboe
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Bug in kernel 2.6.31, Slow wb_kupdate writeout
2009-07-30 21:39 ` Jens Axboe
@ 2009-07-30 22:01 ` Martin Bligh
2009-07-30 22:17 ` Jens Axboe
0 siblings, 1 reply; 32+ messages in thread
From: Martin Bligh @ 2009-07-30 22:01 UTC (permalink / raw)
To: Jens Axboe
Cc: Chad Talbott, linux-kernel, linux-mm, wfg, Michael Rubin,
Andrew Morton, sandeen
On Thu, Jul 30, 2009 at 2:39 PM, Jens Axboe<jens.axboe@oracle.com> wrote:
> On Tue, Jul 28 2009, Chad Talbott wrote:
>> I run a simple workload on a 4GB machine which dirties a few largish
>> inodes like so:
>>
>> # seq 10 | xargs -P0 -n1 -i\{} dd if=/dev/zero of=/tmp/dump\{}
>> bs=1024k count=100
>>
>> While the dds are running data is written out at disk speed. However,
>> once the dds have run to completion and exited there is ~500MB of
>> dirty memory left. Background writeout then takes about 3 more
>> minutes to clean memory at only ~3.3MB/s. When I explicitly sync, I
>> can see that the disk is capable of 40MB/s, which finishes off the
>> files in ~10s. [1]
>>
>> An interesting recent-ish change is "writeback: speed up writeback of
>> big dirty files." When I revert the change to __sync_single_inode the
>> problem appears to go away and background writeout proceeds at disk
>> speed. Interestingly, that code is in the git commit [2], but not in
>> the post to LKML. [3] This is may not be the fix, but it makes this
>> test behave better.
>
> Can I talk you into trying the per-bdi writeback patchset? I just tried
> your test on a 16gb machine, and the dd's finish immediately since it
> wont trip the writeout at that percentage of dirty memory. The 1GB of
> dirty memory is flushed when it gets too old, 30 seconds later in two
> chunks of writeout running at disk speed.
How big did you make the dds? It has to be writing more data than
you have RAM, or it's not going to do anything much interesting ;-)
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Bug in kernel 2.6.31, Slow wb_kupdate writeout
2009-07-30 22:01 ` Martin Bligh
@ 2009-07-30 22:17 ` Jens Axboe
2009-07-30 22:34 ` Martin Bligh
0 siblings, 1 reply; 32+ messages in thread
From: Jens Axboe @ 2009-07-30 22:17 UTC (permalink / raw)
To: Martin Bligh
Cc: Chad Talbott, linux-kernel, linux-mm, wfg, Michael Rubin,
Andrew Morton, sandeen
On Thu, Jul 30 2009, Martin Bligh wrote:
> On Thu, Jul 30, 2009 at 2:39 PM, Jens Axboe<jens.axboe@oracle.com> wrote:
> > On Tue, Jul 28 2009, Chad Talbott wrote:
> >> I run a simple workload on a 4GB machine which dirties a few largish
> >> inodes like so:
> >>
> >> # seq 10 | xargs -P0 -n1 -i\{} dd if=/dev/zero of=/tmp/dump\{}
> >> bs=1024k count=100
> >>
> >> While the dds are running data is written out at disk speed. However,
> >> once the dds have run to completion and exited there is ~500MB of
> >> dirty memory left. Background writeout then takes about 3 more
> >> minutes to clean memory at only ~3.3MB/s. When I explicitly sync, I
> >> can see that the disk is capable of 40MB/s, which finishes off the
> >> files in ~10s. [1]
> >>
> >> An interesting recent-ish change is "writeback: speed up writeback of
> >> big dirty files." When I revert the change to __sync_single_inode the
> >> problem appears to go away and background writeout proceeds at disk
> >> speed. Interestingly, that code is in the git commit [2], but not in
> >> the post to LKML. [3] This is may not be the fix, but it makes this
> >> test behave better.
> >
> > Can I talk you into trying the per-bdi writeback patchset? I just tried
> > your test on a 16gb machine, and the dd's finish immediately since it
> > wont trip the writeout at that percentage of dirty memory. The 1GB of
> > dirty memory is flushed when it gets too old, 30 seconds later in two
> > chunks of writeout running at disk speed.
>
> How big did you make the dds? It has to be writing more data than
> you have RAM, or it's not going to do anything much interesting ;-)
The test case above on a 4G machine is only generating 1G of dirty data.
I ran the same test case on the 16G, resulting in only background
writeout. The relevant bit here being that the background writeout
finished quickly, writing at disk speed.
I re-ran the same test, but using 300 100MB files instead. While the
dd's are running, we are going at ~80MB/sec (this is disk speed, it's an
x25-m). When the dd's are done, it continues doing 80MB/sec for 10
seconds or so. Then the remainder (about 2G) is written in bursts at
disk speeds, but with some time in between.
--
Jens Axboe
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Bug in kernel 2.6.31, Slow wb_kupdate writeout
2009-07-30 22:17 ` Jens Axboe
@ 2009-07-30 22:34 ` Martin Bligh
2009-07-30 22:43 ` Jens Axboe
2009-08-01 4:02 ` Wu Fengguang
0 siblings, 2 replies; 32+ messages in thread
From: Martin Bligh @ 2009-07-30 22:34 UTC (permalink / raw)
To: Jens Axboe
Cc: Chad Talbott, linux-kernel, linux-mm, wfg, Michael Rubin,
Andrew Morton, sandeen
> The test case above on a 4G machine is only generating 1G of dirty data.
> I ran the same test case on the 16G, resulting in only background
> writeout. The relevant bit here being that the background writeout
> finished quickly, writing at disk speed.
>
> I re-ran the same test, but using 300 100MB files instead. While the
> dd's are running, we are going at ~80MB/sec (this is disk speed, it's an
> x25-m). When the dd's are done, it continues doing 80MB/sec for 10
> seconds or so. Then the remainder (about 2G) is written in bursts at
> disk speeds, but with some time in between.
OK, I think the test case is sensitive to how many files you have - if
we punt them to the back of the list, and yet we still have 299 other
ones, it may well be able to keep the disk spinning despite the bug
I outlined.Try using 30 1GB files?
Though it doesn't seem to happen with just one dd streamer, and
I don't see why the bug doesn't trigger in that case either.
I believe the bugfix is correct independent of any bdi changes?
M.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Bug in kernel 2.6.31, Slow wb_kupdate writeout
2009-07-30 22:34 ` Martin Bligh
@ 2009-07-30 22:43 ` Jens Axboe
2009-07-30 22:48 ` Martin Bligh
2009-08-01 4:02 ` Wu Fengguang
1 sibling, 1 reply; 32+ messages in thread
From: Jens Axboe @ 2009-07-30 22:43 UTC (permalink / raw)
To: Martin Bligh
Cc: Chad Talbott, linux-kernel, linux-mm, wfg, Michael Rubin,
Andrew Morton, sandeen
On Thu, Jul 30 2009, Martin Bligh wrote:
> > The test case above on a 4G machine is only generating 1G of dirty data.
> > I ran the same test case on the 16G, resulting in only background
> > writeout. The relevant bit here being that the background writeout
> > finished quickly, writing at disk speed.
> >
> > I re-ran the same test, but using 300 100MB files instead. While the
> > dd's are running, we are going at ~80MB/sec (this is disk speed, it's an
> > x25-m). When the dd's are done, it continues doing 80MB/sec for 10
> > seconds or so. Then the remainder (about 2G) is written in bursts at
> > disk speeds, but with some time in between.
>
> OK, I think the test case is sensitive to how many files you have - if
> we punt them to the back of the list, and yet we still have 299 other
> ones, it may well be able to keep the disk spinning despite the bug
> I outlined.Try using 30 1GB files?
If this disk starts spinning, then we have bigger bugs :-)
>
> Though it doesn't seem to happen with just one dd streamer, and
> I don't see why the bug doesn't trigger in that case either.
>
> I believe the bugfix is correct independent of any bdi changes?
Yeah I think so too, I'll run some more tests on this tomorrow and
verify it there as well.
--
Jens Axboe
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Bug in kernel 2.6.31, Slow wb_kupdate writeout
2009-07-30 22:43 ` Jens Axboe
@ 2009-07-30 22:48 ` Martin Bligh
2009-07-31 7:50 ` Peter Zijlstra
2009-08-01 4:03 ` Wu Fengguang
0 siblings, 2 replies; 32+ messages in thread
From: Martin Bligh @ 2009-07-30 22:48 UTC (permalink / raw)
To: Jens Axboe
Cc: Chad Talbott, linux-kernel, linux-mm, wfg, Michael Rubin, sandeen,
Andrew Morton, Peter Zijlstra
On Thu, Jul 30, 2009 at 3:43 PM, Jens Axboe<jens.axboe@oracle.com> wrote:
> On Thu, Jul 30 2009, Martin Bligh wrote:
>> > The test case above on a 4G machine is only generating 1G of dirty data.
>> > I ran the same test case on the 16G, resulting in only background
>> > writeout. The relevant bit here being that the background writeout
>> > finished quickly, writing at disk speed.
>> >
>> > I re-ran the same test, but using 300 100MB files instead. While the
>> > dd's are running, we are going at ~80MB/sec (this is disk speed, it's an
>> > x25-m). When the dd's are done, it continues doing 80MB/sec for 10
>> > seconds or so. Then the remainder (about 2G) is written in bursts at
>> > disk speeds, but with some time in between.
>>
>> OK, I think the test case is sensitive to how many files you have - if
>> we punt them to the back of the list, and yet we still have 299 other
>> ones, it may well be able to keep the disk spinning despite the bug
>> I outlined.Try using 30 1GB files?
>
> If this disk starts spinning, then we have bigger bugs :-)
>>
>> Though it doesn't seem to happen with just one dd streamer, and
>> I don't see why the bug doesn't trigger in that case either.
>>
>> I believe the bugfix is correct independent of any bdi changes?
>
> Yeah I think so too, I'll run some more tests on this tomorrow and
> verify it there as well.
There's another issue I was discussing with Peter Z. earlier that the
bdi changes might help with - if you look at where the dirty pages
get to, they are capped hard at the average of the dirty and
background thresholds, meaning we can only dirty about half the
pages we should be able to. That does very slowly go away when
the bdi limit catches up, but it seems to start at 0, and it's progess
seems glacially slow (at least if you're impatient ;-))
This seems to affect some of our workloads badly when they have
a sharp spike in dirty data to one device, they get throttled heavily
when they wouldn't have before the per-bdi dirty limits.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Bug in kernel 2.6.31, Slow wb_kupdate writeout
2009-07-30 22:48 ` Martin Bligh
@ 2009-07-31 7:50 ` Peter Zijlstra
2009-08-01 4:03 ` Wu Fengguang
1 sibling, 0 replies; 32+ messages in thread
From: Peter Zijlstra @ 2009-07-31 7:50 UTC (permalink / raw)
To: Martin Bligh
Cc: Jens Axboe, Chad Talbott, linux-kernel, linux-mm, wfg,
Michael Rubin, sandeen, Andrew Morton
On Thu, 2009-07-30 at 15:48 -0700, Martin Bligh wrote:
> There's another issue I was discussing with Peter Z. earlier that the
> bdi changes might help with - if you look at where the dirty pages
> get to, they are capped hard at the average of the dirty and
> background thresholds, meaning we can only dirty about half the
> pages we should be able to. That does very slowly go away when
> the bdi limit catches up, but it seems to start at 0, and it's progess
> seems glacially slow (at least if you're impatient ;-))
>
> This seems to affect some of our workloads badly when they have
> a sharp spike in dirty data to one device, they get throttled heavily
> when they wouldn't have before the per-bdi dirty limits.
Right, currently that adjustment period is about the same order as
writing out the full dirty page capacity. If your system has unbalanced
memory vs io capacity this might indeed end up being glacial.
I've been considering making it a sublinear function wrt the memory
size, so that larger machines get less and therefore adjust faster.
Something like the below perhaps -- the alternative is yet another
sysctl :/
Not sure how the sqrt works out on a wide variety of machines though,..
we'll have to test and see.
---
mm/page-writeback.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 81627eb..64aa140 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -152,7 +152,7 @@ static int calc_period_shift(void)
else
dirty_total = (vm_dirty_ratio * determine_dirtyable_memory()) /
100;
- return 2 + ilog2(dirty_total - 1);
+ return 2 + ilog2(int_sqrt(dirty_total) - 1);
}
/*
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 32+ messages in thread
* Re: Bug in kernel 2.6.31, Slow wb_kupdate writeout
2009-07-30 19:55 ` Martin Bligh
@ 2009-08-01 2:02 ` Wu Fengguang
0 siblings, 0 replies; 32+ messages in thread
From: Wu Fengguang @ 2009-08-01 2:02 UTC (permalink / raw)
To: Martin Bligh
Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, Michael Rubin,
sandeen@redhat.com, Michael Davidson, Andrew Morton,
Peter Zijlstra
[-- Attachment #1: Type: text/plain, Size: 1560 bytes --]
On Fri, Jul 31, 2009 at 03:55:44AM +0800, Martin Bligh wrote:
> > Note that this is a simple fix that may have suboptimal write performance.
> > Here is an old reasoning:
> >
> > A A A A http://lkml.org/lkml/2009/3/28/235
>
> The other thing I've been experimenting with is to disable the per-page
> check in write_cache_pages, ie:
>
> if (wbc->nonblocking && bdi_write_congested(bdi)) {
> wb_stats_inc(WB_STATS_WCP_SECTION_CONG);
> wbc->encountered_congestion = 1;
> /* done = 1; */
>
> This treats the congestion limits as soft, but encourages us to write
> back in larger, more efficient chunks. If that's not going to scare
> people unduly, I can submit that as well.
This risks hitting the hard limit (nr_requests), and block everyone,
including the ones with higher priority (ie. kswapd).
On the other hand, the simple fix in previous mails won't necessarily
act too sub-optimal. It's only a potential one. There is a window of
(1/16)*(nr_requests)*request_size (= 128*256KB/16 = 4MB) between
congestion-on and congestion-off states. So for the best we can inject
a big 4MB chunk into the async write queue once it becomes uncongested.
I have a writeback debug patch that can help find out how
that works out in your real world workloads (by monitoring
nr_to_write). You can also try doubling the ratio (1/16) in
blk_queue_congestion_threshold(), to see how an increased
congestion-on-off window may help.
Thanks,
Fengguang
[-- Attachment #2: writeback-debug-2.6.31.patch --]
[-- Type: text/x-diff, Size: 2713 bytes --]
mm/page-writeback.c | 38 ++++++++++++++++++++++++++++++++++++++
1 file changed, 38 insertions(+)
--- sound-2.6.orig/mm/page-writeback.c
+++ sound-2.6/mm/page-writeback.c
@@ -116,6 +116,33 @@ EXPORT_SYMBOL(laptop_mode);
/* End of sysctl-exported parameters */
+#define writeback_debug_report(n, wbc) do { \
+ __writeback_debug_report(n, wbc, __FILE__, __LINE__, __FUNCTION__); \
+} while (0)
+
+void print_writeback_control(struct writeback_control *wbc)
+{
+ printk(KERN_DEBUG
+ "global dirty %lu writeback %lu nfs %lu "
+ "flags %c%c towrite %ld skipped %ld\n",
+ global_page_state(NR_FILE_DIRTY),
+ global_page_state(NR_WRITEBACK),
+ global_page_state(NR_UNSTABLE_NFS),
+ wbc->encountered_congestion ? 'C':'_',
+ wbc->more_io ? 'M':'_',
+ wbc->nr_to_write,
+ wbc->pages_skipped);
+}
+
+void __writeback_debug_report(long n, struct writeback_control *wbc,
+ const char *file, int line, const char *func)
+{
+ printk(KERN_DEBUG "%s %d %s: %s(%d) %ld\n",
+ file, line, func,
+ current->comm, current->pid,
+ n);
+ print_writeback_control(wbc);
+}
static void background_writeout(unsigned long _min_pages);
@@ -550,6 +577,7 @@ static void balance_dirty_pages(struct a
pages_written += write_chunk - wbc.nr_to_write;
get_dirty_limits(&background_thresh, &dirty_thresh,
&bdi_thresh, bdi);
+ writeback_debug_report(pages_written, &wbc);
}
/*
@@ -576,6 +604,7 @@ static void balance_dirty_pages(struct a
break; /* We've done our duty */
congestion_wait(BLK_RW_ASYNC, HZ/10);
+ writeback_debug_report(-pages_written, &wbc);
}
if (bdi_nr_reclaimable + bdi_nr_writeback < bdi_thresh &&
@@ -670,6 +699,11 @@ void throttle_vm_writeout(gfp_t gfp_mask
global_page_state(NR_WRITEBACK) <= dirty_thresh)
break;
congestion_wait(BLK_RW_ASYNC, HZ/10);
+ printk(KERN_DEBUG "throttle_vm_writeout: "
+ "congestion_wait on %lu+%lu > %lu\n",
+ global_page_state(NR_UNSTABLE_NFS),
+ global_page_state(NR_WRITEBACK),
+ dirty_thresh);
/*
* The caller might hold locks which can prevent IO completion
@@ -719,7 +753,9 @@ static void background_writeout(unsigned
else
break;
}
+ writeback_debug_report(min_pages, &wbc);
}
+ writeback_debug_report(min_pages, &wbc);
}
/*
@@ -792,7 +828,9 @@ static void wb_kupdate(unsigned long arg
break; /* All the old data is written */
}
nr_to_write -= MAX_WRITEBACK_PAGES - wbc.nr_to_write;
+ writeback_debug_report(nr_to_write, &wbc);
}
+ writeback_debug_report(nr_to_write, &wbc);
if (time_before(next_jif, jiffies + HZ))
next_jif = jiffies + HZ;
if (dirty_writeback_interval)
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Bug in kernel 2.6.31, Slow wb_kupdate writeout
2009-07-30 20:33 ` Martin Bligh
@ 2009-08-01 2:58 ` Wu Fengguang
2009-08-01 4:10 ` Wu Fengguang
1 sibling, 0 replies; 32+ messages in thread
From: Wu Fengguang @ 2009-08-01 2:58 UTC (permalink / raw)
To: Martin Bligh
Cc: Chad Talbott, linux-kernel@vger.kernel.org, linux-mm@kvack.org,
Michael Rubin, Andrew Morton, sandeen@redhat.com,
Michael Davidson
On Fri, Jul 31, 2009 at 04:33:09AM +0800, Martin Bligh wrote:
> (BTW: background ... I'm not picking through this code for fun, I'm
> trying to debug writeback problems introduced in our new kernel
> that are affecting Google production workloads ;-))
>
> >> Well, I see two problems. One is that we set more_io based on
> >> whether s_more_io is empty or not before we finish the loop.
> >> I can't see how this can be correct, especially as there can be
> >> other concurrent writers. So somehow we need to check when
> >> we exit the loop, not during it.
> >
> > It is correct inside the loop, however with some overheads.
> >
> > We put it inside the loop because sometimes the whole filesystem is
> > skipped and we shall not set more_io on them whether or not s_more_io
> > is empty.
>
> My point was that you're setting more_io based on a condition
> at a point in time that isn't when you return to the caller.
>
> By the time you return to the caller (after several more loops
> iterations), that condition may no longer be true.
You are right in that sense. Sorry that my claim of correctness is
somehow biased: we normally care much about early abortion, and don't
mind one extra trip over the superblocks. And the extra trip should be
rare enough. I'd be surprised if you observed much of them in real
workloads.
> One other way to address that would to be only to set if if we're
> about to fall off the end of the loop, ie change it to:
>
> if (!list_empty(&sb->s_more_io) && list_empty(&sb->s_io))
> wbc->more_io = 1;
Let more_io=0 when there are more inodes in s_io to be worked on?
I cannot understand it, and suspect we are talking about imaginary
problem on this point ;)
> >> The other is that we're saying we are setting more_io when
> >> nr_to_write is <=0 ... but we only really check it when
> >> nr_to_write is > 0 ... I can't see how this can be useful?
> >
> > That's the caller's fault - I guess the logic was changed a bit by
> > Jens in linux-next. I noticed this just now. It shall be fixed.
>
> I am guessing you're setting more_io here because we're stopping
> because our slice expired, presumably without us completing
> all the io there was to do? That doesn't seem entirely accurate,
> we could have finished all the pending IO (particularly given that
> we can go over nr_to_write somewhat and send it negative).
> Hence, I though that checking whether s_more_io and s_io were
> empty at the time of return might be a more accurate check,
> but on the other hand they are shared lists.
Yes the current more_io logic is not entirely accurate, but I doubt we
can gain much and the improvement can be done trivially (not the line
of code, but the analyzes and tests involved).
Anyway if you would take the time to push forward a patch for reducing
the overheads of possible extra trips, I'll take the time to review it ;)
Thanks,
Fengguang
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Bug in kernel 2.6.31, Slow wb_kupdate writeout
2009-07-30 22:34 ` Martin Bligh
2009-07-30 22:43 ` Jens Axboe
@ 2009-08-01 4:02 ` Wu Fengguang
1 sibling, 0 replies; 32+ messages in thread
From: Wu Fengguang @ 2009-08-01 4:02 UTC (permalink / raw)
To: Martin Bligh
Cc: Jens Axboe, Chad Talbott, linux-kernel, linux-mm, Michael Rubin,
Andrew Morton, sandeen
On Thu, Jul 30, 2009 at 03:34:12PM -0700, Martin Bligh wrote:
> > The test case above on a 4G machine is only generating 1G of dirty data.
> > I ran the same test case on the 16G, resulting in only background
> > writeout. The relevant bit here being that the background writeout
> > finished quickly, writing at disk speed.
> >
> > I re-ran the same test, but using 300 100MB files instead. While the
> > dd's are running, we are going at ~80MB/sec (this is disk speed, it's an
> > x25-m). When the dd's are done, it continues doing 80MB/sec for 10
> > seconds or so. Then the remainder (about 2G) is written in bursts at
> > disk speeds, but with some time in between.
>
> OK, I think the test case is sensitive to how many files you have - if
> we punt them to the back of the list, and yet we still have 299 other
> ones, it may well be able to keep the disk spinning despite the bug
> I outlined.Try using 30 1GB files?
>
> Though it doesn't seem to happen with just one dd streamer, and
> I don't see why the bug doesn't trigger in that case either.
I guess the bug is not related to number dd streamers, but whether
there is a stream of newly dirtied inodes (atime dirtiness would be
enough). Because wb_kupdate() itself won't give up on congestion, but
redirty_tail() would refresh the inode dirty time if there are newly
dirtied inodes in front. And we cannot claim it to be a bug of the
list based redirty_tail(), since we call it with the belief that the
inode is somehow blocked. In this manner redirty_tail() can refresh
the inode dirty time (and therefore delay its writeback for up to 30s)
at will.
> I believe the bugfix is correct independent of any bdi changes?
Agreed.
Thanks,
Fengguang
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Bug in kernel 2.6.31, Slow wb_kupdate writeout
2009-07-30 22:48 ` Martin Bligh
2009-07-31 7:50 ` Peter Zijlstra
@ 2009-08-01 4:03 ` Wu Fengguang
2009-08-01 4:53 ` Wu Fengguang
1 sibling, 1 reply; 32+ messages in thread
From: Wu Fengguang @ 2009-08-01 4:03 UTC (permalink / raw)
To: Martin Bligh
Cc: Jens Axboe, Chad Talbott, linux-kernel, linux-mm, Michael Rubin,
sandeen, Andrew Morton, Peter Zijlstra
On Thu, Jul 30, 2009 at 03:48:02PM -0700, Martin Bligh wrote:
> On Thu, Jul 30, 2009 at 3:43 PM, Jens Axboe<jens.axboe@oracle.com> wrote:
> > On Thu, Jul 30 2009, Martin Bligh wrote:
> >> > The test case above on a 4G machine is only generating 1G of dirty data.
> >> > I ran the same test case on the 16G, resulting in only background
> >> > writeout. The relevant bit here being that the background writeout
> >> > finished quickly, writing at disk speed.
> >> >
> >> > I re-ran the same test, but using 300 100MB files instead. While the
> >> > dd's are running, we are going at ~80MB/sec (this is disk speed, it's an
> >> > x25-m). When the dd's are done, it continues doing 80MB/sec for 10
> >> > seconds or so. Then the remainder (about 2G) is written in bursts at
> >> > disk speeds, but with some time in between.
> >>
> >> OK, I think the test case is sensitive to how many files you have - if
> >> we punt them to the back of the list, and yet we still have 299 other
> >> ones, it may well be able to keep the disk spinning despite the bug
> >> I outlined.Try using 30 1GB files?
> >
> > If this disk starts spinning, then we have bigger bugs :-)
> >>
> >> Though it doesn't seem to happen with just one dd streamer, and
> >> I don't see why the bug doesn't trigger in that case either.
> >>
> >> I believe the bugfix is correct independent of any bdi changes?
> >
> > Yeah I think so too, I'll run some more tests on this tomorrow and
> > verify it there as well.
>
> There's another issue I was discussing with Peter Z. earlier that the
> bdi changes might help with - if you look at where the dirty pages
> get to, they are capped hard at the average of the dirty and
> background thresholds, meaning we can only dirty about half the
> pages we should be able to. That does very slowly go away when
> the bdi limit catches up, but it seems to start at 0, and it's progess
> seems glacially slow (at least if you're impatient ;-))
You mean the dirty limit will start from
(dirty_ratio+background_ratio)/2 = 15% to (dirty_ratio) = 20%,
and grow in a very slow pace? I did observed such curves long ago,
but it does not always show up, as in the below mini bench.
> This seems to affect some of our workloads badly when they have
> a sharp spike in dirty data to one device, they get throttled heavily
> when they wouldn't have before the per-bdi dirty limits.
Here is a single dd on my laptop with 4G memory, kernel 2.6.30.
root /home/wfg# echo 10 > /proc/sys/vm/dirty_ratio
root /home/wfg# echo 20 > /proc/sys/vm/dirty_background_ratio
wfg ~% dd if=/dev/zero of=/opt/vm/10G bs=1M count=1000
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 12.7143 s, 82.5 MB/s
output of vmmon:
nr_dirty nr_writeback
0 0
0 0
56795 0
51655 17020
52071 17511
51648 16898
51655 16485
52369 17425
51648 16930
51470 16809
52630 17267
51287 16634
51260 16641
51310 16903
51281 16379
46073 11169
46086 0
46089 0
3132 9657
21 17677
3 14107
14 2
0 0
0 0
In this case nr_dirty stays almost constant.
Thanks,
Fengguang
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Bug in kernel 2.6.31, Slow wb_kupdate writeout
2009-07-30 20:33 ` Martin Bligh
2009-08-01 2:58 ` Wu Fengguang
@ 2009-08-01 4:10 ` Wu Fengguang
1 sibling, 0 replies; 32+ messages in thread
From: Wu Fengguang @ 2009-08-01 4:10 UTC (permalink / raw)
To: Martin Bligh
Cc: Chad Talbott, linux-kernel@vger.kernel.org, linux-mm@kvack.org,
Michael Rubin, Andrew Morton, sandeen@redhat.com,
Michael Davidson
On Fri, Jul 31, 2009 at 04:33:09AM +0800, Martin Bligh wrote:
> (BTW: background ... I'm not picking through this code for fun, I'm
> trying to debug writeback problems introduced in our new kernel
> that are affecting Google production workloads ;-))
>
> >> Well, I see two problems. One is that we set more_io based on
> >> whether s_more_io is empty or not before we finish the loop.
> >> I can't see how this can be correct, especially as there can be
> >> other concurrent writers. So somehow we need to check when
> >> we exit the loop, not during it.
> >
> > It is correct inside the loop, however with some overheads.
> >
> > We put it inside the loop because sometimes the whole filesystem is
> > skipped and we shall not set more_io on them whether or not s_more_io
> > is empty.
>
> My point was that you're setting more_io based on a condition
> at a point in time that isn't when you return to the caller.
>
> By the time you return to the caller (after several more loops
> iterations), that condition may no longer be true.
>
> One other way to address that would to be only to set if if we're
> about to fall off the end of the loop, ie change it to:
>
> if (!list_empty(&sb->s_more_io) && list_empty(&sb->s_io))
> wbc->more_io = 1;
Ah I see it (as the below patch), looks reasonable to me.
Thanks,
Fengguang
---
fs/fs-writeback.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
--- sound-2.6.orig/fs/fs-writeback.c
+++ sound-2.6/fs/fs-writeback.c
@@ -544,9 +544,9 @@ void generic_sync_sb_inodes(struct super
wbc->more_io = 1;
break;
}
- if (!list_empty(&sb->s_more_io))
- wbc->more_io = 1;
}
+ if (!list_empty(&sb->s_more_io) && list_empty(&sb->s_io))
+ wbc->more_io = 1;
if (sync) {
struct inode *inode, *old_inode = NULL;
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Bug in kernel 2.6.31, Slow wb_kupdate writeout
2009-08-01 4:03 ` Wu Fengguang
@ 2009-08-01 4:53 ` Wu Fengguang
2009-08-01 5:03 ` Wu Fengguang
0 siblings, 1 reply; 32+ messages in thread
From: Wu Fengguang @ 2009-08-01 4:53 UTC (permalink / raw)
To: Martin Bligh
Cc: Jens Axboe, Chad Talbott, linux-kernel, linux-mm, Michael Rubin,
sandeen, Andrew Morton, Peter Zijlstra
On Sat, Aug 01, 2009 at 12:03:13PM +0800, Wu Fengguang wrote:
> On Thu, Jul 30, 2009 at 03:48:02PM -0700, Martin Bligh wrote:
> > On Thu, Jul 30, 2009 at 3:43 PM, Jens Axboe<jens.axboe@oracle.com> wrote:
> > > On Thu, Jul 30 2009, Martin Bligh wrote:
> > >> > The test case above on a 4G machine is only generating 1G of dirty data.
> > >> > I ran the same test case on the 16G, resulting in only background
> > >> > writeout. The relevant bit here being that the background writeout
> > >> > finished quickly, writing at disk speed.
> > >> >
> > >> > I re-ran the same test, but using 300 100MB files instead. While the
> > >> > dd's are running, we are going at ~80MB/sec (this is disk speed, it's an
> > >> > x25-m). When the dd's are done, it continues doing 80MB/sec for 10
> > >> > seconds or so. Then the remainder (about 2G) is written in bursts at
> > >> > disk speeds, but with some time in between.
> > >>
> > >> OK, I think the test case is sensitive to how many files you have - if
> > >> we punt them to the back of the list, and yet we still have 299 other
> > >> ones, it may well be able to keep the disk spinning despite the bug
> > >> I outlined.Try using 30 1GB files?
> > >
> > > If this disk starts spinning, then we have bigger bugs :-)
> > >>
> > >> Though it doesn't seem to happen with just one dd streamer, and
> > >> I don't see why the bug doesn't trigger in that case either.
> > >>
> > >> I believe the bugfix is correct independent of any bdi changes?
> > >
> > > Yeah I think so too, I'll run some more tests on this tomorrow and
> > > verify it there as well.
> >
> > There's another issue I was discussing with Peter Z. earlier that the
> > bdi changes might help with - if you look at where the dirty pages
> > get to, they are capped hard at the average of the dirty and
> > background thresholds, meaning we can only dirty about half the
> > pages we should be able to. That does very slowly go away when
> > the bdi limit catches up, but it seems to start at 0, and it's progess
> > seems glacially slow (at least if you're impatient ;-))
>
> You mean the dirty limit will start from
> (dirty_ratio+background_ratio)/2 = 15% to (dirty_ratio) = 20%,
> and grow in a very slow pace? I did observed such curves long ago,
> but it does not always show up, as in the below mini bench.
>
> > This seems to affect some of our workloads badly when they have
> > a sharp spike in dirty data to one device, they get throttled heavily
> > when they wouldn't have before the per-bdi dirty limits.
>
> Here is a single dd on my laptop with 4G memory, kernel 2.6.30.
>
> root /home/wfg# echo 10 > /proc/sys/vm/dirty_ratio
> root /home/wfg# echo 20 > /proc/sys/vm/dirty_background_ratio
>
> wfg ~% dd if=/dev/zero of=/opt/vm/10G bs=1M count=1000
> 1000+0 records in
> 1000+0 records out
> 1048576000 bytes (1.0 GB) copied, 12.7143 s, 82.5 MB/s
>
> output of vmmon:
>
> nr_dirty nr_writeback
> 0 0
> 0 0
> 56795 0
> 51655 17020
> 52071 17511
> 51648 16898
> 51655 16485
> 52369 17425
> 51648 16930
> 51470 16809
> 52630 17267
> 51287 16634
> 51260 16641
> 51310 16903
> 51281 16379
> 46073 11169
> 46086 0
> 46089 0
> 3132 9657
> 21 17677
> 3 14107
> 14 2
> 0 0
> 0 0
>
> In this case nr_dirty stays almost constant.
I can see the growth when I increased the dd size to 2GB,
and the dd throughput decreased from 82.5MB/s to 60.9MB/s.
wfg ~% dd if=/dev/zero of=/opt/vm/10G bs=1M count=2000
2000+0 records in
2000+0 records out
2097152000 bytes (2.1 GB) copied, 34.4114 s, 60.9 MB/s
nr_dirty nr_writeback
0 0
44980 0
49929 20353
49929 20353
49189 17822
54556 14852
49191 17717
52455 15501
49903 19330
50077 17293
50040 19111
52097 7040
52656 16797
53361 19455
53551 16999
57599 16396
55165 6801
57626 16534
56193 18795
57888 16655
57740 18818
65759 11304
60015 19842
61136 16618
62166 17429
62160 16782
62036 11907
59237 13715
61991 18561
66256 15111
60574 17551
17926 17930
17919 17057
17919 16379
11 13717
11470 4606
2 913
2 0
10 0
10 0
0 0
0 0
But when I redid the above test after dropping all the ~3GB caches,
the dirty limit again seem to remain constant.
# echo 1 > /proc/sys/vm/drop_caches
wfg ~% dd if=/dev/zero of=/opt/vm/10G bs=1M count=2000
2000+0 records in
2000+0 records out
2097152000 bytes (2.1 GB) copied, 33.3299 s, 62.9 MB/s
nr_dirty nr_writeback
0 0
76425 10825
66255 17302
69942 15865
65332 17305
71207 14605
69957 15380
65901 18960
66365 16233
66040 17041
66042 16378
66434 2169
67606 17143
68660 17195
67613 16514
67366 17415
65784 4620
69053 16831
66037 17033
64601 19936
64629 16922
70459 9227
66673 17789
65638 20102
65166 17662
66255 16286
69821 11264
82247 4113
64012 18060
29585 17920
5872 16653
5872 14197
25422 1913
5884 16658
0 12027
2 26
2 0
2 0
Thanks,
Fengguang
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: Bug in kernel 2.6.31, Slow wb_kupdate writeout
2009-08-01 4:53 ` Wu Fengguang
@ 2009-08-01 5:03 ` Wu Fengguang
0 siblings, 0 replies; 32+ messages in thread
From: Wu Fengguang @ 2009-08-01 5:03 UTC (permalink / raw)
To: Martin Bligh
Cc: Jens Axboe, Chad Talbott, linux-kernel, linux-mm, Michael Rubin,
sandeen, Andrew Morton, Peter Zijlstra
On Sat, Aug 01, 2009 at 12:53:46PM +0800, Wu Fengguang wrote:
> On Sat, Aug 01, 2009 at 12:03:13PM +0800, Wu Fengguang wrote:
> I can see the growth when I increased the dd size to 2GB,
> and the dd throughput decreased from 82.5MB/s to 60.9MB/s.
The raw disk write throughput seems to be 60MB/s:
wfg ~% dd if=/dev/zero of=/opt/vm/200M bs=1M count=200 oflag=direct
200+0 records in
200+0 records out
209715200 bytes (210 MB) copied, 3.48137 s, 60.2 MB/s
read throughput is a bit better:
wfg ~% dd of=/dev/null if=/opt/vm/200M bs=1M count=200 iflag=direct
200+0 records in
200+0 records out
209715200 bytes (210 MB) copied, 2.66606 s, 78.7 MB/s
# hdparm -tT /dev/hda
/dev/hda:
Timing cached reads: 10370 MB in 1.99 seconds = 5213.70 MB/sec
Timing buffered disk reads: 216 MB in 3.03 seconds = 71.22 MB/sec
And sync writes are pretty slow:
wfg ~% dd if=/dev/zero of=/opt/vm/200M bs=1M count=200 oflag=sync
200+0 records in
200+0 records out
209715200 bytes (210 MB) copied, 10.4741 s, 20.0 MB/s
Thanks,
Fengguang
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 32+ messages in thread
end of thread, other threads:[~2009-08-01 5:03 UTC | newest]
Thread overview: 32+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-07-28 19:11 Bug in kernel 2.6.31, Slow wb_kupdate writeout Chad Talbott
2009-07-28 21:49 ` Martin Bligh
2009-07-29 7:15 ` Martin Bligh
2009-07-29 11:43 ` Wu Fengguang
2009-07-29 14:11 ` Martin Bligh
2009-07-30 1:06 ` Wu Fengguang
2009-07-30 1:12 ` Martin Bligh
2009-07-30 1:57 ` Wu Fengguang
2009-07-30 2:59 ` Martin Bligh
2009-07-30 4:08 ` Wu Fengguang
2009-07-30 19:55 ` Martin Bligh
2009-08-01 2:02 ` Wu Fengguang
2009-07-30 0:19 ` Martin Bligh
2009-07-30 1:28 ` Martin Bligh
2009-07-30 2:09 ` Wu Fengguang
2009-07-30 2:57 ` Martin Bligh
2009-07-30 3:19 ` Wu Fengguang
2009-07-30 20:33 ` Martin Bligh
2009-08-01 2:58 ` Wu Fengguang
2009-08-01 4:10 ` Wu Fengguang
2009-07-30 1:49 ` Wu Fengguang
2009-07-30 21:39 ` Jens Axboe
2009-07-30 22:01 ` Martin Bligh
2009-07-30 22:17 ` Jens Axboe
2009-07-30 22:34 ` Martin Bligh
2009-07-30 22:43 ` Jens Axboe
2009-07-30 22:48 ` Martin Bligh
2009-07-31 7:50 ` Peter Zijlstra
2009-08-01 4:03 ` Wu Fengguang
2009-08-01 4:53 ` Wu Fengguang
2009-08-01 5:03 ` Wu Fengguang
2009-08-01 4:02 ` Wu Fengguang
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).