Bug in kernel 2.6.31, Slow wb

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* Bug in kernel 2.6.31, Slow wb_kupdate writeout
@ 2009-07-28 19:11 Chad Talbott
  2009-07-28 21:49 ` Martin Bligh
  2009-07-30 21:39 ` Jens Axboe
  0 siblings, 2 replies; 32+ messages in thread
From: Chad Talbott @ 2009-07-28 19:11 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, wfg, Martin Bligh, Michael Rubin, Andrew Morton,
	sandeen

I run a simple workload on a 4GB machine which dirties a few largish
inodes like so:

# seq 10 | xargs -P0 -n1 -i\{} dd if=/dev/zero of=/tmp/dump\{}
bs=1024k count=100

While the dds are running data is written out at disk speed.  However,
once the dds have run to completion and exited there is ~500MB of
dirty memory left.  Background writeout then takes about 3 more
minutes to clean memory at only ~3.3MB/s.  When I explicitly sync, I
can see that the disk is capable of 40MB/s, which finishes off the
files in ~10s. [1]

An interesting recent-ish change is "writeback: speed up writeback of
big dirty files."  When I revert the change to __sync_single_inode the
problem appears to go away and background writeout proceeds at disk
speed.  Interestingly, that code is in the git commit [2], but not in
the post to LKML. [3]  This is may not be the fix, but it makes this
test behave better.

Thanks,
Chad

[1] I've plotted the dirty memory from /proc/meminfo and disk write
speed from iostat at
http://sites.google.com/site/cwtlinux/2-6-31-writeback-bug
[2] git commit:
http://mirror.celinuxforum.org/gitstat/commit-detail.php?commit=8bc3be2751b4f74ab90a446da1912fd8204d53f7
[3] LKML post: http://marc.info/?l=linux-kernel&m=119131601130372&w=2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Bug in kernel 2.6.31, Slow wb_kupdate writeout
  2009-07-28 19:11 Bug in kernel 2.6.31, Slow wb_kupdate writeout Chad Talbott
@ 2009-07-28 21:49 ` Martin Bligh
  2009-07-29  7:15   ` Martin Bligh
  2009-07-30 21:39 ` Jens Axboe
  1 sibling, 1 reply; 32+ messages in thread
From: Martin Bligh @ 2009-07-28 21:49 UTC (permalink / raw)
  To: Chad Talbott
  Cc: linux-kernel, linux-mm, wfg, Michael Rubin, Andrew Morton,
	sandeen

> An interesting recent-ish change is "writeback: speed up writeback of
> big dirty files."  When I revert the change to __sync_single_inode the
> problem appears to go away and background writeout proceeds at disk
> speed.  Interestingly, that code is in the git commit [2], but not in
> the post to LKML. [3]  This is may not be the fix, but it makes this
> test behave better.

I'm fairly sure this is not fixing the root cause - but putting it at the head
rather than the tail of the queue causes the error not to starve wb_kupdate
for nearly so long - as long as we keep the queue full, the bug is hidden.

M.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Bug in kernel 2.6.31, Slow wb_kupdate writeout
  2009-07-28 21:49 ` Martin Bligh
@ 2009-07-29  7:15   ` Martin Bligh
  2009-07-29 11:43     ` Wu Fengguang
  0 siblings, 1 reply; 32+ messages in thread
From: Martin Bligh @ 2009-07-29  7:15 UTC (permalink / raw)
  To: Chad Talbott
  Cc: linux-kernel, linux-mm, wfg, Michael Rubin, Andrew Morton,
	sandeen, Michael Davidson

On Tue, Jul 28, 2009 at 2:49 PM, Martin Bligh<mbligh@google.com> wrote:
>> An interesting recent-ish change is "writeback: speed up writeback of
>> big dirty files."  When I revert the change to __sync_single_inode the
>> problem appears to go away and background writeout proceeds at disk
>> speed.  Interestingly, that code is in the git commit [2], but not in
>> the post to LKML. [3]  This is may not be the fix, but it makes this
>> test behave better.
>
> I'm fairly sure this is not fixing the root cause - but putting it at the head
> rather than the tail of the queue causes the error not to starve wb_kupdate
> for nearly so long - as long as we keep the queue full, the bug is hidden.

OK, it seems this is the root cause - I wasn't clear why all the pages weren't
being written back, and thought there was another bug. What happens is
we go into write_cache_pages, and stuff the disk queue with as much as
we can put into it, and then inevitably hit the congestion limit.

Then we back out to __sync_single_inode, who says "huh, you didn't manage
to write your whole slice", and penalizes the poor blameless inode in question
by putting it back into the penalty box for 30s.

This results in very lumpy I/O writeback at 5s intervals, and very
poor throughput.

Patch below is inline and probably text munged, but is for RFC only.
I'll test it
more thoroughly tomorrow. As for the comment about starving other writes,
I believe requeue_io moves it from s_io to s_more_io which should at least
allow some progress of other files.

--- linux-2.6.30/fs/fs-writeback.c.old  2009-07-29 00:08:29.000000000 -0700
+++ linux-2.6.30/fs/fs-writeback.c      2009-07-29 00:11:28.000000000 -0700
@@ -322,46 +322,11 @@ __sync_single_inode(struct inode *inode,
                        /*
                         * We didn't write back all the pages.  nfs_writepages()
                         * sometimes bales out without doing anything. Redirty
-                        * the inode; Move it from s_io onto s_more_io/s_dirty.
+                        * the inode; Move it from s_io onto s_more_io. It
+                        * may well have just encountered congestion
                         */
-                       /*
-                        * akpm: if the caller was the kupdate function we put
-                        * this inode at the head of s_dirty so it gets first
-                        * consideration.  Otherwise, move it to the tail, for
-                        * the reasons described there.  I'm not really sure
-                        * how much sense this makes.  Presumably I had a good
-                        * reasons for doing it this way, and I'd rather not
-                        * muck with it at present.
-                        */
-                       if (wbc->for_kupdate) {
-                               /*
-                                * For the kupdate function we move the inode
-                                * to s_more_io so it will get more writeout as
-                                * soon as the queue becomes uncongested.
-                                */
-                               inode->i_state |= I_DIRTY_PAGES;
-                               if (wbc->nr_to_write <= 0) {
-                                       /*
-                                        * slice used up: queue for next turn
-                                        */
-                                       requeue_io(inode);
-                               } else {
-                                       /*
-                                        * somehow blocked: retry later
-                                        */
-                                       redirty_tail(inode);
-                               }
-                       } else {
-                               /*
-                                * Otherwise fully redirty the inode so that
-                                * other inodes on this superblock will get some
-                                * writeout.  Otherwise heavy writing to one
-                                * file would indefinitely suspend writeout of
-                                * all the other files.
-                                */
-                               inode->i_state |= I_DIRTY_PAGES;
-                               redirty_tail(inode);
-                       }
+                       inode->i_state |= I_DIRTY_PAGES;
+                       requeue_io(inode);
                } else if (inode->i_state & I_DIRTY) {
                        /*
                         * Someone redirtied the inode while were writing back

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Bug in kernel 2.6.31, Slow wb_kupdate writeout
  2009-07-29  7:15   ` Martin Bligh
@ 2009-07-29 11:43     ` Wu Fengguang
  2009-07-29 14:11       ` Martin Bligh
  2009-07-30  0:19       ` Martin Bligh
  0 siblings, 2 replies; 32+ messages in thread
From: Wu Fengguang @ 2009-07-29 11:43 UTC (permalink / raw)
  To: Martin Bligh
  Cc: Chad Talbott, linux-kernel, linux-mm, Michael Rubin,
	Andrew Morton, sandeen, Michael Davidson

On Wed, Jul 29, 2009 at 12:15:48AM -0700, Martin Bligh wrote:
> On Tue, Jul 28, 2009 at 2:49 PM, Martin Bligh<mbligh@google.com> wrote:
> >> An interesting recent-ish change is "writeback: speed up writeback of
> >> big dirty files." A When I revert the change to __sync_single_inode the
> >> problem appears to go away and background writeout proceeds at disk
> >> speed. A Interestingly, that code is in the git commit [2], but not in
> >> the post to LKML. [3] A This is may not be the fix, but it makes this
> >> test behave better.
> >
> > I'm fairly sure this is not fixing the root cause - but putting it at the head
> > rather than the tail of the queue causes the error not to starve wb_kupdate
> > for nearly so long - as long as we keep the queue full, the bug is hidden.
> 
> OK, it seems this is the root cause - I wasn't clear why all the pages weren't
> being written back, and thought there was another bug. What happens is
> we go into write_cache_pages, and stuff the disk queue with as much as
> we can put into it, and then inevitably hit the congestion limit.
> 
> Then we back out to __sync_single_inode, who says "huh, you didn't manage
> to write your whole slice", and penalizes the poor blameless inode in question
> by putting it back into the penalty box for 30s.
> 
> This results in very lumpy I/O writeback at 5s intervals, and very
> poor throughput.

You are right, so let's fix the congestion case. Your analysis would
be perfect changelog :)

> Patch below is inline and probably text munged, but is for RFC only.
> I'll test it
> more thoroughly tomorrow. As for the comment about starving other writes,
> I believe requeue_io moves it from s_io to s_more_io which should at least
> allow some progress of other files.
> 
> --- linux-2.6.30/fs/fs-writeback.c.old  2009-07-29 00:08:29.000000000 -0700
> +++ linux-2.6.30/fs/fs-writeback.c      2009-07-29 00:11:28.000000000 -0700
> @@ -322,46 +322,11 @@ __sync_single_inode(struct inode *inode,
>                         /*
>                          * We didn't write back all the pages.  nfs_writepages()
>                          * sometimes bales out without doing anything. Redirty
[snip]
> -                               if (wbc->nr_to_write <= 0) {
> -                                       /*
> -                                        * slice used up: queue for next turn
> -                                        */
> -                                       requeue_io(inode);
> -                               } else {
> -                                       /*
> -                                        * somehow blocked: retry later
> -                                        */
> -                                       redirty_tail(inode);

Removing this line can be dangerous - we'll probably go into buzy
waiting (I have tried that long long ago).

Chad, can you try this small patch? Thank you.

Thanks,
Fengguang
---
 fs/fs-writeback.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

--- mm.orig/fs/fs-writeback.c
+++ mm/fs/fs-writeback.c
@@ -325,7 +325,8 @@ __sync_single_inode(struct inode *inode,
 				 * soon as the queue becomes uncongested.
 				 */
 				inode->i_state |= I_DIRTY_PAGES;
-				if (wbc->nr_to_write <= 0) {
+				if (wbc->nr_to_write <= 0 ||
+				    wbc->encountered_congestion) {
 					/*
 					 * slice used up: queue for next turn
 					 */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Bug in kernel 2.6.31, Slow wb_kupdate writeout
  2009-07-29 11:43     ` Wu Fengguang
@ 2009-07-29 14:11       ` Martin Bligh
  2009-07-30  1:06         ` Wu Fengguang
  2009-07-30  0:19       ` Martin Bligh
  1 sibling, 1 reply; 32+ messages in thread
From: Martin Bligh @ 2009-07-29 14:11 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Chad Talbott, linux-kernel, linux-mm, Michael Rubin,
	Andrew Morton, sandeen, Michael Davidson

> --- mm.orig/fs/fs-writeback.c
> +++ mm/fs/fs-writeback.c
> @@ -325,7 +325,8 @@ __sync_single_inode(struct inode *inode,
>                                 * soon as the queue becomes uncongested.
>                                 */
>                                inode->i_state |= I_DIRTY_PAGES;
> -                               if (wbc->nr_to_write <= 0) {
> +                               if (wbc->nr_to_write <= 0 ||
> +                                   wbc->encountered_congestion) {
>                                        /*
>                                         * slice used up: queue for next turn
>                                         */
>

That's not sufficient - it only the problem in the wb_kupdate path. If you want
to be more conservative, how about we do this?

--- linux-2.6.30/fs/fs-writeback.c.old  2009-07-29 00:08:29.000000000 -0700
+++ linux-2.6.30/fs/fs-writeback.c      2009-07-29 07:08:48.000000000 -0700
@@ -323,43 +323,14 @@ __sync_single_inode(struct inode *inode,
                         * We didn't write back all the pages.  nfs_writepages(
)
                         * sometimes bales out without doing anything. Redirty
                         * the inode; Move it from s_io onto s_more_io/s_dirty.
+                        * It may well have just encountered congestion
                         */
-                       /*
-                        * akpm: if the caller was the kupdate function we put
-                        * this inode at the head of s_dirty so it gets first
-                        * consideration.  Otherwise, move it to the tail, for
-                        * the reasons described there.  I'm not really sure
-                        * how much sense this makes.  Presumably I had a good
-                        * reasons for doing it this way, and I'd rather not
-                        * muck with it at present.
-                        */
-                       if (wbc->for_kupdate) {
-                               /*
-                                * For the kupdate function we move the inode
-                                * to s_more_io so it will get more writeout as
-                                * soon as the queue becomes uncongested.
-                                */
-                               inode->i_state |= I_DIRTY_PAGES;
-                               if (wbc->nr_to_write <= 0) {
-                                       /*
-                                        * slice used up: queue for next turn
-                                        */
-                                       requeue_io(inode);
-                               } else {
-                                       /*
-                                        * somehow blocked: retry later
-                                        */
-                                       redirty_tail(inode);
-                               }
-                       } else {
-                               /*
-                                * Otherwise fully redirty the inode so that
-                                * other inodes on this superblock will get som
e
-                                * writeout.  Otherwise heavy writing to one
-                                * file would indefinitely suspend writeout of
-                                * all the other files.
-                                */
-                               inode->i_state |= I_DIRTY_PAGES;
+                       inode->i_state |= I_DIRTY_PAGES;
+                       if (wbc->nr_to_write <= 0 ||     /* sliced used up */
+                            wbc->encountered_congestion)
+                               requeue_io(inode);
+                       else {
+                               /* somehow blocked: retry later */
                                redirty_tail(inode);
                        }
                } else if (inode->i_state & I_DIRTY) {

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Bug in kernel 2.6.31, Slow wb_kupdate writeout
  2009-07-29 14:11       ` Martin Bligh
@ 2009-07-30  1:06         ` Wu Fengguang
  2009-07-30  1:12           ` Martin Bligh
  0 siblings, 1 reply; 32+ messages in thread
From: Wu Fengguang @ 2009-07-30  1:06 UTC (permalink / raw)
  To: Martin Bligh
  Cc: Chad Talbott, linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	Michael Rubin, Andrew Morton, sandeen@redhat.com,
	Michael Davidson

On Wed, Jul 29, 2009 at 10:11:10PM +0800, Martin Bligh wrote:
> > --- mm.orig/fs/fs-writeback.c
> > +++ mm/fs/fs-writeback.c
> > @@ -325,7 +325,8 @@ __sync_single_inode(struct inode *inode,
> > A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  * soon as the queue becomes uncongested.
> > A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  */
> > A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A inode->i_state |= I_DIRTY_PAGES;
> > - A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  if (wbc->nr_to_write <= 0) {
> > + A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  if (wbc->nr_to_write <= 0 ||
> > + A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  wbc->encountered_congestion) {
> > A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A /*
> > A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  * slice used up: queue for next turn
> > A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  */
> >
> 
> That's not sufficient - it only the problem in the wb_kupdate path. If you want
> to be more conservative, how about we do this?

I agree on the unification of kupdate and sync paths. In fact I had a
patch for doing this. And I'd recommend to do it in two patches:
one to fix the congestion case, another to do the code unification.

The sync path don't care whether requeue_io() or redirty_tail() is
used, because they disregard the time stamps totally - only order of
inodes matters (ie. starvation), which is same for requeue_io()/redirty_tail().

Thanks,
Fengguang

> --- linux-2.6.30/fs/fs-writeback.c.old  2009-07-29 00:08:29.000000000 -0700
> +++ linux-2.6.30/fs/fs-writeback.c      2009-07-29 07:08:48.000000000 -0700
> @@ -323,43 +323,14 @@ __sync_single_inode(struct inode *inode,
>                          * We didn't write back all the pages.  nfs_writepages(
> )
>                          * sometimes bales out without doing anything. Redirty
>                          * the inode; Move it from s_io onto s_more_io/s_dirty.
> +                        * It may well have just encountered congestion
>                          */
> -                       /*
> -                        * akpm: if the caller was the kupdate function we put
> -                        * this inode at the head of s_dirty so it gets first
> -                        * consideration.  Otherwise, move it to the tail, for
> -                        * the reasons described there.  I'm not really sure
> -                        * how much sense this makes.  Presumably I had a good
> -                        * reasons for doing it this way, and I'd rather not
> -                        * muck with it at present.
> -                        */
> -                       if (wbc->for_kupdate) {
> -                               /*
> -                                * For the kupdate function we move the inode
> -                                * to s_more_io so it will get more writeout as
> -                                * soon as the queue becomes uncongested.
> -                                */
> -                               inode->i_state |= I_DIRTY_PAGES;
> -                               if (wbc->nr_to_write <= 0) {
> -                                       /*
> -                                        * slice used up: queue for next turn
> -                                        */
> -                                       requeue_io(inode);
> -                               } else {
> -                                       /*
> -                                        * somehow blocked: retry later
> -                                        */
> -                                       redirty_tail(inode);
> -                               }
> -                       } else {
> -                               /*
> -                                * Otherwise fully redirty the inode so that
> -                                * other inodes on this superblock will get som
> e
> -                                * writeout.  Otherwise heavy writing to one
> -                                * file would indefinitely suspend writeout of
> -                                * all the other files.
> -                                */
> -                               inode->i_state |= I_DIRTY_PAGES;
> +                       inode->i_state |= I_DIRTY_PAGES;
> +                       if (wbc->nr_to_write <= 0 ||     /* sliced used up */
> +                            wbc->encountered_congestion)
> +                               requeue_io(inode);
> +                       else {
> +                               /* somehow blocked: retry later */
>                                 redirty_tail(inode);
>                         }
>                 } else if (inode->i_state & I_DIRTY) {

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Bug in kernel 2.6.31, Slow wb_kupdate writeout
  2009-07-30  1:06         ` Wu Fengguang
@ 2009-07-30  1:12           ` Martin Bligh
  2009-07-30  1:57             ` Wu Fengguang
  0 siblings, 1 reply; 32+ messages in thread
From: Martin Bligh @ 2009-07-30  1:12 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Chad Talbott, linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	Michael Rubin, Andrew Morton, sandeen@redhat.com,
	Michael Davidson

> I agree on the unification of kupdate and sync paths. In fact I had a
> patch for doing this. And I'd recommend to do it in two patches:
> one to fix the congestion case, another to do the code unification.
>
> The sync path don't care whether requeue_io() or redirty_tail() is
> used, because they disregard the time stamps totally - only order of
> inodes matters (ie. starvation), which is same for requeue_io()/redirty_tail().

But, as I understand it, both paths share the same lists, so we still have
to be consistent?

Also, you set flags like more_io higher up in sync_sb_inodes() based on
whether there's anything in s_more_io queue, so it still seems to have
some effect to me?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Bug in kernel 2.6.31, Slow wb_kupdate writeout
  2009-07-30  1:12           ` Martin Bligh
@ 2009-07-30  1:57             ` Wu Fengguang
  2009-07-30  2:59               ` Martin Bligh
  0 siblings, 1 reply; 32+ messages in thread
From: Wu Fengguang @ 2009-07-30  1:57 UTC (permalink / raw)
  To: Martin Bligh
  Cc: Chad Talbott, linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	Michael Rubin, Andrew Morton, sandeen@redhat.com,
	Michael Davidson

On Thu, Jul 30, 2009 at 09:12:26AM +0800, Martin Bligh wrote:
> > I agree on the unification of kupdate and sync paths. In fact I had a
> > patch for doing this. And I'd recommend to do it in two patches:
> > one to fix the congestion case, another to do the code unification.
> >
> > The sync path don't care whether requeue_io() or redirty_tail() is
> > used, because they disregard the time stamps totally - only order of
> > inodes matters (ie. starvation), which is same for requeue_io()/redirty_tail().
> 
> But, as I understand it, both paths share the same lists, so we still have
> to be consistent?

Then let's first unify the code, then fix the congestion case? :)

> Also, you set flags like more_io higher up in sync_sb_inodes() based on
> whether there's anything in s_more_io queue, so it still seems to have
> some effect to me?

Yes, maybe in some rare cases.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Bug in kernel 2.6.31, Slow wb_kupdate writeout
  2009-07-30  1:57             ` Wu Fengguang
@ 2009-07-30  2:59               ` Martin Bligh
  2009-07-30  4:08                 ` Wu Fengguang
  0 siblings, 1 reply; 32+ messages in thread
From: Martin Bligh @ 2009-07-30  2:59 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Chad Talbott, linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	Michael Rubin, Andrew Morton, sandeen@redhat.com,
	Michael Davidson

On Wed, Jul 29, 2009 at 6:57 PM, Wu Fengguang<fengguang.wu@intel.com> wrote:
> On Thu, Jul 30, 2009 at 09:12:26AM +0800, Martin Bligh wrote:
>> > I agree on the unification of kupdate and sync paths. In fact I had a
>> > patch for doing this. And I'd recommend to do it in two patches:
>> > one to fix the congestion case, another to do the code unification.
>> >
>> > The sync path don't care whether requeue_io() or redirty_tail() is
>> > used, because they disregard the time stamps totally - only order of
>> > inodes matters (ie. starvation), which is same for requeue_io()/redirty_tail().
>>
>> But, as I understand it, both paths share the same lists, so we still have
>> to be consistent?
>
> Then let's first unify the code, then fix the congestion case? :)

OK, I will send it out as separate patches. I am just finishing up the testing
first.

M.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Bug in kernel 2.6.31, Slow wb_kupdate writeout
  2009-07-30  2:59               ` Martin Bligh
@ 2009-07-30  4:08                 ` Wu Fengguang
  2009-07-30 19:55                   ` Martin Bligh
  0 siblings, 1 reply; 32+ messages in thread
From: Wu Fengguang @ 2009-07-30  4:08 UTC (permalink / raw)
  To: Martin Bligh
  Cc: Chad Talbott, linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	Michael Rubin, Andrew Morton, sandeen@redhat.com,
	Michael Davidson

On Thu, Jul 30, 2009 at 10:59:09AM +0800, Martin Bligh wrote:
> On Wed, Jul 29, 2009 at 6:57 PM, Wu Fengguang<fengguang.wu@intel.com> wrote:
> > On Thu, Jul 30, 2009 at 09:12:26AM +0800, Martin Bligh wrote:
> >> > I agree on the unification of kupdate and sync paths. In fact I had a
> >> > patch for doing this. And I'd recommend to do it in two patches:
> >> > one to fix the congestion case, another to do the code unification.
> >> >
> >> > The sync path don't care whether requeue_io() or redirty_tail() is
> >> > used, because they disregard the time stamps totally - only order of
> >> > inodes matters (ie. starvation), which is same for requeue_io()/redirty_tail().
> >>
> >> But, as I understand it, both paths share the same lists, so we still have
> >> to be consistent?
> >
> > Then let's first unify the code, then fix the congestion case? :)
> 
> OK, I will send it out as separate patches. I am just finishing up the testing
> first.

Note that this is a simple fix that may have suboptimal write performance.
Here is an old reasoning:

        http://lkml.org/lkml/2009/3/28/235

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Bug in kernel 2.6.31, Slow wb_kupdate writeout
  2009-07-30  4:08                 ` Wu Fengguang
@ 2009-07-30 19:55                   ` Martin Bligh
  2009-08-01  2:02                     ` Wu Fengguang
  0 siblings, 1 reply; 32+ messages in thread
From: Martin Bligh @ 2009-07-30 19:55 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, Michael Rubin,
	sandeen@redhat.com, Michael Davidson, Andrew Morton,
	Peter Zijlstra

> Note that this is a simple fix that may have suboptimal write performance.
> Here is an old reasoning:
>
>        http://lkml.org/lkml/2009/3/28/235

The other thing I've been experimenting with is to disable the per-page
check in write_cache_pages, ie:

                        if (wbc->nonblocking && bdi_write_congested(bdi)) {
                                wb_stats_inc(WB_STATS_WCP_SECTION_CONG);
                                wbc->encountered_congestion = 1;
                                /* done = 1; */

This treats the congestion limits as soft, but encourages us to write
back in larger, more efficient chunks. If that's not going to scare
people unduly, I can submit that as well.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Bug in kernel 2.6.31, Slow wb_kupdate writeout
  2009-07-30 19:55                   ` Martin Bligh
@ 2009-08-01  2:02                     ` Wu Fengguang
  0 siblings, 0 replies; 32+ messages in thread
From: Wu Fengguang @ 2009-08-01  2:02 UTC (permalink / raw)
  To: Martin Bligh
  Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, Michael Rubin,
	sandeen@redhat.com, Michael Davidson, Andrew Morton,
	Peter Zijlstra

[-- Attachment #1: Type: text/plain, Size: 1560 bytes --]

On Fri, Jul 31, 2009 at 03:55:44AM +0800, Martin Bligh wrote:
> > Note that this is a simple fix that may have suboptimal write performance.
> > Here is an old reasoning:
> >
> > A  A  A  A http://lkml.org/lkml/2009/3/28/235
> 
> The other thing I've been experimenting with is to disable the per-page
> check in write_cache_pages, ie:
> 
>                         if (wbc->nonblocking && bdi_write_congested(bdi)) {
>                                 wb_stats_inc(WB_STATS_WCP_SECTION_CONG);
>                                 wbc->encountered_congestion = 1;
>                                 /* done = 1; */
> 
> This treats the congestion limits as soft, but encourages us to write
> back in larger, more efficient chunks. If that's not going to scare
> people unduly, I can submit that as well.

This risks hitting the hard limit (nr_requests), and block everyone,
including the ones with higher priority (ie. kswapd).

On the other hand, the simple fix in previous mails won't necessarily
act too sub-optimal. It's only a potential one. There is a window of
(1/16)*(nr_requests)*request_size (= 128*256KB/16 = 4MB) between
congestion-on and congestion-off states. So for the best we can inject
a big 4MB chunk into the async write queue once it becomes uncongested.

I have a writeback debug patch that can help find out how
that works out in your real world workloads (by monitoring
nr_to_write). You can also try doubling the ratio (1/16) in
blk_queue_congestion_threshold(), to see how an increased
congestion-on-off window may help.

Thanks,
Fengguang

[-- Attachment #2: writeback-debug-2.6.31.patch --]
[-- Type: text/x-diff, Size: 2713 bytes --]

 mm/page-writeback.c |   38 ++++++++++++++++++++++++++++++++++++++
 1 file changed, 38 insertions(+)

--- sound-2.6.orig/mm/page-writeback.c
+++ sound-2.6/mm/page-writeback.c
@@ -116,6 +116,33 @@ EXPORT_SYMBOL(laptop_mode);
 
 /* End of sysctl-exported parameters */
 
+#define writeback_debug_report(n, wbc) do {                               \
+	__writeback_debug_report(n, wbc, __FILE__, __LINE__, __FUNCTION__); \
+} while (0)
+
+void print_writeback_control(struct writeback_control *wbc)
+{
+	printk(KERN_DEBUG
+			"global dirty %lu writeback %lu nfs %lu "
+			"flags %c%c towrite %ld skipped %ld\n",
+			global_page_state(NR_FILE_DIRTY),
+			global_page_state(NR_WRITEBACK),
+			global_page_state(NR_UNSTABLE_NFS),
+			wbc->encountered_congestion ? 'C':'_',
+			wbc->more_io ? 'M':'_',
+			wbc->nr_to_write,
+			wbc->pages_skipped);
+}
+
+void __writeback_debug_report(long n, struct writeback_control *wbc,
+		const char *file, int line, const char *func)
+{
+	printk(KERN_DEBUG "%s %d %s: %s(%d) %ld\n",
+			file, line, func,
+			current->comm, current->pid,
+			n);
+	print_writeback_control(wbc);
+}
 
 static void background_writeout(unsigned long _min_pages);
 
@@ -550,6 +577,7 @@ static void balance_dirty_pages(struct a
 			pages_written += write_chunk - wbc.nr_to_write;
 			get_dirty_limits(&background_thresh, &dirty_thresh,
 				       &bdi_thresh, bdi);
+			writeback_debug_report(pages_written, &wbc);
 		}
 
 		/*
@@ -576,6 +604,7 @@ static void balance_dirty_pages(struct a
 			break;		/* We've done our duty */
 
 		congestion_wait(BLK_RW_ASYNC, HZ/10);
+		writeback_debug_report(-pages_written, &wbc);
 	}
 
 	if (bdi_nr_reclaimable + bdi_nr_writeback < bdi_thresh &&
@@ -670,6 +699,11 @@ void throttle_vm_writeout(gfp_t gfp_mask
 			global_page_state(NR_WRITEBACK) <= dirty_thresh)
                         	break;
                 congestion_wait(BLK_RW_ASYNC, HZ/10);
+		printk(KERN_DEBUG "throttle_vm_writeout: "
+				"congestion_wait on %lu+%lu > %lu\n",
+				global_page_state(NR_UNSTABLE_NFS),
+				global_page_state(NR_WRITEBACK),
+				dirty_thresh);
 
 		/*
 		 * The caller might hold locks which can prevent IO completion
@@ -719,7 +753,9 @@ static void background_writeout(unsigned
 			else
 				break;
 		}
+		writeback_debug_report(min_pages, &wbc);
 	}
+	writeback_debug_report(min_pages, &wbc);
 }
 
 /*
@@ -792,7 +828,9 @@ static void wb_kupdate(unsigned long arg
 				break;	/* All the old data is written */
 		}
 		nr_to_write -= MAX_WRITEBACK_PAGES - wbc.nr_to_write;
+		writeback_debug_report(nr_to_write, &wbc);
 	}
+	writeback_debug_report(nr_to_write, &wbc);
 	if (time_before(next_jif, jiffies + HZ))
 		next_jif = jiffies + HZ;
 	if (dirty_writeback_interval)

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Bug in kernel 2.6.31, Slow wb_kupdate writeout
  2009-07-29 11:43     ` Wu Fengguang
  2009-07-29 14:11       ` Martin Bligh
@ 2009-07-30  0:19       ` Martin Bligh
  2009-07-30  1:28         ` Martin Bligh
  2009-07-30  1:49         ` Wu Fengguang
  1 sibling, 2 replies; 32+ messages in thread
From: Martin Bligh @ 2009-07-30  0:19 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Chad Talbott, linux-kernel, linux-mm, Michael Rubin,
	Andrew Morton, sandeen, Michael Davidson

BTW, can you explain this code at the bottom of generic_sync_sb_inodes
for me?

                if (wbc->nr_to_write <= 0) {
                        wbc->more_io = 1;
                        break;
                }

I don't understand why we are setting more_io here? AFAICS, more_io
means there's more stuff to write ... I would think we'd set this if
nr_to_write was > 0 ?

Or just have the section below brought up above this
break check and do:

if (!list_empty(&sb->s_more_io) || !list_empty(&sb->s_io))
        wbc->more_io = 1;

Am I just misunderstanding the intent of more_io ?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Bug in kernel 2.6.31, Slow wb_kupdate writeout
  2009-07-30  0:19       ` Martin Bligh
@ 2009-07-30  1:28         ` Martin Bligh
  2009-07-30  2:09           ` Wu Fengguang
  2009-07-30  1:49         ` Wu Fengguang
  1 sibling, 1 reply; 32+ messages in thread
From: Martin Bligh @ 2009-07-30  1:28 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Chad Talbott, linux-kernel, linux-mm, Michael Rubin,
	Andrew Morton, sandeen, Michael Davidson

On Wed, Jul 29, 2009 at 5:19 PM, Martin Bligh<mbligh@google.com> wrote:
> BTW, can you explain this code at the bottom of generic_sync_sb_inodes
> for me?
>
>                if (wbc->nr_to_write <= 0) {
>                        wbc->more_io = 1;
>                        break;
>                }
>
> I don't understand why we are setting more_io here? AFAICS, more_io
> means there's more stuff to write ... I would think we'd set this if
> nr_to_write was > 0 ?
>
> Or just have the section below brought up above this
> break check and do:
>
> if (!list_empty(&sb->s_more_io) || !list_empty(&sb->s_io))
>        wbc->more_io = 1;
>
> Am I just misunderstanding the intent of more_io ?

I am thinking along the lines of:

@@ -638,13 +609,11 @@ sync_sb_inodes(struct super_block *sb, s
                iput(inode);
                cond_resched();
                spin_lock(&inode_lock);
-               if (wbc->nr_to_write <= 0) {
-                       wbc->more_io = 1;
+               if (wbc->nr_to_write <= 0)
                        break;
-               }
-               if (!list_empty(&sb->s_more_io))
-                       wbc->more_io = 1;
        }
+       if (!list_empty(&sb->s_more_io) || !list_empty(&sb->s_io)
+               wbc->more_io = 1;
        return;         /* Leave any unwritten inodes on s_io */
 }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Bug in kernel 2.6.31, Slow wb_kupdate writeout
  2009-07-30  1:28         ` Martin Bligh
@ 2009-07-30  2:09           ` Wu Fengguang
  2009-07-30  2:57             ` Martin Bligh
  0 siblings, 1 reply; 32+ messages in thread
From: Wu Fengguang @ 2009-07-30  2:09 UTC (permalink / raw)
  To: Martin Bligh
  Cc: Chad Talbott, linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	Michael Rubin, Andrew Morton, sandeen@redhat.com,
	Michael Davidson

On Thu, Jul 30, 2009 at 09:28:07AM +0800, Martin Bligh wrote:
> On Wed, Jul 29, 2009 at 5:19 PM, Martin Bligh<mbligh@google.com> wrote:
> > BTW, can you explain this code at the bottom of generic_sync_sb_inodes
> > for me?
> >
> > A  A  A  A  A  A  A  A if (wbc->nr_to_write <= 0) {
> > A  A  A  A  A  A  A  A  A  A  A  A wbc->more_io = 1;
> > A  A  A  A  A  A  A  A  A  A  A  A break;
> > A  A  A  A  A  A  A  A }
> >
> > I don't understand why we are setting more_io here? AFAICS, more_io
> > means there's more stuff to write ... I would think we'd set this if
> > nr_to_write was > 0 ?
> >
> > Or just have the section below brought up above this
> > break check and do:
> >
> > if (!list_empty(&sb->s_more_io) || !list_empty(&sb->s_io))
> > A  A  A  A wbc->more_io = 1;
> >
> > Am I just misunderstanding the intent of more_io ?
> 
> I am thinking along the lines of:

On closer looks I found this line:

                if (inode_dirtied_after(inode, start))
                        break;

In this case "list_empty(&sb->s_io)" is not a good criteria:
here we are breaking away for some other reasons, and shall
not touch wbc.more_io.

So let's stick with the current code?

Thanks,
Fengguang

> @@ -638,13 +609,11 @@ sync_sb_inodes(struct super_block *sb, s
>                 iput(inode);
>                 cond_resched();
>                 spin_lock(&inode_lock);
> -               if (wbc->nr_to_write <= 0) {
> -                       wbc->more_io = 1;
> +               if (wbc->nr_to_write <= 0)
>                         break;
> -               }
> -               if (!list_empty(&sb->s_more_io))
> -                       wbc->more_io = 1;
>         }
> +       if (!list_empty(&sb->s_more_io) || !list_empty(&sb->s_io)
> +               wbc->more_io = 1;
>         return;         /* Leave any unwritten inodes on s_io */
>  }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Bug in kernel 2.6.31, Slow wb_kupdate writeout
  2009-07-30  2:09           ` Wu Fengguang
@ 2009-07-30  2:57             ` Martin Bligh
  2009-07-30  3:19               ` Wu Fengguang
  0 siblings, 1 reply; 32+ messages in thread
From: Martin Bligh @ 2009-07-30  2:57 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Chad Talbott, linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	Michael Rubin, Andrew Morton, sandeen@redhat.com,
	Michael Davidson

> On closer looks I found this line:
>
>                if (inode_dirtied_after(inode, start))
>                        break;

Ah, OK.

> In this case "list_empty(&sb->s_io)" is not a good criteria:
> here we are breaking away for some other reasons, and shall
> not touch wbc.more_io.
>
> So let's stick with the current code?

Well, I see two problems. One is that we set more_io based on
whether s_more_io is empty or not before we finish the loop.
I can't see how this can be correct, especially as there can be
other concurrent writers. So somehow we need to check when
we exit the loop, not during it.

The other is that we're saying we are setting more_io when
nr_to_write is <=0 ... but we only really check it when
nr_to_write is > 0 ... I can't see how this can be useful?
I'll admit there is one corner case when page_skipped it set
from one of the branches, but I am really not sure what the
intended logic is here, given the above?

In the case where we hit the inode_dirtied_after break
condition, is it bad to set more_io ? There is more to do
on that inode after all. Is there a definition somewhere for
exactly what the more_io flag means?

M.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Bug in kernel 2.6.31, Slow wb_kupdate writeout
  2009-07-30  2:57             ` Martin Bligh
@ 2009-07-30  3:19               ` Wu Fengguang
  2009-07-30 20:33                 ` Martin Bligh
  0 siblings, 1 reply; 32+ messages in thread
From: Wu Fengguang @ 2009-07-30  3:19 UTC (permalink / raw)
  To: Martin Bligh
  Cc: Chad Talbott, linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	Michael Rubin, Andrew Morton, sandeen@redhat.com,
	Michael Davidson

On Thu, Jul 30, 2009 at 10:57:35AM +0800, Martin Bligh wrote:
> > On closer looks I found this line:
> >
> > A  A  A  A  A  A  A  A if (inode_dirtied_after(inode, start))
> > A  A  A  A  A  A  A  A  A  A  A  A break;
> 
> Ah, OK.
> 
> > In this case "list_empty(&sb->s_io)" is not a good criteria:
> > here we are breaking away for some other reasons, and shall
> > not touch wbc.more_io.
> >
> > So let's stick with the current code?
> 
> Well, I see two problems. One is that we set more_io based on
> whether s_more_io is empty or not before we finish the loop.
> I can't see how this can be correct, especially as there can be
> other concurrent writers. So somehow we need to check when
> we exit the loop, not during it.

It is correct inside the loop, however with some overheads.

We put it inside the loop because sometimes the whole filesystem is
skipped and we shall not set more_io on them whether or not s_more_io
is empty.

> The other is that we're saying we are setting more_io when
> nr_to_write is <=0 ... but we only really check it when
> nr_to_write is > 0 ... I can't see how this can be useful?

That's the caller's fault - I guess the logic was changed a bit by
Jens in linux-next. I noticed this just now. It shall be fixed.

> I'll admit there is one corner case when page_skipped it set
> from one of the branches, but I am really not sure what the
> intended logic is here, given the above?
> 
> In the case where we hit the inode_dirtied_after break
> condition, is it bad to set more_io ? There is more to do
> on that inode after all. Is there a definition somewhere for
> exactly what the more_io flag means?

"More dirty pages to be put to io"?

The exact semantics of more_io is determined by the caller,
which used to be (in 2.6.31):

background_writeout():

                if (wbc.nr_to_write > 0 || wbc.pages_skipped > 0) {
                        /* Wrote less than expected */
                        if (wbc.encountered_congestion || wbc.more_io)
                                congestion_wait(BLK_RW_ASYNC, HZ/10);
                        else   
                                break;
                }

wb_kupdate() is same except that it does not check pages_skipped.

Note that in 2.6.31, more_io is not used at all for sync().

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Bug in kernel 2.6.31, Slow wb_kupdate writeout
  2009-07-30  3:19               ` Wu Fengguang
@ 2009-07-30 20:33                 ` Martin Bligh
  2009-08-01  2:58                   ` Wu Fengguang
  2009-08-01  4:10                   ` Wu Fengguang
  0 siblings, 2 replies; 32+ messages in thread
From: Martin Bligh @ 2009-07-30 20:33 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Chad Talbott, linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	Michael Rubin, Andrew Morton, sandeen@redhat.com,
	Michael Davidson

(BTW: background ... I'm not picking through this code for fun, I'm
trying to debug writeback problems introduced in our new kernel
that are affecting Google production workloads ;-))

>> Well, I see two problems. One is that we set more_io based on
>> whether s_more_io is empty or not before we finish the loop.
>> I can't see how this can be correct, especially as there can be
>> other concurrent writers. So somehow we need to check when
>> we exit the loop, not during it.
>
> It is correct inside the loop, however with some overheads.
>
> We put it inside the loop because sometimes the whole filesystem is
> skipped and we shall not set more_io on them whether or not s_more_io
> is empty.

My point was that you're setting more_io based on a condition
at a point in time that isn't when you return to the caller.

By the time you return to the caller (after several more loops
iterations), that condition may no longer be true.

One other way to address that would to be only to set if if we're
about to fall off the end of the loop, ie change it to:

if (!list_empty(&sb->s_more_io) && list_empty(&sb->s_io))
       wbc->more_io = 1;

>> The other is that we're saying we are setting more_io when
>> nr_to_write is <=0 ... but we only really check it when
>> nr_to_write is > 0 ... I can't see how this can be useful?
>
> That's the caller's fault - I guess the logic was changed a bit by
> Jens in linux-next. I noticed this just now. It shall be fixed.

I am guessing you're setting more_io here because we're stopping
because our slice expired, presumably without us completing
all the io there was to do? That doesn't seem entirely accurate,
we could have finished all the pending IO (particularly given that
we can go over nr_to_write somewhat and send it negative).
Hence, I though that checking whether s_more_io and s_io were
empty at the time of return might be a more accurate check,
but on the other hand they are shared lists.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Bug in kernel 2.6.31, Slow wb_kupdate writeout
  2009-07-30 20:33                 ` Martin Bligh
@ 2009-08-01  2:58                   ` Wu Fengguang
  2009-08-01  4:10                   ` Wu Fengguang
  1 sibling, 0 replies; 32+ messages in thread
From: Wu Fengguang @ 2009-08-01  2:58 UTC (permalink / raw)
  To: Martin Bligh
  Cc: Chad Talbott, linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	Michael Rubin, Andrew Morton, sandeen@redhat.com,
	Michael Davidson

On Fri, Jul 31, 2009 at 04:33:09AM +0800, Martin Bligh wrote:
> (BTW: background ... I'm not picking through this code for fun, I'm
> trying to debug writeback problems introduced in our new kernel
> that are affecting Google production workloads ;-))
> 
> >> Well, I see two problems. One is that we set more_io based on
> >> whether s_more_io is empty or not before we finish the loop.
> >> I can't see how this can be correct, especially as there can be
> >> other concurrent writers. So somehow we need to check when
> >> we exit the loop, not during it.
> >
> > It is correct inside the loop, however with some overheads.
> >
> > We put it inside the loop because sometimes the whole filesystem is
> > skipped and we shall not set more_io on them whether or not s_more_io
> > is empty.
> 
> My point was that you're setting more_io based on a condition
> at a point in time that isn't when you return to the caller.
> 
> By the time you return to the caller (after several more loops
> iterations), that condition may no longer be true.

You are right in that sense. Sorry that my claim of correctness is
somehow biased: we normally care much about early abortion, and don't
mind one extra trip over the superblocks. And the extra trip should be
rare enough. I'd be surprised if you observed much of them in real
workloads.

> One other way to address that would to be only to set if if we're
> about to fall off the end of the loop, ie change it to:
> 
> if (!list_empty(&sb->s_more_io) && list_empty(&sb->s_io))
>        wbc->more_io = 1;

Let more_io=0 when there are more inodes in s_io to be worked on?
I cannot understand it, and suspect we are talking about imaginary
problem on this point ;)

> >> The other is that we're saying we are setting more_io when
> >> nr_to_write is <=0 ... but we only really check it when
> >> nr_to_write is > 0 ... I can't see how this can be useful?
> >
> > That's the caller's fault - I guess the logic was changed a bit by
> > Jens in linux-next. I noticed this just now. It shall be fixed.
> 
> I am guessing you're setting more_io here because we're stopping
> because our slice expired, presumably without us completing
> all the io there was to do? That doesn't seem entirely accurate,
> we could have finished all the pending IO (particularly given that
> we can go over nr_to_write somewhat and send it negative).
> Hence, I though that checking whether s_more_io and s_io were
> empty at the time of return might be a more accurate check,
> but on the other hand they are shared lists.

Yes the current more_io logic is not entirely accurate, but I doubt we
can gain much and the improvement can be done trivially (not the line
of code, but the analyzes and tests involved).

Anyway if you would take the time to push forward a patch for reducing
the overheads of possible extra trips, I'll take the time to review it ;)

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Bug in kernel 2.6.31, Slow wb_kupdate writeout
  2009-07-30 20:33                 ` Martin Bligh
  2009-08-01  2:58                   ` Wu Fengguang
@ 2009-08-01  4:10                   ` Wu Fengguang
  1 sibling, 0 replies; 32+ messages in thread
From: Wu Fengguang @ 2009-08-01  4:10 UTC (permalink / raw)
  To: Martin Bligh
  Cc: Chad Talbott, linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	Michael Rubin, Andrew Morton, sandeen@redhat.com,
	Michael Davidson

On Fri, Jul 31, 2009 at 04:33:09AM +0800, Martin Bligh wrote:
> (BTW: background ... I'm not picking through this code for fun, I'm
> trying to debug writeback problems introduced in our new kernel
> that are affecting Google production workloads ;-))
> 
> >> Well, I see two problems. One is that we set more_io based on
> >> whether s_more_io is empty or not before we finish the loop.
> >> I can't see how this can be correct, especially as there can be
> >> other concurrent writers. So somehow we need to check when
> >> we exit the loop, not during it.
> >
> > It is correct inside the loop, however with some overheads.
> >
> > We put it inside the loop because sometimes the whole filesystem is
> > skipped and we shall not set more_io on them whether or not s_more_io
> > is empty.
> 
> My point was that you're setting more_io based on a condition
> at a point in time that isn't when you return to the caller.
> 
> By the time you return to the caller (after several more loops
> iterations), that condition may no longer be true.
> 
> One other way to address that would to be only to set if if we're
> about to fall off the end of the loop, ie change it to:
> 
> if (!list_empty(&sb->s_more_io) && list_empty(&sb->s_io))
>        wbc->more_io = 1;

Ah I see it (as the below patch), looks reasonable to me.

Thanks,
Fengguang

---
 fs/fs-writeback.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

--- sound-2.6.orig/fs/fs-writeback.c
+++ sound-2.6/fs/fs-writeback.c
@@ -544,9 +544,9 @@ void generic_sync_sb_inodes(struct super
 			wbc->more_io = 1;
 			break;
 		}
-		if (!list_empty(&sb->s_more_io))
-			wbc->more_io = 1;
 	}
+	if (!list_empty(&sb->s_more_io) && list_empty(&sb->s_io))
+			wbc->more_io = 1;
 
 	if (sync) {
 		struct inode *inode, *old_inode = NULL;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Bug in kernel 2.6.31, Slow wb_kupdate writeout
  2009-07-30  0:19       ` Martin Bligh
  2009-07-30  1:28         ` Martin Bligh
@ 2009-07-30  1:49         ` Wu Fengguang
  1 sibling, 0 replies; 32+ messages in thread
From: Wu Fengguang @ 2009-07-30  1:49 UTC (permalink / raw)
  To: Martin Bligh
  Cc: Chad Talbott, linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	Michael Rubin, Andrew Morton, sandeen@redhat.com,
	Michael Davidson

On Thu, Jul 30, 2009 at 08:19:34AM +0800, Martin Bligh wrote:
> BTW, can you explain this code at the bottom of generic_sync_sb_inodes
> for me?
> 
>                 if (wbc->nr_to_write <= 0) {
>                         wbc->more_io = 1;
>                         break;
>                 }
> 
> I don't understand why we are setting more_io here? AFAICS, more_io
> means there's more stuff to write ... I would think we'd set this if
> nr_to_write was > 0 ?

That's true: wbc.nr_to_write will always be set to MAX_WRITEBACK_PAGES
by wb_writeback() before entering generic_sync_sb_inodes().

So wbc.nr_to_write <=0 indicates we are interrupted by the quota and
should revisit generic_sync_sb_inodes() to check for more io (which will
_normally_ find more dirty pages to write).

> Or just have the section below brought up above this
> break check and do:
> 
> if (!list_empty(&sb->s_more_io) || !list_empty(&sb->s_io))
>         wbc->more_io = 1;
> 
> Am I just misunderstanding the intent of more_io ?

It should be OK. I agree on the change if it makes the logic more
straightforward.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Bug in kernel 2.6.31, Slow wb_kupdate writeout
  2009-07-28 19:11 Bug in kernel 2.6.31, Slow wb_kupdate writeout Chad Talbott
  2009-07-28 21:49 ` Martin Bligh
@ 2009-07-30 21:39 ` Jens Axboe
  2009-07-30 22:01   ` Martin Bligh
  1 sibling, 1 reply; 32+ messages in thread
From: Jens Axboe @ 2009-07-30 21:39 UTC (permalink / raw)
  To: Chad Talbott
  Cc: linux-kernel, linux-mm, wfg, Martin Bligh, Michael Rubin,
	Andrew Morton, sandeen

On Tue, Jul 28 2009, Chad Talbott wrote:
> I run a simple workload on a 4GB machine which dirties a few largish
> inodes like so:
> 
> # seq 10 | xargs -P0 -n1 -i\{} dd if=/dev/zero of=/tmp/dump\{}
> bs=1024k count=100
> 
> While the dds are running data is written out at disk speed.  However,
> once the dds have run to completion and exited there is ~500MB of
> dirty memory left.  Background writeout then takes about 3 more
> minutes to clean memory at only ~3.3MB/s.  When I explicitly sync, I
> can see that the disk is capable of 40MB/s, which finishes off the
> files in ~10s. [1]
> 
> An interesting recent-ish change is "writeback: speed up writeback of
> big dirty files."  When I revert the change to __sync_single_inode the
> problem appears to go away and background writeout proceeds at disk
> speed.  Interestingly, that code is in the git commit [2], but not in
> the post to LKML. [3]  This is may not be the fix, but it makes this
> test behave better.

Can I talk you into trying the per-bdi writeback patchset? I just tried
your test on a 16gb machine, and the dd's finish immediately since it
wont trip the writeout at that percentage of dirty memory. The 1GB of
dirty memory is flushed when it gets too old, 30 seconds later in two
chunks of writeout running at disk speed.

http://lkml.org/lkml/2009/7/30/302

You can either use the git branch, or just download

http://kernel.dk/writeback-v13.patch

and apply that to -git (or -rc4) directly.

-- 
Jens Axboe

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Bug in kernel 2.6.31, Slow wb_kupdate writeout
  2009-07-30 21:39 ` Jens Axboe
@ 2009-07-30 22:01   ` Martin Bligh
  2009-07-30 22:17     ` Jens Axboe
  0 siblings, 1 reply; 32+ messages in thread
From: Martin Bligh @ 2009-07-30 22:01 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Chad Talbott, linux-kernel, linux-mm, wfg, Michael Rubin,
	Andrew Morton, sandeen

On Thu, Jul 30, 2009 at 2:39 PM, Jens Axboe<jens.axboe@oracle.com> wrote:
> On Tue, Jul 28 2009, Chad Talbott wrote:
>> I run a simple workload on a 4GB machine which dirties a few largish
>> inodes like so:
>>
>> # seq 10 | xargs -P0 -n1 -i\{} dd if=/dev/zero of=/tmp/dump\{}
>> bs=1024k count=100
>>
>> While the dds are running data is written out at disk speed.  However,
>> once the dds have run to completion and exited there is ~500MB of
>> dirty memory left.  Background writeout then takes about 3 more
>> minutes to clean memory at only ~3.3MB/s.  When I explicitly sync, I
>> can see that the disk is capable of 40MB/s, which finishes off the
>> files in ~10s. [1]
>>
>> An interesting recent-ish change is "writeback: speed up writeback of
>> big dirty files."  When I revert the change to __sync_single_inode the
>> problem appears to go away and background writeout proceeds at disk
>> speed.  Interestingly, that code is in the git commit [2], but not in
>> the post to LKML. [3]  This is may not be the fix, but it makes this
>> test behave better.
>
> Can I talk you into trying the per-bdi writeback patchset? I just tried
> your test on a 16gb machine, and the dd's finish immediately since it
> wont trip the writeout at that percentage of dirty memory. The 1GB of
> dirty memory is flushed when it gets too old, 30 seconds later in two
> chunks of writeout running at disk speed.

How big did you make the dds? It has to be writing more data than
you have RAM, or it's not going to do anything much interesting ;-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Bug in kernel 2.6.31, Slow wb_kupdate writeout
  2009-07-30 22:01   ` Martin Bligh
@ 2009-07-30 22:17     ` Jens Axboe
  2009-07-30 22:34       ` Martin Bligh
  0 siblings, 1 reply; 32+ messages in thread
From: Jens Axboe @ 2009-07-30 22:17 UTC (permalink / raw)
  To: Martin Bligh
  Cc: Chad Talbott, linux-kernel, linux-mm, wfg, Michael Rubin,
	Andrew Morton, sandeen

On Thu, Jul 30 2009, Martin Bligh wrote:
> On Thu, Jul 30, 2009 at 2:39 PM, Jens Axboe<jens.axboe@oracle.com> wrote:
> > On Tue, Jul 28 2009, Chad Talbott wrote:
> >> I run a simple workload on a 4GB machine which dirties a few largish
> >> inodes like so:
> >>
> >> # seq 10 | xargs -P0 -n1 -i\{} dd if=/dev/zero of=/tmp/dump\{}
> >> bs=1024k count=100
> >>
> >> While the dds are running data is written out at disk speed.  However,
> >> once the dds have run to completion and exited there is ~500MB of
> >> dirty memory left.  Background writeout then takes about 3 more
> >> minutes to clean memory at only ~3.3MB/s.  When I explicitly sync, I
> >> can see that the disk is capable of 40MB/s, which finishes off the
> >> files in ~10s. [1]
> >>
> >> An interesting recent-ish change is "writeback: speed up writeback of
> >> big dirty files."  When I revert the change to __sync_single_inode the
> >> problem appears to go away and background writeout proceeds at disk
> >> speed.  Interestingly, that code is in the git commit [2], but not in
> >> the post to LKML. [3]  This is may not be the fix, but it makes this
> >> test behave better.
> >
> > Can I talk you into trying the per-bdi writeback patchset? I just tried
> > your test on a 16gb machine, and the dd's finish immediately since it
> > wont trip the writeout at that percentage of dirty memory. The 1GB of
> > dirty memory is flushed when it gets too old, 30 seconds later in two
> > chunks of writeout running at disk speed.
> 
> How big did you make the dds? It has to be writing more data than
> you have RAM, or it's not going to do anything much interesting ;-)

The test case above on a 4G machine is only generating 1G of dirty data.
I ran the same test case on the 16G, resulting in only background
writeout. The relevant bit here being that the background writeout
finished quickly, writing at disk speed.

I re-ran the same test, but using 300 100MB files instead. While the
dd's are running, we are going at ~80MB/sec (this is disk speed, it's an
x25-m). When the dd's are done, it continues doing 80MB/sec for 10
seconds or so. Then the remainder (about 2G) is written in bursts at
disk speeds, but with some time in between.

-- 
Jens Axboe

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Bug in kernel 2.6.31, Slow wb_kupdate writeout
  2009-07-30 22:17     ` Jens Axboe
@ 2009-07-30 22:34       ` Martin Bligh
  2009-07-30 22:43         ` Jens Axboe
  2009-08-01  4:02         ` Wu Fengguang
  0 siblings, 2 replies; 32+ messages in thread
From: Martin Bligh @ 2009-07-30 22:34 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Chad Talbott, linux-kernel, linux-mm, wfg, Michael Rubin,
	Andrew Morton, sandeen

> The test case above on a 4G machine is only generating 1G of dirty data.
> I ran the same test case on the 16G, resulting in only background
> writeout. The relevant bit here being that the background writeout
> finished quickly, writing at disk speed.
>
> I re-ran the same test, but using 300 100MB files instead. While the
> dd's are running, we are going at ~80MB/sec (this is disk speed, it's an
> x25-m). When the dd's are done, it continues doing 80MB/sec for 10
> seconds or so. Then the remainder (about 2G) is written in bursts at
> disk speeds, but with some time in between.

OK, I think the test case is sensitive to how many files you have - if
we punt them to the back of the list, and yet we still have 299 other
ones, it may well be able to keep the disk spinning despite the bug
I outlined.Try using 30 1GB files?

Though it doesn't seem to happen with just one dd streamer, and
I don't see why the bug doesn't trigger in that case either.

I believe the bugfix is correct independent of any bdi changes?

M.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Bug in kernel 2.6.31, Slow wb_kupdate writeout
  2009-07-30 22:34       ` Martin Bligh
@ 2009-07-30 22:43         ` Jens Axboe
  2009-07-30 22:48           ` Martin Bligh
  2009-08-01  4:02         ` Wu Fengguang
  1 sibling, 1 reply; 32+ messages in thread
From: Jens Axboe @ 2009-07-30 22:43 UTC (permalink / raw)
  To: Martin Bligh
  Cc: Chad Talbott, linux-kernel, linux-mm, wfg, Michael Rubin,
	Andrew Morton, sandeen

On Thu, Jul 30 2009, Martin Bligh wrote:
> > The test case above on a 4G machine is only generating 1G of dirty data.
> > I ran the same test case on the 16G, resulting in only background
> > writeout. The relevant bit here being that the background writeout
> > finished quickly, writing at disk speed.
> >
> > I re-ran the same test, but using 300 100MB files instead. While the
> > dd's are running, we are going at ~80MB/sec (this is disk speed, it's an
> > x25-m). When the dd's are done, it continues doing 80MB/sec for 10
> > seconds or so. Then the remainder (about 2G) is written in bursts at
> > disk speeds, but with some time in between.
> 
> OK, I think the test case is sensitive to how many files you have - if
> we punt them to the back of the list, and yet we still have 299 other
> ones, it may well be able to keep the disk spinning despite the bug
> I outlined.Try using 30 1GB files?

If this disk starts spinning, then we have bigger bugs :-)
> 
> Though it doesn't seem to happen with just one dd streamer, and
> I don't see why the bug doesn't trigger in that case either.
> 
> I believe the bugfix is correct independent of any bdi changes?

Yeah I think so too, I'll run some more tests on this tomorrow and
verify it there as well.

-- 
Jens Axboe

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Bug in kernel 2.6.31, Slow wb_kupdate writeout
  2009-07-30 22:43         ` Jens Axboe
@ 2009-07-30 22:48           ` Martin Bligh
  2009-07-31  7:50             ` Peter Zijlstra
  2009-08-01  4:03             ` Wu Fengguang
  0 siblings, 2 replies; 32+ messages in thread
From: Martin Bligh @ 2009-07-30 22:48 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Chad Talbott, linux-kernel, linux-mm, wfg, Michael Rubin, sandeen,
	Andrew Morton, Peter Zijlstra

On Thu, Jul 30, 2009 at 3:43 PM, Jens Axboe<jens.axboe@oracle.com> wrote:
> On Thu, Jul 30 2009, Martin Bligh wrote:
>> > The test case above on a 4G machine is only generating 1G of dirty data.
>> > I ran the same test case on the 16G, resulting in only background
>> > writeout. The relevant bit here being that the background writeout
>> > finished quickly, writing at disk speed.
>> >
>> > I re-ran the same test, but using 300 100MB files instead. While the
>> > dd's are running, we are going at ~80MB/sec (this is disk speed, it's an
>> > x25-m). When the dd's are done, it continues doing 80MB/sec for 10
>> > seconds or so. Then the remainder (about 2G) is written in bursts at
>> > disk speeds, but with some time in between.
>>
>> OK, I think the test case is sensitive to how many files you have - if
>> we punt them to the back of the list, and yet we still have 299 other
>> ones, it may well be able to keep the disk spinning despite the bug
>> I outlined.Try using 30 1GB files?
>
> If this disk starts spinning, then we have bigger bugs :-)
>>
>> Though it doesn't seem to happen with just one dd streamer, and
>> I don't see why the bug doesn't trigger in that case either.
>>
>> I believe the bugfix is correct independent of any bdi changes?
>
> Yeah I think so too, I'll run some more tests on this tomorrow and
> verify it there as well.

There's another issue I was discussing with Peter Z. earlier that the
bdi changes might help with - if you look at where the dirty pages
get to, they are capped hard at the average of the dirty and
background thresholds, meaning we can only dirty about half the
pages we should be able to. That does very slowly go away when
the bdi limit catches up, but it seems to start at 0, and it's progess
seems glacially slow (at least if you're impatient ;-))

This seems to affect some of our workloads badly when they have
a sharp spike in dirty data to one device, they get throttled heavily
when they wouldn't have before the per-bdi dirty limits.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Bug in kernel 2.6.31, Slow wb_kupdate writeout
  2009-07-30 22:48           ` Martin Bligh
@ 2009-07-31  7:50             ` Peter Zijlstra
  2009-08-01  4:03             ` Wu Fengguang
  1 sibling, 0 replies; 32+ messages in thread
From: Peter Zijlstra @ 2009-07-31  7:50 UTC (permalink / raw)
  To: Martin Bligh
  Cc: Jens Axboe, Chad Talbott, linux-kernel, linux-mm, wfg,
	Michael Rubin, sandeen, Andrew Morton

On Thu, 2009-07-30 at 15:48 -0700, Martin Bligh wrote:

> There's another issue I was discussing with Peter Z. earlier that the
> bdi changes might help with - if you look at where the dirty pages
> get to, they are capped hard at the average of the dirty and
> background thresholds, meaning we can only dirty about half the
> pages we should be able to. That does very slowly go away when
> the bdi limit catches up, but it seems to start at 0, and it's progess
> seems glacially slow (at least if you're impatient ;-))
> 
> This seems to affect some of our workloads badly when they have
> a sharp spike in dirty data to one device, they get throttled heavily
> when they wouldn't have before the per-bdi dirty limits.

Right, currently that adjustment period is about the same order as
writing out the full dirty page capacity. If your system has unbalanced
memory vs io capacity this might indeed end up being glacial.

I've been considering making it a sublinear function wrt the memory
size, so that larger machines get less and therefore adjust faster.

Something like the below perhaps -- the alternative is yet another
sysctl :/

Not sure how the sqrt works out on a wide variety of machines though,..
we'll have to test and see.

---
 mm/page-writeback.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 81627eb..64aa140 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -152,7 +152,7 @@ static int calc_period_shift(void)
 	else
 		dirty_total = (vm_dirty_ratio * determine_dirtyable_memory()) /
 				100;
-	return 2 + ilog2(dirty_total - 1);
+	return 2 + ilog2(int_sqrt(dirty_total) - 1);
 }
 
 /*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: Bug in kernel 2.6.31, Slow wb_kupdate writeout
  2009-07-30 22:48           ` Martin Bligh
  2009-07-31  7:50             ` Peter Zijlstra
@ 2009-08-01  4:03             ` Wu Fengguang
  2009-08-01  4:53               ` Wu Fengguang
  1 sibling, 1 reply; 32+ messages in thread
From: Wu Fengguang @ 2009-08-01  4:03 UTC (permalink / raw)
  To: Martin Bligh
  Cc: Jens Axboe, Chad Talbott, linux-kernel, linux-mm, Michael Rubin,
	sandeen, Andrew Morton, Peter Zijlstra

On Thu, Jul 30, 2009 at 03:48:02PM -0700, Martin Bligh wrote:
> On Thu, Jul 30, 2009 at 3:43 PM, Jens Axboe<jens.axboe@oracle.com> wrote:
> > On Thu, Jul 30 2009, Martin Bligh wrote:
> >> > The test case above on a 4G machine is only generating 1G of dirty data.
> >> > I ran the same test case on the 16G, resulting in only background
> >> > writeout. The relevant bit here being that the background writeout
> >> > finished quickly, writing at disk speed.
> >> >
> >> > I re-ran the same test, but using 300 100MB files instead. While the
> >> > dd's are running, we are going at ~80MB/sec (this is disk speed, it's an
> >> > x25-m). When the dd's are done, it continues doing 80MB/sec for 10
> >> > seconds or so. Then the remainder (about 2G) is written in bursts at
> >> > disk speeds, but with some time in between.
> >>
> >> OK, I think the test case is sensitive to how many files you have - if
> >> we punt them to the back of the list, and yet we still have 299 other
> >> ones, it may well be able to keep the disk spinning despite the bug
> >> I outlined.Try using 30 1GB files?
> >
> > If this disk starts spinning, then we have bigger bugs :-)
> >>
> >> Though it doesn't seem to happen with just one dd streamer, and
> >> I don't see why the bug doesn't trigger in that case either.
> >>
> >> I believe the bugfix is correct independent of any bdi changes?
> >
> > Yeah I think so too, I'll run some more tests on this tomorrow and
> > verify it there as well.
> 
> There's another issue I was discussing with Peter Z. earlier that the
> bdi changes might help with - if you look at where the dirty pages
> get to, they are capped hard at the average of the dirty and
> background thresholds, meaning we can only dirty about half the
> pages we should be able to. That does very slowly go away when
> the bdi limit catches up, but it seems to start at 0, and it's progess
> seems glacially slow (at least if you're impatient ;-))

You mean the dirty limit will start from
(dirty_ratio+background_ratio)/2 = 15% to (dirty_ratio) = 20%,
and grow in a very slow pace? I did observed such curves long ago,
but it does not always show up, as in the below mini bench.

> This seems to affect some of our workloads badly when they have
> a sharp spike in dirty data to one device, they get throttled heavily
> when they wouldn't have before the per-bdi dirty limits.

Here is a single dd on my laptop with 4G memory, kernel 2.6.30.

        root /home/wfg# echo 10 > /proc/sys/vm/dirty_ratio                 
        root /home/wfg# echo 20 > /proc/sys/vm/dirty_background_ratio 

        wfg ~% dd if=/dev/zero of=/opt/vm/10G bs=1M count=1000  
        1000+0 records in
        1000+0 records out
        1048576000 bytes (1.0 GB) copied, 12.7143 s, 82.5 MB/s

output of vmmon:

         nr_dirty     nr_writeback
                0                0
                0                0
            56795                0
            51655            17020
            52071            17511
            51648            16898
            51655            16485
            52369            17425
            51648            16930
            51470            16809
            52630            17267
            51287            16634
            51260            16641
            51310            16903
            51281            16379
            46073            11169
            46086                0
            46089                0
             3132             9657
               21            17677
                3            14107
               14                2
                0                0
                0                0

In this case nr_dirty stays almost constant.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Bug in kernel 2.6.31, Slow wb_kupdate writeout
  2009-08-01  4:03             ` Wu Fengguang
@ 2009-08-01  4:53               ` Wu Fengguang
  2009-08-01  5:03                 ` Wu Fengguang
  0 siblings, 1 reply; 32+ messages in thread
From: Wu Fengguang @ 2009-08-01  4:53 UTC (permalink / raw)
  To: Martin Bligh
  Cc: Jens Axboe, Chad Talbott, linux-kernel, linux-mm, Michael Rubin,
	sandeen, Andrew Morton, Peter Zijlstra

On Sat, Aug 01, 2009 at 12:03:13PM +0800, Wu Fengguang wrote:
> On Thu, Jul 30, 2009 at 03:48:02PM -0700, Martin Bligh wrote:
> > On Thu, Jul 30, 2009 at 3:43 PM, Jens Axboe<jens.axboe@oracle.com> wrote:
> > > On Thu, Jul 30 2009, Martin Bligh wrote:
> > >> > The test case above on a 4G machine is only generating 1G of dirty data.
> > >> > I ran the same test case on the 16G, resulting in only background
> > >> > writeout. The relevant bit here being that the background writeout
> > >> > finished quickly, writing at disk speed.
> > >> >
> > >> > I re-ran the same test, but using 300 100MB files instead. While the
> > >> > dd's are running, we are going at ~80MB/sec (this is disk speed, it's an
> > >> > x25-m). When the dd's are done, it continues doing 80MB/sec for 10
> > >> > seconds or so. Then the remainder (about 2G) is written in bursts at
> > >> > disk speeds, but with some time in between.
> > >>
> > >> OK, I think the test case is sensitive to how many files you have - if
> > >> we punt them to the back of the list, and yet we still have 299 other
> > >> ones, it may well be able to keep the disk spinning despite the bug
> > >> I outlined.Try using 30 1GB files?
> > >
> > > If this disk starts spinning, then we have bigger bugs :-)
> > >>
> > >> Though it doesn't seem to happen with just one dd streamer, and
> > >> I don't see why the bug doesn't trigger in that case either.
> > >>
> > >> I believe the bugfix is correct independent of any bdi changes?
> > >
> > > Yeah I think so too, I'll run some more tests on this tomorrow and
> > > verify it there as well.
> > 
> > There's another issue I was discussing with Peter Z. earlier that the
> > bdi changes might help with - if you look at where the dirty pages
> > get to, they are capped hard at the average of the dirty and
> > background thresholds, meaning we can only dirty about half the
> > pages we should be able to. That does very slowly go away when
> > the bdi limit catches up, but it seems to start at 0, and it's progess
> > seems glacially slow (at least if you're impatient ;-))
> 
> You mean the dirty limit will start from
> (dirty_ratio+background_ratio)/2 = 15% to (dirty_ratio) = 20%,
> and grow in a very slow pace? I did observed such curves long ago,
> but it does not always show up, as in the below mini bench.
> 
> > This seems to affect some of our workloads badly when they have
> > a sharp spike in dirty data to one device, they get throttled heavily
> > when they wouldn't have before the per-bdi dirty limits.
> 
> Here is a single dd on my laptop with 4G memory, kernel 2.6.30.
> 
>         root /home/wfg# echo 10 > /proc/sys/vm/dirty_ratio                 
>         root /home/wfg# echo 20 > /proc/sys/vm/dirty_background_ratio 
> 
>         wfg ~% dd if=/dev/zero of=/opt/vm/10G bs=1M count=1000  
>         1000+0 records in
>         1000+0 records out
>         1048576000 bytes (1.0 GB) copied, 12.7143 s, 82.5 MB/s
> 
> output of vmmon:
> 
>          nr_dirty     nr_writeback
>                 0                0
>                 0                0
>             56795                0
>             51655            17020
>             52071            17511
>             51648            16898
>             51655            16485
>             52369            17425
>             51648            16930
>             51470            16809
>             52630            17267
>             51287            16634
>             51260            16641
>             51310            16903
>             51281            16379
>             46073            11169
>             46086                0
>             46089                0
>              3132             9657
>                21            17677
>                 3            14107
>                14                2
>                 0                0
>                 0                0
> 
> In this case nr_dirty stays almost constant.

I can see the growth when I increased the dd size to 2GB,
and the dd throughput decreased from 82.5MB/s to 60.9MB/s.

        wfg ~% dd if=/dev/zero of=/opt/vm/10G bs=1M count=2000
        2000+0 records in
        2000+0 records out
        2097152000 bytes (2.1 GB) copied, 34.4114 s, 60.9 MB/s

         nr_dirty     nr_writeback
                0                0
            44980                0
            49929            20353
            49929            20353
            49189            17822
            54556            14852
            49191            17717
            52455            15501
            49903            19330
            50077            17293
            50040            19111
            52097             7040
            52656            16797
            53361            19455
            53551            16999
            57599            16396
            55165             6801
            57626            16534
            56193            18795
            57888            16655
            57740            18818
            65759            11304
            60015            19842
            61136            16618
            62166            17429
            62160            16782
            62036            11907
            59237            13715
            61991            18561
            66256            15111
            60574            17551
            17926            17930
            17919            17057
            17919            16379
               11            13717
            11470             4606
                2              913
                2                0
               10                0
               10                0
                0                0
                0                0

But when I redid the above test after dropping all the ~3GB caches,
the dirty limit again seem to remain constant.

        # echo 1 > /proc/sys/vm/drop_caches

        wfg ~% dd if=/dev/zero of=/opt/vm/10G bs=1M count=2000
        2000+0 records in
        2000+0 records out
        2097152000 bytes (2.1 GB) copied, 33.3299 s, 62.9 MB/s

         nr_dirty     nr_writeback
                0                0
            76425            10825
            66255            17302
            69942            15865
            65332            17305
            71207            14605
            69957            15380
            65901            18960
            66365            16233
            66040            17041
            66042            16378
            66434             2169
            67606            17143
            68660            17195
            67613            16514
            67366            17415
            65784             4620
            69053            16831
            66037            17033
            64601            19936
            64629            16922
            70459             9227
            66673            17789
            65638            20102
            65166            17662
            66255            16286
            69821            11264
            82247             4113
            64012            18060
            29585            17920
             5872            16653
             5872            14197
            25422             1913
             5884            16658
                0            12027
                2               26
                2                0
                2                0

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Bug in kernel 2.6.31, Slow wb_kupdate writeout
  2009-08-01  4:53               ` Wu Fengguang
@ 2009-08-01  5:03                 ` Wu Fengguang
  0 siblings, 0 replies; 32+ messages in thread
From: Wu Fengguang @ 2009-08-01  5:03 UTC (permalink / raw)
  To: Martin Bligh
  Cc: Jens Axboe, Chad Talbott, linux-kernel, linux-mm, Michael Rubin,
	sandeen, Andrew Morton, Peter Zijlstra

On Sat, Aug 01, 2009 at 12:53:46PM +0800, Wu Fengguang wrote:
> On Sat, Aug 01, 2009 at 12:03:13PM +0800, Wu Fengguang wrote:

> I can see the growth when I increased the dd size to 2GB,
> and the dd throughput decreased from 82.5MB/s to 60.9MB/s.

The raw disk write throughput seems to be 60MB/s:

        wfg ~% dd if=/dev/zero of=/opt/vm/200M bs=1M count=200 oflag=direct
        200+0 records in
        200+0 records out
        209715200 bytes (210 MB) copied, 3.48137 s, 60.2 MB/s

read throughput is a bit better:

        wfg ~% dd of=/dev/null if=/opt/vm/200M bs=1M count=200 iflag=direct
        200+0 records in
        200+0 records out
        209715200 bytes (210 MB) copied, 2.66606 s, 78.7 MB/s

        # hdparm -tT /dev/hda

        /dev/hda:
         Timing cached reads:   10370 MB in  1.99 seconds = 5213.70 MB/sec
         Timing buffered disk reads:  216 MB in  3.03 seconds =  71.22 MB/sec

And sync writes are pretty slow:

        wfg ~% dd if=/dev/zero of=/opt/vm/200M bs=1M count=200 oflag=sync
        200+0 records in
        200+0 records out
        209715200 bytes (210 MB) copied, 10.4741 s, 20.0 MB/s

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Bug in kernel 2.6.31, Slow wb_kupdate writeout
  2009-07-30 22:34       ` Martin Bligh
  2009-07-30 22:43         ` Jens Axboe
@ 2009-08-01  4:02         ` Wu Fengguang
  1 sibling, 0 replies; 32+ messages in thread
From: Wu Fengguang @ 2009-08-01  4:02 UTC (permalink / raw)
  To: Martin Bligh
  Cc: Jens Axboe, Chad Talbott, linux-kernel, linux-mm, Michael Rubin,
	Andrew Morton, sandeen

On Thu, Jul 30, 2009 at 03:34:12PM -0700, Martin Bligh wrote:
> > The test case above on a 4G machine is only generating 1G of dirty data.
> > I ran the same test case on the 16G, resulting in only background
> > writeout. The relevant bit here being that the background writeout
> > finished quickly, writing at disk speed.
> >
> > I re-ran the same test, but using 300 100MB files instead. While the
> > dd's are running, we are going at ~80MB/sec (this is disk speed, it's an
> > x25-m). When the dd's are done, it continues doing 80MB/sec for 10
> > seconds or so. Then the remainder (about 2G) is written in bursts at
> > disk speeds, but with some time in between.
> 
> OK, I think the test case is sensitive to how many files you have - if
> we punt them to the back of the list, and yet we still have 299 other
> ones, it may well be able to keep the disk spinning despite the bug
> I outlined.Try using 30 1GB files?
> 
> Though it doesn't seem to happen with just one dd streamer, and
> I don't see why the bug doesn't trigger in that case either.

I guess the bug is not related to number dd streamers, but whether
there is a stream of newly dirtied inodes (atime dirtiness would be
enough). Because wb_kupdate() itself won't give up on congestion, but
redirty_tail() would refresh the inode dirty time if there are newly
dirtied inodes in front. And we cannot claim it to be a bug of the
list based redirty_tail(), since we call it with the belief that the
inode is somehow blocked. In this manner redirty_tail() can refresh
the inode dirty time (and therefore delay its writeback for up to 30s)
at will.

> I believe the bugfix is correct independent of any bdi changes?

Agreed.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2009-08-01  5:03 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-07-28 19:11 Bug in kernel 2.6.31, Slow wb_kupdate writeout Chad Talbott
2009-07-28 21:49 ` Martin Bligh
2009-07-29  7:15   ` Martin Bligh
2009-07-29 11:43     ` Wu Fengguang
2009-07-29 14:11       ` Martin Bligh
2009-07-30  1:06         ` Wu Fengguang
2009-07-30  1:12           ` Martin Bligh
2009-07-30  1:57             ` Wu Fengguang
2009-07-30  2:59               ` Martin Bligh
2009-07-30  4:08                 ` Wu Fengguang
2009-07-30 19:55                   ` Martin Bligh
2009-08-01  2:02                     ` Wu Fengguang
2009-07-30  0:19       ` Martin Bligh
2009-07-30  1:28         ` Martin Bligh
2009-07-30  2:09           ` Wu Fengguang
2009-07-30  2:57             ` Martin Bligh
2009-07-30  3:19               ` Wu Fengguang
2009-07-30 20:33                 ` Martin Bligh
2009-08-01  2:58                   ` Wu Fengguang
2009-08-01  4:10                   ` Wu Fengguang
2009-07-30  1:49         ` Wu Fengguang
2009-07-30 21:39 ` Jens Axboe
2009-07-30 22:01   ` Martin Bligh
2009-07-30 22:17     ` Jens Axboe
2009-07-30 22:34       ` Martin Bligh
2009-07-30 22:43         ` Jens Axboe
2009-07-30 22:48           ` Martin Bligh
2009-07-31  7:50             ` Peter Zijlstra
2009-08-01  4:03             ` Wu Fengguang
2009-08-01  4:53               ` Wu Fengguang
2009-08-01  5:03                 ` Wu Fengguang
2009-08-01  4:02         ` Wu Fengguang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).