Re: Bug in kernel 2.6.31, Slow wb_kupdate writeout

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Wu Fengguang <fengguang.wu@intel.com>
To: Martin Bligh <mbligh@google.com>
Cc: Chad Talbott <ctalbott@google.com>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	Michael Rubin <mrubin@google.com>,
	Andrew Morton <akpm@google.com>,
	sandeen@redhat.com, Michael Davidson <md@google.com>
Subject: Re: Bug in kernel 2.6.31, Slow wb_kupdate writeout
Date: Wed, 29 Jul 2009 19:43:22 +0800	[thread overview]
Message-ID: <20090729114322.GA9335@localhost> (raw)
In-Reply-To: <33307c790907290015m1e6b5666x9c0014cdaf5ed08@mail.gmail.com>

On Wed, Jul 29, 2009 at 12:15:48AM -0700, Martin Bligh wrote:
> On Tue, Jul 28, 2009 at 2:49 PM, Martin Bligh<mbligh@google.com> wrote:
> >> An interesting recent-ish change is "writeback: speed up writeback of
> >> big dirty files."  When I revert the change to __sync_single_inode the
> >> problem appears to go away and background writeout proceeds at disk
> >> speed.  Interestingly, that code is in the git commit [2], but not in
> >> the post to LKML. [3]  This is may not be the fix, but it makes this
> >> test behave better.
> >
> > I'm fairly sure this is not fixing the root cause - but putting it at the head
> > rather than the tail of the queue causes the error not to starve wb_kupdate
> > for nearly so long - as long as we keep the queue full, the bug is hidden.
> 
> OK, it seems this is the root cause - I wasn't clear why all the pages weren't
> being written back, and thought there was another bug. What happens is
> we go into write_cache_pages, and stuff the disk queue with as much as
> we can put into it, and then inevitably hit the congestion limit.
> 
> Then we back out to __sync_single_inode, who says "huh, you didn't manage
> to write your whole slice", and penalizes the poor blameless inode in question
> by putting it back into the penalty box for 30s.
> 
> This results in very lumpy I/O writeback at 5s intervals, and very
> poor throughput.

You are right, so let's fix the congestion case. Your analysis would
be perfect changelog :)

> Patch below is inline and probably text munged, but is for RFC only.
> I'll test it
> more thoroughly tomorrow. As for the comment about starving other writes,
> I believe requeue_io moves it from s_io to s_more_io which should at least
> allow some progress of other files.
> 
> --- linux-2.6.30/fs/fs-writeback.c.old  2009-07-29 00:08:29.000000000 -0700
> +++ linux-2.6.30/fs/fs-writeback.c      2009-07-29 00:11:28.000000000 -0700
> @@ -322,46 +322,11 @@ __sync_single_inode(struct inode *inode,
>                         /*
>                          * We didn't write back all the pages.  nfs_writepages()
>                          * sometimes bales out without doing anything. Redirty
[snip]
> -                               if (wbc->nr_to_write <= 0) {
> -                                       /*
> -                                        * slice used up: queue for next turn
> -                                        */
> -                                       requeue_io(inode);
> -                               } else {
> -                                       /*
> -                                        * somehow blocked: retry later
> -                                        */
> -                                       redirty_tail(inode);

Removing this line can be dangerous - we'll probably go into buzy
waiting (I have tried that long long ago).

Chad, can you try this small patch? Thank you.

Thanks,
Fengguang
---
 fs/fs-writeback.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

--- mm.orig/fs/fs-writeback.c
+++ mm/fs/fs-writeback.c
@@ -325,7 +325,8 @@ __sync_single_inode(struct inode *inode,
 				 * soon as the queue becomes uncongested.
 				 */
 				inode->i_state |= I_DIRTY_PAGES;
-				if (wbc->nr_to_write <= 0) {
+				if (wbc->nr_to_write <= 0 ||
+				    wbc->encountered_congestion) {
 					/*
 					 * slice used up: queue for next turn
 					 */

WARNING: multiple messages have this Message-ID (diff)

From: Wu Fengguang <fengguang.wu@intel.com>
To: Martin Bligh <mbligh@google.com>
Cc: Chad Talbott <ctalbott@google.com>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	Michael Rubin <mrubin@google.com>,
	Andrew Morton <akpm@google.com>,
	sandeen@redhat.com, Michael Davidson <md@google.com>
Subject: Re: Bug in kernel 2.6.31, Slow wb_kupdate writeout
Date: Wed, 29 Jul 2009 19:43:22 +0800	[thread overview]
Message-ID: <20090729114322.GA9335@localhost> (raw)
In-Reply-To: <33307c790907290015m1e6b5666x9c0014cdaf5ed08@mail.gmail.com>

On Wed, Jul 29, 2009 at 12:15:48AM -0700, Martin Bligh wrote:
> On Tue, Jul 28, 2009 at 2:49 PM, Martin Bligh<mbligh@google.com> wrote:
> >> An interesting recent-ish change is "writeback: speed up writeback of
> >> big dirty files." A When I revert the change to __sync_single_inode the
> >> problem appears to go away and background writeout proceeds at disk
> >> speed. A Interestingly, that code is in the git commit [2], but not in
> >> the post to LKML. [3] A This is may not be the fix, but it makes this
> >> test behave better.
> >
> > I'm fairly sure this is not fixing the root cause - but putting it at the head
> > rather than the tail of the queue causes the error not to starve wb_kupdate
> > for nearly so long - as long as we keep the queue full, the bug is hidden.
> 
> OK, it seems this is the root cause - I wasn't clear why all the pages weren't
> being written back, and thought there was another bug. What happens is
> we go into write_cache_pages, and stuff the disk queue with as much as
> we can put into it, and then inevitably hit the congestion limit.
> 
> Then we back out to __sync_single_inode, who says "huh, you didn't manage
> to write your whole slice", and penalizes the poor blameless inode in question
> by putting it back into the penalty box for 30s.
> 
> This results in very lumpy I/O writeback at 5s intervals, and very
> poor throughput.

You are right, so let's fix the congestion case. Your analysis would
be perfect changelog :)

> Patch below is inline and probably text munged, but is for RFC only.
> I'll test it
> more thoroughly tomorrow. As for the comment about starving other writes,
> I believe requeue_io moves it from s_io to s_more_io which should at least
> allow some progress of other files.
> 
> --- linux-2.6.30/fs/fs-writeback.c.old  2009-07-29 00:08:29.000000000 -0700
> +++ linux-2.6.30/fs/fs-writeback.c      2009-07-29 00:11:28.000000000 -0700
> @@ -322,46 +322,11 @@ __sync_single_inode(struct inode *inode,
>                         /*
>                          * We didn't write back all the pages.  nfs_writepages()
>                          * sometimes bales out without doing anything. Redirty
[snip]
> -                               if (wbc->nr_to_write <= 0) {
> -                                       /*
> -                                        * slice used up: queue for next turn
> -                                        */
> -                                       requeue_io(inode);
> -                               } else {
> -                                       /*
> -                                        * somehow blocked: retry later
> -                                        */
> -                                       redirty_tail(inode);

Removing this line can be dangerous - we'll probably go into buzy
waiting (I have tried that long long ago).

Chad, can you try this small patch? Thank you.

Thanks,
Fengguang
---
 fs/fs-writeback.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

--- mm.orig/fs/fs-writeback.c
+++ mm/fs/fs-writeback.c
@@ -325,7 +325,8 @@ __sync_single_inode(struct inode *inode,
 				 * soon as the queue becomes uncongested.
 				 */
 				inode->i_state |= I_DIRTY_PAGES;
-				if (wbc->nr_to_write <= 0) {
+				if (wbc->nr_to_write <= 0 ||
+				    wbc->encountered_congestion) {
 					/*
 					 * slice used up: queue for next turn
 					 */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2009-07-29 11:43 UTC|newest]

Thread overview: 64+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-07-28 19:11 Bug in kernel 2.6.31, Slow wb_kupdate writeout Chad Talbott
2009-07-28 19:11 ` Chad Talbott
2009-07-28 21:49 ` Martin Bligh
2009-07-28 21:49   ` Martin Bligh
2009-07-29  7:15   ` Martin Bligh
2009-07-29  7:15     ` Martin Bligh
2009-07-29 11:43     ` Wu Fengguang [this message]
2009-07-29 11:43       ` Wu Fengguang
2009-07-29 14:11       ` Martin Bligh
2009-07-29 14:11         ` Martin Bligh
2009-07-30  1:06         ` Wu Fengguang
2009-07-30  1:06           ` Wu Fengguang
2009-07-30  1:12           ` Martin Bligh
2009-07-30  1:12             ` Martin Bligh
2009-07-30  1:57             ` Wu Fengguang
2009-07-30  1:57               ` Wu Fengguang
2009-07-30  2:59               ` Martin Bligh
2009-07-30  2:59                 ` Martin Bligh
2009-07-30  4:08                 ` Wu Fengguang
2009-07-30  4:08                   ` Wu Fengguang
2009-07-30 19:55                   ` Martin Bligh
2009-07-30 19:55                     ` Martin Bligh
2009-08-01  2:02                     ` Wu Fengguang
2009-08-01  2:02                       ` Wu Fengguang
2009-07-30  0:19       ` Martin Bligh
2009-07-30  0:19         ` Martin Bligh
2009-07-30  1:28         ` Martin Bligh
2009-07-30  1:28           ` Martin Bligh
2009-07-30  2:09           ` Wu Fengguang
2009-07-30  2:09             ` Wu Fengguang
2009-07-30  2:57             ` Martin Bligh
2009-07-30  2:57               ` Martin Bligh
2009-07-30  3:19               ` Wu Fengguang
2009-07-30  3:19                 ` Wu Fengguang
2009-07-30 20:33                 ` Martin Bligh
2009-07-30 20:33                   ` Martin Bligh
2009-08-01  2:58                   ` Wu Fengguang
2009-08-01  2:58                     ` Wu Fengguang
2009-08-01  4:10                   ` Wu Fengguang
2009-08-01  4:10                     ` Wu Fengguang
2009-07-30  1:49         ` Wu Fengguang
2009-07-30  1:49           ` Wu Fengguang
2009-07-30 21:39 ` Jens Axboe
2009-07-30 21:39   ` Jens Axboe
2009-07-30 22:01   ` Martin Bligh
2009-07-30 22:01     ` Martin Bligh
2009-07-30 22:17     ` Jens Axboe
2009-07-30 22:17       ` Jens Axboe
2009-07-30 22:34       ` Martin Bligh
2009-07-30 22:34         ` Martin Bligh
2009-07-30 22:43         ` Jens Axboe
2009-07-30 22:43           ` Jens Axboe
2009-07-30 22:48           ` Martin Bligh
2009-07-30 22:48             ` Martin Bligh
2009-07-31  7:50             ` Peter Zijlstra
2009-07-31  7:50               ` Peter Zijlstra
2009-08-01  4:03             ` Wu Fengguang
2009-08-01  4:03               ` Wu Fengguang
2009-08-01  4:53               ` Wu Fengguang
2009-08-01  4:53                 ` Wu Fengguang
2009-08-01  5:03                 ` Wu Fengguang
2009-08-01  5:03                   ` Wu Fengguang
2009-08-01  4:02         ` Wu Fengguang
2009-08-01  4:02           ` Wu Fengguang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20090729114322.GA9335@localhost \
    --to=fengguang.wu@intel.com \
    --cc=akpm@google.com \
    --cc=ctalbott@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mbligh@google.com \
    --cc=md@google.com \
    --cc=mrubin@google.com \
    --cc=sandeen@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.