From: Wu Fengguang <fengguang.wu@intel.com>
To: Peter Zijlstra <a.p.zijlstra@chello.nl>,
Peter Staubach <staubach@redhat.com>,
Myklebust Trond <Trond.Myklebust@netapp.com>
Cc: Jan Kara <jack@suse.cz>,
Andrew Morton <akpm@linux-foundation.org>,
Theodore Tso <tytso@mit.edu>,
Christoph Hellwig <hch@infradead.org>,
Dave Chinner <david@fromorbit.com>,
Chris Mason <chris.mason@oracle.com>,
"Li, Shaohua" <shaohua.li@intel.com>,
"jens.axboe@oracle.com" <jens.axboe@oracle.com>,
Nick Piggin <npiggin@suse.de>,
"linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>,
Richard Kennedy <richard@rsk.demon.co.uk>,
LKML <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH 01/45] writeback: reduce calls to global_page_state in balance_dirty_pages()
Date: Wed, 14 Oct 2009 09:38:32 +0800 [thread overview]
Message-ID: <20091014013832.GA11882@localhost> (raw)
In-Reply-To: <1255458499.8967.711.camel@laptop>
On Wed, Oct 14, 2009 at 02:28:19AM +0800, Peter Zijlstra wrote:
> On Tue, 2009-10-13 at 20:12 +0200, Jan Kara wrote:
> > > for (;;) {
> > > nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
> > > global_page_state(NR_UNSTABLE_NFS);
> > > nr_writeback = global_page_state(NR_WRITEBACK) +
> > > global_page_state(NR_WRITEBACK_TEMP);
> > >
> > > global_dirty_thresh(&background_thresh, &dirty_thresh);
> > >
> > > /*
> > > * Throttle it only when the background writeback cannot
> > > * catch-up. This avoids (excessively) small writeouts
> > > * when the bdi limits are ramping up.
> > > */
> > > if (nr_reclaimable + nr_writeback <
> > > (background_thresh + dirty_thresh) / 2)
> > > break;
> > >
> > > bdi_thresh = bdi_dirty_thresh(bdi, dirty_thresh);
> > >
> > > /*
> > > * In order to avoid the stacked BDI deadlock we need
> > > * to ensure we accurately count the 'dirty' pages when
> > > * the threshold is low.
> > > *
> > > * Otherwise it would be possible to get thresh+n pages
> > > * reported dirty, even though there are thresh-m pages
> > > * actually dirty; with m+n sitting in the percpu
> > > * deltas.
> > > */
> > > if (bdi_thresh < 2*bdi_stat_error(bdi)) {
> > > bdi_nr_reclaimable = bdi_stat_sum(bdi, BDI_RECLAIMABLE);
> > > bdi_nr_writeback = bdi_stat_sum(bdi, BDI_WRITEBACK);
> > > } else {
> > > bdi_nr_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE);
> > > bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK);
> > > }
> > >
> > > /*
> > > * The bdi thresh is somehow "soft" limit derived from the
> > > * global "hard" limit. The former helps to prevent heavy IO
> > > * bdi or process from holding back light ones; The latter is
> > > * the last resort safeguard.
> > > */
> > > dirty_exceeded =
> > > (bdi_nr_reclaimable + bdi_nr_writeback >= bdi_thresh)
> > > || (nr_reclaimable + nr_writeback >= dirty_thresh);
> > >
> > > if (!dirty_exceeded)
> > > break;
> > >
> > > bdi->dirty_exceed_time = jiffies;
> > >
> > > bdi_writeback_wait(bdi, write_chunk);
> > Hmm, probably you've discussed this in some other email but why do we
> > cycle in this loop until we get below dirty limit? We used to leave the
> > loop after writing write_chunk... So the time we spend in
> > balance_dirty_pages() is no longer limited, right?
Right, this is a legitimate concern.
> Wu was saying that without the loop nr_writeback wasn't limited, but
> since bdi_writeback_wakeup() is driven from writeout completion, I'm not
> sure how again that was so.
Let me summarize the ideas :)
There are two cases:
- there are no bdi or block io queue to limit nr_writeback
This must be fixed. It either let nr_writeback grow to dirty_thresh
(with loop) and thus squeeze nr_dirty, or grow out of control
totally (without loop). Current state is, the nr_writeback wait
queue for NFS is there; the one for btrfs is still missing.
- there is a nr_writeback limit, but is larger than dirty_thresh
In this case nr_dirty will be close to 0 regardless of the loop.
The loop will help to keep
nr_dirty + nr_writeback + nr_unstable < dirty_thresh
Without the loop, the "real" dirty threshold would be larger
(determined by the nr_writeback limit).
> We can move all of bdi_dirty to bdi_writeout, if the bdi writeout queue
> permits, but it cannot grow beyond the total limit, since we're actually
> waiting for writeout completion.
Yes, this explains the second case. It's some trade-off like: the
nr_writeback limit can not be trusted in small memory systems, so do
the loop to impose the dirty_thresh, which unfortunately can hurt
responsiveness on all systems with prolonged wait time..
We could possibly test (nr_dirty < nr_writeback). If so, the
nr_writeback limit could be too large to deserve the loop.
It still don't address the nr_dirty=0 problem for small memory system,
that should be acceptable since its nr_dirty will be small anyway.
> Possibly unstable is peculiar.
unstable can also go wild. I saw (in current linux-next with the
following patch) balance_dirty_pages() sleeping for >30s waiting for
the NFS nr_unstable to drop. That is, waiting for the dirty inode to
be _expired_ and written to disk on the server.
It's a general uncoordinated double caching problem for NFS (and maybe more).
Thanks,
Fengguang
---
[ 45.614799] balance_dirty_pages sleeped 228ms
[ 45.954821] balance_dirty_pages sleeped 324ms
[ 46.294874] balance_dirty_pages sleeped 324ms
[ 46.638810] balance_dirty_pages sleeped 328ms
[ 46.670769] balance_dirty_pages sleeped 28ms
[ 46.802779] balance_dirty_pages sleeped 128ms
[ 46.934788] balance_dirty_pages sleeped 124ms
[ 47.066778] balance_dirty_pages sleeped 124ms
[ 47.198774] balance_dirty_pages sleeped 128ms
[ 47.330763] balance_dirty_pages sleeped 124ms
[ 47.462768] balance_dirty_pages sleeped 128ms
[ 47.594768] balance_dirty_pages sleeped 124ms
[ 47.662763] balance_dirty_pages sleeped 60ms
[ 47.798781] balance_dirty_pages sleeped 132ms
[ 47.871435] balance_dirty_pages sleeped 64ms
[ 48.002749] balance_dirty_pages sleeped 124ms
[ 48.138787] balance_dirty_pages sleeped 132ms
[ 48.270824] balance_dirty_pages sleeped 124ms
[ 48.410762] balance_dirty_pages sleeped 128ms
[ 48.542758] balance_dirty_pages sleeped 128ms
[ 48.678786] balance_dirty_pages sleeped 132ms
[ 48.810781] balance_dirty_pages sleeped 124ms
[ 48.946755] balance_dirty_pages sleeped 124ms
[ 49.182753] balance_dirty_pages sleeped 228ms
[ 49.318773] balance_dirty_pages sleeped 128ms
[ 49.666784] balance_dirty_pages sleeped 324ms
[ 49.914774] balance_dirty_pages sleeped 228ms
[ 79.998354] balance_dirty_pages sleeped 30068ms
[ 80.062346] balance_dirty_pages sleeped 60ms
[ 80.290414] balance_dirty_pages sleeped 224ms
[ 80.542413] balance_dirty_pages sleeped 228ms
[ 80.782384] balance_dirty_pages sleeped 228ms
[ 81.142379] balance_dirty_pages sleeped 336ms
[ 116.005926] balance_dirty_pages sleeped 34852ms
[ 141.049584] balance_dirty_pages sleeped 25040ms
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
mm/page-writeback.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)
--- linux.orig/mm/page-writeback.c 2009-10-09 10:22:58.000000000 +0800
+++ linux/mm/page-writeback.c 2009-10-09 10:31:53.000000000 +0800
@@ -490,6 +490,7 @@ static void balance_dirty_pages(struct a
unsigned long bdi_thresh;
unsigned long pages_written = 0;
unsigned long pause = 1;
+ unsigned long start = jiffies;
struct backing_dev_info *bdi = mapping->backing_dev_info;
@@ -566,7 +567,8 @@ static void balance_dirty_pages(struct a
if (pages_written >= write_chunk)
break; /* We've done our duty */
- schedule_timeout_interruptible(pause);
+ __set_current_state(TASK_INTERRUPTIBLE);
+ io_schedule_timeout(pause);
/*
* Increase the delay for each loop, up to our previous
@@ -577,6 +579,9 @@ static void balance_dirty_pages(struct a
pause = HZ / 10;
}
+ if (pause > 1)
+ printk("balance_dirty_pages sleeped %lums\n", (jiffies - start) * 1000/HZ);
+
if (bdi_nr_reclaimable + bdi_nr_writeback < bdi_thresh &&
bdi->dirty_exceeded)
bdi->dirty_exceeded = 0;
next prev parent reply other threads:[~2009-10-14 1:39 UTC|newest]
Thread overview: 116+ messages / expand[flat|nested] mbox.gz Atom feed top
2009-10-07 7:38 [PATCH 00/45] some writeback experiments Wu Fengguang
2009-10-07 7:38 ` [PATCH 01/45] writeback: reduce calls to global_page_state in balance_dirty_pages() Wu Fengguang
2009-10-09 15:12 ` Jan Kara
2009-10-09 15:18 ` Peter Zijlstra
2009-10-09 15:47 ` Jan Kara
2009-10-11 2:28 ` Wu Fengguang
2009-10-11 7:44 ` Peter Zijlstra
2009-10-11 10:50 ` Wu Fengguang
2009-10-11 10:58 ` Peter Zijlstra
2009-10-11 11:25 ` Peter Zijlstra
2009-10-12 1:26 ` Wu Fengguang
2009-10-12 9:07 ` Peter Zijlstra
2009-10-12 9:24 ` Wu Fengguang
2009-10-10 21:33 ` Wu Fengguang
2009-10-12 21:18 ` Jan Kara
2009-10-13 3:24 ` Wu Fengguang
2009-10-13 8:41 ` Peter Zijlstra
2009-10-13 18:12 ` Jan Kara
2009-10-13 18:28 ` Peter Zijlstra
2009-10-14 1:38 ` Wu Fengguang [this message]
2009-10-14 11:22 ` Peter Zijlstra
2009-10-17 5:30 ` Wu Fengguang
2009-10-07 7:38 ` [PATCH 02/45] writeback: reduce calculation of bdi dirty thresholds Wu Fengguang
2009-10-07 7:38 ` [PATCH 03/45] ext4: remove unused parameter wbc from __ext4_journalled_writepage() Wu Fengguang
2009-10-07 7:38 ` [PATCH 04/45] writeback: remove unused nonblocking and congestion checks Wu Fengguang
2009-10-09 15:26 ` Jan Kara
2009-10-10 13:47 ` Wu Fengguang
2009-10-07 7:38 ` [PATCH 05/45] writeback: remove the always false bdi_cap_writeback_dirty() test Wu Fengguang
2009-10-07 7:38 ` [PATCH 06/45] writeback: use larger ratelimit when dirty_exceeded Wu Fengguang
2009-10-07 8:53 ` Peter Zijlstra
2009-10-07 9:17 ` Wu Fengguang
2009-10-07 7:38 ` [PATCH 07/45] writeback: dont redirty tail an inode with dirty pages Wu Fengguang
2009-10-09 15:45 ` Jan Kara
2009-10-07 7:38 ` [PATCH 08/45] writeback: quit on wrap for .range_cyclic (write_cache_pages) Wu Fengguang
2009-10-07 7:38 ` [PATCH 09/45] writeback: quit on wrap for .range_cyclic (pohmelfs) Wu Fengguang
2009-10-07 12:32 ` Evgeniy Polyakov
2009-10-07 14:23 ` Wu Fengguang
2009-10-07 7:38 ` [PATCH 10/45] writeback: quit on wrap for .range_cyclic (btrfs) Wu Fengguang
2009-10-07 7:38 ` [PATCH 11/45] writeback: quit on wrap for .range_cyclic (cifs) Wu Fengguang
2009-10-07 7:38 ` [PATCH 12/45] writeback: quit on wrap for .range_cyclic (ext4) Wu Fengguang
2009-10-07 7:38 ` [PATCH 13/45] writeback: quit on wrap for .range_cyclic (gfs2) Wu Fengguang
2009-10-07 7:38 ` [PATCH 14/45] writeback: quit on wrap for .range_cyclic (afs) Wu Fengguang
2009-10-07 7:38 ` [PATCH 15/45] writeback: fix queue_io() ordering Wu Fengguang
2009-10-07 7:38 ` [PATCH 16/45] writeback: merge for_kupdate and !for_kupdate cases Wu Fengguang
2009-10-07 7:38 ` [PATCH 17/45] writeback: only allow two background writeback works Wu Fengguang
2009-10-07 7:38 ` [PATCH 18/45] writeback: introduce wait queue for balance_dirty_pages() Wu Fengguang
2009-10-08 1:01 ` KAMEZAWA Hiroyuki
2009-10-08 1:58 ` Wu Fengguang
2009-10-08 2:40 ` KAMEZAWA Hiroyuki
2009-10-08 4:01 ` Wu Fengguang
2009-10-08 5:59 ` KAMEZAWA Hiroyuki
2009-10-08 6:07 ` Wu Fengguang
2009-10-08 6:28 ` Wu Fengguang
2009-10-08 6:39 ` KAMEZAWA Hiroyuki
2009-10-08 8:08 ` Peter Zijlstra
2009-10-08 8:11 ` KAMEZAWA Hiroyuki
2009-10-08 8:36 ` Jens Axboe
2009-10-09 2:52 ` [PATCH] writeback: account IO throttling wait as iowait Wu Fengguang
2009-10-09 10:41 ` Jens Axboe
2009-10-09 10:58 ` Wu Fengguang
2009-10-09 11:01 ` Jens Axboe
2009-10-08 8:05 ` [PATCH 18/45] writeback: introduce wait queue for balance_dirty_pages() Peter Zijlstra
2009-10-07 7:38 ` [PATCH 19/45] writeback: remove the loop in balance_dirty_pages() Wu Fengguang
2009-10-07 7:38 ` [PATCH 20/45] NFS: introduce writeback wait queue Wu Fengguang
2009-10-07 8:53 ` Peter Zijlstra
2009-10-07 9:07 ` Wu Fengguang
2009-10-07 9:15 ` Peter Zijlstra
2009-10-07 9:19 ` Wu Fengguang
2009-10-07 9:17 ` Nick Piggin
2009-10-07 9:52 ` Wu Fengguang
2009-10-07 7:38 ` [PATCH 21/45] writeback: estimate bdi write bandwidth Wu Fengguang
2009-10-07 8:53 ` Peter Zijlstra
2009-10-07 9:39 ` Wu Fengguang
2009-10-07 7:38 ` [PATCH 22/45] writeback: show bdi write bandwidth in debugfs Wu Fengguang
2009-10-07 7:38 ` [PATCH 23/45] writeback: kill space in debugfs item name Wu Fengguang
2009-10-07 7:38 ` [PATCH 24/45] writeback: remove global nr_to_write and use timeout instead Wu Fengguang
2009-10-07 7:38 ` [PATCH 25/45] writeback: convert wbc.nr_to_write to per-file parameter Wu Fengguang
2009-10-07 7:38 ` [PATCH 26/45] block: pass the non-rotational queue flag to backing_dev_info Wu Fengguang
2009-10-07 7:38 ` [PATCH 27/45] writeback: introduce wbc.for_background Wu Fengguang
2009-10-07 7:38 ` [PATCH 28/45] writeback: introduce wbc.nr_segments Wu Fengguang
2009-10-07 7:38 ` [PATCH 29/45] writeback: fix the shmem AOP_WRITEPAGE_ACTIVATE case Wu Fengguang
2009-10-07 11:57 ` Hugh Dickins
2009-10-07 14:00 ` Wu Fengguang
2009-10-07 7:38 ` [PATCH 30/45] vmscan: lumpy pageout Wu Fengguang
2009-10-07 7:38 ` [PATCH 31/45] writeback: sync old inodes first in background writeback Wu Fengguang
2010-07-12 3:01 ` Christoph Hellwig
2010-07-12 15:24 ` Wu Fengguang
2009-10-07 7:38 ` [PATCH 32/45] writeback: update kupdate expire timestamp on each scan of b_io Wu Fengguang
2009-10-07 7:38 ` [PATCH 34/45] writeback: sync livelock - kick background writeback Wu Fengguang
2009-10-07 7:38 ` [PATCH 35/45] writeback: sync livelock - use single timestamp for whole sync work Wu Fengguang
2009-10-07 7:38 ` [PATCH 36/45] writeback: sync livelock - curb dirty speed for inodes to be synced Wu Fengguang
2009-10-07 7:38 ` [PATCH 37/45] writeback: use timestamp to indicate dirty exceeded Wu Fengguang
2009-10-07 7:38 ` [PATCH 38/45] writeback: introduce queue b_more_io_wait Wu Fengguang
2009-10-07 7:38 ` [PATCH 39/45] writeback: remove wbc.more_io Wu Fengguang
2009-10-07 7:38 ` [PATCH 40/45] writeback: requeue_io_wait() on I_SYNC locked inode Wu Fengguang
2009-10-07 7:38 ` [PATCH 41/45] writeback: requeue_io_wait() on pages_skipped inode Wu Fengguang
2009-10-07 7:39 ` [PATCH 42/45] writeback: requeue_io_wait() on blocked inode Wu Fengguang
2009-10-07 7:39 ` [PATCH 43/45] writeback: requeue_io_wait() on fs redirtied inode Wu Fengguang
2009-10-07 7:39 ` [PATCH 44/45] NFS: remove NFS_INO_FLUSHING lock Wu Fengguang
2009-10-07 13:11 ` Peter Staubach
2009-10-07 13:32 ` Wu Fengguang
2009-10-07 13:59 ` Peter Staubach
2009-10-08 1:44 ` Wu Fengguang
2009-10-07 7:39 ` [PATCH 45/45] btrfs: fix race on syncing the btree inode Wu Fengguang
2009-10-07 8:53 ` [PATCH 00/45] some writeback experiments Peter Zijlstra
2009-10-07 10:17 ` [PATCH 14/45] writeback: quit on wrap for .range_cyclic (afs) David Howells
2009-10-07 10:21 ` Nick Piggin
2009-10-07 10:47 ` Wu Fengguang
2009-10-07 11:23 ` Nick Piggin
2009-10-07 12:21 ` Wu Fengguang
2009-10-07 13:47 ` [PATCH 00/45] some writeback experiments Peter Staubach
2009-10-07 15:18 ` Wu Fengguang
2009-10-08 5:33 ` Wu Fengguang
2009-10-08 5:44 ` Wu Fengguang
2009-10-07 14:26 ` Theodore Tso
2009-10-07 14:45 ` Wu Fengguang
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20091014013832.GA11882@localhost \
--to=fengguang.wu@intel.com \
--cc=Trond.Myklebust@netapp.com \
--cc=a.p.zijlstra@chello.nl \
--cc=akpm@linux-foundation.org \
--cc=chris.mason@oracle.com \
--cc=david@fromorbit.com \
--cc=hch@infradead.org \
--cc=jack@suse.cz \
--cc=jens.axboe@oracle.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=npiggin@suse.de \
--cc=richard@rsk.demon.co.uk \
--cc=shaohua.li@intel.com \
--cc=staubach@redhat.com \
--cc=tytso@mit.edu \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).