From: Wu Fengguang <fengguang.wu@intel.com>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: Theodore Tso <tytso@mit.edu>,
Christoph Hellwig <hch@infradead.org>,
Dave Chinner <david@fromorbit.com>,
Chris Mason <chris.mason@oracle.com>,
Peter Zijlstra <a.p.zijlstra@chello.nl>,
"Li Shaohua" <shaohua.li@intel.com>,
"Myklebust Trond" <Trond.Myklebust@netapp.com>,
"jens.axboe@oracle.com" <jens.axboe@oracle.com>,
Jan Kara <jack@suse.cz>, Nick Piggin <npiggin@suse.de>,
<linux-fsdevel@vger.kernel.org>,
Wu Fengguang <fengguang.wu@intel.com>
Cc: LKML <linux-kernel@vger.kernel.org>
Subject: [PATCH 18/45] writeback: introduce wait queue for balance_dirty_pages()
Date: Wed, 07 Oct 2009 15:38:36 +0800 [thread overview]
Message-ID: <20091007074903.422089703@intel.com> (raw)
In-Reply-To: 20091007073818.318088777@intel.com
[-- Attachment #1: writeback-balance-wait-queue.patch --]
[-- Type: text/plain, Size: 9191 bytes --]
As proposed by Chris, Dave and Jan, let balance_dirty_pages() wait for
the per-bdi flusher to writeback enough pages for it, instead of
starting foreground writeback by itself. By doing so we harvest two
benefits:
- avoid concurrent writeback of multiple inodes (Dave Chinner)
If every thread doing writes and being throttled start foreground
writeback, it leads to N IO submitters from at least N different
inodes at the same time, end up with N different sets of IO being
issued with potentially zero locality to each other, resulting in
much lower elevator sort/merge efficiency and hence we seek the disk
all over the place to service the different sets of IO.
OTOH, if there is only one submission thread, it doesn't jump between
inodes in the same way when congestion clears - it keeps writing to
the same inode, resulting in large related chunks of sequential IOs
being issued to the disk. This is more efficient than the above
foreground writeback because the elevator works better and the disk
seeks less.
- avoid one constraint torwards huge per-file nr_to_write
The write_chunk used by balance_dirty_pages() should be small enough to
prevent user noticeable one-shot latency. Ie. each sleep/wait inside
balance_dirty_pages() shall be small enough. When it starts its own
writeback, it must specify a small nr_to_write. The throttle wait queue
removes this dependancy by the way.
CC: Chris Mason <chris.mason@oracle.com>
CC: Dave Chinner <david@fromorbit.com>
CC: Jan Kara <jack@suse.cz>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
fs/fs-writeback.c | 71 ++++++++++++++++++++++++++++++++++
include/linux/backing-dev.h | 15 +++++++
mm/backing-dev.c | 4 +
mm/page-writeback.c | 53 ++++++-------------------
4 files changed, 103 insertions(+), 40 deletions(-)
--- linux.orig/mm/page-writeback.c 2009-10-06 23:38:30.000000000 +0800
+++ linux/mm/page-writeback.c 2009-10-06 23:38:43.000000000 +0800
@@ -218,6 +218,15 @@ static inline void __bdi_writeout_inc(st
{
__prop_inc_percpu_max(&vm_completions, &bdi->completions,
bdi->max_prop_frac);
+
+ /*
+ * The DIRTY_THROTTLE_PAGES_STOP test is an optional optimization, so
+ * it's OK to be racy. We set DIRTY_THROTTLE_PAGES_STOP*2 in other
+ * places to reduce the race possibility.
+ */
+ if (atomic_read(&bdi->throttle_pages) < DIRTY_THROTTLE_PAGES_STOP &&
+ atomic_dec_and_test(&bdi->throttle_pages))
+ bdi_writeback_wakeup(bdi);
}
void bdi_writeout_inc(struct backing_dev_info *bdi)
@@ -458,20 +467,10 @@ static void balance_dirty_pages(struct a
unsigned long background_thresh;
unsigned long dirty_thresh;
unsigned long bdi_thresh;
- unsigned long pages_written = 0;
- unsigned long pause = 1;
int dirty_exceeded;
struct backing_dev_info *bdi = mapping->backing_dev_info;
for (;;) {
- struct writeback_control wbc = {
- .bdi = bdi,
- .sync_mode = WB_SYNC_NONE,
- .older_than_this = NULL,
- .nr_to_write = write_chunk,
- .range_cyclic = 1,
- };
-
nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
global_page_state(NR_UNSTABLE_NFS);
nr_writeback = global_page_state(NR_WRITEBACK) +
@@ -518,39 +517,13 @@ static void balance_dirty_pages(struct a
if (!bdi->dirty_exceeded)
bdi->dirty_exceeded = 1;
- /* Note: nr_reclaimable denotes nr_dirty + nr_unstable.
- * Unstable writes are a feature of certain networked
- * filesystems (i.e. NFS) in which data may have been
- * written to the server's write cache, but has not yet
- * been flushed to permanent storage.
- * Only move pages to writeback if this bdi is over its
- * threshold otherwise wait until the disk writes catch
- * up.
- */
- if (bdi_nr_reclaimable > bdi_thresh) {
- writeback_inodes_wbc(&wbc);
- pages_written += write_chunk - wbc.nr_to_write;
- /* don't wait if we've done enough */
- if (pages_written >= write_chunk)
- break;
- }
- schedule_timeout_interruptible(pause);
-
- /*
- * Increase the delay for each loop, up to our previous
- * default of taking a 100ms nap.
- */
- pause <<= 1;
- if (pause > HZ / 10)
- pause = HZ / 10;
+ bdi_writeback_wait(bdi, write_chunk);
+ break;
}
if (!dirty_exceeded && bdi->dirty_exceeded)
bdi->dirty_exceeded = 0;
- if (writeback_in_progress(bdi))
- return;
-
/*
* In laptop mode, we wait until hitting the higher threshold before
* starting background writeout, and then write out all the way down
@@ -559,8 +532,8 @@ static void balance_dirty_pages(struct a
* In normal mode, we start background writeout at the lower
* background_thresh, to keep the amount of dirty memory low.
*/
- if ((laptop_mode && pages_written) ||
- (!laptop_mode && (nr_reclaimable > background_thresh)))
+ if (!laptop_mode && (nr_reclaimable > background_thresh) &&
+ can_submit_background_writeback(bdi))
bdi_start_writeback(bdi, NULL, 0);
}
--- linux.orig/include/linux/backing-dev.h 2009-10-06 23:38:43.000000000 +0800
+++ linux/include/linux/backing-dev.h 2009-10-06 23:38:43.000000000 +0800
@@ -86,6 +86,13 @@ struct backing_dev_info {
struct list_head work_list;
+ /*
+ * dirtier process throttling
+ */
+ spinlock_t throttle_lock;
+ struct list_head throttle_list; /* nr to sync for each task */
+ atomic_t throttle_pages; /* nr to sync for head task */
+
struct device *dev;
#ifdef CONFIG_DEBUG_FS
@@ -99,6 +106,12 @@ struct backing_dev_info {
*/
#define WB_FLAG_BACKGROUND_WORK 30
+/*
+ * when no task is throttled, set throttle_pages to larger than this,
+ * to avoid unnecessary atomic decreases.
+ */
+#define DIRTY_THROTTLE_PAGES_STOP (1 << 22)
+
int bdi_init(struct backing_dev_info *bdi);
void bdi_destroy(struct backing_dev_info *bdi);
@@ -110,6 +123,8 @@ void bdi_start_writeback(struct backing_
long nr_pages);
int bdi_writeback_task(struct bdi_writeback *wb);
int bdi_has_dirty_io(struct backing_dev_info *bdi);
+int bdi_writeback_wakeup(struct backing_dev_info *bdi);
+void bdi_writeback_wait(struct backing_dev_info *bdi, long nr_pages);
extern spinlock_t bdi_lock;
extern struct list_head bdi_list;
--- linux.orig/fs/fs-writeback.c 2009-10-06 23:38:43.000000000 +0800
+++ linux/fs/fs-writeback.c 2009-10-06 23:38:43.000000000 +0800
@@ -25,6 +25,7 @@
#include <linux/blkdev.h>
#include <linux/backing-dev.h>
#include <linux/buffer_head.h>
+#include <linux/completion.h>
#include "internal.h"
#define inode_to_bdi(inode) ((inode)->i_mapping->backing_dev_info)
@@ -265,6 +266,72 @@ void bdi_start_writeback(struct backing_
bdi_alloc_queue_work(bdi, &args);
}
+struct dirty_throttle_task {
+ long nr_pages;
+ struct list_head list;
+ struct completion complete;
+};
+
+void bdi_writeback_wait(struct backing_dev_info *bdi, long nr_pages)
+{
+ struct dirty_throttle_task tt = {
+ .nr_pages = nr_pages,
+ .complete = COMPLETION_INITIALIZER_ONSTACK(tt.complete),
+ };
+ unsigned long flags;
+
+ /*
+ * register throttle pages
+ */
+ spin_lock_irqsave(&bdi->throttle_lock, flags);
+ if (list_empty(&bdi->throttle_list))
+ atomic_set(&bdi->throttle_pages, nr_pages);
+ list_add(&tt.list, &bdi->throttle_list);
+ spin_unlock_irqrestore(&bdi->throttle_lock, flags);
+
+ /*
+ * make sure we will be woke up by someone
+ */
+ if (can_submit_background_writeback(bdi))
+ bdi_start_writeback(bdi, NULL, 0);
+
+ wait_for_completion(&tt.complete);
+}
+
+/*
+ * return 1 if there are more waiting tasks.
+ */
+int bdi_writeback_wakeup(struct backing_dev_info *bdi)
+{
+ struct dirty_throttle_task *tt;
+ unsigned long flags;
+
+ spin_lock_irqsave(&bdi->throttle_lock, flags);
+ /*
+ * remove and wakeup head task
+ */
+ if (!list_empty(&bdi->throttle_list)) {
+ tt = list_entry(bdi->throttle_list.prev,
+ struct dirty_throttle_task, list);
+ list_del(&tt->list);
+ complete(&tt->complete);
+ }
+ /*
+ * update throttle pages
+ */
+ if (!list_empty(&bdi->throttle_list)) {
+ tt = list_entry(bdi->throttle_list.prev,
+ struct dirty_throttle_task, list);
+ atomic_set(&bdi->throttle_pages, tt->nr_pages);
+ } else {
+ tt = NULL;
+ atomic_set(&bdi->throttle_pages, DIRTY_THROTTLE_PAGES_STOP * 2);
+ }
+ spin_unlock_irqrestore(&bdi->throttle_lock, flags);
+
+ return tt != NULL;
+}
+
/*
* Redirty an inode: set its when-it-was dirtied timestamp and move it to the
* furthest end of its superblock's dirty-inode list.
@@ -760,6 +827,10 @@ static long wb_writeback(struct bdi_writ
spin_unlock(&inode_lock);
}
+ if (args->for_background)
+ while (bdi_writeback_wakeup(wb->bdi))
+ ; /* unthrottle all tasks */
+
return wrote;
}
--- linux.orig/mm/backing-dev.c 2009-10-06 23:37:47.000000000 +0800
+++ linux/mm/backing-dev.c 2009-10-06 23:38:43.000000000 +0800
@@ -646,6 +646,10 @@ int bdi_init(struct backing_dev_info *bd
bdi->wb_mask = 1;
bdi->wb_cnt = 1;
+ spin_lock_init(&bdi->throttle_lock);
+ INIT_LIST_HEAD(&bdi->throttle_list);
+ atomic_set(&bdi->throttle_pages, DIRTY_THROTTLE_PAGES_STOP * 2);
+
for (i = 0; i < NR_BDI_STAT_ITEMS; i++) {
err = percpu_counter_init(&bdi->bdi_stat[i], 0);
if (err)
next prev parent reply other threads:[~2009-10-07 7:38 UTC|newest]
Thread overview: 116+ messages / expand[flat|nested] mbox.gz Atom feed top
2009-10-07 7:38 [PATCH 00/45] some writeback experiments Wu Fengguang
2009-10-07 7:38 ` [PATCH 01/45] writeback: reduce calls to global_page_state in balance_dirty_pages() Wu Fengguang
2009-10-09 15:12 ` Jan Kara
2009-10-09 15:18 ` Peter Zijlstra
2009-10-09 15:47 ` Jan Kara
2009-10-11 2:28 ` Wu Fengguang
2009-10-11 7:44 ` Peter Zijlstra
2009-10-11 10:50 ` Wu Fengguang
2009-10-11 10:58 ` Peter Zijlstra
2009-10-11 11:25 ` Peter Zijlstra
2009-10-12 1:26 ` Wu Fengguang
2009-10-12 9:07 ` Peter Zijlstra
2009-10-12 9:24 ` Wu Fengguang
2009-10-10 21:33 ` Wu Fengguang
2009-10-12 21:18 ` Jan Kara
2009-10-13 3:24 ` Wu Fengguang
2009-10-13 8:41 ` Peter Zijlstra
2009-10-13 18:12 ` Jan Kara
2009-10-13 18:28 ` Peter Zijlstra
2009-10-14 1:38 ` Wu Fengguang
2009-10-14 11:22 ` Peter Zijlstra
2009-10-17 5:30 ` Wu Fengguang
2009-10-07 7:38 ` [PATCH 02/45] writeback: reduce calculation of bdi dirty thresholds Wu Fengguang
2009-10-07 7:38 ` [PATCH 03/45] ext4: remove unused parameter wbc from __ext4_journalled_writepage() Wu Fengguang
2009-10-07 7:38 ` [PATCH 04/45] writeback: remove unused nonblocking and congestion checks Wu Fengguang
2009-10-09 15:26 ` Jan Kara
2009-10-10 13:47 ` Wu Fengguang
2009-10-07 7:38 ` [PATCH 05/45] writeback: remove the always false bdi_cap_writeback_dirty() test Wu Fengguang
2009-10-07 7:38 ` [PATCH 06/45] writeback: use larger ratelimit when dirty_exceeded Wu Fengguang
2009-10-07 8:53 ` Peter Zijlstra
2009-10-07 9:17 ` Wu Fengguang
2009-10-07 7:38 ` [PATCH 07/45] writeback: dont redirty tail an inode with dirty pages Wu Fengguang
2009-10-09 15:45 ` Jan Kara
2009-10-07 7:38 ` [PATCH 08/45] writeback: quit on wrap for .range_cyclic (write_cache_pages) Wu Fengguang
2009-10-07 7:38 ` [PATCH 09/45] writeback: quit on wrap for .range_cyclic (pohmelfs) Wu Fengguang
2009-10-07 12:32 ` Evgeniy Polyakov
2009-10-07 14:23 ` Wu Fengguang
2009-10-07 7:38 ` [PATCH 10/45] writeback: quit on wrap for .range_cyclic (btrfs) Wu Fengguang
2009-10-07 7:38 ` [PATCH 11/45] writeback: quit on wrap for .range_cyclic (cifs) Wu Fengguang
2009-10-07 7:38 ` [PATCH 12/45] writeback: quit on wrap for .range_cyclic (ext4) Wu Fengguang
2009-10-07 7:38 ` [PATCH 13/45] writeback: quit on wrap for .range_cyclic (gfs2) Wu Fengguang
2009-10-07 7:38 ` [PATCH 14/45] writeback: quit on wrap for .range_cyclic (afs) Wu Fengguang
2009-10-07 7:38 ` [PATCH 15/45] writeback: fix queue_io() ordering Wu Fengguang
2009-10-07 7:38 ` [PATCH 16/45] writeback: merge for_kupdate and !for_kupdate cases Wu Fengguang
2009-10-07 7:38 ` [PATCH 17/45] writeback: only allow two background writeback works Wu Fengguang
2009-10-07 7:38 ` Wu Fengguang [this message]
2009-10-08 1:01 ` [PATCH 18/45] writeback: introduce wait queue for balance_dirty_pages() KAMEZAWA Hiroyuki
2009-10-08 1:58 ` Wu Fengguang
2009-10-08 2:40 ` KAMEZAWA Hiroyuki
2009-10-08 4:01 ` Wu Fengguang
2009-10-08 5:59 ` KAMEZAWA Hiroyuki
2009-10-08 6:07 ` Wu Fengguang
2009-10-08 6:28 ` Wu Fengguang
2009-10-08 6:39 ` KAMEZAWA Hiroyuki
2009-10-08 8:08 ` Peter Zijlstra
2009-10-08 8:11 ` KAMEZAWA Hiroyuki
2009-10-08 8:36 ` Jens Axboe
2009-10-09 2:52 ` [PATCH] writeback: account IO throttling wait as iowait Wu Fengguang
2009-10-09 10:41 ` Jens Axboe
2009-10-09 10:58 ` Wu Fengguang
2009-10-09 11:01 ` Jens Axboe
2009-10-08 8:05 ` [PATCH 18/45] writeback: introduce wait queue for balance_dirty_pages() Peter Zijlstra
2009-10-07 7:38 ` [PATCH 19/45] writeback: remove the loop in balance_dirty_pages() Wu Fengguang
2009-10-07 7:38 ` [PATCH 20/45] NFS: introduce writeback wait queue Wu Fengguang
2009-10-07 8:53 ` Peter Zijlstra
2009-10-07 9:07 ` Wu Fengguang
2009-10-07 9:15 ` Peter Zijlstra
2009-10-07 9:19 ` Wu Fengguang
2009-10-07 9:17 ` Nick Piggin
2009-10-07 9:52 ` Wu Fengguang
2009-10-07 7:38 ` [PATCH 21/45] writeback: estimate bdi write bandwidth Wu Fengguang
2009-10-07 8:53 ` Peter Zijlstra
2009-10-07 9:39 ` Wu Fengguang
2009-10-07 7:38 ` [PATCH 22/45] writeback: show bdi write bandwidth in debugfs Wu Fengguang
2009-10-07 7:38 ` [PATCH 23/45] writeback: kill space in debugfs item name Wu Fengguang
2009-10-07 7:38 ` [PATCH 24/45] writeback: remove global nr_to_write and use timeout instead Wu Fengguang
2009-10-07 7:38 ` [PATCH 25/45] writeback: convert wbc.nr_to_write to per-file parameter Wu Fengguang
2009-10-07 7:38 ` [PATCH 26/45] block: pass the non-rotational queue flag to backing_dev_info Wu Fengguang
2009-10-07 7:38 ` [PATCH 27/45] writeback: introduce wbc.for_background Wu Fengguang
2009-10-07 7:38 ` [PATCH 28/45] writeback: introduce wbc.nr_segments Wu Fengguang
2009-10-07 7:38 ` [PATCH 29/45] writeback: fix the shmem AOP_WRITEPAGE_ACTIVATE case Wu Fengguang
2009-10-07 11:57 ` Hugh Dickins
2009-10-07 14:00 ` Wu Fengguang
2009-10-07 7:38 ` [PATCH 30/45] vmscan: lumpy pageout Wu Fengguang
2009-10-07 7:38 ` [PATCH 31/45] writeback: sync old inodes first in background writeback Wu Fengguang
2010-07-12 3:01 ` Christoph Hellwig
2010-07-12 15:24 ` Wu Fengguang
2009-10-07 7:38 ` [PATCH 32/45] writeback: update kupdate expire timestamp on each scan of b_io Wu Fengguang
2009-10-07 7:38 ` [PATCH 34/45] writeback: sync livelock - kick background writeback Wu Fengguang
2009-10-07 7:38 ` [PATCH 35/45] writeback: sync livelock - use single timestamp for whole sync work Wu Fengguang
2009-10-07 7:38 ` [PATCH 36/45] writeback: sync livelock - curb dirty speed for inodes to be synced Wu Fengguang
2009-10-07 7:38 ` [PATCH 37/45] writeback: use timestamp to indicate dirty exceeded Wu Fengguang
2009-10-07 7:38 ` [PATCH 38/45] writeback: introduce queue b_more_io_wait Wu Fengguang
2009-10-07 7:38 ` [PATCH 39/45] writeback: remove wbc.more_io Wu Fengguang
2009-10-07 7:38 ` [PATCH 40/45] writeback: requeue_io_wait() on I_SYNC locked inode Wu Fengguang
2009-10-07 7:38 ` [PATCH 41/45] writeback: requeue_io_wait() on pages_skipped inode Wu Fengguang
2009-10-07 7:39 ` [PATCH 42/45] writeback: requeue_io_wait() on blocked inode Wu Fengguang
2009-10-07 7:39 ` [PATCH 43/45] writeback: requeue_io_wait() on fs redirtied inode Wu Fengguang
2009-10-07 7:39 ` [PATCH 44/45] NFS: remove NFS_INO_FLUSHING lock Wu Fengguang
2009-10-07 13:11 ` Peter Staubach
2009-10-07 13:32 ` Wu Fengguang
2009-10-07 13:59 ` Peter Staubach
2009-10-08 1:44 ` Wu Fengguang
2009-10-07 7:39 ` [PATCH 45/45] btrfs: fix race on syncing the btree inode Wu Fengguang
2009-10-07 8:53 ` [PATCH 00/45] some writeback experiments Peter Zijlstra
2009-10-07 10:17 ` [PATCH 14/45] writeback: quit on wrap for .range_cyclic (afs) David Howells
2009-10-07 10:21 ` Nick Piggin
2009-10-07 10:47 ` Wu Fengguang
2009-10-07 11:23 ` Nick Piggin
2009-10-07 12:21 ` Wu Fengguang
2009-10-07 13:47 ` [PATCH 00/45] some writeback experiments Peter Staubach
2009-10-07 15:18 ` Wu Fengguang
2009-10-08 5:33 ` Wu Fengguang
2009-10-08 5:44 ` Wu Fengguang
2009-10-07 14:26 ` Theodore Tso
2009-10-07 14:45 ` Wu Fengguang
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20091007074903.422089703@intel.com \
--to=fengguang.wu@intel.com \
--cc=Trond.Myklebust@netapp.com \
--cc=a.p.zijlstra@chello.nl \
--cc=akpm@linux-foundation.org \
--cc=chris.mason@oracle.com \
--cc=david@fromorbit.com \
--cc=hch@infradead.org \
--cc=jack@suse.cz \
--cc=jens.axboe@oracle.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=npiggin@suse.de \
--cc=shaohua.li@intel.com \
--cc=tytso@mit.edu \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).