From: Wu Fengguang <fengguang.wu@intel.com>
To: Mel Gorman <mel@csn.ul.ie>
Cc: "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
"linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>,
"linux-mm@kvack.org" <linux-mm@kvack.org>,
Dave Chinner <david@fromorbit.com>,
Chris Mason <chris.mason@oracle.com>,
Nick Piggin <npiggin@suse.de>, Rik van Riel <riel@redhat.com>,
Johannes Weiner <hannes@cmpxchg.org>,
Christoph Hellwig <hch@infradead.org>,
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>,
KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>,
Andrew Morton <akpm@linux-foundation.org>,
Andrea Arcangeli <aarcange@redhat.com>
Subject: Re: [PATCH 8/8] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages
Date: Tue, 27 Jul 2010 23:21:47 +0800 [thread overview]
Message-ID: <20100727152147.GB5184@localhost> (raw)
In-Reply-To: <20100727143805.GB5300@csn.ul.ie>
On Tue, Jul 27, 2010 at 10:38:05PM +0800, Mel Gorman wrote:
> On Tue, Jul 27, 2010 at 10:24:13PM +0800, Wu Fengguang wrote:
> > On Tue, Jul 27, 2010 at 09:35:13PM +0800, Mel Gorman wrote:
> > > On Mon, Jul 26, 2010 at 09:10:08PM +0800, Wu Fengguang wrote:
> > > > On Mon, Jul 26, 2010 at 08:57:17PM +0800, Mel Gorman wrote:
> > > > > On Mon, Jul 26, 2010 at 07:27:09PM +0800, Wu Fengguang wrote:
> > > > > > > > > @@ -933,13 +934,16 @@ keep_dirty:
> > > > > > > > > VM_BUG_ON(PageLRU(page) || PageUnevictable(page));
> > > > > > > > > }
> > > > > > > > >
> > > > > > > > > + /*
> > > > > > > > > + * If reclaim is encountering dirty pages, it may be because
> > > > > > > > > + * dirty pages are reaching the end of the LRU even though
> > > > > > > > > + * the dirty_ratio may be satisified. In this case, wake
> > > > > > > > > + * flusher threads to pro-actively clean some pages
> > > > > > > > > + */
> > > > > > > > > + wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty + nr_dirty / 2);
> > > > > > > >
> > > > > > > > Ah it's very possible that nr_dirty==0 here! Then you are hitting the
> > > > > > > > number of dirty pages down to 0 whether or not pageout() is called.
> > > > > > > >
> > > > > > >
> > > > > > > True, this has been fixed to only wakeup flusher threads when this is
> > > > > > > the file LRU, dirty pages have been encountered and the caller has
> > > > > > > sc->may_writepage.
> > > > > >
> > > > > > OK.
> > > > > >
> > > > > > > > Another minor issue is, the passed (nr_dirty + nr_dirty / 2) is
> > > > > > > > normally a small number, much smaller than MAX_WRITEBACK_PAGES.
> > > > > > > > The flusher will sync at least MAX_WRITEBACK_PAGES pages, this is good
> > > > > > > > for efficiency.
> > > > > > > > And it seems good to let the flusher write much more
> > > > > > > > than nr_dirty pages to safeguard a reasonable large
> > > > > > > > vmscan-head-to-first-dirty-LRU-page margin. So it would be enough to
> > > > > > > > update the comments.
> > > > > > > >
> > > > > > >
> > > > > > > Ok, the reasoning had been to flush a number of pages that was related
> > > > > > > to the scanning rate but if that is inefficient for the flusher, I'll
> > > > > > > use MAX_WRITEBACK_PAGES.
> > > > > >
> > > > > > It would be better to pass something like (nr_dirty * N).
> > > > > > MAX_WRITEBACK_PAGES may be increased to 128MB in the future, which is
> > > > > > obviously too large as a parameter. When the batch size is increased
> > > > > > to 128MB, the writeback code may be improved somehow to not exceed the
> > > > > > nr_pages limit too much.
> > > > > >
> > > > >
> > > > > What might be a useful value for N? 1.5 appears to work reasonably well
> > > > > to create a window of writeback ahead of the scanner but it's a bit
> > > > > arbitrary.
> > > >
> > > > I'd recommend N to be a large value. It's no longer relevant now since
> > > > we'll call the flusher to sync some range containing the target page.
> > > > The flusher will then choose an N large enough (eg. 4MB) for efficient
> > > > IO. It needs to be a large value, otherwise the vmscan code will
> > > > quickly run into dirty pages again..
> > > >
> > >
> > > Ok, I took the 4MB at face value to be a "reasonable amount that should
> > > not cause congestion".
> >
> > Under memory pressure, the disk should be busy/congested anyway.
>
> Not necessarily. It could be streaming reads where pages are being added
> to the LRU quickly but not necessarily dominated by dirty pages. Due to the
> scanning rate, a dirty page may be encountered but it could be rare.
Right.
> > The big 4MB adds much work, however many of the pages may need to be
> > synced in the near future anyway. It also requires more time to do
> > the bigger IO, hence adding some latency, however the latency should
> > be a small factor comparing to the IO queue time (which will be long
> > for a busy disk).
> >
> > Overall expectation is, the more efficient IO, the more progress :)
> >
>
> Ok.
>
> > > The end result is
> > >
> > > #define MAX_WRITEBACK (4194304UL >> PAGE_SHIFT)
> > > #define WRITEBACK_FACTOR (MAX_WRITEBACK / SWAP_CLUSTER_MAX)
> > > static inline long nr_writeback_pages(unsigned long nr_dirty)
> > > {
> > > return laptop_mode ? 0 :
> > > min(MAX_WRITEBACK, (nr_dirty * WRITEBACK_FACTOR));
> > > }
> > >
> > > nr_writeback_pages(nr_dirty) is what gets passed to
> > > wakeup_flusher_threads(). Does that seem sensible?
> >
> > If you plan to keep wakeup_flusher_threads(), a simpler form may be
> > sufficient, eg.
> >
> > laptop_mode ? 0 : (nr_dirty * 16)
> >
>
> I plan to keep wakeup_flusher_threads() for now. I didn't go with 16 because
> while nr_dirty will usually be < SWAP_CLUSTER_MAX, it might not be due to lumpy
> reclaim. I wanted to firmly bound how much writeback was being requested -
> hence the mild complexity.
OK.
> > On top of this, we may write another patch to convert the
> > wakeup_flusher_threads(bdi, nr_pages) call to some
> > bdi_start_inode_writeback(inode, offset) call, to start more oriented
> > writeback.
> >
>
> I did a first pass at optimising based on prioritising inodes related to
> dirty pages. It's incredibly primitive and I have to sit down and see
> how the entire of writeback is put together to improve on it. Maybe
> you'll spot something simple or see if it's the totally wrong direction.
> Patch is below.
The simplest style may be
struct writeback_control wbc = {
.sync_mode = WB_SYNC_NONE,
.nr_to_write = MAX_WRITEBACK_PAGES,
};
mapping->writeback_index = offset;
return do_writepages(mapping, &wbc);
But sure there will be many details to handle.
> > When talking the 4MB optimization, I was referring to the internal
> > implementation of bdi_start_inode_writeback(). Sorry for the missing
> > context in the previous email.
> >
>
> No worries, I was assuming it was something in mainline I didn't know
> yet :)
>
> > It may need a big patch to implement bdi_start_inode_writeback().
> > Would you like to try it, or leave the task to me?
> >
>
> If you send me a patch, I can try it out but it's not my highest
> priority right now. I'm still looking to get writeback-from-reclaim down
> to a reasonable level without causing a large amount of churn.
OK. That's already a great work.
> Here is the first pass anyway at kicking wakeup_flusher_threads() for
> inodes belonging to a list of pages. You'll note that I do nothing with
> page offset because I didn't spot a simple way of taking that
> information into account. It's also horrible from a locking perspective.
> So far, it's testing has been "it didn't crash".
It seems a neat way to prioritize the inodes with a new flag
I_DIRTY_RECLAIM. However it may require vastly different
implementation when considering the offset. I'll try to work up a
prototype tomorrow.
Thanks,
Fengguang
> ==== CUT HERE ====
> writeback: Prioritise dirty inodes encountered by reclaim for background flushing
>
> It is preferable that as few dirty pages as possible are dispatched for
> cleaning from the page reclaim path. When dirty pages are encountered by
> page reclaim, this patch marks the inodes that they should be dispatched
> immediately. When the background flusher runs, it moves such inodes immediately
> to the dispatch queue regardless of inode age.
>
> This is an early prototype. It could be optimised to not regularly take
> the inode lock repeatedly and ideally the page offset would also be
> taken into account.
>
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> ---
> fs/fs-writeback.c | 52 ++++++++++++++++++++++++++++++++++++++++++++-
> include/linux/fs.h | 5 ++-
> include/linux/writeback.h | 1 +
> mm/vmscan.c | 6 +++-
> 4 files changed, 59 insertions(+), 5 deletions(-)
>
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index 5a3c764..27a8b75 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -221,7 +221,7 @@ static void move_expired_inodes(struct list_head *delaying_queue,
> LIST_HEAD(tmp);
> struct list_head *pos, *node;
> struct super_block *sb = NULL;
> - struct inode *inode;
> + struct inode *inode, *tinode;
> int do_sb_sort = 0;
>
> if (wbc->for_kupdate || wbc->for_background) {
> @@ -229,6 +229,14 @@ static void move_expired_inodes(struct list_head *delaying_queue,
> older_than_this = jiffies - expire_interval;
> }
>
> + /* Move inodes reclaim found at end of LRU to dispatch queue */
> + list_for_each_entry_safe(inode, tinode, delaying_queue, i_list) {
> + if (inode->i_state & I_DIRTY_RECLAIM) {
> + inode->i_state &= ~I_DIRTY_RECLAIM;
> + list_move(&inode->i_list, &tmp);
> + }
> + }
> +
> while (!list_empty(delaying_queue)) {
> inode = list_entry(delaying_queue->prev, struct inode, i_list);
> if (expire_interval &&
> @@ -906,6 +914,48 @@ void wakeup_flusher_threads(long nr_pages)
> rcu_read_unlock();
> }
>
> +/*
> + * Similar to wakeup_flusher_threads except prioritise inodes contained
> + * in the page_list regardless of age
> + */
> +void wakeup_flusher_threads_pages(long nr_pages, struct list_head *page_list)
> +{
> + struct page *page;
> + struct address_space *mapping;
> + struct inode *inode;
> +
> + list_for_each_entry(page, page_list, lru) {
> + if (!PageDirty(page))
> + continue;
> +
> + lock_page(page);
> + mapping = page_mapping(page);
> + if (!mapping || mapping == &swapper_space)
> + goto unlock;
> +
> + /*
> + * Test outside the lock to see as if it is already set, taking
> + * the inode lock is a waste and the inode should be pinned by
> + * the lock_page
> + */
> + inode = page->mapping->host;
> + if (inode->i_state & I_DIRTY_RECLAIM)
> + goto unlock;
> +
> + /*
> + * XXX: Yuck, has to be a way of batching this by not requiring
> + * the page lock to pin the inode
> + */
> + spin_lock(&inode_lock);
> + inode->i_state |= I_DIRTY_RECLAIM;
> + spin_unlock(&inode_lock);
> +unlock:
> + unlock_page(page);
> + }
> +
> + wakeup_flusher_threads(nr_pages);
> +}
> +
> static noinline void block_dump___mark_inode_dirty(struct inode *inode)
> {
> if (inode->i_ino || strcmp(inode->i_sb->s_id, "bdev")) {
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index e29f0ed..8836698 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1585,8 +1585,8 @@ struct super_operations {
> /*
> * Inode state bits. Protected by inode_lock.
> *
> - * Three bits determine the dirty state of the inode, I_DIRTY_SYNC,
> - * I_DIRTY_DATASYNC and I_DIRTY_PAGES.
> + * Four bits determine the dirty state of the inode, I_DIRTY_SYNC,
> + * I_DIRTY_DATASYNC, I_DIRTY_PAGES and I_DIRTY_RECLAIM.
> *
> * Four bits define the lifetime of an inode. Initially, inodes are I_NEW,
> * until that flag is cleared. I_WILL_FREE, I_FREEING and I_CLEAR are set at
> @@ -1633,6 +1633,7 @@ struct super_operations {
> #define I_DIRTY_SYNC 1
> #define I_DIRTY_DATASYNC 2
> #define I_DIRTY_PAGES 4
> +#define I_DIRTY_RECLAIM 256
> #define __I_NEW 3
> #define I_NEW (1 << __I_NEW)
> #define I_WILL_FREE 16
> diff --git a/include/linux/writeback.h b/include/linux/writeback.h
> index 494edd6..73a4df2 100644
> --- a/include/linux/writeback.h
> +++ b/include/linux/writeback.h
> @@ -64,6 +64,7 @@ void writeback_inodes_wb(struct bdi_writeback *wb,
> struct writeback_control *wbc);
> long wb_do_writeback(struct bdi_writeback *wb, int force_wait);
> void wakeup_flusher_threads(long nr_pages);
> +void wakeup_flusher_threads_pages(long nr_pages, struct list_head *page_list);
>
> /* writeback.h requires fs.h; it, too, is not included from here. */
> static inline void wait_on_inode(struct inode *inode)
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index b66d1f5..bad1abf 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -901,7 +901,8 @@ keep:
> * laptop mode avoiding disk spin-ups
> */
> if (file && nr_dirty_seen && sc->may_writepage)
> - wakeup_flusher_threads(nr_writeback_pages(nr_dirty));
> + wakeup_flusher_threads_pages(nr_writeback_pages(nr_dirty),
> + page_list);
>
> *nr_still_dirty = nr_dirty;
> count_vm_events(PGACTIVATE, pgactivate);
> @@ -1368,7 +1369,8 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
> list_add(&page->lru, &putback_list);
> }
>
> - wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty);
> + wakeup_flusher_threads_pages(laptop_mode ? 0 : nr_dirty,
> + &page_list);
> congestion_wait(BLK_RW_ASYNC, HZ/10);
>
> /*
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
prev parent reply other threads:[~2010-07-27 15:21 UTC|newest]
Thread overview: 87+ messages / expand[flat|nested] mbox.gz Atom feed top
2010-07-19 13:11 [PATCH 0/8] Reduce writeback from page reclaim context V4 Mel Gorman
2010-07-19 13:11 ` [PATCH 1/8] vmscan: tracing: Roll up of patches currently in mmotm Mel Gorman
2010-07-19 13:11 ` [PATCH 2/8] vmscan: tracing: Update trace event to track if page reclaim IO is for anon or file pages Mel Gorman
2010-07-19 13:24 ` Rik van Riel
2010-07-19 14:15 ` Christoph Hellwig
2010-07-19 14:24 ` Mel Gorman
2010-07-19 14:26 ` Christoph Hellwig
2010-07-19 13:11 ` [PATCH 3/8] vmscan: tracing: Update post-processing script to distinguish between anon and file IO from page reclaim Mel Gorman
2010-07-19 13:32 ` Rik van Riel
2010-07-19 13:11 ` [PATCH 4/8] vmscan: Do not writeback filesystem pages in direct reclaim Mel Gorman
2010-07-19 14:19 ` Christoph Hellwig
2010-07-19 14:26 ` Mel Gorman
2010-07-19 18:25 ` Rik van Riel
2010-07-19 22:14 ` Johannes Weiner
2010-07-20 13:45 ` Mel Gorman
2010-07-20 22:02 ` Johannes Weiner
2010-07-21 11:36 ` Johannes Weiner
2010-07-21 11:52 ` Mel Gorman
2010-07-21 12:01 ` KAMEZAWA Hiroyuki
2010-07-21 14:27 ` Mel Gorman
2010-07-21 23:57 ` KAMEZAWA Hiroyuki
2010-07-22 9:19 ` Mel Gorman
2010-07-22 9:22 ` KAMEZAWA Hiroyuki
2010-07-21 13:04 ` Johannes Weiner
2010-07-21 13:38 ` Mel Gorman
2010-07-21 14:28 ` Johannes Weiner
2010-07-21 14:31 ` Mel Gorman
2010-07-21 14:39 ` Johannes Weiner
2010-07-21 15:06 ` Mel Gorman
2010-07-26 8:29 ` Wu Fengguang
2010-07-26 9:12 ` Mel Gorman
2010-07-26 11:19 ` Wu Fengguang
2010-07-26 12:53 ` Mel Gorman
2010-07-26 13:03 ` Wu Fengguang
2010-07-19 13:11 ` [PATCH 5/8] fs,btrfs: Allow kswapd to writeback pages Mel Gorman
2010-07-19 18:27 ` Rik van Riel
2010-07-19 13:11 ` [PATCH 6/8] fs,xfs: " Mel Gorman
2010-07-19 14:20 ` Christoph Hellwig
2010-07-19 14:43 ` Mel Gorman
2010-07-19 13:11 ` [PATCH 7/8] writeback: sync old inodes first in background writeback Mel Gorman
2010-07-19 14:21 ` Christoph Hellwig
2010-07-19 14:40 ` Mel Gorman
2010-07-19 14:48 ` Christoph Hellwig
2010-07-22 8:52 ` Wu Fengguang
2010-07-22 9:02 ` Wu Fengguang
2010-07-22 9:21 ` Wu Fengguang
2010-07-22 10:48 ` Mel Gorman
2010-07-23 9:45 ` Wu Fengguang
2010-07-23 10:57 ` Mel Gorman
2010-07-23 11:49 ` Wu Fengguang
2010-07-23 12:20 ` Wu Fengguang
2010-07-25 10:43 ` KOSAKI Motohiro
2010-07-25 12:03 ` Minchan Kim
2010-07-26 3:27 ` Wu Fengguang
2010-07-26 4:11 ` Minchan Kim
2010-07-26 4:37 ` Wu Fengguang
2010-07-26 16:30 ` Minchan Kim
2010-07-26 22:48 ` Wu Fengguang
2010-07-26 3:08 ` Wu Fengguang
2010-07-26 3:11 ` Rik van Riel
2010-07-26 3:17 ` Wu Fengguang
2010-07-22 15:34 ` Minchan Kim
2010-07-23 11:59 ` Wu Fengguang
2010-07-22 9:42 ` Mel Gorman
2010-07-23 8:33 ` Wu Fengguang
2010-07-22 1:13 ` Wu Fengguang
2010-07-19 18:43 ` Rik van Riel
2010-07-19 13:11 ` [PATCH 8/8] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages Mel Gorman
2010-07-19 14:23 ` Christoph Hellwig
2010-07-19 14:37 ` Mel Gorman
2010-07-19 22:48 ` Johannes Weiner
2010-07-20 14:10 ` Mel Gorman
2010-07-20 22:05 ` Johannes Weiner
2010-07-19 18:59 ` Rik van Riel
2010-07-19 22:26 ` Johannes Weiner
2010-07-26 7:28 ` Wu Fengguang
2010-07-26 9:26 ` Mel Gorman
2010-07-26 11:27 ` Wu Fengguang
2010-07-26 12:57 ` Mel Gorman
2010-07-26 13:10 ` Wu Fengguang
2010-07-27 13:35 ` Mel Gorman
2010-07-27 14:24 ` Wu Fengguang
2010-07-27 14:34 ` Wu Fengguang
2010-07-27 14:40 ` Mel Gorman
2010-07-27 14:55 ` Wu Fengguang
2010-07-27 14:38 ` Mel Gorman
2010-07-27 15:21 ` Wu Fengguang [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20100727152147.GB5184@localhost \
--to=fengguang.wu@intel.com \
--cc=aarcange@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=chris.mason@oracle.com \
--cc=david@fromorbit.com \
--cc=hannes@cmpxchg.org \
--cc=hch@infradead.org \
--cc=kamezawa.hiroyu@jp.fujitsu.com \
--cc=kosaki.motohiro@jp.fujitsu.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mel@csn.ul.ie \
--cc=npiggin@suse.de \
--cc=riel@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).