Re: [PATCH 8/8] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Wu Fengguang <fengguang.wu@intel.com>
To: Mel Gorman <mel@csn.ul.ie>
Cc: "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>,
	Dave Chinner <david@fromorbit.com>,
	Chris Mason <chris.mason@oracle.com>,
	Nick Piggin <npiggin@suse.de>, Rik van Riel <riel@redhat.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Christoph Hellwig <hch@infradead.org>,
	KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>,
	KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Andrea Arcangeli <aarcange@redhat.com>
Subject: Re: [PATCH 8/8] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages
Date: Tue, 27 Jul 2010 23:21:47 +0800	[thread overview]
Message-ID: <20100727152147.GB5184@localhost> (raw)
In-Reply-To: <20100727143805.GB5300@csn.ul.ie>

On Tue, Jul 27, 2010 at 10:38:05PM +0800, Mel Gorman wrote:
> On Tue, Jul 27, 2010 at 10:24:13PM +0800, Wu Fengguang wrote:
> > On Tue, Jul 27, 2010 at 09:35:13PM +0800, Mel Gorman wrote:
> > > On Mon, Jul 26, 2010 at 09:10:08PM +0800, Wu Fengguang wrote:
> > > > On Mon, Jul 26, 2010 at 08:57:17PM +0800, Mel Gorman wrote:
> > > > > On Mon, Jul 26, 2010 at 07:27:09PM +0800, Wu Fengguang wrote:
> > > > > > > > > @@ -933,13 +934,16 @@ keep_dirty:
> > > > > > > > >  		VM_BUG_ON(PageLRU(page) || PageUnevictable(page));
> > > > > > > > >  	}
> > > > > > > > >  
> > > > > > > > > +	/*
> > > > > > > > > +	 * If reclaim is encountering dirty pages, it may be because
> > > > > > > > > +	 * dirty pages are reaching the end of the LRU even though
> > > > > > > > > +	 * the dirty_ratio may be satisified. In this case, wake
> > > > > > > > > +	 * flusher threads to pro-actively clean some pages
> > > > > > > > > +	 */
> > > > > > > > > +	wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty + nr_dirty / 2);
> > > > > > > > 
> > > > > > > > Ah it's very possible that nr_dirty==0 here! Then you are hitting the
> > > > > > > > number of dirty pages down to 0 whether or not pageout() is called.
> > > > > > > > 
> > > > > > > 
> > > > > > > True, this has been fixed to only wakeup flusher threads when this is
> > > > > > > the file LRU, dirty pages have been encountered and the caller has
> > > > > > > sc->may_writepage.
> > > > > > 
> > > > > > OK.
> > > > > > 
> > > > > > > > Another minor issue is, the passed (nr_dirty + nr_dirty / 2) is
> > > > > > > > normally a small number, much smaller than MAX_WRITEBACK_PAGES.
> > > > > > > > The flusher will sync at least MAX_WRITEBACK_PAGES pages, this is good
> > > > > > > > for efficiency.
> > > > > > > > And it seems good to let the flusher write much more
> > > > > > > > than nr_dirty pages to safeguard a reasonable large
> > > > > > > > vmscan-head-to-first-dirty-LRU-page margin. So it would be enough to
> > > > > > > > update the comments.
> > > > > > > > 
> > > > > > > 
> > > > > > > Ok, the reasoning had been to flush a number of pages that was related
> > > > > > > to the scanning rate but if that is inefficient for the flusher, I'll
> > > > > > > use MAX_WRITEBACK_PAGES.
> > > > > > 
> > > > > > It would be better to pass something like (nr_dirty * N).
> > > > > > MAX_WRITEBACK_PAGES may be increased to 128MB in the future, which is
> > > > > > obviously too large as a parameter. When the batch size is increased
> > > > > > to 128MB, the writeback code may be improved somehow to not exceed the
> > > > > > nr_pages limit too much.
> > > > > > 
> > > > > 
> > > > > What might be a useful value for N? 1.5 appears to work reasonably well
> > > > > to create a window of writeback ahead of the scanner but it's a bit
> > > > > arbitrary.
> > > > 
> > > > I'd recommend N to be a large value. It's no longer relevant now since
> > > > we'll call the flusher to sync some range containing the target page.
> > > > The flusher will then choose an N large enough (eg. 4MB) for efficient
> > > > IO. It needs to be a large value, otherwise the vmscan code will
> > > > quickly run into dirty pages again..
> > > > 
> > > 
> > > Ok, I took the 4MB at face value to be a "reasonable amount that should
> > > not cause congestion".
> > 
> > Under memory pressure, the disk should be busy/congested anyway.
> 
> Not necessarily. It could be streaming reads where pages are being added
> to the LRU quickly but not necessarily dominated by dirty pages. Due to the
> scanning rate, a dirty page may be encountered but it could be rare.

Right.

> > The big 4MB adds much work, however many of the pages may need to be
> > synced in the near future anyway. It also requires more time to do
> > the bigger IO, hence adding some latency, however the latency should
> > be a small factor comparing to the IO queue time (which will be long
> > for a busy disk).
> > 
> > Overall expectation is, the more efficient IO, the more progress :)
> > 
> 
> Ok.
> 
> > > The end result is
> > > 
> > > #define MAX_WRITEBACK (4194304UL >> PAGE_SHIFT)
> > > #define WRITEBACK_FACTOR (MAX_WRITEBACK / SWAP_CLUSTER_MAX)
> > > static inline long nr_writeback_pages(unsigned long nr_dirty)
> > > {
> > >         return laptop_mode ? 0 :
> > >                         min(MAX_WRITEBACK, (nr_dirty * WRITEBACK_FACTOR));
> > > }
> > > 
> > > nr_writeback_pages(nr_dirty) is what gets passed to
> > > wakeup_flusher_threads(). Does that seem sensible?
> > 
> > If you plan to keep wakeup_flusher_threads(), a simpler form may be
> > sufficient, eg.
> > 
> >         laptop_mode ? 0 : (nr_dirty * 16)
> > 
> 
> I plan to keep wakeup_flusher_threads() for now. I didn't go with 16 because
> while nr_dirty will usually be < SWAP_CLUSTER_MAX, it might not be due to lumpy
> reclaim. I wanted to firmly bound how much writeback was being requested -
> hence the mild complexity.

OK.

> > On top of this, we may write another patch to convert the
> > wakeup_flusher_threads(bdi, nr_pages) call to some
> > bdi_start_inode_writeback(inode, offset) call, to start more oriented
> > writeback.
> > 
> 
> I did a first pass at optimising based on prioritising inodes related to
> dirty pages. It's incredibly primitive and I have to sit down and see
> how the entire of writeback is put together to improve on it. Maybe
> you'll spot something simple or see if it's the totally wrong direction.
> Patch is below.

The simplest style may be

        struct writeback_control wbc = {
                .sync_mode = WB_SYNC_NONE,
                .nr_to_write = MAX_WRITEBACK_PAGES,
        };
       
        mapping->writeback_index = offset;
        return do_writepages(mapping, &wbc);

But sure there will be many details to handle.

> > When talking the 4MB optimization, I was referring to the internal
> > implementation of bdi_start_inode_writeback(). Sorry for the missing
> > context in the previous email.
> > 
> 
> No worries, I was assuming it was something in mainline I didn't know
> yet :)
> 
> > It may need a big patch to implement bdi_start_inode_writeback().
> > Would you like to try it, or leave the task to me?
> > 
> 
> If you send me a patch, I can try it out but it's not my highest
> priority right now. I'm still looking to get writeback-from-reclaim down
> to a reasonable level without causing a large amount of churn.

OK. That's already a great work.
 
> Here is the first pass anyway at kicking wakeup_flusher_threads() for
> inodes belonging to a list of pages. You'll note that I do nothing with
> page offset because I didn't spot a simple way of taking that
> information into account. It's also horrible from a locking perspective.
> So far, it's testing has been "it didn't crash".

It seems a neat way to prioritize the inodes with a new flag
I_DIRTY_RECLAIM. However it may require vastly different
implementation when considering the offset. I'll try to work up a
prototype tomorrow.

Thanks,
Fengguang

 
> ==== CUT HERE ====
> writeback: Prioritise dirty inodes encountered by reclaim for background flushing
> 
> It is preferable that as few dirty pages as possible are dispatched for
> cleaning from the page reclaim path. When dirty pages are encountered by
> page reclaim, this patch marks the inodes that they should be dispatched
> immediately. When the background flusher runs, it moves such inodes immediately
> to the dispatch queue regardless of inode age.
> 
> This is an early prototype. It could be optimised to not regularly take
> the inode lock repeatedly and ideally the page offset would also be
> taken into account.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> ---
>  fs/fs-writeback.c         |   52 ++++++++++++++++++++++++++++++++++++++++++++-
>  include/linux/fs.h        |    5 ++-
>  include/linux/writeback.h |    1 +
>  mm/vmscan.c               |    6 +++-
>  4 files changed, 59 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index 5a3c764..27a8b75 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -221,7 +221,7 @@ static void move_expired_inodes(struct list_head *delaying_queue,
>  	LIST_HEAD(tmp);
>  	struct list_head *pos, *node;
>  	struct super_block *sb = NULL;
> -	struct inode *inode;
> +	struct inode *inode, *tinode;
>  	int do_sb_sort = 0;
>  
>  	if (wbc->for_kupdate || wbc->for_background) {
> @@ -229,6 +229,14 @@ static void move_expired_inodes(struct list_head *delaying_queue,
>  		older_than_this = jiffies - expire_interval;
>  	}
>  
> +	/* Move inodes reclaim found at end of LRU to dispatch queue */
> +	list_for_each_entry_safe(inode, tinode, delaying_queue, i_list) {
> +		if (inode->i_state & I_DIRTY_RECLAIM) {
> +			inode->i_state &= ~I_DIRTY_RECLAIM;
> +			list_move(&inode->i_list, &tmp);
> +		}
> +	}
> +
>  	while (!list_empty(delaying_queue)) {
>  		inode = list_entry(delaying_queue->prev, struct inode, i_list);
>  		if (expire_interval &&
> @@ -906,6 +914,48 @@ void wakeup_flusher_threads(long nr_pages)
>  	rcu_read_unlock();
>  }
>  
> +/*
> + * Similar to wakeup_flusher_threads except prioritise inodes contained
> + * in the page_list regardless of age
> + */
> +void wakeup_flusher_threads_pages(long nr_pages, struct list_head *page_list)
> +{
> +	struct page *page;
> +	struct address_space *mapping;
> +	struct inode *inode;
> +
> +	list_for_each_entry(page, page_list, lru) {
> +		if (!PageDirty(page))
> +			continue;
> +
> +		lock_page(page);
> +		mapping = page_mapping(page);
> +		if (!mapping || mapping == &swapper_space)
> +			goto unlock;
> +
> +		/*
> +		 * Test outside the lock to see as if it is already set, taking
> +		 * the inode lock is a waste and the inode should be pinned by
> +		 * the lock_page
> +		 */
> +		inode = page->mapping->host;
> +		if (inode->i_state & I_DIRTY_RECLAIM)
> +			goto unlock;
> +
> +		/*
> +		 * XXX: Yuck, has to be a way of batching this by not requiring
> +		 * 	the page lock to pin the inode
> +		 */
> +		spin_lock(&inode_lock);
> +		inode->i_state |= I_DIRTY_RECLAIM;
> +		spin_unlock(&inode_lock);
> +unlock:
> +		unlock_page(page);
> +	}
> +
> +	wakeup_flusher_threads(nr_pages);
> +}
> +
>  static noinline void block_dump___mark_inode_dirty(struct inode *inode)
>  {
>  	if (inode->i_ino || strcmp(inode->i_sb->s_id, "bdev")) {
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index e29f0ed..8836698 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1585,8 +1585,8 @@ struct super_operations {
>  /*
>   * Inode state bits.  Protected by inode_lock.
>   *
> - * Three bits determine the dirty state of the inode, I_DIRTY_SYNC,
> - * I_DIRTY_DATASYNC and I_DIRTY_PAGES.
> + * Four bits determine the dirty state of the inode, I_DIRTY_SYNC,
> + * I_DIRTY_DATASYNC, I_DIRTY_PAGES and I_DIRTY_RECLAIM.
>   *
>   * Four bits define the lifetime of an inode.  Initially, inodes are I_NEW,
>   * until that flag is cleared.  I_WILL_FREE, I_FREEING and I_CLEAR are set at
> @@ -1633,6 +1633,7 @@ struct super_operations {
>  #define I_DIRTY_SYNC		1
>  #define I_DIRTY_DATASYNC	2
>  #define I_DIRTY_PAGES		4
> +#define I_DIRTY_RECLAIM		256
>  #define __I_NEW			3
>  #define I_NEW			(1 << __I_NEW)
>  #define I_WILL_FREE		16
> diff --git a/include/linux/writeback.h b/include/linux/writeback.h
> index 494edd6..73a4df2 100644
> --- a/include/linux/writeback.h
> +++ b/include/linux/writeback.h
> @@ -64,6 +64,7 @@ void writeback_inodes_wb(struct bdi_writeback *wb,
>  		struct writeback_control *wbc);
>  long wb_do_writeback(struct bdi_writeback *wb, int force_wait);
>  void wakeup_flusher_threads(long nr_pages);
> +void wakeup_flusher_threads_pages(long nr_pages, struct list_head *page_list);
>  
>  /* writeback.h requires fs.h; it, too, is not included from here. */
>  static inline void wait_on_inode(struct inode *inode)
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index b66d1f5..bad1abf 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -901,7 +901,8 @@ keep:
>  	 * laptop mode avoiding disk spin-ups
>  	 */
>  	if (file && nr_dirty_seen && sc->may_writepage)
> -		wakeup_flusher_threads(nr_writeback_pages(nr_dirty));
> +		wakeup_flusher_threads_pages(nr_writeback_pages(nr_dirty),
> +					page_list);
>  
>  	*nr_still_dirty = nr_dirty;
>  	count_vm_events(PGACTIVATE, pgactivate);
> @@ -1368,7 +1369,8 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
>  				list_add(&page->lru, &putback_list);
>  			}
>  
> -			wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty);
> +			wakeup_flusher_threads_pages(laptop_mode ? 0 : nr_dirty,
> +								&page_list);
>  			congestion_wait(BLK_RW_ASYNC, HZ/10);
>  
>  			/*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

     prev parent reply	other threads:[~2010-07-27 15:21 UTC|newest]

Thread overview: 87+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-07-19 13:11 [PATCH 0/8] Reduce writeback from page reclaim context V4 Mel Gorman
2010-07-19 13:11 ` [PATCH 1/8] vmscan: tracing: Roll up of patches currently in mmotm Mel Gorman
2010-07-19 13:11 ` [PATCH 2/8] vmscan: tracing: Update trace event to track if page reclaim IO is for anon or file pages Mel Gorman
2010-07-19 13:24   ` Rik van Riel
2010-07-19 14:15   ` Christoph Hellwig
2010-07-19 14:24     ` Mel Gorman
2010-07-19 14:26       ` Christoph Hellwig
2010-07-19 13:11 ` [PATCH 3/8] vmscan: tracing: Update post-processing script to distinguish between anon and file IO from page reclaim Mel Gorman
2010-07-19 13:32   ` Rik van Riel
2010-07-19 13:11 ` [PATCH 4/8] vmscan: Do not writeback filesystem pages in direct reclaim Mel Gorman
2010-07-19 14:19   ` Christoph Hellwig
2010-07-19 14:26     ` Mel Gorman
2010-07-19 18:25   ` Rik van Riel
2010-07-19 22:14   ` Johannes Weiner
2010-07-20 13:45     ` Mel Gorman
2010-07-20 22:02       ` Johannes Weiner
2010-07-21 11:36         ` Johannes Weiner
2010-07-21 11:52         ` Mel Gorman
2010-07-21 12:01           ` KAMEZAWA Hiroyuki
2010-07-21 14:27             ` Mel Gorman
2010-07-21 23:57               ` KAMEZAWA Hiroyuki
2010-07-22  9:19                 ` Mel Gorman
2010-07-22  9:22                   ` KAMEZAWA Hiroyuki
2010-07-21 13:04           ` Johannes Weiner
2010-07-21 13:38             ` Mel Gorman
2010-07-21 14:28               ` Johannes Weiner
2010-07-21 14:31                 ` Mel Gorman
2010-07-21 14:39                   ` Johannes Weiner
2010-07-21 15:06                     ` Mel Gorman
2010-07-26  8:29               ` Wu Fengguang
2010-07-26  9:12                 ` Mel Gorman
2010-07-26 11:19                   ` Wu Fengguang
2010-07-26 12:53                     ` Mel Gorman
2010-07-26 13:03                       ` Wu Fengguang
2010-07-19 13:11 ` [PATCH 5/8] fs,btrfs: Allow kswapd to writeback pages Mel Gorman
2010-07-19 18:27   ` Rik van Riel
2010-07-19 13:11 ` [PATCH 6/8] fs,xfs: " Mel Gorman
2010-07-19 14:20   ` Christoph Hellwig
2010-07-19 14:43     ` Mel Gorman
2010-07-19 13:11 ` [PATCH 7/8] writeback: sync old inodes first in background writeback Mel Gorman
2010-07-19 14:21   ` Christoph Hellwig
2010-07-19 14:40     ` Mel Gorman
2010-07-19 14:48       ` Christoph Hellwig
2010-07-22  8:52       ` Wu Fengguang
2010-07-22  9:02         ` Wu Fengguang
2010-07-22  9:21         ` Wu Fengguang
2010-07-22 10:48           ` Mel Gorman
2010-07-23  9:45             ` Wu Fengguang
2010-07-23 10:57               ` Mel Gorman
2010-07-23 11:49                 ` Wu Fengguang
2010-07-23 12:20                   ` Wu Fengguang
2010-07-25 10:43                 ` KOSAKI Motohiro
2010-07-25 12:03                   ` Minchan Kim
2010-07-26  3:27                     ` Wu Fengguang
2010-07-26  4:11                       ` Minchan Kim
2010-07-26  4:37                         ` Wu Fengguang
2010-07-26 16:30                           ` Minchan Kim
2010-07-26 22:48                             ` Wu Fengguang
2010-07-26  3:08                   ` Wu Fengguang
2010-07-26  3:11                     ` Rik van Riel
2010-07-26  3:17                       ` Wu Fengguang
2010-07-22 15:34           ` Minchan Kim
2010-07-23 11:59             ` Wu Fengguang
2010-07-22  9:42         ` Mel Gorman
2010-07-23  8:33           ` Wu Fengguang
2010-07-22  1:13     ` Wu Fengguang
2010-07-19 18:43   ` Rik van Riel
2010-07-19 13:11 ` [PATCH 8/8] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages Mel Gorman
2010-07-19 14:23   ` Christoph Hellwig
2010-07-19 14:37     ` Mel Gorman
2010-07-19 22:48       ` Johannes Weiner
2010-07-20 14:10         ` Mel Gorman
2010-07-20 22:05           ` Johannes Weiner
2010-07-19 18:59   ` Rik van Riel
2010-07-19 22:26   ` Johannes Weiner
2010-07-26  7:28   ` Wu Fengguang
2010-07-26  9:26     ` Mel Gorman
2010-07-26 11:27       ` Wu Fengguang
2010-07-26 12:57         ` Mel Gorman
2010-07-26 13:10           ` Wu Fengguang
2010-07-27 13:35             ` Mel Gorman
2010-07-27 14:24               ` Wu Fengguang
2010-07-27 14:34                 ` Wu Fengguang
2010-07-27 14:40                   ` Mel Gorman
2010-07-27 14:55                     ` Wu Fengguang
2010-07-27 14:38                 ` Mel Gorman
2010-07-27 15:21                   ` Wu Fengguang [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20100727152147.GB5184@localhost \
    --to=fengguang.wu@intel.com \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=chris.mason@oracle.com \
    --cc=david@fromorbit.com \
    --cc=hannes@cmpxchg.org \
    --cc=hch@infradead.org \
    --cc=kamezawa.hiroyu@jp.fujitsu.com \
    --cc=kosaki.motohiro@jp.fujitsu.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mel@csn.ul.ie \
    --cc=npiggin@suse.de \
    --cc=riel@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).