Re: [PATCH 5/6] writeback: sync expired inodes first in background writeback

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Wu Fengguang <fengguang.wu@intel.com>
To: Jan Kara <jack@suse.cz>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Mel Gorman <mel@linux.vnet.ibm.com>,
	Dave Chinner <david@fromorbit.com>,
	Rik van Riel <riel@redhat.com>, Mel Gorman <mel@csn.ul.ie>,
	Itaru Kitayama <kitayama@cl.bb4u.ne.jp>,
	Minchan Kim <minchan.kim@gmail.com>,
	Linux Memory Management List <linux-mm@kvack.org>,
	"linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>,
	LKML <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH 5/6] writeback: sync expired inodes first in background writeback
Date: Tue, 26 Apr 2011 21:51:30 +0800	[thread overview]
Message-ID: <20110426135130.GA5719@localhost> (raw)
In-Reply-To: <20110426121751.GB5114@quack.suse.cz>

On Tue, Apr 26, 2011 at 08:17:51PM +0800, Jan Kara wrote:
> On Sun 24-04-11 11:15:31, Wu Fengguang wrote:
> > > One of the many requirements for writeback is that if userspace is
> > > continually dirtying pages in a particular file, that shouldn't cause
> > > the kupdate function to concentrate on that file's newly-dirtied pages,
> > > neglecting pages from other files which were less-recently dirtied. 
> > > (and dirty nodes, etc).
> > 
> > Sadly I do find the old pages that the flusher never get a chance to
> > catch and write them out.
>   What kind of load do you use?

Sorry I was just thinking about it and then got a _theoretic_ case.

> > In the below case, if the task dirties pages fast enough at the end of
> > file, writeback_index will never get a chance to wrap back. There may
> > be various variations of this case.
> > 
> > file head
> > [          ***                        ==>***************]==>
> >            old pages          writeback_index            fresh dirties
> > 
> > Ironically the current kernel relies on pageout() to catch these
> > old pages, which is not only inefficient, but also not reliable.
> > If a full LRU walk takes an hour, the old pages may stay dirtied
> > for an hour.
>   Well, the kupdate behavior has always been just a best-effort thing. We
> always tried to handle well common cases but didn't try to solve all of
> them. Unless we want to track dirty-age of every page (which we don't
> want because it's too expensive), there is really no way to make syncing
> of old pages 100% working for all the cases unless we do data-integrity
> type of writeback for the whole inode - but that could create new problems
> with stalling other files for too long I suspect.

Yeah, it's a hard problem in general. The flusher works naturally in
the coarse way..

> > We may have to do (conditional) tagged ->writepages to safeguard users
> > from losing data he'd expect to be written hours ago.
>   Well, if the file is continuously written (and in your case it must be
> even continuosly grown) I'd be content if we handle well the common case of
> linear append (that happens for log files etc.). If we can do well for more
> cases, even better but I'd be cautious not to disrupt some other more
> common cases.

I scratched a patch (totally untested) which will guarantee any kind
of starvation inside an inode. Will this be too overweight?

Thanks,
Fengguang
---
Subject: writeback: livelock prevention inside actively dirtied files
Date: Tue Apr 26 21:35:47 CST 2011

- refresh dirtied_when on every full writeback_index cycle
  (pages may be skipped on SYNC_NONE, but as long as they are retried in
  next cycle..)

- do tagged sync when writeback_index not cycled for too long time
  (the arbitrarily 60s may lead to more page tagging overheads in
  "large dirty threshold but slow storage" system..)

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c       |    1 +
 include/linux/fs.h      |    1 +
 include/linux/pagemap.h |   16 ++++++++++++++++
 mm/page-writeback.c     |   24 ++++++++++++++++++------
 4 files changed, 36 insertions(+), 6 deletions(-)

--- linux-next.orig/fs/fs-writeback.c	2011-04-26 21:26:28.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-04-26 21:26:39.000000000 +0800
@@ -1110,6 +1110,7 @@ void __mark_inode_dirty(struct inode *in
 			spin_unlock(&inode->i_lock);
 			spin_lock(&bdi->wb.list_lock);
 			inode->dirtied_when = jiffies;
+			inode->i_mapping->writeback_cycle_time = jiffies;
 			list_move(&inode->i_wb_list, &bdi->wb.b_dirty);
 			spin_unlock(&bdi->wb.list_lock);
 
--- linux-next.orig/include/linux/fs.h	2011-04-26 21:26:28.000000000 +0800
+++ linux-next/include/linux/fs.h	2011-04-26 21:26:39.000000000 +0800
@@ -639,6 +639,7 @@ struct address_space {
 	unsigned int		truncate_count;	/* Cover race condition with truncate */
 	unsigned long		nrpages;	/* number of total pages */
 	pgoff_t			writeback_index;/* writeback starts here */
+	unsigned long		writeback_cycle_time;
 	const struct address_space_operations *a_ops;	/* methods */
 	unsigned long		flags;		/* error bits/gfp mask */
 	struct backing_dev_info *backing_dev_info; /* device readahead, etc */
--- linux-next.orig/mm/page-writeback.c	2011-04-26 21:26:28.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-04-26 21:33:47.000000000 +0800
@@ -835,6 +835,9 @@ void tag_pages_for_writeback(struct addr
 		cond_resched();
 		/* We check 'start' to handle wrapping when end == ~0UL */
 	} while (tagged >= WRITEBACK_TAG_BATCH && start);
+
+	mapping_set_tagged_sync(mapping);
+	mapping->writeback_cycle_time = jiffies;
 }
 EXPORT_SYMBOL(tag_pages_for_writeback);
 
@@ -872,7 +875,7 @@ int write_cache_pages(struct address_spa
 	pgoff_t end;		/* Inclusive */
 	pgoff_t done_index;
 	int range_whole = 0;
-	int tag;
+	int tag = PAGECACHE_TAG_DIRTY;
 
 	pagevec_init(&pvec, 0);
 	if (wbc->range_cyclic) {
@@ -884,13 +887,19 @@ int write_cache_pages(struct address_spa
 		if (wbc->range_start == 0 && wbc->range_end == LLONG_MAX)
 			range_whole = 1;
 	}
-	if (wbc->sync_mode == WB_SYNC_ALL)
-		tag = PAGECACHE_TAG_TOWRITE;
-	else
-		tag = PAGECACHE_TAG_DIRTY;
+	if (!index)
+		mapping->writeback_cycle_time = jiffies;
 
-	if (wbc->sync_mode == WB_SYNC_ALL)
+	if (wbc->sync_mode == WB_SYNC_ALL ||
+	    (!mapping_tagged_sync(mapping) &&
+	     jiffies - mapping->host->dirtied_when > 60 * HZ)) {
 		tag_pages_for_writeback(mapping, index, end);
+		tag = PAGECACHE_TAG_TOWRITE;
+	}
+
+	if (mapping_tagged_sync(mapping))
+		tag = PAGECACHE_TAG_TOWRITE;
+
 	done_index = index;
 	while (!done && (index <= end)) {
 		int i;
@@ -899,6 +908,9 @@ int write_cache_pages(struct address_spa
 			      min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1);
 		if (nr_pages == 0) {
 			done_index = 0;
+			mapping->dirtied_when = mapping->writeback_cycle_time;
+			if (tag == PAGECACHE_TAG_TOWRITE)
+				mapping_clear_tagged_sync(mapping);
 			break;
 		}
 
--- linux-next.orig/include/linux/pagemap.h	2011-04-26 21:26:28.000000000 +0800
+++ linux-next/include/linux/pagemap.h	2011-04-26 21:46:38.000000000 +0800
@@ -24,6 +24,7 @@ enum mapping_flags {
 	AS_ENOSPC	= __GFP_BITS_SHIFT + 1,	/* ENOSPC on async write */
 	AS_MM_ALL_LOCKS	= __GFP_BITS_SHIFT + 2,	/* under mm_take_all_locks() */
 	AS_UNEVICTABLE	= __GFP_BITS_SHIFT + 3,	/* e.g., ramdisk, SHM_LOCK */
+	AS_TAGGED_SYNC	= __GFP_BITS_SHIFT + 4,	/* sync only tagged pages */
 };
 
 static inline void mapping_set_error(struct address_space *mapping, int error)
@@ -53,6 +54,21 @@ static inline int mapping_unevictable(st
 	return !!mapping;
 }
 
+static inline void mapping_set_tagged_sync(struct address_space *mapping)
+{
+	set_bit(AS_TAGGED_SYNC, &mapping->flags);
+}
+
+static inline void mapping_clear_tagged_sync(struct address_space *mapping)
+{
+	clear_bit(AS_TAGGED_SYNC, &mapping->flags);
+}
+
+static inline int mapping_tagged_sync(struct address_space *mapping)
+{
+	return test_bit(AS_TAGGED_SYNC, &mapping->flags);
+}
+
 static inline gfp_t mapping_gfp_mask(struct address_space * mapping)
 {
 	return (__force gfp_t)mapping->flags & __GFP_BITS_MASK;

WARNING: multiple messages have this Message-ID (diff)

From: Wu Fengguang <fengguang.wu@intel.com>
To: Jan Kara <jack@suse.cz>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Mel Gorman <mel@linux.vnet.ibm.com>,
	Dave Chinner <david@fromorbit.com>,
	Rik van Riel <riel@redhat.com>, Mel Gorman <mel@csn.ul.ie>,
	Itaru Kitayama <kitayama@cl.bb4u.ne.jp>,
	Minchan Kim <minchan.kim@gmail.com>,
	Linux Memory Management List <linux-mm@kvack.org>,
	"linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>,
	LKML <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH 5/6] writeback: sync expired inodes first in background writeback
Date: Tue, 26 Apr 2011 21:51:30 +0800	[thread overview]
Message-ID: <20110426135130.GA5719@localhost> (raw)
In-Reply-To: <20110426121751.GB5114@quack.suse.cz>

On Tue, Apr 26, 2011 at 08:17:51PM +0800, Jan Kara wrote:
> On Sun 24-04-11 11:15:31, Wu Fengguang wrote:
> > > One of the many requirements for writeback is that if userspace is
> > > continually dirtying pages in a particular file, that shouldn't cause
> > > the kupdate function to concentrate on that file's newly-dirtied pages,
> > > neglecting pages from other files which were less-recently dirtied. 
> > > (and dirty nodes, etc).
> > 
> > Sadly I do find the old pages that the flusher never get a chance to
> > catch and write them out.
>   What kind of load do you use?

Sorry I was just thinking about it and then got a _theoretic_ case.

> > In the below case, if the task dirties pages fast enough at the end of
> > file, writeback_index will never get a chance to wrap back. There may
> > be various variations of this case.
> > 
> > file head
> > [          ***                        ==>***************]==>
> >            old pages          writeback_index            fresh dirties
> > 
> > Ironically the current kernel relies on pageout() to catch these
> > old pages, which is not only inefficient, but also not reliable.
> > If a full LRU walk takes an hour, the old pages may stay dirtied
> > for an hour.
>   Well, the kupdate behavior has always been just a best-effort thing. We
> always tried to handle well common cases but didn't try to solve all of
> them. Unless we want to track dirty-age of every page (which we don't
> want because it's too expensive), there is really no way to make syncing
> of old pages 100% working for all the cases unless we do data-integrity
> type of writeback for the whole inode - but that could create new problems
> with stalling other files for too long I suspect.

Yeah, it's a hard problem in general. The flusher works naturally in
the coarse way..

> > We may have to do (conditional) tagged ->writepages to safeguard users
> > from losing data he'd expect to be written hours ago.
>   Well, if the file is continuously written (and in your case it must be
> even continuosly grown) I'd be content if we handle well the common case of
> linear append (that happens for log files etc.). If we can do well for more
> cases, even better but I'd be cautious not to disrupt some other more
> common cases.

I scratched a patch (totally untested) which will guarantee any kind
of starvation inside an inode. Will this be too overweight?

Thanks,
Fengguang
---
Subject: writeback: livelock prevention inside actively dirtied files
Date: Tue Apr 26 21:35:47 CST 2011

- refresh dirtied_when on every full writeback_index cycle
  (pages may be skipped on SYNC_NONE, but as long as they are retried in
  next cycle..)

- do tagged sync when writeback_index not cycled for too long time
  (the arbitrarily 60s may lead to more page tagging overheads in
  "large dirty threshold but slow storage" system..)

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c       |    1 +
 include/linux/fs.h      |    1 +
 include/linux/pagemap.h |   16 ++++++++++++++++
 mm/page-writeback.c     |   24 ++++++++++++++++++------
 4 files changed, 36 insertions(+), 6 deletions(-)

--- linux-next.orig/fs/fs-writeback.c	2011-04-26 21:26:28.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-04-26 21:26:39.000000000 +0800
@@ -1110,6 +1110,7 @@ void __mark_inode_dirty(struct inode *in
 			spin_unlock(&inode->i_lock);
 			spin_lock(&bdi->wb.list_lock);
 			inode->dirtied_when = jiffies;
+			inode->i_mapping->writeback_cycle_time = jiffies;
 			list_move(&inode->i_wb_list, &bdi->wb.b_dirty);
 			spin_unlock(&bdi->wb.list_lock);
 
--- linux-next.orig/include/linux/fs.h	2011-04-26 21:26:28.000000000 +0800
+++ linux-next/include/linux/fs.h	2011-04-26 21:26:39.000000000 +0800
@@ -639,6 +639,7 @@ struct address_space {
 	unsigned int		truncate_count;	/* Cover race condition with truncate */
 	unsigned long		nrpages;	/* number of total pages */
 	pgoff_t			writeback_index;/* writeback starts here */
+	unsigned long		writeback_cycle_time;
 	const struct address_space_operations *a_ops;	/* methods */
 	unsigned long		flags;		/* error bits/gfp mask */
 	struct backing_dev_info *backing_dev_info; /* device readahead, etc */
--- linux-next.orig/mm/page-writeback.c	2011-04-26 21:26:28.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-04-26 21:33:47.000000000 +0800
@@ -835,6 +835,9 @@ void tag_pages_for_writeback(struct addr
 		cond_resched();
 		/* We check 'start' to handle wrapping when end == ~0UL */
 	} while (tagged >= WRITEBACK_TAG_BATCH && start);
+
+	mapping_set_tagged_sync(mapping);
+	mapping->writeback_cycle_time = jiffies;
 }
 EXPORT_SYMBOL(tag_pages_for_writeback);
 
@@ -872,7 +875,7 @@ int write_cache_pages(struct address_spa
 	pgoff_t end;		/* Inclusive */
 	pgoff_t done_index;
 	int range_whole = 0;
-	int tag;
+	int tag = PAGECACHE_TAG_DIRTY;
 
 	pagevec_init(&pvec, 0);
 	if (wbc->range_cyclic) {
@@ -884,13 +887,19 @@ int write_cache_pages(struct address_spa
 		if (wbc->range_start == 0 && wbc->range_end == LLONG_MAX)
 			range_whole = 1;
 	}
-	if (wbc->sync_mode == WB_SYNC_ALL)
-		tag = PAGECACHE_TAG_TOWRITE;
-	else
-		tag = PAGECACHE_TAG_DIRTY;
+	if (!index)
+		mapping->writeback_cycle_time = jiffies;
 
-	if (wbc->sync_mode == WB_SYNC_ALL)
+	if (wbc->sync_mode == WB_SYNC_ALL ||
+	    (!mapping_tagged_sync(mapping) &&
+	     jiffies - mapping->host->dirtied_when > 60 * HZ)) {
 		tag_pages_for_writeback(mapping, index, end);
+		tag = PAGECACHE_TAG_TOWRITE;
+	}
+
+	if (mapping_tagged_sync(mapping))
+		tag = PAGECACHE_TAG_TOWRITE;
+
 	done_index = index;
 	while (!done && (index <= end)) {
 		int i;
@@ -899,6 +908,9 @@ int write_cache_pages(struct address_spa
 			      min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1);
 		if (nr_pages == 0) {
 			done_index = 0;
+			mapping->dirtied_when = mapping->writeback_cycle_time;
+			if (tag == PAGECACHE_TAG_TOWRITE)
+				mapping_clear_tagged_sync(mapping);
 			break;
 		}
 
--- linux-next.orig/include/linux/pagemap.h	2011-04-26 21:26:28.000000000 +0800
+++ linux-next/include/linux/pagemap.h	2011-04-26 21:46:38.000000000 +0800
@@ -24,6 +24,7 @@ enum mapping_flags {
 	AS_ENOSPC	= __GFP_BITS_SHIFT + 1,	/* ENOSPC on async write */
 	AS_MM_ALL_LOCKS	= __GFP_BITS_SHIFT + 2,	/* under mm_take_all_locks() */
 	AS_UNEVICTABLE	= __GFP_BITS_SHIFT + 3,	/* e.g., ramdisk, SHM_LOCK */
+	AS_TAGGED_SYNC	= __GFP_BITS_SHIFT + 4,	/* sync only tagged pages */
 };
 
 static inline void mapping_set_error(struct address_space *mapping, int error)
@@ -53,6 +54,21 @@ static inline int mapping_unevictable(st
 	return !!mapping;
 }
 
+static inline void mapping_set_tagged_sync(struct address_space *mapping)
+{
+	set_bit(AS_TAGGED_SYNC, &mapping->flags);
+}
+
+static inline void mapping_clear_tagged_sync(struct address_space *mapping)
+{
+	clear_bit(AS_TAGGED_SYNC, &mapping->flags);
+}
+
+static inline int mapping_tagged_sync(struct address_space *mapping)
+{
+	return test_bit(AS_TAGGED_SYNC, &mapping->flags);
+}
+
 static inline gfp_t mapping_gfp_mask(struct address_space * mapping)
 {
 	return (__force gfp_t)mapping->flags & __GFP_BITS_MASK;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2011-04-26 13:51 UTC|newest]

Thread overview: 67+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-04-20  8:03 [PATCH 0/6] writeback: moving expire targets for background/kupdate works v2 Wu Fengguang
2011-04-20  8:03 ` Wu Fengguang
2011-04-20  8:03 ` [PATCH 1/6] writeback: pass writeback_control down to move_expired_inodes() Wu Fengguang
2011-04-20  8:03   ` Wu Fengguang
2011-05-04 11:04   ` Christoph Hellwig
2011-05-04 11:04     ` Christoph Hellwig
2011-05-04 11:13     ` Wu Fengguang
2011-05-04 11:13       ` Wu Fengguang
2011-04-20  8:03 ` [PATCH 2/6] writeback: introduce writeback_control.inodes_cleaned Wu Fengguang
2011-04-20  8:03   ` Wu Fengguang
2011-04-20  8:03   ` Wu Fengguang
2011-05-04 11:05   ` Christoph Hellwig
2011-05-04 11:05     ` Christoph Hellwig
2011-05-04 11:11     ` Wu Fengguang
2011-05-04 11:11       ` Wu Fengguang
2011-05-04 11:16       ` Christoph Hellwig
2011-05-04 11:16         ` Christoph Hellwig
2011-05-04 11:32         ` Wu Fengguang
2011-05-04 11:32           ` Wu Fengguang
2011-04-20  8:03 ` [PATCH 3/6] writeback: try more writeback as long as something was written Wu Fengguang
2011-04-20  8:03   ` Wu Fengguang
2011-04-20  8:03   ` Wu Fengguang
2011-04-20  8:03 ` [PATCH 4/6] writeback: the kupdate expire timestamp should be a moving target Wu Fengguang
2011-04-20  8:03   ` Wu Fengguang
2011-04-20  8:03 ` [PATCH 5/6] writeback: sync expired inodes first in background writeback Wu Fengguang
2011-04-20  8:03   ` Wu Fengguang
2011-04-20  8:03   ` Wu Fengguang
2011-04-20 23:40   ` Andrew Morton
2011-04-20 23:40     ` Andrew Morton
2011-04-20 23:40     ` Andrew Morton
2011-04-21  1:14     ` Wu Fengguang
2011-04-21  1:14       ` Wu Fengguang
2011-04-21  1:21       ` Wu Fengguang
2011-04-21  1:21         ` Wu Fengguang
2011-04-24  3:15     ` Wu Fengguang
2011-04-24  3:15       ` Wu Fengguang
2011-04-26 12:17       ` Jan Kara
2011-04-26 12:17         ` Jan Kara
2011-04-26 13:51         ` Wu Fengguang [this message]
2011-04-26 13:51           ` Wu Fengguang
2011-04-26 13:59           ` Wu Fengguang
2011-04-26 13:59             ` Wu Fengguang
2011-04-26 14:05           ` Wu Fengguang
2011-04-26 14:05             ` Wu Fengguang
2011-04-27 11:15           ` Wu Fengguang
2011-04-27 11:15             ` Wu Fengguang
2011-04-20  8:03 ` [PATCH 6/6] writeback: refill b_io iff empty Wu Fengguang
2011-04-20  8:03   ` Wu Fengguang
2011-04-20  8:03   ` Wu Fengguang
2011-05-04  7:39   ` Wu Fengguang
2011-05-05 16:37     ` Jan Kara
2011-05-05 16:37       ` Jan Kara
2011-05-05 16:47       ` Wu Fengguang
2011-05-05 16:47         ` Wu Fengguang
2011-05-06  5:29       ` Wu Fengguang
2011-05-06  5:29         ` Wu Fengguang
2011-05-06  8:42         ` [RFC][PATCH] writeback: limit number of moved inodes in queue_io() Wu Fengguang
2011-05-06  8:42           ` Wu Fengguang
2011-05-06 10:06           ` [RFC][PATCH v2] " Wu Fengguang
2011-05-06 10:06             ` Wu Fengguang
2011-05-06 23:06             ` Dave Chinner
2011-05-06 23:06               ` Dave Chinner
2011-05-06 14:21         ` [PATCH 6/6] writeback: refill b_io iff empty Jan Kara
2011-05-06 14:21           ` Jan Kara
2011-05-10  4:31           ` Wu Fengguang
2011-05-10  4:53             ` Dave Chinner
2011-05-10  4:53               ` Dave Chinner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20110426135130.GA5719@localhost \
    --to=fengguang.wu@intel.com \
    --cc=akpm@linux-foundation.org \
    --cc=david@fromorbit.com \
    --cc=jack@suse.cz \
    --cc=kitayama@cl.bb4u.ne.jp \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mel@csn.ul.ie \
    --cc=mel@linux.vnet.ibm.com \
    --cc=minchan.kim@gmail.com \
    --cc=riel@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.