[PATCH 0/6] [RFC] writeback: try to write older pages first

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/6] [RFC] writeback: try to write older pages first
@ 2010-07-22  5:09 Wu Fengguang
  2010-07-22  5:09 ` [PATCH 1/6] writeback: pass writeback_control down to move_expired_inodes() Wu Fengguang
                   ` (7 more replies)
  0 siblings, 8 replies; 62+ messages in thread
From: Wu Fengguang @ 2010-07-22  5:09 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Dave Chinner, Christoph Hellwig, Mel Gorman, Chris Mason,
	Jens Axboe, Wu Fengguang, LKML, linux-fsdevel, linux-mm

Andrew,

The basic way of avoiding pageout() is to make the flusher sync inodes in the
right order. Oldest dirty inodes contains oldest pages. The smaller inode it
is, the more correlation between inode dirty time and its pages' dirty time.
So for small dirty inodes, syncing in the order of inode dirty time is able to
avoid pageout(). If pageout() is still triggered frequently in this case, the
30s dirty expire time may be too long and could be shrinked adaptively; or it
may be a stressed memcg list whose dirty inodes/pages are more hard to track.

For a large dirty inode, it may flush lots of newly dirtied pages _after_
syncing the expired pages. This is the normal case for a single-stream
sequential dirtier, where older pages are in lower offsets.  In this case we
shall not insist on syncing the whole large dirty inode before considering the
other small dirty inodes. This risks wasting time syncing 1GB freshly dirtied
pages before syncing the other N*1MB expired dirty pages who are approaching
the end of the LRU list and hence pageout().

For a large dirty inode, it may also flush lots of newly dirtied pages _before_
hitting the desired old ones, in which case it helps for pageout() to do some
clustered writeback, and/or set mapping->writeback_index to help the flusher
focus on old pages.

For a large dirty inode, it may also have intermixed old and new dirty pages.
In this case we need to make sure the inode is queued for IO before some of
its pages hit pageout(). Adaptive dirty expire time helps here.

OK, end of the vapour ideas. As for this patchset, it fixes the current
kupdate/background writeback priority:

- the kupdate/background writeback shall include newly expired inodes at each
  queue_io() time, as the large inodes left over from previous writeback rounds
  are likely to have less density of old pages.

- the background writeback shall consider expired inodes first, just like the
  kupdate writeback

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH 1/6] writeback: pass writeback_control down to move_expired_inodes()
  2010-07-22  5:09 [PATCH 0/6] [RFC] writeback: try to write older pages first Wu Fengguang
@ 2010-07-22  5:09 ` Wu Fengguang
  2010-07-23 18:16   ` Jan Kara
                     ` (2 more replies)
  2010-07-22  5:09 ` [PATCH 2/6] writeback: the kupdate expire timestamp should be a moving target Wu Fengguang
                   ` (6 subsequent siblings)
  7 siblings, 3 replies; 62+ messages in thread
From: Wu Fengguang @ 2010-07-22  5:09 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Dave Chinner, Wu Fengguang, Christoph Hellwig, Mel Gorman,
	Chris Mason, Jens Axboe, LKML, linux-fsdevel, linux-mm

[-- Attachment #1: writeback-pass-wbc-to-queue_io.patch --]
[-- Type: text/plain, Size: 2683 bytes --]

This is to prepare for moving the dirty expire policy to move_expired_inodes().
No behavior change.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c |   16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

--- linux-next.orig/fs/fs-writeback.c	2010-07-21 20:12:38.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2010-07-21 20:14:38.000000000 +0800
@@ -213,8 +213,8 @@ static bool inode_dirtied_after(struct i
  * Move expired dirty inodes from @delaying_queue to @dispatch_queue.
  */
 static void move_expired_inodes(struct list_head *delaying_queue,
-			       struct list_head *dispatch_queue,
-				unsigned long *older_than_this)
+				struct list_head *dispatch_queue,
+				struct writeback_control *wbc)
 {
 	LIST_HEAD(tmp);
 	struct list_head *pos, *node;
@@ -224,8 +224,8 @@ static void move_expired_inodes(struct l
 
 	while (!list_empty(delaying_queue)) {
 		inode = list_entry(delaying_queue->prev, struct inode, i_list);
-		if (older_than_this &&
-		    inode_dirtied_after(inode, *older_than_this))
+		if (wbc->older_than_this &&
+		    inode_dirtied_after(inode, *wbc->older_than_this))
 			break;
 		if (sb && sb != inode->i_sb)
 			do_sb_sort = 1;
@@ -257,10 +257,10 @@ static void move_expired_inodes(struct l
  *                 => b_more_io inodes
  *                 => remaining inodes in b_io => (dequeue for sync)
  */
-static void queue_io(struct bdi_writeback *wb, unsigned long *older_than_this)
+static void queue_io(struct bdi_writeback *wb, struct writeback_control *wbc)
 {
 	list_splice_init(&wb->b_more_io, &wb->b_io);
-	move_expired_inodes(&wb->b_dirty, &wb->b_io, older_than_this);
+	move_expired_inodes(&wb->b_dirty, &wb->b_io, wbc);
 }
 
 static int write_inode(struct inode *inode, struct writeback_control *wbc)
@@ -519,7 +519,7 @@ void writeback_inodes_wb(struct bdi_writ
 	wbc->wb_start = jiffies; /* livelock avoidance */
 	spin_lock(&inode_lock);
 	if (!wbc->for_kupdate || list_empty(&wb->b_io))
-		queue_io(wb, wbc->older_than_this);
+		queue_io(wb, wbc);
 
 	while (!list_empty(&wb->b_io)) {
 		struct inode *inode = list_entry(wb->b_io.prev,
@@ -548,7 +548,7 @@ static void __writeback_inodes_sb(struct
 	wbc->wb_start = jiffies; /* livelock avoidance */
 	spin_lock(&inode_lock);
 	if (!wbc->for_kupdate || list_empty(&wb->b_io))
-		queue_io(wb, wbc->older_than_this);
+		queue_io(wb, wbc);
 	writeback_sb_inodes(sb, wb, wbc, true);
 	spin_unlock(&inode_lock);
 }


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH 2/6] writeback: the kupdate expire timestamp should be a moving target
  2010-07-22  5:09 [PATCH 0/6] [RFC] writeback: try to write older pages first Wu Fengguang
  2010-07-22  5:09 ` [PATCH 1/6] writeback: pass writeback_control down to move_expired_inodes() Wu Fengguang
@ 2010-07-22  5:09 ` Wu Fengguang
  2010-07-23 18:17   ` Jan Kara
                     ` (2 more replies)
  2010-07-22  5:09 ` [PATCH 3/6] writeback: kill writeback_control.more_io Wu Fengguang
                   ` (5 subsequent siblings)
  7 siblings, 3 replies; 62+ messages in thread
From: Wu Fengguang @ 2010-07-22  5:09 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Dave Chinner, Jan Kara, Wu Fengguang, Christoph Hellwig,
	Mel Gorman, Chris Mason, Jens Axboe, LKML, linux-fsdevel,
	linux-mm

[-- Attachment #1: writeback-remove-older_than_this.patch --]
[-- Type: text/plain, Size: 6470 bytes --]

Dynamicly compute the dirty expire timestamp at queue_io() time.
Also remove writeback_control.older_than_this which is no longer used.

writeback_control.older_than_this used to be determined at entrance to
the kupdate writeback work. This _static_ timestamp may go stale if the
kupdate work runs on and on. The flusher may then stuck with some old
busy inodes, never considering newly expired inodes thereafter.

This has two possible problems:

- It is unfair for a large dirty inode to delay (for a long time) the
  writeback of small dirty inodes.

- As time goes by, the large and busy dirty inode may contain only
  _freshly_ dirtied pages. Ignoring newly expired dirty inodes risks
  delaying the expired dirty pages to the end of LRU lists, triggering
  the very bad pageout(). Neverthless this patch merely addresses part
  of the problem.

CC: Jan Kara <jack@suse.cz>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c                |   24 +++++++++---------------
 include/linux/writeback.h        |    2 --
 include/trace/events/writeback.h |    6 +-----
 mm/backing-dev.c                 |    1 -
 mm/page-writeback.c              |    1 -
 5 files changed, 10 insertions(+), 24 deletions(-)

--- linux-next.orig/fs/fs-writeback.c	2010-07-21 22:20:01.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2010-07-22 11:23:27.000000000 +0800
@@ -216,16 +216,23 @@ static void move_expired_inodes(struct l
 				struct list_head *dispatch_queue,
 				struct writeback_control *wbc)
 {
+	unsigned long expire_interval = 0;
+	unsigned long older_than_this;
 	LIST_HEAD(tmp);
 	struct list_head *pos, *node;
 	struct super_block *sb = NULL;
 	struct inode *inode;
 	int do_sb_sort = 0;
 
+	if (wbc->for_kupdate) {
+		expire_interval = msecs_to_jiffies(dirty_expire_interval * 10);
+		older_than_this = jiffies - expire_interval;
+	}
+
 	while (!list_empty(delaying_queue)) {
 		inode = list_entry(delaying_queue->prev, struct inode, i_list);
-		if (wbc->older_than_this &&
-		    inode_dirtied_after(inode, *wbc->older_than_this))
+		if (expire_interval &&
+		    inode_dirtied_after(inode, older_than_this))
 			break;
 		if (sb && sb != inode->i_sb)
 			do_sb_sort = 1;
@@ -583,29 +590,19 @@ static inline bool over_bground_thresh(v
  * Try to run once per dirty_writeback_interval.  But if a writeback event
  * takes longer than a dirty_writeback_interval interval, then leave a
  * one-second gap.
- *
- * older_than_this takes precedence over nr_to_write.  So we'll only write back
- * all dirty pages if they are all attached to "old" mappings.
  */
 static long wb_writeback(struct bdi_writeback *wb,
 			 struct wb_writeback_work *work)
 {
 	struct writeback_control wbc = {
 		.sync_mode		= work->sync_mode,
-		.older_than_this	= NULL,
 		.for_kupdate		= work->for_kupdate,
 		.for_background		= work->for_background,
 		.range_cyclic		= work->range_cyclic,
 	};
-	unsigned long oldest_jif;
 	long wrote = 0;
 	struct inode *inode;
 
-	if (wbc.for_kupdate) {
-		wbc.older_than_this = &oldest_jif;
-		oldest_jif = jiffies -
-				msecs_to_jiffies(dirty_expire_interval * 10);
-	}
 	if (!wbc.range_cyclic) {
 		wbc.range_start = 0;
 		wbc.range_end = LLONG_MAX;
@@ -998,9 +995,6 @@ EXPORT_SYMBOL(__mark_inode_dirty);
  * Write out a superblock's list of dirty inodes.  A wait will be performed
  * upon no inodes, all inodes or the final one, depending upon sync_mode.
  *
- * If older_than_this is non-NULL, then only write out inodes which
- * had their first dirtying at a time earlier than *older_than_this.
- *
  * If `bdi' is non-zero then we're being asked to writeback a specific queue.
  * This function assumes that the blockdev superblock's inodes are backed by
  * a variety of queues, so all inodes are searched.  For other superblocks,
--- linux-next.orig/include/linux/writeback.h	2010-07-21 22:20:02.000000000 +0800
+++ linux-next/include/linux/writeback.h	2010-07-22 11:23:27.000000000 +0800
@@ -28,8 +28,6 @@ enum writeback_sync_modes {
  */
 struct writeback_control {
 	enum writeback_sync_modes sync_mode;
-	unsigned long *older_than_this;	/* If !NULL, only write back inodes
-					   older than this */
 	unsigned long wb_start;         /* Time writeback_inodes_wb was
 					   called. This is needed to avoid
 					   extra jobs and livelock */
--- linux-next.orig/include/trace/events/writeback.h	2010-07-21 22:20:02.000000000 +0800
+++ linux-next/include/trace/events/writeback.h	2010-07-22 11:23:27.000000000 +0800
@@ -100,7 +100,6 @@ DECLARE_EVENT_CLASS(wbc_class,
 		__field(int, for_reclaim)
 		__field(int, range_cyclic)
 		__field(int, more_io)
-		__field(unsigned long, older_than_this)
 		__field(long, range_start)
 		__field(long, range_end)
 	),
@@ -115,14 +114,12 @@ DECLARE_EVENT_CLASS(wbc_class,
 		__entry->for_reclaim	= wbc->for_reclaim;
 		__entry->range_cyclic	= wbc->range_cyclic;
 		__entry->more_io	= wbc->more_io;
-		__entry->older_than_this = wbc->older_than_this ?
-						*wbc->older_than_this : 0;
 		__entry->range_start	= (long)wbc->range_start;
 		__entry->range_end	= (long)wbc->range_end;
 	),
 
 	TP_printk("bdi %s: towrt=%ld skip=%ld mode=%d kupd=%d "
-		"bgrd=%d reclm=%d cyclic=%d more=%d older=0x%lx "
+		"bgrd=%d reclm=%d cyclic=%d more=%d "
 		"start=0x%lx end=0x%lx",
 		__entry->name,
 		__entry->nr_to_write,
@@ -133,7 +130,6 @@ DECLARE_EVENT_CLASS(wbc_class,
 		__entry->for_reclaim,
 		__entry->range_cyclic,
 		__entry->more_io,
-		__entry->older_than_this,
 		__entry->range_start,
 		__entry->range_end)
 )
--- linux-next.orig/mm/page-writeback.c	2010-07-21 22:20:02.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-07-21 22:20:03.000000000 +0800
@@ -482,7 +482,6 @@ static void balance_dirty_pages(struct a
 	for (;;) {
 		struct writeback_control wbc = {
 			.sync_mode	= WB_SYNC_NONE,
-			.older_than_this = NULL,
 			.nr_to_write	= write_chunk,
 			.range_cyclic	= 1,
 		};
--- linux-next.orig/mm/backing-dev.c	2010-07-22 11:23:34.000000000 +0800
+++ linux-next/mm/backing-dev.c	2010-07-22 11:23:39.000000000 +0800
@@ -271,7 +271,6 @@ static void bdi_flush_io(struct backing_
 {
 	struct writeback_control wbc = {
 		.sync_mode		= WB_SYNC_NONE,
-		.older_than_this	= NULL,
 		.range_cyclic		= 1,
 		.nr_to_write		= 1024,
 	};


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH 3/6] writeback: kill writeback_control.more_io
  2010-07-22  5:09 [PATCH 0/6] [RFC] writeback: try to write older pages first Wu Fengguang
  2010-07-22  5:09 ` [PATCH 1/6] writeback: pass writeback_control down to move_expired_inodes() Wu Fengguang
  2010-07-22  5:09 ` [PATCH 2/6] writeback: the kupdate expire timestamp should be a moving target Wu Fengguang
@ 2010-07-22  5:09 ` Wu Fengguang
  2010-07-23 18:24   ` Jan Kara
                     ` (2 more replies)
  2010-07-22  5:09 ` [PATCH 4/6] writeback: sync expired inodes first in background writeback Wu Fengguang
                   ` (4 subsequent siblings)
  7 siblings, 3 replies; 62+ messages in thread
From: Wu Fengguang @ 2010-07-22  5:09 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Dave Chinner, Wu Fengguang, Christoph Hellwig, Mel Gorman,
	Chris Mason, Jens Axboe, LKML, linux-fsdevel, linux-mm

[-- Attachment #1: writeback-kill-more_io.patch --]
[-- Type: text/plain, Size: 3213 bytes --]

When wbc.more_io was first introduced, it indicates whether there are
at least one superblock whose s_more_io contains more IO work. Now with
the per-bdi writeback, it can be replaced with a simple b_more_io test.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c                |    9 ++-------
 include/linux/writeback.h        |    1 -
 include/trace/events/writeback.h |    5 +----
 3 files changed, 3 insertions(+), 12 deletions(-)

--- linux-next.orig/fs/fs-writeback.c	2010-07-22 11:23:27.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2010-07-22 12:56:42.000000000 +0800
@@ -507,12 +507,8 @@ static int writeback_sb_inodes(struct su
 		iput(inode);
 		cond_resched();
 		spin_lock(&inode_lock);
-		if (wbc->nr_to_write <= 0) {
-			wbc->more_io = 1;
+		if (wbc->nr_to_write <= 0)
 			return 1;
-		}
-		if (!list_empty(&wb->b_more_io))
-			wbc->more_io = 1;
 	}
 	/* b_io is empty */
 	return 1;
@@ -622,7 +618,6 @@ static long wb_writeback(struct bdi_writ
 		if (work->for_background && !over_bground_thresh())
 			break;
 
-		wbc.more_io = 0;
 		wbc.nr_to_write = MAX_WRITEBACK_PAGES;
 		wbc.pages_skipped = 0;
 
@@ -644,7 +639,7 @@ static long wb_writeback(struct bdi_writ
 		/*
 		 * Didn't write everything and we don't have more IO, bail
 		 */
-		if (!wbc.more_io)
+		if (list_empty(&wb->b_more_io))
 			break;
 		/*
 		 * Did we write something? Try for more
--- linux-next.orig/include/linux/writeback.h	2010-07-22 11:23:27.000000000 +0800
+++ linux-next/include/linux/writeback.h	2010-07-22 11:24:46.000000000 +0800
@@ -49,7 +49,6 @@ struct writeback_control {
 	unsigned for_background:1;	/* A background writeback */
 	unsigned for_reclaim:1;		/* Invoked from the page allocator */
 	unsigned range_cyclic:1;	/* range_start is cyclic */
-	unsigned more_io:1;		/* more io to be dispatched */
 };
 
 /*
--- linux-next.orig/include/trace/events/writeback.h	2010-07-22 11:23:27.000000000 +0800
+++ linux-next/include/trace/events/writeback.h	2010-07-22 11:24:46.000000000 +0800
@@ -99,7 +99,6 @@ DECLARE_EVENT_CLASS(wbc_class,
 		__field(int, for_background)
 		__field(int, for_reclaim)
 		__field(int, range_cyclic)
-		__field(int, more_io)
 		__field(long, range_start)
 		__field(long, range_end)
 	),
@@ -113,13 +112,12 @@ DECLARE_EVENT_CLASS(wbc_class,
 		__entry->for_background	= wbc->for_background;
 		__entry->for_reclaim	= wbc->for_reclaim;
 		__entry->range_cyclic	= wbc->range_cyclic;
-		__entry->more_io	= wbc->more_io;
 		__entry->range_start	= (long)wbc->range_start;
 		__entry->range_end	= (long)wbc->range_end;
 	),
 
 	TP_printk("bdi %s: towrt=%ld skip=%ld mode=%d kupd=%d "
-		"bgrd=%d reclm=%d cyclic=%d more=%d "
+		"bgrd=%d reclm=%d cyclic=%d "
 		"start=0x%lx end=0x%lx",
 		__entry->name,
 		__entry->nr_to_write,
@@ -129,7 +127,6 @@ DECLARE_EVENT_CLASS(wbc_class,
 		__entry->for_background,
 		__entry->for_reclaim,
 		__entry->range_cyclic,
-		__entry->more_io,
 		__entry->range_start,
 		__entry->range_end)
 )


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH 4/6] writeback: sync expired inodes first in background writeback
  2010-07-22  5:09 [PATCH 0/6] [RFC] writeback: try to write older pages first Wu Fengguang
                   ` (2 preceding siblings ...)
  2010-07-22  5:09 ` [PATCH 3/6] writeback: kill writeback_control.more_io Wu Fengguang
@ 2010-07-22  5:09 ` Wu Fengguang
  2010-07-23 18:15   ` Jan Kara
  2010-07-26 10:57   ` Mel Gorman
  2010-07-22  5:09 ` [PATCH 5/6] writeback: try more writeback as long as something was written Wu Fengguang
                   ` (3 subsequent siblings)
  7 siblings, 2 replies; 62+ messages in thread
From: Wu Fengguang @ 2010-07-22  5:09 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Dave Chinner, Jan Kara, Wu Fengguang, Christoph Hellwig,
	Mel Gorman, Chris Mason, Jens Axboe, LKML, linux-fsdevel,
	linux-mm

[-- Attachment #1: writeback-expired-for-background.patch --]
[-- Type: text/plain, Size: 2449 bytes --]

A background flush work may run for ever. So it's reasonable for it to
mimic the kupdate behavior of syncing old/expired inodes first.

The policy is
- enqueue all newly expired inodes at each queue_io() time
- retry with halfed expire interval until get some inodes to sync

CC: Jan Kara <jack@suse.cz>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c |   20 ++++++++++++++------
 1 file changed, 14 insertions(+), 6 deletions(-)

--- linux-next.orig/fs/fs-writeback.c	2010-07-22 12:56:42.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2010-07-22 13:07:51.000000000 +0800
@@ -217,14 +217,14 @@ static void move_expired_inodes(struct l
 				struct writeback_control *wbc)
 {
 	unsigned long expire_interval = 0;
-	unsigned long older_than_this;
+	unsigned long older_than_this = 0; /* reset to kill gcc warning */
 	LIST_HEAD(tmp);
 	struct list_head *pos, *node;
 	struct super_block *sb = NULL;
 	struct inode *inode;
 	int do_sb_sort = 0;
 
-	if (wbc->for_kupdate) {
+	if (wbc->for_kupdate || wbc->for_background) {
 		expire_interval = msecs_to_jiffies(dirty_expire_interval * 10);
 		older_than_this = jiffies - expire_interval;
 	}
@@ -232,8 +232,15 @@ static void move_expired_inodes(struct l
 	while (!list_empty(delaying_queue)) {
 		inode = list_entry(delaying_queue->prev, struct inode, i_list);
 		if (expire_interval &&
-		    inode_dirtied_after(inode, older_than_this))
-			break;
+		    inode_dirtied_after(inode, older_than_this)) {
+			if (wbc->for_background &&
+			    list_empty(dispatch_queue) && list_empty(&tmp)) {
+				expire_interval >>= 1;
+				older_than_this = jiffies - expire_interval;
+				continue;
+			} else
+				break;
+		}
 		if (sb && sb != inode->i_sb)
 			do_sb_sort = 1;
 		sb = inode->i_sb;
@@ -521,7 +528,8 @@ void writeback_inodes_wb(struct bdi_writ
 
 	wbc->wb_start = jiffies; /* livelock avoidance */
 	spin_lock(&inode_lock);
-	if (!wbc->for_kupdate || list_empty(&wb->b_io))
+
+	if (!(wbc->for_kupdate || wbc->for_background) || list_empty(&wb->b_io))
 		queue_io(wb, wbc);
 
 	while (!list_empty(&wb->b_io)) {
@@ -550,7 +558,7 @@ static void __writeback_inodes_sb(struct
 
 	wbc->wb_start = jiffies; /* livelock avoidance */
 	spin_lock(&inode_lock);
-	if (!wbc->for_kupdate || list_empty(&wb->b_io))
+	if (!(wbc->for_kupdate || wbc->for_background) || list_empty(&wb->b_io))
 		queue_io(wb, wbc);
 	writeback_sb_inodes(sb, wb, wbc, true);
 	spin_unlock(&inode_lock);



^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH 5/6] writeback: try more writeback as long as something was written
  2010-07-22  5:09 [PATCH 0/6] [RFC] writeback: try to write older pages first Wu Fengguang
                   ` (3 preceding siblings ...)
  2010-07-22  5:09 ` [PATCH 4/6] writeback: sync expired inodes first in background writeback Wu Fengguang
@ 2010-07-22  5:09 ` Wu Fengguang
  2010-07-23 17:39   ` Jan Kara
  2010-07-26 11:01   ` Mel Gorman
  2010-07-22  5:09 ` [PATCH 6/6] writeback: introduce writeback_control.inodes_written Wu Fengguang
                   ` (2 subsequent siblings)
  7 siblings, 2 replies; 62+ messages in thread
From: Wu Fengguang @ 2010-07-22  5:09 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Dave Chinner, Wu Fengguang, Christoph Hellwig, Mel Gorman,
	Chris Mason, Jens Axboe, LKML, linux-fsdevel, linux-mm

[-- Attachment #1: writeback-background-retry.patch --]
[-- Type: text/plain, Size: 2493 bytes --]

writeback_inodes_wb()/__writeback_inodes_sb() are not agressive in that
they only populate b_io when necessary at entrance time. When the queued
set of inodes are all synced, they just return, possibly with
wbc.nr_to_write > 0.

For kupdate and background writeback, there may be more eligible inodes
sitting in b_dirty when the current set of b_io inodes are completed. So
it is necessary to try another round of writeback as long as we made some
progress in this round. When there are no more eligible inodes, no more
inodes will be enqueued in queue_io(), hence nothing could/will be
synced and we may safely bail.

This will livelock sync when there are heavy dirtiers. However in that case
sync will already be livelocked w/o this patch, as the current livelock
avoidance code is virtually a no-op (for one thing, wb_time should be
set statically at sync start time and be used in move_expired_inodes()).
The sync livelock problem will be addressed in other patches.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c |   19 +++++++++++--------
 1 file changed, 11 insertions(+), 8 deletions(-)

--- linux-next.orig/fs/fs-writeback.c	2010-07-22 13:07:51.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2010-07-22 13:07:54.000000000 +0800
@@ -640,20 +640,23 @@ static long wb_writeback(struct bdi_writ
 		wrote += MAX_WRITEBACK_PAGES - wbc.nr_to_write;
 
 		/*
-		 * If we consumed everything, see if we have more
+		 * Did we write something? Try for more
+		 *
+		 * This is needed _before_ the b_more_io test because the
+		 * background writeback moves inodes to b_io and works on
+		 * them in batches (in order to sync old pages first).  The
+		 * completion of the current batch does not necessarily mean
+		 * the overall work is done.
 		 */
-		if (wbc.nr_to_write <= 0)
+		if (wbc.nr_to_write < MAX_WRITEBACK_PAGES)
 			continue;
+
 		/*
-		 * Didn't write everything and we don't have more IO, bail
+		 * Nothing written and no more inodes for IO, bail
 		 */
 		if (list_empty(&wb->b_more_io))
 			break;
-		/*
-		 * Did we write something? Try for more
-		 */
-		if (wbc.nr_to_write < MAX_WRITEBACK_PAGES)
-			continue;
+
 		/*
 		 * Nothing written. Wait for some inode to
 		 * become available for writeback. Otherwise


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH 6/6] writeback: introduce writeback_control.inodes_written
  2010-07-22  5:09 [PATCH 0/6] [RFC] writeback: try to write older pages first Wu Fengguang
                   ` (4 preceding siblings ...)
  2010-07-22  5:09 ` [PATCH 5/6] writeback: try more writeback as long as something was written Wu Fengguang
@ 2010-07-22  5:09 ` Wu Fengguang
  2010-07-26 11:04   ` Mel Gorman
  2010-07-23 10:24 ` [PATCH 0/6] [RFC] writeback: try to write older pages first Mel Gorman
  2010-07-26 10:28 ` Itaru Kitayama
  7 siblings, 1 reply; 62+ messages in thread
From: Wu Fengguang @ 2010-07-22  5:09 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Dave Chinner, Wu Fengguang, Christoph Hellwig, Mel Gorman,
	Chris Mason, Jens Axboe, LKML, linux-fsdevel, linux-mm

[-- Attachment #1: writeback-inodes_written.patch --]
[-- Type: text/plain, Size: 2028 bytes --]

Introduce writeback_control.inodes_written to count successful
->write_inode() calls.  A non-zero value means there are some
progress on writeback, in which case more writeback will be tried.

This prevents aborting a background writeback work prematually when
the current set of inodes for IO happen to be metadata-only dirty.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c         |    5 +++++
 include/linux/writeback.h |    1 +
 2 files changed, 6 insertions(+)

--- linux-next.orig/fs/fs-writeback.c	2010-07-22 13:07:54.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2010-07-22 13:07:58.000000000 +0800
@@ -379,6 +379,8 @@ writeback_single_inode(struct inode *ino
 		int err = write_inode(inode, wbc);
 		if (ret == 0)
 			ret = err;
+		if (!err)
+			wbc->inodes_written++;
 	}
 
 	spin_lock(&inode_lock);
@@ -628,6 +630,7 @@ static long wb_writeback(struct bdi_writ
 
 		wbc.nr_to_write = MAX_WRITEBACK_PAGES;
 		wbc.pages_skipped = 0;
+		wbc.inodes_written = 0;
 
 		trace_wbc_writeback_start(&wbc, wb->bdi);
 		if (work->sb)
@@ -650,6 +653,8 @@ static long wb_writeback(struct bdi_writ
 		 */
 		if (wbc.nr_to_write < MAX_WRITEBACK_PAGES)
 			continue;
+		if (wbc.inodes_written)
+			continue;
 
 		/*
 		 * Nothing written and no more inodes for IO, bail
--- linux-next.orig/include/linux/writeback.h	2010-07-22 11:24:46.000000000 +0800
+++ linux-next/include/linux/writeback.h	2010-07-22 13:07:58.000000000 +0800
@@ -34,6 +34,7 @@ struct writeback_control {
 	long nr_to_write;		/* Write this many pages, and decrement
 					   this for each page written */
 	long pages_skipped;		/* Pages which were not written */
+	long inodes_written;		/* Number of inodes(metadata) synced */
 
 	/*
 	 * For a_ops->writepages(): is start or end are non-zero then this is


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 0/6] [RFC] writeback: try to write older pages first
  2010-07-22  5:09 [PATCH 0/6] [RFC] writeback: try to write older pages first Wu Fengguang
                   ` (5 preceding siblings ...)
  2010-07-22  5:09 ` [PATCH 6/6] writeback: introduce writeback_control.inodes_written Wu Fengguang
@ 2010-07-23 10:24 ` Mel Gorman
  2010-07-26  7:18   ` Wu Fengguang
  2010-07-26 10:28 ` Itaru Kitayama
  7 siblings, 1 reply; 62+ messages in thread
From: Mel Gorman @ 2010-07-23 10:24 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Dave Chinner, Christoph Hellwig, Chris Mason,
	Jens Axboe, LKML, linux-fsdevel, linux-mm

I queued these up for testing yesterday before starting a review. For
anyone watching, the following patches are pre-requisites from
linux-next if one wants to test against 2.6.35-rc5. I did this because I
wanted to test as few changes as possible

a75db72d30a6402f4b1d841af3b4ce43682d0ac4 writeback: remove wb_list 
2225753c10aef6af9c764a295b71d11bc483c4d6 writeback: merge bdi_writeback_task and bdi_start_fn
aab24fcf6f5ccf0e8de3cc333559bddf9a46f11e writeback: Initial tracing support
f689fba23f3819e3e0bc237c104f2ec25decc219 writeback: Add tracing to balance_dirty_pages
ca43586868b49eb5a07d895708e4d257e2df814e simplify checks for I_CLEAR/I_FREEING

I applied your series on top of this and fired it up. The ordering of
patch application was still teh same

tracing
no direct writeback
Wu's patches and Christoph's pre-reqs from linux-next
Kick flusher threads when dirty pages applied

With them applied, btrfs failed to build but if it builds for you, it
just means I didn't bring a required patch from linux-next. I was
testing against XFS so I didn't dig too deep.

On Thu, Jul 22, 2010 at 01:09:28PM +0800, Wu Fengguang wrote:
> 
> The basic way of avoiding pageout() is to make the flusher sync inodes in the
> right order. Oldest dirty inodes contains oldest pages. The smaller inode it
> is, the more correlation between inode dirty time and its pages' dirty time.
> So for small dirty inodes, syncing in the order of inode dirty time is able to
> avoid pageout(). If pageout() is still triggered frequently in this case, the
> 30s dirty expire time may be too long and could be shrinked adaptively; or it
> may be a stressed memcg list whose dirty inodes/pages are more hard to track.
> 

Have you confirmed this theory with the trace points? It makes perfect
sense and is very rational but proof is a plus. I'm guessing you have
some decent writeback-related tests that might be of use. Mine have a
big mix of anon and file writeback so it's not as clear-cut.

Monitoring it isn't hard. Mount debugfs, enable the vmscan tracepoints
and read the tracing_pipe. To reduce interference, I always pipe it
through gzip and do post-processing afterwards offline with the script
included in Documentation/

Here is what I got from sysbench on x86-64 (other machines hours away)


SYSBENCH FTrace Reclaim Statistics
                    traceonly-v5r6         nodirect-v5r7      flusholdest-v5r7     flushforward-v5r7
Direct reclaims                                683        785        670        938 
Direct reclaim pages scanned                199776     161195     200400     166639 
Direct reclaim write file async I/O          64802          0          0          0 
Direct reclaim write anon async I/O           1009        419       1184      11390 
Direct reclaim write file sync I/O              18          0          0          0 
Direct reclaim write anon sync I/O               0          0          0          0 
Wake kswapd requests                        685360     697255     691009     864602 
Kswapd wakeups                                1596       1517       1517       1545 
Kswapd pages scanned                      17527865   16817554   16816510   15032525 
Kswapd reclaim write file async I/O         888082     618123     649167     147903 
Kswapd reclaim write anon async I/O         229724     229123     233639     243561 
Kswapd reclaim write file sync I/O               0          0          0          0 
Kswapd reclaim write anon sync I/O               0          0          0          0 
Time stalled direct reclaim (ms)             32.79      22.47      19.75       6.34 
Time kswapd awake (ms)                     2192.03    2165.17    2112.73    2055.90 

User/Sys Time Running Test (seconds)         663.3    656.37    664.14    654.63
Percentage Time Spent Direct Reclaim         0.00%     0.00%     0.00%     0.00%
Total Elapsed Time (seconds)               6703.22   6468.78   6472.69   6479.62
Percentage Time kswapd Awake                 0.03%     0.00%     0.00%     0.00%

Flush oldest actually increased the number of pages written back by
kswapd but the anon writeback is also high as swap is involved. Kicking
flusher threads also helps a lot. It helps less than previous released
because I noticed I was kicking flusher threads for both anon and file
dirty pages which is cheating. It's now only waking the threads for
file. It's still a reduction of 84% overall so nothing to sneeze at.

What the patch did do was reduce time stalled in direct reclaim and time
kswapd spent awake so it still might be going the right direction. I
don't have a feeling for how much the writeback figures change between
runs because they take so long to run.

STRESS-HIGHALLOC FTrace Reclaim Statistics
                  stress-highalloc      stress-highalloc      stress-highalloc      stress-highalloc
                    traceonly-v5r6         nodirect-v5r7      flusholdest-v5r7     flushforward-v5r7
Direct reclaims                               1221       1284       1127       1252 
Direct reclaim pages scanned                146220     186156     142075     140617 
Direct reclaim write file async I/O           3433          0          0          0 
Direct reclaim write anon async I/O          25238      28758      23940      23247 
Direct reclaim write file sync I/O            3095          0          0          0 
Direct reclaim write anon sync I/O           10911     305579     281824     246251 
Wake kswapd requests                          1193       1196       1088       1209 
Kswapd wakeups                                 805        824        758        804 
Kswapd pages scanned                      30953364   52621368   42722498   30945547 
Kswapd reclaim write file async I/O         898087     241135     570467      54319 
Kswapd reclaim write anon async I/O        2278607    2201894    1885741    1949170 
Kswapd reclaim write file sync I/O               0          0          0          0 
Kswapd reclaim write anon sync I/O               0          0          0          0 
Time stalled direct reclaim (ms)           8567.29    6628.83    6520.39    6947.23 
Time kswapd awake (ms)                     5847.60    3589.43    3900.74   15837.59 

User/Sys Time Running Test (seconds)       2824.76   2833.05   2833.26   2830.46
Percentage Time Spent Direct Reclaim         0.25%     0.00%     0.00%     0.00%
Total Elapsed Time (seconds)              10920.14   9021.17   8872.06   9301.86
Percentage Time kswapd Awake                 0.15%     0.00%     0.00%     0.00%

Same here, the number of pages written back by kswapd increased but
again anon writeback was a big factor. Kicking threads when dirty pages
are encountered still helps a lot with a 94% reduction of pages written
back overall..

Also, your patch really helped the time spent stalled by direct reclaim
and kswapd was awake a lot less less with tests completing far faster.

Overally, I still think your series if a big help (although I don't know if
the patches in linux-next are also making a difference) but it's not actually
reducing the pages encountered by direct reclaim. Maybe that is because
the tests were making more forward progress and so scanning faster. The
sysbench performance results are too varied to draw conclusions from but it
did slightly improve the success rate of high-order allocations.

The flush-forward patches would appear to be a requirement. Christoph
first described them as a band-aid but he didn't chuck rocks at me when
the patch was actually released. Right now, I'm leaning towards pushing
it and judge by the Swear Meter how good/bad others think it is. So far
it's, me pro, Rik pro, Christoph maybe.

> For a large dirty inode, it may flush lots of newly dirtied pages _after_
> syncing the expired pages. This is the normal case for a single-stream
> sequential dirtier, where older pages are in lower offsets.  In this case we
> shall not insist on syncing the whole large dirty inode before considering the
> other small dirty inodes. This risks wasting time syncing 1GB freshly dirtied
> pages before syncing the other N*1MB expired dirty pages who are approaching
> the end of the LRU list and hence pageout().
> 

Intuitively, this makes a lot of sense.

> For a large dirty inode, it may also flush lots of newly dirtied pages _before_
> hitting the desired old ones, in which case it helps for pageout() to do some
> clustered writeback, and/or set mapping->writeback_index to help the flusher
> focus on old pages.
> 

Will put this idea on the maybe pile.

> For a large dirty inode, it may also have intermixed old and new dirty pages.
> In this case we need to make sure the inode is queued for IO before some of
> its pages hit pageout(). Adaptive dirty expire time helps here.
> 
> OK, end of the vapour ideas. As for this patchset, it fixes the current
> kupdate/background writeback priority:
> 
> - the kupdate/background writeback shall include newly expired inodes at each
>   queue_io() time, as the large inodes left over from previous writeback rounds
>   are likely to have less density of old pages.
> 
> - the background writeback shall consider expired inodes first, just like the
>   kupdate writeback
> 

I haven't actually reviewed these. I got testing kicked off first
because it didn't require brains :)

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 5/6] writeback: try more writeback as long as something was written
  2010-07-22  5:09 ` [PATCH 5/6] writeback: try more writeback as long as something was written Wu Fengguang
@ 2010-07-23 17:39   ` Jan Kara
  2010-07-26 12:39     ` Wu Fengguang
  2010-07-26 11:01   ` Mel Gorman
  1 sibling, 1 reply; 62+ messages in thread
From: Jan Kara @ 2010-07-23 17:39 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Dave Chinner, Christoph Hellwig, Mel Gorman,
	Chris Mason, Jens Axboe, LKML, linux-fsdevel, linux-mm

On Thu 22-07-10 13:09:33, Wu Fengguang wrote:
> writeback_inodes_wb()/__writeback_inodes_sb() are not agressive in that
> they only populate b_io when necessary at entrance time. When the queued
> set of inodes are all synced, they just return, possibly with
> wbc.nr_to_write > 0.
> 
> For kupdate and background writeback, there may be more eligible inodes
> sitting in b_dirty when the current set of b_io inodes are completed. So
> it is necessary to try another round of writeback as long as we made some
> progress in this round. When there are no more eligible inodes, no more
> inodes will be enqueued in queue_io(), hence nothing could/will be
> synced and we may safely bail.
> 
> This will livelock sync when there are heavy dirtiers. However in that case
> sync will already be livelocked w/o this patch, as the current livelock
> avoidance code is virtually a no-op (for one thing, wb_time should be
> set statically at sync start time and be used in move_expired_inodes()).
> The sync livelock problem will be addressed in other patches.
  Hmm, any reason why you don't solve this problem by just removing the
condition before queue_io()? It would also make the logic simpler - always
queue all inodes that are eligible for writeback...

								Honza


> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  fs/fs-writeback.c |   19 +++++++++++--------
>  1 file changed, 11 insertions(+), 8 deletions(-)
> 
> --- linux-next.orig/fs/fs-writeback.c	2010-07-22 13:07:51.000000000 +0800
> +++ linux-next/fs/fs-writeback.c	2010-07-22 13:07:54.000000000 +0800
> @@ -640,20 +640,23 @@ static long wb_writeback(struct bdi_writ
>  		wrote += MAX_WRITEBACK_PAGES - wbc.nr_to_write;
>  
>  		/*
> -		 * If we consumed everything, see if we have more
> +		 * Did we write something? Try for more
> +		 *
> +		 * This is needed _before_ the b_more_io test because the
> +		 * background writeback moves inodes to b_io and works on
> +		 * them in batches (in order to sync old pages first).  The
> +		 * completion of the current batch does not necessarily mean
> +		 * the overall work is done.
>  		 */
> -		if (wbc.nr_to_write <= 0)
> +		if (wbc.nr_to_write < MAX_WRITEBACK_PAGES)
>  			continue;
> +
>  		/*
> -		 * Didn't write everything and we don't have more IO, bail
> +		 * Nothing written and no more inodes for IO, bail
>  		 */
>  		if (list_empty(&wb->b_more_io))
>  			break;
> -		/*
> -		 * Did we write something? Try for more
> -		 */
> -		if (wbc.nr_to_write < MAX_WRITEBACK_PAGES)
> -			continue;
> +
>  		/*
>  		 * Nothing written. Wait for some inode to
>  		 * become available for writeback. Otherwise
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 4/6] writeback: sync expired inodes first in background writeback
  2010-07-22  5:09 ` [PATCH 4/6] writeback: sync expired inodes first in background writeback Wu Fengguang
@ 2010-07-23 18:15   ` Jan Kara
  2010-07-26 11:51     ` Wu Fengguang
  2010-07-26 10:57   ` Mel Gorman
  1 sibling, 1 reply; 62+ messages in thread
From: Jan Kara @ 2010-07-23 18:15 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Dave Chinner, Jan Kara, Christoph Hellwig,
	Mel Gorman, Chris Mason, Jens Axboe, LKML, linux-fsdevel,
	linux-mm

On Thu 22-07-10 13:09:32, Wu Fengguang wrote:
> A background flush work may run for ever. So it's reasonable for it to
> mimic the kupdate behavior of syncing old/expired inodes first.
> 
> The policy is
> - enqueue all newly expired inodes at each queue_io() time
> - retry with halfed expire interval until get some inodes to sync
  Hmm, this logic looks a bit arbitrary to me. What I actually don't like
very much about this that when there aren't inodes older than say 2
seconds, you'll end up queueing just inodes between 2s and 1s. So I'd
rather just queue inodes older than the limit and if there are none, just
queue all other dirty inodes.

								Honza

> CC: Jan Kara <jack@suse.cz>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  fs/fs-writeback.c |   20 ++++++++++++++------
>  1 file changed, 14 insertions(+), 6 deletions(-)
> 
> --- linux-next.orig/fs/fs-writeback.c	2010-07-22 12:56:42.000000000 +0800
> +++ linux-next/fs/fs-writeback.c	2010-07-22 13:07:51.000000000 +0800
> @@ -217,14 +217,14 @@ static void move_expired_inodes(struct l
>  				struct writeback_control *wbc)
>  {
>  	unsigned long expire_interval = 0;
> -	unsigned long older_than_this;
> +	unsigned long older_than_this = 0; /* reset to kill gcc warning */
>  	LIST_HEAD(tmp);
>  	struct list_head *pos, *node;
>  	struct super_block *sb = NULL;
>  	struct inode *inode;
>  	int do_sb_sort = 0;
>  
> -	if (wbc->for_kupdate) {
> +	if (wbc->for_kupdate || wbc->for_background) {
>  		expire_interval = msecs_to_jiffies(dirty_expire_interval * 10);
>  		older_than_this = jiffies - expire_interval;
>  	}
> @@ -232,8 +232,15 @@ static void move_expired_inodes(struct l
>  	while (!list_empty(delaying_queue)) {
>  		inode = list_entry(delaying_queue->prev, struct inode, i_list);
>  		if (expire_interval &&
> -		    inode_dirtied_after(inode, older_than_this))
> -			break;
> +		    inode_dirtied_after(inode, older_than_this)) {
> +			if (wbc->for_background &&
> +			    list_empty(dispatch_queue) && list_empty(&tmp)) {
> +				expire_interval >>= 1;
> +				older_than_this = jiffies - expire_interval;
> +				continue;
> +			} else
> +				break;
> +		}
>  		if (sb && sb != inode->i_sb)
>  			do_sb_sort = 1;
>  		sb = inode->i_sb;
> @@ -521,7 +528,8 @@ void writeback_inodes_wb(struct bdi_writ
>  
>  	wbc->wb_start = jiffies; /* livelock avoidance */
>  	spin_lock(&inode_lock);
> -	if (!wbc->for_kupdate || list_empty(&wb->b_io))
> +
> +	if (!(wbc->for_kupdate || wbc->for_background) || list_empty(&wb->b_io))
>  		queue_io(wb, wbc);
>  
>  	while (!list_empty(&wb->b_io)) {
> @@ -550,7 +558,7 @@ static void __writeback_inodes_sb(struct
>  
>  	wbc->wb_start = jiffies; /* livelock avoidance */
>  	spin_lock(&inode_lock);
> -	if (!wbc->for_kupdate || list_empty(&wb->b_io))
> +	if (!(wbc->for_kupdate || wbc->for_background) || list_empty(&wb->b_io))
>  		queue_io(wb, wbc);
>  	writeback_sb_inodes(sb, wb, wbc, true);
>  	spin_unlock(&inode_lock);
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 1/6] writeback: pass writeback_control down to move_expired_inodes()
  2010-07-22  5:09 ` [PATCH 1/6] writeback: pass writeback_control down to move_expired_inodes() Wu Fengguang
@ 2010-07-23 18:16   ` Jan Kara
  2010-07-26 10:44   ` Mel Gorman
  2010-08-01 15:23   ` Minchan Kim
  2 siblings, 0 replies; 62+ messages in thread
From: Jan Kara @ 2010-07-23 18:16 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Dave Chinner, Christoph Hellwig, Mel Gorman,
	Chris Mason, Jens Axboe, LKML, linux-fsdevel, linux-mm

On Thu 22-07-10 13:09:29, Wu Fengguang wrote:
> This is to prepare for moving the dirty expire policy to move_expired_inodes().
> No behavior change.
  Looks OK.

Acked-by: Jan Kara <jack@suse.cz>

> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  fs/fs-writeback.c |   16 ++++++++--------
>  1 file changed, 8 insertions(+), 8 deletions(-)
> 
> --- linux-next.orig/fs/fs-writeback.c	2010-07-21 20:12:38.000000000 +0800
> +++ linux-next/fs/fs-writeback.c	2010-07-21 20:14:38.000000000 +0800
> @@ -213,8 +213,8 @@ static bool inode_dirtied_after(struct i
>   * Move expired dirty inodes from @delaying_queue to @dispatch_queue.
>   */
>  static void move_expired_inodes(struct list_head *delaying_queue,
> -			       struct list_head *dispatch_queue,
> -				unsigned long *older_than_this)
> +				struct list_head *dispatch_queue,
> +				struct writeback_control *wbc)
>  {
>  	LIST_HEAD(tmp);
>  	struct list_head *pos, *node;
> @@ -224,8 +224,8 @@ static void move_expired_inodes(struct l
>  
>  	while (!list_empty(delaying_queue)) {
>  		inode = list_entry(delaying_queue->prev, struct inode, i_list);
> -		if (older_than_this &&
> -		    inode_dirtied_after(inode, *older_than_this))
> +		if (wbc->older_than_this &&
> +		    inode_dirtied_after(inode, *wbc->older_than_this))
>  			break;
>  		if (sb && sb != inode->i_sb)
>  			do_sb_sort = 1;
> @@ -257,10 +257,10 @@ static void move_expired_inodes(struct l
>   *                 => b_more_io inodes
>   *                 => remaining inodes in b_io => (dequeue for sync)
>   */
> -static void queue_io(struct bdi_writeback *wb, unsigned long *older_than_this)
> +static void queue_io(struct bdi_writeback *wb, struct writeback_control *wbc)
>  {
>  	list_splice_init(&wb->b_more_io, &wb->b_io);
> -	move_expired_inodes(&wb->b_dirty, &wb->b_io, older_than_this);
> +	move_expired_inodes(&wb->b_dirty, &wb->b_io, wbc);
>  }
>  
>  static int write_inode(struct inode *inode, struct writeback_control *wbc)
> @@ -519,7 +519,7 @@ void writeback_inodes_wb(struct bdi_writ
>  	wbc->wb_start = jiffies; /* livelock avoidance */
>  	spin_lock(&inode_lock);
>  	if (!wbc->for_kupdate || list_empty(&wb->b_io))
> -		queue_io(wb, wbc->older_than_this);
> +		queue_io(wb, wbc);
>  
>  	while (!list_empty(&wb->b_io)) {
>  		struct inode *inode = list_entry(wb->b_io.prev,
> @@ -548,7 +548,7 @@ static void __writeback_inodes_sb(struct
>  	wbc->wb_start = jiffies; /* livelock avoidance */
>  	spin_lock(&inode_lock);
>  	if (!wbc->for_kupdate || list_empty(&wb->b_io))
> -		queue_io(wb, wbc->older_than_this);
> +		queue_io(wb, wbc);
>  	writeback_sb_inodes(sb, wb, wbc, true);
>  	spin_unlock(&inode_lock);
>  }
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 2/6] writeback: the kupdate expire timestamp should be a moving target
  2010-07-22  5:09 ` [PATCH 2/6] writeback: the kupdate expire timestamp should be a moving target Wu Fengguang
@ 2010-07-23 18:17   ` Jan Kara
  2010-07-26 10:52   ` Mel Gorman
  2010-08-01 15:29   ` Minchan Kim
  2 siblings, 0 replies; 62+ messages in thread
From: Jan Kara @ 2010-07-23 18:17 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Dave Chinner, Jan Kara, Christoph Hellwig,
	Mel Gorman, Chris Mason, Jens Axboe, LKML, linux-fsdevel,
	linux-mm

On Thu 22-07-10 13:09:30, Wu Fengguang wrote:
> Dynamicly compute the dirty expire timestamp at queue_io() time.
> Also remove writeback_control.older_than_this which is no longer used.
> 
> writeback_control.older_than_this used to be determined at entrance to
> the kupdate writeback work. This _static_ timestamp may go stale if the
> kupdate work runs on and on. The flusher may then stuck with some old
> busy inodes, never considering newly expired inodes thereafter.
  This seems to make sense. The patch looks fine as well.

Acked-by: Jan Kara <jack@suse.cz>

								Honza
> 
> This has two possible problems:
> 
> - It is unfair for a large dirty inode to delay (for a long time) the
>   writeback of small dirty inodes.
> 
> - As time goes by, the large and busy dirty inode may contain only
>   _freshly_ dirtied pages. Ignoring newly expired dirty inodes risks
>   delaying the expired dirty pages to the end of LRU lists, triggering
>   the very bad pageout(). Neverthless this patch merely addresses part
>   of the problem.
> 
> CC: Jan Kara <jack@suse.cz>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  fs/fs-writeback.c                |   24 +++++++++---------------
>  include/linux/writeback.h        |    2 --
>  include/trace/events/writeback.h |    6 +-----
>  mm/backing-dev.c                 |    1 -
>  mm/page-writeback.c              |    1 -
>  5 files changed, 10 insertions(+), 24 deletions(-)
> 
> --- linux-next.orig/fs/fs-writeback.c	2010-07-21 22:20:01.000000000 +0800
> +++ linux-next/fs/fs-writeback.c	2010-07-22 11:23:27.000000000 +0800
> @@ -216,16 +216,23 @@ static void move_expired_inodes(struct l
>  				struct list_head *dispatch_queue,
>  				struct writeback_control *wbc)
>  {
> +	unsigned long expire_interval = 0;
> +	unsigned long older_than_this;
>  	LIST_HEAD(tmp);
>  	struct list_head *pos, *node;
>  	struct super_block *sb = NULL;
>  	struct inode *inode;
>  	int do_sb_sort = 0;
>  
> +	if (wbc->for_kupdate) {
> +		expire_interval = msecs_to_jiffies(dirty_expire_interval * 10);
> +		older_than_this = jiffies - expire_interval;
> +	}
> +
>  	while (!list_empty(delaying_queue)) {
>  		inode = list_entry(delaying_queue->prev, struct inode, i_list);
> -		if (wbc->older_than_this &&
> -		    inode_dirtied_after(inode, *wbc->older_than_this))
> +		if (expire_interval &&
> +		    inode_dirtied_after(inode, older_than_this))
>  			break;
>  		if (sb && sb != inode->i_sb)
>  			do_sb_sort = 1;
> @@ -583,29 +590,19 @@ static inline bool over_bground_thresh(v
>   * Try to run once per dirty_writeback_interval.  But if a writeback event
>   * takes longer than a dirty_writeback_interval interval, then leave a
>   * one-second gap.
> - *
> - * older_than_this takes precedence over nr_to_write.  So we'll only write back
> - * all dirty pages if they are all attached to "old" mappings.
>   */
>  static long wb_writeback(struct bdi_writeback *wb,
>  			 struct wb_writeback_work *work)
>  {
>  	struct writeback_control wbc = {
>  		.sync_mode		= work->sync_mode,
> -		.older_than_this	= NULL,
>  		.for_kupdate		= work->for_kupdate,
>  		.for_background		= work->for_background,
>  		.range_cyclic		= work->range_cyclic,
>  	};
> -	unsigned long oldest_jif;
>  	long wrote = 0;
>  	struct inode *inode;
>  
> -	if (wbc.for_kupdate) {
> -		wbc.older_than_this = &oldest_jif;
> -		oldest_jif = jiffies -
> -				msecs_to_jiffies(dirty_expire_interval * 10);
> -	}
>  	if (!wbc.range_cyclic) {
>  		wbc.range_start = 0;
>  		wbc.range_end = LLONG_MAX;
> @@ -998,9 +995,6 @@ EXPORT_SYMBOL(__mark_inode_dirty);
>   * Write out a superblock's list of dirty inodes.  A wait will be performed
>   * upon no inodes, all inodes or the final one, depending upon sync_mode.
>   *
> - * If older_than_this is non-NULL, then only write out inodes which
> - * had their first dirtying at a time earlier than *older_than_this.
> - *
>   * If `bdi' is non-zero then we're being asked to writeback a specific queue.
>   * This function assumes that the blockdev superblock's inodes are backed by
>   * a variety of queues, so all inodes are searched.  For other superblocks,
> --- linux-next.orig/include/linux/writeback.h	2010-07-21 22:20:02.000000000 +0800
> +++ linux-next/include/linux/writeback.h	2010-07-22 11:23:27.000000000 +0800
> @@ -28,8 +28,6 @@ enum writeback_sync_modes {
>   */
>  struct writeback_control {
>  	enum writeback_sync_modes sync_mode;
> -	unsigned long *older_than_this;	/* If !NULL, only write back inodes
> -					   older than this */
>  	unsigned long wb_start;         /* Time writeback_inodes_wb was
>  					   called. This is needed to avoid
>  					   extra jobs and livelock */
> --- linux-next.orig/include/trace/events/writeback.h	2010-07-21 22:20:02.000000000 +0800
> +++ linux-next/include/trace/events/writeback.h	2010-07-22 11:23:27.000000000 +0800
> @@ -100,7 +100,6 @@ DECLARE_EVENT_CLASS(wbc_class,
>  		__field(int, for_reclaim)
>  		__field(int, range_cyclic)
>  		__field(int, more_io)
> -		__field(unsigned long, older_than_this)
>  		__field(long, range_start)
>  		__field(long, range_end)
>  	),
> @@ -115,14 +114,12 @@ DECLARE_EVENT_CLASS(wbc_class,
>  		__entry->for_reclaim	= wbc->for_reclaim;
>  		__entry->range_cyclic	= wbc->range_cyclic;
>  		__entry->more_io	= wbc->more_io;
> -		__entry->older_than_this = wbc->older_than_this ?
> -						*wbc->older_than_this : 0;
>  		__entry->range_start	= (long)wbc->range_start;
>  		__entry->range_end	= (long)wbc->range_end;
>  	),
>  
>  	TP_printk("bdi %s: towrt=%ld skip=%ld mode=%d kupd=%d "
> -		"bgrd=%d reclm=%d cyclic=%d more=%d older=0x%lx "
> +		"bgrd=%d reclm=%d cyclic=%d more=%d "
>  		"start=0x%lx end=0x%lx",
>  		__entry->name,
>  		__entry->nr_to_write,
> @@ -133,7 +130,6 @@ DECLARE_EVENT_CLASS(wbc_class,
>  		__entry->for_reclaim,
>  		__entry->range_cyclic,
>  		__entry->more_io,
> -		__entry->older_than_this,
>  		__entry->range_start,
>  		__entry->range_end)
>  )
> --- linux-next.orig/mm/page-writeback.c	2010-07-21 22:20:02.000000000 +0800
> +++ linux-next/mm/page-writeback.c	2010-07-21 22:20:03.000000000 +0800
> @@ -482,7 +482,6 @@ static void balance_dirty_pages(struct a
>  	for (;;) {
>  		struct writeback_control wbc = {
>  			.sync_mode	= WB_SYNC_NONE,
> -			.older_than_this = NULL,
>  			.nr_to_write	= write_chunk,
>  			.range_cyclic	= 1,
>  		};
> --- linux-next.orig/mm/backing-dev.c	2010-07-22 11:23:34.000000000 +0800
> +++ linux-next/mm/backing-dev.c	2010-07-22 11:23:39.000000000 +0800
> @@ -271,7 +271,6 @@ static void bdi_flush_io(struct backing_
>  {
>  	struct writeback_control wbc = {
>  		.sync_mode		= WB_SYNC_NONE,
> -		.older_than_this	= NULL,
>  		.range_cyclic		= 1,
>  		.nr_to_write		= 1024,
>  	};
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 3/6] writeback: kill writeback_control.more_io
  2010-07-22  5:09 ` [PATCH 3/6] writeback: kill writeback_control.more_io Wu Fengguang
@ 2010-07-23 18:24   ` Jan Kara
  2010-07-26 10:53   ` Mel Gorman
  2010-08-01 15:34   ` Minchan Kim
  2 siblings, 0 replies; 62+ messages in thread
From: Jan Kara @ 2010-07-23 18:24 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Dave Chinner, Christoph Hellwig, Mel Gorman,
	Chris Mason, Jens Axboe, LKML, linux-fsdevel, linux-mm

On Thu 22-07-10 13:09:31, Wu Fengguang wrote:
> When wbc.more_io was first introduced, it indicates whether there are
> at least one superblock whose s_more_io contains more IO work. Now with
> the per-bdi writeback, it can be replaced with a simple b_more_io test.
  Looks fine.

Acked-by: Jan Kara <jack@suse.cz>

> 
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  fs/fs-writeback.c                |    9 ++-------
>  include/linux/writeback.h        |    1 -
>  include/trace/events/writeback.h |    5 +----
>  3 files changed, 3 insertions(+), 12 deletions(-)
> 
> --- linux-next.orig/fs/fs-writeback.c	2010-07-22 11:23:27.000000000 +0800
> +++ linux-next/fs/fs-writeback.c	2010-07-22 12:56:42.000000000 +0800
> @@ -507,12 +507,8 @@ static int writeback_sb_inodes(struct su
>  		iput(inode);
>  		cond_resched();
>  		spin_lock(&inode_lock);
> -		if (wbc->nr_to_write <= 0) {
> -			wbc->more_io = 1;
> +		if (wbc->nr_to_write <= 0)
>  			return 1;
> -		}
> -		if (!list_empty(&wb->b_more_io))
> -			wbc->more_io = 1;
>  	}
>  	/* b_io is empty */
>  	return 1;
> @@ -622,7 +618,6 @@ static long wb_writeback(struct bdi_writ
>  		if (work->for_background && !over_bground_thresh())
>  			break;
>  
> -		wbc.more_io = 0;
>  		wbc.nr_to_write = MAX_WRITEBACK_PAGES;
>  		wbc.pages_skipped = 0;
>  
> @@ -644,7 +639,7 @@ static long wb_writeback(struct bdi_writ
>  		/*
>  		 * Didn't write everything and we don't have more IO, bail
>  		 */
> -		if (!wbc.more_io)
> +		if (list_empty(&wb->b_more_io))
>  			break;
>  		/*
>  		 * Did we write something? Try for more
> --- linux-next.orig/include/linux/writeback.h	2010-07-22 11:23:27.000000000 +0800
> +++ linux-next/include/linux/writeback.h	2010-07-22 11:24:46.000000000 +0800
> @@ -49,7 +49,6 @@ struct writeback_control {
>  	unsigned for_background:1;	/* A background writeback */
>  	unsigned for_reclaim:1;		/* Invoked from the page allocator */
>  	unsigned range_cyclic:1;	/* range_start is cyclic */
> -	unsigned more_io:1;		/* more io to be dispatched */
>  };
>  
>  /*
> --- linux-next.orig/include/trace/events/writeback.h	2010-07-22 11:23:27.000000000 +0800
> +++ linux-next/include/trace/events/writeback.h	2010-07-22 11:24:46.000000000 +0800
> @@ -99,7 +99,6 @@ DECLARE_EVENT_CLASS(wbc_class,
>  		__field(int, for_background)
>  		__field(int, for_reclaim)
>  		__field(int, range_cyclic)
> -		__field(int, more_io)
>  		__field(long, range_start)
>  		__field(long, range_end)
>  	),
> @@ -113,13 +112,12 @@ DECLARE_EVENT_CLASS(wbc_class,
>  		__entry->for_background	= wbc->for_background;
>  		__entry->for_reclaim	= wbc->for_reclaim;
>  		__entry->range_cyclic	= wbc->range_cyclic;
> -		__entry->more_io	= wbc->more_io;
>  		__entry->range_start	= (long)wbc->range_start;
>  		__entry->range_end	= (long)wbc->range_end;
>  	),
>  
>  	TP_printk("bdi %s: towrt=%ld skip=%ld mode=%d kupd=%d "
> -		"bgrd=%d reclm=%d cyclic=%d more=%d "
> +		"bgrd=%d reclm=%d cyclic=%d "
>  		"start=0x%lx end=0x%lx",
>  		__entry->name,
>  		__entry->nr_to_write,
> @@ -129,7 +127,6 @@ DECLARE_EVENT_CLASS(wbc_class,
>  		__entry->for_background,
>  		__entry->for_reclaim,
>  		__entry->range_cyclic,
> -		__entry->more_io,
>  		__entry->range_start,
>  		__entry->range_end)
>  )
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 0/6] [RFC] writeback: try to write older pages first
  2010-07-23 10:24 ` [PATCH 0/6] [RFC] writeback: try to write older pages first Mel Gorman
@ 2010-07-26  7:18   ` Wu Fengguang
  2010-07-26 10:42     ` Mel Gorman
  0 siblings, 1 reply; 62+ messages in thread
From: Wu Fengguang @ 2010-07-26  7:18 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Dave Chinner, Christoph Hellwig, Chris Mason,
	Jens Axboe, LKML, linux-fsdevel@vger.kernel.org,
	linux-mm@kvack.org

> On Thu, Jul 22, 2010 at 01:09:28PM +0800, Wu Fengguang wrote:
> > 
> > The basic way of avoiding pageout() is to make the flusher sync inodes in the
> > right order. Oldest dirty inodes contains oldest pages. The smaller inode it
> > is, the more correlation between inode dirty time and its pages' dirty time.
> > So for small dirty inodes, syncing in the order of inode dirty time is able to
> > avoid pageout(). If pageout() is still triggered frequently in this case, the
> > 30s dirty expire time may be too long and could be shrinked adaptively; or it
> > may be a stressed memcg list whose dirty inodes/pages are more hard to track.
> > 
> 
> Have you confirmed this theory with the trace points? It makes perfect
> sense and is very rational but proof is a plus.

The proof would be simple.

On average, it takes longer time to dirty a large file than a small file.

For example, when uploading files to a file server with 1MB/s
throughput, it will take 10s for a 10MB file and 30s for a 30MB file.
This is the common case.

Another case is some fast dirtier. It may take 10ms to dirty a 100MB
file and 10s to dirty a 1G file -- the latter is dirty throttled to
the much lower IO throughput due to too many dirty pages. The opposite
may happen, however this is more likely in possibility. If both are
throttled, it degenerates to the above file server case.

So large files tend to contain dirty pages of more varied age.

> I'm guessing you have
> some decent writeback-related tests that might be of use. Mine have a
> big mix of anon and file writeback so it's not as clear-cut.

A neat trick is to run your test with `swapoff -a` :)

Seriously I have no scripts to monitor pageout() calls.
I'll explore ways to test it.

> Monitoring it isn't hard. Mount debugfs, enable the vmscan tracepoints
> and read the tracing_pipe. To reduce interference, I always pipe it
> through gzip and do post-processing afterwards offline with the script
> included in Documentation/

Thanks for the tip!

> Here is what I got from sysbench on x86-64 (other machines hours away)
> 
> 
> SYSBENCH FTrace Reclaim Statistics
>                     traceonly-v5r6         nodirect-v5r7      flusholdest-v5r7     flushforward-v5r7
> Direct reclaims                                683        785        670        938 
> Direct reclaim pages scanned                199776     161195     200400     166639 
> Direct reclaim write file async I/O          64802          0          0          0 
> Direct reclaim write anon async I/O           1009        419       1184      11390 
> Direct reclaim write file sync I/O              18          0          0          0 
> Direct reclaim write anon sync I/O               0          0          0          0 
> Wake kswapd requests                        685360     697255     691009     864602 
> Kswapd wakeups                                1596       1517       1517       1545 
> Kswapd pages scanned                      17527865   16817554   16816510   15032525 
> Kswapd reclaim write file async I/O         888082     618123     649167     147903 
> Kswapd reclaim write anon async I/O         229724     229123     233639     243561 
> Kswapd reclaim write file sync I/O               0          0          0          0 
> Kswapd reclaim write anon sync I/O               0          0          0          0 

> Time stalled direct reclaim (ms)             32.79      22.47      19.75       6.34 
> Time kswapd awake (ms)                     2192.03    2165.17    2112.73    2055.90 

I noticed that $total_direct_latency is divided by 1000 before
printing the above lines, so the unit should be seconds?

> User/Sys Time Running Test (seconds)         663.3    656.37    664.14    654.63
> Percentage Time Spent Direct Reclaim         0.00%     0.00%     0.00%     0.00%
> Total Elapsed Time (seconds)               6703.22   6468.78   6472.69   6479.62
> Percentage Time kswapd Awake                 0.03%     0.00%     0.00%     0.00%

I don't see the code for generating the "Percentage" lines. And the
numbers seem too small to be true.

> Flush oldest actually increased the number of pages written back by
> kswapd but the anon writeback is also high as swap is involved. Kicking
> flusher threads also helps a lot. It helps less than previous released
> because I noticed I was kicking flusher threads for both anon and file
> dirty pages which is cheating. It's now only waking the threads for
> file. It's still a reduction of 84% overall so nothing to sneeze at.
> 
> What the patch did do was reduce time stalled in direct reclaim and time
> kswapd spent awake so it still might be going the right direction. I
> don't have a feeling for how much the writeback figures change between
> runs because they take so long to run.
> 
> STRESS-HIGHALLOC FTrace Reclaim Statistics
>                   stress-highalloc      stress-highalloc      stress-highalloc      stress-highalloc
>                     traceonly-v5r6         nodirect-v5r7      flusholdest-v5r7     flushforward-v5r7
> Direct reclaims                               1221       1284       1127       1252 
> Direct reclaim pages scanned                146220     186156     142075     140617 
> Direct reclaim write file async I/O           3433          0          0          0 
> Direct reclaim write anon async I/O          25238      28758      23940      23247 
> Direct reclaim write file sync I/O            3095          0          0          0 
> Direct reclaim write anon sync I/O           10911     305579     281824     246251 
> Wake kswapd requests                          1193       1196       1088       1209 
> Kswapd wakeups                                 805        824        758        804 
> Kswapd pages scanned                      30953364   52621368   42722498   30945547 
> Kswapd reclaim write file async I/O         898087     241135     570467      54319 
> Kswapd reclaim write anon async I/O        2278607    2201894    1885741    1949170 
> Kswapd reclaim write file sync I/O               0          0          0          0 
> Kswapd reclaim write anon sync I/O               0          0          0          0 
> Time stalled direct reclaim (ms)           8567.29    6628.83    6520.39    6947.23 
> Time kswapd awake (ms)                     5847.60    3589.43    3900.74   15837.59 
> 
> User/Sys Time Running Test (seconds)       2824.76   2833.05   2833.26   2830.46
> Percentage Time Spent Direct Reclaim         0.25%     0.00%     0.00%     0.00%
> Total Elapsed Time (seconds)              10920.14   9021.17   8872.06   9301.86
> Percentage Time kswapd Awake                 0.15%     0.00%     0.00%     0.00%
> 
> Same here, the number of pages written back by kswapd increased but
> again anon writeback was a big factor. Kicking threads when dirty pages
> are encountered still helps a lot with a 94% reduction of pages written
> back overall..

That is impressive! So it definitely helps to reduce total number of
dirty pages under memory pressure.

> Also, your patch really helped the time spent stalled by direct reclaim
> and kswapd was awake a lot less less with tests completing far faster.

Thanks. So it does improve the dirty page layout in the LRU lists.

> Overally, I still think your series if a big help (although I don't know if
> the patches in linux-next are also making a difference) but it's not actually
> reducing the pages encountered by direct reclaim. Maybe that is because
> the tests were making more forward progress and so scanning faster. The
> sysbench performance results are too varied to draw conclusions from but it
> did slightly improve the success rate of high-order allocations.
> 
> The flush-forward patches would appear to be a requirement. Christoph
> first described them as a band-aid but he didn't chuck rocks at me when
> the patch was actually released. Right now, I'm leaning towards pushing
> it and judge by the Swear Meter how good/bad others think it is. So far
> it's, me pro, Rik pro, Christoph maybe.

Sorry for the delay, I'll help review it.

> > For a large dirty inode, it may flush lots of newly dirtied pages _after_
> > syncing the expired pages. This is the normal case for a single-stream
> > sequential dirtier, where older pages are in lower offsets.  In this case we
> > shall not insist on syncing the whole large dirty inode before considering the
> > other small dirty inodes. This risks wasting time syncing 1GB freshly dirtied
> > pages before syncing the other N*1MB expired dirty pages who are approaching
> > the end of the LRU list and hence pageout().
> > 
> 
> Intuitively, this makes a lot of sense.
> 
> > For a large dirty inode, it may also flush lots of newly dirtied pages _before_
> > hitting the desired old ones, in which case it helps for pageout() to do some
> > clustered writeback, and/or set mapping->writeback_index to help the flusher
> > focus on old pages.
> > 
> 
> Will put this idea on the maybe pile.
> 
> > For a large dirty inode, it may also have intermixed old and new dirty pages.
> > In this case we need to make sure the inode is queued for IO before some of
> > its pages hit pageout(). Adaptive dirty expire time helps here.
> > 
> > OK, end of the vapour ideas. As for this patchset, it fixes the current
> > kupdate/background writeback priority:
> > 
> > - the kupdate/background writeback shall include newly expired inodes at each
> >   queue_io() time, as the large inodes left over from previous writeback rounds
> >   are likely to have less density of old pages.
> > 
> > - the background writeback shall consider expired inodes first, just like the
> >   kupdate writeback
> > 
> 
> I haven't actually reviewed these. I got testing kicked off first
> because it didn't require brains :)

Thanks all the same!

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 0/6] [RFC] writeback: try to write older pages first
  2010-07-22  5:09 [PATCH 0/6] [RFC] writeback: try to write older pages first Wu Fengguang
                   ` (6 preceding siblings ...)
  2010-07-23 10:24 ` [PATCH 0/6] [RFC] writeback: try to write older pages first Mel Gorman
@ 2010-07-26 10:28 ` Itaru Kitayama
  2010-07-26 11:47   ` Wu Fengguang
  7 siblings, 1 reply; 62+ messages in thread
From: Itaru Kitayama @ 2010-07-26 10:28 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Dave Chinner, Christoph Hellwig, Mel Gorman,
	Chris Mason, Jens Axboe, LKML, linux-fsdevel, linux-mm

Hi,
Here's a touch up patch on top of your changes against the latest
mmotm.

Signed-off-by: Itaru Kitayama <kitayama@cl.bb4u.ne.jp>
---
 fs/btrfs/extent_io.c        |    2 --
 include/trace/events/ext4.h |    5 +----
 2 files changed, 1 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index cb9af26..b494dee 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2586,7 +2586,6 @@ int extent_write_full_page(struct extent_io_tree *tree, struct page *page,
        };
        struct writeback_control wbc_writepages = {
                .sync_mode      = wbc->sync_mode,
-               .older_than_this = NULL,
                .nr_to_write    = 64,
                .range_start    = page_offset(page) + PAGE_CACHE_SIZE,
                .range_end      = (loff_t)-1,
@@ -2619,7 +2618,6 @@ int extent_write_locked_range(struct extent_io_tree *tree, struct inode *inode,
        };
        struct writeback_control wbc_writepages = {
                .sync_mode      = mode,
-               .older_than_this = NULL,
                .nr_to_write    = nr_pages * 2,
                .range_start    = start,
                .range_end      = end + 1,
diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
index f3865c7..099598b 100644
--- a/include/trace/events/ext4.h
+++ b/include/trace/events/ext4.h
@@ -305,7 +305,6 @@ TRACE_EVENT(ext4_da_writepages_result,
                __field(        int,    ret                     )
                __field(        int,    pages_written           )
                __field(        long,   pages_skipped           )
-               __field(        char,   more_io                 )       
                __field(       pgoff_t, writeback_index         )
        ),
 
@@ -315,15 +314,13 @@ TRACE_EVENT(ext4_da_writepages_result,
                __entry->ret            = ret;
                __entry->pages_written  = pages_written;
                __entry->pages_skipped  = wbc->pages_skipped;
-               __entry->more_io        = wbc->more_io;
                __entry->writeback_index = inode->i_mapping->writeback_index;
        ),
 
-       TP_printk("dev %s ino %lu ret %d pages_written %d pages_skipped %ld more_io %d writeback_index %lu",
+       TP_printk("dev %s ino %lu ret %d pages_written %d pages_skipped %ld writeback_index %lu",
                  jbd2_dev_to_name(__entry->dev),
                  (unsigned long) __entry->ino, __entry->ret,
                  __entry->pages_written, __entry->pages_skipped,
-                 __entry->more_io,
                  (unsigned long) __entry->writeback_index)
 );
 
-- 
1.7.1.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* Re: [PATCH 0/6] [RFC] writeback: try to write older pages first
  2010-07-26  7:18   ` Wu Fengguang
@ 2010-07-26 10:42     ` Mel Gorman
  0 siblings, 0 replies; 62+ messages in thread
From: Mel Gorman @ 2010-07-26 10:42 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Dave Chinner, Christoph Hellwig, Chris Mason,
	Jens Axboe, LKML, linux-fsdevel@vger.kernel.org,
	linux-mm@kvack.org

On Mon, Jul 26, 2010 at 03:18:03PM +0800, Wu Fengguang wrote:
> > On Thu, Jul 22, 2010 at 01:09:28PM +0800, Wu Fengguang wrote:
> > > 
> > > The basic way of avoiding pageout() is to make the flusher sync inodes in the
> > > right order. Oldest dirty inodes contains oldest pages. The smaller inode it
> > > is, the more correlation between inode dirty time and its pages' dirty time.
> > > So for small dirty inodes, syncing in the order of inode dirty time is able to
> > > avoid pageout(). If pageout() is still triggered frequently in this case, the
> > > 30s dirty expire time may be too long and could be shrinked adaptively; or it
> > > may be a stressed memcg list whose dirty inodes/pages are more hard to track.
> > > 
> > 
> > Have you confirmed this theory with the trace points? It makes perfect
> > sense and is very rational but proof is a plus.
> 
> The proof would be simple.
> 
> On average, it takes longer time to dirty a large file than a small file.
> 
> For example, when uploading files to a file server with 1MB/s
> throughput, it will take 10s for a 10MB file and 30s for a 30MB file.
> This is the common case.
> 
> Another case is some fast dirtier. It may take 10ms to dirty a 100MB
> file and 10s to dirty a 1G file -- the latter is dirty throttled to
> the much lower IO throughput due to too many dirty pages. The opposite
> may happen, however this is more likely in possibility. If both are
> throttled, it degenerates to the above file server case.
> 
> So large files tend to contain dirty pages of more varied age.
> 

Ok.

> > I'm guessing you have
> > some decent writeback-related tests that might be of use. Mine have a
> > big mix of anon and file writeback so it's not as clear-cut.
> 
> A neat trick is to run your test with `swapoff -a` :)
> 

Good point.

> Seriously I have no scripts to monitor pageout() calls.
> I'll explore ways to test it.
> 

I'll see about running tests with swapoff.

> > Monitoring it isn't hard. Mount debugfs, enable the vmscan tracepoints
> > and read the tracing_pipe. To reduce interference, I always pipe it
> > through gzip and do post-processing afterwards offline with the script
> > included in Documentation/
> 
> Thanks for the tip!
> 
> > Here is what I got from sysbench on x86-64 (other machines hours away)
> > 
> > 
> > SYSBENCH FTrace Reclaim Statistics
> >                     traceonly-v5r6         nodirect-v5r7      flusholdest-v5r7     flushforward-v5r7
> > Direct reclaims                                683        785        670        938 
> > Direct reclaim pages scanned                199776     161195     200400     166639 
> > Direct reclaim write file async I/O          64802          0          0          0 
> > Direct reclaim write anon async I/O           1009        419       1184      11390 
> > Direct reclaim write file sync I/O              18          0          0          0 
> > Direct reclaim write anon sync I/O               0          0          0          0 
> > Wake kswapd requests                        685360     697255     691009     864602 
> > Kswapd wakeups                                1596       1517       1517       1545 
> > Kswapd pages scanned                      17527865   16817554   16816510   15032525 
> > Kswapd reclaim write file async I/O         888082     618123     649167     147903 
> > Kswapd reclaim write anon async I/O         229724     229123     233639     243561 
> > Kswapd reclaim write file sync I/O               0          0          0          0 
> > Kswapd reclaim write anon sync I/O               0          0          0          0 
> 
> > Time stalled direct reclaim (ms)             32.79      22.47      19.75       6.34 
> > Time kswapd awake (ms)                     2192.03    2165.17    2112.73    2055.90 
> 
> I noticed that $total_direct_latency is divided by 1000 before
> printing the above lines, so the unit should be seconds?
> 

Correct. That figure was generated by another post-processing script that
creates the table. It got the units wrong so the percentage time lines are
wrong and the time spent awake is saying ms when it should say seconds. Sorry
about that.

> > User/Sys Time Running Test (seconds)         663.3    656.37    664.14    654.63
> > Percentage Time Spent Direct Reclaim         0.00%     0.00%     0.00%     0.00%
> > Total Elapsed Time (seconds)               6703.22   6468.78   6472.69   6479.62
> > Percentage Time kswapd Awake                 0.03%     0.00%     0.00%     0.00%
> 
> I don't see the code for generating the "Percentage" lines. And the
> numbers seem too small to be true.
> 

The code is in a table-generation script that had access to data on the
length of time the test ran. I was ignoring the percentage time line
since early on and I missed the error.

The percentage time spent on direct reclaim is

direct_reclaim*100/(user_time+sys_time+stalled_time)

A report based on a corrected scipt looks like

                    traceonly-v5r6         nodirect-v5r9 flusholdest-v5r9     flushforward-v5r9
Direct reclaims                                683        528        808 943 
Direct reclaim pages scanned                199776     298562     125991 83325 
Direct reclaim write file async I/O          64802          0          0 0 
Direct reclaim write anon async I/O           1009       3340        926 2227 
Direct reclaim write file sync I/O              18          0          0 0 
Direct reclaim write anon sync I/O               0          0          0 0 
Wake kswapd requests                        685360     522123     763448 827895 
Kswapd wakeups                                1596       1538       1452 1565 
Kswapd pages scanned                      17527865   17020235   16367809 15415022 
Kswapd reclaim write file async I/O         888082     869540     536427 89004 
Kswapd reclaim write anon async I/O         229724     262934     253396 215861 
Kswapd reclaim write file sync I/O               0          0          0 0 
Kswapd reclaim write anon sync I/O               0          0          0 0 
Time stalled direct reclaim (seconds)        32.79      23.46      20.70 7.01 
Time kswapd awake (seconds)                2192.03    2172.22    2117.82 2166.53 

User/Sys Time Running Test (seconds)         663.3    644.43    637.34 680.53
Percentage Time Spent Direct Reclaim         4.71%     3.51%     3.15% 1.02%
Total Elapsed Time (seconds)               6703.22   6477.95   6503.39 6781.90
Percentage Time kswapd Awake                32.70%    33.53%    32.56% 31.95%

> > Flush oldest actually increased the number of pages written back by
> > kswapd but the anon writeback is also high as swap is involved. Kicking
> > flusher threads also helps a lot. It helps less than previous released
> > because I noticed I was kicking flusher threads for both anon and file
> > dirty pages which is cheating. It's now only waking the threads for
> > file. It's still a reduction of 84% overall so nothing to sneeze at.
> > 
> > What the patch did do was reduce time stalled in direct reclaim and time
> > kswapd spent awake so it still might be going the right direction. I
> > don't have a feeling for how much the writeback figures change between
> > runs because they take so long to run.
> > 
> > STRESS-HIGHALLOC FTrace Reclaim Statistics
> >                   stress-highalloc      stress-highalloc      stress-highalloc      stress-highalloc
> >                     traceonly-v5r6         nodirect-v5r7      flusholdest-v5r7     flushforward-v5r7
> > Direct reclaims                               1221       1284       1127       1252 
> > Direct reclaim pages scanned                146220     186156     142075     140617 
> > Direct reclaim write file async I/O           3433          0          0          0 
> > Direct reclaim write anon async I/O          25238      28758      23940      23247 
> > Direct reclaim write file sync I/O            3095          0          0          0 
> > Direct reclaim write anon sync I/O           10911     305579     281824     246251 
> > Wake kswapd requests                          1193       1196       1088       1209 
> > Kswapd wakeups                                 805        824        758        804 
> > Kswapd pages scanned                      30953364   52621368   42722498   30945547 
> > Kswapd reclaim write file async I/O         898087     241135     570467      54319 
> > Kswapd reclaim write anon async I/O        2278607    2201894    1885741    1949170 
> > Kswapd reclaim write file sync I/O               0          0          0          0 
> > Kswapd reclaim write anon sync I/O               0          0          0          0 
> > Time stalled direct reclaim (ms)           8567.29    6628.83    6520.39    6947.23 
> > Time kswapd awake (ms)                     5847.60    3589.43    3900.74   15837.59 
> > 
> > User/Sys Time Running Test (seconds)       2824.76   2833.05   2833.26   2830.46
> > Percentage Time Spent Direct Reclaim         0.25%     0.00%     0.00%     0.00%
> > Total Elapsed Time (seconds)              10920.14   9021.17   8872.06   9301.86
> > Percentage Time kswapd Awake                 0.15%     0.00%     0.00%     0.00%
> > 
> > Same here, the number of pages written back by kswapd increased but
> > again anon writeback was a big factor. Kicking threads when dirty pages
> > are encountered still helps a lot with a 94% reduction of pages written
> > back overall..
> 
> That is impressive! So it definitely helps to reduce total number of
> dirty pages under memory pressure.
> 

Yes.

> > Also, your patch really helped the time spent stalled by direct reclaim
> > and kswapd was awake a lot less less with tests completing far faster.
> 
> Thanks. So it does improve the dirty page layout in the LRU lists.
> 

It would appear to.

> > Overally, I still think your series if a big help (although I don't know if
> > the patches in linux-next are also making a difference) but it's not actually
> > reducing the pages encountered by direct reclaim. Maybe that is because
> > the tests were making more forward progress and so scanning faster. The
> > sysbench performance results are too varied to draw conclusions from but it
> > did slightly improve the success rate of high-order allocations.
> > 
> > The flush-forward patches would appear to be a requirement. Christoph
> > first described them as a band-aid but he didn't chuck rocks at me when
> > the patch was actually released. Right now, I'm leaning towards pushing
> > it and judge by the Swear Meter how good/bad others think it is. So far
> > it's, me pro, Rik pro, Christoph maybe.
> 
> Sorry for the delay, I'll help review it.
> 

Don't be sorry, I still haven't reviewed the writeback patches.

> > > For a large dirty inode, it may flush lots of newly dirtied pages _after_
> > > syncing the expired pages. This is the normal case for a single-stream
> > > sequential dirtier, where older pages are in lower offsets.  In this case we
> > > shall not insist on syncing the whole large dirty inode before considering the
> > > other small dirty inodes. This risks wasting time syncing 1GB freshly dirtied
> > > pages before syncing the other N*1MB expired dirty pages who are approaching
> > > the end of the LRU list and hence pageout().
> > > 
> > 
> > Intuitively, this makes a lot of sense.
> > 
> > > For a large dirty inode, it may also flush lots of newly dirtied pages _before_
> > > hitting the desired old ones, in which case it helps for pageout() to do some
> > > clustered writeback, and/or set mapping->writeback_index to help the flusher
> > > focus on old pages.
> > > 
> > 
> > Will put this idea on the maybe pile.
> > 
> > > For a large dirty inode, it may also have intermixed old and new dirty pages.
> > > In this case we need to make sure the inode is queued for IO before some of
> > > its pages hit pageout(). Adaptive dirty expire time helps here.
> > > 
> > > OK, end of the vapour ideas. As for this patchset, it fixes the current
> > > kupdate/background writeback priority:
> > > 
> > > - the kupdate/background writeback shall include newly expired inodes at each
> > >   queue_io() time, as the large inodes left over from previous writeback rounds
> > >   are likely to have less density of old pages.
> > > 
> > > - the background writeback shall consider expired inodes first, just like the
> > >   kupdate writeback
> > > 
> > 
> > I haven't actually reviewed these. I got testing kicked off first
> > because it didn't require brains :)
> 
> Thanks all the same!
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 1/6] writeback: pass writeback_control down to move_expired_inodes()
  2010-07-22  5:09 ` [PATCH 1/6] writeback: pass writeback_control down to move_expired_inodes() Wu Fengguang
  2010-07-23 18:16   ` Jan Kara
@ 2010-07-26 10:44   ` Mel Gorman
  2010-08-01 15:23   ` Minchan Kim
  2 siblings, 0 replies; 62+ messages in thread
From: Mel Gorman @ 2010-07-26 10:44 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Dave Chinner, Christoph Hellwig, Chris Mason,
	Jens Axboe, LKML, linux-fsdevel, linux-mm

On Thu, Jul 22, 2010 at 01:09:29PM +0800, Wu Fengguang wrote:
> This is to prepare for moving the dirty expire policy to move_expired_inodes().
> No behavior change.
> 
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>

Can't see any problem.

Acked-by: Mel Gorman <mel@csn.ul.ie>

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 2/6] writeback: the kupdate expire timestamp should be a moving target
  2010-07-22  5:09 ` [PATCH 2/6] writeback: the kupdate expire timestamp should be a moving target Wu Fengguang
  2010-07-23 18:17   ` Jan Kara
@ 2010-07-26 10:52   ` Mel Gorman
  2010-07-26 11:32     ` Wu Fengguang
  2010-08-01 15:29   ` Minchan Kim
  2 siblings, 1 reply; 62+ messages in thread
From: Mel Gorman @ 2010-07-26 10:52 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Dave Chinner, Jan Kara, Christoph Hellwig,
	Chris Mason, Jens Axboe, LKML, linux-fsdevel, linux-mm

On Thu, Jul 22, 2010 at 01:09:30PM +0800, Wu Fengguang wrote:
> Dynamicly compute the dirty expire timestamp at queue_io() time.
> Also remove writeback_control.older_than_this which is no longer used.
> 
> writeback_control.older_than_this used to be determined at entrance to
> the kupdate writeback work. This _static_ timestamp may go stale if the
> kupdate work runs on and on. The flusher may then stuck with some old
> busy inodes, never considering newly expired inodes thereafter.
> 
> This has two possible problems:
> 
> - It is unfair for a large dirty inode to delay (for a long time) the
>   writeback of small dirty inodes.
> 
> - As time goes by, the large and busy dirty inode may contain only
>   _freshly_ dirtied pages. Ignoring newly expired dirty inodes risks
>   delaying the expired dirty pages to the end of LRU lists, triggering
>   the very bad pageout(). Neverthless this patch merely addresses part
>   of the problem.
> 
> CC: Jan Kara <jack@suse.cz>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>

Again, makes sense and I can't see a problem. There are some worth
smithing issues in the changelog such as Dynamicly -> Dynamically and
s/writeback_control.older_than_this used/writeback_control.older_than_this is used/
but other than that.

Acked-by: Mel Gorman <mel@csn.ul.ie>

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 3/6] writeback: kill writeback_control.more_io
  2010-07-22  5:09 ` [PATCH 3/6] writeback: kill writeback_control.more_io Wu Fengguang
  2010-07-23 18:24   ` Jan Kara
@ 2010-07-26 10:53   ` Mel Gorman
  2010-08-01 15:34   ` Minchan Kim
  2 siblings, 0 replies; 62+ messages in thread
From: Mel Gorman @ 2010-07-26 10:53 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Dave Chinner, Christoph Hellwig, Chris Mason,
	Jens Axboe, LKML, linux-fsdevel, linux-mm

On Thu, Jul 22, 2010 at 01:09:31PM +0800, Wu Fengguang wrote:
> When wbc.more_io was first introduced, it indicates whether there are
> at least one superblock whose s_more_io contains more IO work. Now with
> the per-bdi writeback, it can be replaced with a simple b_more_io test.
> 
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>

Acked-by: Mel Gorman <mel@csn.ul.ie>

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 4/6] writeback: sync expired inodes first in background writeback
  2010-07-22  5:09 ` [PATCH 4/6] writeback: sync expired inodes first in background writeback Wu Fengguang
  2010-07-23 18:15   ` Jan Kara
@ 2010-07-26 10:57   ` Mel Gorman
  2010-07-26 12:00     ` Wu Fengguang
  2010-07-26 12:56     ` Wu Fengguang
  1 sibling, 2 replies; 62+ messages in thread
From: Mel Gorman @ 2010-07-26 10:57 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Dave Chinner, Jan Kara, Christoph Hellwig,
	Chris Mason, Jens Axboe, LKML, linux-fsdevel, linux-mm

On Thu, Jul 22, 2010 at 01:09:32PM +0800, Wu Fengguang wrote:
> A background flush work may run for ever. So it's reasonable for it to
> mimic the kupdate behavior of syncing old/expired inodes first.
> 
> The policy is
> - enqueue all newly expired inodes at each queue_io() time
> - retry with halfed expire interval until get some inodes to sync
> 
> CC: Jan Kara <jack@suse.cz>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>

Ok, intuitively this would appear to tie into pageout where we want
older inodes to be cleaned first by background flushers to limit the
number of dirty pages encountered by page reclaim. If this is accurate,
it should be detailed in the changelog.

> ---
>  fs/fs-writeback.c |   20 ++++++++++++++------
>  1 file changed, 14 insertions(+), 6 deletions(-)
> 
> --- linux-next.orig/fs/fs-writeback.c	2010-07-22 12:56:42.000000000 +0800
> +++ linux-next/fs/fs-writeback.c	2010-07-22 13:07:51.000000000 +0800
> @@ -217,14 +217,14 @@ static void move_expired_inodes(struct l
>  				struct writeback_control *wbc)
>  {
>  	unsigned long expire_interval = 0;
> -	unsigned long older_than_this;
> +	unsigned long older_than_this = 0; /* reset to kill gcc warning */
>  	LIST_HEAD(tmp);
>  	struct list_head *pos, *node;
>  	struct super_block *sb = NULL;
>  	struct inode *inode;
>  	int do_sb_sort = 0;
>  
> -	if (wbc->for_kupdate) {
> +	if (wbc->for_kupdate || wbc->for_background) {
>  		expire_interval = msecs_to_jiffies(dirty_expire_interval * 10);
>  		older_than_this = jiffies - expire_interval;
>  	}
> @@ -232,8 +232,15 @@ static void move_expired_inodes(struct l
>  	while (!list_empty(delaying_queue)) {
>  		inode = list_entry(delaying_queue->prev, struct inode, i_list);
>  		if (expire_interval &&
> -		    inode_dirtied_after(inode, older_than_this))
> -			break;
> +		    inode_dirtied_after(inode, older_than_this)) {
> +			if (wbc->for_background &&
> +			    list_empty(dispatch_queue) && list_empty(&tmp)) {
> +				expire_interval >>= 1;
> +				older_than_this = jiffies - expire_interval;
> +				continue;
> +			} else
> +				break;
> +		}

This needs a comment.

I think what it is saying is that if background flush is active but no
inodes are old enough, consider newer inodes. This is on the assumption
that page reclaim has encountered dirty pages and the dirty inodes are
still too young.

>  		if (sb && sb != inode->i_sb)
>  			do_sb_sort = 1;
>  		sb = inode->i_sb;
> @@ -521,7 +528,8 @@ void writeback_inodes_wb(struct bdi_writ
>  
>  	wbc->wb_start = jiffies; /* livelock avoidance */
>  	spin_lock(&inode_lock);
> -	if (!wbc->for_kupdate || list_empty(&wb->b_io))
> +
> +	if (!(wbc->for_kupdate || wbc->for_background) || list_empty(&wb->b_io))
>  		queue_io(wb, wbc);
>  
>  	while (!list_empty(&wb->b_io)) {
> @@ -550,7 +558,7 @@ static void __writeback_inodes_sb(struct
>  
>  	wbc->wb_start = jiffies; /* livelock avoidance */
>  	spin_lock(&inode_lock);
> -	if (!wbc->for_kupdate || list_empty(&wb->b_io))
> +	if (!(wbc->for_kupdate || wbc->for_background) || list_empty(&wb->b_io))
>  		queue_io(wb, wbc);
>  	writeback_sb_inodes(sb, wb, wbc, true);
>  	spin_unlock(&inode_lock);
> 
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 5/6] writeback: try more writeback as long as something was written
  2010-07-22  5:09 ` [PATCH 5/6] writeback: try more writeback as long as something was written Wu Fengguang
  2010-07-23 17:39   ` Jan Kara
@ 2010-07-26 11:01   ` Mel Gorman
  2010-07-26 11:39     ` Wu Fengguang
  1 sibling, 1 reply; 62+ messages in thread
From: Mel Gorman @ 2010-07-26 11:01 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Dave Chinner, Christoph Hellwig, Chris Mason,
	Jens Axboe, LKML, linux-fsdevel, linux-mm

On Thu, Jul 22, 2010 at 01:09:33PM +0800, Wu Fengguang wrote:
> writeback_inodes_wb()/__writeback_inodes_sb() are not agressive in that
> they only populate b_io when necessary at entrance time. When the queued
> set of inodes are all synced, they just return, possibly with
> wbc.nr_to_write > 0.
> 
> For kupdate and background writeback, there may be more eligible inodes
> sitting in b_dirty when the current set of b_io inodes are completed. So
> it is necessary to try another round of writeback as long as we made some
> progress in this round. When there are no more eligible inodes, no more
> inodes will be enqueued in queue_io(), hence nothing could/will be
> synced and we may safely bail.
> 
> This will livelock sync when there are heavy dirtiers. However in that case
> sync will already be livelocked w/o this patch, as the current livelock
> avoidance code is virtually a no-op (for one thing, wb_time should be
> set statically at sync start time and be used in move_expired_inodes()).
> The sync livelock problem will be addressed in other patches.
> 

There does seem to be a livelock issue. During iozone, I see messages in
the console log with this series applied that look like

[ 1687.132034] INFO: task iozone:21225 blocked for more than 120 seconds.
[ 1687.211425] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1687.305204] iozone        D ffff880001b13640     0 21225  21108 0x00000000
[ 1687.387677]  ffff880037419d48 0000000000000082 0000000000000348 0000000000013640
[ 1687.476594]  ffff880037419fd8 ffff880037419fd8 ffff880065892da0 0000000000013640
[ 1687.565512]  0000000000013640 0000000000013640 ffff880065892da0 ffff88007f411510
[ 1687.654431] Call Trace:
[ 1687.683663]  [<ffffffff81002996>] ? ftrace_call+0x5/0x2b
[ 1687.747204]  [<ffffffff812d8f67>] schedule_timeout+0x2d/0x214
[ 1687.815947]  [<ffffffff81002996>] ? ftrace_call+0x5/0x2b
[ 1687.879489]  [<ffffffff812d8527>] wait_for_common+0xd2/0x14a
[ 1687.947195]  [<ffffffff8103ef1e>] ? default_wake_function+0x0/0x14
[ 1688.021132]  [<ffffffff81002996>] ? ftrace_call+0x5/0x2b
[ 1688.084680]  [<ffffffff811160f0>] ? sync_one_sb+0x0/0x22
[ 1688.148223]  [<ffffffff812d8657>] wait_for_completion+0x1d/0x1f
[ 1688.219051]  [<ffffffff811121c4>] sync_inodes_sb+0x92/0x14c
[ 1688.285710]  [<ffffffff811160f0>] ? sync_one_sb+0x0/0x22
[ 1688.349249]  [<ffffffff811160b9>] __sync_filesystem+0x4c/0x83
[ 1688.417995]  [<ffffffff81116110>] sync_one_sb+0x20/0x22
[ 1688.480505]  [<ffffffff810f6a23>] iterate_supers+0x66/0xa4
[ 1688.546124]  [<ffffffff81116157>] sys_sync+0x45/0x5c
[ 1688.605509]  [<ffffffff81002c72>] system_call_fastpath+0x16/0x1b

Similar messages do not appear without the patch. iozone does complete though
and the performance figures are not affected. Should I be worried?

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 6/6] writeback: introduce writeback_control.inodes_written
  2010-07-22  5:09 ` [PATCH 6/6] writeback: introduce writeback_control.inodes_written Wu Fengguang
@ 2010-07-26 11:04   ` Mel Gorman
  0 siblings, 0 replies; 62+ messages in thread
From: Mel Gorman @ 2010-07-26 11:04 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Dave Chinner, Christoph Hellwig, Chris Mason,
	Jens Axboe, LKML, linux-fsdevel, linux-mm

On Thu, Jul 22, 2010 at 01:09:34PM +0800, Wu Fengguang wrote:
> Introduce writeback_control.inodes_written to count successful
> ->write_inode() calls.  A non-zero value means there are some
> progress on writeback, in which case more writeback will be tried.
> 
> This prevents aborting a background writeback work prematually when
> the current set of inodes for IO happen to be metadata-only dirty.
> 
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>

Seems reasonable.

Acked-by: Mel Gorman <mel@csn.ul.ie>

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 2/6] writeback: the kupdate expire timestamp should be a moving target
  2010-07-26 10:52   ` Mel Gorman
@ 2010-07-26 11:32     ` Wu Fengguang
  0 siblings, 0 replies; 62+ messages in thread
From: Wu Fengguang @ 2010-07-26 11:32 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Dave Chinner, Jan Kara, Christoph Hellwig,
	Chris Mason, Jens Axboe, LKML, linux-fsdevel@vger.kernel.org,
	linux-mm@kvack.org

On Mon, Jul 26, 2010 at 06:52:00PM +0800, Mel Gorman wrote:
> On Thu, Jul 22, 2010 at 01:09:30PM +0800, Wu Fengguang wrote:
> > Dynamicly compute the dirty expire timestamp at queue_io() time.
> > Also remove writeback_control.older_than_this which is no longer used.
> > 
> > writeback_control.older_than_this used to be determined at entrance to
> > the kupdate writeback work. This _static_ timestamp may go stale if the
> > kupdate work runs on and on. The flusher may then stuck with some old
> > busy inodes, never considering newly expired inodes thereafter.
> > 
> > This has two possible problems:
> > 
> > - It is unfair for a large dirty inode to delay (for a long time) the
> >   writeback of small dirty inodes.
> > 
> > - As time goes by, the large and busy dirty inode may contain only
> >   _freshly_ dirtied pages. Ignoring newly expired dirty inodes risks
> >   delaying the expired dirty pages to the end of LRU lists, triggering
> >   the very bad pageout(). Neverthless this patch merely addresses part
> >   of the problem.
> > 
> > CC: Jan Kara <jack@suse.cz>
> > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> 
> Again, makes sense and I can't see a problem. There are some worth
> smithing issues in the changelog such as Dynamicly -> Dynamically and

Hah forgot to enable spell checking.

> s/writeback_control.older_than_this used/writeback_control.older_than_this is used/

It's "used to", my god.

> but other than that.
> 
> Acked-by: Mel Gorman <mel@csn.ul.ie>

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 5/6] writeback: try more writeback as long as something was written
  2010-07-26 11:01   ` Mel Gorman
@ 2010-07-26 11:39     ` Wu Fengguang
  0 siblings, 0 replies; 62+ messages in thread
From: Wu Fengguang @ 2010-07-26 11:39 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Dave Chinner, Christoph Hellwig, Chris Mason,
	Jens Axboe, LKML, linux-fsdevel@vger.kernel.org,
	linux-mm@kvack.org

On Mon, Jul 26, 2010 at 07:01:25PM +0800, Mel Gorman wrote:
> On Thu, Jul 22, 2010 at 01:09:33PM +0800, Wu Fengguang wrote:
> > writeback_inodes_wb()/__writeback_inodes_sb() are not agressive in that
> > they only populate b_io when necessary at entrance time. When the queued
> > set of inodes are all synced, they just return, possibly with
> > wbc.nr_to_write > 0.
> > 
> > For kupdate and background writeback, there may be more eligible inodes
> > sitting in b_dirty when the current set of b_io inodes are completed. So
> > it is necessary to try another round of writeback as long as we made some
> > progress in this round. When there are no more eligible inodes, no more
> > inodes will be enqueued in queue_io(), hence nothing could/will be
> > synced and we may safely bail.
> > 
> > This will livelock sync when there are heavy dirtiers. However in that case
> > sync will already be livelocked w/o this patch, as the current livelock
> > avoidance code is virtually a no-op (for one thing, wb_time should be
> > set statically at sync start time and be used in move_expired_inodes()).
> > The sync livelock problem will be addressed in other patches.
> > 
> 
> There does seem to be a livelock issue. During iozone, I see messages in
> the console log with this series applied that look like
> 
> [ 1687.132034] INFO: task iozone:21225 blocked for more than 120 seconds.
> [ 1687.211425] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [ 1687.305204] iozone        D ffff880001b13640     0 21225  21108 0x00000000
> [ 1687.387677]  ffff880037419d48 0000000000000082 0000000000000348 0000000000013640
> [ 1687.476594]  ffff880037419fd8 ffff880037419fd8 ffff880065892da0 0000000000013640
> [ 1687.565512]  0000000000013640 0000000000013640 ffff880065892da0 ffff88007f411510
> [ 1687.654431] Call Trace:
> [ 1687.683663]  [<ffffffff81002996>] ? ftrace_call+0x5/0x2b
> [ 1687.747204]  [<ffffffff812d8f67>] schedule_timeout+0x2d/0x214
> [ 1687.815947]  [<ffffffff81002996>] ? ftrace_call+0x5/0x2b
> [ 1687.879489]  [<ffffffff812d8527>] wait_for_common+0xd2/0x14a
> [ 1687.947195]  [<ffffffff8103ef1e>] ? default_wake_function+0x0/0x14
> [ 1688.021132]  [<ffffffff81002996>] ? ftrace_call+0x5/0x2b
> [ 1688.084680]  [<ffffffff811160f0>] ? sync_one_sb+0x0/0x22
> [ 1688.148223]  [<ffffffff812d8657>] wait_for_completion+0x1d/0x1f
> [ 1688.219051]  [<ffffffff811121c4>] sync_inodes_sb+0x92/0x14c
> [ 1688.285710]  [<ffffffff811160f0>] ? sync_one_sb+0x0/0x22
> [ 1688.349249]  [<ffffffff811160b9>] __sync_filesystem+0x4c/0x83
> [ 1688.417995]  [<ffffffff81116110>] sync_one_sb+0x20/0x22
> [ 1688.480505]  [<ffffffff810f6a23>] iterate_supers+0x66/0xa4
> [ 1688.546124]  [<ffffffff81116157>] sys_sync+0x45/0x5c
> [ 1688.605509]  [<ffffffff81002c72>] system_call_fastpath+0x16/0x1b
> 
> Similar messages do not appear without the patch. iozone does complete though
> and the performance figures are not affected. Should I be worried?

The patch does add a bit more livelock possibility. But don't worry,
I'll fix that.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 0/6] [RFC] writeback: try to write older pages first
  2010-07-26 10:28 ` Itaru Kitayama
@ 2010-07-26 11:47   ` Wu Fengguang
  0 siblings, 0 replies; 62+ messages in thread
From: Wu Fengguang @ 2010-07-26 11:47 UTC (permalink / raw)
  To: Itaru Kitayama
  Cc: Andrew Morton, Dave Chinner, Christoph Hellwig, Mel Gorman,
	Chris Mason, Jens Axboe, LKML, linux-fsdevel@vger.kernel.org,
	linux-mm@kvack.org

> Here's a touch up patch on top of your changes against the latest
> mmotm.
>
> Signed-off-by: Itaru Kitayama <kitayama@cl.bb4u.ne.jp>

Applied, Thanks!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 4/6] writeback: sync expired inodes first in background writeback
  2010-07-23 18:15   ` Jan Kara
@ 2010-07-26 11:51     ` Wu Fengguang
  2010-07-26 12:12       ` Jan Kara
  0 siblings, 1 reply; 62+ messages in thread
From: Wu Fengguang @ 2010-07-26 11:51 UTC (permalink / raw)
  To: Jan Kara
  Cc: Andrew Morton, Dave Chinner, Christoph Hellwig, Mel Gorman,
	Chris Mason, Jens Axboe, LKML, linux-fsdevel@vger.kernel.org,
	linux-mm@kvack.org

On Sat, Jul 24, 2010 at 02:15:21AM +0800, Jan Kara wrote:
> On Thu 22-07-10 13:09:32, Wu Fengguang wrote:
> > A background flush work may run for ever. So it's reasonable for it to
> > mimic the kupdate behavior of syncing old/expired inodes first.
> > 
> > The policy is
> > - enqueue all newly expired inodes at each queue_io() time
> > - retry with halfed expire interval until get some inodes to sync
>   Hmm, this logic looks a bit arbitrary to me. What I actually don't like
> very much about this that when there aren't inodes older than say 2
> seconds, you'll end up queueing just inodes between 2s and 1s. So I'd
> rather just queue inodes older than the limit and if there are none, just
> queue all other dirty inodes.

You are proposing

-				expire_interval >>= 1;
+				expire_interval = 0;

IMO this does not really simplify code or concept. If we can get the
"smoother" behavior in original patch without extra cost, why not? 

Thanks,
Fengguang


> > CC: Jan Kara <jack@suse.cz>
> > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > ---
> >  fs/fs-writeback.c |   20 ++++++++++++++------
> >  1 file changed, 14 insertions(+), 6 deletions(-)
> > 
> > --- linux-next.orig/fs/fs-writeback.c	2010-07-22 12:56:42.000000000 +0800
> > +++ linux-next/fs/fs-writeback.c	2010-07-22 13:07:51.000000000 +0800
> > @@ -217,14 +217,14 @@ static void move_expired_inodes(struct l
> >  				struct writeback_control *wbc)
> >  {
> >  	unsigned long expire_interval = 0;
> > -	unsigned long older_than_this;
> > +	unsigned long older_than_this = 0; /* reset to kill gcc warning */
> >  	LIST_HEAD(tmp);
> >  	struct list_head *pos, *node;
> >  	struct super_block *sb = NULL;
> >  	struct inode *inode;
> >  	int do_sb_sort = 0;
> >  
> > -	if (wbc->for_kupdate) {
> > +	if (wbc->for_kupdate || wbc->for_background) {
> >  		expire_interval = msecs_to_jiffies(dirty_expire_interval * 10);
> >  		older_than_this = jiffies - expire_interval;
> >  	}
> > @@ -232,8 +232,15 @@ static void move_expired_inodes(struct l
> >  	while (!list_empty(delaying_queue)) {
> >  		inode = list_entry(delaying_queue->prev, struct inode, i_list);
> >  		if (expire_interval &&
> > -		    inode_dirtied_after(inode, older_than_this))
> > -			break;
> > +		    inode_dirtied_after(inode, older_than_this)) {
> > +			if (wbc->for_background &&
> > +			    list_empty(dispatch_queue) && list_empty(&tmp)) {
> > +				expire_interval >>= 1;
> > +				older_than_this = jiffies - expire_interval;
> > +				continue;
> > +			} else
> > +				break;
> > +		}
> >  		if (sb && sb != inode->i_sb)
> >  			do_sb_sort = 1;
> >  		sb = inode->i_sb;
> > @@ -521,7 +528,8 @@ void writeback_inodes_wb(struct bdi_writ
> >  
> >  	wbc->wb_start = jiffies; /* livelock avoidance */
> >  	spin_lock(&inode_lock);
> > -	if (!wbc->for_kupdate || list_empty(&wb->b_io))
> > +
> > +	if (!(wbc->for_kupdate || wbc->for_background) || list_empty(&wb->b_io))
> >  		queue_io(wb, wbc);
> >  
> >  	while (!list_empty(&wb->b_io)) {
> > @@ -550,7 +558,7 @@ static void __writeback_inodes_sb(struct
> >  
> >  	wbc->wb_start = jiffies; /* livelock avoidance */
> >  	spin_lock(&inode_lock);
> > -	if (!wbc->for_kupdate || list_empty(&wb->b_io))
> > +	if (!(wbc->for_kupdate || wbc->for_background) || list_empty(&wb->b_io))
> >  		queue_io(wb, wbc);
> >  	writeback_sb_inodes(sb, wb, wbc, true);
> >  	spin_unlock(&inode_lock);
> > 
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> -- 
> Jan Kara <jack@suse.cz>
> SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 4/6] writeback: sync expired inodes first in background writeback
  2010-07-26 10:57   ` Mel Gorman
@ 2010-07-26 12:00     ` Wu Fengguang
  2010-07-26 12:20       ` Jan Kara
  2010-07-26 12:56     ` Wu Fengguang
  1 sibling, 1 reply; 62+ messages in thread
From: Wu Fengguang @ 2010-07-26 12:00 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Dave Chinner, Jan Kara, Christoph Hellwig,
	Chris Mason, Jens Axboe, LKML, linux-fsdevel@vger.kernel.org,
	linux-mm@kvack.org

On Mon, Jul 26, 2010 at 06:57:37PM +0800, Mel Gorman wrote:
> On Thu, Jul 22, 2010 at 01:09:32PM +0800, Wu Fengguang wrote:
> > A background flush work may run for ever. So it's reasonable for it to
> > mimic the kupdate behavior of syncing old/expired inodes first.
> > 
> > The policy is
> > - enqueue all newly expired inodes at each queue_io() time
> > - retry with halfed expire interval until get some inodes to sync
> > 
> > CC: Jan Kara <jack@suse.cz>
> > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> 
> Ok, intuitively this would appear to tie into pageout where we want
> older inodes to be cleaned first by background flushers to limit the
> number of dirty pages encountered by page reclaim. If this is accurate,
> it should be detailed in the changelog.

Good suggestion. I'll add these lines:

This is to help reduce the number of dirty pages encountered by page
reclaim, eg. the pageout() calls. Normally older inodes contain older
dirty pages, which are more close to the end of the LRU lists. So
syncing older inodes first helps reducing the dirty pages reached by
the page reclaim code.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 4/6] writeback: sync expired inodes first in background writeback
  2010-07-26 11:51     ` Wu Fengguang
@ 2010-07-26 12:12       ` Jan Kara
  2010-07-26 12:29         ` Wu Fengguang
  0 siblings, 1 reply; 62+ messages in thread
From: Jan Kara @ 2010-07-26 12:12 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, Andrew Morton, Dave Chinner, Christoph Hellwig,
	Mel Gorman, Chris Mason, Jens Axboe, LKML,
	linux-fsdevel@vger.kernel.org, linux-mm@kvack.org

On Mon 26-07-10 19:51:53, Wu Fengguang wrote:
> On Sat, Jul 24, 2010 at 02:15:21AM +0800, Jan Kara wrote:
> > On Thu 22-07-10 13:09:32, Wu Fengguang wrote:
> > > A background flush work may run for ever. So it's reasonable for it to
> > > mimic the kupdate behavior of syncing old/expired inodes first.
> > > 
> > > The policy is
> > > - enqueue all newly expired inodes at each queue_io() time
> > > - retry with halfed expire interval until get some inodes to sync
> >   Hmm, this logic looks a bit arbitrary to me. What I actually don't like
> > very much about this that when there aren't inodes older than say 2
> > seconds, you'll end up queueing just inodes between 2s and 1s. So I'd
> > rather just queue inodes older than the limit and if there are none, just
> > queue all other dirty inodes.
> 
> You are proposing
> 
> -				expire_interval >>= 1;
> +				expire_interval = 0;
> 
> IMO this does not really simplify code or concept. If we can get the
> "smoother" behavior in original patch without extra cost, why not? 
  I agree there's no substantial code simplification. But I see a
substantial "behavior" simplification (just two sweeps instead of 10 or
so). But I don't really insist on the two sweeps, it's just that I don't
see a justification for the exponencial back off here... I mean what's the
point if the interval we queue gets really small? Why not just use
expire_interval/2 as a step if you want a smoother behavior?

								Honza
> > > CC: Jan Kara <jack@suse.cz>
> > > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > > ---
> > >  fs/fs-writeback.c |   20 ++++++++++++++------
> > >  1 file changed, 14 insertions(+), 6 deletions(-)
> > > 
> > > --- linux-next.orig/fs/fs-writeback.c	2010-07-22 12:56:42.000000000 +0800
> > > +++ linux-next/fs/fs-writeback.c	2010-07-22 13:07:51.000000000 +0800
> > > @@ -217,14 +217,14 @@ static void move_expired_inodes(struct l
> > >  				struct writeback_control *wbc)
> > >  {
> > >  	unsigned long expire_interval = 0;
> > > -	unsigned long older_than_this;
> > > +	unsigned long older_than_this = 0; /* reset to kill gcc warning */
> > >  	LIST_HEAD(tmp);
> > >  	struct list_head *pos, *node;
> > >  	struct super_block *sb = NULL;
> > >  	struct inode *inode;
> > >  	int do_sb_sort = 0;
> > >  
> > > -	if (wbc->for_kupdate) {
> > > +	if (wbc->for_kupdate || wbc->for_background) {
> > >  		expire_interval = msecs_to_jiffies(dirty_expire_interval * 10);
> > >  		older_than_this = jiffies - expire_interval;
> > >  	}
> > > @@ -232,8 +232,15 @@ static void move_expired_inodes(struct l
> > >  	while (!list_empty(delaying_queue)) {
> > >  		inode = list_entry(delaying_queue->prev, struct inode, i_list);
> > >  		if (expire_interval &&
> > > -		    inode_dirtied_after(inode, older_than_this))
> > > -			break;
> > > +		    inode_dirtied_after(inode, older_than_this)) {
> > > +			if (wbc->for_background &&
> > > +			    list_empty(dispatch_queue) && list_empty(&tmp)) {
> > > +				expire_interval >>= 1;
> > > +				older_than_this = jiffies - expire_interval;
> > > +				continue;
> > > +			} else
> > > +				break;
> > > +		}
> > >  		if (sb && sb != inode->i_sb)
> > >  			do_sb_sort = 1;
> > >  		sb = inode->i_sb;
> > > @@ -521,7 +528,8 @@ void writeback_inodes_wb(struct bdi_writ
> > >  
> > >  	wbc->wb_start = jiffies; /* livelock avoidance */
> > >  	spin_lock(&inode_lock);
> > > -	if (!wbc->for_kupdate || list_empty(&wb->b_io))
> > > +
> > > +	if (!(wbc->for_kupdate || wbc->for_background) || list_empty(&wb->b_io))
> > >  		queue_io(wb, wbc);
> > >  
> > >  	while (!list_empty(&wb->b_io)) {
> > > @@ -550,7 +558,7 @@ static void __writeback_inodes_sb(struct
> > >  
> > >  	wbc->wb_start = jiffies; /* livelock avoidance */
> > >  	spin_lock(&inode_lock);
> > > -	if (!wbc->for_kupdate || list_empty(&wb->b_io))
> > > +	if (!(wbc->for_kupdate || wbc->for_background) || list_empty(&wb->b_io))
> > >  		queue_io(wb, wbc);
> > >  	writeback_sb_inodes(sb, wb, wbc, true);
> > >  	spin_unlock(&inode_lock);
> > > 
> > > 
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > -- 
> > Jan Kara <jack@suse.cz>
> > SUSE Labs, CR
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 4/6] writeback: sync expired inodes first in background writeback
  2010-07-26 12:00     ` Wu Fengguang
@ 2010-07-26 12:20       ` Jan Kara
  2010-07-26 12:31         ` Wu Fengguang
  0 siblings, 1 reply; 62+ messages in thread
From: Jan Kara @ 2010-07-26 12:20 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Mel Gorman, Andrew Morton, Dave Chinner, Jan Kara,
	Christoph Hellwig, Chris Mason, Jens Axboe, LKML,
	linux-fsdevel@vger.kernel.org, linux-mm@kvack.org

On Mon 26-07-10 20:00:11, Wu Fengguang wrote:
> On Mon, Jul 26, 2010 at 06:57:37PM +0800, Mel Gorman wrote:
> > On Thu, Jul 22, 2010 at 01:09:32PM +0800, Wu Fengguang wrote:
> > > A background flush work may run for ever. So it's reasonable for it to
> > > mimic the kupdate behavior of syncing old/expired inodes first.
> > > 
> > > The policy is
> > > - enqueue all newly expired inodes at each queue_io() time
> > > - retry with halfed expire interval until get some inodes to sync
> > > 
> > > CC: Jan Kara <jack@suse.cz>
> > > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > 
> > Ok, intuitively this would appear to tie into pageout where we want
> > older inodes to be cleaned first by background flushers to limit the
> > number of dirty pages encountered by page reclaim. If this is accurate,
> > it should be detailed in the changelog.
> 
> Good suggestion. I'll add these lines:
> 
> This is to help reduce the number of dirty pages encountered by page
> reclaim, eg. the pageout() calls. Normally older inodes contain older
> dirty pages, which are more close to the end of the LRU lists. So
  Well, this kind of implicitely assumes that once page is written, it
doesn't get accessed anymore, right? Which I imagine is often true but
not for all workloads... Anyway I think this behavior is a good start
also because it is kind of natural to users to see "old" files written
first.

> syncing older inodes first helps reducing the dirty pages reached by
> the page reclaim code.

								Honza  
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 4/6] writeback: sync expired inodes first in background writeback
  2010-07-26 12:12       ` Jan Kara
@ 2010-07-26 12:29         ` Wu Fengguang
  0 siblings, 0 replies; 62+ messages in thread
From: Wu Fengguang @ 2010-07-26 12:29 UTC (permalink / raw)
  To: Jan Kara
  Cc: Andrew Morton, Dave Chinner, Christoph Hellwig, Mel Gorman,
	Chris Mason, LKML, linux-fsdevel@vger.kernel.org,
	linux-mm@kvack.org

On Mon, Jul 26, 2010 at 08:12:59PM +0800, Jan Kara wrote:
> On Mon 26-07-10 19:51:53, Wu Fengguang wrote:
> > On Sat, Jul 24, 2010 at 02:15:21AM +0800, Jan Kara wrote:
> > > On Thu 22-07-10 13:09:32, Wu Fengguang wrote:
> > > > A background flush work may run for ever. So it's reasonable for it to
> > > > mimic the kupdate behavior of syncing old/expired inodes first.
> > > > 
> > > > The policy is
> > > > - enqueue all newly expired inodes at each queue_io() time
> > > > - retry with halfed expire interval until get some inodes to sync
> > >   Hmm, this logic looks a bit arbitrary to me. What I actually don't like
> > > very much about this that when there aren't inodes older than say 2
> > > seconds, you'll end up queueing just inodes between 2s and 1s. So I'd
> > > rather just queue inodes older than the limit and if there are none, just
> > > queue all other dirty inodes.
> > 
> > You are proposing
> > 
> > -				expire_interval >>= 1;
> > +				expire_interval = 0;
> > 
> > IMO this does not really simplify code or concept. If we can get the
> > "smoother" behavior in original patch without extra cost, why not? 
>   I agree there's no substantial code simplification. But I see a
> substantial "behavior" simplification (just two sweeps instead of 10 or
> so). But I don't really insist on the two sweeps, it's just that I don't
> see a justification for the exponencial back off here... I mean what's the
> point if the interval we queue gets really small? Why not just use
> expire_interval/2 as a step if you want a smoother behavior?

Yeah, the _non-linear_ backoff is not good. You have a point about the
behavior simplification, and it does remove one line. So I'll follow
your way.

Thanks,
Fengguang
---
Subject: writeback: sync expired inodes first in background writeback
From: Wu Fengguang <fengguang.wu@intel.com>
Date: Wed Jul 21 20:11:53 CST 2010

A background flush work may run for ever. So it's reasonable for it to
mimic the kupdate behavior of syncing old/expired inodes first.

The policy is
- enqueue all newly expired inodes at each queue_io() time
- enqueue all dirty inodes if there are no more expired inodes to sync

This will help reduce the number of dirty pages encountered by page
reclaim, eg. the pageout() calls. Normally older inodes contain older
dirty pages, which are more close to the end of the LRU lists. So
syncing older inodes first helps reducing the dirty pages reached by
the page reclaim code.

CC: Jan Kara <jack@suse.cz>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c |   19 +++++++++++++------
 1 file changed, 13 insertions(+), 6 deletions(-)

--- linux-next.orig/fs/fs-writeback.c	2010-07-26 20:19:01.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2010-07-26 20:25:01.000000000 +0800
@@ -217,14 +217,14 @@ static void move_expired_inodes(struct l
 				struct writeback_control *wbc)
 {
 	unsigned long expire_interval = 0;
-	unsigned long older_than_this;
+	unsigned long older_than_this = 0; /* reset to kill gcc warning */
 	LIST_HEAD(tmp);
 	struct list_head *pos, *node;
 	struct super_block *sb = NULL;
 	struct inode *inode;
 	int do_sb_sort = 0;
 
-	if (wbc->for_kupdate) {
+	if (wbc->for_kupdate || wbc->for_background) {
 		expire_interval = msecs_to_jiffies(dirty_expire_interval * 10);
 		older_than_this = jiffies - expire_interval;
 	}
@@ -232,8 +232,14 @@ static void move_expired_inodes(struct l
 	while (!list_empty(delaying_queue)) {
 		inode = list_entry(delaying_queue->prev, struct inode, i_list);
 		if (expire_interval &&
-		    inode_dirtied_after(inode, older_than_this))
-			break;
+		    inode_dirtied_after(inode, older_than_this)) {
+			if (wbc->for_background &&
+			    list_empty(dispatch_queue) && list_empty(&tmp)) {
+				expire_interval = 0;
+				continue;
+			} else
+				break;
+		}
 		if (sb && sb != inode->i_sb)
 			do_sb_sort = 1;
 		sb = inode->i_sb;
@@ -521,7 +527,8 @@ void writeback_inodes_wb(struct bdi_writ
 
 	wbc->wb_start = jiffies; /* livelock avoidance */
 	spin_lock(&inode_lock);
-	if (!wbc->for_kupdate || list_empty(&wb->b_io))
+
+	if (!(wbc->for_kupdate || wbc->for_background) || list_empty(&wb->b_io))
 		queue_io(wb, wbc);
 
 	while (!list_empty(&wb->b_io)) {
@@ -550,7 +557,7 @@ static void __writeback_inodes_sb(struct
 
 	wbc->wb_start = jiffies; /* livelock avoidance */
 	spin_lock(&inode_lock);
-	if (!wbc->for_kupdate || list_empty(&wb->b_io))
+	if (!(wbc->for_kupdate || wbc->for_background) || list_empty(&wb->b_io))
 		queue_io(wb, wbc);
 	writeback_sb_inodes(sb, wb, wbc, true);
 	spin_unlock(&inode_lock);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 4/6] writeback: sync expired inodes first in background writeback
  2010-07-26 12:20       ` Jan Kara
@ 2010-07-26 12:31         ` Wu Fengguang
  2010-07-26 12:39           ` Jan Kara
  0 siblings, 1 reply; 62+ messages in thread
From: Wu Fengguang @ 2010-07-26 12:31 UTC (permalink / raw)
  To: Jan Kara
  Cc: Mel Gorman, Andrew Morton, Dave Chinner, Christoph Hellwig,
	Chris Mason, Jens Axboe, LKML, linux-fsdevel@vger.kernel.org,
	linux-mm@kvack.org

On Mon, Jul 26, 2010 at 08:20:54PM +0800, Jan Kara wrote:
> On Mon 26-07-10 20:00:11, Wu Fengguang wrote:
> > On Mon, Jul 26, 2010 at 06:57:37PM +0800, Mel Gorman wrote:
> > > On Thu, Jul 22, 2010 at 01:09:32PM +0800, Wu Fengguang wrote:
> > > > A background flush work may run for ever. So it's reasonable for it to
> > > > mimic the kupdate behavior of syncing old/expired inodes first.
> > > > 
> > > > The policy is
> > > > - enqueue all newly expired inodes at each queue_io() time
> > > > - retry with halfed expire interval until get some inodes to sync
> > > > 
> > > > CC: Jan Kara <jack@suse.cz>
> > > > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > > 
> > > Ok, intuitively this would appear to tie into pageout where we want
> > > older inodes to be cleaned first by background flushers to limit the
> > > number of dirty pages encountered by page reclaim. If this is accurate,
> > > it should be detailed in the changelog.
> > 
> > Good suggestion. I'll add these lines:
> > 
> > This is to help reduce the number of dirty pages encountered by page
> > reclaim, eg. the pageout() calls. Normally older inodes contain older
> > dirty pages, which are more close to the end of the LRU lists. So
>   Well, this kind of implicitely assumes that once page is written, it
> doesn't get accessed anymore, right?

No, this patch is not evicting the page :)

> Which I imagine is often true but
> not for all workloads... Anyway I think this behavior is a good start
> also because it is kind of natural to users to see "old" files written
> first.

Thanks,
Fengguang

> > syncing older inodes first helps reducing the dirty pages reached by
> > the page reclaim code.
> 
> 								Honza  
> -- 
> Jan Kara <jack@suse.cz>
> SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 4/6] writeback: sync expired inodes first in background writeback
  2010-07-26 12:31         ` Wu Fengguang
@ 2010-07-26 12:39           ` Jan Kara
  2010-07-26 12:47             ` Wu Fengguang
  0 siblings, 1 reply; 62+ messages in thread
From: Jan Kara @ 2010-07-26 12:39 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, Mel Gorman, Andrew Morton, Dave Chinner,
	Christoph Hellwig, Chris Mason, Jens Axboe, LKML,
	linux-fsdevel@vger.kernel.org, linux-mm@kvack.org

On Mon 26-07-10 20:31:41, Wu Fengguang wrote:
> On Mon, Jul 26, 2010 at 08:20:54PM +0800, Jan Kara wrote:
> > On Mon 26-07-10 20:00:11, Wu Fengguang wrote:
> > > On Mon, Jul 26, 2010 at 06:57:37PM +0800, Mel Gorman wrote:
> > > > On Thu, Jul 22, 2010 at 01:09:32PM +0800, Wu Fengguang wrote:
> > > > > A background flush work may run for ever. So it's reasonable for it to
> > > > > mimic the kupdate behavior of syncing old/expired inodes first.
> > > > > 
> > > > > The policy is
> > > > > - enqueue all newly expired inodes at each queue_io() time
> > > > > - retry with halfed expire interval until get some inodes to sync
> > > > > 
> > > > > CC: Jan Kara <jack@suse.cz>
> > > > > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > > > 
> > > > Ok, intuitively this would appear to tie into pageout where we want
> > > > older inodes to be cleaned first by background flushers to limit the
> > > > number of dirty pages encountered by page reclaim. If this is accurate,
> > > > it should be detailed in the changelog.
> > > 
> > > Good suggestion. I'll add these lines:
> > > 
> > > This is to help reduce the number of dirty pages encountered by page
> > > reclaim, eg. the pageout() calls. Normally older inodes contain older
> > > dirty pages, which are more close to the end of the LRU lists. So
> >   Well, this kind of implicitely assumes that once page is written, it
> > doesn't get accessed anymore, right?
> 
> No, this patch is not evicting the page :)
  Sorry, I probably wasn't clear enough :) I meant: The claim that "older
inodes contain older dirty pages, which are more close to the end of the
LRU lists" assumes that once page is written it doesn't get accessed
again. For example files which get continual random access (like DB files)
can have rather old dirtied_when but some of their pages are accessed quite
often...

> > Which I imagine is often true but
> > not for all workloads... Anyway I think this behavior is a good start
> > also because it is kind of natural to users to see "old" files written
> > first.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 5/6] writeback: try more writeback as long as something was written
  2010-07-23 17:39   ` Jan Kara
@ 2010-07-26 12:39     ` Wu Fengguang
  0 siblings, 0 replies; 62+ messages in thread
From: Wu Fengguang @ 2010-07-26 12:39 UTC (permalink / raw)
  To: Jan Kara
  Cc: Andrew Morton, Dave Chinner, Christoph Hellwig, Mel Gorman,
	Chris Mason, Jens Axboe, LKML, linux-fsdevel@vger.kernel.org,
	linux-mm@kvack.org

On Sat, Jul 24, 2010 at 01:39:54AM +0800, Jan Kara wrote:
> On Thu 22-07-10 13:09:33, Wu Fengguang wrote:
> > writeback_inodes_wb()/__writeback_inodes_sb() are not agressive in that
> > they only populate b_io when necessary at entrance time. When the queued
> > set of inodes are all synced, they just return, possibly with
> > wbc.nr_to_write > 0.
> > 
> > For kupdate and background writeback, there may be more eligible inodes
> > sitting in b_dirty when the current set of b_io inodes are completed. So
> > it is necessary to try another round of writeback as long as we made some
> > progress in this round. When there are no more eligible inodes, no more
> > inodes will be enqueued in queue_io(), hence nothing could/will be
> > synced and we may safely bail.
> > 
> > This will livelock sync when there are heavy dirtiers. However in that case
> > sync will already be livelocked w/o this patch, as the current livelock
> > avoidance code is virtually a no-op (for one thing, wb_time should be
> > set statically at sync start time and be used in move_expired_inodes()).
> > The sync livelock problem will be addressed in other patches.
>   Hmm, any reason why you don't solve this problem by just removing the
> condition before queue_io()? It would also make the logic simpler - always

Yeah I'll remove queue_io() in the coming sync livelock patchset.
This patchset does the below. Though awkward, it avoids unnecessary
behavior changes for non-background cases.

-       if (!wbc->for_kupdate || list_empty(&wb->b_io))
+       if (!(wbc->for_kupdate || wbc->for_background) || list_empty(&wb->b_io))
                queue_io(wb, wbc);


> queue all inodes that are eligible for writeback...

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 4/6] writeback: sync expired inodes first in background writeback
  2010-07-26 12:39           ` Jan Kara
@ 2010-07-26 12:47             ` Wu Fengguang
  0 siblings, 0 replies; 62+ messages in thread
From: Wu Fengguang @ 2010-07-26 12:47 UTC (permalink / raw)
  To: Jan Kara
  Cc: Mel Gorman, Andrew Morton, Dave Chinner, Christoph Hellwig,
	Chris Mason, Jens Axboe, LKML, linux-fsdevel@vger.kernel.org,
	linux-mm@kvack.org

On Mon, Jul 26, 2010 at 08:39:07PM +0800, Jan Kara wrote:
> On Mon 26-07-10 20:31:41, Wu Fengguang wrote:
> > On Mon, Jul 26, 2010 at 08:20:54PM +0800, Jan Kara wrote:
> > > On Mon 26-07-10 20:00:11, Wu Fengguang wrote:
> > > > On Mon, Jul 26, 2010 at 06:57:37PM +0800, Mel Gorman wrote:
> > > > > On Thu, Jul 22, 2010 at 01:09:32PM +0800, Wu Fengguang wrote:
> > > > > > A background flush work may run for ever. So it's reasonable for it to
> > > > > > mimic the kupdate behavior of syncing old/expired inodes first.
> > > > > > 
> > > > > > The policy is
> > > > > > - enqueue all newly expired inodes at each queue_io() time
> > > > > > - retry with halfed expire interval until get some inodes to sync
> > > > > > 
> > > > > > CC: Jan Kara <jack@suse.cz>
> > > > > > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > > > > 
> > > > > Ok, intuitively this would appear to tie into pageout where we want
> > > > > older inodes to be cleaned first by background flushers to limit the
> > > > > number of dirty pages encountered by page reclaim. If this is accurate,
> > > > > it should be detailed in the changelog.
> > > > 
> > > > Good suggestion. I'll add these lines:
> > > > 
> > > > This is to help reduce the number of dirty pages encountered by page
> > > > reclaim, eg. the pageout() calls. Normally older inodes contain older
> > > > dirty pages, which are more close to the end of the LRU lists. So
> > >   Well, this kind of implicitely assumes that once page is written, it
> > > doesn't get accessed anymore, right?
> > 
> > No, this patch is not evicting the page :)
>   Sorry, I probably wasn't clear enough :) I meant: The claim that "older
> inodes contain older dirty pages, which are more close to the end of the
> LRU lists" assumes that once page is written it doesn't get accessed
> again. For example files which get continual random access (like DB files)
> can have rather old dirtied_when but some of their pages are accessed quite
> often...

Ah yes. That leads to another fact: smaller inodes tend to have more
strong correlations between its inode dirty age and pages' dirty age. 

This is one of the reason to not sync huge dirty inode in one shot.
Instead of

        sync  1G for inode A
        sync 10M for inode B
        sync 10M for inode C
        sync 10M for inode D

It's better to

        sync 128M for inode A
        sync  10M for inode B
        sync  10M for inode C
        sync  10M for inode D
        sync 128M for inode A
        sync 128M for inode A
        sync 128M for inode A
        sync  10M for inode E (newly expired)
        sync 128M for inode A
        ...

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 4/6] writeback: sync expired inodes first in background writeback
  2010-07-26 10:57   ` Mel Gorman
  2010-07-26 12:00     ` Wu Fengguang
@ 2010-07-26 12:56     ` Wu Fengguang
  2010-07-26 12:59       ` Mel Gorman
  1 sibling, 1 reply; 62+ messages in thread
From: Wu Fengguang @ 2010-07-26 12:56 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Dave Chinner, Jan Kara, Christoph Hellwig,
	Chris Mason, Jens Axboe, LKML, linux-fsdevel@vger.kernel.org,
	linux-mm@kvack.org

> > @@ -232,8 +232,15 @@ static void move_expired_inodes(struct l
> >  	while (!list_empty(delaying_queue)) {
> >  		inode = list_entry(delaying_queue->prev, struct inode, i_list);
> >  		if (expire_interval &&
> > -		    inode_dirtied_after(inode, older_than_this))
> > -			break;
> > +		    inode_dirtied_after(inode, older_than_this)) {
> > +			if (wbc->for_background &&
> > +			    list_empty(dispatch_queue) && list_empty(&tmp)) {
> > +				expire_interval >>= 1;
> > +				older_than_this = jiffies - expire_interval;
> > +				continue;
> > +			} else
> > +				break;
> > +		}
> 
> This needs a comment.
> 
> I think what it is saying is that if background flush is active but no
> inodes are old enough, consider newer inodes. This is on the assumption
> that page reclaim has encountered dirty pages and the dirty inodes are
> still too young.

Yes this should be commented. How about this one?

@@ -232,8 +232,20 @@ static void move_expired_inodes(struct l
        while (!list_empty(delaying_queue)) {
                inode = list_entry(delaying_queue->prev, struct inode, i_list);
                if (expire_interval &&
-                   inode_dirtied_after(inode, older_than_this))
+                   inode_dirtied_after(inode, older_than_this)) {
+                       /*
+                        * background writeback will start with expired inodes,
+                        * and then fresh inodes. This order helps reducing
+                        * the number of dirty pages reaching the end of LRU
+                        * lists and cause trouble to the page reclaim.
+                        */
+                       if (wbc->for_background &&
+                           list_empty(dispatch_queue) && list_empty(&tmp)) {
+                               expire_interval = 0;
+                               continue;
+                       }
                        break;
+               }
                if (sb && sb != inode->i_sb)
                        do_sb_sort = 1;
                sb = inode->i_sb;

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 4/6] writeback: sync expired inodes first in background writeback
  2010-07-26 12:56     ` Wu Fengguang
@ 2010-07-26 12:59       ` Mel Gorman
  2010-07-26 13:11         ` Wu Fengguang
  0 siblings, 1 reply; 62+ messages in thread
From: Mel Gorman @ 2010-07-26 12:59 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Dave Chinner, Jan Kara, Christoph Hellwig,
	Chris Mason, Jens Axboe, LKML, linux-fsdevel@vger.kernel.org,
	linux-mm@kvack.org

On Mon, Jul 26, 2010 at 08:56:35PM +0800, Wu Fengguang wrote:
> > > @@ -232,8 +232,15 @@ static void move_expired_inodes(struct l
> > >  	while (!list_empty(delaying_queue)) {
> > >  		inode = list_entry(delaying_queue->prev, struct inode, i_list);
> > >  		if (expire_interval &&
> > > -		    inode_dirtied_after(inode, older_than_this))
> > > -			break;
> > > +		    inode_dirtied_after(inode, older_than_this)) {
> > > +			if (wbc->for_background &&
> > > +			    list_empty(dispatch_queue) && list_empty(&tmp)) {
> > > +				expire_interval >>= 1;
> > > +				older_than_this = jiffies - expire_interval;
> > > +				continue;
> > > +			} else
> > > +				break;
> > > +		}
> > 
> > This needs a comment.
> > 
> > I think what it is saying is that if background flush is active but no
> > inodes are old enough, consider newer inodes. This is on the assumption
> > that page reclaim has encountered dirty pages and the dirty inodes are
> > still too young.
> 
> Yes this should be commented. How about this one?
> 
> @@ -232,8 +232,20 @@ static void move_expired_inodes(struct l
>         while (!list_empty(delaying_queue)) {
>                 inode = list_entry(delaying_queue->prev, struct inode, i_list);
>                 if (expire_interval &&
> -                   inode_dirtied_after(inode, older_than_this))
> +                   inode_dirtied_after(inode, older_than_this)) {
> +                       /*
> +                        * background writeback will start with expired inodes,
> +                        * and then fresh inodes. This order helps reducing
> +                        * the number of dirty pages reaching the end of LRU
> +                        * lists and cause trouble to the page reclaim.
> +                        */

s/reducing/reduce/

Otherwise, it's enough detail to know what is going on. Thanks

Thanks

> +                       if (wbc->for_background &&
> +                           list_empty(dispatch_queue) && list_empty(&tmp)) {
> +                               expire_interval = 0;
> +                               continue;
> +                       }
>                         break;
> +               }
>                 if (sb && sb != inode->i_sb)
>                         do_sb_sort = 1;
>                 sb = inode->i_sb;
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 4/6] writeback: sync expired inodes first in background writeback
  2010-07-26 12:59       ` Mel Gorman
@ 2010-07-26 13:11         ` Wu Fengguang
  2010-07-27  9:45           ` Mel Gorman
  2010-08-01 15:15           ` Minchan Kim
  0 siblings, 2 replies; 62+ messages in thread
From: Wu Fengguang @ 2010-07-26 13:11 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Dave Chinner, Jan Kara, Christoph Hellwig,
	Chris Mason, Jens Axboe, LKML, linux-fsdevel@vger.kernel.org,
	linux-mm@kvack.org

On Mon, Jul 26, 2010 at 08:59:55PM +0800, Mel Gorman wrote:
> On Mon, Jul 26, 2010 at 08:56:35PM +0800, Wu Fengguang wrote:
> > > > @@ -232,8 +232,15 @@ static void move_expired_inodes(struct l
> > > >  	while (!list_empty(delaying_queue)) {
> > > >  		inode = list_entry(delaying_queue->prev, struct inode, i_list);
> > > >  		if (expire_interval &&
> > > > -		    inode_dirtied_after(inode, older_than_this))
> > > > -			break;
> > > > +		    inode_dirtied_after(inode, older_than_this)) {
> > > > +			if (wbc->for_background &&
> > > > +			    list_empty(dispatch_queue) && list_empty(&tmp)) {
> > > > +				expire_interval >>= 1;
> > > > +				older_than_this = jiffies - expire_interval;
> > > > +				continue;
> > > > +			} else
> > > > +				break;
> > > > +		}
> > > 
> > > This needs a comment.
> > > 
> > > I think what it is saying is that if background flush is active but no
> > > inodes are old enough, consider newer inodes. This is on the assumption
> > > that page reclaim has encountered dirty pages and the dirty inodes are
> > > still too young.
> > 
> > Yes this should be commented. How about this one?
> > 
> > @@ -232,8 +232,20 @@ static void move_expired_inodes(struct l
> >         while (!list_empty(delaying_queue)) {
> >                 inode = list_entry(delaying_queue->prev, struct inode, i_list);
> >                 if (expire_interval &&
> > -                   inode_dirtied_after(inode, older_than_this))
> > +                   inode_dirtied_after(inode, older_than_this)) {
> > +                       /*
> > +                        * background writeback will start with expired inodes,
> > +                        * and then fresh inodes. This order helps reducing
> > +                        * the number of dirty pages reaching the end of LRU
> > +                        * lists and cause trouble to the page reclaim.
> > +                        */
> 
> s/reducing/reduce/
> 
> Otherwise, it's enough detail to know what is going on. Thanks

Thanks. Here is the updated patch.
---
Subject: writeback: sync expired inodes first in background writeback
From: Wu Fengguang <fengguang.wu@intel.com>
Date: Wed Jul 21 20:11:53 CST 2010

A background flush work may run for ever. So it's reasonable for it to
mimic the kupdate behavior of syncing old/expired inodes first.

The policy is
- enqueue all newly expired inodes at each queue_io() time
- enqueue all dirty inodes if there are no more expired inodes to sync

This will help reduce the number of dirty pages encountered by page
reclaim, eg. the pageout() calls. Normally older inodes contain older
dirty pages, which are more close to the end of the LRU lists. So
syncing older inodes first helps reducing the dirty pages reached by
the page reclaim code.

CC: Jan Kara <jack@suse.cz>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c |   23 ++++++++++++++++++-----
 1 file changed, 18 insertions(+), 5 deletions(-)

--- linux-next.orig/fs/fs-writeback.c	2010-07-26 20:19:01.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2010-07-26 21:10:42.000000000 +0800
@@ -217,14 +217,14 @@ static void move_expired_inodes(struct l
 				struct writeback_control *wbc)
 {
 	unsigned long expire_interval = 0;
-	unsigned long older_than_this;
+	unsigned long older_than_this = 0; /* reset to kill gcc warning */
 	LIST_HEAD(tmp);
 	struct list_head *pos, *node;
 	struct super_block *sb = NULL;
 	struct inode *inode;
 	int do_sb_sort = 0;
 
-	if (wbc->for_kupdate) {
+	if (wbc->for_kupdate || wbc->for_background) {
 		expire_interval = msecs_to_jiffies(dirty_expire_interval * 10);
 		older_than_this = jiffies - expire_interval;
 	}
@@ -232,8 +232,20 @@ static void move_expired_inodes(struct l
 	while (!list_empty(delaying_queue)) {
 		inode = list_entry(delaying_queue->prev, struct inode, i_list);
 		if (expire_interval &&
-		    inode_dirtied_after(inode, older_than_this))
+		    inode_dirtied_after(inode, older_than_this)) {
+			/*
+			 * background writeback will start with expired inodes,
+			 * and then fresh inodes. This order helps reduce the
+			 * number of dirty pages reaching the end of LRU lists
+			 * and cause trouble to the page reclaim.
+			 */
+			if (wbc->for_background &&
+			    list_empty(dispatch_queue) && list_empty(&tmp)) {
+				expire_interval = 0;
+				continue;
+			}
 			break;
+		}
 		if (sb && sb != inode->i_sb)
 			do_sb_sort = 1;
 		sb = inode->i_sb;
@@ -521,7 +533,8 @@ void writeback_inodes_wb(struct bdi_writ
 
 	wbc->wb_start = jiffies; /* livelock avoidance */
 	spin_lock(&inode_lock);
-	if (!wbc->for_kupdate || list_empty(&wb->b_io))
+
+	if (!(wbc->for_kupdate || wbc->for_background) || list_empty(&wb->b_io))
 		queue_io(wb, wbc);
 
 	while (!list_empty(&wb->b_io)) {
@@ -550,7 +563,7 @@ static void __writeback_inodes_sb(struct
 
 	wbc->wb_start = jiffies; /* livelock avoidance */
 	spin_lock(&inode_lock);
-	if (!wbc->for_kupdate || list_empty(&wb->b_io))
+	if (!(wbc->for_kupdate || wbc->for_background) || list_empty(&wb->b_io))
 		queue_io(wb, wbc);
 	writeback_sb_inodes(sb, wb, wbc, true);
 	spin_unlock(&inode_lock);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 4/6] writeback: sync expired inodes first in background writeback
  2010-07-26 13:11         ` Wu Fengguang
@ 2010-07-27  9:45           ` Mel Gorman
  2010-08-01 15:15           ` Minchan Kim
  1 sibling, 0 replies; 62+ messages in thread
From: Mel Gorman @ 2010-07-27  9:45 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Dave Chinner, Jan Kara, Christoph Hellwig,
	Chris Mason, Jens Axboe, LKML, linux-fsdevel@vger.kernel.org,
	linux-mm@kvack.org

On Mon, Jul 26, 2010 at 09:11:52PM +0800, Wu Fengguang wrote:
> On Mon, Jul 26, 2010 at 08:59:55PM +0800, Mel Gorman wrote:
> > On Mon, Jul 26, 2010 at 08:56:35PM +0800, Wu Fengguang wrote:
> > > > > @@ -232,8 +232,15 @@ static void move_expired_inodes(struct l
> > > > >  	while (!list_empty(delaying_queue)) {
> > > > >  		inode = list_entry(delaying_queue->prev, struct inode, i_list);
> > > > >  		if (expire_interval &&
> > > > > -		    inode_dirtied_after(inode, older_than_this))
> > > > > -			break;
> > > > > +		    inode_dirtied_after(inode, older_than_this)) {
> > > > > +			if (wbc->for_background &&
> > > > > +			    list_empty(dispatch_queue) && list_empty(&tmp)) {
> > > > > +				expire_interval >>= 1;
> > > > > +				older_than_this = jiffies - expire_interval;
> > > > > +				continue;
> > > > > +			} else
> > > > > +				break;
> > > > > +		}
> > > > 
> > > > This needs a comment.
> > > > 
> > > > I think what it is saying is that if background flush is active but no
> > > > inodes are old enough, consider newer inodes. This is on the assumption
> > > > that page reclaim has encountered dirty pages and the dirty inodes are
> > > > still too young.
> > > 
> > > Yes this should be commented. How about this one?
> > > 
> > > @@ -232,8 +232,20 @@ static void move_expired_inodes(struct l
> > >         while (!list_empty(delaying_queue)) {
> > >                 inode = list_entry(delaying_queue->prev, struct inode, i_list);
> > >                 if (expire_interval &&
> > > -                   inode_dirtied_after(inode, older_than_this))
> > > +                   inode_dirtied_after(inode, older_than_this)) {
> > > +                       /*
> > > +                        * background writeback will start with expired inodes,
> > > +                        * and then fresh inodes. This order helps reducing
> > > +                        * the number of dirty pages reaching the end of LRU
> > > +                        * lists and cause trouble to the page reclaim.
> > > +                        */
> > 
> > s/reducing/reduce/
> > 
> > Otherwise, it's enough detail to know what is going on. Thanks
> 
> Thanks. Here is the updated patch.
> ---
> Subject: writeback: sync expired inodes first in background writeback
> From: Wu Fengguang <fengguang.wu@intel.com>
> Date: Wed Jul 21 20:11:53 CST 2010
> 
> A background flush work may run for ever. So it's reasonable for it to
> mimic the kupdate behavior of syncing old/expired inodes first.
> 
> The policy is
> - enqueue all newly expired inodes at each queue_io() time
> - enqueue all dirty inodes if there are no more expired inodes to sync
> 
> This will help reduce the number of dirty pages encountered by page
> reclaim, eg. the pageout() calls. Normally older inodes contain older
> dirty pages, which are more close to the end of the LRU lists. So
> syncing older inodes first helps reducing the dirty pages reached by
> the page reclaim code.
> 
> CC: Jan Kara <jack@suse.cz>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>

Acked-by: Mel Gorman <mel@csn.ul.ie>

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 4/6] writeback: sync expired inodes first in background writeback
  2010-07-26 13:11         ` Wu Fengguang
  2010-07-27  9:45           ` Mel Gorman
@ 2010-08-01 15:15           ` Minchan Kim
  1 sibling, 0 replies; 62+ messages in thread
From: Minchan Kim @ 2010-08-01 15:15 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Mel Gorman, Andrew Morton, Dave Chinner, Jan Kara,
	Christoph Hellwig, Chris Mason, Jens Axboe, LKML,
	linux-fsdevel@vger.kernel.org, linux-mm@kvack.org

Hi Wu, 

> Subject: writeback: sync expired inodes first in background writeback
> From: Wu Fengguang <fengguang.wu@intel.com>
> Date: Wed Jul 21 20:11:53 CST 2010
> 
> A background flush work may run for ever. So it's reasonable for it to
> mimic the kupdate behavior of syncing old/expired inodes first.
> 
> The policy is
> - enqueue all newly expired inodes at each queue_io() time
> - enqueue all dirty inodes if there are no more expired inodes to sync
> 
> This will help reduce the number of dirty pages encountered by page
> reclaim, eg. the pageout() calls. Normally older inodes contain older
> dirty pages, which are more close to the end of the LRU lists. So
> syncing older inodes first helps reducing the dirty pages reached by
> the page reclaim code.
> 
> CC: Jan Kara <jack@suse.cz>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  fs/fs-writeback.c |   23 ++++++++++++++++++-----
>  1 file changed, 18 insertions(+), 5 deletions(-)
> 
> --- linux-next.orig/fs/fs-writeback.c	2010-07-26 20:19:01.000000000 +0800
> +++ linux-next/fs/fs-writeback.c	2010-07-26 21:10:42.000000000 +0800
> @@ -217,14 +217,14 @@ static void move_expired_inodes(struct l
>  				struct writeback_control *wbc)
>  {
>  	unsigned long expire_interval = 0;
> -	unsigned long older_than_this;
> +	unsigned long older_than_this = 0; /* reset to kill gcc warning */

Maybe I am rather late. 

Nitpick. 
uninitialized_var is consistent. :)

I haven't followed up this patch series. but his patch series is a fundamental way 
to go for reducing pageout. 
-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 1/6] writeback: pass writeback_control down to move_expired_inodes()
  2010-07-22  5:09 ` [PATCH 1/6] writeback: pass writeback_control down to move_expired_inodes() Wu Fengguang
  2010-07-23 18:16   ` Jan Kara
  2010-07-26 10:44   ` Mel Gorman
@ 2010-08-01 15:23   ` Minchan Kim
  2 siblings, 0 replies; 62+ messages in thread
From: Minchan Kim @ 2010-08-01 15:23 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Dave Chinner, Christoph Hellwig, Mel Gorman,
	Chris Mason, Jens Axboe, LKML, linux-fsdevel, linux-mm

On Thu, Jul 22, 2010 at 01:09:29PM +0800, Wu Fengguang wrote:
> This is to prepare for moving the dirty expire policy to move_expired_inodes().
> No behavior change.
> 
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 2/6] writeback: the kupdate expire timestamp should be a moving target
  2010-07-22  5:09 ` [PATCH 2/6] writeback: the kupdate expire timestamp should be a moving target Wu Fengguang
  2010-07-23 18:17   ` Jan Kara
  2010-07-26 10:52   ` Mel Gorman
@ 2010-08-01 15:29   ` Minchan Kim
  2 siblings, 0 replies; 62+ messages in thread
From: Minchan Kim @ 2010-08-01 15:29 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Dave Chinner, Jan Kara, Christoph Hellwig,
	Mel Gorman, Chris Mason, Jens Axboe, LKML, linux-fsdevel,
	linux-mm

On Thu, Jul 22, 2010 at 01:09:30PM +0800, Wu Fengguang wrote:
> Dynamicly compute the dirty expire timestamp at queue_io() time.
> Also remove writeback_control.older_than_this which is no longer used.
> 
> writeback_control.older_than_this used to be determined at entrance to
> the kupdate writeback work. This _static_ timestamp may go stale if the
> kupdate work runs on and on. The flusher may then stuck with some old
> busy inodes, never considering newly expired inodes thereafter.
> 
> This has two possible problems:
> 
> - It is unfair for a large dirty inode to delay (for a long time) the
>   writeback of small dirty inodes.
> 
> - As time goes by, the large and busy dirty inode may contain only
>   _freshly_ dirtied pages. Ignoring newly expired dirty inodes risks
>   delaying the expired dirty pages to the end of LRU lists, triggering
>   the very bad pageout(). Neverthless this patch merely addresses part
>   of the problem.
> 
> CC: Jan Kara <jack@suse.cz>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  fs/fs-writeback.c                |   24 +++++++++---------------
>  include/linux/writeback.h        |    2 --
>  include/trace/events/writeback.h |    6 +-----
>  mm/backing-dev.c                 |    1 -
>  mm/page-writeback.c              |    1 -
>  5 files changed, 10 insertions(+), 24 deletions(-)
> 
> --- linux-next.orig/fs/fs-writeback.c	2010-07-21 22:20:01.000000000 +0800
> +++ linux-next/fs/fs-writeback.c	2010-07-22 11:23:27.000000000 +0800
> @@ -216,16 +216,23 @@ static void move_expired_inodes(struct l
>  				struct list_head *dispatch_queue,
>  				struct writeback_control *wbc)
>  {
> +	unsigned long expire_interval = 0;
> +	unsigned long older_than_this;
>  	LIST_HEAD(tmp);
>  	struct list_head *pos, *node;
>  	struct super_block *sb = NULL;
>  	struct inode *inode;
>  	int do_sb_sort = 0;
>  
> +	if (wbc->for_kupdate) {
> +		expire_interval = msecs_to_jiffies(dirty_expire_interval * 10);
> +		older_than_this = jiffies - expire_interval;
> +	}
> +
>  	while (!list_empty(delaying_queue)) {
>  		inode = list_entry(delaying_queue->prev, struct inode, i_list);
> -		if (wbc->older_than_this &&
> -		    inode_dirtied_after(inode, *wbc->older_than_this))
> +		if (expire_interval &&
> +		    inode_dirtied_after(inode, older_than_this))
>  			break;
>  		if (sb && sb != inode->i_sb)
>  			do_sb_sort = 1;
> @@ -583,29 +590,19 @@ static inline bool over_bground_thresh(v
>   * Try to run once per dirty_writeback_interval.  But if a writeback event
>   * takes longer than a dirty_writeback_interval interval, then leave a
>   * one-second gap.
> - *
> - * older_than_this takes precedence over nr_to_write.  So we'll only write back
> - * all dirty pages if they are all attached to "old" mappings.
>   */
>  static long wb_writeback(struct bdi_writeback *wb,
>  			 struct wb_writeback_work *work)
>  {
>  	struct writeback_control wbc = {
>  		.sync_mode		= work->sync_mode,
> -		.older_than_this	= NULL,
>  		.for_kupdate		= work->for_kupdate,
>  		.for_background		= work->for_background,
>  		.range_cyclic		= work->range_cyclic,
>  	};
> -	unsigned long oldest_jif;
>  	long wrote = 0;
>  	struct inode *inode;
>  
> -	if (wbc.for_kupdate) {
> -		wbc.older_than_this = &oldest_jif;
> -		oldest_jif = jiffies -
> -				msecs_to_jiffies(dirty_expire_interval * 10);
> -	}
>  	if (!wbc.range_cyclic) {
>  		wbc.range_start = 0;
>  		wbc.range_end = LLONG_MAX;
> @@ -998,9 +995,6 @@ EXPORT_SYMBOL(__mark_inode_dirty);
>   * Write out a superblock's list of dirty inodes.  A wait will be performed
>   * upon no inodes, all inodes or the final one, depending upon sync_mode.
>   *
> - * If older_than_this is non-NULL, then only write out inodes which
> - * had their first dirtying at a time earlier than *older_than_this.
> - *
>   * If `bdi' is non-zero then we're being asked to writeback a specific queue.
>   * This function assumes that the blockdev superblock's inodes are backed by
>   * a variety of queues, so all inodes are searched.  For other superblocks,
> --- linux-next.orig/include/linux/writeback.h	2010-07-21 22:20:02.000000000 +0800
> +++ linux-next/include/linux/writeback.h	2010-07-22 11:23:27.000000000 +0800
> @@ -28,8 +28,6 @@ enum writeback_sync_modes {
>   */
>  struct writeback_control {
>  	enum writeback_sync_modes sync_mode;
> -	unsigned long *older_than_this;	/* If !NULL, only write back inodes
> -					   older than this */
>  	unsigned long wb_start;         /* Time writeback_inodes_wb was
>  					   called. This is needed to avoid
>  					   extra jobs and livelock */

In addtion, We shuld remove older_than_this in btrfs and reiser4. 

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 3/6] writeback: kill writeback_control.more_io
  2010-07-22  5:09 ` [PATCH 3/6] writeback: kill writeback_control.more_io Wu Fengguang
  2010-07-23 18:24   ` Jan Kara
  2010-07-26 10:53   ` Mel Gorman
@ 2010-08-01 15:34   ` Minchan Kim
  2010-08-05 14:50     ` Wu Fengguang
  2 siblings, 1 reply; 62+ messages in thread
From: Minchan Kim @ 2010-08-01 15:34 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Dave Chinner, Christoph Hellwig, Mel Gorman,
	Chris Mason, Jens Axboe, LKML, linux-fsdevel, linux-mm

On Thu, Jul 22, 2010 at 01:09:31PM +0800, Wu Fengguang wrote:
> When wbc.more_io was first introduced, it indicates whether there are
> at least one superblock whose s_more_io contains more IO work. Now with
> the per-bdi writeback, it can be replaced with a simple b_more_io test.
> 
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  fs/fs-writeback.c                |    9 ++-------
>  include/linux/writeback.h        |    1 -
>  include/trace/events/writeback.h |    5 +----
>  3 files changed, 3 insertions(+), 12 deletions(-)
> 
> --- linux-next.orig/fs/fs-writeback.c	2010-07-22 11:23:27.000000000 +0800
> +++ linux-next/fs/fs-writeback.c	2010-07-22 12:56:42.000000000 +0800
> @@ -507,12 +507,8 @@ static int writeback_sb_inodes(struct su
>  		iput(inode);
>  		cond_resched();
>  		spin_lock(&inode_lock);
> -		if (wbc->nr_to_write <= 0) {
> -			wbc->more_io = 1;
> +		if (wbc->nr_to_write <= 0)
>  			return 1;
> -		}
> -		if (!list_empty(&wb->b_more_io))
> -			wbc->more_io = 1;
>  	}
>  	/* b_io is empty */
>  	return 1;
> @@ -622,7 +618,6 @@ static long wb_writeback(struct bdi_writ
>  		if (work->for_background && !over_bground_thresh())
>  			break;
>  
> -		wbc.more_io = 0;
>  		wbc.nr_to_write = MAX_WRITEBACK_PAGES;
>  		wbc.pages_skipped = 0;
>  
> @@ -644,7 +639,7 @@ static long wb_writeback(struct bdi_writ
>  		/*
>  		 * Didn't write everything and we don't have more IO, bail
>  		 */
> -		if (!wbc.more_io)
> +		if (list_empty(&wb->b_more_io))
>  			break;
>  		/*
>  		 * Did we write something? Try for more
> --- linux-next.orig/include/linux/writeback.h	2010-07-22 11:23:27.000000000 +0800
> +++ linux-next/include/linux/writeback.h	2010-07-22 11:24:46.000000000 +0800
> @@ -49,7 +49,6 @@ struct writeback_control {
>  	unsigned for_background:1;	/* A background writeback */
>  	unsigned for_reclaim:1;		/* Invoked from the page allocator */
>  	unsigned range_cyclic:1;	/* range_start is cyclic */
> -	unsigned more_io:1;		/* more io to be dispatched */
>  };
>  
>  /*
> --- linux-next.orig/include/trace/events/writeback.h	2010-07-22 11:23:27.000000000 +0800
> +++ linux-next/include/trace/events/writeback.h	2010-07-22 11:24:46.000000000 +0800
> @@ -99,7 +99,6 @@ DECLARE_EVENT_CLASS(wbc_class,
>  		__field(int, for_background)
>  		__field(int, for_reclaim)
>  		__field(int, range_cyclic)
> -		__field(int, more_io)
>  		__field(long, range_start)
>  		__field(long, range_end)
>  	),
> @@ -113,13 +112,12 @@ DECLARE_EVENT_CLASS(wbc_class,
>  		__entry->for_background	= wbc->for_background;
>  		__entry->for_reclaim	= wbc->for_reclaim;
>  		__entry->range_cyclic	= wbc->range_cyclic;
> -		__entry->more_io	= wbc->more_io;
>  		__entry->range_start	= (long)wbc->range_start;
>  		__entry->range_end	= (long)wbc->range_end;
>  	),
>  
>  	TP_printk("bdi %s: towrt=%ld skip=%ld mode=%d kupd=%d "
> -		"bgrd=%d reclm=%d cyclic=%d more=%d "
> +		"bgrd=%d reclm=%d cyclic=%d "
>  		"start=0x%lx end=0x%lx",
>  		__entry->name,
>  		__entry->nr_to_write,
> @@ -129,7 +127,6 @@ DECLARE_EVENT_CLASS(wbc_class,
>  		__entry->for_background,
>  		__entry->for_reclaim,
>  		__entry->range_cyclic,
> -		__entry->more_io,
>  		__entry->range_start,
>  		__entry->range_end)
>  )
> 
> 
> --

include/trace/events/ext4.h also have more_io field. 

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 3/6] writeback: kill writeback_control.more_io
  2010-08-01 15:34   ` Minchan Kim
@ 2010-08-05 14:50     ` Wu Fengguang
  2010-08-05 14:55       ` Wu Fengguang
  2010-08-05 14:56       ` Minchan Kim
  0 siblings, 2 replies; 62+ messages in thread
From: Wu Fengguang @ 2010-08-05 14:50 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, Dave Chinner, Christoph Hellwig, Mel Gorman,
	Chris Mason, Jens Axboe, LKML, linux-fsdevel@vger.kernel.org,
	linux-mm@kvack.org

> include/trace/events/ext4.h also have more_io field. 

I didn't find it in linux-next. What's your kernel version?

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 3/6] writeback: kill writeback_control.more_io
  2010-08-05 14:50     ` Wu Fengguang
@ 2010-08-05 14:55       ` Wu Fengguang
  2010-08-05 14:56       ` Minchan Kim
  1 sibling, 0 replies; 62+ messages in thread
From: Wu Fengguang @ 2010-08-05 14:55 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, Dave Chinner, Christoph Hellwig, Mel Gorman,
	Chris Mason, Jens Axboe, LKML, linux-fsdevel@vger.kernel.org,
	linux-mm@kvack.org

On Thu, Aug 05, 2010 at 10:50:53PM +0800, Wu Fengguang wrote:
> > include/trace/events/ext4.h also have more_io field. 
> 
> I didn't find it in linux-next. What's your kernel version?

Oh it's in mmotm :)

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 3/6] writeback: kill writeback_control.more_io
  2010-08-05 14:50     ` Wu Fengguang
  2010-08-05 14:55       ` Wu Fengguang
@ 2010-08-05 14:56       ` Minchan Kim
  2010-08-05 15:26         ` Wu Fengguang
  1 sibling, 1 reply; 62+ messages in thread
From: Minchan Kim @ 2010-08-05 14:56 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Dave Chinner, Christoph Hellwig, Mel Gorman,
	Chris Mason, Jens Axboe, LKML, linux-fsdevel@vger.kernel.org,
	linux-mm@kvack.org

On Thu, Aug 05, 2010 at 10:50:53PM +0800, Wu Fengguang wrote:
> > include/trace/events/ext4.h also have more_io field. 
> 
> I didn't find it in linux-next. What's your kernel version?

I used mmotm-07-29. 

> 
> Thanks,
> Fengguang

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 3/6] writeback: kill writeback_control.more_io
  2010-08-05 14:56       ` Minchan Kim
@ 2010-08-05 15:26         ` Wu Fengguang
  0 siblings, 0 replies; 62+ messages in thread
From: Wu Fengguang @ 2010-08-05 15:26 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, Dave Chinner, Christoph Hellwig, Mel Gorman,
	Chris Mason, Jens Axboe, LKML, linux-fsdevel@vger.kernel.org,
	linux-mm@kvack.org

On Thu, Aug 05, 2010 at 10:56:06PM +0800, Minchan Kim wrote:
> On Thu, Aug 05, 2010 at 10:50:53PM +0800, Wu Fengguang wrote:
> > > include/trace/events/ext4.h also have more_io field. 
> > 
> > I didn't find it in linux-next. What's your kernel version?
> 
> I used mmotm-07-29. 

Heh it's in linux-next too -- I didn't find the field because the
chunk to remove it slipped into a previous patch..

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH 5/6] writeback: try more writeback as long as something was written
  2011-04-19  3:00 [PATCH 0/6] writeback: moving expire targets for background/kupdate works Wu Fengguang
@ 2011-04-19  3:00 ` Wu Fengguang
  2011-04-19 10:20   ` Jan Kara
  0 siblings, 1 reply; 62+ messages in thread
From: Wu Fengguang @ 2011-04-19  3:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Mel Gorman, Wu Fengguang, Dave Chinner, Trond Myklebust,
	Itaru Kitayama, Minchan Kim, LKML, linux-fsdevel,
	Linux Memory Management List

[-- Attachment #1: writeback-background-retry.patch --]
[-- Type: text/plain, Size: 3069 bytes --]

writeback_inodes_wb()/__writeback_inodes_sb() are not aggressive in that
they only populate possibly a subset of elegible inodes into b_io at
entrance time. When the queued set of inodes are all synced, they just
return, possibly with all queued inode pages written but still
wbc.nr_to_write > 0.

For kupdate and background writeback, there may be more eligible inodes
sitting in b_dirty when the current set of b_io inodes are completed. So
it is necessary to try another round of writeback as long as we made some
progress in this round. When there are no more eligible inodes, no more
inodes will be enqueued in queue_io(), hence nothing could/will be
synced and we may safely bail.

Jan raised the concern

	I'm just afraid that in some pathological cases this could
	result in bad writeback pattern - like if there is some process
	which manages to dirty just a few pages while we are doing
	writeout, this looping could result in writing just a few pages
	in each round which is bad for fragmentation etc.

However it requires really strong timing to make that to (continuously)
happen.  In practice it's very hard to produce such a pattern even if
it's possible in theory. I actually tried to write 1 page per 1ms with
this command

	write-and-fsync -n10000 -S 1000 -c 4096 /fs/test

and do sync(1) at the same time. The sync completes quickly on ext4,
xfs, btrfs. The readers could try other write-and-sleep patterns and
check if it can block sync for longer time.

CC: Jan Kara <jack@suse.cz>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c |   16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

--- linux-next.orig/fs/fs-writeback.c	2011-04-19 10:18:30.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-04-19 10:18:31.000000000 +0800
@@ -750,23 +750,23 @@ static long wb_writeback(struct bdi_writ
 		wrote += write_chunk - wbc.nr_to_write;
 
 		/*
-		 * If we consumed everything, see if we have more
+		 * Did we write something? Try for more
+		 *
+		 * Dirty inodes are moved to b_io for writeback in batches.
+		 * The completion of the current batch does not necessarily
+		 * mean the overall work is done. So we keep looping as long
+		 * as made some progress on cleaning pages or inodes.
 		 */
-		if (wbc.nr_to_write <= 0)
+		if (wbc.nr_to_write < write_chunk)
 			continue;
 		if (wbc.inodes_cleaned)
 			continue;
 		/*
-		 * Didn't write everything and we don't have more IO, bail
+		 * No more inodes for IO, bail
 		 */
 		if (!wbc.more_io)
 			break;
 		/*
-		 * Did we write something? Try for more
-		 */
-		if (wbc.nr_to_write < write_chunk)
-			continue;
-		/*
 		 * Nothing written. Wait for some inode to
 		 * become available for writeback. Otherwise
 		 * we'll just busyloop.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 5/6] writeback: try more writeback as long as something was written
  2011-04-19  3:00 ` [PATCH 5/6] writeback: try more writeback as long as something was written Wu Fengguang
@ 2011-04-19 10:20   ` Jan Kara
  2011-04-19 11:16     ` Wu Fengguang
  0 siblings, 1 reply; 62+ messages in thread
From: Jan Kara @ 2011-04-19 10:20 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jan Kara, Mel Gorman, Dave Chinner,
	Trond Myklebust, Itaru Kitayama, Minchan Kim, LKML, linux-fsdevel,
	Linux Memory Management List

On Tue 19-04-11 11:00:08, Wu Fengguang wrote:
> writeback_inodes_wb()/__writeback_inodes_sb() are not aggressive in that
> they only populate possibly a subset of elegible inodes into b_io at
> entrance time. When the queued set of inodes are all synced, they just
> return, possibly with all queued inode pages written but still
> wbc.nr_to_write > 0.
> 
> For kupdate and background writeback, there may be more eligible inodes
> sitting in b_dirty when the current set of b_io inodes are completed. So
> it is necessary to try another round of writeback as long as we made some
> progress in this round. When there are no more eligible inodes, no more
> inodes will be enqueued in queue_io(), hence nothing could/will be
> synced and we may safely bail.
  Let me understand your concern here: You are afraid that if we do
for_background or for_kupdate writeback and we write less than
MAX_WRITEBACK_PAGES, we stop doing writeback although there could be more
inodes to write at the time we are stopping writeback - the two realistic
cases I can think of are:
a) when inodes just freshly expired during writeback
b) when bdi has less than MAX_WRITEBACK_PAGES of dirty data but we are over
  background threshold due to data on some other bdi. And then while we are
  doing writeback someone does dirtying at our bdi.
Or do you see some other case as well?

The a) case does not seem like a big issue to me after your changes to
move_expired_inodes(). The b) case maybe but do you think it will make any
difference? 

								Honza
> 
> Jan raised the concern
> 
> 	I'm just afraid that in some pathological cases this could
> 	result in bad writeback pattern - like if there is some process
> 	which manages to dirty just a few pages while we are doing
> 	writeout, this looping could result in writing just a few pages
> 	in each round which is bad for fragmentation etc.
> 
> However it requires really strong timing to make that to (continuously)
> happen.  In practice it's very hard to produce such a pattern even if
> it's possible in theory. I actually tried to write 1 page per 1ms with
> this command
> 
> 	write-and-fsync -n10000 -S 1000 -c 4096 /fs/test
> 
> and do sync(1) at the same time. The sync completes quickly on ext4,
> xfs, btrfs. The readers could try other write-and-sleep patterns and
> check if it can block sync for longer time.
> 
> CC: Jan Kara <jack@suse.cz>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  fs/fs-writeback.c |   16 ++++++++--------
>  1 file changed, 8 insertions(+), 8 deletions(-)
> 
> --- linux-next.orig/fs/fs-writeback.c	2011-04-19 10:18:30.000000000 +0800
> +++ linux-next/fs/fs-writeback.c	2011-04-19 10:18:31.000000000 +0800
> @@ -750,23 +750,23 @@ static long wb_writeback(struct bdi_writ
>  		wrote += write_chunk - wbc.nr_to_write;
>  
>  		/*
> -		 * If we consumed everything, see if we have more
> +		 * Did we write something? Try for more
> +		 *
> +		 * Dirty inodes are moved to b_io for writeback in batches.
> +		 * The completion of the current batch does not necessarily
> +		 * mean the overall work is done. So we keep looping as long
> +		 * as made some progress on cleaning pages or inodes.
>  		 */
> -		if (wbc.nr_to_write <= 0)
> +		if (wbc.nr_to_write < write_chunk)
>  			continue;
>  		if (wbc.inodes_cleaned)
>  			continue;
>  		/*
> -		 * Didn't write everything and we don't have more IO, bail
> +		 * No more inodes for IO, bail
>  		 */
>  		if (!wbc.more_io)
>  			break;
>  		/*
> -		 * Did we write something? Try for more
> -		 */
> -		if (wbc.nr_to_write < write_chunk)
> -			continue;
> -		/*
>  		 * Nothing written. Wait for some inode to
>  		 * become available for writeback. Otherwise
>  		 * we'll just busyloop.
> 
> 
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 5/6] writeback: try more writeback as long as something was written
  2011-04-19 10:20   ` Jan Kara
@ 2011-04-19 11:16     ` Wu Fengguang
  2011-04-19 21:10       ` Jan Kara
  0 siblings, 1 reply; 62+ messages in thread
From: Wu Fengguang @ 2011-04-19 11:16 UTC (permalink / raw)
  To: Jan Kara
  Cc: Andrew Morton, Mel Gorman, Dave Chinner, Trond Myklebust,
	Itaru Kitayama, Minchan Kim, LKML, linux-fsdevel@vger.kernel.org,
	Linux Memory Management List

On Tue, Apr 19, 2011 at 06:20:16PM +0800, Jan Kara wrote:
> On Tue 19-04-11 11:00:08, Wu Fengguang wrote:
> > writeback_inodes_wb()/__writeback_inodes_sb() are not aggressive in that
> > they only populate possibly a subset of elegible inodes into b_io at
> > entrance time. When the queued set of inodes are all synced, they just
> > return, possibly with all queued inode pages written but still
> > wbc.nr_to_write > 0.
> > 
> > For kupdate and background writeback, there may be more eligible inodes
> > sitting in b_dirty when the current set of b_io inodes are completed. So
> > it is necessary to try another round of writeback as long as we made some
> > progress in this round. When there are no more eligible inodes, no more
> > inodes will be enqueued in queue_io(), hence nothing could/will be
> > synced and we may safely bail.
>   Let me understand your concern here: You are afraid that if we do
> for_background or for_kupdate writeback and we write less than
> MAX_WRITEBACK_PAGES, we stop doing writeback although there could be more
> inodes to write at the time we are stopping writeback - the two realistic

Yes.

> cases I can think of are:
> a) when inodes just freshly expired during writeback
> b) when bdi has less than MAX_WRITEBACK_PAGES of dirty data but we are over
>   background threshold due to data on some other bdi. And then while we are
>   doing writeback someone does dirtying at our bdi.
> Or do you see some other case as well?
> 
> The a) case does not seem like a big issue to me after your changes to

Yeah (a) is not an issue with kupdate writeback.

> move_expired_inodes(). The b) case maybe but do you think it will make any
> difference? 

(b) seems also weird. What in my mind is this for_background case.
Imagine 100 inodes

        i0, i1, i2, ..., i90, i91, i99

At queue_io() time, i90-i99 happen to be expired and moved to s_io for
IO. When finished successfully, if their total size is less than
MAX_WRITEBACK_PAGES, nr_to_write will be > 0. Then wb_writeback() will
quit the background work (w/o this patch) while it's still over
background threshold.

This will be a fairly normal/frequent case I guess.

Thanks,
Fengguang

> 								Honza
> > 
> > Jan raised the concern
> > 
> > 	I'm just afraid that in some pathological cases this could
> > 	result in bad writeback pattern - like if there is some process
> > 	which manages to dirty just a few pages while we are doing
> > 	writeout, this looping could result in writing just a few pages
> > 	in each round which is bad for fragmentation etc.
> > 
> > However it requires really strong timing to make that to (continuously)
> > happen.  In practice it's very hard to produce such a pattern even if
> > it's possible in theory. I actually tried to write 1 page per 1ms with
> > this command
> > 
> > 	write-and-fsync -n10000 -S 1000 -c 4096 /fs/test
> > 
> > and do sync(1) at the same time. The sync completes quickly on ext4,
> > xfs, btrfs. The readers could try other write-and-sleep patterns and
> > check if it can block sync for longer time.
> > 
> > CC: Jan Kara <jack@suse.cz>
> > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > ---
> >  fs/fs-writeback.c |   16 ++++++++--------
> >  1 file changed, 8 insertions(+), 8 deletions(-)
> > 
> > --- linux-next.orig/fs/fs-writeback.c	2011-04-19 10:18:30.000000000 +0800
> > +++ linux-next/fs/fs-writeback.c	2011-04-19 10:18:31.000000000 +0800
> > @@ -750,23 +750,23 @@ static long wb_writeback(struct bdi_writ
> >  		wrote += write_chunk - wbc.nr_to_write;
> >  
> >  		/*
> > -		 * If we consumed everything, see if we have more
> > +		 * Did we write something? Try for more
> > +		 *
> > +		 * Dirty inodes are moved to b_io for writeback in batches.
> > +		 * The completion of the current batch does not necessarily
> > +		 * mean the overall work is done. So we keep looping as long
> > +		 * as made some progress on cleaning pages or inodes.
> >  		 */
> > -		if (wbc.nr_to_write <= 0)
> > +		if (wbc.nr_to_write < write_chunk)
> >  			continue;
> >  		if (wbc.inodes_cleaned)
> >  			continue;
> >  		/*
> > -		 * Didn't write everything and we don't have more IO, bail
> > +		 * No more inodes for IO, bail
> >  		 */
> >  		if (!wbc.more_io)
> >  			break;
> >  		/*
> > -		 * Did we write something? Try for more
> > -		 */
> > -		if (wbc.nr_to_write < write_chunk)
> > -			continue;
> > -		/*
> >  		 * Nothing written. Wait for some inode to
> >  		 * become available for writeback. Otherwise
> >  		 * we'll just busyloop.
> > 
> > 
> -- 
> Jan Kara <jack@suse.cz>
> SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 5/6] writeback: try more writeback as long as something was written
  2011-04-19 11:16     ` Wu Fengguang
@ 2011-04-19 21:10       ` Jan Kara
  2011-04-20  7:50         ` Wu Fengguang
  0 siblings, 1 reply; 62+ messages in thread
From: Jan Kara @ 2011-04-19 21:10 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, Andrew Morton, Mel Gorman, Dave Chinner,
	Trond Myklebust, Itaru Kitayama, Minchan Kim, LKML,
	linux-fsdevel@vger.kernel.org, Linux Memory Management List

On Tue 19-04-11 19:16:01, Wu Fengguang wrote:
> On Tue, Apr 19, 2011 at 06:20:16PM +0800, Jan Kara wrote:
> > On Tue 19-04-11 11:00:08, Wu Fengguang wrote:
> > > writeback_inodes_wb()/__writeback_inodes_sb() are not aggressive in that
> > > they only populate possibly a subset of elegible inodes into b_io at
> > > entrance time. When the queued set of inodes are all synced, they just
> > > return, possibly with all queued inode pages written but still
> > > wbc.nr_to_write > 0.
> > > 
> > > For kupdate and background writeback, there may be more eligible inodes
> > > sitting in b_dirty when the current set of b_io inodes are completed. So
> > > it is necessary to try another round of writeback as long as we made some
> > > progress in this round. When there are no more eligible inodes, no more
> > > inodes will be enqueued in queue_io(), hence nothing could/will be
> > > synced and we may safely bail.
> >   Let me understand your concern here: You are afraid that if we do
> > for_background or for_kupdate writeback and we write less than
> > MAX_WRITEBACK_PAGES, we stop doing writeback although there could be more
> > inodes to write at the time we are stopping writeback - the two realistic
> 
> Yes.
> 
> > cases I can think of are:
> > a) when inodes just freshly expired during writeback
> > b) when bdi has less than MAX_WRITEBACK_PAGES of dirty data but we are over
> >   background threshold due to data on some other bdi. And then while we are
> >   doing writeback someone does dirtying at our bdi.
> > Or do you see some other case as well?
> > 
> > The a) case does not seem like a big issue to me after your changes to
> 
> Yeah (a) is not an issue with kupdate writeback.
> 
> > move_expired_inodes(). The b) case maybe but do you think it will make any
> > difference? 
> 
> (b) seems also weird. What in my mind is this for_background case.
> Imagine 100 inodes
> 
>         i0, i1, i2, ..., i90, i91, i99
> 
> At queue_io() time, i90-i99 happen to be expired and moved to s_io for
> IO. When finished successfully, if their total size is less than
> MAX_WRITEBACK_PAGES, nr_to_write will be > 0. Then wb_writeback() will
> quit the background work (w/o this patch) while it's still over
> background threshold.
> 
> This will be a fairly normal/frequent case I guess.
  Ah OK, I see. I missed this case your patch set has added. Also your
changes of
        if (!wbc->for_kupdate || list_empty(&wb->b_io))
to
	if (list_empty(&wb->b_io))
are going to cause more cases when we'd hit nr_to_write > 0 (e.g. when one
pass of b_io does not write all the inodes so some are left in b_io list
and then next call to writeback finds these inodes there but there's less
than MAX_WRITEBACK_PAGES in them). Frankly, it makes me like the above
change even less. I'd rather see writeback_inodes_wb /
__writeback_inodes_sb always work on a fresh set of inodes which is
initialized whenever we enter these functions. It just seems less
surprising to me...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 5/6] writeback: try more writeback as long as something was written
  2011-04-19 21:10       ` Jan Kara
@ 2011-04-20  7:50         ` Wu Fengguang
  2011-04-20 15:22           ` Jan Kara
  0 siblings, 1 reply; 62+ messages in thread
From: Wu Fengguang @ 2011-04-20  7:50 UTC (permalink / raw)
  To: Jan Kara
  Cc: Andrew Morton, Mel Gorman, Dave Chinner, Trond Myklebust,
	Itaru Kitayama, Minchan Kim, LKML, linux-fsdevel@vger.kernel.org,
	Linux Memory Management List

On Wed, Apr 20, 2011 at 05:10:08AM +0800, Jan Kara wrote:
> On Tue 19-04-11 19:16:01, Wu Fengguang wrote:
> > On Tue, Apr 19, 2011 at 06:20:16PM +0800, Jan Kara wrote:
> > > On Tue 19-04-11 11:00:08, Wu Fengguang wrote:
> > > > writeback_inodes_wb()/__writeback_inodes_sb() are not aggressive in that
> > > > they only populate possibly a subset of elegible inodes into b_io at
> > > > entrance time. When the queued set of inodes are all synced, they just
> > > > return, possibly with all queued inode pages written but still
> > > > wbc.nr_to_write > 0.
> > > > 
> > > > For kupdate and background writeback, there may be more eligible inodes
> > > > sitting in b_dirty when the current set of b_io inodes are completed. So
> > > > it is necessary to try another round of writeback as long as we made some
> > > > progress in this round. When there are no more eligible inodes, no more
> > > > inodes will be enqueued in queue_io(), hence nothing could/will be
> > > > synced and we may safely bail.
> > >   Let me understand your concern here: You are afraid that if we do
> > > for_background or for_kupdate writeback and we write less than
> > > MAX_WRITEBACK_PAGES, we stop doing writeback although there could be more
> > > inodes to write at the time we are stopping writeback - the two realistic
> > 
> > Yes.
> > 
> > > cases I can think of are:
> > > a) when inodes just freshly expired during writeback
> > > b) when bdi has less than MAX_WRITEBACK_PAGES of dirty data but we are over
> > >   background threshold due to data on some other bdi. And then while we are
> > >   doing writeback someone does dirtying at our bdi.
> > > Or do you see some other case as well?
> > > 
> > > The a) case does not seem like a big issue to me after your changes to
> > 
> > Yeah (a) is not an issue with kupdate writeback.
> > 
> > > move_expired_inodes(). The b) case maybe but do you think it will make any
> > > difference? 
> > 
> > (b) seems also weird. What in my mind is this for_background case.
> > Imagine 100 inodes
> > 
> >         i0, i1, i2, ..., i90, i91, i99
> > 
> > At queue_io() time, i90-i99 happen to be expired and moved to s_io for
> > IO. When finished successfully, if their total size is less than
> > MAX_WRITEBACK_PAGES, nr_to_write will be > 0. Then wb_writeback() will
> > quit the background work (w/o this patch) while it's still over
> > background threshold.
> > 
> > This will be a fairly normal/frequent case I guess.
>   Ah OK, I see. I missed this case your patch set has added. Also your
> changes of
>         if (!wbc->for_kupdate || list_empty(&wb->b_io))
> to
> 	if (list_empty(&wb->b_io))
> are going to cause more cases when we'd hit nr_to_write > 0 (e.g. when one
> pass of b_io does not write all the inodes so some are left in b_io list
> and then next call to writeback finds these inodes there but there's less
> than MAX_WRITEBACK_PAGES in them).

Yes. It's exactly the more aggressive retry logic in wb_writeback()
that allows me to comfortably kill that !wbc->for_kupdate test :)

> Frankly, it makes me like the above change even less. I'd rather see
> writeback_inodes_wb / __writeback_inodes_sb always work on a fresh
> set of inodes which is initialized whenever we enter these
> functions. It just seems less surprising to me...

The old aggressive enqueue policy is an ad-hoc workaround to prevent
background work to miss some inodes and quit early. Now that we have
the complete solution, why not killing it for more consistent code and
behavior? And get better performance numbers :)

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 5/6] writeback: try more writeback as long as something was written
  2011-04-20  7:50         ` Wu Fengguang
@ 2011-04-20 15:22           ` Jan Kara
  2011-04-21  3:33             ` Wu Fengguang
  0 siblings, 1 reply; 62+ messages in thread
From: Jan Kara @ 2011-04-20 15:22 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, Andrew Morton, Mel Gorman, Dave Chinner,
	Trond Myklebust, Itaru Kitayama, Minchan Kim, LKML,
	linux-fsdevel@vger.kernel.org, Linux Memory Management List

On Wed 20-04-11 15:50:53, Wu Fengguang wrote:
> > > >   Let me understand your concern here: You are afraid that if we do
> > > > for_background or for_kupdate writeback and we write less than
> > > > MAX_WRITEBACK_PAGES, we stop doing writeback although there could be more
> > > > inodes to write at the time we are stopping writeback - the two realistic
> > > 
> > > Yes.
> > > 
> > > > cases I can think of are:
> > > > a) when inodes just freshly expired during writeback
> > > > b) when bdi has less than MAX_WRITEBACK_PAGES of dirty data but we are over
> > > >   background threshold due to data on some other bdi. And then while we are
> > > >   doing writeback someone does dirtying at our bdi.
> > > > Or do you see some other case as well?
> > > > 
> > > > The a) case does not seem like a big issue to me after your changes to
> > > 
> > > Yeah (a) is not an issue with kupdate writeback.
> > > 
> > > > move_expired_inodes(). The b) case maybe but do you think it will make any
> > > > difference? 
> > > 
> > > (b) seems also weird. What in my mind is this for_background case.
> > > Imagine 100 inodes
> > > 
> > >         i0, i1, i2, ..., i90, i91, i99
> > > 
> > > At queue_io() time, i90-i99 happen to be expired and moved to s_io for
> > > IO. When finished successfully, if their total size is less than
> > > MAX_WRITEBACK_PAGES, nr_to_write will be > 0. Then wb_writeback() will
> > > quit the background work (w/o this patch) while it's still over
> > > background threshold.
> > > 
> > > This will be a fairly normal/frequent case I guess.
> >   Ah OK, I see. I missed this case your patch set has added. Also your
> > changes of
> >         if (!wbc->for_kupdate || list_empty(&wb->b_io))
> > to
> > 	if (list_empty(&wb->b_io))
> > are going to cause more cases when we'd hit nr_to_write > 0 (e.g. when one
> > pass of b_io does not write all the inodes so some are left in b_io list
> > and then next call to writeback finds these inodes there but there's less
> > than MAX_WRITEBACK_PAGES in them).
> 
> Yes. It's exactly the more aggressive retry logic in wb_writeback()
> that allows me to comfortably kill that !wbc->for_kupdate test :)
> 
> > Frankly, it makes me like the above change even less. I'd rather see
> > writeback_inodes_wb / __writeback_inodes_sb always work on a fresh
> > set of inodes which is initialized whenever we enter these
> > functions. It just seems less surprising to me...
> 
> The old aggressive enqueue policy is an ad-hoc workaround to prevent
> background work to miss some inodes and quit early. Now that we have
> the complete solution, why not killing it for more consistent code and
> behavior? And get better performance numbers :)
  BTW, have you understood why do you get better numbers? What are we doing
better with this changed logic?

I've though about it and also about Dave's analysis. Now I think it's OK to
not add new inodes to b_io when it's not empty. But what I still don't like
is that the emptiness / non-emptiness of b_io carries hidden internal
state - callers of writeback_inodes_wb() shouldn't have to know or care
about such subtleties (__writeback_inodes_sb() is an internal function so I
don't care about that one too much).

So I'd prefer writeback_inodes_wb() (and also __writeback_inodes_sb() but
that's not too important) to do something like:
	int requeued = 0;
requeue:
	if (list_empty(&wb->b_io)) {
		queue_io(wb, wbc->older_than_this);
		requeued = 1;
	}
	while (!list_empty(&wb->b_io)) {
		... do stuff ...
	}
	if (wbc->nr_to_write > 0 && !requeued)
		goto requeue;

Because if you don't do this, you have to do similar change to all the
callers of writeback_inodes_wb() (Ok, there are just three but still).

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 5/6] writeback: try more writeback as long as something was written
  2011-04-20 15:22           ` Jan Kara
@ 2011-04-21  3:33             ` Wu Fengguang
  2011-04-21  4:39               ` Christoph Hellwig
  2011-04-21  7:09               ` Dave Chinner
  0 siblings, 2 replies; 62+ messages in thread
From: Wu Fengguang @ 2011-04-21  3:33 UTC (permalink / raw)
  To: Jan Kara
  Cc: Andrew Morton, Mel Gorman, Dave Chinner, Trond Myklebust,
	Itaru Kitayama, Minchan Kim, LKML, linux-fsdevel@vger.kernel.org,
	Linux Memory Management List

[-- Attachment #1: Type: text/plain, Size: 7882 bytes --]

On Wed, Apr 20, 2011 at 11:22:11PM +0800, Jan Kara wrote:
> On Wed 20-04-11 15:50:53, Wu Fengguang wrote:
> > > > >   Let me understand your concern here: You are afraid that if we do
> > > > > for_background or for_kupdate writeback and we write less than
> > > > > MAX_WRITEBACK_PAGES, we stop doing writeback although there could be more
> > > > > inodes to write at the time we are stopping writeback - the two realistic
> > > > 
> > > > Yes.
> > > > 
> > > > > cases I can think of are:
> > > > > a) when inodes just freshly expired during writeback
> > > > > b) when bdi has less than MAX_WRITEBACK_PAGES of dirty data but we are over
> > > > >   background threshold due to data on some other bdi. And then while we are
> > > > >   doing writeback someone does dirtying at our bdi.
> > > > > Or do you see some other case as well?
> > > > > 
> > > > > The a) case does not seem like a big issue to me after your changes to
> > > > 
> > > > Yeah (a) is not an issue with kupdate writeback.
> > > > 
> > > > > move_expired_inodes(). The b) case maybe but do you think it will make any
> > > > > difference? 
> > > > 
> > > > (b) seems also weird. What in my mind is this for_background case.
> > > > Imagine 100 inodes
> > > > 
> > > >         i0, i1, i2, ..., i90, i91, i99
> > > > 
> > > > At queue_io() time, i90-i99 happen to be expired and moved to s_io for
> > > > IO. When finished successfully, if their total size is less than
> > > > MAX_WRITEBACK_PAGES, nr_to_write will be > 0. Then wb_writeback() will
> > > > quit the background work (w/o this patch) while it's still over
> > > > background threshold.
> > > > 
> > > > This will be a fairly normal/frequent case I guess.
> > >   Ah OK, I see. I missed this case your patch set has added. Also your
> > > changes of
> > >         if (!wbc->for_kupdate || list_empty(&wb->b_io))
> > > to
> > > 	if (list_empty(&wb->b_io))
> > > are going to cause more cases when we'd hit nr_to_write > 0 (e.g. when one
> > > pass of b_io does not write all the inodes so some are left in b_io list
> > > and then next call to writeback finds these inodes there but there's less
> > > than MAX_WRITEBACK_PAGES in them).
> > 
> > Yes. It's exactly the more aggressive retry logic in wb_writeback()
> > that allows me to comfortably kill that !wbc->for_kupdate test :)
> > 
> > > Frankly, it makes me like the above change even less. I'd rather see
> > > writeback_inodes_wb / __writeback_inodes_sb always work on a fresh
> > > set of inodes which is initialized whenever we enter these
> > > functions. It just seems less surprising to me...
> > 
> > The old aggressive enqueue policy is an ad-hoc workaround to prevent
> > background work to miss some inodes and quit early. Now that we have
> > the complete solution, why not killing it for more consistent code and
> > behavior? And get better performance numbers :)
>   BTW, have you understood why do you get better numbers? What are we doing
> better with this changed logic?

Good question. I'm also puzzled to find it run consistently better on
4MB, 32MB and 128MB write chunk sizes, with/without the IO-less and
larger chunk size patches.

It's not about pageout(), because I see "nr_vmscan_write 0" in
/proc/vmstat in the tests.

It's not about the full vs. remained chunk size -- it may helped the
vanilla kernel, but the "writeback: make nr_to_write a per-file limit"
as part of the large chunk size patches already guarantee each file
will get the full chunk size.

I collected the writeback_single_inode() traces (patch attached for
your reference) each for several test runs, and find much more
I_DIRTY_PAGES after patchset. Dave, do you know why there are so many
I_DIRTY_PAGES (or radix tag) remained after the XFS ->writepages() call,
even for small files?

wfg /tmp% g -c I_DIRTY_PAGES trace-*
trace-moving-expire-1:28213
trace-no-moving-expire:6684

wfg /tmp% g -c I_DIRTY_DATASYNC trace-*
trace-moving-expire-1:179
trace-no-moving-expire:193

wfg /tmp% g -c I_DIRTY_SYNC trace-* 
trace-moving-expire-1:29394
trace-no-moving-expire:31593

wfg /tmp% wc -l trace-*
   81108 trace-moving-expire-1
   68562 trace-no-moving-expire

wfg /tmp% head trace-*
==> trace-moving-expire-1 <==
# tracer: nop
#
#           TASK-PID    CPU#    TIMESTAMP  FUNCTION
#              | |       |          |         |
           <...>-2982  [000]   633.671746: writeback_single_inode: bdi 8:0: ino=131 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=1177 wrote=1025 to_write=-1 index=21525
           <...>-2982  [000]   633.672704: writeback_single_inode: bdi 8:0: ino=131 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=1178 wrote=1025 to_write=-1 index=22550
           <...>-2982  [000]   633.673638: writeback_single_inode: bdi 8:0: ino=131 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=1179 wrote=1025 to_write=-1 index=23575
           <...>-2982  [000]   633.674573: writeback_single_inode: bdi 8:0: ino=131 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=1180 wrote=1025 to_write=-1 index=24600
           <...>-2982  [000]   633.880621: writeback_single_inode: bdi 8:0: ino=131 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=1387 wrote=1025 to_write=-1 index=25625
           <...>-2982  [000]   633.881345: writeback_single_inode: bdi 8:0: ino=131 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=1388 wrote=1025 to_write=-1 index=26650

==> trace-no-moving-expire <==
# tracer: nop
#
#           TASK-PID    CPU#    TIMESTAMP  FUNCTION
#              | |       |          |         |
           <...>-2233  [006]   311.175491: writeback_single_inode: bdi 0:15: ino=1574019 state=I_DIRTY_DATASYNC|I_REFERENCED age=0 wrote=0 to_write=1024 index=0
           <...>-2233  [006]   311.175495: writeback_single_inode: bdi 0:15: ino=1536569 state=I_DIRTY_DATASYNC|I_REFERENCED age=0 wrote=0 to_write=1024 index=0
           <...>-2233  [006]   311.175498: writeback_single_inode: bdi 0:15: ino=1534002 state=I_DIRTY_DATASYNC|I_REFERENCED age=0 wrote=0 to_write=1024 index=0
           <...>-2233  [006]   311.175515: writeback_single_inode: bdi 0:15: ino=1574042 state=I_DIRTY_DATASYNC age=25000 wrote=1 to_write=1023 index=0
           <...>-2233  [006]   311.175522: writeback_single_inode: bdi 0:15: ino=1574028 state=I_DIRTY_DATASYNC age=25000 wrote=1 to_write=1022 index=137685
           <...>-2233  [006]   311.175524: writeback_single_inode: bdi 0:15: ino=1574024 state=I_DIRTY_DATASYNC age=25000 wrote=0 to_write=1022 index=0

> I've though about it and also about Dave's analysis. Now I think it's OK to
> not add new inodes to b_io when it's not empty. But what I still don't like
> is that the emptiness / non-emptiness of b_io carries hidden internal
> state - callers of writeback_inodes_wb() shouldn't have to know or care
> about such subtleties (__writeback_inodes_sb() is an internal function so I
> don't care about that one too much).

That's why we liked the v1 implementation :)

> So I'd prefer writeback_inodes_wb() (and also __writeback_inodes_sb() but
> that's not too important) to do something like:
> 	int requeued = 0;
> requeue:
> 	if (list_empty(&wb->b_io)) {
> 		queue_io(wb, wbc->older_than_this);
> 		requeued = 1;
> 	}
> 	while (!list_empty(&wb->b_io)) {
> 		... do stuff ...
> 	}
> 	if (wbc->nr_to_write > 0 && !requeued)
> 		goto requeue;

But that change must be coupled with older_than_this switch,
and doing it here you both lose the wbc visibility and scatters
the policy around..

> Because if you don't do this, you have to do similar change to all the
> callers of writeback_inodes_wb() (Ok, there are just three but still).

I find only one more caller: bdi_flush_io() and it sets
older_than_this to NULL. In fact wb_writeback() is the only user of
older_than_this, originally for kupdate work and now also for
background work.

Basically we only need the retry when did policy switch, so
it makes sense to do it either completely in wb_writeback() or
in move_expired_inodes()?

Thanks,
Fengguang

[-- Attachment #2: writeback-trace-writeback_single_inode.patch --]
[-- Type: text/x-diff, Size: 3552 bytes --]

Subject: writeback: trace writeback_single_inode
Date: Wed Dec 01 17:33:37 CST 2010

It is valuable to know how the dirty inodes are iterated and their IO size.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c                |   12 +++---
 include/trace/events/writeback.h |   56 +++++++++++++++++++++++++++++
 2 files changed, 63 insertions(+), 5 deletions(-)

--- linux-next.orig/fs/fs-writeback.c	2011-04-13 17:18:19.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-04-13 17:18:20.000000000 +0800
@@ -347,7 +347,7 @@ writeback_single_inode(struct inode *ino
 {
 	struct address_space *mapping = inode->i_mapping;
 	long per_file_limit = wbc->per_file_limit;
-	long uninitialized_var(nr_to_write);
+	long nr_to_write = wbc->nr_to_write;
 	unsigned dirty;
 	int ret;
 
@@ -370,7 +370,8 @@ writeback_single_inode(struct inode *ino
 		 */
 		if (wbc->sync_mode != WB_SYNC_ALL) {
 			requeue_io(inode);
-			return 0;
+			ret = 0;
+			goto out;
 		}
 
 		/*
@@ -387,10 +388,8 @@ writeback_single_inode(struct inode *ino
 	spin_unlock(&inode->i_lock);
 	spin_unlock(&inode_wb_list_lock);
 
-	if (per_file_limit) {
-		nr_to_write = wbc->nr_to_write;
+	if (per_file_limit)
 		wbc->nr_to_write = per_file_limit;
-	}
 
 	ret = do_writepages(mapping, wbc);
 
@@ -467,6 +466,9 @@ writeback_single_inode(struct inode *ino
 		}
 	}
 	inode_sync_complete(inode);
+out:
+	trace_writeback_single_inode(inode, wbc,
+				     nr_to_write - wbc->nr_to_write);
 	return ret;
 }
 
--- linux-next.orig/include/trace/events/writeback.h	2011-04-13 17:18:18.000000000 +0800
+++ linux-next/include/trace/events/writeback.h	2011-04-13 17:18:20.000000000 +0800
@@ -10,6 +10,19 @@
 
 struct wb_writeback_work;
 
+#define show_inode_state(state)					\
+	__print_flags(state, "|",				\
+		{I_DIRTY_SYNC,		"I_DIRTY_SYNC"},	\
+		{I_DIRTY_DATASYNC,	"I_DIRTY_DATASYNC"},	\
+		{I_DIRTY_PAGES,		"I_DIRTY_PAGES"},	\
+		{I_NEW,			"I_NEW"},		\
+		{I_WILL_FREE,		"I_WILL_FREE"},		\
+		{I_FREEING,		"I_FREEING"},		\
+		{I_CLEAR,		"I_CLEAR"},		\
+		{I_SYNC,		"I_SYNC"},		\
+		{I_REFERENCED,		"I_REFERENCED"}		\
+		)
+
 DECLARE_EVENT_CLASS(writeback_work_class,
 	TP_PROTO(struct backing_dev_info *bdi, struct wb_writeback_work *work),
 	TP_ARGS(bdi, work),
@@ -149,6 +162,49 @@ DEFINE_WBC_EVENT(wbc_writeback_written);
 DEFINE_WBC_EVENT(wbc_writeback_wait);
 DEFINE_WBC_EVENT(wbc_writepage);
 
+TRACE_EVENT(writeback_single_inode,
+
+	TP_PROTO(struct inode *inode,
+		 struct writeback_control *wbc,
+		 unsigned long wrote
+	),
+
+	TP_ARGS(inode, wbc, wrote),
+
+	TP_STRUCT__entry(
+		__array(char, name, 32)
+		__field(unsigned long, ino)
+		__field(unsigned long, state)
+		__field(unsigned long, age)
+		__field(unsigned long, wrote)
+		__field(long, nr_to_write)
+		__field(unsigned long, writeback_index)
+	),
+
+	TP_fast_assign(
+		strncpy(__entry->name,
+			dev_name(inode->i_mapping->backing_dev_info->dev), 32);
+		__entry->ino		= inode->i_ino;
+		__entry->state		= inode->i_state;
+		__entry->age		= (jiffies - inode->dirtied_when) *
+								1000 / HZ;
+		__entry->wrote		= wrote;
+		__entry->nr_to_write	= wbc->nr_to_write;
+		__entry->writeback_index = inode->i_mapping->writeback_index;
+	),
+
+	TP_printk("bdi %s: ino=%lu state=%s age=%lu "
+		  "wrote=%lu to_write=%ld index=%lu",
+		  __entry->name,
+		  __entry->ino,
+		  show_inode_state(__entry->state),
+		  __entry->age,
+		  __entry->wrote,
+		  __entry->nr_to_write,
+		  __entry->writeback_index
+	)
+);
+
 #define KBps(x)			((x) << (PAGE_SHIFT - 10))
 
 TRACE_EVENT(dirty_ratelimit,

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 5/6] writeback: try more writeback as long as something was written
  2011-04-21  3:33             ` Wu Fengguang
@ 2011-04-21  4:39               ` Christoph Hellwig
  2011-04-21  6:05                 ` Wu Fengguang
  2011-04-21  7:09               ` Dave Chinner
  1 sibling, 1 reply; 62+ messages in thread
From: Christoph Hellwig @ 2011-04-21  4:39 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, Andrew Morton, Mel Gorman, Dave Chinner,
	Trond Myklebust, Itaru Kitayama, Minchan Kim, LKML,
	linux-fsdevel@vger.kernel.org, Linux Memory Management List

On Thu, Apr 21, 2011 at 11:33:25AM +0800, Wu Fengguang wrote:
> I collected the writeback_single_inode() traces (patch attached for
> your reference) each for several test runs, and find much more
> I_DIRTY_PAGES after patchset. Dave, do you know why there are so many
> I_DIRTY_PAGES (or radix tag) remained after the XFS ->writepages() call,
> even for small files?

What is your defintion of a small file?  As soon as it has multiple
extents or holes there's absolutely no way to clean it with a single
writepage call.  Also XFS tries to operate as non-blocking as possible
if the non-blocking flag is set in the wbc, but that flag actually
seems to be dead these days.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 5/6] writeback: try more writeback as long as something was written
  2011-04-21  4:39               ` Christoph Hellwig
@ 2011-04-21  6:05                 ` Wu Fengguang
  2011-04-21 16:41                   ` Jan Kara
  0 siblings, 1 reply; 62+ messages in thread
From: Wu Fengguang @ 2011-04-21  6:05 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jan Kara, Andrew Morton, Mel Gorman, Dave Chinner,
	Trond Myklebust, Itaru Kitayama, Minchan Kim, LKML,
	linux-fsdevel@vger.kernel.org, Linux Memory Management List

On Thu, Apr 21, 2011 at 12:39:40PM +0800, Christoph Hellwig wrote:
> On Thu, Apr 21, 2011 at 11:33:25AM +0800, Wu Fengguang wrote:
> > I collected the writeback_single_inode() traces (patch attached for
> > your reference) each for several test runs, and find much more
> > I_DIRTY_PAGES after patchset. Dave, do you know why there are so many
> > I_DIRTY_PAGES (or radix tag) remained after the XFS ->writepages() call,
> > even for small files?
> 
> What is your defintion of a small file?  As soon as it has multiple
> extents or holes there's absolutely no way to clean it with a single
> writepage call.

It's writing a kernel source tree to XFS. You can find in the below
trace that it often leaves more dirty pages behind (indicated by the
I_DIRTY_PAGES flag) after writing as less as 1 page (indicated by the
wrote=1 field).

> Also XFS tries to operate as non-blocking as possible
> if the non-blocking flag is set in the wbc, but that flag actually
> seems to be dead these days.

Yeah.

Thanks,
Fengguang
---
wfg /tmp% head -300 trace-dt7-moving-expire-xfs
# tracer: nop
#
#           TASK-PID    CPU#    TIMESTAMP  FUNCTION
#              | |       |          |         |
            init-1     [004]  5291.655631: writeback_single_inode: bdi 0:15: ino=1574069 state= age=6 wrote=2 to_write=9223372036854775805 index=179837
            init-1     [004]  5291.657137: writeback_single_inode: bdi 0:15: ino=1574069 state= age=7 wrote=0 to_write=9223372036854775807 index=0
            init-1     [004]  5291.657141: writeback_single_inode: bdi 0:15: ino=1574069 state= age=7 wrote=0 to_write=9223372036854775807 index=0
            init-1     [004]  5291.659716: writeback_single_inode: bdi 0:15: ino=1574069 state= age=3 wrote=1 to_write=9223372036854775806 index=179837
##### CPU 6 buffer started ####
           getty-3417  [006]  5291.661265: writeback_single_inode: bdi 0:15: ino=1574069 state= age=4 wrote=0 to_write=9223372036854775807 index=0
           getty-3417  [006]  5291.661269: writeback_single_inode: bdi 0:15: ino=1574069 state= age=4 wrote=0 to_write=9223372036854775807 index=0
           getty-3417  [006]  5291.663963: writeback_single_inode: bdi 0:15: ino=1574069 state= age=3 wrote=1 to_write=9223372036854775806 index=179837
       flush-8:0-3402  [006]  5291.903857: writeback_single_inode: bdi 8:0: ino=131 state=I_DIRTY_SYNC|I_DIRTY_DATASYNC|I_DIRTY_PAGES age=323 wrote=4097 to_write=-1 index=0
       flush-8:0-3402  [006]  5291.919833: writeback_single_inode: bdi 8:0: ino=133 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=4095 index=0
       flush-8:0-3402  [006]  5291.919876: writeback_single_inode: bdi 8:0: ino=134 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=4093 index=1
       flush-8:0-3402  [006]  5291.919913: writeback_single_inode: bdi 8:0: ino=135 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=5 to_write=4088 index=4
       flush-8:0-3402  [006]  5291.919969: writeback_single_inode: bdi 8:0: ino=136 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=23 to_write=4065 index=13
       flush-8:0-3402  [006]  5291.920008: writeback_single_inode: bdi 8:0: ino=134217857 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=4064 index=0
       flush-8:0-3402  [006]  5291.920049: writeback_single_inode: bdi 8:0: ino=134217858 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=4 to_write=4060 index=3
       flush-8:0-3402  [006]  5291.920087: writeback_single_inode: bdi 8:0: ino=268628417 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=4059 index=0
       flush-8:0-3402  [006]  5291.920128: writeback_single_inode: bdi 8:0: ino=402653313 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=4058 index=0
       flush-8:0-3402  [006]  5291.920160: writeback_single_inode: bdi 8:0: ino=402653314 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=4057 index=0
       flush-8:0-3402  [006]  5291.920194: writeback_single_inode: bdi 8:0: ino=402653315 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=4056 index=0
       flush-8:0-3402  [006]  5291.920225: writeback_single_inode: bdi 8:0: ino=402653316 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=4055 index=0
       flush-8:0-3402  [006]  5291.920260: writeback_single_inode: bdi 8:0: ino=138 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=4054 index=0
       flush-8:0-3402  [006]  5291.920291: writeback_single_inode: bdi 8:0: ino=139 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=4053 index=0
       flush-8:0-3402  [006]  5291.920325: writeback_single_inode: bdi 8:0: ino=140 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=4052 index=0
       flush-8:0-3402  [006]  5291.920356: writeback_single_inode: bdi 8:0: ino=141 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=4051 index=0
       flush-8:0-3402  [006]  5291.920393: writeback_single_inode: bdi 8:0: ino=134217860 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=4050 index=0
       flush-8:0-3402  [006]  5291.920425: writeback_single_inode: bdi 8:0: ino=134217861 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=4049 index=0
       flush-8:0-3402  [006]  5291.920458: writeback_single_inode: bdi 8:0: ino=134217862 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=4048 index=0
       flush-8:0-3402  [006]  5291.920489: writeback_single_inode: bdi 8:0: ino=134217863 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=4047 index=0
       flush-8:0-3402  [006]  5291.920524: writeback_single_inode: bdi 8:0: ino=134217864 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=4045 index=1
       flush-8:0-3402  [006]  5291.920556: writeback_single_inode: bdi 8:0: ino=134217865 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=4044 index=0
       flush-8:0-3402  [006]  5291.920589: writeback_single_inode: bdi 8:0: ino=134217866 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=4043 index=0
       flush-8:0-3402  [006]  5291.920620: writeback_single_inode: bdi 8:0: ino=134217867 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=4042 index=0
       flush-8:0-3402  [006]  5291.920653: writeback_single_inode: bdi 8:0: ino=134217868 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=4041 index=0
       flush-8:0-3402  [006]  5291.920718: writeback_single_inode: bdi 8:0: ino=134217869 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=4040 index=0
       flush-8:0-3402  [006]  5291.920758: writeback_single_inode: bdi 8:0: ino=268628419 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=4039 index=0
       flush-8:0-3402  [006]  5291.920790: writeback_single_inode: bdi 8:0: ino=268628420 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=4038 index=0
       flush-8:0-3402  [006]  5291.920823: writeback_single_inode: bdi 8:0: ino=268628421 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=4037 index=0
       flush-8:0-3402  [006]  5291.920855: writeback_single_inode: bdi 8:0: ino=268628422 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=4036 index=0
       flush-8:0-3402  [006]  5291.920890: writeback_single_inode: bdi 8:0: ino=268628423 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=4035 index=0
       flush-8:0-3402  [006]  5291.920924: writeback_single_inode: bdi 8:0: ino=268628424 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=4033 index=1
       flush-8:0-3402  [006]  5291.920957: writeback_single_inode: bdi 8:0: ino=268628425 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=4032 index=0
       flush-8:0-3402  [006]  5291.920988: writeback_single_inode: bdi 8:0: ino=268628426 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=4031 index=0
       flush-8:0-3402  [006]  5291.921021: writeback_single_inode: bdi 8:0: ino=268628427 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=4030 index=0
       flush-8:0-3402  [006]  5291.921054: writeback_single_inode: bdi 8:0: ino=268628428 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=4028 index=1
       flush-8:0-3402  [006]  5291.921091: writeback_single_inode: bdi 8:0: ino=268628429 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=4027 index=0
       flush-8:0-3402  [006]  5291.921122: writeback_single_inode: bdi 8:0: ino=268628430 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=4026 index=0
       flush-8:0-3402  [006]  5291.921155: writeback_single_inode: bdi 8:0: ino=268628431 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=4025 index=0
       flush-8:0-3402  [006]  5291.921188: writeback_single_inode: bdi 8:0: ino=268628432 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=4023 index=1
       flush-8:0-3402  [006]  5291.921224: writeback_single_inode: bdi 8:0: ino=268628433 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=4022 index=0
       flush-8:0-3402  [006]  5291.921256: writeback_single_inode: bdi 8:0: ino=268628434 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=4021 index=0
       flush-8:0-3402  [006]  5291.921289: writeback_single_inode: bdi 8:0: ino=268628435 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=4020 index=0
       flush-8:0-3402  [006]  5291.921320: writeback_single_inode: bdi 8:0: ino=268628436 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=4019 index=0
       flush-8:0-3402  [006]  5291.921354: writeback_single_inode: bdi 8:0: ino=268628437 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=4018 index=0
       flush-8:0-3402  [006]  5291.921385: writeback_single_inode: bdi 8:0: ino=268628438 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=4017 index=0
       flush-8:0-3402  [006]  5291.921421: writeback_single_inode: bdi 8:0: ino=268628439 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=4016 index=0
       flush-8:0-3402  [006]  5291.921453: writeback_single_inode: bdi 8:0: ino=268628440 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=4015 index=0
       flush-8:0-3402  [006]  5291.921487: writeback_single_inode: bdi 8:0: ino=268628441 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=4014 index=0
       flush-8:0-3402  [006]  5291.921518: writeback_single_inode: bdi 8:0: ino=268628442 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=4013 index=0
       flush-8:0-3402  [006]  5291.921552: writeback_single_inode: bdi 8:0: ino=268628443 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=4012 index=0
       flush-8:0-3402  [006]  5291.921586: writeback_single_inode: bdi 8:0: ino=268628444 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=3 to_write=4009 index=2
       flush-8:0-3402  [006]  5291.921622: writeback_single_inode: bdi 8:0: ino=268628445 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=4007 index=1
       flush-8:0-3402  [006]  5291.921653: writeback_single_inode: bdi 8:0: ino=268628446 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=4006 index=0
       flush-8:0-3402  [006]  5291.921709: writeback_single_inode: bdi 8:0: ino=268628447 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=4005 index=0
       flush-8:0-3402  [006]  5291.921742: writeback_single_inode: bdi 8:0: ino=268628448 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=4004 index=0
       flush-8:0-3402  [006]  5291.921775: writeback_single_inode: bdi 8:0: ino=268628449 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=4003 index=0
       flush-8:0-3402  [006]  5291.921807: writeback_single_inode: bdi 8:0: ino=268628450 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=4002 index=0
       flush-8:0-3402  [006]  5291.921840: writeback_single_inode: bdi 8:0: ino=268628451 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=4001 index=0
       flush-8:0-3402  [006]  5291.921874: writeback_single_inode: bdi 8:0: ino=268628452 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3999 index=1
       flush-8:0-3402  [006]  5291.921909: writeback_single_inode: bdi 8:0: ino=268628453 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3997 index=1
       flush-8:0-3402  [006]  5291.921940: writeback_single_inode: bdi 8:0: ino=268628454 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3996 index=0
       flush-8:0-3402  [006]  5291.921974: writeback_single_inode: bdi 8:0: ino=268628455 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3995 index=0
       flush-8:0-3402  [006]  5291.922005: writeback_single_inode: bdi 8:0: ino=268628456 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3994 index=0
       flush-8:0-3402  [006]  5291.922044: writeback_single_inode: bdi 8:0: ino=268628457 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3992 index=1
       flush-8:0-3402  [006]  5291.922077: writeback_single_inode: bdi 8:0: ino=268628458 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3990 index=1
       flush-8:0-3402  [006]  5291.922116: writeback_single_inode: bdi 8:0: ino=268628459 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3988 index=1
       flush-8:0-3402  [006]  5291.922149: writeback_single_inode: bdi 8:0: ino=268628460 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3986 index=1
       flush-8:0-3402  [006]  5291.922182: writeback_single_inode: bdi 8:0: ino=268628461 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3985 index=0
       flush-8:0-3402  [006]  5291.922213: writeback_single_inode: bdi 8:0: ino=268628462 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3984 index=0
       flush-8:0-3402  [006]  5291.922246: writeback_single_inode: bdi 8:0: ino=268628463 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3983 index=0
       flush-8:0-3402  [006]  5291.922277: writeback_single_inode: bdi 8:0: ino=268628464 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3982 index=0
       flush-8:0-3402  [006]  5291.922310: writeback_single_inode: bdi 8:0: ino=268628465 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3981 index=0
       flush-8:0-3402  [006]  5291.922341: writeback_single_inode: bdi 8:0: ino=268628466 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3980 index=0
       flush-8:0-3402  [006]  5291.922375: writeback_single_inode: bdi 8:0: ino=268628467 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3979 index=0
       flush-8:0-3402  [006]  5291.922406: writeback_single_inode: bdi 8:0: ino=268628468 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3978 index=0
       flush-8:0-3402  [006]  5291.922439: writeback_single_inode: bdi 8:0: ino=268628469 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3977 index=0
       flush-8:0-3402  [006]  5291.922474: writeback_single_inode: bdi 8:0: ino=268628470 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=5 to_write=3972 index=4
       flush-8:0-3402  [006]  5291.922508: writeback_single_inode: bdi 8:0: ino=268628471 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3971 index=0
       flush-8:0-3402  [006]  5291.922539: writeback_single_inode: bdi 8:0: ino=268628472 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3970 index=0
       flush-8:0-3402  [006]  5291.922572: writeback_single_inode: bdi 8:0: ino=268628473 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3969 index=0
       flush-8:0-3402  [006]  5291.922603: writeback_single_inode: bdi 8:0: ino=268628474 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3968 index=0
       flush-8:0-3402  [006]  5291.922636: writeback_single_inode: bdi 8:0: ino=268628475 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3967 index=0
       flush-8:0-3402  [006]  5291.922673: writeback_single_inode: bdi 8:0: ino=268628476 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3966 index=0
       flush-8:0-3402  [006]  5291.922709: writeback_single_inode: bdi 8:0: ino=268628477 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3965 index=0
       flush-8:0-3402  [006]  5291.922741: writeback_single_inode: bdi 8:0: ino=268628478 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3964 index=0
       flush-8:0-3402  [006]  5291.922777: writeback_single_inode: bdi 8:0: ino=268628479 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3963 index=0
       flush-8:0-3402  [006]  5291.922810: writeback_single_inode: bdi 8:0: ino=268628480 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3961 index=1
       flush-8:0-3402  [006]  5291.922850: writeback_single_inode: bdi 8:0: ino=268628481 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3960 index=0
       flush-8:0-3402  [006]  5291.922882: writeback_single_inode: bdi 8:0: ino=268628482 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3959 index=0
       flush-8:0-3402  [006]  5291.922915: writeback_single_inode: bdi 8:0: ino=268628483 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3958 index=0
       flush-8:0-3402  [006]  5291.922946: writeback_single_inode: bdi 8:0: ino=268628484 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3957 index=0
       flush-8:0-3402  [006]  5291.922980: writeback_single_inode: bdi 8:0: ino=268628485 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3956 index=0
       flush-8:0-3402  [006]  5291.923015: writeback_single_inode: bdi 8:0: ino=134217870 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=3 to_write=3953 index=2
       flush-8:0-3402  [006]  5291.923052: writeback_single_inode: bdi 8:0: ino=134217871 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=4 to_write=3949 index=3
       flush-8:0-3402  [006]  5291.923090: writeback_single_inode: bdi 8:0: ino=134217872 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=8 to_write=3941 index=7
       flush-8:0-3402  [006]  5291.923129: writeback_single_inode: bdi 8:0: ino=134217873 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=7 to_write=3934 index=6
       flush-8:0-3402  [006]  5291.923167: writeback_single_inode: bdi 8:0: ino=134217874 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=7 to_write=3927 index=6
       flush-8:0-3402  [006]  5291.923202: writeback_single_inode: bdi 8:0: ino=134217875 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3925 index=1
       flush-8:0-3402  [006]  5291.923234: writeback_single_inode: bdi 8:0: ino=134217876 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3924 index=0
       flush-8:0-3402  [006]  5291.923268: writeback_single_inode: bdi 8:0: ino=402653318 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3923 index=0
       flush-8:0-3402  [006]  5291.923305: writeback_single_inode: bdi 8:0: ino=402653319 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=6 to_write=3917 index=5
       flush-8:0-3402  [006]  5291.923341: writeback_single_inode: bdi 8:0: ino=402653320 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3915 index=1
       flush-8:0-3402  [006]  5291.923372: writeback_single_inode: bdi 8:0: ino=402653321 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3914 index=0
       flush-8:0-3402  [006]  5291.923410: writeback_single_inode: bdi 8:0: ino=402653322 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=4 to_write=3910 index=3
       flush-8:0-3402  [006]  5291.923444: writeback_single_inode: bdi 8:0: ino=402653323 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=4 to_write=3906 index=3
       flush-8:0-3402  [006]  5291.923483: writeback_single_inode: bdi 8:0: ino=402653324 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=3 to_write=3903 index=2
       flush-8:0-3402  [006]  5291.923521: writeback_single_inode: bdi 8:0: ino=402653325 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=8 to_write=3895 index=7
       flush-8:0-3402  [006]  5291.923556: writeback_single_inode: bdi 8:0: ino=143 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3894 index=0
       flush-8:0-3402  [006]  5291.923595: writeback_single_inode: bdi 8:0: ino=144 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=10 to_write=3884 index=9
       flush-8:0-3402  [006]  5291.923630: writeback_single_inode: bdi 8:0: ino=145 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3882 index=1
       flush-8:0-3402  [006]  5291.923673: writeback_single_inode: bdi 8:0: ino=146 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=7 to_write=3875 index=6
       flush-8:0-3402  [006]  5291.923711: writeback_single_inode: bdi 8:0: ino=147 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3874 index=0
       flush-8:0-3402  [006]  5291.923746: writeback_single_inode: bdi 8:0: ino=148 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=4 to_write=3870 index=3
       flush-8:0-3402  [006]  5291.923780: writeback_single_inode: bdi 8:0: ino=149 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3869 index=0
       flush-8:0-3402  [006]  5291.923817: writeback_single_inode: bdi 8:0: ino=150 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=6 to_write=3863 index=5
       flush-8:0-3402  [006]  5291.923852: writeback_single_inode: bdi 8:0: ino=151 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=3 to_write=3860 index=2
       flush-8:0-3402  [006]  5291.923887: writeback_single_inode: bdi 8:0: ino=152 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=4 to_write=3856 index=3
       flush-8:0-3402  [006]  5291.923931: writeback_single_inode: bdi 8:0: ino=153 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=13 to_write=3843 index=12
       flush-8:0-3402  [006]  5291.923964: writeback_single_inode: bdi 8:0: ino=154 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3841 index=1
       flush-8:0-3402  [006]  5291.924014: writeback_single_inode: bdi 8:0: ino=155 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=18 to_write=3823 index=13
       flush-8:0-3402  [006]  5291.924045: writeback_single_inode: bdi 8:0: ino=156 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3822 index=0
       flush-8:0-3402  [006]  5291.924092: writeback_single_inode: bdi 8:0: ino=157 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=13 to_write=3809 index=12
       flush-8:0-3402  [006]  5291.924127: writeback_single_inode: bdi 8:0: ino=402653326 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=4 to_write=3805 index=3
       flush-8:0-3402  [006]  5291.924167: writeback_single_inode: bdi 8:0: ino=402653327 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=8 to_write=3797 index=7
       flush-8:0-3402  [006]  5291.924203: writeback_single_inode: bdi 8:0: ino=402653328 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=5 to_write=3792 index=4
       flush-8:0-3402  [006]  5291.924242: writeback_single_inode: bdi 8:0: ino=402653329 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=3 to_write=3789 index=2
       flush-8:0-3402  [006]  5291.924282: writeback_single_inode: bdi 8:0: ino=402653330 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=11 to_write=3778 index=10
       flush-8:0-3402  [006]  5291.924330: writeback_single_inode: bdi 8:0: ino=402653331 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=17 to_write=3761 index=13
       flush-8:0-3402  [006]  5291.924370: writeback_single_inode: bdi 8:0: ino=402653332 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=11 to_write=3750 index=10
       flush-8:0-3402  [006]  5291.924413: writeback_single_inode: bdi 8:0: ino=402653333 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=12 to_write=3738 index=11
       flush-8:0-3402  [006]  5291.924446: writeback_single_inode: bdi 8:0: ino=402653334 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=3 to_write=3735 index=2
       flush-8:0-3402  [006]  5291.924483: writeback_single_inode: bdi 8:0: ino=402653335 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=4 to_write=3731 index=3
       flush-8:0-3402  [006]  5291.924513: writeback_single_inode: bdi 8:0: ino=402653336 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3730 index=0
       flush-8:0-3402  [006]  5291.924554: writeback_single_inode: bdi 8:0: ino=402653337 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=8 to_write=3722 index=7
       flush-8:0-3402  [006]  5291.924588: writeback_single_inode: bdi 8:0: ino=402653338 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=3 to_write=3719 index=2
       flush-8:0-3402  [006]  5291.924626: writeback_single_inode: bdi 8:0: ino=402653339 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3717 index=1
       flush-8:0-3402  [006]  5291.924679: writeback_single_inode: bdi 8:0: ino=402653340 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=12 to_write=3705 index=11
       flush-8:0-3402  [006]  5291.924719: writeback_single_inode: bdi 8:0: ino=402653341 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3704 index=0
       flush-8:0-3402  [006]  5291.924751: writeback_single_inode: bdi 8:0: ino=402653342 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3702 index=1
       flush-8:0-3402  [006]  5291.924787: writeback_single_inode: bdi 8:0: ino=402653343 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=3 to_write=3699 index=2
       flush-8:0-3402  [006]  5291.924820: writeback_single_inode: bdi 8:0: ino=402653344 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3697 index=1
       flush-8:0-3402  [006]  5291.924856: writeback_single_inode: bdi 8:0: ino=402653345 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=4 to_write=3693 index=3
       flush-8:0-3402  [006]  5291.924888: writeback_single_inode: bdi 8:0: ino=402653346 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3692 index=0
       flush-8:0-3402  [006]  5291.924921: writeback_single_inode: bdi 8:0: ino=402653347 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3691 index=0
       flush-8:0-3402  [006]  5291.924952: writeback_single_inode: bdi 8:0: ino=402653348 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3690 index=0
       flush-8:0-3402  [006]  5291.924995: writeback_single_inode: bdi 8:0: ino=402653349 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=9 to_write=3681 index=8
       flush-8:0-3402  [006]  5291.925035: writeback_single_inode: bdi 8:0: ino=402653350 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=10 to_write=3671 index=9
       flush-8:0-3402  [006]  5291.925070: writeback_single_inode: bdi 8:0: ino=134217878 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3670 index=0
       flush-8:0-3402  [006]  5291.925103: writeback_single_inode: bdi 8:0: ino=134217879 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3668 index=1
       flush-8:0-3402  [006]  5291.925140: writeback_single_inode: bdi 8:0: ino=134217880 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=5 to_write=3663 index=4
       flush-8:0-3402  [006]  5291.925181: writeback_single_inode: bdi 8:0: ino=134217881 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=12 to_write=3651 index=11
       flush-8:0-3402  [006]  5291.925235: writeback_single_inode: bdi 8:0: ino=134217882 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=24 to_write=3627 index=13
       flush-8:0-3402  [006]  5291.925283: writeback_single_inode: bdi 8:0: ino=134217883 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=20 to_write=3607 index=13
       flush-8:0-3402  [006]  5291.925319: writeback_single_inode: bdi 8:0: ino=134217884 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3605 index=1
       flush-8:0-3402  [006]  5291.925351: writeback_single_inode: bdi 8:0: ino=134217885 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3603 index=1
       flush-8:0-3402  [006]  5291.925386: writeback_single_inode: bdi 8:0: ino=134217886 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3601 index=1
       flush-8:0-3402  [006]  5291.925417: writeback_single_inode: bdi 8:0: ino=134217887 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3600 index=0
       flush-8:0-3402  [006]  5291.925450: writeback_single_inode: bdi 8:0: ino=134217888 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3599 index=0
       flush-8:0-3402  [006]  5291.925481: writeback_single_inode: bdi 8:0: ino=134217889 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3598 index=0
       flush-8:0-3402  [006]  5291.925519: writeback_single_inode: bdi 8:0: ino=134217890 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3596 index=1
       flush-8:0-3402  [006]  5291.925552: writeback_single_inode: bdi 8:0: ino=134217891 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3594 index=1
       flush-8:0-3402  [006]  5291.925594: writeback_single_inode: bdi 8:0: ino=134217892 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=5 to_write=3589 index=4
       flush-8:0-3402  [006]  5291.925626: writeback_single_inode: bdi 8:0: ino=134217893 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3588 index=0
       flush-8:0-3402  [006]  5291.925669: writeback_single_inode: bdi 8:0: ino=134217894 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=4 to_write=3584 index=3
       flush-8:0-3402  [006]  5291.925703: writeback_single_inode: bdi 8:0: ino=134217895 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3582 index=1
       flush-8:0-3402  [006]  5291.925746: writeback_single_inode: bdi 8:0: ino=134217896 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=8 to_write=3574 index=7
       flush-8:0-3402  [006]  5291.925777: writeback_single_inode: bdi 8:0: ino=134217897 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3573 index=0
       flush-8:0-3402  [006]  5291.925813: writeback_single_inode: bdi 8:0: ino=134217898 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3571 index=1
       flush-8:0-3402  [006]  5291.925850: writeback_single_inode: bdi 8:0: ino=134217899 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=7 to_write=3564 index=6
       flush-8:0-3402  [006]  5291.925891: writeback_single_inode: bdi 8:0: ino=134217900 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=7 to_write=3557 index=6
       flush-8:0-3402  [006]  5291.925925: writeback_single_inode: bdi 8:0: ino=134217901 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=3 to_write=3554 index=2
       flush-8:0-3402  [006]  5291.925965: writeback_single_inode: bdi 8:0: ino=134217902 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=7 to_write=3547 index=6
       flush-8:0-3402  [006]  5291.925999: writeback_single_inode: bdi 8:0: ino=134217903 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=3 to_write=3544 index=2
       flush-8:0-3402  [006]  5291.926033: writeback_single_inode: bdi 8:0: ino=134217904 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3543 index=0
       flush-8:0-3402  [006]  5291.926065: writeback_single_inode: bdi 8:0: ino=134217905 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3541 index=1
       flush-8:0-3402  [006]  5291.926100: writeback_single_inode: bdi 8:0: ino=134217906 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3539 index=1
       flush-8:0-3402  [006]  5291.926131: writeback_single_inode: bdi 8:0: ino=134217907 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3538 index=0
       flush-8:0-3402  [006]  5291.926164: writeback_single_inode: bdi 8:0: ino=134217908 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3537 index=0
       flush-8:0-3402  [006]  5291.926197: writeback_single_inode: bdi 8:0: ino=134217909 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3535 index=1
       flush-8:0-3402  [006]  5291.926232: writeback_single_inode: bdi 8:0: ino=134217910 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3533 index=1
       flush-8:0-3402  [006]  5291.926264: writeback_single_inode: bdi 8:0: ino=134217911 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3531 index=1
       flush-8:0-3402  [006]  5291.926298: writeback_single_inode: bdi 8:0: ino=134217912 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3530 index=0
       flush-8:0-3402  [006]  5291.926338: writeback_single_inode: bdi 8:0: ino=134217913 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=11 to_write=3519 index=10
       flush-8:0-3402  [006]  5291.926376: writeback_single_inode: bdi 8:0: ino=134217914 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3517 index=1
       flush-8:0-3402  [006]  5291.926411: writeback_single_inode: bdi 8:0: ino=134217915 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=3 to_write=3514 index=2
       flush-8:0-3402  [006]  5291.926450: writeback_single_inode: bdi 8:0: ino=134217916 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=3 to_write=3511 index=2
       flush-8:0-3402  [006]  5291.926482: writeback_single_inode: bdi 8:0: ino=134217917 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3510 index=0
       flush-8:0-3402  [006]  5291.926516: writeback_single_inode: bdi 8:0: ino=134217918 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3508 index=1
       flush-8:0-3402  [006]  5291.926549: writeback_single_inode: bdi 8:0: ino=134217919 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3506 index=1
       flush-8:0-3402  [006]  5291.926594: writeback_single_inode: bdi 8:0: ino=134217984 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=9 to_write=3497 index=8
       flush-8:0-3402  [006]  5291.926627: writeback_single_inode: bdi 8:0: ino=134217985 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=3 to_write=3494 index=2
       flush-8:0-3402  [006]  5291.926667: writeback_single_inode: bdi 8:0: ino=134217986 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3493 index=0
       flush-8:0-3402  [006]  5291.926699: writeback_single_inode: bdi 8:0: ino=134217987 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3492 index=0
       flush-8:0-3402  [006]  5291.926732: writeback_single_inode: bdi 8:0: ino=134217988 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3491 index=0
       flush-8:0-3402  [006]  5291.926763: writeback_single_inode: bdi 8:0: ino=134217989 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3490 index=0
       flush-8:0-3402  [006]  5291.926796: writeback_single_inode: bdi 8:0: ino=134217990 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3489 index=0
       flush-8:0-3402  [006]  5291.926827: writeback_single_inode: bdi 8:0: ino=134217991 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3488 index=0
       flush-8:0-3402  [006]  5291.926862: writeback_single_inode: bdi 8:0: ino=134217992 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3486 index=1
       flush-8:0-3402  [006]  5291.926895: writeback_single_inode: bdi 8:0: ino=134217993 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3484 index=1
       flush-8:0-3402  [006]  5291.926928: writeback_single_inode: bdi 8:0: ino=134217994 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3483 index=0
       flush-8:0-3402  [006]  5291.926961: writeback_single_inode: bdi 8:0: ino=134217995 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3482 index=0
       flush-8:0-3402  [006]  5291.926996: writeback_single_inode: bdi 8:0: ino=134217996 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3480 index=1
       flush-8:0-3402  [006]  5291.927029: writeback_single_inode: bdi 8:0: ino=134217997 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3478 index=1
       flush-8:0-3402  [006]  5291.927064: writeback_single_inode: bdi 8:0: ino=134217998 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3476 index=1
       flush-8:0-3402  [006]  5291.927096: writeback_single_inode: bdi 8:0: ino=134217999 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3474 index=1
       flush-8:0-3402  [006]  5291.927135: writeback_single_inode: bdi 8:0: ino=134218000 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3472 index=1
       flush-8:0-3402  [006]  5291.927168: writeback_single_inode: bdi 8:0: ino=134218001 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3470 index=1
       flush-8:0-3402  [006]  5291.927205: writeback_single_inode: bdi 8:0: ino=134218002 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3468 index=1
       flush-8:0-3402  [006]  5291.927247: writeback_single_inode: bdi 8:0: ino=134218003 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=8 to_write=3460 index=7
       flush-8:0-3402  [006]  5291.927285: writeback_single_inode: bdi 8:0: ino=134218004 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=4 to_write=3456 index=3
       flush-8:0-3402  [006]  5291.927320: writeback_single_inode: bdi 8:0: ino=134218005 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=4 to_write=3452 index=3
       flush-8:0-3402  [006]  5291.927355: writeback_single_inode: bdi 8:0: ino=134218006 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3450 index=1
       flush-8:0-3402  [006]  5291.927388: writeback_single_inode: bdi 8:0: ino=134218007 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3448 index=1
       flush-8:0-3402  [006]  5291.927421: writeback_single_inode: bdi 8:0: ino=134218008 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3447 index=0
       flush-8:0-3402  [006]  5291.927454: writeback_single_inode: bdi 8:0: ino=134218009 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3445 index=1
       flush-8:0-3402  [006]  5291.927487: writeback_single_inode: bdi 8:0: ino=134218010 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3444 index=0
       flush-8:0-3402  [006]  5291.927518: writeback_single_inode: bdi 8:0: ino=134218011 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3443 index=0
       flush-8:0-3402  [006]  5291.927555: writeback_single_inode: bdi 8:0: ino=134218012 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3441 index=1
       flush-8:0-3402  [006]  5291.927604: writeback_single_inode: bdi 8:0: ino=134218013 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=22 to_write=3419 index=13
       flush-8:0-3402  [006]  5291.927639: writeback_single_inode: bdi 8:0: ino=134218014 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3417 index=1
       flush-8:0-3402  [006]  5291.927681: writeback_single_inode: bdi 8:0: ino=134218015 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=3 to_write=3414 index=2
       flush-8:0-3402  [006]  5291.927717: writeback_single_inode: bdi 8:0: ino=134218016 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=3 to_write=3411 index=2
       flush-8:0-3402  [006]  5291.927747: writeback_single_inode: bdi 8:0: ino=134218017 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3410 index=0
       flush-8:0-3402  [006]  5291.927782: writeback_single_inode: bdi 8:0: ino=134218018 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3408 index=1
       flush-8:0-3402  [006]  5291.927815: writeback_single_inode: bdi 8:0: ino=134218019 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3406 index=1
       flush-8:0-3402  [006]  5291.927852: writeback_single_inode: bdi 8:0: ino=134218020 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3404 index=1
       flush-8:0-3402  [006]  5291.927885: writeback_single_inode: bdi 8:0: ino=134218021 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=3 to_write=3401 index=2
       flush-8:0-3402  [006]  5291.927921: writeback_single_inode: bdi 8:0: ino=134218022 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=3 to_write=3398 index=2
       flush-8:0-3402  [006]  5291.927952: writeback_single_inode: bdi 8:0: ino=134218023 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3397 index=0
       flush-8:0-3402  [006]  5291.927986: writeback_single_inode: bdi 8:0: ino=134218024 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3396 index=0
       flush-8:0-3402  [006]  5291.928020: writeback_single_inode: bdi 8:0: ino=134218025 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=3 to_write=3393 index=2
       flush-8:0-3402  [006]  5291.928058: writeback_single_inode: bdi 8:0: ino=134218026 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3391 index=1
       flush-8:0-3402  [006]  5291.928093: writeback_single_inode: bdi 8:0: ino=134218027 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=4 to_write=3387 index=3
       flush-8:0-3402  [006]  5291.928130: writeback_single_inode: bdi 8:0: ino=134218028 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3385 index=1
       flush-8:0-3402  [006]  5291.928162: writeback_single_inode: bdi 8:0: ino=134218029 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3383 index=1
       flush-8:0-3402  [006]  5291.928197: writeback_single_inode: bdi 8:0: ino=134218030 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3381 index=1
       flush-8:0-3402  [006]  5291.928228: writeback_single_inode: bdi 8:0: ino=134218031 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3380 index=0
       flush-8:0-3402  [006]  5291.928262: writeback_single_inode: bdi 8:0: ino=134218032 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3379 index=0
       flush-8:0-3402  [006]  5291.928294: writeback_single_inode: bdi 8:0: ino=134218033 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3377 index=1
       flush-8:0-3402  [006]  5291.928333: writeback_single_inode: bdi 8:0: ino=134218034 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3375 index=1
       flush-8:0-3402  [006]  5291.928367: writeback_single_inode: bdi 8:0: ino=134218035 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=3 to_write=3372 index=2
       flush-8:0-3402  [006]  5291.928408: writeback_single_inode: bdi 8:0: ino=134218036 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=5 to_write=3367 index=4
       flush-8:0-3402  [006]  5291.928441: writeback_single_inode: bdi 8:0: ino=134218037 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3365 index=1
       flush-8:0-3402  [006]  5291.928476: writeback_single_inode: bdi 8:0: ino=134218038 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3363 index=1
       flush-8:0-3402  [006]  5291.928507: writeback_single_inode: bdi 8:0: ino=134218039 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3362 index=0
       flush-8:0-3402  [006]  5291.928545: writeback_single_inode: bdi 8:0: ino=134218040 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3360 index=1
       flush-8:0-3402  [006]  5291.928579: writeback_single_inode: bdi 8:0: ino=134218041 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=3 to_write=3357 index=2
       flush-8:0-3402  [006]  5291.928613: writeback_single_inode: bdi 8:0: ino=134218042 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3356 index=0
       flush-8:0-3402  [006]  5291.928653: writeback_single_inode: bdi 8:0: ino=134218043 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=3 to_write=3353 index=2
       flush-8:0-3402  [006]  5291.928690: writeback_single_inode: bdi 8:0: ino=134218044 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3351 index=1
       flush-8:0-3402  [006]  5291.928724: writeback_single_inode: bdi 8:0: ino=134218045 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=3 to_write=3348 index=2
       flush-8:0-3402  [006]  5291.928758: writeback_single_inode: bdi 8:0: ino=134218046 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3347 index=0
       flush-8:0-3402  [006]  5291.928793: writeback_single_inode: bdi 8:0: ino=134218047 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=5 to_write=3342 index=4
       flush-8:0-3402  [006]  5291.928826: writeback_single_inode: bdi 8:0: ino=134218048 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3341 index=0
       flush-8:0-3402  [006]  5291.928858: writeback_single_inode: bdi 8:0: ino=134218049 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3340 index=0
       flush-8:0-3402  [006]  5291.928902: writeback_single_inode: bdi 8:0: ino=134218050 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3338 index=1
       flush-8:0-3402  [006]  5291.928934: writeback_single_inode: bdi 8:0: ino=134218051 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3337 index=0
       flush-8:0-3402  [006]  5291.928967: writeback_single_inode: bdi 8:0: ino=134218052 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3336 index=0
       flush-8:0-3402  [006]  5291.929001: writeback_single_inode: bdi 8:0: ino=134218053 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=3 to_write=3333 index=2
       flush-8:0-3402  [006]  5291.929039: writeback_single_inode: bdi 8:0: ino=134218054 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=5 to_write=3328 index=4
       flush-8:0-3402  [006]  5291.929070: writeback_single_inode: bdi 8:0: ino=134218055 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3327 index=0
       flush-8:0-3402  [006]  5291.929105: writeback_single_inode: bdi 8:0: ino=134218056 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3325 index=1
       flush-8:0-3402  [006]  5291.929137: writeback_single_inode: bdi 8:0: ino=134218057 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3324 index=0
       flush-8:0-3402  [006]  5291.929170: writeback_single_inode: bdi 8:0: ino=134218058 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3323 index=0
       flush-8:0-3402  [006]  5291.929201: writeback_single_inode: bdi 8:0: ino=134218059 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3322 index=0
       flush-8:0-3402  [006]  5291.929279: writeback_single_inode: bdi 8:0: ino=402653351 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=51 to_write=3271 index=13
       flush-8:0-3402  [006]  5291.929314: writeback_single_inode: bdi 8:0: ino=402653352 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=5 to_write=3266 index=4
       flush-8:0-3402  [006]  5291.929352: writeback_single_inode: bdi 8:0: ino=402653353 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=4 to_write=3262 index=3
       flush-8:0-3402  [006]  5291.929389: writeback_single_inode: bdi 8:0: ino=134218060 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=7 to_write=3255 index=6
       flush-8:0-3402  [006]  5291.929430: writeback_single_inode: bdi 8:0: ino=134218061 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=8 to_write=3247 index=7
       flush-8:0-3402  [006]  5291.929461: writeback_single_inode: bdi 8:0: ino=134218062 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3246 index=0
       flush-8:0-3402  [006]  5291.929495: writeback_single_inode: bdi 8:0: ino=134218063 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3245 index=0
       flush-8:0-3402  [006]  5291.929526: writeback_single_inode: bdi 8:0: ino=134218064 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3244 index=0
       flush-8:0-3402  [006]  5291.929559: writeback_single_inode: bdi 8:0: ino=134218065 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3243 index=0
       flush-8:0-3402  [006]  5291.929594: writeback_single_inode: bdi 8:0: ino=134218066 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=4 to_write=3239 index=3
       flush-8:0-3402  [006]  5291.929629: writeback_single_inode: bdi 8:0: ino=268628487 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3238 index=0
       flush-8:0-3402  [006]  5291.929671: writeback_single_inode: bdi 8:0: ino=268628488 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=4 to_write=3234 index=3
       flush-8:0-3402  [006]  5291.929709: writeback_single_inode: bdi 8:0: ino=268628489 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=3 to_write=3231 index=2
       flush-8:0-3402  [006]  5291.929744: writeback_single_inode: bdi 8:0: ino=268628490 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=5 to_write=3226 index=4
       flush-8:0-3402  [006]  5291.929780: writeback_single_inode: bdi 8:0: ino=268628491 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3225 index=0
       flush-8:0-3402  [006]  5291.929817: writeback_single_inode: bdi 8:0: ino=268628492 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=7 to_write=3218 index=6
       flush-8:0-3402  [006]  5291.929853: writeback_single_inode: bdi 8:0: ino=268628493 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=3 to_write=3215 index=2
       flush-8:0-3402  [006]  5291.929885: writeback_single_inode: bdi 8:0: ino=402653355 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=1 to_write=3214 index=0
       flush-8:0-3402  [006]  5291.929919: writeback_single_inode: bdi 8:0: ino=402653356 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3212 index=1
       flush-8:0-3402  [006]  5291.929956: writeback_single_inode: bdi 8:0: ino=402653357 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=7 to_write=3205 index=6
       flush-8:0-3402  [006]  5291.929994: writeback_single_inode: bdi 8:0: ino=402653358 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3203 index=1
       flush-8:0-3402  [006]  5291.930027: writeback_single_inode: bdi 8:0: ino=402653359 state=I_DIRTY_SYNC|I_DIRTY_PAGES age=0 wrote=2 to_write=3201 index=1
wfg /tmp%

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 5/6] writeback: try more writeback as long as something was written
  2011-04-21  3:33             ` Wu Fengguang
  2011-04-21  4:39               ` Christoph Hellwig
@ 2011-04-21  7:09               ` Dave Chinner
  2011-04-21  7:14                 ` Christoph Hellwig
  1 sibling, 1 reply; 62+ messages in thread
From: Dave Chinner @ 2011-04-21  7:09 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, Andrew Morton, Mel Gorman, Trond Myklebust,
	Itaru Kitayama, Minchan Kim, LKML, linux-fsdevel@vger.kernel.org,
	Linux Memory Management List

On Thu, Apr 21, 2011 at 11:33:25AM +0800, Wu Fengguang wrote:
> I collected the writeback_single_inode() traces (patch attached for
> your reference) each for several test runs, and find much more
> I_DIRTY_PAGES after patchset. Dave, do you know why there are so many
> I_DIRTY_PAGES (or radix tag) remained after the XFS ->writepages() call,
> even for small files?
> 
> wfg /tmp% g -c I_DIRTY_PAGES trace-*
> trace-moving-expire-1:28213
> trace-no-moving-expire:6684
> 
> wfg /tmp% g -c I_DIRTY_DATASYNC trace-*
> trace-moving-expire-1:179
> trace-no-moving-expire:193
> 
> wfg /tmp% g -c I_DIRTY_SYNC trace-* 
> trace-moving-expire-1:29394
> trace-no-moving-expire:31593
> 
> wfg /tmp% wc -l trace-*
>    81108 trace-moving-expire-1
>    68562 trace-no-moving-expire

Likely just timing. When IO completes and updates the inode IO size,
XFS calls mark_inode_dirty() again to ensure that the metadata that
was changed gets written out at a later point in time.
Hence every single file that is created by the test will be marked
dirty again after the first write has returned and disappeared.

Why you see different numbers? it's timing dependent based on Io
completion rates - if you have a fast disk the IO completion can
occur before write_inode() is called and so the inode can be written
and the dirty page state removed in the one writeback_single_inode()
call...

That's my initial guess without looking at it in any real detail,
anyway.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 5/6] writeback: try more writeback as long as something was written
  2011-04-21  7:09               ` Dave Chinner
@ 2011-04-21  7:14                 ` Christoph Hellwig
  2011-04-21  7:52                   ` Dave Chinner
  0 siblings, 1 reply; 62+ messages in thread
From: Christoph Hellwig @ 2011-04-21  7:14 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Wu Fengguang, Jan Kara, Andrew Morton, Mel Gorman,
	Trond Myklebust, Itaru Kitayama, Minchan Kim, LKML,
	linux-fsdevel@vger.kernel.org, Linux Memory Management List

On Thu, Apr 21, 2011 at 05:09:47PM +1000, Dave Chinner wrote:
> Likely just timing. When IO completes and updates the inode IO size,
> XFS calls mark_inode_dirty() again to ensure that the metadata that
> was changed gets written out at a later point in time.
> Hence every single file that is created by the test will be marked
> dirty again after the first write has returned and disappeared.
> 
> Why you see different numbers? it's timing dependent based on Io
> completion rates - if you have a fast disk the IO completion can
> occur before write_inode() is called and so the inode can be written
> and the dirty page state removed in the one writeback_single_inode()
> call...
> 
> That's my initial guess without looking at it in any real detail,
> anyway.

We shouldn't have I_DIRTY_PAGES set for that case, as we only redirty
metadata.  But we're actually doing a xfs_mark_inode_dirty, which
dirties all of I_DIRTY, which includes I_DIRTY_PAGES.  I guess it
should change to

	__mark_inode_dirty(inode, I_DIRTY_SYNC | I_DIRTY_DATASYNC);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 5/6] writeback: try more writeback as long as something was written
  2011-04-21  7:14                 ` Christoph Hellwig
@ 2011-04-21  7:52                   ` Dave Chinner
  2011-04-21  8:00                     ` Christoph Hellwig
  0 siblings, 1 reply; 62+ messages in thread
From: Dave Chinner @ 2011-04-21  7:52 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Wu Fengguang, Jan Kara, Andrew Morton, Mel Gorman,
	Trond Myklebust, Itaru Kitayama, Minchan Kim, LKML,
	linux-fsdevel@vger.kernel.org, Linux Memory Management List

On Thu, Apr 21, 2011 at 03:14:26AM -0400, Christoph Hellwig wrote:
> On Thu, Apr 21, 2011 at 05:09:47PM +1000, Dave Chinner wrote:
> > Likely just timing. When IO completes and updates the inode IO size,
> > XFS calls mark_inode_dirty() again to ensure that the metadata that
> > was changed gets written out at a later point in time.
> > Hence every single file that is created by the test will be marked
> > dirty again after the first write has returned and disappeared.
> > 
> > Why you see different numbers? it's timing dependent based on Io
> > completion rates - if you have a fast disk the IO completion can
> > occur before write_inode() is called and so the inode can be written
> > and the dirty page state removed in the one writeback_single_inode()
> > call...
> > 
> > That's my initial guess without looking at it in any real detail,
> > anyway.
> 
> We shouldn't have I_DIRTY_PAGES set for that case, as we only redirty
> metadata.  But we're actually doing a xfs_mark_inode_dirty, which
> dirties all of I_DIRTY, which includes I_DIRTY_PAGES.  I guess it
> should change to
> 
> 	__mark_inode_dirty(inode, I_DIRTY_SYNC | I_DIRTY_DATASYNC);

Probably should. Using xfs_mark_inode_dirty_sync() might be the best
thing to do.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 5/6] writeback: try more writeback as long as something was written
  2011-04-21  7:52                   ` Dave Chinner
@ 2011-04-21  8:00                     ` Christoph Hellwig
  0 siblings, 0 replies; 62+ messages in thread
From: Christoph Hellwig @ 2011-04-21  8:00 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Christoph Hellwig, Wu Fengguang, Jan Kara, Andrew Morton,
	Mel Gorman, Trond Myklebust, Itaru Kitayama, Minchan Kim, LKML,
	linux-fsdevel@vger.kernel.org, Linux Memory Management List

On Thu, Apr 21, 2011 at 05:52:58PM +1000, Dave Chinner wrote:
> > We shouldn't have I_DIRTY_PAGES set for that case, as we only redirty
> > metadata.  But we're actually doing a xfs_mark_inode_dirty, which
> > dirties all of I_DIRTY, which includes I_DIRTY_PAGES.  I guess it
> > should change to
> > 
> > 	__mark_inode_dirty(inode, I_DIRTY_SYNC | I_DIRTY_DATASYNC);
> 
> Probably should. Using xfs_mark_inode_dirty_sync() might be the best
> thing to do.

That's not correct either - we need to set I_DIRTY_DATASYNC so that it
gets caught by fsync and not just fdatasync.

But thinking about it I'm actually not sure we need it at all.  We already
wait for the i_iocount to go to zero both in fsync and ->sync_fs, which will
catch pending I/O completions even without any VFS dirty state.  So just
marking the inode dirty (as I_DIRTY_SYNC | I_DIRTY_DATASYNC) on I/O
completion should be enough these days.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 5/6] writeback: try more writeback as long as something was written
  2011-04-21  6:05                 ` Wu Fengguang
@ 2011-04-21 16:41                   ` Jan Kara
  2011-04-22  2:32                     ` Wu Fengguang
  0 siblings, 1 reply; 62+ messages in thread
From: Jan Kara @ 2011-04-21 16:41 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Christoph Hellwig, Jan Kara, Andrew Morton, Mel Gorman,
	Dave Chinner, Trond Myklebust, Itaru Kitayama, Minchan Kim, LKML,
	linux-fsdevel@vger.kernel.org, Linux Memory Management List

On Thu 21-04-11 14:05:56, Wu Fengguang wrote:
> On Thu, Apr 21, 2011 at 12:39:40PM +0800, Christoph Hellwig wrote:
> > On Thu, Apr 21, 2011 at 11:33:25AM +0800, Wu Fengguang wrote:
> > > I collected the writeback_single_inode() traces (patch attached for
> > > your reference) each for several test runs, and find much more
> > > I_DIRTY_PAGES after patchset. Dave, do you know why there are so many
> > > I_DIRTY_PAGES (or radix tag) remained after the XFS ->writepages() call,
> > > even for small files?
> > 
> > What is your defintion of a small file?  As soon as it has multiple
> > extents or holes there's absolutely no way to clean it with a single
> > writepage call.
> 
> It's writing a kernel source tree to XFS. You can find in the below
> trace that it often leaves more dirty pages behind (indicated by the
> I_DIRTY_PAGES flag) after writing as less as 1 page (indicated by the
> wrote=1 field).
  As Dave said, it's probably just a race since XFS redirties the inode on
IO completion. So I think the inodes are just small so they have only a few
dirty pages so you don't have much to write and they are written and
redirtied before you check the I_DIRTY flags. You could use radix tree
dirty tag to verify whether there are really dirty pages or not...

  BTW a quick check of kernel tree shows the following distribution of
sizes (in KB):
  Count KB  Cumulative Percent
    257 0   0.9%
  13309 4   45%
   5553 8   63%
   2997 12  73%
   1879 16  80%
   1275 20  83%
    987 24  87%
    685 28  89%
    540 32  91%
    387 36  ...
    309 40
    264 44
    249 48
    170 52
    143 56
    144 60
    132 64
    100 68
    ...
Total 30155

And the distribution of your 'wrote=xxx' roughly corresponds to this...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 5/6] writeback: try more writeback as long as something was written
  2011-04-21 16:41                   ` Jan Kara
@ 2011-04-22  2:32                     ` Wu Fengguang
  2011-04-22 21:23                       ` Jan Kara
  0 siblings, 1 reply; 62+ messages in thread
From: Wu Fengguang @ 2011-04-22  2:32 UTC (permalink / raw)
  To: Jan Kara
  Cc: Christoph Hellwig, Andrew Morton, Mel Gorman, Dave Chinner,
	Trond Myklebust, Itaru Kitayama, Minchan Kim, LKML,
	linux-fsdevel@vger.kernel.org, Linux Memory Management List

On Fri, Apr 22, 2011 at 12:41:54AM +0800, Jan Kara wrote:
> On Thu 21-04-11 14:05:56, Wu Fengguang wrote:
> > On Thu, Apr 21, 2011 at 12:39:40PM +0800, Christoph Hellwig wrote:
> > > On Thu, Apr 21, 2011 at 11:33:25AM +0800, Wu Fengguang wrote:
> > > > I collected the writeback_single_inode() traces (patch attached for
> > > > your reference) each for several test runs, and find much more
> > > > I_DIRTY_PAGES after patchset. Dave, do you know why there are so many
> > > > I_DIRTY_PAGES (or radix tag) remained after the XFS ->writepages() call,
> > > > even for small files?
> > > 
> > > What is your defintion of a small file?  As soon as it has multiple
> > > extents or holes there's absolutely no way to clean it with a single
> > > writepage call.
> > 
> > It's writing a kernel source tree to XFS. You can find in the below
> > trace that it often leaves more dirty pages behind (indicated by the
> > I_DIRTY_PAGES flag) after writing as less as 1 page (indicated by the
> > wrote=1 field).
>   As Dave said, it's probably just a race since XFS redirties the inode on
> IO completion. So I think the inodes are just small so they have only a few
> dirty pages so you don't have much to write and they are written and
> redirtied before you check the I_DIRTY flags. You could use radix tree
> dirty tag to verify whether there are really dirty pages or not...

Yeah, Dave and Christoph root caused it in the other email -- XFS sets
I_DIRTY which accidentally sets I_DIRTY_PAGES. We can safely bet there
are no real dirty pages -- otherwise it would have turned up as
performance regressions.

>   BTW a quick check of kernel tree shows the following distribution of
> sizes (in KB):
>   Count KB  Cumulative Percent
>     257 0   0.9%
>   13309 4   45%
>    5553 8   63%
>    2997 12  73%
>    1879 16  80%
>    1275 20  83%
>     987 24  87%
>     685 28  89%
>     540 32  91%
>     387 36  ...
>     309 40
>     264 44
>     249 48
>     170 52
>     143 56
>     144 60
>     132 64
>     100 68
>     ...
> Total 30155
> 
> And the distribution of your 'wrote=xxx' roughly corresponds to this...

Nice numbers! How do you manage to account them? :)

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 5/6] writeback: try more writeback as long as something was written
  2011-04-22  2:32                     ` Wu Fengguang
@ 2011-04-22 21:23                       ` Jan Kara
  0 siblings, 0 replies; 62+ messages in thread
From: Jan Kara @ 2011-04-22 21:23 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, Christoph Hellwig, Andrew Morton, Mel Gorman,
	Dave Chinner, Trond Myklebust, Itaru Kitayama, Minchan Kim, LKML,
	linux-fsdevel@vger.kernel.org, Linux Memory Management List

On Fri 22-04-11 10:32:26, Wu Fengguang wrote:
> On Fri, Apr 22, 2011 at 12:41:54AM +0800, Jan Kara wrote:
> > On Thu 21-04-11 14:05:56, Wu Fengguang wrote:
> > > On Thu, Apr 21, 2011 at 12:39:40PM +0800, Christoph Hellwig wrote:
> > > > On Thu, Apr 21, 2011 at 11:33:25AM +0800, Wu Fengguang wrote:
> > > > > I collected the writeback_single_inode() traces (patch attached for
> > > > > your reference) each for several test runs, and find much more
> > > > > I_DIRTY_PAGES after patchset. Dave, do you know why there are so many
> > > > > I_DIRTY_PAGES (or radix tag) remained after the XFS ->writepages() call,
> > > > > even for small files?
> > > > 
> > > > What is your defintion of a small file?  As soon as it has multiple
> > > > extents or holes there's absolutely no way to clean it with a single
> > > > writepage call.
> > > 
> > > It's writing a kernel source tree to XFS. You can find in the below
> > > trace that it often leaves more dirty pages behind (indicated by the
> > > I_DIRTY_PAGES flag) after writing as less as 1 page (indicated by the
> > > wrote=1 field).
> >   As Dave said, it's probably just a race since XFS redirties the inode on
> > IO completion. So I think the inodes are just small so they have only a few
> > dirty pages so you don't have much to write and they are written and
> > redirtied before you check the I_DIRTY flags. You could use radix tree
> > dirty tag to verify whether there are really dirty pages or not...
> 
> Yeah, Dave and Christoph root caused it in the other email -- XFS sets
> I_DIRTY which accidentally sets I_DIRTY_PAGES. We can safely bet there
> are no real dirty pages -- otherwise it would have turned up as
> performance regressions.
  Yes, but then the question what we actually do better is still open,
right? :) I'm really curious what it could be because especially in your
copy-kernel case I should not make much different - maybe except if we
occasionally managed to block on PageLock behind the writing thread and now
we don't because we queue the inode later but I find that highly unlikely.

> >   BTW a quick check of kernel tree shows the following distribution of
> > sizes (in KB):
> >   Count KB  Cumulative Percent
> >     257 0   0.9%
> >   13309 4   45%
> >    5553 8   63%
> >    2997 12  73%
> >    1879 16  80%
> >    1275 20  83%
> >     987 24  87%
> >     685 28  89%
> >     540 32  91%
> >     387 36  ...
> >     309 40
> >     264 44
> >     249 48
> >     170 52
> >     143 56
> >     144 60
> >     132 64
> >     100 68
> >     ...
> > Total 30155
> > 
> > And the distribution of your 'wrote=xxx' roughly corresponds to this...
> 
> Nice numbers! How do you manage to account them? :)
  Easy shell command (and I handcomputed the percentages because I was lazy
to write a script for that):
find . -type f -name "*.[ch]" -exec du {} \; | cut -d '	' -f 1 |
sort -n | uniq -c

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 62+ messages in thread

end of thread, other threads:[~2011-04-22 21:23 UTC | newest]

Thread overview: 62+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-07-22  5:09 [PATCH 0/6] [RFC] writeback: try to write older pages first Wu Fengguang
2010-07-22  5:09 ` [PATCH 1/6] writeback: pass writeback_control down to move_expired_inodes() Wu Fengguang
2010-07-23 18:16   ` Jan Kara
2010-07-26 10:44   ` Mel Gorman
2010-08-01 15:23   ` Minchan Kim
2010-07-22  5:09 ` [PATCH 2/6] writeback: the kupdate expire timestamp should be a moving target Wu Fengguang
2010-07-23 18:17   ` Jan Kara
2010-07-26 10:52   ` Mel Gorman
2010-07-26 11:32     ` Wu Fengguang
2010-08-01 15:29   ` Minchan Kim
2010-07-22  5:09 ` [PATCH 3/6] writeback: kill writeback_control.more_io Wu Fengguang
2010-07-23 18:24   ` Jan Kara
2010-07-26 10:53   ` Mel Gorman
2010-08-01 15:34   ` Minchan Kim
2010-08-05 14:50     ` Wu Fengguang
2010-08-05 14:55       ` Wu Fengguang
2010-08-05 14:56       ` Minchan Kim
2010-08-05 15:26         ` Wu Fengguang
2010-07-22  5:09 ` [PATCH 4/6] writeback: sync expired inodes first in background writeback Wu Fengguang
2010-07-23 18:15   ` Jan Kara
2010-07-26 11:51     ` Wu Fengguang
2010-07-26 12:12       ` Jan Kara
2010-07-26 12:29         ` Wu Fengguang
2010-07-26 10:57   ` Mel Gorman
2010-07-26 12:00     ` Wu Fengguang
2010-07-26 12:20       ` Jan Kara
2010-07-26 12:31         ` Wu Fengguang
2010-07-26 12:39           ` Jan Kara
2010-07-26 12:47             ` Wu Fengguang
2010-07-26 12:56     ` Wu Fengguang
2010-07-26 12:59       ` Mel Gorman
2010-07-26 13:11         ` Wu Fengguang
2010-07-27  9:45           ` Mel Gorman
2010-08-01 15:15           ` Minchan Kim
2010-07-22  5:09 ` [PATCH 5/6] writeback: try more writeback as long as something was written Wu Fengguang
2010-07-23 17:39   ` Jan Kara
2010-07-26 12:39     ` Wu Fengguang
2010-07-26 11:01   ` Mel Gorman
2010-07-26 11:39     ` Wu Fengguang
2010-07-22  5:09 ` [PATCH 6/6] writeback: introduce writeback_control.inodes_written Wu Fengguang
2010-07-26 11:04   ` Mel Gorman
2010-07-23 10:24 ` [PATCH 0/6] [RFC] writeback: try to write older pages first Mel Gorman
2010-07-26  7:18   ` Wu Fengguang
2010-07-26 10:42     ` Mel Gorman
2010-07-26 10:28 ` Itaru Kitayama
2010-07-26 11:47   ` Wu Fengguang
  -- strict thread matches above, loose matches on Subject: below --
2011-04-19  3:00 [PATCH 0/6] writeback: moving expire targets for background/kupdate works Wu Fengguang
2011-04-19  3:00 ` [PATCH 5/6] writeback: try more writeback as long as something was written Wu Fengguang
2011-04-19 10:20   ` Jan Kara
2011-04-19 11:16     ` Wu Fengguang
2011-04-19 21:10       ` Jan Kara
2011-04-20  7:50         ` Wu Fengguang
2011-04-20 15:22           ` Jan Kara
2011-04-21  3:33             ` Wu Fengguang
2011-04-21  4:39               ` Christoph Hellwig
2011-04-21  6:05                 ` Wu Fengguang
2011-04-21 16:41                   ` Jan Kara
2011-04-22  2:32                     ` Wu Fengguang
2011-04-22 21:23                       ` Jan Kara
2011-04-21  7:09               ` Dave Chinner
2011-04-21  7:14                 ` Christoph Hellwig
2011-04-21  7:52                   ` Dave Chinner
2011-04-21  8:00                     ` Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).