linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/5]  [RFC] transfer ASYNC vmscan writeback IO to the flusher threads
@ 2010-07-29 11:51 Wu Fengguang
  2010-07-29 11:51 ` [PATCH 1/5] writeback: introduce wbc.for_sync to cover the two sync stages Wu Fengguang
                   ` (6 more replies)
  0 siblings, 7 replies; 30+ messages in thread
From: Wu Fengguang @ 2010-07-29 11:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Wu Fengguang, LKML, linux-fsdevel@vger.kernel.org,
	linux-mm@kvack.org, Dave Chinner, Chris Mason, Nick Piggin,
	Rik van Riel, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrea Arcangeli, Mel Gorman,
	Minchan Kim

Andrew,

It's possible to transfer ASYNC vmscan writeback IOs to the flusher threads.
This simple patchset shows the basic idea. Since it's a big behavior change,
there are inevitably lots of details to sort out. I don't know where it will
go after tests and discussions, so the patches are intentionally kept simple.

sync livelock avoidance (need more to be complete, but this is minimal required for the last two patches)
	[PATCH 1/5] writeback: introduce wbc.for_sync to cover the two sync stages
	[PATCH 2/5] writeback: stop periodic/background work on seeing sync works
	[PATCH 3/5] writeback: prevent sync livelock with the sync_after timestamp

let the flusher threads do ASYNC writeback for pageout()
	[PATCH 4/5] writeback: introduce bdi_start_inode_writeback()
	[PATCH 5/5] vmscan: transfer async file writeback to the flusher

The last two patches are the meats, they depend on the first three patches to
kick the background writeback work, so that the for_reclaim writeback can be
serviced timely.

Comments are welcome!

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 1/5] writeback: introduce wbc.for_sync to cover the two sync stages
  2010-07-29 11:51 [PATCH 0/5] [RFC] transfer ASYNC vmscan writeback IO to the flusher threads Wu Fengguang
@ 2010-07-29 11:51 ` Wu Fengguang
  2010-07-29 15:04   ` Jan Kara
  2010-07-29 11:51 ` [PATCH 2/5] writeback: stop periodic/background work on seeing sync works Wu Fengguang
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 30+ messages in thread
From: Wu Fengguang @ 2010-07-29 11:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Wu Fengguang, LKML, Jan Kara, linux-fsdevel@vger.kernel.org,
	linux-mm@kvack.org, Dave Chinner, Chris Mason, Nick Piggin,
	Rik van Riel, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrea Arcangeli, Mel Gorman,
	Minchan Kim

[-- Attachment #1: writeback-for-sync.patch --]
[-- Type: text/plain, Size: 2990 bytes --]

The sync() is performed in two stages: the WB_SYNC_NONE sync and
the WB_SYNC_ALL sync. It is necessary to tag both stages with
wbc.for_sync, so as to prevent either of them being livelocked.

The basic livelock scheme will be based on the sync_after timestamp.
Inodes dirtied after that won't be queued for IO. The timestamp could be
recorded as early as the sync() time, this patch lazily sets it in
writeback_inodes_sb()/sync_inodes_sb(). This will stop livelock, but
may do more work than necessary.

Note that writeback_inodes_sb() is called by not only sync(), they
are treated the same because the other callers need the same livelock
prevention.

CC: Jan Kara <jack@suse.cz>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c         |   21 ++++++++++++---------
 include/linux/writeback.h |    1 +
 2 files changed, 13 insertions(+), 9 deletions(-)

--- linux-next.orig/fs/fs-writeback.c	2010-07-28 17:05:17.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2010-07-28 21:21:31.000000000 +0800
@@ -36,6 +36,8 @@ struct wb_writeback_work {
 	long nr_pages;
 	struct super_block *sb;
 	enum writeback_sync_modes sync_mode;
+	unsigned long sync_after;
+	unsigned int for_sync:1;
 	unsigned int for_kupdate:1;
 	unsigned int range_cyclic:1;
 	unsigned int for_background:1;
@@ -1086,20 +1090,17 @@ static void wait_sb_inodes(struct super_
  */
 void writeback_inodes_sb(struct super_block *sb)
 {
-	unsigned long nr_dirty = global_page_state(NR_FILE_DIRTY);
-	unsigned long nr_unstable = global_page_state(NR_UNSTABLE_NFS);
 	DECLARE_COMPLETION_ONSTACK(done);
 	struct wb_writeback_work work = {
 		.sb		= sb,
 		.sync_mode	= WB_SYNC_NONE,
+		.for_sync	= 1,
+		.sync_after	= jiffies,
 		.done		= &done,
 	};
 
 	WARN_ON(!rwsem_is_locked(&sb->s_umount));
 
-	work.nr_pages = nr_dirty + nr_unstable +
-			(inodes_stat.nr_inodes - inodes_stat.nr_unused);
-
 	bdi_queue_work(sb->s_bdi, &work);
 	wait_for_completion(&done);
 }
@@ -1137,6 +1138,8 @@ void sync_inodes_sb(struct super_block *
 	struct wb_writeback_work work = {
 		.sb		= sb,
 		.sync_mode	= WB_SYNC_ALL,
+		.for_sync	= 1,
+		.sync_after	= jiffies,
 		.nr_pages	= LONG_MAX,
 		.range_cyclic	= 0,
 		.done		= &done,
--- linux-next.orig/include/linux/writeback.h	2010-07-28 17:05:17.000000000 +0800
+++ linux-next/include/linux/writeback.h	2010-07-28 21:24:54.000000000 +0800
@@ -48,6 +48,7 @@ struct writeback_control {
 	unsigned encountered_congestion:1; /* An output: a queue is full */
 	unsigned for_kupdate:1;		/* A kupdate writeback */
 	unsigned for_background:1;	/* A background writeback */
+	unsigned for_sync:1;		/* A writeback for sync */
 	unsigned for_reclaim:1;		/* Invoked from the page allocator */
 	unsigned range_cyclic:1;	/* range_start is cyclic */
 };


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 2/5] writeback: stop periodic/background work on seeing sync works
  2010-07-29 11:51 [PATCH 0/5] [RFC] transfer ASYNC vmscan writeback IO to the flusher threads Wu Fengguang
  2010-07-29 11:51 ` [PATCH 1/5] writeback: introduce wbc.for_sync to cover the two sync stages Wu Fengguang
@ 2010-07-29 11:51 ` Wu Fengguang
  2010-07-29 16:20   ` Jan Kara
  2010-07-29 11:51 ` [PATCH 3/5] writeback: prevent sync livelock with the sync_after timestamp Wu Fengguang
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 30+ messages in thread
From: Wu Fengguang @ 2010-07-29 11:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Wu Fengguang, LKML, linux-fsdevel@vger.kernel.org,
	linux-mm@kvack.org, Dave Chinner, Chris Mason, Nick Piggin,
	Rik van Riel, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrea Arcangeli, Mel Gorman,
	Minchan Kim

[-- Attachment #1: writeback-sync-pending.patch --]
[-- Type: text/plain, Size: 2679 bytes --]

The periodic/background writeback can run forever. So when any
sync work is enqueued, increase bdi->sync_works to notify the
active non-sync works to exit. Non-sync works queued after sync
works won't be affected.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c           |   13 +++++++++++++
 include/linux/backing-dev.h |    6 ++++++
 mm/backing-dev.c            |    1 +
 3 files changed, 20 insertions(+)

--- linux-next.orig/fs/fs-writeback.c	2010-07-29 17:13:23.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2010-07-29 17:13:49.000000000 +0800
@@ -80,6 +80,8 @@ static void bdi_queue_work(struct backin
 
 	spin_lock(&bdi->wb_lock);
 	list_add_tail(&work->list, &bdi->work_list);
+	if (work->for_sync)
+		atomic_inc(&bdi->wb.sync_works);
 	spin_unlock(&bdi->wb_lock);
 
 	/*
@@ -633,6 +635,14 @@ static long wb_writeback(struct bdi_writ
 			break;
 
 		/*
+		 * background/periodic works can run forever, need to abort
+		 * on seeing any pending sync work, to prevent livelock it.
+		 */
+		if (atomic_read(&wb->sync_works) &&
+		    (work->for_background || work->for_kupdate))
+			break;
+
+		/*
 		 * For background writeout, stop when we are below the
 		 * background dirty threshold
 		 */
@@ -765,6 +775,9 @@ long wb_do_writeback(struct bdi_writebac
 
 		wrote += wb_writeback(wb, work);
 
+		if (work->for_sync)
+			atomic_dec(&wb->sync_works);
+
 		/*
 		 * Notify the caller of completion if this is a synchronous
 		 * work item, otherwise just free it.
--- linux-next.orig/include/linux/backing-dev.h	2010-07-29 17:13:23.000000000 +0800
+++ linux-next/include/linux/backing-dev.h	2010-07-29 17:13:31.000000000 +0800
@@ -50,6 +50,12 @@ struct bdi_writeback {
 
 	unsigned long last_old_flush;		/* last old data flush */
 
+	/*
+	 * sync works queued, background works shall abort on seeing this,
+	 * to prevent livelocking the sync works
+	 */
+	atomic_t sync_works;
+
 	struct task_struct	*task;		/* writeback task */
 	struct list_head	b_dirty;	/* dirty inodes */
 	struct list_head	b_io;		/* parked for writeback */
--- linux-next.orig/mm/backing-dev.c	2010-07-29 17:13:23.000000000 +0800
+++ linux-next/mm/backing-dev.c	2010-07-29 17:13:31.000000000 +0800
@@ -257,6 +257,7 @@ static void bdi_wb_init(struct bdi_write
 
 	wb->bdi = bdi;
 	wb->last_old_flush = jiffies;
+	atomic_set(&wb->sync_works, 0);
 	INIT_LIST_HEAD(&wb->b_dirty);
 	INIT_LIST_HEAD(&wb->b_io);
 	INIT_LIST_HEAD(&wb->b_more_io);


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 3/5] writeback: prevent sync livelock with the sync_after timestamp
  2010-07-29 11:51 [PATCH 0/5] [RFC] transfer ASYNC vmscan writeback IO to the flusher threads Wu Fengguang
  2010-07-29 11:51 ` [PATCH 1/5] writeback: introduce wbc.for_sync to cover the two sync stages Wu Fengguang
  2010-07-29 11:51 ` [PATCH 2/5] writeback: stop periodic/background work on seeing sync works Wu Fengguang
@ 2010-07-29 11:51 ` Wu Fengguang
  2010-07-29 15:02   ` Jan Kara
  2010-07-29 11:51 ` [PATCH 4/5] writeback: introduce bdi_start_inode_writeback() Wu Fengguang
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 30+ messages in thread
From: Wu Fengguang @ 2010-07-29 11:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Wu Fengguang, LKML, Jan Kara, linux-fsdevel@vger.kernel.org,
	linux-mm@kvack.org, Dave Chinner, Chris Mason, Nick Piggin,
	Rik van Riel, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrea Arcangeli, Mel Gorman,
	Minchan Kim

[-- Attachment #1: writeback-sync-pending-start_time.patch --]
[-- Type: text/plain, Size: 3143 bytes --]

The start time in writeback_inodes_wb() is not very useful because it
slips at each invocation time. Preferrably one _constant_ time shall be
used at the beginning to cover the whole sync() work.

The newly dirtied inodes are now guarded at the queue_io() time instead
of the b_io walk time. This is more natural: non-empty b_io/b_more_io
means "more work pending".

The timestamp is now grabbed the sync work submission time, and may be
further optimized to the initial sync() call time.

CC: Jan Kara <jack@suse.cz>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c         |   16 ++++++----------
 include/linux/writeback.h |    4 ++--
 2 files changed, 8 insertions(+), 12 deletions(-)

--- linux-next.orig/fs/fs-writeback.c	2010-07-29 17:13:49.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2010-07-29 17:13:58.000000000 +0800
@@ -228,6 +228,10 @@ static void move_expired_inodes(struct l
 	struct inode *inode;
 	int do_sb_sort = 0;
 
+	if (wbc->for_sync) {
+		expire_interval = 1;
+		older_than_this = wbc->sync_after;
+	}
 	if (wbc->for_kupdate || wbc->for_background) {
 		expire_interval = msecs_to_jiffies(dirty_expire_interval * 10);
 		older_than_this = jiffies - expire_interval;
@@ -507,12 +511,6 @@ static int writeback_sb_inodes(struct su
 			requeue_io(inode);
 			continue;
 		}
-		/*
-		 * Was this inode dirtied after sync_sb_inodes was called?
-		 * This keeps sync from extra jobs and livelock.
-		 */
-		if (inode_dirtied_after(inode, wbc->wb_start))
-			return 1;
 
 		BUG_ON(inode->i_state & I_FREEING);
 		__iget(inode);
@@ -541,10 +539,9 @@ void writeback_inodes_wb(struct bdi_writ
 {
 	int ret = 0;
 
-	wbc->wb_start = jiffies; /* livelock avoidance */
 	spin_lock(&inode_lock);
 
-	if (!(wbc->for_kupdate || wbc->for_background) || list_empty(&wb->b_io))
+	if (list_empty(&wb->b_io))
 		queue_io(wb, wbc);
 
 	while (!list_empty(&wb->b_io)) {
@@ -571,9 +568,8 @@ static void __writeback_inodes_sb(struct
 {
 	WARN_ON(!rwsem_is_locked(&sb->s_umount));
 
-	wbc->wb_start = jiffies; /* livelock avoidance */
 	spin_lock(&inode_lock);
-	if (!(wbc->for_kupdate || wbc->for_background) || list_empty(&wb->b_io))
+	if (list_empty(&wb->b_io))
 		queue_io(wb, wbc);
 	writeback_sb_inodes(sb, wb, wbc, true);
 	spin_unlock(&inode_lock);
--- linux-next.orig/include/linux/writeback.h	2010-07-29 17:13:18.000000000 +0800
+++ linux-next/include/linux/writeback.h	2010-07-29 17:13:58.000000000 +0800
@@ -28,8 +28,8 @@ enum writeback_sync_modes {
  */
 struct writeback_control {
 	enum writeback_sync_modes sync_mode;
-	unsigned long wb_start;         /* Time writeback_inodes_wb was
-					   called. This is needed to avoid
+	unsigned long sync_after;	/* Only sync inodes dirtied after this
+					   timestamp. This is needed to avoid
 					   extra jobs and livelock */
 	long nr_to_write;		/* Write this many pages, and decrement
 					   this for each page written */


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 4/5] writeback: introduce bdi_start_inode_writeback()
  2010-07-29 11:51 [PATCH 0/5] [RFC] transfer ASYNC vmscan writeback IO to the flusher threads Wu Fengguang
                   ` (2 preceding siblings ...)
  2010-07-29 11:51 ` [PATCH 3/5] writeback: prevent sync livelock with the sync_after timestamp Wu Fengguang
@ 2010-07-29 11:51 ` Wu Fengguang
  2010-07-29 11:51 ` [PATCH 5/5] vmscan: transfer async file writeback to the flusher Wu Fengguang
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 30+ messages in thread
From: Wu Fengguang @ 2010-07-29 11:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Wu Fengguang, LKML, linux-fsdevel@vger.kernel.org,
	linux-mm@kvack.org, Dave Chinner, Chris Mason, Nick Piggin,
	Rik van Riel, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrea Arcangeli, Mel Gorman,
	Minchan Kim

[-- Attachment #1: writeback-bdi_start_inode_writeback.patch --]
[-- Type: text/plain, Size: 5350 bytes --]

This is to transfer dirty pages encountered in page reclaim to the
flusher threads for writeback.

The flusher will piggy back more dirty pages for IO, yeilding more
efficient IO.

To avoid memory allocations at page reclaim, a mempool is created.
TODO: more adaptive mempool size.

Background works will be kicked to clean the pages under reclaim ASAP.
TODO: sync_works is temporary reused for convenience.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c           |   69 ++++++++++++++++++++++++++++++++--
 include/linux/backing-dev.h |    1 
 2 files changed, 66 insertions(+), 4 deletions(-)

--- linux-next.orig/fs/fs-writeback.c	2010-07-29 17:13:58.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2010-07-29 17:49:05.000000000 +0800
@@ -35,12 +35,15 @@
 struct wb_writeback_work {
 	long nr_pages;
 	struct super_block *sb;
+	struct inode *inode;
+	pgoff_t offset;
 	enum writeback_sync_modes sync_mode;
 	unsigned long sync_after;
 	unsigned int for_sync:1;
 	unsigned int for_kupdate:1;
 	unsigned int range_cyclic:1;
 	unsigned int for_background:1;
+	unsigned int for_reclaim:1;
 
 	struct list_head list;		/* pending work list */
 	struct completion *done;	/* set if the caller waits */
@@ -61,6 +64,27 @@ struct wb_writeback_work {
  */
 int nr_pdflush_threads;
 
+static mempool_t *wb_work_mempool;
+
+static void *wb_work_alloc(gfp_t gfp_mask, void *pool_data)
+{
+	/*
+	 * bdi_start_inode_writeback() may be called on page reclaim
+	 */
+	if (current->flags & PF_MEMALLOC)
+		return NULL;
+
+	return kmalloc(sizeof(struct wb_writeback_work), gfp_mask);
+}
+
+static __init int wb_work_init(void)
+{
+	wb_work_mempool = mempool_create(10240, /* XXX: better number */
+					 wb_work_alloc, mempool_kfree, NULL);
+	return wb_work_mempool ? 0 : -ENOMEM;
+}
+fs_initcall(wb_work_init);
+
 /**
  * writeback_in_progress - determine whether there is writeback in progress
  * @bdi: the device's backing_dev_info structure.
@@ -80,7 +104,7 @@ static void bdi_queue_work(struct backin
 
 	spin_lock(&bdi->wb_lock);
 	list_add_tail(&work->list, &bdi->work_list);
-	if (work->for_sync)
+	if (work->for_sync || work->for_reclaim)
 		atomic_inc(&bdi->wb.sync_works);
 	spin_unlock(&bdi->wb_lock);
 
@@ -109,7 +133,7 @@ __bdi_start_writeback(struct backing_dev
 	 * This is WB_SYNC_NONE writeback, so if allocation fails just
 	 * wakeup the thread for old dirty data writeback
 	 */
-	work = kzalloc(sizeof(*work), GFP_ATOMIC);
+	work = mempool_alloc(wb_work_mempool, GFP_NOWAIT);
 	if (!work) {
 		if (bdi->wb.task) {
 			trace_writeback_nowork(bdi);
@@ -118,6 +142,7 @@ __bdi_start_writeback(struct backing_dev
 		return;
 	}
 
+	memset(work, 0, sizeof(*work));
 	work->sync_mode	= WB_SYNC_NONE;
 	work->nr_pages	= nr_pages;
 	work->range_cyclic = range_cyclic;
@@ -156,6 +181,26 @@ void bdi_start_background_writeback(stru
 	__bdi_start_writeback(bdi, LONG_MAX, true, true);
 }
 
+void bdi_start_inode_writeback(struct inode *inode, pgoff_t offset)
+{
+	struct wb_writeback_work *work;
+
+	if (!igrab(inode))
+		return;
+
+	work = mempool_alloc(wb_work_mempool, GFP_NOWAIT);
+	if (!work)
+		return;
+
+	memset(work, 0, sizeof(*work));
+	work->sync_mode		= WB_SYNC_NONE;
+	work->inode		= inode;
+	work->offset		= offset;
+	work->for_reclaim	= 1;
+
+	bdi_queue_work(inode->i_sb->s_bdi, work);
+}
+
 /*
  * Redirty an inode: set its when-it-was dirtied timestamp and move it to the
  * furthest end of its superblock's dirty-inode list.
@@ -618,6 +663,22 @@ static long wb_writeback(struct bdi_writ
 	long wrote = 0;
 	struct inode *inode;
 
+	if (work->for_reclaim) {
+		struct page *page = find_get_page(work->inode->i_mapping,
+						  work->offset);
+		wrote = __filemap_fdatawrite_range( /* XXX: write around */
+					work->inode->i_mapping,
+					work->offset,
+					work->offset + MAX_WRITEBACK_PAGES,
+					WB_SYNC_NONE);
+		if (page && PageWriteback(page))
+			SetPageReclaim(page);
+		if (page)
+			page_cache_release(page);
+		iput(work->inode);
+		return wrote;
+	}
+
 	if (!wbc.range_cyclic) {
 		wbc.range_start = 0;
 		wbc.range_end = LLONG_MAX;
@@ -771,7 +832,7 @@ long wb_do_writeback(struct bdi_writebac
 
 		wrote += wb_writeback(wb, work);
 
-		if (work->for_sync)
+		if (work->for_sync || work->for_reclaim)
 			atomic_dec(&wb->sync_works);
 
 		/*
@@ -781,7 +842,7 @@ long wb_do_writeback(struct bdi_writebac
 		if (work->done)
 			complete(work->done);
 		else
-			kfree(work);
+			mempool_free(work, wb_work_mempool);
 	}
 
 	/*
--- linux-next.orig/include/linux/backing-dev.h	2010-07-29 17:13:31.000000000 +0800
+++ linux-next/include/linux/backing-dev.h	2010-07-29 17:47:58.000000000 +0800
@@ -108,6 +108,7 @@ void bdi_unregister(struct backing_dev_i
 int bdi_setup_and_register(struct backing_dev_info *, char *, unsigned int);
 void bdi_start_writeback(struct backing_dev_info *bdi, long nr_pages);
 void bdi_start_background_writeback(struct backing_dev_info *bdi);
+void bdi_start_inode_writeback(struct inode *inode, pgoff_t offset);
 int bdi_writeback_thread(void *data);
 int bdi_has_dirty_io(struct backing_dev_info *bdi);
 void bdi_arm_supers_timer(void);


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 5/5] vmscan: transfer async file writeback to the flusher
  2010-07-29 11:51 [PATCH 0/5] [RFC] transfer ASYNC vmscan writeback IO to the flusher threads Wu Fengguang
                   ` (3 preceding siblings ...)
  2010-07-29 11:51 ` [PATCH 4/5] writeback: introduce bdi_start_inode_writeback() Wu Fengguang
@ 2010-07-29 11:51 ` Wu Fengguang
  2010-07-29 16:09 ` [PATCH 0/5] [RFC] transfer ASYNC vmscan writeback IO to the flusher threads Jan Kara
  2010-07-29 23:23 ` Dave Chinner
  6 siblings, 0 replies; 30+ messages in thread
From: Wu Fengguang @ 2010-07-29 11:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Wu Fengguang, LKML, linux-fsdevel@vger.kernel.org,
	linux-mm@kvack.org, Dave Chinner, Chris Mason, Nick Piggin,
	Rik van Riel, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrea Arcangeli, Mel Gorman,
	Minchan Kim

[-- Attachment #1: vmscan-writeback-inode-page.patch --]
[-- Type: text/plain, Size: 1349 bytes --]

This relays all ASYNC file writeback IOs to the flusher threads.
The lesser SYNC pageout()s will work as before (as a last resort).

It's a minimal prototype implementation and barely runs without panic.
It potentially requires lots of more work to go stable. 

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/vmscan.c |    8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

--- linux-next.orig/mm/vmscan.c	2010-07-29 17:07:07.000000000 +0800
+++ linux-next/mm/vmscan.c	2010-07-29 17:09:16.000000000 +0800
@@ -379,6 +379,13 @@ static pageout_t pageout(struct page *pa
 	}
 	if (mapping->a_ops->writepage == NULL)
 		return PAGE_ACTIVATE;
+
+	if (sync_writeback == PAGEOUT_IO_ASYNC &&
+	    page_is_file_cache(page)) {
+		bdi_start_inode_writeback(mapping->host, page->index);
+		return PAGE_KEEP;
+	}
+
 	if (!may_write_to_queue(mapping->backing_dev_info))
 		return PAGE_KEEP;
 
@@ -1366,7 +1373,6 @@ shrink_inactive_list(unsigned long nr_to
 				list_add(&page->lru, &putback_list);
 			}
 
-			wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty);
 			congestion_wait(BLK_RW_ASYNC, HZ/10);
 
 			/*


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 3/5] writeback: prevent sync livelock with the sync_after timestamp
  2010-07-29 11:51 ` [PATCH 3/5] writeback: prevent sync livelock with the sync_after timestamp Wu Fengguang
@ 2010-07-29 15:02   ` Jan Kara
  2010-07-30  5:17     ` Wu Fengguang
  0 siblings, 1 reply; 30+ messages in thread
From: Jan Kara @ 2010-07-29 15:02 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, LKML, Jan Kara, linux-fsdevel@vger.kernel.org,
	linux-mm@kvack.org, Dave Chinner, Chris Mason, Nick Piggin,
	Rik van Riel, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrea Arcangeli, Mel Gorman,
	Minchan Kim

  Hi Fengguang,

On Thu 29-07-10 19:51:45, Wu Fengguang wrote:
> The start time in writeback_inodes_wb() is not very useful because it
> slips at each invocation time. Preferrably one _constant_ time shall be
> used at the beginning to cover the whole sync() work.
> 
> The newly dirtied inodes are now guarded at the queue_io() time instead
> of the b_io walk time. This is more natural: non-empty b_io/b_more_io
> means "more work pending".
> 
> The timestamp is now grabbed the sync work submission time, and may be
> further optimized to the initial sync() call time.
  The patch seems to have some issues...

> +	if (wbc->for_sync) {
  For example this is never set. You only set wb->for_sync.

> +		expire_interval = 1;
> +		older_than_this = wbc->sync_after;
  And sync_after is never set either???

> -	if (!(wbc->for_kupdate || wbc->for_background) || list_empty(&wb->b_io))
> +	if (list_empty(&wb->b_io))
>  		queue_io(wb, wbc);
  And what is the purpose of this? It looks as an unrelated change to me.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 1/5] writeback: introduce wbc.for_sync to cover the two sync stages
  2010-07-29 11:51 ` [PATCH 1/5] writeback: introduce wbc.for_sync to cover the two sync stages Wu Fengguang
@ 2010-07-29 15:04   ` Jan Kara
  2010-07-30  5:10     ` Wu Fengguang
  0 siblings, 1 reply; 30+ messages in thread
From: Jan Kara @ 2010-07-29 15:04 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, LKML, Jan Kara, linux-fsdevel@vger.kernel.org,
	linux-mm@kvack.org, Dave Chinner, Chris Mason, Nick Piggin,
	Rik van Riel, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrea Arcangeli, Mel Gorman,
	Minchan Kim

On Thu 29-07-10 19:51:43, Wu Fengguang wrote:
> The sync() is performed in two stages: the WB_SYNC_NONE sync and
> the WB_SYNC_ALL sync. It is necessary to tag both stages with
> wbc.for_sync, so as to prevent either of them being livelocked.
> 
> The basic livelock scheme will be based on the sync_after timestamp.
> Inodes dirtied after that won't be queued for IO. The timestamp could be
> recorded as early as the sync() time, this patch lazily sets it in
> writeback_inodes_sb()/sync_inodes_sb(). This will stop livelock, but
> may do more work than necessary.
> 
> Note that writeback_inodes_sb() is called by not only sync(), they
> are treated the same because the other callers need the same livelock
> prevention.
  OK, but the patch does nothing, doesn't it? I'd prefer if the fields
you introduce were actually used in this patch.

								Honza

> CC: Jan Kara <jack@suse.cz>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  fs/fs-writeback.c         |   21 ++++++++++++---------
>  include/linux/writeback.h |    1 +
>  2 files changed, 13 insertions(+), 9 deletions(-)
> 
> --- linux-next.orig/fs/fs-writeback.c	2010-07-28 17:05:17.000000000 +0800
> +++ linux-next/fs/fs-writeback.c	2010-07-28 21:21:31.000000000 +0800
> @@ -36,6 +36,8 @@ struct wb_writeback_work {
>  	long nr_pages;
>  	struct super_block *sb;
>  	enum writeback_sync_modes sync_mode;
> +	unsigned long sync_after;
> +	unsigned int for_sync:1;
>  	unsigned int for_kupdate:1;
>  	unsigned int range_cyclic:1;
>  	unsigned int for_background:1;
> @@ -1086,20 +1090,17 @@ static void wait_sb_inodes(struct super_
>   */
>  void writeback_inodes_sb(struct super_block *sb)
>  {
> -	unsigned long nr_dirty = global_page_state(NR_FILE_DIRTY);
> -	unsigned long nr_unstable = global_page_state(NR_UNSTABLE_NFS);
>  	DECLARE_COMPLETION_ONSTACK(done);
>  	struct wb_writeback_work work = {
>  		.sb		= sb,
>  		.sync_mode	= WB_SYNC_NONE,
> +		.for_sync	= 1,
> +		.sync_after	= jiffies,
>  		.done		= &done,
>  	};
>  
>  	WARN_ON(!rwsem_is_locked(&sb->s_umount));
>  
> -	work.nr_pages = nr_dirty + nr_unstable +
> -			(inodes_stat.nr_inodes - inodes_stat.nr_unused);
> -
>  	bdi_queue_work(sb->s_bdi, &work);
>  	wait_for_completion(&done);
>  }
> @@ -1137,6 +1138,8 @@ void sync_inodes_sb(struct super_block *
>  	struct wb_writeback_work work = {
>  		.sb		= sb,
>  		.sync_mode	= WB_SYNC_ALL,
> +		.for_sync	= 1,
> +		.sync_after	= jiffies,
>  		.nr_pages	= LONG_MAX,
>  		.range_cyclic	= 0,
>  		.done		= &done,
> --- linux-next.orig/include/linux/writeback.h	2010-07-28 17:05:17.000000000 +0800
> +++ linux-next/include/linux/writeback.h	2010-07-28 21:24:54.000000000 +0800
> @@ -48,6 +48,7 @@ struct writeback_control {
>  	unsigned encountered_congestion:1; /* An output: a queue is full */
>  	unsigned for_kupdate:1;		/* A kupdate writeback */
>  	unsigned for_background:1;	/* A background writeback */
> +	unsigned for_sync:1;		/* A writeback for sync */
>  	unsigned for_reclaim:1;		/* Invoked from the page allocator */
>  	unsigned range_cyclic:1;	/* range_start is cyclic */
>  };
> 
> 
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/5]  [RFC] transfer ASYNC vmscan writeback IO to the flusher threads
  2010-07-29 11:51 [PATCH 0/5] [RFC] transfer ASYNC vmscan writeback IO to the flusher threads Wu Fengguang
                   ` (4 preceding siblings ...)
  2010-07-29 11:51 ` [PATCH 5/5] vmscan: transfer async file writeback to the flusher Wu Fengguang
@ 2010-07-29 16:09 ` Jan Kara
  2010-07-30  5:34   ` Wu Fengguang
  2010-07-29 23:23 ` Dave Chinner
  6 siblings, 1 reply; 30+ messages in thread
From: Jan Kara @ 2010-07-29 16:09 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, LKML, linux-fsdevel@vger.kernel.org,
	linux-mm@kvack.org, Dave Chinner, Chris Mason, Nick Piggin,
	Rik van Riel, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrea Arcangeli, Mel Gorman,
	Minchan Kim

On Thu 29-07-10 19:51:42, Wu Fengguang wrote:
> Andrew,
> 
> It's possible to transfer ASYNC vmscan writeback IOs to the flusher threads.
> This simple patchset shows the basic idea. Since it's a big behavior change,
> there are inevitably lots of details to sort out. I don't know where it will
> go after tests and discussions, so the patches are intentionally kept simple.
> 
> sync livelock avoidance (need more to be complete, but this is minimal required for the last two patches)
> 	[PATCH 1/5] writeback: introduce wbc.for_sync to cover the two sync stages
> 	[PATCH 2/5] writeback: stop periodic/background work on seeing sync works
> 	[PATCH 3/5] writeback: prevent sync livelock with the sync_after timestamp
  Well, essentially any WB_SYNC_NONE writeback is still livelockable if you
just grow a file constantly. So your changes are a step in the right
direction but won't fix the issue completely. But what we could do to fix
the issue completely would be to just set wbc->nr_to_write to LONG_MAX
before writing inode for sync use my livelock avoidance using page-tagging
for this case (it wouldn't have the possible performance issue because we
are going to write all the inode anyway).
  I can write the patch but frankly there are so many patches floating
around that I'm not sure what I should base it on...

								Honza

-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/5] writeback: stop periodic/background work on seeing sync works
  2010-07-29 11:51 ` [PATCH 2/5] writeback: stop periodic/background work on seeing sync works Wu Fengguang
@ 2010-07-29 16:20   ` Jan Kara
  2010-07-30  4:03     ` Wu Fengguang
  0 siblings, 1 reply; 30+ messages in thread
From: Jan Kara @ 2010-07-29 16:20 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, LKML, linux-fsdevel@vger.kernel.org,
	linux-mm@kvack.org, Dave Chinner, Chris Mason, Nick Piggin,
	Rik van Riel, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrea Arcangeli, Mel Gorman,
	Minchan Kim

On Thu 29-07-10 19:51:44, Wu Fengguang wrote:
> The periodic/background writeback can run forever. So when any
> sync work is enqueued, increase bdi->sync_works to notify the
> active non-sync works to exit. Non-sync works queued after sync
> works won't be affected.
  Hmm, wouldn't it be simpler logic to just make for_kupdate and
for_background work always yield when there's some other work to do (as
they are livelockable from the definition of the target they have) and
make sure any other work isn't livelockable? The only downside is that
non-livelockable work cannot be "fair" in the sense that we cannot switch
inodes after writing MAX_WRITEBACK_PAGES.
  I even had a patch for this but it's already outdated by now. But I
can refresh it if we decide this is the way to go.

								Honza

> 
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  fs/fs-writeback.c           |   13 +++++++++++++
>  include/linux/backing-dev.h |    6 ++++++
>  mm/backing-dev.c            |    1 +
>  3 files changed, 20 insertions(+)
> 
> --- linux-next.orig/fs/fs-writeback.c	2010-07-29 17:13:23.000000000 +0800
> +++ linux-next/fs/fs-writeback.c	2010-07-29 17:13:49.000000000 +0800
> @@ -80,6 +80,8 @@ static void bdi_queue_work(struct backin
>  
>  	spin_lock(&bdi->wb_lock);
>  	list_add_tail(&work->list, &bdi->work_list);
> +	if (work->for_sync)
> +		atomic_inc(&bdi->wb.sync_works);
>  	spin_unlock(&bdi->wb_lock);
>  
>  	/*
> @@ -633,6 +635,14 @@ static long wb_writeback(struct bdi_writ
>  			break;
>  
>  		/*
> +		 * background/periodic works can run forever, need to abort
> +		 * on seeing any pending sync work, to prevent livelock it.
> +		 */
> +		if (atomic_read(&wb->sync_works) &&
> +		    (work->for_background || work->for_kupdate))
> +			break;
> +
> +		/*
>  		 * For background writeout, stop when we are below the
>  		 * background dirty threshold
>  		 */
> @@ -765,6 +775,9 @@ long wb_do_writeback(struct bdi_writebac
>  
>  		wrote += wb_writeback(wb, work);
>  
> +		if (work->for_sync)
> +			atomic_dec(&wb->sync_works);
> +
>  		/*
>  		 * Notify the caller of completion if this is a synchronous
>  		 * work item, otherwise just free it.
> --- linux-next.orig/include/linux/backing-dev.h	2010-07-29 17:13:23.000000000 +0800
> +++ linux-next/include/linux/backing-dev.h	2010-07-29 17:13:31.000000000 +0800
> @@ -50,6 +50,12 @@ struct bdi_writeback {
>  
>  	unsigned long last_old_flush;		/* last old data flush */
>  
> +	/*
> +	 * sync works queued, background works shall abort on seeing this,
> +	 * to prevent livelocking the sync works
> +	 */
> +	atomic_t sync_works;
> +
>  	struct task_struct	*task;		/* writeback task */
>  	struct list_head	b_dirty;	/* dirty inodes */
>  	struct list_head	b_io;		/* parked for writeback */
> --- linux-next.orig/mm/backing-dev.c	2010-07-29 17:13:23.000000000 +0800
> +++ linux-next/mm/backing-dev.c	2010-07-29 17:13:31.000000000 +0800
> @@ -257,6 +257,7 @@ static void bdi_wb_init(struct bdi_write
>  
>  	wb->bdi = bdi;
>  	wb->last_old_flush = jiffies;
> +	atomic_set(&wb->sync_works, 0);
>  	INIT_LIST_HEAD(&wb->b_dirty);
>  	INIT_LIST_HEAD(&wb->b_io);
>  	INIT_LIST_HEAD(&wb->b_more_io);
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/5]  [RFC] transfer ASYNC vmscan writeback IO to the flusher threads
  2010-07-29 11:51 [PATCH 0/5] [RFC] transfer ASYNC vmscan writeback IO to the flusher threads Wu Fengguang
                   ` (5 preceding siblings ...)
  2010-07-29 16:09 ` [PATCH 0/5] [RFC] transfer ASYNC vmscan writeback IO to the flusher threads Jan Kara
@ 2010-07-29 23:23 ` Dave Chinner
  2010-07-30  7:58   ` Wu Fengguang
  6 siblings, 1 reply; 30+ messages in thread
From: Dave Chinner @ 2010-07-29 23:23 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, LKML, linux-fsdevel@vger.kernel.org,
	linux-mm@kvack.org, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, Christoph Hellwig, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Andrea Arcangeli, Mel Gorman, Minchan Kim

On Thu, Jul 29, 2010 at 07:51:42PM +0800, Wu Fengguang wrote:
> Andrew,
> 
> It's possible to transfer ASYNC vmscan writeback IOs to the flusher threads.
> This simple patchset shows the basic idea. Since it's a big behavior change,
> there are inevitably lots of details to sort out. I don't know where it will
> go after tests and discussions, so the patches are intentionally kept simple.
> 
> sync livelock avoidance (need more to be complete, but this is minimal required for the last two patches)
> 	[PATCH 1/5] writeback: introduce wbc.for_sync to cover the two sync stages
> 	[PATCH 2/5] writeback: stop periodic/background work on seeing sync works
> 	[PATCH 3/5] writeback: prevent sync livelock with the sync_after timestamp
> 
> let the flusher threads do ASYNC writeback for pageout()
> 	[PATCH 4/5] writeback: introduce bdi_start_inode_writeback()
> 	[PATCH 5/5] vmscan: transfer async file writeback to the flusher

I really do not like this - all it does is transfer random page writeback
from vmscan to the flusher threads rather than avoiding random page
writeback altogether. Random page writeback is nasty - just say no.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/5] writeback: stop periodic/background work on seeing sync works
  2010-07-29 16:20   ` Jan Kara
@ 2010-07-30  4:03     ` Wu Fengguang
  2010-08-02 20:51       ` Jan Kara
  0 siblings, 1 reply; 30+ messages in thread
From: Wu Fengguang @ 2010-07-30  4:03 UTC (permalink / raw)
  To: Jan Kara
  Cc: Andrew Morton, LKML, linux-fsdevel@vger.kernel.org,
	linux-mm@kvack.org, Dave Chinner, Chris Mason, Nick Piggin,
	Rik van Riel, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrea Arcangeli, Mel Gorman,
	Minchan Kim

On Fri, Jul 30, 2010 at 12:20:27AM +0800, Jan Kara wrote:
> On Thu 29-07-10 19:51:44, Wu Fengguang wrote:
> > The periodic/background writeback can run forever. So when any
> > sync work is enqueued, increase bdi->sync_works to notify the
> > active non-sync works to exit. Non-sync works queued after sync
> > works won't be affected.
>   Hmm, wouldn't it be simpler logic to just make for_kupdate and
> for_background work always yield when there's some other work to do (as
> they are livelockable from the definition of the target they have) and
> make sure any other work isn't livelockable?

Good idea!

> The only downside is that
> non-livelockable work cannot be "fair" in the sense that we cannot switch
> inodes after writing MAX_WRITEBACK_PAGES.

Cannot switch indoes _before_ finish with the current
MAX_WRITEBACK_PAGES batch? 

>   I even had a patch for this but it's already outdated by now. But I
> can refresh it if we decide this is the way to go.

I'm very interested in your old patch, would you post it? Let's see
which one is easier to work with :)

Thanks,
Fengguang

> > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > ---
> >  fs/fs-writeback.c           |   13 +++++++++++++
> >  include/linux/backing-dev.h |    6 ++++++
> >  mm/backing-dev.c            |    1 +
> >  3 files changed, 20 insertions(+)
> > 
> > --- linux-next.orig/fs/fs-writeback.c	2010-07-29 17:13:23.000000000 +0800
> > +++ linux-next/fs/fs-writeback.c	2010-07-29 17:13:49.000000000 +0800
> > @@ -80,6 +80,8 @@ static void bdi_queue_work(struct backin
> >  
> >  	spin_lock(&bdi->wb_lock);
> >  	list_add_tail(&work->list, &bdi->work_list);
> > +	if (work->for_sync)
> > +		atomic_inc(&bdi->wb.sync_works);
> >  	spin_unlock(&bdi->wb_lock);
> >  
> >  	/*
> > @@ -633,6 +635,14 @@ static long wb_writeback(struct bdi_writ
> >  			break;
> >  
> >  		/*
> > +		 * background/periodic works can run forever, need to abort
> > +		 * on seeing any pending sync work, to prevent livelock it.
> > +		 */
> > +		if (atomic_read(&wb->sync_works) &&
> > +		    (work->for_background || work->for_kupdate))
> > +			break;
> > +
> > +		/*
> >  		 * For background writeout, stop when we are below the
> >  		 * background dirty threshold
> >  		 */
> > @@ -765,6 +775,9 @@ long wb_do_writeback(struct bdi_writebac
> >  
> >  		wrote += wb_writeback(wb, work);
> >  
> > +		if (work->for_sync)
> > +			atomic_dec(&wb->sync_works);
> > +
> >  		/*
> >  		 * Notify the caller of completion if this is a synchronous
> >  		 * work item, otherwise just free it.
> > --- linux-next.orig/include/linux/backing-dev.h	2010-07-29 17:13:23.000000000 +0800
> > +++ linux-next/include/linux/backing-dev.h	2010-07-29 17:13:31.000000000 +0800
> > @@ -50,6 +50,12 @@ struct bdi_writeback {
> >  
> >  	unsigned long last_old_flush;		/* last old data flush */
> >  
> > +	/*
> > +	 * sync works queued, background works shall abort on seeing this,
> > +	 * to prevent livelocking the sync works
> > +	 */
> > +	atomic_t sync_works;
> > +
> >  	struct task_struct	*task;		/* writeback task */
> >  	struct list_head	b_dirty;	/* dirty inodes */
> >  	struct list_head	b_io;		/* parked for writeback */
> > --- linux-next.orig/mm/backing-dev.c	2010-07-29 17:13:23.000000000 +0800
> > +++ linux-next/mm/backing-dev.c	2010-07-29 17:13:31.000000000 +0800
> > @@ -257,6 +257,7 @@ static void bdi_wb_init(struct bdi_write
> >  
> >  	wb->bdi = bdi;
> >  	wb->last_old_flush = jiffies;
> > +	atomic_set(&wb->sync_works, 0);
> >  	INIT_LIST_HEAD(&wb->b_dirty);
> >  	INIT_LIST_HEAD(&wb->b_io);
> >  	INIT_LIST_HEAD(&wb->b_more_io);
> > 
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> -- 
> Jan Kara <jack@suse.cz>
> SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 1/5] writeback: introduce wbc.for_sync to cover the two sync stages
  2010-07-29 15:04   ` Jan Kara
@ 2010-07-30  5:10     ` Wu Fengguang
  0 siblings, 0 replies; 30+ messages in thread
From: Wu Fengguang @ 2010-07-30  5:10 UTC (permalink / raw)
  To: Jan Kara
  Cc: Andrew Morton, LKML, linux-fsdevel@vger.kernel.org,
	linux-mm@kvack.org, Dave Chinner, Chris Mason, Nick Piggin,
	Rik van Riel, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrea Arcangeli, Mel Gorman,
	Minchan Kim

On Thu, Jul 29, 2010 at 11:04:13PM +0800, Jan Kara wrote:
> On Thu 29-07-10 19:51:43, Wu Fengguang wrote:
> > The sync() is performed in two stages: the WB_SYNC_NONE sync and
> > the WB_SYNC_ALL sync. It is necessary to tag both stages with
> > wbc.for_sync, so as to prevent either of them being livelocked.
> > 
> > The basic livelock scheme will be based on the sync_after timestamp.
> > Inodes dirtied after that won't be queued for IO. The timestamp could be
> > recorded as early as the sync() time, this patch lazily sets it in
> > writeback_inodes_sb()/sync_inodes_sb(). This will stop livelock, but
> > may do more work than necessary.
> > 
> > Note that writeback_inodes_sb() is called by not only sync(), they
> > are treated the same because the other callers need the same livelock
> > prevention.

>   OK, but the patch does nothing, doesn't it? I'd prefer if the fields
> you introduce were actually used in this patch.

OK, I'll merge it with the third patch.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 3/5] writeback: prevent sync livelock with the sync_after timestamp
  2010-07-29 15:02   ` Jan Kara
@ 2010-07-30  5:17     ` Wu Fengguang
  0 siblings, 0 replies; 30+ messages in thread
From: Wu Fengguang @ 2010-07-30  5:17 UTC (permalink / raw)
  To: Jan Kara
  Cc: Andrew Morton, LKML, linux-fsdevel@vger.kernel.org,
	linux-mm@kvack.org, Dave Chinner, Chris Mason, Nick Piggin,
	Rik van Riel, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrea Arcangeli, Mel Gorman,
	Minchan Kim

On Thu, Jul 29, 2010 at 11:02:41PM +0800, Jan Kara wrote:
>   Hi Fengguang,
> 
> On Thu 29-07-10 19:51:45, Wu Fengguang wrote:
> > The start time in writeback_inodes_wb() is not very useful because it
> > slips at each invocation time. Preferrably one _constant_ time shall be
> > used at the beginning to cover the whole sync() work.
> > 
> > The newly dirtied inodes are now guarded at the queue_io() time instead
> > of the b_io walk time. This is more natural: non-empty b_io/b_more_io
> > means "more work pending".
> > 
> > The timestamp is now grabbed the sync work submission time, and may be
> > further optimized to the initial sync() call time.
>   The patch seems to have some issues...
> 
> > +	if (wbc->for_sync) {
>   For example this is never set. You only set wb->for_sync.

Ah right.

> > +		expire_interval = 1;
> > +		older_than_this = wbc->sync_after;
>   And sync_after is never set either???

Sorry I must lose some chunk when rebasing the patch ..

> > -	if (!(wbc->for_kupdate || wbc->for_background) || list_empty(&wb->b_io))
> > +	if (list_empty(&wb->b_io))
> >  		queue_io(wb, wbc);
>   And what is the purpose of this? It looks as an unrelated change to me.

Yes it's not tightly related. It may be simpler to do

 -	if (!wbc->for_kupdate || list_empty(&wb->b_io))
 +	if (list_empty(&wb->b_io))

in the previous patch "writeback: sync expired inodes first in
background writeback".

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/5]  [RFC] transfer ASYNC vmscan writeback IO to the flusher threads
  2010-07-29 16:09 ` [PATCH 0/5] [RFC] transfer ASYNC vmscan writeback IO to the flusher threads Jan Kara
@ 2010-07-30  5:34   ` Wu Fengguang
  0 siblings, 0 replies; 30+ messages in thread
From: Wu Fengguang @ 2010-07-30  5:34 UTC (permalink / raw)
  To: Jan Kara
  Cc: Andrew Morton, LKML, linux-fsdevel@vger.kernel.org,
	linux-mm@kvack.org, Dave Chinner, Chris Mason, Nick Piggin,
	Rik van Riel, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrea Arcangeli, Mel Gorman,
	Minchan Kim

On Fri, Jul 30, 2010 at 12:09:47AM +0800, Jan Kara wrote:
> On Thu 29-07-10 19:51:42, Wu Fengguang wrote:
> > Andrew,
> > 
> > It's possible to transfer ASYNC vmscan writeback IOs to the flusher threads.
> > This simple patchset shows the basic idea. Since it's a big behavior change,
> > there are inevitably lots of details to sort out. I don't know where it will
> > go after tests and discussions, so the patches are intentionally kept simple.
> > 
> > sync livelock avoidance (need more to be complete, but this is minimal required for the last two patches)
> > 	[PATCH 1/5] writeback: introduce wbc.for_sync to cover the two sync stages
> > 	[PATCH 2/5] writeback: stop periodic/background work on seeing sync works
> > 	[PATCH 3/5] writeback: prevent sync livelock with the sync_after timestamp
>   Well, essentially any WB_SYNC_NONE writeback is still livelockable if you
> just grow a file constantly. So your changes are a step in the right
> direction but won't fix the issue completely.

Right. We have complementary patches to prevent livelocks both inside
file and among files.

> But what we could do to fix
> the issue completely would be to just set wbc->nr_to_write to LONG_MAX
> before writing inode for sync use my livelock avoidance using page-tagging
> for this case (it wouldn't have the possible performance issue because we
> are going to write all the inode anyway).

Yeah your patches are good to avoid livelocking in one single busy file.
I didn't forgot them :)

>   I can write the patch but frankly there are so many patches floating
> around that I'm not sure what I should base it on...

Me confused too. It may take some time to quiet down..

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/5]  [RFC] transfer ASYNC vmscan writeback IO to the flusher threads
  2010-07-29 23:23 ` Dave Chinner
@ 2010-07-30  7:58   ` Wu Fengguang
  2010-07-30  9:22     ` KOSAKI Motohiro
  2010-07-30 11:12     ` Dave Chinner
  0 siblings, 2 replies; 30+ messages in thread
From: Wu Fengguang @ 2010-07-30  7:58 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Andrew Morton, LKML, linux-fsdevel@vger.kernel.org,
	linux-mm@kvack.org, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, Christoph Hellwig, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Andrea Arcangeli, Mel Gorman, Minchan Kim

On Fri, Jul 30, 2010 at 07:23:30AM +0800, Dave Chinner wrote:
> On Thu, Jul 29, 2010 at 07:51:42PM +0800, Wu Fengguang wrote:
> > Andrew,
> > 
> > It's possible to transfer ASYNC vmscan writeback IOs to the flusher threads.
> > This simple patchset shows the basic idea. Since it's a big behavior change,
> > there are inevitably lots of details to sort out. I don't know where it will
> > go after tests and discussions, so the patches are intentionally kept simple.
> > 
> > sync livelock avoidance (need more to be complete, but this is minimal required for the last two patches)
> > 	[PATCH 1/5] writeback: introduce wbc.for_sync to cover the two sync stages
> > 	[PATCH 2/5] writeback: stop periodic/background work on seeing sync works
> > 	[PATCH 3/5] writeback: prevent sync livelock with the sync_after timestamp
> > 
> > let the flusher threads do ASYNC writeback for pageout()
> > 	[PATCH 4/5] writeback: introduce bdi_start_inode_writeback()
> > 	[PATCH 5/5] vmscan: transfer async file writeback to the flusher
> 
> I really do not like this - all it does is transfer random page writeback
> from vmscan to the flusher threads rather than avoiding random page
> writeback altogether. Random page writeback is nasty - just say no.

There are cases we have to do pageout().

- a stressed memcg with lots of dirty pages
- a large NUMA system whose nodes have unbalanced vmscan rate and dirty pages

In the above cases, the whole system may not be that stressed,
except for some local LRU list being busy scanned.  If the local
memory stress lead to lots of pageout(), it could bring down the whole
system by congesting the disks with many small seeky IO.

It may be an overkill to push global writeback (ie. it's silly to sync
1GB dirty data because there is a small stressed 100MB LRU list). The
obvious solution is to keep the pageout() calls and make them more IO
wise by doing write-around at the same time.  The write-around pages
will likely be in the same stressed LRU list, hence will do good for
page reclaim as well.

Transferring ASYNC work to the flushers helps the kswapd-vs-flusher
priority problem too. Currently the kswapd/direct reclaim either have
to skip dirty pages on congestion, or to risk being blocked in
get_request_wait(), both are not good options. However the use of
bdi_start_inode_writeback() do ask for a good vmscan throttling scheme
to prevent it falsely OOM before the flusher is able to clean the
transfered pages. This would be tricky.

If the system is globally memory stressed and run into pageout(), we
can safely kick the flusher threads for more writeback. There are 3
possible schemes:

- to kick writeback for N pages, eg. the existing wakeup_flusher_threads() calls

- to lower dirty_expire_interval, eg. to enqueue the current inode
  (that contains the current dirty page for pageout()) _plus_ all
  older inodes for writeback. This can be done when servicing the
  for_reclaim writeback work.

- to lower dirty throttle limit (trying to find a criterion...)

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/5]  [RFC] transfer ASYNC vmscan writeback IO to the flusher threads
  2010-07-30  7:58   ` Wu Fengguang
@ 2010-07-30  9:22     ` KOSAKI Motohiro
  2010-07-30 12:25       ` Wu Fengguang
  2010-07-30 11:12     ` Dave Chinner
  1 sibling, 1 reply; 30+ messages in thread
From: KOSAKI Motohiro @ 2010-07-30  9:22 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: kosaki.motohiro, Dave Chinner, Andrew Morton, LKML,
	linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, Chris Mason,
	Nick Piggin, Rik van Riel, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki, Andrea Arcangeli, Mel Gorman, Minchan Kim

> On Fri, Jul 30, 2010 at 07:23:30AM +0800, Dave Chinner wrote:
> > On Thu, Jul 29, 2010 at 07:51:42PM +0800, Wu Fengguang wrote:
> > > Andrew,
> > > 
> > > It's possible to transfer ASYNC vmscan writeback IOs to the flusher threads.
> > > This simple patchset shows the basic idea. Since it's a big behavior change,
> > > there are inevitably lots of details to sort out. I don't know where it will
> > > go after tests and discussions, so the patches are intentionally kept simple.
> > > 
> > > sync livelock avoidance (need more to be complete, but this is minimal required for the last two patches)
> > > 	[PATCH 1/5] writeback: introduce wbc.for_sync to cover the two sync stages
> > > 	[PATCH 2/5] writeback: stop periodic/background work on seeing sync works
> > > 	[PATCH 3/5] writeback: prevent sync livelock with the sync_after timestamp
> > > 
> > > let the flusher threads do ASYNC writeback for pageout()
> > > 	[PATCH 4/5] writeback: introduce bdi_start_inode_writeback()
> > > 	[PATCH 5/5] vmscan: transfer async file writeback to the flusher
> > 
> > I really do not like this - all it does is transfer random page writeback
> > from vmscan to the flusher threads rather than avoiding random page
> > writeback altogether. Random page writeback is nasty - just say no.
> 
> There are cases we have to do pageout().
> 
> - a stressed memcg with lots of dirty pages
> - a large NUMA system whose nodes have unbalanced vmscan rate and dirty pages

- 32bit highmem system too

can you please see following commit? this describe current design.




commit c4e2d7ddde9693a4c05da7afd485db02c27a7a09
Author: akpm <akpm>
Date:   Sun Dec 22 01:07:33 2002 +0000

    [PATCH] Give kswapd writeback higher priority than pdflush

    The `low latency page reclaim' design works by preventing page
    allocators from blocking on request queues (and by preventing them from
    blocking against writeback of individual pages, but that is immaterial
    here).

    This has a problem under some situations.  pdflush (or a write(2)
    caller) could be saturating the queue with highmem pages.  This
    prevents anyone from writing back ZONE_NORMAL pages.  We end up doing
    enormous amounts of scenning.

    A test case is to mmap(MAP_SHARED) almost all of a 4G machine's memory,
    then kill the mmapping applications.  The machine instantly goes from
    0% of memory dirty to 95% or more.  pdflush kicks in and starts writing
    the least-recently-dirtied pages, which are all highmem.  The queue is
    congested so nobody will write back ZONE_NORMAL pages.  kswapd chews
    50% of the CPU scanning past dirty ZONE_NORMAL pages and page reclaim
    efficiency (pages_reclaimed/pages_scanned) falls to 2%.

    So this patch changes the policy for kswapd.  kswapd may use all of a
    request queue, and is prepared to block on request queues.

    What will now happen in the above scenario is:

    1: The page alloctor scans some pages, fails to reclaim enough
       memory and takes a nap in blk_congetion_wait().

    2: kswapd() will scan the ZONE_NORMAL LRU and will start writing
       back pages.  (These pages will be rotated to the tail of the
       inactive list at IO-completion interrupt time).

       This writeback will saturate the queue with ZONE_NORMAL pages.
       Conveniently, pdflush will avoid the congested queues.  So we end up
       writing the correct pages.

    In this test, kswapd CPU utilisation falls from 50% to 2%, page reclaim
    efficiency rises from 2% to 40% and things are generally a lot happier.


    The downside is that kswapd may now do a lot less page reclaim,
    increasing page allocation latency, causing more direct reclaim,
    increasing lock contention in the VM, etc.  But I have not been able to
    demonstrate that in testing.


    The other problem is that there is only one kswapd, and there are lots
    of disks.  That is a generic problem - without being able to co-opt
    user processes we don't have enough threads to keep lots of disks saturated.

    One fix for this would be to add an additional "really congested"
    threshold in the request queues, so kswapd can still perform
    nonblocking writeout.  This gives kswapd priority over pdflush while
    allowing kswapd to feed many disk queues.  I doubt if this will be
    called for.

    BKrev: 3e051055aitHp3bZBPSqmq21KGs5aQ



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/5]  [RFC] transfer ASYNC vmscan writeback IO to the flusher threads
  2010-07-30  7:58   ` Wu Fengguang
  2010-07-30  9:22     ` KOSAKI Motohiro
@ 2010-07-30 11:12     ` Dave Chinner
  2010-07-30 13:18       ` Wu Fengguang
  1 sibling, 1 reply; 30+ messages in thread
From: Dave Chinner @ 2010-07-30 11:12 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, LKML, linux-fsdevel@vger.kernel.org,
	linux-mm@kvack.org, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, Christoph Hellwig, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Andrea Arcangeli, Mel Gorman, Minchan Kim

On Fri, Jul 30, 2010 at 03:58:19PM +0800, Wu Fengguang wrote:
> On Fri, Jul 30, 2010 at 07:23:30AM +0800, Dave Chinner wrote:
> > On Thu, Jul 29, 2010 at 07:51:42PM +0800, Wu Fengguang wrote:
> > > Andrew,
> > > 
> > > It's possible to transfer ASYNC vmscan writeback IOs to the flusher threads.
> > > This simple patchset shows the basic idea. Since it's a big behavior change,
> > > there are inevitably lots of details to sort out. I don't know where it will
> > > go after tests and discussions, so the patches are intentionally kept simple.
> > > 
> > > sync livelock avoidance (need more to be complete, but this is minimal required for the last two patches)
> > > 	[PATCH 1/5] writeback: introduce wbc.for_sync to cover the two sync stages
> > > 	[PATCH 2/5] writeback: stop periodic/background work on seeing sync works
> > > 	[PATCH 3/5] writeback: prevent sync livelock with the sync_after timestamp
> > > 
> > > let the flusher threads do ASYNC writeback for pageout()
> > > 	[PATCH 4/5] writeback: introduce bdi_start_inode_writeback()
> > > 	[PATCH 5/5] vmscan: transfer async file writeback to the flusher
> > 
> > I really do not like this - all it does is transfer random page writeback
> > from vmscan to the flusher threads rather than avoiding random page
> > writeback altogether. Random page writeback is nasty - just say no.
> 
> There are cases we have to do pageout().
> 
> - a stressed memcg with lots of dirty pages
> - a large NUMA system whose nodes have unbalanced vmscan rate and dirty pages
> 
> In the above cases, the whole system may not be that stressed,
> except for some local LRU list being busy scanned.  If the local
> memory stress lead to lots of pageout(), it could bring down the whole
> system by congesting the disks with many small seeky IO.
> 
> It may be an overkill to push global writeback (ie. it's silly to sync
> 1GB dirty data because there is a small stressed 100MB LRU list).

No it isn't. Dirty pages have to cleaned sometime and it reclaim has
a need to clean pages, we may as well start cleaning them all.
Kicking background writeback is effectively just starting work we
have already delayed into the future a little bit earlier than we
otherwise would have.

Doing this is only going to hurt performance if the same pages are
being frequently dirtied, but the cahnges to flush expired inodes
first in background writeback should avoid the worst of that
behaviour. Further, the more clean pages we have, the faster
susbequent memory reclaims are going to free up pages....


> The
> obvious solution is to keep the pageout() calls and make them more IO
> wise by doing write-around at the same time.  The write-around pages
> will likely be in the same stressed LRU list, hence will do good for
> page reclaim as well.

You've kind of already done that by telling it to writeback 1024
pages starting with a specific page. However, the big problem with
this is that it asusme that the inode has contiguous dirty pages in
the cache.  That assumption fall down in many cases e.g. when you
are writing lots of small files like kernel trees contain, and so
you still end up with random IO patterns coming out of reclaim.

> Transferring ASYNC work to the flushers helps the
> kswapd-vs-flusher priority problem too. Currently the
> kswapd/direct reclaim either have to skip dirty pages on
> congestion, or to risk being blocked in get_request_wait(), both
> are not good options. However the use of
> bdi_start_inode_writeback() do ask for a good vmscan throttling
> scheme to prevent it falsely OOM before the flusher is able to
> clean the transfered pages. This would be tricky.

I have no problem with that aspect ofthe patch - my issue is that it
does nothing to prevent the problem that causes excessive congestion
in the first place...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/5]  [RFC] transfer ASYNC vmscan writeback IO to the flusher threads
  2010-07-30  9:22     ` KOSAKI Motohiro
@ 2010-07-30 12:25       ` Wu Fengguang
  0 siblings, 0 replies; 30+ messages in thread
From: Wu Fengguang @ 2010-07-30 12:25 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Dave Chinner, Andrew Morton, LKML, linux-fsdevel@vger.kernel.org,
	linux-mm@kvack.org, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, Christoph Hellwig, KAMEZAWA Hiroyuki,
	Andrea Arcangeli, Mel Gorman, Minchan Kim

> > There are cases we have to do pageout().
> > 
> > - a stressed memcg with lots of dirty pages
> > - a large NUMA system whose nodes have unbalanced vmscan rate and dirty pages
> 
> - 32bit highmem system too

Ah yes!

> can you please see following commit? this describe current design.

Good staff. Thanks.

Thanks,
Fengguang


> 
> 
> 
> commit c4e2d7ddde9693a4c05da7afd485db02c27a7a09
> Author: akpm <akpm>
> Date:   Sun Dec 22 01:07:33 2002 +0000
> 
>     [PATCH] Give kswapd writeback higher priority than pdflush
> 
>     The `low latency page reclaim' design works by preventing page
>     allocators from blocking on request queues (and by preventing them from
>     blocking against writeback of individual pages, but that is immaterial
>     here).
> 
>     This has a problem under some situations.  pdflush (or a write(2)
>     caller) could be saturating the queue with highmem pages.  This
>     prevents anyone from writing back ZONE_NORMAL pages.  We end up doing
>     enormous amounts of scenning.
> 
>     A test case is to mmap(MAP_SHARED) almost all of a 4G machine's memory,
>     then kill the mmapping applications.  The machine instantly goes from
>     0% of memory dirty to 95% or more.  pdflush kicks in and starts writing
>     the least-recently-dirtied pages, which are all highmem.  The queue is
>     congested so nobody will write back ZONE_NORMAL pages.  kswapd chews
>     50% of the CPU scanning past dirty ZONE_NORMAL pages and page reclaim
>     efficiency (pages_reclaimed/pages_scanned) falls to 2%.
> 
>     So this patch changes the policy for kswapd.  kswapd may use all of a
>     request queue, and is prepared to block on request queues.
> 
>     What will now happen in the above scenario is:
> 
>     1: The page alloctor scans some pages, fails to reclaim enough
>        memory and takes a nap in blk_congetion_wait().
> 
>     2: kswapd() will scan the ZONE_NORMAL LRU and will start writing
>        back pages.  (These pages will be rotated to the tail of the
>        inactive list at IO-completion interrupt time).
> 
>        This writeback will saturate the queue with ZONE_NORMAL pages.
>        Conveniently, pdflush will avoid the congested queues.  So we end up
>        writing the correct pages.
> 
>     In this test, kswapd CPU utilisation falls from 50% to 2%, page reclaim
>     efficiency rises from 2% to 40% and things are generally a lot happier.
> 
> 
>     The downside is that kswapd may now do a lot less page reclaim,
>     increasing page allocation latency, causing more direct reclaim,
>     increasing lock contention in the VM, etc.  But I have not been able to
>     demonstrate that in testing.
> 
> 
>     The other problem is that there is only one kswapd, and there are lots
>     of disks.  That is a generic problem - without being able to co-opt
>     user processes we don't have enough threads to keep lots of disks saturated.
> 
>     One fix for this would be to add an additional "really congested"
>     threshold in the request queues, so kswapd can still perform
>     nonblocking writeout.  This gives kswapd priority over pdflush while
>     allowing kswapd to feed many disk queues.  I doubt if this will be
>     called for.
> 
>     BKrev: 3e051055aitHp3bZBPSqmq21KGs5aQ
> 
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/5]  [RFC] transfer ASYNC vmscan writeback IO to the flusher threads
  2010-07-30 11:12     ` Dave Chinner
@ 2010-07-30 13:18       ` Wu Fengguang
  0 siblings, 0 replies; 30+ messages in thread
From: Wu Fengguang @ 2010-07-30 13:18 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Andrew Morton, LKML, linux-fsdevel@vger.kernel.org,
	linux-mm@kvack.org, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, Christoph Hellwig, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Andrea Arcangeli, Mel Gorman, Minchan Kim

On Fri, Jul 30, 2010 at 07:12:44PM +0800, Dave Chinner wrote:
> On Fri, Jul 30, 2010 at 03:58:19PM +0800, Wu Fengguang wrote:
> > On Fri, Jul 30, 2010 at 07:23:30AM +0800, Dave Chinner wrote:
> > > On Thu, Jul 29, 2010 at 07:51:42PM +0800, Wu Fengguang wrote:
> > > > Andrew,
> > > > 
> > > > It's possible to transfer ASYNC vmscan writeback IOs to the flusher threads.
> > > > This simple patchset shows the basic idea. Since it's a big behavior change,
> > > > there are inevitably lots of details to sort out. I don't know where it will
> > > > go after tests and discussions, so the patches are intentionally kept simple.
> > > > 
> > > > sync livelock avoidance (need more to be complete, but this is minimal required for the last two patches)
> > > > 	[PATCH 1/5] writeback: introduce wbc.for_sync to cover the two sync stages
> > > > 	[PATCH 2/5] writeback: stop periodic/background work on seeing sync works
> > > > 	[PATCH 3/5] writeback: prevent sync livelock with the sync_after timestamp
> > > > 
> > > > let the flusher threads do ASYNC writeback for pageout()
> > > > 	[PATCH 4/5] writeback: introduce bdi_start_inode_writeback()
> > > > 	[PATCH 5/5] vmscan: transfer async file writeback to the flusher
> > > 
> > > I really do not like this - all it does is transfer random page writeback
> > > from vmscan to the flusher threads rather than avoiding random page
> > > writeback altogether. Random page writeback is nasty - just say no.
> > 
> > There are cases we have to do pageout().
> > 
> > - a stressed memcg with lots of dirty pages
> > - a large NUMA system whose nodes have unbalanced vmscan rate and dirty pages
> > 
> > In the above cases, the whole system may not be that stressed,
> > except for some local LRU list being busy scanned.  If the local
> > memory stress lead to lots of pageout(), it could bring down the whole
> > system by congesting the disks with many small seeky IO.
> > 
> > It may be an overkill to push global writeback (ie. it's silly to sync
> > 1GB dirty data because there is a small stressed 100MB LRU list).
> 
> No it isn't. Dirty pages have to cleaned sometime and it reclaim has
> a need to clean pages, we may as well start cleaning them all.
> Kicking background writeback is effectively just starting work we
> have already delayed into the future a little bit earlier than we
> otherwise would have.
> 
> Doing this is only going to hurt performance if the same pages are
> being frequently dirtied, but the cahnges to flush expired inodes
> first in background writeback should avoid the worst of that
> behaviour. Further, the more clean pages we have, the faster
> susbequent memory reclaims are going to free up pages....

You have some points here, the data have to be synced anyway, earlier
or later.

However it still helps to clean the right data first. With
write-around, we may get clean pages in the stressed LRU in 10ms.
Blindly syncing the global inodes...maybe after 10s if unlucky.

So pageout() is still good to have/keep. But sure we need to improve it
(transfer work to the flusher, do write-around, throttle) as well as
reducing it (kick global writeback and knock down global dirty pages).

> > The
> > obvious solution is to keep the pageout() calls and make them more IO
> > wise by doing write-around at the same time.  The write-around pages
> > will likely be in the same stressed LRU list, hence will do good for
> > page reclaim as well.
> 
> You've kind of already done that by telling it to writeback 1024
> pages starting with a specific page. However, the big problem with
> this is that it asusme that the inode has contiguous dirty pages in

Right. We could use .writeback_index/.nr_to_write instead of
.range_start/.range_end as the writeback parameters. It's a bit racy
to use mapping->writeback_index though.

> the cache.  That assumption fall down in many cases e.g. when you
> are writing lots of small files like kernel trees contain, and so
> you still end up with random IO patterns coming out of reclaim.

Small files lead to random IO anyway? You may mean .offset=1 so the
dirty page 0 will be left out. I do have the plan to do write-around
to cover such issue, since it would be very common case. Imagine the
dirty page at offset N lies in Normal zone and N+1 in DMA32 zone.
If DMA32 is scanned slightly before Normal, then we got page N+1
first, while actually we should start with page N.

> > Transferring ASYNC work to the flushers helps the
> > kswapd-vs-flusher priority problem too. Currently the
> > kswapd/direct reclaim either have to skip dirty pages on
> > congestion, or to risk being blocked in get_request_wait(), both
> > are not good options. However the use of
> > bdi_start_inode_writeback() do ask for a good vmscan throttling
> > scheme to prevent it falsely OOM before the flusher is able to
> > clean the transfered pages. This would be tricky.
> 
> I have no problem with that aspect ofthe patch - my issue is that it
> does nothing to prevent the problem that causes excessive congestion
> in the first place...

No problem. It's merely the first step, stay tuned :)

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/5] writeback: stop periodic/background work on seeing sync works
  2010-07-30  4:03     ` Wu Fengguang
@ 2010-08-02 20:51       ` Jan Kara
  2010-08-03  3:01         ` Wu Fengguang
  0 siblings, 1 reply; 30+ messages in thread
From: Jan Kara @ 2010-08-02 20:51 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, Andrew Morton, LKML, linux-fsdevel@vger.kernel.org,
	linux-mm@kvack.org, Dave Chinner, Chris Mason, Nick Piggin,
	Rik van Riel, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrea Arcangeli, Mel Gorman,
	Minchan Kim

[-- Attachment #1: Type: text/plain, Size: 1682 bytes --]

On Fri 30-07-10 12:03:06, Wu Fengguang wrote:
> On Fri, Jul 30, 2010 at 12:20:27AM +0800, Jan Kara wrote:
> > On Thu 29-07-10 19:51:44, Wu Fengguang wrote:
> > > The periodic/background writeback can run forever. So when any
> > > sync work is enqueued, increase bdi->sync_works to notify the
> > > active non-sync works to exit. Non-sync works queued after sync
> > > works won't be affected.
> >   Hmm, wouldn't it be simpler logic to just make for_kupdate and
> > for_background work always yield when there's some other work to do (as
> > they are livelockable from the definition of the target they have) and
> > make sure any other work isn't livelockable?
> 
> Good idea!
> 
> > The only downside is that
> > non-livelockable work cannot be "fair" in the sense that we cannot switch
> > inodes after writing MAX_WRITEBACK_PAGES.
> 
> Cannot switch indoes _before_ finish with the current
> MAX_WRITEBACK_PAGES batch? 
  Well, even after writing all those MAX_WRITEBACK_PAGES. Because what you
want to do in a non-livelockable work is: take inode, write it, never look at
it again for this work. Because if you later return to the inode, it can
have newer dirty pages and thus you cannot really avoid livelock. Of
course, this all assumes .nr_to_write isn't set to something small. That
avoids the livelock as well.

> >   I even had a patch for this but it's already outdated by now. But I
> > can refresh it if we decide this is the way to go.
> 
> I'm very interested in your old patch, would you post it? Let's see
> which one is easier to work with :)
  OK, attached is the patch. I've rebased it against 2.6.35.

									Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

[-- Attachment #2: 0001-mm-Stop-background-writeback-if-there-is-other-work-.patch --]
[-- Type: text/x-patch, Size: 1349 bytes --]

>From a6df0d4db148f983fe756df4791409db28dff459 Mon Sep 17 00:00:00 2001
From: Jan Kara <jack@suse.cz>
Date: Mon, 2 Aug 2010 22:30:25 +0200
Subject: [PATCH] mm: Stop background writeback if there is other work queued for the thread

Background writeback and kupdate-style writeback are easily livelockable
(from a definition of their target). This is inconvenient because it can
make sync(1) stall forever waiting on its queued work to be finished.
Fix the problem by interrupting background and kupdate writeback if there
is some other work to do. We can return to them after completing all the
queued work.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/fs-writeback.c |    8 ++++++++
 1 files changed, 8 insertions(+), 0 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index d5be169..542471e 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -633,6 +633,14 @@ static long wb_writeback(struct bdi_writeback *wb,
 			break;
 
 		/*
+		 * Background writeout and kupdate-style writeback are
+		 * easily livelockable. Stop them if there is other work
+		 * to do so that e.g. sync can proceed.
+		 */
+		if ((work->for_background || work->for_kupdate) &&
+		    !list_empty(&wb->bdi->work_list))
+			break;
+		/*
 		 * For background writeout, stop when we are below the
 		 * background dirty threshold
 		 */
-- 
1.6.4.2


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/5] writeback: stop periodic/background work on seeing sync works
  2010-08-02 20:51       ` Jan Kara
@ 2010-08-03  3:01         ` Wu Fengguang
  2010-08-03 10:55           ` Jan Kara
  0 siblings, 1 reply; 30+ messages in thread
From: Wu Fengguang @ 2010-08-03  3:01 UTC (permalink / raw)
  To: Jan Kara
  Cc: Andrew Morton, LKML, linux-fsdevel@vger.kernel.org,
	linux-mm@kvack.org, Dave Chinner, Chris Mason, Nick Piggin,
	Rik van Riel, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrea Arcangeli, Mel Gorman,
	Minchan Kim

On Tue, Aug 03, 2010 at 04:51:52AM +0800, Jan Kara wrote:
> On Fri 30-07-10 12:03:06, Wu Fengguang wrote:
> > On Fri, Jul 30, 2010 at 12:20:27AM +0800, Jan Kara wrote:
> > > On Thu 29-07-10 19:51:44, Wu Fengguang wrote:
> > > > The periodic/background writeback can run forever. So when any
> > > > sync work is enqueued, increase bdi->sync_works to notify the
> > > > active non-sync works to exit. Non-sync works queued after sync
> > > > works won't be affected.
> > >   Hmm, wouldn't it be simpler logic to just make for_kupdate and
> > > for_background work always yield when there's some other work to do (as
> > > they are livelockable from the definition of the target they have) and
> > > make sure any other work isn't livelockable?
> > 
> > Good idea!
> > 
> > > The only downside is that
> > > non-livelockable work cannot be "fair" in the sense that we cannot switch
> > > inodes after writing MAX_WRITEBACK_PAGES.
> > 
> > Cannot switch indoes _before_ finish with the current
> > MAX_WRITEBACK_PAGES batch? 
>   Well, even after writing all those MAX_WRITEBACK_PAGES. Because what you
> want to do in a non-livelockable work is: take inode, write it, never look at
> it again for this work. Because if you later return to the inode, it can
> have newer dirty pages and thus you cannot really avoid livelock. Of
> course, this all assumes .nr_to_write isn't set to something small. That
> avoids the livelock as well.

I do have a poor man's solution that can handle this case.
https://kerneltrap.org/mailarchive/linux-fsdevel/2009/10/7/6476473/thread
It may do more extra works, but will stop livelock in theory.

A related question is, what if some for_reclaim works get enqueued?
Shall we postpone the sync work as well? The global sync is not likely
to hit the dirty pages in a small memcg, or may take long time. It
seems not a high priority task though.

> > >   I even had a patch for this but it's already outdated by now. But I
> > > can refresh it if we decide this is the way to go.
> > 
> > I'm very interested in your old patch, would you post it? Let's see
> > which one is easier to work with :)
>   OK, attached is the patch. I've rebased it against 2.6.35.
> 									Honza
> -- 
> Jan Kara <jack@suse.cz>
> SUSE Labs, CR

> From a6df0d4db148f983fe756df4791409db28dff459 Mon Sep 17 00:00:00 2001
> From: Jan Kara <jack@suse.cz>
> Date: Mon, 2 Aug 2010 22:30:25 +0200
> Subject: [PATCH] mm: Stop background writeback if there is other work queued for the thread
> 
> Background writeback and kupdate-style writeback are easily livelockable
> (from a definition of their target). This is inconvenient because it can
> make sync(1) stall forever waiting on its queued work to be finished.
> Fix the problem by interrupting background and kupdate writeback if there
> is some other work to do. We can return to them after completing all the
> queued work.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
>  fs/fs-writeback.c |    8 ++++++++
>  1 files changed, 8 insertions(+), 0 deletions(-)
> 
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index d5be169..542471e 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -633,6 +633,14 @@ static long wb_writeback(struct bdi_writeback *wb,
>  			break;
>  
>  		/*
> +		 * Background writeout and kupdate-style writeback are
> +		 * easily livelockable. Stop them if there is other work
> +		 * to do so that e.g. sync can proceed.
> +		 */
> +		if ((work->for_background || work->for_kupdate) &&
> +		    !list_empty(&wb->bdi->work_list))
> +			break;
> +		/*

I like it. It's much simpler.

Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>


Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/5] writeback: stop periodic/background work on seeing sync works
  2010-08-03  3:01         ` Wu Fengguang
@ 2010-08-03 10:55           ` Jan Kara
  2010-08-03 12:39             ` Jan Kara
  2010-08-03 14:36             ` Wu Fengguang
  0 siblings, 2 replies; 30+ messages in thread
From: Jan Kara @ 2010-08-03 10:55 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, Andrew Morton, LKML, linux-fsdevel@vger.kernel.org,
	linux-mm@kvack.org, Dave Chinner, Chris Mason, Nick Piggin,
	Rik van Riel, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrea Arcangeli, Mel Gorman,
	Minchan Kim

On Tue 03-08-10 11:01:25, Wu Fengguang wrote:
> On Tue, Aug 03, 2010 at 04:51:52AM +0800, Jan Kara wrote:
> > On Fri 30-07-10 12:03:06, Wu Fengguang wrote:
> > > On Fri, Jul 30, 2010 at 12:20:27AM +0800, Jan Kara wrote:
> > > > On Thu 29-07-10 19:51:44, Wu Fengguang wrote:
> > > > > The periodic/background writeback can run forever. So when any
> > > > > sync work is enqueued, increase bdi->sync_works to notify the
> > > > > active non-sync works to exit. Non-sync works queued after sync
> > > > > works won't be affected.
> > > >   Hmm, wouldn't it be simpler logic to just make for_kupdate and
> > > > for_background work always yield when there's some other work to do (as
> > > > they are livelockable from the definition of the target they have) and
> > > > make sure any other work isn't livelockable?
> > > 
> > > Good idea!
> > > 
> > > > The only downside is that
> > > > non-livelockable work cannot be "fair" in the sense that we cannot switch
> > > > inodes after writing MAX_WRITEBACK_PAGES.
> > > 
> > > Cannot switch indoes _before_ finish with the current
> > > MAX_WRITEBACK_PAGES batch? 
> >   Well, even after writing all those MAX_WRITEBACK_PAGES. Because what you
> > want to do in a non-livelockable work is: take inode, write it, never look at
> > it again for this work. Because if you later return to the inode, it can
> > have newer dirty pages and thus you cannot really avoid livelock. Of
> > course, this all assumes .nr_to_write isn't set to something small. That
> > avoids the livelock as well.
> 
> I do have a poor man's solution that can handle this case.
> https://kerneltrap.org/mailarchive/linux-fsdevel/2009/10/7/6476473/thread
> It may do more extra works, but will stop livelock in theory.
  So I don't think sync work on it's own is a problem. There we can just
give up any fairness and just go inode by inode. IMHO it's much simpler that
way. The remaining types of work we have are "for_reclaim" and then ones
triggered by filesystems to get rid of delayed allocated data. These cases
can easily have well defined and low nr_to_write so they wouldn't be
livelockable either. What do you think?

> A related question is, what if some for_reclaim works get enqueued?
> Shall we postpone the sync work as well? The global sync is not likely
> to hit the dirty pages in a small memcg, or may take long time. It
> seems not a high priority task though.
  I see some incentive to do this but the simple thing with for_background
and for_kupdate work is that they are essentially state-less and so they
can be easily (and automatically) restarted. It would be really hard to
implement something like this for sync and still avoid livelocks.

> > > >   I even had a patch for this but it's already outdated by now. But I
> > > > can refresh it if we decide this is the way to go.
> > > 
> > > I'm very interested in your old patch, would you post it? Let's see
> > > which one is easier to work with :)
> >   OK, attached is the patch. I've rebased it against 2.6.35.
> > 									Honza
> > -- 
> > Jan Kara <jack@suse.cz>
> > SUSE Labs, CR
> 
> > From a6df0d4db148f983fe756df4791409db28dff459 Mon Sep 17 00:00:00 2001
> > From: Jan Kara <jack@suse.cz>
> > Date: Mon, 2 Aug 2010 22:30:25 +0200
> > Subject: [PATCH] mm: Stop background writeback if there is other work queued for the thread
> > 
> > Background writeback and kupdate-style writeback are easily livelockable
> > (from a definition of their target). This is inconvenient because it can
> > make sync(1) stall forever waiting on its queued work to be finished.
> > Fix the problem by interrupting background and kupdate writeback if there
> > is some other work to do. We can return to them after completing all the
> > queued work.
> > 
> > Signed-off-by: Jan Kara <jack@suse.cz>
> > ---
> >  fs/fs-writeback.c |    8 ++++++++
> >  1 files changed, 8 insertions(+), 0 deletions(-)
> > 
> > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> > index d5be169..542471e 100644
> > --- a/fs/fs-writeback.c
> > +++ b/fs/fs-writeback.c
> > @@ -633,6 +633,14 @@ static long wb_writeback(struct bdi_writeback *wb,
> >  			break;
> >  
> >  		/*
> > +		 * Background writeout and kupdate-style writeback are
> > +		 * easily livelockable. Stop them if there is other work
> > +		 * to do so that e.g. sync can proceed.
> > +		 */
> > +		if ((work->for_background || work->for_kupdate) &&
> > +		    !list_empty(&wb->bdi->work_list))
> > +			break;
> > +		/*
> 
> I like it. It's much simpler.
> 
> Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>
  Thanks. I think I'll try to get this merged via Jens' tree in this
merge window.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/5] writeback: stop periodic/background work on seeing sync works
  2010-08-03 10:55           ` Jan Kara
@ 2010-08-03 12:39             ` Jan Kara
  2010-08-03 12:59               ` Wu Fengguang
  2010-08-03 14:36             ` Wu Fengguang
  1 sibling, 1 reply; 30+ messages in thread
From: Jan Kara @ 2010-08-03 12:39 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, Andrew Morton, LKML, linux-fsdevel@vger.kernel.org,
	linux-mm@kvack.org, Dave Chinner, Chris Mason, Nick Piggin,
	Rik van Riel, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrea Arcangeli, Mel Gorman,
	Minchan Kim

[-- Attachment #1: Type: text/plain, Size: 2521 bytes --]

On Tue 03-08-10 12:55:20, Jan Kara wrote:
> On Tue 03-08-10 11:01:25, Wu Fengguang wrote:
> > On Tue, Aug 03, 2010 at 04:51:52AM +0800, Jan Kara wrote:
> > > On Fri 30-07-10 12:03:06, Wu Fengguang wrote:
> > > > On Fri, Jul 30, 2010 at 12:20:27AM +0800, Jan Kara wrote:
> > > > > On Thu 29-07-10 19:51:44, Wu Fengguang wrote:
> > > > > > The periodic/background writeback can run forever. So when any
> > > > > > sync work is enqueued, increase bdi->sync_works to notify the
> > > > > > active non-sync works to exit. Non-sync works queued after sync
> > > > > > works won't be affected.
> > > > >   Hmm, wouldn't it be simpler logic to just make for_kupdate and
> > > > > for_background work always yield when there's some other work to do (as
> > > > > they are livelockable from the definition of the target they have) and
> > > > > make sure any other work isn't livelockable?
> > > > 
> > > > Good idea!
> > > > 
> > > > > The only downside is that
> > > > > non-livelockable work cannot be "fair" in the sense that we cannot switch
> > > > > inodes after writing MAX_WRITEBACK_PAGES.
> > > > 
> > > > Cannot switch indoes _before_ finish with the current
> > > > MAX_WRITEBACK_PAGES batch? 
> > >   Well, even after writing all those MAX_WRITEBACK_PAGES. Because what you
> > > want to do in a non-livelockable work is: take inode, write it, never look at
> > > it again for this work. Because if you later return to the inode, it can
> > > have newer dirty pages and thus you cannot really avoid livelock. Of
> > > course, this all assumes .nr_to_write isn't set to something small. That
> > > avoids the livelock as well.
> > 
> > I do have a poor man's solution that can handle this case.
> > https://kerneltrap.org/mailarchive/linux-fsdevel/2009/10/7/6476473/thread
> > It may do more extra works, but will stop livelock in theory.
>   So I don't think sync work on it's own is a problem. There we can just
> give up any fairness and just go inode by inode. IMHO it's much simpler that
> way. The remaining types of work we have are "for_reclaim" and then ones
> triggered by filesystems to get rid of delayed allocated data. These cases
> can easily have well defined and low nr_to_write so they wouldn't be
> livelockable either. What do you think?
  Fengguang, how about merging also the attached simple patch together with
my fix? With these two patches, I'm not able to trigger any sync livelock
while without one of them I hit them quite easily...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

[-- Attachment #2: 0001-mm-Avoid-resetting-wb_start-after-each-writeback-ro.patch --]
[-- Type: text/x-patch, Size: 1877 bytes --]

>From e4b708115825bca6a1020eed2356e2aab0567e3a Mon Sep 17 00:00:00 2001
From: Jan Kara <jack@suse.cz>
Date: Tue, 3 Aug 2010 10:35:02 +0000
Subject: [PATCH] mm: Avoid resetting wb_start after each writeback round

WB_SYNC_NONE writeback is done in rounds of 1024 pages so that we don't write
out some huge inode for too long while starving writeout of other inodes. To
avoid livelocks, we record time we started writeback in wbc->wb_start and do
not write out inodes which were dirtied after this time. But currently,
writeback_inodes_wb() resets wb_start each time it is called thus effectively
invalidating this logic and making any WB_SYNC_NONE writeback prone to
livelocks.

This patch makes sure wb_start is set only once when we start writeback.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/fs-writeback.c |    5 +++--
 1 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 6bdc924..aa59394 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -530,7 +530,8 @@ void writeback_inodes_wb(struct bdi_writeback *wb,
 {
 	int ret = 0;
 
-	wbc->wb_start = jiffies; /* livelock avoidance */
+	if (!wbc->wb_start)
+		wbc->wb_start = jiffies; /* livelock avoidance */
 	spin_lock(&inode_lock);
 	if (!wbc->for_kupdate || list_empty(&wb->b_io))
 		queue_io(wb, wbc->older_than_this);
@@ -559,7 +560,6 @@ static void __writeback_inodes_sb(struct super_block *sb,
 {
 	WARN_ON(!rwsem_is_locked(&sb->s_umount));
 
-	wbc->wb_start = jiffies; /* livelock avoidance */
 	spin_lock(&inode_lock);
 	if (!wbc->for_kupdate || list_empty(&wb->b_io))
 		queue_io(wb, wbc->older_than_this);
@@ -625,6 +625,7 @@ static long wb_writeback(struct bdi_writeback *wb,
 		wbc.range_end = LLONG_MAX;
 	}
 
+	wbc.wb_start = jiffies; /* livelock avoidance */
 	for (;;) {
 		/*
 		 * Stop writeback when nr_pages has been consumed
-- 
1.6.0.2


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/5] writeback: stop periodic/background work on seeing sync works
  2010-08-03 12:39             ` Jan Kara
@ 2010-08-03 12:59               ` Wu Fengguang
  2010-08-03 13:18                 ` Jan Kara
  2010-08-03 13:22                 ` Wu Fengguang
  0 siblings, 2 replies; 30+ messages in thread
From: Wu Fengguang @ 2010-08-03 12:59 UTC (permalink / raw)
  To: Jan Kara
  Cc: Andrew Morton, LKML, linux-fsdevel@vger.kernel.org,
	linux-mm@kvack.org, Dave Chinner, Chris Mason, Nick Piggin,
	Rik van Riel, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrea Arcangeli, Mel Gorman,
	Minchan Kim

On Tue, Aug 03, 2010 at 08:39:22PM +0800, Jan Kara wrote:
> On Tue 03-08-10 12:55:20, Jan Kara wrote:
> > On Tue 03-08-10 11:01:25, Wu Fengguang wrote:
> > > On Tue, Aug 03, 2010 at 04:51:52AM +0800, Jan Kara wrote:
> > > > On Fri 30-07-10 12:03:06, Wu Fengguang wrote:
> > > > > On Fri, Jul 30, 2010 at 12:20:27AM +0800, Jan Kara wrote:
> > > > > > On Thu 29-07-10 19:51:44, Wu Fengguang wrote:
> > > > > > > The periodic/background writeback can run forever. So when any
> > > > > > > sync work is enqueued, increase bdi->sync_works to notify the
> > > > > > > active non-sync works to exit. Non-sync works queued after sync
> > > > > > > works won't be affected.
> > > > > >   Hmm, wouldn't it be simpler logic to just make for_kupdate and
> > > > > > for_background work always yield when there's some other work to do (as
> > > > > > they are livelockable from the definition of the target they have) and
> > > > > > make sure any other work isn't livelockable?
> > > > > 
> > > > > Good idea!
> > > > > 
> > > > > > The only downside is that
> > > > > > non-livelockable work cannot be "fair" in the sense that we cannot switch
> > > > > > inodes after writing MAX_WRITEBACK_PAGES.
> > > > > 
> > > > > Cannot switch indoes _before_ finish with the current
> > > > > MAX_WRITEBACK_PAGES batch? 
> > > >   Well, even after writing all those MAX_WRITEBACK_PAGES. Because what you
> > > > want to do in a non-livelockable work is: take inode, write it, never look at
> > > > it again for this work. Because if you later return to the inode, it can
> > > > have newer dirty pages and thus you cannot really avoid livelock. Of
> > > > course, this all assumes .nr_to_write isn't set to something small. That
> > > > avoids the livelock as well.
> > > 
> > > I do have a poor man's solution that can handle this case.
> > > https://kerneltrap.org/mailarchive/linux-fsdevel/2009/10/7/6476473/thread
> > > It may do more extra works, but will stop livelock in theory.
> >   So I don't think sync work on it's own is a problem. There we can just
> > give up any fairness and just go inode by inode. IMHO it's much simpler that
> > way. The remaining types of work we have are "for_reclaim" and then ones
> > triggered by filesystems to get rid of delayed allocated data. These cases
> > can easily have well defined and low nr_to_write so they wouldn't be
> > livelockable either. What do you think?
>   Fengguang, how about merging also the attached simple patch together with
> my fix? With these two patches, I'm not able to trigger any sync livelock
> while without one of them I hit them quite easily...

This looks OK. However note that redirty_tail() can modify
dirtied_when unexpectedly. So the more we rely on wb_start, the more
possibility an inode is (wrongly) skipped by sync. I have a bunch of
patches to remove redirty_tail(). However they may not be good
candidates for 2.6.36..

Thanks,
Fengguang


> 								Honza
> -- 
> Jan Kara <jack@suse.cz>
> SUSE Labs, CR

> From e4b708115825bca6a1020eed2356e2aab0567e3a Mon Sep 17 00:00:00 2001
> From: Jan Kara <jack@suse.cz>
> Date: Tue, 3 Aug 2010 10:35:02 +0000
> Subject: [PATCH] mm: Avoid resetting wb_start after each writeback round
> 
> WB_SYNC_NONE writeback is done in rounds of 1024 pages so that we don't write
> out some huge inode for too long while starving writeout of other inodes. To
> avoid livelocks, we record time we started writeback in wbc->wb_start and do
> not write out inodes which were dirtied after this time. But currently,
> writeback_inodes_wb() resets wb_start each time it is called thus effectively
> invalidating this logic and making any WB_SYNC_NONE writeback prone to
> livelocks.
> 
> This patch makes sure wb_start is set only once when we start writeback.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
>  fs/fs-writeback.c |    5 +++--
>  1 files changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index 6bdc924..aa59394 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -530,7 +530,8 @@ void writeback_inodes_wb(struct bdi_writeback *wb,
>  {
>  	int ret = 0;
>  
> -	wbc->wb_start = jiffies; /* livelock avoidance */
> +	if (!wbc->wb_start)
> +		wbc->wb_start = jiffies; /* livelock avoidance */
>  	spin_lock(&inode_lock);
>  	if (!wbc->for_kupdate || list_empty(&wb->b_io))
>  		queue_io(wb, wbc->older_than_this);
> @@ -559,7 +560,6 @@ static void __writeback_inodes_sb(struct super_block *sb,
>  {
>  	WARN_ON(!rwsem_is_locked(&sb->s_umount));
>  
> -	wbc->wb_start = jiffies; /* livelock avoidance */
>  	spin_lock(&inode_lock);
>  	if (!wbc->for_kupdate || list_empty(&wb->b_io))
>  		queue_io(wb, wbc->older_than_this);
> @@ -625,6 +625,7 @@ static long wb_writeback(struct bdi_writeback *wb,
>  		wbc.range_end = LLONG_MAX;
>  	}
>  
> +	wbc.wb_start = jiffies; /* livelock avoidance */
>  	for (;;) {
>  		/*
>  		 * Stop writeback when nr_pages has been consumed
> -- 
> 1.6.0.2
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/5] writeback: stop periodic/background work on seeing sync works
  2010-08-03 12:59               ` Wu Fengguang
@ 2010-08-03 13:18                 ` Jan Kara
  2010-08-03 13:22                 ` Wu Fengguang
  1 sibling, 0 replies; 30+ messages in thread
From: Jan Kara @ 2010-08-03 13:18 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, Andrew Morton, LKML, linux-fsdevel@vger.kernel.org,
	linux-mm@kvack.org, Dave Chinner, Chris Mason, Nick Piggin,
	Rik van Riel, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrea Arcangeli, Mel Gorman,
	Minchan Kim

On Tue 03-08-10 20:59:24, Wu Fengguang wrote:
> On Tue, Aug 03, 2010 at 08:39:22PM +0800, Jan Kara wrote:
> > On Tue 03-08-10 12:55:20, Jan Kara wrote:
> > > On Tue 03-08-10 11:01:25, Wu Fengguang wrote:
> > > > On Tue, Aug 03, 2010 at 04:51:52AM +0800, Jan Kara wrote:
> > > > > On Fri 30-07-10 12:03:06, Wu Fengguang wrote:
> > > > > > On Fri, Jul 30, 2010 at 12:20:27AM +0800, Jan Kara wrote:
> > > > > > > On Thu 29-07-10 19:51:44, Wu Fengguang wrote:
> > > > > > > > The periodic/background writeback can run forever. So when any
> > > > > > > > sync work is enqueued, increase bdi->sync_works to notify the
> > > > > > > > active non-sync works to exit. Non-sync works queued after sync
> > > > > > > > works won't be affected.
> > > > > > >   Hmm, wouldn't it be simpler logic to just make for_kupdate and
> > > > > > > for_background work always yield when there's some other work to do (as
> > > > > > > they are livelockable from the definition of the target they have) and
> > > > > > > make sure any other work isn't livelockable?
> > > > > > 
> > > > > > Good idea!
> > > > > > 
> > > > > > > The only downside is that
> > > > > > > non-livelockable work cannot be "fair" in the sense that we cannot switch
> > > > > > > inodes after writing MAX_WRITEBACK_PAGES.
> > > > > > 
> > > > > > Cannot switch indoes _before_ finish with the current
> > > > > > MAX_WRITEBACK_PAGES batch? 
> > > > >   Well, even after writing all those MAX_WRITEBACK_PAGES. Because what you
> > > > > want to do in a non-livelockable work is: take inode, write it, never look at
> > > > > it again for this work. Because if you later return to the inode, it can
> > > > > have newer dirty pages and thus you cannot really avoid livelock. Of
> > > > > course, this all assumes .nr_to_write isn't set to something small. That
> > > > > avoids the livelock as well.
> > > > 
> > > > I do have a poor man's solution that can handle this case.
> > > > https://kerneltrap.org/mailarchive/linux-fsdevel/2009/10/7/6476473/thread
> > > > It may do more extra works, but will stop livelock in theory.
> > >   So I don't think sync work on it's own is a problem. There we can just
> > > give up any fairness and just go inode by inode. IMHO it's much simpler that
> > > way. The remaining types of work we have are "for_reclaim" and then ones
> > > triggered by filesystems to get rid of delayed allocated data. These cases
> > > can easily have well defined and low nr_to_write so they wouldn't be
> > > livelockable either. What do you think?
> >   Fengguang, how about merging also the attached simple patch together with
> > my fix? With these two patches, I'm not able to trigger any sync livelock
> > while without one of them I hit them quite easily...
> 
> This looks OK. However note that redirty_tail() can modify
> dirtied_when unexpectedly. So the more we rely on wb_start, the more
> possibility an inode is (wrongly) skipped by sync. I have a bunch of
> patches to remove redirty_tail(). However they may not be good
> candidates for 2.6.36..
  Yes, I'm aware of this. But if I'm right, after your changes to the
logic in writeback_single_inode() Andrew has in his tree, we use
requeue_io() in case inode still has any dirty pages. Thus after these
patches we should be mostly fine. Shouldn't we?

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/5] writeback: stop periodic/background work on seeing sync works
  2010-08-03 12:59               ` Wu Fengguang
  2010-08-03 13:18                 ` Jan Kara
@ 2010-08-03 13:22                 ` Wu Fengguang
  2010-08-03 13:44                   ` Wu Fengguang
  1 sibling, 1 reply; 30+ messages in thread
From: Wu Fengguang @ 2010-08-03 13:22 UTC (permalink / raw)
  To: Jan Kara
  Cc: Andrew Morton, LKML, linux-fsdevel@vger.kernel.org,
	linux-mm@kvack.org, Dave Chinner, Chris Mason, Nick Piggin,
	Rik van Riel, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrea Arcangeli, Mel Gorman,
	Minchan Kim

> >   Fengguang, how about merging also the attached simple patch together with
> > my fix? With these two patches, I'm not able to trigger any sync livelock
> > while without one of them I hit them quite easily...
> 
> This looks OK. However note that redirty_tail() can modify
> dirtied_when unexpectedly. So the more we rely on wb_start, the more
> possibility an inode is (wrongly) skipped by sync. I have a bunch of
> patches to remove redirty_tail(). However they may not be good
> candidates for 2.6.36..

It looks that setting wb_start at the beginning of
writeback_inodes_wb() won't be easily affected by redirty_tail().

So

Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/5] writeback: stop periodic/background work on seeing sync works
  2010-08-03 13:22                 ` Wu Fengguang
@ 2010-08-03 13:44                   ` Wu Fengguang
  2010-08-03 13:48                     ` Wu Fengguang
  0 siblings, 1 reply; 30+ messages in thread
From: Wu Fengguang @ 2010-08-03 13:44 UTC (permalink / raw)
  To: Jan Kara
  Cc: Andrew Morton, LKML, linux-fsdevel@vger.kernel.org,
	linux-mm@kvack.org, Dave Chinner, Chris Mason, Nick Piggin,
	Rik van Riel, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrea Arcangeli, Mel Gorman,
	Minchan Kim

On Tue, Aug 03, 2010 at 09:22:16PM +0800, Wu Fengguang wrote:
> > >   Fengguang, how about merging also the attached simple patch together with
> > > my fix? With these two patches, I'm not able to trigger any sync livelock
> > > while without one of them I hit them quite easily...
> > 
> > This looks OK. However note that redirty_tail() can modify
> > dirtied_when unexpectedly. So the more we rely on wb_start, the more
> > possibility an inode is (wrongly) skipped by sync. I have a bunch of
> > patches to remove redirty_tail(). However they may not be good
> > candidates for 2.6.36..
> 
> It looks that setting wb_start at the beginning of
> writeback_inodes_wb() won't be easily affected by redirty_tail().

Except for this redirty_tail(), which may mess up the dirtied_when
ordering in b_dirty and later on break the assumption of
inode_dirtied_after(inode, wbc->wb_start).

It can be replaced by a requeue_io() for now.  Christoph mentioned a
patchset to introduce sb->s_wb, which should be a better solution.

Thanks,
Fengguang

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index a178828..e56e68b 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -457,12 +457,7 @@ int generic_writeback_sb_inodes(struct super_block *sb,
 
 		if (inode->i_sb != sb) {
 			if (only_this_sb) {
-				/*
-				 * We only want to write back data for this
-				 * superblock, move all inodes not belonging
-				 * to it back onto the dirty list.
-				 */
-				redirty_tail(inode);
+				requeue_io(inode);
 				continue;
 			}
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/5] writeback: stop periodic/background work on seeing sync works
  2010-08-03 13:44                   ` Wu Fengguang
@ 2010-08-03 13:48                     ` Wu Fengguang
  0 siblings, 0 replies; 30+ messages in thread
From: Wu Fengguang @ 2010-08-03 13:48 UTC (permalink / raw)
  To: Jan Kara
  Cc: Andrew Morton, LKML, linux-fsdevel@vger.kernel.org,
	linux-mm@kvack.org, Dave Chinner, Chris Mason, Nick Piggin,
	Rik van Riel, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrea Arcangeli, Mel Gorman,
	Minchan Kim

On Tue, Aug 03, 2010 at 09:44:31PM +0800, Wu Fengguang wrote:
> On Tue, Aug 03, 2010 at 09:22:16PM +0800, Wu Fengguang wrote:
> > > >   Fengguang, how about merging also the attached simple patch together with
> > > > my fix? With these two patches, I'm not able to trigger any sync livelock
> > > > while without one of them I hit them quite easily...
> > > 
> > > This looks OK. However note that redirty_tail() can modify
> > > dirtied_when unexpectedly. So the more we rely on wb_start, the more
> > > possibility an inode is (wrongly) skipped by sync. I have a bunch of
> > > patches to remove redirty_tail(). However they may not be good
> > > candidates for 2.6.36..
> > 
> > It looks that setting wb_start at the beginning of
> > writeback_inodes_wb() won't be easily affected by redirty_tail().
> 
> Except for this redirty_tail(), which may mess up the dirtied_when
> ordering in b_dirty and later on break the assumption of
> inode_dirtied_after(inode, wbc->wb_start).

Oh well, it seems still OK (it does sound crazy). Please ignore the
below patch, sorry for the noise..

Thanks,
Fengguang

> It can be replaced by a requeue_io() for now.  Christoph mentioned a
> patchset to introduce sb->s_wb, which should be a better solution.
> 
> Thanks,
> Fengguang
> 
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index a178828..e56e68b 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -457,12 +457,7 @@ int generic_writeback_sb_inodes(struct super_block *sb,
>  
>  		if (inode->i_sb != sb) {
>  			if (only_this_sb) {
> -				/*
> -				 * We only want to write back data for this
> -				 * superblock, move all inodes not belonging
> -				 * to it back onto the dirty list.
> -				 */
> -				redirty_tail(inode);
> +				requeue_io(inode);
>  				continue;
>  			}
>  

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/5] writeback: stop periodic/background work on seeing sync works
  2010-08-03 10:55           ` Jan Kara
  2010-08-03 12:39             ` Jan Kara
@ 2010-08-03 14:36             ` Wu Fengguang
  1 sibling, 0 replies; 30+ messages in thread
From: Wu Fengguang @ 2010-08-03 14:36 UTC (permalink / raw)
  To: Jan Kara
  Cc: Andrew Morton, LKML, linux-fsdevel@vger.kernel.org,
	linux-mm@kvack.org, Dave Chinner, Chris Mason, Nick Piggin,
	Rik van Riel, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrea Arcangeli, Mel Gorman,
	Minchan Kim

On Tue, Aug 03, 2010 at 06:55:20PM +0800, Jan Kara wrote:
> On Tue 03-08-10 11:01:25, Wu Fengguang wrote:
> > On Tue, Aug 03, 2010 at 04:51:52AM +0800, Jan Kara wrote:
> > > On Fri 30-07-10 12:03:06, Wu Fengguang wrote:
> > > > On Fri, Jul 30, 2010 at 12:20:27AM +0800, Jan Kara wrote:
> > > > > On Thu 29-07-10 19:51:44, Wu Fengguang wrote:
> > > > > > The periodic/background writeback can run forever. So when any
> > > > > > sync work is enqueued, increase bdi->sync_works to notify the
> > > > > > active non-sync works to exit. Non-sync works queued after sync
> > > > > > works won't be affected.
> > > > >   Hmm, wouldn't it be simpler logic to just make for_kupdate and
> > > > > for_background work always yield when there's some other work to do (as
> > > > > they are livelockable from the definition of the target they have) and
> > > > > make sure any other work isn't livelockable?
> > > > 
> > > > Good idea!
> > > > 
> > > > > The only downside is that
> > > > > non-livelockable work cannot be "fair" in the sense that we cannot switch
> > > > > inodes after writing MAX_WRITEBACK_PAGES.
> > > > 
> > > > Cannot switch indoes _before_ finish with the current
> > > > MAX_WRITEBACK_PAGES batch? 
> > >   Well, even after writing all those MAX_WRITEBACK_PAGES. Because what you
> > > want to do in a non-livelockable work is: take inode, write it, never look at
> > > it again for this work. Because if you later return to the inode, it can
> > > have newer dirty pages and thus you cannot really avoid livelock. Of
> > > course, this all assumes .nr_to_write isn't set to something small. That
> > > avoids the livelock as well.
> > 
> > I do have a poor man's solution that can handle this case.
> > https://kerneltrap.org/mailarchive/linux-fsdevel/2009/10/7/6476473/thread
> > It may do more extra works, but will stop livelock in theory.
>   So I don't think sync work on it's own is a problem. There we can just
> give up any fairness and just go inode by inode. IMHO it's much simpler that
> way.

I would like to reserve my opinion here. IMHO small files should
better get synced first :)

> The remaining types of work we have are "for_reclaim" and then ones
> triggered by filesystems to get rid of delayed allocated data. These cases
> can easily have well defined and low nr_to_write so they wouldn't be
> livelockable either. What do you think?

Right. for_reclaim works won't livelock in itself, since it will be
bounded by either nr_to_write or some range. They may be delayed for a
while by large sync or nr_pages works though.

> > A related question is, what if some for_reclaim works get enqueued?
> > Shall we postpone the sync work as well? The global sync is not likely
> > to hit the dirty pages in a small memcg, or may take long time. It
> > seems not a high priority task though.
>   I see some incentive to do this but the simple thing with for_background
> and for_kupdate work is that they are essentially state-less and so they
> can be easily (and automatically) restarted. It would be really hard to
> implement something like this for sync and still avoid livelocks.

So let's ignore the issue for now.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2010-08-03 14:36 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-07-29 11:51 [PATCH 0/5] [RFC] transfer ASYNC vmscan writeback IO to the flusher threads Wu Fengguang
2010-07-29 11:51 ` [PATCH 1/5] writeback: introduce wbc.for_sync to cover the two sync stages Wu Fengguang
2010-07-29 15:04   ` Jan Kara
2010-07-30  5:10     ` Wu Fengguang
2010-07-29 11:51 ` [PATCH 2/5] writeback: stop periodic/background work on seeing sync works Wu Fengguang
2010-07-29 16:20   ` Jan Kara
2010-07-30  4:03     ` Wu Fengguang
2010-08-02 20:51       ` Jan Kara
2010-08-03  3:01         ` Wu Fengguang
2010-08-03 10:55           ` Jan Kara
2010-08-03 12:39             ` Jan Kara
2010-08-03 12:59               ` Wu Fengguang
2010-08-03 13:18                 ` Jan Kara
2010-08-03 13:22                 ` Wu Fengguang
2010-08-03 13:44                   ` Wu Fengguang
2010-08-03 13:48                     ` Wu Fengguang
2010-08-03 14:36             ` Wu Fengguang
2010-07-29 11:51 ` [PATCH 3/5] writeback: prevent sync livelock with the sync_after timestamp Wu Fengguang
2010-07-29 15:02   ` Jan Kara
2010-07-30  5:17     ` Wu Fengguang
2010-07-29 11:51 ` [PATCH 4/5] writeback: introduce bdi_start_inode_writeback() Wu Fengguang
2010-07-29 11:51 ` [PATCH 5/5] vmscan: transfer async file writeback to the flusher Wu Fengguang
2010-07-29 16:09 ` [PATCH 0/5] [RFC] transfer ASYNC vmscan writeback IO to the flusher threads Jan Kara
2010-07-30  5:34   ` Wu Fengguang
2010-07-29 23:23 ` Dave Chinner
2010-07-30  7:58   ` Wu Fengguang
2010-07-30  9:22     ` KOSAKI Motohiro
2010-07-30 12:25       ` Wu Fengguang
2010-07-30 11:12     ` Dave Chinner
2010-07-30 13:18       ` Wu Fengguang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).