[PATCH 0/11] Per-bdi writeback flusher threads v8

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/11] Per-bdi writeback flusher threads v8
@ 2009-05-27  9:41 Jens Axboe
  2009-05-27  9:41 ` [PATCH 01/11] ntfs: remove old debug check for dirty data in ntfs_put_super() Jens Axboe
                   ` (12 more replies)
  0 siblings, 13 replies; 41+ messages in thread
From: Jens Axboe @ 2009-05-27  9:41 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel
  Cc: chris.mason, david, hch, akpm, jack, yanmin_zhang, richard,
	damien.wyart

Hi,

Here's the 8th version of the writeback patches. Changes since v7:

- Fold the "include default_backing_dev_info in writeback" patch into
  the core, we should just do it from the beginning.
- More series cleanup, I think it should be mostly complete now. No
  hunks are split between patches now (things like comments for
  functions added earlier, and so on).
- Fix hang with calling bdi_wait_on_work_clear() inside the bdi_lock
  mutex when the default wb thread had exited.
- Fix hang with queuing work on exited wb thread, it would have no
  receipients since the default wb thread wrongly cleared the bit
  from the register mask on exit. It must be persistent, which is
  why it gets initialized on bdi_register() already.

For ease of patching, I've put the full diff here:

  http://kernel.dk/writeback-v8.patch

and also stored this in a writeback-v7 branch that will not change,
you can pull that into Linus tree from here:

  git://git.kernel.dk/linux-2.6-block.git writeback-v8

 b/block/blk-core.c            |    1 
 b/drivers/block/aoe/aoeblk.c  |    1 
 b/drivers/char/mem.c          |    1 
 b/fs/btrfs/disk-io.c          |   24 -
 b/fs/buffer.c                 |    2 
 b/fs/char_dev.c               |    1 
 b/fs/configfs/inode.c         |    1 
 b/fs/fs-writeback.c           |  807 +++++++++++++++++++++++++++-------
 b/fs/fuse/inode.c             |    1 
 b/fs/hugetlbfs/inode.c        |    1 
 b/fs/nfs/client.c             |    1 
 b/fs/ntfs/super.c             |   33 -
 b/fs/ocfs2/dlm/dlmfs.c        |    1 
 b/fs/ramfs/inode.c            |    1 
 b/fs/super.c                  |    3 
 b/fs/sync.c                   |    2 
 b/fs/sysfs/inode.c            |    1 
 b/fs/ubifs/super.c            |    1 
 b/include/linux/backing-dev.h |   74 +++
 b/include/linux/fs.h          |   11 
 b/include/linux/writeback.h   |   15 
 b/kernel/cgroup.c             |    1 
 b/mm/Makefile                 |    2 
 b/mm/backing-dev.c            |  476 +++++++++++++++++++-
 b/mm/page-writeback.c         |  151 ------
 b/mm/swap_state.c             |    1 
 b/mm/vmscan.c                 |    2 
 mm/pdflush.c                  |  269 -----------
 28 files changed, 1248 insertions(+), 637 deletions(-)

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH 01/11] ntfs: remove old debug check for dirty data in ntfs_put_super()
  2009-05-27  9:41 [PATCH 0/11] Per-bdi writeback flusher threads v8 Jens Axboe
@ 2009-05-27  9:41 ` Jens Axboe
  2009-05-27  9:41 ` [PATCH 02/11] btrfs: properly register fs backing device Jens Axboe
                   ` (11 subsequent siblings)
  12 siblings, 0 replies; 41+ messages in thread
From: Jens Axboe @ 2009-05-27  9:41 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel
  Cc: chris.mason, david, hch, akpm, jack, yanmin_zhang, richard,
	damien.wyart, Jens Axboe

This should not trigger anymore, so kill it.

Acked-by: Anton Altaparmakov <aia21@cam.ac.uk>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
---
 fs/ntfs/super.c |   33 +++------------------------------
 1 files changed, 3 insertions(+), 30 deletions(-)

diff --git a/fs/ntfs/super.c b/fs/ntfs/super.c
index f76951d..3fc03bd 100644
--- a/fs/ntfs/super.c
+++ b/fs/ntfs/super.c
@@ -2373,39 +2373,12 @@ static void ntfs_put_super(struct super_block *sb)
 		vol->mftmirr_ino = NULL;
 	}
 	/*
-	 * If any dirty inodes are left, throw away all mft data page cache
-	 * pages to allow a clean umount.  This should never happen any more
-	 * due to mft.c::ntfs_mft_writepage() cleaning all the dirty pages as
-	 * the underlying mft records are written out and cleaned.  If it does,
-	 * happen anyway, we want to know...
+	 * We should have no dirty inodes left, due to
+	 * mft.c::ntfs_mft_writepage() cleaning all the dirty pages as
+	 * the underlying mft records are written out and cleaned.
 	 */
 	ntfs_commit_inode(vol->mft_ino);
 	write_inode_now(vol->mft_ino, 1);
-	if (sb_has_dirty_inodes(sb)) {
-		const char *s1, *s2;
-
-		mutex_lock(&vol->mft_ino->i_mutex);
-		truncate_inode_pages(vol->mft_ino->i_mapping, 0);
-		mutex_unlock(&vol->mft_ino->i_mutex);
-		write_inode_now(vol->mft_ino, 1);
-		if (sb_has_dirty_inodes(sb)) {
-			static const char *_s1 = "inodes";
-			static const char *_s2 = "";
-			s1 = _s1;
-			s2 = _s2;
-		} else {
-			static const char *_s1 = "mft pages";
-			static const char *_s2 = "They have been thrown "
-					"away.  ";
-			s1 = _s1;
-			s2 = _s2;
-		}
-		ntfs_error(sb, "Dirty %s found at umount time.  %sYou should "
-				"run chkdsk.  Please email "
-				"linux-ntfs-dev@lists.sourceforge.net and say "
-				"that you saw this message.  Thank you.", s1,
-				s2);
-	}
 #endif /* NTFS_RW */
 
 	iput(vol->mft_ino);
-- 
1.6.3.rc0.1.gf800


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 02/11] btrfs: properly register fs backing device
  2009-05-27  9:41 [PATCH 0/11] Per-bdi writeback flusher threads v8 Jens Axboe
  2009-05-27  9:41 ` [PATCH 01/11] ntfs: remove old debug check for dirty data in ntfs_put_super() Jens Axboe
@ 2009-05-27  9:41 ` Jens Axboe
  2009-05-27  9:41 ` [PATCH 03/11] writeback: move dirty inodes from super_block to backing_dev_info Jens Axboe
                   ` (10 subsequent siblings)
  12 siblings, 0 replies; 41+ messages in thread
From: Jens Axboe @ 2009-05-27  9:41 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel
  Cc: chris.mason, david, hch, akpm, jack, yanmin_zhang, richard,
	damien.wyart, Jens Axboe

btrfs assigns this bdi to all inodes on that file system, so make
sure it's registered. This isn't really important now, but will be
when we put dirty inodes there. Even now, we miss the stats when the
bdi isn't visible.

Also fixes failure to check bdi_init() return value, and bad inherit of
->capabilities flags from the default bdi.

Acked-by: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
---
 fs/btrfs/disk-io.c |   23 ++++++++++++++++++-----
 1 files changed, 18 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 4b0ea0b..2dc19c9 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1345,12 +1345,24 @@ static void btrfs_unplug_io_fn(struct backing_dev_info *bdi, struct page *page)
 	free_extent_map(em);
 }
 
+/*
+ * If this fails, caller must call bdi_destroy() to get rid of the
+ * bdi again.
+ */
 static int setup_bdi(struct btrfs_fs_info *info, struct backing_dev_info *bdi)
 {
-	bdi_init(bdi);
+	int err;
+
+	bdi->capabilities = BDI_CAP_MAP_COPY;
+	err = bdi_init(bdi);
+	if (err)
+		return err;
+
+	err = bdi_register(bdi, NULL, "btrfs");
+	if (err)
+		return err;
+
 	bdi->ra_pages	= default_backing_dev_info.ra_pages;
-	bdi->state		= 0;
-	bdi->capabilities	= default_backing_dev_info.capabilities;
 	bdi->unplug_io_fn	= btrfs_unplug_io_fn;
 	bdi->unplug_io_data	= info;
 	bdi->congested_fn	= btrfs_congested_fn;
@@ -1574,7 +1586,8 @@ struct btrfs_root *open_ctree(struct super_block *sb,
 	fs_info->sb = sb;
 	fs_info->max_extent = (u64)-1;
 	fs_info->max_inline = 8192 * 1024;
-	setup_bdi(fs_info, &fs_info->bdi);
+	if (setup_bdi(fs_info, &fs_info->bdi))
+		goto fail_bdi;
 	fs_info->btree_inode = new_inode(sb);
 	fs_info->btree_inode->i_ino = 1;
 	fs_info->btree_inode->i_nlink = 1;
@@ -1931,8 +1944,8 @@ fail_iput:
 
 	btrfs_close_devices(fs_info->fs_devices);
 	btrfs_mapping_tree_free(&fs_info->mapping_tree);
+fail_bdi:
 	bdi_destroy(&fs_info->bdi);
-
 fail:
 	kfree(extent_root);
 	kfree(tree_root);
-- 
1.6.3.rc0.1.gf800


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 03/11] writeback: move dirty inodes from super_block to backing_dev_info
  2009-05-27  9:41 [PATCH 0/11] Per-bdi writeback flusher threads v8 Jens Axboe
  2009-05-27  9:41 ` [PATCH 01/11] ntfs: remove old debug check for dirty data in ntfs_put_super() Jens Axboe
  2009-05-27  9:41 ` [PATCH 02/11] btrfs: properly register fs backing device Jens Axboe
@ 2009-05-27  9:41 ` Jens Axboe
  2009-05-27  9:41 ` [PATCH 04/11] writeback: switch to per-bdi threads for flushing data Jens Axboe
                   ` (9 subsequent siblings)
  12 siblings, 0 replies; 41+ messages in thread
From: Jens Axboe @ 2009-05-27  9:41 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel
  Cc: chris.mason, david, hch, akpm, jack, yanmin_zhang, richard,
	damien.wyart, Jens Axboe

This is a first step at introducing per-bdi flusher threads. We should
have no change in behaviour, although sb_has_dirty_inodes() is now
ridiculously expensive, as there's no easy way to answer that question.
Not a huge problem, since it'll be deleted in subsequent patches.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
---
 fs/fs-writeback.c           |  196 +++++++++++++++++++++++++++---------------
 fs/super.c                  |    3 -
 include/linux/backing-dev.h |    9 ++
 include/linux/fs.h          |    5 +-
 mm/backing-dev.c            |   24 +++++
 mm/page-writeback.c         |   11 +--
 6 files changed, 164 insertions(+), 84 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 91013ff..1137408 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -25,6 +25,7 @@
 #include <linux/buffer_head.h>
 #include "internal.h"
 
+#define inode_to_bdi(inode)	((inode)->i_mapping->backing_dev_info)
 
 /**
  * writeback_acquire - attempt to get exclusive writeback access to a device
@@ -158,12 +159,13 @@ void __mark_inode_dirty(struct inode *inode, int flags)
 			goto out;
 
 		/*
-		 * If the inode was already on s_dirty/s_io/s_more_io, don't
-		 * reposition it (that would break s_dirty time-ordering).
+		 * If the inode was already on b_dirty/b_io/b_more_io, don't
+		 * reposition it (that would break b_dirty time-ordering).
 		 */
 		if (!was_dirty) {
 			inode->dirtied_when = jiffies;
-			list_move(&inode->i_list, &sb->s_dirty);
+			list_move(&inode->i_list,
+					&inode_to_bdi(inode)->b_dirty);
 		}
 	}
 out:
@@ -184,31 +186,30 @@ static int write_inode(struct inode *inode, int sync)
  * furthest end of its superblock's dirty-inode list.
  *
  * Before stamping the inode's ->dirtied_when, we check to see whether it is
- * already the most-recently-dirtied inode on the s_dirty list.  If that is
+ * already the most-recently-dirtied inode on the b_dirty list.  If that is
  * the case then the inode must have been redirtied while it was being written
  * out and we don't reset its dirtied_when.
  */
 static void redirty_tail(struct inode *inode)
 {
-	struct super_block *sb = inode->i_sb;
+	struct backing_dev_info *bdi = inode_to_bdi(inode);
 
-	if (!list_empty(&sb->s_dirty)) {
-		struct inode *tail_inode;
+	if (!list_empty(&bdi->b_dirty)) {
+		struct inode *tail;
 
-		tail_inode = list_entry(sb->s_dirty.next, struct inode, i_list);
-		if (time_before(inode->dirtied_when,
-				tail_inode->dirtied_when))
+		tail = list_entry(bdi->b_dirty.next, struct inode, i_list);
+		if (time_before(inode->dirtied_when, tail->dirtied_when))
 			inode->dirtied_when = jiffies;
 	}
-	list_move(&inode->i_list, &sb->s_dirty);
+	list_move(&inode->i_list, &bdi->b_dirty);
 }
 
 /*
- * requeue inode for re-scanning after sb->s_io list is exhausted.
+ * requeue inode for re-scanning after bdi->b_io list is exhausted.
  */
 static void requeue_io(struct inode *inode)
 {
-	list_move(&inode->i_list, &inode->i_sb->s_more_io);
+	list_move(&inode->i_list, &inode_to_bdi(inode)->b_more_io);
 }
 
 static void inode_sync_complete(struct inode *inode)
@@ -255,18 +256,50 @@ static void move_expired_inodes(struct list_head *delaying_queue,
 /*
  * Queue all expired dirty inodes for io, eldest first.
  */
-static void queue_io(struct super_block *sb,
-				unsigned long *older_than_this)
+static void queue_io(struct backing_dev_info *bdi,
+		     unsigned long *older_than_this)
+{
+	list_splice_init(&bdi->b_more_io, bdi->b_io.prev);
+	move_expired_inodes(&bdi->b_dirty, &bdi->b_io, older_than_this);
+}
+
+static int sb_on_inode_list(struct super_block *sb, struct list_head *list)
 {
-	list_splice_init(&sb->s_more_io, sb->s_io.prev);
-	move_expired_inodes(&sb->s_dirty, &sb->s_io, older_than_this);
+	struct inode *inode;
+	int ret = 0;
+
+	spin_lock(&inode_lock);
+	list_for_each_entry(inode, list, i_list) {
+		if (inode->i_sb == sb) {
+			ret = 1;
+			break;
+		}
+	}
+	spin_unlock(&inode_lock);
+	return ret;
 }
 
 int sb_has_dirty_inodes(struct super_block *sb)
 {
-	return !list_empty(&sb->s_dirty) ||
-	       !list_empty(&sb->s_io) ||
-	       !list_empty(&sb->s_more_io);
+	struct backing_dev_info *bdi;
+	int ret = 0;
+
+	/*
+	 * This is REALLY expensive right now, but it'll go away
+	 * when the bdi writeback is introduced
+	 */
+	mutex_lock(&bdi_lock);
+	list_for_each_entry(bdi, &bdi_list, bdi_list) {
+		if (sb_on_inode_list(sb, &bdi->b_dirty) ||
+		    sb_on_inode_list(sb, &bdi->b_io) ||
+		    sb_on_inode_list(sb, &bdi->b_more_io)) {
+			ret = 1;
+			break;
+		}
+	}
+	mutex_unlock(&bdi_lock);
+
+	return ret;
 }
 EXPORT_SYMBOL(sb_has_dirty_inodes);
 
@@ -322,11 +355,11 @@ __sync_single_inode(struct inode *inode, struct writeback_control *wbc)
 			/*
 			 * We didn't write back all the pages.  nfs_writepages()
 			 * sometimes bales out without doing anything. Redirty
-			 * the inode; Move it from s_io onto s_more_io/s_dirty.
+			 * the inode; Move it from b_io onto b_more_io/b_dirty.
 			 */
 			/*
 			 * akpm: if the caller was the kupdate function we put
-			 * this inode at the head of s_dirty so it gets first
+			 * this inode at the head of b_dirty so it gets first
 			 * consideration.  Otherwise, move it to the tail, for
 			 * the reasons described there.  I'm not really sure
 			 * how much sense this makes.  Presumably I had a good
@@ -336,7 +369,7 @@ __sync_single_inode(struct inode *inode, struct writeback_control *wbc)
 			if (wbc->for_kupdate) {
 				/*
 				 * For the kupdate function we move the inode
-				 * to s_more_io so it will get more writeout as
+				 * to b_more_io so it will get more writeout as
 				 * soon as the queue becomes uncongested.
 				 */
 				inode->i_state |= I_DIRTY_PAGES;
@@ -402,10 +435,10 @@ __writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
 	if ((wbc->sync_mode != WB_SYNC_ALL) && (inode->i_state & I_SYNC)) {
 		/*
 		 * We're skipping this inode because it's locked, and we're not
-		 * doing writeback-for-data-integrity.  Move it to s_more_io so
-		 * that writeback can proceed with the other inodes on s_io.
+		 * doing writeback-for-data-integrity.  Move it to b_more_io so
+		 * that writeback can proceed with the other inodes on b_io.
 		 * We'll have another go at writing back this inode when we
-		 * completed a full scan of s_io.
+		 * completed a full scan of b_io.
 		 */
 		requeue_io(inode);
 		return 0;
@@ -428,51 +461,34 @@ __writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
 	return __sync_single_inode(inode, wbc);
 }
 
-/*
- * Write out a superblock's list of dirty inodes.  A wait will be performed
- * upon no inodes, all inodes or the final one, depending upon sync_mode.
- *
- * If older_than_this is non-NULL, then only write out inodes which
- * had their first dirtying at a time earlier than *older_than_this.
- *
- * If we're a pdflush thread, then implement pdflush collision avoidance
- * against the entire list.
- *
- * If `bdi' is non-zero then we're being asked to writeback a specific queue.
- * This function assumes that the blockdev superblock's inodes are backed by
- * a variety of queues, so all inodes are searched.  For other superblocks,
- * assume that all inodes are backed by the same queue.
- *
- * FIXME: this linear search could get expensive with many fileystems.  But
- * how to fix?  We need to go from an address_space to all inodes which share
- * a queue with that address_space.  (Easy: have a global "dirty superblocks"
- * list).
- *
- * The inodes to be written are parked on sb->s_io.  They are moved back onto
- * sb->s_dirty as they are selected for writing.  This way, none can be missed
- * on the writer throttling path, and we get decent balancing between many
- * throttled threads: we don't want them all piling up on inode_sync_wait.
- */
-void generic_sync_sb_inodes(struct super_block *sb,
-				struct writeback_control *wbc)
+static void generic_sync_bdi_inodes(struct backing_dev_info *bdi,
+				    struct writeback_control *wbc,
+				    struct super_block *sb,
+				    int is_blkdev_sb)
 {
 	const unsigned long start = jiffies;	/* livelock avoidance */
-	int sync = wbc->sync_mode == WB_SYNC_ALL;
 
 	spin_lock(&inode_lock);
-	if (!wbc->for_kupdate || list_empty(&sb->s_io))
-		queue_io(sb, wbc->older_than_this);
 
-	while (!list_empty(&sb->s_io)) {
-		struct inode *inode = list_entry(sb->s_io.prev,
+	if (!wbc->for_kupdate || list_empty(&bdi->b_io))
+		queue_io(bdi, wbc->older_than_this);
+
+	while (!list_empty(&bdi->b_io)) {
+		struct inode *inode = list_entry(bdi->b_io.prev,
 						struct inode, i_list);
-		struct address_space *mapping = inode->i_mapping;
-		struct backing_dev_info *bdi = mapping->backing_dev_info;
 		long pages_skipped;
 
+		/*
+		 * super block given and doesn't match, skip this inode
+		 */
+		if (sb && sb != inode->i_sb) {
+			redirty_tail(inode);
+			continue;
+		}
+
 		if (!bdi_cap_writeback_dirty(bdi)) {
 			redirty_tail(inode);
-			if (sb_is_blkdev_sb(sb)) {
+			if (is_blkdev_sb) {
 				/*
 				 * Dirty memory-backed blockdev: the ramdisk
 				 * driver does this.  Skip just this inode
@@ -494,14 +510,14 @@ void generic_sync_sb_inodes(struct super_block *sb,
 
 		if (wbc->nonblocking && bdi_write_congested(bdi)) {
 			wbc->encountered_congestion = 1;
-			if (!sb_is_blkdev_sb(sb))
+			if (!is_blkdev_sb)
 				break;		/* Skip a congested fs */
 			requeue_io(inode);
 			continue;		/* Skip a congested blockdev */
 		}
 
 		if (wbc->bdi && bdi != wbc->bdi) {
-			if (!sb_is_blkdev_sb(sb))
+			if (!is_blkdev_sb)
 				break;		/* fs has the wrong queue */
 			requeue_io(inode);
 			continue;		/* blockdev has wrong queue */
@@ -539,13 +555,55 @@ void generic_sync_sb_inodes(struct super_block *sb,
 			wbc->more_io = 1;
 			break;
 		}
-		if (!list_empty(&sb->s_more_io))
+		if (!list_empty(&bdi->b_more_io))
 			wbc->more_io = 1;
 	}
 
-	if (sync) {
+	spin_unlock(&inode_lock);
+	/* Leave any unwritten inodes on b_io */
+}
+
+/*
+ * Write out a superblock's list of dirty inodes.  A wait will be performed
+ * upon no inodes, all inodes or the final one, depending upon sync_mode.
+ *
+ * If older_than_this is non-NULL, then only write out inodes which
+ * had their first dirtying at a time earlier than *older_than_this.
+ *
+ * If we're a pdlfush thread, then implement pdflush collision avoidance
+ * against the entire list.
+ *
+ * If `bdi' is non-zero then we're being asked to writeback a specific queue.
+ * This function assumes that the blockdev superblock's inodes are backed by
+ * a variety of queues, so all inodes are searched.  For other superblocks,
+ * assume that all inodes are backed by the same queue.
+ *
+ * FIXME: this linear search could get expensive with many fileystems.  But
+ * how to fix?  We need to go from an address_space to all inodes which share
+ * a queue with that address_space.  (Easy: have a global "dirty superblocks"
+ * list).
+ *
+ * The inodes to be written are parked on bdi->b_io.  They are moved back onto
+ * bdi->b_dirty as they are selected for writing.  This way, none can be missed
+ * on the writer throttling path, and we get decent balancing between many
+ * throttled threads: we don't want them all piling up on inode_sync_wait.
+ */
+void generic_sync_sb_inodes(struct super_block *sb,
+				struct writeback_control *wbc)
+{
+	const int is_blkdev_sb = sb_is_blkdev_sb(sb);
+	struct backing_dev_info *bdi;
+
+	mutex_lock(&bdi_lock);
+	list_for_each_entry(bdi, &bdi_list, bdi_list)
+		generic_sync_bdi_inodes(bdi, wbc, sb, is_blkdev_sb);
+	mutex_unlock(&bdi_lock);
+
+	if (wbc->sync_mode == WB_SYNC_ALL) {
 		struct inode *inode, *old_inode = NULL;
 
+		spin_lock(&inode_lock);
+
 		/*
 		 * Data integrity sync. Must wait for all pages under writeback,
 		 * because there may have been pages dirtied before our sync
@@ -583,10 +641,8 @@ void generic_sync_sb_inodes(struct super_block *sb,
 		}
 		spin_unlock(&inode_lock);
 		iput(old_inode);
-	} else
-		spin_unlock(&inode_lock);
+	}
 
-	return;		/* Leave any unwritten inodes on s_io */
 }
 EXPORT_SYMBOL_GPL(generic_sync_sb_inodes);
 
@@ -601,8 +657,8 @@ static void sync_sb_inodes(struct super_block *sb,
  *
  * Note:
  * We don't need to grab a reference to superblock here. If it has non-empty
- * ->s_dirty it's hadn't been killed yet and kill_super() won't proceed
- * past sync_inodes_sb() until the ->s_dirty/s_io/s_more_io lists are all
+ * ->b_dirty it's hadn't been killed yet and kill_super() won't proceed
+ * past sync_inodes_sb() until the ->b_dirty/b_io/b_more_io lists are all
  * empty. Since __sync_single_inode() regains inode_lock before it finally moves
  * inode from superblock lists we are OK.
  *
diff --git a/fs/super.c b/fs/super.c
index 1943fdf..76dd5b2 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -64,9 +64,6 @@ static struct super_block *alloc_super(struct file_system_type *type)
 			s = NULL;
 			goto out;
 		}
-		INIT_LIST_HEAD(&s->s_dirty);
-		INIT_LIST_HEAD(&s->s_io);
-		INIT_LIST_HEAD(&s->s_more_io);
 		INIT_LIST_HEAD(&s->s_files);
 		INIT_LIST_HEAD(&s->s_instances);
 		INIT_HLIST_HEAD(&s->s_anon);
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 0ec2c59..8719c87 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -40,6 +40,8 @@ enum bdi_stat_item {
 #define BDI_STAT_BATCH (8*(1+ilog2(nr_cpu_ids)))
 
 struct backing_dev_info {
+	struct list_head bdi_list;
+
 	unsigned long ra_pages;	/* max readahead in PAGE_CACHE_SIZE units */
 	unsigned long state;	/* Always use atomic bitops on this */
 	unsigned int capabilities; /* Device capabilities */
@@ -58,6 +60,10 @@ struct backing_dev_info {
 
 	struct device *dev;
 
+	struct list_head	b_dirty;	/* dirty inodes */
+	struct list_head	b_io;		/* parked for writeback */
+	struct list_head	b_more_io;	/* parked for more writeback */
+
 #ifdef CONFIG_DEBUG_FS
 	struct dentry *debug_dir;
 	struct dentry *debug_stats;
@@ -72,6 +78,9 @@ int bdi_register(struct backing_dev_info *bdi, struct device *parent,
 int bdi_register_dev(struct backing_dev_info *bdi, dev_t dev);
 void bdi_unregister(struct backing_dev_info *bdi);
 
+extern struct mutex bdi_lock;
+extern struct list_head bdi_list;
+
 static inline void __add_bdi_stat(struct backing_dev_info *bdi,
 		enum bdi_stat_item item, s64 amount)
 {
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 3b534e5..6b475d4 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -712,7 +712,7 @@ static inline int mapping_writably_mapped(struct address_space *mapping)
 
 struct inode {
 	struct hlist_node	i_hash;
-	struct list_head	i_list;
+	struct list_head	i_list;		/* backing dev IO list */
 	struct list_head	i_sb_list;
 	struct list_head	i_dentry;
 	unsigned long		i_ino;
@@ -1329,9 +1329,6 @@ struct super_block {
 	struct xattr_handler	**s_xattr;
 
 	struct list_head	s_inodes;	/* all inodes */
-	struct list_head	s_dirty;	/* dirty inodes */
-	struct list_head	s_io;		/* parked for writeback */
-	struct list_head	s_more_io;	/* parked for more writeback */
 	struct hlist_head	s_anon;		/* anonymous dentries for (nfs) exporting */
 	struct list_head	s_files;
 	/* s_dentry_lru and s_nr_dentry_unused are protected by dcache_lock */
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 493b468..de0bbfe 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -22,6 +22,8 @@ struct backing_dev_info default_backing_dev_info = {
 EXPORT_SYMBOL_GPL(default_backing_dev_info);
 
 static struct class *bdi_class;
+DEFINE_MUTEX(bdi_lock);
+LIST_HEAD(bdi_list);
 
 #ifdef CONFIG_DEBUG_FS
 #include <linux/debugfs.h>
@@ -211,6 +213,10 @@ int bdi_register(struct backing_dev_info *bdi, struct device *parent,
 		goto exit;
 	}
 
+	mutex_lock(&bdi_lock);
+	list_add_tail(&bdi->bdi_list, &bdi_list);
+	mutex_unlock(&bdi_lock);
+
 	bdi->dev = dev;
 	bdi_debug_register(bdi, dev_name(dev));
 
@@ -225,9 +231,17 @@ int bdi_register_dev(struct backing_dev_info *bdi, dev_t dev)
 }
 EXPORT_SYMBOL(bdi_register_dev);
 
+static void bdi_remove_from_list(struct backing_dev_info *bdi)
+{
+	mutex_lock(&bdi_lock);
+	list_del(&bdi->bdi_list);
+	mutex_unlock(&bdi_lock);
+}
+
 void bdi_unregister(struct backing_dev_info *bdi)
 {
 	if (bdi->dev) {
+		bdi_remove_from_list(bdi);
 		bdi_debug_unregister(bdi);
 		device_unregister(bdi->dev);
 		bdi->dev = NULL;
@@ -245,6 +259,10 @@ int bdi_init(struct backing_dev_info *bdi)
 	bdi->min_ratio = 0;
 	bdi->max_ratio = 100;
 	bdi->max_prop_frac = PROP_FRAC_BASE;
+	INIT_LIST_HEAD(&bdi->bdi_list);
+	INIT_LIST_HEAD(&bdi->b_io);
+	INIT_LIST_HEAD(&bdi->b_dirty);
+	INIT_LIST_HEAD(&bdi->b_more_io);
 
 	for (i = 0; i < NR_BDI_STAT_ITEMS; i++) {
 		err = percpu_counter_init(&bdi->bdi_stat[i], 0);
@@ -259,6 +277,8 @@ int bdi_init(struct backing_dev_info *bdi)
 err:
 		while (i--)
 			percpu_counter_destroy(&bdi->bdi_stat[i]);
+
+		bdi_remove_from_list(bdi);
 	}
 
 	return err;
@@ -269,6 +289,10 @@ void bdi_destroy(struct backing_dev_info *bdi)
 {
 	int i;
 
+	WARN_ON(!list_empty(&bdi->b_dirty));
+	WARN_ON(!list_empty(&bdi->b_io));
+	WARN_ON(!list_empty(&bdi->b_more_io));
+
 	bdi_unregister(bdi);
 
 	for (i = 0; i < NR_BDI_STAT_ITEMS; i++)
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index bb553c3..7c44314 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -319,15 +319,13 @@ static void task_dirty_limit(struct task_struct *tsk, long *pdirty)
 /*
  *
  */
-static DEFINE_SPINLOCK(bdi_lock);
 static unsigned int bdi_min_ratio;
 
 int bdi_set_min_ratio(struct backing_dev_info *bdi, unsigned int min_ratio)
 {
 	int ret = 0;
-	unsigned long flags;
 
-	spin_lock_irqsave(&bdi_lock, flags);
+	mutex_lock(&bdi_lock);
 	if (min_ratio > bdi->max_ratio) {
 		ret = -EINVAL;
 	} else {
@@ -339,27 +337,26 @@ int bdi_set_min_ratio(struct backing_dev_info *bdi, unsigned int min_ratio)
 			ret = -EINVAL;
 		}
 	}
-	spin_unlock_irqrestore(&bdi_lock, flags);
+	mutex_unlock(&bdi_lock);
 
 	return ret;
 }
 
 int bdi_set_max_ratio(struct backing_dev_info *bdi, unsigned max_ratio)
 {
-	unsigned long flags;
 	int ret = 0;
 
 	if (max_ratio > 100)
 		return -EINVAL;
 
-	spin_lock_irqsave(&bdi_lock, flags);
+	mutex_lock(&bdi_lock);
 	if (bdi->min_ratio > max_ratio) {
 		ret = -EINVAL;
 	} else {
 		bdi->max_ratio = max_ratio;
 		bdi->max_prop_frac = (PROP_FRAC_BASE * max_ratio) / 100;
 	}
-	spin_unlock_irqrestore(&bdi_lock, flags);
+	mutex_unlock(&bdi_lock);
 
 	return ret;
 }
-- 
1.6.3.rc0.1.gf800


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 04/11] writeback: switch to per-bdi threads for flushing data
  2009-05-27  9:41 [PATCH 0/11] Per-bdi writeback flusher threads v8 Jens Axboe
                   ` (2 preceding siblings ...)
  2009-05-27  9:41 ` [PATCH 03/11] writeback: move dirty inodes from super_block to backing_dev_info Jens Axboe
@ 2009-05-27  9:41 ` Jens Axboe
  2009-05-27 11:11   ` Peter Zijlstra
  2009-05-27 15:14   ` Jan Kara
  2009-05-27  9:41 ` [PATCH 05/11] writeback: get rid of pdflush completely Jens Axboe
                   ` (8 subsequent siblings)
  12 siblings, 2 replies; 41+ messages in thread
From: Jens Axboe @ 2009-05-27  9:41 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel
  Cc: chris.mason, david, hch, akpm, jack, yanmin_zhang, richard,
	damien.wyart, Jens Axboe

This gets rid of pdflush for bdi writeout and kupdated style cleaning.
This is an experiment to see if we get better writeout behaviour with
per-bdi flushing. Some initial tests look pretty encouraging. A sample
ffsb workload that does random writes to files is about 8% faster here
on a simple SATA drive during the benchmark phase. File layout also seems
a LOT more smooth in vmstat:

 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
 0  1      0 608848   2652 375372    0    0     0 71024  604    24  1 10 48 42
 0  1      0 549644   2712 433736    0    0     0 60692  505    27  1  8 48 44
 1  0      0 476928   2784 505192    0    0     4 29540  553    24  0  9 53 37
 0  1      0 457972   2808 524008    0    0     0 54876  331    16  0  4 38 58
 0  1      0 366128   2928 614284    0    0     4 92168  710    58  0 13 53 34
 0  1      0 295092   3000 684140    0    0     0 62924  572    23  0  9 53 37
 0  1      0 236592   3064 741704    0    0     4 58256  523    17  0  8 48 44
 0  1      0 165608   3132 811464    0    0     0 57460  560    21  0  8 54 38
 0  1      0 102952   3200 873164    0    0     4 74748  540    29  1 10 48 41
 0  1      0  48604   3252 926472    0    0     0 53248  469    29  0  7 47 45

where vanilla tends to fluctuate a lot in the creation phase:

 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
 1  1      0 678716   5792 303380    0    0     0 74064  565    50  1 11 52 36
 1  0      0 662488   5864 319396    0    0     4   352  302   329  0  2 47 51
 0  1      0 599312   5924 381468    0    0     0 78164  516    55  0  9 51 40
 0  1      0 519952   6008 459516    0    0     4 78156  622    56  1 11 52 37
 1  1      0 436640   6092 541632    0    0     0 82244  622    54  0 11 48 41
 0  1      0 436640   6092 541660    0    0     0     8  152    39  0  0 51 49
 0  1      0 332224   6200 644252    0    0     4 102800  728    46  1 13 49 36
 1  0      0 274492   6260 701056    0    0     4 12328  459    49  0  7 50 43
 0  1      0 211220   6324 763356    0    0     0 106940  515    37  1 10 51 39
 1  0      0 160412   6376 813468    0    0     0  8224  415    43  0  6 49 45
 1  1      0  85980   6452 886556    0    0     4 113516  575    39  1 11 54 34
 0  2      0  85968   6452 886620    0    0     0  1640  158   211  0  0 46 54

So apart from seemingly behaving better for buffered writeout, this also
allows us to potentially have more than one bdi thread flushing out data.
This may be useful for NUMA type setups.

A 10 disk test with btrfs performs 26% faster with per-bdi flushing. Other
tests pending.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
---
 fs/buffer.c                 |    2 +-
 fs/fs-writeback.c           |  313 ++++++++++++++++++++++++++-----------------
 fs/sync.c                   |    2 +-
 include/linux/backing-dev.h |   29 ++++
 include/linux/fs.h          |    3 +-
 include/linux/writeback.h   |    2 +-
 mm/backing-dev.c            |  189 +++++++++++++++++++++++++-
 mm/page-writeback.c         |  140 +------------------
 mm/vmscan.c                 |    2 +-
 9 files changed, 415 insertions(+), 267 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index aed2977..14f0802 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -281,7 +281,7 @@ static void free_more_memory(void)
 	struct zone *zone;
 	int nid;
 
-	wakeup_pdflush(1024);
+	wakeup_flusher_threads(1024);
 	yield();
 
 	for_each_online_node(nid) {
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 1137408..5d99b12 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -19,6 +19,8 @@
 #include <linux/sched.h>
 #include <linux/fs.h>
 #include <linux/mm.h>
+#include <linux/kthread.h>
+#include <linux/freezer.h>
 #include <linux/writeback.h>
 #include <linux/blkdev.h>
 #include <linux/backing-dev.h>
@@ -61,10 +63,190 @@ int writeback_in_progress(struct backing_dev_info *bdi)
  */
 static void writeback_release(struct backing_dev_info *bdi)
 {
-	BUG_ON(!writeback_in_progress(bdi));
+	WARN_ON_ONCE(!writeback_in_progress(bdi));
+	bdi->wb_arg.nr_pages = 0;
+	bdi->wb_arg.sb = NULL;
 	clear_bit(BDI_pdflush, &bdi->state);
 }
 
+int bdi_start_writeback(struct backing_dev_info *bdi, struct super_block *sb,
+			 long nr_pages, enum writeback_sync_modes sync_mode)
+{
+	/*
+	 * This only happens the first time someone kicks this bdi, so put
+	 * it out-of-line.
+	 */
+	if (unlikely(!bdi->task)) {
+		bdi_add_default_flusher_task(bdi);
+		return 1;
+	}
+
+	if (writeback_acquire(bdi)) {
+		bdi->wb_arg.nr_pages = nr_pages;
+		bdi->wb_arg.sb = sb;
+		bdi->wb_arg.sync_mode = sync_mode;
+		/*
+		 * make above store seen before the task is woken
+		 */
+		smp_mb();
+		wake_up(&bdi->wait);
+	}
+
+	return 0;
+}
+
+/*
+ * The maximum number of pages to writeout in a single bdi flush/kupdate
+ * operation.  We do this so we don't hold I_SYNC against an inode for
+ * enormous amounts of time, which would block a userspace task which has
+ * been forced to throttle against that inode.  Also, the code reevaluates
+ * the dirty each time it has written this many pages.
+ */
+#define MAX_WRITEBACK_PAGES     1024
+
+/*
+ * Periodic writeback of "old" data.
+ *
+ * Define "old": the first time one of an inode's pages is dirtied, we mark the
+ * dirtying-time in the inode's address_space.  So this periodic writeback code
+ * just walks the superblock inode list, writing back any inodes which are
+ * older than a specific point in time.
+ *
+ * Try to run once per dirty_writeback_interval.  But if a writeback event
+ * takes longer than a dirty_writeback_interval interval, then leave a
+ * one-second gap.
+ *
+ * older_than_this takes precedence over nr_to_write.  So we'll only write back
+ * all dirty pages if they are all attached to "old" mappings.
+ */
+static void bdi_kupdated(struct backing_dev_info *bdi)
+{
+	unsigned long oldest_jif;
+	long nr_to_write;
+	struct writeback_control wbc = {
+		.bdi			= bdi,
+		.sync_mode		= WB_SYNC_NONE,
+		.older_than_this	= &oldest_jif,
+		.nr_to_write		= 0,
+		.for_kupdate		= 1,
+		.range_cyclic		= 1,
+	};
+
+	oldest_jif = jiffies - msecs_to_jiffies(dirty_expire_interval * 10);
+
+	nr_to_write = global_page_state(NR_FILE_DIRTY) +
+			global_page_state(NR_UNSTABLE_NFS) +
+			(inodes_stat.nr_inodes - inodes_stat.nr_unused);
+
+	while (nr_to_write > 0) {
+		wbc.more_io = 0;
+		wbc.encountered_congestion = 0;
+		wbc.nr_to_write = MAX_WRITEBACK_PAGES;
+		generic_sync_bdi_inodes(NULL, &wbc);
+		if (wbc.nr_to_write > 0)
+			break;	/* All the old data is written */
+		nr_to_write -= MAX_WRITEBACK_PAGES;
+	}
+}
+
+static inline bool over_bground_thresh(void)
+{
+	unsigned long background_thresh, dirty_thresh;
+
+	get_dirty_limits(&background_thresh, &dirty_thresh, NULL, NULL);
+
+	return (global_page_state(NR_FILE_DIRTY) +
+		global_page_state(NR_UNSTABLE_NFS) >= background_thresh);
+}
+
+static void bdi_pdflush(struct backing_dev_info *bdi)
+{
+	struct writeback_control wbc = {
+		.bdi			= bdi,
+		.sync_mode		= bdi->wb_arg.sync_mode,
+		.older_than_this	= NULL,
+		.range_cyclic		= 1,
+	};
+	long nr_pages = bdi->wb_arg.nr_pages;
+
+	for (;;) {
+		if (wbc.sync_mode == WB_SYNC_NONE && nr_pages <= 0 &&
+		    !over_bground_thresh())
+			break;
+
+		wbc.more_io = 0;
+		wbc.encountered_congestion = 0;
+		wbc.nr_to_write = MAX_WRITEBACK_PAGES;
+		wbc.pages_skipped = 0;
+		generic_sync_bdi_inodes(bdi->wb_arg.sb, &wbc);
+		nr_pages -= MAX_WRITEBACK_PAGES - wbc.nr_to_write;
+		/*
+		 * If we ran out of stuff to write, bail unless more_io got set
+		 */
+		if (wbc.nr_to_write > 0 || wbc.pages_skipped > 0) {
+			if (wbc.more_io)
+				continue;
+			break;
+		}
+	}
+}
+
+/*
+ * Handle writeback of dirty data for the device backed by this bdi. Also
+ * wakes up periodically and does kupdated style flushing.
+ */
+int bdi_writeback_task(struct backing_dev_info *bdi)
+{
+	while (!kthread_should_stop()) {
+		unsigned long wait_jiffies;
+		DEFINE_WAIT(wait);
+
+		prepare_to_wait(&bdi->wait, &wait, TASK_INTERRUPTIBLE);
+		wait_jiffies = msecs_to_jiffies(dirty_writeback_interval * 10);
+		schedule_timeout(wait_jiffies);
+		try_to_freeze();
+
+		/*
+		 * We get here in two cases:
+		 *
+		 *  schedule_timeout() returned because the dirty writeback
+		 *  interval has elapsed. If that happens, we will be able
+		 *  to acquire the writeback lock and will proceed to do
+		 *  kupdated style writeout.
+		 *
+		 *  Someone called bdi_start_writeback(), which will acquire
+		 *  the writeback lock. This means our writeback_acquire()
+		 *  below will fail and we call into bdi_pdflush() for
+		 *  pdflush style writeout.
+		 *
+		 */
+		if (writeback_acquire(bdi))
+			bdi_kupdated(bdi);
+		else
+			bdi_pdflush(bdi);
+
+		writeback_release(bdi);
+		finish_wait(&bdi->wait, &wait);
+	}
+
+	return 0;
+}
+
+void bdi_writeback_all(struct super_block *sb, struct writeback_control *wbc)
+{
+	struct backing_dev_info *bdi, *tmp;
+
+	mutex_lock(&bdi_lock);
+
+	list_for_each_entry_safe(bdi, tmp, &bdi_list, bdi_list) {
+		if (!bdi_has_dirty_io(bdi))
+			continue;
+		bdi_start_writeback(bdi, sb, wbc->nr_to_write, wbc->sync_mode);
+	}
+
+	mutex_unlock(&bdi_lock);
+}
+
 /**
  *	__mark_inode_dirty -	internal function
  *	@inode: inode to mark
@@ -263,46 +445,6 @@ static void queue_io(struct backing_dev_info *bdi,
 	move_expired_inodes(&bdi->b_dirty, &bdi->b_io, older_than_this);
 }
 
-static int sb_on_inode_list(struct super_block *sb, struct list_head *list)
-{
-	struct inode *inode;
-	int ret = 0;
-
-	spin_lock(&inode_lock);
-	list_for_each_entry(inode, list, i_list) {
-		if (inode->i_sb == sb) {
-			ret = 1;
-			break;
-		}
-	}
-	spin_unlock(&inode_lock);
-	return ret;
-}
-
-int sb_has_dirty_inodes(struct super_block *sb)
-{
-	struct backing_dev_info *bdi;
-	int ret = 0;
-
-	/*
-	 * This is REALLY expensive right now, but it'll go away
-	 * when the bdi writeback is introduced
-	 */
-	mutex_lock(&bdi_lock);
-	list_for_each_entry(bdi, &bdi_list, bdi_list) {
-		if (sb_on_inode_list(sb, &bdi->b_dirty) ||
-		    sb_on_inode_list(sb, &bdi->b_io) ||
-		    sb_on_inode_list(sb, &bdi->b_more_io)) {
-			ret = 1;
-			break;
-		}
-	}
-	mutex_unlock(&bdi_lock);
-
-	return ret;
-}
-EXPORT_SYMBOL(sb_has_dirty_inodes);
-
 /*
  * Write a single inode's dirty pages and inode data out to disk.
  * If `wait' is set, wait on the writeout.
@@ -461,11 +603,11 @@ __writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
 	return __sync_single_inode(inode, wbc);
 }
 
-static void generic_sync_bdi_inodes(struct backing_dev_info *bdi,
-				    struct writeback_control *wbc,
-				    struct super_block *sb,
-				    int is_blkdev_sb)
+void generic_sync_bdi_inodes(struct super_block *sb,
+			     struct writeback_control *wbc)
 {
+	const int is_blkdev_sb = sb_is_blkdev_sb(sb);
+	struct backing_dev_info *bdi = wbc->bdi;
 	const unsigned long start = jiffies;	/* livelock avoidance */
 
 	spin_lock(&inode_lock);
@@ -516,13 +658,6 @@ static void generic_sync_bdi_inodes(struct backing_dev_info *bdi,
 			continue;		/* Skip a congested blockdev */
 		}
 
-		if (wbc->bdi && bdi != wbc->bdi) {
-			if (!is_blkdev_sb)
-				break;		/* fs has the wrong queue */
-			requeue_io(inode);
-			continue;		/* blockdev has wrong queue */
-		}
-
 		/*
 		 * Was this inode dirtied after sync_sb_inodes was called?
 		 * This keeps sync from extra jobs and livelock.
@@ -530,16 +665,10 @@ static void generic_sync_bdi_inodes(struct backing_dev_info *bdi,
 		if (inode_dirtied_after(inode, start))
 			break;
 
-		/* Is another pdflush already flushing this queue? */
-		if (current_is_pdflush() && !writeback_acquire(bdi))
-			break;
-
 		BUG_ON(inode->i_state & I_FREEING);
 		__iget(inode);
 		pages_skipped = wbc->pages_skipped;
 		__writeback_single_inode(inode, wbc);
-		if (current_is_pdflush())
-			writeback_release(bdi);
 		if (wbc->pages_skipped != pages_skipped) {
 			/*
 			 * writeback is not making progress due to locked
@@ -578,11 +707,6 @@ static void generic_sync_bdi_inodes(struct backing_dev_info *bdi,
  * a variety of queues, so all inodes are searched.  For other superblocks,
  * assume that all inodes are backed by the same queue.
  *
- * FIXME: this linear search could get expensive with many fileystems.  But
- * how to fix?  We need to go from an address_space to all inodes which share
- * a queue with that address_space.  (Easy: have a global "dirty superblocks"
- * list).
- *
  * The inodes to be written are parked on bdi->b_io.  They are moved back onto
  * bdi->b_dirty as they are selected for writing.  This way, none can be missed
  * on the writer throttling path, and we get decent balancing between many
@@ -591,13 +715,10 @@ static void generic_sync_bdi_inodes(struct backing_dev_info *bdi,
 void generic_sync_sb_inodes(struct super_block *sb,
 				struct writeback_control *wbc)
 {
-	const int is_blkdev_sb = sb_is_blkdev_sb(sb);
-	struct backing_dev_info *bdi;
-
-	mutex_lock(&bdi_lock);
-	list_for_each_entry(bdi, &bdi_list, bdi_list)
-		generic_sync_bdi_inodes(bdi, wbc, sb, is_blkdev_sb);
-	mutex_unlock(&bdi_lock);
+	if (wbc->bdi)
+		generic_sync_bdi_inodes(sb, wbc);
+	else
+		bdi_writeback_all(sb, wbc);
 
 	if (wbc->sync_mode == WB_SYNC_ALL) {
 		struct inode *inode, *old_inode = NULL;
@@ -653,58 +774,6 @@ static void sync_sb_inodes(struct super_block *sb,
 }
 
 /*
- * Start writeback of dirty pagecache data against all unlocked inodes.
- *
- * Note:
- * We don't need to grab a reference to superblock here. If it has non-empty
- * ->b_dirty it's hadn't been killed yet and kill_super() won't proceed
- * past sync_inodes_sb() until the ->b_dirty/b_io/b_more_io lists are all
- * empty. Since __sync_single_inode() regains inode_lock before it finally moves
- * inode from superblock lists we are OK.
- *
- * If `older_than_this' is non-zero then only flush inodes which have a
- * flushtime older than *older_than_this.
- *
- * If `bdi' is non-zero then we will scan the first inode against each
- * superblock until we find the matching ones.  One group will be the dirty
- * inodes against a filesystem.  Then when we hit the dummy blockdev superblock,
- * sync_sb_inodes will seekout the blockdev which matches `bdi'.  Maybe not
- * super-efficient but we're about to do a ton of I/O...
- */
-void
-writeback_inodes(struct writeback_control *wbc)
-{
-	struct super_block *sb;
-
-	might_sleep();
-	spin_lock(&sb_lock);
-restart:
-	list_for_each_entry_reverse(sb, &super_blocks, s_list) {
-		if (sb_has_dirty_inodes(sb)) {
-			/* we're making our own get_super here */
-			sb->s_count++;
-			spin_unlock(&sb_lock);
-			/*
-			 * If we can't get the readlock, there's no sense in
-			 * waiting around, most of the time the FS is going to
-			 * be unmounted by the time it is released.
-			 */
-			if (down_read_trylock(&sb->s_umount)) {
-				if (sb->s_root)
-					sync_sb_inodes(sb, wbc);
-				up_read(&sb->s_umount);
-			}
-			spin_lock(&sb_lock);
-			if (__put_super_and_need_restart(sb))
-				goto restart;
-		}
-		if (wbc->nr_to_write <= 0)
-			break;
-	}
-	spin_unlock(&sb_lock);
-}
-
-/*
  * writeback and wait upon the filesystem's dirty inodes.  The caller will
  * do this in two passes - one to write, and one to wait.
  *
diff --git a/fs/sync.c b/fs/sync.c
index 7abc65f..3887f10 100644
--- a/fs/sync.c
+++ b/fs/sync.c
@@ -23,7 +23,7 @@
  */
 static void do_sync(unsigned long wait)
 {
-	wakeup_pdflush(0);
+	wakeup_flusher_threads(0);
 	sync_inodes(0);		/* All mappings, inodes and their blockdevs */
 	vfs_dq_sync(NULL);
 	sync_supers();		/* Write the superblocks */
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 8719c87..9f040a9 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -13,6 +13,7 @@
 #include <linux/proportions.h>
 #include <linux/kernel.h>
 #include <linux/fs.h>
+#include <linux/writeback.h>
 #include <asm/atomic.h>
 
 struct page;
@@ -24,6 +25,7 @@ struct dentry;
  */
 enum bdi_state {
 	BDI_pdflush,		/* A pdflush thread is working this device */
+	BDI_pending,		/* On its way to being activated */
 	BDI_async_congested,	/* The async (write) queue is getting full */
 	BDI_sync_congested,	/* The sync queue is getting full */
 	BDI_unused,		/* Available bits start here */
@@ -39,6 +41,12 @@ enum bdi_stat_item {
 
 #define BDI_STAT_BATCH (8*(1+ilog2(nr_cpu_ids)))
 
+struct bdi_writeback_arg {
+	unsigned long nr_pages;
+	struct super_block *sb;
+	enum writeback_sync_modes sync_mode;
+};
+
 struct backing_dev_info {
 	struct list_head bdi_list;
 
@@ -60,6 +68,9 @@ struct backing_dev_info {
 
 	struct device *dev;
 
+	struct task_struct	*task;		/* writeback task */
+	wait_queue_head_t	wait;
+	struct bdi_writeback_arg wb_arg;	/* protected by BDI_pdflush */
 	struct list_head	b_dirty;	/* dirty inodes */
 	struct list_head	b_io;		/* parked for writeback */
 	struct list_head	b_more_io;	/* parked for more writeback */
@@ -77,10 +88,22 @@ int bdi_register(struct backing_dev_info *bdi, struct device *parent,
 		const char *fmt, ...);
 int bdi_register_dev(struct backing_dev_info *bdi, dev_t dev);
 void bdi_unregister(struct backing_dev_info *bdi);
+int bdi_start_writeback(struct backing_dev_info *bdi, struct super_block *sb,
+			 long nr_pages, enum writeback_sync_modes sync_mode);
+int bdi_writeback_task(struct backing_dev_info *bdi);
+void bdi_writeback_all(struct super_block *sb, struct writeback_control *wbc);
+void bdi_add_default_flusher_task(struct backing_dev_info *bdi);
 
 extern struct mutex bdi_lock;
 extern struct list_head bdi_list;
 
+static inline int bdi_has_dirty_io(struct backing_dev_info *bdi)
+{
+	return !list_empty(&bdi->b_dirty) ||
+	       !list_empty(&bdi->b_io) ||
+	       !list_empty(&bdi->b_more_io);
+}
+
 static inline void __add_bdi_stat(struct backing_dev_info *bdi,
 		enum bdi_stat_item item, s64 amount)
 {
@@ -196,6 +219,7 @@ int bdi_set_max_ratio(struct backing_dev_info *bdi, unsigned int max_ratio);
 #define BDI_CAP_EXEC_MAP	0x00000040
 #define BDI_CAP_NO_ACCT_WB	0x00000080
 #define BDI_CAP_SWAP_BACKED	0x00000100
+#define BDI_CAP_FLUSH_FORKER	0x00000200
 
 #define BDI_CAP_VMFLAGS \
 	(BDI_CAP_READ_MAP | BDI_CAP_WRITE_MAP | BDI_CAP_EXEC_MAP)
@@ -265,6 +289,11 @@ static inline bool bdi_cap_swap_backed(struct backing_dev_info *bdi)
 	return bdi->capabilities & BDI_CAP_SWAP_BACKED;
 }
 
+static inline bool bdi_cap_flush_forker(struct backing_dev_info *bdi)
+{
+	return bdi->capabilities & BDI_CAP_FLUSH_FORKER;
+}
+
 static inline bool mapping_cap_writeback_dirty(struct address_space *mapping)
 {
 	return bdi_cap_writeback_dirty(mapping->backing_dev_info);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 6b475d4..ecdc544 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2063,6 +2063,8 @@ extern int invalidate_inode_pages2_range(struct address_space *mapping,
 					 pgoff_t start, pgoff_t end);
 extern void generic_sync_sb_inodes(struct super_block *sb,
 				struct writeback_control *wbc);
+extern void generic_sync_bdi_inodes(struct super_block *sb,
+				struct writeback_control *);
 extern int write_inode_now(struct inode *, int);
 extern int filemap_fdatawrite(struct address_space *);
 extern int filemap_flush(struct address_space *);
@@ -2180,7 +2182,6 @@ extern int bdev_read_only(struct block_device *);
 extern int set_blocksize(struct block_device *, int);
 extern int sb_set_blocksize(struct super_block *, int);
 extern int sb_min_blocksize(struct super_block *, int);
-extern int sb_has_dirty_inodes(struct super_block *);
 
 extern int generic_file_mmap(struct file *, struct vm_area_struct *);
 extern int generic_file_readonly_mmap(struct file *, struct vm_area_struct *);
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 9344547..a8e9f78 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -99,7 +99,7 @@ static inline void inode_sync_wait(struct inode *inode)
 /*
  * mm/page-writeback.c
  */
-int wakeup_pdflush(long nr_pages);
+void wakeup_flusher_threads(long nr_pages);
 void laptop_io_completion(void);
 void laptop_sync_completion(void);
 void throttle_vm_writeout(gfp_t gfp_mask);
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index de0bbfe..0df8079 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -1,8 +1,11 @@
 
 #include <linux/wait.h>
 #include <linux/backing-dev.h>
+#include <linux/kthread.h>
+#include <linux/freezer.h>
 #include <linux/fs.h>
 #include <linux/pagemap.h>
+#include <linux/mm.h>
 #include <linux/sched.h>
 #include <linux/module.h>
 #include <linux/writeback.h>
@@ -16,7 +19,7 @@ EXPORT_SYMBOL(default_unplug_io_fn);
 struct backing_dev_info default_backing_dev_info = {
 	.ra_pages	= VM_MAX_READAHEAD * 1024 / PAGE_CACHE_SIZE,
 	.state		= 0,
-	.capabilities	= BDI_CAP_MAP_COPY,
+	.capabilities	= BDI_CAP_MAP_COPY | BDI_CAP_FLUSH_FORKER,
 	.unplug_io_fn	= default_unplug_io_fn,
 };
 EXPORT_SYMBOL_GPL(default_backing_dev_info);
@@ -24,6 +27,7 @@ EXPORT_SYMBOL_GPL(default_backing_dev_info);
 static struct class *bdi_class;
 DEFINE_MUTEX(bdi_lock);
 LIST_HEAD(bdi_list);
+LIST_HEAD(bdi_pending_list);
 
 #ifdef CONFIG_DEBUG_FS
 #include <linux/debugfs.h>
@@ -195,6 +199,143 @@ static int __init default_bdi_init(void)
 }
 subsys_initcall(default_bdi_init);
 
+static int bdi_start_fn(void *ptr)
+{
+	struct backing_dev_info *bdi = ptr;
+	struct task_struct *tsk = current;
+
+	/*
+	 * Add us to the active bdi_list
+	 */
+	mutex_lock(&bdi_lock);
+	list_add(&bdi->bdi_list, &bdi_list);
+	mutex_unlock(&bdi_lock);
+
+	tsk->flags |= PF_FLUSHER | PF_SWAPWRITE;
+	set_freezable();
+
+	/*
+	 * Our parent may run at a different priority, just set us to normal
+	 */
+	set_user_nice(tsk, 0);
+
+	/*
+	 * Clear pending bit and wakeup anybody waiting to tear us down
+	 */
+	clear_bit(BDI_pending, &bdi->state);
+	smp_mb__after_clear_bit();
+	wake_up_bit(&bdi->state, BDI_pending);
+
+	return bdi_writeback_task(bdi);
+}
+
+static void bdi_flush_io(struct backing_dev_info *bdi)
+{
+	struct writeback_control wbc = {
+		.bdi			= bdi,
+		.sync_mode		= WB_SYNC_NONE,
+		.older_than_this	= NULL,
+		.range_cyclic		= 1,
+		.nr_to_write		= 1024,
+	};
+
+	generic_sync_bdi_inodes(NULL, &wbc);
+}
+
+static int bdi_forker_task(void *ptr)
+{
+	struct backing_dev_info *me = ptr;
+	DEFINE_WAIT(wait);
+
+	for (;;) {
+		struct backing_dev_info *bdi, *tmp;
+
+		/*
+		 * Do this periodically, like kupdated() did before.
+		 */
+		sync_supers();
+
+		/*
+		 * Temporary measure, we want to make sure we don't see
+		 * dirty data on the default backing_dev_info
+		 */
+		if (bdi_has_dirty_io(me))
+			bdi_flush_io(me);
+
+		prepare_to_wait(&me->wait, &wait, TASK_INTERRUPTIBLE);
+
+		mutex_lock(&bdi_lock);
+
+		/*
+		 * Check if any existing bdi's have dirty data without
+		 * a thread registered. If so, set that up.
+		 */
+		list_for_each_entry_safe(bdi, tmp, &bdi_list, bdi_list) {
+			if (bdi->task || !bdi_has_dirty_io(bdi))
+				continue;
+
+			bdi_add_default_flusher_task(bdi);
+		}
+
+		if (list_empty(&bdi_pending_list)) {
+			unsigned long wait;
+
+			mutex_unlock(&bdi_lock);
+			wait = msecs_to_jiffies(dirty_writeback_interval * 10);
+			schedule_timeout(wait);
+			try_to_freeze();
+			continue;
+		}
+
+		/*
+		 * This is our real job - check for pending entries in
+		 * bdi_pending_list, and create the tasks that got added
+		 */
+		bdi = list_entry(bdi_pending_list.next, struct backing_dev_info,
+				 bdi_list);
+		list_del_init(&bdi->bdi_list);
+		mutex_unlock(&bdi_lock);
+
+		BUG_ON(bdi->task);
+
+		bdi->task = kthread_run(bdi_start_fn, bdi, "bdi-%s",
+					dev_name(bdi->dev));
+		/*
+		 * If task creation fails, then readd the bdi to
+		 * the pending list and force writeout of the bdi
+		 * from this forker thread. That will free some memory
+		 * and we can try again.
+		 */
+		if (!bdi->task) {
+			/*
+			 * Add this 'bdi' to the back, so we get
+			 * a chance to flush other bdi's to free
+			 * memory.
+			 */
+			mutex_lock(&bdi_lock);
+			list_add_tail(&bdi->bdi_list, &bdi_pending_list);
+			mutex_unlock(&bdi_lock);
+
+			bdi_flush_io(bdi);
+		}
+	}
+
+	finish_wait(&me->wait, &wait);
+	return 0;
+}
+
+void bdi_add_default_flusher_task(struct backing_dev_info *bdi)
+{
+	if (test_and_set_bit(BDI_pending, &bdi->state))
+		return;
+
+	mutex_lock(&bdi_lock);
+	list_move_tail(&bdi->bdi_list, &bdi_pending_list);
+	mutex_unlock(&bdi_lock);
+
+	wake_up(&default_backing_dev_info.wait);
+}
+
 int bdi_register(struct backing_dev_info *bdi, struct device *parent,
 		const char *fmt, ...)
 {
@@ -218,8 +359,25 @@ int bdi_register(struct backing_dev_info *bdi, struct device *parent,
 	mutex_unlock(&bdi_lock);
 
 	bdi->dev = dev;
-	bdi_debug_register(bdi, dev_name(dev));
 
+	/*
+	 * Just start the forker thread for our default backing_dev_info,
+	 * and add other bdi's to the list. They will get a thread created
+	 * on-demand when they need it.
+	 */
+	if (bdi_cap_flush_forker(bdi)) {
+		bdi->task = kthread_run(bdi_forker_task, bdi, "bdi-%s",
+						dev_name(dev));
+		if (!bdi->task) {
+			mutex_lock(&bdi_lock);
+			list_del(&bdi->bdi_list);
+			mutex_unlock(&bdi_lock);
+			ret = -ENOMEM;
+			goto exit;
+		}
+	}
+
+	bdi_debug_register(bdi, dev_name(dev));
 exit:
 	return ret;
 }
@@ -231,8 +389,19 @@ int bdi_register_dev(struct backing_dev_info *bdi, dev_t dev)
 }
 EXPORT_SYMBOL(bdi_register_dev);
 
-static void bdi_remove_from_list(struct backing_dev_info *bdi)
+static int sched_wait(void *word)
+{
+	schedule();
+	return 0;
+}
+
+static void bdi_wb_shutdown(struct backing_dev_info *bdi)
 {
+	/*
+	 * If setup is pending, wait for that to complete first
+	 */
+	wait_on_bit(&bdi->state, BDI_pending, sched_wait, TASK_UNINTERRUPTIBLE);
+
 	mutex_lock(&bdi_lock);
 	list_del(&bdi->bdi_list);
 	mutex_unlock(&bdi_lock);
@@ -241,7 +410,13 @@ static void bdi_remove_from_list(struct backing_dev_info *bdi)
 void bdi_unregister(struct backing_dev_info *bdi)
 {
 	if (bdi->dev) {
-		bdi_remove_from_list(bdi);
+		if (!bdi_cap_flush_forker(bdi)) {
+			bdi_wb_shutdown(bdi);
+			if (bdi->task) {
+				kthread_stop(bdi->task);
+				bdi->task = NULL;
+			}
+		}
 		bdi_debug_unregister(bdi);
 		device_unregister(bdi->dev);
 		bdi->dev = NULL;
@@ -251,14 +426,14 @@ EXPORT_SYMBOL(bdi_unregister);
 
 int bdi_init(struct backing_dev_info *bdi)
 {
-	int i;
-	int err;
+	int i, err;
 
 	bdi->dev = NULL;
 
 	bdi->min_ratio = 0;
 	bdi->max_ratio = 100;
 	bdi->max_prop_frac = PROP_FRAC_BASE;
+	init_waitqueue_head(&bdi->wait);
 	INIT_LIST_HEAD(&bdi->bdi_list);
 	INIT_LIST_HEAD(&bdi->b_io);
 	INIT_LIST_HEAD(&bdi->b_dirty);
@@ -277,8 +452,6 @@ int bdi_init(struct backing_dev_info *bdi)
 err:
 		while (i--)
 			percpu_counter_destroy(&bdi->bdi_stat[i]);
-
-		bdi_remove_from_list(bdi);
 	}
 
 	return err;
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 7c44314..46c62b0 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -36,15 +36,6 @@
 #include <linux/pagevec.h>
 
 /*
- * The maximum number of pages to writeout in a single bdflush/kupdate
- * operation.  We do this so we don't hold I_SYNC against an inode for
- * enormous amounts of time, which would block a userspace task which has
- * been forced to throttle against that inode.  Also, the code reevaluates
- * the dirty each time it has written this many pages.
- */
-#define MAX_WRITEBACK_PAGES	1024
-
-/*
  * After a CPU has dirtied this many pages, balance_dirty_pages_ratelimited
  * will look to see if it needs to force writeback or throttling.
  */
@@ -117,8 +108,6 @@ EXPORT_SYMBOL(laptop_mode);
 /* End of sysctl-exported parameters */
 
 
-static void background_writeout(unsigned long _min_pages);
-
 /*
  * Scale the writeback cache size proportional to the relative writeout speeds.
  *
@@ -539,7 +528,7 @@ static void balance_dirty_pages(struct address_space *mapping)
 		 * been flushed to permanent storage.
 		 */
 		if (bdi_nr_reclaimable) {
-			writeback_inodes(&wbc);
+			generic_sync_bdi_inodes(NULL, &wbc);
 			pages_written += write_chunk - wbc.nr_to_write;
 			get_dirty_limits(&background_thresh, &dirty_thresh,
 				       &bdi_thresh, bdi);
@@ -590,7 +579,7 @@ static void balance_dirty_pages(struct address_space *mapping)
 			(!laptop_mode && (global_page_state(NR_FILE_DIRTY)
 					  + global_page_state(NR_UNSTABLE_NFS)
 					  > background_thresh)))
-		pdflush_operation(background_writeout, 0);
+		bdi_start_writeback(bdi, NULL, 0, WB_SYNC_NONE);
 }
 
 void set_page_dirty_balance(struct page *page, int page_mkwrite)
@@ -675,152 +664,41 @@ void throttle_vm_writeout(gfp_t gfp_mask)
 }
 
 /*
- * writeback at least _min_pages, and keep writing until the amount of dirty
- * memory is less than the background threshold, or until we're all clean.
+ * Start writeback of `nr_pages' pages.  If `nr_pages' is zero, write back
+ * the whole world.
  */
-static void background_writeout(unsigned long _min_pages)
+void wakeup_flusher_threads(long nr_pages)
 {
-	long min_pages = _min_pages;
 	struct writeback_control wbc = {
-		.bdi		= NULL,
 		.sync_mode	= WB_SYNC_NONE,
 		.older_than_this = NULL,
-		.nr_to_write	= 0,
-		.nonblocking	= 1,
 		.range_cyclic	= 1,
 	};
 
-	for ( ; ; ) {
-		unsigned long background_thresh;
-		unsigned long dirty_thresh;
-
-		get_dirty_limits(&background_thresh, &dirty_thresh, NULL, NULL);
-		if (global_page_state(NR_FILE_DIRTY) +
-			global_page_state(NR_UNSTABLE_NFS) < background_thresh
-				&& min_pages <= 0)
-			break;
-		wbc.more_io = 0;
-		wbc.encountered_congestion = 0;
-		wbc.nr_to_write = MAX_WRITEBACK_PAGES;
-		wbc.pages_skipped = 0;
-		writeback_inodes(&wbc);
-		min_pages -= MAX_WRITEBACK_PAGES - wbc.nr_to_write;
-		if (wbc.nr_to_write > 0 || wbc.pages_skipped > 0) {
-			/* Wrote less than expected */
-			if (wbc.encountered_congestion || wbc.more_io)
-				congestion_wait(WRITE, HZ/10);
-			else
-				break;
-		}
-	}
-}
-
-/*
- * Start writeback of `nr_pages' pages.  If `nr_pages' is zero, write back
- * the whole world.  Returns 0 if a pdflush thread was dispatched.  Returns
- * -1 if all pdflush threads were busy.
- */
-int wakeup_pdflush(long nr_pages)
-{
 	if (nr_pages == 0)
 		nr_pages = global_page_state(NR_FILE_DIRTY) +
 				global_page_state(NR_UNSTABLE_NFS);
-	return pdflush_operation(background_writeout, nr_pages);
+	wbc.nr_to_write = nr_pages;
+	bdi_writeback_all(NULL, &wbc);
 }
 
-static void wb_timer_fn(unsigned long unused);
 static void laptop_timer_fn(unsigned long unused);
 
-static DEFINE_TIMER(wb_timer, wb_timer_fn, 0, 0);
 static DEFINE_TIMER(laptop_mode_wb_timer, laptop_timer_fn, 0, 0);
 
 /*
- * Periodic writeback of "old" data.
- *
- * Define "old": the first time one of an inode's pages is dirtied, we mark the
- * dirtying-time in the inode's address_space.  So this periodic writeback code
- * just walks the superblock inode list, writing back any inodes which are
- * older than a specific point in time.
- *
- * Try to run once per dirty_writeback_interval.  But if a writeback event
- * takes longer than a dirty_writeback_interval interval, then leave a
- * one-second gap.
- *
- * older_than_this takes precedence over nr_to_write.  So we'll only write back
- * all dirty pages if they are all attached to "old" mappings.
- */
-static void wb_kupdate(unsigned long arg)
-{
-	unsigned long oldest_jif;
-	unsigned long start_jif;
-	unsigned long next_jif;
-	long nr_to_write;
-	struct writeback_control wbc = {
-		.bdi		= NULL,
-		.sync_mode	= WB_SYNC_NONE,
-		.older_than_this = &oldest_jif,
-		.nr_to_write	= 0,
-		.nonblocking	= 1,
-		.for_kupdate	= 1,
-		.range_cyclic	= 1,
-	};
-
-	sync_supers();
-
-	oldest_jif = jiffies - msecs_to_jiffies(dirty_expire_interval * 10);
-	start_jif = jiffies;
-	next_jif = start_jif + msecs_to_jiffies(dirty_writeback_interval * 10);
-	nr_to_write = global_page_state(NR_FILE_DIRTY) +
-			global_page_state(NR_UNSTABLE_NFS) +
-			(inodes_stat.nr_inodes - inodes_stat.nr_unused);
-	while (nr_to_write > 0) {
-		wbc.more_io = 0;
-		wbc.encountered_congestion = 0;
-		wbc.nr_to_write = MAX_WRITEBACK_PAGES;
-		writeback_inodes(&wbc);
-		if (wbc.nr_to_write > 0) {
-			if (wbc.encountered_congestion || wbc.more_io)
-				congestion_wait(WRITE, HZ/10);
-			else
-				break;	/* All the old data is written */
-		}
-		nr_to_write -= MAX_WRITEBACK_PAGES - wbc.nr_to_write;
-	}
-	if (time_before(next_jif, jiffies + HZ))
-		next_jif = jiffies + HZ;
-	if (dirty_writeback_interval)
-		mod_timer(&wb_timer, next_jif);
-}
-
-/*
  * sysctl handler for /proc/sys/vm/dirty_writeback_centisecs
  */
 int dirty_writeback_centisecs_handler(ctl_table *table, int write,
 	struct file *file, void __user *buffer, size_t *length, loff_t *ppos)
 {
 	proc_dointvec(table, write, file, buffer, length, ppos);
-	if (dirty_writeback_interval)
-		mod_timer(&wb_timer, jiffies +
-			msecs_to_jiffies(dirty_writeback_interval * 10));
-	else
-		del_timer(&wb_timer);
 	return 0;
 }
 
-static void wb_timer_fn(unsigned long unused)
-{
-	if (pdflush_operation(wb_kupdate, 0) < 0)
-		mod_timer(&wb_timer, jiffies + HZ); /* delay 1 second */
-}
-
-static void laptop_flush(unsigned long unused)
-{
-	sys_sync();
-}
-
 static void laptop_timer_fn(unsigned long unused)
 {
-	pdflush_operation(laptop_flush, 0);
+	wakeup_flusher_threads(0);
 }
 
 /*
@@ -903,8 +781,6 @@ void __init page_writeback_init(void)
 {
 	int shift;
 
-	mod_timer(&wb_timer,
-		  jiffies + msecs_to_jiffies(dirty_writeback_interval * 10));
 	writeback_set_ratelimit();
 	register_cpu_notifier(&ratelimit_nb);
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 5fa3eda..e37fd38 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1654,7 +1654,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 		 */
 		if (total_scanned > sc->swap_cluster_max +
 					sc->swap_cluster_max / 2) {
-			wakeup_pdflush(laptop_mode ? 0 : total_scanned);
+			wakeup_flusher_threads(laptop_mode ? 0 : total_scanned);
 			sc->may_writepage = 1;
 		}
 
-- 
1.6.3.rc0.1.gf800


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH 04/11] writeback: switch to per-bdi threads for flushing data
  2009-05-27  9:41 ` [PATCH 04/11] writeback: switch to per-bdi threads for flushing data Jens Axboe
@ 2009-05-27 11:11   ` Peter Zijlstra
  2009-05-27 11:24     ` Jens Axboe
  2009-05-27 15:14   ` Jan Kara
  1 sibling, 1 reply; 41+ messages in thread
From: Peter Zijlstra @ 2009-05-27 11:11 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-kernel, linux-fsdevel, chris.mason, david, hch, akpm, jack,
	yanmin_zhang, richard, damien.wyart

On Wed, 2009-05-27 at 11:41 +0200, Jens Axboe wrote:

> +	if (writeback_acquire(bdi)) {
> +		bdi->wb_arg.nr_pages = nr_pages;
> +		bdi->wb_arg.sb = sb;
> +		bdi->wb_arg.sync_mode = sync_mode;
> +		/*
> +		 * make above store seen before the task is woken
> +		 */
> +		smp_mb();
> +		wake_up(&bdi->wait);
> +	}

wake_up() implies a wmb() when we indeed to a wakeup, is that
sufficient?

> +int bdi_writeback_task(struct backing_dev_info *bdi)
> +{
> +	while (!kthread_should_stop()) {
> +		unsigned long wait_jiffies;
> +		DEFINE_WAIT(wait);
> +
> +		prepare_to_wait(&bdi->wait, &wait, TASK_INTERRUPTIBLE);
> +		wait_jiffies = msecs_to_jiffies(dirty_writeback_interval * 10);
> +		schedule_timeout(wait_jiffies);
> +		try_to_freeze();
> +
> +		/*
> +		 * We get here in two cases:
> +		 *
> +		 *  schedule_timeout() returned because the dirty writeback
> +		 *  interval has elapsed. If that happens, we will be able
> +		 *  to acquire the writeback lock and will proceed to do
> +		 *  kupdated style writeout.
> +		 *
> +		 *  Someone called bdi_start_writeback(), which will acquire
> +		 *  the writeback lock. This means our writeback_acquire()
> +		 *  below will fail and we call into bdi_pdflush() for
> +		 *  pdflush style writeout.
> +		 *
> +		 */
> +		if (writeback_acquire(bdi))
> +			bdi_kupdated(bdi);
> +		else
> +			bdi_pdflush(bdi);
> +
> +		writeback_release(bdi);
> +		finish_wait(&bdi->wait, &wait);
> +	}
> +
> +	return 0;
> +}

the unpaired writeback_release() wrt writeback_acquire() looks odd.

Also the prepare/finish wait bits seem oddly out of place. Are there
really multiple waiters on bdi->wait? The above wake_up() seems to
suggest not, since it directly modifies bdi state instead of queueing
work.



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 04/11] writeback: switch to per-bdi threads for flushing data
  2009-05-27 11:11   ` Peter Zijlstra
@ 2009-05-27 11:24     ` Jens Axboe
  0 siblings, 0 replies; 41+ messages in thread
From: Jens Axboe @ 2009-05-27 11:24 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linux-fsdevel, chris.mason, david, hch, akpm, jack,
	yanmin_zhang, richard, damien.wyart

On Wed, May 27 2009, Peter Zijlstra wrote:
> On Wed, 2009-05-27 at 11:41 +0200, Jens Axboe wrote:
> 
> > +	if (writeback_acquire(bdi)) {
> > +		bdi->wb_arg.nr_pages = nr_pages;
> > +		bdi->wb_arg.sb = sb;
> > +		bdi->wb_arg.sync_mode = sync_mode;
> > +		/*
> > +		 * make above store seen before the task is woken
> > +		 */
> > +		smp_mb();
> > +		wake_up(&bdi->wait);
> > +	}
> 
> wake_up() implies a wmb() when we indeed to a wakeup, is that
> sufficient?

That is sufficient. I'll kill it in the next revision, seeing as this is
just an intermediate step, no harm done.

> > +int bdi_writeback_task(struct backing_dev_info *bdi)
> > +{
> > +	while (!kthread_should_stop()) {
> > +		unsigned long wait_jiffies;
> > +		DEFINE_WAIT(wait);
> > +
> > +		prepare_to_wait(&bdi->wait, &wait, TASK_INTERRUPTIBLE);
> > +		wait_jiffies = msecs_to_jiffies(dirty_writeback_interval * 10);
> > +		schedule_timeout(wait_jiffies);
> > +		try_to_freeze();
> > +
> > +		/*
> > +		 * We get here in two cases:
> > +		 *
> > +		 *  schedule_timeout() returned because the dirty writeback
> > +		 *  interval has elapsed. If that happens, we will be able
> > +		 *  to acquire the writeback lock and will proceed to do
> > +		 *  kupdated style writeout.
> > +		 *
> > +		 *  Someone called bdi_start_writeback(), which will acquire
> > +		 *  the writeback lock. This means our writeback_acquire()
> > +		 *  below will fail and we call into bdi_pdflush() for
> > +		 *  pdflush style writeout.
> > +		 *
> > +		 */
> > +		if (writeback_acquire(bdi))
> > +			bdi_kupdated(bdi);
> > +		else
> > +			bdi_pdflush(bdi);
> > +
> > +		writeback_release(bdi);
> > +		finish_wait(&bdi->wait, &wait);
> > +	}
> > +
> > +	return 0;
> > +}
> 
> the unpaired writeback_release() wrt writeback_acquire() looks odd.

Did you read the comment? :-)

> Also the prepare/finish wait bits seem oddly out of place. Are there
> really multiple waiters on bdi->wait? The above wake_up() seems to
> suggest not, since it directly modifies bdi state instead of queueing
> work.

Intermediate step, further along it should be more clear.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 04/11] writeback: switch to per-bdi threads for flushing data
  2009-05-27  9:41 ` [PATCH 04/11] writeback: switch to per-bdi threads for flushing data Jens Axboe
  2009-05-27 11:11   ` Peter Zijlstra
@ 2009-05-27 15:14   ` Jan Kara
  2009-05-27 17:50     ` Jens Axboe
  1 sibling, 1 reply; 41+ messages in thread
From: Jan Kara @ 2009-05-27 15:14 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-kernel, linux-fsdevel, chris.mason, david, hch, akpm, jack,
	yanmin_zhang, richard, damien.wyart

  The patch set seems easier to read now. Thanks for cleaning it up.

> +void bdi_writeback_all(struct super_block *sb, struct writeback_control *wbc)
> +{
> +	struct backing_dev_info *bdi, *tmp;
> +
> +	mutex_lock(&bdi_lock);
> +
> +	list_for_each_entry_safe(bdi, tmp, &bdi_list, bdi_list) {
> +		if (!bdi_has_dirty_io(bdi))
> +			continue;
> +		bdi_start_writeback(bdi, sb, wbc->nr_to_write, wbc->sync_mode);
> +	}
> +
> +	mutex_unlock(&bdi_lock);
> +}
> +
  Looking at this function, I've realized that wbc->nr_to_write has a bit
silly meaning here. Each BDI will be kicked to write nr_to_write pages
which is not what it used to mean originally. I don't think it really matters
but we should have this in mind...

> @@ -591,13 +715,10 @@ static void generic_sync_bdi_inodes(struct backing_dev_info *bdi,
>  void generic_sync_sb_inodes(struct super_block *sb,
>  				struct writeback_control *wbc)
>  {
> -	const int is_blkdev_sb = sb_is_blkdev_sb(sb);
> -	struct backing_dev_info *bdi;
> -
> -	mutex_lock(&bdi_lock);
> -	list_for_each_entry(bdi, &bdi_list, bdi_list)
> -		generic_sync_bdi_inodes(bdi, wbc, sb, is_blkdev_sb);
> -	mutex_unlock(&bdi_lock);
> +	if (wbc->bdi)
> +		generic_sync_bdi_inodes(sb, wbc);
> +	else
> +		bdi_writeback_all(sb, wbc);
  I guess this asynchronousness is just transient...

> +static int bdi_forker_task(void *ptr)
> +{
> +	struct backing_dev_info *me = ptr;
> +	DEFINE_WAIT(wait);
> +
> +	for (;;) {
> +		struct backing_dev_info *bdi, *tmp;
> +
> +		/*
> +		 * Do this periodically, like kupdated() did before.
> +		 */
> +		sync_supers();
  Ugh, this looks nasty. Moreover I'm afraid of forker_task() getting stuck
(and thus not being able to start new threads) in sync_supers() when some
fs is busy and other needs to create flusher thread...
  Why not just having a separate thread for this? I know we have lots of
kernel threads already but this one seems like a useful one... Or do you
plan getting rid of this completely sometime in the near future and sync
supers also from per-bdi thread (which would make a lot of sence to me)?

> +
> +		/*
> +		 * Temporary measure, we want to make sure we don't see
> +		 * dirty data on the default backing_dev_info
> +		 */
> +		if (bdi_has_dirty_io(me))
> +			bdi_flush_io(me);
> +
> +		prepare_to_wait(&me->wait, &wait, TASK_INTERRUPTIBLE);
> +
> +		mutex_lock(&bdi_lock);
> +
> +		/*
> +		 * Check if any existing bdi's have dirty data without
> +		 * a thread registered. If so, set that up.
> +		 */
> +		list_for_each_entry_safe(bdi, tmp, &bdi_list, bdi_list) {
> +			if (bdi->task || !bdi_has_dirty_io(bdi))
> +				continue;
> +
> +			bdi_add_default_flusher_task(bdi);
> +		}
> +
> +		if (list_empty(&bdi_pending_list)) {
> +			unsigned long wait;
> +
> +			mutex_unlock(&bdi_lock);
> +			wait = msecs_to_jiffies(dirty_writeback_interval * 10);
> +			schedule_timeout(wait);
> +			try_to_freeze();
> +			continue;
> +		}
> +
> +		/*
> +		 * This is our real job - check for pending entries in
> +		 * bdi_pending_list, and create the tasks that got added
> +		 */
> +		bdi = list_entry(bdi_pending_list.next, struct backing_dev_info,
> +				 bdi_list);
> +		list_del_init(&bdi->bdi_list);
> +		mutex_unlock(&bdi_lock);
> +
> +		BUG_ON(bdi->task);
> +
> +		bdi->task = kthread_run(bdi_start_fn, bdi, "bdi-%s",
> +					dev_name(bdi->dev));
> +		/*
> +		 * If task creation fails, then readd the bdi to
> +		 * the pending list and force writeout of the bdi
> +		 * from this forker thread. That will free some memory
> +		 * and we can try again.
> +		 */
> +		if (!bdi->task) {
> +			/*
> +			 * Add this 'bdi' to the back, so we get
> +			 * a chance to flush other bdi's to free
> +			 * memory.
> +			 */
> +			mutex_lock(&bdi_lock);
> +			list_add_tail(&bdi->bdi_list, &bdi_pending_list);
> +			mutex_unlock(&bdi_lock);
> +
> +			bdi_flush_io(bdi);
> +		}
> +	}
> +
> +	finish_wait(&me->wait, &wait);
> +	return 0;
> +}

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 04/11] writeback: switch to per-bdi threads for flushing data
  2009-05-27 15:14   ` Jan Kara
@ 2009-05-27 17:50     ` Jens Axboe
  2009-05-28 14:45       ` Jan Kara
  0 siblings, 1 reply; 41+ messages in thread
From: Jens Axboe @ 2009-05-27 17:50 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-kernel, linux-fsdevel, chris.mason, david, hch, akpm,
	yanmin_zhang, richard, damien.wyart

On Wed, May 27 2009, Jan Kara wrote:
>   The patch set seems easier to read now. Thanks for cleaning it up.

No problem. The issue is mainly that I have to maintain these
intermediate steps, and as code gets added and bugs fixed, things have
to be shuffled back and forth. Now that things are stabilizing more,
it's easier.

> > +void bdi_writeback_all(struct super_block *sb, struct writeback_control *wbc)
> > +{
> > +	struct backing_dev_info *bdi, *tmp;
> > +
> > +	mutex_lock(&bdi_lock);
> > +
> > +	list_for_each_entry_safe(bdi, tmp, &bdi_list, bdi_list) {
> > +		if (!bdi_has_dirty_io(bdi))
> > +			continue;
> > +		bdi_start_writeback(bdi, sb, wbc->nr_to_write, wbc->sync_mode);
> > +	}
> > +
> > +	mutex_unlock(&bdi_lock);
> > +}
> > +
>   Looking at this function, I've realized that wbc->nr_to_write has a bit
> silly meaning here. Each BDI will be kicked to write nr_to_write pages
> which is not what it used to mean originally. I don't think it really matters
> but we should have this in mind...

Yes, I know about that difference. I don't think it matters a whole lot,
since we typically just use MAX_WRITEBACK_PAGES which is only 4MB of IO.
And in the case of writing back the world, we'll just come short on each
bdi.

> > @@ -591,13 +715,10 @@ static void generic_sync_bdi_inodes(struct backing_dev_info *bdi,
> >  void generic_sync_sb_inodes(struct super_block *sb,
> >  				struct writeback_control *wbc)
> >  {
> > -	const int is_blkdev_sb = sb_is_blkdev_sb(sb);
> > -	struct backing_dev_info *bdi;
> > -
> > -	mutex_lock(&bdi_lock);
> > -	list_for_each_entry(bdi, &bdi_list, bdi_list)
> > -		generic_sync_bdi_inodes(bdi, wbc, sb, is_blkdev_sb);
> > -	mutex_unlock(&bdi_lock);
> > +	if (wbc->bdi)
> > +		generic_sync_bdi_inodes(sb, wbc);
> > +	else
> > +		bdi_writeback_all(sb, wbc);
>   I guess this asynchronousness is just transient...

Right, if it bothers you, I can fix that up too :-)

> > +static int bdi_forker_task(void *ptr)
> > +{
> > +	struct backing_dev_info *me = ptr;
> > +	DEFINE_WAIT(wait);
> > +
> > +	for (;;) {
> > +		struct backing_dev_info *bdi, *tmp;
> > +
> > +		/*
> > +		 * Do this periodically, like kupdated() did before.
> > +		 */
> > +		sync_supers();
>   Ugh, this looks nasty. Moreover I'm afraid of forker_task() getting stuck
> (and thus not being able to start new threads) in sync_supers() when some
> fs is busy and other needs to create flusher thread...
>   Why not just having a separate thread for this? I know we have lots of
> kernel threads already but this one seems like a useful one... Or do you
> plan getting rid of this completely sometime in the near future and sync
> supers also from per-bdi thread (which would make a lot of sence to me)?

It's ugly, and I think this is precisely what Ted hit. He's in umount,
has ->s_umount sem held and waiting for IO.

So there's definitely trouble brewing there. As a short term solution, a
separate thread will do. Longer term, the sync_supers_bdi() type setup I
mentioned earlier would probably be the best. But once we start dealing
with the super blocks, we have to be more careful with referencing.
Which we also discussed in a previous mail :-)

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 04/11] writeback: switch to per-bdi threads for flushing data
  2009-05-27 17:50     ` Jens Axboe
@ 2009-05-28 14:45       ` Jan Kara
  0 siblings, 0 replies; 41+ messages in thread
From: Jan Kara @ 2009-05-28 14:45 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-kernel, linux-fsdevel, chris.mason, david, hch, akpm,
	yanmin_zhang, richard, damien.wyart

On Wed 27-05-09 19:50:19, Jens Axboe wrote:
> > > +static int bdi_forker_task(void *ptr)
> > > +{
> > > +	struct backing_dev_info *me = ptr;
> > > +	DEFINE_WAIT(wait);
> > > +
> > > +	for (;;) {
> > > +		struct backing_dev_info *bdi, *tmp;
> > > +
> > > +		/*
> > > +		 * Do this periodically, like kupdated() did before.
> > > +		 */
> > > +		sync_supers();
> >   Ugh, this looks nasty. Moreover I'm afraid of forker_task() getting stuck
> > (and thus not being able to start new threads) in sync_supers() when some
> > fs is busy and other needs to create flusher thread...
> >   Why not just having a separate thread for this? I know we have lots of
> > kernel threads already but this one seems like a useful one... Or do you
> > plan getting rid of this completely sometime in the near future and sync
> > supers also from per-bdi thread (which would make a lot of sence to me)?
> 
> It's ugly, and I think this is precisely what Ted hit. He's in umount,
> has ->s_umount sem held and waiting for IO.
  I've looked into this a bit more because it was still nagging in the back
of my mind and I think there indeed is a race (although your sync writeback
waiting has now hidden it). The problem is following:
  bdi flusher threads lives independently of filesystem being mounted or
not. So it can happen that bdi_kupdate() or bdi_pdflush() runs in parallel
with umount running in another thread. That should not really happen
because
  1) umount can fail with EBUSY because generic_sync_bdi_inodes() holds
    a reference to inode
  2) we race more subtly and we get to call __writeback_single_inode() after
    the filesystem has been unmounted (put_super() has been called).

  So I believe you simply have to deal with superblock references and
umount semaphore in your patches...

> So there's definitely trouble brewing there. As a short term solution, a
> separate thread will do. Longer term, the sync_supers_bdi() type setup I
> mentioned earlier would probably be the best. But once we start dealing
> with the super blocks, we have to be more careful with referencing.
> Which we also discussed in a previous mail :-)

									Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH 05/11] writeback: get rid of pdflush completely
  2009-05-27  9:41 [PATCH 0/11] Per-bdi writeback flusher threads v8 Jens Axboe
                   ` (3 preceding siblings ...)
  2009-05-27  9:41 ` [PATCH 04/11] writeback: switch to per-bdi threads for flushing data Jens Axboe
@ 2009-05-27  9:41 ` Jens Axboe
  2009-05-27  9:41 ` [PATCH 06/11] writeback: separate the flushing state/task from the bdi Jens Axboe
                   ` (7 subsequent siblings)
  12 siblings, 0 replies; 41+ messages in thread
From: Jens Axboe @ 2009-05-27  9:41 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel
  Cc: chris.mason, david, hch, akpm, jack, yanmin_zhang, richard,
	damien.wyart, Jens Axboe

It is now unused, so kill it off.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
---
 fs/fs-writeback.c         |    5 +
 include/linux/writeback.h |   12 --
 mm/Makefile               |    2 +-
 mm/pdflush.c              |  269 ---------------------------------------------
 4 files changed, 6 insertions(+), 282 deletions(-)
 delete mode 100644 mm/pdflush.c

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 5d99b12..3b748e7 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -29,6 +29,11 @@
 
 #define inode_to_bdi(inode)	((inode)->i_mapping->backing_dev_info)
 
+/*
+ * We don't actually have pdflush, but this one is exported though /proc...
+ */
+int nr_pdflush_threads;
+
 /**
  * writeback_acquire - attempt to get exclusive writeback access to a device
  * @bdi: the device's backing_dev_info structure
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index a8e9f78..baf04a9 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -14,17 +14,6 @@ extern struct list_head inode_in_use;
 extern struct list_head inode_unused;
 
 /*
- * Yes, writeback.h requires sched.h
- * No, sched.h is not included from here.
- */
-static inline int task_is_pdflush(struct task_struct *task)
-{
-	return task->flags & PF_FLUSHER;
-}
-
-#define current_is_pdflush()	task_is_pdflush(current)
-
-/*
  * fs/fs-writeback.c
  */
 enum writeback_sync_modes {
@@ -151,7 +140,6 @@ balance_dirty_pages_ratelimited(struct address_space *mapping)
 typedef int (*writepage_t)(struct page *page, struct writeback_control *wbc,
 				void *data);
 
-int pdflush_operation(void (*fn)(unsigned long), unsigned long arg0);
 int generic_writepages(struct address_space *mapping,
 		       struct writeback_control *wbc);
 int write_cache_pages(struct address_space *mapping,
diff --git a/mm/Makefile b/mm/Makefile
index ec73c68..2adb811 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -8,7 +8,7 @@ mmu-$(CONFIG_MMU)	:= fremap.o highmem.o madvise.o memory.o mincore.o \
 			   vmalloc.o
 
 obj-y			:= bootmem.o filemap.o mempool.o oom_kill.o fadvise.o \
-			   maccess.o page_alloc.o page-writeback.o pdflush.o \
+			   maccess.o page_alloc.o page-writeback.o \
 			   readahead.o swap.o truncate.o vmscan.o shmem.o \
 			   prio_tree.o util.o mmzone.o vmstat.o backing-dev.o \
 			   page_isolation.o mm_init.o $(mmu-y)
diff --git a/mm/pdflush.c b/mm/pdflush.c
deleted file mode 100644
index 235ac44..0000000
--- a/mm/pdflush.c
+++ /dev/null
@@ -1,269 +0,0 @@
-/*
- * mm/pdflush.c - worker threads for writing back filesystem data
- *
- * Copyright (C) 2002, Linus Torvalds.
- *
- * 09Apr2002	Andrew Morton
- *		Initial version
- * 29Feb2004	kaos@sgi.com
- *		Move worker thread creation to kthread to avoid chewing
- *		up stack space with nested calls to kernel_thread.
- */
-
-#include <linux/sched.h>
-#include <linux/list.h>
-#include <linux/signal.h>
-#include <linux/spinlock.h>
-#include <linux/gfp.h>
-#include <linux/init.h>
-#include <linux/module.h>
-#include <linux/fs.h>		/* Needed by writeback.h	  */
-#include <linux/writeback.h>	/* Prototypes pdflush_operation() */
-#include <linux/kthread.h>
-#include <linux/cpuset.h>
-#include <linux/freezer.h>
-
-
-/*
- * Minimum and maximum number of pdflush instances
- */
-#define MIN_PDFLUSH_THREADS	2
-#define MAX_PDFLUSH_THREADS	8
-
-static void start_one_pdflush_thread(void);
-
-
-/*
- * The pdflush threads are worker threads for writing back dirty data.
- * Ideally, we'd like one thread per active disk spindle.  But the disk
- * topology is very hard to divine at this level.   Instead, we take
- * care in various places to prevent more than one pdflush thread from
- * performing writeback against a single filesystem.  pdflush threads
- * have the PF_FLUSHER flag set in current->flags to aid in this.
- */
-
-/*
- * All the pdflush threads.  Protected by pdflush_lock
- */
-static LIST_HEAD(pdflush_list);
-static DEFINE_SPINLOCK(pdflush_lock);
-
-/*
- * The count of currently-running pdflush threads.  Protected
- * by pdflush_lock.
- *
- * Readable by sysctl, but not writable.  Published to userspace at
- * /proc/sys/vm/nr_pdflush_threads.
- */
-int nr_pdflush_threads = 0;
-
-/*
- * The time at which the pdflush thread pool last went empty
- */
-static unsigned long last_empty_jifs;
-
-/*
- * The pdflush thread.
- *
- * Thread pool management algorithm:
- * 
- * - The minimum and maximum number of pdflush instances are bound
- *   by MIN_PDFLUSH_THREADS and MAX_PDFLUSH_THREADS.
- * 
- * - If there have been no idle pdflush instances for 1 second, create
- *   a new one.
- * 
- * - If the least-recently-went-to-sleep pdflush thread has been asleep
- *   for more than one second, terminate a thread.
- */
-
-/*
- * A structure for passing work to a pdflush thread.  Also for passing
- * state information between pdflush threads.  Protected by pdflush_lock.
- */
-struct pdflush_work {
-	struct task_struct *who;	/* The thread */
-	void (*fn)(unsigned long);	/* A callback function */
-	unsigned long arg0;		/* An argument to the callback */
-	struct list_head list;		/* On pdflush_list, when idle */
-	unsigned long when_i_went_to_sleep;
-};
-
-static int __pdflush(struct pdflush_work *my_work)
-{
-	current->flags |= PF_FLUSHER | PF_SWAPWRITE;
-	set_freezable();
-	my_work->fn = NULL;
-	my_work->who = current;
-	INIT_LIST_HEAD(&my_work->list);
-
-	spin_lock_irq(&pdflush_lock);
-	for ( ; ; ) {
-		struct pdflush_work *pdf;
-
-		set_current_state(TASK_INTERRUPTIBLE);
-		list_move(&my_work->list, &pdflush_list);
-		my_work->when_i_went_to_sleep = jiffies;
-		spin_unlock_irq(&pdflush_lock);
-		schedule();
-		try_to_freeze();
-		spin_lock_irq(&pdflush_lock);
-		if (!list_empty(&my_work->list)) {
-			/*
-			 * Someone woke us up, but without removing our control
-			 * structure from the global list.  swsusp will do this
-			 * in try_to_freeze()->refrigerator().  Handle it.
-			 */
-			my_work->fn = NULL;
-			continue;
-		}
-		if (my_work->fn == NULL) {
-			printk("pdflush: bogus wakeup\n");
-			continue;
-		}
-		spin_unlock_irq(&pdflush_lock);
-
-		(*my_work->fn)(my_work->arg0);
-
-		spin_lock_irq(&pdflush_lock);
-
-		/*
-		 * Thread creation: For how long have there been zero
-		 * available threads?
-		 *
-		 * To throttle creation, we reset last_empty_jifs.
-		 */
-		if (time_after(jiffies, last_empty_jifs + 1 * HZ)) {
-			if (list_empty(&pdflush_list)) {
-				if (nr_pdflush_threads < MAX_PDFLUSH_THREADS) {
-					last_empty_jifs = jiffies;
-					nr_pdflush_threads++;
-					spin_unlock_irq(&pdflush_lock);
-					start_one_pdflush_thread();
-					spin_lock_irq(&pdflush_lock);
-				}
-			}
-		}
-
-		my_work->fn = NULL;
-
-		/*
-		 * Thread destruction: For how long has the sleepiest
-		 * thread slept?
-		 */
-		if (list_empty(&pdflush_list))
-			continue;
-		if (nr_pdflush_threads <= MIN_PDFLUSH_THREADS)
-			continue;
-		pdf = list_entry(pdflush_list.prev, struct pdflush_work, list);
-		if (time_after(jiffies, pdf->when_i_went_to_sleep + 1 * HZ)) {
-			/* Limit exit rate */
-			pdf->when_i_went_to_sleep = jiffies;
-			break;					/* exeunt */
-		}
-	}
-	nr_pdflush_threads--;
-	spin_unlock_irq(&pdflush_lock);
-	return 0;
-}
-
-/*
- * Of course, my_work wants to be just a local in __pdflush().  It is
- * separated out in this manner to hopefully prevent the compiler from
- * performing unfortunate optimisations against the auto variables.  Because
- * these are visible to other tasks and CPUs.  (No problem has actually
- * been observed.  This is just paranoia).
- */
-static int pdflush(void *dummy)
-{
-	struct pdflush_work my_work;
-	cpumask_var_t cpus_allowed;
-
-	/*
-	 * Since the caller doesn't even check kthread_run() worked, let's not
-	 * freak out too much if this fails.
-	 */
-	if (!alloc_cpumask_var(&cpus_allowed, GFP_KERNEL)) {
-		printk(KERN_WARNING "pdflush failed to allocate cpumask\n");
-		return 0;
-	}
-
-	/*
-	 * pdflush can spend a lot of time doing encryption via dm-crypt.  We
-	 * don't want to do that at keventd's priority.
-	 */
-	set_user_nice(current, 0);
-
-	/*
-	 * Some configs put our parent kthread in a limited cpuset,
-	 * which kthread() overrides, forcing cpus_allowed == cpu_all_mask.
-	 * Our needs are more modest - cut back to our cpusets cpus_allowed.
-	 * This is needed as pdflush's are dynamically created and destroyed.
-	 * The boottime pdflush's are easily placed w/o these 2 lines.
-	 */
-	cpuset_cpus_allowed(current, cpus_allowed);
-	set_cpus_allowed_ptr(current, cpus_allowed);
-	free_cpumask_var(cpus_allowed);
-
-	return __pdflush(&my_work);
-}
-
-/*
- * Attempt to wake up a pdflush thread, and get it to do some work for you.
- * Returns zero if it indeed managed to find a worker thread, and passed your
- * payload to it.
- */
-int pdflush_operation(void (*fn)(unsigned long), unsigned long arg0)
-{
-	unsigned long flags;
-	int ret = 0;
-
-	BUG_ON(fn == NULL);	/* Hard to diagnose if it's deferred */
-
-	spin_lock_irqsave(&pdflush_lock, flags);
-	if (list_empty(&pdflush_list)) {
-		ret = -1;
-	} else {
-		struct pdflush_work *pdf;
-
-		pdf = list_entry(pdflush_list.next, struct pdflush_work, list);
-		list_del_init(&pdf->list);
-		if (list_empty(&pdflush_list))
-			last_empty_jifs = jiffies;
-		pdf->fn = fn;
-		pdf->arg0 = arg0;
-		wake_up_process(pdf->who);
-	}
-	spin_unlock_irqrestore(&pdflush_lock, flags);
-
-	return ret;
-}
-
-static void start_one_pdflush_thread(void)
-{
-	struct task_struct *k;
-
-	k = kthread_run(pdflush, NULL, "pdflush");
-	if (unlikely(IS_ERR(k))) {
-		spin_lock_irq(&pdflush_lock);
-		nr_pdflush_threads--;
-		spin_unlock_irq(&pdflush_lock);
-	}
-}
-
-static int __init pdflush_init(void)
-{
-	int i;
-
-	/*
-	 * Pre-set nr_pdflush_threads...  If we fail to create,
-	 * the count will be decremented.
-	 */
-	nr_pdflush_threads = MIN_PDFLUSH_THREADS;
-
-	for (i = 0; i < MIN_PDFLUSH_THREADS; i++)
-		start_one_pdflush_thread();
-	return 0;
-}
-
-module_init(pdflush_init);
-- 
1.6.3.rc0.1.gf800


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 06/11] writeback: separate the flushing state/task from the bdi
  2009-05-27  9:41 [PATCH 0/11] Per-bdi writeback flusher threads v8 Jens Axboe
                   ` (4 preceding siblings ...)
  2009-05-27  9:41 ` [PATCH 05/11] writeback: get rid of pdflush completely Jens Axboe
@ 2009-05-27  9:41 ` Jens Axboe
  2009-05-27  9:41 ` [PATCH 07/11] writeback: support > 1 flusher thread per bdi Jens Axboe
                   ` (6 subsequent siblings)
  12 siblings, 0 replies; 41+ messages in thread
From: Jens Axboe @ 2009-05-27  9:41 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel
  Cc: chris.mason, david, hch, akpm, jack, yanmin_zhang, richard,
	damien.wyart, Jens Axboe

Add a struct bdi_writeback for tracking and handling dirty IO. This
is in preparation for adding > 1 flusher task per bdi.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
---
 fs/fs-writeback.c           |  145 ++++++++++++++++++++++++++----------------
 include/linux/backing-dev.h |   40 +++++++-----
 mm/backing-dev.c            |  128 ++++++++++++++++++++++++++++++--------
 3 files changed, 215 insertions(+), 98 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 3b748e7..a238480 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -46,9 +46,11 @@ int nr_pdflush_threads;
  * unless they implement their own.  Which is somewhat inefficient, as this
  * may prevent concurrent writeback against multiple devices.
  */
-static int writeback_acquire(struct backing_dev_info *bdi)
+static int writeback_acquire(struct bdi_writeback *wb)
 {
-	return !test_and_set_bit(BDI_pdflush, &bdi->state);
+	struct backing_dev_info *bdi = wb->bdi;
+
+	return !test_and_set_bit(wb->nr, &bdi->wb_active);
 }
 
 /**
@@ -59,19 +61,40 @@ static int writeback_acquire(struct backing_dev_info *bdi)
  */
 int writeback_in_progress(struct backing_dev_info *bdi)
 {
-	return test_bit(BDI_pdflush, &bdi->state);
+	return bdi->wb_active != 0;
 }
 
 /**
  * writeback_release - relinquish exclusive writeback access against a device.
  * @bdi: the device's backing_dev_info structure
  */
-static void writeback_release(struct backing_dev_info *bdi)
+static void writeback_release(struct bdi_writeback *wb)
 {
-	WARN_ON_ONCE(!writeback_in_progress(bdi));
-	bdi->wb_arg.nr_pages = 0;
-	bdi->wb_arg.sb = NULL;
-	clear_bit(BDI_pdflush, &bdi->state);
+	struct backing_dev_info *bdi = wb->bdi;
+
+	wb->nr_pages = 0;
+	wb->sb = NULL;
+	clear_bit(wb->nr, &bdi->wb_active);
+}
+
+static void wb_start_writeback(struct bdi_writeback *wb, struct super_block *sb,
+			       long nr_pages,
+			       enum writeback_sync_modes sync_mode)
+{
+	if (!wb_has_dirty_io(wb))
+		return;
+
+	if (writeback_acquire(wb)) {
+		wb->nr_pages = nr_pages;
+		wb->sb = sb;
+		wb->sync_mode = sync_mode;
+
+		/*
+		 * make above store seen before the task is woken
+		 */
+		smp_mb();
+		wake_up(&wb->wait);
+	}
 }
 
 int bdi_start_writeback(struct backing_dev_info *bdi, struct super_block *sb,
@@ -81,22 +104,12 @@ int bdi_start_writeback(struct backing_dev_info *bdi, struct super_block *sb,
 	 * This only happens the first time someone kicks this bdi, so put
 	 * it out-of-line.
 	 */
-	if (unlikely(!bdi->task)) {
+	if (unlikely(!bdi->wb.task)) {
 		bdi_add_default_flusher_task(bdi);
 		return 1;
 	}
 
-	if (writeback_acquire(bdi)) {
-		bdi->wb_arg.nr_pages = nr_pages;
-		bdi->wb_arg.sb = sb;
-		bdi->wb_arg.sync_mode = sync_mode;
-		/*
-		 * make above store seen before the task is woken
-		 */
-		smp_mb();
-		wake_up(&bdi->wait);
-	}
-
+	wb_start_writeback(&bdi->wb, sb, nr_pages, sync_mode);
 	return 0;
 }
 
@@ -124,12 +137,12 @@ int bdi_start_writeback(struct backing_dev_info *bdi, struct super_block *sb,
  * older_than_this takes precedence over nr_to_write.  So we'll only write back
  * all dirty pages if they are all attached to "old" mappings.
  */
-static void bdi_kupdated(struct backing_dev_info *bdi)
+static void wb_kupdated(struct bdi_writeback *wb)
 {
 	unsigned long oldest_jif;
 	long nr_to_write;
 	struct writeback_control wbc = {
-		.bdi			= bdi,
+		.bdi			= wb->bdi,
 		.sync_mode		= WB_SYNC_NONE,
 		.older_than_this	= &oldest_jif,
 		.nr_to_write		= 0,
@@ -164,15 +177,19 @@ static inline bool over_bground_thresh(void)
 		global_page_state(NR_UNSTABLE_NFS) >= background_thresh);
 }
 
-static void bdi_pdflush(struct backing_dev_info *bdi)
+static void generic_sync_wb_inodes(struct bdi_writeback *wb,
+				   struct super_block *sb,
+				   struct writeback_control *wbc);
+
+static void wb_writeback(struct bdi_writeback *wb)
 {
 	struct writeback_control wbc = {
-		.bdi			= bdi,
-		.sync_mode		= bdi->wb_arg.sync_mode,
+		.bdi			= wb->bdi,
+		.sync_mode		= wb->sync_mode,
 		.older_than_this	= NULL,
 		.range_cyclic		= 1,
 	};
-	long nr_pages = bdi->wb_arg.nr_pages;
+	long nr_pages = wb->nr_pages;
 
 	for (;;) {
 		if (wbc.sync_mode == WB_SYNC_NONE && nr_pages <= 0 &&
@@ -183,7 +200,7 @@ static void bdi_pdflush(struct backing_dev_info *bdi)
 		wbc.encountered_congestion = 0;
 		wbc.nr_to_write = MAX_WRITEBACK_PAGES;
 		wbc.pages_skipped = 0;
-		generic_sync_bdi_inodes(bdi->wb_arg.sb, &wbc);
+		generic_sync_wb_inodes(wb, wb->sb, &wbc);
 		nr_pages -= MAX_WRITEBACK_PAGES - wbc.nr_to_write;
 		/*
 		 * If we ran out of stuff to write, bail unless more_io got set
@@ -200,13 +217,13 @@ static void bdi_pdflush(struct backing_dev_info *bdi)
  * Handle writeback of dirty data for the device backed by this bdi. Also
  * wakes up periodically and does kupdated style flushing.
  */
-int bdi_writeback_task(struct backing_dev_info *bdi)
+int bdi_writeback_task(struct bdi_writeback *wb)
 {
 	while (!kthread_should_stop()) {
 		unsigned long wait_jiffies;
 		DEFINE_WAIT(wait);
 
-		prepare_to_wait(&bdi->wait, &wait, TASK_INTERRUPTIBLE);
+		prepare_to_wait(&wb->wait, &wait, TASK_INTERRUPTIBLE);
 		wait_jiffies = msecs_to_jiffies(dirty_writeback_interval * 10);
 		schedule_timeout(wait_jiffies);
 		try_to_freeze();
@@ -225,13 +242,13 @@ int bdi_writeback_task(struct backing_dev_info *bdi)
 		 *  pdflush style writeout.
 		 *
 		 */
-		if (writeback_acquire(bdi))
-			bdi_kupdated(bdi);
+		if (writeback_acquire(wb))
+			wb_kupdated(wb);
 		else
-			bdi_pdflush(bdi);
+			wb_writeback(wb);
 
-		writeback_release(bdi);
-		finish_wait(&bdi->wait, &wait);
+		writeback_release(wb);
+		finish_wait(&wb->wait, &wait);
 	}
 
 	return 0;
@@ -252,6 +269,14 @@ void bdi_writeback_all(struct super_block *sb, struct writeback_control *wbc)
 	mutex_unlock(&bdi_lock);
 }
 
+/*
+ * We have only a single wb per bdi, so just return that.
+ */
+static inline struct bdi_writeback *inode_get_wb(struct inode *inode)
+{
+	return &inode_to_bdi(inode)->wb;
+}
+
 /**
  *	__mark_inode_dirty -	internal function
  *	@inode: inode to mark
@@ -350,9 +375,10 @@ void __mark_inode_dirty(struct inode *inode, int flags)
 		 * reposition it (that would break b_dirty time-ordering).
 		 */
 		if (!was_dirty) {
+			struct bdi_writeback *wb = inode_get_wb(inode);
+
 			inode->dirtied_when = jiffies;
-			list_move(&inode->i_list,
-					&inode_to_bdi(inode)->b_dirty);
+			list_move(&inode->i_list, &wb->b_dirty);
 		}
 	}
 out:
@@ -379,16 +405,16 @@ static int write_inode(struct inode *inode, int sync)
  */
 static void redirty_tail(struct inode *inode)
 {
-	struct backing_dev_info *bdi = inode_to_bdi(inode);
+	struct bdi_writeback *wb = inode_get_wb(inode);
 
-	if (!list_empty(&bdi->b_dirty)) {
+	if (!list_empty(&wb->b_dirty)) {
 		struct inode *tail;
 
-		tail = list_entry(bdi->b_dirty.next, struct inode, i_list);
+		tail = list_entry(wb->b_dirty.next, struct inode, i_list);
 		if (time_before(inode->dirtied_when, tail->dirtied_when))
 			inode->dirtied_when = jiffies;
 	}
-	list_move(&inode->i_list, &bdi->b_dirty);
+	list_move(&inode->i_list, &wb->b_dirty);
 }
 
 /*
@@ -396,7 +422,9 @@ static void redirty_tail(struct inode *inode)
  */
 static void requeue_io(struct inode *inode)
 {
-	list_move(&inode->i_list, &inode_to_bdi(inode)->b_more_io);
+	struct bdi_writeback *wb = inode_get_wb(inode);
+
+	list_move(&inode->i_list, &wb->b_more_io);
 }
 
 static void inode_sync_complete(struct inode *inode)
@@ -443,11 +471,10 @@ static void move_expired_inodes(struct list_head *delaying_queue,
 /*
  * Queue all expired dirty inodes for io, eldest first.
  */
-static void queue_io(struct backing_dev_info *bdi,
-		     unsigned long *older_than_this)
+static void queue_io(struct bdi_writeback *wb, unsigned long *older_than_this)
 {
-	list_splice_init(&bdi->b_more_io, bdi->b_io.prev);
-	move_expired_inodes(&bdi->b_dirty, &bdi->b_io, older_than_this);
+	list_splice_init(&wb->b_more_io, wb->b_io.prev);
+	move_expired_inodes(&wb->b_dirty, &wb->b_io, older_than_this);
 }
 
 /*
@@ -608,20 +635,20 @@ __writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
 	return __sync_single_inode(inode, wbc);
 }
 
-void generic_sync_bdi_inodes(struct super_block *sb,
-			     struct writeback_control *wbc)
+static void generic_sync_wb_inodes(struct bdi_writeback *wb,
+				   struct super_block *sb,
+				   struct writeback_control *wbc)
 {
 	const int is_blkdev_sb = sb_is_blkdev_sb(sb);
-	struct backing_dev_info *bdi = wbc->bdi;
 	const unsigned long start = jiffies;	/* livelock avoidance */
 
 	spin_lock(&inode_lock);
 
-	if (!wbc->for_kupdate || list_empty(&bdi->b_io))
-		queue_io(bdi, wbc->older_than_this);
+	if (!wbc->for_kupdate || list_empty(&wb->b_io))
+		queue_io(wb, wbc->older_than_this);
 
-	while (!list_empty(&bdi->b_io)) {
-		struct inode *inode = list_entry(bdi->b_io.prev,
+	while (!list_empty(&wb->b_io)) {
+		struct inode *inode = list_entry(wb->b_io.prev,
 						struct inode, i_list);
 		long pages_skipped;
 
@@ -633,7 +660,7 @@ void generic_sync_bdi_inodes(struct super_block *sb,
 			continue;
 		}
 
-		if (!bdi_cap_writeback_dirty(bdi)) {
+		if (!bdi_cap_writeback_dirty(wb->bdi)) {
 			redirty_tail(inode);
 			if (is_blkdev_sb) {
 				/*
@@ -655,7 +682,7 @@ void generic_sync_bdi_inodes(struct super_block *sb,
 			continue;
 		}
 
-		if (wbc->nonblocking && bdi_write_congested(bdi)) {
+		if (wbc->nonblocking && bdi_write_congested(wb->bdi)) {
 			wbc->encountered_congestion = 1;
 			if (!is_blkdev_sb)
 				break;		/* Skip a congested fs */
@@ -689,7 +716,7 @@ void generic_sync_bdi_inodes(struct super_block *sb,
 			wbc->more_io = 1;
 			break;
 		}
-		if (!list_empty(&bdi->b_more_io))
+		if (!list_empty(&wb->b_more_io))
 			wbc->more_io = 1;
 	}
 
@@ -697,6 +724,14 @@ void generic_sync_bdi_inodes(struct super_block *sb,
 	/* Leave any unwritten inodes on b_io */
 }
 
+void generic_sync_bdi_inodes(struct super_block *sb,
+			     struct writeback_control *wbc)
+{
+	struct backing_dev_info *bdi = wbc->bdi;
+
+	generic_sync_wb_inodes(&bdi->wb, sb, wbc);
+}
+
 /*
  * Write out a superblock's list of dirty inodes.  A wait will be performed
  * upon no inodes, all inodes or the final one, depending upon sync_mode.
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 9f040a9..4acc64e 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -24,8 +24,8 @@ struct dentry;
  * Bits in backing_dev_info.state
  */
 enum bdi_state {
-	BDI_pdflush,		/* A pdflush thread is working this device */
 	BDI_pending,		/* On its way to being activated */
+	BDI_wb_alloc,		/* Default embedded wb allocated */
 	BDI_async_congested,	/* The async (write) queue is getting full */
 	BDI_sync_congested,	/* The sync queue is getting full */
 	BDI_unused,		/* Available bits start here */
@@ -41,15 +41,23 @@ enum bdi_stat_item {
 
 #define BDI_STAT_BATCH (8*(1+ilog2(nr_cpu_ids)))
 
-struct bdi_writeback_arg {
-	unsigned long nr_pages;
-	struct super_block *sb;
+struct bdi_writeback {
+	struct backing_dev_info *bdi;		/* our parent bdi */
+	unsigned int nr;
+
+	struct task_struct	*task;		/* writeback task */
+	wait_queue_head_t	wait;
+	struct list_head	b_dirty;	/* dirty inodes */
+	struct list_head	b_io;		/* parked for writeback */
+	struct list_head	b_more_io;	/* parked for more writeback */
+
+	unsigned long		nr_pages;
+	struct super_block	*sb;
 	enum writeback_sync_modes sync_mode;
 };
 
 struct backing_dev_info {
 	struct list_head bdi_list;
-
 	unsigned long ra_pages;	/* max readahead in PAGE_CACHE_SIZE units */
 	unsigned long state;	/* Always use atomic bitops on this */
 	unsigned int capabilities; /* Device capabilities */
@@ -66,14 +74,11 @@ struct backing_dev_info {
 	unsigned int min_ratio;
 	unsigned int max_ratio, max_prop_frac;
 
-	struct device *dev;
+	struct bdi_writeback wb;  /* default writeback info for this bdi */
+	unsigned long wb_active;  /* bitmap of active tasks */
+	unsigned long wb_mask;	  /* number of registered tasks */
 
-	struct task_struct	*task;		/* writeback task */
-	wait_queue_head_t	wait;
-	struct bdi_writeback_arg wb_arg;	/* protected by BDI_pdflush */
-	struct list_head	b_dirty;	/* dirty inodes */
-	struct list_head	b_io;		/* parked for writeback */
-	struct list_head	b_more_io;	/* parked for more writeback */
+	struct device *dev;
 
 #ifdef CONFIG_DEBUG_FS
 	struct dentry *debug_dir;
@@ -90,18 +95,19 @@ int bdi_register_dev(struct backing_dev_info *bdi, dev_t dev);
 void bdi_unregister(struct backing_dev_info *bdi);
 int bdi_start_writeback(struct backing_dev_info *bdi, struct super_block *sb,
 			 long nr_pages, enum writeback_sync_modes sync_mode);
-int bdi_writeback_task(struct backing_dev_info *bdi);
+int bdi_writeback_task(struct bdi_writeback *wb);
 void bdi_writeback_all(struct super_block *sb, struct writeback_control *wbc);
 void bdi_add_default_flusher_task(struct backing_dev_info *bdi);
+int bdi_has_dirty_io(struct backing_dev_info *bdi);
 
 extern struct mutex bdi_lock;
 extern struct list_head bdi_list;
 
-static inline int bdi_has_dirty_io(struct backing_dev_info *bdi)
+static inline int wb_has_dirty_io(struct bdi_writeback *wb)
 {
-	return !list_empty(&bdi->b_dirty) ||
-	       !list_empty(&bdi->b_io) ||
-	       !list_empty(&bdi->b_more_io);
+	return !list_empty(&wb->b_dirty) ||
+	       !list_empty(&wb->b_io) ||
+	       !list_empty(&wb->b_more_io);
 }
 
 static inline void __add_bdi_stat(struct backing_dev_info *bdi,
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 0df8079..28c6a7d 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -199,10 +199,46 @@ static int __init default_bdi_init(void)
 }
 subsys_initcall(default_bdi_init);
 
+static void bdi_wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi)
+{
+	memset(wb, 0, sizeof(*wb));
+
+	wb->bdi = bdi;
+	init_waitqueue_head(&wb->wait);
+	INIT_LIST_HEAD(&wb->b_dirty);
+	INIT_LIST_HEAD(&wb->b_io);
+	INIT_LIST_HEAD(&wb->b_more_io);
+}
+
+static int wb_assign_nr(struct backing_dev_info *bdi, struct bdi_writeback *wb)
+{
+	set_bit(0, &bdi->wb_mask);
+	wb->nr = 0;
+	return 0;
+}
+
+static void bdi_put_wb(struct backing_dev_info *bdi, struct bdi_writeback *wb)
+{
+	clear_bit(wb->nr, &bdi->wb_mask);
+	clear_bit(BDI_wb_alloc, &bdi->state);
+}
+
+static struct bdi_writeback *bdi_new_wb(struct backing_dev_info *bdi)
+{
+	struct bdi_writeback *wb;
+
+	set_bit(BDI_wb_alloc, &bdi->state);
+	wb = &bdi->wb;
+	wb_assign_nr(bdi, wb);
+	return wb;
+}
+
 static int bdi_start_fn(void *ptr)
 {
-	struct backing_dev_info *bdi = ptr;
+	struct bdi_writeback *wb = ptr;
+	struct backing_dev_info *bdi = wb->bdi;
 	struct task_struct *tsk = current;
+	int ret;
 
 	/*
 	 * Add us to the active bdi_list
@@ -226,7 +262,15 @@ static int bdi_start_fn(void *ptr)
 	smp_mb__after_clear_bit();
 	wake_up_bit(&bdi->state, BDI_pending);
 
-	return bdi_writeback_task(bdi);
+	ret = bdi_writeback_task(wb);
+
+	bdi_put_wb(bdi, wb);
+	return ret;
+}
+
+int bdi_has_dirty_io(struct backing_dev_info *bdi)
+{
+	return wb_has_dirty_io(&bdi->wb);
 }
 
 static void bdi_flush_io(struct backing_dev_info *bdi)
@@ -244,11 +288,12 @@ static void bdi_flush_io(struct backing_dev_info *bdi)
 
 static int bdi_forker_task(void *ptr)
 {
-	struct backing_dev_info *me = ptr;
+	struct bdi_writeback *me = ptr;
 	DEFINE_WAIT(wait);
 
 	for (;;) {
 		struct backing_dev_info *bdi, *tmp;
+		struct bdi_writeback *wb;
 
 		/*
 		 * Do this periodically, like kupdated() did before.
@@ -259,8 +304,8 @@ static int bdi_forker_task(void *ptr)
 		 * Temporary measure, we want to make sure we don't see
 		 * dirty data on the default backing_dev_info
 		 */
-		if (bdi_has_dirty_io(me))
-			bdi_flush_io(me);
+		if (wb_has_dirty_io(me))
+			bdi_flush_io(me->bdi);
 
 		prepare_to_wait(&me->wait, &wait, TASK_INTERRUPTIBLE);
 
@@ -271,7 +316,7 @@ static int bdi_forker_task(void *ptr)
 		 * a thread registered. If so, set that up.
 		 */
 		list_for_each_entry_safe(bdi, tmp, &bdi_list, bdi_list) {
-			if (bdi->task || !bdi_has_dirty_io(bdi))
+			if (bdi->wb.task || !bdi_has_dirty_io(bdi))
 				continue;
 
 			bdi_add_default_flusher_task(bdi);
@@ -296,17 +341,22 @@ static int bdi_forker_task(void *ptr)
 		list_del_init(&bdi->bdi_list);
 		mutex_unlock(&bdi_lock);
 
-		BUG_ON(bdi->task);
+		wb = bdi_new_wb(bdi);
+		if (!wb)
+			goto readd_flush;
 
-		bdi->task = kthread_run(bdi_start_fn, bdi, "bdi-%s",
+		wb->task = kthread_run(bdi_start_fn, wb, "bdi-%s",
 					dev_name(bdi->dev));
+
 		/*
 		 * If task creation fails, then readd the bdi to
 		 * the pending list and force writeout of the bdi
 		 * from this forker thread. That will free some memory
 		 * and we can try again.
 		 */
-		if (!bdi->task) {
+		if (!wb->task) {
+			bdi_put_wb(bdi, wb);
+readd_flush:
 			/*
 			 * Add this 'bdi' to the back, so we get
 			 * a chance to flush other bdi's to free
@@ -324,8 +374,18 @@ static int bdi_forker_task(void *ptr)
 	return 0;
 }
 
+/*
+ * Add a new flusher task that gets created for any bdi
+ * that has dirty data pending writeout
+ */
 void bdi_add_default_flusher_task(struct backing_dev_info *bdi)
 {
+	if (!bdi_cap_writeback_dirty(bdi))
+		return;
+
+	/*
+	 * Someone already marked this pending for task creation
+	 */
 	if (test_and_set_bit(BDI_pending, &bdi->state))
 		return;
 
@@ -333,7 +393,7 @@ void bdi_add_default_flusher_task(struct backing_dev_info *bdi)
 	list_move_tail(&bdi->bdi_list, &bdi_pending_list);
 	mutex_unlock(&bdi_lock);
 
-	wake_up(&default_backing_dev_info.wait);
+	wake_up(&default_backing_dev_info.wb.wait);
 }
 
 int bdi_register(struct backing_dev_info *bdi, struct device *parent,
@@ -366,13 +426,23 @@ int bdi_register(struct backing_dev_info *bdi, struct device *parent,
 	 * on-demand when they need it.
 	 */
 	if (bdi_cap_flush_forker(bdi)) {
-		bdi->task = kthread_run(bdi_forker_task, bdi, "bdi-%s",
+		struct bdi_writeback *wb;
+
+		wb = bdi_new_wb(bdi);
+		if (!wb) {
+			ret = -ENOMEM;
+			goto remove_err;
+		}
+
+		wb->task = kthread_run(bdi_forker_task, wb, "bdi-%s",
 						dev_name(dev));
-		if (!bdi->task) {
+		if (!wb->task) {
+			bdi_put_wb(bdi, wb);
+			ret = -ENOMEM;
+remove_err:
 			mutex_lock(&bdi_lock);
 			list_del(&bdi->bdi_list);
 			mutex_unlock(&bdi_lock);
-			ret = -ENOMEM;
 			goto exit;
 		}
 	}
@@ -395,28 +465,37 @@ static int sched_wait(void *word)
 	return 0;
 }
 
+/*
+ * Remove bdi from global list and shutdown any threads we have running
+ */
 static void bdi_wb_shutdown(struct backing_dev_info *bdi)
 {
+	if (!bdi_cap_writeback_dirty(bdi))
+		return;
+
 	/*
 	 * If setup is pending, wait for that to complete first
 	 */
 	wait_on_bit(&bdi->state, BDI_pending, sched_wait, TASK_UNINTERRUPTIBLE);
 
+	/*
+	 * Make sure nobody finds us on the bdi_list anymore
+	 */
 	mutex_lock(&bdi_lock);
 	list_del(&bdi->bdi_list);
 	mutex_unlock(&bdi_lock);
+
+	/*
+	 * Finally, kill the kernel thread
+	 */
+	kthread_stop(bdi->wb.task);
 }
 
 void bdi_unregister(struct backing_dev_info *bdi)
 {
 	if (bdi->dev) {
-		if (!bdi_cap_flush_forker(bdi)) {
+		if (!bdi_cap_flush_forker(bdi))
 			bdi_wb_shutdown(bdi);
-			if (bdi->task) {
-				kthread_stop(bdi->task);
-				bdi->task = NULL;
-			}
-		}
 		bdi_debug_unregister(bdi);
 		device_unregister(bdi->dev);
 		bdi->dev = NULL;
@@ -433,11 +512,10 @@ int bdi_init(struct backing_dev_info *bdi)
 	bdi->min_ratio = 0;
 	bdi->max_ratio = 100;
 	bdi->max_prop_frac = PROP_FRAC_BASE;
-	init_waitqueue_head(&bdi->wait);
 	INIT_LIST_HEAD(&bdi->bdi_list);
-	INIT_LIST_HEAD(&bdi->b_io);
-	INIT_LIST_HEAD(&bdi->b_dirty);
-	INIT_LIST_HEAD(&bdi->b_more_io);
+	bdi->wb_mask = bdi->wb_active = 0;
+
+	bdi_wb_init(&bdi->wb, bdi);
 
 	for (i = 0; i < NR_BDI_STAT_ITEMS; i++) {
 		err = percpu_counter_init(&bdi->bdi_stat[i], 0);
@@ -462,9 +540,7 @@ void bdi_destroy(struct backing_dev_info *bdi)
 {
 	int i;
 
-	WARN_ON(!list_empty(&bdi->b_dirty));
-	WARN_ON(!list_empty(&bdi->b_io));
-	WARN_ON(!list_empty(&bdi->b_more_io));
+	WARN_ON(bdi_has_dirty_io(bdi));
 
 	bdi_unregister(bdi);
 
-- 
1.6.3.rc0.1.gf800


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 07/11] writeback: support > 1 flusher thread per bdi
  2009-05-27  9:41 [PATCH 0/11] Per-bdi writeback flusher threads v8 Jens Axboe
                   ` (5 preceding siblings ...)
  2009-05-27  9:41 ` [PATCH 06/11] writeback: separate the flushing state/task from the bdi Jens Axboe
@ 2009-05-27  9:41 ` Jens Axboe
  2009-05-28  9:27   ` Jan Kara
  2009-05-27  9:41 ` [PATCH 08/11] writeback: allow sleepy exit of default writeback task Jens Axboe
                   ` (5 subsequent siblings)
  12 siblings, 1 reply; 41+ messages in thread
From: Jens Axboe @ 2009-05-27  9:41 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel
  Cc: chris.mason, david, hch, akpm, jack, yanmin_zhang, richard,
	damien.wyart, Jens Axboe

Build on the bdi_writeback support by allowing registration of
more than 1 flusher thread. File systems can call bdi_add_flusher_task(bdi)
to add more flusher threads to the device. If they do so, they must also
provide a super_operations function to return the suitable bdi_writeback
struct from any given inode.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
---
 fs/fs-writeback.c           |  447 +++++++++++++++++++++++++++++++++++--------
 include/linux/backing-dev.h |   34 +++-
 include/linux/fs.h          |    3 +
 include/linux/writeback.h   |    1 +
 mm/backing-dev.c            |  242 +++++++++++++++++++-----
 5 files changed, 592 insertions(+), 135 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index a238480..92b3df7 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -34,83 +34,249 @@
  */
 int nr_pdflush_threads;
 
-/**
- * writeback_acquire - attempt to get exclusive writeback access to a device
- * @bdi: the device's backing_dev_info structure
- *
- * It is a waste of resources to have more than one pdflush thread blocked on
- * a single request queue.  Exclusion at the request_queue level is obtained
- * via a flag in the request_queue's backing_dev_info.state.
- *
- * Non-request_queue-backed address_spaces will share default_backing_dev_info,
- * unless they implement their own.  Which is somewhat inefficient, as this
- * may prevent concurrent writeback against multiple devices.
+static void generic_sync_wb_inodes(struct bdi_writeback *wb,
+				   struct super_block *sb,
+				   struct writeback_control *wbc);
+
+/*
+ * Work items for the bdi_writeback threads
  */
-static int writeback_acquire(struct bdi_writeback *wb)
+struct bdi_work {
+	struct list_head list;
+	struct list_head wait_list;
+	struct rcu_head rcu_head;
+
+	unsigned long seen;
+	atomic_t pending;
+
+	unsigned long sb_data;
+	unsigned long nr_pages;
+	enum writeback_sync_modes sync_mode;
+
+	unsigned long state;
+};
+
+static struct super_block *bdi_work_sb(struct bdi_work *work)
 {
-	struct backing_dev_info *bdi = wb->bdi;
+	return (struct super_block *) (work->sb_data & ~1UL);
+}
+
+static inline bool bdi_work_on_stack(struct bdi_work *work)
+{
+	return work->sb_data & 1UL;
+}
 
-	return !test_and_set_bit(wb->nr, &bdi->wb_active);
+static inline void bdi_work_init(struct bdi_work *work, struct super_block *sb,
+				 unsigned long nr_pages,
+				 enum writeback_sync_modes sync_mode)
+{
+	INIT_RCU_HEAD(&work->rcu_head);
+	work->sb_data = (unsigned long) sb;
+	work->nr_pages = nr_pages;
+	work->sync_mode = sync_mode;
+	work->state = 1;
+}
+
+static inline void bdi_work_init_on_stack(struct bdi_work *work,
+					  struct super_block *sb,
+					  unsigned long nr_pages,
+					  enum writeback_sync_modes sync_mode)
+{
+	bdi_work_init(work, sb, nr_pages, sync_mode);
+	work->sb_data |= 1UL;
 }
 
 /**
  * writeback_in_progress - determine whether there is writeback in progress
  * @bdi: the device's backing_dev_info structure.
  *
- * Determine whether there is writeback in progress against a backing device.
+ * Determine whether there is writeback waiting to be handled against a
+ * backing device.
  */
 int writeback_in_progress(struct backing_dev_info *bdi)
 {
-	return bdi->wb_active != 0;
+	return !list_empty(&bdi->work_list);
 }
 
-/**
- * writeback_release - relinquish exclusive writeback access against a device.
- * @bdi: the device's backing_dev_info structure
- */
-static void writeback_release(struct bdi_writeback *wb)
+static void bdi_work_clear(struct bdi_work *work)
 {
-	struct backing_dev_info *bdi = wb->bdi;
+	clear_bit(0, &work->state);
+	smp_mb__after_clear_bit();
+	wake_up_bit(&work->state, 0);
+}
 
-	wb->nr_pages = 0;
-	wb->sb = NULL;
-	clear_bit(wb->nr, &bdi->wb_active);
+static void bdi_work_free(struct rcu_head *head)
+{
+	struct bdi_work *work = container_of(head, struct bdi_work, rcu_head);
+
+	if (!bdi_work_on_stack(work))
+		kfree(work);
+	else
+		bdi_work_clear(work);
 }
 
-static void wb_start_writeback(struct bdi_writeback *wb, struct super_block *sb,
-			       long nr_pages,
-			       enum writeback_sync_modes sync_mode)
+static void wb_work_complete(struct bdi_work *work)
 {
-	if (!wb_has_dirty_io(wb))
-		return;
+	const enum writeback_sync_modes sync_mode = work->sync_mode;
+
+	/*
+	 * For allocated work, we can clear the done/seen bit right here.
+	 * For on-stack work, we need to postpone both the clear and free
+	 * to after the RCU grace period, since the stack could be invalidated
+	 * as soon as bdi_work_clear() has done the wakeup.
+	 */
+	if (!bdi_work_on_stack(work))
+		bdi_work_clear(work);
+	else if (sync_mode == WB_SYNC_NONE)
+		call_rcu(&work->rcu_head, bdi_work_free);
+}
+
+static void wb_clear_pending(struct bdi_writeback *wb, struct bdi_work *work)
+{
+	/*
+	 * The caller has retrieved the work arguments from this work,
+	 * drop our reference. If this is the last ref, delete and free it
+	 */
+	if (atomic_dec_and_test(&work->pending)) {
+		struct backing_dev_info *bdi = wb->bdi;
+
+		spin_lock(&bdi->wb_lock);
+		list_del_rcu(&work->list);
+		spin_unlock(&bdi->wb_lock);
+
+		wb_work_complete(work);
+	}
+}
+
+static void wb_start_writeback(struct bdi_writeback *wb, struct bdi_work *work)
+{
+	/*
+	 * If we failed allocating the bdi work item, wake up the wb thread
+	 * always. As a safety precaution, it'll flush out everything
+	 */
+	if (!wb_has_dirty_io(wb) && work)
+		wb_clear_pending(wb, work);
+	else
+		wake_up(&wb->wait);
+}
 
-	if (writeback_acquire(wb)) {
-		wb->nr_pages = nr_pages;
-		wb->sb = sb;
-		wb->sync_mode = sync_mode;
+static void bdi_queue_work(struct backing_dev_info *bdi, struct bdi_work *work)
+{
+	if (work) {
+		work->seen = bdi->wb_mask;
+		BUG_ON(!work->seen);
+		atomic_set(&work->pending, bdi->wb_cnt);
+		BUG_ON(!bdi->wb_cnt);
 
 		/*
-		 * make above store seen before the task is woken
+		 * Make sure stores are seen before it appears on the list
 		 */
 		smp_mb();
-		wake_up(&wb->wait);
+
+		spin_lock(&bdi->wb_lock);
+		list_add_tail_rcu(&work->list, &bdi->work_list);
+		spin_unlock(&bdi->wb_lock);
 	}
 }
 
-int bdi_start_writeback(struct backing_dev_info *bdi, struct super_block *sb,
-			 long nr_pages, enum writeback_sync_modes sync_mode)
+static void bdi_sched_work(struct backing_dev_info *bdi, struct bdi_work *work)
+{
+	if (!bdi_wblist_needs_lock(bdi))
+		wb_start_writeback(&bdi->wb, work);
+	else {
+		struct bdi_writeback *wb;
+		int idx;
+
+		idx = srcu_read_lock(&bdi->srcu);
+
+		list_for_each_entry_rcu(wb, &bdi->wb_list, list)
+			wb_start_writeback(wb, work);
+
+		srcu_read_unlock(&bdi->srcu, idx);
+	}
+}
+
+static void __bdi_start_work(struct backing_dev_info *bdi,
+			     struct bdi_work *work)
 {
 	/*
-	 * This only happens the first time someone kicks this bdi, so put
-	 * it out-of-line.
+	 * If the default thread isn't there, make sure we add it. When
+	 * it gets created and wakes up, we'll run this work.
 	 */
-	if (unlikely(!bdi->wb.task)) {
+	if (unlikely(list_empty_careful(&bdi->wb_list)))
 		bdi_add_default_flusher_task(bdi);
-		return 1;
+	else
+		bdi_sched_work(bdi, work);
+}
+
+static void bdi_start_work(struct backing_dev_info *bdi, struct bdi_work *work)
+{
+	/*
+	 * If the default thread isn't there, make sure we add it. When
+	 * it gets created and wakes up, we'll run this work.
+	 */
+	if (unlikely(list_empty_careful(&bdi->wb_list))) {
+		mutex_lock(&bdi_lock);
+		bdi_add_default_flusher_task(bdi);
+		mutex_unlock(&bdi_lock);
+	} else
+		bdi_sched_work(bdi, work);
+}
+
+/*
+ * Used for on-stack allocated work items. The caller needs to wait until
+ * the wb threads have acked the work before it's safe to continue.
+ */
+static void bdi_wait_on_work_clear(struct bdi_work *work)
+{
+	wait_on_bit(&work->state, 0, bdi_sched_wait, TASK_UNINTERRUPTIBLE);
+}
+
+static struct bdi_work *bdi_alloc_work(struct super_block *sb, long nr_pages,
+				       enum writeback_sync_modes sync_mode)
+{
+	struct bdi_work *work;
+
+	work = kmalloc(sizeof(*work), GFP_ATOMIC);
+	if (work)
+		bdi_work_init(work, sb, nr_pages, sync_mode);
+
+	return work;
+}
+
+void bdi_start_writeback(struct backing_dev_info *bdi, struct super_block *sb,
+			 long nr_pages, enum writeback_sync_modes sync_mode)
+{
+	const bool must_wait = sync_mode == WB_SYNC_ALL;
+	struct bdi_work work_stack, *work = NULL;
+
+	if (!must_wait)
+		work = bdi_alloc_work(sb, nr_pages, sync_mode);
+
+	if (!work) {
+		work = &work_stack;
+		bdi_work_init_on_stack(work, sb, nr_pages, sync_mode);
 	}
 
-	wb_start_writeback(&bdi->wb, sb, nr_pages, sync_mode);
-	return 0;
+	bdi_queue_work(bdi, work);
+	bdi_start_work(bdi, work);
+
+	/*
+	 * If the sync mode is WB_SYNC_ALL, block waiting for the work to
+	 * complete. If not, we only need to wait for the work to be started,
+	 * if we allocated it on-stack. We use the same mechanism, if the
+	 * wait bit is set in the bdi_work struct, then threads will not
+	 * clear pending until after they are done.
+	 *
+	 * Note that work == &work_stack if must_wait is true, so we don't
+	 * need to do call_rcu() here ever, since the completion path will
+	 * have done that for us.
+	 */
+	if (must_wait || work == &work_stack) {
+		bdi_wait_on_work_clear(work);
+		if (work != &work_stack)
+			call_rcu(&work->rcu_head, bdi_work_free);
+	}
 }
 
 /*
@@ -160,7 +326,7 @@ static void wb_kupdated(struct bdi_writeback *wb)
 		wbc.more_io = 0;
 		wbc.encountered_congestion = 0;
 		wbc.nr_to_write = MAX_WRITEBACK_PAGES;
-		generic_sync_bdi_inodes(NULL, &wbc);
+		generic_sync_wb_inodes(wb, NULL, &wbc);
 		if (wbc.nr_to_write > 0)
 			break;	/* All the old data is written */
 		nr_to_write -= MAX_WRITEBACK_PAGES;
@@ -177,22 +343,19 @@ static inline bool over_bground_thresh(void)
 		global_page_state(NR_UNSTABLE_NFS) >= background_thresh);
 }
 
-static void generic_sync_wb_inodes(struct bdi_writeback *wb,
-				   struct super_block *sb,
-				   struct writeback_control *wbc);
-
-static void wb_writeback(struct bdi_writeback *wb)
+static void __wb_writeback(struct bdi_writeback *wb, long nr_pages,
+			   struct super_block *sb,
+			   enum writeback_sync_modes sync_mode)
 {
 	struct writeback_control wbc = {
 		.bdi			= wb->bdi,
-		.sync_mode		= wb->sync_mode,
+		.sync_mode		= sync_mode,
 		.older_than_this	= NULL,
 		.range_cyclic		= 1,
 	};
-	long nr_pages = wb->nr_pages;
 
 	for (;;) {
-		if (wbc.sync_mode == WB_SYNC_NONE && nr_pages <= 0 &&
+		if (sync_mode == WB_SYNC_NONE && nr_pages <= 0 &&
 		    !over_bground_thresh())
 			break;
 
@@ -200,7 +363,7 @@ static void wb_writeback(struct bdi_writeback *wb)
 		wbc.encountered_congestion = 0;
 		wbc.nr_to_write = MAX_WRITEBACK_PAGES;
 		wbc.pages_skipped = 0;
-		generic_sync_wb_inodes(wb, wb->sb, &wbc);
+		generic_sync_wb_inodes(wb, sb, &wbc);
 		nr_pages -= MAX_WRITEBACK_PAGES - wbc.nr_to_write;
 		/*
 		 * If we ran out of stuff to write, bail unless more_io got set
@@ -214,67 +377,171 @@ static void wb_writeback(struct bdi_writeback *wb)
 }
 
 /*
+ * Return the next bdi_work struct that hasn't been processed by this
+ * wb thread yet
+ */
+static struct bdi_work *get_next_work_item(struct backing_dev_info *bdi,
+					   struct bdi_writeback *wb)
+{
+	struct bdi_work *work, *ret = NULL;
+
+	rcu_read_lock();
+
+	list_for_each_entry_rcu(work, &bdi->work_list, list) {
+		if (!test_and_clear_bit(wb->nr, &work->seen))
+			continue;
+
+		ret = work;
+		break;
+	}
+
+	rcu_read_unlock();
+	return ret;
+}
+
+/*
+ * Retrieve work items and do the writeback they describe
+ */
+static void wb_writeback(struct bdi_writeback *wb)
+{
+	struct backing_dev_info *bdi = wb->bdi;
+	struct bdi_work *work;
+
+	while ((work = get_next_work_item(bdi, wb)) != NULL) {
+		struct super_block *sb = bdi_work_sb(work);
+		long nr_pages = work->nr_pages;
+		enum writeback_sync_modes sync_mode = work->sync_mode;
+
+		/*
+		 * If this isn't a data integrity operation, just notify
+		 * that we have seen this work and we are now starting it.
+		 */
+		if (sync_mode == WB_SYNC_NONE)
+			wb_clear_pending(wb, work);
+
+		__wb_writeback(wb, nr_pages, sb, sync_mode);
+
+		/*
+		 * This is a data integrity writeback, so only do the
+		 * notification when we have completed the work.
+		 */
+		if (sync_mode == WB_SYNC_ALL)
+			wb_clear_pending(wb, work);
+	}
+}
+
+/*
+ * This will be inlined in bdi_writeback_task() once we get rid of any
+ * dirty inodes on the default_backing_dev_info
+ */
+void wb_do_writeback(struct bdi_writeback *wb)
+{
+	/*
+	 * We get here in two cases:
+	 *
+	 *  schedule_timeout() returned because the dirty writeback
+	 *  interval has elapsed. If that happens, the work item list
+	 *  will be empty and we will proceed to do kupdated style writeout.
+	 *
+	 *  Someone called bdi_start_writeback(), which put one/more work
+	 *  items on the work_list. Process those.
+	 */
+	if (list_empty(&wb->bdi->work_list))
+		wb_kupdated(wb);
+	else
+		wb_writeback(wb);
+}
+
+/*
  * Handle writeback of dirty data for the device backed by this bdi. Also
  * wakes up periodically and does kupdated style flushing.
  */
 int bdi_writeback_task(struct bdi_writeback *wb)
 {
+	DEFINE_WAIT(wait);
+
 	while (!kthread_should_stop()) {
 		unsigned long wait_jiffies;
-		DEFINE_WAIT(wait);
+
+		wb_do_writeback(wb);
 
 		prepare_to_wait(&wb->wait, &wait, TASK_INTERRUPTIBLE);
 		wait_jiffies = msecs_to_jiffies(dirty_writeback_interval * 10);
 		schedule_timeout(wait_jiffies);
 		try_to_freeze();
-
-		/*
-		 * We get here in two cases:
-		 *
-		 *  schedule_timeout() returned because the dirty writeback
-		 *  interval has elapsed. If that happens, we will be able
-		 *  to acquire the writeback lock and will proceed to do
-		 *  kupdated style writeout.
-		 *
-		 *  Someone called bdi_start_writeback(), which will acquire
-		 *  the writeback lock. This means our writeback_acquire()
-		 *  below will fail and we call into bdi_pdflush() for
-		 *  pdflush style writeout.
-		 *
-		 */
-		if (writeback_acquire(wb))
-			wb_kupdated(wb);
-		else
-			wb_writeback(wb);
-
-		writeback_release(wb);
-		finish_wait(&wb->wait, &wait);
 	}
 
+	finish_wait(&wb->wait, &wait);
 	return 0;
 }
 
+/*
+ * Schedule writeback for all backing devices. Expensive! If this is a data
+ * integrity operation, writeback will be complete when this returns. If
+ * we are simply called for WB_SYNC_NONE, then writeback will merely be
+ * scheduled to run.
+ */
 void bdi_writeback_all(struct super_block *sb, struct writeback_control *wbc)
 {
+	const bool must_wait = wbc->sync_mode == WB_SYNC_ALL;
 	struct backing_dev_info *bdi, *tmp;
+	struct bdi_work *work;
+	LIST_HEAD(list);
 
 	mutex_lock(&bdi_lock);
 
 	list_for_each_entry_safe(bdi, tmp, &bdi_list, bdi_list) {
+		struct bdi_work *work;
+
 		if (!bdi_has_dirty_io(bdi))
 			continue;
-		bdi_start_writeback(bdi, sb, wbc->nr_to_write, wbc->sync_mode);
+
+		/*
+		 * If work allocation fails, do the writes inline. An
+		 * alternative approach would be too fall back to an on-stack
+		 * allocation of work. For that we need to drop the bdi_lock
+		 * and restart the scan afterwards, though.
+		 */
+		work = bdi_alloc_work(sb, wbc->nr_to_write, wbc->sync_mode);
+		if (!work) {
+			wbc->bdi = bdi;
+			generic_sync_bdi_inodes(sb, wbc);
+			continue;
+		}
+		if (must_wait)
+			list_add_tail(&work->wait_list, &list);
+
+		bdi_queue_work(bdi, work);
+		__bdi_start_work(bdi, work);
 	}
 
 	mutex_unlock(&bdi_lock);
+
+	/*
+	 * If this is for WB_SYNC_ALL, wait for pending work to complete
+	 * before returning.
+	 */
+	while (!list_empty(&list)) {
+		work = list_entry(list.next, struct bdi_work, wait_list);
+		list_del(&work->wait_list);
+		bdi_wait_on_work_clear(work);
+		call_rcu(&work->rcu_head, bdi_work_free);
+	}
 }
 
 /*
- * We have only a single wb per bdi, so just return that.
+ * If the filesystem didn't provide a way to map an inode to a dedicated
+ * flusher thread, it doesn't support more than 1 thread. So we know it's
+ * the default thread, return that.
  */
 static inline struct bdi_writeback *inode_get_wb(struct inode *inode)
 {
-	return &inode_to_bdi(inode)->wb;
+	const struct super_operations *sop = inode->i_sb->s_op;
+
+	if (!sop->inode_get_wb)
+		return &inode_to_bdi(inode)->wb;
+
+	return sop->inode_get_wb(inode);
 }
 
 /**
@@ -728,8 +995,24 @@ void generic_sync_bdi_inodes(struct super_block *sb,
 			     struct writeback_control *wbc)
 {
 	struct backing_dev_info *bdi = wbc->bdi;
+	struct bdi_writeback *wb;
 
-	generic_sync_wb_inodes(&bdi->wb, sb, wbc);
+	/*
+	 * Common case is just a single wb thread and that is embedded in
+	 * the bdi, so it doesn't need locking
+	 */
+	if (!bdi_wblist_needs_lock(bdi))
+		generic_sync_wb_inodes(&bdi->wb, sb, wbc);
+	else {
+		int idx;
+
+		idx = srcu_read_lock(&bdi->srcu);
+
+		list_for_each_entry_rcu(wb, &bdi->wb_list, list)
+			generic_sync_wb_inodes(wb, sb, wbc);
+
+		srcu_read_unlock(&bdi->srcu, idx);
+	}
 }
 
 /*
@@ -756,7 +1039,7 @@ void generic_sync_sb_inodes(struct super_block *sb,
 				struct writeback_control *wbc)
 {
 	if (wbc->bdi)
-		generic_sync_bdi_inodes(sb, wbc);
+		bdi_start_writeback(wbc->bdi, sb, wbc->nr_to_write, wbc->sync_mode);
 	else
 		bdi_writeback_all(sb, wbc);
 
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 4acc64e..352d47d 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -13,6 +13,8 @@
 #include <linux/proportions.h>
 #include <linux/kernel.h>
 #include <linux/fs.h>
+#include <linux/sched.h>
+#include <linux/srcu.h>
 #include <linux/writeback.h>
 #include <asm/atomic.h>
 
@@ -26,6 +28,7 @@ struct dentry;
 enum bdi_state {
 	BDI_pending,		/* On its way to being activated */
 	BDI_wb_alloc,		/* Default embedded wb allocated */
+	BDI_wblist_lock,	/* bdi->wb_list now needs locking */
 	BDI_async_congested,	/* The async (write) queue is getting full */
 	BDI_sync_congested,	/* The sync queue is getting full */
 	BDI_unused,		/* Available bits start here */
@@ -42,6 +45,8 @@ enum bdi_stat_item {
 #define BDI_STAT_BATCH (8*(1+ilog2(nr_cpu_ids)))
 
 struct bdi_writeback {
+	struct list_head list;			/* hangs off the bdi */
+
 	struct backing_dev_info *bdi;		/* our parent bdi */
 	unsigned int nr;
 
@@ -50,13 +55,12 @@ struct bdi_writeback {
 	struct list_head	b_dirty;	/* dirty inodes */
 	struct list_head	b_io;		/* parked for writeback */
 	struct list_head	b_more_io;	/* parked for more writeback */
-
-	unsigned long		nr_pages;
-	struct super_block	*sb;
-	enum writeback_sync_modes sync_mode;
 };
 
+#define BDI_MAX_FLUSHERS	32
+
 struct backing_dev_info {
+	struct srcu_struct srcu; /* for wb_list read side protection */
 	struct list_head bdi_list;
 	unsigned long ra_pages;	/* max readahead in PAGE_CACHE_SIZE units */
 	unsigned long state;	/* Always use atomic bitops on this */
@@ -75,8 +79,12 @@ struct backing_dev_info {
 	unsigned int max_ratio, max_prop_frac;
 
 	struct bdi_writeback wb;  /* default writeback info for this bdi */
-	unsigned long wb_active;  /* bitmap of active tasks */
-	unsigned long wb_mask;	  /* number of registered tasks */
+	spinlock_t wb_lock;	  /* protects update side of wb_list */
+	struct list_head wb_list; /* the flusher threads hanging off this bdi */
+	unsigned long wb_mask;	  /* bitmask of registered tasks */
+	unsigned int wb_cnt;	  /* number of registered tasks */
+
+	struct list_head work_list;
 
 	struct device *dev;
 
@@ -93,16 +101,22 @@ int bdi_register(struct backing_dev_info *bdi, struct device *parent,
 		const char *fmt, ...);
 int bdi_register_dev(struct backing_dev_info *bdi, dev_t dev);
 void bdi_unregister(struct backing_dev_info *bdi);
-int bdi_start_writeback(struct backing_dev_info *bdi, struct super_block *sb,
+void bdi_start_writeback(struct backing_dev_info *bdi, struct super_block *sb,
 			 long nr_pages, enum writeback_sync_modes sync_mode);
 int bdi_writeback_task(struct bdi_writeback *wb);
 void bdi_writeback_all(struct super_block *sb, struct writeback_control *wbc);
 void bdi_add_default_flusher_task(struct backing_dev_info *bdi);
+void bdi_add_flusher_task(struct backing_dev_info *bdi);
 int bdi_has_dirty_io(struct backing_dev_info *bdi);
 
 extern struct mutex bdi_lock;
 extern struct list_head bdi_list;
 
+static inline int bdi_wblist_needs_lock(struct backing_dev_info *bdi)
+{
+	return test_bit(BDI_wblist_lock, &bdi->state);
+}
+
 static inline int wb_has_dirty_io(struct bdi_writeback *wb)
 {
 	return !list_empty(&wb->b_dirty) ||
@@ -315,4 +329,10 @@ static inline bool mapping_cap_swap_backed(struct address_space *mapping)
 	return bdi_cap_swap_backed(mapping->backing_dev_info);
 }
 
+static inline int bdi_sched_wait(void *word)
+{
+	schedule();
+	return 0;
+}
+
 #endif		/* _LINUX_BACKING_DEV_H */
diff --git a/include/linux/fs.h b/include/linux/fs.h
index ecdc544..d3bda5d 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1550,11 +1550,14 @@ extern ssize_t vfs_readv(struct file *, const struct iovec __user *,
 extern ssize_t vfs_writev(struct file *, const struct iovec __user *,
 		unsigned long, loff_t *);
 
+struct bdi_writeback;
+
 struct super_operations {
    	struct inode *(*alloc_inode)(struct super_block *sb);
 	void (*destroy_inode)(struct inode *);
 
    	void (*dirty_inode) (struct inode *);
+	struct bdi_writeback *(*inode_get_wb) (struct inode *);
 	int (*write_inode) (struct inode *, int);
 	void (*drop_inode) (struct inode *);
 	void (*delete_inode) (struct inode *);
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index baf04a9..e414702 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -69,6 +69,7 @@ void writeback_inodes(struct writeback_control *wbc);
 int inode_wait(void *);
 void sync_inodes_sb(struct super_block *, int wait);
 void sync_inodes(int wait);
+void wb_do_writeback(struct bdi_writeback *wb);
 
 /* writeback.h requires fs.h; it, too, is not included from here. */
 static inline void wait_on_inode(struct inode *inode)
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 28c6a7d..ce40890 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -199,7 +199,47 @@ static int __init default_bdi_init(void)
 }
 subsys_initcall(default_bdi_init);
 
-static void bdi_wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi)
+static int wb_assign_nr(struct backing_dev_info *bdi, struct bdi_writeback *wb)
+{
+	unsigned long mask = BDI_MAX_FLUSHERS - 1;
+	unsigned int nr;
+
+	do {
+		if ((bdi->wb_mask & mask) == mask)
+			return 1;
+
+		nr = find_first_zero_bit(&bdi->wb_mask, BDI_MAX_FLUSHERS);
+	} while (test_and_set_bit(nr, &bdi->wb_mask));
+
+	wb->nr = nr;
+
+	spin_lock(&bdi->wb_lock);
+	bdi->wb_cnt++;
+	spin_unlock(&bdi->wb_lock);
+
+	return 0;
+}
+
+static void bdi_put_wb(struct backing_dev_info *bdi, struct bdi_writeback *wb)
+{
+	/*
+	 * If this is the default wb thread exiting, leave the bit set
+	 * in the wb mask as we set that before it's created as well. This
+	 * is done to make sure that assigned work with no thread has at
+	 * least one receipient.
+	 */
+	if (wb == &bdi->wb)
+		clear_bit(BDI_wb_alloc, &bdi->state);
+	else {
+		clear_bit(wb->nr, &bdi->wb_mask);
+		kfree(wb);
+		spin_lock(&bdi->wb_lock);
+		bdi->wb_cnt--;
+		spin_unlock(&bdi->wb_lock);
+	}
+}
+
+static int bdi_wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi)
 {
 	memset(wb, 0, sizeof(*wb));
 
@@ -208,44 +248,52 @@ static void bdi_wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi)
 	INIT_LIST_HEAD(&wb->b_dirty);
 	INIT_LIST_HEAD(&wb->b_io);
 	INIT_LIST_HEAD(&wb->b_more_io);
-}
-
-static int wb_assign_nr(struct backing_dev_info *bdi, struct bdi_writeback *wb)
-{
-	set_bit(0, &bdi->wb_mask);
-	wb->nr = 0;
-	return 0;
-}
 
-static void bdi_put_wb(struct backing_dev_info *bdi, struct bdi_writeback *wb)
-{
-	clear_bit(wb->nr, &bdi->wb_mask);
-	clear_bit(BDI_wb_alloc, &bdi->state);
+	return wb_assign_nr(bdi, wb);
 }
 
 static struct bdi_writeback *bdi_new_wb(struct backing_dev_info *bdi)
 {
 	struct bdi_writeback *wb;
 
-	set_bit(BDI_wb_alloc, &bdi->state);
-	wb = &bdi->wb;
-	wb_assign_nr(bdi, wb);
+	/*
+	 * Default bdi->wb is already assigned, so just return it
+	 */
+	if (!test_and_set_bit(BDI_wb_alloc, &bdi->state))
+		wb = &bdi->wb;
+	else {
+		wb = kmalloc(sizeof(struct bdi_writeback), GFP_KERNEL);
+		if (wb) {
+			if (bdi_wb_init(wb, bdi)) {
+				kfree(wb);
+				wb = NULL;
+			}
+		}
+	}
+
 	return wb;
 }
 
-static int bdi_start_fn(void *ptr)
+static void bdi_task_init(struct backing_dev_info *bdi,
+			  struct bdi_writeback *wb)
 {
-	struct bdi_writeback *wb = ptr;
-	struct backing_dev_info *bdi = wb->bdi;
 	struct task_struct *tsk = current;
-	int ret;
+	int was_empty;
 
 	/*
-	 * Add us to the active bdi_list
+	 * Add us to the active bdi_list. If we are adding threads beyond
+	 * the default embedded bdi_writeback, then we need to start using
+	 * proper locking. Check the list for empty first, then set the
+	 * BDI_wblist_lock flag if there's > 1 entry on the list now
 	 */
-	mutex_lock(&bdi_lock);
-	list_add(&bdi->bdi_list, &bdi_list);
-	mutex_unlock(&bdi_lock);
+	spin_lock(&bdi->wb_lock);
+
+	was_empty = list_empty(&bdi->wb_list);
+	list_add_tail_rcu(&wb->list, &bdi->wb_list);
+	if (!was_empty)
+		set_bit(BDI_wblist_lock, &bdi->state);
+
+	spin_unlock(&bdi->wb_lock);
 
 	tsk->flags |= PF_FLUSHER | PF_SWAPWRITE;
 	set_freezable();
@@ -254,6 +302,22 @@ static int bdi_start_fn(void *ptr)
 	 * Our parent may run at a different priority, just set us to normal
 	 */
 	set_user_nice(tsk, 0);
+}
+
+static int bdi_start_fn(void *ptr)
+{
+	struct bdi_writeback *wb = ptr;
+	struct backing_dev_info *bdi = wb->bdi;
+	int ret;
+
+	/*
+	 * Add us to the active bdi_list
+	 */
+	mutex_lock(&bdi_lock);
+	list_add(&bdi->bdi_list, &bdi_list);
+	mutex_unlock(&bdi_lock);
+
+	bdi_task_init(bdi, wb);
 
 	/*
 	 * Clear pending bit and wakeup anybody waiting to tear us down
@@ -264,13 +328,44 @@ static int bdi_start_fn(void *ptr)
 
 	ret = bdi_writeback_task(wb);
 
+	/*
+	 * Remove us from the list
+	 */
+	spin_lock(&bdi->wb_lock);
+	list_del_rcu(&wb->list);
+	spin_unlock(&bdi->wb_lock);
+
+	/*
+	 * wait for rcu grace period to end, so we can free wb
+	 */
+	synchronize_srcu(&bdi->srcu);
+
 	bdi_put_wb(bdi, wb);
 	return ret;
 }
 
 int bdi_has_dirty_io(struct backing_dev_info *bdi)
 {
-	return wb_has_dirty_io(&bdi->wb);
+	struct bdi_writeback *wb;
+	int ret = 0;
+
+	if (!bdi_wblist_needs_lock(bdi))
+		ret = wb_has_dirty_io(&bdi->wb);
+	else {
+		int idx;
+
+		idx = srcu_read_lock(&bdi->srcu);
+
+		list_for_each_entry_rcu(wb, &bdi->wb_list, list) {
+			ret = wb_has_dirty_io(wb);
+			if (ret)
+				break;
+		}
+
+		srcu_read_unlock(&bdi->srcu, idx);
+	}
+
+	return ret;
 }
 
 static void bdi_flush_io(struct backing_dev_info *bdi)
@@ -291,6 +386,8 @@ static int bdi_forker_task(void *ptr)
 	struct bdi_writeback *me = ptr;
 	DEFINE_WAIT(wait);
 
+	bdi_task_init(me->bdi, me);
+
 	for (;;) {
 		struct backing_dev_info *bdi, *tmp;
 		struct bdi_writeback *wb;
@@ -304,8 +401,8 @@ static int bdi_forker_task(void *ptr)
 		 * Temporary measure, we want to make sure we don't see
 		 * dirty data on the default backing_dev_info
 		 */
-		if (wb_has_dirty_io(me))
-			bdi_flush_io(me->bdi);
+		if (wb_has_dirty_io(me) || !list_empty(&me->bdi->work_list))
+			wb_do_writeback(me);
 
 		prepare_to_wait(&me->wait, &wait, TASK_INTERRUPTIBLE);
 
@@ -375,27 +472,70 @@ readd_flush:
 }
 
 /*
- * Add a new flusher task that gets created for any bdi
- * that has dirty data pending writeout
+ * bdi_lock held on entry
  */
-void bdi_add_default_flusher_task(struct backing_dev_info *bdi)
+static void bdi_add_one_flusher_task(struct backing_dev_info *bdi,
+				     int(*func)(struct backing_dev_info *))
 {
 	if (!bdi_cap_writeback_dirty(bdi))
 		return;
 
 	/*
-	 * Someone already marked this pending for task creation
+	 * Check with the helper whether to proceed adding a task. Will only
+	 * abort if we two or more simultanous calls to
+	 * bdi_add_default_flusher_task() occured, further additions will block
+	 * waiting for previous additions to finish.
 	 */
-	if (test_and_set_bit(BDI_pending, &bdi->state))
-		return;
+	if (!func(bdi)) {
+		list_move_tail(&bdi->bdi_list, &bdi_pending_list);
 
-	mutex_lock(&bdi_lock);
-	list_move_tail(&bdi->bdi_list, &bdi_pending_list);
+		/*
+		 * We are now on the pending list, wake up bdi_forker_task()
+		 * to finish the job and add us back to the active bdi_list
+		 */
+		wake_up(&default_backing_dev_info.wb.wait);
+	}
+}
+
+static int flusher_add_helper_block(struct backing_dev_info *bdi)
+{
 	mutex_unlock(&bdi_lock);
+	wait_on_bit_lock(&bdi->state, BDI_pending, bdi_sched_wait,
+				TASK_UNINTERRUPTIBLE);
+	mutex_lock(&bdi_lock);
+	return 0;
+}
 
-	wake_up(&default_backing_dev_info.wb.wait);
+static int flusher_add_helper_test(struct backing_dev_info *bdi)
+{
+	return test_and_set_bit(BDI_pending, &bdi->state);
+}
+
+/*
+ * Add the default flusher task that gets created for any bdi
+ * that has dirty data pending writeout
+ */
+void bdi_add_default_flusher_task(struct backing_dev_info *bdi)
+{
+	bdi_add_one_flusher_task(bdi, flusher_add_helper_test);
 }
 
+/**
+ * bdi_add_flusher_task - add one more flusher task to this @bdi
+ *  @bdi:	the bdi
+ *
+ * Add an additional flusher task to this @bdi. Will block waiting on
+ * previous additions, if any.
+ *
+ */
+void bdi_add_flusher_task(struct backing_dev_info *bdi)
+{
+	mutex_lock(&bdi_lock);
+	bdi_add_one_flusher_task(bdi, flusher_add_helper_block);
+	mutex_unlock(&bdi_lock);
+}
+EXPORT_SYMBOL(bdi_add_flusher_task);
+
 int bdi_register(struct backing_dev_info *bdi, struct device *parent,
 		const char *fmt, ...)
 {
@@ -459,24 +599,21 @@ int bdi_register_dev(struct backing_dev_info *bdi, dev_t dev)
 }
 EXPORT_SYMBOL(bdi_register_dev);
 
-static int sched_wait(void *word)
-{
-	schedule();
-	return 0;
-}
-
 /*
  * Remove bdi from global list and shutdown any threads we have running
  */
 static void bdi_wb_shutdown(struct backing_dev_info *bdi)
 {
+	struct bdi_writeback *wb;
+
 	if (!bdi_cap_writeback_dirty(bdi))
 		return;
 
 	/*
 	 * If setup is pending, wait for that to complete first
 	 */
-	wait_on_bit(&bdi->state, BDI_pending, sched_wait, TASK_UNINTERRUPTIBLE);
+	wait_on_bit(&bdi->state, BDI_pending, bdi_sched_wait,
+			TASK_UNINTERRUPTIBLE);
 
 	/*
 	 * Make sure nobody finds us on the bdi_list anymore
@@ -486,9 +623,11 @@ static void bdi_wb_shutdown(struct backing_dev_info *bdi)
 	mutex_unlock(&bdi_lock);
 
 	/*
-	 * Finally, kill the kernel thread
+	 * Finally, kill the kernel threads. We don't need to be RCU
+	 * safe anymore, since the bdi is gone from visibility.
 	 */
-	kthread_stop(bdi->wb.task);
+	list_for_each_entry(wb, &bdi->wb_list, list)
+		kthread_stop(wb->task);
 }
 
 void bdi_unregister(struct backing_dev_info *bdi)
@@ -512,8 +651,12 @@ int bdi_init(struct backing_dev_info *bdi)
 	bdi->min_ratio = 0;
 	bdi->max_ratio = 100;
 	bdi->max_prop_frac = PROP_FRAC_BASE;
+	spin_lock_init(&bdi->wb_lock);
+	bdi->wb_mask = 0;
+	bdi->wb_cnt = 0;
 	INIT_LIST_HEAD(&bdi->bdi_list);
-	bdi->wb_mask = bdi->wb_active = 0;
+	INIT_LIST_HEAD(&bdi->wb_list);
+	INIT_LIST_HEAD(&bdi->work_list);
 
 	bdi_wb_init(&bdi->wb, bdi);
 
@@ -523,10 +666,15 @@ int bdi_init(struct backing_dev_info *bdi)
 			goto err;
 	}
 
+	err = init_srcu_struct(&bdi->srcu);
+	if (err)
+		goto err;
+
 	bdi->dirty_exceeded = 0;
 	err = prop_local_init_percpu(&bdi->completions);
 
 	if (err) {
+		cleanup_srcu_struct(&bdi->srcu);
 err:
 		while (i--)
 			percpu_counter_destroy(&bdi->bdi_stat[i]);
@@ -544,6 +692,8 @@ void bdi_destroy(struct backing_dev_info *bdi)
 
 	bdi_unregister(bdi);
 
+	cleanup_srcu_struct(&bdi->srcu);
+
 	for (i = 0; i < NR_BDI_STAT_ITEMS; i++)
 		percpu_counter_destroy(&bdi->bdi_stat[i]);
 
-- 
1.6.3.rc0.1.gf800


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH 07/11] writeback: support > 1 flusher thread per bdi
  2009-05-27  9:41 ` [PATCH 07/11] writeback: support > 1 flusher thread per bdi Jens Axboe
@ 2009-05-28  9:27   ` Jan Kara
  2009-05-28 10:40     ` Jens Axboe
  0 siblings, 1 reply; 41+ messages in thread
From: Jan Kara @ 2009-05-28  9:27 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-kernel, linux-fsdevel, chris.mason, david, hch, akpm, jack,
	yanmin_zhang, richard, damien.wyart

On Wed 27-05-09 11:41:48, Jens Axboe wrote:
> Build on the bdi_writeback support by allowing registration of
> more than 1 flusher thread. File systems can call bdi_add_flusher_task(bdi)
> to add more flusher threads to the device. If they do so, they must also
> provide a super_operations function to return the suitable bdi_writeback
> struct from any given inode.
  Isn't this patch in fact doing two different things? Looking at the
changes it implements more flusher threads per BDI as you write above but
it also does all that sync-work handling which isn't quite trivial and
seems to be a separate thing...

> +static struct bdi_work *bdi_alloc_work(struct super_block *sb, long nr_pages,
> +				       enum writeback_sync_modes sync_mode)
> +{
> +	struct bdi_work *work;
> +
> +	work = kmalloc(sizeof(*work), GFP_ATOMIC);
  Why is this GFP_ATOMIC? Wouldn't GFP_NOFS be enough?
But for now, GFP_ATOMIC may be fine since it excercises more the
"allocation failed" path :).

> +	if (work)
> +		bdi_work_init(work, sb, nr_pages, sync_mode);
> +
> +	return work;
> +}
> +
> +void bdi_start_writeback(struct backing_dev_info *bdi, struct super_block *sb,
> +			 long nr_pages, enum writeback_sync_modes sync_mode)
> +{
> +	const bool must_wait = sync_mode == WB_SYNC_ALL;
> +	struct bdi_work work_stack, *work = NULL;
> +
> +	if (!must_wait)
> +		work = bdi_alloc_work(sb, nr_pages, sync_mode);
> +
> +	if (!work) {
> +		work = &work_stack;
> +		bdi_work_init_on_stack(work, sb, nr_pages, sync_mode);
>  	}
>  
> -	wb_start_writeback(&bdi->wb, sb, nr_pages, sync_mode);
> -	return 0;
> +	bdi_queue_work(bdi, work);
> +	bdi_start_work(bdi, work);
> +
> +	/*
> +	 * If the sync mode is WB_SYNC_ALL, block waiting for the work to
> +	 * complete. If not, we only need to wait for the work to be started,
> +	 * if we allocated it on-stack. We use the same mechanism, if the
> +	 * wait bit is set in the bdi_work struct, then threads will not
> +	 * clear pending until after they are done.
> +	 *
> +	 * Note that work == &work_stack if must_wait is true, so we don't
> +	 * need to do call_rcu() here ever, since the completion path will
> +	 * have done that for us.
> +	 */
> +	if (must_wait || work == &work_stack) {
> +		bdi_wait_on_work_clear(work);
> +		if (work != &work_stack)
> +			call_rcu(&work->rcu_head, bdi_work_free);
> +	}
>  }
  Looking into this, it seems a bit complex with all that on_stack, sync /
nosync thing. Wouldn't a simple refcounting scheme be more clear? Alloc /
init_work_on_stack gets a reference from bdi_work, queue_work gets another
reference passed later to flusher thread. We drop the reference when we
leave bdi_start_writeback() and when flusher thread is done with the work.
When refcount hits zero, work struct is freed (when work is on stack, we
just never drop the last reference)...

									Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 07/11] writeback: support > 1 flusher thread per bdi
  2009-05-28  9:27   ` Jan Kara
@ 2009-05-28 10:40     ` Jens Axboe
  2009-05-28 12:43       ` Jan Kara
  0 siblings, 1 reply; 41+ messages in thread
From: Jens Axboe @ 2009-05-28 10:40 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-kernel, linux-fsdevel, chris.mason, david, hch, akpm,
	yanmin_zhang, richard, damien.wyart

On Thu, May 28 2009, Jan Kara wrote:
> On Wed 27-05-09 11:41:48, Jens Axboe wrote:
> > Build on the bdi_writeback support by allowing registration of
> > more than 1 flusher thread. File systems can call bdi_add_flusher_task(bdi)
> > to add more flusher threads to the device. If they do so, they must also
> > provide a super_operations function to return the suitable bdi_writeback
> > struct from any given inode.
>   Isn't this patch in fact doing two different things? Looking at the
> changes it implements more flusher threads per BDI as you write above but
> it also does all that sync-work handling which isn't quite trivial and
> seems to be a separate thing...

Yes, but the separated work is a pre-requisite for multi thread support.
But I guess it would clean it up as a patchset if I added that with the
previous patch, that would save me from having to add the
writeback_acquire()/release changes. But I still have to do those, so
perhaps not... Unless you feel strongly about it, I'd prefer to leave it
as-is.

> > +static struct bdi_work *bdi_alloc_work(struct super_block *sb, long nr_pages,
> > +				       enum writeback_sync_modes sync_mode)
> > +{
> > +	struct bdi_work *work;
> > +
> > +	work = kmalloc(sizeof(*work), GFP_ATOMIC);
>   Why is this GFP_ATOMIC? Wouldn't GFP_NOFS be enough?
> But for now, GFP_ATOMIC may be fine since it excercises more the
> "allocation failed" path :).

GFP_NOFS/NOIO should be ok. I prefer GFP_ATOMIC since we never enter any
reclaim, so better safe than sorry. We still have to handle failure
either way, the rate of failure may be different though.

> > +	if (work)
> > +		bdi_work_init(work, sb, nr_pages, sync_mode);
> > +
> > +	return work;
> > +}
> > +
> > +void bdi_start_writeback(struct backing_dev_info *bdi, struct super_block *sb,
> > +			 long nr_pages, enum writeback_sync_modes sync_mode)
> > +{
> > +	const bool must_wait = sync_mode == WB_SYNC_ALL;
> > +	struct bdi_work work_stack, *work = NULL;
> > +
> > +	if (!must_wait)
> > +		work = bdi_alloc_work(sb, nr_pages, sync_mode);
> > +
> > +	if (!work) {
> > +		work = &work_stack;
> > +		bdi_work_init_on_stack(work, sb, nr_pages, sync_mode);
> >  	}
> >  
> > -	wb_start_writeback(&bdi->wb, sb, nr_pages, sync_mode);
> > -	return 0;
> > +	bdi_queue_work(bdi, work);
> > +	bdi_start_work(bdi, work);
> > +
> > +	/*
> > +	 * If the sync mode is WB_SYNC_ALL, block waiting for the work to
> > +	 * complete. If not, we only need to wait for the work to be started,
> > +	 * if we allocated it on-stack. We use the same mechanism, if the
> > +	 * wait bit is set in the bdi_work struct, then threads will not
> > +	 * clear pending until after they are done.
> > +	 *
> > +	 * Note that work == &work_stack if must_wait is true, so we don't
> > +	 * need to do call_rcu() here ever, since the completion path will
> > +	 * have done that for us.
> > +	 */
> > +	if (must_wait || work == &work_stack) {
> > +		bdi_wait_on_work_clear(work);
> > +		if (work != &work_stack)
> > +			call_rcu(&work->rcu_head, bdi_work_free);
> > +	}
> >  }
>   Looking into this, it seems a bit complex with all that on_stack, sync /
> nosync thing. Wouldn't a simple refcounting scheme be more clear? Alloc /
> init_work_on_stack gets a reference from bdi_work, queue_work gets another
> reference passed later to flusher thread. We drop the reference when we
> leave bdi_start_writeback() and when flusher thread is done with the work.
> When refcount hits zero, work struct is freed (when work is on stack, we
> just never drop the last reference)...

It wouldn't change the complexity of the stack vs non-stack at all,
since you have to do the same checks for when it's safe to proceed. And
having the single bit there with the hash bit wait queues makes that bit
easier.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 07/11] writeback: support > 1 flusher thread per bdi
  2009-05-28 10:40     ` Jens Axboe
@ 2009-05-28 12:43       ` Jan Kara
  2009-05-28 12:53         ` Jens Axboe
  0 siblings, 1 reply; 41+ messages in thread
From: Jan Kara @ 2009-05-28 12:43 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Jan Kara, linux-kernel, linux-fsdevel, chris.mason, david, hch,
	akpm, yanmin_zhang, richard, damien.wyart

On Thu 28-05-09 12:40:23, Jens Axboe wrote:
> On Thu, May 28 2009, Jan Kara wrote:
> > On Wed 27-05-09 11:41:48, Jens Axboe wrote:
> > > Build on the bdi_writeback support by allowing registration of
> > > more than 1 flusher thread. File systems can call bdi_add_flusher_task(bdi)
> > > to add more flusher threads to the device. If they do so, they must also
> > > provide a super_operations function to return the suitable bdi_writeback
> > > struct from any given inode.
> >   Isn't this patch in fact doing two different things? Looking at the
> > changes it implements more flusher threads per BDI as you write above but
> > it also does all that sync-work handling which isn't quite trivial and
> > seems to be a separate thing...
> 
> Yes, but the separated work is a pre-requisite for multi thread support.
> But I guess it would clean it up as a patchset if I added that with the
> previous patch, that would save me from having to add the
> writeback_acquire()/release changes. But I still have to do those, so
> perhaps not... Unless you feel strongly about it, I'd prefer to leave it
> as-is.
  OK.

> > > +static struct bdi_work *bdi_alloc_work(struct super_block *sb, long nr_pages,
> > > +				       enum writeback_sync_modes sync_mode)
> > > +{
> > > +	struct bdi_work *work;
> > > +
> > > +	work = kmalloc(sizeof(*work), GFP_ATOMIC);
> >   Why is this GFP_ATOMIC? Wouldn't GFP_NOFS be enough?
> > But for now, GFP_ATOMIC may be fine since it excercises more the
> > "allocation failed" path :).
> 
> GFP_NOFS/NOIO should be ok. I prefer GFP_ATOMIC since we never enter any
> reclaim, so better safe than sorry. We still have to handle failure
> either way, the rate of failure may be different though.
  Yes, it was mainly about the rate of failure since if the allocation
fails, we have to synchronously wait for writeout to complete...

> > > +	if (work)
> > > +		bdi_work_init(work, sb, nr_pages, sync_mode);
> > > +
> > > +	return work;
> > > +}
> > > +
> > > +void bdi_start_writeback(struct backing_dev_info *bdi, struct super_block *sb,
> > > +			 long nr_pages, enum writeback_sync_modes sync_mode)
> > > +{
> > > +	const bool must_wait = sync_mode == WB_SYNC_ALL;
> > > +	struct bdi_work work_stack, *work = NULL;
> > > +
> > > +	if (!must_wait)
> > > +		work = bdi_alloc_work(sb, nr_pages, sync_mode);
> > > +
> > > +	if (!work) {
> > > +		work = &work_stack;
> > > +		bdi_work_init_on_stack(work, sb, nr_pages, sync_mode);
> > >  	}
> > >  
> > > -	wb_start_writeback(&bdi->wb, sb, nr_pages, sync_mode);
> > > -	return 0;
> > > +	bdi_queue_work(bdi, work);
> > > +	bdi_start_work(bdi, work);
> > > +
> > > +	/*
> > > +	 * If the sync mode is WB_SYNC_ALL, block waiting for the work to
> > > +	 * complete. If not, we only need to wait for the work to be started,
> > > +	 * if we allocated it on-stack. We use the same mechanism, if the
> > > +	 * wait bit is set in the bdi_work struct, then threads will not
> > > +	 * clear pending until after they are done.
> > > +	 *
> > > +	 * Note that work == &work_stack if must_wait is true, so we don't
> > > +	 * need to do call_rcu() here ever, since the completion path will
> > > +	 * have done that for us.
> > > +	 */
> > > +	if (must_wait || work == &work_stack) {
> > > +		bdi_wait_on_work_clear(work);
> > > +		if (work != &work_stack)
> > > +			call_rcu(&work->rcu_head, bdi_work_free);
> > > +	}
> > >  }
> >   Looking into this, it seems a bit complex with all that on_stack, sync /
> > nosync thing. Wouldn't a simple refcounting scheme be more clear? Alloc /
> > init_work_on_stack gets a reference from bdi_work, queue_work gets another
> > reference passed later to flusher thread. We drop the reference when we
> > leave bdi_start_writeback() and when flusher thread is done with the work.
> > When refcount hits zero, work struct is freed (when work is on stack, we
> > just never drop the last reference)...
> 
> It wouldn't change the complexity of the stack vs non-stack at all,
> since you have to do the same checks for when it's safe to proceed. And
> having the single bit there with the hash bit wait queues makes that bit
> easier.
  I think it would be simpler. Look:
static void bdi_work_free(struct rcu_head *head)
{
	struct bdi_work *work = container_of(head, struct bdi_work, rcu_head);

	kfree(work);
}

static void bdi_put_work(struct bdi_work *work)
{
	if (!atomic_dec_and_test(&work->count, 1))
		call_rcu(&work->rcu_head, bdi_work_free);
}

static void wb_work_complete(struct bdi_work *work)
{
	bdi_work_clear(work);
	bdi_put_work(work);
}

void bdi_start_writeback(...)
{
	...
	if (must_wait || work == &work_stack)
		bdi_wait_on_work_clear(work);
	if (work != &work_stack)
		bdi_put_work(work);
}

  IMO much easier to read...

									Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 07/11] writeback: support > 1 flusher thread per bdi
  2009-05-28 12:43       ` Jan Kara
@ 2009-05-28 12:53         ` Jens Axboe
  2009-05-28 13:58           ` Jan Kara
  0 siblings, 1 reply; 41+ messages in thread
From: Jens Axboe @ 2009-05-28 12:53 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-kernel, linux-fsdevel, chris.mason, david, hch, akpm,
	yanmin_zhang, richard, damien.wyart

On Thu, May 28 2009, Jan Kara wrote:
> > >   Looking into this, it seems a bit complex with all that on_stack, sync /
> > > nosync thing. Wouldn't a simple refcounting scheme be more clear? Alloc /
> > > init_work_on_stack gets a reference from bdi_work, queue_work gets another
> > > reference passed later to flusher thread. We drop the reference when we
> > > leave bdi_start_writeback() and when flusher thread is done with the work.
> > > When refcount hits zero, work struct is freed (when work is on stack, we
> > > just never drop the last reference)...
> > 
> > It wouldn't change the complexity of the stack vs non-stack at all,
> > since you have to do the same checks for when it's safe to proceed. And
> > having the single bit there with the hash bit wait queues makes that bit
> > easier.
>   I think it would be simpler. Look:
> static void bdi_work_free(struct rcu_head *head)
> {
> 	struct bdi_work *work = container_of(head, struct bdi_work, rcu_head);
> 
> 	kfree(work);
> }
> 
> static void bdi_put_work(struct bdi_work *work)
> {
> 	if (!atomic_dec_and_test(&work->count, 1))
> 		call_rcu(&work->rcu_head, bdi_work_free);
> }
> 
> static void wb_work_complete(struct bdi_work *work)
> {
> 	bdi_work_clear(work);
> 	bdi_put_work(work);
> }
> 
> void bdi_start_writeback(...)
> {
> 	...
> 	if (must_wait || work == &work_stack)
> 		bdi_wait_on_work_clear(work);
> 	if (work != &work_stack)
> 		bdi_put_work(work);
> }
> 
>   IMO much easier to read...

And doesn't work, since you cannot exit after clearing the on-stack work
before before rcu is quisced. The bdi_work could be browseable by other
threads under rcu_read_lock(), just like you defer the kfree(), you have
to defer the bdi_work_clear() for on-stack work.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 07/11] writeback: support > 1 flusher thread per bdi
  2009-05-28 12:53         ` Jens Axboe
@ 2009-05-28 13:58           ` Jan Kara
  0 siblings, 0 replies; 41+ messages in thread
From: Jan Kara @ 2009-05-28 13:58 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-kernel, linux-fsdevel, chris.mason, david, hch, akpm,
	yanmin_zhang, richard, damien.wyart

On Thu 28-05-09 14:53:08, Jens Axboe wrote:
> On Thu, May 28 2009, Jan Kara wrote:
> > > >   Looking into this, it seems a bit complex with all that on_stack, sync /
> > > > nosync thing. Wouldn't a simple refcounting scheme be more clear? Alloc /
> > > > init_work_on_stack gets a reference from bdi_work, queue_work gets another
> > > > reference passed later to flusher thread. We drop the reference when we
> > > > leave bdi_start_writeback() and when flusher thread is done with the work.
> > > > When refcount hits zero, work struct is freed (when work is on stack, we
> > > > just never drop the last reference)...
> > > 
> > > It wouldn't change the complexity of the stack vs non-stack at all,
> > > since you have to do the same checks for when it's safe to proceed. And
> > > having the single bit there with the hash bit wait queues makes that bit
> > > easier.
> >   I think it would be simpler. Look:
> > static void bdi_work_free(struct rcu_head *head)
> > {
> > 	struct bdi_work *work = container_of(head, struct bdi_work, rcu_head);
> > 
> > 	kfree(work);
> > }
> > 
> > static void bdi_put_work(struct bdi_work *work)
> > {
> > 	if (!atomic_dec_and_test(&work->count, 1))
> > 		call_rcu(&work->rcu_head, bdi_work_free);
> > }
> > 
> > static void wb_work_complete(struct bdi_work *work)
> > {
> > 	bdi_work_clear(work);
> > 	bdi_put_work(work);
> > }
> > 
> > void bdi_start_writeback(...)
> > {
> > 	...
> > 	if (must_wait || work == &work_stack)
> > 		bdi_wait_on_work_clear(work);
> > 	if (work != &work_stack)
> > 		bdi_put_work(work);
> > }
> > 
> >   IMO much easier to read...
> 
> And doesn't work, since you cannot exit after clearing the on-stack work
> before before rcu is quisced. The bdi_work could be browseable by other
> threads under rcu_read_lock(), just like you defer the kfree(), you have
> to defer the bdi_work_clear() for on-stack work.
  Right. My fault. Sorry for the noise.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH 08/11] writeback: allow sleepy exit of default writeback task
  2009-05-27  9:41 [PATCH 0/11] Per-bdi writeback flusher threads v8 Jens Axboe
                   ` (6 preceding siblings ...)
  2009-05-27  9:41 ` [PATCH 07/11] writeback: support > 1 flusher thread per bdi Jens Axboe
@ 2009-05-27  9:41 ` Jens Axboe
  2009-05-27  9:41 ` [PATCH 09/11] writeback: add some debug inode list counters to bdi stats Jens Axboe
                   ` (4 subsequent siblings)
  12 siblings, 0 replies; 41+ messages in thread
From: Jens Axboe @ 2009-05-27  9:41 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel
  Cc: chris.mason, david, hch, akpm, jack, yanmin_zhang, richard,
	damien.wyart, Jens Axboe

Since we do lazy create of default writeback tasks for a bdi, we can
allow sleepy exit if it has been completely idle for 5 minutes.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
---
 fs/fs-writeback.c           |   52 ++++++++++++++++++++++++++++++++++--------
 include/linux/backing-dev.h |    5 ++++
 include/linux/writeback.h   |    2 +-
 3 files changed, 48 insertions(+), 11 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 92b3df7..8ccf270 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -303,10 +303,10 @@ void bdi_start_writeback(struct backing_dev_info *bdi, struct super_block *sb,
  * older_than_this takes precedence over nr_to_write.  So we'll only write back
  * all dirty pages if they are all attached to "old" mappings.
  */
-static void wb_kupdated(struct bdi_writeback *wb)
+static long wb_kupdated(struct bdi_writeback *wb)
 {
 	unsigned long oldest_jif;
-	long nr_to_write;
+	long nr_to_write, wrote = 0;
 	struct writeback_control wbc = {
 		.bdi			= wb->bdi,
 		.sync_mode		= WB_SYNC_NONE,
@@ -327,10 +327,13 @@ static void wb_kupdated(struct bdi_writeback *wb)
 		wbc.encountered_congestion = 0;
 		wbc.nr_to_write = MAX_WRITEBACK_PAGES;
 		generic_sync_wb_inodes(wb, NULL, &wbc);
+		wrote += MAX_WRITEBACK_PAGES - wbc.nr_to_write;
 		if (wbc.nr_to_write > 0)
 			break;	/* All the old data is written */
 		nr_to_write -= MAX_WRITEBACK_PAGES;
 	}
+
+	return wrote;
 }
 
 static inline bool over_bground_thresh(void)
@@ -343,7 +346,7 @@ static inline bool over_bground_thresh(void)
 		global_page_state(NR_UNSTABLE_NFS) >= background_thresh);
 }
 
-static void __wb_writeback(struct bdi_writeback *wb, long nr_pages,
+static long __wb_writeback(struct bdi_writeback *wb, long nr_pages,
 			   struct super_block *sb,
 			   enum writeback_sync_modes sync_mode)
 {
@@ -353,6 +356,7 @@ static void __wb_writeback(struct bdi_writeback *wb, long nr_pages,
 		.older_than_this	= NULL,
 		.range_cyclic		= 1,
 	};
+	long wrote = 0;
 
 	for (;;) {
 		if (sync_mode == WB_SYNC_NONE && nr_pages <= 0 &&
@@ -365,6 +369,7 @@ static void __wb_writeback(struct bdi_writeback *wb, long nr_pages,
 		wbc.pages_skipped = 0;
 		generic_sync_wb_inodes(wb, sb, &wbc);
 		nr_pages -= MAX_WRITEBACK_PAGES - wbc.nr_to_write;
+		wrote += MAX_WRITEBACK_PAGES - wbc.nr_to_write;
 		/*
 		 * If we ran out of stuff to write, bail unless more_io got set
 		 */
@@ -374,6 +379,8 @@ static void __wb_writeback(struct bdi_writeback *wb, long nr_pages,
 			break;
 		}
 	}
+
+	return wrote;
 }
 
 /*
@@ -402,10 +409,11 @@ static struct bdi_work *get_next_work_item(struct backing_dev_info *bdi,
 /*
  * Retrieve work items and do the writeback they describe
  */
-static void wb_writeback(struct bdi_writeback *wb)
+static long wb_writeback(struct bdi_writeback *wb)
 {
 	struct backing_dev_info *bdi = wb->bdi;
 	struct bdi_work *work;
+	long wrote = 0;
 
 	while ((work = get_next_work_item(bdi, wb)) != NULL) {
 		struct super_block *sb = bdi_work_sb(work);
@@ -419,7 +427,7 @@ static void wb_writeback(struct bdi_writeback *wb)
 		if (sync_mode == WB_SYNC_NONE)
 			wb_clear_pending(wb, work);
 
-		__wb_writeback(wb, nr_pages, sb, sync_mode);
+		wrote += __wb_writeback(wb, nr_pages, sb, sync_mode);
 
 		/*
 		 * This is a data integrity writeback, so only do the
@@ -428,14 +436,18 @@ static void wb_writeback(struct bdi_writeback *wb)
 		if (sync_mode == WB_SYNC_ALL)
 			wb_clear_pending(wb, work);
 	}
+
+	return wrote;
 }
 
 /*
  * This will be inlined in bdi_writeback_task() once we get rid of any
  * dirty inodes on the default_backing_dev_info
  */
-void wb_do_writeback(struct bdi_writeback *wb)
+long wb_do_writeback(struct bdi_writeback *wb)
 {
+	long wrote;
+
 	/*
 	 * We get here in two cases:
 	 *
@@ -447,9 +459,11 @@ void wb_do_writeback(struct bdi_writeback *wb)
 	 *  items on the work_list. Process those.
 	 */
 	if (list_empty(&wb->bdi->work_list))
-		wb_kupdated(wb);
+		wrote = wb_kupdated(wb);
 	else
-		wb_writeback(wb);
+		wrote = wb_writeback(wb);
+
+	return wrote;
 }
 
 /*
@@ -458,12 +472,30 @@ void wb_do_writeback(struct bdi_writeback *wb)
  */
 int bdi_writeback_task(struct bdi_writeback *wb)
 {
+	unsigned long last_active = jiffies;
+	unsigned long wait_jiffies = -1UL;
+	long pages_written;
 	DEFINE_WAIT(wait);
 
 	while (!kthread_should_stop()) {
-		unsigned long wait_jiffies;
 
-		wb_do_writeback(wb);
+		pages_written = wb_do_writeback(wb);
+
+		if (pages_written)
+			last_active = jiffies;
+		else if (wait_jiffies != -1UL) {
+			unsigned long max_idle;
+
+			/*
+			 * Longest period of inactivity that we tolerate. If we
+			 * see dirty data again later, the task will get
+			 * recreated automatically.
+			 */
+			max_idle = max(5UL * 60 * HZ, wait_jiffies);
+			if (time_after(jiffies, max_idle + last_active) &&
+			    wb_is_default_task(wb))
+				break;
+		}
 
 		prepare_to_wait(&wb->wait, &wait, TASK_INTERRUPTIBLE);
 		wait_jiffies = msecs_to_jiffies(dirty_writeback_interval * 10);
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 352d47d..f6d1b05 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -112,6 +112,11 @@ int bdi_has_dirty_io(struct backing_dev_info *bdi);
 extern struct mutex bdi_lock;
 extern struct list_head bdi_list;
 
+static inline int wb_is_default_task(struct bdi_writeback *wb)
+{
+	return wb == &wb->bdi->wb;
+}
+
 static inline int bdi_wblist_needs_lock(struct backing_dev_info *bdi)
 {
 	return test_bit(BDI_wblist_lock, &bdi->state);
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index e414702..30e318b 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -69,7 +69,7 @@ void writeback_inodes(struct writeback_control *wbc);
 int inode_wait(void *);
 void sync_inodes_sb(struct super_block *, int wait);
 void sync_inodes(int wait);
-void wb_do_writeback(struct bdi_writeback *wb);
+long wb_do_writeback(struct bdi_writeback *wb);
 
 /* writeback.h requires fs.h; it, too, is not included from here. */
 static inline void wait_on_inode(struct inode *inode)
-- 
1.6.3.rc0.1.gf800


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 09/11] writeback: add some debug inode list counters to bdi stats
  2009-05-27  9:41 [PATCH 0/11] Per-bdi writeback flusher threads v8 Jens Axboe
                   ` (7 preceding siblings ...)
  2009-05-27  9:41 ` [PATCH 08/11] writeback: allow sleepy exit of default writeback task Jens Axboe
@ 2009-05-27  9:41 ` Jens Axboe
  2009-05-27  9:41 ` [PATCH 10/11] writeback: add name to backing_dev_info Jens Axboe
                   ` (3 subsequent siblings)
  12 siblings, 0 replies; 41+ messages in thread
From: Jens Axboe @ 2009-05-27  9:41 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel
  Cc: chris.mason, david, hch, akpm, jack, yanmin_zhang, richard,
	damien.wyart, Jens Axboe

Add some debug entries to be able to inspect the internal state of
the writeback details.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
---
 mm/backing-dev.c |   38 ++++++++++++++++++++++++++++++++++----
 1 files changed, 34 insertions(+), 4 deletions(-)

diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index ce40890..9ed6522 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -43,9 +43,29 @@ static void bdi_debug_init(void)
 static int bdi_debug_stats_show(struct seq_file *m, void *v)
 {
 	struct backing_dev_info *bdi = m->private;
+	struct bdi_writeback *wb;
 	unsigned long background_thresh;
 	unsigned long dirty_thresh;
 	unsigned long bdi_thresh;
+	unsigned long nr_dirty, nr_io, nr_more_io, nr_wb;
+	struct inode *inode;
+
+	/*
+	 * inode lock is enough here, the bdi->wb_list is protected by
+	 * RCU on the reader side
+	 */
+	nr_wb = nr_dirty = nr_io = nr_more_io = 0;
+	spin_lock(&inode_lock);
+	list_for_each_entry(wb, &bdi->wb_list, list) {
+		nr_wb++;
+		list_for_each_entry(inode, &wb->b_dirty, i_list)
+			nr_dirty++;
+		list_for_each_entry(inode, &wb->b_io, i_list)
+			nr_io++;
+		list_for_each_entry(inode, &wb->b_more_io, i_list)
+			nr_more_io++;
+	}
+	spin_unlock(&inode_lock);
 
 	get_dirty_limits(&background_thresh, &dirty_thresh, &bdi_thresh, bdi);
 
@@ -55,12 +75,22 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v)
 		   "BdiReclaimable:   %8lu kB\n"
 		   "BdiDirtyThresh:   %8lu kB\n"
 		   "DirtyThresh:      %8lu kB\n"
-		   "BackgroundThresh: %8lu kB\n",
+		   "BackgroundThresh: %8lu kB\n"
+		   "WriteBack threads:%8lu\n"
+		   "b_dirty:          %8lu\n"
+		   "b_io:             %8lu\n"
+		   "b_more_io:        %8lu\n"
+		   "bdi_list:         %8u\n"
+		   "state:            %8lx\n"
+		   "wb_mask:          %8lx\n"
+		   "wb_list:          %8u\n"
+		   "wb_cnt:           %8u\n",
 		   (unsigned long) K(bdi_stat(bdi, BDI_WRITEBACK)),
 		   (unsigned long) K(bdi_stat(bdi, BDI_RECLAIMABLE)),
-		   K(bdi_thresh),
-		   K(dirty_thresh),
-		   K(background_thresh));
+		   K(bdi_thresh), K(dirty_thresh),
+		   K(background_thresh), nr_wb, nr_dirty, nr_io, nr_more_io,
+		   !list_empty(&bdi->bdi_list), bdi->state, bdi->wb_mask,
+		   !list_empty(&bdi->wb_list), bdi->wb_cnt);
 #undef K
 
 	return 0;
-- 
1.6.3.rc0.1.gf800


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 10/11] writeback: add name to backing_dev_info
  2009-05-27  9:41 [PATCH 0/11] Per-bdi writeback flusher threads v8 Jens Axboe
                   ` (8 preceding siblings ...)
  2009-05-27  9:41 ` [PATCH 09/11] writeback: add some debug inode list counters to bdi stats Jens Axboe
@ 2009-05-27  9:41 ` Jens Axboe
  2009-05-27  9:41 ` [PATCH 11/11] writeback: check for registered bdi in flusher add and inode dirty Jens Axboe
                   ` (2 subsequent siblings)
  12 siblings, 0 replies; 41+ messages in thread
From: Jens Axboe @ 2009-05-27  9:41 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel
  Cc: chris.mason, david, hch, akpm, jack, yanmin_zhang, richard,
	damien.wyart, Jens Axboe

This enables us to track who does what and print info. Its main use
is catching dirty inodes on the default_backing_dev_info, so we can
fix that up.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
---
 block/blk-core.c            |    1 +
 drivers/block/aoe/aoeblk.c  |    1 +
 drivers/char/mem.c          |    1 +
 fs/btrfs/disk-io.c          |    1 +
 fs/char_dev.c               |    1 +
 fs/configfs/inode.c         |    1 +
 fs/fuse/inode.c             |    1 +
 fs/hugetlbfs/inode.c        |    1 +
 fs/nfs/client.c             |    1 +
 fs/ocfs2/dlm/dlmfs.c        |    1 +
 fs/ramfs/inode.c            |    1 +
 fs/sysfs/inode.c            |    1 +
 fs/ubifs/super.c            |    1 +
 include/linux/backing-dev.h |    2 ++
 kernel/cgroup.c             |    1 +
 mm/backing-dev.c            |    1 +
 mm/swap_state.c             |    1 +
 17 files changed, 18 insertions(+), 0 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index c89883b..d3f18b5 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -517,6 +517,7 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id)
 
 	q->backing_dev_info.unplug_io_fn = blk_backing_dev_unplug;
 	q->backing_dev_info.unplug_io_data = q;
+	q->backing_dev_info.name = "block";
 	err = bdi_init(&q->backing_dev_info);
 	if (err) {
 		kmem_cache_free(blk_requestq_cachep, q);
diff --git a/drivers/block/aoe/aoeblk.c b/drivers/block/aoe/aoeblk.c
index 2307a27..0efb8fc 100644
--- a/drivers/block/aoe/aoeblk.c
+++ b/drivers/block/aoe/aoeblk.c
@@ -265,6 +265,7 @@ aoeblk_gdalloc(void *vp)
 	}
 
 	blk_queue_make_request(&d->blkq, aoeblk_make_request);
+	d->blkq.backing_dev_info.name = "aoe";
 	if (bdi_init(&d->blkq.backing_dev_info))
 		goto err_mempool;
 	spin_lock_irqsave(&d->lock, flags);
diff --git a/drivers/char/mem.c b/drivers/char/mem.c
index 8f05c38..3b38093 100644
--- a/drivers/char/mem.c
+++ b/drivers/char/mem.c
@@ -820,6 +820,7 @@ static const struct file_operations zero_fops = {
  * - permits private mappings, "copies" are taken of the source of zeros
  */
 static struct backing_dev_info zero_bdi = {
+	.name		= "char/mem",
 	.capabilities	= BDI_CAP_MAP_COPY,
 };
 
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 2dc19c9..eff2a82 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1353,6 +1353,7 @@ static int setup_bdi(struct btrfs_fs_info *info, struct backing_dev_info *bdi)
 {
 	int err;
 
+	bdi->name = "btrfs";
 	bdi->capabilities = BDI_CAP_MAP_COPY;
 	err = bdi_init(bdi);
 	if (err)
diff --git a/fs/char_dev.c b/fs/char_dev.c
index 38f7122..350ef9c 100644
--- a/fs/char_dev.c
+++ b/fs/char_dev.c
@@ -32,6 +32,7 @@
  * - no readahead or I/O queue unplugging required
  */
 struct backing_dev_info directly_mappable_cdev_bdi = {
+	.name = "char",
 	.capabilities	= (
 #ifdef CONFIG_MMU
 		/* permit private copies of the data to be taken */
diff --git a/fs/configfs/inode.c b/fs/configfs/inode.c
index 5d349d3..9a266cd 100644
--- a/fs/configfs/inode.c
+++ b/fs/configfs/inode.c
@@ -46,6 +46,7 @@ static const struct address_space_operations configfs_aops = {
 };
 
 static struct backing_dev_info configfs_backing_dev_info = {
+	.name		= "configfs",
 	.ra_pages	= 0,	/* No readahead */
 	.capabilities	= BDI_CAP_NO_ACCT_AND_WRITEBACK,
 };
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 91f7c85..e5e8b03 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -484,6 +484,7 @@ int fuse_conn_init(struct fuse_conn *fc, struct super_block *sb)
 	INIT_LIST_HEAD(&fc->bg_queue);
 	INIT_LIST_HEAD(&fc->entry);
 	atomic_set(&fc->num_waiting, 0);
+	fc->bdi.name = "fuse";
 	fc->bdi.ra_pages = (VM_MAX_READAHEAD * 1024) / PAGE_CACHE_SIZE;
 	fc->bdi.unplug_io_fn = default_unplug_io_fn;
 	/* fuse does it's own writeback accounting */
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index c1462d4..db1e537 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -43,6 +43,7 @@ static const struct inode_operations hugetlbfs_dir_inode_operations;
 static const struct inode_operations hugetlbfs_inode_operations;
 
 static struct backing_dev_info hugetlbfs_backing_dev_info = {
+	.name		= "hugetlbfs",
 	.ra_pages	= 0,	/* No readahead */
 	.capabilities	= BDI_CAP_NO_ACCT_AND_WRITEBACK,
 };
diff --git a/fs/nfs/client.c b/fs/nfs/client.c
index 75c9cd2..3a26d06 100644
--- a/fs/nfs/client.c
+++ b/fs/nfs/client.c
@@ -836,6 +836,7 @@ static void nfs_server_set_fsinfo(struct nfs_server *server, struct nfs_fsinfo *
 		server->rsize = NFS_MAX_FILE_IO_SIZE;
 	server->rpages = (server->rsize + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
 
+	server->backing_dev_info.name = "nfs";
 	server->backing_dev_info.ra_pages = server->rpages * NFS_MAX_READAHEAD;
 
 	if (server->wsize > max_rpc_payload)
diff --git a/fs/ocfs2/dlm/dlmfs.c b/fs/ocfs2/dlm/dlmfs.c
index 1c9efb4..02bf178 100644
--- a/fs/ocfs2/dlm/dlmfs.c
+++ b/fs/ocfs2/dlm/dlmfs.c
@@ -325,6 +325,7 @@ clear_fields:
 }
 
 static struct backing_dev_info dlmfs_backing_dev_info = {
+	.name		= "ocfs2-dlmfs",
 	.ra_pages	= 0,	/* No readahead */
 	.capabilities	= BDI_CAP_NO_ACCT_AND_WRITEBACK,
 };
diff --git a/fs/ramfs/inode.c b/fs/ramfs/inode.c
index 3a6b193..5a24199 100644
--- a/fs/ramfs/inode.c
+++ b/fs/ramfs/inode.c
@@ -46,6 +46,7 @@ static const struct super_operations ramfs_ops;
 static const struct inode_operations ramfs_dir_inode_operations;
 
 static struct backing_dev_info ramfs_backing_dev_info = {
+	.name		= "ramfs",
 	.ra_pages	= 0,	/* No readahead */
 	.capabilities	= BDI_CAP_NO_ACCT_AND_WRITEBACK |
 			  BDI_CAP_MAP_DIRECT | BDI_CAP_MAP_COPY |
diff --git a/fs/sysfs/inode.c b/fs/sysfs/inode.c
index 555f0ff..e57f98e 100644
--- a/fs/sysfs/inode.c
+++ b/fs/sysfs/inode.c
@@ -29,6 +29,7 @@ static const struct address_space_operations sysfs_aops = {
 };
 
 static struct backing_dev_info sysfs_backing_dev_info = {
+	.name		= "sysfs",
 	.ra_pages	= 0,	/* No readahead */
 	.capabilities	= BDI_CAP_NO_ACCT_AND_WRITEBACK,
 };
diff --git a/fs/ubifs/super.c b/fs/ubifs/super.c
index e9f7a75..2349e2c 100644
--- a/fs/ubifs/super.c
+++ b/fs/ubifs/super.c
@@ -1923,6 +1923,7 @@ static int ubifs_fill_super(struct super_block *sb, void *data, int silent)
 	 *
 	 * Read-ahead will be disabled because @c->bdi.ra_pages is 0.
 	 */
+	c->bdi.name = "ubifs",
 	c->bdi.capabilities = BDI_CAP_MAP_COPY;
 	c->bdi.unplug_io_fn = default_unplug_io_fn;
 	err  = bdi_init(&c->bdi);
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index f6d1b05..e6c7f3d 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -70,6 +70,8 @@ struct backing_dev_info {
 	void (*unplug_io_fn)(struct backing_dev_info *, struct page *);
 	void *unplug_io_data;
 
+	char *name;
+
 	struct percpu_counter bdi_stat[NR_BDI_STAT_ITEMS];
 
 	struct prop_local_percpu completions;
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index a7267bf..0863c5f 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -598,6 +598,7 @@ static struct inode_operations cgroup_dir_inode_operations;
 static struct file_operations proc_cgroupstats_operations;
 
 static struct backing_dev_info cgroup_backing_dev_info = {
+	.name		= "cgroup",
 	.capabilities	= BDI_CAP_NO_ACCT_AND_WRITEBACK,
 };
 
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 9ed6522..24ebee9 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -17,6 +17,7 @@ void default_unplug_io_fn(struct backing_dev_info *bdi, struct page *page)
 EXPORT_SYMBOL(default_unplug_io_fn);
 
 struct backing_dev_info default_backing_dev_info = {
+	.name		= "default",
 	.ra_pages	= VM_MAX_READAHEAD * 1024 / PAGE_CACHE_SIZE,
 	.state		= 0,
 	.capabilities	= BDI_CAP_MAP_COPY | BDI_CAP_FLUSH_FORKER,
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 3ecea98..323da00 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -34,6 +34,7 @@ static const struct address_space_operations swap_aops = {
 };
 
 static struct backing_dev_info swap_backing_dev_info = {
+	.name		= "swap",
 	.capabilities	= BDI_CAP_NO_ACCT_AND_WRITEBACK | BDI_CAP_SWAP_BACKED,
 	.unplug_io_fn	= swap_unplug_io_fn,
 };
-- 
1.6.3.rc0.1.gf800


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 11/11] writeback: check for registered bdi in flusher add and inode dirty
  2009-05-27  9:41 [PATCH 0/11] Per-bdi writeback flusher threads v8 Jens Axboe
                   ` (9 preceding siblings ...)
  2009-05-27  9:41 ` [PATCH 10/11] writeback: add name to backing_dev_info Jens Axboe
@ 2009-05-27  9:41 ` Jens Axboe
  2009-05-27 12:41 ` [PATCH 0/11] Per-bdi writeback flusher threads v8 Richard Kennedy
  2009-05-27 14:47 ` Theodore Tso
  12 siblings, 0 replies; 41+ messages in thread
From: Jens Axboe @ 2009-05-27  9:41 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel
  Cc: chris.mason, david, hch, akpm, jack, yanmin_zhang, richard,
	damien.wyart, Jens Axboe

Also a debugging aid. We want to catch dirty inodes being added to
backing devices that don't do writeback.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
---
 fs/fs-writeback.c           |    7 +++++++
 include/linux/backing-dev.h |    1 +
 mm/backing-dev.c            |    6 ++++++
 3 files changed, 14 insertions(+), 0 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 8ccf270..18e4b25 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -675,6 +675,13 @@ void __mark_inode_dirty(struct inode *inode, int flags)
 		 */
 		if (!was_dirty) {
 			struct bdi_writeback *wb = inode_get_wb(inode);
+			struct backing_dev_info *bdi = wb->bdi;
+
+			if (bdi_cap_writeback_dirty(bdi) &&
+			    !test_bit(BDI_registered, &bdi->state)) {
+				WARN_ON(1);
+				printk("bdi-%s not registered\n", bdi->name);
+			}
 
 			inode->dirtied_when = jiffies;
 			list_move(&inode->i_list, &wb->b_dirty);
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index e6c7f3d..679cfb8 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -31,6 +31,7 @@ enum bdi_state {
 	BDI_wblist_lock,	/* bdi->wb_list now needs locking */
 	BDI_async_congested,	/* The async (write) queue is getting full */
 	BDI_sync_congested,	/* The sync queue is getting full */
+	BDI_registered,		/* bdi_register() was done */
 	BDI_unused,		/* Available bits start here */
 };
 
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 24ebee9..f71588c 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -511,6 +511,11 @@ static void bdi_add_one_flusher_task(struct backing_dev_info *bdi,
 	if (!bdi_cap_writeback_dirty(bdi))
 		return;
 
+	if (WARN_ON(!test_bit(BDI_registered, &bdi->state))) {
+		printk("bdi %p/%s is not registered!\n", bdi, bdi->name);
+		return;
+	}
+
 	/*
 	 * Check with the helper whether to proceed adding a task. Will only
 	 * abort if we two or more simultanous calls to
@@ -619,6 +624,7 @@ remove_err:
 	}
 
 	bdi_debug_register(bdi, dev_name(dev));
+	set_bit(BDI_registered, &bdi->state);
 exit:
 	return ret;
 }
-- 
1.6.3.rc0.1.gf800


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/11] Per-bdi writeback flusher threads v8
  2009-05-27  9:41 [PATCH 0/11] Per-bdi writeback flusher threads v8 Jens Axboe
                   ` (10 preceding siblings ...)
  2009-05-27  9:41 ` [PATCH 11/11] writeback: check for registered bdi in flusher add and inode dirty Jens Axboe
@ 2009-05-27 12:41 ` Richard Kennedy
  2009-05-27 12:47   ` Jens Axboe
  2009-05-27 14:47 ` Theodore Tso
  12 siblings, 1 reply; 41+ messages in thread
From: Richard Kennedy @ 2009-05-27 12:41 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-kernel, linux-fsdevel, chris.mason, david, hch, akpm, jack,
	yanmin_zhang, damien.wyart

On Wed, 2009-05-27 at 11:41 +0200, Jens Axboe wrote:
> Hi,
> 
> Here's the 8th version of the writeback patches. Changes since v7:
> 
> - Fold the "include default_backing_dev_info in writeback" patch into
>   the core, we should just do it from the beginning.
> - More series cleanup, I think it should be mostly complete now. No
>   hunks are split between patches now (things like comments for
>   functions added earlier, and so on).
> - Fix hang with calling bdi_wait_on_work_clear() inside the bdi_lock
>   mutex when the default wb thread had exited.
> - Fix hang with queuing work on exited wb thread, it would have no
>   receipients since the default wb thread wrongly cleared the bit
>   from the register mask on exit. It must be persistent, which is
>   why it gets initialized on bdi_register() already.
> 
> For ease of patching, I've put the full diff here:
> 
>   http://kernel.dk/writeback-v8.patch
> 
> and also stored this in a writeback-v7 branch that will not change,
> you can pull that into Linus tree from here:
> 
>   git://git.kernel.dk/linux-2.6-block.git writeback-v8
> 
>  b/block/blk-core.c            |    1 
>  b/drivers/block/aoe/aoeblk.c  |    1 
>  b/drivers/char/mem.c          |    1 
>  b/fs/btrfs/disk-io.c          |   24 -
>  b/fs/buffer.c                 |    2 
>  b/fs/char_dev.c               |    1 
>  b/fs/configfs/inode.c         |    1 
>  b/fs/fs-writeback.c           |  807 +++++++++++++++++++++++++++-------
>  b/fs/fuse/inode.c             |    1 
>  b/fs/hugetlbfs/inode.c        |    1 
>  b/fs/nfs/client.c             |    1 
>  b/fs/ntfs/super.c             |   33 -
>  b/fs/ocfs2/dlm/dlmfs.c        |    1 
>  b/fs/ramfs/inode.c            |    1 
>  b/fs/super.c                  |    3 
>  b/fs/sync.c                   |    2 
>  b/fs/sysfs/inode.c            |    1 
>  b/fs/ubifs/super.c            |    1 
>  b/include/linux/backing-dev.h |   74 +++
>  b/include/linux/fs.h          |   11 
>  b/include/linux/writeback.h   |   15 
>  b/kernel/cgroup.c             |    1 
>  b/mm/Makefile                 |    2 
>  b/mm/backing-dev.c            |  476 +++++++++++++++++++-
>  b/mm/page-writeback.c         |  151 ------
>  b/mm/swap_state.c             |    1 
>  b/mm/vmscan.c                 |    2 
>  mm/pdflush.c                  |  269 -----------
>  28 files changed, 1248 insertions(+), 637 deletions(-)
> 
Hi Jens,

This is working nicely for me. It's successfully built a kernel with no
problems at all, so you've fixed the earlier problem I had.

Early fio tests writing to 2 disks at the same time indicate that it's
faster too. But there's a fair amount of variability so I will run more
tests & see what the trend really is.

regards
Richard

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/11] Per-bdi writeback flusher threads v8
  2009-05-27 12:41 ` [PATCH 0/11] Per-bdi writeback flusher threads v8 Richard Kennedy
@ 2009-05-27 12:47   ` Jens Axboe
  0 siblings, 0 replies; 41+ messages in thread
From: Jens Axboe @ 2009-05-27 12:47 UTC (permalink / raw)
  To: Richard Kennedy
  Cc: linux-kernel, linux-fsdevel, chris.mason, david, hch, akpm, jack,
	yanmin_zhang, damien.wyart

On Wed, May 27 2009, Richard Kennedy wrote:
> On Wed, 2009-05-27 at 11:41 +0200, Jens Axboe wrote:
> > Hi,
> > 
> > Here's the 8th version of the writeback patches. Changes since v7:
> > 
> > - Fold the "include default_backing_dev_info in writeback" patch into
> >   the core, we should just do it from the beginning.
> > - More series cleanup, I think it should be mostly complete now. No
> >   hunks are split between patches now (things like comments for
> >   functions added earlier, and so on).
> > - Fix hang with calling bdi_wait_on_work_clear() inside the bdi_lock
> >   mutex when the default wb thread had exited.
> > - Fix hang with queuing work on exited wb thread, it would have no
> >   receipients since the default wb thread wrongly cleared the bit
> >   from the register mask on exit. It must be persistent, which is
> >   why it gets initialized on bdi_register() already.
> > 
> > For ease of patching, I've put the full diff here:
> > 
> >   http://kernel.dk/writeback-v8.patch
> > 
> > and also stored this in a writeback-v7 branch that will not change,
> > you can pull that into Linus tree from here:
> > 
> >   git://git.kernel.dk/linux-2.6-block.git writeback-v8
> > 
> >  b/block/blk-core.c            |    1 
> >  b/drivers/block/aoe/aoeblk.c  |    1 
> >  b/drivers/char/mem.c          |    1 
> >  b/fs/btrfs/disk-io.c          |   24 -
> >  b/fs/buffer.c                 |    2 
> >  b/fs/char_dev.c               |    1 
> >  b/fs/configfs/inode.c         |    1 
> >  b/fs/fs-writeback.c           |  807 +++++++++++++++++++++++++++-------
> >  b/fs/fuse/inode.c             |    1 
> >  b/fs/hugetlbfs/inode.c        |    1 
> >  b/fs/nfs/client.c             |    1 
> >  b/fs/ntfs/super.c             |   33 -
> >  b/fs/ocfs2/dlm/dlmfs.c        |    1 
> >  b/fs/ramfs/inode.c            |    1 
> >  b/fs/super.c                  |    3 
> >  b/fs/sync.c                   |    2 
> >  b/fs/sysfs/inode.c            |    1 
> >  b/fs/ubifs/super.c            |    1 
> >  b/include/linux/backing-dev.h |   74 +++
> >  b/include/linux/fs.h          |   11 
> >  b/include/linux/writeback.h   |   15 
> >  b/kernel/cgroup.c             |    1 
> >  b/mm/Makefile                 |    2 
> >  b/mm/backing-dev.c            |  476 +++++++++++++++++++-
> >  b/mm/page-writeback.c         |  151 ------
> >  b/mm/swap_state.c             |    1 
> >  b/mm/vmscan.c                 |    2 
> >  mm/pdflush.c                  |  269 -----------
> >  28 files changed, 1248 insertions(+), 637 deletions(-)
> > 
> Hi Jens,
> 
> This is working nicely for me. It's successfully built a kernel with no
> problems at all, so you've fixed the earlier problem I had.
> 
> Early fio tests writing to 2 disks at the same time indicate that it's
> faster too. But there's a fair amount of variability so I will run more
> tests & see what the trend really is.

That's good to hear, thanks for testing! Do let me know if you run into
anything suspicious later on.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/11] Per-bdi writeback flusher threads v8
  2009-05-27  9:41 [PATCH 0/11] Per-bdi writeback flusher threads v8 Jens Axboe
                   ` (11 preceding siblings ...)
  2009-05-27 12:41 ` [PATCH 0/11] Per-bdi writeback flusher threads v8 Richard Kennedy
@ 2009-05-27 14:47 ` Theodore Tso
  2009-05-27 15:05   ` Jens Axboe
  2009-05-27 17:53   ` Theodore Tso
  12 siblings, 2 replies; 41+ messages in thread
From: Theodore Tso @ 2009-05-27 14:47 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-kernel, linux-fsdevel, chris.mason, david, hch, akpm, jack,
	yanmin_zhang, richard, damien.wyart

Hi Jens,

FYI, just for yucks I tried to merge your bdi patches and the ext4
patch queue to make sure there were no patch conflicts, and started a
quick regression run.  I got the following soft lockup report when
running "fsstress -d /mnt/tmp -s 12345 -S -p 20 -n 1000".

I'll retry the test with your stock writeback-v8 git branch w/o any
ext4 patches planned the next mere window mainline to see if I get the
same soft lockup, but I thought I should give you an early heads up.

						- Ted


[46283.324099] kjournald2 starting: pid 23258, dev dm-4:8, commit interval 5 seconds
[46283.326503] EXT4 FS on dm-4, internal journal on dm-4:8
[46283.326523] EXT4-fs: delayed allocation enabled
[46283.326538] EXT4-fs: file extents enabled
[46283.336368] EXT4-fs: mballoc enabled
[46283.337053] EXT4-fs: mounted filesystem dm-4 with ordered data mode
[46440.889702] INFO: task umount:22890 blocked for more than 120 seconds.
[46440.889715] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[46440.889725] umount        D 00002a0d  2288 22890  19517
[46440.889742]  df5b0db8 00000046 d8a93c7f 00002a0d c0be7148 c0e5cbc8 c0163ebd c0a78700
[46440.889770]  c0a78700 df5b0d74 c0164e28 f145a980 f145abfc c2d13700 00000000 d8ad1b4f
[46440.889797]  00002a0d c0165031 00000006 f145a980 c05e9dd4 00000002 df5b0d9c f145abfc
[46440.889823] Call Trace:
[46440.889844]  [<c0163ebd>] ? lock_release_holdtime+0x30/0x131
[46440.889856]  [<c0164e28>] ? mark_lock+0x1e/0x1e4
[46440.889868]  [<c0165031>] ? mark_held_locks+0x43/0x5b
[46440.889882]  [<c05e9dd4>] ? _spin_unlock_irqrestore+0x3c/0x48
[46440.889894]  [<c01652ba>] ? trace_hardirqs_on+0xb/0xd
[46440.889909]  [<c05e823f>] schedule+0x8/0x17
[46440.889922]  [<c01d7031>] bdi_sched_wait+0x8/0xc
[46440.889933]  [<c05e8728>] __wait_on_bit+0x36/0x5d
[46440.889944]  [<c01d7029>] ? bdi_sched_wait+0x0/0xc
[46440.889956]  [<c05e87fa>] out_of_line_wait_on_bit+0xab/0xb3
[46440.889968]  [<c01d7029>] ? bdi_sched_wait+0x0/0xc
[46440.889981]  [<c01577ae>] ? wake_bit_function+0x0/0x43
[46440.889993]  [<c01d61b6>] wait_on_bit+0x20/0x2c
[46440.890005]  [<c01d6d2e>] bdi_writeback_all+0x161/0x18e
[46440.890019]  [<c0199f63>] ? wait_on_page_writeback_range+0x9d/0xdc
[46440.890032]  [<c01d6e6f>] generic_sync_sb_inodes+0x2f/0xcc
[46440.890044]  [<c01d6f7a>] sync_inodes_sb+0x6e/0x76
[46440.890057]  [<c01c1aa0>] __fsync_super+0x63/0x66
[46440.890069]  [<c01c1aae>] fsync_super+0xb/0x19
[46440.890080]  [<c01c1d16>] generic_shutdown_super+0x1c/0xde
[46440.890092]  [<c01c1df5>] kill_block_super+0x1d/0x31
[46440.890104]  [<c01f0acd>] ? vfs_quota_off+0x0/0x12
[46440.890115]  [<c01c2350>] deactivate_super+0x57/0x6b
[46440.890128]  [<c01d217e>] mntput_no_expire+0xca/0xfb
[46440.890141]  [<c01d265b>] sys_umount+0x28f/0x2b4
[46440.890153]  [<c01d268d>] sys_oldumount+0xd/0xf
[46440.890166]  [<c011c264>] sysenter_do_call+0x12/0x38
[46440.890176] 1 lock held by umount/22890:
[46440.890182]  #0:  (&type->s_umount_key#31){++++..}, at: [<c01c234b>] deactivate_super+0x52/0x6b
[46560.889079] INFO: task umount:22890 blocked for more than 120 seconds.
[46560.889097] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[46560.889113] umount        D 00002a0d  2288 22890  19517
[46560.889140]  df5b0db8 00000046 d8a93c7f 00002a0d c0be7148 c0e5cbc8 c0163ebd c0a78700
[46560.889181]  c0a78700 df5b0d74 c0164e28 f145a980 f145abfc c2d13700 00000000 d8ad1b4f
[46560.889221]  00002a0d c0165031 00000006 f145a980 c05e9dd4 00000002 df5b0d9c f145abfc
[46560.889260] Call Trace:
[46560.889291]  [<c0163ebd>] ? lock_release_holdtime+0x30/0x131
[46560.889309]  [<c0164e28>] ? mark_lock+0x1e/0x1e4
[46560.889327]  [<c0165031>] ? mark_held_locks+0x43/0x5b
[46560.889348]  [<c05e9dd4>] ? _spin_unlock_irqrestore+0x3c/0x48
[46560.889366]  [<c01652ba>] ? trace_hardirqs_on+0xb/0xd
[46560.889387]  [<c05e823f>] schedule+0x8/0x17
[46560.889406]  [<c01d7031>] bdi_sched_wait+0x8/0xc
[46560.889422]  [<c05e8728>] __wait_on_bit+0x36/0x5d
[46560.889439]  [<c01d7029>] ? bdi_sched_wait+0x0/0xc
[46560.889457]  [<c05e87fa>] out_of_line_wait_on_bit+0xab/0xb3
[46560.889475]  [<c01d7029>] ? bdi_sched_wait+0x0/0xc
[46560.889495]  [<c01577ae>] ? wake_bit_function+0x0/0x43
[46560.889512]  [<c01d61b6>] wait_on_bit+0x20/0x2c
[46560.889605]  [<c01d6d2e>] bdi_writeback_all+0x161/0x18e
[46560.889635]  [<c0199f63>] ? wait_on_page_writeback_range+0x9d/0xdc
[46560.889664]  [<c01d6e6f>] generic_sync_sb_inodes+0x2f/0xcc
[46560.889689]  [<c01d6f7a>] sync_inodes_sb+0x6e/0x76
[46560.889709]  [<c01c1aa0>] __fsync_super+0x63/0x66
[46560.889726]  [<c01c1aae>] fsync_super+0xb/0x19
[46560.889743]  [<c01c1d16>] generic_shutdown_super+0x1c/0xde
[46560.889761]  [<c01c1df5>] kill_block_super+0x1d/0x31
[46560.889780]  [<c01f0acd>] ? vfs_quota_off+0x0/0x12
[46560.889797]  [<c01c2350>] deactivate_super+0x57/0x6b
[46560.889816]  [<c01d217e>] mntput_no_expire+0xca/0xfb
[46560.889835]  [<c01d265b>] sys_umount+0x28f/0x2b4
[46560.889855]  [<c01d268d>] sys_oldumount+0xd/0xf
[46560.889874]  [<c011c264>] sysenter_do_call+0x12/0x38
[46560.889889] 1 lock held by umount/22890:
[46560.889899]  #0:  (&type->s_umount_key#31){++++..}, at: [<c01c234b>] deactivate_super+0x52/0x6b
[46680.889028] INFO: task umount:22890 blocked for more than 120 seconds.
[46680.889046] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[46680.889061] umount        D 00002a0d  2288 22890  19517
[46680.889088]  df5b0db8 00000046 d8a93c7f 00002a0d c0be7148 c0e5cbc8 c0163ebd c0a78700
[46680.889130]  c0a78700 df5b0d74 c0164e28 f145a980 f145abfc c2d13700 00000000 d8ad1b4f
[46680.889169]  00002a0d c0165031 00000006 f145a980 c05e9dd4 00000002 df5b0d9c f145abfc
[46680.889209] Call Trace:
[46680.889240]  [<c0163ebd>] ? lock_release_holdtime+0x30/0x131
[46680.889259]  [<c0164e28>] ? mark_lock+0x1e/0x1e4
[46680.889276]  [<c0165031>] ? mark_held_locks+0x43/0x5b
[46680.889297]  [<c05e9dd4>] ? _spin_unlock_irqrestore+0x3c/0x48
[46680.889315]  [<c01652ba>] ? trace_hardirqs_on+0xb/0xd
[46680.889336]  [<c05e823f>] schedule+0x8/0x17
[46680.889355]  [<c01d7031>] bdi_sched_wait+0x8/0xc
[46680.889372]  [<c05e8728>] __wait_on_bit+0x36/0x5d
[46680.889389]  [<c01d7029>] ? bdi_sched_wait+0x0/0xc
[46680.889407]  [<c05e87fa>] out_of_line_wait_on_bit+0xab/0xb3
[46680.889425]  [<c01d7029>] ? bdi_sched_wait+0x0/0xc
[46680.889444]  [<c01577ae>] ? wake_bit_function+0x0/0x43
[46680.889462]  [<c01d61b6>] wait_on_bit+0x20/0x2c
[46680.889480]  [<c01d6d2e>] bdi_writeback_all+0x161/0x18e
[46680.889501]  [<c0199f63>] ? wait_on_page_writeback_range+0x9d/0xdc
[46680.889605]  [<c01d6e6f>] generic_sync_sb_inodes+0x2f/0xcc
[46680.889633]  [<c01d6f7a>] sync_inodes_sb+0x6e/0x76
[46680.889662]  [<c01c1aa0>] __fsync_super+0x63/0x66
[46680.889685]  [<c01c1aae>] fsync_super+0xb/0x19
[46680.889711]  [<c01c1d16>] generic_shutdown_super+0x1c/0xde
[46680.889737]  [<c01c1df5>] kill_block_super+0x1d/0x31
[46680.889766]  [<c01f0acd>] ? vfs_quota_off+0x0/0x12
[46680.889793]  [<c01c2350>] deactivate_super+0x57/0x6b
[46680.889824]  [<c01d217e>] mntput_no_expire+0xca/0xfb
[46680.889853]  [<c01d265b>] sys_umount+0x28f/0x2b4
[46680.889884]  [<c01d268d>] sys_oldumount+0xd/0xf
[46680.889913]  [<c011c264>] sysenter_do_call+0x12/0x38
[46680.889936] 1 lock held by umount/22890:
[46680.889950]  #0:  (&type->s_umount_key#31){++++..}, at: [<c01c234b>] deactivate_super+0x52/0x6b
[46800.889776] INFO: task umount:22890 blocked for more than 120 seconds.
[46800.889793] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[46800.889809] umount        D 00002a0d  2288 22890  19517
[46800.889836]  df5b0db8 00000046 d8a93c7f 00002a0d c0be7148 c0e5cbc8 c0163ebd c0a78700
[46800.889877]  c0a78700 df5b0d74 c0164e28 f145a980 f145abfc c2d13700 00000000 d8ad1b4f
[46800.889917]  00002a0d c0165031 00000006 f145a980 c05e9dd4 00000002 df5b0d9c f145abfc
[46800.889956] Call Trace:
[46800.889987]  [<c0163ebd>] ? lock_release_holdtime+0x30/0x131
[46800.890006]  [<c0164e28>] ? mark_lock+0x1e/0x1e4
[46800.890023]  [<c0165031>] ? mark_held_locks+0x43/0x5b
[46800.890044]  [<c05e9dd4>] ? _spin_unlock_irqrestore+0x3c/0x48
[46800.890062]  [<c01652ba>] ? trace_hardirqs_on+0xb/0xd
[46800.890083]  [<c05e823f>] schedule+0x8/0x17
[46800.890102]  [<c01d7031>] bdi_sched_wait+0x8/0xc
[46800.890119]  [<c05e8728>] __wait_on_bit+0x36/0x5d
[46800.890136]  [<c01d7029>] ? bdi_sched_wait+0x0/0xc
[46800.890153]  [<c05e87fa>] out_of_line_wait_on_bit+0xab/0xb3
[46800.890171]  [<c01d7029>] ? bdi_sched_wait+0x0/0xc
[46800.890191]  [<c01577ae>] ? wake_bit_function+0x0/0x43
[46800.890209]  [<c01d61b6>] wait_on_bit+0x20/0x2c
[46800.890227]  [<c01d6d2e>] bdi_writeback_all+0x161/0x18e
[46800.890248]  [<c0199f63>] ? wait_on_page_writeback_range+0x9d/0xdc
[46800.890268]  [<c01d6e6f>] generic_sync_sb_inodes+0x2f/0xcc
[46800.890286]  [<c01d6f7a>] sync_inodes_sb+0x6e/0x76
[46800.890306]  [<c01c1aa0>] __fsync_super+0x63/0x66
[46800.890323]  [<c01c1aae>] fsync_super+0xb/0x19
[46800.890340]  [<c01c1d16>] generic_shutdown_super+0x1c/0xde
[46800.890357]  [<c01c1df5>] kill_block_super+0x1d/0x31
[46800.890376]  [<c01f0acd>] ? vfs_quota_off+0x0/0x12
[46800.890393]  [<c01c2350>] deactivate_super+0x57/0x6b
[46800.890413]  [<c01d217e>] mntput_no_expire+0xca/0xfb
[46800.890431]  [<c01d265b>] sys_umount+0x28f/0x2b4
[46800.890451]  [<c01d268d>] sys_oldumount+0xd/0xf
[46800.890470]  [<c011c264>] sysenter_do_call+0x12/0x38
[46800.890485] 1 lock held by umount/22890:
[46800.890494]  #0:  (&type->s_umount_key#31){++++..}, at: [<c01c234b>] deactivate_super+0x52/0x6b
[46920.888925] INFO: task umount:22890 blocked for more than 120 seconds.
[46920.888943] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[46920.888959] umount        D 00002a0d  2288 22890  19517
[46920.888986]  df5b0db8 00000046 d8a93c7f 00002a0d c0be7148 c0e5cbc8 c0163ebd c0a78700
[46920.889027]  c0a78700 df5b0d74 c0164e28 f145a980 f145abfc c2d13700 00000000 d8ad1b4f
[46920.889067]  00002a0d c0165031 00000006 f145a980 c05e9dd4 00000002 df5b0d9c f145abfc
[46920.889106] Call Trace:
[46920.889137]  [<c0163ebd>] ? lock_release_holdtime+0x30/0x131
[46920.889155]  [<c0164e28>] ? mark_lock+0x1e/0x1e4
[46920.889173]  [<c0165031>] ? mark_held_locks+0x43/0x5b
[46920.889194]  [<c05e9dd4>] ? _spin_unlock_irqrestore+0x3c/0x48
[46920.889212]  [<c01652ba>] ? trace_hardirqs_on+0xb/0xd
[46920.889233]  [<c05e823f>] schedule+0x8/0x17
[46920.889252]  [<c01d7031>] bdi_sched_wait+0x8/0xc
[46920.889268]  [<c05e8728>] __wait_on_bit+0x36/0x5d
[46920.889286]  [<c01d7029>] ? bdi_sched_wait+0x0/0xc
[46920.889303]  [<c05e87fa>] out_of_line_wait_on_bit+0xab/0xb3
[46920.889321]  [<c01d7029>] ? bdi_sched_wait+0x0/0xc
[46920.889341]  [<c01577ae>] ? wake_bit_function+0x0/0x43
[46920.889359]  [<c01d61b6>] wait_on_bit+0x20/0x2c
[46920.889376]  [<c01d6d2e>] bdi_writeback_all+0x161/0x18e
[46920.889397]  [<c0199f63>] ? wait_on_page_writeback_range+0x9d/0xdc
[46920.889418]  [<c01d6e6f>] generic_sync_sb_inodes+0x2f/0xcc
[46920.889436]  [<c01d6f7a>] sync_inodes_sb+0x6e/0x76
[46920.889456]  [<c01c1aa0>] __fsync_super+0x63/0x66
[46920.889472]  [<c01c1aae>] fsync_super+0xb/0x19
[46920.889567]  [<c01c1d16>] generic_shutdown_super+0x1c/0xde
[46920.889594]  [<c01c1df5>] kill_block_super+0x1d/0x31
[46920.889623]  [<c01f0acd>] ? vfs_quota_off+0x0/0x12
[46920.889647]  [<c01c2350>] deactivate_super+0x57/0x6b
[46920.889675]  [<c01d217e>] mntput_no_expire+0xca/0xfb
[46920.889703]  [<c01d265b>] sys_umount+0x28f/0x2b4
[46920.889734]  [<c01d268d>] sys_oldumount+0xd/0xf
[46920.889762]  [<c011c264>] sysenter_do_call+0x12/0x38
[46920.889786] 1 lock held by umount/22890:
[46920.889800]  #0:  (&type->s_umount_key#31){++++..}, at: [<c01c234b>] deactivate_super+0x52/0x6b
[46920.889866] INFO: task fsstress:23920 blocked for more than 120 seconds.
[46920.889886] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[46920.889906] fsstress      D 00002a89  2032 23920  23919
[46920.889942]  e8968f20 00000046 9fb7a303 00002a89 00000000 f5815c6c e8968ed0 c0a78700
[46920.890005]  c0a78700 c0e5c7a0 e89604f0 e8960000 e896027c c2e61700 00000001 9fc1cb54
[46920.890070]  00002a89 00000000 e8960000 00000007 e8968f04 c0165031 00000006 e896027c
[46920.890133] Call Trace:
[46920.890163]  [<c0165031>] ? mark_held_locks+0x43/0x5b
[46920.890191]  [<c05e9d8f>] ? _spin_unlock_irq+0x22/0x2b
[46920.890219]  [<c016528e>] ? trace_hardirqs_on_caller+0x103/0x124
[46920.890251]  [<c05e823f>] schedule+0x8/0x17
[46920.890277]  [<c05e9a8c>] rwsem_down_failed_common+0x6d/0x84
[46920.890306]  [<c05e9ae3>] rwsem_down_read_failed+0x1d/0x26
[46920.890335]  [<c05e9b27>] call_rwsem_down_read_failed+0x7/0xc
[46920.890363]  [<c05e90e5>] ? down_read+0x5e/0x6f
[46920.890390]  [<c01d6fb6>] __sync_inodes+0x34/0x89
[46920.890418]  [<c01d7018>] sync_inodes+0xd/0x1e
[46920.890444]  [<c01d96fd>] do_sync+0x14/0x5a
[46920.890472]  [<c01d9767>] sys_sync+0xd/0x12
[46920.890498]  [<c011c264>] sysenter_do_call+0x12/0x38
[46920.890521] 1 lock held by fsstress/23920:
[46920.890535]  #0:  (&type->s_umount_key#31){++++..}, at: [<c01d6fb6>] __sync_inodes+0x34/0x89
[46920.890594] INFO: task fsstress:23921 blocked for more than 120 seconds.
[46920.890613] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[46920.890633] fsstress      D 00002a89  2212 23921  23919
[46920.890670]  f14abf20 00000046 adcb3cde 00002a89 00000000 f5815c6c f14abed0 c0a78700
[46920.890732]  c0a78700 c0e5c7a0 e8964330 e8963e40 e89640bc c2d13700 00000000 adf1f741
[46920.890796]  00002a89 00000000 e8963e40 00000007 f14abf04 c0165031 00000006 e89640bc
[46920.890858] Call Trace:
[46920.890887]  [<c0165031>] ? mark_held_locks+0x43/0x5b
[46920.890915]  [<c05e9d8f>] ? _spin_unlock_irq+0x22/0x2b
[46920.890943]  [<c016528e>] ? trace_hardirqs_on_caller+0x103/0x124
[46920.890975]  [<c05e823f>] schedule+0x8/0x17
[46920.891001]  [<c05e9a8c>] rwsem_down_failed_common+0x6d/0x84
[46920.891030]  [<c05e9ae3>] rwsem_down_read_failed+0x1d/0x26
[46920.891059]  [<c05e9b27>] call_rwsem_down_read_failed+0x7/0xc
[46920.891087]  [<c05e90e5>] ? down_read+0x5e/0x6f
[46920.891114]  [<c01d6fb6>] __sync_inodes+0x34/0x89
[46920.891142]  [<c01d7018>] sync_inodes+0xd/0x1e
[46920.891168]  [<c01d96fd>] do_sync+0x14/0x5a
[46920.891195]  [<c01d9767>] sys_sync+0xd/0x12
[46920.891221]  [<c011c264>] sysenter_do_call+0x12/0x38
[46920.891244] 1 lock held by fsstress/23921:
[46920.891258]  #0:  (&type->s_umount_key#31){++++..}, at: [<c01d6fb6>] __sync_inodes+0x34/0x89
[46920.891318] INFO: task fsstress:23922 blocked for more than 120 seconds.
[46920.891336] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[46920.891356] fsstress      D 00002a89  2508 23922  23919
[46920.891393]  e89b1f20 00000046 9f55d644 00002a89 00000000 00000001 00000001 c0a78700
[46920.891456]  c0a78700 c0e5c7a0 00962e70 e8962980 e8962bfc c2d13700 00000000 9f589dea
[46920.891522]  00002a89 00000000 e8962980 00000007 e89b1f04 c0165031 00000006 e8962bfc
[46920.891584] Call Trace:
[46920.891613]  [<c0165031>] ? mark_held_locks+0x43/0x5b
[46920.891641]  [<c05e9d8f>] ? _spin_unlock_irq+0x22/0x2b
[46920.891669]  [<c016528e>] ? trace_hardirqs_on_caller+0x103/0x124
[46920.891700]  [<c05e823f>] schedule+0x8/0x17
[46920.891727]  [<c05e9a8c>] rwsem_down_failed_common+0x6d/0x84
[46920.891756]  [<c05e9ae3>] rwsem_down_read_failed+0x1d/0x26
[46920.891785]  [<c05e9b27>] call_rwsem_down_read_failed+0x7/0xc
[46920.891812]  [<c05e90e5>] ? down_read+0x5e/0x6f
[46920.891839]  [<c01d6fb6>] __sync_inodes+0x34/0x89
[46920.891867]  [<c01d7018>] sync_inodes+0xd/0x1e
[46920.891893]  [<c01d96fd>] do_sync+0x14/0x5a
[46920.891921]  [<c01d9767>] sys_sync+0xd/0x12
[46920.891947]  [<c011c264>] sysenter_do_call+0x12/0x38
[46920.891969] 1 lock held by fsstress/23922:
[46920.891983]  #0:  (&type->s_umount_key#31){++++..}, at: [<c01d6fb6>] __sync_inodes+0x34/0x89
[46920.892038] INFO: task fsstress:23923 blocked for more than 120 seconds.
[46920.892055] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[46920.892075] fsstress      D 00002a89  2524 23923  23919
[46920.892113]  e89f8f20 00000046 9f62c022 00002a89 00000000 f5815c6c e89f8ed0 c0a78700
[46920.892176]  c0a78700 c0e5c7a0 f75299b0 f75294c0 f752973c c2e61700 00000001 9f6e75d6
[46920.892238]  00002a89 00000000 f75294c0 00000007 e89f8f04 c0165031 00000006 f752973c
[46920.892302] Call Trace:
[46920.892332]  [<c0165031>] ? mark_held_locks+0x43/0x5b
[46920.892362]  [<c05e9d8f>] ? _spin_unlock_irq+0x22/0x2b
[46920.892390]  [<c016528e>] ? trace_hardirqs_on_caller+0x103/0x124
[46920.892423]  [<c05e823f>] schedule+0x8/0x17
[46920.892450]  [<c05e9a8c>] rwsem_down_failed_common+0x6d/0x84
[46920.892478]  [<c05e9ae3>] rwsem_down_read_failed+0x1d/0x26
[46920.892507]  [<c05e9b27>] call_rwsem_down_read_failed+0x7/0xc
[46920.892535]  [<c05e90e5>] ? down_read+0x5e/0x6f
[46920.892562]  [<c01d6fb6>] __sync_inodes+0x34/0x89
[46920.892589]  [<c01d7018>] sync_inodes+0xd/0x1e
[46920.892616]  [<c01d96fd>] do_sync+0x14/0x5a
[46920.892643]  [<c01d9767>] sys_sync+0xd/0x12
[46920.892669]  [<c011c264>] sysenter_do_call+0x12/0x38
[46920.892692] 1 lock held by fsstress/23923:
[46920.892706]  #0:  (&type->s_umount_key#31){++++..}, at: [<c01d6fb6>] __sync_inodes+0x34/0x89
[46920.892765] INFO: task fsstress:23924 blocked for more than 120 seconds.
[46920.892784] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[46920.892804] fsstress      D 00002a89  2472 23924  23919
[46920.892911]  f1495f20 00000046 a24d3c01 00002a89 00000000 f5815c6c f1495ed0 c0a78700
[46920.892977]  c0a78700 c0e5c7a0 f5994330 f5993e40 f59940bc c2e61700 00000001 a287d873
[46920.893039]  00002a89 00000000 f5993e40 00000007 f1495f04 c0165031 00000006 f59940bc
[46920.893101] Call Trace:
[46920.893132]  [<c0165031>] ? mark_held_locks+0x43/0x5b
[46920.893161]  [<c05e9d8f>] ? _spin_unlock_irq+0x22/0x2b
[46920.893189]  [<c016528e>] ? trace_hardirqs_on_caller+0x103/0x124
[46920.893219]  [<c05e823f>] schedule+0x8/0x17
[46920.893247]  [<c05e9a8c>] rwsem_down_failed_common+0x6d/0x84
[46920.893276]  [<c05e9ae3>] rwsem_down_read_failed+0x1d/0x26
[46920.893307]  [<c05e9b27>] call_rwsem_down_read_failed+0x7/0xc
[46920.893333]  [<c05e90e5>] ? down_read+0x5e/0x6f
[46920.893361]  [<c01d6fb6>] __sync_inodes+0x34/0x89
[46920.893388]  [<c01d7018>] sync_inodes+0xd/0x1e
[46920.893416]  [<c01d96fd>] do_sync+0x14/0x5a
[46920.893442]  [<c01d9767>] sys_sync+0xd/0x12
[46920.893470]  [<c011c264>] sysenter_do_call+0x12/0x38
[46920.893492] 1 lock held by fsstress/23924:
[46920.893508]  #0:  (&type->s_umount_key#31){++++..}, at: [<c01d6fb6>] __sync_inodes+0x34/0x89

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/11] Per-bdi writeback flusher threads v8
  2009-05-27 14:47 ` Theodore Tso
@ 2009-05-27 15:05   ` Jens Axboe
  2009-05-27 17:53   ` Theodore Tso
  1 sibling, 0 replies; 41+ messages in thread
From: Jens Axboe @ 2009-05-27 15:05 UTC (permalink / raw)
  To: Theodore Tso
  Cc: linux-kernel, linux-fsdevel, chris.mason, david, hch, akpm, jack,
	yanmin_zhang, richard, damien.wyart

On Wed, May 27 2009, Theodore Tso wrote:
> Hi Jens,
> 
> FYI, just for yucks I tried to merge your bdi patches and the ext4
> patch queue to make sure there were no patch conflicts, and started a
> quick regression run.  I got the following soft lockup report when
> running "fsstress -d /mnt/tmp -s 12345 -S -p 20 -n 1000".
> 
> I'll retry the test with your stock writeback-v8 git branch w/o any
> ext4 patches planned the next mere window mainline to see if I get the
> same soft lockup, but I thought I should give you an early heads up.

It looks like it's the writeback patches, from a quick glance. Did it
get stuck, or did it still make progress? Thanks for testing, please let
me know if it triggers with just wb-v8. Can you send me your .config,
too? I'm assuming no special ext4 mount options, right? I used ext4 as
well for testing, using -o barrier=0 for most of the runs.

Irregardless, I'll try this test and poke a bit at it and find out what
the issue is.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/11] Per-bdi writeback flusher threads v8
  2009-05-27 14:47 ` Theodore Tso
  2009-05-27 15:05   ` Jens Axboe
@ 2009-05-27 17:53   ` Theodore Tso
  2009-05-27 17:57     ` Jens Axboe
  2009-05-27 17:58     ` Theodore Tso
  1 sibling, 2 replies; 41+ messages in thread
From: Theodore Tso @ 2009-05-27 17:53 UTC (permalink / raw)
  To: Jens Axboe, linux-kernel, linux-fsdevel, chris.mason, david, hch,
	akpm

On Wed, May 27, 2009 at 10:47:54AM -0400, Theodore Tso wrote:
> 
> I'll retry the test with your stock writeback-v8 git branch w/o any
> ext4 patches planned the next mere window mainline to see if I get the
> same soft lockup, but I thought I should give you an early heads up.

Confirmed.  I had to run fsstress twice, but I was able to trigger a
soft hangup with just the per-bdi v8 patches using ext4.

With ext3, fsstress didn't cause a soft lockup while it was running
--- but after the test, when I tried to unmount the filesystem,
/sbin/umount hung:

[ 2040.893469] INFO: task umount:7154 blocked for more than 120 seconds.
[ 2040.893487] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2040.893503] umount        D 000001ba  2600  7154   5885
[ 2040.893531]  ec408db8 00000046 ba2bff0b 000001ba c0be7148 c0e68bc8 c0163ebd c0a78700
[ 2040.893572]  c0a78700 ec408d74 c0164e28 e95c0000 e95c027c c2d13700 00000000 ba2d9a13
[ 2040.893612]  000001ba c0165031 00000006 e95c0000 c05e9594 00000002 ec408d9c e95c027c
[ 2040.893652] Call Trace:
[ 2040.893683]  [<c0163ebd>] ? lock_release_holdtime+0x30/0x131
[ 2040.893702]  [<c0164e28>] ? mark_lock+0x1e/0x1e4
[ 2040.893720]  [<c0165031>] ? mark_held_locks+0x43/0x5b
[ 2040.893742]  [<c05e9594>] ? _spin_unlock_irqrestore+0x3c/0x48
[ 2040.893761]  [<c01652ba>] ? trace_hardirqs_on+0xb/0xd
[ 2040.893782]  [<c05e79ff>] schedule+0x8/0x17
[ 2040.893801]  [<c01d7009>] bdi_sched_wait+0x8/0xc
[ 2040.893818]  [<c05e7ee8>] __wait_on_bit+0x36/0x5d
[ 2040.893836]  [<c01d7001>] ? bdi_sched_wait+0x0/0xc
[ 2040.893854]  [<c05e7fba>] out_of_line_wait_on_bit+0xab/0xb3
[ 2040.893872]  [<c01d7001>] ? bdi_sched_wait+0x0/0xc
[ 2040.893892]  [<c01577ae>] ? wake_bit_function+0x0/0x43
[ 2040.893911]  [<c01d618e>] wait_on_bit+0x20/0x2c
[ 2040.893929]  [<c01d6d06>] bdi_writeback_all+0x161/0x18e
[ 2040.893951]  [<c0199f63>] ? wait_on_page_writeback_range+0x9d/0xdc
[ 2040.894052]  [<c01d6e47>] generic_sync_sb_inodes+0x2f/0xcc
[ 2040.894079]  [<c01d6f52>] sync_inodes_sb+0x6e/0x76
[ 2040.894107]  [<c01c1aa0>] __fsync_super+0x63/0x66
[ 2040.894131]  [<c01c1aae>] fsync_super+0xb/0x19
[ 2040.894149]  [<c01c1d16>] generic_shutdown_super+0x1c/0xde
[ 2040.894167]  [<c01c1df5>] kill_block_super+0x1d/0x31
[ 2040.894186]  [<c01f0a85>] ? vfs_quota_off+0x0/0x12
[ 2040.894204]  [<c01c2350>] deactivate_super+0x57/0x6b
[ 2040.894223]  [<c01d2156>] mntput_no_expire+0xca/0xfb
[ 2040.894242]  [<c01d2633>] sys_umount+0x28f/0x2b4
[ 2040.894262]  [<c01d2665>] sys_oldumount+0xd/0xf
[ 2040.894281]  [<c011c264>] sysenter_do_call+0x12/0x38
[ 2040.894297] 1 lock held by umount/7154:
[ 2040.894307]  #0:  (&type->s_umount_key#31){++++..}, at: [<c01c234b>] deactivate_super+0x52/0x6b


Given that the ext4 hangs were also related to s_umount being taken by
sync_inodes(), there seems to be something going on there:

[ 3720.900031] INFO: task fsstress:8487 blocked for more than 120 seconds.
[ 3720.900041] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 3720.900049] fsstress      D 00000330  2060  8487   8484
[ 3720.900063]  eddcde38 00000046 633117b6 00000330 c0be7148 c0e58ba8 c0163ebd c0a78700
[ 3720.900084]  c0a78700 eddcddf4 c0164e28 f45a0000 f45a027c c2d13700 00000000 63320800
[ 3720.900104]  00000330 c0165031 00000006 f45a0000 c05e9594 00000002 eddcde1c f45a027c
[ 3720.900124] Call Trace:
[ 3720.900142]  [<c0163ebd>] ? lock_release_holdtime+0x30/0x131
[ 3720.900151]  [<c0164e28>] ? mark_lock+0x1e/0x1e4
[ 3720.900160]  [<c0165031>] ? mark_held_locks+0x43/0x5b
[ 3720.900172]  [<c05e9594>] ? _spin_unlock_irqrestore+0x3c/0x48
[ 3720.900181]  [<c01652ba>] ? trace_hardirqs_on+0xb/0xd
[ 3720.900192]  [<c05e79ff>] schedule+0x8/0x17
[ 3720.900202]  [<c01d7009>] bdi_sched_wait+0x8/0xc
[ 3720.900211]  [<c05e7ee8>] __wait_on_bit+0x36/0x5d
[ 3720.900220]  [<c01d7001>] ? bdi_sched_wait+0x0/0xc
[ 3720.900229]  [<c05e7fba>] out_of_line_wait_on_bit+0xab/0xb3
[ 3720.900238]  [<c01d7001>] ? bdi_sched_wait+0x0/0xc
[ 3720.900248]  [<c01577ae>] ? wake_bit_function+0x0/0x43
[ 3720.900258]  [<c01d618e>] wait_on_bit+0x20/0x2c
[ 3720.900267]  [<c01d6d06>] bdi_writeback_all+0x161/0x18e
[ 3720.900277]  [<c01d6e47>] generic_sync_sb_inodes+0x2f/0xcc
[ 3720.900287]  [<c01d6f52>] sync_inodes_sb+0x6e/0x76
[ 3720.900297]  [<c01d6f9d>] __sync_inodes+0x43/0x89
[ 3720.900306]  [<c01d6ffe>] sync_inodes+0x1b/0x1e
[ 3720.900314]  [<c01d96f9>] do_sync+0x38/0x5a
[ 3720.900323]  [<c01d973f>] sys_sync+0xd/0x12
[ 3720.900333]  [<c011c264>] sysenter_do_call+0x12/0x38
[ 3720.900340] 1 lock held by fsstress/8487:
[ 3720.900345]  #0:  (&type->s_umount_key#15){++++..}, at: [<c01d6f8e>] __sync_inodes+0x34/0x89

  		     				       	   - Ted

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/11] Per-bdi writeback flusher threads v8
  2009-05-27 17:53   ` Theodore Tso
@ 2009-05-27 17:57     ` Jens Axboe
  2009-05-27 17:58     ` Theodore Tso
  1 sibling, 0 replies; 41+ messages in thread
From: Jens Axboe @ 2009-05-27 17:57 UTC (permalink / raw)
  To: Theodore Tso
  Cc: linux-kernel, linux-fsdevel, chris.mason, david, hch, akpm, jack,
	yanmin_zhang, richard, damien.wyart

On Wed, May 27 2009, Theodore Tso wrote:
> On Wed, May 27, 2009 at 10:47:54AM -0400, Theodore Tso wrote:
> > 
> > I'll retry the test with your stock writeback-v8 git branch w/o any
> > ext4 patches planned the next mere window mainline to see if I get the
> > same soft lockup, but I thought I should give you an early heads up.
> 
> Confirmed.  I had to run fsstress twice, but I was able to trigger a
> soft hangup with just the per-bdi v8 patches using ext4.
> 
> With ext3, fsstress didn't cause a soft lockup while it was running
> --- but after the test, when I tried to unmount the filesystem,
> /sbin/umount hung:
> 
> [ 2040.893469] INFO: task umount:7154 blocked for more than 120 seconds.
> [ 2040.893487] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [ 2040.893503] umount        D 000001ba  2600  7154   5885
> [ 2040.893531]  ec408db8 00000046 ba2bff0b 000001ba c0be7148 c0e68bc8 c0163ebd c0a78700
> [ 2040.893572]  c0a78700 ec408d74 c0164e28 e95c0000 e95c027c c2d13700 00000000 ba2d9a13
> [ 2040.893612]  000001ba c0165031 00000006 e95c0000 c05e9594 00000002 ec408d9c e95c027c
> [ 2040.893652] Call Trace:
> [ 2040.893683]  [<c0163ebd>] ? lock_release_holdtime+0x30/0x131
> [ 2040.893702]  [<c0164e28>] ? mark_lock+0x1e/0x1e4
> [ 2040.893720]  [<c0165031>] ? mark_held_locks+0x43/0x5b
> [ 2040.893742]  [<c05e9594>] ? _spin_unlock_irqrestore+0x3c/0x48
> [ 2040.893761]  [<c01652ba>] ? trace_hardirqs_on+0xb/0xd
> [ 2040.893782]  [<c05e79ff>] schedule+0x8/0x17
> [ 2040.893801]  [<c01d7009>] bdi_sched_wait+0x8/0xc
> [ 2040.893818]  [<c05e7ee8>] __wait_on_bit+0x36/0x5d
> [ 2040.893836]  [<c01d7001>] ? bdi_sched_wait+0x0/0xc
> [ 2040.893854]  [<c05e7fba>] out_of_line_wait_on_bit+0xab/0xb3
> [ 2040.893872]  [<c01d7001>] ? bdi_sched_wait+0x0/0xc
> [ 2040.893892]  [<c01577ae>] ? wake_bit_function+0x0/0x43
> [ 2040.893911]  [<c01d618e>] wait_on_bit+0x20/0x2c
> [ 2040.893929]  [<c01d6d06>] bdi_writeback_all+0x161/0x18e
> [ 2040.893951]  [<c0199f63>] ? wait_on_page_writeback_range+0x9d/0xdc
> [ 2040.894052]  [<c01d6e47>] generic_sync_sb_inodes+0x2f/0xcc
> [ 2040.894079]  [<c01d6f52>] sync_inodes_sb+0x6e/0x76
> [ 2040.894107]  [<c01c1aa0>] __fsync_super+0x63/0x66
> [ 2040.894131]  [<c01c1aae>] fsync_super+0xb/0x19
> [ 2040.894149]  [<c01c1d16>] generic_shutdown_super+0x1c/0xde
> [ 2040.894167]  [<c01c1df5>] kill_block_super+0x1d/0x31
> [ 2040.894186]  [<c01f0a85>] ? vfs_quota_off+0x0/0x12
> [ 2040.894204]  [<c01c2350>] deactivate_super+0x57/0x6b
> [ 2040.894223]  [<c01d2156>] mntput_no_expire+0xca/0xfb
> [ 2040.894242]  [<c01d2633>] sys_umount+0x28f/0x2b4
> [ 2040.894262]  [<c01d2665>] sys_oldumount+0xd/0xf
> [ 2040.894281]  [<c011c264>] sysenter_do_call+0x12/0x38
> [ 2040.894297] 1 lock held by umount/7154:
> [ 2040.894307]  #0:  (&type->s_umount_key#31){++++..}, at: [<c01c234b>] deactivate_super+0x52/0x6b
> 
> 
> Given that the ext4 hangs were also related to s_umount being taken by
> sync_inodes(), there seems to be something going on there:

You didn't happen to catch a sysrq-t of the bdi-* threads as well, did
you? That would confirm the suspicion on this bug, but I'm pretty sure I
know what it is (see the Jan Kara reply). I'll move the super sync to a
silly thread for now, then we can later take care of that with per-bdi
super syncing instead.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/11] Per-bdi writeback flusher threads v8
  2009-05-27 17:53   ` Theodore Tso
  2009-05-27 17:57     ` Jens Axboe
@ 2009-05-27 17:58     ` Theodore Tso
  2009-05-27 18:14       ` Jens Axboe
  1 sibling, 1 reply; 41+ messages in thread
From: Theodore Tso @ 2009-05-27 17:58 UTC (permalink / raw)
  To: Jens Axboe, linux-kernel, linux-fsdevel, chris.mason, david, hch,
	akpm

On Wed, May 27, 2009 at 01:53:53PM -0400, Theodore Tso wrote:
> On Wed, May 27, 2009 at 10:47:54AM -0400, Theodore Tso wrote:
> > 
> > I'll retry the test with your stock writeback-v8 git branch w/o any
> > ext4 patches planned the next mere window mainline to see if I get the
> > same soft lockup, but I thought I should give you an early heads up.
> 
> Confirmed.  I had to run fsstress twice, but I was able to trigger a
> soft hangup with just the per-bdi v8 patches using ext4.

As you requested, here's the .config file which I used.  This was on a
Lenovo S10 (N270 Atom dual-core CPU, 1.5 gigs of memory, 5400 rpm hdd).

              	   	      		     - Ted

#
# Automatically generated make config: don't edit
# Linux kernel version: 2.6.30-rc7
# Tue May 26 13:39:41 2009
#
# CONFIG_64BIT is not set
CONFIG_X86_32=y
# CONFIG_X86_64 is not set
CONFIG_X86=y
CONFIG_ARCH_DEFCONFIG="arch/x86/configs/i386_defconfig"
CONFIG_GENERIC_TIME=y
CONFIG_GENERIC_CMOS_UPDATE=y
CONFIG_CLOCKSOURCE_WATCHDOG=y
CONFIG_GENERIC_CLOCKEVENTS=y
CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_HAVE_LATENCYTOP_SUPPORT=y
CONFIG_FAST_CMPXCHG_LOCAL=y
CONFIG_MMU=y
CONFIG_ZONE_DMA=y
CONFIG_GENERIC_ISA_DMA=y
CONFIG_GENERIC_IOMAP=y
CONFIG_GENERIC_BUG=y
CONFIG_GENERIC_HWEIGHT=y
CONFIG_GENERIC_GPIO=y
CONFIG_ARCH_MAY_HAVE_PC_FDC=y
# CONFIG_RWSEM_GENERIC_SPINLOCK is not set
CONFIG_RWSEM_XCHGADD_ALGORITHM=y
CONFIG_ARCH_HAS_CPU_IDLE_WAIT=y
CONFIG_GENERIC_CALIBRATE_DELAY=y
# CONFIG_GENERIC_TIME_VSYSCALL is not set
CONFIG_ARCH_HAS_CPU_RELAX=y
CONFIG_ARCH_HAS_DEFAULT_IDLE=y
CONFIG_ARCH_HAS_CACHE_LINE_SIZE=y
CONFIG_HAVE_SETUP_PER_CPU_AREA=y
CONFIG_HAVE_DYNAMIC_PER_CPU_AREA=y
# CONFIG_HAVE_CPUMASK_OF_CPU_MAP is not set
CONFIG_ARCH_HIBERNATION_POSSIBLE=y
CONFIG_ARCH_SUSPEND_POSSIBLE=y
# CONFIG_ZONE_DMA32 is not set
CONFIG_ARCH_POPULATES_NODE_MAP=y
# CONFIG_AUDIT_ARCH is not set
CONFIG_ARCH_SUPPORTS_OPTIMIZED_INLINING=y
CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y
CONFIG_GENERIC_HARDIRQS=y
CONFIG_GENERIC_HARDIRQS_NO__DO_IRQ=y
CONFIG_GENERIC_IRQ_PROBE=y
CONFIG_GENERIC_PENDING_IRQ=y
CONFIG_USE_GENERIC_SMP_HELPERS=y
CONFIG_X86_32_SMP=y
CONFIG_X86_HT=y
CONFIG_X86_TRAMPOLINE=y
CONFIG_X86_32_LAZY_GS=y
CONFIG_KTIME_SCALAR=y
CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config"

#
# General setup
#
CONFIG_EXPERIMENTAL=y
CONFIG_LOCK_KERNEL=y
CONFIG_INIT_ENV_ARG_LIMIT=32
CONFIG_LOCALVERSION=""
CONFIG_LOCALVERSION_AUTO=y
CONFIG_HAVE_KERNEL_GZIP=y
CONFIG_HAVE_KERNEL_BZIP2=y
CONFIG_HAVE_KERNEL_LZMA=y
CONFIG_KERNEL_GZIP=y
# CONFIG_KERNEL_BZIP2 is not set
# CONFIG_KERNEL_LZMA is not set
CONFIG_SWAP=y
CONFIG_SYSVIPC=y
CONFIG_SYSVIPC_SYSCTL=y
CONFIG_POSIX_MQUEUE=y
CONFIG_POSIX_MQUEUE_SYSCTL=y
CONFIG_BSD_PROCESS_ACCT=y
CONFIG_BSD_PROCESS_ACCT_V3=y
CONFIG_TASKSTATS=y
CONFIG_TASK_DELAY_ACCT=y
CONFIG_TASK_XACCT=y
CONFIG_TASK_IO_ACCOUNTING=y
CONFIG_AUDIT=y
CONFIG_AUDITSYSCALL=y
CONFIG_AUDIT_TREE=y

#
# RCU Subsystem
#
CONFIG_CLASSIC_RCU=y
# CONFIG_TREE_RCU is not set
# CONFIG_PREEMPT_RCU is not set
# CONFIG_TREE_RCU_TRACE is not set
# CONFIG_PREEMPT_RCU_TRACE is not set
# CONFIG_IKCONFIG is not set
CONFIG_LOG_BUF_SHIFT=17
CONFIG_HAVE_UNSTABLE_SCHED_CLOCK=y
# CONFIG_GROUP_SCHED is not set
CONFIG_CGROUPS=y
CONFIG_CGROUP_DEBUG=y
CONFIG_CGROUP_NS=y
CONFIG_CGROUP_FREEZER=y
CONFIG_CGROUP_DEVICE=y
CONFIG_CPUSETS=y
CONFIG_PROC_PID_CPUSET=y
CONFIG_CGROUP_CPUACCT=y
CONFIG_RESOURCE_COUNTERS=y
# CONFIG_CGROUP_MEM_RES_CTLR is not set
CONFIG_SYSFS_DEPRECATED=y
CONFIG_SYSFS_DEPRECATED_V2=y
CONFIG_RELAY=y
CONFIG_NAMESPACES=y
CONFIG_UTS_NS=y
CONFIG_IPC_NS=y
CONFIG_USER_NS=y
CONFIG_PID_NS=y
# CONFIG_NET_NS is not set
CONFIG_BLK_DEV_INITRD=y
CONFIG_INITRAMFS_SOURCE=""
CONFIG_RD_GZIP=y
CONFIG_RD_BZIP2=y
CONFIG_RD_LZMA=y
CONFIG_CC_OPTIMIZE_FOR_SIZE=y
CONFIG_SYSCTL=y
CONFIG_ANON_INODES=y
# CONFIG_EMBEDDED is not set
CONFIG_UID16=y
CONFIG_SYSCTL_SYSCALL=y
CONFIG_KALLSYMS=y
CONFIG_KALLSYMS_ALL=y
# CONFIG_KALLSYMS_EXTRA_PASS is not set
CONFIG_STRIP_ASM_SYMS=y
CONFIG_HOTPLUG=y
CONFIG_PRINTK=y
CONFIG_BUG=y
CONFIG_ELF_CORE=y
CONFIG_PCSPKR_PLATFORM=y
CONFIG_BASE_FULL=y
CONFIG_FUTEX=y
CONFIG_EPOLL=y
CONFIG_SIGNALFD=y
CONFIG_TIMERFD=y
CONFIG_EVENTFD=y
CONFIG_SHMEM=y
CONFIG_AIO=y
CONFIG_VM_EVENT_COUNTERS=y
CONFIG_PCI_QUIRKS=y
CONFIG_SLUB_DEBUG=y
CONFIG_COMPAT_BRK=y
# CONFIG_SLAB is not set
CONFIG_SLUB=y
# CONFIG_SLOB is not set
CONFIG_PROFILING=y
CONFIG_TRACEPOINTS=y
CONFIG_MARKERS=y
# CONFIG_OPROFILE is not set
CONFIG_HAVE_OPROFILE=y
CONFIG_KPROBES=y
CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS=y
CONFIG_KRETPROBES=y
CONFIG_HAVE_IOREMAP_PROT=y
CONFIG_HAVE_KPROBES=y
CONFIG_HAVE_KRETPROBES=y
CONFIG_HAVE_ARCH_TRACEHOOK=y
CONFIG_HAVE_DMA_API_DEBUG=y
# CONFIG_SLOW_WORK is not set
CONFIG_HAVE_GENERIC_DMA_COHERENT=y
CONFIG_SLABINFO=y
CONFIG_RT_MUTEXES=y
CONFIG_BASE_SMALL=0
CONFIG_MODULES=y
CONFIG_MODULE_FORCE_LOAD=y
CONFIG_MODULE_UNLOAD=y
CONFIG_MODULE_FORCE_UNLOAD=y
# CONFIG_MODVERSIONS is not set
CONFIG_MODULE_SRCVERSION_ALL=y
CONFIG_STOP_MACHINE=y
CONFIG_BLOCK=y
CONFIG_LBD=y
CONFIG_BLK_DEV_BSG=y
CONFIG_BLK_DEV_INTEGRITY=y

#
# IO Schedulers
#
CONFIG_IOSCHED_NOOP=y
CONFIG_IOSCHED_AS=y
CONFIG_IOSCHED_DEADLINE=y
CONFIG_IOSCHED_CFQ=y
# CONFIG_DEFAULT_AS is not set
# CONFIG_DEFAULT_DEADLINE is not set
CONFIG_DEFAULT_CFQ=y
# CONFIG_DEFAULT_NOOP is not set
CONFIG_DEFAULT_IOSCHED="cfq"
CONFIG_PREEMPT_NOTIFIERS=y
CONFIG_FREEZER=y

#
# Processor type and features
#
CONFIG_TICK_ONESHOT=y
CONFIG_NO_HZ=y
CONFIG_HIGH_RES_TIMERS=y
CONFIG_GENERIC_CLOCKEVENTS_BUILD=y
CONFIG_SMP=y
# CONFIG_SPARSE_IRQ is not set
CONFIG_X86_MPPARSE=y
# CONFIG_X86_BIGSMP is not set
CONFIG_X86_EXTENDED_PLATFORM=y
# CONFIG_X86_ELAN is not set
# CONFIG_X86_RDC321X is not set
# CONFIG_X86_32_NON_STANDARD is not set
# CONFIG_SCHED_OMIT_FRAME_POINTER is not set
CONFIG_PARAVIRT_GUEST=y
# CONFIG_XEN is not set
CONFIG_VMI=y
CONFIG_KVM_CLOCK=y
CONFIG_KVM_GUEST=y
CONFIG_PARAVIRT=y
# CONFIG_PARAVIRT_SPINLOCKS is not set
CONFIG_PARAVIRT_CLOCK=y
# CONFIG_PARAVIRT_DEBUG is not set
# CONFIG_MEMTEST is not set
# CONFIG_M386 is not set
# CONFIG_M486 is not set
# CONFIG_M586 is not set
# CONFIG_M586TSC is not set
# CONFIG_M586MMX is not set
# CONFIG_M686 is not set
# CONFIG_MPENTIUMII is not set
# CONFIG_MPENTIUMIII is not set
# CONFIG_MPENTIUMM is not set
# CONFIG_MPENTIUM4 is not set
# CONFIG_MK6 is not set
# CONFIG_MK7 is not set
# CONFIG_MK8 is not set
# CONFIG_MCRUSOE is not set
# CONFIG_MEFFICEON is not set
# CONFIG_MWINCHIPC6 is not set
# CONFIG_MWINCHIP3D is not set
# CONFIG_MGEODEGX1 is not set
# CONFIG_MGEODE_LX is not set
# CONFIG_MCYRIXIII is not set
# CONFIG_MVIAC3_2 is not set
# CONFIG_MVIAC7 is not set
# CONFIG_MPSC is not set
CONFIG_MCORE2=y
# CONFIG_GENERIC_CPU is not set
CONFIG_X86_GENERIC=y
CONFIG_X86_CPU=y
CONFIG_X86_L1_CACHE_BYTES=64
CONFIG_X86_INTERNODE_CACHE_BYTES=64
CONFIG_X86_CMPXCHG=y
CONFIG_X86_L1_CACHE_SHIFT=6
CONFIG_X86_XADD=y
CONFIG_X86_WP_WORKS_OK=y
CONFIG_X86_INVLPG=y
CONFIG_X86_BSWAP=y
CONFIG_X86_POPAD_OK=y
CONFIG_X86_INTEL_USERCOPY=y
CONFIG_X86_USE_PPRO_CHECKSUM=y
CONFIG_X86_TSC=y
CONFIG_X86_CMPXCHG64=y
CONFIG_X86_CMOV=y
CONFIG_X86_MINIMUM_CPU_FAMILY=4
CONFIG_X86_DEBUGCTLMSR=y
CONFIG_CPU_SUP_INTEL=y
CONFIG_CPU_SUP_CYRIX_32=y
CONFIG_CPU_SUP_AMD=y
CONFIG_CPU_SUP_CENTAUR=y
CONFIG_CPU_SUP_TRANSMETA_32=y
CONFIG_CPU_SUP_UMC_32=y
# CONFIG_X86_DS is not set
CONFIG_HPET_TIMER=y
CONFIG_HPET_EMULATE_RTC=y
CONFIG_DMI=y
# CONFIG_IOMMU_HELPER is not set
# CONFIG_IOMMU_API is not set
CONFIG_NR_CPUS=8
CONFIG_SCHED_SMT=y
CONFIG_SCHED_MC=y
# CONFIG_PREEMPT_NONE is not set
CONFIG_PREEMPT_VOLUNTARY=y
# CONFIG_PREEMPT is not set
CONFIG_X86_LOCAL_APIC=y
CONFIG_X86_IO_APIC=y
# CONFIG_X86_REROUTE_FOR_BROKEN_BOOT_IRQS is not set
CONFIG_X86_MCE=y
# CONFIG_X86_MCE_NONFATAL is not set
# CONFIG_X86_MCE_P4THERMAL is not set
CONFIG_VM86=y
# CONFIG_TOSHIBA is not set
# CONFIG_I8K is not set
CONFIG_X86_REBOOTFIXUPS=y
# CONFIG_MICROCODE is not set
# CONFIG_X86_MSR is not set
CONFIG_X86_CPUID=y
CONFIG_X86_CPU_DEBUG=y
# CONFIG_NOHIGHMEM is not set
# CONFIG_HIGHMEM4G is not set
CONFIG_HIGHMEM64G=y
CONFIG_PAGE_OFFSET=0xC0000000
CONFIG_HIGHMEM=y
CONFIG_X86_PAE=y
CONFIG_ARCH_PHYS_ADDR_T_64BIT=y
CONFIG_ARCH_FLATMEM_ENABLE=y
CONFIG_ARCH_SPARSEMEM_ENABLE=y
CONFIG_ARCH_SELECT_MEMORY_MODEL=y
CONFIG_SELECT_MEMORY_MODEL=y
CONFIG_FLATMEM_MANUAL=y
# CONFIG_DISCONTIGMEM_MANUAL is not set
# CONFIG_SPARSEMEM_MANUAL is not set
CONFIG_FLATMEM=y
CONFIG_FLAT_NODE_MEM_MAP=y
CONFIG_SPARSEMEM_STATIC=y
CONFIG_PAGEFLAGS_EXTENDED=y
CONFIG_SPLIT_PTLOCK_CPUS=4
CONFIG_PHYS_ADDR_T_64BIT=y
CONFIG_ZONE_DMA_FLAG=1
CONFIG_BOUNCE=y
CONFIG_VIRT_TO_BUS=y
CONFIG_UNEVICTABLE_LRU=y
CONFIG_HAVE_MLOCK=y
CONFIG_HAVE_MLOCKED_PAGE_BIT=y
CONFIG_MMU_NOTIFIER=y
CONFIG_HIGHPTE=y
# CONFIG_X86_CHECK_BIOS_CORRUPTION is not set
CONFIG_X86_RESERVE_LOW_64K=y
# CONFIG_MATH_EMULATION is not set
CONFIG_MTRR=y
# CONFIG_MTRR_SANITIZER is not set
CONFIG_X86_PAT=y
CONFIG_EFI=y
CONFIG_SECCOMP=y
# CONFIG_CC_STACKPROTECTOR is not set
# CONFIG_HZ_100 is not set
# CONFIG_HZ_250 is not set
CONFIG_HZ_300=y
# CONFIG_HZ_1000 is not set
CONFIG_HZ=300
CONFIG_SCHED_HRTICK=y
CONFIG_KEXEC=y
CONFIG_CRASH_DUMP=y
# CONFIG_KEXEC_JUMP is not set
CONFIG_PHYSICAL_START=0x100000
CONFIG_RELOCATABLE=y
CONFIG_PHYSICAL_ALIGN=0x100000
CONFIG_HOTPLUG_CPU=y
CONFIG_COMPAT_VDSO=y
# CONFIG_CMDLINE_BOOL is not set
CONFIG_ARCH_ENABLE_MEMORY_HOTPLUG=y

#
# Power management and ACPI options
#
CONFIG_PM=y
CONFIG_PM_DEBUG=y
# CONFIG_PM_VERBOSE is not set
CONFIG_CAN_PM_TRACE=y
# CONFIG_PM_TRACE_RTC is not set
CONFIG_PM_SLEEP_SMP=y
CONFIG_PM_SLEEP=y
CONFIG_SUSPEND=y
CONFIG_SUSPEND_FREEZER=y
CONFIG_HIBERNATION=y
CONFIG_PM_STD_PARTITION=""
CONFIG_ACPI=y
CONFIG_ACPI_SLEEP=y
CONFIG_ACPI_PROCFS=y
CONFIG_ACPI_PROCFS_POWER=y
CONFIG_ACPI_SYSFS_POWER=y
CONFIG_ACPI_PROC_EVENT=y
CONFIG_ACPI_AC=y
CONFIG_ACPI_BATTERY=y
CONFIG_ACPI_BUTTON=y
# CONFIG_ACPI_VIDEO is not set
CONFIG_ACPI_FAN=y
CONFIG_ACPI_DOCK=y
CONFIG_ACPI_PROCESSOR=y
CONFIG_ACPI_HOTPLUG_CPU=y
CONFIG_ACPI_THERMAL=y
# CONFIG_ACPI_CUSTOM_DSDT is not set
CONFIG_ACPI_BLACKLIST_YEAR=2000
# CONFIG_ACPI_DEBUG is not set
CONFIG_ACPI_PCI_SLOT=y
CONFIG_X86_PM_TIMER=y
CONFIG_ACPI_CONTAINER=y
CONFIG_ACPI_SBS=y
# CONFIG_APM is not set

#
# CPU Frequency scaling
#
CONFIG_CPU_FREQ=y
CONFIG_CPU_FREQ_TABLE=y
CONFIG_CPU_FREQ_DEBUG=y
CONFIG_CPU_FREQ_STAT=y
CONFIG_CPU_FREQ_STAT_DETAILS=y
# CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE is not set
# CONFIG_CPU_FREQ_DEFAULT_GOV_POWERSAVE is not set
# CONFIG_CPU_FREQ_DEFAULT_GOV_USERSPACE is not set
CONFIG_CPU_FREQ_DEFAULT_GOV_ONDEMAND=y
# CONFIG_CPU_FREQ_DEFAULT_GOV_CONSERVATIVE is not set
CONFIG_CPU_FREQ_GOV_PERFORMANCE=y
CONFIG_CPU_FREQ_GOV_POWERSAVE=y
CONFIG_CPU_FREQ_GOV_USERSPACE=y
CONFIG_CPU_FREQ_GOV_ONDEMAND=y
CONFIG_CPU_FREQ_GOV_CONSERVATIVE=y

#
# CPUFreq processor drivers
#
CONFIG_X86_ACPI_CPUFREQ=y
# CONFIG_X86_POWERNOW_K6 is not set
# CONFIG_X86_POWERNOW_K7 is not set
# CONFIG_X86_POWERNOW_K8 is not set
# CONFIG_X86_GX_SUSPMOD is not set
# CONFIG_X86_SPEEDSTEP_CENTRINO is not set
# CONFIG_X86_SPEEDSTEP_ICH is not set
# CONFIG_X86_SPEEDSTEP_SMI is not set
# CONFIG_X86_P4_CLOCKMOD is not set
# CONFIG_X86_CPUFREQ_NFORCE2 is not set
# CONFIG_X86_LONGRUN is not set
# CONFIG_X86_LONGHAUL is not set
# CONFIG_X86_E_POWERSAVER is not set

#
# shared options
#
# CONFIG_X86_SPEEDSTEP_LIB is not set
CONFIG_CPU_IDLE=y
CONFIG_CPU_IDLE_GOV_LADDER=y
CONFIG_CPU_IDLE_GOV_MENU=y

#
# Bus options (PCI etc.)
#
CONFIG_PCI=y
# CONFIG_PCI_GOBIOS is not set
# CONFIG_PCI_GOMMCONFIG is not set
# CONFIG_PCI_GODIRECT is not set
# CONFIG_PCI_GOOLPC is not set
CONFIG_PCI_GOANY=y
CONFIG_PCI_BIOS=y
CONFIG_PCI_DIRECT=y
CONFIG_PCI_MMCONFIG=y
CONFIG_PCI_DOMAINS=y
# CONFIG_DMAR is not set
CONFIG_PCIEPORTBUS=y
# CONFIG_HOTPLUG_PCI_PCIE is not set
CONFIG_PCIEAER=y
# CONFIG_PCIEASPM is not set
CONFIG_ARCH_SUPPORTS_MSI=y
CONFIG_PCI_MSI=y
CONFIG_PCI_LEGACY=y
# CONFIG_PCI_DEBUG is not set
# CONFIG_PCI_STUB is not set
CONFIG_HT_IRQ=y
# CONFIG_PCI_IOV is not set
CONFIG_ISA_DMA_API=y
CONFIG_ISA=y
CONFIG_EISA=y
CONFIG_EISA_VLB_PRIMING=y
CONFIG_EISA_PCI_EISA=y
CONFIG_EISA_VIRTUAL_ROOT=y
CONFIG_EISA_NAMES=y
CONFIG_MCA=y
CONFIG_MCA_LEGACY=y
CONFIG_MCA_PROC_FS=y
# CONFIG_SCx200 is not set
# CONFIG_OLPC is not set
CONFIG_PCCARD=y
CONFIG_PCMCIA_DEBUG=y
# CONFIG_PCMCIA is not set
CONFIG_CARDBUS=y

#
# PC-card bridges
#
CONFIG_YENTA=y
CONFIG_YENTA_O2=y
CONFIG_YENTA_RICOH=y
CONFIG_YENTA_TI=y
CONFIG_YENTA_ENE_TUNE=y
CONFIG_YENTA_TOSHIBA=y
CONFIG_PCMCIA_PROBE=y
CONFIG_PCCARD_NONSTATIC=y
CONFIG_HOTPLUG_PCI=y
# CONFIG_HOTPLUG_PCI_FAKE is not set
# CONFIG_HOTPLUG_PCI_COMPAQ is not set
# CONFIG_HOTPLUG_PCI_IBM is not set
# CONFIG_HOTPLUG_PCI_ACPI is not set
CONFIG_HOTPLUG_PCI_CPCI=y
# CONFIG_HOTPLUG_PCI_CPCI_ZT5550 is not set
# CONFIG_HOTPLUG_PCI_CPCI_GENERIC is not set
CONFIG_HOTPLUG_PCI_SHPC=y

#
# Executable file formats / Emulations
#
CONFIG_BINFMT_ELF=y
# CONFIG_CORE_DUMP_DEFAULT_ELF_HEADERS is not set
CONFIG_HAVE_AOUT=y
# CONFIG_BINFMT_AOUT is not set
# CONFIG_BINFMT_MISC is not set
CONFIG_HAVE_ATOMIC_IOMAP=y
CONFIG_NET=y

#
# Networking options
#
CONFIG_PACKET=y
CONFIG_PACKET_MMAP=y
CONFIG_UNIX=y
CONFIG_XFRM=y
# CONFIG_XFRM_USER is not set
# CONFIG_XFRM_SUB_POLICY is not set
# CONFIG_XFRM_MIGRATE is not set
CONFIG_XFRM_STATISTICS=y
CONFIG_NET_KEY=y
# CONFIG_NET_KEY_MIGRATE is not set
CONFIG_INET=y
CONFIG_IP_MULTICAST=y
CONFIG_IP_ADVANCED_ROUTER=y
CONFIG_ASK_IP_FIB_HASH=y
# CONFIG_IP_FIB_TRIE is not set
CONFIG_IP_FIB_HASH=y
CONFIG_IP_MULTIPLE_TABLES=y
CONFIG_IP_ROUTE_MULTIPATH=y
CONFIG_IP_ROUTE_VERBOSE=y
# CONFIG_IP_PNP is not set
# CONFIG_NET_IPIP is not set
# CONFIG_NET_IPGRE is not set
CONFIG_IP_MROUTE=y
CONFIG_IP_PIMSM_V1=y
CONFIG_IP_PIMSM_V2=y
# CONFIG_ARPD is not set
CONFIG_SYN_COOKIES=y
# CONFIG_INET_AH is not set
CONFIG_INET_ESP=y
# CONFIG_INET_IPCOMP is not set
# CONFIG_INET_XFRM_TUNNEL is not set
# CONFIG_INET_TUNNEL is not set
# CONFIG_INET_XFRM_MODE_TRANSPORT is not set
CONFIG_INET_XFRM_MODE_TUNNEL=y
# CONFIG_INET_XFRM_MODE_BEET is not set
CONFIG_INET_LRO=y
CONFIG_INET_DIAG=y
CONFIG_INET_TCP_DIAG=y
# CONFIG_TCP_CONG_ADVANCED is not set
CONFIG_TCP_CONG_CUBIC=y
CONFIG_DEFAULT_TCP_CONG="cubic"
CONFIG_TCP_MD5SIG=y
# CONFIG_IPV6 is not set
# CONFIG_NETLABEL is not set
CONFIG_NETWORK_SECMARK=y
CONFIG_NETFILTER=y
# CONFIG_NETFILTER_DEBUG is not set
CONFIG_NETFILTER_ADVANCED=y

#
# Core Netfilter Configuration
#
# CONFIG_NETFILTER_NETLINK_QUEUE is not set
# CONFIG_NETFILTER_NETLINK_LOG is not set
# CONFIG_NF_CONNTRACK is not set
CONFIG_NETFILTER_XTABLES=y
# CONFIG_NETFILTER_XT_TARGET_CLASSIFY is not set
# CONFIG_NETFILTER_XT_TARGET_LED is not set
# CONFIG_NETFILTER_XT_TARGET_MARK is not set
# CONFIG_NETFILTER_XT_TARGET_NFLOG is not set
# CONFIG_NETFILTER_XT_TARGET_NFQUEUE is not set
# CONFIG_NETFILTER_XT_TARGET_RATEEST is not set
# CONFIG_NETFILTER_XT_TARGET_SECMARK is not set
# CONFIG_NETFILTER_XT_TARGET_TCPMSS is not set
# CONFIG_NETFILTER_XT_MATCH_COMMENT is not set
# CONFIG_NETFILTER_XT_MATCH_DCCP is not set
# CONFIG_NETFILTER_XT_MATCH_DSCP is not set
# CONFIG_NETFILTER_XT_MATCH_ESP is not set
# CONFIG_NETFILTER_XT_MATCH_HASHLIMIT is not set
# CONFIG_NETFILTER_XT_MATCH_HL is not set
# CONFIG_NETFILTER_XT_MATCH_IPRANGE is not set
# CONFIG_NETFILTER_XT_MATCH_LENGTH is not set
# CONFIG_NETFILTER_XT_MATCH_LIMIT is not set
# CONFIG_NETFILTER_XT_MATCH_MAC is not set
# CONFIG_NETFILTER_XT_MATCH_MARK is not set
# CONFIG_NETFILTER_XT_MATCH_MULTIPORT is not set
# CONFIG_NETFILTER_XT_MATCH_OWNER is not set
# CONFIG_NETFILTER_XT_MATCH_POLICY is not set
# CONFIG_NETFILTER_XT_MATCH_PKTTYPE is not set
# CONFIG_NETFILTER_XT_MATCH_QUOTA is not set
# CONFIG_NETFILTER_XT_MATCH_RATEEST is not set
# CONFIG_NETFILTER_XT_MATCH_REALM is not set
# CONFIG_NETFILTER_XT_MATCH_RECENT is not set
# CONFIG_NETFILTER_XT_MATCH_SCTP is not set
# CONFIG_NETFILTER_XT_MATCH_STATISTIC is not set
# CONFIG_NETFILTER_XT_MATCH_STRING is not set
# CONFIG_NETFILTER_XT_MATCH_TCPMSS is not set
# CONFIG_NETFILTER_XT_MATCH_TIME is not set
# CONFIG_NETFILTER_XT_MATCH_U32 is not set
# CONFIG_IP_VS is not set

#
# IP: Netfilter Configuration
#
# CONFIG_NF_DEFRAG_IPV4 is not set
# CONFIG_IP_NF_QUEUE is not set
CONFIG_IP_NF_IPTABLES=y
# CONFIG_IP_NF_MATCH_ADDRTYPE is not set
# CONFIG_IP_NF_MATCH_AH is not set
# CONFIG_IP_NF_MATCH_ECN is not set
# CONFIG_IP_NF_MATCH_TTL is not set
CONFIG_IP_NF_FILTER=y
# CONFIG_IP_NF_TARGET_REJECT is not set
# CONFIG_IP_NF_TARGET_LOG is not set
# CONFIG_IP_NF_TARGET_ULOG is not set
# CONFIG_IP_NF_MANGLE is not set
# CONFIG_IP_NF_TARGET_TTL is not set
# CONFIG_IP_NF_RAW is not set
# CONFIG_IP_NF_SECURITY is not set
# CONFIG_IP_NF_ARPTABLES is not set
# CONFIG_IP_DCCP is not set
# CONFIG_IP_SCTP is not set
# CONFIG_TIPC is not set
CONFIG_ATM=y
CONFIG_ATM_CLIP=y
# CONFIG_ATM_CLIP_NO_ICMP is not set
# CONFIG_ATM_LANE is not set
# CONFIG_ATM_BR2684 is not set
# CONFIG_BRIDGE is not set
# CONFIG_NET_DSA is not set
# CONFIG_VLAN_8021Q is not set
# CONFIG_DECNET is not set
CONFIG_LLC=y
# CONFIG_LLC2 is not set
# CONFIG_IPX is not set
# CONFIG_ATALK is not set
# CONFIG_X25 is not set
# CONFIG_LAPB is not set
# CONFIG_ECONET is not set
# CONFIG_WAN_ROUTER is not set
# CONFIG_PHONET is not set
CONFIG_NET_SCHED=y

#
# Queueing/Scheduling
#
# CONFIG_NET_SCH_CBQ is not set
# CONFIG_NET_SCH_HTB is not set
# CONFIG_NET_SCH_HFSC is not set
# CONFIG_NET_SCH_ATM is not set
# CONFIG_NET_SCH_PRIO is not set
# CONFIG_NET_SCH_MULTIQ is not set
# CONFIG_NET_SCH_RED is not set
# CONFIG_NET_SCH_SFQ is not set
# CONFIG_NET_SCH_TEQL is not set
# CONFIG_NET_SCH_TBF is not set
# CONFIG_NET_SCH_GRED is not set
# CONFIG_NET_SCH_DSMARK is not set
# CONFIG_NET_SCH_NETEM is not set
# CONFIG_NET_SCH_DRR is not set
# CONFIG_NET_SCH_INGRESS is not set

#
# Classification
#
CONFIG_NET_CLS=y
# CONFIG_NET_CLS_BASIC is not set
# CONFIG_NET_CLS_TCINDEX is not set
# CONFIG_NET_CLS_ROUTE4 is not set
# CONFIG_NET_CLS_FW is not set
# CONFIG_NET_CLS_U32 is not set
# CONFIG_NET_CLS_RSVP is not set
# CONFIG_NET_CLS_RSVP6 is not set
# CONFIG_NET_CLS_FLOW is not set
# CONFIG_NET_CLS_CGROUP is not set
CONFIG_NET_EMATCH=y
CONFIG_NET_EMATCH_STACK=32
# CONFIG_NET_EMATCH_CMP is not set
# CONFIG_NET_EMATCH_NBYTE is not set
# CONFIG_NET_EMATCH_U32 is not set
# CONFIG_NET_EMATCH_META is not set
# CONFIG_NET_EMATCH_TEXT is not set
CONFIG_NET_CLS_ACT=y
# CONFIG_NET_ACT_POLICE is not set
# CONFIG_NET_ACT_GACT is not set
# CONFIG_NET_ACT_MIRRED is not set
# CONFIG_NET_ACT_IPT is not set
# CONFIG_NET_ACT_NAT is not set
# CONFIG_NET_ACT_PEDIT is not set
# CONFIG_NET_ACT_SIMP is not set
# CONFIG_NET_ACT_SKBEDIT is not set
CONFIG_NET_SCH_FIFO=y
# CONFIG_DCB is not set

#
# Network testing
#
# CONFIG_NET_PKTGEN is not set
# CONFIG_NET_TCPPROBE is not set
# CONFIG_NET_DROP_MONITOR is not set
CONFIG_HAMRADIO=y

#
# Packet Radio protocols
#
# CONFIG_AX25 is not set
# CONFIG_CAN is not set
# CONFIG_IRDA is not set
CONFIG_BT=y
CONFIG_BT_L2CAP=y
# CONFIG_BT_SCO is not set
CONFIG_BT_RFCOMM=y
CONFIG_BT_RFCOMM_TTY=y
# CONFIG_BT_BNEP is not set
# CONFIG_BT_HIDP is not set

#
# Bluetooth device drivers
#
CONFIG_BT_HCIBTUSB=y
# CONFIG_BT_HCIBTSDIO is not set
# CONFIG_BT_HCIUART is not set
# CONFIG_BT_HCIBCM203X is not set
# CONFIG_BT_HCIBPA10X is not set
# CONFIG_BT_HCIBFUSB is not set
# CONFIG_BT_HCIVHCI is not set
# CONFIG_AF_RXRPC is not set
CONFIG_FIB_RULES=y
CONFIG_WIRELESS=y
CONFIG_CFG80211=y
# CONFIG_CFG80211_REG_DEBUG is not set
CONFIG_WIRELESS_OLD_REGULATORY=y
CONFIG_WIRELESS_EXT=y
CONFIG_WIRELESS_EXT_SYSFS=y
CONFIG_LIB80211=y
# CONFIG_LIB80211_DEBUG is not set
CONFIG_MAC80211=y

#
# Rate control algorithm selection
#
CONFIG_MAC80211_RC_MINSTREL=y
# CONFIG_MAC80211_RC_DEFAULT_PID is not set
CONFIG_MAC80211_RC_DEFAULT_MINSTREL=y
CONFIG_MAC80211_RC_DEFAULT="minstrel"
CONFIG_MAC80211_MESH=y
CONFIG_MAC80211_LEDS=y
CONFIG_MAC80211_DEBUGFS=y
CONFIG_MAC80211_DEBUG_MENU=y
# CONFIG_MAC80211_DEBUG_PACKET_ALIGNMENT is not set
# CONFIG_MAC80211_NOINLINE is not set
# CONFIG_MAC80211_VERBOSE_DEBUG is not set
# CONFIG_MAC80211_HT_DEBUG is not set
# CONFIG_MAC80211_TKIP_DEBUG is not set
# CONFIG_MAC80211_IBSS_DEBUG is not set
# CONFIG_MAC80211_VERBOSE_PS_DEBUG is not set
# CONFIG_MAC80211_VERBOSE_MPL_DEBUG is not set
# CONFIG_MAC80211_DEBUG_COUNTERS is not set
# CONFIG_WIMAX is not set
CONFIG_RFKILL=y
# CONFIG_RFKILL_INPUT is not set
CONFIG_RFKILL_LEDS=y
# CONFIG_NET_9P is not set

#
# Device Drivers
#

#
# Generic Driver Options
#
CONFIG_UEVENT_HELPER_PATH="/sbin/hotplug"
CONFIG_STANDALONE=y
CONFIG_PREVENT_FIRMWARE_BUILD=y
CONFIG_FW_LOADER=y
CONFIG_FIRMWARE_IN_KERNEL=y
CONFIG_EXTRA_FIRMWARE=""
# CONFIG_DEBUG_DRIVER is not set
# CONFIG_DEBUG_DEVRES is not set
# CONFIG_SYS_HYPERVISOR is not set
# CONFIG_CONNECTOR is not set
# CONFIG_MTD is not set
CONFIG_PARPORT=y
CONFIG_PARPORT_PC=y
# CONFIG_PARPORT_SERIAL is not set
CONFIG_PARPORT_PC_FIFO=y
# CONFIG_PARPORT_PC_SUPERIO is not set
# CONFIG_PARPORT_GSC is not set
# CONFIG_PARPORT_AX88796 is not set
CONFIG_PARPORT_1284=y
CONFIG_PNP=y
CONFIG_PNP_DEBUG_MESSAGES=y

#
# Protocols
#
CONFIG_ISAPNP=y
CONFIG_PNPBIOS=y
CONFIG_PNPBIOS_PROC_FS=y
CONFIG_PNPACPI=y
CONFIG_BLK_DEV=y
# CONFIG_BLK_DEV_FD is not set
# CONFIG_BLK_DEV_XD is not set
# CONFIG_PARIDE is not set
# CONFIG_BLK_CPQ_DA is not set
# CONFIG_BLK_CPQ_CISS_DA is not set
# CONFIG_BLK_DEV_DAC960 is not set
# CONFIG_BLK_DEV_UMEM is not set
# CONFIG_BLK_DEV_COW_COMMON is not set
CONFIG_BLK_DEV_LOOP=y
# CONFIG_BLK_DEV_CRYPTOLOOP is not set
# CONFIG_BLK_DEV_NBD is not set
# CONFIG_BLK_DEV_SX8 is not set
# CONFIG_BLK_DEV_UB is not set
CONFIG_BLK_DEV_RAM=y
CONFIG_BLK_DEV_RAM_COUNT=16
CONFIG_BLK_DEV_RAM_SIZE=65536
# CONFIG_BLK_DEV_XIP is not set
# CONFIG_CDROM_PKTCDVD is not set
# CONFIG_ATA_OVER_ETH is not set
# CONFIG_BLK_DEV_HD is not set
CONFIG_MISC_DEVICES=y
# CONFIG_IBM_ASM is not set
# CONFIG_PHANTOM is not set
# CONFIG_SGI_IOC4 is not set
# CONFIG_TIFM_CORE is not set
# CONFIG_ICS932S401 is not set
# CONFIG_ENCLOSURE_SERVICES is not set
# CONFIG_HP_ILO is not set
# CONFIG_ISL29003 is not set
# CONFIG_C2PORT is not set

#
# EEPROM support
#
# CONFIG_EEPROM_AT24 is not set
# CONFIG_EEPROM_AT25 is not set
# CONFIG_EEPROM_LEGACY is not set
# CONFIG_EEPROM_93CX6 is not set
CONFIG_HAVE_IDE=y
CONFIG_IDE=y

#
# Please see Documentation/ide/ide.txt for help/info on IDE drives
#
CONFIG_IDE_ATAPI=y
# CONFIG_BLK_DEV_IDE_SATA is not set
CONFIG_IDE_GD=y
CONFIG_IDE_GD_ATA=y
# CONFIG_IDE_GD_ATAPI is not set
# CONFIG_BLK_DEV_DELKIN is not set
CONFIG_BLK_DEV_IDECD=y
CONFIG_BLK_DEV_IDECD_VERBOSE_ERRORS=y
# CONFIG_BLK_DEV_IDETAPE is not set
# CONFIG_BLK_DEV_IDEACPI is not set
# CONFIG_IDE_TASK_IOCTL is not set
CONFIG_IDE_PROC_FS=y

#
# IDE chipset support/bugfixes
#
# CONFIG_IDE_GENERIC is not set
# CONFIG_BLK_DEV_PLATFORM is not set
# CONFIG_BLK_DEV_CMD640 is not set
# CONFIG_BLK_DEV_IDEPNP is not set

#
# PCI IDE chipsets support
#
# CONFIG_BLK_DEV_GENERIC is not set
# CONFIG_BLK_DEV_OPTI621 is not set
# CONFIG_BLK_DEV_RZ1000 is not set
# CONFIG_BLK_DEV_AEC62XX is not set
# CONFIG_BLK_DEV_ALI15X3 is not set
# CONFIG_BLK_DEV_AMD74XX is not set
# CONFIG_BLK_DEV_ATIIXP is not set
# CONFIG_BLK_DEV_CMD64X is not set
# CONFIG_BLK_DEV_TRIFLEX is not set
# CONFIG_BLK_DEV_CS5520 is not set
# CONFIG_BLK_DEV_CS5530 is not set
# CONFIG_BLK_DEV_CS5535 is not set
# CONFIG_BLK_DEV_CS5536 is not set
# CONFIG_BLK_DEV_HPT366 is not set
# CONFIG_BLK_DEV_JMICRON is not set
# CONFIG_BLK_DEV_SC1200 is not set
# CONFIG_BLK_DEV_PIIX is not set
# CONFIG_BLK_DEV_IT8172 is not set
# CONFIG_BLK_DEV_IT8213 is not set
# CONFIG_BLK_DEV_IT821X is not set
# CONFIG_BLK_DEV_NS87415 is not set
# CONFIG_BLK_DEV_PDC202XX_OLD is not set
# CONFIG_BLK_DEV_PDC202XX_NEW is not set
# CONFIG_BLK_DEV_SVWKS is not set
# CONFIG_BLK_DEV_SIIMAGE is not set
# CONFIG_BLK_DEV_SIS5513 is not set
# CONFIG_BLK_DEV_SLC90E66 is not set
# CONFIG_BLK_DEV_TRM290 is not set
# CONFIG_BLK_DEV_VIA82CXXX is not set
# CONFIG_BLK_DEV_TC86C001 is not set

#
# Other IDE chipsets support
#

#
# Note: most of these also require special kernel boot parameters
#
# CONFIG_BLK_DEV_4DRIVES is not set
# CONFIG_BLK_DEV_ALI14XX is not set
# CONFIG_BLK_DEV_DTC2278 is not set
# CONFIG_BLK_DEV_HT6560B is not set
# CONFIG_BLK_DEV_QD65XX is not set
# CONFIG_BLK_DEV_UMC8672 is not set
# CONFIG_BLK_DEV_IDEDMA is not set

#
# SCSI device support
#
# CONFIG_RAID_ATTRS is not set
CONFIG_SCSI=y
CONFIG_SCSI_DMA=y
# CONFIG_SCSI_TGT is not set
# CONFIG_SCSI_NETLINK is not set
CONFIG_SCSI_PROC_FS=y

#
# SCSI support type (disk, tape, CD-ROM)
#
CONFIG_BLK_DEV_SD=y
# CONFIG_CHR_DEV_ST is not set
# CONFIG_CHR_DEV_OSST is not set
CONFIG_BLK_DEV_SR=y
CONFIG_BLK_DEV_SR_VENDOR=y
CONFIG_CHR_DEV_SG=y
# CONFIG_CHR_DEV_SCH is not set

#
# Some SCSI devices (e.g. CD jukebox) support multiple LUNs
#
CONFIG_SCSI_MULTI_LUN=y
CONFIG_SCSI_CONSTANTS=y
CONFIG_SCSI_LOGGING=y
# CONFIG_SCSI_SCAN_ASYNC is not set
CONFIG_SCSI_WAIT_SCAN=m

#
# SCSI Transports
#
# CONFIG_SCSI_SPI_ATTRS is not set
# CONFIG_SCSI_FC_ATTRS is not set
# CONFIG_SCSI_ISCSI_ATTRS is not set
# CONFIG_SCSI_SAS_ATTRS is not set
# CONFIG_SCSI_SAS_LIBSAS is not set
# CONFIG_SCSI_SRP_ATTRS is not set
CONFIG_SCSI_LOWLEVEL=y
# CONFIG_ISCSI_TCP is not set
# CONFIG_SCSI_CXGB3_ISCSI is not set
# CONFIG_BLK_DEV_3W_XXXX_RAID is not set
# CONFIG_SCSI_3W_9XXX is not set
# CONFIG_SCSI_7000FASST is not set
# CONFIG_SCSI_ACARD is not set
# CONFIG_SCSI_AHA152X is not set
# CONFIG_SCSI_AHA1542 is not set
# CONFIG_SCSI_AHA1740 is not set
# CONFIG_SCSI_AACRAID is not set
# CONFIG_SCSI_AIC7XXX is not set
# CONFIG_SCSI_AIC7XXX_OLD is not set
# CONFIG_SCSI_AIC79XX is not set
# CONFIG_SCSI_AIC94XX is not set
# CONFIG_SCSI_DPT_I2O is not set
# CONFIG_SCSI_ADVANSYS is not set
# CONFIG_SCSI_IN2000 is not set
# CONFIG_SCSI_ARCMSR is not set
CONFIG_MEGARAID_NEWGEN=y
# CONFIG_MEGARAID_MM is not set
# CONFIG_MEGARAID_LEGACY is not set
# CONFIG_MEGARAID_SAS is not set
# CONFIG_SCSI_MPT2SAS is not set
# CONFIG_SCSI_HPTIOP is not set
# CONFIG_SCSI_BUSLOGIC is not set
# CONFIG_LIBFC is not set
# CONFIG_LIBFCOE is not set
# CONFIG_FCOE is not set
# CONFIG_FCOE_FNIC is not set
# CONFIG_SCSI_DMX3191D is not set
# CONFIG_SCSI_DTC3280 is not set
# CONFIG_SCSI_EATA is not set
# CONFIG_SCSI_FUTURE_DOMAIN is not set
# CONFIG_SCSI_FD_MCS is not set
# CONFIG_SCSI_GDTH is not set
# CONFIG_SCSI_GENERIC_NCR5380 is not set
# CONFIG_SCSI_GENERIC_NCR5380_MMIO is not set
# CONFIG_SCSI_IBMMCA is not set
# CONFIG_SCSI_IPS is not set
# CONFIG_SCSI_INITIO is not set
# CONFIG_SCSI_INIA100 is not set
# CONFIG_SCSI_PPA is not set
# CONFIG_SCSI_IMM is not set
# CONFIG_SCSI_MVSAS is not set
# CONFIG_SCSI_NCR53C406A is not set
# CONFIG_SCSI_NCR_D700 is not set
# CONFIG_SCSI_STEX is not set
# CONFIG_SCSI_SYM53C8XX_2 is not set
# CONFIG_SCSI_IPR is not set
# CONFIG_SCSI_NCR_Q720 is not set
# CONFIG_SCSI_PAS16 is not set
# CONFIG_SCSI_QLOGIC_FAS is not set
# CONFIG_SCSI_QLOGIC_1280 is not set
# CONFIG_SCSI_QLA_FC is not set
# CONFIG_SCSI_QLA_ISCSI is not set
# CONFIG_SCSI_LPFC is not set
# CONFIG_SCSI_SIM710 is not set
# CONFIG_SCSI_SYM53C416 is not set
# CONFIG_SCSI_DC395x is not set
# CONFIG_SCSI_DC390T is not set
# CONFIG_SCSI_T128 is not set
# CONFIG_SCSI_U14_34F is not set
# CONFIG_SCSI_ULTRASTOR is not set
# CONFIG_SCSI_NSP32 is not set
# CONFIG_SCSI_DEBUG is not set
# CONFIG_SCSI_SRP is not set
# CONFIG_SCSI_DH is not set
# CONFIG_SCSI_OSD_INITIATOR is not set
CONFIG_ATA=y
# CONFIG_ATA_NONSTANDARD is not set
CONFIG_ATA_ACPI=y
CONFIG_SATA_PMP=y
CONFIG_SATA_AHCI=y
# CONFIG_SATA_SIL24 is not set
CONFIG_ATA_SFF=y
# CONFIG_SATA_SVW is not set
CONFIG_ATA_PIIX=y
# CONFIG_SATA_MV is not set
# CONFIG_SATA_NV is not set
# CONFIG_PDC_ADMA is not set
# CONFIG_SATA_QSTOR is not set
# CONFIG_SATA_PROMISE is not set
# CONFIG_SATA_SX4 is not set
# CONFIG_SATA_SIL is not set
# CONFIG_SATA_SIS is not set
# CONFIG_SATA_ULI is not set
# CONFIG_SATA_VIA is not set
# CONFIG_SATA_VITESSE is not set
# CONFIG_SATA_INIC162X is not set
# CONFIG_PATA_ACPI is not set
# CONFIG_PATA_ALI is not set
# CONFIG_PATA_AMD is not set
# CONFIG_PATA_ARTOP is not set
# CONFIG_PATA_ATIIXP is not set
# CONFIG_PATA_CMD640_PCI is not set
# CONFIG_PATA_CMD64X is not set
# CONFIG_PATA_CS5520 is not set
# CONFIG_PATA_CS5530 is not set
# CONFIG_PATA_CS5535 is not set
# CONFIG_PATA_CS5536 is not set
# CONFIG_PATA_CYPRESS is not set
# CONFIG_PATA_EFAR is not set
CONFIG_ATA_GENERIC=y
# CONFIG_PATA_HPT366 is not set
# CONFIG_PATA_HPT37X is not set
# CONFIG_PATA_HPT3X2N is not set
# CONFIG_PATA_HPT3X3 is not set
# CONFIG_PATA_ISAPNP is not set
# CONFIG_PATA_IT821X is not set
# CONFIG_PATA_IT8213 is not set
# CONFIG_PATA_JMICRON is not set
# CONFIG_PATA_LEGACY is not set
# CONFIG_PATA_TRIFLEX is not set
# CONFIG_PATA_MARVELL is not set
# CONFIG_PATA_MPIIX is not set
# CONFIG_PATA_OLDPIIX is not set
# CONFIG_PATA_NETCELL is not set
# CONFIG_PATA_NINJA32 is not set
# CONFIG_PATA_NS87410 is not set
# CONFIG_PATA_NS87415 is not set
# CONFIG_PATA_OPTI is not set
# CONFIG_PATA_OPTIDMA is not set
# CONFIG_PATA_PDC_OLD is not set
# CONFIG_PATA_QDI is not set
# CONFIG_PATA_RADISYS is not set
# CONFIG_PATA_RZ1000 is not set
# CONFIG_PATA_SC1200 is not set
# CONFIG_PATA_SERVERWORKS is not set
# CONFIG_PATA_PDC2027X is not set
# CONFIG_PATA_SIL680 is not set
# CONFIG_PATA_SIS is not set
# CONFIG_PATA_VIA is not set
# CONFIG_PATA_WINBOND is not set
# CONFIG_PATA_WINBOND_VLB is not set
# CONFIG_PATA_SCH is not set
CONFIG_MD=y
# CONFIG_BLK_DEV_MD is not set
CONFIG_BLK_DEV_DM=y
# CONFIG_DM_DEBUG is not set
# CONFIG_DM_CRYPT is not set
CONFIG_DM_SNAPSHOT=y
CONFIG_DM_MIRROR=y
# CONFIG_DM_ZERO is not set
# CONFIG_DM_MULTIPATH is not set
# CONFIG_DM_DELAY is not set
CONFIG_DM_UEVENT=y
CONFIG_FUSION=y
# CONFIG_FUSION_SPI is not set
# CONFIG_FUSION_FC is not set
# CONFIG_FUSION_SAS is not set
CONFIG_FUSION_MAX_SGE=128
CONFIG_FUSION_LOGGING=y

#
# IEEE 1394 (FireWire) support
#

#
# Enable only one of the two stacks, unless you know what you are doing
#
# CONFIG_FIREWIRE is not set
# CONFIG_IEEE1394 is not set
# CONFIG_I2O is not set
CONFIG_MACINTOSH_DRIVERS=y
CONFIG_MAC_EMUMOUSEBTN=y
CONFIG_NETDEVICES=y
CONFIG_COMPAT_NET_DEV_OPS=y
# CONFIG_IFB is not set
CONFIG_DUMMY=y
# CONFIG_BONDING is not set
# CONFIG_MACVLAN is not set
# CONFIG_EQUALIZER is not set
CONFIG_TUN=y
# CONFIG_VETH is not set
# CONFIG_NET_SB1000 is not set
# CONFIG_ARCNET is not set
CONFIG_PHYLIB=y

#
# MII PHY device drivers
#
# CONFIG_MARVELL_PHY is not set
# CONFIG_DAVICOM_PHY is not set
# CONFIG_QSEMI_PHY is not set
# CONFIG_LXT_PHY is not set
# CONFIG_CICADA_PHY is not set
# CONFIG_VITESSE_PHY is not set
# CONFIG_SMSC_PHY is not set
# CONFIG_BROADCOM_PHY is not set
# CONFIG_ICPLUS_PHY is not set
# CONFIG_REALTEK_PHY is not set
# CONFIG_NATIONAL_PHY is not set
# CONFIG_STE10XP is not set
# CONFIG_LSI_ET1011C_PHY is not set
# CONFIG_FIXED_PHY is not set
# CONFIG_MDIO_BITBANG is not set
CONFIG_NET_ETHERNET=y
CONFIG_MII=y
# CONFIG_HAPPYMEAL is not set
# CONFIG_SUNGEM is not set
# CONFIG_CASSINI is not set
CONFIG_NET_VENDOR_3COM=y
# CONFIG_EL1 is not set
# CONFIG_EL2 is not set
# CONFIG_ELPLUS is not set
# CONFIG_EL16 is not set
# CONFIG_EL3 is not set
# CONFIG_3C515 is not set
# CONFIG_ELMC is not set
# CONFIG_ELMC_II is not set
# CONFIG_VORTEX is not set
# CONFIG_TYPHOON is not set
# CONFIG_LANCE is not set
CONFIG_NET_VENDOR_SMC=y
# CONFIG_WD80x3 is not set
# CONFIG_ULTRAMCA is not set
# CONFIG_ULTRA is not set
# CONFIG_ULTRA32 is not set
# CONFIG_SMC9194 is not set
# CONFIG_ENC28J60 is not set
# CONFIG_ETHOC is not set
CONFIG_NET_VENDOR_RACAL=y
# CONFIG_NI52 is not set
# CONFIG_NI65 is not set
# CONFIG_DNET is not set
CONFIG_NET_TULIP=y
# CONFIG_DE2104X is not set
# CONFIG_TULIP is not set
# CONFIG_DE4X5 is not set
# CONFIG_WINBOND_840 is not set
# CONFIG_DM9102 is not set
# CONFIG_ULI526X is not set
# CONFIG_PCMCIA_XIRCOM is not set
# CONFIG_AT1700 is not set
# CONFIG_DEPCA is not set
# CONFIG_HP100 is not set
CONFIG_NET_ISA=y
# CONFIG_E2100 is not set
# CONFIG_EWRK3 is not set
# CONFIG_EEXPRESS is not set
# CONFIG_EEXPRESS_PRO is not set
# CONFIG_HPLAN_PLUS is not set
# CONFIG_HPLAN is not set
# CONFIG_LP486E is not set
# CONFIG_ETH16I is not set
# CONFIG_NE2000 is not set
# CONFIG_ZNET is not set
# CONFIG_SEEQ8005 is not set
# CONFIG_NE2_MCA is not set
# CONFIG_IBMLANA is not set
# CONFIG_IBM_NEW_EMAC_ZMII is not set
# CONFIG_IBM_NEW_EMAC_RGMII is not set
# CONFIG_IBM_NEW_EMAC_TAH is not set
# CONFIG_IBM_NEW_EMAC_EMAC4 is not set
# CONFIG_IBM_NEW_EMAC_NO_FLOW_CTRL is not set
# CONFIG_IBM_NEW_EMAC_MAL_CLR_ICINTSTAT is not set
# CONFIG_IBM_NEW_EMAC_MAL_COMMON_ERR is not set
CONFIG_NET_PCI=y
# CONFIG_PCNET32 is not set
# CONFIG_AMD8111_ETH is not set
# CONFIG_ADAPTEC_STARFIRE is not set
# CONFIG_AC3200 is not set
# CONFIG_APRICOT is not set
# CONFIG_B44 is not set
# CONFIG_FORCEDETH is not set
# CONFIG_CS89x0 is not set
# CONFIG_E100 is not set
# CONFIG_LNE390 is not set
# CONFIG_FEALNX is not set
# CONFIG_NATSEMI is not set
# CONFIG_NE2K_PCI is not set
# CONFIG_NE3210 is not set
# CONFIG_ES3210 is not set
# CONFIG_8139CP is not set
# CONFIG_8139TOO is not set
# CONFIG_R6040 is not set
# CONFIG_SIS900 is not set
# CONFIG_EPIC100 is not set
# CONFIG_SMSC9420 is not set
# CONFIG_SUNDANCE is not set
# CONFIG_TLAN is not set
# CONFIG_VIA_RHINE is not set
# CONFIG_SC92031 is not set
CONFIG_NET_POCKET=y
# CONFIG_ATP is not set
# CONFIG_DE600 is not set
# CONFIG_DE620 is not set
# CONFIG_ATL2 is not set
CONFIG_NETDEV_1000=y
# CONFIG_ACENIC is not set
# CONFIG_DL2K is not set
# CONFIG_E1000 is not set
CONFIG_E1000E=y
# CONFIG_IP1000 is not set
# CONFIG_IGB is not set
# CONFIG_IGBVF is not set
# CONFIG_NS83820 is not set
# CONFIG_HAMACHI is not set
# CONFIG_YELLOWFIN is not set
# CONFIG_R8169 is not set
# CONFIG_SIS190 is not set
# CONFIG_SKGE is not set
# CONFIG_SKY2 is not set
# CONFIG_VIA_VELOCITY is not set
CONFIG_TIGON3=y
# CONFIG_BNX2 is not set
# CONFIG_QLA3XXX is not set
# CONFIG_ATL1 is not set
# CONFIG_ATL1E is not set
# CONFIG_ATL1C is not set
# CONFIG_JME is not set
CONFIG_NETDEV_10000=y
# CONFIG_CHELSIO_T1 is not set
CONFIG_CHELSIO_T3_DEPENDS=y
# CONFIG_CHELSIO_T3 is not set
# CONFIG_ENIC is not set
# CONFIG_IXGBE is not set
# CONFIG_IXGB is not set
# CONFIG_S2IO is not set
# CONFIG_VXGE is not set
# CONFIG_MYRI10GE is not set
# CONFIG_NETXEN_NIC is not set
# CONFIG_NIU is not set
# CONFIG_MLX4_EN is not set
# CONFIG_MLX4_CORE is not set
# CONFIG_TEHUTI is not set
# CONFIG_BNX2X is not set
# CONFIG_QLGE is not set
# CONFIG_SFC is not set
# CONFIG_BE2NET is not set
CONFIG_TR=y
# CONFIG_IBMTR is not set
# CONFIG_IBMOL is not set
# CONFIG_IBMLS is not set
# CONFIG_3C359 is not set
# CONFIG_TMS380TR is not set
# CONFIG_SMCTR is not set

#
# Wireless LAN
#
CONFIG_WLAN_PRE80211=y
# CONFIG_STRIP is not set
# CONFIG_ARLAN is not set
# CONFIG_WAVELAN is not set
CONFIG_WLAN_80211=y
# CONFIG_LIBERTAS is not set
# CONFIG_LIBERTAS_THINFIRM is not set
# CONFIG_AIRO is not set
# CONFIG_ATMEL is not set
# CONFIG_AT76C50X_USB is not set
# CONFIG_PRISM54 is not set
# CONFIG_USB_ZD1201 is not set
# CONFIG_USB_NET_RNDIS_WLAN is not set
# CONFIG_RTL8180 is not set
# CONFIG_RTL8187 is not set
# CONFIG_ADM8211 is not set
# CONFIG_MAC80211_HWSIM is not set
# CONFIG_MWL8K is not set
# CONFIG_P54_COMMON is not set
# CONFIG_ATH5K is not set
# CONFIG_ATH9K is not set
# CONFIG_AR9170_USB is not set
# CONFIG_IPW2100 is not set
# CONFIG_IPW2200 is not set
CONFIG_IWLWIFI=y
CONFIG_IWLWIFI_LEDS=y
CONFIG_IWLWIFI_RFKILL=y
CONFIG_IWLWIFI_SPECTRUM_MEASUREMENT=y
CONFIG_IWLWIFI_DEBUG=y
CONFIG_IWLWIFI_DEBUGFS=y
CONFIG_IWLAGN=y
CONFIG_IWL4965=y
CONFIG_IWL5000=y
# CONFIG_IWL3945 is not set
# CONFIG_HOSTAP is not set
# CONFIG_B43 is not set
# CONFIG_B43LEGACY is not set
# CONFIG_ZD1211RW is not set
# CONFIG_RT2X00 is not set
# CONFIG_HERMES is not set

#
# Enable WiMAX (Networking options) to see the WiMAX drivers
#

#
# USB Network Adapters
#
# CONFIG_USB_CATC is not set
# CONFIG_USB_KAWETH is not set
# CONFIG_USB_PEGASUS is not set
# CONFIG_USB_RTL8150 is not set
# CONFIG_USB_USBNET is not set
# CONFIG_USB_HSO is not set
CONFIG_WAN=y
# CONFIG_HDLC is not set
# CONFIG_DLCI is not set
# CONFIG_SBNI is not set
CONFIG_ATM_DRIVERS=y
# CONFIG_ATM_DUMMY is not set
# CONFIG_ATM_TCP is not set
# CONFIG_ATM_LANAI is not set
# CONFIG_ATM_ENI is not set
# CONFIG_ATM_FIRESTREAM is not set
# CONFIG_ATM_ZATM is not set
# CONFIG_ATM_NICSTAR is not set
# CONFIG_ATM_IDT77252 is not set
# CONFIG_ATM_AMBASSADOR is not set
# CONFIG_ATM_HORIZON is not set
# CONFIG_ATM_IA is not set
# CONFIG_ATM_FORE200E is not set
# CONFIG_ATM_HE is not set
# CONFIG_ATM_SOLOS is not set
CONFIG_FDDI=y
# CONFIG_DEFXX is not set
# CONFIG_SKFP is not set
CONFIG_HIPPI=y
# CONFIG_ROADRUNNER is not set
# CONFIG_PLIP is not set
CONFIG_PPP=y
CONFIG_PPP_MULTILINK=y
CONFIG_PPP_FILTER=y
CONFIG_PPP_ASYNC=y
# CONFIG_PPP_SYNC_TTY is not set
CONFIG_PPP_DEFLATE=y
CONFIG_PPP_BSDCOMP=y
# CONFIG_PPP_MPPE is not set
# CONFIG_PPPOE is not set
# CONFIG_PPPOATM is not set
# CONFIG_PPPOL2TP is not set
# CONFIG_SLIP is not set
CONFIG_SLHC=y
CONFIG_NET_FC=y
# CONFIG_NETCONSOLE is not set
# CONFIG_NETPOLL is not set
# CONFIG_NET_POLL_CONTROLLER is not set
CONFIG_ISDN=y
# CONFIG_MISDN is not set
# CONFIG_ISDN_I4L is not set
# CONFIG_ISDN_CAPI is not set
# CONFIG_PHONE is not set

#
# Input device support
#
CONFIG_INPUT=y
CONFIG_INPUT_FF_MEMLESS=m
# CONFIG_INPUT_POLLDEV is not set

#
# Userland interfaces
#
CONFIG_INPUT_MOUSEDEV=y
CONFIG_INPUT_MOUSEDEV_PSAUX=y
CONFIG_INPUT_MOUSEDEV_SCREEN_X=1024
CONFIG_INPUT_MOUSEDEV_SCREEN_Y=768
# CONFIG_INPUT_JOYDEV is not set
CONFIG_INPUT_EVDEV=y
# CONFIG_INPUT_EVBUG is not set

#
# Input Device Drivers
#
CONFIG_INPUT_KEYBOARD=y
CONFIG_KEYBOARD_ATKBD=y
# CONFIG_KEYBOARD_SUNKBD is not set
# CONFIG_KEYBOARD_LKKBD is not set
# CONFIG_KEYBOARD_XTKBD is not set
# CONFIG_KEYBOARD_NEWTON is not set
# CONFIG_KEYBOARD_STOWAWAY is not set
# CONFIG_KEYBOARD_GPIO is not set
CONFIG_INPUT_MOUSE=y
CONFIG_MOUSE_PS2=y
CONFIG_MOUSE_PS2_ALPS=y
CONFIG_MOUSE_PS2_LOGIPS2PP=y
CONFIG_MOUSE_PS2_SYNAPTICS=y
CONFIG_MOUSE_PS2_LIFEBOOK=y
CONFIG_MOUSE_PS2_TRACKPOINT=y
# CONFIG_MOUSE_PS2_ELANTECH is not set
# CONFIG_MOUSE_PS2_TOUCHKIT is not set
# CONFIG_MOUSE_SERIAL is not set
# CONFIG_MOUSE_APPLETOUCH is not set
# CONFIG_MOUSE_BCM5974 is not set
# CONFIG_MOUSE_INPORT is not set
# CONFIG_MOUSE_LOGIBM is not set
# CONFIG_MOUSE_PC110PAD is not set
# CONFIG_MOUSE_VSXXXAA is not set
# CONFIG_MOUSE_GPIO is not set
CONFIG_INPUT_JOYSTICK=y
# CONFIG_JOYSTICK_ANALOG is not set
# CONFIG_JOYSTICK_A3D is not set
# CONFIG_JOYSTICK_ADI is not set
# CONFIG_JOYSTICK_COBRA is not set
# CONFIG_JOYSTICK_GF2K is not set
# CONFIG_JOYSTICK_GRIP is not set
# CONFIG_JOYSTICK_GRIP_MP is not set
# CONFIG_JOYSTICK_GUILLEMOT is not set
# CONFIG_JOYSTICK_INTERACT is not set
# CONFIG_JOYSTICK_SIDEWINDER is not set
# CONFIG_JOYSTICK_TMDC is not set
# CONFIG_JOYSTICK_IFORCE is not set
# CONFIG_JOYSTICK_WARRIOR is not set
# CONFIG_JOYSTICK_MAGELLAN is not set
# CONFIG_JOYSTICK_SPACEORB is not set
# CONFIG_JOYSTICK_SPACEBALL is not set
# CONFIG_JOYSTICK_STINGER is not set
# CONFIG_JOYSTICK_TWIDJOY is not set
# CONFIG_JOYSTICK_ZHENHUA is not set
# CONFIG_JOYSTICK_DB9 is not set
# CONFIG_JOYSTICK_GAMECON is not set
# CONFIG_JOYSTICK_TURBOGRAFX is not set
# CONFIG_JOYSTICK_JOYDUMP is not set
# CONFIG_JOYSTICK_XPAD is not set
# CONFIG_JOYSTICK_WALKERA0701 is not set
CONFIG_INPUT_TABLET=y
# CONFIG_TABLET_USB_ACECAD is not set
# CONFIG_TABLET_USB_AIPTEK is not set
# CONFIG_TABLET_USB_GTCO is not set
# CONFIG_TABLET_USB_KBTAB is not set
# CONFIG_TABLET_USB_WACOM is not set
CONFIG_INPUT_TOUCHSCREEN=y
# CONFIG_TOUCHSCREEN_ADS7846 is not set
# CONFIG_TOUCHSCREEN_AD7877 is not set
# CONFIG_TOUCHSCREEN_AD7879_I2C is not set
# CONFIG_TOUCHSCREEN_AD7879_SPI is not set
# CONFIG_TOUCHSCREEN_AD7879 is not set
# CONFIG_TOUCHSCREEN_FUJITSU is not set
# CONFIG_TOUCHSCREEN_GUNZE is not set
# CONFIG_TOUCHSCREEN_ELO is not set
# CONFIG_TOUCHSCREEN_WACOM_W8001 is not set
# CONFIG_TOUCHSCREEN_MTOUCH is not set
# CONFIG_TOUCHSCREEN_INEXIO is not set
# CONFIG_TOUCHSCREEN_MK712 is not set
# CONFIG_TOUCHSCREEN_HTCPEN is not set
# CONFIG_TOUCHSCREEN_PENMOUNT is not set
# CONFIG_TOUCHSCREEN_TOUCHRIGHT is not set
# CONFIG_TOUCHSCREEN_TOUCHWIN is not set
# CONFIG_TOUCHSCREEN_USB_COMPOSITE is not set
# CONFIG_TOUCHSCREEN_TOUCHIT213 is not set
# CONFIG_TOUCHSCREEN_TSC2007 is not set
CONFIG_INPUT_MISC=y
# CONFIG_INPUT_PCSPKR is not set
# CONFIG_INPUT_APANEL is not set
# CONFIG_INPUT_WISTRON_BTNS is not set
# CONFIG_INPUT_ATLAS_BTNS is not set
# CONFIG_INPUT_ATI_REMOTE is not set
# CONFIG_INPUT_ATI_REMOTE2 is not set
# CONFIG_INPUT_KEYSPAN_REMOTE is not set
# CONFIG_INPUT_POWERMATE is not set
# CONFIG_INPUT_YEALINK is not set
# CONFIG_INPUT_CM109 is not set
CONFIG_INPUT_UINPUT=y
# CONFIG_INPUT_GPIO_ROTARY_ENCODER is not set

#
# Hardware I/O ports
#
CONFIG_SERIO=y
CONFIG_SERIO_I8042=y
# CONFIG_SERIO_SERPORT is not set
# CONFIG_SERIO_CT82C710 is not set
# CONFIG_SERIO_PARKBD is not set
# CONFIG_SERIO_PCIPS2 is not set
CONFIG_SERIO_LIBPS2=y
CONFIG_SERIO_RAW=y
# CONFIG_GAMEPORT is not set

#
# Character devices
#
CONFIG_VT=y
CONFIG_CONSOLE_TRANSLATIONS=y
CONFIG_VT_CONSOLE=y
CONFIG_HW_CONSOLE=y
CONFIG_VT_HW_CONSOLE_BINDING=y
CONFIG_DEVKMEM=y
CONFIG_SERIAL_NONSTANDARD=y
# CONFIG_COMPUTONE is not set
# CONFIG_ROCKETPORT is not set
# CONFIG_CYCLADES is not set
# CONFIG_DIGIEPCA is not set
# CONFIG_MOXA_INTELLIO is not set
# CONFIG_MOXA_SMARTIO is not set
# CONFIG_ISI is not set
# CONFIG_SYNCLINK is not set
# CONFIG_SYNCLINKMP is not set
# CONFIG_SYNCLINK_GT is not set
# CONFIG_N_HDLC is not set
# CONFIG_RISCOM8 is not set
# CONFIG_SPECIALIX is not set
# CONFIG_SX is not set
# CONFIG_RIO is not set
CONFIG_STALDRV=y
# CONFIG_STALLION is not set
# CONFIG_ISTALLION is not set
# CONFIG_NOZOMI is not set

#
# Serial drivers
#
CONFIG_SERIAL_8250=y
CONFIG_SERIAL_8250_CONSOLE=y
CONFIG_FIX_EARLYCON_MEM=y
CONFIG_SERIAL_8250_PCI=y
CONFIG_SERIAL_8250_PNP=y
CONFIG_SERIAL_8250_NR_UARTS=48
CONFIG_SERIAL_8250_RUNTIME_UARTS=4
CONFIG_SERIAL_8250_EXTENDED=y
CONFIG_SERIAL_8250_MANY_PORTS=y
# CONFIG_SERIAL_8250_FOURPORT is not set
# CONFIG_SERIAL_8250_ACCENT is not set
# CONFIG_SERIAL_8250_BOCA is not set
# CONFIG_SERIAL_8250_EXAR_ST16C554 is not set
# CONFIG_SERIAL_8250_HUB6 is not set
CONFIG_SERIAL_8250_SHARE_IRQ=y
# CONFIG_SERIAL_8250_DETECT_IRQ is not set
CONFIG_SERIAL_8250_RSA=y
# CONFIG_SERIAL_8250_MCA is not set

#
# Non-8250 serial port support
#
# CONFIG_SERIAL_MAX3100 is not set
CONFIG_SERIAL_CORE=y
CONFIG_SERIAL_CORE_CONSOLE=y
# CONFIG_SERIAL_JSM is not set
CONFIG_UNIX98_PTYS=y
# CONFIG_DEVPTS_MULTIPLE_INSTANCES is not set
CONFIG_LEGACY_PTYS=y
CONFIG_LEGACY_PTY_COUNT=256
CONFIG_PRINTER=y
# CONFIG_LP_CONSOLE is not set
# CONFIG_PPDEV is not set
# CONFIG_IPMI_HANDLER is not set
CONFIG_HW_RANDOM=y
# CONFIG_HW_RANDOM_TIMERIOMEM is not set
# CONFIG_HW_RANDOM_INTEL is not set
# CONFIG_HW_RANDOM_AMD is not set
# CONFIG_HW_RANDOM_GEODE is not set
# CONFIG_HW_RANDOM_VIA is not set
CONFIG_NVRAM=y
CONFIG_RTC=y
# CONFIG_DTLK is not set
# CONFIG_R3964 is not set
# CONFIG_APPLICOM is not set
# CONFIG_SONYPI is not set
# CONFIG_MWAVE is not set
# CONFIG_PC8736x_GPIO is not set
# CONFIG_NSC_GPIO is not set
# CONFIG_CS5535_GPIO is not set
# CONFIG_RAW_DRIVER is not set
CONFIG_HPET=y
CONFIG_HPET_MMAP=y
# CONFIG_HANGCHECK_TIMER is not set
# CONFIG_TCG_TPM is not set
# CONFIG_TELCLOCK is not set
CONFIG_DEVPORT=y
CONFIG_I2C=y
CONFIG_I2C_BOARDINFO=y
# CONFIG_I2C_CHARDEV is not set
# CONFIG_I2C_HELPER_AUTO is not set

#
# I2C Algorithms
#
CONFIG_I2C_ALGOBIT=y
# CONFIG_I2C_ALGOPCF is not set
# CONFIG_I2C_ALGOPCA is not set

#
# I2C Hardware Bus support
#

#
# PC SMBus host controller drivers
#
# CONFIG_I2C_ALI1535 is not set
# CONFIG_I2C_ALI1563 is not set
# CONFIG_I2C_ALI15X3 is not set
# CONFIG_I2C_AMD756 is not set
# CONFIG_I2C_AMD8111 is not set
# CONFIG_I2C_I801 is not set
CONFIG_I2C_ISCH=y
# CONFIG_I2C_PIIX4 is not set
# CONFIG_I2C_NFORCE2 is not set
# CONFIG_I2C_SIS5595 is not set
# CONFIG_I2C_SIS630 is not set
# CONFIG_I2C_SIS96X is not set
# CONFIG_I2C_VIA is not set
# CONFIG_I2C_VIAPRO is not set

#
# I2C system bus drivers (mostly embedded / system-on-chip)
#
# CONFIG_I2C_GPIO is not set
# CONFIG_I2C_OCORES is not set
# CONFIG_I2C_SIMTEC is not set

#
# External I2C/SMBus adapter drivers
#
# CONFIG_I2C_PARPORT is not set
# CONFIG_I2C_PARPORT_LIGHT is not set
# CONFIG_I2C_TAOS_EVM is not set
# CONFIG_I2C_TINY_USB is not set

#
# Graphics adapter I2C/DDC channel drivers
#
# CONFIG_I2C_VOODOO3 is not set

#
# Other I2C/SMBus bus drivers
#
# CONFIG_I2C_PCA_ISA is not set
# CONFIG_I2C_PCA_PLATFORM is not set
# CONFIG_I2C_STUB is not set
# CONFIG_SCx200_ACB is not set

#
# Miscellaneous I2C Chip support
#
# CONFIG_DS1682 is not set
# CONFIG_SENSORS_PCF8574 is not set
# CONFIG_PCF8575 is not set
# CONFIG_SENSORS_PCA9539 is not set
# CONFIG_SENSORS_MAX6875 is not set
# CONFIG_SENSORS_TSL2550 is not set
# CONFIG_I2C_DEBUG_CORE is not set
# CONFIG_I2C_DEBUG_ALGO is not set
# CONFIG_I2C_DEBUG_BUS is not set
# CONFIG_I2C_DEBUG_CHIP is not set
CONFIG_SPI=y
# CONFIG_SPI_DEBUG is not set
CONFIG_SPI_MASTER=y

#
# SPI Master Controller Drivers
#
# CONFIG_SPI_BITBANG is not set
# CONFIG_SPI_BUTTERFLY is not set
# CONFIG_SPI_GPIO is not set
# CONFIG_SPI_LM70_LLP is not set

#
# SPI Protocol Masters
#
# CONFIG_SPI_SPIDEV is not set
# CONFIG_SPI_TLE62X0 is not set
CONFIG_ARCH_WANT_OPTIONAL_GPIOLIB=y
CONFIG_GPIOLIB=y
# CONFIG_DEBUG_GPIO is not set
# CONFIG_GPIO_SYSFS is not set

#
# Memory mapped GPIO expanders:
#

#
# I2C GPIO expanders:
#
# CONFIG_GPIO_MAX732X is not set
# CONFIG_GPIO_PCA953X is not set
# CONFIG_GPIO_PCF857X is not set

#
# PCI GPIO expanders:
#
# CONFIG_GPIO_BT8XX is not set

#
# SPI GPIO expanders:
#
# CONFIG_GPIO_MAX7301 is not set
# CONFIG_GPIO_MCP23S08 is not set
# CONFIG_W1 is not set
CONFIG_POWER_SUPPLY=y
CONFIG_POWER_SUPPLY_DEBUG=y
# CONFIG_PDA_POWER is not set
# CONFIG_BATTERY_DS2760 is not set
# CONFIG_BATTERY_BQ27x00 is not set
CONFIG_HWMON=y
# CONFIG_HWMON_VID is not set
# CONFIG_SENSORS_ABITUGURU is not set
# CONFIG_SENSORS_ABITUGURU3 is not set
# CONFIG_SENSORS_AD7414 is not set
# CONFIG_SENSORS_AD7418 is not set
# CONFIG_SENSORS_ADCXX is not set
# CONFIG_SENSORS_ADM1021 is not set
# CONFIG_SENSORS_ADM1025 is not set
# CONFIG_SENSORS_ADM1026 is not set
# CONFIG_SENSORS_ADM1029 is not set
# CONFIG_SENSORS_ADM1031 is not set
# CONFIG_SENSORS_ADM9240 is not set
# CONFIG_SENSORS_ADT7462 is not set
# CONFIG_SENSORS_ADT7470 is not set
# CONFIG_SENSORS_ADT7473 is not set
# CONFIG_SENSORS_ADT7475 is not set
# CONFIG_SENSORS_K8TEMP is not set
# CONFIG_SENSORS_ASB100 is not set
# CONFIG_SENSORS_ATK0110 is not set
# CONFIG_SENSORS_ATXP1 is not set
# CONFIG_SENSORS_DS1621 is not set
# CONFIG_SENSORS_I5K_AMB is not set
# CONFIG_SENSORS_F71805F is not set
# CONFIG_SENSORS_F71882FG is not set
# CONFIG_SENSORS_F75375S is not set
# CONFIG_SENSORS_FSCHER is not set
# CONFIG_SENSORS_FSCPOS is not set
# CONFIG_SENSORS_FSCHMD is not set
# CONFIG_SENSORS_G760A is not set
# CONFIG_SENSORS_GL518SM is not set
# CONFIG_SENSORS_GL520SM is not set
# CONFIG_SENSORS_CORETEMP is not set
# CONFIG_SENSORS_IT87 is not set
# CONFIG_SENSORS_LM63 is not set
# CONFIG_SENSORS_LM70 is not set
# CONFIG_SENSORS_LM75 is not set
# CONFIG_SENSORS_LM77 is not set
# CONFIG_SENSORS_LM78 is not set
# CONFIG_SENSORS_LM80 is not set
# CONFIG_SENSORS_LM83 is not set
# CONFIG_SENSORS_LM85 is not set
# CONFIG_SENSORS_LM87 is not set
# CONFIG_SENSORS_LM90 is not set
# CONFIG_SENSORS_LM92 is not set
# CONFIG_SENSORS_LM93 is not set
# CONFIG_SENSORS_LTC4215 is not set
# CONFIG_SENSORS_LTC4245 is not set
# CONFIG_SENSORS_LM95241 is not set
# CONFIG_SENSORS_MAX1111 is not set
# CONFIG_SENSORS_MAX1619 is not set
# CONFIG_SENSORS_MAX6650 is not set
# CONFIG_SENSORS_PC87360 is not set
# CONFIG_SENSORS_PC87427 is not set
# CONFIG_SENSORS_PCF8591 is not set
# CONFIG_SENSORS_SHT15 is not set
# CONFIG_SENSORS_SIS5595 is not set
# CONFIG_SENSORS_DME1737 is not set
# CONFIG_SENSORS_SMSC47M1 is not set
# CONFIG_SENSORS_SMSC47M192 is not set
# CONFIG_SENSORS_SMSC47B397 is not set
# CONFIG_SENSORS_ADS7828 is not set
# CONFIG_SENSORS_THMC50 is not set
# CONFIG_SENSORS_VIA686A is not set
# CONFIG_SENSORS_VT1211 is not set
# CONFIG_SENSORS_VT8231 is not set
# CONFIG_SENSORS_W83781D is not set
# CONFIG_SENSORS_W83791D is not set
# CONFIG_SENSORS_W83792D is not set
# CONFIG_SENSORS_W83793 is not set
# CONFIG_SENSORS_W83L785TS is not set
# CONFIG_SENSORS_W83L786NG is not set
# CONFIG_SENSORS_W83627HF is not set
# CONFIG_SENSORS_W83627EHF is not set
# CONFIG_SENSORS_HDAPS is not set
# CONFIG_SENSORS_LIS3LV02D is not set
# CONFIG_SENSORS_APPLESMC is not set
# CONFIG_HWMON_DEBUG_CHIP is not set
CONFIG_THERMAL=y
# CONFIG_THERMAL_HWMON is not set
CONFIG_WATCHDOG=y
# CONFIG_WATCHDOG_NOWAYOUT is not set

#
# Watchdog Device Drivers
#
# CONFIG_SOFT_WATCHDOG is not set
# CONFIG_ACQUIRE_WDT is not set
# CONFIG_ADVANTECH_WDT is not set
# CONFIG_ALIM1535_WDT is not set
# CONFIG_ALIM7101_WDT is not set
# CONFIG_SC520_WDT is not set
# CONFIG_EUROTECH_WDT is not set
# CONFIG_IB700_WDT is not set
# CONFIG_IBMASR is not set
# CONFIG_WAFER_WDT is not set
# CONFIG_I6300ESB_WDT is not set
CONFIG_ITCO_WDT=y
CONFIG_ITCO_VENDOR_SUPPORT=y
# CONFIG_IT8712F_WDT is not set
# CONFIG_IT87_WDT is not set
# CONFIG_HP_WATCHDOG is not set
# CONFIG_SC1200_WDT is not set
# CONFIG_PC87413_WDT is not set
# CONFIG_60XX_WDT is not set
# CONFIG_SBC8360_WDT is not set
# CONFIG_SBC7240_WDT is not set
# CONFIG_CPU5_WDT is not set
# CONFIG_SMSC_SCH311X_WDT is not set
# CONFIG_SMSC37B787_WDT is not set
# CONFIG_W83627HF_WDT is not set
# CONFIG_W83697HF_WDT is not set
# CONFIG_W83697UG_WDT is not set
# CONFIG_W83877F_WDT is not set
# CONFIG_W83977F_WDT is not set
# CONFIG_MACHZ_WDT is not set
# CONFIG_SBC_EPX_C3_WATCHDOG is not set

#
# ISA-based Watchdog Cards
#
# CONFIG_PCWATCHDOG is not set
# CONFIG_MIXCOMWD is not set
# CONFIG_WDT is not set

#
# PCI-based Watchdog Cards
#
# CONFIG_PCIPCWATCHDOG is not set
# CONFIG_WDTPCI is not set

#
# USB-based Watchdog Cards
#
# CONFIG_USBPCWATCHDOG is not set
CONFIG_SSB_POSSIBLE=y

#
# Sonics Silicon Backplane
#
# CONFIG_SSB is not set

#
# Multifunction device drivers
#
# CONFIG_MFD_CORE is not set
# CONFIG_MFD_SM501 is not set
# CONFIG_HTC_PASIC3 is not set
# CONFIG_TPS65010 is not set
# CONFIG_TWL4030_CORE is not set
# CONFIG_MFD_TMIO is not set
# CONFIG_PMIC_DA903X is not set
# CONFIG_MFD_WM8400 is not set
# CONFIG_MFD_WM8350_I2C is not set
# CONFIG_MFD_PCF50633 is not set
# CONFIG_REGULATOR is not set

#
# Multimedia devices
#

#
# Multimedia core support
#
# CONFIG_VIDEO_DEV is not set
# CONFIG_DVB_CORE is not set
# CONFIG_VIDEO_MEDIA is not set

#
# Multimedia drivers
#
CONFIG_DAB=y
# CONFIG_USB_DABUSB is not set

#
# Graphics support
#
CONFIG_AGP=y
# CONFIG_AGP_ALI is not set
# CONFIG_AGP_ATI is not set
# CONFIG_AGP_AMD is not set
# CONFIG_AGP_AMD64 is not set
CONFIG_AGP_INTEL=y
# CONFIG_AGP_NVIDIA is not set
# CONFIG_AGP_SIS is not set
# CONFIG_AGP_SWORKS is not set
# CONFIG_AGP_VIA is not set
# CONFIG_AGP_EFFICEON is not set
CONFIG_DRM=y
# CONFIG_DRM_TDFX is not set
# CONFIG_DRM_R128 is not set
# CONFIG_DRM_RADEON is not set
# CONFIG_DRM_I810 is not set
# CONFIG_DRM_I830 is not set
CONFIG_DRM_I915=y
# CONFIG_DRM_I915_KMS is not set
# CONFIG_DRM_MGA is not set
# CONFIG_DRM_SIS is not set
# CONFIG_DRM_VIA is not set
# CONFIG_DRM_SAVAGE is not set
# CONFIG_VGASTATE is not set
CONFIG_VIDEO_OUTPUT_CONTROL=y
CONFIG_FB=y
CONFIG_FIRMWARE_EDID=y
# CONFIG_FB_DDC is not set
# CONFIG_FB_BOOT_VESA_SUPPORT is not set
CONFIG_FB_CFB_FILLRECT=y
CONFIG_FB_CFB_COPYAREA=y
CONFIG_FB_CFB_IMAGEBLIT=y
# CONFIG_FB_CFB_REV_PIXELS_IN_BYTE is not set
# CONFIG_FB_SYS_FILLRECT is not set
# CONFIG_FB_SYS_COPYAREA is not set
# CONFIG_FB_SYS_IMAGEBLIT is not set
# CONFIG_FB_FOREIGN_ENDIAN is not set
# CONFIG_FB_SYS_FOPS is not set
# CONFIG_FB_SVGALIB is not set
# CONFIG_FB_MACMODES is not set
# CONFIG_FB_BACKLIGHT is not set
CONFIG_FB_MODE_HELPERS=y
CONFIG_FB_TILEBLITTING=y

#
# Frame buffer hardware drivers
#
# CONFIG_FB_CIRRUS is not set
# CONFIG_FB_PM2 is not set
# CONFIG_FB_CYBER2000 is not set
# CONFIG_FB_ARC is not set
# CONFIG_FB_ASILIANT is not set
# CONFIG_FB_IMSTT is not set
# CONFIG_FB_VGA16 is not set
# CONFIG_FB_VESA is not set
CONFIG_FB_EFI=y
# CONFIG_FB_N411 is not set
# CONFIG_FB_HGA is not set
# CONFIG_FB_S1D13XXX is not set
# CONFIG_FB_NVIDIA is not set
# CONFIG_FB_RIVA is not set
# CONFIG_FB_I810 is not set
# CONFIG_FB_LE80578 is not set
# CONFIG_FB_INTEL is not set
# CONFIG_FB_MATROX is not set
# CONFIG_FB_RADEON is not set
# CONFIG_FB_ATY128 is not set
# CONFIG_FB_ATY is not set
# CONFIG_FB_S3 is not set
# CONFIG_FB_SAVAGE is not set
# CONFIG_FB_SIS is not set
# CONFIG_FB_VIA is not set
# CONFIG_FB_NEOMAGIC is not set
# CONFIG_FB_KYRO is not set
# CONFIG_FB_3DFX is not set
# CONFIG_FB_VOODOO1 is not set
# CONFIG_FB_VT8623 is not set
# CONFIG_FB_TRIDENT is not set
# CONFIG_FB_ARK is not set
# CONFIG_FB_PM3 is not set
# CONFIG_FB_CARMINE is not set
CONFIG_FB_GEODE=y
# CONFIG_FB_GEODE_LX is not set
# CONFIG_FB_GEODE_GX is not set
# CONFIG_FB_GEODE_GX1 is not set
# CONFIG_FB_VIRTUAL is not set
# CONFIG_FB_METRONOME is not set
# CONFIG_FB_MB862XX is not set
# CONFIG_FB_BROADSHEET is not set
CONFIG_BACKLIGHT_LCD_SUPPORT=y
# CONFIG_LCD_CLASS_DEVICE is not set
CONFIG_BACKLIGHT_CLASS_DEVICE=y
# CONFIG_BACKLIGHT_GENERIC is not set
# CONFIG_BACKLIGHT_PROGEAR is not set
# CONFIG_BACKLIGHT_MBP_NVIDIA is not set
# CONFIG_BACKLIGHT_SAHARA is not set

#
# Display device support
#
# CONFIG_DISPLAY_SUPPORT is not set

#
# Console display driver support
#
CONFIG_VGA_CONSOLE=y
CONFIG_VGACON_SOFT_SCROLLBACK=y
CONFIG_VGACON_SOFT_SCROLLBACK_SIZE=256
# CONFIG_MDA_CONSOLE is not set
CONFIG_DUMMY_CONSOLE=y
CONFIG_FRAMEBUFFER_CONSOLE=y
# CONFIG_FRAMEBUFFER_CONSOLE_DETECT_PRIMARY is not set
# CONFIG_FRAMEBUFFER_CONSOLE_ROTATION is not set
# CONFIG_FONTS is not set
CONFIG_FONT_8x8=y
CONFIG_FONT_8x16=y
# CONFIG_LOGO is not set
CONFIG_SOUND=y
CONFIG_SOUND_OSS_CORE=y
CONFIG_SND=y
CONFIG_SND_TIMER=y
CONFIG_SND_PCM=y
CONFIG_SND_HWDEP=y
CONFIG_SND_RAWMIDI=y
CONFIG_SND_JACK=y
CONFIG_SND_SEQUENCER=y
CONFIG_SND_SEQ_DUMMY=y
CONFIG_SND_OSSEMUL=y
CONFIG_SND_MIXER_OSS=y
CONFIG_SND_PCM_OSS=y
CONFIG_SND_PCM_OSS_PLUGINS=y
CONFIG_SND_SEQUENCER_OSS=y
# CONFIG_SND_HRTIMER is not set
# CONFIG_SND_RTCTIMER is not set
CONFIG_SND_DYNAMIC_MINORS=y
CONFIG_SND_SUPPORT_OLD_API=y
CONFIG_SND_VERBOSE_PROCFS=y
# CONFIG_SND_VERBOSE_PRINTK is not set
# CONFIG_SND_DEBUG is not set
CONFIG_SND_VMASTER=y
CONFIG_SND_DRIVERS=y
# CONFIG_SND_PCSP is not set
# CONFIG_SND_DUMMY is not set
CONFIG_SND_VIRMIDI=y
# CONFIG_SND_MTPAV is not set
# CONFIG_SND_MTS64 is not set
# CONFIG_SND_SERIAL_U16550 is not set
# CONFIG_SND_MPU401 is not set
# CONFIG_SND_PORTMAN2X4 is not set
CONFIG_SND_ISA=y
# CONFIG_SND_ADLIB is not set
# CONFIG_SND_AD1816A is not set
# CONFIG_SND_AD1848 is not set
# CONFIG_SND_ALS100 is not set
# CONFIG_SND_AZT2320 is not set
# CONFIG_SND_CMI8330 is not set
# CONFIG_SND_CS4231 is not set
# CONFIG_SND_CS4236 is not set
# CONFIG_SND_DT019X is not set
# CONFIG_SND_ES968 is not set
# CONFIG_SND_ES1688 is not set
# CONFIG_SND_ES18XX is not set
# CONFIG_SND_SC6000 is not set
# CONFIG_SND_GUSCLASSIC is not set
# CONFIG_SND_GUSEXTREME is not set
# CONFIG_SND_GUSMAX is not set
# CONFIG_SND_INTERWAVE is not set
# CONFIG_SND_INTERWAVE_STB is not set
# CONFIG_SND_OPL3SA2 is not set
# CONFIG_SND_OPTI92X_AD1848 is not set
# CONFIG_SND_OPTI92X_CS4231 is not set
# CONFIG_SND_OPTI93X is not set
# CONFIG_SND_MIRO is not set
# CONFIG_SND_SB8 is not set
# CONFIG_SND_SB16 is not set
# CONFIG_SND_SBAWE is not set
# CONFIG_SND_SGALAXY is not set
# CONFIG_SND_SSCAPE is not set
# CONFIG_SND_WAVEFRONT is not set
# CONFIG_SND_MSND_PINNACLE is not set
# CONFIG_SND_MSND_CLASSIC is not set
CONFIG_SND_PCI=y
# CONFIG_SND_AD1889 is not set
# CONFIG_SND_ALS300 is not set
# CONFIG_SND_ALS4000 is not set
# CONFIG_SND_ALI5451 is not set
# CONFIG_SND_ATIIXP is not set
# CONFIG_SND_ATIIXP_MODEM is not set
# CONFIG_SND_AU8810 is not set
# CONFIG_SND_AU8820 is not set
# CONFIG_SND_AU8830 is not set
# CONFIG_SND_AW2 is not set
# CONFIG_SND_AZT3328 is not set
# CONFIG_SND_BT87X is not set
# CONFIG_SND_CA0106 is not set
# CONFIG_SND_CMIPCI is not set
# CONFIG_SND_OXYGEN is not set
# CONFIG_SND_CS4281 is not set
# CONFIG_SND_CS46XX is not set
# CONFIG_SND_CS5530 is not set
# CONFIG_SND_CS5535AUDIO is not set
# CONFIG_SND_DARLA20 is not set
# CONFIG_SND_GINA20 is not set
# CONFIG_SND_LAYLA20 is not set
# CONFIG_SND_DARLA24 is not set
# CONFIG_SND_GINA24 is not set
# CONFIG_SND_LAYLA24 is not set
# CONFIG_SND_MONA is not set
# CONFIG_SND_MIA is not set
# CONFIG_SND_ECHO3G is not set
# CONFIG_SND_INDIGO is not set
# CONFIG_SND_INDIGOIO is not set
# CONFIG_SND_INDIGODJ is not set
# CONFIG_SND_INDIGOIOX is not set
# CONFIG_SND_INDIGODJX is not set
# CONFIG_SND_EMU10K1 is not set
# CONFIG_SND_EMU10K1X is not set
# CONFIG_SND_ENS1370 is not set
# CONFIG_SND_ENS1371 is not set
# CONFIG_SND_ES1938 is not set
# CONFIG_SND_ES1968 is not set
# CONFIG_SND_FM801 is not set
CONFIG_SND_HDA_INTEL=y
CONFIG_SND_HDA_HWDEP=y
CONFIG_SND_HDA_RECONFIG=y
# CONFIG_SND_HDA_INPUT_BEEP is not set
CONFIG_SND_HDA_CODEC_REALTEK=y
CONFIG_SND_HDA_CODEC_ANALOG=y
CONFIG_SND_HDA_CODEC_SIGMATEL=y
CONFIG_SND_HDA_CODEC_VIA=y
CONFIG_SND_HDA_CODEC_ATIHDMI=y
CONFIG_SND_HDA_CODEC_NVHDMI=y
CONFIG_SND_HDA_CODEC_INTELHDMI=y
CONFIG_SND_HDA_ELD=y
CONFIG_SND_HDA_CODEC_CONEXANT=y
CONFIG_SND_HDA_CODEC_CMEDIA=y
CONFIG_SND_HDA_CODEC_SI3054=y
CONFIG_SND_HDA_GENERIC=y
CONFIG_SND_HDA_POWER_SAVE=y
CONFIG_SND_HDA_POWER_SAVE_DEFAULT=10
# CONFIG_SND_HDSP is not set
# CONFIG_SND_HDSPM is not set
# CONFIG_SND_HIFIER is not set
# CONFIG_SND_ICE1712 is not set
# CONFIG_SND_ICE1724 is not set
# CONFIG_SND_INTEL8X0 is not set
# CONFIG_SND_INTEL8X0M is not set
# CONFIG_SND_KORG1212 is not set
# CONFIG_SND_MAESTRO3 is not set
# CONFIG_SND_MIXART is not set
# CONFIG_SND_NM256 is not set
# CONFIG_SND_PCXHR is not set
# CONFIG_SND_RIPTIDE is not set
# CONFIG_SND_RME32 is not set
# CONFIG_SND_RME96 is not set
# CONFIG_SND_RME9652 is not set
# CONFIG_SND_SIS7019 is not set
# CONFIG_SND_SONICVIBES is not set
# CONFIG_SND_TRIDENT is not set
# CONFIG_SND_VIA82XX is not set
# CONFIG_SND_VIA82XX_MODEM is not set
# CONFIG_SND_VIRTUOSO is not set
# CONFIG_SND_VX222 is not set
# CONFIG_SND_YMFPCI is not set
CONFIG_SND_SPI=y
CONFIG_SND_USB=y
# CONFIG_SND_USB_AUDIO is not set
# CONFIG_SND_USB_USX2Y is not set
# CONFIG_SND_USB_CAIAQ is not set
# CONFIG_SND_USB_US122L is not set
# CONFIG_SND_SOC is not set
# CONFIG_SOUND_PRIME is not set
CONFIG_HID_SUPPORT=y
CONFIG_HID=y
CONFIG_HID_DEBUG=y
# CONFIG_HIDRAW is not set

#
# USB Input Devices
#
CONFIG_USB_HID=y
# CONFIG_HID_PID is not set
# CONFIG_USB_HIDDEV is not set

#
# Special HID drivers
#
CONFIG_HID_A4TECH=y
CONFIG_HID_APPLE=y
CONFIG_HID_BELKIN=y
CONFIG_HID_CHERRY=y
CONFIG_HID_CHICONY=y
CONFIG_HID_CYPRESS=y
# CONFIG_DRAGONRISE_FF is not set
CONFIG_HID_EZKEY=y
CONFIG_HID_KYE=y
CONFIG_HID_GYRATION=y
CONFIG_HID_KENSINGTON=y
CONFIG_HID_LOGITECH=y
# CONFIG_LOGITECH_FF is not set
# CONFIG_LOGIRUMBLEPAD2_FF is not set
CONFIG_HID_MICROSOFT=y
CONFIG_HID_MONTEREY=y
CONFIG_HID_NTRIG=y
CONFIG_HID_PANTHERLORD=y
# CONFIG_PANTHERLORD_FF is not set
CONFIG_HID_PETALYNX=y
CONFIG_HID_SAMSUNG=y
CONFIG_HID_SONY=y
CONFIG_HID_SUNPLUS=y
# CONFIG_GREENASIA_FF is not set
CONFIG_HID_TOPSEED=y
# CONFIG_THRUSTMASTER_FF is not set
# CONFIG_ZEROPLUS_FF is not set
CONFIG_USB_SUPPORT=y
CONFIG_USB_ARCH_HAS_HCD=y
CONFIG_USB_ARCH_HAS_OHCI=y
CONFIG_USB_ARCH_HAS_EHCI=y
CONFIG_USB=y
# CONFIG_USB_DEBUG is not set
CONFIG_USB_ANNOUNCE_NEW_DEVICES=y

#
# Miscellaneous USB options
#
CONFIG_USB_DEVICEFS=y
CONFIG_USB_DEVICE_CLASS=y
# CONFIG_USB_DYNAMIC_MINORS is not set
CONFIG_USB_SUSPEND=y
# CONFIG_USB_OTG is not set
CONFIG_USB_MON=y
# CONFIG_USB_WUSB is not set
# CONFIG_USB_WUSB_CBAF is not set

#
# USB Host Controller Drivers
#
# CONFIG_USB_C67X00_HCD is not set
CONFIG_USB_EHCI_HCD=y
CONFIG_USB_EHCI_ROOT_HUB_TT=y
CONFIG_USB_EHCI_TT_NEWSCHED=y
# CONFIG_USB_OXU210HP_HCD is not set
# CONFIG_USB_ISP116X_HCD is not set
# CONFIG_USB_ISP1760_HCD is not set
CONFIG_USB_OHCI_HCD=y
# CONFIG_USB_OHCI_BIG_ENDIAN_DESC is not set
# CONFIG_USB_OHCI_BIG_ENDIAN_MMIO is not set
CONFIG_USB_OHCI_LITTLE_ENDIAN=y
CONFIG_USB_UHCI_HCD=y
# CONFIG_USB_SL811_HCD is not set
# CONFIG_USB_R8A66597_HCD is not set
# CONFIG_USB_WHCI_HCD is not set
# CONFIG_USB_HWA_HCD is not set

#
# USB Device Class drivers
#
CONFIG_USB_ACM=y
CONFIG_USB_PRINTER=y
CONFIG_USB_WDM=y
# CONFIG_USB_TMC is not set

#
# NOTE: USB_STORAGE depends on SCSI but BLK_DEV_SD may
#

#
# also be needed; see USB_STORAGE Help for more info
#
CONFIG_USB_STORAGE=y
# CONFIG_USB_STORAGE_DEBUG is not set
# CONFIG_USB_STORAGE_DATAFAB is not set
# CONFIG_USB_STORAGE_FREECOM is not set
CONFIG_USB_STORAGE_ISD200=y
# CONFIG_USB_STORAGE_USBAT is not set
# CONFIG_USB_STORAGE_SDDR09 is not set
# CONFIG_USB_STORAGE_SDDR55 is not set
# CONFIG_USB_STORAGE_JUMPSHOT is not set
# CONFIG_USB_STORAGE_ALAUDA is not set
# CONFIG_USB_STORAGE_ONETOUCH is not set
# CONFIG_USB_STORAGE_KARMA is not set
# CONFIG_USB_STORAGE_CYPRESS_ATACB is not set
CONFIG_USB_LIBUSUAL=y

#
# USB Imaging devices
#
# CONFIG_USB_MDC800 is not set
# CONFIG_USB_MICROTEK is not set

#
# USB port drivers
#
# CONFIG_USB_USS720 is not set
CONFIG_USB_SERIAL=y
# CONFIG_USB_SERIAL_CONSOLE is not set
CONFIG_USB_EZUSB=y
CONFIG_USB_SERIAL_GENERIC=y
# CONFIG_USB_SERIAL_AIRCABLE is not set
# CONFIG_USB_SERIAL_ARK3116 is not set
# CONFIG_USB_SERIAL_BELKIN is not set
# CONFIG_USB_SERIAL_CH341 is not set
# CONFIG_USB_SERIAL_WHITEHEAT is not set
# CONFIG_USB_SERIAL_DIGI_ACCELEPORT is not set
# CONFIG_USB_SERIAL_CP210X is not set
# CONFIG_USB_SERIAL_CYPRESS_M8 is not set
# CONFIG_USB_SERIAL_EMPEG is not set
# CONFIG_USB_SERIAL_FTDI_SIO is not set
# CONFIG_USB_SERIAL_FUNSOFT is not set
# CONFIG_USB_SERIAL_VISOR is not set
# CONFIG_USB_SERIAL_IPAQ is not set
# CONFIG_USB_SERIAL_IR is not set
# CONFIG_USB_SERIAL_EDGEPORT is not set
# CONFIG_USB_SERIAL_EDGEPORT_TI is not set
# CONFIG_USB_SERIAL_GARMIN is not set
# CONFIG_USB_SERIAL_IPW is not set
# CONFIG_USB_SERIAL_IUU is not set
# CONFIG_USB_SERIAL_KEYSPAN_PDA is not set
# CONFIG_USB_SERIAL_KEYSPAN is not set
# CONFIG_USB_SERIAL_KLSI is not set
# CONFIG_USB_SERIAL_KOBIL_SCT is not set
# CONFIG_USB_SERIAL_MCT_U232 is not set
# CONFIG_USB_SERIAL_MOS7720 is not set
# CONFIG_USB_SERIAL_MOS7840 is not set
# CONFIG_USB_SERIAL_MOTOROLA is not set
# CONFIG_USB_SERIAL_NAVMAN is not set
# CONFIG_USB_SERIAL_PL2303 is not set
# CONFIG_USB_SERIAL_OTI6858 is not set
# CONFIG_USB_SERIAL_QUALCOMM is not set
# CONFIG_USB_SERIAL_SPCP8X5 is not set
# CONFIG_USB_SERIAL_HP4X is not set
# CONFIG_USB_SERIAL_SAFE is not set
# CONFIG_USB_SERIAL_SIEMENS_MPI is not set
CONFIG_USB_SERIAL_SIERRAWIRELESS=y
# CONFIG_USB_SERIAL_SYMBOL is not set
# CONFIG_USB_SERIAL_TI is not set
# CONFIG_USB_SERIAL_CYBERJACK is not set
# CONFIG_USB_SERIAL_XIRCOM is not set
# CONFIG_USB_SERIAL_OPTION is not set
# CONFIG_USB_SERIAL_OMNINET is not set
# CONFIG_USB_SERIAL_OPTICON is not set
# CONFIG_USB_SERIAL_DEBUG is not set

#
# USB Miscellaneous drivers
#
# CONFIG_USB_EMI62 is not set
# CONFIG_USB_EMI26 is not set
# CONFIG_USB_ADUTUX is not set
# CONFIG_USB_SEVSEG is not set
# CONFIG_USB_RIO500 is not set
# CONFIG_USB_LEGOTOWER is not set
# CONFIG_USB_LCD is not set
CONFIG_USB_BERRY_CHARGE=y
# CONFIG_USB_LED is not set
# CONFIG_USB_CYPRESS_CY7C63 is not set
# CONFIG_USB_CYTHERM is not set
# CONFIG_USB_IDMOUSE is not set
# CONFIG_USB_FTDI_ELAN is not set
# CONFIG_USB_APPLEDISPLAY is not set
# CONFIG_USB_SISUSBVGA is not set
# CONFIG_USB_LD is not set
# CONFIG_USB_TRANCEVIBRATOR is not set
# CONFIG_USB_IOWARRIOR is not set
# CONFIG_USB_TEST is not set
# CONFIG_USB_ISIGHTFW is not set
# CONFIG_USB_VST is not set
# CONFIG_USB_ATM is not set
# CONFIG_USB_GADGET is not set

#
# OTG and related infrastructure
#
# CONFIG_USB_GPIO_VBUS is not set
# CONFIG_NOP_USB_XCEIV is not set
# CONFIG_UWB is not set
CONFIG_MMC=y
# CONFIG_MMC_DEBUG is not set
# CONFIG_MMC_UNSAFE_RESUME is not set

#
# MMC/SD/SDIO Card Drivers
#
CONFIG_MMC_BLOCK=y
CONFIG_MMC_BLOCK_BOUNCE=y
# CONFIG_SDIO_UART is not set
# CONFIG_MMC_TEST is not set

#
# MMC/SD/SDIO Host Controller Drivers
#
CONFIG_MMC_SDHCI=y
CONFIG_MMC_SDHCI_PCI=y
# CONFIG_MMC_RICOH_MMC is not set
# CONFIG_MMC_WBSD is not set
# CONFIG_MMC_TIFM_SD is not set
# CONFIG_MEMSTICK is not set
CONFIG_NEW_LEDS=y
CONFIG_LEDS_CLASS=y

#
# LED drivers
#
# CONFIG_LEDS_ALIX2 is not set
# CONFIG_LEDS_PCA9532 is not set
# CONFIG_LEDS_GPIO is not set
# CONFIG_LEDS_LP5521 is not set
# CONFIG_LEDS_CLEVO_MAIL is not set
# CONFIG_LEDS_PCA955X is not set
# CONFIG_LEDS_DAC124S085 is not set
# CONFIG_LEDS_BD2802 is not set

#
# LED Triggers
#
CONFIG_LEDS_TRIGGERS=y
CONFIG_LEDS_TRIGGER_TIMER=y
# CONFIG_LEDS_TRIGGER_IDE_DISK is not set
CONFIG_LEDS_TRIGGER_HEARTBEAT=y
# CONFIG_LEDS_TRIGGER_BACKLIGHT is not set
# CONFIG_LEDS_TRIGGER_GPIO is not set
CONFIG_LEDS_TRIGGER_DEFAULT_ON=y

#
# iptables trigger is under Netfilter config (LED target)
#
# CONFIG_ACCESSIBILITY is not set
# CONFIG_INFINIBAND is not set
CONFIG_EDAC=y

#
# Reporting subsystems
#
# CONFIG_EDAC_DEBUG is not set
# CONFIG_EDAC_MM_EDAC is not set
# CONFIG_RTC_CLASS is not set
CONFIG_AUXDISPLAY=y
# CONFIG_KS0108 is not set
# CONFIG_UIO is not set
# CONFIG_STAGING is not set
CONFIG_X86_PLATFORM_DEVICES=y
# CONFIG_ACER_WMI is not set
# CONFIG_ASUS_LAPTOP is not set
# CONFIG_DELL_WMI is not set
# CONFIG_FUJITSU_LAPTOP is not set
# CONFIG_TC1100_WMI is not set
# CONFIG_HP_WMI is not set
# CONFIG_MSI_LAPTOP is not set
# CONFIG_PANASONIC_LAPTOP is not set
# CONFIG_COMPAL_LAPTOP is not set
# CONFIG_SONY_LAPTOP is not set
CONFIG_THINKPAD_ACPI=y
# CONFIG_THINKPAD_ACPI_DEBUGFACILITIES is not set
# CONFIG_THINKPAD_ACPI_DEBUG is not set
# CONFIG_THINKPAD_ACPI_UNSAFE_LEDS is not set
CONFIG_THINKPAD_ACPI_BAY=y
CONFIG_THINKPAD_ACPI_VIDEO=y
CONFIG_THINKPAD_ACPI_HOTKEY_POLL=y
# CONFIG_INTEL_MENLOW is not set
# CONFIG_EEEPC_LAPTOP is not set
CONFIG_ACPI_WMI=y
# CONFIG_ACPI_ASUS is not set
# CONFIG_ACPI_TOSHIBA is not set

#
# Firmware Drivers
#
# CONFIG_EDD is not set
CONFIG_FIRMWARE_MEMMAP=y
CONFIG_EFI_VARS=y
# CONFIG_DELL_RBU is not set
# CONFIG_DCDBAS is not set
CONFIG_DMIID=y
CONFIG_ISCSI_IBFT_FIND=y
# CONFIG_ISCSI_IBFT is not set

#
# File systems
#
# CONFIG_EXT2_FS is not set
CONFIG_EXT3_FS=y
# CONFIG_EXT3_DEFAULTS_TO_ORDERED is not set
CONFIG_EXT3_FS_XATTR=y
CONFIG_EXT3_FS_POSIX_ACL=y
CONFIG_EXT3_FS_SECURITY=y
CONFIG_EXT4_FS=y
CONFIG_EXT4DEV_COMPAT=y
CONFIG_EXT4_FS_XATTR=y
CONFIG_EXT4_FS_POSIX_ACL=y
CONFIG_EXT4_FS_SECURITY=y
CONFIG_JBD=y
CONFIG_JBD_DEBUG=y
CONFIG_JBD2=y
CONFIG_JBD2_DEBUG=y
CONFIG_FS_MBCACHE=y
# CONFIG_REISERFS_FS is not set
# CONFIG_JFS_FS is not set
CONFIG_FS_POSIX_ACL=y
CONFIG_FILE_LOCKING=y
# CONFIG_XFS_FS is not set
# CONFIG_GFS2_FS is not set
# CONFIG_OCFS2_FS is not set
# CONFIG_BTRFS_FS is not set
CONFIG_DNOTIFY=y
CONFIG_INOTIFY=y
CONFIG_INOTIFY_USER=y
CONFIG_QUOTA=y
CONFIG_QUOTA_NETLINK_INTERFACE=y
CONFIG_PRINT_QUOTA_WARNING=y
# CONFIG_QFMT_V1 is not set
# CONFIG_QFMT_V2 is not set
CONFIG_QUOTACTL=y
# CONFIG_AUTOFS_FS is not set
CONFIG_AUTOFS4_FS=y
CONFIG_FUSE_FS=y

#
# Caches
#
# CONFIG_FSCACHE is not set

#
# CD-ROM/DVD Filesystems
#
CONFIG_ISO9660_FS=y
CONFIG_JOLIET=y
CONFIG_ZISOFS=y
CONFIG_UDF_FS=y
CONFIG_UDF_NLS=y

#
# DOS/FAT/NT Filesystems
#
CONFIG_FAT_FS=y
# CONFIG_MSDOS_FS is not set
CONFIG_VFAT_FS=y
CONFIG_FAT_DEFAULT_CODEPAGE=437
CONFIG_FAT_DEFAULT_IOCHARSET="iso8859-1"
CONFIG_NTFS_FS=y
# CONFIG_NTFS_DEBUG is not set
CONFIG_NTFS_RW=y

#
# Pseudo filesystems
#
CONFIG_PROC_FS=y
CONFIG_PROC_KCORE=y
CONFIG_PROC_VMCORE=y
CONFIG_PROC_SYSCTL=y
CONFIG_PROC_PAGE_MONITOR=y
CONFIG_SYSFS=y
CONFIG_TMPFS=y
# CONFIG_TMPFS_POSIX_ACL is not set
CONFIG_HUGETLBFS=y
CONFIG_HUGETLB_PAGE=y
# CONFIG_CONFIGFS_FS is not set
CONFIG_MISC_FILESYSTEMS=y
# CONFIG_ADFS_FS is not set
# CONFIG_AFFS_FS is not set
CONFIG_ECRYPT_FS=y
# CONFIG_HFS_FS is not set
# CONFIG_HFSPLUS_FS is not set
# CONFIG_BEFS_FS is not set
# CONFIG_BFS_FS is not set
# CONFIG_EFS_FS is not set
CONFIG_CRAMFS=y
# CONFIG_SQUASHFS is not set
# CONFIG_VXFS_FS is not set
# CONFIG_MINIX_FS is not set
# CONFIG_OMFS_FS is not set
# CONFIG_HPFS_FS is not set
# CONFIG_QNX4FS_FS is not set
# CONFIG_ROMFS_FS is not set
# CONFIG_SYSV_FS is not set
# CONFIG_UFS_FS is not set
# CONFIG_NILFS2_FS is not set
CONFIG_NETWORK_FILESYSTEMS=y
CONFIG_NFS_FS=y
CONFIG_NFS_V3=y
# CONFIG_NFS_V3_ACL is not set
CONFIG_NFS_V4=y
CONFIG_NFSD=y
CONFIG_NFSD_V3=y
# CONFIG_NFSD_V3_ACL is not set
CONFIG_NFSD_V4=y
CONFIG_LOCKD=y
CONFIG_LOCKD_V4=y
CONFIG_EXPORTFS=y
CONFIG_NFS_COMMON=y
CONFIG_SUNRPC=y
CONFIG_SUNRPC_GSS=y
CONFIG_RPCSEC_GSS_KRB5=y
# CONFIG_RPCSEC_GSS_SPKM3 is not set
# CONFIG_SMB_FS is not set
# CONFIG_CIFS is not set
# CONFIG_NCP_FS is not set
# CONFIG_CODA_FS is not set
# CONFIG_AFS_FS is not set

#
# Partition Types
#
CONFIG_PARTITION_ADVANCED=y
# CONFIG_ACORN_PARTITION is not set
# CONFIG_OSF_PARTITION is not set
# CONFIG_AMIGA_PARTITION is not set
CONFIG_ATARI_PARTITION=y
CONFIG_MAC_PARTITION=y
CONFIG_MSDOS_PARTITION=y
CONFIG_BSD_DISKLABEL=y
CONFIG_MINIX_SUBPARTITION=y
CONFIG_SOLARIS_X86_PARTITION=y
CONFIG_UNIXWARE_DISKLABEL=y
CONFIG_LDM_PARTITION=y
# CONFIG_LDM_DEBUG is not set
CONFIG_SGI_PARTITION=y
# CONFIG_ULTRIX_PARTITION is not set
CONFIG_SUN_PARTITION=y
# CONFIG_KARMA_PARTITION is not set
CONFIG_EFI_PARTITION=y
CONFIG_SYSV68_PARTITION=y
CONFIG_NLS=y
CONFIG_NLS_DEFAULT="cp437"
CONFIG_NLS_CODEPAGE_437=y
# CONFIG_NLS_CODEPAGE_737 is not set
# CONFIG_NLS_CODEPAGE_775 is not set
CONFIG_NLS_CODEPAGE_850=y
# CONFIG_NLS_CODEPAGE_852 is not set
# CONFIG_NLS_CODEPAGE_855 is not set
# CONFIG_NLS_CODEPAGE_857 is not set
# CONFIG_NLS_CODEPAGE_860 is not set
# CONFIG_NLS_CODEPAGE_861 is not set
# CONFIG_NLS_CODEPAGE_862 is not set
# CONFIG_NLS_CODEPAGE_863 is not set
# CONFIG_NLS_CODEPAGE_864 is not set
# CONFIG_NLS_CODEPAGE_865 is not set
# CONFIG_NLS_CODEPAGE_866 is not set
# CONFIG_NLS_CODEPAGE_869 is not set
# CONFIG_NLS_CODEPAGE_936 is not set
# CONFIG_NLS_CODEPAGE_950 is not set
# CONFIG_NLS_CODEPAGE_932 is not set
# CONFIG_NLS_CODEPAGE_949 is not set
# CONFIG_NLS_CODEPAGE_874 is not set
# CONFIG_NLS_ISO8859_8 is not set
# CONFIG_NLS_CODEPAGE_1250 is not set
# CONFIG_NLS_CODEPAGE_1251 is not set
CONFIG_NLS_ASCII=y
CONFIG_NLS_ISO8859_1=y
# CONFIG_NLS_ISO8859_2 is not set
# CONFIG_NLS_ISO8859_3 is not set
# CONFIG_NLS_ISO8859_4 is not set
# CONFIG_NLS_ISO8859_5 is not set
# CONFIG_NLS_ISO8859_6 is not set
# CONFIG_NLS_ISO8859_7 is not set
# CONFIG_NLS_ISO8859_9 is not set
# CONFIG_NLS_ISO8859_13 is not set
# CONFIG_NLS_ISO8859_14 is not set
# CONFIG_NLS_ISO8859_15 is not set
# CONFIG_NLS_KOI8_R is not set
# CONFIG_NLS_KOI8_U is not set
# CONFIG_NLS_UTF8 is not set
# CONFIG_DLM is not set

#
# Kernel hacking
#
CONFIG_TRACE_IRQFLAGS_SUPPORT=y
CONFIG_PRINTK_TIME=y
CONFIG_ENABLE_WARN_DEPRECATED=y
# CONFIG_ENABLE_MUST_CHECK is not set
CONFIG_FRAME_WARN=1024
CONFIG_MAGIC_SYSRQ=y
CONFIG_UNUSED_SYMBOLS=y
CONFIG_DEBUG_FS=y
# CONFIG_HEADERS_CHECK is not set
CONFIG_DEBUG_KERNEL=y
# CONFIG_DEBUG_SHIRQ is not set
CONFIG_DETECT_SOFTLOCKUP=y
# CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC is not set
CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC_VALUE=0
CONFIG_DETECT_HUNG_TASK=y
# CONFIG_BOOTPARAM_HUNG_TASK_PANIC is not set
CONFIG_BOOTPARAM_HUNG_TASK_PANIC_VALUE=0
CONFIG_SCHED_DEBUG=y
CONFIG_SCHEDSTATS=y
CONFIG_TIMER_STATS=y
# CONFIG_DEBUG_OBJECTS is not set
# CONFIG_SLUB_DEBUG_ON is not set
# CONFIG_SLUB_STATS is not set
# CONFIG_DEBUG_RT_MUTEXES is not set
# CONFIG_RT_MUTEX_TESTER is not set
CONFIG_DEBUG_SPINLOCK=y
CONFIG_DEBUG_MUTEXES=y
CONFIG_DEBUG_LOCK_ALLOC=y
CONFIG_PROVE_LOCKING=y
CONFIG_LOCKDEP=y
CONFIG_LOCK_STAT=y
# CONFIG_DEBUG_LOCKDEP is not set
CONFIG_TRACE_IRQFLAGS=y
CONFIG_DEBUG_SPINLOCK_SLEEP=y
# CONFIG_DEBUG_LOCKING_API_SELFTESTS is not set
CONFIG_STACKTRACE=y
# CONFIG_DEBUG_KOBJECT is not set
# CONFIG_DEBUG_HIGHMEM is not set
CONFIG_DEBUG_BUGVERBOSE=y
# CONFIG_DEBUG_INFO is not set
# CONFIG_DEBUG_VM is not set
# CONFIG_DEBUG_VIRTUAL is not set
# CONFIG_DEBUG_WRITECOUNT is not set
CONFIG_DEBUG_MEMORY_INIT=y
# CONFIG_DEBUG_LIST is not set
# CONFIG_DEBUG_SG is not set
# CONFIG_DEBUG_NOTIFIERS is not set
CONFIG_ARCH_WANT_FRAME_POINTERS=y
CONFIG_FRAME_POINTER=y
# CONFIG_BOOT_PRINTK_DELAY is not set
# CONFIG_RCU_TORTURE_TEST is not set
# CONFIG_RCU_CPU_STALL_DETECTOR is not set
# CONFIG_KPROBES_SANITY_TEST is not set
# CONFIG_BACKTRACE_SELF_TEST is not set
# CONFIG_DEBUG_BLOCK_EXT_DEVT is not set
# CONFIG_LKDTM is not set
# CONFIG_FAULT_INJECTION is not set
CONFIG_LATENCYTOP=y
CONFIG_SYSCTL_SYSCALL_CHECK=y
CONFIG_DEBUG_PAGEALLOC=y
CONFIG_USER_STACKTRACE_SUPPORT=y
CONFIG_NOP_TRACER=y
CONFIG_HAVE_FUNCTION_TRACER=y
CONFIG_HAVE_FUNCTION_GRAPH_TRACER=y
CONFIG_HAVE_FUNCTION_TRACE_MCOUNT_TEST=y
CONFIG_HAVE_DYNAMIC_FTRACE=y
CONFIG_HAVE_FTRACE_MCOUNT_RECORD=y
CONFIG_HAVE_FTRACE_SYSCALLS=y
CONFIG_TRACER_MAX_TRACE=y
CONFIG_RING_BUFFER=y
CONFIG_TRACING=y
CONFIG_TRACING_SUPPORT=y

#
# Tracers
#
# CONFIG_FUNCTION_TRACER is not set
# CONFIG_IRQSOFF_TRACER is not set
CONFIG_SYSPROF_TRACER=y
CONFIG_SCHED_TRACER=y
CONFIG_CONTEXT_SWITCH_TRACER=y
CONFIG_EVENT_TRACER=y
CONFIG_FTRACE_SYSCALLS=y
# CONFIG_BOOT_TRACER is not set
# CONFIG_TRACE_BRANCH_PROFILING is not set
CONFIG_POWER_TRACER=y
# CONFIG_STACK_TRACER is not set
# CONFIG_KMEMTRACE is not set
# CONFIG_WORKQUEUE_TRACER is not set
CONFIG_BLK_DEV_IO_TRACE=y
# CONFIG_FTRACE_STARTUP_TEST is not set
# CONFIG_MMIOTRACE is not set
CONFIG_PROVIDE_OHCI1394_DMA_INIT=y
CONFIG_DYNAMIC_DEBUG=y
# CONFIG_DMA_API_DEBUG is not set
CONFIG_SAMPLES=y
# CONFIG_SAMPLE_MARKERS is not set
# CONFIG_SAMPLE_TRACEPOINTS is not set
# CONFIG_SAMPLE_KOBJECT is not set
# CONFIG_SAMPLE_KPROBES is not set
CONFIG_HAVE_ARCH_KGDB=y
# CONFIG_KGDB is not set
# CONFIG_STRICT_DEVMEM is not set
CONFIG_X86_VERBOSE_BOOTUP=y
CONFIG_EARLY_PRINTK=y
# CONFIG_EARLY_PRINTK_DBGP is not set
CONFIG_DEBUG_STACKOVERFLOW=y
CONFIG_DEBUG_STACK_USAGE=y
# CONFIG_DEBUG_PER_CPU_MAPS is not set
# CONFIG_X86_PTDUMP is not set
CONFIG_DEBUG_RODATA=y
# CONFIG_DEBUG_RODATA_TEST is not set
# CONFIG_DEBUG_NX_TEST is not set
CONFIG_4KSTACKS=y
CONFIG_DOUBLEFAULT=y
CONFIG_HAVE_MMIOTRACE_SUPPORT=y
CONFIG_IO_DELAY_TYPE_0X80=0
CONFIG_IO_DELAY_TYPE_0XED=1
CONFIG_IO_DELAY_TYPE_UDELAY=2
CONFIG_IO_DELAY_TYPE_NONE=3
CONFIG_IO_DELAY_0X80=y
# CONFIG_IO_DELAY_0XED is not set
# CONFIG_IO_DELAY_UDELAY is not set
# CONFIG_IO_DELAY_NONE is not set
CONFIG_DEFAULT_IO_DELAY_TYPE=0
CONFIG_DEBUG_BOOT_PARAMS=y
# CONFIG_CPA_DEBUG is not set
CONFIG_OPTIMIZE_INLINING=y

#
# Security options
#
CONFIG_KEYS=y
CONFIG_KEYS_DEBUG_PROC_KEYS=y
CONFIG_SECURITY=y
# CONFIG_SECURITYFS is not set
CONFIG_SECURITY_NETWORK=y
# CONFIG_SECURITY_NETWORK_XFRM is not set
# CONFIG_SECURITY_PATH is not set
CONFIG_SECURITY_FILE_CAPABILITIES=y
# CONFIG_SECURITY_ROOTPLUG is not set
CONFIG_SECURITY_DEFAULT_MMAP_MIN_ADDR=0
CONFIG_SECURITY_SELINUX=y
CONFIG_SECURITY_SELINUX_BOOTPARAM=y
CONFIG_SECURITY_SELINUX_BOOTPARAM_VALUE=0
CONFIG_SECURITY_SELINUX_DISABLE=y
CONFIG_SECURITY_SELINUX_DEVELOP=y
CONFIG_SECURITY_SELINUX_AVC_STATS=y
CONFIG_SECURITY_SELINUX_CHECKREQPROT_VALUE=1
# CONFIG_SECURITY_SELINUX_POLICYDB_VERSION_MAX is not set
# CONFIG_SECURITY_TOMOYO is not set
# CONFIG_IMA is not set
CONFIG_CRYPTO=y

#
# Crypto core or helper
#
# CONFIG_CRYPTO_FIPS is not set
CONFIG_CRYPTO_ALGAPI=y
CONFIG_CRYPTO_ALGAPI2=y
CONFIG_CRYPTO_AEAD=y
CONFIG_CRYPTO_AEAD2=y
CONFIG_CRYPTO_BLKCIPHER=y
CONFIG_CRYPTO_BLKCIPHER2=y
CONFIG_CRYPTO_HASH=y
CONFIG_CRYPTO_HASH2=y
CONFIG_CRYPTO_RNG=y
CONFIG_CRYPTO_RNG2=y
CONFIG_CRYPTO_PCOMP=y
CONFIG_CRYPTO_MANAGER=y
CONFIG_CRYPTO_MANAGER2=y
# CONFIG_CRYPTO_GF128MUL is not set
CONFIG_CRYPTO_NULL=y
CONFIG_CRYPTO_WORKQUEUE=y
# CONFIG_CRYPTO_CRYPTD is not set
CONFIG_CRYPTO_AUTHENC=y
# CONFIG_CRYPTO_TEST is not set

#
# Authenticated Encryption with Associated Data
#
# CONFIG_CRYPTO_CCM is not set
# CONFIG_CRYPTO_GCM is not set
CONFIG_CRYPTO_SEQIV=y

#
# Block modes
#
CONFIG_CRYPTO_CBC=y
CONFIG_CRYPTO_CTR=y
# CONFIG_CRYPTO_CTS is not set
CONFIG_CRYPTO_ECB=y
# CONFIG_CRYPTO_LRW is not set
# CONFIG_CRYPTO_PCBC is not set
# CONFIG_CRYPTO_XTS is not set

#
# Hash modes
#
CONFIG_CRYPTO_HMAC=y
CONFIG_CRYPTO_XCBC=y

#
# Digest
#
# CONFIG_CRYPTO_CRC32C is not set
# CONFIG_CRYPTO_CRC32C_INTEL is not set
# CONFIG_CRYPTO_MD4 is not set
CONFIG_CRYPTO_MD5=y
CONFIG_CRYPTO_MICHAEL_MIC=y
# CONFIG_CRYPTO_RMD128 is not set
CONFIG_CRYPTO_RMD160=y
# CONFIG_CRYPTO_RMD256 is not set
# CONFIG_CRYPTO_RMD320 is not set
CONFIG_CRYPTO_SHA1=y
CONFIG_CRYPTO_SHA256=y
# CONFIG_CRYPTO_SHA512 is not set
# CONFIG_CRYPTO_TGR192 is not set
# CONFIG_CRYPTO_WP512 is not set

#
# Ciphers
#
CONFIG_CRYPTO_AES=y
CONFIG_CRYPTO_AES_586=y
# CONFIG_CRYPTO_ANUBIS is not set
CONFIG_CRYPTO_ARC4=y
CONFIG_CRYPTO_BLOWFISH=y
CONFIG_CRYPTO_CAMELLIA=y
# CONFIG_CRYPTO_CAST5 is not set
# CONFIG_CRYPTO_CAST6 is not set
CONFIG_CRYPTO_DES=y
# CONFIG_CRYPTO_FCRYPT is not set
# CONFIG_CRYPTO_KHAZAD is not set
# CONFIG_CRYPTO_SALSA20 is not set
# CONFIG_CRYPTO_SALSA20_586 is not set
# CONFIG_CRYPTO_SEED is not set
CONFIG_CRYPTO_SERPENT=y
# CONFIG_CRYPTO_TEA is not set
CONFIG_CRYPTO_TWOFISH=y
CONFIG_CRYPTO_TWOFISH_COMMON=y
# CONFIG_CRYPTO_TWOFISH_586 is not set

#
# Compression
#
CONFIG_CRYPTO_DEFLATE=y
# CONFIG_CRYPTO_ZLIB is not set
# CONFIG_CRYPTO_LZO is not set

#
# Random Number Generation
#
# CONFIG_CRYPTO_ANSI_CPRNG is not set
CONFIG_CRYPTO_HW=y
CONFIG_CRYPTO_DEV_PADLOCK=y
# CONFIG_CRYPTO_DEV_PADLOCK_AES is not set
# CONFIG_CRYPTO_DEV_PADLOCK_SHA is not set
# CONFIG_CRYPTO_DEV_GEODE is not set
# CONFIG_CRYPTO_DEV_HIFN_795X is not set
CONFIG_HAVE_KVM=y
CONFIG_HAVE_KVM_IRQCHIP=y
CONFIG_VIRTUALIZATION=y
CONFIG_KVM=y
CONFIG_KVM_INTEL=y
# CONFIG_KVM_AMD is not set
CONFIG_KVM_TRACE=y
# CONFIG_VIRTIO_PCI is not set
# CONFIG_VIRTIO_BALLOON is not set
CONFIG_BINARY_PRINTF=y

#
# Library routines
#
CONFIG_BITREVERSE=y
CONFIG_GENERIC_FIND_FIRST_BIT=y
CONFIG_GENERIC_FIND_NEXT_BIT=y
CONFIG_GENERIC_FIND_LAST_BIT=y
CONFIG_CRC_CCITT=y
CONFIG_CRC16=y
CONFIG_CRC_T10DIF=y
CONFIG_CRC_ITU_T=y
CONFIG_CRC32=y
# CONFIG_CRC7 is not set
# CONFIG_LIBCRC32C is not set
CONFIG_AUDIT_GENERIC=y
CONFIG_ZLIB_INFLATE=y
CONFIG_ZLIB_DEFLATE=y
CONFIG_DECOMPRESS_GZIP=y
CONFIG_DECOMPRESS_BZIP2=y
CONFIG_DECOMPRESS_LZMA=y
CONFIG_HAS_IOMEM=y
CONFIG_HAS_IOPORT=y
CONFIG_HAS_DMA=y
CONFIG_NLATTR=y


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/11] Per-bdi writeback flusher threads v8
  2009-05-27 17:58     ` Theodore Tso
@ 2009-05-27 18:14       ` Jens Axboe
  2009-05-27 19:15         ` Jens Axboe
  0 siblings, 1 reply; 41+ messages in thread
From: Jens Axboe @ 2009-05-27 18:14 UTC (permalink / raw)
  To: Theodore Tso
  Cc: linux-kernel, linux-fsdevel, chris.mason, david, hch, akpm, jack,
	yanmin_zhang, richard, damien.wyart

On Wed, May 27 2009, Theodore Tso wrote:
> On Wed, May 27, 2009 at 01:53:53PM -0400, Theodore Tso wrote:
> > On Wed, May 27, 2009 at 10:47:54AM -0400, Theodore Tso wrote:
> > > 
> > > I'll retry the test with your stock writeback-v8 git branch w/o any
> > > ext4 patches planned the next mere window mainline to see if I get the
> > > same soft lockup, but I thought I should give you an early heads up.
> > 
> > Confirmed.  I had to run fsstress twice, but I was able to trigger a
> > soft hangup with just the per-bdi v8 patches using ext4.
> 
> As you requested, here's the .config file which I used.  This was on a
> Lenovo S10 (N270 Atom dual-core CPU, 1.5 gigs of memory, 5400 rpm hdd).

If you have time, can you rerun with this little patch? It moves the
super sync to a separate thread. Thanks!

diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index f71588c..8b30f29 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -218,10 +218,14 @@ static __init int bdi_class_init(void)
 }
 postcore_initcall(bdi_class_init);
 
+static int bdi_sync_supers(void *unused);
+
 static int __init default_bdi_init(void)
 {
 	int err;
 
+	kthread_run(bdi_sync_supers, NULL, "bdi-super");
+
 	err = bdi_init(&default_backing_dev_info);
 	if (!err)
 		bdi_register(&default_backing_dev_info, NULL, "default");
@@ -412,6 +416,20 @@ static void bdi_flush_io(struct backing_dev_info *bdi)
 	generic_sync_bdi_inodes(NULL, &wbc);
 }
 
+static int bdi_sync_supers(void *unused)
+{
+	while (!kthread_should_stop()) {
+		schedule_timeout(dirty_expire_interval * 10);
+
+		/*
+		 * Do this periodically, like kupdated() did before.
+		 */
+		sync_supers();
+	}
+
+	return 0;
+}
+
 static int bdi_forker_task(void *ptr)
 {
 	struct bdi_writeback *me = ptr;
@@ -424,11 +442,6 @@ static int bdi_forker_task(void *ptr)
 		struct bdi_writeback *wb;
 
 		/*
-		 * Do this periodically, like kupdated() did before.
-		 */
-		sync_supers();
-
-		/*
 		 * Temporary measure, we want to make sure we don't see
 		 * dirty data on the default backing_dev_info
 		 */

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/11] Per-bdi writeback flusher threads v8
  2009-05-27 18:14       ` Jens Axboe
@ 2009-05-27 19:15         ` Jens Axboe
  2009-05-27 19:45           ` Jens Axboe
  0 siblings, 1 reply; 41+ messages in thread
From: Jens Axboe @ 2009-05-27 19:15 UTC (permalink / raw)
  To: Theodore Tso
  Cc: linux-kernel, linux-fsdevel, chris.mason, david, hch, akpm, jack,
	yanmin_zhang, richard, damien.wyart

On Wed, May 27 2009, Jens Axboe wrote:
> On Wed, May 27 2009, Theodore Tso wrote:
> > On Wed, May 27, 2009 at 01:53:53PM -0400, Theodore Tso wrote:
> > > On Wed, May 27, 2009 at 10:47:54AM -0400, Theodore Tso wrote:
> > > > 
> > > > I'll retry the test with your stock writeback-v8 git branch w/o any
> > > > ext4 patches planned the next mere window mainline to see if I get the
> > > > same soft lockup, but I thought I should give you an early heads up.
> > > 
> > > Confirmed.  I had to run fsstress twice, but I was able to trigger a
> > > soft hangup with just the per-bdi v8 patches using ext4.
> > 
> > As you requested, here's the .config file which I used.  This was on a
> > Lenovo S10 (N270 Atom dual-core CPU, 1.5 gigs of memory, 5400 rpm hdd).
> 
> If you have time, can you rerun with this little patch? It moves the
> super sync to a separate thread. Thanks!
> 
> diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> index f71588c..8b30f29 100644
> --- a/mm/backing-dev.c
> +++ b/mm/backing-dev.c
> @@ -218,10 +218,14 @@ static __init int bdi_class_init(void)
>  }
>  postcore_initcall(bdi_class_init);
>  
> +static int bdi_sync_supers(void *unused);
> +
>  static int __init default_bdi_init(void)
>  {
>  	int err;
>  
> +	kthread_run(bdi_sync_supers, NULL, "bdi-super");
> +
>  	err = bdi_init(&default_backing_dev_info);
>  	if (!err)
>  		bdi_register(&default_backing_dev_info, NULL, "default");
> @@ -412,6 +416,20 @@ static void bdi_flush_io(struct backing_dev_info *bdi)
>  	generic_sync_bdi_inodes(NULL, &wbc);
>  }
>  
> +static int bdi_sync_supers(void *unused)
> +{
> +	while (!kthread_should_stop()) {
> +		schedule_timeout(dirty_expire_interval * 10);

Woops, that should be

                schedule_timeout(msecs_to_jiffies(dirty_expire_interval * 10));

of course, otherwise we sync quite a lot :-)

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/11] Per-bdi writeback flusher threads v8
  2009-05-27 19:15         ` Jens Axboe
@ 2009-05-27 19:45           ` Jens Axboe
  2009-05-28  0:49             ` Theodore Tso
  0 siblings, 1 reply; 41+ messages in thread
From: Jens Axboe @ 2009-05-27 19:45 UTC (permalink / raw)
  To: Theodore Tso
  Cc: linux-kernel, linux-fsdevel, chris.mason, david, hch, akpm, jack,
	yanmin_zhang, richard, damien.wyart

On Wed, May 27 2009, Jens Axboe wrote:
> On Wed, May 27 2009, Jens Axboe wrote:
> > On Wed, May 27 2009, Theodore Tso wrote:
> > > On Wed, May 27, 2009 at 01:53:53PM -0400, Theodore Tso wrote:
> > > > On Wed, May 27, 2009 at 10:47:54AM -0400, Theodore Tso wrote:
> > > > > 
> > > > > I'll retry the test with your stock writeback-v8 git branch w/o any
> > > > > ext4 patches planned the next mere window mainline to see if I get the
> > > > > same soft lockup, but I thought I should give you an early heads up.
> > > > 
> > > > Confirmed.  I had to run fsstress twice, but I was able to trigger a
> > > > soft hangup with just the per-bdi v8 patches using ext4.
> > > 
> > > As you requested, here's the .config file which I used.  This was on a
> > > Lenovo S10 (N270 Atom dual-core CPU, 1.5 gigs of memory, 5400 rpm hdd).
> > 
> > If you have time, can you rerun with this little patch? It moves the
> > super sync to a separate thread. Thanks!

This one has been tested good, where good means that it boots and
functions normally at least. Whether it fixes your issue, that would be
interesting to know :-)

diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index f71588c..cc246df 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -218,10 +218,14 @@ static __init int bdi_class_init(void)
 }
 postcore_initcall(bdi_class_init);
 
+static int bdi_sync_supers(void *unused);
+
 static int __init default_bdi_init(void)
 {
 	int err;
 
+	kthread_run(bdi_sync_supers, NULL, "bdi-super");
+
 	err = bdi_init(&default_backing_dev_info);
 	if (!err)
 		bdi_register(&default_backing_dev_info, NULL, "default");
@@ -412,6 +416,23 @@ static void bdi_flush_io(struct backing_dev_info *bdi)
 	generic_sync_bdi_inodes(NULL, &wbc);
 }
 
+static int bdi_sync_supers(void *unused)
+{
+	set_user_nice(current, 0);
+
+	while (!kthread_should_stop()) {
+		set_current_state(TASK_INTERRUPTIBLE);
+		schedule_timeout(msecs_to_jiffies(dirty_expire_interval * 10));
+
+		/*
+		 * Do this periodically, like kupdated() did before.
+		 */
+		sync_supers();
+	}
+
+	return 0;
+}
+
 static int bdi_forker_task(void *ptr)
 {
 	struct bdi_writeback *me = ptr;
@@ -424,11 +445,6 @@ static int bdi_forker_task(void *ptr)
 		struct bdi_writeback *wb;
 
 		/*
-		 * Do this periodically, like kupdated() did before.
-		 */
-		sync_supers();
-
-		/*
 		 * Temporary measure, we want to make sure we don't see
 		 * dirty data on the default backing_dev_info
 		 */

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/11] Per-bdi writeback flusher threads v8
  2009-05-27 19:45           ` Jens Axboe
@ 2009-05-28  0:49             ` Theodore Tso
  2009-05-28  9:28               ` Jan Kara
  0 siblings, 1 reply; 41+ messages in thread
From: Theodore Tso @ 2009-05-28  0:49 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-kernel, linux-fsdevel, chris.mason, david, hch, akpm, jack,
	yanmin_zhang, richard, damien.wyart

[-- Attachment #1: Type: text/plain, Size: 633 bytes --]

On Wed, May 27, 2009 at 09:45:43PM +0200, Jens Axboe wrote:
> 
> This one has been tested good, where good means that it boots and
> functions normally at least. Whether it fixes your issue, that would be
> interesting to know :-)
> 

Unfortunately, it doesn't seem to have.  Here's a dmesg with the
softlockup report and the sysrq-t output.  Unfortunately the dmesg
file is too big for LKML, so I've compressed it so you can get the
whole thing.

There's also a lockdep warning which fsx triggered.

Same .config as last time, except I bumped up CONFIG_LOG_BUF_SHIFT to
19 to make sure dmesg could capture everything.

						- Ted


[-- Attachment #2: dmesg-with-patch.gz --]
[-- Type: application/octet-stream, Size: 41877 bytes --]

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/11] Per-bdi writeback flusher threads v8
  2009-05-28  0:49             ` Theodore Tso
@ 2009-05-28  9:28               ` Jan Kara
  2009-05-28  9:36                 ` Jens Axboe
  2009-05-28 15:23                 ` Eric W. Biederman
  0 siblings, 2 replies; 41+ messages in thread
From: Jan Kara @ 2009-05-28  9:28 UTC (permalink / raw)
  To: Theodore Tso, Jens Axboe, linux-kernel, linux-fsdevel,
	chris.mason, david, hch
  Cc: Alex Chiang, Eric W. Biederman

On Wed 27-05-09 20:49:59, Theodore Tso wrote:
> On Wed, May 27, 2009 at 09:45:43PM +0200, Jens Axboe wrote:
> > 
> > This one has been tested good, where good means that it boots and
> > functions normally at least. Whether it fixes your issue, that would be
> > interesting to know :-)
> > 
> 
> Unfortunately, it doesn't seem to have.  Here's a dmesg with the
> softlockup report and the sysrq-t output.  Unfortunately the dmesg
> file is too big for LKML, so I've compressed it so you can get the
> whole thing.
  Everybody waits for sys_sync() to complete and they never seem to be
woken up. Jens, wb_work_complete() seems a bit fishy - who does
wb_clear_work() in sync_mode == WB_SYNC_ALL which is on stack?

> There's also a lockdep warning which fsx triggered.
  The lockdep warning is definitely unrelated. It's really a possible
deadlock, although not quite probable. IMHO the problem is that
sysfs_mutex gets above mmap_sem due to code in sysfs_readdir which calls
filldir() which may cause page fault. At the same time it gets quite low
on the lock stack because filesystems call sysfs functions from their
internal functions (in this case ext4_put_super) holding quite some locks.
Adding a few CC's for this.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/11] Per-bdi writeback flusher threads v8
  2009-05-28  9:28               ` Jan Kara
@ 2009-05-28  9:36                 ` Jens Axboe
  2009-05-28 15:23                 ` Eric W. Biederman
  1 sibling, 0 replies; 41+ messages in thread
From: Jens Axboe @ 2009-05-28  9:36 UTC (permalink / raw)
  To: Jan Kara
  Cc: Theodore Tso, linux-kernel, linux-fsdevel, chris.mason, david,
	hch, akpm, yanmin_zhang, richard, damien.wyart, Alex Chiang,
	Eric W. Biederman

On Thu, May 28 2009, Jan Kara wrote:
> On Wed 27-05-09 20:49:59, Theodore Tso wrote:
> > On Wed, May 27, 2009 at 09:45:43PM +0200, Jens Axboe wrote:
> > > 
> > > This one has been tested good, where good means that it boots and
> > > functions normally at least. Whether it fixes your issue, that would be
> > > interesting to know :-)
> > > 
> > 
> > Unfortunately, it doesn't seem to have.  Here's a dmesg with the
> > softlockup report and the sysrq-t output.  Unfortunately the dmesg
> > file is too big for LKML, so I've compressed it so you can get the
> > whole thing.
>   Everybody waits for sys_sync() to complete and they never seem to be
> woken up. Jens, wb_work_complete() seems a bit fishy - who does
> wb_clear_work() in sync_mode == WB_SYNC_ALL which is on stack?

It is tricky, I looked through that several times. I think
wb_work_complete() should do:

        if (!bdi_work_on_stack(work))
                bdi_work_clear(work);
        if (sync_mode == WB_SYNC_NONE || bdi_work_on_stack(work))
                call_rcu(&work->rcu_head, bdi_work_free);

to have bdi_work_clear() called for the on-stack work. I'll double check
this and roll out a new release, once tested. I haven't been able to
reproduce Ted's problem yet, but perhaps my bdi_alloc() test failures
haven't triggered for WB_SYNC_ALL yet.

> > There's also a lockdep warning which fsx triggered.
>   The lockdep warning is definitely unrelated. It's really a possible
> deadlock, although not quite probable. IMHO the problem is that
> sysfs_mutex gets above mmap_sem due to code in sysfs_readdir which calls
> filldir() which may cause page fault. At the same time it gets quite low
> on the lock stack because filesystems call sysfs functions from their
> internal functions (in this case ext4_put_super) holding quite some locks.
> Adding a few CC's for this.

Thanks, I didn't see any bdi related stuff in there either.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/11] Per-bdi writeback flusher threads v8
  2009-05-28  9:28               ` Jan Kara
  2009-05-28  9:36                 ` Jens Axboe
@ 2009-05-28 15:23                 ` Eric W. Biederman
  2009-05-28 19:32                   ` Theodore Tso
  1 sibling, 1 reply; 41+ messages in thread
From: Eric W. Biederman @ 2009-05-28 15:23 UTC (permalink / raw)
  To: Jan Kara
  Cc: Theodore Tso, Jens Axboe, linux-kernel, linux-fsdevel,
	chris.mason, david, hch, akpm, yanmin_zhang, richard,
	damien.wyart, Alex Chiang, Eric W. Biederman

Jan Kara <jack@suse.cz> writes:

> On Wed 27-05-09 20:49:59, Theodore Tso wrote:
>> On Wed, May 27, 2009 at 09:45:43PM +0200, Jens Axboe wrote:
>> > 
>> > This one has been tested good, where good means that it boots and
>> > functions normally at least. Whether it fixes your issue, that would be
>> > interesting to know :-)
>> > 
>> 
>> Unfortunately, it doesn't seem to have.  Here's a dmesg with the
>> softlockup report and the sysrq-t output.  Unfortunately the dmesg
>> file is too big for LKML, so I've compressed it so you can get the
>> whole thing.
>   Everybody waits for sys_sync() to complete and they never seem to be
> woken up. Jens, wb_work_complete() seems a bit fishy - who does
> wb_clear_work() in sync_mode == WB_SYNC_ALL which is on stack?
>
>> There's also a lockdep warning which fsx triggered.
>   The lockdep warning is definitely unrelated. It's really a possible
> deadlock, although not quite probable. IMHO the problem is that
> sysfs_mutex gets above mmap_sem due to code in sysfs_readdir which calls
> filldir() which may cause page fault. At the same time it gets quite low
> on the lock stack because filesystems call sysfs functions from their
> internal functions (in this case ext4_put_super) holding quite some locks.
> Adding a few CC's for this.

Interesting.

I thought the network stack was the only piece of code silly enough
to hold locks while deleting sysfs files.

Holding any lock while deleting a objects from sysfs, sysctl or proc,
is asking for serious mischief, and unfixable from the fs side.

The usual problem is that lockdep doesn't yet understand
sysfs_deactivate which waits for any running sysfs operations to
complete before it deletes the sysfs files.

Which means any lock you hold in a show or store method is can deadlock
with any lock you hold while deleting from sysfs.

ext4 appears lock loose and fancy free in it's show and store methods
so it might be ok except for this issue of mmap_sem vs sysfs_mutex.
But apparently even that isn't enough to git rid of the requirement
to not hold locks when deleting objects.

Eric

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/11] Per-bdi writeback flusher threads v8
  2009-05-28 15:23                 ` Eric W. Biederman
@ 2009-05-28 19:32                   ` Theodore Tso
  2009-05-28 19:38                     ` Christoph Hellwig
  0 siblings, 1 reply; 41+ messages in thread
From: Theodore Tso @ 2009-05-28 19:32 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Jan Kara, Jens Axboe, linux-kernel, linux-fsdevel, chris.mason,
	david, hch, akpm, yanmin_zhang, richard, damien.wyart,
	Alex Chiang, Eric W. Biederman

On Thu, May 28, 2009 at 08:23:28AM -0700, Eric W. Biederman wrote:
> I thought the network stack was the only piece of code silly enough
> to hold locks while deleting sysfs files.
> 
> Holding any lock while deleting a objects from sysfs, sysctl or proc,
> is asking for serious mischief, and unfixable from the fs side.
> 
> The usual problem is that lockdep doesn't yet understand
> sysfs_deactivate which waits for any running sysfs operations to
> complete before it deletes the sysfs files.
> 
> Which means any lock you hold in a show or store method is can deadlock
> with any lock you hold while deleting from sysfs.
> 
> ext4 appears lock loose and fancy free in it's show and store methods
> so it might be ok except for this issue of mmap_sem vs sysfs_mutex.
> But apparently even that isn't enough to git rid of the requirement
> to not hold locks when deleting objects.

I ran into this problem, and I attempted to work around the fact that
the core VFS layer calls the fs's put_super() while holding the BKL
and the super block lock as follows at the very end of ext4's
put_super() function:

	/*
	 * Now that we are completely done shutting down the
	 * superblock, we need to actually destroy the kobject.
	 */
	unlock_kernel();
	unlock_super(sb);
	kobject_put(&sbi->s_kobj);
	wait_for_completion(&sbi->s_kobj_unregister);
	lock_super(sb);
	lock_kernel();

I'm pretty sure that after the first call to invalidate_inodes() in
fs/super.c's generic_shutdown_super(), we really don't need to hold
the BKL or the superblock lock (and let the filesystems' low-level
write_super(0 and put_super() take the lock if they really need it),
but we probably need to take a closer look at this to make sure it's
true for all filesystems.  (IIRC, I think Christoph was looking to
clean up lock_super(); at least with respect to the write_super call.
I don't know what his plans regarding the BKL and put_super(),
though.)

Fundamentally, right now it's rather painful to use sysfs from within
a filesystem driver, since the generic VFS layer doesn't have any
explicit kobject support functionality.

						- Ted

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/11] Per-bdi writeback flusher threads v8
  2009-05-28 19:32                   ` Theodore Tso
@ 2009-05-28 19:38                     ` Christoph Hellwig
  0 siblings, 0 replies; 41+ messages in thread
From: Christoph Hellwig @ 2009-05-28 19:38 UTC (permalink / raw)
  To: Theodore Tso, Eric W. Biederman, Jan Kara, Jens Axboe,
	linux-kernel, linux-fsdevel

On Thu, May 28, 2009 at 03:32:46PM -0400, Theodore Tso wrote:
> 
> I'm pretty sure that after the first call to invalidate_inodes() in
> fs/super.c's generic_shutdown_super(), we really don't need to hold
> the BKL or the superblock lock (and let the filesystems' low-level
> write_super(0 and put_super() take the lock if they really need it),
> but we probably need to take a closer look at this to make sure it's
> true for all filesystems.  (IIRC, I think Christoph was looking to
> clean up lock_super(); at least with respect to the write_super call.
> I don't know what his plans regarding the BKL and put_super(),
> though.)

Both the BKL and lock_super are not held for ->put_super anymore in
the vfs tree.  In fact in the vfs tree there are no callers of
lock_super in the VFS anymore.


^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH 0/11] Per-bdi writeback flusher threads v9
@ 2009-05-28 11:46 Jens Axboe
  2009-05-28 11:46 ` [PATCH 04/11] writeback: switch to per-bdi threads for flushing data Jens Axboe
  0 siblings, 1 reply; 41+ messages in thread
From: Jens Axboe @ 2009-05-28 11:46 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, tytso
  Cc: chris.mason, david, hch, akpm, jack, yanmin_zhang, richard,
	damien.wyart

Hi,

Here's the 9th version of the writeback patches. Changes since v8:

- Fix a bdi_work on-stack allocation hang. I hope this fixes Ted's
  issue.
- Get rid of the explicit wait queues, we can just use wake_up_process()
  since it's just for that one task.
- Add separate "sync_supers" thread that makes sure that the dirty
  super blocks get written. We cannot safely do this from bdi_forker_task(),
  as that risks deadlocking on ->s_umount. Artem, I implemented this
  by doing the wake ups from a timer so that it would be easier for you
  to just deactivate the timer when there are no super blocks.

For ease of patching, I've put the full diff here:

  http://kernel.dk/writeback-v9.patch

and also stored this in a writeback-v9 branch that will not change,
you can pull that into Linus tree from here:

  git://git.kernel.dk/linux-2.6-block.git writeback-v9

 block/blk-core.c            |    1 +
 drivers/block/aoe/aoeblk.c  |    1 +
 drivers/char/mem.c          |    1 +
 fs/btrfs/disk-io.c          |   24 +-
 fs/buffer.c                 |    2 +-
 fs/char_dev.c               |    1 +
 fs/configfs/inode.c         |    1 +
 fs/fs-writeback.c           |  804 ++++++++++++++++++++++++++++-------
 fs/fuse/inode.c             |    1 +
 fs/hugetlbfs/inode.c        |    1 +
 fs/nfs/client.c             |    1 +
 fs/ntfs/super.c             |   33 +--
 fs/ocfs2/dlm/dlmfs.c        |    1 +
 fs/ramfs/inode.c            |    1 +
 fs/super.c                  |    3 -
 fs/sync.c                   |    2 +-
 fs/sysfs/inode.c            |    1 +
 fs/ubifs/super.c            |    1 +
 include/linux/backing-dev.h |   73 ++++-
 include/linux/fs.h          |   11 +-
 include/linux/writeback.h   |   15 +-
 kernel/cgroup.c             |    1 +
 mm/Makefile                 |    2 +-
 mm/backing-dev.c            |  518 ++++++++++++++++++++++-
 mm/page-writeback.c         |  151 +------
 mm/pdflush.c                |  269 ------------
 mm/swap_state.c             |    1 +
 mm/vmscan.c                 |    2 +-
 28 files changed, 1286 insertions(+), 637 deletions(-)

-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH 04/11] writeback: switch to per-bdi threads for flushing data
  2009-05-28 11:46 [PATCH 0/11] Per-bdi writeback flusher threads v9 Jens Axboe
@ 2009-05-28 11:46 ` Jens Axboe
  2009-05-28 14:13   ` Artem Bityutskiy
  0 siblings, 1 reply; 41+ messages in thread
From: Jens Axboe @ 2009-05-28 11:46 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, tytso
  Cc: chris.mason, david, hch, akpm, jack, yanmin_zhang, richard,
	damien.wyart, Jens Axboe

This gets rid of pdflush for bdi writeout and kupdated style cleaning.
This is an experiment to see if we get better writeout behaviour with
per-bdi flushing. Some initial tests look pretty encouraging. A sample
ffsb workload that does random writes to files is about 8% faster here
on a simple SATA drive during the benchmark phase. File layout also seems
a LOT more smooth in vmstat:

 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
 0  1      0 608848   2652 375372    0    0     0 71024  604    24  1 10 48 42
 0  1      0 549644   2712 433736    0    0     0 60692  505    27  1  8 48 44
 1  0      0 476928   2784 505192    0    0     4 29540  553    24  0  9 53 37
 0  1      0 457972   2808 524008    0    0     0 54876  331    16  0  4 38 58
 0  1      0 366128   2928 614284    0    0     4 92168  710    58  0 13 53 34
 0  1      0 295092   3000 684140    0    0     0 62924  572    23  0  9 53 37
 0  1      0 236592   3064 741704    0    0     4 58256  523    17  0  8 48 44
 0  1      0 165608   3132 811464    0    0     0 57460  560    21  0  8 54 38
 0  1      0 102952   3200 873164    0    0     4 74748  540    29  1 10 48 41
 0  1      0  48604   3252 926472    0    0     0 53248  469    29  0  7 47 45

where vanilla tends to fluctuate a lot in the creation phase:

 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
 1  1      0 678716   5792 303380    0    0     0 74064  565    50  1 11 52 36
 1  0      0 662488   5864 319396    0    0     4   352  302   329  0  2 47 51
 0  1      0 599312   5924 381468    0    0     0 78164  516    55  0  9 51 40
 0  1      0 519952   6008 459516    0    0     4 78156  622    56  1 11 52 37
 1  1      0 436640   6092 541632    0    0     0 82244  622    54  0 11 48 41
 0  1      0 436640   6092 541660    0    0     0     8  152    39  0  0 51 49
 0  1      0 332224   6200 644252    0    0     4 102800  728    46  1 13 49 36
 1  0      0 274492   6260 701056    0    0     4 12328  459    49  0  7 50 43
 0  1      0 211220   6324 763356    0    0     0 106940  515    37  1 10 51 39
 1  0      0 160412   6376 813468    0    0     0  8224  415    43  0  6 49 45
 1  1      0  85980   6452 886556    0    0     4 113516  575    39  1 11 54 34
 0  2      0  85968   6452 886620    0    0     0  1640  158   211  0  0 46 54

So apart from seemingly behaving better for buffered writeout, this also
allows us to potentially have more than one bdi thread flushing out data.
This may be useful for NUMA type setups.

A 10 disk test with btrfs performs 26% faster with per-bdi flushing. Other
tests pending. mmap heavy writing also improves considerably.

A separate thread is added to sync the super blocks. In the long term,
adding sync_supers_bdi() functionality could get rid of this thread again.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
---
 fs/buffer.c                 |    2 +-
 fs/fs-writeback.c           |  309 ++++++++++++++++++++++++++-----------------
 fs/sync.c                   |    2 +-
 include/linux/backing-dev.h |   28 ++++
 include/linux/fs.h          |    3 +-
 include/linux/writeback.h   |    2 +-
 mm/backing-dev.c            |  231 +++++++++++++++++++++++++++++++-
 mm/page-writeback.c         |  140 +------------------
 mm/vmscan.c                 |    2 +-
 9 files changed, 452 insertions(+), 267 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index aed2977..14f0802 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -281,7 +281,7 @@ static void free_more_memory(void)
 	struct zone *zone;
 	int nid;
 
-	wakeup_pdflush(1024);
+	wakeup_flusher_threads(1024);
 	yield();
 
 	for_each_online_node(nid) {
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 1137408..aa0b560 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -19,6 +19,8 @@
 #include <linux/sched.h>
 #include <linux/fs.h>
 #include <linux/mm.h>
+#include <linux/kthread.h>
+#include <linux/freezer.h>
 #include <linux/writeback.h>
 #include <linux/blkdev.h>
 #include <linux/backing-dev.h>
@@ -61,10 +63,186 @@ int writeback_in_progress(struct backing_dev_info *bdi)
  */
 static void writeback_release(struct backing_dev_info *bdi)
 {
-	BUG_ON(!writeback_in_progress(bdi));
+	WARN_ON_ONCE(!writeback_in_progress(bdi));
+	bdi->wb_arg.nr_pages = 0;
+	bdi->wb_arg.sb = NULL;
 	clear_bit(BDI_pdflush, &bdi->state);
 }
 
+int bdi_start_writeback(struct backing_dev_info *bdi, struct super_block *sb,
+			 long nr_pages, enum writeback_sync_modes sync_mode)
+{
+	/*
+	 * This only happens the first time someone kicks this bdi, so put
+	 * it out-of-line.
+	 */
+	if (unlikely(!bdi->task)) {
+		bdi_add_default_flusher_task(bdi);
+		return 1;
+	}
+
+	if (writeback_acquire(bdi)) {
+		bdi->wb_arg.nr_pages = nr_pages;
+		bdi->wb_arg.sb = sb;
+		bdi->wb_arg.sync_mode = sync_mode;
+
+		if (bdi->task)
+			wake_up_process(bdi->task);
+	}
+
+	return 0;
+}
+
+/*
+ * The maximum number of pages to writeout in a single bdi flush/kupdate
+ * operation.  We do this so we don't hold I_SYNC against an inode for
+ * enormous amounts of time, which would block a userspace task which has
+ * been forced to throttle against that inode.  Also, the code reevaluates
+ * the dirty each time it has written this many pages.
+ */
+#define MAX_WRITEBACK_PAGES     1024
+
+/*
+ * Periodic writeback of "old" data.
+ *
+ * Define "old": the first time one of an inode's pages is dirtied, we mark the
+ * dirtying-time in the inode's address_space.  So this periodic writeback code
+ * just walks the superblock inode list, writing back any inodes which are
+ * older than a specific point in time.
+ *
+ * Try to run once per dirty_writeback_interval.  But if a writeback event
+ * takes longer than a dirty_writeback_interval interval, then leave a
+ * one-second gap.
+ *
+ * older_than_this takes precedence over nr_to_write.  So we'll only write back
+ * all dirty pages if they are all attached to "old" mappings.
+ */
+static void bdi_kupdated(struct backing_dev_info *bdi)
+{
+	unsigned long oldest_jif;
+	long nr_to_write;
+	struct writeback_control wbc = {
+		.bdi			= bdi,
+		.sync_mode		= WB_SYNC_NONE,
+		.older_than_this	= &oldest_jif,
+		.nr_to_write		= 0,
+		.for_kupdate		= 1,
+		.range_cyclic		= 1,
+	};
+
+	oldest_jif = jiffies - msecs_to_jiffies(dirty_expire_interval * 10);
+
+	nr_to_write = global_page_state(NR_FILE_DIRTY) +
+			global_page_state(NR_UNSTABLE_NFS) +
+			(inodes_stat.nr_inodes - inodes_stat.nr_unused);
+
+	while (nr_to_write > 0) {
+		wbc.more_io = 0;
+		wbc.encountered_congestion = 0;
+		wbc.nr_to_write = MAX_WRITEBACK_PAGES;
+		generic_sync_bdi_inodes(NULL, &wbc);
+		if (wbc.nr_to_write > 0)
+			break;	/* All the old data is written */
+		nr_to_write -= MAX_WRITEBACK_PAGES;
+	}
+}
+
+static inline bool over_bground_thresh(void)
+{
+	unsigned long background_thresh, dirty_thresh;
+
+	get_dirty_limits(&background_thresh, &dirty_thresh, NULL, NULL);
+
+	return (global_page_state(NR_FILE_DIRTY) +
+		global_page_state(NR_UNSTABLE_NFS) >= background_thresh);
+}
+
+static void bdi_pdflush(struct backing_dev_info *bdi)
+{
+	struct writeback_control wbc = {
+		.bdi			= bdi,
+		.sync_mode		= bdi->wb_arg.sync_mode,
+		.older_than_this	= NULL,
+		.range_cyclic		= 1,
+	};
+	long nr_pages = bdi->wb_arg.nr_pages;
+
+	for (;;) {
+		if (wbc.sync_mode == WB_SYNC_NONE && nr_pages <= 0 &&
+		    !over_bground_thresh())
+			break;
+
+		wbc.more_io = 0;
+		wbc.encountered_congestion = 0;
+		wbc.nr_to_write = MAX_WRITEBACK_PAGES;
+		wbc.pages_skipped = 0;
+		generic_sync_bdi_inodes(bdi->wb_arg.sb, &wbc);
+		nr_pages -= MAX_WRITEBACK_PAGES - wbc.nr_to_write;
+		/*
+		 * If we ran out of stuff to write, bail unless more_io got set
+		 */
+		if (wbc.nr_to_write > 0 || wbc.pages_skipped > 0) {
+			if (wbc.more_io)
+				continue;
+			break;
+		}
+	}
+}
+
+/*
+ * Handle writeback of dirty data for the device backed by this bdi. Also
+ * wakes up periodically and does kupdated style flushing.
+ */
+int bdi_writeback_task(struct backing_dev_info *bdi)
+{
+	while (!kthread_should_stop()) {
+		unsigned long wait_jiffies;
+
+		wait_jiffies = msecs_to_jiffies(dirty_writeback_interval * 10);
+		set_current_state(TASK_INTERRUPTIBLE);
+		schedule_timeout(wait_jiffies);
+		try_to_freeze();
+
+		/*
+		 * We get here in two cases:
+		 *
+		 *  schedule_timeout() returned because the dirty writeback
+		 *  interval has elapsed. If that happens, we will be able
+		 *  to acquire the writeback lock and will proceed to do
+		 *  kupdated style writeout.
+		 *
+		 *  Someone called bdi_start_writeback(), which will acquire
+		 *  the writeback lock. This means our writeback_acquire()
+		 *  below will fail and we call into bdi_pdflush() for
+		 *  pdflush style writeout.
+		 *
+		 */
+		if (writeback_acquire(bdi))
+			bdi_kupdated(bdi);
+		else
+			bdi_pdflush(bdi);
+
+		writeback_release(bdi);
+	}
+
+	return 0;
+}
+
+void bdi_writeback_all(struct super_block *sb, struct writeback_control *wbc)
+{
+	struct backing_dev_info *bdi, *tmp;
+
+	mutex_lock(&bdi_lock);
+
+	list_for_each_entry_safe(bdi, tmp, &bdi_list, bdi_list) {
+		if (!bdi_has_dirty_io(bdi))
+			continue;
+		bdi_start_writeback(bdi, sb, wbc->nr_to_write, wbc->sync_mode);
+	}
+
+	mutex_unlock(&bdi_lock);
+}
+
 /**
  *	__mark_inode_dirty -	internal function
  *	@inode: inode to mark
@@ -263,46 +441,6 @@ static void queue_io(struct backing_dev_info *bdi,
 	move_expired_inodes(&bdi->b_dirty, &bdi->b_io, older_than_this);
 }
 
-static int sb_on_inode_list(struct super_block *sb, struct list_head *list)
-{
-	struct inode *inode;
-	int ret = 0;
-
-	spin_lock(&inode_lock);
-	list_for_each_entry(inode, list, i_list) {
-		if (inode->i_sb == sb) {
-			ret = 1;
-			break;
-		}
-	}
-	spin_unlock(&inode_lock);
-	return ret;
-}
-
-int sb_has_dirty_inodes(struct super_block *sb)
-{
-	struct backing_dev_info *bdi;
-	int ret = 0;
-
-	/*
-	 * This is REALLY expensive right now, but it'll go away
-	 * when the bdi writeback is introduced
-	 */
-	mutex_lock(&bdi_lock);
-	list_for_each_entry(bdi, &bdi_list, bdi_list) {
-		if (sb_on_inode_list(sb, &bdi->b_dirty) ||
-		    sb_on_inode_list(sb, &bdi->b_io) ||
-		    sb_on_inode_list(sb, &bdi->b_more_io)) {
-			ret = 1;
-			break;
-		}
-	}
-	mutex_unlock(&bdi_lock);
-
-	return ret;
-}
-EXPORT_SYMBOL(sb_has_dirty_inodes);
-
 /*
  * Write a single inode's dirty pages and inode data out to disk.
  * If `wait' is set, wait on the writeout.
@@ -461,11 +599,11 @@ __writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
 	return __sync_single_inode(inode, wbc);
 }
 
-static void generic_sync_bdi_inodes(struct backing_dev_info *bdi,
-				    struct writeback_control *wbc,
-				    struct super_block *sb,
-				    int is_blkdev_sb)
+void generic_sync_bdi_inodes(struct super_block *sb,
+			     struct writeback_control *wbc)
 {
+	const int is_blkdev_sb = sb_is_blkdev_sb(sb);
+	struct backing_dev_info *bdi = wbc->bdi;
 	const unsigned long start = jiffies;	/* livelock avoidance */
 
 	spin_lock(&inode_lock);
@@ -516,13 +654,6 @@ static void generic_sync_bdi_inodes(struct backing_dev_info *bdi,
 			continue;		/* Skip a congested blockdev */
 		}
 
-		if (wbc->bdi && bdi != wbc->bdi) {
-			if (!is_blkdev_sb)
-				break;		/* fs has the wrong queue */
-			requeue_io(inode);
-			continue;		/* blockdev has wrong queue */
-		}
-
 		/*
 		 * Was this inode dirtied after sync_sb_inodes was called?
 		 * This keeps sync from extra jobs and livelock.
@@ -530,16 +661,10 @@ static void generic_sync_bdi_inodes(struct backing_dev_info *bdi,
 		if (inode_dirtied_after(inode, start))
 			break;
 
-		/* Is another pdflush already flushing this queue? */
-		if (current_is_pdflush() && !writeback_acquire(bdi))
-			break;
-
 		BUG_ON(inode->i_state & I_FREEING);
 		__iget(inode);
 		pages_skipped = wbc->pages_skipped;
 		__writeback_single_inode(inode, wbc);
-		if (current_is_pdflush())
-			writeback_release(bdi);
 		if (wbc->pages_skipped != pages_skipped) {
 			/*
 			 * writeback is not making progress due to locked
@@ -578,11 +703,6 @@ static void generic_sync_bdi_inodes(struct backing_dev_info *bdi,
  * a variety of queues, so all inodes are searched.  For other superblocks,
  * assume that all inodes are backed by the same queue.
  *
- * FIXME: this linear search could get expensive with many fileystems.  But
- * how to fix?  We need to go from an address_space to all inodes which share
- * a queue with that address_space.  (Easy: have a global "dirty superblocks"
- * list).
- *
  * The inodes to be written are parked on bdi->b_io.  They are moved back onto
  * bdi->b_dirty as they are selected for writing.  This way, none can be missed
  * on the writer throttling path, and we get decent balancing between many
@@ -591,13 +711,10 @@ static void generic_sync_bdi_inodes(struct backing_dev_info *bdi,
 void generic_sync_sb_inodes(struct super_block *sb,
 				struct writeback_control *wbc)
 {
-	const int is_blkdev_sb = sb_is_blkdev_sb(sb);
-	struct backing_dev_info *bdi;
-
-	mutex_lock(&bdi_lock);
-	list_for_each_entry(bdi, &bdi_list, bdi_list)
-		generic_sync_bdi_inodes(bdi, wbc, sb, is_blkdev_sb);
-	mutex_unlock(&bdi_lock);
+	if (wbc->bdi)
+		generic_sync_bdi_inodes(sb, wbc);
+	else
+		bdi_writeback_all(sb, wbc);
 
 	if (wbc->sync_mode == WB_SYNC_ALL) {
 		struct inode *inode, *old_inode = NULL;
@@ -653,58 +770,6 @@ static void sync_sb_inodes(struct super_block *sb,
 }
 
 /*
- * Start writeback of dirty pagecache data against all unlocked inodes.
- *
- * Note:
- * We don't need to grab a reference to superblock here. If it has non-empty
- * ->b_dirty it's hadn't been killed yet and kill_super() won't proceed
- * past sync_inodes_sb() until the ->b_dirty/b_io/b_more_io lists are all
- * empty. Since __sync_single_inode() regains inode_lock before it finally moves
- * inode from superblock lists we are OK.
- *
- * If `older_than_this' is non-zero then only flush inodes which have a
- * flushtime older than *older_than_this.
- *
- * If `bdi' is non-zero then we will scan the first inode against each
- * superblock until we find the matching ones.  One group will be the dirty
- * inodes against a filesystem.  Then when we hit the dummy blockdev superblock,
- * sync_sb_inodes will seekout the blockdev which matches `bdi'.  Maybe not
- * super-efficient but we're about to do a ton of I/O...
- */
-void
-writeback_inodes(struct writeback_control *wbc)
-{
-	struct super_block *sb;
-
-	might_sleep();
-	spin_lock(&sb_lock);
-restart:
-	list_for_each_entry_reverse(sb, &super_blocks, s_list) {
-		if (sb_has_dirty_inodes(sb)) {
-			/* we're making our own get_super here */
-			sb->s_count++;
-			spin_unlock(&sb_lock);
-			/*
-			 * If we can't get the readlock, there's no sense in
-			 * waiting around, most of the time the FS is going to
-			 * be unmounted by the time it is released.
-			 */
-			if (down_read_trylock(&sb->s_umount)) {
-				if (sb->s_root)
-					sync_sb_inodes(sb, wbc);
-				up_read(&sb->s_umount);
-			}
-			spin_lock(&sb_lock);
-			if (__put_super_and_need_restart(sb))
-				goto restart;
-		}
-		if (wbc->nr_to_write <= 0)
-			break;
-	}
-	spin_unlock(&sb_lock);
-}
-
-/*
  * writeback and wait upon the filesystem's dirty inodes.  The caller will
  * do this in two passes - one to write, and one to wait.
  *
diff --git a/fs/sync.c b/fs/sync.c
index 7abc65f..3887f10 100644
--- a/fs/sync.c
+++ b/fs/sync.c
@@ -23,7 +23,7 @@
  */
 static void do_sync(unsigned long wait)
 {
-	wakeup_pdflush(0);
+	wakeup_flusher_threads(0);
 	sync_inodes(0);		/* All mappings, inodes and their blockdevs */
 	vfs_dq_sync(NULL);
 	sync_supers();		/* Write the superblocks */
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 8719c87..4a312e9 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -13,6 +13,7 @@
 #include <linux/proportions.h>
 #include <linux/kernel.h>
 #include <linux/fs.h>
+#include <linux/writeback.h>
 #include <asm/atomic.h>
 
 struct page;
@@ -24,6 +25,7 @@ struct dentry;
  */
 enum bdi_state {
 	BDI_pdflush,		/* A pdflush thread is working this device */
+	BDI_pending,		/* On its way to being activated */
 	BDI_async_congested,	/* The async (write) queue is getting full */
 	BDI_sync_congested,	/* The sync queue is getting full */
 	BDI_unused,		/* Available bits start here */
@@ -39,6 +41,12 @@ enum bdi_stat_item {
 
 #define BDI_STAT_BATCH (8*(1+ilog2(nr_cpu_ids)))
 
+struct bdi_writeback_arg {
+	unsigned long nr_pages;
+	struct super_block *sb;
+	enum writeback_sync_modes sync_mode;
+};
+
 struct backing_dev_info {
 	struct list_head bdi_list;
 
@@ -60,6 +68,8 @@ struct backing_dev_info {
 
 	struct device *dev;
 
+	struct task_struct	*task;		/* writeback task */
+	struct bdi_writeback_arg wb_arg;	/* protected by BDI_pdflush */
 	struct list_head	b_dirty;	/* dirty inodes */
 	struct list_head	b_io;		/* parked for writeback */
 	struct list_head	b_more_io;	/* parked for more writeback */
@@ -77,10 +87,22 @@ int bdi_register(struct backing_dev_info *bdi, struct device *parent,
 		const char *fmt, ...);
 int bdi_register_dev(struct backing_dev_info *bdi, dev_t dev);
 void bdi_unregister(struct backing_dev_info *bdi);
+int bdi_start_writeback(struct backing_dev_info *bdi, struct super_block *sb,
+			 long nr_pages, enum writeback_sync_modes sync_mode);
+int bdi_writeback_task(struct backing_dev_info *bdi);
+void bdi_writeback_all(struct super_block *sb, struct writeback_control *wbc);
+void bdi_add_default_flusher_task(struct backing_dev_info *bdi);
 
 extern struct mutex bdi_lock;
 extern struct list_head bdi_list;
 
+static inline int bdi_has_dirty_io(struct backing_dev_info *bdi)
+{
+	return !list_empty(&bdi->b_dirty) ||
+	       !list_empty(&bdi->b_io) ||
+	       !list_empty(&bdi->b_more_io);
+}
+
 static inline void __add_bdi_stat(struct backing_dev_info *bdi,
 		enum bdi_stat_item item, s64 amount)
 {
@@ -196,6 +218,7 @@ int bdi_set_max_ratio(struct backing_dev_info *bdi, unsigned int max_ratio);
 #define BDI_CAP_EXEC_MAP	0x00000040
 #define BDI_CAP_NO_ACCT_WB	0x00000080
 #define BDI_CAP_SWAP_BACKED	0x00000100
+#define BDI_CAP_FLUSH_FORKER	0x00000200
 
 #define BDI_CAP_VMFLAGS \
 	(BDI_CAP_READ_MAP | BDI_CAP_WRITE_MAP | BDI_CAP_EXEC_MAP)
@@ -265,6 +288,11 @@ static inline bool bdi_cap_swap_backed(struct backing_dev_info *bdi)
 	return bdi->capabilities & BDI_CAP_SWAP_BACKED;
 }
 
+static inline bool bdi_cap_flush_forker(struct backing_dev_info *bdi)
+{
+	return bdi->capabilities & BDI_CAP_FLUSH_FORKER;
+}
+
 static inline bool mapping_cap_writeback_dirty(struct address_space *mapping)
 {
 	return bdi_cap_writeback_dirty(mapping->backing_dev_info);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 6b475d4..ecdc544 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2063,6 +2063,8 @@ extern int invalidate_inode_pages2_range(struct address_space *mapping,
 					 pgoff_t start, pgoff_t end);
 extern void generic_sync_sb_inodes(struct super_block *sb,
 				struct writeback_control *wbc);
+extern void generic_sync_bdi_inodes(struct super_block *sb,
+				struct writeback_control *);
 extern int write_inode_now(struct inode *, int);
 extern int filemap_fdatawrite(struct address_space *);
 extern int filemap_flush(struct address_space *);
@@ -2180,7 +2182,6 @@ extern int bdev_read_only(struct block_device *);
 extern int set_blocksize(struct block_device *, int);
 extern int sb_set_blocksize(struct super_block *, int);
 extern int sb_min_blocksize(struct super_block *, int);
-extern int sb_has_dirty_inodes(struct super_block *);
 
 extern int generic_file_mmap(struct file *, struct vm_area_struct *);
 extern int generic_file_readonly_mmap(struct file *, struct vm_area_struct *);
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 9344547..a8e9f78 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -99,7 +99,7 @@ static inline void inode_sync_wait(struct inode *inode)
 /*
  * mm/page-writeback.c
  */
-int wakeup_pdflush(long nr_pages);
+void wakeup_flusher_threads(long nr_pages);
 void laptop_io_completion(void);
 void laptop_sync_completion(void);
 void throttle_vm_writeout(gfp_t gfp_mask);
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index de0bbfe..3dbfc76 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -1,8 +1,11 @@
 
 #include <linux/wait.h>
 #include <linux/backing-dev.h>
+#include <linux/kthread.h>
+#include <linux/freezer.h>
 #include <linux/fs.h>
 #include <linux/pagemap.h>
+#include <linux/mm.h>
 #include <linux/sched.h>
 #include <linux/module.h>
 #include <linux/writeback.h>
@@ -16,7 +19,7 @@ EXPORT_SYMBOL(default_unplug_io_fn);
 struct backing_dev_info default_backing_dev_info = {
 	.ra_pages	= VM_MAX_READAHEAD * 1024 / PAGE_CACHE_SIZE,
 	.state		= 0,
-	.capabilities	= BDI_CAP_MAP_COPY,
+	.capabilities	= BDI_CAP_MAP_COPY | BDI_CAP_FLUSH_FORKER,
 	.unplug_io_fn	= default_unplug_io_fn,
 };
 EXPORT_SYMBOL_GPL(default_backing_dev_info);
@@ -24,6 +27,14 @@ EXPORT_SYMBOL_GPL(default_backing_dev_info);
 static struct class *bdi_class;
 DEFINE_MUTEX(bdi_lock);
 LIST_HEAD(bdi_list);
+LIST_HEAD(bdi_pending_list);
+
+static struct task_struct *sync_supers_tsk;
+static struct timer_list sync_supers_timer;
+
+static int bdi_sync_supers(void *);
+static void sync_supers_timer_fn(unsigned long);
+static void arm_supers_timer(void);
 
 #ifdef CONFIG_DEBUG_FS
 #include <linux/debugfs.h>
@@ -187,6 +198,13 @@ static int __init default_bdi_init(void)
 {
 	int err;
 
+	sync_supers_tsk = kthread_run(bdi_sync_supers, NULL, "sync_supers");
+	BUG_ON(!sync_supers_tsk);
+
+	init_timer(&sync_supers_timer);
+	setup_timer(&sync_supers_timer, sync_supers_timer_fn, 0);
+	arm_supers_timer();
+
 	err = bdi_init(&default_backing_dev_info);
 	if (!err)
 		bdi_register(&default_backing_dev_info, NULL, "default");
@@ -195,6 +213,172 @@ static int __init default_bdi_init(void)
 }
 subsys_initcall(default_bdi_init);
 
+static int bdi_start_fn(void *ptr)
+{
+	struct backing_dev_info *bdi = ptr;
+	struct task_struct *tsk = current;
+
+	/*
+	 * Add us to the active bdi_list
+	 */
+	mutex_lock(&bdi_lock);
+	list_add(&bdi->bdi_list, &bdi_list);
+	mutex_unlock(&bdi_lock);
+
+	tsk->flags |= PF_FLUSHER | PF_SWAPWRITE;
+	set_freezable();
+
+	/*
+	 * Our parent may run at a different priority, just set us to normal
+	 */
+	set_user_nice(tsk, 0);
+
+	/*
+	 * Clear pending bit and wakeup anybody waiting to tear us down
+	 */
+	clear_bit(BDI_pending, &bdi->state);
+	smp_mb__after_clear_bit();
+	wake_up_bit(&bdi->state, BDI_pending);
+
+	return bdi_writeback_task(bdi);
+}
+
+static void bdi_flush_io(struct backing_dev_info *bdi)
+{
+	struct writeback_control wbc = {
+		.bdi			= bdi,
+		.sync_mode		= WB_SYNC_NONE,
+		.older_than_this	= NULL,
+		.range_cyclic		= 1,
+		.nr_to_write		= 1024,
+	};
+
+	generic_sync_bdi_inodes(NULL, &wbc);
+}
+
+/*
+ * kupdated() used to do this. We cannot do it from the bdi_forker_task()
+ * or we risk deadlocking on ->s_umount. The longer term solution would be
+ * to implement sync_supers_bdi() or similar and simply do it from the
+ * bdi writeback tasks individually.
+ */
+static int bdi_sync_supers(void *unused)
+{
+	set_user_nice(current, 0);
+
+	while (!kthread_should_stop()) {
+		set_current_state(TASK_INTERRUPTIBLE);
+		schedule();
+
+		/*
+		 * Do this periodically, like kupdated() did before.
+		 */
+		sync_supers();
+	}
+
+	return 0;
+}
+
+static void arm_supers_timer(void)
+{
+	unsigned long next;
+
+	next = msecs_to_jiffies(dirty_writeback_interval * 10) + jiffies;
+	mod_timer(&sync_supers_timer, round_jiffies_up(next));
+}
+
+static void sync_supers_timer_fn(unsigned long unused)
+{
+	wake_up_process(sync_supers_tsk);
+	arm_supers_timer();
+}
+
+static int bdi_forker_task(void *ptr)
+{
+	struct backing_dev_info *me = ptr;
+
+	for (;;) {
+		struct backing_dev_info *bdi, *tmp;
+
+		/*
+		 * Temporary measure, we want to make sure we don't see
+		 * dirty data on the default backing_dev_info
+		 */
+		if (bdi_has_dirty_io(me))
+			bdi_flush_io(me);
+
+		mutex_lock(&bdi_lock);
+
+		/*
+		 * Check if any existing bdi's have dirty data without
+		 * a thread registered. If so, set that up.
+		 */
+		list_for_each_entry_safe(bdi, tmp, &bdi_list, bdi_list) {
+			if (bdi->task || !bdi_has_dirty_io(bdi))
+				continue;
+
+			bdi_add_default_flusher_task(bdi);
+		}
+
+		if (list_empty(&bdi_pending_list)) {
+			unsigned long wait;
+
+			mutex_unlock(&bdi_lock);
+			wait = msecs_to_jiffies(dirty_writeback_interval * 10);
+			set_current_state(TASK_INTERRUPTIBLE);
+			schedule_timeout(wait);
+			try_to_freeze();
+			continue;
+		}
+
+		/*
+		 * This is our real job - check for pending entries in
+		 * bdi_pending_list, and create the tasks that got added
+		 */
+		bdi = list_entry(bdi_pending_list.next, struct backing_dev_info,
+				 bdi_list);
+		list_del_init(&bdi->bdi_list);
+		mutex_unlock(&bdi_lock);
+
+		BUG_ON(bdi->task);
+
+		bdi->task = kthread_run(bdi_start_fn, bdi, "bdi-%s",
+					dev_name(bdi->dev));
+		/*
+		 * If task creation fails, then readd the bdi to
+		 * the pending list and force writeout of the bdi
+		 * from this forker thread. That will free some memory
+		 * and we can try again.
+		 */
+		if (!bdi->task) {
+			/*
+			 * Add this 'bdi' to the back, so we get
+			 * a chance to flush other bdi's to free
+			 * memory.
+			 */
+			mutex_lock(&bdi_lock);
+			list_add_tail(&bdi->bdi_list, &bdi_pending_list);
+			mutex_unlock(&bdi_lock);
+
+			bdi_flush_io(bdi);
+		}
+	}
+
+	return 0;
+}
+
+void bdi_add_default_flusher_task(struct backing_dev_info *bdi)
+{
+	if (test_and_set_bit(BDI_pending, &bdi->state))
+		return;
+
+	mutex_lock(&bdi_lock);
+	list_move_tail(&bdi->bdi_list, &bdi_pending_list);
+	mutex_unlock(&bdi_lock);
+
+	wake_up_process(default_backing_dev_info.task);
+}
+
 int bdi_register(struct backing_dev_info *bdi, struct device *parent,
 		const char *fmt, ...)
 {
@@ -218,8 +402,25 @@ int bdi_register(struct backing_dev_info *bdi, struct device *parent,
 	mutex_unlock(&bdi_lock);
 
 	bdi->dev = dev;
-	bdi_debug_register(bdi, dev_name(dev));
 
+	/*
+	 * Just start the forker thread for our default backing_dev_info,
+	 * and add other bdi's to the list. They will get a thread created
+	 * on-demand when they need it.
+	 */
+	if (bdi_cap_flush_forker(bdi)) {
+		bdi->task = kthread_run(bdi_forker_task, bdi, "bdi-%s",
+						dev_name(dev));
+		if (!bdi->task) {
+			mutex_lock(&bdi_lock);
+			list_del(&bdi->bdi_list);
+			mutex_unlock(&bdi_lock);
+			ret = -ENOMEM;
+			goto exit;
+		}
+	}
+
+	bdi_debug_register(bdi, dev_name(dev));
 exit:
 	return ret;
 }
@@ -231,8 +432,19 @@ int bdi_register_dev(struct backing_dev_info *bdi, dev_t dev)
 }
 EXPORT_SYMBOL(bdi_register_dev);
 
-static void bdi_remove_from_list(struct backing_dev_info *bdi)
+static int sched_wait(void *word)
 {
+	schedule();
+	return 0;
+}
+
+static void bdi_wb_shutdown(struct backing_dev_info *bdi)
+{
+	/*
+	 * If setup is pending, wait for that to complete first
+	 */
+	wait_on_bit(&bdi->state, BDI_pending, sched_wait, TASK_UNINTERRUPTIBLE);
+
 	mutex_lock(&bdi_lock);
 	list_del(&bdi->bdi_list);
 	mutex_unlock(&bdi_lock);
@@ -241,7 +453,13 @@ static void bdi_remove_from_list(struct backing_dev_info *bdi)
 void bdi_unregister(struct backing_dev_info *bdi)
 {
 	if (bdi->dev) {
-		bdi_remove_from_list(bdi);
+		if (!bdi_cap_flush_forker(bdi)) {
+			bdi_wb_shutdown(bdi);
+			if (bdi->task) {
+				kthread_stop(bdi->task);
+				bdi->task = NULL;
+			}
+		}
 		bdi_debug_unregister(bdi);
 		device_unregister(bdi->dev);
 		bdi->dev = NULL;
@@ -251,8 +469,7 @@ EXPORT_SYMBOL(bdi_unregister);
 
 int bdi_init(struct backing_dev_info *bdi)
 {
-	int i;
-	int err;
+	int i, err;
 
 	bdi->dev = NULL;
 
@@ -277,8 +494,6 @@ int bdi_init(struct backing_dev_info *bdi)
 err:
 		while (i--)
 			percpu_counter_destroy(&bdi->bdi_stat[i]);
-
-		bdi_remove_from_list(bdi);
 	}
 
 	return err;
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 7c44314..46c62b0 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -36,15 +36,6 @@
 #include <linux/pagevec.h>
 
 /*
- * The maximum number of pages to writeout in a single bdflush/kupdate
- * operation.  We do this so we don't hold I_SYNC against an inode for
- * enormous amounts of time, which would block a userspace task which has
- * been forced to throttle against that inode.  Also, the code reevaluates
- * the dirty each time it has written this many pages.
- */
-#define MAX_WRITEBACK_PAGES	1024
-
-/*
  * After a CPU has dirtied this many pages, balance_dirty_pages_ratelimited
  * will look to see if it needs to force writeback or throttling.
  */
@@ -117,8 +108,6 @@ EXPORT_SYMBOL(laptop_mode);
 /* End of sysctl-exported parameters */
 
 
-static void background_writeout(unsigned long _min_pages);
-
 /*
  * Scale the writeback cache size proportional to the relative writeout speeds.
  *
@@ -539,7 +528,7 @@ static void balance_dirty_pages(struct address_space *mapping)
 		 * been flushed to permanent storage.
 		 */
 		if (bdi_nr_reclaimable) {
-			writeback_inodes(&wbc);
+			generic_sync_bdi_inodes(NULL, &wbc);
 			pages_written += write_chunk - wbc.nr_to_write;
 			get_dirty_limits(&background_thresh, &dirty_thresh,
 				       &bdi_thresh, bdi);
@@ -590,7 +579,7 @@ static void balance_dirty_pages(struct address_space *mapping)
 			(!laptop_mode && (global_page_state(NR_FILE_DIRTY)
 					  + global_page_state(NR_UNSTABLE_NFS)
 					  > background_thresh)))
-		pdflush_operation(background_writeout, 0);
+		bdi_start_writeback(bdi, NULL, 0, WB_SYNC_NONE);
 }
 
 void set_page_dirty_balance(struct page *page, int page_mkwrite)
@@ -675,152 +664,41 @@ void throttle_vm_writeout(gfp_t gfp_mask)
 }
 
 /*
- * writeback at least _min_pages, and keep writing until the amount of dirty
- * memory is less than the background threshold, or until we're all clean.
+ * Start writeback of `nr_pages' pages.  If `nr_pages' is zero, write back
+ * the whole world.
  */
-static void background_writeout(unsigned long _min_pages)
+void wakeup_flusher_threads(long nr_pages)
 {
-	long min_pages = _min_pages;
 	struct writeback_control wbc = {
-		.bdi		= NULL,
 		.sync_mode	= WB_SYNC_NONE,
 		.older_than_this = NULL,
-		.nr_to_write	= 0,
-		.nonblocking	= 1,
 		.range_cyclic	= 1,
 	};
 
-	for ( ; ; ) {
-		unsigned long background_thresh;
-		unsigned long dirty_thresh;
-
-		get_dirty_limits(&background_thresh, &dirty_thresh, NULL, NULL);
-		if (global_page_state(NR_FILE_DIRTY) +
-			global_page_state(NR_UNSTABLE_NFS) < background_thresh
-				&& min_pages <= 0)
-			break;
-		wbc.more_io = 0;
-		wbc.encountered_congestion = 0;
-		wbc.nr_to_write = MAX_WRITEBACK_PAGES;
-		wbc.pages_skipped = 0;
-		writeback_inodes(&wbc);
-		min_pages -= MAX_WRITEBACK_PAGES - wbc.nr_to_write;
-		if (wbc.nr_to_write > 0 || wbc.pages_skipped > 0) {
-			/* Wrote less than expected */
-			if (wbc.encountered_congestion || wbc.more_io)
-				congestion_wait(WRITE, HZ/10);
-			else
-				break;
-		}
-	}
-}
-
-/*
- * Start writeback of `nr_pages' pages.  If `nr_pages' is zero, write back
- * the whole world.  Returns 0 if a pdflush thread was dispatched.  Returns
- * -1 if all pdflush threads were busy.
- */
-int wakeup_pdflush(long nr_pages)
-{
 	if (nr_pages == 0)
 		nr_pages = global_page_state(NR_FILE_DIRTY) +
 				global_page_state(NR_UNSTABLE_NFS);
-	return pdflush_operation(background_writeout, nr_pages);
+	wbc.nr_to_write = nr_pages;
+	bdi_writeback_all(NULL, &wbc);
 }
 
-static void wb_timer_fn(unsigned long unused);
 static void laptop_timer_fn(unsigned long unused);
 
-static DEFINE_TIMER(wb_timer, wb_timer_fn, 0, 0);
 static DEFINE_TIMER(laptop_mode_wb_timer, laptop_timer_fn, 0, 0);
 
 /*
- * Periodic writeback of "old" data.
- *
- * Define "old": the first time one of an inode's pages is dirtied, we mark the
- * dirtying-time in the inode's address_space.  So this periodic writeback code
- * just walks the superblock inode list, writing back any inodes which are
- * older than a specific point in time.
- *
- * Try to run once per dirty_writeback_interval.  But if a writeback event
- * takes longer than a dirty_writeback_interval interval, then leave a
- * one-second gap.
- *
- * older_than_this takes precedence over nr_to_write.  So we'll only write back
- * all dirty pages if they are all attached to "old" mappings.
- */
-static void wb_kupdate(unsigned long arg)
-{
-	unsigned long oldest_jif;
-	unsigned long start_jif;
-	unsigned long next_jif;
-	long nr_to_write;
-	struct writeback_control wbc = {
-		.bdi		= NULL,
-		.sync_mode	= WB_SYNC_NONE,
-		.older_than_this = &oldest_jif,
-		.nr_to_write	= 0,
-		.nonblocking	= 1,
-		.for_kupdate	= 1,
-		.range_cyclic	= 1,
-	};
-
-	sync_supers();
-
-	oldest_jif = jiffies - msecs_to_jiffies(dirty_expire_interval * 10);
-	start_jif = jiffies;
-	next_jif = start_jif + msecs_to_jiffies(dirty_writeback_interval * 10);
-	nr_to_write = global_page_state(NR_FILE_DIRTY) +
-			global_page_state(NR_UNSTABLE_NFS) +
-			(inodes_stat.nr_inodes - inodes_stat.nr_unused);
-	while (nr_to_write > 0) {
-		wbc.more_io = 0;
-		wbc.encountered_congestion = 0;
-		wbc.nr_to_write = MAX_WRITEBACK_PAGES;
-		writeback_inodes(&wbc);
-		if (wbc.nr_to_write > 0) {
-			if (wbc.encountered_congestion || wbc.more_io)
-				congestion_wait(WRITE, HZ/10);
-			else
-				break;	/* All the old data is written */
-		}
-		nr_to_write -= MAX_WRITEBACK_PAGES - wbc.nr_to_write;
-	}
-	if (time_before(next_jif, jiffies + HZ))
-		next_jif = jiffies + HZ;
-	if (dirty_writeback_interval)
-		mod_timer(&wb_timer, next_jif);
-}
-
-/*
  * sysctl handler for /proc/sys/vm/dirty_writeback_centisecs
  */
 int dirty_writeback_centisecs_handler(ctl_table *table, int write,
 	struct file *file, void __user *buffer, size_t *length, loff_t *ppos)
 {
 	proc_dointvec(table, write, file, buffer, length, ppos);
-	if (dirty_writeback_interval)
-		mod_timer(&wb_timer, jiffies +
-			msecs_to_jiffies(dirty_writeback_interval * 10));
-	else
-		del_timer(&wb_timer);
 	return 0;
 }
 
-static void wb_timer_fn(unsigned long unused)
-{
-	if (pdflush_operation(wb_kupdate, 0) < 0)
-		mod_timer(&wb_timer, jiffies + HZ); /* delay 1 second */
-}
-
-static void laptop_flush(unsigned long unused)
-{
-	sys_sync();
-}
-
 static void laptop_timer_fn(unsigned long unused)
 {
-	pdflush_operation(laptop_flush, 0);
+	wakeup_flusher_threads(0);
 }
 
 /*
@@ -903,8 +781,6 @@ void __init page_writeback_init(void)
 {
 	int shift;
 
-	mod_timer(&wb_timer,
-		  jiffies + msecs_to_jiffies(dirty_writeback_interval * 10));
 	writeback_set_ratelimit();
 	register_cpu_notifier(&ratelimit_nb);
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 5fa3eda..e37fd38 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1654,7 +1654,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 		 */
 		if (total_scanned > sc->swap_cluster_max +
 					sc->swap_cluster_max / 2) {
-			wakeup_pdflush(laptop_mode ? 0 : total_scanned);
+			wakeup_flusher_threads(laptop_mode ? 0 : total_scanned);
 			sc->may_writepage = 1;
 		}
 
-- 
1.6.3.rc0.1.gf800


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH 04/11] writeback: switch to per-bdi threads for flushing data
  2009-05-28 11:46 ` [PATCH 04/11] writeback: switch to per-bdi threads for flushing data Jens Axboe
@ 2009-05-28 14:13   ` Artem Bityutskiy
  2009-05-28 22:28     ` Jens Axboe
  0 siblings, 1 reply; 41+ messages in thread
From: Artem Bityutskiy @ 2009-05-28 14:13 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-kernel, linux-fsdevel, tytso, chris.mason, david, hch, akpm,
	jack, yanmin_zhang, richard, damien.wyart

Jens Axboe wrote:
> +#define BDI_CAP_FLUSH_FORKER	0x00000200

Would it please be possible to add a comment about
what this flag is, and whether it is for internal
usage or not. Not immediately obvious for me.

Artem.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 04/11] writeback: switch to per-bdi threads for flushing  data
  2009-05-28 14:13   ` Artem Bityutskiy
@ 2009-05-28 22:28     ` Jens Axboe
  0 siblings, 0 replies; 41+ messages in thread
From: Jens Axboe @ 2009-05-28 22:28 UTC (permalink / raw)
  To: Artem Bityutskiy
  Cc: linux-kernel, linux-fsdevel, tytso, chris.mason, david, hch, akpm,
	jack, yanmin_zhang, richard, damien.wyart

On Thu, May 28 2009, Artem Bityutskiy wrote:
> Jens Axboe wrote:
>> +#define BDI_CAP_FLUSH_FORKER	0x00000200
>
> Would it please be possible to add a comment about
> what this flag is, and whether it is for internal
> usage or not. Not immediately obvious for me.

It's internal, probably I should just replace it with a check for
&default_backing_dev_info. If not, I'll add a comment.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 41+ messages in thread

end of thread, other threads:[~2009-05-28 22:28 UTC | newest]

Thread overview: 41+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-05-27  9:41 [PATCH 0/11] Per-bdi writeback flusher threads v8 Jens Axboe
2009-05-27  9:41 ` [PATCH 01/11] ntfs: remove old debug check for dirty data in ntfs_put_super() Jens Axboe
2009-05-27  9:41 ` [PATCH 02/11] btrfs: properly register fs backing device Jens Axboe
2009-05-27  9:41 ` [PATCH 03/11] writeback: move dirty inodes from super_block to backing_dev_info Jens Axboe
2009-05-27  9:41 ` [PATCH 04/11] writeback: switch to per-bdi threads for flushing data Jens Axboe
2009-05-27 11:11   ` Peter Zijlstra
2009-05-27 11:24     ` Jens Axboe
2009-05-27 15:14   ` Jan Kara
2009-05-27 17:50     ` Jens Axboe
2009-05-28 14:45       ` Jan Kara
2009-05-27  9:41 ` [PATCH 05/11] writeback: get rid of pdflush completely Jens Axboe
2009-05-27  9:41 ` [PATCH 06/11] writeback: separate the flushing state/task from the bdi Jens Axboe
2009-05-27  9:41 ` [PATCH 07/11] writeback: support > 1 flusher thread per bdi Jens Axboe
2009-05-28  9:27   ` Jan Kara
2009-05-28 10:40     ` Jens Axboe
2009-05-28 12:43       ` Jan Kara
2009-05-28 12:53         ` Jens Axboe
2009-05-28 13:58           ` Jan Kara
2009-05-27  9:41 ` [PATCH 08/11] writeback: allow sleepy exit of default writeback task Jens Axboe
2009-05-27  9:41 ` [PATCH 09/11] writeback: add some debug inode list counters to bdi stats Jens Axboe
2009-05-27  9:41 ` [PATCH 10/11] writeback: add name to backing_dev_info Jens Axboe
2009-05-27  9:41 ` [PATCH 11/11] writeback: check for registered bdi in flusher add and inode dirty Jens Axboe
2009-05-27 12:41 ` [PATCH 0/11] Per-bdi writeback flusher threads v8 Richard Kennedy
2009-05-27 12:47   ` Jens Axboe
2009-05-27 14:47 ` Theodore Tso
2009-05-27 15:05   ` Jens Axboe
2009-05-27 17:53   ` Theodore Tso
2009-05-27 17:57     ` Jens Axboe
2009-05-27 17:58     ` Theodore Tso
2009-05-27 18:14       ` Jens Axboe
2009-05-27 19:15         ` Jens Axboe
2009-05-27 19:45           ` Jens Axboe
2009-05-28  0:49             ` Theodore Tso
2009-05-28  9:28               ` Jan Kara
2009-05-28  9:36                 ` Jens Axboe
2009-05-28 15:23                 ` Eric W. Biederman
2009-05-28 19:32                   ` Theodore Tso
2009-05-28 19:38                     ` Christoph Hellwig
  -- strict thread matches above, loose matches on Subject: below --
2009-05-28 11:46 [PATCH 0/11] Per-bdi writeback flusher threads v9 Jens Axboe
2009-05-28 11:46 ` [PATCH 04/11] writeback: switch to per-bdi threads for flushing data Jens Axboe
2009-05-28 14:13   ` Artem Bityutskiy
2009-05-28 22:28     ` Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).