[PATCH 0/8] Sync and VFS scalability improvements V2

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/8] Sync and VFS scalability improvements V2
@ 2015-03-20 17:14 Josef Bacik
  2015-03-20 17:14 ` [PATCH 1/8] writeback: plug writeback at a high level Josef Bacik
                   ` (7 more replies)
  0 siblings, 8 replies; 24+ messages in thread
From: Josef Bacik @ 2015-03-20 17:14 UTC (permalink / raw)
  To: linux-fsdevel, david, viro, jack, linux-kernel

Here is the update to the patches based on comments from Al and Jan.  The
changes since V1 are

- Dropped the IOP_NOTHASHED patch and replaced it wit the hlist_fake patch as
  per Al's suggestion.
- Dropped "inode: convert per-sb inode list to a list_lru" since Jan has found
  some issues with it and it doesn't seem as awesome as we want.
- Fixed "inode: don't softlockup when evicting inodes" since it was based on the
  list_lru patch.
- Added my SOB to all the patches as per Jan's suggestion and added his relevant
  Reviewed-by's.

I'm getting new performance numbers for these patches now, hopefully that stuff
will finish soon and I can get them posted.  I wanted to get the updated
patchset up as quickly as possible since we're getting close to the merge window
and I'm getting ready to disappear into vacation for a little bit.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 1/8] writeback: plug writeback at a high level
  2015-03-20 17:14 [PATCH 0/8] Sync and VFS scalability improvements V2 Josef Bacik
@ 2015-03-20 17:14 ` Josef Bacik
  2015-03-20 17:14 ` [PATCH 2/8] inode: add hlist_fake to avoid the inode hash lock in evict Josef Bacik
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 24+ messages in thread
From: Josef Bacik @ 2015-03-20 17:14 UTC (permalink / raw)
  To: linux-fsdevel, david, viro, jack, linux-kernel; +Cc: Dave Chinner

From: Dave Chinner <dchinner@redhat.com>

Doing writeback on lots of little files causes terrible IOPS storms
because of the per-mapping writeback plugging we do. This
essentially causes imeediate dispatch of IO for each mapping,
regardless of the context in which writeback is occurring.

IOWs, running a concurrent write-lots-of-small 4k files using fsmark
on XFS results in a huge number of IOPS being issued for data
writes.  Metadata writes are sorted and plugged at a high level by
XFS, so aggregate nicely into large IOs. However, data writeback IOs
are dispatched in individual 4k IOs, even when the blocks of two
consecutively written files are adjacent.

Test VM: 8p, 8GB RAM, 4xSSD in RAID0, 100TB sparse XFS filesystem,
metadata CRCs enabled.

Kernel: 3.10-rc5 + xfsdev + my 3.11 xfs queue (~70 patches)

Test:

$ ./fs_mark  -D  10000  -S0  -n  10000  -s  4096  -L  120  -d
/mnt/scratch/0  -d  /mnt/scratch/1  -d  /mnt/scratch/2  -d
/mnt/scratch/3  -d  /mnt/scratch/4  -d  /mnt/scratch/5  -d
/mnt/scratch/6  -d  /mnt/scratch/7

Result:

		wall	sys	create rate	Physical write IO
		time	CPU	(avg files/s)	 IOPS	Bandwidth
		-----	-----	------------	------	---------
unpatched	6m56s	15m47s	24,000+/-500	26,000	130MB/s
patched		5m06s	13m28s	32,800+/-600	 1,500	180MB/s
improvement	-26.44%	-14.68%	  +36.67%	-94.23%	+38.46%

If I use zero length files, this workload at about 500 IOPS, so
plugging drops the data IOs from roughly 25,500/s to 1000/s.
3 lines of code, 35% better throughput for 15% less CPU.

The benefits of plugging at this layer are likely to be higher for
spinning media as the IO patterns for this workload are going make a
much bigger difference on high IO latency devices.....

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 fs/fs-writeback.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index e907052..a9ff2b7 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -659,7 +659,9 @@ static long writeback_sb_inodes(struct super_block *sb,
 	unsigned long start_time = jiffies;
 	long write_chunk;
 	long wrote = 0;  /* count both pages and inodes */
+	struct blk_plug plug;

+	blk_start_plug(&plug);
 	while (!list_empty(&wb->b_io)) {
 		struct inode *inode = wb_inode(wb->b_io.prev);

@@ -756,6 +758,7 @@ static long writeback_sb_inodes(struct super_block *sb,
 				break;
 		}
 	}
+	blk_finish_plug(&plug);
 	return wrote;
 }

-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 2/8] inode: add hlist_fake to avoid the inode hash lock in evict
  2015-03-20 17:14 [PATCH 0/8] Sync and VFS scalability improvements V2 Josef Bacik
  2015-03-20 17:14 ` [PATCH 1/8] writeback: plug writeback at a high level Josef Bacik
@ 2015-03-20 17:14 ` Josef Bacik
  2015-03-20 17:14 ` [PATCH 3/8] inode: convert inode_sb_list_lock to per-sb Josef Bacik
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 24+ messages in thread
From: Josef Bacik @ 2015-03-20 17:14 UTC (permalink / raw)
  To: linux-fsdevel, david, viro, jack, linux-kernel

Some filesystems don't use the VFS inode hash and fake the fact they
are hashed so that all the writeback code works correctly. However,
this means the evict() path still tries to remove the inode from the
hash, meaning that the inode_hash_lock() needs to be taken
unnecessarily. Hence under certain workloads the inode_hash_lock can
be contended even if the inode is never actually hashed.

To avoid this add hlist_fake to test if the inode isn't actually
hashed to avoid taking the hash lock on inodes that have never been
hashed.  Based on Dave Chinner's

inode: add IOP_NOTHASHED to avoid inode hash lock in evict

basd on Al's suggestions.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 include/linux/fs.h   | 2 +-
 include/linux/list.h | 5 +++++
 2 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index b4d71b5..31da757 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2528,7 +2528,7 @@ static inline void insert_inode_hash(struct inode *inode)
 extern void __remove_inode_hash(struct inode *);
 static inline void remove_inode_hash(struct inode *inode)
 {
-	if (!inode_unhashed(inode))
+	if (!inode_unhashed(inode) && !hlist_fake(&inode->i_hash))
 		__remove_inode_hash(inode);
 }
 
diff --git a/include/linux/list.h b/include/linux/list.h
index feb773c..3e3e64a 100644
--- a/include/linux/list.h
+++ b/include/linux/list.h
@@ -672,6 +672,11 @@ static inline void hlist_add_fake(struct hlist_node *n)
 	n->pprev = &n->next;
 }
 
+static inline bool hlist_fake(struct hlist_node *h)
+{
+	return h->pprev == &h->next;
+}
+
 /*
  * Move a list from one list head to another. Fixup the pprev
  * reference of the first entry if it exists.
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 3/8] inode: convert inode_sb_list_lock to per-sb
  2015-03-20 17:14 [PATCH 0/8] Sync and VFS scalability improvements V2 Josef Bacik
  2015-03-20 17:14 ` [PATCH 1/8] writeback: plug writeback at a high level Josef Bacik
  2015-03-20 17:14 ` [PATCH 2/8] inode: add hlist_fake to avoid the inode hash lock in evict Josef Bacik
@ 2015-03-20 17:14 ` Josef Bacik
  2015-03-20 17:14 ` [PATCH 4/8] sync: serialise per-superblock sync operations Josef Bacik
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 24+ messages in thread
From: Josef Bacik @ 2015-03-20 17:14 UTC (permalink / raw)
  To: linux-fsdevel, david, viro, jack, linux-kernel; +Cc: Dave Chinner

From: Dave Chinner <dchinner@redhat.com>

The process of reducing contention on per-superblock inode lists
starts with moving the locking to match the per-superblock inode
list. This takes the global lock out of the picture and reduces the
contention problems to within a single filesystem. This doesn't get
rid of contention as the locks still have global CPU scope, but it
does isolate operations on different superblocks form each other.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 fs/block_dev.c                   | 12 ++++++------
 fs/drop_caches.c                 | 10 ++++++----
 fs/fs-writeback.c                | 12 ++++++------
 fs/inode.c                       | 28 +++++++++++++---------------
 fs/internal.h                    |  1 -
 fs/notify/inode_mark.c           | 20 ++++++++++----------
 fs/quota/dquot.c                 | 16 ++++++++--------
 fs/super.c                       |  3 ++-
 include/linux/fs.h               |  5 ++++-
 include/linux/fsnotify_backend.h |  4 ++--
 10 files changed, 57 insertions(+), 54 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 975266b..2eb2436 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1753,7 +1753,7 @@ void iterate_bdevs(void (*func)(struct block_device *, void *), void *arg)
 {
 	struct inode *inode, *old_inode = NULL;
 
-	spin_lock(&inode_sb_list_lock);
+	spin_lock(&blockdev_superblock->s_inode_list_lock);
 	list_for_each_entry(inode, &blockdev_superblock->s_inodes, i_sb_list) {
 		struct address_space *mapping = inode->i_mapping;
 
@@ -1765,13 +1765,13 @@ void iterate_bdevs(void (*func)(struct block_device *, void *), void *arg)
 		}
 		__iget(inode);
 		spin_unlock(&inode->i_lock);
-		spin_unlock(&inode_sb_list_lock);
+		spin_unlock(&blockdev_superblock->s_inode_list_lock);
 		/*
 		 * We hold a reference to 'inode' so it couldn't have been
 		 * removed from s_inodes list while we dropped the
-		 * inode_sb_list_lock.  We cannot iput the inode now as we can
+		 * s_inode_list_lock  We cannot iput the inode now as we can
 		 * be holding the last reference and we cannot iput it under
-		 * inode_sb_list_lock. So we keep the reference and iput it
+		 * s_inode_list_lock. So we keep the reference and iput it
 		 * later.
 		 */
 		iput(old_inode);
@@ -1779,8 +1779,8 @@ void iterate_bdevs(void (*func)(struct block_device *, void *), void *arg)
 
 		func(I_BDEV(inode), arg);
 
-		spin_lock(&inode_sb_list_lock);
+		spin_lock(&blockdev_superblock->s_inode_list_lock);
 	}
-	spin_unlock(&inode_sb_list_lock);
+	spin_unlock(&blockdev_superblock->s_inode_list_lock);
 	iput(old_inode);
 }
diff --git a/fs/drop_caches.c b/fs/drop_caches.c
index 5718cb9..d72d52b 100644
--- a/fs/drop_caches.c
+++ b/fs/drop_caches.c
@@ -17,7 +17,7 @@ static void drop_pagecache_sb(struct super_block *sb, void *unused)
 {
 	struct inode *inode, *toput_inode = NULL;
 
-	spin_lock(&inode_sb_list_lock);
+	spin_lock(&sb->s_inode_list_lock);
 	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
 		spin_lock(&inode->i_lock);
 		if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
@@ -27,13 +27,15 @@ static void drop_pagecache_sb(struct super_block *sb, void *unused)
 		}
 		__iget(inode);
 		spin_unlock(&inode->i_lock);
-		spin_unlock(&inode_sb_list_lock);
+		spin_unlock(&sb->s_inode_list_lock);
+
 		invalidate_mapping_pages(inode->i_mapping, 0, -1);
 		iput(toput_inode);
 		toput_inode = inode;
-		spin_lock(&inode_sb_list_lock);
+
+		spin_lock(&sb->s_inode_list_lock);
 	}
-	spin_unlock(&inode_sb_list_lock);
+	spin_unlock(&sb->s_inode_list_lock);
 	iput(toput_inode);
 }
 
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index a9ff2b7..9780f6c 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1298,7 +1298,7 @@ static void wait_sb_inodes(struct super_block *sb)
 	 */
 	WARN_ON(!rwsem_is_locked(&sb->s_umount));
 
-	spin_lock(&inode_sb_list_lock);
+	spin_lock(&sb->s_inode_list_lock);
 
 	/*
 	 * Data integrity sync. Must wait for all pages under writeback,
@@ -1318,14 +1318,14 @@ static void wait_sb_inodes(struct super_block *sb)
 		}
 		__iget(inode);
 		spin_unlock(&inode->i_lock);
-		spin_unlock(&inode_sb_list_lock);
+		spin_unlock(&sb->s_inode_list_lock);
 
 		/*
 		 * We hold a reference to 'inode' so it couldn't have been
 		 * removed from s_inodes list while we dropped the
-		 * inode_sb_list_lock.  We cannot iput the inode now as we can
+		 * s_inode_list_lock.  We cannot iput the inode now as we can
 		 * be holding the last reference and we cannot iput it under
-		 * inode_sb_list_lock. So we keep the reference and iput it
+		 * s_inode_list_lock. So we keep the reference and iput it
 		 * later.
 		 */
 		iput(old_inode);
@@ -1335,9 +1335,9 @@ static void wait_sb_inodes(struct super_block *sb)
 
 		cond_resched();
 
-		spin_lock(&inode_sb_list_lock);
+		spin_lock(&sb->s_inode_list_lock);
 	}
-	spin_unlock(&inode_sb_list_lock);
+	spin_unlock(&sb->s_inode_list_lock);
 	iput(old_inode);
 }
 
diff --git a/fs/inode.c b/fs/inode.c
index f00b16f..bd61843 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -28,8 +28,8 @@
  *   inode->i_state, inode->i_hash, __iget()
  * Inode LRU list locks protect:
  *   inode->i_sb->s_inode_lru, inode->i_lru
- * inode_sb_list_lock protects:
- *   sb->s_inodes, inode->i_sb_list
+ * inode->i_sb->s_inode_list_lock protects:
+ *   inode->i_sb->s_inodes, inode->i_sb_list
  * bdi->wb.list_lock protects:
  *   bdi->wb.b_{dirty,io,more_io,dirty_time}, inode->i_wb_list
  * inode_hash_lock protects:
@@ -37,7 +37,7 @@
  *
  * Lock ordering:
  *
- * inode_sb_list_lock
+ * inode->i_sb->s_inode_list_lock
  *   inode->i_lock
  *     Inode LRU list locks
  *
@@ -45,7 +45,7 @@
  *   inode->i_lock
  *
  * inode_hash_lock
- *   inode_sb_list_lock
+ *   inode->i_sb->s_inode_list_lock
  *   inode->i_lock
  *
  * iunique_lock
@@ -57,8 +57,6 @@ static unsigned int i_hash_shift __read_mostly;
 static struct hlist_head *inode_hashtable __read_mostly;
 static __cacheline_aligned_in_smp DEFINE_SPINLOCK(inode_hash_lock);
 
-__cacheline_aligned_in_smp DEFINE_SPINLOCK(inode_sb_list_lock);
-
 /*
  * Empty aops. Can be used for the cases where the user does not
  * define any of the address_space operations.
@@ -424,18 +422,18 @@ static void inode_lru_list_del(struct inode *inode)
  */
 void inode_sb_list_add(struct inode *inode)
 {
-	spin_lock(&inode_sb_list_lock);
+	spin_lock(&inode->i_sb->s_inode_list_lock);
 	list_add(&inode->i_sb_list, &inode->i_sb->s_inodes);
-	spin_unlock(&inode_sb_list_lock);
+	spin_unlock(&inode->i_sb->s_inode_list_lock);
 }
 EXPORT_SYMBOL_GPL(inode_sb_list_add);
 
 static inline void inode_sb_list_del(struct inode *inode)
 {
 	if (!list_empty(&inode->i_sb_list)) {
-		spin_lock(&inode_sb_list_lock);
+		spin_lock(&inode->i_sb->s_inode_list_lock);
 		list_del_init(&inode->i_sb_list);
-		spin_unlock(&inode_sb_list_lock);
+		spin_unlock(&inode->i_sb->s_inode_list_lock);
 	}
 }
 
@@ -592,7 +590,7 @@ void evict_inodes(struct super_block *sb)
 	struct inode *inode, *next;
 	LIST_HEAD(dispose);
 
-	spin_lock(&inode_sb_list_lock);
+	spin_lock(&sb->s_inode_list_lock);
 	list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list) {
 		if (atomic_read(&inode->i_count))
 			continue;
@@ -608,7 +606,7 @@ void evict_inodes(struct super_block *sb)
 		spin_unlock(&inode->i_lock);
 		list_add(&inode->i_lru, &dispose);
 	}
-	spin_unlock(&inode_sb_list_lock);
+	spin_unlock(&sb->s_inode_list_lock);
 
 	dispose_list(&dispose);
 }
@@ -629,7 +627,7 @@ int invalidate_inodes(struct super_block *sb, bool kill_dirty)
 	struct inode *inode, *next;
 	LIST_HEAD(dispose);
 
-	spin_lock(&inode_sb_list_lock);
+	spin_lock(&sb->s_inode_list_lock);
 	list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list) {
 		spin_lock(&inode->i_lock);
 		if (inode->i_state & (I_NEW | I_FREEING | I_WILL_FREE)) {
@@ -652,7 +650,7 @@ int invalidate_inodes(struct super_block *sb, bool kill_dirty)
 		spin_unlock(&inode->i_lock);
 		list_add(&inode->i_lru, &dispose);
 	}
-	spin_unlock(&inode_sb_list_lock);
+	spin_unlock(&sb->s_inode_list_lock);
 
 	dispose_list(&dispose);
 
@@ -884,7 +882,7 @@ struct inode *new_inode(struct super_block *sb)
 {
 	struct inode *inode;
 
-	spin_lock_prefetch(&inode_sb_list_lock);
+	spin_lock_prefetch(&sb->s_inode_list_lock);
 
 	inode = new_inode_pseudo(sb);
 	if (inode)
diff --git a/fs/internal.h b/fs/internal.h
index 01dce1d..b6c350d 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -111,7 +111,6 @@ extern int open_check_o_direct(struct file *f);
 /*
  * inode.c
  */
-extern spinlock_t inode_sb_list_lock;
 extern long prune_icache_sb(struct super_block *sb, struct shrink_control *sc);
 extern void inode_add_lru(struct inode *inode);
 
diff --git a/fs/notify/inode_mark.c b/fs/notify/inode_mark.c
index 3daf513..a4e1a8f 100644
--- a/fs/notify/inode_mark.c
+++ b/fs/notify/inode_mark.c
@@ -163,17 +163,17 @@ int fsnotify_add_inode_mark(struct fsnotify_mark *mark,
 
 /**
  * fsnotify_unmount_inodes - an sb is unmounting.  handle any watched inodes.
- * @list: list of inodes being unmounted (sb->s_inodes)
+ * @sb: superblock being unmounted.
  *
  * Called during unmount with no locks held, so needs to be safe against
- * concurrent modifiers. We temporarily drop inode_sb_list_lock and CAN block.
+ * concurrent modifiers. We temporarily drop sb->s_inode_list_lock and CAN block.
  */
-void fsnotify_unmount_inodes(struct list_head *list)
+void fsnotify_unmount_inodes(struct super_block *sb)
 {
 	struct inode *inode, *next_i, *need_iput = NULL;
 
-	spin_lock(&inode_sb_list_lock);
-	list_for_each_entry_safe(inode, next_i, list, i_sb_list) {
+	spin_lock(&sb->s_inode_list_lock);
+	list_for_each_entry_safe(inode, next_i, &sb->s_inodes, i_sb_list) {
 		struct inode *need_iput_tmp;
 
 		/*
@@ -209,7 +209,7 @@ void fsnotify_unmount_inodes(struct list_head *list)
 		spin_unlock(&inode->i_lock);
 
 		/* In case the dropping of a reference would nuke next_i. */
-		while (&next_i->i_sb_list != list) {
+		while (&next_i->i_sb_list != &sb->s_inodes) {
 			spin_lock(&next_i->i_lock);
 			if (!(next_i->i_state & (I_FREEING | I_WILL_FREE)) &&
 						atomic_read(&next_i->i_count)) {
@@ -224,12 +224,12 @@ void fsnotify_unmount_inodes(struct list_head *list)
 		}
 
 		/*
-		 * We can safely drop inode_sb_list_lock here because either
+		 * We can safely drop s_inode_list_lock here because either
 		 * we actually hold references on both inode and next_i or
 		 * end of list.  Also no new inodes will be added since the
 		 * umount has begun.
 		 */
-		spin_unlock(&inode_sb_list_lock);
+		spin_unlock(&sb->s_inode_list_lock);
 
 		if (need_iput_tmp)
 			iput(need_iput_tmp);
@@ -241,7 +241,7 @@ void fsnotify_unmount_inodes(struct list_head *list)
 
 		iput(inode);
 
-		spin_lock(&inode_sb_list_lock);
+		spin_lock(&sb->s_inode_list_lock);
 	}
-	spin_unlock(&inode_sb_list_lock);
+	spin_unlock(&sb->s_inode_list_lock);
 }
diff --git a/fs/quota/dquot.c b/fs/quota/dquot.c
index 0ccd4ba..68c7ae3 100644
--- a/fs/quota/dquot.c
+++ b/fs/quota/dquot.c
@@ -920,7 +920,7 @@ static void add_dquot_ref(struct super_block *sb, int type)
 	int reserved = 0;
 #endif
 
-	spin_lock(&inode_sb_list_lock);
+	spin_lock(&sb->s_inode_list_lock);
 	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
 		spin_lock(&inode->i_lock);
 		if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
@@ -931,7 +931,7 @@ static void add_dquot_ref(struct super_block *sb, int type)
 		}
 		__iget(inode);
 		spin_unlock(&inode->i_lock);
-		spin_unlock(&inode_sb_list_lock);
+		spin_unlock(&sb->s_inode_list_lock);
 
 #ifdef CONFIG_QUOTA_DEBUG
 		if (unlikely(inode_get_rsv_space(inode) > 0))
@@ -943,15 +943,15 @@ static void add_dquot_ref(struct super_block *sb, int type)
 		/*
 		 * We hold a reference to 'inode' so it couldn't have been
 		 * removed from s_inodes list while we dropped the
-		 * inode_sb_list_lock We cannot iput the inode now as we can be
+		 * s_inode_list_lock. We cannot iput the inode now as we can be
 		 * holding the last reference and we cannot iput it under
-		 * inode_sb_list_lock. So we keep the reference and iput it
+		 * s_inode_list_lock. So we keep the reference and iput it
 		 * later.
 		 */
 		old_inode = inode;
-		spin_lock(&inode_sb_list_lock);
+		spin_lock(&sb->s_inode_list_lock);
 	}
-	spin_unlock(&inode_sb_list_lock);
+	spin_unlock(&sb->s_inode_list_lock);
 	iput(old_inode);
 
 #ifdef CONFIG_QUOTA_DEBUG
@@ -1019,7 +1019,7 @@ static void remove_dquot_ref(struct super_block *sb, int type,
 	struct inode *inode;
 	int reserved = 0;
 
-	spin_lock(&inode_sb_list_lock);
+	spin_lock(&sb->s_inode_list_lock);
 	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
 		/*
 		 *  We have to scan also I_NEW inodes because they can already
@@ -1035,7 +1035,7 @@ static void remove_dquot_ref(struct super_block *sb, int type,
 		}
 		spin_unlock(&dq_data_lock);
 	}
-	spin_unlock(&inode_sb_list_lock);
+	spin_unlock(&sb->s_inode_list_lock);
 #ifdef CONFIG_QUOTA_DEBUG
 	if (reserved) {
 		printk(KERN_WARNING "VFS (%s): Writes happened after quota"
diff --git a/fs/super.c b/fs/super.c
index 2b7dc90..85d6a62 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -191,6 +191,7 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags)
 	INIT_HLIST_NODE(&s->s_instances);
 	INIT_HLIST_BL_HEAD(&s->s_anon);
 	INIT_LIST_HEAD(&s->s_inodes);
+	spin_lock_init(&s->s_inode_list_lock);
 
 	if (list_lru_init_memcg(&s->s_dentry_lru))
 		goto fail;
@@ -399,7 +400,7 @@ void generic_shutdown_super(struct super_block *sb)
 		sync_filesystem(sb);
 		sb->s_flags &= ~MS_ACTIVE;
 
-		fsnotify_unmount_inodes(&sb->s_inodes);
+		fsnotify_unmount_inodes(sb);
 
 		evict_inodes(sb);
 
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 31da757..35953b5 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1244,7 +1244,6 @@ struct super_block {
 #endif
 	const struct xattr_handler **s_xattr;
 
-	struct list_head	s_inodes;	/* all inodes */
 	struct hlist_bl_head	s_anon;		/* anonymous dentries for (nfs) exporting */
 	struct list_head	s_mounts;	/* list of mounts; _not_ for fs use */
 	struct block_device	*s_bdev;
@@ -1315,6 +1314,10 @@ struct super_block {
 	 * Indicates how deep in a filesystem stack this SB is
 	 */
 	int s_stack_depth;
+
+	/* s_inode_list_lock protects s_inodes */
+	spinlock_t		s_inode_list_lock ____cacheline_aligned_in_smp;
+	struct list_head	s_inodes;	/* all inodes */
 };
 
 extern struct timespec current_fs_time(struct super_block *sb);
diff --git a/include/linux/fsnotify_backend.h b/include/linux/fsnotify_backend.h
index 0f313f9..236cbc4 100644
--- a/include/linux/fsnotify_backend.h
+++ b/include/linux/fsnotify_backend.h
@@ -359,7 +359,7 @@ extern void fsnotify_clear_marks_by_group_flags(struct fsnotify_group *group, un
 extern void fsnotify_clear_marks_by_group(struct fsnotify_group *group);
 extern void fsnotify_get_mark(struct fsnotify_mark *mark);
 extern void fsnotify_put_mark(struct fsnotify_mark *mark);
-extern void fsnotify_unmount_inodes(struct list_head *list);
+extern void fsnotify_unmount_inodes(struct super_block *sb);
 
 /* put here because inotify does some weird stuff when destroying watches */
 extern void fsnotify_init_event(struct fsnotify_event *event,
@@ -395,7 +395,7 @@ static inline u32 fsnotify_get_cookie(void)
 	return 0;
 }
 
-static inline void fsnotify_unmount_inodes(struct list_head *list)
+static inline void fsnotify_unmount_inodes(struct super_block *sb)
 {}
 
 #endif	/* CONFIG_FSNOTIFY */
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 4/8] sync: serialise per-superblock sync operations
  2015-03-20 17:14 [PATCH 0/8] Sync and VFS scalability improvements V2 Josef Bacik
                   ` (2 preceding siblings ...)
  2015-03-20 17:14 ` [PATCH 3/8] inode: convert inode_sb_list_lock to per-sb Josef Bacik
@ 2015-03-20 17:14 ` Josef Bacik
  2015-03-20 17:14 ` [PATCH 5/8] inode: rename i_wb_list to i_io_list Josef Bacik
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 24+ messages in thread
From: Josef Bacik @ 2015-03-20 17:14 UTC (permalink / raw)
  To: linux-fsdevel, david, viro, jack, linux-kernel; +Cc: Dave Chinner

From: Dave Chinner <dchinner@redhat.com>

When competing sync(2) calls walk the same filesystem, they need to
walk the list of inodes on the superblock to find all the inodes
that we need to wait for IO completion on. However, when multiple
wait_sb_inodes() calls do this at the same time, they contend on the
the inode_sb_list_lock and the contention causes system wide
slowdowns. In effect, concurrent sync(2) calls can take longer and
burn more CPU than if they were serialised.

Stop the worst of the contention by adding a per-sb mutex to wrap
around wait_sb_inodes() so that we only execute one sync(2) IO
completion walk per superblock superblock at a time and hence avoid
contention being triggered by concurrent sync(2) calls.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 fs/fs-writeback.c  | 11 +++++++++++
 fs/super.c         |  1 +
 include/linux/fs.h |  2 ++
 3 files changed, 14 insertions(+)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 9780f6c..dcfe21c 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1288,6 +1288,15 @@ out_unlock_inode:
 }
 EXPORT_SYMBOL(__mark_inode_dirty);
 
+/*
+ * The @s_sync_lock is used to serialise concurrent sync operations
+ * to avoid lock contention problems with concurrent wait_sb_inodes() calls.
+ * Concurrent callers will block on the s_sync_lock rather than doing contending
+ * walks. The queueing maintains sync(2) required behaviour as all the IO that
+ * has been issued up to the time this function is enter is guaranteed to be
+ * completed by the time we have gained the lock and waited for all IO that is
+ * in progress regardless of the order callers are granted the lock.
+ */
 static void wait_sb_inodes(struct super_block *sb)
 {
 	struct inode *inode, *old_inode = NULL;
@@ -1298,6 +1307,7 @@ static void wait_sb_inodes(struct super_block *sb)
 	 */
 	WARN_ON(!rwsem_is_locked(&sb->s_umount));
 
+	mutex_lock(&sb->s_sync_lock);
 	spin_lock(&sb->s_inode_list_lock);
 
 	/*
@@ -1339,6 +1349,7 @@ static void wait_sb_inodes(struct super_block *sb)
 	}
 	spin_unlock(&sb->s_inode_list_lock);
 	iput(old_inode);
+	mutex_unlock(&sb->s_sync_lock);
 }
 
 /**
diff --git a/fs/super.c b/fs/super.c
index 85d6a62..6a05d94 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -190,6 +190,7 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags)
 	s->s_flags = flags;
 	INIT_HLIST_NODE(&s->s_instances);
 	INIT_HLIST_BL_HEAD(&s->s_anon);
+	mutex_init(&s->s_sync_lock);
 	INIT_LIST_HEAD(&s->s_inodes);
 	spin_lock_init(&s->s_inode_list_lock);
 
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 35953b5..10cdf24 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1310,6 +1310,8 @@ struct super_block {
 	struct list_lru		s_inode_lru ____cacheline_aligned_in_smp;
 	struct rcu_head		rcu;
 
+	struct mutex		s_sync_lock;	/* sync serialisation lock */
+
 	/*
 	 * Indicates how deep in a filesystem stack this SB is
 	 */
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 5/8] inode: rename i_wb_list to i_io_list
  2015-03-20 17:14 [PATCH 0/8] Sync and VFS scalability improvements V2 Josef Bacik
                   ` (3 preceding siblings ...)
  2015-03-20 17:14 ` [PATCH 4/8] sync: serialise per-superblock sync operations Josef Bacik
@ 2015-03-20 17:14 ` Josef Bacik
  2015-03-20 17:14 ` [PATCH 6/8] bdi: add a new writeback list for sync Josef Bacik
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 24+ messages in thread
From: Josef Bacik @ 2015-03-20 17:14 UTC (permalink / raw)
  To: linux-fsdevel, david, viro, jack, linux-kernel; +Cc: Dave Chinner

From: Dave Chinner <dchinner@redhat.com>

There's a small consistency problem between the inode and writeback
naming. Writeback calls the "for IO" inode queues b_io and
b_more_io, but the inode calls these the "writeback list" or
i_wb_list. This makes it hard to an new "under writeback" list to
the inode, or call it an "under IO" list on the bdi because either
way we'll have writeback on IO and IO on writeback and it'll just be
confusing. I'm getting confused just writing this!

So, rename the inode "for IO" list variable to i_io_list so we can
add a new "writeback list" in a subsequent patch.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 fs/fs-writeback.c  | 20 ++++++++++----------
 fs/inode.c         |  6 +++---
 include/linux/fs.h |  2 +-
 mm/backing-dev.c   |  8 ++++----
 4 files changed, 18 insertions(+), 18 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index dcfe21c..cabebde 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -84,7 +84,7 @@ EXPORT_SYMBOL_GPL(inode_to_bdi);
 
 static inline struct inode *wb_inode(struct list_head *head)
 {
-	return list_entry(head, struct inode, i_wb_list);
+	return list_entry(head, struct inode, i_io_list);
 }
 
 /*
@@ -193,7 +193,7 @@ void inode_wb_list_del(struct inode *inode)
 	struct backing_dev_info *bdi = inode_to_bdi(inode);
 
 	spin_lock(&bdi->wb.list_lock);
-	list_del_init(&inode->i_wb_list);
+	list_del_init(&inode->i_io_list);
 	spin_unlock(&bdi->wb.list_lock);
 }
 
@@ -216,7 +216,7 @@ static void redirty_tail(struct inode *inode, struct bdi_writeback *wb)
 		if (time_before(inode->dirtied_when, tail->dirtied_when))
 			inode->dirtied_when = jiffies;
 	}
-	list_move(&inode->i_wb_list, &wb->b_dirty);
+	list_move(&inode->i_io_list, &wb->b_dirty);
 }
 
 /*
@@ -225,7 +225,7 @@ static void redirty_tail(struct inode *inode, struct bdi_writeback *wb)
 static void requeue_io(struct inode *inode, struct bdi_writeback *wb)
 {
 	assert_spin_locked(&wb->list_lock);
-	list_move(&inode->i_wb_list, &wb->b_more_io);
+	list_move(&inode->i_io_list, &wb->b_more_io);
 }
 
 static void inode_sync_complete(struct inode *inode)
@@ -284,7 +284,7 @@ static int move_expired_inodes(struct list_head *delaying_queue,
 		if (older_than_this &&
 		    inode_dirtied_after(inode, *older_than_this))
 			break;
-		list_move(&inode->i_wb_list, &tmp);
+		list_move(&inode->i_io_list, &tmp);
 		moved++;
 		if (flags & EXPIRE_DIRTY_ATIME)
 			set_bit(__I_DIRTY_TIME_EXPIRED, &inode->i_state);
@@ -307,7 +307,7 @@ static int move_expired_inodes(struct list_head *delaying_queue,
 		list_for_each_prev_safe(pos, node, &tmp) {
 			inode = wb_inode(pos);
 			if (inode->i_sb == sb)
-				list_move(&inode->i_wb_list, dispatch_queue);
+				list_move(&inode->i_io_list, dispatch_queue);
 		}
 	}
 out:
@@ -458,10 +458,10 @@ static void requeue_inode(struct inode *inode, struct bdi_writeback *wb,
 		 */
 		redirty_tail(inode, wb);
 	} else if (inode->i_state & I_DIRTY_TIME) {
-		list_move(&inode->i_wb_list, &wb->b_dirty_time);
+		list_move(&inode->i_io_list, &wb->b_dirty_time);
 	} else {
 		/* The inode is clean. Remove from writeback lists. */
-		list_del_init(&inode->i_wb_list);
+		list_del_init(&inode->i_io_list);
 	}
 }
 
@@ -598,7 +598,7 @@ writeback_single_inode(struct inode *inode, struct bdi_writeback *wb,
 	 * touch it. See comment above for explanation.
 	 */
 	if (!(inode->i_state & I_DIRTY_ALL))
-		list_del_init(&inode->i_wb_list);
+		list_del_init(&inode->i_io_list);
 	spin_unlock(&wb->list_lock);
 	inode_sync_complete(inode);
 out:
@@ -1272,7 +1272,7 @@ void __mark_inode_dirty(struct inode *inode, int flags)
 			}
 
 			inode->dirtied_when = jiffies;
-			list_move(&inode->i_wb_list, dirtytime ?
+			list_move(&inode->i_io_list, dirtytime ?
 				  &bdi->wb.b_dirty_time : &bdi->wb.b_dirty);
 			spin_unlock(&bdi->wb.list_lock);
 			trace_writeback_dirty_inode_enqueue(inode);
diff --git a/fs/inode.c b/fs/inode.c
index bd61843..e0f2a60 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -31,7 +31,7 @@
  * inode->i_sb->s_inode_list_lock protects:
  *   inode->i_sb->s_inodes, inode->i_sb_list
  * bdi->wb.list_lock protects:
- *   bdi->wb.b_{dirty,io,more_io,dirty_time}, inode->i_wb_list
+ *   bdi->wb.b_{dirty,io,more_io,dirty_time}, inode->i_io_list
  * inode_hash_lock protects:
  *   inode_hashtable, inode->i_hash
  *
@@ -355,7 +355,7 @@ void inode_init_once(struct inode *inode)
 	memset(inode, 0, sizeof(*inode));
 	INIT_HLIST_NODE(&inode->i_hash);
 	INIT_LIST_HEAD(&inode->i_devices);
-	INIT_LIST_HEAD(&inode->i_wb_list);
+	INIT_LIST_HEAD(&inode->i_io_list);
 	INIT_LIST_HEAD(&inode->i_lru);
 	address_space_init_once(&inode->i_data);
 	i_size_ordered_init(inode);
@@ -523,7 +523,7 @@ static void evict(struct inode *inode)
 	BUG_ON(!(inode->i_state & I_FREEING));
 	BUG_ON(!list_empty(&inode->i_lru));
 
-	if (!list_empty(&inode->i_wb_list))
+	if (!list_empty(&inode->i_io_list))
 		inode_wb_list_del(inode);
 
 	inode_sb_list_del(inode);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 10cdf24..2a28bbd 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -606,7 +606,7 @@ struct inode {
 	unsigned long		dirtied_when;	/* jiffies of first dirtying */
 
 	struct hlist_node	i_hash;
-	struct list_head	i_wb_list;	/* backing dev IO list */
+	struct list_head	i_io_list;	/* backing dev IO list */
 	struct list_head	i_lru;		/* inode LRU list */
 	struct list_head	i_sb_list;
 	union {
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 6dc4580..6e7a644 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -54,13 +54,13 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v)
 
 	nr_dirty = nr_io = nr_more_io = nr_dirty_time = 0;
 	spin_lock(&wb->list_lock);
-	list_for_each_entry(inode, &wb->b_dirty, i_wb_list)
+	list_for_each_entry(inode, &wb->b_dirty, i_io_list)
 		nr_dirty++;
-	list_for_each_entry(inode, &wb->b_io, i_wb_list)
+	list_for_each_entry(inode, &wb->b_io, i_io_list)
 		nr_io++;
-	list_for_each_entry(inode, &wb->b_more_io, i_wb_list)
+	list_for_each_entry(inode, &wb->b_more_io, i_io_list)
 		nr_more_io++;
-	list_for_each_entry(inode, &wb->b_dirty_time, i_wb_list)
+	list_for_each_entry(inode, &wb->b_dirty_time, i_io_list)
 		if (inode->i_state & I_DIRTY_TIME)
 			nr_dirty_time++;
 	spin_unlock(&wb->list_lock);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 6/8] bdi: add a new writeback list for sync
  2015-03-20 17:14 [PATCH 0/8] Sync and VFS scalability improvements V2 Josef Bacik
                   ` (4 preceding siblings ...)
  2015-03-20 17:14 ` [PATCH 5/8] inode: rename i_wb_list to i_io_list Josef Bacik
@ 2015-03-20 17:14 ` Josef Bacik
  2015-03-20 17:14 ` [PATCH 7/8] writeback: periodically trim the writeback list Josef Bacik
  2015-03-20 17:14 ` [PATCH 8/8] inode: don't softlockup when evicting inodes Josef Bacik
  7 siblings, 0 replies; 24+ messages in thread
From: Josef Bacik @ 2015-03-20 17:14 UTC (permalink / raw)
  To: linux-fsdevel, david, viro, jack, linux-kernel; +Cc: Dave Chinner

From: Dave Chinner <dchinner@redhat.com>

wait_sb_inodes() current does a walk of all inodes in the filesystem
to find dirty one to wait on during sync. This is highly
inefficient and wastes a lot of CPU when there are lots of clean
cached inodes that we don't need to wait on.

To avoid this "all inode" walk, we need to track inodes that are
currently under writeback that we need to wait for. We do this by
adding inodes to a writeback list on the bdi when the mapping is
first tagged as having pages under writeback.  wait_sb_inodes() can
then walk this list of "inodes under IO" and wait specifically just
for the inodes that the current sync(2) needs to wait for.

To avoid needing all the realted locking to be safe against
interrupts, Jan Kara suggested that we be lazy about removal from
the writeback list. That is, we don't remove inodes from the
writeback list on IO completion, but do it directly during a
wait_sb_inodes() walk.

This means that the a rare sync(2) call will have some work to do
skipping clean inodes However, in the current problem case of
concurrent sync workloads, concurrent wait_sb_inodes() calls only
walk the very recently dispatched inodes and hence should have very
little work to do.

This also means that we have to remove the inodes from the writeback
list during eviction. Do this in inode_wait_for_writeback() once
all writeback on the inode is complete.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
---
 fs/fs-writeback.c           | 115 ++++++++++++++++++++++++++++++++++++++------
 fs/inode.c                  |   1 +
 include/linux/backing-dev.h |   3 ++
 include/linux/fs.h          |   1 +
 mm/backing-dev.c            |   1 +
 mm/page-writeback.c         |  14 ++++++
 6 files changed, 119 insertions(+), 16 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index cabebde..82b0f43 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -198,6 +198,34 @@ void inode_wb_list_del(struct inode *inode)
 }
 
 /*
+ * mark an inode as under writeback on the given bdi
+ */
+void bdi_mark_inode_writeback(struct backing_dev_info *bdi, struct inode *inode)
+{
+	WARN_ON_ONCE(bdi != inode_to_bdi(inode));
+	if (list_empty(&inode->i_wb_list)) {
+		spin_lock(&bdi->wb.list_lock);
+		if (list_empty(&inode->i_wb_list))
+			list_add_tail(&inode->i_wb_list, &bdi->wb.b_writeback);
+		spin_unlock(&bdi->wb.list_lock);
+	}
+}
+
+/*
+ * clear an inode as under writeback on the given bdi
+ */
+static void bdi_clear_inode_writeback(struct backing_dev_info *bdi,
+				      struct inode *inode)
+{
+	WARN_ON_ONCE(bdi != inode_to_bdi(inode));
+	if (!list_empty(&inode->i_wb_list)) {
+		spin_lock(&bdi->wb.list_lock);
+		list_del_init(&inode->i_wb_list);
+		spin_unlock(&bdi->wb.list_lock);
+	}
+}
+
+/*
  * Redirty an inode: set its when-it-was dirtied timestamp and move it to the
  * furthest end of its superblock's dirty-inode list.
  *
@@ -371,13 +399,23 @@ static void __inode_wait_for_writeback(struct inode *inode)
 }
 
 /*
- * Wait for writeback on an inode to complete. Caller must have inode pinned.
+ * Wait for writeback on an inode to complete during eviction. Caller must have
+ * inode pinned.
  */
 void inode_wait_for_writeback(struct inode *inode)
 {
+	BUG_ON(!(inode->i_state & I_FREEING));
+
 	spin_lock(&inode->i_lock);
 	__inode_wait_for_writeback(inode);
 	spin_unlock(&inode->i_lock);
+
+	/*
+	 * bd_inode's will have already had the bdev free'd so don't bother
+	 * doing the bdi_clear_inode_writeback.
+	 */
+	if (!sb_is_blkdev_sb(inode->i_sb))
+		bdi_clear_inode_writeback(inode_to_bdi(inode), inode);
 }
 
 /*
@@ -1299,7 +1337,9 @@ EXPORT_SYMBOL(__mark_inode_dirty);
  */
 static void wait_sb_inodes(struct super_block *sb)
 {
-	struct inode *inode, *old_inode = NULL;
+	struct backing_dev_info *bdi = sb->s_bdi;
+	struct inode *old_inode = NULL;
+	LIST_HEAD(sync_list);
 
 	/*
 	 * We need to be protected against the filesystem going from
@@ -1307,28 +1347,59 @@ static void wait_sb_inodes(struct super_block *sb)
 	 */
 	WARN_ON(!rwsem_is_locked(&sb->s_umount));
 
-	mutex_lock(&sb->s_sync_lock);
-	spin_lock(&sb->s_inode_list_lock);
-
 	/*
-	 * Data integrity sync. Must wait for all pages under writeback,
-	 * because there may have been pages dirtied before our sync
-	 * call, but which had writeout started before we write it out.
-	 * In which case, the inode may not be on the dirty list, but
-	 * we still have to wait for that writeout.
+	 * Data integrity sync. Must wait for all pages under writeback, because
+	 * there may have been pages dirtied before our sync call, but which had
+	 * writeout started before we write it out.  In which case, the inode
+	 * may not be on the dirty list, but we still have to wait for that
+	 * writeout.
+	 *
+	 * To avoid syncing inodes put under IO after we have started here,
+	 * splice the io list to a temporary list head and walk that. Newly
+	 * dirtied inodes will go onto the primary list so we won't wait for
+	 * them. This is safe to do as we are serialised by the s_sync_lock,
+	 * so we'll complete processing the complete list before the next
+	 * sync operation repeats the splice-and-walk process.
+	 *
+	 * Inodes that have pages under writeback after we've finished the wait
+	 * may or may not be on the primary list. They had pages under put IO
+	 * after we started our wait, so we need to make sure the next sync or
+	 * IO completion treats them correctly, Move them back to the primary
+	 * list and restart the walk.
+	 *
+	 * Inodes that are clean after we have waited for them don't belong on
+	 * any list, so simply remove them from the sync list and move onwards.
 	 */
-	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
+	mutex_lock(&sb->s_sync_lock);
+	spin_lock(&bdi->wb.list_lock);
+	list_splice_init(&bdi->wb.b_writeback, &sync_list);
+
+	while (!list_empty(&sync_list)) {
+		struct inode *inode = list_first_entry(&sync_list, struct inode,
+						       i_wb_list);
 		struct address_space *mapping = inode->i_mapping;
 
+		/*
+		 * We are lazy on IO completion and don't remove inodes from the
+		 * list when they are clean. Detect that immediately and skip
+		 * inodes we don't ahve to wait on.
+		 */
+		if (!mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK)) {
+			list_del_init(&inode->i_wb_list);
+			cond_resched_lock(&bdi->wb.list_lock);
+			continue;
+		}
+
 		spin_lock(&inode->i_lock);
-		if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
-		    (mapping->nrpages == 0)) {
+		if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) {
+                        list_move(&inode->i_wb_list, &bdi->wb.b_writeback);
 			spin_unlock(&inode->i_lock);
+			cond_resched_lock(&bdi->wb.list_lock);
 			continue;
 		}
 		__iget(inode);
 		spin_unlock(&inode->i_lock);
-		spin_unlock(&sb->s_inode_list_lock);
+		spin_unlock(&bdi->wb.list_lock);
 
 		/*
 		 * We hold a reference to 'inode' so it couldn't have been
@@ -1345,9 +1416,21 @@ static void wait_sb_inodes(struct super_block *sb)
 
 		cond_resched();
 
-		spin_lock(&sb->s_inode_list_lock);
+		/*
+		 * the inode has been written back now, so check whether we
+		 * still have pages under IO and move it back to the primary
+		 * list if necessary
+		 */
+		spin_lock_irq(&mapping->tree_lock);
+		spin_lock(&bdi->wb.list_lock);
+		if (mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK)) {
+			WARN_ON(list_empty(&inode->i_wb_list));
+			list_move(&inode->i_wb_list, &bdi->wb.b_writeback);
+                } else
+			list_del_init(&inode->i_wb_list);
+		spin_unlock_irq(&mapping->tree_lock);
 	}
-	spin_unlock(&sb->s_inode_list_lock);
+	spin_unlock(&bdi->wb.list_lock);
 	iput(old_inode);
 	mutex_unlock(&sb->s_sync_lock);
 }
diff --git a/fs/inode.c b/fs/inode.c
index e0f2a60..b961e5a 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -356,6 +356,7 @@ void inode_init_once(struct inode *inode)
 	INIT_HLIST_NODE(&inode->i_hash);
 	INIT_LIST_HEAD(&inode->i_devices);
 	INIT_LIST_HEAD(&inode->i_io_list);
+	INIT_LIST_HEAD(&inode->i_wb_list);
 	INIT_LIST_HEAD(&inode->i_lru);
 	address_space_init_once(&inode->i_data);
 	i_size_ordered_init(inode);
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index aff923a..12d8224 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -56,6 +56,7 @@ struct bdi_writeback {
 	struct list_head b_io;		/* parked for writeback */
 	struct list_head b_more_io;	/* parked for more writeback */
 	struct list_head b_dirty_time;	/* time stamps are dirty */
+	struct list_head b_writeback;	/* inodes under writeback */
 	spinlock_t list_lock;		/* protects the b_* lists */
 };
 
@@ -124,6 +125,8 @@ void bdi_start_background_writeback(struct backing_dev_info *bdi);
 void bdi_writeback_workfn(struct work_struct *work);
 int bdi_has_dirty_io(struct backing_dev_info *bdi);
 void bdi_wakeup_thread_delayed(struct backing_dev_info *bdi);
+void bdi_mark_inode_writeback(struct backing_dev_info *bdi,
+			      struct inode *inode);
 
 extern spinlock_t bdi_lock;
 extern struct list_head bdi_list;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 2a28bbd..bdf0735 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -609,6 +609,7 @@ struct inode {
 	struct list_head	i_io_list;	/* backing dev IO list */
 	struct list_head	i_lru;		/* inode LRU list */
 	struct list_head	i_sb_list;
+	struct list_head	i_wb_list;	/* backing dev writeback list */
 	union {
 		struct hlist_head	i_dentry;
 		struct rcu_head		i_rcu;
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 6e7a644..df31958 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -386,6 +386,7 @@ static void bdi_wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi)
 	INIT_LIST_HEAD(&wb->b_io);
 	INIT_LIST_HEAD(&wb->b_more_io);
 	INIT_LIST_HEAD(&wb->b_dirty_time);
+	INIT_LIST_HEAD(&wb->b_writeback);
 	spin_lock_init(&wb->list_lock);
 	INIT_DELAYED_WORK(&wb->dwork, bdi_writeback_workfn);
 }
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 45e187b..4f21498 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2359,11 +2359,25 @@ int __test_set_page_writeback(struct page *page, bool keep_write)
 		spin_lock_irqsave(&mapping->tree_lock, flags);
 		ret = TestSetPageWriteback(page);
 		if (!ret) {
+			bool on_wblist;
+
+			on_wblist = mapping_tagged(mapping,
+						   PAGECACHE_TAG_WRITEBACK);
+
 			radix_tree_tag_set(&mapping->page_tree,
 						page_index(page),
 						PAGECACHE_TAG_WRITEBACK);
 			if (bdi_cap_account_writeback(bdi))
 				__inc_bdi_stat(bdi, BDI_WRITEBACK);
+
+			/*
+			 * we can come through here when swapping anonymous
+			 * pages, so we don't necessarily have an inode to
+			 * track for sync. Inodes are remove lazily, so there is
+			 * no equivalent in test_clear_page_writeback().
+			 */
+			if (!on_wblist && mapping->host)
+				bdi_mark_inode_writeback(bdi, mapping->host);
 		}
 		if (!PageDirty(page))
 			radix_tree_tag_clear(&mapping->page_tree,
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 7/8] writeback: periodically trim the writeback list
  2015-03-20 17:14 [PATCH 0/8] Sync and VFS scalability improvements V2 Josef Bacik
                   ` (5 preceding siblings ...)
  2015-03-20 17:14 ` [PATCH 6/8] bdi: add a new writeback list for sync Josef Bacik
@ 2015-03-20 17:14 ` Josef Bacik
  2015-03-20 18:49   ` [PATCH 7/8] bdi: add a new writeback list for sync V3 Josef Bacik
  2015-03-20 17:14 ` [PATCH 8/8] inode: don't softlockup when evicting inodes Josef Bacik
  7 siblings, 1 reply; 24+ messages in thread
From: Josef Bacik @ 2015-03-20 17:14 UTC (permalink / raw)
  To: linux-fsdevel, david, viro, jack, linux-kernel; +Cc: Dave Chinner

From: Dave Chinner <dchinner@redhat.com>

Inodes are removed lazily from the bdi writeback list, so in the
absence of sync(2) work inodes will build up on the bdi writback
list even though they are no longer under IO. Use the periodic
kupdate work check to remove inodes no longer under IO from the
writeback list.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 fs/fs-writeback.c | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 82b0f43..aa0de0f 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1037,6 +1037,23 @@ static long wb_check_background_flush(struct bdi_writeback *wb)
 	return 0;
 }
 
+/*
+ * clean out writeback list for all inodes that don't have IO in progress
+ */
+static void wb_trim_writeback_list(struct bdi_writeback *wb)
+{
+	struct inode *inode;
+	struct inode *tmp;
+
+	spin_lock(&wb->list_lock);
+	list_for_each_entry_safe(inode, tmp, &wb->b_writeback, i_wb_list) {
+		if (!mapping_tagged(inode->i_mapping, PAGECACHE_TAG_WRITEBACK))
+			list_del_init(&inode->i_wb_list);
+	}
+	spin_unlock(&wb->list_lock);
+
+}
+
 static long wb_check_old_data_flush(struct bdi_writeback *wb)
 {
 	unsigned long expired;
@@ -1053,6 +1070,8 @@ static long wb_check_old_data_flush(struct bdi_writeback *wb)
 	if (time_before(jiffies, expired))
 		return 0;
 
+	wb_trim_writeback_list(wb);
+
 	wb->last_old_flush = jiffies;
 	nr_pages = get_nr_dirty_pages();
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 7/8] bdi: add a new writeback list for sync V3
  2015-03-20 17:14 ` [PATCH 7/8] writeback: periodically trim the writeback list Josef Bacik
@ 2015-03-20 18:49   ` Josef Bacik
  2015-04-01  8:44     ` Jan Kara
  2015-04-01  8:46     ` Jan Kara
  0 siblings, 2 replies; 24+ messages in thread
From: Josef Bacik @ 2015-03-20 18:49 UTC (permalink / raw)
  To: linux-fsdevel, david, viro, jack, linux-kernel; +Cc: Dave Chinner

From: Dave Chinner <dchinner@redhat.com>

wait_sb_inodes() current does a walk of all inodes in the filesystem
to find dirty one to wait on during sync. This is highly
inefficient and wastes a lot of CPU when there are lots of clean
cached inodes that we don't need to wait on.

To avoid this "all inode" walk, we need to track inodes that are
currently under writeback that we need to wait for. We do this by
adding inodes to a writeback list on the bdi when the mapping is
first tagged as having pages under writeback.  wait_sb_inodes() can
then walk this list of "inodes under IO" and wait specifically just
for the inodes that the current sync(2) needs to wait for.

To avoid needing all the realted locking to be safe against
interrupts, Jan Kara suggested that we be lazy about removal from
the writeback list. That is, we don't remove inodes from the
writeback list on IO completion, but do it directly during a
wait_sb_inodes() walk.

This means that the a rare sync(2) call will have some work to do
skipping clean inodes However, in the current problem case of
concurrent sync workloads, concurrent wait_sb_inodes() calls only
walk the very recently dispatched inodes and hence should have very
little work to do.

This also means that we have to remove the inodes from the writeback
list during eviction. Do this in inode_wait_for_writeback() once
all writeback on the inode is complete.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
---
V2->V3:
-noticed a lockdep warning because of mapping->tree_lock-->wb.list_lock
 dependancy and the need for tree_lock to be under IRQ.  I re-arranged it so the
 dependancy is now wb.list_lock-->mapping->tree_lock and it seems to be ok, Jan
 could you look hard at this?

 fs/fs-writeback.c           | 115 ++++++++++++++++++++++++++++++++++++++------
 fs/inode.c                  |   1 +
 include/linux/backing-dev.h |   3 ++
 include/linux/fs.h          |   1 +
 mm/backing-dev.c            |   1 +
 mm/page-writeback.c         |  19 ++++++++
 6 files changed, 124 insertions(+), 16 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index cabebde..e70d2ea 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -198,6 +198,34 @@ void inode_wb_list_del(struct inode *inode)
 }
 
 /*
+ * mark an inode as under writeback on the given bdi
+ */
+void bdi_mark_inode_writeback(struct backing_dev_info *bdi, struct inode *inode)
+{
+	WARN_ON_ONCE(bdi != inode_to_bdi(inode));
+	if (list_empty(&inode->i_wb_list)) {
+		spin_lock(&bdi->wb.list_lock);
+		if (list_empty(&inode->i_wb_list))
+			list_add_tail(&inode->i_wb_list, &bdi->wb.b_writeback);
+		spin_unlock(&bdi->wb.list_lock);
+	}
+}
+
+/*
+ * clear an inode as under writeback on the given bdi
+ */
+static void bdi_clear_inode_writeback(struct backing_dev_info *bdi,
+				      struct inode *inode)
+{
+	WARN_ON_ONCE(bdi != inode_to_bdi(inode));
+	if (!list_empty(&inode->i_wb_list)) {
+		spin_lock(&bdi->wb.list_lock);
+		list_del_init(&inode->i_wb_list);
+		spin_unlock(&bdi->wb.list_lock);
+	}
+}
+
+/*
  * Redirty an inode: set its when-it-was dirtied timestamp and move it to the
  * furthest end of its superblock's dirty-inode list.
  *
@@ -371,13 +399,23 @@ static void __inode_wait_for_writeback(struct inode *inode)
 }
 
 /*
- * Wait for writeback on an inode to complete. Caller must have inode pinned.
+ * Wait for writeback on an inode to complete during eviction. Caller must have
+ * inode pinned.
  */
 void inode_wait_for_writeback(struct inode *inode)
 {
+	BUG_ON(!(inode->i_state & I_FREEING));
+
 	spin_lock(&inode->i_lock);
 	__inode_wait_for_writeback(inode);
 	spin_unlock(&inode->i_lock);
+
+	/*
+	 * bd_inode's will have already had the bdev free'd so don't bother
+	 * doing the bdi_clear_inode_writeback.
+	 */
+	if (!sb_is_blkdev_sb(inode->i_sb))
+		bdi_clear_inode_writeback(inode_to_bdi(inode), inode);
 }
 
 /*
@@ -1299,7 +1337,9 @@ EXPORT_SYMBOL(__mark_inode_dirty);
  */
 static void wait_sb_inodes(struct super_block *sb)
 {
-	struct inode *inode, *old_inode = NULL;
+	struct backing_dev_info *bdi = sb->s_bdi;
+	struct inode *old_inode = NULL;
+	LIST_HEAD(sync_list);
 
 	/*
 	 * We need to be protected against the filesystem going from
@@ -1307,28 +1347,59 @@ static void wait_sb_inodes(struct super_block *sb)
 	 */
 	WARN_ON(!rwsem_is_locked(&sb->s_umount));
 
-	mutex_lock(&sb->s_sync_lock);
-	spin_lock(&sb->s_inode_list_lock);
-
 	/*
-	 * Data integrity sync. Must wait for all pages under writeback,
-	 * because there may have been pages dirtied before our sync
-	 * call, but which had writeout started before we write it out.
-	 * In which case, the inode may not be on the dirty list, but
-	 * we still have to wait for that writeout.
+	 * Data integrity sync. Must wait for all pages under writeback, because
+	 * there may have been pages dirtied before our sync call, but which had
+	 * writeout started before we write it out.  In which case, the inode
+	 * may not be on the dirty list, but we still have to wait for that
+	 * writeout.
+	 *
+	 * To avoid syncing inodes put under IO after we have started here,
+	 * splice the io list to a temporary list head and walk that. Newly
+	 * dirtied inodes will go onto the primary list so we won't wait for
+	 * them. This is safe to do as we are serialised by the s_sync_lock,
+	 * so we'll complete processing the complete list before the next
+	 * sync operation repeats the splice-and-walk process.
+	 *
+	 * Inodes that have pages under writeback after we've finished the wait
+	 * may or may not be on the primary list. They had pages under put IO
+	 * after we started our wait, so we need to make sure the next sync or
+	 * IO completion treats them correctly, Move them back to the primary
+	 * list and restart the walk.
+	 *
+	 * Inodes that are clean after we have waited for them don't belong on
+	 * any list, so simply remove them from the sync list and move onwards.
 	 */
-	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
+	mutex_lock(&sb->s_sync_lock);
+	spin_lock(&bdi->wb.list_lock);
+	list_splice_init(&bdi->wb.b_writeback, &sync_list);
+
+	while (!list_empty(&sync_list)) {
+		struct inode *inode = list_first_entry(&sync_list, struct inode,
+						       i_wb_list);
 		struct address_space *mapping = inode->i_mapping;
 
+		/*
+		 * We are lazy on IO completion and don't remove inodes from the
+		 * list when they are clean. Detect that immediately and skip
+		 * inodes we don't ahve to wait on.
+		 */
+		if (!mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK)) {
+			list_del_init(&inode->i_wb_list);
+			cond_resched_lock(&bdi->wb.list_lock);
+			continue;
+		}
+
 		spin_lock(&inode->i_lock);
-		if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
-		    (mapping->nrpages == 0)) {
+		if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) {
+			list_move(&inode->i_wb_list, &bdi->wb.b_writeback);
 			spin_unlock(&inode->i_lock);
+			cond_resched_lock(&bdi->wb.list_lock);
 			continue;
 		}
 		__iget(inode);
 		spin_unlock(&inode->i_lock);
-		spin_unlock(&sb->s_inode_list_lock);
+		spin_unlock(&bdi->wb.list_lock);
 
 		/*
 		 * We hold a reference to 'inode' so it couldn't have been
@@ -1345,9 +1416,21 @@ static void wait_sb_inodes(struct super_block *sb)
 
 		cond_resched();
 
-		spin_lock(&sb->s_inode_list_lock);
+		/*
+		 * the inode has been written back now, so check whether we
+		 * still have pages under IO and move it back to the primary
+		 * list if necessary.
+		 */
+		spin_lock(&bdi->wb.list_lock);
+		spin_lock_irq(&mapping->tree_lock);
+		if (mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK)) {
+			WARN_ON(list_empty(&inode->i_wb_list));
+			list_move(&inode->i_wb_list, &bdi->wb.b_writeback);
+                } else
+			list_del_init(&inode->i_wb_list);
+		spin_unlock_irq(&mapping->tree_lock);
 	}
-	spin_unlock(&sb->s_inode_list_lock);
+	spin_unlock(&bdi->wb.list_lock);
 	iput(old_inode);
 	mutex_unlock(&sb->s_sync_lock);
 }
diff --git a/fs/inode.c b/fs/inode.c
index e0f2a60..b961e5a 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -356,6 +356,7 @@ void inode_init_once(struct inode *inode)
 	INIT_HLIST_NODE(&inode->i_hash);
 	INIT_LIST_HEAD(&inode->i_devices);
 	INIT_LIST_HEAD(&inode->i_io_list);
+	INIT_LIST_HEAD(&inode->i_wb_list);
 	INIT_LIST_HEAD(&inode->i_lru);
 	address_space_init_once(&inode->i_data);
 	i_size_ordered_init(inode);
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index aff923a..12d8224 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -56,6 +56,7 @@ struct bdi_writeback {
 	struct list_head b_io;		/* parked for writeback */
 	struct list_head b_more_io;	/* parked for more writeback */
 	struct list_head b_dirty_time;	/* time stamps are dirty */
+	struct list_head b_writeback;	/* inodes under writeback */
 	spinlock_t list_lock;		/* protects the b_* lists */
 };
 
@@ -124,6 +125,8 @@ void bdi_start_background_writeback(struct backing_dev_info *bdi);
 void bdi_writeback_workfn(struct work_struct *work);
 int bdi_has_dirty_io(struct backing_dev_info *bdi);
 void bdi_wakeup_thread_delayed(struct backing_dev_info *bdi);
+void bdi_mark_inode_writeback(struct backing_dev_info *bdi,
+			      struct inode *inode);
 
 extern spinlock_t bdi_lock;
 extern struct list_head bdi_list;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 2a28bbd..bdf0735 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -609,6 +609,7 @@ struct inode {
 	struct list_head	i_io_list;	/* backing dev IO list */
 	struct list_head	i_lru;		/* inode LRU list */
 	struct list_head	i_sb_list;
+	struct list_head	i_wb_list;	/* backing dev writeback list */
 	union {
 		struct hlist_head	i_dentry;
 		struct rcu_head		i_rcu;
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 6e7a644..df31958 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -386,6 +386,7 @@ static void bdi_wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi)
 	INIT_LIST_HEAD(&wb->b_io);
 	INIT_LIST_HEAD(&wb->b_more_io);
 	INIT_LIST_HEAD(&wb->b_dirty_time);
+	INIT_LIST_HEAD(&wb->b_writeback);
 	spin_lock_init(&wb->list_lock);
 	INIT_DELAYED_WORK(&wb->dwork, bdi_writeback_workfn);
 }
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 45e187b..b1f1d48 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2355,10 +2355,14 @@ int __test_set_page_writeback(struct page *page, bool keep_write)
 	if (mapping) {
 		struct backing_dev_info *bdi = inode_to_bdi(mapping->host);
 		unsigned long flags;
+		bool on_wblist = true;
 
 		spin_lock_irqsave(&mapping->tree_lock, flags);
 		ret = TestSetPageWriteback(page);
 		if (!ret) {
+			on_wblist = mapping_tagged(mapping,
+						   PAGECACHE_TAG_WRITEBACK);
+
 			radix_tree_tag_set(&mapping->page_tree,
 						page_index(page),
 						PAGECACHE_TAG_WRITEBACK);
@@ -2374,6 +2378,21 @@ int __test_set_page_writeback(struct page *page, bool keep_write)
 						page_index(page),
 						PAGECACHE_TAG_TOWRITE);
 		spin_unlock_irqrestore(&mapping->tree_lock, flags);
+
+		/*
+		 * we can come through here when swapping anonymous pages, so
+		 * we don't necessarily have an inode to track for sync. Inodes
+		 * are remove lazily, so there is no equivalent in
+		 * test_clear_page_writeback().
+		 *
+		 * We do this outside the mapping->tree_lock because we don't
+		 * want to force the bdi list lock to have to be used with irq's
+		 * disabled.  Anybody who cares about synchronization here
+		 * should take the bdi's list lock and then take the inodes
+		 * mapping tree_lock.
+		 */
+		if (!on_wblist && mapping->host)
+			bdi_mark_inode_writeback(bdi, mapping->host);
 	} else {
 		ret = TestSetPageWriteback(page);
 	}
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH 7/8] bdi: add a new writeback list for sync V3
  2015-03-20 18:49   ` [PATCH 7/8] bdi: add a new writeback list for sync V3 Josef Bacik
@ 2015-04-01  8:44     ` Jan Kara
  2015-04-01  8:46     ` Jan Kara
  1 sibling, 0 replies; 24+ messages in thread
From: Jan Kara @ 2015-04-01  8:44 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-fsdevel, david, viro, jack, linux-kernel, Dave Chinner

  Hi Josef,

On Fri 20-03-15 14:49:18, Josef Bacik wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> wait_sb_inodes() current does a walk of all inodes in the filesystem
> to find dirty one to wait on during sync. This is highly
> inefficient and wastes a lot of CPU when there are lots of clean
> cached inodes that we don't need to wait on.
> 
> To avoid this "all inode" walk, we need to track inodes that are
> currently under writeback that we need to wait for. We do this by
> adding inodes to a writeback list on the bdi when the mapping is
> first tagged as having pages under writeback.  wait_sb_inodes() can
> then walk this list of "inodes under IO" and wait specifically just
> for the inodes that the current sync(2) needs to wait for.
> 
> To avoid needing all the realted locking to be safe against
> interrupts, Jan Kara suggested that we be lazy about removal from
> the writeback list. That is, we don't remove inodes from the
> writeback list on IO completion, but do it directly during a
> wait_sb_inodes() walk.
> 
> This means that the a rare sync(2) call will have some work to do
> skipping clean inodes However, in the current problem case of
> concurrent sync workloads, concurrent wait_sb_inodes() calls only
> walk the very recently dispatched inodes and hence should have very
> little work to do.
> 
> This also means that we have to remove the inodes from the writeback
> list during eviction. Do this in inode_wait_for_writeback() once
> all writeback on the inode is complete.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> Signed-off-by: Josef Bacik <jbacik@fb.com>
> ---
> V2->V3:
> -noticed a lockdep warning because of mapping->tree_lock-->wb.list_lock
>  dependancy and the need for tree_lock to be under IRQ.  I re-arranged it so the
>  dependancy is now wb.list_lock-->mapping->tree_lock and it seems to be ok, Jan
>  could you look hard at this?
> 
...
> @@ -1299,7 +1337,9 @@ EXPORT_SYMBOL(__mark_inode_dirty);
>   */
>  static void wait_sb_inodes(struct super_block *sb)
>  {
> -	struct inode *inode, *old_inode = NULL;
> +	struct backing_dev_info *bdi = sb->s_bdi;
> +	struct inode *old_inode = NULL;
> +	LIST_HEAD(sync_list);
>  
>  	/*
>  	 * We need to be protected against the filesystem going from
> @@ -1307,28 +1347,59 @@ static void wait_sb_inodes(struct super_block *sb)
>  	 */
>  	WARN_ON(!rwsem_is_locked(&sb->s_umount));
>  
> -	mutex_lock(&sb->s_sync_lock);
> -	spin_lock(&sb->s_inode_list_lock);
> -
>  	/*
> -	 * Data integrity sync. Must wait for all pages under writeback,
> -	 * because there may have been pages dirtied before our sync
> -	 * call, but which had writeout started before we write it out.
> -	 * In which case, the inode may not be on the dirty list, but
> -	 * we still have to wait for that writeout.
> +	 * Data integrity sync. Must wait for all pages under writeback, because
> +	 * there may have been pages dirtied before our sync call, but which had
> +	 * writeout started before we write it out.  In which case, the inode
> +	 * may not be on the dirty list, but we still have to wait for that
> +	 * writeout.
> +	 *
> +	 * To avoid syncing inodes put under IO after we have started here,
> +	 * splice the io list to a temporary list head and walk that. Newly
> +	 * dirtied inodes will go onto the primary list so we won't wait for
> +	 * them. This is safe to do as we are serialised by the s_sync_lock,
> +	 * so we'll complete processing the complete list before the next
> +	 * sync operation repeats the splice-and-walk process.
> +	 *
> +	 * Inodes that have pages under writeback after we've finished the wait
> +	 * may or may not be on the primary list. They had pages under put IO
> +	 * after we started our wait, so we need to make sure the next sync or
> +	 * IO completion treats them correctly, Move them back to the primary
> +	 * list and restart the walk.
> +	 *
> +	 * Inodes that are clean after we have waited for them don't belong on
> +	 * any list, so simply remove them from the sync list and move onwards.
>  	 */
> -	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
> +	mutex_lock(&sb->s_sync_lock);
> +	spin_lock(&bdi->wb.list_lock);
> +	list_splice_init(&bdi->wb.b_writeback, &sync_list);
> +
> +	while (!list_empty(&sync_list)) {
> +		struct inode *inode = list_first_entry(&sync_list, struct inode,
> +						       i_wb_list);
>  		struct address_space *mapping = inode->i_mapping;
>  
> +		/*
> +		 * We are lazy on IO completion and don't remove inodes from the
> +		 * list when they are clean. Detect that immediately and skip
> +		 * inodes we don't ahve to wait on.
> +		 */
> +		if (!mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK)) {
> +			list_del_init(&inode->i_wb_list);
> +			cond_resched_lock(&bdi->wb.list_lock);
> +			continue;
> +		}
> +
>  		spin_lock(&inode->i_lock);
> -		if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
> -		    (mapping->nrpages == 0)) {
> +		if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) {
> +			list_move(&inode->i_wb_list, &bdi->wb.b_writeback);
>  			spin_unlock(&inode->i_lock);
> +			cond_resched_lock(&bdi->wb.list_lock);
>  			continue;
>  		}
>  		__iget(inode);
>  		spin_unlock(&inode->i_lock);
> -		spin_unlock(&sb->s_inode_list_lock);
> +		spin_unlock(&bdi->wb.list_lock);
  Hum, so you seem to keep that dance with inode references around (keeping
refs to current and prev inode). However that isn't necessary anymore once
you always pull the first inode from the list. So just grab reference to
the inode before dropping i_lock and put it when you are done with it...

> @@ -1345,9 +1416,21 @@ static void wait_sb_inodes(struct super_block *sb)
>  
>  		cond_resched();
>  
> -		spin_lock(&sb->s_inode_list_lock);
> +		/*
> +		 * the inode has been written back now, so check whether we
> +		 * still have pages under IO and move it back to the primary
> +		 * list if necessary.
> +		 */
> +		spin_lock(&bdi->wb.list_lock);
> +		spin_lock_irq(&mapping->tree_lock);
  Please comment that tree_lock is really necessary here. We need it
because bdi_mark_inode_writeback() needn't have bothered with grabbing
list_lock and may have exited without doing anything when it saw inode on
our spliced list.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 7/8] bdi: add a new writeback list for sync V3
  2015-03-20 18:49   ` [PATCH 7/8] bdi: add a new writeback list for sync V3 Josef Bacik
  2015-04-01  8:44     ` Jan Kara
@ 2015-04-01  8:46     ` Jan Kara
  1 sibling, 0 replies; 24+ messages in thread
From: Jan Kara @ 2015-04-01  8:46 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-fsdevel, david, viro, jack, linux-kernel, Dave Chinner

  BTW, this should have been patch 6/8, not 7/8...

								Honza
On Fri 20-03-15 14:49:18, Josef Bacik wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> wait_sb_inodes() current does a walk of all inodes in the filesystem
> to find dirty one to wait on during sync. This is highly
> inefficient and wastes a lot of CPU when there are lots of clean
> cached inodes that we don't need to wait on.
> 
> To avoid this "all inode" walk, we need to track inodes that are
> currently under writeback that we need to wait for. We do this by
> adding inodes to a writeback list on the bdi when the mapping is
> first tagged as having pages under writeback.  wait_sb_inodes() can
> then walk this list of "inodes under IO" and wait specifically just
> for the inodes that the current sync(2) needs to wait for.
> 
> To avoid needing all the realted locking to be safe against
> interrupts, Jan Kara suggested that we be lazy about removal from
> the writeback list. That is, we don't remove inodes from the
> writeback list on IO completion, but do it directly during a
> wait_sb_inodes() walk.
> 
> This means that the a rare sync(2) call will have some work to do
> skipping clean inodes However, in the current problem case of
> concurrent sync workloads, concurrent wait_sb_inodes() calls only
> walk the very recently dispatched inodes and hence should have very
> little work to do.
> 
> This also means that we have to remove the inodes from the writeback
> list during eviction. Do this in inode_wait_for_writeback() once
> all writeback on the inode is complete.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> Signed-off-by: Josef Bacik <jbacik@fb.com>
> ---
> V2->V3:
> -noticed a lockdep warning because of mapping->tree_lock-->wb.list_lock
>  dependancy and the need for tree_lock to be under IRQ.  I re-arranged it so the
>  dependancy is now wb.list_lock-->mapping->tree_lock and it seems to be ok, Jan
>  could you look hard at this?
> 
>  fs/fs-writeback.c           | 115 ++++++++++++++++++++++++++++++++++++++------
>  fs/inode.c                  |   1 +
>  include/linux/backing-dev.h |   3 ++
>  include/linux/fs.h          |   1 +
>  mm/backing-dev.c            |   1 +
>  mm/page-writeback.c         |  19 ++++++++
>  6 files changed, 124 insertions(+), 16 deletions(-)
> 
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index cabebde..e70d2ea 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -198,6 +198,34 @@ void inode_wb_list_del(struct inode *inode)
>  }
>  
>  /*
> + * mark an inode as under writeback on the given bdi
> + */
> +void bdi_mark_inode_writeback(struct backing_dev_info *bdi, struct inode *inode)
> +{
> +	WARN_ON_ONCE(bdi != inode_to_bdi(inode));
> +	if (list_empty(&inode->i_wb_list)) {
> +		spin_lock(&bdi->wb.list_lock);
> +		if (list_empty(&inode->i_wb_list))
> +			list_add_tail(&inode->i_wb_list, &bdi->wb.b_writeback);
> +		spin_unlock(&bdi->wb.list_lock);
> +	}
> +}
> +
> +/*
> + * clear an inode as under writeback on the given bdi
> + */
> +static void bdi_clear_inode_writeback(struct backing_dev_info *bdi,
> +				      struct inode *inode)
> +{
> +	WARN_ON_ONCE(bdi != inode_to_bdi(inode));
> +	if (!list_empty(&inode->i_wb_list)) {
> +		spin_lock(&bdi->wb.list_lock);
> +		list_del_init(&inode->i_wb_list);
> +		spin_unlock(&bdi->wb.list_lock);
> +	}
> +}
> +
> +/*
>   * Redirty an inode: set its when-it-was dirtied timestamp and move it to the
>   * furthest end of its superblock's dirty-inode list.
>   *
> @@ -371,13 +399,23 @@ static void __inode_wait_for_writeback(struct inode *inode)
>  }
>  
>  /*
> - * Wait for writeback on an inode to complete. Caller must have inode pinned.
> + * Wait for writeback on an inode to complete during eviction. Caller must have
> + * inode pinned.
>   */
>  void inode_wait_for_writeback(struct inode *inode)
>  {
> +	BUG_ON(!(inode->i_state & I_FREEING));
> +
>  	spin_lock(&inode->i_lock);
>  	__inode_wait_for_writeback(inode);
>  	spin_unlock(&inode->i_lock);
> +
> +	/*
> +	 * bd_inode's will have already had the bdev free'd so don't bother
> +	 * doing the bdi_clear_inode_writeback.
> +	 */
> +	if (!sb_is_blkdev_sb(inode->i_sb))
> +		bdi_clear_inode_writeback(inode_to_bdi(inode), inode);
>  }
>  
>  /*
> @@ -1299,7 +1337,9 @@ EXPORT_SYMBOL(__mark_inode_dirty);
>   */
>  static void wait_sb_inodes(struct super_block *sb)
>  {
> -	struct inode *inode, *old_inode = NULL;
> +	struct backing_dev_info *bdi = sb->s_bdi;
> +	struct inode *old_inode = NULL;
> +	LIST_HEAD(sync_list);
>  
>  	/*
>  	 * We need to be protected against the filesystem going from
> @@ -1307,28 +1347,59 @@ static void wait_sb_inodes(struct super_block *sb)
>  	 */
>  	WARN_ON(!rwsem_is_locked(&sb->s_umount));
>  
> -	mutex_lock(&sb->s_sync_lock);
> -	spin_lock(&sb->s_inode_list_lock);
> -
>  	/*
> -	 * Data integrity sync. Must wait for all pages under writeback,
> -	 * because there may have been pages dirtied before our sync
> -	 * call, but which had writeout started before we write it out.
> -	 * In which case, the inode may not be on the dirty list, but
> -	 * we still have to wait for that writeout.
> +	 * Data integrity sync. Must wait for all pages under writeback, because
> +	 * there may have been pages dirtied before our sync call, but which had
> +	 * writeout started before we write it out.  In which case, the inode
> +	 * may not be on the dirty list, but we still have to wait for that
> +	 * writeout.
> +	 *
> +	 * To avoid syncing inodes put under IO after we have started here,
> +	 * splice the io list to a temporary list head and walk that. Newly
> +	 * dirtied inodes will go onto the primary list so we won't wait for
> +	 * them. This is safe to do as we are serialised by the s_sync_lock,
> +	 * so we'll complete processing the complete list before the next
> +	 * sync operation repeats the splice-and-walk process.
> +	 *
> +	 * Inodes that have pages under writeback after we've finished the wait
> +	 * may or may not be on the primary list. They had pages under put IO
> +	 * after we started our wait, so we need to make sure the next sync or
> +	 * IO completion treats them correctly, Move them back to the primary
> +	 * list and restart the walk.
> +	 *
> +	 * Inodes that are clean after we have waited for them don't belong on
> +	 * any list, so simply remove them from the sync list and move onwards.
>  	 */
> -	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
> +	mutex_lock(&sb->s_sync_lock);
> +	spin_lock(&bdi->wb.list_lock);
> +	list_splice_init(&bdi->wb.b_writeback, &sync_list);
> +
> +	while (!list_empty(&sync_list)) {
> +		struct inode *inode = list_first_entry(&sync_list, struct inode,
> +						       i_wb_list);
>  		struct address_space *mapping = inode->i_mapping;
>  
> +		/*
> +		 * We are lazy on IO completion and don't remove inodes from the
> +		 * list when they are clean. Detect that immediately and skip
> +		 * inodes we don't ahve to wait on.
> +		 */
> +		if (!mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK)) {
> +			list_del_init(&inode->i_wb_list);
> +			cond_resched_lock(&bdi->wb.list_lock);
> +			continue;
> +		}
> +
>  		spin_lock(&inode->i_lock);
> -		if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
> -		    (mapping->nrpages == 0)) {
> +		if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) {
> +			list_move(&inode->i_wb_list, &bdi->wb.b_writeback);
>  			spin_unlock(&inode->i_lock);
> +			cond_resched_lock(&bdi->wb.list_lock);
>  			continue;
>  		}
>  		__iget(inode);
>  		spin_unlock(&inode->i_lock);
> -		spin_unlock(&sb->s_inode_list_lock);
> +		spin_unlock(&bdi->wb.list_lock);
>  
>  		/*
>  		 * We hold a reference to 'inode' so it couldn't have been
> @@ -1345,9 +1416,21 @@ static void wait_sb_inodes(struct super_block *sb)
>  
>  		cond_resched();
>  
> -		spin_lock(&sb->s_inode_list_lock);
> +		/*
> +		 * the inode has been written back now, so check whether we
> +		 * still have pages under IO and move it back to the primary
> +		 * list if necessary.
> +		 */
> +		spin_lock(&bdi->wb.list_lock);
> +		spin_lock_irq(&mapping->tree_lock);
> +		if (mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK)) {
> +			WARN_ON(list_empty(&inode->i_wb_list));
> +			list_move(&inode->i_wb_list, &bdi->wb.b_writeback);
> +                } else
> +			list_del_init(&inode->i_wb_list);
> +		spin_unlock_irq(&mapping->tree_lock);
>  	}
> -	spin_unlock(&sb->s_inode_list_lock);
> +	spin_unlock(&bdi->wb.list_lock);
>  	iput(old_inode);
>  	mutex_unlock(&sb->s_sync_lock);
>  }
> diff --git a/fs/inode.c b/fs/inode.c
> index e0f2a60..b961e5a 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -356,6 +356,7 @@ void inode_init_once(struct inode *inode)
>  	INIT_HLIST_NODE(&inode->i_hash);
>  	INIT_LIST_HEAD(&inode->i_devices);
>  	INIT_LIST_HEAD(&inode->i_io_list);
> +	INIT_LIST_HEAD(&inode->i_wb_list);
>  	INIT_LIST_HEAD(&inode->i_lru);
>  	address_space_init_once(&inode->i_data);
>  	i_size_ordered_init(inode);
> diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
> index aff923a..12d8224 100644
> --- a/include/linux/backing-dev.h
> +++ b/include/linux/backing-dev.h
> @@ -56,6 +56,7 @@ struct bdi_writeback {
>  	struct list_head b_io;		/* parked for writeback */
>  	struct list_head b_more_io;	/* parked for more writeback */
>  	struct list_head b_dirty_time;	/* time stamps are dirty */
> +	struct list_head b_writeback;	/* inodes under writeback */
>  	spinlock_t list_lock;		/* protects the b_* lists */
>  };
>  
> @@ -124,6 +125,8 @@ void bdi_start_background_writeback(struct backing_dev_info *bdi);
>  void bdi_writeback_workfn(struct work_struct *work);
>  int bdi_has_dirty_io(struct backing_dev_info *bdi);
>  void bdi_wakeup_thread_delayed(struct backing_dev_info *bdi);
> +void bdi_mark_inode_writeback(struct backing_dev_info *bdi,
> +			      struct inode *inode);
>  
>  extern spinlock_t bdi_lock;
>  extern struct list_head bdi_list;
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 2a28bbd..bdf0735 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -609,6 +609,7 @@ struct inode {
>  	struct list_head	i_io_list;	/* backing dev IO list */
>  	struct list_head	i_lru;		/* inode LRU list */
>  	struct list_head	i_sb_list;
> +	struct list_head	i_wb_list;	/* backing dev writeback list */
>  	union {
>  		struct hlist_head	i_dentry;
>  		struct rcu_head		i_rcu;
> diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> index 6e7a644..df31958 100644
> --- a/mm/backing-dev.c
> +++ b/mm/backing-dev.c
> @@ -386,6 +386,7 @@ static void bdi_wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi)
>  	INIT_LIST_HEAD(&wb->b_io);
>  	INIT_LIST_HEAD(&wb->b_more_io);
>  	INIT_LIST_HEAD(&wb->b_dirty_time);
> +	INIT_LIST_HEAD(&wb->b_writeback);
>  	spin_lock_init(&wb->list_lock);
>  	INIT_DELAYED_WORK(&wb->dwork, bdi_writeback_workfn);
>  }
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 45e187b..b1f1d48 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -2355,10 +2355,14 @@ int __test_set_page_writeback(struct page *page, bool keep_write)
>  	if (mapping) {
>  		struct backing_dev_info *bdi = inode_to_bdi(mapping->host);
>  		unsigned long flags;
> +		bool on_wblist = true;
>  
>  		spin_lock_irqsave(&mapping->tree_lock, flags);
>  		ret = TestSetPageWriteback(page);
>  		if (!ret) {
> +			on_wblist = mapping_tagged(mapping,
> +						   PAGECACHE_TAG_WRITEBACK);
> +
>  			radix_tree_tag_set(&mapping->page_tree,
>  						page_index(page),
>  						PAGECACHE_TAG_WRITEBACK);
> @@ -2374,6 +2378,21 @@ int __test_set_page_writeback(struct page *page, bool keep_write)
>  						page_index(page),
>  						PAGECACHE_TAG_TOWRITE);
>  		spin_unlock_irqrestore(&mapping->tree_lock, flags);
> +
> +		/*
> +		 * we can come through here when swapping anonymous pages, so
> +		 * we don't necessarily have an inode to track for sync. Inodes
> +		 * are remove lazily, so there is no equivalent in
> +		 * test_clear_page_writeback().
> +		 *
> +		 * We do this outside the mapping->tree_lock because we don't
> +		 * want to force the bdi list lock to have to be used with irq's
> +		 * disabled.  Anybody who cares about synchronization here
> +		 * should take the bdi's list lock and then take the inodes
> +		 * mapping tree_lock.
> +		 */
> +		if (!on_wblist && mapping->host)
> +			bdi_mark_inode_writeback(bdi, mapping->host);
>  	} else {
>  		ret = TestSetPageWriteback(page);
>  	}
> -- 
> 1.8.3.1
> 
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 8/8] inode: don't softlockup when evicting inodes
  2015-03-20 17:14 [PATCH 0/8] Sync and VFS scalability improvements V2 Josef Bacik
                   ` (6 preceding siblings ...)
  2015-03-20 17:14 ` [PATCH 7/8] writeback: periodically trim the writeback list Josef Bacik
@ 2015-03-20 17:14 ` Josef Bacik
  2015-04-01  8:05   ` Jan Kara
  7 siblings, 1 reply; 24+ messages in thread
From: Josef Bacik @ 2015-03-20 17:14 UTC (permalink / raw)
  To: linux-fsdevel, david, viro, jack, linux-kernel

On a box with a lot of ram (148gb) I can make the box softlockup after running
an fs_mark job that creates hundreds of millions of empty files.  This is
because we never generate enough memory pressure to keep the number of inodes on
our unused list low, so when we go to unmount we have to evict ~100 million
inodes.  This makes one processor a very unhappy person, so add a cond_resched()
in dispose_list() and cond_resched_lock() in the eviction isolation function to
combat this.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
---
 fs/inode.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/fs/inode.c b/fs/inode.c
index b961e5a..c58dbd3 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -574,6 +574,7 @@ static void dispose_list(struct list_head *head)
 		list_del_init(&inode->i_lru);
 
 		evict(inode);
+		cond_resched();
 	}
 }
 
@@ -592,6 +593,7 @@ void evict_inodes(struct super_block *sb)
 	LIST_HEAD(dispose);
 
 	spin_lock(&sb->s_inode_list_lock);
+again:
 	list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list) {
 		if (atomic_read(&inode->i_count))
 			continue;
@@ -606,6 +608,14 @@ void evict_inodes(struct super_block *sb)
 		inode_lru_list_del(inode);
 		spin_unlock(&inode->i_lock);
 		list_add(&inode->i_lru, &dispose);
+
+		/*
+		 * We can have a ton of inodes to evict at unmount time given
+		 * enough memory, check to see if we need to go to sleep for a
+		 * bit so we don't livelock.
+		 */
+		if (cond_resched_lock(&sb->s_inode_list_lock))
+			goto again;
 	}
 	spin_unlock(&sb->s_inode_list_lock);
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH 8/8] inode: don't softlockup when evicting inodes
  2015-03-20 17:14 ` [PATCH 8/8] inode: don't softlockup when evicting inodes Josef Bacik
@ 2015-04-01  8:05   ` Jan Kara
  2015-04-07 15:03     ` Josef Bacik
  0 siblings, 1 reply; 24+ messages in thread
From: Jan Kara @ 2015-04-01  8:05 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-fsdevel, david, viro, jack, linux-kernel

  Sorry for a late reply. I was ill last week...

On Fri 20-03-15 13:14:16, Josef Bacik wrote:
> On a box with a lot of ram (148gb) I can make the box softlockup after running
> an fs_mark job that creates hundreds of millions of empty files.  This is
> because we never generate enough memory pressure to keep the number of inodes on
> our unused list low, so when we go to unmount we have to evict ~100 million
> inodes.  This makes one processor a very unhappy person, so add a cond_resched()
> in dispose_list() and cond_resched_lock() in the eviction isolation function to
> combat this.  Thanks,
> 
> Signed-off-by: Josef Bacik <jbacik@fb.com>
> ---
>  fs/inode.c | 10 ++++++++++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/fs/inode.c b/fs/inode.c
> index b961e5a..c58dbd3 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -574,6 +574,7 @@ static void dispose_list(struct list_head *head)
>  		list_del_init(&inode->i_lru);
>  
>  		evict(inode);
> +		cond_resched();
  Fine.

>  	}
>  }
>  
> @@ -592,6 +593,7 @@ void evict_inodes(struct super_block *sb)
>  	LIST_HEAD(dispose);
>  
>  	spin_lock(&sb->s_inode_list_lock);
> +again:
>  	list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list) {
>  		if (atomic_read(&inode->i_count))
>  			continue;
> @@ -606,6 +608,14 @@ void evict_inodes(struct super_block *sb)
>  		inode_lru_list_del(inode);
>  		spin_unlock(&inode->i_lock);
>  		list_add(&inode->i_lru, &dispose);
> +
> +		/*
> +		 * We can have a ton of inodes to evict at unmount time given
> +		 * enough memory, check to see if we need to go to sleep for a
> +		 * bit so we don't livelock.
> +		 */
> +		if (cond_resched_lock(&sb->s_inode_list_lock))
> +			goto again;
  Not so fine. How this is ever guaranteed to finish? We don't move inodes
from the i_sb_list in this loop so if we ever take 'goto again' we just
start doing all the work from the beginning...

What needs to happen is that if we need to resched, we drop
sb->s_inode_list_lock, call dispose_list(&dispose) and *then* restart from
the beginning since we have freed all the inodes that we isolated...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 8/8] inode: don't softlockup when evicting inodes
  2015-04-01  8:05   ` Jan Kara
@ 2015-04-07 15:03     ` Josef Bacik
  0 siblings, 0 replies; 24+ messages in thread
From: Josef Bacik @ 2015-04-07 15:03 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel, david, viro, linux-kernel

On 04/01/2015 04:05 AM, Jan Kara wrote:
>    Sorry for a late reply. I was ill last week...
>

That's ok, I was on vacation for the last two weeks ;).

> On Fri 20-03-15 13:14:16, Josef Bacik wrote:
>> On a box with a lot of ram (148gb) I can make the box softlockup after running
>> an fs_mark job that creates hundreds of millions of empty files.  This is
>> because we never generate enough memory pressure to keep the number of inodes on
>> our unused list low, so when we go to unmount we have to evict ~100 million
>> inodes.  This makes one processor a very unhappy person, so add a cond_resched()
>> in dispose_list() and cond_resched_lock() in the eviction isolation function to
>> combat this.  Thanks,
>>
>> Signed-off-by: Josef Bacik <jbacik@fb.com>
>> ---
>>   fs/inode.c | 10 ++++++++++
>>   1 file changed, 10 insertions(+)
>>
>> diff --git a/fs/inode.c b/fs/inode.c
>> index b961e5a..c58dbd3 100644
>> --- a/fs/inode.c
>> +++ b/fs/inode.c
>> @@ -574,6 +574,7 @@ static void dispose_list(struct list_head *head)
>>   		list_del_init(&inode->i_lru);
>>
>>   		evict(inode);
>> +		cond_resched();
>    Fine.
>
>>   	}
>>   }
>>
>> @@ -592,6 +593,7 @@ void evict_inodes(struct super_block *sb)
>>   	LIST_HEAD(dispose);
>>
>>   	spin_lock(&sb->s_inode_list_lock);
>> +again:
>>   	list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list) {
>>   		if (atomic_read(&inode->i_count))
>>   			continue;
>> @@ -606,6 +608,14 @@ void evict_inodes(struct super_block *sb)
>>   		inode_lru_list_del(inode);
>>   		spin_unlock(&inode->i_lock);
>>   		list_add(&inode->i_lru, &dispose);
>> +
>> +		/*
>> +		 * We can have a ton of inodes to evict at unmount time given
>> +		 * enough memory, check to see if we need to go to sleep for a
>> +		 * bit so we don't livelock.
>> +		 */
>> +		if (cond_resched_lock(&sb->s_inode_list_lock))
>> +			goto again;
>    Not so fine. How this is ever guaranteed to finish? We don't move inodes
> from the i_sb_list in this loop so if we ever take 'goto again' we just
> start doing all the work from the beginning...
>
> What needs to happen is that if we need to resched, we drop
> sb->s_inode_list_lock, call dispose_list(&dispose) and *then* restart from
> the beginning since we have freed all the inodes that we isolated...
>

Ooops, good point.  I'll get this fixed up, thanks,

Josef

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 0/7] super block scalabilit patches V3
@ 2015-06-11 19:41 Josef Bacik
  2015-06-11 19:41 ` [PATCH 6/8] bdi: add a new writeback list for sync Josef Bacik
  0 siblings, 1 reply; 24+ messages in thread
From: Josef Bacik @ 2015-06-11 19:41 UTC (permalink / raw)
  To: linux-fsdevel, kernel-team, viro, hch, jack, david

Here are the cleaned up versions of Dave Chinners super block scalability
patches.  I've been testing them locally for a while and they are pretty solid.
They fix a few big issues, such as the global inode list and soft lockups on
boxes on unmount that have lots of inodes in cache.  Al if you would consider
pulling these in that would be great, you can pull from here

git://git.kernel.org/pub/scm/linux/kernel/git/josef/btrfs-next.git superblock-scaling

Thanks,

Josef

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 6/8] bdi: add a new writeback list for sync
  2015-06-11 19:41 [PATCH 0/7] super block scalabilit patches V3 Josef Bacik
@ 2015-06-11 19:41 ` Josef Bacik
  2015-06-15 14:12   ` Jan Kara
  0 siblings, 1 reply; 24+ messages in thread
From: Josef Bacik @ 2015-06-11 19:41 UTC (permalink / raw)
  To: linux-fsdevel, kernel-team, viro, hch, jack, david; +Cc: Dave Chinner

From: Dave Chinner <dchinner@redhat.com>

wait_sb_inodes() current does a walk of all inodes in the filesystem
to find dirty one to wait on during sync. This is highly
inefficient and wastes a lot of CPU when there are lots of clean
cached inodes that we don't need to wait on.

To avoid this "all inode" walk, we need to track inodes that are
currently under writeback that we need to wait for. We do this by
adding inodes to a writeback list on the bdi when the mapping is
first tagged as having pages under writeback.  wait_sb_inodes() can
then walk this list of "inodes under IO" and wait specifically just
for the inodes that the current sync(2) needs to wait for.

To avoid needing all the realted locking to be safe against
interrupts, Jan Kara suggested that we be lazy about removal from
the writeback list. That is, we don't remove inodes from the
writeback list on IO completion, but do it directly during a
wait_sb_inodes() walk.

This means that the a rare sync(2) call will have some work to do
skipping clean inodes However, in the current problem case of
concurrent sync workloads, concurrent wait_sb_inodes() calls only
walk the very recently dispatched inodes and hence should have very
little work to do.

This also means that we have to remove the inodes from the writeback
list during eviction. Do this in inode_wait_for_writeback() once
all writeback on the inode is complete.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
---
 fs/fs-writeback.c           | 131 ++++++++++++++++++++++++++++++++++----------
 fs/inode.c                  |   1 +
 include/linux/backing-dev.h |   3 +
 include/linux/fs.h          |   1 +
 mm/backing-dev.c            |   1 +
 mm/page-writeback.c         |  14 +++++
 6 files changed, 122 insertions(+), 29 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index aa72536..3f5b2ff 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -210,6 +210,34 @@ void inode_wb_list_del(struct inode *inode)
 }
 
 /*
+ * mark an inode as under writeback on the given bdi
+ */
+void bdi_mark_inode_writeback(struct backing_dev_info *bdi, struct inode *inode)
+{
+	WARN_ON_ONCE(bdi != inode_to_bdi(inode));
+	if (list_empty(&inode->i_wb_list)) {
+		spin_lock(&bdi->wb.list_lock);
+		if (list_empty(&inode->i_wb_list))
+			list_add_tail(&inode->i_wb_list, &bdi->wb.b_writeback);
+		spin_unlock(&bdi->wb.list_lock);
+	}
+}
+
+/*
+ * clear an inode as under writeback on the given bdi
+ */
+static void bdi_clear_inode_writeback(struct backing_dev_info *bdi,
+				      struct inode *inode)
+{
+	WARN_ON_ONCE(bdi != inode_to_bdi(inode));
+	if (!list_empty(&inode->i_wb_list)) {
+		spin_lock(&bdi->wb.list_lock);
+		list_del_init(&inode->i_wb_list);
+		spin_unlock(&bdi->wb.list_lock);
+	}
+}
+
+/*
  * Redirty an inode: set its when-it-was dirtied timestamp and move it to the
  * furthest end of its superblock's dirty-inode list.
  *
@@ -383,13 +411,23 @@ static void __inode_wait_for_writeback(struct inode *inode)
 }
 
 /*
- * Wait for writeback on an inode to complete. Caller must have inode pinned.
+ * Wait for writeback on an inode to complete during eviction. Caller must have
+ * inode pinned.
  */
 void inode_wait_for_writeback(struct inode *inode)
 {
+	BUG_ON(!(inode->i_state & I_FREEING));
+
 	spin_lock(&inode->i_lock);
 	__inode_wait_for_writeback(inode);
 	spin_unlock(&inode->i_lock);
+
+	/*
+	 * bd_inode's will have already had the bdev free'd so don't bother
+	 * doing the bdi_clear_inode_writeback.
+	 */
+	if (!sb_is_blkdev_sb(inode->i_sb))
+		bdi_clear_inode_writeback(inode_to_bdi(inode), inode);
 }
 
 /*
@@ -1372,7 +1410,8 @@ EXPORT_SYMBOL(__mark_inode_dirty);
  */
 static void wait_sb_inodes(struct super_block *sb)
 {
-	struct inode *inode, *old_inode = NULL;
+	struct backing_dev_info *bdi = sb->s_bdi;
+	LIST_HEAD(sync_list);
 
 	/*
 	 * We need to be protected against the filesystem going from
@@ -1380,48 +1419,82 @@ static void wait_sb_inodes(struct super_block *sb)
 	 */
 	WARN_ON(!rwsem_is_locked(&sb->s_umount));
 
-	mutex_lock(&sb->s_sync_lock);
-	spin_lock(&sb->s_inode_list_lock);
-
 	/*
-	 * Data integrity sync. Must wait for all pages under writeback,
-	 * because there may have been pages dirtied before our sync
-	 * call, but which had writeout started before we write it out.
-	 * In which case, the inode may not be on the dirty list, but
-	 * we still have to wait for that writeout.
+	 * Data integrity sync. Must wait for all pages under writeback, because
+	 * there may have been pages dirtied before our sync call, but which had
+	 * writeout started before we write it out.  In which case, the inode
+	 * may not be on the dirty list, but we still have to wait for that
+	 * writeout.
+	 *
+	 * To avoid syncing inodes put under IO after we have started here,
+	 * splice the io list to a temporary list head and walk that. Newly
+	 * dirtied inodes will go onto the primary list so we won't wait for
+	 * them. This is safe to do as we are serialised by the s_sync_lock,
+	 * so we'll complete processing the complete list before the next
+	 * sync operation repeats the splice-and-walk process.
+	 *
+	 * Inodes that have pages under writeback after we've finished the wait
+	 * may or may not be on the primary list. They had pages under put IO
+	 * after we started our wait, so we need to make sure the next sync or
+	 * IO completion treats them correctly, Move them back to the primary
+	 * list and restart the walk.
+	 *
+	 * Inodes that are clean after we have waited for them don't belong on
+	 * any list, so simply remove them from the sync list and move onwards.
 	 */
-	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
+	mutex_lock(&sb->s_sync_lock);
+	spin_lock(&bdi->wb.list_lock);
+	list_splice_init(&bdi->wb.b_writeback, &sync_list);
+
+	while (!list_empty(&sync_list)) {
+		struct inode *inode = list_first_entry(&sync_list, struct inode,
+						       i_wb_list);
 		struct address_space *mapping = inode->i_mapping;
 
+		/*
+		 * We are lazy on IO completion and don't remove inodes from the
+		 * list when they are clean. Detect that immediately and skip
+		 * inodes we don't ahve to wait on.
+		 */
+		if (!mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK)) {
+			list_del_init(&inode->i_wb_list);
+			cond_resched_lock(&bdi->wb.list_lock);
+			continue;
+		}
+
 		spin_lock(&inode->i_lock);
-		if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
-		    (mapping->nrpages == 0)) {
+		if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) {
+                        list_move(&inode->i_wb_list, &bdi->wb.b_writeback);
 			spin_unlock(&inode->i_lock);
+			cond_resched_lock(&bdi->wb.list_lock);
 			continue;
 		}
 		__iget(inode);
 		spin_unlock(&inode->i_lock);
-		spin_unlock(&sb->s_inode_list_lock);
-
-		/*
-		 * We hold a reference to 'inode' so it couldn't have been
-		 * removed from s_inodes list while we dropped the
-		 * s_inode_list_lock.  We cannot iput the inode now as we can
-		 * be holding the last reference and we cannot iput it under
-		 * s_inode_list_lock. So we keep the reference and iput it
-		 * later.
-		 */
-		iput(old_inode);
-		old_inode = inode;
+		spin_unlock(&bdi->wb.list_lock);
 
 		filemap_fdatawait(mapping);
-
 		cond_resched();
 
-		spin_lock(&sb->s_inode_list_lock);
+		/*
+		 * the inode has been written back now, so check whether we
+		 * still have pages under IO and move it back to the primary
+		 * list if necessary.  We really need the mapping->tree_lock
+		 * here because bdi_mark_inode_writeback may have not done
+		 * anything because we were on the spliced list and we need to
+		 * check TAG_WRITEBACK.
+		 */
+		spin_lock_irq(&mapping->tree_lock);
+		spin_lock(&bdi->wb.list_lock);
+		if (mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK)) {
+			WARN_ON(list_empty(&inode->i_wb_list));
+			list_move(&inode->i_wb_list, &bdi->wb.b_writeback);
+                } else
+			list_del_init(&inode->i_wb_list);
+		spin_unlock_irq(&mapping->tree_lock);
+		iput(inode);
 	}
-	spin_unlock(&sb->s_inode_list_lock);
-	iput(old_inode);
+	spin_unlock(&bdi->wb.list_lock);
 	mutex_unlock(&sb->s_sync_lock);
 }
 
diff --git a/fs/inode.c b/fs/inode.c
index c76a575..d1e6598 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -356,6 +356,7 @@ void inode_init_once(struct inode *inode)
 	INIT_HLIST_NODE(&inode->i_hash);
 	INIT_LIST_HEAD(&inode->i_devices);
 	INIT_LIST_HEAD(&inode->i_io_list);
+	INIT_LIST_HEAD(&inode->i_wb_list);
 	INIT_LIST_HEAD(&inode->i_lru);
 	address_space_init_once(&inode->i_data);
 	i_size_ordered_init(inode);
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index aff923a..12d8224 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -56,6 +56,7 @@ struct bdi_writeback {
 	struct list_head b_io;		/* parked for writeback */
 	struct list_head b_more_io;	/* parked for more writeback */
 	struct list_head b_dirty_time;	/* time stamps are dirty */
+	struct list_head b_writeback;	/* inodes under writeback */
 	spinlock_t list_lock;		/* protects the b_* lists */
 };
 
@@ -124,6 +125,8 @@ void bdi_start_background_writeback(struct backing_dev_info *bdi);
 void bdi_writeback_workfn(struct work_struct *work);
 int bdi_has_dirty_io(struct backing_dev_info *bdi);
 void bdi_wakeup_thread_delayed(struct backing_dev_info *bdi);
+void bdi_mark_inode_writeback(struct backing_dev_info *bdi,
+			      struct inode *inode);
 
 extern spinlock_t bdi_lock;
 extern struct list_head bdi_list;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index f4d56cc..f47792f 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -637,6 +637,7 @@ struct inode {
 	struct list_head	i_io_list;	/* backing dev IO list */
 	struct list_head	i_lru;		/* inode LRU list */
 	struct list_head	i_sb_list;
+	struct list_head	i_wb_list;	/* backing dev writeback list */
 	union {
 		struct hlist_head	i_dentry;
 		struct rcu_head		i_rcu;
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 6e7a644..df31958 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -386,6 +386,7 @@ static void bdi_wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi)
 	INIT_LIST_HEAD(&wb->b_io);
 	INIT_LIST_HEAD(&wb->b_more_io);
 	INIT_LIST_HEAD(&wb->b_dirty_time);
+	INIT_LIST_HEAD(&wb->b_writeback);
 	spin_lock_init(&wb->list_lock);
 	INIT_DELAYED_WORK(&wb->dwork, bdi_writeback_workfn);
 }
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index eb59f7e..f3751d1 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2382,11 +2382,25 @@ int __test_set_page_writeback(struct page *page, bool keep_write)
 		spin_lock_irqsave(&mapping->tree_lock, flags);
 		ret = TestSetPageWriteback(page);
 		if (!ret) {
+			bool on_wblist;
+
+			on_wblist = mapping_tagged(mapping,
+						   PAGECACHE_TAG_WRITEBACK);
+
 			radix_tree_tag_set(&mapping->page_tree,
 						page_index(page),
 						PAGECACHE_TAG_WRITEBACK);
 			if (bdi_cap_account_writeback(bdi))
 				__inc_bdi_stat(bdi, BDI_WRITEBACK);
+
+			/*
+			 * we can come through here when swapping anonymous
+			 * pages, so we don't necessarily have an inode to
+			 * track for sync. Inodes are remove lazily, so there is
+			 * no equivalent in test_clear_page_writeback().
+			 */
+			if (!on_wblist && mapping->host)
+				bdi_mark_inode_writeback(bdi, mapping->host);
 		}
 		if (!PageDirty(page))
 			radix_tree_tag_clear(&mapping->page_tree,
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH 6/8] bdi: add a new writeback list for sync
  2015-06-11 19:41 ` [PATCH 6/8] bdi: add a new writeback list for sync Josef Bacik
@ 2015-06-15 14:12   ` Jan Kara
  2015-06-16 15:42     ` Josef Bacik
  0 siblings, 1 reply; 24+ messages in thread
From: Jan Kara @ 2015-06-15 14:12 UTC (permalink / raw)
  To: Josef Bacik
  Cc: linux-fsdevel, kernel-team, viro, hch, jack, david, Dave Chinner

On Thu 11-06-15 15:41:11, Josef Bacik wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> wait_sb_inodes() current does a walk of all inodes in the filesystem
> to find dirty one to wait on during sync. This is highly
> inefficient and wastes a lot of CPU when there are lots of clean
> cached inodes that we don't need to wait on.
> 
> To avoid this "all inode" walk, we need to track inodes that are
> currently under writeback that we need to wait for. We do this by
> adding inodes to a writeback list on the bdi when the mapping is
> first tagged as having pages under writeback.  wait_sb_inodes() can
> then walk this list of "inodes under IO" and wait specifically just
> for the inodes that the current sync(2) needs to wait for.
> 
> To avoid needing all the realted locking to be safe against
> interrupts, Jan Kara suggested that we be lazy about removal from
> the writeback list. That is, we don't remove inodes from the
> writeback list on IO completion, but do it directly during a
> wait_sb_inodes() walk.
> 
> This means that the a rare sync(2) call will have some work to do
> skipping clean inodes However, in the current problem case of
> concurrent sync workloads, concurrent wait_sb_inodes() calls only
> walk the very recently dispatched inodes and hence should have very
> little work to do.
> 
> This also means that we have to remove the inodes from the writeback
> list during eviction. Do this in inode_wait_for_writeback() once
> all writeback on the inode is complete.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> Signed-off-by: Josef Bacik <jbacik@fb.com>

I don't think you noticed comments I did when you last posted this
series:

>  /*
> - * Wait for writeback on an inode to complete. Caller must have inode pinned.
> + * Wait for writeback on an inode to complete during eviction. Caller must have
> + * inode pinned.
>   */
>  void inode_wait_for_writeback(struct inode *inode)
>  {
> +	BUG_ON(!(inode->i_state & I_FREEING));
> +
>  	spin_lock(&inode->i_lock);
>  	__inode_wait_for_writeback(inode);
>  	spin_unlock(&inode->i_lock);
> +
> +	/*
> +	 * bd_inode's will have already had the bdev free'd so don't bother
> +	 * doing the bdi_clear_inode_writeback.
> +	 */
> +	if (!sb_is_blkdev_sb(inode->i_sb))
> +		bdi_clear_inode_writeback(inode_to_bdi(inode), inode);
>  }

Why do we bother with not putting bdev inode back?

...
>  		spin_lock(&inode->i_lock);
> -		if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
> -		    (mapping->nrpages == 0)) {
> +		if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) {
> +                        list_move(&inode->i_wb_list, &bdi->wb.b_writeback);

Indenting is broken here...

>  			spin_unlock(&inode->i_lock);
> +			cond_resched_lock(&bdi->wb.list_lock);
>  			continue;
>  		}
>  		__iget(inode);
>  		spin_unlock(&inode->i_lock);


								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 6/8] bdi: add a new writeback list for sync
  2015-06-15 14:12   ` Jan Kara
@ 2015-06-16 15:42     ` Josef Bacik
  2015-06-17 10:34       ` Jan Kara
  0 siblings, 1 reply; 24+ messages in thread
From: Josef Bacik @ 2015-06-16 15:42 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel, kernel-team, viro, hch, david, Dave Chinner

On 06/15/2015 07:12 AM, Jan Kara wrote:
> On Thu 11-06-15 15:41:11, Josef Bacik wrote:
>> From: Dave Chinner <dchinner@redhat.com>
>>
>> wait_sb_inodes() current does a walk of all inodes in the filesystem
>> to find dirty one to wait on during sync. This is highly
>> inefficient and wastes a lot of CPU when there are lots of clean
>> cached inodes that we don't need to wait on.
>>
>> To avoid this "all inode" walk, we need to track inodes that are
>> currently under writeback that we need to wait for. We do this by
>> adding inodes to a writeback list on the bdi when the mapping is
>> first tagged as having pages under writeback.  wait_sb_inodes() can
>> then walk this list of "inodes under IO" and wait specifically just
>> for the inodes that the current sync(2) needs to wait for.
>>
>> To avoid needing all the realted locking to be safe against
>> interrupts, Jan Kara suggested that we be lazy about removal from
>> the writeback list. That is, we don't remove inodes from the
>> writeback list on IO completion, but do it directly during a
>> wait_sb_inodes() walk.
>>
>> This means that the a rare sync(2) call will have some work to do
>> skipping clean inodes However, in the current problem case of
>> concurrent sync workloads, concurrent wait_sb_inodes() calls only
>> walk the very recently dispatched inodes and hence should have very
>> little work to do.
>>
>> This also means that we have to remove the inodes from the writeback
>> list during eviction. Do this in inode_wait_for_writeback() once
>> all writeback on the inode is complete.
>>
>> Signed-off-by: Dave Chinner <dchinner@redhat.com>
>> Signed-off-by: Josef Bacik <jbacik@fb.com>
>
> I don't think you noticed comments I did when you last posted this
> series:

You're right I missed them completely, sorry.

>
>>   /*
>> - * Wait for writeback on an inode to complete. Caller must have inode pinned.
>> + * Wait for writeback on an inode to complete during eviction. Caller must have
>> + * inode pinned.
>>    */
>>   void inode_wait_for_writeback(struct inode *inode)
>>   {
>> +	BUG_ON(!(inode->i_state & I_FREEING));
>> +
>>   	spin_lock(&inode->i_lock);
>>   	__inode_wait_for_writeback(inode);
>>   	spin_unlock(&inode->i_lock);
>> +
>> +	/*
>> +	 * bd_inode's will have already had the bdev free'd so don't bother
>> +	 * doing the bdi_clear_inode_writeback.
>> +	 */
>> +	if (!sb_is_blkdev_sb(inode->i_sb))
>> +		bdi_clear_inode_writeback(inode_to_bdi(inode), inode);
>>   }
>
> Why do we bother with not putting bdev inode back?
>

My memory is rusty on this, but if the inode is the inode for a bdev we 
will have already free'd the bdev at this point and we get a null 
pointer deref, so this just skips that bit.  I'll fix up the indenting. 
  Thanks,

Josef

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 6/8] bdi: add a new writeback list for sync
  2015-06-16 15:42     ` Josef Bacik
@ 2015-06-17 10:34       ` Jan Kara
  2015-06-17 17:55         ` Josef Bacik
  0 siblings, 1 reply; 24+ messages in thread
From: Jan Kara @ 2015-06-17 10:34 UTC (permalink / raw)
  To: Josef Bacik
  Cc: Jan Kara, linux-fsdevel, kernel-team, viro, hch, david,
	Dave Chinner

On Tue 16-06-15 08:42:27, Josef Bacik wrote:
> >>  /*
> >>- * Wait for writeback on an inode to complete. Caller must have inode pinned.
> >>+ * Wait for writeback on an inode to complete during eviction. Caller must have
> >>+ * inode pinned.
> >>   */
> >>  void inode_wait_for_writeback(struct inode *inode)
> >>  {
> >>+	BUG_ON(!(inode->i_state & I_FREEING));
> >>+
> >>  	spin_lock(&inode->i_lock);
> >>  	__inode_wait_for_writeback(inode);
> >>  	spin_unlock(&inode->i_lock);
> >>+
> >>+	/*
> >>+	 * bd_inode's will have already had the bdev free'd so don't bother
> >>+	 * doing the bdi_clear_inode_writeback.
> >>+	 */
> >>+	if (!sb_is_blkdev_sb(inode->i_sb))
> >>+		bdi_clear_inode_writeback(inode_to_bdi(inode), inode);
> >>  }
> >
> >Why do we bother with not putting bdev inode back?
> >
> 
> My memory is rusty on this, but if the inode is the inode for a bdev
> we will have already free'd the bdev at this point and we get a null
> pointer deref, so this just skips that bit.

Ah, the reason likely is that bdev->bd_disk is NULL (already cleaned up in
__blkdev_put()) at this moment and thus bdev_get_queue() called from
inode_to_bdi() will oops. Can you please add these details to the comment?
It's a bit subtle...

Also we shouldn't have any pages in the block device mapping anymore
because of the work done in __blkdev_put() (and thus inode shouldn't be in
the writeback list) but I'd be calmer if we asserted
list_empty(&inode->i_wb_list). Can you please add that? Thanks!

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 6/8] bdi: add a new writeback list for sync
  2015-06-17 10:34       ` Jan Kara
@ 2015-06-17 17:55         ` Josef Bacik
  2015-06-18  9:28           ` Jan Kara
  0 siblings, 1 reply; 24+ messages in thread
From: Josef Bacik @ 2015-06-17 17:55 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel, kernel-team, viro, hch, david, Dave Chinner

On 06/17/2015 03:34 AM, Jan Kara wrote:
> On Tue 16-06-15 08:42:27, Josef Bacik wrote:
>>>>   /*
>>>> - * Wait for writeback on an inode to complete. Caller must have inode pinned.
>>>> + * Wait for writeback on an inode to complete during eviction. Caller must have
>>>> + * inode pinned.
>>>>    */
>>>>   void inode_wait_for_writeback(struct inode *inode)
>>>>   {
>>>> +	BUG_ON(!(inode->i_state & I_FREEING));
>>>> +
>>>>   	spin_lock(&inode->i_lock);
>>>>   	__inode_wait_for_writeback(inode);
>>>>   	spin_unlock(&inode->i_lock);
>>>> +
>>>> +	/*
>>>> +	 * bd_inode's will have already had the bdev free'd so don't bother
>>>> +	 * doing the bdi_clear_inode_writeback.
>>>> +	 */
>>>> +	if (!sb_is_blkdev_sb(inode->i_sb))
>>>> +		bdi_clear_inode_writeback(inode_to_bdi(inode), inode);
>>>>   }
>>>
>>> Why do we bother with not putting bdev inode back?
>>>
>>
>> My memory is rusty on this, but if the inode is the inode for a bdev
>> we will have already free'd the bdev at this point and we get a null
>> pointer deref, so this just skips that bit.
>
> Ah, the reason likely is that bdev->bd_disk is NULL (already cleaned up in
> __blkdev_put()) at this moment and thus bdev_get_queue() called from
> inode_to_bdi() will oops. Can you please add these details to the comment?
> It's a bit subtle...
>
> Also we shouldn't have any pages in the block device mapping anymore
> because of the work done in __blkdev_put() (and thus inode shouldn't be in
> the writeback list) but I'd be calmer if we asserted
> list_empty(&inode->i_wb_list). Can you please add that? Thanks!

Won't it trip if we never sync before we drop the device tho?  So we 
write some stuff to the block device, it gets written out, we then drop 
the device for whatever reason and boom, hit BUG_ON(&inode->i_wb_list) 
because we're still on the writeback list even though it doesn't matter 
because this disk is going away.  Just an untested theory, what do you 
think?  If it is possible I suppose I could just add the clear'ing bit 
for the bd inode to before we drop bd_disk if that would make you happy. 
  Thanks,

Josef


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 6/8] bdi: add a new writeback list for sync
  2015-06-17 17:55         ` Josef Bacik
@ 2015-06-18  9:28           ` Jan Kara
  0 siblings, 0 replies; 24+ messages in thread
From: Jan Kara @ 2015-06-18  9:28 UTC (permalink / raw)
  To: Josef Bacik
  Cc: Jan Kara, linux-fsdevel, kernel-team, viro, hch, david,
	Dave Chinner

On Wed 17-06-15 10:55:58, Josef Bacik wrote:
> On 06/17/2015 03:34 AM, Jan Kara wrote:
> >On Tue 16-06-15 08:42:27, Josef Bacik wrote:
> >>>>  /*
> >>>>- * Wait for writeback on an inode to complete. Caller must have inode pinned.
> >>>>+ * Wait for writeback on an inode to complete during eviction. Caller must have
> >>>>+ * inode pinned.
> >>>>   */
> >>>>  void inode_wait_for_writeback(struct inode *inode)
> >>>>  {
> >>>>+	BUG_ON(!(inode->i_state & I_FREEING));
> >>>>+
> >>>>  	spin_lock(&inode->i_lock);
> >>>>  	__inode_wait_for_writeback(inode);
> >>>>  	spin_unlock(&inode->i_lock);
> >>>>+
> >>>>+	/*
> >>>>+	 * bd_inode's will have already had the bdev free'd so don't bother
> >>>>+	 * doing the bdi_clear_inode_writeback.
> >>>>+	 */
> >>>>+	if (!sb_is_blkdev_sb(inode->i_sb))
> >>>>+		bdi_clear_inode_writeback(inode_to_bdi(inode), inode);
> >>>>  }
> >>>
> >>>Why do we bother with not putting bdev inode back?
> >>>
> >>
> >>My memory is rusty on this, but if the inode is the inode for a bdev
> >>we will have already free'd the bdev at this point and we get a null
> >>pointer deref, so this just skips that bit.
> >
> >Ah, the reason likely is that bdev->bd_disk is NULL (already cleaned up in
> >__blkdev_put()) at this moment and thus bdev_get_queue() called from
> >inode_to_bdi() will oops. Can you please add these details to the comment?
> >It's a bit subtle...
> >
> >Also we shouldn't have any pages in the block device mapping anymore
> >because of the work done in __blkdev_put() (and thus inode shouldn't be in
> >the writeback list) but I'd be calmer if we asserted
> >list_empty(&inode->i_wb_list). Can you please add that? Thanks!
> 
> Won't it trip if we never sync before we drop the device tho?  So we
> write some stuff to the block device, it gets written out, we then
> drop the device for whatever reason and boom, hit
> BUG_ON(&inode->i_wb_list) because we're still on the writeback list
> even though it doesn't matter because this disk is going away.  Just
> an untested theory, what do you think?

Well, we are going to free the inode. If it is still linked into
the writeback list, we are in trouble as the list will now contain freed
element. So better BUG_ON earlier than chase memory corruption later.

Calling bdi_clear_inode_writeback() in kill_bdev() is probably a good idea
and we can then just have the assertion in evict() to make sure nothing
went wrong. OK?

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 0/8] super block scalability patches V4
@ 2015-06-24  3:23 Josef Bacik
  2015-06-24  3:24 ` [PATCH 6/8] bdi: add a new writeback list for sync Josef Bacik
  0 siblings, 1 reply; 24+ messages in thread
From: Josef Bacik @ 2015-06-24  3:23 UTC (permalink / raw)
  To: linux-fsdevel, kernel-team, viro, hch, david, jack

This is the latest round of patches, it includes all the new reviewed-by's, the
fixed up 'bdi: add a new writeback list for sync' patch that has Jan's
suggestions included and Dave's patch folded in to address the problem he saw in
testing.  I've rebased these on Linus's master from this evening.  Al would you
please consider pulling

git://git.kernel.org/pub/scm/linux/kernel/git/josef/btrfs-next.git superblock-scaling

Thanks,

Josef

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 6/8] bdi: add a new writeback list for sync
  2015-06-24  3:23 [PATCH 0/8] super block scalability patches V4 Josef Bacik
@ 2015-06-24  3:24 ` Josef Bacik
  2015-12-09 18:40   ` Brian Foster
  0 siblings, 1 reply; 24+ messages in thread
From: Josef Bacik @ 2015-06-24  3:24 UTC (permalink / raw)
  To: linux-fsdevel, kernel-team, viro, hch, david, jack; +Cc: Dave Chinner

From: Dave Chinner <dchinner@redhat.com>

wait_sb_inodes() current does a walk of all inodes in the filesystem
to find dirty one to wait on during sync. This is highly
inefficient and wastes a lot of CPU when there are lots of clean
cached inodes that we don't need to wait on.

To avoid this "all inode" walk, we need to track inodes that are
currently under writeback that we need to wait for. We do this by
adding inodes to a writeback list on the bdi when the mapping is
first tagged as having pages under writeback.  wait_sb_inodes() can
then walk this list of "inodes under IO" and wait specifically just
for the inodes that the current sync(2) needs to wait for.

To avoid needing all the realted locking to be safe against
interrupts, Jan Kara suggested that we be lazy about removal from
the writeback list. That is, we don't remove inodes from the
writeback list on IO completion, but do it directly during a
wait_sb_inodes() walk.

This means that the a rare sync(2) call will have some work to do
skipping clean inodes However, in the current problem case of
concurrent sync workloads, concurrent wait_sb_inodes() calls only
walk the very recently dispatched inodes and hence should have very
little work to do.

This also means that we have to remove the inodes from the writeback
list during eviction. Do this in inode_wait_for_writeback() once
all writeback on the inode is complete.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Tested-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 fs/block_dev.c              |   2 +
 fs/fs-writeback.c           | 148 +++++++++++++++++++++++++++++++++++---------
 fs/inode.c                  |   1 +
 include/linux/backing-dev.h |   5 ++
 include/linux/fs.h          |   1 +
 mm/backing-dev.c            |   1 +
 mm/page-writeback.c         |  14 +++++
 7 files changed, 144 insertions(+), 28 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index f2a89be..2726ff1 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -64,6 +64,8 @@ void kill_bdev(struct block_device *bdev)
 {
 	struct address_space *mapping = bdev->bd_inode->i_mapping;
 
+	bdi_clear_inode_writeback(inode_to_bdi(bdev->bd_inode),
+				  bdev->bd_inode);
 	if (mapping->nrpages == 0 && mapping->nrshadows == 0)
 		return;
 
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index aa72536..5c005ad 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -210,6 +210,34 @@ void inode_wb_list_del(struct inode *inode)
 }
 
 /*
+ * mark an inode as under writeback on the given bdi
+ */
+void bdi_mark_inode_writeback(struct backing_dev_info *bdi, struct inode *inode)
+{
+	WARN_ON_ONCE(bdi != inode_to_bdi(inode));
+	if (list_empty(&inode->i_wb_list)) {
+		spin_lock(&bdi->wb.list_lock);
+		if (list_empty(&inode->i_wb_list))
+			list_add_tail(&inode->i_wb_list, &bdi->wb.b_writeback);
+		spin_unlock(&bdi->wb.list_lock);
+	}
+}
+
+/*
+ * clear an inode as under writeback on the given bdi
+ */
+void bdi_clear_inode_writeback(struct backing_dev_info *bdi,
+			       struct inode *inode)
+{
+	WARN_ON_ONCE(bdi != inode_to_bdi(inode));
+	if (!list_empty(&inode->i_wb_list)) {
+		spin_lock(&bdi->wb.list_lock);
+		list_del_init(&inode->i_wb_list);
+		spin_unlock(&bdi->wb.list_lock);
+	}
+}
+
+/*
  * Redirty an inode: set its when-it-was dirtied timestamp and move it to the
  * furthest end of its superblock's dirty-inode list.
  *
@@ -383,13 +411,28 @@ static void __inode_wait_for_writeback(struct inode *inode)
 }
 
 /*
- * Wait for writeback on an inode to complete. Caller must have inode pinned.
+ * Wait for writeback on an inode to complete during eviction. Caller must have
+ * inode pinned.
  */
 void inode_wait_for_writeback(struct inode *inode)
 {
+	BUG_ON(!(inode->i_state & I_FREEING));
+
 	spin_lock(&inode->i_lock);
 	__inode_wait_for_writeback(inode);
 	spin_unlock(&inode->i_lock);
+
+	/*
+	 * For a bd_inode when we do inode_to_bdi we'll want to get the bdev for
+	 * the inode and then deref bdev->bd_disk, which at this point has been
+	 * set to NULL, so we would panic.  At the point we are dropping our
+	 * bd_inode we won't have any pages under writeback on the device so
+	 * this is safe.  But just in case we'll assert to make sure we don't
+	 * screw this up.
+	 */
+	if (!sb_is_blkdev_sb(inode->i_sb))
+		bdi_clear_inode_writeback(inode_to_bdi(inode), inode);
+	BUG_ON(!list_empty(&inode->i_wb_list));
 }
 
 /*
@@ -1372,7 +1415,9 @@ EXPORT_SYMBOL(__mark_inode_dirty);
  */
 static void wait_sb_inodes(struct super_block *sb)
 {
-	struct inode *inode, *old_inode = NULL;
+	struct backing_dev_info *bdi = sb->s_bdi;
+	LIST_HEAD(sync_list);
+	struct inode *iput_inode = NULL;
 
 	/*
 	 * We need to be protected against the filesystem going from
@@ -1380,48 +1425,95 @@ static void wait_sb_inodes(struct super_block *sb)
 	 */
 	WARN_ON(!rwsem_is_locked(&sb->s_umount));
 
-	mutex_lock(&sb->s_sync_lock);
-	spin_lock(&sb->s_inode_list_lock);
-
 	/*
-	 * Data integrity sync. Must wait for all pages under writeback,
-	 * because there may have been pages dirtied before our sync
-	 * call, but which had writeout started before we write it out.
-	 * In which case, the inode may not be on the dirty list, but
-	 * we still have to wait for that writeout.
+	 * Data integrity sync. Must wait for all pages under writeback, because
+	 * there may have been pages dirtied before our sync call, but which had
+	 * writeout started before we write it out.  In which case, the inode
+	 * may not be on the dirty list, but we still have to wait for that
+	 * writeout.
+	 *
+	 * To avoid syncing inodes put under IO after we have started here,
+	 * splice the io list to a temporary list head and walk that. Newly
+	 * dirtied inodes will go onto the primary list so we won't wait for
+	 * them. This is safe to do as we are serialised by the s_sync_lock,
+	 * so we'll complete processing the complete list before the next
+	 * sync operation repeats the splice-and-walk process.
+	 *
+	 * Inodes that have pages under writeback after we've finished the wait
+	 * may or may not be on the primary list. They had pages under put IO
+	 * after we started our wait, so we need to make sure the next sync or
+	 * IO completion treats them correctly, Move them back to the primary
+	 * list and restart the walk.
+	 *
+	 * Inodes that are clean after we have waited for them don't belong on
+	 * any list, so simply remove them from the sync list and move onwards.
 	 */
-	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
+	mutex_lock(&sb->s_sync_lock);
+	spin_lock(&bdi->wb.list_lock);
+	list_splice_init(&bdi->wb.b_writeback, &sync_list);
+
+	while (!list_empty(&sync_list)) {
+		struct inode *inode = list_first_entry(&sync_list, struct inode,
+						       i_wb_list);
 		struct address_space *mapping = inode->i_mapping;
 
+		/*
+		 * We are lazy on IO completion and don't remove inodes from the
+		 * list when they are clean. Detect that immediately and skip
+		 * inodes we don't ahve to wait on.
+		 */
+		if (!mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK)) {
+			list_del_init(&inode->i_wb_list);
+			cond_resched_lock(&bdi->wb.list_lock);
+			continue;
+		}
+
 		spin_lock(&inode->i_lock);
-		if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
-		    (mapping->nrpages == 0)) {
+		if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) {
+			list_move(&inode->i_wb_list, &bdi->wb.b_writeback);
 			spin_unlock(&inode->i_lock);
+			cond_resched_lock(&bdi->wb.list_lock);
 			continue;
 		}
 		__iget(inode);
 		spin_unlock(&inode->i_lock);
-		spin_unlock(&sb->s_inode_list_lock);
+		spin_unlock(&bdi->wb.list_lock);
 
-		/*
-		 * We hold a reference to 'inode' so it couldn't have been
-		 * removed from s_inodes list while we dropped the
-		 * s_inode_list_lock.  We cannot iput the inode now as we can
-		 * be holding the last reference and we cannot iput it under
-		 * s_inode_list_lock. So we keep the reference and iput it
-		 * later.
-		 */
-		iput(old_inode);
-		old_inode = inode;
+		if (iput_inode)
+			iput(iput_inode);
 
 		filemap_fdatawait(mapping);
-
 		cond_resched();
 
-		spin_lock(&sb->s_inode_list_lock);
+		/*
+		 * the inode has been written back now, so check whether we
+		 * still have pages under IO and move it back to the primary
+		 * list if necessary.  We really need the mapping->tree_lock
+		 * here because bdi_mark_inode_writeback may have not done
+		 * anything because we were on the spliced list and we need to
+		 * check TAG_WRITEBACK.
+		 */
+		spin_lock_irq(&mapping->tree_lock);
+		spin_lock(&bdi->wb.list_lock);
+		if (mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK)) {
+			WARN_ON(list_empty(&inode->i_wb_list));
+			list_move(&inode->i_wb_list, &bdi->wb.b_writeback);
+                } else
+			list_del_init(&inode->i_wb_list);
+		spin_unlock_irq(&mapping->tree_lock);
+
+		/*
+		 * can't iput inode while holding the wb.list_lock. Save it for
+		 * the next time through the loop when we drop all our spin
+		 * locks.
+		 */
+		iput_inode = inode;
 	}
-	spin_unlock(&sb->s_inode_list_lock);
-	iput(old_inode);
+	spin_unlock(&bdi->wb.list_lock);
+
+	if (iput_inode)
+		iput(iput_inode);
+
 	mutex_unlock(&sb->s_sync_lock);
 }
 
diff --git a/fs/inode.c b/fs/inode.c
index 770d684..8f00557 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -357,6 +357,7 @@ void inode_init_once(struct inode *inode)
 	INIT_HLIST_NODE(&inode->i_hash);
 	INIT_LIST_HEAD(&inode->i_devices);
 	INIT_LIST_HEAD(&inode->i_io_list);
+	INIT_LIST_HEAD(&inode->i_wb_list);
 	INIT_LIST_HEAD(&inode->i_lru);
 	address_space_init_once(&inode->i_data);
 	i_size_ordered_init(inode);
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index d87d8ec..0d5bca2 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -56,6 +56,7 @@ struct bdi_writeback {
 	struct list_head b_io;		/* parked for writeback */
 	struct list_head b_more_io;	/* parked for more writeback */
 	struct list_head b_dirty_time;	/* time stamps are dirty */
+	struct list_head b_writeback;	/* inodes under writeback */
 	spinlock_t list_lock;		/* protects the b_* lists */
 };
 
@@ -123,6 +124,10 @@ void bdi_start_background_writeback(struct backing_dev_info *bdi);
 void bdi_writeback_workfn(struct work_struct *work);
 int bdi_has_dirty_io(struct backing_dev_info *bdi);
 void bdi_wakeup_thread_delayed(struct backing_dev_info *bdi);
+void bdi_mark_inode_writeback(struct backing_dev_info *bdi,
+			      struct inode *inode);
+void bdi_clear_inode_writeback(struct backing_dev_info *bdi,
+			       struct inode *inode);
 
 extern spinlock_t bdi_lock;
 extern struct list_head bdi_list;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 6dd2ab2..d3baa08 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -636,6 +636,7 @@ struct inode {
 	struct list_head	i_io_list;	/* backing dev IO list */
 	struct list_head	i_lru;		/* inode LRU list */
 	struct list_head	i_sb_list;
+	struct list_head	i_wb_list;	/* backing dev writeback list */
 	union {
 		struct hlist_head	i_dentry;
 		struct rcu_head		i_rcu;
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index e9ed047..ea39301 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -369,6 +369,7 @@ static void bdi_wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi)
 	INIT_LIST_HEAD(&wb->b_io);
 	INIT_LIST_HEAD(&wb->b_more_io);
 	INIT_LIST_HEAD(&wb->b_dirty_time);
+	INIT_LIST_HEAD(&wb->b_writeback);
 	spin_lock_init(&wb->list_lock);
 	INIT_DELAYED_WORK(&wb->dwork, bdi_writeback_workfn);
 }
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index eb59f7e..f3751d1 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2382,11 +2382,25 @@ int __test_set_page_writeback(struct page *page, bool keep_write)
 		spin_lock_irqsave(&mapping->tree_lock, flags);
 		ret = TestSetPageWriteback(page);
 		if (!ret) {
+			bool on_wblist;
+
+			on_wblist = mapping_tagged(mapping,
+						   PAGECACHE_TAG_WRITEBACK);
+
 			radix_tree_tag_set(&mapping->page_tree,
 						page_index(page),
 						PAGECACHE_TAG_WRITEBACK);
 			if (bdi_cap_account_writeback(bdi))
 				__inc_bdi_stat(bdi, BDI_WRITEBACK);
+
+			/*
+			 * we can come through here when swapping anonymous
+			 * pages, so we don't necessarily have an inode to
+			 * track for sync. Inodes are remove lazily, so there is
+			 * no equivalent in test_clear_page_writeback().
+			 */
+			if (!on_wblist && mapping->host)
+				bdi_mark_inode_writeback(bdi, mapping->host);
 		}
 		if (!PageDirty(page))
 			radix_tree_tag_clear(&mapping->page_tree,
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH 6/8] bdi: add a new writeback list for sync
  2015-06-24  3:24 ` [PATCH 6/8] bdi: add a new writeback list for sync Josef Bacik
@ 2015-12-09 18:40   ` Brian Foster
  2015-12-10 10:08     ` Jan Kara
  0 siblings, 1 reply; 24+ messages in thread
From: Brian Foster @ 2015-12-09 18:40 UTC (permalink / raw)
  To: Josef Bacik
  Cc: linux-fsdevel, kernel-team, viro, hch, david, jack, Dave Chinner

On Tue, Jun 23, 2015 at 08:24:00PM -0700, Josef Bacik wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> wait_sb_inodes() current does a walk of all inodes in the filesystem
> to find dirty one to wait on during sync. This is highly
> inefficient and wastes a lot of CPU when there are lots of clean
> cached inodes that we don't need to wait on.
> 
> To avoid this "all inode" walk, we need to track inodes that are
> currently under writeback that we need to wait for. We do this by
> adding inodes to a writeback list on the bdi when the mapping is
> first tagged as having pages under writeback.  wait_sb_inodes() can
> then walk this list of "inodes under IO" and wait specifically just
> for the inodes that the current sync(2) needs to wait for.
> 
> To avoid needing all the realted locking to be safe against
> interrupts, Jan Kara suggested that we be lazy about removal from
> the writeback list. That is, we don't remove inodes from the
> writeback list on IO completion, but do it directly during a
> wait_sb_inodes() walk.
> 
> This means that the a rare sync(2) call will have some work to do
> skipping clean inodes However, in the current problem case of
> concurrent sync workloads, concurrent wait_sb_inodes() calls only
> walk the very recently dispatched inodes and hence should have very
> little work to do.
> 
> This also means that we have to remove the inodes from the writeback
> list during eviction. Do this in inode_wait_for_writeback() once
> all writeback on the inode is complete.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> Signed-off-by: Josef Bacik <jbacik@fb.com>
> Tested-by: Dave Chinner <dchinner@redhat.com>
> Reviewed-by: Jan Kara <jack@suse.cz>
> ---

Hi all,

My understanding is this (and the subsequent) patch were never merged
due to conflicts with the cgroup aware writeback stuff that had been
merged. I think the fundamental approach of this patch still solves the
original problem, the writeback inodes are just always tracked on the
root bdi_writeback rather than the per-cgroup bdi_writeback the inode
was associated with before writeback.

Any thoughts on that are appreciated. Otherwise, I'm trying to rebase
this and I've hit a lockdep splat [1] that suggests the locking for the
writeback list needs to be tweaked one way or another. Note that lockdep
fires with this original version as well, so I don't think this is a new
issue.

I _think_ this means that list_lock either needs to be taken outside of
mapping->tree_lock or we need to create a new irq safe spinlock specific
to the b_writeback list. The former looks like it requires some ugliness
in __test_set_page_writeback() to acquire list_lock first, but only when
list_empty(&inode->i_wb_list) so we don't grab it for every page. The
latter obviously means a new irq-disabling lock (which is already done
in parts of wait_sb_inode(), fwiw). The latter might also facilitate
further cleanups like pulling b_writeback out of bdi_writeback and up
into backing_dev_info since we really only need one instance of the
list, but I'm not sure if that's preferable..

Thoughts?

Brian

[1] lockdep splat:

[ INFO: possible irq lock inversion dependency detected ]
4.4.0-rc4+ #54 Not tainted
---------------------------------------------------------
swapper/0/0 just changed the state of lock:
(&(&mapping->tree_lock)->rlock){-.....}, at: [<ffffffff811d25ed>] test_clear_page_writeback+0x5d/0x210
but this lock took another, HARDIRQ-unsafe lock in the past:
(&(&wb->list_lock)->rlock){+.+...}#012#012and interrupts could create inverse lock ordering between them.
#012other info that might help us debug this:
Possible interrupt unsafe locking scenario:
      CPU0                    CPU1
      ----                    ----
 lock(&(&wb->list_lock)->rlock);
                              local_irq_disable();
                              lock(&(&mapping->tree_lock)->rlock);
                              lock(&(&wb->list_lock)->rlock);
 <Interrupt>
   lock(&(&mapping->tree_lock)->rlock);
#012 *** DEADLOCK ***
no locks held by swapper/0/0.
#012the shortest dependencies between 2nd lock and 1st lock:
-> (&(&wb->list_lock)->rlock){+.+...} ops: 194 {
   HARDIRQ-ON-W at:
                     [<ffffffff810f9959>] __lock_acquire+0x819/0x1a50
                     [<ffffffff810fb54e>] lock_acquire+0xce/0x1c0
                     [<ffffffff817e3df8>] _raw_spin_lock+0x38/0x50
                     [<ffffffff8126b7a9>] __mark_inode_dirty+0x3d9/0x5e0
                     [<ffffffff8125a859>] generic_update_time+0x79/0xd0
                     [<ffffffff8125c638>] touch_atime+0xa8/0xd0
                     [<ffffffff812519c3>] iterate_dir+0xe3/0x130
                     [<ffffffff81251ecd>] SyS_getdents+0x9d/0x130
                     [<ffffffff817e4a72>] entry_SYSCALL_64_fastpath+0x12/0x76
   SOFTIRQ-ON-W at:
                     [<ffffffff810f998a>] __lock_acquire+0x84a/0x1a50
                     [<ffffffff810fb54e>] lock_acquire+0xce/0x1c0
                     [<ffffffff817e3df8>] _raw_spin_lock+0x38/0x50
                     [<ffffffff8126b7a9>] __mark_inode_dirty+0x3d9/0x5e0
                     [<ffffffff8125a859>] generic_update_time+0x79/0xd0
                     [<ffffffff8125c638>] touch_atime+0xa8/0xd0
                     [<ffffffff812519c3>] iterate_dir+0xe3/0x130
                     [<ffffffff81251ecd>] SyS_getdents+0x9d/0x130
                     [<ffffffff817e4a72>] entry_SYSCALL_64_fastpath+0x12/0x76
   INITIAL USE at:
                    [<ffffffff810f9538>] __lock_acquire+0x3f8/0x1a50
                    [<ffffffff810fb54e>] lock_acquire+0xce/0x1c0
                    [<ffffffff817e3df8>] _raw_spin_lock+0x38/0x50
                    [<ffffffff8126b7a9>] __mark_inode_dirty+0x3d9/0x5e0
                    [<ffffffff8125a859>] generic_update_time+0x79/0xd0
                    [<ffffffff8125c638>] touch_atime+0xa8/0xd0
                    [<ffffffff812519c3>] iterate_dir+0xe3/0x130
                    [<ffffffff81251ecd>] SyS_getdents+0x9d/0x130
                    [<ffffffff817e4a72>] entry_SYSCALL_64_fastpath+0x12/0x76
 }
 ... key      at: [<ffffffff82c7faa0>] __key.37388+0x0/0x8
 ... acquired at:
  [<ffffffff810fb54e>] lock_acquire+0xce/0x1c0
  [<ffffffff817e3df8>] _raw_spin_lock+0x38/0x50
  [<ffffffff812703a7>] bdi_mark_inode_writeback+0x67/0xd0
  [<ffffffff811d19b2>] __test_set_page_writeback+0x112/0x200
  [<ffffffff8127bcdc>] __block_write_full_page.constprop.44+0xec/0x400
  [<ffffffff8127c0ee>] block_write_full_page+0xfe/0x190
  [<ffffffff8127cec8>] blkdev_writepage+0x18/0x20
  [<ffffffff811cede6>] __writepage+0x16/0x40
  [<ffffffff811cfc65>] write_cache_pages+0x225/0x5f0
  [<ffffffff811d0084>] generic_writepages+0x54/0x80
  [<ffffffff811d24e1>] do_writepages+0x21/0x30
  [<ffffffff811c4b40>] __filemap_fdatawrite_range+0x80/0xb0
  [<ffffffff811c4c7d>] filemap_write_and_wait_range+0x2d/0x70
  [<ffffffff8127ca8b>] blkdev_fsync+0x1b/0x50
  [<ffffffff81274a9b>] vfs_fsync_range+0x4b/0xb0
  [<ffffffff81274b5d>] do_fsync+0x3d/0x70
  [<ffffffff81274e10>] SyS_fsync+0x10/0x20
  [<ffffffff817e4a72>] entry_SYSCALL_64_fastpath+0x12/0x76

-> (&(&mapping->tree_lock)->rlock){-.....} ops: 16821 {
  IN-HARDIRQ-W at:
                   [<ffffffff810f9ad9>] __lock_acquire+0x999/0x1a50
                   [<ffffffff810fb54e>] lock_acquire+0xce/0x1c0
                   [<ffffffff817e4810>] _raw_spin_lock_irqsave+0x50/0x70
                   [<ffffffff811d25ed>] test_clear_page_writeback+0x5d/0x210
                   [<ffffffff811c2b2f>] end_page_writeback+0x1f/0xa0
                   [<ffffffff81278df3>] end_buffer_async_write+0xd3/0x1a0
                   [<ffffffff81277828>] end_bio_bh_io_sync+0x28/0x40
                   [<ffffffff813a329f>] bio_endio+0x3f/0x60
                   [<ffffffff8162d4d9>] dec_pending+0x149/0x310
                   [<ffffffff8162e1d6>] clone_endio+0x76/0xe0
                   [<ffffffff813a329f>] bio_endio+0x3f/0x60
                   [<ffffffff813ab512>] blk_update_request+0xb2/0x3a0
                   [<ffffffff813b63ca>] blk_mq_end_request+0x1a/0x70
                   [<ffffffffa009e82f>] virtblk_request_done+0x3f/0x70 [virtio_blk]
                   [<ffffffff813b4ea3>] __blk_mq_complete_request_remote+0x13/0x20
                   [<ffffffff81137d9f>] flush_smp_call_function_queue+0x5f/0x160
                   [<ffffffff811388d3>] generic_smp_call_function_single_interrupt+0x13/0x60
                   [<ffffffff810530a7>] smp_call_function_single_interrupt+0x27/0x40
                   [<ffffffff817e604c>] call_function_single_interrupt+0x8c/0xa0
                   [<ffffffff81026703>] default_idle+0x23/0x150
                   [<ffffffff8102702f>] arch_cpu_idle+0xf/0x20
                   [<ffffffff810efd8a>] default_idle_call+0x2a/0x40
                   [<ffffffff810f0184>] cpu_startup_entry+0x384/0x3f0
                   [<ffffffff817d6aaa>] rest_init+0x13a/0x140
                   [<ffffffff81d6b01f>] start_kernel+0x49e/0x4bf
                   [<ffffffff81d6a339>] x86_64_start_reservations+0x2a/0x2c
                   [<ffffffff81d6a480>] x86_64_start_kernel+0x145/0x168
  INITIAL USE at:
                  [<ffffffff810f9538>] __lock_acquire+0x3f8/0x1a50
                  [<ffffffff810fb54e>] lock_acquire+0xce/0x1c0
                  [<ffffffff817e3fa4>] _raw_spin_lock_irq+0x44/0x60
                  [<ffffffff811c303c>] __add_to_page_cache_locked+0xbc/0x340
                  [<ffffffff811c3317>] add_to_page_cache_lru+0x37/0x90
                  [<ffffffff811c3dc7>] pagecache_get_page+0xb7/0x200
                  [<ffffffff811c4139>] grab_cache_page_write_begin+0x29/0x40
                  [<ffffffff81269848>] simple_write_begin+0x28/0x1c0
                  [<ffffffff811c2a01>] generic_perform_write+0xd1/0x1e0
                  [<ffffffff811c55f2>] __generic_file_write_iter+0x1a2/0x1e0
                  [<ffffffff811c5718>] generic_file_write_iter+0xe8/0x1e0
                  [<ffffffff8123c409>] __vfs_write+0xc9/0x110
                  [<ffffffff8123ca99>] vfs_write+0xa9/0x1a0
                  [<ffffffff8123d778>] SyS_write+0x58/0xd0
                  [<ffffffff81d6c3b1>] xwrite+0x29/0x5c
                  [<ffffffff81d6c40e>] do_copy+0x2a/0xb8
                  [<ffffffff81d6c151>] write_buffer+0x23/0x34
                  [<ffffffff81d6c9e6>] unpack_to_rootfs+0xfd/0x294
                  [<ffffffff81d6cbd9>] populate_rootfs+0x5c/0x108
                  [<ffffffff81002123>] do_one_initcall+0xb3/0x200
                  [<ffffffff81d6b22e>] kernel_init_freeable+0x1ee/0x28d
                  [<ffffffff817d6abe>] kernel_init+0xe/0xe0
                  [<ffffffff817e4ddf>] ret_from_fork+0x3f/0x70
}
... key      at: [<ffffffff82ca2760>] __key.39560+0x0/0x8
... acquired at:
  [<ffffffff810f7c1b>] check_usage_forwards+0x15b/0x160
  [<ffffffff810f88c3>] mark_lock+0x333/0x610
  [<ffffffff810f9ad9>] __lock_acquire+0x999/0x1a50
  [<ffffffff810fb54e>] lock_acquire+0xce/0x1c0
  [<ffffffff817e4810>] _raw_spin_lock_irqsave+0x50/0x70
  [<ffffffff811d25ed>] test_clear_page_writeback+0x5d/0x210
  [<ffffffff811c2b2f>] end_page_writeback+0x1f/0xa0
  [<ffffffff81278df3>] end_buffer_async_write+0xd3/0x1a0
  [<ffffffff81277828>] end_bio_bh_io_sync+0x28/0x40
  [<ffffffff813a329f>] bio_endio+0x3f/0x60
  [<ffffffff8162d4d9>] dec_pending+0x149/0x310
  [<ffffffff8162e1d6>] clone_endio+0x76/0xe0
  [<ffffffff813a329f>] bio_endio+0x3f/0x60
  [<ffffffff813ab512>] blk_update_request+0xb2/0x3a0
  [<ffffffff813b63ca>] blk_mq_end_request+0x1a/0x70
  [<ffffffffa009e82f>] virtblk_request_done+0x3f/0x70 [virtio_blk]
  [<ffffffff813b4ea3>] __blk_mq_complete_request_remote+0x13/0x20
  [<ffffffff81137d9f>] flush_smp_call_function_queue+0x5f/0x160
  [<ffffffff811388d3>] generic_smp_call_function_single_interrupt+0x13/0x60
  [<ffffffff810530a7>] smp_call_function_single_interrupt+0x27/0x40
  [<ffffffff817e604c>] call_function_single_interrupt+0x8c/0xa0
  [<ffffffff81026703>] default_idle+0x23/0x150
  [<ffffffff8102702f>] arch_cpu_idle+0xf/0x20
  [<ffffffff810efd8a>] default_idle_call+0x2a/0x40
  [<ffffffff810f0184>] cpu_startup_entry+0x384/0x3f0
  [<ffffffff817d6aaa>] rest_init+0x13a/0x140
  [<ffffffff81d6b01f>] start_kernel+0x49e/0x4bf
  [<ffffffff81d6a339>] x86_64_start_reservations+0x2a/0x2c
  [<ffffffff81d6a480>] x86_64_start_kernel+0x145/0x168

#012stack backtrace:
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.4.0-rc4+ #54
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
0000000000000000 06776aa479e380f6 ffff88011fc03ad0 ffffffff813dd019
ffffffff82a0f120 ffff88011fc03b10 ffffffff811c184d ffffffff81c10be8
ffffffff81c10be8 ffffffff81c10500 ffffffff81a56510 0000000000000000
Call Trace:
<IRQ>  [<ffffffff813dd019>] dump_stack+0x4b/0x72
[<ffffffff811c184d>] print_irq_inversion_bug.part.38+0x1a4/0x1b0
[<ffffffff810f7c1b>] check_usage_forwards+0x15b/0x160
[<ffffffff810f88c3>] mark_lock+0x333/0x610
[<ffffffff810f7ac0>] ? check_usage_backwards+0x160/0x160
[<ffffffff810f9ad9>] __lock_acquire+0x999/0x1a50
[<ffffffff810fb54e>] lock_acquire+0xce/0x1c0
[<ffffffff811d25ed>] ? test_clear_page_writeback+0x5d/0x210
[<ffffffff817e4810>] _raw_spin_lock_irqsave+0x50/0x70
[<ffffffff811d25ed>] ? test_clear_page_writeback+0x5d/0x210
[<ffffffff811d25ed>] test_clear_page_writeback+0x5d/0x210
[<ffffffff811c2b2f>] end_page_writeback+0x1f/0xa0
[<ffffffff81278df3>] end_buffer_async_write+0xd3/0x1a0
[<ffffffff81277828>] end_bio_bh_io_sync+0x28/0x40
[<ffffffff813a329f>] bio_endio+0x3f/0x60
[<ffffffff8162d4d9>] dec_pending+0x149/0x310
[<ffffffff811c6809>] ? mempool_free+0x29/0x80
[<ffffffff8162e1d6>] clone_endio+0x76/0xe0
[<ffffffff813a329f>] bio_endio+0x3f/0x60
[<ffffffff813ab512>] blk_update_request+0xb2/0x3a0
[<ffffffff813b4e90>] ? blkdev_issue_zeroout+0xf0/0xf0
[<ffffffff813b63ca>] blk_mq_end_request+0x1a/0x70
[<ffffffffa009e82f>] virtblk_request_done+0x3f/0x70 [virtio_blk]
[<ffffffff813b4ea3>] __blk_mq_complete_request_remote+0x13/0x20
[<ffffffff81137d9f>] flush_smp_call_function_queue+0x5f/0x160
[<ffffffff811388d3>] generic_smp_call_function_single_interrupt+0x13/0x60
[<ffffffff810530a7>] smp_call_function_single_interrupt+0x27/0x40
[<ffffffff817e604c>] call_function_single_interrupt+0x8c/0xa0
<EOI>  [<ffffffff810653a6>] ? native_safe_halt+0x6/0x10
[<ffffffff81026703>] default_idle+0x23/0x150
[<ffffffff8102702f>] arch_cpu_idle+0xf/0x20
[<ffffffff810efd8a>] default_idle_call+0x2a/0x40
[<ffffffff810f0184>] cpu_startup_entry+0x384/0x3f0
[<ffffffff817d6aaa>] rest_init+0x13a/0x140
[<ffffffff81d6b01f>] start_kernel+0x49e/0x4bf
[<ffffffff81d6a120>] ? early_idt_handler_array+0x120/0x120
[<ffffffff81d6a339>] x86_64_start_reservations+0x2a/0x2c
[<ffffffff81d6a480>] x86_64_start_kernel+0x145/0x168

>  fs/block_dev.c              |   2 +
>  fs/fs-writeback.c           | 148 +++++++++++++++++++++++++++++++++++---------
>  fs/inode.c                  |   1 +
>  include/linux/backing-dev.h |   5 ++
>  include/linux/fs.h          |   1 +
>  mm/backing-dev.c            |   1 +
>  mm/page-writeback.c         |  14 +++++
>  7 files changed, 144 insertions(+), 28 deletions(-)
> 
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index f2a89be..2726ff1 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -64,6 +64,8 @@ void kill_bdev(struct block_device *bdev)
>  {
>  	struct address_space *mapping = bdev->bd_inode->i_mapping;
>  
> +	bdi_clear_inode_writeback(inode_to_bdi(bdev->bd_inode),
> +				  bdev->bd_inode);
>  	if (mapping->nrpages == 0 && mapping->nrshadows == 0)
>  		return;
>  
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index aa72536..5c005ad 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -210,6 +210,34 @@ void inode_wb_list_del(struct inode *inode)
>  }
>  
>  /*
> + * mark an inode as under writeback on the given bdi
> + */
> +void bdi_mark_inode_writeback(struct backing_dev_info *bdi, struct inode *inode)
> +{
> +	WARN_ON_ONCE(bdi != inode_to_bdi(inode));
> +	if (list_empty(&inode->i_wb_list)) {
> +		spin_lock(&bdi->wb.list_lock);
> +		if (list_empty(&inode->i_wb_list))
> +			list_add_tail(&inode->i_wb_list, &bdi->wb.b_writeback);
> +		spin_unlock(&bdi->wb.list_lock);
> +	}
> +}
> +
> +/*
> + * clear an inode as under writeback on the given bdi
> + */
> +void bdi_clear_inode_writeback(struct backing_dev_info *bdi,
> +			       struct inode *inode)
> +{
> +	WARN_ON_ONCE(bdi != inode_to_bdi(inode));
> +	if (!list_empty(&inode->i_wb_list)) {
> +		spin_lock(&bdi->wb.list_lock);
> +		list_del_init(&inode->i_wb_list);
> +		spin_unlock(&bdi->wb.list_lock);
> +	}
> +}
> +
> +/*
>   * Redirty an inode: set its when-it-was dirtied timestamp and move it to the
>   * furthest end of its superblock's dirty-inode list.
>   *
> @@ -383,13 +411,28 @@ static void __inode_wait_for_writeback(struct inode *inode)
>  }
>  
>  /*
> - * Wait for writeback on an inode to complete. Caller must have inode pinned.
> + * Wait for writeback on an inode to complete during eviction. Caller must have
> + * inode pinned.
>   */
>  void inode_wait_for_writeback(struct inode *inode)
>  {
> +	BUG_ON(!(inode->i_state & I_FREEING));
> +
>  	spin_lock(&inode->i_lock);
>  	__inode_wait_for_writeback(inode);
>  	spin_unlock(&inode->i_lock);
> +
> +	/*
> +	 * For a bd_inode when we do inode_to_bdi we'll want to get the bdev for
> +	 * the inode and then deref bdev->bd_disk, which at this point has been
> +	 * set to NULL, so we would panic.  At the point we are dropping our
> +	 * bd_inode we won't have any pages under writeback on the device so
> +	 * this is safe.  But just in case we'll assert to make sure we don't
> +	 * screw this up.
> +	 */
> +	if (!sb_is_blkdev_sb(inode->i_sb))
> +		bdi_clear_inode_writeback(inode_to_bdi(inode), inode);
> +	BUG_ON(!list_empty(&inode->i_wb_list));
>  }
>  
>  /*
> @@ -1372,7 +1415,9 @@ EXPORT_SYMBOL(__mark_inode_dirty);
>   */
>  static void wait_sb_inodes(struct super_block *sb)
>  {
> -	struct inode *inode, *old_inode = NULL;
> +	struct backing_dev_info *bdi = sb->s_bdi;
> +	LIST_HEAD(sync_list);
> +	struct inode *iput_inode = NULL;
>  
>  	/*
>  	 * We need to be protected against the filesystem going from
> @@ -1380,48 +1425,95 @@ static void wait_sb_inodes(struct super_block *sb)
>  	 */
>  	WARN_ON(!rwsem_is_locked(&sb->s_umount));
>  
> -	mutex_lock(&sb->s_sync_lock);
> -	spin_lock(&sb->s_inode_list_lock);
> -
>  	/*
> -	 * Data integrity sync. Must wait for all pages under writeback,
> -	 * because there may have been pages dirtied before our sync
> -	 * call, but which had writeout started before we write it out.
> -	 * In which case, the inode may not be on the dirty list, but
> -	 * we still have to wait for that writeout.
> +	 * Data integrity sync. Must wait for all pages under writeback, because
> +	 * there may have been pages dirtied before our sync call, but which had
> +	 * writeout started before we write it out.  In which case, the inode
> +	 * may not be on the dirty list, but we still have to wait for that
> +	 * writeout.
> +	 *
> +	 * To avoid syncing inodes put under IO after we have started here,
> +	 * splice the io list to a temporary list head and walk that. Newly
> +	 * dirtied inodes will go onto the primary list so we won't wait for
> +	 * them. This is safe to do as we are serialised by the s_sync_lock,
> +	 * so we'll complete processing the complete list before the next
> +	 * sync operation repeats the splice-and-walk process.
> +	 *
> +	 * Inodes that have pages under writeback after we've finished the wait
> +	 * may or may not be on the primary list. They had pages under put IO
> +	 * after we started our wait, so we need to make sure the next sync or
> +	 * IO completion treats them correctly, Move them back to the primary
> +	 * list and restart the walk.
> +	 *
> +	 * Inodes that are clean after we have waited for them don't belong on
> +	 * any list, so simply remove them from the sync list and move onwards.
>  	 */
> -	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
> +	mutex_lock(&sb->s_sync_lock);
> +	spin_lock(&bdi->wb.list_lock);
> +	list_splice_init(&bdi->wb.b_writeback, &sync_list);
> +
> +	while (!list_empty(&sync_list)) {
> +		struct inode *inode = list_first_entry(&sync_list, struct inode,
> +						       i_wb_list);
>  		struct address_space *mapping = inode->i_mapping;
>  
> +		/*
> +		 * We are lazy on IO completion and don't remove inodes from the
> +		 * list when they are clean. Detect that immediately and skip
> +		 * inodes we don't ahve to wait on.
> +		 */
> +		if (!mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK)) {
> +			list_del_init(&inode->i_wb_list);
> +			cond_resched_lock(&bdi->wb.list_lock);
> +			continue;
> +		}
> +
>  		spin_lock(&inode->i_lock);
> -		if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
> -		    (mapping->nrpages == 0)) {
> +		if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) {
> +			list_move(&inode->i_wb_list, &bdi->wb.b_writeback);
>  			spin_unlock(&inode->i_lock);
> +			cond_resched_lock(&bdi->wb.list_lock);
>  			continue;
>  		}
>  		__iget(inode);
>  		spin_unlock(&inode->i_lock);
> -		spin_unlock(&sb->s_inode_list_lock);
> +		spin_unlock(&bdi->wb.list_lock);
>  
> -		/*
> -		 * We hold a reference to 'inode' so it couldn't have been
> -		 * removed from s_inodes list while we dropped the
> -		 * s_inode_list_lock.  We cannot iput the inode now as we can
> -		 * be holding the last reference and we cannot iput it under
> -		 * s_inode_list_lock. So we keep the reference and iput it
> -		 * later.
> -		 */
> -		iput(old_inode);
> -		old_inode = inode;
> +		if (iput_inode)
> +			iput(iput_inode);
>  
>  		filemap_fdatawait(mapping);
> -
>  		cond_resched();
>  
> -		spin_lock(&sb->s_inode_list_lock);
> +		/*
> +		 * the inode has been written back now, so check whether we
> +		 * still have pages under IO and move it back to the primary
> +		 * list if necessary.  We really need the mapping->tree_lock
> +		 * here because bdi_mark_inode_writeback may have not done
> +		 * anything because we were on the spliced list and we need to
> +		 * check TAG_WRITEBACK.
> +		 */
> +		spin_lock_irq(&mapping->tree_lock);
> +		spin_lock(&bdi->wb.list_lock);
> +		if (mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK)) {
> +			WARN_ON(list_empty(&inode->i_wb_list));
> +			list_move(&inode->i_wb_list, &bdi->wb.b_writeback);
> +                } else
> +			list_del_init(&inode->i_wb_list);
> +		spin_unlock_irq(&mapping->tree_lock);
> +
> +		/*
> +		 * can't iput inode while holding the wb.list_lock. Save it for
> +		 * the next time through the loop when we drop all our spin
> +		 * locks.
> +		 */
> +		iput_inode = inode;
>  	}
> -	spin_unlock(&sb->s_inode_list_lock);
> -	iput(old_inode);
> +	spin_unlock(&bdi->wb.list_lock);
> +
> +	if (iput_inode)
> +		iput(iput_inode);
> +
>  	mutex_unlock(&sb->s_sync_lock);
>  }
>  
> diff --git a/fs/inode.c b/fs/inode.c
> index 770d684..8f00557 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -357,6 +357,7 @@ void inode_init_once(struct inode *inode)
>  	INIT_HLIST_NODE(&inode->i_hash);
>  	INIT_LIST_HEAD(&inode->i_devices);
>  	INIT_LIST_HEAD(&inode->i_io_list);
> +	INIT_LIST_HEAD(&inode->i_wb_list);
>  	INIT_LIST_HEAD(&inode->i_lru);
>  	address_space_init_once(&inode->i_data);
>  	i_size_ordered_init(inode);
> diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
> index d87d8ec..0d5bca2 100644
> --- a/include/linux/backing-dev.h
> +++ b/include/linux/backing-dev.h
> @@ -56,6 +56,7 @@ struct bdi_writeback {
>  	struct list_head b_io;		/* parked for writeback */
>  	struct list_head b_more_io;	/* parked for more writeback */
>  	struct list_head b_dirty_time;	/* time stamps are dirty */
> +	struct list_head b_writeback;	/* inodes under writeback */
>  	spinlock_t list_lock;		/* protects the b_* lists */
>  };
>  
> @@ -123,6 +124,10 @@ void bdi_start_background_writeback(struct backing_dev_info *bdi);
>  void bdi_writeback_workfn(struct work_struct *work);
>  int bdi_has_dirty_io(struct backing_dev_info *bdi);
>  void bdi_wakeup_thread_delayed(struct backing_dev_info *bdi);
> +void bdi_mark_inode_writeback(struct backing_dev_info *bdi,
> +			      struct inode *inode);
> +void bdi_clear_inode_writeback(struct backing_dev_info *bdi,
> +			       struct inode *inode);
>  
>  extern spinlock_t bdi_lock;
>  extern struct list_head bdi_list;
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 6dd2ab2..d3baa08 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -636,6 +636,7 @@ struct inode {
>  	struct list_head	i_io_list;	/* backing dev IO list */
>  	struct list_head	i_lru;		/* inode LRU list */
>  	struct list_head	i_sb_list;
> +	struct list_head	i_wb_list;	/* backing dev writeback list */
>  	union {
>  		struct hlist_head	i_dentry;
>  		struct rcu_head		i_rcu;
> diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> index e9ed047..ea39301 100644
> --- a/mm/backing-dev.c
> +++ b/mm/backing-dev.c
> @@ -369,6 +369,7 @@ static void bdi_wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi)
>  	INIT_LIST_HEAD(&wb->b_io);
>  	INIT_LIST_HEAD(&wb->b_more_io);
>  	INIT_LIST_HEAD(&wb->b_dirty_time);
> +	INIT_LIST_HEAD(&wb->b_writeback);
>  	spin_lock_init(&wb->list_lock);
>  	INIT_DELAYED_WORK(&wb->dwork, bdi_writeback_workfn);
>  }
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index eb59f7e..f3751d1 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -2382,11 +2382,25 @@ int __test_set_page_writeback(struct page *page, bool keep_write)
>  		spin_lock_irqsave(&mapping->tree_lock, flags);
>  		ret = TestSetPageWriteback(page);
>  		if (!ret) {
> +			bool on_wblist;
> +
> +			on_wblist = mapping_tagged(mapping,
> +						   PAGECACHE_TAG_WRITEBACK);
> +
>  			radix_tree_tag_set(&mapping->page_tree,
>  						page_index(page),
>  						PAGECACHE_TAG_WRITEBACK);
>  			if (bdi_cap_account_writeback(bdi))
>  				__inc_bdi_stat(bdi, BDI_WRITEBACK);
> +
> +			/*
> +			 * we can come through here when swapping anonymous
> +			 * pages, so we don't necessarily have an inode to
> +			 * track for sync. Inodes are remove lazily, so there is
> +			 * no equivalent in test_clear_page_writeback().
> +			 */
> +			if (!on_wblist && mapping->host)
> +				bdi_mark_inode_writeback(bdi, mapping->host);
>  		}
>  		if (!PageDirty(page))
>  			radix_tree_tag_clear(&mapping->page_tree,
> -- 
> 2.1.0
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 6/8] bdi: add a new writeback list for sync
  2015-12-09 18:40   ` Brian Foster
@ 2015-12-10 10:08     ` Jan Kara
  2015-12-11 14:37       ` Brian Foster
  0 siblings, 1 reply; 24+ messages in thread
From: Jan Kara @ 2015-12-10 10:08 UTC (permalink / raw)
  To: Brian Foster
  Cc: Josef Bacik, linux-fsdevel, kernel-team, viro, hch, david, jack,
	Dave Chinner

On Wed 09-12-15 13:40:30, Brian Foster wrote:
> On Tue, Jun 23, 2015 at 08:24:00PM -0700, Josef Bacik wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > wait_sb_inodes() current does a walk of all inodes in the filesystem
> > to find dirty one to wait on during sync. This is highly
> > inefficient and wastes a lot of CPU when there are lots of clean
> > cached inodes that we don't need to wait on.
> > 
> > To avoid this "all inode" walk, we need to track inodes that are
> > currently under writeback that we need to wait for. We do this by
> > adding inodes to a writeback list on the bdi when the mapping is
> > first tagged as having pages under writeback.  wait_sb_inodes() can
> > then walk this list of "inodes under IO" and wait specifically just
> > for the inodes that the current sync(2) needs to wait for.
> > 
> > To avoid needing all the realted locking to be safe against
> > interrupts, Jan Kara suggested that we be lazy about removal from
> > the writeback list. That is, we don't remove inodes from the
> > writeback list on IO completion, but do it directly during a
> > wait_sb_inodes() walk.
> > 
> > This means that the a rare sync(2) call will have some work to do
> > skipping clean inodes However, in the current problem case of
> > concurrent sync workloads, concurrent wait_sb_inodes() calls only
> > walk the very recently dispatched inodes and hence should have very
> > little work to do.
> > 
> > This also means that we have to remove the inodes from the writeback
> > list during eviction. Do this in inode_wait_for_writeback() once
> > all writeback on the inode is complete.
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > Signed-off-by: Josef Bacik <jbacik@fb.com>
> > Tested-by: Dave Chinner <dchinner@redhat.com>
> > Reviewed-by: Jan Kara <jack@suse.cz>
> > ---

  Hi,

> My understanding is this (and the subsequent) patch were never merged
> due to conflicts with the cgroup aware writeback stuff that had been
> merged.

Yes. Thanks for bringing this up. At that point I thought Josef will take
care of rebasing the patches and it fell out of my radar.

> I think the fundamental approach of this patch still solves the
> original problem, the writeback inodes are just always tracked on the
> root bdi_writeback rather than the per-cgroup bdi_writeback the inode
> was associated with before writeback.

I was thinking about this and realized that the original patch actually has
a flaw. In case there are several partitions on a bdi and sync_fs() would
be called for two filesystems on the same bdi, then one of those sync_fs()
calls would splice the whole b_writeback list into its private list in
wait_sb_inodes() and the other sync_fs() call would return too early because
it would find b_writeback list empty.

Given that we need the list of inodes under writeback only for sync(2) /
sync_fs(2) purposes it would work the best to track those inodes in the
per-superblock list. We would miss block device inode this way but that one
is waited-on in a special way anyway (in __sync_blockdev() /
fdatawait_one_bdev()). Also that way any strange interaction and confusion
with cgroup writeback is avoided.

> Any thoughts on that are appreciated. Otherwise, I'm trying to rebase
> this and I've hit a lockdep splat [1] that suggests the locking for the
> writeback list needs to be tweaked one way or another. Note that lockdep
> fires with this original version as well, so I don't think this is a new
> issue.

Yeah, it looks that the issue has been there forever. When we have per-sb
list then we need a different lock to protect the list anyway. So having a
dedicated lock for the list which would be irqsave would be the easiest
solution. But note that in that case the locking of i_lock in
wait_inodes_sb() has to be dealt with as well since making i_lock irqsafe
is not an option.

I think we could avoid having the dedicated lock irqsafe if we moved
bdi_mark_inode_writeback() call from under mapping->tree_lock (to happen
later in __test_set_page_writeback()). However that needs some careful
thinking about how the interactions of __test_set_page_writeback() with the
list handling in wait_inodes_sb() work out.

								Honza

> [1] lockdep splat:
> 
> [ INFO: possible irq lock inversion dependency detected ]
> 4.4.0-rc4+ #54 Not tainted
> ---------------------------------------------------------
> swapper/0/0 just changed the state of lock:
> (&(&mapping->tree_lock)->rlock){-.....}, at: [<ffffffff811d25ed>] test_clear_page_writeback+0x5d/0x210
> but this lock took another, HARDIRQ-unsafe lock in the past:
> (&(&wb->list_lock)->rlock){+.+...}#012#012and interrupts could create inverse lock ordering between them.
> #012other info that might help us debug this:
> Possible interrupt unsafe locking scenario:
>       CPU0                    CPU1
>       ----                    ----
>  lock(&(&wb->list_lock)->rlock);
>                               local_irq_disable();
>                               lock(&(&mapping->tree_lock)->rlock);
>                               lock(&(&wb->list_lock)->rlock);
>  <Interrupt>
>    lock(&(&mapping->tree_lock)->rlock);
> #012 *** DEADLOCK ***
> no locks held by swapper/0/0.
> #012the shortest dependencies between 2nd lock and 1st lock:
> -> (&(&wb->list_lock)->rlock){+.+...} ops: 194 {
>    HARDIRQ-ON-W at:
>                      [<ffffffff810f9959>] __lock_acquire+0x819/0x1a50
>                      [<ffffffff810fb54e>] lock_acquire+0xce/0x1c0
>                      [<ffffffff817e3df8>] _raw_spin_lock+0x38/0x50
>                      [<ffffffff8126b7a9>] __mark_inode_dirty+0x3d9/0x5e0
>                      [<ffffffff8125a859>] generic_update_time+0x79/0xd0
>                      [<ffffffff8125c638>] touch_atime+0xa8/0xd0
>                      [<ffffffff812519c3>] iterate_dir+0xe3/0x130
>                      [<ffffffff81251ecd>] SyS_getdents+0x9d/0x130
>                      [<ffffffff817e4a72>] entry_SYSCALL_64_fastpath+0x12/0x76
>    SOFTIRQ-ON-W at:
>                      [<ffffffff810f998a>] __lock_acquire+0x84a/0x1a50
>                      [<ffffffff810fb54e>] lock_acquire+0xce/0x1c0
>                      [<ffffffff817e3df8>] _raw_spin_lock+0x38/0x50
>                      [<ffffffff8126b7a9>] __mark_inode_dirty+0x3d9/0x5e0
>                      [<ffffffff8125a859>] generic_update_time+0x79/0xd0
>                      [<ffffffff8125c638>] touch_atime+0xa8/0xd0
>                      [<ffffffff812519c3>] iterate_dir+0xe3/0x130
>                      [<ffffffff81251ecd>] SyS_getdents+0x9d/0x130
>                      [<ffffffff817e4a72>] entry_SYSCALL_64_fastpath+0x12/0x76
>    INITIAL USE at:
>                     [<ffffffff810f9538>] __lock_acquire+0x3f8/0x1a50
>                     [<ffffffff810fb54e>] lock_acquire+0xce/0x1c0
>                     [<ffffffff817e3df8>] _raw_spin_lock+0x38/0x50
>                     [<ffffffff8126b7a9>] __mark_inode_dirty+0x3d9/0x5e0
>                     [<ffffffff8125a859>] generic_update_time+0x79/0xd0
>                     [<ffffffff8125c638>] touch_atime+0xa8/0xd0
>                     [<ffffffff812519c3>] iterate_dir+0xe3/0x130
>                     [<ffffffff81251ecd>] SyS_getdents+0x9d/0x130
>                     [<ffffffff817e4a72>] entry_SYSCALL_64_fastpath+0x12/0x76
>  }
>  ... key      at: [<ffffffff82c7faa0>] __key.37388+0x0/0x8
>  ... acquired at:
>   [<ffffffff810fb54e>] lock_acquire+0xce/0x1c0
>   [<ffffffff817e3df8>] _raw_spin_lock+0x38/0x50
>   [<ffffffff812703a7>] bdi_mark_inode_writeback+0x67/0xd0
>   [<ffffffff811d19b2>] __test_set_page_writeback+0x112/0x200
>   [<ffffffff8127bcdc>] __block_write_full_page.constprop.44+0xec/0x400
>   [<ffffffff8127c0ee>] block_write_full_page+0xfe/0x190
>   [<ffffffff8127cec8>] blkdev_writepage+0x18/0x20
>   [<ffffffff811cede6>] __writepage+0x16/0x40
>   [<ffffffff811cfc65>] write_cache_pages+0x225/0x5f0
>   [<ffffffff811d0084>] generic_writepages+0x54/0x80
>   [<ffffffff811d24e1>] do_writepages+0x21/0x30
>   [<ffffffff811c4b40>] __filemap_fdatawrite_range+0x80/0xb0
>   [<ffffffff811c4c7d>] filemap_write_and_wait_range+0x2d/0x70
>   [<ffffffff8127ca8b>] blkdev_fsync+0x1b/0x50
>   [<ffffffff81274a9b>] vfs_fsync_range+0x4b/0xb0
>   [<ffffffff81274b5d>] do_fsync+0x3d/0x70
>   [<ffffffff81274e10>] SyS_fsync+0x10/0x20
>   [<ffffffff817e4a72>] entry_SYSCALL_64_fastpath+0x12/0x76
> 
> -> (&(&mapping->tree_lock)->rlock){-.....} ops: 16821 {
>   IN-HARDIRQ-W at:
>                    [<ffffffff810f9ad9>] __lock_acquire+0x999/0x1a50
>                    [<ffffffff810fb54e>] lock_acquire+0xce/0x1c0
>                    [<ffffffff817e4810>] _raw_spin_lock_irqsave+0x50/0x70
>                    [<ffffffff811d25ed>] test_clear_page_writeback+0x5d/0x210
>                    [<ffffffff811c2b2f>] end_page_writeback+0x1f/0xa0
>                    [<ffffffff81278df3>] end_buffer_async_write+0xd3/0x1a0
>                    [<ffffffff81277828>] end_bio_bh_io_sync+0x28/0x40
>                    [<ffffffff813a329f>] bio_endio+0x3f/0x60
>                    [<ffffffff8162d4d9>] dec_pending+0x149/0x310
>                    [<ffffffff8162e1d6>] clone_endio+0x76/0xe0
>                    [<ffffffff813a329f>] bio_endio+0x3f/0x60
>                    [<ffffffff813ab512>] blk_update_request+0xb2/0x3a0
>                    [<ffffffff813b63ca>] blk_mq_end_request+0x1a/0x70
>                    [<ffffffffa009e82f>] virtblk_request_done+0x3f/0x70 [virtio_blk]
>                    [<ffffffff813b4ea3>] __blk_mq_complete_request_remote+0x13/0x20
>                    [<ffffffff81137d9f>] flush_smp_call_function_queue+0x5f/0x160
>                    [<ffffffff811388d3>] generic_smp_call_function_single_interrupt+0x13/0x60
>                    [<ffffffff810530a7>] smp_call_function_single_interrupt+0x27/0x40
>                    [<ffffffff817e604c>] call_function_single_interrupt+0x8c/0xa0
>                    [<ffffffff81026703>] default_idle+0x23/0x150
>                    [<ffffffff8102702f>] arch_cpu_idle+0xf/0x20
>                    [<ffffffff810efd8a>] default_idle_call+0x2a/0x40
>                    [<ffffffff810f0184>] cpu_startup_entry+0x384/0x3f0
>                    [<ffffffff817d6aaa>] rest_init+0x13a/0x140
>                    [<ffffffff81d6b01f>] start_kernel+0x49e/0x4bf
>                    [<ffffffff81d6a339>] x86_64_start_reservations+0x2a/0x2c
>                    [<ffffffff81d6a480>] x86_64_start_kernel+0x145/0x168
>   INITIAL USE at:
>                   [<ffffffff810f9538>] __lock_acquire+0x3f8/0x1a50
>                   [<ffffffff810fb54e>] lock_acquire+0xce/0x1c0
>                   [<ffffffff817e3fa4>] _raw_spin_lock_irq+0x44/0x60
>                   [<ffffffff811c303c>] __add_to_page_cache_locked+0xbc/0x340
>                   [<ffffffff811c3317>] add_to_page_cache_lru+0x37/0x90
>                   [<ffffffff811c3dc7>] pagecache_get_page+0xb7/0x200
>                   [<ffffffff811c4139>] grab_cache_page_write_begin+0x29/0x40
>                   [<ffffffff81269848>] simple_write_begin+0x28/0x1c0
>                   [<ffffffff811c2a01>] generic_perform_write+0xd1/0x1e0
>                   [<ffffffff811c55f2>] __generic_file_write_iter+0x1a2/0x1e0
>                   [<ffffffff811c5718>] generic_file_write_iter+0xe8/0x1e0
>                   [<ffffffff8123c409>] __vfs_write+0xc9/0x110
>                   [<ffffffff8123ca99>] vfs_write+0xa9/0x1a0
>                   [<ffffffff8123d778>] SyS_write+0x58/0xd0
>                   [<ffffffff81d6c3b1>] xwrite+0x29/0x5c
>                   [<ffffffff81d6c40e>] do_copy+0x2a/0xb8
>                   [<ffffffff81d6c151>] write_buffer+0x23/0x34
>                   [<ffffffff81d6c9e6>] unpack_to_rootfs+0xfd/0x294
>                   [<ffffffff81d6cbd9>] populate_rootfs+0x5c/0x108
>                   [<ffffffff81002123>] do_one_initcall+0xb3/0x200
>                   [<ffffffff81d6b22e>] kernel_init_freeable+0x1ee/0x28d
>                   [<ffffffff817d6abe>] kernel_init+0xe/0xe0
>                   [<ffffffff817e4ddf>] ret_from_fork+0x3f/0x70
> }
> ... key      at: [<ffffffff82ca2760>] __key.39560+0x0/0x8
> ... acquired at:
>   [<ffffffff810f7c1b>] check_usage_forwards+0x15b/0x160
>   [<ffffffff810f88c3>] mark_lock+0x333/0x610
>   [<ffffffff810f9ad9>] __lock_acquire+0x999/0x1a50
>   [<ffffffff810fb54e>] lock_acquire+0xce/0x1c0
>   [<ffffffff817e4810>] _raw_spin_lock_irqsave+0x50/0x70
>   [<ffffffff811d25ed>] test_clear_page_writeback+0x5d/0x210
>   [<ffffffff811c2b2f>] end_page_writeback+0x1f/0xa0
>   [<ffffffff81278df3>] end_buffer_async_write+0xd3/0x1a0
>   [<ffffffff81277828>] end_bio_bh_io_sync+0x28/0x40
>   [<ffffffff813a329f>] bio_endio+0x3f/0x60
>   [<ffffffff8162d4d9>] dec_pending+0x149/0x310
>   [<ffffffff8162e1d6>] clone_endio+0x76/0xe0
>   [<ffffffff813a329f>] bio_endio+0x3f/0x60
>   [<ffffffff813ab512>] blk_update_request+0xb2/0x3a0
>   [<ffffffff813b63ca>] blk_mq_end_request+0x1a/0x70
>   [<ffffffffa009e82f>] virtblk_request_done+0x3f/0x70 [virtio_blk]
>   [<ffffffff813b4ea3>] __blk_mq_complete_request_remote+0x13/0x20
>   [<ffffffff81137d9f>] flush_smp_call_function_queue+0x5f/0x160
>   [<ffffffff811388d3>] generic_smp_call_function_single_interrupt+0x13/0x60
>   [<ffffffff810530a7>] smp_call_function_single_interrupt+0x27/0x40
>   [<ffffffff817e604c>] call_function_single_interrupt+0x8c/0xa0
>   [<ffffffff81026703>] default_idle+0x23/0x150
>   [<ffffffff8102702f>] arch_cpu_idle+0xf/0x20
>   [<ffffffff810efd8a>] default_idle_call+0x2a/0x40
>   [<ffffffff810f0184>] cpu_startup_entry+0x384/0x3f0
>   [<ffffffff817d6aaa>] rest_init+0x13a/0x140
>   [<ffffffff81d6b01f>] start_kernel+0x49e/0x4bf
>   [<ffffffff81d6a339>] x86_64_start_reservations+0x2a/0x2c
>   [<ffffffff81d6a480>] x86_64_start_kernel+0x145/0x168
> 
> #012stack backtrace:
> CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.4.0-rc4+ #54
> Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
> 0000000000000000 06776aa479e380f6 ffff88011fc03ad0 ffffffff813dd019
> ffffffff82a0f120 ffff88011fc03b10 ffffffff811c184d ffffffff81c10be8
> ffffffff81c10be8 ffffffff81c10500 ffffffff81a56510 0000000000000000
> Call Trace:
> <IRQ>  [<ffffffff813dd019>] dump_stack+0x4b/0x72
> [<ffffffff811c184d>] print_irq_inversion_bug.part.38+0x1a4/0x1b0
> [<ffffffff810f7c1b>] check_usage_forwards+0x15b/0x160
> [<ffffffff810f88c3>] mark_lock+0x333/0x610
> [<ffffffff810f7ac0>] ? check_usage_backwards+0x160/0x160
> [<ffffffff810f9ad9>] __lock_acquire+0x999/0x1a50
> [<ffffffff810fb54e>] lock_acquire+0xce/0x1c0
> [<ffffffff811d25ed>] ? test_clear_page_writeback+0x5d/0x210
> [<ffffffff817e4810>] _raw_spin_lock_irqsave+0x50/0x70
> [<ffffffff811d25ed>] ? test_clear_page_writeback+0x5d/0x210
> [<ffffffff811d25ed>] test_clear_page_writeback+0x5d/0x210
> [<ffffffff811c2b2f>] end_page_writeback+0x1f/0xa0
> [<ffffffff81278df3>] end_buffer_async_write+0xd3/0x1a0
> [<ffffffff81277828>] end_bio_bh_io_sync+0x28/0x40
> [<ffffffff813a329f>] bio_endio+0x3f/0x60
> [<ffffffff8162d4d9>] dec_pending+0x149/0x310
> [<ffffffff811c6809>] ? mempool_free+0x29/0x80
> [<ffffffff8162e1d6>] clone_endio+0x76/0xe0
> [<ffffffff813a329f>] bio_endio+0x3f/0x60
> [<ffffffff813ab512>] blk_update_request+0xb2/0x3a0
> [<ffffffff813b4e90>] ? blkdev_issue_zeroout+0xf0/0xf0
> [<ffffffff813b63ca>] blk_mq_end_request+0x1a/0x70
> [<ffffffffa009e82f>] virtblk_request_done+0x3f/0x70 [virtio_blk]
> [<ffffffff813b4ea3>] __blk_mq_complete_request_remote+0x13/0x20
> [<ffffffff81137d9f>] flush_smp_call_function_queue+0x5f/0x160
> [<ffffffff811388d3>] generic_smp_call_function_single_interrupt+0x13/0x60
> [<ffffffff810530a7>] smp_call_function_single_interrupt+0x27/0x40
> [<ffffffff817e604c>] call_function_single_interrupt+0x8c/0xa0
> <EOI>  [<ffffffff810653a6>] ? native_safe_halt+0x6/0x10
> [<ffffffff81026703>] default_idle+0x23/0x150
> [<ffffffff8102702f>] arch_cpu_idle+0xf/0x20
> [<ffffffff810efd8a>] default_idle_call+0x2a/0x40
> [<ffffffff810f0184>] cpu_startup_entry+0x384/0x3f0
> [<ffffffff817d6aaa>] rest_init+0x13a/0x140
> [<ffffffff81d6b01f>] start_kernel+0x49e/0x4bf
> [<ffffffff81d6a120>] ? early_idt_handler_array+0x120/0x120
> [<ffffffff81d6a339>] x86_64_start_reservations+0x2a/0x2c
> [<ffffffff81d6a480>] x86_64_start_kernel+0x145/0x168
> 
> >  fs/block_dev.c              |   2 +
> >  fs/fs-writeback.c           | 148 +++++++++++++++++++++++++++++++++++---------
> >  fs/inode.c                  |   1 +
> >  include/linux/backing-dev.h |   5 ++
> >  include/linux/fs.h          |   1 +
> >  mm/backing-dev.c            |   1 +
> >  mm/page-writeback.c         |  14 +++++
> >  7 files changed, 144 insertions(+), 28 deletions(-)
> > 
> > diff --git a/fs/block_dev.c b/fs/block_dev.c
> > index f2a89be..2726ff1 100644
> > --- a/fs/block_dev.c
> > +++ b/fs/block_dev.c
> > @@ -64,6 +64,8 @@ void kill_bdev(struct block_device *bdev)
> >  {
> >  	struct address_space *mapping = bdev->bd_inode->i_mapping;
> >  
> > +	bdi_clear_inode_writeback(inode_to_bdi(bdev->bd_inode),
> > +				  bdev->bd_inode);
> >  	if (mapping->nrpages == 0 && mapping->nrshadows == 0)
> >  		return;
> >  
> > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> > index aa72536..5c005ad 100644
> > --- a/fs/fs-writeback.c
> > +++ b/fs/fs-writeback.c
> > @@ -210,6 +210,34 @@ void inode_wb_list_del(struct inode *inode)
> >  }
> >  
> >  /*
> > + * mark an inode as under writeback on the given bdi
> > + */
> > +void bdi_mark_inode_writeback(struct backing_dev_info *bdi, struct inode *inode)
> > +{
> > +	WARN_ON_ONCE(bdi != inode_to_bdi(inode));
> > +	if (list_empty(&inode->i_wb_list)) {
> > +		spin_lock(&bdi->wb.list_lock);
> > +		if (list_empty(&inode->i_wb_list))
> > +			list_add_tail(&inode->i_wb_list, &bdi->wb.b_writeback);
> > +		spin_unlock(&bdi->wb.list_lock);
> > +	}
> > +}
> > +
> > +/*
> > + * clear an inode as under writeback on the given bdi
> > + */
> > +void bdi_clear_inode_writeback(struct backing_dev_info *bdi,
> > +			       struct inode *inode)
> > +{
> > +	WARN_ON_ONCE(bdi != inode_to_bdi(inode));
> > +	if (!list_empty(&inode->i_wb_list)) {
> > +		spin_lock(&bdi->wb.list_lock);
> > +		list_del_init(&inode->i_wb_list);
> > +		spin_unlock(&bdi->wb.list_lock);
> > +	}
> > +}
> > +
> > +/*
> >   * Redirty an inode: set its when-it-was dirtied timestamp and move it to the
> >   * furthest end of its superblock's dirty-inode list.
> >   *
> > @@ -383,13 +411,28 @@ static void __inode_wait_for_writeback(struct inode *inode)
> >  }
> >  
> >  /*
> > - * Wait for writeback on an inode to complete. Caller must have inode pinned.
> > + * Wait for writeback on an inode to complete during eviction. Caller must have
> > + * inode pinned.
> >   */
> >  void inode_wait_for_writeback(struct inode *inode)
> >  {
> > +	BUG_ON(!(inode->i_state & I_FREEING));
> > +
> >  	spin_lock(&inode->i_lock);
> >  	__inode_wait_for_writeback(inode);
> >  	spin_unlock(&inode->i_lock);
> > +
> > +	/*
> > +	 * For a bd_inode when we do inode_to_bdi we'll want to get the bdev for
> > +	 * the inode and then deref bdev->bd_disk, which at this point has been
> > +	 * set to NULL, so we would panic.  At the point we are dropping our
> > +	 * bd_inode we won't have any pages under writeback on the device so
> > +	 * this is safe.  But just in case we'll assert to make sure we don't
> > +	 * screw this up.
> > +	 */
> > +	if (!sb_is_blkdev_sb(inode->i_sb))
> > +		bdi_clear_inode_writeback(inode_to_bdi(inode), inode);
> > +	BUG_ON(!list_empty(&inode->i_wb_list));
> >  }
> >  
> >  /*
> > @@ -1372,7 +1415,9 @@ EXPORT_SYMBOL(__mark_inode_dirty);
> >   */
> >  static void wait_sb_inodes(struct super_block *sb)
> >  {
> > -	struct inode *inode, *old_inode = NULL;
> > +	struct backing_dev_info *bdi = sb->s_bdi;
> > +	LIST_HEAD(sync_list);
> > +	struct inode *iput_inode = NULL;
> >  
> >  	/*
> >  	 * We need to be protected against the filesystem going from
> > @@ -1380,48 +1425,95 @@ static void wait_sb_inodes(struct super_block *sb)
> >  	 */
> >  	WARN_ON(!rwsem_is_locked(&sb->s_umount));
> >  
> > -	mutex_lock(&sb->s_sync_lock);
> > -	spin_lock(&sb->s_inode_list_lock);
> > -
> >  	/*
> > -	 * Data integrity sync. Must wait for all pages under writeback,
> > -	 * because there may have been pages dirtied before our sync
> > -	 * call, but which had writeout started before we write it out.
> > -	 * In which case, the inode may not be on the dirty list, but
> > -	 * we still have to wait for that writeout.
> > +	 * Data integrity sync. Must wait for all pages under writeback, because
> > +	 * there may have been pages dirtied before our sync call, but which had
> > +	 * writeout started before we write it out.  In which case, the inode
> > +	 * may not be on the dirty list, but we still have to wait for that
> > +	 * writeout.
> > +	 *
> > +	 * To avoid syncing inodes put under IO after we have started here,
> > +	 * splice the io list to a temporary list head and walk that. Newly
> > +	 * dirtied inodes will go onto the primary list so we won't wait for
> > +	 * them. This is safe to do as we are serialised by the s_sync_lock,
> > +	 * so we'll complete processing the complete list before the next
> > +	 * sync operation repeats the splice-and-walk process.
> > +	 *
> > +	 * Inodes that have pages under writeback after we've finished the wait
> > +	 * may or may not be on the primary list. They had pages under put IO
> > +	 * after we started our wait, so we need to make sure the next sync or
> > +	 * IO completion treats them correctly, Move them back to the primary
> > +	 * list and restart the walk.
> > +	 *
> > +	 * Inodes that are clean after we have waited for them don't belong on
> > +	 * any list, so simply remove them from the sync list and move onwards.
> >  	 */
> > -	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
> > +	mutex_lock(&sb->s_sync_lock);
> > +	spin_lock(&bdi->wb.list_lock);
> > +	list_splice_init(&bdi->wb.b_writeback, &sync_list);
> > +
> > +	while (!list_empty(&sync_list)) {
> > +		struct inode *inode = list_first_entry(&sync_list, struct inode,
> > +						       i_wb_list);
> >  		struct address_space *mapping = inode->i_mapping;
> >  
> > +		/*
> > +		 * We are lazy on IO completion and don't remove inodes from the
> > +		 * list when they are clean. Detect that immediately and skip
> > +		 * inodes we don't ahve to wait on.
> > +		 */
> > +		if (!mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK)) {
> > +			list_del_init(&inode->i_wb_list);
> > +			cond_resched_lock(&bdi->wb.list_lock);
> > +			continue;
> > +		}
> > +
> >  		spin_lock(&inode->i_lock);
> > -		if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
> > -		    (mapping->nrpages == 0)) {
> > +		if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) {
> > +			list_move(&inode->i_wb_list, &bdi->wb.b_writeback);
> >  			spin_unlock(&inode->i_lock);
> > +			cond_resched_lock(&bdi->wb.list_lock);
> >  			continue;
> >  		}
> >  		__iget(inode);
> >  		spin_unlock(&inode->i_lock);
> > -		spin_unlock(&sb->s_inode_list_lock);
> > +		spin_unlock(&bdi->wb.list_lock);
> >  
> > -		/*
> > -		 * We hold a reference to 'inode' so it couldn't have been
> > -		 * removed from s_inodes list while we dropped the
> > -		 * s_inode_list_lock.  We cannot iput the inode now as we can
> > -		 * be holding the last reference and we cannot iput it under
> > -		 * s_inode_list_lock. So we keep the reference and iput it
> > -		 * later.
> > -		 */
> > -		iput(old_inode);
> > -		old_inode = inode;
> > +		if (iput_inode)
> > +			iput(iput_inode);
> >  
> >  		filemap_fdatawait(mapping);
> > -
> >  		cond_resched();
> >  
> > -		spin_lock(&sb->s_inode_list_lock);
> > +		/*
> > +		 * the inode has been written back now, so check whether we
> > +		 * still have pages under IO and move it back to the primary
> > +		 * list if necessary.  We really need the mapping->tree_lock
> > +		 * here because bdi_mark_inode_writeback may have not done
> > +		 * anything because we were on the spliced list and we need to
> > +		 * check TAG_WRITEBACK.
> > +		 */
> > +		spin_lock_irq(&mapping->tree_lock);
> > +		spin_lock(&bdi->wb.list_lock);
> > +		if (mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK)) {
> > +			WARN_ON(list_empty(&inode->i_wb_list));
> > +			list_move(&inode->i_wb_list, &bdi->wb.b_writeback);
> > +                } else
> > +			list_del_init(&inode->i_wb_list);
> > +		spin_unlock_irq(&mapping->tree_lock);
> > +
> > +		/*
> > +		 * can't iput inode while holding the wb.list_lock. Save it for
> > +		 * the next time through the loop when we drop all our spin
> > +		 * locks.
> > +		 */
> > +		iput_inode = inode;
> >  	}
> > -	spin_unlock(&sb->s_inode_list_lock);
> > -	iput(old_inode);
> > +	spin_unlock(&bdi->wb.list_lock);
> > +
> > +	if (iput_inode)
> > +		iput(iput_inode);
> > +
> >  	mutex_unlock(&sb->s_sync_lock);
> >  }
> >  
> > diff --git a/fs/inode.c b/fs/inode.c
> > index 770d684..8f00557 100644
> > --- a/fs/inode.c
> > +++ b/fs/inode.c
> > @@ -357,6 +357,7 @@ void inode_init_once(struct inode *inode)
> >  	INIT_HLIST_NODE(&inode->i_hash);
> >  	INIT_LIST_HEAD(&inode->i_devices);
> >  	INIT_LIST_HEAD(&inode->i_io_list);
> > +	INIT_LIST_HEAD(&inode->i_wb_list);
> >  	INIT_LIST_HEAD(&inode->i_lru);
> >  	address_space_init_once(&inode->i_data);
> >  	i_size_ordered_init(inode);
> > diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
> > index d87d8ec..0d5bca2 100644
> > --- a/include/linux/backing-dev.h
> > +++ b/include/linux/backing-dev.h
> > @@ -56,6 +56,7 @@ struct bdi_writeback {
> >  	struct list_head b_io;		/* parked for writeback */
> >  	struct list_head b_more_io;	/* parked for more writeback */
> >  	struct list_head b_dirty_time;	/* time stamps are dirty */
> > +	struct list_head b_writeback;	/* inodes under writeback */
> >  	spinlock_t list_lock;		/* protects the b_* lists */
> >  };
> >  
> > @@ -123,6 +124,10 @@ void bdi_start_background_writeback(struct backing_dev_info *bdi);
> >  void bdi_writeback_workfn(struct work_struct *work);
> >  int bdi_has_dirty_io(struct backing_dev_info *bdi);
> >  void bdi_wakeup_thread_delayed(struct backing_dev_info *bdi);
> > +void bdi_mark_inode_writeback(struct backing_dev_info *bdi,
> > +			      struct inode *inode);
> > +void bdi_clear_inode_writeback(struct backing_dev_info *bdi,
> > +			       struct inode *inode);
> >  
> >  extern spinlock_t bdi_lock;
> >  extern struct list_head bdi_list;
> > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > index 6dd2ab2..d3baa08 100644
> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
> > @@ -636,6 +636,7 @@ struct inode {
> >  	struct list_head	i_io_list;	/* backing dev IO list */
> >  	struct list_head	i_lru;		/* inode LRU list */
> >  	struct list_head	i_sb_list;
> > +	struct list_head	i_wb_list;	/* backing dev writeback list */
> >  	union {
> >  		struct hlist_head	i_dentry;
> >  		struct rcu_head		i_rcu;
> > diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> > index e9ed047..ea39301 100644
> > --- a/mm/backing-dev.c
> > +++ b/mm/backing-dev.c
> > @@ -369,6 +369,7 @@ static void bdi_wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi)
> >  	INIT_LIST_HEAD(&wb->b_io);
> >  	INIT_LIST_HEAD(&wb->b_more_io);
> >  	INIT_LIST_HEAD(&wb->b_dirty_time);
> > +	INIT_LIST_HEAD(&wb->b_writeback);
> >  	spin_lock_init(&wb->list_lock);
> >  	INIT_DELAYED_WORK(&wb->dwork, bdi_writeback_workfn);
> >  }
> > diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> > index eb59f7e..f3751d1 100644
> > --- a/mm/page-writeback.c
> > +++ b/mm/page-writeback.c
> > @@ -2382,11 +2382,25 @@ int __test_set_page_writeback(struct page *page, bool keep_write)
> >  		spin_lock_irqsave(&mapping->tree_lock, flags);
> >  		ret = TestSetPageWriteback(page);
> >  		if (!ret) {
> > +			bool on_wblist;
> > +
> > +			on_wblist = mapping_tagged(mapping,
> > +						   PAGECACHE_TAG_WRITEBACK);
> > +
> >  			radix_tree_tag_set(&mapping->page_tree,
> >  						page_index(page),
> >  						PAGECACHE_TAG_WRITEBACK);
> >  			if (bdi_cap_account_writeback(bdi))
> >  				__inc_bdi_stat(bdi, BDI_WRITEBACK);
> > +
> > +			/*
> > +			 * we can come through here when swapping anonymous
> > +			 * pages, so we don't necessarily have an inode to
> > +			 * track for sync. Inodes are remove lazily, so there is
> > +			 * no equivalent in test_clear_page_writeback().
> > +			 */
> > +			if (!on_wblist && mapping->host)
> > +				bdi_mark_inode_writeback(bdi, mapping->host);
> >  		}
> >  		if (!PageDirty(page))
> >  			radix_tree_tag_clear(&mapping->page_tree,
> > -- 
> > 2.1.0
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 6/8] bdi: add a new writeback list for sync
  2015-12-10 10:08     ` Jan Kara
@ 2015-12-11 14:37       ` Brian Foster
  0 siblings, 0 replies; 24+ messages in thread
From: Brian Foster @ 2015-12-11 14:37 UTC (permalink / raw)
  To: Jan Kara
  Cc: Josef Bacik, linux-fsdevel, kernel-team, viro, hch, david,
	Dave Chinner

On Thu, Dec 10, 2015 at 11:08:20AM +0100, Jan Kara wrote:
> On Wed 09-12-15 13:40:30, Brian Foster wrote:
> > On Tue, Jun 23, 2015 at 08:24:00PM -0700, Josef Bacik wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > > 
> > > wait_sb_inodes() current does a walk of all inodes in the filesystem
> > > to find dirty one to wait on during sync. This is highly
> > > inefficient and wastes a lot of CPU when there are lots of clean
> > > cached inodes that we don't need to wait on.
> > > 
> > > To avoid this "all inode" walk, we need to track inodes that are
> > > currently under writeback that we need to wait for. We do this by
> > > adding inodes to a writeback list on the bdi when the mapping is
> > > first tagged as having pages under writeback.  wait_sb_inodes() can
> > > then walk this list of "inodes under IO" and wait specifically just
> > > for the inodes that the current sync(2) needs to wait for.
> > > 
> > > To avoid needing all the realted locking to be safe against
> > > interrupts, Jan Kara suggested that we be lazy about removal from
> > > the writeback list. That is, we don't remove inodes from the
> > > writeback list on IO completion, but do it directly during a
> > > wait_sb_inodes() walk.
> > > 
> > > This means that the a rare sync(2) call will have some work to do
> > > skipping clean inodes However, in the current problem case of
> > > concurrent sync workloads, concurrent wait_sb_inodes() calls only
> > > walk the very recently dispatched inodes and hence should have very
> > > little work to do.
> > > 
> > > This also means that we have to remove the inodes from the writeback
> > > list during eviction. Do this in inode_wait_for_writeback() once
> > > all writeback on the inode is complete.
> > > 
> > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > > Signed-off-by: Josef Bacik <jbacik@fb.com>
> > > Tested-by: Dave Chinner <dchinner@redhat.com>
> > > Reviewed-by: Jan Kara <jack@suse.cz>
> > > ---
> 
>   Hi,
> 
> > My understanding is this (and the subsequent) patch were never merged
> > due to conflicts with the cgroup aware writeback stuff that had been
> > merged.
> 
> Yes. Thanks for bringing this up. At that point I thought Josef will take
> care of rebasing the patches and it fell out of my radar.
> 
> > I think the fundamental approach of this patch still solves the
> > original problem, the writeback inodes are just always tracked on the
> > root bdi_writeback rather than the per-cgroup bdi_writeback the inode
> > was associated with before writeback.
> 
> I was thinking about this and realized that the original patch actually has
> a flaw. In case there are several partitions on a bdi and sync_fs() would
> be called for two filesystems on the same bdi, then one of those sync_fs()
> calls would splice the whole b_writeback list into its private list in
> wait_sb_inodes() and the other sync_fs() call would return too early because
> it would find b_writeback list empty.
> 

Hmm, yeah that does sound like a problem...

> Given that we need the list of inodes under writeback only for sync(2) /
> sync_fs(2) purposes it would work the best to track those inodes in the
> per-superblock list. We would miss block device inode this way but that one
> is waited-on in a special way anyway (in __sync_blockdev() /
> fdatawait_one_bdev()). Also that way any strange interaction and confusion
> with cgroup writeback is avoided.
> 

Agree, that sounds like a more clean approach on first thought. We kill
the more wasteful per-bdi_writeback list since it would only be used by
the root structure embedded in backing_dev_info and isolates the code to
dealing with sync. I'll have a closer look at this.

> > Any thoughts on that are appreciated. Otherwise, I'm trying to rebase
> > this and I've hit a lockdep splat [1] that suggests the locking for the
> > writeback list needs to be tweaked one way or another. Note that lockdep
> > fires with this original version as well, so I don't think this is a new
> > issue.
> 
> Yeah, it looks that the issue has been there forever. When we have per-sb
> list then we need a different lock to protect the list anyway. So having a
> dedicated lock for the list which would be irqsave would be the easiest
> solution. But note that in that case the locking of i_lock in
> wait_inodes_sb() has to be dealt with as well since making i_lock irqsafe
> is not an option.
> 

Ok...

> I think we could avoid having the dedicated lock irqsafe if we moved
> bdi_mark_inode_writeback() call from under mapping->tree_lock (to happen
> later in __test_set_page_writeback()). However that needs some careful
> thinking about how the interactions of __test_set_page_writeback() with the
> list handling in wait_inodes_sb() work out.
> 

That thought crossed my mind briefly as well. It looked like it could be
racy once we drop mapping->tree_lock, but actually I wonder if the lazy
list removal technique helps us get away with that (e.g., it's expected
that the writeback list will be populated with non-writeback inodes for
some time as it is).

Anyways, I can take a closer look at this as well. It could be a couple
weeks or so before I get back to this, but I'll try to incorporate the
above once I do. Thanks for the feedback! :)

Brian

> 								Honza
> 
> > [1] lockdep splat:
> > 
> > [ INFO: possible irq lock inversion dependency detected ]
> > 4.4.0-rc4+ #54 Not tainted
> > ---------------------------------------------------------
> > swapper/0/0 just changed the state of lock:
> > (&(&mapping->tree_lock)->rlock){-.....}, at: [<ffffffff811d25ed>] test_clear_page_writeback+0x5d/0x210
> > but this lock took another, HARDIRQ-unsafe lock in the past:
> > (&(&wb->list_lock)->rlock){+.+...}#012#012and interrupts could create inverse lock ordering between them.
> > #012other info that might help us debug this:
> > Possible interrupt unsafe locking scenario:
> >       CPU0                    CPU1
> >       ----                    ----
> >  lock(&(&wb->list_lock)->rlock);
> >                               local_irq_disable();
> >                               lock(&(&mapping->tree_lock)->rlock);
> >                               lock(&(&wb->list_lock)->rlock);
> >  <Interrupt>
> >    lock(&(&mapping->tree_lock)->rlock);
> > #012 *** DEADLOCK ***
> > no locks held by swapper/0/0.
> > #012the shortest dependencies between 2nd lock and 1st lock:
> > -> (&(&wb->list_lock)->rlock){+.+...} ops: 194 {
> >    HARDIRQ-ON-W at:
> >                      [<ffffffff810f9959>] __lock_acquire+0x819/0x1a50
> >                      [<ffffffff810fb54e>] lock_acquire+0xce/0x1c0
> >                      [<ffffffff817e3df8>] _raw_spin_lock+0x38/0x50
> >                      [<ffffffff8126b7a9>] __mark_inode_dirty+0x3d9/0x5e0
> >                      [<ffffffff8125a859>] generic_update_time+0x79/0xd0
> >                      [<ffffffff8125c638>] touch_atime+0xa8/0xd0
> >                      [<ffffffff812519c3>] iterate_dir+0xe3/0x130
> >                      [<ffffffff81251ecd>] SyS_getdents+0x9d/0x130
> >                      [<ffffffff817e4a72>] entry_SYSCALL_64_fastpath+0x12/0x76
> >    SOFTIRQ-ON-W at:
> >                      [<ffffffff810f998a>] __lock_acquire+0x84a/0x1a50
> >                      [<ffffffff810fb54e>] lock_acquire+0xce/0x1c0
> >                      [<ffffffff817e3df8>] _raw_spin_lock+0x38/0x50
> >                      [<ffffffff8126b7a9>] __mark_inode_dirty+0x3d9/0x5e0
> >                      [<ffffffff8125a859>] generic_update_time+0x79/0xd0
> >                      [<ffffffff8125c638>] touch_atime+0xa8/0xd0
> >                      [<ffffffff812519c3>] iterate_dir+0xe3/0x130
> >                      [<ffffffff81251ecd>] SyS_getdents+0x9d/0x130
> >                      [<ffffffff817e4a72>] entry_SYSCALL_64_fastpath+0x12/0x76
> >    INITIAL USE at:
> >                     [<ffffffff810f9538>] __lock_acquire+0x3f8/0x1a50
> >                     [<ffffffff810fb54e>] lock_acquire+0xce/0x1c0
> >                     [<ffffffff817e3df8>] _raw_spin_lock+0x38/0x50
> >                     [<ffffffff8126b7a9>] __mark_inode_dirty+0x3d9/0x5e0
> >                     [<ffffffff8125a859>] generic_update_time+0x79/0xd0
> >                     [<ffffffff8125c638>] touch_atime+0xa8/0xd0
> >                     [<ffffffff812519c3>] iterate_dir+0xe3/0x130
> >                     [<ffffffff81251ecd>] SyS_getdents+0x9d/0x130
> >                     [<ffffffff817e4a72>] entry_SYSCALL_64_fastpath+0x12/0x76
> >  }
> >  ... key      at: [<ffffffff82c7faa0>] __key.37388+0x0/0x8
> >  ... acquired at:
> >   [<ffffffff810fb54e>] lock_acquire+0xce/0x1c0
> >   [<ffffffff817e3df8>] _raw_spin_lock+0x38/0x50
> >   [<ffffffff812703a7>] bdi_mark_inode_writeback+0x67/0xd0
> >   [<ffffffff811d19b2>] __test_set_page_writeback+0x112/0x200
> >   [<ffffffff8127bcdc>] __block_write_full_page.constprop.44+0xec/0x400
> >   [<ffffffff8127c0ee>] block_write_full_page+0xfe/0x190
> >   [<ffffffff8127cec8>] blkdev_writepage+0x18/0x20
> >   [<ffffffff811cede6>] __writepage+0x16/0x40
> >   [<ffffffff811cfc65>] write_cache_pages+0x225/0x5f0
> >   [<ffffffff811d0084>] generic_writepages+0x54/0x80
> >   [<ffffffff811d24e1>] do_writepages+0x21/0x30
> >   [<ffffffff811c4b40>] __filemap_fdatawrite_range+0x80/0xb0
> >   [<ffffffff811c4c7d>] filemap_write_and_wait_range+0x2d/0x70
> >   [<ffffffff8127ca8b>] blkdev_fsync+0x1b/0x50
> >   [<ffffffff81274a9b>] vfs_fsync_range+0x4b/0xb0
> >   [<ffffffff81274b5d>] do_fsync+0x3d/0x70
> >   [<ffffffff81274e10>] SyS_fsync+0x10/0x20
> >   [<ffffffff817e4a72>] entry_SYSCALL_64_fastpath+0x12/0x76
> > 
> > -> (&(&mapping->tree_lock)->rlock){-.....} ops: 16821 {
> >   IN-HARDIRQ-W at:
> >                    [<ffffffff810f9ad9>] __lock_acquire+0x999/0x1a50
> >                    [<ffffffff810fb54e>] lock_acquire+0xce/0x1c0
> >                    [<ffffffff817e4810>] _raw_spin_lock_irqsave+0x50/0x70
> >                    [<ffffffff811d25ed>] test_clear_page_writeback+0x5d/0x210
> >                    [<ffffffff811c2b2f>] end_page_writeback+0x1f/0xa0
> >                    [<ffffffff81278df3>] end_buffer_async_write+0xd3/0x1a0
> >                    [<ffffffff81277828>] end_bio_bh_io_sync+0x28/0x40
> >                    [<ffffffff813a329f>] bio_endio+0x3f/0x60
> >                    [<ffffffff8162d4d9>] dec_pending+0x149/0x310
> >                    [<ffffffff8162e1d6>] clone_endio+0x76/0xe0
> >                    [<ffffffff813a329f>] bio_endio+0x3f/0x60
> >                    [<ffffffff813ab512>] blk_update_request+0xb2/0x3a0
> >                    [<ffffffff813b63ca>] blk_mq_end_request+0x1a/0x70
> >                    [<ffffffffa009e82f>] virtblk_request_done+0x3f/0x70 [virtio_blk]
> >                    [<ffffffff813b4ea3>] __blk_mq_complete_request_remote+0x13/0x20
> >                    [<ffffffff81137d9f>] flush_smp_call_function_queue+0x5f/0x160
> >                    [<ffffffff811388d3>] generic_smp_call_function_single_interrupt+0x13/0x60
> >                    [<ffffffff810530a7>] smp_call_function_single_interrupt+0x27/0x40
> >                    [<ffffffff817e604c>] call_function_single_interrupt+0x8c/0xa0
> >                    [<ffffffff81026703>] default_idle+0x23/0x150
> >                    [<ffffffff8102702f>] arch_cpu_idle+0xf/0x20
> >                    [<ffffffff810efd8a>] default_idle_call+0x2a/0x40
> >                    [<ffffffff810f0184>] cpu_startup_entry+0x384/0x3f0
> >                    [<ffffffff817d6aaa>] rest_init+0x13a/0x140
> >                    [<ffffffff81d6b01f>] start_kernel+0x49e/0x4bf
> >                    [<ffffffff81d6a339>] x86_64_start_reservations+0x2a/0x2c
> >                    [<ffffffff81d6a480>] x86_64_start_kernel+0x145/0x168
> >   INITIAL USE at:
> >                   [<ffffffff810f9538>] __lock_acquire+0x3f8/0x1a50
> >                   [<ffffffff810fb54e>] lock_acquire+0xce/0x1c0
> >                   [<ffffffff817e3fa4>] _raw_spin_lock_irq+0x44/0x60
> >                   [<ffffffff811c303c>] __add_to_page_cache_locked+0xbc/0x340
> >                   [<ffffffff811c3317>] add_to_page_cache_lru+0x37/0x90
> >                   [<ffffffff811c3dc7>] pagecache_get_page+0xb7/0x200
> >                   [<ffffffff811c4139>] grab_cache_page_write_begin+0x29/0x40
> >                   [<ffffffff81269848>] simple_write_begin+0x28/0x1c0
> >                   [<ffffffff811c2a01>] generic_perform_write+0xd1/0x1e0
> >                   [<ffffffff811c55f2>] __generic_file_write_iter+0x1a2/0x1e0
> >                   [<ffffffff811c5718>] generic_file_write_iter+0xe8/0x1e0
> >                   [<ffffffff8123c409>] __vfs_write+0xc9/0x110
> >                   [<ffffffff8123ca99>] vfs_write+0xa9/0x1a0
> >                   [<ffffffff8123d778>] SyS_write+0x58/0xd0
> >                   [<ffffffff81d6c3b1>] xwrite+0x29/0x5c
> >                   [<ffffffff81d6c40e>] do_copy+0x2a/0xb8
> >                   [<ffffffff81d6c151>] write_buffer+0x23/0x34
> >                   [<ffffffff81d6c9e6>] unpack_to_rootfs+0xfd/0x294
> >                   [<ffffffff81d6cbd9>] populate_rootfs+0x5c/0x108
> >                   [<ffffffff81002123>] do_one_initcall+0xb3/0x200
> >                   [<ffffffff81d6b22e>] kernel_init_freeable+0x1ee/0x28d
> >                   [<ffffffff817d6abe>] kernel_init+0xe/0xe0
> >                   [<ffffffff817e4ddf>] ret_from_fork+0x3f/0x70
> > }
> > ... key      at: [<ffffffff82ca2760>] __key.39560+0x0/0x8
> > ... acquired at:
> >   [<ffffffff810f7c1b>] check_usage_forwards+0x15b/0x160
> >   [<ffffffff810f88c3>] mark_lock+0x333/0x610
> >   [<ffffffff810f9ad9>] __lock_acquire+0x999/0x1a50
> >   [<ffffffff810fb54e>] lock_acquire+0xce/0x1c0
> >   [<ffffffff817e4810>] _raw_spin_lock_irqsave+0x50/0x70
> >   [<ffffffff811d25ed>] test_clear_page_writeback+0x5d/0x210
> >   [<ffffffff811c2b2f>] end_page_writeback+0x1f/0xa0
> >   [<ffffffff81278df3>] end_buffer_async_write+0xd3/0x1a0
> >   [<ffffffff81277828>] end_bio_bh_io_sync+0x28/0x40
> >   [<ffffffff813a329f>] bio_endio+0x3f/0x60
> >   [<ffffffff8162d4d9>] dec_pending+0x149/0x310
> >   [<ffffffff8162e1d6>] clone_endio+0x76/0xe0
> >   [<ffffffff813a329f>] bio_endio+0x3f/0x60
> >   [<ffffffff813ab512>] blk_update_request+0xb2/0x3a0
> >   [<ffffffff813b63ca>] blk_mq_end_request+0x1a/0x70
> >   [<ffffffffa009e82f>] virtblk_request_done+0x3f/0x70 [virtio_blk]
> >   [<ffffffff813b4ea3>] __blk_mq_complete_request_remote+0x13/0x20
> >   [<ffffffff81137d9f>] flush_smp_call_function_queue+0x5f/0x160
> >   [<ffffffff811388d3>] generic_smp_call_function_single_interrupt+0x13/0x60
> >   [<ffffffff810530a7>] smp_call_function_single_interrupt+0x27/0x40
> >   [<ffffffff817e604c>] call_function_single_interrupt+0x8c/0xa0
> >   [<ffffffff81026703>] default_idle+0x23/0x150
> >   [<ffffffff8102702f>] arch_cpu_idle+0xf/0x20
> >   [<ffffffff810efd8a>] default_idle_call+0x2a/0x40
> >   [<ffffffff810f0184>] cpu_startup_entry+0x384/0x3f0
> >   [<ffffffff817d6aaa>] rest_init+0x13a/0x140
> >   [<ffffffff81d6b01f>] start_kernel+0x49e/0x4bf
> >   [<ffffffff81d6a339>] x86_64_start_reservations+0x2a/0x2c
> >   [<ffffffff81d6a480>] x86_64_start_kernel+0x145/0x168
> > 
> > #012stack backtrace:
> > CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.4.0-rc4+ #54
> > Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
> > 0000000000000000 06776aa479e380f6 ffff88011fc03ad0 ffffffff813dd019
> > ffffffff82a0f120 ffff88011fc03b10 ffffffff811c184d ffffffff81c10be8
> > ffffffff81c10be8 ffffffff81c10500 ffffffff81a56510 0000000000000000
> > Call Trace:
> > <IRQ>  [<ffffffff813dd019>] dump_stack+0x4b/0x72
> > [<ffffffff811c184d>] print_irq_inversion_bug.part.38+0x1a4/0x1b0
> > [<ffffffff810f7c1b>] check_usage_forwards+0x15b/0x160
> > [<ffffffff810f88c3>] mark_lock+0x333/0x610
> > [<ffffffff810f7ac0>] ? check_usage_backwards+0x160/0x160
> > [<ffffffff810f9ad9>] __lock_acquire+0x999/0x1a50
> > [<ffffffff810fb54e>] lock_acquire+0xce/0x1c0
> > [<ffffffff811d25ed>] ? test_clear_page_writeback+0x5d/0x210
> > [<ffffffff817e4810>] _raw_spin_lock_irqsave+0x50/0x70
> > [<ffffffff811d25ed>] ? test_clear_page_writeback+0x5d/0x210
> > [<ffffffff811d25ed>] test_clear_page_writeback+0x5d/0x210
> > [<ffffffff811c2b2f>] end_page_writeback+0x1f/0xa0
> > [<ffffffff81278df3>] end_buffer_async_write+0xd3/0x1a0
> > [<ffffffff81277828>] end_bio_bh_io_sync+0x28/0x40
> > [<ffffffff813a329f>] bio_endio+0x3f/0x60
> > [<ffffffff8162d4d9>] dec_pending+0x149/0x310
> > [<ffffffff811c6809>] ? mempool_free+0x29/0x80
> > [<ffffffff8162e1d6>] clone_endio+0x76/0xe0
> > [<ffffffff813a329f>] bio_endio+0x3f/0x60
> > [<ffffffff813ab512>] blk_update_request+0xb2/0x3a0
> > [<ffffffff813b4e90>] ? blkdev_issue_zeroout+0xf0/0xf0
> > [<ffffffff813b63ca>] blk_mq_end_request+0x1a/0x70
> > [<ffffffffa009e82f>] virtblk_request_done+0x3f/0x70 [virtio_blk]
> > [<ffffffff813b4ea3>] __blk_mq_complete_request_remote+0x13/0x20
> > [<ffffffff81137d9f>] flush_smp_call_function_queue+0x5f/0x160
> > [<ffffffff811388d3>] generic_smp_call_function_single_interrupt+0x13/0x60
> > [<ffffffff810530a7>] smp_call_function_single_interrupt+0x27/0x40
> > [<ffffffff817e604c>] call_function_single_interrupt+0x8c/0xa0
> > <EOI>  [<ffffffff810653a6>] ? native_safe_halt+0x6/0x10
> > [<ffffffff81026703>] default_idle+0x23/0x150
> > [<ffffffff8102702f>] arch_cpu_idle+0xf/0x20
> > [<ffffffff810efd8a>] default_idle_call+0x2a/0x40
> > [<ffffffff810f0184>] cpu_startup_entry+0x384/0x3f0
> > [<ffffffff817d6aaa>] rest_init+0x13a/0x140
> > [<ffffffff81d6b01f>] start_kernel+0x49e/0x4bf
> > [<ffffffff81d6a120>] ? early_idt_handler_array+0x120/0x120
> > [<ffffffff81d6a339>] x86_64_start_reservations+0x2a/0x2c
> > [<ffffffff81d6a480>] x86_64_start_kernel+0x145/0x168
> > 
> > >  fs/block_dev.c              |   2 +
> > >  fs/fs-writeback.c           | 148 +++++++++++++++++++++++++++++++++++---------
> > >  fs/inode.c                  |   1 +
> > >  include/linux/backing-dev.h |   5 ++
> > >  include/linux/fs.h          |   1 +
> > >  mm/backing-dev.c            |   1 +
> > >  mm/page-writeback.c         |  14 +++++
> > >  7 files changed, 144 insertions(+), 28 deletions(-)
> > > 
> > > diff --git a/fs/block_dev.c b/fs/block_dev.c
> > > index f2a89be..2726ff1 100644
> > > --- a/fs/block_dev.c
> > > +++ b/fs/block_dev.c
> > > @@ -64,6 +64,8 @@ void kill_bdev(struct block_device *bdev)
> > >  {
> > >  	struct address_space *mapping = bdev->bd_inode->i_mapping;
> > >  
> > > +	bdi_clear_inode_writeback(inode_to_bdi(bdev->bd_inode),
> > > +				  bdev->bd_inode);
> > >  	if (mapping->nrpages == 0 && mapping->nrshadows == 0)
> > >  		return;
> > >  
> > > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> > > index aa72536..5c005ad 100644
> > > --- a/fs/fs-writeback.c
> > > +++ b/fs/fs-writeback.c
> > > @@ -210,6 +210,34 @@ void inode_wb_list_del(struct inode *inode)
> > >  }
> > >  
> > >  /*
> > > + * mark an inode as under writeback on the given bdi
> > > + */
> > > +void bdi_mark_inode_writeback(struct backing_dev_info *bdi, struct inode *inode)
> > > +{
> > > +	WARN_ON_ONCE(bdi != inode_to_bdi(inode));
> > > +	if (list_empty(&inode->i_wb_list)) {
> > > +		spin_lock(&bdi->wb.list_lock);
> > > +		if (list_empty(&inode->i_wb_list))
> > > +			list_add_tail(&inode->i_wb_list, &bdi->wb.b_writeback);
> > > +		spin_unlock(&bdi->wb.list_lock);
> > > +	}
> > > +}
> > > +
> > > +/*
> > > + * clear an inode as under writeback on the given bdi
> > > + */
> > > +void bdi_clear_inode_writeback(struct backing_dev_info *bdi,
> > > +			       struct inode *inode)
> > > +{
> > > +	WARN_ON_ONCE(bdi != inode_to_bdi(inode));
> > > +	if (!list_empty(&inode->i_wb_list)) {
> > > +		spin_lock(&bdi->wb.list_lock);
> > > +		list_del_init(&inode->i_wb_list);
> > > +		spin_unlock(&bdi->wb.list_lock);
> > > +	}
> > > +}
> > > +
> > > +/*
> > >   * Redirty an inode: set its when-it-was dirtied timestamp and move it to the
> > >   * furthest end of its superblock's dirty-inode list.
> > >   *
> > > @@ -383,13 +411,28 @@ static void __inode_wait_for_writeback(struct inode *inode)
> > >  }
> > >  
> > >  /*
> > > - * Wait for writeback on an inode to complete. Caller must have inode pinned.
> > > + * Wait for writeback on an inode to complete during eviction. Caller must have
> > > + * inode pinned.
> > >   */
> > >  void inode_wait_for_writeback(struct inode *inode)
> > >  {
> > > +	BUG_ON(!(inode->i_state & I_FREEING));
> > > +
> > >  	spin_lock(&inode->i_lock);
> > >  	__inode_wait_for_writeback(inode);
> > >  	spin_unlock(&inode->i_lock);
> > > +
> > > +	/*
> > > +	 * For a bd_inode when we do inode_to_bdi we'll want to get the bdev for
> > > +	 * the inode and then deref bdev->bd_disk, which at this point has been
> > > +	 * set to NULL, so we would panic.  At the point we are dropping our
> > > +	 * bd_inode we won't have any pages under writeback on the device so
> > > +	 * this is safe.  But just in case we'll assert to make sure we don't
> > > +	 * screw this up.
> > > +	 */
> > > +	if (!sb_is_blkdev_sb(inode->i_sb))
> > > +		bdi_clear_inode_writeback(inode_to_bdi(inode), inode);
> > > +	BUG_ON(!list_empty(&inode->i_wb_list));
> > >  }
> > >  
> > >  /*
> > > @@ -1372,7 +1415,9 @@ EXPORT_SYMBOL(__mark_inode_dirty);
> > >   */
> > >  static void wait_sb_inodes(struct super_block *sb)
> > >  {
> > > -	struct inode *inode, *old_inode = NULL;
> > > +	struct backing_dev_info *bdi = sb->s_bdi;
> > > +	LIST_HEAD(sync_list);
> > > +	struct inode *iput_inode = NULL;
> > >  
> > >  	/*
> > >  	 * We need to be protected against the filesystem going from
> > > @@ -1380,48 +1425,95 @@ static void wait_sb_inodes(struct super_block *sb)
> > >  	 */
> > >  	WARN_ON(!rwsem_is_locked(&sb->s_umount));
> > >  
> > > -	mutex_lock(&sb->s_sync_lock);
> > > -	spin_lock(&sb->s_inode_list_lock);
> > > -
> > >  	/*
> > > -	 * Data integrity sync. Must wait for all pages under writeback,
> > > -	 * because there may have been pages dirtied before our sync
> > > -	 * call, but which had writeout started before we write it out.
> > > -	 * In which case, the inode may not be on the dirty list, but
> > > -	 * we still have to wait for that writeout.
> > > +	 * Data integrity sync. Must wait for all pages under writeback, because
> > > +	 * there may have been pages dirtied before our sync call, but which had
> > > +	 * writeout started before we write it out.  In which case, the inode
> > > +	 * may not be on the dirty list, but we still have to wait for that
> > > +	 * writeout.
> > > +	 *
> > > +	 * To avoid syncing inodes put under IO after we have started here,
> > > +	 * splice the io list to a temporary list head and walk that. Newly
> > > +	 * dirtied inodes will go onto the primary list so we won't wait for
> > > +	 * them. This is safe to do as we are serialised by the s_sync_lock,
> > > +	 * so we'll complete processing the complete list before the next
> > > +	 * sync operation repeats the splice-and-walk process.
> > > +	 *
> > > +	 * Inodes that have pages under writeback after we've finished the wait
> > > +	 * may or may not be on the primary list. They had pages under put IO
> > > +	 * after we started our wait, so we need to make sure the next sync or
> > > +	 * IO completion treats them correctly, Move them back to the primary
> > > +	 * list and restart the walk.
> > > +	 *
> > > +	 * Inodes that are clean after we have waited for them don't belong on
> > > +	 * any list, so simply remove them from the sync list and move onwards.
> > >  	 */
> > > -	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
> > > +	mutex_lock(&sb->s_sync_lock);
> > > +	spin_lock(&bdi->wb.list_lock);
> > > +	list_splice_init(&bdi->wb.b_writeback, &sync_list);
> > > +
> > > +	while (!list_empty(&sync_list)) {
> > > +		struct inode *inode = list_first_entry(&sync_list, struct inode,
> > > +						       i_wb_list);
> > >  		struct address_space *mapping = inode->i_mapping;
> > >  
> > > +		/*
> > > +		 * We are lazy on IO completion and don't remove inodes from the
> > > +		 * list when they are clean. Detect that immediately and skip
> > > +		 * inodes we don't ahve to wait on.
> > > +		 */
> > > +		if (!mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK)) {
> > > +			list_del_init(&inode->i_wb_list);
> > > +			cond_resched_lock(&bdi->wb.list_lock);
> > > +			continue;
> > > +		}
> > > +
> > >  		spin_lock(&inode->i_lock);
> > > -		if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
> > > -		    (mapping->nrpages == 0)) {
> > > +		if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) {
> > > +			list_move(&inode->i_wb_list, &bdi->wb.b_writeback);
> > >  			spin_unlock(&inode->i_lock);
> > > +			cond_resched_lock(&bdi->wb.list_lock);
> > >  			continue;
> > >  		}
> > >  		__iget(inode);
> > >  		spin_unlock(&inode->i_lock);
> > > -		spin_unlock(&sb->s_inode_list_lock);
> > > +		spin_unlock(&bdi->wb.list_lock);
> > >  
> > > -		/*
> > > -		 * We hold a reference to 'inode' so it couldn't have been
> > > -		 * removed from s_inodes list while we dropped the
> > > -		 * s_inode_list_lock.  We cannot iput the inode now as we can
> > > -		 * be holding the last reference and we cannot iput it under
> > > -		 * s_inode_list_lock. So we keep the reference and iput it
> > > -		 * later.
> > > -		 */
> > > -		iput(old_inode);
> > > -		old_inode = inode;
> > > +		if (iput_inode)
> > > +			iput(iput_inode);
> > >  
> > >  		filemap_fdatawait(mapping);
> > > -
> > >  		cond_resched();
> > >  
> > > -		spin_lock(&sb->s_inode_list_lock);
> > > +		/*
> > > +		 * the inode has been written back now, so check whether we
> > > +		 * still have pages under IO and move it back to the primary
> > > +		 * list if necessary.  We really need the mapping->tree_lock
> > > +		 * here because bdi_mark_inode_writeback may have not done
> > > +		 * anything because we were on the spliced list and we need to
> > > +		 * check TAG_WRITEBACK.
> > > +		 */
> > > +		spin_lock_irq(&mapping->tree_lock);
> > > +		spin_lock(&bdi->wb.list_lock);
> > > +		if (mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK)) {
> > > +			WARN_ON(list_empty(&inode->i_wb_list));
> > > +			list_move(&inode->i_wb_list, &bdi->wb.b_writeback);
> > > +                } else
> > > +			list_del_init(&inode->i_wb_list);
> > > +		spin_unlock_irq(&mapping->tree_lock);
> > > +
> > > +		/*
> > > +		 * can't iput inode while holding the wb.list_lock. Save it for
> > > +		 * the next time through the loop when we drop all our spin
> > > +		 * locks.
> > > +		 */
> > > +		iput_inode = inode;
> > >  	}
> > > -	spin_unlock(&sb->s_inode_list_lock);
> > > -	iput(old_inode);
> > > +	spin_unlock(&bdi->wb.list_lock);
> > > +
> > > +	if (iput_inode)
> > > +		iput(iput_inode);
> > > +
> > >  	mutex_unlock(&sb->s_sync_lock);
> > >  }
> > >  
> > > diff --git a/fs/inode.c b/fs/inode.c
> > > index 770d684..8f00557 100644
> > > --- a/fs/inode.c
> > > +++ b/fs/inode.c
> > > @@ -357,6 +357,7 @@ void inode_init_once(struct inode *inode)
> > >  	INIT_HLIST_NODE(&inode->i_hash);
> > >  	INIT_LIST_HEAD(&inode->i_devices);
> > >  	INIT_LIST_HEAD(&inode->i_io_list);
> > > +	INIT_LIST_HEAD(&inode->i_wb_list);
> > >  	INIT_LIST_HEAD(&inode->i_lru);
> > >  	address_space_init_once(&inode->i_data);
> > >  	i_size_ordered_init(inode);
> > > diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
> > > index d87d8ec..0d5bca2 100644
> > > --- a/include/linux/backing-dev.h
> > > +++ b/include/linux/backing-dev.h
> > > @@ -56,6 +56,7 @@ struct bdi_writeback {
> > >  	struct list_head b_io;		/* parked for writeback */
> > >  	struct list_head b_more_io;	/* parked for more writeback */
> > >  	struct list_head b_dirty_time;	/* time stamps are dirty */
> > > +	struct list_head b_writeback;	/* inodes under writeback */
> > >  	spinlock_t list_lock;		/* protects the b_* lists */
> > >  };
> > >  
> > > @@ -123,6 +124,10 @@ void bdi_start_background_writeback(struct backing_dev_info *bdi);
> > >  void bdi_writeback_workfn(struct work_struct *work);
> > >  int bdi_has_dirty_io(struct backing_dev_info *bdi);
> > >  void bdi_wakeup_thread_delayed(struct backing_dev_info *bdi);
> > > +void bdi_mark_inode_writeback(struct backing_dev_info *bdi,
> > > +			      struct inode *inode);
> > > +void bdi_clear_inode_writeback(struct backing_dev_info *bdi,
> > > +			       struct inode *inode);
> > >  
> > >  extern spinlock_t bdi_lock;
> > >  extern struct list_head bdi_list;
> > > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > > index 6dd2ab2..d3baa08 100644
> > > --- a/include/linux/fs.h
> > > +++ b/include/linux/fs.h
> > > @@ -636,6 +636,7 @@ struct inode {
> > >  	struct list_head	i_io_list;	/* backing dev IO list */
> > >  	struct list_head	i_lru;		/* inode LRU list */
> > >  	struct list_head	i_sb_list;
> > > +	struct list_head	i_wb_list;	/* backing dev writeback list */
> > >  	union {
> > >  		struct hlist_head	i_dentry;
> > >  		struct rcu_head		i_rcu;
> > > diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> > > index e9ed047..ea39301 100644
> > > --- a/mm/backing-dev.c
> > > +++ b/mm/backing-dev.c
> > > @@ -369,6 +369,7 @@ static void bdi_wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi)
> > >  	INIT_LIST_HEAD(&wb->b_io);
> > >  	INIT_LIST_HEAD(&wb->b_more_io);
> > >  	INIT_LIST_HEAD(&wb->b_dirty_time);
> > > +	INIT_LIST_HEAD(&wb->b_writeback);
> > >  	spin_lock_init(&wb->list_lock);
> > >  	INIT_DELAYED_WORK(&wb->dwork, bdi_writeback_workfn);
> > >  }
> > > diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> > > index eb59f7e..f3751d1 100644
> > > --- a/mm/page-writeback.c
> > > +++ b/mm/page-writeback.c
> > > @@ -2382,11 +2382,25 @@ int __test_set_page_writeback(struct page *page, bool keep_write)
> > >  		spin_lock_irqsave(&mapping->tree_lock, flags);
> > >  		ret = TestSetPageWriteback(page);
> > >  		if (!ret) {
> > > +			bool on_wblist;
> > > +
> > > +			on_wblist = mapping_tagged(mapping,
> > > +						   PAGECACHE_TAG_WRITEBACK);
> > > +
> > >  			radix_tree_tag_set(&mapping->page_tree,
> > >  						page_index(page),
> > >  						PAGECACHE_TAG_WRITEBACK);
> > >  			if (bdi_cap_account_writeback(bdi))
> > >  				__inc_bdi_stat(bdi, BDI_WRITEBACK);
> > > +
> > > +			/*
> > > +			 * we can come through here when swapping anonymous
> > > +			 * pages, so we don't necessarily have an inode to
> > > +			 * track for sync. Inodes are remove lazily, so there is
> > > +			 * no equivalent in test_clear_page_writeback().
> > > +			 */
> > > +			if (!on_wblist && mapping->host)
> > > +				bdi_mark_inode_writeback(bdi, mapping->host);
> > >  		}
> > >  		if (!PageDirty(page))
> > >  			radix_tree_tag_clear(&mapping->page_tree,
> > > -- 
> > > 2.1.0
> > > 
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2015-12-11 14:37 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-03-20 17:14 [PATCH 0/8] Sync and VFS scalability improvements V2 Josef Bacik
2015-03-20 17:14 ` [PATCH 1/8] writeback: plug writeback at a high level Josef Bacik
2015-03-20 17:14 ` [PATCH 2/8] inode: add hlist_fake to avoid the inode hash lock in evict Josef Bacik
2015-03-20 17:14 ` [PATCH 3/8] inode: convert inode_sb_list_lock to per-sb Josef Bacik
2015-03-20 17:14 ` [PATCH 4/8] sync: serialise per-superblock sync operations Josef Bacik
2015-03-20 17:14 ` [PATCH 5/8] inode: rename i_wb_list to i_io_list Josef Bacik
2015-03-20 17:14 ` [PATCH 6/8] bdi: add a new writeback list for sync Josef Bacik
2015-03-20 17:14 ` [PATCH 7/8] writeback: periodically trim the writeback list Josef Bacik
2015-03-20 18:49   ` [PATCH 7/8] bdi: add a new writeback list for sync V3 Josef Bacik
2015-04-01  8:44     ` Jan Kara
2015-04-01  8:46     ` Jan Kara
2015-03-20 17:14 ` [PATCH 8/8] inode: don't softlockup when evicting inodes Josef Bacik
2015-04-01  8:05   ` Jan Kara
2015-04-07 15:03     ` Josef Bacik
  -- strict thread matches above, loose matches on Subject: below --
2015-06-11 19:41 [PATCH 0/7] super block scalabilit patches V3 Josef Bacik
2015-06-11 19:41 ` [PATCH 6/8] bdi: add a new writeback list for sync Josef Bacik
2015-06-15 14:12   ` Jan Kara
2015-06-16 15:42     ` Josef Bacik
2015-06-17 10:34       ` Jan Kara
2015-06-17 17:55         ` Josef Bacik
2015-06-18  9:28           ` Jan Kara
2015-06-24  3:23 [PATCH 0/8] super block scalability patches V4 Josef Bacik
2015-06-24  3:24 ` [PATCH 6/8] bdi: add a new writeback list for sync Josef Bacik
2015-12-09 18:40   ` Brian Foster
2015-12-10 10:08     ` Jan Kara
2015-12-11 14:37       ` Brian Foster

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).