- * [PATCH 01/45] writeback: add struct dirty_context
  2015-01-06 21:25 [PATCHSET RFC block/for-next] writeback: cgroup writeback support Tejun Heo
@ 2015-01-06 21:25 ` Tejun Heo
  2015-01-06 21:25 ` [PATCH 02/45] writeback: add {CONFIG|BDI_CAP|FS}_CGROUP_WRITEBACK Tejun Heo
                   ` (44 subsequent siblings)
  45 siblings, 0 replies; 54+ messages in thread
From: Tejun Heo @ 2015-01-06 21:25 UTC (permalink / raw)
  To: axboe
  Cc: linux-kernel, jack, hch, hannes, linux-fsdevel, vgoyal, lizefan,
	cgroups, linux-mm, mhocko, clm, fengguang.wu, david, Tejun Heo
Add struct dirty_context and make page and inode dirty paths use it as
the parameter carrier.  dirty_context currently hosts ->page,
->mapping and ->inode and is initialized by init_dirty_inode_context()
or init_dirty_page_context() for non-data inode and data page dirtying
respectively.
For non-data dirtying, mark_inode_dirty_dctx() is added and
__mark_inode_dirty() is made a simple wrapper on top of it as
__mark_inode_dirty() has quite a few users.  For page dirtying,
account_page_dirtied() is updated to take dirty_context so that both
the inode and page dirtying can use the same dirty_context.
This currently doesn't make any functional difference but cgroup
writeback support will add more fields to the struct and use them to
share context between page and inode dirtying.
Include of backing-dev-defs.h is added to fs.h and mm.h for
dirty_context and the now unnecessary explicit declaration of
backing_def_info is removed from fs.h.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
---
 fs/buffer.c                      |  9 ++++---
 fs/fs-writeback.c                | 56 +++++++++++++++++++++++++++++++++++++---
 fs/xfs/xfs_aops.c                |  7 +++--
 include/linux/backing-dev-defs.h | 10 +++++++
 include/linux/backing-dev.h      |  4 +++
 include/linux/fs.h               |  3 ++-
 include/linux/mm.h               |  3 ++-
 mm/page-writeback.c              | 14 ++++++----
 8 files changed, 91 insertions(+), 15 deletions(-)
diff --git a/fs/buffer.c b/fs/buffer.c
index 20805db..2dab7dd 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -26,6 +26,7 @@
 #include <linux/slab.h>
 #include <linux/capability.h>
 #include <linux/blkdev.h>
+#include <linux/backing-dev.h>
 #include <linux/file.h>
 #include <linux/quotaops.h>
 #include <linux/highmem.h>
@@ -627,17 +628,19 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode);
 static void __set_page_dirty(struct page *page,
 		struct address_space *mapping, int warn)
 {
+	struct dirty_context dctx;
 	unsigned long flags;
 
 	spin_lock_irqsave(&mapping->tree_lock, flags);
-	if (page->mapping) {	/* Race with truncate? */
+	init_dirty_page_context(&dctx, page, mapping);
+	if (dctx.mapping) {	/* Race with truncate? */
 		WARN_ON_ONCE(warn && !PageUptodate(page));
-		account_page_dirtied(page, mapping);
+		account_page_dirtied(&dctx);
 		radix_tree_tag_set(&mapping->page_tree,
 				page_index(page), PAGECACHE_TAG_DIRTY);
 	}
 	spin_unlock_irqrestore(&mapping->tree_lock, flags);
-	__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
+	mark_inode_dirty_dctx(&dctx, I_DIRTY_PAGES);
 }
 
 /*
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 5130895..97c92b3 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -106,6 +106,46 @@ out_unlock:
 	spin_unlock_bh(&wb->work_lock);
 }
 
+/**
+ * init_dirty_page_context - init dirty_context for page dirtying
+ * @dctx: dirty_context to initialize
+ * @page: page to be dirtied
+ *
+ * @page is about to be dirtied, prepare @dctx accordingly.  Must be called
+ * with @mapping->tree_lock held.  The inode dirtying due to @page dirtying
+ * should use the same @dctx.
+ *
+ * @mapping may have been obtained before the lock was acquired and
+ * @dctx->mapping can be set to NULL even if @mapping isn't if truncate
+ * took place in-between.  @dctx->inode is always set to @mapping->inode.
+ */
+void init_dirty_page_context(struct dirty_context *dctx, struct page *page,
+			     struct address_space *mapping)
+{
+	lockdep_assert_held(&mapping->tree_lock);
+
+	dctx->page = page;
+	dctx->inode = mapping->host;
+	dctx->mapping = page_mapping(page);
+
+	BUG_ON(dctx->mapping != mapping);
+}
+EXPORT_SYMBOL_GPL(init_dirty_page_context);
+
+/**
+ * init_dirty_inode_context - init dirty_context for inode dirtying
+ * @dctx: dirty_context to initialize
+ * @inode: inode to be dirtied
+ *
+ * @inode is about to be dirtied w/o a page belonging to it being dirtied,
+ * prepare @dctx accordingly.
+ */
+void init_dirty_inode_context(struct dirty_context *dctx, struct inode *inode)
+{
+	memset(dctx, 0, sizeof(*dctx));
+	dctx->inode = inode;
+}
+
 static void __wb_start_writeback(struct bdi_writeback *wb, long nr_pages,
 				 bool range_cyclic, enum wb_reason reason)
 {
@@ -1107,8 +1147,8 @@ static noinline void block_dump___mark_inode_dirty(struct inode *inode)
 }
 
 /**
- *	__mark_inode_dirty -	internal function
- *	@inode: inode to mark
+ *	mark_inode_dirty_dctx -	internal function
+ *	@dctx: dirty_context containing the target inode
  *	@flags: what kind of dirty (i.e. I_DIRTY_SYNC)
  *	Mark an inode as dirty. Callers should use mark_inode_dirty or
  *  	mark_inode_dirty_sync.
@@ -1130,8 +1170,9 @@ static noinline void block_dump___mark_inode_dirty(struct inode *inode)
  * page->mapping->host, so the page-dirtying time is recorded in the internal
  * blockdev inode.
  */
-void __mark_inode_dirty(struct inode *inode, int flags)
+void mark_inode_dirty_dctx(struct dirty_context *dctx, int flags)
 {
+	struct inode *inode = dctx->inode;
 	struct super_block *sb = inode->i_sb;
 	struct backing_dev_info *bdi = NULL;
 
@@ -1222,6 +1263,15 @@ out_unlock_inode:
 	spin_unlock(&inode->i_lock);
 
 }
+EXPORT_SYMBOL(mark_inode_dirty_dctx);
+
+void __mark_inode_dirty(struct inode *inode, int flags)
+{
+	struct dirty_context dctx;
+
+	init_dirty_inode_context(&dctx, inode);
+	mark_inode_dirty_dctx(&dctx, flags);
+}
 EXPORT_SYMBOL(__mark_inode_dirty);
 
 static void wait_sb_inodes(struct super_block *sb)
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 18e2f3b..fb94975 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -36,6 +36,7 @@
 #include <linux/mpage.h>
 #include <linux/pagevec.h>
 #include <linux/writeback.h>
+#include <linux/backing-dev.h>
 
 void
 xfs_count_page_state(
@@ -1814,17 +1815,19 @@ xfs_vm_set_page_dirty(
 
 	if (newly_dirty) {
 		/* sigh - __set_page_dirty() is static, so copy it here, too */
+		struct dirty_context dctx;
 		unsigned long flags;
 
 		spin_lock_irqsave(&mapping->tree_lock, flags);
+		init_dirty_page_context(&dctx, page, mapping);
 		if (page->mapping) {	/* Race with truncate? */
 			WARN_ON_ONCE(!PageUptodate(page));
-			account_page_dirtied(page, mapping);
+			account_page_dirtied(&dctx);
 			radix_tree_tag_set(&mapping->page_tree,
 					page_index(page), PAGECACHE_TAG_DIRTY);
 		}
 		spin_unlock_irqrestore(&mapping->tree_lock, flags);
-		__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
+		mark_inode_dirty_dctx(&dctx, I_DIRTY_PAGES);
 	}
 	return newly_dirty;
 }
diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
index 2874d83..bf20ef1 100644
--- a/include/linux/backing-dev-defs.h
+++ b/include/linux/backing-dev-defs.h
@@ -94,6 +94,16 @@ struct backing_dev_info {
 #endif
 };
 
+/*
+ * The following structure carries context used during page and inode
+ * dirtying.  Should be initialized with init_dirty_{inode|page}_context().
+ */
+struct dirty_context {
+	struct page		*page;
+	struct inode		*inode;
+	struct address_space	*mapping;
+};
+
 enum {
 	BLK_RW_ASYNC	= 0,
 	BLK_RW_SYNC	= 1,
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 3c6fd34..34fe620 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -263,4 +263,8 @@ static inline struct backing_dev_info *inode_to_bdi(struct inode *inode)
 	return sb->s_bdi;
 }
 
+void init_dirty_page_context(struct dirty_context *dctx, struct page *page,
+			     struct address_space *mapping);
+void init_dirty_inode_context(struct dirty_context *dctx, struct inode *inode);
+
 #endif		/* _LINUX_BACKING_DEV_H */
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 8639770..9b63758 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -30,6 +30,7 @@
 #include <linux/lockdep.h>
 #include <linux/percpu-rwsem.h>
 #include <linux/blk_types.h>
+#include <linux/backing-dev-defs.h>
 
 #include <asm/byteorder.h>
 #include <uapi/linux/fs.h>
@@ -394,7 +395,6 @@ int pagecache_write_end(struct file *, struct address_space *mapping,
 				loff_t pos, unsigned len, unsigned copied,
 				struct page *page, void *fsdata);
 
-struct backing_dev_info;
 struct address_space {
 	struct inode		*host;		/* owner: inode, block_device */
 	struct radix_tree_root	page_tree;	/* radix tree of all pages */
@@ -1749,6 +1749,7 @@ struct super_operations {
 
 #define I_DIRTY (I_DIRTY_SYNC | I_DIRTY_DATASYNC | I_DIRTY_PAGES)
 
+extern void mark_inode_dirty_dctx(struct dirty_context *dctx, int flags);
 extern void __mark_inode_dirty(struct inode *, int);
 static inline void mark_inode_dirty(struct inode *inode)
 {
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 0c15841..825acb8 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -20,6 +20,7 @@
 #include <linux/shrinker.h>
 #include <linux/resource.h>
 #include <linux/page_ext.h>
+#include <linux/backing-dev-defs.h>
 
 struct mempolicy;
 struct anon_vma;
@@ -1250,7 +1251,7 @@ int __set_page_dirty_nobuffers(struct page *page);
 int __set_page_dirty_no_writeback(struct page *page);
 int redirty_page_for_writepage(struct writeback_control *wbc,
 				struct page *page);
-void account_page_dirtied(struct page *page, struct address_space *mapping);
+void account_page_dirtied(struct dirty_context *dctx);
 int set_page_dirty(struct page *page);
 int set_page_dirty_lock(struct page *page);
 int clear_page_dirty_for_io(struct page *page);
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 0632a43..0e35ff4 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2090,8 +2090,11 @@ int __set_page_dirty_no_writeback(struct page *page)
  * Helper function for set_page_dirty family.
  * NOTE: This relies on being atomic wrt interrupts.
  */
-void account_page_dirtied(struct page *page, struct address_space *mapping)
+void account_page_dirtied(struct dirty_context *dctx)
 {
+	struct page *page = dctx->page;
+	struct address_space *mapping = dctx->mapping;
+
 	trace_writeback_dirty_page(page, mapping);
 
 	if (!mapping_cap_account_dirty(mapping))
@@ -2123,21 +2126,22 @@ int __set_page_dirty_nobuffers(struct page *page)
 {
 	if (!TestSetPageDirty(page)) {
 		struct address_space *mapping = page_mapping(page);
+		struct dirty_context dctx;
 		unsigned long flags;
 
 		if (!mapping)
 			return 1;
 
 		spin_lock_irqsave(&mapping->tree_lock, flags);
-		BUG_ON(page_mapping(page) != mapping);
+		init_dirty_page_context(&dctx, page, mapping);
 		WARN_ON_ONCE(!PagePrivate(page) && !PageUptodate(page));
-		account_page_dirtied(page, mapping);
+		account_page_dirtied(&dctx);
 		radix_tree_tag_set(&mapping->page_tree, page_index(page),
 				   PAGECACHE_TAG_DIRTY);
 		spin_unlock_irqrestore(&mapping->tree_lock, flags);
-		if (mapping->host) {
+		if (dctx.inode) {
 			/* !PageAnon && !swapper_space */
-			__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
+			mark_inode_dirty_dctx(&dctx, I_DIRTY_PAGES);
 		}
 		return 1;
 	}
-- 
2.1.0
^ permalink raw reply related	[flat|nested] 54+ messages in thread
- * [PATCH 02/45] writeback: add {CONFIG|BDI_CAP|FS}_CGROUP_WRITEBACK
  2015-01-06 21:25 [PATCHSET RFC block/for-next] writeback: cgroup writeback support Tejun Heo
  2015-01-06 21:25 ` [PATCH 01/45] writeback: add struct dirty_context Tejun Heo
@ 2015-01-06 21:25 ` Tejun Heo
  2015-01-06 21:25 ` [PATCH 03/45] memcg: encode page_cgflags in the lower bits of page->mem_cgroup Tejun Heo
                   ` (43 subsequent siblings)
  45 siblings, 0 replies; 54+ messages in thread
From: Tejun Heo @ 2015-01-06 21:25 UTC (permalink / raw)
  To: axboe
  Cc: linux-kernel, jack, hch, hannes, linux-fsdevel, vgoyal, lizefan,
	cgroups, linux-mm, mhocko, clm, fengguang.wu, david, Tejun Heo
cgroup writeback requires support from both bdi and filesystem sides.
Add BDI_CAP_CGROUP_WRITEBACK and FS_CGROUP_WRITEBACK to indicate
support and enable BDI_CAP_CGROUP_WRITEBACK on block based bdi's by
default.  Also, define CONFIG_CGROUP_WRITEBACK which is enabled if
both MEMCG and BLK_CGROUP are enabled.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
---
 block/blk-core.c            |  3 ++-
 include/linux/backing-dev.h | 31 +++++++++++++++++++++++++++++++
 include/linux/fs.h          |  1 +
 init/Kconfig                |  5 +++++
 4 files changed, 39 insertions(+), 1 deletion(-)
diff --git a/block/blk-core.c b/block/blk-core.c
index 723e4a3..ff4d2f8 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -606,7 +606,8 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id)
 
 	q->backing_dev_info.ra_pages =
 			(VM_MAX_READAHEAD * 1024) / PAGE_CACHE_SIZE;
-	q->backing_dev_info.capabilities = BDI_CAP_MAP_COPY;
+	q->backing_dev_info.capabilities = BDI_CAP_MAP_COPY |
+					   BDI_CAP_CGROUP_WRITEBACK;
 	q->backing_dev_info.name = "block";
 	q->node = node_id;
 
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 34fe620..68c2fd7 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -146,6 +146,8 @@ int bdi_set_max_ratio(struct backing_dev_info *bdi, unsigned int max_ratio);
  * BDI_CAP_SWAP_BACKED:    Count shmem/tmpfs objects as swap-backed.
  *
  * BDI_CAP_STRICTLIMIT:    Keep number of dirty pages below bdi threshold.
+ *
+ * BDI_CAP_CGROUP_WRITEBACK: Supports cgroup-aware writeback.
  */
 #define BDI_CAP_NO_ACCT_DIRTY	0x00000001
 #define BDI_CAP_NO_WRITEBACK	0x00000002
@@ -158,6 +160,7 @@ int bdi_set_max_ratio(struct backing_dev_info *bdi, unsigned int max_ratio);
 #define BDI_CAP_SWAP_BACKED	0x00000100
 #define BDI_CAP_STABLE_WRITES	0x00000200
 #define BDI_CAP_STRICTLIMIT	0x00000400
+#define BDI_CAP_CGROUP_WRITEBACK 0x00000800
 
 #define BDI_CAP_VMFLAGS \
 	(BDI_CAP_READ_MAP | BDI_CAP_WRITE_MAP | BDI_CAP_EXEC_MAP)
@@ -267,4 +270,32 @@ void init_dirty_page_context(struct dirty_context *dctx, struct page *page,
 			     struct address_space *mapping);
 void init_dirty_inode_context(struct dirty_context *dctx, struct inode *inode);
 
+#ifdef CONFIG_CGROUP_WRITEBACK
+
+/**
+ * mapping_cgwb_enabled - test whether cgroup writeback is enabled on a mapping
+ * @mapping: address_space of interest
+ *
+ * cgroup writeback requires support from both the bdi and filesystem.
+ * Test whether @mapping has both.
+ */
+static inline bool mapping_cgwb_enabled(struct address_space *mapping)
+{
+	struct backing_dev_info *bdi = mapping->backing_dev_info;
+	struct inode *inode = mapping->host;
+
+	return bdi_cap_account_dirty(bdi) &&
+		(bdi->capabilities & BDI_CAP_CGROUP_WRITEBACK) &&
+		inode && (inode->i_sb->s_type->fs_flags & FS_CGROUP_WRITEBACK);
+}
+
+#else	/* CONFIG_CGROUP_WRITEBACK */
+
+static inline bool mapping_cgwb_enabled(struct address_space *mapping)
+{
+	return false;
+}
+
+#endif	/* CONFIG_CGROUP_WRITEBACK */
+
 #endif		/* _LINUX_BACKING_DEV_H */
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 9b63758..2f3df6a 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1818,6 +1818,7 @@ struct file_system_type {
 #define FS_HAS_SUBTYPE		4
 #define FS_USERNS_MOUNT		8	/* Can be mounted by userns root */
 #define FS_USERNS_DEV_MOUNT	16 /* A userns mount does not imply MNT_NODEV */
+#define FS_CGROUP_WRITEBACK	32	/* Supports cgroup-aware writeback */
 #define FS_RENAME_DOES_D_MOVE	32768	/* FS will handle d_move() during rename() internally. */
 	struct dentry *(*mount) (struct file_system_type *, int,
 		       const char *, void *);
diff --git a/init/Kconfig b/init/Kconfig
index 005d239..3fb9a53 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1122,6 +1122,11 @@ config DEBUG_BLK_CGROUP
 	Enable some debugging help. Currently it exports additional stat
 	files in a cgroup which can be useful for debugging.
 
+config CGROUP_WRITEBACK
+	bool
+	depends on MEMCG && BLK_CGROUP
+	default y
+
 endif # CGROUPS
 
 config CHECKPOINT_RESTORE
-- 
2.1.0
^ permalink raw reply related	[flat|nested] 54+ messages in thread
- * [PATCH 03/45] memcg: encode page_cgflags in the lower bits of page->mem_cgroup
  2015-01-06 21:25 [PATCHSET RFC block/for-next] writeback: cgroup writeback support Tejun Heo
  2015-01-06 21:25 ` [PATCH 01/45] writeback: add struct dirty_context Tejun Heo
  2015-01-06 21:25 ` [PATCH 02/45] writeback: add {CONFIG|BDI_CAP|FS}_CGROUP_WRITEBACK Tejun Heo
@ 2015-01-06 21:25 ` Tejun Heo
  2015-01-06 21:25 ` [PATCH 04/45] memcg, writeback: implement memcg_blkcg_ptr Tejun Heo
                   ` (42 subsequent siblings)
  45 siblings, 0 replies; 54+ messages in thread
From: Tejun Heo @ 2015-01-06 21:25 UTC (permalink / raw)
  To: axboe
  Cc: linux-kernel, jack, hch, hannes, linux-fsdevel, vgoyal, lizefan,
	cgroups, linux-mm, mhocko, clm, fengguang.wu, david, Tejun Heo
cgroup writeback support will require several bits per page.  They can
easily be encoded in the lower bits of page->mem_cgroup which only
increases the alignement of struct mem_cgroup.  This patch
* Converts page->mem_cgroup to unsigned long so that nobody
  dereferences it directly and bit operations are easier.
* Introduces enum page_cgflags which will list the flags.  It
  currently only defines PCG_FLAGS_BITS and PCG_FLAGS_MASK.  The
  former is used to align struct mem_cgroup accordingly.
* Adds and applies two accessors - page_memcg_is_set() and
  page_memcg().
With PCG_FLAGS_BITS at zero, this patch shouldn't introduce any
noticeable functional changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
---
 include/linux/mm_types.h |  3 +-
 mm/debug.c               |  2 +-
 mm/memcontrol.c          | 84 ++++++++++++++++++++++++++++++++----------------
 3 files changed, 59 insertions(+), 30 deletions(-)
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 97f2bb3..7f6c5ef 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -21,7 +21,6 @@
 #define AT_VECTOR_SIZE (2*(AT_VECTOR_SIZE_ARCH + AT_VECTOR_SIZE_BASE + 1))
 
 struct address_space;
-struct mem_cgroup;
 
 #define USE_SPLIT_PTE_PTLOCKS	(NR_CPUS >= CONFIG_SPLIT_PTLOCK_CPUS)
 #define USE_SPLIT_PMD_PTLOCKS	(USE_SPLIT_PTE_PTLOCKS && \
@@ -176,7 +175,7 @@ struct page {
 	};
 
 #ifdef CONFIG_MEMCG
-	struct mem_cgroup *mem_cgroup;
+	unsigned long mem_cgroup;
 #endif
 
 	/*
diff --git a/mm/debug.c b/mm/debug.c
index 0e58f32..94d91f9 100644
--- a/mm/debug.c
+++ b/mm/debug.c
@@ -97,7 +97,7 @@ void dump_page_badflags(struct page *page, const char *reason,
 	}
 #ifdef CONFIG_MEMCG
 	if (page->mem_cgroup)
-		pr_alert("page->mem_cgroup:%p\n", page->mem_cgroup);
+		pr_alert("page->mem_cgroup:%p\n", (void *)page->mem_cgroup);
 #endif
 }
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 202e386..3ab3f04 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -87,6 +87,35 @@ static int really_do_swap_account __initdata;
 #define do_swap_account		0
 #endif
 
+/*
+ * Lower bits of page->mem_cgroup encodes the following flags.  Use
+ * page_memcg*() and page_cgflags*() to access the pointer and flags
+ * respectively.  The flags can be used only while the pointer is set and
+ * are cleared together with it.
+ */
+enum page_cgflags {
+	PCG_FLAGS_BITS		= 0,
+	PCG_FLAGS_MASK		= ((1UL << PCG_FLAGS_BITS) - 1),
+};
+
+/* struct mem_cgroup should be accordingly aligned */
+#define MEM_CGROUP_ALIGN						\
+	((1UL << PCG_FLAGS_BITS) >= __alignof__(unsigned long long) ?	\
+	 (1UL << PCG_FLAGS_BITS) : __alignof__(unsigned long long))
+
+static bool page_memcg_is_set(struct page *page)
+{
+	if (page->mem_cgroup) {
+		WARN_ON_ONCE(!(page->mem_cgroup & ~PCG_FLAGS_MASK));
+		return true;
+	}
+	return false;
+}
+
+static struct mem_cgroup *page_memcg(struct page *page)
+{
+	return (void *)(page->mem_cgroup & ~PCG_FLAGS_MASK);
+}
 
 static const char * const mem_cgroup_stat_names[] = {
 	"cache",
@@ -362,7 +391,7 @@ struct mem_cgroup {
 
 	struct mem_cgroup_per_node *nodeinfo[0];
 	/* WARNING: nodeinfo must be the last member here */
-};
+} __aligned(MEM_CGROUP_ALIGN);
 
 #ifdef CONFIG_MEMCG_KMEM
 static bool memcg_kmem_is_active(struct mem_cgroup *memcg)
@@ -1268,7 +1297,7 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct zone *zone)
 		goto out;
 	}
 
-	memcg = page->mem_cgroup;
+	memcg = page_memcg(page);
 	/*
 	 * Swapcache readahead pages are added to the LRU - and
 	 * possibly migrated - before they are charged.
@@ -2011,7 +2040,7 @@ struct mem_cgroup *mem_cgroup_begin_page_stat(struct page *page)
 	if (mem_cgroup_disabled())
 		return NULL;
 again:
-	memcg = page->mem_cgroup;
+	memcg = page_memcg(page);
 	if (unlikely(!memcg))
 		return NULL;
 
@@ -2019,7 +2048,7 @@ again:
 		return memcg;
 
 	spin_lock_irqsave(&memcg->move_lock, flags);
-	if (memcg != page->mem_cgroup) {
+	if (memcg != page_memcg(page)) {
 		spin_unlock_irqrestore(&memcg->move_lock, flags);
 		goto again;
 	}
@@ -2401,7 +2430,7 @@ struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
 
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
 
-	memcg = page->mem_cgroup;
+	memcg = page_memcg(page);
 	if (memcg) {
 		if (!css_tryget_online(&memcg->css))
 			memcg = NULL;
@@ -2453,7 +2482,7 @@ static void commit_charge(struct page *page, struct mem_cgroup *memcg,
 {
 	int isolated;
 
-	VM_BUG_ON_PAGE(page->mem_cgroup, page);
+	VM_BUG_ON_PAGE(page_memcg_is_set(page), page);
 
 	/*
 	 * In some cases, SwapCache and FUSE(splice_buf->radixtree), the page
@@ -2476,7 +2505,7 @@ static void commit_charge(struct page *page, struct mem_cgroup *memcg,
 	 * - a page cache insertion, a swapin fault, or a migration
 	 *   have the page locked
 	 */
-	page->mem_cgroup = memcg;
+	page->mem_cgroup = (unsigned long)memcg;
 
 	if (lrucare)
 		unlock_page_lru(page, isolated);
@@ -2751,12 +2780,12 @@ void __memcg_kmem_commit_charge(struct page *page, struct mem_cgroup *memcg,
 		memcg_uncharge_kmem(memcg, 1 << order);
 		return;
 	}
-	page->mem_cgroup = memcg;
+	page->mem_cgroup = (unsigned long)memcg;
 }
 
 void __memcg_kmem_uncharge_pages(struct page *page, int order)
 {
-	struct mem_cgroup *memcg = page->mem_cgroup;
+	struct mem_cgroup *memcg = page_memcg(page);
 
 	if (!memcg)
 		return;
@@ -2764,7 +2793,7 @@ void __memcg_kmem_uncharge_pages(struct page *page, int order)
 	VM_BUG_ON_PAGE(mem_cgroup_is_root(memcg), page);
 
 	memcg_uncharge_kmem(memcg, 1 << order);
-	page->mem_cgroup = NULL;
+	page->mem_cgroup = 0;
 }
 #endif /* CONFIG_MEMCG_KMEM */
 
@@ -2778,15 +2807,16 @@ void __memcg_kmem_uncharge_pages(struct page *page, int order)
  */
 void mem_cgroup_split_huge_fixup(struct page *head)
 {
+	struct mem_cgroup *memcg = page_memcg(head);
 	int i;
 
 	if (mem_cgroup_disabled())
 		return;
 
 	for (i = 1; i < HPAGE_PMD_NR; i++)
-		head[i].mem_cgroup = head->mem_cgroup;
+		head[i].mem_cgroup = (unsigned long)memcg;
 
-	__this_cpu_sub(head->mem_cgroup->stat->count[MEM_CGROUP_STAT_RSS_HUGE],
+	__this_cpu_sub(memcg->stat->count[MEM_CGROUP_STAT_RSS_HUGE],
 		       HPAGE_PMD_NR);
 }
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
@@ -2834,7 +2864,7 @@ static int mem_cgroup_move_account(struct page *page,
 		goto out;
 
 	ret = -EINVAL;
-	if (page->mem_cgroup != from)
+	if (page_memcg(page) != from)
 		goto out_unlock;
 
 	spin_lock_irqsave(&from->move_lock, flags);
@@ -2860,7 +2890,7 @@ static int mem_cgroup_move_account(struct page *page,
 	 */
 
 	/* caller should have done css_get */
-	page->mem_cgroup = to;
+	page->mem_cgroup = (unsigned long)to;
 	spin_unlock_irqrestore(&from->move_lock, flags);
 
 	ret = 0;
@@ -4838,7 +4868,7 @@ static enum mc_target_type get_mctgt_type(struct vm_area_struct *vma,
 		 * mem_cgroup_move_account() checks the page is valid or
 		 * not under LRU exclusion.
 		 */
-		if (page->mem_cgroup == mc.from) {
+		if (page_memcg(page) == mc.from) {
 			ret = MC_TARGET_PAGE;
 			if (target)
 				target->page = page;
@@ -4872,7 +4902,7 @@ static enum mc_target_type get_mctgt_type_thp(struct vm_area_struct *vma,
 	VM_BUG_ON_PAGE(!page || !PageHead(page), page);
 	if (!move_anon())
 		return ret;
-	if (page->mem_cgroup == mc.from) {
+	if (page_memcg(page) == mc.from) {
 		ret = MC_TARGET_PAGE;
 		if (target) {
 			get_page(page);
@@ -5316,7 +5346,7 @@ void mem_cgroup_swapout(struct page *page, swp_entry_t entry)
 	if (!do_swap_account)
 		return;
 
-	memcg = page->mem_cgroup;
+	memcg = page_memcg(page);
 
 	/* Readahead page, never charged */
 	if (!memcg)
@@ -5326,7 +5356,7 @@ void mem_cgroup_swapout(struct page *page, swp_entry_t entry)
 	VM_BUG_ON_PAGE(oldid, page);
 	mem_cgroup_swap_statistics(memcg, true);
 
-	page->mem_cgroup = NULL;
+	page->mem_cgroup = 0;
 
 	if (!mem_cgroup_is_root(memcg))
 		page_counter_uncharge(&memcg->memory, 1);
@@ -5400,7 +5430,7 @@ int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
 		 * the page lock, which serializes swap cache removal, which
 		 * in turn serializes uncharging.
 		 */
-		if (page->mem_cgroup)
+		if (page_memcg_is_set(page))
 			goto out;
 	}
 
@@ -5560,7 +5590,7 @@ static void uncharge_list(struct list_head *page_list)
 		VM_BUG_ON_PAGE(PageLRU(page), page);
 		VM_BUG_ON_PAGE(page_count(page), page);
 
-		if (!page->mem_cgroup)
+		if (!page_memcg_is_set(page))
 			continue;
 
 		/*
@@ -5569,13 +5599,13 @@ static void uncharge_list(struct list_head *page_list)
 		 * exclusive access to the page.
 		 */
 
-		if (memcg != page->mem_cgroup) {
+		if (memcg != page_memcg(page)) {
 			if (memcg) {
 				uncharge_batch(memcg, pgpgout, nr_anon, nr_file,
 					       nr_huge, page);
 				pgpgout = nr_anon = nr_file = nr_huge = 0;
 			}
-			memcg = page->mem_cgroup;
+			memcg = page_memcg(page);
 		}
 
 		if (PageTransHuge(page)) {
@@ -5589,7 +5619,7 @@ static void uncharge_list(struct list_head *page_list)
 		else
 			nr_file += nr_pages;
 
-		page->mem_cgroup = NULL;
+		page->mem_cgroup = 0;
 
 		pgpgout++;
 	} while (next != page_list);
@@ -5612,7 +5642,7 @@ void mem_cgroup_uncharge(struct page *page)
 		return;
 
 	/* Don't touch page->lru of any random page, pre-check: */
-	if (!page->mem_cgroup)
+	if (!page_memcg_is_set(page))
 		return;
 
 	INIT_LIST_HEAD(&page->lru);
@@ -5663,7 +5693,7 @@ void mem_cgroup_migrate(struct page *oldpage, struct page *newpage,
 		return;
 
 	/* Page cache replacement: new page already charged? */
-	if (newpage->mem_cgroup)
+	if (page_memcg_is_set(newpage))
 		return;
 
 	/*
@@ -5672,14 +5702,14 @@ void mem_cgroup_migrate(struct page *oldpage, struct page *newpage,
 	 * uncharged page when the PFN walker finds a page that
 	 * reclaim just put back on the LRU but has not released yet.
 	 */
-	memcg = oldpage->mem_cgroup;
+	memcg = page_memcg(oldpage);
 	if (!memcg)
 		return;
 
 	if (lrucare)
 		lock_page_lru(oldpage, &isolated);
 
-	oldpage->mem_cgroup = NULL;
+	oldpage->mem_cgroup = 0;
 
 	if (lrucare)
 		unlock_page_lru(oldpage, isolated);
-- 
2.1.0
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related	[flat|nested] 54+ messages in thread
- * [PATCH 04/45] memcg, writeback: implement memcg_blkcg_ptr
  2015-01-06 21:25 [PATCHSET RFC block/for-next] writeback: cgroup writeback support Tejun Heo
                   ` (2 preceding siblings ...)
  2015-01-06 21:25 ` [PATCH 03/45] memcg: encode page_cgflags in the lower bits of page->mem_cgroup Tejun Heo
@ 2015-01-06 21:25 ` Tejun Heo
  2015-01-06 21:25 ` [PATCH 05/45] writeback: make backing_dev_info host cgroup-specific bdi_writebacks Tejun Heo
                   ` (41 subsequent siblings)
  45 siblings, 0 replies; 54+ messages in thread
From: Tejun Heo @ 2015-01-06 21:25 UTC (permalink / raw)
  To: axboe
  Cc: linux-kernel, jack, hch, hannes, linux-fsdevel, vgoyal, lizefan,
	cgroups, linux-mm, mhocko, clm, fengguang.wu, david, Tejun Heo
One of the challenges in implementing cgroup writeback support is
determining the blkcg to attribute each page to.  we can add a
per-page pointer pointing to the blkcg of the dirtier similar to
page->mem_cgroup; however, this is quite a bit of overhead for
information which is mostly duplicate to the existing
page->mem_cgroup.
When a page is charged to a memcg, the page is attributed to the
memcg.  Writeback is tied to memory pressure which is determined by
memcg membership, so it makes an inherent sense to attribute writeback
of a page to the blkcg which corresponds to the page's memcg.
If we assume that memcg and blkcg are always enabled and disable
together, this can be trivially implemented by adding a pointer to the
corresponding blkcg to each memcg and using that whenever issuing
writeback IO; however, on the unified hierarchy, the controllers can
be enabled and disabled separately meaning that the corresponding
blkcg for a given memcg may change dynamically.  To accomodate this,
two reference counted colored pointers are used.  The active pointer
is used whenever dirtying a new page.  When the association changes,
the other pointer is updated to point to the new blkcg and becomes
active while the currently active one becomes inactive and drained.
This way, each page can point to the blkcg it's associated with using,
theoretically, just single bit selecting between the two blkcg
pointers of its memcg while still allowing dynamic change of the
corresponding blkcg for the memcg.
In practice, we end up having to use four bits - two separate
associations each of which occupying two bits.  Each association takes
two bits because we must always be able to fall back to the root memcg
due to, for example, memory allocation failure, and we two two
separate associations for dirtied and writeback phases and to prevent
pages being repeatedly redirtied while being written from indefinitely
delaying inactive pointer draining.  Refer to comments in
mm/memcontrol.c for more details.
The page to blkcg association established by this patch will be used
as the basis of cgroup writeback support.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
---
 fs/fs-writeback.c           |  32 ++++
 include/linux/backing-dev.h |   1 +
 include/linux/memcontrol.h  |  56 ++++++
 mm/filemap.c                |   1 +
 mm/memcontrol.c             | 438 +++++++++++++++++++++++++++++++++++++++++++-
 mm/page-writeback.c         |   6 +-
 mm/truncate.c               |   1 +
 7 files changed, 532 insertions(+), 3 deletions(-)
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 97c92b3..138a5ea 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -106,6 +106,36 @@ out_unlock:
 	spin_unlock_bh(&wb->work_lock);
 }
 
+#ifdef CONFIG_CGROUP_WRITEBACK
+
+/**
+ * init_cgwb_dirty_page_context - init cgwb part of dirty_context
+ * @dctx: dirty_context being initialized
+ *
+ * @dctx is being initialized by init_dirty_page_context().  Initialize
+ * cgroup writeback part of it.
+ */
+static void init_cgwb_dirty_page_context(struct dirty_context *dctx)
+{
+	/* cgroup writeback requires support from both the bdi and filesystem */
+	if (!mapping_cgwb_enabled(dctx->mapping))
+		goto force_root;
+
+	page_blkcg_attach_dirty(dctx->page);
+	return;
+
+force_root:
+	page_blkcg_force_root_dirty(dctx->page);
+}
+
+#else	/* CONFIG_CGROUP_WRITEBACK */
+
+static void init_cgwb_dirty_page_context(struct dirty_context *dctx)
+{
+}
+
+#endif	/* CONFIG_CGROUP_WRITEBACK */
+
 /**
  * init_dirty_page_context - init dirty_context for page dirtying
  * @dctx: dirty_context to initialize
@@ -129,6 +159,8 @@ void init_dirty_page_context(struct dirty_context *dctx, struct page *page,
 	dctx->mapping = page_mapping(page);
 
 	BUG_ON(dctx->mapping != mapping);
+
+	init_cgwb_dirty_page_context(dctx);
 }
 EXPORT_SYMBOL_GPL(init_dirty_page_context);
 
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 68c2fd7..7a20cff 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -12,6 +12,7 @@
 #include <linux/fs.h>
 #include <linux/sched.h>
 #include <linux/writeback.h>
+#include <linux/memcontrol.h>
 
 #include <linux/backing-dev-defs.h>
 
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 04d3c20..27dad0b 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -23,6 +23,8 @@
 #include <linux/vm_event_item.h>
 #include <linux/hardirq.h>
 #include <linux/jump_label.h>
+#include <linux/percpu-refcount.h>
+#include <linux/blk-cgroup.h>
 
 struct mem_cgroup;
 struct page;
@@ -536,5 +538,59 @@ static inline void memcg_kmem_put_cache(struct kmem_cache *cachep)
 {
 }
 #endif /* CONFIG_MEMCG_KMEM */
+
+#ifdef CONFIG_CGROUP_WRITEBACK
+
+struct cgroup_subsys_state *page_blkcg_dirty(struct page *page);
+struct cgroup_subsys_state *page_blkcg_wb(struct page *page);
+struct cgroup_subsys_state *page_blkcg_attach_dirty(struct page *page);
+struct cgroup_subsys_state *page_blkcg_attach_wb(struct page *page);
+void page_blkcg_detach_dirty(struct page *page);
+void page_blkcg_detach_wb(struct page *page);
+void page_blkcg_force_root_dirty(struct page *page);
+void page_blkcg_force_root_wb(struct page *page);
+
+#else /* CONFIG_CGROUP_WRITEBACK */
+
+static inline struct cgroup_subsys_state *page_blkcg_dirty(struct page *page)
+{
+	return blkcg_root_css;
+}
+
+static inline struct cgroup_subsys_state *page_blkcg_wb(struct page *page)
+{
+	return blkcg_root_css;
+}
+
+static inline struct cgroup_subsys_state *
+page_blkcg_attach_dirty(struct page *page)
+{
+	return blkcg_root_css;
+}
+
+static inline struct cgroup_subsys_state *
+page_blkcg_attach_wb(struct page *page)
+{
+	return blkcg_root_css;
+}
+
+static inline void page_blkcg_detach_dirty(struct page *page)
+{
+}
+
+static inline void page_blkcg_detach_wb(struct page *page)
+{
+}
+
+static inline void page_blkcg_force_root_dirty(struct page *page)
+{
+}
+
+static inline void page_blkcg_force_root_wb(struct page *page)
+{
+}
+
+#endif /* CONFIG_CGROUP_WRITEBACK */
+
 #endif /* _LINUX_MEMCONTROL_H */
 
diff --git a/mm/filemap.c b/mm/filemap.c
index fdb4288..98a6675 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -212,6 +212,7 @@ void __delete_from_page_cache(struct page *page, void *shadow)
 	if (PageDirty(page) && mapping_cap_account_dirty(mapping)) {
 		dec_zone_page_state(page, NR_FILE_DIRTY);
 		dec_wb_stat(&mapping->backing_dev_info->wb, WB_RECLAIMABLE);
+		page_blkcg_detach_dirty(page);
 	}
 }
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 3ab3f04..aa0812b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -56,6 +56,7 @@
 #include <linux/oom.h>
 #include <linux/lockdep.h>
 #include <linux/file.h>
+#include <linux/blkdev.h>
 #include "internal.h"
 #include <net/sock.h>
 #include <net/ip.h>
@@ -94,7 +95,23 @@ static int really_do_swap_account __initdata;
  * are cleared together with it.
  */
 enum page_cgflags {
+#ifdef CONFIG_CGROUP_WRITEBACK
+	/*
+	 * Flags to associate a page with blkcgs.  There are two
+	 * associations - one for dirtying, the other for writeback.  If
+	 * VALID is clear, the root blkcg is used.  If not, the COLOR bit
+	 * indexes page_memcg(page)->blkcg_ptr[].  The COLOR bits must
+	 * immediately follow the corresponding VALID bits.  See
+	 * memcg_blkcg_ptr implementation for more info.
+	 */
+	PCG_BLKCG_DIRTY_VALID	= 1UL << 0,
+	PCG_BLKCG_DIRTY_COLOR	= 1UL << 1,
+	PCG_BLKCG_WB_VALID	= 1UL << 2,
+	PCG_BLKCG_WB_COLOR	= 1UL << 3,
+	PCG_FLAGS_BITS		= 4,
+#else
 	PCG_FLAGS_BITS		= 0,
+#endif
 	PCG_FLAGS_MASK		= ((1UL << PCG_FLAGS_BITS) - 1),
 };
 
@@ -294,6 +311,15 @@ struct mem_cgroup_event {
 static void mem_cgroup_threshold(struct mem_cgroup *memcg);
 static void mem_cgroup_oom_notify(struct mem_cgroup *memcg);
 
+#ifdef CONFIG_CGROUP_WRITEBACK
+struct memcg_blkcg_ptr {
+	struct cgroup_subsys_state	*css;
+	struct percpu_ref		ref;
+	struct mem_cgroup		*memcg;
+	struct work_struct		release_work;
+};
+#endif
+
 /*
  * The memory controller data structure. The memory controller controls both
  * page cache and RSS per cgroup. We would eventually like to provide
@@ -385,6 +411,11 @@ struct mem_cgroup {
 	atomic_t	numainfo_updating;
 #endif
 
+#ifdef CONFIG_CGROUP_WRITEBACK
+	int				blkcg_color;
+	struct memcg_blkcg_ptr		blkcg_ptr[2];
+#endif
+
 	/* List of events which userspace want to receive */
 	struct list_head event_list;
 	spinlock_t event_list_lock;
@@ -3418,6 +3449,397 @@ static int memcg_update_kmem_limit(struct mem_cgroup *memcg,
 }
 #endif /* CONFIG_MEMCG_KMEM */
 
+#ifdef CONFIG_CGROUP_WRITEBACK
+
+/*
+ * memcg_blkcg_ptr implementation.
+ *
+ * A charged page is associated with a memcg.  When the page gets dirtied
+ * and written back, we want to associate the dirtying and writeback to the
+ * matching blkcg so that the writeback IOs can be attributed to the
+ * originating cgroup and controlled accordingly.
+ *
+ * As adding another per-page pointer to track blkcg is undesirable and the
+ * matching effective blkcg for a given memcg is mostly static, it makes
+ * sense to track a page's associated blkcg through its associated memcg.
+ *
+ * If the relationship between memcg and blkcg were static, this would be
+ * trivial - a single pointer, e.g. page_memcg(page)->blkcg_css, would
+ * suffice; however, the corresponding blkcg may change as controllers are
+ * enabled and disabled.  To accommodate this, pointer coloring is used.
+ *
+ * There are two reference counted pointers and one of the two is the
+ * active one.  When a new page needs to be associated with its blkcg, the
+ * reference count for the active color pointer is incremented and the
+ * blkcg it points to is used.  The page only needs to record the one bit
+ * color to determine the associated blkcg.  If the corresponding blkcg
+ * changes, the other pointer is updated to point to the new blkcg and the
+ * active color is flipped.  The now inactive color pointer is maintained
+ * until all referencing pages are drained.
+ *
+ * Note that this two pointer scheme can only accommodate single on-going
+ * blkcg change.  If the matching effective blkcg changes again while the
+ * inactive pointer is still being drained from the previous round, the new
+ * update is delayed until the draining is complete.
+ *
+ * A page needs to be associated with its blkcg while it's dirty and being
+ * written back.  The page may be re-dirtied repeatedly while being written
+ * back, which can prevent blkcg pointer draining from making progress as
+ * the reference the page holds may never be put.  To break this possible
+ * live-lock, a page uses two separate pointer colors for dirtying and
+ * writeback where the latter inherits the former on writeback start.
+ * Splitting the writeback association from the dirty one ensures that the
+ * current color is guaranteed to drain after a single writeback cycle as
+ * new dirtying can always take on the current active color.
+ */
+
+static DEFINE_MUTEX(memcg_blkcg_ptr_mutex);
+static DECLARE_WAIT_QUEUE_HEAD(memcg_blkcg_ptr_waitq);
+
+/**
+ * memcg_blkcg_ptr_update_locked - update the memcg blkcg ptr
+ * @memcg: target mem_cgroup
+ *
+ * This function is called when the matching effective blkcg of @memcg may
+ * have changed.  If different from the currently active pointer and the
+ * inactive one is free, this function updates the inactive pointer to
+ * point to the corresponding blkcg, flips the active color and starts
+ * draining the previous one.
+ *
+ * This function should be called under memcg_blkcg_ptr_mutex.
+ */
+static void memcg_blkcg_ptr_update_locked(struct mem_cgroup *memcg)
+{
+	int cur_color = memcg->blkcg_color;
+	int next_color = !cur_color;
+	struct memcg_blkcg_ptr *cur_ptr = &memcg->blkcg_ptr[cur_color];
+	struct memcg_blkcg_ptr *next_ptr = &memcg->blkcg_ptr[next_color];
+	struct cgroup_subsys_state *blkcg_css;
+
+	lockdep_assert_held(&memcg_blkcg_ptr_mutex);
+
+	/*
+	 * Negative cur_color indicates that @memcg is defunct and no more
+	 * pointer update can happen till the previous one is complete.
+	 */
+	if (cur_color < 0 || next_ptr->css)
+		return;
+
+	/* acquire current matching blkcg and see whether update is needed */
+	blkcg_css = cgroup_get_e_css(memcg->css.cgroup, &blkio_cgrp_subsys);
+	if (blkcg_css == cur_ptr->css) {
+		css_put(blkcg_css);
+		return;
+	}
+
+	/* init the next ptr, flip the color and start draining the prev */
+	next_ptr->css = blkcg_css;
+	percpu_ref_reinit(&next_ptr->ref);
+	memcg->blkcg_color = next_color;
+
+	if (cur_ptr->css)
+		percpu_ref_kill(&cur_ptr->ref);
+}
+
+static void memcg_blkcg_ptr_update(struct mem_cgroup *memcg)
+{
+	mutex_lock(&memcg_blkcg_ptr_mutex);
+	memcg_blkcg_ptr_update_locked(memcg);
+	mutex_unlock(&memcg_blkcg_ptr_mutex);
+}
+
+static void memcg_blkcg_ref_release_workfn(struct work_struct *work)
+{
+	struct memcg_blkcg_ptr *ptr =
+		container_of(work, struct memcg_blkcg_ptr, release_work);
+	struct mem_cgroup *memcg = ptr->memcg;
+
+	/* @ptr just finished draining, put the blkcg it was pointing to */
+	css_put(ptr->css);
+
+	/*
+	 * Mark @ptr free and try updating as previous update attempts may
+	 * have been delayed because @ptr was occupied.
+	 */
+	mutex_lock(&memcg_blkcg_ptr_mutex);
+	ptr->css = NULL;
+	memcg_blkcg_ptr_update_locked(memcg);
+	mutex_unlock(&memcg_blkcg_ptr_mutex);
+
+	wake_up_all(&memcg_blkcg_ptr_waitq);
+}
+
+static void memcg_blkcg_ref_release(struct percpu_ref *ref)
+{
+	struct memcg_blkcg_ptr *ptr =
+		container_of(ref, struct memcg_blkcg_ptr, ref);
+
+	schedule_work(&ptr->release_work);
+}
+
+static int memcg_blkcg_ptr_init(struct mem_cgroup *memcg)
+{
+	int i, ret;
+
+	/*
+	 * The first ptr_update always flips the color.  Let's init w/ 1 so
+	 * that we start with 0 after the initial ptr_update.
+	 */
+	memcg->blkcg_color = 1;
+
+	for (i = 0; i < ARRAY_SIZE(memcg->blkcg_ptr); i++) {
+		struct memcg_blkcg_ptr *ptr = &memcg->blkcg_ptr[i];
+
+		/* start dead, ptr_update will reinit the next ptr */
+		ret = percpu_ref_init(&ptr->ref, memcg_blkcg_ref_release,
+				      PERCPU_REF_INIT_DEAD, GFP_KERNEL);
+		if (ret) {
+			while (--i >= 0)
+				percpu_ref_exit(&memcg->blkcg_ptr[i].ref);
+			return ret;
+		}
+
+		ptr->memcg = memcg;
+		INIT_WORK(&ptr->release_work, memcg_blkcg_ref_release_workfn);
+	}
+
+	return 0;
+}
+
+static void memcg_blkcg_ptr_exit(struct mem_cgroup *memcg)
+{
+	int i;
+
+	mutex_lock(&memcg_blkcg_ptr_mutex);
+
+	/* disable further ptr_update */
+	memcg->blkcg_color = -1;
+
+	for (i = 0; i < ARRAY_SIZE(memcg->blkcg_ptr); i++) {
+		struct memcg_blkcg_ptr *ptr = &memcg->blkcg_ptr[i];
+
+		/* force start draining and wait for its completion */
+		if (!percpu_ref_is_dying(&ptr->ref))
+			percpu_ref_kill(&ptr->ref);
+		if (ptr->css) {
+			mutex_unlock(&memcg_blkcg_ptr_mutex);
+			wait_event(memcg_blkcg_ptr_waitq, !ptr->css);
+			mutex_lock(&memcg_blkcg_ptr_mutex);
+		}
+		percpu_ref_exit(&ptr->ref);
+	}
+
+	mutex_unlock(&memcg_blkcg_ptr_mutex);
+}
+
+static __always_inline struct cgroup_subsys_state *
+page_blkcg(struct page *page, unsigned int valid_flag, unsigned int color_flag)
+{
+	struct mem_cgroup *memcg = page_memcg(page);
+	unsigned long memcg_v = page->mem_cgroup;
+	int color;
+
+	if (!(memcg_v & valid_flag))
+		return blkcg_root_css;
+
+	color = (bool)(memcg_v & color_flag);
+	return memcg->blkcg_ptr[color].css;
+}
+
+/**
+ * page_blkcg_dirty - the blkcg a page is associated with for dirtying
+ * @page: page in question
+ */
+struct cgroup_subsys_state *page_blkcg_dirty(struct page *page)
+{
+	return page_blkcg(page, PCG_BLKCG_DIRTY_VALID, PCG_BLKCG_DIRTY_COLOR);
+}
+
+/**
+ * page_blkcg_writeback - the blkcg a page is associated with for writeback
+ * @page: page in question
+ */
+struct cgroup_subsys_state *page_blkcg_wb(struct page *page)
+{
+	return page_blkcg(page, PCG_BLKCG_WB_VALID, PCG_BLKCG_WB_COLOR);
+}
+
+static __always_inline void page_cgflags_update(struct page *page,
+						unsigned long cgflags_mask,
+						unsigned long cgflags_target)
+{
+	unsigned long memcg_v = page->mem_cgroup;
+
+	WARN_ON_ONCE(!memcg_v);
+
+	/* dirty the cacheline only when necessary */
+	if ((memcg_v & cgflags_mask) != cgflags_target)
+		page->mem_cgroup = (memcg_v & ~cgflags_mask) | cgflags_target;
+}
+
+/**
+ * page_blkcg_attach_dirty - associate a page with its dirtying blkcg
+ * @page: target page
+ *
+ * This function is to be called when @page is being newly dirtied and
+ * makes the active corresponding blkcg of its memcg its dirty blkcg.  This
+ * blkcg can be retrieved using page_blkcg_dirty().
+ */
+struct cgroup_subsys_state *page_blkcg_attach_dirty(struct page *page)
+{
+	struct mem_cgroup *memcg = page_memcg(page);
+	struct memcg_blkcg_ptr *ptr;
+	int color;
+
+	while (true) {
+		color = memcg->blkcg_color;
+		ptr = &memcg->blkcg_ptr[color];
+		if (ptr->css == blkcg_root_css)
+			goto root_css;
+		if (likely(percpu_ref_tryget(&ptr->ref)))
+			break;
+		cpu_relax();
+	}
+
+	page_cgflags_update(page, PCG_BLKCG_DIRTY_VALID | PCG_BLKCG_DIRTY_COLOR,
+			    PCG_BLKCG_DIRTY_VALID |
+			    (color ? PCG_BLKCG_DIRTY_COLOR : 0));
+	return ptr->css;
+root_css:
+	page_cgflags_update(page, PCG_BLKCG_DIRTY_VALID, 0);
+	return blkcg_root_css;
+}
+
+/**
+ * page_blkcg_attach_wb - associate a page with its writeback blkcg
+ * @page: target page
+ *
+ * This function is to be called when @page is about to be written back and
+ * makes its dirty blkcg its writeback blkcg.  This blkcg can be retrieved
+ * using page_blkcg_wb().
+ */
+struct cgroup_subsys_state *page_blkcg_attach_wb(struct page *page)
+{
+	unsigned long memcg_v = page->mem_cgroup;
+	struct mem_cgroup *memcg = page_memcg(page);
+	struct memcg_blkcg_ptr *ptr;
+	int color;
+
+	if (!(memcg_v & PCG_BLKCG_DIRTY_VALID))
+		goto root_css;
+
+	/*
+	 * Inherit @page's dirty color.  @page's dirty color is already
+	 * detached at this point, and, if the associated memcg's active
+	 * color has flipped, the page may already have been redirtied with
+	 * a different color or the blkcg_ptr released.
+	 *
+	 * If @page has been redirtied to a different blkcg before
+	 * writeback starts on the previous dirty state, the writeback is
+	 * attributed to the new blkcg.  The race window isn't huge and
+	 * charging to the new blkcg isn't strictly wrong as @page got
+	 * redirtied to the new blkcg after all.
+	 *
+	 * If @page's dirty blkcg_ptr already got released, we fall back to
+	 * the root blkcg.  This can only happen if blkcg_ptr's reference
+	 * count reached zero since the put of @page's dirty reference
+	 * making it highly unlikely to happen to more than few pages.
+	 *
+	 * We may attach some pages to the wrong blkcg across memcg-blkcg
+	 * correspondence change but such changes are rare to begin with
+	 * and the number of pages we may misattribute is pretty limited.
+	 */
+	color = (bool)(memcg_v & PCG_BLKCG_DIRTY_COLOR);
+	ptr = &memcg->blkcg_ptr[color];
+	if (unlikely(!percpu_ref_tryget(&ptr->ref)))
+		goto root_css;
+
+	page_cgflags_update(page, PCG_BLKCG_WB_VALID | PCG_BLKCG_WB_COLOR,
+			    PCG_BLKCG_WB_VALID |
+			    (color ? PCG_BLKCG_WB_COLOR : 0));
+	return ptr->css;
+root_css:
+	page_cgflags_update(page, PCG_BLKCG_WB_VALID, 0);
+	return blkcg_root_css;
+}
+
+static __always_inline void
+page_blkcg_detach(struct page *page, unsigned valid_flag, unsigned color_flag)
+{
+	unsigned long memcg_v = page->mem_cgroup;
+	struct mem_cgroup *memcg = page_memcg(page);
+	int color;
+
+	if (!(memcg_v & valid_flag))
+		return;
+
+	color = (bool)(memcg_v & color_flag);
+	percpu_ref_put(&memcg->blkcg_ptr[color].ref);
+}
+
+/**
+ * page_blkcg_detach_dirty - disassociate a page from its dirty blkcg
+ * @page: target page
+ *
+ * Put @page's current dirty blkcg association.  This function must be
+ * called before @page's dirtiness is cleared.
+ */
+void page_blkcg_detach_dirty(struct page *page)
+{
+	page_blkcg_detach(page, PCG_BLKCG_DIRTY_VALID, PCG_BLKCG_DIRTY_COLOR);
+}
+
+/**
+ * page_blkcg_detach_wb - disassociate a page from its writeback blkcg
+ * @page: target page
+ *
+ * Put @page's current writeback blkcg association.  This funtion must be
+ * called before @page's writeback is complete.
+ */
+void page_blkcg_detach_wb(struct page *page)
+{
+	page_blkcg_detach(page, PCG_BLKCG_WB_VALID, PCG_BLKCG_WB_COLOR);
+}
+
+/**
+ * page_blkcg_force_root_dirty - force a page's dirty blkcg to be the root one
+ * @page: target page
+ *
+ * The caller must ensure that @page doesn't have dirty blkcg attached.
+ */
+void page_blkcg_force_root_dirty(struct page *page)
+{
+	page_cgflags_update(page, PCG_BLKCG_DIRTY_VALID, 0);
+}
+
+/**
+ * page_blkcg_force_root_wb - force a page's writeback blkcg to be the root one
+ * @page: target page
+ *
+ * The caller must ensure that @page doesn't have writeback blkcg attached.
+ */
+void page_blkcg_force_root_wb(struct page *page)
+{
+	page_cgflags_update(page, PCG_BLKCG_WB_VALID, 0);
+}
+
+#else	/* CONFIG_CGROUP_WRITEBACK */
+
+static void memcg_blkcg_ptr_update(struct mem_cgroup *memcg)
+{
+}
+
+static int memcg_blkcg_ptr_init(struct mem_cgroup *memcg)
+{
+	return 0;
+}
+
+static void memcg_blkcg_ptr_exit(struct mem_cgroup *memcg)
+{
+}
+
+#endif	/* CONFIG_CGROUP_WRITEBACK */
+
 /*
  * The user of this function is...
  * RES_LIMIT.
@@ -4560,6 +4982,10 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
 		if (alloc_mem_cgroup_per_zone_info(memcg, node))
 			goto free_out;
 
+	error = memcg_blkcg_ptr_init(memcg);
+	if (error)
+		goto free_out;
+
 	/* root ? */
 	if (parent_css == NULL) {
 		root_mem_cgroup = memcg;
@@ -4599,7 +5025,7 @@ mem_cgroup_css_online(struct cgroup_subsys_state *css)
 		return -ENOSPC;
 
 	if (!parent)
-		return 0;
+		goto done;
 
 	mutex_lock(&memcg_create_mutex);
 
@@ -4642,7 +5068,8 @@ mem_cgroup_css_online(struct cgroup_subsys_state *css)
 	 * reading the memcg members.
 	 */
 	smp_store_release(&memcg->initialized, 1);
-
+done:
+	memcg_blkcg_ptr_update(memcg);
 	return 0;
 }
 
@@ -4671,6 +5098,7 @@ static void mem_cgroup_css_free(struct cgroup_subsys_state *css)
 	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
 
 	memcg_destroy_kmem(memcg);
+	memcg_blkcg_ptr_exit(memcg);
 	__mem_cgroup_free(memcg);
 }
 
@@ -4697,6 +5125,11 @@ static void mem_cgroup_css_reset(struct cgroup_subsys_state *css)
 	memcg->soft_limit = PAGE_COUNTER_MAX;
 }
 
+static void mem_cgroup_css_e_css_changed(struct cgroup_subsys_state *css)
+{
+	memcg_blkcg_ptr_update(mem_cgroup_from_css(css));
+}
+
 #ifdef CONFIG_MMU
 /* Handlers for move charge at task migration. */
 static int mem_cgroup_do_precharge(unsigned long count)
@@ -5288,6 +5721,7 @@ struct cgroup_subsys memory_cgrp_subsys = {
 	.css_offline = mem_cgroup_css_offline,
 	.css_free = mem_cgroup_css_free,
 	.css_reset = mem_cgroup_css_reset,
+	.css_e_css_changed = mem_cgroup_css_e_css_changed,
 	.can_attach = mem_cgroup_can_attach,
 	.cancel_attach = mem_cgroup_cancel_attach,
 	.attach = mem_cgroup_move_task,
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 0e35ff4..72a0edf 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2303,6 +2303,7 @@ int clear_page_dirty_for_io(struct page *page)
 			dec_zone_page_state(page, NR_FILE_DIRTY);
 			dec_wb_stat(&mapping->backing_dev_info->wb,
 				    WB_RECLAIMABLE);
+			page_blkcg_detach_dirty(page);
 			return 1;
 		}
 		return 0;
@@ -2330,6 +2331,7 @@ int test_clear_page_writeback(struct page *page)
 						PAGECACHE_TAG_WRITEBACK);
 			if (bdi_cap_account_writeback(bdi)) {
 				__dec_wb_stat(&bdi->wb, WB_WRITEBACK);
+				page_blkcg_detach_wb(page);
 				__wb_writeout_inc(&bdi->wb);
 			}
 		}
@@ -2363,8 +2365,10 @@ int __test_set_page_writeback(struct page *page, bool keep_write)
 			radix_tree_tag_set(&mapping->page_tree,
 						page_index(page),
 						PAGECACHE_TAG_WRITEBACK);
-			if (bdi_cap_account_writeback(bdi))
+			if (bdi_cap_account_writeback(bdi)) {
 				__inc_wb_stat(&bdi->wb, WB_WRITEBACK);
+				page_blkcg_attach_wb(page);
+			}
 		}
 		if (!PageDirty(page))
 			radix_tree_tag_clear(&mapping->page_tree,
diff --git a/mm/truncate.c b/mm/truncate.c
index 3fcd662..caae624 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -116,6 +116,7 @@ void cancel_dirty_page(struct page *page, unsigned int account_size)
 				    WB_RECLAIMABLE);
 			if (account_size)
 				task_io_account_cancelled_write(account_size);
+			page_blkcg_detach_dirty(page);
 		}
 	}
 }
-- 
2.1.0
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related	[flat|nested] 54+ messages in thread
- * [PATCH 05/45] writeback: make backing_dev_info host cgroup-specific bdi_writebacks
  2015-01-06 21:25 [PATCHSET RFC block/for-next] writeback: cgroup writeback support Tejun Heo
                   ` (3 preceding siblings ...)
  2015-01-06 21:25 ` [PATCH 04/45] memcg, writeback: implement memcg_blkcg_ptr Tejun Heo
@ 2015-01-06 21:25 ` Tejun Heo
  2015-01-06 21:25 ` [PATCH 06/45] writeback, blkcg: associate each blkcg_gq with the corresponding bdi_writeback Tejun Heo
                   ` (40 subsequent siblings)
  45 siblings, 0 replies; 54+ messages in thread
From: Tejun Heo @ 2015-01-06 21:25 UTC (permalink / raw)
  To: axboe
  Cc: linux-kernel, jack, hch, hannes, linux-fsdevel, vgoyal, lizefan,
	cgroups, linux-mm, mhocko, clm, fengguang.wu, david, Tejun Heo
For the planned cgroup writeback support, on each bdi
(backing_dev_info), each cgroup will be served by a separate wb
(bdi_writeback).  This patch updates bdi so that a bdi can host
multiple wbs (bdi_writebacks).
bdi->wb remains unchanged and will keep serving the root cgroup.
cgwb's (cgroup wb's) for non-root cgroups are created on-demand or
looked up during init_cgwb_dirty_page_contex() according to the dirty
blkcg of the page being dirtied.  Each cgwb is indexed on
bdi->cgwb_tree by its blkcg id.
Once dirty_context is initialized for a page, the page's wb can be
looked up using page_cgwb_{dirty|wb}() while the page is dirty or
under writeback respectively.  Once created, a cgwb is destroyed iff
either its associated bdi or blkcg is destroyed, meaning that as long
as a page is dirty or under writeback, its associated cgwb is
accessible without further locking.
dirty_context grew a new field ->wb which caches the selected wb and
account_page_dirtied() is updated to use that instead of
unconditionally using bdi->wb.
Currently, none of the filesystems has FS_CGROUP_WRITEBACK and all
pages will keep being associated with bdi->wb.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
---
 block/blk-cgroup.c               |  11 ++-
 fs/fs-writeback.c                |  19 +++-
 include/linux/backing-dev-defs.h |  17 +++-
 include/linux/backing-dev.h      | 123 +++++++++++++++++++++++++
 include/linux/blk-cgroup.h       |   4 +
 mm/backing-dev.c                 | 189 +++++++++++++++++++++++++++++++++++++++
 mm/page-writeback.c              |   4 +-
 7 files changed, 361 insertions(+), 6 deletions(-)
diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 9e0fe38..8bebaa9 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -15,6 +15,7 @@
 #include <linux/module.h>
 #include <linux/err.h>
 #include <linux/blkdev.h>
+#include <linux/backing-dev.h>
 #include <linux/slab.h>
 #include <linux/genhd.h>
 #include <linux/delay.h>
@@ -813,6 +814,11 @@ static void blkcg_css_offline(struct cgroup_subsys_state *css)
 	spin_unlock_irq(&blkcg->lock);
 }
 
+static void blkcg_css_released(struct cgroup_subsys_state *css)
+{
+	cgwb_blkcg_released(css);
+}
+
 static void blkcg_css_free(struct cgroup_subsys_state *css)
 {
 	struct blkcg *blkcg = css_to_blkcg(css);
@@ -841,7 +847,9 @@ done:
 	spin_lock_init(&blkcg->lock);
 	INIT_RADIX_TREE(&blkcg->blkg_tree, GFP_ATOMIC);
 	INIT_HLIST_HEAD(&blkcg->blkg_list);
-
+#ifdef CONFIG_CGROUP_WRITEBACK
+	INIT_LIST_HEAD(&blkcg->cgwb_list);
+#endif
 	return &blkcg->css;
 }
 
@@ -926,6 +934,7 @@ static int blkcg_can_attach(struct cgroup_subsys_state *css,
 struct cgroup_subsys blkio_cgrp_subsys = {
 	.css_alloc = blkcg_css_alloc,
 	.css_offline = blkcg_css_offline,
+	.css_released = blkcg_css_released,
 	.css_free = blkcg_css_free,
 	.can_attach = blkcg_can_attach,
 	.legacy_cftypes = blkcg_files,
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 138a5ea..3b54835 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -117,21 +117,37 @@ out_unlock:
  */
 static void init_cgwb_dirty_page_context(struct dirty_context *dctx)
 {
+	struct backing_dev_info *bdi = dctx->mapping->backing_dev_info;
+	struct cgroup_subsys_state *blkcg_css;
+
 	/* cgroup writeback requires support from both the bdi and filesystem */
 	if (!mapping_cgwb_enabled(dctx->mapping))
 		goto force_root;
 
-	page_blkcg_attach_dirty(dctx->page);
+	/*
+	 * @dctx->page is a candidate for cgroup writeback and about to be
+	 * dirtied.  Attach the dirty blkcg to the page and pre-allocate
+	 * all resources necessary for cgroup writeback.  On failure, fall
+	 * back to the root blkcg.
+	 */
+	blkcg_css = page_blkcg_attach_dirty(dctx->page);
+	dctx->wb = cgwb_lookup_create(bdi, blkcg_css);
+	if (!dctx->wb) {
+		page_blkcg_detach_dirty(dctx->page);
+		goto force_root;
+	}
 	return;
 
 force_root:
 	page_blkcg_force_root_dirty(dctx->page);
+	dctx->wb = &bdi->wb;
 }
 
 #else	/* CONFIG_CGROUP_WRITEBACK */
 
 static void init_cgwb_dirty_page_context(struct dirty_context *dctx)
 {
+	dctx->wb = &dctx->mapping->backing_dev_info->wb;
 }
 
 #endif	/* CONFIG_CGROUP_WRITEBACK */
@@ -176,6 +192,7 @@ void init_dirty_inode_context(struct dirty_context *dctx, struct inode *inode)
 {
 	memset(dctx, 0, sizeof(*dctx));
 	dctx->inode = inode;
+	dctx->wb = &inode_to_bdi(inode)->wb;
 }
 
 static void __wb_start_writeback(struct bdi_writeback *wb, long nr_pages,
diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
index bf20ef1..511066f 100644
--- a/include/linux/backing-dev-defs.h
+++ b/include/linux/backing-dev-defs.h
@@ -2,6 +2,7 @@
 #define __LINUX_BACKING_DEV_DEFS_H
 
 #include <linux/list.h>
+#include <linux/radix-tree.h>
 #include <linux/spinlock.h>
 #include <linux/percpu_counter.h>
 #include <linux/flex_proportions.h>
@@ -68,6 +69,15 @@ struct bdi_writeback {
 	spinlock_t work_lock;		/* protects work_list & dwork scheduling */
 	struct list_head work_list;
 	struct delayed_work dwork;	/* work item used for writeback */
+
+#ifdef CONFIG_CGROUP_WRITEBACK
+	struct cgroup_subsys_state *blkcg_css; /* the blkcg we belong to */
+	struct list_head blkcg_node;	/* anchored at blkcg->wb_list */
+	union {
+		struct list_head shutdown_node;
+		struct rcu_head rcu;
+	};
+#endif
 };
 
 struct backing_dev_info {
@@ -82,8 +92,10 @@ struct backing_dev_info {
 	unsigned int min_ratio;
 	unsigned int max_ratio, max_prop_frac;
 
-	struct bdi_writeback wb;  /* default writeback info for this bdi */
-
+	struct bdi_writeback wb; /* the root writeback info for this bdi */
+#ifdef CONFIG_CGROUP_WRITEBACK
+	struct radix_tree_root cgwb_tree; /* radix tree of !root cgroup wbs */
+#endif
 	struct device *dev;
 
 	struct timer_list laptop_mode_wb_timer;
@@ -102,6 +114,7 @@ struct dirty_context {
 	struct page		*page;
 	struct inode		*inode;
 	struct address_space	*mapping;
+	struct bdi_writeback	*wb;
 };
 
 enum {
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 7a20cff..3722796 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -13,6 +13,7 @@
 #include <linux/sched.h>
 #include <linux/writeback.h>
 #include <linux/memcontrol.h>
+#include <linux/blk-cgroup.h>
 
 #include <linux/backing-dev-defs.h>
 
@@ -273,6 +274,10 @@ void init_dirty_inode_context(struct dirty_context *dctx, struct inode *inode);
 
 #ifdef CONFIG_CGROUP_WRITEBACK
 
+void cgwb_blkcg_released(struct cgroup_subsys_state *blkcg_css);
+int __cgwb_create(struct backing_dev_info *bdi,
+		  struct cgroup_subsys_state *blkcg_css);
+
 /**
  * mapping_cgwb_enabled - test whether cgroup writeback is enabled on a mapping
  * @mapping: address_space of interest
@@ -290,6 +295,97 @@ static inline bool mapping_cgwb_enabled(struct address_space *mapping)
 		inode && (inode->i_sb->s_type->fs_flags & FS_CGROUP_WRITEBACK);
 }
 
+/**
+ * cgwb_lookup - lookup cgwb for a given blkcg on a bdi
+ * @bdi: target bdi
+ * @blkcg_css: target blkcg
+ *
+ * Look up the cgwb (cgroup bdi_writeback) for @blkcg_css on @bdi.  The
+ * returned cgwb is accessible as long as @bdi and @blkcg_css stay alive.
+ *
+ * Returns the pointer to the found cgwb on success, NULL on failure.
+ */
+static inline struct bdi_writeback *
+cgwb_lookup(struct backing_dev_info *bdi, struct cgroup_subsys_state *blkcg_css)
+{
+	struct bdi_writeback *cgwb;
+
+	if (blkcg_css == blkcg_root_css)
+		return &bdi->wb;
+
+	/*
+	 * RCU locking protects the radix tree itself.  The looked up cgwb
+	 * is protected by the caller ensuring that @bdi and the blkcg w/
+	 * @blkcg_id are alive.
+	 */
+	rcu_read_lock();
+	cgwb = radix_tree_lookup(&bdi->cgwb_tree, blkcg_css->id);
+	rcu_read_unlock();
+	return cgwb;
+}
+
+/**
+ * cgwb_lookup_create - try to lookup cgwb and create one if not found
+ * @bdi: target bdi
+ * @blkcg_css: cgroup_subsys_state of the target blkcg
+ *
+ * Try to look up the cgwb (cgroup bdi_writeback) for the blkcg with
+ * @blkcg_css on @bdi.  If it doesn't exist, try to create one.  This
+ * function can be called under any context without locking as long as @bdi
+ * and @blkcg_css are kept alive.  See cgwb_lookup() for details.
+ *
+ * Returns the pointer to the found cgwb on success, NULL if such cgwb
+ * doesn't exist and creation failed due to memory pressure.
+ */
+static inline struct bdi_writeback *
+cgwb_lookup_create(struct backing_dev_info *bdi,
+		   struct cgroup_subsys_state *blkcg_css)
+{
+	struct bdi_writeback *wb;
+
+	do {
+		wb = cgwb_lookup(bdi, blkcg_css);
+		if (wb)
+			return wb;
+	} while (!__cgwb_create(bdi, blkcg_css));
+
+	return NULL;
+}
+
+/**
+ * page_cgwb_dirty - lookup the dirty cgwb of a page
+ * @page: target page
+ *
+ * Returns the dirty cgwb (cgroup bdi_writeback) of @page.  The returned
+ * wb is accessible as long as @page is dirty.
+ */
+static inline struct bdi_writeback *page_cgwb_dirty(struct page *page)
+{
+	struct backing_dev_info *bdi = page->mapping->backing_dev_info;
+	struct bdi_writeback *wb = cgwb_lookup(bdi, page_blkcg_dirty(page));
+
+	if (WARN_ON_ONCE(!wb))
+		return &bdi->wb;
+	return wb;
+}
+
+/**
+ * page_cgwb_wb - lookup the writeback cgwb of a page
+ * @page: target page
+ *
+ * Returns the writeback cgwb (cgroup bdi_writeback) of @page.  The
+ * returned wb is accessible as long as @page is under writeback.
+ */
+static inline struct bdi_writeback *page_cgwb_wb(struct page *page)
+{
+	struct backing_dev_info *bdi = page->mapping->backing_dev_info;
+	struct bdi_writeback *wb = cgwb_lookup(bdi, page_blkcg_wb(page));
+
+	if (WARN_ON_ONCE(!wb))
+		return &bdi->wb;
+	return wb;
+}
+
 #else	/* CONFIG_CGROUP_WRITEBACK */
 
 static inline bool mapping_cgwb_enabled(struct address_space *mapping)
@@ -297,6 +393,33 @@ static inline bool mapping_cgwb_enabled(struct address_space *mapping)
 	return false;
 }
 
+static inline void cgwb_blkcg_released(struct cgroup_subsys_state *blkcg_css)
+{
+}
+
+static inline struct bdi_writeback *
+cgwb_lookup(struct backing_dev_info *bdi, struct cgroup_subsys_state *blkcg_css)
+{
+	return &bdi->wb;
+}
+
+static inline struct bdi_writeback *
+cgwb_lookup_create(struct backing_dev_info *bdi,
+		   struct cgroup_subsys_state *blkcg_css)
+{
+	return &bdi->wb;
+}
+
+static inline struct bdi_writeback *page_cgwb_dirty(struct page *page)
+{
+	return &page->mapping->backing_dev_info->wb;
+}
+
+static inline struct bdi_writeback *page_cgwb_wb(struct page *page)
+{
+	return &page->mapping->backing_dev_info->wb;
+}
+
 #endif	/* CONFIG_CGROUP_WRITEBACK */
 
 #endif		/* _LINUX_BACKING_DEV_H */
diff --git a/include/linux/blk-cgroup.h b/include/linux/blk-cgroup.h
index 4dc643f..3033eb1 100644
--- a/include/linux/blk-cgroup.h
+++ b/include/linux/blk-cgroup.h
@@ -53,6 +53,10 @@ struct blkcg {
 	/* TODO: per-policy storage in blkcg */
 	unsigned int			cfq_weight;	/* belongs to cfq */
 	unsigned int			cfq_leaf_weight;
+
+#ifdef CONFIG_CGROUP_WRITEBACK
+	struct list_head		cgwb_list;
+#endif
 };
 
 struct blkg_stat {
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 1c9b70e..c6dda82 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -440,6 +440,192 @@ static void wb_exit(struct bdi_writeback *wb)
 	fprop_local_destroy_percpu(&wb->completions);
 }
 
+#ifdef CONFIG_CGROUP_WRITEBACK
+
+/*
+ * cgwb_lock protects bdi->cgwb_tree and blkcg->cgwb_list where the former
+ * is also RCU protected.  cgwb_shutdown_mutex synchronizes shutdown
+ * attempts from bdi and blkcg destructions.  For details, see
+ * cgwb_shutdown_prepare/commit().
+ */
+static DEFINE_SPINLOCK(cgwb_lock);
+static DEFINE_MUTEX(cgwb_shutdown_mutex);
+
+int __cgwb_create(struct backing_dev_info *bdi,
+		  struct cgroup_subsys_state *blkcg_css)
+{
+	struct blkcg *blkcg = css_to_blkcg(blkcg_css);
+	struct bdi_writeback *wb;
+	unsigned long flags;
+	int ret;
+
+	wb = kzalloc(sizeof(*wb), GFP_ATOMIC);
+	if (!wb)
+		return -ENOMEM;
+
+	ret = wb_init(wb, bdi, GFP_ATOMIC);
+	if (ret) {
+		kfree(wb);
+		return -ENOMEM;
+	}
+
+	wb->blkcg_css = blkcg_css;
+	set_bit(WB_registered, &wb->state); /* cgwbs are always registered */
+
+	ret = -ENODEV;
+	spin_lock_irqsave(&cgwb_lock, flags);
+	/* the root wb determines the registered state of the whole bdi */
+	if (test_bit(WB_registered, &bdi->wb.state)) {
+		/* we might have raced w/ another instance of this function */
+		ret = radix_tree_insert(&bdi->cgwb_tree, blkcg_css->id, wb);
+		if (!ret)
+			list_add_tail(&wb->blkcg_node, &blkcg->cgwb_list);
+	}
+	spin_unlock_irqrestore(&cgwb_lock, flags);
+	if (ret) {
+		wb_exit(wb);
+		if (ret != -EEXIST)
+			return ret;
+	}
+	return 0;
+}
+
+/**
+ * cgwb_shutdown_prepare - prepare to shutdown a cgwb
+ * @wb: cgwb to be shutdown
+ * @to_shutdown: list to queue @wb on
+ *
+ * This function is called to queue @wb for shutdown on @to_shutdown.  The
+ * bdi_writeback indexes use the cgwb_lock spinlock but wb_shutdown() needs
+ * process context, so this function can be called while holding cgwb_lock
+ * and cgwb_shutdown_mutex to queue cgwbs for shutdown.  Once all target
+ * cgwbs are queued, the caller should release cgwb_lock and invoke
+ * cgwb_shutdown_commit().
+ */
+static void cgwb_shutdown_prepare(struct bdi_writeback *wb,
+				  struct list_head *to_shutdown)
+{
+	lockdep_assert_held(&cgwb_lock);
+	lockdep_assert_held(&cgwb_shutdown_mutex);
+
+	WARN_ON(!test_bit(WB_registered, &wb->state));
+	clear_bit(WB_registered, &wb->state);
+	list_add_tail(&wb->shutdown_node, to_shutdown);
+}
+
+/**
+ * cgwb_shutdown_commit - commit cgwb shutdowns
+ * @to_shutdown: list of cgwbs to shutdown
+ *
+ * This function is called after @to_shutdown is built by calls to
+ * cgwb_shutdown_prepare() and cgwb_lock is released.  It invokes
+ * wb_shutdown() on all cgwbs on the list.  bdi and blkcg may try to
+ * shutdown the same cgwbs and should wait till completion if shutdown is
+ * initiated by the other.  This synchronization is achieved through
+ * cgwb_shutdown_mutex which should have been acquired before the
+ * cgwb_shutdown_prepare() invocations.
+ */
+static void cgwb_shutdown_commit(struct list_head *to_shutdown)
+{
+	struct bdi_writeback *wb;
+
+	lockdep_assert_held(&cgwb_shutdown_mutex);
+
+	list_for_each_entry(wb, to_shutdown, shutdown_node)
+		wb_shutdown(wb);
+}
+
+static void cgwb_exit(struct bdi_writeback *wb)
+{
+	WARN_ON(!radix_tree_delete(&wb->bdi->cgwb_tree, wb->blkcg_css->id));
+	list_del(&wb->blkcg_node);
+	wb_exit(wb);
+	kfree_rcu(wb, rcu);
+}
+
+static void cgwb_bdi_init(struct backing_dev_info *bdi)
+{
+	bdi->wb.blkcg_css = blkcg_root_css;
+	INIT_RADIX_TREE(&bdi->cgwb_tree, GFP_ATOMIC);
+}
+
+/**
+ * cgwb_bdi_shutdown - @bdi is being shut down, shut down all cgwbs
+ * @bdi: bdi being shut down
+ */
+static void cgwb_bdi_shutdown(struct backing_dev_info *bdi)
+{
+	LIST_HEAD(to_shutdown);
+	struct radix_tree_iter iter;
+	void **slot;
+
+	WARN_ON(test_bit(WB_registered, &bdi->wb.state));
+
+	mutex_lock(&cgwb_shutdown_mutex);
+	spin_lock_irq(&cgwb_lock);
+
+	radix_tree_for_each_slot(slot, &bdi->cgwb_tree, &iter, 0)
+		cgwb_shutdown_prepare(*slot, &to_shutdown);
+
+	spin_unlock_irq(&cgwb_lock);
+	cgwb_shutdown_commit(&to_shutdown);
+	mutex_unlock(&cgwb_shutdown_mutex);
+}
+
+/**
+ * cgwb_bdi_exit - @bdi is being exit, exit all its cgwbs
+ * @bdi: bdi being shut down
+ */
+static void cgwb_bdi_exit(struct backing_dev_info *bdi)
+{
+	LIST_HEAD(to_free);
+	struct radix_tree_iter iter;
+	void **slot;
+
+	spin_lock_irq(&cgwb_lock);
+	radix_tree_for_each_slot(slot, &bdi->cgwb_tree, &iter, 0) {
+		struct bdi_writeback *wb = *slot;
+
+		WARN_ON(test_bit(WB_registered, &wb->state));
+		cgwb_exit(wb);
+	}
+	spin_unlock_irq(&cgwb_lock);
+}
+
+/**
+ * cgwb_blkcg_released - a blkcg is being destroyed, release all matching cgwbs
+ * @blkcg_css: blkcg being destroyed
+ */
+void cgwb_blkcg_released(struct cgroup_subsys_state *blkcg_css)
+{
+	LIST_HEAD(to_shutdown);
+	struct blkcg *blkcg = css_to_blkcg(blkcg_css);
+	struct bdi_writeback *wb, *next;
+
+	mutex_lock(&cgwb_shutdown_mutex);
+	spin_lock_irq(&cgwb_lock);
+
+	list_for_each_entry_safe(wb, next, &blkcg->cgwb_list, blkcg_node)
+		cgwb_shutdown_prepare(wb, &to_shutdown);
+
+	spin_unlock_irq(&cgwb_lock);
+	cgwb_shutdown_commit(&to_shutdown);
+	mutex_unlock(&cgwb_shutdown_mutex);
+
+	spin_lock_irq(&cgwb_lock);
+	list_for_each_entry_safe(wb, next, &blkcg->cgwb_list, blkcg_node)
+		cgwb_exit(wb);
+	spin_unlock_irq(&cgwb_lock);
+}
+
+#else	/* CONFIG_CGROUP_WRITEBACK */
+
+static void cgwb_bdi_init(struct backing_dev_info *bdi) { }
+static void cgwb_bdi_shutdown(struct backing_dev_info *bdi) { }
+static void cgwb_bdi_exit(struct backing_dev_info *bdi) { }
+
+#endif	/* CONFIG_CGROUP_WRITEBACK */
+
 int bdi_init(struct backing_dev_info *bdi)
 {
 	int err;
@@ -455,6 +641,7 @@ int bdi_init(struct backing_dev_info *bdi)
 	if (err)
 		return err;
 
+	cgwb_bdi_init(bdi);
 	return 0;
 }
 EXPORT_SYMBOL(bdi_init);
@@ -532,6 +719,7 @@ void bdi_unregister(struct backing_dev_info *bdi)
 			/* make sure nobody finds us on the bdi_list anymore */
 			bdi_remove_from_list(bdi);
 			wb_shutdown(&bdi->wb);
+			cgwb_bdi_shutdown(bdi);
 		}
 
 		bdi_debug_unregister(bdi);
@@ -544,6 +732,7 @@ EXPORT_SYMBOL(bdi_unregister);
 void bdi_destroy(struct backing_dev_info *bdi)
 {
 	bdi_unregister(bdi);
+	cgwb_bdi_exit(bdi);
 	wb_exit(&bdi->wb);
 }
 EXPORT_SYMBOL(bdi_destroy);
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 72a0edf..6475504 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2102,8 +2102,8 @@ void account_page_dirtied(struct dirty_context *dctx)
 
 	__inc_zone_page_state(page, NR_FILE_DIRTY);
 	__inc_zone_page_state(page, NR_DIRTIED);
-	__inc_wb_stat(&mapping->backing_dev_info->wb, WB_RECLAIMABLE);
-	__inc_wb_stat(&mapping->backing_dev_info->wb, WB_DIRTIED);
+	__inc_wb_stat(dctx->wb, WB_RECLAIMABLE);
+	__inc_wb_stat(dctx->wb, WB_DIRTIED);
 	task_io_account_write(PAGE_CACHE_SIZE);
 	current->nr_dirtied++;
 	this_cpu_inc(bdp_ratelimits);
-- 
2.1.0
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related	[flat|nested] 54+ messages in thread
- * [PATCH 06/45] writeback, blkcg: associate each blkcg_gq with the corresponding bdi_writeback
  2015-01-06 21:25 [PATCHSET RFC block/for-next] writeback: cgroup writeback support Tejun Heo
                   ` (4 preceding siblings ...)
  2015-01-06 21:25 ` [PATCH 05/45] writeback: make backing_dev_info host cgroup-specific bdi_writebacks Tejun Heo
@ 2015-01-06 21:25 ` Tejun Heo
  2015-01-06 21:25 ` [PATCH 07/45] writeback: attribute stats to the matching per-cgroup bdi_writeback Tejun Heo
                   ` (39 subsequent siblings)
  45 siblings, 0 replies; 54+ messages in thread
From: Tejun Heo @ 2015-01-06 21:25 UTC (permalink / raw)
  To: axboe
  Cc: linux-kernel, jack, hch, hannes, linux-fsdevel, vgoyal, lizefan,
	cgroups, linux-mm, mhocko, clm, fengguang.wu, david, Tejun Heo
A blkg (blkcg_gq) can be congested and decongested independently from
other blkgs on the same request_queue.  Accordingly, for cgroup
writeback support, the congestion status at bdi (backing_dev_info)
should be split to per-cgroup wb's (bdi_writeback's) and updated
separately from matching blkg's.
This patch prepares by adding blkg->wb and associating a blkg with its
matching per-cgroup wb on creation.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
Cc: Vivek Goyal <vgoyal@redhat.com>
---
 block/blk-cgroup.c         | 15 +++++++++++++++
 include/linux/blk-cgroup.h |  6 ++++++
 2 files changed, 21 insertions(+)
diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 8bebaa9..6fe085c 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -182,6 +182,7 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg,
 				    struct blkcg_gq *new_blkg)
 {
 	struct blkcg_gq *blkg;
+	struct bdi_writeback *wb;
 	int i, ret;
 
 	WARN_ON_ONCE(!rcu_read_lock_held());
@@ -193,6 +194,19 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg,
 		goto err_free_blkg;
 	}
 
+	/*
+	 * Once created, @wb will stay alive longer than @blkg.  @wb is
+	 * destroyed iff either its bdi or @blkcg is destroyed.  The bdi is
+	 * part of the request_queue and will outlive @blkg, and, while
+	 * @blkcg is being brought down, @wb will be destroyed the last in
+	 * ->css_released().
+	 */
+	wb = cgwb_lookup_create(&q->backing_dev_info, &blkcg->css);
+	if (!wb) {
+		ret = -ENOMEM;
+		goto err_free_blkg;
+	}
+
 	/* allocate */
 	if (!new_blkg) {
 		new_blkg = blkg_alloc(blkcg, q, GFP_ATOMIC);
@@ -202,6 +216,7 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg,
 		}
 	}
 	blkg = new_blkg;
+	blkg->wb = wb;
 
 	/* link parent */
 	if (blkcg_parent(blkcg)) {
diff --git a/include/linux/blk-cgroup.h b/include/linux/blk-cgroup.h
index 3033eb1..97ceee3 100644
--- a/include/linux/blk-cgroup.h
+++ b/include/linux/blk-cgroup.h
@@ -99,6 +99,12 @@ struct blkcg_gq {
 	struct hlist_node		blkcg_node;
 	struct blkcg			*blkcg;
 
+	/*
+	 * Each blkg gets congested separately and the congestion state is
+	 * propagated to the matching cgroup wb.
+	 */
+	struct bdi_writeback		*wb;
+
 	/* all non-root blkcg_gq's are guaranteed to have access to parent */
 	struct blkcg_gq			*parent;
 
-- 
2.1.0
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related	[flat|nested] 54+ messages in thread
- * [PATCH 07/45] writeback: attribute stats to the matching per-cgroup bdi_writeback
  2015-01-06 21:25 [PATCHSET RFC block/for-next] writeback: cgroup writeback support Tejun Heo
                   ` (5 preceding siblings ...)
  2015-01-06 21:25 ` [PATCH 06/45] writeback, blkcg: associate each blkcg_gq with the corresponding bdi_writeback Tejun Heo
@ 2015-01-06 21:25 ` Tejun Heo
  2015-01-06 21:25 ` [PATCH 08/45] writeback: let balance_dirty_pages() work on the matching cgroup bdi_writeback Tejun Heo
                   ` (38 subsequent siblings)
  45 siblings, 0 replies; 54+ messages in thread
From: Tejun Heo @ 2015-01-06 21:25 UTC (permalink / raw)
  To: axboe
  Cc: linux-kernel, jack, hch, hannes, linux-fsdevel, vgoyal, lizefan,
	cgroups, linux-mm, mhocko, clm, fengguang.wu, david, Tejun Heo
Until now, all WB_* stats were accounted against the root wb
(bdi_writeback), now that multiple wb (bdi_writeback) support is in
place, let's attributes the stats to the respective per-cgroup wb's.
WB_RECLAIMABLE and WB_DIRTIED are attributed to the page's dirty cgwb
(per-cgroup wb) and WB_WRITEBACK to writeback cgwb.
__test_set_page_writeback() is updated so that dirty cgwb association
takes place before WB_WRITEBACK increment so that the latter can make
use of the association.
As no filesystem has FS_CGROUP_WRITEBACK yet, this doesn't lead to
visible behavior differences.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
---
 mm/filemap.c        |  2 +-
 mm/page-writeback.c | 18 ++++++++++++------
 mm/truncate.c       |  3 +--
 3 files changed, 14 insertions(+), 9 deletions(-)
diff --git a/mm/filemap.c b/mm/filemap.c
index 98a6675..faa577d 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -211,7 +211,7 @@ void __delete_from_page_cache(struct page *page, void *shadow)
 	 */
 	if (PageDirty(page) && mapping_cap_account_dirty(mapping)) {
 		dec_zone_page_state(page, NR_FILE_DIRTY);
-		dec_wb_stat(&mapping->backing_dev_info->wb, WB_RECLAIMABLE);
+		dec_wb_stat(page_cgwb_dirty(page), WB_RECLAIMABLE);
 		page_blkcg_detach_dirty(page);
 	}
 }
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 6475504..d1fea3a 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2159,10 +2159,13 @@ EXPORT_SYMBOL(__set_page_dirty_nobuffers);
 void account_page_redirty(struct page *page)
 {
 	struct address_space *mapping = page->mapping;
+
 	if (mapping && mapping_cap_account_dirty(mapping)) {
+		struct bdi_writeback *wb = page_cgwb_dirty(page);
+
 		current->nr_dirtied--;
 		dec_zone_page_state(page, NR_DIRTIED);
-		dec_wb_stat(&mapping->backing_dev_info->wb, WB_DIRTIED);
+		dec_wb_stat(wb, WB_DIRTIED);
 	}
 }
 EXPORT_SYMBOL(account_page_redirty);
@@ -2300,9 +2303,10 @@ int clear_page_dirty_for_io(struct page *page)
 		 * exclusion.
 		 */
 		if (TestClearPageDirty(page)) {
+			struct bdi_writeback *wb = page_cgwb_dirty(page);
+
 			dec_zone_page_state(page, NR_FILE_DIRTY);
-			dec_wb_stat(&mapping->backing_dev_info->wb,
-				    WB_RECLAIMABLE);
+			dec_wb_stat(wb, WB_RECLAIMABLE);
 			page_blkcg_detach_dirty(page);
 			return 1;
 		}
@@ -2330,9 +2334,11 @@ int test_clear_page_writeback(struct page *page)
 						page_index(page),
 						PAGECACHE_TAG_WRITEBACK);
 			if (bdi_cap_account_writeback(bdi)) {
-				__dec_wb_stat(&bdi->wb, WB_WRITEBACK);
+				struct bdi_writeback *wb = page_cgwb_wb(page);
+
+				__dec_wb_stat(wb, WB_WRITEBACK);
 				page_blkcg_detach_wb(page);
-				__wb_writeout_inc(&bdi->wb);
+				__wb_writeout_inc(wb);
 			}
 		}
 		spin_unlock_irqrestore(&mapping->tree_lock, flags);
@@ -2366,8 +2372,8 @@ int __test_set_page_writeback(struct page *page, bool keep_write)
 						page_index(page),
 						PAGECACHE_TAG_WRITEBACK);
 			if (bdi_cap_account_writeback(bdi)) {
-				__inc_wb_stat(&bdi->wb, WB_WRITEBACK);
 				page_blkcg_attach_wb(page);
+				__inc_wb_stat(page_cgwb_wb(page), WB_WRITEBACK);
 			}
 		}
 		if (!PageDirty(page))
diff --git a/mm/truncate.c b/mm/truncate.c
index caae624..1658e34 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -112,8 +112,7 @@ void cancel_dirty_page(struct page *page, unsigned int account_size)
 		struct address_space *mapping = page->mapping;
 		if (mapping && mapping_cap_account_dirty(mapping)) {
 			dec_zone_page_state(page, NR_FILE_DIRTY);
-			dec_wb_stat(&mapping->backing_dev_info->wb,
-				    WB_RECLAIMABLE);
+			dec_wb_stat(page_cgwb_dirty(page), WB_RECLAIMABLE);
 			if (account_size)
 				task_io_account_cancelled_write(account_size);
 			page_blkcg_detach_dirty(page);
-- 
2.1.0
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related	[flat|nested] 54+ messages in thread
- * [PATCH 08/45] writeback: let balance_dirty_pages() work on the matching cgroup bdi_writeback
  2015-01-06 21:25 [PATCHSET RFC block/for-next] writeback: cgroup writeback support Tejun Heo
                   ` (6 preceding siblings ...)
  2015-01-06 21:25 ` [PATCH 07/45] writeback: attribute stats to the matching per-cgroup bdi_writeback Tejun Heo
@ 2015-01-06 21:25 ` Tejun Heo
  2015-01-06 21:25 ` [PATCH 09/45] writeback: make congestion functions per bdi_writeback Tejun Heo
                   ` (37 subsequent siblings)
  45 siblings, 0 replies; 54+ messages in thread
From: Tejun Heo @ 2015-01-06 21:25 UTC (permalink / raw)
  To: axboe
  Cc: linux-kernel, jack, hch, hannes, linux-fsdevel, vgoyal, lizefan,
	cgroups, linux-mm, mhocko, clm, fengguang.wu, david, Tejun Heo
Currently, balance_dirty_pages() always work on bdi->wb.  This patch
updates it to work on the cgwb (cgroup bdi_writeback) matching the
blkcg of the current task as that's what the pages are being dirtied
against.
balance_dirty_pages_ratelimited() now pins the current blkcg and looks
up the matching cgwb and passes it to balance_dirty_pages().  The
pinning is necessary to ensure that the cgwb stays alive while the
function is executing as a cgwb's lifetime is determined by its bdi
and blkcg.
As no filesystem has FS_CGROUP_WRITEBACK yet, this doesn't lead to
visible behavior differences.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
---
 mm/page-writeback.c | 21 ++++++++++++++++++---
 1 file changed, 18 insertions(+), 3 deletions(-)
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index d1fea3a..b115a57 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1335,6 +1335,7 @@ static inline void wb_dirty_limits(struct bdi_writeback *wb,
  * perform some writeout.
  */
 static void balance_dirty_pages(struct address_space *mapping,
+				struct bdi_writeback *wb,
 				unsigned long pages_dirtied)
 {
 	unsigned long nr_reclaimable;	/* = file_dirty + unstable_nfs */
@@ -1351,7 +1352,6 @@ static void balance_dirty_pages(struct address_space *mapping,
 	unsigned long dirty_ratelimit;
 	unsigned long pos_ratio;
 	struct backing_dev_info *bdi = mapping->backing_dev_info;
-	struct bdi_writeback *wb = &bdi->wb;
 	bool strictlimit = bdi->capabilities & BDI_CAP_STRICTLIMIT;
 	unsigned long start_time = jiffies;
 
@@ -1575,13 +1575,25 @@ DEFINE_PER_CPU(int, dirty_throttle_leaks) = 0;
 void balance_dirty_pages_ratelimited(struct address_space *mapping)
 {
 	struct backing_dev_info *bdi = mapping->backing_dev_info;
-	struct bdi_writeback *wb = &bdi->wb;
+	struct cgroup_subsys_state *blkcg_css = NULL;
+	struct bdi_writeback *wb = NULL;
 	int ratelimit;
 	int *p;
 
 	if (!bdi_cap_account_dirty(bdi))
 		return;
 
+	/*
+	 * Throttle against the cgwb of the current blkcg.  Make sure that
+	 * the cgwb stays alive by pinning the blkcg.
+	 */
+	if (mapping_cgwb_enabled(mapping)) {
+		blkcg_css = task_get_blkcg_css(current);
+		wb = cgwb_lookup(bdi, blkcg_css);
+	}
+	if (!wb)
+		wb = &bdi->wb;
+
 	ratelimit = current->nr_dirtied_pause;
 	if (wb->dirty_exceeded)
 		ratelimit = min(ratelimit, 32 >> (PAGE_SHIFT - 10));
@@ -1615,7 +1627,10 @@ void balance_dirty_pages_ratelimited(struct address_space *mapping)
 	preempt_enable();
 
 	if (unlikely(current->nr_dirtied >= ratelimit))
-		balance_dirty_pages(mapping, current->nr_dirtied);
+		balance_dirty_pages(mapping, wb, current->nr_dirtied);
+
+	if (blkcg_css)
+		css_put(blkcg_css);
 }
 EXPORT_SYMBOL(balance_dirty_pages_ratelimited);
 
-- 
2.1.0
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related	[flat|nested] 54+ messages in thread
- * [PATCH 09/45] writeback: make congestion functions per bdi_writeback
  2015-01-06 21:25 [PATCHSET RFC block/for-next] writeback: cgroup writeback support Tejun Heo
                   ` (7 preceding siblings ...)
  2015-01-06 21:25 ` [PATCH 08/45] writeback: let balance_dirty_pages() work on the matching cgroup bdi_writeback Tejun Heo
@ 2015-01-06 21:25 ` Tejun Heo
  2015-01-06 21:25 ` [PATCH 10/45] writeback, blkcg: restructure blk_{set|clear}_queue_congested() Tejun Heo
                   ` (36 subsequent siblings)
  45 siblings, 0 replies; 54+ messages in thread
From: Tejun Heo @ 2015-01-06 21:25 UTC (permalink / raw)
  To: axboe
  Cc: linux-kernel, jack, hch, hannes, linux-fsdevel, vgoyal, lizefan,
	cgroups, linux-mm, mhocko, clm, fengguang.wu, david, Tejun Heo
Currently, all congestion functions take bdi (backing_dev_info) and
always operate on the root wb (bdi->wb) and the congestion state from
the block layer is propagated only for the root blkcg.  This patch
introduces {set|clear}_wb_congested() and wb_congested() which take
@wb and operate on it.  The bdi counteparts are now wrappers invoking
the wb based functions on @bdi->wb.
While converting clear_bdi_congested() to clear_wb_congested(), the
local variable declaration order between @wqh and @bit is swapped for
cosmetic reason.
This patch just adds the new wb based functions.  The following
patches will apply them.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
---
 include/linux/backing-dev-defs.h | 14 +++++++++++--
 include/linux/backing-dev.h      | 43 +++++++++++++++++++++++-----------------
 mm/backing-dev.c                 | 22 ++++++++++----------
 3 files changed, 48 insertions(+), 31 deletions(-)
diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
index 511066f..54a3a9c 100644
--- a/include/linux/backing-dev-defs.h
+++ b/include/linux/backing-dev-defs.h
@@ -122,7 +122,17 @@ enum {
 	BLK_RW_SYNC	= 1,
 };
 
-void clear_bdi_congested(struct backing_dev_info *bdi, int sync);
-void set_bdi_congested(struct backing_dev_info *bdi, int sync);
+void clear_wb_congested(struct bdi_writeback *wb, int sync);
+void set_wb_congested(struct bdi_writeback *wb, int sync);
+
+static inline void clear_bdi_congested(struct backing_dev_info *bdi, int sync)
+{
+	clear_wb_congested(&bdi->wb, sync);
+}
+
+static inline void set_bdi_congested(struct backing_dev_info *bdi, int sync)
+{
+	set_wb_congested(&bdi->wb, sync);
+}
 
 #endif	/* __LINUX_BACKING_DEV_DEFS_H */
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 3722796..be66668 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -182,27 +182,13 @@ extern struct backing_dev_info noop_backing_dev_info;
 
 int writeback_in_progress(struct backing_dev_info *bdi);
 
-static inline int bdi_congested(struct backing_dev_info *bdi, int bdi_bits)
+static inline int wb_congested(struct bdi_writeback *wb, int bdi_bits)
 {
+	struct backing_dev_info *bdi = wb->bdi;
+
 	if (bdi->congested_fn)
 		return bdi->congested_fn(bdi->congested_data, bdi_bits);
-	return (bdi->wb.state & bdi_bits);
-}
-
-static inline int bdi_read_congested(struct backing_dev_info *bdi)
-{
-	return bdi_congested(bdi, 1 << WB_sync_congested);
-}
-
-static inline int bdi_write_congested(struct backing_dev_info *bdi)
-{
-	return bdi_congested(bdi, 1 << WB_async_congested);
-}
-
-static inline int bdi_rw_congested(struct backing_dev_info *bdi)
-{
-	return bdi_congested(bdi, (1 << WB_sync_congested) |
-				  (1 << WB_async_congested));
+	return wb->state & bdi_bits;
 }
 
 long congestion_wait(int sync, long timeout);
@@ -422,4 +408,25 @@ static inline struct bdi_writeback *page_cgwb_wb(struct page *page)
 
 #endif	/* CONFIG_CGROUP_WRITEBACK */
 
+static inline int bdi_congested(struct backing_dev_info *bdi, int bdi_bits)
+{
+	return wb_congested(&bdi->wb, bdi_bits);
+}
+
+static inline int bdi_read_congested(struct backing_dev_info *bdi)
+{
+	return bdi_congested(bdi, 1 << WB_sync_congested);
+}
+
+static inline int bdi_write_congested(struct backing_dev_info *bdi)
+{
+	return bdi_congested(bdi, 1 << WB_async_congested);
+}
+
+static inline int bdi_rw_congested(struct backing_dev_info *bdi)
+{
+	return bdi_congested(bdi, (1 << WB_sync_congested) |
+				  (1 << WB_async_congested));
+}
+
 #endif		/* _LINUX_BACKING_DEV_H */
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index c6dda82..2851278 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -767,31 +767,31 @@ static wait_queue_head_t congestion_wqh[2] = {
 		__WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[0]),
 		__WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[1])
 	};
-static atomic_t nr_bdi_congested[2];
+static atomic_t nr_wb_congested[2];
 
-void clear_bdi_congested(struct backing_dev_info *bdi, int sync)
+void clear_wb_congested(struct bdi_writeback *wb, int sync)
 {
-	enum wb_state bit;
 	wait_queue_head_t *wqh = &congestion_wqh[sync];
+	enum wb_state bit;
 
 	bit = sync ? WB_sync_congested : WB_async_congested;
-	if (test_and_clear_bit(bit, &bdi->wb.state))
-		atomic_dec(&nr_bdi_congested[sync]);
+	if (test_and_clear_bit(bit, &wb->state))
+		atomic_dec(&nr_wb_congested[sync]);
 	smp_mb__after_atomic();
 	if (waitqueue_active(wqh))
 		wake_up(wqh);
 }
-EXPORT_SYMBOL(clear_bdi_congested);
+EXPORT_SYMBOL(clear_wb_congested);
 
-void set_bdi_congested(struct backing_dev_info *bdi, int sync)
+void set_wb_congested(struct bdi_writeback *wb, int sync)
 {
 	enum wb_state bit;
 
 	bit = sync ? WB_sync_congested : WB_async_congested;
-	if (!test_and_set_bit(bit, &bdi->wb.state))
-		atomic_inc(&nr_bdi_congested[sync]);
+	if (!test_and_set_bit(bit, &wb->state))
+		atomic_inc(&nr_wb_congested[sync]);
 }
-EXPORT_SYMBOL(set_bdi_congested);
+EXPORT_SYMBOL(set_wb_congested);
 
 /**
  * congestion_wait - wait for a backing_dev to become uncongested
@@ -850,7 +850,7 @@ long wait_iff_congested(struct zone *zone, int sync, long timeout)
 	 * encountered in the current zone, yield if necessary instead
 	 * of sleeping on the congestion queue
 	 */
-	if (atomic_read(&nr_bdi_congested[sync]) == 0 ||
+	if (atomic_read(&nr_wb_congested[sync]) == 0 ||
 	    !test_bit(ZONE_CONGESTED, &zone->flags)) {
 		cond_resched();
 
-- 
2.1.0
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related	[flat|nested] 54+ messages in thread
- * [PATCH 10/45] writeback, blkcg: restructure blk_{set|clear}_queue_congested()
  2015-01-06 21:25 [PATCHSET RFC block/for-next] writeback: cgroup writeback support Tejun Heo
                   ` (8 preceding siblings ...)
  2015-01-06 21:25 ` [PATCH 09/45] writeback: make congestion functions per bdi_writeback Tejun Heo
@ 2015-01-06 21:25 ` Tejun Heo
  2015-01-06 21:25 ` [PATCH 11/45] writeback, blkcg: propagate non-root blkcg congestion state Tejun Heo
                   ` (35 subsequent siblings)
  45 siblings, 0 replies; 54+ messages in thread
From: Tejun Heo @ 2015-01-06 21:25 UTC (permalink / raw)
  To: axboe
  Cc: linux-kernel, jack, hch, hannes, linux-fsdevel, vgoyal, lizefan,
	cgroups, linux-mm, mhocko, clm, fengguang.wu, david, Tejun Heo
blk_{set|clear}_queue_congested() take @q and set or clear,
respectively, the congestion state of its bdi's root wb.  Because bdi
used to be able to handle congestion state only on the root wb, the
callers of those functions tested whether the congestion is on the
root blkcg and skipped if not.
This is cumbersome and makes implementation of per cgroup
bdi_writeback congestion state propagation difficult.  This patch
renames blk_{set|clear}_queue_congested() to
blk_{set|clear}_congested(), and makes them take request_list instead
of request_queue and test whether the specified request_list is the
root one before updating bdi_writeback congestion state.  This makes
the tests in the callers unnecessary and simplifies them.
As there are no external users of these functions, the definitions are
moved from include/linux/blkdev.h to block/blk-core.c.
This patch doesn't introduce any noticeable behavior difference.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
Cc: Vivek Goyal <vgoyal@redhat.com>
---
 block/blk-core.c       | 62 ++++++++++++++++++++++++++++++--------------------
 include/linux/blkdev.h | 19 ----------------
 2 files changed, 37 insertions(+), 44 deletions(-)
diff --git a/block/blk-core.c b/block/blk-core.c
index ff4d2f8..c9a7d6c 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -63,6 +63,28 @@ struct kmem_cache *blk_requestq_cachep;
  */
 static struct workqueue_struct *kblockd_workqueue;
 
+static void blk_clear_congested(struct request_list *rl, int sync)
+{
+	if (rl != &rl->q->root_rl)
+		return;
+#ifdef CONFIG_CGROUP_WRITEBACK
+	clear_wb_congested(rl->blkg->wb, sync);
+#else
+	clear_wb_congested(&rl->q->backing_dev_info.wb, sync);
+#endif
+}
+
+static void blk_set_congested(struct request_list *rl, int sync)
+{
+	if (rl != &rl->q->root_rl)
+		return;
+#ifdef CONFIG_CGROUP_WRITEBACK
+	set_wb_congested(rl->blkg->wb, sync);
+#else
+	set_wb_congested(&rl->q->backing_dev_info.wb, sync);
+#endif
+}
+
 void blk_queue_congestion_threshold(struct request_queue *q)
 {
 	int nr;
@@ -828,13 +850,8 @@ static void __freed_request(struct request_list *rl, int sync)
 {
 	struct request_queue *q = rl->q;
 
-	/*
-	 * bdi isn't aware of blkcg yet.  As all async IOs end up root
-	 * blkcg anyway, just use root blkcg state.
-	 */
-	if (rl == &q->root_rl &&
-	    rl->count[sync] < queue_congestion_off_threshold(q))
-		blk_clear_queue_congested(q, sync);
+	if (rl->count[sync] < queue_congestion_off_threshold(q))
+		blk_clear_congested(rl, sync);
 
 	if (rl->count[sync] + 1 <= q->nr_requests) {
 		if (waitqueue_active(&rl->wait[sync]))
@@ -867,25 +884,25 @@ static void freed_request(struct request_list *rl, unsigned int flags)
 int blk_update_nr_requests(struct request_queue *q, unsigned int nr)
 {
 	struct request_list *rl;
+	int on_thresh, off_thresh;
 
 	spin_lock_irq(q->queue_lock);
 	q->nr_requests = nr;
 	blk_queue_congestion_threshold(q);
+	on_thresh = queue_congestion_on_threshold(q);
+	off_thresh = queue_congestion_off_threshold(q);
 
-	/* congestion isn't cgroup aware and follows root blkcg for now */
-	rl = &q->root_rl;
-
-	if (rl->count[BLK_RW_SYNC] >= queue_congestion_on_threshold(q))
-		blk_set_queue_congested(q, BLK_RW_SYNC);
-	else if (rl->count[BLK_RW_SYNC] < queue_congestion_off_threshold(q))
-		blk_clear_queue_congested(q, BLK_RW_SYNC);
+	blk_queue_for_each_rl(rl, q) {
+		if (rl->count[BLK_RW_SYNC] >= on_thresh)
+			blk_set_congested(rl, BLK_RW_SYNC);
+		else if (rl->count[BLK_RW_SYNC] < off_thresh)
+			blk_clear_congested(rl, BLK_RW_SYNC);
 
-	if (rl->count[BLK_RW_ASYNC] >= queue_congestion_on_threshold(q))
-		blk_set_queue_congested(q, BLK_RW_ASYNC);
-	else if (rl->count[BLK_RW_ASYNC] < queue_congestion_off_threshold(q))
-		blk_clear_queue_congested(q, BLK_RW_ASYNC);
+		if (rl->count[BLK_RW_ASYNC] >= on_thresh)
+			blk_set_congested(rl, BLK_RW_ASYNC);
+		else if (rl->count[BLK_RW_ASYNC] < off_thresh)
+			blk_clear_congested(rl, BLK_RW_ASYNC);
 
-	blk_queue_for_each_rl(rl, q) {
 		if (rl->count[BLK_RW_SYNC] >= q->nr_requests) {
 			blk_set_rl_full(rl, BLK_RW_SYNC);
 		} else {
@@ -995,12 +1012,7 @@ static struct request *__get_request(struct request_list *rl, int rw_flags,
 				}
 			}
 		}
-		/*
-		 * bdi isn't aware of blkcg yet.  As all async IOs end up
-		 * root blkcg anyway, just use root blkcg state.
-		 */
-		if (rl == &q->root_rl)
-			blk_set_queue_congested(q, is_sync);
+		blk_set_congested(rl, is_sync);
 	}
 
 	/*
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index fc980a6..b8964a7 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -819,25 +819,6 @@ extern int sg_scsi_ioctl(struct request_queue *, struct gendisk *, fmode_t,
 
 extern void blk_queue_bio(struct request_queue *q, struct bio *bio);
 
-/*
- * A queue has just exitted congestion.  Note this in the global counter of
- * congested queues, and wake up anyone who was waiting for requests to be
- * put back.
- */
-static inline void blk_clear_queue_congested(struct request_queue *q, int sync)
-{
-	clear_bdi_congested(&q->backing_dev_info, sync);
-}
-
-/*
- * A queue has just entered congestion.  Flag that in the queue's VM-visible
- * state flags and increment the global gounter of congested queues.
- */
-static inline void blk_set_queue_congested(struct request_queue *q, int sync)
-{
-	set_bdi_congested(&q->backing_dev_info, sync);
-}
-
 extern void blk_start_queue(struct request_queue *q);
 extern void blk_stop_queue(struct request_queue *q);
 extern void blk_sync_queue(struct request_queue *q);
-- 
2.1.0
^ permalink raw reply related	[flat|nested] 54+ messages in thread
- * [PATCH 11/45] writeback, blkcg: propagate non-root blkcg congestion state
  2015-01-06 21:25 [PATCHSET RFC block/for-next] writeback: cgroup writeback support Tejun Heo
                   ` (9 preceding siblings ...)
  2015-01-06 21:25 ` [PATCH 10/45] writeback, blkcg: restructure blk_{set|clear}_queue_congested() Tejun Heo
@ 2015-01-06 21:25 ` Tejun Heo
  2015-01-06 21:25 ` [PATCH 12/45] writeback: implement and use mapping_congested() Tejun Heo
                   ` (34 subsequent siblings)
  45 siblings, 0 replies; 54+ messages in thread
From: Tejun Heo @ 2015-01-06 21:25 UTC (permalink / raw)
  To: axboe
  Cc: linux-kernel, jack, hch, hannes, linux-fsdevel, vgoyal, lizefan,
	cgroups, linux-mm, mhocko, clm, fengguang.wu, david, Tejun Heo
Now that bdi layer can handle per cgwb (cgroup bdi_writeback)
congestion state, blk_{set|clear}_congested() can propagate non-root
blkcg congestion state to them.
This can be easily achieved by disabling the root_rl tests in
blk_{set|clear}_congested().  Note that we still need those tests when
!CONFIG_CGROUP_WRITEBACK as otherwise we'll end up flipping root blkcg
wb's congestion state for events happening on other blkcgs.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
Cc: Vivek Goyal <vgoyal@redhat.com>
---
 block/blk-core.c | 15 +++++++++------
 1 file changed, 9 insertions(+), 6 deletions(-)
diff --git a/block/blk-core.c b/block/blk-core.c
index c9a7d6c..d731f1a 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -65,23 +65,26 @@ static struct workqueue_struct *kblockd_workqueue;
 
 static void blk_clear_congested(struct request_list *rl, int sync)
 {
-	if (rl != &rl->q->root_rl)
-		return;
 #ifdef CONFIG_CGROUP_WRITEBACK
 	clear_wb_congested(rl->blkg->wb, sync);
 #else
-	clear_wb_congested(&rl->q->backing_dev_info.wb, sync);
+	/*
+	 * If !CGROUP_WRITEBACK, all blkg's map to bdi->wb and we shouldn't
+	 * flip its congestion state for events on other blkcgs.
+	 */
+	if (rl == &rl->q->root_rl)
+		clear_wb_congested(&rl->q->backing_dev_info.wb, sync);
 #endif
 }
 
 static void blk_set_congested(struct request_list *rl, int sync)
 {
-	if (rl != &rl->q->root_rl)
-		return;
 #ifdef CONFIG_CGROUP_WRITEBACK
 	set_wb_congested(rl->blkg->wb, sync);
 #else
-	set_wb_congested(&rl->q->backing_dev_info.wb, sync);
+	/* see blk_clear_congested() */
+	if (rl == &rl->q->root_rl)
+		set_wb_congested(&rl->q->backing_dev_info.wb, sync);
 #endif
 }
 
-- 
2.1.0
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related	[flat|nested] 54+ messages in thread
- * [PATCH 12/45] writeback: implement and use mapping_congested()
  2015-01-06 21:25 [PATCHSET RFC block/for-next] writeback: cgroup writeback support Tejun Heo
                   ` (10 preceding siblings ...)
  2015-01-06 21:25 ` [PATCH 11/45] writeback, blkcg: propagate non-root blkcg congestion state Tejun Heo
@ 2015-01-06 21:25 ` Tejun Heo
  2015-01-06 21:25 ` [PATCH 13/45] writeback: implement WB_has_dirty_io wb_state flag Tejun Heo
                   ` (33 subsequent siblings)
  45 siblings, 0 replies; 54+ messages in thread
From: Tejun Heo @ 2015-01-06 21:25 UTC (permalink / raw)
  To: axboe
  Cc: linux-kernel, jack, hch, hannes, linux-fsdevel, vgoyal, lizefan,
	cgroups, linux-mm, mhocko, clm, fengguang.wu, david, Tejun Heo
In several places, bdi_congested() and its wrappers are used to
determine whether more IOs should be issued.  With cgroup writeback
support, this question can't be answered solely based on the bdi
(backing_dev_info).  It's dependent on whether the filesystem and bdi
support cgroup writeback and the blkcg the asking task belongs to.
This patch implements mapping_congested() and its wrappers which take
@mapping and @task and determines the congestion state considering
cgroup writeback for the combination.  The new functions replace
bdi_*congested() calls in places where the query is about specific
mapping and task.
There are several filesystem users which also fit this criteria but
they should be updated when each filesystem implements cgroup
writeback support.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
Cc: Vivek Goyal <vgoyal@redhat.com>
---
 fs/fs-writeback.c           | 34 ++++++++++++++++++++++++++++++++++
 include/linux/backing-dev.h | 27 +++++++++++++++++++++++++++
 mm/fadvise.c                |  2 +-
 mm/readahead.c              |  2 +-
 mm/vmscan.c                 | 12 ++++++------
 5 files changed, 69 insertions(+), 8 deletions(-)
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 3b54835..43c1fb2 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -109,6 +109,40 @@ out_unlock:
 #ifdef CONFIG_CGROUP_WRITEBACK
 
 /**
+ * mapping_congested - test whether a mapping is congested for a task
+ * @mapping: address space to test for congestion
+ * @task: task to test congestion for
+ * @bdi_bits: mask of WB_[a]sync_congested bits to test
+ *
+ * Tests whether @mapping is congested for @task.  @bdi_bits is the mask of
+ * congestion bits to test and the return value is the mask of set bits.
+ *
+ * If cgroup writeback is enabled for @mapping, its congestion state for
+ * @task is determined by whether the cgwb (cgroup bdi_writeback) for the
+ * blkcg of %current on @mapping->backing_dev_info is congested; otherwise,
+ * the root's congestion state is used.
+ */
+int mapping_congested(struct address_space *mapping,
+		      struct task_struct *task, int bdi_bits)
+{
+	struct backing_dev_info *bdi = mapping->backing_dev_info;
+	struct bdi_writeback *wb;
+	int ret = 0;
+
+	if (!mapping_cgwb_enabled(mapping))
+		return wb_congested(&bdi->wb, bdi_bits);
+
+	rcu_read_lock();
+	wb = cgwb_lookup(bdi, task_css(task, blkio_cgrp_id));
+	if (wb)
+		ret = wb_congested(wb, bdi_bits);
+	rcu_read_unlock();
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(mapping_congested);
+
+/**
  * init_cgwb_dirty_page_context - init cgwb part of dirty_context
  * @dctx: dirty_context being initialized
  *
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index be66668..0b1ac4b 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -263,6 +263,8 @@ void init_dirty_inode_context(struct dirty_context *dctx, struct inode *inode);
 void cgwb_blkcg_released(struct cgroup_subsys_state *blkcg_css);
 int __cgwb_create(struct backing_dev_info *bdi,
 		  struct cgroup_subsys_state *blkcg_css);
+int mapping_congested(struct address_space *mapping, struct task_struct *task,
+		      int bdi_bits);
 
 /**
  * mapping_cgwb_enabled - test whether cgroup writeback is enabled on a mapping
@@ -383,6 +385,12 @@ static inline void cgwb_blkcg_released(struct cgroup_subsys_state *blkcg_css)
 {
 }
 
+static inline int mapping_congested(struct address_space *mapping,
+				    struct task_struct *task, int bdi_bits)
+{
+	return wb_congested(&mapping->backing_dev_info->wb, bdi_bits);
+}
+
 static inline struct bdi_writeback *
 cgwb_lookup(struct backing_dev_info *bdi, struct cgroup_subsys_state *blkcg_css)
 {
@@ -408,6 +416,25 @@ static inline struct bdi_writeback *page_cgwb_wb(struct page *page)
 
 #endif	/* CONFIG_CGROUP_WRITEBACK */
 
+static inline int mapping_read_congested(struct address_space *mapping,
+					 struct task_struct *task)
+{
+	return mapping_congested(mapping, task, 1 << WB_sync_congested);
+}
+
+static inline int mapping_write_congested(struct address_space *mapping,
+					  struct task_struct *task)
+{
+	return mapping_congested(mapping, task, 1 << WB_async_congested);
+}
+
+static inline int mapping_rw_congested(struct address_space *mapping,
+				       struct task_struct *task)
+{
+	return mapping_congested(mapping, task, (1 << WB_sync_congested) |
+						(1 << WB_async_congested));
+}
+
 static inline int bdi_congested(struct backing_dev_info *bdi, int bdi_bits)
 {
 	return wb_congested(&bdi->wb, bdi_bits);
diff --git a/mm/fadvise.c b/mm/fadvise.c
index 2ad7adf..c7347d7 100644
--- a/mm/fadvise.c
+++ b/mm/fadvise.c
@@ -113,7 +113,7 @@ SYSCALL_DEFINE4(fadvise64_64, int, fd, loff_t, offset, loff_t, len, int, advice)
 	case POSIX_FADV_NOREUSE:
 		break;
 	case POSIX_FADV_DONTNEED:
-		if (!bdi_write_congested(mapping->backing_dev_info))
+		if (!mapping_write_congested(mapping, current))
 			__filemap_fdatawrite_range(mapping, offset, endbyte,
 						   WB_SYNC_NONE);
 
diff --git a/mm/readahead.c b/mm/readahead.c
index 17b9172..beb930c 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -541,7 +541,7 @@ page_cache_async_readahead(struct address_space *mapping,
 	/*
 	 * Defer asynchronous read-ahead on IO congestion.
 	 */
-	if (bdi_read_congested(mapping->backing_dev_info))
+	if (mapping_read_congested(mapping, current))
 		return;
 
 	/* do read-ahead */
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 5e8772b..95b98c7 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -411,14 +411,14 @@ static inline int is_page_cache_freeable(struct page *page)
 	return page_count(page) - page_has_private(page) == 2;
 }
 
-static int may_write_to_queue(struct backing_dev_info *bdi,
-			      struct scan_control *sc)
+static int may_write_to_mapping(struct address_space *mapping,
+				struct scan_control *sc)
 {
 	if (current->flags & PF_SWAPWRITE)
 		return 1;
-	if (!bdi_write_congested(bdi))
+	if (!mapping_write_congested(mapping, current))
 		return 1;
-	if (bdi == current->backing_dev_info)
+	if (mapping->backing_dev_info == current->backing_dev_info)
 		return 1;
 	return 0;
 }
@@ -497,7 +497,7 @@ static pageout_t pageout(struct page *page, struct address_space *mapping,
 	}
 	if (mapping->a_ops->writepage == NULL)
 		return PAGE_ACTIVATE;
-	if (!may_write_to_queue(mapping->backing_dev_info, sc))
+	if (!may_write_to_mapping(mapping, sc))
 		return PAGE_KEEP;
 
 	if (clear_page_dirty_for_io(page)) {
@@ -885,7 +885,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		 */
 		mapping = page_mapping(page);
 		if (((dirty || writeback) && mapping &&
-		     bdi_write_congested(mapping->backing_dev_info)) ||
+		     mapping_write_congested(mapping, current)) ||
 		    (writeback && PageReclaim(page)))
 			nr_congested++;
 
-- 
2.1.0
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related	[flat|nested] 54+ messages in thread
- * [PATCH 13/45] writeback: implement WB_has_dirty_io wb_state flag
  2015-01-06 21:25 [PATCHSET RFC block/for-next] writeback: cgroup writeback support Tejun Heo
                   ` (11 preceding siblings ...)
  2015-01-06 21:25 ` [PATCH 12/45] writeback: implement and use mapping_congested() Tejun Heo
@ 2015-01-06 21:25 ` Tejun Heo
  2015-01-06 21:25 ` [PATCH 14/45] writeback: implement backing_dev_info->tot_write_bandwidth Tejun Heo
                   ` (32 subsequent siblings)
  45 siblings, 0 replies; 54+ messages in thread
From: Tejun Heo @ 2015-01-06 21:25 UTC (permalink / raw)
  To: axboe
  Cc: linux-kernel, jack, hch, hannes, linux-fsdevel, vgoyal, lizefan,
	cgroups, linux-mm, mhocko, clm, fengguang.wu, david, Tejun Heo
Currently, wb_has_dirty_io() determines whether a wb (bdi_writeback)
has any dirty inode by testing all three IO lists on each invocation
without actively keeping track.  For cgroup writeback support, a
single bdi will host multiple wb's each of which will host dirty
inodes separately and we'll need to make bdi_has_dirty_io(), which
currently only represents the root wb, aggregate has_dirty_io from all
member wb's, which requires tracking transitions in has_dirty_io state
on each wb.
This patch adds inode_wb_list_move_locked() and
inode_wb_list_del_locked() replace direct inode->i_wb_list operations
with them.  In addition to the list operations, the two functions keep
track of whether the wb has dirty inodes or not and record it using
the new WB_has_dirty_io flag.  inode_wb_list_moved_locked()'s return
value indicates whether the wb had no dirty inodes before.
mark_inode_dirty_dctx() is restructured so that the return value of
inode_wb_list_move_locked() can be used for deciding whether to wake
up the wb.
While at it, change {bdi|wb}_has_dirty_io()'s return values to bool.
These functions were returning 0 and 1 before.  Also, add a comment
explaining the synchronization of wb_state flags.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
---
 fs/fs-writeback.c                | 88 +++++++++++++++++++++++++++++-----------
 include/linux/backing-dev-defs.h |  8 ++++
 include/linux/backing-dev.h      |  8 ++--
 mm/backing-dev.c                 |  2 +-
 4 files changed, 77 insertions(+), 29 deletions(-)
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 43c1fb2..1718f5f 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -291,16 +291,62 @@ void bdi_start_background_writeback(struct backing_dev_info *bdi)
 	wb_wakeup(&bdi->wb);
 }
 
+/**
+ * inode_wb_list_move_locked - move an inode onto a bdi_writeback IO list
+ * @inode: inode to be moved
+ * @wb: target bdi_writeback
+ * @head: one of @wb->b_{dirty|io|more_io}
+ *
+ * Move @inode->i_wb_list to @list of @wb and set %WB_has_dirty_io.
+ * Returns %true if all IO lists were empty before; otherwise, %false.
+ */
+static bool inode_wb_list_move_locked(struct inode *inode,
+				      struct bdi_writeback *wb,
+				      struct list_head *head)
+{
+	assert_spin_locked(&wb->list_lock);
+
+	list_move(&inode->i_wb_list, head);
+
+	if (wb_has_dirty_io(wb)) {
+		return false;
+	} else {
+		set_bit(WB_has_dirty_io, &wb->state);
+		return true;
+	}
+}
+
+/**
+ * inode_wb_list_del_locked - remove an inode from its bdi_writeback IO list
+ * @inode: inode to be removed
+ * @wb: bdi_writeback @inode is being removed from
+ *
+ * Remove @inode which may be on one of @wb->b_{dirty|io|more_io} lists and
+ * clear %WB_has_dirty_io if all are empty afterwards.
+ */
+static void inode_wb_list_del_locked(struct inode *inode,
+				     struct bdi_writeback *wb)
+{
+	assert_spin_locked(&wb->list_lock);
+
+	list_del_init(&inode->i_wb_list);
+
+	if (wb_has_dirty_io(wb) && list_empty(&wb->b_dirty) &&
+	    list_empty(&wb->b_io) && list_empty(&wb->b_more_io))
+		clear_bit(WB_has_dirty_io, &wb->state);
+}
+
 /*
  * Remove the inode from the writeback list it is on.
  */
 void inode_wb_list_del(struct inode *inode)
 {
 	struct backing_dev_info *bdi = inode_to_bdi(inode);
+	struct bdi_writeback *wb = &bdi->wb;
 
-	spin_lock(&bdi->wb.list_lock);
-	list_del_init(&inode->i_wb_list);
-	spin_unlock(&bdi->wb.list_lock);
+	spin_lock(&wb->list_lock);
+	inode_wb_list_del_locked(inode, wb);
+	spin_unlock(&wb->list_lock);
 }
 
 /*
@@ -314,7 +360,6 @@ void inode_wb_list_del(struct inode *inode)
  */
 static void redirty_tail(struct inode *inode, struct bdi_writeback *wb)
 {
-	assert_spin_locked(&wb->list_lock);
 	if (!list_empty(&wb->b_dirty)) {
 		struct inode *tail;
 
@@ -322,7 +367,7 @@ static void redirty_tail(struct inode *inode, struct bdi_writeback *wb)
 		if (time_before(inode->dirtied_when, tail->dirtied_when))
 			inode->dirtied_when = jiffies;
 	}
-	list_move(&inode->i_wb_list, &wb->b_dirty);
+	inode_wb_list_move_locked(inode, wb, &wb->b_dirty);
 }
 
 /*
@@ -330,8 +375,7 @@ static void redirty_tail(struct inode *inode, struct bdi_writeback *wb)
  */
 static void requeue_io(struct inode *inode, struct bdi_writeback *wb)
 {
-	assert_spin_locked(&wb->list_lock);
-	list_move(&inode->i_wb_list, &wb->b_more_io);
+	inode_wb_list_move_locked(inode, wb, &wb->b_more_io);
 }
 
 static void inode_sync_complete(struct inode *inode)
@@ -549,7 +593,7 @@ static void requeue_inode(struct inode *inode, struct bdi_writeback *wb,
 		redirty_tail(inode, wb);
 	} else {
 		/* The inode is clean. Remove from writeback lists. */
-		list_del_init(&inode->i_wb_list);
+		inode_wb_list_del_locked(inode, wb);
 	}
 }
 
@@ -678,7 +722,7 @@ writeback_single_inode(struct inode *inode, struct bdi_writeback *wb,
 	 * touch it. See comment above for explanation.
 	 */
 	if (!(inode->i_state & I_DIRTY))
-		list_del_init(&inode->i_wb_list);
+		inode_wb_list_del_locked(inode, wb);
 	spin_unlock(&wb->list_lock);
 	inode_sync_complete(inode);
 out:
@@ -1319,25 +1363,23 @@ void mark_inode_dirty_dctx(struct dirty_context *dctx, int flags)
 
 			spin_unlock(&inode->i_lock);
 			spin_lock(&bdi->wb.list_lock);
-			if (bdi_cap_writeback_dirty(bdi)) {
-				WARN(!test_bit(WB_registered, &bdi->wb.state),
-				     "bdi-%s not registered\n", bdi->name);
 
-				/*
-				 * If this is the first dirty inode for this
-				 * bdi, we have to wake-up the corresponding
-				 * bdi thread to make sure background
-				 * write-back happens later.
-				 */
-				if (!wb_has_dirty_io(&bdi->wb))
-					wakeup_bdi = true;
-			}
+			WARN(bdi_cap_writeback_dirty(bdi) &&
+			     !test_bit(WB_registered, &bdi->wb.state),
+			     "bdi-%s not registered\n", bdi->name);
 
 			inode->dirtied_when = jiffies;
-			list_move(&inode->i_wb_list, &bdi->wb.b_dirty);
+			wakeup_bdi = inode_wb_list_move_locked(inode, &bdi->wb,
+							      &bdi->wb.b_dirty);
 			spin_unlock(&bdi->wb.list_lock);
 
-			if (wakeup_bdi)
+			/*
+			 * If this is the first dirty inode for this bdi,
+			 * we have to wake-up the corresponding bdi thread
+			 * to make sure background write-back happens
+			 * later.
+			 */
+			if (bdi_cap_writeback_dirty(bdi) && wakeup_bdi)
 				wb_wakeup_delayed(&bdi->wb);
 			return;
 		}
diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
index 54a3a9c..d1c0bf4 100644
--- a/include/linux/backing-dev-defs.h
+++ b/include/linux/backing-dev-defs.h
@@ -17,10 +17,18 @@ struct dentry;
  * Bits in bdi_writeback.state
  */
 enum wb_state {
+	/*
+	 * The two congested flags are modified asynchronously and must be
+	 * atomic.  The other flags are protected either by wb->list_lock
+	 * or ->work_lock and don't need to be atomic if placed on separate
+	 * fields.  The extra atomic operations don't really matter here.
+	 * Let's keep them together and use atomic bitops.
+	 */
 	WB_async_congested,	/* The async (write) queue is getting full */
 	WB_sync_congested,	/* The sync queue is getting full */
 	WB_registered,		/* bdi_register() was done */
 	WB_writeback_running,	/* Writeback is in progress */
+	WB_has_dirty_io,	/* Dirty inodes on ->b_{dirty|io|more_io} */
 };
 
 typedef int (congested_fn)(void *, int);
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 0b1ac4b..533ff86 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -30,7 +30,7 @@ void bdi_start_writeback(struct backing_dev_info *bdi, long nr_pages,
 			enum wb_reason reason);
 void bdi_start_background_writeback(struct backing_dev_info *bdi);
 void wb_workfn(struct work_struct *work);
-int bdi_has_dirty_io(struct backing_dev_info *bdi);
+bool bdi_has_dirty_io(struct backing_dev_info *bdi);
 void wb_wakeup_delayed(struct bdi_writeback *wb);
 
 extern spinlock_t bdi_lock;
@@ -38,11 +38,9 @@ extern struct list_head bdi_list;
 
 extern struct workqueue_struct *bdi_wq;
 
-static inline int wb_has_dirty_io(struct bdi_writeback *wb)
+static inline bool wb_has_dirty_io(struct bdi_writeback *wb)
 {
-	return !list_empty(&wb->b_dirty) ||
-	       !list_empty(&wb->b_io) ||
-	       !list_empty(&wb->b_more_io);
+	return test_bit(WB_has_dirty_io, &wb->state);
 }
 
 static inline void __add_wb_stat(struct bdi_writeback *wb,
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 2851278..9d69e7c 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -307,7 +307,7 @@ static int __init default_bdi_init(void)
 }
 subsys_initcall(default_bdi_init);
 
-int bdi_has_dirty_io(struct backing_dev_info *bdi)
+bool bdi_has_dirty_io(struct backing_dev_info *bdi)
 {
 	return wb_has_dirty_io(&bdi->wb);
 }
-- 
2.1.0
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related	[flat|nested] 54+ messages in thread
- * [PATCH 14/45] writeback: implement backing_dev_info->tot_write_bandwidth
  2015-01-06 21:25 [PATCHSET RFC block/for-next] writeback: cgroup writeback support Tejun Heo
                   ` (12 preceding siblings ...)
  2015-01-06 21:25 ` [PATCH 13/45] writeback: implement WB_has_dirty_io wb_state flag Tejun Heo
@ 2015-01-06 21:25 ` Tejun Heo
  2015-01-06 21:25 ` [PATCH 15/45] writeback: make bdi_has_dirty_io() take multiple bdi_writeback's into account Tejun Heo
                   ` (31 subsequent siblings)
  45 siblings, 0 replies; 54+ messages in thread
From: Tejun Heo @ 2015-01-06 21:25 UTC (permalink / raw)
  To: axboe
  Cc: linux-kernel, jack, hch, hannes, linux-fsdevel, vgoyal, lizefan,
	cgroups, linux-mm, mhocko, clm, fengguang.wu, david, Tejun Heo
cgroup writeback support needs to keep track of the sum of
avg_write_bandwidth of all wb's (bdi_writeback's) with dirty inodes to
distribute write workload.  This patch adds bdi->tot_write_bandwidth
and updates inode_wb_list_move_locked(), inode_wb_list_del_locked()
and wb_update_write_bandwidth() to adjust it as wb's gain and lose
dirty inodes and its avg_write_bandwidth gets updated.
As the update events are not synchronized with each other,
bdi->tot_write_bandwidth is an atomic_long_t.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
---
 fs/fs-writeback.c                | 7 ++++++-
 include/linux/backing-dev-defs.h | 2 ++
 mm/page-writeback.c              | 3 +++
 3 files changed, 11 insertions(+), 1 deletion(-)
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 1718f5f..d41728b 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -312,6 +312,8 @@ static bool inode_wb_list_move_locked(struct inode *inode,
 		return false;
 	} else {
 		set_bit(WB_has_dirty_io, &wb->state);
+		atomic_long_add(wb->avg_write_bandwidth,
+				&wb->bdi->tot_write_bandwidth);
 		return true;
 	}
 }
@@ -332,8 +334,11 @@ static void inode_wb_list_del_locked(struct inode *inode,
 	list_del_init(&inode->i_wb_list);
 
 	if (wb_has_dirty_io(wb) && list_empty(&wb->b_dirty) &&
-	    list_empty(&wb->b_io) && list_empty(&wb->b_more_io))
+	    list_empty(&wb->b_io) && list_empty(&wb->b_more_io)) {
 		clear_bit(WB_has_dirty_io, &wb->state);
+		atomic_long_sub(wb->avg_write_bandwidth,
+				&wb->bdi->tot_write_bandwidth);
+	}
 }
 
 /*
diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
index d1c0bf4..e1f5f08 100644
--- a/include/linux/backing-dev-defs.h
+++ b/include/linux/backing-dev-defs.h
@@ -100,6 +100,8 @@ struct backing_dev_info {
 	unsigned int min_ratio;
 	unsigned int max_ratio, max_prop_frac;
 
+	atomic_long_t tot_write_bandwidth; /* sum of active avg_write_bw */
+
 	struct bdi_writeback wb; /* the root writeback info for this bdi */
 #ifdef CONFIG_CGROUP_WRITEBACK
 	struct radix_tree_root cgwb_tree; /* radix tree of !root cgroup wbs */
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index b115a57..176d0fb 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -879,6 +879,9 @@ static void wb_update_write_bandwidth(struct bdi_writeback *wb,
 		avg += (old - avg) >> 3;
 
 out:
+	if (wb_has_dirty_io(wb))
+		atomic_long_add(avg - wb->avg_write_bandwidth,
+				&wb->bdi->tot_write_bandwidth);
 	wb->write_bandwidth = bw;
 	wb->avg_write_bandwidth = avg;
 }
-- 
2.1.0
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related	[flat|nested] 54+ messages in thread
- * [PATCH 15/45] writeback: make bdi_has_dirty_io() take multiple bdi_writeback's into account
  2015-01-06 21:25 [PATCHSET RFC block/for-next] writeback: cgroup writeback support Tejun Heo
                   ` (13 preceding siblings ...)
  2015-01-06 21:25 ` [PATCH 14/45] writeback: implement backing_dev_info->tot_write_bandwidth Tejun Heo
@ 2015-01-06 21:25 ` Tejun Heo
  2015-01-06 21:25 ` [PATCH 16/45] writeback: don't issue wb_writeback_work if clean Tejun Heo
                   ` (30 subsequent siblings)
  45 siblings, 0 replies; 54+ messages in thread
From: Tejun Heo @ 2015-01-06 21:25 UTC (permalink / raw)
  To: axboe
  Cc: linux-kernel, jack, hch, hannes, linux-fsdevel, vgoyal, lizefan,
	cgroups, linux-mm, mhocko, clm, fengguang.wu, david, Tejun Heo
bdi_has_dirty_io() used to only reflect whether the root wb
(bdi_writeback) has dirty inodes.  For cgroup writeback support, it
needs to take all active wb's into account.  If any wb on the bdi has
dirty inodes, bdi_has_dirty_io() should return true.
To achieve that, as inode_wb_list_{move|del}_locked() now keep track
of the dirty state transition of each wb, the number of dirty wbs can
be counted in the bdi; however, bdi is already aggregating
wb->avg_write_bandwidth which can easily be guaranteed to be > 0 if
there are any dirty inodes by ensuring wb->avg_write_bandwidth can't
dip below 1.  bdi_has_dirty_io() can simply test whether
bdi->tot_write_bandwidth is zero or not.
While this bumps the value of wb->avg_write_bandwidth to one when it
used to be zero, this shouldn't cause any meaningful behavior
difference.
bdi_has_dirty_io() is made an inline function which tests whether
->tot_write_bandwidth is non-zero.  Also, WARN_ON_ONCE()'s on its
value are added to inode_wb_list_{move|del}_locked().
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
---
 fs/fs-writeback.c                |  5 +++--
 include/linux/backing-dev-defs.h |  8 ++++++--
 include/linux/backing-dev.h      | 10 +++++++++-
 mm/backing-dev.c                 |  5 -----
 mm/page-writeback.c              | 10 +++++++---
 5 files changed, 25 insertions(+), 13 deletions(-)
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index d41728b..35d32ad 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -312,6 +312,7 @@ static bool inode_wb_list_move_locked(struct inode *inode,
 		return false;
 	} else {
 		set_bit(WB_has_dirty_io, &wb->state);
+		WARN_ON_ONCE(!wb->avg_write_bandwidth);
 		atomic_long_add(wb->avg_write_bandwidth,
 				&wb->bdi->tot_write_bandwidth);
 		return true;
@@ -336,8 +337,8 @@ static void inode_wb_list_del_locked(struct inode *inode,
 	if (wb_has_dirty_io(wb) && list_empty(&wb->b_dirty) &&
 	    list_empty(&wb->b_io) && list_empty(&wb->b_more_io)) {
 		clear_bit(WB_has_dirty_io, &wb->state);
-		atomic_long_sub(wb->avg_write_bandwidth,
-				&wb->bdi->tot_write_bandwidth);
+		WARN_ON_ONCE(atomic_long_sub_return(wb->avg_write_bandwidth,
+					&wb->bdi->tot_write_bandwidth) < 0);
 	}
 }
 
diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
index e1f5f08..4ceda83 100644
--- a/include/linux/backing-dev-defs.h
+++ b/include/linux/backing-dev-defs.h
@@ -60,7 +60,7 @@ struct bdi_writeback {
 	unsigned long dirtied_stamp;
 	unsigned long written_stamp;	/* pages written at bw_time_stamp */
 	unsigned long write_bandwidth;	/* the estimated write bandwidth */
-	unsigned long avg_write_bandwidth; /* further smoothed write bw */
+	unsigned long avg_write_bandwidth; /* further smoothed write bw, > 0 */
 
 	/*
 	 * The base dirty throttle rate, re-calculated on every 200ms.
@@ -100,7 +100,11 @@ struct backing_dev_info {
 	unsigned int min_ratio;
 	unsigned int max_ratio, max_prop_frac;
 
-	atomic_long_t tot_write_bandwidth; /* sum of active avg_write_bw */
+	/*
+	 * Sum of avg_write_bw of wbs with dirty inodes.  > 0 if there are
+	 * any dirty wbs, which is depended upon by bdi_has_dirty().
+	 */
+	atomic_long_t tot_write_bandwidth;
 
 	struct bdi_writeback wb; /* the root writeback info for this bdi */
 #ifdef CONFIG_CGROUP_WRITEBACK
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 533ff86..49feb0b 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -30,7 +30,6 @@ void bdi_start_writeback(struct backing_dev_info *bdi, long nr_pages,
 			enum wb_reason reason);
 void bdi_start_background_writeback(struct backing_dev_info *bdi);
 void wb_workfn(struct work_struct *work);
-bool bdi_has_dirty_io(struct backing_dev_info *bdi);
 void wb_wakeup_delayed(struct bdi_writeback *wb);
 
 extern spinlock_t bdi_lock;
@@ -43,6 +42,15 @@ static inline bool wb_has_dirty_io(struct bdi_writeback *wb)
 	return test_bit(WB_has_dirty_io, &wb->state);
 }
 
+static inline bool bdi_has_dirty_io(struct backing_dev_info *bdi)
+{
+	/*
+	 * @bdi->tot_write_bandwidth is guaranteed to be > 0 if there are
+	 * any dirty wbs.  See wb_update_write_bandwidth().
+	 */
+	return atomic_long_read(&bdi->tot_write_bandwidth);
+}
+
 static inline void __add_wb_stat(struct bdi_writeback *wb,
 				 enum wb_stat_item item, s64 amount)
 {
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 9d69e7c..b4a5d9b 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -307,11 +307,6 @@ static int __init default_bdi_init(void)
 }
 subsys_initcall(default_bdi_init);
 
-bool bdi_has_dirty_io(struct backing_dev_info *bdi)
-{
-	return wb_has_dirty_io(&bdi->wb);
-}
-
 /*
  * This function is used when the first inode for this wb is marked dirty. It
  * wakes-up the corresponding bdi thread which should then take care of the
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 176d0fb..cc0ce70 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -879,9 +879,13 @@ static void wb_update_write_bandwidth(struct bdi_writeback *wb,
 		avg += (old - avg) >> 3;
 
 out:
-	if (wb_has_dirty_io(wb))
-		atomic_long_add(avg - wb->avg_write_bandwidth,
-				&wb->bdi->tot_write_bandwidth);
+	/* keep avg > 0 to guarantee that tot > 0 if there are dirty wbs */
+	avg = max(avg, 1LU);
+	if (wb_has_dirty_io(wb)) {
+		long delta = avg - wb->avg_write_bandwidth;
+		WARN_ON_ONCE(atomic_long_add_return(delta,
+					&wb->bdi->tot_write_bandwidth) <= 0);
+	}
 	wb->write_bandwidth = bw;
 	wb->avg_write_bandwidth = avg;
 }
-- 
2.1.0
^ permalink raw reply related	[flat|nested] 54+ messages in thread
- * [PATCH 16/45] writeback: don't issue wb_writeback_work if clean
  2015-01-06 21:25 [PATCHSET RFC block/for-next] writeback: cgroup writeback support Tejun Heo
                   ` (14 preceding siblings ...)
  2015-01-06 21:25 ` [PATCH 15/45] writeback: make bdi_has_dirty_io() take multiple bdi_writeback's into account Tejun Heo
@ 2015-01-06 21:25 ` Tejun Heo
  2015-01-06 21:25 ` [PATCH 17/45] writeback: make bdi->min/max_ratio handling cgroup writeback aware Tejun Heo
                   ` (29 subsequent siblings)
  45 siblings, 0 replies; 54+ messages in thread
From: Tejun Heo @ 2015-01-06 21:25 UTC (permalink / raw)
  To: axboe
  Cc: linux-kernel, jack, hch, hannes, linux-fsdevel, vgoyal, lizefan,
	cgroups, linux-mm, mhocko, clm, fengguang.wu, david, Tejun Heo
There are several places in fs/fs-writeback.c which queues
wb_writeback_work without checking whether the target wb
(bdi_writeback) has dirty inodes or not.  The only thing
wb_writeback_work does is writing back the dirty inodes for the target
wb and queueing a work item for a clean wb is essentially noop.  There
are some side effects such as bandwidth stats being updated and
triggering tracepoints but these don't affect the operation in any
meaningful way.
This patch makes all writeback_inodes_sb_nr() and sync_inodes_sb()
skip wb_queue_work() if the target bdi is clean.  Also, it moves
dirtiness check from wakeup_flusher_threads() to
__wb_start_writeback() so that all its callers benefit from the check.
While the overhead incurred by scheduling a noop work isn't currently
significant, the overhead may be higher with cgroup writeback support
as we may end up issuing noop work items to a lot of clean wb's.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
---
 fs/fs-writeback.c | 18 ++++++++++--------
 1 file changed, 10 insertions(+), 8 deletions(-)
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 35d32ad..bb8dbe8 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -234,6 +234,9 @@ static void __wb_start_writeback(struct bdi_writeback *wb, long nr_pages,
 {
 	struct wb_writeback_work *work;
 
+	if (!wb_has_dirty_io(wb))
+		return;
+
 	/*
 	 * This is WB_SYNC_NONE writeback, so if allocation fails just
 	 * wakeup the thread for old dirty data writeback
@@ -1249,11 +1252,8 @@ void wakeup_flusher_threads(long nr_pages, enum wb_reason reason)
 		nr_pages = get_nr_dirty_pages();
 
 	rcu_read_lock();
-	list_for_each_entry_rcu(bdi, &bdi_list, bdi_list) {
-		if (!bdi_has_dirty_io(bdi))
-			continue;
+	list_for_each_entry_rcu(bdi, &bdi_list, bdi_list)
 		__wb_start_writeback(&bdi->wb, nr_pages, false, reason);
-	}
 	rcu_read_unlock();
 }
 
@@ -1481,11 +1481,12 @@ void writeback_inodes_sb_nr(struct super_block *sb,
 		.nr_pages		= nr,
 		.reason			= reason,
 	};
+	struct backing_dev_info *bdi = sb->s_bdi;
 
-	if (sb->s_bdi == &noop_backing_dev_info)
+	if (!bdi_has_dirty_io(bdi) || bdi == &noop_backing_dev_info)
 		return;
 	WARN_ON(!rwsem_is_locked(&sb->s_umount));
-	wb_queue_work(&sb->s_bdi->wb, &work);
+	wb_queue_work(&bdi->wb, &work);
 	wait_for_completion(&done);
 }
 EXPORT_SYMBOL(writeback_inodes_sb_nr);
@@ -1563,13 +1564,14 @@ void sync_inodes_sb(struct super_block *sb)
 		.reason		= WB_REASON_SYNC,
 		.for_sync	= 1,
 	};
+	struct backing_dev_info *bdi = sb->s_bdi;
 
 	/* Nothing to do? */
-	if (sb->s_bdi == &noop_backing_dev_info)
+	if (!bdi_has_dirty_io(bdi) || bdi == &noop_backing_dev_info)
 		return;
 	WARN_ON(!rwsem_is_locked(&sb->s_umount));
 
-	wb_queue_work(&sb->s_bdi->wb, &work);
+	wb_queue_work(&bdi->wb, &work);
 	wait_for_completion(&done);
 
 	wait_sb_inodes(sb);
-- 
2.1.0
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related	[flat|nested] 54+ messages in thread
- * [PATCH 17/45] writeback: make bdi->min/max_ratio handling cgroup writeback aware
  2015-01-06 21:25 [PATCHSET RFC block/for-next] writeback: cgroup writeback support Tejun Heo
                   ` (15 preceding siblings ...)
  2015-01-06 21:25 ` [PATCH 16/45] writeback: don't issue wb_writeback_work if clean Tejun Heo
@ 2015-01-06 21:25 ` Tejun Heo
  2015-01-06 21:25 ` [PATCH 18/45] writeback: implement bdi_for_each_wb() Tejun Heo
                   ` (28 subsequent siblings)
  45 siblings, 0 replies; 54+ messages in thread
From: Tejun Heo @ 2015-01-06 21:25 UTC (permalink / raw)
  To: axboe
  Cc: linux-kernel, jack, hch, hannes, linux-fsdevel, vgoyal, lizefan,
	cgroups, linux-mm, mhocko, clm, fengguang.wu, david, Tejun Heo
bdi->min/max_ratio are user-configurable per-bdi knobs which regulate
dirty limit of each bdi.  For cgroup writeback, they need to be
further distributed across wb's (bdi_writeback's) belonging to the
configured bdi.
This patch introduces wb_min_max_ratio() which distributes
bdi->min/max_ratio according to a wb's proportion in the total active
bandwidth of its bdi.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
---
 mm/page-writeback.c | 46 ++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 42 insertions(+), 4 deletions(-)
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index cc0ce70..e1b74d7 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -155,6 +155,42 @@ static unsigned long writeout_period_time = 0;
  */
 #define VM_COMPLETIONS_PERIOD_LEN (3*HZ)
 
+#ifdef CONFIG_CGROUP_WRITEBACK
+
+static void wb_min_max_ratio(struct bdi_writeback *wb,
+			     unsigned long *minp, unsigned long *maxp)
+{
+	unsigned long this_bw = wb->avg_write_bandwidth;
+	unsigned long tot_bw = atomic_long_read(&wb->bdi->tot_write_bandwidth);
+	unsigned long long min = wb->bdi->min_ratio;
+	unsigned long long max = wb->bdi->min_ratio;
+
+	/*
+	 * @wb may already be clean by the time control reaches here and
+	 * the total may not include its bw.
+	 */
+	if (this_bw < tot_bw) {
+		min *= this_bw;
+		max *= this_bw;
+		do_div(min, tot_bw);
+		do_div(max, tot_bw);
+	}
+
+	*minp = min;
+	*maxp = max;
+}
+
+#else	/* CONFIG_CGROUP_WRITEBACK */
+
+static void wb_min_max_ratio(struct bdi_writeback *wb,
+			     unsigned long *minp, unsigned long *maxp)
+{
+	*minp = wb->bdi->min_ratio;
+	*maxp = wb->bdi->max_ratio;
+}
+
+#endif	/* CONFIG_CGROUP_WRITEBACK */
+
 /*
  * In a memory zone, there is a certain amount of pages we consider
  * available for the page cache, which is essentially the number of
@@ -539,9 +575,9 @@ static unsigned long hard_dirty_limit(unsigned long thresh)
  */
 unsigned long wb_dirty_limit(struct bdi_writeback *wb, unsigned long dirty)
 {
-	struct backing_dev_info *bdi = wb->bdi;
 	u64 wb_dirty;
 	long numerator, denominator;
+	unsigned long wb_min_ratio, wb_max_ratio;
 
 	/*
 	 * Calculate this BDI's share of the dirty ratio.
@@ -552,9 +588,11 @@ unsigned long wb_dirty_limit(struct bdi_writeback *wb, unsigned long dirty)
 	wb_dirty *= numerator;
 	do_div(wb_dirty, denominator);
 
-	wb_dirty += (dirty * bdi->min_ratio) / 100;
-	if (wb_dirty > (dirty * bdi->max_ratio) / 100)
-		wb_dirty = dirty * bdi->max_ratio / 100;
+	wb_min_max_ratio(wb, &wb_min_ratio, &wb_max_ratio);
+
+	wb_dirty += (dirty * wb_min_ratio) / 100;
+	if (wb_dirty > (dirty * wb_max_ratio) / 100)
+		wb_dirty = dirty * wb_max_ratio / 100;
 
 	return wb_dirty;
 }
-- 
2.1.0
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related	[flat|nested] 54+ messages in thread
- * [PATCH 18/45] writeback: implement bdi_for_each_wb()
  2015-01-06 21:25 [PATCHSET RFC block/for-next] writeback: cgroup writeback support Tejun Heo
                   ` (16 preceding siblings ...)
  2015-01-06 21:25 ` [PATCH 17/45] writeback: make bdi->min/max_ratio handling cgroup writeback aware Tejun Heo
@ 2015-01-06 21:25 ` Tejun Heo
  2015-01-06 21:25 ` [PATCH 19/45] writeback: remove bdi_start_writeback() Tejun Heo
                   ` (27 subsequent siblings)
  45 siblings, 0 replies; 54+ messages in thread
From: Tejun Heo @ 2015-01-06 21:25 UTC (permalink / raw)
  To: axboe
  Cc: linux-kernel, jack, hch, hannes, linux-fsdevel, vgoyal, lizefan,
	cgroups, linux-mm, mhocko, clm, fengguang.wu, david, Tejun Heo
This will be used to implement bdi-wide operations which should be
distributed across all its cgroup bdi_writebacks.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
---
 include/linux/backing-dev.h | 64 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 64 insertions(+)
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 49feb0b..37c4299 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -380,6 +380,61 @@ static inline struct bdi_writeback *page_cgwb_wb(struct page *page)
 	return wb;
 }
 
+struct wb_iter {
+	int			start_blkcg_id;
+	struct radix_tree_iter	tree_iter;
+	void			**slot;
+};
+
+static inline struct bdi_writeback *__wb_iter_next(struct wb_iter *iter,
+						   struct backing_dev_info *bdi)
+{
+	struct radix_tree_iter *titer = &iter->tree_iter;
+
+	WARN_ON_ONCE(!rcu_read_lock_held());
+
+	if (iter->start_blkcg_id >= 0) {
+		iter->slot = radix_tree_iter_init(titer, iter->start_blkcg_id);
+		iter->start_blkcg_id = -1;
+	} else {
+		iter->slot = radix_tree_next_slot(iter->slot, titer, 0);
+	}
+
+	if (!iter->slot)
+		iter->slot = radix_tree_next_chunk(&bdi->cgwb_tree, titer, 0);
+	if (iter->slot)
+		return *iter->slot;
+	return NULL;
+}
+
+static inline struct bdi_writeback *__wb_iter_init(struct wb_iter *iter,
+						   struct backing_dev_info *bdi,
+						   int start_blkcg_id)
+{
+	iter->start_blkcg_id = start_blkcg_id;
+
+	if (start_blkcg_id)
+		return __wb_iter_next(iter, bdi);
+	else
+		return &bdi->wb;
+}
+
+/**
+ * bdi_for_each_wb - walk all wb's of a bdi in ascending blkcg ID order
+ * @wb_cur: cursor struct bdi_writeback pointer
+ * @bdi: bdi to walk wb's of
+ * @iter: pointer to struct wb_iter to be used as iteration buffer
+ * @start_blkcg_id: blkcg ID to start iteration from
+ *
+ * Iterate @wb_cur through the wb's (bdi_writeback's) of @bdi in ascending
+ * blkcg ID order starting from @start_blkcg_id.  @iter is struct wb_iter
+ * to be used as temp storage during iteration.  rcu_read_lock() must be
+ * held throughout iteration.
+ */
+#define bdi_for_each_wb(wb_cur, bdi, iter, start_blkcg_id)		\
+	for ((wb_cur) = __wb_iter_init(iter, bdi, start_blkcg_id);	\
+	     (wb_cur); (wb_cur) = __wb_iter_next(iter, bdi))
+
 #else	/* CONFIG_CGROUP_WRITEBACK */
 
 static inline bool mapping_cgwb_enabled(struct address_space *mapping)
@@ -420,6 +475,15 @@ static inline struct bdi_writeback *page_cgwb_wb(struct page *page)
 	return &page->mapping->backing_dev_info->wb;
 }
 
+struct wb_iter {
+	int		next_id;
+};
+
+#define bdi_for_each_wb(wb_cur, bdi, iter, start_blkcg_id)		\
+	for ((iter)->next_id = (start_blkcg_id);			\
+	     ({	(wb_cur) = !(iter)->next_id++ ? &(bdi)->wb : NULL;	\
+	     }); )
+
 #endif	/* CONFIG_CGROUP_WRITEBACK */
 
 static inline int mapping_read_congested(struct address_space *mapping,
-- 
2.1.0
^ permalink raw reply related	[flat|nested] 54+ messages in thread
- * [PATCH 19/45] writeback: remove bdi_start_writeback()
  2015-01-06 21:25 [PATCHSET RFC block/for-next] writeback: cgroup writeback support Tejun Heo
                   ` (17 preceding siblings ...)
  2015-01-06 21:25 ` [PATCH 18/45] writeback: implement bdi_for_each_wb() Tejun Heo
@ 2015-01-06 21:25 ` Tejun Heo
  2015-01-06 21:25 ` [PATCH 20/45] writeback: make laptop_mode_timer_fn() handle multiple bdi_writeback's Tejun Heo
                   ` (26 subsequent siblings)
  45 siblings, 0 replies; 54+ messages in thread
From: Tejun Heo @ 2015-01-06 21:25 UTC (permalink / raw)
  To: axboe
  Cc: linux-kernel, jack, hch, hannes, linux-fsdevel, vgoyal, lizefan,
	cgroups, linux-mm, mhocko, clm, fengguang.wu, david, Tejun Heo
bdi_start_writeback() is a thin wrapper on top of
__wb_start_writeback() which is used only by laptop_mode_timer_fn().
This patches removes bdi_start_writeback(), renames
__wb_start_writeback() to wb_start_writeback() and makes
laptop_mode_timer_fn() use it instead.
This doesn't cause any functional difference and will ease making
laptop_mode_timer_fn() cgroup writeback aware.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
---
 fs/fs-writeback.c           | 24 +++---------------------
 include/linux/backing-dev.h |  4 ++--
 mm/page-writeback.c         |  4 ++--
 3 files changed, 7 insertions(+), 25 deletions(-)
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index bb8dbe8..18d8e72 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -229,8 +229,8 @@ void init_dirty_inode_context(struct dirty_context *dctx, struct inode *inode)
 	dctx->wb = &inode_to_bdi(inode)->wb;
 }
 
-static void __wb_start_writeback(struct bdi_writeback *wb, long nr_pages,
-				 bool range_cyclic, enum wb_reason reason)
+void wb_start_writeback(struct bdi_writeback *wb, long nr_pages,
+			bool range_cyclic, enum wb_reason reason)
 {
 	struct wb_writeback_work *work;
 
@@ -257,24 +257,6 @@ static void __wb_start_writeback(struct bdi_writeback *wb, long nr_pages,
 }
 
 /**
- * bdi_start_writeback - start writeback
- * @bdi: the backing device to write from
- * @nr_pages: the number of pages to write
- * @reason: reason why some writeback work was initiated
- *
- * Description:
- *   This does WB_SYNC_NONE opportunistic writeback. The IO is only
- *   started when this function returns, we make no guarantees on
- *   completion. Caller need not hold sb s_umount semaphore.
- *
- */
-void bdi_start_writeback(struct backing_dev_info *bdi, long nr_pages,
-			enum wb_reason reason)
-{
-	__wb_start_writeback(&bdi->wb, nr_pages, true, reason);
-}
-
-/**
  * bdi_start_background_writeback - start background writeback
  * @bdi: the backing device to write from
  *
@@ -1253,7 +1235,7 @@ void wakeup_flusher_threads(long nr_pages, enum wb_reason reason)
 
 	rcu_read_lock();
 	list_for_each_entry_rcu(bdi, &bdi_list, bdi_list)
-		__wb_start_writeback(&bdi->wb, nr_pages, false, reason);
+		wb_start_writeback(&bdi->wb, nr_pages, false, reason);
 	rcu_read_unlock();
 }
 
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 37c4299..c6278ee 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -26,8 +26,8 @@ int bdi_register(struct backing_dev_info *bdi, struct device *parent,
 int bdi_register_dev(struct backing_dev_info *bdi, dev_t dev);
 void bdi_unregister(struct backing_dev_info *bdi);
 int __must_check bdi_setup_and_register(struct backing_dev_info *, char *, unsigned int);
-void bdi_start_writeback(struct backing_dev_info *bdi, long nr_pages,
-			enum wb_reason reason);
+void wb_start_writeback(struct bdi_writeback *wb, long nr_pages,
+			bool range_cyclic, enum wb_reason reason);
 void bdi_start_background_writeback(struct backing_dev_info *bdi);
 void wb_workfn(struct work_struct *work);
 void wb_wakeup_delayed(struct bdi_writeback *wb);
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index e1b74d7..18bf51d 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1731,8 +1731,8 @@ void laptop_mode_timer_fn(unsigned long data)
 	 * threshold
 	 */
 	if (bdi_has_dirty_io(&q->backing_dev_info))
-		bdi_start_writeback(&q->backing_dev_info, nr_pages,
-					WB_REASON_LAPTOP_TIMER);
+		wb_start_writeback(&q->backing_dev_info.wb, nr_pages, true,
+				   WB_REASON_LAPTOP_TIMER);
 }
 
 /*
-- 
2.1.0
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related	[flat|nested] 54+ messages in thread
- * [PATCH 20/45] writeback: make laptop_mode_timer_fn() handle multiple bdi_writeback's
  2015-01-06 21:25 [PATCHSET RFC block/for-next] writeback: cgroup writeback support Tejun Heo
                   ` (18 preceding siblings ...)
  2015-01-06 21:25 ` [PATCH 19/45] writeback: remove bdi_start_writeback() Tejun Heo
@ 2015-01-06 21:25 ` Tejun Heo
       [not found] ` <1420579582-8516-1-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
                   ` (25 subsequent siblings)
  45 siblings, 0 replies; 54+ messages in thread
From: Tejun Heo @ 2015-01-06 21:25 UTC (permalink / raw)
  To: axboe
  Cc: linux-kernel, jack, hch, hannes, linux-fsdevel, vgoyal, lizefan,
	cgroups, linux-mm, mhocko, clm, fengguang.wu, david, Tejun Heo
For cgroup writeback support, all bdi-wide operations should be
distributed to all its wb's (bdi_writeback's).
This patch updates laptop_mode_timer_fn() so that it invokes
wb_start_writeback() on all wb's rather than just the root one.  As
the intent is writing out all dirty data, there's no reason to split
the number of pages to write.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
---
 mm/page-writeback.c | 12 +++++++++---
 1 file changed, 9 insertions(+), 3 deletions(-)
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 18bf51d..190e2a2 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1725,14 +1725,20 @@ void laptop_mode_timer_fn(unsigned long data)
 	struct request_queue *q = (struct request_queue *)data;
 	int nr_pages = global_page_state(NR_FILE_DIRTY) +
 		global_page_state(NR_UNSTABLE_NFS);
+	struct bdi_writeback *wb;
+	struct wb_iter iter;
 
 	/*
 	 * We want to write everything out, not just down to the dirty
 	 * threshold
 	 */
-	if (bdi_has_dirty_io(&q->backing_dev_info))
-		wb_start_writeback(&q->backing_dev_info.wb, nr_pages, true,
-				   WB_REASON_LAPTOP_TIMER);
+	if (!bdi_has_dirty_io(&q->backing_dev_info))
+		return;
+
+	bdi_for_each_wb(wb, &q->backing_dev_info, &iter, 0)
+		if (wb_has_dirty_io(wb))
+			wb_start_writeback(wb, nr_pages, true,
+					   WB_REASON_LAPTOP_TIMER);
 }
 
 /*
-- 
2.1.0
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related	[flat|nested] 54+ messages in thread
- [parent not found: <1420579582-8516-1-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>] 
- * [PATCH 21/45] writeback: make writeback_in_progress() take bdi_writeback instead of backing_dev_info
       [not found] ` <1420579582-8516-1-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
@ 2015-01-06 21:25   ` Tejun Heo
  2015-01-06 21:26   ` [PATCH 44/45] mpage: make __mpage_writepage() honor cgroup writeback Tejun Heo
  1 sibling, 0 replies; 54+ messages in thread
From: Tejun Heo @ 2015-01-06 21:25 UTC (permalink / raw)
  To: axboe-tSWWG44O7X1aa/9Udqfwiw
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, jack-AlSwsSmVLrQ,
	hch-wEGCiKHe2LqWVfeAwA7xHQ, hannes-druUgvl0LCNAfugRpC6u6w,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	vgoyal-H+wXaHxf7aLQT0dZR+AlfA, lizefan-hv44wF8Li93QT0dZR+AlfA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	mhocko-AlSwsSmVLrQ, clm-b10kYP2dOMg,
	fengguang.wu-ral2JQCrhuEAvxtiuMwx3w, david-FqsqvQoI3Ljby3iVrkZq2A,
	Tejun Heo
writeback_in_progress() currently takes @bdi and returns whether
writeback is in progress on its root wb (bdi_writeback).  In
preparation for cgroup writeback support, make it take wb instead.
While at it, make it an inline function.
This patch doesn't make any functional difference.
Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Cc: Jens Axboe <axboe-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>
Cc: Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org>
---
 fs/fs-writeback.c           | 15 +--------------
 include/linux/backing-dev.h | 12 +++++++++++-
 mm/page-writeback.c         |  4 ++--
 3 files changed, 14 insertions(+), 17 deletions(-)
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 18d8e72..6ab113b 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -53,19 +53,6 @@ struct wb_writeback_work {
 	struct completion *done;	/* set if the caller waits */
 };
 
-/**
- * writeback_in_progress - determine whether there is writeback in progress
- * @bdi: the device's backing_dev_info structure.
- *
- * Determine whether there is writeback waiting to be handled against a
- * backing device.
- */
-int writeback_in_progress(struct backing_dev_info *bdi)
-{
-	return test_bit(WB_writeback_running, &bdi->wb.state);
-}
-EXPORT_SYMBOL(writeback_in_progress);
-
 static inline struct inode *wb_inode(struct list_head *head)
 {
 	return list_entry(head, struct inode, i_wb_list);
@@ -1501,7 +1488,7 @@ int try_to_writeback_inodes_sb_nr(struct super_block *sb,
 				  unsigned long nr,
 				  enum wb_reason reason)
 {
-	if (writeback_in_progress(sb->s_bdi))
+	if (writeback_in_progress(&sb->s_bdi->wb))
 		return 1;
 
 	if (!down_read_trylock(&sb->s_umount))
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index c6278ee..953fa01 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -186,7 +186,17 @@ int bdi_set_max_ratio(struct backing_dev_info *bdi, unsigned int max_ratio);
 extern struct backing_dev_info default_backing_dev_info;
 extern struct backing_dev_info noop_backing_dev_info;
 
-int writeback_in_progress(struct backing_dev_info *bdi);
+/**
+ * writeback_in_progress - determine whether there is writeback in progress
+ * @wb: bdi_writeback of interest
+ *
+ * Determine whether there is writeback waiting to be handled against a
+ * bdi_writeback.
+ */
+static inline bool writeback_in_progress(struct bdi_writeback *wb)
+{
+	return test_bit(WB_writeback_running, &wb->state);
+}
 
 static inline int wb_congested(struct bdi_writeback *wb, int bdi_bits)
 {
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 190e2a2..b250ee2 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1449,7 +1449,7 @@ static void balance_dirty_pages(struct address_space *mapping,
 			break;
 		}
 
-		if (unlikely(!writeback_in_progress(bdi)))
+		if (unlikely(!writeback_in_progress(wb)))
 			bdi_start_background_writeback(bdi);
 
 		if (!strictlimit)
@@ -1568,7 +1568,7 @@ pause:
 	if (!dirty_exceeded && wb->dirty_exceeded)
 		wb->dirty_exceeded = 0;
 
-	if (writeback_in_progress(bdi))
+	if (writeback_in_progress(wb))
 		return;
 
 	/*
-- 
2.1.0
^ permalink raw reply related	[flat|nested] 54+ messages in thread
- * [PATCH 44/45] mpage: make __mpage_writepage() honor cgroup writeback
       [not found] ` <1420579582-8516-1-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
  2015-01-06 21:25   ` [PATCH 21/45] writeback: make writeback_in_progress() take bdi_writeback instead of backing_dev_info Tejun Heo
@ 2015-01-06 21:26   ` Tejun Heo
  1 sibling, 0 replies; 54+ messages in thread
From: Tejun Heo @ 2015-01-06 21:26 UTC (permalink / raw)
  To: axboe-tSWWG44O7X1aa/9Udqfwiw
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, jack-AlSwsSmVLrQ,
	hch-wEGCiKHe2LqWVfeAwA7xHQ, hannes-druUgvl0LCNAfugRpC6u6w,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	vgoyal-H+wXaHxf7aLQT0dZR+AlfA, lizefan-hv44wF8Li93QT0dZR+AlfA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	mhocko-AlSwsSmVLrQ, clm-b10kYP2dOMg,
	fengguang.wu-ral2JQCrhuEAvxtiuMwx3w, david-FqsqvQoI3Ljby3iVrkZq2A,
	Tejun Heo, Andrew Morton, Alexander Viro
__mpage_writepage() is used to implement mpage_writepages() which in
turn is used for ->writepages() of various filesystems.  All writeback
logic is now updated to handle cgroup writeback and the block cgroup
to issue IOs for is encoded in writeback_control and can be retrieved
using wbc_blkcg_css(); however, __mpage_writepage() currently ignores
the blkcg indicated by wbc and issues all bio's without explicit blkcg
association.
This patch updates __mpage_writepage() so that the issued bio's are
associated with wbc_blkcg_css().
Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Cc: Jens Axboe <axboe-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>
Cc: Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org>
Cc: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
Cc: Alexander Viro <viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
---
 fs/mpage.c | 6 ++++++
 1 file changed, 6 insertions(+)
diff --git a/fs/mpage.c b/fs/mpage.c
index 587c7ed..84921b2 100644
--- a/fs/mpage.c
+++ b/fs/mpage.c
@@ -595,6 +595,8 @@ page_is_mapped:
 
 alloc_new:
 	if (bio == NULL) {
+		struct cgroup_subsys_state *blkcg_css;
+
 		if (first_unmapped == blocks_per_page) {
 			if (!bdev_write_page(bdev, blocks[0] << (blkbits - 9),
 								page, wbc)) {
@@ -606,6 +608,10 @@ alloc_new:
 				bio_get_nr_vecs(bdev), GFP_NOFS|__GFP_HIGH);
 		if (bio == NULL)
 			goto confused;
+
+		blkcg_css = wbc_blkcg_css(wbc);
+		if (blkcg_css)
+			bio_associate_blkcg(bio, blkcg_css);
 	}
 
 	/*
-- 
2.1.0
^ permalink raw reply related	[flat|nested] 54+ messages in thread
 
- * [PATCH 22/45] writeback: make bdi_start_background_writeback() take bdi_writeback instead of backing_dev_info
  2015-01-06 21:25 [PATCHSET RFC block/for-next] writeback: cgroup writeback support Tejun Heo
                   ` (20 preceding siblings ...)
       [not found] ` <1420579582-8516-1-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
@ 2015-01-06 21:25 ` Tejun Heo
  2015-01-06 21:26 ` [PATCH 23/45] writeback: make wakeup_flusher_threads() handle multiple bdi_writeback's Tejun Heo
                   ` (23 subsequent siblings)
  45 siblings, 0 replies; 54+ messages in thread
From: Tejun Heo @ 2015-01-06 21:25 UTC (permalink / raw)
  To: axboe
  Cc: linux-kernel, jack, hch, hannes, linux-fsdevel, vgoyal, lizefan,
	cgroups, linux-mm, mhocko, clm, fengguang.wu, david, Tejun Heo
bdi_start_background_writeback() currently takes @bdi and kicks the
root wb (bdi_writeback).  In preparation for cgroup writeback support,
make it take wb instead.
This patch doesn't make any functional difference.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
---
 fs/fs-writeback.c           | 12 ++++++------
 include/linux/backing-dev.h |  2 +-
 mm/page-writeback.c         |  4 ++--
 3 files changed, 9 insertions(+), 9 deletions(-)
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 6ab113b..c209ff1 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -244,23 +244,23 @@ void wb_start_writeback(struct bdi_writeback *wb, long nr_pages,
 }
 
 /**
- * bdi_start_background_writeback - start background writeback
- * @bdi: the backing device to write from
+ * wb_start_background_writeback - start background writeback
+ * @wb: bdi_writback to write from
  *
  * Description:
  *   This makes sure WB_SYNC_NONE background writeback happens. When
- *   this function returns, it is only guaranteed that for given BDI
+ *   this function returns, it is only guaranteed that for given wb
  *   some IO is happening if we are over background dirty threshold.
  *   Caller need not hold sb s_umount semaphore.
  */
-void bdi_start_background_writeback(struct backing_dev_info *bdi)
+void wb_start_background_writeback(struct bdi_writeback *wb)
 {
 	/*
 	 * We just wake up the flusher thread. It will perform background
 	 * writeback as soon as there is no other work to do.
 	 */
-	trace_writeback_wake_background(bdi);
-	wb_wakeup(&bdi->wb);
+	trace_writeback_wake_background(wb->bdi);
+	wb_wakeup(wb);
 }
 
 /**
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 953fa01..4cdab7c 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -28,7 +28,7 @@ void bdi_unregister(struct backing_dev_info *bdi);
 int __must_check bdi_setup_and_register(struct backing_dev_info *, char *, unsigned int);
 void wb_start_writeback(struct bdi_writeback *wb, long nr_pages,
 			bool range_cyclic, enum wb_reason reason);
-void bdi_start_background_writeback(struct backing_dev_info *bdi);
+void wb_start_background_writeback(struct bdi_writeback *wb);
 void wb_workfn(struct work_struct *work);
 void wb_wakeup_delayed(struct bdi_writeback *wb);
 
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index b250ee2..4cf365c 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1450,7 +1450,7 @@ static void balance_dirty_pages(struct address_space *mapping,
 		}
 
 		if (unlikely(!writeback_in_progress(wb)))
-			bdi_start_background_writeback(bdi);
+			wb_start_background_writeback(wb);
 
 		if (!strictlimit)
 			wb_dirty_limits(wb, dirty_thresh, background_thresh,
@@ -1583,7 +1583,7 @@ pause:
 		return;
 
 	if (nr_reclaimable > background_thresh)
-		bdi_start_background_writeback(bdi);
+		wb_start_background_writeback(wb);
 }
 
 static DEFINE_PER_CPU(int, bdp_ratelimits);
-- 
2.1.0
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related	[flat|nested] 54+ messages in thread
- * [PATCH 23/45] writeback: make wakeup_flusher_threads() handle multiple bdi_writeback's
  2015-01-06 21:25 [PATCHSET RFC block/for-next] writeback: cgroup writeback support Tejun Heo
                   ` (21 preceding siblings ...)
  2015-01-06 21:25 ` [PATCH 22/45] writeback: make bdi_start_background_writeback() take bdi_writeback instead of backing_dev_info Tejun Heo
@ 2015-01-06 21:26 ` Tejun Heo
  2015-01-06 21:26 ` [PATCH 24/45] writeback: add wb_writeback_work->auto_free Tejun Heo
                   ` (22 subsequent siblings)
  45 siblings, 0 replies; 54+ messages in thread
From: Tejun Heo @ 2015-01-06 21:26 UTC (permalink / raw)
  To: axboe
  Cc: linux-kernel, jack, hch, hannes, linux-fsdevel, vgoyal, lizefan,
	cgroups, linux-mm, mhocko, clm, fengguang.wu, david, Tejun Heo
wakeup_flusher_threads() currently only starts writeback on the root
wb (bdi_writeback).  For cgroup writeback support, update the function
to wake up all wbs and distribute the number of pages to write
according to the proportion of each wb's write bandwidth, which is
implemented in wb_split_bdi_pages().
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
---
 fs/fs-writeback.c | 46 ++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 44 insertions(+), 2 deletions(-)
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index c209ff1..8bf13e6 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -164,6 +164,34 @@ force_root:
 	dctx->wb = &bdi->wb;
 }
 
+/**
+ * wb_split_bdi_pages - split nr_pages to write according to bandwidth
+ * @wb: target bdi_writeback to split @nr_pages to
+ * @nr_pages: number of pages to write for the whole bdi
+ *
+ * Split @wb's portion of @nr_pages according to @wb's write bandwidth in
+ * relation to the total write bandwidth of all wb's w/ dirty inodes on
+ * @wb->bdi.
+ */
+static long wb_split_bdi_pages(struct bdi_writeback *wb, long nr_pages)
+{
+	unsigned long this_bw = wb->avg_write_bandwidth;
+	unsigned long tot_bw = atomic_long_read(&wb->bdi->tot_write_bandwidth);
+
+	if (nr_pages == LONG_MAX)
+		return LONG_MAX;
+
+	/*
+	 * This may be called on clean wb's and proportional distribution
+	 * may not make sense, just use the original @nr_pages in those
+	 * cases.  In general, we wanna err on the side of writing more.
+	 */
+	if (!tot_bw || this_bw >= tot_bw)
+		return nr_pages;
+	else
+		return DIV_ROUND_UP_ULL((u64)nr_pages * this_bw, tot_bw);
+}
+
 #else	/* CONFIG_CGROUP_WRITEBACK */
 
 static void init_cgwb_dirty_page_context(struct dirty_context *dctx)
@@ -171,6 +199,11 @@ static void init_cgwb_dirty_page_context(struct dirty_context *dctx)
 	dctx->wb = &dctx->mapping->backing_dev_info->wb;
 }
 
+static long wb_split_bdi_pages(struct bdi_writeback *wb, long nr_pages)
+{
+	return nr_pages;
+}
+
 #endif	/* CONFIG_CGROUP_WRITEBACK */
 
 /**
@@ -1221,8 +1254,17 @@ void wakeup_flusher_threads(long nr_pages, enum wb_reason reason)
 		nr_pages = get_nr_dirty_pages();
 
 	rcu_read_lock();
-	list_for_each_entry_rcu(bdi, &bdi_list, bdi_list)
-		wb_start_writeback(&bdi->wb, nr_pages, false, reason);
+	list_for_each_entry_rcu(bdi, &bdi_list, bdi_list) {
+		struct bdi_writeback *wb;
+		struct wb_iter iter;
+
+		if (!bdi_has_dirty_io(bdi))
+			continue;
+
+		bdi_for_each_wb(wb, bdi, &iter, 0)
+			wb_start_writeback(wb, wb_split_bdi_pages(wb, nr_pages),
+					   false, reason);
+	}
 	rcu_read_unlock();
 }
 
-- 
2.1.0
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related	[flat|nested] 54+ messages in thread
- * [PATCH 24/45] writeback: add wb_writeback_work->auto_free
  2015-01-06 21:25 [PATCHSET RFC block/for-next] writeback: cgroup writeback support Tejun Heo
                   ` (22 preceding siblings ...)
  2015-01-06 21:26 ` [PATCH 23/45] writeback: make wakeup_flusher_threads() handle multiple bdi_writeback's Tejun Heo
@ 2015-01-06 21:26 ` Tejun Heo
  2015-01-06 21:26 ` [PATCH 25/45] writeback: implement bdi_wait_for_completion() Tejun Heo
                   ` (21 subsequent siblings)
  45 siblings, 0 replies; 54+ messages in thread
From: Tejun Heo @ 2015-01-06 21:26 UTC (permalink / raw)
  To: axboe
  Cc: linux-kernel, jack, hch, hannes, linux-fsdevel, vgoyal, lizefan,
	cgroups, linux-mm, mhocko, clm, fengguang.wu, david, Tejun Heo
Currently, a wb_writeback_work is freed automatically on completion if
it doesn't have ->done set.  Add wb_writeback_work->auto_free to make
the switch explicit.  This will help cgroup writeback support where
waiting for completion and whether to free automatically don't
necessarily move together.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
---
 fs/fs-writeback.c | 13 ++++++-------
 1 file changed, 6 insertions(+), 7 deletions(-)
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 8bf13e6..3c012b8 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -47,6 +47,7 @@ struct wb_writeback_work {
 	unsigned int range_cyclic:1;
 	unsigned int for_background:1;
 	unsigned int for_sync:1;	/* sync(2) WB_SYNC_ALL writeback */
+	unsigned int auto_free:1;	/* free on completion */
 	enum wb_reason reason;		/* why was writeback initiated? */
 
 	struct list_head list;		/* pending work list */
@@ -272,6 +273,7 @@ void wb_start_writeback(struct bdi_writeback *wb, long nr_pages,
 	work->nr_pages	= nr_pages;
 	work->range_cyclic = range_cyclic;
 	work->reason	= reason;
+	work->auto_free	= 1;
 
 	wb_queue_work(wb, work);
 }
@@ -1173,19 +1175,16 @@ static long wb_do_writeback(struct bdi_writeback *wb)
 
 	set_bit(WB_writeback_running, &wb->state);
 	while ((work = get_next_work_item(wb)) != NULL) {
+		struct completion *done = work->done;
 
 		trace_writeback_exec(wb->bdi, work);
 
 		wrote += wb_writeback(wb, work);
 
-		/*
-		 * Notify the caller of completion if this is a synchronous
-		 * work item, otherwise just free it.
-		 */
-		if (work->done)
-			complete(work->done);
-		else
+		if (work->auto_free)
 			kfree(work);
+		if (done)
+			complete(done);
 	}
 
 	/*
-- 
2.1.0
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related	[flat|nested] 54+ messages in thread
- * [PATCH 25/45] writeback: implement bdi_wait_for_completion()
  2015-01-06 21:25 [PATCHSET RFC block/for-next] writeback: cgroup writeback support Tejun Heo
                   ` (23 preceding siblings ...)
  2015-01-06 21:26 ` [PATCH 24/45] writeback: add wb_writeback_work->auto_free Tejun Heo
@ 2015-01-06 21:26 ` Tejun Heo
  2015-01-06 21:26 ` [PATCH 26/45] writeback: implement wb_wait_for_single_work() Tejun Heo
                   ` (20 subsequent siblings)
  45 siblings, 0 replies; 54+ messages in thread
From: Tejun Heo @ 2015-01-06 21:26 UTC (permalink / raw)
  To: axboe
  Cc: linux-kernel, jack, hch, hannes, linux-fsdevel, vgoyal, lizefan,
	cgroups, linux-mm, mhocko, clm, fengguang.wu, david, Tejun Heo
If the completion of a wb_writeback_work can be waited upon by setting
its ->done to a struct completion and waiting on it; however, for
cgroup writeback support, it's necessary to issue multiple work items
to multiple bdi_writebacks and wait for the completion of all.
This patch implements wb_completion which can wait for multiple work
items and replaces the struct completion with it.  It can be defined
using DEFINE_WB_COMPLETION_ONSTACK(), used for multiple work items and
waited for by wb_wait_for_completion().
Nobody currently issues multiple work items and this patch doesn't
introduce any behavior changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
---
 fs/fs-writeback.c                | 57 +++++++++++++++++++++++++++++++---------
 include/linux/backing-dev-defs.h |  2 ++
 mm/backing-dev.c                 |  1 +
 3 files changed, 48 insertions(+), 12 deletions(-)
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 3c012b8..6527692 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -34,6 +34,10 @@
  */
 #define MIN_WRITEBACK_PAGES	(4096UL >> (PAGE_CACHE_SHIFT - 10))
 
+struct wb_completion {
+	atomic_t		cnt;
+};
+
 /*
  * Passed into wb_writeback(), essentially a subset of writeback_control
  */
@@ -51,9 +55,21 @@ struct wb_writeback_work {
 	enum wb_reason reason;		/* why was writeback initiated? */
 
 	struct list_head list;		/* pending work list */
-	struct completion *done;	/* set if the caller waits */
+	struct wb_completion *done;	/* set if the caller waits */
 };
 
+/*
+ * If one wants to wait for one or more wb_writeback_works, each work's
+ * ->done should be set to a wb_completion defined using the following
+ * macro.  Once all work items are issued with wb_queue_work(), the caller
+ * can wait for the completion of all using wb_wait_for_completion().  Work
+ * items which are waited upon aren't freed automatically on completion.
+ */
+#define DEFINE_WB_COMPLETION_ONSTACK(cmpl)				\
+	struct wb_completion cmpl = {					\
+		.cnt		= ATOMIC_INIT(1),			\
+	}
+
 static inline struct inode *wb_inode(struct list_head *head)
 {
 	return list_entry(head, struct inode, i_wb_list);
@@ -83,17 +99,34 @@ static void wb_queue_work(struct bdi_writeback *wb,
 	trace_writeback_queue(wb->bdi, work);
 
 	spin_lock_bh(&wb->work_lock);
-	if (!test_bit(WB_registered, &wb->state)) {
-		if (work->done)
-			complete(work->done);
+	if (!test_bit(WB_registered, &wb->state))
 		goto out_unlock;
-	}
+	if (work->done)
+		atomic_inc(&work->done->cnt);
 	list_add_tail(&work->list, &wb->work_list);
 	mod_delayed_work(bdi_wq, &wb->dwork, 0);
 out_unlock:
 	spin_unlock_bh(&wb->work_lock);
 }
 
+/**
+ * wb_wait_for_completion - wait for completion of bdi_writeback_works
+ * @bdi: bdi work items were issued to
+ * @done: target wb_completion
+ *
+ * Wait for one or more work items issued to @bdi with their ->done field
+ * set to @done, which should have been defined with
+ * DEFINE_WB_COMPLETION_ONSTACK().  This function returns after all such
+ * work items are completed.  Work items which are waited upon aren't freed
+ * automatically on completion.
+ */
+static void wb_wait_for_completion(struct backing_dev_info *bdi,
+				   struct wb_completion *done)
+{
+	atomic_dec(&done->cnt);		/* put down the initial count */
+	wait_event(bdi->wb_waitq, !atomic_read(&done->cnt));
+}
+
 #ifdef CONFIG_CGROUP_WRITEBACK
 
 /**
@@ -1175,7 +1208,7 @@ static long wb_do_writeback(struct bdi_writeback *wb)
 
 	set_bit(WB_writeback_running, &wb->state);
 	while ((work = get_next_work_item(wb)) != NULL) {
-		struct completion *done = work->done;
+		struct wb_completion *done = work->done;
 
 		trace_writeback_exec(wb->bdi, work);
 
@@ -1183,8 +1216,8 @@ static long wb_do_writeback(struct bdi_writeback *wb)
 
 		if (work->auto_free)
 			kfree(work);
-		if (done)
-			complete(done);
+		if (done && atomic_dec_and_test(&done->cnt))
+			wake_up_all(&wb->bdi->wb_waitq);
 	}
 
 	/*
@@ -1482,7 +1515,7 @@ void writeback_inodes_sb_nr(struct super_block *sb,
 			    unsigned long nr,
 			    enum wb_reason reason)
 {
-	DECLARE_COMPLETION_ONSTACK(done);
+	DEFINE_WB_COMPLETION_ONSTACK(done);
 	struct wb_writeback_work work = {
 		.sb			= sb,
 		.sync_mode		= WB_SYNC_NONE,
@@ -1497,7 +1530,7 @@ void writeback_inodes_sb_nr(struct super_block *sb,
 		return;
 	WARN_ON(!rwsem_is_locked(&sb->s_umount));
 	wb_queue_work(&bdi->wb, &work);
-	wait_for_completion(&done);
+	wb_wait_for_completion(bdi, &done);
 }
 EXPORT_SYMBOL(writeback_inodes_sb_nr);
 
@@ -1564,7 +1597,7 @@ EXPORT_SYMBOL(try_to_writeback_inodes_sb);
  */
 void sync_inodes_sb(struct super_block *sb)
 {
-	DECLARE_COMPLETION_ONSTACK(done);
+	DEFINE_WB_COMPLETION_ONSTACK(done);
 	struct wb_writeback_work work = {
 		.sb		= sb,
 		.sync_mode	= WB_SYNC_ALL,
@@ -1582,7 +1615,7 @@ void sync_inodes_sb(struct super_block *sb)
 	WARN_ON(!rwsem_is_locked(&sb->s_umount));
 
 	wb_queue_work(&bdi->wb, &work);
-	wait_for_completion(&done);
+	wb_wait_for_completion(bdi, &done);
 
 	wait_sb_inodes(sb);
 }
diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
index 4ceda83..bc1b9e7 100644
--- a/include/linux/backing-dev-defs.h
+++ b/include/linux/backing-dev-defs.h
@@ -110,6 +110,8 @@ struct backing_dev_info {
 #ifdef CONFIG_CGROUP_WRITEBACK
 	struct radix_tree_root cgwb_tree; /* radix tree of !root cgroup wbs */
 #endif
+	wait_queue_head_t wb_waitq;
+
 	struct device *dev;
 
 	struct timer_list laptop_mode_wb_timer;
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index b4a5d9b..171fffd 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -631,6 +631,7 @@ int bdi_init(struct backing_dev_info *bdi)
 	bdi->max_ratio = 100;
 	bdi->max_prop_frac = FPROP_FRAC_BASE;
 	INIT_LIST_HEAD(&bdi->bdi_list);
+	init_waitqueue_head(&bdi->wb_waitq);
 
 	err = wb_init(&bdi->wb, bdi, GFP_KERNEL);
 	if (err)
-- 
2.1.0
^ permalink raw reply related	[flat|nested] 54+ messages in thread
- * [PATCH 26/45] writeback: implement wb_wait_for_single_work()
  2015-01-06 21:25 [PATCHSET RFC block/for-next] writeback: cgroup writeback support Tejun Heo
                   ` (24 preceding siblings ...)
  2015-01-06 21:26 ` [PATCH 25/45] writeback: implement bdi_wait_for_completion() Tejun Heo
@ 2015-01-06 21:26 ` Tejun Heo
  2015-01-06 21:26 ` [PATCH 27/45] writeback: restructure try_writeback_inodes_sb[_nr]() Tejun Heo
                   ` (19 subsequent siblings)
  45 siblings, 0 replies; 54+ messages in thread
From: Tejun Heo @ 2015-01-06 21:26 UTC (permalink / raw)
  To: axboe
  Cc: linux-kernel, jack, hch, hannes, linux-fsdevel, vgoyal, lizefan,
	cgroups, linux-mm, mhocko, clm, fengguang.wu, david, Tejun Heo
For cgroup writeback, multiple wb_writeback_work items may need to be
issuedto accomplish a single task.  The previous patch updated the
waiting mechanism such that wb_wait_for_completion() can wait for
multiple work items.
Issuing mulitple work items involves memory allocation which may fail.
As most writeback operations can't fail or blocked on memory
allocation, in such cases, we'll fall back to sequential issuing of an
on-stack work item, which would need to be waited upon sequentially.
This patch implements wb_wait_for_single_work() which waits for a
single work item independently from wb_completion waiting so that such
fallback mechanism can be used without getting tangled with the usual
issuing / completion operation.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
---
 fs/fs-writeback.c | 47 +++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 45 insertions(+), 2 deletions(-)
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 6527692..6889077 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -52,6 +52,8 @@ struct wb_writeback_work {
 	unsigned int for_background:1;
 	unsigned int for_sync:1;	/* sync(2) WB_SYNC_ALL writeback */
 	unsigned int auto_free:1;	/* free on completion */
+	unsigned int single_wait:1;
+	unsigned int single_done:1;
 	enum wb_reason reason;		/* why was writeback initiated? */
 
 	struct list_head list;		/* pending work list */
@@ -99,8 +101,11 @@ static void wb_queue_work(struct bdi_writeback *wb,
 	trace_writeback_queue(wb->bdi, work);
 
 	spin_lock_bh(&wb->work_lock);
-	if (!test_bit(WB_registered, &wb->state))
+	if (!test_bit(WB_registered, &wb->state)) {
+		if (work->single_wait)
+			work->single_done = 1;
 		goto out_unlock;
+	}
 	if (work->done)
 		atomic_inc(&work->done->cnt);
 	list_add_tail(&work->list, &wb->work_list);
@@ -199,6 +204,32 @@ force_root:
 }
 
 /**
+ * wb_wait_for_single_work - wait for completion of a single bdi_writeback_work
+ * @bdi: bdi the work item was issued to
+ * @work: work item to wait for
+ *
+ * Wait for the completion of @work which was issued to one of @bdi's
+ * bdi_writeback's.  The caller must have set @work->single_wait before
+ * issuing it.  This wait operates independently fo
+ * wb_wait_for_completion() and also disables automatic freeing of @work.
+ */
+static void wb_wait_for_single_work(struct backing_dev_info *bdi,
+				    struct wb_writeback_work *work)
+{
+	if (WARN_ON_ONCE(!work->single_wait))
+		return;
+
+	wait_event(bdi->wb_waitq, work->single_done);
+
+	/*
+	 * Paired with smp_wmb() in wb_do_writeback() and ensures that all
+	 * modifications to @work prior to assertion of ->single_done is
+	 * visible to the caller once this function returns.
+	 */
+	smp_rmb();
+}
+
+/**
  * wb_split_bdi_pages - split nr_pages to write according to bandwidth
  * @wb: target bdi_writeback to split @nr_pages to
  * @nr_pages: number of pages to write for the whole bdi
@@ -1209,14 +1240,26 @@ static long wb_do_writeback(struct bdi_writeback *wb)
 	set_bit(WB_writeback_running, &wb->state);
 	while ((work = get_next_work_item(wb)) != NULL) {
 		struct wb_completion *done = work->done;
+		bool need_wake_up = false;
 
 		trace_writeback_exec(wb->bdi, work);
 
 		wrote += wb_writeback(wb, work);
 
-		if (work->auto_free)
+		if (work->single_wait) {
+			WARN_ON_ONCE(work->auto_free);
+			/* paired w/ rmb in wb_wait_for_single_work() */
+			smp_wmb();
+			work->single_done = 1;
+			need_wake_up = true;
+		} else if (work->auto_free) {
 			kfree(work);
+		}
+
 		if (done && atomic_dec_and_test(&done->cnt))
+			need_wake_up = true;
+
+		if (need_wake_up)
 			wake_up_all(&wb->bdi->wb_waitq);
 	}
 
-- 
2.1.0
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related	[flat|nested] 54+ messages in thread
- * [PATCH 27/45] writeback: restructure try_writeback_inodes_sb[_nr]()
  2015-01-06 21:25 [PATCHSET RFC block/for-next] writeback: cgroup writeback support Tejun Heo
                   ` (25 preceding siblings ...)
  2015-01-06 21:26 ` [PATCH 26/45] writeback: implement wb_wait_for_single_work() Tejun Heo
@ 2015-01-06 21:26 ` Tejun Heo
  2015-01-06 21:26 ` [PATCH 28/45] writeback: make writeback initiation functions handle multiple bdi_writeback's Tejun Heo
                   ` (18 subsequent siblings)
  45 siblings, 0 replies; 54+ messages in thread
From: Tejun Heo @ 2015-01-06 21:26 UTC (permalink / raw)
  To: axboe
  Cc: linux-kernel, jack, hch, hannes, linux-fsdevel, vgoyal, lizefan,
	cgroups, linux-mm, mhocko, clm, fengguang.wu, david, Tejun Heo
try_writeback_inodes_sb_nr() wraps writeback_inodes_sb_nr() so that it
handles s_umount locking and skips if writeback is already in
progress.  The in progress test is performed on the root wb
(bdi_writeback) which isn't sufficient for cgroup writeback support.
The test must be done per-wb.
To prepare for the change, this patch factors out
__writeback_inodes_sb_nr() from writeback_inodes_sb_nr() and adds
@skip_if_busy and moves the in progress test right before queueing the
wb_writeback_work.  try_writeback_inodes_sb_nr() now just grabs
s_umount and invokes __writeback_inodes_sb_nr() with asserted
@skip_if_busy.  This way, later addition of multiple wb handling can
skip only the wb's which already have writeback in progress.
This swaps the order between in progress test and s_umount test which
can flip the return value when writeback is in progress and s_umount
is being held by someone else but this shouldn't cause any meaningful
difference.  It's a fringe condition and the return value is an
unsynchronized hint anyway.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
---
 fs/fs-writeback.c         | 52 ++++++++++++++++++++++++++---------------------
 include/linux/writeback.h |  6 +++---
 2 files changed, 32 insertions(+), 26 deletions(-)
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 6889077..008f588 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1544,19 +1544,8 @@ static void wait_sb_inodes(struct super_block *sb)
 	iput(old_inode);
 }
 
-/**
- * writeback_inodes_sb_nr -	writeback dirty inodes from given super_block
- * @sb: the superblock
- * @nr: the number of pages to write
- * @reason: reason why some writeback work initiated
- *
- * Start writeback on some inodes on this super_block. No guarantees are made
- * on how many (if any) will be written, and this function does not wait
- * for IO completion of submitted IO.
- */
-void writeback_inodes_sb_nr(struct super_block *sb,
-			    unsigned long nr,
-			    enum wb_reason reason)
+static void __writeback_inodes_sb_nr(struct super_block *sb, unsigned long nr,
+				     enum wb_reason reason, bool skip_if_busy)
 {
 	DEFINE_WB_COMPLETION_ONSTACK(done);
 	struct wb_writeback_work work = {
@@ -1572,9 +1561,30 @@ void writeback_inodes_sb_nr(struct super_block *sb,
 	if (!bdi_has_dirty_io(bdi) || bdi == &noop_backing_dev_info)
 		return;
 	WARN_ON(!rwsem_is_locked(&sb->s_umount));
+
+	if (skip_if_busy && writeback_in_progress(&bdi->wb))
+		return;
+
 	wb_queue_work(&bdi->wb, &work);
 	wb_wait_for_completion(bdi, &done);
 }
+
+/**
+ * writeback_inodes_sb_nr -	writeback dirty inodes from given super_block
+ * @sb: the superblock
+ * @nr: the number of pages to write
+ * @reason: reason why some writeback work initiated
+ *
+ * Start writeback on some inodes on this super_block. No guarantees are made
+ * on how many (if any) will be written, and this function does not wait
+ * for IO completion of submitted IO.
+ */
+void writeback_inodes_sb_nr(struct super_block *sb,
+			    unsigned long nr,
+			    enum wb_reason reason)
+{
+	__writeback_inodes_sb_nr(sb, nr, reason, false);
+}
 EXPORT_SYMBOL(writeback_inodes_sb_nr);
 
 /**
@@ -1601,19 +1611,15 @@ EXPORT_SYMBOL(writeback_inodes_sb);
  * Invoke writeback_inodes_sb_nr if no writeback is currently underway.
  * Returns 1 if writeback was started, 0 if not.
  */
-int try_to_writeback_inodes_sb_nr(struct super_block *sb,
-				  unsigned long nr,
-				  enum wb_reason reason)
+bool try_to_writeback_inodes_sb_nr(struct super_block *sb, unsigned long nr,
+				   enum wb_reason reason)
 {
-	if (writeback_in_progress(&sb->s_bdi->wb))
-		return 1;
-
 	if (!down_read_trylock(&sb->s_umount))
-		return 0;
+		return false;
 
-	writeback_inodes_sb_nr(sb, nr, reason);
+	__writeback_inodes_sb_nr(sb, nr, reason, true);
 	up_read(&sb->s_umount);
-	return 1;
+	return true;
 }
 EXPORT_SYMBOL(try_to_writeback_inodes_sb_nr);
 
@@ -1625,7 +1631,7 @@ EXPORT_SYMBOL(try_to_writeback_inodes_sb_nr);
  * Implement by try_to_writeback_inodes_sb_nr()
  * Returns 1 if writeback was started, 0 if not.
  */
-int try_to_writeback_inodes_sb(struct super_block *sb, enum wb_reason reason)
+bool try_to_writeback_inodes_sb(struct super_block *sb, enum wb_reason reason)
 {
 	return try_to_writeback_inodes_sb_nr(sb, get_nr_dirty_pages(), reason);
 }
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 8e4485f..75349bb 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -93,9 +93,9 @@ struct bdi_writeback;
 void writeback_inodes_sb(struct super_block *, enum wb_reason reason);
 void writeback_inodes_sb_nr(struct super_block *, unsigned long nr,
 							enum wb_reason reason);
-int try_to_writeback_inodes_sb(struct super_block *, enum wb_reason reason);
-int try_to_writeback_inodes_sb_nr(struct super_block *, unsigned long nr,
-				  enum wb_reason reason);
+bool try_to_writeback_inodes_sb(struct super_block *, enum wb_reason reason);
+bool try_to_writeback_inodes_sb_nr(struct super_block *, unsigned long nr,
+				   enum wb_reason reason);
 void sync_inodes_sb(struct super_block *);
 void wakeup_flusher_threads(long nr_pages, enum wb_reason reason);
 void inode_wait_for_writeback(struct inode *inode);
-- 
2.1.0
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related	[flat|nested] 54+ messages in thread
- * [PATCH 28/45] writeback: make writeback initiation functions handle multiple bdi_writeback's
  2015-01-06 21:25 [PATCHSET RFC block/for-next] writeback: cgroup writeback support Tejun Heo
                   ` (26 preceding siblings ...)
  2015-01-06 21:26 ` [PATCH 27/45] writeback: restructure try_writeback_inodes_sb[_nr]() Tejun Heo
@ 2015-01-06 21:26 ` Tejun Heo
  2015-01-06 21:26 ` [PATCH 29/45] writeback: move i_wb_list emptiness test into inode_wb_list_del() from its caller Tejun Heo
                   ` (17 subsequent siblings)
  45 siblings, 0 replies; 54+ messages in thread
From: Tejun Heo @ 2015-01-06 21:26 UTC (permalink / raw)
  To: axboe
  Cc: linux-kernel, jack, hch, hannes, linux-fsdevel, vgoyal, lizefan,
	cgroups, linux-mm, mhocko, clm, fengguang.wu, david, Tejun Heo
[try_]writeback_inodes_sb[_nr]() and sync_inodes_sb() currently only
handle dirty inodes on the root wb (bdi_writeback) of the target bdi.
This patch implements bdi_split_work_to_wbs() and use it to make these
functions handle multiple wb's.
bdi_split_work_to_wbs() takes a base wb_writeback_work and create
clones of it and issue them to the wb's of the target bdi.  The base
work's nr_pages is distributed using wb_split_bdi_pages() -
ie. according to each wb's write bandwidth's proportion in the bdi.
Cloning a bdi involves memory allocation which may fail.  In such
cases, bdi_split_work_to_wbs() issues the base work directly and waits
for its completion before proceeding to the next wb to guarantee
forward progress and correctness under memory pressure.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
---
 fs/fs-writeback.c | 96 ++++++++++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 91 insertions(+), 5 deletions(-)
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 008f588..4094d30 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -257,6 +257,80 @@ static long wb_split_bdi_pages(struct bdi_writeback *wb, long nr_pages)
 		return DIV_ROUND_UP_ULL((u64)nr_pages * this_bw, tot_bw);
 }
 
+/**
+ * wb_clone_and_queue_work - clone a wb_writeback_work and issue it to a wb
+ * @wb: target bdi_writeback
+ * @base_work: source wb_writeback_work
+ *
+ * Try to make a clone of @base_work and issue it to @wb.  If cloning
+ * succeeds, %true is returned; otherwise, @base_work is issued directly
+ * and %false is returned.  In the latter case, the caller is required to
+ * wait for @base_work's completion using wb_wait_for_single_work().
+ *
+ * A clone is auto-freed on completion.  @base_work never is.
+ */
+static bool wb_clone_and_queue_work(struct bdi_writeback *wb,
+				    struct wb_writeback_work *base_work)
+{
+	struct wb_writeback_work *work;
+
+	work = kmalloc(sizeof(*work), GFP_ATOMIC);
+	if (work) {
+		*work = *base_work;
+		work->auto_free = 1;
+		work->single_wait = 0;
+	} else {
+		work = base_work;
+		work->auto_free = 0;
+		work->single_wait = 1;
+	}
+	work->single_done = 0;
+	wb_queue_work(wb, work);
+	return work != base_work;
+}
+
+/**
+ * bdi_split_work_to_wbs - split a wb_writeback_work to all wb's of a bdi
+ * @bdi: target backing_dev_info
+ * @base_work: wb_writeback_work to issue
+ * @skip_if_busy: skip wb's which already have writeback in progress
+ *
+ * Split and issue @base_work to all wb's (bdi_writeback's) of @bdi which
+ * have dirty inodes.  If @base_work->nr_page isn't %LONG_MAX, it's
+ * distributed to the busy wbs according to each wb's proportion in the
+ * total active write bandwidth of @bdi.
+ */
+static void bdi_split_work_to_wbs(struct backing_dev_info *bdi,
+				  struct wb_writeback_work *base_work,
+				  bool skip_if_busy)
+{
+	long nr_pages = base_work->nr_pages;
+	int next_blkcg_id = 0;
+	struct bdi_writeback *wb;
+	struct wb_iter iter;
+
+	might_sleep();
+
+	if (!bdi_has_dirty_io(bdi))
+		return;
+restart:
+	rcu_read_lock();
+	bdi_for_each_wb(wb, bdi, &iter, next_blkcg_id) {
+		if (!wb_has_dirty_io(wb) ||
+		    (skip_if_busy && writeback_in_progress(wb)))
+			continue;
+
+		base_work->nr_pages = wb_split_bdi_pages(wb, nr_pages);
+		if (!wb_clone_and_queue_work(wb, base_work)) {
+			next_blkcg_id = wb->blkcg_css->id + 1;
+			rcu_read_unlock();
+			wb_wait_for_single_work(bdi, base_work);
+			goto restart;
+		}
+	}
+	rcu_read_unlock();
+}
+
 #else	/* CONFIG_CGROUP_WRITEBACK */
 
 static void init_cgwb_dirty_page_context(struct dirty_context *dctx)
@@ -269,6 +343,21 @@ static long wb_split_bdi_pages(struct bdi_writeback *wb, long nr_pages)
 	return nr_pages;
 }
 
+static void bdi_split_work_to_wbs(struct backing_dev_info *bdi,
+				  struct wb_writeback_work *base_work,
+				  bool skip_if_busy)
+{
+	might_sleep();
+
+	if (bdi_has_dirty_io(bdi) &&
+	    (!skip_if_busy || !writeback_in_progress(&bdi->wb))) {
+		base_work->auto_free = 0;
+		base_work->single_wait = 0;
+		base_work->single_done = 0;
+		wb_queue_work(&bdi->wb, base_work);
+	}
+}
+
 #endif	/* CONFIG_CGROUP_WRITEBACK */
 
 /**
@@ -1562,10 +1651,7 @@ static void __writeback_inodes_sb_nr(struct super_block *sb, unsigned long nr,
 		return;
 	WARN_ON(!rwsem_is_locked(&sb->s_umount));
 
-	if (skip_if_busy && writeback_in_progress(&bdi->wb))
-		return;
-
-	wb_queue_work(&bdi->wb, &work);
+	bdi_split_work_to_wbs(sb->s_bdi, &work, skip_if_busy);
 	wb_wait_for_completion(bdi, &done);
 }
 
@@ -1663,7 +1749,7 @@ void sync_inodes_sb(struct super_block *sb)
 		return;
 	WARN_ON(!rwsem_is_locked(&sb->s_umount));
 
-	wb_queue_work(&bdi->wb, &work);
+	bdi_split_work_to_wbs(bdi, &work, false);
 	wb_wait_for_completion(bdi, &done);
 
 	wait_sb_inodes(sb);
-- 
2.1.0
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related	[flat|nested] 54+ messages in thread
- * [PATCH 29/45] writeback: move i_wb_list emptiness test into inode_wb_list_del() from its caller
  2015-01-06 21:25 [PATCHSET RFC block/for-next] writeback: cgroup writeback support Tejun Heo
                   ` (27 preceding siblings ...)
  2015-01-06 21:26 ` [PATCH 28/45] writeback: make writeback initiation functions handle multiple bdi_writeback's Tejun Heo
@ 2015-01-06 21:26 ` Tejun Heo
  2015-01-06 21:26 ` [PATCH 30/45] vfs, writeback: introduce struct inode_wb_link Tejun Heo
                   ` (16 subsequent siblings)
  45 siblings, 0 replies; 54+ messages in thread
From: Tejun Heo @ 2015-01-06 21:26 UTC (permalink / raw)
  To: axboe
  Cc: linux-kernel, jack, hch, hannes, linux-fsdevel, vgoyal, lizefan,
	cgroups, linux-mm, mhocko, clm, fengguang.wu, david, Tejun Heo
inode_wb_list_del() has one caller, evict(), which tests whether
inode->i_wb_list is empty before invoking the function.  With cgroup
writeback support, an inode may belong to multiple bdi_writeback's
rendering this test incorrect or at least insufficient.  This patch
moves the test into inode_wb_list_del() so that later patches can
update the logic in the function proper.
This does add a function call and jump when a clean inode is being
evicted but this shouldn't be anything noticeable and if it ever is
making that part an inline logic in fs/internal.h is easy.
This patch is pure code reorganization.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
---
 fs/fs-writeback.c | 3 +++
 fs/inode.c        | 4 +---
 2 files changed, 4 insertions(+), 3 deletions(-)
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 4094d30..0fcdfe9 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -510,6 +510,9 @@ void inode_wb_list_del(struct inode *inode)
 	struct backing_dev_info *bdi = inode_to_bdi(inode);
 	struct bdi_writeback *wb = &bdi->wb;
 
+	if (list_empty(&inode->i_wb_list))
+		return;
+
 	spin_lock(&wb->list_lock);
 	inode_wb_list_del_locked(inode, wb);
 	spin_unlock(&wb->list_lock);
diff --git a/fs/inode.c b/fs/inode.c
index aa149e7..7fbfc00 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -536,9 +536,7 @@ static void evict(struct inode *inode)
 	BUG_ON(!(inode->i_state & I_FREEING));
 	BUG_ON(!list_empty(&inode->i_lru));
 
-	if (!list_empty(&inode->i_wb_list))
-		inode_wb_list_del(inode);
-
+	inode_wb_list_del(inode);
 	inode_sb_list_del(inode);
 
 	/*
-- 
2.1.0
^ permalink raw reply related	[flat|nested] 54+ messages in thread
- * [PATCH 30/45] vfs, writeback: introduce struct inode_wb_link
  2015-01-06 21:25 [PATCHSET RFC block/for-next] writeback: cgroup writeback support Tejun Heo
                   ` (28 preceding siblings ...)
  2015-01-06 21:26 ` [PATCH 29/45] writeback: move i_wb_list emptiness test into inode_wb_list_del() from its caller Tejun Heo
@ 2015-01-06 21:26 ` Tejun Heo
  2015-01-06 21:26 ` [PATCH 31/45] vfs, writeback: add inode_wb_link->data point to the associated bdi_writeback Tejun Heo
                   ` (15 subsequent siblings)
  45 siblings, 0 replies; 54+ messages in thread
From: Tejun Heo @ 2015-01-06 21:26 UTC (permalink / raw)
  To: axboe
  Cc: linux-kernel, jack, hch, hannes, linux-fsdevel, vgoyal, lizefan,
	cgroups, linux-mm, mhocko, clm, fengguang.wu, david, Tejun Heo,
	Alexander Viro
An inode may be written to from more than one cgroups and for cgroup
writeback support to work properly the inode needs to be on the dirty
lists of all wb's (bdi_writeback's) corresponding to the dirtying
cgroups so that writeback on each cgroup can keep track of and process
the inode.
As the first step on enabling linking an inode on multiple wb's, this
patch introduces struct inode_wb_link, which represents the
association between an inode and a wb, and replaces inode->i_wb_list
with ->i_wb_link of this type.
struct inode_wb_link currently contains only struct list_head and the
conversions are mostly equivalent and of trivial nature.  The only
difference at this point is that some functions are converted to take
the pointer to inode->i_wb_link instead of inode and use
container_of() to recover the inode.
Later patches will expand inode_wb_link and its handling to support
linking on multiple wb's.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
---
 fs/fs-writeback.c                | 104 ++++++++++++++++++++++-----------------
 fs/inode.c                       |   2 +-
 include/linux/backing-dev-defs.h |   8 +++
 include/linux/backing-dev.h      |   5 ++
 include/linux/fs.h               |   2 +-
 mm/backing-dev.c                 |   6 +--
 6 files changed, 78 insertions(+), 49 deletions(-)
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 0fcdfe9..0a10dd8 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -72,9 +72,9 @@ struct wb_writeback_work {
 		.cnt		= ATOMIC_INIT(1),			\
 	}
 
-static inline struct inode *wb_inode(struct list_head *head)
+static struct inode_wb_link *dirty_list_to_iwbl(struct list_head *head)
 {
-	return list_entry(head, struct inode, i_wb_list);
+	return list_entry(head, struct inode_wb_link, dirty_list);
 }
 
 /*
@@ -452,21 +452,20 @@ void wb_start_background_writeback(struct bdi_writeback *wb)
 }
 
 /**
- * inode_wb_list_move_locked - move an inode onto a bdi_writeback IO list
- * @inode: inode to be moved
+ * iwbl_move_locked - move an inode_wb_link onto a bdi_writeback IO list
+ * @iwbl: inode_wb_link to be moved
  * @wb: target bdi_writeback
  * @head: one of @wb->b_{dirty|io|more_io}
  *
- * Move @inode->i_wb_list to @list of @wb and set %WB_has_dirty_io.
+ * Move @iwbl->dirty_list to @list of @wb and set %WB_has_dirty_io.
  * Returns %true if all IO lists were empty before; otherwise, %false.
  */
-static bool inode_wb_list_move_locked(struct inode *inode,
-				      struct bdi_writeback *wb,
-				      struct list_head *head)
+static bool iwbl_move_locked(struct inode_wb_link *iwbl,
+			     struct bdi_writeback *wb, struct list_head *head)
 {
 	assert_spin_locked(&wb->list_lock);
 
-	list_move(&inode->i_wb_list, head);
+	list_move(&iwbl->dirty_list, head);
 
 	if (wb_has_dirty_io(wb)) {
 		return false;
@@ -480,19 +479,19 @@ static bool inode_wb_list_move_locked(struct inode *inode,
 }
 
 /**
- * inode_wb_list_del_locked - remove an inode from its bdi_writeback IO list
- * @inode: inode to be removed
+ * iwbl_del_locked - remove an inode_wb_link from its bdi_writeback IO list
+ * @iwbl: inode_wb_link to be removed
  * @wb: bdi_writeback @inode is being removed from
  *
- * Remove @inode which may be on one of @wb->b_{dirty|io|more_io} lists and
+ * Remove @iwbl which may be on one of @wb->b_{dirty|io|more_io} lists and
  * clear %WB_has_dirty_io if all are empty afterwards.
  */
-static void inode_wb_list_del_locked(struct inode *inode,
-				     struct bdi_writeback *wb)
+static void iwbl_del_locked(struct inode_wb_link *iwbl,
+			    struct bdi_writeback *wb)
 {
 	assert_spin_locked(&wb->list_lock);
 
-	list_del_init(&inode->i_wb_list);
+	list_del_init(&iwbl->dirty_list);
 
 	if (wb_has_dirty_io(wb) && list_empty(&wb->b_dirty) &&
 	    list_empty(&wb->b_io) && list_empty(&wb->b_more_io)) {
@@ -507,14 +506,15 @@ static void inode_wb_list_del_locked(struct inode *inode,
  */
 void inode_wb_list_del(struct inode *inode)
 {
+	struct inode_wb_link *iwbl = &inode->i_wb_link;
 	struct backing_dev_info *bdi = inode_to_bdi(inode);
 	struct bdi_writeback *wb = &bdi->wb;
 
-	if (list_empty(&inode->i_wb_list))
+	if (list_empty(&iwbl->dirty_list))
 		return;
 
 	spin_lock(&wb->list_lock);
-	inode_wb_list_del_locked(inode, wb);
+	iwbl_del_locked(iwbl, wb);
 	spin_unlock(&wb->list_lock);
 }
 
@@ -527,24 +527,28 @@ void inode_wb_list_del(struct inode *inode)
  * the case then the inode must have been redirtied while it was being written
  * out and we don't reset its dirtied_when.
  */
-static void redirty_tail(struct inode *inode, struct bdi_writeback *wb)
+static void redirty_tail(struct inode_wb_link *iwbl, struct bdi_writeback *wb)
 {
+	struct inode *inode = iwbl_to_inode(iwbl);
+
 	if (!list_empty(&wb->b_dirty)) {
+		struct inode_wb_link *tail_iwbl;
 		struct inode *tail;
 
-		tail = wb_inode(wb->b_dirty.next);
+		tail_iwbl = dirty_list_to_iwbl(wb->b_dirty.next);
+		tail = iwbl_to_inode(tail_iwbl);
 		if (time_before(inode->dirtied_when, tail->dirtied_when))
 			inode->dirtied_when = jiffies;
 	}
-	inode_wb_list_move_locked(inode, wb, &wb->b_dirty);
+	iwbl_move_locked(iwbl, wb, &wb->b_dirty);
 }
 
 /*
  * requeue inode for re-scanning after bdi->b_io list is exhausted.
  */
-static void requeue_io(struct inode *inode, struct bdi_writeback *wb)
+static void requeue_io(struct inode_wb_link *iwbl, struct bdi_writeback *wb)
 {
-	inode_wb_list_move_locked(inode, wb, &wb->b_more_io);
+	iwbl_move_locked(iwbl, wb, &wb->b_more_io);
 }
 
 static void inode_sync_complete(struct inode *inode)
@@ -583,16 +587,19 @@ static int move_expired_inodes(struct list_head *delaying_queue,
 	LIST_HEAD(tmp);
 	struct list_head *pos, *node;
 	struct super_block *sb = NULL;
+	struct inode_wb_link *iwbl;
 	struct inode *inode;
 	int do_sb_sort = 0;
 	int moved = 0;
 
 	while (!list_empty(delaying_queue)) {
-		inode = wb_inode(delaying_queue->prev);
+		iwbl = dirty_list_to_iwbl(delaying_queue->prev);
+		inode = iwbl_to_inode(iwbl);
+
 		if (work->older_than_this &&
 		    inode_dirtied_after(inode, *work->older_than_this))
 			break;
-		list_move(&inode->i_wb_list, &tmp);
+		list_move(&iwbl->dirty_list, &tmp);
 		moved++;
 		if (sb_is_blkdev_sb(inode->i_sb))
 			continue;
@@ -609,11 +616,12 @@ static int move_expired_inodes(struct list_head *delaying_queue,
 
 	/* Move inodes from one superblock together */
 	while (!list_empty(&tmp)) {
-		sb = wb_inode(tmp.prev)->i_sb;
+		sb = iwbl_to_inode(dirty_list_to_iwbl(tmp.prev))->i_sb;
 		list_for_each_prev_safe(pos, node, &tmp) {
-			inode = wb_inode(pos);
+			iwbl = dirty_list_to_iwbl(pos);
+			inode = iwbl_to_inode(iwbl);
 			if (inode->i_sb == sb)
-				list_move(&inode->i_wb_list, dispatch_queue);
+				list_move(&iwbl->dirty_list, dispatch_queue);
 		}
 	}
 out:
@@ -711,9 +719,11 @@ static void inode_sleep_on_writeback(struct inode *inode)
  * processes all inodes in writeback lists and requeueing inodes behind flusher
  * thread's back can have unexpected consequences.
  */
-static void requeue_inode(struct inode *inode, struct bdi_writeback *wb,
+static void requeue_inode(struct inode_wb_link *iwbl, struct bdi_writeback *wb,
 			  struct writeback_control *wbc)
 {
+	struct inode *inode = iwbl_to_inode(iwbl);
+
 	if (inode->i_state & I_FREEING)
 		return;
 
@@ -731,7 +741,7 @@ static void requeue_inode(struct inode *inode, struct bdi_writeback *wb,
 		 * writeback is not making progress due to locked
 		 * buffers. Skip this inode for now.
 		 */
-		redirty_tail(inode, wb);
+		redirty_tail(iwbl, wb);
 		return;
 	}
 
@@ -742,7 +752,7 @@ static void requeue_inode(struct inode *inode, struct bdi_writeback *wb,
 		 */
 		if (wbc->nr_to_write <= 0) {
 			/* Slice used up. Queue for next turn. */
-			requeue_io(inode, wb);
+			requeue_io(iwbl, wb);
 		} else {
 			/*
 			 * Writeback blocked by something other than
@@ -751,7 +761,7 @@ static void requeue_inode(struct inode *inode, struct bdi_writeback *wb,
 			 * retrying writeback of the dirty page/inode
 			 * that cannot be performed immediately.
 			 */
-			redirty_tail(inode, wb);
+			redirty_tail(iwbl, wb);
 		}
 	} else if (inode->i_state & I_DIRTY) {
 		/*
@@ -759,10 +769,10 @@ static void requeue_inode(struct inode *inode, struct bdi_writeback *wb,
 		 * such as delayed allocation during submission or metadata
 		 * updates after data IO completion.
 		 */
-		redirty_tail(inode, wb);
+		redirty_tail(iwbl, wb);
 	} else {
 		/* The inode is clean. Remove from writeback lists. */
-		inode_wb_list_del_locked(inode, wb);
+		iwbl_del_locked(iwbl, wb);
 	}
 }
 
@@ -848,6 +858,7 @@ static int
 writeback_single_inode(struct inode *inode, struct bdi_writeback *wb,
 		       struct writeback_control *wbc)
 {
+	struct inode_wb_link *iwbl = &inode->i_wb_link;
 	int ret = 0;
 
 	spin_lock(&inode->i_lock);
@@ -891,7 +902,7 @@ writeback_single_inode(struct inode *inode, struct bdi_writeback *wb,
 	 * touch it. See comment above for explanation.
 	 */
 	if (!(inode->i_state & I_DIRTY))
-		inode_wb_list_del_locked(inode, wb);
+		iwbl_del_locked(iwbl, wb);
 	spin_unlock(&wb->list_lock);
 	inode_sync_complete(inode);
 out:
@@ -954,7 +965,8 @@ static long writeback_sb_inodes(struct super_block *sb,
 	long wrote = 0;  /* count both pages and inodes */
 
 	while (!list_empty(&wb->b_io)) {
-		struct inode *inode = wb_inode(wb->b_io.prev);
+		struct inode_wb_link *iwbl = dirty_list_to_iwbl(wb->b_io.prev);
+		struct inode *inode = iwbl_to_inode(iwbl);
 
 		if (inode->i_sb != sb) {
 			if (work->sb) {
@@ -963,7 +975,7 @@ static long writeback_sb_inodes(struct super_block *sb,
 				 * superblock, move all inodes not belonging
 				 * to it back onto the dirty list.
 				 */
-				redirty_tail(inode, wb);
+				redirty_tail(iwbl, wb);
 				continue;
 			}
 
@@ -983,7 +995,7 @@ static long writeback_sb_inodes(struct super_block *sb,
 		spin_lock(&inode->i_lock);
 		if (inode->i_state & (I_NEW | I_FREEING | I_WILL_FREE)) {
 			spin_unlock(&inode->i_lock);
-			redirty_tail(inode, wb);
+			redirty_tail(iwbl, wb);
 			continue;
 		}
 		if ((inode->i_state & I_SYNC) && wbc.sync_mode != WB_SYNC_ALL) {
@@ -997,7 +1009,7 @@ static long writeback_sb_inodes(struct super_block *sb,
 			 * when we completed a full scan of b_io.
 			 */
 			spin_unlock(&inode->i_lock);
-			requeue_io(inode, wb);
+			requeue_io(iwbl, wb);
 			trace_writeback_sb_inodes_requeue(inode);
 			continue;
 		}
@@ -1034,7 +1046,7 @@ static long writeback_sb_inodes(struct super_block *sb,
 		spin_lock(&inode->i_lock);
 		if (!(inode->i_state & I_DIRTY))
 			wrote++;
-		requeue_inode(inode, wb, &wbc);
+		requeue_inode(iwbl, wb, &wbc);
 		inode_sync_complete(inode);
 		spin_unlock(&inode->i_lock);
 		cond_resched_lock(&wb->list_lock);
@@ -1059,7 +1071,8 @@ static long __writeback_inodes_wb(struct bdi_writeback *wb,
 	long wrote = 0;
 
 	while (!list_empty(&wb->b_io)) {
-		struct inode *inode = wb_inode(wb->b_io.prev);
+		struct inode_wb_link *iwbl = dirty_list_to_iwbl(wb->b_io.prev);
+		struct inode *inode = iwbl_to_inode(iwbl);
 		struct super_block *sb = inode->i_sb;
 
 		if (!grab_super_passive(sb)) {
@@ -1068,7 +1081,7 @@ static long __writeback_inodes_wb(struct bdi_writeback *wb,
 			 * s_umount being grabbed by someone else. Don't use
 			 * requeue_io() to avoid busy retrying the inode/sb.
 			 */
-			redirty_tail(inode, wb);
+			redirty_tail(iwbl, wb);
 			continue;
 		}
 		wrote += writeback_sb_inodes(sb, wb, work);
@@ -1152,6 +1165,7 @@ static long wb_writeback(struct bdi_writeback *wb,
 	unsigned long wb_start = jiffies;
 	long nr_pages = work->nr_pages;
 	unsigned long oldest_jif;
+	struct inode_wb_link *iwbl;
 	struct inode *inode;
 	long progress;
 
@@ -1228,7 +1242,8 @@ static long wb_writeback(struct bdi_writeback *wb,
 		 */
 		if (!list_empty(&wb->b_more_io))  {
 			trace_writeback_wait(wb->bdi, work);
-			inode = wb_inode(wb->b_more_io.prev);
+			iwbl = dirty_list_to_iwbl(wb->b_more_io.prev);
+			inode = iwbl_to_inode(iwbl);
 			spin_lock(&inode->i_lock);
 			spin_unlock(&wb->list_lock);
 			/* This function drops i_lock... */
@@ -1514,6 +1529,7 @@ void mark_inode_dirty_dctx(struct dirty_context *dctx, int flags)
 
 	spin_lock(&inode->i_lock);
 	if ((inode->i_state & flags) != flags) {
+		struct inode_wb_link *iwbl = &inode->i_wb_link;
 		const int was_dirty = inode->i_state & I_DIRTY;
 
 		inode->i_state |= flags;
@@ -1553,8 +1569,8 @@ void mark_inode_dirty_dctx(struct dirty_context *dctx, int flags)
 			     "bdi-%s not registered\n", bdi->name);
 
 			inode->dirtied_when = jiffies;
-			wakeup_bdi = inode_wb_list_move_locked(inode, &bdi->wb,
-							      &bdi->wb.b_dirty);
+			wakeup_bdi = iwbl_move_locked(iwbl, &bdi->wb,
+						      &bdi->wb.b_dirty);
 			spin_unlock(&bdi->wb.list_lock);
 
 			/*
diff --git a/fs/inode.c b/fs/inode.c
index 7fbfc00..7ec49ad 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -369,7 +369,7 @@ void inode_init_once(struct inode *inode)
 	memset(inode, 0, sizeof(*inode));
 	INIT_HLIST_NODE(&inode->i_hash);
 	INIT_LIST_HEAD(&inode->i_devices);
-	INIT_LIST_HEAD(&inode->i_wb_list);
+	INIT_LIST_HEAD(&inode->i_wb_link.dirty_list);
 	INIT_LIST_HEAD(&inode->i_lru);
 	address_space_init_once(&inode->i_data);
 	i_size_ordered_init(inode);
diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
index bc1b9e7..8bc80bd 100644
--- a/include/linux/backing-dev-defs.h
+++ b/include/linux/backing-dev-defs.h
@@ -123,6 +123,14 @@ struct backing_dev_info {
 };
 
 /*
+ * Used to link a dirty inode on a wb (bdi_writeback).  Each inode embeds
+ * one at ->i_wb_link which is used for the root wb.
+ */
+struct inode_wb_link {
+	struct list_head	dirty_list;
+};
+
+/*
  * The following structure carries context used during page and inode
  * dirtying.  Should be initialized with init_dirty_{inode|page}_context().
  */
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 4cdab7c..6ced0f4 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -274,6 +274,11 @@ void init_dirty_page_context(struct dirty_context *dctx, struct page *page,
 			     struct address_space *mapping);
 void init_dirty_inode_context(struct dirty_context *dctx, struct inode *inode);
 
+static inline struct inode *iwbl_to_inode(struct inode_wb_link *iwbl)
+{
+	return container_of(iwbl, struct inode, i_wb_link);
+}
+
 #ifdef CONFIG_CGROUP_WRITEBACK
 
 void cgwb_blkcg_released(struct cgroup_subsys_state *blkcg_css);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 2f3df6a..ea0b68f 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -610,7 +610,7 @@ struct inode {
 	unsigned long		dirtied_when;	/* jiffies of first dirtying */
 
 	struct hlist_node	i_hash;
-	struct list_head	i_wb_list;	/* backing dev IO list */
+	struct inode_wb_link	i_wb_link;	/* backing dev IO list */
 	struct list_head	i_lru;		/* inode LRU list */
 	struct list_head	i_sb_list;
 	union {
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 171fffd..cc8d21a 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -73,11 +73,11 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v)
 
 	nr_dirty = nr_io = nr_more_io = 0;
 	spin_lock(&wb->list_lock);
-	list_for_each_entry(inode, &wb->b_dirty, i_wb_list)
+	list_for_each_entry(inode, &wb->b_dirty, i_wb_link.dirty_list)
 		nr_dirty++;
-	list_for_each_entry(inode, &wb->b_io, i_wb_list)
+	list_for_each_entry(inode, &wb->b_io, i_wb_link.dirty_list)
 		nr_io++;
-	list_for_each_entry(inode, &wb->b_more_io, i_wb_list)
+	list_for_each_entry(inode, &wb->b_more_io, i_wb_link.dirty_list)
 		nr_more_io++;
 	spin_unlock(&wb->list_lock);
 
-- 
2.1.0
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related	[flat|nested] 54+ messages in thread
- * [PATCH 31/45] vfs, writeback: add inode_wb_link->data point to the associated bdi_writeback
  2015-01-06 21:25 [PATCHSET RFC block/for-next] writeback: cgroup writeback support Tejun Heo
                   ` (29 preceding siblings ...)
  2015-01-06 21:26 ` [PATCH 30/45] vfs, writeback: introduce struct inode_wb_link Tejun Heo
@ 2015-01-06 21:26 ` Tejun Heo
  2015-01-06 21:26 ` [PATCH 32/45] vfs, writeback: move inode->dirtied_when into inode->i_wb_link Tejun Heo
                   ` (14 subsequent siblings)
  45 siblings, 0 replies; 54+ messages in thread
From: Tejun Heo @ 2015-01-06 21:26 UTC (permalink / raw)
  To: axboe
  Cc: linux-kernel, jack, hch, hannes, linux-fsdevel, vgoyal, lizefan,
	cgroups, linux-mm, mhocko, clm, fengguang.wu, david, Tejun Heo,
	Alexander Viro
If CONFIG_CGROUP_WRITEBACK, add ->data to iwbl (inode_wb_link) which
is an unsigned long value which points to the associated wb
(bdi_writeback) with its upper bits and carries IWBL_* flags, none of
which is defined yet, in the lower.  iwbl_to_wb() is added to retrieve
the associated wb from a iwbl.
Places which were mapping inode to wb through inode_to_bdi() are
converted to use iwbl_to_wb(&inode->i_wb_link) instead.  ->data is set
by init_i_wb_link() function on inode initialization and when a bdev
inode changes its associated bdi.
When CONFIG_CGROUP_ENABLED is enabled, this adds a pointer to struct
inode.  This patch currently doesn't make any behavioral difference
but will allow associating a single inode with multiple wb's which is
necessary for cgroup writeback support.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
---
 fs/block_dev.c                   |  1 +
 fs/fs-writeback.c                |  8 ++++----
 fs/inode.c                       |  1 +
 include/linux/backing-dev-defs.h | 27 ++++++++++++++++++++++++++-
 include/linux/backing-dev.h      | 30 ++++++++++++++++++++++++++++++
 5 files changed, 62 insertions(+), 5 deletions(-)
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 0413d3f..855f850 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -61,6 +61,7 @@ static void bdev_inode_switch_bdi(struct inode *inode,
 		spin_lock(&inode->i_lock);
 		if (!(inode->i_state & I_DIRTY)) {
 			inode->i_data.backing_dev_info = dst;
+			init_i_wb_link(inode);
 			spin_unlock(&inode->i_lock);
 			return;
 		}
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 0a10dd8..2a5e400 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -507,8 +507,7 @@ static void iwbl_del_locked(struct inode_wb_link *iwbl,
 void inode_wb_list_del(struct inode *inode)
 {
 	struct inode_wb_link *iwbl = &inode->i_wb_link;
-	struct backing_dev_info *bdi = inode_to_bdi(inode);
-	struct bdi_writeback *wb = &bdi->wb;
+	struct bdi_writeback *wb = iwbl_to_wb(iwbl);
 
 	if (list_empty(&iwbl->dirty_list))
 		return;
@@ -1787,7 +1786,7 @@ EXPORT_SYMBOL(sync_inodes_sb);
  */
 int write_inode_now(struct inode *inode, int sync)
 {
-	struct bdi_writeback *wb = &inode_to_bdi(inode)->wb;
+	struct bdi_writeback *wb = iwbl_to_wb(&inode->i_wb_link);
 	struct writeback_control wbc = {
 		.nr_to_write = LONG_MAX,
 		.sync_mode = sync ? WB_SYNC_ALL : WB_SYNC_NONE,
@@ -1816,7 +1815,8 @@ EXPORT_SYMBOL(write_inode_now);
  */
 int sync_inode(struct inode *inode, struct writeback_control *wbc)
 {
-	return writeback_single_inode(inode, &inode_to_bdi(inode)->wb, wbc);
+	return writeback_single_inode(inode, iwbl_to_wb(&inode->i_wb_link),
+				      wbc);
 }
 EXPORT_SYMBOL(sync_inode);
 
diff --git a/fs/inode.c b/fs/inode.c
index 7ec49ad..b38d7d6 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -195,6 +195,7 @@ int inode_init_always(struct super_block *sb, struct inode *inode)
 	inode->i_fsnotify_mask = 0;
 #endif
 
+	init_i_wb_link(inode);
 	this_cpu_inc(nr_inodes);
 
 	return 0;
diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
index 8bc80bd..9720cac 100644
--- a/include/linux/backing-dev-defs.h
+++ b/include/linux/backing-dev-defs.h
@@ -43,6 +43,24 @@ enum wb_stat_item {
 
 #define WB_STAT_BATCH (8*(1+ilog2(nr_cpu_ids)))
 
+/*
+ * IWBL_* flags which occupy the lower bits of inode_wb_link->data.  The
+ * upper bits point to bdi_writeback, so the number of these flags
+ * determines the minimum alignment of bdi_writeback.
+ */
+enum {
+	IWBL_FLAGS_BITS,
+	IWBL_FLAGS_MASK		= (1UL << IWBL_FLAGS_BITS) - 1,
+};
+
+/*
+ * Align bdi_writeback so that inode_wb_link->data can carry IWBL_* flags
+ * in the lower bits but don't let it fall below that of ullong.
+ */
+#define BDI_WRITEBACK_ALIGN	\
+	((1UL << IWBL_FLAGS_BITS) > __alignof(unsigned long long) ?	\
+	 (1UL << IWBL_FLAGS_BITS) : __alignof(unsigned long long))
+
 struct bdi_writeback {
 	struct backing_dev_info *bdi;	/* our parent bdi */
 
@@ -86,7 +104,7 @@ struct bdi_writeback {
 		struct rcu_head rcu;
 	};
 #endif
-};
+} __aligned(BDI_WRITEBACK_ALIGN);
 
 struct backing_dev_info {
 	struct list_head bdi_list;
@@ -127,6 +145,13 @@ struct backing_dev_info {
  * one at ->i_wb_link which is used for the root wb.
  */
 struct inode_wb_link {
+#ifdef CONFIG_CGROUP_WRITEBACK
+	/*
+	 * Upper bits point to the associated bdi_writeback.  Lower carry
+	 * IWBL_* flags.  Use iwbl_to_wb() to reach the bdi_writeback.
+	 */
+	unsigned long		data;
+#endif
 	struct list_head	dirty_list;
 };
 
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 6ced0f4..bc69c7f 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -450,6 +450,25 @@ static inline struct bdi_writeback *__wb_iter_init(struct wb_iter *iter,
 	for ((wb_cur) = __wb_iter_init(iter, bdi, start_blkcg_id);	\
 	     (wb_cur); (wb_cur) = __wb_iter_next(iter, bdi))
 
+/**
+ * init_i_wb_link - (re)initialize inode->i_wb_link
+ * @inode: inode of interest
+ *
+ * Initialize @inode->i_wb_link.  Usually invoked on inode initialization.
+ * One special case is the bdev inodes which are associated with different
+ * bdi's over their lifetimes.  This function must be called each time the
+ * associated bdi changes.
+ */
+static inline void init_i_wb_link(struct inode *inode)
+{
+	inode->i_wb_link.data = (unsigned long)&inode_to_bdi(inode)->wb;
+}
+
+static inline struct bdi_writeback *iwbl_to_wb(struct inode_wb_link *iwbl)
+{
+	return (void *)(iwbl->data & ~IWBL_FLAGS_MASK);
+}
+
 #else	/* CONFIG_CGROUP_WRITEBACK */
 
 static inline bool mapping_cgwb_enabled(struct address_space *mapping)
@@ -499,6 +518,17 @@ struct wb_iter {
 	     ({	(wb_cur) = !(iter)->next_id++ ? &(bdi)->wb : NULL;	\
 	     }); )
 
+static inline void init_i_wb_link(struct inode *inode)
+{
+}
+
+static inline struct bdi_writeback *iwbl_to_wb(struct inode_wb_link *iwbl)
+{
+	struct inode *inode = iwbl_to_inode(iwbl);
+
+	return &inode_to_bdi(inode)->wb;
+}
+
 #endif	/* CONFIG_CGROUP_WRITEBACK */
 
 static inline int mapping_read_congested(struct address_space *mapping,
-- 
2.1.0
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related	[flat|nested] 54+ messages in thread
- * [PATCH 32/45] vfs, writeback: move inode->dirtied_when into inode->i_wb_link
  2015-01-06 21:25 [PATCHSET RFC block/for-next] writeback: cgroup writeback support Tejun Heo
                   ` (30 preceding siblings ...)
  2015-01-06 21:26 ` [PATCH 31/45] vfs, writeback: add inode_wb_link->data point to the associated bdi_writeback Tejun Heo
@ 2015-01-06 21:26 ` Tejun Heo
  2015-01-06 21:26 ` [PATCH 33/45] writeback: minor reorganization of fs/fs-writeback.c Tejun Heo
                   ` (13 subsequent siblings)
  45 siblings, 0 replies; 54+ messages in thread
From: Tejun Heo @ 2015-01-06 21:26 UTC (permalink / raw)
  To: axboe
  Cc: linux-kernel, jack, hch, hannes, linux-fsdevel, vgoyal, lizefan,
	cgroups, linux-mm, mhocko, clm, fengguang.wu, david, Tejun Heo,
	Alexander Viro
With cgroup writeback support, an inode may be dirtied by multiple
wb's (bdi_writeback's) belonging to different cgroups and each should
be tracked separately.  iwbl (inode_wb_link) will be used to establish
the associations between an inode and the wb's that it's dirtied
against.
This patch moves inode->dirtied_when into iwbl so that the dirtied
timestamp can be tracked separately for each associated wb.
Other than relocation of the timestamp field in struct inode, this
doesn't cause any functional changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
---
 fs/fs-writeback.c                | 28 ++++++++++++----------------
 fs/inode.c                       |  2 +-
 include/linux/backing-dev-defs.h |  1 +
 include/linux/fs.h               |  2 --
 include/trace/events/writeback.h |  4 ++--
 5 files changed, 16 insertions(+), 21 deletions(-)
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 2a5e400..6851088 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -521,23 +521,19 @@ void inode_wb_list_del(struct inode *inode)
  * Redirty an inode: set its when-it-was dirtied timestamp and move it to the
  * furthest end of its superblock's dirty-inode list.
  *
- * Before stamping the inode's ->dirtied_when, we check to see whether it is
+ * Before stamping the iwbl's ->dirtied_when, we check to see whether it is
  * already the most-recently-dirtied inode on the b_dirty list.  If that is
  * the case then the inode must have been redirtied while it was being written
  * out and we don't reset its dirtied_when.
  */
 static void redirty_tail(struct inode_wb_link *iwbl, struct bdi_writeback *wb)
 {
-	struct inode *inode = iwbl_to_inode(iwbl);
-
 	if (!list_empty(&wb->b_dirty)) {
-		struct inode_wb_link *tail_iwbl;
-		struct inode *tail;
+		struct inode_wb_link *tail;
 
-		tail_iwbl = dirty_list_to_iwbl(wb->b_dirty.next);
-		tail = iwbl_to_inode(tail_iwbl);
-		if (time_before(inode->dirtied_when, tail->dirtied_when))
-			inode->dirtied_when = jiffies;
+		tail = dirty_list_to_iwbl(wb->b_dirty.next);
+		if (time_before(iwbl->dirtied_when, tail->dirtied_when))
+			iwbl->dirtied_when = jiffies;
 	}
 	iwbl_move_locked(iwbl, wb, &wb->b_dirty);
 }
@@ -560,9 +556,9 @@ static void inode_sync_complete(struct inode *inode)
 	wake_up_bit(&inode->i_state, __I_SYNC);
 }
 
-static bool inode_dirtied_after(struct inode *inode, unsigned long t)
+static bool iwbl_dirtied_after(struct inode_wb_link *iwbl, unsigned long t)
 {
-	bool ret = time_after(inode->dirtied_when, t);
+	bool ret = time_after(iwbl->dirtied_when, t);
 #ifndef CONFIG_64BIT
 	/*
 	 * For inodes being constantly redirtied, dirtied_when can get stuck.
@@ -570,7 +566,7 @@ static bool inode_dirtied_after(struct inode *inode, unsigned long t)
 	 * This test is necessary to prevent such wrapped-around relative times
 	 * from permanently stopping the whole bdi writeback.
 	 */
-	ret = ret && time_before_eq(inode->dirtied_when, jiffies);
+	ret = ret && time_before_eq(iwbl->dirtied_when, jiffies);
 #endif
 	return ret;
 }
@@ -596,7 +592,7 @@ static int move_expired_inodes(struct list_head *delaying_queue,
 		inode = iwbl_to_inode(iwbl);
 
 		if (work->older_than_this &&
-		    inode_dirtied_after(inode, *work->older_than_this))
+		    iwbl_dirtied_after(iwbl, *work->older_than_this))
 			break;
 		list_move(&iwbl->dirty_list, &tmp);
 		moved++;
@@ -733,7 +729,7 @@ static void requeue_inode(struct inode_wb_link *iwbl, struct bdi_writeback *wb,
 	 */
 	if ((inode->i_state & I_DIRTY) &&
 	    (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages))
-		inode->dirtied_when = jiffies;
+		iwbl->dirtied_when = jiffies;
 
 	if (wbc->pages_skipped) {
 		/*
@@ -1488,7 +1484,7 @@ static noinline void block_dump___mark_inode_dirty(struct inode *inode)
  * In short, make sure you hash any inodes _before_ you start marking
  * them dirty.
  *
- * Note that for blockdevs, inode->dirtied_when represents the dirtying time of
+ * Note that for blockdevs, iwbl->dirtied_when represents the dirtying time of
  * the block-special inode (/dev/hda1) itself.  And the ->dirtied_when field of
  * the kernel-internal blockdev inode represents the dirtying time of the
  * blockdev's pages.  This is why for I_DIRTY_PAGES we always use
@@ -1567,7 +1563,7 @@ void mark_inode_dirty_dctx(struct dirty_context *dctx, int flags)
 			     !test_bit(WB_registered, &bdi->wb.state),
 			     "bdi-%s not registered\n", bdi->name);
 
-			inode->dirtied_when = jiffies;
+			iwbl->dirtied_when = jiffies;
 			wakeup_bdi = iwbl_move_locked(iwbl, &bdi->wb,
 						      &bdi->wb.b_dirty);
 			spin_unlock(&bdi->wb.list_lock);
diff --git a/fs/inode.c b/fs/inode.c
index b38d7d6..66c9b68 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -152,7 +152,7 @@ int inode_init_always(struct super_block *sb, struct inode *inode)
 	inode->i_bdev = NULL;
 	inode->i_cdev = NULL;
 	inode->i_rdev = 0;
-	inode->dirtied_when = 0;
+	inode->i_wb_link.dirtied_when = 0;
 
 	if (security_inode_alloc(inode))
 		goto out;
diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
index 9720cac..01f27e3 100644
--- a/include/linux/backing-dev-defs.h
+++ b/include/linux/backing-dev-defs.h
@@ -152,6 +152,7 @@ struct inode_wb_link {
 	 */
 	unsigned long		data;
 #endif
+	unsigned long		dirtied_when;
 	struct list_head	dirty_list;
 };
 
diff --git a/include/linux/fs.h b/include/linux/fs.h
index ea0b68f..fb261b4 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -607,8 +607,6 @@ struct inode {
 	unsigned long		i_state;
 	struct mutex		i_mutex;
 
-	unsigned long		dirtied_when;	/* jiffies of first dirtying */
-
 	struct hlist_node	i_hash;
 	struct inode_wb_link	i_wb_link;	/* backing dev IO list */
 	struct list_head	i_lru;		/* inode LRU list */
diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h
index 8622b5b..8bc68ac 100644
--- a/include/trace/events/writeback.h
+++ b/include/trace/events/writeback.h
@@ -494,7 +494,7 @@ TRACE_EVENT(writeback_sb_inodes_requeue,
 		        dev_name(inode_to_bdi(inode)->dev), 32);
 		__entry->ino		= inode->i_ino;
 		__entry->state		= inode->i_state;
-		__entry->dirtied_when	= inode->dirtied_when;
+		__entry->dirtied_when	= inode->i_wb_link.dirtied_when;
 	),
 
 	TP_printk("bdi %s: ino=%lu state=%s dirtied_when=%lu age=%lu",
@@ -565,7 +565,7 @@ DECLARE_EVENT_CLASS(writeback_single_inode_template,
 			dev_name(inode_to_bdi(inode)->dev), 32);
 		__entry->ino		= inode->i_ino;
 		__entry->state		= inode->i_state;
-		__entry->dirtied_when	= inode->dirtied_when;
+		__entry->dirtied_when	= inode->i_wb_link.dirtied_when;
 		__entry->writeback_index = inode->i_mapping->writeback_index;
 		__entry->nr_to_write	= nr_to_write;
 		__entry->wrote		= nr_to_write - wbc->nr_to_write;
-- 
2.1.0
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related	[flat|nested] 54+ messages in thread
- * [PATCH 33/45] writeback: minor reorganization of fs/fs-writeback.c
  2015-01-06 21:25 [PATCHSET RFC block/for-next] writeback: cgroup writeback support Tejun Heo
                   ` (31 preceding siblings ...)
  2015-01-06 21:26 ` [PATCH 32/45] vfs, writeback: move inode->dirtied_when into inode->i_wb_link Tejun Heo
@ 2015-01-06 21:26 ` Tejun Heo
  2015-01-06 21:26 ` [PATCH 34/45] vfs, writeback: implement support for multiple inode_wb_link's Tejun Heo
                   ` (12 subsequent siblings)
  45 siblings, 0 replies; 54+ messages in thread
From: Tejun Heo @ 2015-01-06 21:26 UTC (permalink / raw)
  To: axboe
  Cc: linux-kernel, jack, hch, hannes, linux-fsdevel, vgoyal, lizefan,
	cgroups, linux-mm, mhocko, clm, fengguang.wu, david, Tejun Heo
Move iwbl_{move|del}_locked() and __inode_wait_for_writeback() upwards
before #ifdef CONFIG_CGROUP_WRITEBACK block and make separate
identical copies of __inode_wait_for_writeback() in the #ifdef and
#else branches.  The relocation and two copies will help following
cgroup writeback changes.
This is pure reorganization.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
---
 fs/fs-writeback.c | 199 ++++++++++++++++++++++++++++++------------------------
 1 file changed, 109 insertions(+), 90 deletions(-)
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 6851088..ab77ed2 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -132,6 +132,75 @@ static void wb_wait_for_completion(struct backing_dev_info *bdi,
 	wait_event(bdi->wb_waitq, !atomic_read(&done->cnt));
 }
 
+/**
+ * iwbl_move_locked - move an inode_wb_link onto a bdi_writeback IO list
+ * @iwbl: inode_wb_link to be moved
+ * @wb: target bdi_writeback
+ * @head: one of @wb->b_{dirty|io|more_io}
+ *
+ * Move @iwbl->dirty_list to @list of @wb and set %WB_has_dirty_io.
+ * Returns %true if all IO lists were empty before; otherwise, %false.
+ */
+static bool iwbl_move_locked(struct inode_wb_link *iwbl,
+			     struct bdi_writeback *wb, struct list_head *head)
+{
+	assert_spin_locked(&wb->list_lock);
+
+	list_move(&iwbl->dirty_list, head);
+
+	if (wb_has_dirty_io(wb)) {
+		return false;
+	} else {
+		set_bit(WB_has_dirty_io, &wb->state);
+		WARN_ON_ONCE(!wb->avg_write_bandwidth);
+		atomic_long_add(wb->avg_write_bandwidth,
+				&wb->bdi->tot_write_bandwidth);
+		return true;
+	}
+}
+
+/**
+ * iwbl_del_locked - remove an inode_wb_link from its bdi_writeback IO list
+ * @iwbl: inode_wb_link to be removed
+ * @wb: bdi_writeback @inode is being removed from
+ *
+ * Remove @iwbl which may be on one of @wb->b_{dirty|io|more_io} lists and
+ * clear %WB_has_dirty_io if all are empty afterwards.
+ */
+static void iwbl_del_locked(struct inode_wb_link *iwbl,
+			    struct bdi_writeback *wb)
+{
+	assert_spin_locked(&wb->list_lock);
+
+	list_del_init(&iwbl->dirty_list);
+
+	if (wb_has_dirty_io(wb) && list_empty(&wb->b_dirty) &&
+	    list_empty(&wb->b_io) && list_empty(&wb->b_more_io)) {
+		clear_bit(WB_has_dirty_io, &wb->state);
+		WARN_ON_ONCE(atomic_long_sub_return(wb->avg_write_bandwidth,
+					&wb->bdi->tot_write_bandwidth) < 0);
+	}
+}
+
+/*
+ * Wait for writeback on an inode to complete. Called with i_lock held.
+ * Caller must make sure inode cannot go away when we drop i_lock.
+ */
+static void __inode_wait_for_writeback(struct inode *inode)
+	__releases(inode->i_lock)
+	__acquires(inode->i_lock)
+{
+	DEFINE_WAIT_BIT(wq, &inode->i_state, __I_SYNC);
+	wait_queue_head_t *wqh;
+
+	wqh = bit_waitqueue(&inode->i_state, __I_SYNC);
+	while (inode->i_state & I_SYNC) {
+		spin_unlock(&inode->i_lock);
+		__wait_on_bit(wqh, &wq, bit_wait, TASK_UNINTERRUPTIBLE);
+		spin_lock(&inode->i_lock);
+	}
+}
+
 #ifdef CONFIG_CGROUP_WRITEBACK
 
 /**
@@ -331,6 +400,26 @@ restart:
 	rcu_read_unlock();
 }
 
+/*
+ * Sleep until I_SYNC is cleared. This function must be called with i_lock
+ * held and drops it. It is aimed for callers not holding any inode reference
+ * so once i_lock is dropped, inode can go away.
+ */
+static void inode_sleep_on_writeback(struct inode *inode)
+	__releases(inode->i_lock)
+{
+	DEFINE_WAIT(wait);
+	wait_queue_head_t *wqh = bit_waitqueue(&inode->i_state, __I_SYNC);
+	int sleep;
+
+	prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
+	sleep = inode->i_state & I_SYNC;
+	spin_unlock(&inode->i_lock);
+	if (sleep)
+		schedule();
+	finish_wait(wqh, &wait);
+}
+
 #else	/* CONFIG_CGROUP_WRITEBACK */
 
 static void init_cgwb_dirty_page_context(struct dirty_context *dctx)
@@ -358,6 +447,26 @@ static void bdi_split_work_to_wbs(struct backing_dev_info *bdi,
 	}
 }
 
+/*
+ * Sleep until I_SYNC is cleared. This function must be called with i_lock
+ * held and drops it. It is aimed for callers not holding any inode reference
+ * so once i_lock is dropped, inode can go away.
+ */
+static void inode_sleep_on_writeback(struct inode *inode)
+	__releases(inode->i_lock)
+{
+	DEFINE_WAIT(wait);
+	wait_queue_head_t *wqh = bit_waitqueue(&inode->i_state, __I_SYNC);
+	int sleep;
+
+	prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
+	sleep = inode->i_state & I_SYNC;
+	spin_unlock(&inode->i_lock);
+	if (sleep)
+		schedule();
+	finish_wait(wqh, &wait);
+}
+
 #endif	/* CONFIG_CGROUP_WRITEBACK */
 
 /**
@@ -451,56 +560,6 @@ void wb_start_background_writeback(struct bdi_writeback *wb)
 	wb_wakeup(wb);
 }
 
-/**
- * iwbl_move_locked - move an inode_wb_link onto a bdi_writeback IO list
- * @iwbl: inode_wb_link to be moved
- * @wb: target bdi_writeback
- * @head: one of @wb->b_{dirty|io|more_io}
- *
- * Move @iwbl->dirty_list to @list of @wb and set %WB_has_dirty_io.
- * Returns %true if all IO lists were empty before; otherwise, %false.
- */
-static bool iwbl_move_locked(struct inode_wb_link *iwbl,
-			     struct bdi_writeback *wb, struct list_head *head)
-{
-	assert_spin_locked(&wb->list_lock);
-
-	list_move(&iwbl->dirty_list, head);
-
-	if (wb_has_dirty_io(wb)) {
-		return false;
-	} else {
-		set_bit(WB_has_dirty_io, &wb->state);
-		WARN_ON_ONCE(!wb->avg_write_bandwidth);
-		atomic_long_add(wb->avg_write_bandwidth,
-				&wb->bdi->tot_write_bandwidth);
-		return true;
-	}
-}
-
-/**
- * iwbl_del_locked - remove an inode_wb_link from its bdi_writeback IO list
- * @iwbl: inode_wb_link to be removed
- * @wb: bdi_writeback @inode is being removed from
- *
- * Remove @iwbl which may be on one of @wb->b_{dirty|io|more_io} lists and
- * clear %WB_has_dirty_io if all are empty afterwards.
- */
-static void iwbl_del_locked(struct inode_wb_link *iwbl,
-			    struct bdi_writeback *wb)
-{
-	assert_spin_locked(&wb->list_lock);
-
-	list_del_init(&iwbl->dirty_list);
-
-	if (wb_has_dirty_io(wb) && list_empty(&wb->b_dirty) &&
-	    list_empty(&wb->b_io) && list_empty(&wb->b_more_io)) {
-		clear_bit(WB_has_dirty_io, &wb->state);
-		WARN_ON_ONCE(atomic_long_sub_return(wb->avg_write_bandwidth,
-					&wb->bdi->tot_write_bandwidth) < 0);
-	}
-}
-
 /*
  * Remove the inode from the writeback list it is on.
  */
@@ -657,26 +716,6 @@ static int write_inode(struct inode *inode, struct writeback_control *wbc)
 }
 
 /*
- * Wait for writeback on an inode to complete. Called with i_lock held.
- * Caller must make sure inode cannot go away when we drop i_lock.
- */
-static void __inode_wait_for_writeback(struct inode *inode)
-	__releases(inode->i_lock)
-	__acquires(inode->i_lock)
-{
-	DEFINE_WAIT_BIT(wq, &inode->i_state, __I_SYNC);
-	wait_queue_head_t *wqh;
-
-	wqh = bit_waitqueue(&inode->i_state, __I_SYNC);
-	while (inode->i_state & I_SYNC) {
-		spin_unlock(&inode->i_lock);
-		__wait_on_bit(wqh, &wq, bit_wait,
-			      TASK_UNINTERRUPTIBLE);
-		spin_lock(&inode->i_lock);
-	}
-}
-
-/*
  * Wait for writeback on an inode to complete. Caller must have inode pinned.
  */
 void inode_wait_for_writeback(struct inode *inode)
@@ -687,26 +726,6 @@ void inode_wait_for_writeback(struct inode *inode)
 }
 
 /*
- * Sleep until I_SYNC is cleared. This function must be called with i_lock
- * held and drops it. It is aimed for callers not holding any inode reference
- * so once i_lock is dropped, inode can go away.
- */
-static void inode_sleep_on_writeback(struct inode *inode)
-	__releases(inode->i_lock)
-{
-	DEFINE_WAIT(wait);
-	wait_queue_head_t *wqh = bit_waitqueue(&inode->i_state, __I_SYNC);
-	int sleep;
-
-	prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
-	sleep = inode->i_state & I_SYNC;
-	spin_unlock(&inode->i_lock);
-	if (sleep)
-		schedule();
-	finish_wait(wqh, &wait);
-}
-
-/*
  * Find proper writeback list for the inode depending on its current state and
  * possibly also change of its state while we were doing writeback.  Here we
  * handle things such as livelock prevention or fairness of writeback among
-- 
2.1.0
^ permalink raw reply related	[flat|nested] 54+ messages in thread
- * [PATCH 34/45] vfs, writeback: implement support for multiple inode_wb_link's
  2015-01-06 21:25 [PATCHSET RFC block/for-next] writeback: cgroup writeback support Tejun Heo
                   ` (32 preceding siblings ...)
  2015-01-06 21:26 ` [PATCH 33/45] writeback: minor reorganization of fs/fs-writeback.c Tejun Heo
@ 2015-01-06 21:26 ` Tejun Heo
  2015-01-06 21:26 ` [PATCH 35/45] vfs, writeback: implement inode->i_nr_syncs Tejun Heo
                   ` (11 subsequent siblings)
  45 siblings, 0 replies; 54+ messages in thread
From: Tejun Heo @ 2015-01-06 21:26 UTC (permalink / raw)
  To: axboe
  Cc: linux-kernel, jack, hch, hannes, linux-fsdevel, vgoyal, lizefan,
	cgroups, linux-mm, mhocko, clm, fengguang.wu, david, Tejun Heo,
	Alexander Viro
An inode may be written to from more than one cgroups and for cgroup
writeback support to work properly the inode needs to be on the dirty
lists of all wb's (bdi_writeback's) corresponding to the dirtying
cgroups so that writeback on each cgroup can keep track of and process
the inode.
The previous patches separated out iwbl (inode_wb_link which) is used
to dirty an inode against a wb (bdi_writeback).  Currently, there's
only one embedded iwbl per inode which can be used to dirty the inode
against the root wb.  This patch introduces icgwbl (inode_cgwb_link)
which includes iwbl and can be used to link an inode to a non-root
cgroup wb.
Each icgwbl points to the associated inode directly and wb through its
iwbl, and is linked on inode->i_cgwb_links and wb->icgwbls.  They're
created on demand and destroyed only when either the associated inode
or wb is destroyed.  When a page is about to be dirtied, the matching
i[cg]wbl is looked up or created and recorded in the newly added
dirty_context->iwbl field.  The next patch will use the field to link
inodes against their matching cgroup wb's.
Currently, icgwbls are linked on a linked list on each inode and
linearly looked up on each dirtying attempt, which is an obvious
scalability bottleneck.  We want an RCU-safe balanced tree here but
the kernel doesn't have such indexing structure in tree yet.  Given
that dirtying the same inode from numerous different cgroups isn't too
frequent, I think the linear list is a bandaid we can use for now but
we really should switch to something proper (e.g. a bonsai tree) soon.
This patch adds a struct hlist_head to struct inode when
CONFIG_CGROUP_WRITEBACK adding the size of a pointer to it.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
---
 fs/fs-writeback.c                | 121 ++++++++++++++++++++++++++++++++++-----
 fs/inode.c                       |   3 +
 include/linux/backing-dev-defs.h |  34 +++++++++++
 include/linux/backing-dev.h      |  77 +++++++++++++++++++++++--
 include/linux/fs.h               |   3 +
 mm/backing-dev.c                 |  80 ++++++++++++++++++++++++++
 6 files changed, 300 insertions(+), 18 deletions(-)
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index ab77ed2..d10c231 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -182,6 +182,18 @@ static void iwbl_del_locked(struct inode_wb_link *iwbl,
 	}
 }
 
+static void iwbl_del(struct inode_wb_link *iwbl)
+{
+	struct bdi_writeback *wb = iwbl_to_wb(iwbl);
+
+	if (list_empty(&iwbl->dirty_list))
+		return;
+
+	spin_lock(&wb->list_lock);
+	iwbl_del_locked(iwbl, wb);
+	spin_unlock(&wb->list_lock);
+}
+
 /*
  * Wait for writeback on an inode to complete. Called with i_lock held.
  * Caller must make sure inode cannot go away when we drop i_lock.
@@ -260,16 +272,34 @@ static void init_cgwb_dirty_page_context(struct dirty_context *dctx)
 	 * back to the root blkcg.
 	 */
 	blkcg_css = page_blkcg_attach_dirty(dctx->page);
-	dctx->wb = cgwb_lookup_create(bdi, blkcg_css);
-	if (!dctx->wb) {
-		page_blkcg_detach_dirty(dctx->page);
-		goto force_root;
+
+	/* if iwbl already exists, wb can be determined from that too */
+	dctx->iwbl = iwbl_lookup(dctx->inode, blkcg_css);
+	if (dctx->iwbl) {
+		dctx->wb = iwbl_to_wb(dctx->iwbl);
+		return;
 	}
+
+	/* slow path, let's create wb and iwbl */
+	dctx->wb = cgwb_lookup_create(bdi, blkcg_css);
+	if (!dctx->wb)
+		goto detach_dirty;
+
+	dctx->iwbl = iwbl_create(dctx->inode, dctx->wb);
+	if (!dctx->iwbl)
+		goto detach_dirty;
+
 	return;
 
+detach_dirty:
+	page_blkcg_detach_dirty(dctx->page);
 force_root:
 	page_blkcg_force_root_dirty(dctx->page);
 	dctx->wb = &bdi->wb;
+	if (dctx->inode)
+		dctx->iwbl = &dctx->inode->i_wb_link;
+	else
+		dctx->iwbl = NULL;
 }
 
 /**
@@ -420,11 +450,78 @@ static void inode_sleep_on_writeback(struct inode *inode)
 	finish_wait(wqh, &wait);
 }
 
+static inline struct inode_cgwb_link *icgwbl_first(struct inode *inode)
+{
+	struct hlist_node *node =
+		rcu_dereference_check(hlist_first_rcu(&inode->i_cgwb_links),
+			lockdep_is_held(&inode_to_bdi(inode)->icgwbls_lock));
+
+	return hlist_entry_safe(node, struct inode_cgwb_link, inode_node);
+}
+
+static inline struct inode_cgwb_link *icgwbl_next(struct inode_cgwb_link *pos,
+						  struct inode *inode)
+{
+	struct hlist_node *node =
+		rcu_dereference_check(hlist_next_rcu(&pos->inode_node),
+			lockdep_is_held(&inode_to_bdi(inode)->icgwbls_lock));
+
+	return hlist_entry_safe(node, struct inode_cgwb_link, inode_node);
+}
+
+/**
+ * inode_for_each_icgwbl - walk all icgwbl's of an inode
+ * @cur: cursor struct inode_cgwb_link pointer
+ * @nxt: temp struct inode_cgwb_link pointer
+ * @inode: inode to walk icgwbl's of
+ *
+ * Walk @inode's icgwbl's (inode_cgwb_link's).  rcu_read_lock() must be
+ * held throughout iteration.
+ */
+#define inode_for_each_icgwbl(cur, nxt, inode)				\
+	for ((cur) = icgwbl_first((inode)),				\
+	     (nxt) = (cur) ? icgwbl_next((cur), (inode)) : NULL;	\
+	     (cur);							\
+	     (cur) = (nxt),						\
+	     (nxt) = (nxt) ? icgwbl_next((nxt), (inode)) : NULL)
+
+static void inode_icgwbls_del(struct inode *inode)
+{
+	LIST_HEAD(to_free);
+	struct backing_dev_info *bdi = inode_to_bdi(inode);
+	struct inode_cgwb_link *icgwbl, *next;
+	unsigned long flags;
+
+	spin_lock_irqsave(&bdi->icgwbls_lock, flags);
+
+	/* I_FREEING must be set here to disallow further iwbl_create() */
+	WARN_ON_ONCE(!(inode->i_state & I_FREEING));
+
+	/*
+	 * We don't wanna nest wb->list_lock under bdi->icgwbls_lock as the
+	 * latter is irq-safe and the former isn't.  Queue icgwbls on
+	 * @to_free and perform iwbl_del() and freeing after releasing
+	 * bdi->icgwbls_lock.
+	 */
+	inode_for_each_icgwbl(icgwbl, next, inode) {
+		hlist_del_rcu(&icgwbl->inode_node);
+		list_move(&icgwbl->wb_node, &to_free);
+	}
+
+	spin_unlock_irqrestore(&bdi->icgwbls_lock, flags);
+
+	list_for_each_entry_safe(icgwbl, next, &to_free, wb_node) {
+		iwbl_del(&icgwbl->iwbl);
+		kfree_rcu(icgwbl, rcu);
+	}
+}
+
 #else	/* CONFIG_CGROUP_WRITEBACK */
 
 static void init_cgwb_dirty_page_context(struct dirty_context *dctx)
 {
 	dctx->wb = &dctx->mapping->backing_dev_info->wb;
+	dctx->iwbl = dctx->inode ? &dctx->inode->i_wb_link : NULL;
 }
 
 static long wb_split_bdi_pages(struct bdi_writeback *wb, long nr_pages)
@@ -467,6 +564,10 @@ static void inode_sleep_on_writeback(struct inode *inode)
 	finish_wait(wqh, &wait);
 }
 
+static void inode_icgwbls_del(struct inode *inode)
+{
+}
+
 #endif	/* CONFIG_CGROUP_WRITEBACK */
 
 /**
@@ -510,6 +611,7 @@ void init_dirty_inode_context(struct dirty_context *dctx, struct inode *inode)
 	memset(dctx, 0, sizeof(*dctx));
 	dctx->inode = inode;
 	dctx->wb = &inode_to_bdi(inode)->wb;
+	dctx->iwbl = &inode->i_wb_link;
 }
 
 void wb_start_writeback(struct bdi_writeback *wb, long nr_pages,
@@ -565,15 +667,8 @@ void wb_start_background_writeback(struct bdi_writeback *wb)
  */
 void inode_wb_list_del(struct inode *inode)
 {
-	struct inode_wb_link *iwbl = &inode->i_wb_link;
-	struct bdi_writeback *wb = iwbl_to_wb(iwbl);
-
-	if (list_empty(&iwbl->dirty_list))
-		return;
-
-	spin_lock(&wb->list_lock);
-	iwbl_del_locked(iwbl, wb);
-	spin_unlock(&wb->list_lock);
+	iwbl_del(&inode->i_wb_link);
+	inode_icgwbls_del(inode);
 }
 
 /*
diff --git a/fs/inode.c b/fs/inode.c
index 66c9b68..8a55494 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -374,6 +374,9 @@ void inode_init_once(struct inode *inode)
 	INIT_LIST_HEAD(&inode->i_lru);
 	address_space_init_once(&inode->i_data);
 	i_size_ordered_init(inode);
+#ifdef CONFIG_CGROUP_WRITEBACK
+	INIT_HLIST_HEAD(&inode->i_cgwb_links);
+#endif
 #ifdef CONFIG_FSNOTIFY
 	INIT_HLIST_HEAD(&inode->i_fsnotify_marks);
 #endif
diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
index 01f27e3..e448edc 100644
--- a/include/linux/backing-dev-defs.h
+++ b/include/linux/backing-dev-defs.h
@@ -99,6 +99,7 @@ struct bdi_writeback {
 #ifdef CONFIG_CGROUP_WRITEBACK
 	struct cgroup_subsys_state *blkcg_css; /* the blkcg we belong to */
 	struct list_head blkcg_node;	/* anchored at blkcg->wb_list */
+	struct list_head icgwbls;	/* inode_cgwb_links of this wb */
 	union {
 		struct list_head shutdown_node;
 		struct rcu_head rcu;
@@ -127,6 +128,7 @@ struct backing_dev_info {
 	struct bdi_writeback wb; /* the root writeback info for this bdi */
 #ifdef CONFIG_CGROUP_WRITEBACK
 	struct radix_tree_root cgwb_tree; /* radix tree of !root cgroup wbs */
+	spinlock_t icgwbls_lock; /* protects wb->icgwbls and inode->i_cgwb_links */
 #endif
 	wait_queue_head_t wb_waitq;
 
@@ -157,6 +159,37 @@ struct inode_wb_link {
 };
 
 /*
+ * Used to link a dirty inode on a non-root wb (bdi_writeback).  An inode
+ * may have multiple of these as it gets dirtied on non-root wb's.  Linked
+ * on both the inode and wb and destroyed when either goes away.
+ *
+ * TODO: When an inode is being dirtied against a non-root wb, its
+ * ->i_wb_link is searched linearly to locate the matching icgwbl
+ * (inode_cgwb_link).  The linear search is a scalability bottleneck but
+ * the kernel currently don't have an indexing data structure which would
+ * fit this use case.  A balanced tree which can be walked under RCU read
+ * lock is necessary (e.g. bonsai tree).  Once such indexing data structure
+ * is necessary, icgwbl should be converted to use that.
+ */
+struct inode_cgwb_link {
+	struct inode_wb_link	iwbl;
+
+	struct inode		*inode;		/* the associated inode */
+
+	/*
+	 * ->inode_node is anchored at inode->i_wb_links and ->wb_node at
+	 * bdi_writeback->icgwbls.  Both are write-protected by
+	 * bdi->icgwbls_lock but the former can be traversed under RCU and
+	 * is sorted by the associated blkcg ID to allow traversal
+	 * continuation after dropping RCU read lock.
+	 */
+	struct hlist_node	inode_node;	/* RCU-safe, sorted */
+	struct list_head	wb_node;
+
+	struct rcu_head		rcu;
+};
+
+/*
  * The following structure carries context used during page and inode
  * dirtying.  Should be initialized with init_dirty_{inode|page}_context().
  */
@@ -165,6 +198,7 @@ struct dirty_context {
 	struct inode		*inode;
 	struct address_space	*mapping;
 	struct bdi_writeback	*wb;
+	struct inode_wb_link	*iwbl;
 };
 
 enum {
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index bc69c7f..6c16d10 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -274,16 +274,13 @@ void init_dirty_page_context(struct dirty_context *dctx, struct page *page,
 			     struct address_space *mapping);
 void init_dirty_inode_context(struct dirty_context *dctx, struct inode *inode);
 
-static inline struct inode *iwbl_to_inode(struct inode_wb_link *iwbl)
-{
-	return container_of(iwbl, struct inode, i_wb_link);
-}
-
 #ifdef CONFIG_CGROUP_WRITEBACK
 
 void cgwb_blkcg_released(struct cgroup_subsys_state *blkcg_css);
 int __cgwb_create(struct backing_dev_info *bdi,
 		  struct cgroup_subsys_state *blkcg_css);
+struct inode_wb_link *iwbl_create(struct inode *inode,
+				  struct bdi_writeback *wb);
 int mapping_congested(struct address_space *mapping, struct task_struct *task,
 		      int bdi_bits);
 
@@ -469,6 +466,60 @@ static inline struct bdi_writeback *iwbl_to_wb(struct inode_wb_link *iwbl)
 	return (void *)(iwbl->data & ~IWBL_FLAGS_MASK);
 }
 
+static inline bool iwbl_is_root(struct inode_wb_link *iwbl)
+{
+	struct bdi_writeback *wb = iwbl_to_wb(iwbl);
+
+	return wb->blkcg_css == blkcg_root_css;
+}
+
+static inline struct inode *iwbl_to_inode(struct inode_wb_link *iwbl)
+{
+	if (iwbl_is_root(iwbl)) {
+		return container_of(iwbl, struct inode, i_wb_link);
+	} else {
+		struct inode_cgwb_link *icgwbl =
+			container_of(iwbl, struct inode_cgwb_link, iwbl);
+		return icgwbl->inode;
+	}
+}
+
+/**
+ * iwbl_lookup - lookup iwbl for dirtying an inode against a blkcg_css
+ * @inode: target inode
+ * @blkcg_css: target blkcg_css
+ *
+ * Lookup iwbl (inode_wb_link) for dirtying @inode against @blkcg_css.  If
+ * found, the returned iwbl is associated with the bdi_writeback of
+ * @blkcg_css on @inode's bdi.  If not found, %NULL is returned.
+ *
+ * The returned iwbl remains accessible as long as both @inode and
+ * @blkcg_css are alive.
+ */
+static inline struct inode_wb_link *
+iwbl_lookup(struct inode *inode, struct cgroup_subsys_state *blkcg_css)
+{
+	struct inode_wb_link *iwbl = NULL;
+	struct inode_cgwb_link *icgwbl;
+
+	if (blkcg_css == blkcg_root_css)
+		return &inode->i_wb_link;
+
+	/*
+	 * RCU protects the lookup itself.  Once looked up, the iwbl's
+	 * lifetime is governed by those of @inode and @blkcg_css.
+	 */
+	rcu_read_lock();
+	hlist_for_each_entry_rcu(icgwbl, &inode->i_cgwb_links, inode_node) {
+		if (iwbl_to_wb(&icgwbl->iwbl)->blkcg_css == blkcg_css) {
+			iwbl = &icgwbl->iwbl;
+			break;
+		}
+	}
+	rcu_read_unlock();
+	return iwbl;
+}
+
 #else	/* CONFIG_CGROUP_WRITEBACK */
 
 static inline bool mapping_cgwb_enabled(struct address_space *mapping)
@@ -522,6 +573,11 @@ static inline void init_i_wb_link(struct inode *inode)
 {
 }
 
+static inline struct inode *iwbl_to_inode(struct inode_wb_link *iwbl)
+{
+	return container_of(iwbl, struct inode, i_wb_link);
+}
+
 static inline struct bdi_writeback *iwbl_to_wb(struct inode_wb_link *iwbl)
 {
 	struct inode *inode = iwbl_to_inode(iwbl);
@@ -529,6 +585,17 @@ static inline struct bdi_writeback *iwbl_to_wb(struct inode_wb_link *iwbl)
 	return &inode_to_bdi(inode)->wb;
 }
 
+static inline bool iwbl_is_root(struct inode_wb_link *iwbl)
+{
+	return true;
+}
+
+static inline struct inode_wb_link *
+iwbl_lookup(struct inode *inode, struct cgroup_subsys_state *blkcg_css)
+{
+	return &inode->i_wb_link;
+}
+
 #endif	/* CONFIG_CGROUP_WRITEBACK */
 
 static inline int mapping_read_congested(struct address_space *mapping,
diff --git a/include/linux/fs.h b/include/linux/fs.h
index fb261b4..b394821 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -609,6 +609,9 @@ struct inode {
 
 	struct hlist_node	i_hash;
 	struct inode_wb_link	i_wb_link;	/* backing dev IO list */
+#ifdef CONFIG_CGROUP_WRITEBACK
+	struct hlist_head	i_cgwb_links;	/* sorted inode_cgwb_links */
+#endif
 	struct list_head	i_lru;		/* inode LRU list */
 	struct list_head	i_sb_list;
 	union {
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index cc8d21a..e4db465 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -464,6 +464,7 @@ int __cgwb_create(struct backing_dev_info *bdi,
 		return -ENOMEM;
 	}
 
+	INIT_LIST_HEAD(&wb->icgwbls);
 	wb->blkcg_css = blkcg_css;
 	set_bit(WB_registered, &wb->state); /* cgwbs are always registered */
 
@@ -532,16 +533,31 @@ static void cgwb_shutdown_commit(struct list_head *to_shutdown)
 
 static void cgwb_exit(struct bdi_writeback *wb)
 {
+	struct inode_cgwb_link *icgwbl, *next;
+	unsigned long flags;
+
+	spin_lock_irqsave(&wb->bdi->icgwbls_lock, flags);
+	list_for_each_entry_safe(icgwbl, next, &wb->icgwbls, wb_node) {
+		WARN_ON_ONCE(!list_empty(&icgwbl->iwbl.dirty_list));
+		hlist_del_rcu(&icgwbl->inode_node);
+		list_del(&icgwbl->wb_node);
+		kfree_rcu(icgwbl, rcu);
+	}
+	spin_unlock_irqrestore(&wb->bdi->icgwbls_lock, flags);
+
 	WARN_ON(!radix_tree_delete(&wb->bdi->cgwb_tree, wb->blkcg_css->id));
 	list_del(&wb->blkcg_node);
+
 	wb_exit(wb);
 	kfree_rcu(wb, rcu);
 }
 
 static void cgwb_bdi_init(struct backing_dev_info *bdi)
 {
+	INIT_LIST_HEAD(&bdi->wb.icgwbls);
 	bdi->wb.blkcg_css = blkcg_root_css;
 	INIT_RADIX_TREE(&bdi->cgwb_tree, GFP_ATOMIC);
+	spin_lock_init(&bdi->icgwbls_lock);
 }
 
 /**
@@ -613,6 +629,70 @@ void cgwb_blkcg_released(struct cgroup_subsys_state *blkcg_css)
 	spin_unlock_irq(&cgwb_lock);
 }
 
+/**
+ * iwbl_create - create an inode_cgwb_link
+ * @inode: target inode
+ * @wb: target bdi_writeback
+ *
+ * Try to create an iwbl (inode_wb_link) for dirtying @inode against @wb.
+ * This function can be called under any context without locking as long as
+ * @inode and @wb are kept alive.  See iwbl_lookup() for details.
+ *
+ * Returns the pointer to the created or found icgwbl on success, %NULL on
+ * failure.
+ */
+struct inode_wb_link *iwbl_create(struct inode *inode, struct bdi_writeback *wb)
+{
+	struct inode_wb_link *iwbl = NULL;
+	struct inode_cgwb_link *icgwbl;
+	unsigned long flags;
+
+	icgwbl = kzalloc(sizeof(*icgwbl), GFP_ATOMIC);
+	if (!icgwbl)
+		return NULL;
+
+	icgwbl->iwbl.data = (unsigned long)wb;
+	INIT_LIST_HEAD(&icgwbl->iwbl.dirty_list);
+	icgwbl->inode = inode;
+
+	spin_lock_irqsave(&wb->bdi->icgwbls_lock, flags);
+
+	/*
+	 * Testing I_FREEING under icgwbls_lock guarantees that no new
+	 * icgwbl's will be created after inode_icgwbls_del().
+	 */
+	if (inode->i_state & I_FREEING)
+		goto out_unlock;
+
+	iwbl = iwbl_lookup(inode, wb->blkcg_css);
+	if (!iwbl) {
+		struct inode_cgwb_link *prev = NULL, *pos;
+		int blkcg_id = wb->blkcg_css->id;
+
+		/* i_cgwb_links is sorted by blkcg ID */
+		hlist_for_each_entry_rcu(pos, &inode->i_cgwb_links, inode_node) {
+			if (iwbl_to_wb(&pos->iwbl)->blkcg_css->id > blkcg_id)
+				break;
+			prev = pos;
+		}
+		if (prev)
+			hlist_add_behind_rcu(&icgwbl->inode_node,
+					     &prev->inode_node);
+		else
+			hlist_add_head_rcu(&icgwbl->inode_node,
+					   &inode->i_cgwb_links);
+
+		list_add(&icgwbl->wb_node, &wb->icgwbls);
+
+		iwbl = &icgwbl->iwbl;
+		icgwbl = NULL;
+	}
+out_unlock:
+	spin_unlock_irqrestore(&wb->bdi->icgwbls_lock, flags);
+	kfree(icgwbl);
+	return iwbl;
+}
+
 #else	/* CONFIG_CGROUP_WRITEBACK */
 
 static void cgwb_bdi_init(struct backing_dev_info *bdi) { }
-- 
2.1.0
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related	[flat|nested] 54+ messages in thread
- * [PATCH 35/45] vfs, writeback: implement inode->i_nr_syncs
  2015-01-06 21:25 [PATCHSET RFC block/for-next] writeback: cgroup writeback support Tejun Heo
                   ` (33 preceding siblings ...)
  2015-01-06 21:26 ` [PATCH 34/45] vfs, writeback: implement support for multiple inode_wb_link's Tejun Heo
@ 2015-01-06 21:26 ` Tejun Heo
  2015-01-06 21:26 ` [PATCH 36/45] writeback: dirty inodes against their matching cgroup bdi_writeback's Tejun Heo
                   ` (10 subsequent siblings)
  45 siblings, 0 replies; 54+ messages in thread
From: Tejun Heo @ 2015-01-06 21:26 UTC (permalink / raw)
  To: axboe
  Cc: linux-kernel, jack, hch, hannes, linux-fsdevel, vgoyal, lizefan,
	cgroups, linux-mm, mhocko, clm, fengguang.wu, david, Tejun Heo,
	Alexander Viro
Currently, I_SYNC is used to keep track of whether writeback is in
progress on the inode or not.  With cgroup writeback support, multiple
writebacks could be in progress simultaneously and a single bit in
inode->i_state isn't sufficient.
If CONFIG_CGROUP_WRITEBACK, this patch makes each iwbl (inode_wb_link)
track whether writeback is in progress using the new IWBL_SYNC flag on
iwbl->data and inode->i_nr_syncs aggregate total number of writebacks
in progress on the inode.  New helpers, iwbl_{test|set|clear}_sync(),
iwbl_sync_wakeup() and __iwbl_wait_for_writeback() are added to
manipulate these states and inode_sleep_on_writeback() is converted to
iwbl_sleep_on_writeback().  I_SYNC retains the same meaning - it's set
if any writeback is in progress and cleared if none.
If !CONFIG_CGROUP_WRITEBACK, the helpers simply operate on I_SYNC
directly and there's no behavioral changes compared to before.  When
CONFIG_CGROUP_WRITEBACK, this adds an atomic_t to struct inode.  This
competes for the same left over 4 byte slot w/ i_readcount from
CONFIG_IMA on 64 bit, and, as long as CONFIG_IMA isn't enabled, it
doesn't increase the size of struct inode on 64 bit.
This allows keeping track of writeback in-progress state per cgroup
bdi_writeback.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
---
 fs/fs-writeback.c                | 172 +++++++++++++++++++++++++++++++++------
 include/linux/backing-dev-defs.h |   7 ++
 include/linux/fs.h               |   3 +
 mm/backing-dev.c                 |   1 +
 4 files changed, 160 insertions(+), 23 deletions(-)
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index d10c231..df99b5b 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -430,26 +430,113 @@ restart:
 	rcu_read_unlock();
 }
 
+/**
+ * iwbl_test_sync - test whether writeback is in progress on an inode_wb_link
+ * @iwbl: target inode_wb_link
+ *
+ * Test whether writeback is in progress for the inode on the bdi_writeback
+ * specified by @iwbl.  The caller is responsible for synchornization.
+ */
+static bool iwbl_test_sync(struct inode_wb_link *iwbl)
+{
+	return test_bit(IWBL_SYNC, &iwbl->data);
+}
+
+/**
+ * iwbl_set_sync - mark an inode_wb_link that writeback is in progress
+ * @iwbl: target inode_wb_link
+ * @inode: inode @iwbl is associated with
+ *
+ * Mark that writeback is in progress for @inode on the bdi_writeback
+ * specified by @iwbl.  iwbl_test_sync() will return %true on @iwbl and
+ * %I_SYNC is set on @inode while there's any writeback in progress on it.
+ */
+static void iwbl_set_sync(struct inode_wb_link *iwbl, struct inode *inode)
+{
+	lockdep_assert_held(&inode->i_lock);
+	WARN_ON_ONCE(iwbl_test_sync(iwbl));
+
+	set_bit(IWBL_SYNC, &iwbl->data);
+	inode->i_nr_syncs++;
+	inode->i_state |= I_SYNC;
+}
+
+/**
+ * iwbl_clear_sync - undo iwbl_set_sync()
+ * @iwbl: target inode_wb_link
+ * @inode: inode @iwbl is associated with
+ *
+ * Returns %true if this was the last writeback in progress on @inode;
+ * %false, otherwise.
+ */
+static bool iwbl_clear_sync(struct inode_wb_link *iwbl, struct inode *inode)
+{
+	bool sync_complete;
+
+	lockdep_assert_held(&inode->i_lock);
+	WARN_ON_ONCE(!iwbl_test_sync(iwbl));
+
+	clear_bit(IWBL_SYNC, &iwbl->data);
+	sync_complete = !--inode->i_nr_syncs;
+	if (sync_complete)
+		inode->i_state &= ~I_SYNC;
+	return sync_complete;
+}
+
+/**
+ * iwbl_wait_for_writeback - wait for writeback in progree on an inode_wb_link
+ * @iwbl: target inode_wb_link
+ *
+ * Wait for the writeback in progress for the inode on the bdi_writeback
+ * specified by @iwbl.
+ */
+static void iwbl_wait_for_writeback(struct inode_wb_link *iwbl)
+	__releases(inode->i_lock)
+	__acquires(inode->i_lock)
+{
+	struct inode *inode = iwbl_to_inode(iwbl);
+	DEFINE_WAIT_BIT(wq, &iwbl->data, IWBL_SYNC);
+	wait_queue_head_t *wqh;
+
+	lockdep_assert_held(&inode->i_lock);
+
+	wqh = bit_waitqueue(&iwbl->data, IWBL_SYNC);
+	while (test_bit(IWBL_SYNC, &iwbl->data)) {
+		spin_unlock(&inode->i_lock);
+		__wait_on_bit(wqh, &wq, bit_wait, TASK_UNINTERRUPTIBLE);
+		spin_lock(&inode->i_lock);
+	}
+}
+
 /*
  * Sleep until I_SYNC is cleared. This function must be called with i_lock
  * held and drops it. It is aimed for callers not holding any inode reference
  * so once i_lock is dropped, inode can go away.
  */
-static void inode_sleep_on_writeback(struct inode *inode)
-	__releases(inode->i_lock)
+static void iwbl_sleep_on_writeback(struct inode_wb_link *iwbl)
 {
 	DEFINE_WAIT(wait);
-	wait_queue_head_t *wqh = bit_waitqueue(&inode->i_state, __I_SYNC);
+	struct inode *inode = iwbl_to_inode(iwbl);
+	wait_queue_head_t *wqh = bit_waitqueue(&iwbl->data, IWBL_SYNC);
 	int sleep;
 
 	prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
-	sleep = inode->i_state & I_SYNC;
+	sleep = test_bit(IWBL_SYNC, &iwbl->data);
 	spin_unlock(&inode->i_lock);
 	if (sleep)
 		schedule();
 	finish_wait(wqh, &wait);
 }
 
+/**
+ * iwbl_sync_wakeup - wakeup iwbl_{wait_for|sleep_on}_writeback() waiter
+ * @iwbl: target inode_wb_link
+ */
+static void iwbl_sync_wakeup(struct inode_wb_link *iwbl)
+{
+	wake_up_bit(&iwbl->data, IWBL_SYNC);
+}
+
 static inline struct inode_cgwb_link *icgwbl_first(struct inode *inode)
 {
 	struct hlist_node *node =
@@ -504,6 +591,7 @@ static void inode_icgwbls_del(struct inode *inode)
 	 * bdi->icgwbls_lock.
 	 */
 	inode_for_each_icgwbl(icgwbl, next, inode) {
+		WARN_ON_ONCE(test_bit(IWBL_SYNC, &icgwbl->iwbl.data));
 		hlist_del_rcu(&icgwbl->inode_node);
 		list_move(&icgwbl->wb_node, &to_free);
 	}
@@ -544,15 +632,39 @@ static void bdi_split_work_to_wbs(struct backing_dev_info *bdi,
 	}
 }
 
+static bool iwbl_test_sync(struct inode_wb_link *iwbl)
+{
+	struct inode *inode = iwbl_to_inode(iwbl);
+
+	return inode->i_state & I_SYNC;
+}
+
+static void iwbl_set_sync(struct inode_wb_link *iwbl, struct inode *inode)
+{
+	inode->i_state |= I_SYNC;
+}
+
+static bool iwbl_clear_sync(struct inode_wb_link *iwbl, struct inode *inode)
+{
+	inode->i_state &= ~I_SYNC;
+	return true;
+}
+
+static void iwbl_wait_for_writeback(struct inode_wb_link *iwbl)
+{
+	__inode_wait_for_writeback(iwbl_to_inode(iwbl));
+}
+
 /*
  * Sleep until I_SYNC is cleared. This function must be called with i_lock
  * held and drops it. It is aimed for callers not holding any inode reference
  * so once i_lock is dropped, inode can go away.
  */
-static void inode_sleep_on_writeback(struct inode *inode)
+static void iwbl_sleep_on_writeback(struct inode_wb_link *iwbl)
 	__releases(inode->i_lock)
 {
 	DEFINE_WAIT(wait);
+	struct inode *inode = iwbl_to_inode(iwbl);
 	wait_queue_head_t *wqh = bit_waitqueue(&inode->i_state, __I_SYNC);
 	int sleep;
 
@@ -564,6 +676,11 @@ static void inode_sleep_on_writeback(struct inode *inode)
 	finish_wait(wqh, &wait);
 }
 
+static void iwbl_sync_wakeup(struct inode_wb_link *iwbl)
+{
+	/* noop, __I_SYNC wakeup is enough */
+}
+
 static void inode_icgwbls_del(struct inode *inode)
 {
 }
@@ -700,14 +817,22 @@ static void requeue_io(struct inode_wb_link *iwbl, struct bdi_writeback *wb)
 	iwbl_move_locked(iwbl, wb, &wb->b_more_io);
 }
 
-static void inode_sync_complete(struct inode *inode)
+static void iwbl_sync_complete(struct inode_wb_link *iwbl)
 {
-	inode->i_state &= ~I_SYNC;
+	struct inode *inode = iwbl_to_inode(iwbl);
+	bool sync_complete;
+
+	sync_complete = iwbl_clear_sync(iwbl, inode);
 	/* If inode is clean an unused, put it into LRU now... */
-	inode_add_lru(inode);
+	if (sync_complete)
+		inode_add_lru(inode);
+
 	/* Waiters must see I_SYNC cleared before being woken up */
 	smp_mb();
-	wake_up_bit(&inode->i_state, __I_SYNC);
+
+	iwbl_sync_wakeup(iwbl);
+	if (sync_complete)
+		wake_up_bit(&inode->i_state, __I_SYNC);
 }
 
 static bool iwbl_dirtied_after(struct inode_wb_link *iwbl, unsigned long t)
@@ -888,17 +1013,18 @@ static void requeue_inode(struct inode_wb_link *iwbl, struct bdi_writeback *wb,
 /*
  * Write out an inode and its dirty pages. Do not update the writeback list
  * linkage. That is left to the caller. The caller is also responsible for
- * setting I_SYNC flag and calling inode_sync_complete() to clear it.
+ * setting I_SYNC flag and calling iwbl_sync_complete() to clear it.
  */
 static int
 __writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
 {
 	struct address_space *mapping = inode->i_mapping;
+	struct inode_wb_link *iwbl = &inode->i_wb_link;
 	long nr_to_write = wbc->nr_to_write;
 	unsigned dirty;
 	int ret;
 
-	WARN_ON(!(inode->i_state & I_SYNC));
+	WARN_ON(!iwbl_test_sync(iwbl));
 
 	trace_writeback_single_inode_start(inode, wbc, nr_to_write);
 
@@ -976,7 +1102,7 @@ writeback_single_inode(struct inode *inode, struct bdi_writeback *wb,
 	else
 		WARN_ON(inode->i_state & I_WILL_FREE);
 
-	if (inode->i_state & I_SYNC) {
+	if (iwbl_test_sync(iwbl)) {
 		if (wbc->sync_mode != WB_SYNC_ALL)
 			goto out;
 		/*
@@ -984,9 +1110,9 @@ writeback_single_inode(struct inode *inode, struct bdi_writeback *wb,
 		 * inode reference or inode has I_WILL_FREE set, it cannot go
 		 * away under us.
 		 */
-		__inode_wait_for_writeback(inode);
+		iwbl_wait_for_writeback(iwbl);
 	}
-	WARN_ON(inode->i_state & I_SYNC);
+	WARN_ON(iwbl_test_sync(iwbl));
 	/*
 	 * Skip inode if it is clean and we have no outstanding writeback in
 	 * WB_SYNC_ALL mode. We don't want to mess with writeback lists in this
@@ -999,7 +1125,7 @@ writeback_single_inode(struct inode *inode, struct bdi_writeback *wb,
 	    (wbc->sync_mode != WB_SYNC_ALL ||
 	     !mapping_tagged(inode->i_mapping, PAGECACHE_TAG_WRITEBACK)))
 		goto out;
-	inode->i_state |= I_SYNC;
+	iwbl_set_sync(iwbl, inode);
 	spin_unlock(&inode->i_lock);
 
 	ret = __writeback_single_inode(inode, wbc);
@@ -1013,7 +1139,7 @@ writeback_single_inode(struct inode *inode, struct bdi_writeback *wb,
 	if (!(inode->i_state & I_DIRTY))
 		iwbl_del_locked(iwbl, wb);
 	spin_unlock(&wb->list_lock);
-	inode_sync_complete(inode);
+	iwbl_sync_complete(iwbl);
 out:
 	spin_unlock(&inode->i_lock);
 	return ret;
@@ -1107,7 +1233,7 @@ static long writeback_sb_inodes(struct super_block *sb,
 			redirty_tail(iwbl, wb);
 			continue;
 		}
-		if ((inode->i_state & I_SYNC) && wbc.sync_mode != WB_SYNC_ALL) {
+		if (iwbl_test_sync(iwbl) && wbc.sync_mode != WB_SYNC_ALL) {
 			/*
 			 * If this inode is locked for writeback and we are not
 			 * doing writeback-for-data-integrity, move it to
@@ -1129,14 +1255,14 @@ static long writeback_sb_inodes(struct super_block *sb,
 		 * are doing WB_SYNC_NONE writeback. So this catches only the
 		 * WB_SYNC_ALL case.
 		 */
-		if (inode->i_state & I_SYNC) {
+		if (iwbl_test_sync(iwbl)) {
 			/* Wait for I_SYNC. This function drops i_lock... */
-			inode_sleep_on_writeback(inode);
+			iwbl_sleep_on_writeback(iwbl);
 			/* Inode may be gone, start again */
 			spin_lock(&wb->list_lock);
 			continue;
 		}
-		inode->i_state |= I_SYNC;
+		iwbl_set_sync(iwbl, inode);
 		spin_unlock(&inode->i_lock);
 
 		write_chunk = writeback_chunk_size(wb, work);
@@ -1156,7 +1282,7 @@ static long writeback_sb_inodes(struct super_block *sb,
 		if (!(inode->i_state & I_DIRTY))
 			wrote++;
 		requeue_inode(iwbl, wb, &wbc);
-		inode_sync_complete(inode);
+		iwbl_sync_complete(iwbl);
 		spin_unlock(&inode->i_lock);
 		cond_resched_lock(&wb->list_lock);
 		/*
@@ -1356,7 +1482,7 @@ static long wb_writeback(struct bdi_writeback *wb,
 			spin_lock(&inode->i_lock);
 			spin_unlock(&wb->list_lock);
 			/* This function drops i_lock... */
-			inode_sleep_on_writeback(inode);
+			iwbl_sleep_on_writeback(iwbl);
 			spin_lock(&wb->list_lock);
 		}
 	}
@@ -1648,7 +1774,7 @@ void mark_inode_dirty_dctx(struct dirty_context *dctx, int flags)
 		 * The unlocker will place the inode on the appropriate
 		 * superblock list, based upon its state.
 		 */
-		if (inode->i_state & I_SYNC)
+		if (iwbl_test_sync(iwbl))
 			goto out_unlock_inode;
 
 		/*
diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
index e448edc..e3b18f3 100644
--- a/include/linux/backing-dev-defs.h
+++ b/include/linux/backing-dev-defs.h
@@ -47,8 +47,15 @@ enum wb_stat_item {
  * IWBL_* flags which occupy the lower bits of inode_wb_link->data.  The
  * upper bits point to bdi_writeback, so the number of these flags
  * determines the minimum alignment of bdi_writeback.
+ *
+ * IWBL_SYNC
+ *
+ *  Tracks whether writeback is in progress for an iwbl.  If this bit is
+ *  set for any iwbl on an inode, the inode's I_SYNC is set too.
  */
 enum {
+	IWBL_SYNC		= 0,
+
 	IWBL_FLAGS_BITS,
 	IWBL_FLAGS_MASK		= (1UL << IWBL_FLAGS_BITS) - 1,
 };
diff --git a/include/linux/fs.h b/include/linux/fs.h
index b394821..4c22824 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -625,6 +625,9 @@ struct inode {
 #ifdef CONFIG_IMA
 	atomic_t		i_readcount; /* struct files open RO */
 #endif
+#ifdef CONFIG_CGROUP_WRITEBACK
+	unsigned int		i_nr_syncs;
+#endif
 	const struct file_operations	*i_fop;	/* former ->i_op->default_file_ops */
 	struct file_lock	*i_flock;
 	struct address_space	i_data;
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index e4db465..1399ad6 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -539,6 +539,7 @@ static void cgwb_exit(struct bdi_writeback *wb)
 	spin_lock_irqsave(&wb->bdi->icgwbls_lock, flags);
 	list_for_each_entry_safe(icgwbl, next, &wb->icgwbls, wb_node) {
 		WARN_ON_ONCE(!list_empty(&icgwbl->iwbl.dirty_list));
+		WARN_ON_ONCE(test_bit(IWBL_SYNC, &icgwbl->iwbl.data));
 		hlist_del_rcu(&icgwbl->inode_node);
 		list_del(&icgwbl->wb_node);
 		kfree_rcu(icgwbl, rcu);
-- 
2.1.0
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related	[flat|nested] 54+ messages in thread
- * [PATCH 36/45] writeback: dirty inodes against their matching cgroup bdi_writeback's
  2015-01-06 21:25 [PATCHSET RFC block/for-next] writeback: cgroup writeback support Tejun Heo
                   ` (34 preceding siblings ...)
  2015-01-06 21:26 ` [PATCH 35/45] vfs, writeback: implement inode->i_nr_syncs Tejun Heo
@ 2015-01-06 21:26 ` Tejun Heo
  2015-01-06 21:26 ` [PATCH 37/45] writeback: make writeback_control carry the inode_wb_link being served Tejun Heo
                   ` (9 subsequent siblings)
  45 siblings, 0 replies; 54+ messages in thread
From: Tejun Heo @ 2015-01-06 21:26 UTC (permalink / raw)
  To: axboe
  Cc: linux-kernel, jack, hch, hannes, linux-fsdevel, vgoyal, lizefan,
	cgroups, linux-mm, mhocko, clm, fengguang.wu, david, Tejun Heo
mark_inode_dirty_dctx() always dirtied the inode against the root wb
(bdi_writeback).  The previous patches added all the infrastructure
necessary to attribute an inode against the wb of the dirtying cgroup.
On entry to mark_inode_dirty_dctx(), @dctx now carries the matching wb
and iwbl (inode_wb_link).  This patch updates mark_inode_dirty_dctx()
so that it uses the wb and iwbl from @dctx instead of unconditionally
using the root ones.
Currently, none of the filesystems has FS_CGROUP_WRITEBACK and all
pages will keep being dirtied against the root wb.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
---
 fs/fs-writeback.c | 22 ++++++++++------------
 1 file changed, 10 insertions(+), 12 deletions(-)
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index df99b5b..dfcf5dd 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1734,8 +1734,9 @@ static noinline void block_dump___mark_inode_dirty(struct inode *inode)
 void mark_inode_dirty_dctx(struct dirty_context *dctx, int flags)
 {
 	struct inode *inode = dctx->inode;
+	struct inode_wb_link *iwbl = dctx->iwbl;
+	struct bdi_writeback *wb = dctx->wb;
 	struct super_block *sb = inode->i_sb;
-	struct backing_dev_info *bdi = NULL;
 
 	/*
 	 * Don't do this for I_DIRTY_PAGES - that doesn't actually
@@ -1764,7 +1765,6 @@ void mark_inode_dirty_dctx(struct dirty_context *dctx, int flags)
 
 	spin_lock(&inode->i_lock);
 	if ((inode->i_state & flags) != flags) {
-		struct inode_wb_link *iwbl = &inode->i_wb_link;
 		const int was_dirty = inode->i_state & I_DIRTY;
 
 		inode->i_state |= flags;
@@ -1794,19 +1794,17 @@ void mark_inode_dirty_dctx(struct dirty_context *dctx, int flags)
 		 */
 		if (!was_dirty) {
 			bool wakeup_bdi = false;
-			bdi = inode_to_bdi(inode);
 
 			spin_unlock(&inode->i_lock);
-			spin_lock(&bdi->wb.list_lock);
+			spin_lock(&wb->list_lock);
 
-			WARN(bdi_cap_writeback_dirty(bdi) &&
-			     !test_bit(WB_registered, &bdi->wb.state),
-			     "bdi-%s not registered\n", bdi->name);
+			WARN(bdi_cap_writeback_dirty(wb->bdi) &&
+			     !test_bit(WB_registered, &wb->state),
+			     "bdi-%s not registered\n", wb->bdi->name);
 
 			iwbl->dirtied_when = jiffies;
-			wakeup_bdi = iwbl_move_locked(iwbl, &bdi->wb,
-						      &bdi->wb.b_dirty);
-			spin_unlock(&bdi->wb.list_lock);
+			wakeup_bdi = iwbl_move_locked(iwbl, wb, &wb->b_dirty);
+			spin_unlock(&wb->list_lock);
 
 			/*
 			 * If this is the first dirty inode for this bdi,
@@ -1814,8 +1812,8 @@ void mark_inode_dirty_dctx(struct dirty_context *dctx, int flags)
 			 * to make sure background write-back happens
 			 * later.
 			 */
-			if (bdi_cap_writeback_dirty(bdi) && wakeup_bdi)
-				wb_wakeup_delayed(&bdi->wb);
+			if (bdi_cap_writeback_dirty(wb->bdi) && wakeup_bdi)
+				wb_wakeup_delayed(wb);
 			return;
 		}
 	}
-- 
2.1.0
^ permalink raw reply related	[flat|nested] 54+ messages in thread
- * [PATCH 37/45] writeback: make writeback_control carry the inode_wb_link being served
  2015-01-06 21:25 [PATCHSET RFC block/for-next] writeback: cgroup writeback support Tejun Heo
                   ` (35 preceding siblings ...)
  2015-01-06 21:26 ` [PATCH 36/45] writeback: dirty inodes against their matching cgroup bdi_writeback's Tejun Heo
@ 2015-01-06 21:26 ` Tejun Heo
  2015-01-06 21:26 ` [PATCH 38/45] writeback: make cyclic writeback cursor cgroup writeback aware Tejun Heo
                   ` (8 subsequent siblings)
  45 siblings, 0 replies; 54+ messages in thread
From: Tejun Heo @ 2015-01-06 21:26 UTC (permalink / raw)
  To: axboe
  Cc: linux-kernel, jack, hch, hannes, linux-fsdevel, vgoyal, lizefan,
	cgroups, linux-mm, mhocko, clm, fengguang.wu, david, Tejun Heo
A wbc (writeback_control) is constructed at the beginning of writeback
and passed throughout the writeback path.  It serves as the control
structure carrying both in and out parameters for writeback of each
inode.  This patch adds wbc->iwbl so that it also carries the cgroup
writeback context.
wbc_set_iwbl(), which is called by writeback_sb_inodes() while kicking
off writeback of each inode, associates the wbc with the iwbl
(inode_wb_link) being serviced.  inode_writeback_iwbl() and
bdi_writeback_wb() are used to determine the iwbl from inode and wb
(bdi_writeback) from bdi being serviced considering wbc->iwbl.  If the
wbc has a specific iwbl associated with it, the iwbl is used to
determine them; otherwise, the root cgroup is assumed.
This allows accessing the current cgroup wb and iwbl being serviced
throughout the writeback path.  __writeback_single_inode(), which used
to assume the root iwbl, are updated to use inode_writeback_iwbl().
writeback_single_inode() now also uses inode_wirteback_iwbl() and
drops @wb and determines it using iwbl_to_wb() instead.  For
clear_page_dirty_for_io() which used to re-lookup the dirty wb using
page_cgwb_dirty(), a new function, clear_page_dirty_for_io_wbc() which
takes additional @wbc and uses bdi_writeback_wb(), is added.
This patch also adds wbc_blkcg_css() which determines whether the
current writebfffack is for a specific cgroup and which cgroup.  This
will be used by future patches.
This propagates the cgroup writeback context throughout most of
writeback path so that the cgroup specific data is accessible without
repeating lookups from page blkcg.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
---
 fs/fs-writeback.c           | 36 ++++++++++++++++++++------
 include/linux/backing-dev.h | 62 +++++++++++++++++++++++++++++++++++++++++++++
 include/linux/writeback.h   |  3 +++
 mm/page-writeback.c         | 21 ++++++++++++---
 4 files changed, 111 insertions(+), 11 deletions(-)
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index dfcf5dd..562b75f 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -604,6 +604,21 @@ static void inode_icgwbls_del(struct inode *inode)
 	}
 }
 
+/**
+ * wbc_set_iwbl - associate a wbc with an iwbl
+ * @wbc: target writeback_control
+ * @iwbl: inode_wb_link to associate @wbc with
+ *
+ * Writeback for @iwbl is about to be performed with @wbc as the control
+ * structure.  Associate @wbc with @iwbl so that writeback implementation
+ * can retrieve @iwbl from @wbc.
+ */
+static inline void wbc_set_iwbl(struct writeback_control *wbc,
+				struct inode_wb_link *iwbl)
+{
+	wbc->iwbl = iwbl;
+}
+
 #else	/* CONFIG_CGROUP_WRITEBACK */
 
 static void init_cgwb_dirty_page_context(struct dirty_context *dctx)
@@ -685,6 +700,11 @@ static void inode_icgwbls_del(struct inode *inode)
 {
 }
 
+static inline void wbc_set_iwbl(struct writeback_control *wbc,
+				struct inode_wb_link *iwbl)
+{
+}
+
 #endif	/* CONFIG_CGROUP_WRITEBACK */
 
 /**
@@ -1019,7 +1039,7 @@ static int
 __writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
 {
 	struct address_space *mapping = inode->i_mapping;
-	struct inode_wb_link *iwbl = &inode->i_wb_link;
+	struct inode_wb_link *iwbl = inode_writeback_iwbl(inode, wbc);
 	long nr_to_write = wbc->nr_to_write;
 	unsigned dirty;
 	int ret;
@@ -1090,10 +1110,10 @@ __writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
  * and does more profound writeback list handling in writeback_sb_inodes().
  */
 static int
-writeback_single_inode(struct inode *inode, struct bdi_writeback *wb,
-		       struct writeback_control *wbc)
+writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
 {
-	struct inode_wb_link *iwbl = &inode->i_wb_link;
+	struct inode_wb_link *iwbl = inode_writeback_iwbl(inode, wbc);
+	struct bdi_writeback *wb = iwbl_to_wb(iwbl);
 	int ret = 0;
 
 	spin_lock(&inode->i_lock);
@@ -1222,6 +1242,8 @@ static long writeback_sb_inodes(struct super_block *sb,
 			break;
 		}
 
+		wbc_set_iwbl(&wbc, iwbl);
+
 		/*
 		 * Don't bother with new inodes or inodes being freed, first
 		 * kind does not need periodic writeout yet, and for the latter
@@ -2020,7 +2042,6 @@ EXPORT_SYMBOL(sync_inodes_sb);
  */
 int write_inode_now(struct inode *inode, int sync)
 {
-	struct bdi_writeback *wb = iwbl_to_wb(&inode->i_wb_link);
 	struct writeback_control wbc = {
 		.nr_to_write = LONG_MAX,
 		.sync_mode = sync ? WB_SYNC_ALL : WB_SYNC_NONE,
@@ -2032,7 +2053,7 @@ int write_inode_now(struct inode *inode, int sync)
 		wbc.nr_to_write = 0;
 
 	might_sleep();
-	return writeback_single_inode(inode, wb, &wbc);
+	return writeback_single_inode(inode, &wbc);
 }
 EXPORT_SYMBOL(write_inode_now);
 
@@ -2049,8 +2070,7 @@ EXPORT_SYMBOL(write_inode_now);
  */
 int sync_inode(struct inode *inode, struct writeback_control *wbc)
 {
-	return writeback_single_inode(inode, iwbl_to_wb(&inode->i_wb_link),
-				      wbc);
+	return writeback_single_inode(inode, wbc);
 }
 EXPORT_SYMBOL(sync_inode);
 
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 6c16d10..5a163fa 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -520,6 +520,50 @@ iwbl_lookup(struct inode *inode, struct cgroup_subsys_state *blkcg_css)
 	return iwbl;
 }
 
+/**
+ * inode_writeback_iwbl - determine the inode_wb_link under writeback
+ * @inode: inode under writeback
+ * @wbc: writeback_control in effect
+ *
+ * Called from code path which is writing back @inode with @wbc to
+ * determine the iwbl (inode_wb_link) this writeback is for.  Guaranteed to
+ * return a valid iwbl.
+ */
+static inline struct inode_wb_link *
+inode_writeback_iwbl(struct inode *inode, struct writeback_control *wbc)
+{
+	return wbc->iwbl ?: &inode->i_wb_link;
+}
+
+/**
+ * bdi_writeback_wb - determine the bdi_writeback under writeback
+ * @bdi: backing_dev_info under writeback
+ * @wbc: writeback_control in effect
+ *
+ * Called from code path which is writing back @bdi with @wbc to determine
+ * the wb (bdi_writebck) this writeback is for.  Guaranteed to return a
+ * valid wb.
+ */
+static inline struct bdi_writeback *
+bdi_writeback_wb(struct backing_dev_info *bdi, struct writeback_control *wbc)
+{
+	return wbc->iwbl ? iwbl_to_wb(wbc->iwbl) : &bdi->wb;
+}
+
+/**
+ * wbc_blkcg_css - return the blkcg_css associated with a wbc
+ * @wbc: writeback_control of interest
+ *
+ * Return the blkcg_css of the inode_wb_link @wbc is associated with.  If
+ * @wbc hasn't been associated with an iwbl using wbc_set_iwbl(), %NULL is
+ * returned.
+ */
+static inline struct cgroup_subsys_state *
+wbc_blkcg_css(struct writeback_control *wbc)
+{
+	return wbc->iwbl ? iwbl_to_wb(wbc->iwbl)->blkcg_css : NULL;
+}
+
 #else	/* CONFIG_CGROUP_WRITEBACK */
 
 static inline bool mapping_cgwb_enabled(struct address_space *mapping)
@@ -596,6 +640,24 @@ iwbl_lookup(struct inode *inode, struct cgroup_subsys_state *blkcg_css)
 	return &inode->i_wb_link;
 }
 
+static inline struct inode_wb_link *
+inode_writeback_iwbl(struct inode *inode, struct writeback_control *wbc)
+{
+	return &inode->i_wb_link;
+}
+
+static inline struct bdi_writeback *
+bdi_writeback_wb(struct backing_dev_info *bdi, struct writeback_control *wbc)
+{
+	return &bdi->wb;
+}
+
+static inline struct cgroup_subsys_state *
+wbc_blkcg_css(struct writeback_control *wbc)
+{
+	return NULL;
+}
+
 #endif	/* CONFIG_CGROUP_WRITEBACK */
 
 static inline int mapping_read_congested(struct address_space *mapping,
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 75349bb..dad1953 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -84,6 +84,9 @@ struct writeback_control {
 	unsigned for_reclaim:1;		/* Invoked from the page allocator */
 	unsigned range_cyclic:1;	/* range_start is cyclic */
 	unsigned for_sync:1;		/* sync(2) WB_SYNC_ALL writeback */
+#ifdef CONFIG_CGROUP_WRITEBACK
+	struct inode_wb_link *iwbl;	/* iwbl this writeback is for */
+#endif
 };
 
 /*
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 4cf365c..3e31458 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -148,6 +148,9 @@ static struct timer_list writeout_period_timer =
 		TIMER_DEFERRED_INITIALIZER(writeout_period, 0, 0);
 static unsigned long writeout_period_time = 0;
 
+static int clear_page_dirty_for_io_wbc(struct page *page,
+				       struct writeback_control *wbc);
+
 /*
  * Length of period for aging writeout fractions of bdis. This is an
  * arbitrarily chosen number. The longer the period, the slower fractions will
@@ -1993,7 +1996,7 @@ continue_unlock:
 			}
 
 			BUG_ON(PageWriteback(page));
-			if (!clear_page_dirty_for_io(page))
+			if (!clear_page_dirty_for_io_wbc(page, wbc))
 				goto continue_unlock;
 
 			trace_wbc_writepage(wbc, mapping->backing_dev_info);
@@ -2326,7 +2329,8 @@ EXPORT_SYMBOL(set_page_dirty_lock);
  * This incoherency between the page's dirty flag and radix-tree tag is
  * unfortunate, but it only exists while the page is locked.
  */
-int clear_page_dirty_for_io(struct page *page)
+static int clear_page_dirty_for_io_wbc(struct page *page,
+				       struct writeback_control *wbc)
 {
 	struct address_space *mapping = page_mapping(page);
 
@@ -2369,7 +2373,8 @@ int clear_page_dirty_for_io(struct page *page)
 		 * exclusion.
 		 */
 		if (TestClearPageDirty(page)) {
-			struct bdi_writeback *wb = page_cgwb_dirty(page);
+			struct backing_dev_info *bdi = mapping->backing_dev_info;
+			struct bdi_writeback *wb = bdi_writeback_wb(bdi, wbc);
 
 			dec_zone_page_state(page, NR_FILE_DIRTY);
 			dec_wb_stat(wb, WB_RECLAIMABLE);
@@ -2380,6 +2385,16 @@ int clear_page_dirty_for_io(struct page *page)
 	}
 	return TestClearPageDirty(page);
 }
+
+int clear_page_dirty_for_io(struct page *page)
+{
+	struct writeback_control wbc = {
+		.sync_mode = WB_SYNC_ALL,
+		.nr_to_write = 1,
+	};
+
+	return clear_page_dirty_for_io_wbc(page, &wbc);
+}
 EXPORT_SYMBOL(clear_page_dirty_for_io);
 
 int test_clear_page_writeback(struct page *page)
-- 
2.1.0
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related	[flat|nested] 54+ messages in thread
- * [PATCH 38/45] writeback: make cyclic writeback cursor cgroup writeback aware
  2015-01-06 21:25 [PATCHSET RFC block/for-next] writeback: cgroup writeback support Tejun Heo
                   ` (36 preceding siblings ...)
  2015-01-06 21:26 ` [PATCH 37/45] writeback: make writeback_control carry the inode_wb_link being served Tejun Heo
@ 2015-01-06 21:26 ` Tejun Heo
  2015-01-06 21:26 ` [PATCH 39/45] writeback: make DIRTY_PAGES tracking " Tejun Heo
                   ` (7 subsequent siblings)
  45 siblings, 0 replies; 54+ messages in thread
From: Tejun Heo @ 2015-01-06 21:26 UTC (permalink / raw)
  To: axboe
  Cc: linux-kernel, jack, hch, hannes, linux-fsdevel, vgoyal, lizefan,
	cgroups, linux-mm, mhocko, clm, fengguang.wu, david, Tejun Heo
address_space->writeback_index keeps track of where to write next for
cyclic writebacks.  When cgroup writeback is used, an adress_space can
be written back by multiple wb's (bdi_writeback's) and sharing the
cyclic cursor across them doesn't make sense.
This patch adds inode_cgwb_link->writeback_index and introduces and
uses mapping_writeback_index_wbc() to determine the writeback cursor
to use.  If the writeback_control in effect indicates that non-root
cgroup writeback is in progress, the matching inode_cgwb_link's
writeback_index is used; otherwise, the mapping one is used.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
---
 include/linux/backing-dev-defs.h |  2 ++
 include/linux/backing-dev.h      | 30 ++++++++++++++++++++++++++++++
 mm/page-writeback.c              |  5 +++--
 3 files changed, 35 insertions(+), 2 deletions(-)
diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
index e3b18f3..6d64a0e 100644
--- a/include/linux/backing-dev-defs.h
+++ b/include/linux/backing-dev-defs.h
@@ -193,6 +193,8 @@ struct inode_cgwb_link {
 	struct hlist_node	inode_node;	/* RCU-safe, sorted */
 	struct list_head	wb_node;
 
+	pgoff_t			writeback_index; /* cyclic writeback cursor */
+
 	struct rcu_head		rcu;
 };
 
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 5a163fa..57dd200 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -551,6 +551,30 @@ bdi_writeback_wb(struct backing_dev_info *bdi, struct writeback_control *wbc)
 }
 
 /**
+ * mapping_writeback_index - determine the cursor for cyclic writeback
+ * @mapping: address_space under writeback
+ * @wbc: writeback_control in effect
+ *
+ * Called from address_space_operations->writepages() implementations to
+ * retrieve the pointer to the cursor variable to use for cyclic
+ * writebacks.  If cgroup writeback is enabled, there's a separate cyclic
+ * cursor for each cgroup writing back @mapping.
+ */
+static inline pgoff_t *mapping_writeback_index(struct address_space *mapping,
+					       struct writeback_control *wbc)
+{
+	struct inode_wb_link *iwbl = wbc->iwbl;
+
+	if (!iwbl || iwbl_to_wb(iwbl)->blkcg_css == blkcg_root_css) {
+		return &mapping->writeback_index;
+	} else {
+		struct inode_cgwb_link *icgwbl =
+			container_of(iwbl, struct inode_cgwb_link, iwbl);
+		return &icgwbl->writeback_index;
+	}
+}
+
+/**
  * wbc_blkcg_css - return the blkcg_css associated with a wbc
  * @wbc: writeback_control of interest
  *
@@ -652,6 +676,12 @@ bdi_writeback_wb(struct backing_dev_info *bdi, struct writeback_control *wbc)
 	return &bdi->wb;
 }
 
+static inline pgoff_t *mapping_writeback_index(struct address_space *mapping,
+					       struct writeback_control *wbc)
+{
+	return &mapping->writeback_index;
+}
+
 static inline struct cgroup_subsys_state *
 wbc_blkcg_css(struct writeback_control *wbc)
 {
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 3e31458..753d76f 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1906,6 +1906,7 @@ int write_cache_pages(struct address_space *mapping,
 	int done = 0;
 	struct pagevec pvec;
 	int nr_pages;
+	pgoff_t *writeback_index_ptr = mapping_writeback_index(mapping, wbc);
 	pgoff_t uninitialized_var(writeback_index);
 	pgoff_t index;
 	pgoff_t end;		/* Inclusive */
@@ -1916,7 +1917,7 @@ int write_cache_pages(struct address_space *mapping,
 
 	pagevec_init(&pvec, 0);
 	if (wbc->range_cyclic) {
-		writeback_index = mapping->writeback_index; /* prev offset */
+		writeback_index = *writeback_index_ptr; /* prev offset */
 		index = writeback_index;
 		if (index == 0)
 			cycled = 1;
@@ -2048,7 +2049,7 @@ continue_unlock:
 		goto retry;
 	}
 	if (wbc->range_cyclic || (range_whole && wbc->nr_to_write > 0))
-		mapping->writeback_index = done_index;
+		*writeback_index_ptr = done_index;
 
 	return ret;
 }
-- 
2.1.0
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related	[flat|nested] 54+ messages in thread
- * [PATCH 39/45] writeback: make DIRTY_PAGES tracking cgroup writeback aware
  2015-01-06 21:25 [PATCHSET RFC block/for-next] writeback: cgroup writeback support Tejun Heo
                   ` (37 preceding siblings ...)
  2015-01-06 21:26 ` [PATCH 38/45] writeback: make cyclic writeback cursor cgroup writeback aware Tejun Heo
@ 2015-01-06 21:26 ` Tejun Heo
  2015-01-06 21:26 ` [PATCH 40/45] writeback: make write_cache_pages() " Tejun Heo
                   ` (6 subsequent siblings)
  45 siblings, 0 replies; 54+ messages in thread
From: Tejun Heo @ 2015-01-06 21:26 UTC (permalink / raw)
  To: axboe
  Cc: linux-kernel, jack, hch, hannes, linux-fsdevel, vgoyal, lizefan,
	cgroups, linux-mm, mhocko, clm, fengguang.wu, david, Tejun Heo
I_DIRTY_PAGES on inode->i_state tracks whether its address_space
contains dirty pages.  When cgroup writeback is used, an address_space
can be dirtied against multiple wb's (bdi_writeback's) and we want to
be able to track dirty state per iwbl (inode_wb_link).
This patch adds IWBL_DIRTY_PAGES which tracks whether an iwbl has
dirty pages.  It's set along with I_DIRTY_PAGES when an inode gets
dirtied but because the radix tree tags can't carry which iwbl's pages
are dirtied against, whether an iwbl became clean can't be decided by
testing PAGECACHE_TAG_DIRTY.  Instead, it's opportunistically cleared
after a whole address_space writeback and when I_DIRTY_PAGES is
cleared.  This isn't ideal but the cost of inaccuracies should be
reasonable.  See the comment on top of I_DIRTY_PAGES definition for
more info.
Note that non-root iwbl's are only attributed with dirty pages, the
metadata dirtiness - I_DIRTY_SYNC and I_DIRTY_DATASYNC - are always
attributed to the root iwbl.  This means that when an inode gets
dirtied for both metadata and dirty pages from non-root cgroup, it
will dirty both the root iwbl for the metadata and the matching cgroup
iwbl for the dirty pages.
This encapsulates I_DIRTY_* manipulations and testing through new
functions - iwbl_has_enough_dirty(), iwbl_set_dirty() and
iwbl_still_has_dirty_pages() - and introduces another mb which is
paired with the one in __mark_inode_dirty_dctx() to interlock
IWBL_DIRTY_PAGES testing and clearing.  Comments for the mb's are
updated to reflect it.
write_cache_pages() is updated to use
mapping_writeback_{maybe|confirm}_whole() to clear IWBL_DIRTY_PAGES
opportunistically.  Filesystems which implement custom writepages
should be updated similarly to support cgroup writeback.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
---
 fs/fs-writeback.c                | 186 +++++++++++++++++++++++++++++++--------
 include/linux/backing-dev-defs.h |  17 ++++
 include/linux/backing-dev.h      |  72 +++++++++++++++
 mm/page-writeback.c              |  21 ++++-
 4 files changed, 255 insertions(+), 41 deletions(-)
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 562b75f..dbfd0b0 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -619,6 +619,71 @@ static inline void wbc_set_iwbl(struct writeback_control *wbc,
 	wbc->iwbl = iwbl;
 }
 
+/**
+ * iwbl_has_enough_dirty - does an iwbl and its inode have dirty bits already?
+ * @iwbl: inode_wb_link of interest
+ * @inode: inode @iwbl belongs to
+ * @dirty: I_DIRTY_* bits to be set
+ *
+ * @inode is being dirtied with @dirty by @iwbl's cgroup, test whether
+ * @iwbl and @inode already have all the dirty bits set.  Each iwbl has
+ * separate %IWBL_DIRTY_PAGES bit which should be also set if @dirty has
+ * %I_DIRTY_PAGES.
+ */
+static inline bool iwbl_has_enough_dirty(struct inode_wb_link *iwbl,
+					 struct inode *inode, int dirty)
+{
+	return (inode->i_state & dirty) == dirty &&
+		(!(dirty & I_DIRTY_PAGES) ||
+		 test_bit(IWBL_DIRTY_PAGES, &iwbl->data));
+}
+
+/**
+ * iwbl_set_dirty - set dirty bits on an iwbl and its inode
+ * @iwbl: ionde_wb_link of interest
+ * @inode: inode @iwbl belongs to
+ * @dirty: I_DIRTY_* bits to be set
+ *
+ * Set @dirty on @iwbl and @inode and return whether @iwbl was already
+ * dirty.  @iwbl only carries the data dirty bit through %IWBL_DIRTY_PAGES.
+ */
+static inline bool iwbl_set_dirty(struct inode_wb_link *iwbl,
+				  struct inode *inode, int dirty)
+{
+	bool was_dirty = test_bit(IWBL_DIRTY_PAGES, &iwbl->data);
+
+	/* metadata dirty bit is always attributed to the root */
+	if (iwbl_is_root(iwbl))
+		was_dirty |= inode->i_state & (I_DIRTY_SYNC | I_DIRTY_DATASYNC);
+
+	inode->i_state |= dirty;
+	if (dirty & I_DIRTY_PAGES)
+		set_bit(IWBL_DIRTY_PAGES, &iwbl->data);
+	return was_dirty;
+}
+
+/**
+ * iwbl_still_has_dirty_pages - does an iwbl have dirty pages after writeback?
+ * @iwbl: inode_wb_link of interest
+ * @inode: inode @iwbl belongs to
+ *
+ * Called from requeue_inode() after writing back @inode for @iwbl to
+ * determine whether @iwbl still has dirty pages and should thus be
+ * requeued.  This function can update IWBL_DIRTY_PAGES and may also
+ * spuriously return true.
+ *
+ * See IWBL_DIRTY_PAGES definition for more info.
+ */
+static inline bool iwbl_still_has_dirty_pages(struct inode_wb_link *iwbl,
+					      struct inode *inode)
+{
+	if (!mapping_tagged(inode->i_mapping, PAGECACHE_TAG_DIRTY)) {
+		clear_bit(IWBL_DIRTY_PAGES, &iwbl->data);
+		return false;
+	}
+	return test_bit(IWBL_DIRTY_PAGES, &iwbl->data);
+}
+
 #else	/* CONFIG_CGROUP_WRITEBACK */
 
 static void init_cgwb_dirty_page_context(struct dirty_context *dctx)
@@ -705,6 +770,27 @@ static inline void wbc_set_iwbl(struct writeback_control *wbc,
 {
 }
 
+static inline bool iwbl_has_enough_dirty(struct inode_wb_link *iwbl,
+					 struct inode *inode, int dirty)
+{
+	return (inode->i_state & dirty) == dirty;
+}
+
+static inline bool iwbl_set_dirty(struct inode_wb_link *iwbl,
+				  struct inode *inode, int dirty)
+{
+	bool was_dirty = inode->i_state & I_DIRTY;
+
+	inode->i_state |= dirty;
+	return was_dirty;
+}
+
+static inline bool iwbl_still_has_dirty_pages(struct inode_wb_link *iwbl,
+					      struct inode *inode)
+{
+	return mapping_tagged(inode->i_mapping, PAGECACHE_TAG_DIRTY);
+}
+
 #endif	/* CONFIG_CGROUP_WRITEBACK */
 
 /**
@@ -999,7 +1085,7 @@ static void requeue_inode(struct inode_wb_link *iwbl, struct bdi_writeback *wb,
 		return;
 	}
 
-	if (mapping_tagged(inode->i_mapping, PAGECACHE_TAG_DIRTY)) {
+	if (iwbl_still_has_dirty_pages(iwbl, inode)) {
 		/*
 		 * We didn't write back all the pages.  nfs_writepages()
 		 * sometimes bales out without doing anything.
@@ -1017,7 +1103,8 @@ static void requeue_inode(struct inode_wb_link *iwbl, struct bdi_writeback *wb,
 			 */
 			redirty_tail(iwbl, wb);
 		}
-	} else if (inode->i_state & I_DIRTY) {
+	} else if (iwbl_is_root(iwbl) &&
+		   (inode->i_state & (I_DIRTY_SYNC | I_DIRTY_DATASYNC))) {
 		/*
 		 * Filesystems can dirty the inode during writeback operations,
 		 * such as delayed allocation during submission or metadata
@@ -1074,14 +1161,14 @@ __writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
 	inode->i_state &= ~I_DIRTY;
 
 	/*
-	 * Paired with smp_mb() in __mark_inode_dirty().  This allows
-	 * __mark_inode_dirty() to test i_state without grabbing i_lock -
-	 * either they see the I_DIRTY bits cleared or we see the dirtied
-	 * inode.
+	 * Paired with smp_mb() in __mark_inode_dirty_dctx().  This allows
+	 * the function to perform iwbl_has_enough_dirty() test without
+	 * grabbing i_lock - either they see the dirty bits cleared or we
+	 * see the dirtied inode.
 	 *
 	 * I_DIRTY_PAGES is always cleared together above even if @mapping
 	 * still has dirty pages.  The flag is reinstated after smp_mb() if
-	 * necessary.  This guarantees that either __mark_inode_dirty()
+	 * necessary to guarantee that either __mark_inode_dirty_dctx()
 	 * sees clear I_DIRTY_PAGES or we see PAGECACHE_TAG_DIRTY.
 	 */
 	smp_mb();
@@ -1729,31 +1816,7 @@ static noinline void block_dump___mark_inode_dirty(struct inode *inode)
 	}
 }
 
-/**
- *	mark_inode_dirty_dctx -	internal function
- *	@dctx: dirty_context containing the target inode
- *	@flags: what kind of dirty (i.e. I_DIRTY_SYNC)
- *	Mark an inode as dirty. Callers should use mark_inode_dirty or
- *  	mark_inode_dirty_sync.
- *
- * Put the inode on the super block's dirty list.
- *
- * CAREFUL! We mark it dirty unconditionally, but move it onto the
- * dirty list only if it is hashed or if it refers to a blockdev.
- * If it was not hashed, it will never be added to the dirty list
- * even if it is later hashed, as it will have been marked dirty already.
- *
- * In short, make sure you hash any inodes _before_ you start marking
- * them dirty.
- *
- * Note that for blockdevs, iwbl->dirtied_when represents the dirtying time of
- * the block-special inode (/dev/hda1) itself.  And the ->dirtied_when field of
- * the kernel-internal blockdev inode represents the dirtying time of the
- * blockdev's pages.  This is why for I_DIRTY_PAGES we always use
- * page->mapping->host, so the page-dirtying time is recorded in the internal
- * blockdev inode.
- */
-void mark_inode_dirty_dctx(struct dirty_context *dctx, int flags)
+static void __mark_inode_dirty_dctx(struct dirty_context *dctx, int flags)
 {
 	struct inode *inode = dctx->inode;
 	struct inode_wb_link *iwbl = dctx->iwbl;
@@ -1774,22 +1837,23 @@ void mark_inode_dirty_dctx(struct dirty_context *dctx, int flags)
 	}
 
 	/*
-	 * Paired with smp_mb() in __writeback_single_inode() for the
-	 * following lockless i_state test.  See there for details.
+	 * Paired with smp_mb()'s in __writeback_single_inode() and
+	 * mapping_writeback_maybe_whole() for the following lockless
+	 * iwbl_has_enough_dirty() test.  See there for details.
 	 */
 	smp_mb();
 
-	if ((inode->i_state & flags) == flags)
+	if (iwbl_has_enough_dirty(iwbl, inode, flags))
 		return;
 
 	if (unlikely(block_dump))
 		block_dump___mark_inode_dirty(inode);
 
 	spin_lock(&inode->i_lock);
-	if ((inode->i_state & flags) != flags) {
-		const int was_dirty = inode->i_state & I_DIRTY;
+	if (!iwbl_has_enough_dirty(iwbl, inode, flags)) {
+		bool was_dirty;
 
-		inode->i_state |= flags;
+		was_dirty = iwbl_set_dirty(iwbl, inode, flags);
 
 		/*
 		 * If the inode is being synced, just update its dirty state.
@@ -1845,6 +1909,52 @@ out_unlock_inode:
 }
 EXPORT_SYMBOL(mark_inode_dirty_dctx);
 
+/**
+ *	mark_inode_dirty_dctx -	internal function
+ *	@dctx: dirty_context containing the target inode
+ *	@flags: what kind of dirty (i.e. I_DIRTY_SYNC)
+ *	Mark an inode as dirty. Callers should use mark_inode_dirty or
+ *  	mark_inode_dirty_sync.
+ *
+ * Put the inode on the super block's dirty list.
+ *
+ * CAREFUL! We mark it dirty unconditionally, but move it onto the
+ * dirty list only if it is hashed or if it refers to a blockdev.
+ * If it was not hashed, it will never be added to the dirty list
+ * even if it is later hashed, as it will have been marked dirty already.
+ *
+ * In short, make sure you hash any inodes _before_ you start marking
+ * them dirty.
+ *
+ * Note that for blockdevs, iwbl->dirtied_when represents the dirtying time of
+ * the block-special inode (/dev/hda1) itself.  And the ->dirtied_when field of
+ * the kernel-internal blockdev inode represents the dirtying time of the
+ * blockdev's pages.  This is why for I_DIRTY_PAGES we always use
+ * page->mapping->host, so the page-dirtying time is recorded in the internal
+ * blockdev inode.
+ */
+void mark_inode_dirty_dctx(struct dirty_context *dctx, int flags)
+{
+	/*
+	 * I_DIRTY_PAGES should dirty @dctx->iwbl but I_DIRTY_[DATA]SYNC
+	 * should always dirty the root iwbl.  If @dctx->iwbl is root, we
+	 * can do both at the same time; otherwise, handle the two dirtying
+	 * separately.
+	 */
+	if (iwbl_is_root(dctx->iwbl) ||
+	    !(flags & (I_DIRTY_SYNC | I_DIRTY_DATASYNC))) {
+		__mark_inode_dirty_dctx(dctx, flags);
+		return;
+	}
+
+	if (flags & I_DIRTY_PAGES)
+		__mark_inode_dirty_dctx(dctx, I_DIRTY_PAGES);
+
+	flags &= ~I_DIRTY_PAGES;
+	if (flags)
+		__mark_inode_dirty(dctx->inode, flags);
+}
+
 void __mark_inode_dirty(struct inode *inode, int flags)
 {
 	struct dirty_context dctx;
diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
index 6d64a0e..5e0381c 100644
--- a/include/linux/backing-dev-defs.h
+++ b/include/linux/backing-dev-defs.h
@@ -52,9 +52,26 @@ enum wb_stat_item {
  *
  *  Tracks whether writeback is in progress for an iwbl.  If this bit is
  *  set for any iwbl on an inode, the inode's I_SYNC is set too.
+ *
+ * IWBL_DIRTY_PAGES
+ *
+ *  Tracks whether an iwbl has dirty pages.  This bit is asserted when a
+ *  page is dirtied against it; unfortunately, unlike I_DIRTY_PAGES which
+ *  can be cleared reliably by testing PAGECACHE_TAG_DIRTY after a
+ *  writeback, there's no way to reliably determine whether an iwbl is
+ *  clean and the bit may remain set spuriously w/o dirty pages.
+ *
+ *  mapping_writeback_{maybe|confirm}_whole() are used to opportunistically
+ *  clear the bit if a writeback attempt successfully sweeps all the dirty
+ *  pages.  It also gets cleared when PAGECACHE_TAG_DIRTY test indicates
+ *  that the whole address_space is clean.  While the bit may remain set
+ *  spuriously for a while, the duration of such inaccuracy should be
+ *  reasonably limited as a periodic cyclic writeback triggered on a clean
+ *  iwbl will notice the clean state.
  */
 enum {
 	IWBL_SYNC		= 0,
+	IWBL_DIRTY_PAGES,
 
 	IWBL_FLAGS_BITS,
 	IWBL_FLAGS_MASK		= (1UL << IWBL_FLAGS_BITS) - 1,
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 57dd200..5d919bc 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -575,6 +575,66 @@ static inline pgoff_t *mapping_writeback_index(struct address_space *mapping,
 }
 
 /**
+ * mapping_writeback_maybe_whole - prepare for possible whole writeback
+ * @mapping: address_space under writeback
+ * @wbc: writeback_control in effect
+ *
+ * @mapping is being written back according to @wbc and it may write back
+ * all dirty pages.  This function must be called before such writeback is
+ * started and matched with mapping_writeback_confirm_whole() which is
+ * called after the writeback.  Combined, these two functions can detect
+ * clean condition on the associated inode_wb_link and clear
+ * %IWBL_DIRTY_PAGES on it so that its writeback work can be selectively
+ * turned off even while the inode is dirty on other cgroups.
+ *
+ * See IWBL_DIRTY_PAGES definition for more info.
+ */
+static inline void
+mapping_writeback_maybe_whole(struct address_space *mapping,
+			      struct writeback_control *wbc)
+{
+	struct inode *inode = mapping->host;
+
+	if (!inode)
+		return;
+
+	clear_bit(IWBL_DIRTY_PAGES, &inode_writeback_iwbl(inode, wbc)->data);
+
+	/*
+	 * Paired with smp_mb() in __mark_inode_dirty_dctx().  Clearing
+	 * IWBL_DIRTY_PAGES before the following mb and reinstating it
+	 * later if writeback skips over some pages guarantees that either
+	 * __mark_inode_dirty_dctx() sees clear IWBL_DIRTY_PAGES or we see
+	 * all the dirtied pages.
+	 */
+	smp_mb__after_atomic();
+}
+
+/**
+ * mapping_writeback_confirm_whole - confirm whether whole writeback took place
+ * @mapping: address_space under writeback
+ * @wbc: writeback_control in effect
+ * @wrote_whole: were all pages written out?
+ *
+ * @mapping is being written back according to @wbc and
+ * mapping_writeback_maybe_whole() was called because it could write back
+ * all dirty pages.  The writeback function must call this function to
+ * indicate whether all pages were actually written out or not.  See
+ * mapping_writeback_maybe_whole() for more info.
+ */
+static inline void
+mapping_writeback_confirm_whole(struct address_space *mapping,
+				struct writeback_control *wbc, bool wrote_whole)
+{
+	struct inode *inode = mapping->host;
+
+	if (!inode || wrote_whole)
+		return;
+
+	set_bit(IWBL_DIRTY_PAGES, &inode_writeback_iwbl(inode, wbc)->data);
+}
+
+/**
  * wbc_blkcg_css - return the blkcg_css associated with a wbc
  * @wbc: writeback_control of interest
  *
@@ -682,6 +742,18 @@ static inline pgoff_t *mapping_writeback_index(struct address_space *mapping,
 	return &mapping->writeback_index;
 }
 
+static inline void
+mapping_writeback_maybe_whole(struct address_space *mapping,
+			      struct writeback_control *wbc)
+{
+}
+
+static inline void
+mapping_writeback_confirm_whole(struct address_space *mapping,
+				struct writeback_control *wbc, bool wrote_whole)
+{
+}
+
 static inline struct cgroup_subsys_state *
 wbc_blkcg_css(struct writeback_control *wbc)
 {
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 753d76f..dd15bb3 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1904,6 +1904,7 @@ int write_cache_pages(struct address_space *mapping,
 {
 	int ret = 0;
 	int done = 0;
+	int skipped_dirty = 0;
 	struct pagevec pvec;
 	int nr_pages;
 	pgoff_t *writeback_index_ptr = mapping_writeback_index(mapping, wbc);
@@ -1913,6 +1914,7 @@ int write_cache_pages(struct address_space *mapping,
 	pgoff_t done_index;
 	int cycled;
 	int range_whole = 0;
+	int maybe_whole = 0;
 	int tag;
 
 	pagevec_init(&pvec, 0);
@@ -1924,13 +1926,20 @@ int write_cache_pages(struct address_space *mapping,
 		else
 			cycled = 0;
 		end = -1;
+		maybe_whole = 1;
 	} else {
 		index = wbc->range_start >> PAGE_CACHE_SHIFT;
 		end = wbc->range_end >> PAGE_CACHE_SHIFT;
-		if (wbc->range_start == 0 && wbc->range_end == LLONG_MAX)
+		if (wbc->range_start == 0 && wbc->range_end == LLONG_MAX) {
 			range_whole = 1;
+			maybe_whole = 1;
+		}
 		cycled = 1; /* ignore range_cyclic tests */
 	}
+
+	if (maybe_whole)
+		mapping_writeback_maybe_whole(mapping, wbc);
+
 	if (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages)
 		tag = PAGECACHE_TAG_TOWRITE;
 	else
@@ -1990,10 +1999,12 @@ continue_unlock:
 			}
 
 			if (PageWriteback(page)) {
-				if (wbc->sync_mode != WB_SYNC_NONE)
+				if (wbc->sync_mode != WB_SYNC_NONE) {
 					wait_on_page_writeback(page);
-				else
+				} else {
+					skipped_dirty = 1;
 					goto continue_unlock;
+				}
 			}
 
 			BUG_ON(PageWriteback(page));
@@ -2004,6 +2015,7 @@ continue_unlock:
 			ret = (*writepage)(page, wbc, data);
 			if (unlikely(ret)) {
 				if (ret == AOP_WRITEPAGE_ACTIVATE) {
+					skipped_dirty = 1;
 					unlock_page(page);
 					ret = 0;
 				} else {
@@ -2051,6 +2063,9 @@ continue_unlock:
 	if (wbc->range_cyclic || (range_whole && wbc->nr_to_write > 0))
 		*writeback_index_ptr = done_index;
 
+	if (maybe_whole)
+		mapping_writeback_confirm_whole(mapping, wbc,
+						!done && !skipped_dirty);
 	return ret;
 }
 EXPORT_SYMBOL(write_cache_pages);
-- 
2.1.0
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related	[flat|nested] 54+ messages in thread
- * [PATCH 40/45] writeback: make write_cache_pages() cgroup writeback aware
  2015-01-06 21:25 [PATCHSET RFC block/for-next] writeback: cgroup writeback support Tejun Heo
                   ` (38 preceding siblings ...)
  2015-01-06 21:26 ` [PATCH 39/45] writeback: make DIRTY_PAGES tracking " Tejun Heo
@ 2015-01-06 21:26 ` Tejun Heo
  2015-01-06 21:26 ` [PATCH 41/45] writeback: make __writeback_single_inode() " Tejun Heo
                   ` (5 subsequent siblings)
  45 siblings, 0 replies; 54+ messages in thread
From: Tejun Heo @ 2015-01-06 21:26 UTC (permalink / raw)
  To: axboe
  Cc: linux-kernel, jack, hch, hannes, linux-fsdevel, vgoyal, lizefan,
	cgroups, linux-mm, mhocko, clm, fengguang.wu, david, Tejun Heo
write_cache_pages() is used to implement generic do_writepages().  Up
until now, the function targeted all dirty pages; however, for cgroup
writeback, it needs to be more restrained.  As writeback for each wb
cgroup (bdi_writeback) will be executed separately, do_writepages()
needs to write out only the pages dirtied against the wb being
serviced.
This patch introduces wbc_skip_page() which is used by
write_cache_pages() to determine whether a page should be skipped
because it is dirtied against a different wb.  wbc->iwbl_mismatch is
also added to keep track of whether pages were skipped, which will be
used later.
Filesystems which don't use write_cache_pages() for its
address_space_operation->writepages() should update its ->writepages()
to use wbc_skip_page() directly to support cgroup writeback.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
---
 include/linux/backing-dev.h | 27 +++++++++++++++++++++++++++
 include/linux/writeback.h   |  1 +
 mm/page-writeback.c         |  3 +++
 3 files changed, 31 insertions(+)
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 5d919bc..173d218 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -648,6 +648,27 @@ wbc_blkcg_css(struct writeback_control *wbc)
 	return wbc->iwbl ? iwbl_to_wb(wbc->iwbl)->blkcg_css : NULL;
 }
 
+/**
+ * wbc_skip_page - determine whether to skip a page during writeback
+ * @wbc: writeback_control in effect
+ * @page: page being considered
+ *
+ * Determine whether @page should be written back during a writeback
+ * controlled by @wbc.  This function also accounts the number of skipped
+ * pages in @wbc and should only be called once per page.
+ */
+static inline bool wbc_skip_page(struct writeback_control *wbc,
+				 struct page *page)
+{
+	struct cgroup_subsys_state *blkcg_css = wbc_blkcg_css(wbc);
+
+	if (blkcg_css && blkcg_css != page_blkcg_dirty(page)) {
+		wbc->iwbl_mismatch = 1;
+		return true;
+	}
+	return false;
+}
+
 #else	/* CONFIG_CGROUP_WRITEBACK */
 
 static inline bool mapping_cgwb_enabled(struct address_space *mapping)
@@ -760,6 +781,12 @@ wbc_blkcg_css(struct writeback_control *wbc)
 	return NULL;
 }
 
+static inline bool wbc_skip_page(struct writeback_control *wbc,
+				 struct page *page)
+{
+	return false;
+}
+
 #endif	/* CONFIG_CGROUP_WRITEBACK */
 
 static inline int mapping_read_congested(struct address_space *mapping,
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index dad1953..a225a33 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -85,6 +85,7 @@ struct writeback_control {
 	unsigned range_cyclic:1;	/* range_start is cyclic */
 	unsigned for_sync:1;		/* sync(2) WB_SYNC_ALL writeback */
 #ifdef CONFIG_CGROUP_WRITEBACK
+	unsigned iwbl_mismatch:1;	/* pages skipped due to iwbl mismatch */
 	struct inode_wb_link *iwbl;	/* iwbl this writeback is for */
 #endif
 };
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index dd15bb3..0edf749 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1977,6 +1977,9 @@ retry:
 
 			done_index = page->index;
 
+			if (wbc_skip_page(wbc, page))
+				continue;
+
 			lock_page(page);
 
 			/*
-- 
2.1.0
^ permalink raw reply related	[flat|nested] 54+ messages in thread
- * [PATCH 41/45] writeback: make __writeback_single_inode() cgroup writeback aware
  2015-01-06 21:25 [PATCHSET RFC block/for-next] writeback: cgroup writeback support Tejun Heo
                   ` (39 preceding siblings ...)
  2015-01-06 21:26 ` [PATCH 40/45] writeback: make write_cache_pages() " Tejun Heo
@ 2015-01-06 21:26 ` Tejun Heo
  2015-01-06 21:26 ` [PATCH 42/45] writeback: make __filemap_fdatawrite_range() croup " Tejun Heo
                   ` (4 subsequent siblings)
  45 siblings, 0 replies; 54+ messages in thread
From: Tejun Heo @ 2015-01-06 21:26 UTC (permalink / raw)
  To: axboe
  Cc: linux-kernel, jack, hch, hannes, linux-fsdevel, vgoyal, lizefan,
	cgroups, linux-mm, mhocko, clm, fengguang.wu, david, Tejun Heo
Metadata is always dirtied against the root cgroup and should thus be
written out only by the root cgroup writeback.  This patch updates
__writeback_single_inode() so that it skips writing out metadata if
the writeback is for a non-root cgroup.  wbc_skip_metadata() is added
to decide whether to skip metadata writeback.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
---
 fs/fs-writeback.c | 29 ++++++++++++++++++++++++++---
 1 file changed, 26 insertions(+), 3 deletions(-)
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index dbfd0b0..2bb14d5 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -684,6 +684,19 @@ static inline bool iwbl_still_has_dirty_pages(struct inode_wb_link *iwbl,
 	return test_bit(IWBL_DIRTY_PAGES, &iwbl->data);
 }
 
+/**
+ * wbc_skip_metadata - determine whether to skip writing out metadata
+ * @wbc: writeback_control in effect
+ *
+ * Called by __writeback_single_inode() to decide whether to skip writing
+ * out metadata.  Metadata is always dirtied against the root cgroup and
+ * should only be written out by the root.
+ */
+static inline bool wbc_skip_metadata(struct writeback_control *wbc)
+{
+	return wbc->iwbl && !iwbl_is_root(wbc->iwbl);
+}
+
 #else	/* CONFIG_CGROUP_WRITEBACK */
 
 static void init_cgwb_dirty_page_context(struct dirty_context *dctx)
@@ -791,6 +804,11 @@ static inline bool iwbl_still_has_dirty_pages(struct inode_wb_link *iwbl,
 	return mapping_tagged(inode->i_mapping, PAGECACHE_TAG_DIRTY);
 }
 
+static inline bool wbc_skip_metadata(struct writeback_control *wbc)
+{
+	return false;
+}
+
 #endif	/* CONFIG_CGROUP_WRITEBACK */
 
 /**
@@ -1128,6 +1146,7 @@ __writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
 	struct address_space *mapping = inode->i_mapping;
 	struct inode_wb_link *iwbl = inode_writeback_iwbl(inode, wbc);
 	long nr_to_write = wbc->nr_to_write;
+	bool skip_metadata = wbc_skip_metadata(wbc);
 	unsigned dirty;
 	int ret;
 
@@ -1144,7 +1163,7 @@ __writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
 	 * separate, external IO completion path and ->sync_fs for guaranteeing
 	 * inode metadata is written back correctly.
 	 */
-	if (wbc->sync_mode == WB_SYNC_ALL && !wbc->for_sync) {
+	if (wbc->sync_mode == WB_SYNC_ALL && !wbc->for_sync && !skip_metadata) {
 		int err = filemap_fdatawait(mapping);
 		if (ret == 0)
 			ret = err;
@@ -1157,8 +1176,12 @@ __writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
 	 */
 	spin_lock(&inode->i_lock);
 
-	dirty = inode->i_state & I_DIRTY;
-	inode->i_state &= ~I_DIRTY;
+	if (skip_metadata)
+		dirty = inode->i_state & I_DIRTY_PAGES;
+	else
+		dirty = inode->i_state & I_DIRTY;
+
+	inode->i_state &= ~dirty;
 
 	/*
 	 * Paired with smp_mb() in __mark_inode_dirty_dctx().  This allows
-- 
2.1.0
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related	[flat|nested] 54+ messages in thread
- * [PATCH 42/45] writeback: make __filemap_fdatawrite_range() croup writeback aware
  2015-01-06 21:25 [PATCHSET RFC block/for-next] writeback: cgroup writeback support Tejun Heo
                   ` (40 preceding siblings ...)
  2015-01-06 21:26 ` [PATCH 41/45] writeback: make __writeback_single_inode() " Tejun Heo
@ 2015-01-06 21:26 ` Tejun Heo
  2015-01-06 21:26 ` [PATCH 43/45] buffer, writeback: make __block_write_full_page() honor cgroup writeback Tejun Heo
                   ` (3 subsequent siblings)
  45 siblings, 0 replies; 54+ messages in thread
From: Tejun Heo @ 2015-01-06 21:26 UTC (permalink / raw)
  To: axboe
  Cc: linux-kernel, jack, hch, hannes, linux-fsdevel, vgoyal, lizefan,
	cgroups, linux-mm, mhocko, clm, fengguang.wu, david, Tejun Heo
__filemap_fdatawrite_range() and its friends are used, among other
things, to implement fsync and coordinate buffered and direct IOs.
The function directly invokes do_writepages() bypassing the usual fs
writeback mechanism and thus currently doesn't respect cgroup
writeback.
This patch adds wb_writeback_work->mapping[_range_{start|end}] which
are used to instruct wb_writeback_work item to execute do_writepages()
on a single mapping.  A new function cgwb_do_writepages() is added
which splits do_writepages() to all dirtying wb's (bdi_writeback's)
using this new work type.  __filemap_fdatawrite_range() is updated to
use cgwb_do_writepages() instead of do_writepages().
cgwb_do_writepages() first tries direct do_writepages() on the current
blkcg as it's likely that the calling cgroup is trying to flush pages
that it dirtied.  If that doesn't write out all pages, it issues
single mappign work items to all wb's with dirty pages on the target
mapping.  It currently doesn't distribute wbc->nr_to_write according
to the bandwidth proportion of each wb.  If this ever becomes
necessary, implementing it shouldn't be too difficult.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
---
 fs/fs-writeback.c           | 149 +++++++++++++++++++++++++++++++++++++++++++-
 include/linux/backing-dev.h |   8 +++
 mm/filemap.c                |   2 +-
 3 files changed, 157 insertions(+), 2 deletions(-)
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 2bb14d5..cea13fe 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -36,6 +36,9 @@
 
 struct wb_completion {
 	atomic_t		cnt;
+#ifdef CONFIG_CGROUP_WRITEBACK
+	int			mapping_ret;	/* used by works w/ ->mapping */
+#endif
 };
 
 /*
@@ -43,8 +46,19 @@ struct wb_completion {
  */
 struct wb_writeback_work {
 	long nr_pages;
+
 	struct super_block *sb;
 	unsigned long *older_than_this;
+
+	/*
+	 * If ->mapping is set, only that mapping is written out using
+	 * do_writepages().  ->mapping_range_{start|end} are meaningful
+	 * only in such cases.
+	 */
+	struct address_space *mapping;
+	loff_t mapping_range_start;
+	loff_t mapping_range_end;
+
 	enum writeback_sync_modes sync_mode;
 	unsigned int tagged_writepages:1;
 	unsigned int for_kupdate:1;
@@ -697,6 +711,133 @@ static inline bool wbc_skip_metadata(struct writeback_control *wbc)
 	return wbc->iwbl && !iwbl_is_root(wbc->iwbl);
 }
 
+static bool cgwb_do_writepages_split_work(struct inode_wb_link *iwbl,
+					  struct wb_writeback_work *base_work)
+{
+	struct bdi_writeback *wb = iwbl_to_wb(iwbl);
+
+	/* if DIRTY_PAGES isn't visible yet, neither is the dirty data */
+	if (!test_bit(IWBL_DIRTY_PAGES, &iwbl->data))
+		return true;
+
+	return wb_clone_and_queue_work(wb, base_work);
+}
+
+/**
+ * cgwb_do_writepages - cgroup-aware do_writepages()
+ * @mapping: address_space to write out
+ * @wbc: writeback_control in effect
+ *
+ * Write out pages from @mapping according to @wbc.  This function expects
+ * @mapping to be a file backed one.  If cgroup writeback is enabled, the
+ * writes are distributed across the cgroups which dirtied the pages;
+ * otherwise, this is equivalent to do_writepages().  Returns 0 on success,
+ * -errno on failre.
+ */
+int cgwb_do_writepages(struct address_space *mapping,
+		       struct writeback_control *wbc)
+{
+	DEFINE_WB_COMPLETION_ONSTACK(done);
+	struct inode *inode = mapping->host;
+	struct backing_dev_info *bdi = inode_to_bdi(inode);
+	struct wb_writeback_work base_work = {
+		.mapping		= mapping,
+		.mapping_range_start	= wbc->range_start,
+		.mapping_range_end	= wbc->range_end,
+		.nr_pages		= wbc->nr_to_write,
+		.sync_mode		= wbc->sync_mode,
+		.tagged_writepages	= wbc->tagged_writepages,
+		.for_kupdate		= wbc->for_kupdate,
+		.range_cyclic		= wbc->range_cyclic,
+		.for_background		= wbc->for_background,
+		.for_sync		= wbc->for_sync,
+		.done			= &done,
+	};
+	struct cgroup_subsys_state *blkcg_css;
+	struct inode_wb_link *iwbl;
+	struct inode_cgwb_link *icgwbl, *n;
+	int last_blkcg_id = 0, ret;
+
+	/*
+	 * The caller is likely flushing the pages it dirtied.  First look
+	 * up the current iwbl and perform do_writepages() directly on it.
+	 * If no page is skipped due to mismatching cgroup, there's nothing
+	 * more to do.
+	 */
+	blkcg_css = task_get_css(current, blkio_cgrp_id);
+	iwbl = iwbl_lookup(inode, blkcg_css);
+	if (iwbl) {
+		wbc_set_iwbl(wbc, iwbl);
+		wbc->iwbl_mismatch = 0;
+
+		ret = do_writepages(mapping, wbc);
+
+		css_put(blkcg_css);
+		if (ret || !wbc->iwbl_mismatch)
+			return ret;
+	} else {
+		css_put(blkcg_css);
+	}
+
+	/*
+	 * Split writes to all dirty iwbl's.  We don't yet implement
+	 * bandwidth-proportional distribution of nr_pages as the only
+	 * current caller, __filemap_fdatawrite_range(), always sets it to
+	 * LONG_MAX.  Implementing proportional distribution would require
+	 * a prepatory pass over dirty iwbl's to calculate the total write
+	 * bandwidth of the involved wb's.
+	 */
+	WARN_ON_ONCE(base_work.nr_pages != LONG_MAX);
+
+	if (!cgwb_do_writepages_split_work(&inode->i_wb_link, &base_work))
+		wb_wait_for_single_work(bdi, &base_work);
+restart_split:
+	rcu_read_lock();
+	inode_for_each_icgwbl(icgwbl, n, inode) {
+		struct inode_wb_link *iwbl = &icgwbl->iwbl;
+		int blkcg_id = iwbl_to_wb(iwbl)->blkcg_css->id;
+
+		if (blkcg_id <= last_blkcg_id)
+			continue;
+
+		if (!cgwb_do_writepages_split_work(iwbl, &base_work)) {
+			rcu_read_unlock();
+			wb_wait_for_single_work(bdi, &base_work);
+			goto restart_split;
+		}
+		last_blkcg_id = blkcg_id;
+	}
+	rcu_read_unlock();
+
+	wb_wait_for_completion(bdi, &done);
+	return done.mapping_ret;
+}
+
+static bool maybe_writeback_single_mapping(struct wb_writeback_work *work)
+{
+	struct wb_completion *done = work->done;
+	struct writeback_control wbc = {
+		.range_start		= work->mapping_range_start,
+		.range_end		= work->mapping_range_end,
+		.nr_to_write		= work->nr_pages,
+		.sync_mode		= work->sync_mode,
+		.tagged_writepages	= work->tagged_writepages,
+		.for_kupdate		= work->for_kupdate,
+		.range_cyclic		= work->range_cyclic,
+		.for_background		= work->for_background,
+		.for_sync		= work->for_sync,
+	};
+	int ret;
+
+	if (!work->mapping)
+		return false;
+
+	ret = do_writepages(work->mapping, &wbc);
+	if (done && ret)
+		done->mapping_ret = ret;
+	return true;
+}
+
 #else	/* CONFIG_CGROUP_WRITEBACK */
 
 static void init_cgwb_dirty_page_context(struct dirty_context *dctx)
@@ -809,6 +950,11 @@ static inline bool wbc_skip_metadata(struct writeback_control *wbc)
 	return false;
 }
 
+static bool maybe_writeback_single_mapping(struct wb_writeback_work *work)
+{
+	return false;
+}
+
 #endif	/* CONFIG_CGROUP_WRITEBACK */
 
 /**
@@ -1718,7 +1864,8 @@ static long wb_do_writeback(struct bdi_writeback *wb)
 
 		trace_writeback_exec(wb->bdi, work);
 
-		wrote += wb_writeback(wb, work);
+		if (!maybe_writeback_single_mapping(work))
+			wrote += wb_writeback(wb, work);
 
 		if (work->single_wait) {
 			WARN_ON_ONCE(work->auto_free);
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 173d218..2456efb 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -281,6 +281,8 @@ int __cgwb_create(struct backing_dev_info *bdi,
 		  struct cgroup_subsys_state *blkcg_css);
 struct inode_wb_link *iwbl_create(struct inode *inode,
 				  struct bdi_writeback *wb);
+int cgwb_do_writepages(struct address_space *mapping,
+		       struct writeback_control *wbc);
 int mapping_congested(struct address_space *mapping, struct task_struct *task,
 		      int bdi_bits);
 
@@ -787,6 +789,12 @@ static inline bool wbc_skip_page(struct writeback_control *wbc,
 	return false;
 }
 
+static inline int cgwb_do_writepages(struct address_space *mapping,
+				     struct writeback_control *wbc)
+{
+	return do_writepages(mapping, wbc);
+}
+
 #endif	/* CONFIG_CGROUP_WRITEBACK */
 
 static inline int mapping_read_congested(struct address_space *mapping,
diff --git a/mm/filemap.c b/mm/filemap.c
index faa577d..e858cd1 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -284,7 +284,7 @@ int __filemap_fdatawrite_range(struct address_space *mapping, loff_t start,
 	if (!mapping_cap_writeback_dirty(mapping))
 		return 0;
 
-	ret = do_writepages(mapping, &wbc);
+	ret = cgwb_do_writepages(mapping, &wbc);
 	return ret;
 }
 
-- 
2.1.0
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related	[flat|nested] 54+ messages in thread
- * [PATCH 43/45] buffer, writeback: make __block_write_full_page() honor cgroup writeback
  2015-01-06 21:25 [PATCHSET RFC block/for-next] writeback: cgroup writeback support Tejun Heo
                   ` (41 preceding siblings ...)
  2015-01-06 21:26 ` [PATCH 42/45] writeback: make __filemap_fdatawrite_range() croup " Tejun Heo
@ 2015-01-06 21:26 ` Tejun Heo
  2015-01-06 21:26 ` [PATCH 45/45] ext2: enable cgroup writeback support Tejun Heo
                   ` (2 subsequent siblings)
  45 siblings, 0 replies; 54+ messages in thread
From: Tejun Heo @ 2015-01-06 21:26 UTC (permalink / raw)
  To: axboe
  Cc: linux-kernel, jack, hch, hannes, linux-fsdevel, vgoyal, lizefan,
	cgroups, linux-mm, mhocko, clm, fengguang.wu, david, Tejun Heo,
	Andrew Morton
[__]block_write_full_page() is used to implement ->writepage in
various filesystems.  All writeback logic is now updated to handle
cgroup writeback and the block cgroup to issue IOs for is encoded in
writeback_control and can be retrieved using wbc_blkcg_css(); however,
[__]block_write_full_page() currently ignores the blkcg indicated by
wbc and issues all bio's without explicit blkcg association.
This patch adds submit_bh_blkcg() which associates the bio with the
specified blkio cgroup before issuing and uses it in
__block_write_full_page() so that the issued bio's are associated with
wbc_blkcg_css().
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
Cc: Andrew Morton <akpm@linux-foundation.org>
---
 fs/buffer.c | 21 +++++++++++++++++----
 1 file changed, 17 insertions(+), 4 deletions(-)
diff --git a/fs/buffer.c b/fs/buffer.c
index 2dab7dd..1377346 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -45,6 +45,9 @@
 #include <trace/events/block.h>
 
 static int fsync_buffers_list(spinlock_t *lock, struct list_head *list);
+static int submit_bh_blkcg(int rw, struct buffer_head *bh,
+			   unsigned long bio_flags,
+			   struct cgroup_subsys_state *blkcg_css);
 
 #define BH_ENTRY(list) list_entry((list), struct buffer_head, b_assoc_buffers)
 
@@ -1777,7 +1780,7 @@ static int __block_write_full_page(struct inode *inode, struct page *page,
 	do {
 		struct buffer_head *next = bh->b_this_page;
 		if (buffer_async_write(bh)) {
-			submit_bh(write_op, bh);
+			submit_bh_blkcg(write_op, bh, 0, wbc_blkcg_css(wbc));
 			nr_underway++;
 		}
 		bh = next;
@@ -1831,7 +1834,7 @@ recover:
 		struct buffer_head *next = bh->b_this_page;
 		if (buffer_async_write(bh)) {
 			clear_buffer_dirty(bh);
-			submit_bh(write_op, bh);
+			submit_bh_blkcg(write_op, bh, 0, wbc_blkcg_css(wbc));
 			nr_underway++;
 		}
 		bh = next;
@@ -3000,7 +3003,9 @@ void guard_bio_eod(int rw, struct bio *bio)
 	}
 }
 
-int _submit_bh(int rw, struct buffer_head *bh, unsigned long bio_flags)
+static int submit_bh_blkcg(int rw, struct buffer_head *bh,
+			   unsigned long bio_flags,
+			   struct cgroup_subsys_state *blkcg_css)
 {
 	struct bio *bio;
 	int ret = 0;
@@ -3023,6 +3028,9 @@ int _submit_bh(int rw, struct buffer_head *bh, unsigned long bio_flags)
 	 */
 	bio = bio_alloc(GFP_NOIO, 1);
 
+	if (blkcg_css)
+		bio_associate_blkcg(bio, blkcg_css);
+
 	bio->bi_iter.bi_sector = bh->b_blocknr * (bh->b_size >> 9);
 	bio->bi_bdev = bh->b_bdev;
 	bio->bi_io_vec[0].bv_page = bh->b_page;
@@ -3053,11 +3061,16 @@ int _submit_bh(int rw, struct buffer_head *bh, unsigned long bio_flags)
 	bio_put(bio);
 	return ret;
 }
+
+int _submit_bh(int rw, struct buffer_head *bh, unsigned long bio_flags)
+{
+	return submit_bh_blkcg(rw, bh, bio_flags, NULL);
+}
 EXPORT_SYMBOL_GPL(_submit_bh);
 
 int submit_bh(int rw, struct buffer_head *bh)
 {
-	return _submit_bh(rw, bh, 0);
+	return submit_bh_blkcg(rw, bh, 0, NULL);
 }
 EXPORT_SYMBOL(submit_bh);
 
-- 
2.1.0
^ permalink raw reply related	[flat|nested] 54+ messages in thread
- * [PATCH 45/45] ext2: enable cgroup writeback support
  2015-01-06 21:25 [PATCHSET RFC block/for-next] writeback: cgroup writeback support Tejun Heo
                   ` (42 preceding siblings ...)
  2015-01-06 21:26 ` [PATCH 43/45] buffer, writeback: make __block_write_full_page() honor cgroup writeback Tejun Heo
@ 2015-01-06 21:26 ` Tejun Heo
  2015-01-06 21:44 ` [PATCHSET RFC block/for-next] writeback: " Tejun Heo
  2015-01-08  9:30 ` Jan Kara
  45 siblings, 0 replies; 54+ messages in thread
From: Tejun Heo @ 2015-01-06 21:26 UTC (permalink / raw)
  To: axboe
  Cc: linux-kernel, jack, hch, hannes, linux-fsdevel, vgoyal, lizefan,
	cgroups, linux-mm, mhocko, clm, fengguang.wu, david, Tejun Heo,
	linux-ext4
Writeback now supports cgroup writeback and the generic writeback,
buffer, libfs, and mpage helpers that ext2 uses are all updated to
work with cgroup writeback.
This patch enables cgroup writeback for ext2 by adding
FS_CGROUP_WRITEBACK to its ->fs_flags.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
Cc: linux-ext4@vger.kernel.org
---
 fs/ext2/super.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/ext2/super.c b/fs/ext2/super.c
index ae55fdd..dc3f27e 100644
--- a/fs/ext2/super.c
+++ b/fs/ext2/super.c
@@ -1546,7 +1546,7 @@ static struct file_system_type ext2_fs_type = {
 	.name		= "ext2",
 	.mount		= ext2_mount,
 	.kill_sb	= kill_block_super,
-	.fs_flags	= FS_REQUIRES_DEV,
+	.fs_flags	= FS_REQUIRES_DEV | FS_CGROUP_WRITEBACK,
 };
 MODULE_ALIAS_FS("ext2");
 
-- 
2.1.0
^ permalink raw reply related	[flat|nested] 54+ messages in thread
- * Re: [PATCHSET RFC block/for-next] writeback: cgroup writeback support
  2015-01-06 21:25 [PATCHSET RFC block/for-next] writeback: cgroup writeback support Tejun Heo
                   ` (43 preceding siblings ...)
  2015-01-06 21:26 ` [PATCH 45/45] ext2: enable cgroup writeback support Tejun Heo
@ 2015-01-06 21:44 ` Tejun Heo
  2015-01-07 23:45   ` Dave Chinner
  2015-01-08  9:30 ` Jan Kara
  45 siblings, 1 reply; 54+ messages in thread
From: Tejun Heo @ 2015-01-06 21:44 UTC (permalink / raw)
  To: axboe
  Cc: linux-kernel, jack, hch, hannes, linux-fsdevel, vgoyal, lizefan,
	cgroups, linux-mm, mhocko, clm, fengguang.wu, david
Hello, again.  A bit of addition.
On Tue, Jan 06, 2015 at 04:25:37PM -0500, Tejun Heo wrote:
...
> Overall design
> --------------
What's going on in this patchset is fairly straight forward.  The main
thing which is happening is that a bdi is being split into multiple
per-cgroup pieces.  Each split bdi, represented by bdi_writeback,
behaves mostly identically with how bdi behaved before.
Complications mostly arise from filesystems and inodes having to deal
with multiple split bdi's instead of one, but those are mostly
straight-forward 1:N mapping issues.  It does get tedious here and
there but doesn't complicate the overall picture.  This
straight-forwardness pays off when dealing with interaction issues
which would have been extremely hairy otherwise.  More on this while
discussing balance_dirty_pages.
...
> Missing pieces
> --------------
...
> * balance_dirty_pages currently doesn't consider the task's memcg when
>   calculating the number of dirtyable pages.  This means that tasks in
>   memcg won't have the benefit of smooth background writeback and will
>   bump into direct reclaim all the time.  This has always been like
>   this but with cgroup writeback support, this is also finally
>   fixable.  I'll work on this as the earlier part gets settled.
This has always been a really thorny issue but now that each wb
behaves as an independent writeback domain, this can be solved nicely.
Each cgroup can carry the fraction of writebandwidth against the whole
system and each task can carry its fraction against its memcg.
balance_dirty_pages can now stagger these two ratios and then apply it
against the memory which *may* be dirtyable to the task's memcg and
then throttle the dirtier accordingly.  This works out exactly as a
straight-forward extension of the global logic which is proven to
work.  This really is pieces falling into places.
Thanks.
-- 
tejun
^ permalink raw reply	[flat|nested] 54+ messages in thread
- * Re: [PATCHSET RFC block/for-next] writeback: cgroup writeback support
  2015-01-06 21:44 ` [PATCHSET RFC block/for-next] writeback: " Tejun Heo
@ 2015-01-07 23:45   ` Dave Chinner
  2015-01-09 21:23     ` Tejun Heo
  0 siblings, 1 reply; 54+ messages in thread
From: Dave Chinner @ 2015-01-07 23:45 UTC (permalink / raw)
  To: Tejun Heo
  Cc: axboe, linux-kernel, jack, hch, hannes, linux-fsdevel, vgoyal,
	lizefan, cgroups, linux-mm, mhocko, clm, fengguang.wu
On Tue, Jan 06, 2015 at 04:44:26PM -0500, Tejun Heo wrote:
> Hello, again.  A bit of addition.
> 
> On Tue, Jan 06, 2015 at 04:25:37PM -0500, Tejun Heo wrote:
> ...
> > Overall design
> > --------------
> 
> What's going on in this patchset is fairly straight forward.  The main
> thing which is happening is that a bdi is being split into multiple
> per-cgroup pieces.  Each split bdi, represented by bdi_writeback,
> behaves mostly identically with how bdi behaved before.
I like the overall direction you've taken, Tejun, but I have
a couple of questions...
> Complications mostly arise from filesystems and inodes having to deal
> with multiple split bdi's instead of one, but those are mostly
> straight-forward 1:N mapping issues.  It does get tedious here and
> there but doesn't complicate the overall picture.
Some filesystems don't track metadata-dirty inode state in the bdi
lists, and instead track that in their own lists (usually deep
inside the journalling subsystem). i.e. I_DIRTY_PAGES are the only
dirty state that is tracked in the VFS. i.e. inode metadata
writeback will still be considered global, but pages won't be. Hence
you might get pages written back quickly, but the inodes are going
to remain dirty and unreclaimable until the filesystem flushes some
time in the future after the journal is committed and the inode
written...
There has also been talk of allowing filesystems to directly track
dirty page state as well - the discussion came out of the way tux3
was tracking and committing delta changes to file data. Now that
hasn't gone anywhere, but I'm wondering what impact this patch set
would have on such proposals?
Similarly, I'm concerned about additional overhead in the writeback
path - we can easily drive the flusher thread to be CPU bound on IO
subsystems that have decent bandwidth (low GB/s), so adding more
overhead to every page we have to flush is going to reduce
performance on these systems. Do you have any idea what impact
just enabling the memcg/blkcg tracking has on writeback performance
and CPU consumption?
A further complication for data writeback is that some filesystems
do their own adjacent page write clustering own inside their own
->writepages/->writepage implementations. Both ext4 and XFS do this,
and it makes no sense from a system and filesystem performance
perspective to turn sequential ranges of dirty pages into much
slower, semi-random IO just because the pages belong to different
memcgs. It's not a good idea to compromise bulk writeback
throughput under memory pressure just because a different memcgs
write to the same files, so what is going to be the impact of
filesystems ignoring memcg ownership during writeback clustering?
Finally, distros are going to ship with this always enabled, so what
is the overall increase in the size of the struct inode on a 64
bit system with it enabled?
Cheers,
Dave.
-- 
Dave Chinner
david@fromorbit.com
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 54+ messages in thread 
- * Re: [PATCHSET RFC block/for-next] writeback: cgroup writeback support
  2015-01-07 23:45   ` Dave Chinner
@ 2015-01-09 21:23     ` Tejun Heo
  2015-01-10  0:38       ` Dave Chinner
  0 siblings, 1 reply; 54+ messages in thread
From: Tejun Heo @ 2015-01-09 21:23 UTC (permalink / raw)
  To: Dave Chinner
  Cc: axboe, linux-kernel, jack, hch, hannes, linux-fsdevel, vgoyal,
	lizefan, cgroups, linux-mm, mhocko, clm, fengguang.wu
Hello, Dave.
On Thu, Jan 08, 2015 at 10:45:32AM +1100, Dave Chinner wrote:
> > Complications mostly arise from filesystems and inodes having to deal
> > with multiple split bdi's instead of one, but those are mostly
> > straight-forward 1:N mapping issues.  It does get tedious here and
> > there but doesn't complicate the overall picture.
> 
> Some filesystems don't track metadata-dirty inode state in the bdi
> lists, and instead track that in their own lists (usually deep
> inside the journalling subsystem). i.e. I_DIRTY_PAGES are the only
> dirty state that is tracked in the VFS. i.e. inode metadata
> writeback will still be considered global, but pages won't be. Hence
> you might get pages written back quickly, but the inodes are going
> to remain dirty and unreclaimable until the filesystem flushes some
> time in the future after the journal is committed and the inode
> written...
I'm not sure I'm following.  What writeback layer provides is cgroup
awareness when dealing with I_DIRTY_PAGES.  Metadata writebacks will
become automatically cgroup-aware to the extent they go through
regular page dirtying mechanism.  If some don't go through that
channel (e.g. journals shouldn't to avoid priority inversion), it's
upto the specific filesystem to decide how to handle them.  In most
cases, I imagine they'd be sent down as originating from the root
cgroup, ie, as the system cost.  Specific filesystems can be more
sophisticated, I suppose.
Ultimately, the only thing which matters is with which cgroup a bio is
associated when issued.  What's implemented in this patchset is
propagation of memcg tags for pagecache pages.  If necessary, further
mechanisms can be added, but this should cover the basics.
> There has also been talk of allowing filesystems to directly track
> dirty page state as well - the discussion came out of the way tux3
> was tracking and committing delta changes to file data. Now that
> hasn't gone anywhere, but I'm wondering what impact this patch set
> would have on such proposals?
Would such a filesystem take over writeback mechanism too?  The
implemented mechanism is fairly modular and the counterparts in each
filesystem should be able to use them the same way the core writeback
code does.  I'm afraid I can't say much without knowing further
details.
> Similarly, I'm concerned about additional overhead in the writeback
> path - we can easily drive the flusher thread to be CPU bound on IO
> subsystems that have decent bandwidth (low GB/s), so adding more
> overhead to every page we have to flush is going to reduce
> performance on these systems. Do you have any idea what impact
> just enabling the memcg/blkcg tracking has on writeback performance
> and CPU consumption?
I measured avg sys+user time of 50 iterations of
  fs_mark -d /mnt/tmp/ -s 104857600 -n 32
on an ext2 on a ramdisk, which should put the hot path part - page
faulting and inode dirtying - under spotlight.  cgroup writeback
enabled but not used case consumes around 1% more cpu time - AVG 6.616
STDEV 0.050 w/o this patchset, AVG 6.682 STDEV 0.046 with.  This is an
extreme case and while it isn't free the overhead is fairly low.
> A further complication for data writeback is that some filesystems
> do their own adjacent page write clustering own inside their own
> ->writepages/->writepage implementations. Both ext4 and XFS do this,
> and it makes no sense from a system and filesystem performance
> perspective to turn sequential ranges of dirty pages into much
> slower, semi-random IO just because the pages belong to different
> memcgs. It's not a good idea to compromise bulk writeback
> throughput under memory pressure just because a different memcgs
> write to the same files, so what is going to be the impact of
> filesystems ignoring memcg ownership during writeback clustering?
I don't think that's a good idea.  Implementing that isn't hard.
Range writeback can simply avoid skipping pages from different
cgroups; however, different cgroups can have vastly different
characteristics.  One may be configured to have a majority of the
available bandwidth while another has to scrap by with few hundreds of
k's per sec.  Initiating write out on pages which belong to the former
from the writeback of the latter may cause serious priority inversion
issues.
Maybe we can think of optimizations down the road but I'd strongly
prefer to stick to simple and clear divisions among cgroups.  Also, a
file highly interleaved by multiple cgroups isn't a particularly
likely use case.
> Finally, distros are going to ship with this always enabled, so what
> is the overall increase in the size of the struct inode on a 64
> bit system with it enabled?
This was in the head message, but, to repeat, two more pointers if
!CONFIG_IMA, which is the case for fedora at least.  If CONFIG_IMA is
enabled, it becomes three pointers.  In my test setup, before the
patchset or CONFIG_CGROUP_WRITEBACK disabled, it's 544 bytes and w/
CONFIG_CGROUP_WRITEBACK 560.
Thanks.
-- 
tejun
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 54+ messages in thread 
- * Re: [PATCHSET RFC block/for-next] writeback: cgroup writeback support
  2015-01-09 21:23     ` Tejun Heo
@ 2015-01-10  0:38       ` Dave Chinner
  2015-01-10 15:56         ` Tejun Heo
  0 siblings, 1 reply; 54+ messages in thread
From: Dave Chinner @ 2015-01-10  0:38 UTC (permalink / raw)
  To: Tejun Heo
  Cc: axboe, linux-kernel, jack, hch, hannes, linux-fsdevel, vgoyal,
	lizefan, cgroups, linux-mm, mhocko, clm, fengguang.wu
On Fri, Jan 09, 2015 at 04:23:36PM -0500, Tejun Heo wrote:
> Hello, Dave.
> 
> On Thu, Jan 08, 2015 at 10:45:32AM +1100, Dave Chinner wrote:
> > > Complications mostly arise from filesystems and inodes having to deal
> > > with multiple split bdi's instead of one, but those are mostly
> > > straight-forward 1:N mapping issues.  It does get tedious here and
> > > there but doesn't complicate the overall picture.
> > 
> > Some filesystems don't track metadata-dirty inode state in the bdi
> > lists, and instead track that in their own lists (usually deep
> > inside the journalling subsystem). i.e. I_DIRTY_PAGES are the only
> > dirty state that is tracked in the VFS. i.e. inode metadata
> > writeback will still be considered global, but pages won't be. Hence
> > you might get pages written back quickly, but the inodes are going
> > to remain dirty and unreclaimable until the filesystem flushes some
> > time in the future after the journal is committed and the inode
> > written...
> 
> I'm not sure I'm following.  What writeback layer provides is cgroup
> awareness when dealing with I_DIRTY_PAGES.  Metadata writebacks will
> become automatically cgroup-aware to the extent they go through
> regular page dirtying mechanism.
That's my point - inode metadata drtying/writeback don't go through
the regular page dirtying mechanisms.
> What's implemented in this patchset is
> propagation of memcg tags for pagecache pages.  If necessary, further
> mechanisms can be added, but this should cover the basics.
Sure, but I'm just pointing out that if you dirty a million inodes
in a memcg (e.g. chown -R), memcg-based writeback will not cause
them to be written...
> > There has also been talk of allowing filesystems to directly track
> > dirty page state as well - the discussion came out of the way tux3
> > was tracking and committing delta changes to file data. Now that
> > hasn't gone anywhere, but I'm wondering what impact this patch set
> > would have on such proposals?
> 
> Would such a filesystem take over writeback mechanism too?
That was the intent.
> The
> implemented mechanism is fairly modular and the counterparts in each
> filesystem should be able to use them the same way the core writeback
> code does.  I'm afraid I can't say much without knowing further
> details.
Ok,you haven't said "that's impossible", so that's good enough for
now ;)
> > Similarly, I'm concerned about additional overhead in the writeback
> > path - we can easily drive the flusher thread to be CPU bound on IO
> > subsystems that have decent bandwidth (low GB/s), so adding more
> > overhead to every page we have to flush is going to reduce
> > performance on these systems. Do you have any idea what impact
> > just enabling the memcg/blkcg tracking has on writeback performance
> > and CPU consumption?
> 
> I measured avg sys+user time of 50 iterations of
> 
>   fs_mark -d /mnt/tmp/ -s 104857600 -n 32
> 
> on an ext2 on a ramdisk, which should put the hot path part - page
> faulting and inode dirtying - under spotlight.  cgroup writeback
> enabled but not used case consumes around 1% more cpu time - AVG 6.616
> STDEV 0.050 w/o this patchset, AVG 6.682 STDEV 0.046 with.  This is an
> extreme case and while it isn't free the overhead is fairly low.
What's the throughput for these numbers? CPU usage without any idea
of the number of pages being scanned doesn't tell us a whole lot.
> > A further complication for data writeback is that some filesystems
> > do their own adjacent page write clustering own inside their own
> > ->writepages/->writepage implementations. Both ext4 and XFS do this,
> > and it makes no sense from a system and filesystem performance
> > perspective to turn sequential ranges of dirty pages into much
> > slower, semi-random IO just because the pages belong to different
> > memcgs. It's not a good idea to compromise bulk writeback
> > throughput under memory pressure just because a different memcgs
> > write to the same files, so what is going to be the impact of
> > filesystems ignoring memcg ownership during writeback clustering?
> 
> I don't think that's a good idea.  Implementing that isn't hard.
> Range writeback can simply avoid skipping pages from different
> cgroups; however, different cgroups can have vastly different
> characteristics.  One may be configured to have a majority of the
> available bandwidth while another has to scrap by with few hundreds of
> k's per sec.  Initiating write out on pages which belong to the former
> from the writeback of the latter may cause serious priority inversion
> issues.
I'd suggest that you should provide mechanisms at the block layer
for accounting the pages in the bio to the memcg they belong to,
not make a sweeping directive that filesystems can only write back
pages from one memcg at a time.
If you account for pages to their memcg and decide on bio priority
at bio_add_page() time you would avoid the inversion and cross-cg
accounting problems.  If you do this, the filesystem doesn't need to
care at all what memcg pages belong to; they just do optimal IO to
clean sequential dirty pages and it is accounted and throttled
appropriately by the lower layers.
> Maybe we can think of optimizations down the road but I'd strongly
> prefer to stick to simple and clear divisions among cgroups.  Also, a
> file highly interleaved by multiple cgroups isn't a particularly
> likely use case.
That's true, and that's a further reason why I think we should not
be caring about this case in the filesystem writeback code at all.
Cheers,
Dave.
-- 
Dave Chinner
david@fromorbit.com
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 54+ messages in thread 
- * Re: [PATCHSET RFC block/for-next] writeback: cgroup writeback support
  2015-01-10  0:38       ` Dave Chinner
@ 2015-01-10 15:56         ` Tejun Heo
  2015-01-10 16:05           ` Tejun Heo
  0 siblings, 1 reply; 54+ messages in thread
From: Tejun Heo @ 2015-01-10 15:56 UTC (permalink / raw)
  To: Dave Chinner
  Cc: axboe, linux-kernel, jack, hch, hannes, linux-fsdevel, vgoyal,
	lizefan, cgroups, linux-mm, mhocko, clm, fengguang.wu
Hey,
On Sat, Jan 10, 2015 at 11:38:19AM +1100, Dave Chinner wrote:
> > What's implemented in this patchset is
> > propagation of memcg tags for pagecache pages.  If necessary, further
> > mechanisms can be added, but this should cover the basics.
> 
> Sure, but I'm just pointing out that if you dirty a million inodes
> in a memcg (e.g. chown -R), memcg-based writeback will not cause
> them to be written...
Sure, in such cases, they'd need further wiring up if to be solved
properly.  For some, we'd have to punt to the root cgroup and charge
it as general system cost but this is no different from other
controllers and at least some of such punting would be inherent in the
nature of the involved activities.
> > I measured avg sys+user time of 50 iterations of
> > 
> >   fs_mark -d /mnt/tmp/ -s 104857600 -n 32
> > 
> > on an ext2 on a ramdisk, which should put the hot path part - page
> > faulting and inode dirtying - under spotlight.  cgroup writeback
> > enabled but not used case consumes around 1% more cpu time - AVG 6.616
> > STDEV 0.050 w/o this patchset, AVG 6.682 STDEV 0.046 with.  This is an
> > extreme case and while it isn't free the overhead is fairly low.
> 
> What's the throughput for these numbers? CPU usage without any idea
> of the number of pages being scanned doesn't tell us a whole lot.
Ah, sorry about that.  Here's the output from one such fs_mark run.
Being run on a ramdisk, it's CPU bound (1.9Ghz Opteron).
$ fs_mark -d /mnt/tmp/ -s 104857600 -n 32 -v
[opt ~]# fs_mark -d /mnt/tmp/ -s 104857600 -n 32 -v
#  fs_mark  -d  /mnt/tmp/  -s  104857600  -n  32  -v
#       Version 3.3, 1 thread(s) starting at Sat Jan 10 10:46:13 2015
#       Sync method: INBAND FSYNC: fsync() per file in write loop.
#       Directories:  no subdirectories used
#       File names: 40 bytes long, (16 initial bytes of time stamp with 24 random bytes at end of name)
#       Files info: size 104857600 bytes, written with an IO size of 16384 bytes per write
#       App overhead is time in microseconds spent in the test not doing file writing related system calls.
#       All system call times are reported in microseconds.
FSUse%        Count         Size    Files/sec     App Overhead        CREAT (Min/Avg/Max)        WRITE (Min/Avg/Max)        FSYNC (Min/Avg/Max)         SYNC (Min/Avg/Max)        CLOSE (Min/Avg/Max)       UNLINK (Min/Avg/Max)
     5           32    104857600          4.6            23204       28       45       54       14       21      364    65986    73892   112279        0        0        0       12       12       13     2647     5777    23842
> I'd suggest that you should provide mechanisms at the block layer
> for accounting the pages in the bio to the memcg they belong to,
> not make a sweeping directive that filesystems can only write back
> pages from one memcg at a time.
> 
> If you account for pages to their memcg and decide on bio priority
> at bio_add_page() time you would avoid the inversion and cross-cg
> accounting problems.  If you do this, the filesystem doesn't need to
> care at all what memcg pages belong to; they just do optimal IO to
> clean sequential dirty pages and it is accounted and throttled
> appropriately by the lower layers.
That'd destroy the fundamental feedback mechanism propagating the
pressure from the blkcg split block device up through the writeback
eventually to the memcg.  This chain of backpressure is why this whole
scheme works.  When a blkcg on a device gets congested, its request
reserve becomes contended which in turn sets congestion state on the
channel and blocks further bio submissions till requests are complete.
This blocking of bio is the final and ultimate channel of the
backpressure propagation.  If you start mixing pages from different
cgroups in a single bio, the only options for handling it from the
lower layer is either splitting it into two separate requests and
finish the bio only on completion of both or choosing one victim
cgroup, essentially arbitrarily, both of which can lead to gross
priority inversion in many circumstances.
> > Maybe we can think of optimizations down the road but I'd strongly
> > prefer to stick to simple and clear divisions among cgroups.  Also, a
> > file highly interleaved by multiple cgroups isn't a particularly
> > likely use case.
> 
> That's true, and that's a further reason why I think we should not
> be caring about this case in the filesystem writeback code at all.
I'm afraid I'm not following this logic.  Why would we do something
which isn't straight forward and has a lot of corner cases for a
prospect for optimizing a fringe case?  The only thing filesystem
writeback logic has to do is skipping pages which belong to a
different cgroup, just like it'd skip a page which is already under
writeback.  There's nothing complicated about it.  Those pages simply
aren't the target of that writeback instance.
Thanks.
-- 
tejun
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 54+ messages in thread
- * Re: [PATCHSET RFC block/for-next] writeback: cgroup writeback support
  2015-01-10 15:56         ` Tejun Heo
@ 2015-01-10 16:05           ` Tejun Heo
  0 siblings, 0 replies; 54+ messages in thread
From: Tejun Heo @ 2015-01-10 16:05 UTC (permalink / raw)
  To: Dave Chinner
  Cc: axboe, linux-kernel, jack, hch, hannes, linux-fsdevel, vgoyal,
	lizefan, cgroups, linux-mm, mhocko, clm, fengguang.wu
On Sat, Jan 10, 2015 at 10:56:53AM -0500, Tejun Heo wrote:
...
> backpressure propagation.  If you start mixing pages from different
> cgroups in a single bio, the only options for handling it from the
> lower layer is either splitting it into two separate requests and
> finish the bio only on completion of both or choosing one victim
> cgroup, essentially arbitrarily, both of which can lead to gross
> priority inversion in many circumstances.
Another aspect to consider here is that cfq-iosched doesn't even issue
IOs from different cgroups at the same time.  It schedules time slices
for different cgroups and at any given time only issues a stream of
IOs from a single cgroup.  This is mainly because it's impossible to
determine how much time the target device to process a specific IO
request, especially when it's a write.  The only way we can
approxmiate the cost with an acceptable level of accuracy is bunching
multiple IOs up and then measure the time to finish them in groups so
that the the deviations can be spread across multiple requests.  This
means that we can't issue IOs belonging to different cgroups at the
same time because we can't account for the divisions of cost for the
different cgroups.
Thanks.
-- 
tejun
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 54+ messages in thread 
 
 
 
 
 
- * Re: [PATCHSET RFC block/for-next] writeback: cgroup writeback support
  2015-01-06 21:25 [PATCHSET RFC block/for-next] writeback: cgroup writeback support Tejun Heo
                   ` (44 preceding siblings ...)
  2015-01-06 21:44 ` [PATCHSET RFC block/for-next] writeback: " Tejun Heo
@ 2015-01-08  9:30 ` Jan Kara
  2015-01-09 21:36   ` Tejun Heo
  45 siblings, 1 reply; 54+ messages in thread
From: Jan Kara @ 2015-01-08  9:30 UTC (permalink / raw)
  To: Tejun Heo
  Cc: axboe, linux-kernel, jack, hch, hannes, linux-fsdevel, vgoyal,
	lizefan, cgroups, linux-mm, mhocko, clm, fengguang.wu, david
  Hello,
On Tue 06-01-15 16:25:37, Tejun Heo wrote:
> blkio cgroup (blkcg) is severely crippled in that it can only control
> read and direct write IOs.  blkcg can't tell which cgroup should be
> held responsible for a given writeback IO and charges all of them to
> the root cgroup - all normal write traffic ends up in the root cgroup.
> Although the problem has been identified years ago, mainly because it
> interacts with so many subsystems, it hasn't been solved yet.
> 
> This patchset finally implements cgroup writeback support so that
> writeback of a page is attributed to the corresponding blkcg of the
> memcg that the page belongs to.
> 
> Overall design
> --------------
> 
> * This requires cooperation between memcg and blkcg.  The IOs are
>   charged to the blkcg that the page's memcg corresponds to.  This
>   currently works only on the unified hierarchy.
> 
> * Each memcg maintains reference counted front and back pointers to
>   the correspending blkcg.  Whenever a page gets dirtied or initiates
>   writeback, it uses the blkcg the front one points to.  The reference
>   counting ensures that the association remains till the page is done
>   and having front and back pointers guarantees that the association
>   can change without being live-locked by pages being contiuously
>   dirtied.
> 
> * struct bdi_writeback (wb) was always embedded in struct
>   backing_dev_info (bdi) and the distinction between the two wasn't
>   clear.  This patchset makes wb operate as an independent writeback
> 
>   execution.  bdi->wb is still embedded and serves the root cgroup but
>   other wb's can be associated with a single bdi each serving a
>   non-root wb.
> 
> * All writeback operations are made per-wb instead of per-bdi.
>   bdi-wide operations are split across all member wb's.  If some
>   finite amount needs to be distributed, be it number of pages to
>   writeback or bdi->min/max_ratio, it's distributed according to the
>   bandwidth proportion a wb has in the bdi.
> 
> * Non-root wb's host and write back only dirty pages (I_DIRTY_PAGES).
>   I_DIRTY_[DATA]SYNC is always handled by the root wb.
> 
> * An inode may have pages dirtied by different memcgs, which naturally
>   means that it should be able to be dirtied against multiple wb's.
>   To support linking an inode against multiple wb's, iwbl
>   (inode_wb_link) is introduced.  An inode has multiple iwbl's
>   associated with it if it's dirty against multiple wb's.
  Is the ability for inode to belong to multiple memcgs really worth the
effort? It brings significant complications (see also Dave's email) and
the last time we were discussing cgroup writeback support the demand from
users for this was small... How hard would it be to just start with an
implementation which attaches the inode to the first memcg that dirties it
(and detaches it when inode gets clean)? And implement sharing of inodes
among mecgs only if there's a real demand for it?
								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 54+ messages in thread
- * Re: [PATCHSET RFC block/for-next] writeback: cgroup writeback support
  2015-01-08  9:30 ` Jan Kara
@ 2015-01-09 21:36   ` Tejun Heo
  0 siblings, 0 replies; 54+ messages in thread
From: Tejun Heo @ 2015-01-09 21:36 UTC (permalink / raw)
  To: Jan Kara
  Cc: axboe, linux-kernel, hch, hannes, linux-fsdevel, vgoyal, lizefan,
	cgroups, linux-mm, mhocko, clm, fengguang.wu, david
Hello, Jan.
On Thu, Jan 08, 2015 at 10:30:57AM +0100, Jan Kara wrote:
> > * An inode may have pages dirtied by different memcgs, which naturally
> >   means that it should be able to be dirtied against multiple wb's.
> >   To support linking an inode against multiple wb's, iwbl
> >   (inode_wb_link) is introduced.  An inode has multiple iwbl's
> >   associated with it if it's dirty against multiple wb's.
>
>   Is the ability for inode to belong to multiple memcgs really worth the
> effort? It brings significant complications (see also Dave's email) and
> the last time we were discussing cgroup writeback support the demand from
> users for this was small... How hard would it be to just start with an
> implementation which attaches the inode to the first memcg that dirties it
> (and detaches it when inode gets clean)? And implement sharing of inodes
> among mecgs only if there's a real demand for it?
This was something I spent quite some time debating back and forth.
IMO, the complexity added from having to handle dirtying against
multiple cgroups isn't that high in the scheme of things.  It enables
use cases where different regions of an inode are actively shared by
multiple cgroups and more importantly makes unexpected behaviors a lot
less likely by aligning what writeback and blkcg sees with memcg's
perception.  As mentioned in the head message, this gives us the
ability to hook up dirty ratio handling correctly for each memcg.
That working properly strongly hinges on everybody involved seeing the
same picture.
Thanks.
-- 
tejun
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 54+ messages in thread