linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH V3 2/8] Make TestSetPageDirty and dirty page accounting in one func
       [not found] ` <1356455919-14445-1-git-send-email-handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
@ 2012-12-25 17:22   ` Sha Zhengju
  2012-12-28  0:39     ` Kamezawa Hiroyuki
  2013-01-02  9:08     ` Michal Hocko
  0 siblings, 2 replies; 27+ messages in thread
From: Sha Zhengju @ 2012-12-25 17:22 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA
  Cc: dchinner-H+wXaHxf7aLQT0dZR+AlfA, mhocko-AlSwsSmVLrQ,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A,
	gthelen-hpIqsD4AKlfQT0dZR+AlfA,
	fengguang.wu-ral2JQCrhuEAvxtiuMwx3w,
	glommer-bzQdu9zFT3WakBO8gow8eQ, Sha Zhengju

From: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>

Commit a8e7d49a(Fix race in create_empty_buffers() vs __set_page_dirty_buffers())
extracts TestSetPageDirty from __set_page_dirty and is far away from
account_page_dirtied. But it's better to make the two operations in one single
function to keep modular. So in order to avoid the potential race mentioned in
commit a8e7d49a, we can hold private_lock until __set_page_dirty completes.
There's no deadlock between ->private_lock and ->tree_lock after confirmation.
It's a prepare patch for following memcg dirty page accounting patches.


Here is some test numbers that before/after this patch:
Test steps(Mem-4g, ext4):
drop_cache; sync
fio (ioengine=sync/write/buffered/bs=4k/size=1g/numjobs=2/group_reporting/thread)

We test it for 10 times and get the average numbers:
Before:
write: io=2048.0MB, bw=254117KB/s, iops=63528.9 , runt=  8279msec
lat (usec): min=1 , max=742361 , avg=30.918, stdev=1601.02
After:
write: io=2048.0MB, bw=254044KB/s, iops=63510.3 , runt=  8274.4msec
lat (usec): min=1 , max=856333 , avg=31.043, stdev=1769.32

Note that the impact is little(<1%).


Signed-off-by: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
Reviewed-by: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>
---
 fs/buffer.c |   24 ++++++++++++------------
 1 file changed, 12 insertions(+), 12 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index c017a2d..3b032b9 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -609,9 +609,15 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode);
  * If warn is true, then emit a warning if the page is not uptodate and has
  * not been truncated.
  */
-static void __set_page_dirty(struct page *page,
+static int __set_page_dirty(struct page *page,
 		struct address_space *mapping, int warn)
 {
+	if (unlikely(!mapping))
+		return !TestSetPageDirty(page);
+
+	if (TestSetPageDirty(page))
+		return 0;
+
 	spin_lock_irq(&mapping->tree_lock);
 	if (page->mapping) {	/* Race with truncate? */
 		WARN_ON_ONCE(warn && !PageUptodate(page));
@@ -621,6 +627,8 @@ static void __set_page_dirty(struct page *page,
 	}
 	spin_unlock_irq(&mapping->tree_lock);
 	__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
+
+	return 1;
 }
 
 /*
@@ -666,11 +674,9 @@ int __set_page_dirty_buffers(struct page *page)
 			bh = bh->b_this_page;
 		} while (bh != head);
 	}
-	newly_dirty = !TestSetPageDirty(page);
+	newly_dirty = __set_page_dirty(page, mapping, 1);
 	spin_unlock(&mapping->private_lock);
 
-	if (newly_dirty)
-		__set_page_dirty(page, mapping, 1);
 	return newly_dirty;
 }
 EXPORT_SYMBOL(__set_page_dirty_buffers);
@@ -1125,14 +1131,8 @@ void mark_buffer_dirty(struct buffer_head *bh)
 			return;
 	}
 
-	if (!test_set_buffer_dirty(bh)) {
-		struct page *page = bh->b_page;
-		if (!TestSetPageDirty(page)) {
-			struct address_space *mapping = page_mapping(page);
-			if (mapping)
-				__set_page_dirty(page, mapping, 0);
-		}
-	}
+	if (!test_set_buffer_dirty(bh))
+		__set_page_dirty(bh->b_page, page_mapping(bh->b_page), 0);
 }
 EXPORT_SYMBOL(mark_buffer_dirty);
 
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH V3 3/8] use vfs __set_page_dirty interface instead of doing it inside filesystem
       [not found] <1356455919-14445-1-git-send-email-handai.szj@taobao.com>
       [not found] ` <1356455919-14445-1-git-send-email-handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
@ 2012-12-25 17:24 ` Sha Zhengju
       [not found]   ` <1356456261-14579-1-git-send-email-handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
  2012-12-25 17:26 ` [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting Sha Zhengju
  2 siblings, 1 reply; 27+ messages in thread
From: Sha Zhengju @ 2012-12-25 17:24 UTC (permalink / raw)
  To: linux-kernel, cgroups, linux-mm, linux-fsdevel, ceph-devel
  Cc: sage, dchinner, mhocko, akpm, kamezawa.hiroyu, gthelen,
	fengguang.wu, glommer, Sha Zhengju

From: Sha Zhengju <handai.szj@taobao.com>

Following we will treat SetPageDirty and dirty page accounting as an integrated
operation. Filesystems had better use vfs interface directly to avoid those details.

Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
Acked-by: Sage Weil <sage@inktank.com>
---
 fs/buffer.c                 |    3 ++-
 fs/ceph/addr.c              |   20 ++------------------
 include/linux/buffer_head.h |    2 ++
 3 files changed, 6 insertions(+), 19 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 3b032b9..762168a 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -609,7 +609,7 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode);
  * If warn is true, then emit a warning if the page is not uptodate and has
  * not been truncated.
  */
-static int __set_page_dirty(struct page *page,
+int __set_page_dirty(struct page *page,
 		struct address_space *mapping, int warn)
 {
 	if (unlikely(!mapping))
@@ -630,6 +630,7 @@ static int __set_page_dirty(struct page *page,
 
 	return 1;
 }
+EXPORT_SYMBOL(__set_page_dirty);
 
 /*
  * Add a page to the dirty page list.
diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
index 6690269..f2779b8 100644
--- a/fs/ceph/addr.c
+++ b/fs/ceph/addr.c
@@ -5,6 +5,7 @@
 #include <linux/mm.h>
 #include <linux/pagemap.h>
 #include <linux/writeback.h>	/* generic_writepages */
+#include <linux/buffer_head.h>
 #include <linux/slab.h>
 #include <linux/pagevec.h>
 #include <linux/task_io_accounting_ops.h>
@@ -73,14 +74,8 @@ static int ceph_set_page_dirty(struct page *page)
 	int undo = 0;
 	struct ceph_snap_context *snapc;
 
-	if (unlikely(!mapping))
-		return !TestSetPageDirty(page);
-
-	if (TestSetPageDirty(page)) {
-		dout("%p set_page_dirty %p idx %lu -- already dirty\n",
-		     mapping->host, page, page->index);
+	if (!__set_page_dirty(page, mapping, 1))
 		return 0;
-	}
 
 	inode = mapping->host;
 	ci = ceph_inode(inode);
@@ -107,14 +102,7 @@ static int ceph_set_page_dirty(struct page *page)
 	     snapc, snapc->seq, snapc->num_snaps);
 	spin_unlock(&ci->i_ceph_lock);
 
-	/* now adjust page */
-	spin_lock_irq(&mapping->tree_lock);
 	if (page->mapping) {	/* Race with truncate? */
-		WARN_ON_ONCE(!PageUptodate(page));
-		account_page_dirtied(page, page->mapping);
-		radix_tree_tag_set(&mapping->page_tree,
-				page_index(page), PAGECACHE_TAG_DIRTY);
-
 		/*
 		 * Reference snap context in page->private.  Also set
 		 * PagePrivate so that we get invalidatepage callback.
@@ -126,14 +114,10 @@ static int ceph_set_page_dirty(struct page *page)
 		undo = 1;
 	}
 
-	spin_unlock_irq(&mapping->tree_lock);
-
 	if (undo)
 		/* whoops, we failed to dirty the page */
 		ceph_put_wrbuffer_cap_refs(ci, 1, snapc);
 
-	__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
-
 	BUG_ON(!PageDirty(page));
 	return 1;
 }
diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
index 458f497..0a331a8 100644
--- a/include/linux/buffer_head.h
+++ b/include/linux/buffer_head.h
@@ -336,6 +336,8 @@ static inline void lock_buffer(struct buffer_head *bh)
 }
 
 extern int __set_page_dirty_buffers(struct page *page);
+extern int __set_page_dirty(struct page *page,
+		struct address_space *mapping, int warn);
 
 #else /* CONFIG_BLOCK */
 
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting
       [not found] <1356455919-14445-1-git-send-email-handai.szj@taobao.com>
       [not found] ` <1356455919-14445-1-git-send-email-handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
  2012-12-25 17:24 ` [PATCH V3 3/8] use vfs __set_page_dirty interface instead of doing it inside filesystem Sha Zhengju
@ 2012-12-25 17:26 ` Sha Zhengju
  2013-01-02 10:44   ` Michal Hocko
  2013-01-06 20:07   ` Greg Thelen
  2 siblings, 2 replies; 27+ messages in thread
From: Sha Zhengju @ 2012-12-25 17:26 UTC (permalink / raw)
  To: linux-kernel, cgroups, linux-mm, linux-fsdevel
  Cc: mhocko, akpm, kamezawa.hiroyu, gthelen, fengguang.wu, glommer,
	dchinner, Sha Zhengju

From: Sha Zhengju <handai.szj@taobao.com>

This patch adds memcg routines to count dirty pages, which allows memory controller
to maintain an accurate view of the amount of its dirty memory and can provide some
info for users while cgroup's direct reclaim is working.

After Kame's commit 89c06bd5(memcg: use new logic for page stat accounting), we can
use 'struct page' flag to test page state instead of per page_cgroup flag. But memcg
has a feature to move a page from a cgroup to another one and may have race between
"move" and "page stat accounting". So in order to avoid the race we have designed a
bigger lock:

         mem_cgroup_begin_update_page_stat()
         modify page information        -->(a)
         mem_cgroup_update_page_stat()  -->(b)
         mem_cgroup_end_update_page_stat()
It requires (a) and (b)(dirty pages accounting) can stay close enough.
In the previous two prepare patches, we have reworked the vfs set page dirty routines
and now the interfaces are more explicit:
        incrementing (2):
                __set_page_dirty
                __set_page_dirty_nobuffers
        decrementing (2):
                clear_page_dirty_for_io
                cancel_dirty_page

To prevent AB/BA deadlock mentioned by Greg Thelen in previous version
(https://lkml.org/lkml/2012/7/30/227), we adjust the lock order:
->private_lock --> mapping->tree_lock --> memcg->move_lock.
So we need to make mapping->tree_lock ahead of TestSetPageDirty in __set_page_dirty()
and __set_page_dirty_nobuffers(). But in order to avoiding useless spinlock contention,
a prepare PageDirty() checking is added.


Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujtisu.com>
Acked-by: Fengguang Wu <fengguang.wu@intel.com>
---
 fs/buffer.c                |   14 +++++++++++++-
 include/linux/memcontrol.h |    1 +
 mm/filemap.c               |   10 ++++++++++
 mm/memcontrol.c            |   29 ++++++++++++++++++++++-------
 mm/page-writeback.c        |   39 ++++++++++++++++++++++++++++++++-------
 mm/truncate.c              |    6 ++++++
 6 files changed, 84 insertions(+), 15 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 762168a..53402d2 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -612,19 +612,31 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode);
 int __set_page_dirty(struct page *page,
 		struct address_space *mapping, int warn)
 {
+	bool locked;
+	unsigned long flags;
+
 	if (unlikely(!mapping))
 		return !TestSetPageDirty(page);
 
-	if (TestSetPageDirty(page))
+	if (PageDirty(page))
 		return 0;
 
 	spin_lock_irq(&mapping->tree_lock);
+	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
+
+	if (TestSetPageDirty(page)) {
+		mem_cgroup_end_update_page_stat(page, &locked, &flags);
+		spin_unlock_irq(&mapping->tree_lock);
+		return 0;
+	}
+
 	if (page->mapping) {	/* Race with truncate? */
 		WARN_ON_ONCE(warn && !PageUptodate(page));
 		account_page_dirtied(page, mapping);
 		radix_tree_tag_set(&mapping->page_tree,
 				page_index(page), PAGECACHE_TAG_DIRTY);
 	}
+	mem_cgroup_end_update_page_stat(page, &locked, &flags);
 	spin_unlock_irq(&mapping->tree_lock);
 	__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
 
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 5421b8a..2685d8a 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -44,6 +44,7 @@ enum mem_cgroup_stat_index {
 	MEM_CGROUP_STAT_RSS,	   /* # of pages charged as anon rss */
 	MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
 	MEM_CGROUP_STAT_SWAP, /* # of pages, swapped out */
+	MEM_CGROUP_STAT_FILE_DIRTY,  /* # of dirty pages in page cache */
 	MEM_CGROUP_STAT_NSTATS,
 };
 
diff --git a/mm/filemap.c b/mm/filemap.c
index 83efee7..b589be5 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -62,6 +62,11 @@
  *      ->swap_lock		(exclusive_swap_page, others)
  *        ->mapping->tree_lock
  *
+ *    ->private_lock		(__set_page_dirty_buffers)
+ *      ->mapping->tree_lock
+ *        ->memcg->move_lock	(mem_cgroup_begin_update_page_stat->
+ *							move_lock_mem_cgroup)
+ *
  *  ->i_mutex
  *    ->i_mmap_mutex		(truncate->unmap_mapping_range)
  *
@@ -112,6 +117,8 @@
 void __delete_from_page_cache(struct page *page)
 {
 	struct address_space *mapping = page->mapping;
+	bool locked;
+	unsigned long flags;
 
 	/*
 	 * if we're uptodate, flush out into the cleancache, otherwise
@@ -139,10 +146,13 @@ void __delete_from_page_cache(struct page *page)
 	 * Fix it up by doing a final dirty accounting check after
 	 * having removed the page entirely.
 	 */
+	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
 	if (PageDirty(page) && mapping_cap_account_dirty(mapping)) {
+		mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
 		dec_zone_page_state(page, NR_FILE_DIRTY);
 		dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
 	}
+	mem_cgroup_end_update_page_stat(page, &locked, &flags);
 }
 
 /**
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d450c04..c884640 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -95,6 +95,7 @@ static const char * const mem_cgroup_stat_names[] = {
 	"rss",
 	"mapped_file",
 	"swap",
+	"dirty",
 };
 
 enum mem_cgroup_events_index {
@@ -3609,6 +3610,19 @@ void mem_cgroup_split_huge_fixup(struct page *head)
 }
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
+static inline
+void mem_cgroup_move_account_page_stat(struct mem_cgroup *from,
+					struct mem_cgroup *to,
+					unsigned int nr_pages,
+					enum mem_cgroup_stat_index idx)
+{
+	/* Update stat data for mem_cgroup */
+	preempt_disable();
+	__this_cpu_add(from->stat->count[idx], -nr_pages);
+	__this_cpu_add(to->stat->count[idx], nr_pages);
+	preempt_enable();
+}
+
 /**
  * mem_cgroup_move_account - move account of the page
  * @page: the page
@@ -3654,13 +3668,14 @@ static int mem_cgroup_move_account(struct page *page,
 
 	move_lock_mem_cgroup(from, &flags);
 
-	if (!anon && page_mapped(page)) {
-		/* Update mapped_file data for mem_cgroup */
-		preempt_disable();
-		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
-		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
-		preempt_enable();
-	}
+	if (!anon && page_mapped(page))
+		mem_cgroup_move_account_page_stat(from, to, nr_pages,
+			MEM_CGROUP_STAT_FILE_MAPPED);
+
+	if (PageDirty(page))
+		mem_cgroup_move_account_page_stat(from, to, nr_pages,
+			MEM_CGROUP_STAT_FILE_DIRTY);
+
 	mem_cgroup_charge_statistics(from, anon, -nr_pages);
 
 	/* caller should have done css_get */
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 0713bfb..526ddd7 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1978,11 +1978,17 @@ int __set_page_dirty_no_writeback(struct page *page)
 
 /*
  * Helper function for set_page_dirty family.
+ *
+ * The caller must hold mem_cgroup_begin/end_update_page_stat() lock
+ * while modifying struct page state and accounting dirty pages.
+ * See __set_page_dirty for example.
+ *
  * NOTE: This relies on being atomic wrt interrupts.
  */
 void account_page_dirtied(struct page *page, struct address_space *mapping)
 {
 	if (mapping_cap_account_dirty(mapping)) {
+		mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
 		__inc_zone_page_state(page, NR_FILE_DIRTY);
 		__inc_zone_page_state(page, NR_DIRTIED);
 		__inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
@@ -2022,14 +2028,22 @@ EXPORT_SYMBOL(account_page_writeback);
  */
 int __set_page_dirty_nobuffers(struct page *page)
 {
+	bool locked;
+	unsigned long flags;
+	struct address_space *mapping = page_mapping(page);
+
+	if (PageDirty(page))
+		return 0;
+
+	if (unlikely(!mapping))
+		return !TestSetPageDirty(page);
+
+	spin_lock_irq(&mapping->tree_lock);
+	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
+
 	if (!TestSetPageDirty(page)) {
-		struct address_space *mapping = page_mapping(page);
 		struct address_space *mapping2;
 
-		if (!mapping)
-			return 1;
-
-		spin_lock_irq(&mapping->tree_lock);
 		mapping2 = page_mapping(page);
 		if (mapping2) { /* Race with truncate? */
 			BUG_ON(mapping2 != mapping);
@@ -2038,13 +2052,18 @@ int __set_page_dirty_nobuffers(struct page *page)
 			radix_tree_tag_set(&mapping->page_tree,
 				page_index(page), PAGECACHE_TAG_DIRTY);
 		}
+		mem_cgroup_end_update_page_stat(page, &locked, &flags);
 		spin_unlock_irq(&mapping->tree_lock);
+
 		if (mapping->host) {
 			/* !PageAnon && !swapper_space */
 			__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
 		}
 		return 1;
 	}
+
+	mem_cgroup_end_update_page_stat(page, &locked, &flags);
+	spin_unlock_irq(&mapping->tree_lock);
 	return 0;
 }
 EXPORT_SYMBOL(__set_page_dirty_nobuffers);
@@ -2160,6 +2179,9 @@ EXPORT_SYMBOL(set_page_dirty_lock);
 int clear_page_dirty_for_io(struct page *page)
 {
 	struct address_space *mapping = page_mapping(page);
+	bool locked;
+	unsigned long flags;
+	int ret = 0;
 
 	BUG_ON(!PageLocked(page));
 
@@ -2201,13 +2223,16 @@ int clear_page_dirty_for_io(struct page *page)
 		 * the desired exclusion. See mm/memory.c:do_wp_page()
 		 * for more comments.
 		 */
+		mem_cgroup_begin_update_page_stat(page, &locked, &flags);
 		if (TestClearPageDirty(page)) {
+			mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
 			dec_zone_page_state(page, NR_FILE_DIRTY);
 			dec_bdi_stat(mapping->backing_dev_info,
 					BDI_RECLAIMABLE);
-			return 1;
+			ret = 1;
 		}
-		return 0;
+		mem_cgroup_end_update_page_stat(page, &locked, &flags);
+		return ret;
 	}
 	return TestClearPageDirty(page);
 }
diff --git a/mm/truncate.c b/mm/truncate.c
index db1b216..c81e6c4 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -73,9 +73,14 @@ static inline void truncate_partial_page(struct page *page, unsigned partial)
  */
 void cancel_dirty_page(struct page *page, unsigned int account_size)
 {
+	bool locked;
+	unsigned long flags;
+
+	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
 	if (TestClearPageDirty(page)) {
 		struct address_space *mapping = page->mapping;
 		if (mapping && mapping_cap_account_dirty(mapping)) {
+			mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
 			dec_zone_page_state(page, NR_FILE_DIRTY);
 			dec_bdi_stat(mapping->backing_dev_info,
 					BDI_RECLAIMABLE);
@@ -83,6 +88,7 @@ void cancel_dirty_page(struct page *page, unsigned int account_size)
 				task_io_account_cancelled_write(account_size);
 		}
 	}
+	mem_cgroup_end_update_page_stat(page, &locked, &flags);
 }
 EXPORT_SYMBOL(cancel_dirty_page);
 
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH V3 2/8] Make TestSetPageDirty and dirty page accounting in one func
  2012-12-25 17:22   ` [PATCH V3 2/8] Make TestSetPageDirty and dirty page accounting in one func Sha Zhengju
@ 2012-12-28  0:39     ` Kamezawa Hiroyuki
  2013-01-05  2:34       ` Sha Zhengju
  2013-01-02  9:08     ` Michal Hocko
  1 sibling, 1 reply; 27+ messages in thread
From: Kamezawa Hiroyuki @ 2012-12-28  0:39 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-kernel, cgroups, linux-mm, linux-fsdevel, dchinner, mhocko,
	akpm, gthelen, fengguang.wu, glommer, Sha Zhengju

(2012/12/26 2:22), Sha Zhengju wrote:
> From: Sha Zhengju <handai.szj@taobao.com>
> 
> Commit a8e7d49a(Fix race in create_empty_buffers() vs __set_page_dirty_buffers())
> extracts TestSetPageDirty from __set_page_dirty and is far away from
> account_page_dirtied. But it's better to make the two operations in one single
> function to keep modular. So in order to avoid the potential race mentioned in
> commit a8e7d49a, we can hold private_lock until __set_page_dirty completes.
> There's no deadlock between ->private_lock and ->tree_lock after confirmation.
> It's a prepare patch for following memcg dirty page accounting patches.
> 
> 
> Here is some test numbers that before/after this patch:
> Test steps(Mem-4g, ext4):
> drop_cache; sync
> fio (ioengine=sync/write/buffered/bs=4k/size=1g/numjobs=2/group_reporting/thread)
> 
> We test it for 10 times and get the average numbers:
> Before:
> write: io=2048.0MB, bw=254117KB/s, iops=63528.9 , runt=  8279msec
> lat (usec): min=1 , max=742361 , avg=30.918, stdev=1601.02
> After:
> write: io=2048.0MB, bw=254044KB/s, iops=63510.3 , runt=  8274.4msec
> lat (usec): min=1 , max=856333 , avg=31.043, stdev=1769.32
> 
> Note that the impact is little(<1%).
> 
> 
> Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
> Reviewed-by: Michal Hocko <mhocko@suse.cz>

Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Hmm,..this change should be double-checked by vfs, I/O guys...

increasing hold time of mapping->private_lock doesn't affect performance ?


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH V3 3/8] use vfs __set_page_dirty interface instead of doing it inside filesystem
       [not found]   ` <1356456261-14579-1-git-send-email-handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
@ 2012-12-28  0:41     ` Kamezawa Hiroyuki
  0 siblings, 0 replies; 27+ messages in thread
From: Kamezawa Hiroyuki @ 2012-12-28  0:41 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	ceph-devel-u79uwXL29TY76Z2rM5mHXA, sage-BnTBU8nroG7k1uMJSBkQmQ,
	dchinner-H+wXaHxf7aLQT0dZR+AlfA, mhocko-AlSwsSmVLrQ,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	gthelen-hpIqsD4AKlfQT0dZR+AlfA,
	fengguang.wu-ral2JQCrhuEAvxtiuMwx3w,
	glommer-bzQdu9zFT3WakBO8gow8eQ, Sha Zhengju

(2012/12/26 2:24), Sha Zhengju wrote:
> From: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
> 
> Following we will treat SetPageDirty and dirty page accounting as an integrated
> operation. Filesystems had better use vfs interface directly to avoid those details.
> 
> Signed-off-by: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
> Acked-by: Sage Weil <sage-4GqslpFJ+cxBDgjK7y7TUQ@public.gmane.org>

Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH V3 2/8] Make TestSetPageDirty and dirty page accounting in one func
  2012-12-25 17:22   ` [PATCH V3 2/8] Make TestSetPageDirty and dirty page accounting in one func Sha Zhengju
  2012-12-28  0:39     ` Kamezawa Hiroyuki
@ 2013-01-02  9:08     ` Michal Hocko
       [not found]       ` <20130102090803.GB22160-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
  1 sibling, 1 reply; 27+ messages in thread
From: Michal Hocko @ 2013-01-02  9:08 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-kernel, cgroups, linux-mm, linux-fsdevel, dchinner, akpm,
	kamezawa.hiroyu, gthelen, fengguang.wu, glommer, Sha Zhengju

On Wed 26-12-12 01:22:36, Sha Zhengju wrote:
> From: Sha Zhengju <handai.szj@taobao.com>
> 
> Commit a8e7d49a(Fix race in create_empty_buffers() vs __set_page_dirty_buffers())
> extracts TestSetPageDirty from __set_page_dirty and is far away from
> account_page_dirtied. But it's better to make the two operations in one single
> function to keep modular. So in order to avoid the potential race mentioned in
> commit a8e7d49a, we can hold private_lock until __set_page_dirty completes.
> There's no deadlock between ->private_lock and ->tree_lock after confirmation.

Could you be more specific here? E.g. quote mm/filemap.c comment I have
mentioned during the first round of review?

> It's a prepare patch for following memcg dirty page accounting patches.
> 
> 
> Here is some test numbers that before/after this patch:
> Test steps(Mem-4g, ext4):
> drop_cache; sync
> fio (ioengine=sync/write/buffered/bs=4k/size=1g/numjobs=2/group_reporting/thread)

Could also add some rationale why you think this test is relevant?

> We test it for 10 times and get the average numbers:
> Before:
> write: io=2048.0MB, bw=254117KB/s, iops=63528.9 , runt=  8279msec
> lat (usec): min=1 , max=742361 , avg=30.918, stdev=1601.02
> After:
> write: io=2048.0MB, bw=254044KB/s, iops=63510.3 , runt=  8274.4msec
> lat (usec): min=1 , max=856333 , avg=31.043, stdev=1769.32
> 
> Note that the impact is little(<1%).
> 
> 
> Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
> Reviewed-by: Michal Hocko <mhocko@suse.cz>
> ---
>  fs/buffer.c |   24 ++++++++++++------------
>  1 file changed, 12 insertions(+), 12 deletions(-)
> 
> diff --git a/fs/buffer.c b/fs/buffer.c
> index c017a2d..3b032b9 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -609,9 +609,15 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode);
>   * If warn is true, then emit a warning if the page is not uptodate and has
>   * not been truncated.
>   */
> -static void __set_page_dirty(struct page *page,
> +static int __set_page_dirty(struct page *page,
>  		struct address_space *mapping, int warn)
>  {
> +	if (unlikely(!mapping))
> +		return !TestSetPageDirty(page);
> +
> +	if (TestSetPageDirty(page))
> +		return 0;
> +
>  	spin_lock_irq(&mapping->tree_lock);
>  	if (page->mapping) {	/* Race with truncate? */
>  		WARN_ON_ONCE(warn && !PageUptodate(page));
> @@ -621,6 +627,8 @@ static void __set_page_dirty(struct page *page,
>  	}
>  	spin_unlock_irq(&mapping->tree_lock);
>  	__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
> +
> +	return 1;
>  }
>  
>  /*
> @@ -666,11 +674,9 @@ int __set_page_dirty_buffers(struct page *page)
>  			bh = bh->b_this_page;
>  		} while (bh != head);
>  	}
> -	newly_dirty = !TestSetPageDirty(page);
> +	newly_dirty = __set_page_dirty(page, mapping, 1);
>  	spin_unlock(&mapping->private_lock);
>  
> -	if (newly_dirty)
> -		__set_page_dirty(page, mapping, 1);
>  	return newly_dirty;
>  }
>  EXPORT_SYMBOL(__set_page_dirty_buffers);
> @@ -1125,14 +1131,8 @@ void mark_buffer_dirty(struct buffer_head *bh)
>  			return;
>  	}
>  
> -	if (!test_set_buffer_dirty(bh)) {
> -		struct page *page = bh->b_page;
> -		if (!TestSetPageDirty(page)) {
> -			struct address_space *mapping = page_mapping(page);
> -			if (mapping)
> -				__set_page_dirty(page, mapping, 0);
> -		}
> -	}
> +	if (!test_set_buffer_dirty(bh))
> +		__set_page_dirty(bh->b_page, page_mapping(bh->b_page), 0);
>  }
>  EXPORT_SYMBOL(mark_buffer_dirty);
>  
> -- 
> 1.7.9.5
> 

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting
  2012-12-25 17:26 ` [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting Sha Zhengju
@ 2013-01-02 10:44   ` Michal Hocko
  2013-01-05  4:48     ` Sha Zhengju
  2013-05-03  9:11     ` Michal Hocko
  2013-01-06 20:07   ` Greg Thelen
  1 sibling, 2 replies; 27+ messages in thread
From: Michal Hocko @ 2013-01-02 10:44 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-kernel, cgroups, linux-mm, linux-fsdevel, akpm,
	kamezawa.hiroyu, gthelen, fengguang.wu, glommer, dchinner,
	Sha Zhengju

On Wed 26-12-12 01:26:07, Sha Zhengju wrote:
> From: Sha Zhengju <handai.szj@taobao.com>
> 
> This patch adds memcg routines to count dirty pages, which allows memory controller
> to maintain an accurate view of the amount of its dirty memory and can provide some
> info for users while cgroup's direct reclaim is working.

I guess you meant targeted resp. (hard/soft) limit reclaim here,
right? It is true that this is direct reclaim but it is not clear to me
why the usefulnes should be limitted to the reclaim for users. I would
understand this if the users was in fact in-kernel users.

[...]
> To prevent AB/BA deadlock mentioned by Greg Thelen in previous version
> (https://lkml.org/lkml/2012/7/30/227), we adjust the lock order:
> ->private_lock --> mapping->tree_lock --> memcg->move_lock.
> So we need to make mapping->tree_lock ahead of TestSetPageDirty in __set_page_dirty()
> and __set_page_dirty_nobuffers(). But in order to avoiding useless spinlock contention,
> a prepare PageDirty() checking is added.

But there is another AA deadlock here I believe.
page_remove_rmap
  mem_cgroup_begin_update_page_stat		<<< 1
  set_page_dirty
    __set_page_dirty_buffers
      __set_page_dirty
        mem_cgroup_begin_update_page_stat	<<< 2
	  move_lock_mem_cgroup
	    spin_lock_irqsave(&memcg->move_lock, *flags);

mem_cgroup_begin_update_page_stat is not recursive wrt. locking AFAICS
because we might race with the moving charges:
	CPU0						CPU1					
page_remove_rmap
						mem_cgroup_can_attach
  mem_cgroup_begin_update_page_stat (1)
    rcu_read_lock
						  mem_cgroup_start_move
						    atomic_inc(&memcg_moving)
						    atomic_inc(&memcg->moving_account)
						    synchronize_rcu
    __mem_cgroup_begin_update_page_stat
      mem_cgroup_stolen	<<< TRUE
      move_lock_mem_cgroup
  [...]
        mem_cgroup_begin_update_page_stat (2)
	  __mem_cgroup_begin_update_page_stat
	    mem_cgroup_stolen	  <<< still TRUE
	    move_lock_mem_cgroup  <<< DEADLOCK
  [...]
  mem_cgroup_end_update_page_stat
    rcu_unlock
    						  # wake up from synchronize_rcu
						[...]
						mem_cgroup_move_task
						  mem_cgroup_move_charge
						    walk_page_range
						      mem_cgroup_move_account
						        move_lock_mem_cgroup


Maybe I have missed some other locking which would prevent this from
happening but the locking relations are really complicated in this area
so if mem_cgroup_{begin,end}_update_page_stat might be called
recursively then we need a fat comment which justifies that.

[...]
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH V3 2/8] Make TestSetPageDirty and dirty page accounting in one func
  2012-12-28  0:39     ` Kamezawa Hiroyuki
@ 2013-01-05  2:34       ` Sha Zhengju
  0 siblings, 0 replies; 27+ messages in thread
From: Sha Zhengju @ 2013-01-05  2:34 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: linux-kernel, cgroups, linux-mm, linux-fsdevel, dchinner, mhocko,
	akpm, gthelen, fengguang.wu, glommer, Sha Zhengju

Hi Kame,

Sorry for the late response, I'm just back from vocation. : )

On Fri, Dec 28, 2012 at 8:39 AM, Kamezawa Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> (2012/12/26 2:22), Sha Zhengju wrote:
>> From: Sha Zhengju <handai.szj@taobao.com>
>>
>> Commit a8e7d49a(Fix race in create_empty_buffers() vs __set_page_dirty_buffers())
>> extracts TestSetPageDirty from __set_page_dirty and is far away from
>> account_page_dirtied. But it's better to make the two operations in one single
>> function to keep modular. So in order to avoid the potential race mentioned in
>> commit a8e7d49a, we can hold private_lock until __set_page_dirty completes.
>> There's no deadlock between ->private_lock and ->tree_lock after confirmation.
>> It's a prepare patch for following memcg dirty page accounting patches.
>>
>>
>> Here is some test numbers that before/after this patch:
>> Test steps(Mem-4g, ext4):
>> drop_cache; sync
>> fio (ioengine=sync/write/buffered/bs=4k/size=1g/numjobs=2/group_reporting/thread)
>>
>> We test it for 10 times and get the average numbers:
>> Before:
>> write: io=2048.0MB, bw=254117KB/s, iops=63528.9 , runt=  8279msec
>> lat (usec): min=1 , max=742361 , avg=30.918, stdev=1601.02
>> After:
>> write: io=2048.0MB, bw=254044KB/s, iops=63510.3 , runt=  8274.4msec
>> lat (usec): min=1 , max=856333 , avg=31.043, stdev=1769.32
>>
>> Note that the impact is little(<1%).
>>
>>
>> Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
>> Reviewed-by: Michal Hocko <mhocko@suse.cz>
>
> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>
> Hmm,..this change should be double-checked by vfs, I/O guys...
>

Now it seems they haven't paid attention here... I'll push it soon for
more review.

> increasing hold time of mapping->private_lock doesn't affect performance ?
>
>

Yes, pointed by Fengguang in the previous round, mapping->private_lock and
mapping->tree_lock are often contented locks that in a dd testcase
they have the top
 #1 and #2 contention.
So the numbers above are trying to find the impaction of lock
contention by multiple
threads(numjobs=2) writing to the same file in parallel and it seems
the impact is
little (<1%).
I'm not sure if the test case is enough, any advice is welcomed! : )

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH V3 2/8] Make TestSetPageDirty and dirty page accounting in one func
       [not found]       ` <20130102090803.GB22160-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
@ 2013-01-05  2:49         ` Sha Zhengju
       [not found]           ` <CAFj3OHUCQkqB2+ky9wxFpkNYcn2=6t9Qd7XFf3RBY0F4Wxyqcg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 27+ messages in thread
From: Sha Zhengju @ 2013-01-05  2:49 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	dchinner-H+wXaHxf7aLQT0dZR+AlfA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A,
	gthelen-hpIqsD4AKlfQT0dZR+AlfA,
	fengguang.wu-ral2JQCrhuEAvxtiuMwx3w,
	glommer-bzQdu9zFT3WakBO8gow8eQ, Sha Zhengju

Hi Michal,

Sorry for my late response, I'm just back from vocation. : )

On Wed, Jan 2, 2013 at 5:08 PM, Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org> wrote:
> On Wed 26-12-12 01:22:36, Sha Zhengju wrote:
>> From: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
>>
>> Commit a8e7d49a(Fix race in create_empty_buffers() vs __set_page_dirty_buffers())
>> extracts TestSetPageDirty from __set_page_dirty and is far away from
>> account_page_dirtied. But it's better to make the two operations in one single
>> function to keep modular. So in order to avoid the potential race mentioned in
>> commit a8e7d49a, we can hold private_lock until __set_page_dirty completes.
>> There's no deadlock between ->private_lock and ->tree_lock after confirmation.
>
> Could you be more specific here? E.g. quote mm/filemap.c comment I have
> mentioned during the first round of review?
>

Okay, sorry for forgetting the comment. I'll add it next round.

>> It's a prepare patch for following memcg dirty page accounting patches.
>>
>>
>> Here is some test numbers that before/after this patch:
>> Test steps(Mem-4g, ext4):
>> drop_cache; sync
>> fio (ioengine=sync/write/buffered/bs=4k/size=1g/numjobs=2/group_reporting/thread)
>
> Could also add some rationale why you think this test is relevant?
>

The test is aiming at finding the impact of performance due to lock
contention by writing parallel
to the same file. I'll add the reason next version too.

Thanks for reviewing!


Regards,
Sha

>> We test it for 10 times and get the average numbers:
>> Before:
>> write: io=2048.0MB, bw=254117KB/s, iops=63528.9 , runt=  8279msec
>> lat (usec): min=1 , max=742361 , avg=30.918, stdev=1601.02
>> After:
>> write: io=2048.0MB, bw=254044KB/s, iops=63510.3 , runt=  8274.4msec
>> lat (usec): min=1 , max=856333 , avg=31.043, stdev=1769.32
>>
>> Note that the impact is little(<1%).
>>
>>
>> Signed-off-by: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
>> Reviewed-by: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>
>> ---
>>  fs/buffer.c |   24 ++++++++++++------------
>>  1 file changed, 12 insertions(+), 12 deletions(-)
>>
>> diff --git a/fs/buffer.c b/fs/buffer.c
>> index c017a2d..3b032b9 100644
>> --- a/fs/buffer.c
>> +++ b/fs/buffer.c
>> @@ -609,9 +609,15 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode);
>>   * If warn is true, then emit a warning if the page is not uptodate and has
>>   * not been truncated.
>>   */
>> -static void __set_page_dirty(struct page *page,
>> +static int __set_page_dirty(struct page *page,
>>               struct address_space *mapping, int warn)
>>  {
>> +     if (unlikely(!mapping))
>> +             return !TestSetPageDirty(page);
>> +
>> +     if (TestSetPageDirty(page))
>> +             return 0;
>> +
>>       spin_lock_irq(&mapping->tree_lock);
>>       if (page->mapping) {    /* Race with truncate? */
>>               WARN_ON_ONCE(warn && !PageUptodate(page));
>> @@ -621,6 +627,8 @@ static void __set_page_dirty(struct page *page,
>>       }
>>       spin_unlock_irq(&mapping->tree_lock);
>>       __mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
>> +
>> +     return 1;
>>  }
>>
>>  /*
>> @@ -666,11 +674,9 @@ int __set_page_dirty_buffers(struct page *page)
>>                       bh = bh->b_this_page;
>>               } while (bh != head);
>>       }
>> -     newly_dirty = !TestSetPageDirty(page);
>> +     newly_dirty = __set_page_dirty(page, mapping, 1);
>>       spin_unlock(&mapping->private_lock);
>>
>> -     if (newly_dirty)
>> -             __set_page_dirty(page, mapping, 1);
>>       return newly_dirty;
>>  }
>>  EXPORT_SYMBOL(__set_page_dirty_buffers);
>> @@ -1125,14 +1131,8 @@ void mark_buffer_dirty(struct buffer_head *bh)
>>                       return;
>>       }
>>
>> -     if (!test_set_buffer_dirty(bh)) {
>> -             struct page *page = bh->b_page;
>> -             if (!TestSetPageDirty(page)) {
>> -                     struct address_space *mapping = page_mapping(page);
>> -                     if (mapping)
>> -                             __set_page_dirty(page, mapping, 0);
>> -             }
>> -     }
>> +     if (!test_set_buffer_dirty(bh))
>> +             __set_page_dirty(bh->b_page, page_mapping(bh->b_page), 0);
>>  }
>>  EXPORT_SYMBOL(mark_buffer_dirty);
>>
>> --
>> 1.7.9.5
>>
>
> --
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting
  2013-01-02 10:44   ` Michal Hocko
@ 2013-01-05  4:48     ` Sha Zhengju
  2013-01-06 20:02       ` Hugh Dickins
       [not found]       ` <CAFj3OHXKyMO3gwghiBAmbowvqko-JqLtKroX2kzin1rk=q9tZg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2013-05-03  9:11     ` Michal Hocko
  1 sibling, 2 replies; 27+ messages in thread
From: Sha Zhengju @ 2013-01-05  4:48 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, cgroups, linux-mm, linux-fsdevel, akpm,
	kamezawa.hiroyu, gthelen, fengguang.wu, glommer, dchinner,
	Sha Zhengju

On Wed, Jan 2, 2013 at 6:44 PM, Michal Hocko <mhocko@suse.cz> wrote:
> On Wed 26-12-12 01:26:07, Sha Zhengju wrote:
>> From: Sha Zhengju <handai.szj@taobao.com>
>>
>> This patch adds memcg routines to count dirty pages, which allows memory controller
>> to maintain an accurate view of the amount of its dirty memory and can provide some
>> info for users while cgroup's direct reclaim is working.
>
> I guess you meant targeted resp. (hard/soft) limit reclaim here,
> right? It is true that this is direct reclaim but it is not clear to me

Yes, I meant memcg hard/soft reclaim here which is triggered directly
by allocation and is distinct from background kswapd reclaim (global).

> why the usefulnes should be limitted to the reclaim for users. I would
> understand this if the users was in fact in-kernel users.
>

One of the reasons I'm trying to accounting the dirty pages is to get a
more board overall view of memory usages because memcg hard/soft
reclaim may have effect on response time of user application.
Yeah, the beneficiary can be application administrator or kernel users.  :P

> [...]
>> To prevent AB/BA deadlock mentioned by Greg Thelen in previous version
>> (https://lkml.org/lkml/2012/7/30/227), we adjust the lock order:
>> ->private_lock --> mapping->tree_lock --> memcg->move_lock.
>> So we need to make mapping->tree_lock ahead of TestSetPageDirty in __set_page_dirty()
>> and __set_page_dirty_nobuffers(). But in order to avoiding useless spinlock contention,
>> a prepare PageDirty() checking is added.
>
> But there is another AA deadlock here I believe.
> page_remove_rmap
>   mem_cgroup_begin_update_page_stat             <<< 1
>   set_page_dirty
>     __set_page_dirty_buffers
>       __set_page_dirty
>         mem_cgroup_begin_update_page_stat       <<< 2
>           move_lock_mem_cgroup
>             spin_lock_irqsave(&memcg->move_lock, *flags);
>
> mem_cgroup_begin_update_page_stat is not recursive wrt. locking AFAICS
> because we might race with the moving charges:
>         CPU0                                            CPU1
> page_remove_rmap
>                                                 mem_cgroup_can_attach
>   mem_cgroup_begin_update_page_stat (1)
>     rcu_read_lock
>                                                   mem_cgroup_start_move
>                                                     atomic_inc(&memcg_moving)
>                                                     atomic_inc(&memcg->moving_account)
>                                                     synchronize_rcu
>     __mem_cgroup_begin_update_page_stat
>       mem_cgroup_stolen <<< TRUE
>       move_lock_mem_cgroup
>   [...]
>         mem_cgroup_begin_update_page_stat (2)
>           __mem_cgroup_begin_update_page_stat
>             mem_cgroup_stolen     <<< still TRUE
>             move_lock_mem_cgroup  <<< DEADLOCK
>   [...]
>   mem_cgroup_end_update_page_stat
>     rcu_unlock
>                                                   # wake up from synchronize_rcu
>                                                 [...]
>                                                 mem_cgroup_move_task
>                                                   mem_cgroup_move_charge
>                                                     walk_page_range
>                                                       mem_cgroup_move_account
>                                                         move_lock_mem_cgroup
>
>
> Maybe I have missed some other locking which would prevent this from
> happening but the locking relations are really complicated in this area
> so if mem_cgroup_{begin,end}_update_page_stat might be called
> recursively then we need a fat comment which justifies that.
>

Ohhh...good catching!  I didn't notice there is a recursive call of
mem_cgroup_{begin,end}_update_page_stat in page_remove_rmap().
The mem_cgroup_{begin,end}_update_page_stat() design has depressed
me a lot recently as the lock granularity is a little bigger than I thought.
Not only the resource but also some code logic is in the range of locking
which may be deadlock prone. The problem still exists if we are trying to
add stat account of other memcg page later, may I make bold to suggest
that we dig into the lock again...

But with regard to the current lock implementation, I doubt if we can we can
account MEM_CGROUP_STAT_FILE_{MAPPED, DIRTY} in one breath and just
try to get move_lock once in the beginning. IMHO we can make
mem_cgroup_{begin,end}_update_page_stat() to recursive aware and what I'm
thinking now is changing memcg->move_lock to rw-spinlock from the
original spinlock:
mem_cgroup_{begin,end}_update_page_stat() try to get the read lock which make it
reenterable and memcg moving task side try to get the write spinlock.
Then the race may be following:

        CPU0                                            CPU1
page_remove_rmap
                                                mem_cgroup_can_attach
  mem_cgroup_begin_update_page_stat (1)
    rcu_read_lock
                                                  mem_cgroup_start_move
                                                    atomic_inc(&memcg_moving)

atomic_inc(&memcg->moving_account)
                                                    synchronize_rcu
    __mem_cgroup_begin_update_page_stat
      mem_cgroup_stolen   <<< TRUE
      move_lock_mem_cgroup   <<<< read-spinlock success
  [...]
     mem_cgroup_begin_update_page_stat (2)
          __mem_cgroup_begin_update_page_stat
            mem_cgroup_stolen     <<< still TRUE
            move_lock_mem_cgroup  <<<< read-spinlock success

  [...]
  mem_cgroup_end_update_page_stat     <<< locked = true, unlock
    rcu_unlock
                                                  # wake up from synchronize_rcu
                                                [...]
                                                mem_cgroup_move_task
                                                  mem_cgroup_move_charge
                                                    walk_page_range
                                                      mem_cgroup_move_account

move_lock_mem_cgroup    <<< write-spinlock


AFAICS, the deadlock seems to be avoided by both the rcu and rwlock.
Is there anything I lost?


Thanks,
Sha

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH V3 2/8] Make TestSetPageDirty and dirty page accounting in one func
       [not found]           ` <CAFj3OHUCQkqB2+ky9wxFpkNYcn2=6t9Qd7XFf3RBY0F4Wxyqcg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2013-01-05 10:45             ` Michal Hocko
  0 siblings, 0 replies; 27+ messages in thread
From: Michal Hocko @ 2013-01-05 10:45 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	dchinner-H+wXaHxf7aLQT0dZR+AlfA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A,
	gthelen-hpIqsD4AKlfQT0dZR+AlfA,
	fengguang.wu-ral2JQCrhuEAvxtiuMwx3w,
	glommer-bzQdu9zFT3WakBO8gow8eQ, Sha Zhengju

On Sat 05-01-13 10:49:00, Sha Zhengju wrote:
[...]
> >> Here is some test numbers that before/after this patch:
> >> Test steps(Mem-4g, ext4):
> >> drop_cache; sync
> >> fio (ioengine=sync/write/buffered/bs=4k/size=1g/numjobs=2/group_reporting/thread)
> >
> > Could also add some rationale why you think this test is relevant?
> >
> 
> The test is aiming at finding the impact of performance due to lock
> contention by writing parallel
> to the same file. I'll add the reason next version too.

Please make sure to describe which locks are exercised during that test
and how much.

Thanks
[...]
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting
  2013-01-05  4:48     ` Sha Zhengju
@ 2013-01-06 20:02       ` Hugh Dickins
       [not found]         ` <alpine.LNX.2.00.1301061135400.29149-fupSdm12i1nKWymIFiNcPA@public.gmane.org>
       [not found]       ` <CAFj3OHXKyMO3gwghiBAmbowvqko-JqLtKroX2kzin1rk=q9tZg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  1 sibling, 1 reply; 27+ messages in thread
From: Hugh Dickins @ 2013-01-06 20:02 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: Michal Hocko, Johannes Weiner, linux-kernel, cgroups, linux-mm,
	linux-fsdevel, akpm, kamezawa.hiroyu, gthelen, fengguang.wu,
	glommer, dchinner, Sha Zhengju

On Sat, 5 Jan 2013, Sha Zhengju wrote:
> On Wed, Jan 2, 2013 at 6:44 PM, Michal Hocko <mhocko@suse.cz> wrote:
> >
> > Maybe I have missed some other locking which would prevent this from
> > happening but the locking relations are really complicated in this area
> > so if mem_cgroup_{begin,end}_update_page_stat might be called
> > recursively then we need a fat comment which justifies that.
> >
> 
> Ohhh...good catching!  I didn't notice there is a recursive call of
> mem_cgroup_{begin,end}_update_page_stat in page_remove_rmap().
> The mem_cgroup_{begin,end}_update_page_stat() design has depressed
> me a lot recently as the lock granularity is a little bigger than I thought.
> Not only the resource but also some code logic is in the range of locking
> which may be deadlock prone. The problem still exists if we are trying to
> add stat account of other memcg page later, may I make bold to suggest
> that we dig into the lock again...

Forgive me, I must confess I'm no more than skimming this thread,
and don't like dumping unsigned-off patches on people; but thought
that on balance it might be more helpful than not if I offer you a
patch I worked on around 3.6-rc2 (but have updated to 3.8-rc2 below).

I too was getting depressed by the constraints imposed by
mem_cgroup_{begin,end}_update_page_stat (good job though Kamezawa-san
did to minimize them), and wanted to replace by something freer, more
RCU-like.  In the end it seemed more effort than it was worth to go
as far as I wanted, but I do think that this is some improvement over
what we currently have, and should deal with your recursion issue.

But if this does appear useful to memcg people, then we really ought
to get it checked over by locking/barrier experts before going further.
I think myself that I've over-barriered it, and could use a little
lighter; but they (Paul McKenney, Peter Zijlstra, Oleg Nesterov come
to mind) will see more clearly, and may just hate the whole thing,
as yet another peculiar lockdep-avoiding hand-crafted locking scheme.
I've not wanted to waste their time on reviewing it, if it's not even
going to be useful to memcg people.

It may be easier to understand if you just apply the patch and look
at the result in mm/memcontrol.c, where I tried to gather the pieces
together in one place and describe them ("These functions mediate...").

Hugh

 include/linux/memcontrol.h |   39 +--
 mm/memcontrol.c            |  375 +++++++++++++++++++++--------------
 mm/rmap.c                  |   20 -
 3 files changed, 257 insertions(+), 177 deletions(-)

--- 3.8-rc2/include/linux/memcontrol.h	2012-12-22 09:43:27.172015571 -0800
+++ linux/include/linux/memcontrol.h	2013-01-02 14:47:47.960394878 -0800
@@ -136,32 +136,28 @@ static inline bool mem_cgroup_disabled(v
 	return false;
 }
 
-void __mem_cgroup_begin_update_page_stat(struct page *page, bool *locked,
-					 unsigned long *flags);
-
+void __mem_cgroup_begin_update_page_stat(struct page *page);
+void __mem_cgroup_end_update_page_stat(void);
 extern atomic_t memcg_moving;
 
 static inline void mem_cgroup_begin_update_page_stat(struct page *page,
-					bool *locked, unsigned long *flags)
+						     bool *clamped)
 {
-	if (mem_cgroup_disabled())
-		return;
-	rcu_read_lock();
-	*locked = false;
-	if (atomic_read(&memcg_moving))
-		__mem_cgroup_begin_update_page_stat(page, locked, flags);
+	preempt_disable();
+	*clamped = false;
+	if (unlikely(atomic_read(&memcg_moving))) {
+		__mem_cgroup_begin_update_page_stat(page);
+		*clamped = true;
+	}
 }
 
-void __mem_cgroup_end_update_page_stat(struct page *page,
-				unsigned long *flags);
 static inline void mem_cgroup_end_update_page_stat(struct page *page,
-					bool *locked, unsigned long *flags)
+						   bool *clamped)
 {
-	if (mem_cgroup_disabled())
-		return;
-	if (*locked)
-		__mem_cgroup_end_update_page_stat(page, flags);
-	rcu_read_unlock();
+	/* We don't currently use the page arg, but keep it for symmetry */
+	if (unlikely(*clamped))
+		__mem_cgroup_end_update_page_stat();
+	preempt_enable();
 }
 
 void mem_cgroup_update_page_stat(struct page *page,
@@ -345,13 +341,16 @@ mem_cgroup_print_oom_info(struct mem_cgr
 }
 
 static inline void mem_cgroup_begin_update_page_stat(struct page *page,
-					bool *locked, unsigned long *flags)
+						     bool *clamped)
 {
+	/* It may be helpful to our callers if the stub behaves the same way */
+	preempt_disable();
 }
 
 static inline void mem_cgroup_end_update_page_stat(struct page *page,
-					bool *locked, unsigned long *flags)
+						   bool *clamped)
 {
+	preempt_enable();
 }
 
 static inline void mem_cgroup_inc_page_stat(struct page *page,
--- 3.8-rc2/mm/memcontrol.c	2012-12-22 09:43:27.628015582 -0800
+++ linux/mm/memcontrol.c	2013-01-02 14:55:36.268406008 -0800
@@ -321,12 +321,7 @@ struct mem_cgroup {
 	 * mem_cgroup ? And what type of charges should we move ?
 	 */
 	unsigned long 	move_charge_at_immigrate;
-	/*
-	 * set > 0 if pages under this cgroup are moving to other cgroup.
-	 */
-	atomic_t	moving_account;
-	/* taken only while moving_account > 0 */
-	spinlock_t	move_lock;
+
 	/*
 	 * percpu counter.
 	 */
@@ -1414,60 +1409,10 @@ int mem_cgroup_swappiness(struct mem_cgr
 }
 
 /*
- * memcg->moving_account is used for checking possibility that some thread is
- * calling move_account(). When a thread on CPU-A starts moving pages under
- * a memcg, other threads should check memcg->moving_account under
- * rcu_read_lock(), like this:
- *
- *         CPU-A                                    CPU-B
- *                                              rcu_read_lock()
- *         memcg->moving_account+1              if (memcg->mocing_account)
- *                                                   take heavy locks.
- *         synchronize_rcu()                    update something.
- *                                              rcu_read_unlock()
- *         start move here.
- */
-
-/* for quick checking without looking up memcg */
-atomic_t memcg_moving __read_mostly;
-
-static void mem_cgroup_start_move(struct mem_cgroup *memcg)
-{
-	atomic_inc(&memcg_moving);
-	atomic_inc(&memcg->moving_account);
-	synchronize_rcu();
-}
-
-static void mem_cgroup_end_move(struct mem_cgroup *memcg)
-{
-	/*
-	 * Now, mem_cgroup_clear_mc() may call this function with NULL.
-	 * We check NULL in callee rather than caller.
-	 */
-	if (memcg) {
-		atomic_dec(&memcg_moving);
-		atomic_dec(&memcg->moving_account);
-	}
-}
-
-/*
- * 2 routines for checking "mem" is under move_account() or not.
- *
- * mem_cgroup_stolen() -  checking whether a cgroup is mc.from or not. This
- *			  is used for avoiding races in accounting.  If true,
- *			  pc->mem_cgroup may be overwritten.
- *
  * mem_cgroup_under_move() - checking a cgroup is mc.from or mc.to or
  *			  under hierarchy of moving cgroups. This is for
- *			  waiting at hith-memory prressure caused by "move".
+ *			  waiting at high memory pressure caused by "move".
  */
-
-static bool mem_cgroup_stolen(struct mem_cgroup *memcg)
-{
-	VM_BUG_ON(!rcu_read_lock_held());
-	return atomic_read(&memcg->moving_account) > 0;
-}
-
 static bool mem_cgroup_under_move(struct mem_cgroup *memcg)
 {
 	struct mem_cgroup *from;
@@ -1506,24 +1451,6 @@ static bool mem_cgroup_wait_acct_move(st
 	return false;
 }
 
-/*
- * Take this lock when
- * - a code tries to modify page's memcg while it's USED.
- * - a code tries to modify page state accounting in a memcg.
- * see mem_cgroup_stolen(), too.
- */
-static void move_lock_mem_cgroup(struct mem_cgroup *memcg,
-				  unsigned long *flags)
-{
-	spin_lock_irqsave(&memcg->move_lock, *flags);
-}
-
-static void move_unlock_mem_cgroup(struct mem_cgroup *memcg,
-				unsigned long *flags)
-{
-	spin_unlock_irqrestore(&memcg->move_lock, *flags);
-}
-
 /**
  * mem_cgroup_print_oom_info: Called from OOM with tasklist_lock held in read mode.
  * @memcg: The memory cgroup that went over limit
@@ -2096,75 +2023,215 @@ static bool mem_cgroup_handle_oom(struct
 }
 
 /*
- * Currently used to update mapped file statistics, but the routine can be
- * generalized to update other statistics as well.
- *
- * Notes: Race condition
- *
- * We usually use page_cgroup_lock() for accessing page_cgroup member but
- * it tends to be costly. But considering some conditions, we doesn't need
- * to do so _always_.
- *
- * Considering "charge", lock_page_cgroup() is not required because all
- * file-stat operations happen after a page is attached to radix-tree. There
- * are no race with "charge".
- *
- * Considering "uncharge", we know that memcg doesn't clear pc->mem_cgroup
- * at "uncharge" intentionally. So, we always see valid pc->mem_cgroup even
- * if there are race with "uncharge". Statistics itself is properly handled
- * by flags.
+ * These functions mediate between the common case of updating memcg stats
+ * when a page transitions from one state to another, and the rare case of
+ * moving a page from one memcg to another.
+ *
+ * A simple example of the updater would be:
+ *	mem_cgroup_begin_update_page_stat(page);
+ *	if (TestClearPageFlag(page))
+ *		mem_cgroup_dec_page_stat(page, NR_FLAG_PAGES);
+ *	mem_cgroup_end_update_page_stat(page);
+ *
+ * An over-simplified example of the mover would be:
+ *	mem_cgroup_begin_move();
+ *	for each page chosen from old_memcg {
+ *		pc = lookup_page_cgroup(page);
+ *		lock_page_cgroup(pc);
+ *		if (trylock_memcg_move(page)) {
+ *			if (PageFlag(page)) {
+ *				mem_cgroup_dec_page_stat(page, NR_FLAG_PAGES);
+ *				pc->mem_cgroup = new_memcg;
+ *				mem_cgroup_inc_page_stat(page, NR_FLAG_PAGES);
+ *			}
+ *			unlock_memcg_move();
+ *			unlock_page_cgroup(pc);
+ *		}
+ *		cond_resched();
+ *	}
+ *	mem_cgroup_end_move();
+ *
+ * Without some kind of serialization between updater and mover, the mover
+ * cannot know whether or not to move one count from old to new memcg stats;
+ * but the serialization must be as lightweight as possible for the updater.
+ *
+ * At present we use two layers of lock avoidance, then spinlock on memcg;
+ * but that already got into (easily avoided) lock hierarchy violation with
+ * the page_cgroup lock; and as dirty writeback stats are added, it gets
+ * into further difficulty with the page cache radix tree lock (and on s390
+ * architecture, page_remove_rmap calls set_page_dirty within its critical
+ * section: perhaps that can be reordered, but if not, it requires nesting).
+ *
+ * We need a mechanism more like rcu_read_lock() for the updater, who then
+ * does not have to worry about lock ordering.  The scheme below is not quite
+ * as light as that: rarely, the updater does have to spin waiting on a mover;
+ * and it is still best for updater to avoid taking page_cgroup lock in its
+ * critical section (though mover drops and retries if necessary, so there is
+ * no actual deadlock).  Testing on 4-way suggests 5% heavier for the mover.
+ */
+
+/*
+ * memcg_moving count is written in advance by movers,
+ * and read by updaters to see if they need to worry further.
+ */
+atomic_t memcg_moving __read_mostly;
+
+/*
+ * Keep it simple: allow only one page to move at a time.  cgroup_mutex
+ * already serializes move_charge_at_immigrate movements, but not writes
+ * to memory.force_empty, nor move-pages-to-parent phase of cgroup rmdir.
  *
- * Considering "move", this is an only case we see a race. To make the race
- * small, we check mm->moving_account and detect there are possibility of race
- * If there is, we take a lock.
+ * memcg_moving_lock guards writes by movers to memcg_moving_page,
+ * which is read by updaters to see if they need to worry about their page.
+ */
+static DEFINE_SPINLOCK(memcg_moving_lock);
+static struct page *memcg_moving_page;
+
+/*
+ * updating_page_stat is written per-cpu by updaters,
+ * and all cpus read by mover to check when safe to proceed with the move.
  */
+static DEFINE_PER_CPU(int, updating_page_stat) = 0;
 
-void __mem_cgroup_begin_update_page_stat(struct page *page,
-				bool *locked, unsigned long *flags)
+/*
+ * Mover calls mem_cgroup_begin_move() before starting on its pages; its
+ * synchronize_rcu() ensures that all updaters will see memcg_moving in time.
+ */
+static void mem_cgroup_begin_move(void)
 {
-	struct mem_cgroup *memcg;
-	struct page_cgroup *pc;
+	get_online_cpus();
+	atomic_inc(&memcg_moving);
+	synchronize_rcu();
+}
+
+static void mem_cgroup_end_move(void)
+{
+	atomic_dec(&memcg_moving);
+	put_online_cpus();
+}
+
+/*
+ * Mover calls trylock_memcg_move(page) before moving stats and changing
+ * ownership of page.  If it fails, mover should drop page_cgroup lock and
+ * any other spinlocks held, cond_resched then try the page again.  This
+ * lets updaters take those locks if unavoidable, though preferably not.
+ */
+static bool trylock_memcg_move(struct page *page)
+{
+	static struct cpumask updating;
+	int try;
+
+	cpumask_copy(&updating, cpu_online_mask);
+	spin_lock(&memcg_moving_lock);
+	memcg_moving_page = page;
 
-	pc = lookup_page_cgroup(page);
-again:
-	memcg = pc->mem_cgroup;
-	if (unlikely(!memcg || !PageCgroupUsed(pc)))
-		return;
 	/*
-	 * If this memory cgroup is not under account moving, we don't
-	 * need to take move_lock_mem_cgroup(). Because we already hold
-	 * rcu_read_lock(), any calls to move_account will be delayed until
-	 * rcu_read_unlock() if mem_cgroup_stolen() == true.
+	 * Make sure that __mem_cgroup_begin_update_page_stat(page) can see
+	 * our memcg_moving_page before it commits to updating_page_stat.
 	 */
-	if (!mem_cgroup_stolen(memcg))
-		return;
+	smp_mb();
 
-	move_lock_mem_cgroup(memcg, flags);
-	if (memcg != pc->mem_cgroup || !PageCgroupUsed(pc)) {
-		move_unlock_mem_cgroup(memcg, flags);
-		goto again;
+	for (try = 0; try < 64; try++) {
+		int updaters = 0;
+		int cpu;
+
+		for_each_cpu(cpu, &updating) {
+			if (ACCESS_ONCE(per_cpu(updating_page_stat, cpu)))
+				updaters++;
+			else
+				cpumask_clear_cpu(cpu, &updating);
+		}
+		if (!updaters)
+			return true;
 	}
-	*locked = true;
+
+	memcg_moving_page = NULL;
+	spin_unlock(&memcg_moving_lock);
+	return false;
 }
 
-void __mem_cgroup_end_update_page_stat(struct page *page, unsigned long *flags)
+static void unlock_memcg_move(void)
 {
-	struct page_cgroup *pc = lookup_page_cgroup(page);
+	memcg_moving_page = NULL;
+	spin_unlock(&memcg_moving_lock);
+}
 
-	/*
-	 * It's guaranteed that pc->mem_cgroup never changes while
-	 * lock is held because a routine modifies pc->mem_cgroup
-	 * should take move_lock_mem_cgroup().
-	 */
-	move_unlock_mem_cgroup(pc->mem_cgroup, flags);
+/*
+ * If memcg_moving, updater calls __mem_cgroup_begin_update_page_stat(page)
+ * (with preemption disabled) to indicate to the next mover that this cpu is
+ * updating a page, or to wait on the mover if it's already moving this page.
+ */
+void __mem_cgroup_begin_update_page_stat(struct page *page)
+{
+	static const int probing = 0x10000;
+	int updating;
+
+	__this_cpu_add(updating_page_stat, probing);
+
+	for (;;) {
+		/*
+		 * Make sure that trylock_memcg_move(page) can see our
+		 * updating_page_stat before we check memcg_moving_page.
+		 *
+		 * We use the special probing value at first so move sees it,
+		 * but nesting and interrupts on this cpu can distinguish it.
+		 */
+		smp_mb();
+
+		if (likely(page != ACCESS_ONCE(memcg_moving_page)))
+			break;
+
+		/*
+		 * We may be nested, we may be serving an interrupt: do not
+		 * hang here if the outer level already went beyond probing.
+		 */
+		updating = __this_cpu_read(updating_page_stat);
+		if (updating & (probing - 1))
+			break;
+
+		__this_cpu_write(updating_page_stat, 0);
+		while (page == ACCESS_ONCE(memcg_moving_page))
+			cpu_relax();
+		__this_cpu_write(updating_page_stat, updating);
+	}
+
+	/* Add one to count and remove temporary probing value */
+	__this_cpu_sub(updating_page_stat, probing - 1);
+}
+
+void __mem_cgroup_end_update_page_stat(void)
+{
+	__this_cpu_dec(updating_page_stat);
+}
+
+/*
+ * Static inline interfaces to the above in include/linux/memcontrol.h:
+ *
+static inline void mem_cgroup_begin_update_page_stat(struct page *page,
+						     bool *clamped)
+{
+	preempt_disable();
+	*clamped = false;
+	if (unlikely(atomic_read(&memcg_moving))) {
+		__mem_cgroup_begin_update_page_stat(page);
+		*clamped = true;
+	}
 }
 
+static inline void mem_cgroup_end_update_page_stat(struct page *page,
+						   bool *clamped)
+{
+	if (unlikely(*clamped))
+		__mem_cgroup_end_update_page_stat();
+	preempt_enable();
+}
+ */
+
 void mem_cgroup_update_page_stat(struct page *page,
 				 enum mem_cgroup_page_stat_item idx, int val)
 {
 	struct mem_cgroup *memcg;
 	struct page_cgroup *pc = lookup_page_cgroup(page);
-	unsigned long uninitialized_var(flags);
 
 	if (mem_cgroup_disabled())
 		return;
@@ -2181,7 +2248,8 @@ void mem_cgroup_update_page_stat(struct
 		BUG();
 	}
 
-	this_cpu_add(memcg->stat->count[idx], val);
+	/* mem_cgroup_begin_update_page_stat() disabled preemption */
+	__this_cpu_add(memcg->stat->count[idx], val);
 }
 
 /*
@@ -3580,7 +3648,6 @@ static int mem_cgroup_move_account(struc
 				   struct mem_cgroup *from,
 				   struct mem_cgroup *to)
 {
-	unsigned long flags;
 	int ret;
 	bool anon = PageAnon(page);
 
@@ -3602,21 +3669,21 @@ static int mem_cgroup_move_account(struc
 	if (!PageCgroupUsed(pc) || pc->mem_cgroup != from)
 		goto unlock;
 
-	move_lock_mem_cgroup(from, &flags);
+	ret = -EAGAIN;
+	if (!trylock_memcg_move(page))
+		goto unlock;
 
 	if (!anon && page_mapped(page)) {
 		/* Update mapped_file data for mem_cgroup */
-		preempt_disable();
 		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
 		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
-		preempt_enable();
 	}
 	mem_cgroup_charge_statistics(from, anon, -nr_pages);
 
 	/* caller should have done css_get */
 	pc->mem_cgroup = to;
 	mem_cgroup_charge_statistics(to, anon, nr_pages);
-	move_unlock_mem_cgroup(from, &flags);
+	unlock_memcg_move();
 	ret = 0;
 unlock:
 	unlock_page_cgroup(pc);
@@ -3675,19 +3742,25 @@ static int mem_cgroup_move_parent(struct
 	 */
 	if (!parent)
 		parent = root_mem_cgroup;
-
+retry:
 	if (nr_pages > 1) {
 		VM_BUG_ON(!PageTransHuge(page));
 		flags = compound_lock_irqsave(page);
 	}
 
-	ret = mem_cgroup_move_account(page, nr_pages,
-				pc, child, parent);
-	if (!ret)
-		__mem_cgroup_cancel_local_charge(child, nr_pages);
+	ret = mem_cgroup_move_account(page, nr_pages, pc, child, parent);
 
 	if (nr_pages > 1)
 		compound_unlock_irqrestore(page, flags);
+
+	if (ret == -EAGAIN) {
+		cond_resched();
+		goto retry;
+	}
+
+	if (!ret)
+		__mem_cgroup_cancel_local_charge(child, nr_pages);
+
 	putback_lru_page(page);
 put:
 	put_page(page);
@@ -4685,7 +4758,7 @@ static void mem_cgroup_reparent_charges(
 		/* This is for making all *used* pages to be on LRU. */
 		lru_add_drain_all();
 		drain_all_stock_sync(memcg);
-		mem_cgroup_start_move(memcg);
+		mem_cgroup_begin_move();
 		for_each_node_state(node, N_MEMORY) {
 			for (zid = 0; zid < MAX_NR_ZONES; zid++) {
 				enum lru_list lru;
@@ -4695,7 +4768,7 @@ static void mem_cgroup_reparent_charges(
 				}
 			}
 		}
-		mem_cgroup_end_move(memcg);
+		mem_cgroup_end_move();
 		memcg_oom_recover(memcg);
 		cond_resched();
 
@@ -6128,7 +6201,6 @@ mem_cgroup_css_alloc(struct cgroup *cont
 	atomic_set(&memcg->refcnt, 1);
 	memcg->move_charge_at_immigrate = 0;
 	mutex_init(&memcg->thresholds_lock);
-	spin_lock_init(&memcg->move_lock);
 
 	error = memcg_init_kmem(memcg, &mem_cgroup_subsys);
 	if (error) {
@@ -6521,7 +6593,8 @@ static void mem_cgroup_clear_mc(void)
 	mc.from = NULL;
 	mc.to = NULL;
 	spin_unlock(&mc.lock);
-	mem_cgroup_end_move(from);
+	if (from)
+		mem_cgroup_end_move();
 }
 
 static int mem_cgroup_can_attach(struct cgroup *cgroup,
@@ -6547,7 +6620,7 @@ static int mem_cgroup_can_attach(struct
 			VM_BUG_ON(mc.precharge);
 			VM_BUG_ON(mc.moved_charge);
 			VM_BUG_ON(mc.moved_swap);
-			mem_cgroup_start_move(from);
+			mem_cgroup_begin_move();
 			spin_lock(&mc.lock);
 			mc.from = from;
 			mc.to = memcg;
@@ -6573,7 +6646,7 @@ static int mem_cgroup_move_charge_pte_ra
 				unsigned long addr, unsigned long end,
 				struct mm_walk *walk)
 {
-	int ret = 0;
+	int ret;
 	struct vm_area_struct *vma = walk->private;
 	pte_t *pte;
 	spinlock_t *ptl;
@@ -6592,6 +6665,8 @@ static int mem_cgroup_move_charge_pte_ra
 	 *    to be unlocked in __split_huge_page_splitting(), where the main
 	 *    part of thp split is not executed yet.
 	 */
+retry:
+	ret = 0;
 	if (pmd_trans_huge_lock(pmd, vma) == 1) {
 		if (mc.precharge < HPAGE_PMD_NR) {
 			spin_unlock(&vma->vm_mm->page_table_lock);
@@ -6602,8 +6677,9 @@ static int mem_cgroup_move_charge_pte_ra
 			page = target.page;
 			if (!isolate_lru_page(page)) {
 				pc = lookup_page_cgroup(page);
-				if (!mem_cgroup_move_account(page, HPAGE_PMD_NR,
-							pc, mc.from, mc.to)) {
+				ret = mem_cgroup_move_account(page,
+					    HPAGE_PMD_NR, pc, mc.from, mc.to);
+				if (!ret) {
 					mc.precharge -= HPAGE_PMD_NR;
 					mc.moved_charge += HPAGE_PMD_NR;
 				}
@@ -6612,12 +6688,14 @@ static int mem_cgroup_move_charge_pte_ra
 			put_page(page);
 		}
 		spin_unlock(&vma->vm_mm->page_table_lock);
+		if (ret == -EAGAIN)
+			goto retry;
 		return 0;
 	}
 
 	if (pmd_trans_unstable(pmd))
 		return 0;
-retry:
+
 	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
 	for (; addr != end; addr += PAGE_SIZE) {
 		pte_t ptent = *(pte++);
@@ -6632,8 +6710,9 @@ retry:
 			if (isolate_lru_page(page))
 				goto put;
 			pc = lookup_page_cgroup(page);
-			if (!mem_cgroup_move_account(page, 1, pc,
-						     mc.from, mc.to)) {
+			ret = mem_cgroup_move_account(page, 1, pc,
+						      mc.from, mc.to);
+			if (!ret) {
 				mc.precharge--;
 				/* we uncharge from mc.from later. */
 				mc.moved_charge++;
@@ -6653,11 +6732,15 @@ put:			/* get_mctgt_type() gets the page
 		default:
 			break;
 		}
+		if (ret == -EAGAIN)
+			break;
 	}
 	pte_unmap_unlock(pte - 1, ptl);
 	cond_resched();
 
 	if (addr != end) {
+		if (ret == -EAGAIN)
+			goto retry;
 		/*
 		 * We have consumed all precharges we got in can_attach().
 		 * We try charge one by one, but don't do any additional
--- 3.8-rc2/mm/rmap.c	2012-12-22 09:43:27.656015582 -0800
+++ linux/mm/rmap.c	2013-01-02 15:03:46.100417650 -0800
@@ -1107,15 +1107,14 @@ void page_add_new_anon_rmap(struct page
  */
 void page_add_file_rmap(struct page *page)
 {
-	bool locked;
-	unsigned long flags;
+	bool clamped;
 
-	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
+	mem_cgroup_begin_update_page_stat(page, &clamped);
 	if (atomic_inc_and_test(&page->_mapcount)) {
 		__inc_zone_page_state(page, NR_FILE_MAPPED);
 		mem_cgroup_inc_page_stat(page, MEMCG_NR_FILE_MAPPED);
 	}
-	mem_cgroup_end_update_page_stat(page, &locked, &flags);
+	mem_cgroup_end_update_page_stat(page, &clamped);
 }
 
 /**
@@ -1128,16 +1127,15 @@ void page_remove_rmap(struct page *page)
 {
 	struct address_space *mapping = page_mapping(page);
 	bool anon = PageAnon(page);
-	bool locked;
-	unsigned long flags;
+	bool uninitialized_var(clamped);
 
 	/*
 	 * The anon case has no mem_cgroup page_stat to update; but may
-	 * uncharge_page() below, where the lock ordering can deadlock if
-	 * we hold the lock against page_stat move: so avoid it on anon.
+	 * uncharge_page() below, when holding page_cgroup lock might force
+	 * a page_stat move to back off temporarily: so avoid it on anon.
 	 */
 	if (!anon)
-		mem_cgroup_begin_update_page_stat(page, &locked, &flags);
+		mem_cgroup_begin_update_page_stat(page, &clamped);
 
 	/* page still mapped by someone else? */
 	if (!atomic_add_negative(-1, &page->_mapcount))
@@ -1182,7 +1180,7 @@ void page_remove_rmap(struct page *page)
 	} else {
 		__dec_zone_page_state(page, NR_FILE_MAPPED);
 		mem_cgroup_dec_page_stat(page, MEMCG_NR_FILE_MAPPED);
-		mem_cgroup_end_update_page_stat(page, &locked, &flags);
+		mem_cgroup_end_update_page_stat(page, &clamped);
 	}
 	if (unlikely(PageMlocked(page)))
 		clear_page_mlock(page);
@@ -1198,7 +1196,7 @@ void page_remove_rmap(struct page *page)
 	return;
 out:
 	if (!anon)
-		mem_cgroup_end_update_page_stat(page, &locked, &flags);
+		mem_cgroup_end_update_page_stat(page, &clamped);
 }
 
 /*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting
  2012-12-25 17:26 ` [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting Sha Zhengju
  2013-01-02 10:44   ` Michal Hocko
@ 2013-01-06 20:07   ` Greg Thelen
       [not found]     ` <xr93obh2krcr.fsf-aSPv4SP+Du0KgorLzL7FmE7CuiCeIGUxQQ4Iyu8u01E@public.gmane.org>
  1 sibling, 1 reply; 27+ messages in thread
From: Greg Thelen @ 2013-01-06 20:07 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, mhocko-AlSwsSmVLrQ,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A,
	fengguang.wu-ral2JQCrhuEAvxtiuMwx3w,
	glommer-bzQdu9zFT3WakBO8gow8eQ, dchinner-H+wXaHxf7aLQT0dZR+AlfA,
	Sha Zhengju

On Tue, Dec 25 2012, Sha Zhengju wrote:

> From: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
>
> This patch adds memcg routines to count dirty pages, which allows memory controller
> to maintain an accurate view of the amount of its dirty memory and can provide some
> info for users while cgroup's direct reclaim is working.
>
> After Kame's commit 89c06bd5(memcg: use new logic for page stat accounting), we can
> use 'struct page' flag to test page state instead of per page_cgroup flag. But memcg
> has a feature to move a page from a cgroup to another one and may have race between
> "move" and "page stat accounting". So in order to avoid the race we have designed a
> bigger lock:
>
>          mem_cgroup_begin_update_page_stat()
>          modify page information        -->(a)
>          mem_cgroup_update_page_stat()  -->(b)
>          mem_cgroup_end_update_page_stat()
> It requires (a) and (b)(dirty pages accounting) can stay close enough.
> In the previous two prepare patches, we have reworked the vfs set page dirty routines
> and now the interfaces are more explicit:
>         incrementing (2):
>                 __set_page_dirty
>                 __set_page_dirty_nobuffers
>         decrementing (2):
>                 clear_page_dirty_for_io
>                 cancel_dirty_page
>
> To prevent AB/BA deadlock mentioned by Greg Thelen in previous version
> (https://lkml.org/lkml/2012/7/30/227), we adjust the lock order:
> ->private_lock --> mapping->tree_lock --> memcg->move_lock.
> So we need to make mapping->tree_lock ahead of TestSetPageDirty in __set_page_dirty()
> and __set_page_dirty_nobuffers(). But in order to avoiding useless spinlock contention,
> a prepare PageDirty() checking is added.
>
>
> Signed-off-by: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu-LdfC7J4mv27QFUHtdCDX3A@public.gmane.org>
> Acked-by: Fengguang Wu <fengguang.wu-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
> ---
>  fs/buffer.c                |   14 +++++++++++++-
>  include/linux/memcontrol.h |    1 +
>  mm/filemap.c               |   10 ++++++++++
>  mm/memcontrol.c            |   29 ++++++++++++++++++++++-------
>  mm/page-writeback.c        |   39 ++++++++++++++++++++++++++++++++-------
>  mm/truncate.c              |    6 ++++++
>  6 files changed, 84 insertions(+), 15 deletions(-)

__nilfs_clear_page_dirty() clears PageDirty, does it need modification
for this patch series?

> diff --git a/fs/buffer.c b/fs/buffer.c
> index 762168a..53402d2 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -612,19 +612,31 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode);
>  int __set_page_dirty(struct page *page,
>  		struct address_space *mapping, int warn)
>  {
> +	bool locked;
> +	unsigned long flags;
> +
>  	if (unlikely(!mapping))
>  		return !TestSetPageDirty(page);
>  
> -	if (TestSetPageDirty(page))
> +	if (PageDirty(page))
>  		return 0;
>  
>  	spin_lock_irq(&mapping->tree_lock);
> +	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
> +
> +	if (TestSetPageDirty(page)) {
> +		mem_cgroup_end_update_page_stat(page, &locked, &flags);
> +		spin_unlock_irq(&mapping->tree_lock);
> +		return 0;
> +	}
> +
>  	if (page->mapping) {	/* Race with truncate? */
>  		WARN_ON_ONCE(warn && !PageUptodate(page));
>  		account_page_dirtied(page, mapping);
>  		radix_tree_tag_set(&mapping->page_tree,
>  				page_index(page), PAGECACHE_TAG_DIRTY);
>  	}
> +	mem_cgroup_end_update_page_stat(page, &locked, &flags);
>  	spin_unlock_irq(&mapping->tree_lock);
>  	__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
>  
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 5421b8a..2685d8a 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -44,6 +44,7 @@ enum mem_cgroup_stat_index {
>  	MEM_CGROUP_STAT_RSS,	   /* # of pages charged as anon rss */
>  	MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
>  	MEM_CGROUP_STAT_SWAP, /* # of pages, swapped out */
> +	MEM_CGROUP_STAT_FILE_DIRTY,  /* # of dirty pages in page cache */
>  	MEM_CGROUP_STAT_NSTATS,
>  };
>  
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 83efee7..b589be5 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -62,6 +62,11 @@
>   *      ->swap_lock		(exclusive_swap_page, others)
>   *        ->mapping->tree_lock
>   *
> + *    ->private_lock		(__set_page_dirty_buffers)
> + *      ->mapping->tree_lock
> + *        ->memcg->move_lock	(mem_cgroup_begin_update_page_stat->
> + *							move_lock_mem_cgroup)
> + *
>   *  ->i_mutex
>   *    ->i_mmap_mutex		(truncate->unmap_mapping_range)
>   *
> @@ -112,6 +117,8 @@
>  void __delete_from_page_cache(struct page *page)
>  {
>  	struct address_space *mapping = page->mapping;
> +	bool locked;
> +	unsigned long flags;
>  
>  	/*
>  	 * if we're uptodate, flush out into the cleancache, otherwise
> @@ -139,10 +146,13 @@ void __delete_from_page_cache(struct page *page)
>  	 * Fix it up by doing a final dirty accounting check after
>  	 * having removed the page entirely.
>  	 */
> +	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
>  	if (PageDirty(page) && mapping_cap_account_dirty(mapping)) {
> +		mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
>  		dec_zone_page_state(page, NR_FILE_DIRTY);
>  		dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
>  	}
> +	mem_cgroup_end_update_page_stat(page, &locked, &flags);
>  }
>  
>  /**
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index d450c04..c884640 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -95,6 +95,7 @@ static const char * const mem_cgroup_stat_names[] = {
>  	"rss",
>  	"mapped_file",
>  	"swap",
> +	"dirty",
>  };
>  
>  enum mem_cgroup_events_index {
> @@ -3609,6 +3610,19 @@ void mem_cgroup_split_huge_fixup(struct page *head)
>  }
>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>  
> +static inline
> +void mem_cgroup_move_account_page_stat(struct mem_cgroup *from,
> +					struct mem_cgroup *to,
> +					unsigned int nr_pages,
> +					enum mem_cgroup_stat_index idx)
> +{
> +	/* Update stat data for mem_cgroup */
> +	preempt_disable();
> +	__this_cpu_add(from->stat->count[idx], -nr_pages);

What you do think about adding a WARN_ON_ONCE() here to check for
underflow?  A check might help catch:
a) unresolved races between move accounting vs setting/clearing
   dirtying.
b) future modifications that mess with PageDirty/Writeback flags without
   considering memcg.

> +	__this_cpu_add(to->stat->count[idx], nr_pages);
> +	preempt_enable();
> +}
> +
>  /**
>   * mem_cgroup_move_account - move account of the page
>   * @page: the page
> @@ -3654,13 +3668,14 @@ static int mem_cgroup_move_account(struct page *page,
>  
>  	move_lock_mem_cgroup(from, &flags);
>  
> -	if (!anon && page_mapped(page)) {
> -		/* Update mapped_file data for mem_cgroup */
> -		preempt_disable();
> -		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> -		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> -		preempt_enable();
> -	}
> +	if (!anon && page_mapped(page))
> +		mem_cgroup_move_account_page_stat(from, to, nr_pages,
> +			MEM_CGROUP_STAT_FILE_MAPPED);
> +
> +	if (PageDirty(page))

Is (!anon && PageDirty(page)) better?  If dirty anon pages are moved
between memcg M1 and M2 I think that we'd mistakenly underflow M1 if it
was not previously accounting for the dirty anon page.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting
       [not found]       ` <CAFj3OHXKyMO3gwghiBAmbowvqko-JqLtKroX2kzin1rk=q9tZg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2013-01-07  7:25         ` Kamezawa Hiroyuki
       [not found]           ` <50EA7860.6030300-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
  0 siblings, 1 reply; 27+ messages in thread
From: Kamezawa Hiroyuki @ 2013-01-07  7:25 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: Michal Hocko, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	gthelen-hpIqsD4AKlfQT0dZR+AlfA,
	fengguang.wu-ral2JQCrhuEAvxtiuMwx3w,
	glommer-bzQdu9zFT3WakBO8gow8eQ, dchinner-H+wXaHxf7aLQT0dZR+AlfA,
	Sha Zhengju

(2013/01/05 13:48), Sha Zhengju wrote:
> On Wed, Jan 2, 2013 at 6:44 PM, Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org> wrote:
>> On Wed 26-12-12 01:26:07, Sha Zhengju wrote:
>>> From: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
>>>
>>> This patch adds memcg routines to count dirty pages, which allows memory controller
>>> to maintain an accurate view of the amount of its dirty memory and can provide some
>>> info for users while cgroup's direct reclaim is working.
>>
>> I guess you meant targeted resp. (hard/soft) limit reclaim here,
>> right? It is true that this is direct reclaim but it is not clear to me
>
> Yes, I meant memcg hard/soft reclaim here which is triggered directly
> by allocation and is distinct from background kswapd reclaim (global).
>
>> why the usefulnes should be limitted to the reclaim for users. I would
>> understand this if the users was in fact in-kernel users.
>>
>
> One of the reasons I'm trying to accounting the dirty pages is to get a
> more board overall view of memory usages because memcg hard/soft
> reclaim may have effect on response time of user application.
> Yeah, the beneficiary can be application administrator or kernel users.  :P
>
>> [...]
>>> To prevent AB/BA deadlock mentioned by Greg Thelen in previous version
>>> (https://lkml.org/lkml/2012/7/30/227), we adjust the lock order:
>>> ->private_lock --> mapping->tree_lock --> memcg->move_lock.
>>> So we need to make mapping->tree_lock ahead of TestSetPageDirty in __set_page_dirty()
>>> and __set_page_dirty_nobuffers(). But in order to avoiding useless spinlock contention,
>>> a prepare PageDirty() checking is added.
>>
>> But there is another AA deadlock here I believe.
>> page_remove_rmap
>>    mem_cgroup_begin_update_page_stat             <<< 1
>>    set_page_dirty
>>      __set_page_dirty_buffers
>>        __set_page_dirty
>>          mem_cgroup_begin_update_page_stat       <<< 2
>>            move_lock_mem_cgroup
>>              spin_lock_irqsave(&memcg->move_lock, *flags);
>>
>> mem_cgroup_begin_update_page_stat is not recursive wrt. locking AFAICS
>> because we might race with the moving charges:
>>          CPU0                                            CPU1
>> page_remove_rmap
>>                                                  mem_cgroup_can_attach
>>    mem_cgroup_begin_update_page_stat (1)
>>      rcu_read_lock
>>                                                    mem_cgroup_start_move
>>                                                      atomic_inc(&memcg_moving)
>>                                                      atomic_inc(&memcg->moving_account)
>>                                                      synchronize_rcu
>>      __mem_cgroup_begin_update_page_stat
>>        mem_cgroup_stolen <<< TRUE
>>        move_lock_mem_cgroup
>>    [...]
>>          mem_cgroup_begin_update_page_stat (2)
>>            __mem_cgroup_begin_update_page_stat
>>              mem_cgroup_stolen     <<< still TRUE
>>              move_lock_mem_cgroup  <<< DEADLOCK
>>    [...]
>>    mem_cgroup_end_update_page_stat
>>      rcu_unlock
>>                                                    # wake up from synchronize_rcu
>>                                                  [...]
>>                                                  mem_cgroup_move_task
>>                                                    mem_cgroup_move_charge
>>                                                      walk_page_range
>>                                                        mem_cgroup_move_account
>>                                                          move_lock_mem_cgroup
>>
>>
>> Maybe I have missed some other locking which would prevent this from
>> happening but the locking relations are really complicated in this area
>> so if mem_cgroup_{begin,end}_update_page_stat might be called
>> recursively then we need a fat comment which justifies that.
>>
>
> Ohhh...good catching!  I didn't notice there is a recursive call of
> mem_cgroup_{begin,end}_update_page_stat in page_remove_rmap().
> The mem_cgroup_{begin,end}_update_page_stat() design has depressed
> me a lot recently as the lock granularity is a little bigger than I thought.
> Not only the resource but also some code logic is in the range of locking
> which may be deadlock prone. The problem still exists if we are trying to
> add stat account of other memcg page later, may I make bold to suggest
> that we dig into the lock again...
>
> But with regard to the current lock implementation, I doubt if we can we can
> account MEM_CGROUP_STAT_FILE_{MAPPED, DIRTY} in one breath and just
> try to get move_lock once in the beginning. IMHO we can make
> mem_cgroup_{begin,end}_update_page_stat() to recursive aware and what I'm
> thinking now is changing memcg->move_lock to rw-spinlock from the
> original spinlock:
> mem_cgroup_{begin,end}_update_page_stat() try to get the read lock which make it
> reenterable and memcg moving task side try to get the write spinlock.
> Then the race may be following:
>
>          CPU0                                            CPU1
> page_remove_rmap
>                                                  mem_cgroup_can_attach
>    mem_cgroup_begin_update_page_stat (1)
>      rcu_read_lock
>                                                    mem_cgroup_start_move
>                                                      atomic_inc(&memcg_moving)
>
> atomic_inc(&memcg->moving_account)
>                                                      synchronize_rcu
>      __mem_cgroup_begin_update_page_stat
>        mem_cgroup_stolen   <<< TRUE
>        move_lock_mem_cgroup   <<<< read-spinlock success
>    [...]
>       mem_cgroup_begin_update_page_stat (2)
>            __mem_cgroup_begin_update_page_stat
>              mem_cgroup_stolen     <<< still TRUE
>              move_lock_mem_cgroup  <<<< read-spinlock success
>
>    [...]
>    mem_cgroup_end_update_page_stat     <<< locked = true, unlock
>      rcu_unlock
>                                                    # wake up from synchronize_rcu
>                                                  [...]
>                                                  mem_cgroup_move_task
>                                                    mem_cgroup_move_charge
>                                                      walk_page_range
>                                                        mem_cgroup_move_account
>
> move_lock_mem_cgroup    <<< write-spinlock
>
>
> AFAICS, the deadlock seems to be avoided by both the rcu and rwlock.
> Is there anything I lost?
>

rwlock will work with the nest but it seems ugly do updates under read-lock.

How about this straightforward ?
==
/*
  * Once a thread takes memcg_move_lock() on a memcg, it can take the lock on
  * the memcg again for nesting calls
  */
static void move_lock_mem_cgroup(memcg, flags);
{
	current->memcg_move_lock_nested += 1;
	if (current->memcg_move_lock_nested > 1) {
		VM_BUG_ON(current->move_locked_memcg != memcg);
		return;
	}
	spin_lock_irqsave(&memcg_move_lock, &flags);
	current->move_lockdev_memcg = memcg;
}

static void move_unlock_mem_cgroup(memcg, flags)
{
	current->memcg_move_lock_nested -= 1;
	if (!current->memcg_move_lock_nested) {
		current->move_locked_memcg = NULL;
		spin_unlock_irqrestore(&memcg_move_lock,flags);
	}
}

==
But, hmm, this kind of (ugly) hack may cause trouble as Hugh said.

Thanks,
-Kame

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting
       [not found]         ` <alpine.LNX.2.00.1301061135400.29149-fupSdm12i1nKWymIFiNcPA@public.gmane.org>
@ 2013-01-07  7:49           ` Kamezawa Hiroyuki
  2013-01-09  5:15             ` Hugh Dickins
  2013-01-09 14:35           ` Sha Zhengju
  1 sibling, 1 reply; 27+ messages in thread
From: Kamezawa Hiroyuki @ 2013-01-07  7:49 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Sha Zhengju, Michal Hocko, Johannes Weiner,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	gthelen-hpIqsD4AKlfQT0dZR+AlfA,
	fengguang.wu-ral2JQCrhuEAvxtiuMwx3w,
	glommer-bzQdu9zFT3WakBO8gow8eQ, dchinner-H+wXaHxf7aLQT0dZR+AlfA,
	Sha Zhengju

(2013/01/07 5:02), Hugh Dickins wrote:
> On Sat, 5 Jan 2013, Sha Zhengju wrote:
>> On Wed, Jan 2, 2013 at 6:44 PM, Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org> wrote:
>>>
>>> Maybe I have missed some other locking which would prevent this from
>>> happening but the locking relations are really complicated in this area
>>> so if mem_cgroup_{begin,end}_update_page_stat might be called
>>> recursively then we need a fat comment which justifies that.
>>>
>>
>> Ohhh...good catching!  I didn't notice there is a recursive call of
>> mem_cgroup_{begin,end}_update_page_stat in page_remove_rmap().
>> The mem_cgroup_{begin,end}_update_page_stat() design has depressed
>> me a lot recently as the lock granularity is a little bigger than I thought.
>> Not only the resource but also some code logic is in the range of locking
>> which may be deadlock prone. The problem still exists if we are trying to
>> add stat account of other memcg page later, may I make bold to suggest
>> that we dig into the lock again...
>
> Forgive me, I must confess I'm no more than skimming this thread,
> and don't like dumping unsigned-off patches on people; but thought
> that on balance it might be more helpful than not if I offer you a
> patch I worked on around 3.6-rc2 (but have updated to 3.8-rc2 below).
>
> I too was getting depressed by the constraints imposed by
> mem_cgroup_{begin,end}_update_page_stat (good job though Kamezawa-san
> did to minimize them), and wanted to replace by something freer, more
> RCU-like.  In the end it seemed more effort than it was worth to go
> as far as I wanted, but I do think that this is some improvement over
> what we currently have, and should deal with your recursion issue.
>
In what case does this improve performance ?

> But if this does appear useful to memcg people, then we really ought
> to get it checked over by locking/barrier experts before going further.
> I think myself that I've over-barriered it, and could use a little
> lighter; but they (Paul McKenney, Peter Zijlstra, Oleg Nesterov come
> to mind) will see more clearly, and may just hate the whole thing,
> as yet another peculiar lockdep-avoiding hand-crafted locking scheme.
> I've not wanted to waste their time on reviewing it, if it's not even
> going to be useful to memcg people.
>
> It may be easier to understand if you just apply the patch and look
> at the result in mm/memcontrol.c, where I tried to gather the pieces
> together in one place and describe them ("These functions mediate...").
>
> Hugh
>

Hi, this patch seems interesting but...doesn't this make move_account() very
slow if the number of cpus increases because of scanning all cpus per a page ?
And this looks like reader-can-block-writer percpu rwlock..it's too heavy to
writers if there are many readers.


Thanks,
-Kame


  

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting
  2013-01-07  7:49           ` Kamezawa Hiroyuki
@ 2013-01-09  5:15             ` Hugh Dickins
       [not found]               ` <alpine.LNX.2.00.1301082030100.5319-fupSdm12i1nKWymIFiNcPA@public.gmane.org>
  0 siblings, 1 reply; 27+ messages in thread
From: Hugh Dickins @ 2013-01-09  5:15 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: Sha Zhengju, Michal Hocko, Johannes Weiner, linux-kernel, cgroups,
	linux-mm, linux-fsdevel, akpm, gthelen, fengguang.wu, glommer,
	dchinner, Sha Zhengju

On Mon, 7 Jan 2013, Kamezawa Hiroyuki wrote:
> (2013/01/07 5:02), Hugh Dickins wrote:
> > 
> > Forgive me, I must confess I'm no more than skimming this thread,
> > and don't like dumping unsigned-off patches on people; but thought
> > that on balance it might be more helpful than not if I offer you a
> > patch I worked on around 3.6-rc2 (but have updated to 3.8-rc2 below).
> > 
> > I too was getting depressed by the constraints imposed by
> > mem_cgroup_{begin,end}_update_page_stat (good job though Kamezawa-san
> > did to minimize them), and wanted to replace by something freer, more
> > RCU-like.  In the end it seemed more effort than it was worth to go
> > as far as I wanted, but I do think that this is some improvement over
> > what we currently have, and should deal with your recursion issue.
> > 
> In what case does this improve performance ?

Perhaps none.  I was aiming to not degrade performance at the stats
update end, and make it more flexible, so new stats can be updated which
would be problematic today (for lock ordering and recursion reasons).

I've not done any performance measurement on it, and don't have enough
cpus for an interesting report; but if someone thinks it might solve a
problem for them, and has plenty of cpus to test with, please go ahead,
we'd be glad to hear the results.

> Hi, this patch seems interesting but...doesn't this make move_account() very
> slow if the number of cpus increases because of scanning all cpus per a page
> ?
> And this looks like reader-can-block-writer percpu rwlock..it's too heavy to
> writers if there are many readers.

I was happy to make the relatively rare move_account end considerably
heavier.  I'll be disappointed if it turns out to be prohibitively
heavy at that end - if we're going to make move_account impossible,
there are much easier ways to achieve that! - but it is a possibility.

Something you might have missed when considering many readers (stats
updaters): the move_account end does not wait for a moment when there
are no readers, that would indeed be a losing strategy; it just waits
for each cpu that's updating page stats to leave that section, so every
cpu is sure to notice and hold off if it then tries to update the page
which is to be moved.  (I may not be explaining that very well!)

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting
       [not found]               ` <alpine.LNX.2.00.1301082030100.5319-fupSdm12i1nKWymIFiNcPA@public.gmane.org>
@ 2013-01-09  7:24                 ` Kamezawa Hiroyuki
  0 siblings, 0 replies; 27+ messages in thread
From: Kamezawa Hiroyuki @ 2013-01-09  7:24 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Sha Zhengju, Michal Hocko, Johannes Weiner,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	gthelen-hpIqsD4AKlfQT0dZR+AlfA,
	fengguang.wu-ral2JQCrhuEAvxtiuMwx3w,
	glommer-bzQdu9zFT3WakBO8gow8eQ, dchinner-H+wXaHxf7aLQT0dZR+AlfA,
	Sha Zhengju

(2013/01/09 14:15), Hugh Dickins wrote:
> On Mon, 7 Jan 2013, Kamezawa Hiroyuki wrote:
>> (2013/01/07 5:02), Hugh Dickins wrote:
>>>
>>> Forgive me, I must confess I'm no more than skimming this thread,
>>> and don't like dumping unsigned-off patches on people; but thought
>>> that on balance it might be more helpful than not if I offer you a
>>> patch I worked on around 3.6-rc2 (but have updated to 3.8-rc2 below).
>>>
>>> I too was getting depressed by the constraints imposed by
>>> mem_cgroup_{begin,end}_update_page_stat (good job though Kamezawa-san
>>> did to minimize them), and wanted to replace by something freer, more
>>> RCU-like.  In the end it seemed more effort than it was worth to go
>>> as far as I wanted, but I do think that this is some improvement over
>>> what we currently have, and should deal with your recursion issue.
>>>
>> In what case does this improve performance ?
>
> Perhaps none.  I was aiming to not degrade performance at the stats
> update end, and make it more flexible, so new stats can be updated which
> would be problematic today (for lock ordering and recursion reasons).
>
> I've not done any performance measurement on it, and don't have enough
> cpus for an interesting report; but if someone thinks it might solve a
> problem for them, and has plenty of cpus to test with, please go ahead,
> we'd be glad to hear the results.
>
>> Hi, this patch seems interesting but...doesn't this make move_account() very
>> slow if the number of cpus increases because of scanning all cpus per a page
>> ?
>> And this looks like reader-can-block-writer percpu rwlock..it's too heavy to
>> writers if there are many readers.
>
> I was happy to make the relatively rare move_account end considerably
> heavier.  I'll be disappointed if it turns out to be prohibitively
> heavy at that end - if we're going to make move_account impossible,
> there are much easier ways to achieve that! - but it is a possibility.
>

move_account at task-move has been required feature for NEC and Nishimura-san
did good job. I'd like to keep that available as much as possible.

> Something you might have missed when considering many readers (stats
> updaters): the move_account end does not wait for a moment when there
> are no readers, that would indeed be a losing strategy; it just waits
> for each cpu that's updating page stats to leave that section, so every
> cpu is sure to notice and hold off if it then tries to update the page
> which is to be moved.  (I may not be explaining that very well!)
>

Hmm, yeah, maybe I miss somehing.

BTW, if nesting, mem_cgroup_end_update_page_stat() seems to make counter minus.

Thanks,
-Kame

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting
       [not found]     ` <xr93obh2krcr.fsf-aSPv4SP+Du0KgorLzL7FmE7CuiCeIGUxQQ4Iyu8u01E@public.gmane.org>
@ 2013-01-09  9:45       ` Sha Zhengju
  0 siblings, 0 replies; 27+ messages in thread
From: Sha Zhengju @ 2013-01-09  9:45 UTC (permalink / raw)
  To: Greg Thelen
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, mhocko-AlSwsSmVLrQ,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A,
	fengguang.wu-ral2JQCrhuEAvxtiuMwx3w,
	glommer-bzQdu9zFT3WakBO8gow8eQ, dchinner-H+wXaHxf7aLQT0dZR+AlfA,
	Sha Zhengju

On Mon, Jan 7, 2013 at 4:07 AM, Greg Thelen <gthelen-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
> On Tue, Dec 25 2012, Sha Zhengju wrote:
>
>> From: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
>>
>> This patch adds memcg routines to count dirty pages, which allows memory controller
>> to maintain an accurate view of the amount of its dirty memory and can provide some
>> info for users while cgroup's direct reclaim is working.
>>
>> After Kame's commit 89c06bd5(memcg: use new logic for page stat accounting), we can
>> use 'struct page' flag to test page state instead of per page_cgroup flag. But memcg
>> has a feature to move a page from a cgroup to another one and may have race between
>> "move" and "page stat accounting". So in order to avoid the race we have designed a
>> bigger lock:
>>
>>          mem_cgroup_begin_update_page_stat()
>>          modify page information        -->(a)
>>          mem_cgroup_update_page_stat()  -->(b)
>>          mem_cgroup_end_update_page_stat()
>> It requires (a) and (b)(dirty pages accounting) can stay close enough.
>> In the previous two prepare patches, we have reworked the vfs set page dirty routines
>> and now the interfaces are more explicit:
>>         incrementing (2):
>>                 __set_page_dirty
>>                 __set_page_dirty_nobuffers
>>         decrementing (2):
>>                 clear_page_dirty_for_io
>>                 cancel_dirty_page
>>
>> To prevent AB/BA deadlock mentioned by Greg Thelen in previous version
>> (https://lkml.org/lkml/2012/7/30/227), we adjust the lock order:
>> ->private_lock --> mapping->tree_lock --> memcg->move_lock.
>> So we need to make mapping->tree_lock ahead of TestSetPageDirty in __set_page_dirty()
>> and __set_page_dirty_nobuffers(). But in order to avoiding useless spinlock contention,
>> a prepare PageDirty() checking is added.
>>
>>
>> Signed-off-by: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
>> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu-LdfC7J4mv27QFUHtdCDX3A@public.gmane.org>
>> Acked-by: Fengguang Wu <fengguang.wu-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
>> ---
>>  fs/buffer.c                |   14 +++++++++++++-
>>  include/linux/memcontrol.h |    1 +
>>  mm/filemap.c               |   10 ++++++++++
>>  mm/memcontrol.c            |   29 ++++++++++++++++++++++-------
>>  mm/page-writeback.c        |   39 ++++++++++++++++++++++++++++++++-------
>>  mm/truncate.c              |    6 ++++++
>>  6 files changed, 84 insertions(+), 15 deletions(-)
>
> __nilfs_clear_page_dirty() clears PageDirty, does it need modification
> for this patch series?

It doesn't need to do so.
mem_cgroup_dec/inc_page_stat() is accompany with
dec/inc_zone_page_state() to account memcg page stat.  IMHO we only
have to do some modification while SetPageDirty and
dec/inc_zone_page_state() occur together.
__nilfs_clear_page_dirty() will call clear_page_dirty_for_io(page)
later where the accounting is done.

>> diff --git a/fs/buffer.c b/fs/buffer.c
>> index 762168a..53402d2 100644
>> --- a/fs/buffer.c
>> +++ b/fs/buffer.c
>> @@ -612,19 +612,31 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode);
>>  int __set_page_dirty(struct page *page,
>>               struct address_space *mapping, int warn)
>>  {
>> +     bool locked;
>> +     unsigned long flags;
>> +
>>       if (unlikely(!mapping))
>>               return !TestSetPageDirty(page);
>>
>> -     if (TestSetPageDirty(page))
>> +     if (PageDirty(page))
>>               return 0;
>>
>>       spin_lock_irq(&mapping->tree_lock);
>> +     mem_cgroup_begin_update_page_stat(page, &locked, &flags);
>> +
>> +     if (TestSetPageDirty(page)) {
>> +             mem_cgroup_end_update_page_stat(page, &locked, &flags);
>> +             spin_unlock_irq(&mapping->tree_lock);
>> +             return 0;
>> +     }
>> +
>>       if (page->mapping) {    /* Race with truncate? */
>>               WARN_ON_ONCE(warn && !PageUptodate(page));
>>               account_page_dirtied(page, mapping);
>>               radix_tree_tag_set(&mapping->page_tree,
>>                               page_index(page), PAGECACHE_TAG_DIRTY);
>>       }
>> +     mem_cgroup_end_update_page_stat(page, &locked, &flags);
>>       spin_unlock_irq(&mapping->tree_lock);
>>       __mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
>>
>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
>> index 5421b8a..2685d8a 100644
>> --- a/include/linux/memcontrol.h
>> +++ b/include/linux/memcontrol.h
>> @@ -44,6 +44,7 @@ enum mem_cgroup_stat_index {
>>       MEM_CGROUP_STAT_RSS,       /* # of pages charged as anon rss */
>>       MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
>>       MEM_CGROUP_STAT_SWAP, /* # of pages, swapped out */
>> +     MEM_CGROUP_STAT_FILE_DIRTY,  /* # of dirty pages in page cache */
>>       MEM_CGROUP_STAT_NSTATS,
>>  };
>>
>> diff --git a/mm/filemap.c b/mm/filemap.c
>> index 83efee7..b589be5 100644
>> --- a/mm/filemap.c
>> +++ b/mm/filemap.c
>> @@ -62,6 +62,11 @@
>>   *      ->swap_lock          (exclusive_swap_page, others)
>>   *        ->mapping->tree_lock
>>   *
>> + *    ->private_lock         (__set_page_dirty_buffers)
>> + *      ->mapping->tree_lock
>> + *        ->memcg->move_lock (mem_cgroup_begin_update_page_stat->
>> + *                                                   move_lock_mem_cgroup)
>> + *
>>   *  ->i_mutex
>>   *    ->i_mmap_mutex         (truncate->unmap_mapping_range)
>>   *
>> @@ -112,6 +117,8 @@
>>  void __delete_from_page_cache(struct page *page)
>>  {
>>       struct address_space *mapping = page->mapping;
>> +     bool locked;
>> +     unsigned long flags;
>>
>>       /*
>>        * if we're uptodate, flush out into the cleancache, otherwise
>> @@ -139,10 +146,13 @@ void __delete_from_page_cache(struct page *page)
>>        * Fix it up by doing a final dirty accounting check after
>>        * having removed the page entirely.
>>        */
>> +     mem_cgroup_begin_update_page_stat(page, &locked, &flags);
>>       if (PageDirty(page) && mapping_cap_account_dirty(mapping)) {
>> +             mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
>>               dec_zone_page_state(page, NR_FILE_DIRTY);
>>               dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
>>       }
>> +     mem_cgroup_end_update_page_stat(page, &locked, &flags);
>>  }
>>
>>  /**
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index d450c04..c884640 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -95,6 +95,7 @@ static const char * const mem_cgroup_stat_names[] = {
>>       "rss",
>>       "mapped_file",
>>       "swap",
>> +     "dirty",
>>  };
>>
>>  enum mem_cgroup_events_index {
>> @@ -3609,6 +3610,19 @@ void mem_cgroup_split_huge_fixup(struct page *head)
>>  }
>>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>>
>> +static inline
>> +void mem_cgroup_move_account_page_stat(struct mem_cgroup *from,
>> +                                     struct mem_cgroup *to,
>> +                                     unsigned int nr_pages,
>> +                                     enum mem_cgroup_stat_index idx)
>> +{
>> +     /* Update stat data for mem_cgroup */
>> +     preempt_disable();
>> +     __this_cpu_add(from->stat->count[idx], -nr_pages);
>
> What you do think about adding a WARN_ON_ONCE() here to check for
> underflow?  A check might help catch:
> a) unresolved races between move accounting vs setting/clearing
>    dirtying.
> b) future modifications that mess with PageDirty/Writeback flags without
>    considering memcg.
To prevent the current memcg deadlock and lock nesting, I'm thinking
about another synchronization proposal for memcg page stat &
move_account. The counter may be minus in a very short periods. Now
I'm not sure whether it's okay... maybe I'll send it out in another
thread later...

>
>> +     __this_cpu_add(to->stat->count[idx], nr_pages);
>> +     preempt_enable();
>> +}
>> +
>>  /**
>>   * mem_cgroup_move_account - move account of the page
>>   * @page: the page
>> @@ -3654,13 +3668,14 @@ static int mem_cgroup_move_account(struct page *page,
>>
>>       move_lock_mem_cgroup(from, &flags);
>>
>> -     if (!anon && page_mapped(page)) {
>> -             /* Update mapped_file data for mem_cgroup */
>> -             preempt_disable();
>> -             __this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
>> -             __this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
>> -             preempt_enable();
>> -     }
>> +     if (!anon && page_mapped(page))
>> +             mem_cgroup_move_account_page_stat(from, to, nr_pages,
>> +                     MEM_CGROUP_STAT_FILE_MAPPED);
>> +
>> +     if (PageDirty(page))
>
> Is (!anon && PageDirty(page)) better?  If dirty anon pages are moved
> between memcg M1 and M2 I think that we'd mistakenly underflow M1 if it
> was not previously accounting for the dirty anon page.
Yeah... A page can be PageAnon and PageDirty simultaneously but we only
account dirty file-page. Will be updated next version.


-- 
Thanks,
Sha

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting
       [not found]         ` <alpine.LNX.2.00.1301061135400.29149-fupSdm12i1nKWymIFiNcPA@public.gmane.org>
  2013-01-07  7:49           ` Kamezawa Hiroyuki
@ 2013-01-09 14:35           ` Sha Zhengju
       [not found]             ` <CAFj3OHVUx0bZyEGQU_CApVbgz7SrX3BQ+0U5fRV=En800wv+cQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  1 sibling, 1 reply; 27+ messages in thread
From: Sha Zhengju @ 2013-01-09 14:35 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Michal Hocko, Johannes Weiner,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A,
	gthelen-hpIqsD4AKlfQT0dZR+AlfA,
	fengguang.wu-ral2JQCrhuEAvxtiuMwx3w,
	glommer-bzQdu9zFT3WakBO8gow8eQ, dchinner-H+wXaHxf7aLQT0dZR+AlfA,
	Sha Zhengju

Hi Hugh,

On Mon, Jan 7, 2013 at 4:02 AM, Hugh Dickins <hughd-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
> On Sat, 5 Jan 2013, Sha Zhengju wrote:
>> On Wed, Jan 2, 2013 at 6:44 PM, Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org> wrote:
>> >
>> > Maybe I have missed some other locking which would prevent this from
>> > happening but the locking relations are really complicated in this area
>> > so if mem_cgroup_{begin,end}_update_page_stat might be called
>> > recursively then we need a fat comment which justifies that.
>> >
>>
>> Ohhh...good catching!  I didn't notice there is a recursive call of
>> mem_cgroup_{begin,end}_update_page_stat in page_remove_rmap().
>> The mem_cgroup_{begin,end}_update_page_stat() design has depressed
>> me a lot recently as the lock granularity is a little bigger than I thought.
>> Not only the resource but also some code logic is in the range of locking
>> which may be deadlock prone. The problem still exists if we are trying to
>> add stat account of other memcg page later, may I make bold to suggest
>> that we dig into the lock again...
>
> Forgive me, I must confess I'm no more than skimming this thread,
> and don't like dumping unsigned-off patches on people; but thought
> that on balance it might be more helpful than not if I offer you a
> patch I worked on around 3.6-rc2 (but have updated to 3.8-rc2 below).

Thanks for your interest in this matter! I really appreciate your work!

> I too was getting depressed by the constraints imposed by
> mem_cgroup_{begin,end}_update_page_stat (good job though Kamezawa-san
> did to minimize them), and wanted to replace by something freer, more
> RCU-like.  In the end it seemed more effort than it was worth to go
> as far as I wanted, but I do think that this is some improvement over
> what we currently have, and should deal with your recursion issue.
It takes me some time to understand the patch. yeah, it can solve my
recursion issue and also reduce some locks(e.g. move_lock). But it did
have some side effect on move end as it will become slower. To my
knowledge, each task is forked in root memcg, and there's a moving
while attaching it to a cgroup. So move_account is also a frequent
behavior to some extent.
Some comments are below.

> But if this does appear useful to memcg people, then we really ought
> to get it checked over by locking/barrier experts before going further.
> I think myself that I've over-barriered it, and could use a little
> lighter; but they (Paul McKenney, Peter Zijlstra, Oleg Nesterov come
> to mind) will see more clearly, and may just hate the whole thing,
> as yet another peculiar lockdep-avoiding hand-crafted locking scheme.
> I've not wanted to waste their time on reviewing it, if it's not even
> going to be useful to memcg people.
>
> It may be easier to understand if you just apply the patch and look
> at the result in mm/memcontrol.c, where I tried to gather the pieces
> together in one place and describe them ("These functions mediate...").
>
> Hugh
>
>  include/linux/memcontrol.h |   39 +--
>  mm/memcontrol.c            |  375 +++++++++++++++++++++--------------
>  mm/rmap.c                  |   20 -
>  3 files changed, 257 insertions(+), 177 deletions(-)
>
> --- 3.8-rc2/include/linux/memcontrol.h  2012-12-22 09:43:27.172015571 -0800
> +++ linux/include/linux/memcontrol.h    2013-01-02 14:47:47.960394878 -0800
> @@ -136,32 +136,28 @@ static inline bool mem_cgroup_disabled(v
>         return false;
>  }
>
> -void __mem_cgroup_begin_update_page_stat(struct page *page, bool *locked,
> -                                        unsigned long *flags);
> -
> +void __mem_cgroup_begin_update_page_stat(struct page *page);
> +void __mem_cgroup_end_update_page_stat(void);
>  extern atomic_t memcg_moving;
>
>  static inline void mem_cgroup_begin_update_page_stat(struct page *page,
> -                                       bool *locked, unsigned long *flags)
> +                                                    bool *clamped)
>  {
> -       if (mem_cgroup_disabled())
> -               return;
> -       rcu_read_lock();
> -       *locked = false;
> -       if (atomic_read(&memcg_moving))
> -               __mem_cgroup_begin_update_page_stat(page, locked, flags);
> +       preempt_disable();

Referring to synchronize_rcu in mem_cgroup_begin_move(), here
rcu_read_lock() lost?

> +       *clamped = false;
> +       if (unlikely(atomic_read(&memcg_moving))) {
> +               __mem_cgroup_begin_update_page_stat(page);
> +               *clamped = true;
> +       }
>  }
>
> -void __mem_cgroup_end_update_page_stat(struct page *page,
> -                               unsigned long *flags);
>  static inline void mem_cgroup_end_update_page_stat(struct page *page,
> -                                       bool *locked, unsigned long *flags)
> +                                                  bool *clamped)
>  {
> -       if (mem_cgroup_disabled())
> -               return;
> -       if (*locked)
> -               __mem_cgroup_end_update_page_stat(page, flags);
> -       rcu_read_unlock();
> +       /* We don't currently use the page arg, but keep it for symmetry */
> +       if (unlikely(*clamped))
> +               __mem_cgroup_end_update_page_stat();

Ditto. Need rcu_read_unlock()?

> +       preempt_enable();
>  }
>
>  void mem_cgroup_update_page_stat(struct page *page,
> @@ -345,13 +341,16 @@ mem_cgroup_print_oom_info(struct mem_cgr
>  }
>
>  static inline void mem_cgroup_begin_update_page_stat(struct page *page,
> -                                       bool *locked, unsigned long *flags)
> +                                                    bool *clamped)
>  {
> +       /* It may be helpful to our callers if the stub behaves the same way */
> +       preempt_disable();
>  }
>
>  static inline void mem_cgroup_end_update_page_stat(struct page *page,
> -                                       bool *locked, unsigned long *flags)
> +                                                  bool *clamped)
>  {
> +       preempt_enable();
>  }
>
>  static inline void mem_cgroup_inc_page_stat(struct page *page,
> --- 3.8-rc2/mm/memcontrol.c     2012-12-22 09:43:27.628015582 -0800
> +++ linux/mm/memcontrol.c       2013-01-02 14:55:36.268406008 -0800
> @@ -321,12 +321,7 @@ struct mem_cgroup {
>          * mem_cgroup ? And what type of charges should we move ?
>          */
>         unsigned long   move_charge_at_immigrate;
> -       /*
> -        * set > 0 if pages under this cgroup are moving to other cgroup.
> -        */
> -       atomic_t        moving_account;
> -       /* taken only while moving_account > 0 */
> -       spinlock_t      move_lock;
> +
>         /*
>          * percpu counter.
>          */
> @@ -1414,60 +1409,10 @@ int mem_cgroup_swappiness(struct mem_cgr
>  }
>
>  /*
> - * memcg->moving_account is used for checking possibility that some thread is
> - * calling move_account(). When a thread on CPU-A starts moving pages under
> - * a memcg, other threads should check memcg->moving_account under
> - * rcu_read_lock(), like this:
> - *
> - *         CPU-A                                    CPU-B
> - *                                              rcu_read_lock()
> - *         memcg->moving_account+1              if (memcg->mocing_account)
> - *                                                   take heavy locks.
> - *         synchronize_rcu()                    update something.
> - *                                              rcu_read_unlock()
> - *         start move here.
> - */
> -
> -/* for quick checking without looking up memcg */
> -atomic_t memcg_moving __read_mostly;
> -
> -static void mem_cgroup_start_move(struct mem_cgroup *memcg)
> -{
> -       atomic_inc(&memcg_moving);
> -       atomic_inc(&memcg->moving_account);
> -       synchronize_rcu();
> -}
> -
> -static void mem_cgroup_end_move(struct mem_cgroup *memcg)
> -{
> -       /*
> -        * Now, mem_cgroup_clear_mc() may call this function with NULL.
> -        * We check NULL in callee rather than caller.
> -        */
> -       if (memcg) {
> -               atomic_dec(&memcg_moving);
> -               atomic_dec(&memcg->moving_account);
> -       }
> -}
> -
> -/*
> - * 2 routines for checking "mem" is under move_account() or not.
> - *
> - * mem_cgroup_stolen() -  checking whether a cgroup is mc.from or not. This
> - *                       is used for avoiding races in accounting.  If true,
> - *                       pc->mem_cgroup may be overwritten.
> - *
>   * mem_cgroup_under_move() - checking a cgroup is mc.from or mc.to or
>   *                       under hierarchy of moving cgroups. This is for
> - *                       waiting at hith-memory prressure caused by "move".
> + *                       waiting at high memory pressure caused by "move".
>   */
> -
> -static bool mem_cgroup_stolen(struct mem_cgroup *memcg)
> -{
> -       VM_BUG_ON(!rcu_read_lock_held());
> -       return atomic_read(&memcg->moving_account) > 0;
> -}
> -
>  static bool mem_cgroup_under_move(struct mem_cgroup *memcg)
>  {
>         struct mem_cgroup *from;
> @@ -1506,24 +1451,6 @@ static bool mem_cgroup_wait_acct_move(st
>         return false;
>  }
>
> -/*
> - * Take this lock when
> - * - a code tries to modify page's memcg while it's USED.
> - * - a code tries to modify page state accounting in a memcg.
> - * see mem_cgroup_stolen(), too.
> - */
> -static void move_lock_mem_cgroup(struct mem_cgroup *memcg,
> -                                 unsigned long *flags)
> -{
> -       spin_lock_irqsave(&memcg->move_lock, *flags);
> -}
> -
> -static void move_unlock_mem_cgroup(struct mem_cgroup *memcg,
> -                               unsigned long *flags)
> -{
> -       spin_unlock_irqrestore(&memcg->move_lock, *flags);
> -}
> -
>  /**
>   * mem_cgroup_print_oom_info: Called from OOM with tasklist_lock held in read mode.
>   * @memcg: The memory cgroup that went over limit
> @@ -2096,75 +2023,215 @@ static bool mem_cgroup_handle_oom(struct
>  }
>
>  /*
> - * Currently used to update mapped file statistics, but the routine can be
> - * generalized to update other statistics as well.
> - *
> - * Notes: Race condition
> - *
> - * We usually use page_cgroup_lock() for accessing page_cgroup member but
> - * it tends to be costly. But considering some conditions, we doesn't need
> - * to do so _always_.
> - *
> - * Considering "charge", lock_page_cgroup() is not required because all
> - * file-stat operations happen after a page is attached to radix-tree. There
> - * are no race with "charge".
> - *
> - * Considering "uncharge", we know that memcg doesn't clear pc->mem_cgroup
> - * at "uncharge" intentionally. So, we always see valid pc->mem_cgroup even
> - * if there are race with "uncharge". Statistics itself is properly handled
> - * by flags.
> + * These functions mediate between the common case of updating memcg stats
> + * when a page transitions from one state to another, and the rare case of
> + * moving a page from one memcg to another.
> + *
> + * A simple example of the updater would be:
> + *     mem_cgroup_begin_update_page_stat(page);
> + *     if (TestClearPageFlag(page))
> + *             mem_cgroup_dec_page_stat(page, NR_FLAG_PAGES);
> + *     mem_cgroup_end_update_page_stat(page);
> + *
> + * An over-simplified example of the mover would be:
> + *     mem_cgroup_begin_move();
> + *     for each page chosen from old_memcg {
> + *             pc = lookup_page_cgroup(page);
> + *             lock_page_cgroup(pc);
> + *             if (trylock_memcg_move(page)) {
> + *                     if (PageFlag(page)) {
> + *                             mem_cgroup_dec_page_stat(page, NR_FLAG_PAGES);
> + *                             pc->mem_cgroup = new_memcg;
> + *                             mem_cgroup_inc_page_stat(page, NR_FLAG_PAGES);
> + *                     }
> + *                     unlock_memcg_move();
> + *                     unlock_page_cgroup(pc);
> + *             }
> + *             cond_resched();
> + *     }
> + *     mem_cgroup_end_move();
> + *
> + * Without some kind of serialization between updater and mover, the mover
> + * cannot know whether or not to move one count from old to new memcg stats;
> + * but the serialization must be as lightweight as possible for the updater.
> + *
> + * At present we use two layers of lock avoidance, then spinlock on memcg;
> + * but that already got into (easily avoided) lock hierarchy violation with
> + * the page_cgroup lock; and as dirty writeback stats are added, it gets
> + * into further difficulty with the page cache radix tree lock (and on s390
> + * architecture, page_remove_rmap calls set_page_dirty within its critical
> + * section: perhaps that can be reordered, but if not, it requires nesting).
> + *
> + * We need a mechanism more like rcu_read_lock() for the updater, who then
> + * does not have to worry about lock ordering.  The scheme below is not quite
> + * as light as that: rarely, the updater does have to spin waiting on a mover;
> + * and it is still best for updater to avoid taking page_cgroup lock in its
> + * critical section (though mover drops and retries if necessary, so there is
> + * no actual deadlock).  Testing on 4-way suggests 5% heavier for the mover.
> + */
> +
> +/*
> + * memcg_moving count is written in advance by movers,
> + * and read by updaters to see if they need to worry further.
> + */
> +atomic_t memcg_moving __read_mostly;
> +
> +/*
> + * Keep it simple: allow only one page to move at a time.  cgroup_mutex
> + * already serializes move_charge_at_immigrate movements, but not writes
> + * to memory.force_empty, nor move-pages-to-parent phase of cgroup rmdir.
>   *
> - * Considering "move", this is an only case we see a race. To make the race
> - * small, we check mm->moving_account and detect there are possibility of race
> - * If there is, we take a lock.
> + * memcg_moving_lock guards writes by movers to memcg_moving_page,
> + * which is read by updaters to see if they need to worry about their page.
> + */
> +static DEFINE_SPINLOCK(memcg_moving_lock);
> +static struct page *memcg_moving_page;
> +
> +/*
> + * updating_page_stat is written per-cpu by updaters,
> + * and all cpus read by mover to check when safe to proceed with the move.
>   */
> +static DEFINE_PER_CPU(int, updating_page_stat) = 0;
>
> -void __mem_cgroup_begin_update_page_stat(struct page *page,
> -                               bool *locked, unsigned long *flags)
> +/*
> + * Mover calls mem_cgroup_begin_move() before starting on its pages; its
> + * synchronize_rcu() ensures that all updaters will see memcg_moving in time.
> + */
> +static void mem_cgroup_begin_move(void)
>  {
> -       struct mem_cgroup *memcg;
> -       struct page_cgroup *pc;
> +       get_online_cpus();
> +       atomic_inc(&memcg_moving);
> +       synchronize_rcu();
> +}
> +
> +static void mem_cgroup_end_move(void)
> +{
> +       atomic_dec(&memcg_moving);
> +       put_online_cpus();
> +}
> +
> +/*
> + * Mover calls trylock_memcg_move(page) before moving stats and changing
> + * ownership of page.  If it fails, mover should drop page_cgroup lock and
> + * any other spinlocks held, cond_resched then try the page again.  This
> + * lets updaters take those locks if unavoidable, though preferably not.
> + */
> +static bool trylock_memcg_move(struct page *page)
> +{
> +       static struct cpumask updating;
> +       int try;
> +
> +       cpumask_copy(&updating, cpu_online_mask);
> +       spin_lock(&memcg_moving_lock);
While reaching here, we already do lock_page_cgroup() which is a much
heavier lock. But memcg_moving_lock is a only one global spinlock that
each memcg doing move_account will try to get, so it may spin many
memcg to waiting here while holding their page_cgroup lock. This may
aggravate some performance decrease IMHO.

> +       memcg_moving_page = page;
>
> -       pc = lookup_page_cgroup(page);
> -again:
> -       memcg = pc->mem_cgroup;
> -       if (unlikely(!memcg || !PageCgroupUsed(pc)))
> -               return;
>         /*
> -        * If this memory cgroup is not under account moving, we don't
> -        * need to take move_lock_mem_cgroup(). Because we already hold
> -        * rcu_read_lock(), any calls to move_account will be delayed until
> -        * rcu_read_unlock() if mem_cgroup_stolen() == true.
> +        * Make sure that __mem_cgroup_begin_update_page_stat(page) can see
> +        * our memcg_moving_page before it commits to updating_page_stat.
>          */
> -       if (!mem_cgroup_stolen(memcg))
> -               return;
> +       smp_mb();
>
> -       move_lock_mem_cgroup(memcg, flags);
> -       if (memcg != pc->mem_cgroup || !PageCgroupUsed(pc)) {
> -               move_unlock_mem_cgroup(memcg, flags);
> -               goto again;
> +       for (try = 0; try < 64; try++) {
> +               int updaters = 0;
> +               int cpu;
> +
> +               for_each_cpu(cpu, &updating) {
> +                       if (ACCESS_ONCE(per_cpu(updating_page_stat, cpu)))
> +                               updaters++;
> +                       else
> +                               cpumask_clear_cpu(cpu, &updating);
> +               }
> +               if (!updaters)
> +                       return true;
>         }
> -       *locked = true;
> +
> +       memcg_moving_page = NULL;
Also need a smp_mb()?

> +       spin_unlock(&memcg_moving_lock);
> +       return false;
>  }
>
> -void __mem_cgroup_end_update_page_stat(struct page *page, unsigned long *flags)
> +static void unlock_memcg_move(void)
>  {
> -       struct page_cgroup *pc = lookup_page_cgroup(page);
> +       memcg_moving_page = NULL;
> +       spin_unlock(&memcg_moving_lock);
> +}
>
> -       /*
> -        * It's guaranteed that pc->mem_cgroup never changes while
> -        * lock is held because a routine modifies pc->mem_cgroup
> -        * should take move_lock_mem_cgroup().
> -        */
> -       move_unlock_mem_cgroup(pc->mem_cgroup, flags);
> +/*
> + * If memcg_moving, updater calls __mem_cgroup_begin_update_page_stat(page)
> + * (with preemption disabled) to indicate to the next mover that this cpu is
> + * updating a page, or to wait on the mover if it's already moving this page.
> + */
> +void __mem_cgroup_begin_update_page_stat(struct page *page)
> +{
> +       static const int probing = 0x10000;
> +       int updating;
> +
> +       __this_cpu_add(updating_page_stat, probing);
> +
> +       for (;;) {
> +               /*
> +                * Make sure that trylock_memcg_move(page) can see our
> +                * updating_page_stat before we check memcg_moving_page.
> +                *
> +                * We use the special probing value at first so move sees it,
> +                * but nesting and interrupts on this cpu can distinguish it.
> +                */
> +               smp_mb();
> +
> +               if (likely(page != ACCESS_ONCE(memcg_moving_page)))
> +                       break;
> +
> +               /*
> +                * We may be nested, we may be serving an interrupt: do not
> +                * hang here if the outer level already went beyond probing.
> +                */
> +               updating = __this_cpu_read(updating_page_stat);
> +               if (updating & (probing - 1))
> +                       break;
> +
> +               __this_cpu_write(updating_page_stat, 0);
> +               while (page == ACCESS_ONCE(memcg_moving_page))
> +                       cpu_relax();
> +               __this_cpu_write(updating_page_stat, updating);
> +       }
> +
> +       /* Add one to count and remove temporary probing value */
> +       __this_cpu_sub(updating_page_stat, probing - 1);
> +}
> +
> +void __mem_cgroup_end_update_page_stat(void)
> +{
> +       __this_cpu_dec(updating_page_stat);
> +}
> +
> +/*
> + * Static inline interfaces to the above in include/linux/memcontrol.h:
> + *
> +static inline void mem_cgroup_begin_update_page_stat(struct page *page,
> +                                                    bool *clamped)
> +{
> +       preempt_disable();
> +       *clamped = false;
> +       if (unlikely(atomic_read(&memcg_moving))) {
> +               __mem_cgroup_begin_update_page_stat(page);
> +               *clamped = true;
> +       }
>  }
>
> +static inline void mem_cgroup_end_update_page_stat(struct page *page,
> +                                                  bool *clamped)
> +{
> +       if (unlikely(*clamped))
> +               __mem_cgroup_end_update_page_stat();
> +       preempt_enable();
> +}
> + */
> +
>  void mem_cgroup_update_page_stat(struct page *page,
>                                  enum mem_cgroup_page_stat_item idx, int val)
>  {
>         struct mem_cgroup *memcg;
>         struct page_cgroup *pc = lookup_page_cgroup(page);
> -       unsigned long uninitialized_var(flags);
>
>         if (mem_cgroup_disabled())
>                 return;
> @@ -2181,7 +2248,8 @@ void mem_cgroup_update_page_stat(struct
>                 BUG();
>         }
>
> -       this_cpu_add(memcg->stat->count[idx], val);
> +       /* mem_cgroup_begin_update_page_stat() disabled preemption */
> +       __this_cpu_add(memcg->stat->count[idx], val);
>  }
>
>  /*
> @@ -3580,7 +3648,6 @@ static int mem_cgroup_move_account(struc
>                                    struct mem_cgroup *from,
>                                    struct mem_cgroup *to)
>  {
> -       unsigned long flags;
>         int ret;
>         bool anon = PageAnon(page);
>
> @@ -3602,21 +3669,21 @@ static int mem_cgroup_move_account(struc
>         if (!PageCgroupUsed(pc) || pc->mem_cgroup != from)
>                 goto unlock;
>
> -       move_lock_mem_cgroup(from, &flags);
> +       ret = -EAGAIN;
> +       if (!trylock_memcg_move(page))
> +               goto unlock;
>
>         if (!anon && page_mapped(page)) {
>                 /* Update mapped_file data for mem_cgroup */
> -               preempt_disable();
>                 __this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
>                 __this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> -               preempt_enable();
>         }
>         mem_cgroup_charge_statistics(from, anon, -nr_pages);
>
>         /* caller should have done css_get */
>         pc->mem_cgroup = to;
>         mem_cgroup_charge_statistics(to, anon, nr_pages);
> -       move_unlock_mem_cgroup(from, &flags);
> +       unlock_memcg_move();
>         ret = 0;
>  unlock:
>         unlock_page_cgroup(pc);
> @@ -3675,19 +3742,25 @@ static int mem_cgroup_move_parent(struct
>          */
>         if (!parent)
>                 parent = root_mem_cgroup;
> -
> +retry:
>         if (nr_pages > 1) {
>                 VM_BUG_ON(!PageTransHuge(page));
>                 flags = compound_lock_irqsave(page);
>         }
>
> -       ret = mem_cgroup_move_account(page, nr_pages,
> -                               pc, child, parent);
> -       if (!ret)
> -               __mem_cgroup_cancel_local_charge(child, nr_pages);
> +       ret = mem_cgroup_move_account(page, nr_pages, pc, child, parent);
>
>         if (nr_pages > 1)
>                 compound_unlock_irqrestore(page, flags);
> +
> +       if (ret == -EAGAIN) {
> +               cond_resched();
> +               goto retry;
> +       }
> +
> +       if (!ret)
> +               __mem_cgroup_cancel_local_charge(child, nr_pages);
> +
>         putback_lru_page(page);
>  put:
>         put_page(page);
> @@ -4685,7 +4758,7 @@ static void mem_cgroup_reparent_charges(
>                 /* This is for making all *used* pages to be on LRU. */
>                 lru_add_drain_all();
>                 drain_all_stock_sync(memcg);
> -               mem_cgroup_start_move(memcg);
> +               mem_cgroup_begin_move();
>                 for_each_node_state(node, N_MEMORY) {
>                         for (zid = 0; zid < MAX_NR_ZONES; zid++) {
>                                 enum lru_list lru;
> @@ -4695,7 +4768,7 @@ static void mem_cgroup_reparent_charges(
>                                 }
>                         }
>                 }
> -               mem_cgroup_end_move(memcg);
> +               mem_cgroup_end_move();
>                 memcg_oom_recover(memcg);
>                 cond_resched();
>
> @@ -6128,7 +6201,6 @@ mem_cgroup_css_alloc(struct cgroup *cont
>         atomic_set(&memcg->refcnt, 1);
>         memcg->move_charge_at_immigrate = 0;
>         mutex_init(&memcg->thresholds_lock);
> -       spin_lock_init(&memcg->move_lock);
>
>         error = memcg_init_kmem(memcg, &mem_cgroup_subsys);
>         if (error) {
> @@ -6521,7 +6593,8 @@ static void mem_cgroup_clear_mc(void)
>         mc.from = NULL;
>         mc.to = NULL;
>         spin_unlock(&mc.lock);
> -       mem_cgroup_end_move(from);
> +       if (from)
> +               mem_cgroup_end_move();
>  }
>
>  static int mem_cgroup_can_attach(struct cgroup *cgroup,
> @@ -6547,7 +6620,7 @@ static int mem_cgroup_can_attach(struct
>                         VM_BUG_ON(mc.precharge);
>                         VM_BUG_ON(mc.moved_charge);
>                         VM_BUG_ON(mc.moved_swap);
> -                       mem_cgroup_start_move(from);
> +                       mem_cgroup_begin_move();
>                         spin_lock(&mc.lock);
>                         mc.from = from;
>                         mc.to = memcg;
> @@ -6573,7 +6646,7 @@ static int mem_cgroup_move_charge_pte_ra
>                                 unsigned long addr, unsigned long end,
>                                 struct mm_walk *walk)
>  {
> -       int ret = 0;
> +       int ret;
>         struct vm_area_struct *vma = walk->private;
>         pte_t *pte;
>         spinlock_t *ptl;
> @@ -6592,6 +6665,8 @@ static int mem_cgroup_move_charge_pte_ra
>          *    to be unlocked in __split_huge_page_splitting(), where the main
>          *    part of thp split is not executed yet.
>          */
> +retry:
> +       ret = 0;
>         if (pmd_trans_huge_lock(pmd, vma) == 1) {
>                 if (mc.precharge < HPAGE_PMD_NR) {
>                         spin_unlock(&vma->vm_mm->page_table_lock);
> @@ -6602,8 +6677,9 @@ static int mem_cgroup_move_charge_pte_ra
>                         page = target.page;
>                         if (!isolate_lru_page(page)) {
>                                 pc = lookup_page_cgroup(page);
> -                               if (!mem_cgroup_move_account(page, HPAGE_PMD_NR,
> -                                                       pc, mc.from, mc.to)) {
> +                               ret = mem_cgroup_move_account(page,
> +                                           HPAGE_PMD_NR, pc, mc.from, mc.to);
> +                               if (!ret) {
>                                         mc.precharge -= HPAGE_PMD_NR;
>                                         mc.moved_charge += HPAGE_PMD_NR;
>                                 }
> @@ -6612,12 +6688,14 @@ static int mem_cgroup_move_charge_pte_ra
>                         put_page(page);
>                 }
>                 spin_unlock(&vma->vm_mm->page_table_lock);
> +               if (ret == -EAGAIN)
> +                       goto retry;
>                 return 0;
>         }
>
>         if (pmd_trans_unstable(pmd))
>                 return 0;
> -retry:
> +
>         pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
>         for (; addr != end; addr += PAGE_SIZE) {
>                 pte_t ptent = *(pte++);
> @@ -6632,8 +6710,9 @@ retry:
>                         if (isolate_lru_page(page))
>                                 goto put;
>                         pc = lookup_page_cgroup(page);
> -                       if (!mem_cgroup_move_account(page, 1, pc,
> -                                                    mc.from, mc.to)) {
> +                       ret = mem_cgroup_move_account(page, 1, pc,
> +                                                     mc.from, mc.to);
> +                       if (!ret) {
>                                 mc.precharge--;
>                                 /* we uncharge from mc.from later. */
>                                 mc.moved_charge++;
> @@ -6653,11 +6732,15 @@ put:                    /* get_mctgt_type() gets the page
>                 default:
>                         break;
>                 }
> +               if (ret == -EAGAIN)
> +                       break;
>         }
>         pte_unmap_unlock(pte - 1, ptl);
>         cond_resched();
>
>         if (addr != end) {
> +               if (ret == -EAGAIN)
> +                       goto retry;
>                 /*
>                  * We have consumed all precharges we got in can_attach().
>                  * We try charge one by one, but don't do any additional
> --- 3.8-rc2/mm/rmap.c   2012-12-22 09:43:27.656015582 -0800
> +++ linux/mm/rmap.c     2013-01-02 15:03:46.100417650 -0800
> @@ -1107,15 +1107,14 @@ void page_add_new_anon_rmap(struct page
>   */
>  void page_add_file_rmap(struct page *page)
>  {
> -       bool locked;
> -       unsigned long flags;
> +       bool clamped;
>
> -       mem_cgroup_begin_update_page_stat(page, &locked, &flags);
> +       mem_cgroup_begin_update_page_stat(page, &clamped);
>         if (atomic_inc_and_test(&page->_mapcount)) {
>                 __inc_zone_page_state(page, NR_FILE_MAPPED);
>                 mem_cgroup_inc_page_stat(page, MEMCG_NR_FILE_MAPPED);
>         }
> -       mem_cgroup_end_update_page_stat(page, &locked, &flags);
> +       mem_cgroup_end_update_page_stat(page, &clamped);
>  }
>
>  /**
> @@ -1128,16 +1127,15 @@ void page_remove_rmap(struct page *page)
>  {
>         struct address_space *mapping = page_mapping(page);
>         bool anon = PageAnon(page);
> -       bool locked;
> -       unsigned long flags;
> +       bool uninitialized_var(clamped);
>
>         /*
>          * The anon case has no mem_cgroup page_stat to update; but may
> -        * uncharge_page() below, where the lock ordering can deadlock if
> -        * we hold the lock against page_stat move: so avoid it on anon.
> +        * uncharge_page() below, when holding page_cgroup lock might force
> +        * a page_stat move to back off temporarily: so avoid it on anon.
>          */
>         if (!anon)
> -               mem_cgroup_begin_update_page_stat(page, &locked, &flags);
> +               mem_cgroup_begin_update_page_stat(page, &clamped);
>
>         /* page still mapped by someone else? */
>         if (!atomic_add_negative(-1, &page->_mapcount))
> @@ -1182,7 +1180,7 @@ void page_remove_rmap(struct page *page)
>         } else {
>                 __dec_zone_page_state(page, NR_FILE_MAPPED);
>                 mem_cgroup_dec_page_stat(page, MEMCG_NR_FILE_MAPPED);
> -               mem_cgroup_end_update_page_stat(page, &locked, &flags);
> +               mem_cgroup_end_update_page_stat(page, &clamped);
>         }
>         if (unlikely(PageMlocked(page)))
>                 clear_page_mlock(page);
> @@ -1198,7 +1196,7 @@ void page_remove_rmap(struct page *page)
>         return;
>  out:
>         if (!anon)
> -               mem_cgroup_end_update_page_stat(page, &locked, &flags);
> +               mem_cgroup_end_update_page_stat(page, &clamped);
>  }
>
>  /*



-- 
Thanks,
Sha

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting
       [not found]             ` <CAFj3OHVUx0bZyEGQU_CApVbgz7SrX3BQ+0U5fRV=En800wv+cQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2013-01-09 14:47               ` Michal Hocko
  0 siblings, 0 replies; 27+ messages in thread
From: Michal Hocko @ 2013-01-09 14:47 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: Hugh Dickins, Johannes Weiner,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A,
	gthelen-hpIqsD4AKlfQT0dZR+AlfA,
	fengguang.wu-ral2JQCrhuEAvxtiuMwx3w,
	glommer-bzQdu9zFT3WakBO8gow8eQ, dchinner-H+wXaHxf7aLQT0dZR+AlfA,
	Sha Zhengju

On Wed 09-01-13 22:35:12, Sha Zhengju wrote:
[...]
> To my knowledge, each task is forked in root memcg, and there's a
> moving while attaching it to a cgroup. So move_account is also a
> frequent behavior to some extent.

Not really. Every fork/exec is copies the current group (see
cgroup_fork) so there is no moving on that path.
[...]
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting
       [not found]           ` <50EA7860.6030300-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
@ 2013-01-09 15:02             ` Sha Zhengju
       [not found]               ` <CAFj3OHXMgRG6u2YoM7y5WuPo2ZNA1yPmKRV29FYj9B6Wj_c6Lw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 27+ messages in thread
From: Sha Zhengju @ 2013-01-09 15:02 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: Michal Hocko, Hugh Dickins, Johannes Weiner,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	gthelen-hpIqsD4AKlfQT0dZR+AlfA,
	fengguang.wu-ral2JQCrhuEAvxtiuMwx3w,
	glommer-bzQdu9zFT3WakBO8gow8eQ, dchinner-H+wXaHxf7aLQT0dZR+AlfA,
	Sha Zhengju

On Mon, Jan 7, 2013 at 3:25 PM, Kamezawa Hiroyuki
<kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org> wrote:
> (2013/01/05 13:48), Sha Zhengju wrote:
>>
>> On Wed, Jan 2, 2013 at 6:44 PM, Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org> wrote:
>>>
>>> On Wed 26-12-12 01:26:07, Sha Zhengju wrote:
>>>>
>>>> From: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
>>>>
>>>> This patch adds memcg routines to count dirty pages, which allows memory
>>>> controller
>>>> to maintain an accurate view of the amount of its dirty memory and can
>>>> provide some
>>>> info for users while cgroup's direct reclaim is working.
>>>
>>>
>>> I guess you meant targeted resp. (hard/soft) limit reclaim here,
>>> right? It is true that this is direct reclaim but it is not clear to me
>>
>>
>> Yes, I meant memcg hard/soft reclaim here which is triggered directly
>> by allocation and is distinct from background kswapd reclaim (global).
>>
>>> why the usefulnes should be limitted to the reclaim for users. I would
>>> understand this if the users was in fact in-kernel users.
>>>
>>
>> One of the reasons I'm trying to accounting the dirty pages is to get a
>> more board overall view of memory usages because memcg hard/soft
>> reclaim may have effect on response time of user application.
>> Yeah, the beneficiary can be application administrator or kernel users.
>> :P
>>
>>> [...]
>>>>
>>>> To prevent AB/BA deadlock mentioned by Greg Thelen in previous version
>>>> (https://lkml.org/lkml/2012/7/30/227), we adjust the lock order:
>>>> ->private_lock --> mapping->tree_lock --> memcg->move_lock.
>>>> So we need to make mapping->tree_lock ahead of TestSetPageDirty in
>>>> __set_page_dirty()
>>>> and __set_page_dirty_nobuffers(). But in order to avoiding useless
>>>> spinlock contention,
>>>> a prepare PageDirty() checking is added.
>>>
>>>
>>> But there is another AA deadlock here I believe.
>>> page_remove_rmap
>>>    mem_cgroup_begin_update_page_stat             <<< 1
>>>    set_page_dirty
>>>      __set_page_dirty_buffers
>>>        __set_page_dirty
>>>          mem_cgroup_begin_update_page_stat       <<< 2
>>>            move_lock_mem_cgroup
>>>              spin_lock_irqsave(&memcg->move_lock, *flags);
>>>
>>> mem_cgroup_begin_update_page_stat is not recursive wrt. locking AFAICS
>>> because we might race with the moving charges:
>>>          CPU0                                            CPU1
>>> page_remove_rmap
>>>                                                  mem_cgroup_can_attach
>>>    mem_cgroup_begin_update_page_stat (1)
>>>      rcu_read_lock
>>>                                                    mem_cgroup_start_move
>>>
>>> atomic_inc(&memcg_moving)
>>>
>>> atomic_inc(&memcg->moving_account)
>>>                                                      synchronize_rcu
>>>      __mem_cgroup_begin_update_page_stat
>>>        mem_cgroup_stolen <<< TRUE
>>>        move_lock_mem_cgroup
>>>    [...]
>>>          mem_cgroup_begin_update_page_stat (2)
>>>            __mem_cgroup_begin_update_page_stat
>>>              mem_cgroup_stolen     <<< still TRUE
>>>              move_lock_mem_cgroup  <<< DEADLOCK
>>>    [...]
>>>    mem_cgroup_end_update_page_stat
>>>      rcu_unlock
>>>                                                    # wake up from
>>> synchronize_rcu
>>>                                                  [...]
>>>                                                  mem_cgroup_move_task
>>>                                                    mem_cgroup_move_charge
>>>                                                      walk_page_range
>>>
>>> mem_cgroup_move_account
>>>
>>> move_lock_mem_cgroup
>>>
>>>
>>> Maybe I have missed some other locking which would prevent this from
>>> happening but the locking relations are really complicated in this area
>>> so if mem_cgroup_{begin,end}_update_page_stat might be called
>>> recursively then we need a fat comment which justifies that.
>>>
>>
>> Ohhh...good catching!  I didn't notice there is a recursive call of
>> mem_cgroup_{begin,end}_update_page_stat in page_remove_rmap().
>> The mem_cgroup_{begin,end}_update_page_stat() design has depressed
>> me a lot recently as the lock granularity is a little bigger than I
>> thought.
>> Not only the resource but also some code logic is in the range of locking
>> which may be deadlock prone. The problem still exists if we are trying to
>> add stat account of other memcg page later, may I make bold to suggest
>> that we dig into the lock again...
>>
>> But with regard to the current lock implementation, I doubt if we can we
>> can
>> account MEM_CGROUP_STAT_FILE_{MAPPED, DIRTY} in one breath and just
>> try to get move_lock once in the beginning. IMHO we can make
>> mem_cgroup_{begin,end}_update_page_stat() to recursive aware and what I'm
>> thinking now is changing memcg->move_lock to rw-spinlock from the
>> original spinlock:
>> mem_cgroup_{begin,end}_update_page_stat() try to get the read lock which
>> make it
>> reenterable and memcg moving task side try to get the write spinlock.
>> Then the race may be following:
>>
>>          CPU0                                            CPU1
>> page_remove_rmap
>>                                                  mem_cgroup_can_attach
>>    mem_cgroup_begin_update_page_stat (1)
>>      rcu_read_lock
>>                                                    mem_cgroup_start_move
>>
>> atomic_inc(&memcg_moving)
>>
>> atomic_inc(&memcg->moving_account)
>>                                                      synchronize_rcu
>>      __mem_cgroup_begin_update_page_stat
>>        mem_cgroup_stolen   <<< TRUE
>>        move_lock_mem_cgroup   <<<< read-spinlock success
>>    [...]
>>       mem_cgroup_begin_update_page_stat (2)
>>            __mem_cgroup_begin_update_page_stat
>>              mem_cgroup_stolen     <<< still TRUE
>>              move_lock_mem_cgroup  <<<< read-spinlock success
>>
>>    [...]
>>    mem_cgroup_end_update_page_stat     <<< locked = true, unlock
>>      rcu_unlock
>>                                                    # wake up from
>> synchronize_rcu
>>                                                  [...]
>>                                                  mem_cgroup_move_task
>>                                                    mem_cgroup_move_charge
>>                                                      walk_page_range
>>
>> mem_cgroup_move_account
>>
>> move_lock_mem_cgroup    <<< write-spinlock
>>
>>
>> AFAICS, the deadlock seems to be avoided by both the rcu and rwlock.
>> Is there anything I lost?
>>
>
> rwlock will work with the nest but it seems ugly do updates under read-lock.
>
> How about this straightforward ?
> ==
> /*
>  * Once a thread takes memcg_move_lock() on a memcg, it can take the lock on
>  * the memcg again for nesting calls
>  */
> static void move_lock_mem_cgroup(memcg, flags);
> {
>         current->memcg_move_lock_nested += 1;
>         if (current->memcg_move_lock_nested > 1) {
>                 VM_BUG_ON(current->move_locked_memcg != memcg);
>                 return;
>         }
>         spin_lock_irqsave(&memcg_move_lock, &flags);
>         current->move_lockdev_memcg = memcg;
> }
>
> static void move_unlock_mem_cgroup(memcg, flags)
> {
>         current->memcg_move_lock_nested -= 1;
>         if (!current->memcg_move_lock_nested) {
>                 current->move_locked_memcg = NULL;
>                 spin_unlock_irqrestore(&memcg_move_lock,flags);
>         }
> }
>
Does we need to add two
fields(current->memcg_move_lock_nested/move_locked_memcg) to 'struct
task'? Is it feasible?

Now I'm thinking about another synchronization proposal for memcg page
stat updater and move_account, which seems to deal with recursion
issue and deadlock:

             CPU A                                               CPU B

  move_lock_mem_cgroup
  old_memcg = pc->mem_cgroup
  TestSetPageDirty(page)
  move_unlock_mem_cgroup
                                                         move_lock_mem_cgroup
                                                         if (PageDirty)

old_memcg->nr_dirty --

new_memcg->nr_dirty ++

pc->mem_cgroup = new_memcgy
                                                         move_unlock_mem_cgroup

  old_memcg->nr_dirty ++


So nr_dirty of old_memcg may be minus in a very short
period('old_memcg->nr_dirty --' by CPU B), but it will be revised soon
by CPU A. And the final figures of memcg->nr_dirty is correct.
Meanwhile the move_lock only protect saving old_memcg and
TestSetPageDirty in its critical section and without any irrelevant
logic, so the lock order or deadlock can be handled easily.

But I'm not sure whether I've lost some race conditions, any comments
are welcomed. : )


--
Thanks,
Sha

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting
       [not found]               ` <CAFj3OHXMgRG6u2YoM7y5WuPo2ZNA1yPmKRV29FYj9B6Wj_c6Lw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2013-01-10  2:16                 ` Kamezawa Hiroyuki
  2013-01-10  4:26                   ` Sha Zhengju
  0 siblings, 1 reply; 27+ messages in thread
From: Kamezawa Hiroyuki @ 2013-01-10  2:16 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: Michal Hocko, Hugh Dickins, Johannes Weiner,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	gthelen-hpIqsD4AKlfQT0dZR+AlfA,
	fengguang.wu-ral2JQCrhuEAvxtiuMwx3w,
	glommer-bzQdu9zFT3WakBO8gow8eQ, dchinner-H+wXaHxf7aLQT0dZR+AlfA,
	Sha Zhengju

(2013/01/10 0:02), Sha Zhengju wrote:
> On Mon, Jan 7, 2013 at 3:25 PM, Kamezawa Hiroyuki
> <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org> wrote:
>> (2013/01/05 13:48), Sha Zhengju wrote:
>>>
>>> On Wed, Jan 2, 2013 at 6:44 PM, Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org> wrote:
>>>>
>>>> On Wed 26-12-12 01:26:07, Sha Zhengju wrote:
>>>>>
>>>>> From: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
>>>>>
>>>>> This patch adds memcg routines to count dirty pages, which allows memory
>>>>> controller
>>>>> to maintain an accurate view of the amount of its dirty memory and can
>>>>> provide some
>>>>> info for users while cgroup's direct reclaim is working.
>>>>
>>>>
>>>> I guess you meant targeted resp. (hard/soft) limit reclaim here,
>>>> right? It is true that this is direct reclaim but it is not clear to me
>>>
>>>
>>> Yes, I meant memcg hard/soft reclaim here which is triggered directly
>>> by allocation and is distinct from background kswapd reclaim (global).
>>>
>>>> why the usefulnes should be limitted to the reclaim for users. I would
>>>> understand this if the users was in fact in-kernel users.
>>>>
>>>
>>> One of the reasons I'm trying to accounting the dirty pages is to get a
>>> more board overall view of memory usages because memcg hard/soft
>>> reclaim may have effect on response time of user application.
>>> Yeah, the beneficiary can be application administrator or kernel users.
>>> :P
>>>
>>>> [...]
>>>>>
>>>>> To prevent AB/BA deadlock mentioned by Greg Thelen in previous version
>>>>> (https://lkml.org/lkml/2012/7/30/227), we adjust the lock order:
>>>>> ->private_lock --> mapping->tree_lock --> memcg->move_lock.
>>>>> So we need to make mapping->tree_lock ahead of TestSetPageDirty in
>>>>> __set_page_dirty()
>>>>> and __set_page_dirty_nobuffers(). But in order to avoiding useless
>>>>> spinlock contention,
>>>>> a prepare PageDirty() checking is added.
>>>>
>>>>
>>>> But there is another AA deadlock here I believe.
>>>> page_remove_rmap
>>>>     mem_cgroup_begin_update_page_stat             <<< 1
>>>>     set_page_dirty
>>>>       __set_page_dirty_buffers
>>>>         __set_page_dirty
>>>>           mem_cgroup_begin_update_page_stat       <<< 2
>>>>             move_lock_mem_cgroup
>>>>               spin_lock_irqsave(&memcg->move_lock, *flags);
>>>>
>>>> mem_cgroup_begin_update_page_stat is not recursive wrt. locking AFAICS
>>>> because we might race with the moving charges:
>>>>           CPU0                                            CPU1
>>>> page_remove_rmap
>>>>                                                   mem_cgroup_can_attach
>>>>     mem_cgroup_begin_update_page_stat (1)
>>>>       rcu_read_lock
>>>>                                                     mem_cgroup_start_move
>>>>
>>>> atomic_inc(&memcg_moving)
>>>>
>>>> atomic_inc(&memcg->moving_account)
>>>>                                                       synchronize_rcu
>>>>       __mem_cgroup_begin_update_page_stat
>>>>         mem_cgroup_stolen <<< TRUE
>>>>         move_lock_mem_cgroup
>>>>     [...]
>>>>           mem_cgroup_begin_update_page_stat (2)
>>>>             __mem_cgroup_begin_update_page_stat
>>>>               mem_cgroup_stolen     <<< still TRUE
>>>>               move_lock_mem_cgroup  <<< DEADLOCK
>>>>     [...]
>>>>     mem_cgroup_end_update_page_stat
>>>>       rcu_unlock
>>>>                                                     # wake up from
>>>> synchronize_rcu
>>>>                                                   [...]
>>>>                                                   mem_cgroup_move_task
>>>>                                                     mem_cgroup_move_charge
>>>>                                                       walk_page_range
>>>>
>>>> mem_cgroup_move_account
>>>>
>>>> move_lock_mem_cgroup
>>>>
>>>>
>>>> Maybe I have missed some other locking which would prevent this from
>>>> happening but the locking relations are really complicated in this area
>>>> so if mem_cgroup_{begin,end}_update_page_stat might be called
>>>> recursively then we need a fat comment which justifies that.
>>>>
>>>
>>> Ohhh...good catching!  I didn't notice there is a recursive call of
>>> mem_cgroup_{begin,end}_update_page_stat in page_remove_rmap().
>>> The mem_cgroup_{begin,end}_update_page_stat() design has depressed
>>> me a lot recently as the lock granularity is a little bigger than I
>>> thought.
>>> Not only the resource but also some code logic is in the range of locking
>>> which may be deadlock prone. The problem still exists if we are trying to
>>> add stat account of other memcg page later, may I make bold to suggest
>>> that we dig into the lock again...
>>>
>>> But with regard to the current lock implementation, I doubt if we can we
>>> can
>>> account MEM_CGROUP_STAT_FILE_{MAPPED, DIRTY} in one breath and just
>>> try to get move_lock once in the beginning. IMHO we can make
>>> mem_cgroup_{begin,end}_update_page_stat() to recursive aware and what I'm
>>> thinking now is changing memcg->move_lock to rw-spinlock from the
>>> original spinlock:
>>> mem_cgroup_{begin,end}_update_page_stat() try to get the read lock which
>>> make it
>>> reenterable and memcg moving task side try to get the write spinlock.
>>> Then the race may be following:
>>>
>>>           CPU0                                            CPU1
>>> page_remove_rmap
>>>                                                   mem_cgroup_can_attach
>>>     mem_cgroup_begin_update_page_stat (1)
>>>       rcu_read_lock
>>>                                                     mem_cgroup_start_move
>>>
>>> atomic_inc(&memcg_moving)
>>>
>>> atomic_inc(&memcg->moving_account)
>>>                                                       synchronize_rcu
>>>       __mem_cgroup_begin_update_page_stat
>>>         mem_cgroup_stolen   <<< TRUE
>>>         move_lock_mem_cgroup   <<<< read-spinlock success
>>>     [...]
>>>        mem_cgroup_begin_update_page_stat (2)
>>>             __mem_cgroup_begin_update_page_stat
>>>               mem_cgroup_stolen     <<< still TRUE
>>>               move_lock_mem_cgroup  <<<< read-spinlock success
>>>
>>>     [...]
>>>     mem_cgroup_end_update_page_stat     <<< locked = true, unlock
>>>       rcu_unlock
>>>                                                     # wake up from
>>> synchronize_rcu
>>>                                                   [...]
>>>                                                   mem_cgroup_move_task
>>>                                                     mem_cgroup_move_charge
>>>                                                       walk_page_range
>>>
>>> mem_cgroup_move_account
>>>
>>> move_lock_mem_cgroup    <<< write-spinlock
>>>
>>>
>>> AFAICS, the deadlock seems to be avoided by both the rcu and rwlock.
>>> Is there anything I lost?
>>>
>>
>> rwlock will work with the nest but it seems ugly do updates under read-lock.
>>
>> How about this straightforward ?
>> ==
>> /*
>>   * Once a thread takes memcg_move_lock() on a memcg, it can take the lock on
>>   * the memcg again for nesting calls
>>   */
>> static void move_lock_mem_cgroup(memcg, flags);
>> {
>>          current->memcg_move_lock_nested += 1;
>>          if (current->memcg_move_lock_nested > 1) {
>>                  VM_BUG_ON(current->move_locked_memcg != memcg);
>>                  return;
>>          }
>>          spin_lock_irqsave(&memcg_move_lock, &flags);
>>          current->move_lockdev_memcg = memcg;
>> }
>>
>> static void move_unlock_mem_cgroup(memcg, flags)
>> {
>>          current->memcg_move_lock_nested -= 1;
>>          if (!current->memcg_move_lock_nested) {
>>                  current->move_locked_memcg = NULL;
>>                  spin_unlock_irqrestore(&memcg_move_lock,flags);
>>          }
>> }
>>
> Does we need to add two
> fields(current->memcg_move_lock_nested/move_locked_memcg) to 'struct
> task'? Is it feasible?
>
> Now I'm thinking about another synchronization proposal for memcg page
> stat updater and move_account, which seems to deal with recursion
> issue and deadlock:
>
>               CPU A                                               CPU B
>
>    move_lock_mem_cgroup
>    old_memcg = pc->mem_cgroup
>    TestSetPageDirty(page)
>    move_unlock_mem_cgroup
>                                                           move_lock_mem_cgroup
>                                                           if (PageDirty)
>
> old_memcg->nr_dirty --
>
> new_memcg->nr_dirty ++
>
> pc->mem_cgroup = new_memcgy
>                                                           move_unlock_mem_cgroup
>
>    old_memcg->nr_dirty ++
>

I'm sorry I couldn't catch why you call TestSetPageDirty()....and what CPUA/CPUB is
doing ? CPUA calls move_account() and CPUB updates stat ? If so, why move_account()
is allowed to set PG_dirty ??


>
> So nr_dirty of old_memcg may be minus in a very short
> period('old_memcg->nr_dirty --' by CPU B), but it will be revised soon
> by CPU A. And the final figures of memcg->nr_dirty is correct.

It seems both of new_memcg and old_memcg has an account for a page. Is it correct ?


> Meanwhile the move_lock only protect saving old_memcg and
> TestSetPageDirty in its critical section and without any irrelevant
> logic, so the lock order or deadlock can be handled easily.
>
> But I'm not sure whether I've lost some race conditions, any comments
> are welcomed. : )
>

Sorry I couldn't understand.

Thanks,
-Kame

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting
  2013-01-10  2:16                 ` Kamezawa Hiroyuki
@ 2013-01-10  4:26                   ` Sha Zhengju
       [not found]                     ` <CAFj3OHW=n22veXzR27qfc+10t-nETU=B78NULPXrEDT1S-KsOw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 27+ messages in thread
From: Sha Zhengju @ 2013-01-10  4:26 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: Michal Hocko, Hugh Dickins, Johannes Weiner, linux-kernel,
	cgroups, linux-mm, linux-fsdevel, akpm, gthelen, fengguang.wu,
	glommer, dchinner, Sha Zhengju

On Thu, Jan 10, 2013 at 10:16 AM, Kamezawa Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> (2013/01/10 0:02), Sha Zhengju wrote:
>>
>> On Mon, Jan 7, 2013 at 3:25 PM, Kamezawa Hiroyuki
>> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>>>
>>> (2013/01/05 13:48), Sha Zhengju wrote:
>>>>
>>>>
>>>> On Wed, Jan 2, 2013 at 6:44 PM, Michal Hocko <mhocko@suse.cz> wrote:
>>>>>
>>>>>
>>>>> On Wed 26-12-12 01:26:07, Sha Zhengju wrote:
>>>>>>
>>>>>>
>>>>>> From: Sha Zhengju <handai.szj@taobao.com>
>>>>>>
>>>>>> This patch adds memcg routines to count dirty pages, which allows
>>>>>> memory
>>>>>> controller
>>>>>> to maintain an accurate view of the amount of its dirty memory and can
>>>>>> provide some
>>>>>> info for users while cgroup's direct reclaim is working.
>>>>>
>>>>>
>>>>>
>>>>> I guess you meant targeted resp. (hard/soft) limit reclaim here,
>>>>> right? It is true that this is direct reclaim but it is not clear to me
>>>>
>>>>
>>>>
>>>> Yes, I meant memcg hard/soft reclaim here which is triggered directly
>>>> by allocation and is distinct from background kswapd reclaim (global).
>>>>
>>>>> why the usefulnes should be limitted to the reclaim for users. I would
>>>>> understand this if the users was in fact in-kernel users.
>>>>>
>>>>
>>>> One of the reasons I'm trying to accounting the dirty pages is to get a
>>>> more board overall view of memory usages because memcg hard/soft
>>>> reclaim may have effect on response time of user application.
>>>> Yeah, the beneficiary can be application administrator or kernel users.
>>>> :P
>>>>
>>>>> [...]
>>>>>>
>>>>>>
>>>>>> To prevent AB/BA deadlock mentioned by Greg Thelen in previous version
>>>>>> (https://lkml.org/lkml/2012/7/30/227), we adjust the lock order:
>>>>>> ->private_lock --> mapping->tree_lock --> memcg->move_lock.
>>>>>> So we need to make mapping->tree_lock ahead of TestSetPageDirty in
>>>>>> __set_page_dirty()
>>>>>> and __set_page_dirty_nobuffers(). But in order to avoiding useless
>>>>>> spinlock contention,
>>>>>> a prepare PageDirty() checking is added.
>>>>>
>>>>>
>>>>>
>>>>> But there is another AA deadlock here I believe.
>>>>> page_remove_rmap
>>>>>     mem_cgroup_begin_update_page_stat             <<< 1
>>>>>     set_page_dirty
>>>>>       __set_page_dirty_buffers
>>>>>         __set_page_dirty
>>>>>           mem_cgroup_begin_update_page_stat       <<< 2
>>>>>             move_lock_mem_cgroup
>>>>>               spin_lock_irqsave(&memcg->move_lock, *flags);
>>>>>
>>>>> mem_cgroup_begin_update_page_stat is not recursive wrt. locking AFAICS
>>>>> because we might race with the moving charges:
>>>>>           CPU0                                            CPU1
>>>>> page_remove_rmap
>>>>>                                                   mem_cgroup_can_attach
>>>>>     mem_cgroup_begin_update_page_stat (1)
>>>>>       rcu_read_lock
>>>>>
>>>>> mem_cgroup_start_move
>>>>>
>>>>> atomic_inc(&memcg_moving)
>>>>>
>>>>> atomic_inc(&memcg->moving_account)
>>>>>                                                       synchronize_rcu
>>>>>       __mem_cgroup_begin_update_page_stat
>>>>>         mem_cgroup_stolen <<< TRUE
>>>>>         move_lock_mem_cgroup
>>>>>     [...]
>>>>>           mem_cgroup_begin_update_page_stat (2)
>>>>>             __mem_cgroup_begin_update_page_stat
>>>>>               mem_cgroup_stolen     <<< still TRUE
>>>>>               move_lock_mem_cgroup  <<< DEADLOCK
>>>>>     [...]
>>>>>     mem_cgroup_end_update_page_stat
>>>>>       rcu_unlock
>>>>>                                                     # wake up from
>>>>> synchronize_rcu
>>>>>                                                   [...]
>>>>>                                                   mem_cgroup_move_task
>>>>>
>>>>> mem_cgroup_move_charge
>>>>>                                                       walk_page_range
>>>>>
>>>>> mem_cgroup_move_account
>>>>>
>>>>> move_lock_mem_cgroup
>>>>>
>>>>>
>>>>> Maybe I have missed some other locking which would prevent this from
>>>>> happening but the locking relations are really complicated in this area
>>>>> so if mem_cgroup_{begin,end}_update_page_stat might be called
>>>>> recursively then we need a fat comment which justifies that.
>>>>>
>>>>
>>>> Ohhh...good catching!  I didn't notice there is a recursive call of
>>>> mem_cgroup_{begin,end}_update_page_stat in page_remove_rmap().
>>>> The mem_cgroup_{begin,end}_update_page_stat() design has depressed
>>>> me a lot recently as the lock granularity is a little bigger than I
>>>> thought.
>>>> Not only the resource but also some code logic is in the range of
>>>> locking
>>>> which may be deadlock prone. The problem still exists if we are trying
>>>> to
>>>> add stat account of other memcg page later, may I make bold to suggest
>>>> that we dig into the lock again...
>>>>
>>>> But with regard to the current lock implementation, I doubt if we can we
>>>> can
>>>> account MEM_CGROUP_STAT_FILE_{MAPPED, DIRTY} in one breath and just
>>>> try to get move_lock once in the beginning. IMHO we can make
>>>> mem_cgroup_{begin,end}_update_page_stat() to recursive aware and what
>>>> I'm
>>>> thinking now is changing memcg->move_lock to rw-spinlock from the
>>>> original spinlock:
>>>> mem_cgroup_{begin,end}_update_page_stat() try to get the read lock which
>>>> make it
>>>> reenterable and memcg moving task side try to get the write spinlock.
>>>> Then the race may be following:
>>>>
>>>>           CPU0                                            CPU1
>>>> page_remove_rmap
>>>>                                                   mem_cgroup_can_attach
>>>>     mem_cgroup_begin_update_page_stat (1)
>>>>       rcu_read_lock
>>>>
>>>> mem_cgroup_start_move
>>>>
>>>> atomic_inc(&memcg_moving)
>>>>
>>>> atomic_inc(&memcg->moving_account)
>>>>                                                       synchronize_rcu
>>>>       __mem_cgroup_begin_update_page_stat
>>>>         mem_cgroup_stolen   <<< TRUE
>>>>         move_lock_mem_cgroup   <<<< read-spinlock success
>>>>     [...]
>>>>        mem_cgroup_begin_update_page_stat (2)
>>>>             __mem_cgroup_begin_update_page_stat
>>>>               mem_cgroup_stolen     <<< still TRUE
>>>>               move_lock_mem_cgroup  <<<< read-spinlock success
>>>>
>>>>     [...]
>>>>     mem_cgroup_end_update_page_stat     <<< locked = true, unlock
>>>>       rcu_unlock
>>>>                                                     # wake up from
>>>> synchronize_rcu
>>>>                                                   [...]
>>>>                                                   mem_cgroup_move_task
>>>>
>>>> mem_cgroup_move_charge
>>>>                                                       walk_page_range
>>>>
>>>> mem_cgroup_move_account
>>>>
>>>> move_lock_mem_cgroup    <<< write-spinlock
>>>>
>>>>
>>>> AFAICS, the deadlock seems to be avoided by both the rcu and rwlock.
>>>> Is there anything I lost?
>>>>
>>>
>>> rwlock will work with the nest but it seems ugly do updates under
>>> read-lock.
>>>
>>> How about this straightforward ?
>>> ==
>>> /*
>>>   * Once a thread takes memcg_move_lock() on a memcg, it can take the
>>> lock on
>>>   * the memcg again for nesting calls
>>>   */
>>> static void move_lock_mem_cgroup(memcg, flags);
>>> {
>>>          current->memcg_move_lock_nested += 1;
>>>          if (current->memcg_move_lock_nested > 1) {
>>>                  VM_BUG_ON(current->move_locked_memcg != memcg);
>>>                  return;
>>>          }
>>>          spin_lock_irqsave(&memcg_move_lock, &flags);
>>>          current->move_lockdev_memcg = memcg;
>>> }
>>>
>>> static void move_unlock_mem_cgroup(memcg, flags)
>>> {
>>>          current->memcg_move_lock_nested -= 1;
>>>          if (!current->memcg_move_lock_nested) {
>>>                  current->move_locked_memcg = NULL;
>>>                  spin_unlock_irqrestore(&memcg_move_lock,flags);
>>>          }
>>> }
>>>
>> Does we need to add two
>> fields(current->memcg_move_lock_nested/move_locked_memcg) to 'struct
>> task'? Is it feasible?
>>
>> Now I'm thinking about another synchronization proposal for memcg page
>> stat updater and move_account, which seems to deal with recursion
>> issue and deadlock:
>>
>>               CPU A                                               CPU B
>>
>>    move_lock_mem_cgroup
>>    old_memcg = pc->mem_cgroup
>>    TestSetPageDirty(page)
>>    move_unlock_mem_cgroup
>>
>> move_lock_mem_cgroup
>>                                                           if (PageDirty)
>>
>> old_memcg->nr_dirty --
>>
>> new_memcg->nr_dirty ++
>>
>> pc->mem_cgroup = new_memcgy
>>
>> move_unlock_mem_cgroup
>>
>>    old_memcg->nr_dirty ++
>>
>
> I'm sorry I couldn't catch why you call TestSetPageDirty()....and what
> CPUA/CPUB is
> doing ? CPUA calls move_account() and CPUB updates stat ? If so, why
> move_account()
> is allowed to set PG_dirty ??
>

Sorry,  the layout above seems in a mess and is confusing...
>From the beginning, after removing duplicated information like PCG_*
flags in 'struct page_cgroup'(commit 2ff76f1193), there's a problem
between "move" and "page stat accounting" :
assume CPU-A does "page stat accounting" and CPU-B does "move"

CPU-A                        CPU-B
TestSet PG_dirty
(delay)                 move_lock_mem_cgroup()
                            if (PageDirty(page)) {
                                  old_memcg->nr_dirty --
                                  new_memcg->nr_dirty++
                            }
                            pc->mem_cgroup = new_memcg;
                            move_unlock_mem_cgroup()

move_lock_mem_cgroup()
memcg = pc->mem_cgroup
memcg->nr_dirty++
move_unlock_mem_cgroup()

while accounting information of new_memcg may be double-counted. So we
use a bigger lock to solve this problem:  (commit: 89c06bd52f)
         move_lock_mem_cgroup()
         TestSetPageDirty(page)
         update page stats (without any checks)
         move_unlock_mem_cgroup()

But this method also has its pros and cons(e.g. need lock nesting). So
I doubt whether the following is able to deal with these issues all
together:
(CPU-A does "page stat accounting" and CPU-B does "move")

             CPU-A                            CPU-B

move_lock_mem_cgroup()
memcg = pc->mem_cgroup
SetPageDirty(page)
move_unlock_mem_cgroup()
                                      move_lock_mem_cgroup()
                                      if (PageDirty) {
                                               old_memcg->nr_dirty --;
                                               new_memcg->nr_dirty ++;
                                       }
                                       pc->mem_cgroup = new_memcg
                                       move_unlock_mem_cgroup()

memcg->nr_dirty ++


For CPU-A, we save pc->mem_cgroup in a temporary variable just before
SetPageDirty inside move_lock and then update stats if the page is set
PG_dirty successfully. But CPU-B may do "moving" in advance that
"old_memcg->nr_dirty --" will make old_memcg->nr_dirty incorrect but
soon CPU-A will do "memcg->nr_dirty ++" at the heels that amend the
stats.
However, there is a potential problem that old_memcg->nr_dirty  may be
minus in a very short period but not a big issue IMHO.

I hope that is clear.  : )
Thanks!

>
>>
>> So nr_dirty of old_memcg may be minus in a very short
>> period('old_memcg->nr_dirty --' by CPU B), but it will be revised soon
>> by CPU A. And the final figures of memcg->nr_dirty is correct.
>
>
> It seems both of new_memcg and old_memcg has an account for a page. Is it
> correct ?
>
>
>
>> Meanwhile the move_lock only protect saving old_memcg and
>> TestSetPageDirty in its critical section and without any irrelevant
>> logic, so the lock order or deadlock can be handled easily.
>>
>> But I'm not sure whether I've lost some race conditions, any comments
>> are welcomed. : )
>>
>
> Sorry I couldn't understand.
>
> Thanks,
> -Kame
>
>



-- 
Thanks,
Sha

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting
       [not found]                     ` <CAFj3OHW=n22veXzR27qfc+10t-nETU=B78NULPXrEDT1S-KsOw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2013-01-10  5:03                       ` Kamezawa Hiroyuki
       [not found]                         ` <50EE4B84.5080205-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
  0 siblings, 1 reply; 27+ messages in thread
From: Kamezawa Hiroyuki @ 2013-01-10  5:03 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: Michal Hocko, Hugh Dickins, Johannes Weiner,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	gthelen-hpIqsD4AKlfQT0dZR+AlfA,
	fengguang.wu-ral2JQCrhuEAvxtiuMwx3w,
	glommer-bzQdu9zFT3WakBO8gow8eQ, dchinner-H+wXaHxf7aLQT0dZR+AlfA,
	Sha Zhengju

(2013/01/10 13:26), Sha Zhengju wrote:

> But this method also has its pros and cons(e.g. need lock nesting). So
> I doubt whether the following is able to deal with these issues all
> together:
> (CPU-A does "page stat accounting" and CPU-B does "move")
>
>               CPU-A                            CPU-B
>
> move_lock_mem_cgroup()
> memcg = pc->mem_cgroup
> SetPageDirty(page)
> move_unlock_mem_cgroup()
>                                        move_lock_mem_cgroup()
>                                        if (PageDirty) {
>                                                 old_memcg->nr_dirty --;
>                                                 new_memcg->nr_dirty ++;
>                                         }
>                                         pc->mem_cgroup = new_memcg
>                                         move_unlock_mem_cgroup()
>
> memcg->nr_dirty ++
>
>
> For CPU-A, we save pc->mem_cgroup in a temporary variable just before
> SetPageDirty inside move_lock and then update stats if the page is set
> PG_dirty successfully. But CPU-B may do "moving" in advance that
> "old_memcg->nr_dirty --" will make old_memcg->nr_dirty incorrect but
> soon CPU-A will do "memcg->nr_dirty ++" at the heels that amend the
> stats.
> However, there is a potential problem that old_memcg->nr_dirty  may be
> minus in a very short period but not a big issue IMHO.
>

IMHO, this will work. Please take care of that the recorded memcg will not
be invalid pointer when you update the nr_dirty later.
(Maybe RCU will protect it.)

_If_ this method can handle "nesting" problem clearer and make implementation
simpler, please go ahead. To be honest, I'm not sure how the code will be until
seeing the patch. Hmm, why you write SetPageDirty() here rather than
TestSetPageDirty()....

Thanks,
-Kame

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting
       [not found]                         ` <50EE4B84.5080205-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
@ 2013-01-10  8:28                           ` Sha Zhengju
  0 siblings, 0 replies; 27+ messages in thread
From: Sha Zhengju @ 2013-01-10  8:28 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: Michal Hocko, Hugh Dickins, Johannes Weiner,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	gthelen-hpIqsD4AKlfQT0dZR+AlfA,
	fengguang.wu-ral2JQCrhuEAvxtiuMwx3w,
	glommer-bzQdu9zFT3WakBO8gow8eQ, dchinner-H+wXaHxf7aLQT0dZR+AlfA,
	Sha Zhengju

On Thu, Jan 10, 2013 at 1:03 PM, Kamezawa Hiroyuki
<kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org> wrote:
> (2013/01/10 13:26), Sha Zhengju wrote:
>
>> But this method also has its pros and cons(e.g. need lock nesting). So
>> I doubt whether the following is able to deal with these issues all
>> together:
>> (CPU-A does "page stat accounting" and CPU-B does "move")
>>
>>               CPU-A                            CPU-B
>>
>> move_lock_mem_cgroup()
>> memcg = pc->mem_cgroup
>> SetPageDirty(page)
>> move_unlock_mem_cgroup()
>>                                        move_lock_mem_cgroup()
>>                                        if (PageDirty) {
>>                                                 old_memcg->nr_dirty --;
>>                                                 new_memcg->nr_dirty ++;
>>                                         }
>>                                         pc->mem_cgroup = new_memcg
>>                                         move_unlock_mem_cgroup()
>>
>> memcg->nr_dirty ++
>>
>>
>> For CPU-A, we save pc->mem_cgroup in a temporary variable just before
>> SetPageDirty inside move_lock and then update stats if the page is set
>> PG_dirty successfully. But CPU-B may do "moving" in advance that
>> "old_memcg->nr_dirty --" will make old_memcg->nr_dirty incorrect but
>> soon CPU-A will do "memcg->nr_dirty ++" at the heels that amend the
>> stats.
>> However, there is a potential problem that old_memcg->nr_dirty  may be
>> minus in a very short period but not a big issue IMHO.
>>
>
> IMHO, this will work. Please take care of that the recorded memcg will not
> be invalid pointer when you update the nr_dirty later.
> (Maybe RCU will protect it.)
>
Yes, there're 3 places to change pc->mem_cgroup: charge & uncharge &
move_account. "charge" has no race with stat updater and "uncharge"
doesn't reset pc->mem_cgroup directly, also "move_account" is just the
one we are handling, so they may do no harm here. Meanwhile, invalid
pointer made by cgroup deletion may also be avoided by RCU. Yet it's a
rough conclusion by quick look...

> _If_ this method can handle "nesting" problem clearer and make
> implementation
> simpler, please go ahead. To be honest, I'm not sure how the code will be
> until
Okay, later I'll try to propose the patch.

> seeing the patch. Hmm, why you write SetPageDirty() here rather than
> TestSetPageDirty()....
>
No particular reason...TestSetPageDirty() may be more precise... : )


-- 
Thanks,
Sha

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting
  2013-01-02 10:44   ` Michal Hocko
  2013-01-05  4:48     ` Sha Zhengju
@ 2013-05-03  9:11     ` Michal Hocko
  2013-05-03  9:59       ` Sha Zhengju
  1 sibling, 1 reply; 27+ messages in thread
From: Michal Hocko @ 2013-05-03  9:11 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-kernel, cgroups, linux-mm, linux-fsdevel, akpm,
	kamezawa.hiroyu, gthelen, fengguang.wu, glommer, dchinner,
	Sha Zhengju

On Wed 02-01-13 11:44:21, Michal Hocko wrote:
> On Wed 26-12-12 01:26:07, Sha Zhengju wrote:
> > From: Sha Zhengju <handai.szj@taobao.com>
> > 
> > This patch adds memcg routines to count dirty pages, which allows memory controller
> > to maintain an accurate view of the amount of its dirty memory and can provide some
> > info for users while cgroup's direct reclaim is working.
> 
> I guess you meant targeted resp. (hard/soft) limit reclaim here,
> right? It is true that this is direct reclaim but it is not clear to me
> why the usefulnes should be limitted to the reclaim for users. I would
> understand this if the users was in fact in-kernel users.
> 
> [...]
> > To prevent AB/BA deadlock mentioned by Greg Thelen in previous version
> > (https://lkml.org/lkml/2012/7/30/227), we adjust the lock order:
> > ->private_lock --> mapping->tree_lock --> memcg->move_lock.
> > So we need to make mapping->tree_lock ahead of TestSetPageDirty in __set_page_dirty()
> > and __set_page_dirty_nobuffers(). But in order to avoiding useless spinlock contention,
> > a prepare PageDirty() checking is added.
> 
> But there is another AA deadlock here I believe.
> page_remove_rmap
>   mem_cgroup_begin_update_page_stat		<<< 1
>   set_page_dirty
>     __set_page_dirty_buffers
>       __set_page_dirty
>         mem_cgroup_begin_update_page_stat	<<< 2
> 	  move_lock_mem_cgroup
> 	    spin_lock_irqsave(&memcg->move_lock, *flags);

JFYI since abf09bed (s390/mm: implement software dirty bits) this is no
longer possible. I haven't checked wheter there are other cases like
this one and it should be better if mem_cgroup_begin_update_page_stat
was recursive safe if that can be done without too many hacks.
I will have a look at this (hopefully) sometimes next week.

[...]
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting
  2013-05-03  9:11     ` Michal Hocko
@ 2013-05-03  9:59       ` Sha Zhengju
  0 siblings, 0 replies; 27+ messages in thread
From: Sha Zhengju @ 2013-05-03  9:59 UTC (permalink / raw)
  To: Michal Hocko
  Cc: LKML, Cgroups, linux-mm@kvack.org, linux-fsdevel, Andrew Morton,
	KAMEZAWA Hiroyuki, Greg Thelen, Wu Fengguang, Glauber Costa,
	Dave Chinner, Sha Zhengju

On Fri, May 3, 2013 at 5:11 PM, Michal Hocko <mhocko@suse.cz> wrote:
> On Wed 02-01-13 11:44:21, Michal Hocko wrote:
>> On Wed 26-12-12 01:26:07, Sha Zhengju wrote:
>> > From: Sha Zhengju <handai.szj@taobao.com>
>> >
>> > This patch adds memcg routines to count dirty pages, which allows memory controller
>> > to maintain an accurate view of the amount of its dirty memory and can provide some
>> > info for users while cgroup's direct reclaim is working.
>>
>> I guess you meant targeted resp. (hard/soft) limit reclaim here,
>> right? It is true that this is direct reclaim but it is not clear to me
>> why the usefulnes should be limitted to the reclaim for users. I would
>> understand this if the users was in fact in-kernel users.
>>
>> [...]
>> > To prevent AB/BA deadlock mentioned by Greg Thelen in previous version
>> > (https://lkml.org/lkml/2012/7/30/227), we adjust the lock order:
>> > ->private_lock --> mapping->tree_lock --> memcg->move_lock.
>> > So we need to make mapping->tree_lock ahead of TestSetPageDirty in __set_page_dirty()
>> > and __set_page_dirty_nobuffers(). But in order to avoiding useless spinlock contention,
>> > a prepare PageDirty() checking is added.
>>
>> But there is another AA deadlock here I believe.
>> page_remove_rmap
>>   mem_cgroup_begin_update_page_stat           <<< 1
>>   set_page_dirty
>>     __set_page_dirty_buffers
>>       __set_page_dirty
>>         mem_cgroup_begin_update_page_stat     <<< 2
>>         move_lock_mem_cgroup
>>           spin_lock_irqsave(&memcg->move_lock, *flags);
>
> JFYI since abf09bed (s390/mm: implement software dirty bits) this is no
> longer possible. I haven't checked wheter there are other cases like
> this one and it should be better if mem_cgroup_begin_update_page_stat
> was recursive safe if that can be done without too many hacks.
> I will have a look at this (hopefully) sometimes next week.
>

Hi Michal,


I'm sorry for not being able to return to this problem immediately after LSF/MM.
That is good news. IIRC, it's the only place we have encountered
recursive problem in accounting memcg dirty pages. But I'll try to
revive my previous work of simplifying
mem_cgroup_begin_update_page_stat() lock.
I'll back to it in next few days.


--
Thanks,
Sha

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2013-05-03  9:59 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <1356455919-14445-1-git-send-email-handai.szj@taobao.com>
     [not found] ` <1356455919-14445-1-git-send-email-handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
2012-12-25 17:22   ` [PATCH V3 2/8] Make TestSetPageDirty and dirty page accounting in one func Sha Zhengju
2012-12-28  0:39     ` Kamezawa Hiroyuki
2013-01-05  2:34       ` Sha Zhengju
2013-01-02  9:08     ` Michal Hocko
     [not found]       ` <20130102090803.GB22160-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2013-01-05  2:49         ` Sha Zhengju
     [not found]           ` <CAFj3OHUCQkqB2+ky9wxFpkNYcn2=6t9Qd7XFf3RBY0F4Wxyqcg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-01-05 10:45             ` Michal Hocko
2012-12-25 17:24 ` [PATCH V3 3/8] use vfs __set_page_dirty interface instead of doing it inside filesystem Sha Zhengju
     [not found]   ` <1356456261-14579-1-git-send-email-handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
2012-12-28  0:41     ` Kamezawa Hiroyuki
2012-12-25 17:26 ` [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting Sha Zhengju
2013-01-02 10:44   ` Michal Hocko
2013-01-05  4:48     ` Sha Zhengju
2013-01-06 20:02       ` Hugh Dickins
     [not found]         ` <alpine.LNX.2.00.1301061135400.29149-fupSdm12i1nKWymIFiNcPA@public.gmane.org>
2013-01-07  7:49           ` Kamezawa Hiroyuki
2013-01-09  5:15             ` Hugh Dickins
     [not found]               ` <alpine.LNX.2.00.1301082030100.5319-fupSdm12i1nKWymIFiNcPA@public.gmane.org>
2013-01-09  7:24                 ` Kamezawa Hiroyuki
2013-01-09 14:35           ` Sha Zhengju
     [not found]             ` <CAFj3OHVUx0bZyEGQU_CApVbgz7SrX3BQ+0U5fRV=En800wv+cQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-01-09 14:47               ` Michal Hocko
     [not found]       ` <CAFj3OHXKyMO3gwghiBAmbowvqko-JqLtKroX2kzin1rk=q9tZg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-01-07  7:25         ` Kamezawa Hiroyuki
     [not found]           ` <50EA7860.6030300-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
2013-01-09 15:02             ` Sha Zhengju
     [not found]               ` <CAFj3OHXMgRG6u2YoM7y5WuPo2ZNA1yPmKRV29FYj9B6Wj_c6Lw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-01-10  2:16                 ` Kamezawa Hiroyuki
2013-01-10  4:26                   ` Sha Zhengju
     [not found]                     ` <CAFj3OHW=n22veXzR27qfc+10t-nETU=B78NULPXrEDT1S-KsOw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-01-10  5:03                       ` Kamezawa Hiroyuki
     [not found]                         ` <50EE4B84.5080205-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
2013-01-10  8:28                           ` Sha Zhengju
2013-05-03  9:11     ` Michal Hocko
2013-05-03  9:59       ` Sha Zhengju
2013-01-06 20:07   ` Greg Thelen
     [not found]     ` <xr93obh2krcr.fsf-aSPv4SP+Du0KgorLzL7FmE7CuiCeIGUxQQ4Iyu8u01E@public.gmane.org>
2013-01-09  9:45       ` Sha Zhengju

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).