[PATCH] memcg: simplify lock of memcg page stat accounting

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] memcg: simplify lock of memcg page stat accounting
@ 2013-01-26 11:12 Sha Zhengju
  2013-01-28 14:10 ` Michal Hocko
  2013-01-29  0:41 ` Kamezawa Hiroyuki
  0 siblings, 2 replies; 9+ messages in thread
From: Sha Zhengju @ 2013-01-26 11:12 UTC (permalink / raw)
  To: cgroups, linux-mm
  Cc: mhocko, kamezawa.hiroyu, akpm, gthelen, hannes, hughd,
	Sha Zhengju

From: Sha Zhengju <handai.szj@taobao.com>

After removing duplicated information like PCG_*
flags in 'struct page_cgroup'(commit 2ff76f1193), there's a problem
between "move" and "page stat accounting"(only FILE_MAPPED is supported
now but other stats will be added in future):
assume CPU-A does "page stat accounting" and CPU-B does "move"

CPU-A                        CPU-B
TestSet PG_dirty
(delay)              	move_lock_mem_cgroup()
                        if (PageDirty(page)) {
                             old_memcg->nr_dirty --
                             new_memcg->nr_dirty++
                        }
                        pc->mem_cgroup = new_memcg;
                        move_unlock_mem_cgroup()

move_lock_mem_cgroup()
memcg = pc->mem_cgroup
memcg->nr_dirty++
move_unlock_mem_cgroup()

while accounting information of new_memcg may be double-counted. So we
use a bigger lock to solve this problem:  (commit: 89c06bd52f)

      move_lock_mem_cgroup() <-- mem_cgroup_begin_update_page_stat()
      TestSetPageDirty(page)
      update page stats (without any checks)
      move_unlock_mem_cgroup() <-- mem_cgroup_begin_update_page_stat()


But this method also has its pros and cons: at present we use two layers
of lock avoidance(memcg_moving and memcg->moving_account) then spinlock
on memcg (see mem_cgroup_begin_update_page_stat()), but the lock granularity
is a little bigger that not only the critical section but also some code
logic is in the range of locking which may be deadlock prone. As dirty
writeack stats are added, it gets into further difficulty with the page
cache radix tree lock and it seems that the lock requires nesting.
(https://lkml.org/lkml/2013/1/2/48)

So in order to make the lock simpler and clearer and also avoid the 'nesting'
problem, a choice may be:
(CPU-A does "page stat accounting" and CPU-B does "move")

       CPU-A                        CPU-B

move_lock_mem_cgroup()
memcg = pc->mem_cgroup
TestSetPageDirty(page)
move_unlock_mem_cgroup()
                             move_lock_mem_cgroup()
                             if (PageDirty) {
                                  old_memcg->nr_dirty --;
                                  new_memcg->nr_dirty ++;
                             }
                             pc->mem_cgroup = new_memcg
                             move_unlock_mem_cgroup()

memcg->nr_dirty ++


For CPU-A, we save pc->mem_cgroup in a temporary variable just before
TestSetPageDirty inside move_lock and then update stats if the page is set
PG_dirty successfully. But CPU-B may do "moving" in advance that
"old_memcg->nr_dirty --" will make old_memcg->nr_dirty incorrect but
soon CPU-A will do "memcg->nr_dirty ++" finally that amend the stats.

Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
---
 include/linux/memcontrol.h |   14 +++++------
 mm/memcontrol.c            |    8 ++-----
 mm/rmap.c                  |   55 +++++++++++++++++++++++++++++++++-----------
 3 files changed, 51 insertions(+), 26 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 0108a56..12de53b 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -164,20 +164,20 @@ static inline void mem_cgroup_end_update_page_stat(struct page *page,
 	rcu_read_unlock();
 }
 
-void mem_cgroup_update_page_stat(struct page *page,
+void mem_cgroup_update_page_stat(struct mem_cgroup *memcg,
 				 enum mem_cgroup_page_stat_item idx,
 				 int val);
 
-static inline void mem_cgroup_inc_page_stat(struct page *page,
+static inline void mem_cgroup_inc_page_stat(struct mem_cgroup *memcg,
 					    enum mem_cgroup_page_stat_item idx)
 {
-	mem_cgroup_update_page_stat(page, idx, 1);
+	mem_cgroup_update_page_stat(memcg, idx, 1);
 }
 
-static inline void mem_cgroup_dec_page_stat(struct page *page,
+static inline void mem_cgroup_dec_page_stat(struct mem_cgroup *memcg,
 					    enum mem_cgroup_page_stat_item idx)
 {
-	mem_cgroup_update_page_stat(page, idx, -1);
+	mem_cgroup_update_page_stat(memcg, idx, -1);
 }
 
 unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
@@ -354,12 +354,12 @@ static inline void mem_cgroup_end_update_page_stat(struct page *page,
 {
 }
 
-static inline void mem_cgroup_inc_page_stat(struct page *page,
+static inline void mem_cgroup_inc_page_stat(struct mem_cgroup *memcg,
 					    enum mem_cgroup_page_stat_item idx)
 {
 }
 
-static inline void mem_cgroup_dec_page_stat(struct page *page,
+static inline void mem_cgroup_dec_page_stat(struct mem_cgroup *memcg,
 					    enum mem_cgroup_page_stat_item idx)
 {
 }
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 3817460..1b13e43 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2259,18 +2259,14 @@ void __mem_cgroup_end_update_page_stat(struct page *page, unsigned long *flags)
 	move_unlock_mem_cgroup(pc->mem_cgroup, flags);
 }
 
-void mem_cgroup_update_page_stat(struct page *page,
+void mem_cgroup_update_page_stat(struct mem_cgroup *memcg,
 				 enum mem_cgroup_page_stat_item idx, int val)
 {
-	struct mem_cgroup *memcg;
-	struct page_cgroup *pc = lookup_page_cgroup(page);
-	unsigned long uninitialized_var(flags);
 
 	if (mem_cgroup_disabled())
 		return;
 
-	memcg = pc->mem_cgroup;
-	if (unlikely(!memcg || !PageCgroupUsed(pc)))
+	if (unlikely(!memcg))
 		return;
 
 	switch (idx) {
diff --git a/mm/rmap.c b/mm/rmap.c
index 59b0dca..0d74c48 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1112,13 +1112,25 @@ void page_add_file_rmap(struct page *page)
 {
 	bool locked;
 	unsigned long flags;
+	bool ret;
+	struct mem_cgroup *memcg = NULL;
+	struct cgroup_subsys_state *css = NULL;
 
 	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
-	if (atomic_inc_and_test(&page->_mapcount)) {
+	memcg = try_get_mem_cgroup_from_page(page);
+	ret = atomic_inc_and_test(&page->_mapcount);
+	mem_cgroup_end_update_page_stat(page, &locked, &flags);
+
+	if (ret) {
 		__inc_zone_page_state(page, NR_FILE_MAPPED);
-		mem_cgroup_inc_page_stat(page, MEMCG_NR_FILE_MAPPED);
+		if (memcg)
+			mem_cgroup_inc_page_stat(memcg, MEMCG_NR_FILE_MAPPED);
+	}
+
+	if (memcg) {
+		css = mem_cgroup_css(memcg);
+		css_put(css);
 	}
-	mem_cgroup_end_update_page_stat(page, &locked, &flags);
 }
 
 /**
@@ -1133,18 +1145,32 @@ void page_remove_rmap(struct page *page)
 	bool anon = PageAnon(page);
 	bool locked;
 	unsigned long flags;
+	struct mem_cgroup *memcg = NULL;
+	struct cgroup_subsys_state *css = NULL;
+	bool ret;
 
 	/*
 	 * The anon case has no mem_cgroup page_stat to update; but may
 	 * uncharge_page() below, where the lock ordering can deadlock if
 	 * we hold the lock against page_stat move: so avoid it on anon.
 	 */
-	if (!anon)
+	if (!anon) {
 		mem_cgroup_begin_update_page_stat(page, &locked, &flags);
+		memcg = try_get_mem_cgroup_from_page(page);
+		if (memcg)
+			css = mem_cgroup_css(memcg);
+	}
+
+	ret = atomic_add_negative(-1, &page->_mapcount);
+	if (!anon)
+		mem_cgroup_end_update_page_stat(page, &locked, &flags);
 
 	/* page still mapped by someone else? */
-	if (!atomic_add_negative(-1, &page->_mapcount))
-		goto out;
+	if (!ret) {
+		if (!anon && memcg)
+			css_put(css);
+		return;
+	}
 
 	/*
 	 * Now that the last pte has gone, s390 must transfer dirty
@@ -1173,8 +1199,12 @@ void page_remove_rmap(struct page *page)
 	 * Hugepages are not counted in NR_ANON_PAGES nor NR_FILE_MAPPED
 	 * and not charged by memcg for now.
 	 */
-	if (unlikely(PageHuge(page)))
-		goto out;
+	if (unlikely(PageHuge(page))) {
+		if (!anon && memcg)
+			css_put(css);
+		return;
+	}
+
 	if (anon) {
 		mem_cgroup_uncharge_page(page);
 		if (!PageTransHuge(page))
@@ -1184,8 +1214,10 @@ void page_remove_rmap(struct page *page)
 					      NR_ANON_TRANSPARENT_HUGEPAGES);
 	} else {
 		__dec_zone_page_state(page, NR_FILE_MAPPED);
-		mem_cgroup_dec_page_stat(page, MEMCG_NR_FILE_MAPPED);
-		mem_cgroup_end_update_page_stat(page, &locked, &flags);
+		if (memcg) {
+			mem_cgroup_dec_page_stat(memcg, MEMCG_NR_FILE_MAPPED);
+			css_put(css);
+		}
 	}
 	if (unlikely(PageMlocked(page)))
 		clear_page_mlock(page);
@@ -1199,9 +1231,6 @@ void page_remove_rmap(struct page *page)
 	 * faster for those pages still in swapcache.
 	 */
 	return;
-out:
-	if (!anon)
-		mem_cgroup_end_update_page_stat(page, &locked, &flags);
 }
 
 /*
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH] memcg: simplify lock of memcg page stat accounting
  2013-01-26 11:12 [PATCH] memcg: simplify lock of memcg page stat accounting Sha Zhengju
@ 2013-01-28 14:10 ` Michal Hocko
  2013-01-29 13:44   ` Sha Zhengju
  2013-01-29  0:41 ` Kamezawa Hiroyuki
  1 sibling, 1 reply; 9+ messages in thread
From: Michal Hocko @ 2013-01-28 14:10 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: cgroups, linux-mm, kamezawa.hiroyu, akpm, gthelen, hannes, hughd,
	Sha Zhengju

Hi,
just a minor comment/suggestion. The patch would be much more easier to
review if you split it up into two parts. Preparatory with page->memcg
parameter change and the locking change you are proposing.

On Sat 26-01-13 19:12:36, Sha Zhengju wrote:
[...]
> So in order to make the lock simpler and clearer and also avoid the 'nesting'
> problem, a choice may be:
> (CPU-A does "page stat accounting" and CPU-B does "move")
> 
>        CPU-A                        CPU-B
> 
> move_lock_mem_cgroup()
> memcg = pc->mem_cgroup
> TestSetPageDirty(page)
> move_unlock_mem_cgroup()
>                              move_lock_mem_cgroup()
>                              if (PageDirty) {
>                                   old_memcg->nr_dirty --;
>                                   new_memcg->nr_dirty ++;
>                              }
>                              pc->mem_cgroup = new_memcg
>                              move_unlock_mem_cgroup()
> 
> memcg->nr_dirty ++
>
> For CPU-A, we save pc->mem_cgroup in a temporary variable just before
> TestSetPageDirty inside move_lock and then update stats if the page is set
> PG_dirty successfully.

Hmm, the description is a bit confising. You are talking about
TestSetPageDirty but dirty accounting is not directly handled in the
patch. It took me a bit to figure that it's actually set_page_dirty
called from page_remove_rmap which matters here.  So it is more a
dependency between MEMCG_NR_FILE_MAPPED and your future (currently
non-existent) MEMCG_NR_FILE_DIRTY accounting that you are preparing.
set_page_dirty now can take mem_cgroup_{begin,end}_update_page_stat.

> But CPU-B may do "moving" in advance that
> "old_memcg->nr_dirty --" will make old_memcg->nr_dirty incorrect but
> soon CPU-A will do "memcg->nr_dirty ++" finally that amend the stats.

The counter is per-cpu so we are safe wrt. atomic increments and we can
probably tolerate off-by 1 temporal errors (mem_cgroup_read_stat would
need val = min(val, 0);).
I am not sure I like this very much though. It adds an additional memcg
reference counting into page_add_file which is a hot path already.

I think the accounting side should be as lightweight as possible and the
additional price should be payed by mover.

> Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
> ---
>  include/linux/memcontrol.h |   14 +++++------
>  mm/memcontrol.c            |    8 ++-----
>  mm/rmap.c                  |   55 +++++++++++++++++++++++++++++++++-----------
>  3 files changed, 51 insertions(+), 26 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 0108a56..12de53b 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -164,20 +164,20 @@ static inline void mem_cgroup_end_update_page_stat(struct page *page,
>  	rcu_read_unlock();
>  }
>  
> -void mem_cgroup_update_page_stat(struct page *page,
> +void mem_cgroup_update_page_stat(struct mem_cgroup *memcg,
>  				 enum mem_cgroup_page_stat_item idx,
>  				 int val);
>  
> -static inline void mem_cgroup_inc_page_stat(struct page *page,
> +static inline void mem_cgroup_inc_page_stat(struct mem_cgroup *memcg,
>  					    enum mem_cgroup_page_stat_item idx)
>  {
> -	mem_cgroup_update_page_stat(page, idx, 1);
> +	mem_cgroup_update_page_stat(memcg, idx, 1);
>  }
>  
> -static inline void mem_cgroup_dec_page_stat(struct page *page,
> +static inline void mem_cgroup_dec_page_stat(struct mem_cgroup *memcg,
>  					    enum mem_cgroup_page_stat_item idx)
>  {
> -	mem_cgroup_update_page_stat(page, idx, -1);
> +	mem_cgroup_update_page_stat(memcg, idx, -1);
>  }
>  
>  unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
> @@ -354,12 +354,12 @@ static inline void mem_cgroup_end_update_page_stat(struct page *page,
>  {
>  }
>  
> -static inline void mem_cgroup_inc_page_stat(struct page *page,
> +static inline void mem_cgroup_inc_page_stat(struct mem_cgroup *memcg,
>  					    enum mem_cgroup_page_stat_item idx)
>  {
>  }
>  
> -static inline void mem_cgroup_dec_page_stat(struct page *page,
> +static inline void mem_cgroup_dec_page_stat(struct mem_cgroup *memcg,
>  					    enum mem_cgroup_page_stat_item idx)
>  {
>  }
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 3817460..1b13e43 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2259,18 +2259,14 @@ void __mem_cgroup_end_update_page_stat(struct page *page, unsigned long *flags)
>  	move_unlock_mem_cgroup(pc->mem_cgroup, flags);
>  }
>  
> -void mem_cgroup_update_page_stat(struct page *page,
> +void mem_cgroup_update_page_stat(struct mem_cgroup *memcg,
>  				 enum mem_cgroup_page_stat_item idx, int val)
>  {
> -	struct mem_cgroup *memcg;
> -	struct page_cgroup *pc = lookup_page_cgroup(page);
> -	unsigned long uninitialized_var(flags);
>  
>  	if (mem_cgroup_disabled())
>  		return;
>  
> -	memcg = pc->mem_cgroup;
> -	if (unlikely(!memcg || !PageCgroupUsed(pc)))
> +	if (unlikely(!memcg))
>  		return;
>  
>  	switch (idx) {
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 59b0dca..0d74c48 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1112,13 +1112,25 @@ void page_add_file_rmap(struct page *page)
>  {
>  	bool locked;
>  	unsigned long flags;
> +	bool ret;
> +	struct mem_cgroup *memcg = NULL;
> +	struct cgroup_subsys_state *css = NULL;
>  
>  	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
> -	if (atomic_inc_and_test(&page->_mapcount)) {
> +	memcg = try_get_mem_cgroup_from_page(page);
> +	ret = atomic_inc_and_test(&page->_mapcount);
> +	mem_cgroup_end_update_page_stat(page, &locked, &flags);
> +
> +	if (ret) {
>  		__inc_zone_page_state(page, NR_FILE_MAPPED);
> -		mem_cgroup_inc_page_stat(page, MEMCG_NR_FILE_MAPPED);
> +		if (memcg)
> +			mem_cgroup_inc_page_stat(memcg, MEMCG_NR_FILE_MAPPED);
> +	}
> +
> +	if (memcg) {
> +		css = mem_cgroup_css(memcg);
> +		css_put(css);
>  	}
> -	mem_cgroup_end_update_page_stat(page, &locked, &flags);
>  }
>  
>  /**
> @@ -1133,18 +1145,32 @@ void page_remove_rmap(struct page *page)
>  	bool anon = PageAnon(page);
>  	bool locked;
>  	unsigned long flags;
> +	struct mem_cgroup *memcg = NULL;
> +	struct cgroup_subsys_state *css = NULL;
> +	bool ret;
>  
>  	/*
>  	 * The anon case has no mem_cgroup page_stat to update; but may
>  	 * uncharge_page() below, where the lock ordering can deadlock if
>  	 * we hold the lock against page_stat move: so avoid it on anon.
>  	 */
> -	if (!anon)
> +	if (!anon) {
>  		mem_cgroup_begin_update_page_stat(page, &locked, &flags);
> +		memcg = try_get_mem_cgroup_from_page(page);
> +		if (memcg)
> +			css = mem_cgroup_css(memcg);
> +	}
> +
> +	ret = atomic_add_negative(-1, &page->_mapcount);
> +	if (!anon)
> +		mem_cgroup_end_update_page_stat(page, &locked, &flags);
>  
>  	/* page still mapped by someone else? */
> -	if (!atomic_add_negative(-1, &page->_mapcount))
> -		goto out;
> +	if (!ret) {
> +		if (!anon && memcg)
> +			css_put(css);
> +		return;
> +	}
>  
>  	/*
>  	 * Now that the last pte has gone, s390 must transfer dirty
> @@ -1173,8 +1199,12 @@ void page_remove_rmap(struct page *page)
>  	 * Hugepages are not counted in NR_ANON_PAGES nor NR_FILE_MAPPED
>  	 * and not charged by memcg for now.
>  	 */
> -	if (unlikely(PageHuge(page)))
> -		goto out;
> +	if (unlikely(PageHuge(page))) {
> +		if (!anon && memcg)
> +			css_put(css);
> +		return;
> +	}
> +
>  	if (anon) {
>  		mem_cgroup_uncharge_page(page);
>  		if (!PageTransHuge(page))
> @@ -1184,8 +1214,10 @@ void page_remove_rmap(struct page *page)
>  					      NR_ANON_TRANSPARENT_HUGEPAGES);
>  	} else {
>  		__dec_zone_page_state(page, NR_FILE_MAPPED);
> -		mem_cgroup_dec_page_stat(page, MEMCG_NR_FILE_MAPPED);
> -		mem_cgroup_end_update_page_stat(page, &locked, &flags);
> +		if (memcg) {
> +			mem_cgroup_dec_page_stat(memcg, MEMCG_NR_FILE_MAPPED);
> +			css_put(css);
> +		}
>  	}
>  	if (unlikely(PageMlocked(page)))
>  		clear_page_mlock(page);
> @@ -1199,9 +1231,6 @@ void page_remove_rmap(struct page *page)
>  	 * faster for those pages still in swapcache.
>  	 */
>  	return;
> -out:
> -	if (!anon)
> -		mem_cgroup_end_update_page_stat(page, &locked, &flags);
>  }
>  
>  /*
> -- 
> 1.7.9.5
> 
> --
> To unsubscribe from this list: send the line "unsubscribe cgroups" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] memcg: simplify lock of memcg page stat accounting
  2013-01-26 11:12 [PATCH] memcg: simplify lock of memcg page stat accounting Sha Zhengju
  2013-01-28 14:10 ` Michal Hocko
@ 2013-01-29  0:41 ` Kamezawa Hiroyuki
  2013-01-29 10:40   ` Michal Hocko
  2013-01-29 15:29   ` Sha Zhengju
  1 sibling, 2 replies; 9+ messages in thread
From: Kamezawa Hiroyuki @ 2013-01-29  0:41 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: cgroups, linux-mm, mhocko, akpm, gthelen, hannes, hughd,
	Sha Zhengju

(2013/01/26 20:12), Sha Zhengju wrote:
> From: Sha Zhengju <handai.szj@taobao.com>
> 
> After removing duplicated information like PCG_*
> flags in 'struct page_cgroup'(commit 2ff76f1193), there's a problem
> between "move" and "page stat accounting"(only FILE_MAPPED is supported
> now but other stats will be added in future):
> assume CPU-A does "page stat accounting" and CPU-B does "move"
> 
> CPU-A                        CPU-B
> TestSet PG_dirty
> (delay)              	move_lock_mem_cgroup()
>                          if (PageDirty(page)) {
>                               old_memcg->nr_dirty --
>                               new_memcg->nr_dirty++
>                          }
>                          pc->mem_cgroup = new_memcg;
>                          move_unlock_mem_cgroup()
> 
> move_lock_mem_cgroup()
> memcg = pc->mem_cgroup
> memcg->nr_dirty++
> move_unlock_mem_cgroup()
> 
> while accounting information of new_memcg may be double-counted. So we
> use a bigger lock to solve this problem:  (commit: 89c06bd52f)
> 
>        move_lock_mem_cgroup() <-- mem_cgroup_begin_update_page_stat()
>        TestSetPageDirty(page)
>        update page stats (without any checks)
>        move_unlock_mem_cgroup() <-- mem_cgroup_begin_update_page_stat()
> 
> 
> But this method also has its pros and cons: at present we use two layers
> of lock avoidance(memcg_moving and memcg->moving_account) then spinlock
> on memcg (see mem_cgroup_begin_update_page_stat()), but the lock granularity
> is a little bigger that not only the critical section but also some code
> logic is in the range of locking which may be deadlock prone. As dirty
> writeack stats are added, it gets into further difficulty with the page
> cache radix tree lock and it seems that the lock requires nesting.
> (https://lkml.org/lkml/2013/1/2/48)
> 
> So in order to make the lock simpler and clearer and also avoid the 'nesting'
> problem, a choice may be:
> (CPU-A does "page stat accounting" and CPU-B does "move")
> 
>         CPU-A                        CPU-B
> 
> move_lock_mem_cgroup()
> memcg = pc->mem_cgroup
> TestSetPageDirty(page)
> move_unlock_mem_cgroup()
>                               move_lock_mem_cgroup()
>                               if (PageDirty) {
>                                    old_memcg->nr_dirty --;
>                                    new_memcg->nr_dirty ++;
>                               }
>                               pc->mem_cgroup = new_memcg
>                               move_unlock_mem_cgroup()
> 
> memcg->nr_dirty ++
> 

Hmm. no race with file truncate ?


> 
> For CPU-A, we save pc->mem_cgroup in a temporary variable just before
> TestSetPageDirty inside move_lock and then update stats if the page is set
> PG_dirty successfully. But CPU-B may do "moving" in advance that
> "old_memcg->nr_dirty --" will make old_memcg->nr_dirty incorrect but
> soon CPU-A will do "memcg->nr_dirty ++" finally that amend the stats.
> 
> Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
> ---
>   include/linux/memcontrol.h |   14 +++++------
>   mm/memcontrol.c            |    8 ++-----
>   mm/rmap.c                  |   55 +++++++++++++++++++++++++++++++++-----------
>   3 files changed, 51 insertions(+), 26 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 0108a56..12de53b 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -164,20 +164,20 @@ static inline void mem_cgroup_end_update_page_stat(struct page *page,
>   	rcu_read_unlock();
>   }
>   
> -void mem_cgroup_update_page_stat(struct page *page,
> +void mem_cgroup_update_page_stat(struct mem_cgroup *memcg,
>   				 enum mem_cgroup_page_stat_item idx,
>   				 int val);
>   
> -static inline void mem_cgroup_inc_page_stat(struct page *page,
> +static inline void mem_cgroup_inc_page_stat(struct mem_cgroup *memcg,
>   					    enum mem_cgroup_page_stat_item idx)
>   {
> -	mem_cgroup_update_page_stat(page, idx, 1);
> +	mem_cgroup_update_page_stat(memcg, idx, 1);
>   }
>   
> -static inline void mem_cgroup_dec_page_stat(struct page *page,
> +static inline void mem_cgroup_dec_page_stat(struct mem_cgroup *memcg,
>   					    enum mem_cgroup_page_stat_item idx)
>   {
> -	mem_cgroup_update_page_stat(page, idx, -1);
> +	mem_cgroup_update_page_stat(memcg, idx, -1);
>   }
>   
>   unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
> @@ -354,12 +354,12 @@ static inline void mem_cgroup_end_update_page_stat(struct page *page,
>   {
>   }
>   
> -static inline void mem_cgroup_inc_page_stat(struct page *page,
> +static inline void mem_cgroup_inc_page_stat(struct mem_cgroup *memcg,
>   					    enum mem_cgroup_page_stat_item idx)
>   {
>   }
>   
> -static inline void mem_cgroup_dec_page_stat(struct page *page,
> +static inline void mem_cgroup_dec_page_stat(struct mem_cgroup *memcg,
>   					    enum mem_cgroup_page_stat_item idx)
>   {
>   }
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 3817460..1b13e43 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2259,18 +2259,14 @@ void __mem_cgroup_end_update_page_stat(struct page *page, unsigned long *flags)
>   	move_unlock_mem_cgroup(pc->mem_cgroup, flags);
>   }
>   
> -void mem_cgroup_update_page_stat(struct page *page,
> +void mem_cgroup_update_page_stat(struct mem_cgroup *memcg,
>   				 enum mem_cgroup_page_stat_item idx, int val)
>   {
> -	struct mem_cgroup *memcg;
> -	struct page_cgroup *pc = lookup_page_cgroup(page);
> -	unsigned long uninitialized_var(flags);
>   
>   	if (mem_cgroup_disabled())
>   		return;
>   
> -	memcg = pc->mem_cgroup;
> -	if (unlikely(!memcg || !PageCgroupUsed(pc)))
> +	if (unlikely(!memcg))
>   		return;

I can't catch why you can do accounting without checking PCG_USED.
Could you add comments like

  * while accounting ops, mapping->tree_lock() or lock_page() is held
    and we have any race with truncation 
  etc...

>   
>   	switch (idx) {
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 59b0dca..0d74c48 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1112,13 +1112,25 @@ void page_add_file_rmap(struct page *page)
>   {
>   	bool locked;
>   	unsigned long flags;
> +	bool ret;
> +	struct mem_cgroup *memcg = NULL;
> +	struct cgroup_subsys_state *css = NULL;
>   
>   	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
> -	if (atomic_inc_and_test(&page->_mapcount)) {
> +	memcg = try_get_mem_cgroup_from_page(page);

Toooooo heavy ! I can say NACK to this patch only because of this try_get().

To hold memcg alive, rcu_read_lock() will work (as current code does).

BTW, does this patch fixes the nested-lock problem ?

Thanks,
-Kame


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] memcg: simplify lock of memcg page stat accounting
  2013-01-29  0:41 ` Kamezawa Hiroyuki
@ 2013-01-29 10:40   ` Michal Hocko
  2013-01-29 15:29   ` Sha Zhengju
  1 sibling, 0 replies; 9+ messages in thread
From: Michal Hocko @ 2013-01-29 10:40 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: Sha Zhengju, cgroups, linux-mm, akpm, gthelen, hannes, hughd,
	Sha Zhengju

On Tue 29-01-13 09:41:05, KAMEZAWA Hiroyuki wrote:
> (2013/01/26 20:12), Sha Zhengju wrote:
[...]
> > So in order to make the lock simpler and clearer and also avoid the 'nesting'
> > problem, a choice may be:
> > (CPU-A does "page stat accounting" and CPU-B does "move")
> > 
> >         CPU-A                        CPU-B
> > 
> > move_lock_mem_cgroup()
> > memcg = pc->mem_cgroup
> > TestSetPageDirty(page)
> > move_unlock_mem_cgroup()
> >                               move_lock_mem_cgroup()
> >                               if (PageDirty) {
> >                                    old_memcg->nr_dirty --;
> >                                    new_memcg->nr_dirty ++;
> >                               }
> >                               pc->mem_cgroup = new_memcg
> >                               move_unlock_mem_cgroup()
> > 
> > memcg->nr_dirty ++
> > 
> 
> Hmm. no race with file truncate ?

Shouldn't pte lock protect us in page_{add_file,remove}_rmap?

[...]
> > diff --git a/mm/rmap.c b/mm/rmap.c
> > index 59b0dca..0d74c48 100644
> > --- a/mm/rmap.c
> > +++ b/mm/rmap.c
> > @@ -1112,13 +1112,25 @@ void page_add_file_rmap(struct page *page)
> >   {
> >   	bool locked;
> >   	unsigned long flags;
> > +	bool ret;
> > +	struct mem_cgroup *memcg = NULL;
> > +	struct cgroup_subsys_state *css = NULL;
> >   
> >   	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
> > -	if (atomic_inc_and_test(&page->_mapcount)) {
> > +	memcg = try_get_mem_cgroup_from_page(page);
> 
> Toooooo heavy ! I can say NACK to this patch only because of this try_get().

Agreed.

> To hold memcg alive, rcu_read_lock() will work (as current code does).
> 
> BTW, does this patch fixes the nested-lock problem ?

Because set_page_drity is called outside of mem_cgroup_{begin,end}_update_page_stat.
That confused me too.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] memcg: simplify lock of memcg page stat accounting
  2013-01-28 14:10 ` Michal Hocko
@ 2013-01-29 13:44   ` Sha Zhengju
  2013-01-29 15:19     ` Michal Hocko
  0 siblings, 1 reply; 9+ messages in thread
From: Sha Zhengju @ 2013-01-29 13:44 UTC (permalink / raw)
  To: Michal Hocko
  Cc: cgroups, linux-mm, kamezawa.hiroyu, akpm, gthelen, hannes, hughd,
	Sha Zhengju

On Mon, Jan 28, 2013 at 10:10 PM, Michal Hocko <mhocko@suse.cz> wrote:
> Hi,
> just a minor comment/suggestion. The patch would be much more easier to
> review if you split it up into two parts. Preparatory with page->memcg
> parameter change and the locking change you are proposing.
>
OK.

> On Sat 26-01-13 19:12:36, Sha Zhengju wrote:
> [...]
>> So in order to make the lock simpler and clearer and also avoid the 'nesting'
>> problem, a choice may be:
>> (CPU-A does "page stat accounting" and CPU-B does "move")
>>
>>        CPU-A                        CPU-B
>>
>> move_lock_mem_cgroup()
>> memcg = pc->mem_cgroup
>> TestSetPageDirty(page)
>> move_unlock_mem_cgroup()
>>                              move_lock_mem_cgroup()
>>                              if (PageDirty) {
>>                                   old_memcg->nr_dirty --;
>>                                   new_memcg->nr_dirty ++;
>>                              }
>>                              pc->mem_cgroup = new_memcg
>>                              move_unlock_mem_cgroup()
>>
>> memcg->nr_dirty ++
>>
>> For CPU-A, we save pc->mem_cgroup in a temporary variable just before
>> TestSetPageDirty inside move_lock and then update stats if the page is set
>> PG_dirty successfully.
>
> Hmm, the description is a bit confising. You are talking about
> TestSetPageDirty but dirty accounting is not directly handled in the
> patch. It took me a bit to figure that it's actually set_page_dirty
> called from page_remove_rmap which matters here.  So it is more a

Thanks for reviewing!
Yeah, now I find it improper to take dirty pages as example in commit
log. Since dirty page accounting is more complicate than other stats
such as FILE_MAPPED and FILE_WRITEBACK, I  tried to clear out the
reason of changing the current lock rule.

> dependency between MEMCG_NR_FILE_MAPPED and your future (currently
> non-existent) MEMCG_NR_FILE_DIRTY accounting that you are preparing.
> set_page_dirty now can take mem_cgroup_{begin,end}_update_page_stat.
>
>> But CPU-B may do "moving" in advance that
>> "old_memcg->nr_dirty --" will make old_memcg->nr_dirty incorrect but
>> soon CPU-A will do "memcg->nr_dirty ++" finally that amend the stats.
>
> The counter is per-cpu so we are safe wrt. atomic increments and we can
> probably tolerate off-by 1 temporal errors (mem_cgroup_read_stat would
> need val = min(val, 0);).

Sorry, I cannot catch the 'min(val, 0)' part.. or do you mean max?

> I am not sure I like this very much though. It adds an additional memcg
> reference counting into page_add_file which is a hot path already.
>
> I think the accounting side should be as lightweight as possible and the
> additional price should be payed by mover.
>

Yes... I thought css_get is not so heavy here so I used the existing
try_get(). I'll try to use rcu to hold memcg alive as Kame suggest.

>> Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
>> ---
>>  include/linux/memcontrol.h |   14 +++++------
>>  mm/memcontrol.c            |    8 ++-----
>>  mm/rmap.c                  |   55 +++++++++++++++++++++++++++++++++-----------
>>  3 files changed, 51 insertions(+), 26 deletions(-)
>>
>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
>> index 0108a56..12de53b 100644
>> --- a/include/linux/memcontrol.h
>> +++ b/include/linux/memcontrol.h
>> @@ -164,20 +164,20 @@ static inline void mem_cgroup_end_update_page_stat(struct page *page,
>>       rcu_read_unlock();
>>  }
>>
>> -void mem_cgroup_update_page_stat(struct page *page,
>> +void mem_cgroup_update_page_stat(struct mem_cgroup *memcg,
>>                                enum mem_cgroup_page_stat_item idx,
>>                                int val);
>>
>> -static inline void mem_cgroup_inc_page_stat(struct page *page,
>> +static inline void mem_cgroup_inc_page_stat(struct mem_cgroup *memcg,
>>                                           enum mem_cgroup_page_stat_item idx)
>>  {
>> -     mem_cgroup_update_page_stat(page, idx, 1);
>> +     mem_cgroup_update_page_stat(memcg, idx, 1);
>>  }
>>
>> -static inline void mem_cgroup_dec_page_stat(struct page *page,
>> +static inline void mem_cgroup_dec_page_stat(struct mem_cgroup *memcg,
>>                                           enum mem_cgroup_page_stat_item idx)
>>  {
>> -     mem_cgroup_update_page_stat(page, idx, -1);
>> +     mem_cgroup_update_page_stat(memcg, idx, -1);
>>  }
>>
>>  unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
>> @@ -354,12 +354,12 @@ static inline void mem_cgroup_end_update_page_stat(struct page *page,
>>  {
>>  }
>>
>> -static inline void mem_cgroup_inc_page_stat(struct page *page,
>> +static inline void mem_cgroup_inc_page_stat(struct mem_cgroup *memcg,
>>                                           enum mem_cgroup_page_stat_item idx)
>>  {
>>  }
>>
>> -static inline void mem_cgroup_dec_page_stat(struct page *page,
>> +static inline void mem_cgroup_dec_page_stat(struct mem_cgroup *memcg,
>>                                           enum mem_cgroup_page_stat_item idx)
>>  {
>>  }
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index 3817460..1b13e43 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -2259,18 +2259,14 @@ void __mem_cgroup_end_update_page_stat(struct page *page, unsigned long *flags)
>>       move_unlock_mem_cgroup(pc->mem_cgroup, flags);
>>  }
>>
>> -void mem_cgroup_update_page_stat(struct page *page,
>> +void mem_cgroup_update_page_stat(struct mem_cgroup *memcg,
>>                                enum mem_cgroup_page_stat_item idx, int val)
>>  {
>> -     struct mem_cgroup *memcg;
>> -     struct page_cgroup *pc = lookup_page_cgroup(page);
>> -     unsigned long uninitialized_var(flags);
>>
>>       if (mem_cgroup_disabled())
>>               return;
>>
>> -     memcg = pc->mem_cgroup;
>> -     if (unlikely(!memcg || !PageCgroupUsed(pc)))
>> +     if (unlikely(!memcg))
>>               return;
>>
>>       switch (idx) {
>> diff --git a/mm/rmap.c b/mm/rmap.c
>> index 59b0dca..0d74c48 100644
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -1112,13 +1112,25 @@ void page_add_file_rmap(struct page *page)
>>  {
>>       bool locked;
>>       unsigned long flags;
>> +     bool ret;
>> +     struct mem_cgroup *memcg = NULL;
>> +     struct cgroup_subsys_state *css = NULL;
>>
>>       mem_cgroup_begin_update_page_stat(page, &locked, &flags);
>> -     if (atomic_inc_and_test(&page->_mapcount)) {
>> +     memcg = try_get_mem_cgroup_from_page(page);
>> +     ret = atomic_inc_and_test(&page->_mapcount);
>> +     mem_cgroup_end_update_page_stat(page, &locked, &flags);
>> +
>> +     if (ret) {
>>               __inc_zone_page_state(page, NR_FILE_MAPPED);
>> -             mem_cgroup_inc_page_stat(page, MEMCG_NR_FILE_MAPPED);
>> +             if (memcg)
>> +                     mem_cgroup_inc_page_stat(memcg, MEMCG_NR_FILE_MAPPED);
>> +     }
>> +
>> +     if (memcg) {
>> +             css = mem_cgroup_css(memcg);
>> +             css_put(css);
>>       }
>> -     mem_cgroup_end_update_page_stat(page, &locked, &flags);
>>  }
>>
>>  /**
>> @@ -1133,18 +1145,32 @@ void page_remove_rmap(struct page *page)
>>       bool anon = PageAnon(page);
>>       bool locked;
>>       unsigned long flags;
>> +     struct mem_cgroup *memcg = NULL;
>> +     struct cgroup_subsys_state *css = NULL;
>> +     bool ret;
>>
>>       /*
>>        * The anon case has no mem_cgroup page_stat to update; but may
>>        * uncharge_page() below, where the lock ordering can deadlock if
>>        * we hold the lock against page_stat move: so avoid it on anon.
>>        */
>> -     if (!anon)
>> +     if (!anon) {
>>               mem_cgroup_begin_update_page_stat(page, &locked, &flags);
>> +             memcg = try_get_mem_cgroup_from_page(page);
>> +             if (memcg)
>> +                     css = mem_cgroup_css(memcg);
>> +     }
>> +
>> +     ret = atomic_add_negative(-1, &page->_mapcount);
>> +     if (!anon)
>> +             mem_cgroup_end_update_page_stat(page, &locked, &flags);
>>
>>       /* page still mapped by someone else? */
>> -     if (!atomic_add_negative(-1, &page->_mapcount))
>> -             goto out;
>> +     if (!ret) {
>> +             if (!anon && memcg)
>> +                     css_put(css);
>> +             return;
>> +     }
>>
>>       /*
>>        * Now that the last pte has gone, s390 must transfer dirty
>> @@ -1173,8 +1199,12 @@ void page_remove_rmap(struct page *page)
>>        * Hugepages are not counted in NR_ANON_PAGES nor NR_FILE_MAPPED
>>        * and not charged by memcg for now.
>>        */
>> -     if (unlikely(PageHuge(page)))
>> -             goto out;
>> +     if (unlikely(PageHuge(page))) {
>> +             if (!anon && memcg)
>> +                     css_put(css);
>> +             return;
>> +     }
>> +
>>       if (anon) {
>>               mem_cgroup_uncharge_page(page);
>>               if (!PageTransHuge(page))
>> @@ -1184,8 +1214,10 @@ void page_remove_rmap(struct page *page)
>>                                             NR_ANON_TRANSPARENT_HUGEPAGES);
>>       } else {
>>               __dec_zone_page_state(page, NR_FILE_MAPPED);
>> -             mem_cgroup_dec_page_stat(page, MEMCG_NR_FILE_MAPPED);
>> -             mem_cgroup_end_update_page_stat(page, &locked, &flags);
>> +             if (memcg) {
>> +                     mem_cgroup_dec_page_stat(memcg, MEMCG_NR_FILE_MAPPED);
>> +                     css_put(css);
>> +             }
>>       }
>>       if (unlikely(PageMlocked(page)))
>>               clear_page_mlock(page);
>> @@ -1199,9 +1231,6 @@ void page_remove_rmap(struct page *page)
>>        * faster for those pages still in swapcache.
>>        */
>>       return;
>> -out:
>> -     if (!anon)
>> -             mem_cgroup_end_update_page_stat(page, &locked, &flags);
>>  }
>>
>>  /*
>> --
>> 1.7.9.5
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe cgroups" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
> --
> Michal Hocko
> SUSE Labs



-- 
Thanks,
Sha

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] memcg: simplify lock of memcg page stat accounting
  2013-01-29 13:44   ` Sha Zhengju
@ 2013-01-29 15:19     ` Michal Hocko
  0 siblings, 0 replies; 9+ messages in thread
From: Michal Hocko @ 2013-01-29 15:19 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: cgroups, linux-mm, kamezawa.hiroyu, akpm, gthelen, hannes, hughd,
	Sha Zhengju

On Tue 29-01-13 21:44:44, Sha Zhengju wrote:
> On Mon, Jan 28, 2013 at 10:10 PM, Michal Hocko <mhocko@suse.cz> wrote:
[...]
> >> But CPU-B may do "moving" in advance that
> >> "old_memcg->nr_dirty --" will make old_memcg->nr_dirty incorrect but
> >> soon CPU-A will do "memcg->nr_dirty ++" finally that amend the stats.
> >
> > The counter is per-cpu so we are safe wrt. atomic increments and we can
> > probably tolerate off-by 1 temporal errors (mem_cgroup_read_stat would
> > need val = min(val, 0);).
> 
> Sorry, I cannot catch the 'min(val, 0)' part.. or do you mean max?

Ohh, yeah. Head was telling max, fingers disagreed.
 
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] memcg: simplify lock of memcg page stat accounting
  2013-01-29  0:41 ` Kamezawa Hiroyuki
  2013-01-29 10:40   ` Michal Hocko
@ 2013-01-29 15:29   ` Sha Zhengju
  2013-01-30  9:12     ` Michal Hocko
  1 sibling, 1 reply; 9+ messages in thread
From: Sha Zhengju @ 2013-01-29 15:29 UTC (permalink / raw)
  To: Kamezawa Hiroyuki, Michal Hocko
  Cc: cgroups, linux-mm, akpm, gthelen, hannes, hughd, Sha Zhengju

On Tue, Jan 29, 2013 at 8:41 AM, Kamezawa Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> (2013/01/26 20:12), Sha Zhengju wrote:
>> From: Sha Zhengju <handai.szj@taobao.com>
>>
>> After removing duplicated information like PCG_*
>> flags in 'struct page_cgroup'(commit 2ff76f1193), there's a problem
>> between "move" and "page stat accounting"(only FILE_MAPPED is supported
>> now but other stats will be added in future):
>> assume CPU-A does "page stat accounting" and CPU-B does "move"
>>
>> CPU-A                        CPU-B
>> TestSet PG_dirty
>> (delay)               move_lock_mem_cgroup()
>>                          if (PageDirty(page)) {
>>                               old_memcg->nr_dirty --
>>                               new_memcg->nr_dirty++
>>                          }
>>                          pc->mem_cgroup = new_memcg;
>>                          move_unlock_mem_cgroup()
>>
>> move_lock_mem_cgroup()
>> memcg = pc->mem_cgroup
>> memcg->nr_dirty++
>> move_unlock_mem_cgroup()
>>
>> while accounting information of new_memcg may be double-counted. So we
>> use a bigger lock to solve this problem:  (commit: 89c06bd52f)
>>
>>        move_lock_mem_cgroup() <-- mem_cgroup_begin_update_page_stat()
>>        TestSetPageDirty(page)
>>        update page stats (without any checks)
>>        move_unlock_mem_cgroup() <-- mem_cgroup_begin_update_page_stat()
>>
>>
>> But this method also has its pros and cons: at present we use two layers
>> of lock avoidance(memcg_moving and memcg->moving_account) then spinlock
>> on memcg (see mem_cgroup_begin_update_page_stat()), but the lock granularity
>> is a little bigger that not only the critical section but also some code
>> logic is in the range of locking which may be deadlock prone. As dirty
>> writeack stats are added, it gets into further difficulty with the page
>> cache radix tree lock and it seems that the lock requires nesting.
>> (https://lkml.org/lkml/2013/1/2/48)
>>
>> So in order to make the lock simpler and clearer and also avoid the 'nesting'
>> problem, a choice may be:
>> (CPU-A does "page stat accounting" and CPU-B does "move")
>>
>>         CPU-A                        CPU-B
>>
>> move_lock_mem_cgroup()
>> memcg = pc->mem_cgroup
>> TestSetPageDirty(page)
>> move_unlock_mem_cgroup()
>>                               move_lock_mem_cgroup()
>>                               if (PageDirty) {
>>                                    old_memcg->nr_dirty --;
>>                                    new_memcg->nr_dirty ++;
>>                               }
>>                               pc->mem_cgroup = new_memcg
>>                               move_unlock_mem_cgroup()
>>
>> memcg->nr_dirty ++
>>
>
> Hmm. no race with file truncate ?
>

Do you mean "dirty page accounting" racing with truncate?  Yes, if
another one do truncate and set page->mapping=NULL just before CPU-A's
'memcg->nr_dirty ++', then it'll have no change to correct the figure
back. So my rough idea now is to have some small changes to
__set_page_dirty/__set_page_dirty_nobuffers that do SetDirtyPage
inside ->tree_lock.

But, in current codes, is there any chance that
mem_cgroup_move_account() racing with truncate that PageAnon is
false(since page->mapping is cleared) but later in page_remove_rmap()
the new memcg stats is over decrement...? Call me silly...but I really
get dizzy by those locks now, need to have a run to refresh my head...
 : (


>
>>
>> For CPU-A, we save pc->mem_cgroup in a temporary variable just before
>> TestSetPageDirty inside move_lock and then update stats if the page is set
>> PG_dirty successfully. But CPU-B may do "moving" in advance that
>> "old_memcg->nr_dirty --" will make old_memcg->nr_dirty incorrect but
>> soon CPU-A will do "memcg->nr_dirty ++" finally that amend the stats.
>>
>> Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
>> ---
>>   include/linux/memcontrol.h |   14 +++++------
>>   mm/memcontrol.c            |    8 ++-----
>>   mm/rmap.c                  |   55 +++++++++++++++++++++++++++++++++-----------
>>   3 files changed, 51 insertions(+), 26 deletions(-)
>>
>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
>> index 0108a56..12de53b 100644
>> --- a/include/linux/memcontrol.h
>> +++ b/include/linux/memcontrol.h
>> @@ -164,20 +164,20 @@ static inline void mem_cgroup_end_update_page_stat(struct page *page,
>>       rcu_read_unlock();
>>   }
>>
>> -void mem_cgroup_update_page_stat(struct page *page,
>> +void mem_cgroup_update_page_stat(struct mem_cgroup *memcg,
>>                                enum mem_cgroup_page_stat_item idx,
>>                                int val);
>>
>> -static inline void mem_cgroup_inc_page_stat(struct page *page,
>> +static inline void mem_cgroup_inc_page_stat(struct mem_cgroup *memcg,
>>                                           enum mem_cgroup_page_stat_item idx)
>>   {
>> -     mem_cgroup_update_page_stat(page, idx, 1);
>> +     mem_cgroup_update_page_stat(memcg, idx, 1);
>>   }
>>
>> -static inline void mem_cgroup_dec_page_stat(struct page *page,
>> +static inline void mem_cgroup_dec_page_stat(struct mem_cgroup *memcg,
>>                                           enum mem_cgroup_page_stat_item idx)
>>   {
>> -     mem_cgroup_update_page_stat(page, idx, -1);
>> +     mem_cgroup_update_page_stat(memcg, idx, -1);
>>   }
>>
>>   unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
>> @@ -354,12 +354,12 @@ static inline void mem_cgroup_end_update_page_stat(struct page *page,
>>   {
>>   }
>>
>> -static inline void mem_cgroup_inc_page_stat(struct page *page,
>> +static inline void mem_cgroup_inc_page_stat(struct mem_cgroup *memcg,
>>                                           enum mem_cgroup_page_stat_item idx)
>>   {
>>   }
>>
>> -static inline void mem_cgroup_dec_page_stat(struct page *page,
>> +static inline void mem_cgroup_dec_page_stat(struct mem_cgroup *memcg,
>>                                           enum mem_cgroup_page_stat_item idx)
>>   {
>>   }
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index 3817460..1b13e43 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -2259,18 +2259,14 @@ void __mem_cgroup_end_update_page_stat(struct page *page, unsigned long *flags)
>>       move_unlock_mem_cgroup(pc->mem_cgroup, flags);
>>   }
>>
>> -void mem_cgroup_update_page_stat(struct page *page,
>> +void mem_cgroup_update_page_stat(struct mem_cgroup *memcg,
>>                                enum mem_cgroup_page_stat_item idx, int val)
>>   {
>> -     struct mem_cgroup *memcg;
>> -     struct page_cgroup *pc = lookup_page_cgroup(page);
>> -     unsigned long uninitialized_var(flags);
>>
>>       if (mem_cgroup_disabled())
>>               return;
>>
>> -     memcg = pc->mem_cgroup;
>> -     if (unlikely(!memcg || !PageCgroupUsed(pc)))
>> +     if (unlikely(!memcg))
>>               return;
>
> I can't catch why you can do accounting without checking PCG_USED.
> Could you add comments like
>
>   * while accounting ops, mapping->tree_lock() or lock_page() is held
>     and we have any race with truncation
>   etc...

Yeah...considering stat updates and uncharge, PCG_USED should be checked here.
But anther problem raising out of my mind that the three: page stat
accounting, move_account and uncharge may need synchronization....

>
>>
>>       switch (idx) {
>> diff --git a/mm/rmap.c b/mm/rmap.c
>> index 59b0dca..0d74c48 100644
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -1112,13 +1112,25 @@ void page_add_file_rmap(struct page *page)
>>   {
>>       bool locked;
>>       unsigned long flags;
>> +     bool ret;
>> +     struct mem_cgroup *memcg = NULL;
>> +     struct cgroup_subsys_state *css = NULL;
>>
>>       mem_cgroup_begin_update_page_stat(page, &locked, &flags);
>> -     if (atomic_inc_and_test(&page->_mapcount)) {
>> +     memcg = try_get_mem_cgroup_from_page(page);
>
> Toooooo heavy ! I can say NACK to this patch only because of this try_get().
>
> To hold memcg alive, rcu_read_lock() will work (as current code does).
>
OK, next version will return to its correct path.

> BTW, does this patch fixes the nested-lock problem ?
>

Yes, the lock only protects 'get old memcg' and 'modify page status',
so page_remove_rmap can call set_dirty_page out of memcg stat lock.


Thanks,
Sha

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] memcg: simplify lock of memcg page stat accounting
  2013-01-29 15:29   ` Sha Zhengju
@ 2013-01-30  9:12     ` Michal Hocko
  2013-01-30 14:57       ` Sha Zhengju
  0 siblings, 1 reply; 9+ messages in thread
From: Michal Hocko @ 2013-01-30  9:12 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: Kamezawa Hiroyuki, cgroups, linux-mm, akpm, gthelen, hannes,
	hughd, Sha Zhengju

On Tue 29-01-13 23:29:35, Sha Zhengju wrote:
> On Tue, Jan 29, 2013 at 8:41 AM, Kamezawa Hiroyuki
> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > (2013/01/26 20:12), Sha Zhengju wrote:
> >> From: Sha Zhengju <handai.szj@taobao.com>
> >>
> >> After removing duplicated information like PCG_*
> >> flags in 'struct page_cgroup'(commit 2ff76f1193), there's a problem
> >> between "move" and "page stat accounting"(only FILE_MAPPED is supported
> >> now but other stats will be added in future):
> >> assume CPU-A does "page stat accounting" and CPU-B does "move"
> >>
> >> CPU-A                        CPU-B
> >> TestSet PG_dirty
> >> (delay)               move_lock_mem_cgroup()
> >>                          if (PageDirty(page)) {
> >>                               old_memcg->nr_dirty --
> >>                               new_memcg->nr_dirty++
> >>                          }
> >>                          pc->mem_cgroup = new_memcg;
> >>                          move_unlock_mem_cgroup()
> >>
> >> move_lock_mem_cgroup()
> >> memcg = pc->mem_cgroup
> >> memcg->nr_dirty++
> >> move_unlock_mem_cgroup()
> >>
> >> while accounting information of new_memcg may be double-counted. So we
> >> use a bigger lock to solve this problem:  (commit: 89c06bd52f)
> >>
> >>        move_lock_mem_cgroup() <-- mem_cgroup_begin_update_page_stat()
> >>        TestSetPageDirty(page)
> >>        update page stats (without any checks)
> >>        move_unlock_mem_cgroup() <-- mem_cgroup_begin_update_page_stat()
> >>
> >>
> >> But this method also has its pros and cons: at present we use two layers
> >> of lock avoidance(memcg_moving and memcg->moving_account) then spinlock
> >> on memcg (see mem_cgroup_begin_update_page_stat()), but the lock granularity
> >> is a little bigger that not only the critical section but also some code
> >> logic is in the range of locking which may be deadlock prone. As dirty
> >> writeack stats are added, it gets into further difficulty with the page
> >> cache radix tree lock and it seems that the lock requires nesting.
> >> (https://lkml.org/lkml/2013/1/2/48)
> >>
> >> So in order to make the lock simpler and clearer and also avoid the 'nesting'
> >> problem, a choice may be:
> >> (CPU-A does "page stat accounting" and CPU-B does "move")
> >>
> >>         CPU-A                        CPU-B
> >>
> >> move_lock_mem_cgroup()
> >> memcg = pc->mem_cgroup
> >> TestSetPageDirty(page)
> >> move_unlock_mem_cgroup()
> >>                               move_lock_mem_cgroup()
> >>                               if (PageDirty) {
> >>                                    old_memcg->nr_dirty --;
> >>                                    new_memcg->nr_dirty ++;
> >>                               }
> >>                               pc->mem_cgroup = new_memcg
> >>                               move_unlock_mem_cgroup()
> >>
> >> memcg->nr_dirty ++
> >>
> >
> > Hmm. no race with file truncate ?
> >
> 
> Do you mean "dirty page accounting" racing with truncate?  Yes, if
> another one do truncate and set page->mapping=NULL just before CPU-A's
> 'memcg->nr_dirty ++', then it'll have no change to correct the figure
> back. So my rough idea now is to have some small changes to
> __set_page_dirty/__set_page_dirty_nobuffers that do SetDirtyPage
> inside ->tree_lock.
> 
> But, in current codes, is there any chance that
> mem_cgroup_move_account() racing with truncate that PageAnon is
> false(since page->mapping is cleared) but later in page_remove_rmap()
> the new memcg stats is over decrement...?

We are not checking page->mapping but rather page_mapped() which
checks page->_mapcount and that is protected from races with
mem_cgroup_move_account by mem_cgroup_begin_update_page_stat locking.
Makes sense?

> Call me silly...but I really get dizzy by those locks now, need to
> have a run to refresh my head... : (

Yeah, that part is funny for a certain reading of the word funny ;)
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] memcg: simplify lock of memcg page stat accounting
  2013-01-30  9:12     ` Michal Hocko
@ 2013-01-30 14:57       ` Sha Zhengju
  0 siblings, 0 replies; 9+ messages in thread
From: Sha Zhengju @ 2013-01-30 14:57 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Kamezawa Hiroyuki, cgroups, linux-mm, akpm, gthelen, hannes,
	hughd, Sha Zhengju

On Wed, Jan 30, 2013 at 5:12 PM, Michal Hocko <mhocko@suse.cz> wrote:
> On Tue 29-01-13 23:29:35, Sha Zhengju wrote:
>> On Tue, Jan 29, 2013 at 8:41 AM, Kamezawa Hiroyuki
>> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>> > (2013/01/26 20:12), Sha Zhengju wrote:
>> >> From: Sha Zhengju <handai.szj@taobao.com>
>> >>
>> >> After removing duplicated information like PCG_*
>> >> flags in 'struct page_cgroup'(commit 2ff76f1193), there's a problem
>> >> between "move" and "page stat accounting"(only FILE_MAPPED is supported
>> >> now but other stats will be added in future):
>> >> assume CPU-A does "page stat accounting" and CPU-B does "move"
>> >>
>> >> CPU-A                        CPU-B
>> >> TestSet PG_dirty
>> >> (delay)               move_lock_mem_cgroup()
>> >>                          if (PageDirty(page)) {
>> >>                               old_memcg->nr_dirty --
>> >>                               new_memcg->nr_dirty++
>> >>                          }
>> >>                          pc->mem_cgroup = new_memcg;
>> >>                          move_unlock_mem_cgroup()
>> >>
>> >> move_lock_mem_cgroup()
>> >> memcg = pc->mem_cgroup
>> >> memcg->nr_dirty++
>> >> move_unlock_mem_cgroup()
>> >>
>> >> while accounting information of new_memcg may be double-counted. So we
>> >> use a bigger lock to solve this problem:  (commit: 89c06bd52f)
>> >>
>> >>        move_lock_mem_cgroup() <-- mem_cgroup_begin_update_page_stat()
>> >>        TestSetPageDirty(page)
>> >>        update page stats (without any checks)
>> >>        move_unlock_mem_cgroup() <-- mem_cgroup_begin_update_page_stat()
>> >>
>> >>
>> >> But this method also has its pros and cons: at present we use two layers
>> >> of lock avoidance(memcg_moving and memcg->moving_account) then spinlock
>> >> on memcg (see mem_cgroup_begin_update_page_stat()), but the lock granularity
>> >> is a little bigger that not only the critical section but also some code
>> >> logic is in the range of locking which may be deadlock prone. As dirty
>> >> writeack stats are added, it gets into further difficulty with the page
>> >> cache radix tree lock and it seems that the lock requires nesting.
>> >> (https://lkml.org/lkml/2013/1/2/48)
>> >>
>> >> So in order to make the lock simpler and clearer and also avoid the 'nesting'
>> >> problem, a choice may be:
>> >> (CPU-A does "page stat accounting" and CPU-B does "move")
>> >>
>> >>         CPU-A                        CPU-B
>> >>
>> >> move_lock_mem_cgroup()
>> >> memcg = pc->mem_cgroup
>> >> TestSetPageDirty(page)
>> >> move_unlock_mem_cgroup()
>> >>                               move_lock_mem_cgroup()
>> >>                               if (PageDirty) {
>> >>                                    old_memcg->nr_dirty --;
>> >>                                    new_memcg->nr_dirty ++;
>> >>                               }
>> >>                               pc->mem_cgroup = new_memcg
>> >>                               move_unlock_mem_cgroup()
>> >>
>> >> memcg->nr_dirty ++
>> >>
>> >
>> > Hmm. no race with file truncate ?
>> >
>>
>> Do you mean "dirty page accounting" racing with truncate?  Yes, if
>> another one do truncate and set page->mapping=NULL just before CPU-A's
>> 'memcg->nr_dirty ++', then it'll have no change to correct the figure
>> back. So my rough idea now is to have some small changes to
>> __set_page_dirty/__set_page_dirty_nobuffers that do SetDirtyPage
>> inside ->tree_lock.
>>
>> But, in current codes, is there any chance that
>> mem_cgroup_move_account() racing with truncate that PageAnon is
>> false(since page->mapping is cleared) but later in page_remove_rmap()
>> the new memcg stats is over decrement...?
>
> We are not checking page->mapping but rather page_mapped() which
> checks page->_mapcount and that is protected from races with
> mem_cgroup_move_account by mem_cgroup_begin_update_page_stat locking.
> Makes sense?

Yeah. I think you are right, what's matters here is page->_mapcount
and even page->mapping is cleared the !anon check is still true.  : )

>
>> Call me silly...but I really get dizzy by those locks now, need to
>> have a run to refresh my head... : (
>
> Yeah, that part is funny for a certain reading of the word funny ;)
> --
> Michal Hocko
> SUSE Labs



-- 
Thanks,
Sha

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2013-01-30 14:57 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-01-26 11:12 [PATCH] memcg: simplify lock of memcg page stat accounting Sha Zhengju
2013-01-28 14:10 ` Michal Hocko
2013-01-29 13:44   ` Sha Zhengju
2013-01-29 15:19     ` Michal Hocko
2013-01-29  0:41 ` Kamezawa Hiroyuki
2013-01-29 10:40   ` Michal Hocko
2013-01-29 15:29   ` Sha Zhengju
2013-01-30  9:12     ` Michal Hocko
2013-01-30 14:57       ` Sha Zhengju

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).