[RFC] [PATCH 0/7 v2] memcg: page

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC] [PATCH 0/7 v2] memcg: page_cgroup diet
@ 2012-01-13  8:30 KAMEZAWA Hiroyuki
  2012-01-13  8:32 ` [RFC] [PATCH 1/7 v2] memcg: remove unnecessary check in mem_cgroup_update_page_stat() KAMEZAWA Hiroyuki
                   ` (6 more replies)
  0 siblings, 7 replies; 50+ messages in thread
From: KAMEZAWA Hiroyuki @ 2012-01-13  8:30 UTC (permalink / raw)
  To: linux-mm@kvack.org
  Cc: Ying Han, hugh.dickins@tiscali.co.uk, hannes@cmpxchg.org,
	Michal Hocko, cgroups, bsingharora@gmail.com

This is just an RFC for dumping my queue to share and get better idea.
Patch order may not be clean. Advice is welcomed.

Now, struct page_cgroup is defined as
==
struct page_cgroup {
        unsigned long flags;
        struct mem_cgroup *mem_cgroup;
};
==

We want to remove ->flags to shrink the size (and integrate into 'struct page').
To do that, we need to remove some flags.

Now, flag is defined as
==
        PCG_LOCK,  /* Lock for pc->mem_cgroup and following bits. */
        PCG_CACHE, /* charged as cache */
        PCG_USED, /* this object is in use. */
        PCG_MIGRATION, /* under page migration */
        /* flags for mem_cgroup and file and I/O status */
        PCG_MOVE_LOCK, /* For race between move_account v.s. following bits */
        PCG_FILE_MAPPED, /* page is accounted as "mapped" */
==
We have 6bits now.

This patch series removes PCG_CACHE, PCG_MOVE_LOCK, PCG_FILE_MAPPED.
Then, if we use low 3bits of ->mem_cgroup for PCG_LOCK, PCG_USED, PCG_MIGRATION,
we can remove pc->flags, I guess.

To remove flags, this patch modifes page-stat accounting. After this set, 
per-memcg page stat accounting will be

	mem_cgroup_begin_update_page_stat() --(A)
	modify page status
	mem_cgroup_update_page_stat()
	mem_cgroup_end_update_page_stat()   --(B)

Between (A) and (B), it's guaranteed the page's pc->mem_cgroup will not be moved.
By this change, move_account() can make use of page's information rather than
page_cgroup's flag and we don't have to duplicate flags in page_cgroup.
I think this is saner and allow us to add more per-memcg vmstat without any
new flags.

I'm now testing but don't see additional overheads.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [RFC] [PATCH 1/7 v2] memcg: remove unnecessary check in mem_cgroup_update_page_stat()
  2012-01-13  8:30 [RFC] [PATCH 0/7 v2] memcg: page_cgroup diet KAMEZAWA Hiroyuki
@ 2012-01-13  8:32 ` KAMEZAWA Hiroyuki
  2012-01-17 15:16   ` Michal Hocko
  2012-01-13  8:33 ` [RFC] [PATCH 2/7 v2] memcg: add memory barrier for checking account move KAMEZAWA Hiroyuki
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 50+ messages in thread
From: KAMEZAWA Hiroyuki @ 2012-01-13  8:32 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, Ying Han, hugh.dickins@tiscali.co.uk,
	hannes@cmpxchg.org, Michal Hocko, cgroups, bsingharora@gmail.com



^ permalink raw reply	[flat|nested] 50+ messages in thread

* [RFC] [PATCH 2/7 v2] memcg: add memory barrier for checking account move.
  2012-01-13  8:30 [RFC] [PATCH 0/7 v2] memcg: page_cgroup diet KAMEZAWA Hiroyuki
  2012-01-13  8:32 ` [RFC] [PATCH 1/7 v2] memcg: remove unnecessary check in mem_cgroup_update_page_stat() KAMEZAWA Hiroyuki
@ 2012-01-13  8:33 ` KAMEZAWA Hiroyuki
  2012-01-17 15:26   ` Michal Hocko
  2012-01-13  8:40 ` [RFC] [PATCH 3/7 v2] memcg: remove PCG_MOVE_LOCK flag from pc->flags KAMEZAWA Hiroyuki
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 50+ messages in thread
From: KAMEZAWA Hiroyuki @ 2012-01-13  8:33 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, Ying Han, hugh.dickins@tiscali.co.uk,
	hannes@cmpxchg.org, Michal Hocko, cgroups, bsingharora@gmail.com

I think this bugfix is needed before going ahead. thoughts?
==

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [RFC] [PATCH 3/7 v2] memcg: remove PCG_MOVE_LOCK flag from pc->flags
  2012-01-13  8:30 [RFC] [PATCH 0/7 v2] memcg: page_cgroup diet KAMEZAWA Hiroyuki
  2012-01-13  8:32 ` [RFC] [PATCH 1/7 v2] memcg: remove unnecessary check in mem_cgroup_update_page_stat() KAMEZAWA Hiroyuki
  2012-01-13  8:33 ` [RFC] [PATCH 2/7 v2] memcg: add memory barrier for checking account move KAMEZAWA Hiroyuki
@ 2012-01-13  8:40 ` KAMEZAWA Hiroyuki
  2012-01-16 12:55   ` Kirill A. Shutemov
                     ` (2 more replies)
  2012-01-13  8:41 ` [RFC] [PATCH 4/7 v2] memcg: new scheme to update per-memcg page stat accounting KAMEZAWA Hiroyuki
                   ` (3 subsequent siblings)
  6 siblings, 3 replies; 50+ messages in thread
From: KAMEZAWA Hiroyuki @ 2012-01-13  8:40 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, Ying Han, hugh.dickins@tiscali.co.uk,
	hannes@cmpxchg.org, Michal Hocko, cgroups, bsingharora@gmail.com



^ permalink raw reply	[flat|nested] 50+ messages in thread

* [RFC] [PATCH 4/7 v2] memcg: new scheme to update per-memcg page stat accounting.
  2012-01-13  8:30 [RFC] [PATCH 0/7 v2] memcg: page_cgroup diet KAMEZAWA Hiroyuki
                   ` (2 preceding siblings ...)
  2012-01-13  8:40 ` [RFC] [PATCH 3/7 v2] memcg: remove PCG_MOVE_LOCK flag from pc->flags KAMEZAWA Hiroyuki
@ 2012-01-13  8:41 ` KAMEZAWA Hiroyuki
  2012-01-18 16:45   ` Michal Hocko
  2012-01-26 19:01   ` Ying Han
  2012-01-13  8:42 ` [RFC] [PATCH 5/7 v2] memcg: remove PCG_FILE_MAPPED KAMEZAWA Hiroyuki
                   ` (2 subsequent siblings)
  6 siblings, 2 replies; 50+ messages in thread
From: KAMEZAWA Hiroyuki @ 2012-01-13  8:41 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, Ying Han, hugh.dickins@tiscali.co.uk,
	hannes@cmpxchg.org, Michal Hocko, cgroups, bsingharora@gmail.com



^ permalink raw reply	[flat|nested] 50+ messages in thread

* [RFC] [PATCH 5/7 v2] memcg: remove PCG_FILE_MAPPED
  2012-01-13  8:30 [RFC] [PATCH 0/7 v2] memcg: page_cgroup diet KAMEZAWA Hiroyuki
                   ` (3 preceding siblings ...)
  2012-01-13  8:41 ` [RFC] [PATCH 4/7 v2] memcg: new scheme to update per-memcg page stat accounting KAMEZAWA Hiroyuki
@ 2012-01-13  8:42 ` KAMEZAWA Hiroyuki
  2012-01-19 14:07   ` Michal Hocko
  2012-01-13  8:43 ` [RFC] [PATCH 6/7 v2] memcg: remove PCG_CACHE KAMEZAWA Hiroyuki
  2012-01-13  8:45 ` [RFC] [PATCH 7/7 v2] memcg: make mem_cgroup_begin_update_stat to use global pcpu KAMEZAWA Hiroyuki
  6 siblings, 1 reply; 50+ messages in thread
From: KAMEZAWA Hiroyuki @ 2012-01-13  8:42 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, Ying Han, hugh.dickins@tiscali.co.uk,
	hannes@cmpxchg.org, Michal Hocko, cgroups, bsingharora@gmail.com



^ permalink raw reply	[flat|nested] 50+ messages in thread

* [RFC] [PATCH 6/7 v2] memcg: remove PCG_CACHE
  2012-01-13  8:30 [RFC] [PATCH 0/7 v2] memcg: page_cgroup diet KAMEZAWA Hiroyuki
                   ` (4 preceding siblings ...)
  2012-01-13  8:42 ` [RFC] [PATCH 5/7 v2] memcg: remove PCG_FILE_MAPPED KAMEZAWA Hiroyuki
@ 2012-01-13  8:43 ` KAMEZAWA Hiroyuki
  2012-01-13  8:45 ` [RFC] [PATCH 7/7 v2] memcg: make mem_cgroup_begin_update_stat to use global pcpu KAMEZAWA Hiroyuki
  6 siblings, 0 replies; 50+ messages in thread
From: KAMEZAWA Hiroyuki @ 2012-01-13  8:43 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, Ying Han, hugh.dickins@tiscali.co.uk,
	hannes@cmpxchg.org, Michal Hocko, cgroups, bsingharora@gmail.com

This patch can be cut-out from this series as an independent patch.
==

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [RFC] [PATCH 7/7 v2] memcg: make mem_cgroup_begin_update_stat to use global pcpu.
  2012-01-13  8:30 [RFC] [PATCH 0/7 v2] memcg: page_cgroup diet KAMEZAWA Hiroyuki
                   ` (5 preceding siblings ...)
  2012-01-13  8:43 ` [RFC] [PATCH 6/7 v2] memcg: remove PCG_CACHE KAMEZAWA Hiroyuki
@ 2012-01-13  8:45 ` KAMEZAWA Hiroyuki
  2012-01-19 14:47   ` Michal Hocko
  2012-01-20  8:40   ` Greg Thelen
  6 siblings, 2 replies; 50+ messages in thread
From: KAMEZAWA Hiroyuki @ 2012-01-13  8:45 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, Ying Han, hugh.dickins@tiscali.co.uk,
	hannes@cmpxchg.org, Michal Hocko, cgroups, bsingharora@gmail.com



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] [PATCH 3/7 v2] memcg: remove PCG_MOVE_LOCK flag from pc->flags
  2012-01-13  8:40 ` [RFC] [PATCH 3/7 v2] memcg: remove PCG_MOVE_LOCK flag from pc->flags KAMEZAWA Hiroyuki
@ 2012-01-16 12:55   ` Kirill A. Shutemov
  2012-01-17  0:22     ` KAMEZAWA Hiroyuki
  2012-01-17 16:46   ` Michal Hocko
  2012-01-23 22:02   ` Ying Han
  2 siblings, 1 reply; 50+ messages in thread
From: Kirill A. Shutemov @ 2012-01-16 12:55 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, Ying Han, hugh.dickins@tiscali.co.uk,
	hannes@cmpxchg.org, Michal Hocko, cgroups, bsingharora@gmail.com

On Fri, Jan 13, 2012 at 05:40:19PM +0900, KAMEZAWA Hiroyuki wrote:
> 
> From 1008e84d94245b1e7c4d237802ff68ff00757736 Mon Sep 17 00:00:00 2001
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Date: Thu, 12 Jan 2012 15:53:24 +0900
> Subject: [PATCH 3/7] memcg: remove PCG_MOVE_LOCK flag from pc->flags.
> 
> PCG_MOVE_LOCK bit is used for bit spinlock for avoiding race between
> memcg's account moving and page state statistics updates.
> 
> Considering page-statistics update, very hot path, this lock is
> taken only when someone is moving account (or PageTransHuge())
> And, now, all moving-account between memcgroups (by task-move)
> are serialized.
> 
> So, it seems too costly to have 1bit per page for this purpose.
> 
> This patch removes PCG_MOVE_LOCK and add hashed rwlock array
> instead of it. This works well enough. Even when we need to
> take the lock, we don't need to disable IRQ in hot path because
> of using rwlock.
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

...

> +#define NR_MOVE_ACCOUNT_LOCKS	(NR_CPUS)
> +#define move_account_hash(page) ((page_to_pfn(page) % NR_MOVE_ACCOUNT_LOCKS))

You still tend to add too many parentheses into macros ;)

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] [PATCH 3/7 v2] memcg: remove PCG_MOVE_LOCK flag from pc->flags
  2012-01-16 12:55   ` Kirill A. Shutemov
@ 2012-01-17  0:22     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 50+ messages in thread
From: KAMEZAWA Hiroyuki @ 2012-01-17  0:22 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: linux-mm@kvack.org, Ying Han, hugh.dickins@tiscali.co.uk,
	hannes@cmpxchg.org, Michal Hocko, cgroups, bsingharora@gmail.com

On Mon, 16 Jan 2012 14:55:26 +0200
"Kirill A. Shutemov" <kirill@shutemov.name> wrote:

> On Fri, Jan 13, 2012 at 05:40:19PM +0900, KAMEZAWA Hiroyuki wrote:
> > 
> > From 1008e84d94245b1e7c4d237802ff68ff00757736 Mon Sep 17 00:00:00 2001
> > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > Date: Thu, 12 Jan 2012 15:53:24 +0900
> > Subject: [PATCH 3/7] memcg: remove PCG_MOVE_LOCK flag from pc->flags.
> > 
> > PCG_MOVE_LOCK bit is used for bit spinlock for avoiding race between
> > memcg's account moving and page state statistics updates.
> > 
> > Considering page-statistics update, very hot path, this lock is
> > taken only when someone is moving account (or PageTransHuge())
> > And, now, all moving-account between memcgroups (by task-move)
> > are serialized.
> > 
> > So, it seems too costly to have 1bit per page for this purpose.
> > 
> > This patch removes PCG_MOVE_LOCK and add hashed rwlock array
> > instead of it. This works well enough. Even when we need to
> > take the lock, we don't need to disable IRQ in hot path because
> > of using rwlock.
> > 
> > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> ...
> 
> > +#define NR_MOVE_ACCOUNT_LOCKS	(NR_CPUS)
> > +#define move_account_hash(page) ((page_to_pfn(page) % NR_MOVE_ACCOUNT_LOCKS))
> 
> You still tend to add too many parentheses into macros ;)
> 

Ah, yes. Maybe my bad habit..
Will be fixed in the next patch.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] [PATCH 1/7 v2] memcg: remove unnecessary check in mem_cgroup_update_page_stat()
  2012-01-13  8:32 ` [RFC] [PATCH 1/7 v2] memcg: remove unnecessary check in mem_cgroup_update_page_stat() KAMEZAWA Hiroyuki
@ 2012-01-17 15:16   ` Michal Hocko
  2012-01-17 23:55     ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 50+ messages in thread
From: Michal Hocko @ 2012-01-17 15:16 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, Ying Han, hugh.dickins@tiscali.co.uk,
	hannes@cmpxchg.org, cgroups, bsingharora@gmail.com

On Fri 13-01-12 17:32:27, KAMEZAWA Hiroyuki wrote:
> 
> From 788aebf15f3fa37940e0745cab72547e20683bf2 Mon Sep 17 00:00:00 2001
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Date: Thu, 12 Jan 2012 16:08:33 +0900
> Subject: [PATCH 1/7] memcg: remove unnecessary check in mem_cgroup_update_page_stat()
> 
> commit 10ea69f1182b removes move_lock_page_cgroup() in thp-split path.
> So, this PageTransHuge() check is unnecessary, too.

I do not see commit like that in the tree. I guess you meant
memcg: make mem_cgroup_split_huge_fixup() more efficient which is not
merged yet, right?

> 
> Note:
>  - considering when mem_cgroup_update_page_stat() is called,
>    there will be no race between split_huge_page() and update_page_stat().
>    All required locks are held in higher level.

We should never have THP page in this path in the first place. So why
not changing this to VM_BUG_ON(PageTransHuge).

> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> ---
>  mm/memcontrol.c |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 21ba356..08b988d 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1866,7 +1866,7 @@ void mem_cgroup_update_page_stat(struct page *page,
>  	if (unlikely(!memcg || !PageCgroupUsed(pc)))
>  		goto out;
>  	/* pc->mem_cgroup is unstable ? */
> -	if (unlikely(mem_cgroup_stealed(memcg)) || PageTransHuge(page)) {
> +	if (unlikely(mem_cgroup_stealed(memcg))) {
>  		/* take a lock against to access pc->mem_cgroup */
>  		move_lock_page_cgroup(pc, &flags);
>  		need_unlock = true;
> -- 
> 1.7.4.1
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe cgroups" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] [PATCH 2/7 v2] memcg: add memory barrier for checking account move.
  2012-01-13  8:33 ` [RFC] [PATCH 2/7 v2] memcg: add memory barrier for checking account move KAMEZAWA Hiroyuki
@ 2012-01-17 15:26   ` Michal Hocko
  2012-01-18  0:06     ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 50+ messages in thread
From: Michal Hocko @ 2012-01-17 15:26 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, Ying Han, hugh.dickins@tiscali.co.uk,
	hannes@cmpxchg.org, cgroups, bsingharora@gmail.com

On Fri 13-01-12 17:33:47, KAMEZAWA Hiroyuki wrote:
> I think this bugfix is needed before going ahead. thoughts?
> ==
> From 2cb491a41782b39aae9f6fe7255b9159ac6c1563 Mon Sep 17 00:00:00 2001
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Date: Fri, 13 Jan 2012 14:27:20 +0900
> Subject: [PATCH 2/7] memcg: add memory barrier for checking account move.
> 
> At starting move_account(), source memcg's per-cpu variable
> MEM_CGROUP_ON_MOVE is set. The page status update
> routine check it under rcu_read_lock(). But there is no memory
> barrier. This patch adds one.

OK this would help to enforce that the CPU would see the current value
but what prevents us from the race with the value update without the
lock? This is as racy as it was before AFAICS.

> 
> Signed-off-by: KAMAZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> ---
>  mm/memcontrol.c |    5 ++++-
>  1 files changed, 4 insertions(+), 1 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 08b988d..9019069 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1256,8 +1256,10 @@ static void mem_cgroup_start_move(struct mem_cgroup *memcg)
>  
>  	get_online_cpus();
>  	spin_lock(&memcg->pcp_counter_lock);
> -	for_each_online_cpu(cpu)
> +	for_each_online_cpu(cpu) {
>  		per_cpu(memcg->stat->count[MEM_CGROUP_ON_MOVE], cpu) += 1;
> +		smp_wmb();
> +	}
>  	memcg->nocpu_base.count[MEM_CGROUP_ON_MOVE] += 1;
>  	spin_unlock(&memcg->pcp_counter_lock);
>  	put_online_cpus();
> @@ -1294,6 +1296,7 @@ static void mem_cgroup_end_move(struct mem_cgroup *memcg)
>  static bool mem_cgroup_stealed(struct mem_cgroup *memcg)
>  {
>  	VM_BUG_ON(!rcu_read_lock_held());
> +	smp_rmb();
>  	return this_cpu_read(memcg->stat->count[MEM_CGROUP_ON_MOVE]) > 0;
>  }
>  
> -- 
> 1.7.4.1
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe cgroups" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] [PATCH 3/7 v2] memcg: remove PCG_MOVE_LOCK flag from pc->flags
  2012-01-13  8:40 ` [RFC] [PATCH 3/7 v2] memcg: remove PCG_MOVE_LOCK flag from pc->flags KAMEZAWA Hiroyuki
  2012-01-16 12:55   ` Kirill A. Shutemov
@ 2012-01-17 16:46   ` Michal Hocko
  2012-01-18  0:12     ` KAMEZAWA Hiroyuki
  2012-01-23 22:02   ` Ying Han
  2 siblings, 1 reply; 50+ messages in thread
From: Michal Hocko @ 2012-01-17 16:46 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, Ying Han, hugh.dickins@tiscali.co.uk,
	hannes@cmpxchg.org, cgroups, bsingharora@gmail.com

On Fri 13-01-12 17:40:19, KAMEZAWA Hiroyuki wrote:
> 
> From 1008e84d94245b1e7c4d237802ff68ff00757736 Mon Sep 17 00:00:00 2001
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Date: Thu, 12 Jan 2012 15:53:24 +0900
> Subject: [PATCH 3/7] memcg: remove PCG_MOVE_LOCK flag from pc->flags.
> 
> PCG_MOVE_LOCK bit is used for bit spinlock for avoiding race between
> memcg's account moving and page state statistics updates.
> 
> Considering page-statistics update, very hot path, this lock is
> taken only when someone is moving account (or PageTransHuge())

Outdated comment? THP is not an issue here.

> And, now, all moving-account between memcgroups (by task-move)
> are serialized.
> 
> So, it seems too costly to have 1bit per page for this purpose.
> 
> This patch removes PCG_MOVE_LOCK and add hashed rwlock array
> instead of it. This works well enough. Even when we need to
> take the lock, 

Hmmm, rwlocks are not popular these days very much. 
Anyway, can we rather make it (source) memcg (bit)spinlock instead. We
would reduce false sharing this way and would penalize only pages from
the moving group.

The patch seems to be correct but I do not like the rwlock part very
much. I would replace it if possible.

[...]

Thanks
-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] [PATCH 1/7 v2] memcg: remove unnecessary check in mem_cgroup_update_page_stat()
  2012-01-17 15:16   ` Michal Hocko
@ 2012-01-17 23:55     ` KAMEZAWA Hiroyuki
  2012-01-18 13:01       ` Michal Hocko
  0 siblings, 1 reply; 50+ messages in thread
From: KAMEZAWA Hiroyuki @ 2012-01-17 23:55 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm@kvack.org, Ying Han, hugh.dickins@tiscali.co.uk,
	hannes@cmpxchg.org, cgroups, bsingharora@gmail.com

On Tue, 17 Jan 2012 16:16:20 +0100
Michal Hocko <mhocko@suse.cz> wrote:

> On Fri 13-01-12 17:32:27, KAMEZAWA Hiroyuki wrote:
> > 
> > From 788aebf15f3fa37940e0745cab72547e20683bf2 Mon Sep 17 00:00:00 2001
> > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > Date: Thu, 12 Jan 2012 16:08:33 +0900
> > Subject: [PATCH 1/7] memcg: remove unnecessary check in mem_cgroup_update_page_stat()
> > 
> > commit 10ea69f1182b removes move_lock_page_cgroup() in thp-split path.
> > So, this PageTransHuge() check is unnecessary, too.
> 
> I do not see commit like that in the tree. I guess you meant
> memcg: make mem_cgroup_split_huge_fixup() more efficient which is not
> merged yet, right?
> 

This commit in the linux-next.

> commit e94c8a9cbce1aee4af9e1285802785481b7f93c5
> Author: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Date:   Thu Jan 12 17:18:20 2012 -0800
> 
>     memcg: make mem_cgroup_split_huge_fixup() more efficient
> 


> > 
> > Note:
> >  - considering when mem_cgroup_update_page_stat() is called,
> >    there will be no race between split_huge_page() and update_page_stat().
> >    All required locks are held in higher level.
> 
> We should never have THP page in this path in the first place. So why
> not changing this to VM_BUG_ON(PageTransHuge).
> 

Ying Han considers to support mlock stat.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] [PATCH 2/7 v2] memcg: add memory barrier for checking account move.
  2012-01-17 15:26   ` Michal Hocko
@ 2012-01-18  0:06     ` KAMEZAWA Hiroyuki
  2012-01-18 12:37       ` Michal Hocko
  0 siblings, 1 reply; 50+ messages in thread
From: KAMEZAWA Hiroyuki @ 2012-01-18  0:06 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm@kvack.org, Ying Han, hugh.dickins@tiscali.co.uk,
	hannes@cmpxchg.org, cgroups, bsingharora@gmail.com

On Tue, 17 Jan 2012 16:26:35 +0100
Michal Hocko <mhocko@suse.cz> wrote:

> On Fri 13-01-12 17:33:47, KAMEZAWA Hiroyuki wrote:
> > I think this bugfix is needed before going ahead. thoughts?
> > ==
> > From 2cb491a41782b39aae9f6fe7255b9159ac6c1563 Mon Sep 17 00:00:00 2001
> > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > Date: Fri, 13 Jan 2012 14:27:20 +0900
> > Subject: [PATCH 2/7] memcg: add memory barrier for checking account move.
> > 
> > At starting move_account(), source memcg's per-cpu variable
> > MEM_CGROUP_ON_MOVE is set. The page status update
> > routine check it under rcu_read_lock(). But there is no memory
> > barrier. This patch adds one.
> 
> OK this would help to enforce that the CPU would see the current value
> but what prevents us from the race with the value update without the
> lock? This is as racy as it was before AFAICS.
> 

Hm, do I misunderstand ?
==
   update                     reference

   CPU A                        CPU B
  set value                rcu_read_lock()
  smp_wmb()                smp_rmb()
                           read_value
                           rcu_read_unlock()
  synchronize_rcu().
==
I expect
If synchronize_rcu() is called before rcu_read_lock() => move_lock_xxx will be held.
If synchronize_rcu() is called after rcu_read_lock() => update will be delayed.

Here, cpu B needs to read most recently updated value.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] [PATCH 3/7 v2] memcg: remove PCG_MOVE_LOCK flag from pc->flags
  2012-01-17 16:46   ` Michal Hocko
@ 2012-01-18  0:12     ` KAMEZAWA Hiroyuki
  2012-01-18 10:47       ` Michal Hocko
  0 siblings, 1 reply; 50+ messages in thread
From: KAMEZAWA Hiroyuki @ 2012-01-18  0:12 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm@kvack.org, Ying Han, hugh.dickins@tiscali.co.uk,
	hannes@cmpxchg.org, cgroups, bsingharora@gmail.com

On Tue, 17 Jan 2012 17:46:05 +0100
Michal Hocko <mhocko@suse.cz> wrote:

> On Fri 13-01-12 17:40:19, KAMEZAWA Hiroyuki wrote:
> > 
> > From 1008e84d94245b1e7c4d237802ff68ff00757736 Mon Sep 17 00:00:00 2001
> > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > Date: Thu, 12 Jan 2012 15:53:24 +0900
> > Subject: [PATCH 3/7] memcg: remove PCG_MOVE_LOCK flag from pc->flags.
> > 
> > PCG_MOVE_LOCK bit is used for bit spinlock for avoiding race between
> > memcg's account moving and page state statistics updates.
> > 
> > Considering page-statistics update, very hot path, this lock is
> > taken only when someone is moving account (or PageTransHuge())
> 
> Outdated comment? THP is not an issue here.
> 
Ah, sorry. I reorderd patches.

> > And, now, all moving-account between memcgroups (by task-move)
> > are serialized.
> > 
> > So, it seems too costly to have 1bit per page for this purpose.
> > 
> > This patch removes PCG_MOVE_LOCK and add hashed rwlock array
> > instead of it. This works well enough. Even when we need to
> > take the lock, 
> 
> Hmmm, rwlocks are not popular these days very much. 
> Anyway, can we rather make it (source) memcg (bit)spinlock instead. We
> would reduce false sharing this way and would penalize only pages from
> the moving group.
> 
per-memcg spinlock ? The reason I used rwlock() is to avoid disabling IRQ.
This routine will be called by IRQ context (for dirty ratio support).
So, IRQ disable will be required if we use spinlock.

Thanks,
-Kame








--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] [PATCH 3/7 v2] memcg: remove PCG_MOVE_LOCK flag from pc->flags
  2012-01-18  0:12     ` KAMEZAWA Hiroyuki
@ 2012-01-18 10:47       ` Michal Hocko
  2012-01-18 23:53         ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 50+ messages in thread
From: Michal Hocko @ 2012-01-18 10:47 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, Ying Han, hugh.dickins@tiscali.co.uk,
	hannes@cmpxchg.org, cgroups, bsingharora@gmail.com

On Wed 18-01-12 09:12:26, KAMEZAWA Hiroyuki wrote:
> On Tue, 17 Jan 2012 17:46:05 +0100
> Michal Hocko <mhocko@suse.cz> wrote:
> 
> > On Fri 13-01-12 17:40:19, KAMEZAWA Hiroyuki wrote:
[...]
> > > This patch removes PCG_MOVE_LOCK and add hashed rwlock array
> > > instead of it. This works well enough. Even when we need to
> > > take the lock, 
> > 
> > Hmmm, rwlocks are not popular these days very much. 
> > Anyway, can we rather make it (source) memcg (bit)spinlock instead. We
> > would reduce false sharing this way and would penalize only pages from
> > the moving group.
> > 
> per-memcg spinlock ? 

Yes

> The reason I used rwlock() is to avoid disabling IRQ.  This routine
> will be called by IRQ context (for dirty ratio support).  So, IRQ
> disable will be required if we use spinlock.

OK, I have missed the comment about disabling IRQs. It's true that we do
not have to be afraid about deadlocks if the lock is held only for
reading from the irq context but does the spinlock makes a performance
bottleneck? We are talking about the slowpath.
I could see the reason for the read lock when doing hashed locks because
they are global but if we make the lock per memcg then we shouldn't
interfere with other updates which are not blocked by the move.

> Thanks,
> -Kame

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] [PATCH 2/7 v2] memcg: add memory barrier for checking account move.
  2012-01-18  0:06     ` KAMEZAWA Hiroyuki
@ 2012-01-18 12:37       ` Michal Hocko
  2012-01-19  2:17         ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 50+ messages in thread
From: Michal Hocko @ 2012-01-18 12:37 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, Ying Han, hugh.dickins@tiscali.co.uk,
	hannes@cmpxchg.org, cgroups, bsingharora@gmail.com

On Wed 18-01-12 09:06:56, KAMEZAWA Hiroyuki wrote:
> On Tue, 17 Jan 2012 16:26:35 +0100
> Michal Hocko <mhocko@suse.cz> wrote:
> 
> > On Fri 13-01-12 17:33:47, KAMEZAWA Hiroyuki wrote:
> > > I think this bugfix is needed before going ahead. thoughts?
> > > ==
> > > From 2cb491a41782b39aae9f6fe7255b9159ac6c1563 Mon Sep 17 00:00:00 2001
> > > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > > Date: Fri, 13 Jan 2012 14:27:20 +0900
> > > Subject: [PATCH 2/7] memcg: add memory barrier for checking account move.
> > > 
> > > At starting move_account(), source memcg's per-cpu variable
> > > MEM_CGROUP_ON_MOVE is set. The page status update
> > > routine check it under rcu_read_lock(). But there is no memory
> > > barrier. This patch adds one.
> > 
> > OK this would help to enforce that the CPU would see the current value
> > but what prevents us from the race with the value update without the
> > lock? This is as racy as it was before AFAICS.
> > 
> 
> Hm, do I misunderstand ?
> ==
>    update                     reference
> 
>    CPU A                        CPU B
>   set value                rcu_read_lock()
>   smp_wmb()                smp_rmb()
>                            read_value
>                            rcu_read_unlock()
>   synchronize_rcu().
> ==
> I expect
> If synchronize_rcu() is called before rcu_read_lock() => move_lock_xxx will be held.
> If synchronize_rcu() is called after rcu_read_lock() => update will be delayed.

Ahh, OK I can see it now. Readers are not that important because it is
actually the updater who is delayed until all preexisting rcu read
sections are finished.

In that case. Why do we need both barriers? spin_unlock is a full
barrier so maybe we just need smp_rmb before we read value to make sure
that we do not get stalled value when we start rcu_read section after
synchronize_rcu?

> Here, cpu B needs to read most recently updated value.

If it reads the old value then it would think that we are not moving and
so we would account to the old group and move it later on, right?

> 
> Thanks,
> -Kame
> 
> --
> To unsubscribe from this list: send the line "unsubscribe cgroups" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] [PATCH 1/7 v2] memcg: remove unnecessary check in mem_cgroup_update_page_stat()
  2012-01-17 23:55     ` KAMEZAWA Hiroyuki
@ 2012-01-18 13:01       ` Michal Hocko
  2012-01-19  2:18         ` KAMEZAWA Hiroyuki
  2012-01-19 20:07         ` Ying Han
  0 siblings, 2 replies; 50+ messages in thread
From: Michal Hocko @ 2012-01-18 13:01 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, Ying Han, hugh.dickins@tiscali.co.uk,
	hannes@cmpxchg.org, cgroups, bsingharora@gmail.com

On Wed 18-01-12 08:55:58, KAMEZAWA Hiroyuki wrote:
> On Tue, 17 Jan 2012 16:16:20 +0100
> Michal Hocko <mhocko@suse.cz> wrote:
> 
> > On Fri 13-01-12 17:32:27, KAMEZAWA Hiroyuki wrote:
> > > 
> > > From 788aebf15f3fa37940e0745cab72547e20683bf2 Mon Sep 17 00:00:00 2001
> > > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > > Date: Thu, 12 Jan 2012 16:08:33 +0900
> > > Subject: [PATCH 1/7] memcg: remove unnecessary check in mem_cgroup_update_page_stat()
> > > 
> > > commit 10ea69f1182b removes move_lock_page_cgroup() in thp-split path.
> > > So, this PageTransHuge() check is unnecessary, too.
> > 
> > I do not see commit like that in the tree. I guess you meant
> > memcg: make mem_cgroup_split_huge_fixup() more efficient which is not
> > merged yet, right?
> > 
> 
> This commit in the linux-next.

Referring to commits from linux-next is tricky as it changes all the
time. I guess that the full commit subject should be sufficient.

> > > Note:
> > >  - considering when mem_cgroup_update_page_stat() is called,
> > >    there will be no race between split_huge_page() and update_page_stat().
> > >    All required locks are held in higher level.
> > 
> > We should never have THP page in this path in the first place. So why
> > not changing this to VM_BUG_ON(PageTransHuge).
> > 
> 
> Ying Han considers to support mlock stat.

OK, got it. What about the following updated changelog instead?

===
We do not have to check PageTransHuge in mem_cgroup_update_page_stat
and fallback into the locked accounting because both move charge and thp
split up are done with compound_lock so they cannot race. update vs.
move is protected by the mem_cgroup_stealed sufficiently.

PageTransHuge pages shouldn't appear in this code path currently because
we are tracking only file pages at the moment but later we are planning
to track also other pages (e.g. mlocked ones).
===

> 
> Thanks,
> -Kame
> 
> --
> To unsubscribe from this list: send the line "unsubscribe cgroups" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] [PATCH 4/7 v2] memcg: new scheme to update per-memcg page stat accounting.
  2012-01-13  8:41 ` [RFC] [PATCH 4/7 v2] memcg: new scheme to update per-memcg page stat accounting KAMEZAWA Hiroyuki
@ 2012-01-18 16:45   ` Michal Hocko
  2012-01-18 23:58     ` KAMEZAWA Hiroyuki
  2012-01-26 19:01   ` Ying Han
  1 sibling, 1 reply; 50+ messages in thread
From: Michal Hocko @ 2012-01-18 16:45 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, Ying Han, hugh.dickins@tiscali.co.uk,
	hannes@cmpxchg.org, cgroups, bsingharora@gmail.com

On Fri 13-01-12 17:41:38, KAMEZAWA Hiroyuki wrote:
[...]
> (And Dirty flag has another problem which cannot be handled by flag on
> page_cgroup.)

Is this imporatant for the patch?

> 
> I'd like to remove this flag because
>  - In recent discussions, removing pc->flags is our direction.
>  - This kind of duplication of flag/status is very bad and
>    it's better to use status in 'struct page'.
> 
> This patch is for removing page_cgroup's special flag for
> page-state accounting and for using 'struct page's status itself.

The patch doesn't seem to do so. It just enhances the semantics of the
locking for accounting.

> 
> This patch adds an atomic update support of page statistics accounting
> in memcg. In short, it prevents a page from being moved from a memcg
> to another while updating page status by...
> 
> 	locked = mem_cgroup_begin_update_page_stat(page)
>         modify page
>         mem_cgroup_update_page_stat(page)
>         mem_cgroup_end_update_page_stat(page, locked)
> 
> While begin_update_page_stat() ... end_update_page_stat(),
> the page_cgroup will never be moved to other memcg.
> 
> In usual case, overhead is rcu_read_lock() and rcu_read_unlock(),
> lookup_page_cgroup().
> 
> Note:
>  - I still now considering how to reduce overhead of this scheme.
>    Good idea is welcomed.

So the overhead is increased by one lookup_page_cgroup call, right?

> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

I think we do not have many other choices than adding some kind of
locking around both normal and memcg pages updates so this looks like a
good approach.

> ---
>  include/linux/memcontrol.h |   36 ++++++++++++++++++++++++++++++++++
>  mm/memcontrol.c            |   46 ++++++++++++++++++++++++++-----------------
>  mm/rmap.c                  |   14 +++++++++++-
>  3 files changed, 76 insertions(+), 20 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 4d34356..976b58c 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -141,9 +141,35 @@ static inline bool mem_cgroup_disabled(void)
>  	return false;
>  }
>  
> +/*
> + * When we update page->flags,' we'll update some memcg's counter.
> + * Unlike vmstat, memcg has per-memcg stats and page-memcg relationship
> + * can be changed while 'struct page' information is updated.
> + * We need to prevent the race by
> + * 	locked = mem_cgroup_begin_update_page_stat(page)
> + * 	modify 'page'
> + * 	mem_cgroup_update_page_stat(page, idx, val)
> + * 	mem_cgroup_end_update_page_stat(page, locked);
> + */
> +bool __mem_cgroup_begin_update_page_stat(struct page *page);
> +static inline bool mem_cgroup_begin_update_page_stat(struct page *page)

The interface seems to be strange a bit. I would expect bool *locked
parameter because this is some kind of cookie. Return value might be
confusing.

> +{
> +	if (mem_cgroup_disabled())
> +		return false;
> +	return __mem_cgroup_begin_update_page_stat(page);
> +}
>  void mem_cgroup_update_page_stat(struct page *page,
>  				 enum mem_cgroup_page_stat_item idx,
>  				 int val);
> +void __mem_cgroup_end_update_page_stat(struct page *page, bool locked);
> +static inline void
> +mem_cgroup_end_update_page_stat(struct page *page, bool locked)
> +{
> +	if (mem_cgroup_disabled())
> +		return;
> +	__mem_cgroup_end_update_page_stat(page, locked);
> +}
> +
>  
>  static inline void mem_cgroup_inc_page_stat(struct page *page,
>  					    enum mem_cgroup_page_stat_item idx)
> @@ -356,6 +382,16 @@ mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p)
>  {
>  }
>  
> +static inline bool mem_cgroup_begin_update_page_stat(struct page *page)
> +{
> +	return false;
> +}
> +
> +static inline void
> +mem_cgroup_end_update_page_stat(struct page *page, bool locked)
> +{
> +}
> +
>  static inline void mem_cgroup_inc_page_stat(struct page *page,
>  					    enum mem_cgroup_page_stat_item idx)
>  {
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 61e276f..30ef810 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1912,29 +1912,43 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask)
>   * possibility of race condition. If there is, we take a lock.
>   */
>  
> +bool __mem_cgroup_begin_update_page_stat(struct page *page)
> +{
> +	struct page_cgroup *pc = lookup_page_cgroup(page);
> +	bool locked = false;
> +	struct mem_cgroup *memcg;
> +
> +	rcu_read_lock();
> +	memcg = pc->mem_cgroup;
> +
> +	if (!memcg || !PageCgroupUsed(pc))
> +		goto out;
> +	if (mem_cgroup_stealed(memcg)) {
> +		mem_cgroup_account_move_rlock(page);
> +		locked = true;
> +	}
> +out:
> +	return locked;
> +}
> +
> +void __mem_cgroup_end_update_page_stat(struct page *page, bool locked)
> +{
> +	if (locked)
> +		mem_cgroup_account_move_runlock(page);
> +	rcu_read_unlock();
> +}
> +
>  void mem_cgroup_update_page_stat(struct page *page,
>  				 enum mem_cgroup_page_stat_item idx, int val)
>  {
> -	struct mem_cgroup *memcg;
>  	struct page_cgroup *pc = lookup_page_cgroup(page);
> -	bool need_unlock = false;
> +	struct mem_cgroup *memcg = pc->mem_cgroup;
>  
>  	if (mem_cgroup_disabled())
>  		return;
>  
> -	rcu_read_lock();
> -	memcg = pc->mem_cgroup;
>  	if (unlikely(!memcg || !PageCgroupUsed(pc)))
> -		goto out;
> -	/* pc->mem_cgroup is unstable ? */
> -	if (unlikely(mem_cgroup_stealed(memcg))) {
> -		/* take a lock against to access pc->mem_cgroup */
> -		mem_cgroup_account_move_rlock(page);
> -		need_unlock = true;
> -		memcg = pc->mem_cgroup;
> -		if (!memcg || !PageCgroupUsed(pc))
> -			goto out;
> -	}
> +		return;
>  
>  	switch (idx) {
>  	case MEMCG_NR_FILE_MAPPED:
> @@ -1950,10 +1964,6 @@ void mem_cgroup_update_page_stat(struct page *page,
>  
>  	this_cpu_add(memcg->stat->count[idx], val);
>  
> -out:
> -	if (unlikely(need_unlock))
> -		mem_cgroup_account_move_runlock(page);
> -	rcu_read_unlock();
>  	return;
>  }
>  EXPORT_SYMBOL(mem_cgroup_update_page_stat);

We need to export mem_cgroup_{begin,end}_update_page_stat as well.
Btw. Why is this exported anyway?

> diff --git a/mm/rmap.c b/mm/rmap.c
> index aa547d4..def60d1 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1150,10 +1150,13 @@ void page_add_new_anon_rmap(struct page *page,
>   */
>  void page_add_file_rmap(struct page *page)
>  {
> +	bool locked = mem_cgroup_begin_update_page_stat(page);
> +
>  	if (atomic_inc_and_test(&page->_mapcount)) {
>  		__inc_zone_page_state(page, NR_FILE_MAPPED);
>  		mem_cgroup_inc_page_stat(page, MEMCG_NR_FILE_MAPPED);
>  	}
> +	mem_cgroup_end_update_page_stat(page, locked);
>  }
>  
>  /**
> @@ -1164,10 +1167,14 @@ void page_add_file_rmap(struct page *page)
>   */
>  void page_remove_rmap(struct page *page)
>  {
> +	bool locked = false;
> +
> +	if (!PageAnon(page))
> +		locked = mem_cgroup_begin_update_page_stat(page);

Doesn't look nice. We shouldn't care about which pages update stats at
this level...

> +
>  	/* page still mapped by someone else? */
>  	if (!atomic_add_negative(-1, &page->_mapcount))
> -		return;
> -
> +		goto out;
>  	/*
>  	 * Now that the last pte has gone, s390 must transfer dirty
>  	 * flag from storage key to struct page.  We can usually skip
> @@ -1204,6 +1211,9 @@ void page_remove_rmap(struct page *page)
>  	 * Leaving it set also helps swapoff to reinstate ptes
>  	 * faster for those pages still in swapcache.
>  	 */
> +out:
> +	if (!PageAnon(page))
> +		mem_cgroup_end_update_page_stat(page, locked);
>  }
>  
>  /*
> -- 
> 1.7.4.1
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe cgroups" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] [PATCH 3/7 v2] memcg: remove PCG_MOVE_LOCK flag from pc->flags
  2012-01-18 10:47       ` Michal Hocko
@ 2012-01-18 23:53         ` KAMEZAWA Hiroyuki
  2012-01-23 22:05           ` Ying Han
  0 siblings, 1 reply; 50+ messages in thread
From: KAMEZAWA Hiroyuki @ 2012-01-18 23:53 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm@kvack.org, Ying Han, hugh.dickins@tiscali.co.uk,
	hannes@cmpxchg.org, cgroups, bsingharora@gmail.com

On Wed, 18 Jan 2012 11:47:03 +0100
Michal Hocko <mhocko@suse.cz> wrote:

> On Wed 18-01-12 09:12:26, KAMEZAWA Hiroyuki wrote:
> > On Tue, 17 Jan 2012 17:46:05 +0100
> > Michal Hocko <mhocko@suse.cz> wrote:
> > 
> > > On Fri 13-01-12 17:40:19, KAMEZAWA Hiroyuki wrote:
> [...]
> > > > This patch removes PCG_MOVE_LOCK and add hashed rwlock array
> > > > instead of it. This works well enough. Even when we need to
> > > > take the lock, 
> > > 
> > > Hmmm, rwlocks are not popular these days very much. 
> > > Anyway, can we rather make it (source) memcg (bit)spinlock instead. We
> > > would reduce false sharing this way and would penalize only pages from
> > > the moving group.
> > > 
> > per-memcg spinlock ? 
> 
> Yes
> 
> > The reason I used rwlock() is to avoid disabling IRQ.  This routine
> > will be called by IRQ context (for dirty ratio support).  So, IRQ
> > disable will be required if we use spinlock.
> 
> OK, I have missed the comment about disabling IRQs. It's true that we do
> not have to be afraid about deadlocks if the lock is held only for
> reading from the irq context but does the spinlock makes a performance
> bottleneck? We are talking about the slowpath.
> I could see the reason for the read lock when doing hashed locks because
> they are global but if we make the lock per memcg then we shouldn't
> interfere with other updates which are not blocked by the move.
> 

Hm, ok. In the next version, I'll use per-memcg spinlock (with hash if necessary)

Thanks,
-Kame


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] [PATCH 4/7 v2] memcg: new scheme to update per-memcg page stat accounting.
  2012-01-18 16:45   ` Michal Hocko
@ 2012-01-18 23:58     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 50+ messages in thread
From: KAMEZAWA Hiroyuki @ 2012-01-18 23:58 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm@kvack.org, Ying Han, hugh.dickins@tiscali.co.uk,
	hannes@cmpxchg.org, cgroups, bsingharora@gmail.com

On Wed, 18 Jan 2012 17:45:58 +0100
Michal Hocko <mhocko@suse.cz> wrote:

> On Fri 13-01-12 17:41:38, KAMEZAWA Hiroyuki wrote:
> [...]
> > (And Dirty flag has another problem which cannot be handled by flag on
> > page_cgroup.)
> 
> Is this imporatant for the patch?
> 

Half of my motivation is for a problem on dirty flag.

> > 
> > I'd like to remove this flag because
> >  - In recent discussions, removing pc->flags is our direction.
> >  - This kind of duplication of flag/status is very bad and
> >    it's better to use status in 'struct page'.
> > 
> > This patch is for removing page_cgroup's special flag for
> > page-state accounting and for using 'struct page's status itself.
> 
> The patch doesn't seem to do so. It just enhances the semantics of the
> locking for accounting.
> 
Ah, this patch is requred for avoding flags in pc->flags.


> > 
> > This patch adds an atomic update support of page statistics accounting
> > in memcg. In short, it prevents a page from being moved from a memcg
> > to another while updating page status by...
> > 
> > 	locked = mem_cgroup_begin_update_page_stat(page)
> >         modify page
> >         mem_cgroup_update_page_stat(page)
> >         mem_cgroup_end_update_page_stat(page, locked)
> > 
> > While begin_update_page_stat() ... end_update_page_stat(),
> > the page_cgroup will never be moved to other memcg.
> > 
> > In usual case, overhead is rcu_read_lock() and rcu_read_unlock(),
> > lookup_page_cgroup().
> > 
> > Note:
> >  - I still now considering how to reduce overhead of this scheme.
> >    Good idea is welcomed.
> 
> So the overhead is increased by one lookup_page_cgroup call, right?
> 

Yes. But patch 7/7 decrease that, finally.


> > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> I think we do not have many other choices than adding some kind of
> locking around both normal and memcg pages updates so this looks like a
> good approach.

Thank you.

> 
> > ---
> >  include/linux/memcontrol.h |   36 ++++++++++++++++++++++++++++++++++
> >  mm/memcontrol.c            |   46 ++++++++++++++++++++++++++-----------------
> >  mm/rmap.c                  |   14 +++++++++++-
> >  3 files changed, 76 insertions(+), 20 deletions(-)
> > 
> > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > index 4d34356..976b58c 100644
> > --- a/include/linux/memcontrol.h
> > +++ b/include/linux/memcontrol.h
> > @@ -141,9 +141,35 @@ static inline bool mem_cgroup_disabled(void)
> >  	return false;
> >  }
> >  
> > +/*
> > + * When we update page->flags,' we'll update some memcg's counter.
> > + * Unlike vmstat, memcg has per-memcg stats and page-memcg relationship
> > + * can be changed while 'struct page' information is updated.
> > + * We need to prevent the race by
> > + * 	locked = mem_cgroup_begin_update_page_stat(page)
> > + * 	modify 'page'
> > + * 	mem_cgroup_update_page_stat(page, idx, val)
> > + * 	mem_cgroup_end_update_page_stat(page, locked);
> > + */
> > +bool __mem_cgroup_begin_update_page_stat(struct page *page);
> > +static inline bool mem_cgroup_begin_update_page_stat(struct page *page)
> 
> The interface seems to be strange a bit. I would expect bool *locked
> parameter because this is some kind of cookie. Return value might be
> confusing.
> 

Hm. ok, then make this function as void.


> > +{
> > +	if (mem_cgroup_disabled())
> > +		return false;
> > +	return __mem_cgroup_begin_update_page_stat(page);
> > +}
> >  void mem_cgroup_update_page_stat(struct page *page,
> >  				 enum mem_cgroup_page_stat_item idx,
> >  				 int val);
> > +void __mem_cgroup_end_update_page_stat(struct page *page, bool locked);
> > +static inline void
> > +mem_cgroup_end_update_page_stat(struct page *page, bool locked)
> > +{
> > +	if (mem_cgroup_disabled())
> > +		return;
> > +	__mem_cgroup_end_update_page_stat(page, locked);
> > +}
> > +
> >  
> >  static inline void mem_cgroup_inc_page_stat(struct page *page,
> >  					    enum mem_cgroup_page_stat_item idx)
> > @@ -356,6 +382,16 @@ mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p)
> >  {
> >  }
> >  
> > +static inline bool mem_cgroup_begin_update_page_stat(struct page *page)
> > +{
> > +	return false;
> > +}
> > +
> > +static inline void
> > +mem_cgroup_end_update_page_stat(struct page *page, bool locked)
> > +{
> > +}
> > +
> >  static inline void mem_cgroup_inc_page_stat(struct page *page,
> >  					    enum mem_cgroup_page_stat_item idx)
> >  {
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 61e276f..30ef810 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -1912,29 +1912,43 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask)
> >   * possibility of race condition. If there is, we take a lock.
> >   */
> >  
> > +bool __mem_cgroup_begin_update_page_stat(struct page *page)
> > +{
> > +	struct page_cgroup *pc = lookup_page_cgroup(page);
> > +	bool locked = false;
> > +	struct mem_cgroup *memcg;
> > +
> > +	rcu_read_lock();
> > +	memcg = pc->mem_cgroup;
> > +
> > +	if (!memcg || !PageCgroupUsed(pc))
> > +		goto out;
> > +	if (mem_cgroup_stealed(memcg)) {
> > +		mem_cgroup_account_move_rlock(page);
> > +		locked = true;
> > +	}
> > +out:
> > +	return locked;
> > +}
> > +
> > +void __mem_cgroup_end_update_page_stat(struct page *page, bool locked)
> > +{
> > +	if (locked)
> > +		mem_cgroup_account_move_runlock(page);
> > +	rcu_read_unlock();
> > +}
> > +
> >  void mem_cgroup_update_page_stat(struct page *page,
> >  				 enum mem_cgroup_page_stat_item idx, int val)
> >  {
> > -	struct mem_cgroup *memcg;
> >  	struct page_cgroup *pc = lookup_page_cgroup(page);
> > -	bool need_unlock = false;
> > +	struct mem_cgroup *memcg = pc->mem_cgroup;
> >  
> >  	if (mem_cgroup_disabled())
> >  		return;
> >  
> > -	rcu_read_lock();
> > -	memcg = pc->mem_cgroup;
> >  	if (unlikely(!memcg || !PageCgroupUsed(pc)))
> > -		goto out;
> > -	/* pc->mem_cgroup is unstable ? */
> > -	if (unlikely(mem_cgroup_stealed(memcg))) {
> > -		/* take a lock against to access pc->mem_cgroup */
> > -		mem_cgroup_account_move_rlock(page);
> > -		need_unlock = true;
> > -		memcg = pc->mem_cgroup;
> > -		if (!memcg || !PageCgroupUsed(pc))
> > -			goto out;
> > -	}
> > +		return;
> >  
> >  	switch (idx) {
> >  	case MEMCG_NR_FILE_MAPPED:
> > @@ -1950,10 +1964,6 @@ void mem_cgroup_update_page_stat(struct page *page,
> >  
> >  	this_cpu_add(memcg->stat->count[idx], val);
> >  
> > -out:
> > -	if (unlikely(need_unlock))
> > -		mem_cgroup_account_move_runlock(page);
> > -	rcu_read_unlock();
> >  	return;
> >  }
> >  EXPORT_SYMBOL(mem_cgroup_update_page_stat);
> 
> We need to export mem_cgroup_{begin,end}_update_page_stat as well.
> Btw. Why is this exported anyway?
> 

I'm not sure at all...Hmm ? for old reason ?
I'll check whether we can remove this EXPORT.



> > diff --git a/mm/rmap.c b/mm/rmap.c
> > index aa547d4..def60d1 100644
> > --- a/mm/rmap.c
> > +++ b/mm/rmap.c
> > @@ -1150,10 +1150,13 @@ void page_add_new_anon_rmap(struct page *page,
> >   */
> >  void page_add_file_rmap(struct page *page)
> >  {
> > +	bool locked = mem_cgroup_begin_update_page_stat(page);
> > +
> >  	if (atomic_inc_and_test(&page->_mapcount)) {
> >  		__inc_zone_page_state(page, NR_FILE_MAPPED);
> >  		mem_cgroup_inc_page_stat(page, MEMCG_NR_FILE_MAPPED);
> >  	}
> > +	mem_cgroup_end_update_page_stat(page, locked);
> >  }
> >  
> >  /**
> > @@ -1164,10 +1167,14 @@ void page_add_file_rmap(struct page *page)
> >   */
> >  void page_remove_rmap(struct page *page)
> >  {
> > +	bool locked = false;
> > +
> > +	if (!PageAnon(page))
> > +		locked = mem_cgroup_begin_update_page_stat(page);
> 
> Doesn't look nice. We shouldn't care about which pages update stats at
> this level...
> 

Ok, remove this and see whether this optimization is required.
If required, I'll add enough comments.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] [PATCH 2/7 v2] memcg: add memory barrier for checking account move.
  2012-01-18 12:37       ` Michal Hocko
@ 2012-01-19  2:17         ` KAMEZAWA Hiroyuki
  2012-01-19  9:28           ` Michal Hocko
  2012-01-20 18:08           ` Ying Han
  0 siblings, 2 replies; 50+ messages in thread
From: KAMEZAWA Hiroyuki @ 2012-01-19  2:17 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm@kvack.org, Ying Han, hugh.dickins@tiscali.co.uk,
	hannes@cmpxchg.org, cgroups, bsingharora@gmail.com

On Wed, 18 Jan 2012 13:37:59 +0100
Michal Hocko <mhocko@suse.cz> wrote:

> On Wed 18-01-12 09:06:56, KAMEZAWA Hiroyuki wrote:
> > On Tue, 17 Jan 2012 16:26:35 +0100
> > Michal Hocko <mhocko@suse.cz> wrote:
> > 
> > > On Fri 13-01-12 17:33:47, KAMEZAWA Hiroyuki wrote:
> > > > I think this bugfix is needed before going ahead. thoughts?
> > > > ==
> > > > From 2cb491a41782b39aae9f6fe7255b9159ac6c1563 Mon Sep 17 00:00:00 2001
> > > > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > > > Date: Fri, 13 Jan 2012 14:27:20 +0900
> > > > Subject: [PATCH 2/7] memcg: add memory barrier for checking account move.
> > > > 
> > > > At starting move_account(), source memcg's per-cpu variable
> > > > MEM_CGROUP_ON_MOVE is set. The page status update
> > > > routine check it under rcu_read_lock(). But there is no memory
> > > > barrier. This patch adds one.
> > > 
> > > OK this would help to enforce that the CPU would see the current value
> > > but what prevents us from the race with the value update without the
> > > lock? This is as racy as it was before AFAICS.
> > > 
> > 
> > Hm, do I misunderstand ?
> > ==
> >    update                     reference
> > 
> >    CPU A                        CPU B
> >   set value                rcu_read_lock()
> >   smp_wmb()                smp_rmb()
> >                            read_value
> >                            rcu_read_unlock()
> >   synchronize_rcu().
> > ==
> > I expect
> > If synchronize_rcu() is called before rcu_read_lock() => move_lock_xxx will be held.
> > If synchronize_rcu() is called after rcu_read_lock() => update will be delayed.
> 
> Ahh, OK I can see it now. Readers are not that important because it is
> actually the updater who is delayed until all preexisting rcu read
> sections are finished.
> 
> In that case. Why do we need both barriers? spin_unlock is a full
> barrier so maybe we just need smp_rmb before we read value to make sure
> that we do not get stalled value when we start rcu_read section after
> synchronize_rcu?
> 

I doubt .... If no barrier, this case happens

==
	update			reference
	CPU A			CPU B
	set value
	synchronize_rcu()	rcu_read_lock()
				read_value <= find old value
				rcu_read_unlock()
				do no lock
==


> > Here, cpu B needs to read most recently updated value.
> 
> If it reads the old value then it would think that we are not moving and
> so we would account to the old group and move it later on, right?
>
Right. without move_lock, we're not sure which old/new pc->mem_cgroup will be. 
This will cause mis accounting.


Thanks,
-Kame





--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] [PATCH 1/7 v2] memcg: remove unnecessary check in mem_cgroup_update_page_stat()
  2012-01-18 13:01       ` Michal Hocko
@ 2012-01-19  2:18         ` KAMEZAWA Hiroyuki
  2012-01-19 20:07         ` Ying Han
  1 sibling, 0 replies; 50+ messages in thread
From: KAMEZAWA Hiroyuki @ 2012-01-19  2:18 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm@kvack.org, Ying Han, hugh.dickins@tiscali.co.uk,
	hannes@cmpxchg.org, cgroups, bsingharora@gmail.com

On Wed, 18 Jan 2012 14:01:02 +0100
Michal Hocko <mhocko@suse.cz> wrote:

> On Wed 18-01-12 08:55:58, KAMEZAWA Hiroyuki wrote:
> > On Tue, 17 Jan 2012 16:16:20 +0100
> > Michal Hocko <mhocko@suse.cz> wrote:
> > 
> > > On Fri 13-01-12 17:32:27, KAMEZAWA Hiroyuki wrote:
> > > > 
> > > > From 788aebf15f3fa37940e0745cab72547e20683bf2 Mon Sep 17 00:00:00 2001
> > > > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > > > Date: Thu, 12 Jan 2012 16:08:33 +0900
> > > > Subject: [PATCH 1/7] memcg: remove unnecessary check in mem_cgroup_update_page_stat()
> > > > 
> > > > commit 10ea69f1182b removes move_lock_page_cgroup() in thp-split path.
> > > > So, this PageTransHuge() check is unnecessary, too.
> > > 
> > > I do not see commit like that in the tree. I guess you meant
> > > memcg: make mem_cgroup_split_huge_fixup() more efficient which is not
> > > merged yet, right?
> > > 
> > 
> > This commit in the linux-next.
> 
> Referring to commits from linux-next is tricky as it changes all the
> time. I guess that the full commit subject should be sufficient.
> 
> > > > Note:
> > > >  - considering when mem_cgroup_update_page_stat() is called,
> > > >    there will be no race between split_huge_page() and update_page_stat().
> > > >    All required locks are held in higher level.
> > > 
> > > We should never have THP page in this path in the first place. So why
> > > not changing this to VM_BUG_ON(PageTransHuge).
> > > 
> > 
> > Ying Han considers to support mlock stat.
> 
> OK, got it. What about the following updated changelog instead?
> 
> ===
> We do not have to check PageTransHuge in mem_cgroup_update_page_stat
> and fallback into the locked accounting because both move charge and thp
> split up are done with compound_lock so they cannot race. update vs.
> move is protected by the mem_cgroup_stealed sufficiently.
> 
> PageTransHuge pages shouldn't appear in this code path currently because
> we are tracking only file pages at the moment but later we are planning
> to track also other pages (e.g. mlocked ones).
> ===

ok, will use this :) Thank you.

-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] [PATCH 2/7 v2] memcg: add memory barrier for checking account move.
  2012-01-19  2:17         ` KAMEZAWA Hiroyuki
@ 2012-01-19  9:28           ` Michal Hocko
  2012-01-19 23:57             ` KAMEZAWA Hiroyuki
  2012-01-20 18:08           ` Ying Han
  1 sibling, 1 reply; 50+ messages in thread
From: Michal Hocko @ 2012-01-19  9:28 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, Ying Han, hugh.dickins@tiscali.co.uk,
	hannes@cmpxchg.org, cgroups, bsingharora@gmail.com

On Thu 19-01-12 11:17:27, KAMEZAWA Hiroyuki wrote:
> On Wed, 18 Jan 2012 13:37:59 +0100
> Michal Hocko <mhocko@suse.cz> wrote:
> 
> > On Wed 18-01-12 09:06:56, KAMEZAWA Hiroyuki wrote:
> > > On Tue, 17 Jan 2012 16:26:35 +0100
> > > Michal Hocko <mhocko@suse.cz> wrote:
> > > 
> > > > On Fri 13-01-12 17:33:47, KAMEZAWA Hiroyuki wrote:
> > > > > I think this bugfix is needed before going ahead. thoughts?
> > > > > ==
> > > > > From 2cb491a41782b39aae9f6fe7255b9159ac6c1563 Mon Sep 17 00:00:00 2001
> > > > > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > > > > Date: Fri, 13 Jan 2012 14:27:20 +0900
> > > > > Subject: [PATCH 2/7] memcg: add memory barrier for checking account move.
> > > > > 
> > > > > At starting move_account(), source memcg's per-cpu variable
> > > > > MEM_CGROUP_ON_MOVE is set. The page status update
> > > > > routine check it under rcu_read_lock(). But there is no memory
> > > > > barrier. This patch adds one.
> > > > 
> > > > OK this would help to enforce that the CPU would see the current value
> > > > but what prevents us from the race with the value update without the
> > > > lock? This is as racy as it was before AFAICS.
> > > > 
> > > 
> > > Hm, do I misunderstand ?
> > > ==
> > >    update                     reference
> > > 
> > >    CPU A                        CPU B
> > >   set value                rcu_read_lock()
> > >   smp_wmb()                smp_rmb()
> > >                            read_value
> > >                            rcu_read_unlock()
> > >   synchronize_rcu().
> > > ==
> > > I expect
> > > If synchronize_rcu() is called before rcu_read_lock() => move_lock_xxx will be held.
> > > If synchronize_rcu() is called after rcu_read_lock() => update will be delayed.
> > 
> > Ahh, OK I can see it now. Readers are not that important because it is
> > actually the updater who is delayed until all preexisting rcu read
> > sections are finished.
> > 
> > In that case. Why do we need both barriers? spin_unlock is a full
> > barrier so maybe we just need smp_rmb before we read value to make sure
> > that we do not get stalled value when we start rcu_read section after
> > synchronize_rcu?
> > 
> 
> I doubt .... If no barrier, this case happens
> 
> ==
> 	update			reference
> 	CPU A			CPU B
> 	set value
> 	synchronize_rcu()	rcu_read_lock()
> 				read_value <= find old value
> 				rcu_read_unlock()
> 				do no lock
> ==

OK, I have looked at Documentation/memory-barriers.txt again and
spin_unlock is not a full barrier so we cannot rely on it.
Anyway you still could do the barrier once after we set all values?
-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] [PATCH 5/7 v2] memcg: remove PCG_FILE_MAPPED
  2012-01-13  8:42 ` [RFC] [PATCH 5/7 v2] memcg: remove PCG_FILE_MAPPED KAMEZAWA Hiroyuki
@ 2012-01-19 14:07   ` Michal Hocko
  2012-01-26 19:10     ` Ying Han
  0 siblings, 1 reply; 50+ messages in thread
From: Michal Hocko @ 2012-01-19 14:07 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, Ying Han, hugh.dickins@tiscali.co.uk,
	hannes@cmpxchg.org, cgroups, bsingharora@gmail.com

On Fri 13-01-12 17:42:23, KAMEZAWA Hiroyuki wrote:
> From a9b51d6204d7f8714173c46a306caf413ad25d4e Mon Sep 17 00:00:00 2001
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Date: Thu, 12 Jan 2012 18:40:26 +0900
> Subject: [PATCH 5/7] memcg: remove PCG_FILE_MAPPED
> 
> Because we can update page's status and memcg's statistics without
> race with move_account, this flag is unnecessary.

I would really appreciate a little bit more description ;)

8725d541 [memcg: fix race in file_mapped accounting] has added the
flag to resolve a race when a move_account happened between page's
mapcount has been updated and this has been accounted to memcg.
This, however, cannot happen anymore because mem_cgroup_update_page_stat
is always enclosed by mem_cgroup_begin_update_page_stat and
mem_cgroup_end_update_page_stat along with the mapcount update.

> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Other than that looks good and nice to see another one go away.
I will add my ack along with the patches which this depend on once they
settle down if you don't mind

Thanks

> ---
>  include/linux/page_cgroup.h |    6 ------
>  mm/memcontrol.c             |    6 +-----
>  2 files changed, 1 insertions(+), 11 deletions(-)
> 
> diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
> index 5dba799..0b9a48a 100644
> --- a/include/linux/page_cgroup.h
> +++ b/include/linux/page_cgroup.h
> @@ -7,8 +7,6 @@ enum {
>  	PCG_CACHE, /* charged as cache */
>  	PCG_USED, /* this object is in use. */
>  	PCG_MIGRATION, /* under page migration */
> -	/* flags for mem_cgroup and file and I/O status */
> -	PCG_FILE_MAPPED, /* page is accounted as "mapped" */
>  	__NR_PCG_FLAGS,
>  };
>  
> @@ -72,10 +70,6 @@ TESTPCGFLAG(Used, USED)
>  CLEARPCGFLAG(Used, USED)
>  SETPCGFLAG(Used, USED)
>  
> -SETPCGFLAG(FileMapped, FILE_MAPPED)
> -CLEARPCGFLAG(FileMapped, FILE_MAPPED)
> -TESTPCGFLAG(FileMapped, FILE_MAPPED)
> -
>  SETPCGFLAG(Migration, MIGRATION)
>  CLEARPCGFLAG(Migration, MIGRATION)
>  TESTPCGFLAG(Migration, MIGRATION)
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 30ef810..a96800d 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1952,10 +1952,6 @@ void mem_cgroup_update_page_stat(struct page *page,
>  
>  	switch (idx) {
>  	case MEMCG_NR_FILE_MAPPED:
> -		if (val > 0)
> -			SetPageCgroupFileMapped(pc);
> -		else if (!page_mapped(page))
> -			ClearPageCgroupFileMapped(pc);
>  		idx = MEM_CGROUP_STAT_FILE_MAPPED;
>  		break;
>  	default:
> @@ -2606,7 +2602,7 @@ static int mem_cgroup_move_account(struct page *page,
>  
>  	mem_cgroup_account_move_wlock(page, &flags);
>  
> -	if (PageCgroupFileMapped(pc)) {
> +	if (page_mapcount(page)) {
>  		/* Update mapped_file data for mem_cgroup */
>  		preempt_disable();
>  		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> -- 
> 1.7.4.1
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe cgroups" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] [PATCH 7/7 v2] memcg: make mem_cgroup_begin_update_stat to use global pcpu.
  2012-01-13  8:45 ` [RFC] [PATCH 7/7 v2] memcg: make mem_cgroup_begin_update_stat to use global pcpu KAMEZAWA Hiroyuki
@ 2012-01-19 14:47   ` Michal Hocko
  2012-01-20  2:19     ` KAMEZAWA Hiroyuki
  2012-01-20  8:40   ` Greg Thelen
  1 sibling, 1 reply; 50+ messages in thread
From: Michal Hocko @ 2012-01-19 14:47 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, Ying Han, hugh.dickins@tiscali.co.uk,
	hannes@cmpxchg.org, cgroups, bsingharora@gmail.com

On Fri 13-01-12 17:45:10, KAMEZAWA Hiroyuki wrote:
> From 3df71cef5757ee6547916c4952f04a263c1b8ddb Mon Sep 17 00:00:00 2001
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Date: Fri, 13 Jan 2012 17:07:35 +0900
> Subject: [PATCH 7/7] memcg: make mem_cgroup_begin_update_stat to use global pcpu.
> 
> Now, a per-cpu flag to show the memcg is under account moving is
> now implemented as per-memcg-per-cpu.
> 
> So, when accessing this, we need to access memcg 1st. But this
> function is called even when status update doesn't occur. Then,
> accessing struct memcg is an overhead in such case.
> 
> This patch removes per-cpu-per-memcg MEM_CGROUP_ON_MOVE and add
> per-cpu vairable to do the same work. For per-memcg, atomic
> counter is added. By this, mem_cgroup_begin_update_stat() will
> just access percpu variable in usual case and don't need to find & access
> memcg. This reduces overhead.

I agree that move_account is not a hotpath and that we don't have
to optimize for it but I guess we can do better. If we use a cookie
parameter for
mem_cgroup_{begin,end}_update_stat(struct page *page, unsigned long *cookie)
then we can stab page_cgroup inside and use the last bit for
locked.  Then we do not have to call lookup_page_cgroup again in
mem_cgroup_update_page_stat and just replace page by the cookie.
What do you think?

> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> ---
>  include/linux/memcontrol.h |   16 +++++++++-
>  mm/memcontrol.c            |   67 +++++++++++++++++++++----------------------
>  2 files changed, 47 insertions(+), 36 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 976b58c..26a4baa 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -151,12 +151,22 @@ static inline bool mem_cgroup_disabled(void)
>   * 	mem_cgroup_update_page_stat(page, idx, val)
>   * 	mem_cgroup_end_update_page_stat(page, locked);
>   */
> +DECLARE_PER_CPU(int, mem_cgroup_account_moving);
> +static inline bool any_mem_cgroup_stealed(void)
> +{
> +	smp_rmb();
> +	return this_cpu_read(mem_cgroup_account_moving) > 0;
> +}
> +
>  bool __mem_cgroup_begin_update_page_stat(struct page *page);
>  static inline bool mem_cgroup_begin_update_page_stat(struct page *page)
>  {
>  	if (mem_cgroup_disabled())
>  		return false;
> -	return __mem_cgroup_begin_update_page_stat(page);
> +	rcu_read_lock();
> +	if (unlikely(any_mem_cgroup_stealed()))
> +		return __mem_cgroup_begin_update_page_stat(page);
> +	return false;
>  }
>  void mem_cgroup_update_page_stat(struct page *page,
>  				 enum mem_cgroup_page_stat_item idx,
> @@ -167,7 +177,9 @@ mem_cgroup_end_update_page_stat(struct page *page, bool locked)
>  {
>  	if (mem_cgroup_disabled())
>  		return;
> -	__mem_cgroup_end_update_page_stat(page, locked);
> +	if (locked)
> +		__mem_cgroup_end_update_page_stat(page, locked);
> +	rcu_read_unlock();
>  }
>  
>  
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 8b67ccf..4836e8d 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -89,7 +89,6 @@ enum mem_cgroup_stat_index {
>  	MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
>  	MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
>  	MEM_CGROUP_STAT_DATA, /* end of data requires synchronization */
> -	MEM_CGROUP_ON_MOVE,	/* someone is moving account between groups */
>  	MEM_CGROUP_STAT_NSTATS,
>  };
>  
> @@ -279,6 +278,8 @@ struct mem_cgroup {
>  	 * mem_cgroup ? And what type of charges should we move ?
>  	 */
>  	unsigned long 	move_charge_at_immigrate;
> +	/* set when a page under this memcg may be moving to other memcg */
> +	atomic_t	account_moving;
>  	/*
>  	 * percpu counter.
>  	 */
> @@ -1250,20 +1251,27 @@ int mem_cgroup_swappiness(struct mem_cgroup *memcg)
>  	return memcg->swappiness;
>  }
>  
> +/*
> + * For quick check, for avoiding looking up memcg, system-wide
> + * per-cpu check is provided.
> + */
> +DEFINE_PER_CPU(int, mem_cgroup_account_moving);
> +DEFINE_SPINLOCK(mem_cgroup_stealed_lock);
> +
>  static void mem_cgroup_start_move(struct mem_cgroup *memcg)
>  {
>  	int cpu;
>  
>  	get_online_cpus();
> -	spin_lock(&memcg->pcp_counter_lock);
> +	spin_lock(&mem_cgroup_stealed_lock);
>  	for_each_online_cpu(cpu) {
> -		per_cpu(memcg->stat->count[MEM_CGROUP_ON_MOVE], cpu) += 1;
> +		per_cpu(mem_cgroup_account_moving, cpu) += 1;
>  		smp_wmb();
>  	}
> -	memcg->nocpu_base.count[MEM_CGROUP_ON_MOVE] += 1;
> -	spin_unlock(&memcg->pcp_counter_lock);
> +	spin_unlock(&mem_cgroup_stealed_lock);
>  	put_online_cpus();
>  
> +	atomic_inc(&memcg->account_moving);
>  	synchronize_rcu();
>  }
>  
> @@ -1274,11 +1282,12 @@ static void mem_cgroup_end_move(struct mem_cgroup *memcg)
>  	if (!memcg)
>  		return;
>  	get_online_cpus();
> -	spin_lock(&memcg->pcp_counter_lock);
> -	for_each_online_cpu(cpu)
> -		per_cpu(memcg->stat->count[MEM_CGROUP_ON_MOVE], cpu) -= 1;
> -	memcg->nocpu_base.count[MEM_CGROUP_ON_MOVE] -= 1;
> -	spin_unlock(&memcg->pcp_counter_lock);
> +	spin_lock(&mem_cgroup_stealed_lock);
> +	for_each_online_cpu(cpu) {
> +		per_cpu(mem_cgroup_account_moving, cpu) -= 1;
> +	}
> +	spin_unlock(&mem_cgroup_stealed_lock);
> +	atomic_dec(&memcg->account_moving);
>  	put_online_cpus();
>  }
>  /*
> @@ -1296,8 +1305,7 @@ static void mem_cgroup_end_move(struct mem_cgroup *memcg)
>  static bool mem_cgroup_stealed(struct mem_cgroup *memcg)
>  {
>  	VM_BUG_ON(!rcu_read_lock_held());
> -	smp_rmb();
> -	return this_cpu_read(memcg->stat->count[MEM_CGROUP_ON_MOVE]) > 0;
> +	return atomic_read(&memcg->account_moving) > 0;
>  }
>  
>  static bool mem_cgroup_under_move(struct mem_cgroup *memcg)
> @@ -1343,10 +1351,9 @@ static bool mem_cgroup_wait_acct_move(struct mem_cgroup *memcg)
>   * page satus accounting. To avoid that, we need some locks. In general,
>   * ading atomic ops to hot path is very bad. We're using 2 level logic.
>   *
> - * When a thread starts moving account information, per-cpu MEM_CGROUP_ON_MOVE
> - * value is set. If MEM_CGROUP_ON_MOVE==0, there are no race and page status
> - * update can be done withou any locks. If MEM_CGROUP_ON_MOVE>0, we use
> - * following hashed rwlocks.
> + * When a thread starts moving account information, memcg->account_moving
> + * value is set. If ==0, there are no race and page status update can be done
> + * without any locks. If account_moving >0, we use following hashed rwlocks.
>   * - At updating information, we hold rlock.
>   * - When a page is picked up and being moved, wlock is held.
>   *
> @@ -1354,7 +1361,7 @@ static bool mem_cgroup_wait_acct_move(struct mem_cgroup *memcg)
>   */
>  
>  /*
> - * This rwlock is accessed only when MEM_CGROUP_ON_MOVE > 0.
> + * This rwlock is accessed only when account_moving > 0.
>   */
>  #define NR_MOVE_ACCOUNT_LOCKS	(NR_CPUS)
>  #define move_account_hash(page) ((page_to_pfn(page) % NR_MOVE_ACCOUNT_LOCKS))
> @@ -1907,9 +1914,8 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask)
>   * if there are race with "uncharge". Statistics itself is properly handled
>   * by flags.
>   *
> - * Considering "move", this is an only case we see a race. To make the race
> - * small, we check MEM_CGROUP_ON_MOVE percpu value and detect there are
> - * possibility of race condition. If there is, we take a lock.
> + * If any_mem_cgroup_stealed() && mem_cgroup_stealed(), there is
> + * a possiblity of race condition and we take a lock.
>   */
>  
>  bool __mem_cgroup_begin_update_page_stat(struct page *page)
> @@ -1918,7 +1924,6 @@ bool __mem_cgroup_begin_update_page_stat(struct page *page)
>  	bool locked = false;
>  	struct mem_cgroup *memcg;
>  
> -	rcu_read_lock();
>  	memcg = pc->mem_cgroup;
>  
>  	if (!memcg || !PageCgroupUsed(pc))
> @@ -1933,9 +1938,7 @@ out:
>  
>  void __mem_cgroup_end_update_page_stat(struct page *page, bool locked)
>  {
> -	if (locked)
> -		mem_cgroup_account_move_runlock(page);
> -	rcu_read_unlock();
> +	mem_cgroup_account_move_runlock(page);
>  }
>  
>  void mem_cgroup_update_page_stat(struct page *page,
> @@ -2133,18 +2136,14 @@ static void mem_cgroup_drain_pcp_counter(struct mem_cgroup *memcg, int cpu)
>  		per_cpu(memcg->stat->events[i], cpu) = 0;
>  		memcg->nocpu_base.events[i] += x;
>  	}
> -	/* need to clear ON_MOVE value, works as a kind of lock. */
> -	per_cpu(memcg->stat->count[MEM_CGROUP_ON_MOVE], cpu) = 0;
>  	spin_unlock(&memcg->pcp_counter_lock);
>  }
>  
> -static void synchronize_mem_cgroup_on_move(struct mem_cgroup *memcg, int cpu)
> +static void synchronize_mem_cgroup_on_move(int cpu)
>  {
> -	int idx = MEM_CGROUP_ON_MOVE;
> -
> -	spin_lock(&memcg->pcp_counter_lock);
> -	per_cpu(memcg->stat->count[idx], cpu) = memcg->nocpu_base.count[idx];
> -	spin_unlock(&memcg->pcp_counter_lock);
> +	spin_lock(&mem_cgroup_stealed_lock);
> +	per_cpu(mem_cgroup_account_moving, cpu) = 0;
> +	spin_unlock(&mem_cgroup_stealed_lock);
>  }
>  
>  static int __cpuinit memcg_cpu_hotplug_callback(struct notifier_block *nb,
> @@ -2156,8 +2155,7 @@ static int __cpuinit memcg_cpu_hotplug_callback(struct notifier_block *nb,
>  	struct mem_cgroup *iter;
>  
>  	if ((action == CPU_ONLINE)) {
> -		for_each_mem_cgroup(iter)
> -			synchronize_mem_cgroup_on_move(iter, cpu);
> +		synchronize_mem_cgroup_on_move(cpu);
>  		return NOTIFY_OK;
>  	}
>  
> @@ -2167,6 +2165,7 @@ static int __cpuinit memcg_cpu_hotplug_callback(struct notifier_block *nb,
>  	for_each_mem_cgroup(iter)
>  		mem_cgroup_drain_pcp_counter(iter, cpu);
>  
> +	per_cpu(mem_cgroup_account_moving, cpu) = 0;
>  	stock = &per_cpu(memcg_stock, cpu);
>  	drain_stock(stock);
>  	return NOTIFY_OK;
> -- 
> 1.7.4.1
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe cgroups" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] [PATCH 1/7 v2] memcg: remove unnecessary check in mem_cgroup_update_page_stat()
  2012-01-18 13:01       ` Michal Hocko
  2012-01-19  2:18         ` KAMEZAWA Hiroyuki
@ 2012-01-19 20:07         ` Ying Han
  2012-01-20  0:48           ` KAMEZAWA Hiroyuki
  1 sibling, 1 reply; 50+ messages in thread
From: Ying Han @ 2012-01-19 20:07 UTC (permalink / raw)
  To: Michal Hocko
  Cc: KAMEZAWA Hiroyuki, linux-mm@kvack.org, hugh.dickins@tiscali.co.uk,
	hannes@cmpxchg.org, cgroups, bsingharora@gmail.com

On Wed, Jan 18, 2012 at 5:01 AM, Michal Hocko <mhocko@suse.cz> wrote:
> On Wed 18-01-12 08:55:58, KAMEZAWA Hiroyuki wrote:
>> On Tue, 17 Jan 2012 16:16:20 +0100
>> Michal Hocko <mhocko@suse.cz> wrote:
>>
>> > On Fri 13-01-12 17:32:27, KAMEZAWA Hiroyuki wrote:
>> > >
>> > > From 788aebf15f3fa37940e0745cab72547e20683bf2 Mon Sep 17 00:00:00 2001
>> > > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>> > > Date: Thu, 12 Jan 2012 16:08:33 +0900
>> > > Subject: [PATCH 1/7] memcg: remove unnecessary check in mem_cgroup_update_page_stat()
>> > >
>> > > commit 10ea69f1182b removes move_lock_page_cgroup() in thp-split path.
>> > > So, this PageTransHuge() check is unnecessary, too.
>> >
>> > I do not see commit like that in the tree. I guess you meant
>> > memcg: make mem_cgroup_split_huge_fixup() more efficient which is not
>> > merged yet, right?
>> >
>>
>> This commit in the linux-next.
>
> Referring to commits from linux-next is tricky as it changes all the
> time. I guess that the full commit subject should be sufficient.
>
>> > > Note:
>> > >  - considering when mem_cgroup_update_page_stat() is called,
>> > >    there will be no race between split_huge_page() and update_page_stat().
>> > >    All required locks are held in higher level.
>> >
>> > We should never have THP page in this path in the first place. So why
>> > not changing this to VM_BUG_ON(PageTransHuge).
>> >
>>
>> Ying Han considers to support mlock stat.
>
> OK, got it. What about the following updated changelog instead?
>
> ===
> We do not have to check PageTransHuge in mem_cgroup_update_page_stat
> and fallback into the locked accounting because both move charge and thp

one nitpick. Should it be "move account" instead of "move charge"?

--Ying

> split up are done with compound_lock so they cannot race. update vs.
> move is protected by the mem_cgroup_stealed sufficiently.
>
> PageTransHuge pages shouldn't appear in this code path currently because
> we are tracking only file pages at the moment but later we are planning
> to track also other pages (e.g. mlocked ones).
> ===
>
>>
>> Thanks,
>> -Kame
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe cgroups" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
> --
> Michal Hocko
> SUSE Labs
> SUSE LINUX s.r.o.
> Lihovarska 1060/12
> 190 00 Praha 9
> Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] [PATCH 2/7 v2] memcg: add memory barrier for checking account move.
  2012-01-19  9:28           ` Michal Hocko
@ 2012-01-19 23:57             ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 50+ messages in thread
From: KAMEZAWA Hiroyuki @ 2012-01-19 23:57 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm@kvack.org, Ying Han, hugh.dickins@tiscali.co.uk,
	hannes@cmpxchg.org, cgroups, bsingharora@gmail.com

On Thu, 19 Jan 2012 10:28:34 +0100
Michal Hocko <mhocko@suse.cz> wrote:

> On Thu 19-01-12 11:17:27, KAMEZAWA Hiroyuki wrote:
> > On Wed, 18 Jan 2012 13:37:59 +0100
> > Michal Hocko <mhocko@suse.cz> wrote:
> > 
> > > On Wed 18-01-12 09:06:56, KAMEZAWA Hiroyuki wrote:
> > > > On Tue, 17 Jan 2012 16:26:35 +0100
> > > > Michal Hocko <mhocko@suse.cz> wrote:
> > > > 
> > > > > On Fri 13-01-12 17:33:47, KAMEZAWA Hiroyuki wrote:
> > > > > > I think this bugfix is needed before going ahead. thoughts?
> > > > > > ==
> > > > > > From 2cb491a41782b39aae9f6fe7255b9159ac6c1563 Mon Sep 17 00:00:00 2001
> > > > > > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > > > > > Date: Fri, 13 Jan 2012 14:27:20 +0900
> > > > > > Subject: [PATCH 2/7] memcg: add memory barrier for checking account move.
> > > > > > 
> > > > > > At starting move_account(), source memcg's per-cpu variable
> > > > > > MEM_CGROUP_ON_MOVE is set. The page status update
> > > > > > routine check it under rcu_read_lock(). But there is no memory
> > > > > > barrier. This patch adds one.
> > > > > 
> > > > > OK this would help to enforce that the CPU would see the current value
> > > > > but what prevents us from the race with the value update without the
> > > > > lock? This is as racy as it was before AFAICS.
> > > > > 
> > > > 
> > > > Hm, do I misunderstand ?
> > > > ==
> > > >    update                     reference
> > > > 
> > > >    CPU A                        CPU B
> > > >   set value                rcu_read_lock()
> > > >   smp_wmb()                smp_rmb()
> > > >                            read_value
> > > >                            rcu_read_unlock()
> > > >   synchronize_rcu().
> > > > ==
> > > > I expect
> > > > If synchronize_rcu() is called before rcu_read_lock() => move_lock_xxx will be held.
> > > > If synchronize_rcu() is called after rcu_read_lock() => update will be delayed.
> > > 
> > > Ahh, OK I can see it now. Readers are not that important because it is
> > > actually the updater who is delayed until all preexisting rcu read
> > > sections are finished.
> > > 
> > > In that case. Why do we need both barriers? spin_unlock is a full
> > > barrier so maybe we just need smp_rmb before we read value to make sure
> > > that we do not get stalled value when we start rcu_read section after
> > > synchronize_rcu?
> > > 
> > 
> > I doubt .... If no barrier, this case happens
> > 
> > ==
> > 	update			reference
> > 	CPU A			CPU B
> > 	set value
> > 	synchronize_rcu()	rcu_read_lock()
> > 				read_value <= find old value
> > 				rcu_read_unlock()
> > 				do no lock
> > ==
> 
> OK, I have looked at Documentation/memory-barriers.txt again and
> spin_unlock is not a full barrier so we cannot rely on it.
> Anyway you still could do the barrier once after we set all values?

Ok. I'll move that after for_each_cpu.

Thanks,
-Kame


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] [PATCH 1/7 v2] memcg: remove unnecessary check in mem_cgroup_update_page_stat()
  2012-01-19 20:07         ` Ying Han
@ 2012-01-20  0:48           ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 50+ messages in thread
From: KAMEZAWA Hiroyuki @ 2012-01-20  0:48 UTC (permalink / raw)
  To: Ying Han
  Cc: Michal Hocko, linux-mm@kvack.org, hugh.dickins@tiscali.co.uk,
	hannes@cmpxchg.org, cgroups, bsingharora@gmail.com

On Thu, 19 Jan 2012 12:07:22 -0800
Ying Han <yinghan@google.com> wrote:

> On Wed, Jan 18, 2012 at 5:01 AM, Michal Hocko <mhocko@suse.cz> wrote:
> > On Wed 18-01-12 08:55:58, KAMEZAWA Hiroyuki wrote:
> >> On Tue, 17 Jan 2012 16:16:20 +0100
> >> Michal Hocko <mhocko@suse.cz> wrote:
> >>
> >> > On Fri 13-01-12 17:32:27, KAMEZAWA Hiroyuki wrote:
> >> > >
> >> > > From 788aebf15f3fa37940e0745cab72547e20683bf2 Mon Sep 17 00:00:00 2001
> >> > > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> >> > > Date: Thu, 12 Jan 2012 16:08:33 +0900
> >> > > Subject: [PATCH 1/7] memcg: remove unnecessary check in mem_cgroup_update_page_stat()
> >> > >
> >> > > commit 10ea69f1182b removes move_lock_page_cgroup() in thp-split path.
> >> > > So, this PageTransHuge() check is unnecessary, too.
> >> >
> >> > I do not see commit like that in the tree. I guess you meant
> >> > memcg: make mem_cgroup_split_huge_fixup() more efficient which is not
> >> > merged yet, right?
> >> >
> >>
> >> This commit in the linux-next.
> >
> > Referring to commits from linux-next is tricky as it changes all the
> > time. I guess that the full commit subject should be sufficient.
> >
> >> > > Note:
> >> > > A - considering when mem_cgroup_update_page_stat() is called,
> >> > > A  A there will be no race between split_huge_page() and update_page_stat().
> >> > > A  A All required locks are held in higher level.
> >> >
> >> > We should never have THP page in this path in the first place. So why
> >> > not changing this to VM_BUG_ON(PageTransHuge).
> >> >
> >>
> >> Ying Han considers to support mlock stat.
> >
> > OK, got it. What about the following updated changelog instead?
> >
> > ===
> > We do not have to check PageTransHuge in mem_cgroup_update_page_stat
> > and fallback into the locked accounting because both move charge and thp
> 
> one nitpick. Should it be "move account" instead of "move charge"?
> 
Ah, yes. you'r right.
Considering Michal's comment, I'll update and post v3.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] [PATCH 7/7 v2] memcg: make mem_cgroup_begin_update_stat to use global pcpu.
  2012-01-19 14:47   ` Michal Hocko
@ 2012-01-20  2:19     ` KAMEZAWA Hiroyuki
  2012-01-20  8:38       ` Michal Hocko
  0 siblings, 1 reply; 50+ messages in thread
From: KAMEZAWA Hiroyuki @ 2012-01-20  2:19 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm@kvack.org, Ying Han, hugh.dickins@tiscali.co.uk,
	hannes@cmpxchg.org, cgroups, bsingharora@gmail.com

On Thu, 19 Jan 2012 15:47:12 +0100
Michal Hocko <mhocko@suse.cz> wrote:

> On Fri 13-01-12 17:45:10, KAMEZAWA Hiroyuki wrote:
> > From 3df71cef5757ee6547916c4952f04a263c1b8ddb Mon Sep 17 00:00:00 2001
> > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > Date: Fri, 13 Jan 2012 17:07:35 +0900
> > Subject: [PATCH 7/7] memcg: make mem_cgroup_begin_update_stat to use global pcpu.
> > 
> > Now, a per-cpu flag to show the memcg is under account moving is
> > now implemented as per-memcg-per-cpu.
> > 
> > So, when accessing this, we need to access memcg 1st. But this
> > function is called even when status update doesn't occur. Then,
> > accessing struct memcg is an overhead in such case.
> > 
> > This patch removes per-cpu-per-memcg MEM_CGROUP_ON_MOVE and add
> > per-cpu vairable to do the same work. For per-memcg, atomic
> > counter is added. By this, mem_cgroup_begin_update_stat() will
> > just access percpu variable in usual case and don't need to find & access
> > memcg. This reduces overhead.
> 
> I agree that move_account is not a hotpath and that we don't have
> to optimize for it but I guess we can do better. If we use a cookie
> parameter for
> mem_cgroup_{begin,end}_update_stat(struct page *page, unsigned long *cookie)
> then we can stab page_cgroup inside and use the last bit for
> locked.  Then we do not have to call lookup_page_cgroup again in
> mem_cgroup_update_page_stat and just replace page by the cookie.
> What do you think?
> 

Because these routine is called as

	mem_cgroup_begin_update_stat()
	if (condition)
		set_page_flag
		mem_cgroup_update_stat()
	mem_cgroup_end_update_stat()

In earlier version(not posted), I did so. Now, I don't because of 2 reasons.

1. I wonder it's better not to have extra arguments in begin_xxx it
   will be overhead itself.
2. my work's final purpose is integrate page_cgroup to struct page.
   If I can do, lookup_page_cgroup() cost will be almost 0 and we'll revert
   the cookie, finally.

So, can't we keep this update routine simple for a while ?
If we saw it's finally impossible to integrate page_cgroup to page,
I'd like to consider 'cookie' again.

BTW, If we use spinlock and need to do irq_disable() in begin_update_stat()
we'll need to pass *flags...

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] [PATCH 7/7 v2] memcg: make mem_cgroup_begin_update_stat to use global pcpu.
  2012-01-20  2:19     ` KAMEZAWA Hiroyuki
@ 2012-01-20  8:38       ` Michal Hocko
  0 siblings, 0 replies; 50+ messages in thread
From: Michal Hocko @ 2012-01-20  8:38 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, Ying Han, hugh.dickins@tiscali.co.uk,
	hannes@cmpxchg.org, cgroups, bsingharora@gmail.com

On Fri 20-01-12 11:19:47, KAMEZAWA Hiroyuki wrote:
> On Thu, 19 Jan 2012 15:47:12 +0100
> Michal Hocko <mhocko@suse.cz> wrote:
> 
> > On Fri 13-01-12 17:45:10, KAMEZAWA Hiroyuki wrote:
> > > From 3df71cef5757ee6547916c4952f04a263c1b8ddb Mon Sep 17 00:00:00 2001
> > > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > > Date: Fri, 13 Jan 2012 17:07:35 +0900
> > > Subject: [PATCH 7/7] memcg: make mem_cgroup_begin_update_stat to use global pcpu.
> > > 
> > > Now, a per-cpu flag to show the memcg is under account moving is
> > > now implemented as per-memcg-per-cpu.
> > > 
> > > So, when accessing this, we need to access memcg 1st. But this
> > > function is called even when status update doesn't occur. Then,
> > > accessing struct memcg is an overhead in such case.
> > > 
> > > This patch removes per-cpu-per-memcg MEM_CGROUP_ON_MOVE and add
> > > per-cpu vairable to do the same work. For per-memcg, atomic
> > > counter is added. By this, mem_cgroup_begin_update_stat() will
> > > just access percpu variable in usual case and don't need to find & access
> > > memcg. This reduces overhead.
> > 
> > I agree that move_account is not a hotpath and that we don't have
> > to optimize for it but I guess we can do better. If we use a cookie
> > parameter for
> > mem_cgroup_{begin,end}_update_stat(struct page *page, unsigned long *cookie)
> > then we can stab page_cgroup inside and use the last bit for
> > locked.  Then we do not have to call lookup_page_cgroup again in
> > mem_cgroup_update_page_stat and just replace page by the cookie.
> > What do you think?
> > 
> 
> Because these routine is called as
> 
> 	mem_cgroup_begin_update_stat()
> 	if (condition)
> 		set_page_flag
> 		mem_cgroup_update_stat()
> 	mem_cgroup_end_update_stat()
> 
> In earlier version(not posted), I did so. Now, I don't because of 2 reasons.
> 
> 1. I wonder it's better not to have extra arguments in begin_xxx it
>    will be overhead itself.

I am not sure this could be noticable.

> 2. my work's final purpose is integrate page_cgroup to struct page.
>    If I can do, lookup_page_cgroup() cost will be almost 0 and we'll revert
>    the cookie, finally.

OK

> So, can't we keep this update routine simple for a while ?

Sure. I was just concerned that global move account state might be an
issue because we would have an side effect interaction between different
cgroups.

> If we saw it's finally impossible to integrate page_cgroup to page,
> I'd like to consider 'cookie' again.
> 
> BTW, If we use spinlock and need to do irq_disable() in begin_update_stat()
> we'll need to pass *flags...

Right, you said that dirty page accounting requires to be called from
IRQ context as well.

> 
> Thanks,
> -Kame
> 
> --
> To unsubscribe from this list: send the line "unsubscribe cgroups" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] [PATCH 7/7 v2] memcg: make mem_cgroup_begin_update_stat to use global pcpu.
  2012-01-13  8:45 ` [RFC] [PATCH 7/7 v2] memcg: make mem_cgroup_begin_update_stat to use global pcpu KAMEZAWA Hiroyuki
  2012-01-19 14:47   ` Michal Hocko
@ 2012-01-20  8:40   ` Greg Thelen
  2012-01-24  3:18     ` KAMEZAWA Hiroyuki
  1 sibling, 1 reply; 50+ messages in thread
From: Greg Thelen @ 2012-01-20  8:40 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, Ying Han, hugh.dickins@tiscali.co.uk,
	hannes@cmpxchg.org, Michal Hocko, cgroups, bsingharora@gmail.com

On Fri, Jan 13, 2012 at 12:45 AM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> From 3df71cef5757ee6547916c4952f04a263c1b8ddb Mon Sep 17 00:00:00 2001
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Date: Fri, 13 Jan 2012 17:07:35 +0900
> Subject: [PATCH 7/7] memcg: make mem_cgroup_begin_update_stat to use global pcpu.
>
> Now, a per-cpu flag to show the memcg is under account moving is
> now implemented as per-memcg-per-cpu.
>
> So, when accessing this, we need to access memcg 1st. But this
> function is called even when status update doesn't occur. Then,
> accessing struct memcg is an overhead in such case.
>
> This patch removes per-cpu-per-memcg MEM_CGROUP_ON_MOVE and add
> per-cpu vairable to do the same work. For per-memcg, atomic
> counter is added. By this, mem_cgroup_begin_update_stat() will
> just access percpu variable in usual case and don't need to find & access
> memcg. This reduces overhead.
>
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> ---
>  include/linux/memcontrol.h |   16 +++++++++-
>  mm/memcontrol.c            |   67 +++++++++++++++++++++----------------------
>  2 files changed, 47 insertions(+), 36 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 976b58c..26a4baa 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -151,12 +151,22 @@ static inline bool mem_cgroup_disabled(void)
>  *     mem_cgroup_update_page_stat(page, idx, val)
>  *     mem_cgroup_end_update_page_stat(page, locked);
>  */
> +DECLARE_PER_CPU(int, mem_cgroup_account_moving);
> +static inline bool any_mem_cgroup_stealed(void)
> +{
> +       smp_rmb();
> +       return this_cpu_read(mem_cgroup_account_moving) > 0;
> +}
> +
>  bool __mem_cgroup_begin_update_page_stat(struct page *page);
>  static inline bool mem_cgroup_begin_update_page_stat(struct page *page)
>  {
>        if (mem_cgroup_disabled())
>                return false;
> -       return __mem_cgroup_begin_update_page_stat(page);
> +       rcu_read_lock();
> +       if (unlikely(any_mem_cgroup_stealed()))
> +               return __mem_cgroup_begin_update_page_stat(page);
> +       return false;
>  }
>  void mem_cgroup_update_page_stat(struct page *page,
>                                 enum mem_cgroup_page_stat_item idx,
> @@ -167,7 +177,9 @@ mem_cgroup_end_update_page_stat(struct page *page, bool locked)
>  {
>        if (mem_cgroup_disabled())
>                return;
> -       __mem_cgroup_end_update_page_stat(page, locked);
> +       if (locked)
> +               __mem_cgroup_end_update_page_stat(page, locked);
> +       rcu_read_unlock();
>  }
>
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 8b67ccf..4836e8d 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -89,7 +89,6 @@ enum mem_cgroup_stat_index {
>        MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
>        MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
>        MEM_CGROUP_STAT_DATA, /* end of data requires synchronization */
> -       MEM_CGROUP_ON_MOVE,     /* someone is moving account between groups */
>        MEM_CGROUP_STAT_NSTATS,
>  };
>
> @@ -279,6 +278,8 @@ struct mem_cgroup {
>         * mem_cgroup ? And what type of charges should we move ?
>         */
>        unsigned long   move_charge_at_immigrate;
> +       /* set when a page under this memcg may be moving to other memcg */
> +       atomic_t        account_moving;
>        /*
>         * percpu counter.
>         */
> @@ -1250,20 +1251,27 @@ int mem_cgroup_swappiness(struct mem_cgroup *memcg)
>        return memcg->swappiness;
>  }
>
> +/*
> + * For quick check, for avoiding looking up memcg, system-wide
> + * per-cpu check is provided.
> + */
> +DEFINE_PER_CPU(int, mem_cgroup_account_moving);

Why is this a per-cpu counter?  Can this be an single atomic_t
instead, or does cpu hotplug require per-cpu state?  In the common
case, when there is no move in progress, then the counter would be
zero and clean in all cpu caches that need it.  When moving pages,
mem_cgroup_start_move() would atomic_inc the counter.

> +DEFINE_SPINLOCK(mem_cgroup_stealed_lock);
> +
>  static void mem_cgroup_start_move(struct mem_cgroup *memcg)
>  {
>        int cpu;
>
>        get_online_cpus();
> -       spin_lock(&memcg->pcp_counter_lock);
> +       spin_lock(&mem_cgroup_stealed_lock);
>        for_each_online_cpu(cpu) {
> -               per_cpu(memcg->stat->count[MEM_CGROUP_ON_MOVE], cpu) += 1;
> +               per_cpu(mem_cgroup_account_moving, cpu) += 1;
>                smp_wmb();
>        }
> -       memcg->nocpu_base.count[MEM_CGROUP_ON_MOVE] += 1;
> -       spin_unlock(&memcg->pcp_counter_lock);
> +       spin_unlock(&mem_cgroup_stealed_lock);
>        put_online_cpus();
>
> +       atomic_inc(&memcg->account_moving);
>        synchronize_rcu();
>  }
>
> @@ -1274,11 +1282,12 @@ static void mem_cgroup_end_move(struct mem_cgroup *memcg)
>        if (!memcg)
>                return;
>        get_online_cpus();
> -       spin_lock(&memcg->pcp_counter_lock);
> -       for_each_online_cpu(cpu)
> -               per_cpu(memcg->stat->count[MEM_CGROUP_ON_MOVE], cpu) -= 1;
> -       memcg->nocpu_base.count[MEM_CGROUP_ON_MOVE] -= 1;
> -       spin_unlock(&memcg->pcp_counter_lock);
> +       spin_lock(&mem_cgroup_stealed_lock);
> +       for_each_online_cpu(cpu) {
> +               per_cpu(mem_cgroup_account_moving, cpu) -= 1;
> +       }
> +       spin_unlock(&mem_cgroup_stealed_lock);
> +       atomic_dec(&memcg->account_moving);
>        put_online_cpus();
>  }
>  /*
> @@ -1296,8 +1305,7 @@ static void mem_cgroup_end_move(struct mem_cgroup *memcg)
>  static bool mem_cgroup_stealed(struct mem_cgroup *memcg)
>  {
>        VM_BUG_ON(!rcu_read_lock_held());
> -       smp_rmb();
> -       return this_cpu_read(memcg->stat->count[MEM_CGROUP_ON_MOVE]) > 0;
> +       return atomic_read(&memcg->account_moving) > 0;
>  }
>
>  static bool mem_cgroup_under_move(struct mem_cgroup *memcg)
> @@ -1343,10 +1351,9 @@ static bool mem_cgroup_wait_acct_move(struct mem_cgroup *memcg)
>  * page satus accounting. To avoid that, we need some locks. In general,
>  * ading atomic ops to hot path is very bad. We're using 2 level logic.
>  *
> - * When a thread starts moving account information, per-cpu MEM_CGROUP_ON_MOVE
> - * value is set. If MEM_CGROUP_ON_MOVE==0, there are no race and page status
> - * update can be done withou any locks. If MEM_CGROUP_ON_MOVE>0, we use
> - * following hashed rwlocks.
> + * When a thread starts moving account information, memcg->account_moving
> + * value is set. If ==0, there are no race and page status update can be done
> + * without any locks. If account_moving >0, we use following hashed rwlocks.
>  * - At updating information, we hold rlock.
>  * - When a page is picked up and being moved, wlock is held.
>  *
> @@ -1354,7 +1361,7 @@ static bool mem_cgroup_wait_acct_move(struct mem_cgroup *memcg)
>  */
>
>  /*
> - * This rwlock is accessed only when MEM_CGROUP_ON_MOVE > 0.
> + * This rwlock is accessed only when account_moving > 0.
>  */
>  #define NR_MOVE_ACCOUNT_LOCKS  (NR_CPUS)
>  #define move_account_hash(page) ((page_to_pfn(page) % NR_MOVE_ACCOUNT_LOCKS))
> @@ -1907,9 +1914,8 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask)
>  * if there are race with "uncharge". Statistics itself is properly handled
>  * by flags.
>  *
> - * Considering "move", this is an only case we see a race. To make the race
> - * small, we check MEM_CGROUP_ON_MOVE percpu value and detect there are
> - * possibility of race condition. If there is, we take a lock.
> + * If any_mem_cgroup_stealed() && mem_cgroup_stealed(), there is
> + * a possiblity of race condition and we take a lock.
>  */
>
>  bool __mem_cgroup_begin_update_page_stat(struct page *page)
> @@ -1918,7 +1924,6 @@ bool __mem_cgroup_begin_update_page_stat(struct page *page)
>        bool locked = false;
>        struct mem_cgroup *memcg;
>
> -       rcu_read_lock();
>        memcg = pc->mem_cgroup;
>
>        if (!memcg || !PageCgroupUsed(pc))
> @@ -1933,9 +1938,7 @@ out:
>
>  void __mem_cgroup_end_update_page_stat(struct page *page, bool locked)
>  {
> -       if (locked)
> -               mem_cgroup_account_move_runlock(page);
> -       rcu_read_unlock();
> +       mem_cgroup_account_move_runlock(page);
>  }
>
>  void mem_cgroup_update_page_stat(struct page *page,
> @@ -2133,18 +2136,14 @@ static void mem_cgroup_drain_pcp_counter(struct mem_cgroup *memcg, int cpu)
>                per_cpu(memcg->stat->events[i], cpu) = 0;
>                memcg->nocpu_base.events[i] += x;
>        }
> -       /* need to clear ON_MOVE value, works as a kind of lock. */
> -       per_cpu(memcg->stat->count[MEM_CGROUP_ON_MOVE], cpu) = 0;
>        spin_unlock(&memcg->pcp_counter_lock);
>  }
>
> -static void synchronize_mem_cgroup_on_move(struct mem_cgroup *memcg, int cpu)
> +static void synchronize_mem_cgroup_on_move(int cpu)
>  {
> -       int idx = MEM_CGROUP_ON_MOVE;
> -
> -       spin_lock(&memcg->pcp_counter_lock);
> -       per_cpu(memcg->stat->count[idx], cpu) = memcg->nocpu_base.count[idx];
> -       spin_unlock(&memcg->pcp_counter_lock);
> +       spin_lock(&mem_cgroup_stealed_lock);
> +       per_cpu(mem_cgroup_account_moving, cpu) = 0;
> +       spin_unlock(&mem_cgroup_stealed_lock);
>  }
>
>  static int __cpuinit memcg_cpu_hotplug_callback(struct notifier_block *nb,
> @@ -2156,8 +2155,7 @@ static int __cpuinit memcg_cpu_hotplug_callback(struct notifier_block *nb,
>        struct mem_cgroup *iter;
>
>        if ((action == CPU_ONLINE)) {
> -               for_each_mem_cgroup(iter)
> -                       synchronize_mem_cgroup_on_move(iter, cpu);
> +               synchronize_mem_cgroup_on_move(cpu);
>                return NOTIFY_OK;
>        }
>
> @@ -2167,6 +2165,7 @@ static int __cpuinit memcg_cpu_hotplug_callback(struct notifier_block *nb,
>        for_each_mem_cgroup(iter)
>                mem_cgroup_drain_pcp_counter(iter, cpu);
>
> +       per_cpu(mem_cgroup_account_moving, cpu) = 0;
>        stock = &per_cpu(memcg_stock, cpu);
>        drain_stock(stock);
>        return NOTIFY_OK;
> --
> 1.7.4.1
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe cgroups" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] [PATCH 2/7 v2] memcg: add memory barrier for checking account move.
  2012-01-19  2:17         ` KAMEZAWA Hiroyuki
  2012-01-19  9:28           ` Michal Hocko
@ 2012-01-20 18:08           ` Ying Han
  2012-01-23  9:04             ` Michal Hocko
  1 sibling, 1 reply; 50+ messages in thread
From: Ying Han @ 2012-01-20 18:08 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Michal Hocko, linux-mm@kvack.org, hugh.dickins@tiscali.co.uk,
	hannes@cmpxchg.org, cgroups, bsingharora@gmail.com

On Wed, Jan 18, 2012 at 6:17 PM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Wed, 18 Jan 2012 13:37:59 +0100
> Michal Hocko <mhocko@suse.cz> wrote:
>
>> On Wed 18-01-12 09:06:56, KAMEZAWA Hiroyuki wrote:
>> > On Tue, 17 Jan 2012 16:26:35 +0100
>> > Michal Hocko <mhocko@suse.cz> wrote:
>> >
>> > > On Fri 13-01-12 17:33:47, KAMEZAWA Hiroyuki wrote:
>> > > > I think this bugfix is needed before going ahead. thoughts?
>> > > > ==
>> > > > From 2cb491a41782b39aae9f6fe7255b9159ac6c1563 Mon Sep 17 00:00:00 2001
>> > > > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>> > > > Date: Fri, 13 Jan 2012 14:27:20 +0900
>> > > > Subject: [PATCH 2/7] memcg: add memory barrier for checking account move.
>> > > >
>> > > > At starting move_account(), source memcg's per-cpu variable
>> > > > MEM_CGROUP_ON_MOVE is set. The page status update
>> > > > routine check it under rcu_read_lock(). But there is no memory
>> > > > barrier. This patch adds one.
>> > >
>> > > OK this would help to enforce that the CPU would see the current value
>> > > but what prevents us from the race with the value update without the
>> > > lock? This is as racy as it was before AFAICS.
>> > >
>> >
>> > Hm, do I misunderstand ?
>> > ==
>> >    update                     reference
>> >
>> >    CPU A                        CPU B
>> >   set value                rcu_read_lock()
>> >   smp_wmb()                smp_rmb()
>> >                            read_value
>> >                            rcu_read_unlock()
>> >   synchronize_rcu().
>> > ==
>> > I expect
>> > If synchronize_rcu() is called before rcu_read_lock() => move_lock_xxx will be held.
>> > If synchronize_rcu() is called after rcu_read_lock() => update will be delayed.
>>
>> Ahh, OK I can see it now. Readers are not that important because it is
>> actually the updater who is delayed until all preexisting rcu read
>> sections are finished.
>>
>> In that case. Why do we need both barriers? spin_unlock is a full
>> barrier so maybe we just need smp_rmb before we read value to make sure
>> that we do not get stalled value when we start rcu_read section after
>> synchronize_rcu?
>>
>
> I doubt .... If no barrier, this case happens
>
> ==
>        update                  reference
>        CPU A                   CPU B
>        set value
>        synchronize_rcu()       rcu_read_lock()
>                                read_value <= find old value
>                                rcu_read_unlock()
>                                do no lock
> ==

Hi Kame,

Can you help to clarify a bit more on the example above? Why
read_value got the old value after synchronize_rcu().

Sorry for getting into this late.

--Ying

Sorry for getting into this late.
>
>> > Here, cpu B needs to read most recently updated value.
>>
>> If it reads the old value then it would think that we are not moving and
>> so we would account to the old group and move it later on, right?
>>
> Right. without move_lock, we're not sure which old/new pc->mem_cgroup will be.
> This will cause mis accounting.
>
>
> Thanks,
> -Kame
>
>
>
>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] [PATCH 2/7 v2] memcg: add memory barrier for checking account move.
  2012-01-20 18:08           ` Ying Han
@ 2012-01-23  9:04             ` Michal Hocko
  2012-01-24  3:21               ` KAMEZAWA Hiroyuki
  2012-01-24 19:04               ` Ying Han
  0 siblings, 2 replies; 50+ messages in thread
From: Michal Hocko @ 2012-01-23  9:04 UTC (permalink / raw)
  To: Ying Han
  Cc: KAMEZAWA Hiroyuki, linux-mm@kvack.org, hugh.dickins@tiscali.co.uk,
	hannes@cmpxchg.org, cgroups, bsingharora@gmail.com

On Fri 20-01-12 10:08:44, Ying Han wrote:
> On Wed, Jan 18, 2012 at 6:17 PM, KAMEZAWA Hiroyuki
> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > On Wed, 18 Jan 2012 13:37:59 +0100
> > Michal Hocko <mhocko@suse.cz> wrote:
> >
> >> On Wed 18-01-12 09:06:56, KAMEZAWA Hiroyuki wrote:
> >> > On Tue, 17 Jan 2012 16:26:35 +0100
> >> > Michal Hocko <mhocko@suse.cz> wrote:
> >> >
> >> > > On Fri 13-01-12 17:33:47, KAMEZAWA Hiroyuki wrote:
> >> > > > I think this bugfix is needed before going ahead. thoughts?
> >> > > > ==
> >> > > > From 2cb491a41782b39aae9f6fe7255b9159ac6c1563 Mon Sep 17 00:00:00 2001
> >> > > > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> >> > > > Date: Fri, 13 Jan 2012 14:27:20 +0900
> >> > > > Subject: [PATCH 2/7] memcg: add memory barrier for checking account move.
> >> > > >
> >> > > > At starting move_account(), source memcg's per-cpu variable
> >> > > > MEM_CGROUP_ON_MOVE is set. The page status update
> >> > > > routine check it under rcu_read_lock(). But there is no memory
> >> > > > barrier. This patch adds one.
> >> > >
> >> > > OK this would help to enforce that the CPU would see the current value
> >> > > but what prevents us from the race with the value update without the
> >> > > lock? This is as racy as it was before AFAICS.
> >> > >
> >> >
> >> > Hm, do I misunderstand ?
> >> > ==
> >> >    update                     reference
> >> >
> >> >    CPU A                        CPU B
> >> >   set value                rcu_read_lock()
> >> >   smp_wmb()                smp_rmb()
> >> >                            read_value
> >> >                            rcu_read_unlock()
> >> >   synchronize_rcu().
> >> > ==
> >> > I expect
> >> > If synchronize_rcu() is called before rcu_read_lock() => move_lock_xxx will be held.
> >> > If synchronize_rcu() is called after rcu_read_lock() => update will be delayed.
> >>
> >> Ahh, OK I can see it now. Readers are not that important because it is
> >> actually the updater who is delayed until all preexisting rcu read
> >> sections are finished.
> >>
> >> In that case. Why do we need both barriers? spin_unlock is a full
> >> barrier so maybe we just need smp_rmb before we read value to make sure
> >> that we do not get stalled value when we start rcu_read section after
> >> synchronize_rcu?
> >>
> >
> > I doubt .... If no barrier, this case happens
> >
> > ==
> >        update                  reference
> >        CPU A                   CPU B
> >        set value
> >        synchronize_rcu()       rcu_read_lock()
> >                                read_value <= find old value
> >                                rcu_read_unlock()
> >                                do no lock
> > ==
> 
> Hi Kame,
> 
> Can you help to clarify a bit more on the example above? Why
> read_value got the old value after synchronize_rcu().

AFAIU it is because rcu_read_unlock doesn't force any memory barrier
and we synchronize only the updater (with synchronize_rcu), so nothing
guarantees that the value set on CPUA is visible to CPUB.

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] [PATCH 3/7 v2] memcg: remove PCG_MOVE_LOCK flag from pc->flags
  2012-01-13  8:40 ` [RFC] [PATCH 3/7 v2] memcg: remove PCG_MOVE_LOCK flag from pc->flags KAMEZAWA Hiroyuki
  2012-01-16 12:55   ` Kirill A. Shutemov
  2012-01-17 16:46   ` Michal Hocko
@ 2012-01-23 22:02   ` Ying Han
  2012-01-24  4:47     ` KAMEZAWA Hiroyuki
  2 siblings, 1 reply; 50+ messages in thread
From: Ying Han @ 2012-01-23 22:02 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, hugh.dickins@tiscali.co.uk,
	hannes@cmpxchg.org, Michal Hocko, cgroups, bsingharora@gmail.com

On Fri, Jan 13, 2012 at 12:40 AM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
>
> From 1008e84d94245b1e7c4d237802ff68ff00757736 Mon Sep 17 00:00:00 2001
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Date: Thu, 12 Jan 2012 15:53:24 +0900
> Subject: [PATCH 3/7] memcg: remove PCG_MOVE_LOCK flag from pc->flags.
>
> PCG_MOVE_LOCK bit is used for bit spinlock for avoiding race between
> memcg's account moving and page state statistics updates.
>
> Considering page-statistics update, very hot path, this lock is
> taken only when someone is moving account (or PageTransHuge())
> And, now, all moving-account between memcgroups (by task-move)
> are serialized.

This might be a side question, can you clarify the serialization here?
Does it mean that we only allow one task-move at a time system-wide?

Thanks

--Ying
>
> So, it seems too costly to have 1bit per page for this purpose.
>
> This patch removes PCG_MOVE_LOCK and add hashed rwlock array
> instead of it. This works well enough. Even when we need to
> take the lock, we don't need to disable IRQ in hot path because
> of using rwlock.
>
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> ---
>  include/linux/page_cgroup.h |   19 -----------
>  mm/memcontrol.c             |   72 ++++++++++++++++++++++++++++++++++++++----
>  2 files changed, 65 insertions(+), 26 deletions(-)
>
> diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
> index a2d1177..5dba799 100644
> --- a/include/linux/page_cgroup.h
> +++ b/include/linux/page_cgroup.h
> @@ -8,7 +8,6 @@ enum {
>        PCG_USED, /* this object is in use. */
>        PCG_MIGRATION, /* under page migration */
>        /* flags for mem_cgroup and file and I/O status */
> -       PCG_MOVE_LOCK, /* For race between move_account v.s. following bits */
>        PCG_FILE_MAPPED, /* page is accounted as "mapped" */
>        __NR_PCG_FLAGS,
>  };
> @@ -95,24 +94,6 @@ static inline void unlock_page_cgroup(struct page_cgroup *pc)
>        bit_spin_unlock(PCG_LOCK, &pc->flags);
>  }
>
> -static inline void move_lock_page_cgroup(struct page_cgroup *pc,
> -       unsigned long *flags)
> -{
> -       /*
> -        * We know updates to pc->flags of page cache's stats are from both of
> -        * usual context or IRQ context. Disable IRQ to avoid deadlock.
> -        */
> -       local_irq_save(*flags);
> -       bit_spin_lock(PCG_MOVE_LOCK, &pc->flags);
> -}
> -
> -static inline void move_unlock_page_cgroup(struct page_cgroup *pc,
> -       unsigned long *flags)
> -{
> -       bit_spin_unlock(PCG_MOVE_LOCK, &pc->flags);
> -       local_irq_restore(*flags);
> -}
> -
>  #else /* CONFIG_CGROUP_MEM_RES_CTLR */
>  struct page_cgroup;
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 9019069..61e276f 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1338,6 +1338,65 @@ static bool mem_cgroup_wait_acct_move(struct mem_cgroup *memcg)
>        return false;
>  }
>
> +/*
> + * At moving acccounting information between cgroups, we'll have race with
> + * page satus accounting. To avoid that, we need some locks. In general,
> + * ading atomic ops to hot path is very bad. We're using 2 level logic.
> + *
> + * When a thread starts moving account information, per-cpu MEM_CGROUP_ON_MOVE
> + * value is set. If MEM_CGROUP_ON_MOVE==0, there are no race and page status
> + * update can be done withou any locks. If MEM_CGROUP_ON_MOVE>0, we use
> + * following hashed rwlocks.
> + * - At updating information, we hold rlock.
> + * - When a page is picked up and being moved, wlock is held.
> + *
> + * This logic works well enough because moving account is not an usual event.
> + */
> +
> +/*
> + * This rwlock is accessed only when MEM_CGROUP_ON_MOVE > 0.
> + */
> +#define NR_MOVE_ACCOUNT_LOCKS  (NR_CPUS)
> +#define move_account_hash(page) ((page_to_pfn(page) % NR_MOVE_ACCOUNT_LOCKS))
> +static rwlock_t move_account_locks[NR_MOVE_ACCOUNT_LOCKS];
> +
> +static rwlock_t *__mem_cgroup_account_move_lock(struct page *page)
> +{
> +       int hnum = move_account_hash(page);
> +
> +       return &move_account_locks[hnum];
> +}
> +
> +static void mem_cgroup_account_move_rlock(struct page *page)
> +{
> +       read_lock(__mem_cgroup_account_move_lock(page));
> +}
> +
> +static void mem_cgroup_account_move_runlock(struct page *page)
> +{
> +       read_unlock(__mem_cgroup_account_move_lock(page));
> +}
> +
> +static void mem_cgroup_account_move_wlock(struct page *page,
> +               unsigned long *flags)
> +{
> +       write_lock_irqsave(__mem_cgroup_account_move_lock(page), *flags);
> +}
> +
> +static void mem_cgroup_account_move_wunlock(struct page *page,
> +               unsigned long flags)
> +{
> +       write_unlock_irqrestore(__mem_cgroup_account_move_lock(page), flags);
> +}
> +
> +static  void mem_cgroup_account_move_lock_init(void)
> +{
> +       int num;
> +
> +       for (num = 0; num < NR_MOVE_ACCOUNT_LOCKS; num++)
> +               rwlock_init(&move_account_locks[num]);
> +}
> +
>  /**
>  * mem_cgroup_print_oom_info: Called from OOM with tasklist_lock held in read mode.
>  * @memcg: The memory cgroup that went over limit
> @@ -1859,7 +1918,6 @@ void mem_cgroup_update_page_stat(struct page *page,
>        struct mem_cgroup *memcg;
>        struct page_cgroup *pc = lookup_page_cgroup(page);
>        bool need_unlock = false;
> -       unsigned long uninitialized_var(flags);
>
>        if (mem_cgroup_disabled())
>                return;
> @@ -1871,7 +1929,7 @@ void mem_cgroup_update_page_stat(struct page *page,
>        /* pc->mem_cgroup is unstable ? */
>        if (unlikely(mem_cgroup_stealed(memcg))) {
>                /* take a lock against to access pc->mem_cgroup */
> -               move_lock_page_cgroup(pc, &flags);
> +               mem_cgroup_account_move_rlock(page);
>                need_unlock = true;
>                memcg = pc->mem_cgroup;
>                if (!memcg || !PageCgroupUsed(pc))
> @@ -1894,7 +1952,7 @@ void mem_cgroup_update_page_stat(struct page *page,
>
>  out:
>        if (unlikely(need_unlock))
> -               move_unlock_page_cgroup(pc, &flags);
> +               mem_cgroup_account_move_runlock(page);
>        rcu_read_unlock();
>        return;
>  }
> @@ -2457,8 +2515,7 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup *memcg,
>
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>
> -#define PCGF_NOCOPY_AT_SPLIT ((1 << PCG_LOCK) | (1 << PCG_MOVE_LOCK) |\
> -                       (1 << PCG_MIGRATION))
> +#define PCGF_NOCOPY_AT_SPLIT ((1 << PCG_LOCK) |  (1 << PCG_MIGRATION))
>  /*
>  * Because tail pages are not marked as "used", set it. We're under
>  * zone->lru_lock, 'splitting on pmd' and compound_lock.
> @@ -2537,7 +2594,7 @@ static int mem_cgroup_move_account(struct page *page,
>        if (!PageCgroupUsed(pc) || pc->mem_cgroup != from)
>                goto unlock;
>
> -       move_lock_page_cgroup(pc, &flags);
> +       mem_cgroup_account_move_wlock(page, &flags);
>
>        if (PageCgroupFileMapped(pc)) {
>                /* Update mapped_file data for mem_cgroup */
> @@ -2561,7 +2618,7 @@ static int mem_cgroup_move_account(struct page *page,
>         * guaranteed that "to" is never removed. So, we don't check rmdir
>         * status here.
>         */
> -       move_unlock_page_cgroup(pc, &flags);
> +       mem_cgroup_account_move_wunlock(page, flags);
>        ret = 0;
>  unlock:
>        unlock_page_cgroup(pc);
> @@ -4938,6 +4995,7 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
>                        INIT_WORK(&stock->work, drain_local_stock);
>                }
>                hotcpu_notifier(memcg_cpu_hotplug_callback, 0);
> +               mem_cgroup_account_move_lock_init();
>        } else {
>                parent = mem_cgroup_from_cont(cont->parent);
>                memcg->use_hierarchy = parent->use_hierarchy;
> --
> 1.7.4.1
>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] [PATCH 3/7 v2] memcg: remove PCG_MOVE_LOCK flag from pc->flags
  2012-01-18 23:53         ` KAMEZAWA Hiroyuki
@ 2012-01-23 22:05           ` Ying Han
  2012-01-24  4:59             ` KAMEZAWA Hiroyuki
  2012-01-24  8:43             ` Michal Hocko
  0 siblings, 2 replies; 50+ messages in thread
From: Ying Han @ 2012-01-23 22:05 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Michal Hocko, linux-mm@kvack.org, hugh.dickins@tiscali.co.uk,
	hannes@cmpxchg.org, cgroups, bsingharora@gmail.com

On Wed, Jan 18, 2012 at 3:53 PM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Wed, 18 Jan 2012 11:47:03 +0100
> Michal Hocko <mhocko@suse.cz> wrote:
>
>> On Wed 18-01-12 09:12:26, KAMEZAWA Hiroyuki wrote:
>> > On Tue, 17 Jan 2012 17:46:05 +0100
>> > Michal Hocko <mhocko@suse.cz> wrote:
>> >
>> > > On Fri 13-01-12 17:40:19, KAMEZAWA Hiroyuki wrote:
>> [...]
>> > > > This patch removes PCG_MOVE_LOCK and add hashed rwlock array
>> > > > instead of it. This works well enough. Even when we need to
>> > > > take the lock,
>> > >
>> > > Hmmm, rwlocks are not popular these days very much.
>> > > Anyway, can we rather make it (source) memcg (bit)spinlock instead. We
>> > > would reduce false sharing this way and would penalize only pages from
>> > > the moving group.
>> > >
>> > per-memcg spinlock ?
>>
>> Yes
>>
>> > The reason I used rwlock() is to avoid disabling IRQ.  This routine
>> > will be called by IRQ context (for dirty ratio support).  So, IRQ
>> > disable will be required if we use spinlock.
>>
>> OK, I have missed the comment about disabling IRQs. It's true that we do
>> not have to be afraid about deadlocks if the lock is held only for
>> reading from the irq context but does the spinlock makes a performance
>> bottleneck? We are talking about the slowpath.
>> I could see the reason for the read lock when doing hashed locks because
>> they are global but if we make the lock per memcg then we shouldn't
>> interfere with other updates which are not blocked by the move.
>>
>
> Hm, ok. In the next version, I'll use per-memcg spinlock (with hash if necessary)

Just want to make sure I understand it, even we make the lock
per-memcg, there is still a false sharing of pc within one memcg. Do
we need to demonstrate the effect ?

Also, I don't get the point of why spinlock instead of rwlock in this case?

--Ying

>
> Thanks,
> -Kame
>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] [PATCH 7/7 v2] memcg: make mem_cgroup_begin_update_stat to use global pcpu.
  2012-01-20  8:40   ` Greg Thelen
@ 2012-01-24  3:18     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 50+ messages in thread
From: KAMEZAWA Hiroyuki @ 2012-01-24  3:18 UTC (permalink / raw)
  To: Greg Thelen
  Cc: linux-mm@kvack.org, Ying Han, hugh.dickins@tiscali.co.uk,
	hannes@cmpxchg.org, Michal Hocko, cgroups, bsingharora@gmail.com

On Fri, 20 Jan 2012 00:40:34 -0800
Greg Thelen <gthelen@google.com> wrote:

> On Fri, Jan 13, 2012 at 12:45 AM, KAMEZAWA Hiroyuki

> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 8b67ccf..4836e8d 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -89,7 +89,6 @@ enum mem_cgroup_stat_index {
> > A  A  A  A MEM_CGROUP_STAT_FILE_MAPPED, A /* # of pages charged as file rss */
> > A  A  A  A MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
> > A  A  A  A MEM_CGROUP_STAT_DATA, /* end of data requires synchronization */
> > - A  A  A  MEM_CGROUP_ON_MOVE, A  A  /* someone is moving account between groups */
> > A  A  A  A MEM_CGROUP_STAT_NSTATS,
> > A };
> >
> > @@ -279,6 +278,8 @@ struct mem_cgroup {
> > A  A  A  A  * mem_cgroup ? And what type of charges should we move ?
> > A  A  A  A  */
> > A  A  A  A unsigned long A  move_charge_at_immigrate;
> > + A  A  A  /* set when a page under this memcg may be moving to other memcg */
> > + A  A  A  atomic_t A  A  A  A account_moving;
> > A  A  A  A /*
> > A  A  A  A  * percpu counter.
> > A  A  A  A  */
> > @@ -1250,20 +1251,27 @@ int mem_cgroup_swappiness(struct mem_cgroup *memcg)
> > A  A  A  A return memcg->swappiness;
> > A }
> >
> > +/*
> > + * For quick check, for avoiding looking up memcg, system-wide
> > + * per-cpu check is provided.
> > + */
> > +DEFINE_PER_CPU(int, mem_cgroup_account_moving);
> 
> Why is this a per-cpu counter?  Can this be an single atomic_t
> instead, or does cpu hotplug require per-cpu state?  In the common
> case, when there is no move in progress, then the counter would be
> zero and clean in all cpu caches that need it.  When moving pages,
> mem_cgroup_start_move() would atomic_inc the counter.
> 

Ok, atomic_t will be simple.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] [PATCH 2/7 v2] memcg: add memory barrier for checking account move.
  2012-01-23  9:04             ` Michal Hocko
@ 2012-01-24  3:21               ` KAMEZAWA Hiroyuki
  2012-01-24  8:49                 ` Michal Hocko
  2012-01-24 19:04               ` Ying Han
  1 sibling, 1 reply; 50+ messages in thread
From: KAMEZAWA Hiroyuki @ 2012-01-24  3:21 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Ying Han, linux-mm@kvack.org, hugh.dickins@tiscali.co.uk,
	hannes@cmpxchg.org, cgroups, bsingharora@gmail.com

On Mon, 23 Jan 2012 10:04:36 +0100
Michal Hocko <mhocko@suse.cz> wrote:

> On Fri 20-01-12 10:08:44, Ying Han wrote:
> > On Wed, Jan 18, 2012 at 6:17 PM, KAMEZAWA Hiroyuki
> > <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > > On Wed, 18 Jan 2012 13:37:59 +0100
> > > Michal Hocko <mhocko@suse.cz> wrote:
> > >
> > >> On Wed 18-01-12 09:06:56, KAMEZAWA Hiroyuki wrote:
> > >> > On Tue, 17 Jan 2012 16:26:35 +0100
> > >> > Michal Hocko <mhocko@suse.cz> wrote:
> > >> >
> > >> > > On Fri 13-01-12 17:33:47, KAMEZAWA Hiroyuki wrote:
> > >> > > > I think this bugfix is needed before going ahead. thoughts?
> > >> > > > ==
> > >> > > > From 2cb491a41782b39aae9f6fe7255b9159ac6c1563 Mon Sep 17 00:00:00 2001
> > >> > > > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > >> > > > Date: Fri, 13 Jan 2012 14:27:20 +0900
> > >> > > > Subject: [PATCH 2/7] memcg: add memory barrier for checking account move.
> > >> > > >
> > >> > > > At starting move_account(), source memcg's per-cpu variable
> > >> > > > MEM_CGROUP_ON_MOVE is set. The page status update
> > >> > > > routine check it under rcu_read_lock(). But there is no memory
> > >> > > > barrier. This patch adds one.
> > >> > >
> > >> > > OK this would help to enforce that the CPU would see the current value
> > >> > > but what prevents us from the race with the value update without the
> > >> > > lock? This is as racy as it was before AFAICS.
> > >> > >
> > >> >
> > >> > Hm, do I misunderstand ?
> > >> > ==
> > >> > A  A update A  A  A  A  A  A  A  A  A  A  reference
> > >> >
> > >> > A  A CPU A A  A  A  A  A  A  A  A  A  A  A  A CPU B
> > >> > A  set value A  A  A  A  A  A  A  A rcu_read_lock()
> > >> > A  smp_wmb() A  A  A  A  A  A  A  A smp_rmb()
> > >> > A  A  A  A  A  A  A  A  A  A  A  A  A  A read_value
> > >> > A  A  A  A  A  A  A  A  A  A  A  A  A  A rcu_read_unlock()
> > >> > A  synchronize_rcu().
> > >> > ==
> > >> > I expect
> > >> > If synchronize_rcu() is called before rcu_read_lock() => move_lock_xxx will be held.
> > >> > If synchronize_rcu() is called after rcu_read_lock() => update will be delayed.
> > >>
> > >> Ahh, OK I can see it now. Readers are not that important because it is
> > >> actually the updater who is delayed until all preexisting rcu read
> > >> sections are finished.
> > >>
> > >> In that case. Why do we need both barriers? spin_unlock is a full
> > >> barrier so maybe we just need smp_rmb before we read value to make sure
> > >> that we do not get stalled value when we start rcu_read section after
> > >> synchronize_rcu?
> > >>
> > >
> > > I doubt .... If no barrier, this case happens
> > >
> > > ==
> > > A  A  A  A update A  A  A  A  A  A  A  A  A reference
> > > A  A  A  A CPU A A  A  A  A  A  A  A  A  A  CPU B
> > > A  A  A  A set value
> > > A  A  A  A synchronize_rcu() A  A  A  rcu_read_lock()
> > > A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A read_value <= find old value
> > > A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A rcu_read_unlock()
> > > A  A  A  A  A  A  A  A  A  A  A  A  A  A  A  A do no lock
> > > ==
> > 
> > Hi Kame,
> > 
> > Can you help to clarify a bit more on the example above? Why
> > read_value got the old value after synchronize_rcu().
> 
> AFAIU it is because rcu_read_unlock doesn't force any memory barrier
> and we synchronize only the updater (with synchronize_rcu), so nothing
> guarantees that the value set on CPUA is visible to CPUB.
> 

Thank you. 

...Finally, I'd like to make this check to atomic_t rather than complicated
percpu counter. Hmm, do it now ?

Thanks,
-Kame


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] [PATCH 3/7 v2] memcg: remove PCG_MOVE_LOCK flag from pc->flags
  2012-01-23 22:02   ` Ying Han
@ 2012-01-24  4:47     ` KAMEZAWA Hiroyuki
  2012-01-25 22:48       ` Ying Han
  0 siblings, 1 reply; 50+ messages in thread
From: KAMEZAWA Hiroyuki @ 2012-01-24  4:47 UTC (permalink / raw)
  To: Ying Han
  Cc: linux-mm@kvack.org, hugh.dickins@tiscali.co.uk,
	hannes@cmpxchg.org, Michal Hocko, cgroups, bsingharora@gmail.com

On Mon, 23 Jan 2012 14:02:48 -0800
Ying Han <yinghan@google.com> wrote:

> On Fri, Jan 13, 2012 at 12:40 AM, KAMEZAWA Hiroyuki
> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> >
> > From 1008e84d94245b1e7c4d237802ff68ff00757736 Mon Sep 17 00:00:00 2001
> > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > Date: Thu, 12 Jan 2012 15:53:24 +0900
> > Subject: [PATCH 3/7] memcg: remove PCG_MOVE_LOCK flag from pc->flags.
> >
> > PCG_MOVE_LOCK bit is used for bit spinlock for avoiding race between
> > memcg's account moving and page state statistics updates.
> >
> > Considering page-statistics update, very hot path, this lock is
> > taken only when someone is moving account (or PageTransHuge())
> > And, now, all moving-account between memcgroups (by task-move)
> > are serialized.
> 
> This might be a side question, can you clarify the serialization here?
> Does it mean that we only allow one task-move at a time system-wide?
> 

current implementation has that limit by mutex.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] [PATCH 3/7 v2] memcg: remove PCG_MOVE_LOCK flag from pc->flags
  2012-01-23 22:05           ` Ying Han
@ 2012-01-24  4:59             ` KAMEZAWA Hiroyuki
  2012-01-24  8:43             ` Michal Hocko
  1 sibling, 0 replies; 50+ messages in thread
From: KAMEZAWA Hiroyuki @ 2012-01-24  4:59 UTC (permalink / raw)
  To: Ying Han
  Cc: Michal Hocko, linux-mm@kvack.org, hugh.dickins@tiscali.co.uk,
	hannes@cmpxchg.org, cgroups, bsingharora@gmail.com

On Mon, 23 Jan 2012 14:05:33 -0800
Ying Han <yinghan@google.com> wrote:

> On Wed, Jan 18, 2012 at 3:53 PM, KAMEZAWA Hiroyuki
> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > On Wed, 18 Jan 2012 11:47:03 +0100
> > Michal Hocko <mhocko@suse.cz> wrote:
> >
> >> On Wed 18-01-12 09:12:26, KAMEZAWA Hiroyuki wrote:
> >> > On Tue, 17 Jan 2012 17:46:05 +0100
> >> > Michal Hocko <mhocko@suse.cz> wrote:
> >> >
> >> > > On Fri 13-01-12 17:40:19, KAMEZAWA Hiroyuki wrote:
> >> [...]
> >> > > > This patch removes PCG_MOVE_LOCK and add hashed rwlock array
> >> > > > instead of it. This works well enough. Even when we need to
> >> > > > take the lock,
> >> > >
> >> > > Hmmm, rwlocks are not popular these days very much.
> >> > > Anyway, can we rather make it (source) memcg (bit)spinlock instead. We
> >> > > would reduce false sharing this way and would penalize only pages from
> >> > > the moving group.
> >> > >
> >> > per-memcg spinlock ?
> >>
> >> Yes
> >>
> >> > The reason I used rwlock() is to avoid disabling IRQ. A This routine
> >> > will be called by IRQ context (for dirty ratio support). A So, IRQ
> >> > disable will be required if we use spinlock.
> >>
> >> OK, I have missed the comment about disabling IRQs. It's true that we do
> >> not have to be afraid about deadlocks if the lock is held only for
> >> reading from the irq context but does the spinlock makes a performance
> >> bottleneck? We are talking about the slowpath.
> >> I could see the reason for the read lock when doing hashed locks because
> >> they are global but if we make the lock per memcg then we shouldn't
> >> interfere with other updates which are not blocked by the move.
> >>
> >
> > Hm, ok. In the next version, I'll use per-memcg spinlock (with hash if necessary)
> 
> Just want to make sure I understand it, even we make the lock
> per-memcg, there is still a false sharing of pc within one memcg. Do
> we need to demonstrate the effect ?
> 

Hmm, I'll try some. Account_move occurs when

 a) a task is moved to other cgroup
 b) a cgroup is removed.

I think checking case a) will be enough because there is no task in a memcg
while it is being removed. Then, I'll measure performace of file mapping
while moving task repeatedly. There will be spinlock conflict.

- I'll consider to make the range of spinlock small.
- I'll consider have a hash of spinlock or spinlock based of page-zone and types.
  (It's easy to make spinlock as to be per-memcg-per-zone.)

> Also, I don't get the point of why spinlock instead of rwlock in this case?
> 
>From Documentation/spinlocks.txt

>    NOTE! reader-writer locks require more atomic memory operations than
>    simple spinlocks.  Unless the reader critical section is long, you
>    are better off just using spinlocks.

>    NOTE! We are working hard to remove reader-writer spinlocks in most
>    cases, so please don't add a new one without consensus.  (Instead, see
>    Documentation/RCU/rcu.txt for complete information.)
> 

I don't have enough strong motivation to use rwlock.
But if rwlock works enough well rather than spinlocks, it will be a choice.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] [PATCH 3/7 v2] memcg: remove PCG_MOVE_LOCK flag from pc->flags
  2012-01-23 22:05           ` Ying Han
  2012-01-24  4:59             ` KAMEZAWA Hiroyuki
@ 2012-01-24  8:43             ` Michal Hocko
  2012-01-25 23:07               ` Ying Han
  1 sibling, 1 reply; 50+ messages in thread
From: Michal Hocko @ 2012-01-24  8:43 UTC (permalink / raw)
  To: Ying Han
  Cc: KAMEZAWA Hiroyuki, linux-mm@kvack.org, hugh.dickins@tiscali.co.uk,
	hannes@cmpxchg.org, cgroups, bsingharora@gmail.com

On Mon 23-01-12 14:05:33, Ying Han wrote:
> On Wed, Jan 18, 2012 at 3:53 PM, KAMEZAWA Hiroyuki
> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > On Wed, 18 Jan 2012 11:47:03 +0100
> > Michal Hocko <mhocko@suse.cz> wrote:
> >
> >> On Wed 18-01-12 09:12:26, KAMEZAWA Hiroyuki wrote:
> >> > On Tue, 17 Jan 2012 17:46:05 +0100
> >> > Michal Hocko <mhocko@suse.cz> wrote:
> >> >
> >> > > On Fri 13-01-12 17:40:19, KAMEZAWA Hiroyuki wrote:
> >> [...]
> >> > > > This patch removes PCG_MOVE_LOCK and add hashed rwlock array
> >> > > > instead of it. This works well enough. Even when we need to
> >> > > > take the lock,
> >> > >
> >> > > Hmmm, rwlocks are not popular these days very much.
> >> > > Anyway, can we rather make it (source) memcg (bit)spinlock instead. We
> >> > > would reduce false sharing this way and would penalize only pages from
> >> > > the moving group.
> >> > >
> >> > per-memcg spinlock ?
> >>
> >> Yes
> >>
> >> > The reason I used rwlock() is to avoid disabling IRQ.  This routine
> >> > will be called by IRQ context (for dirty ratio support).  So, IRQ
> >> > disable will be required if we use spinlock.
> >>
> >> OK, I have missed the comment about disabling IRQs. It's true that we do
> >> not have to be afraid about deadlocks if the lock is held only for
> >> reading from the irq context but does the spinlock makes a performance
> >> bottleneck? We are talking about the slowpath.
> >> I could see the reason for the read lock when doing hashed locks because
> >> they are global but if we make the lock per memcg then we shouldn't
> >> interfere with other updates which are not blocked by the move.
> >>
> >
> > Hm, ok. In the next version, I'll use per-memcg spinlock (with hash if necessary)
> 
> Just want to make sure I understand it, even we make the lock
> per-memcg, there is still a false sharing of pc within one memcg. 

Yes that is true. I have missed that we might fault in several pages at
once but this would happen only during task move, right? And that is not
a hot path anyway. Or?

> Do we need to demonstrate the effect ?
> 
> Also, I don't get the point of why spinlock instead of rwlock in this case?

spinlock provides a fairness while with rwlocks might lead to
starvation.


-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] [PATCH 2/7 v2] memcg: add memory barrier for checking account move.
  2012-01-24  3:21               ` KAMEZAWA Hiroyuki
@ 2012-01-24  8:49                 ` Michal Hocko
  0 siblings, 0 replies; 50+ messages in thread
From: Michal Hocko @ 2012-01-24  8:49 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Ying Han, linux-mm@kvack.org, hugh.dickins@tiscali.co.uk,
	hannes@cmpxchg.org, cgroups, bsingharora@gmail.com

On Tue 24-01-12 12:21:20, KAMEZAWA Hiroyuki wrote:
> On Mon, 23 Jan 2012 10:04:36 +0100
> Michal Hocko <mhocko@suse.cz> wrote:
> 
> > On Fri 20-01-12 10:08:44, Ying Han wrote:
> > > On Wed, Jan 18, 2012 at 6:17 PM, KAMEZAWA Hiroyuki
> > > <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > > > On Wed, 18 Jan 2012 13:37:59 +0100
> > > > Michal Hocko <mhocko@suse.cz> wrote:
> > > >
> > > >> On Wed 18-01-12 09:06:56, KAMEZAWA Hiroyuki wrote:
> > > >> > On Tue, 17 Jan 2012 16:26:35 +0100
> > > >> > Michal Hocko <mhocko@suse.cz> wrote:
> > > >> >
> > > >> > > On Fri 13-01-12 17:33:47, KAMEZAWA Hiroyuki wrote:
> > > >> > > > I think this bugfix is needed before going ahead. thoughts?
> > > >> > > > ==
> > > >> > > > From 2cb491a41782b39aae9f6fe7255b9159ac6c1563 Mon Sep 17 00:00:00 2001
> > > >> > > > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > > >> > > > Date: Fri, 13 Jan 2012 14:27:20 +0900
> > > >> > > > Subject: [PATCH 2/7] memcg: add memory barrier for checking account move.
> > > >> > > >
> > > >> > > > At starting move_account(), source memcg's per-cpu variable
> > > >> > > > MEM_CGROUP_ON_MOVE is set. The page status update
> > > >> > > > routine check it under rcu_read_lock(). But there is no memory
> > > >> > > > barrier. This patch adds one.
> > > >> > >
> > > >> > > OK this would help to enforce that the CPU would see the current value
> > > >> > > but what prevents us from the race with the value update without the
> > > >> > > lock? This is as racy as it was before AFAICS.
> > > >> > >
> > > >> >
> > > >> > Hm, do I misunderstand ?
> > > >> > ==
> > > >> >    update                     reference
> > > >> >
> > > >> >    CPU A                        CPU B
> > > >> >   set value                rcu_read_lock()
> > > >> >   smp_wmb()                smp_rmb()
> > > >> >                            read_value
> > > >> >                            rcu_read_unlock()
> > > >> >   synchronize_rcu().
> > > >> > ==
> > > >> > I expect
> > > >> > If synchronize_rcu() is called before rcu_read_lock() => move_lock_xxx will be held.
> > > >> > If synchronize_rcu() is called after rcu_read_lock() => update will be delayed.
> > > >>
> > > >> Ahh, OK I can see it now. Readers are not that important because it is
> > > >> actually the updater who is delayed until all preexisting rcu read
> > > >> sections are finished.
> > > >>
> > > >> In that case. Why do we need both barriers? spin_unlock is a full
> > > >> barrier so maybe we just need smp_rmb before we read value to make sure
> > > >> that we do not get stalled value when we start rcu_read section after
> > > >> synchronize_rcu?
> > > >>
> > > >
> > > > I doubt .... If no barrier, this case happens
> > > >
> > > > ==
> > > >        update                  reference
> > > >        CPU A                   CPU B
> > > >        set value
> > > >        synchronize_rcu()       rcu_read_lock()
> > > >                                read_value <= find old value
> > > >                                rcu_read_unlock()
> > > >                                do no lock
> > > > ==
> > > 
> > > Hi Kame,
> > > 
> > > Can you help to clarify a bit more on the example above? Why
> > > read_value got the old value after synchronize_rcu().
> > 
> > AFAIU it is because rcu_read_unlock doesn't force any memory barrier
> > and we synchronize only the updater (with synchronize_rcu), so nothing
> > guarantees that the value set on CPUA is visible to CPUB.
> > 
> 
> Thank you. 
> 
> ...Finally, I'd like to make this check to atomic_t rather than complicated
> percpu counter. Hmm, do it now ?

I thought you wanted to prevent from atomics but you would need a read
barrier in the reader side because only atomics which change the state
imply a memory barrier IIRC. So it is a question why atomic is
simpler...

> 
> Thanks,
> -Kame
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe cgroups" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] [PATCH 2/7 v2] memcg: add memory barrier for checking account move.
  2012-01-23  9:04             ` Michal Hocko
  2012-01-24  3:21               ` KAMEZAWA Hiroyuki
@ 2012-01-24 19:04               ` Ying Han
  2012-01-25 11:07                 ` Michal Hocko
  1 sibling, 1 reply; 50+ messages in thread
From: Ying Han @ 2012-01-24 19:04 UTC (permalink / raw)
  To: Michal Hocko
  Cc: KAMEZAWA Hiroyuki, linux-mm@kvack.org, hugh.dickins@tiscali.co.uk,
	hannes@cmpxchg.org, cgroups, bsingharora@gmail.com

On Mon, Jan 23, 2012 at 1:04 AM, Michal Hocko <mhocko@suse.cz> wrote:
> On Fri 20-01-12 10:08:44, Ying Han wrote:
>> On Wed, Jan 18, 2012 at 6:17 PM, KAMEZAWA Hiroyuki
>> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>> > On Wed, 18 Jan 2012 13:37:59 +0100
>> > Michal Hocko <mhocko@suse.cz> wrote:
>> >
>> >> On Wed 18-01-12 09:06:56, KAMEZAWA Hiroyuki wrote:
>> >> > On Tue, 17 Jan 2012 16:26:35 +0100
>> >> > Michal Hocko <mhocko@suse.cz> wrote:
>> >> >
>> >> > > On Fri 13-01-12 17:33:47, KAMEZAWA Hiroyuki wrote:
>> >> > > > I think this bugfix is needed before going ahead. thoughts?
>> >> > > > ==
>> >> > > > From 2cb491a41782b39aae9f6fe7255b9159ac6c1563 Mon Sep 17 00:00:00 2001
>> >> > > > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>> >> > > > Date: Fri, 13 Jan 2012 14:27:20 +0900
>> >> > > > Subject: [PATCH 2/7] memcg: add memory barrier for checking account move.
>> >> > > >
>> >> > > > At starting move_account(), source memcg's per-cpu variable
>> >> > > > MEM_CGROUP_ON_MOVE is set. The page status update
>> >> > > > routine check it under rcu_read_lock(). But there is no memory
>> >> > > > barrier. This patch adds one.
>> >> > >
>> >> > > OK this would help to enforce that the CPU would see the current value
>> >> > > but what prevents us from the race with the value update without the
>> >> > > lock? This is as racy as it was before AFAICS.
>> >> > >
>> >> >
>> >> > Hm, do I misunderstand ?
>> >> > ==
>> >> >    update                     reference
>> >> >
>> >> >    CPU A                        CPU B
>> >> >   set value                rcu_read_lock()
>> >> >   smp_wmb()                smp_rmb()
>> >> >                            read_value
>> >> >                            rcu_read_unlock()
>> >> >   synchronize_rcu().
>> >> > ==
>> >> > I expect
>> >> > If synchronize_rcu() is called before rcu_read_lock() => move_lock_xxx will be held.
>> >> > If synchronize_rcu() is called after rcu_read_lock() => update will be delayed.
>> >>
>> >> Ahh, OK I can see it now. Readers are not that important because it is
>> >> actually the updater who is delayed until all preexisting rcu read
>> >> sections are finished.
>> >>
>> >> In that case. Why do we need both barriers? spin_unlock is a full
>> >> barrier so maybe we just need smp_rmb before we read value to make sure
>> >> that we do not get stalled value when we start rcu_read section after
>> >> synchronize_rcu?
>> >>
>> >
>> > I doubt .... If no barrier, this case happens
>> >
>> > ==
>> >        update                  reference
>> >        CPU A                   CPU B
>> >        set value
>> >        synchronize_rcu()       rcu_read_lock()
>> >                                read_value <= find old value
>> >                                rcu_read_unlock()
>> >                                do no lock
>> > ==
>>
>> Hi Kame,
>>
>> Can you help to clarify a bit more on the example above? Why
>> read_value got the old value after synchronize_rcu().
>
> AFAIU it is because rcu_read_unlock doesn't force any memory barrier
> and we synchronize only the updater (with synchronize_rcu), so nothing
> guarantees that the value set on CPUA is visible to CPUB.

Thanks, and i might have found similar comment on the
documentation/rcu/checklist.txt:
"
The various RCU read-side primitives do -not- necessarily contain
memory barriers.
"

So, the read barrier here is to make sure no reordering between the
reader and the rcu_read_lock. The same for the write barrier which
makes sure no reordering between the updater and synchronize_rcu. The
the rcu here is to synchronize between the updater and reader. If so,
why not the change like :

       for_each_online_cpu(cpu)
               per_cpu(memcg->stat->count[MEM_CGROUP_ON_MOVE], cpu) += 1;
+      smp_wmb();

Sorry, the use of per-cpu variable MEM_CGROUP_ON_MOVE does confuse me.

--Ying
>
> --
> Michal Hocko
> SUSE Labs
> SUSE LINUX s.r.o.
> Lihovarska 1060/12
> 190 00 Praha 9
> Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] [PATCH 2/7 v2] memcg: add memory barrier for checking account move.
  2012-01-24 19:04               ` Ying Han
@ 2012-01-25 11:07                 ` Michal Hocko
  0 siblings, 0 replies; 50+ messages in thread
From: Michal Hocko @ 2012-01-25 11:07 UTC (permalink / raw)
  To: Ying Han
  Cc: KAMEZAWA Hiroyuki, linux-mm@kvack.org, hugh.dickins@tiscali.co.uk,
	hannes@cmpxchg.org, cgroups, bsingharora@gmail.com

On Tue 24-01-12 11:04:16, Ying Han wrote:
> On Mon, Jan 23, 2012 at 1:04 AM, Michal Hocko <mhocko@suse.cz> wrote:
> > On Fri 20-01-12 10:08:44, Ying Han wrote:
> >> On Wed, Jan 18, 2012 at 6:17 PM, KAMEZAWA Hiroyuki
> >> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
[...]
> >> > I doubt .... If no barrier, this case happens
> >> >
> >> > ==
> >> >        update                  reference
> >> >        CPU A                   CPU B
> >> >        set value
> >> >        synchronize_rcu()       rcu_read_lock()
> >> >                                read_value <= find old value
> >> >                                rcu_read_unlock()
> >> >                                do no lock
> >> > ==
> >>
> >> Hi Kame,
> >>
> >> Can you help to clarify a bit more on the example above? Why
> >> read_value got the old value after synchronize_rcu().
> >
> > AFAIU it is because rcu_read_unlock doesn't force any memory barrier
> > and we synchronize only the updater (with synchronize_rcu), so nothing
> > guarantees that the value set on CPUA is visible to CPUB.
> 
> Thanks, and i might have found similar comment on the
> documentation/rcu/checklist.txt:
> "
> The various RCU read-side primitives do -not- necessarily contain
> memory barriers.
> "
> 
> So, the read barrier here is to make sure no reordering between the
> reader and the rcu_read_lock. The same for the write barrier which
> makes sure no reordering between the updater and synchronize_rcu. The
> the rcu here is to synchronize between the updater and reader. If so,
> why not the change like :
> 
>        for_each_online_cpu(cpu)
>                per_cpu(memcg->stat->count[MEM_CGROUP_ON_MOVE], cpu) += 1;
> +      smp_wmb();

Threre is a data dependency between per_cpu update (the above for look)
and local read of the per-cpu on the read-side and IIUC we need to pair
write barrier with read one before we read the value.

But I might be wrong here (see the SMP BARRIER PAIRING section in
Documentation/memory-barriers.txt).

> Sorry, the use of per-cpu variable MEM_CGROUP_ON_MOVE does confuse me.

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] [PATCH 3/7 v2] memcg: remove PCG_MOVE_LOCK flag from pc->flags
  2012-01-24  4:47     ` KAMEZAWA Hiroyuki
@ 2012-01-25 22:48       ` Ying Han
  0 siblings, 0 replies; 50+ messages in thread
From: Ying Han @ 2012-01-25 22:48 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, hugh.dickins@tiscali.co.uk,
	hannes@cmpxchg.org, Michal Hocko, cgroups, bsingharora@gmail.com

On Mon, Jan 23, 2012 at 8:47 PM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Mon, 23 Jan 2012 14:02:48 -0800
> Ying Han <yinghan@google.com> wrote:
>
>> On Fri, Jan 13, 2012 at 12:40 AM, KAMEZAWA Hiroyuki
>> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>> >
>> > From 1008e84d94245b1e7c4d237802ff68ff00757736 Mon Sep 17 00:00:00 2001
>> > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>> > Date: Thu, 12 Jan 2012 15:53:24 +0900
>> > Subject: [PATCH 3/7] memcg: remove PCG_MOVE_LOCK flag from pc->flags.
>> >
>> > PCG_MOVE_LOCK bit is used for bit spinlock for avoiding race between
>> > memcg's account moving and page state statistics updates.
>> >
>> > Considering page-statistics update, very hot path, this lock is
>> > taken only when someone is moving account (or PageTransHuge())
>> > And, now, all moving-account between memcgroups (by task-move)
>> > are serialized.
>>
>> This might be a side question, can you clarify the serialization here?
>> Does it mean that we only allow one task-move at a time system-wide?
>>
>
> current implementation has that limit by mutex.

ah, thanks. that is the cgroup_mutex lock which serializes
mem_cgroup_move_charge() from move task.

--Ying

>
> Thanks,
> -Kame
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] [PATCH 3/7 v2] memcg: remove PCG_MOVE_LOCK flag from pc->flags
  2012-01-24  8:43             ` Michal Hocko
@ 2012-01-25 23:07               ` Ying Han
  2012-01-26  9:16                 ` Michal Hocko
  0 siblings, 1 reply; 50+ messages in thread
From: Ying Han @ 2012-01-25 23:07 UTC (permalink / raw)
  To: Michal Hocko
  Cc: KAMEZAWA Hiroyuki, linux-mm@kvack.org, hugh.dickins@tiscali.co.uk,
	hannes@cmpxchg.org, cgroups, bsingharora@gmail.com

On Tue, Jan 24, 2012 at 12:43 AM, Michal Hocko <mhocko@suse.cz> wrote:
> On Mon 23-01-12 14:05:33, Ying Han wrote:
>> On Wed, Jan 18, 2012 at 3:53 PM, KAMEZAWA Hiroyuki
>> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>> > On Wed, 18 Jan 2012 11:47:03 +0100
>> > Michal Hocko <mhocko@suse.cz> wrote:
>> >
>> >> On Wed 18-01-12 09:12:26, KAMEZAWA Hiroyuki wrote:
>> >> > On Tue, 17 Jan 2012 17:46:05 +0100
>> >> > Michal Hocko <mhocko@suse.cz> wrote:
>> >> >
>> >> > > On Fri 13-01-12 17:40:19, KAMEZAWA Hiroyuki wrote:
>> >> [...]
>> >> > > > This patch removes PCG_MOVE_LOCK and add hashed rwlock array
>> >> > > > instead of it. This works well enough. Even when we need to
>> >> > > > take the lock,
>> >> > >
>> >> > > Hmmm, rwlocks are not popular these days very much.
>> >> > > Anyway, can we rather make it (source) memcg (bit)spinlock instead. We
>> >> > > would reduce false sharing this way and would penalize only pages from
>> >> > > the moving group.
>> >> > >
>> >> > per-memcg spinlock ?
>> >>
>> >> Yes
>> >>
>> >> > The reason I used rwlock() is to avoid disabling IRQ.  This routine
>> >> > will be called by IRQ context (for dirty ratio support).  So, IRQ
>> >> > disable will be required if we use spinlock.
>> >>
>> >> OK, I have missed the comment about disabling IRQs. It's true that we do
>> >> not have to be afraid about deadlocks if the lock is held only for
>> >> reading from the irq context but does the spinlock makes a performance
>> >> bottleneck? We are talking about the slowpath.
>> >> I could see the reason for the read lock when doing hashed locks because
>> >> they are global but if we make the lock per memcg then we shouldn't
>> >> interfere with other updates which are not blocked by the move.
>> >>
>> >
>> > Hm, ok. In the next version, I'll use per-memcg spinlock (with hash if necessary)
>>
>> Just want to make sure I understand it, even we make the lock
>> per-memcg, there is still a false sharing of pc within one memcg.
>
> Yes that is true. I have missed that we might fault in several pages at
> once but this would happen only during task move, right? And that is not
> a hot path anyway. Or?

I was thinking of page-statistics update which is hot path. If the
moving task and non-moving task share the same per-memcg lock, any
page-statistic update from the non-moving task will be blocked? Sorry
If i missed something here :)

>
>> Do we need to demonstrate the effect ?
>>
>> Also, I don't get the point of why spinlock instead of rwlock in this case?
>
> spinlock provides a fairness while with rwlocks might lead to
> starvation.

that is true.

--Ying

>
>
> --
> Michal Hocko
> SUSE Labs
> SUSE LINUX s.r.o.
> Lihovarska 1060/12
> 190 00 Praha 9
> Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] [PATCH 3/7 v2] memcg: remove PCG_MOVE_LOCK flag from pc->flags
  2012-01-25 23:07               ` Ying Han
@ 2012-01-26  9:16                 ` Michal Hocko
  0 siblings, 0 replies; 50+ messages in thread
From: Michal Hocko @ 2012-01-26  9:16 UTC (permalink / raw)
  To: Ying Han
  Cc: KAMEZAWA Hiroyuki, linux-mm@kvack.org, hugh.dickins@tiscali.co.uk,
	hannes@cmpxchg.org, cgroups, bsingharora@gmail.com

On Wed 25-01-12 15:07:47, Ying Han wrote:
> On Tue, Jan 24, 2012 at 12:43 AM, Michal Hocko <mhocko@suse.cz> wrote:
> > On Mon 23-01-12 14:05:33, Ying Han wrote:
[...]
> >> Just want to make sure I understand it, even we make the lock
> >> per-memcg, there is still a false sharing of pc within one memcg.
> >
> > Yes that is true. I have missed that we might fault in several pages at
> > once but this would happen only during task move, right? And that is not
> > a hot path anyway. Or?
> 
> I was thinking of page-statistics update which is hot path. If the
> moving task and non-moving task share the same per-memcg lock, any
> page-statistic update from the non-moving task will be blocked? Sorry
> If i missed something here :)

OK, I got your point, finally. I guess there is a plan to reduce the
effect by array of locks.

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] [PATCH 4/7 v2] memcg: new scheme to update per-memcg page stat accounting.
  2012-01-13  8:41 ` [RFC] [PATCH 4/7 v2] memcg: new scheme to update per-memcg page stat accounting KAMEZAWA Hiroyuki
  2012-01-18 16:45   ` Michal Hocko
@ 2012-01-26 19:01   ` Ying Han
  1 sibling, 0 replies; 50+ messages in thread
From: Ying Han @ 2012-01-26 19:01 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, hugh.dickins@tiscali.co.uk,
	hannes@cmpxchg.org, Michal Hocko, cgroups, bsingharora@gmail.com

On Fri, Jan 13, 2012 at 12:41 AM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> From 08a81022fa6f820a42aa5bf3a24ee07690dfff99 Mon Sep 17 00:00:00 2001
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Date: Thu, 12 Jan 2012 18:13:32 +0900
> Subject: [PATCH 4/7] memcg: new scheme to update per-memcg page stat accounting.
>
> Now, page status accounting is done by a call mem_cgroup_update_page_stat()
nitpick, /by a call/by calling

> and this function set flags to page_cgroup->flags.
>
> This flag was required because the page's status and page <=> memcg
> relationship cannot be updated in atomic way.

I assume we are talking about the PCG_FILE_MAPPED flag, can we make it
specific here?

For example,
> Considering FILE_MAPPED,
>
>        CPU A                      CPU B
>                                pick up a page to be moved.
>    set page_mapcount()=0.
>                                move memcg' FILE_MAPPED stat --(*)
>                                overwrite pc->mem_cgroup
>    modify memcg's FILE_MAPPED-(**)
>
> If we don't have a flag on pc->flags, we'll not move 'FILE_MAPPED'
> account information in (*) and we'll decrease FILE_MAPPED in (**)
> from wrong cgroup. We'll see this kind of race at handling
> dirty, writeback...etc..bits. (And Dirty flag has another problem
> which cannot be handled by flag on page_cgroup.)
>
> I'd like to remove this flag because
>  - In recent discussions, removing pc->flags is our direction.
>  - This kind of duplication of flag/status is very bad and
>   it's better to use status in 'struct page'.
>
> This patch is for removing page_cgroup's special flag for
> page-state accounting and for using 'struct page's status itself.

I think this patch itself doesn't remove any pc flags. I believe it is
on the following patch, which removes the PCG_FILE_MAPPED flag.

>
> This patch adds an atomic update support of page statistics accounting
> in memcg. In short, it prevents a page from being moved from a memcg
> to another while updating page status by...
>
>        locked = mem_cgroup_begin_update_page_stat(page)
>        modify page
>        mem_cgroup_update_page_stat(page)
>        mem_cgroup_end_update_page_stat(page, locked)
>
> While begin_update_page_stat() ... end_update_page_stat(),
> the page_cgroup will never be moved to other memcg.

This is nice.

In general, the description needs some work and it isn't clear to me
what this patch does at the first glance.

--Ying

>
> In usual case, overhead is rcu_read_lock() and rcu_read_unlock(),
> lookup_page_cgroup().
>
> Note:
>  - I still now considering how to reduce overhead of this scheme.
>   Good idea is welcomed.
>
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> ---
>  include/linux/memcontrol.h |   36 ++++++++++++++++++++++++++++++++++
>  mm/memcontrol.c            |   46 ++++++++++++++++++++++++++-----------------
>  mm/rmap.c                  |   14 +++++++++++-
>  3 files changed, 76 insertions(+), 20 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 4d34356..976b58c 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -141,9 +141,35 @@ static inline bool mem_cgroup_disabled(void)
>        return false;
>  }
>
> +/*
> + * When we update page->flags,' we'll update some memcg's counter.
> + * Unlike vmstat, memcg has per-memcg stats and page-memcg relationship
> + * can be changed while 'struct page' information is updated.
> + * We need to prevent the race by
> + *     locked = mem_cgroup_begin_update_page_stat(page)
> + *     modify 'page'
> + *     mem_cgroup_update_page_stat(page, idx, val)
> + *     mem_cgroup_end_update_page_stat(page, locked);
> + */
> +bool __mem_cgroup_begin_update_page_stat(struct page *page);
> +static inline bool mem_cgroup_begin_update_page_stat(struct page *page)
> +{
> +       if (mem_cgroup_disabled())
> +               return false;
> +       return __mem_cgroup_begin_update_page_stat(page);
> +}
>  void mem_cgroup_update_page_stat(struct page *page,
>                                 enum mem_cgroup_page_stat_item idx,
>                                 int val);
> +void __mem_cgroup_end_update_page_stat(struct page *page, bool locked);
> +static inline void
> +mem_cgroup_end_update_page_stat(struct page *page, bool locked)
> +{
> +       if (mem_cgroup_disabled())
> +               return;
> +       __mem_cgroup_end_update_page_stat(page, locked);
> +}
> +
>
>  static inline void mem_cgroup_inc_page_stat(struct page *page,
>                                            enum mem_cgroup_page_stat_item idx)
> @@ -356,6 +382,16 @@ mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p)
>  {
>  }
>
> +static inline bool mem_cgroup_begin_update_page_stat(struct page *page)
> +{
> +       return false;
> +}
> +
> +static inline void
> +mem_cgroup_end_update_page_stat(struct page *page, bool locked)
> +{
> +}
> +
>  static inline void mem_cgroup_inc_page_stat(struct page *page,
>                                            enum mem_cgroup_page_stat_item idx)
>  {
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 61e276f..30ef810 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1912,29 +1912,43 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask)
>  * possibility of race condition. If there is, we take a lock.
>  */
>
> +bool __mem_cgroup_begin_update_page_stat(struct page *page)
> +{
> +       struct page_cgroup *pc = lookup_page_cgroup(page);
> +       bool locked = false;
> +       struct mem_cgroup *memcg;
> +
> +       rcu_read_lock();
> +       memcg = pc->mem_cgroup;
> +
> +       if (!memcg || !PageCgroupUsed(pc))
> +               goto out;
> +       if (mem_cgroup_stealed(memcg)) {
> +               mem_cgroup_account_move_rlock(page);
> +               locked = true;
> +       }
> +out:
> +       return locked;
> +}
> +
> +void __mem_cgroup_end_update_page_stat(struct page *page, bool locked)
> +{
> +       if (locked)
> +               mem_cgroup_account_move_runlock(page);
> +       rcu_read_unlock();
> +}
> +
>  void mem_cgroup_update_page_stat(struct page *page,
>                                 enum mem_cgroup_page_stat_item idx, int val)
>  {
> -       struct mem_cgroup *memcg;
>        struct page_cgroup *pc = lookup_page_cgroup(page);
> -       bool need_unlock = false;
> +       struct mem_cgroup *memcg = pc->mem_cgroup;
>
>        if (mem_cgroup_disabled())
>                return;
>
> -       rcu_read_lock();
> -       memcg = pc->mem_cgroup;
>        if (unlikely(!memcg || !PageCgroupUsed(pc)))
> -               goto out;
> -       /* pc->mem_cgroup is unstable ? */
> -       if (unlikely(mem_cgroup_stealed(memcg))) {
> -               /* take a lock against to access pc->mem_cgroup */
> -               mem_cgroup_account_move_rlock(page);
> -               need_unlock = true;
> -               memcg = pc->mem_cgroup;
> -               if (!memcg || !PageCgroupUsed(pc))
> -                       goto out;
> -       }
> +               return;
>
>        switch (idx) {
>        case MEMCG_NR_FILE_MAPPED:
> @@ -1950,10 +1964,6 @@ void mem_cgroup_update_page_stat(struct page *page,
>
>        this_cpu_add(memcg->stat->count[idx], val);
>
> -out:
> -       if (unlikely(need_unlock))
> -               mem_cgroup_account_move_runlock(page);
> -       rcu_read_unlock();
>        return;
>  }
>  EXPORT_SYMBOL(mem_cgroup_update_page_stat);
> diff --git a/mm/rmap.c b/mm/rmap.c
> index aa547d4..def60d1 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1150,10 +1150,13 @@ void page_add_new_anon_rmap(struct page *page,
>  */
>  void page_add_file_rmap(struct page *page)
>  {
> +       bool locked = mem_cgroup_begin_update_page_stat(page);
> +
>        if (atomic_inc_and_test(&page->_mapcount)) {
>                __inc_zone_page_state(page, NR_FILE_MAPPED);
>                mem_cgroup_inc_page_stat(page, MEMCG_NR_FILE_MAPPED);
>        }
> +       mem_cgroup_end_update_page_stat(page, locked);
>  }
>
>  /**
> @@ -1164,10 +1167,14 @@ void page_add_file_rmap(struct page *page)
>  */
>  void page_remove_rmap(struct page *page)
>  {
> +       bool locked = false;
> +
> +       if (!PageAnon(page))
> +               locked = mem_cgroup_begin_update_page_stat(page);
> +
>        /* page still mapped by someone else? */
>        if (!atomic_add_negative(-1, &page->_mapcount))
> -               return;
> -
> +               goto out;
>        /*
>         * Now that the last pte has gone, s390 must transfer dirty
>         * flag from storage key to struct page.  We can usually skip
> @@ -1204,6 +1211,9 @@ void page_remove_rmap(struct page *page)
>         * Leaving it set also helps swapoff to reinstate ptes
>         * faster for those pages still in swapcache.
>         */
> +out:
> +       if (!PageAnon(page))
> +               mem_cgroup_end_update_page_stat(page, locked);
>  }
>
>  /*
> --
> 1.7.4.1
>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] [PATCH 5/7 v2] memcg: remove PCG_FILE_MAPPED
  2012-01-19 14:07   ` Michal Hocko
@ 2012-01-26 19:10     ` Ying Han
  0 siblings, 0 replies; 50+ messages in thread
From: Ying Han @ 2012-01-26 19:10 UTC (permalink / raw)
  To: Michal Hocko
  Cc: KAMEZAWA Hiroyuki, linux-mm@kvack.org, hugh.dickins@tiscali.co.uk,
	hannes@cmpxchg.org, cgroups, bsingharora@gmail.com

On Thu, Jan 19, 2012 at 6:07 AM, Michal Hocko <mhocko@suse.cz> wrote:
> On Fri 13-01-12 17:42:23, KAMEZAWA Hiroyuki wrote:
>> From a9b51d6204d7f8714173c46a306caf413ad25d4e Mon Sep 17 00:00:00 2001
>> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>> Date: Thu, 12 Jan 2012 18:40:26 +0900
>> Subject: [PATCH 5/7] memcg: remove PCG_FILE_MAPPED
>>
>> Because we can update page's status and memcg's statistics without
>> race with move_account, this flag is unnecessary.
>
> I would really appreciate a little bit more description ;)
>
> 8725d541 [memcg: fix race in file_mapped accounting] has added the
> flag to resolve a race when a move_account happened between page's
> mapcount has been updated and this has been accounted to memcg.
> This, however, cannot happen anymore because mem_cgroup_update_page_stat
> is always enclosed by mem_cgroup_begin_update_page_stat and
> mem_cgroup_end_update_page_stat along with the mapcount update.

Agree on better documentation.

>
>>
>> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>
> Other than that looks good and nice to see another one go away.
> I will add my ack along with the patches which this depend on once they
> settle down if you don't mind
>
> Thanks
>
>> ---
>>  include/linux/page_cgroup.h |    6 ------
>>  mm/memcontrol.c             |    6 +-----
>>  2 files changed, 1 insertions(+), 11 deletions(-)
>>
>> diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
>> index 5dba799..0b9a48a 100644
>> --- a/include/linux/page_cgroup.h
>> +++ b/include/linux/page_cgroup.h
>> @@ -7,8 +7,6 @@ enum {
>>       PCG_CACHE, /* charged as cache */
>>       PCG_USED, /* this object is in use. */
>>       PCG_MIGRATION, /* under page migration */
>> -     /* flags for mem_cgroup and file and I/O status */
>> -     PCG_FILE_MAPPED, /* page is accounted as "mapped" */
>>       __NR_PCG_FLAGS,
>>  };
>>
>> @@ -72,10 +70,6 @@ TESTPCGFLAG(Used, USED)
>>  CLEARPCGFLAG(Used, USED)
>>  SETPCGFLAG(Used, USED)
>>
>> -SETPCGFLAG(FileMapped, FILE_MAPPED)
>> -CLEARPCGFLAG(FileMapped, FILE_MAPPED)
>> -TESTPCGFLAG(FileMapped, FILE_MAPPED)
>> -
>>  SETPCGFLAG(Migration, MIGRATION)
>>  CLEARPCGFLAG(Migration, MIGRATION)
>>  TESTPCGFLAG(Migration, MIGRATION)
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index 30ef810..a96800d 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -1952,10 +1952,6 @@ void mem_cgroup_update_page_stat(struct page *page,
>>
>>       switch (idx) {
>>       case MEMCG_NR_FILE_MAPPED:
>> -             if (val > 0)
>> -                     SetPageCgroupFileMapped(pc);
>> -             else if (!page_mapped(page))
>> -                     ClearPageCgroupFileMapped(pc);
>>               idx = MEM_CGROUP_STAT_FILE_MAPPED;
>>               break;
>>       default:
>> @@ -2606,7 +2602,7 @@ static int mem_cgroup_move_account(struct page *page,
>>
>>       mem_cgroup_account_move_wlock(page, &flags);
>>
>> -     if (PageCgroupFileMapped(pc)) {
>> +     if (page_mapcount(page)) {

Does it includes anon pages , should it be

if (page_mapped(page) && !PageAnon(page))

--Ying

>>               /* Update mapped_file data for mem_cgroup */
>>               preempt_disable();
>>               __this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
>> --
>> 1.7.4.1
>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe cgroups" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
> --
> Michal Hocko
> SUSE Labs
> SUSE LINUX s.r.o.
> Lihovarska 1060/12
> 190 00 Praha 9
> Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

end of thread, other threads:[~2012-01-26 19:10 UTC | newest]

Thread overview: 50+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-01-13  8:30 [RFC] [PATCH 0/7 v2] memcg: page_cgroup diet KAMEZAWA Hiroyuki
2012-01-13  8:32 ` [RFC] [PATCH 1/7 v2] memcg: remove unnecessary check in mem_cgroup_update_page_stat() KAMEZAWA Hiroyuki
2012-01-17 15:16   ` Michal Hocko
2012-01-17 23:55     ` KAMEZAWA Hiroyuki
2012-01-18 13:01       ` Michal Hocko
2012-01-19  2:18         ` KAMEZAWA Hiroyuki
2012-01-19 20:07         ` Ying Han
2012-01-20  0:48           ` KAMEZAWA Hiroyuki
2012-01-13  8:33 ` [RFC] [PATCH 2/7 v2] memcg: add memory barrier for checking account move KAMEZAWA Hiroyuki
2012-01-17 15:26   ` Michal Hocko
2012-01-18  0:06     ` KAMEZAWA Hiroyuki
2012-01-18 12:37       ` Michal Hocko
2012-01-19  2:17         ` KAMEZAWA Hiroyuki
2012-01-19  9:28           ` Michal Hocko
2012-01-19 23:57             ` KAMEZAWA Hiroyuki
2012-01-20 18:08           ` Ying Han
2012-01-23  9:04             ` Michal Hocko
2012-01-24  3:21               ` KAMEZAWA Hiroyuki
2012-01-24  8:49                 ` Michal Hocko
2012-01-24 19:04               ` Ying Han
2012-01-25 11:07                 ` Michal Hocko
2012-01-13  8:40 ` [RFC] [PATCH 3/7 v2] memcg: remove PCG_MOVE_LOCK flag from pc->flags KAMEZAWA Hiroyuki
2012-01-16 12:55   ` Kirill A. Shutemov
2012-01-17  0:22     ` KAMEZAWA Hiroyuki
2012-01-17 16:46   ` Michal Hocko
2012-01-18  0:12     ` KAMEZAWA Hiroyuki
2012-01-18 10:47       ` Michal Hocko
2012-01-18 23:53         ` KAMEZAWA Hiroyuki
2012-01-23 22:05           ` Ying Han
2012-01-24  4:59             ` KAMEZAWA Hiroyuki
2012-01-24  8:43             ` Michal Hocko
2012-01-25 23:07               ` Ying Han
2012-01-26  9:16                 ` Michal Hocko
2012-01-23 22:02   ` Ying Han
2012-01-24  4:47     ` KAMEZAWA Hiroyuki
2012-01-25 22:48       ` Ying Han
2012-01-13  8:41 ` [RFC] [PATCH 4/7 v2] memcg: new scheme to update per-memcg page stat accounting KAMEZAWA Hiroyuki
2012-01-18 16:45   ` Michal Hocko
2012-01-18 23:58     ` KAMEZAWA Hiroyuki
2012-01-26 19:01   ` Ying Han
2012-01-13  8:42 ` [RFC] [PATCH 5/7 v2] memcg: remove PCG_FILE_MAPPED KAMEZAWA Hiroyuki
2012-01-19 14:07   ` Michal Hocko
2012-01-26 19:10     ` Ying Han
2012-01-13  8:43 ` [RFC] [PATCH 6/7 v2] memcg: remove PCG_CACHE KAMEZAWA Hiroyuki
2012-01-13  8:45 ` [RFC] [PATCH 7/7 v2] memcg: make mem_cgroup_begin_update_stat to use global pcpu KAMEZAWA Hiroyuki
2012-01-19 14:47   ` Michal Hocko
2012-01-20  2:19     ` KAMEZAWA Hiroyuki
2012-01-20  8:38       ` Michal Hocko
2012-01-20  8:40   ` Greg Thelen
2012-01-24  3:18     ` KAMEZAWA Hiroyuki

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).