[patch 0/9] mm: memcontrol: naturalize charge lifetime

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [patch 0/9] mm: memcontrol: naturalize charge lifetime
@ 2014-04-30 20:25 Johannes Weiner
  2014-04-30 20:25 ` [patch 1/9] mm: memcontrol: fold mem_cgroup_do_charge() Johannes Weiner
                   ` (9 more replies)
  0 siblings, 10 replies; 35+ messages in thread
From: Johannes Weiner @ 2014-04-30 20:25 UTC (permalink / raw)
  To: linux-mm; +Cc: Michal Hocko, Hugh Dickins, Tejun Heo, cgroups, linux-kernel

Hi,

these patches rework memcg charge lifetime to integrate more naturally
with the lifetime of user pages.  This drastically simplifies the code
and reduces charging and uncharging overhead.  The most expensive part
of charging and uncharging is the page_cgroup bit spinlock, which is
removed entirely after this series.

Here are the top-10 profile entries of a stress test that reads a 128G
sparse file on a freshly booted box, without a dedicated cgroup
(i.e. executing in the root memcg).  Before:

    15.36%              cat  [kernel.kallsyms]   [k] copy_user_generic_string                  
    13.31%              cat  [kernel.kallsyms]   [k] memset                                    
    11.48%              cat  [kernel.kallsyms]   [k] do_mpage_readpage                         
     4.23%              cat  [kernel.kallsyms]   [k] get_page_from_freelist                    
     2.38%              cat  [kernel.kallsyms]   [k] put_page                                  
     2.32%              cat  [kernel.kallsyms]   [k] __mem_cgroup_commit_charge                
     2.18%          kswapd0  [kernel.kallsyms]   [k] __mem_cgroup_uncharge_common              
     1.92%          kswapd0  [kernel.kallsyms]   [k] shrink_page_list                          
     1.86%              cat  [kernel.kallsyms]   [k] __radix_tree_lookup                       
     1.62%              cat  [kernel.kallsyms]   [k] __pagevec_lru_add_fn                      

And after:

    15.67%           cat  [kernel.kallsyms]   [k] copy_user_generic_string                  
    13.48%           cat  [kernel.kallsyms]   [k] memset                                    
    11.42%           cat  [kernel.kallsyms]   [k] do_mpage_readpage                         
     3.98%           cat  [kernel.kallsyms]   [k] get_page_from_freelist                    
     2.46%           cat  [kernel.kallsyms]   [k] put_page                                  
     2.13%       kswapd0  [kernel.kallsyms]   [k] shrink_page_list                          
     1.88%           cat  [kernel.kallsyms]   [k] __radix_tree_lookup                       
     1.67%           cat  [kernel.kallsyms]   [k] __pagevec_lru_add_fn                      
     1.39%       kswapd0  [kernel.kallsyms]   [k] free_pcppages_bulk                        
     1.30%           cat  [kernel.kallsyms]   [k] kfree                                     

The code also survived some prolonged stress testing with a swapping
workload being moved continuously between memcgs.

My apologies in advance for the reviewability.  I tried to split out
the rewrite into more steps, but had to declare the current code as
unsalvagaeble after it took me more than a day to convince myself how
the swap accounting works.  It's probably easiest to read this as
newly written code.

 Documentation/cgroups/memcg_test.txt |  160 +--
 include/linux/memcontrol.h           |   94 +-
 include/linux/page_cgroup.h          |   43 +-
 include/linux/swap.h                 |   15 +-
 kernel/events/uprobes.c              |    1 +
 mm/filemap.c                         |   13 +-
 mm/huge_memory.c                     |   51 +-
 mm/memcontrol.c                      | 1724 ++++++++++++--------------------
 mm/memory.c                          |   41 +-
 mm/migrate.c                         |   46 +-
 mm/rmap.c                            |    6 -
 mm/shmem.c                           |   28 +-
 mm/swap.c                            |   22 +
 mm/swap_state.c                      |    8 +-
 mm/swapfile.c                        |   21 +-
 mm/truncate.c                        |    1 -
 mm/vmscan.c                          |    9 +-
 mm/zswap.c                           |    2 +-
 18 files changed, 833 insertions(+), 1452 deletions(-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [patch 1/9] mm: memcontrol: fold mem_cgroup_do_charge()
  2014-04-30 20:25 [patch 0/9] mm: memcontrol: naturalize charge lifetime Johannes Weiner
@ 2014-04-30 20:25 ` Johannes Weiner
  2014-04-30 20:25 ` [patch 2/9] mm: memcontrol: rearrange charging fast path Johannes Weiner
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 35+ messages in thread
From: Johannes Weiner @ 2014-04-30 20:25 UTC (permalink / raw)
  To: linux-mm; +Cc: Michal Hocko, Hugh Dickins, Tejun Heo, cgroups, linux-kernel

This function was split out because mem_cgroup_try_charge() got too
big.  But having essentially one sequence of operations arbitrarily
split in half is not good for reworking the code.  Fold it back in.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/memcontrol.c | 166 ++++++++++++++++++++++----------------------------------
 1 file changed, 64 insertions(+), 102 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 29501f040568..75dfeb8fa98b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2574,80 +2574,6 @@ static int memcg_cpu_hotplug_callback(struct notifier_block *nb,
 	return NOTIFY_OK;
 }
 
-
-/* See mem_cgroup_try_charge() for details */
-enum {
-	CHARGE_OK,		/* success */
-	CHARGE_RETRY,		/* need to retry but retry is not bad */
-	CHARGE_NOMEM,		/* we can't do more. return -ENOMEM */
-	CHARGE_WOULDBLOCK,	/* GFP_WAIT wasn't set and no enough res. */
-};
-
-static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
-				unsigned int nr_pages, unsigned int min_pages,
-				bool invoke_oom)
-{
-	unsigned long csize = nr_pages * PAGE_SIZE;
-	struct mem_cgroup *mem_over_limit;
-	struct res_counter *fail_res;
-	unsigned long flags = 0;
-	int ret;
-
-	ret = res_counter_charge(&memcg->res, csize, &fail_res);
-
-	if (likely(!ret)) {
-		if (!do_swap_account)
-			return CHARGE_OK;
-		ret = res_counter_charge(&memcg->memsw, csize, &fail_res);
-		if (likely(!ret))
-			return CHARGE_OK;
-
-		res_counter_uncharge(&memcg->res, csize);
-		mem_over_limit = mem_cgroup_from_res_counter(fail_res, memsw);
-		flags |= MEM_CGROUP_RECLAIM_NOSWAP;
-	} else
-		mem_over_limit = mem_cgroup_from_res_counter(fail_res, res);
-	/*
-	 * Never reclaim on behalf of optional batching, retry with a
-	 * single page instead.
-	 */
-	if (nr_pages > min_pages)
-		return CHARGE_RETRY;
-
-	if (!(gfp_mask & __GFP_WAIT))
-		return CHARGE_WOULDBLOCK;
-
-	if (gfp_mask & __GFP_NORETRY)
-		return CHARGE_NOMEM;
-
-	ret = mem_cgroup_reclaim(mem_over_limit, gfp_mask, flags);
-	if (mem_cgroup_margin(mem_over_limit) >= nr_pages)
-		return CHARGE_RETRY;
-	/*
-	 * Even though the limit is exceeded at this point, reclaim
-	 * may have been able to free some pages.  Retry the charge
-	 * before killing the task.
-	 *
-	 * Only for regular pages, though: huge pages are rather
-	 * unlikely to succeed so close to the limit, and we fall back
-	 * to regular pages anyway in case of failure.
-	 */
-	if (nr_pages <= (1 << PAGE_ALLOC_COSTLY_ORDER) && ret)
-		return CHARGE_RETRY;
-
-	/*
-	 * At task move, charge accounts can be doubly counted. So, it's
-	 * better to wait until the end of task_move if something is going on.
-	 */
-	if (mem_cgroup_wait_acct_move(mem_over_limit))
-		return CHARGE_RETRY;
-
-	if (invoke_oom)
-		mem_cgroup_oom(mem_over_limit, gfp_mask, get_order(csize));
-
-	return CHARGE_NOMEM;
-}
-
 /**
  * mem_cgroup_try_charge - try charging a memcg
  * @memcg: memcg to charge
@@ -2664,7 +2590,11 @@ static int mem_cgroup_try_charge(struct mem_cgroup *memcg,
 {
 	unsigned int batch = max(CHARGE_BATCH, nr_pages);
 	int nr_oom_retries = MEM_CGROUP_RECLAIM_RETRIES;
-	int ret;
+	struct mem_cgroup *mem_over_limit;
+	struct res_counter *fail_res;
+	unsigned long nr_reclaimed;
+	unsigned long flags = 0;
+	unsigned long long size;
 
 	if (mem_cgroup_is_root(memcg))
 		goto done;
@@ -2683,44 +2613,76 @@ static int mem_cgroup_try_charge(struct mem_cgroup *memcg,
 
 	if (gfp_mask & __GFP_NOFAIL)
 		oom = false;
-again:
+retry:
 	if (consume_stock(memcg, nr_pages))
 		goto done;
 
-	do {
-		bool invoke_oom = oom && !nr_oom_retries;
+	size = batch * PAGE_SIZE;
+	if (!res_counter_charge(&memcg->res, size, &fail_res)) {
+		if (!do_swap_account)
+			goto done_restock;
+		if (!res_counter_charge(&memcg->memsw, size, &fail_res))
+			goto done_restock;
+		res_counter_uncharge(&memcg->res, size);
+		mem_over_limit = mem_cgroup_from_res_counter(fail_res, memsw);
+		flags |= MEM_CGROUP_RECLAIM_NOSWAP;
+	} else
+		mem_over_limit = mem_cgroup_from_res_counter(fail_res, res);
 
-		/* If killed, bypass charge */
-		if (fatal_signal_pending(current))
-			goto bypass;
+	if (batch > nr_pages) {
+		batch = nr_pages;
+		goto retry;
+	}
 
-		ret = mem_cgroup_do_charge(memcg, gfp_mask, batch,
-					   nr_pages, invoke_oom);
-		switch (ret) {
-		case CHARGE_OK:
-			break;
-		case CHARGE_RETRY: /* not in OOM situation but retry */
-			batch = nr_pages;
-			goto again;
-		case CHARGE_WOULDBLOCK: /* !__GFP_WAIT */
-			goto nomem;
-		case CHARGE_NOMEM: /* OOM routine works */
-			if (!oom || invoke_oom)
-				goto nomem;
-			nr_oom_retries--;
-			break;
-		}
-	} while (ret != CHARGE_OK);
+	if (!(gfp_mask & __GFP_WAIT))
+		goto nomem;
 
-	if (batch > nr_pages)
-		refill_stock(memcg, batch - nr_pages);
-done:
-	return 0;
+	if (gfp_mask & __GFP_NORETRY)
+		goto nomem;
+
+	nr_reclaimed = mem_cgroup_reclaim(mem_over_limit, gfp_mask, flags);
+
+	if (mem_cgroup_margin(mem_over_limit) >= batch)
+		goto retry;
+	/*
+	 * Even though the limit is exceeded at this point, reclaim
+	 * may have been able to free some pages.  Retry the charge
+	 * before killing the task.
+	 *
+	 * Only for regular pages, though: huge pages are rather
+	 * unlikely to succeed so close to the limit, and we fall back
+	 * to regular pages anyway in case of failure.
+	 */
+	if (nr_reclaimed && batch <= (1 << PAGE_ALLOC_COSTLY_ORDER))
+		goto retry;
+	/*
+	 * At task move, charge accounts can be doubly counted. So, it's
+	 * better to wait until the end of task_move if something is going on.
+	 */
+	if (mem_cgroup_wait_acct_move(mem_over_limit))
+		goto retry;
+
+	if (fatal_signal_pending(current))
+		goto bypass;
+
+	if (!oom)
+		goto nomem;
+
+	if (nr_oom_retries--)
+		goto retry;
+
+	mem_cgroup_oom(mem_over_limit, gfp_mask, get_order(batch));
 nomem:
 	if (!(gfp_mask & __GFP_NOFAIL))
 		return -ENOMEM;
 bypass:
 	return -EINTR;
+
+done_restock:
+	if (batch > nr_pages)
+		refill_stock(memcg, batch - nr_pages);
+done:
+	return 0;
 }
 
 /**
-- 
1.9.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [patch 2/9] mm: memcontrol: rearrange charging fast path
  2014-04-30 20:25 [patch 0/9] mm: memcontrol: naturalize charge lifetime Johannes Weiner
  2014-04-30 20:25 ` [patch 1/9] mm: memcontrol: fold mem_cgroup_do_charge() Johannes Weiner
@ 2014-04-30 20:25 ` Johannes Weiner
  2014-05-07 14:33   ` Michal Hocko
  2014-04-30 20:25 ` [patch 3/9] mm: memcontrol: retry reclaim for oom-disabled and __GFP_NOFAIL charges Johannes Weiner
                   ` (7 subsequent siblings)
  9 siblings, 1 reply; 35+ messages in thread
From: Johannes Weiner @ 2014-04-30 20:25 UTC (permalink / raw)
  To: linux-mm; +Cc: Michal Hocko, Hugh Dickins, Tejun Heo, cgroups, linux-kernel

The charging path currently starts out with OOM condition checks when
OOM is the rarest possible case.

Rearrange this code to run OOM/task dying checks only after trying the
percpu charge and the res_counter charge and bail out before entering
reclaim.  Attempting a charge does not hurt an (oom-)killed task as
much as every charge attempt having to check OOM conditions.  Also,
only check __GFP_NOFAIL when the charge would actually fail.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/memcontrol.c | 31 ++++++++++++++++---------------
 1 file changed, 16 insertions(+), 15 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 75dfeb8fa98b..6ce59146fec7 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2598,21 +2598,6 @@ static int mem_cgroup_try_charge(struct mem_cgroup *memcg,
 
 	if (mem_cgroup_is_root(memcg))
 		goto done;
-	/*
-	 * Unlike in global OOM situations, memcg is not in a physical
-	 * memory shortage.  Allow dying and OOM-killed tasks to
-	 * bypass the last charges so that they can exit quickly and
-	 * free their memory.
-	 */
-	if (unlikely(test_thread_flag(TIF_MEMDIE) ||
-		     fatal_signal_pending(current)))
-		goto bypass;
-
-	if (unlikely(task_in_memcg_oom(current)))
-		goto nomem;
-
-	if (gfp_mask & __GFP_NOFAIL)
-		oom = false;
 retry:
 	if (consume_stock(memcg, nr_pages))
 		goto done;
@@ -2634,6 +2619,19 @@ retry:
 		goto retry;
 	}
 
+	/*
+	 * Unlike in global OOM situations, memcg is not in a physical
+	 * memory shortage.  Allow dying and OOM-killed tasks to
+	 * bypass the last charges so that they can exit quickly and
+	 * free their memory.
+	 */
+	if (unlikely(test_thread_flag(TIF_MEMDIE) ||
+		     fatal_signal_pending(current)))
+		goto bypass;
+
+	if (unlikely(task_in_memcg_oom(current)))
+		goto nomem;
+
 	if (!(gfp_mask & __GFP_WAIT))
 		goto nomem;
 
@@ -2662,6 +2660,9 @@ retry:
 	if (mem_cgroup_wait_acct_move(mem_over_limit))
 		goto retry;
 
+	if (gfp_mask & __GFP_NOFAIL)
+		goto bypass;
+
 	if (fatal_signal_pending(current))
 		goto bypass;
 
-- 
1.9.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [patch 3/9] mm: memcontrol: retry reclaim for oom-disabled and __GFP_NOFAIL charges
  2014-04-30 20:25 [patch 0/9] mm: memcontrol: naturalize charge lifetime Johannes Weiner
  2014-04-30 20:25 ` [patch 1/9] mm: memcontrol: fold mem_cgroup_do_charge() Johannes Weiner
  2014-04-30 20:25 ` [patch 2/9] mm: memcontrol: rearrange charging fast path Johannes Weiner
@ 2014-04-30 20:25 ` Johannes Weiner
  2014-05-07 14:43   ` Michal Hocko
  2014-04-30 20:25 ` [patch 4/9] mm: memcontrol: catch root bypass in move precharge Johannes Weiner
                   ` (6 subsequent siblings)
  9 siblings, 1 reply; 35+ messages in thread
From: Johannes Weiner @ 2014-04-30 20:25 UTC (permalink / raw)
  To: linux-mm; +Cc: Michal Hocko, Hugh Dickins, Tejun Heo, cgroups, linux-kernel

There is no reason why oom-disabled and __GFP_NOFAIL charges should
try to reclaim only once when every other charge tries several times
before giving up.  Make them all retry the same number of times.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/memcontrol.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 6ce59146fec7..c431a30280ac 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2589,7 +2589,7 @@ static int mem_cgroup_try_charge(struct mem_cgroup *memcg,
 				 bool oom)
 {
 	unsigned int batch = max(CHARGE_BATCH, nr_pages);
-	int nr_oom_retries = MEM_CGROUP_RECLAIM_RETRIES;
+	int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
 	struct mem_cgroup *mem_over_limit;
 	struct res_counter *fail_res;
 	unsigned long nr_reclaimed;
@@ -2660,6 +2660,9 @@ retry:
 	if (mem_cgroup_wait_acct_move(mem_over_limit))
 		goto retry;
 
+	if (nr_retries--)
+		goto retry;
+
 	if (gfp_mask & __GFP_NOFAIL)
 		goto bypass;
 
@@ -2669,9 +2672,6 @@ retry:
 	if (!oom)
 		goto nomem;
 
-	if (nr_oom_retries--)
-		goto retry;
-
 	mem_cgroup_oom(mem_over_limit, gfp_mask, get_order(batch));
 nomem:
 	if (!(gfp_mask & __GFP_NOFAIL))
-- 
1.9.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [patch 4/9] mm: memcontrol: catch root bypass in move precharge
  2014-04-30 20:25 [patch 0/9] mm: memcontrol: naturalize charge lifetime Johannes Weiner
                   ` (2 preceding siblings ...)
  2014-04-30 20:25 ` [patch 3/9] mm: memcontrol: retry reclaim for oom-disabled and __GFP_NOFAIL charges Johannes Weiner
@ 2014-04-30 20:25 ` Johannes Weiner
  2014-05-07 14:55   ` Michal Hocko
  2014-04-30 20:25 ` [patch 5/9] mm: memcontrol: use root_mem_cgroup res_counter Johannes Weiner
                   ` (5 subsequent siblings)
  9 siblings, 1 reply; 35+ messages in thread
From: Johannes Weiner @ 2014-04-30 20:25 UTC (permalink / raw)
  To: linux-mm; +Cc: Michal Hocko, Hugh Dickins, Tejun Heo, cgroups, linux-kernel

When mem_cgroup_try_charge() returns -EINTR, it bypassed the charge to
the root memcg.  But move precharging does not catch this and treats
this case as if no charge had happened, thus leaking a charge against
root.  Because of an old optimization, the root memcg's res_counter is
not actually charged right now, but it's still an imbalance and
subsequent patches will charge the root memcg again.

Thus, catch those bypasses to the root memcg and properly cancel them
before giving up the move.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/memcontrol.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c431a30280ac..788be26103f9 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6546,8 +6546,9 @@ one_by_one:
 			cond_resched();
 		}
 		ret = mem_cgroup_try_charge(memcg, GFP_KERNEL, 1, false);
+		if (ret == -EINTR)
+			__mem_cgroup_cancel_charge(root_mem_cgroup, 1);
 		if (ret)
-			/* mem_cgroup_clear_mc() will do uncharge later */
 			return ret;
 		mc.precharge++;
 	}
-- 
1.9.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [patch 5/9] mm: memcontrol: use root_mem_cgroup res_counter
  2014-04-30 20:25 [patch 0/9] mm: memcontrol: naturalize charge lifetime Johannes Weiner
                   ` (3 preceding siblings ...)
  2014-04-30 20:25 ` [patch 4/9] mm: memcontrol: catch root bypass in move precharge Johannes Weiner
@ 2014-04-30 20:25 ` Johannes Weiner
  2014-05-07 15:14   ` Michal Hocko
  2014-04-30 20:25 ` [patch 6/9] mm: memcontrol: remove ordering between pc->mem_cgroup and PageCgroupUsed Johannes Weiner
                   ` (4 subsequent siblings)
  9 siblings, 1 reply; 35+ messages in thread
From: Johannes Weiner @ 2014-04-30 20:25 UTC (permalink / raw)
  To: linux-mm; +Cc: Michal Hocko, Hugh Dickins, Tejun Heo, cgroups, linux-kernel

The root_mem_cgroup res_counter is never charged itself: there is no
limit at the root level anyway, and any statistics are generated on
demand by summing up the counters of all other cgroups.  This was an
optimization to keep down costs on systems that don't create specific
cgroups, but with per-cpu charge caches the res_counter operations do
not even show up on in profiles anymore.  Just remove it and simplify
the code.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/memcontrol.c | 157 ++++++++++++++++++--------------------------------------
 1 file changed, 49 insertions(+), 108 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 788be26103f9..34407d99262a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2595,9 +2595,8 @@ static int mem_cgroup_try_charge(struct mem_cgroup *memcg,
 	unsigned long nr_reclaimed;
 	unsigned long flags = 0;
 	unsigned long long size;
+	int ret = 0;
 
-	if (mem_cgroup_is_root(memcg))
-		goto done;
 retry:
 	if (consume_stock(memcg, nr_pages))
 		goto done;
@@ -2677,13 +2676,15 @@ nomem:
 	if (!(gfp_mask & __GFP_NOFAIL))
 		return -ENOMEM;
 bypass:
-	return -EINTR;
+	memcg = root_mem_cgroup;
+	ret = -EINTR;
+	goto retry;
 
 done_restock:
 	if (batch > nr_pages)
 		refill_stock(memcg, batch - nr_pages);
 done:
-	return 0;
+	return ret;
 }
 
 /**
@@ -2723,13 +2724,11 @@ static struct mem_cgroup *mem_cgroup_try_charge_mm(struct mm_struct *mm,
 static void __mem_cgroup_cancel_charge(struct mem_cgroup *memcg,
 				       unsigned int nr_pages)
 {
-	if (!mem_cgroup_is_root(memcg)) {
-		unsigned long bytes = nr_pages * PAGE_SIZE;
+	unsigned long bytes = nr_pages * PAGE_SIZE;
 
-		res_counter_uncharge(&memcg->res, bytes);
-		if (do_swap_account)
-			res_counter_uncharge(&memcg->memsw, bytes);
-	}
+	res_counter_uncharge(&memcg->res, bytes);
+	if (do_swap_account)
+		res_counter_uncharge(&memcg->memsw, bytes);
 }
 
 /*
@@ -2741,9 +2740,6 @@ static void __mem_cgroup_cancel_local_charge(struct mem_cgroup *memcg,
 {
 	unsigned long bytes = nr_pages * PAGE_SIZE;
 
-	if (mem_cgroup_is_root(memcg))
-		return;
-
 	res_counter_uncharge_until(&memcg->res, memcg->res.parent, bytes);
 	if (do_swap_account)
 		res_counter_uncharge_until(&memcg->memsw,
@@ -3925,9 +3921,13 @@ int mem_cgroup_charge_file(struct page *page, struct mm_struct *mm,
 	 * Page cache insertions can happen without an actual mm
 	 * context, e.g. during disk probing on boot.
 	 */
-	if (unlikely(!mm))
+	if (unlikely(!mm)) {
 		memcg = root_mem_cgroup;
-	else {
+		ret = mem_cgroup_try_charge(memcg, gfp_mask, 1, true);
+		VM_BUG_ON(ret == -EINTR);
+		if (ret)
+			return ret;
+	} else {
 		memcg = mem_cgroup_try_charge_mm(mm, gfp_mask, 1, true);
 		if (!memcg)
 			return -ENOMEM;
@@ -4083,7 +4083,7 @@ __mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype,
 	 * replacement page, so leave it alone when phasing out the
 	 * page that is unused after the migration.
 	 */
-	if (!end_migration && !mem_cgroup_is_root(memcg))
+	if (!end_migration)
 		mem_cgroup_do_uncharge(memcg, nr_pages, ctype);
 
 	return memcg;
@@ -4216,8 +4216,7 @@ void mem_cgroup_uncharge_swap(swp_entry_t ent)
 		 * We uncharge this because swap is freed.
 		 * This memcg can be obsolete one. We avoid calling css_tryget
 		 */
-		if (!mem_cgroup_is_root(memcg))
-			res_counter_uncharge(&memcg->memsw, PAGE_SIZE);
+		res_counter_uncharge(&memcg->memsw, PAGE_SIZE);
 		mem_cgroup_swap_statistics(memcg, false);
 		css_put(&memcg->css);
 	}
@@ -4903,78 +4902,24 @@ out:
 	return retval;
 }
 
-
-static unsigned long mem_cgroup_recursive_stat(struct mem_cgroup *memcg,
-					       enum mem_cgroup_stat_index idx)
-{
-	struct mem_cgroup *iter;
-	long val = 0;
-
-	/* Per-cpu values can be negative, use a signed accumulator */
-	for_each_mem_cgroup_tree(iter, memcg)
-		val += mem_cgroup_read_stat(iter, idx);
-
-	if (val < 0) /* race ? */
-		val = 0;
-	return val;
-}
-
-static inline u64 mem_cgroup_usage(struct mem_cgroup *memcg, bool swap)
-{
-	u64 val;
-
-	if (!mem_cgroup_is_root(memcg)) {
-		if (!swap)
-			return res_counter_read_u64(&memcg->res, RES_USAGE);
-		else
-			return res_counter_read_u64(&memcg->memsw, RES_USAGE);
-	}
-
-	/*
-	 * Transparent hugepages are still accounted for in MEM_CGROUP_STAT_RSS
-	 * as well as in MEM_CGROUP_STAT_RSS_HUGE.
-	 */
-	val = mem_cgroup_recursive_stat(memcg, MEM_CGROUP_STAT_CACHE);
-	val += mem_cgroup_recursive_stat(memcg, MEM_CGROUP_STAT_RSS);
-
-	if (swap)
-		val += mem_cgroup_recursive_stat(memcg, MEM_CGROUP_STAT_SWAP);
-
-	return val << PAGE_SHIFT;
-}
-
 static u64 mem_cgroup_read_u64(struct cgroup_subsys_state *css,
-				   struct cftype *cft)
+			       struct cftype *cft)
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
-	u64 val;
-	int name;
-	enum res_type type;
-
-	type = MEMFILE_TYPE(cft->private);
-	name = MEMFILE_ATTR(cft->private);
+	enum res_type type = MEMFILE_TYPE(cft->private);
+	int name = MEMFILE_ATTR(cft->private);
 
 	switch (type) {
 	case _MEM:
-		if (name == RES_USAGE)
-			val = mem_cgroup_usage(memcg, false);
-		else
-			val = res_counter_read_u64(&memcg->res, name);
-		break;
+		return res_counter_read_u64(&memcg->res, name);
 	case _MEMSWAP:
-		if (name == RES_USAGE)
-			val = mem_cgroup_usage(memcg, true);
-		else
-			val = res_counter_read_u64(&memcg->memsw, name);
-		break;
+		return res_counter_read_u64(&memcg->memsw, name);
 	case _KMEM:
-		val = res_counter_read_u64(&memcg->kmem, name);
+		return res_counter_read_u64(&memcg->kmem, name);
 		break;
 	default:
 		BUG();
 	}
-
-	return val;
 }
 
 #ifdef CONFIG_MEMCG_KMEM
@@ -5440,7 +5385,10 @@ static void __mem_cgroup_threshold(struct mem_cgroup *memcg, bool swap)
 	if (!t)
 		goto unlock;
 
-	usage = mem_cgroup_usage(memcg, swap);
+	if (!swap)
+		usage = res_counter_read_u64(&memcg->res, RES_USAGE);
+	else
+		usage = res_counter_read_u64(&memcg->memsw, RES_USAGE);
 
 	/*
 	 * current_threshold points to threshold just below or equal to usage.
@@ -5532,15 +5480,15 @@ static int __mem_cgroup_usage_register_event(struct mem_cgroup *memcg,
 
 	mutex_lock(&memcg->thresholds_lock);
 
-	if (type == _MEM)
+	if (type == _MEM) {
 		thresholds = &memcg->thresholds;
-	else if (type == _MEMSWAP)
+		usage = res_counter_read_u64(&memcg->res, RES_USAGE);
+	} else if (type == _MEMSWAP) {
 		thresholds = &memcg->memsw_thresholds;
-	else
+		usage = res_counter_read_u64(&memcg->memsw, RES_USAGE);
+	} else
 		BUG();
 
-	usage = mem_cgroup_usage(memcg, type == _MEMSWAP);
-
 	/* Check if a threshold crossed before adding a new one */
 	if (thresholds->primary)
 		__mem_cgroup_threshold(memcg, type == _MEMSWAP);
@@ -5620,18 +5568,19 @@ static void __mem_cgroup_usage_unregister_event(struct mem_cgroup *memcg,
 	int i, j, size;
 
 	mutex_lock(&memcg->thresholds_lock);
-	if (type == _MEM)
+
+	if (type == _MEM) {
 		thresholds = &memcg->thresholds;
-	else if (type == _MEMSWAP)
+		usage = res_counter_read_u64(&memcg->res, RES_USAGE);
+	} else if (type == _MEMSWAP) {
 		thresholds = &memcg->memsw_thresholds;
-	else
+		usage = res_counter_read_u64(&memcg->memsw, RES_USAGE);
+	} else
 		BUG();
 
 	if (!thresholds->primary)
 		goto unlock;
 
-	usage = mem_cgroup_usage(memcg, type == _MEMSWAP);
-
 	/* Check if a threshold crossed before removing */
 	__mem_cgroup_threshold(memcg, type == _MEMSWAP);
 
@@ -6390,9 +6339,9 @@ mem_cgroup_css_online(struct cgroup_subsys_state *css)
 		 * core guarantees its existence.
 		 */
 	} else {
-		res_counter_init(&memcg->res, NULL);
-		res_counter_init(&memcg->memsw, NULL);
-		res_counter_init(&memcg->kmem, NULL);
+		res_counter_init(&memcg->res, &root_mem_cgroup->res);
+		res_counter_init(&memcg->memsw, &root_mem_cgroup->memsw);
+		res_counter_init(&memcg->kmem, &root_mem_cgroup->kmem);
 		/*
 		 * Deeper hierachy with use_hierarchy == false doesn't make
 		 * much sense so let cgroup subsystem know about this
@@ -6510,11 +6459,6 @@ static int mem_cgroup_do_precharge(unsigned long count)
 	int batch_count = PRECHARGE_COUNT_AT_ONCE;
 	struct mem_cgroup *memcg = mc.to;
 
-	if (mem_cgroup_is_root(memcg)) {
-		mc.precharge += count;
-		/* we don't need css_get for root */
-		return ret;
-	}
 	/* try to charge at once */
 	if (count > 1) {
 		struct res_counter *dummy;
@@ -6825,21 +6769,18 @@ static void __mem_cgroup_clear_mc(void)
 	/* we must fixup refcnts and charges */
 	if (mc.moved_swap) {
 		/* uncharge swap account from the old cgroup */
-		if (!mem_cgroup_is_root(mc.from))
-			res_counter_uncharge(&mc.from->memsw,
-						PAGE_SIZE * mc.moved_swap);
+		res_counter_uncharge(&mc.from->memsw,
+				     PAGE_SIZE * mc.moved_swap);
 
 		for (i = 0; i < mc.moved_swap; i++)
 			css_put(&mc.from->css);
 
-		if (!mem_cgroup_is_root(mc.to)) {
-			/*
-			 * we charged both to->res and to->memsw, so we should
-			 * uncharge to->res.
-			 */
-			res_counter_uncharge(&mc.to->res,
-						PAGE_SIZE * mc.moved_swap);
-		}
+		/*
+		 * we charged both to->res and to->memsw, so we should
+		 * uncharge to->res.
+		 */
+		res_counter_uncharge(&mc.to->res,
+				     PAGE_SIZE * mc.moved_swap);
 		/* we've already done css_get(mc.to) */
 		mc.moved_swap = 0;
 	}
-- 
1.9.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [patch 6/9] mm: memcontrol: remove ordering between pc->mem_cgroup and PageCgroupUsed
  2014-04-30 20:25 [patch 0/9] mm: memcontrol: naturalize charge lifetime Johannes Weiner
                   ` (4 preceding siblings ...)
  2014-04-30 20:25 ` [patch 5/9] mm: memcontrol: use root_mem_cgroup res_counter Johannes Weiner
@ 2014-04-30 20:25 ` Johannes Weiner
  2014-05-23 13:20   ` Michal Hocko
  2014-04-30 20:25 ` [patch 7/9] mm: memcontrol: do not acquire page_cgroup lock for kmem pages Johannes Weiner
                   ` (3 subsequent siblings)
  9 siblings, 1 reply; 35+ messages in thread
From: Johannes Weiner @ 2014-04-30 20:25 UTC (permalink / raw)
  To: linux-mm; +Cc: Michal Hocko, Hugh Dickins, Tejun Heo, cgroups, linux-kernel

There is a write barrier between setting pc->mem_cgroup and
PageCgroupUsed, which was added to allow LRU operations to lookup the
memcg LRU list of a page without acquiring the page_cgroup lock.  But
ever since 38c5d72f3ebe ("memcg: simplify LRU handling by new rule"),
pages are ensured to be off-LRU while charging, so nobody else is
changing LRU state while pc->mem_cgroup is being written.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/memcontrol.c | 9 ---------
 1 file changed, 9 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 34407d99262a..c528ae9ac230 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2823,14 +2823,6 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup *memcg,
 	}
 
 	pc->mem_cgroup = memcg;
-	/*
-	 * We access a page_cgroup asynchronously without lock_page_cgroup().
-	 * Especially when a page_cgroup is taken from a page, pc->mem_cgroup
-	 * is accessed after testing USED bit. To make pc->mem_cgroup visible
-	 * before USED bit, we need memory barrier here.
-	 * See mem_cgroup_add_lru_list(), etc.
-	 */
-	smp_wmb();
 	SetPageCgroupUsed(pc);
 
 	if (lrucare) {
@@ -3609,7 +3601,6 @@ void mem_cgroup_split_huge_fixup(struct page *head)
 	for (i = 1; i < HPAGE_PMD_NR; i++) {
 		pc = head_pc + i;
 		pc->mem_cgroup = memcg;
-		smp_wmb();/* see __commit_charge() */
 		pc->flags = head_pc->flags & ~PCGF_NOCOPY_AT_SPLIT;
 	}
 	__this_cpu_sub(memcg->stat->count[MEM_CGROUP_STAT_RSS_HUGE],
-- 
1.9.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [patch 7/9] mm: memcontrol: do not acquire page_cgroup lock for kmem pages
  2014-04-30 20:25 [patch 0/9] mm: memcontrol: naturalize charge lifetime Johannes Weiner
                   ` (5 preceding siblings ...)
  2014-04-30 20:25 ` [patch 6/9] mm: memcontrol: remove ordering between pc->mem_cgroup and PageCgroupUsed Johannes Weiner
@ 2014-04-30 20:25 ` Johannes Weiner
  2014-05-23 13:39   ` Michal Hocko
  2014-04-30 20:25 ` [patch 8/9] mm: memcontrol: rewrite charge API Johannes Weiner
                   ` (2 subsequent siblings)
  9 siblings, 1 reply; 35+ messages in thread
From: Johannes Weiner @ 2014-04-30 20:25 UTC (permalink / raw)
  To: linux-mm; +Cc: Michal Hocko, Hugh Dickins, Tejun Heo, cgroups, linux-kernel

Kmem page charging and uncharging is serialized by means of exclusive
access to the page.  Do not take the page_cgroup lock and don't set
pc->flags atomically.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/memcontrol.c | 16 +++-------------
 1 file changed, 3 insertions(+), 13 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c528ae9ac230..d3961fce1d54 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3535,10 +3535,8 @@ void __memcg_kmem_commit_charge(struct page *page, struct mem_cgroup *memcg,
 	}
 
 	pc = lookup_page_cgroup(page);
-	lock_page_cgroup(pc);
 	pc->mem_cgroup = memcg;
-	SetPageCgroupUsed(pc);
-	unlock_page_cgroup(pc);
+	pc->flags = PCG_USED;
 }
 
 void __memcg_kmem_uncharge_pages(struct page *page, int order)
@@ -3548,19 +3546,11 @@ void __memcg_kmem_uncharge_pages(struct page *page, int order)
 
 
 	pc = lookup_page_cgroup(page);
-	/*
-	 * Fast unlocked return. Theoretically might have changed, have to
-	 * check again after locking.
-	 */
 	if (!PageCgroupUsed(pc))
 		return;
 
-	lock_page_cgroup(pc);
-	if (PageCgroupUsed(pc)) {
-		memcg = pc->mem_cgroup;
-		ClearPageCgroupUsed(pc);
-	}
-	unlock_page_cgroup(pc);
+	memcg = pc->mem_cgroup;
+	pc->flags = 0;
 
 	/*
 	 * We trust that only if there is a memcg associated with the page, it
-- 
1.9.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [patch 8/9] mm: memcontrol: rewrite charge API
  2014-04-30 20:25 [patch 0/9] mm: memcontrol: naturalize charge lifetime Johannes Weiner
                   ` (6 preceding siblings ...)
  2014-04-30 20:25 ` [patch 7/9] mm: memcontrol: do not acquire page_cgroup lock for kmem pages Johannes Weiner
@ 2014-04-30 20:25 ` Johannes Weiner
  2014-05-23 14:18   ` Michal Hocko
  2014-05-23 14:54   ` Michal Hocko
  2014-04-30 20:25 ` [patch 9/9] mm: memcontrol: rewrite uncharge API Johannes Weiner
  2014-05-02 11:26 ` [patch 0/9] mm: memcontrol: naturalize charge lifetime Michal Hocko
  9 siblings, 2 replies; 35+ messages in thread
From: Johannes Weiner @ 2014-04-30 20:25 UTC (permalink / raw)
  To: linux-mm; +Cc: Michal Hocko, Hugh Dickins, Tejun Heo, cgroups, linux-kernel

The memcg charge API charges pages before they are rmapped - i.e. have
an actual "type" - and so every callsite needs its own set of charge
and uncharge functions to know what type is being operated on.

Rewrite the charge API to provide a generic set of try_charge(),
commit_charge() and cancel_charge() transaction operations, much like
what's currently done for swap-in:

  mem_cgroup_try_charge() attempts to reserve a charge, reclaiming
  pages from the memcg if necessary.

  mem_cgroup_commit_charge() commits the page to the charge once it
  has a valid page->mapping and PageAnon() reliably tells the type.

  mem_cgroup_cancel_charge() aborts the transaction.

As pages need to be committed after rmap is established but before
they are added to the LRU, page_add_new_anon_rmap() must stop doing
LRU additions again.  Factor lru_cache_add_active_or_unevictable().

The order of functions in mm/memcontrol.c is entirely random, so this
new charge interface is implemented at the end of the file, where all
new or cleaned up, and documented code should go from now on.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 Documentation/cgroups/memcg_test.txt |  32 +-
 include/linux/memcontrol.h           |  53 +--
 include/linux/swap.h                 |   3 +
 kernel/events/uprobes.c              |   1 +
 mm/filemap.c                         |   9 +-
 mm/huge_memory.c                     |  51 ++-
 mm/memcontrol.c                      | 777 ++++++++++++++++-------------------
 mm/memory.c                          |  41 +-
 mm/migrate.c                         |   1 +
 mm/rmap.c                            |   5 -
 mm/shmem.c                           |  24 +-
 mm/swap.c                            |  20 +
 mm/swapfile.c                        |  14 +-
 13 files changed, 479 insertions(+), 552 deletions(-)

diff --git a/Documentation/cgroups/memcg_test.txt b/Documentation/cgroups/memcg_test.txt
index 80ac454704b8..bcf750d3cecd 100644
--- a/Documentation/cgroups/memcg_test.txt
+++ b/Documentation/cgroups/memcg_test.txt
@@ -24,24 +24,7 @@ Please note that implementation details can be changed.
 
    a page/swp_entry may be charged (usage += PAGE_SIZE) at
 
-	mem_cgroup_charge_anon()
-	  Called at new page fault and Copy-On-Write.
-
-	mem_cgroup_try_charge_swapin()
-	  Called at do_swap_page() (page fault on swap entry) and swapoff.
-	  Followed by charge-commit-cancel protocol. (With swap accounting)
-	  At commit, a charge recorded in swap_cgroup is removed.
-
-	mem_cgroup_charge_file()
-	  Called at add_to_page_cache()
-
-	mem_cgroup_cache_charge_swapin()
-	  Called at shmem's swapin.
-
-	mem_cgroup_prepare_migration()
-	  Called before migration. "extra" charge is done and followed by
-	  charge-commit-cancel protocol.
-	  At commit, charge against oldpage or newpage will be committed.
+	mem_cgroup_try_charge()
 
 2. Uncharge
   a page/swp_entry may be uncharged (usage -= PAGE_SIZE) by
@@ -69,19 +52,14 @@ Please note that implementation details can be changed.
 	to new page is committed. At failure, charge to old page is committed.
 
 3. charge-commit-cancel
-	In some case, we can't know this "charge" is valid or not at charging
-	(because of races).
-	To handle such case, there are charge-commit-cancel functions.
-		mem_cgroup_try_charge_XXX
-		mem_cgroup_commit_charge_XXX
-		mem_cgroup_cancel_charge_XXX
-	these are used in swap-in and migration.
+	Memcg pages are charged in two steps:
+		mem_cgroup_try_charge()
+		mem_cgroup_commit_charge() or mem_cgroup_cancel_charge()
 
 	At try_charge(), there are no flags to say "this page is charged".
 	at this point, usage += PAGE_SIZE.
 
-	At commit(), the function checks the page should be charged or not
-	and set flags or avoid charging.(usage -= PAGE_SIZE)
+	At commit(), the page is associated with the memcg.
 
 	At cancel(), simply usage -= PAGE_SIZE.
 
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index b569b8be5c5a..5578b07376b7 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -54,28 +54,11 @@ struct mem_cgroup_reclaim_cookie {
 };
 
 #ifdef CONFIG_MEMCG
-/*
- * All "charge" functions with gfp_mask should use GFP_KERNEL or
- * (gfp_mask & GFP_RECLAIM_MASK). In current implementatin, memcg doesn't
- * alloc memory but reclaims memory from all available zones. So, "where I want
- * memory from" bits of gfp_mask has no meaning. So any bits of that field is
- * available but adding a rule is better. charge functions' gfp_mask should
- * be set to GFP_KERNEL or gfp_mask & GFP_RECLAIM_MASK for avoiding ambiguous
- * codes.
- * (Of course, if memcg does memory allocation in future, GFP_KERNEL is sane.)
- */
-
-extern int mem_cgroup_charge_anon(struct page *page, struct mm_struct *mm,
-				gfp_t gfp_mask);
-/* for swap handling */
-extern int mem_cgroup_try_charge_swapin(struct mm_struct *mm,
-		struct page *page, gfp_t mask, struct mem_cgroup **memcgp);
-extern void mem_cgroup_commit_charge_swapin(struct page *page,
-					struct mem_cgroup *memcg);
-extern void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *memcg);
-
-extern int mem_cgroup_charge_file(struct page *page, struct mm_struct *mm,
-					gfp_t gfp_mask);
+int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
+			  gfp_t gfp_mask, struct mem_cgroup **memcgp);
+void mem_cgroup_commit_charge(struct page *page, struct mem_cgroup *memcg,
+			      bool lrucare);
+void mem_cgroup_cancel_charge(struct page *page, struct mem_cgroup *memcg);
 
 struct lruvec *mem_cgroup_zone_lruvec(struct zone *, struct mem_cgroup *);
 struct lruvec *mem_cgroup_page_lruvec(struct page *, struct zone *);
@@ -233,30 +216,22 @@ void mem_cgroup_print_bad_page(struct page *page);
 #else /* CONFIG_MEMCG */
 struct mem_cgroup;
 
-static inline int mem_cgroup_charge_anon(struct page *page,
-					struct mm_struct *mm, gfp_t gfp_mask)
-{
-	return 0;
-}
-
-static inline int mem_cgroup_charge_file(struct page *page,
-					struct mm_struct *mm, gfp_t gfp_mask)
-{
-	return 0;
-}
-
-static inline int mem_cgroup_try_charge_swapin(struct mm_struct *mm,
-		struct page *page, gfp_t gfp_mask, struct mem_cgroup **memcgp)
+static inline int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
+					gfp_t gfp_mask,
+					struct mem_cgroup **memcgp)
 {
+	*memcgp = NULL;
 	return 0;
 }
 
-static inline void mem_cgroup_commit_charge_swapin(struct page *page,
-					  struct mem_cgroup *memcg)
+static inline void mem_cgroup_commit_charge(struct page *page,
+					    struct mem_cgroup *memcg,
+					    bool lrucare)
 {
 }
 
-static inline void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *memcg)
+static inline void mem_cgroup_cancel_charge(struct page *page,
+					    struct mem_cgroup *memcg)
 {
 }
 
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 350711560753..403a8530ee62 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -323,6 +323,9 @@ extern void swap_setup(void);
 
 extern void add_page_to_unevictable_list(struct page *page);
 
+extern void lru_cache_add_active_or_unevictable(struct page *page,
+						struct vm_area_struct *vma);
+
 /**
  * lru_cache_add: add a page to the page lists
  * @page: the page to add
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 04709b66369d..44c508044c1d 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -180,6 +180,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 
 	get_page(kpage);
 	page_add_new_anon_rmap(kpage, vma, addr);
+	lru_cache_add_active_or_unevictable(kpage, vma);
 
 	if (!PageAnon(page)) {
 		dec_mm_counter(mm, MM_FILEPAGES);
diff --git a/mm/filemap.c b/mm/filemap.c
index a82fbe4c9e8e..346c2e178193 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -558,19 +558,19 @@ static int __add_to_page_cache_locked(struct page *page,
 				      pgoff_t offset, gfp_t gfp_mask,
 				      void **shadowp)
 {
+	struct mem_cgroup *memcg;
 	int error;
 
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
 	VM_BUG_ON_PAGE(PageSwapBacked(page), page);
 
-	error = mem_cgroup_charge_file(page, current->mm,
-					gfp_mask & GFP_RECLAIM_MASK);
+	error = mem_cgroup_try_charge(page, current->mm, gfp_mask, &memcg);
 	if (error)
 		return error;
 
 	error = radix_tree_maybe_preload(gfp_mask & ~__GFP_HIGHMEM);
 	if (error) {
-		mem_cgroup_uncharge_cache_page(page);
+		mem_cgroup_cancel_charge(page, memcg);
 		return error;
 	}
 
@@ -585,13 +585,14 @@ static int __add_to_page_cache_locked(struct page *page,
 		goto err_insert;
 	__inc_zone_page_state(page, NR_FILE_PAGES);
 	spin_unlock_irq(&mapping->tree_lock);
+	mem_cgroup_commit_charge(page, memcg, false);
 	trace_mm_filemap_add_to_page_cache(page);
 	return 0;
 err_insert:
 	page->mapping = NULL;
 	/* Leave page->index set: truncation relies upon it */
 	spin_unlock_irq(&mapping->tree_lock);
-	mem_cgroup_uncharge_cache_page(page);
+	mem_cgroup_cancel_charge(page, memcg);
 	page_cache_release(page);
 	return error;
 }
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 64635f5278ff..1a22d8b12cf2 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -715,13 +715,20 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
 					unsigned long haddr, pmd_t *pmd,
 					struct page *page)
 {
+	struct mem_cgroup *memcg;
 	pgtable_t pgtable;
 	spinlock_t *ptl;
 
 	VM_BUG_ON_PAGE(!PageCompound(page), page);
+
+	if (mem_cgroup_try_charge(page, mm, GFP_KERNEL, &memcg))
+		return VM_FAULT_OOM;
+
 	pgtable = pte_alloc_one(mm, haddr);
-	if (unlikely(!pgtable))
+	if (unlikely(!pgtable)) {
+		mem_cgroup_cancel_charge(page, memcg);
 		return VM_FAULT_OOM;
+	}
 
 	clear_huge_page(page, haddr, HPAGE_PMD_NR);
 	/*
@@ -734,7 +741,7 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
 	ptl = pmd_lock(mm, pmd);
 	if (unlikely(!pmd_none(*pmd))) {
 		spin_unlock(ptl);
-		mem_cgroup_uncharge_page(page);
+		mem_cgroup_cancel_charge(page, memcg);
 		put_page(page);
 		pte_free(mm, pgtable);
 	} else {
@@ -742,6 +749,8 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
 		entry = mk_huge_pmd(page, vma->vm_page_prot);
 		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
 		page_add_new_anon_rmap(page, vma, haddr);
+		mem_cgroup_commit_charge(page, memcg, false);
+		lru_cache_add_active_or_unevictable(page, vma);
 		pgtable_trans_huge_deposit(mm, pmd, pgtable);
 		set_pmd_at(mm, haddr, pmd, entry);
 		add_mm_counter(mm, MM_ANONPAGES, HPAGE_PMD_NR);
@@ -827,13 +836,7 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		count_vm_event(THP_FAULT_FALLBACK);
 		return VM_FAULT_FALLBACK;
 	}
-	if (unlikely(mem_cgroup_charge_anon(page, mm, GFP_KERNEL))) {
-		put_page(page);
-		count_vm_event(THP_FAULT_FALLBACK);
-		return VM_FAULT_FALLBACK;
-	}
 	if (unlikely(__do_huge_pmd_anonymous_page(mm, vma, haddr, pmd, page))) {
-		mem_cgroup_uncharge_page(page);
 		put_page(page);
 		count_vm_event(THP_FAULT_FALLBACK);
 		return VM_FAULT_FALLBACK;
@@ -948,6 +951,7 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
 					struct page *page,
 					unsigned long haddr)
 {
+	struct mem_cgroup *memcg;
 	spinlock_t *ptl;
 	pgtable_t pgtable;
 	pmd_t _pmd;
@@ -968,13 +972,15 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
 					       __GFP_OTHER_NODE,
 					       vma, address, page_to_nid(page));
 		if (unlikely(!pages[i] ||
-			     mem_cgroup_charge_anon(pages[i], mm,
-						       GFP_KERNEL))) {
+			     mem_cgroup_try_charge(pages[i], mm, GFP_KERNEL,
+						   &memcg))) {
 			if (pages[i])
 				put_page(pages[i]);
 			mem_cgroup_uncharge_start();
 			while (--i >= 0) {
-				mem_cgroup_uncharge_page(pages[i]);
+				memcg = (void *)page_private(pages[i]);
+				set_page_private(pages[i], 0);
+				mem_cgroup_cancel_charge(pages[i], memcg);
 				put_page(pages[i]);
 			}
 			mem_cgroup_uncharge_end();
@@ -982,6 +988,7 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
 			ret |= VM_FAULT_OOM;
 			goto out;
 		}
+		set_page_private(pages[i], (unsigned long)memcg);
 	}
 
 	for (i = 0; i < HPAGE_PMD_NR; i++) {
@@ -1010,7 +1017,11 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
 		pte_t *pte, entry;
 		entry = mk_pte(pages[i], vma->vm_page_prot);
 		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+		memcg = (void *)page_private(pages[i]);
+		set_page_private(pages[i], 0);
 		page_add_new_anon_rmap(pages[i], vma, haddr);
+		mem_cgroup_commit_charge(pages[i], memcg, false);
+		lru_cache_add_active_or_unevictable(pages[i], vma);
 		pte = pte_offset_map(&_pmd, haddr);
 		VM_BUG_ON(!pte_none(*pte));
 		set_pte_at(mm, haddr, pte, entry);
@@ -1036,7 +1047,9 @@ out_free_pages:
 	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
 	mem_cgroup_uncharge_start();
 	for (i = 0; i < HPAGE_PMD_NR; i++) {
-		mem_cgroup_uncharge_page(pages[i]);
+		memcg = (void *)page_private(pages[i]);
+		set_page_private(pages[i], 0);
+		mem_cgroup_cancel_charge(pages[i], memcg);
 		put_page(pages[i]);
 	}
 	mem_cgroup_uncharge_end();
@@ -1050,6 +1063,7 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	spinlock_t *ptl;
 	int ret = 0;
 	struct page *page = NULL, *new_page;
+	struct mem_cgroup *memcg;
 	unsigned long haddr;
 	unsigned long mmun_start;	/* For mmu_notifiers */
 	unsigned long mmun_end;		/* For mmu_notifiers */
@@ -1101,7 +1115,7 @@ alloc:
 		goto out;
 	}
 
-	if (unlikely(mem_cgroup_charge_anon(new_page, mm, GFP_KERNEL))) {
+	if (unlikely(mem_cgroup_try_charge(new_page, mm, GFP_KERNEL, &memcg))) {
 		put_page(new_page);
 		if (page) {
 			split_huge_page(page);
@@ -1130,7 +1144,7 @@ alloc:
 		put_page(page);
 	if (unlikely(!pmd_same(*pmd, orig_pmd))) {
 		spin_unlock(ptl);
-		mem_cgroup_uncharge_page(new_page);
+		mem_cgroup_cancel_charge(new_page, memcg);
 		put_page(new_page);
 		goto out_mn;
 	} else {
@@ -1139,6 +1153,8 @@ alloc:
 		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
 		pmdp_clear_flush(vma, haddr, pmd);
 		page_add_new_anon_rmap(new_page, vma, haddr);
+		mem_cgroup_commit_charge(new_page, memcg, false);
+		lru_cache_add_active_or_unevictable(new_page, vma);
 		set_pmd_at(mm, haddr, pmd, entry);
 		update_mmu_cache_pmd(vma, address, pmd);
 		if (!page) {
@@ -2349,6 +2365,7 @@ static void collapse_huge_page(struct mm_struct *mm,
 	spinlock_t *pmd_ptl, *pte_ptl;
 	int isolated;
 	unsigned long hstart, hend;
+	struct mem_cgroup *memcg;
 	unsigned long mmun_start;	/* For mmu_notifiers */
 	unsigned long mmun_end;		/* For mmu_notifiers */
 
@@ -2359,7 +2376,7 @@ static void collapse_huge_page(struct mm_struct *mm,
 	if (!new_page)
 		return;
 
-	if (unlikely(mem_cgroup_charge_anon(new_page, mm, GFP_KERNEL)))
+	if (unlikely(mem_cgroup_try_charge(new_page, mm, GFP_KERNEL, &memcg)))
 		return;
 
 	/*
@@ -2448,6 +2465,8 @@ static void collapse_huge_page(struct mm_struct *mm,
 	spin_lock(pmd_ptl);
 	BUG_ON(!pmd_none(*pmd));
 	page_add_new_anon_rmap(new_page, vma, address);
+	mem_cgroup_commit_charge(new_page, memcg, false);
+	lru_cache_add_active_or_unevictable(new_page, vma);
 	pgtable_trans_huge_deposit(mm, pmd, pgtable);
 	set_pmd_at(mm, address, pmd, _pmd);
 	update_mmu_cache_pmd(vma, address, pmd);
@@ -2461,7 +2480,7 @@ out_up_write:
 	return;
 
 out:
-	mem_cgroup_uncharge_page(new_page);
+	mem_cgroup_cancel_charge(new_page, memcg);
 	goto out_up_write;
 }
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d3961fce1d54..6f48e292ffe7 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2574,163 +2574,6 @@ static int memcg_cpu_hotplug_callback(struct notifier_block *nb,
 	return NOTIFY_OK;
 }
 
-/**
- * mem_cgroup_try_charge - try charging a memcg
- * @memcg: memcg to charge
- * @nr_pages: number of pages to charge
- * @oom: trigger OOM if reclaim fails
- *
- * Returns 0 if @memcg was charged successfully, -EINTR if the charge
- * was bypassed to root_mem_cgroup, and -ENOMEM if the charge failed.
- */
-static int mem_cgroup_try_charge(struct mem_cgroup *memcg,
-				 gfp_t gfp_mask,
-				 unsigned int nr_pages,
-				 bool oom)
-{
-	unsigned int batch = max(CHARGE_BATCH, nr_pages);
-	int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
-	struct mem_cgroup *mem_over_limit;
-	struct res_counter *fail_res;
-	unsigned long nr_reclaimed;
-	unsigned long flags = 0;
-	unsigned long long size;
-	int ret = 0;
-
-retry:
-	if (consume_stock(memcg, nr_pages))
-		goto done;
-
-	size = batch * PAGE_SIZE;
-	if (!res_counter_charge(&memcg->res, size, &fail_res)) {
-		if (!do_swap_account)
-			goto done_restock;
-		if (!res_counter_charge(&memcg->memsw, size, &fail_res))
-			goto done_restock;
-		res_counter_uncharge(&memcg->res, size);
-		mem_over_limit = mem_cgroup_from_res_counter(fail_res, memsw);
-		flags |= MEM_CGROUP_RECLAIM_NOSWAP;
-	} else
-		mem_over_limit = mem_cgroup_from_res_counter(fail_res, res);
-
-	if (batch > nr_pages) {
-		batch = nr_pages;
-		goto retry;
-	}
-
-	/*
-	 * Unlike in global OOM situations, memcg is not in a physical
-	 * memory shortage.  Allow dying and OOM-killed tasks to
-	 * bypass the last charges so that they can exit quickly and
-	 * free their memory.
-	 */
-	if (unlikely(test_thread_flag(TIF_MEMDIE) ||
-		     fatal_signal_pending(current)))
-		goto bypass;
-
-	if (unlikely(task_in_memcg_oom(current)))
-		goto nomem;
-
-	if (!(gfp_mask & __GFP_WAIT))
-		goto nomem;
-
-	if (gfp_mask & __GFP_NORETRY)
-		goto nomem;
-
-	nr_reclaimed = mem_cgroup_reclaim(mem_over_limit, gfp_mask, flags);
-
-	if (mem_cgroup_margin(mem_over_limit) >= batch)
-		goto retry;
-	/*
-	 * Even though the limit is exceeded at this point, reclaim
-	 * may have been able to free some pages.  Retry the charge
-	 * before killing the task.
-	 *
-	 * Only for regular pages, though: huge pages are rather
-	 * unlikely to succeed so close to the limit, and we fall back
-	 * to regular pages anyway in case of failure.
-	 */
-	if (nr_reclaimed && batch <= (1 << PAGE_ALLOC_COSTLY_ORDER))
-		goto retry;
-	/*
-	 * At task move, charge accounts can be doubly counted. So, it's
-	 * better to wait until the end of task_move if something is going on.
-	 */
-	if (mem_cgroup_wait_acct_move(mem_over_limit))
-		goto retry;
-
-	if (nr_retries--)
-		goto retry;
-
-	if (gfp_mask & __GFP_NOFAIL)
-		goto bypass;
-
-	if (fatal_signal_pending(current))
-		goto bypass;
-
-	if (!oom)
-		goto nomem;
-
-	mem_cgroup_oom(mem_over_limit, gfp_mask, get_order(batch));
-nomem:
-	if (!(gfp_mask & __GFP_NOFAIL))
-		return -ENOMEM;
-bypass:
-	memcg = root_mem_cgroup;
-	ret = -EINTR;
-	goto retry;
-
-done_restock:
-	if (batch > nr_pages)
-		refill_stock(memcg, batch - nr_pages);
-done:
-	return ret;
-}
-
-/**
- * mem_cgroup_try_charge_mm - try charging a mm
- * @mm: mm_struct to charge
- * @nr_pages: number of pages to charge
- * @oom: trigger OOM if reclaim fails
- *
- * Returns the charged mem_cgroup associated with the given mm_struct or
- * NULL the charge failed.
- */
-static struct mem_cgroup *mem_cgroup_try_charge_mm(struct mm_struct *mm,
-				 gfp_t gfp_mask,
-				 unsigned int nr_pages,
-				 bool oom)
-
-{
-	struct mem_cgroup *memcg;
-	int ret;
-
-	memcg = get_mem_cgroup_from_mm(mm);
-	ret = mem_cgroup_try_charge(memcg, gfp_mask, nr_pages, oom);
-	css_put(&memcg->css);
-	if (ret == -EINTR)
-		memcg = root_mem_cgroup;
-	else if (ret)
-		memcg = NULL;
-
-	return memcg;
-}
-
-/*
- * Somemtimes we have to undo a charge we got by try_charge().
- * This function is for that and do uncharge, put css's refcnt.
- * gotten by try_charge().
- */
-static void __mem_cgroup_cancel_charge(struct mem_cgroup *memcg,
-				       unsigned int nr_pages)
-{
-	unsigned long bytes = nr_pages * PAGE_SIZE;
-
-	res_counter_uncharge(&memcg->res, bytes);
-	if (do_swap_account)
-		res_counter_uncharge(&memcg->memsw, bytes);
-}
-
 /*
  * Cancel chrages in this cgroup....doesn't propagate to parent cgroup.
  * This is useful when moving usage to parent cgroup.
@@ -2788,69 +2631,6 @@ struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
 	return memcg;
 }
 
-static void __mem_cgroup_commit_charge(struct mem_cgroup *memcg,
-				       struct page *page,
-				       unsigned int nr_pages,
-				       enum charge_type ctype,
-				       bool lrucare)
-{
-	struct page_cgroup *pc = lookup_page_cgroup(page);
-	struct zone *uninitialized_var(zone);
-	struct lruvec *lruvec;
-	bool was_on_lru = false;
-	bool anon;
-
-	lock_page_cgroup(pc);
-	VM_BUG_ON_PAGE(PageCgroupUsed(pc), page);
-	/*
-	 * we don't need page_cgroup_lock about tail pages, becase they are not
-	 * accessed by any other context at this point.
-	 */
-
-	/*
-	 * In some cases, SwapCache and FUSE(splice_buf->radixtree), the page
-	 * may already be on some other mem_cgroup's LRU.  Take care of it.
-	 */
-	if (lrucare) {
-		zone = page_zone(page);
-		spin_lock_irq(&zone->lru_lock);
-		if (PageLRU(page)) {
-			lruvec = mem_cgroup_zone_lruvec(zone, pc->mem_cgroup);
-			ClearPageLRU(page);
-			del_page_from_lru_list(page, lruvec, page_lru(page));
-			was_on_lru = true;
-		}
-	}
-
-	pc->mem_cgroup = memcg;
-	SetPageCgroupUsed(pc);
-
-	if (lrucare) {
-		if (was_on_lru) {
-			lruvec = mem_cgroup_zone_lruvec(zone, pc->mem_cgroup);
-			VM_BUG_ON_PAGE(PageLRU(page), page);
-			SetPageLRU(page);
-			add_page_to_lru_list(page, lruvec, page_lru(page));
-		}
-		spin_unlock_irq(&zone->lru_lock);
-	}
-
-	if (ctype == MEM_CGROUP_CHARGE_TYPE_ANON)
-		anon = true;
-	else
-		anon = false;
-
-	mem_cgroup_charge_statistics(memcg, page, anon, nr_pages);
-	unlock_page_cgroup(pc);
-
-	/*
-	 * "charge_statistics" updated event counter. Then, check it.
-	 * Insert ancestor (and ancestor's ancestors), to softlimit RB-tree.
-	 * if they exceeds softlimit.
-	 */
-	memcg_check_events(memcg, page);
-}
-
 static DEFINE_MUTEX(set_limit_mutex);
 
 #ifdef CONFIG_MEMCG_KMEM
@@ -2895,6 +2675,9 @@ static int mem_cgroup_slabinfo_read(struct seq_file *m, void *v)
 }
 #endif
 
+static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
+		      unsigned int nr_pages, bool oom);
+
 static int memcg_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp, u64 size)
 {
 	struct res_counter *fail_res;
@@ -2904,22 +2687,21 @@ static int memcg_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp, u64 size)
 	if (ret)
 		return ret;
 
-	ret = mem_cgroup_try_charge(memcg, gfp, size >> PAGE_SHIFT,
-				    oom_gfp_allowed(gfp));
+	ret = try_charge(memcg, gfp, size >> PAGE_SHIFT, oom_gfp_allowed(gfp));
 	if (ret == -EINTR)  {
 		/*
-		 * mem_cgroup_try_charge() chosed to bypass to root due to
-		 * OOM kill or fatal signal.  Since our only options are to
-		 * either fail the allocation or charge it to this cgroup, do
-		 * it as a temporary condition. But we can't fail. From a
-		 * kmem/slab perspective, the cache has already been selected,
-		 * by mem_cgroup_kmem_get_cache(), so it is too late to change
+		 * try_charge() chose to bypass to root due to OOM kill or
+		 * fatal signal.  Since our only options are to either fail
+		 * the allocation or charge it to this cgroup, do it as a
+		 * temporary condition. But we can't fail. From a kmem/slab
+		 * perspective, the cache has already been selected, by
+		 * mem_cgroup_kmem_get_cache(), so it is too late to change
 		 * our minds.
 		 *
 		 * This condition will only trigger if the task entered
-		 * memcg_charge_kmem in a sane state, but was OOM-killed during
-		 * mem_cgroup_try_charge() above. Tasks that were already
-		 * dying when the allocation triggers should have been already
+		 * memcg_charge_kmem in a sane state, but was OOM-killed
+		 * during try_charge() above. Tasks that were already dying
+		 * when the allocation triggers should have been already
 		 * directed to the root cgroup in memcontrol.h
 		 */
 		res_counter_charge_nofail(&memcg->res, size, &fail_res);
@@ -3728,193 +3510,17 @@ static int mem_cgroup_move_parent(struct page *page,
 	}
 
 	ret = mem_cgroup_move_account(page, nr_pages,
-				pc, child, parent);
-	if (!ret)
-		__mem_cgroup_cancel_local_charge(child, nr_pages);
-
-	if (nr_pages > 1)
-		compound_unlock_irqrestore(page, flags);
-	putback_lru_page(page);
-put:
-	put_page(page);
-out:
-	return ret;
-}
-
-int mem_cgroup_charge_anon(struct page *page,
-			      struct mm_struct *mm, gfp_t gfp_mask)
-{
-	unsigned int nr_pages = 1;
-	struct mem_cgroup *memcg;
-	bool oom = true;
-
-	if (mem_cgroup_disabled())
-		return 0;
-
-	VM_BUG_ON_PAGE(page_mapped(page), page);
-	VM_BUG_ON_PAGE(page->mapping && !PageAnon(page), page);
-	VM_BUG_ON(!mm);
-
-	if (PageTransHuge(page)) {
-		nr_pages <<= compound_order(page);
-		VM_BUG_ON_PAGE(!PageTransHuge(page), page);
-		/*
-		 * Never OOM-kill a process for a huge page.  The
-		 * fault handler will fall back to regular pages.
-		 */
-		oom = false;
-	}
-
-	memcg = mem_cgroup_try_charge_mm(mm, gfp_mask, nr_pages, oom);
-	if (!memcg)
-		return -ENOMEM;
-	__mem_cgroup_commit_charge(memcg, page, nr_pages,
-				   MEM_CGROUP_CHARGE_TYPE_ANON, false);
-	return 0;
-}
-
-/*
- * While swap-in, try_charge -> commit or cancel, the page is locked.
- * And when try_charge() successfully returns, one refcnt to memcg without
- * struct page_cgroup is acquired. This refcnt will be consumed by
- * "commit()" or removed by "cancel()"
- */
-static int __mem_cgroup_try_charge_swapin(struct mm_struct *mm,
-					  struct page *page,
-					  gfp_t mask,
-					  struct mem_cgroup **memcgp)
-{
-	struct mem_cgroup *memcg = NULL;
-	struct page_cgroup *pc;
-	int ret;
-
-	pc = lookup_page_cgroup(page);
-	/*
-	 * Every swap fault against a single page tries to charge the
-	 * page, bail as early as possible.  shmem_unuse() encounters
-	 * already charged pages, too.  The USED bit is protected by
-	 * the page lock, which serializes swap cache removal, which
-	 * in turn serializes uncharging.
-	 */
-	if (PageCgroupUsed(pc))
-		goto out;
-	if (do_swap_account)
-		memcg = try_get_mem_cgroup_from_page(page);
-	if (!memcg)
-		memcg = get_mem_cgroup_from_mm(mm);
-	ret = mem_cgroup_try_charge(memcg, mask, 1, true);
-	css_put(&memcg->css);
-	if (ret == -EINTR)
-		memcg = root_mem_cgroup;
-	else if (ret)
-		return ret;
-out:
-	*memcgp = memcg;
-	return 0;
-}
-
-int mem_cgroup_try_charge_swapin(struct mm_struct *mm, struct page *page,
-				 gfp_t gfp_mask, struct mem_cgroup **memcgp)
-{
-	if (mem_cgroup_disabled()) {
-		*memcgp = NULL;
-		return 0;
-	}
-	/*
-	 * A racing thread's fault, or swapoff, may have already
-	 * updated the pte, and even removed page from swap cache: in
-	 * those cases unuse_pte()'s pte_same() test will fail; but
-	 * there's also a KSM case which does need to charge the page.
-	 */
-	if (!PageSwapCache(page)) {
-		struct mem_cgroup *memcg;
-
-		memcg = mem_cgroup_try_charge_mm(mm, gfp_mask, 1, true);
-		if (!memcg)
-			return -ENOMEM;
-		*memcgp = memcg;
-		return 0;
-	}
-	return __mem_cgroup_try_charge_swapin(mm, page, gfp_mask, memcgp);
-}
-
-void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *memcg)
-{
-	if (mem_cgroup_disabled())
-		return;
-	if (!memcg)
-		return;
-	__mem_cgroup_cancel_charge(memcg, 1);
-}
-
-static void
-__mem_cgroup_commit_charge_swapin(struct page *page, struct mem_cgroup *memcg,
-					enum charge_type ctype)
-{
-	if (mem_cgroup_disabled())
-		return;
-	if (!memcg)
-		return;
-
-	__mem_cgroup_commit_charge(memcg, page, 1, ctype, true);
-	/*
-	 * Now swap is on-memory. This means this page may be
-	 * counted both as mem and swap....double count.
-	 * Fix it by uncharging from memsw. Basically, this SwapCache is stable
-	 * under lock_page(). But in do_swap_page()::memory.c, reuse_swap_page()
-	 * may call delete_from_swap_cache() before reach here.
-	 */
-	if (do_swap_account && PageSwapCache(page)) {
-		swp_entry_t ent = {.val = page_private(page)};
-		mem_cgroup_uncharge_swap(ent);
-	}
-}
-
-void mem_cgroup_commit_charge_swapin(struct page *page,
-				     struct mem_cgroup *memcg)
-{
-	__mem_cgroup_commit_charge_swapin(page, memcg,
-					  MEM_CGROUP_CHARGE_TYPE_ANON);
-}
-
-int mem_cgroup_charge_file(struct page *page, struct mm_struct *mm,
-				gfp_t gfp_mask)
-{
-	enum charge_type type = MEM_CGROUP_CHARGE_TYPE_CACHE;
-	struct mem_cgroup *memcg;
-	int ret;
-
-	if (mem_cgroup_disabled())
-		return 0;
-	if (PageCompound(page))
-		return 0;
-
-	if (PageSwapCache(page)) { /* shmem */
-		ret = __mem_cgroup_try_charge_swapin(mm, page,
-						     gfp_mask, &memcg);
-		if (ret)
-			return ret;
-		__mem_cgroup_commit_charge_swapin(page, memcg, type);
-		return 0;
-	}
-
-	/*
-	 * Page cache insertions can happen without an actual mm
-	 * context, e.g. during disk probing on boot.
-	 */
-	if (unlikely(!mm)) {
-		memcg = root_mem_cgroup;
-		ret = mem_cgroup_try_charge(memcg, gfp_mask, 1, true);
-		VM_BUG_ON(ret == -EINTR);
-		if (ret)
-			return ret;
-	} else {
-		memcg = mem_cgroup_try_charge_mm(mm, gfp_mask, 1, true);
-		if (!memcg)
-			return -ENOMEM;
-	}
-	__mem_cgroup_commit_charge(memcg, page, 1, type, false);
-	return 0;
+				pc, child, parent);
+	if (!ret)
+		__mem_cgroup_cancel_local_charge(child, nr_pages);
+
+	if (nr_pages > 1)
+		compound_unlock_irqrestore(page, flags);
+	putback_lru_page(page);
+put:
+	put_page(page);
+out:
+	return ret;
 }
 
 static void mem_cgroup_do_uncharge(struct mem_cgroup *memcg,
@@ -4253,6 +3859,9 @@ static inline int mem_cgroup_move_swap_account(swp_entry_t entry,
 }
 #endif
 
+static void commit_charge(struct page *page, struct mem_cgroup *memcg,
+			  unsigned int nr_pages, bool anon, bool lrucare);
+
 /*
  * Before starting migration, account PAGE_SIZE to mem_cgroup that the old
  * page belongs to.
@@ -4263,7 +3872,6 @@ void mem_cgroup_prepare_migration(struct page *page, struct page *newpage,
 	struct mem_cgroup *memcg = NULL;
 	unsigned int nr_pages = 1;
 	struct page_cgroup *pc;
-	enum charge_type ctype;
 
 	*memcgp = NULL;
 
@@ -4325,16 +3933,12 @@ void mem_cgroup_prepare_migration(struct page *page, struct page *newpage,
 	 * page. In the case new page is migrated but not remapped, new page's
 	 * mapcount will be finally 0 and we call uncharge in end_migration().
 	 */
-	if (PageAnon(page))
-		ctype = MEM_CGROUP_CHARGE_TYPE_ANON;
-	else
-		ctype = MEM_CGROUP_CHARGE_TYPE_CACHE;
 	/*
 	 * The page is committed to the memcg, but it's not actually
 	 * charged to the res_counter since we plan on replacing the
 	 * old one and only one page is going to be left afterwards.
 	 */
-	__mem_cgroup_commit_charge(memcg, newpage, nr_pages, ctype, false);
+	commit_charge(newpage, memcg, nr_pages, PageAnon(page), false);
 }
 
 /* remove redundant charge if migration failed*/
@@ -4393,7 +3997,6 @@ void mem_cgroup_replace_page_cache(struct page *oldpage,
 {
 	struct mem_cgroup *memcg = NULL;
 	struct page_cgroup *pc;
-	enum charge_type type = MEM_CGROUP_CHARGE_TYPE_CACHE;
 
 	if (mem_cgroup_disabled())
 		return;
@@ -4419,7 +4022,7 @@ void mem_cgroup_replace_page_cache(struct page *oldpage,
 	 * the newpage may be on LRU(or pagevec for LRU) already. We lock
 	 * LRU while we overwrite pc->mem_cgroup.
 	 */
-	__mem_cgroup_commit_charge(memcg, newpage, 1, type, true);
+	commit_charge(newpage, memcg, 1, false, true);
 }
 
 #ifdef CONFIG_DEBUG_VM
@@ -6434,6 +6037,7 @@ static void mem_cgroup_css_free(struct cgroup_subsys_state *css)
 #ifdef CONFIG_MMU
 /* Handlers for move charge at task migration. */
 #define PRECHARGE_COUNT_AT_ONCE	256
+static void cancel_charge(struct mem_cgroup *memcg, unsigned int nr_pages);
 static int mem_cgroup_do_precharge(unsigned long count)
 {
 	int ret = 0;
@@ -6470,9 +6074,9 @@ one_by_one:
 			batch_count = PRECHARGE_COUNT_AT_ONCE;
 			cond_resched();
 		}
-		ret = mem_cgroup_try_charge(memcg, GFP_KERNEL, 1, false);
+		ret = try_charge(memcg, GFP_KERNEL, 1, false);
 		if (ret == -EINTR)
-			__mem_cgroup_cancel_charge(root_mem_cgroup, 1);
+			cancel_charge(root_mem_cgroup, 1);
 		if (ret)
 			return ret;
 		mc.precharge++;
@@ -6736,7 +6340,7 @@ static void __mem_cgroup_clear_mc(void)
 
 	/* we must uncharge all the leftover precharges from mc.to */
 	if (mc.precharge) {
-		__mem_cgroup_cancel_charge(mc.to, mc.precharge);
+		cancel_charge(mc.to, mc.precharge);
 		mc.precharge = 0;
 	}
 	/*
@@ -6744,7 +6348,7 @@ static void __mem_cgroup_clear_mc(void)
 	 * we must uncharge here.
 	 */
 	if (mc.moved_charge) {
-		__mem_cgroup_cancel_charge(mc.from, mc.moved_charge);
+		cancel_charge(mc.from, mc.moved_charge);
 		mc.moved_charge = 0;
 	}
 	/* we must fixup refcnts and charges */
@@ -7070,6 +6674,319 @@ static void __init enable_swap_cgroup(void)
 }
 #endif
 
+static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
+		      unsigned int nr_pages, bool oom)
+{
+	unsigned int batch = max(CHARGE_BATCH, nr_pages);
+	int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
+	struct mem_cgroup *mem_over_limit;
+	struct res_counter *fail_res;
+	unsigned long nr_reclaimed;
+	unsigned long flags = 0;
+	unsigned long long size;
+	int ret = 0;
+
+retry:
+	if (consume_stock(memcg, nr_pages))
+		goto done;
+
+	size = batch * PAGE_SIZE;
+	if (!res_counter_charge(&memcg->res, size, &fail_res)) {
+		if (!do_swap_account)
+			goto done_restock;
+		if (!res_counter_charge(&memcg->memsw, size, &fail_res))
+			goto done_restock;
+		res_counter_uncharge(&memcg->res, size);
+		mem_over_limit = mem_cgroup_from_res_counter(fail_res, memsw);
+		flags |= MEM_CGROUP_RECLAIM_NOSWAP;
+	} else
+		mem_over_limit = mem_cgroup_from_res_counter(fail_res, res);
+
+	if (batch > nr_pages) {
+		batch = nr_pages;
+		goto retry;
+	}
+
+	/*
+	 * Unlike in global OOM situations, memcg is not in a physical
+	 * memory shortage.  Allow dying and OOM-killed tasks to
+	 * bypass the last charges so that they can exit quickly and
+	 * free their memory.
+	 */
+	if (unlikely(test_thread_flag(TIF_MEMDIE) ||
+		     fatal_signal_pending(current)))
+		goto bypass;
+
+	if (unlikely(task_in_memcg_oom(current)))
+		goto nomem;
+
+	if (!(gfp_mask & __GFP_WAIT))
+		goto nomem;
+
+	if (gfp_mask & __GFP_NORETRY)
+		goto nomem;
+
+	nr_reclaimed = mem_cgroup_reclaim(mem_over_limit, gfp_mask, flags);
+
+	if (mem_cgroup_margin(mem_over_limit) >= batch)
+		goto retry;
+	/*
+	 * Even though the limit is exceeded at this point, reclaim
+	 * may have been able to free some pages.  Retry the charge
+	 * before killing the task.
+	 *
+	 * Only for regular pages, though: huge pages are rather
+	 * unlikely to succeed so close to the limit, and we fall back
+	 * to regular pages anyway in case of failure.
+	 */
+	if (nr_reclaimed && batch <= (1 << PAGE_ALLOC_COSTLY_ORDER))
+		goto retry;
+	/*
+	 * At task move, charge accounts can be doubly counted. So, it's
+	 * better to wait until the end of task_move if something is going on.
+	 */
+	if (mem_cgroup_wait_acct_move(mem_over_limit))
+		goto retry;
+
+	if (nr_retries--)
+		goto retry;
+
+	if (gfp_mask & __GFP_NOFAIL)
+		goto bypass;
+
+	if (fatal_signal_pending(current))
+		goto bypass;
+
+	if (!oom)
+		goto nomem;
+
+	mem_cgroup_oom(mem_over_limit, gfp_mask, get_order(batch));
+nomem:
+	if (!(gfp_mask & __GFP_NOFAIL))
+		return -ENOMEM;
+bypass:
+	memcg = root_mem_cgroup;
+	ret = -EINTR;
+	goto retry;
+
+done_restock:
+	if (batch > nr_pages)
+		refill_stock(memcg, batch - nr_pages);
+done:
+	return ret;
+}
+
+/**
+ * mem_cgroup_try_charge - try charging a page
+ * @page: page to charge
+ * @mm: mm context of the victim
+ * @gfp_mask: reclaim mode
+ * @memcgp: charged memcg return
+ *
+ * Try to charge @page to the memcg that @mm belongs to, reclaiming
+ * pages according to @gfp_mask if necessary.
+ *
+ * Returns 0 on success, with *@memcgp pointing to the charged memcg.
+ * Otherwise, an error code is returned.
+ *
+ * After page->mapping has been set up, the caller must finalize the
+ * charge with mem_cgroup_commit_charge().  Or abort the transaction
+ * with mem_cgroup_cancel_charge() in case page instantiation fails.
+ */
+int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
+			  gfp_t gfp_mask, struct mem_cgroup **memcgp)
+{
+	struct mem_cgroup *memcg = NULL;
+	unsigned int nr_pages = 1;
+	bool oom = true;
+	int ret = 0;
+
+	if (mem_cgroup_disabled())
+		goto out;
+
+	if (PageSwapCache(page)) {
+		struct page_cgroup *pc = lookup_page_cgroup(page);
+		/*
+		 * Every swap fault against a single page tries to charge the
+		 * page, bail as early as possible.  shmem_unuse() encounters
+		 * already charged pages, too.  The USED bit is protected by
+		 * the page lock, which serializes swap cache removal, which
+		 * in turn serializes uncharging.
+		 */
+		if (PageCgroupUsed(pc))
+			goto out;
+	}
+
+	if (PageTransHuge(page)) {
+		nr_pages <<= compound_order(page);
+		VM_BUG_ON_PAGE(!PageTransHuge(page), page);
+		/*
+		 * Never OOM-kill a process for a huge page.  The
+		 * fault handler will fall back to regular pages.
+		 */
+		oom = false;
+	}
+
+	if (do_swap_account && PageSwapCache(page))
+		memcg = try_get_mem_cgroup_from_page(page);
+	if (!memcg) {
+		/*
+		 * Page cache insertions can happen without an actual
+		 * mm context, e.g. during disk probing on boot.
+		 */
+		if (unlikely(!mm)) {
+			memcg = root_mem_cgroup;
+			css_get(&memcg->css);
+		} else
+			memcg = get_mem_cgroup_from_mm(mm);
+	}
+
+	ret = try_charge(memcg, gfp_mask, nr_pages, oom);
+
+	css_put(&memcg->css);
+
+	if (ret == -EINTR) {
+		memcg = root_mem_cgroup;
+		ret = 0;
+	}
+out:
+	*memcgp = memcg;
+	return ret;
+}
+
+static void commit_charge(struct page *page, struct mem_cgroup *memcg,
+			  unsigned int nr_pages, bool anon, bool lrucare)
+{
+	struct page_cgroup *pc = lookup_page_cgroup(page);
+	struct zone *uninitialized_var(zone);
+	bool was_on_lru = false;
+	struct lruvec *lruvec;
+
+	lock_page_cgroup(pc);
+
+	VM_BUG_ON_PAGE(PageCgroupUsed(pc), page);
+	VM_BUG_ON_PAGE(PageLRU(page) && !lrucare, page);
+
+	if (lrucare) {
+		zone = page_zone(page);
+		spin_lock_irq(&zone->lru_lock);
+		if (PageLRU(page)) {
+			lruvec = mem_cgroup_zone_lruvec(zone, pc->mem_cgroup);
+			ClearPageLRU(page);
+			del_page_from_lru_list(page, lruvec, page_lru(page));
+			was_on_lru = true;
+		}
+	}
+
+	pc->mem_cgroup = memcg;
+	SetPageCgroupUsed(pc);
+
+	if (lrucare) {
+		if (was_on_lru) {
+			lruvec = mem_cgroup_zone_lruvec(zone, pc->mem_cgroup);
+			VM_BUG_ON_PAGE(PageLRU(page), page);
+			SetPageLRU(page);
+			add_page_to_lru_list(page, lruvec, page_lru(page));
+		}
+		spin_unlock_irq(&zone->lru_lock);
+	}
+
+	mem_cgroup_charge_statistics(memcg, page, anon, nr_pages);
+	unlock_page_cgroup(pc);
+
+	memcg_check_events(memcg, page);
+}
+
+/**
+ * mem_cgroup_commit_charge - commit a page charge
+ * @page: page to charge
+ * @memcg: memcg to charge the page to
+ * @lrucare: page might be on LRU already
+ *
+ * Finalize a charge transaction started by mem_cgroup_try_charge(),
+ * after page->mapping has been set up.  This must happen atomically
+ * as part of the page instantiation, i.e. under the page table lock
+ * for anonymous pages, under the page lock for page and swap cache.
+ *
+ * In addition, the page must not be on the LRU during the commit, to
+ * prevent racing with task migration.  If it might be, use @lrucare.
+ *
+ * Use mem_cgroup_cancel_charge() to cancel the transaction instead.
+ */
+void mem_cgroup_commit_charge(struct page *page, struct mem_cgroup *memcg,
+			      bool lrucare)
+{
+	unsigned int nr_pages = 1;
+
+	VM_BUG_ON_PAGE(!page->mapping, page);
+	VM_BUG_ON_PAGE(PageLRU(page) && !lrucare, page);
+
+	if (mem_cgroup_disabled())
+		return;
+	/*
+	 * Swap faults will attempt to charge the same page multiple
+	 * times.  But reuse_swap_page() might have removed the page
+	 * from swapcache already, so we can't check PageSwapCache().
+	 */
+	if (!memcg)
+		return;
+
+	if (PageTransHuge(page)) {
+		nr_pages <<= compound_order(page);
+		VM_BUG_ON_PAGE(!PageTransHuge(page), page);
+	}
+
+	commit_charge(page, memcg, nr_pages, PageAnon(page), lrucare);
+
+	if (do_swap_account && PageSwapCache(page)) {
+		swp_entry_t entry = { .val = page_private(page) };
+		/*
+		 * The swap entry might not get freed for a long time,
+		 * let's not wait for it.  The page already received a
+		 * memory+swap charge, drop the swap entry duplicate.
+		 */
+		mem_cgroup_uncharge_swap(entry);
+	}
+}
+
+static void cancel_charge(struct mem_cgroup *memcg, unsigned int nr_pages)
+{
+	unsigned long bytes = nr_pages * PAGE_SIZE;
+
+	res_counter_uncharge(&memcg->res, bytes);
+	if (do_swap_account)
+		res_counter_uncharge(&memcg->memsw, bytes);
+}
+
+/**
+ * mem_cgroup_cancel_charge - cancel a page charge
+ * @page: page to charge
+ * @memcg: memcg to charge the page to
+ *
+ * Cancel a charge transaction started by mem_cgroup_try_charge().
+ */
+void mem_cgroup_cancel_charge(struct page *page, struct mem_cgroup *memcg)
+{
+	unsigned int nr_pages = 1;
+
+	if (mem_cgroup_disabled())
+		return;
+	/*
+	 * Swap faults will attempt to charge the same page multiple
+	 * times.  But reuse_swap_page() might have removed the page
+	 * from swapcache already, so we can't check PageSwapCache().
+	 */
+	if (!memcg)
+		return;
+
+	if (PageTransHuge(page)) {
+		nr_pages <<= compound_order(page);
+		VM_BUG_ON_PAGE(!PageTransHuge(page), page);
+	}
+
+	cancel_charge(memcg, nr_pages);
+}
+
 /*
  * subsys_initcall() for memory controller.
  *
diff --git a/mm/memory.c b/mm/memory.c
index d0f0bef3be48..36af46a50fad 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2673,6 +2673,7 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	struct page *dirty_page = NULL;
 	unsigned long mmun_start = 0;	/* For mmu_notifiers */
 	unsigned long mmun_end = 0;	/* For mmu_notifiers */
+	struct mem_cgroup *memcg;
 
 	old_page = vm_normal_page(vma, address, orig_pte);
 	if (!old_page) {
@@ -2828,7 +2829,7 @@ gotten:
 	}
 	__SetPageUptodate(new_page);
 
-	if (mem_cgroup_charge_anon(new_page, mm, GFP_KERNEL))
+	if (mem_cgroup_try_charge(new_page, mm, GFP_KERNEL, &memcg))
 		goto oom_free_new;
 
 	mmun_start  = address & PAGE_MASK;
@@ -2858,6 +2859,8 @@ gotten:
 		 */
 		ptep_clear_flush(vma, address, page_table);
 		page_add_new_anon_rmap(new_page, vma, address);
+		mem_cgroup_commit_charge(new_page, memcg, false);
+		lru_cache_add_active_or_unevictable(new_page, vma);
 		/*
 		 * We call the notify macro here because, when using secondary
 		 * mmu page tables (such as kvm shadow page tables), we want the
@@ -2895,7 +2898,7 @@ gotten:
 		new_page = old_page;
 		ret |= VM_FAULT_WRITE;
 	} else
-		mem_cgroup_uncharge_page(new_page);
+		mem_cgroup_cancel_charge(new_page, memcg);
 
 	if (new_page)
 		page_cache_release(new_page);
@@ -3031,10 +3034,10 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
 {
 	spinlock_t *ptl;
 	struct page *page, *swapcache;
+	struct mem_cgroup *memcg;
 	swp_entry_t entry;
 	pte_t pte;
 	int locked;
-	struct mem_cgroup *ptr;
 	int exclusive = 0;
 	int ret = 0;
 
@@ -3110,7 +3113,7 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		goto out_page;
 	}
 
-	if (mem_cgroup_try_charge_swapin(mm, page, GFP_KERNEL, &ptr)) {
+	if (mem_cgroup_try_charge(page, mm, GFP_KERNEL, &memcg)) {
 		ret = VM_FAULT_OOM;
 		goto out_page;
 	}
@@ -3135,10 +3138,6 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	 * while the page is counted on swap but not yet in mapcount i.e.
 	 * before page_add_anon_rmap() and swap_free(); try_to_free_swap()
 	 * must be called after the swap_free(), or it will never succeed.
-	 * Because delete_from_swap_page() may be called by reuse_swap_page(),
-	 * mem_cgroup_commit_charge_swapin() may not be able to find swp_entry
-	 * in page->private. In this case, a record in swap_cgroup  is silently
-	 * discarded at swap_free().
 	 */
 
 	inc_mm_counter_fast(mm, MM_ANONPAGES);
@@ -3154,12 +3153,14 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (pte_swp_soft_dirty(orig_pte))
 		pte = pte_mksoft_dirty(pte);
 	set_pte_at(mm, address, page_table, pte);
-	if (page == swapcache)
+	if (page == swapcache) {
 		do_page_add_anon_rmap(page, vma, address, exclusive);
-	else /* ksm created a completely new copy */
+		mem_cgroup_commit_charge(page, memcg, true);
+	} else { /* ksm created a completely new copy */
 		page_add_new_anon_rmap(page, vma, address);
-	/* It's better to call commit-charge after rmap is established */
-	mem_cgroup_commit_charge_swapin(page, ptr);
+		mem_cgroup_commit_charge(page, memcg, false);
+		lru_cache_add_active_or_unevictable(page, vma);
+	}
 
 	swap_free(entry);
 	if (vm_swap_full() || (vma->vm_flags & VM_LOCKED) || PageMlocked(page))
@@ -3192,7 +3193,7 @@ unlock:
 out:
 	return ret;
 out_nomap:
-	mem_cgroup_cancel_charge_swapin(ptr);
+	mem_cgroup_cancel_charge(page, memcg);
 	pte_unmap_unlock(page_table, ptl);
 out_page:
 	unlock_page(page);
@@ -3248,6 +3249,7 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		unsigned long address, pte_t *page_table, pmd_t *pmd,
 		unsigned int flags)
 {
+	struct mem_cgroup *memcg;
 	struct page *page;
 	spinlock_t *ptl;
 	pte_t entry;
@@ -3281,7 +3283,7 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	 */
 	__SetPageUptodate(page);
 
-	if (mem_cgroup_charge_anon(page, mm, GFP_KERNEL))
+	if (mem_cgroup_try_charge(page, mm, GFP_KERNEL, &memcg))
 		goto oom_free_page;
 
 	entry = mk_pte(page, vma->vm_page_prot);
@@ -3294,6 +3296,8 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 
 	inc_mm_counter_fast(mm, MM_ANONPAGES);
 	page_add_new_anon_rmap(page, vma, address);
+	mem_cgroup_commit_charge(page, memcg, false);
+	lru_cache_add_active_or_unevictable(page, vma);
 setpte:
 	set_pte_at(mm, address, page_table, entry);
 
@@ -3303,7 +3307,7 @@ unlock:
 	pte_unmap_unlock(page_table, ptl);
 	return 0;
 release:
-	mem_cgroup_uncharge_page(page);
+	mem_cgroup_cancel_charge(page, memcg);
 	page_cache_release(page);
 	goto unlock;
 oom_free_page:
@@ -3526,6 +3530,7 @@ static int do_cow_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 		pgoff_t pgoff, unsigned int flags, pte_t orig_pte)
 {
 	struct page *fault_page, *new_page;
+	struct mem_cgroup *memcg;
 	spinlock_t *ptl;
 	pte_t *pte;
 	int ret;
@@ -3537,7 +3542,7 @@ static int do_cow_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (!new_page)
 		return VM_FAULT_OOM;
 
-	if (mem_cgroup_charge_anon(new_page, mm, GFP_KERNEL)) {
+	if (mem_cgroup_try_charge(new_page, mm, GFP_KERNEL, &memcg)) {
 		page_cache_release(new_page);
 		return VM_FAULT_OOM;
 	}
@@ -3557,12 +3562,14 @@ static int do_cow_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 		goto uncharge_out;
 	}
 	do_set_pte(vma, address, new_page, pte, true, true);
+	mem_cgroup_commit_charge(new_page, memcg, false);
+	lru_cache_add_active_or_unevictable(new_page, vma);
 	pte_unmap_unlock(pte, ptl);
 	unlock_page(fault_page);
 	page_cache_release(fault_page);
 	return ret;
 uncharge_out:
-	mem_cgroup_uncharge_page(new_page);
+	mem_cgroup_cancel_charge(new_page, memcg);
 	page_cache_release(new_page);
 	return ret;
 }
diff --git a/mm/migrate.c b/mm/migrate.c
index bed48809e5d0..a88fabd71f87 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1853,6 +1853,7 @@ fail_putback:
 	 */
 	flush_cache_range(vma, mmun_start, mmun_end);
 	page_add_new_anon_rmap(new_page, vma, mmun_start);
+	lru_cache_add_active_or_unevictable(new_page, vma);
 	pmdp_clear_flush(vma, mmun_start, pmd);
 	set_pmd_at(mm, mmun_start, pmd, entry);
 	flush_tlb_range(vma, mmun_start, mmun_end);
diff --git a/mm/rmap.c b/mm/rmap.c
index 9c3e77396d1a..6b6fe5f4ece1 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1024,11 +1024,6 @@ void page_add_new_anon_rmap(struct page *page,
 	__mod_zone_page_state(page_zone(page), NR_ANON_PAGES,
 			hpage_nr_pages(page));
 	__page_set_anon_rmap(page, vma, address, 1);
-	if (!mlocked_vma_newpage(vma, page)) {
-		SetPageActive(page);
-		lru_cache_add(page);
-	} else
-		add_page_to_unevictable_list(page);
 }
 
 /**
diff --git a/mm/shmem.c b/mm/shmem.c
index 8f1a95406bae..f8637acc2dad 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -668,6 +668,7 @@ int shmem_unuse(swp_entry_t swap, struct page *page)
 {
 	struct list_head *this, *next;
 	struct shmem_inode_info *info;
+	struct mem_cgroup *memcg;
 	int found = 0;
 	int error = 0;
 
@@ -683,7 +684,7 @@ int shmem_unuse(swp_entry_t swap, struct page *page)
 	 * the shmem_swaplist_mutex which might hold up shmem_writepage().
 	 * Charged back to the user (not to caller) when swap account is used.
 	 */
-	error = mem_cgroup_charge_file(page, current->mm, GFP_KERNEL);
+	error = mem_cgroup_try_charge(page, current->mm, GFP_KERNEL, &memcg);
 	if (error)
 		goto out;
 	/* No radix_tree_preload: swap entry keeps a place for page in tree */
@@ -701,8 +702,11 @@ int shmem_unuse(swp_entry_t swap, struct page *page)
 	}
 	mutex_unlock(&shmem_swaplist_mutex);
 
-	if (found < 0)
+	if (found < 0) {
 		error = found;
+		mem_cgroup_cancel_charge(page, memcg);
+	} else
+		mem_cgroup_commit_charge(page, memcg, true);
 out:
 	unlock_page(page);
 	page_cache_release(page);
@@ -1005,6 +1009,7 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
 	struct address_space *mapping = inode->i_mapping;
 	struct shmem_inode_info *info;
 	struct shmem_sb_info *sbinfo;
+	struct mem_cgroup *memcg;
 	struct page *page;
 	swp_entry_t swap;
 	int error;
@@ -1080,8 +1085,7 @@ repeat:
 				goto failed;
 		}
 
-		error = mem_cgroup_charge_file(page, current->mm,
-						gfp & GFP_RECLAIM_MASK);
+		error = mem_cgroup_try_charge(page, current->mm, gfp, &memcg);
 		if (!error) {
 			error = shmem_add_to_page_cache(page, mapping, index,
 						gfp, swp_to_radix_entry(swap));
@@ -1097,12 +1101,16 @@ repeat:
 			 * Reset swap.val? No, leave it so "failed" goes back to
 			 * "repeat": reading a hole and writing should succeed.
 			 */
-			if (error)
+			if (error) {
+				mem_cgroup_cancel_charge(page, memcg);
 				delete_from_swap_cache(page);
+			}
 		}
 		if (error)
 			goto failed;
 
+		mem_cgroup_commit_charge(page, memcg, true);
+
 		spin_lock(&info->lock);
 		info->swapped--;
 		shmem_recalc_inode(inode);
@@ -1134,8 +1142,7 @@ repeat:
 
 		SetPageSwapBacked(page);
 		__set_page_locked(page);
-		error = mem_cgroup_charge_file(page, current->mm,
-						gfp & GFP_RECLAIM_MASK);
+		error = mem_cgroup_try_charge(page, current->mm, gfp, &memcg);
 		if (error)
 			goto decused;
 		error = radix_tree_maybe_preload(gfp & GFP_RECLAIM_MASK);
@@ -1145,9 +1152,10 @@ repeat:
 			radix_tree_preload_end();
 		}
 		if (error) {
-			mem_cgroup_uncharge_cache_page(page);
+			mem_cgroup_cancel_charge(page, memcg);
 			goto decused;
 		}
+		mem_cgroup_commit_charge(page, memcg, false);
 		lru_cache_add_anon(page);
 
 		spin_lock(&info->lock);
diff --git a/mm/swap.c b/mm/swap.c
index 9ce43ba4498b..a5bdff331507 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -635,6 +635,26 @@ void add_page_to_unevictable_list(struct page *page)
 	spin_unlock_irq(&zone->lru_lock);
 }
 
+/**
+ * lru_cache_add_active_or_unevictable
+ * @page:  the page to be added to LRU
+ * @vma:   vma in which page is mapped for determining reclaimability
+ *
+ * Place @page on the active or unevictable LRU list, depending on its
+ * evictability.  Note that if the page is not evictable, it goes
+ * directly back onto it's zone's unevictable list, it does NOT use a
+ * per cpu pagevec.
+ */
+void lru_cache_add_active_or_unevictable(struct page *page,
+					 struct vm_area_struct *vma)
+{
+	if (!mlocked_vma_newpage(vma, page)) {
+		SetPageActive(page);
+		lru_cache_add(page);
+	} else
+		add_page_to_unevictable_list(page);
+}
+
 /*
  * If the page can not be invalidated, it is moved to the
  * inactive list to speed up its reclaim.  It is moved to the
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 4a7f7e6992b6..7c57c7256c6e 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1126,15 +1126,14 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
 	if (unlikely(!page))
 		return -ENOMEM;
 
-	if (mem_cgroup_try_charge_swapin(vma->vm_mm, page,
-					 GFP_KERNEL, &memcg)) {
+	if (mem_cgroup_try_charge(page, vma->vm_mm, GFP_KERNEL, &memcg)) {
 		ret = -ENOMEM;
 		goto out_nolock;
 	}
 
 	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
 	if (unlikely(!maybe_same_pte(*pte, swp_entry_to_pte(entry)))) {
-		mem_cgroup_cancel_charge_swapin(memcg);
+		mem_cgroup_cancel_charge(page, memcg);
 		ret = 0;
 		goto out;
 	}
@@ -1144,11 +1143,14 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
 	get_page(page);
 	set_pte_at(vma->vm_mm, addr, pte,
 		   pte_mkold(mk_pte(page, vma->vm_page_prot)));
-	if (page == swapcache)
+	if (page == swapcache) {
 		page_add_anon_rmap(page, vma, addr);
-	else /* ksm created a completely new copy */
+		mem_cgroup_commit_charge(page, memcg, true);
+	} else { /* ksm created a completely new copy */
 		page_add_new_anon_rmap(page, vma, addr);
-	mem_cgroup_commit_charge_swapin(page, memcg);
+		mem_cgroup_commit_charge(page, memcg, false);
+		lru_cache_add_active_or_unevictable(page, vma);
+	}
 	swap_free(entry);
 	/*
 	 * Move the page to the active list so it is not
-- 
1.9.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [patch 9/9] mm: memcontrol: rewrite uncharge API
  2014-04-30 20:25 [patch 0/9] mm: memcontrol: naturalize charge lifetime Johannes Weiner
                   ` (7 preceding siblings ...)
  2014-04-30 20:25 ` [patch 8/9] mm: memcontrol: rewrite charge API Johannes Weiner
@ 2014-04-30 20:25 ` Johannes Weiner
  2014-05-04 14:32   ` Johannes Weiner
  2014-05-27  7:43   ` Kamezawa Hiroyuki
  2014-05-02 11:26 ` [patch 0/9] mm: memcontrol: naturalize charge lifetime Michal Hocko
  9 siblings, 2 replies; 35+ messages in thread
From: Johannes Weiner @ 2014-04-30 20:25 UTC (permalink / raw)
  To: linux-mm; +Cc: Michal Hocko, Hugh Dickins, Tejun Heo, cgroups, linux-kernel

The memcg uncharging code that is involved towards the end of a page's
lifetime - truncation, reclaim, swapout, migration - is impressively
complicated and fragile.

Because anonymous and file pages were always charged before they had
their page->mapping established, uncharges had to happen when the page
type could be known from the context, as in unmap for anonymous, page
cache removal for file and shmem pages, and swap cache truncation for
swap pages.  However, these operations also happen well before the
page is actually freed, and so a lot of synchronization is necessary:

- On page migration, the old page might be unmapped but then reused,
  so memcg code has to prevent an untimely uncharge in that case.
  Because this code - which should be a simple charge transfer - is so
  special-cased, it is not reusable for replace_page_cache().

- Swap cache truncation happens during both swap-in and swap-out, and
  possibly repeatedly before the page is actually freed.  This means
  that the memcg swapout code is called from many contexts that make
  no sense and it has to figure out the direction from page state to
  make sure memory and memory+swap are always correctly charged.

But now that charged pages always have a page->mapping, introduce
mem_cgroup_uncharge(), which is called after the final put_page(),
when we know for sure that nobody is looking at the page anymore.

For page migration, introduce mem_cgroup_migrate(), which is called
after the migration is successful and the new page is fully rmapped.
Because the old page is no longer uncharged after migration, prevent
double charges by decoupling the page's memcg association (PCG_USED
and pc->mem_cgroup) from the page holding an actual charge.  The new
bits PCG_MEM and PCG_MEMSW represent the respective charges and are
transferred to the new page during migration.

mem_cgroup_migrate() is suitable for replace_page_cache() as well.

Swap accounting is massively simplified: because the page is no longer
uncharged as early as swap cache deletion, a new mem_cgroup_swapout()
can transfer the page's memory+swap charge (PCG_MEMSW) to the swap
entry before the final put_page() in page reclaim.

Finally, because pages are now charged under proper serialization
(anon: exclusive; cache: page lock; swapin: page lock; migration: page
lock), and uncharged under full exclusion, they can not race with
themselves.  Because they are also off-LRU during charge/uncharge,
charge migration can not race, with that, either.  Remove the crazily
expensive the page_cgroup lock and set pc->flags non-atomically.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 Documentation/cgroups/memcg_test.txt | 128 +------
 include/linux/memcontrol.h           |  49 +--
 include/linux/page_cgroup.h          |  43 +--
 include/linux/swap.h                 |  12 +-
 mm/filemap.c                         |   4 +-
 mm/memcontrol.c                      | 721 ++++++++++++-----------------------
 mm/migrate.c                         |  45 +--
 mm/rmap.c                            |   1 -
 mm/shmem.c                           |   4 +-
 mm/swap.c                            |   2 +
 mm/swap_state.c                      |   8 +-
 mm/swapfile.c                        |   7 +-
 mm/truncate.c                        |   1 -
 mm/vmscan.c                          |   9 +-
 mm/zswap.c                           |   2 +-
 15 files changed, 302 insertions(+), 734 deletions(-)

diff --git a/Documentation/cgroups/memcg_test.txt b/Documentation/cgroups/memcg_test.txt
index bcf750d3cecd..8870b0212150 100644
--- a/Documentation/cgroups/memcg_test.txt
+++ b/Documentation/cgroups/memcg_test.txt
@@ -29,28 +29,13 @@ Please note that implementation details can be changed.
 2. Uncharge
   a page/swp_entry may be uncharged (usage -= PAGE_SIZE) by
 
-	mem_cgroup_uncharge_page()
-	  Called when an anonymous page is fully unmapped. I.e., mapcount goes
-	  to 0. If the page is SwapCache, uncharge is delayed until
-	  mem_cgroup_uncharge_swapcache().
-
-	mem_cgroup_uncharge_cache_page()
-	  Called when a page-cache is deleted from radix-tree. If the page is
-	  SwapCache, uncharge is delayed until mem_cgroup_uncharge_swapcache().
-
-	mem_cgroup_uncharge_swapcache()
-	  Called when SwapCache is removed from radix-tree. The charge itself
-	  is moved to swap_cgroup. (If mem+swap controller is disabled, no
-	  charge to swap occurs.)
+	mem_cgroup_uncharge()
+	  Called when a page's refcount goes down to 0.
 
 	mem_cgroup_uncharge_swap()
 	  Called when swp_entry's refcnt goes down to 0. A charge against swap
 	  disappears.
 
-	mem_cgroup_end_migration(old, new)
-	At success of migration old is uncharged (if necessary), a charge
-	to new page is committed. At failure, charge to old page is committed.
-
 3. charge-commit-cancel
 	Memcg pages are charged in two steps:
 		mem_cgroup_try_charge()
@@ -69,18 +54,6 @@ Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y.
 	Anonymous page is newly allocated at
 		  - page fault into MAP_ANONYMOUS mapping.
 		  - Copy-On-Write.
- 	It is charged right after it's allocated before doing any page table
-	related operations. Of course, it's uncharged when another page is used
-	for the fault address.
-
-	At freeing anonymous page (by exit() or munmap()), zap_pte() is called
-	and pages for ptes are freed one by one.(see mm/memory.c). Uncharges
-	are done at page_remove_rmap() when page_mapcount() goes down to 0.
-
-	Another page freeing is by page-reclaim (vmscan.c) and anonymous
-	pages are swapped out. In this case, the page is marked as
-	PageSwapCache(). uncharge() routine doesn't uncharge the page marked
-	as SwapCache(). It's delayed until __delete_from_swap_cache().
 
 	4.1 Swap-in.
 	At swap-in, the page is taken from swap-cache. There are 2 cases.
@@ -89,41 +62,6 @@ Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y.
 	(b) If the SwapCache has been mapped by processes, it has been
 	    charged already.
 
-	This swap-in is one of the most complicated work. In do_swap_page(),
-	following events occur when pte is unchanged.
-
-	(1) the page (SwapCache) is looked up.
-	(2) lock_page()
-	(3) try_charge_swapin()
-	(4) reuse_swap_page() (may call delete_swap_cache())
-	(5) commit_charge_swapin()
-	(6) swap_free().
-
-	Considering following situation for example.
-
-	(A) The page has not been charged before (2) and reuse_swap_page()
-	    doesn't call delete_from_swap_cache().
-	(B) The page has not been charged before (2) and reuse_swap_page()
-	    calls delete_from_swap_cache().
-	(C) The page has been charged before (2) and reuse_swap_page() doesn't
-	    call delete_from_swap_cache().
-	(D) The page has been charged before (2) and reuse_swap_page() calls
-	    delete_from_swap_cache().
-
-	    memory.usage/memsw.usage changes to this page/swp_entry will be
-	 Case          (A)      (B)       (C)     (D)
-         Event
-       Before (2)     0/ 1     0/ 1      1/ 1    1/ 1
-          ===========================================
-          (3)        +1/+1    +1/+1     +1/+1   +1/+1
-          (4)          -       0/ 0       -     -1/ 0
-          (5)         0/-1     0/ 0     -1/-1    0/ 0
-          (6)          -       0/-1       -      0/-1
-          ===========================================
-       Result         1/ 1     1/ 1      1/ 1    1/ 1
-
-       In any cases, charges to this page should be 1/ 1.
-
 	4.2 Swap-out.
 	At swap-out, typical state transition is below.
 
@@ -136,28 +74,20 @@ Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y.
 	    swp_entry's refcnt -= 1.
 
 
-	At (b), the page is marked as SwapCache and not uncharged.
-	At (d), the page is removed from SwapCache and a charge in page_cgroup
-	is moved to swap_cgroup.
-
 	Finally, at task exit,
 	(e) zap_pte() is called and swp_entry's refcnt -=1 -> 0.
-	Here, a charge in swap_cgroup disappears.
 
 5. Page Cache
    	Page Cache is charged at
 	- add_to_page_cache_locked().
 
-	uncharged at
-	- __remove_from_page_cache().
-
 	The logic is very clear. (About migration, see below)
 	Note: __remove_from_page_cache() is called by remove_from_page_cache()
 	and __remove_mapping().
 
 6. Shmem(tmpfs) Page Cache
-	Memcg's charge/uncharge have special handlers of shmem. The best way
-	to understand shmem's page state transition is to read mm/shmem.c.
+	The best way to understand shmem's page state transition is to read
+	mm/shmem.c.
 	But brief explanation of the behavior of memcg around shmem will be
 	helpful to understand the logic.
 
@@ -170,56 +100,10 @@ Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y.
 	It's charged when...
 	- A new page is added to shmem's radix-tree.
 	- A swp page is read. (move a charge from swap_cgroup to page_cgroup)
-	It's uncharged when
-	- A page is removed from radix-tree and not SwapCache.
-	- When SwapCache is removed, a charge is moved to swap_cgroup.
-	- When swp_entry's refcnt goes down to 0, a charge in swap_cgroup
-	  disappears.
 
 7. Page Migration
-   	One of the most complicated functions is page-migration-handler.
-	Memcg has 2 routines. Assume that we are migrating a page's contents
-	from OLDPAGE to NEWPAGE.
-
-	Usual migration logic is..
-	(a) remove the page from LRU.
-	(b) allocate NEWPAGE (migration target)
-	(c) lock by lock_page().
-	(d) unmap all mappings.
-	(e-1) If necessary, replace entry in radix-tree.
-	(e-2) move contents of a page.
-	(f) map all mappings again.
-	(g) pushback the page to LRU.
-	(-) OLDPAGE will be freed.
-
-	Before (g), memcg should complete all necessary charge/uncharge to
-	NEWPAGE/OLDPAGE.
-
-	The point is....
-	- If OLDPAGE is anonymous, all charges will be dropped at (d) because
-          try_to_unmap() drops all mapcount and the page will not be
-	  SwapCache.
-
-	- If OLDPAGE is SwapCache, charges will be kept at (g) because
-	  __delete_from_swap_cache() isn't called at (e-1)
-
-	- If OLDPAGE is page-cache, charges will be kept at (g) because
-	  remove_from_swap_cache() isn't called at (e-1)
-
-	memcg provides following hooks.
-
-	- mem_cgroup_prepare_migration(OLDPAGE)
-	  Called after (b) to account a charge (usage += PAGE_SIZE) against
-	  memcg which OLDPAGE belongs to.
-
-        - mem_cgroup_end_migration(OLDPAGE, NEWPAGE)
-	  Called after (f) before (g).
-	  If OLDPAGE is used, commit OLDPAGE again. If OLDPAGE is already
-	  charged, a charge by prepare_migration() is automatically canceled.
-	  If NEWPAGE is used, commit NEWPAGE and uncharge OLDPAGE.
-
-	  But zap_pte() (by exit or munmap) can be called while migration,
-	  we have to check if OLDPAGE/NEWPAGE is a valid page after commit().
+
+	mem_cgroup_migrate()
 
 8. LRU
         Each memcg has its own private LRU. Now, its handling is under global
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 5578b07376b7..4ef4c2acbc1a 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -60,15 +60,17 @@ void mem_cgroup_commit_charge(struct page *page, struct mem_cgroup *memcg,
 			      bool lrucare);
 void mem_cgroup_cancel_charge(struct page *page, struct mem_cgroup *memcg);
 
-struct lruvec *mem_cgroup_zone_lruvec(struct zone *, struct mem_cgroup *);
-struct lruvec *mem_cgroup_page_lruvec(struct page *, struct zone *);
+void mem_cgroup_uncharge(struct page *page);
+
+/* Batched uncharging */
+void mem_cgroup_uncharge_start(void);
+void mem_cgroup_uncharge_end(void);
 
-/* For coalescing uncharge for reducing memcg' overhead*/
-extern void mem_cgroup_uncharge_start(void);
-extern void mem_cgroup_uncharge_end(void);
+void mem_cgroup_migrate(struct page *oldpage, struct page *newpage,
+			bool lrucare);
 
-extern void mem_cgroup_uncharge_page(struct page *page);
-extern void mem_cgroup_uncharge_cache_page(struct page *page);
+struct lruvec *mem_cgroup_zone_lruvec(struct zone *, struct mem_cgroup *);
+struct lruvec *mem_cgroup_page_lruvec(struct page *, struct zone *);
 
 bool __mem_cgroup_same_or_subtree(const struct mem_cgroup *root_memcg,
 				  struct mem_cgroup *memcg);
@@ -96,12 +98,6 @@ bool mm_match_cgroup(const struct mm_struct *mm, const struct mem_cgroup *memcg)
 
 extern struct cgroup_subsys_state *mem_cgroup_css(struct mem_cgroup *memcg);
 
-extern void
-mem_cgroup_prepare_migration(struct page *page, struct page *newpage,
-			     struct mem_cgroup **memcgp);
-extern void mem_cgroup_end_migration(struct mem_cgroup *memcg,
-	struct page *oldpage, struct page *newpage, bool migration_ok);
-
 struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *,
 				   struct mem_cgroup *,
 				   struct mem_cgroup_reclaim_cookie *);
@@ -116,8 +112,6 @@ unsigned long mem_cgroup_get_lru_size(struct lruvec *lruvec, enum lru_list);
 void mem_cgroup_update_lru_size(struct lruvec *, enum lru_list, int);
 extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
 					struct task_struct *p);
-extern void mem_cgroup_replace_page_cache(struct page *oldpage,
-					struct page *newpage);
 
 static inline void mem_cgroup_oom_enable(void)
 {
@@ -235,19 +229,21 @@ static inline void mem_cgroup_cancel_charge(struct page *page,
 {
 }
 
-static inline void mem_cgroup_uncharge_start(void)
+static inline void mem_cgroup_uncharge(struct page *page)
 {
 }
 
-static inline void mem_cgroup_uncharge_end(void)
+static inline void mem_cgroup_uncharge_start(void)
 {
 }
 
-static inline void mem_cgroup_uncharge_page(struct page *page)
+static inline void mem_cgroup_uncharge_end(void)
 {
 }
 
-static inline void mem_cgroup_uncharge_cache_page(struct page *page)
+static inline void mem_cgroup_migrate(struct page *oldpage,
+				      struct page *newpage,
+				      bool lrucare)
 {
 }
 
@@ -286,17 +282,6 @@ static inline struct cgroup_subsys_state
 	return NULL;
 }
 
-static inline void
-mem_cgroup_prepare_migration(struct page *page, struct page *newpage,
-			     struct mem_cgroup **memcgp)
-{
-}
-
-static inline void mem_cgroup_end_migration(struct mem_cgroup *memcg,
-		struct page *oldpage, struct page *newpage, bool migration_ok)
-{
-}
-
 static inline struct mem_cgroup *
 mem_cgroup_iter(struct mem_cgroup *root,
 		struct mem_cgroup *prev,
@@ -392,10 +377,6 @@ static inline
 void mem_cgroup_count_vm_event(struct mm_struct *mm, enum vm_event_item idx)
 {
 }
-static inline void mem_cgroup_replace_page_cache(struct page *oldpage,
-				struct page *newpage)
-{
-}
 #endif /* CONFIG_MEMCG */
 
 #if !defined(CONFIG_MEMCG) || !defined(CONFIG_DEBUG_VM)
diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
index 777a524716db..97b5c39a31c8 100644
--- a/include/linux/page_cgroup.h
+++ b/include/linux/page_cgroup.h
@@ -3,9 +3,9 @@
 
 enum {
 	/* flags for mem_cgroup */
-	PCG_LOCK,  /* Lock for pc->mem_cgroup and following bits. */
-	PCG_USED, /* this object is in use. */
-	PCG_MIGRATION, /* under page migration */
+	PCG_USED,	/* This page is charged to a memcg */
+	PCG_MEM,	/* This page holds a memory charge */
+	PCG_MEMSW,	/* This page holds a memory+swap charge */
 	__NR_PCG_FLAGS,
 };
 
@@ -44,42 +44,9 @@ static inline void __init page_cgroup_init(void)
 struct page_cgroup *lookup_page_cgroup(struct page *page);
 struct page *lookup_cgroup_page(struct page_cgroup *pc);
 
-#define TESTPCGFLAG(uname, lname)			\
-static inline int PageCgroup##uname(struct page_cgroup *pc)	\
-	{ return test_bit(PCG_##lname, &pc->flags); }
-
-#define SETPCGFLAG(uname, lname)			\
-static inline void SetPageCgroup##uname(struct page_cgroup *pc)\
-	{ set_bit(PCG_##lname, &pc->flags);  }
-
-#define CLEARPCGFLAG(uname, lname)			\
-static inline void ClearPageCgroup##uname(struct page_cgroup *pc)	\
-	{ clear_bit(PCG_##lname, &pc->flags);  }
-
-#define TESTCLEARPCGFLAG(uname, lname)			\
-static inline int TestClearPageCgroup##uname(struct page_cgroup *pc)	\
-	{ return test_and_clear_bit(PCG_##lname, &pc->flags);  }
-
-TESTPCGFLAG(Used, USED)
-CLEARPCGFLAG(Used, USED)
-SETPCGFLAG(Used, USED)
-
-SETPCGFLAG(Migration, MIGRATION)
-CLEARPCGFLAG(Migration, MIGRATION)
-TESTPCGFLAG(Migration, MIGRATION)
-
-static inline void lock_page_cgroup(struct page_cgroup *pc)
-{
-	/*
-	 * Don't take this lock in IRQ context.
-	 * This lock is for pc->mem_cgroup, USED, MIGRATION
-	 */
-	bit_spin_lock(PCG_LOCK, &pc->flags);
-}
-
-static inline void unlock_page_cgroup(struct page_cgroup *pc)
+static inline int PageCgroupUsed(struct page_cgroup *pc)
 {
-	bit_spin_unlock(PCG_LOCK, &pc->flags);
+	return test_bit(PCG_USED, &pc->flags);
 }
 
 #else /* CONFIG_MEMCG */
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 403a8530ee62..05d2b1cd4f59 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -400,9 +400,13 @@ static inline int mem_cgroup_swappiness(struct mem_cgroup *mem)
 }
 #endif
 #ifdef CONFIG_MEMCG_SWAP
-extern void mem_cgroup_uncharge_swap(swp_entry_t ent);
+extern void mem_cgroup_swapout(struct page *page, swp_entry_t entry);
+extern void mem_cgroup_uncharge_swap(swp_entry_t entry);
 #else
-static inline void mem_cgroup_uncharge_swap(swp_entry_t ent)
+static inline void mem_cgroup_swapout(struct page *page, swp_entry_t entry)
+{
+}
+static inline void mem_cgroup_uncharge_swap(swp_entry_t entry)
 {
 }
 #endif
@@ -462,7 +466,7 @@ extern void swap_shmem_alloc(swp_entry_t);
 extern int swap_duplicate(swp_entry_t);
 extern int swapcache_prepare(swp_entry_t);
 extern void swap_free(swp_entry_t);
-extern void swapcache_free(swp_entry_t, struct page *page);
+extern void swapcache_free(swp_entry_t);
 extern int free_swap_and_cache(swp_entry_t);
 extern int swap_type_of(dev_t, sector_t, struct block_device **);
 extern unsigned int count_swap_pages(int, int);
@@ -526,7 +530,7 @@ static inline void swap_free(swp_entry_t swp)
 {
 }
 
-static inline void swapcache_free(swp_entry_t swp, struct page *page)
+static inline void swapcache_free(swp_entry_t swp)
 {
 }
 
diff --git a/mm/filemap.c b/mm/filemap.c
index 346c2e178193..337fb5e5360c 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -233,7 +233,6 @@ void delete_from_page_cache(struct page *page)
 	spin_lock_irq(&mapping->tree_lock);
 	__delete_from_page_cache(page, NULL);
 	spin_unlock_irq(&mapping->tree_lock);
-	mem_cgroup_uncharge_cache_page(page);
 
 	if (freepage)
 		freepage(page);
@@ -499,8 +498,7 @@ int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask)
 		if (PageSwapBacked(new))
 			__inc_zone_page_state(new, NR_SHMEM);
 		spin_unlock_irq(&mapping->tree_lock);
-		/* mem_cgroup codes must not be called under tree_lock */
-		mem_cgroup_replace_page_cache(old, new);
+		mem_cgroup_migrate(old, new, true);
 		radix_tree_preload_end();
 		if (freepage)
 			freepage(old);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 6f48e292ffe7..0add8b7b3a6c 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -919,13 +919,13 @@ static unsigned long mem_cgroup_read_events(struct mem_cgroup *memcg,
 
 static void mem_cgroup_charge_statistics(struct mem_cgroup *memcg,
 					 struct page *page,
-					 bool anon, int nr_pages)
+					 int nr_pages)
 {
 	/*
 	 * Here, RSS means 'mapped anon' and anon's SwapCache. Shmem/tmpfs is
 	 * counted as CACHE even if it's on ANON LRU.
 	 */
-	if (anon)
+	if (PageAnon(page))
 		__this_cpu_add(memcg->stat->count[MEM_CGROUP_STAT_RSS],
 				nr_pages);
 	else
@@ -1358,20 +1358,6 @@ out:
 	return lruvec;
 }
 
-/*
- * Following LRU functions are allowed to be used without PCG_LOCK.
- * Operations are called by routine of global LRU independently from memcg.
- * What we have to take care of here is validness of pc->mem_cgroup.
- *
- * Changes to pc->mem_cgroup happens when
- * 1. charge
- * 2. moving account
- * In typical case, "charge" is done before add-to-lru. Exception is SwapCache.
- * It is added to LRU before charge.
- * If PCG_USED bit is not set, page_cgroup is not added to this private LRU.
- * When moving account, the page is not on LRU. It's isolated.
- */
-
 /**
  * mem_cgroup_page_lruvec - return lruvec for adding an lru page
  * @page: the page
@@ -2285,22 +2271,14 @@ cleanup:
  *
  * Notes: Race condition
  *
- * We usually use page_cgroup_lock() for accessing page_cgroup member but
- * it tends to be costly. But considering some conditions, we doesn't need
- * to do so _always_.
- *
- * Considering "charge", lock_page_cgroup() is not required because all
- * file-stat operations happen after a page is attached to radix-tree. There
- * are no race with "charge".
+ * Charging occurs during page instantiation, while the page is
+ * unmapped and locked in page migration, or while the page table is
+ * locked in THP migration.  No race is possible.
  *
- * Considering "uncharge", we know that memcg doesn't clear pc->mem_cgroup
- * at "uncharge" intentionally. So, we always see valid pc->mem_cgroup even
- * if there are race with "uncharge". Statistics itself is properly handled
- * by flags.
+ * Uncharge happens to pages with zero references, no race possible.
  *
- * Considering "move", this is an only case we see a race. To make the race
- * small, we check mm->moving_account and detect there are possibility of race
- * If there is, we take a lock.
+ * Charge moving between groups is protected by checking mm->moving
+ * account and taking the move_lock in the slowpath.
  */
 
 void __mem_cgroup_begin_update_page_stat(struct page *page,
@@ -2603,34 +2581,6 @@ static struct mem_cgroup *mem_cgroup_lookup(unsigned short id)
 	return mem_cgroup_from_id(id);
 }
 
-struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
-{
-	struct mem_cgroup *memcg = NULL;
-	struct page_cgroup *pc;
-	unsigned short id;
-	swp_entry_t ent;
-
-	VM_BUG_ON_PAGE(!PageLocked(page), page);
-
-	pc = lookup_page_cgroup(page);
-	lock_page_cgroup(pc);
-	if (PageCgroupUsed(pc)) {
-		memcg = pc->mem_cgroup;
-		if (memcg && !css_tryget(&memcg->css))
-			memcg = NULL;
-	} else if (PageSwapCache(page)) {
-		ent.val = page_private(page);
-		id = lookup_swap_cgroup_id(ent);
-		rcu_read_lock();
-		memcg = mem_cgroup_lookup(id);
-		if (memcg && !css_tryget(&memcg->css))
-			memcg = NULL;
-		rcu_read_unlock();
-	}
-	unlock_page_cgroup(pc);
-	return memcg;
-}
-
 static DEFINE_MUTEX(set_limit_mutex);
 
 #ifdef CONFIG_MEMCG_KMEM
@@ -3352,7 +3302,6 @@ static inline void mem_cgroup_destroy_all_caches(struct mem_cgroup *memcg)
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 
-#define PCGF_NOCOPY_AT_SPLIT (1 << PCG_LOCK | 1 << PCG_MIGRATION)
 /*
  * Because tail pages are not marked as "used", set it. We're under
  * zone->lru_lock, 'splitting on pmd' and compound_lock.
@@ -3373,7 +3322,7 @@ void mem_cgroup_split_huge_fixup(struct page *head)
 	for (i = 1; i < HPAGE_PMD_NR; i++) {
 		pc = head_pc + i;
 		pc->mem_cgroup = memcg;
-		pc->flags = head_pc->flags & ~PCGF_NOCOPY_AT_SPLIT;
+		pc->flags = head_pc->flags;
 	}
 	__this_cpu_sub(memcg->stat->count[MEM_CGROUP_STAT_RSS_HUGE],
 		       HPAGE_PMD_NR);
@@ -3403,7 +3352,6 @@ static int mem_cgroup_move_account(struct page *page,
 {
 	unsigned long flags;
 	int ret;
-	bool anon = PageAnon(page);
 
 	VM_BUG_ON(from == to);
 	VM_BUG_ON_PAGE(PageLRU(page), page);
@@ -3417,15 +3365,13 @@ static int mem_cgroup_move_account(struct page *page,
 	if (nr_pages > 1 && !PageTransHuge(page))
 		goto out;
 
-	lock_page_cgroup(pc);
-
 	ret = -EINVAL;
 	if (!PageCgroupUsed(pc) || pc->mem_cgroup != from)
-		goto unlock;
+		goto out;
 
 	move_lock_mem_cgroup(from, &flags);
 
-	if (!anon && page_mapped(page)) {
+	if (!PageAnon(page) && page_mapped(page)) {
 		__this_cpu_sub(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED],
 			       nr_pages);
 		__this_cpu_add(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED],
@@ -3439,15 +3385,13 @@ static int mem_cgroup_move_account(struct page *page,
 			       nr_pages);
 	}
 
-	mem_cgroup_charge_statistics(from, page, anon, -nr_pages);
+	mem_cgroup_charge_statistics(from, page, -nr_pages);
 
 	/* caller should have done css_get */
 	pc->mem_cgroup = to;
-	mem_cgroup_charge_statistics(to, page, anon, nr_pages);
+	mem_cgroup_charge_statistics(to, page, nr_pages);
 	move_unlock_mem_cgroup(from, &flags);
 	ret = 0;
-unlock:
-	unlock_page_cgroup(pc);
 	/*
 	 * check events
 	 */
@@ -3523,193 +3467,6 @@ out:
 	return ret;
 }
 
-static void mem_cgroup_do_uncharge(struct mem_cgroup *memcg,
-				   unsigned int nr_pages,
-				   const enum charge_type ctype)
-{
-	struct memcg_batch_info *batch = NULL;
-	bool uncharge_memsw = true;
-
-	/* If swapout, usage of swap doesn't decrease */
-	if (!do_swap_account || ctype == MEM_CGROUP_CHARGE_TYPE_SWAPOUT)
-		uncharge_memsw = false;
-
-	batch = &current->memcg_batch;
-	/*
-	 * In usual, we do css_get() when we remember memcg pointer.
-	 * But in this case, we keep res->usage until end of a series of
-	 * uncharges. Then, it's ok to ignore memcg's refcnt.
-	 */
-	if (!batch->memcg)
-		batch->memcg = memcg;
-	/*
-	 * do_batch > 0 when unmapping pages or inode invalidate/truncate.
-	 * In those cases, all pages freed continuously can be expected to be in
-	 * the same cgroup and we have chance to coalesce uncharges.
-	 * But we do uncharge one by one if this is killed by OOM(TIF_MEMDIE)
-	 * because we want to do uncharge as soon as possible.
-	 */
-
-	if (!batch->do_batch || test_thread_flag(TIF_MEMDIE))
-		goto direct_uncharge;
-
-	if (nr_pages > 1)
-		goto direct_uncharge;
-
-	/*
-	 * In typical case, batch->memcg == mem. This means we can
-	 * merge a series of uncharges to an uncharge of res_counter.
-	 * If not, we uncharge res_counter ony by one.
-	 */
-	if (batch->memcg != memcg)
-		goto direct_uncharge;
-	/* remember freed charge and uncharge it later */
-	batch->nr_pages++;
-	if (uncharge_memsw)
-		batch->memsw_nr_pages++;
-	return;
-direct_uncharge:
-	res_counter_uncharge(&memcg->res, nr_pages * PAGE_SIZE);
-	if (uncharge_memsw)
-		res_counter_uncharge(&memcg->memsw, nr_pages * PAGE_SIZE);
-	if (unlikely(batch->memcg != memcg))
-		memcg_oom_recover(memcg);
-}
-
-/*
- * uncharge if !page_mapped(page)
- */
-static struct mem_cgroup *
-__mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype,
-			     bool end_migration)
-{
-	struct mem_cgroup *memcg = NULL;
-	unsigned int nr_pages = 1;
-	struct page_cgroup *pc;
-	bool anon;
-
-	if (mem_cgroup_disabled())
-		return NULL;
-
-	if (PageTransHuge(page)) {
-		nr_pages <<= compound_order(page);
-		VM_BUG_ON_PAGE(!PageTransHuge(page), page);
-	}
-	/*
-	 * Check if our page_cgroup is valid
-	 */
-	pc = lookup_page_cgroup(page);
-	if (unlikely(!PageCgroupUsed(pc)))
-		return NULL;
-
-	lock_page_cgroup(pc);
-
-	memcg = pc->mem_cgroup;
-
-	if (!PageCgroupUsed(pc))
-		goto unlock_out;
-
-	anon = PageAnon(page);
-
-	switch (ctype) {
-	case MEM_CGROUP_CHARGE_TYPE_ANON:
-		/*
-		 * Generally PageAnon tells if it's the anon statistics to be
-		 * updated; but sometimes e.g. mem_cgroup_uncharge_page() is
-		 * used before page reached the stage of being marked PageAnon.
-		 */
-		anon = true;
-		/* fallthrough */
-	case MEM_CGROUP_CHARGE_TYPE_DROP:
-		/* See mem_cgroup_prepare_migration() */
-		if (page_mapped(page))
-			goto unlock_out;
-		/*
-		 * Pages under migration may not be uncharged.  But
-		 * end_migration() /must/ be the one uncharging the
-		 * unused post-migration page and so it has to call
-		 * here with the migration bit still set.  See the
-		 * res_counter handling below.
-		 */
-		if (!end_migration && PageCgroupMigration(pc))
-			goto unlock_out;
-		break;
-	case MEM_CGROUP_CHARGE_TYPE_SWAPOUT:
-		if (!PageAnon(page)) {	/* Shared memory */
-			if (page->mapping && !page_is_file_cache(page))
-				goto unlock_out;
-		} else if (page_mapped(page)) /* Anon */
-				goto unlock_out;
-		break;
-	default:
-		break;
-	}
-
-	mem_cgroup_charge_statistics(memcg, page, anon, -nr_pages);
-
-	ClearPageCgroupUsed(pc);
-	/*
-	 * pc->mem_cgroup is not cleared here. It will be accessed when it's
-	 * freed from LRU. This is safe because uncharged page is expected not
-	 * to be reused (freed soon). Exception is SwapCache, it's handled by
-	 * special functions.
-	 */
-
-	unlock_page_cgroup(pc);
-	/*
-	 * even after unlock, we have memcg->res.usage here and this memcg
-	 * will never be freed, so it's safe to call css_get().
-	 */
-	memcg_check_events(memcg, page);
-	if (do_swap_account && ctype == MEM_CGROUP_CHARGE_TYPE_SWAPOUT) {
-		mem_cgroup_swap_statistics(memcg, true);
-		css_get(&memcg->css);
-	}
-	/*
-	 * Migration does not charge the res_counter for the
-	 * replacement page, so leave it alone when phasing out the
-	 * page that is unused after the migration.
-	 */
-	if (!end_migration)
-		mem_cgroup_do_uncharge(memcg, nr_pages, ctype);
-
-	return memcg;
-
-unlock_out:
-	unlock_page_cgroup(pc);
-	return NULL;
-}
-
-void mem_cgroup_uncharge_page(struct page *page)
-{
-	/* early check. */
-	if (page_mapped(page))
-		return;
-	VM_BUG_ON_PAGE(page->mapping && !PageAnon(page), page);
-	/*
-	 * If the page is in swap cache, uncharge should be deferred
-	 * to the swap path, which also properly accounts swap usage
-	 * and handles memcg lifetime.
-	 *
-	 * Note that this check is not stable and reclaim may add the
-	 * page to swap cache at any time after this.  However, if the
-	 * page is not in swap cache by the time page->mapcount hits
-	 * 0, there won't be any page table references to the swap
-	 * slot, and reclaim will free it and not actually write the
-	 * page to disk.
-	 */
-	if (PageSwapCache(page))
-		return;
-	__mem_cgroup_uncharge_common(page, MEM_CGROUP_CHARGE_TYPE_ANON, false);
-}
-
-void mem_cgroup_uncharge_cache_page(struct page *page)
-{
-	VM_BUG_ON_PAGE(page_mapped(page), page);
-	VM_BUG_ON_PAGE(page->mapping, page);
-	__mem_cgroup_uncharge_common(page, MEM_CGROUP_CHARGE_TYPE_CACHE, false);
-}
-
 /*
  * Batch_start/batch_end is called in unmap_page_range/invlidate/trucate.
  * In that cases, pages are freed continuously and we can expect pages
@@ -3757,59 +3514,7 @@ void mem_cgroup_uncharge_end(void)
 	batch->memcg = NULL;
 }
 
-#ifdef CONFIG_SWAP
-/*
- * called after __delete_from_swap_cache() and drop "page" account.
- * memcg information is recorded to swap_cgroup of "ent"
- */
-void
-mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent, bool swapout)
-{
-	struct mem_cgroup *memcg;
-	int ctype = MEM_CGROUP_CHARGE_TYPE_SWAPOUT;
-
-	if (!swapout) /* this was a swap cache but the swap is unused ! */
-		ctype = MEM_CGROUP_CHARGE_TYPE_DROP;
-
-	memcg = __mem_cgroup_uncharge_common(page, ctype, false);
-
-	/*
-	 * record memcg information,  if swapout && memcg != NULL,
-	 * css_get() was called in uncharge().
-	 */
-	if (do_swap_account && swapout && memcg)
-		swap_cgroup_record(ent, mem_cgroup_id(memcg));
-}
-#endif
-
 #ifdef CONFIG_MEMCG_SWAP
-/*
- * called from swap_entry_free(). remove record in swap_cgroup and
- * uncharge "memsw" account.
- */
-void mem_cgroup_uncharge_swap(swp_entry_t ent)
-{
-	struct mem_cgroup *memcg;
-	unsigned short id;
-
-	if (!do_swap_account)
-		return;
-
-	id = swap_cgroup_record(ent, 0);
-	rcu_read_lock();
-	memcg = mem_cgroup_lookup(id);
-	if (memcg) {
-		/*
-		 * We uncharge this because swap is freed.
-		 * This memcg can be obsolete one. We avoid calling css_tryget
-		 */
-		res_counter_uncharge(&memcg->memsw, PAGE_SIZE);
-		mem_cgroup_swap_statistics(memcg, false);
-		css_put(&memcg->css);
-	}
-	rcu_read_unlock();
-}
-
 /**
  * mem_cgroup_move_swap_account - move swap charge and swap_cgroup's record.
  * @entry: swap entry to be moved
@@ -3859,172 +3564,6 @@ static inline int mem_cgroup_move_swap_account(swp_entry_t entry,
 }
 #endif
 
-static void commit_charge(struct page *page, struct mem_cgroup *memcg,
-			  unsigned int nr_pages, bool anon, bool lrucare);
-
-/*
- * Before starting migration, account PAGE_SIZE to mem_cgroup that the old
- * page belongs to.
- */
-void mem_cgroup_prepare_migration(struct page *page, struct page *newpage,
-				  struct mem_cgroup **memcgp)
-{
-	struct mem_cgroup *memcg = NULL;
-	unsigned int nr_pages = 1;
-	struct page_cgroup *pc;
-
-	*memcgp = NULL;
-
-	if (mem_cgroup_disabled())
-		return;
-
-	if (PageTransHuge(page))
-		nr_pages <<= compound_order(page);
-
-	pc = lookup_page_cgroup(page);
-	lock_page_cgroup(pc);
-	if (PageCgroupUsed(pc)) {
-		memcg = pc->mem_cgroup;
-		css_get(&memcg->css);
-		/*
-		 * At migrating an anonymous page, its mapcount goes down
-		 * to 0 and uncharge() will be called. But, even if it's fully
-		 * unmapped, migration may fail and this page has to be
-		 * charged again. We set MIGRATION flag here and delay uncharge
-		 * until end_migration() is called
-		 *
-		 * Corner Case Thinking
-		 * A)
-		 * When the old page was mapped as Anon and it's unmap-and-freed
-		 * while migration was ongoing.
-		 * If unmap finds the old page, uncharge() of it will be delayed
-		 * until end_migration(). If unmap finds a new page, it's
-		 * uncharged when it make mapcount to be 1->0. If unmap code
-		 * finds swap_migration_entry, the new page will not be mapped
-		 * and end_migration() will find it(mapcount==0).
-		 *
-		 * B)
-		 * When the old page was mapped but migraion fails, the kernel
-		 * remaps it. A charge for it is kept by MIGRATION flag even
-		 * if mapcount goes down to 0. We can do remap successfully
-		 * without charging it again.
-		 *
-		 * C)
-		 * The "old" page is under lock_page() until the end of
-		 * migration, so, the old page itself will not be swapped-out.
-		 * If the new page is swapped out before end_migraton, our
-		 * hook to usual swap-out path will catch the event.
-		 */
-		if (PageAnon(page))
-			SetPageCgroupMigration(pc);
-	}
-	unlock_page_cgroup(pc);
-	/*
-	 * If the page is not charged at this point,
-	 * we return here.
-	 */
-	if (!memcg)
-		return;
-
-	*memcgp = memcg;
-	/*
-	 * We charge new page before it's used/mapped. So, even if unlock_page()
-	 * is called before end_migration, we can catch all events on this new
-	 * page. In the case new page is migrated but not remapped, new page's
-	 * mapcount will be finally 0 and we call uncharge in end_migration().
-	 */
-	/*
-	 * The page is committed to the memcg, but it's not actually
-	 * charged to the res_counter since we plan on replacing the
-	 * old one and only one page is going to be left afterwards.
-	 */
-	commit_charge(newpage, memcg, nr_pages, PageAnon(page), false);
-}
-
-/* remove redundant charge if migration failed*/
-void mem_cgroup_end_migration(struct mem_cgroup *memcg,
-	struct page *oldpage, struct page *newpage, bool migration_ok)
-{
-	struct page *used, *unused;
-	struct page_cgroup *pc;
-	bool anon;
-
-	if (!memcg)
-		return;
-
-	if (!migration_ok) {
-		used = oldpage;
-		unused = newpage;
-	} else {
-		used = newpage;
-		unused = oldpage;
-	}
-	anon = PageAnon(used);
-	__mem_cgroup_uncharge_common(unused,
-				     anon ? MEM_CGROUP_CHARGE_TYPE_ANON
-				     : MEM_CGROUP_CHARGE_TYPE_CACHE,
-				     true);
-	css_put(&memcg->css);
-	/*
-	 * We disallowed uncharge of pages under migration because mapcount
-	 * of the page goes down to zero, temporarly.
-	 * Clear the flag and check the page should be charged.
-	 */
-	pc = lookup_page_cgroup(oldpage);
-	lock_page_cgroup(pc);
-	ClearPageCgroupMigration(pc);
-	unlock_page_cgroup(pc);
-
-	/*
-	 * If a page is a file cache, radix-tree replacement is very atomic
-	 * and we can skip this check. When it was an Anon page, its mapcount
-	 * goes down to 0. But because we added MIGRATION flage, it's not
-	 * uncharged yet. There are several case but page->mapcount check
-	 * and USED bit check in mem_cgroup_uncharge_page() will do enough
-	 * check. (see prepare_charge() also)
-	 */
-	if (anon)
-		mem_cgroup_uncharge_page(used);
-}
-
-/*
- * At replace page cache, newpage is not under any memcg but it's on
- * LRU. So, this function doesn't touch res_counter but handles LRU
- * in correct way. Both pages are locked so we cannot race with uncharge.
- */
-void mem_cgroup_replace_page_cache(struct page *oldpage,
-				  struct page *newpage)
-{
-	struct mem_cgroup *memcg = NULL;
-	struct page_cgroup *pc;
-
-	if (mem_cgroup_disabled())
-		return;
-
-	pc = lookup_page_cgroup(oldpage);
-	/* fix accounting on old pages */
-	lock_page_cgroup(pc);
-	if (PageCgroupUsed(pc)) {
-		memcg = pc->mem_cgroup;
-		mem_cgroup_charge_statistics(memcg, oldpage, false, -1);
-		ClearPageCgroupUsed(pc);
-	}
-	unlock_page_cgroup(pc);
-
-	/*
-	 * When called from shmem_replace_page(), in some cases the
-	 * oldpage has already been charged, and in some cases not.
-	 */
-	if (!memcg)
-		return;
-	/*
-	 * Even if newpage->mapping was NULL before starting replacement,
-	 * the newpage may be on LRU(or pagevec for LRU) already. We lock
-	 * LRU while we overwrite pc->mem_cgroup.
-	 */
-	commit_charge(newpage, memcg, 1, false, true);
-}
-
 #ifdef CONFIG_DEBUG_VM
 static struct page_cgroup *lookup_page_cgroup_used(struct page *page)
 {
@@ -6213,9 +5752,9 @@ static enum mc_target_type get_mctgt_type(struct vm_area_struct *vma,
 	if (page) {
 		pc = lookup_page_cgroup(page);
 		/*
-		 * Do only loose check w/o page_cgroup lock.
-		 * mem_cgroup_move_account() checks the pc is valid or not under
-		 * the lock.
+		 * Do only loose check w/o serialization.
+		 * mem_cgroup_move_account() checks the pc is valid or
+		 * not under LRU exclusion.
 		 */
 		if (PageCgroupUsed(pc) && pc->mem_cgroup == mc.from) {
 			ret = MC_TARGET_PAGE;
@@ -6674,6 +6213,97 @@ static void __init enable_swap_cgroup(void)
 }
 #endif
 
+#ifdef CONFIG_MEMCG_SWAP
+/**
+ * mem_cgroup_swapout - transfer a memsw charge to swap
+ * @page: page whose memsw charge to transfer
+ * @entry: swap entry to move the charge to
+ *
+ * Transfer the memsw charge of @page to @entry.
+ */
+void mem_cgroup_swapout(struct page *page, swp_entry_t entry)
+{
+	struct page_cgroup *pc;
+	unsigned short oldid;
+
+	VM_BUG_ON_PAGE(PageLRU(page), page);
+	VM_BUG_ON_PAGE(page_count(page), page);
+
+	if (!do_swap_account)
+		return;
+
+	pc = lookup_page_cgroup(page);
+
+	oldid = swap_cgroup_record(entry, mem_cgroup_id(pc->mem_cgroup));
+	VM_BUG_ON_PAGE(oldid, page);
+
+	pc->flags &= ~PCG_MEMSW;
+	css_get(&pc->mem_cgroup->css);
+	mem_cgroup_swap_statistics(pc->mem_cgroup, true);
+}
+
+/**
+ * mem_cgroup_uncharge_swap - uncharge a swap entry
+ * @entry: swap entry to uncharge
+ *
+ * Drop the memsw charge associated with @entry.
+ */
+void mem_cgroup_uncharge_swap(swp_entry_t entry)
+{
+	struct mem_cgroup *memcg;
+	unsigned short id;
+
+	if (!do_swap_account)
+		return;
+
+	id = swap_cgroup_record(entry, 0);
+	rcu_read_lock();
+	memcg = mem_cgroup_lookup(id);
+	if (memcg) {
+		res_counter_uncharge(&memcg->memsw, PAGE_SIZE);
+		mem_cgroup_swap_statistics(memcg, false);
+		css_put(&memcg->css);
+	}
+	rcu_read_unlock();
+}
+#endif
+
+/**
+ * try_get_mem_cgroup_from_page - look up page's memcg association
+ * @page: the page
+ *
+ * Look up, get a css reference, and return the memcg that owns @page.
+ *
+ * The page must be locked to prevent racing with swap-in and page
+ * cache charges.  If coming from an unlocked page table, the caller
+ * must ensure the page is on the LRU or this can race with charging.
+ */
+struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
+{
+	struct mem_cgroup *memcg = NULL;
+	struct page_cgroup *pc;
+	unsigned short id;
+	swp_entry_t ent;
+
+	VM_BUG_ON_PAGE(!PageLocked(page), page);
+
+	pc = lookup_page_cgroup(page);
+	if (PageCgroupUsed(pc)) {
+		memcg = pc->mem_cgroup;
+		if (memcg && !css_tryget(&memcg->css))
+			memcg = NULL;
+	} else if (PageSwapCache(page)) {
+		ent.val = page_private(page);
+		id = lookup_swap_cgroup_id(ent);
+		rcu_read_lock();
+		memcg = mem_cgroup_lookup(id);
+		if (memcg && !css_tryget(&memcg->css))
+			memcg = NULL;
+		rcu_read_unlock();
+	}
+	return memcg;
+}
+
 static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
 		      unsigned int nr_pages, bool oom)
 {
@@ -6855,15 +6485,13 @@ out:
 }
 
 static void commit_charge(struct page *page, struct mem_cgroup *memcg,
-			  unsigned int nr_pages, bool anon, bool lrucare)
+			  unsigned int nr_pages, bool lrucare)
 {
 	struct page_cgroup *pc = lookup_page_cgroup(page);
 	struct zone *uninitialized_var(zone);
 	bool was_on_lru = false;
 	struct lruvec *lruvec;
 
-	lock_page_cgroup(pc);
-
 	VM_BUG_ON_PAGE(PageCgroupUsed(pc), page);
 	VM_BUG_ON_PAGE(PageLRU(page) && !lrucare, page);
 
@@ -6877,9 +6505,22 @@ static void commit_charge(struct page *page, struct mem_cgroup *memcg,
 			was_on_lru = true;
 		}
 	}
-
+	/*
+	 * Nobody should be changing or seriously looking at
+	 * pc->mem_cgroup and pc->flags at this point:
+	 *
+	 * - the page is uncharged
+	 *
+	 * - the page is off-LRU
+	 *
+	 * - an anonymous fault has exclusive page access, except for
+	 *   a locked page table
+	 *
+	 * - the page is locked for page cache insertions, swapin
+	 *   faults, and migration.
+	 */
 	pc->mem_cgroup = memcg;
-	SetPageCgroupUsed(pc);
+	pc->flags = PCG_USED | PCG_MEM | PCG_MEMSW;
 
 	if (lrucare) {
 		if (was_on_lru) {
@@ -6891,9 +6532,7 @@ static void commit_charge(struct page *page, struct mem_cgroup *memcg,
 		spin_unlock_irq(&zone->lru_lock);
 	}
 
-	mem_cgroup_charge_statistics(memcg, page, anon, nr_pages);
-	unlock_page_cgroup(pc);
-
+	mem_cgroup_charge_statistics(memcg, page, nr_pages);
 	memcg_check_events(memcg, page);
 }
 
@@ -6936,7 +6575,7 @@ void mem_cgroup_commit_charge(struct page *page, struct mem_cgroup *memcg,
 		VM_BUG_ON_PAGE(!PageTransHuge(page), page);
 	}
 
-	commit_charge(page, memcg, nr_pages, PageAnon(page), lrucare);
+	commit_charge(page, memcg, nr_pages, lrucare);
 
 	if (do_swap_account && PageSwapCache(page)) {
 		swp_entry_t entry = { .val = page_private(page) };
@@ -6987,6 +6626,116 @@ void mem_cgroup_cancel_charge(struct page *page, struct mem_cgroup *memcg)
 	cancel_charge(memcg, nr_pages);
 }
 
+/**
+ * mem_cgroup_uncharge - uncharge a page
+ * @page: page to uncharge
+ *
+ * Uncharge a page previously charged with mem_cgroup_try_charge() and
+ * mem_cgroup_commit_charge().
+ */
+void mem_cgroup_uncharge(struct page *page)
+{
+	struct memcg_batch_info *batch;
+	unsigned int nr_pages = 1;
+	struct mem_cgroup *memcg;
+	struct page_cgroup *pc;
+	unsigned long flags;
+
+	VM_BUG_ON_PAGE(PageLRU(page), page);
+	VM_BUG_ON_PAGE(page_count(page), page);
+
+	if (mem_cgroup_disabled())
+		return;
+
+	pc = lookup_page_cgroup(page);
+
+	/* Every final put_page() ends up here */
+	if (!PageCgroupUsed(pc))
+		return;
+
+	if (PageTransHuge(page)) {
+		nr_pages <<= compound_order(page);
+		VM_BUG_ON_PAGE(!PageTransHuge(page), page);
+	}
+	/*
+	 * Nobody should be changing or seriously looking at
+	 * pc->mem_cgroup and pc->flags at this point, we have fully
+	 * exclusive access to the page.
+	 */
+	memcg = pc->mem_cgroup;
+	flags = pc->flags;
+	pc->flags = 0;
+
+	mem_cgroup_charge_statistics(memcg, page, -nr_pages);
+	memcg_check_events(memcg, page);
+
+	batch = &current->memcg_batch;
+	if (!batch->memcg)
+		batch->memcg = memcg;
+	else if (batch->memcg != memcg)
+		goto uncharge;
+	if (nr_pages > 1)
+		goto uncharge;
+	if (!batch->do_batch)
+		goto uncharge;
+	if (test_thread_flag(TIF_MEMDIE))
+		goto uncharge;
+	if (flags & PCG_MEM)
+		batch->nr_pages++;
+	if (flags & PCG_MEMSW)
+		batch->memsw_nr_pages++;
+	return;
+uncharge:
+	if (flags & PCG_MEM)
+		res_counter_uncharge(&memcg->res, nr_pages * PAGE_SIZE);
+	if (flags & PCG_MEMSW)
+		res_counter_uncharge(&memcg->memsw, nr_pages * PAGE_SIZE);
+	if (batch->memcg != memcg)
+		memcg_oom_recover(memcg);
+}
+
+/**
+ * mem_cgroup_migrate - migrate a charge to another page
+ * @oldpage: currently charged page
+ * @newpage: page to transfer the charge to
+ * @lrucare: page might be on LRU already
+ *
+ * Migrate the charge from @oldpage to @newpage.
+ *
+ * Both pages must be locked, @newpage->mapping must be set up.
+ */
+void mem_cgroup_migrate(struct page *oldpage, struct page *newpage,
+			bool lrucare)
+{
+	unsigned int nr_pages = 1;
+	struct page_cgroup *pc;
+
+	VM_BUG_ON_PAGE(!PageLocked(oldpage), oldpage);
+	VM_BUG_ON_PAGE(!PageLocked(newpage), newpage);
+	VM_BUG_ON_PAGE(PageLRU(oldpage), oldpage);
+	VM_BUG_ON_PAGE(PageLRU(newpage), newpage);
+	VM_BUG_ON_PAGE(PageAnon(oldpage) != PageAnon(newpage), newpage);
+
+	if (mem_cgroup_disabled())
+		return;
+
+	pc = lookup_page_cgroup(oldpage);
+	if (!PageCgroupUsed(pc))
+		return;
+
+	VM_BUG_ON_PAGE(!(pc->flags & PCG_MEM), page);
+	VM_BUG_ON_PAGE(!(pc->flags & PCG_MEMSW), page);
+	pc->flags &= ~(PCG_MEM | PCG_MEMSW);
+
+	if (PageTransHuge(oldpage)) {
+		nr_pages <<= compound_order(oldpage);
+		VM_BUG_ON_PAGE(!PageTransHuge(oldpage), oldpage);
+		VM_BUG_ON_PAGE(!PageTransHuge(newpage), newpage);
+	}
+
+	commit_charge(newpage, pc->mem_cgroup, nr_pages, lrucare);
+}
+
 /*
  * subsys_initcall() for memory controller.
  *
diff --git a/mm/migrate.c b/mm/migrate.c
index a88fabd71f87..80d33e62eb16 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -780,11 +780,14 @@ static int move_to_new_page(struct page *newpage, struct page *page,
 		rc = fallback_migrate_page(mapping, newpage, page, mode);
 
 	if (rc != MIGRATEPAGE_SUCCESS) {
-		newpage->mapping = NULL;
+		if (!PageAnon(newpage))
+			newpage->mapping = NULL;
 	} else {
 		if (remap_swapcache)
 			remove_migration_ptes(page, newpage);
-		page->mapping = NULL;
+		if (!PageAnon(page))
+			page->mapping = NULL;
+		mem_cgroup_migrate(page, newpage, false);
 	}
 
 	unlock_page(newpage);
@@ -797,7 +800,6 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
 {
 	int rc = -EAGAIN;
 	int remap_swapcache = 1;
-	struct mem_cgroup *mem;
 	struct anon_vma *anon_vma = NULL;
 
 	if (!trylock_page(page)) {
@@ -823,9 +825,6 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
 		lock_page(page);
 	}
 
-	/* charge against new page */
-	mem_cgroup_prepare_migration(page, newpage, &mem);
-
 	if (PageWriteback(page)) {
 		/*
 		 * Only in the case of a full synchronous migration is it
@@ -835,10 +834,10 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
 		 */
 		if (mode != MIGRATE_SYNC) {
 			rc = -EBUSY;
-			goto uncharge;
+			goto out_unlock;
 		}
 		if (!force)
-			goto uncharge;
+			goto out_unlock;
 		wait_on_page_writeback(page);
 	}
 	/*
@@ -874,7 +873,7 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
 			 */
 			remap_swapcache = 0;
 		} else {
-			goto uncharge;
+			goto out_unlock;
 		}
 	}
 
@@ -887,7 +886,7 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
 		 * the page migration right away (proteced by page lock).
 		 */
 		rc = balloon_page_migrate(newpage, page, mode);
-		goto uncharge;
+		goto out_unlock;
 	}
 
 	/*
@@ -906,7 +905,7 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
 		VM_BUG_ON_PAGE(PageAnon(page), page);
 		if (page_has_private(page)) {
 			try_to_free_buffers(page);
-			goto uncharge;
+			goto out_unlock;
 		}
 		goto skip_unmap;
 	}
@@ -925,10 +924,7 @@ skip_unmap:
 	if (anon_vma)
 		put_anon_vma(anon_vma);
 
-uncharge:
-	mem_cgroup_end_migration(mem, page, newpage,
-				 (rc == MIGRATEPAGE_SUCCESS ||
-				  rc == MIGRATEPAGE_BALLOON_SUCCESS));
+out_unlock:
 	unlock_page(page);
 out:
 	return rc;
@@ -1764,7 +1760,6 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 	pg_data_t *pgdat = NODE_DATA(node);
 	int isolated = 0;
 	struct page *new_page = NULL;
-	struct mem_cgroup *memcg = NULL;
 	int page_lru = page_is_file_cache(page);
 	unsigned long mmun_start = address & HPAGE_PMD_MASK;
 	unsigned long mmun_end = mmun_start + HPAGE_PMD_SIZE;
@@ -1830,15 +1825,6 @@ fail_putback:
 		goto out_unlock;
 	}
 
-	/*
-	 * Traditional migration needs to prepare the memcg charge
-	 * transaction early to prevent the old page from being
-	 * uncharged when installing migration entries.  Here we can
-	 * save the potential rollback and start the charge transfer
-	 * only when migration is already known to end successfully.
-	 */
-	mem_cgroup_prepare_migration(page, new_page, &memcg);
-
 	orig_entry = *pmd;
 	entry = mk_pmd(new_page, vma->vm_page_prot);
 	entry = pmd_mkhuge(entry);
@@ -1867,14 +1853,11 @@ fail_putback:
 		goto fail_putback;
 	}
 
+	mem_cgroup_migrate(page, new_page, false);
+	lru_cache_add_active_or_unevictable(new_page, vma);
+
 	page_remove_rmap(page);
 
-	/*
-	 * Finish the charge transaction under the page table lock to
-	 * prevent split_huge_page() from dividing up the charge
-	 * before it's fully transferred to the new page.
-	 */
-	mem_cgroup_end_migration(memcg, page, new_page, true);
 	spin_unlock(ptl);
 	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
 
diff --git a/mm/rmap.c b/mm/rmap.c
index 6b6fe5f4ece1..ac55c156ba69 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1076,7 +1076,6 @@ void page_remove_rmap(struct page *page)
 	if (unlikely(PageHuge(page)))
 		goto out;
 	if (anon) {
-		mem_cgroup_uncharge_page(page);
 		if (PageTransHuge(page))
 			__dec_zone_page_state(page,
 					      NR_ANON_TRANSPARENT_HUGEPAGES);
diff --git a/mm/shmem.c b/mm/shmem.c
index f8637acc2dad..d2ed1e6f1eaf 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -809,7 +809,7 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
 	}
 
 	mutex_unlock(&shmem_swaplist_mutex);
-	swapcache_free(swap, NULL);
+	swapcache_free(swap);
 redirty:
 	set_page_dirty(page);
 	if (wbc->for_reclaim)
@@ -982,7 +982,7 @@ static int shmem_replace_page(struct page **pagep, gfp_t gfp,
 		 */
 		oldpage = newpage;
 	} else {
-		mem_cgroup_replace_page_cache(oldpage, newpage);
+		mem_cgroup_migrate(oldpage, newpage, false);
 		lru_cache_add_anon(newpage);
 		*pagep = newpage;
 	}
diff --git a/mm/swap.c b/mm/swap.c
index a5bdff331507..37abd8233613 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -62,6 +62,7 @@ static void __page_cache_release(struct page *page)
 		del_page_from_lru_list(page, lruvec, page_off_lru(page));
 		spin_unlock_irqrestore(&zone->lru_lock, flags);
 	}
+	mem_cgroup_uncharge(page);
 }
 
 static void __put_single_page(struct page *page)
@@ -872,6 +873,7 @@ void release_pages(struct page **pages, int nr, int cold)
 			__ClearPageLRU(page);
 			del_page_from_lru_list(page, lruvec, page_off_lru(page));
 		}
+		mem_cgroup_uncharge(page);
 
 		/* Clear Active bit in case of parallel mark_page_accessed */
 		ClearPageActive(page);
diff --git a/mm/swap_state.c b/mm/swap_state.c
index e76ace30d436..c9ca7fe2c571 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -176,7 +176,7 @@ int add_to_swap(struct page *page, struct list_head *list)
 
 	if (unlikely(PageTransHuge(page)))
 		if (unlikely(split_huge_page_to_list(page, list))) {
-			swapcache_free(entry, NULL);
+			swapcache_free(entry);
 			return 0;
 		}
 
@@ -202,7 +202,7 @@ int add_to_swap(struct page *page, struct list_head *list)
 		 * add_to_swap_cache() doesn't return -EEXIST, so we can safely
 		 * clear SWAP_HAS_CACHE flag.
 		 */
-		swapcache_free(entry, NULL);
+		swapcache_free(entry);
 		return 0;
 	}
 }
@@ -225,7 +225,7 @@ void delete_from_swap_cache(struct page *page)
 	__delete_from_swap_cache(page);
 	spin_unlock_irq(&address_space->tree_lock);
 
-	swapcache_free(entry, page);
+	swapcache_free(entry);
 	page_cache_release(page);
 }
 
@@ -386,7 +386,7 @@ struct page *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 		 * add_to_swap_cache() doesn't return -EEXIST, so we can safely
 		 * clear SWAP_HAS_CACHE flag.
 		 */
-		swapcache_free(entry, NULL);
+		swapcache_free(entry);
 	} while (err != -ENOMEM);
 
 	if (new_page)
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 7c57c7256c6e..67caa7d88308 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -863,16 +863,13 @@ void swap_free(swp_entry_t entry)
 /*
  * Called after dropping swapcache to decrease refcnt to swap entries.
  */
-void swapcache_free(swp_entry_t entry, struct page *page)
+void swapcache_free(swp_entry_t entry)
 {
 	struct swap_info_struct *p;
-	unsigned char count;
 
 	p = swap_info_get(entry);
 	if (p) {
-		count = swap_entry_free(p, entry, SWAP_HAS_CACHE);
-		if (page)
-			mem_cgroup_uncharge_swapcache(page, entry, count != 0);
+		swap_entry_free(p, entry, SWAP_HAS_CACHE);
 		spin_unlock(&p->lock);
 	}
 }
diff --git a/mm/truncate.c b/mm/truncate.c
index e5cc39ab0751..dfb13f839323 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -556,7 +556,6 @@ invalidate_complete_page2(struct address_space *mapping, struct page *page)
 	BUG_ON(page_has_private(page));
 	__delete_from_page_cache(page, NULL);
 	spin_unlock_irq(&mapping->tree_lock);
-	mem_cgroup_uncharge_cache_page(page);
 
 	if (mapping->a_ops->freepage)
 		mapping->a_ops->freepage(page);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 9b6497eda806..016661d95f74 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -565,9 +565,10 @@ static int __remove_mapping(struct address_space *mapping, struct page *page,
 
 	if (PageSwapCache(page)) {
 		swp_entry_t swap = { .val = page_private(page) };
+		mem_cgroup_swapout(page, swap);
 		__delete_from_swap_cache(page);
 		spin_unlock_irq(&mapping->tree_lock);
-		swapcache_free(swap, page);
+		swapcache_free(swap);
 	} else {
 		void (*freepage)(struct page *);
 		void *shadow = NULL;
@@ -588,7 +589,6 @@ static int __remove_mapping(struct address_space *mapping, struct page *page,
 			shadow = workingset_eviction(mapping, page);
 		__delete_from_page_cache(page, shadow);
 		spin_unlock_irq(&mapping->tree_lock);
-		mem_cgroup_uncharge_cache_page(page);
 
 		if (freepage != NULL)
 			freepage(page);
@@ -1091,6 +1091,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		 */
 		__clear_page_locked(page);
 free_it:
+		mem_cgroup_uncharge(page);
 		nr_reclaimed++;
 
 		/*
@@ -1423,6 +1424,8 @@ putback_inactive_pages(struct lruvec *lruvec, struct list_head *page_list)
 			__ClearPageActive(page);
 			del_page_from_lru_list(page, lruvec, lru);
 
+			mem_cgroup_uncharge(page);
+
 			if (unlikely(PageCompound(page))) {
 				spin_unlock_irq(&zone->lru_lock);
 				(*get_compound_page_dtor(page))(page);
@@ -1631,6 +1634,8 @@ static void move_active_pages_to_lru(struct lruvec *lruvec,
 			__ClearPageActive(page);
 			del_page_from_lru_list(page, lruvec, lru);
 
+			mem_cgroup_uncharge(page);
+
 			if (unlikely(PageCompound(page))) {
 				spin_unlock_irq(&zone->lru_lock);
 				(*get_compound_page_dtor(page))(page);
diff --git a/mm/zswap.c b/mm/zswap.c
index aeaef0fb5624..efe018731e08 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -502,7 +502,7 @@ static int zswap_get_swap_cache_page(swp_entry_t entry,
 		 * add_to_swap_cache() doesn't return -EEXIST, so we can safely
 		 * clear SWAP_HAS_CACHE flag.
 		 */
-		swapcache_free(entry, NULL);
+		swapcache_free(entry);
 	} while (err != -ENOMEM);
 
 	if (new_page)
-- 
1.9.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [patch 0/9] mm: memcontrol: naturalize charge lifetime
  2014-04-30 20:25 [patch 0/9] mm: memcontrol: naturalize charge lifetime Johannes Weiner
                   ` (8 preceding siblings ...)
  2014-04-30 20:25 ` [patch 9/9] mm: memcontrol: rewrite uncharge API Johannes Weiner
@ 2014-05-02 11:26 ` Michal Hocko
  9 siblings, 0 replies; 35+ messages in thread
From: Michal Hocko @ 2014-05-02 11:26 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: linux-mm, Hugh Dickins, Tejun Heo, cgroups, linux-kernel

On Wed 30-04-14 16:25:34, Johannes Weiner wrote:
[...]
>  Documentation/cgroups/memcg_test.txt |  160 +--
>  include/linux/memcontrol.h           |   94 +-
>  include/linux/page_cgroup.h          |   43 +-
>  include/linux/swap.h                 |   15 +-
>  kernel/events/uprobes.c              |    1 +
>  mm/filemap.c                         |   13 +-
>  mm/huge_memory.c                     |   51 +-
>  mm/memcontrol.c                      | 1724 ++++++++++++--------------------
>  mm/memory.c                          |   41 +-
>  mm/migrate.c                         |   46 +-
>  mm/rmap.c                            |    6 -
>  mm/shmem.c                           |   28 +-
>  mm/swap.c                            |   22 +
>  mm/swap_state.c                      |    8 +-
>  mm/swapfile.c                        |   21 +-
>  mm/truncate.c                        |    1 -
>  mm/vmscan.c                          |    9 +-
>  mm/zswap.c                           |    2 +-
>  18 files changed, 833 insertions(+), 1452 deletions(-)

Impressive! I will get through the series but it will take some time.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [patch 9/9] mm: memcontrol: rewrite uncharge API
  2014-04-30 20:25 ` [patch 9/9] mm: memcontrol: rewrite uncharge API Johannes Weiner
@ 2014-05-04 14:32   ` Johannes Weiner
  2014-05-27  7:43   ` Kamezawa Hiroyuki
  1 sibling, 0 replies; 35+ messages in thread
From: Johannes Weiner @ 2014-05-04 14:32 UTC (permalink / raw)
  To: linux-mm; +Cc: Michal Hocko, Hugh Dickins, Tejun Heo, cgroups, linux-kernel

On Wed, Apr 30, 2014 at 04:25:43PM -0400, Johannes Weiner wrote:
> The memcg uncharging code that is involved towards the end of a page's
> lifetime - truncation, reclaim, swapout, migration - is impressively
> complicated and fragile.
> 
> Because anonymous and file pages were always charged before they had
> their page->mapping established, uncharges had to happen when the page
> type could be known from the context, as in unmap for anonymous, page
> cache removal for file and shmem pages, and swap cache truncation for
> swap pages.  However, these operations also happen well before the
> page is actually freed, and so a lot of synchronization is necessary:
> 
> - On page migration, the old page might be unmapped but then reused,
>   so memcg code has to prevent an untimely uncharge in that case.
>   Because this code - which should be a simple charge transfer - is so
>   special-cased, it is not reusable for replace_page_cache().
> 
> - Swap cache truncation happens during both swap-in and swap-out, and
>   possibly repeatedly before the page is actually freed.  This means
>   that the memcg swapout code is called from many contexts that make
>   no sense and it has to figure out the direction from page state to
>   make sure memory and memory+swap are always correctly charged.
> 
> But now that charged pages always have a page->mapping, introduce
> mem_cgroup_uncharge(), which is called after the final put_page(),
> when we know for sure that nobody is looking at the page anymore.
> 
> For page migration, introduce mem_cgroup_migrate(), which is called
> after the migration is successful and the new page is fully rmapped.
> Because the old page is no longer uncharged after migration, prevent
> double charges by decoupling the page's memcg association (PCG_USED
> and pc->mem_cgroup) from the page holding an actual charge.  The new
> bits PCG_MEM and PCG_MEMSW represent the respective charges and are
> transferred to the new page during migration.
> 
> mem_cgroup_migrate() is suitable for replace_page_cache() as well.
> 
> Swap accounting is massively simplified: because the page is no longer
> uncharged as early as swap cache deletion, a new mem_cgroup_swapout()
> can transfer the page's memory+swap charge (PCG_MEMSW) to the swap
> entry before the final put_page() in page reclaim.
> 
> Finally, because pages are now charged under proper serialization
> (anon: exclusive; cache: page lock; swapin: page lock; migration: page
> lock), and uncharged under full exclusion, they can not race with
> themselves.  Because they are also off-LRU during charge/uncharge,
> charge migration can not race, with that, either.  Remove the crazily
> expensive the page_cgroup lock and set pc->flags non-atomically.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Follow-up fixlets to this change that fell out of more testing in
production and more auditing so far:

- Document mem_cgroup_move_account() exclusion
- Catch uncharged swapin readahead pages in mem_cgroup_swapout()
- Fix DEBUG_VM build after last-minute identifier rename
- Drop duplicate lru_cache_add_active_or_unevictable() in THP migration

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 0add8b7b3a6c..f73df16b8115 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3387,6 +3387,12 @@ static int mem_cgroup_move_account(struct page *page,
 
 	mem_cgroup_charge_statistics(from, page, -nr_pages);
 
+	/*
+	 * It is safe to change pc->mem_cgroup here because the page
+	 * is referenced, charged, and isolated - we can't race with
+	 * uncharging, charging, migration, or LRU putback.
+	 */
+
 	/* caller should have done css_get */
 	pc->mem_cgroup = to;
 	mem_cgroup_charge_statistics(to, page, nr_pages);
@@ -6234,6 +6240,12 @@ void mem_cgroup_swapout(struct page *page, swp_entry_t entry)
 
 	pc = lookup_page_cgroup(page);
 
+	/* Readahead page, never charged */
+	if (!PageCgroupUsed(pc))
+		return;
+
+	VM_BUG_ON_PAGE(!(pc->flags & PCG_MEMSW), page);
+
 	oldid = swap_cgroup_record(entry, mem_cgroup_id(pc->mem_cgroup));
 	VM_BUG_ON_PAGE(oldid, page);
 
@@ -6723,8 +6735,8 @@ void mem_cgroup_migrate(struct page *oldpage, struct page *newpage,
 	if (!PageCgroupUsed(pc))
 		return;
 
-	VM_BUG_ON_PAGE(!(pc->flags & PCG_MEM), page);
-	VM_BUG_ON_PAGE(!(pc->flags & PCG_MEMSW), page);
+	VM_BUG_ON_PAGE(!(pc->flags & PCG_MEM), oldpage);
+	VM_BUG_ON_PAGE(!(pc->flags & PCG_MEMSW), oldpage);
 	pc->flags &= ~(PCG_MEM | PCG_MEMSW);
 
 	if (PageTransHuge(oldpage)) {
diff --git a/mm/migrate.c b/mm/migrate.c
index 80d33e62eb16..afe688021699 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1839,7 +1839,6 @@ fail_putback:
 	 */
 	flush_cache_range(vma, mmun_start, mmun_end);
 	page_add_new_anon_rmap(new_page, vma, mmun_start);
-	lru_cache_add_active_or_unevictable(new_page, vma);
 	pmdp_clear_flush(vma, mmun_start, pmd);
 	set_pmd_at(mm, mmun_start, pmd, entry);
 	flush_tlb_range(vma, mmun_start, mmun_end);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [patch 2/9] mm: memcontrol: rearrange charging fast path
  2014-04-30 20:25 ` [patch 2/9] mm: memcontrol: rearrange charging fast path Johannes Weiner
@ 2014-05-07 14:33   ` Michal Hocko
  2014-05-08 18:22     ` Johannes Weiner
  0 siblings, 1 reply; 35+ messages in thread
From: Michal Hocko @ 2014-05-07 14:33 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: linux-mm, Hugh Dickins, Tejun Heo, cgroups, linux-kernel

On Wed 30-04-14 16:25:36, Johannes Weiner wrote:
> The charging path currently starts out with OOM condition checks when
> OOM is the rarest possible case.
> 
> Rearrange this code to run OOM/task dying checks only after trying the
> percpu charge and the res_counter charge and bail out before entering
> reclaim.  Attempting a charge does not hurt an (oom-)killed task as
> much as every charge attempt having to check OOM conditions. 

OK, I've never considered those to be measurable but it is true that the
numbers accumulate over time.

So yes, this makes sense.

> Also, only check __GFP_NOFAIL when the charge would actually fail.

OK, but return ENOMEM as pointed below.

> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  mm/memcontrol.c | 31 ++++++++++++++++---------------
>  1 file changed, 16 insertions(+), 15 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 75dfeb8fa98b..6ce59146fec7 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2598,21 +2598,6 @@ static int mem_cgroup_try_charge(struct mem_cgroup *memcg,
>  
>  	if (mem_cgroup_is_root(memcg))
>  		goto done;
> -	/*
> -	 * Unlike in global OOM situations, memcg is not in a physical
> -	 * memory shortage.  Allow dying and OOM-killed tasks to
> -	 * bypass the last charges so that they can exit quickly and
> -	 * free their memory.
> -	 */
> -	if (unlikely(test_thread_flag(TIF_MEMDIE) ||
> -		     fatal_signal_pending(current)))
> -		goto bypass;

This is missing "memcg: do not hang on OOM when killed by userspace OOM
access to memory reserves" - trivial to resolve.

> -
> -	if (unlikely(task_in_memcg_oom(current)))
> -		goto nomem;
> -
> -	if (gfp_mask & __GFP_NOFAIL)
> -		oom = false;
>  retry:
>  	if (consume_stock(memcg, nr_pages))
>  		goto done;
[...]
> @@ -2662,6 +2660,9 @@ retry:
>  	if (mem_cgroup_wait_acct_move(mem_over_limit))
>  		goto retry;
>  
> +	if (gfp_mask & __GFP_NOFAIL)
> +		goto bypass;
> +

This is a behavior change because we have returned ENOMEM previously

>  	if (fatal_signal_pending(current))
>  		goto bypass;
>  
	if (!oom)
		goto nomem;

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [patch 3/9] mm: memcontrol: retry reclaim for oom-disabled and __GFP_NOFAIL charges
  2014-04-30 20:25 ` [patch 3/9] mm: memcontrol: retry reclaim for oom-disabled and __GFP_NOFAIL charges Johannes Weiner
@ 2014-05-07 14:43   ` Michal Hocko
  2014-05-08 18:28     ` Johannes Weiner
  0 siblings, 1 reply; 35+ messages in thread
From: Michal Hocko @ 2014-05-07 14:43 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: linux-mm, Hugh Dickins, Tejun Heo, cgroups, linux-kernel

On Wed 30-04-14 16:25:37, Johannes Weiner wrote:
> There is no reason why oom-disabled and __GFP_NOFAIL charges should
> try to reclaim only once when every other charge tries several times
> before giving up.  Make them all retry the same number of times.

I guess the idea whas that oom disabled (THP) allocation can fallback to
a smaller allocation. I would suspect this would increase latency for
THP page faults.

__GFP_NOFAIL is a different story and it can be retried.
 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  mm/memcontrol.c | 8 ++++----
>  1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 6ce59146fec7..c431a30280ac 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2589,7 +2589,7 @@ static int mem_cgroup_try_charge(struct mem_cgroup *memcg,
>  				 bool oom)
>  {
>  	unsigned int batch = max(CHARGE_BATCH, nr_pages);
> -	int nr_oom_retries = MEM_CGROUP_RECLAIM_RETRIES;
> +	int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
>  	struct mem_cgroup *mem_over_limit;
>  	struct res_counter *fail_res;
>  	unsigned long nr_reclaimed;
> @@ -2660,6 +2660,9 @@ retry:
>  	if (mem_cgroup_wait_acct_move(mem_over_limit))
>  		goto retry;
>  
> +	if (nr_retries--)
> +		goto retry;
> +
>  	if (gfp_mask & __GFP_NOFAIL)
>  		goto bypass;
>  
> @@ -2669,9 +2672,6 @@ retry:
>  	if (!oom)
>  		goto nomem;
>  
> -	if (nr_oom_retries--)
> -		goto retry;
> -
>  	mem_cgroup_oom(mem_over_limit, gfp_mask, get_order(batch));
>  nomem:
>  	if (!(gfp_mask & __GFP_NOFAIL))
> -- 
> 1.9.2
> 

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [patch 4/9] mm: memcontrol: catch root bypass in move precharge
  2014-04-30 20:25 ` [patch 4/9] mm: memcontrol: catch root bypass in move precharge Johannes Weiner
@ 2014-05-07 14:55   ` Michal Hocko
  2014-05-08 18:30     ` Johannes Weiner
  0 siblings, 1 reply; 35+ messages in thread
From: Michal Hocko @ 2014-05-07 14:55 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: linux-mm, Hugh Dickins, Tejun Heo, cgroups, linux-kernel

On Wed 30-04-14 16:25:38, Johannes Weiner wrote:
[...]
> @@ -6546,8 +6546,9 @@ one_by_one:
>  			cond_resched();
>  		}
>  		ret = mem_cgroup_try_charge(memcg, GFP_KERNEL, 1, false);
> +		if (ret == -EINTR)
> +			__mem_cgroup_cancel_charge(root_mem_cgroup, 1);
>  		if (ret)
> -			/* mem_cgroup_clear_mc() will do uncharge later */

I would prefer to keep the comment and explain that we will loose return
code on the way and that is why cancel on root has to be done here.

>  			return ret;
>  		mc.precharge++;
>  	}
> -- 
> 1.9.2
> 

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [patch 5/9] mm: memcontrol: use root_mem_cgroup res_counter
  2014-04-30 20:25 ` [patch 5/9] mm: memcontrol: use root_mem_cgroup res_counter Johannes Weiner
@ 2014-05-07 15:14   ` Michal Hocko
  0 siblings, 0 replies; 35+ messages in thread
From: Michal Hocko @ 2014-05-07 15:14 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: linux-mm, Hugh Dickins, Tejun Heo, cgroups, linux-kernel

On Wed 30-04-14 16:25:39, Johannes Weiner wrote:
> The root_mem_cgroup res_counter is never charged itself: there is no
> limit at the root level anyway, and any statistics are generated on
> demand by summing up the counters of all other cgroups.  This was an
> optimization to keep down costs on systems that don't create specific
> cgroups, but with per-cpu charge caches the res_counter operations do
> not even show up on in profiles anymore.  Just remove it and simplify
> the code.

It seems that only kmem and tcp thingy are left but those looks much
harder and they are not directly related.

root_mem_cgroup use also seems to be correct.

I have to look at this closer and that will be no sooner than on Monday
as I am off for the rest of the week.
[...]
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [patch 2/9] mm: memcontrol: rearrange charging fast path
  2014-05-07 14:33   ` Michal Hocko
@ 2014-05-08 18:22     ` Johannes Weiner
  2014-05-12  7:59       ` Michal Hocko
  0 siblings, 1 reply; 35+ messages in thread
From: Johannes Weiner @ 2014-05-08 18:22 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-mm, Hugh Dickins, Tejun Heo, cgroups, linux-kernel

On Wed, May 07, 2014 at 04:33:34PM +0200, Michal Hocko wrote:
> On Wed 30-04-14 16:25:36, Johannes Weiner wrote:
> > The charging path currently starts out with OOM condition checks when
> > OOM is the rarest possible case.
> > 
> > Rearrange this code to run OOM/task dying checks only after trying the
> > percpu charge and the res_counter charge and bail out before entering
> > reclaim.  Attempting a charge does not hurt an (oom-)killed task as
> > much as every charge attempt having to check OOM conditions. 
> 
> OK, I've never considered those to be measurable but it is true that the
> numbers accumulate over time.
> 
> So yes, this makes sense.
> 
> > Also, only check __GFP_NOFAIL when the charge would actually fail.
> 
> OK, but return ENOMEM as pointed below.
> 
> > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> > ---
> >  mm/memcontrol.c | 31 ++++++++++++++++---------------
> >  1 file changed, 16 insertions(+), 15 deletions(-)
> > 
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 75dfeb8fa98b..6ce59146fec7 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -2598,21 +2598,6 @@ static int mem_cgroup_try_charge(struct mem_cgroup *memcg,
> >  
> >  	if (mem_cgroup_is_root(memcg))
> >  		goto done;
> > -	/*
> > -	 * Unlike in global OOM situations, memcg is not in a physical
> > -	 * memory shortage.  Allow dying and OOM-killed tasks to
> > -	 * bypass the last charges so that they can exit quickly and
> > -	 * free their memory.
> > -	 */
> > -	if (unlikely(test_thread_flag(TIF_MEMDIE) ||
> > -		     fatal_signal_pending(current)))
> > -		goto bypass;
> 
> This is missing "memcg: do not hang on OOM when killed by userspace OOM
> access to memory reserves" - trivial to resolve.

Yep, will rebase before the next submission.

> > -	if (unlikely(task_in_memcg_oom(current)))
> > -		goto nomem;
> > -
> > -	if (gfp_mask & __GFP_NOFAIL)
> > -		oom = false;
> >  retry:
> >  	if (consume_stock(memcg, nr_pages))
> >  		goto done;
> [...]
> > @@ -2662,6 +2660,9 @@ retry:
> >  	if (mem_cgroup_wait_acct_move(mem_over_limit))
> >  		goto retry;
> >  
> > +	if (gfp_mask & __GFP_NOFAIL)
> > +		goto bypass;
> > +
> 
> This is a behavior change because we have returned ENOMEM previously

__GFP_NOFAIL must never return -ENOMEM, or we'd have to rename it ;-)
It just looks like this in the patch, but this is the label code:

nomem:
	if (!(gfp_mask & __GFP_NOFAIL))
		return -ENOMEM;
bypass:
	...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [patch 3/9] mm: memcontrol: retry reclaim for oom-disabled and __GFP_NOFAIL charges
  2014-05-07 14:43   ` Michal Hocko
@ 2014-05-08 18:28     ` Johannes Weiner
  0 siblings, 0 replies; 35+ messages in thread
From: Johannes Weiner @ 2014-05-08 18:28 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-mm, Hugh Dickins, Tejun Heo, cgroups, linux-kernel

On Wed, May 07, 2014 at 04:43:39PM +0200, Michal Hocko wrote:
> On Wed 30-04-14 16:25:37, Johannes Weiner wrote:
> > There is no reason why oom-disabled and __GFP_NOFAIL charges should
> > try to reclaim only once when every other charge tries several times
> > before giving up.  Make them all retry the same number of times.
> 
> I guess the idea whas that oom disabled (THP) allocation can fallback to
> a smaller allocation. I would suspect this would increase latency for
> THP page faults.

If it does, we should probably teach THP to use __GFP_NORETRY.

On that note, __GFP_NORETRY is currently useless for memcg because it
has !__GFP_WAIT semantics...  I'll include a fix for that in v2.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [patch 4/9] mm: memcontrol: catch root bypass in move precharge
  2014-05-07 14:55   ` Michal Hocko
@ 2014-05-08 18:30     ` Johannes Weiner
  0 siblings, 0 replies; 35+ messages in thread
From: Johannes Weiner @ 2014-05-08 18:30 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-mm, Hugh Dickins, Tejun Heo, cgroups, linux-kernel

On Wed, May 07, 2014 at 04:55:53PM +0200, Michal Hocko wrote:
> On Wed 30-04-14 16:25:38, Johannes Weiner wrote:
> [...]
> > @@ -6546,8 +6546,9 @@ one_by_one:
> >  			cond_resched();
> >  		}
> >  		ret = mem_cgroup_try_charge(memcg, GFP_KERNEL, 1, false);
> > +		if (ret == -EINTR)
> > +			__mem_cgroup_cancel_charge(root_mem_cgroup, 1);
> >  		if (ret)
> > -			/* mem_cgroup_clear_mc() will do uncharge later */
> 
> I would prefer to keep the comment and explain that we will loose return
> code on the way and that is why cancel on root has to be done here.

That makes sense, I'll add an explanation of who is (un)charged when
and where.  Thanks!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [patch 2/9] mm: memcontrol: rearrange charging fast path
  2014-05-08 18:22     ` Johannes Weiner
@ 2014-05-12  7:59       ` Michal Hocko
  0 siblings, 0 replies; 35+ messages in thread
From: Michal Hocko @ 2014-05-12  7:59 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: linux-mm, Hugh Dickins, Tejun Heo, cgroups, linux-kernel

On Thu 08-05-14 14:22:24, Johannes Weiner wrote:
> On Wed, May 07, 2014 at 04:33:34PM +0200, Michal Hocko wrote:
> > On Wed 30-04-14 16:25:36, Johannes Weiner wrote:
[...]
> > > -	if (unlikely(task_in_memcg_oom(current)))
> > > -		goto nomem;
> > > -
> > > -	if (gfp_mask & __GFP_NOFAIL)
> > > -		oom = false;
> > >  retry:
> > >  	if (consume_stock(memcg, nr_pages))
> > >  		goto done;
> > [...]
> > > @@ -2662,6 +2660,9 @@ retry:
> > >  	if (mem_cgroup_wait_acct_move(mem_over_limit))
> > >  		goto retry;
> > >  
> > > +	if (gfp_mask & __GFP_NOFAIL)
> > > +		goto bypass;
> > > +
> > 
> > This is a behavior change because we have returned ENOMEM previously
> 
> __GFP_NOFAIL must never return -ENOMEM, or we'd have to rename it ;-)
> It just looks like this in the patch, but this is the label code:
> 
> nomem:
> 	if (!(gfp_mask & __GFP_NOFAIL))
> 		return -ENOMEM;
> bypass:
> 	...

Ouch. Brain fart. Sorry...

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [patch 6/9] mm: memcontrol: remove ordering between pc->mem_cgroup and PageCgroupUsed
  2014-04-30 20:25 ` [patch 6/9] mm: memcontrol: remove ordering between pc->mem_cgroup and PageCgroupUsed Johannes Weiner
@ 2014-05-23 13:20   ` Michal Hocko
  2014-05-27 19:45     ` Johannes Weiner
  0 siblings, 1 reply; 35+ messages in thread
From: Michal Hocko @ 2014-05-23 13:20 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: linux-mm, Hugh Dickins, Tejun Heo, cgroups, linux-kernel

On Wed 30-04-14 16:25:40, Johannes Weiner wrote:
> There is a write barrier between setting pc->mem_cgroup and
> PageCgroupUsed, which was added to allow LRU operations to lookup the
> memcg LRU list of a page without acquiring the page_cgroup lock.  But
> ever since 38c5d72f3ebe ("memcg: simplify LRU handling by new rule"),
> pages are ensured to be off-LRU while charging, so nobody else is
> changing LRU state while pc->mem_cgroup is being written.

This is quite confusing. Why do we have the lrucare path then?
The code is quite tricky so this deserves a more detailed explanation
IMO.

There are only 3 paths which check both the flag and mem_cgroup (
without page_cgroup_lock) get_mctgt_type* and mem_cgroup_page_lruvec AFAICS.
None of them have rmb so there was no guarantee about ordering anyway.

> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Anyway, the change is welcome
Acked-by: Michal Hocko <mhocko@suse.cz>

> ---
>  mm/memcontrol.c | 9 ---------
>  1 file changed, 9 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 34407d99262a..c528ae9ac230 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2823,14 +2823,6 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup *memcg,
>  	}
>  
>  	pc->mem_cgroup = memcg;
> -	/*
> -	 * We access a page_cgroup asynchronously without lock_page_cgroup().
> -	 * Especially when a page_cgroup is taken from a page, pc->mem_cgroup
> -	 * is accessed after testing USED bit. To make pc->mem_cgroup visible
> -	 * before USED bit, we need memory barrier here.
> -	 * See mem_cgroup_add_lru_list(), etc.
> -	 */
> -	smp_wmb();
>  	SetPageCgroupUsed(pc);
>  
>  	if (lrucare) {
> @@ -3609,7 +3601,6 @@ void mem_cgroup_split_huge_fixup(struct page *head)
>  	for (i = 1; i < HPAGE_PMD_NR; i++) {
>  		pc = head_pc + i;
>  		pc->mem_cgroup = memcg;
> -		smp_wmb();/* see __commit_charge() */
>  		pc->flags = head_pc->flags & ~PCGF_NOCOPY_AT_SPLIT;
>  	}
>  	__this_cpu_sub(memcg->stat->count[MEM_CGROUP_STAT_RSS_HUGE],
> -- 
> 1.9.2
> 

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [patch 7/9] mm: memcontrol: do not acquire page_cgroup lock for kmem pages
  2014-04-30 20:25 ` [patch 7/9] mm: memcontrol: do not acquire page_cgroup lock for kmem pages Johannes Weiner
@ 2014-05-23 13:39   ` Michal Hocko
  2014-05-23 13:40     ` Michal Hocko
                       ` (2 more replies)
  0 siblings, 3 replies; 35+ messages in thread
From: Michal Hocko @ 2014-05-23 13:39 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: linux-mm, Hugh Dickins, Tejun Heo, cgroups, linux-kernel

I am adding Vladimir to CC

On Wed 30-04-14 16:25:41, Johannes Weiner wrote:
> Kmem page charging and uncharging is serialized by means of exclusive
> access to the page.  Do not take the page_cgroup lock and don't set
> pc->flags atomically.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

The patch is correct I just have some comments below.
Anyway
Acked-by: Michal Hocko <mhocko@suse.cz>

> ---
>  mm/memcontrol.c | 16 +++-------------
>  1 file changed, 3 insertions(+), 13 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index c528ae9ac230..d3961fce1d54 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -3535,10 +3535,8 @@ void __memcg_kmem_commit_charge(struct page *page, struct mem_cgroup *memcg,
>  	}
>  

	/*
	 * given page is newly allocated and invisible to everybody but
	 * the caller so there is no need to use page_cgroup lock nor
	 * SetPageCgroupUsed
	 */

would be helpful?

>  	pc = lookup_page_cgroup(page);
> -	lock_page_cgroup(pc);
>  	pc->mem_cgroup = memcg;
> -	SetPageCgroupUsed(pc);
> -	unlock_page_cgroup(pc);
> +	pc->flags = PCG_USED;
>  }
>  
>  void __memcg_kmem_uncharge_pages(struct page *page, int order)
> @@ -3548,19 +3546,11 @@ void __memcg_kmem_uncharge_pages(struct page *page, int order)
>  
>  
>  	pc = lookup_page_cgroup(page);
> -	/*
> -	 * Fast unlocked return. Theoretically might have changed, have to
> -	 * check again after locking.
> -	 */

This comment was there since the code has been merged. Maybe it was true
at the time but after "mm: get rid of __GFP_KMEMCG" it is definitely out
of date.

	/*
	 * the pages is going away and will be freed and nobody can see
	 * it anymore so no need to take page_cgroup lock.
	 */
>  	if (!PageCgroupUsed(pc))
>  		return;
>  
> -	lock_page_cgroup(pc);
> -	if (PageCgroupUsed(pc)) {
> -		memcg = pc->mem_cgroup;
> -		ClearPageCgroupUsed(pc);
> -	}
> -	unlock_page_cgroup(pc);

maybe add
	WARN_ON_ONCE(pc->flags != PCG_USED);

to check for an unexpected flags usage in the kmem path?

> +	memcg = pc->mem_cgroup;
> +	pc->flags = 0;
>  
>  	/*
>  	 * We trust that only if there is a memcg associated with the page, it
> -- 
> 1.9.2
> 

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [patch 7/9] mm: memcontrol: do not acquire page_cgroup lock for kmem pages
  2014-05-23 13:39   ` Michal Hocko
@ 2014-05-23 13:40     ` Michal Hocko
  2014-05-23 14:29     ` Vladimir Davydov
  2014-05-27 19:53     ` Johannes Weiner
  2 siblings, 0 replies; 35+ messages in thread
From: Michal Hocko @ 2014-05-23 13:40 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, Hugh Dickins, Tejun Heo, cgroups, linux-kernel,
	Vladimir Davydov

On Fri 23-05-14 15:39:38, Michal Hocko wrote:
> I am adding Vladimir to CC

And now for real
> 
> On Wed 30-04-14 16:25:41, Johannes Weiner wrote:
> > Kmem page charging and uncharging is serialized by means of exclusive
> > access to the page.  Do not take the page_cgroup lock and don't set
> > pc->flags atomically.
> > 
> > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> 
> The patch is correct I just have some comments below.
> Anyway
> Acked-by: Michal Hocko <mhocko@suse.cz>
> 
> > ---
> >  mm/memcontrol.c | 16 +++-------------
> >  1 file changed, 3 insertions(+), 13 deletions(-)
> > 
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index c528ae9ac230..d3961fce1d54 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -3535,10 +3535,8 @@ void __memcg_kmem_commit_charge(struct page *page, struct mem_cgroup *memcg,
> >  	}
> >  
> 
> 	/*
> 	 * given page is newly allocated and invisible to everybody but
> 	 * the caller so there is no need to use page_cgroup lock nor
> 	 * SetPageCgroupUsed
> 	 */
> 
> would be helpful?
> 
> >  	pc = lookup_page_cgroup(page);
> > -	lock_page_cgroup(pc);
> >  	pc->mem_cgroup = memcg;
> > -	SetPageCgroupUsed(pc);
> > -	unlock_page_cgroup(pc);
> > +	pc->flags = PCG_USED;
> >  }
> >  
> >  void __memcg_kmem_uncharge_pages(struct page *page, int order)
> > @@ -3548,19 +3546,11 @@ void __memcg_kmem_uncharge_pages(struct page *page, int order)
> >  
> >  
> >  	pc = lookup_page_cgroup(page);
> > -	/*
> > -	 * Fast unlocked return. Theoretically might have changed, have to
> > -	 * check again after locking.
> > -	 */
> 
> This comment was there since the code has been merged. Maybe it was true
> at the time but after "mm: get rid of __GFP_KMEMCG" it is definitely out
> of date.
> 
> 	/*
> 	 * the pages is going away and will be freed and nobody can see
> 	 * it anymore so no need to take page_cgroup lock.
> 	 */
> >  	if (!PageCgroupUsed(pc))
> >  		return;
> >  
> > -	lock_page_cgroup(pc);
> > -	if (PageCgroupUsed(pc)) {
> > -		memcg = pc->mem_cgroup;
> > -		ClearPageCgroupUsed(pc);
> > -	}
> > -	unlock_page_cgroup(pc);
> 
> maybe add
> 	WARN_ON_ONCE(pc->flags != PCG_USED);
> 
> to check for an unexpected flags usage in the kmem path?
> 
> > +	memcg = pc->mem_cgroup;
> > +	pc->flags = 0;
> >  
> >  	/*
> >  	 * We trust that only if there is a memcg associated with the page, it
> > -- 
> > 1.9.2
> > 
> 
> -- 
> Michal Hocko
> SUSE Labs
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [patch 8/9] mm: memcontrol: rewrite charge API
  2014-04-30 20:25 ` [patch 8/9] mm: memcontrol: rewrite charge API Johannes Weiner
@ 2014-05-23 14:18   ` Michal Hocko
  2014-05-23 14:54   ` Michal Hocko
  1 sibling, 0 replies; 35+ messages in thread
From: Michal Hocko @ 2014-05-23 14:18 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: linux-mm, Hugh Dickins, Tejun Heo, cgroups, linux-kernel

On Wed 30-04-14 16:25:42, Johannes Weiner wrote:
[...]
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index d3961fce1d54..6f48e292ffe7 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2574,163 +2574,6 @@ static int memcg_cpu_hotplug_callback(struct notifier_block *nb,
>  	return NOTIFY_OK;
>  }
>  
> -/**
> - * mem_cgroup_try_charge - try charging a memcg
> - * @memcg: memcg to charge
> - * @nr_pages: number of pages to charge
> - * @oom: trigger OOM if reclaim fails
> - *
> - * Returns 0 if @memcg was charged successfully, -EINTR if the charge
> - * was bypassed to root_mem_cgroup, and -ENOMEM if the charge failed.
> - */
> -static int mem_cgroup_try_charge(struct mem_cgroup *memcg,
> -				 gfp_t gfp_mask,
> -				 unsigned int nr_pages,
> -				 bool oom)

Why haven't you simply renamed mem_cgroup_try_charge to try_charge here?
The code move is really hard to review.
[...]
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [patch 7/9] mm: memcontrol: do not acquire page_cgroup lock for kmem pages
  2014-05-23 13:39   ` Michal Hocko
  2014-05-23 13:40     ` Michal Hocko
@ 2014-05-23 14:29     ` Vladimir Davydov
  2014-05-27 19:53     ` Johannes Weiner
  2 siblings, 0 replies; 35+ messages in thread
From: Vladimir Davydov @ 2014-05-23 14:29 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Michal Hocko, linux-mm, Hugh Dickins, Tejun Heo, cgroups,
	linux-kernel

On Fri, May 23, 2014 at 03:39:38PM +0200, Michal Hocko wrote:
> I am adding Vladimir to CC
> 
> On Wed 30-04-14 16:25:41, Johannes Weiner wrote:
> > Kmem page charging and uncharging is serialized by means of exclusive
> > access to the page.  Do not take the page_cgroup lock and don't set
> > pc->flags atomically.
> > 
> > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Acked-by: Vladimir Davydov <vdavydov@parallels.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [patch 8/9] mm: memcontrol: rewrite charge API
  2014-04-30 20:25 ` [patch 8/9] mm: memcontrol: rewrite charge API Johannes Weiner
  2014-05-23 14:18   ` Michal Hocko
@ 2014-05-23 14:54   ` Michal Hocko
  2014-05-23 15:18     ` Michal Hocko
  2014-05-27 20:05     ` Johannes Weiner
  1 sibling, 2 replies; 35+ messages in thread
From: Michal Hocko @ 2014-05-23 14:54 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: linux-mm, Hugh Dickins, Tejun Heo, cgroups, linux-kernel

On Wed 30-04-14 16:25:42, Johannes Weiner wrote:
> The memcg charge API charges pages before they are rmapped - i.e. have
> an actual "type" - and so every callsite needs its own set of charge
> and uncharge functions to know what type is being operated on.
> 
> Rewrite the charge API to provide a generic set of try_charge(),
> commit_charge() and cancel_charge() transaction operations, much like
> what's currently done for swap-in:
> 
>   mem_cgroup_try_charge() attempts to reserve a charge, reclaiming
>   pages from the memcg if necessary.
> 
>   mem_cgroup_commit_charge() commits the page to the charge once it
>   has a valid page->mapping and PageAnon() reliably tells the type.
> 
>   mem_cgroup_cancel_charge() aborts the transaction.
> 
> As pages need to be committed after rmap is established but before
> they are added to the LRU, page_add_new_anon_rmap() must stop doing
> LRU additions again.  Factor lru_cache_add_active_or_unevictable().
> 
> The order of functions in mm/memcontrol.c is entirely random, so this
> new charge interface is implemented at the end of the file, where all
> new or cleaned up, and documented code should go from now on.

I would prefer moving them after refactoring because the reviewing is
really harder this way. If such moving is needed at all.

Anyway this is definitely not a Friday material...

So only a first impression from a quick glance.

size is saying the code is slightly bigger:
   text    data     bss     dec     hex filename
 487977   84898   45984  618859   9716b mm/built-in.o.7
 488276   84898   45984  619158   97296 mm/built-in.o.8

No biggie though.

It is true it get's rid of ~80LOC in memcontrol.c but it adds some more
outside of memcg. Most of the charging paths didn't get any easier, they
already know the type and they have to make sure they even commit the
charge now.

But maybe it is just me feeling that now that we have
mem_cgroup_charge_{anon,file,swapin} the API doesn't look so insane
anymore and so I am not tempted to change it that much.

I will look at this with a Monday and fresh brain again.

> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  Documentation/cgroups/memcg_test.txt |  32 +-
>  include/linux/memcontrol.h           |  53 +--
>  include/linux/swap.h                 |   3 +
>  kernel/events/uprobes.c              |   1 +
>  mm/filemap.c                         |   9 +-
>  mm/huge_memory.c                     |  51 ++-
>  mm/memcontrol.c                      | 777 ++++++++++++++++-------------------
>  mm/memory.c                          |  41 +-
>  mm/migrate.c                         |   1 +
>  mm/rmap.c                            |   5 -
>  mm/shmem.c                           |  24 +-
>  mm/swap.c                            |  20 +
>  mm/swapfile.c                        |  14 +-
>  13 files changed, 479 insertions(+), 552 deletions(-)
> 
> diff --git a/Documentation/cgroups/memcg_test.txt b/Documentation/cgroups/memcg_test.txt
> index 80ac454704b8..bcf750d3cecd 100644
> --- a/Documentation/cgroups/memcg_test.txt
> +++ b/Documentation/cgroups/memcg_test.txt
> @@ -24,24 +24,7 @@ Please note that implementation details can be changed.
>  
>     a page/swp_entry may be charged (usage += PAGE_SIZE) at
>  
> -	mem_cgroup_charge_anon()
> -	  Called at new page fault and Copy-On-Write.
> -
> -	mem_cgroup_try_charge_swapin()
> -	  Called at do_swap_page() (page fault on swap entry) and swapoff.
> -	  Followed by charge-commit-cancel protocol. (With swap accounting)
> -	  At commit, a charge recorded in swap_cgroup is removed.
> -
> -	mem_cgroup_charge_file()
> -	  Called at add_to_page_cache()
> -
> -	mem_cgroup_cache_charge_swapin()
> -	  Called at shmem's swapin.
> -
> -	mem_cgroup_prepare_migration()
> -	  Called before migration. "extra" charge is done and followed by
> -	  charge-commit-cancel protocol.
> -	  At commit, charge against oldpage or newpage will be committed.
> +	mem_cgroup_try_charge()
>  
>  2. Uncharge
>    a page/swp_entry may be uncharged (usage -= PAGE_SIZE) by
> @@ -69,19 +52,14 @@ Please note that implementation details can be changed.
>  	to new page is committed. At failure, charge to old page is committed.
>  
>  3. charge-commit-cancel
> -	In some case, we can't know this "charge" is valid or not at charging
> -	(because of races).
> -	To handle such case, there are charge-commit-cancel functions.
> -		mem_cgroup_try_charge_XXX
> -		mem_cgroup_commit_charge_XXX
> -		mem_cgroup_cancel_charge_XXX
> -	these are used in swap-in and migration.
> +	Memcg pages are charged in two steps:
> +		mem_cgroup_try_charge()
> +		mem_cgroup_commit_charge() or mem_cgroup_cancel_charge()
>  
>  	At try_charge(), there are no flags to say "this page is charged".
>  	at this point, usage += PAGE_SIZE.
>  
> -	At commit(), the function checks the page should be charged or not
> -	and set flags or avoid charging.(usage -= PAGE_SIZE)
> +	At commit(), the page is associated with the memcg.
>  
>  	At cancel(), simply usage -= PAGE_SIZE.
>  
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index b569b8be5c5a..5578b07376b7 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -54,28 +54,11 @@ struct mem_cgroup_reclaim_cookie {
>  };
>  
>  #ifdef CONFIG_MEMCG
> -/*
> - * All "charge" functions with gfp_mask should use GFP_KERNEL or
> - * (gfp_mask & GFP_RECLAIM_MASK). In current implementatin, memcg doesn't
> - * alloc memory but reclaims memory from all available zones. So, "where I want
> - * memory from" bits of gfp_mask has no meaning. So any bits of that field is
> - * available but adding a rule is better. charge functions' gfp_mask should
> - * be set to GFP_KERNEL or gfp_mask & GFP_RECLAIM_MASK for avoiding ambiguous
> - * codes.
> - * (Of course, if memcg does memory allocation in future, GFP_KERNEL is sane.)
> - */
> -
> -extern int mem_cgroup_charge_anon(struct page *page, struct mm_struct *mm,
> -				gfp_t gfp_mask);
> -/* for swap handling */
> -extern int mem_cgroup_try_charge_swapin(struct mm_struct *mm,
> -		struct page *page, gfp_t mask, struct mem_cgroup **memcgp);
> -extern void mem_cgroup_commit_charge_swapin(struct page *page,
> -					struct mem_cgroup *memcg);
> -extern void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *memcg);
> -
> -extern int mem_cgroup_charge_file(struct page *page, struct mm_struct *mm,
> -					gfp_t gfp_mask);
> +int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
> +			  gfp_t gfp_mask, struct mem_cgroup **memcgp);
> +void mem_cgroup_commit_charge(struct page *page, struct mem_cgroup *memcg,
> +			      bool lrucare);
> +void mem_cgroup_cancel_charge(struct page *page, struct mem_cgroup *memcg);
>  
>  struct lruvec *mem_cgroup_zone_lruvec(struct zone *, struct mem_cgroup *);
>  struct lruvec *mem_cgroup_page_lruvec(struct page *, struct zone *);
> @@ -233,30 +216,22 @@ void mem_cgroup_print_bad_page(struct page *page);
>  #else /* CONFIG_MEMCG */
>  struct mem_cgroup;
>  
> -static inline int mem_cgroup_charge_anon(struct page *page,
> -					struct mm_struct *mm, gfp_t gfp_mask)
> -{
> -	return 0;
> -}
> -
> -static inline int mem_cgroup_charge_file(struct page *page,
> -					struct mm_struct *mm, gfp_t gfp_mask)
> -{
> -	return 0;
> -}
> -
> -static inline int mem_cgroup_try_charge_swapin(struct mm_struct *mm,
> -		struct page *page, gfp_t gfp_mask, struct mem_cgroup **memcgp)
> +static inline int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
> +					gfp_t gfp_mask,
> +					struct mem_cgroup **memcgp)
>  {
> +	*memcgp = NULL;
>  	return 0;
>  }
>  
> -static inline void mem_cgroup_commit_charge_swapin(struct page *page,
> -					  struct mem_cgroup *memcg)
> +static inline void mem_cgroup_commit_charge(struct page *page,
> +					    struct mem_cgroup *memcg,
> +					    bool lrucare)
>  {
>  }
>  
> -static inline void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *memcg)
> +static inline void mem_cgroup_cancel_charge(struct page *page,
> +					    struct mem_cgroup *memcg)
>  {
>  }
>  
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 350711560753..403a8530ee62 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -323,6 +323,9 @@ extern void swap_setup(void);
>  
>  extern void add_page_to_unevictable_list(struct page *page);
>  
> +extern void lru_cache_add_active_or_unevictable(struct page *page,
> +						struct vm_area_struct *vma);
> +
>  /**
>   * lru_cache_add: add a page to the page lists
>   * @page: the page to add
> diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> index 04709b66369d..44c508044c1d 100644
> --- a/kernel/events/uprobes.c
> +++ b/kernel/events/uprobes.c
> @@ -180,6 +180,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
>  
>  	get_page(kpage);
>  	page_add_new_anon_rmap(kpage, vma, addr);
> +	lru_cache_add_active_or_unevictable(kpage, vma);
>  
>  	if (!PageAnon(page)) {
>  		dec_mm_counter(mm, MM_FILEPAGES);
> diff --git a/mm/filemap.c b/mm/filemap.c
> index a82fbe4c9e8e..346c2e178193 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -558,19 +558,19 @@ static int __add_to_page_cache_locked(struct page *page,
>  				      pgoff_t offset, gfp_t gfp_mask,
>  				      void **shadowp)
>  {
> +	struct mem_cgroup *memcg;
>  	int error;
>  
>  	VM_BUG_ON_PAGE(!PageLocked(page), page);
>  	VM_BUG_ON_PAGE(PageSwapBacked(page), page);
>  
> -	error = mem_cgroup_charge_file(page, current->mm,
> -					gfp_mask & GFP_RECLAIM_MASK);
> +	error = mem_cgroup_try_charge(page, current->mm, gfp_mask, &memcg);
>  	if (error)
>  		return error;
>  
>  	error = radix_tree_maybe_preload(gfp_mask & ~__GFP_HIGHMEM);
>  	if (error) {
> -		mem_cgroup_uncharge_cache_page(page);
> +		mem_cgroup_cancel_charge(page, memcg);
>  		return error;
>  	}
>  
> @@ -585,13 +585,14 @@ static int __add_to_page_cache_locked(struct page *page,
>  		goto err_insert;
>  	__inc_zone_page_state(page, NR_FILE_PAGES);
>  	spin_unlock_irq(&mapping->tree_lock);
> +	mem_cgroup_commit_charge(page, memcg, false);
>  	trace_mm_filemap_add_to_page_cache(page);
>  	return 0;
>  err_insert:
>  	page->mapping = NULL;
>  	/* Leave page->index set: truncation relies upon it */
>  	spin_unlock_irq(&mapping->tree_lock);
> -	mem_cgroup_uncharge_cache_page(page);
> +	mem_cgroup_cancel_charge(page, memcg);
>  	page_cache_release(page);
>  	return error;
>  }
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 64635f5278ff..1a22d8b12cf2 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -715,13 +715,20 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
>  					unsigned long haddr, pmd_t *pmd,
>  					struct page *page)
>  {
> +	struct mem_cgroup *memcg;
>  	pgtable_t pgtable;
>  	spinlock_t *ptl;
>  
>  	VM_BUG_ON_PAGE(!PageCompound(page), page);
> +
> +	if (mem_cgroup_try_charge(page, mm, GFP_KERNEL, &memcg))
> +		return VM_FAULT_OOM;
> +
>  	pgtable = pte_alloc_one(mm, haddr);
> -	if (unlikely(!pgtable))
> +	if (unlikely(!pgtable)) {
> +		mem_cgroup_cancel_charge(page, memcg);
>  		return VM_FAULT_OOM;
> +	}
>  
>  	clear_huge_page(page, haddr, HPAGE_PMD_NR);
>  	/*
> @@ -734,7 +741,7 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
>  	ptl = pmd_lock(mm, pmd);
>  	if (unlikely(!pmd_none(*pmd))) {
>  		spin_unlock(ptl);
> -		mem_cgroup_uncharge_page(page);
> +		mem_cgroup_cancel_charge(page, memcg);
>  		put_page(page);
>  		pte_free(mm, pgtable);
>  	} else {
> @@ -742,6 +749,8 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
>  		entry = mk_huge_pmd(page, vma->vm_page_prot);
>  		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
>  		page_add_new_anon_rmap(page, vma, haddr);
> +		mem_cgroup_commit_charge(page, memcg, false);
> +		lru_cache_add_active_or_unevictable(page, vma);
>  		pgtable_trans_huge_deposit(mm, pmd, pgtable);
>  		set_pmd_at(mm, haddr, pmd, entry);
>  		add_mm_counter(mm, MM_ANONPAGES, HPAGE_PMD_NR);
> @@ -827,13 +836,7 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  		count_vm_event(THP_FAULT_FALLBACK);
>  		return VM_FAULT_FALLBACK;
>  	}
> -	if (unlikely(mem_cgroup_charge_anon(page, mm, GFP_KERNEL))) {
> -		put_page(page);
> -		count_vm_event(THP_FAULT_FALLBACK);
> -		return VM_FAULT_FALLBACK;
> -	}
>  	if (unlikely(__do_huge_pmd_anonymous_page(mm, vma, haddr, pmd, page))) {
> -		mem_cgroup_uncharge_page(page);
>  		put_page(page);
>  		count_vm_event(THP_FAULT_FALLBACK);
>  		return VM_FAULT_FALLBACK;
> @@ -948,6 +951,7 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
>  					struct page *page,
>  					unsigned long haddr)
>  {
> +	struct mem_cgroup *memcg;
>  	spinlock_t *ptl;
>  	pgtable_t pgtable;
>  	pmd_t _pmd;
> @@ -968,13 +972,15 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
>  					       __GFP_OTHER_NODE,
>  					       vma, address, page_to_nid(page));
>  		if (unlikely(!pages[i] ||
> -			     mem_cgroup_charge_anon(pages[i], mm,
> -						       GFP_KERNEL))) {
> +			     mem_cgroup_try_charge(pages[i], mm, GFP_KERNEL,
> +						   &memcg))) {
>  			if (pages[i])
>  				put_page(pages[i]);
>  			mem_cgroup_uncharge_start();
>  			while (--i >= 0) {
> -				mem_cgroup_uncharge_page(pages[i]);
> +				memcg = (void *)page_private(pages[i]);
> +				set_page_private(pages[i], 0);
> +				mem_cgroup_cancel_charge(pages[i], memcg);
>  				put_page(pages[i]);
>  			}
>  			mem_cgroup_uncharge_end();
> @@ -982,6 +988,7 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
>  			ret |= VM_FAULT_OOM;
>  			goto out;
>  		}
> +		set_page_private(pages[i], (unsigned long)memcg);
>  	}
>  
>  	for (i = 0; i < HPAGE_PMD_NR; i++) {
> @@ -1010,7 +1017,11 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
>  		pte_t *pte, entry;
>  		entry = mk_pte(pages[i], vma->vm_page_prot);
>  		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
> +		memcg = (void *)page_private(pages[i]);
> +		set_page_private(pages[i], 0);
>  		page_add_new_anon_rmap(pages[i], vma, haddr);
> +		mem_cgroup_commit_charge(pages[i], memcg, false);
> +		lru_cache_add_active_or_unevictable(pages[i], vma);
>  		pte = pte_offset_map(&_pmd, haddr);
>  		VM_BUG_ON(!pte_none(*pte));
>  		set_pte_at(mm, haddr, pte, entry);
> @@ -1036,7 +1047,9 @@ out_free_pages:
>  	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
>  	mem_cgroup_uncharge_start();
>  	for (i = 0; i < HPAGE_PMD_NR; i++) {
> -		mem_cgroup_uncharge_page(pages[i]);
> +		memcg = (void *)page_private(pages[i]);
> +		set_page_private(pages[i], 0);
> +		mem_cgroup_cancel_charge(pages[i], memcg);
>  		put_page(pages[i]);
>  	}
>  	mem_cgroup_uncharge_end();
> @@ -1050,6 +1063,7 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  	spinlock_t *ptl;
>  	int ret = 0;
>  	struct page *page = NULL, *new_page;
> +	struct mem_cgroup *memcg;
>  	unsigned long haddr;
>  	unsigned long mmun_start;	/* For mmu_notifiers */
>  	unsigned long mmun_end;		/* For mmu_notifiers */
> @@ -1101,7 +1115,7 @@ alloc:
>  		goto out;
>  	}
>  
> -	if (unlikely(mem_cgroup_charge_anon(new_page, mm, GFP_KERNEL))) {
> +	if (unlikely(mem_cgroup_try_charge(new_page, mm, GFP_KERNEL, &memcg))) {
>  		put_page(new_page);
>  		if (page) {
>  			split_huge_page(page);
> @@ -1130,7 +1144,7 @@ alloc:
>  		put_page(page);
>  	if (unlikely(!pmd_same(*pmd, orig_pmd))) {
>  		spin_unlock(ptl);
> -		mem_cgroup_uncharge_page(new_page);
> +		mem_cgroup_cancel_charge(new_page, memcg);
>  		put_page(new_page);
>  		goto out_mn;
>  	} else {
> @@ -1139,6 +1153,8 @@ alloc:
>  		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
>  		pmdp_clear_flush(vma, haddr, pmd);
>  		page_add_new_anon_rmap(new_page, vma, haddr);
> +		mem_cgroup_commit_charge(new_page, memcg, false);
> +		lru_cache_add_active_or_unevictable(new_page, vma);
>  		set_pmd_at(mm, haddr, pmd, entry);
>  		update_mmu_cache_pmd(vma, address, pmd);
>  		if (!page) {
> @@ -2349,6 +2365,7 @@ static void collapse_huge_page(struct mm_struct *mm,
>  	spinlock_t *pmd_ptl, *pte_ptl;
>  	int isolated;
>  	unsigned long hstart, hend;
> +	struct mem_cgroup *memcg;
>  	unsigned long mmun_start;	/* For mmu_notifiers */
>  	unsigned long mmun_end;		/* For mmu_notifiers */
>  
> @@ -2359,7 +2376,7 @@ static void collapse_huge_page(struct mm_struct *mm,
>  	if (!new_page)
>  		return;
>  
> -	if (unlikely(mem_cgroup_charge_anon(new_page, mm, GFP_KERNEL)))
> +	if (unlikely(mem_cgroup_try_charge(new_page, mm, GFP_KERNEL, &memcg)))
>  		return;
>  
>  	/*
> @@ -2448,6 +2465,8 @@ static void collapse_huge_page(struct mm_struct *mm,
>  	spin_lock(pmd_ptl);
>  	BUG_ON(!pmd_none(*pmd));
>  	page_add_new_anon_rmap(new_page, vma, address);
> +	mem_cgroup_commit_charge(new_page, memcg, false);
> +	lru_cache_add_active_or_unevictable(new_page, vma);
>  	pgtable_trans_huge_deposit(mm, pmd, pgtable);
>  	set_pmd_at(mm, address, pmd, _pmd);
>  	update_mmu_cache_pmd(vma, address, pmd);
> @@ -2461,7 +2480,7 @@ out_up_write:
>  	return;
>  
>  out:
> -	mem_cgroup_uncharge_page(new_page);
> +	mem_cgroup_cancel_charge(new_page, memcg);
>  	goto out_up_write;
>  }
>  
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index d3961fce1d54..6f48e292ffe7 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2574,163 +2574,6 @@ static int memcg_cpu_hotplug_callback(struct notifier_block *nb,
>  	return NOTIFY_OK;
>  }
>  
> -/**
> - * mem_cgroup_try_charge - try charging a memcg
> - * @memcg: memcg to charge
> - * @nr_pages: number of pages to charge
> - * @oom: trigger OOM if reclaim fails
> - *
> - * Returns 0 if @memcg was charged successfully, -EINTR if the charge
> - * was bypassed to root_mem_cgroup, and -ENOMEM if the charge failed.
> - */
> -static int mem_cgroup_try_charge(struct mem_cgroup *memcg,
> -				 gfp_t gfp_mask,
> -				 unsigned int nr_pages,
> -				 bool oom)
> -{
> -	unsigned int batch = max(CHARGE_BATCH, nr_pages);
> -	int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
> -	struct mem_cgroup *mem_over_limit;
> -	struct res_counter *fail_res;
> -	unsigned long nr_reclaimed;
> -	unsigned long flags = 0;
> -	unsigned long long size;
> -	int ret = 0;
> -
> -retry:
> -	if (consume_stock(memcg, nr_pages))
> -		goto done;
> -
> -	size = batch * PAGE_SIZE;
> -	if (!res_counter_charge(&memcg->res, size, &fail_res)) {
> -		if (!do_swap_account)
> -			goto done_restock;
> -		if (!res_counter_charge(&memcg->memsw, size, &fail_res))
> -			goto done_restock;
> -		res_counter_uncharge(&memcg->res, size);
> -		mem_over_limit = mem_cgroup_from_res_counter(fail_res, memsw);
> -		flags |= MEM_CGROUP_RECLAIM_NOSWAP;
> -	} else
> -		mem_over_limit = mem_cgroup_from_res_counter(fail_res, res);
> -
> -	if (batch > nr_pages) {
> -		batch = nr_pages;
> -		goto retry;
> -	}
> -
> -	/*
> -	 * Unlike in global OOM situations, memcg is not in a physical
> -	 * memory shortage.  Allow dying and OOM-killed tasks to
> -	 * bypass the last charges so that they can exit quickly and
> -	 * free their memory.
> -	 */
> -	if (unlikely(test_thread_flag(TIF_MEMDIE) ||
> -		     fatal_signal_pending(current)))
> -		goto bypass;
> -
> -	if (unlikely(task_in_memcg_oom(current)))
> -		goto nomem;
> -
> -	if (!(gfp_mask & __GFP_WAIT))
> -		goto nomem;
> -
> -	if (gfp_mask & __GFP_NORETRY)
> -		goto nomem;
> -
> -	nr_reclaimed = mem_cgroup_reclaim(mem_over_limit, gfp_mask, flags);
> -
> -	if (mem_cgroup_margin(mem_over_limit) >= batch)
> -		goto retry;
> -	/*
> -	 * Even though the limit is exceeded at this point, reclaim
> -	 * may have been able to free some pages.  Retry the charge
> -	 * before killing the task.
> -	 *
> -	 * Only for regular pages, though: huge pages are rather
> -	 * unlikely to succeed so close to the limit, and we fall back
> -	 * to regular pages anyway in case of failure.
> -	 */
> -	if (nr_reclaimed && batch <= (1 << PAGE_ALLOC_COSTLY_ORDER))
> -		goto retry;
> -	/*
> -	 * At task move, charge accounts can be doubly counted. So, it's
> -	 * better to wait until the end of task_move if something is going on.
> -	 */
> -	if (mem_cgroup_wait_acct_move(mem_over_limit))
> -		goto retry;
> -
> -	if (nr_retries--)
> -		goto retry;
> -
> -	if (gfp_mask & __GFP_NOFAIL)
> -		goto bypass;
> -
> -	if (fatal_signal_pending(current))
> -		goto bypass;
> -
> -	if (!oom)
> -		goto nomem;
> -
> -	mem_cgroup_oom(mem_over_limit, gfp_mask, get_order(batch));
> -nomem:
> -	if (!(gfp_mask & __GFP_NOFAIL))
> -		return -ENOMEM;
> -bypass:
> -	memcg = root_mem_cgroup;
> -	ret = -EINTR;
> -	goto retry;
> -
> -done_restock:
> -	if (batch > nr_pages)
> -		refill_stock(memcg, batch - nr_pages);
> -done:
> -	return ret;
> -}
> -
> -/**
> - * mem_cgroup_try_charge_mm - try charging a mm
> - * @mm: mm_struct to charge
> - * @nr_pages: number of pages to charge
> - * @oom: trigger OOM if reclaim fails
> - *
> - * Returns the charged mem_cgroup associated with the given mm_struct or
> - * NULL the charge failed.
> - */
> -static struct mem_cgroup *mem_cgroup_try_charge_mm(struct mm_struct *mm,
> -				 gfp_t gfp_mask,
> -				 unsigned int nr_pages,
> -				 bool oom)
> -
> -{
> -	struct mem_cgroup *memcg;
> -	int ret;
> -
> -	memcg = get_mem_cgroup_from_mm(mm);
> -	ret = mem_cgroup_try_charge(memcg, gfp_mask, nr_pages, oom);
> -	css_put(&memcg->css);
> -	if (ret == -EINTR)
> -		memcg = root_mem_cgroup;
> -	else if (ret)
> -		memcg = NULL;
> -
> -	return memcg;
> -}
> -
> -/*
> - * Somemtimes we have to undo a charge we got by try_charge().
> - * This function is for that and do uncharge, put css's refcnt.
> - * gotten by try_charge().
> - */
> -static void __mem_cgroup_cancel_charge(struct mem_cgroup *memcg,
> -				       unsigned int nr_pages)
> -{
> -	unsigned long bytes = nr_pages * PAGE_SIZE;
> -
> -	res_counter_uncharge(&memcg->res, bytes);
> -	if (do_swap_account)
> -		res_counter_uncharge(&memcg->memsw, bytes);
> -}
> -
>  /*
>   * Cancel chrages in this cgroup....doesn't propagate to parent cgroup.
>   * This is useful when moving usage to parent cgroup.
> @@ -2788,69 +2631,6 @@ struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
>  	return memcg;
>  }
>  
> -static void __mem_cgroup_commit_charge(struct mem_cgroup *memcg,
> -				       struct page *page,
> -				       unsigned int nr_pages,
> -				       enum charge_type ctype,
> -				       bool lrucare)
> -{
> -	struct page_cgroup *pc = lookup_page_cgroup(page);
> -	struct zone *uninitialized_var(zone);
> -	struct lruvec *lruvec;
> -	bool was_on_lru = false;
> -	bool anon;
> -
> -	lock_page_cgroup(pc);
> -	VM_BUG_ON_PAGE(PageCgroupUsed(pc), page);
> -	/*
> -	 * we don't need page_cgroup_lock about tail pages, becase they are not
> -	 * accessed by any other context at this point.
> -	 */
> -
> -	/*
> -	 * In some cases, SwapCache and FUSE(splice_buf->radixtree), the page
> -	 * may already be on some other mem_cgroup's LRU.  Take care of it.
> -	 */
> -	if (lrucare) {
> -		zone = page_zone(page);
> -		spin_lock_irq(&zone->lru_lock);
> -		if (PageLRU(page)) {
> -			lruvec = mem_cgroup_zone_lruvec(zone, pc->mem_cgroup);
> -			ClearPageLRU(page);
> -			del_page_from_lru_list(page, lruvec, page_lru(page));
> -			was_on_lru = true;
> -		}
> -	}
> -
> -	pc->mem_cgroup = memcg;
> -	SetPageCgroupUsed(pc);
> -
> -	if (lrucare) {
> -		if (was_on_lru) {
> -			lruvec = mem_cgroup_zone_lruvec(zone, pc->mem_cgroup);
> -			VM_BUG_ON_PAGE(PageLRU(page), page);
> -			SetPageLRU(page);
> -			add_page_to_lru_list(page, lruvec, page_lru(page));
> -		}
> -		spin_unlock_irq(&zone->lru_lock);
> -	}
> -
> -	if (ctype == MEM_CGROUP_CHARGE_TYPE_ANON)
> -		anon = true;
> -	else
> -		anon = false;
> -
> -	mem_cgroup_charge_statistics(memcg, page, anon, nr_pages);
> -	unlock_page_cgroup(pc);
> -
> -	/*
> -	 * "charge_statistics" updated event counter. Then, check it.
> -	 * Insert ancestor (and ancestor's ancestors), to softlimit RB-tree.
> -	 * if they exceeds softlimit.
> -	 */
> -	memcg_check_events(memcg, page);
> -}
> -
>  static DEFINE_MUTEX(set_limit_mutex);
>  
>  #ifdef CONFIG_MEMCG_KMEM
> @@ -2895,6 +2675,9 @@ static int mem_cgroup_slabinfo_read(struct seq_file *m, void *v)
>  }
>  #endif
>  
> +static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
> +		      unsigned int nr_pages, bool oom);
> +
>  static int memcg_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp, u64 size)
>  {
>  	struct res_counter *fail_res;
> @@ -2904,22 +2687,21 @@ static int memcg_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp, u64 size)
>  	if (ret)
>  		return ret;
>  
> -	ret = mem_cgroup_try_charge(memcg, gfp, size >> PAGE_SHIFT,
> -				    oom_gfp_allowed(gfp));
> +	ret = try_charge(memcg, gfp, size >> PAGE_SHIFT, oom_gfp_allowed(gfp));
>  	if (ret == -EINTR)  {
>  		/*
> -		 * mem_cgroup_try_charge() chosed to bypass to root due to
> -		 * OOM kill or fatal signal.  Since our only options are to
> -		 * either fail the allocation or charge it to this cgroup, do
> -		 * it as a temporary condition. But we can't fail. From a
> -		 * kmem/slab perspective, the cache has already been selected,
> -		 * by mem_cgroup_kmem_get_cache(), so it is too late to change
> +		 * try_charge() chose to bypass to root due to OOM kill or
> +		 * fatal signal.  Since our only options are to either fail
> +		 * the allocation or charge it to this cgroup, do it as a
> +		 * temporary condition. But we can't fail. From a kmem/slab
> +		 * perspective, the cache has already been selected, by
> +		 * mem_cgroup_kmem_get_cache(), so it is too late to change
>  		 * our minds.
>  		 *
>  		 * This condition will only trigger if the task entered
> -		 * memcg_charge_kmem in a sane state, but was OOM-killed during
> -		 * mem_cgroup_try_charge() above. Tasks that were already
> -		 * dying when the allocation triggers should have been already
> +		 * memcg_charge_kmem in a sane state, but was OOM-killed
> +		 * during try_charge() above. Tasks that were already dying
> +		 * when the allocation triggers should have been already
>  		 * directed to the root cgroup in memcontrol.h
>  		 */
>  		res_counter_charge_nofail(&memcg->res, size, &fail_res);
> @@ -3728,193 +3510,17 @@ static int mem_cgroup_move_parent(struct page *page,
>  	}
>  
>  	ret = mem_cgroup_move_account(page, nr_pages,
> -				pc, child, parent);
> -	if (!ret)
> -		__mem_cgroup_cancel_local_charge(child, nr_pages);
> -
> -	if (nr_pages > 1)
> -		compound_unlock_irqrestore(page, flags);
> -	putback_lru_page(page);
> -put:
> -	put_page(page);
> -out:
> -	return ret;
> -}
> -
> -int mem_cgroup_charge_anon(struct page *page,
> -			      struct mm_struct *mm, gfp_t gfp_mask)
> -{
> -	unsigned int nr_pages = 1;
> -	struct mem_cgroup *memcg;
> -	bool oom = true;
> -
> -	if (mem_cgroup_disabled())
> -		return 0;
> -
> -	VM_BUG_ON_PAGE(page_mapped(page), page);
> -	VM_BUG_ON_PAGE(page->mapping && !PageAnon(page), page);
> -	VM_BUG_ON(!mm);
> -
> -	if (PageTransHuge(page)) {
> -		nr_pages <<= compound_order(page);
> -		VM_BUG_ON_PAGE(!PageTransHuge(page), page);
> -		/*
> -		 * Never OOM-kill a process for a huge page.  The
> -		 * fault handler will fall back to regular pages.
> -		 */
> -		oom = false;
> -	}
> -
> -	memcg = mem_cgroup_try_charge_mm(mm, gfp_mask, nr_pages, oom);
> -	if (!memcg)
> -		return -ENOMEM;
> -	__mem_cgroup_commit_charge(memcg, page, nr_pages,
> -				   MEM_CGROUP_CHARGE_TYPE_ANON, false);
> -	return 0;
> -}
> -
> -/*
> - * While swap-in, try_charge -> commit or cancel, the page is locked.
> - * And when try_charge() successfully returns, one refcnt to memcg without
> - * struct page_cgroup is acquired. This refcnt will be consumed by
> - * "commit()" or removed by "cancel()"
> - */
> -static int __mem_cgroup_try_charge_swapin(struct mm_struct *mm,
> -					  struct page *page,
> -					  gfp_t mask,
> -					  struct mem_cgroup **memcgp)
> -{
> -	struct mem_cgroup *memcg = NULL;
> -	struct page_cgroup *pc;
> -	int ret;
> -
> -	pc = lookup_page_cgroup(page);
> -	/*
> -	 * Every swap fault against a single page tries to charge the
> -	 * page, bail as early as possible.  shmem_unuse() encounters
> -	 * already charged pages, too.  The USED bit is protected by
> -	 * the page lock, which serializes swap cache removal, which
> -	 * in turn serializes uncharging.
> -	 */
> -	if (PageCgroupUsed(pc))
> -		goto out;
> -	if (do_swap_account)
> -		memcg = try_get_mem_cgroup_from_page(page);
> -	if (!memcg)
> -		memcg = get_mem_cgroup_from_mm(mm);
> -	ret = mem_cgroup_try_charge(memcg, mask, 1, true);
> -	css_put(&memcg->css);
> -	if (ret == -EINTR)
> -		memcg = root_mem_cgroup;
> -	else if (ret)
> -		return ret;
> -out:
> -	*memcgp = memcg;
> -	return 0;
> -}
> -
> -int mem_cgroup_try_charge_swapin(struct mm_struct *mm, struct page *page,
> -				 gfp_t gfp_mask, struct mem_cgroup **memcgp)
> -{
> -	if (mem_cgroup_disabled()) {
> -		*memcgp = NULL;
> -		return 0;
> -	}
> -	/*
> -	 * A racing thread's fault, or swapoff, may have already
> -	 * updated the pte, and even removed page from swap cache: in
> -	 * those cases unuse_pte()'s pte_same() test will fail; but
> -	 * there's also a KSM case which does need to charge the page.
> -	 */
> -	if (!PageSwapCache(page)) {
> -		struct mem_cgroup *memcg;
> -
> -		memcg = mem_cgroup_try_charge_mm(mm, gfp_mask, 1, true);
> -		if (!memcg)
> -			return -ENOMEM;
> -		*memcgp = memcg;
> -		return 0;
> -	}
> -	return __mem_cgroup_try_charge_swapin(mm, page, gfp_mask, memcgp);
> -}
> -
> -void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *memcg)
> -{
> -	if (mem_cgroup_disabled())
> -		return;
> -	if (!memcg)
> -		return;
> -	__mem_cgroup_cancel_charge(memcg, 1);
> -}
> -
> -static void
> -__mem_cgroup_commit_charge_swapin(struct page *page, struct mem_cgroup *memcg,
> -					enum charge_type ctype)
> -{
> -	if (mem_cgroup_disabled())
> -		return;
> -	if (!memcg)
> -		return;
> -
> -	__mem_cgroup_commit_charge(memcg, page, 1, ctype, true);
> -	/*
> -	 * Now swap is on-memory. This means this page may be
> -	 * counted both as mem and swap....double count.
> -	 * Fix it by uncharging from memsw. Basically, this SwapCache is stable
> -	 * under lock_page(). But in do_swap_page()::memory.c, reuse_swap_page()
> -	 * may call delete_from_swap_cache() before reach here.
> -	 */
> -	if (do_swap_account && PageSwapCache(page)) {
> -		swp_entry_t ent = {.val = page_private(page)};
> -		mem_cgroup_uncharge_swap(ent);
> -	}
> -}
> -
> -void mem_cgroup_commit_charge_swapin(struct page *page,
> -				     struct mem_cgroup *memcg)
> -{
> -	__mem_cgroup_commit_charge_swapin(page, memcg,
> -					  MEM_CGROUP_CHARGE_TYPE_ANON);
> -}
> -
> -int mem_cgroup_charge_file(struct page *page, struct mm_struct *mm,
> -				gfp_t gfp_mask)
> -{
> -	enum charge_type type = MEM_CGROUP_CHARGE_TYPE_CACHE;
> -	struct mem_cgroup *memcg;
> -	int ret;
> -
> -	if (mem_cgroup_disabled())
> -		return 0;
> -	if (PageCompound(page))
> -		return 0;
> -
> -	if (PageSwapCache(page)) { /* shmem */
> -		ret = __mem_cgroup_try_charge_swapin(mm, page,
> -						     gfp_mask, &memcg);
> -		if (ret)
> -			return ret;
> -		__mem_cgroup_commit_charge_swapin(page, memcg, type);
> -		return 0;
> -	}
> -
> -	/*
> -	 * Page cache insertions can happen without an actual mm
> -	 * context, e.g. during disk probing on boot.
> -	 */
> -	if (unlikely(!mm)) {
> -		memcg = root_mem_cgroup;
> -		ret = mem_cgroup_try_charge(memcg, gfp_mask, 1, true);
> -		VM_BUG_ON(ret == -EINTR);
> -		if (ret)
> -			return ret;
> -	} else {
> -		memcg = mem_cgroup_try_charge_mm(mm, gfp_mask, 1, true);
> -		if (!memcg)
> -			return -ENOMEM;
> -	}
> -	__mem_cgroup_commit_charge(memcg, page, 1, type, false);
> -	return 0;
> +				pc, child, parent);
> +	if (!ret)
> +		__mem_cgroup_cancel_local_charge(child, nr_pages);
> +
> +	if (nr_pages > 1)
> +		compound_unlock_irqrestore(page, flags);
> +	putback_lru_page(page);
> +put:
> +	put_page(page);
> +out:
> +	return ret;
>  }
>  
>  static void mem_cgroup_do_uncharge(struct mem_cgroup *memcg,
> @@ -4253,6 +3859,9 @@ static inline int mem_cgroup_move_swap_account(swp_entry_t entry,
>  }
>  #endif
>  
> +static void commit_charge(struct page *page, struct mem_cgroup *memcg,
> +			  unsigned int nr_pages, bool anon, bool lrucare);
> +
>  /*
>   * Before starting migration, account PAGE_SIZE to mem_cgroup that the old
>   * page belongs to.
> @@ -4263,7 +3872,6 @@ void mem_cgroup_prepare_migration(struct page *page, struct page *newpage,
>  	struct mem_cgroup *memcg = NULL;
>  	unsigned int nr_pages = 1;
>  	struct page_cgroup *pc;
> -	enum charge_type ctype;
>  
>  	*memcgp = NULL;
>  
> @@ -4325,16 +3933,12 @@ void mem_cgroup_prepare_migration(struct page *page, struct page *newpage,
>  	 * page. In the case new page is migrated but not remapped, new page's
>  	 * mapcount will be finally 0 and we call uncharge in end_migration().
>  	 */
> -	if (PageAnon(page))
> -		ctype = MEM_CGROUP_CHARGE_TYPE_ANON;
> -	else
> -		ctype = MEM_CGROUP_CHARGE_TYPE_CACHE;
>  	/*
>  	 * The page is committed to the memcg, but it's not actually
>  	 * charged to the res_counter since we plan on replacing the
>  	 * old one and only one page is going to be left afterwards.
>  	 */
> -	__mem_cgroup_commit_charge(memcg, newpage, nr_pages, ctype, false);
> +	commit_charge(newpage, memcg, nr_pages, PageAnon(page), false);
>  }
>  
>  /* remove redundant charge if migration failed*/
> @@ -4393,7 +3997,6 @@ void mem_cgroup_replace_page_cache(struct page *oldpage,
>  {
>  	struct mem_cgroup *memcg = NULL;
>  	struct page_cgroup *pc;
> -	enum charge_type type = MEM_CGROUP_CHARGE_TYPE_CACHE;
>  
>  	if (mem_cgroup_disabled())
>  		return;
> @@ -4419,7 +4022,7 @@ void mem_cgroup_replace_page_cache(struct page *oldpage,
>  	 * the newpage may be on LRU(or pagevec for LRU) already. We lock
>  	 * LRU while we overwrite pc->mem_cgroup.
>  	 */
> -	__mem_cgroup_commit_charge(memcg, newpage, 1, type, true);
> +	commit_charge(newpage, memcg, 1, false, true);
>  }
>  
>  #ifdef CONFIG_DEBUG_VM
> @@ -6434,6 +6037,7 @@ static void mem_cgroup_css_free(struct cgroup_subsys_state *css)
>  #ifdef CONFIG_MMU
>  /* Handlers for move charge at task migration. */
>  #define PRECHARGE_COUNT_AT_ONCE	256
> +static void cancel_charge(struct mem_cgroup *memcg, unsigned int nr_pages);
>  static int mem_cgroup_do_precharge(unsigned long count)
>  {
>  	int ret = 0;
> @@ -6470,9 +6074,9 @@ one_by_one:
>  			batch_count = PRECHARGE_COUNT_AT_ONCE;
>  			cond_resched();
>  		}
> -		ret = mem_cgroup_try_charge(memcg, GFP_KERNEL, 1, false);
> +		ret = try_charge(memcg, GFP_KERNEL, 1, false);
>  		if (ret == -EINTR)
> -			__mem_cgroup_cancel_charge(root_mem_cgroup, 1);
> +			cancel_charge(root_mem_cgroup, 1);
>  		if (ret)
>  			return ret;
>  		mc.precharge++;
> @@ -6736,7 +6340,7 @@ static void __mem_cgroup_clear_mc(void)
>  
>  	/* we must uncharge all the leftover precharges from mc.to */
>  	if (mc.precharge) {
> -		__mem_cgroup_cancel_charge(mc.to, mc.precharge);
> +		cancel_charge(mc.to, mc.precharge);
>  		mc.precharge = 0;
>  	}
>  	/*
> @@ -6744,7 +6348,7 @@ static void __mem_cgroup_clear_mc(void)
>  	 * we must uncharge here.
>  	 */
>  	if (mc.moved_charge) {
> -		__mem_cgroup_cancel_charge(mc.from, mc.moved_charge);
> +		cancel_charge(mc.from, mc.moved_charge);
>  		mc.moved_charge = 0;
>  	}
>  	/* we must fixup refcnts and charges */
> @@ -7070,6 +6674,319 @@ static void __init enable_swap_cgroup(void)
>  }
>  #endif
>  
> +static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
> +		      unsigned int nr_pages, bool oom)
> +{
> +	unsigned int batch = max(CHARGE_BATCH, nr_pages);
> +	int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
> +	struct mem_cgroup *mem_over_limit;
> +	struct res_counter *fail_res;
> +	unsigned long nr_reclaimed;
> +	unsigned long flags = 0;
> +	unsigned long long size;
> +	int ret = 0;
> +
> +retry:
> +	if (consume_stock(memcg, nr_pages))
> +		goto done;
> +
> +	size = batch * PAGE_SIZE;
> +	if (!res_counter_charge(&memcg->res, size, &fail_res)) {
> +		if (!do_swap_account)
> +			goto done_restock;
> +		if (!res_counter_charge(&memcg->memsw, size, &fail_res))
> +			goto done_restock;
> +		res_counter_uncharge(&memcg->res, size);
> +		mem_over_limit = mem_cgroup_from_res_counter(fail_res, memsw);
> +		flags |= MEM_CGROUP_RECLAIM_NOSWAP;
> +	} else
> +		mem_over_limit = mem_cgroup_from_res_counter(fail_res, res);
> +
> +	if (batch > nr_pages) {
> +		batch = nr_pages;
> +		goto retry;
> +	}
> +
> +	/*
> +	 * Unlike in global OOM situations, memcg is not in a physical
> +	 * memory shortage.  Allow dying and OOM-killed tasks to
> +	 * bypass the last charges so that they can exit quickly and
> +	 * free their memory.
> +	 */
> +	if (unlikely(test_thread_flag(TIF_MEMDIE) ||
> +		     fatal_signal_pending(current)))
> +		goto bypass;
> +
> +	if (unlikely(task_in_memcg_oom(current)))
> +		goto nomem;
> +
> +	if (!(gfp_mask & __GFP_WAIT))
> +		goto nomem;
> +
> +	if (gfp_mask & __GFP_NORETRY)
> +		goto nomem;
> +
> +	nr_reclaimed = mem_cgroup_reclaim(mem_over_limit, gfp_mask, flags);
> +
> +	if (mem_cgroup_margin(mem_over_limit) >= batch)
> +		goto retry;
> +	/*
> +	 * Even though the limit is exceeded at this point, reclaim
> +	 * may have been able to free some pages.  Retry the charge
> +	 * before killing the task.
> +	 *
> +	 * Only for regular pages, though: huge pages are rather
> +	 * unlikely to succeed so close to the limit, and we fall back
> +	 * to regular pages anyway in case of failure.
> +	 */
> +	if (nr_reclaimed && batch <= (1 << PAGE_ALLOC_COSTLY_ORDER))
> +		goto retry;
> +	/*
> +	 * At task move, charge accounts can be doubly counted. So, it's
> +	 * better to wait until the end of task_move if something is going on.
> +	 */
> +	if (mem_cgroup_wait_acct_move(mem_over_limit))
> +		goto retry;
> +
> +	if (nr_retries--)
> +		goto retry;
> +
> +	if (gfp_mask & __GFP_NOFAIL)
> +		goto bypass;
> +
> +	if (fatal_signal_pending(current))
> +		goto bypass;
> +
> +	if (!oom)
> +		goto nomem;
> +
> +	mem_cgroup_oom(mem_over_limit, gfp_mask, get_order(batch));
> +nomem:
> +	if (!(gfp_mask & __GFP_NOFAIL))
> +		return -ENOMEM;
> +bypass:
> +	memcg = root_mem_cgroup;
> +	ret = -EINTR;
> +	goto retry;
> +
> +done_restock:
> +	if (batch > nr_pages)
> +		refill_stock(memcg, batch - nr_pages);
> +done:
> +	return ret;
> +}
> +
> +/**
> + * mem_cgroup_try_charge - try charging a page
> + * @page: page to charge
> + * @mm: mm context of the victim
> + * @gfp_mask: reclaim mode
> + * @memcgp: charged memcg return
> + *
> + * Try to charge @page to the memcg that @mm belongs to, reclaiming
> + * pages according to @gfp_mask if necessary.
> + *
> + * Returns 0 on success, with *@memcgp pointing to the charged memcg.
> + * Otherwise, an error code is returned.
> + *
> + * After page->mapping has been set up, the caller must finalize the
> + * charge with mem_cgroup_commit_charge().  Or abort the transaction
> + * with mem_cgroup_cancel_charge() in case page instantiation fails.
> + */
> +int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
> +			  gfp_t gfp_mask, struct mem_cgroup **memcgp)
> +{
> +	struct mem_cgroup *memcg = NULL;
> +	unsigned int nr_pages = 1;
> +	bool oom = true;
> +	int ret = 0;
> +
> +	if (mem_cgroup_disabled())
> +		goto out;
> +
> +	if (PageSwapCache(page)) {
> +		struct page_cgroup *pc = lookup_page_cgroup(page);
> +		/*
> +		 * Every swap fault against a single page tries to charge the
> +		 * page, bail as early as possible.  shmem_unuse() encounters
> +		 * already charged pages, too.  The USED bit is protected by
> +		 * the page lock, which serializes swap cache removal, which
> +		 * in turn serializes uncharging.
> +		 */
> +		if (PageCgroupUsed(pc))
> +			goto out;
> +	}
> +
> +	if (PageTransHuge(page)) {
> +		nr_pages <<= compound_order(page);
> +		VM_BUG_ON_PAGE(!PageTransHuge(page), page);
> +		/*
> +		 * Never OOM-kill a process for a huge page.  The
> +		 * fault handler will fall back to regular pages.
> +		 */
> +		oom = false;
> +	}
> +
> +	if (do_swap_account && PageSwapCache(page))
> +		memcg = try_get_mem_cgroup_from_page(page);
> +	if (!memcg) {
> +		/*
> +		 * Page cache insertions can happen without an actual
> +		 * mm context, e.g. during disk probing on boot.
> +		 */
> +		if (unlikely(!mm)) {
> +			memcg = root_mem_cgroup;
> +			css_get(&memcg->css);
> +		} else
> +			memcg = get_mem_cgroup_from_mm(mm);
> +	}
> +
> +	ret = try_charge(memcg, gfp_mask, nr_pages, oom);
> +
> +	css_put(&memcg->css);
> +
> +	if (ret == -EINTR) {
> +		memcg = root_mem_cgroup;
> +		ret = 0;
> +	}
> +out:
> +	*memcgp = memcg;
> +	return ret;
> +}
> +
> +static void commit_charge(struct page *page, struct mem_cgroup *memcg,
> +			  unsigned int nr_pages, bool anon, bool lrucare)
> +{
> +	struct page_cgroup *pc = lookup_page_cgroup(page);
> +	struct zone *uninitialized_var(zone);
> +	bool was_on_lru = false;
> +	struct lruvec *lruvec;
> +
> +	lock_page_cgroup(pc);
> +
> +	VM_BUG_ON_PAGE(PageCgroupUsed(pc), page);
> +	VM_BUG_ON_PAGE(PageLRU(page) && !lrucare, page);
> +
> +	if (lrucare) {
> +		zone = page_zone(page);
> +		spin_lock_irq(&zone->lru_lock);
> +		if (PageLRU(page)) {
> +			lruvec = mem_cgroup_zone_lruvec(zone, pc->mem_cgroup);
> +			ClearPageLRU(page);
> +			del_page_from_lru_list(page, lruvec, page_lru(page));
> +			was_on_lru = true;
> +		}
> +	}
> +
> +	pc->mem_cgroup = memcg;
> +	SetPageCgroupUsed(pc);
> +
> +	if (lrucare) {
> +		if (was_on_lru) {
> +			lruvec = mem_cgroup_zone_lruvec(zone, pc->mem_cgroup);
> +			VM_BUG_ON_PAGE(PageLRU(page), page);
> +			SetPageLRU(page);
> +			add_page_to_lru_list(page, lruvec, page_lru(page));
> +		}
> +		spin_unlock_irq(&zone->lru_lock);
> +	}
> +
> +	mem_cgroup_charge_statistics(memcg, page, anon, nr_pages);
> +	unlock_page_cgroup(pc);
> +
> +	memcg_check_events(memcg, page);
> +}
> +
> +/**
> + * mem_cgroup_commit_charge - commit a page charge
> + * @page: page to charge
> + * @memcg: memcg to charge the page to
> + * @lrucare: page might be on LRU already
> + *
> + * Finalize a charge transaction started by mem_cgroup_try_charge(),
> + * after page->mapping has been set up.  This must happen atomically
> + * as part of the page instantiation, i.e. under the page table lock
> + * for anonymous pages, under the page lock for page and swap cache.
> + *
> + * In addition, the page must not be on the LRU during the commit, to
> + * prevent racing with task migration.  If it might be, use @lrucare.
> + *
> + * Use mem_cgroup_cancel_charge() to cancel the transaction instead.
> + */
> +void mem_cgroup_commit_charge(struct page *page, struct mem_cgroup *memcg,
> +			      bool lrucare)
> +{
> +	unsigned int nr_pages = 1;
> +
> +	VM_BUG_ON_PAGE(!page->mapping, page);
> +	VM_BUG_ON_PAGE(PageLRU(page) && !lrucare, page);
> +
> +	if (mem_cgroup_disabled())
> +		return;
> +	/*
> +	 * Swap faults will attempt to charge the same page multiple
> +	 * times.  But reuse_swap_page() might have removed the page
> +	 * from swapcache already, so we can't check PageSwapCache().
> +	 */
> +	if (!memcg)
> +		return;
> +
> +	if (PageTransHuge(page)) {
> +		nr_pages <<= compound_order(page);
> +		VM_BUG_ON_PAGE(!PageTransHuge(page), page);
> +	}
> +
> +	commit_charge(page, memcg, nr_pages, PageAnon(page), lrucare);
> +
> +	if (do_swap_account && PageSwapCache(page)) {
> +		swp_entry_t entry = { .val = page_private(page) };
> +		/*
> +		 * The swap entry might not get freed for a long time,
> +		 * let's not wait for it.  The page already received a
> +		 * memory+swap charge, drop the swap entry duplicate.
> +		 */
> +		mem_cgroup_uncharge_swap(entry);
> +	}
> +}
> +
> +static void cancel_charge(struct mem_cgroup *memcg, unsigned int nr_pages)
> +{
> +	unsigned long bytes = nr_pages * PAGE_SIZE;
> +
> +	res_counter_uncharge(&memcg->res, bytes);
> +	if (do_swap_account)
> +		res_counter_uncharge(&memcg->memsw, bytes);
> +}
> +
> +/**
> + * mem_cgroup_cancel_charge - cancel a page charge
> + * @page: page to charge
> + * @memcg: memcg to charge the page to
> + *
> + * Cancel a charge transaction started by mem_cgroup_try_charge().
> + */
> +void mem_cgroup_cancel_charge(struct page *page, struct mem_cgroup *memcg)
> +{
> +	unsigned int nr_pages = 1;
> +
> +	if (mem_cgroup_disabled())
> +		return;
> +	/*
> +	 * Swap faults will attempt to charge the same page multiple
> +	 * times.  But reuse_swap_page() might have removed the page
> +	 * from swapcache already, so we can't check PageSwapCache().
> +	 */
> +	if (!memcg)
> +		return;
> +
> +	if (PageTransHuge(page)) {
> +		nr_pages <<= compound_order(page);
> +		VM_BUG_ON_PAGE(!PageTransHuge(page), page);
> +	}
> +
> +	cancel_charge(memcg, nr_pages);
> +}
> +
>  /*
>   * subsys_initcall() for memory controller.
>   *
> diff --git a/mm/memory.c b/mm/memory.c
> index d0f0bef3be48..36af46a50fad 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2673,6 +2673,7 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  	struct page *dirty_page = NULL;
>  	unsigned long mmun_start = 0;	/* For mmu_notifiers */
>  	unsigned long mmun_end = 0;	/* For mmu_notifiers */
> +	struct mem_cgroup *memcg;
>  
>  	old_page = vm_normal_page(vma, address, orig_pte);
>  	if (!old_page) {
> @@ -2828,7 +2829,7 @@ gotten:
>  	}
>  	__SetPageUptodate(new_page);
>  
> -	if (mem_cgroup_charge_anon(new_page, mm, GFP_KERNEL))
> +	if (mem_cgroup_try_charge(new_page, mm, GFP_KERNEL, &memcg))
>  		goto oom_free_new;
>  
>  	mmun_start  = address & PAGE_MASK;
> @@ -2858,6 +2859,8 @@ gotten:
>  		 */
>  		ptep_clear_flush(vma, address, page_table);
>  		page_add_new_anon_rmap(new_page, vma, address);
> +		mem_cgroup_commit_charge(new_page, memcg, false);
> +		lru_cache_add_active_or_unevictable(new_page, vma);
>  		/*
>  		 * We call the notify macro here because, when using secondary
>  		 * mmu page tables (such as kvm shadow page tables), we want the
> @@ -2895,7 +2898,7 @@ gotten:
>  		new_page = old_page;
>  		ret |= VM_FAULT_WRITE;
>  	} else
> -		mem_cgroup_uncharge_page(new_page);
> +		mem_cgroup_cancel_charge(new_page, memcg);
>  
>  	if (new_page)
>  		page_cache_release(new_page);
> @@ -3031,10 +3034,10 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  {
>  	spinlock_t *ptl;
>  	struct page *page, *swapcache;
> +	struct mem_cgroup *memcg;
>  	swp_entry_t entry;
>  	pte_t pte;
>  	int locked;
> -	struct mem_cgroup *ptr;
>  	int exclusive = 0;
>  	int ret = 0;
>  
> @@ -3110,7 +3113,7 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  		goto out_page;
>  	}
>  
> -	if (mem_cgroup_try_charge_swapin(mm, page, GFP_KERNEL, &ptr)) {
> +	if (mem_cgroup_try_charge(page, mm, GFP_KERNEL, &memcg)) {
>  		ret = VM_FAULT_OOM;
>  		goto out_page;
>  	}
> @@ -3135,10 +3138,6 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  	 * while the page is counted on swap but not yet in mapcount i.e.
>  	 * before page_add_anon_rmap() and swap_free(); try_to_free_swap()
>  	 * must be called after the swap_free(), or it will never succeed.
> -	 * Because delete_from_swap_page() may be called by reuse_swap_page(),
> -	 * mem_cgroup_commit_charge_swapin() may not be able to find swp_entry
> -	 * in page->private. In this case, a record in swap_cgroup  is silently
> -	 * discarded at swap_free().
>  	 */
>  
>  	inc_mm_counter_fast(mm, MM_ANONPAGES);
> @@ -3154,12 +3153,14 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  	if (pte_swp_soft_dirty(orig_pte))
>  		pte = pte_mksoft_dirty(pte);
>  	set_pte_at(mm, address, page_table, pte);
> -	if (page == swapcache)
> +	if (page == swapcache) {
>  		do_page_add_anon_rmap(page, vma, address, exclusive);
> -	else /* ksm created a completely new copy */
> +		mem_cgroup_commit_charge(page, memcg, true);
> +	} else { /* ksm created a completely new copy */
>  		page_add_new_anon_rmap(page, vma, address);
> -	/* It's better to call commit-charge after rmap is established */
> -	mem_cgroup_commit_charge_swapin(page, ptr);
> +		mem_cgroup_commit_charge(page, memcg, false);
> +		lru_cache_add_active_or_unevictable(page, vma);
> +	}
>  
>  	swap_free(entry);
>  	if (vm_swap_full() || (vma->vm_flags & VM_LOCKED) || PageMlocked(page))
> @@ -3192,7 +3193,7 @@ unlock:
>  out:
>  	return ret;
>  out_nomap:
> -	mem_cgroup_cancel_charge_swapin(ptr);
> +	mem_cgroup_cancel_charge(page, memcg);
>  	pte_unmap_unlock(page_table, ptl);
>  out_page:
>  	unlock_page(page);
> @@ -3248,6 +3249,7 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  		unsigned long address, pte_t *page_table, pmd_t *pmd,
>  		unsigned int flags)
>  {
> +	struct mem_cgroup *memcg;
>  	struct page *page;
>  	spinlock_t *ptl;
>  	pte_t entry;
> @@ -3281,7 +3283,7 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  	 */
>  	__SetPageUptodate(page);
>  
> -	if (mem_cgroup_charge_anon(page, mm, GFP_KERNEL))
> +	if (mem_cgroup_try_charge(page, mm, GFP_KERNEL, &memcg))
>  		goto oom_free_page;
>  
>  	entry = mk_pte(page, vma->vm_page_prot);
> @@ -3294,6 +3296,8 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  
>  	inc_mm_counter_fast(mm, MM_ANONPAGES);
>  	page_add_new_anon_rmap(page, vma, address);
> +	mem_cgroup_commit_charge(page, memcg, false);
> +	lru_cache_add_active_or_unevictable(page, vma);
>  setpte:
>  	set_pte_at(mm, address, page_table, entry);
>  
> @@ -3303,7 +3307,7 @@ unlock:
>  	pte_unmap_unlock(page_table, ptl);
>  	return 0;
>  release:
> -	mem_cgroup_uncharge_page(page);
> +	mem_cgroup_cancel_charge(page, memcg);
>  	page_cache_release(page);
>  	goto unlock;
>  oom_free_page:
> @@ -3526,6 +3530,7 @@ static int do_cow_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>  		pgoff_t pgoff, unsigned int flags, pte_t orig_pte)
>  {
>  	struct page *fault_page, *new_page;
> +	struct mem_cgroup *memcg;
>  	spinlock_t *ptl;
>  	pte_t *pte;
>  	int ret;
> @@ -3537,7 +3542,7 @@ static int do_cow_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>  	if (!new_page)
>  		return VM_FAULT_OOM;
>  
> -	if (mem_cgroup_charge_anon(new_page, mm, GFP_KERNEL)) {
> +	if (mem_cgroup_try_charge(new_page, mm, GFP_KERNEL, &memcg)) {
>  		page_cache_release(new_page);
>  		return VM_FAULT_OOM;
>  	}
> @@ -3557,12 +3562,14 @@ static int do_cow_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>  		goto uncharge_out;
>  	}
>  	do_set_pte(vma, address, new_page, pte, true, true);
> +	mem_cgroup_commit_charge(new_page, memcg, false);
> +	lru_cache_add_active_or_unevictable(new_page, vma);
>  	pte_unmap_unlock(pte, ptl);
>  	unlock_page(fault_page);
>  	page_cache_release(fault_page);
>  	return ret;
>  uncharge_out:
> -	mem_cgroup_uncharge_page(new_page);
> +	mem_cgroup_cancel_charge(new_page, memcg);
>  	page_cache_release(new_page);
>  	return ret;
>  }
> diff --git a/mm/migrate.c b/mm/migrate.c
> index bed48809e5d0..a88fabd71f87 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1853,6 +1853,7 @@ fail_putback:
>  	 */
>  	flush_cache_range(vma, mmun_start, mmun_end);
>  	page_add_new_anon_rmap(new_page, vma, mmun_start);
> +	lru_cache_add_active_or_unevictable(new_page, vma);
>  	pmdp_clear_flush(vma, mmun_start, pmd);
>  	set_pmd_at(mm, mmun_start, pmd, entry);
>  	flush_tlb_range(vma, mmun_start, mmun_end);
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 9c3e77396d1a..6b6fe5f4ece1 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1024,11 +1024,6 @@ void page_add_new_anon_rmap(struct page *page,
>  	__mod_zone_page_state(page_zone(page), NR_ANON_PAGES,
>  			hpage_nr_pages(page));
>  	__page_set_anon_rmap(page, vma, address, 1);
> -	if (!mlocked_vma_newpage(vma, page)) {
> -		SetPageActive(page);
> -		lru_cache_add(page);
> -	} else
> -		add_page_to_unevictable_list(page);
>  }
>  
>  /**
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 8f1a95406bae..f8637acc2dad 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -668,6 +668,7 @@ int shmem_unuse(swp_entry_t swap, struct page *page)
>  {
>  	struct list_head *this, *next;
>  	struct shmem_inode_info *info;
> +	struct mem_cgroup *memcg;
>  	int found = 0;
>  	int error = 0;
>  
> @@ -683,7 +684,7 @@ int shmem_unuse(swp_entry_t swap, struct page *page)
>  	 * the shmem_swaplist_mutex which might hold up shmem_writepage().
>  	 * Charged back to the user (not to caller) when swap account is used.
>  	 */
> -	error = mem_cgroup_charge_file(page, current->mm, GFP_KERNEL);
> +	error = mem_cgroup_try_charge(page, current->mm, GFP_KERNEL, &memcg);
>  	if (error)
>  		goto out;
>  	/* No radix_tree_preload: swap entry keeps a place for page in tree */
> @@ -701,8 +702,11 @@ int shmem_unuse(swp_entry_t swap, struct page *page)
>  	}
>  	mutex_unlock(&shmem_swaplist_mutex);
>  
> -	if (found < 0)
> +	if (found < 0) {
>  		error = found;
> +		mem_cgroup_cancel_charge(page, memcg);
> +	} else
> +		mem_cgroup_commit_charge(page, memcg, true);
>  out:
>  	unlock_page(page);
>  	page_cache_release(page);
> @@ -1005,6 +1009,7 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
>  	struct address_space *mapping = inode->i_mapping;
>  	struct shmem_inode_info *info;
>  	struct shmem_sb_info *sbinfo;
> +	struct mem_cgroup *memcg;
>  	struct page *page;
>  	swp_entry_t swap;
>  	int error;
> @@ -1080,8 +1085,7 @@ repeat:
>  				goto failed;
>  		}
>  
> -		error = mem_cgroup_charge_file(page, current->mm,
> -						gfp & GFP_RECLAIM_MASK);
> +		error = mem_cgroup_try_charge(page, current->mm, gfp, &memcg);
>  		if (!error) {
>  			error = shmem_add_to_page_cache(page, mapping, index,
>  						gfp, swp_to_radix_entry(swap));
> @@ -1097,12 +1101,16 @@ repeat:
>  			 * Reset swap.val? No, leave it so "failed" goes back to
>  			 * "repeat": reading a hole and writing should succeed.
>  			 */
> -			if (error)
> +			if (error) {
> +				mem_cgroup_cancel_charge(page, memcg);
>  				delete_from_swap_cache(page);
> +			}
>  		}
>  		if (error)
>  			goto failed;
>  
> +		mem_cgroup_commit_charge(page, memcg, true);
> +
>  		spin_lock(&info->lock);
>  		info->swapped--;
>  		shmem_recalc_inode(inode);
> @@ -1134,8 +1142,7 @@ repeat:
>  
>  		SetPageSwapBacked(page);
>  		__set_page_locked(page);
> -		error = mem_cgroup_charge_file(page, current->mm,
> -						gfp & GFP_RECLAIM_MASK);
> +		error = mem_cgroup_try_charge(page, current->mm, gfp, &memcg);
>  		if (error)
>  			goto decused;
>  		error = radix_tree_maybe_preload(gfp & GFP_RECLAIM_MASK);
> @@ -1145,9 +1152,10 @@ repeat:
>  			radix_tree_preload_end();
>  		}
>  		if (error) {
> -			mem_cgroup_uncharge_cache_page(page);
> +			mem_cgroup_cancel_charge(page, memcg);
>  			goto decused;
>  		}
> +		mem_cgroup_commit_charge(page, memcg, false);
>  		lru_cache_add_anon(page);
>  
>  		spin_lock(&info->lock);
> diff --git a/mm/swap.c b/mm/swap.c
> index 9ce43ba4498b..a5bdff331507 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -635,6 +635,26 @@ void add_page_to_unevictable_list(struct page *page)
>  	spin_unlock_irq(&zone->lru_lock);
>  }
>  
> +/**
> + * lru_cache_add_active_or_unevictable
> + * @page:  the page to be added to LRU
> + * @vma:   vma in which page is mapped for determining reclaimability
> + *
> + * Place @page on the active or unevictable LRU list, depending on its
> + * evictability.  Note that if the page is not evictable, it goes
> + * directly back onto it's zone's unevictable list, it does NOT use a
> + * per cpu pagevec.
> + */
> +void lru_cache_add_active_or_unevictable(struct page *page,
> +					 struct vm_area_struct *vma)
> +{
> +	if (!mlocked_vma_newpage(vma, page)) {
> +		SetPageActive(page);
> +		lru_cache_add(page);
> +	} else
> +		add_page_to_unevictable_list(page);
> +}
> +
>  /*
>   * If the page can not be invalidated, it is moved to the
>   * inactive list to speed up its reclaim.  It is moved to the
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 4a7f7e6992b6..7c57c7256c6e 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -1126,15 +1126,14 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
>  	if (unlikely(!page))
>  		return -ENOMEM;
>  
> -	if (mem_cgroup_try_charge_swapin(vma->vm_mm, page,
> -					 GFP_KERNEL, &memcg)) {
> +	if (mem_cgroup_try_charge(page, vma->vm_mm, GFP_KERNEL, &memcg)) {
>  		ret = -ENOMEM;
>  		goto out_nolock;
>  	}
>  
>  	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
>  	if (unlikely(!maybe_same_pte(*pte, swp_entry_to_pte(entry)))) {
> -		mem_cgroup_cancel_charge_swapin(memcg);
> +		mem_cgroup_cancel_charge(page, memcg);
>  		ret = 0;
>  		goto out;
>  	}
> @@ -1144,11 +1143,14 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
>  	get_page(page);
>  	set_pte_at(vma->vm_mm, addr, pte,
>  		   pte_mkold(mk_pte(page, vma->vm_page_prot)));
> -	if (page == swapcache)
> +	if (page == swapcache) {
>  		page_add_anon_rmap(page, vma, addr);
> -	else /* ksm created a completely new copy */
> +		mem_cgroup_commit_charge(page, memcg, true);
> +	} else { /* ksm created a completely new copy */
>  		page_add_new_anon_rmap(page, vma, addr);
> -	mem_cgroup_commit_charge_swapin(page, memcg);
> +		mem_cgroup_commit_charge(page, memcg, false);
> +		lru_cache_add_active_or_unevictable(page, vma);
> +	}
>  	swap_free(entry);
>  	/*
>  	 * Move the page to the active list so it is not
> -- 
> 1.9.2
> 

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [patch 8/9] mm: memcontrol: rewrite charge API
  2014-05-23 14:54   ` Michal Hocko
@ 2014-05-23 15:18     ` Michal Hocko
  2014-05-27 20:05     ` Johannes Weiner
  1 sibling, 0 replies; 35+ messages in thread
From: Michal Hocko @ 2014-05-23 15:18 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: linux-mm, Hugh Dickins, Tejun Heo, cgroups, linux-kernel

On Fri 23-05-14 16:54:13, Michal Hocko wrote:
> On Wed 30-04-14 16:25:42, Johannes Weiner wrote:
> > The memcg charge API charges pages before they are rmapped - i.e. have
> > an actual "type" - and so every callsite needs its own set of charge
> > and uncharge functions to know what type is being operated on.
> > 
> > Rewrite the charge API to provide a generic set of try_charge(),
> > commit_charge() and cancel_charge() transaction operations, much like
> > what's currently done for swap-in:
> > 
> >   mem_cgroup_try_charge() attempts to reserve a charge, reclaiming
> >   pages from the memcg if necessary.
> > 
> >   mem_cgroup_commit_charge() commits the page to the charge once it
> >   has a valid page->mapping and PageAnon() reliably tells the type.
> > 
> >   mem_cgroup_cancel_charge() aborts the transaction.
> > 
> > As pages need to be committed after rmap is established but before
> > they are added to the LRU, page_add_new_anon_rmap() must stop doing
> > LRU additions again.  Factor lru_cache_add_active_or_unevictable().
> > 
> > The order of functions in mm/memcontrol.c is entirely random, so this
> > new charge interface is implemented at the end of the file, where all
> > new or cleaned up, and documented code should go from now on.
> 
> I would prefer moving them after refactoring because the reviewing is
> really harder this way. If such moving is needed at all.
> 
> Anyway this is definitely not a Friday material...
> 
> So only a first impression from a quick glance.
> 
> size is saying the code is slightly bigger:
>    text    data     bss     dec     hex filename
>  487977   84898   45984  618859   9716b mm/built-in.o.7
>  488276   84898   45984  619158   97296 mm/built-in.o.8
> 
> No biggie though.
> 
> It is true it get's rid of ~80LOC in memcontrol.c but it adds some more
> outside of memcg. Most of the charging paths didn't get any easier, they
> already know the type and they have to make sure they even commit the
> charge now.
> 
> But maybe it is just me feeling that now that we have
> mem_cgroup_charge_{anon,file,swapin} the API doesn't look so insane
> anymore and so I am not tempted to change it that much.
> 
> I will look at this with a Monday and fresh brain again.

And now that I got to 9/9 it is obvious it helps a lot to clean up the
uncharge path. But I am not in a mental state to dive into this today.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [patch 9/9] mm: memcontrol: rewrite uncharge API
  2014-04-30 20:25 ` [patch 9/9] mm: memcontrol: rewrite uncharge API Johannes Weiner
  2014-05-04 14:32   ` Johannes Weiner
@ 2014-05-27  7:43   ` Kamezawa Hiroyuki
  2014-05-27 18:59     ` Johannes Weiner
  1 sibling, 1 reply; 35+ messages in thread
From: Kamezawa Hiroyuki @ 2014-05-27  7:43 UTC (permalink / raw)
  To: Johannes Weiner, linux-mm
  Cc: Michal Hocko, Hugh Dickins, Tejun Heo, cgroups, linux-kernel

(2014/05/01 5:25), Johannes Weiner wrote:
> The memcg uncharging code that is involved towards the end of a page's
> lifetime - truncation, reclaim, swapout, migration - is impressively
> complicated and fragile.
> 
> Because anonymous and file pages were always charged before they had
> their page->mapping established, uncharges had to happen when the page
> type could be known from the context, as in unmap for anonymous, page
> cache removal for file and shmem pages, and swap cache truncation for
> swap pages.  However, these operations also happen well before the
> page is actually freed, and so a lot of synchronization is necessary:
> 
> - On page migration, the old page might be unmapped but then reused,
>    so memcg code has to prevent an untimely uncharge in that case.
>    Because this code - which should be a simple charge transfer - is so
>    special-cased, it is not reusable for replace_page_cache().
> 
> - Swap cache truncation happens during both swap-in and swap-out, and
>    possibly repeatedly before the page is actually freed.  This means
>    that the memcg swapout code is called from many contexts that make
>    no sense and it has to figure out the direction from page state to
>    make sure memory and memory+swap are always correctly charged.
> 
> But now that charged pages always have a page->mapping, introduce
> mem_cgroup_uncharge(), which is called after the final put_page(),
> when we know for sure that nobody is looking at the page anymore.
> 
> For page migration, introduce mem_cgroup_migrate(), which is called
> after the migration is successful and the new page is fully rmapped.
> Because the old page is no longer uncharged after migration, prevent
> double charges by decoupling the page's memcg association (PCG_USED
> and pc->mem_cgroup) from the page holding an actual charge.  The new
> bits PCG_MEM and PCG_MEMSW represent the respective charges and are
> transferred to the new page during migration.
> 
> mem_cgroup_migrate() is suitable for replace_page_cache() as well.
> 
> Swap accounting is massively simplified: because the page is no longer
> uncharged as early as swap cache deletion, a new mem_cgroup_swapout()
> can transfer the page's memory+swap charge (PCG_MEMSW) to the swap
> entry before the final put_page() in page reclaim.
> 
> Finally, because pages are now charged under proper serialization
> (anon: exclusive; cache: page lock; swapin: page lock; migration: page
> lock), and uncharged under full exclusion, they can not race with
> themselves.  Because they are also off-LRU during charge/uncharge,
> charge migration can not race, with that, either.  Remove the crazily
> expensive the page_cgroup lock and set pc->flags non-atomically.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

The whole series seems wonderful to me. Thank you.
I'm not sure whether I have enough good eyes now but this seems good.

One thing in my mind is batched uncharge rework.

Because uncharge() is done in final put_page() path, 
mem_cgroup_uncharge_start()/mem_cgroup_uncharge_end() placement may not be good enough.

swap.c::release_pages() may be good to have mem_cgroup_uncharge_start()/end().
(and you may be able to remove unnecessary calls of mem_cgroup_uncharge_start/end())

Thanks,
-Kame



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [patch 9/9] mm: memcontrol: rewrite uncharge API
  2014-05-27  7:43   ` Kamezawa Hiroyuki
@ 2014-05-27 18:59     ` Johannes Weiner
  0 siblings, 0 replies; 35+ messages in thread
From: Johannes Weiner @ 2014-05-27 18:59 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: linux-mm, Michal Hocko, Hugh Dickins, Tejun Heo, cgroups,
	linux-kernel

Hi Kame,

it's been a long time, I hope you're doing well.

On Tue, May 27, 2014 at 04:43:28PM +0900, Kamezawa Hiroyuki wrote:
> (2014/05/01 5:25), Johannes Weiner wrote:
> > The memcg uncharging code that is involved towards the end of a page's
> > lifetime - truncation, reclaim, swapout, migration - is impressively
> > complicated and fragile.
> > 
> > Because anonymous and file pages were always charged before they had
> > their page->mapping established, uncharges had to happen when the page
> > type could be known from the context, as in unmap for anonymous, page
> > cache removal for file and shmem pages, and swap cache truncation for
> > swap pages.  However, these operations also happen well before the
> > page is actually freed, and so a lot of synchronization is necessary:
> > 
> > - On page migration, the old page might be unmapped but then reused,
> >    so memcg code has to prevent an untimely uncharge in that case.
> >    Because this code - which should be a simple charge transfer - is so
> >    special-cased, it is not reusable for replace_page_cache().
> > 
> > - Swap cache truncation happens during both swap-in and swap-out, and
> >    possibly repeatedly before the page is actually freed.  This means
> >    that the memcg swapout code is called from many contexts that make
> >    no sense and it has to figure out the direction from page state to
> >    make sure memory and memory+swap are always correctly charged.
> > 
> > But now that charged pages always have a page->mapping, introduce
> > mem_cgroup_uncharge(), which is called after the final put_page(),
> > when we know for sure that nobody is looking at the page anymore.
> > 
> > For page migration, introduce mem_cgroup_migrate(), which is called
> > after the migration is successful and the new page is fully rmapped.
> > Because the old page is no longer uncharged after migration, prevent
> > double charges by decoupling the page's memcg association (PCG_USED
> > and pc->mem_cgroup) from the page holding an actual charge.  The new
> > bits PCG_MEM and PCG_MEMSW represent the respective charges and are
> > transferred to the new page during migration.
> > 
> > mem_cgroup_migrate() is suitable for replace_page_cache() as well.
> > 
> > Swap accounting is massively simplified: because the page is no longer
> > uncharged as early as swap cache deletion, a new mem_cgroup_swapout()
> > can transfer the page's memory+swap charge (PCG_MEMSW) to the swap
> > entry before the final put_page() in page reclaim.
> > 
> > Finally, because pages are now charged under proper serialization
> > (anon: exclusive; cache: page lock; swapin: page lock; migration: page
> > lock), and uncharged under full exclusion, they can not race with
> > themselves.  Because they are also off-LRU during charge/uncharge,
> > charge migration can not race, with that, either.  Remove the crazily
> > expensive the page_cgroup lock and set pc->flags non-atomically.
> > 
> > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> 
> The whole series seems wonderful to me. Thank you.
> I'm not sure whether I have enough good eyes now but this seems good.

Thank you!

> One thing in my mind is batched uncharge rework.
> 
> Because uncharge() is done in final put_page() path, 
> mem_cgroup_uncharge_start()/mem_cgroup_uncharge_end() placement may not be good enough.
> 
> swap.c::release_pages() may be good to have mem_cgroup_uncharge_start()/end().
> (and you may be able to remove unnecessary calls of mem_cgroup_uncharge_start/end())

That's a good point.

I pushed the batch calls from all pagevec_release() callers directly
into release_pages(), which is everyone but shrink_page_list().

THP fallback abort used to do real uncharging, but now only does
cancelling, so it's no longer batched - I removed the batch calls
there as well.  Not optimal, but it should be fine in this slowpath.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [patch 6/9] mm: memcontrol: remove ordering between pc->mem_cgroup and PageCgroupUsed
  2014-05-23 13:20   ` Michal Hocko
@ 2014-05-27 19:45     ` Johannes Weiner
  2014-05-28 11:31       ` Michal Hocko
  0 siblings, 1 reply; 35+ messages in thread
From: Johannes Weiner @ 2014-05-27 19:45 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-mm, Hugh Dickins, Tejun Heo, cgroups, linux-kernel

On Fri, May 23, 2014 at 03:20:43PM +0200, Michal Hocko wrote:
> On Wed 30-04-14 16:25:40, Johannes Weiner wrote:
> > There is a write barrier between setting pc->mem_cgroup and
> > PageCgroupUsed, which was added to allow LRU operations to lookup the
> > memcg LRU list of a page without acquiring the page_cgroup lock.  But
> > ever since 38c5d72f3ebe ("memcg: simplify LRU handling by new rule"),
> > pages are ensured to be off-LRU while charging, so nobody else is
> > changing LRU state while pc->mem_cgroup is being written.
> 
> This is quite confusing. Why do we have the lrucare path then?

Some charge paths start with the page on the LRU, lrucare makes sure
it's off during the charge.

> The code is quite tricky so this deserves a more detailed explanation
> IMO.
> 
> There are only 3 paths which check both the flag and mem_cgroup (
> without page_cgroup_lock) get_mctgt_type* and mem_cgroup_page_lruvec AFAICS.
> None of them have rmb so there was no guarantee about ordering anyway.

Yeah, exactly.  As per the changelog, this is a remnant of the way it
used to work but it's no longer needed because of guaranteed off-LRU
state.

> > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> 
> Anyway, the change is welcome
> Acked-by: Michal Hocko <mhocko@suse.cz>

Thanks!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [patch 7/9] mm: memcontrol: do not acquire page_cgroup lock for kmem pages
  2014-05-23 13:39   ` Michal Hocko
  2014-05-23 13:40     ` Michal Hocko
  2014-05-23 14:29     ` Vladimir Davydov
@ 2014-05-27 19:53     ` Johannes Weiner
  2014-05-28 11:33       ` Michal Hocko
  2 siblings, 1 reply; 35+ messages in thread
From: Johannes Weiner @ 2014-05-27 19:53 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-mm, Hugh Dickins, Tejun Heo, cgroups, linux-kernel

On Fri, May 23, 2014 at 03:39:38PM +0200, Michal Hocko wrote:
> I am adding Vladimir to CC
> 
> On Wed 30-04-14 16:25:41, Johannes Weiner wrote:
> > Kmem page charging and uncharging is serialized by means of exclusive
> > access to the page.  Do not take the page_cgroup lock and don't set
> > pc->flags atomically.
> > 
> > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> 
> The patch is correct I just have some comments below.
> Anyway
> Acked-by: Michal Hocko <mhocko@suse.cz>

Thanks!

> >  mm/memcontrol.c | 16 +++-------------
> >  1 file changed, 3 insertions(+), 13 deletions(-)
> > 
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index c528ae9ac230..d3961fce1d54 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -3535,10 +3535,8 @@ void __memcg_kmem_commit_charge(struct page *page, struct mem_cgroup *memcg,
> >  	}
> >  
> 
> 	/*
> 	 * given page is newly allocated and invisible to everybody but
> 	 * the caller so there is no need to use page_cgroup lock nor
> 	 * SetPageCgroupUsed
> 	 */
> 
> would be helpful?

That makes sense, I added the following:

+       /*
+        * The page is freshly allocated and not visible to any
+        * outside callers yet.  Set up pc non-atomically.
+        */

> >  	pc = lookup_page_cgroup(page);
> > -	lock_page_cgroup(pc);
> >  	pc->mem_cgroup = memcg;
> > -	SetPageCgroupUsed(pc);
> > -	unlock_page_cgroup(pc);
> > +	pc->flags = PCG_USED;
> >  }
> >  
> >  void __memcg_kmem_uncharge_pages(struct page *page, int order)
> > @@ -3548,19 +3546,11 @@ void __memcg_kmem_uncharge_pages(struct page *page, int order)
> >  
> >  
> >  	pc = lookup_page_cgroup(page);
> > -	/*
> > -	 * Fast unlocked return. Theoretically might have changed, have to
> > -	 * check again after locking.
> > -	 */
> 
> This comment was there since the code has been merged. Maybe it was true
> at the time but after "mm: get rid of __GFP_KMEMCG" it is definitely out
> of date.
> 
> 	/*
> 	 * the pages is going away and will be freed and nobody can see
> 	 * it anymore so no need to take page_cgroup lock.
> 	 */
> >  	if (!PageCgroupUsed(pc))
> >  		return;
> >  
> > -	lock_page_cgroup(pc);
> > -	if (PageCgroupUsed(pc)) {
> > -		memcg = pc->mem_cgroup;
> > -		ClearPageCgroupUsed(pc);
> > -	}
> > -	unlock_page_cgroup(pc);
> 
> maybe add
> 	WARN_ON_ONCE(pc->flags != PCG_USED);
> 
> to check for an unexpected flags usage in the kmem path?

There is no overlap between page types that use PCG_USED and those
that don't.  What would be the value of adding this?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [patch 8/9] mm: memcontrol: rewrite charge API
  2014-05-23 14:54   ` Michal Hocko
  2014-05-23 15:18     ` Michal Hocko
@ 2014-05-27 20:05     ` Johannes Weiner
  2014-05-28 11:37       ` Michal Hocko
  1 sibling, 1 reply; 35+ messages in thread
From: Johannes Weiner @ 2014-05-27 20:05 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-mm, Hugh Dickins, Tejun Heo, cgroups, linux-kernel

On Fri, May 23, 2014 at 04:54:13PM +0200, Michal Hocko wrote:
> On Wed 30-04-14 16:25:42, Johannes Weiner wrote:
> > The memcg charge API charges pages before they are rmapped - i.e. have
> > an actual "type" - and so every callsite needs its own set of charge
> > and uncharge functions to know what type is being operated on.
> > 
> > Rewrite the charge API to provide a generic set of try_charge(),
> > commit_charge() and cancel_charge() transaction operations, much like
> > what's currently done for swap-in:
> > 
> >   mem_cgroup_try_charge() attempts to reserve a charge, reclaiming
> >   pages from the memcg if necessary.
> > 
> >   mem_cgroup_commit_charge() commits the page to the charge once it
> >   has a valid page->mapping and PageAnon() reliably tells the type.
> > 
> >   mem_cgroup_cancel_charge() aborts the transaction.
> > 
> > As pages need to be committed after rmap is established but before
> > they are added to the LRU, page_add_new_anon_rmap() must stop doing
> > LRU additions again.  Factor lru_cache_add_active_or_unevictable().
> > 
> > The order of functions in mm/memcontrol.c is entirely random, so this
> > new charge interface is implemented at the end of the file, where all
> > new or cleaned up, and documented code should go from now on.
> 
> I would prefer moving them after refactoring because the reviewing is
> really harder this way. If such moving is needed at all.

I find it incredibly cumbersome to work with this code because of the
ordering.  Sure, you use the search function of the editor, but you
don't even know whether to look above or below, half of the hits are
forward declarations etc.  Crappy code attracts more crappy code, so I
feel strongly that we clean this up and raise the bar for the future.

As to the ordering: I chose this way because this really is a
fundamental rewrite, and I figured it would be *easier* to read if you
have the entire relevant code show up in the diff.  I.e. try_charge()
is fully included, right next to its API entry function.

If this doesn't work for you - the reviewer - I'm happy to change it
around and move the code separately.

> size is saying the code is slightly bigger:
>    text    data     bss     dec     hex filename
>  487977   84898   45984  618859   9716b mm/built-in.o.7
>  488276   84898   45984  619158   97296 mm/built-in.o.8
> 
> No biggie though.
> 
> It is true it get's rid of ~80LOC in memcontrol.c but it adds some more
> outside of memcg. Most of the charging paths didn't get any easier, they
> already know the type and they have to make sure they even commit the
> charge now.
> 
> But maybe it is just me feeling that now that we have
> mem_cgroup_charge_{anon,file,swapin} the API doesn't look so insane
> anymore and so I am not tempted to change it that much.

I should have been a little clearer in the changelog: this is mainly
to make sure we never commit pages before their rmapping is
established so that not only charging, but also uncharging can be
drastically simplified.

You already noticed that when looking at the next patch, but I'll make
sure to mention it here as well.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [patch 6/9] mm: memcontrol: remove ordering between pc->mem_cgroup and PageCgroupUsed
  2014-05-27 19:45     ` Johannes Weiner
@ 2014-05-28 11:31       ` Michal Hocko
  0 siblings, 0 replies; 35+ messages in thread
From: Michal Hocko @ 2014-05-28 11:31 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: linux-mm, Hugh Dickins, Tejun Heo, cgroups, linux-kernel

On Tue 27-05-14 15:45:00, Johannes Weiner wrote:
> On Fri, May 23, 2014 at 03:20:43PM +0200, Michal Hocko wrote:
> > On Wed 30-04-14 16:25:40, Johannes Weiner wrote:
> > > There is a write barrier between setting pc->mem_cgroup and
> > > PageCgroupUsed, which was added to allow LRU operations to lookup the
> > > memcg LRU list of a page without acquiring the page_cgroup lock.  But
> > > ever since 38c5d72f3ebe ("memcg: simplify LRU handling by new rule"),
> > > pages are ensured to be off-LRU while charging, so nobody else is
> > > changing LRU state while pc->mem_cgroup is being written.
> > 
> > This is quite confusing. Why do we have the lrucare path then?
> 
> Some charge paths start with the page on the LRU, lrucare makes sure
> it's off during the charge.

Yeah, I know I just wanted to point that the changelog might be
confusing and so mentioning this aspect would be nice...

> > The code is quite tricky so this deserves a more detailed explanation
> > IMO.
> > 
> > There are only 3 paths which check both the flag and mem_cgroup (
> > without page_cgroup_lock) get_mctgt_type* and mem_cgroup_page_lruvec AFAICS.
> > None of them have rmb so there was no guarantee about ordering anyway.
> 
> Yeah, exactly.  As per the changelog, this is a remnant of the way it
> used to work but it's no longer needed because of guaranteed off-LRU
> state.
> 
> > > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> > 
> > Anyway, the change is welcome
> > Acked-by: Michal Hocko <mhocko@suse.cz>
> 
> Thanks!

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [patch 7/9] mm: memcontrol: do not acquire page_cgroup lock for kmem pages
  2014-05-27 19:53     ` Johannes Weiner
@ 2014-05-28 11:33       ` Michal Hocko
  0 siblings, 0 replies; 35+ messages in thread
From: Michal Hocko @ 2014-05-28 11:33 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: linux-mm, Hugh Dickins, Tejun Heo, cgroups, linux-kernel

On Tue 27-05-14 15:53:42, Johannes Weiner wrote:
> On Fri, May 23, 2014 at 03:39:38PM +0200, Michal Hocko wrote:
[...]
> > >  	if (!PageCgroupUsed(pc))
> > >  		return;
> > >  
> > > -	lock_page_cgroup(pc);
> > > -	if (PageCgroupUsed(pc)) {
> > > -		memcg = pc->mem_cgroup;
> > > -		ClearPageCgroupUsed(pc);
> > > -	}
> > > -	unlock_page_cgroup(pc);
> > 
> > maybe add
> > 	WARN_ON_ONCE(pc->flags != PCG_USED);
> > 
> > to check for an unexpected flags usage in the kmem path?
> 
> There is no overlap between page types that use PCG_USED and those
> that don't.  What would be the value of adding this?

I meant it as an early warning that something bad is going on.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [patch 8/9] mm: memcontrol: rewrite charge API
  2014-05-27 20:05     ` Johannes Weiner
@ 2014-05-28 11:37       ` Michal Hocko
  0 siblings, 0 replies; 35+ messages in thread
From: Michal Hocko @ 2014-05-28 11:37 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: linux-mm, Hugh Dickins, Tejun Heo, cgroups, linux-kernel

On Tue 27-05-14 16:05:16, Johannes Weiner wrote:
> On Fri, May 23, 2014 at 04:54:13PM +0200, Michal Hocko wrote:
> > On Wed 30-04-14 16:25:42, Johannes Weiner wrote:
> > > The memcg charge API charges pages before they are rmapped - i.e. have
> > > an actual "type" - and so every callsite needs its own set of charge
> > > and uncharge functions to know what type is being operated on.
> > > 
> > > Rewrite the charge API to provide a generic set of try_charge(),
> > > commit_charge() and cancel_charge() transaction operations, much like
> > > what's currently done for swap-in:
> > > 
> > >   mem_cgroup_try_charge() attempts to reserve a charge, reclaiming
> > >   pages from the memcg if necessary.
> > > 
> > >   mem_cgroup_commit_charge() commits the page to the charge once it
> > >   has a valid page->mapping and PageAnon() reliably tells the type.
> > > 
> > >   mem_cgroup_cancel_charge() aborts the transaction.
> > > 
> > > As pages need to be committed after rmap is established but before
> > > they are added to the LRU, page_add_new_anon_rmap() must stop doing
> > > LRU additions again.  Factor lru_cache_add_active_or_unevictable().
> > > 
> > > The order of functions in mm/memcontrol.c is entirely random, so this
> > > new charge interface is implemented at the end of the file, where all
> > > new or cleaned up, and documented code should go from now on.
> > 
> > I would prefer moving them after refactoring because the reviewing is
> > really harder this way. If such moving is needed at all.
> 
> I find it incredibly cumbersome to work with this code because of the
> ordering.  Sure, you use the search function of the editor, but you
> don't even know whether to look above or below, half of the hits are
> forward declarations etc. 

I tend to use cscope when moving through code so I never considered that
a big hassle.

> Crappy code attracts more crappy code, so I
> feel strongly that we clean this up and raise the bar for the future.

No objection to that. If reorganization helps in that direction then
let's do it. But I would rather do it in a separate patch to have an
easy way to compare the results (e.g. by comparing the generated code).

> As to the ordering: I chose this way because this really is a
> fundamental rewrite, and I figured it would be *easier* to read if you
> have the entire relevant code show up in the diff.  I.e. try_charge()
> is fully included, right next to its API entry function.
> 
> If this doesn't work for you - the reviewer - I'm happy to change it
> around and move the code separately.

Yeah, that would make the review easier. At least for me.
 
> > size is saying the code is slightly bigger:
> >    text    data     bss     dec     hex filename
> >  487977   84898   45984  618859   9716b mm/built-in.o.7
> >  488276   84898   45984  619158   97296 mm/built-in.o.8
> > 
> > No biggie though.
> > 
> > It is true it get's rid of ~80LOC in memcontrol.c but it adds some more
> > outside of memcg. Most of the charging paths didn't get any easier, they
> > already know the type and they have to make sure they even commit the
> > charge now.
> > 
> > But maybe it is just me feeling that now that we have
> > mem_cgroup_charge_{anon,file,swapin} the API doesn't look so insane
> > anymore and so I am not tempted to change it that much.
> 
> I should have been a little clearer in the changelog: this is mainly
> to make sure we never commit pages before their rmapping is
> established so that not only charging, but also uncharging can be
> drastically simplified.
> 
> You already noticed that when looking at the next patch, but I'll make
> sure to mention it here as well.

Thanks!
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2014-05-28 11:37 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-04-30 20:25 [patch 0/9] mm: memcontrol: naturalize charge lifetime Johannes Weiner
2014-04-30 20:25 ` [patch 1/9] mm: memcontrol: fold mem_cgroup_do_charge() Johannes Weiner
2014-04-30 20:25 ` [patch 2/9] mm: memcontrol: rearrange charging fast path Johannes Weiner
2014-05-07 14:33   ` Michal Hocko
2014-05-08 18:22     ` Johannes Weiner
2014-05-12  7:59       ` Michal Hocko
2014-04-30 20:25 ` [patch 3/9] mm: memcontrol: retry reclaim for oom-disabled and __GFP_NOFAIL charges Johannes Weiner
2014-05-07 14:43   ` Michal Hocko
2014-05-08 18:28     ` Johannes Weiner
2014-04-30 20:25 ` [patch 4/9] mm: memcontrol: catch root bypass in move precharge Johannes Weiner
2014-05-07 14:55   ` Michal Hocko
2014-05-08 18:30     ` Johannes Weiner
2014-04-30 20:25 ` [patch 5/9] mm: memcontrol: use root_mem_cgroup res_counter Johannes Weiner
2014-05-07 15:14   ` Michal Hocko
2014-04-30 20:25 ` [patch 6/9] mm: memcontrol: remove ordering between pc->mem_cgroup and PageCgroupUsed Johannes Weiner
2014-05-23 13:20   ` Michal Hocko
2014-05-27 19:45     ` Johannes Weiner
2014-05-28 11:31       ` Michal Hocko
2014-04-30 20:25 ` [patch 7/9] mm: memcontrol: do not acquire page_cgroup lock for kmem pages Johannes Weiner
2014-05-23 13:39   ` Michal Hocko
2014-05-23 13:40     ` Michal Hocko
2014-05-23 14:29     ` Vladimir Davydov
2014-05-27 19:53     ` Johannes Weiner
2014-05-28 11:33       ` Michal Hocko
2014-04-30 20:25 ` [patch 8/9] mm: memcontrol: rewrite charge API Johannes Weiner
2014-05-23 14:18   ` Michal Hocko
2014-05-23 14:54   ` Michal Hocko
2014-05-23 15:18     ` Michal Hocko
2014-05-27 20:05     ` Johannes Weiner
2014-05-28 11:37       ` Michal Hocko
2014-04-30 20:25 ` [patch 9/9] mm: memcontrol: rewrite uncharge API Johannes Weiner
2014-05-04 14:32   ` Johannes Weiner
2014-05-27  7:43   ` Kamezawa Hiroyuki
2014-05-27 18:59     ` Johannes Weiner
2014-05-02 11:26 ` [patch 0/9] mm: memcontrol: naturalize charge lifetime Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).