[patch 0/4] mm: memcontrol: populate unified hierarchy interface v2

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [patch 0/4] mm: memcontrol: populate unified hierarchy interface v2
@ 2014-08-08 21:38 Johannes Weiner
  2014-08-08 21:38 ` [patch 1/4] mm: memcontrol: use generic direct reclaim code to meet the allocation Johannes Weiner
                   ` (3 more replies)
  0 siblings, 4 replies; 5+ messages in thread
From: Johannes Weiner @ 2014-08-08 21:38 UTC (permalink / raw)
  To: linux-mm
  Cc: Michal Hocko, Greg Thelen, Vladimir Davydov, Tejun Heo, cgroups,
	linux-kernel

Hi,

memory cgroups are fundamentally broken when it comes to partitioning
the machine for many concurrent jobs.  In real life, workloads expand
and contract over time, and the hard limit is too static to reflect
this - it either wastes memory outside of the group, or wastes memory
inside the group.  As a result, the hard limit is mostly just used to
catch extreme consumption peaks, while workload trimming and balancing
is left to global reclaim and global OOM handling.  That in turn
requires more and more cgroup-awareness on the global level to make up
for the lack of useful policy enforcement on the cgroup level itself.

The ongoing versioning of the cgroup user interface gives us a chance
to fix such brokenness, and also clean up the interface and fix a lot
of the inconsistencies and ugliness that crept in over time.

This series adds a minimal set of control files to version 2 of the
memcg interface, implementing a new approach to machine partitioning.

Version 2 of this series is in response to feedback from Michal.  Some
of the changes are in code, but mostly it improves the documentation
and changelogs to describe the fundamental problems with the original
approach to machine partitioning and makes a case for the new model.

 Documentation/cgroups/unified-hierarchy.txt |  65 ++++++++
 include/linux/res_counter.h                 |  29 ++++
 include/linux/swap.h                        |   6 +-
 kernel/res_counter.c                        |   3 +
 mm/memcontrol.c                             | 250 +++++++++++++++++++---------
 mm/vmscan.c                                 |   7 +-
 6 files changed, 277 insertions(+), 83 deletions(-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [patch 1/4] mm: memcontrol: use generic direct reclaim code to meet the allocation
  2014-08-08 21:38 [patch 0/4] mm: memcontrol: populate unified hierarchy interface v2 Johannes Weiner
@ 2014-08-08 21:38 ` Johannes Weiner
  2014-08-08 21:38 ` [patch 2/4] mm: memcontrol: add memory.current and memory.high to default hierarchy Johannes Weiner
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 5+ messages in thread
From: Johannes Weiner @ 2014-08-08 21:38 UTC (permalink / raw)
  To: linux-mm
  Cc: Michal Hocko, Greg Thelen, Vladimir Davydov, Tejun Heo, cgroups,
	linux-kernel

Memcg loops around calling direct reclaim for SWAP_CLUSTER_MAX pages
and checking the status after every invocation, but the direct reclaim
code is already required/optimized to meet higher targets efficiently.

Pass the proper target to direct reclaim and remove a lot of looping
and reclaim cruft from historic memcg glory.  All the callsites still
loop in case reclaim efforts get stolen by concurrent allocations, or
when reclaim has a hard time making progress, but at least it doesn't
have to loop to meet the basic reclaim goal anymore.

This also prepares the memcg reclaim API for use with the planned high
limit, to target any limit excess with a single reclaim invocation.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/swap.h |  6 ++--
 mm/memcontrol.c      | 86 +++++++++++-----------------------------------------
 mm/vmscan.c          |  7 +++--
 3 files changed, 25 insertions(+), 74 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 1b72060f093a..f94614a2668a 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -327,8 +327,10 @@ extern void lru_cache_add_active_or_unevictable(struct page *page,
 extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 					gfp_t gfp_mask, nodemask_t *mask);
 extern int __isolate_lru_page(struct page *page, isolate_mode_t mode);
-extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem,
-						  gfp_t gfp_mask, bool noswap);
+extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
+						  unsigned long nr_pages,
+						  gfp_t gfp_mask,
+						  bool may_swap);
 extern unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
 						gfp_t gfp_mask, bool noswap,
 						struct zone *zone,
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ec4dcf1b9562..4146c0f47ba2 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -315,9 +315,6 @@ struct mem_cgroup {
 	/* OOM-Killer disable */
 	int		oom_kill_disable;
 
-	/* set when res.limit == memsw.limit */
-	bool		memsw_is_minimum;
-
 	/* protect arrays of thresholds */
 	struct mutex thresholds_lock;
 
@@ -481,14 +478,6 @@ enum res_type {
 #define OOM_CONTROL		(0)
 
 /*
- * Reclaim flags for mem_cgroup_hierarchical_reclaim
- */
-#define MEM_CGROUP_RECLAIM_NOSWAP_BIT	0x0
-#define MEM_CGROUP_RECLAIM_NOSWAP	(1 << MEM_CGROUP_RECLAIM_NOSWAP_BIT)
-#define MEM_CGROUP_RECLAIM_SHRINK_BIT	0x1
-#define MEM_CGROUP_RECLAIM_SHRINK	(1 << MEM_CGROUP_RECLAIM_SHRINK_BIT)
-
-/*
  * The memcg_create_mutex will be held whenever a new cgroup is created.
  * As a consequence, any change that needs to protect against new child cgroups
  * appearing has to hold it as well.
@@ -1792,42 +1781,6 @@ static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
 			 NULL, "Memory cgroup out of memory");
 }
 
-static unsigned long mem_cgroup_reclaim(struct mem_cgroup *memcg,
-					gfp_t gfp_mask,
-					unsigned long flags)
-{
-	unsigned long total = 0;
-	bool noswap = false;
-	int loop;
-
-	if (flags & MEM_CGROUP_RECLAIM_NOSWAP)
-		noswap = true;
-	if (!(flags & MEM_CGROUP_RECLAIM_SHRINK) && memcg->memsw_is_minimum)
-		noswap = true;
-
-	for (loop = 0; loop < MEM_CGROUP_MAX_RECLAIM_LOOPS; loop++) {
-		if (loop)
-			drain_all_stock_async(memcg);
-		total += try_to_free_mem_cgroup_pages(memcg, gfp_mask, noswap);
-		/*
-		 * Allow limit shrinkers, which are triggered directly
-		 * by userspace, to catch signals and stop reclaim
-		 * after minimal progress, regardless of the margin.
-		 */
-		if (total && (flags & MEM_CGROUP_RECLAIM_SHRINK))
-			break;
-		if (mem_cgroup_margin(memcg))
-			break;
-		/*
-		 * If nothing was reclaimed after two attempts, there
-		 * may be no reclaimable pages in this hierarchy.
-		 */
-		if (loop && !total)
-			break;
-	}
-	return total;
-}
-
 /**
  * test_mem_cgroup_node_reclaimable
  * @memcg: the target memcg
@@ -2530,8 +2483,9 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	struct mem_cgroup *mem_over_limit;
 	struct res_counter *fail_res;
 	unsigned long nr_reclaimed;
-	unsigned long flags = 0;
 	unsigned long long size;
+	bool may_swap = true;
+	bool drained = false;
 	int ret = 0;
 
 retry:
@@ -2546,7 +2500,7 @@ retry:
 			goto done_restock;
 		res_counter_uncharge(&memcg->res, size);
 		mem_over_limit = mem_cgroup_from_res_counter(fail_res, memsw);
-		flags |= MEM_CGROUP_RECLAIM_NOSWAP;
+		may_swap = false;
 	} else
 		mem_over_limit = mem_cgroup_from_res_counter(fail_res, res);
 
@@ -2572,11 +2526,18 @@ retry:
 	if (!(gfp_mask & __GFP_WAIT))
 		goto nomem;
 
-	nr_reclaimed = mem_cgroup_reclaim(mem_over_limit, gfp_mask, flags);
+	nr_reclaimed = try_to_free_mem_cgroup_pages(mem_over_limit, nr_pages,
+						    gfp_mask, may_swap);
 
 	if (mem_cgroup_margin(mem_over_limit) >= nr_pages)
 		goto retry;
 
+	if (!drained) {
+		drain_all_stock_async(memcg);
+		drained = true;
+		goto retry;
+	}
+
 	if (gfp_mask & __GFP_NORETRY)
 		goto nomem;
 	/*
@@ -3707,19 +3668,13 @@ static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
 			enlarge = 1;
 
 		ret = res_counter_set_limit(&memcg->res, val);
-		if (!ret) {
-			if (memswlimit == val)
-				memcg->memsw_is_minimum = true;
-			else
-				memcg->memsw_is_minimum = false;
-		}
 		mutex_unlock(&set_limit_mutex);
 
 		if (!ret)
 			break;
 
-		mem_cgroup_reclaim(memcg, GFP_KERNEL,
-				   MEM_CGROUP_RECLAIM_SHRINK);
+		try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL, true);
+
 		curusage = res_counter_read_u64(&memcg->res, RES_USAGE);
 		/* Usage is reduced ? */
 		if (curusage >= oldusage)
@@ -3766,20 +3721,13 @@ static int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg,
 		if (memswlimit < val)
 			enlarge = 1;
 		ret = res_counter_set_limit(&memcg->memsw, val);
-		if (!ret) {
-			if (memlimit == val)
-				memcg->memsw_is_minimum = true;
-			else
-				memcg->memsw_is_minimum = false;
-		}
 		mutex_unlock(&set_limit_mutex);
 
 		if (!ret)
 			break;
 
-		mem_cgroup_reclaim(memcg, GFP_KERNEL,
-				   MEM_CGROUP_RECLAIM_NOSWAP |
-				   MEM_CGROUP_RECLAIM_SHRINK);
+		try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL, false);
+
 		curusage = res_counter_read_u64(&memcg->memsw, RES_USAGE);
 		/* Usage is reduced ? */
 		if (curusage >= oldusage)
@@ -4028,8 +3976,8 @@ static int mem_cgroup_force_empty(struct mem_cgroup *memcg)
 		if (signal_pending(current))
 			return -EINTR;
 
-		progress = try_to_free_mem_cgroup_pages(memcg, GFP_KERNEL,
-						false);
+		progress = try_to_free_mem_cgroup_pages(memcg, 1,
+							GFP_KERNEL, true);
 		if (!progress) {
 			nr_retries--;
 			/* maybe some writeback is necessary */
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 2836b5373b2e..f1609423821b 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2753,21 +2753,22 @@ unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *memcg,
 }
 
 unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
+					   unsigned long nr_pages,
 					   gfp_t gfp_mask,
-					   bool noswap)
+					   bool may_swap)
 {
 	struct zonelist *zonelist;
 	unsigned long nr_reclaimed;
 	int nid;
 	struct scan_control sc = {
-		.nr_to_reclaim = SWAP_CLUSTER_MAX,
+		.nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX),
 		.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
 				(GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK),
 		.target_mem_cgroup = memcg,
 		.priority = DEF_PRIORITY,
 		.may_writepage = !laptop_mode,
 		.may_unmap = 1,
-		.may_swap = !noswap,
+		.may_swap = may_swap,
 	};
 
 	/*
-- 
2.0.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [patch 2/4] mm: memcontrol: add memory.current and memory.high to default hierarchy
  2014-08-08 21:38 [patch 0/4] mm: memcontrol: populate unified hierarchy interface v2 Johannes Weiner
  2014-08-08 21:38 ` [patch 1/4] mm: memcontrol: use generic direct reclaim code to meet the allocation Johannes Weiner
@ 2014-08-08 21:38 ` Johannes Weiner
  2014-08-08 21:38 ` [patch 3/4] mm: memcontrol: add memory.max " Johannes Weiner
  2014-08-08 21:38 ` [patch 4/4] mm: memcontrol: add memory.vmstat " Johannes Weiner
  3 siblings, 0 replies; 5+ messages in thread
From: Johannes Weiner @ 2014-08-08 21:38 UTC (permalink / raw)
  To: linux-mm
  Cc: Michal Hocko, Greg Thelen, Vladimir Davydov, Tejun Heo, cgroups,
	linux-kernel

Provide the most fundamental interface necessary for memory cgroups to
partition the machine for concurrent workloads in unified hierarchy:
report the current usage and allow setting an upper limit on it.

The upper limit, set in memory.high, is not a strict OOM limit and is
enforced purely by direct reclaim.  This is a deviation from the old
hard upper limit, which history has shown to fail at partitioning a
machine for real workloads in a resource-efficient manner: if chosen
conservatively, the hard limit risks OOM kills; if chosen generously,
memory is underutilized most of the time.  As a result, in practice
the limit is mostly used to contain extremes and balancing of regular
workingset fluctuations and cache trimming is left to global reclaim
and the global OOM killer, which creates an increasing demand for
complicated cgroup-specific prioritization features in both of them.

The high limit on the other hand is a target size limit that is meant
to trim caches and keep consumption at the average working set size
while providing elasticity for peaks.  This allows memory cgroups to
be useful for workload packing without relying too much on global VM
interventions, except for parallel peaks or inadequate configurations.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 Documentation/cgroups/unified-hierarchy.txt | 52 +++++++++++++++++
 include/linux/res_counter.h                 | 29 ++++++++++
 kernel/res_counter.c                        |  3 +
 mm/memcontrol.c                             | 89 ++++++++++++++++++++++++++---
 4 files changed, 164 insertions(+), 9 deletions(-)

diff --git a/Documentation/cgroups/unified-hierarchy.txt b/Documentation/cgroups/unified-hierarchy.txt
index 4f4563277864..2d91530b8d6c 100644
--- a/Documentation/cgroups/unified-hierarchy.txt
+++ b/Documentation/cgroups/unified-hierarchy.txt
@@ -324,9 +324,61 @@ supported and the interface files "release_agent" and
 
 4-3-3. memory
 
+Memory cgroups account and limit the memory consumption of cgroups,
+but the current limit semantics make the feature hard to use and
+creates problems in existing configurations.
+
+4.3.3.1 No more default hard limit
+
+'memory.limit_in_bytes' is the current upper limit that can not be
+exceeded under any circumstances.  If it can not be met by direct
+reclaim, the tasks in the cgroup are OOM killed.
+
+While this may look like a valid approach to partition the machine, in
+practice workloads expand and contract during runtime, and it's
+impossible to get the machine-wide configuration right: if users set
+this hard limit conservatively, they are plagued by cgroup-internal
+OOM kills during peaks while memory might be idle (external waste).
+If they set it too generously, precious resources are either unused or
+wasted on old cache (internal waste).  Because of that, in practice
+users set the hard limit only to handle extremes and then overcommit
+the machine.  This leaves the actual partitioning and group trimming
+to global reclaim and OOM handling, which has led to increasing
+demands for recognizing cgroup policy during global reclaim, and even
+the ability to handle global OOM situations from userspace using
+task-specific memory reserves.  All these outcomes and developments
+show the utter failure of hard limits to effectively partition the
+machine for maximum utilization.
+
+When it comes to monitoring cgroup health, 'memory.pressure_level' was
+added for userspace to monitor memory pressure based on group-internal
+reclaim efficiency.  But as per above the group trimming is mostly
+done by global reclaim and the pressure the group experiences is not
+proportional to its excess.  And once internal pressure actually
+builds, the window between onset and an OOM kill can be very short
+with hard limits - by the time internal pressure is reported to
+userspace, it's often too late to intervene before the group goes OOM.
+Both aspects severely limit the ability to monitor cgroup health,
+detect looming OOM situations, and pinpoint offenders.
+
+In unified hierarchy, the primary means of limiting memory consumption
+is 'memory.high'.  It's enforced by direct reclaim to trim caches and
+keep the workload lean, but can be exceeded during working set peaks.
+This moves the responsibility of partitioning mostly back to memory
+cgroups, and global handling only enganges during concurrent peaks.
+
+Configurations can start out by setting this limit to a conservative
+estimate of the average working set size and then make upward
+adjustments based on monitoring high limit excess, workload
+performance, and the global memory situation.
+
+4.3.3.2 Misc changes
+
 - use_hierarchy is on by default and the cgroup file for the flag is
   not created.
 
+- memory.usage_in_bytes is renamed to memory.current to be in line
+  with the new limit naming scheme
 
 5. Planned Changes
 
diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
index 56b7bc32db4f..27394cfdf1fe 100644
--- a/include/linux/res_counter.h
+++ b/include/linux/res_counter.h
@@ -32,6 +32,10 @@ struct res_counter {
 	 */
 	unsigned long long max_usage;
 	/*
+	 * the high limit that creates pressure but can be exceeded
+	 */
+	unsigned long long high;
+	/*
 	 * the limit that usage cannot exceed
 	 */
 	unsigned long long limit;
@@ -85,6 +89,7 @@ int res_counter_memparse_write_strategy(const char *buf,
 enum {
 	RES_USAGE,
 	RES_MAX_USAGE,
+	RES_HIGH,
 	RES_LIMIT,
 	RES_FAILCNT,
 	RES_SOFT_LIMIT,
@@ -132,6 +137,19 @@ u64 res_counter_uncharge(struct res_counter *counter, unsigned long val);
 u64 res_counter_uncharge_until(struct res_counter *counter,
 			       struct res_counter *top,
 			       unsigned long val);
+
+static inline unsigned long long res_counter_high(struct res_counter *cnt)
+{
+	unsigned long long high = 0;
+	unsigned long flags;
+
+	spin_lock_irqsave(&cnt->lock, flags);
+	if (cnt->usage > cnt->high)
+		high = cnt->usage - cnt->high;
+	spin_unlock_irqrestore(&cnt->lock, flags);
+	return high;
+}
+
 /**
  * res_counter_margin - calculate chargeable space of a counter
  * @cnt: the counter
@@ -193,6 +211,17 @@ static inline void res_counter_reset_failcnt(struct res_counter *cnt)
 	spin_unlock_irqrestore(&cnt->lock, flags);
 }
 
+static inline int res_counter_set_high(struct res_counter *cnt,
+				       unsigned long long high)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&cnt->lock, flags);
+	cnt->high = high;
+	spin_unlock_irqrestore(&cnt->lock, flags);
+	return 0;
+}
+
 static inline int res_counter_set_limit(struct res_counter *cnt,
 		unsigned long long limit)
 {
diff --git a/kernel/res_counter.c b/kernel/res_counter.c
index e791130f85a7..26a08be49a3d 100644
--- a/kernel/res_counter.c
+++ b/kernel/res_counter.c
@@ -17,6 +17,7 @@
 void res_counter_init(struct res_counter *counter, struct res_counter *parent)
 {
 	spin_lock_init(&counter->lock);
+	counter->high = RES_COUNTER_MAX;
 	counter->limit = RES_COUNTER_MAX;
 	counter->soft_limit = RES_COUNTER_MAX;
 	counter->parent = parent;
@@ -130,6 +131,8 @@ res_counter_member(struct res_counter *counter, int member)
 		return &counter->usage;
 	case RES_MAX_USAGE:
 		return &counter->max_usage;
+	case RES_HIGH:
+		return &counter->high;
 	case RES_LIMIT:
 		return &counter->limit;
 	case RES_FAILCNT:
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 4146c0f47ba2..81627387fbd7 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2481,8 +2481,8 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	unsigned int batch = max(CHARGE_BATCH, nr_pages);
 	int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
 	struct mem_cgroup *mem_over_limit;
-	struct res_counter *fail_res;
 	unsigned long nr_reclaimed;
+	struct res_counter *res;
 	unsigned long long size;
 	bool may_swap = true;
 	bool drained = false;
@@ -2493,16 +2493,16 @@ retry:
 		goto done;
 
 	size = batch * PAGE_SIZE;
-	if (!res_counter_charge(&memcg->res, size, &fail_res)) {
+	if (!res_counter_charge(&memcg->res, size, &res)) {
 		if (!do_swap_account)
 			goto done_restock;
-		if (!res_counter_charge(&memcg->memsw, size, &fail_res))
+		if (!res_counter_charge(&memcg->memsw, size, &res))
 			goto done_restock;
 		res_counter_uncharge(&memcg->res, size);
-		mem_over_limit = mem_cgroup_from_res_counter(fail_res, memsw);
+		mem_over_limit = mem_cgroup_from_res_counter(res, memsw);
 		may_swap = false;
 	} else
-		mem_over_limit = mem_cgroup_from_res_counter(fail_res, res);
+		mem_over_limit = mem_cgroup_from_res_counter(res, res);
 
 	if (batch > nr_pages) {
 		batch = nr_pages;
@@ -2579,6 +2579,21 @@ bypass:
 done_restock:
 	if (batch > nr_pages)
 		refill_stock(memcg, batch - nr_pages);
+
+	res = &memcg->res;
+	while (res) {
+		unsigned long long high = res_counter_high(res);
+
+		if (high) {
+			unsigned long high_pages = high >> PAGE_SHIFT;
+			struct mem_cgroup *memcg;
+
+			memcg = mem_cgroup_from_res_counter(res, res);
+			try_to_free_mem_cgroup_pages(memcg, high_pages,
+						     gfp_mask, true);
+		}
+		res = res->parent;
+	}
 done:
 	return ret;
 }
@@ -5141,7 +5156,7 @@ out_kfree:
 	return ret;
 }
 
-static struct cftype mem_cgroup_files[] = {
+static struct cftype mem_cgroup_legacy_files[] = {
 	{
 		.name = "usage_in_bytes",
 		.private = MEMFILE_PRIVATE(_MEM, RES_USAGE),
@@ -5250,7 +5265,7 @@ static struct cftype mem_cgroup_files[] = {
 };
 
 #ifdef CONFIG_MEMCG_SWAP
-static struct cftype memsw_cgroup_files[] = {
+static struct cftype memsw_cgroup_legacy_files[] = {
 	{
 		.name = "memsw.usage_in_bytes",
 		.private = MEMFILE_PRIVATE(_MEMSWAP, RES_USAGE),
@@ -6195,6 +6210,61 @@ static void mem_cgroup_bind(struct cgroup_subsys_state *root_css)
 		mem_cgroup_from_css(root_css)->use_hierarchy = true;
 }
 
+static u64 memory_current_read(struct cgroup_subsys_state *css,
+			       struct cftype *cft)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
+
+	return res_counter_read_u64(&memcg->res, RES_USAGE);
+}
+
+static u64 memory_high_read(struct cgroup_subsys_state *css,
+			    struct cftype *cft)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
+
+	return res_counter_read_u64(&memcg->res, RES_HIGH);
+}
+
+static ssize_t memory_high_write(struct kernfs_open_file *of,
+				 char *buf, size_t nbytes, loff_t off)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
+	u64 high;
+	int ret;
+
+	if (mem_cgroup_is_root(memcg))
+		return -EINVAL;
+
+	buf = strim(buf);
+	ret = res_counter_memparse_write_strategy(buf, &high);
+	if (ret)
+		return ret;
+
+	ret = res_counter_set_high(&memcg->res, high);
+	if (ret)
+		return ret;
+
+	high = res_counter_high(&memcg->res);
+	if (high)
+		try_to_free_mem_cgroup_pages(memcg, high >> PAGE_SHIFT,
+					     GFP_KERNEL, true);
+
+	return nbytes;
+}
+
+static struct cftype memory_files[] = {
+	{
+		.name = "current",
+		.read_u64 = memory_current_read,
+	},
+	{
+		.name = "high",
+		.read_u64 = memory_high_read,
+		.write = memory_high_write,
+	},
+};
+
 struct cgroup_subsys memory_cgrp_subsys = {
 	.css_alloc = mem_cgroup_css_alloc,
 	.css_online = mem_cgroup_css_online,
@@ -6205,7 +6275,8 @@ struct cgroup_subsys memory_cgrp_subsys = {
 	.cancel_attach = mem_cgroup_cancel_attach,
 	.attach = mem_cgroup_move_task,
 	.bind = mem_cgroup_bind,
-	.legacy_cftypes = mem_cgroup_files,
+	.dfl_cftypes = memory_files,
+	.legacy_cftypes = mem_cgroup_legacy_files,
 	.early_init = 0,
 };
 
@@ -6223,7 +6294,7 @@ __setup("swapaccount=", enable_swap_account);
 static void __init memsw_file_init(void)
 {
 	WARN_ON(cgroup_add_legacy_cftypes(&memory_cgrp_subsys,
-					  memsw_cgroup_files));
+					  memsw_cgroup_legacy_files));
 }
 
 static void __init enable_swap_cgroup(void)
-- 
2.0.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [patch 3/4] mm: memcontrol: add memory.max to default hierarchy
  2014-08-08 21:38 [patch 0/4] mm: memcontrol: populate unified hierarchy interface v2 Johannes Weiner
  2014-08-08 21:38 ` [patch 1/4] mm: memcontrol: use generic direct reclaim code to meet the allocation Johannes Weiner
  2014-08-08 21:38 ` [patch 2/4] mm: memcontrol: add memory.current and memory.high to default hierarchy Johannes Weiner
@ 2014-08-08 21:38 ` Johannes Weiner
  2014-08-08 21:38 ` [patch 4/4] mm: memcontrol: add memory.vmstat " Johannes Weiner
  3 siblings, 0 replies; 5+ messages in thread
From: Johannes Weiner @ 2014-08-08 21:38 UTC (permalink / raw)
  To: linux-mm
  Cc: Michal Hocko, Greg Thelen, Vladimir Davydov, Tejun Heo, cgroups,
	linux-kernel

In untrusted environments, a strict upper memory limit on a cgroup can
be necessary, to protect against bugs or malicious users.

Provide memory.max, a limit that can not be breached and will trigger
group-internal OOM killing once page reclaim can no longer enforce it.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 Documentation/cgroups/unified-hierarchy.txt |  5 +++++
 mm/memcontrol.c                             | 35 +++++++++++++++++++++++++++++
 2 files changed, 40 insertions(+)

diff --git a/Documentation/cgroups/unified-hierarchy.txt b/Documentation/cgroups/unified-hierarchy.txt
index 2d91530b8d6c..ef1db728a035 100644
--- a/Documentation/cgroups/unified-hierarchy.txt
+++ b/Documentation/cgroups/unified-hierarchy.txt
@@ -372,6 +372,10 @@ estimate of the average working set size and then make upward
 adjustments based on monitoring high limit excess, workload
 performance, and the global memory situation.
 
+In untrusted environments, users may wish to limit the amount of high
+limit excess in order to contain buggy or malicious workloads.  For
+that purpose, a hard upper limit can be set through 'memory.max'.
+
 4.3.3.2 Misc changes
 
 - use_hierarchy is on by default and the cgroup file for the flag is
@@ -380,6 +384,7 @@ performance, and the global memory situation.
 - memory.usage_in_bytes is renamed to memory.current to be in line
   with the new limit naming scheme
 
+
 5. Planned Changes
 
 5-1. CAP for resource control
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 81627387fbd7..a69ff21c8a9a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6253,6 +6253,36 @@ static ssize_t memory_high_write(struct kernfs_open_file *of,
 	return nbytes;
 }
 
+static u64 memory_max_read(struct cgroup_subsys_state *css,
+			   struct cftype *cft)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
+
+	return res_counter_read_u64(&memcg->res, RES_LIMIT);
+}
+
+static ssize_t memory_max_write(struct kernfs_open_file *of,
+				char *buf, size_t nbytes, loff_t off)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
+	u64 max;
+	int ret;
+
+	if (mem_cgroup_is_root(memcg))
+		return -EINVAL;
+
+	buf = strim(buf);
+	ret = res_counter_memparse_write_strategy(buf, &max);
+	if (ret)
+		return ret;
+
+	ret = mem_cgroup_resize_limit(memcg, max);
+	if (ret)
+		return ret;
+
+	return nbytes;
+}
+
 static struct cftype memory_files[] = {
 	{
 		.name = "current",
@@ -6263,6 +6293,11 @@ static struct cftype memory_files[] = {
 		.read_u64 = memory_high_read,
 		.write = memory_high_write,
 	},
+	{
+		.name = "max",
+		.read_u64 = memory_max_read,
+		.write = memory_max_write,
+	},
 };
 
 struct cgroup_subsys memory_cgrp_subsys = {
-- 
2.0.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [patch 4/4] mm: memcontrol: add memory.vmstat to default hierarchy
  2014-08-08 21:38 [patch 0/4] mm: memcontrol: populate unified hierarchy interface v2 Johannes Weiner
                   ` (2 preceding siblings ...)
  2014-08-08 21:38 ` [patch 3/4] mm: memcontrol: add memory.max " Johannes Weiner
@ 2014-08-08 21:38 ` Johannes Weiner
  3 siblings, 0 replies; 5+ messages in thread
From: Johannes Weiner @ 2014-08-08 21:38 UTC (permalink / raw)
  To: linux-mm
  Cc: Michal Hocko, Greg Thelen, Vladimir Davydov, Tejun Heo, cgroups,
	linux-kernel

Provide basic per-memcg vmstat-style statistics on LRU sizes,
allocated and freed pages, major and minor faults.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 Documentation/cgroups/unified-hierarchy.txt |  8 ++++++
 mm/memcontrol.c                             | 40 +++++++++++++++++++++++++++++
 2 files changed, 48 insertions(+)

diff --git a/Documentation/cgroups/unified-hierarchy.txt b/Documentation/cgroups/unified-hierarchy.txt
index ef1db728a035..512e9a2b2e06 100644
--- a/Documentation/cgroups/unified-hierarchy.txt
+++ b/Documentation/cgroups/unified-hierarchy.txt
@@ -384,6 +384,14 @@ that purpose, a hard upper limit can be set through 'memory.max'.
 - memory.usage_in_bytes is renamed to memory.current to be in line
   with the new limit naming scheme
 
+- memory.stat has been replaced by memory.vmstat, which provides
+  page-based statistics in the style of /proc/vmstat.
+
+  As cgroups are now always hierarchical and no longer allow tasks in
+  intermediate levels, the local state is irrelevant and all
+  statistics represent the state of the entire hierarchy rooted at the
+  given group.
+
 
 5. Planned Changes
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index a69ff21c8a9a..4959460fa170 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6283,6 +6283,42 @@ static ssize_t memory_max_write(struct kernfs_open_file *of,
 	return nbytes;
 }
 
+static u64 tree_events(struct mem_cgroup *memcg, int event)
+{
+	struct mem_cgroup *mi;
+	u64 val = 0;
+
+	for_each_mem_cgroup_tree(mi, memcg)
+		val += mem_cgroup_read_events(mi, event);
+	return val;
+}
+
+static int memory_vmstat_show(struct seq_file *m, void *v)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
+	struct mem_cgroup *mi;
+	int i;
+
+	for (i = 0; i < NR_LRU_LISTS; i++) {
+		u64 val = 0;
+
+		for_each_mem_cgroup_tree(mi, memcg)
+			val += mem_cgroup_nr_lru_pages(mi, BIT(i));
+		seq_printf(m, "%s %llu\n", vmstat_text[NR_LRU_BASE + i], val);
+	}
+
+	seq_printf(m, "pgalloc %llu\n",
+		   tree_events(memcg, MEM_CGROUP_EVENTS_PGPGIN));
+	seq_printf(m, "pgfree %llu\n",
+		   tree_events(memcg, MEM_CGROUP_EVENTS_PGPGOUT));
+	seq_printf(m, "pgfault %llu\n",
+		   tree_events(memcg, MEM_CGROUP_EVENTS_PGFAULT));
+	seq_printf(m, "pgmajfault %llu\n",
+		   tree_events(memcg, MEM_CGROUP_EVENTS_PGMAJFAULT));
+
+	return 0;
+}
+
 static struct cftype memory_files[] = {
 	{
 		.name = "current",
@@ -6298,6 +6334,10 @@ static struct cftype memory_files[] = {
 		.read_u64 = memory_max_read,
 		.write = memory_max_write,
 	},
+	{
+		.name = "vmstat",
+		.seq_show = memory_vmstat_show,
+	},
 };
 
 struct cgroup_subsys memory_cgrp_subsys = {
-- 
2.0.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2014-08-08 21:38 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-08-08 21:38 [patch 0/4] mm: memcontrol: populate unified hierarchy interface v2 Johannes Weiner
2014-08-08 21:38 ` [patch 1/4] mm: memcontrol: use generic direct reclaim code to meet the allocation Johannes Weiner
2014-08-08 21:38 ` [patch 2/4] mm: memcontrol: add memory.current and memory.high to default hierarchy Johannes Weiner
2014-08-08 21:38 ` [patch 3/4] mm: memcontrol: add memory.max " Johannes Weiner
2014-08-08 21:38 ` [patch 4/4] mm: memcontrol: add memory.vmstat " Johannes Weiner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).