[patch 0/2] mm: memcontrol: default hierarchy interface for memory v2

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [patch 0/2] mm: memcontrol: default hierarchy interface for memory v2
@ 2015-01-20 15:31 Johannes Weiner
  2015-01-20 15:31 ` [patch 1/2] mm: page_counter: pull "-1" handling out of page_counter_memparse() Johannes Weiner
                   ` (2 more replies)
  0 siblings, 3 replies; 20+ messages in thread
From: Johannes Weiner @ 2015-01-20 15:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, Vladimir Davydov, Greg Thelen, linux-mm, cgroups,
	linux-kernel

Hi Andrew,

these patches changed sufficiently while in -mm that a rebase makes
sense.  The change from using "none" in the configuration files to
"max"/"infinity" requires a do-over of 1/2 and a changelog fix in 2/2.

I folded all increments, both in-tree and the ones still pending, and
credited your seq_puts() checkpatch fix, so these two changes are the
all-encompassing latest versions, and everything else can be dropped.

Thanks!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [patch 1/2] mm: page_counter: pull "-1" handling out of page_counter_memparse()
  2015-01-20 15:31 [patch 0/2] mm: memcontrol: default hierarchy interface for memory v2 Johannes Weiner
@ 2015-01-20 15:31 ` Johannes Weiner
  2015-01-20 16:04   ` Michal Hocko
  2015-01-20 15:31 ` [patch 2/2] mm: memcontrol: default hierarchy interface for memory Johannes Weiner
  2015-01-20 16:57 ` [patch 0/2] mm: memcontrol: default hierarchy interface for memory v2 Michal Hocko
  2 siblings, 1 reply; 20+ messages in thread
From: Johannes Weiner @ 2015-01-20 15:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, Vladimir Davydov, Greg Thelen, linux-mm, cgroups,
	linux-kernel

The unified hierarchy interface for memory cgroups will no longer use
"-1" to mean maximum possible resource value.  In preparation for
this, make the string an argument and let the caller supply it.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/page_counter.h | 3 ++-
 mm/hugetlb_cgroup.c          | 2 +-
 mm/memcontrol.c              | 4 ++--
 mm/page_counter.c            | 7 ++++---
 net/ipv4/tcp_memcontrol.c    | 2 +-
 5 files changed, 10 insertions(+), 8 deletions(-)

diff --git a/include/linux/page_counter.h b/include/linux/page_counter.h
index 955421575d16..17fa4f8de3a6 100644
--- a/include/linux/page_counter.h
+++ b/include/linux/page_counter.h
@@ -41,7 +41,8 @@ int page_counter_try_charge(struct page_counter *counter,
 			    struct page_counter **fail);
 void page_counter_uncharge(struct page_counter *counter, unsigned long nr_pages);
 int page_counter_limit(struct page_counter *counter, unsigned long limit);
-int page_counter_memparse(const char *buf, unsigned long *nr_pages);
+int page_counter_memparse(const char *buf, const char *max,
+			  unsigned long *nr_pages);
 
 static inline void page_counter_reset_watermark(struct page_counter *counter)
 {
diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c
index 037e1c00a5b7..6e0057439a46 100644
--- a/mm/hugetlb_cgroup.c
+++ b/mm/hugetlb_cgroup.c
@@ -279,7 +279,7 @@ static ssize_t hugetlb_cgroup_write(struct kernfs_open_file *of,
 		return -EINVAL;
 
 	buf = strstrip(buf);
-	ret = page_counter_memparse(buf, &nr_pages);
+	ret = page_counter_memparse(buf, "-1", &nr_pages);
 	if (ret)
 		return ret;
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 05ad91cda22c..a3592a756ad9 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3442,7 +3442,7 @@ static ssize_t mem_cgroup_write(struct kernfs_open_file *of,
 	int ret;
 
 	buf = strstrip(buf);
-	ret = page_counter_memparse(buf, &nr_pages);
+	ret = page_counter_memparse(buf, "-1", &nr_pages);
 	if (ret)
 		return ret;
 
@@ -3814,7 +3814,7 @@ static int __mem_cgroup_usage_register_event(struct mem_cgroup *memcg,
 	unsigned long usage;
 	int i, size, ret;
 
-	ret = page_counter_memparse(args, &threshold);
+	ret = page_counter_memparse(args, "-1", &threshold);
 	if (ret)
 		return ret;
 
diff --git a/mm/page_counter.c b/mm/page_counter.c
index a009574fbba9..11b4beda14ba 100644
--- a/mm/page_counter.c
+++ b/mm/page_counter.c
@@ -166,18 +166,19 @@ int page_counter_limit(struct page_counter *counter, unsigned long limit)
 /**
  * page_counter_memparse - memparse() for page counter limits
  * @buf: string to parse
+ * @max: string meaning maximum possible value
  * @nr_pages: returns the result in number of pages
  *
  * Returns -EINVAL, or 0 and @nr_pages on success.  @nr_pages will be
  * limited to %PAGE_COUNTER_MAX.
  */
-int page_counter_memparse(const char *buf, unsigned long *nr_pages)
+int page_counter_memparse(const char *buf, const char *max,
+			  unsigned long *nr_pages)
 {
-	char unlimited[] = "-1";
 	char *end;
 	u64 bytes;
 
-	if (!strncmp(buf, unlimited, sizeof(unlimited))) {
+	if (!strcmp(buf, max)) {
 		*nr_pages = PAGE_COUNTER_MAX;
 		return 0;
 	}
diff --git a/net/ipv4/tcp_memcontrol.c b/net/ipv4/tcp_memcontrol.c
index 272327134a1b..c2a75c6957a1 100644
--- a/net/ipv4/tcp_memcontrol.c
+++ b/net/ipv4/tcp_memcontrol.c
@@ -120,7 +120,7 @@ static ssize_t tcp_cgroup_write(struct kernfs_open_file *of,
 	switch (of_cft(of)->private) {
 	case RES_LIMIT:
 		/* see memcontrol.c */
-		ret = page_counter_memparse(buf, &nr_pages);
+		ret = page_counter_memparse(buf, "-1", &nr_pages);
 		if (ret)
 			break;
 		mutex_lock(&tcp_limit_mutex);
-- 
2.2.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [patch 1/2] mm: page_counter: pull "-1" handling out of page_counter_memparse()
  2015-01-20 15:31 ` [patch 1/2] mm: page_counter: pull "-1" handling out of page_counter_memparse() Johannes Weiner
@ 2015-01-20 16:04   ` Michal Hocko
  0 siblings, 0 replies; 20+ messages in thread
From: Michal Hocko @ 2015-01-20 16:04 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Vladimir Davydov, Greg Thelen, linux-mm, cgroups,
	linux-kernel

On Tue 20-01-15 10:31:54, Johannes Weiner wrote:
> The unified hierarchy interface for memory cgroups will no longer use
> "-1" to mean maximum possible resource value.  In preparation for
> this, make the string an argument and let the caller supply it.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Acked-by: Michal Hocko <mhocko@suse.cz>

Thanks!

> ---
>  include/linux/page_counter.h | 3 ++-
>  mm/hugetlb_cgroup.c          | 2 +-
>  mm/memcontrol.c              | 4 ++--
>  mm/page_counter.c            | 7 ++++---
>  net/ipv4/tcp_memcontrol.c    | 2 +-
>  5 files changed, 10 insertions(+), 8 deletions(-)
> 
> diff --git a/include/linux/page_counter.h b/include/linux/page_counter.h
> index 955421575d16..17fa4f8de3a6 100644
> --- a/include/linux/page_counter.h
> +++ b/include/linux/page_counter.h
> @@ -41,7 +41,8 @@ int page_counter_try_charge(struct page_counter *counter,
>  			    struct page_counter **fail);
>  void page_counter_uncharge(struct page_counter *counter, unsigned long nr_pages);
>  int page_counter_limit(struct page_counter *counter, unsigned long limit);
> -int page_counter_memparse(const char *buf, unsigned long *nr_pages);
> +int page_counter_memparse(const char *buf, const char *max,
> +			  unsigned long *nr_pages);
>  
>  static inline void page_counter_reset_watermark(struct page_counter *counter)
>  {
> diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c
> index 037e1c00a5b7..6e0057439a46 100644
> --- a/mm/hugetlb_cgroup.c
> +++ b/mm/hugetlb_cgroup.c
> @@ -279,7 +279,7 @@ static ssize_t hugetlb_cgroup_write(struct kernfs_open_file *of,
>  		return -EINVAL;
>  
>  	buf = strstrip(buf);
> -	ret = page_counter_memparse(buf, &nr_pages);
> +	ret = page_counter_memparse(buf, "-1", &nr_pages);
>  	if (ret)
>  		return ret;
>  
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 05ad91cda22c..a3592a756ad9 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -3442,7 +3442,7 @@ static ssize_t mem_cgroup_write(struct kernfs_open_file *of,
>  	int ret;
>  
>  	buf = strstrip(buf);
> -	ret = page_counter_memparse(buf, &nr_pages);
> +	ret = page_counter_memparse(buf, "-1", &nr_pages);
>  	if (ret)
>  		return ret;
>  
> @@ -3814,7 +3814,7 @@ static int __mem_cgroup_usage_register_event(struct mem_cgroup *memcg,
>  	unsigned long usage;
>  	int i, size, ret;
>  
> -	ret = page_counter_memparse(args, &threshold);
> +	ret = page_counter_memparse(args, "-1", &threshold);
>  	if (ret)
>  		return ret;
>  
> diff --git a/mm/page_counter.c b/mm/page_counter.c
> index a009574fbba9..11b4beda14ba 100644
> --- a/mm/page_counter.c
> +++ b/mm/page_counter.c
> @@ -166,18 +166,19 @@ int page_counter_limit(struct page_counter *counter, unsigned long limit)
>  /**
>   * page_counter_memparse - memparse() for page counter limits
>   * @buf: string to parse
> + * @max: string meaning maximum possible value
>   * @nr_pages: returns the result in number of pages
>   *
>   * Returns -EINVAL, or 0 and @nr_pages on success.  @nr_pages will be
>   * limited to %PAGE_COUNTER_MAX.
>   */
> -int page_counter_memparse(const char *buf, unsigned long *nr_pages)
> +int page_counter_memparse(const char *buf, const char *max,
> +			  unsigned long *nr_pages)
>  {
> -	char unlimited[] = "-1";
>  	char *end;
>  	u64 bytes;
>  
> -	if (!strncmp(buf, unlimited, sizeof(unlimited))) {
> +	if (!strcmp(buf, max)) {
>  		*nr_pages = PAGE_COUNTER_MAX;
>  		return 0;
>  	}
> diff --git a/net/ipv4/tcp_memcontrol.c b/net/ipv4/tcp_memcontrol.c
> index 272327134a1b..c2a75c6957a1 100644
> --- a/net/ipv4/tcp_memcontrol.c
> +++ b/net/ipv4/tcp_memcontrol.c
> @@ -120,7 +120,7 @@ static ssize_t tcp_cgroup_write(struct kernfs_open_file *of,
>  	switch (of_cft(of)->private) {
>  	case RES_LIMIT:
>  		/* see memcontrol.c */
> -		ret = page_counter_memparse(buf, &nr_pages);
> +		ret = page_counter_memparse(buf, "-1", &nr_pages);
>  		if (ret)
>  			break;
>  		mutex_lock(&tcp_limit_mutex);
> -- 
> 2.2.0
> 

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [patch 2/2] mm: memcontrol: default hierarchy interface for memory
  2015-01-20 15:31 [patch 0/2] mm: memcontrol: default hierarchy interface for memory v2 Johannes Weiner
  2015-01-20 15:31 ` [patch 1/2] mm: page_counter: pull "-1" handling out of page_counter_memparse() Johannes Weiner
@ 2015-01-20 15:31 ` Johannes Weiner
  2015-01-20 16:31   ` Michal Hocko
  2015-02-23 11:13   ` Sasha Levin
  2015-01-20 16:57 ` [patch 0/2] mm: memcontrol: default hierarchy interface for memory v2 Michal Hocko
  2 siblings, 2 replies; 20+ messages in thread
From: Johannes Weiner @ 2015-01-20 15:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, Vladimir Davydov, Greg Thelen, linux-mm, cgroups,
	linux-kernel

Introduce the basic control files to account, partition, and limit
memory using cgroups in default hierarchy mode.

This interface versioning allows us to address fundamental design
issues in the existing memory cgroup interface, further explained
below.  The old interface will be maintained indefinitely, but a
clearer model and improved workload performance should encourage
existing users to switch over to the new one eventually.

The control files are thus:

  - memory.current shows the current consumption of the cgroup and its
    descendants, in bytes.

  - memory.low configures the lower end of the cgroup's expected
    memory consumption range.  The kernel considers memory below that
    boundary to be a reserve - the minimum that the workload needs in
    order to make forward progress - and generally avoids reclaiming
    it, unless there is an imminent risk of entering an OOM situation.

  - memory.high configures the upper end of the cgroup's expected
    memory consumption range.  A cgroup whose consumption grows beyond
    this threshold is forced into direct reclaim, to work off the
    excess and to throttle new allocations heavily, but is generally
    allowed to continue and the OOM killer is not invoked.

  - memory.max configures the hard maximum amount of memory that the
    cgroup is allowed to consume before the OOM killer is invoked.

  - memory.events shows event counters that indicate how often the
    cgroup was reclaimed while below memory.low, how often it was
    forced to reclaim excess beyond memory.high, how often it hit
    memory.max, and how often it entered OOM due to memory.max.  This
    allows users to identify configuration problems when observing a
    degradation in workload performance.  An overcommitted system will
    have an increased rate of low boundary breaches, whereas increased
    rates of high limit breaches, maximum hits, or even OOM situations
    will indicate internally overcommitted cgroups.

For existing users of memory cgroups, the following deviations from
the current interface are worth pointing out and explaining:

  - The original lower boundary, the soft limit, is defined as a limit
    that is per default unset.  As a result, the set of cgroups that
    global reclaim prefers is opt-in, rather than opt-out.  The costs
    for optimizing these mostly negative lookups are so high that the
    implementation, despite its enormous size, does not even provide
    the basic desirable behavior.  First off, the soft limit has no
    hierarchical meaning.  All configured groups are organized in a
    global rbtree and treated like equal peers, regardless where they
    are located in the hierarchy.  This makes subtree delegation
    impossible.  Second, the soft limit reclaim pass is so aggressive
    that it not just introduces high allocation latencies into the
    system, but also impacts system performance due to overreclaim, to
    the point where the feature becomes self-defeating.

    The memory.low boundary on the other hand is a top-down allocated
    reserve.  A cgroup enjoys reclaim protection when it and all its
    ancestors are below their low boundaries, which makes delegation
    of subtrees possible.  Secondly, new cgroups have no reserve per
    default and in the common case most cgroups are eligible for the
    preferred reclaim pass.  This allows the new low boundary to be
    efficiently implemented with just a minor addition to the generic
    reclaim code, without the need for out-of-band data structures and
    reclaim passes.  Because the generic reclaim code considers all
    cgroups except for the ones running low in the preferred first
    reclaim pass, overreclaim of individual groups is eliminated as
    well, resulting in much better overall workload performance.

  - The original high boundary, the hard limit, is defined as a strict
    limit that can not budge, even if the OOM killer has to be called.
    But this generally goes against the goal of making the most out of
    the available memory.  The memory consumption of workloads varies
    during runtime, and that requires users to overcommit.  But doing
    that with a strict upper limit requires either a fairly accurate
    prediction of the working set size or adding slack to the limit.
    Since working set size estimation is hard and error prone, and
    getting it wrong results in OOM kills, most users tend to err on
    the side of a looser limit and end up wasting precious resources.

    The memory.high boundary on the other hand can be set much more
    conservatively.  When hit, it throttles allocations by forcing
    them into direct reclaim to work off the excess, but it never
    invokes the OOM killer.  As a result, a high boundary that is
    chosen too aggressively will not terminate the processes, but
    instead it will lead to gradual performance degradation.  The user
    can monitor this and make corrections until the minimal memory
    footprint that still gives acceptable performance is found.

    In extreme cases, with many concurrent allocations and a complete
    breakdown of reclaim progress within the group, the high boundary
    can be exceeded.  But even then it's mostly better to satisfy the
    allocation from the slack available in other groups or the rest of
    the system than killing the group.  Otherwise, memory.max is there
    to limit this type of spillover and ultimately contain buggy or
    even malicious applications.

  - The original control file names are unwieldy and inconsistent in
    many different ways.  For example, the upper boundary hit count is
    exported in the memory.failcnt file, but an OOM event count has to
    be manually counted by listening to memory.oom_control events, and
    lower boundary / soft limit events have to be counted by first
    setting a threshold for that value and then counting those events.
    Also, usage and limit files encode their units in the filename.
    That makes the filenames very long, even though this is not
    information that a user needs to be reminded of every time they
    type out those names.

    To address these naming issues, as well as to signal clearly that
    the new interface carries a new configuration model, the naming
    conventions in it necessarily differ from the old interface.

  - The original limit files indicate the state of an unset limit with
    a very high number, and a configured limit can be unset by echoing
    -1 into those files.  But that very high number is implementation
    and architecture dependent and not very descriptive.  And while -1
    can be understood as an underflow into the highest possible value,
    -2 or -10M etc. do not work, so it's not inconsistent.

    memory.low, memory.high, and memory.max will use the string
    "infinity" to indicate and set the highest possible value.

[akpm@linux-foundation.org: use seq_puts() for basic strings]
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Cc: Greg Thelen <gthelen@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
 Documentation/cgroups/unified-hierarchy.txt |  79 ++++++++++
 include/linux/memcontrol.h                  |  32 ++++
 mm/memcontrol.c                             | 229 ++++++++++++++++++++++++++--
 mm/vmscan.c                                 |  22 ++-
 4 files changed, 348 insertions(+), 14 deletions(-)

diff --git a/Documentation/cgroups/unified-hierarchy.txt b/Documentation/cgroups/unified-hierarchy.txt
index 4f4563277864..71daa35ec2d9 100644
--- a/Documentation/cgroups/unified-hierarchy.txt
+++ b/Documentation/cgroups/unified-hierarchy.txt
@@ -327,6 +327,85 @@ supported and the interface files "release_agent" and
 - use_hierarchy is on by default and the cgroup file for the flag is
   not created.
 
+- The original lower boundary, the soft limit, is defined as a limit
+  that is per default unset.  As a result, the set of cgroups that
+  global reclaim prefers is opt-in, rather than opt-out.  The costs
+  for optimizing these mostly negative lookups are so high that the
+  implementation, despite its enormous size, does not even provide the
+  basic desirable behavior.  First off, the soft limit has no
+  hierarchical meaning.  All configured groups are organized in a
+  global rbtree and treated like equal peers, regardless where they
+  are located in the hierarchy.  This makes subtree delegation
+  impossible.  Second, the soft limit reclaim pass is so aggressive
+  that it not just introduces high allocation latencies into the
+  system, but also impacts system performance due to overreclaim, to
+  the point where the feature becomes self-defeating.
+
+  The memory.low boundary on the other hand is a top-down allocated
+  reserve.  A cgroup enjoys reclaim protection when it and all its
+  ancestors are below their low boundaries, which makes delegation of
+  subtrees possible.  Secondly, new cgroups have no reserve per
+  default and in the common case most cgroups are eligible for the
+  preferred reclaim pass.  This allows the new low boundary to be
+  efficiently implemented with just a minor addition to the generic
+  reclaim code, without the need for out-of-band data structures and
+  reclaim passes.  Because the generic reclaim code considers all
+  cgroups except for the ones running low in the preferred first
+  reclaim pass, overreclaim of individual groups is eliminated as
+  well, resulting in much better overall workload performance.
+
+- The original high boundary, the hard limit, is defined as a strict
+  limit that can not budge, even if the OOM killer has to be called.
+  But this generally goes against the goal of making the most out of
+  the available memory.  The memory consumption of workloads varies
+  during runtime, and that requires users to overcommit.  But doing
+  that with a strict upper limit requires either a fairly accurate
+  prediction of the working set size or adding slack to the limit.
+  Since working set size estimation is hard and error prone, and
+  getting it wrong results in OOM kills, most users tend to err on the
+  side of a looser limit and end up wasting precious resources.
+
+  The memory.high boundary on the other hand can be set much more
+  conservatively.  When hit, it throttles allocations by forcing them
+  into direct reclaim to work off the excess, but it never invokes the
+  OOM killer.  As a result, a high boundary that is chosen too
+  aggressively will not terminate the processes, but instead it will
+  lead to gradual performance degradation.  The user can monitor this
+  and make corrections until the minimal memory footprint that still
+  gives acceptable performance is found.
+
+  In extreme cases, with many concurrent allocations and a complete
+  breakdown of reclaim progress within the group, the high boundary
+  can be exceeded.  But even then it's mostly better to satisfy the
+  allocation from the slack available in other groups or the rest of
+  the system than killing the group.  Otherwise, memory.max is there
+  to limit this type of spillover and ultimately contain buggy or even
+  malicious applications.
+
+- The original control file names are unwieldy and inconsistent in
+  many different ways.  For example, the upper boundary hit count is
+  exported in the memory.failcnt file, but an OOM event count has to
+  be manually counted by listening to memory.oom_control events, and
+  lower boundary / soft limit events have to be counted by first
+  setting a threshold for that value and then counting those events.
+  Also, usage and limit files encode their units in the filename.
+  That makes the filenames very long, even though this is not
+  information that a user needs to be reminded of every time they type
+  out those names.
+
+  To address these naming issues, as well as to signal clearly that
+  the new interface carries a new configuration model, the naming
+  conventions in it necessarily differ from the old interface.
+
+- The original limit files indicate the state of an unset limit with a
+  Very High Number, and a configured limit can be unset by echoing -1
+  into those files.  But that very high number is implementation and
+  architecture dependent and not very descriptive.  And while -1 can
+  be understood as an underflow into the highest possible value, -2 or
+  -10M etc. do not work, so it's not consistent.
+
+  memory.low, memory.high, and memory.max will use the string
+  "infinity" to indicate and set the highest possible value.
 
 5. Planned Changes
 
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 76f489fad640..72dff5fb0d0c 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -52,7 +52,27 @@ struct mem_cgroup_reclaim_cookie {
 	unsigned int generation;
 };
 
+enum mem_cgroup_events_index {
+	MEM_CGROUP_EVENTS_PGPGIN,	/* # of pages paged in */
+	MEM_CGROUP_EVENTS_PGPGOUT,	/* # of pages paged out */
+	MEM_CGROUP_EVENTS_PGFAULT,	/* # of page-faults */
+	MEM_CGROUP_EVENTS_PGMAJFAULT,	/* # of major page-faults */
+	MEM_CGROUP_EVENTS_NSTATS,
+	/* default hierarchy events */
+	MEMCG_LOW = MEM_CGROUP_EVENTS_NSTATS,
+	MEMCG_HIGH,
+	MEMCG_MAX,
+	MEMCG_OOM,
+	MEMCG_NR_EVENTS,
+};
+
 #ifdef CONFIG_MEMCG
+void mem_cgroup_events(struct mem_cgroup *memcg,
+		       enum mem_cgroup_events_index idx,
+		       unsigned int nr);
+
+bool mem_cgroup_low(struct mem_cgroup *root, struct mem_cgroup *memcg);
+
 int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
 			  gfp_t gfp_mask, struct mem_cgroup **memcgp);
 void mem_cgroup_commit_charge(struct page *page, struct mem_cgroup *memcg,
@@ -175,6 +195,18 @@ void mem_cgroup_split_huge_fixup(struct page *head);
 #else /* CONFIG_MEMCG */
 struct mem_cgroup;
 
+static inline void mem_cgroup_events(struct mem_cgroup *memcg,
+				     enum mem_cgroup_events_index idx,
+				     unsigned int nr)
+{
+}
+
+static inline bool mem_cgroup_low(struct mem_cgroup *root,
+				  struct mem_cgroup *memcg)
+{
+	return false;
+}
+
 static inline int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
 					gfp_t gfp_mask,
 					struct mem_cgroup **memcgp)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index a3592a756ad9..5730886e3b0e 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -97,14 +97,6 @@ static const char * const mem_cgroup_stat_names[] = {
 	"swap",
 };
 
-enum mem_cgroup_events_index {
-	MEM_CGROUP_EVENTS_PGPGIN,	/* # of pages paged in */
-	MEM_CGROUP_EVENTS_PGPGOUT,	/* # of pages paged out */
-	MEM_CGROUP_EVENTS_PGFAULT,	/* # of page-faults */
-	MEM_CGROUP_EVENTS_PGMAJFAULT,	/* # of major page-faults */
-	MEM_CGROUP_EVENTS_NSTATS,
-};
-
 static const char * const mem_cgroup_events_names[] = {
 	"pgpgin",
 	"pgpgout",
@@ -138,7 +130,7 @@ enum mem_cgroup_events_target {
 
 struct mem_cgroup_stat_cpu {
 	long count[MEM_CGROUP_STAT_NSTATS];
-	unsigned long events[MEM_CGROUP_EVENTS_NSTATS];
+	unsigned long events[MEMCG_NR_EVENTS];
 	unsigned long nr_page_events;
 	unsigned long targets[MEM_CGROUP_NTARGETS];
 };
@@ -284,6 +276,10 @@ struct mem_cgroup {
 	struct page_counter memsw;
 	struct page_counter kmem;
 
+	/* Normal memory consumption range */
+	unsigned long low;
+	unsigned long high;
+
 	unsigned long soft_limit;
 
 	/* vmpressure notifications */
@@ -2327,6 +2323,8 @@ retry:
 	if (!(gfp_mask & __GFP_WAIT))
 		goto nomem;
 
+	mem_cgroup_events(mem_over_limit, MEMCG_MAX, 1);
+
 	nr_reclaimed = try_to_free_mem_cgroup_pages(mem_over_limit, nr_pages,
 						    gfp_mask, may_swap);
 
@@ -2368,6 +2366,8 @@ retry:
 	if (fatal_signal_pending(current))
 		goto bypass;
 
+	mem_cgroup_events(mem_over_limit, MEMCG_OOM, 1);
+
 	mem_cgroup_oom(mem_over_limit, gfp_mask, get_order(nr_pages));
 nomem:
 	if (!(gfp_mask & __GFP_NOFAIL))
@@ -2379,6 +2379,16 @@ done_restock:
 	css_get_many(&memcg->css, batch);
 	if (batch > nr_pages)
 		refill_stock(memcg, batch - nr_pages);
+	/*
+	 * If the hierarchy is above the normal consumption range,
+	 * make the charging task trim their excess contribution.
+	 */
+	do {
+		if (page_counter_read(&memcg->memory) <= memcg->high)
+			continue;
+		mem_cgroup_events(memcg, MEMCG_HIGH, 1);
+		try_to_free_mem_cgroup_pages(memcg, nr_pages, gfp_mask, true);
+	} while ((memcg = parent_mem_cgroup(memcg)));
 done:
 	return ret;
 }
@@ -4304,7 +4314,7 @@ out_kfree:
 	return ret;
 }
 
-static struct cftype mem_cgroup_files[] = {
+static struct cftype mem_cgroup_legacy_files[] = {
 	{
 		.name = "usage_in_bytes",
 		.private = MEMFILE_PRIVATE(_MEM, RES_USAGE),
@@ -4580,6 +4590,7 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
 	if (parent_css == NULL) {
 		root_mem_cgroup = memcg;
 		page_counter_init(&memcg->memory, NULL);
+		memcg->high = PAGE_COUNTER_MAX;
 		memcg->soft_limit = PAGE_COUNTER_MAX;
 		page_counter_init(&memcg->memsw, NULL);
 		page_counter_init(&memcg->kmem, NULL);
@@ -4625,6 +4636,7 @@ mem_cgroup_css_online(struct cgroup_subsys_state *css)
 
 	if (parent->use_hierarchy) {
 		page_counter_init(&memcg->memory, &parent->memory);
+		memcg->high = PAGE_COUNTER_MAX;
 		memcg->soft_limit = PAGE_COUNTER_MAX;
 		page_counter_init(&memcg->memsw, &parent->memsw);
 		page_counter_init(&memcg->kmem, &parent->kmem);
@@ -4635,6 +4647,7 @@ mem_cgroup_css_online(struct cgroup_subsys_state *css)
 		 */
 	} else {
 		page_counter_init(&memcg->memory, NULL);
+		memcg->high = PAGE_COUNTER_MAX;
 		memcg->soft_limit = PAGE_COUNTER_MAX;
 		page_counter_init(&memcg->memsw, NULL);
 		page_counter_init(&memcg->kmem, NULL);
@@ -4710,6 +4723,8 @@ static void mem_cgroup_css_reset(struct cgroup_subsys_state *css)
 	mem_cgroup_resize_limit(memcg, PAGE_COUNTER_MAX);
 	mem_cgroup_resize_memsw_limit(memcg, PAGE_COUNTER_MAX);
 	memcg_update_kmem_limit(memcg, PAGE_COUNTER_MAX);
+	memcg->low = 0;
+	memcg->high = PAGE_COUNTER_MAX;
 	memcg->soft_limit = PAGE_COUNTER_MAX;
 }
 
@@ -5296,6 +5311,147 @@ static void mem_cgroup_bind(struct cgroup_subsys_state *root_css)
 		mem_cgroup_from_css(root_css)->use_hierarchy = true;
 }
 
+static u64 memory_current_read(struct cgroup_subsys_state *css,
+			       struct cftype *cft)
+{
+	return mem_cgroup_usage(mem_cgroup_from_css(css), false);
+}
+
+static int memory_low_show(struct seq_file *m, void *v)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
+	unsigned long low = ACCESS_ONCE(memcg->low);
+
+	if (low == PAGE_COUNTER_MAX)
+		seq_puts(m, "infinity\n");
+	else
+		seq_printf(m, "%llu\n", (u64)low * PAGE_SIZE);
+
+	return 0;
+}
+
+static ssize_t memory_low_write(struct kernfs_open_file *of,
+				char *buf, size_t nbytes, loff_t off)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
+	unsigned long low;
+	int err;
+
+	buf = strstrip(buf);
+	err = page_counter_memparse(buf, "infinity", &low);
+	if (err)
+		return err;
+
+	memcg->low = low;
+
+	return nbytes;
+}
+
+static int memory_high_show(struct seq_file *m, void *v)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
+	unsigned long high = ACCESS_ONCE(memcg->high);
+
+	if (high == PAGE_COUNTER_MAX)
+		seq_puts(m, "infinity\n");
+	else
+		seq_printf(m, "%llu\n", (u64)high * PAGE_SIZE);
+
+	return 0;
+}
+
+static ssize_t memory_high_write(struct kernfs_open_file *of,
+				 char *buf, size_t nbytes, loff_t off)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
+	unsigned long high;
+	int err;
+
+	buf = strstrip(buf);
+	err = page_counter_memparse(buf, "infinity", &high);
+	if (err)
+		return err;
+
+	memcg->high = high;
+
+	return nbytes;
+}
+
+static int memory_max_show(struct seq_file *m, void *v)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
+	unsigned long max = ACCESS_ONCE(memcg->memory.limit);
+
+	if (max == PAGE_COUNTER_MAX)
+		seq_puts(m, "infinity\n");
+	else
+		seq_printf(m, "%llu\n", (u64)max * PAGE_SIZE);
+
+	return 0;
+}
+
+static ssize_t memory_max_write(struct kernfs_open_file *of,
+				char *buf, size_t nbytes, loff_t off)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
+	unsigned long max;
+	int err;
+
+	buf = strstrip(buf);
+	err = page_counter_memparse(buf, "infinity", &max);
+	if (err)
+		return err;
+
+	err = mem_cgroup_resize_limit(memcg, max);
+	if (err)
+		return err;
+
+	return nbytes;
+}
+
+static int memory_events_show(struct seq_file *m, void *v)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
+
+	seq_printf(m, "low %lu\n", mem_cgroup_read_events(memcg, MEMCG_LOW));
+	seq_printf(m, "high %lu\n", mem_cgroup_read_events(memcg, MEMCG_HIGH));
+	seq_printf(m, "max %lu\n", mem_cgroup_read_events(memcg, MEMCG_MAX));
+	seq_printf(m, "oom %lu\n", mem_cgroup_read_events(memcg, MEMCG_OOM));
+
+	return 0;
+}
+
+static struct cftype memory_files[] = {
+	{
+		.name = "current",
+		.read_u64 = memory_current_read,
+	},
+	{
+		.name = "low",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.seq_show = memory_low_show,
+		.write = memory_low_write,
+	},
+	{
+		.name = "high",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.seq_show = memory_high_show,
+		.write = memory_high_write,
+	},
+	{
+		.name = "max",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.seq_show = memory_max_show,
+		.write = memory_max_write,
+	},
+	{
+		.name = "events",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.seq_show = memory_events_show,
+	},
+	{ }	/* terminate */
+};
+
 struct cgroup_subsys memory_cgrp_subsys = {
 	.css_alloc = mem_cgroup_css_alloc,
 	.css_online = mem_cgroup_css_online,
@@ -5306,7 +5462,8 @@ struct cgroup_subsys memory_cgrp_subsys = {
 	.cancel_attach = mem_cgroup_cancel_attach,
 	.attach = mem_cgroup_move_task,
 	.bind = mem_cgroup_bind,
-	.legacy_cftypes = mem_cgroup_files,
+	.dfl_cftypes = memory_files,
+	.legacy_cftypes = mem_cgroup_legacy_files,
 	.early_init = 0,
 };
 
@@ -5341,6 +5498,56 @@ static void __init enable_swap_cgroup(void)
 }
 #endif
 
+/**
+ * mem_cgroup_events - count memory events against a cgroup
+ * @memcg: the memory cgroup
+ * @idx: the event index
+ * @nr: the number of events to account for
+ */
+void mem_cgroup_events(struct mem_cgroup *memcg,
+		       enum mem_cgroup_events_index idx,
+		       unsigned int nr)
+{
+	this_cpu_add(memcg->stat->events[idx], nr);
+}
+
+/**
+ * mem_cgroup_low - check if memory consumption is below the normal range
+ * @root: the highest ancestor to consider
+ * @memcg: the memory cgroup to check
+ *
+ * Returns %true if memory consumption of @memcg, and that of all
+ * configurable ancestors up to @root, is below the normal range.
+ */
+bool mem_cgroup_low(struct mem_cgroup *root, struct mem_cgroup *memcg)
+{
+	if (mem_cgroup_disabled())
+		return false;
+
+	/*
+	 * The toplevel group doesn't have a configurable range, so
+	 * it's never low when looked at directly, and it is not
+	 * considered an ancestor when assessing the hierarchy.
+	 */
+
+	if (memcg == root_mem_cgroup)
+		return false;
+
+	if (page_counter_read(&memcg->memory) > memcg->low)
+		return false;
+
+	while (memcg != root) {
+		memcg = parent_mem_cgroup(memcg);
+
+		if (memcg == root_mem_cgroup)
+			break;
+
+		if (page_counter_read(&memcg->memory) > memcg->low)
+			return false;
+	}
+	return true;
+}
+
 #ifdef CONFIG_MEMCG_SWAP
 /**
  * mem_cgroup_swapout - transfer a memsw charge to swap
diff --git a/mm/vmscan.c b/mm/vmscan.c
index b89097185f46..f62ec654d4c5 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -91,6 +91,9 @@ struct scan_control {
 	/* Can pages be swapped as part of reclaim? */
 	unsigned int may_swap:1;
 
+	/* Can cgroups be reclaimed below their normal consumption range? */
+	unsigned int may_thrash:1;
+
 	unsigned int hibernation_mode:1;
 
 	/* One of the zones is ready for compaction */
@@ -2333,6 +2336,12 @@ static bool shrink_zone(struct zone *zone, struct scan_control *sc,
 			struct lruvec *lruvec;
 			int swappiness;
 
+			if (mem_cgroup_low(root, memcg)) {
+				if (!sc->may_thrash)
+					continue;
+				mem_cgroup_events(memcg, MEMCG_LOW, 1);
+			}
+
 			lruvec = mem_cgroup_zone_lruvec(zone, memcg);
 			swappiness = mem_cgroup_swappiness(memcg);
 			scanned = sc->nr_scanned;
@@ -2360,8 +2369,7 @@ static bool shrink_zone(struct zone *zone, struct scan_control *sc,
 				mem_cgroup_iter_break(root, memcg);
 				break;
 			}
-			memcg = mem_cgroup_iter(root, memcg, &reclaim);
-		} while (memcg);
+		} while ((memcg = mem_cgroup_iter(root, memcg, &reclaim)));
 
 		/*
 		 * Shrink the slab caches in the same proportion that
@@ -2559,10 +2567,11 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
 static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 					  struct scan_control *sc)
 {
+	int initial_priority = sc->priority;
 	unsigned long total_scanned = 0;
 	unsigned long writeback_threshold;
 	bool zones_reclaimable;
-
+retry:
 	delayacct_freepages_start();
 
 	if (global_reclaim(sc))
@@ -2612,6 +2621,13 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 	if (sc->compaction_ready)
 		return 1;
 
+	/* Untapped cgroup reserves?  Don't OOM, retry. */
+	if (!sc->may_thrash) {
+		sc->priority = initial_priority;
+		sc->may_thrash = 1;
+		goto retry;
+	}
+
 	/* Any of the zones still reclaimable?  Don't OOM. */
 	if (zones_reclaimable)
 		return 1;
-- 
2.2.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [patch 2/2] mm: memcontrol: default hierarchy interface for memory
  2015-01-20 15:31 ` [patch 2/2] mm: memcontrol: default hierarchy interface for memory Johannes Weiner
@ 2015-01-20 16:31   ` Michal Hocko
  2015-02-23 11:13   ` Sasha Levin
  1 sibling, 0 replies; 20+ messages in thread
From: Michal Hocko @ 2015-01-20 16:31 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Vladimir Davydov, Greg Thelen, linux-mm, cgroups,
	linux-kernel

On Tue 20-01-15 10:31:55, Johannes Weiner wrote:
> Introduce the basic control files to account, partition, and limit
> memory using cgroups in default hierarchy mode.
> 
> This interface versioning allows us to address fundamental design
> issues in the existing memory cgroup interface, further explained
> below.  The old interface will be maintained indefinitely, but a
> clearer model and improved workload performance should encourage
> existing users to switch over to the new one eventually.
> 
> The control files are thus:
> 
>   - memory.current shows the current consumption of the cgroup and its
>     descendants, in bytes.
> 
>   - memory.low configures the lower end of the cgroup's expected
>     memory consumption range.  The kernel considers memory below that
>     boundary to be a reserve - the minimum that the workload needs in
>     order to make forward progress - and generally avoids reclaiming
>     it, unless there is an imminent risk of entering an OOM situation.
> 
>   - memory.high configures the upper end of the cgroup's expected
>     memory consumption range.  A cgroup whose consumption grows beyond
>     this threshold is forced into direct reclaim, to work off the
>     excess and to throttle new allocations heavily, but is generally
>     allowed to continue and the OOM killer is not invoked.
> 
>   - memory.max configures the hard maximum amount of memory that the
>     cgroup is allowed to consume before the OOM killer is invoked.
> 
>   - memory.events shows event counters that indicate how often the
>     cgroup was reclaimed while below memory.low, how often it was
>     forced to reclaim excess beyond memory.high, how often it hit
>     memory.max, and how often it entered OOM due to memory.max.  This
>     allows users to identify configuration problems when observing a
>     degradation in workload performance.  An overcommitted system will
>     have an increased rate of low boundary breaches, whereas increased
>     rates of high limit breaches, maximum hits, or even OOM situations
>     will indicate internally overcommitted cgroups.
> 
> For existing users of memory cgroups, the following deviations from
> the current interface are worth pointing out and explaining:
> 
>   - The original lower boundary, the soft limit, is defined as a limit
>     that is per default unset.  As a result, the set of cgroups that
>     global reclaim prefers is opt-in, rather than opt-out.  The costs
>     for optimizing these mostly negative lookups are so high that the
>     implementation, despite its enormous size, does not even provide
>     the basic desirable behavior.  First off, the soft limit has no
>     hierarchical meaning.  All configured groups are organized in a
>     global rbtree and treated like equal peers, regardless where they
>     are located in the hierarchy.  This makes subtree delegation
>     impossible.  Second, the soft limit reclaim pass is so aggressive
>     that it not just introduces high allocation latencies into the
>     system, but also impacts system performance due to overreclaim, to
>     the point where the feature becomes self-defeating.
> 
>     The memory.low boundary on the other hand is a top-down allocated
>     reserve.  A cgroup enjoys reclaim protection when it and all its
>     ancestors are below their low boundaries, which makes delegation
>     of subtrees possible.  Secondly, new cgroups have no reserve per
>     default and in the common case most cgroups are eligible for the
>     preferred reclaim pass.  This allows the new low boundary to be
>     efficiently implemented with just a minor addition to the generic
>     reclaim code, without the need for out-of-band data structures and
>     reclaim passes.  Because the generic reclaim code considers all
>     cgroups except for the ones running low in the preferred first
>     reclaim pass, overreclaim of individual groups is eliminated as
>     well, resulting in much better overall workload performance.
> 
>   - The original high boundary, the hard limit, is defined as a strict
>     limit that can not budge, even if the OOM killer has to be called.
>     But this generally goes against the goal of making the most out of
>     the available memory.  The memory consumption of workloads varies
>     during runtime, and that requires users to overcommit.  But doing
>     that with a strict upper limit requires either a fairly accurate
>     prediction of the working set size or adding slack to the limit.
>     Since working set size estimation is hard and error prone, and
>     getting it wrong results in OOM kills, most users tend to err on
>     the side of a looser limit and end up wasting precious resources.
> 
>     The memory.high boundary on the other hand can be set much more
>     conservatively.  When hit, it throttles allocations by forcing
>     them into direct reclaim to work off the excess, but it never
>     invokes the OOM killer.  As a result, a high boundary that is
>     chosen too aggressively will not terminate the processes, but
>     instead it will lead to gradual performance degradation.  The user
>     can monitor this and make corrections until the minimal memory
>     footprint that still gives acceptable performance is found.
> 
>     In extreme cases, with many concurrent allocations and a complete
>     breakdown of reclaim progress within the group, the high boundary
>     can be exceeded.  But even then it's mostly better to satisfy the
>     allocation from the slack available in other groups or the rest of
>     the system than killing the group.  Otherwise, memory.max is there
>     to limit this type of spillover and ultimately contain buggy or
>     even malicious applications.
> 
>   - The original control file names are unwieldy and inconsistent in
>     many different ways.  For example, the upper boundary hit count is
>     exported in the memory.failcnt file, but an OOM event count has to
>     be manually counted by listening to memory.oom_control events, and
>     lower boundary / soft limit events have to be counted by first
>     setting a threshold for that value and then counting those events.
>     Also, usage and limit files encode their units in the filename.
>     That makes the filenames very long, even though this is not
>     information that a user needs to be reminded of every time they
>     type out those names.
> 
>     To address these naming issues, as well as to signal clearly that
>     the new interface carries a new configuration model, the naming
>     conventions in it necessarily differ from the old interface.
> 
>   - The original limit files indicate the state of an unset limit with
>     a very high number, and a configured limit can be unset by echoing
>     -1 into those files.  But that very high number is implementation
>     and architecture dependent and not very descriptive.  And while -1
>     can be understood as an underflow into the highest possible value,
>     -2 or -10M etc. do not work, so it's not inconsistent.
> 
>     memory.low, memory.high, and memory.max will use the string
>     "infinity" to indicate and set the highest possible value.
> 
> [akpm@linux-foundation.org: use seq_puts() for basic strings]
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Michal Hocko <mhocko@suse.cz>
> Cc: Vladimir Davydov <vdavydov@parallels.com>
> Cc: Greg Thelen <gthelen@google.com>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Acked-by: Michal Hocko <mhocko@suse.cz>

> ---
>  Documentation/cgroups/unified-hierarchy.txt |  79 ++++++++++
>  include/linux/memcontrol.h                  |  32 ++++
>  mm/memcontrol.c                             | 229 ++++++++++++++++++++++++++--
>  mm/vmscan.c                                 |  22 ++-
>  4 files changed, 348 insertions(+), 14 deletions(-)
> 
> diff --git a/Documentation/cgroups/unified-hierarchy.txt b/Documentation/cgroups/unified-hierarchy.txt
> index 4f4563277864..71daa35ec2d9 100644
> --- a/Documentation/cgroups/unified-hierarchy.txt
> +++ b/Documentation/cgroups/unified-hierarchy.txt
> @@ -327,6 +327,85 @@ supported and the interface files "release_agent" and
>  - use_hierarchy is on by default and the cgroup file for the flag is
>    not created.
>  
> +- The original lower boundary, the soft limit, is defined as a limit
> +  that is per default unset.  As a result, the set of cgroups that
> +  global reclaim prefers is opt-in, rather than opt-out.  The costs
> +  for optimizing these mostly negative lookups are so high that the
> +  implementation, despite its enormous size, does not even provide the
> +  basic desirable behavior.  First off, the soft limit has no
> +  hierarchical meaning.  All configured groups are organized in a
> +  global rbtree and treated like equal peers, regardless where they
> +  are located in the hierarchy.  This makes subtree delegation
> +  impossible.  Second, the soft limit reclaim pass is so aggressive
> +  that it not just introduces high allocation latencies into the
> +  system, but also impacts system performance due to overreclaim, to
> +  the point where the feature becomes self-defeating.
> +
> +  The memory.low boundary on the other hand is a top-down allocated
> +  reserve.  A cgroup enjoys reclaim protection when it and all its
> +  ancestors are below their low boundaries, which makes delegation of
> +  subtrees possible.  Secondly, new cgroups have no reserve per
> +  default and in the common case most cgroups are eligible for the
> +  preferred reclaim pass.  This allows the new low boundary to be
> +  efficiently implemented with just a minor addition to the generic
> +  reclaim code, without the need for out-of-band data structures and
> +  reclaim passes.  Because the generic reclaim code considers all
> +  cgroups except for the ones running low in the preferred first
> +  reclaim pass, overreclaim of individual groups is eliminated as
> +  well, resulting in much better overall workload performance.
> +
> +- The original high boundary, the hard limit, is defined as a strict
> +  limit that can not budge, even if the OOM killer has to be called.
> +  But this generally goes against the goal of making the most out of
> +  the available memory.  The memory consumption of workloads varies
> +  during runtime, and that requires users to overcommit.  But doing
> +  that with a strict upper limit requires either a fairly accurate
> +  prediction of the working set size or adding slack to the limit.
> +  Since working set size estimation is hard and error prone, and
> +  getting it wrong results in OOM kills, most users tend to err on the
> +  side of a looser limit and end up wasting precious resources.
> +
> +  The memory.high boundary on the other hand can be set much more
> +  conservatively.  When hit, it throttles allocations by forcing them
> +  into direct reclaim to work off the excess, but it never invokes the
> +  OOM killer.  As a result, a high boundary that is chosen too
> +  aggressively will not terminate the processes, but instead it will
> +  lead to gradual performance degradation.  The user can monitor this
> +  and make corrections until the minimal memory footprint that still
> +  gives acceptable performance is found.
> +
> +  In extreme cases, with many concurrent allocations and a complete
> +  breakdown of reclaim progress within the group, the high boundary
> +  can be exceeded.  But even then it's mostly better to satisfy the
> +  allocation from the slack available in other groups or the rest of
> +  the system than killing the group.  Otherwise, memory.max is there
> +  to limit this type of spillover and ultimately contain buggy or even
> +  malicious applications.
> +
> +- The original control file names are unwieldy and inconsistent in
> +  many different ways.  For example, the upper boundary hit count is
> +  exported in the memory.failcnt file, but an OOM event count has to
> +  be manually counted by listening to memory.oom_control events, and
> +  lower boundary / soft limit events have to be counted by first
> +  setting a threshold for that value and then counting those events.
> +  Also, usage and limit files encode their units in the filename.
> +  That makes the filenames very long, even though this is not
> +  information that a user needs to be reminded of every time they type
> +  out those names.
> +
> +  To address these naming issues, as well as to signal clearly that
> +  the new interface carries a new configuration model, the naming
> +  conventions in it necessarily differ from the old interface.
> +
> +- The original limit files indicate the state of an unset limit with a
> +  Very High Number, and a configured limit can be unset by echoing -1
> +  into those files.  But that very high number is implementation and
> +  architecture dependent and not very descriptive.  And while -1 can
> +  be understood as an underflow into the highest possible value, -2 or
> +  -10M etc. do not work, so it's not consistent.
> +
> +  memory.low, memory.high, and memory.max will use the string
> +  "infinity" to indicate and set the highest possible value.
>  
>  5. Planned Changes
>  
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 76f489fad640..72dff5fb0d0c 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -52,7 +52,27 @@ struct mem_cgroup_reclaim_cookie {
>  	unsigned int generation;
>  };
>  
> +enum mem_cgroup_events_index {
> +	MEM_CGROUP_EVENTS_PGPGIN,	/* # of pages paged in */
> +	MEM_CGROUP_EVENTS_PGPGOUT,	/* # of pages paged out */
> +	MEM_CGROUP_EVENTS_PGFAULT,	/* # of page-faults */
> +	MEM_CGROUP_EVENTS_PGMAJFAULT,	/* # of major page-faults */
> +	MEM_CGROUP_EVENTS_NSTATS,
> +	/* default hierarchy events */
> +	MEMCG_LOW = MEM_CGROUP_EVENTS_NSTATS,
> +	MEMCG_HIGH,
> +	MEMCG_MAX,
> +	MEMCG_OOM,
> +	MEMCG_NR_EVENTS,
> +};
> +
>  #ifdef CONFIG_MEMCG
> +void mem_cgroup_events(struct mem_cgroup *memcg,
> +		       enum mem_cgroup_events_index idx,
> +		       unsigned int nr);
> +
> +bool mem_cgroup_low(struct mem_cgroup *root, struct mem_cgroup *memcg);
> +
>  int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
>  			  gfp_t gfp_mask, struct mem_cgroup **memcgp);
>  void mem_cgroup_commit_charge(struct page *page, struct mem_cgroup *memcg,
> @@ -175,6 +195,18 @@ void mem_cgroup_split_huge_fixup(struct page *head);
>  #else /* CONFIG_MEMCG */
>  struct mem_cgroup;
>  
> +static inline void mem_cgroup_events(struct mem_cgroup *memcg,
> +				     enum mem_cgroup_events_index idx,
> +				     unsigned int nr)
> +{
> +}
> +
> +static inline bool mem_cgroup_low(struct mem_cgroup *root,
> +				  struct mem_cgroup *memcg)
> +{
> +	return false;
> +}
> +
>  static inline int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
>  					gfp_t gfp_mask,
>  					struct mem_cgroup **memcgp)
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index a3592a756ad9..5730886e3b0e 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -97,14 +97,6 @@ static const char * const mem_cgroup_stat_names[] = {
>  	"swap",
>  };
>  
> -enum mem_cgroup_events_index {
> -	MEM_CGROUP_EVENTS_PGPGIN,	/* # of pages paged in */
> -	MEM_CGROUP_EVENTS_PGPGOUT,	/* # of pages paged out */
> -	MEM_CGROUP_EVENTS_PGFAULT,	/* # of page-faults */
> -	MEM_CGROUP_EVENTS_PGMAJFAULT,	/* # of major page-faults */
> -	MEM_CGROUP_EVENTS_NSTATS,
> -};
> -
>  static const char * const mem_cgroup_events_names[] = {
>  	"pgpgin",
>  	"pgpgout",
> @@ -138,7 +130,7 @@ enum mem_cgroup_events_target {
>  
>  struct mem_cgroup_stat_cpu {
>  	long count[MEM_CGROUP_STAT_NSTATS];
> -	unsigned long events[MEM_CGROUP_EVENTS_NSTATS];
> +	unsigned long events[MEMCG_NR_EVENTS];
>  	unsigned long nr_page_events;
>  	unsigned long targets[MEM_CGROUP_NTARGETS];
>  };
> @@ -284,6 +276,10 @@ struct mem_cgroup {
>  	struct page_counter memsw;
>  	struct page_counter kmem;
>  
> +	/* Normal memory consumption range */
> +	unsigned long low;
> +	unsigned long high;
> +
>  	unsigned long soft_limit;
>  
>  	/* vmpressure notifications */
> @@ -2327,6 +2323,8 @@ retry:
>  	if (!(gfp_mask & __GFP_WAIT))
>  		goto nomem;
>  
> +	mem_cgroup_events(mem_over_limit, MEMCG_MAX, 1);
> +
>  	nr_reclaimed = try_to_free_mem_cgroup_pages(mem_over_limit, nr_pages,
>  						    gfp_mask, may_swap);
>  
> @@ -2368,6 +2366,8 @@ retry:
>  	if (fatal_signal_pending(current))
>  		goto bypass;
>  
> +	mem_cgroup_events(mem_over_limit, MEMCG_OOM, 1);
> +
>  	mem_cgroup_oom(mem_over_limit, gfp_mask, get_order(nr_pages));
>  nomem:
>  	if (!(gfp_mask & __GFP_NOFAIL))
> @@ -2379,6 +2379,16 @@ done_restock:
>  	css_get_many(&memcg->css, batch);
>  	if (batch > nr_pages)
>  		refill_stock(memcg, batch - nr_pages);
> +	/*
> +	 * If the hierarchy is above the normal consumption range,
> +	 * make the charging task trim their excess contribution.
> +	 */
> +	do {
> +		if (page_counter_read(&memcg->memory) <= memcg->high)
> +			continue;
> +		mem_cgroup_events(memcg, MEMCG_HIGH, 1);
> +		try_to_free_mem_cgroup_pages(memcg, nr_pages, gfp_mask, true);
> +	} while ((memcg = parent_mem_cgroup(memcg)));
>  done:
>  	return ret;
>  }
> @@ -4304,7 +4314,7 @@ out_kfree:
>  	return ret;
>  }
>  
> -static struct cftype mem_cgroup_files[] = {
> +static struct cftype mem_cgroup_legacy_files[] = {
>  	{
>  		.name = "usage_in_bytes",
>  		.private = MEMFILE_PRIVATE(_MEM, RES_USAGE),
> @@ -4580,6 +4590,7 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
>  	if (parent_css == NULL) {
>  		root_mem_cgroup = memcg;
>  		page_counter_init(&memcg->memory, NULL);
> +		memcg->high = PAGE_COUNTER_MAX;
>  		memcg->soft_limit = PAGE_COUNTER_MAX;
>  		page_counter_init(&memcg->memsw, NULL);
>  		page_counter_init(&memcg->kmem, NULL);
> @@ -4625,6 +4636,7 @@ mem_cgroup_css_online(struct cgroup_subsys_state *css)
>  
>  	if (parent->use_hierarchy) {
>  		page_counter_init(&memcg->memory, &parent->memory);
> +		memcg->high = PAGE_COUNTER_MAX;
>  		memcg->soft_limit = PAGE_COUNTER_MAX;
>  		page_counter_init(&memcg->memsw, &parent->memsw);
>  		page_counter_init(&memcg->kmem, &parent->kmem);
> @@ -4635,6 +4647,7 @@ mem_cgroup_css_online(struct cgroup_subsys_state *css)
>  		 */
>  	} else {
>  		page_counter_init(&memcg->memory, NULL);
> +		memcg->high = PAGE_COUNTER_MAX;
>  		memcg->soft_limit = PAGE_COUNTER_MAX;
>  		page_counter_init(&memcg->memsw, NULL);
>  		page_counter_init(&memcg->kmem, NULL);
> @@ -4710,6 +4723,8 @@ static void mem_cgroup_css_reset(struct cgroup_subsys_state *css)
>  	mem_cgroup_resize_limit(memcg, PAGE_COUNTER_MAX);
>  	mem_cgroup_resize_memsw_limit(memcg, PAGE_COUNTER_MAX);
>  	memcg_update_kmem_limit(memcg, PAGE_COUNTER_MAX);
> +	memcg->low = 0;
> +	memcg->high = PAGE_COUNTER_MAX;
>  	memcg->soft_limit = PAGE_COUNTER_MAX;
>  }
>  
> @@ -5296,6 +5311,147 @@ static void mem_cgroup_bind(struct cgroup_subsys_state *root_css)
>  		mem_cgroup_from_css(root_css)->use_hierarchy = true;
>  }
>  
> +static u64 memory_current_read(struct cgroup_subsys_state *css,
> +			       struct cftype *cft)
> +{
> +	return mem_cgroup_usage(mem_cgroup_from_css(css), false);
> +}
> +
> +static int memory_low_show(struct seq_file *m, void *v)
> +{
> +	struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
> +	unsigned long low = ACCESS_ONCE(memcg->low);
> +
> +	if (low == PAGE_COUNTER_MAX)
> +		seq_puts(m, "infinity\n");
> +	else
> +		seq_printf(m, "%llu\n", (u64)low * PAGE_SIZE);
> +
> +	return 0;
> +}
> +
> +static ssize_t memory_low_write(struct kernfs_open_file *of,
> +				char *buf, size_t nbytes, loff_t off)
> +{
> +	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
> +	unsigned long low;
> +	int err;
> +
> +	buf = strstrip(buf);
> +	err = page_counter_memparse(buf, "infinity", &low);
> +	if (err)
> +		return err;
> +
> +	memcg->low = low;
> +
> +	return nbytes;
> +}
> +
> +static int memory_high_show(struct seq_file *m, void *v)
> +{
> +	struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
> +	unsigned long high = ACCESS_ONCE(memcg->high);
> +
> +	if (high == PAGE_COUNTER_MAX)
> +		seq_puts(m, "infinity\n");
> +	else
> +		seq_printf(m, "%llu\n", (u64)high * PAGE_SIZE);
> +
> +	return 0;
> +}
> +
> +static ssize_t memory_high_write(struct kernfs_open_file *of,
> +				 char *buf, size_t nbytes, loff_t off)
> +{
> +	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
> +	unsigned long high;
> +	int err;
> +
> +	buf = strstrip(buf);
> +	err = page_counter_memparse(buf, "infinity", &high);
> +	if (err)
> +		return err;
> +
> +	memcg->high = high;
> +
> +	return nbytes;
> +}
> +
> +static int memory_max_show(struct seq_file *m, void *v)
> +{
> +	struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
> +	unsigned long max = ACCESS_ONCE(memcg->memory.limit);
> +
> +	if (max == PAGE_COUNTER_MAX)
> +		seq_puts(m, "infinity\n");
> +	else
> +		seq_printf(m, "%llu\n", (u64)max * PAGE_SIZE);
> +
> +	return 0;
> +}
> +
> +static ssize_t memory_max_write(struct kernfs_open_file *of,
> +				char *buf, size_t nbytes, loff_t off)
> +{
> +	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
> +	unsigned long max;
> +	int err;
> +
> +	buf = strstrip(buf);
> +	err = page_counter_memparse(buf, "infinity", &max);
> +	if (err)
> +		return err;
> +
> +	err = mem_cgroup_resize_limit(memcg, max);
> +	if (err)
> +		return err;
> +
> +	return nbytes;
> +}
> +
> +static int memory_events_show(struct seq_file *m, void *v)
> +{
> +	struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
> +
> +	seq_printf(m, "low %lu\n", mem_cgroup_read_events(memcg, MEMCG_LOW));
> +	seq_printf(m, "high %lu\n", mem_cgroup_read_events(memcg, MEMCG_HIGH));
> +	seq_printf(m, "max %lu\n", mem_cgroup_read_events(memcg, MEMCG_MAX));
> +	seq_printf(m, "oom %lu\n", mem_cgroup_read_events(memcg, MEMCG_OOM));
> +
> +	return 0;
> +}
> +
> +static struct cftype memory_files[] = {
> +	{
> +		.name = "current",
> +		.read_u64 = memory_current_read,
> +	},
> +	{
> +		.name = "low",
> +		.flags = CFTYPE_NOT_ON_ROOT,
> +		.seq_show = memory_low_show,
> +		.write = memory_low_write,
> +	},
> +	{
> +		.name = "high",
> +		.flags = CFTYPE_NOT_ON_ROOT,
> +		.seq_show = memory_high_show,
> +		.write = memory_high_write,
> +	},
> +	{
> +		.name = "max",
> +		.flags = CFTYPE_NOT_ON_ROOT,
> +		.seq_show = memory_max_show,
> +		.write = memory_max_write,
> +	},
> +	{
> +		.name = "events",
> +		.flags = CFTYPE_NOT_ON_ROOT,
> +		.seq_show = memory_events_show,
> +	},
> +	{ }	/* terminate */
> +};
> +
>  struct cgroup_subsys memory_cgrp_subsys = {
>  	.css_alloc = mem_cgroup_css_alloc,
>  	.css_online = mem_cgroup_css_online,
> @@ -5306,7 +5462,8 @@ struct cgroup_subsys memory_cgrp_subsys = {
>  	.cancel_attach = mem_cgroup_cancel_attach,
>  	.attach = mem_cgroup_move_task,
>  	.bind = mem_cgroup_bind,
> -	.legacy_cftypes = mem_cgroup_files,
> +	.dfl_cftypes = memory_files,
> +	.legacy_cftypes = mem_cgroup_legacy_files,
>  	.early_init = 0,
>  };
>  
> @@ -5341,6 +5498,56 @@ static void __init enable_swap_cgroup(void)
>  }
>  #endif
>  
> +/**
> + * mem_cgroup_events - count memory events against a cgroup
> + * @memcg: the memory cgroup
> + * @idx: the event index
> + * @nr: the number of events to account for
> + */
> +void mem_cgroup_events(struct mem_cgroup *memcg,
> +		       enum mem_cgroup_events_index idx,
> +		       unsigned int nr)
> +{
> +	this_cpu_add(memcg->stat->events[idx], nr);
> +}
> +
> +/**
> + * mem_cgroup_low - check if memory consumption is below the normal range
> + * @root: the highest ancestor to consider
> + * @memcg: the memory cgroup to check
> + *
> + * Returns %true if memory consumption of @memcg, and that of all
> + * configurable ancestors up to @root, is below the normal range.
> + */
> +bool mem_cgroup_low(struct mem_cgroup *root, struct mem_cgroup *memcg)
> +{
> +	if (mem_cgroup_disabled())
> +		return false;
> +
> +	/*
> +	 * The toplevel group doesn't have a configurable range, so
> +	 * it's never low when looked at directly, and it is not
> +	 * considered an ancestor when assessing the hierarchy.
> +	 */
> +
> +	if (memcg == root_mem_cgroup)
> +		return false;
> +
> +	if (page_counter_read(&memcg->memory) > memcg->low)
> +		return false;
> +
> +	while (memcg != root) {
> +		memcg = parent_mem_cgroup(memcg);
> +
> +		if (memcg == root_mem_cgroup)
> +			break;
> +
> +		if (page_counter_read(&memcg->memory) > memcg->low)
> +			return false;
> +	}
> +	return true;
> +}
> +
>  #ifdef CONFIG_MEMCG_SWAP
>  /**
>   * mem_cgroup_swapout - transfer a memsw charge to swap
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index b89097185f46..f62ec654d4c5 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -91,6 +91,9 @@ struct scan_control {
>  	/* Can pages be swapped as part of reclaim? */
>  	unsigned int may_swap:1;
>  
> +	/* Can cgroups be reclaimed below their normal consumption range? */
> +	unsigned int may_thrash:1;
> +
>  	unsigned int hibernation_mode:1;
>  
>  	/* One of the zones is ready for compaction */
> @@ -2333,6 +2336,12 @@ static bool shrink_zone(struct zone *zone, struct scan_control *sc,
>  			struct lruvec *lruvec;
>  			int swappiness;
>  
> +			if (mem_cgroup_low(root, memcg)) {
> +				if (!sc->may_thrash)
> +					continue;
> +				mem_cgroup_events(memcg, MEMCG_LOW, 1);
> +			}
> +
>  			lruvec = mem_cgroup_zone_lruvec(zone, memcg);
>  			swappiness = mem_cgroup_swappiness(memcg);
>  			scanned = sc->nr_scanned;
> @@ -2360,8 +2369,7 @@ static bool shrink_zone(struct zone *zone, struct scan_control *sc,
>  				mem_cgroup_iter_break(root, memcg);
>  				break;
>  			}
> -			memcg = mem_cgroup_iter(root, memcg, &reclaim);
> -		} while (memcg);
> +		} while ((memcg = mem_cgroup_iter(root, memcg, &reclaim)));
>  
>  		/*
>  		 * Shrink the slab caches in the same proportion that
> @@ -2559,10 +2567,11 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
>  static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
>  					  struct scan_control *sc)
>  {
> +	int initial_priority = sc->priority;
>  	unsigned long total_scanned = 0;
>  	unsigned long writeback_threshold;
>  	bool zones_reclaimable;
> -
> +retry:
>  	delayacct_freepages_start();
>  
>  	if (global_reclaim(sc))
> @@ -2612,6 +2621,13 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
>  	if (sc->compaction_ready)
>  		return 1;
>  
> +	/* Untapped cgroup reserves?  Don't OOM, retry. */
> +	if (!sc->may_thrash) {
> +		sc->priority = initial_priority;
> +		sc->may_thrash = 1;
> +		goto retry;
> +	}
> +
>  	/* Any of the zones still reclaimable?  Don't OOM. */
>  	if (zones_reclaimable)
>  		return 1;
> -- 
> 2.2.0
> 

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [patch 2/2] mm: memcontrol: default hierarchy interface for memory
  2015-01-20 15:31 ` [patch 2/2] mm: memcontrol: default hierarchy interface for memory Johannes Weiner
  2015-01-20 16:31   ` Michal Hocko
@ 2015-02-23 11:13   ` Sasha Levin
  2015-02-23 14:28     ` Michal Hocko
  1 sibling, 1 reply; 20+ messages in thread
From: Sasha Levin @ 2015-02-23 11:13 UTC (permalink / raw)
  To: Johannes Weiner, Andrew Morton
  Cc: Michal Hocko, Vladimir Davydov, Greg Thelen, linux-mm, cgroups,
	linux-kernel

Hi Johannes,

On 01/20/2015 10:31 AM, Johannes Weiner wrote:
> Introduce the basic control files to account, partition, and limit
> memory using cgroups in default hierarchy mode.

I'm seeing the following while fuzzing:

[ 5634.427361] GPF could be caused by NULL-ptr deref or user memory access
general protection fault: 0000 [#1] SMP KASAN
[ 5634.430492] Dumping ftrace buffer:
[ 5634.430565]    (ftrace buffer empty)
[ 5634.430565] Modules linked in:
[ 5634.430565] CPU: 0 PID: 3983 Comm: kswapd0 Not tainted 3.19.0-next-20150222-sasha-00045-g8dc7569 #1943
[ 5634.430565] task: ffff88056a7cb000 ti: ffff880568860000 task.ti: ffff880568860000
[ 5634.430565] RIP: mem_cgroup_low (./arch/x86/include/asm/atomic64_64.h:21 include/asm-generic/atomic-long.h:31 include/linux/page_counter.h:34 mm/memcontrol.c:5438)
[ 5634.430565] RSP: 0000:ffff880568867968  EFLAGS: 00010202
[ 5634.430565] RAX: 000000000000001a RBX: 0000000000000000 RCX: 0000000000000000
[ 5634.430565] RDX: 1ffff1000822a3a4 RSI: ffff880041151bd8 RDI: ffff880041151cb8
[ 5634.430565] RBP: ffff880568867998 R08: 0000000000000000 R09: 0000000000000001
[ 5634.430565] R10: ffff880041151bd8 R11: 0000000000000000 R12: 00000000000000d0
[ 5634.430565] R13: dffffc0000000000 R14: ffff8800000237b0 R15: 0000000000000000
[ 5634.430565] FS:  0000000000000000(0000) GS:ffff88091aa00000(0000) knlGS:0000000000000000
[ 5634.430565] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 5634.430565] CR2: 000000000138efd8 CR3: 0000000500078000 CR4: 00000000000007b0
[ 5634.430565] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 5634.430565] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600
[ 5634.430565] Stack:
[ 5634.430565]  ffff880568867988 ffff880041151bd8 0000000000000000 ffff880000610000
[ 5634.430565]  ffff880568867d68 dffffc0000000000 ffff880568867b38 ffffffff81a1ac0f
[ 5634.430565]  ffffffff81b875b0 1ffff100ad10cf45 ffff880568867d80 ffff880568867d70
[ 5634.430565] Call Trace:
[ 5634.430565] shrink_zone (mm/vmscan.c:2389)
[ 5634.430565] ? percpu_ref_get_many (include/linux/percpu-refcount.h:270)
[ 5634.430565] ? shrink_lruvec (mm/vmscan.c:2365)
[ 5634.430565] kswapd (mm/vmscan.c:3104 mm/vmscan.c:3276 mm/vmscan.c:3484)
[ 5634.430565] ? debug_check_no_locks_freed (kernel/locking/lockdep.c:3051)
[ 5634.430565] ? mem_cgroup_shrink_node_zone (mm/vmscan.c:3401)
[ 5634.430565] ? __tick_nohz_task_switch (./arch/x86/include/asm/paravirt.h:809 (discriminator 2) kernel/time/tick-sched.c:292 (discriminator 2))
[ 5634.430565] ? trace_hardirqs_on_caller (kernel/locking/lockdep.c:2554 kernel/locking/lockdep.c:2601)
[ 5634.430565] ? trace_hardirqs_on (kernel/locking/lockdep.c:2609)
[ 5634.430565] ? finish_task_switch (kernel/sched/core.c:2229)
[ 5634.430565] ? finish_task_switch (kernel/sched/sched.h:1058 kernel/sched/core.c:2210)
[ 5634.430565] ? __init_waitqueue_head (kernel/sched/wait.c:292)
[ 5634.430565] ? __schedule (kernel/sched/core.c:2320 kernel/sched/core.c:2778)
[ 5634.430565] ? mem_cgroup_shrink_node_zone (mm/vmscan.c:3401)
[ 5634.430565] ? mem_cgroup_shrink_node_zone (mm/vmscan.c:3401)
[ 5634.430565] kthread (kernel/kthread.c:207)
[ 5634.430565] ? __tick_nohz_task_switch (./arch/x86/include/asm/paravirt.h:809 (discriminator 2) kernel/time/tick-sched.c:292 (discriminator 2))
[ 5634.430565] ? flush_kthread_work (kernel/kthread.c:176)
[ 5634.430565] ? trace_hardirqs_on_caller (kernel/locking/lockdep.c:2554 kernel/locking/lockdep.c:2601)
[ 5634.430565] ? schedule_tail (kernel/sched/core.c:2268)
[ 5634.430565] ? flush_kthread_work (kernel/kthread.c:176)
[ 5634.430565] ret_from_fork (arch/x86/kernel/entry_64.S:283)
[ 5634.430565] ? flush_kthread_work (kernel/kthread.c:176)
[ 5634.430565] Code: ff 49 39 de 0f 84 bd 00 00 00 49 89 dc 49 81 c4 d0 00 00 00 0f 84 f7 00 00 00 41 f6 c4 07 0f 85 ed 00 00 00 4c 89 e0 48 c1 e8 03 <42> 80 3c 28 00 0f 85 ef 00 00 00 4c 8b a3 d0 00 00 00 48 85 db
All code
========
   0:	ff 49 39             	decl   0x39(%rcx)
   3:	de 0f                	fimul  (%rdi)
   5:	84 bd 00 00 00 49    	test   %bh,0x49000000(%rbp)
   b:	89 dc                	mov    %ebx,%esp
   d:	49 81 c4 d0 00 00 00 	add    $0xd0,%r12
  14:	0f 84 f7 00 00 00    	je     0x111
  1a:	41 f6 c4 07          	test   $0x7,%r12b
  1e:	0f 85 ed 00 00 00    	jne    0x111
  24:	4c 89 e0             	mov    %r12,%rax
  27:	48 c1 e8 03          	shr    $0x3,%rax
  2b:*	42 80 3c 28 00       	cmpb   $0x0,(%rax,%r13,1)		<-- trapping instruction
  30:	0f 85 ef 00 00 00    	jne    0x125
  36:	4c 8b a3 d0 00 00 00 	mov    0xd0(%rbx),%r12
  3d:	48 85 db             	test   %rbx,%rbx
	...

Code starting with the faulting instruction
===========================================
   0:	42 80 3c 28 00       	cmpb   $0x0,(%rax,%r13,1)
   5:	0f 85 ef 00 00 00    	jne    0xfa
   b:	4c 8b a3 d0 00 00 00 	mov    0xd0(%rbx),%r12
  12:	48 85 db             	test   %rbx,%rbx
	...
[ 5634.430565] RIP mem_cgroup_low (./arch/x86/include/asm/atomic64_64.h:21 include/asm-generic/atomic-long.h:31 include/linux/page_counter.h:34 mm/memcontrol.c:5438)
[ 5634.430565]  RSP <ffff880568867968>


Thanks,
Sasha

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [patch 2/2] mm: memcontrol: default hierarchy interface for memory
  2015-02-23 11:13   ` Sasha Levin
@ 2015-02-23 14:28     ` Michal Hocko
  0 siblings, 0 replies; 20+ messages in thread
From: Michal Hocko @ 2015-02-23 14:28 UTC (permalink / raw)
  To: Sasha Levin
  Cc: Johannes Weiner, Andrew Morton, Vladimir Davydov, Greg Thelen,
	linux-mm, cgroups, linux-kernel

On Mon 23-02-15 06:13:52, Sasha Levin wrote:
> Hi Johannes,
> 
> On 01/20/2015 10:31 AM, Johannes Weiner wrote:
> > Introduce the basic control files to account, partition, and limit
> > memory using cgroups in default hierarchy mode.
> 
> I'm seeing the following while fuzzing:

Already fixed by http://marc.info/?l=linux-mm&m=142416201408215&w=2.
Andrew has picked up the patch AFAIR but there is no mmotm tree yet.

> [ 5634.427361] GPF could be caused by NULL-ptr deref or user memory access
> general protection fault: 0000 [#1] SMP KASAN
> [ 5634.430492] Dumping ftrace buffer:
> [ 5634.430565]    (ftrace buffer empty)
> [ 5634.430565] Modules linked in:
> [ 5634.430565] CPU: 0 PID: 3983 Comm: kswapd0 Not tainted 3.19.0-next-20150222-sasha-00045-g8dc7569 #1943
> [ 5634.430565] task: ffff88056a7cb000 ti: ffff880568860000 task.ti: ffff880568860000
> [ 5634.430565] RIP: mem_cgroup_low (./arch/x86/include/asm/atomic64_64.h:21 include/asm-generic/atomic-long.h:31 include/linux/page_counter.h:34 mm/memcontrol.c:5438)
> [ 5634.430565] RSP: 0000:ffff880568867968  EFLAGS: 00010202
> [ 5634.430565] RAX: 000000000000001a RBX: 0000000000000000 RCX: 0000000000000000
> [ 5634.430565] RDX: 1ffff1000822a3a4 RSI: ffff880041151bd8 RDI: ffff880041151cb8
> [ 5634.430565] RBP: ffff880568867998 R08: 0000000000000000 R09: 0000000000000001
> [ 5634.430565] R10: ffff880041151bd8 R11: 0000000000000000 R12: 00000000000000d0
> [ 5634.430565] R13: dffffc0000000000 R14: ffff8800000237b0 R15: 0000000000000000
> [ 5634.430565] FS:  0000000000000000(0000) GS:ffff88091aa00000(0000) knlGS:0000000000000000
> [ 5634.430565] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> [ 5634.430565] CR2: 000000000138efd8 CR3: 0000000500078000 CR4: 00000000000007b0
> [ 5634.430565] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 5634.430565] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600
> [ 5634.430565] Stack:
> [ 5634.430565]  ffff880568867988 ffff880041151bd8 0000000000000000 ffff880000610000
> [ 5634.430565]  ffff880568867d68 dffffc0000000000 ffff880568867b38 ffffffff81a1ac0f
> [ 5634.430565]  ffffffff81b875b0 1ffff100ad10cf45 ffff880568867d80 ffff880568867d70
> [ 5634.430565] Call Trace:
> [ 5634.430565] shrink_zone (mm/vmscan.c:2389)
> [ 5634.430565] ? percpu_ref_get_many (include/linux/percpu-refcount.h:270)
> [ 5634.430565] ? shrink_lruvec (mm/vmscan.c:2365)
> [ 5634.430565] kswapd (mm/vmscan.c:3104 mm/vmscan.c:3276 mm/vmscan.c:3484)
> [ 5634.430565] ? debug_check_no_locks_freed (kernel/locking/lockdep.c:3051)
> [ 5634.430565] ? mem_cgroup_shrink_node_zone (mm/vmscan.c:3401)
> [ 5634.430565] ? __tick_nohz_task_switch (./arch/x86/include/asm/paravirt.h:809 (discriminator 2) kernel/time/tick-sched.c:292 (discriminator 2))
> [ 5634.430565] ? trace_hardirqs_on_caller (kernel/locking/lockdep.c:2554 kernel/locking/lockdep.c:2601)
> [ 5634.430565] ? trace_hardirqs_on (kernel/locking/lockdep.c:2609)
> [ 5634.430565] ? finish_task_switch (kernel/sched/core.c:2229)
> [ 5634.430565] ? finish_task_switch (kernel/sched/sched.h:1058 kernel/sched/core.c:2210)
> [ 5634.430565] ? __init_waitqueue_head (kernel/sched/wait.c:292)
> [ 5634.430565] ? __schedule (kernel/sched/core.c:2320 kernel/sched/core.c:2778)
> [ 5634.430565] ? mem_cgroup_shrink_node_zone (mm/vmscan.c:3401)
> [ 5634.430565] ? mem_cgroup_shrink_node_zone (mm/vmscan.c:3401)
> [ 5634.430565] kthread (kernel/kthread.c:207)
> [ 5634.430565] ? __tick_nohz_task_switch (./arch/x86/include/asm/paravirt.h:809 (discriminator 2) kernel/time/tick-sched.c:292 (discriminator 2))
> [ 5634.430565] ? flush_kthread_work (kernel/kthread.c:176)
> [ 5634.430565] ? trace_hardirqs_on_caller (kernel/locking/lockdep.c:2554 kernel/locking/lockdep.c:2601)
> [ 5634.430565] ? schedule_tail (kernel/sched/core.c:2268)
> [ 5634.430565] ? flush_kthread_work (kernel/kthread.c:176)
> [ 5634.430565] ret_from_fork (arch/x86/kernel/entry_64.S:283)
> [ 5634.430565] ? flush_kthread_work (kernel/kthread.c:176)
> [ 5634.430565] Code: ff 49 39 de 0f 84 bd 00 00 00 49 89 dc 49 81 c4 d0 00 00 00 0f 84 f7 00 00 00 41 f6 c4 07 0f 85 ed 00 00 00 4c 89 e0 48 c1 e8 03 <42> 80 3c 28 00 0f 85 ef 00 00 00 4c 8b a3 d0 00 00 00 48 85 db
> All code
> ========
>    0:	ff 49 39             	decl   0x39(%rcx)
>    3:	de 0f                	fimul  (%rdi)
>    5:	84 bd 00 00 00 49    	test   %bh,0x49000000(%rbp)
>    b:	89 dc                	mov    %ebx,%esp
>    d:	49 81 c4 d0 00 00 00 	add    $0xd0,%r12
>   14:	0f 84 f7 00 00 00    	je     0x111
>   1a:	41 f6 c4 07          	test   $0x7,%r12b
>   1e:	0f 85 ed 00 00 00    	jne    0x111
>   24:	4c 89 e0             	mov    %r12,%rax
>   27:	48 c1 e8 03          	shr    $0x3,%rax
>   2b:*	42 80 3c 28 00       	cmpb   $0x0,(%rax,%r13,1)		<-- trapping instruction
>   30:	0f 85 ef 00 00 00    	jne    0x125
>   36:	4c 8b a3 d0 00 00 00 	mov    0xd0(%rbx),%r12
>   3d:	48 85 db             	test   %rbx,%rbx
> 	...
> 
> Code starting with the faulting instruction
> ===========================================
>    0:	42 80 3c 28 00       	cmpb   $0x0,(%rax,%r13,1)
>    5:	0f 85 ef 00 00 00    	jne    0xfa
>    b:	4c 8b a3 d0 00 00 00 	mov    0xd0(%rbx),%r12
>   12:	48 85 db             	test   %rbx,%rbx
> 	...
> [ 5634.430565] RIP mem_cgroup_low (./arch/x86/include/asm/atomic64_64.h:21 include/asm-generic/atomic-long.h:31 include/linux/page_counter.h:34 mm/memcontrol.c:5438)
> [ 5634.430565]  RSP <ffff880568867968>
> 
> 
> Thanks,
> Sasha

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [patch 0/2] mm: memcontrol: default hierarchy interface for memory v2
  2015-01-20 15:31 [patch 0/2] mm: memcontrol: default hierarchy interface for memory v2 Johannes Weiner
  2015-01-20 15:31 ` [patch 1/2] mm: page_counter: pull "-1" handling out of page_counter_memparse() Johannes Weiner
  2015-01-20 15:31 ` [patch 2/2] mm: memcontrol: default hierarchy interface for memory Johannes Weiner
@ 2015-01-20 16:57 ` Michal Hocko
  2 siblings, 0 replies; 20+ messages in thread
From: Michal Hocko @ 2015-01-20 16:57 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Vladimir Davydov, Greg Thelen, linux-mm, cgroups,
	linux-kernel

And one more thing. Maybe we should consider posting this to linux-api
as well. This is a new user visible interface and it would be good to
have more eyes looking at possible shortcomings.

On Tue 20-01-15 10:31:53, Johannes Weiner wrote:
> Hi Andrew,
> 
> these patches changed sufficiently while in -mm that a rebase makes
> sense.  The change from using "none" in the configuration files to
> "max"/"infinity" requires a do-over of 1/2 and a changelog fix in 2/2.
> 
> I folded all increments, both in-tree and the ones still pending, and
> credited your seq_puts() checkpatch fix, so these two changes are the
> all-encompassing latest versions, and everything else can be dropped.
> 
> Thanks!
> 

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [patch 1/2] mm: page_counter: pull "-1" handling out of page_counter_memparse()
@ 2015-01-09  4:15 Johannes Weiner
  2015-01-09  4:15 ` [patch 2/2] mm: memcontrol: default hierarchy interface for memory Johannes Weiner
  0 siblings, 1 reply; 20+ messages in thread
From: Johannes Weiner @ 2015-01-09  4:15 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, Vladimir Davydov, Greg Thelen, linux-mm, cgroups,
	linux-kernel

It was convenient to have the generic function handle it, as all
callsites agreed.  Subsequent patches will add new user interfaces
that do not want to support the "-1" special string.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/hugetlb_cgroup.c       | 10 +++++++---
 mm/memcontrol.c           | 20 ++++++++++++++------
 mm/page_counter.c         |  6 ------
 net/ipv4/tcp_memcontrol.c | 10 +++++++---
 4 files changed, 28 insertions(+), 18 deletions(-)

diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c
index 037e1c00a5b7..ee3fc80adba1 100644
--- a/mm/hugetlb_cgroup.c
+++ b/mm/hugetlb_cgroup.c
@@ -279,9 +279,13 @@ static ssize_t hugetlb_cgroup_write(struct kernfs_open_file *of,
 		return -EINVAL;
 
 	buf = strstrip(buf);
-	ret = page_counter_memparse(buf, &nr_pages);
-	if (ret)
-		return ret;
+	if (!strcmp(buf, "-1")) {
+		nr_pages = PAGE_COUNTER_MAX;
+	} else {
+		ret = page_counter_memparse(buf, &nr_pages);
+		if (ret)
+			return ret;
+	}
 
 	idx = MEMFILE_IDX(of_cft(of)->private);
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 202e3862d564..20486da85750 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3400,9 +3400,13 @@ static ssize_t mem_cgroup_write(struct kernfs_open_file *of,
 	int ret;
 
 	buf = strstrip(buf);
-	ret = page_counter_memparse(buf, &nr_pages);
-	if (ret)
-		return ret;
+	if (!strcmp(buf, "-1")) {
+		nr_pages = PAGE_COUNTER_MAX;
+	} else {
+		ret = page_counter_memparse(buf, &nr_pages);
+		if (ret)
+			return ret;
+	}
 
 	switch (MEMFILE_ATTR(of_cft(of)->private)) {
 	case RES_LIMIT:
@@ -3768,9 +3772,13 @@ static int __mem_cgroup_usage_register_event(struct mem_cgroup *memcg,
 	unsigned long usage;
 	int i, size, ret;
 
-	ret = page_counter_memparse(args, &threshold);
-	if (ret)
-		return ret;
+	if (!strcmp(args, "-1")) {
+		threshold = PAGE_COUNTER_MAX;
+	} else {
+		ret = page_counter_memparse(args, &threshold);
+		if (ret)
+			return ret;
+	}
 
 	mutex_lock(&memcg->thresholds_lock);
 
diff --git a/mm/page_counter.c b/mm/page_counter.c
index a009574fbba9..0d4f9daf68bd 100644
--- a/mm/page_counter.c
+++ b/mm/page_counter.c
@@ -173,15 +173,9 @@ int page_counter_limit(struct page_counter *counter, unsigned long limit)
  */
 int page_counter_memparse(const char *buf, unsigned long *nr_pages)
 {
-	char unlimited[] = "-1";
 	char *end;
 	u64 bytes;
 
-	if (!strncmp(buf, unlimited, sizeof(unlimited))) {
-		*nr_pages = PAGE_COUNTER_MAX;
-		return 0;
-	}
-
 	bytes = memparse(buf, &end);
 	if (*end != '\0')
 		return -EINVAL;
diff --git a/net/ipv4/tcp_memcontrol.c b/net/ipv4/tcp_memcontrol.c
index 272327134a1b..a9d9fcb4dc25 100644
--- a/net/ipv4/tcp_memcontrol.c
+++ b/net/ipv4/tcp_memcontrol.c
@@ -120,9 +120,13 @@ static ssize_t tcp_cgroup_write(struct kernfs_open_file *of,
 	switch (of_cft(of)->private) {
 	case RES_LIMIT:
 		/* see memcontrol.c */
-		ret = page_counter_memparse(buf, &nr_pages);
-		if (ret)
-			break;
+		if (!strcmp(buf, "-1")) {
+			nr_pages = PAGE_COUNTER_MAX;
+		} else {
+			ret = page_counter_memparse(buf, &nr_pages);
+			if (ret)
+				break;
+		}
 		mutex_lock(&tcp_limit_mutex);
 		ret = tcp_update_limit(memcg, nr_pages);
 		mutex_unlock(&tcp_limit_mutex);
-- 
2.2.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [patch 2/2] mm: memcontrol: default hierarchy interface for memory
  2015-01-09  4:15 [patch 1/2] mm: page_counter: pull "-1" handling out of page_counter_memparse() Johannes Weiner
@ 2015-01-09  4:15 ` Johannes Weiner
  2015-01-12 23:37   ` Andrew Morton
                     ` (4 more replies)
  0 siblings, 5 replies; 20+ messages in thread
From: Johannes Weiner @ 2015-01-09  4:15 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, Vladimir Davydov, Greg Thelen, linux-mm, cgroups,
	linux-kernel

Introduce the basic control files to account, partition, and limit
memory using cgroups in default hierarchy mode.

This interface versioning allows us to address fundamental design
issues in the existing memory cgroup interface, further explained
below.  The old interface will be maintained indefinitely, but a
clearer model and improved workload performance should encourage
existing users to switch over to the new one eventually.

The control files are thus:

  - memory.current shows the current consumption of the cgroup and its
    descendants, in bytes.

  - memory.low configures the lower end of the cgroup's expected
    memory consumption range.  The kernel considers memory below that
    boundary to be a reserve - the minimum that the workload needs in
    order to make forward progress - and generally avoids reclaiming
    it, unless there is an imminent risk of entering an OOM situation.

  - memory.high configures the upper end of the cgroup's expected
    memory consumption range.  A cgroup whose consumption grows beyond
    this threshold is forced into direct reclaim, to work off the
    excess and to throttle new allocations heavily, but is generally
    allowed to continue and the OOM killer is not invoked.

  - memory.max configures the hard maximum amount of memory that the
    cgroup is allowed to consume before the OOM killer is invoked.

  - memory.events shows event counters that indicate how often the
    cgroup was reclaimed while below memory.low, how often it was
    forced to reclaim excess beyond memory.high, how often it hit
    memory.max, and how often it entered OOM due to memory.max.  This
    allows users to identify configuration problems when observing a
    degradation in workload performance.  An overcommitted system will
    have an increased rate of low boundary breaches, whereas increased
    rates of high limit breaches, maximum hits, or even OOM situations
    will indicate internally overcommitted cgroups.

For existing users of memory cgroups, the following deviations from
the current interface are worth pointing out and explaining:

  - The original lower boundary, the soft limit, is defined as a limit
    that is per default unset.  As a result, the set of cgroups that
    global reclaim prefers is opt-in, rather than opt-out.  The costs
    for optimizing these mostly negative lookups are so high that the
    implementation, despite its enormous size, does not even provide
    the basic desirable behavior.  First off, the soft limit has no
    hierarchical meaning.  All configured groups are organized in a
    global rbtree and treated like equal peers, regardless where they
    are located in the hierarchy.  This makes subtree delegation
    impossible.  Second, the soft limit reclaim pass is so aggressive
    that it not just introduces high allocation latencies into the
    system, but also impacts system performance due to overreclaim, to
    the point where the feature becomes self-defeating.

    The memory.low boundary on the other hand is a top-down allocated
    reserve.  A cgroup enjoys reclaim protection when it and all its
    ancestors are below their low boundaries, which makes delegation
    of subtrees possible.  Secondly, new cgroups have no reserve per
    default and in the common case most cgroups are eligible for the
    preferred reclaim pass.  This allows the new low boundary to be
    efficiently implemented with just a minor addition to the generic
    reclaim code, without the need for out-of-band data structures and
    reclaim passes.  Because the generic reclaim code considers all
    cgroups except for the ones running low in the preferred first
    reclaim pass, overreclaim of individual groups is eliminated as
    well, resulting in much better overall workload performance.

  - The original high boundary, the hard limit, is defined as a strict
    limit that can not budge, even if the OOM killer has to be called.
    But this generally goes against the goal of making the most out of
    the available memory.  The memory consumption of workloads varies
    during runtime, and that requires users to overcommit.  But doing
    that with a strict upper limit requires either a fairly accurate
    prediction of the working set size or adding slack to the limit.
    Since working set size estimation is hard and error prone, and
    getting it wrong results in OOM kills, most users tend to err on
    the side of a looser limit and end up wasting precious resources.

    The memory.high boundary on the other hand can be set much more
    conservatively.  When hit, it throttles allocations by forcing
    them into direct reclaim to work off the excess, but it never
    invokes the OOM killer.  As a result, a high boundary that is
    chosen too aggressively will not terminate the processes, but
    instead it will lead to gradual performance degradation.  The user
    can monitor this and make corrections until the minimal memory
    footprint that still gives acceptable performance is found.

    In extreme cases, with many concurrent allocations and a complete
    breakdown of reclaim progress within the group, the high boundary
    can be exceeded.  But even then it's mostly better to satisfy the
    allocation from the slack available in other groups or the rest of
    the system than killing the group.  Otherwise, memory.max is there
    to limit this type of spillover and ultimately contain buggy or
    even malicious applications.

  - The existing control file names are unwieldy and inconsistent in
    many different ways.  For example, the upper boundary hit count is
    exported in the memory.failcnt file, but an OOM event count has to
    be manually counted by listening to memory.oom_control events, and
    lower boundary / soft limit events have to be counted by first
    setting a threshold for that value and then counting those events.
    Also, usage and limit files encode their units in the filename.
    That makes the filenames very long, even though this is not
    information that a user needs to be reminded of every time they
    type out those names.

    To address these naming issues, as well as to signal clearly that
    the new interface carries a new configuration model, the naming
    conventions in it necessarily differ from the old interface.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/memcontrol.h |  32 ++++++
 mm/memcontrol.c            | 247 +++++++++++++++++++++++++++++++++++++++++++--
 mm/vmscan.c                |  22 +++-
 3 files changed, 287 insertions(+), 14 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 76b4084b8d08..dc04da2e79e0 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -52,7 +52,27 @@ struct mem_cgroup_reclaim_cookie {
 	unsigned int generation;
 };
 
+enum mem_cgroup_events_index {
+	MEM_CGROUP_EVENTS_PGPGIN,	/* # of pages paged in */
+	MEM_CGROUP_EVENTS_PGPGOUT,	/* # of pages paged out */
+	MEM_CGROUP_EVENTS_PGFAULT,	/* # of page-faults */
+	MEM_CGROUP_EVENTS_PGMAJFAULT,	/* # of major page-faults */
+	MEM_CGROUP_EVENTS_NSTATS,
+	/* default hierarchy events */
+	MEMCG_LOW = MEM_CGROUP_EVENTS_NSTATS,
+	MEMCG_HIGH,
+	MEMCG_MAX,
+	MEMCG_OOM,
+	MEMCG_NR_EVENTS,
+};
+
 #ifdef CONFIG_MEMCG
+void mem_cgroup_events(struct mem_cgroup *memcg,
+		       enum mem_cgroup_events_index idx,
+		       unsigned int nr);
+
+bool mem_cgroup_low(struct mem_cgroup *root, struct mem_cgroup *memcg);
+
 int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
 			  gfp_t gfp_mask, struct mem_cgroup **memcgp);
 void mem_cgroup_commit_charge(struct page *page, struct mem_cgroup *memcg,
@@ -174,6 +194,18 @@ void mem_cgroup_split_huge_fixup(struct page *head);
 #else /* CONFIG_MEMCG */
 struct mem_cgroup;
 
+static inline void mem_cgroup_events(struct mem_cgroup *memcg,
+				     enum mem_cgroup_events_index idx,
+				     unsigned int nr)
+{
+}
+
+static inline bool mem_cgroup_low(struct mem_cgroup *root,
+				  struct mem_cgroup *memcg)
+{
+	return false;
+}
+
 static inline int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
 					gfp_t gfp_mask,
 					struct mem_cgroup **memcgp)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 20486da85750..fd9e542fc26f 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -97,14 +97,6 @@ static const char * const mem_cgroup_stat_names[] = {
 	"swap",
 };
 
-enum mem_cgroup_events_index {
-	MEM_CGROUP_EVENTS_PGPGIN,	/* # of pages paged in */
-	MEM_CGROUP_EVENTS_PGPGOUT,	/* # of pages paged out */
-	MEM_CGROUP_EVENTS_PGFAULT,	/* # of page-faults */
-	MEM_CGROUP_EVENTS_PGMAJFAULT,	/* # of major page-faults */
-	MEM_CGROUP_EVENTS_NSTATS,
-};
-
 static const char * const mem_cgroup_events_names[] = {
 	"pgpgin",
 	"pgpgout",
@@ -138,7 +130,7 @@ enum mem_cgroup_events_target {
 
 struct mem_cgroup_stat_cpu {
 	long count[MEM_CGROUP_STAT_NSTATS];
-	unsigned long events[MEM_CGROUP_EVENTS_NSTATS];
+	unsigned long events[MEMCG_NR_EVENTS];
 	unsigned long nr_page_events;
 	unsigned long targets[MEM_CGROUP_NTARGETS];
 };
@@ -284,6 +276,10 @@ struct mem_cgroup {
 	struct page_counter memsw;
 	struct page_counter kmem;
 
+	/* Normal memory consumption range */
+	unsigned long low;
+	unsigned long high;
+
 	unsigned long soft_limit;
 
 	/* vmpressure notifications */
@@ -2301,6 +2297,8 @@ retry:
 	if (!(gfp_mask & __GFP_WAIT))
 		goto nomem;
 
+	mem_cgroup_events(mem_over_limit, MEMCG_MAX, 1);
+
 	nr_reclaimed = try_to_free_mem_cgroup_pages(mem_over_limit, nr_pages,
 						    gfp_mask, may_swap);
 
@@ -2342,6 +2340,8 @@ retry:
 	if (fatal_signal_pending(current))
 		goto bypass;
 
+	mem_cgroup_events(mem_over_limit, MEMCG_OOM, 1);
+
 	mem_cgroup_oom(mem_over_limit, gfp_mask, get_order(nr_pages));
 nomem:
 	if (!(gfp_mask & __GFP_NOFAIL))
@@ -2353,6 +2353,22 @@ done_restock:
 	css_get_many(&memcg->css, batch);
 	if (batch > nr_pages)
 		refill_stock(memcg, batch - nr_pages);
+	/*
+	 * If the hierarchy is above the normal consumption range,
+	 * make the charging task trim the excess.
+	 */
+	do {
+		unsigned long nr_pages = page_counter_read(&memcg->memory);
+		unsigned long high = ACCESS_ONCE(memcg->high);
+
+		if (nr_pages > high) {
+			mem_cgroup_events(memcg, MEMCG_HIGH, 1);
+
+			try_to_free_mem_cgroup_pages(memcg, nr_pages - high,
+						     gfp_mask, true);
+		}
+
+	} while ((memcg = parent_mem_cgroup(memcg)));
 done:
 	return ret;
 }
@@ -4266,7 +4282,7 @@ out_kfree:
 	return ret;
 }
 
-static struct cftype mem_cgroup_files[] = {
+static struct cftype mem_cgroup_legacy_files[] = {
 	{
 		.name = "usage_in_bytes",
 		.private = MEMFILE_PRIVATE(_MEM, RES_USAGE),
@@ -4542,6 +4558,7 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
 	if (parent_css == NULL) {
 		root_mem_cgroup = memcg;
 		page_counter_init(&memcg->memory, NULL);
+		memcg->high = PAGE_COUNTER_MAX;
 		memcg->soft_limit = PAGE_COUNTER_MAX;
 		page_counter_init(&memcg->memsw, NULL);
 		page_counter_init(&memcg->kmem, NULL);
@@ -4587,6 +4604,7 @@ mem_cgroup_css_online(struct cgroup_subsys_state *css)
 
 	if (parent->use_hierarchy) {
 		page_counter_init(&memcg->memory, &parent->memory);
+		memcg->high = PAGE_COUNTER_MAX;
 		memcg->soft_limit = PAGE_COUNTER_MAX;
 		page_counter_init(&memcg->memsw, &parent->memsw);
 		page_counter_init(&memcg->kmem, &parent->kmem);
@@ -4597,6 +4615,7 @@ mem_cgroup_css_online(struct cgroup_subsys_state *css)
 		 */
 	} else {
 		page_counter_init(&memcg->memory, NULL);
+		memcg->high = PAGE_COUNTER_MAX;
 		memcg->soft_limit = PAGE_COUNTER_MAX;
 		page_counter_init(&memcg->memsw, NULL);
 		page_counter_init(&memcg->kmem, NULL);
@@ -4672,6 +4691,8 @@ static void mem_cgroup_css_reset(struct cgroup_subsys_state *css)
 	mem_cgroup_resize_limit(memcg, PAGE_COUNTER_MAX);
 	mem_cgroup_resize_memsw_limit(memcg, PAGE_COUNTER_MAX);
 	memcg_update_kmem_limit(memcg, PAGE_COUNTER_MAX);
+	memcg->low = 0;
+	memcg->high = PAGE_COUNTER_MAX;
 	memcg->soft_limit = PAGE_COUNTER_MAX;
 }
 
@@ -5260,6 +5281,159 @@ static void mem_cgroup_bind(struct cgroup_subsys_state *root_css)
 		mem_cgroup_from_css(root_css)->use_hierarchy = true;
 }
 
+static u64 memory_current_read(struct cgroup_subsys_state *css,
+			       struct cftype *cft)
+{
+	return mem_cgroup_usage(mem_cgroup_from_css(css), false);
+}
+
+static int memory_low_show(struct seq_file *m, void *v)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
+	unsigned long low = ACCESS_ONCE(memcg->low);
+
+	if (low == 0)
+		seq_printf(m, "none\n");
+	else
+		seq_printf(m, "%llu\n", (u64)low * PAGE_SIZE);
+
+	return 0;
+}
+
+static ssize_t memory_low_write(struct kernfs_open_file *of,
+				char *buf, size_t nbytes, loff_t off)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
+	unsigned long low;
+	int err;
+
+	buf = strstrip(buf);
+	if (!strcmp(buf, "none")) {
+		low = 0;
+	} else {
+		err = page_counter_memparse(buf, &low);
+		if (err)
+			return err;
+	}
+
+	memcg->low = low;
+
+	return nbytes;
+}
+
+static int memory_high_show(struct seq_file *m, void *v)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
+	unsigned long high = ACCESS_ONCE(memcg->high);
+
+	if (high == PAGE_COUNTER_MAX)
+		seq_printf(m, "none\n");
+	else
+		seq_printf(m, "%llu\n", (u64)high * PAGE_SIZE);
+
+	return 0;
+}
+
+static ssize_t memory_high_write(struct kernfs_open_file *of,
+				 char *buf, size_t nbytes, loff_t off)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
+	unsigned long high;
+	int err;
+
+	buf = strstrip(buf);
+	if (!strcmp(buf, "none")) {
+		high = PAGE_COUNTER_MAX;
+	} else {
+		err = page_counter_memparse(buf, &high);
+		if (err)
+			return err;
+	}
+
+	memcg->high = high;
+
+	return nbytes;
+}
+
+static int memory_max_show(struct seq_file *m, void *v)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
+	unsigned long max = ACCESS_ONCE(memcg->memory.limit);
+
+	if (max == PAGE_COUNTER_MAX)
+		seq_printf(m, "none\n");
+	else
+		seq_printf(m, "%llu\n", (u64)max * PAGE_SIZE);
+
+	return 0;
+}
+
+static ssize_t memory_max_write(struct kernfs_open_file *of,
+				char *buf, size_t nbytes, loff_t off)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
+	unsigned long max;
+	int err;
+
+	buf = strstrip(buf);
+	if (!strcmp(buf, "none")) {
+		max = PAGE_COUNTER_MAX;
+	} else {
+		err = page_counter_memparse(buf, &max);
+		if (err)
+			return err;
+	}
+
+	err = mem_cgroup_resize_limit(memcg, max);
+	if (err)
+		return err;
+
+	return nbytes;
+}
+
+static int memory_events_show(struct seq_file *m, void *v)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
+
+	seq_printf(m, "low %lu\n", mem_cgroup_read_events(memcg, MEMCG_LOW));
+	seq_printf(m, "high %lu\n", mem_cgroup_read_events(memcg, MEMCG_HIGH));
+	seq_printf(m, "max %lu\n", mem_cgroup_read_events(memcg, MEMCG_MAX));
+	seq_printf(m, "oom %lu\n", mem_cgroup_read_events(memcg, MEMCG_OOM));
+
+	return 0;
+}
+
+static struct cftype memory_files[] = {
+	{
+		.name = "current",
+		.read_u64 = memory_current_read,
+	},
+	{
+		.name = "low",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.seq_show = memory_low_show,
+		.write = memory_low_write,
+	},
+	{
+		.name = "high",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.seq_show = memory_high_show,
+		.write = memory_high_write,
+	},
+	{
+		.name = "max",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.seq_show = memory_max_show,
+		.write = memory_max_write,
+	},
+	{
+		.name = "events",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.seq_show = memory_events_show,
+	},
+	{ }	/* terminate */
+};
+
 struct cgroup_subsys memory_cgrp_subsys = {
 	.css_alloc = mem_cgroup_css_alloc,
 	.css_online = mem_cgroup_css_online,
@@ -5270,7 +5444,8 @@ struct cgroup_subsys memory_cgrp_subsys = {
 	.cancel_attach = mem_cgroup_cancel_attach,
 	.attach = mem_cgroup_move_task,
 	.bind = mem_cgroup_bind,
-	.legacy_cftypes = mem_cgroup_files,
+	.dfl_cftypes = memory_files,
+	.legacy_cftypes = mem_cgroup_legacy_files,
 	.early_init = 0,
 };
 
@@ -5305,6 +5480,56 @@ static void __init enable_swap_cgroup(void)
 }
 #endif
 
+/**
+ * mem_cgroup_events - count memory events against a cgroup
+ * @memcg: the memory cgroup
+ * @idx: the event index
+ * @nr: the number of events to account for
+ */
+void mem_cgroup_events(struct mem_cgroup *memcg,
+		       enum mem_cgroup_events_index idx,
+		       unsigned int nr)
+{
+	this_cpu_add(memcg->stat->events[idx], nr);
+}
+
+/**
+ * mem_cgroup_low - check if memory consumption is below the normal range
+ * @root: the highest ancestor to consider
+ * @memcg: the memory cgroup to check
+ *
+ * Returns %true if memory consumption of @memcg, and that of all
+ * configurable ancestors up to @root, is below the normal range.
+ */
+bool mem_cgroup_low(struct mem_cgroup *root, struct mem_cgroup *memcg)
+{
+	if (mem_cgroup_disabled())
+		return false;
+
+	/*
+	 * The toplevel group doesn't have a configurable range, so
+	 * it's never low when looked at directly, and it is not
+	 * considered an ancestor when assessing the hierarchy.
+	 */
+
+	if (memcg == root_mem_cgroup)
+		return false;
+
+	if (page_counter_read(&memcg->memory) > memcg->low)
+		return false;
+
+	while (memcg != root) {
+		memcg = parent_mem_cgroup(memcg);
+
+		if (memcg == root_mem_cgroup)
+			break;
+
+		if (page_counter_read(&memcg->memory) > memcg->low)
+			return false;
+	}
+	return true;
+}
+
 #ifdef CONFIG_MEMCG_SWAP
 /**
  * mem_cgroup_swapout - transfer a memsw charge to swap
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 5e8772b2b9ef..64bf375cbe6f 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -91,6 +91,9 @@ struct scan_control {
 	/* Can pages be swapped as part of reclaim? */
 	unsigned int may_swap:1;
 
+	/* Can cgroups be reclaimed below their normal consumption range? */
+	unsigned int may_thrash:1;
+
 	unsigned int hibernation_mode:1;
 
 	/* One of the zones is ready for compaction */
@@ -2322,6 +2325,12 @@ static bool shrink_zone(struct zone *zone, struct scan_control *sc,
 			struct lruvec *lruvec;
 			int swappiness;
 
+			if (mem_cgroup_low(root, memcg)) {
+				if (!sc->may_thrash)
+					continue;
+				mem_cgroup_events(memcg, MEMCG_LOW, 1);
+			}
+
 			lruvec = mem_cgroup_zone_lruvec(zone, memcg);
 			swappiness = mem_cgroup_swappiness(memcg);
 
@@ -2343,8 +2352,7 @@ static bool shrink_zone(struct zone *zone, struct scan_control *sc,
 				mem_cgroup_iter_break(root, memcg);
 				break;
 			}
-			memcg = mem_cgroup_iter(root, memcg, &reclaim);
-		} while (memcg);
+		} while ((memcg = mem_cgroup_iter(root, memcg, &reclaim)));
 
 		/*
 		 * Shrink the slab caches in the same proportion that
@@ -2547,10 +2555,11 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
 static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 					  struct scan_control *sc)
 {
+	int initial_priority = sc->priority;
 	unsigned long total_scanned = 0;
 	unsigned long writeback_threshold;
 	bool zones_reclaimable;
-
+retry:
 	delayacct_freepages_start();
 
 	if (global_reclaim(sc))
@@ -2600,6 +2609,13 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 	if (sc->compaction_ready)
 		return 1;
 
+	/* Untapped cgroup reserves?  Don't OOM, retry. */
+	if (!sc->may_thrash) {
+		sc->priority = initial_priority;
+		sc->may_thrash = 1;
+		goto retry;
+	}
+
 	/* Any of the zones still reclaimable?  Don't OOM. */
 	if (zones_reclaimable)
 		return 1;
-- 
2.2.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [patch 2/2] mm: memcontrol: default hierarchy interface for memory
  2015-01-09  4:15 ` [patch 2/2] mm: memcontrol: default hierarchy interface for memory Johannes Weiner
@ 2015-01-12 23:37   ` Andrew Morton
  2015-01-13 15:50     ` Johannes Weiner
  2015-01-13 23:20   ` Greg Thelen
                     ` (3 subsequent siblings)
  4 siblings, 1 reply; 20+ messages in thread
From: Andrew Morton @ 2015-01-12 23:37 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Michal Hocko, Vladimir Davydov, Greg Thelen, linux-mm, cgroups,
	linux-kernel

On Thu,  8 Jan 2015 23:15:04 -0500 Johannes Weiner <hannes@cmpxchg.org> wrote:

> Introduce the basic control files to account, partition, and limit
> memory using cgroups in default hierarchy mode.
> 
> This interface versioning allows us to address fundamental design
> issues in the existing memory cgroup interface, further explained
> below.  The old interface will be maintained indefinitely, but a
> clearer model and improved workload performance should encourage
> existing users to switch over to the new one eventually.
> 
> The control files are thus:
> 
>   - memory.current shows the current consumption of the cgroup and its
>     descendants, in bytes.
> 
>   - memory.low configures the lower end of the cgroup's expected
>     memory consumption range.  The kernel considers memory below that
>     boundary to be a reserve - the minimum that the workload needs in
>     order to make forward progress - and generally avoids reclaiming
>     it, unless there is an imminent risk of entering an OOM situation.

The code appears to be ascribing a special meaning to low==0: you can
write "none" to set this.  But I'm not seeing any description of this?

>   - memory.high configures the upper end of the cgroup's expected
>     memory consumption range.  A cgroup whose consumption grows beyond
>     this threshold is forced into direct reclaim, to work off the
>     excess and to throttle new allocations heavily, but is generally
>     allowed to continue and the OOM killer is not invoked.
> 
>   - memory.max configures the hard maximum amount of memory that the
>     cgroup is allowed to consume before the OOM killer is invoked.
> 
>   - memory.events shows event counters that indicate how often the
>     cgroup was reclaimed while below memory.low, how often it was
>     forced to reclaim excess beyond memory.high, how often it hit
>     memory.max, and how often it entered OOM due to memory.max.  This
>     allows users to identify configuration problems when observing a
>     degradation in workload performance.  An overcommitted system will
>     have an increased rate of low boundary breaches, whereas increased
>     rates of high limit breaches, maximum hits, or even OOM situations
>     will indicate internally overcommitted cgroups.
> 
> For existing users of memory cgroups, the following deviations from
> the current interface are worth pointing out and explaining:
> 
>   - The original lower boundary, the soft limit, is defined as a limit
>     that is per default unset.  As a result, the set of cgroups that
>     global reclaim prefers is opt-in, rather than opt-out.  The costs
>     for optimizing these mostly negative lookups are so high that the
>     implementation, despite its enormous size, does not even provide
>     the basic desirable behavior.  First off, the soft limit has no
>     hierarchical meaning.  All configured groups are organized in a
>     global rbtree and treated like equal peers, regardless where they
>     are located in the hierarchy.  This makes subtree delegation
>     impossible.  Second, the soft limit reclaim pass is so aggressive
>     that it not just introduces high allocation latencies into the
>     system, but also impacts system performance due to overreclaim, to
>     the point where the feature becomes self-defeating.
> 
>     The memory.low boundary on the other hand is a top-down allocated
>     reserve.  A cgroup enjoys reclaim protection when it and all its
>     ancestors are below their low boundaries, which makes delegation
>     of subtrees possible.  Secondly, new cgroups have no reserve per
>     default and in the common case most cgroups are eligible for the
>     preferred reclaim pass.  This allows the new low boundary to be
>     efficiently implemented with just a minor addition to the generic
>     reclaim code, without the need for out-of-band data structures and
>     reclaim passes.  Because the generic reclaim code considers all
>     cgroups except for the ones running low in the preferred first
>     reclaim pass, overreclaim of individual groups is eliminated as
>     well, resulting in much better overall workload performance.
> 
>   - The original high boundary, the hard limit, is defined as a strict
>     limit that can not budge, even if the OOM killer has to be called.
>     But this generally goes against the goal of making the most out of
>     the available memory.  The memory consumption of workloads varies
>     during runtime, and that requires users to overcommit.  But doing
>     that with a strict upper limit requires either a fairly accurate
>     prediction of the working set size or adding slack to the limit.
>     Since working set size estimation is hard and error prone, and
>     getting it wrong results in OOM kills, most users tend to err on
>     the side of a looser limit and end up wasting precious resources.
> 
>     The memory.high boundary on the other hand can be set much more
>     conservatively.  When hit, it throttles allocations by forcing
>     them into direct reclaim to work off the excess, but it never
>     invokes the OOM killer.  As a result, a high boundary that is
>     chosen too aggressively will not terminate the processes, but
>     instead it will lead to gradual performance degradation.  The user
>     can monitor this and make corrections until the minimal memory
>     footprint that still gives acceptable performance is found.
> 
>     In extreme cases, with many concurrent allocations and a complete
>     breakdown of reclaim progress within the group, the high boundary
>     can be exceeded.  But even then it's mostly better to satisfy the
>     allocation from the slack available in other groups or the rest of
>     the system than killing the group.  Otherwise, memory.max is there
>     to limit this type of spillover and ultimately contain buggy or
>     even malicious applications.
> 
>   - The existing control file names are unwieldy and inconsistent in
>     many different ways.  For example, the upper boundary hit count is
>     exported in the memory.failcnt file, but an OOM event count has to
>     be manually counted by listening to memory.oom_control events, and
>     lower boundary / soft limit events have to be counted by first
>     setting a threshold for that value and then counting those events.
>     Also, usage and limit files encode their units in the filename.
>     That makes the filenames very long, even though this is not
>     information that a user needs to be reminded of every time they
>     type out those names.
> 
>     To address these naming issues, as well as to signal clearly that
>     the new interface carries a new configuration model, the naming
>     conventions in it necessarily differ from the old interface.

This all sounds pretty major.  How much trouble is this change likely to
cause existing memcg users?

> include/linux/memcontrol.h |  32 ++++++
> mm/memcontrol.c            | 247 +++++++++++++++++++++++++++++++++++++++++++--
> mm/vmscan.c                |  22 +++-

No Documentation/cgroups/memory.txt?


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [patch 2/2] mm: memcontrol: default hierarchy interface for memory
  2015-01-12 23:37   ` Andrew Morton
@ 2015-01-13 15:50     ` Johannes Weiner
  2015-01-13 20:52       ` Andrew Morton
  0 siblings, 1 reply; 20+ messages in thread
From: Johannes Weiner @ 2015-01-13 15:50 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, Vladimir Davydov, Greg Thelen, linux-mm, cgroups,
	linux-kernel

On Mon, Jan 12, 2015 at 03:37:16PM -0800, Andrew Morton wrote:
> On Thu,  8 Jan 2015 23:15:04 -0500 Johannes Weiner <hannes@cmpxchg.org> wrote:
> 
> > Introduce the basic control files to account, partition, and limit
> > memory using cgroups in default hierarchy mode.
> > 
> > This interface versioning allows us to address fundamental design
> > issues in the existing memory cgroup interface, further explained
> > below.  The old interface will be maintained indefinitely, but a
> > clearer model and improved workload performance should encourage
> > existing users to switch over to the new one eventually.
> > 
> > The control files are thus:
> > 
> >   - memory.current shows the current consumption of the cgroup and its
> >     descendants, in bytes.
> > 
> >   - memory.low configures the lower end of the cgroup's expected
> >     memory consumption range.  The kernel considers memory below that
> >     boundary to be a reserve - the minimum that the workload needs in
> >     order to make forward progress - and generally avoids reclaiming
> >     it, unless there is an imminent risk of entering an OOM situation.
> 
> The code appears to be ascribing a special meaning to low==0: you can
> write "none" to set this.  But I'm not seeing any description of this?

Ah, yes.

The memory.limit_in_bytes and memory.soft_limit_in_bytes currently
show 18446744073709551615 per default, which is a silly way of saying
"this limit is inactive".  And echoing -1 into the control file is an
even sillier way of setting this state.  So the new interface just
calls this state "none".  Internally, 0 and Very High Number represent
this unconfigured state for memory.low and memory.high, respectively.

I added a bullet point at the end of the changelog below.

> >   - memory.high configures the upper end of the cgroup's expected
> >     memory consumption range.  A cgroup whose consumption grows beyond
> >     this threshold is forced into direct reclaim, to work off the
> >     excess and to throttle new allocations heavily, but is generally
> >     allowed to continue and the OOM killer is not invoked.
> > 
> >   - memory.max configures the hard maximum amount of memory that the
> >     cgroup is allowed to consume before the OOM killer is invoked.
> > 
> >   - memory.events shows event counters that indicate how often the
> >     cgroup was reclaimed while below memory.low, how often it was
> >     forced to reclaim excess beyond memory.high, how often it hit
> >     memory.max, and how often it entered OOM due to memory.max.  This
> >     allows users to identify configuration problems when observing a
> >     degradation in workload performance.  An overcommitted system will
> >     have an increased rate of low boundary breaches, whereas increased
> >     rates of high limit breaches, maximum hits, or even OOM situations
> >     will indicate internally overcommitted cgroups.
> > 
> > For existing users of memory cgroups, the following deviations from
> > the current interface are worth pointing out and explaining:
> > 
> >   - The original lower boundary, the soft limit, is defined as a limit
> >     that is per default unset.  As a result, the set of cgroups that
> >     global reclaim prefers is opt-in, rather than opt-out.  The costs
> >     for optimizing these mostly negative lookups are so high that the
> >     implementation, despite its enormous size, does not even provide
> >     the basic desirable behavior.  First off, the soft limit has no
> >     hierarchical meaning.  All configured groups are organized in a
> >     global rbtree and treated like equal peers, regardless where they
> >     are located in the hierarchy.  This makes subtree delegation
> >     impossible.  Second, the soft limit reclaim pass is so aggressive
> >     that it not just introduces high allocation latencies into the
> >     system, but also impacts system performance due to overreclaim, to
> >     the point where the feature becomes self-defeating.
> > 
> >     The memory.low boundary on the other hand is a top-down allocated
> >     reserve.  A cgroup enjoys reclaim protection when it and all its
> >     ancestors are below their low boundaries, which makes delegation
> >     of subtrees possible.  Secondly, new cgroups have no reserve per
> >     default and in the common case most cgroups are eligible for the
> >     preferred reclaim pass.  This allows the new low boundary to be
> >     efficiently implemented with just a minor addition to the generic
> >     reclaim code, without the need for out-of-band data structures and
> >     reclaim passes.  Because the generic reclaim code considers all
> >     cgroups except for the ones running low in the preferred first
> >     reclaim pass, overreclaim of individual groups is eliminated as
> >     well, resulting in much better overall workload performance.
> > 
> >   - The original high boundary, the hard limit, is defined as a strict
> >     limit that can not budge, even if the OOM killer has to be called.
> >     But this generally goes against the goal of making the most out of
> >     the available memory.  The memory consumption of workloads varies
> >     during runtime, and that requires users to overcommit.  But doing
> >     that with a strict upper limit requires either a fairly accurate
> >     prediction of the working set size or adding slack to the limit.
> >     Since working set size estimation is hard and error prone, and
> >     getting it wrong results in OOM kills, most users tend to err on
> >     the side of a looser limit and end up wasting precious resources.
> > 
> >     The memory.high boundary on the other hand can be set much more
> >     conservatively.  When hit, it throttles allocations by forcing
> >     them into direct reclaim to work off the excess, but it never
> >     invokes the OOM killer.  As a result, a high boundary that is
> >     chosen too aggressively will not terminate the processes, but
> >     instead it will lead to gradual performance degradation.  The user
> >     can monitor this and make corrections until the minimal memory
> >     footprint that still gives acceptable performance is found.
> > 
> >     In extreme cases, with many concurrent allocations and a complete
> >     breakdown of reclaim progress within the group, the high boundary
> >     can be exceeded.  But even then it's mostly better to satisfy the
> >     allocation from the slack available in other groups or the rest of
> >     the system than killing the group.  Otherwise, memory.max is there
> >     to limit this type of spillover and ultimately contain buggy or
> >     even malicious applications.
> > 
> >   - The existing control file names are unwieldy and inconsistent in
> >     many different ways.  For example, the upper boundary hit count is
> >     exported in the memory.failcnt file, but an OOM event count has to
> >     be manually counted by listening to memory.oom_control events, and
> >     lower boundary / soft limit events have to be counted by first
> >     setting a threshold for that value and then counting those events.
> >     Also, usage and limit files encode their units in the filename.
> >     That makes the filenames very long, even though this is not
> >     information that a user needs to be reminded of every time they
> >     type out those names.
> > 
> >     To address these naming issues, as well as to signal clearly that
> >     the new interface carries a new configuration model, the naming
> >     conventions in it necessarily differ from the old interface.

  - The existing limit files indicate the state of an unset limit with
    a very high number, and a configured limit can be unset by echoing
    -1 into those files.  But that very high number is implementation
    and architecture dependent and not very descriptive.  And while -1
    can be understood as an underflow into the highest possible value,
    -2 or -10M etc. do not work, so it's quite inconsistent.

    memory.low and memory.high will indicate "none" if the boundary is
    not configured, and a configured boundary can be unset by writing
    "none" into these files as well.

Does that sound good?

> This all sounds pretty major.  How much trouble is this change likely to
> cause existing memcg users?

That is actually entirely up to the user in question.

1. The old cgroup interface remains in place as long as there are
users, so, technically, nothing has to change unless they want to.

2. While default settings and behavior of memory.low slightly differ
from memory.soft_limit_in_bytes, the new interface is compatible with
most existing usecases.  Anybody who currently only hard limits can
set memory.max to the same value as memory.limit_in_bytes and be done.
A configuration that uses soft limits should be easy to translate to
memory.low, AFAIK it's already only used for the reserve semantics.

3. That being said, even though they are not forced to, of course we
want users to rethink their approach to machine partitioning because
that is likely going to improve their workload performance and their
memory utilization.  If they go down this route, they have to figure
out the workloads' minimal amount of memory to run acceptably and set
memory.low.  Then they need to figure out the amount of slack they
want to afford each workload during idle times - this trades the
available cache window in the group against startup/allocation latency
in other groups - and set the memory.high accordingly.  The job
launcher/admin can then parcel off the system's memory by considering
some value between the low and high boundary as the average target
size of each group, depending on the desired level of overcommit.

> > include/linux/memcontrol.h |  32 ++++++
> > mm/memcontrol.c            | 247 +++++++++++++++++++++++++++++++++++++++++++--
> > mm/vmscan.c                |  22 +++-
> 
> No Documentation/cgroups/memory.txt?

That file has a bit of an identity crisis, where interface description
is entangled with irrelevant (and out-of-date) implementation details.

It would be a lot better to have a single cgroup interface document
that covers the generic interface and all available controllers with
consistent language and level of detail.  Better for users, probably
also better for developers to cross check if new interfaces integrate
nicely into the existing model.  I'll kick something off.

Thanks!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [patch 2/2] mm: memcontrol: default hierarchy interface for memory
  2015-01-13 15:50     ` Johannes Weiner
@ 2015-01-13 20:52       ` Andrew Morton
  2015-01-13 21:44         ` Johannes Weiner
  0 siblings, 1 reply; 20+ messages in thread
From: Andrew Morton @ 2015-01-13 20:52 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Michal Hocko, Vladimir Davydov, Greg Thelen, linux-mm, cgroups,
	linux-kernel

On Tue, 13 Jan 2015 10:50:40 -0500 Johannes Weiner <hannes@cmpxchg.org> wrote:

> On Mon, Jan 12, 2015 at 03:37:16PM -0800, Andrew Morton wrote:
> > On Thu,  8 Jan 2015 23:15:04 -0500 Johannes Weiner <hannes@cmpxchg.org> wrote:
> > 
> > > Introduce the basic control files to account, partition, and limit
> > > memory using cgroups in default hierarchy mode.
> > > 
> > > This interface versioning allows us to address fundamental design
> > > issues in the existing memory cgroup interface, further explained
> > > below.  The old interface will be maintained indefinitely, but a
> > > clearer model and improved workload performance should encourage
> > > existing users to switch over to the new one eventually.
> > > 
> > > The control files are thus:
> > > 
> > >   - memory.current shows the current consumption of the cgroup and its
> > >     descendants, in bytes.
> > > 
> > >   - memory.low configures the lower end of the cgroup's expected
> > >     memory consumption range.  The kernel considers memory below that
> > >     boundary to be a reserve - the minimum that the workload needs in
> > >     order to make forward progress - and generally avoids reclaiming
> > >     it, unless there is an imminent risk of entering an OOM situation.
> > 
> > The code appears to be ascribing a special meaning to low==0: you can
> > write "none" to set this.  But I'm not seeing any description of this?
> 
> Ah, yes.
> 
> The memory.limit_in_bytes and memory.soft_limit_in_bytes currently
> show 18446744073709551615 per default, which is a silly way of saying
> "this limit is inactive".  And echoing -1 into the control file is an
> even sillier way of setting this state.  So the new interface just
> calls this state "none".  Internally, 0 and Very High Number represent
> this unconfigured state for memory.low and memory.high, respectively.
> 
> I added a bullet point at the end of the changelog below.

Added, thanks.

> > This all sounds pretty major.  How much trouble is this change likely to
> > cause existing memcg users?
> 
> That is actually entirely up to the user in question.
> 
> 1. The old cgroup interface remains in place as long as there are
> users, so, technically, nothing has to change unless they want to.

It would be good to zap the old interface one day.  Maybe we won't ever
be able to, but we should try.  Once this has all settled down and is
documented, how about we add a couple of printk_once's to poke people
in the new direction?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [patch 2/2] mm: memcontrol: default hierarchy interface for memory
  2015-01-13 20:52       ` Andrew Morton
@ 2015-01-13 21:44         ` Johannes Weiner
  0 siblings, 0 replies; 20+ messages in thread
From: Johannes Weiner @ 2015-01-13 21:44 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, Vladimir Davydov, Greg Thelen, linux-mm, cgroups,
	linux-kernel

On Tue, Jan 13, 2015 at 12:52:58PM -0800, Andrew Morton wrote:
> On Tue, 13 Jan 2015 10:50:40 -0500 Johannes Weiner <hannes@cmpxchg.org> wrote:
> 
> > On Mon, Jan 12, 2015 at 03:37:16PM -0800, Andrew Morton wrote:
> > > This all sounds pretty major.  How much trouble is this change likely to
> > > cause existing memcg users?
> > 
> > That is actually entirely up to the user in question.
> > 
> > 1. The old cgroup interface remains in place as long as there are
> > users, so, technically, nothing has to change unless they want to.
> 
> It would be good to zap the old interface one day.  Maybe we won't ever
> be able to, but we should try.  Once this has all settled down and is
> documented, how about we add a couple of printk_once's to poke people
> in the new direction?

Yup, sounds good to me.

There are a few missing pieces until the development flag can be
lifted from the new cgroup interface.  I'm currently working on making
swap control available, and I expect Tejun to want to push buffered IO
control into the tree before that as well.

But once we have these things in place and are reasonably sure that
the existing use cases can be fully supported by the new interface,
I'm all for nudging people in its direction.  Yes, we should at some
point try to drop the old one, which is going to reduce the code size
significantly, but even before then we can work toward separating out
the old-only parts of the implementation as much as possible to reduce
development overhead from the new/main code.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [patch 2/2] mm: memcontrol: default hierarchy interface for memory
  2015-01-09  4:15 ` [patch 2/2] mm: memcontrol: default hierarchy interface for memory Johannes Weiner
  2015-01-12 23:37   ` Andrew Morton
@ 2015-01-13 23:20   ` Greg Thelen
  2015-01-14 16:01     ` Johannes Weiner
  2015-01-14 14:28   ` Vladimir Davydov
                     ` (2 subsequent siblings)
  4 siblings, 1 reply; 20+ messages in thread
From: Greg Thelen @ 2015-01-13 23:20 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Michal Hocko, Vladimir Davydov, linux-mm, cgroups,
	linux-kernel


On Thu, Jan 08 2015, Johannes Weiner wrote:

> Introduce the basic control files to account, partition, and limit
> memory using cgroups in default hierarchy mode.
>
> This interface versioning allows us to address fundamental design
> issues in the existing memory cgroup interface, further explained
> below.  The old interface will be maintained indefinitely, but a
> clearer model and improved workload performance should encourage
> existing users to switch over to the new one eventually.
>
> The control files are thus:
>
>   - memory.current shows the current consumption of the cgroup and its
>     descendants, in bytes.
>
>   - memory.low configures the lower end of the cgroup's expected
>     memory consumption range.  The kernel considers memory below that
>     boundary to be a reserve - the minimum that the workload needs in
>     order to make forward progress - and generally avoids reclaiming
>     it, unless there is an imminent risk of entering an OOM situation.

So this is try-hard, but no-promises interface.  No complaints.  But I
assume that an eventual extension is a more rigid memory.min which
specifies a minimum working set under which an container would prefer an
oom kill to thrashing.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [patch 2/2] mm: memcontrol: default hierarchy interface for memory
  2015-01-13 23:20   ` Greg Thelen
@ 2015-01-14 16:01     ` Johannes Weiner
  0 siblings, 0 replies; 20+ messages in thread
From: Johannes Weiner @ 2015-01-14 16:01 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Andrew Morton, Michal Hocko, Vladimir Davydov, linux-mm, cgroups,
	linux-kernel

On Tue, Jan 13, 2015 at 03:20:08PM -0800, Greg Thelen wrote:
> 
> On Thu, Jan 08 2015, Johannes Weiner wrote:
> 
> > Introduce the basic control files to account, partition, and limit
> > memory using cgroups in default hierarchy mode.
> >
> > This interface versioning allows us to address fundamental design
> > issues in the existing memory cgroup interface, further explained
> > below.  The old interface will be maintained indefinitely, but a
> > clearer model and improved workload performance should encourage
> > existing users to switch over to the new one eventually.
> >
> > The control files are thus:
> >
> >   - memory.current shows the current consumption of the cgroup and its
> >     descendants, in bytes.
> >
> >   - memory.low configures the lower end of the cgroup's expected
> >     memory consumption range.  The kernel considers memory below that
> >     boundary to be a reserve - the minimum that the workload needs in
> >     order to make forward progress - and generally avoids reclaiming
> >     it, unless there is an imminent risk of entering an OOM situation.
> 
> So this is try-hard, but no-promises interface.  No complaints.  But I
> assume that an eventual extension is a more rigid memory.min which
> specifies a minimum working set under which an container would prefer an
> oom kill to thrashing.

Yes, memory.min would nicely complement memory.max and I wouldn't be
opposed to adding it.  However, that does require at least some level
of cgroup-awareness in the global OOM killer in order to route kills
meaningfully according to cgroup configuration, which is mainly why I
deferred it in this patch.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [patch 2/2] mm: memcontrol: default hierarchy interface for memory
  2015-01-09  4:15 ` [patch 2/2] mm: memcontrol: default hierarchy interface for memory Johannes Weiner
  2015-01-12 23:37   ` Andrew Morton
  2015-01-13 23:20   ` Greg Thelen
@ 2015-01-14 14:28   ` Vladimir Davydov
  2015-01-14 15:34   ` Michal Hocko
  2015-01-14 16:17   ` Michal Hocko
  4 siblings, 0 replies; 20+ messages in thread
From: Vladimir Davydov @ 2015-01-14 14:28 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Michal Hocko, Greg Thelen, linux-mm, cgroups,
	linux-kernel

On Thu, Jan 08, 2015 at 11:15:04PM -0500, Johannes Weiner wrote:
>   - memory.low configures the lower end of the cgroup's expected
>     memory consumption range.  The kernel considers memory below that
>     boundary to be a reserve - the minimum that the workload needs in
>     order to make forward progress - and generally avoids reclaiming
>     it, unless there is an imminent risk of entering an OOM situation.

AFAICS, if a cgroup cannot be shrunk back to its low limit (e.g.
because it consumes anon memory, and there's no swap), it will get on
with it. Is it considered to be a problem? Are there any plans to fix
it, e.g. by invoking OOM-killer in a cgroup that is above its low limit
if we fail to reclaim from it?

Thanks,
Vladimir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [patch 2/2] mm: memcontrol: default hierarchy interface for memory
  2015-01-09  4:15 ` [patch 2/2] mm: memcontrol: default hierarchy interface for memory Johannes Weiner
                     ` (2 preceding siblings ...)
  2015-01-14 14:28   ` Vladimir Davydov
@ 2015-01-14 15:34   ` Michal Hocko
  2015-01-14 17:19     ` Johannes Weiner
  2015-01-14 16:17   ` Michal Hocko
  4 siblings, 1 reply; 20+ messages in thread
From: Michal Hocko @ 2015-01-14 15:34 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Vladimir Davydov, Greg Thelen, linux-mm, cgroups,
	linux-kernel

On Thu 08-01-15 23:15:04, Johannes Weiner wrote:
[...]
> @@ -2353,6 +2353,22 @@ done_restock:
>  	css_get_many(&memcg->css, batch);
>  	if (batch > nr_pages)
>  		refill_stock(memcg, batch - nr_pages);
> +	/*
> +	 * If the hierarchy is above the normal consumption range,
> +	 * make the charging task trim the excess.
> +	 */
> +	do {
> +		unsigned long nr_pages = page_counter_read(&memcg->memory);
> +		unsigned long high = ACCESS_ONCE(memcg->high);
> +
> +		if (nr_pages > high) {
> +			mem_cgroup_events(memcg, MEMCG_HIGH, 1);
> +
> +			try_to_free_mem_cgroup_pages(memcg, nr_pages - high,
> +						     gfp_mask, true);
> +		}

As I've said before I am not happy about this. Heavy parallel load
hitting on the limit can generate really high reclaim targets causing
over reclaim and long stalls. Moreover portions of the excess would be
reclaimed multiple times which is not necessary.

I am not entirely happy about reclaiming nr_pages for THP_PAGES already
and this might be much worse, more probable and less predictable.

I believe the target should be capped somehow. nr_pages sounds like a
compromise. It would still throttle the charger and scale much better.

> +
> +	} while ((memcg = parent_mem_cgroup(memcg)));
>  done:
>  	return ret;
>  }
[...]
> +static int memory_events_show(struct seq_file *m, void *v)
> +{
> +	struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
> +
> +	seq_printf(m, "low %lu\n", mem_cgroup_read_events(memcg, MEMCG_LOW));
> +	seq_printf(m, "high %lu\n", mem_cgroup_read_events(memcg, MEMCG_HIGH));
> +	seq_printf(m, "max %lu\n", mem_cgroup_read_events(memcg, MEMCG_MAX));
> +	seq_printf(m, "oom %lu\n", mem_cgroup_read_events(memcg, MEMCG_OOM));
> +
> +	return 0;
> +}

OK, but I really think we need a way of OOM notification for user space
OOM handling as well - including blocking the OOM killer as we have
now.  This is not directly related to this patch so it doesn't have to
happen right now, we should just think about the proper interface if
oom_control is consider not suitable.

[...]
> @@ -2322,6 +2325,12 @@ static bool shrink_zone(struct zone *zone, struct scan_control *sc,
>  			struct lruvec *lruvec;
>  			int swappiness;
>  
> +			if (mem_cgroup_low(root, memcg)) {
> +				if (!sc->may_thrash)
> +					continue;
> +				mem_cgroup_events(memcg, MEMCG_LOW, 1);
> +			}
> +
>  			lruvec = mem_cgroup_zone_lruvec(zone, memcg);
>  			swappiness = mem_cgroup_swappiness(memcg);
>  
> @@ -2343,8 +2352,7 @@ static bool shrink_zone(struct zone *zone, struct scan_control *sc,
>  				mem_cgroup_iter_break(root, memcg);
>  				break;
>  			}
> -			memcg = mem_cgroup_iter(root, memcg, &reclaim);
> -		} while (memcg);
> +		} while ((memcg = mem_cgroup_iter(root, memcg, &reclaim)));

I had a similar code but then I could trigger quick priority drop downs
during parallel reclaim with multiple low limited groups. I've tried to
address that by retrying shrink_zone if it hasn't called shrink_lruvec
at all. Still not ideal because it can livelock theoretically, but I
haven't seen that in my testing.

The following is a quick rebase (not tested yet). I can send it as a
separate patch with a full changelog if you prefer that over merging it
into yours but I think we need it in some form:

diff --git a/mm/vmscan.c b/mm/vmscan.c
index cd4cbc045b79..6cf7d4293976 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2306,6 +2306,7 @@ static bool shrink_zone(struct zone *zone, struct scan_control *sc,
 {
 	unsigned long nr_reclaimed, nr_scanned;
 	bool reclaimable = false;
+	int scanned_memcgs = 0;
 
 	do {
 		struct mem_cgroup *root = sc->target_mem_cgroup;
@@ -2331,6 +2332,7 @@ static bool shrink_zone(struct zone *zone, struct scan_control *sc,
 				mem_cgroup_events(memcg, MEMCG_LOW, 1);
 			}
 
+			scanned_memcgs++;
 			lruvec = mem_cgroup_zone_lruvec(zone, memcg);
 			swappiness = mem_cgroup_swappiness(memcg);
 
@@ -2355,6 +2357,14 @@ static bool shrink_zone(struct zone *zone, struct scan_control *sc,
 		} while ((memcg = mem_cgroup_iter(root, memcg, &reclaim)));
 
 		/*
+		 * reclaimers are racing and only saw low limited memcgs
+		 * so we have to retry the memcg walk.
+		 */
+		if (!scanned_memcgs && (global_reclaim(sc) ||
+					!mem_cgroup_low(root, root)))
+			continue;
+
+		/*
 		 * Shrink the slab caches in the same proportion that
 		 * the eligible LRU pages were scanned.
 		 */
@@ -2380,8 +2390,11 @@ static bool shrink_zone(struct zone *zone, struct scan_control *sc,
 		if (sc->nr_reclaimed - nr_reclaimed)
 			reclaimable = true;
 
-	} while (should_continue_reclaim(zone, sc->nr_reclaimed - nr_reclaimed,
-					 sc->nr_scanned - nr_scanned, sc));
+		if (!should_continue_reclaim(zone,
+					sc->nr_reclaimed - nr_reclaimed,
+					sc->nr_scanned - nr_scanned, sc))
+			break;
+	} while (true);
 
 	return reclaimable;
 }

[...]

Other than that the patch looks OK and I am happy this has moved
forward finally.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [patch 2/2] mm: memcontrol: default hierarchy interface for memory
  2015-01-14 15:34   ` Michal Hocko
@ 2015-01-14 17:19     ` Johannes Weiner
  2015-01-15 17:08       ` Michal Hocko
  0 siblings, 1 reply; 20+ messages in thread
From: Johannes Weiner @ 2015-01-14 17:19 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Vladimir Davydov, Greg Thelen, linux-mm, cgroups,
	linux-kernel

On Wed, Jan 14, 2015 at 04:34:25PM +0100, Michal Hocko wrote:
> On Thu 08-01-15 23:15:04, Johannes Weiner wrote:
> [...]
> > @@ -2353,6 +2353,22 @@ done_restock:
> >  	css_get_many(&memcg->css, batch);
> >  	if (batch > nr_pages)
> >  		refill_stock(memcg, batch - nr_pages);
> > +	/*
> > +	 * If the hierarchy is above the normal consumption range,
> > +	 * make the charging task trim the excess.
> > +	 */
> > +	do {
> > +		unsigned long nr_pages = page_counter_read(&memcg->memory);
> > +		unsigned long high = ACCESS_ONCE(memcg->high);
> > +
> > +		if (nr_pages > high) {
> > +			mem_cgroup_events(memcg, MEMCG_HIGH, 1);
> > +
> > +			try_to_free_mem_cgroup_pages(memcg, nr_pages - high,
> > +						     gfp_mask, true);
> > +		}
> 
> As I've said before I am not happy about this. Heavy parallel load
> hitting on the limit can generate really high reclaim targets causing
> over reclaim and long stalls. Moreover portions of the excess would be
> reclaimed multiple times which is not necessary.
>
> I am not entirely happy about reclaiming nr_pages for THP_PAGES already
> and this might be much worse, more probable and less predictable.
> 
> I believe the target should be capped somehow. nr_pages sounds like a
> compromise. It would still throttle the charger and scale much better.

That's fair enough, I'll experiment with this.

> > +static int memory_events_show(struct seq_file *m, void *v)
> > +{
> > +	struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
> > +
> > +	seq_printf(m, "low %lu\n", mem_cgroup_read_events(memcg, MEMCG_LOW));
> > +	seq_printf(m, "high %lu\n", mem_cgroup_read_events(memcg, MEMCG_HIGH));
> > +	seq_printf(m, "max %lu\n", mem_cgroup_read_events(memcg, MEMCG_MAX));
> > +	seq_printf(m, "oom %lu\n", mem_cgroup_read_events(memcg, MEMCG_OOM));
> > +
> > +	return 0;
> > +}
> 
> OK, but I really think we need a way of OOM notification for user space
> OOM handling as well - including blocking the OOM killer as we have
> now.  This is not directly related to this patch so it doesn't have to
> happen right now, we should just think about the proper interface if
> oom_control is consider not suitable.

Yes, I think OOM control should be a separate discussion.

> > @@ -2322,6 +2325,12 @@ static bool shrink_zone(struct zone *zone, struct scan_control *sc,
> >  			struct lruvec *lruvec;
> >  			int swappiness;
> >  
> > +			if (mem_cgroup_low(root, memcg)) {
> > +				if (!sc->may_thrash)
> > +					continue;
> > +				mem_cgroup_events(memcg, MEMCG_LOW, 1);
> > +			}
> > +
> >  			lruvec = mem_cgroup_zone_lruvec(zone, memcg);
> >  			swappiness = mem_cgroup_swappiness(memcg);
> >  
> > @@ -2343,8 +2352,7 @@ static bool shrink_zone(struct zone *zone, struct scan_control *sc,
> >  				mem_cgroup_iter_break(root, memcg);
> >  				break;
> >  			}
> > -			memcg = mem_cgroup_iter(root, memcg, &reclaim);
> > -		} while (memcg);
> > +		} while ((memcg = mem_cgroup_iter(root, memcg, &reclaim)));
> 
> I had a similar code but then I could trigger quick priority drop downs
> during parallel reclaim with multiple low limited groups. I've tried to
> address that by retrying shrink_zone if it hasn't called shrink_lruvec
> at all. Still not ideal because it can livelock theoretically, but I
> haven't seen that in my testing.

Do you remember the circumstances and the exact configuration?

I tested this with around 30 containerized kernel build jobs whose low
boundaries pretty much added up to the available physical memory and
never observed this.  That being said, thrashing is an emergency path
and users should really watch the memory.events low counter.  After
all, if global reclaim frequently has to ignore the reserve settings,
what's the point of having them in the first place?

So while I see that this might burn some cpu cycles when the system is
misconfigured, and that we could definitely be smarter about this, I'm
not convinced we have to rush a workaround before moving ahead with
this patch, especially not one that is prone to livelock the system.

> Other than that the patch looks OK and I am happy this has moved
> forward finally.

Thanks! I'm glad we're getting somewhere as well.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [patch 2/2] mm: memcontrol: default hierarchy interface for memory
  2015-01-14 17:19     ` Johannes Weiner
@ 2015-01-15 17:08       ` Michal Hocko
  0 siblings, 0 replies; 20+ messages in thread
From: Michal Hocko @ 2015-01-15 17:08 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Vladimir Davydov, Greg Thelen, linux-mm, cgroups,
	linux-kernel

On Wed 14-01-15 12:19:44, Johannes Weiner wrote:
> On Wed, Jan 14, 2015 at 04:34:25PM +0100, Michal Hocko wrote:
> > On Thu 08-01-15 23:15:04, Johannes Weiner wrote:
[...]
> > > @@ -2322,6 +2325,12 @@ static bool shrink_zone(struct zone *zone, struct scan_control *sc,
> > >  			struct lruvec *lruvec;
> > >  			int swappiness;
> > >  
> > > +			if (mem_cgroup_low(root, memcg)) {
> > > +				if (!sc->may_thrash)
> > > +					continue;
> > > +				mem_cgroup_events(memcg, MEMCG_LOW, 1);
> > > +			}
> > > +
> > >  			lruvec = mem_cgroup_zone_lruvec(zone, memcg);
> > >  			swappiness = mem_cgroup_swappiness(memcg);
> > >  
> > > @@ -2343,8 +2352,7 @@ static bool shrink_zone(struct zone *zone, struct scan_control *sc,
> > >  				mem_cgroup_iter_break(root, memcg);
> > >  				break;
> > >  			}
> > > -			memcg = mem_cgroup_iter(root, memcg, &reclaim);
> > > -		} while (memcg);
> > > +		} while ((memcg = mem_cgroup_iter(root, memcg, &reclaim)));
> > 
> > I had a similar code but then I could trigger quick priority drop downs
> > during parallel reclaim with multiple low limited groups. I've tried to
> > address that by retrying shrink_zone if it hasn't called shrink_lruvec
> > at all. Still not ideal because it can livelock theoretically, but I
> > haven't seen that in my testing.
> 
> Do you remember the circumstances and the exact configuration?

Well, I was testing heavy parallel memory intensive load (combination of
anon and file) in one memcg and many (hundreds of) idle memcgs to see
how much overhead memcg traversing would cost us. And I misconfigured by
setting idle memcgs low-limit to -1 instead of 0. There is nothing
running in them.
I've noticed that I can see more pages reclaimed than expected and also
higher runtime which turned out to be related to longer stalls during
reclaim rather than the cost of the memcg reclaim iterator.  Debugging
has shown that many direct reclaimers were racing over low-limited
groups and dropped to lower priorities. The race window was apparently
much much smaller than a noop shrink_lruvec run.

So in a sense this was a mis-configured system because I do not expect
so many low limited groups in real life but there was still something
reclaimable so the machine wasn't really over-committed. So this points
to an issue which might happen, albeit in a smaller scale, if there are
many groups, heavy reclaim and some reclaimers unlucky to race and see
only low-limited groups.

> I tested this with around 30 containerized kernel build jobs whose low
> boundaries pretty much added up to the available physical memory and
> never observed this.  That being said, thrashing is an emergency path
> and users should really watch the memory.events low counter.  After
> all, if global reclaim frequently has to ignore the reserve settings,
> what's the point of having them in the first place?

Sure, over-committed low limit is a misconfiguration. But this is not
what happened in my testing.

> So while I see that this might burn some cpu cycles when the system is
> misconfigured, and that we could definitely be smarter about this, I'm
> not convinced we have to rush a workaround before moving ahead with
> this patch, especially not one that is prone to livelock the system.

OK, then do not merge it to the original patch. If for nothing else then
for bisectability. I will post a patch separately. I still think we
should consider a way how to address it sooner or later because the
result would be non trivial to debug.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [patch 2/2] mm: memcontrol: default hierarchy interface for memory
  2015-01-09  4:15 ` [patch 2/2] mm: memcontrol: default hierarchy interface for memory Johannes Weiner
                     ` (3 preceding siblings ...)
  2015-01-14 15:34   ` Michal Hocko
@ 2015-01-14 16:17   ` Michal Hocko
  4 siblings, 0 replies; 20+ messages in thread
From: Michal Hocko @ 2015-01-14 16:17 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Vladimir Davydov, Greg Thelen, linux-mm, cgroups,
	linux-kernel

I have overlooked the `none' setting...

On Thu 08-01-15 23:15:04, Johannes Weiner wrote:
[...]
> +static int memory_low_show(struct seq_file *m, void *v)
> +{
> +	struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
> +	unsigned long low = ACCESS_ONCE(memcg->low);
> +
> +	if (low == 0)
> +		seq_printf(m, "none\n");
> +	else
> +		seq_printf(m, "%llu\n", (u64)low * PAGE_SIZE);
> +
> +	return 0;
> +}

This is really confusing. What if somebody wants to protect a group
from being reclaimed? One possible and natural way would by copying
memory.max value but then `none' means something else completely.

Besides that why to call 0, which has a clear meaning, any other name?

Now that I think about the naming `none' doesn't sound that great for
max resp. high either. If for nothing else then for the above copy
example (who knows what shows up later). Sure, a huge number is bad
as well for reasons you have mentioned in other email. `resource_max'
sounds like a better fit to me. But I am lame at naming.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2015-02-23 14:28 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-01-20 15:31 [patch 0/2] mm: memcontrol: default hierarchy interface for memory v2 Johannes Weiner
2015-01-20 15:31 ` [patch 1/2] mm: page_counter: pull "-1" handling out of page_counter_memparse() Johannes Weiner
2015-01-20 16:04   ` Michal Hocko
2015-01-20 15:31 ` [patch 2/2] mm: memcontrol: default hierarchy interface for memory Johannes Weiner
2015-01-20 16:31   ` Michal Hocko
2015-02-23 11:13   ` Sasha Levin
2015-02-23 14:28     ` Michal Hocko
2015-01-20 16:57 ` [patch 0/2] mm: memcontrol: default hierarchy interface for memory v2 Michal Hocko
  -- strict thread matches above, loose matches on Subject: below --
2015-01-09  4:15 [patch 1/2] mm: page_counter: pull "-1" handling out of page_counter_memparse() Johannes Weiner
2015-01-09  4:15 ` [patch 2/2] mm: memcontrol: default hierarchy interface for memory Johannes Weiner
2015-01-12 23:37   ` Andrew Morton
2015-01-13 15:50     ` Johannes Weiner
2015-01-13 20:52       ` Andrew Morton
2015-01-13 21:44         ` Johannes Weiner
2015-01-13 23:20   ` Greg Thelen
2015-01-14 16:01     ` Johannes Weiner
2015-01-14 14:28   ` Vladimir Davydov
2015-01-14 15:34   ` Michal Hocko
2015-01-14 17:19     ` Johannes Weiner
2015-01-15 17:08       ` Michal Hocko
2015-01-14 16:17   ` Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).