* [PATCH 1/4] memcg: fix over-high reclaim amount
2015-08-28 15:25 ` Tejun Heo
@ 2015-08-28 15:25 ` Tejun Heo
-1 siblings, 0 replies; 66+ messages in thread
From: Tejun Heo @ 2015-08-28 15:25 UTC (permalink / raw)
To: hannes-druUgvl0LCNAfugRpC6u6w, mhocko-DgEjT+Ai2ygdnm+yROfE0A
Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
vdavydov-bzQdu9zFT3WakBO8gow8eQ, kernel-team-b10kYP2dOMg,
Tejun Heo
When memory usage is over the high limit, try_charge() performs direct
reclaim; however, it uses the current charging amount @nr_pages as the
reclamation target which is incorrect as we want to reclaim down to
the high limit. In practice, this doesn't matter all that much
because the minimum target pages that try_to_free_mem_cgroup_pages()
uses is SWAP_CLUSTER_MAX which is rather large.
Fix it by setting the target number of pages to the difference between
the current usage and the high limit.
Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
---
mm/memcontrol.c | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index aacc767..18ecf75 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2078,10 +2078,13 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
* make the charging task trim their excess contribution.
*/
do {
- if (page_counter_read(&memcg->memory) <= memcg->high)
+ unsigned long usage = page_counter_read(&memcg->memory);
+ unsigned long high = ACCESS_ONCE(memcg->high);
+
+ if (usage <= high)
continue;
mem_cgroup_events(memcg, MEMCG_HIGH, 1);
- try_to_free_mem_cgroup_pages(memcg, nr_pages, gfp_mask, true);
+ try_to_free_mem_cgroup_pages(memcg, high - usage, gfp_mask, true);
} while ((memcg = parent_mem_cgroup(memcg)));
done:
return ret;
--
2.4.3
^ permalink raw reply related [flat|nested] 66+ messages in thread* [PATCH 1/4] memcg: fix over-high reclaim amount
@ 2015-08-28 15:25 ` Tejun Heo
0 siblings, 0 replies; 66+ messages in thread
From: Tejun Heo @ 2015-08-28 15:25 UTC (permalink / raw)
To: hannes, mhocko; +Cc: cgroups, linux-mm, vdavydov, kernel-team, Tejun Heo
When memory usage is over the high limit, try_charge() performs direct
reclaim; however, it uses the current charging amount @nr_pages as the
reclamation target which is incorrect as we want to reclaim down to
the high limit. In practice, this doesn't matter all that much
because the minimum target pages that try_to_free_mem_cgroup_pages()
uses is SWAP_CLUSTER_MAX which is rather large.
Fix it by setting the target number of pages to the difference between
the current usage and the high limit.
Signed-off-by: Tejun Heo <tj@kernel.org>
---
mm/memcontrol.c | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index aacc767..18ecf75 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2078,10 +2078,13 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
* make the charging task trim their excess contribution.
*/
do {
- if (page_counter_read(&memcg->memory) <= memcg->high)
+ unsigned long usage = page_counter_read(&memcg->memory);
+ unsigned long high = ACCESS_ONCE(memcg->high);
+
+ if (usage <= high)
continue;
mem_cgroup_events(memcg, MEMCG_HIGH, 1);
- try_to_free_mem_cgroup_pages(memcg, nr_pages, gfp_mask, true);
+ try_to_free_mem_cgroup_pages(memcg, high - usage, gfp_mask, true);
} while ((memcg = parent_mem_cgroup(memcg)));
done:
return ret;
--
2.4.3
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 66+ messages in thread* Re: [PATCH 1/4] memcg: fix over-high reclaim amount
2015-08-28 15:25 ` Tejun Heo
(?)
@ 2015-08-28 17:06 ` Michal Hocko
[not found] ` <20150828170612.GA21463-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
-1 siblings, 1 reply; 66+ messages in thread
From: Michal Hocko @ 2015-08-28 17:06 UTC (permalink / raw)
To: Tejun Heo; +Cc: hannes, cgroups, linux-mm, vdavydov, kernel-team
On Fri 28-08-15 11:25:27, Tejun Heo wrote:
> When memory usage is over the high limit, try_charge() performs direct
> reclaim; however, it uses the current charging amount @nr_pages as the
> reclamation target which is incorrect as we want to reclaim down to
> the high limit. In practice, this doesn't matter all that much
> because the minimum target pages that try_to_free_mem_cgroup_pages()
> uses is SWAP_CLUSTER_MAX which is rather large.
>
> Fix it by setting the target number of pages to the difference between
> the current usage and the high limit.
I do not think this a better behavior. If you have parallel charges to
the same memcg then you can easilly over-reclaim because everybody
will reclaim the maximum rather than its contribution.
Sure we can fail to reclaim the target and slowly grow over high limit
but that is to be expected. This is not the max limit which cannot be
breached and external memory pressure/reclaim is there to mitigate that.
> Signed-off-by: Tejun Heo <tj@kernel.org>
> ---
> mm/memcontrol.c | 7 +++++--
> 1 file changed, 5 insertions(+), 2 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index aacc767..18ecf75 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2078,10 +2078,13 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
> * make the charging task trim their excess contribution.
> */
> do {
> - if (page_counter_read(&memcg->memory) <= memcg->high)
> + unsigned long usage = page_counter_read(&memcg->memory);
> + unsigned long high = ACCESS_ONCE(memcg->high);
> +
> + if (usage <= high)
> continue;
> mem_cgroup_events(memcg, MEMCG_HIGH, 1);
> - try_to_free_mem_cgroup_pages(memcg, nr_pages, gfp_mask, true);
> + try_to_free_mem_cgroup_pages(memcg, high - usage, gfp_mask, true);
> } while ((memcg = parent_mem_cgroup(memcg)));
> done:
> return ret;
> --
> 2.4.3
--
Michal Hocko
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 66+ messages in thread
* [PATCH 2/4] memcg: flatten task_struct->memcg_oom
2015-08-28 15:25 ` Tejun Heo
@ 2015-08-28 15:25 ` Tejun Heo
-1 siblings, 0 replies; 66+ messages in thread
From: Tejun Heo @ 2015-08-28 15:25 UTC (permalink / raw)
To: hannes-druUgvl0LCNAfugRpC6u6w, mhocko-DgEjT+Ai2ygdnm+yROfE0A
Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
vdavydov-bzQdu9zFT3WakBO8gow8eQ, kernel-team-b10kYP2dOMg,
Tejun Heo
task_struct->memcg_oom is a sub-struct containing fields which are
used for async memcg oom handling. Most task_struct fields aren't
packaged this way and it can lead to unnecessary alignment paddings.
This patch flattens it.
* task.memcg_oom.memcg -> task.memcg_in_oom
* task.memcg_oom.gfp_mask -> task.memcg_oom_gfp_mask
* task.memcg_oom.order -> task.memcg_oom_order
* task.memcg_oom.may_oom -> task.memcg_may_oom
In addition, task.memcg_may_oom is relocated to where other bitfields
are which reduces the size of task_struct.
Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
---
include/linux/memcontrol.h | 10 +++++-----
include/linux/sched.h | 13 ++++++-------
mm/memcontrol.c | 16 ++++++++--------
3 files changed, 19 insertions(+), 20 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index ad800e6..3d28656 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -407,19 +407,19 @@ void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
static inline void mem_cgroup_oom_enable(void)
{
- WARN_ON(current->memcg_oom.may_oom);
- current->memcg_oom.may_oom = 1;
+ WARN_ON(current->memcg_may_oom);
+ current->memcg_may_oom = 1;
}
static inline void mem_cgroup_oom_disable(void)
{
- WARN_ON(!current->memcg_oom.may_oom);
- current->memcg_oom.may_oom = 0;
+ WARN_ON(!current->memcg_may_oom);
+ current->memcg_may_oom = 0;
}
static inline bool task_in_memcg_oom(struct task_struct *p)
{
- return p->memcg_oom.memcg;
+ return p->memcg_in_oom;
}
bool mem_cgroup_oom_synchronize(bool wait);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index a4ab9da..ef73b54 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1451,7 +1451,9 @@ struct task_struct {
unsigned sched_reset_on_fork:1;
unsigned sched_contributes_to_load:1;
unsigned sched_migrated:1;
-
+#ifdef CONFIG_MEMCG
+ unsigned memcg_may_oom:1;
+#endif
#ifdef CONFIG_MEMCG_KMEM
unsigned memcg_kmem_skip_account:1;
#endif
@@ -1782,12 +1784,9 @@ struct task_struct {
unsigned long trace_recursion;
#endif /* CONFIG_TRACING */
#ifdef CONFIG_MEMCG
- struct memcg_oom_info {
- struct mem_cgroup *memcg;
- gfp_t gfp_mask;
- int order;
- unsigned int may_oom:1;
- } memcg_oom;
+ struct mem_cgroup *memcg_in_oom;
+ gfp_t memcg_oom_gfp_mask;
+ int memcg_oom_order;
#endif
#ifdef CONFIG_UPROBES
struct uprobe_task *utask;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 18ecf75..74abb31 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1652,7 +1652,7 @@ static void memcg_oom_recover(struct mem_cgroup *memcg)
static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order)
{
- if (!current->memcg_oom.may_oom)
+ if (!current->memcg_may_oom)
return;
/*
* We are in the middle of the charge context here, so we
@@ -1669,9 +1669,9 @@ static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order)
* and when we know whether the fault was overall successful.
*/
css_get(&memcg->css);
- current->memcg_oom.memcg = memcg;
- current->memcg_oom.gfp_mask = mask;
- current->memcg_oom.order = order;
+ current->memcg_in_oom = memcg;
+ current->memcg_oom_gfp_mask = mask;
+ current->memcg_oom_order = order;
}
/**
@@ -1693,7 +1693,7 @@ static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order)
*/
bool mem_cgroup_oom_synchronize(bool handle)
{
- struct mem_cgroup *memcg = current->memcg_oom.memcg;
+ struct mem_cgroup *memcg = current->memcg_in_oom;
struct oom_wait_info owait;
bool locked;
@@ -1721,8 +1721,8 @@ bool mem_cgroup_oom_synchronize(bool handle)
if (locked && !memcg->oom_kill_disable) {
mem_cgroup_unmark_under_oom(memcg);
finish_wait(&memcg_oom_waitq, &owait.wait);
- mem_cgroup_out_of_memory(memcg, current->memcg_oom.gfp_mask,
- current->memcg_oom.order);
+ mem_cgroup_out_of_memory(memcg, current->memcg_oom_gfp_mask,
+ current->memcg_oom_order);
} else {
schedule();
mem_cgroup_unmark_under_oom(memcg);
@@ -1739,7 +1739,7 @@ bool mem_cgroup_oom_synchronize(bool handle)
memcg_oom_recover(memcg);
}
cleanup:
- current->memcg_oom.memcg = NULL;
+ current->memcg_in_oom = NULL;
css_put(&memcg->css);
return true;
}
--
2.4.3
^ permalink raw reply related [flat|nested] 66+ messages in thread* [PATCH 2/4] memcg: flatten task_struct->memcg_oom
@ 2015-08-28 15:25 ` Tejun Heo
0 siblings, 0 replies; 66+ messages in thread
From: Tejun Heo @ 2015-08-28 15:25 UTC (permalink / raw)
To: hannes, mhocko; +Cc: cgroups, linux-mm, vdavydov, kernel-team, Tejun Heo
task_struct->memcg_oom is a sub-struct containing fields which are
used for async memcg oom handling. Most task_struct fields aren't
packaged this way and it can lead to unnecessary alignment paddings.
This patch flattens it.
* task.memcg_oom.memcg -> task.memcg_in_oom
* task.memcg_oom.gfp_mask -> task.memcg_oom_gfp_mask
* task.memcg_oom.order -> task.memcg_oom_order
* task.memcg_oom.may_oom -> task.memcg_may_oom
In addition, task.memcg_may_oom is relocated to where other bitfields
are which reduces the size of task_struct.
Signed-off-by: Tejun Heo <tj@kernel.org>
---
include/linux/memcontrol.h | 10 +++++-----
include/linux/sched.h | 13 ++++++-------
mm/memcontrol.c | 16 ++++++++--------
3 files changed, 19 insertions(+), 20 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index ad800e6..3d28656 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -407,19 +407,19 @@ void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
static inline void mem_cgroup_oom_enable(void)
{
- WARN_ON(current->memcg_oom.may_oom);
- current->memcg_oom.may_oom = 1;
+ WARN_ON(current->memcg_may_oom);
+ current->memcg_may_oom = 1;
}
static inline void mem_cgroup_oom_disable(void)
{
- WARN_ON(!current->memcg_oom.may_oom);
- current->memcg_oom.may_oom = 0;
+ WARN_ON(!current->memcg_may_oom);
+ current->memcg_may_oom = 0;
}
static inline bool task_in_memcg_oom(struct task_struct *p)
{
- return p->memcg_oom.memcg;
+ return p->memcg_in_oom;
}
bool mem_cgroup_oom_synchronize(bool wait);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index a4ab9da..ef73b54 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1451,7 +1451,9 @@ struct task_struct {
unsigned sched_reset_on_fork:1;
unsigned sched_contributes_to_load:1;
unsigned sched_migrated:1;
-
+#ifdef CONFIG_MEMCG
+ unsigned memcg_may_oom:1;
+#endif
#ifdef CONFIG_MEMCG_KMEM
unsigned memcg_kmem_skip_account:1;
#endif
@@ -1782,12 +1784,9 @@ struct task_struct {
unsigned long trace_recursion;
#endif /* CONFIG_TRACING */
#ifdef CONFIG_MEMCG
- struct memcg_oom_info {
- struct mem_cgroup *memcg;
- gfp_t gfp_mask;
- int order;
- unsigned int may_oom:1;
- } memcg_oom;
+ struct mem_cgroup *memcg_in_oom;
+ gfp_t memcg_oom_gfp_mask;
+ int memcg_oom_order;
#endif
#ifdef CONFIG_UPROBES
struct uprobe_task *utask;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 18ecf75..74abb31 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1652,7 +1652,7 @@ static void memcg_oom_recover(struct mem_cgroup *memcg)
static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order)
{
- if (!current->memcg_oom.may_oom)
+ if (!current->memcg_may_oom)
return;
/*
* We are in the middle of the charge context here, so we
@@ -1669,9 +1669,9 @@ static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order)
* and when we know whether the fault was overall successful.
*/
css_get(&memcg->css);
- current->memcg_oom.memcg = memcg;
- current->memcg_oom.gfp_mask = mask;
- current->memcg_oom.order = order;
+ current->memcg_in_oom = memcg;
+ current->memcg_oom_gfp_mask = mask;
+ current->memcg_oom_order = order;
}
/**
@@ -1693,7 +1693,7 @@ static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order)
*/
bool mem_cgroup_oom_synchronize(bool handle)
{
- struct mem_cgroup *memcg = current->memcg_oom.memcg;
+ struct mem_cgroup *memcg = current->memcg_in_oom;
struct oom_wait_info owait;
bool locked;
@@ -1721,8 +1721,8 @@ bool mem_cgroup_oom_synchronize(bool handle)
if (locked && !memcg->oom_kill_disable) {
mem_cgroup_unmark_under_oom(memcg);
finish_wait(&memcg_oom_waitq, &owait.wait);
- mem_cgroup_out_of_memory(memcg, current->memcg_oom.gfp_mask,
- current->memcg_oom.order);
+ mem_cgroup_out_of_memory(memcg, current->memcg_oom_gfp_mask,
+ current->memcg_oom_order);
} else {
schedule();
mem_cgroup_unmark_under_oom(memcg);
@@ -1739,7 +1739,7 @@ bool mem_cgroup_oom_synchronize(bool handle)
memcg_oom_recover(memcg);
}
cleanup:
- current->memcg_oom.memcg = NULL;
+ current->memcg_in_oom = NULL;
css_put(&memcg->css);
return true;
}
--
2.4.3
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 66+ messages in thread[parent not found: <1440775530-18630-3-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>]
* Re: [PATCH 2/4] memcg: flatten task_struct->memcg_oom
2015-08-28 15:25 ` Tejun Heo
@ 2015-08-28 17:11 ` Michal Hocko
-1 siblings, 0 replies; 66+ messages in thread
From: Michal Hocko @ 2015-08-28 17:11 UTC (permalink / raw)
To: Tejun Heo
Cc: hannes-druUgvl0LCNAfugRpC6u6w, cgroups-u79uwXL29TY76Z2rM5mHXA,
linux-mm-Bw31MaZKKs3YtjvyW6yDsg, vdavydov-bzQdu9zFT3WakBO8gow8eQ,
kernel-team-b10kYP2dOMg
On Fri 28-08-15 11:25:28, Tejun Heo wrote:
> task_struct->memcg_oom is a sub-struct containing fields which are
> used for async memcg oom handling. Most task_struct fields aren't
> packaged this way and it can lead to unnecessary alignment paddings.
> This patch flattens it.
>
> * task.memcg_oom.memcg -> task.memcg_in_oom
> * task.memcg_oom.gfp_mask -> task.memcg_oom_gfp_mask
> * task.memcg_oom.order -> task.memcg_oom_order
> * task.memcg_oom.may_oom -> task.memcg_may_oom
>
> In addition, task.memcg_may_oom is relocated to where other bitfields
> are which reduces the size of task_struct.
OK we will save 8B AFAICS which probably doesn't make much different for
this huge structure. But we already have memcg_kmem_skip_account bit
field there so another one makes sense. That alone would be sufficient
to save those bytes. Regarding the struct, I do not have a strong
opinion. I do not mind removing it.
> Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Acked-by: Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org>
> ---
> include/linux/memcontrol.h | 10 +++++-----
> include/linux/sched.h | 13 ++++++-------
> mm/memcontrol.c | 16 ++++++++--------
> 3 files changed, 19 insertions(+), 20 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index ad800e6..3d28656 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -407,19 +407,19 @@ void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
>
> static inline void mem_cgroup_oom_enable(void)
> {
> - WARN_ON(current->memcg_oom.may_oom);
> - current->memcg_oom.may_oom = 1;
> + WARN_ON(current->memcg_may_oom);
> + current->memcg_may_oom = 1;
> }
>
> static inline void mem_cgroup_oom_disable(void)
> {
> - WARN_ON(!current->memcg_oom.may_oom);
> - current->memcg_oom.may_oom = 0;
> + WARN_ON(!current->memcg_may_oom);
> + current->memcg_may_oom = 0;
> }
>
> static inline bool task_in_memcg_oom(struct task_struct *p)
> {
> - return p->memcg_oom.memcg;
> + return p->memcg_in_oom;
> }
>
> bool mem_cgroup_oom_synchronize(bool wait);
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index a4ab9da..ef73b54 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1451,7 +1451,9 @@ struct task_struct {
> unsigned sched_reset_on_fork:1;
> unsigned sched_contributes_to_load:1;
> unsigned sched_migrated:1;
> -
> +#ifdef CONFIG_MEMCG
> + unsigned memcg_may_oom:1;
> +#endif
> #ifdef CONFIG_MEMCG_KMEM
> unsigned memcg_kmem_skip_account:1;
> #endif
> @@ -1782,12 +1784,9 @@ struct task_struct {
> unsigned long trace_recursion;
> #endif /* CONFIG_TRACING */
> #ifdef CONFIG_MEMCG
> - struct memcg_oom_info {
> - struct mem_cgroup *memcg;
> - gfp_t gfp_mask;
> - int order;
> - unsigned int may_oom:1;
> - } memcg_oom;
> + struct mem_cgroup *memcg_in_oom;
> + gfp_t memcg_oom_gfp_mask;
> + int memcg_oom_order;
> #endif
> #ifdef CONFIG_UPROBES
> struct uprobe_task *utask;
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 18ecf75..74abb31 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1652,7 +1652,7 @@ static void memcg_oom_recover(struct mem_cgroup *memcg)
>
> static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order)
> {
> - if (!current->memcg_oom.may_oom)
> + if (!current->memcg_may_oom)
> return;
> /*
> * We are in the middle of the charge context here, so we
> @@ -1669,9 +1669,9 @@ static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order)
> * and when we know whether the fault was overall successful.
> */
> css_get(&memcg->css);
> - current->memcg_oom.memcg = memcg;
> - current->memcg_oom.gfp_mask = mask;
> - current->memcg_oom.order = order;
> + current->memcg_in_oom = memcg;
> + current->memcg_oom_gfp_mask = mask;
> + current->memcg_oom_order = order;
> }
>
> /**
> @@ -1693,7 +1693,7 @@ static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order)
> */
> bool mem_cgroup_oom_synchronize(bool handle)
> {
> - struct mem_cgroup *memcg = current->memcg_oom.memcg;
> + struct mem_cgroup *memcg = current->memcg_in_oom;
> struct oom_wait_info owait;
> bool locked;
>
> @@ -1721,8 +1721,8 @@ bool mem_cgroup_oom_synchronize(bool handle)
> if (locked && !memcg->oom_kill_disable) {
> mem_cgroup_unmark_under_oom(memcg);
> finish_wait(&memcg_oom_waitq, &owait.wait);
> - mem_cgroup_out_of_memory(memcg, current->memcg_oom.gfp_mask,
> - current->memcg_oom.order);
> + mem_cgroup_out_of_memory(memcg, current->memcg_oom_gfp_mask,
> + current->memcg_oom_order);
> } else {
> schedule();
> mem_cgroup_unmark_under_oom(memcg);
> @@ -1739,7 +1739,7 @@ bool mem_cgroup_oom_synchronize(bool handle)
> memcg_oom_recover(memcg);
> }
> cleanup:
> - current->memcg_oom.memcg = NULL;
> + current->memcg_in_oom = NULL;
> css_put(&memcg->css);
> return true;
> }
> --
> 2.4.3
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 66+ messages in thread* Re: [PATCH 2/4] memcg: flatten task_struct->memcg_oom
@ 2015-08-28 17:11 ` Michal Hocko
0 siblings, 0 replies; 66+ messages in thread
From: Michal Hocko @ 2015-08-28 17:11 UTC (permalink / raw)
To: Tejun Heo; +Cc: hannes, cgroups, linux-mm, vdavydov, kernel-team
On Fri 28-08-15 11:25:28, Tejun Heo wrote:
> task_struct->memcg_oom is a sub-struct containing fields which are
> used for async memcg oom handling. Most task_struct fields aren't
> packaged this way and it can lead to unnecessary alignment paddings.
> This patch flattens it.
>
> * task.memcg_oom.memcg -> task.memcg_in_oom
> * task.memcg_oom.gfp_mask -> task.memcg_oom_gfp_mask
> * task.memcg_oom.order -> task.memcg_oom_order
> * task.memcg_oom.may_oom -> task.memcg_may_oom
>
> In addition, task.memcg_may_oom is relocated to where other bitfields
> are which reduces the size of task_struct.
OK we will save 8B AFAICS which probably doesn't make much different for
this huge structure. But we already have memcg_kmem_skip_account bit
field there so another one makes sense. That alone would be sufficient
to save those bytes. Regarding the struct, I do not have a strong
opinion. I do not mind removing it.
> Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.com>
> ---
> include/linux/memcontrol.h | 10 +++++-----
> include/linux/sched.h | 13 ++++++-------
> mm/memcontrol.c | 16 ++++++++--------
> 3 files changed, 19 insertions(+), 20 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index ad800e6..3d28656 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -407,19 +407,19 @@ void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
>
> static inline void mem_cgroup_oom_enable(void)
> {
> - WARN_ON(current->memcg_oom.may_oom);
> - current->memcg_oom.may_oom = 1;
> + WARN_ON(current->memcg_may_oom);
> + current->memcg_may_oom = 1;
> }
>
> static inline void mem_cgroup_oom_disable(void)
> {
> - WARN_ON(!current->memcg_oom.may_oom);
> - current->memcg_oom.may_oom = 0;
> + WARN_ON(!current->memcg_may_oom);
> + current->memcg_may_oom = 0;
> }
>
> static inline bool task_in_memcg_oom(struct task_struct *p)
> {
> - return p->memcg_oom.memcg;
> + return p->memcg_in_oom;
> }
>
> bool mem_cgroup_oom_synchronize(bool wait);
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index a4ab9da..ef73b54 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1451,7 +1451,9 @@ struct task_struct {
> unsigned sched_reset_on_fork:1;
> unsigned sched_contributes_to_load:1;
> unsigned sched_migrated:1;
> -
> +#ifdef CONFIG_MEMCG
> + unsigned memcg_may_oom:1;
> +#endif
> #ifdef CONFIG_MEMCG_KMEM
> unsigned memcg_kmem_skip_account:1;
> #endif
> @@ -1782,12 +1784,9 @@ struct task_struct {
> unsigned long trace_recursion;
> #endif /* CONFIG_TRACING */
> #ifdef CONFIG_MEMCG
> - struct memcg_oom_info {
> - struct mem_cgroup *memcg;
> - gfp_t gfp_mask;
> - int order;
> - unsigned int may_oom:1;
> - } memcg_oom;
> + struct mem_cgroup *memcg_in_oom;
> + gfp_t memcg_oom_gfp_mask;
> + int memcg_oom_order;
> #endif
> #ifdef CONFIG_UPROBES
> struct uprobe_task *utask;
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 18ecf75..74abb31 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1652,7 +1652,7 @@ static void memcg_oom_recover(struct mem_cgroup *memcg)
>
> static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order)
> {
> - if (!current->memcg_oom.may_oom)
> + if (!current->memcg_may_oom)
> return;
> /*
> * We are in the middle of the charge context here, so we
> @@ -1669,9 +1669,9 @@ static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order)
> * and when we know whether the fault was overall successful.
> */
> css_get(&memcg->css);
> - current->memcg_oom.memcg = memcg;
> - current->memcg_oom.gfp_mask = mask;
> - current->memcg_oom.order = order;
> + current->memcg_in_oom = memcg;
> + current->memcg_oom_gfp_mask = mask;
> + current->memcg_oom_order = order;
> }
>
> /**
> @@ -1693,7 +1693,7 @@ static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order)
> */
> bool mem_cgroup_oom_synchronize(bool handle)
> {
> - struct mem_cgroup *memcg = current->memcg_oom.memcg;
> + struct mem_cgroup *memcg = current->memcg_in_oom;
> struct oom_wait_info owait;
> bool locked;
>
> @@ -1721,8 +1721,8 @@ bool mem_cgroup_oom_synchronize(bool handle)
> if (locked && !memcg->oom_kill_disable) {
> mem_cgroup_unmark_under_oom(memcg);
> finish_wait(&memcg_oom_waitq, &owait.wait);
> - mem_cgroup_out_of_memory(memcg, current->memcg_oom.gfp_mask,
> - current->memcg_oom.order);
> + mem_cgroup_out_of_memory(memcg, current->memcg_oom_gfp_mask,
> + current->memcg_oom_order);
> } else {
> schedule();
> mem_cgroup_unmark_under_oom(memcg);
> @@ -1739,7 +1739,7 @@ bool mem_cgroup_oom_synchronize(bool handle)
> memcg_oom_recover(memcg);
> }
> cleanup:
> - current->memcg_oom.memcg = NULL;
> + current->memcg_in_oom = NULL;
> css_put(&memcg->css);
> return true;
> }
> --
> 2.4.3
--
Michal Hocko
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 66+ messages in thread
* [PATCH 3/4] memcg: punt high overage reclaim to return-to-userland path
2015-08-28 15:25 ` Tejun Heo
@ 2015-08-28 15:25 ` Tejun Heo
-1 siblings, 0 replies; 66+ messages in thread
From: Tejun Heo @ 2015-08-28 15:25 UTC (permalink / raw)
To: hannes-druUgvl0LCNAfugRpC6u6w, mhocko-DgEjT+Ai2ygdnm+yROfE0A
Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
vdavydov-bzQdu9zFT3WakBO8gow8eQ, kernel-team-b10kYP2dOMg,
Tejun Heo
Currently, try_charge() tries to reclaim memory directly when the high
limit is breached; however, this has a couple issues.
* try_charge() can be invoked from any in-kernel allocation site and
reclaim path may use considerable amount of stack. This can lead to
stack overflows which are extremely difficult to reproduce.
* If the allocation doesn't have __GFP_WAIT, direct reclaim is
skipped. If a process performs only speculative allocations, it can
blow way past the high limit. This is actually easily reproducible
by simply doing "find /". VFS tries speculative !__GFP_WAIT
allocations first, so as long as there's memory which can be
consumed without blocking, it can keep allocating memory regardless
of the high limit.
This patch makes try_charge() always punt the direct reclaim to the
return-to-userland path. If try_charge() detects that high limit is
breached, it sets current->memcg_over_high to the offending memcg and
schedules execution of mem_cgroup_handle_over_high() which performs
the direct reclaim from the return-to-userland path.
As long as kernel doesn't have a run-away allocation spree, this
should provide enough protection while making kmemcg behave more
consistently.
Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
---
include/linux/memcontrol.h | 6 +++++
include/linux/sched.h | 1 +
include/linux/tracehook.h | 3 +++
mm/memcontrol.c | 66 +++++++++++++++++++++++++++++++++++++---------
4 files changed, 64 insertions(+), 12 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 3d28656..8d345a7 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -402,6 +402,8 @@ static inline int mem_cgroup_inactive_anon_is_low(struct lruvec *lruvec)
return inactive * inactive_ratio < active;
}
+void mem_cgroup_handle_over_high(void);
+
void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
struct task_struct *p);
@@ -621,6 +623,10 @@ static inline void mem_cgroup_end_page_stat(struct mem_cgroup *memcg)
{
}
+static inline void mem_cgroup_handle_over_high(void)
+{
+}
+
static inline void mem_cgroup_oom_enable(void)
{
}
diff --git a/include/linux/sched.h b/include/linux/sched.h
index ef73b54..c76b71d 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1785,6 +1785,7 @@ struct task_struct {
#endif /* CONFIG_TRACING */
#ifdef CONFIG_MEMCG
struct mem_cgroup *memcg_in_oom;
+ struct mem_cgroup *memcg_over_high; /* reclaim on returning to user */
gfp_t memcg_oom_gfp_mask;
int memcg_oom_order;
#endif
diff --git a/include/linux/tracehook.h b/include/linux/tracehook.h
index 84d4972..26c1521 100644
--- a/include/linux/tracehook.h
+++ b/include/linux/tracehook.h
@@ -50,6 +50,7 @@
#include <linux/ptrace.h>
#include <linux/security.h>
#include <linux/task_work.h>
+#include <linux/memcontrol.h>
struct linux_binprm;
/*
@@ -188,6 +189,8 @@ static inline void tracehook_notify_resume(struct pt_regs *regs)
smp_mb__after_atomic();
if (unlikely(current->task_works))
task_work_run();
+
+ mem_cgroup_handle_over_high();
}
#endif /* <linux/tracehook.h> */
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 74abb31..c94b686 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -62,6 +62,7 @@
#include <linux/oom.h>
#include <linux/lockdep.h>
#include <linux/file.h>
+#include <linux/tracehook.h>
#include "internal.h"
#include <net/sock.h>
#include <net/ip.h>
@@ -1963,6 +1964,33 @@ static int memcg_cpu_hotplug_callback(struct notifier_block *nb,
return NOTIFY_OK;
}
+/*
+ * Scheduled by try_charge() to be executed from the userland return path
+ * and reclaims memory over the high limit.
+ */
+void mem_cgroup_handle_over_high(void)
+{
+ struct mem_cgroup *memcg = current->memcg_over_high;
+
+ if (likely(!memcg))
+ return;
+
+ do {
+ unsigned long usage = page_counter_read(&memcg->memory);
+ unsigned long high = ACCESS_ONCE(memcg->high);
+
+ if (usage <= high)
+ continue;
+
+ mem_cgroup_events(memcg, MEMCG_HIGH, 1);
+ try_to_free_mem_cgroup_pages(memcg, usage - high,
+ GFP_KERNEL, true);
+ } while ((memcg = parent_mem_cgroup(memcg)));
+
+ css_put(¤t->memcg_over_high->css);
+ current->memcg_over_high = NULL;
+}
+
static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
unsigned int nr_pages)
{
@@ -2071,21 +2099,27 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
css_get_many(&memcg->css, batch);
if (batch > nr_pages)
refill_stock(memcg, batch - nr_pages);
- if (!(gfp_mask & __GFP_WAIT))
- goto done;
+
/*
- * If the hierarchy is above the normal consumption range,
- * make the charging task trim their excess contribution.
+ * If the hierarchy is above the normal consumption range, schedule
+ * direct reclaim on returning to userland. We can perform direct
+ * reclaim here if __GFP_WAIT; however, punting has the benefit of
+ * avoiding surprise high stack usages and it's fine to breach the
+ * high limit temporarily while control stays in kernel.
*/
- do {
- unsigned long usage = page_counter_read(&memcg->memory);
- unsigned long high = ACCESS_ONCE(memcg->high);
+ if (!current->memcg_over_high) {
+ struct mem_cgroup *pos = memcg;
- if (usage <= high)
- continue;
- mem_cgroup_events(memcg, MEMCG_HIGH, 1);
- try_to_free_mem_cgroup_pages(memcg, high - usage, gfp_mask, true);
- } while ((memcg = parent_mem_cgroup(memcg)));
+ do {
+ if (page_counter_read(&pos->memory) > pos->high) {
+ /* make user return path rescan from leaf */
+ css_get(&memcg->css);
+ current->memcg_over_high = memcg;
+ set_notify_resume(current);
+ break;
+ }
+ } while ((pos = parent_mem_cgroup(pos)));
+ }
done:
return ret;
}
@@ -5053,6 +5087,13 @@ static void mem_cgroup_move_task(struct cgroup_subsys_state *css,
}
#endif
+static void mem_cgroup_exit(struct cgroup_subsys_state *css,
+ struct cgroup_subsys_state *old_css,
+ struct task_struct *task)
+{
+ mem_cgroup_handle_over_high();
+}
+
/*
* Cgroup retains root cgroups across [un]mount cycles making it necessary
* to verify whether we're attached to the default hierarchy on each mount
@@ -5223,6 +5264,7 @@ struct cgroup_subsys memory_cgrp_subsys = {
.can_attach = mem_cgroup_can_attach,
.cancel_attach = mem_cgroup_cancel_attach,
.attach = mem_cgroup_move_task,
+ .exit = mem_cgroup_exit,
.bind = mem_cgroup_bind,
.dfl_cftypes = memory_files,
.legacy_cftypes = mem_cgroup_legacy_files,
--
2.4.3
^ permalink raw reply related [flat|nested] 66+ messages in thread* [PATCH 3/4] memcg: punt high overage reclaim to return-to-userland path
@ 2015-08-28 15:25 ` Tejun Heo
0 siblings, 0 replies; 66+ messages in thread
From: Tejun Heo @ 2015-08-28 15:25 UTC (permalink / raw)
To: hannes, mhocko; +Cc: cgroups, linux-mm, vdavydov, kernel-team, Tejun Heo
Currently, try_charge() tries to reclaim memory directly when the high
limit is breached; however, this has a couple issues.
* try_charge() can be invoked from any in-kernel allocation site and
reclaim path may use considerable amount of stack. This can lead to
stack overflows which are extremely difficult to reproduce.
* If the allocation doesn't have __GFP_WAIT, direct reclaim is
skipped. If a process performs only speculative allocations, it can
blow way past the high limit. This is actually easily reproducible
by simply doing "find /". VFS tries speculative !__GFP_WAIT
allocations first, so as long as there's memory which can be
consumed without blocking, it can keep allocating memory regardless
of the high limit.
This patch makes try_charge() always punt the direct reclaim to the
return-to-userland path. If try_charge() detects that high limit is
breached, it sets current->memcg_over_high to the offending memcg and
schedules execution of mem_cgroup_handle_over_high() which performs
the direct reclaim from the return-to-userland path.
As long as kernel doesn't have a run-away allocation spree, this
should provide enough protection while making kmemcg behave more
consistently.
Signed-off-by: Tejun Heo <tj@kernel.org>
---
include/linux/memcontrol.h | 6 +++++
include/linux/sched.h | 1 +
include/linux/tracehook.h | 3 +++
mm/memcontrol.c | 66 +++++++++++++++++++++++++++++++++++++---------
4 files changed, 64 insertions(+), 12 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 3d28656..8d345a7 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -402,6 +402,8 @@ static inline int mem_cgroup_inactive_anon_is_low(struct lruvec *lruvec)
return inactive * inactive_ratio < active;
}
+void mem_cgroup_handle_over_high(void);
+
void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
struct task_struct *p);
@@ -621,6 +623,10 @@ static inline void mem_cgroup_end_page_stat(struct mem_cgroup *memcg)
{
}
+static inline void mem_cgroup_handle_over_high(void)
+{
+}
+
static inline void mem_cgroup_oom_enable(void)
{
}
diff --git a/include/linux/sched.h b/include/linux/sched.h
index ef73b54..c76b71d 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1785,6 +1785,7 @@ struct task_struct {
#endif /* CONFIG_TRACING */
#ifdef CONFIG_MEMCG
struct mem_cgroup *memcg_in_oom;
+ struct mem_cgroup *memcg_over_high; /* reclaim on returning to user */
gfp_t memcg_oom_gfp_mask;
int memcg_oom_order;
#endif
diff --git a/include/linux/tracehook.h b/include/linux/tracehook.h
index 84d4972..26c1521 100644
--- a/include/linux/tracehook.h
+++ b/include/linux/tracehook.h
@@ -50,6 +50,7 @@
#include <linux/ptrace.h>
#include <linux/security.h>
#include <linux/task_work.h>
+#include <linux/memcontrol.h>
struct linux_binprm;
/*
@@ -188,6 +189,8 @@ static inline void tracehook_notify_resume(struct pt_regs *regs)
smp_mb__after_atomic();
if (unlikely(current->task_works))
task_work_run();
+
+ mem_cgroup_handle_over_high();
}
#endif /* <linux/tracehook.h> */
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 74abb31..c94b686 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -62,6 +62,7 @@
#include <linux/oom.h>
#include <linux/lockdep.h>
#include <linux/file.h>
+#include <linux/tracehook.h>
#include "internal.h"
#include <net/sock.h>
#include <net/ip.h>
@@ -1963,6 +1964,33 @@ static int memcg_cpu_hotplug_callback(struct notifier_block *nb,
return NOTIFY_OK;
}
+/*
+ * Scheduled by try_charge() to be executed from the userland return path
+ * and reclaims memory over the high limit.
+ */
+void mem_cgroup_handle_over_high(void)
+{
+ struct mem_cgroup *memcg = current->memcg_over_high;
+
+ if (likely(!memcg))
+ return;
+
+ do {
+ unsigned long usage = page_counter_read(&memcg->memory);
+ unsigned long high = ACCESS_ONCE(memcg->high);
+
+ if (usage <= high)
+ continue;
+
+ mem_cgroup_events(memcg, MEMCG_HIGH, 1);
+ try_to_free_mem_cgroup_pages(memcg, usage - high,
+ GFP_KERNEL, true);
+ } while ((memcg = parent_mem_cgroup(memcg)));
+
+ css_put(¤t->memcg_over_high->css);
+ current->memcg_over_high = NULL;
+}
+
static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
unsigned int nr_pages)
{
@@ -2071,21 +2099,27 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
css_get_many(&memcg->css, batch);
if (batch > nr_pages)
refill_stock(memcg, batch - nr_pages);
- if (!(gfp_mask & __GFP_WAIT))
- goto done;
+
/*
- * If the hierarchy is above the normal consumption range,
- * make the charging task trim their excess contribution.
+ * If the hierarchy is above the normal consumption range, schedule
+ * direct reclaim on returning to userland. We can perform direct
+ * reclaim here if __GFP_WAIT; however, punting has the benefit of
+ * avoiding surprise high stack usages and it's fine to breach the
+ * high limit temporarily while control stays in kernel.
*/
- do {
- unsigned long usage = page_counter_read(&memcg->memory);
- unsigned long high = ACCESS_ONCE(memcg->high);
+ if (!current->memcg_over_high) {
+ struct mem_cgroup *pos = memcg;
- if (usage <= high)
- continue;
- mem_cgroup_events(memcg, MEMCG_HIGH, 1);
- try_to_free_mem_cgroup_pages(memcg, high - usage, gfp_mask, true);
- } while ((memcg = parent_mem_cgroup(memcg)));
+ do {
+ if (page_counter_read(&pos->memory) > pos->high) {
+ /* make user return path rescan from leaf */
+ css_get(&memcg->css);
+ current->memcg_over_high = memcg;
+ set_notify_resume(current);
+ break;
+ }
+ } while ((pos = parent_mem_cgroup(pos)));
+ }
done:
return ret;
}
@@ -5053,6 +5087,13 @@ static void mem_cgroup_move_task(struct cgroup_subsys_state *css,
}
#endif
+static void mem_cgroup_exit(struct cgroup_subsys_state *css,
+ struct cgroup_subsys_state *old_css,
+ struct task_struct *task)
+{
+ mem_cgroup_handle_over_high();
+}
+
/*
* Cgroup retains root cgroups across [un]mount cycles making it necessary
* to verify whether we're attached to the default hierarchy on each mount
@@ -5223,6 +5264,7 @@ struct cgroup_subsys memory_cgrp_subsys = {
.can_attach = mem_cgroup_can_attach,
.cancel_attach = mem_cgroup_cancel_attach,
.attach = mem_cgroup_move_task,
+ .exit = mem_cgroup_exit,
.bind = mem_cgroup_bind,
.dfl_cftypes = memory_files,
.legacy_cftypes = mem_cgroup_legacy_files,
--
2.4.3
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 66+ messages in thread[parent not found: <1440775530-18630-4-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>]
* Re: [PATCH 3/4] memcg: punt high overage reclaim to return-to-userland path
2015-08-28 15:25 ` Tejun Heo
@ 2015-08-28 16:36 ` Vladimir Davydov
-1 siblings, 0 replies; 66+ messages in thread
From: Vladimir Davydov @ 2015-08-28 16:36 UTC (permalink / raw)
To: Tejun Heo
Cc: hannes-druUgvl0LCNAfugRpC6u6w, mhocko-DgEjT+Ai2ygdnm+yROfE0A,
cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
kernel-team-b10kYP2dOMg
Hi Tejun,
On Fri, Aug 28, 2015 at 11:25:29AM -0400, Tejun Heo wrote:
> Currently, try_charge() tries to reclaim memory directly when the high
> limit is breached; however, this has a couple issues.
>
> * try_charge() can be invoked from any in-kernel allocation site and
> reclaim path may use considerable amount of stack. This can lead to
> stack overflows which are extremely difficult to reproduce.
IMO this paragraph does not justify this patch at all, because one will
still invoke direct reclaim from try_charge() on hitting the hard limit.
>
> * If the allocation doesn't have __GFP_WAIT, direct reclaim is
> skipped. If a process performs only speculative allocations, it can
> blow way past the high limit. This is actually easily reproducible
> by simply doing "find /". VFS tries speculative !__GFP_WAIT
> allocations first, so as long as there's memory which can be
> consumed without blocking, it can keep allocating memory regardless
> of the high limit.
I think there shouldn't normally occur a lot of !__GFP_WAIT allocations
in a row - they should still alternate with normal __GFP_WAIT
allocations. Yes, that means we can breach memory.high threshold for a
short period of time, but it isn't a hard limit, so it looks perfectly
fine to me.
I tried to run `find /` over ext4 in a cgroup with memory.high set to
32M and kmem accounting enabled. With such a setup memory.current never
got higher than 33152K, which is only 384K greater than the memory.high.
Which FS did you use?
Thanks,
Vladimir
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [PATCH 3/4] memcg: punt high overage reclaim to return-to-userland path
@ 2015-08-28 16:36 ` Vladimir Davydov
0 siblings, 0 replies; 66+ messages in thread
From: Vladimir Davydov @ 2015-08-28 16:36 UTC (permalink / raw)
To: Tejun Heo; +Cc: hannes, mhocko, cgroups, linux-mm, kernel-team
Hi Tejun,
On Fri, Aug 28, 2015 at 11:25:29AM -0400, Tejun Heo wrote:
> Currently, try_charge() tries to reclaim memory directly when the high
> limit is breached; however, this has a couple issues.
>
> * try_charge() can be invoked from any in-kernel allocation site and
> reclaim path may use considerable amount of stack. This can lead to
> stack overflows which are extremely difficult to reproduce.
IMO this paragraph does not justify this patch at all, because one will
still invoke direct reclaim from try_charge() on hitting the hard limit.
>
> * If the allocation doesn't have __GFP_WAIT, direct reclaim is
> skipped. If a process performs only speculative allocations, it can
> blow way past the high limit. This is actually easily reproducible
> by simply doing "find /". VFS tries speculative !__GFP_WAIT
> allocations first, so as long as there's memory which can be
> consumed without blocking, it can keep allocating memory regardless
> of the high limit.
I think there shouldn't normally occur a lot of !__GFP_WAIT allocations
in a row - they should still alternate with normal __GFP_WAIT
allocations. Yes, that means we can breach memory.high threshold for a
short period of time, but it isn't a hard limit, so it looks perfectly
fine to me.
I tried to run `find /` over ext4 in a cgroup with memory.high set to
32M and kmem accounting enabled. With such a setup memory.current never
got higher than 33152K, which is only 384K greater than the memory.high.
Which FS did you use?
Thanks,
Vladimir
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [PATCH 3/4] memcg: punt high overage reclaim to return-to-userland path
2015-08-28 16:36 ` Vladimir Davydov
@ 2015-08-28 16:48 ` Tejun Heo
-1 siblings, 0 replies; 66+ messages in thread
From: Tejun Heo @ 2015-08-28 16:48 UTC (permalink / raw)
To: Vladimir Davydov
Cc: hannes-druUgvl0LCNAfugRpC6u6w, mhocko-DgEjT+Ai2ygdnm+yROfE0A,
cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
kernel-team-b10kYP2dOMg
Hello, Vladimir.
On Fri, Aug 28, 2015 at 07:36:11PM +0300, Vladimir Davydov wrote:
> > * try_charge() can be invoked from any in-kernel allocation site and
> > reclaim path may use considerable amount of stack. This can lead to
> > stack overflows which are extremely difficult to reproduce.
>
> IMO this paragraph does not justify this patch at all, because one will
> still invoke direct reclaim from try_charge() on hitting the hard limit.
Ah... right, and we can't defer direct reclaim for hard limit.
> > * If the allocation doesn't have __GFP_WAIT, direct reclaim is
> > skipped. If a process performs only speculative allocations, it can
> > blow way past the high limit. This is actually easily reproducible
> > by simply doing "find /". VFS tries speculative !__GFP_WAIT
> > allocations first, so as long as there's memory which can be
> > consumed without blocking, it can keep allocating memory regardless
> > of the high limit.
>
> I think there shouldn't normally occur a lot of !__GFP_WAIT allocations
> in a row - they should still alternate with normal __GFP_WAIT
> allocations. Yes, that means we can breach memory.high threshold for a
> short period of time, but it isn't a hard limit, so it looks perfectly
> fine to me.
>
> I tried to run `find /` over ext4 in a cgroup with memory.high set to
> 32M and kmem accounting enabled. With such a setup memory.current never
> got higher than 33152K, which is only 384K greater than the memory.high.
> Which FS did you use?
ext4. Here, it goes onto happily consuming hundreds of megabytes with
limit set at 32M. We have quite a few places where !__GFP_WAIT
allocations are performed speculatively in hot paths with fallback
slow paths, so this is bound to happen somewhere.
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [PATCH 3/4] memcg: punt high overage reclaim to return-to-userland path
@ 2015-08-28 16:48 ` Tejun Heo
0 siblings, 0 replies; 66+ messages in thread
From: Tejun Heo @ 2015-08-28 16:48 UTC (permalink / raw)
To: Vladimir Davydov; +Cc: hannes, mhocko, cgroups, linux-mm, kernel-team
Hello, Vladimir.
On Fri, Aug 28, 2015 at 07:36:11PM +0300, Vladimir Davydov wrote:
> > * try_charge() can be invoked from any in-kernel allocation site and
> > reclaim path may use considerable amount of stack. This can lead to
> > stack overflows which are extremely difficult to reproduce.
>
> IMO this paragraph does not justify this patch at all, because one will
> still invoke direct reclaim from try_charge() on hitting the hard limit.
Ah... right, and we can't defer direct reclaim for hard limit.
> > * If the allocation doesn't have __GFP_WAIT, direct reclaim is
> > skipped. If a process performs only speculative allocations, it can
> > blow way past the high limit. This is actually easily reproducible
> > by simply doing "find /". VFS tries speculative !__GFP_WAIT
> > allocations first, so as long as there's memory which can be
> > consumed without blocking, it can keep allocating memory regardless
> > of the high limit.
>
> I think there shouldn't normally occur a lot of !__GFP_WAIT allocations
> in a row - they should still alternate with normal __GFP_WAIT
> allocations. Yes, that means we can breach memory.high threshold for a
> short period of time, but it isn't a hard limit, so it looks perfectly
> fine to me.
>
> I tried to run `find /` over ext4 in a cgroup with memory.high set to
> 32M and kmem accounting enabled. With such a setup memory.current never
> got higher than 33152K, which is only 384K greater than the memory.high.
> Which FS did you use?
ext4. Here, it goes onto happily consuming hundreds of megabytes with
limit set at 32M. We have quite a few places where !__GFP_WAIT
allocations are performed speculatively in hot paths with fallback
slow paths, so this is bound to happen somewhere.
Thanks.
--
tejun
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [PATCH 3/4] memcg: punt high overage reclaim to return-to-userland path
2015-08-28 16:48 ` Tejun Heo
(?)
@ 2015-08-28 20:32 ` Vladimir Davydov
2015-08-28 20:44 ` Tejun Heo
2015-08-30 15:52 ` Vladimir Davydov
-1 siblings, 2 replies; 66+ messages in thread
From: Vladimir Davydov @ 2015-08-28 20:32 UTC (permalink / raw)
To: Tejun Heo, Andrew Morton
Cc: hannes, mhocko, cgroups, linux-mm, kernel-team, Joonsoo Kim,
Christoph Lameter, David Rientjes
On Fri, Aug 28, 2015 at 12:48:19PM -0400, Tejun Heo wrote:
...
> > > * If the allocation doesn't have __GFP_WAIT, direct reclaim is
> > > skipped. If a process performs only speculative allocations, it can
> > > blow way past the high limit. This is actually easily reproducible
> > > by simply doing "find /". VFS tries speculative !__GFP_WAIT
> > > allocations first, so as long as there's memory which can be
> > > consumed without blocking, it can keep allocating memory regardless
> > > of the high limit.
> >
> > I think there shouldn't normally occur a lot of !__GFP_WAIT allocations
> > in a row - they should still alternate with normal __GFP_WAIT
> > allocations. Yes, that means we can breach memory.high threshold for a
> > short period of time, but it isn't a hard limit, so it looks perfectly
> > fine to me.
> >
> > I tried to run `find /` over ext4 in a cgroup with memory.high set to
> > 32M and kmem accounting enabled. With such a setup memory.current never
> > got higher than 33152K, which is only 384K greater than the memory.high.
> > Which FS did you use?
>
> ext4. Here, it goes onto happily consuming hundreds of megabytes with
> limit set at 32M. We have quite a few places where !__GFP_WAIT
> allocations are performed speculatively in hot paths with fallback
> slow paths, so this is bound to happen somewhere.
What kind of workload should it be then? `find` will constantly invoke
d_alloc, which issues a GFP_KERNEL allocation and therefore is allowed
to perform reclaim...
OK, I tried to reproduce the issue on the latest mainline kernel and ...
succeeded - memory.current did occasionally jump up to ~55M although
memory.high was set to 32M. Hmm, strange... Started to investigate.
Printed stack traces and found that we don't invoke memcg reclaim on
normal GFP_KERNEL allocations! How is that? The thing is there was a
commit that made SLUB (not VFS or any other kmem user, but core SLUB)
try to allocate high order slab pages w/o __GFP_WAIT for performance
reasons. That broke kmemcg case. Here it goes:
commit 6af3142bed1f520b90f4cdb6cd10bbd16906ce9a
Author: Joonsoo Kim <js1304@gmail.com>
Date: Tue Aug 25 00:03:52 2015 +0000
mm/slub: don't wait for high-order page allocation
I suspect your kernel has this commit included, because w/o it I haven't
managed to catch anything nearly as bad as you describe: the memory.high
excess reached 1-2 Mb at max, but never "hundreds of megabytes". If so,
we'd better fix that instead. Actually, it's worth fixing anyway. What
about the patch below?
---
From: Vladimir Davydov <vdavydov@parallels.com>
Date: Fri, 28 Aug 2015 23:17:19 +0300
Subject: [PATCH] mm/slub: don't bypass memcg reclaim for high-order page
allocation
Commit 6af3142bed1f52 ("mm/slub: don't wait for high-order page
allocation") made allocate_slab() try to allocate high order slab pages
w/o __GFP_WAIT in order to avoid invoking reclaim/compaction when we can
fall back on low order pages. However, it broke kmemcg/memory.high
logic. The latter works as a soft limit: an allocation won't fail if it
is breached, but we call direct reclaim to compensate the excess. W/o
__GFP_WAIT we can't invoke reclaimer and therefore we will just go on,
exceeding memory.high more and more until a normal __GFP_WAIT allocation
is issued.
Since memcg reclaim never triggers compaction, we can pass __GFP_WAIT to
memcg_charge_slab() even on high order page allocations w/o any
performance impact. So let's fix this problem by excluding __GFP_WAIT
only from alloc_pages() while still forwarding it to memcg_charge_slab()
if the context allows.
Fixes: 6af3142bed1f52 ("mm/slub: don't wait for high-order page allocation")
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
diff --git a/mm/slub.c b/mm/slub.c
index e180f8dcd06d..1b9dbad40272 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1333,6 +1333,9 @@ static inline struct page *alloc_slab_page(struct kmem_cache *s,
if (memcg_charge_slab(s, flags, order))
return NULL;
+ if ((flags & __GFP_WAIT) && oo_order(oo) > oo_order(s->min))
+ flags = (flags | __GFP_NOMEMALLOC) & ~__GFP_WAIT;
+
if (node == NUMA_NO_NODE)
page = alloc_pages(flags, order);
else
@@ -1364,8 +1367,6 @@ static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
* so we fall-back to the minimum order allocation.
*/
alloc_gfp = (flags | __GFP_NOWARN | __GFP_NORETRY) & ~__GFP_NOFAIL;
- if ((alloc_gfp & __GFP_WAIT) && oo_order(oo) > oo_order(s->min))
- alloc_gfp = (alloc_gfp | __GFP_NOMEMALLOC) & ~__GFP_WAIT;
page = alloc_slab_page(s, alloc_gfp, node, oo);
if (unlikely(!page)) {
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 66+ messages in thread* Re: [PATCH 3/4] memcg: punt high overage reclaim to return-to-userland path
2015-08-28 20:32 ` Vladimir Davydov
@ 2015-08-28 20:44 ` Tejun Heo
2015-08-30 15:52 ` Vladimir Davydov
1 sibling, 0 replies; 66+ messages in thread
From: Tejun Heo @ 2015-08-28 20:44 UTC (permalink / raw)
To: Vladimir Davydov
Cc: Andrew Morton, hannes-druUgvl0LCNAfugRpC6u6w,
mhocko-DgEjT+Ai2ygdnm+yROfE0A, cgroups-u79uwXL29TY76Z2rM5mHXA,
linux-mm-Bw31MaZKKs3YtjvyW6yDsg, kernel-team-b10kYP2dOMg,
Joonsoo Kim, Christoph Lameter, David Rientjes
Hello,
On Fri, Aug 28, 2015 at 11:32:31PM +0300, Vladimir Davydov wrote:
> What kind of workload should it be then? `find` will constantly invoke
> d_alloc, which issues a GFP_KERNEL allocation and therefore is allowed
> to perform reclaim...
>
> OK, I tried to reproduce the issue on the latest mainline kernel and ...
> succeeded - memory.current did occasionally jump up to ~55M although
> memory.high was set to 32M. Hmm, strange... Started to investigate.
> Printed stack traces and found that we don't invoke memcg reclaim on
> normal GFP_KERNEL allocations! How is that? The thing is there was a
> commit that made SLUB (not VFS or any other kmem user, but core SLUB)
> try to allocate high order slab pages w/o __GFP_WAIT for performance
> reasons. That broke kmemcg case. Here it goes:
Ah, cool, so it was a bug from slub. Punting to return path still has
some niceties but if we can't consistently get rid of stack
consumption it's not that attractive. Let's revisit it later together
with hard limit reclaim.
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [PATCH 3/4] memcg: punt high overage reclaim to return-to-userland path
@ 2015-08-28 20:44 ` Tejun Heo
0 siblings, 0 replies; 66+ messages in thread
From: Tejun Heo @ 2015-08-28 20:44 UTC (permalink / raw)
To: Vladimir Davydov
Cc: Andrew Morton, hannes, mhocko, cgroups, linux-mm, kernel-team,
Joonsoo Kim, Christoph Lameter, David Rientjes
Hello,
On Fri, Aug 28, 2015 at 11:32:31PM +0300, Vladimir Davydov wrote:
> What kind of workload should it be then? `find` will constantly invoke
> d_alloc, which issues a GFP_KERNEL allocation and therefore is allowed
> to perform reclaim...
>
> OK, I tried to reproduce the issue on the latest mainline kernel and ...
> succeeded - memory.current did occasionally jump up to ~55M although
> memory.high was set to 32M. Hmm, strange... Started to investigate.
> Printed stack traces and found that we don't invoke memcg reclaim on
> normal GFP_KERNEL allocations! How is that? The thing is there was a
> commit that made SLUB (not VFS or any other kmem user, but core SLUB)
> try to allocate high order slab pages w/o __GFP_WAIT for performance
> reasons. That broke kmemcg case. Here it goes:
Ah, cool, so it was a bug from slub. Punting to return path still has
some niceties but if we can't consistently get rid of stack
consumption it's not that attractive. Let's revisit it later together
with hard limit reclaim.
Thanks.
--
tejun
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [PATCH 3/4] memcg: punt high overage reclaim to return-to-userland path
2015-08-28 20:44 ` Tejun Heo
(?)
@ 2015-08-28 22:06 ` Tejun Heo
[not found] ` <20150828220632.GF11089-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>
-1 siblings, 1 reply; 66+ messages in thread
From: Tejun Heo @ 2015-08-28 22:06 UTC (permalink / raw)
To: Vladimir Davydov
Cc: Andrew Morton, hannes, mhocko, cgroups, linux-mm, kernel-team,
Joonsoo Kim, Christoph Lameter, David Rientjes
On Fri, Aug 28, 2015 at 04:44:32PM -0400, Tejun Heo wrote:
> Ah, cool, so it was a bug from slub. Punting to return path still has
> some niceties but if we can't consistently get rid of stack
> consumption it's not that attractive. Let's revisit it later together
> with hard limit reclaim.
So, I can't check right now but I'm pretty sure I was using SLAB on my
test config, so this issue may exist there too.
Thanks.
--
tejun
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [PATCH 3/4] memcg: punt high overage reclaim to return-to-userland path
2015-08-28 20:32 ` Vladimir Davydov
2015-08-28 20:44 ` Tejun Heo
@ 2015-08-30 15:52 ` Vladimir Davydov
1 sibling, 0 replies; 66+ messages in thread
From: Vladimir Davydov @ 2015-08-30 15:52 UTC (permalink / raw)
To: Tejun Heo, Andrew Morton
Cc: hannes, mhocko, cgroups, linux-mm, kernel-team, Joonsoo Kim,
Christoph Lameter, David Rientjes
On Fri, Aug 28, 2015 at 11:32:31PM +0300, Vladimir Davydov wrote:
...
> From: Vladimir Davydov <vdavydov@parallels.com>
> Date: Fri, 28 Aug 2015 23:17:19 +0300
> Subject: [PATCH] mm/slub: don't bypass memcg reclaim for high-order page
> allocation
Please ignore this patch. I'll rework and resend.
Thanks,
Vladimir
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [PATCH 3/4] memcg: punt high overage reclaim to return-to-userland path
2015-08-28 15:25 ` Tejun Heo
@ 2015-08-28 17:13 ` Michal Hocko
-1 siblings, 0 replies; 66+ messages in thread
From: Michal Hocko @ 2015-08-28 17:13 UTC (permalink / raw)
To: Tejun Heo
Cc: hannes-druUgvl0LCNAfugRpC6u6w, cgroups-u79uwXL29TY76Z2rM5mHXA,
linux-mm-Bw31MaZKKs3YtjvyW6yDsg, vdavydov-bzQdu9zFT3WakBO8gow8eQ,
kernel-team-b10kYP2dOMg
On Fri 28-08-15 11:25:29, Tejun Heo wrote:
> Currently, try_charge() tries to reclaim memory directly when the high
> limit is breached; however, this has a couple issues.
>
> * try_charge() can be invoked from any in-kernel allocation site and
> reclaim path may use considerable amount of stack. This can lead to
> stack overflows which are extremely difficult to reproduce.
This is true but I haven't seen any reports for the stack overflow for
quite some time.
> * If the allocation doesn't have __GFP_WAIT, direct reclaim is
> skipped. If a process performs only speculative allocations, it can
> blow way past the high limit. This is actually easily reproducible
> by simply doing "find /". VFS tries speculative !__GFP_WAIT
> allocations first, so as long as there's memory which can be
> consumed without blocking, it can keep allocating memory regardless
> of the high limit.
It is a bit confusing that you are talking about direct reclaim but in
fact mean high limit reclaim. But yeah, you are right there is no
protection against GFP_NOWAIT allocations there.
> This patch makes try_charge() always punt the direct reclaim to the
> return-to-userland path. If try_charge() detects that high limit is
> breached, it sets current->memcg_over_high to the offending memcg and
> schedules execution of mem_cgroup_handle_over_high() which performs
> the direct reclaim from the return-to-userland path.
OK, this is certainly an attractive idea because of allocation requests
with reduced reclaim capabilities. GFP_NOWAIT is not the only one.
GFP_NOFS would be another. With kmem accounting they are much bigger
problem than with regular page faults/page cache. And having full
GFP_KERNEL reclaim context is definitely nice.
I would just argue that this implementation has the same issue as the
other patch in the series which performs high-usage reclaim. I think
that each task should reclaim only its contribution which is trivial
to account.
> As long as kernel doesn't have a run-away allocation spree, this
> should provide enough protection while making kmemcg behave more
> consistently.
>
> Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> ---
> include/linux/memcontrol.h | 6 +++++
> include/linux/sched.h | 1 +
> include/linux/tracehook.h | 3 +++
> mm/memcontrol.c | 66 +++++++++++++++++++++++++++++++++++++---------
> 4 files changed, 64 insertions(+), 12 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 3d28656..8d345a7 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -402,6 +402,8 @@ static inline int mem_cgroup_inactive_anon_is_low(struct lruvec *lruvec)
> return inactive * inactive_ratio < active;
> }
>
> +void mem_cgroup_handle_over_high(void);
> +
> void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
> struct task_struct *p);
>
> @@ -621,6 +623,10 @@ static inline void mem_cgroup_end_page_stat(struct mem_cgroup *memcg)
> {
> }
>
> +static inline void mem_cgroup_handle_over_high(void)
> +{
> +}
> +
> static inline void mem_cgroup_oom_enable(void)
> {
> }
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index ef73b54..c76b71d 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1785,6 +1785,7 @@ struct task_struct {
> #endif /* CONFIG_TRACING */
> #ifdef CONFIG_MEMCG
> struct mem_cgroup *memcg_in_oom;
> + struct mem_cgroup *memcg_over_high; /* reclaim on returning to user */
> gfp_t memcg_oom_gfp_mask;
> int memcg_oom_order;
> #endif
> diff --git a/include/linux/tracehook.h b/include/linux/tracehook.h
> index 84d4972..26c1521 100644
> --- a/include/linux/tracehook.h
> +++ b/include/linux/tracehook.h
> @@ -50,6 +50,7 @@
> #include <linux/ptrace.h>
> #include <linux/security.h>
> #include <linux/task_work.h>
> +#include <linux/memcontrol.h>
> struct linux_binprm;
>
> /*
> @@ -188,6 +189,8 @@ static inline void tracehook_notify_resume(struct pt_regs *regs)
> smp_mb__after_atomic();
> if (unlikely(current->task_works))
> task_work_run();
> +
> + mem_cgroup_handle_over_high();
> }
>
> #endif /* <linux/tracehook.h> */
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 74abb31..c94b686 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -62,6 +62,7 @@
> #include <linux/oom.h>
> #include <linux/lockdep.h>
> #include <linux/file.h>
> +#include <linux/tracehook.h>
> #include "internal.h"
> #include <net/sock.h>
> #include <net/ip.h>
> @@ -1963,6 +1964,33 @@ static int memcg_cpu_hotplug_callback(struct notifier_block *nb,
> return NOTIFY_OK;
> }
>
> +/*
> + * Scheduled by try_charge() to be executed from the userland return path
> + * and reclaims memory over the high limit.
> + */
> +void mem_cgroup_handle_over_high(void)
> +{
> + struct mem_cgroup *memcg = current->memcg_over_high;
> +
> + if (likely(!memcg))
> + return;
> +
> + do {
> + unsigned long usage = page_counter_read(&memcg->memory);
> + unsigned long high = ACCESS_ONCE(memcg->high);
> +
> + if (usage <= high)
> + continue;
> +
> + mem_cgroup_events(memcg, MEMCG_HIGH, 1);
> + try_to_free_mem_cgroup_pages(memcg, usage - high,
> + GFP_KERNEL, true);
> + } while ((memcg = parent_mem_cgroup(memcg)));
> +
> + css_put(¤t->memcg_over_high->css);
> + current->memcg_over_high = NULL;
> +}
> +
> static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
> unsigned int nr_pages)
> {
> @@ -2071,21 +2099,27 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
> css_get_many(&memcg->css, batch);
> if (batch > nr_pages)
> refill_stock(memcg, batch - nr_pages);
> - if (!(gfp_mask & __GFP_WAIT))
> - goto done;
> +
> /*
> - * If the hierarchy is above the normal consumption range,
> - * make the charging task trim their excess contribution.
> + * If the hierarchy is above the normal consumption range, schedule
> + * direct reclaim on returning to userland. We can perform direct
> + * reclaim here if __GFP_WAIT; however, punting has the benefit of
> + * avoiding surprise high stack usages and it's fine to breach the
> + * high limit temporarily while control stays in kernel.
> */
> - do {
> - unsigned long usage = page_counter_read(&memcg->memory);
> - unsigned long high = ACCESS_ONCE(memcg->high);
> + if (!current->memcg_over_high) {
> + struct mem_cgroup *pos = memcg;
>
> - if (usage <= high)
> - continue;
> - mem_cgroup_events(memcg, MEMCG_HIGH, 1);
> - try_to_free_mem_cgroup_pages(memcg, high - usage, gfp_mask, true);
> - } while ((memcg = parent_mem_cgroup(memcg)));
> + do {
> + if (page_counter_read(&pos->memory) > pos->high) {
> + /* make user return path rescan from leaf */
> + css_get(&memcg->css);
> + current->memcg_over_high = memcg;
> + set_notify_resume(current);
> + break;
> + }
> + } while ((pos = parent_mem_cgroup(pos)));
> + }
> done:
> return ret;
> }
> @@ -5053,6 +5087,13 @@ static void mem_cgroup_move_task(struct cgroup_subsys_state *css,
> }
> #endif
>
> +static void mem_cgroup_exit(struct cgroup_subsys_state *css,
> + struct cgroup_subsys_state *old_css,
> + struct task_struct *task)
> +{
> + mem_cgroup_handle_over_high();
> +}
> +
> /*
> * Cgroup retains root cgroups across [un]mount cycles making it necessary
> * to verify whether we're attached to the default hierarchy on each mount
> @@ -5223,6 +5264,7 @@ struct cgroup_subsys memory_cgrp_subsys = {
> .can_attach = mem_cgroup_can_attach,
> .cancel_attach = mem_cgroup_cancel_attach,
> .attach = mem_cgroup_move_task,
> + .exit = mem_cgroup_exit,
> .bind = mem_cgroup_bind,
> .dfl_cftypes = memory_files,
> .legacy_cftypes = mem_cgroup_legacy_files,
> --
> 2.4.3
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 66+ messages in thread* Re: [PATCH 3/4] memcg: punt high overage reclaim to return-to-userland path
@ 2015-08-28 17:13 ` Michal Hocko
0 siblings, 0 replies; 66+ messages in thread
From: Michal Hocko @ 2015-08-28 17:13 UTC (permalink / raw)
To: Tejun Heo; +Cc: hannes, cgroups, linux-mm, vdavydov, kernel-team
On Fri 28-08-15 11:25:29, Tejun Heo wrote:
> Currently, try_charge() tries to reclaim memory directly when the high
> limit is breached; however, this has a couple issues.
>
> * try_charge() can be invoked from any in-kernel allocation site and
> reclaim path may use considerable amount of stack. This can lead to
> stack overflows which are extremely difficult to reproduce.
This is true but I haven't seen any reports for the stack overflow for
quite some time.
> * If the allocation doesn't have __GFP_WAIT, direct reclaim is
> skipped. If a process performs only speculative allocations, it can
> blow way past the high limit. This is actually easily reproducible
> by simply doing "find /". VFS tries speculative !__GFP_WAIT
> allocations first, so as long as there's memory which can be
> consumed without blocking, it can keep allocating memory regardless
> of the high limit.
It is a bit confusing that you are talking about direct reclaim but in
fact mean high limit reclaim. But yeah, you are right there is no
protection against GFP_NOWAIT allocations there.
> This patch makes try_charge() always punt the direct reclaim to the
> return-to-userland path. If try_charge() detects that high limit is
> breached, it sets current->memcg_over_high to the offending memcg and
> schedules execution of mem_cgroup_handle_over_high() which performs
> the direct reclaim from the return-to-userland path.
OK, this is certainly an attractive idea because of allocation requests
with reduced reclaim capabilities. GFP_NOWAIT is not the only one.
GFP_NOFS would be another. With kmem accounting they are much bigger
problem than with regular page faults/page cache. And having full
GFP_KERNEL reclaim context is definitely nice.
I would just argue that this implementation has the same issue as the
other patch in the series which performs high-usage reclaim. I think
that each task should reclaim only its contribution which is trivial
to account.
> As long as kernel doesn't have a run-away allocation spree, this
> should provide enough protection while making kmemcg behave more
> consistently.
>
> Signed-off-by: Tejun Heo <tj@kernel.org>
> ---
> include/linux/memcontrol.h | 6 +++++
> include/linux/sched.h | 1 +
> include/linux/tracehook.h | 3 +++
> mm/memcontrol.c | 66 +++++++++++++++++++++++++++++++++++++---------
> 4 files changed, 64 insertions(+), 12 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 3d28656..8d345a7 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -402,6 +402,8 @@ static inline int mem_cgroup_inactive_anon_is_low(struct lruvec *lruvec)
> return inactive * inactive_ratio < active;
> }
>
> +void mem_cgroup_handle_over_high(void);
> +
> void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
> struct task_struct *p);
>
> @@ -621,6 +623,10 @@ static inline void mem_cgroup_end_page_stat(struct mem_cgroup *memcg)
> {
> }
>
> +static inline void mem_cgroup_handle_over_high(void)
> +{
> +}
> +
> static inline void mem_cgroup_oom_enable(void)
> {
> }
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index ef73b54..c76b71d 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1785,6 +1785,7 @@ struct task_struct {
> #endif /* CONFIG_TRACING */
> #ifdef CONFIG_MEMCG
> struct mem_cgroup *memcg_in_oom;
> + struct mem_cgroup *memcg_over_high; /* reclaim on returning to user */
> gfp_t memcg_oom_gfp_mask;
> int memcg_oom_order;
> #endif
> diff --git a/include/linux/tracehook.h b/include/linux/tracehook.h
> index 84d4972..26c1521 100644
> --- a/include/linux/tracehook.h
> +++ b/include/linux/tracehook.h
> @@ -50,6 +50,7 @@
> #include <linux/ptrace.h>
> #include <linux/security.h>
> #include <linux/task_work.h>
> +#include <linux/memcontrol.h>
> struct linux_binprm;
>
> /*
> @@ -188,6 +189,8 @@ static inline void tracehook_notify_resume(struct pt_regs *regs)
> smp_mb__after_atomic();
> if (unlikely(current->task_works))
> task_work_run();
> +
> + mem_cgroup_handle_over_high();
> }
>
> #endif /* <linux/tracehook.h> */
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 74abb31..c94b686 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -62,6 +62,7 @@
> #include <linux/oom.h>
> #include <linux/lockdep.h>
> #include <linux/file.h>
> +#include <linux/tracehook.h>
> #include "internal.h"
> #include <net/sock.h>
> #include <net/ip.h>
> @@ -1963,6 +1964,33 @@ static int memcg_cpu_hotplug_callback(struct notifier_block *nb,
> return NOTIFY_OK;
> }
>
> +/*
> + * Scheduled by try_charge() to be executed from the userland return path
> + * and reclaims memory over the high limit.
> + */
> +void mem_cgroup_handle_over_high(void)
> +{
> + struct mem_cgroup *memcg = current->memcg_over_high;
> +
> + if (likely(!memcg))
> + return;
> +
> + do {
> + unsigned long usage = page_counter_read(&memcg->memory);
> + unsigned long high = ACCESS_ONCE(memcg->high);
> +
> + if (usage <= high)
> + continue;
> +
> + mem_cgroup_events(memcg, MEMCG_HIGH, 1);
> + try_to_free_mem_cgroup_pages(memcg, usage - high,
> + GFP_KERNEL, true);
> + } while ((memcg = parent_mem_cgroup(memcg)));
> +
> + css_put(¤t->memcg_over_high->css);
> + current->memcg_over_high = NULL;
> +}
> +
> static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
> unsigned int nr_pages)
> {
> @@ -2071,21 +2099,27 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
> css_get_many(&memcg->css, batch);
> if (batch > nr_pages)
> refill_stock(memcg, batch - nr_pages);
> - if (!(gfp_mask & __GFP_WAIT))
> - goto done;
> +
> /*
> - * If the hierarchy is above the normal consumption range,
> - * make the charging task trim their excess contribution.
> + * If the hierarchy is above the normal consumption range, schedule
> + * direct reclaim on returning to userland. We can perform direct
> + * reclaim here if __GFP_WAIT; however, punting has the benefit of
> + * avoiding surprise high stack usages and it's fine to breach the
> + * high limit temporarily while control stays in kernel.
> */
> - do {
> - unsigned long usage = page_counter_read(&memcg->memory);
> - unsigned long high = ACCESS_ONCE(memcg->high);
> + if (!current->memcg_over_high) {
> + struct mem_cgroup *pos = memcg;
>
> - if (usage <= high)
> - continue;
> - mem_cgroup_events(memcg, MEMCG_HIGH, 1);
> - try_to_free_mem_cgroup_pages(memcg, high - usage, gfp_mask, true);
> - } while ((memcg = parent_mem_cgroup(memcg)));
> + do {
> + if (page_counter_read(&pos->memory) > pos->high) {
> + /* make user return path rescan from leaf */
> + css_get(&memcg->css);
> + current->memcg_over_high = memcg;
> + set_notify_resume(current);
> + break;
> + }
> + } while ((pos = parent_mem_cgroup(pos)));
> + }
> done:
> return ret;
> }
> @@ -5053,6 +5087,13 @@ static void mem_cgroup_move_task(struct cgroup_subsys_state *css,
> }
> #endif
>
> +static void mem_cgroup_exit(struct cgroup_subsys_state *css,
> + struct cgroup_subsys_state *old_css,
> + struct task_struct *task)
> +{
> + mem_cgroup_handle_over_high();
> +}
> +
> /*
> * Cgroup retains root cgroups across [un]mount cycles making it necessary
> * to verify whether we're attached to the default hierarchy on each mount
> @@ -5223,6 +5264,7 @@ struct cgroup_subsys memory_cgrp_subsys = {
> .can_attach = mem_cgroup_can_attach,
> .cancel_attach = mem_cgroup_cancel_attach,
> .attach = mem_cgroup_move_task,
> + .exit = mem_cgroup_exit,
> .bind = mem_cgroup_bind,
> .dfl_cftypes = memory_files,
> .legacy_cftypes = mem_cgroup_legacy_files,
> --
> 2.4.3
--
Michal Hocko
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 66+ messages in thread[parent not found: <20150828171322.GC21463-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>]
* Re: [PATCH 3/4] memcg: punt high overage reclaim to return-to-userland path
2015-08-28 17:13 ` Michal Hocko
@ 2015-08-28 17:56 ` Tejun Heo
-1 siblings, 0 replies; 66+ messages in thread
From: Tejun Heo @ 2015-08-28 17:56 UTC (permalink / raw)
To: Michal Hocko
Cc: hannes-druUgvl0LCNAfugRpC6u6w, cgroups-u79uwXL29TY76Z2rM5mHXA,
linux-mm-Bw31MaZKKs3YtjvyW6yDsg, vdavydov-bzQdu9zFT3WakBO8gow8eQ,
kernel-team-b10kYP2dOMg
Hello,
On Fri, Aug 28, 2015 at 07:13:22PM +0200, Michal Hocko wrote:
> On Fri 28-08-15 11:25:29, Tejun Heo wrote:
> > Currently, try_charge() tries to reclaim memory directly when the high
> > limit is breached; however, this has a couple issues.
> >
> > * try_charge() can be invoked from any in-kernel allocation site and
> > reclaim path may use considerable amount of stack. This can lead to
> > stack overflows which are extremely difficult to reproduce.
>
> This is true but I haven't seen any reports for the stack overflow for
> quite some time.
So, this didn't really fix it but xfs had to punt things to workqueues
to avoid stack overflows and IIRC it involved direct reclaim. Maybe
it's too late but it probably is a good idea to punt this from the
source.
> I would just argue that this implementation has the same issue as the
> other patch in the series which performs high-usage reclaim. I think
> that each task should reclaim only its contribution which is trivial
> to account.
Hmm... I'll respond there.
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [PATCH 3/4] memcg: punt high overage reclaim to return-to-userland path
@ 2015-08-28 17:56 ` Tejun Heo
0 siblings, 0 replies; 66+ messages in thread
From: Tejun Heo @ 2015-08-28 17:56 UTC (permalink / raw)
To: Michal Hocko; +Cc: hannes, cgroups, linux-mm, vdavydov, kernel-team
Hello,
On Fri, Aug 28, 2015 at 07:13:22PM +0200, Michal Hocko wrote:
> On Fri 28-08-15 11:25:29, Tejun Heo wrote:
> > Currently, try_charge() tries to reclaim memory directly when the high
> > limit is breached; however, this has a couple issues.
> >
> > * try_charge() can be invoked from any in-kernel allocation site and
> > reclaim path may use considerable amount of stack. This can lead to
> > stack overflows which are extremely difficult to reproduce.
>
> This is true but I haven't seen any reports for the stack overflow for
> quite some time.
So, this didn't really fix it but xfs had to punt things to workqueues
to avoid stack overflows and IIRC it involved direct reclaim. Maybe
it's too late but it probably is a good idea to punt this from the
source.
> I would just argue that this implementation has the same issue as the
> other patch in the series which performs high-usage reclaim. I think
> that each task should reclaim only its contribution which is trivial
> to account.
Hmm... I'll respond there.
Thanks.
--
tejun
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [PATCH 3/4] memcg: punt high overage reclaim to return-to-userland path
2015-08-28 17:13 ` Michal Hocko
@ 2015-08-28 20:45 ` Vladimir Davydov
-1 siblings, 0 replies; 66+ messages in thread
From: Vladimir Davydov @ 2015-08-28 20:45 UTC (permalink / raw)
To: Michal Hocko
Cc: Tejun Heo, hannes-druUgvl0LCNAfugRpC6u6w,
cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
kernel-team-b10kYP2dOMg
On Fri, Aug 28, 2015 at 07:13:22PM +0200, Michal Hocko wrote:
...
> > * If the allocation doesn't have __GFP_WAIT, direct reclaim is
> > skipped. If a process performs only speculative allocations, it can
> > blow way past the high limit. This is actually easily reproducible
> > by simply doing "find /". VFS tries speculative !__GFP_WAIT
> > allocations first, so as long as there's memory which can be
> > consumed without blocking, it can keep allocating memory regardless
> > of the high limit.
>
> It is a bit confusing that you are talking about direct reclaim but in
> fact mean high limit reclaim. But yeah, you are right there is no
> protection against GFP_NOWAIT allocations there.
Actually, memory.high by itself *is* the protection against GFP_NOWAIT
allocations, similarly to zone watermarks. W/o it we would have no other
choice but fail a GFP_NOWAIT allocation on hitting memory.max. One
should just set it so that
memory.max - memory.high > [max sum size of !__GFP_WAIT allocations
that can normally occur in a row]
That being said, currently I don't see any point in making memory.high
!__GFP_WAIT-safe.
Thanks,
Vladimir
^ permalink raw reply [flat|nested] 66+ messages in thread* Re: [PATCH 3/4] memcg: punt high overage reclaim to return-to-userland path
@ 2015-08-28 20:45 ` Vladimir Davydov
0 siblings, 0 replies; 66+ messages in thread
From: Vladimir Davydov @ 2015-08-28 20:45 UTC (permalink / raw)
To: Michal Hocko; +Cc: Tejun Heo, hannes, cgroups, linux-mm, kernel-team
On Fri, Aug 28, 2015 at 07:13:22PM +0200, Michal Hocko wrote:
...
> > * If the allocation doesn't have __GFP_WAIT, direct reclaim is
> > skipped. If a process performs only speculative allocations, it can
> > blow way past the high limit. This is actually easily reproducible
> > by simply doing "find /". VFS tries speculative !__GFP_WAIT
> > allocations first, so as long as there's memory which can be
> > consumed without blocking, it can keep allocating memory regardless
> > of the high limit.
>
> It is a bit confusing that you are talking about direct reclaim but in
> fact mean high limit reclaim. But yeah, you are right there is no
> protection against GFP_NOWAIT allocations there.
Actually, memory.high by itself *is* the protection against GFP_NOWAIT
allocations, similarly to zone watermarks. W/o it we would have no other
choice but fail a GFP_NOWAIT allocation on hitting memory.max. One
should just set it so that
memory.max - memory.high > [max sum size of !__GFP_WAIT allocations
that can normally occur in a row]
That being said, currently I don't see any point in making memory.high
!__GFP_WAIT-safe.
Thanks,
Vladimir
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 66+ messages in thread* Re: [PATCH 3/4] memcg: punt high overage reclaim to return-to-userland path
2015-08-28 20:45 ` Vladimir Davydov
@ 2015-08-28 20:53 ` Tejun Heo
-1 siblings, 0 replies; 66+ messages in thread
From: Tejun Heo @ 2015-08-28 20:53 UTC (permalink / raw)
To: Vladimir Davydov
Cc: Michal Hocko, hannes-druUgvl0LCNAfugRpC6u6w,
cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
kernel-team-b10kYP2dOMg
Hello, Vladmir.
On Fri, Aug 28, 2015 at 11:45:54PM +0300, Vladimir Davydov wrote:
> Actually, memory.high by itself *is* the protection against GFP_NOWAIT
> allocations, similarly to zone watermarks. W/o it we would have no other
> choice but fail a GFP_NOWAIT allocation on hitting memory.max. One
> should just set it so that
>
> memory.max - memory.high > [max sum size of !__GFP_WAIT allocations
> that can normally occur in a row]
While this would be true in many cases, I don't think this is the
intention of the two knobs and the space between high and max can be
filled up by anything which can't be reclaimed - e.g. too many dirty /
writeback pages on a slow device or memlocked pages. If it were
really the buffer for GFP_NOWAIT, there's no reason to even make it a
separate knob and we *may* change how over-high reclaim behaves in the
future, so let's please not dig ourselves into something too specific.
> That being said, currently I don't see any point in making memory.high
> !__GFP_WAIT-safe.
Yeah, as long as the blow up can't be triggered consistently, it
should be fine.
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [PATCH 3/4] memcg: punt high overage reclaim to return-to-userland path
@ 2015-08-28 20:53 ` Tejun Heo
0 siblings, 0 replies; 66+ messages in thread
From: Tejun Heo @ 2015-08-28 20:53 UTC (permalink / raw)
To: Vladimir Davydov; +Cc: Michal Hocko, hannes, cgroups, linux-mm, kernel-team
Hello, Vladmir.
On Fri, Aug 28, 2015 at 11:45:54PM +0300, Vladimir Davydov wrote:
> Actually, memory.high by itself *is* the protection against GFP_NOWAIT
> allocations, similarly to zone watermarks. W/o it we would have no other
> choice but fail a GFP_NOWAIT allocation on hitting memory.max. One
> should just set it so that
>
> memory.max - memory.high > [max sum size of !__GFP_WAIT allocations
> that can normally occur in a row]
While this would be true in many cases, I don't think this is the
intention of the two knobs and the space between high and max can be
filled up by anything which can't be reclaimed - e.g. too many dirty /
writeback pages on a slow device or memlocked pages. If it were
really the buffer for GFP_NOWAIT, there's no reason to even make it a
separate knob and we *may* change how over-high reclaim behaves in the
future, so let's please not dig ourselves into something too specific.
> That being said, currently I don't see any point in making memory.high
> !__GFP_WAIT-safe.
Yeah, as long as the blow up can't be triggered consistently, it
should be fine.
Thanks.
--
tejun
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 66+ messages in thread
[parent not found: <20150828205301.GB11089-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>]
* Re: [PATCH 3/4] memcg: punt high overage reclaim to return-to-userland path
2015-08-28 20:53 ` Tejun Heo
@ 2015-08-28 21:07 ` Vladimir Davydov
-1 siblings, 0 replies; 66+ messages in thread
From: Vladimir Davydov @ 2015-08-28 21:07 UTC (permalink / raw)
To: Tejun Heo
Cc: Michal Hocko, hannes-druUgvl0LCNAfugRpC6u6w,
cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
kernel-team-b10kYP2dOMg
On Fri, Aug 28, 2015 at 04:53:01PM -0400, Tejun Heo wrote:
> On Fri, Aug 28, 2015 at 11:45:54PM +0300, Vladimir Davydov wrote:
> > Actually, memory.high by itself *is* the protection against GFP_NOWAIT
> > allocations, similarly to zone watermarks. W/o it we would have no other
> > choice but fail a GFP_NOWAIT allocation on hitting memory.max. One
> > should just set it so that
> >
> > memory.max - memory.high > [max sum size of !__GFP_WAIT allocations
> > that can normally occur in a row]
>
> While this would be true in many cases, I don't think this is the
> intention of the two knobs and the space between high and max can be
> filled up by anything which can't be reclaimed - e.g. too many dirty /
> writeback pages on a slow device or memlocked pages. If it were
> really the buffer for GFP_NOWAIT, there's no reason to even make it a
> separate knob and we *may* change how over-high reclaim behaves in the
> future, so let's please not dig ourselves into something too specific.
Yep, come to think of it, you're right. One might want to use the
memory.high knob as the protection, because currently it is the only way
to protect kmemcg against GFP_NOWAIT failures, but it looks more like
abusing it :-/
We should probably think about introducing some kind of watermarks that
would trigger memcg reclaim, asynchronous or direct, on exceeding
them.
Thanks,
Vladimir
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [PATCH 3/4] memcg: punt high overage reclaim to return-to-userland path
@ 2015-08-28 21:07 ` Vladimir Davydov
0 siblings, 0 replies; 66+ messages in thread
From: Vladimir Davydov @ 2015-08-28 21:07 UTC (permalink / raw)
To: Tejun Heo; +Cc: Michal Hocko, hannes, cgroups, linux-mm, kernel-team
On Fri, Aug 28, 2015 at 04:53:01PM -0400, Tejun Heo wrote:
> On Fri, Aug 28, 2015 at 11:45:54PM +0300, Vladimir Davydov wrote:
> > Actually, memory.high by itself *is* the protection against GFP_NOWAIT
> > allocations, similarly to zone watermarks. W/o it we would have no other
> > choice but fail a GFP_NOWAIT allocation on hitting memory.max. One
> > should just set it so that
> >
> > memory.max - memory.high > [max sum size of !__GFP_WAIT allocations
> > that can normally occur in a row]
>
> While this would be true in many cases, I don't think this is the
> intention of the two knobs and the space between high and max can be
> filled up by anything which can't be reclaimed - e.g. too many dirty /
> writeback pages on a slow device or memlocked pages. If it were
> really the buffer for GFP_NOWAIT, there's no reason to even make it a
> separate knob and we *may* change how over-high reclaim behaves in the
> future, so let's please not dig ourselves into something too specific.
Yep, come to think of it, you're right. One might want to use the
memory.high knob as the protection, because currently it is the only way
to protect kmemcg against GFP_NOWAIT failures, but it looks more like
abusing it :-/
We should probably think about introducing some kind of watermarks that
would trigger memcg reclaim, asynchronous or direct, on exceeding
them.
Thanks,
Vladimir
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [PATCH 3/4] memcg: punt high overage reclaim to return-to-userland path
2015-08-28 21:07 ` Vladimir Davydov
@ 2015-08-28 21:14 ` Tejun Heo
-1 siblings, 0 replies; 66+ messages in thread
From: Tejun Heo @ 2015-08-28 21:14 UTC (permalink / raw)
To: Vladimir Davydov
Cc: Michal Hocko, hannes-druUgvl0LCNAfugRpC6u6w,
cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
kernel-team-b10kYP2dOMg
Hey,
On Sat, Aug 29, 2015 at 12:07:04AM +0300, Vladimir Davydov wrote:
> We should probably think about introducing some kind of watermarks that
> would trigger memcg reclaim, asynchronous or direct, on exceeding
> them.
Yeah, for max + kmemcg case, we eventually should do something similar
to the global case where we try to kick off async reclaim before we
hit the hard wall. Ultimately, I think punting reclaims to workqueue
or return-path is a good idea anyway, so maybe it can be all part of
the same mechanism. Given that the high limit is the primary control
mechanism on the default hierarchy, it should be fine for now.
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [PATCH 3/4] memcg: punt high overage reclaim to return-to-userland path
@ 2015-08-28 21:14 ` Tejun Heo
0 siblings, 0 replies; 66+ messages in thread
From: Tejun Heo @ 2015-08-28 21:14 UTC (permalink / raw)
To: Vladimir Davydov; +Cc: Michal Hocko, hannes, cgroups, linux-mm, kernel-team
Hey,
On Sat, Aug 29, 2015 at 12:07:04AM +0300, Vladimir Davydov wrote:
> We should probably think about introducing some kind of watermarks that
> would trigger memcg reclaim, asynchronous or direct, on exceeding
> them.
Yeah, for max + kmemcg case, we eventually should do something similar
to the global case where we try to kick off async reclaim before we
hit the hard wall. Ultimately, I think punting reclaims to workqueue
or return-path is a good idea anyway, so maybe it can be all part of
the same mechanism. Given that the high limit is the primary control
mechanism on the default hierarchy, it should be fine for now.
Thanks.
--
tejun
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 66+ messages in thread
* [PATCH 4/4] memcg: always enable kmemcg on the default hierarchy
2015-08-28 15:25 ` Tejun Heo
@ 2015-08-28 15:25 ` Tejun Heo
-1 siblings, 0 replies; 66+ messages in thread
From: Tejun Heo @ 2015-08-28 15:25 UTC (permalink / raw)
To: hannes-druUgvl0LCNAfugRpC6u6w, mhocko-DgEjT+Ai2ygdnm+yROfE0A
Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
vdavydov-bzQdu9zFT3WakBO8gow8eQ, kernel-team-b10kYP2dOMg,
Tejun Heo
On the default hierarchy, all memory consumption will be accounted
together and controlled by the same set of limits. Always enable
kmemcg on the default hierarchy.
Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
---
mm/memcontrol.c | 7 +++++++
1 file changed, 7 insertions(+)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c94b686..8a5dd01 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4362,6 +4362,13 @@ mem_cgroup_css_online(struct cgroup_subsys_state *css)
if (ret)
return ret;
+ /* kmem is always accounted together on the default hierarchy */
+ if (cgroup_on_dfl(css->cgroup)) {
+ ret = memcg_activate_kmem(memcg, PAGE_COUNTER_MAX);
+ if (ret)
+ return ret;
+ }
+
/*
* Make sure the memcg is initialized: mem_cgroup_iter()
* orders reading memcg->initialized against its callers
--
2.4.3
^ permalink raw reply related [flat|nested] 66+ messages in thread* [PATCH 4/4] memcg: always enable kmemcg on the default hierarchy
@ 2015-08-28 15:25 ` Tejun Heo
0 siblings, 0 replies; 66+ messages in thread
From: Tejun Heo @ 2015-08-28 15:25 UTC (permalink / raw)
To: hannes, mhocko; +Cc: cgroups, linux-mm, vdavydov, kernel-team, Tejun Heo
On the default hierarchy, all memory consumption will be accounted
together and controlled by the same set of limits. Always enable
kmemcg on the default hierarchy.
Signed-off-by: Tejun Heo <tj@kernel.org>
---
mm/memcontrol.c | 7 +++++++
1 file changed, 7 insertions(+)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c94b686..8a5dd01 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4362,6 +4362,13 @@ mem_cgroup_css_online(struct cgroup_subsys_state *css)
if (ret)
return ret;
+ /* kmem is always accounted together on the default hierarchy */
+ if (cgroup_on_dfl(css->cgroup)) {
+ ret = memcg_activate_kmem(memcg, PAGE_COUNTER_MAX);
+ if (ret)
+ return ret;
+ }
+
/*
* Make sure the memcg is initialized: mem_cgroup_iter()
* orders reading memcg->initialized against its callers
--
2.4.3
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 66+ messages in thread[parent not found: <1440775530-18630-5-git-send-email-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>]
* Re: [PATCH 4/4] memcg: always enable kmemcg on the default hierarchy
2015-08-28 15:25 ` Tejun Heo
@ 2015-08-28 16:49 ` Vladimir Davydov
-1 siblings, 0 replies; 66+ messages in thread
From: Vladimir Davydov @ 2015-08-28 16:49 UTC (permalink / raw)
To: Tejun Heo
Cc: hannes-druUgvl0LCNAfugRpC6u6w, mhocko-DgEjT+Ai2ygdnm+yROfE0A,
cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
kernel-team-b10kYP2dOMg
On Fri, Aug 28, 2015 at 11:25:30AM -0400, Tejun Heo wrote:
> On the default hierarchy, all memory consumption will be accounted
> together and controlled by the same set of limits. Always enable
> kmemcg on the default hierarchy.
IMO we should introduce a boot time knob for disabling it, because kmem
accounting is still not perfect, besides some users might prefer to go
w/o it for performance reasons.
>
> Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> ---
> mm/memcontrol.c | 7 +++++++
> 1 file changed, 7 insertions(+)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index c94b686..8a5dd01 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -4362,6 +4362,13 @@ mem_cgroup_css_online(struct cgroup_subsys_state *css)
> if (ret)
> return ret;
>
> + /* kmem is always accounted together on the default hierarchy */
> + if (cgroup_on_dfl(css->cgroup)) {
> + ret = memcg_activate_kmem(memcg, PAGE_COUNTER_MAX);
> + if (ret)
> + return ret;
> + }
> +
This is a wrong place for this. The kernel will panic on an attempt to
create a sub memcg, because memcg_init_kmem already enables kmem
accounting in this case. I guess we should add this hunk to
memcg_propagate_kmem instead.
Thanks,
Vladimir
^ permalink raw reply [flat|nested] 66+ messages in thread* Re: [PATCH 4/4] memcg: always enable kmemcg on the default hierarchy
@ 2015-08-28 16:49 ` Vladimir Davydov
0 siblings, 0 replies; 66+ messages in thread
From: Vladimir Davydov @ 2015-08-28 16:49 UTC (permalink / raw)
To: Tejun Heo; +Cc: hannes, mhocko, cgroups, linux-mm, kernel-team
On Fri, Aug 28, 2015 at 11:25:30AM -0400, Tejun Heo wrote:
> On the default hierarchy, all memory consumption will be accounted
> together and controlled by the same set of limits. Always enable
> kmemcg on the default hierarchy.
IMO we should introduce a boot time knob for disabling it, because kmem
accounting is still not perfect, besides some users might prefer to go
w/o it for performance reasons.
>
> Signed-off-by: Tejun Heo <tj@kernel.org>
> ---
> mm/memcontrol.c | 7 +++++++
> 1 file changed, 7 insertions(+)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index c94b686..8a5dd01 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -4362,6 +4362,13 @@ mem_cgroup_css_online(struct cgroup_subsys_state *css)
> if (ret)
> return ret;
>
> + /* kmem is always accounted together on the default hierarchy */
> + if (cgroup_on_dfl(css->cgroup)) {
> + ret = memcg_activate_kmem(memcg, PAGE_COUNTER_MAX);
> + if (ret)
> + return ret;
> + }
> +
This is a wrong place for this. The kernel will panic on an attempt to
create a sub memcg, because memcg_init_kmem already enables kmem
accounting in this case. I guess we should add this hunk to
memcg_propagate_kmem instead.
Thanks,
Vladimir
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 66+ messages in thread* Re: [PATCH 4/4] memcg: always enable kmemcg on the default hierarchy
2015-08-28 16:49 ` Vladimir Davydov
(?)
@ 2015-08-28 16:56 ` Tejun Heo
-1 siblings, 0 replies; 66+ messages in thread
From: Tejun Heo @ 2015-08-28 16:56 UTC (permalink / raw)
To: Vladimir Davydov; +Cc: hannes, mhocko, cgroups, linux-mm, kernel-team
Hello,
On Fri, Aug 28, 2015 at 07:49:18PM +0300, Vladimir Davydov wrote:
> On Fri, Aug 28, 2015 at 11:25:30AM -0400, Tejun Heo wrote:
> > On the default hierarchy, all memory consumption will be accounted
> > together and controlled by the same set of limits. Always enable
> > kmemcg on the default hierarchy.
>
> IMO we should introduce a boot time knob for disabling it, because kmem
> accounting is still not perfect, besides some users might prefer to go
> w/o it for performance reasons.
Yeah, fair enough.
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index c94b686..8a5dd01 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -4362,6 +4362,13 @@ mem_cgroup_css_online(struct cgroup_subsys_state *css)
> > if (ret)
> > return ret;
> >
> > + /* kmem is always accounted together on the default hierarchy */
> > + if (cgroup_on_dfl(css->cgroup)) {
> > + ret = memcg_activate_kmem(memcg, PAGE_COUNTER_MAX);
> > + if (ret)
> > + return ret;
> > + }
> > +
>
> This is a wrong place for this. The kernel will panic on an attempt to
> create a sub memcg, because memcg_init_kmem already enables kmem
> accounting in this case. I guess we should add this hunk to
> memcg_propagate_kmem instead.
Yeap, bypassing "parent is active" test in memcg_propagate_kmem()
seems like the right thing to do.
Thanks.
--
tejun
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 66+ messages in thread* Re: [PATCH 4/4] memcg: always enable kmemcg on the default hierarchy
2015-08-28 16:49 ` Vladimir Davydov
@ 2015-08-28 17:14 ` Michal Hocko
-1 siblings, 0 replies; 66+ messages in thread
From: Michal Hocko @ 2015-08-28 17:14 UTC (permalink / raw)
To: Vladimir Davydov
Cc: Tejun Heo, hannes-druUgvl0LCNAfugRpC6u6w,
cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
kernel-team-b10kYP2dOMg
On Fri 28-08-15 19:49:18, Vladimir Davydov wrote:
> On Fri, Aug 28, 2015 at 11:25:30AM -0400, Tejun Heo wrote:
> > On the default hierarchy, all memory consumption will be accounted
> > together and controlled by the same set of limits. Always enable
> > kmemcg on the default hierarchy.
>
> IMO we should introduce a boot time knob for disabling it, because kmem
> accounting is still not perfect, besides some users might prefer to go
> w/o it for performance reasons.
I would even argue for opt-in rather than opt-out.
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [PATCH 4/4] memcg: always enable kmemcg on the default hierarchy
@ 2015-08-28 17:14 ` Michal Hocko
0 siblings, 0 replies; 66+ messages in thread
From: Michal Hocko @ 2015-08-28 17:14 UTC (permalink / raw)
To: Vladimir Davydov; +Cc: Tejun Heo, hannes, cgroups, linux-mm, kernel-team
On Fri 28-08-15 19:49:18, Vladimir Davydov wrote:
> On Fri, Aug 28, 2015 at 11:25:30AM -0400, Tejun Heo wrote:
> > On the default hierarchy, all memory consumption will be accounted
> > together and controlled by the same set of limits. Always enable
> > kmemcg on the default hierarchy.
>
> IMO we should introduce a boot time knob for disabling it, because kmem
> accounting is still not perfect, besides some users might prefer to go
> w/o it for performance reasons.
I would even argue for opt-in rather than opt-out.
--
Michal Hocko
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [PATCH 4/4] memcg: always enable kmemcg on the default hierarchy
2015-08-28 17:14 ` Michal Hocko
(?)
@ 2015-08-28 17:41 ` Tejun Heo
2015-09-01 12:44 ` Michal Hocko
-1 siblings, 1 reply; 66+ messages in thread
From: Tejun Heo @ 2015-08-28 17:41 UTC (permalink / raw)
To: Michal Hocko; +Cc: Vladimir Davydov, hannes, cgroups, linux-mm, kernel-team
On Fri, Aug 28, 2015 at 07:14:38PM +0200, Michal Hocko wrote:
> On Fri 28-08-15 19:49:18, Vladimir Davydov wrote:
> > On Fri, Aug 28, 2015 at 11:25:30AM -0400, Tejun Heo wrote:
> > > On the default hierarchy, all memory consumption will be accounted
> > > together and controlled by the same set of limits. Always enable
> > > kmemcg on the default hierarchy.
> >
> > IMO we should introduce a boot time knob for disabling it, because kmem
> > accounting is still not perfect, besides some users might prefer to go
> > w/o it for performance reasons.
>
> I would even argue for opt-in rather than opt-out.
Definitely not. We wanna put all memory consumptions under the same
roof by default.
Thanks.
--
tejun
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [PATCH 4/4] memcg: always enable kmemcg on the default hierarchy
2015-08-28 17:41 ` Tejun Heo
@ 2015-09-01 12:44 ` Michal Hocko
[not found] ` <20150901124459.GC8810-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
0 siblings, 1 reply; 66+ messages in thread
From: Michal Hocko @ 2015-09-01 12:44 UTC (permalink / raw)
To: Tejun Heo; +Cc: Vladimir Davydov, hannes, cgroups, linux-mm, kernel-team
On Fri 28-08-15 13:41:40, Tejun Heo wrote:
> On Fri, Aug 28, 2015 at 07:14:38PM +0200, Michal Hocko wrote:
> > On Fri 28-08-15 19:49:18, Vladimir Davydov wrote:
> > > On Fri, Aug 28, 2015 at 11:25:30AM -0400, Tejun Heo wrote:
> > > > On the default hierarchy, all memory consumption will be accounted
> > > > together and controlled by the same set of limits. Always enable
> > > > kmemcg on the default hierarchy.
> > >
> > > IMO we should introduce a boot time knob for disabling it, because kmem
> > > accounting is still not perfect, besides some users might prefer to go
> > > w/o it for performance reasons.
> >
> > I would even argue for opt-in rather than opt-out.
>
> Definitely not.
The runtime overhead is not negligible and I do not see why everybody
should be paying that price by default. I can definitely see the reason why
somebody would want to enable the kmem accounting but many users will
probably never care because the kernel footprint would be in the noise
wrt. user memory.
> We wanna put all memory consumptions under the same roof by default.
But I am not sure we will ever achieve this. E.g. hugetlb memory is way
too different to be under the same charging by default IMO. Also all
the random drivers calling into the page allocator directly in the user
context would need to charge explicitly.
--
Michal Hocko
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [PATCH 4/4] memcg: always enable kmemcg on the default hierarchy
2015-08-28 15:25 ` Tejun Heo
(?)
(?)
@ 2015-11-05 17:30 ` Michal Hocko
-1 siblings, 0 replies; 66+ messages in thread
From: Michal Hocko @ 2015-11-05 17:30 UTC (permalink / raw)
To: Tejun Heo; +Cc: hannes, cgroups, linux-mm, vdavydov, kernel-team
Just for the reference. This has been discussed as a part of other email
thread discussed here:
http://lkml.kernel.org/r/20151027122647.GG9891%40dhcp22.suse.cz
I am _really_ sorry for hijacking that one - I didn't intend to do
so but my remark ended up in a full discussion. If I knew it would go
that way I wouldn't even mention it.
On Fri 28-08-15 11:25:30, Tejun Heo wrote:
> On the default hierarchy, all memory consumption will be accounted
> together and controlled by the same set of limits. Always enable
> kmemcg on the default hierarchy.
>
> Signed-off-by: Tejun Heo <tj@kernel.org>
> ---
> mm/memcontrol.c | 7 +++++++
> 1 file changed, 7 insertions(+)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index c94b686..8a5dd01 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -4362,6 +4362,13 @@ mem_cgroup_css_online(struct cgroup_subsys_state *css)
> if (ret)
> return ret;
>
> + /* kmem is always accounted together on the default hierarchy */
> + if (cgroup_on_dfl(css->cgroup)) {
> + ret = memcg_activate_kmem(memcg, PAGE_COUNTER_MAX);
> + if (ret)
> + return ret;
> + }
> +
> /*
> * Make sure the memcg is initialized: mem_cgroup_iter()
> * orders reading memcg->initialized against its callers
> --
> 2.4.3
--
Michal Hocko
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 66+ messages in thread