* [patch 00/11] userspace out of memory handling
@ 2014-03-05 3:58 David Rientjes
2014-03-05 3:58 ` [patch 01/11] fork: collapse copy_flags into copy_process David Rientjes
` (12 more replies)
0 siblings, 13 replies; 33+ messages in thread
From: David Rientjes @ 2014-03-05 3:58 UTC (permalink / raw)
To: Andrew Morton
Cc: Johannes Weiner, Michal Hocko, KAMEZAWA Hiroyuki,
Christoph Lameter, Pekka Enberg, Tejun Heo, Mel Gorman,
Oleg Nesterov, Rik van Riel, Jianguo Wu, Tim Hockin, linux-kernel,
linux-mm, cgroups, linux-doc
This patchset implements userspace out of memory handling.
It is based on v3.14-rc5. Individual patches will apply cleanly or you
may pull the entire series from
git://git.kernel.org/pub/scm/linux/kernel/git/rientjes/linux.git mm/oom
When the system or a memcg is oom, processes running on that system or
attached to that memcg cannot allocate memory. It is impossible for a
process to reliably handle the oom condition from userspace.
First, consider only system oom conditions. When memory is completely
depleted and nothing may be reclaimed, the kernel is forced to free some
memory; the only way it can do so is to kill a userspace process. This
will happen instantaneously and userspace can enforce neither its own
policy nor collect information.
On system oom, there may be a hierarchy of memcgs that represent user
jobs, for example. Each job may have a priority independent of their
current memory usage. There is no existing kernel interface to kill the
lowest priority job; userspace can now kill the lowest priority job or
allow priorities to change based on whether the job is using more memory
than its pre-defined reservation.
Additionally, users may want to log the condition or debug applications
that are using too much memory. They may wish to collect heap profiles
or are able to do memory freeing without killing a process by throttling
or ratelimiting.
Interactive users using X window environments may wish to have a dialogue
box appear to determine how to proceed -- it may even allow them shell
access to examine the state of the system while oom.
It's not sufficient to simply restrict all user processes to a subset of
memory and oom handling processes to the remainder via a memcg hierarchy:
kernel memory and other page allocations can easily deplete all memory
that is not charged to a user hierarchy of memory.
This patchset allows userspace to do all of these things by defining a
small memory reserve that is accessible only by processes that are
handling the notification.
Second, consider memcg oom conditions. Processes need no special
knowledge of whether they are attached to the root memcg, where memcg
charging will always succeed, or a child memcg where charging will fail
when the limit has been reached. This allows those processes handling
memcg oom conditions to overcharge the memcg by the amount of reserved
memory. They need not create child memcgs with smaller limits and
attach the userspace oom handler only to the parent; such support would
not allow userspace to handle system oom conditions anyway.
This patchset introduces a standard interface through memcg that allows
both of these conditions to be handled in the same clean way: users
define memory.oom_reserve_in_bytes to define the reserve and this
amount is allowed to be overcharged to the process handling the oom
condition's memcg. If used with the root memcg, this amount is allowed
to be allocated below the per-zone watermarks for root processes that
are handling such conditions (only root may write to
cgroup.event_control for the root memcg).
---
Documentation/cgroups/memory.txt | 46 ++++++++-
Documentation/cgroups/resource_counter.txt | 12 +--
Documentation/sysctl/vm.txt | 5 +
arch/m32r/mm/discontig.c | 1 +
include/linux/memcontrol.h | 24 +++++
include/linux/mempolicy.h | 3 +-
include/linux/mmzone.h | 2 +
include/linux/res_counter.h | 16 ++--
include/linux/sched.h | 2 +-
kernel/fork.c | 13 +--
kernel/res_counter.c | 42 ++++++---
mm/memcontrol.c | 144 ++++++++++++++++++++++++++++-
mm/mempolicy.c | 46 ++-------
mm/oom_kill.c | 7 ++
mm/page_alloc.c | 17 +++-
mm/slab.c | 8 +-
mm/slub.c | 2 +-
17 files changed, 292 insertions(+), 98 deletions(-)
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 33+ messages in thread
* [patch 01/11] fork: collapse copy_flags into copy_process
2014-03-05 3:58 [patch 00/11] userspace out of memory handling David Rientjes
@ 2014-03-05 3:58 ` David Rientjes
2014-03-05 3:58 ` [patch 02/11] mm, mempolicy: rename slab_node for clarity David Rientjes
` (11 subsequent siblings)
12 siblings, 0 replies; 33+ messages in thread
From: David Rientjes @ 2014-03-05 3:58 UTC (permalink / raw)
To: Andrew Morton
Cc: Johannes Weiner, Michal Hocko, KAMEZAWA Hiroyuki,
Christoph Lameter, Pekka Enberg, Tejun Heo, Mel Gorman,
Oleg Nesterov, Rik van Riel, Jianguo Wu, Tim Hockin, linux-kernel,
linux-mm, cgroups, linux-doc
copy_flags() does not use the clone_flags formal and can be collapsed
into copy_process() for cleaner code.
Signed-off-by: David Rientjes <rientjes@google.com>
---
kernel/fork.c | 12 ++----------
1 file changed, 2 insertions(+), 10 deletions(-)
diff --git a/kernel/fork.c b/kernel/fork.c
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1069,15 +1069,6 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
return 0;
}
-static void copy_flags(unsigned long clone_flags, struct task_struct *p)
-{
- unsigned long new_flags = p->flags;
-
- new_flags &= ~(PF_SUPERPRIV | PF_WQ_WORKER);
- new_flags |= PF_FORKNOEXEC;
- p->flags = new_flags;
-}
-
SYSCALL_DEFINE1(set_tid_address, int __user *, tidptr)
{
current->clear_child_tid = tidptr;
@@ -1227,7 +1218,8 @@ static struct task_struct *copy_process(unsigned long clone_flags,
goto bad_fork_cleanup_count;
delayacct_tsk_init(p); /* Must remain after dup_task_struct() */
- copy_flags(clone_flags, p);
+ p->flags &= ~(PF_SUPERPRIV | PF_WQ_WORKER);
+ p->flags |= PF_FORKNOEXEC;
INIT_LIST_HEAD(&p->children);
INIT_LIST_HEAD(&p->sibling);
rcu_copy_process(p);
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 33+ messages in thread
* [patch 02/11] mm, mempolicy: rename slab_node for clarity
2014-03-05 3:58 [patch 00/11] userspace out of memory handling David Rientjes
2014-03-05 3:58 ` [patch 01/11] fork: collapse copy_flags into copy_process David Rientjes
@ 2014-03-05 3:58 ` David Rientjes
2014-03-05 3:59 ` [patch 03/11] mm, mempolicy: remove per-process flag David Rientjes
` (10 subsequent siblings)
12 siblings, 0 replies; 33+ messages in thread
From: David Rientjes @ 2014-03-05 3:58 UTC (permalink / raw)
To: Andrew Morton
Cc: Johannes Weiner, Michal Hocko, KAMEZAWA Hiroyuki,
Christoph Lameter, Pekka Enberg, Tejun Heo, Mel Gorman,
Oleg Nesterov, Rik van Riel, Jianguo Wu, Tim Hockin, linux-kernel,
linux-mm, cgroups, linux-doc
slab_node() is actually a mempolicy function, so rename it to
mempolicy_slab_node() to make it clearer that it used for processes with
mempolicies.
At the same time, cleanup its code by saving numa_mem_id() in a local
variable (since we require a node with memory, not just any node) and
remove an obsolete comment that assumes the mempolicy is actually passed
into the function.
Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: David Rientjes <rientjes@google.com>
---
include/linux/mempolicy.h | 2 +-
mm/mempolicy.c | 15 ++++++---------
mm/slab.c | 4 ++--
mm/slub.c | 2 +-
4 files changed, 10 insertions(+), 13 deletions(-)
diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -151,7 +151,7 @@ extern struct zonelist *huge_zonelist(struct vm_area_struct *vma,
extern bool init_nodemask_of_mempolicy(nodemask_t *mask);
extern bool mempolicy_nodemask_intersects(struct task_struct *tsk,
const nodemask_t *mask);
-extern unsigned slab_node(void);
+extern unsigned int mempolicy_slab_node(void);
extern enum zone_type policy_zone;
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1782,21 +1782,18 @@ static unsigned interleave_nodes(struct mempolicy *policy)
/*
* Depending on the memory policy provide a node from which to allocate the
* next slab entry.
- * @policy must be protected by freeing by the caller. If @policy is
- * the current task's mempolicy, this protection is implicit, as only the
- * task can change it's policy. The system default policy requires no
- * such protection.
*/
-unsigned slab_node(void)
+unsigned int mempolicy_slab_node(void)
{
struct mempolicy *policy;
+ int node = numa_mem_id();
if (in_interrupt())
- return numa_node_id();
+ return node;
policy = current->mempolicy;
if (!policy || policy->flags & MPOL_F_LOCAL)
- return numa_node_id();
+ return node;
switch (policy->mode) {
case MPOL_PREFERRED:
@@ -1816,11 +1813,11 @@ unsigned slab_node(void)
struct zonelist *zonelist;
struct zone *zone;
enum zone_type highest_zoneidx = gfp_zone(GFP_KERNEL);
- zonelist = &NODE_DATA(numa_node_id())->node_zonelists[0];
+ zonelist = &NODE_DATA(node)->node_zonelists[0];
(void)first_zones_zonelist(zonelist, highest_zoneidx,
&policy->v.nodes,
&zone);
- return zone ? zone->node : numa_node_id();
+ return zone ? zone->node : node;
}
default:
diff --git a/mm/slab.c b/mm/slab.c
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -3042,7 +3042,7 @@ static void *alternate_node_alloc(struct kmem_cache *cachep, gfp_t flags)
if (cpuset_do_slab_mem_spread() && (cachep->flags & SLAB_MEM_SPREAD))
nid_alloc = cpuset_slab_spread_node();
else if (current->mempolicy)
- nid_alloc = slab_node();
+ nid_alloc = mempolicy_slab_node();
if (nid_alloc != nid_here)
return ____cache_alloc_node(cachep, flags, nid_alloc);
return NULL;
@@ -3074,7 +3074,7 @@ static void *fallback_alloc(struct kmem_cache *cache, gfp_t flags)
retry_cpuset:
cpuset_mems_cookie = get_mems_allowed();
- zonelist = node_zonelist(slab_node(), flags);
+ zonelist = node_zonelist(mempolicy_slab_node(), flags);
retry:
/*
diff --git a/mm/slub.c b/mm/slub.c
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1685,7 +1685,7 @@ static void *get_any_partial(struct kmem_cache *s, gfp_t flags,
do {
cpuset_mems_cookie = get_mems_allowed();
- zonelist = node_zonelist(slab_node(), flags);
+ zonelist = node_zonelist(mempolicy_slab_node(), flags);
for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
struct kmem_cache_node *n;
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 33+ messages in thread
* [patch 03/11] mm, mempolicy: remove per-process flag
2014-03-05 3:58 [patch 00/11] userspace out of memory handling David Rientjes
2014-03-05 3:58 ` [patch 01/11] fork: collapse copy_flags into copy_process David Rientjes
2014-03-05 3:58 ` [patch 02/11] mm, mempolicy: rename slab_node for clarity David Rientjes
@ 2014-03-05 3:59 ` David Rientjes
2014-03-07 17:20 ` Andi Kleen
2014-03-05 3:59 ` [patch 04/11] mm, memcg: add tunable for oom reserves David Rientjes
` (9 subsequent siblings)
12 siblings, 1 reply; 33+ messages in thread
From: David Rientjes @ 2014-03-05 3:59 UTC (permalink / raw)
To: Andrew Morton
Cc: Johannes Weiner, Michal Hocko, KAMEZAWA Hiroyuki,
Christoph Lameter, Pekka Enberg, Tejun Heo, Mel Gorman,
Oleg Nesterov, Rik van Riel, Jianguo Wu, Tim Hockin, linux-kernel,
linux-mm, cgroups, linux-doc
PF_MEMPOLICY is an unnecessary optimization for CONFIG_SLAB users.
There's no significant performance degradation to checking
current->mempolicy rather than current->flags & PF_MEMPOLICY in the
allocation path, especially since this is considered unlikely().
Running TCP_RR with netperf-2.4.5 through localhost on 16 cpu machine
with 64GB of memory and without a mempolicy:
threads before after
16 1249409 1244487
32 1281786 1246783
48 1239175 1239138
64 1244642 1241841
80 1244346 1248918
96 1266436 1254316
112 1307398 1312135
128 1327607 1326502
Per-process flags are a scarce resource so we should free them up
whenever possible and make them available. We'll be using it shortly for
memcg oom reserves.
Signed-off-by: David Rientjes <rientjes@google.com>
---
include/linux/mempolicy.h | 1 -
include/linux/sched.h | 1 -
kernel/fork.c | 1 -
mm/mempolicy.c | 31 -------------------------------
mm/slab.c | 4 ++--
5 files changed, 2 insertions(+), 36 deletions(-)
diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -143,7 +143,6 @@ extern void numa_policy_init(void);
extern void mpol_rebind_task(struct task_struct *tsk, const nodemask_t *new,
enum mpol_rebind_step step);
extern void mpol_rebind_mm(struct mm_struct *mm, nodemask_t *new);
-extern void mpol_fix_fork_child_flag(struct task_struct *p);
extern struct zonelist *huge_zonelist(struct vm_area_struct *vma,
unsigned long addr, gfp_t gfp_flags,
diff --git a/include/linux/sched.h b/include/linux/sched.h
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1821,7 +1821,6 @@ extern void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut,
#define PF_SPREAD_SLAB 0x02000000 /* Spread some slab caches over cpuset */
#define PF_NO_SETAFFINITY 0x04000000 /* Userland is not allowed to meddle with cpus_allowed */
#define PF_MCE_EARLY 0x08000000 /* Early kill for mce process policy */
-#define PF_MEMPOLICY 0x10000000 /* Non-default NUMA mempolicy */
#define PF_MUTEX_TESTER 0x20000000 /* Thread belongs to the rt mutex tester */
#define PF_FREEZER_SKIP 0x40000000 /* Freezer should not count it as freezable */
#define PF_SUSPEND_TASK 0x80000000 /* this thread called freeze_processes and should not be frozen */
diff --git a/kernel/fork.c b/kernel/fork.c
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1265,7 +1265,6 @@ static struct task_struct *copy_process(unsigned long clone_flags,
p->mempolicy = NULL;
goto bad_fork_cleanup_cgroup;
}
- mpol_fix_fork_child_flag(p);
#endif
#ifdef CONFIG_CPUSETS
p->cpuset_mem_spread_rotor = NUMA_NO_NODE;
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -795,36 +795,6 @@ static int mbind_range(struct mm_struct *mm, unsigned long start,
return err;
}
-/*
- * Update task->flags PF_MEMPOLICY bit: set iff non-default
- * mempolicy. Allows more rapid checking of this (combined perhaps
- * with other PF_* flag bits) on memory allocation hot code paths.
- *
- * If called from outside this file, the task 'p' should -only- be
- * a newly forked child not yet visible on the task list, because
- * manipulating the task flags of a visible task is not safe.
- *
- * The above limitation is why this routine has the funny name
- * mpol_fix_fork_child_flag().
- *
- * It is also safe to call this with a task pointer of current,
- * which the static wrapper mpol_set_task_struct_flag() does,
- * for use within this file.
- */
-
-void mpol_fix_fork_child_flag(struct task_struct *p)
-{
- if (p->mempolicy)
- p->flags |= PF_MEMPOLICY;
- else
- p->flags &= ~PF_MEMPOLICY;
-}
-
-static void mpol_set_task_struct_flag(void)
-{
- mpol_fix_fork_child_flag(current);
-}
-
/* Set the process memory policy */
static long do_set_mempolicy(unsigned short mode, unsigned short flags,
nodemask_t *nodes)
@@ -861,7 +831,6 @@ static long do_set_mempolicy(unsigned short mode, unsigned short flags,
}
old = current->mempolicy;
current->mempolicy = new;
- mpol_set_task_struct_flag();
if (new && new->mode == MPOL_INTERLEAVE &&
nodes_weight(new->v.nodes))
current->il_next = first_node(new->v.nodes);
diff --git a/mm/slab.c b/mm/slab.c
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -3027,7 +3027,7 @@ out:
#ifdef CONFIG_NUMA
/*
- * Try allocating on another node if PF_SPREAD_SLAB|PF_MEMPOLICY.
+ * Try allocating on another node if PF_SPREAD_SLAB is a mempolicy is set.
*
* If we are in_interrupt, then process context, including cpusets and
* mempolicy, may not apply and should not be used for allocation policy.
@@ -3259,7 +3259,7 @@ __do_cache_alloc(struct kmem_cache *cache, gfp_t flags)
{
void *objp;
- if (unlikely(current->flags & (PF_SPREAD_SLAB | PF_MEMPOLICY))) {
+ if (current->mempolicy || unlikely(current->flags & PF_SPREAD_SLAB)) {
objp = alternate_node_alloc(cache, flags);
if (objp)
goto out;
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 33+ messages in thread
* [patch 04/11] mm, memcg: add tunable for oom reserves
2014-03-05 3:58 [patch 00/11] userspace out of memory handling David Rientjes
` (2 preceding siblings ...)
2014-03-05 3:59 ` [patch 03/11] mm, mempolicy: remove per-process flag David Rientjes
@ 2014-03-05 3:59 ` David Rientjes
2014-03-05 21:17 ` Andrew Morton
2014-03-06 21:04 ` Tejun Heo
2014-03-05 3:59 ` [patch 05/11] res_counter: remove interface for locked charging and uncharging David Rientjes
` (8 subsequent siblings)
12 siblings, 2 replies; 33+ messages in thread
From: David Rientjes @ 2014-03-05 3:59 UTC (permalink / raw)
To: Andrew Morton
Cc: Johannes Weiner, Michal Hocko, KAMEZAWA Hiroyuki,
Christoph Lameter, Pekka Enberg, Tejun Heo, Mel Gorman,
Oleg Nesterov, Rik van Riel, Jianguo Wu, Tim Hockin, linux-kernel,
linux-mm, cgroups, linux-doc
Userspace needs a way to define the amount of memory reserves that
processes handling oom conditions may utilize. This patch adds a per-
memcg oom reserve field and file, memory.oom_reserve_in_bytes, to
manipulate its value.
If currently utilized memory reserves are attempted to be reduced by
writing a smaller value to memory.oom_reserve_in_bytes, it will fail with
-EBUSY until some memory is uncharged.
Signed-off-by: David Rientjes <rientjes@google.com>
---
mm/memcontrol.c | 53 +++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 53 insertions(+)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -315,6 +315,9 @@ struct mem_cgroup {
/* OOM-Killer disable */
int oom_kill_disable;
+ /* reserves for handling oom conditions, protected by res.lock */
+ unsigned long long oom_reserve;
+
/* set when res.limit == memsw.limit */
bool memsw_is_minimum;
@@ -5936,6 +5939,51 @@ static int mem_cgroup_oom_control_write(struct cgroup_subsys_state *css,
return 0;
}
+static int mem_cgroup_resize_oom_reserve(struct mem_cgroup *memcg,
+ unsigned long long new_limit)
+{
+ struct res_counter *res = &memcg->res;
+ u64 limit, usage;
+ int ret = 0;
+
+ spin_lock(&res->lock);
+ limit = res->limit;
+ usage = res->usage;
+
+ if (usage > limit && usage - limit > new_limit) {
+ ret = -EBUSY;
+ goto out;
+ }
+
+ memcg->oom_reserve = new_limit;
+out:
+ spin_unlock(&res->lock);
+ return ret;
+}
+
+static u64 mem_cgroup_oom_reserve_read(struct cgroup_subsys_state *css,
+ struct cftype *cft)
+{
+ return mem_cgroup_from_css(css)->oom_reserve;
+}
+
+static int mem_cgroup_oom_reserve_write(struct cgroup_subsys_state *css,
+ struct cftype *cft, const char *buffer)
+{
+ struct mem_cgroup *memcg = mem_cgroup_from_css(css);
+ unsigned long long val;
+ int ret;
+
+ if (mem_cgroup_is_root(memcg))
+ return -EINVAL;
+
+ ret = res_counter_memparse_write_strategy(buffer, &val);
+ if (ret)
+ return ret;
+
+ return mem_cgroup_resize_oom_reserve(memcg, val);
+}
+
#ifdef CONFIG_MEMCG_KMEM
static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
{
@@ -6291,6 +6339,11 @@ static struct cftype mem_cgroup_files[] = {
.private = MEMFILE_PRIVATE(_OOM_TYPE, OOM_CONTROL),
},
{
+ .name = "oom_reserve_in_bytes",
+ .read_u64 = mem_cgroup_oom_reserve_read,
+ .write_string = mem_cgroup_oom_reserve_write,
+ },
+ {
.name = "pressure_level",
},
#ifdef CONFIG_NUMA
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 33+ messages in thread
* [patch 05/11] res_counter: remove interface for locked charging and uncharging
2014-03-05 3:58 [patch 00/11] userspace out of memory handling David Rientjes
` (3 preceding siblings ...)
2014-03-05 3:59 ` [patch 04/11] mm, memcg: add tunable for oom reserves David Rientjes
@ 2014-03-05 3:59 ` David Rientjes
2014-03-05 3:59 ` [patch 06/11] res_counter: add interface for maximum nofail charge David Rientjes
` (7 subsequent siblings)
12 siblings, 0 replies; 33+ messages in thread
From: David Rientjes @ 2014-03-05 3:59 UTC (permalink / raw)
To: Andrew Morton
Cc: Johannes Weiner, Michal Hocko, KAMEZAWA Hiroyuki,
Christoph Lameter, Pekka Enberg, Tejun Heo, Mel Gorman,
Oleg Nesterov, Rik van Riel, Jianguo Wu, Tim Hockin, linux-kernel,
linux-mm, cgroups, linux-doc
The res_counter_{charge,uncharge}_locked() variants are not used in the
kernel outside of the resource counter code itself, so remove the
interface.
Signed-off-by: David Rientjes <rientjes@google.com>
---
Documentation/cgroups/resource_counter.txt | 12 ++----------
include/linux/res_counter.h | 6 +-----
kernel/res_counter.c | 23 ++++++++++++-----------
3 files changed, 15 insertions(+), 26 deletions(-)
diff --git a/Documentation/cgroups/resource_counter.txt b/Documentation/cgroups/resource_counter.txt
--- a/Documentation/cgroups/resource_counter.txt
+++ b/Documentation/cgroups/resource_counter.txt
@@ -76,15 +76,7 @@ to work with it.
limit_fail_at parameter is set to the particular res_counter element
where the charging failed.
- d. int res_counter_charge_locked
- (struct res_counter *rc, unsigned long val, bool force)
-
- The same as res_counter_charge(), but it must not acquire/release the
- res_counter->lock internally (it must be called with res_counter->lock
- held). The force parameter indicates whether we can bypass the limit.
-
- e. u64 res_counter_uncharge[_locked]
- (struct res_counter *rc, unsigned long val)
+ d. u64 res_counter_uncharge(struct res_counter *rc, unsigned long val)
When a resource is released (freed) it should be de-accounted
from the resource counter it was accounted to. This is called
@@ -93,7 +85,7 @@ to work with it.
The _locked routines imply that the res_counter->lock is taken.
- f. u64 res_counter_uncharge_until
+ e. u64 res_counter_uncharge_until
(struct res_counter *rc, struct res_counter *top,
unsigned long val)
diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
--- a/include/linux/res_counter.h
+++ b/include/linux/res_counter.h
@@ -104,15 +104,13 @@ void res_counter_init(struct res_counter *counter, struct res_counter *parent);
* units, e.g. numbers, bytes, Kbytes, etc
*
* returns 0 on success and <0 if the counter->usage will exceed the
- * counter->limit _locked call expects the counter->lock to be taken
+ * counter->limit
*
* charge_nofail works the same, except that it charges the resource
* counter unconditionally, and returns < 0 if the after the current
* charge we are over limit.
*/
-int __must_check res_counter_charge_locked(struct res_counter *counter,
- unsigned long val, bool force);
int __must_check res_counter_charge(struct res_counter *counter,
unsigned long val, struct res_counter **limit_fail_at);
int res_counter_charge_nofail(struct res_counter *counter,
@@ -125,12 +123,10 @@ int res_counter_charge_nofail(struct res_counter *counter,
* @val: the amount of the resource
*
* these calls check for usage underflow and show a warning on the console
- * _locked call expects the counter->lock to be taken
*
* returns the total charges still present in @counter.
*/
-u64 res_counter_uncharge_locked(struct res_counter *counter, unsigned long val);
u64 res_counter_uncharge(struct res_counter *counter, unsigned long val);
u64 res_counter_uncharge_until(struct res_counter *counter,
diff --git a/kernel/res_counter.c b/kernel/res_counter.c
--- a/kernel/res_counter.c
+++ b/kernel/res_counter.c
@@ -22,8 +22,18 @@ void res_counter_init(struct res_counter *counter, struct res_counter *parent)
counter->parent = parent;
}
-int res_counter_charge_locked(struct res_counter *counter, unsigned long val,
- bool force)
+static u64 res_counter_uncharge_locked(struct res_counter *counter,
+ unsigned long val)
+{
+ if (WARN_ON(counter->usage < val))
+ val = counter->usage;
+
+ counter->usage -= val;
+ return counter->usage;
+}
+
+static int res_counter_charge_locked(struct res_counter *counter,
+ unsigned long val, bool force)
{
int ret = 0;
@@ -86,15 +96,6 @@ int res_counter_charge_nofail(struct res_counter *counter, unsigned long val,
return __res_counter_charge(counter, val, limit_fail_at, true);
}
-u64 res_counter_uncharge_locked(struct res_counter *counter, unsigned long val)
-{
- if (WARN_ON(counter->usage < val))
- val = counter->usage;
-
- counter->usage -= val;
- return counter->usage;
-}
-
u64 res_counter_uncharge_until(struct res_counter *counter,
struct res_counter *top,
unsigned long val)
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 33+ messages in thread
* [patch 06/11] res_counter: add interface for maximum nofail charge
2014-03-05 3:58 [patch 00/11] userspace out of memory handling David Rientjes
` (4 preceding siblings ...)
2014-03-05 3:59 ` [patch 05/11] res_counter: remove interface for locked charging and uncharging David Rientjes
@ 2014-03-05 3:59 ` David Rientjes
2014-03-05 3:59 ` [patch 07/11] mm, memcg: allow processes handling oom notifications to access reserves David Rientjes
` (6 subsequent siblings)
12 siblings, 0 replies; 33+ messages in thread
From: David Rientjes @ 2014-03-05 3:59 UTC (permalink / raw)
To: Andrew Morton
Cc: Johannes Weiner, Michal Hocko, KAMEZAWA Hiroyuki,
Christoph Lameter, Pekka Enberg, Tejun Heo, Mel Gorman,
Oleg Nesterov, Rik van Riel, Jianguo Wu, Tim Hockin, linux-kernel,
linux-mm, cgroups, linux-doc
For memcg oom reserves, we'll need a resource counter interface that will
not fail when exceeding the memcg limit like res_counter_charge_nofail,
but only to a ceiling.
This patch adds res_counter_charge_nofail_max() that will exceed the
resource counter but only to a maximum defined value. If it fails to
charge the resource, it returns -ENOMEM.
Signed-off-by: David Rientjes <rientjes@google.com>
---
include/linux/res_counter.h | 10 +++++++++-
kernel/res_counter.c | 27 +++++++++++++++++++++------
2 files changed, 30 insertions(+), 7 deletions(-)
diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
--- a/include/linux/res_counter.h
+++ b/include/linux/res_counter.h
@@ -107,14 +107,22 @@ void res_counter_init(struct res_counter *counter, struct res_counter *parent);
* counter->limit
*
* charge_nofail works the same, except that it charges the resource
- * counter unconditionally, and returns < 0 if the after the current
+ * counter unconditionally, and returns < 0 if after the current
* charge we are over limit.
+ *
+ * charge_nofail_max is the same as charge_nofail, except that the
+ * resource counter usage can only exceed the limit by the max
+ * difference. Unlike charge_nofail, charge_nofail_max returns < 0
+ * only if the current charge fails because of the max difference.
*/
int __must_check res_counter_charge(struct res_counter *counter,
unsigned long val, struct res_counter **limit_fail_at);
int res_counter_charge_nofail(struct res_counter *counter,
unsigned long val, struct res_counter **limit_fail_at);
+int res_counter_charge_nofail_max(struct res_counter *counter,
+ unsigned long val, struct res_counter **limit_fail_at,
+ unsigned long max);
/*
* uncharge - tell that some portion of the resource is released
diff --git a/kernel/res_counter.c b/kernel/res_counter.c
--- a/kernel/res_counter.c
+++ b/kernel/res_counter.c
@@ -33,15 +33,19 @@ static u64 res_counter_uncharge_locked(struct res_counter *counter,
}
static int res_counter_charge_locked(struct res_counter *counter,
- unsigned long val, bool force)
+ unsigned long val, bool force,
+ unsigned long max)
{
int ret = 0;
if (counter->usage + val > counter->limit) {
counter->failcnt++;
- ret = -ENOMEM;
+ if (max == ULONG_MAX)
+ ret = -ENOMEM;
if (!force)
return ret;
+ if (counter->usage + val - counter->limit > max)
+ return -ENOMEM;
}
counter->usage += val;
@@ -51,7 +55,8 @@ static int res_counter_charge_locked(struct res_counter *counter,
}
static int __res_counter_charge(struct res_counter *counter, unsigned long val,
- struct res_counter **limit_fail_at, bool force)
+ struct res_counter **limit_fail_at, bool force,
+ unsigned long max)
{
int ret, r;
unsigned long flags;
@@ -62,7 +67,7 @@ static int __res_counter_charge(struct res_counter *counter, unsigned long val,
local_irq_save(flags);
for (c = counter; c != NULL; c = c->parent) {
spin_lock(&c->lock);
- r = res_counter_charge_locked(c, val, force);
+ r = res_counter_charge_locked(c, val, force, max);
spin_unlock(&c->lock);
if (r < 0 && !ret) {
ret = r;
@@ -87,13 +92,23 @@ static int __res_counter_charge(struct res_counter *counter, unsigned long val,
int res_counter_charge(struct res_counter *counter, unsigned long val,
struct res_counter **limit_fail_at)
{
- return __res_counter_charge(counter, val, limit_fail_at, false);
+ return __res_counter_charge(counter, val, limit_fail_at, false,
+ ULONG_MAX);
}
int res_counter_charge_nofail(struct res_counter *counter, unsigned long val,
struct res_counter **limit_fail_at)
{
- return __res_counter_charge(counter, val, limit_fail_at, true);
+ return __res_counter_charge(counter, val, limit_fail_at, true,
+ ULONG_MAX);
+}
+
+int res_counter_charge_nofail_max(struct res_counter *counter,
+ unsigned long val,
+ struct res_counter **limit_fail_at,
+ unsigned long max)
+{
+ return __res_counter_charge(counter, val, limit_fail_at, true, max);
}
u64 res_counter_uncharge_until(struct res_counter *counter,
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 33+ messages in thread
* [patch 07/11] mm, memcg: allow processes handling oom notifications to access reserves
2014-03-05 3:58 [patch 00/11] userspace out of memory handling David Rientjes
` (5 preceding siblings ...)
2014-03-05 3:59 ` [patch 06/11] res_counter: add interface for maximum nofail charge David Rientjes
@ 2014-03-05 3:59 ` David Rientjes
2014-03-06 21:12 ` Tejun Heo
2014-03-05 3:59 ` [patch 08/11] mm, memcg: add memcg oom reserve documentation David Rientjes
` (5 subsequent siblings)
12 siblings, 1 reply; 33+ messages in thread
From: David Rientjes @ 2014-03-05 3:59 UTC (permalink / raw)
To: Andrew Morton
Cc: Johannes Weiner, Michal Hocko, KAMEZAWA Hiroyuki,
Christoph Lameter, Pekka Enberg, Tejun Heo, Mel Gorman,
Oleg Nesterov, Rik van Riel, Jianguo Wu, Tim Hockin, linux-kernel,
linux-mm, cgroups, linux-doc
Now that a per-process flag is available, define it for processes that
handle userspace oom notifications. This is an optimization to avoid
mantaining a list of such processes attached to a memcg at any given time
and iterating it at charge time.
This flag gets set whenever a process has registered for an oom
notification and is cleared whenever it unregisters.
When memcg reclaim has failed to free any memory, it is necessary for
userspace oom handlers to be able to dip into reserves to pagefault text,
allocate kernel memory to read the "tasks" file, allocate heap, etc.
System oom conditions are not addressed at this time, but the same per-
process flag can be used in the page allocator to determine if access
should be given to userspace oom handlers to per-zone memory reserves at
a later time once there is consensus.
Signed-off-by: David Rientjes <rientjes@google.com>
---
include/linux/sched.h | 1 +
mm/memcontrol.c | 47 ++++++++++++++++++++++++++++++++++++++++++++++-
2 files changed, 47 insertions(+), 1 deletion(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1821,6 +1821,7 @@ extern void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut,
#define PF_SPREAD_SLAB 0x02000000 /* Spread some slab caches over cpuset */
#define PF_NO_SETAFFINITY 0x04000000 /* Userland is not allowed to meddle with cpus_allowed */
#define PF_MCE_EARLY 0x08000000 /* Early kill for mce process policy */
+#define PF_OOM_HANDLER 0x10000000 /* Userspace process handling oom conditions */
#define PF_MUTEX_TESTER 0x20000000 /* Thread belongs to the rt mutex tester */
#define PF_FREEZER_SKIP 0x40000000 /* Freezer should not count it as freezable */
#define PF_SUSPEND_TASK 0x80000000 /* this thread called freeze_processes and should not be frozen */
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2633,6 +2633,33 @@ enum {
CHARGE_WOULDBLOCK, /* GFP_WAIT wasn't set and no enough res. */
};
+/*
+ * Processes handling oom conditions are allowed to utilize memory reserves so
+ * that they may handle the condition.
+ */
+static int mem_cgroup_oom_handler_charge(struct mem_cgroup *memcg,
+ unsigned long csize,
+ struct mem_cgroup **mem_over_limit)
+{
+ struct res_counter *fail_res;
+ int ret;
+
+ ret = res_counter_charge_nofail_max(&memcg->res, csize, &fail_res,
+ memcg->oom_reserve);
+ if (!ret && do_swap_account) {
+ ret = res_counter_charge_nofail_max(&memcg->memsw, csize,
+ &fail_res,
+ memcg->oom_reserve);
+ if (ret) {
+ res_counter_uncharge(&memcg->res, csize);
+ *mem_over_limit = mem_cgroup_from_res_counter(fail_res,
+ memsw);
+
+ }
+ }
+ return !ret ? CHARGE_OK : CHARGE_NOMEM;
+}
+
static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
unsigned int nr_pages, unsigned int min_pages,
bool invoke_oom)
@@ -2692,6 +2719,13 @@ static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
if (mem_cgroup_wait_acct_move(mem_over_limit))
return CHARGE_RETRY;
+ if (current->flags & PF_OOM_HANDLER) {
+ ret = mem_cgroup_oom_handler_charge(memcg, csize,
+ &mem_over_limit);
+ if (ret == CHARGE_OK)
+ return CHARGE_OK;
+ }
+
if (invoke_oom)
mem_cgroup_oom(mem_over_limit, gfp_mask, get_order(csize));
@@ -2739,7 +2773,8 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
|| fatal_signal_pending(current)))
goto bypass;
- if (unlikely(task_in_memcg_oom(current)))
+ if (unlikely(task_in_memcg_oom(current)) &&
+ !(current->flags & PF_OOM_HANDLER))
goto nomem;
if (gfp_mask & __GFP_NOFAIL)
@@ -5877,6 +5912,11 @@ static int mem_cgroup_oom_register_event(struct mem_cgroup *memcg,
if (!event)
return -ENOMEM;
+ /*
+ * Setting PF_OOM_HANDLER before taking memcg_oom_lock ensures it is
+ * set before getting added to memcg->oom_notify.
+ */
+ current->flags |= PF_OOM_HANDLER;
spin_lock(&memcg_oom_lock);
event->eventfd = eventfd;
@@ -5904,6 +5944,11 @@ static void mem_cgroup_oom_unregister_event(struct mem_cgroup *memcg,
}
}
+ /*
+ * Clearing PF_OOM_HANDLER before dropping memcg_oom_lock ensures it is
+ * cleared before receiving another notification.
+ */
+ current->flags &= ~PF_OOM_HANDLER;
spin_unlock(&memcg_oom_lock);
}
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 33+ messages in thread
* [patch 08/11] mm, memcg: add memcg oom reserve documentation
2014-03-05 3:58 [patch 00/11] userspace out of memory handling David Rientjes
` (6 preceding siblings ...)
2014-03-05 3:59 ` [patch 07/11] mm, memcg: allow processes handling oom notifications to access reserves David Rientjes
@ 2014-03-05 3:59 ` David Rientjes
2014-03-05 3:59 ` [patch 09/11] mm, page_alloc: allow system oom handlers to use memory reserves David Rientjes
` (4 subsequent siblings)
12 siblings, 0 replies; 33+ messages in thread
From: David Rientjes @ 2014-03-05 3:59 UTC (permalink / raw)
To: Andrew Morton
Cc: Johannes Weiner, Michal Hocko, KAMEZAWA Hiroyuki,
Christoph Lameter, Pekka Enberg, Tejun Heo, Mel Gorman,
Oleg Nesterov, Rik van Riel, Jianguo Wu, Tim Hockin, linux-kernel,
linux-mm, cgroups, linux-doc
Add documentation on memcg oom reserves to
Documentation/cgroups/memory.txt and give an example of its usage and
recommended best practices.
Signed-off-by: David Rientjes <rientjes@google.com>
---
Documentation/cgroups/memory.txt | 26 ++++++++++++++++++++++++++
1 file changed, 26 insertions(+)
diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -71,6 +71,7 @@ Brief summary of control files.
(See sysctl's vm.swappiness)
memory.move_charge_at_immigrate # set/show controls of moving charges
memory.oom_control # set/show oom controls.
+ memory.oom_reserve_in_bytes # set/show limit of oom memory reserves
memory.numa_stat # show the number of memory usage per numa node
memory.kmem.limit_in_bytes # set/show hard limit for kernel memory
@@ -772,6 +773,31 @@ At reading, current status of OOM is shown.
under_oom 0 or 1 (if 1, the memory cgroup is under OOM, tasks may
be stopped.)
+Processes that handle oom conditions in their own memcgs or their child
+memcgs may need to allocate memory themselves to do anything useful,
+including pagefaulting its text or allocating kernel memory to read the
+memcg "tasks" file. For this reason, memory.oom_reserve_in_bytes is
+provided that specifies how much memory that processes waiting on
+memory.oom_control can allocate above the memcg limit.
+
+The memcg that the oom handler is attached to is charged for the memory
+that it allocates against its own memory.oom_reserve_in_bytes. This
+memory is therefore only available to processes that are waiting for
+a notification.
+
+For example, if you do
+
+ # echo 2m > memory.oom_reserve_in_bytes
+
+then any process attached to this memcg that is waiting on memcg oom
+notifications anywhere on the system can allocate an additional 2MB
+above memory.limit_in_bytes.
+
+You may still consider doing mlockall(MCL_FUTURE) for processes that
+are waiting on oom notifications to keep this vaue as minimal as
+possible, or allow it to be large enough so that its text can still
+be pagefaulted in under oom conditions when the value is known.
+
11. Memory Pressure
The pressure level notifications can be used to monitor the memory
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 33+ messages in thread
* [patch 09/11] mm, page_alloc: allow system oom handlers to use memory reserves
2014-03-05 3:58 [patch 00/11] userspace out of memory handling David Rientjes
` (7 preceding siblings ...)
2014-03-05 3:59 ` [patch 08/11] mm, memcg: add memcg oom reserve documentation David Rientjes
@ 2014-03-05 3:59 ` David Rientjes
2014-03-06 21:13 ` Tejun Heo
2014-03-05 3:59 ` [patch 10/11] mm, memcg: add memory.oom_control notification for system oom David Rientjes
` (3 subsequent siblings)
12 siblings, 1 reply; 33+ messages in thread
From: David Rientjes @ 2014-03-05 3:59 UTC (permalink / raw)
To: Andrew Morton
Cc: Johannes Weiner, Michal Hocko, KAMEZAWA Hiroyuki,
Christoph Lameter, Pekka Enberg, Tejun Heo, Mel Gorman,
Oleg Nesterov, Rik van Riel, Jianguo Wu, Tim Hockin, linux-kernel,
linux-mm, cgroups, linux-doc
The root memcg allows unlimited memory charging, so no memory may be
reserved for userspace oom handlers that are responsible for dealing
with system oom conditions.
Instead, this memory must come from per-zone memory reserves. This
allows the memory allocation to succeed, and the memcg charge will
naturally succeed afterwards.
This patch introduces per-zone oom watermarks that aren't really
watermarks in the traditional sense. The oom watermark is the root
memcg's oom reserve proportional to the size of the zone. When a page
allocation is done, the effective watermark is
[min/low/high watermark] - [oom watermark]
For the [min watermark] case, this is effectively the oom reserve.
However, it also adjusts the low and high watermark accordingly so
memory is actually only allocated from min reserves when appropriate.
Signed-off-by: David Rientjes <rientjes@google.com>
---
Documentation/cgroups/memory.txt | 9 +++++++++
Documentation/sysctl/vm.txt | 5 +++++
arch/m32r/mm/discontig.c | 1 +
include/linux/memcontrol.h | 13 +++++++++++++
include/linux/mmzone.h | 2 ++
mm/memcontrol.c | 26 +++++++++++++++++++++++++-
mm/page_alloc.c | 17 ++++++++++++++++-
7 files changed, 71 insertions(+), 2 deletions(-)
diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -798,6 +798,15 @@ are waiting on oom notifications to keep this vaue as minimal as
possible, or allow it to be large enough so that its text can still
be pagefaulted in under oom conditions when the value is known.
+For root processes that are responsible for handling system oom
+conditions, this reserve comes from the per-zone watermarks rather than
+exceeding the limit of the root memcg (since the limit of that memcg is
+always infinity). Such processes may allocate into per-zone memory
+reserves proportional to the setting of the root memcg's oom reserve.
+If setting an oom reserve for the root memcg to handle system oom
+conditions, it is recommended that min_free_kbytes (see
+Documentation/sysctl/vm.txt) exceeds this value.
+
11. Memory Pressure
The pressure level notifications can be used to monitor the memory
diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -403,6 +403,11 @@ become subtly broken, and prone to deadlock under high loads.
Setting this too high will OOM your machine instantly.
+If root memory controller OOM reserves are configured (see
+Documentation/cgroups/memory.txt), some of this memory may also be
+used for userspace processes that are responsible for handling
+system OOM conditions.
+
=============================================================
min_slab_ratio:
diff --git a/arch/m32r/mm/discontig.c b/arch/m32r/mm/discontig.c
--- a/arch/m32r/mm/discontig.c
+++ b/arch/m32r/mm/discontig.c
@@ -156,6 +156,7 @@ void __init zone_sizes_init(void)
* Use all area of internal RAM.
* see __alloc_pages()
*/
+ NODE_DATA(1)->node_zones->watermark[WMARK_OOM] = 0;
NODE_DATA(1)->node_zones->watermark[WMARK_MIN] = 0;
NODE_DATA(1)->node_zones->watermark[WMARK_LOW] = 0;
NODE_DATA(1)->node_zones->watermark[WMARK_HIGH] = 0;
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -156,6 +156,9 @@ static inline bool task_in_memcg_oom(struct task_struct *p)
bool mem_cgroup_oom_synchronize(bool wait);
+extern bool mem_cgroup_alloc_use_oom_reserve(void);
+extern u64 mem_cgroup_root_oom_reserve(void);
+
#ifdef CONFIG_MEMCG_SWAP
extern int do_swap_account;
#endif
@@ -397,6 +400,16 @@ static inline bool mem_cgroup_oom_synchronize(bool wait)
return false;
}
+static inline bool mem_cgroup_alloc_use_oom_reserve(void)
+{
+ return false;
+}
+
+static inline u64 mem_cgroup_root_oom_reserve(void)
+{
+ return 0;
+}
+
static inline void mem_cgroup_inc_page_stat(struct page *page,
enum mem_cgroup_stat_index idx)
{
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -226,12 +226,14 @@ struct lruvec {
typedef unsigned __bitwise__ isolate_mode_t;
enum zone_watermarks {
+ WMARK_OOM,
WMARK_MIN,
WMARK_LOW,
WMARK_HIGH,
NR_WMARK
};
+#define oom_wmark_pages(z) (z->watermark[WMARK_OOM])
#define min_wmark_pages(z) (z->watermark[WMARK_MIN])
#define low_wmark_pages(z) (z->watermark[WMARK_LOW])
#define high_wmark_pages(z) (z->watermark[WMARK_HIGH])
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6026,7 +6026,31 @@ static int mem_cgroup_oom_reserve_write(struct cgroup_subsys_state *css,
if (ret)
return ret;
- return mem_cgroup_resize_oom_reserve(memcg, val);
+ ret = mem_cgroup_resize_oom_reserve(memcg, val);
+ if (ret)
+ return ret;
+
+ /* Zone oom watermarks need to be reset for root memcg changes */
+ if (memcg == root_mem_cgroup)
+ setup_per_zone_wmarks();
+ return 0;
+}
+
+bool mem_cgroup_alloc_use_oom_reserve(void)
+{
+ bool ret = false;
+
+ rcu_read_lock();
+ if (mem_cgroup_from_task(current) == root_mem_cgroup)
+ ret = true;
+ rcu_read_unlock();
+
+ return ret;
+}
+
+u64 mem_cgroup_root_oom_reserve(void)
+{
+ return root_mem_cgroup->oom_reserve >> PAGE_SHIFT;
}
#ifdef CONFIG_MEMCG_KMEM
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1722,6 +1722,12 @@ bool zone_watermark_ok_safe(struct zone *z, int order, unsigned long mark,
free_pages);
}
+static bool use_oom_reserves(void)
+{
+ return (current->flags & PF_OOM_HANDLER) && !in_interrupt() &&
+ mem_cgroup_alloc_use_oom_reserve();
+}
+
#ifdef CONFIG_NUMA
/*
* zlc_setup - Setup for "zonelist cache". Uses cached zone data to
@@ -1982,6 +1988,9 @@ zonelist_scan:
goto this_zone_full;
mark = zone->watermark[alloc_flags & ALLOC_WMARK_MASK];
+ if (unlikely(use_oom_reserves()))
+ mark -= min_wmark_pages(zone) - oom_wmark_pages(zone);
+
if (!zone_watermark_ok(zone, order, mark,
classzone_idx, alloc_flags)) {
int ret;
@@ -5595,11 +5604,15 @@ static void __setup_per_zone_wmarks(void)
}
for_each_zone(zone) {
- u64 tmp;
+ u64 tmp, oom;
spin_lock_irqsave(&zone->lock, flags);
tmp = (u64)pages_min * zone->managed_pages;
do_div(tmp, lowmem_pages);
+ oom = mem_cgroup_root_oom_reserve() * zone->managed_pages;
+ do_div(oom, lowmem_pages);
+ if (oom > tmp)
+ oom = tmp;
if (is_highmem(zone)) {
/*
* __GFP_HIGH and PF_MEMALLOC allocations usually don't
@@ -5615,12 +5628,14 @@ static void __setup_per_zone_wmarks(void)
min_pages = zone->managed_pages / 1024;
min_pages = clamp(min_pages, SWAP_CLUSTER_MAX, 128UL);
zone->watermark[WMARK_MIN] = min_pages;
+ zone->watermark[WMARK_OOM] = min_pages;
} else {
/*
* If it's a lowmem zone, reserve a number of pages
* proportionate to the zone's size.
*/
zone->watermark[WMARK_MIN] = tmp;
+ zone->watermark[WMARK_OOM] = oom;
}
zone->watermark[WMARK_LOW] = min_wmark_pages(zone) + (tmp >> 2);
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 33+ messages in thread
* [patch 10/11] mm, memcg: add memory.oom_control notification for system oom
2014-03-05 3:58 [patch 00/11] userspace out of memory handling David Rientjes
` (8 preceding siblings ...)
2014-03-05 3:59 ` [patch 09/11] mm, page_alloc: allow system oom handlers to use memory reserves David Rientjes
@ 2014-03-05 3:59 ` David Rientjes
2014-03-06 21:15 ` Tejun Heo
2014-03-05 3:59 ` [patch 11/11] mm, memcg: allow system oom killer to be disabled David Rientjes
` (2 subsequent siblings)
12 siblings, 1 reply; 33+ messages in thread
From: David Rientjes @ 2014-03-05 3:59 UTC (permalink / raw)
To: Andrew Morton
Cc: Johannes Weiner, Michal Hocko, KAMEZAWA Hiroyuki,
Christoph Lameter, Pekka Enberg, Tejun Heo, Mel Gorman,
Oleg Nesterov, Rik van Riel, Jianguo Wu, Tim Hockin, linux-kernel,
linux-mm, cgroups, linux-doc
Now that process handling system oom conditions have access to a small
amount of memory reserves, we need a way to notify those process on
system oom conditions.
When a userspace process waits on the root memcg's memory.oom_control, it
will wake up anytime there is a system oom condition.
This is a special case of oom notifiers since it doesn't subsequently
notify all memcgs under the root memcg (all memcgs on the system). We
don't want to trigger those oom handlers which are set aside specifically
for true memcg oom notifications that disable their own oom killers to
enforce their own oom policy, for example.
Signed-off-by: David Rientjes <rientjes@google.com>
---
Documentation/cgroups/memory.txt | 11 ++++++-----
include/linux/memcontrol.h | 5 +++++
mm/memcontrol.c | 9 +++++++++
mm/oom_kill.c | 4 ++++
4 files changed, 24 insertions(+), 5 deletions(-)
diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -744,18 +744,19 @@ delivery and gets notification when OOM happens.
To register a notifier, an application must:
- create an eventfd using eventfd(2)
- - open memory.oom_control file
+ - open memory.oom_control file for reading
- write string like "<event_fd> <fd of memory.oom_control>" to
cgroup.event_control
-The application will be notified through eventfd when OOM happens.
-OOM notification doesn't work for the root cgroup.
+The application will be notified through eventfd when OOM happens, including
+on system oom when used with the root memcg.
You can disable the OOM-killer by writing "1" to memory.oom_control file, as:
- #echo 1 > memory.oom_control
+ # echo 1 > memory.oom_control
-This operation is only allowed to the top cgroup of a sub-hierarchy.
+This operation is only allowed to the top cgroup of a sub-hierarchy and does
+not include the root memcg.
If OOM-killer is disabled, tasks under cgroup will hang/sleep
in memory cgroup's OOM-waitqueue when they request accountable memory.
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -158,6 +158,7 @@ bool mem_cgroup_oom_synchronize(bool wait);
extern bool mem_cgroup_alloc_use_oom_reserve(void);
extern u64 mem_cgroup_root_oom_reserve(void);
+extern void mem_cgroup_root_oom_notify(void);
#ifdef CONFIG_MEMCG_SWAP
extern int do_swap_account;
@@ -410,6 +411,10 @@ static inline u64 mem_cgroup_root_oom_reserve(void)
return 0;
}
+static inline void mem_cgroup_root_oom_notify(void)
+{
+}
+
static inline void mem_cgroup_inc_page_stat(struct page *page,
enum mem_cgroup_stat_index idx)
{
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5721,6 +5721,15 @@ static void mem_cgroup_oom_notify(struct mem_cgroup *memcg)
mem_cgroup_oom_notify_cb(iter);
}
+/*
+ * Notify any process waiting on the root memcg's memory.oom_control, but do not
+ * notify any child memcgs to avoid triggering their per-memcg oom handlers.
+ */
+void mem_cgroup_root_oom_notify(void)
+{
+ mem_cgroup_oom_notify_cb(root_mem_cgroup);
+}
+
static int __mem_cgroup_usage_register_event(struct mem_cgroup *memcg,
struct eventfd_ctx *eventfd, const char *args, enum res_type type)
{
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -643,6 +643,10 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
return;
}
+ /* Avoid waking up processes for oom kills triggered by sysrq */
+ if (!force_kill)
+ mem_cgroup_root_oom_notify();
+
/*
* Check if there were limitations on the allocation (only relevant for
* NUMA) that may require different handling.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 33+ messages in thread
* [patch 11/11] mm, memcg: allow system oom killer to be disabled
2014-03-05 3:58 [patch 00/11] userspace out of memory handling David Rientjes
` (9 preceding siblings ...)
2014-03-05 3:59 ` [patch 10/11] mm, memcg: add memory.oom_control notification for system oom David Rientjes
@ 2014-03-05 3:59 ` David Rientjes
2014-03-06 21:15 ` Tejun Heo
2014-03-05 21:17 ` [patch 00/11] userspace out of memory handling Andrew Morton
2014-03-06 20:49 ` Tejun Heo
12 siblings, 1 reply; 33+ messages in thread
From: David Rientjes @ 2014-03-05 3:59 UTC (permalink / raw)
To: Andrew Morton
Cc: Johannes Weiner, Michal Hocko, KAMEZAWA Hiroyuki,
Christoph Lameter, Pekka Enberg, Tejun Heo, Mel Gorman,
Oleg Nesterov, Rik van Riel, Jianguo Wu, Tim Hockin, linux-kernel,
linux-mm, cgroups, linux-doc
Now that system oom conditions can properly be handled from userspace,
allow the oom killer to be disabled. Otherwise, the kernel will
immediately kill a process and memory will be freed. The userspace oom
handler may have a different policy.
Signed-off-by: David Rientjes <rientjes@google.com>
---
Documentation/cgroups/memory.txt | 4 ++--
include/linux/memcontrol.h | 6 ++++++
mm/memcontrol.c | 11 ++++++++---
mm/oom_kill.c | 3 +++
4 files changed, 19 insertions(+), 5 deletions(-)
diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -755,8 +755,8 @@ You can disable the OOM-killer by writing "1" to memory.oom_control file, as:
# echo 1 > memory.oom_control
-This operation is only allowed to the top cgroup of a sub-hierarchy and does
-not include the root memcg.
+This operation is only allowed to the top cgroup of a sub-hierarchy. If
+disabled for the root memcg, the system oom killer is disabled.
If OOM-killer is disabled, tasks under cgroup will hang/sleep
in memory cgroup's OOM-waitqueue when they request accountable memory.
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -159,6 +159,7 @@ bool mem_cgroup_oom_synchronize(bool wait);
extern bool mem_cgroup_alloc_use_oom_reserve(void);
extern u64 mem_cgroup_root_oom_reserve(void);
extern void mem_cgroup_root_oom_notify(void);
+extern bool mem_cgroup_root_oom_disable(void);
#ifdef CONFIG_MEMCG_SWAP
extern int do_swap_account;
@@ -415,6 +416,11 @@ static inline void mem_cgroup_root_oom_notify(void)
{
}
+static inline bool mem_cgroup_root_oom_disable(void)
+{
+ return false;
+}
+
static inline void mem_cgroup_inc_page_stat(struct page *page,
enum mem_cgroup_stat_index idx)
{
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5976,13 +5976,13 @@ static int mem_cgroup_oom_control_write(struct cgroup_subsys_state *css,
struct mem_cgroup *memcg = mem_cgroup_from_css(css);
struct mem_cgroup *parent = mem_cgroup_from_css(css_parent(&memcg->css));
- /* cannot set to root cgroup and only 0 and 1 are allowed */
- if (!parent || !((val == 0) || (val == 1)))
+ /* only 0 and 1 are allowed */
+ if (val != !!val)
return -EINVAL;
mutex_lock(&memcg_create_mutex);
/* oom-kill-disable is a flag for subhierarchy. */
- if ((parent->use_hierarchy) || memcg_has_children(memcg)) {
+ if (parent && (parent->use_hierarchy || memcg_has_children(memcg))) {
mutex_unlock(&memcg_create_mutex);
return -EINVAL;
}
@@ -6062,6 +6062,11 @@ u64 mem_cgroup_root_oom_reserve(void)
return root_mem_cgroup->oom_reserve >> PAGE_SHIFT;
}
+bool mem_cgroup_root_oom_disable(void)
+{
+ return root_mem_cgroup->oom_kill_disable;
+}
+
#ifdef CONFIG_MEMCG_KMEM
static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
{
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -656,6 +656,9 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
mpol_mask = (constraint == CONSTRAINT_MEMORY_POLICY) ? nodemask : NULL;
check_panic_on_oom(constraint, gfp_mask, order, mpol_mask);
+ if (mem_cgroup_root_oom_disable())
+ return;
+
if (sysctl_oom_kill_allocating_task && current->mm &&
!oom_unkillable_task(current, NULL, nodemask) &&
current->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) {
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [patch 00/11] userspace out of memory handling
2014-03-05 3:58 [patch 00/11] userspace out of memory handling David Rientjes
` (10 preceding siblings ...)
2014-03-05 3:59 ` [patch 11/11] mm, memcg: allow system oom killer to be disabled David Rientjes
@ 2014-03-05 21:17 ` Andrew Morton
2014-03-06 2:52 ` David Rientjes
2014-03-06 20:49 ` Tejun Heo
12 siblings, 1 reply; 33+ messages in thread
From: Andrew Morton @ 2014-03-05 21:17 UTC (permalink / raw)
To: David Rientjes
Cc: Johannes Weiner, Michal Hocko, KAMEZAWA Hiroyuki,
Christoph Lameter, Pekka Enberg, Tejun Heo, Mel Gorman,
Oleg Nesterov, Rik van Riel, Jianguo Wu, Tim Hockin, linux-kernel,
linux-mm, cgroups, linux-doc
On Tue, 4 Mar 2014 19:58:38 -0800 (PST) David Rientjes <rientjes@google.com> wrote:
> This patchset implements userspace out of memory handling.
>
> It is based on v3.14-rc5. Individual patches will apply cleanly or you
> may pull the entire series from
>
> git://git.kernel.org/pub/scm/linux/kernel/git/rientjes/linux.git mm/oom
>
> When the system or a memcg is oom, processes running on that system or
> attached to that memcg cannot allocate memory. It is impossible for a
> process to reliably handle the oom condition from userspace.
>
> First, consider only system oom conditions. When memory is completely
> depleted and nothing may be reclaimed, the kernel is forced to free some
> memory; the only way it can do so is to kill a userspace process. This
> will happen instantaneously and userspace can enforce neither its own
> policy nor collect information.
>
> On system oom, there may be a hierarchy of memcgs that represent user
> jobs, for example. Each job may have a priority independent of their
> current memory usage. There is no existing kernel interface to kill the
> lowest priority job; userspace can now kill the lowest priority job or
> allow priorities to change based on whether the job is using more memory
> than its pre-defined reservation.
>
> Additionally, users may want to log the condition or debug applications
> that are using too much memory. They may wish to collect heap profiles
> or are able to do memory freeing without killing a process by throttling
> or ratelimiting.
>
> Interactive users using X window environments may wish to have a dialogue
> box appear to determine how to proceed -- it may even allow them shell
> access to examine the state of the system while oom.
>
> It's not sufficient to simply restrict all user processes to a subset of
> memory and oom handling processes to the remainder via a memcg hierarchy:
> kernel memory and other page allocations can easily deplete all memory
> that is not charged to a user hierarchy of memory.
>
> This patchset allows userspace to do all of these things by defining a
> small memory reserve that is accessible only by processes that are
> handling the notification.
>
> Second, consider memcg oom conditions. Processes need no special
> knowledge of whether they are attached to the root memcg, where memcg
> charging will always succeed, or a child memcg where charging will fail
> when the limit has been reached. This allows those processes handling
> memcg oom conditions to overcharge the memcg by the amount of reserved
> memory. They need not create child memcgs with smaller limits and
> attach the userspace oom handler only to the parent; such support would
> not allow userspace to handle system oom conditions anyway.
>
> This patchset introduces a standard interface through memcg that allows
> both of these conditions to be handled in the same clean way: users
> define memory.oom_reserve_in_bytes to define the reserve and this
> amount is allowed to be overcharged to the process handling the oom
> condition's memcg. If used with the root memcg, this amount is allowed
> to be allocated below the per-zone watermarks for root processes that
> are handling such conditions (only root may write to
> cgroup.event_control for the root memcg).
If process A is trying to allocate memory, cannot do so and the
userspace oom-killer is invoked, there must be means via which process
A waits for the userspace oom-killer's action. And there must be
fallbacks which occur if the userspace oom killer fails to clear the
oom condition, or times out.
Would be interested to see a description of how all this works.
It is unfortunate that this feature is memcg-only. Surely it could
also be used by non-memcg setups. Would like to see at least a
detailed description of how this will all be presented and implemented.
We should aim to make the memcg and non-memcg userspace interfaces and
user-visible behaviour as similar as possible.
Patches 1, 2, 3 and 5 appear to be independent and useful so I think
I'll cherrypick those, OK?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [patch 04/11] mm, memcg: add tunable for oom reserves
2014-03-05 3:59 ` [patch 04/11] mm, memcg: add tunable for oom reserves David Rientjes
@ 2014-03-05 21:17 ` Andrew Morton
2014-03-06 2:53 ` David Rientjes
2014-03-06 21:04 ` Tejun Heo
1 sibling, 1 reply; 33+ messages in thread
From: Andrew Morton @ 2014-03-05 21:17 UTC (permalink / raw)
To: David Rientjes
Cc: Johannes Weiner, Michal Hocko, KAMEZAWA Hiroyuki,
Christoph Lameter, Pekka Enberg, Tejun Heo, Mel Gorman,
Oleg Nesterov, Rik van Riel, Jianguo Wu, Tim Hockin, linux-kernel,
linux-mm, cgroups, linux-doc
On Tue, 4 Mar 2014 19:59:19 -0800 (PST) David Rientjes <rientjes@google.com> wrote:
> Userspace needs a way to define the amount of memory reserves that
> processes handling oom conditions may utilize. This patch adds a per-
> memcg oom reserve field and file, memory.oom_reserve_in_bytes, to
> manipulate its value.
>
> If currently utilized memory reserves are attempted to be reduced by
> writing a smaller value to memory.oom_reserve_in_bytes, it will fail with
> -EBUSY until some memory is uncharged.
>
> ...
>
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -315,6 +315,9 @@ struct mem_cgroup {
> /* OOM-Killer disable */
> int oom_kill_disable;
>
> + /* reserves for handling oom conditions, protected by res.lock */
> + unsigned long long oom_reserve;
Units? bytes, I assume.
> /* set when res.limit == memsw.limit */
> bool memsw_is_minimum;
>
> @@ -5936,6 +5939,51 @@ static int mem_cgroup_oom_control_write(struct cgroup_subsys_state *css,
> return 0;
> }
>
> +static int mem_cgroup_resize_oom_reserve(struct mem_cgroup *memcg,
> + unsigned long long new_limit)
> +{
> + struct res_counter *res = &memcg->res;
> + u64 limit, usage;
> + int ret = 0;
The code mixes u64's and unsigned long longs in inexplicable ways.
Suggest using u64 throughout.
> + spin_lock(&res->lock);
> + limit = res->limit;
> + usage = res->usage;
> +
> + if (usage > limit && usage - limit > new_limit) {
> + ret = -EBUSY;
> + goto out;
> + }
> +
> + memcg->oom_reserve = new_limit;
> +out:
> + spin_unlock(&res->lock);
> + return ret;
> +}
>
> ...
>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [patch 00/11] userspace out of memory handling
2014-03-05 21:17 ` [patch 00/11] userspace out of memory handling Andrew Morton
@ 2014-03-06 2:52 ` David Rientjes
2014-03-11 12:03 ` Jianguo Wu
0 siblings, 1 reply; 33+ messages in thread
From: David Rientjes @ 2014-03-06 2:52 UTC (permalink / raw)
To: Andrew Morton
Cc: Johannes Weiner, Michal Hocko, KAMEZAWA Hiroyuki,
Christoph Lameter, Pekka Enberg, Tejun Heo, Mel Gorman,
Oleg Nesterov, Rik van Riel, Jianguo Wu, Tim Hockin, linux-kernel,
linux-mm, cgroups, linux-doc
[-- Attachment #1: Type: TEXT/PLAIN, Size: 5992 bytes --]
On Wed, 5 Mar 2014, Andrew Morton wrote:
> > This patchset introduces a standard interface through memcg that allows
> > both of these conditions to be handled in the same clean way: users
> > define memory.oom_reserve_in_bytes to define the reserve and this
> > amount is allowed to be overcharged to the process handling the oom
> > condition's memcg. If used with the root memcg, this amount is allowed
> > to be allocated below the per-zone watermarks for root processes that
> > are handling such conditions (only root may write to
> > cgroup.event_control for the root memcg).
>
> If process A is trying to allocate memory, cannot do so and the
> userspace oom-killer is invoked, there must be means via which process
> A waits for the userspace oom-killer's action.
It does so by relooping in the page allocator waiting for memory to be
freed just like it would if the kernel oom killer were called and process
A was waiting for the oom kill victim process B to exit, we don't have the
ability to put it on a waitqueue because we don't touch the freeing
hotpath. The userspace oom handler may not even necessarily kill
anything, it may be able to free its own memory and start throttling other
processes, for example.
> And there must be
> fallbacks which occur if the userspace oom killer fails to clear the
> oom condition, or times out.
>
I agree completely and proposed this before as memory.oom_delay_millisecs
at http://lwn.net/Articles/432226 which we use internally when memory
can't be freed or a memcg's limit cannot be expanded. I guess it makes
more sense alongside the rest of this patchset now, I can add it as an
additional patch next time around.
> Would be interested to see a description of how all this works.
>
There's an article for LWN also being developed on this topic. As
mentioned in that article, I think it would be best to generalize a lot of
the common functions and the eventfd handling entirely into a library.
I've attached an example implementation that just invokes a function to
handle the situation.
For Google's usecase specifically, at the root memcg level (system oom) we
want to do priority based memcg killing. We want to kill from within a
memcg hierarchy that has the lowest priority relative to other memcgs.
This cannot be implemented with /proc/pid/oom_score_adj today. Those
priorities may also change depending on whether a memcg hierarchy is
"overlimit", i.e. its limit has been increased temporarily because it has
hit a memcg oom and additional memory is readily available on the system.
So why not just introduce a memcg tunable that specifies a priority?
Well, it's not that simple. Other users will want to implement different
policies on system oom (think about things like existing panic_on_oom or
oom_kill_allocating_task sysctls). I introduced oom_kill_allocating_task
originally for SGI because they wanted a fast oom kill rather than
expensive tasklist scan: the allocating task itself is rather irrelevant,
it was just the unlucky task that was allocating at the moment that oom
was triggered. What's guaranteed is that current in that case will always
free memory from under oom (it's not a member of some other mempolicy or
cpuset that would be needlessly killed). Both sysctls could trivially be
reimplemented in userspace with this feature.
I have other customers who don't run in a memcg environment at all, they
simply reattach all processes to root and delete all other memcgs. These
customers are only concerned about system oom conditions and want to do
something "interesting" before a process is killed. Some want to log the
VM statistics as an artifact to examine later, some want to examine heap
profiles, others can start throttling and freeing memory rather than kill
anything. All of this is impossible today because the kernel oom killer
will simply kill something immediately and any stats we collect afterwards
don't represent the oom condition. The heap profiles are lost, throttling
is useless, etc.
Jianguo (cc'd) may also have usecases not described here.
> It is unfortunate that this feature is memcg-only. Surely it could
> also be used by non-memcg setups. Would like to see at least a
> detailed description of how this will all be presented and implemented.
> We should aim to make the memcg and non-memcg userspace interfaces and
> user-visible behaviour as similar as possible.
>
It's memcg only because it can handle both system and memcg oom conditions
with the same clean interface, it would be possible to implement only
system oom condition handling through procfs (a little sloppy since it
needs to register the eventfd) but then a userspace oom handler would need
to determine which interface to use based on whether it was running in a
memcg or non-memcg environment. I implemented this feature with userspace
in mind: I didn't want it to need two different implementations to do the
same thing depending on memcg. The way it is written, a userspace oom
handler does not know (nor need not care) whether it is constrained by the
amount of system RAM or a memcg limit. It can simply write the reserve to
its memcg's memory.oom_reserve_in_bytes, attach to memory.oom_control and
be done.
This does mean that memcg needs to be enabled for the support, though.
This is already done on most distributions, the cgroup just needs to be
mounted. Would it be better to duplicate the interface in two different
spots depending on CONFIG_MEMCG? I didn't think so, and I think the idea
of a userspace library that takes care of this registration (and mounting,
perhaps) proposed on LWN would be the best of both worlds.
> Patches 1, 2, 3 and 5 appear to be independent and useful so I think
> I'll cherrypick those, OK?
>
Ok! I'm hoping that the PF_MEMPOLICY bit that is removed in those patches
is at least temporarily reserved for PF_OOM_HANDLER introduced here, I
removed it purposefully :)
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: Type: TEXT/x-csrc; name=liboom.c, Size: 2431 bytes --]
/*
*
*/
#include <errno.h>
#include <fcntl.h>
#include <limits.h>
#include <stdio.h>
#include <string.h>
#include <sys/eventfd.h>
#include <sys/mman.h>
#include <sys/types.h>
#define STRING_MAX (512)
void handle_oom(void)
{
printf("notification received\n");
}
int wait_oom_notifier(int eventfd_fd, void (*handler)(void))
{
uint64_t ret;
int err;
for (;;) {
err = read(eventfd_fd, &ret, sizeof(ret));
if (err != sizeof(ret)) {
fprintf(stderr, "read()\n");
return err;
}
handler();
}
}
int register_oom_notifier(const char *memcg)
{
char path[PATH_MAX];
char control_string[STRING_MAX];
int event_control_fd;
int control_fd;
int eventfd_fd;
int err = 0;
err = snprintf(path, PATH_MAX, "%s/memory.oom_control", memcg);
if (err < 0) {
fprintf(stderr, "snprintf()\n");
goto out;
}
control_fd = open(path, O_RDONLY);
if (control_fd == -1) {
fprintf(stderr, "open(): %d\n", errno);
err = errno;
goto out;
}
eventfd_fd = eventfd(0, 0);
if (eventfd_fd == -1) {
fprintf(stderr, "eventfd(): %d\n", errno);
err = errno;
goto out_close_control;
}
err = snprintf(control_string, STRING_MAX, "%d %d", eventfd_fd,
control_fd);
if (err < 0) {
fprintf(stderr, "snprintf()\n");
goto out_close_eventfd;
}
err = snprintf(path, PATH_MAX, "%s/cgroup.event_control", memcg);
if (err < 0) {
fprintf(stderr, "snprintf()\n");
goto out_close_eventfd;
}
event_control_fd = open(path, O_WRONLY);
if (event_control_fd == 1) {
fprintf(stderr, "open(): %d\n", errno);
err = errno;
goto out_close_eventfd;
}
write(event_control_fd, control_string, strlen(control_string));
close(event_control_fd);
return eventfd_fd;
out_close_eventfd:
close(eventfd_fd);
out_close_control:
close(control_fd);
out:
return err;
}
int main(int argc, char **argv)
{
int eventfd_fd;
int err = 0;
if (argc != 2) {
fprintf(stderr, "usage: %s <path>\n", argv[0]);
return -1;
}
err = mlockall(MCL_FUTURE);
if (err) {
fprintf(stderr, "%d\n", errno);
return -1;
}
eventfd_fd = register_oom_notifier(argv[1]);
if (eventfd_fd < 0) {
fprintf(stderr, "%d\n", err);
goto out;
}
err = wait_oom_notifier(eventfd_fd, handle_oom);
if (err) {
fprintf(stderr, "wait_oom_notifier()\n");
goto out;
}
out:
munlockall();
return err;
}
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [patch 04/11] mm, memcg: add tunable for oom reserves
2014-03-05 21:17 ` Andrew Morton
@ 2014-03-06 2:53 ` David Rientjes
0 siblings, 0 replies; 33+ messages in thread
From: David Rientjes @ 2014-03-06 2:53 UTC (permalink / raw)
To: Andrew Morton
Cc: Johannes Weiner, Michal Hocko, KAMEZAWA Hiroyuki,
Christoph Lameter, Pekka Enberg, Tejun Heo, Mel Gorman,
Oleg Nesterov, Rik van Riel, Jianguo Wu, Tim Hockin, linux-kernel,
linux-mm, cgroups, linux-doc
On Wed, 5 Mar 2014, Andrew Morton wrote:
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -315,6 +315,9 @@ struct mem_cgroup {
> > /* OOM-Killer disable */
> > int oom_kill_disable;
> >
> > + /* reserves for handling oom conditions, protected by res.lock */
> > + unsigned long long oom_reserve;
>
> Units? bytes, I assume.
>
Yes, fixed.
> > /* set when res.limit == memsw.limit */
> > bool memsw_is_minimum;
> >
> > @@ -5936,6 +5939,51 @@ static int mem_cgroup_oom_control_write(struct cgroup_subsys_state *css,
> > return 0;
> > }
> >
> > +static int mem_cgroup_resize_oom_reserve(struct mem_cgroup *memcg,
> > + unsigned long long new_limit)
> > +{
> > + struct res_counter *res = &memcg->res;
> > + u64 limit, usage;
> > + int ret = 0;
>
> The code mixes u64's and unsigned long longs in inexplicable ways.
> Suggest using u64 throughout.
>
Ok!
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [patch 00/11] userspace out of memory handling
2014-03-05 3:58 [patch 00/11] userspace out of memory handling David Rientjes
` (11 preceding siblings ...)
2014-03-05 21:17 ` [patch 00/11] userspace out of memory handling Andrew Morton
@ 2014-03-06 20:49 ` Tejun Heo
2014-03-06 20:55 ` David Rientjes
12 siblings, 1 reply; 33+ messages in thread
From: Tejun Heo @ 2014-03-06 20:49 UTC (permalink / raw)
To: David Rientjes
Cc: Andrew Morton, Johannes Weiner, Michal Hocko, KAMEZAWA Hiroyuki,
Christoph Lameter, Pekka Enberg, Mel Gorman, Oleg Nesterov,
Rik van Riel, Jianguo Wu, Tim Hockin, linux-kernel, linux-mm,
cgroups, linux-doc
On Tue, Mar 04, 2014 at 07:58:38PM -0800, David Rientjes wrote:
> This patchset implements userspace out of memory handling.
>
> It is based on v3.14-rc5. Individual patches will apply cleanly or you
> may pull the entire series from
>
> git://git.kernel.org/pub/scm/linux/kernel/git/rientjes/linux.git mm/oom
>
> When the system or a memcg is oom, processes running on that system or
> attached to that memcg cannot allocate memory. It is impossible for a
> process to reliably handle the oom condition from userspace.
ISTR the conclusion last time was nack on the whole approach. What
changed between then and now? I can't detect any fundamental changes
from the description.
Thanks.
--
tejun
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [patch 00/11] userspace out of memory handling
2014-03-06 20:49 ` Tejun Heo
@ 2014-03-06 20:55 ` David Rientjes
2014-03-06 20:59 ` Tejun Heo
0 siblings, 1 reply; 33+ messages in thread
From: David Rientjes @ 2014-03-06 20:55 UTC (permalink / raw)
To: Tejun Heo
Cc: Andrew Morton, Johannes Weiner, Michal Hocko, KAMEZAWA Hiroyuki,
Christoph Lameter, Pekka Enberg, Mel Gorman, Oleg Nesterov,
Rik van Riel, Jianguo Wu, Tim Hockin, linux-kernel, linux-mm,
cgroups, linux-doc
On Thu, 6 Mar 2014, Tejun Heo wrote:
> On Tue, Mar 04, 2014 at 07:58:38PM -0800, David Rientjes wrote:
> > This patchset implements userspace out of memory handling.
> >
> > It is based on v3.14-rc5. Individual patches will apply cleanly or you
> > may pull the entire series from
> >
> > git://git.kernel.org/pub/scm/linux/kernel/git/rientjes/linux.git mm/oom
> >
> > When the system or a memcg is oom, processes running on that system or
> > attached to that memcg cannot allocate memory. It is impossible for a
> > process to reliably handle the oom condition from userspace.
>
> ISTR the conclusion last time was nack on the whole approach. What
> changed between then and now? I can't detect any fundamental changes
> from the description.
>
This includes system oom handling alongside memcg oom handling. If you
have specific objections, please let us know, thanks!
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [patch 00/11] userspace out of memory handling
2014-03-06 20:55 ` David Rientjes
@ 2014-03-06 20:59 ` Tejun Heo
2014-03-06 21:08 ` David Rientjes
0 siblings, 1 reply; 33+ messages in thread
From: Tejun Heo @ 2014-03-06 20:59 UTC (permalink / raw)
To: David Rientjes
Cc: Andrew Morton, Johannes Weiner, Michal Hocko, KAMEZAWA Hiroyuki,
Christoph Lameter, Pekka Enberg, Mel Gorman, Oleg Nesterov,
Rik van Riel, Jianguo Wu, Tim Hockin, linux-kernel, linux-mm,
cgroups, linux-doc
On Thu, Mar 06, 2014 at 12:55:43PM -0800, David Rientjes wrote:
> > ISTR the conclusion last time was nack on the whole approach. What
> > changed between then and now? I can't detect any fundamental changes
> > from the description.
> >
>
> This includes system oom handling alongside memcg oom handling. If you
> have specific objections, please let us know, thanks!
Umm, that wasn't the bulk of objection, was it? We were discussion
the whole premise of userland oom handling and the conclusion, at
best, was that you couldn't show that it was actually necessary and
most other people disliked the idea. Just changing a part of it and
resubmitting doesn't really change the whole situation. If you want
to continue the discussion on the basic approach, please do continue
that on the original thread so that we don't lose the context. I'm
gonna nack the respective patches so that they don't get picked up by
accident for now.
Thanks.
--
tejun
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [patch 04/11] mm, memcg: add tunable for oom reserves
2014-03-05 3:59 ` [patch 04/11] mm, memcg: add tunable for oom reserves David Rientjes
2014-03-05 21:17 ` Andrew Morton
@ 2014-03-06 21:04 ` Tejun Heo
1 sibling, 0 replies; 33+ messages in thread
From: Tejun Heo @ 2014-03-06 21:04 UTC (permalink / raw)
To: David Rientjes
Cc: Andrew Morton, Johannes Weiner, Michal Hocko, KAMEZAWA Hiroyuki,
Christoph Lameter, Pekka Enberg, Mel Gorman, Oleg Nesterov,
Rik van Riel, Jianguo Wu, Tim Hockin, linux-kernel, linux-mm,
cgroups, linux-doc
On Tue, Mar 04, 2014 at 07:59:19PM -0800, David Rientjes wrote:
> Userspace needs a way to define the amount of memory reserves that
> processes handling oom conditions may utilize. This patch adds a per-
> memcg oom reserve field and file, memory.oom_reserve_in_bytes, to
> manipulate its value.
>
> If currently utilized memory reserves are attempted to be reduced by
> writing a smaller value to memory.oom_reserve_in_bytes, it will fail with
> -EBUSY until some memory is uncharged.
>
> Signed-off-by: David Rientjes <rientjes@google.com>
We're completely unsure this is the way we wanna be headed and this is
a huge commitment. For now at least,
Nacked-by: Tejun Heo <tj@kernel.org>
--
tejun
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [patch 00/11] userspace out of memory handling
2014-03-06 20:59 ` Tejun Heo
@ 2014-03-06 21:08 ` David Rientjes
2014-03-06 21:11 ` Tejun Heo
0 siblings, 1 reply; 33+ messages in thread
From: David Rientjes @ 2014-03-06 21:08 UTC (permalink / raw)
To: Tejun Heo
Cc: Andrew Morton, Johannes Weiner, Michal Hocko, KAMEZAWA Hiroyuki,
Christoph Lameter, Pekka Enberg, Mel Gorman, Oleg Nesterov,
Rik van Riel, Jianguo Wu, Tim Hockin, linux-kernel, linux-mm,
cgroups, linux-doc
On Thu, 6 Mar 2014, Tejun Heo wrote:
> > This includes system oom handling alongside memcg oom handling. If you
> > have specific objections, please let us know, thanks!
>
> Umm, that wasn't the bulk of objection, was it? We were discussion
> the whole premise of userland oom handling and the conclusion, at
> best, was that you couldn't show that it was actually necessary and
> most other people disliked the idea.
I'm not sure how you reach that conclusion: it's necessary because any
process handling the oom condition will need memory to do anything useful.
How else would a process that is handling a system oom condition, for
example, be able to obtain a list of processes, check memory usage, issue
a kill, do any logging, collect heap or smaps samples, or signal processes
to throttle incoming requests without having access to memory itself? The
system is oom.
> Just changing a part of it and
> resubmitting doesn't really change the whole situation. If you want
> to continue the discussion on the basic approach, please do continue
> that on the original thread so that we don't lose the context. I'm
> gonna nack the respective patches so that they don't get picked up by
> accident for now.
>
This is going to be discussed at the LSF/mm conference, I believe it would
be helpful to have an actual complete patchset proposed so that it can be
discussed properly. I feel no need to refer to an older patchset that
would not apply and did not include all the support necessary for handling
oom conditions.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [patch 00/11] userspace out of memory handling
2014-03-06 21:08 ` David Rientjes
@ 2014-03-06 21:11 ` Tejun Heo
2014-03-06 21:23 ` David Rientjes
0 siblings, 1 reply; 33+ messages in thread
From: Tejun Heo @ 2014-03-06 21:11 UTC (permalink / raw)
To: David Rientjes
Cc: Andrew Morton, Johannes Weiner, Michal Hocko, KAMEZAWA Hiroyuki,
Christoph Lameter, Pekka Enberg, Mel Gorman, Oleg Nesterov,
Rik van Riel, Jianguo Wu, Tim Hockin, linux-kernel, linux-mm,
cgroups, linux-doc
Hello, David.
On Thu, Mar 06, 2014 at 01:08:10PM -0800, David Rientjes wrote:
> I'm not sure how you reach that conclusion: it's necessary because any
> process handling the oom condition will need memory to do anything useful.
> How else would a process that is handling a system oom condition, for
> example, be able to obtain a list of processes, check memory usage, issue
> a kill, do any logging, collect heap or smaps samples, or signal processes
> to throttle incoming requests without having access to memory itself? The
> system is oom.
We're now just re-starting the whole discussion with all context lost.
How is this a good idea? We talked about all this previously. If you
have something to add, add there *please* so that other people can
track it too.
> This is going to be discussed at the LSF/mm conference, I believe it would
> be helpful to have an actual complete patchset proposed so that it can be
> discussed properly. I feel no need to refer to an older patchset that
> would not apply and did not include all the support necessary for handling
> oom conditions.
That's completely fine but if that's your intention please at least
prefix the patchset with RFC and explicitly state that no consensus
has been reached (well, it was more like negative consensus from what
I remember) in the description so that it can't be picked up
accidentally.
Thanks.
--
tejun
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [patch 07/11] mm, memcg: allow processes handling oom notifications to access reserves
2014-03-05 3:59 ` [patch 07/11] mm, memcg: allow processes handling oom notifications to access reserves David Rientjes
@ 2014-03-06 21:12 ` Tejun Heo
0 siblings, 0 replies; 33+ messages in thread
From: Tejun Heo @ 2014-03-06 21:12 UTC (permalink / raw)
To: David Rientjes
Cc: Andrew Morton, Johannes Weiner, Michal Hocko, KAMEZAWA Hiroyuki,
Christoph Lameter, Pekka Enberg, Mel Gorman, Oleg Nesterov,
Rik van Riel, Jianguo Wu, Tim Hockin, linux-kernel, linux-mm,
cgroups, linux-doc
On Tue, Mar 04, 2014 at 07:59:29PM -0800, David Rientjes wrote:
> Now that a per-process flag is available, define it for processes that
> handle userspace oom notifications. This is an optimization to avoid
> mantaining a list of such processes attached to a memcg at any given time
> and iterating it at charge time.
>
> This flag gets set whenever a process has registered for an oom
> notification and is cleared whenever it unregisters.
>
> When memcg reclaim has failed to free any memory, it is necessary for
> userspace oom handlers to be able to dip into reserves to pagefault text,
> allocate kernel memory to read the "tasks" file, allocate heap, etc.
>
> System oom conditions are not addressed at this time, but the same per-
> process flag can be used in the page allocator to determine if access
> should be given to userspace oom handlers to per-zone memory reserves at
> a later time once there is consensus.
>
> Signed-off-by: David Rientjes <rientjes@google.com>
ntil consensus on the whole approach can be reached,
Nacked-by: Tejun Heo <tj@kernel.org>
Thanks.
--
tejun
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [patch 09/11] mm, page_alloc: allow system oom handlers to use memory reserves
2014-03-05 3:59 ` [patch 09/11] mm, page_alloc: allow system oom handlers to use memory reserves David Rientjes
@ 2014-03-06 21:13 ` Tejun Heo
0 siblings, 0 replies; 33+ messages in thread
From: Tejun Heo @ 2014-03-06 21:13 UTC (permalink / raw)
To: David Rientjes
Cc: Andrew Morton, Johannes Weiner, Michal Hocko, KAMEZAWA Hiroyuki,
Christoph Lameter, Pekka Enberg, Mel Gorman, Oleg Nesterov,
Rik van Riel, Jianguo Wu, Tim Hockin, linux-kernel, linux-mm,
cgroups, linux-doc
On Tue, Mar 04, 2014 at 07:59:35PM -0800, David Rientjes wrote:
> The root memcg allows unlimited memory charging, so no memory may be
> reserved for userspace oom handlers that are responsible for dealing
> with system oom conditions.
>
> Instead, this memory must come from per-zone memory reserves. This
> allows the memory allocation to succeed, and the memcg charge will
> naturally succeed afterwards.
>
> This patch introduces per-zone oom watermarks that aren't really
> watermarks in the traditional sense. The oom watermark is the root
> memcg's oom reserve proportional to the size of the zone. When a page
> allocation is done, the effective watermark is
>
> [min/low/high watermark] - [oom watermark]
>
> For the [min watermark] case, this is effectively the oom reserve.
> However, it also adjusts the low and high watermark accordingly so
> memory is actually only allocated from min reserves when appropriate.
>
> Signed-off-by: David Rientjes <rientjes@google.com>
Until consensus on the whole approach can be reached,
Nacked-by: Tejun Heo <tj@kernel.org>
Thanks.
--
tejun
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [patch 10/11] mm, memcg: add memory.oom_control notification for system oom
2014-03-05 3:59 ` [patch 10/11] mm, memcg: add memory.oom_control notification for system oom David Rientjes
@ 2014-03-06 21:15 ` Tejun Heo
0 siblings, 0 replies; 33+ messages in thread
From: Tejun Heo @ 2014-03-06 21:15 UTC (permalink / raw)
To: David Rientjes
Cc: Andrew Morton, Johannes Weiner, Michal Hocko, KAMEZAWA Hiroyuki,
Christoph Lameter, Pekka Enberg, Mel Gorman, Oleg Nesterov,
Rik van Riel, Jianguo Wu, Tim Hockin, linux-kernel, linux-mm,
cgroups, linux-doc
On Tue, Mar 04, 2014 at 07:59:41PM -0800, David Rientjes wrote:
> Now that process handling system oom conditions have access to a small
> amount of memory reserves, we need a way to notify those process on
> system oom conditions.
>
> When a userspace process waits on the root memcg's memory.oom_control, it
> will wake up anytime there is a system oom condition.
>
> This is a special case of oom notifiers since it doesn't subsequently
> notify all memcgs under the root memcg (all memcgs on the system). We
> don't want to trigger those oom handlers which are set aside specifically
> for true memcg oom notifications that disable their own oom killers to
> enforce their own oom policy, for example.
>
> Signed-off-by: David Rientjes <rientjes@google.com>
Until consensus on the whole approach can be reached,
Nacked-by: Tejun Heo <tj@kernel.org>
Thanks.
--
tejun
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [patch 11/11] mm, memcg: allow system oom killer to be disabled
2014-03-05 3:59 ` [patch 11/11] mm, memcg: allow system oom killer to be disabled David Rientjes
@ 2014-03-06 21:15 ` Tejun Heo
0 siblings, 0 replies; 33+ messages in thread
From: Tejun Heo @ 2014-03-06 21:15 UTC (permalink / raw)
To: David Rientjes
Cc: Andrew Morton, Johannes Weiner, Michal Hocko, KAMEZAWA Hiroyuki,
Christoph Lameter, Pekka Enberg, Mel Gorman, Oleg Nesterov,
Rik van Riel, Jianguo Wu, Tim Hockin, linux-kernel, linux-mm,
cgroups, linux-doc
On Tue, Mar 04, 2014 at 07:59:46PM -0800, David Rientjes wrote:
> Now that system oom conditions can properly be handled from userspace,
> allow the oom killer to be disabled. Otherwise, the kernel will
> immediately kill a process and memory will be freed. The userspace oom
> handler may have a different policy.
>
> Signed-off-by: David Rientjes <rientjes@google.com>
Until consensus on the whole approach can be reached,
Nacked-by: Tejun Heo <tj@kernel.org>
Thanks.
--
tejun
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [patch 00/11] userspace out of memory handling
2014-03-06 21:11 ` Tejun Heo
@ 2014-03-06 21:23 ` David Rientjes
2014-03-06 21:29 ` Tejun Heo
2014-03-06 21:33 ` Tejun Heo
0 siblings, 2 replies; 33+ messages in thread
From: David Rientjes @ 2014-03-06 21:23 UTC (permalink / raw)
To: Tejun Heo
Cc: Andrew Morton, Johannes Weiner, Michal Hocko, KAMEZAWA Hiroyuki,
Christoph Lameter, Pekka Enberg, Mel Gorman, Oleg Nesterov,
Rik van Riel, Jianguo Wu, Tim Hockin, linux-kernel, linux-mm,
cgroups, linux-doc
On Thu, 6 Mar 2014, Tejun Heo wrote:
> > I'm not sure how you reach that conclusion: it's necessary because any
> > process handling the oom condition will need memory to do anything useful.
> > How else would a process that is handling a system oom condition, for
> > example, be able to obtain a list of processes, check memory usage, issue
> > a kill, do any logging, collect heap or smaps samples, or signal processes
> > to throttle incoming requests without having access to memory itself? The
> > system is oom.
>
> We're now just re-starting the whole discussion with all context lost.
> How is this a good idea? We talked about all this previously. If you
> have something to add, add there *please* so that other people can
> track it too.
>
I'm referring to system oom handling as an example above, in case you
missed my earlier email a few minutes ago: the previous patchset did not
include support for system oom handling. Nothing that I wrote above was
possible with the first patchset. This is the complete support.
> That's completely fine but if that's your intention please at least
> prefix the patchset with RFC and explicitly state that no consensus
> has been reached (well, it was more like negative consensus from what
> I remember) in the description so that it can't be picked up
> accidentally.
>
This patchset provides a solution to a real-world problem that is not
solved with any other patchset. I expect it to be reviewed as any other
patchset, it's not an "RFC" from my perspective: it's a proposal for
inclusion. Don't worry, Andrew is not going to apply anything
accidentally.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [patch 00/11] userspace out of memory handling
2014-03-06 21:23 ` David Rientjes
@ 2014-03-06 21:29 ` Tejun Heo
2014-03-06 21:33 ` Tejun Heo
1 sibling, 0 replies; 33+ messages in thread
From: Tejun Heo @ 2014-03-06 21:29 UTC (permalink / raw)
To: David Rientjes
Cc: Andrew Morton, Johannes Weiner, Michal Hocko, KAMEZAWA Hiroyuki,
Christoph Lameter, Pekka Enberg, Mel Gorman, Oleg Nesterov,
Rik van Riel, Jianguo Wu, Tim Hockin, linux-kernel, linux-mm,
cgroups, linux-doc
On Thu, Mar 06, 2014 at 01:23:57PM -0800, David Rientjes wrote:
> I'm referring to system oom handling as an example above, in case you
> missed my earlier email a few minutes ago: the previous patchset did not
> include support for system oom handling. Nothing that I wrote above was
> possible with the first patchset. This is the complete support.
But we were talking about system oom handling. Yes, the patch didn't
exist back then but the fundamental premises stay unchanged. There's
no point in restarting the whole thread. You can refer to this
patchset from that thread. It's a logical thing to do. We have all
the context there. I don't really understand why you're resisting it.
It doesn't change the basis of the discussion. The issues brought up
before should still be addressed and it only makes sense to retain the
context.
If you have more to add, including the existence of this
implementation, let's please talk in the original thread. It was long
thread with a lot of points raised. Let's please not replay that
whole thread here unnecessarily.
Thanks.
--
tejun
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [patch 00/11] userspace out of memory handling
2014-03-06 21:23 ` David Rientjes
2014-03-06 21:29 ` Tejun Heo
@ 2014-03-06 21:33 ` Tejun Heo
2014-03-07 12:23 ` Michal Hocko
1 sibling, 1 reply; 33+ messages in thread
From: Tejun Heo @ 2014-03-06 21:33 UTC (permalink / raw)
To: David Rientjes
Cc: Andrew Morton, Johannes Weiner, Michal Hocko, KAMEZAWA Hiroyuki,
Christoph Lameter, Pekka Enberg, Mel Gorman, Oleg Nesterov,
Rik van Riel, Jianguo Wu, Tim Hockin, linux-kernel, linux-mm,
cgroups, linux-doc
A bit of addition.
On Thu, Mar 06, 2014 at 01:23:57PM -0800, David Rientjes wrote:
> This patchset provides a solution to a real-world problem that is not
> solved with any other patchset. I expect it to be reviewed as any other
> patchset, it's not an "RFC" from my perspective: it's a proposal for
> inclusion. Don't worry, Andrew is not going to apply anything
> accidentally.
I can't force it down your throat but I feel somewhat uneasy about how
this was posted without any reference to the previous discussion as if
this were just now being proposed especially as the said discussion
wasn't particularly favorable to this approach. Prefixing RFC or at
least pointing back to the original discussion seems like the
courteous thing to do.
Thanks.
--
tejun
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [patch 00/11] userspace out of memory handling
2014-03-06 21:33 ` Tejun Heo
@ 2014-03-07 12:23 ` Michal Hocko
0 siblings, 0 replies; 33+ messages in thread
From: Michal Hocko @ 2014-03-07 12:23 UTC (permalink / raw)
To: Tejun Heo
Cc: David Rientjes, Andrew Morton, Johannes Weiner, KAMEZAWA Hiroyuki,
Christoph Lameter, Pekka Enberg, Mel Gorman, Oleg Nesterov,
Rik van Riel, Jianguo Wu, Tim Hockin, linux-kernel, linux-mm,
cgroups, linux-doc
On Thu 06-03-14 16:33:24, Tejun Heo wrote:
> A bit of addition.
>
> On Thu, Mar 06, 2014 at 01:23:57PM -0800, David Rientjes wrote:
> > This patchset provides a solution to a real-world problem that is not
> > solved with any other patchset. I expect it to be reviewed as any other
> > patchset, it's not an "RFC" from my perspective: it's a proposal for
> > inclusion. Don't worry, Andrew is not going to apply anything
> > accidentally.
>
> I can't force it down your throat but I feel somewhat uneasy about how
> this was posted without any reference to the previous discussion as if
> this were just now being proposed especially as the said discussion
> wasn't particularly favorable to this approach. Prefixing RFC or at
> least pointing back to the original discussion seems like the
> courteous thing to do.
Completely agreed! My first impression when I saw the patchset yesterday
was that it was posted for sake of future LSF discussion. I was also
curious about the missing RFC. Posting it as a proposal for inclusion is
premature before any conclusion is reached.
--
Michal Hocko
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [patch 03/11] mm, mempolicy: remove per-process flag
2014-03-05 3:59 ` [patch 03/11] mm, mempolicy: remove per-process flag David Rientjes
@ 2014-03-07 17:20 ` Andi Kleen
2014-03-07 20:48 ` Andrew Morton
0 siblings, 1 reply; 33+ messages in thread
From: Andi Kleen @ 2014-03-07 17:20 UTC (permalink / raw)
To: David Rientjes
Cc: Andrew Morton, Johannes Weiner, Michal Hocko, KAMEZAWA Hiroyuki,
Christoph Lameter, Pekka Enberg, Tejun Heo, Mel Gorman,
Oleg Nesterov, Rik van Riel, Jianguo Wu, Tim Hockin, linux-kernel,
linux-mm, cgroups, linux-doc
David Rientjes <rientjes@google.com> writes:
>
> Per-process flags are a scarce resource so we should free them up
> whenever possible and make them available. We'll be using it shortly for
> memcg oom reserves.
I'm not convinced TCP_RR is a meaningfull benchmark for slab.
The shortness seems like an artificial problem.
Just add another flag word to the task_struct? That would seem
to be the obvious way. People will need it sooner or later anyways.
-Andi
--
ak@linux.intel.com -- Speaking for myself only
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [patch 03/11] mm, mempolicy: remove per-process flag
2014-03-07 17:20 ` Andi Kleen
@ 2014-03-07 20:48 ` Andrew Morton
0 siblings, 0 replies; 33+ messages in thread
From: Andrew Morton @ 2014-03-07 20:48 UTC (permalink / raw)
To: Andi Kleen
Cc: David Rientjes, Johannes Weiner, Michal Hocko, KAMEZAWA Hiroyuki,
Christoph Lameter, Pekka Enberg, Tejun Heo, Mel Gorman,
Oleg Nesterov, Rik van Riel, Jianguo Wu, Tim Hockin, linux-kernel,
linux-mm, cgroups, linux-doc
On Fri, 07 Mar 2014 09:20:39 -0800 Andi Kleen <andi@firstfloor.org> wrote:
> David Rientjes <rientjes@google.com> writes:
> >
> > Per-process flags are a scarce resource so we should free them up
> > whenever possible and make them available. We'll be using it shortly for
> > memcg oom reserves.
>
> I'm not convinced TCP_RR is a meaningfull benchmark for slab.
>
> The shortness seems like an artificial problem.
>
> Just add another flag word to the task_struct? That would seem
> to be the obvious way. People will need it sooner or later anyways.
>
This is basically what the patch does:
@@ -3259,7 +3259,7 @@ __do_cache_alloc(struct kmem_cache *cach
{
void *objp;
- if (unlikely(current->flags & (PF_SPREAD_SLAB | PF_MEMPOLICY))) {
+ if (current->mempolicy || unlikely(current->flags & PF_SPREAD_SLAB)) {
objp = alternate_node_alloc(cache, flags);
if (objp)
goto out;
It runs when slab goes into the page allocator for backing store (ie:
relatively rarely). It adds one test-n-branch when a mempolicy is
active and actually removes instructions when no mempolicy is active.
This patch won't be making any difference to anything.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [patch 00/11] userspace out of memory handling
2014-03-06 2:52 ` David Rientjes
@ 2014-03-11 12:03 ` Jianguo Wu
0 siblings, 0 replies; 33+ messages in thread
From: Jianguo Wu @ 2014-03-11 12:03 UTC (permalink / raw)
To: David Rientjes
Cc: Andrew Morton, Johannes Weiner, Michal Hocko, KAMEZAWA Hiroyuki,
Christoph Lameter, Pekka Enberg, Tejun Heo, Mel Gorman,
Oleg Nesterov, Rik van Riel, Tim Hockin, linux-kernel, linux-mm,
cgroups, linux-doc
On 2014/3/6 10:52, David Rientjes wrote:
> On Wed, 5 Mar 2014, Andrew Morton wrote:
>
>>> This patchset introduces a standard interface through memcg that allows
>>> both of these conditions to be handled in the same clean way: users
>>> define memory.oom_reserve_in_bytes to define the reserve and this
>>> amount is allowed to be overcharged to the process handling the oom
>>> condition's memcg. If used with the root memcg, this amount is allowed
>>> to be allocated below the per-zone watermarks for root processes that
>>> are handling such conditions (only root may write to
>>> cgroup.event_control for the root memcg).
>>
>> If process A is trying to allocate memory, cannot do so and the
>> userspace oom-killer is invoked, there must be means via which process
>> A waits for the userspace oom-killer's action.
>
> It does so by relooping in the page allocator waiting for memory to be
> freed just like it would if the kernel oom killer were called and process
> A was waiting for the oom kill victim process B to exit, we don't have the
> ability to put it on a waitqueue because we don't touch the freeing
> hotpath. The userspace oom handler may not even necessarily kill
> anything, it may be able to free its own memory and start throttling other
> processes, for example.
>
>> And there must be
>> fallbacks which occur if the userspace oom killer fails to clear the
>> oom condition, or times out.
>>
>
> I agree completely and proposed this before as memory.oom_delay_millisecs
> at http://lwn.net/Articles/432226 which we use internally when memory
> can't be freed or a memcg's limit cannot be expanded. I guess it makes
> more sense alongside the rest of this patchset now, I can add it as an
> additional patch next time around.
>
>> Would be interested to see a description of how all this works.
>>
>
> There's an article for LWN also being developed on this topic. As
> mentioned in that article, I think it would be best to generalize a lot of
> the common functions and the eventfd handling entirely into a library.
> I've attached an example implementation that just invokes a function to
> handle the situation.
>
> For Google's usecase specifically, at the root memcg level (system oom) we
> want to do priority based memcg killing. We want to kill from within a
> memcg hierarchy that has the lowest priority relative to other memcgs.
> This cannot be implemented with /proc/pid/oom_score_adj today. Those
> priorities may also change depending on whether a memcg hierarchy is
> "overlimit", i.e. its limit has been increased temporarily because it has
> hit a memcg oom and additional memory is readily available on the system.
>
> So why not just introduce a memcg tunable that specifies a priority?
> Well, it's not that simple. Other users will want to implement different
> policies on system oom (think about things like existing panic_on_oom or
> oom_kill_allocating_task sysctls). I introduced oom_kill_allocating_task
> originally for SGI because they wanted a fast oom kill rather than
> expensive tasklist scan: the allocating task itself is rather irrelevant,
> it was just the unlucky task that was allocating at the moment that oom
> was triggered. What's guaranteed is that current in that case will always
> free memory from under oom (it's not a member of some other mempolicy or
> cpuset that would be needlessly killed). Both sysctls could trivially be
> reimplemented in userspace with this feature.
>
> I have other customers who don't run in a memcg environment at all, they
> simply reattach all processes to root and delete all other memcgs. These
> customers are only concerned about system oom conditions and want to do
> something "interesting" before a process is killed. Some want to log the
> VM statistics as an artifact to examine later, some want to examine heap
> profiles, others can start throttling and freeing memory rather than kill
> anything. All of this is impossible today because the kernel oom killer
> will simply kill something immediately and any stats we collect afterwards
> don't represent the oom condition. The heap profiles are lost, throttling
> is useless, etc.
>
> Jianguo (cc'd) may also have usecases not described here.
>
I want to log memory usage, like slabinfo, vmalloc info, page-cache info, etc. before
kill anything.
>> It is unfortunate that this feature is memcg-only. Surely it could
>> also be used by non-memcg setups. Would like to see at least a
>> detailed description of how this will all be presented and implemented.
>> We should aim to make the memcg and non-memcg userspace interfaces and
>> user-visible behaviour as similar as possible.
>>
>
> It's memcg only because it can handle both system and memcg oom conditions
> with the same clean interface, it would be possible to implement only
> system oom condition handling through procfs (a little sloppy since it
> needs to register the eventfd) but then a userspace oom handler would need
> to determine which interface to use based on whether it was running in a
> memcg or non-memcg environment. I implemented this feature with userspace
> in mind: I didn't want it to need two different implementations to do the
> same thing depending on memcg. The way it is written, a userspace oom
> handler does not know (nor need not care) whether it is constrained by the
> amount of system RAM or a memcg limit. It can simply write the reserve to
> its memcg's memory.oom_reserve_in_bytes, attach to memory.oom_control and
> be done.
>
> This does mean that memcg needs to be enabled for the support, though.
> This is already done on most distributions, the cgroup just needs to be
> mounted. Would it be better to duplicate the interface in two different
> spots depending on CONFIG_MEMCG? I didn't think so, and I think the idea
> of a userspace library that takes care of this registration (and mounting,
> perhaps) proposed on LWN would be the best of both worlds.
>
>> Patches 1, 2, 3 and 5 appear to be independent and useful so I think
>> I'll cherrypick those, OK?
>>
>
> Ok! I'm hoping that the PF_MEMPOLICY bit that is removed in those patches
> is at least temporarily reserved for PF_OOM_HANDLER introduced here, I
> removed it purposefully :)
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 33+ messages in thread
end of thread, other threads:[~2014-03-11 12:05 UTC | newest]
Thread overview: 33+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-03-05 3:58 [patch 00/11] userspace out of memory handling David Rientjes
2014-03-05 3:58 ` [patch 01/11] fork: collapse copy_flags into copy_process David Rientjes
2014-03-05 3:58 ` [patch 02/11] mm, mempolicy: rename slab_node for clarity David Rientjes
2014-03-05 3:59 ` [patch 03/11] mm, mempolicy: remove per-process flag David Rientjes
2014-03-07 17:20 ` Andi Kleen
2014-03-07 20:48 ` Andrew Morton
2014-03-05 3:59 ` [patch 04/11] mm, memcg: add tunable for oom reserves David Rientjes
2014-03-05 21:17 ` Andrew Morton
2014-03-06 2:53 ` David Rientjes
2014-03-06 21:04 ` Tejun Heo
2014-03-05 3:59 ` [patch 05/11] res_counter: remove interface for locked charging and uncharging David Rientjes
2014-03-05 3:59 ` [patch 06/11] res_counter: add interface for maximum nofail charge David Rientjes
2014-03-05 3:59 ` [patch 07/11] mm, memcg: allow processes handling oom notifications to access reserves David Rientjes
2014-03-06 21:12 ` Tejun Heo
2014-03-05 3:59 ` [patch 08/11] mm, memcg: add memcg oom reserve documentation David Rientjes
2014-03-05 3:59 ` [patch 09/11] mm, page_alloc: allow system oom handlers to use memory reserves David Rientjes
2014-03-06 21:13 ` Tejun Heo
2014-03-05 3:59 ` [patch 10/11] mm, memcg: add memory.oom_control notification for system oom David Rientjes
2014-03-06 21:15 ` Tejun Heo
2014-03-05 3:59 ` [patch 11/11] mm, memcg: allow system oom killer to be disabled David Rientjes
2014-03-06 21:15 ` Tejun Heo
2014-03-05 21:17 ` [patch 00/11] userspace out of memory handling Andrew Morton
2014-03-06 2:52 ` David Rientjes
2014-03-11 12:03 ` Jianguo Wu
2014-03-06 20:49 ` Tejun Heo
2014-03-06 20:55 ` David Rientjes
2014-03-06 20:59 ` Tejun Heo
2014-03-06 21:08 ` David Rientjes
2014-03-06 21:11 ` Tejun Heo
2014-03-06 21:23 ` David Rientjes
2014-03-06 21:29 ` Tejun Heo
2014-03-06 21:33 ` Tejun Heo
2014-03-07 12:23 ` Michal Hocko
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).