From: Johannes Weiner <hannes@cmpxchg.org>
To: linux-mm@kvack.org, linux-fsdevel@vger.kernel.org,
linux-kernel@vger.kernel.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>,
Andrew Morton <akpm@linux-foundation.org>,
Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>,
Huang Ying <ying.huang@intel.com>,
Andrea Arcangeli <aarcange@redhat.com>,
Dave Chinner <david@fromorbit.com>, Michal Hocko <mhocko@suse.cz>,
Theodore Ts'o <tytso@mit.edu>
Subject: [patch 06/12] mm: oom_kill: simplify OOM killer locking
Date: Wed, 25 Mar 2015 02:17:10 -0400 [thread overview]
Message-ID: <1427264236-17249-7-git-send-email-hannes@cmpxchg.org> (raw)
In-Reply-To: <1427264236-17249-1-git-send-email-hannes@cmpxchg.org>
The zonelist locking and the oom_sem are two overlapping locks that
are used to serialize global OOM killing against different things.
The historical zonelist locking serializes OOM kills from allocations
with overlapping zonelists against each other to prevent killing more
tasks than necessary in the same memory domain. Only when neither
tasklists nor zonelists from two concurrent OOM kills overlap (tasks
in separate memcgs bound to separate nodes) are OOM kills allowed to
execute in parallel.
The younger oom_sem is a read-write lock to serialize OOM killing
against the PM code trying to disable the OOM killer altogether.
However, the OOM killer is a fairly cold error path, there is really
no reason to optimize for highly performant and concurrent OOM kills.
And the oom_sem is just flat-out redundant.
Replace both locking schemes with a single global mutex serializing
OOM kills regardless of context.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
include/linux/oom.h | 5 +--
mm/memcontrol.c | 18 +++++---
mm/oom_kill.c | 127 +++++++++++-----------------------------------------
mm/page_alloc.c | 8 ++--
4 files changed, 44 insertions(+), 114 deletions(-)
diff --git a/include/linux/oom.h b/include/linux/oom.h
index a8e6a498cbcb..7deecb7bca5e 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -32,6 +32,8 @@ enum oom_scan_t {
/* Thread is the potential origin of an oom condition; kill first on oom */
#define OOM_FLAG_ORIGIN ((__force oom_flags_t)0x1)
+extern struct mutex oom_lock;
+
static inline void set_current_oom_origin(void)
{
current->signal->oom_flags |= OOM_FLAG_ORIGIN;
@@ -60,9 +62,6 @@ extern void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
struct mem_cgroup *memcg, nodemask_t *nodemask,
const char *message);
-extern bool oom_zonelist_trylock(struct zonelist *zonelist, gfp_t gfp_flags);
-extern void oom_zonelist_unlock(struct zonelist *zonelist, gfp_t gfp_flags);
-
extern void check_panic_on_oom(enum oom_constraint constraint, gfp_t gfp_mask,
int order, const nodemask_t *nodemask,
struct mem_cgroup *memcg);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index aab5604e0ac4..9f280b9df848 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1530,6 +1530,8 @@ static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
unsigned int points = 0;
struct task_struct *chosen = NULL;
+ mutex_lock(&oom_lock);
+
/*
* If current has a pending SIGKILL or is exiting, then automatically
* select it. The goal is to allow it to allocate so that it may
@@ -1537,7 +1539,7 @@ static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
*/
if (fatal_signal_pending(current) || task_will_free_mem(current)) {
mark_oom_victim(current);
- return;
+ goto unlock;
}
check_panic_on_oom(CONSTRAINT_MEMCG, gfp_mask, order, NULL, memcg);
@@ -1564,7 +1566,7 @@ static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
mem_cgroup_iter_break(memcg, iter);
if (chosen)
put_task_struct(chosen);
- return;
+ goto unlock;
case OOM_SCAN_OK:
break;
};
@@ -1585,11 +1587,13 @@ static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
css_task_iter_end(&it);
}
- if (!chosen)
- return;
- points = chosen_points * 1000 / totalpages;
- oom_kill_process(chosen, gfp_mask, order, points, totalpages, memcg,
- NULL, "Memory cgroup out of memory");
+ if (chosen) {
+ points = chosen_points * 1000 / totalpages;
+ oom_kill_process(chosen, gfp_mask, order, points, totalpages,
+ memcg, NULL, "Memory cgroup out of memory");
+ }
+unlock:
+ mutex_unlock(&oom_lock);
}
#if MAX_NUMNODES > 1
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index d3490b019d46..5cfda39b3268 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -42,7 +42,8 @@
int sysctl_panic_on_oom;
int sysctl_oom_kill_allocating_task;
int sysctl_oom_dump_tasks = 1;
-static DEFINE_SPINLOCK(zone_scan_lock);
+
+DEFINE_MUTEX(oom_lock);
#ifdef CONFIG_NUMA
/**
@@ -405,13 +406,12 @@ static atomic_t oom_victims = ATOMIC_INIT(0);
static DECLARE_WAIT_QUEUE_HEAD(oom_victims_wait);
bool oom_killer_disabled __read_mostly;
-static DECLARE_RWSEM(oom_sem);
/**
* mark_oom_victim - mark the given task as OOM victim
* @tsk: task to mark
*
- * Has to be called with oom_sem taken for read and never after
+ * Has to be called with oom_lock held and never after
* oom has been disabled already.
*/
void mark_oom_victim(struct task_struct *tsk)
@@ -460,14 +460,14 @@ bool oom_killer_disable(void)
* Make sure to not race with an ongoing OOM killer
* and that the current is not the victim.
*/
- down_write(&oom_sem);
+ mutex_lock(&oom_lock);
if (test_thread_flag(TIF_MEMDIE)) {
- up_write(&oom_sem);
+ mutex_unlock(&oom_lock);
return false;
}
oom_killer_disabled = true;
- up_write(&oom_sem);
+ mutex_unlock(&oom_lock);
wait_event(oom_victims_wait, !atomic_read(&oom_victims));
@@ -634,52 +634,6 @@ int unregister_oom_notifier(struct notifier_block *nb)
}
EXPORT_SYMBOL_GPL(unregister_oom_notifier);
-/*
- * Try to acquire the OOM killer lock for the zones in zonelist. Returns zero
- * if a parallel OOM killing is already taking place that includes a zone in
- * the zonelist. Otherwise, locks all zones in the zonelist and returns 1.
- */
-bool oom_zonelist_trylock(struct zonelist *zonelist, gfp_t gfp_mask)
-{
- struct zoneref *z;
- struct zone *zone;
- bool ret = true;
-
- spin_lock(&zone_scan_lock);
- for_each_zone_zonelist(zone, z, zonelist, gfp_zone(gfp_mask))
- if (test_bit(ZONE_OOM_LOCKED, &zone->flags)) {
- ret = false;
- goto out;
- }
-
- /*
- * Lock each zone in the zonelist under zone_scan_lock so a parallel
- * call to oom_zonelist_trylock() doesn't succeed when it shouldn't.
- */
- for_each_zone_zonelist(zone, z, zonelist, gfp_zone(gfp_mask))
- set_bit(ZONE_OOM_LOCKED, &zone->flags);
-
-out:
- spin_unlock(&zone_scan_lock);
- return ret;
-}
-
-/*
- * Clears the ZONE_OOM_LOCKED flag for all zones in the zonelist so that failed
- * allocation attempts with zonelists containing them may now recall the OOM
- * killer, if necessary.
- */
-void oom_zonelist_unlock(struct zonelist *zonelist, gfp_t gfp_mask)
-{
- struct zoneref *z;
- struct zone *zone;
-
- spin_lock(&zone_scan_lock);
- for_each_zone_zonelist(zone, z, zonelist, gfp_zone(gfp_mask))
- clear_bit(ZONE_OOM_LOCKED, &zone->flags);
- spin_unlock(&zone_scan_lock);
-}
-
/**
* __out_of_memory - kill the "best" process when we run out of memory
* @zonelist: zonelist pointer
@@ -693,8 +647,8 @@ void oom_zonelist_unlock(struct zonelist *zonelist, gfp_t gfp_mask)
* OR try to be smart about which process to kill. Note that we
* don't have to be perfect here, we just have to be good.
*/
-static void __out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
- int order, nodemask_t *nodemask, bool force_kill)
+bool out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
+ int order, nodemask_t *nodemask, bool force_kill)
{
const nodemask_t *mpol_mask;
struct task_struct *p;
@@ -704,10 +658,13 @@ static void __out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
enum oom_constraint constraint = CONSTRAINT_NONE;
int killed = 0;
+ if (oom_killer_disabled)
+ return false;
+
blocking_notifier_call_chain(&oom_notify_list, 0, &freed);
if (freed > 0)
/* Got some memory back in the last second. */
- return;
+ goto out;
/*
* If current has a pending SIGKILL or is exiting, then automatically
@@ -720,7 +677,7 @@ static void __out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
if (current->mm &&
(fatal_signal_pending(current) || task_will_free_mem(current))) {
mark_oom_victim(current);
- return;
+ goto out;
}
/*
@@ -760,32 +717,8 @@ static void __out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
*/
if (killed)
schedule_timeout_killable(1);
-}
-
-/**
- * out_of_memory - tries to invoke OOM killer.
- * @zonelist: zonelist pointer
- * @gfp_mask: memory allocation flags
- * @order: amount of memory being requested as a power of 2
- * @nodemask: nodemask passed to page allocator
- * @force_kill: true if a task must be killed, even if others are exiting
- *
- * invokes __out_of_memory if the OOM is not disabled by oom_killer_disable()
- * when it returns false. Otherwise returns true.
- */
-bool out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
- int order, nodemask_t *nodemask, bool force_kill)
-{
- bool ret = false;
-
- down_read(&oom_sem);
- if (!oom_killer_disabled) {
- __out_of_memory(zonelist, gfp_mask, order, nodemask, force_kill);
- ret = true;
- }
- up_read(&oom_sem);
- return ret;
+ return true;
}
/*
@@ -795,27 +728,21 @@ bool out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
*/
void pagefault_out_of_memory(void)
{
- struct zonelist *zonelist;
-
- down_read(&oom_sem);
if (mem_cgroup_oom_synchronize(true))
- goto unlock;
+ return;
- zonelist = node_zonelist(first_memory_node, GFP_KERNEL);
- if (oom_zonelist_trylock(zonelist, GFP_KERNEL)) {
- if (!oom_killer_disabled)
- __out_of_memory(NULL, 0, 0, NULL, false);
- else
- /*
- * There shouldn't be any user tasks runable while the
- * OOM killer is disabled so the current task has to
- * be a racing OOM victim for which oom_killer_disable()
- * is waiting for.
- */
- WARN_ON(test_thread_flag(TIF_MEMDIE));
+ if (!mutex_trylock(&oom_lock))
+ return;
- oom_zonelist_unlock(zonelist, GFP_KERNEL);
+ if (!out_of_memory(NULL, 0, 0, NULL, false)) {
+ /*
+ * There shouldn't be any user tasks runnable while the
+ * OOM killer is disabled, so the current task has to
+ * be a racing OOM victim for which oom_killer_disable()
+ * is waiting for.
+ */
+ WARN_ON(test_thread_flag(TIF_MEMDIE));
}
-unlock:
- up_read(&oom_sem);
+
+ mutex_unlock(&oom_lock);
}
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 656379820190..9ebc760187ac 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2380,10 +2380,10 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
*did_some_progress = 0;
/*
- * Acquire the per-zone oom lock for each zone. If that
- * fails, somebody else is making progress for us.
+ * Acquire the oom lock. If that fails, somebody else is
+ * making progress for us.
*/
- if (!oom_zonelist_trylock(ac->zonelist, gfp_mask)) {
+ if (!mutex_trylock(&oom_lock)) {
*did_some_progress = 1;
schedule_timeout_uninterruptible(1);
return NULL;
@@ -2428,7 +2428,7 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
|| WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL))
*did_some_progress = 1;
out:
- oom_zonelist_unlock(ac->zonelist, gfp_mask);
+ mutex_unlock(&oom_lock);
return page;
}
--
2.3.3
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2015-03-25 6:17 UTC|newest]
Thread overview: 70+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-03-25 6:17 [patch 00/12] mm: page_alloc: improve OOM mechanism and policy Johannes Weiner
2015-03-25 6:17 ` [patch 01/12] mm: oom_kill: remove unnecessary locking in oom_enable() Johannes Weiner
2015-03-26 0:51 ` David Rientjes
2015-03-26 11:51 ` Michal Hocko
2015-03-26 13:18 ` Michal Hocko
2015-03-26 19:30 ` David Rientjes
2015-03-26 11:43 ` Michal Hocko
2015-03-26 20:05 ` David Rientjes
2015-03-25 6:17 ` [patch 02/12] mm: oom_kill: clean up victim marking and exiting interfaces Johannes Weiner
2015-03-26 3:34 ` David Rientjes
2015-03-26 11:54 ` Michal Hocko
2015-03-25 6:17 ` [patch 03/12] mm: oom_kill: switch test-and-clear of known TIF_MEMDIE to clear Johannes Weiner
2015-03-26 3:31 ` David Rientjes
2015-03-26 11:05 ` Johannes Weiner
2015-03-26 19:50 ` David Rientjes
2015-03-30 14:48 ` Michal Hocko
2015-04-02 23:01 ` [patch] android, lmk: avoid setting TIF_MEMDIE if process has already exited David Rientjes
2015-04-28 22:50 ` [patch resend] " David Rientjes
2015-03-26 11:57 ` [patch 03/12] mm: oom_kill: switch test-and-clear of known TIF_MEMDIE to clear Michal Hocko
2015-03-25 6:17 ` [patch 04/12] mm: oom_kill: remove unnecessary locking in exit_oom_victim() Johannes Weiner
2015-03-26 12:53 ` Michal Hocko
2015-03-26 13:01 ` Michal Hocko
2015-03-26 15:10 ` Johannes Weiner
2015-03-26 15:04 ` Johannes Weiner
2015-03-25 6:17 ` [patch 05/12] mm: oom_kill: generalize OOM progress waitqueue Johannes Weiner
2015-03-26 13:03 ` Michal Hocko
2015-03-25 6:17 ` Johannes Weiner [this message]
2015-03-26 13:31 ` [patch 06/12] mm: oom_kill: simplify OOM killer locking Michal Hocko
2015-03-26 15:17 ` Johannes Weiner
2015-03-26 16:07 ` Michal Hocko
2015-03-25 6:17 ` [patch 07/12] mm: page_alloc: inline should_alloc_retry() Johannes Weiner
2015-03-26 14:11 ` Michal Hocko
2015-03-26 15:18 ` Johannes Weiner
2015-03-25 6:17 ` [patch 08/12] mm: page_alloc: wait for OOM killer progress before retrying Johannes Weiner
2015-03-25 14:15 ` Tetsuo Handa
2015-03-25 17:01 ` Vlastimil Babka
2015-03-26 11:28 ` Johannes Weiner
2015-03-26 11:24 ` Johannes Weiner
2015-03-26 14:32 ` Michal Hocko
2015-03-26 15:23 ` Johannes Weiner
2015-03-26 15:38 ` Michal Hocko
2015-03-26 18:17 ` Johannes Weiner
2015-03-27 14:01 ` [patch 08/12] mm: page_alloc: wait for OOM killer progressbefore retrying Tetsuo Handa
2015-03-26 15:58 ` [patch 08/12] mm: page_alloc: wait for OOM killer progress before retrying Michal Hocko
2015-03-26 18:23 ` Johannes Weiner
2015-03-25 6:17 ` [patch 09/12] mm: page_alloc: private memory reserves for OOM-killing allocations Johannes Weiner
2015-04-14 16:49 ` Michal Hocko
2015-04-24 19:13 ` Johannes Weiner
2015-03-25 6:17 ` [patch 10/12] mm: page_alloc: emergency reserve access for __GFP_NOFAIL allocations Johannes Weiner
2015-04-14 16:55 ` Michal Hocko
2015-03-25 6:17 ` [patch 11/12] mm: page_alloc: do not lock up GFP_NOFS allocations upon OOM Johannes Weiner
2015-03-26 14:50 ` Michal Hocko
2015-03-25 6:17 ` [patch 12/12] mm: page_alloc: do not lock up low-order " Johannes Weiner
2015-03-26 15:32 ` Michal Hocko
2015-03-26 19:58 ` [patch 00/12] mm: page_alloc: improve OOM mechanism and policy Dave Chinner
2015-03-27 15:05 ` Johannes Weiner
2015-03-30 0:32 ` Dave Chinner
2015-03-30 19:31 ` Johannes Weiner
2015-04-01 15:19 ` Michal Hocko
2015-04-01 21:39 ` Dave Chinner
2015-04-02 7:29 ` Michal Hocko
2015-04-07 14:18 ` Johannes Weiner
2015-04-11 7:29 ` Tetsuo Handa
2015-04-13 12:49 ` Michal Hocko
2015-04-13 12:46 ` Michal Hocko
2015-04-14 0:11 ` Dave Chinner
2015-04-14 7:20 ` Michal Hocko
2015-04-14 10:36 ` Johannes Weiner
2015-04-14 14:23 ` Michal Hocko
[not found] <048301d066d1$653e63d0$2fbb2b70$@alibaba-inc.com>
2015-03-25 8:05 ` [patch 06/12] mm: oom_kill: simplify OOM killer locking Hillf Danton
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1427264236-17249-7-git-send-email-hannes@cmpxchg.org \
--to=hannes@cmpxchg.org \
--cc=aarcange@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=david@fromorbit.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mhocko@suse.cz \
--cc=penguin-kernel@I-love.SAKURA.ne.jp \
--cc=torvalds@linux-foundation.org \
--cc=tytso@mit.edu \
--cc=ying.huang@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).