* [PATCH RFC] memcg: close the race window between OOM detection and killing
@ 2015-06-03 3:15 Tejun Heo
2015-06-03 14:44 ` Michal Hocko
0 siblings, 1 reply; 11+ messages in thread
From: Tejun Heo @ 2015-06-03 3:15 UTC (permalink / raw)
To: Johannes Weiner, Michal Hocko
Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg
Hello,
This patch closes the race window by introducing OOM victim generation
number to detect whether any exited between OOM detection and killing;
however, this isn't the prettiest thing in the world and is nasty in
that memcg OOM mechanism deviates from system-wide OOM killer.
The only reason memcg OOM killer behaves asynchronously (unwinding
stack and then handling) is memcg userland OOM handling, which may end
up blocking for userland actions while still holding whatever locks
that it was holding at the time it was invoking try_charge() leading
to a deadlock.
However, given that userland OOMs are retriable, this doesn't have to
be this complicated. Waiting with timeout in try_charge()
synchronously should be enough - in the unlikely cases where forward
progress can't be made, the OOM killing can simply abort waiting and
continue on. If it is an OOM deadlock which requires death of more
victims, OOM condition will trigger again and kill more.
IOW, it'd be cleaner to do everything synchronously while holding
oom_lock with timeout to get out of rare deadlocks.
What do you think?
Thanks.
----- 8< -----
Memcg OOM killings are done at the end of page fault apart from OOM
detection. This allows the following race condition.
Task A Task B
OOM detection
OOM detection
OOM kill
victim exits
OOM kill
Task B has no way of knowing that another task has already killed an
OOM victim which proceeded to exit and release memory and will
unnecessarily pick another victim. In highly contended cases, this
can lead to multiple unnecessary chained killings.
This patch closes this race window by adding per-memcg OOM victim exit
generation number. Each task snapshots it when trying to charge. If
OOM condition is triggered, the kill path compares the remembered
generation against the current value. If they differ, it indicates
that some victims have exited between the charge attempt and OOM kill
path and the task shouldn't pick another victim.
The condition can be reliably triggered with multiple allocating
processes by modifying mem_cgroup_oom_trylock() to retry several times
with a short delay. With the patch applied, memcg OOM correctly
detects the race condition and skips OOM killing to retry the
allocation.
Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
---
include/linux/memcontrol.h | 9 ++++++-
include/linux/sched.h | 3 +-
mm/memcontrol.c | 52 +++++++++++++++++++++++++++++++++++++++++++++
mm/oom_kill.c | 5 ++++
4 files changed, 66 insertions(+), 3 deletions(-)
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -126,8 +126,8 @@ bool mem_cgroup_lruvec_online(struct lru
int mem_cgroup_select_victim_node(struct mem_cgroup *memcg);
unsigned long mem_cgroup_get_lru_size(struct lruvec *lruvec, enum lru_list);
void mem_cgroup_update_lru_size(struct lruvec *, enum lru_list, int);
-extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
- struct task_struct *p);
+void mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p);
+void mem_cgroup_exit_oom_victim(void);
static inline void mem_cgroup_oom_enable(void)
{
@@ -321,6 +321,11 @@ mem_cgroup_print_oom_info(struct mem_cgr
{
}
+static inline void
+mem_cgroup_exit_oom_victim(void)
+{
+}
+
static inline struct mem_cgroup *mem_cgroup_begin_page_stat(struct page *page)
{
return NULL;
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1770,7 +1770,8 @@ struct task_struct {
struct mem_cgroup *memcg;
gfp_t gfp_mask;
int order;
- unsigned int may_oom:1;
+ u16 oom_exit_gen;
+ u16 may_oom:1;
} memcg_oom;
#endif
#ifdef CONFIG_UPROBES
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -285,6 +285,21 @@ struct mem_cgroup {
*/
bool use_hierarchy;
+ /*
+ * Because memcg OOM detection and killing aren't in the same
+ * critical section, OOM victims might exit between the detection
+ * and killing steps of another OOMing task leading to unnecessary
+ * consecutive OOM killings. The following counter, bumped
+ * whenever an OOM victim exits, is used to detect such race
+ * conditions by testing whether it has changed between detection
+ * and killing.
+ *
+ * Use u16 to avoid bloating task->memcg_oom_info. While u16 can
+ * wrap inbetween, it's highly unlikely and we can afford rare
+ * inaccuracies. Protected by oom_lock.
+ */
+ u16 oom_exit_gen;
+
/* protected by memcg_oom_lock */
bool oom_lock;
int under_oom;
@@ -1490,6 +1505,28 @@ void mem_cgroup_print_oom_info(struct me
mutex_unlock(&oom_info_lock);
}
+/**
+ * mem_cgroup_exit_oom_victim - note the exit of an OOM victim
+ *
+ * Called from exit_oom_victm() with oom_lock held. This is used to bump
+ * memcg->oom_exit_gen which is used to avoid unnecessary chained OOM
+ * killings.
+ */
+void mem_cgroup_exit_oom_victim(void)
+{
+ struct mem_cgroup *memcg;
+
+ lockdep_assert_held(&oom_lock);
+
+ rcu_read_lock();
+ memcg = mem_cgroup_from_task(current);
+
+ /* paired with load_acquire in try_charge() */
+ smp_store_release(&memcg->oom_exit_gen, memcg->oom_exit_gen + 1);
+
+ rcu_read_unlock();
+ }
+
/*
* This function returns the number of memcg under hierarchy tree. Returns
* 1(self count) if no children.
@@ -1533,6 +1570,13 @@ static void mem_cgroup_out_of_memory(str
mutex_lock(&oom_lock);
/*
+ * If we raced OOM victim exits between our charge attempt and
+ * here, there's no reason to kill more. Bail and retry.
+ */
+ if (current->memcg_oom.oom_exit_gen != memcg->oom_exit_gen)
+ goto unlock;
+
+ /*
* If current has a pending SIGKILL or is exiting, then automatically
* select it. The goal is to allow it to allocate so that it may
* quickly exit and free its memory.
@@ -2245,6 +2289,14 @@ static int try_charge(struct mem_cgroup
if (mem_cgroup_is_root(memcg))
goto done;
retry:
+ /*
+ * Snapshot the current OOM exit generation number. The generation
+ * number has to be updated after memory is released and read
+ * before charging is attempted. Use load_acquire paired with
+ * store_release in mem_cgroup_exit_oom_victim() for ordering.
+ */
+ current->memcg_oom.oom_exit_gen = smp_load_acquire(&memcg->oom_exit_gen);
+
if (consume_stock(memcg, nr_pages))
goto done;
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -435,10 +435,15 @@ void mark_oom_victim(struct task_struct
*/
void exit_oom_victim(void)
{
+ mutex_lock(&oom_lock);
+
+ mem_cgroup_exit_oom_victim();
clear_thread_flag(TIF_MEMDIE);
if (!atomic_dec_return(&oom_victims))
wake_up_all(&oom_victims_wait);
+
+ mutex_unlock(&oom_lock);
}
/**
^ permalink raw reply [flat|nested] 11+ messages in thread* Re: [PATCH RFC] memcg: close the race window between OOM detection and killing 2015-06-03 3:15 [PATCH RFC] memcg: close the race window between OOM detection and killing Tejun Heo @ 2015-06-03 14:44 ` Michal Hocko [not found] ` <20150603144414.GG16201-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> 0 siblings, 1 reply; 11+ messages in thread From: Michal Hocko @ 2015-06-03 14:44 UTC (permalink / raw) To: Tejun Heo; +Cc: Johannes Weiner, cgroups, linux-mm On Wed 03-06-15 12:15:44, Tejun Heo wrote: > Hello, > > This patch closes the race window by introducing OOM victim generation > number to detect whether any exited between OOM detection and killing; > however, this isn't the prettiest thing in the world and is nasty in > that memcg OOM mechanism deviates from system-wide OOM killer. The race does exist in the global case as well AFAICS. __alloc_pages_may_oom mutex_trylock get_page_from_freelist # fails <preempted> exit_mm # releases some memory out_of_memory exit_oom_victim # No TIF_MEMDIE task so kill a new The race window might be smaller but it is possible in principle. Why does it make sense to treat those two in a different way? > The only reason memcg OOM killer behaves asynchronously (unwinding > stack and then handling) is memcg userland OOM handling, which may end > up blocking for userland actions while still holding whatever locks > that it was holding at the time it was invoking try_charge() leading > to a deadlock. This is not the only reason. In-kernel memcg oom handling needs it as well. See 3812c8c8f395 ("mm: memcg: do not trap chargers with full callstack on OOM"). In fact it was the in-kernel case which has triggered this change. We simply cannot wait for oom with the stack and all the state the charge is called from. > However, given that userland OOMs are retriable, this doesn't have to > be this complicated. Waiting with timeout in try_charge() > synchronously should be enough - in the unlikely cases where forward > progress can't be made, the OOM killing can simply abort waiting and > continue on. If it is an OOM deadlock which requires death of more > victims, OOM condition will trigger again and kill more. > > IOW, it'd be cleaner to do everything synchronously while holding > oom_lock with timeout to get out of rare deadlocks. Deadlocks are quite real and we really have to unwind and handle with a clean stack. > What do you think? > > Thanks. > ----- 8< ----- > Memcg OOM killings are done at the end of page fault apart from OOM > detection. This allows the following race condition. > > Task A Task B > > OOM detection > OOM detection > OOM kill > victim exits > OOM kill > > Task B has no way of knowing that another task has already killed an > OOM victim which proceeded to exit and release memory and will > unnecessarily pick another victim. In highly contended cases, this > can lead to multiple unnecessary chained killings. Yes I can see this might happen. I haven't seen this in the real life but I guess such a load can be constructed. The question is whether this is serious enough to make the code more complicated. > This patch closes this race window by adding per-memcg OOM victim exit > generation number. Each task snapshots it when trying to charge. If > OOM condition is triggered, the kill path compares the remembered > generation against the current value. If they differ, it indicates > that some victims have exited between the charge attempt and OOM kill > path and the task shouldn't pick another victim. The idea is good. See comments to the implementation below. > The condition can be reliably triggered with multiple allocating > processes by modifying mem_cgroup_oom_trylock() to retry several times > with a short delay. With the patch applied, memcg OOM correctly > detects the race condition and skips OOM killing to retry the > allocation. Were you able to trigger this even without adding delays? > Signed-off-by: Tejun Heo <tj@kernel.org> > --- > include/linux/memcontrol.h | 9 ++++++- > include/linux/sched.h | 3 +- > mm/memcontrol.c | 52 +++++++++++++++++++++++++++++++++++++++++++++ > mm/oom_kill.c | 5 ++++ > 4 files changed, 66 insertions(+), 3 deletions(-) > [...] > +/** > + * mem_cgroup_exit_oom_victim - note the exit of an OOM victim > + * > + * Called from exit_oom_victm() with oom_lock held. This is used to bump > + * memcg->oom_exit_gen which is used to avoid unnecessary chained OOM > + * killings. > + */ > +void mem_cgroup_exit_oom_victim(void) > +{ > + struct mem_cgroup *memcg; > + > + lockdep_assert_held(&oom_lock); > + > + rcu_read_lock(); > + memcg = mem_cgroup_from_task(current); The OOM might have happened in a parent memcg and the OOM victim might be a sibling or where ever in the hierarchy under oom memcg. So you have to use the OOM memcg to track the counter otherwise the tasks from other memcgs in the hierarchy racing with the oom victim would miss it anyway. You can store the target memcg into the victim when killing it. [...] > @@ -2245,6 +2289,14 @@ static int try_charge(struct mem_cgroup > if (mem_cgroup_is_root(memcg)) > goto done; > retry: > + /* > + * Snapshot the current OOM exit generation number. The generation > + * number has to be updated after memory is released and read > + * before charging is attempted. Use load_acquire paired with > + * store_release in mem_cgroup_exit_oom_victim() for ordering. > + */ > + current->memcg_oom.oom_exit_gen = smp_load_acquire(&memcg->oom_exit_gen); Same here. You should store the oom memcg gen count. Ideally hook it into mem_cgroup_oom. > + > if (consume_stock(memcg, nr_pages)) > goto done; > [...] Thanks! -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 11+ messages in thread
[parent not found: <20150603144414.GG16201-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>]
* Re: [PATCH RFC] memcg: close the race window between OOM detection and killing [not found] ` <20150603144414.GG16201-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> @ 2015-06-03 19:36 ` Tejun Heo 2015-06-04 9:30 ` Michal Hocko 0 siblings, 1 reply; 11+ messages in thread From: Tejun Heo @ 2015-06-03 19:36 UTC (permalink / raw) To: Michal Hocko Cc: Johannes Weiner, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg Hey, Michal. On Wed, Jun 03, 2015 at 04:44:14PM +0200, Michal Hocko wrote: > The race does exist in the global case as well AFAICS. > __alloc_pages_may_oom > mutex_trylock > get_page_from_freelist # fails > <preempted> exit_mm # releases some memory > out_of_memory > exit_oom_victim > # No TIF_MEMDIE task so kill a new Hmmm? In -mm, if __alloc_page_may_oom() fails trylock, it never calls out_of_memory(). > The race window might be smaller but it is possible in principle. > Why does it make sense to treat those two in a different way? The main difference here is that the alloc path does the whole thing synchrnously and thus the OOM detection and killing can be put in the same critical section which isn't the case for the memcg OOM handling. > > The only reason memcg OOM killer behaves asynchronously (unwinding > > stack and then handling) is memcg userland OOM handling, which may end > > up blocking for userland actions while still holding whatever locks > > that it was holding at the time it was invoking try_charge() leading > > to a deadlock. > > This is not the only reason. In-kernel memcg oom handling needs it > as well. See 3812c8c8f395 ("mm: memcg: do not trap chargers with > full callstack on OOM"). In fact it was the in-kernel case which has > triggered this change. We simply cannot wait for oom with the stack and > all the state the charge is called from. Why should this be any different from OOM handling from page allocator tho? That can end up in the exact same situation and currently it won't be able to get out of such situation - the OOM victim would be stuck with OOM_SCAN_ABORT and all subsequent OOM invocations wouldn't do anything due to OOM_SCAN_ABORT. The solution here seems to be timing out on those waits. ie. pick an OOM victim, kill it, wait for it to exit for enough seconds. If the victim doesn't exit, ignore it and repeat the process, which is guaranteed to make progress no matter what and appliable for both allocator and memcg OOM handling. > > IOW, it'd be cleaner to do everything synchronously while holding > > oom_lock with timeout to get out of rare deadlocks. > > Deadlocks are quite real and we really have to unwind and handle with a > clean stack. Yeah, they're real but we can deal with them in a more consistent way using timeouts. > > Memcg OOM killings are done at the end of page fault apart from OOM > > detection. This allows the following race condition. > > > > Task A Task B > > > > OOM detection > > OOM detection > > OOM kill > > victim exits > > OOM kill > > > > Task B has no way of knowing that another task has already killed an > > OOM victim which proceeded to exit and release memory and will > > unnecessarily pick another victim. In highly contended cases, this > > can lead to multiple unnecessary chained killings. > > Yes I can see this might happen. I haven't seen this in the real life > but I guess such a load can be constructed. The question is whether this > is serious enough to make the code more complicated. Yeah, I do have bug report and dmesg where multiple processes are being killed (each one pretty large) when one should have been enough. ... > > +void mem_cgroup_exit_oom_victim(void) > > +{ > > + struct mem_cgroup *memcg; > > + > > + lockdep_assert_held(&oom_lock); > > + > > + rcu_read_lock(); > > + memcg = mem_cgroup_from_task(current); > > The OOM might have happened in a parent memcg and the OOM victim might > be a sibling or where ever in the hierarchy under oom memcg. > So you have to use the OOM memcg to track the counter otherwise the > tasks from other memcgs in the hierarchy racing with the oom victim > would miss it anyway. You can store the target memcg into the victim > when killing it. In those cases there actually are more than one domains OOMing, but yeah the count prolly should propagate all the way to the root. Gees... I dislike this approach even more. Grabbng the oom lock and doing everything synchronously with timeout will be far simpler and easier to follow. Thanks. -- tejun ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH RFC] memcg: close the race window between OOM detection and killing 2015-06-03 19:36 ` Tejun Heo @ 2015-06-04 9:30 ` Michal Hocko [not found] ` <20150604093031.GB4806-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> 0 siblings, 1 reply; 11+ messages in thread From: Michal Hocko @ 2015-06-04 9:30 UTC (permalink / raw) To: Tejun Heo; +Cc: Johannes Weiner, cgroups, linux-mm On Thu 04-06-15 04:36:39, Tejun Heo wrote: > Hey, Michal. > > On Wed, Jun 03, 2015 at 04:44:14PM +0200, Michal Hocko wrote: > > The race does exist in the global case as well AFAICS. > > __alloc_pages_may_oom > > mutex_trylock > > get_page_from_freelist # fails > > <preempted> exit_mm # releases some memory > > out_of_memory > > exit_oom_victim > > # No TIF_MEMDIE task so kill a new > > Hmmm? In -mm, if __alloc_page_may_oom() fails trylock, it never calls > out_of_memory(). Sure but the oom_lock might be free already. out_of_memory doesn't wait for the victim to finish. It just does schedule_timeout_killable. > > The race window might be smaller but it is possible in principle. > > Why does it make sense to treat those two in a different way? > > The main difference here is that the alloc path does the whole thing > synchrnously and thus the OOM detection and killing can be put in the > same critical section which isn't the case for the memcg OOM handling. This is true but there is still a time window between the last allocation attempt and out_of_memory when the OOM victim might have exited and another task would be selected. > > > The only reason memcg OOM killer behaves asynchronously (unwinding > > > stack and then handling) is memcg userland OOM handling, which may end > > > up blocking for userland actions while still holding whatever locks > > > that it was holding at the time it was invoking try_charge() leading > > > to a deadlock. > > > > This is not the only reason. In-kernel memcg oom handling needs it > > as well. See 3812c8c8f395 ("mm: memcg: do not trap chargers with > > full callstack on OOM"). In fact it was the in-kernel case which has > > triggered this change. We simply cannot wait for oom with the stack and > > all the state the charge is called from. > > Why should this be any different from OOM handling from page allocator > tho? Yes the global OOM is prone to deadlock. This has been discussed a lot and we still do not have a good answer for that. The primary problem is that small allocations do not fail and retry indefinitely so an OOM victim might be blocked on a lock held by a task which is the allocator. This is less likely and harder to trigger with standard loads than in memcg environment though. > That can end up in the exact same situation and currently it > won't be able to get out of such situation - the OOM victim would be > stuck with OOM_SCAN_ABORT and all subsequent OOM invocations wouldn't > do anything due to OOM_SCAN_ABORT. > > The solution here seems to be timing out on those waits. ie. pick an > OOM victim, kill it, wait for it to exit for enough seconds. If the > victim doesn't exit, ignore it and repeat the process, which is > guaranteed to make progress no matter what and appliable for both > allocator and memcg OOM handling. There have been suggestions to add an OOM timeout and ignore the previous OOM victim after the timeout expires and select a new victim. This sounds attractive but this approach has its own problems (http://marc.info/?l=linux-mm&m=141686814824684&w=2). I am convinced that a more appropriate solution for this is to not pretend that small allocation never fail and start failing them after OOM killer is not able to make any progress (GFP_NOFS allocations would be the first candidate and the easiest one to trigger deadlocks via i_mutex). Johannes was also suggesting an OOM memory reserve which would be used for OOM contexts. Also OOM killer can be improved and shrink some of the victims memory before killing it (e.g. drop private clean pages and their page tables). > > > IOW, it'd be cleaner to do everything synchronously while holding > > > oom_lock with timeout to get out of rare deadlocks. > > > > Deadlocks are quite real and we really have to unwind and handle with a > > clean stack. > > Yeah, they're real but we can deal with them in a more consistent way > using timeouts. > > > > Memcg OOM killings are done at the end of page fault apart from OOM > > > detection. This allows the following race condition. > > > > > > Task A Task B > > > > > > OOM detection > > > OOM detection > > > OOM kill > > > victim exits > > > OOM kill > > > > > > Task B has no way of knowing that another task has already killed an > > > OOM victim which proceeded to exit and release memory and will > > > unnecessarily pick another victim. In highly contended cases, this > > > can lead to multiple unnecessary chained killings. > > > > Yes I can see this might happen. I haven't seen this in the real life > > but I guess such a load can be constructed. The question is whether this > > is serious enough to make the code more complicated. > > Yeah, I do have bug report and dmesg where multiple processes are > being killed (each one pretty large) when one should have been enough. Could you share the oom reports? > ... > > > +void mem_cgroup_exit_oom_victim(void) > > > +{ > > > + struct mem_cgroup *memcg; > > > + > > > + lockdep_assert_held(&oom_lock); > > > + > > > + rcu_read_lock(); > > > + memcg = mem_cgroup_from_task(current); > > > > The OOM might have happened in a parent memcg and the OOM victim might > > be a sibling or where ever in the hierarchy under oom memcg. > > So you have to use the OOM memcg to track the counter otherwise the > > tasks from other memcgs in the hierarchy racing with the oom victim > > would miss it anyway. You can store the target memcg into the victim > > when killing it. > > In those cases there actually are more than one domains OOMing, but > yeah the count prolly should propagate all the way to the root. Not sure I understand what you mean by more OOM domains. There is only one in a hierarchy which has reached its limit and it is not able to reclaim any charges. > Gees... I dislike this approach even more. Grabbng the oom lock and > doing everything synchronously with timeout will be far simpler and > easier to follow. It might sound easier but it has its own problems... -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 11+ messages in thread
[parent not found: <20150604093031.GB4806-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>]
* Re: [PATCH RFC] memcg: close the race window between OOM detection and killing [not found] ` <20150604093031.GB4806-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> @ 2015-06-04 19:06 ` Johannes Weiner 2015-06-05 14:29 ` Michal Hocko 2015-06-04 19:29 ` Tejun Heo 1 sibling, 1 reply; 11+ messages in thread From: Johannes Weiner @ 2015-06-04 19:06 UTC (permalink / raw) To: Michal Hocko Cc: Tejun Heo, David Rientjes, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg On Thu, Jun 04, 2015 at 11:30:31AM +0200, Michal Hocko wrote: > There have been suggestions to add an OOM timeout and ignore the > previous OOM victim after the timeout expires and select a new > victim. This sounds attractive but this approach has its own problems > (http://marc.info/?l=linux-mm&m=141686814824684&w=2). Since this list of concerns have been brought up but never really addressed, let me give it a shot. From David's email: : The oom killer timeout is always an attractive remedy to this situation : and gets proposed quite often. Several problems: (1) you can needlessly : panic the machine because no other processes are eligible for oom kill : after declaring that the first oom kill victim cannot make progress, If we run out of OOM victims that can successfully exit, then we are genuinely deadlocked. What else is there to do? A panic() is better than locking up quietly in that case. : (2) it can lead to unnecessary oom killing if the oom kill victim : can exit but hasn't be scheduled or is in the process of exiting, We can set the timeout sufficiently high that this should be a fair trade-off. Let's say 10 seconds. If your only remaining means of reclaiming memory, which is waiting for that one task to exit, takes longer than 10 seconds, aren't you in big enough trouble already? It seems reasonable to assume that there won't be any more progress. But even if there were, the machine is in a state bad enough that a second OOM kill should not be the end of the world. : (3) you can easily turn the oom killer into a serial oom killer : since there's no guarantee the next process that is chosen won't be : affected by the same problem, Again, this would still be better than deadlocking. : and (4) this doesn't fix the problem if an oom disabled process is : wedged trying to allocate memory while holding a mutex that others : are waiting on. I don't follow. If another OOM victim is chosen and can exit, the task that is trying to allocate with the lock held will finish the allocation and release the lock. > I am convinced that a more appropriate solution for this is to not > pretend that small allocation never fail and start failing them after > OOM killer is not able to make any progress (GFP_NOFS allocations would > be the first candidate and the easiest one to trigger deadlocks via > i_mutex). Johannes was also suggesting an OOM memory reserve which would > be used for OOM contexts. I am no longer convinced we can ever go back to failing smaller allocations and NOFS allocations. The filesystem people involved in that discussion have proven completely uncooperative on that subject. So I think we should make the OOM killer as robust as possible. It's just unnecessary to deadlock on a single process when there are more candidates out there that we could try instead. We are already in a worst-case state, killing more tasks is not going to make it worse. > Also OOM killer can be improved and shrink some of the victims memory > before killing it (e.g. drop private clean pages and their page tables). That might work too. It's just a bit more complex and I don't really see the downside of moving on to other victims after a timeout. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH RFC] memcg: close the race window between OOM detection and killing 2015-06-04 19:06 ` Johannes Weiner @ 2015-06-05 14:29 ` Michal Hocko 0 siblings, 0 replies; 11+ messages in thread From: Michal Hocko @ 2015-06-05 14:29 UTC (permalink / raw) To: Johannes Weiner; +Cc: Tejun Heo, David Rientjes, cgroups, linux-mm On Thu 04-06-15 15:06:49, Johannes Weiner wrote: > On Thu, Jun 04, 2015 at 11:30:31AM +0200, Michal Hocko wrote: > > There have been suggestions to add an OOM timeout and ignore the > > previous OOM victim after the timeout expires and select a new > > victim. This sounds attractive but this approach has its own problems > > (http://marc.info/?l=linux-mm&m=141686814824684&w=2). > > Since this list of concerns have been brought up but never really > addressed, let me give it a shot. From David's email: [...] > : (3) you can easily turn the oom killer into a serial oom killer > : since there's no guarantee the next process that is chosen won't be > : affected by the same problem, This is the primary argument I had in mind when nacking the previous attempt. If we go after another task after the timeout and that happens to be just another blocked task we will not move any further, yet we have killed another task pointlessly. Tetsuo had a load which generates hundreds of such tasks all blocked on the same i_mutex. So you would end up in a basically time unbounded unresponsive system. If you put timeout too low you risk pointless killing and the large timeout will not help in the pathological case as well for which it is proposed for. I can imagine we can enhance panic_on_oom policy to consider also a timeout (panic_on_oom_timeout). This would be a much better solution IMO and much cleaner policy because it would be bounded in time. So the administrator knows what the timeout actually means. It also sounds like a more appropriate very last resort because the current panic_on_oom is just too restricted. [...] > > I am convinced that a more appropriate solution for this is to not > > pretend that small allocation never fail and start failing them after > > OOM killer is not able to make any progress (GFP_NOFS allocations would > > be the first candidate and the easiest one to trigger deadlocks via > > i_mutex). Johannes was also suggesting an OOM memory reserve which would > > be used for OOM contexts. > > I am no longer convinced we can ever go back to failing smaller > allocations and NOFS allocations. The filesystem people involved in > that discussion have proven completely uncooperative on that subject. Yeah, this is sad but I fail to see why this should be a reason to stop trying to make NOFS behavior more sensible. Those allocations are clearly restricted and retrying endlessly is simply wrong conceptually. Maybe this alone will turn out being sufficient to prevent from the dead lock issues in 99% of cases. panic_on_oom_timeout might be the last resort for those who want to be really sure that the system will not be unresponsive for an unbounded amount of time. > So I think we should make the OOM killer as robust as possible. I do agree. I just think there are other steps we can do which would be less disruptive like giving oom context access to memory reserves. Page table reclaim would be more complicated but maybe we even do not need that. > It's just unnecessary to deadlock on a single process when there are > more candidates out there that we could try instead. We are already > in a worst-case state, killing more tasks is not going to make it > worse. > > > Also OOM killer can be improved and shrink some of the victims memory > > before killing it (e.g. drop private clean pages and their page tables). > > That might work too. It's just a bit more complex and I don't really > see the downside of moving on to other victims after a timeout. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH RFC] memcg: close the race window between OOM detection and killing [not found] ` <20150604093031.GB4806-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> 2015-06-04 19:06 ` Johannes Weiner @ 2015-06-04 19:29 ` Tejun Heo [not found] ` <20150604192936.GR20091-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org> 1 sibling, 1 reply; 11+ messages in thread From: Tejun Heo @ 2015-06-04 19:29 UTC (permalink / raw) To: Michal Hocko Cc: Johannes Weiner, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg Hello, Michal. On Thu, Jun 04, 2015 at 11:30:31AM +0200, Michal Hocko wrote: > > Hmmm? In -mm, if __alloc_page_may_oom() fails trylock, it never calls > > out_of_memory(). > > Sure but the oom_lock might be free already. out_of_memory doesn't wait > for the victim to finish. It just does schedule_timeout_killable. That doesn't matter because the detection and TIF_MEMDIE assertion are atomic w.r.t. oom_lock and TIF_MEMDIE essentially extends the locking by preventing further OOM kills. Am I missing something? > > The main difference here is that the alloc path does the whole thing > > synchrnously and thus the OOM detection and killing can be put in the > > same critical section which isn't the case for the memcg OOM handling. > > This is true but there is still a time window between the last > allocation attempt and out_of_memory when the OOM victim might have > exited and another task would be selected. Please see above. > > > This is not the only reason. In-kernel memcg oom handling needs it > > > as well. See 3812c8c8f395 ("mm: memcg: do not trap chargers with > > > full callstack on OOM"). In fact it was the in-kernel case which has > > > triggered this change. We simply cannot wait for oom with the stack and > > > all the state the charge is called from. > > > > Why should this be any different from OOM handling from page allocator > > tho? > > Yes the global OOM is prone to deadlock. This has been discussed a lot > and we still do not have a good answer for that. The primary problem > is that small allocations do not fail and retry indefinitely so an OOM > victim might be blocked on a lock held by a task which is the allocator. > This is less likely and harder to trigger with standard loads than in > memcg environment though. Deadlocks from infallible allocations getting interlocked are different. OOM killer can't really get around that by itself but I'm not talking about those deadlocks but at the same time they're a lot less likely. It's about OOM victim trapped in a deadlock failing to release memory because someone else is waiting for that memory to be released while blocking the victim. Sure, the two issues are related but once you solve things getting blocked on single OOM victim, it becomes a lot less of an issue. > There have been suggestions to add an OOM timeout and ignore the > previous OOM victim after the timeout expires and select a new > victim. This sounds attractive but this approach has its own problems > (http://marc.info/?l=linux-mm&m=141686814824684&w=2). Here are the the issues the message lists (1) you can needlessly panic the machine because no other processes are eligible for oom kill after declaring that the first oom kill victim cannot make progress, This is extremely unlikely unless most processes in the system are involved in the same deadlock. All processes have SIGKILL pending but nobody can exit? In such cases, panic prolly isn't such a bad idea. I mean, where would you go from there? (2) it can lead to unnecessary oom killing if the oom kill victim can exit but hasn't be scheduled or is in the process of exiting, It's a matter of having a reasonable timeout. OOM killing isn't an exact operation to begin with and if an OOM victim fails to release memory in, say 10s or whatever, finding another target is the right thing to do. (3) you can easily turn the oom killer into a serial oom killer since there's no guarantee the next process that is chosen won't be affected by the same problem, and And how is that worse than deadlocking? OOM killer is a mechanism to prevent the system from complete lockup at the cost of essentially randomly butchering its workload. The nasty userland memcg OOM hack aside, by the time OOM killing has engaged, the system is already at the end of the rope. (4) this doesn't fix the problem if an oom disabled process is wedged trying to allocate memory while holding a mutex that others are waiting *All* others in the system are waiting on this particular OOM disabled process and nobody can release any memory? Yeah, panic then. The arguments in that message aren't really against adding timeouts but a lot more for wholesale removal of OOM killing. That's an awesome goal but is way far fetched at the moment. > I am convinced that a more appropriate solution for this is to not > pretend that small allocation never fail and start failing them after > OOM killer is not able to make any progress (GFP_NOFS allocations would > be the first candidate and the easiest one to trigger deadlocks via > i_mutex). Johannes was also suggesting an OOM memory reserve which would > be used for OOM contexts. I don't follow why you reached such conclusion. The arguments don't really make sense to me. Once you accept that OOM killer is a sledgehammer rather than a surgical blade, the direction to take seems pretty obvious to me and it *can't* be a precision mechanism - no matter what, it's killing a random process with SIGKILL. > Also OOM killer can be improved and shrink some of the victims memory > before killing it (e.g. drop private clean pages and their page tables). And why would we go to that level of sophiscation. Just wait a while and kill more until it gets unwedged. That will achieve most effects of being a lot more sophiscated with a lot less complexity and again those minute differences don't matter here. > > Gees... I dislike this approach even more. Grabbng the oom lock and > > doing everything synchronously with timeout will be far simpler and > > easier to follow. > > It might sound easier but it has its own problems... I'm still failing to see what the problems are. Thanks. -- tejun ^ permalink raw reply [flat|nested] 11+ messages in thread
[parent not found: <20150604192936.GR20091-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org>]
* Re: [PATCH RFC] memcg: close the race window between OOM detection and killing [not found] ` <20150604192936.GR20091-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org> @ 2015-06-05 14:35 ` Michal Hocko [not found] ` <20150605143534.GD26113-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> 0 siblings, 1 reply; 11+ messages in thread From: Michal Hocko @ 2015-06-05 14:35 UTC (permalink / raw) To: Tejun Heo Cc: Johannes Weiner, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg On Fri 05-06-15 04:29:36, Tejun Heo wrote: > Hello, Michal. > > On Thu, Jun 04, 2015 at 11:30:31AM +0200, Michal Hocko wrote: > > > Hmmm? In -mm, if __alloc_page_may_oom() fails trylock, it never calls > > > out_of_memory(). > > > > Sure but the oom_lock might be free already. out_of_memory doesn't wait > > for the victim to finish. It just does schedule_timeout_killable. > > That doesn't matter because the detection and TIF_MEMDIE assertion are > atomic w.r.t. oom_lock and TIF_MEMDIE essentially extends the locking > by preventing further OOM kills. Am I missing something? This is true but TIF_MEMDIE releasing is not atomic wrt. the allocation path. So the oom victim could have released memory and dropped TIF_MEMDIE but the allocation path hasn't noticed that because it's passed /* * Go through the zonelist yet one more time, keep very high watermark * here, this is only to catch a parallel oom killing, we must fail if * we're still under heavy pressure. */ page = get_page_from_freelist(gfp_mask | __GFP_HARDWALL, order, ALLOC_WMARK_HIGH|ALLOC_CPUSET, ac); and goes on to kill another task because there is no TIF_MEMDIE anymore. > > > The main difference here is that the alloc path does the whole thing > > > synchrnously and thus the OOM detection and killing can be put in the > > > same critical section which isn't the case for the memcg OOM handling. > > > > This is true but there is still a time window between the last > > allocation attempt and out_of_memory when the OOM victim might have > > exited and another task would be selected. > > Please see above. > > > > > This is not the only reason. In-kernel memcg oom handling needs it > > > > as well. See 3812c8c8f395 ("mm: memcg: do not trap chargers with > > > > full callstack on OOM"). In fact it was the in-kernel case which has > > > > triggered this change. We simply cannot wait for oom with the stack and > > > > all the state the charge is called from. > > > > > > Why should this be any different from OOM handling from page allocator > > > tho? > > > > Yes the global OOM is prone to deadlock. This has been discussed a lot > > and we still do not have a good answer for that. The primary problem > > is that small allocations do not fail and retry indefinitely so an OOM > > victim might be blocked on a lock held by a task which is the allocator. > > This is less likely and harder to trigger with standard loads than in > > memcg environment though. > > Deadlocks from infallible allocations getting interlocked are > different. OOM killer can't really get around that by itself but I'm > not talking about those deadlocks but at the same time they're a lot > less likely. It's about OOM victim trapped in a deadlock failing to > release memory because someone else is waiting for that memory to be > released while blocking the victim. I thought those would be in the allocator context - which was the example I've provided. What kind of context do you have in mind? > Sure, the two issues are related > but once you solve things getting blocked on single OOM victim, it > becomes a lot less of an issue. > > > There have been suggestions to add an OOM timeout and ignore the > > previous OOM victim after the timeout expires and select a new > > victim. This sounds attractive but this approach has its own problems > > (http://marc.info/?l=linux-mm&m=141686814824684&w=2). > > Here are the the issues the message lists Let's focus on discussing those points in reply to Johannes' email. AFAIU your notes very in line with his. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 11+ messages in thread
[parent not found: <20150605143534.GD26113-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>]
* Re: [PATCH RFC] memcg: close the race window between OOM detection and killing [not found] ` <20150605143534.GD26113-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> @ 2015-06-05 14:57 ` Tejun Heo [not found] ` <20150605145759.GA5946-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org> 0 siblings, 1 reply; 11+ messages in thread From: Tejun Heo @ 2015-06-05 14:57 UTC (permalink / raw) To: Michal Hocko Cc: Johannes Weiner, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg Hello, Michal. On Fri, Jun 05, 2015 at 04:35:34PM +0200, Michal Hocko wrote: > > That doesn't matter because the detection and TIF_MEMDIE assertion are > > atomic w.r.t. oom_lock and TIF_MEMDIE essentially extends the locking > > by preventing further OOM kills. Am I missing something? > > This is true but TIF_MEMDIE releasing is not atomic wrt. the allocation > path. So the oom victim could have released memory and dropped This is splitting hairs. In vast majority of problem cases, if anything is gonna be locked up, it's gonna be locked up before releasing memory it's holding. Yet again, this is a blunt instrument to unwedge the system. It's difficult to see the point of aiming that level of granularity. > TIF_MEMDIE but the allocation path hasn't noticed that because it's passed > /* > * Go through the zonelist yet one more time, keep very high watermark > * here, this is only to catch a parallel oom killing, we must fail if > * we're still under heavy pressure. > */ > page = get_page_from_freelist(gfp_mask | __GFP_HARDWALL, order, > ALLOC_WMARK_HIGH|ALLOC_CPUSET, ac); > > and goes on to kill another task because there is no TIF_MEMDIE > anymore. Why would this be an issue if we disallow parallel killing? > > Deadlocks from infallible allocations getting interlocked are > > different. OOM killer can't really get around that by itself but I'm > > not talking about those deadlocks but at the same time they're a lot > > less likely. It's about OOM victim trapped in a deadlock failing to > > release memory because someone else is waiting for that memory to be > > released while blocking the victim. > > I thought those would be in the allocator context - which was the > example I've provided. What kind of context do you have in mind? Yeah, sure, they'd be in the allocator context holding other resources which are being waited upon. The first case was deadlock based on purely memory starvation where NOFAIL allocations interlock with each other w/o involving other resources. Thanks. -- tejun ^ permalink raw reply [flat|nested] 11+ messages in thread
[parent not found: <20150605145759.GA5946-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org>]
* Re: [PATCH RFC] memcg: close the race window between OOM detection and killing [not found] ` <20150605145759.GA5946-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org> @ 2015-06-05 15:21 ` Michal Hocko 2015-06-06 0:56 ` Tejun Heo 0 siblings, 1 reply; 11+ messages in thread From: Michal Hocko @ 2015-06-05 15:21 UTC (permalink / raw) To: Tejun Heo Cc: Johannes Weiner, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg On Fri 05-06-15 23:57:59, Tejun Heo wrote: > Hello, Michal. > > On Fri, Jun 05, 2015 at 04:35:34PM +0200, Michal Hocko wrote: > > > That doesn't matter because the detection and TIF_MEMDIE assertion are > > > atomic w.r.t. oom_lock and TIF_MEMDIE essentially extends the locking > > > by preventing further OOM kills. Am I missing something? > > > > This is true but TIF_MEMDIE releasing is not atomic wrt. the allocation > > path. So the oom victim could have released memory and dropped > > This is splitting hairs. In vast majority of problem cases, if > anything is gonna be locked up, it's gonna be locked up before > releasing memory it's holding. Yet again, this is a blunt instrument > to unwedge the system. It's difficult to see the point of aiming that > level of granularity. I was just pointing out that the OOM killer is inherently racy even for the global case. Not sure we are talking about the same thing here. > > > TIF_MEMDIE but the allocation path hasn't noticed that because it's passed > > /* > > * Go through the zonelist yet one more time, keep very high watermark > > * here, this is only to catch a parallel oom killing, we must fail if > > * we're still under heavy pressure. > > */ > > page = get_page_from_freelist(gfp_mask | __GFP_HARDWALL, order, > > ALLOC_WMARK_HIGH|ALLOC_CPUSET, ac); > > > > and goes on to kill another task because there is no TIF_MEMDIE > > anymore. > > Why would this be an issue if we disallow parallel killing? I am confused. The whole thread has started by fixing a race in memcg and I was asking about the global case which is racy currently as well. > > > Deadlocks from infallible allocations getting interlocked are > > > different. OOM killer can't really get around that by itself but I'm > > > not talking about those deadlocks but at the same time they're a lot > > > less likely. It's about OOM victim trapped in a deadlock failing to > > > release memory because someone else is waiting for that memory to be > > > released while blocking the victim. > > > > I thought those would be in the allocator context - which was the > > example I've provided. What kind of context do you have in mind? > > Yeah, sure, they'd be in the allocator context holding other resources > which are being waited upon. The first case was deadlock based on > purely memory starvation where NOFAIL allocations interlock with each > other w/o involving other resources. OK, I guess we were just talking past each other. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH RFC] memcg: close the race window between OOM detection and killing 2015-06-05 15:21 ` Michal Hocko @ 2015-06-06 0:56 ` Tejun Heo 0 siblings, 0 replies; 11+ messages in thread From: Tejun Heo @ 2015-06-06 0:56 UTC (permalink / raw) To: Michal Hocko; +Cc: Johannes Weiner, cgroups, linux-mm Hello, Michal. On Fri, Jun 05, 2015 at 05:21:35PM +0200, Michal Hocko wrote: ... > > > TIF_MEMDIE but the allocation path hasn't noticed that because it's passed > > > /* > > > * Go through the zonelist yet one more time, keep very high watermark > > > * here, this is only to catch a parallel oom killing, we must fail if > > > * we're still under heavy pressure. > > > */ > > > page = get_page_from_freelist(gfp_mask | __GFP_HARDWALL, order, > > > ALLOC_WMARK_HIGH|ALLOC_CPUSET, ac); > > > > > > and goes on to kill another task because there is no TIF_MEMDIE > > > anymore. > > > > Why would this be an issue if we disallow parallel killing? > > I am confused. The whole thread has started by fixing a race in memcg > and I was asking about the global case which is racy currently as well. Ah, okay, I thought we were still talking about issues w/ making things synchronous, but anyways, the above isn't a synchronization race per-se which is what the original patch was trying to address for memcg OOM path, right? Thanks. -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2015-06-06 0:56 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-06-03 3:15 [PATCH RFC] memcg: close the race window between OOM detection and killing Tejun Heo
2015-06-03 14:44 ` Michal Hocko
[not found] ` <20150603144414.GG16201-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2015-06-03 19:36 ` Tejun Heo
2015-06-04 9:30 ` Michal Hocko
[not found] ` <20150604093031.GB4806-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2015-06-04 19:06 ` Johannes Weiner
2015-06-05 14:29 ` Michal Hocko
2015-06-04 19:29 ` Tejun Heo
[not found] ` <20150604192936.GR20091-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org>
2015-06-05 14:35 ` Michal Hocko
[not found] ` <20150605143534.GD26113-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2015-06-05 14:57 ` Tejun Heo
[not found] ` <20150605145759.GA5946-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org>
2015-06-05 15:21 ` Michal Hocko
2015-06-06 0:56 ` Tejun Heo
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox