* [RFC PATCH 0/3] memcg OOM notifications and PF_EXITING checks
@ 2014-01-15 15:01 Michal Hocko
2014-01-15 15:01 ` [RFC 1/3] memcg: notify userspace about OOM only when and action is due Michal Hocko
` (2 more replies)
0 siblings, 3 replies; 8+ messages in thread
From: Michal Hocko @ 2014-01-15 15:01 UTC (permalink / raw)
To: linux-mm
Cc: LKML, Johannes Weiner, David Rientjes, KOSAKI Motohiro,
Andrew Morton
Hi,
this is an attempt to restart discussions regarding memcg OOM
notifications and break out conditions.
"memcg: do not hang on OOM when killed by userspace OOM access to memory
reserves" which was a first patch in the series was already merged to -mm
tree (http://www.ozlabs.org/~akpm/mmotm/broken-out/memcg-do-not-hang-on-oom-when-killed-by-userspace-oom-access-to-memory-reserves.patch)
but it didn't see ack from neither David nor Johannes. I would be happy
if we agreed on that one as well.
The first patch in this series implements and extends an idea proposed
by David to not notify userspace when the OOM killer might back out and
prevent from killing. Johannes was not fond of the idea because this
changes userspace interface in a subtle way because somebody might be
relying on notifications as a signal that the memcg is getting into
troubles. It has been argued that there are memory thresholds and
vmpressure notifications for such an use case.
I am in favor to make change the notification and draw the line when to
notify to "kernel or userspace has to perform an action". It makes sense
to me, it is still racy though. Something might have exiting millisecond
after notification fired but it at least is consistent.
The second patch is trivial and it removes PF_EXITING check for the
current in mem_cgroup_out_of_memory because it is no longer needed when
we have the check in the charging path.
The last patch is just an attempt and might be totally wrong. I've
noticed that we are not checking for the killed tasks in
mem_cgroup_out_of_memory which might break usecases where a task was
killed by vmpressure or thresholds handlers but the killed task cannot
terminate in time. We should rather not kill something else.
^ permalink raw reply [flat|nested] 8+ messages in thread* [RFC 1/3] memcg: notify userspace about OOM only when and action is due 2014-01-15 15:01 [RFC PATCH 0/3] memcg OOM notifications and PF_EXITING checks Michal Hocko @ 2014-01-15 15:01 ` Michal Hocko 2014-01-15 17:56 ` Johannes Weiner 2014-01-15 15:01 ` [RFC 2/3] memcg: do not check PF_EXITING in mem_cgroup_out_of_memory Michal Hocko 2014-01-15 15:01 ` [RFC 3/3] memcg,oom: do not check PF_EXITING and do not set TIF_MEMDIE Michal Hocko 2 siblings, 1 reply; 8+ messages in thread From: Michal Hocko @ 2014-01-15 15:01 UTC (permalink / raw) To: linux-mm Cc: LKML, Johannes Weiner, David Rientjes, KOSAKI Motohiro, Andrew Morton Userspace is currently notified about OOM condition after reclaim fails to uncharge any memory after MEM_CGROUP_RECLAIM_RETRIES rounds. This usually means that the memcg is really in troubles and an OOM action (either done by userspace or kernel) has to be taken. The kernel OOM killer however bails out and doesn't kill anything if it sees an already dying/exiting task in a good hope a memory will be released and the OOM situation will be resolved. Therefore it makes sense to notify userspace only after really all measures have been taken and an userspace action is required or the kernel kills a task. This patch is based on idea by David Rientjes to not notify userspace when the current task is killed or in a late exiting. The original patch, however, didn't handle in kernel oom killer back offs which is implemtented by this patch. Signed-off-by: Michal Hocko <mhocko@suse.cz> --- include/linux/memcontrol.h | 5 +++++ mm/memcontrol.c | 9 +++++---- mm/oom_kill.c | 3 +++ 3 files changed, 13 insertions(+), 4 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index abd0113b6620..8aeb7c441533 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -134,6 +134,7 @@ unsigned long mem_cgroup_get_lru_size(struct lruvec *lruvec, enum lru_list); void mem_cgroup_update_lru_size(struct lruvec *, enum lru_list, int); extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p); +extern void mem_cgroup_oom_notify(struct mem_cgroup *memcg); extern void mem_cgroup_replace_page_cache(struct page *oldpage, struct page *newpage); @@ -369,6 +370,10 @@ mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p) { } +static inline void mem_cgroup_oom_notify(struct mem_cgroup *memcg) +{ +} + static inline void mem_cgroup_begin_update_page_stat(struct page *page, bool *locked, unsigned long *flags) { diff --git a/mm/memcontrol.c b/mm/memcontrol.c index f016d26adfd3..491d368ae488 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2232,15 +2232,16 @@ bool mem_cgroup_oom_synchronize(bool handle) locked = mem_cgroup_oom_trylock(memcg); - if (locked) - mem_cgroup_oom_notify(memcg); - if (locked && !memcg->oom_kill_disable) { mem_cgroup_unmark_under_oom(memcg); finish_wait(&memcg_oom_waitq, &owait.wait); + /* calls mem_cgroup_oom_notify if there is a task to kill */ mem_cgroup_out_of_memory(memcg, current->memcg_oom.gfp_mask, current->memcg_oom.order); } else { + if (locked && memcg->oom_kill_disable) + mem_cgroup_oom_notify(memcg); + schedule(); mem_cgroup_unmark_under_oom(memcg); finish_wait(&memcg_oom_waitq, &owait.wait); @@ -5620,7 +5621,7 @@ static int mem_cgroup_oom_notify_cb(struct mem_cgroup *memcg) return 0; } -static void mem_cgroup_oom_notify(struct mem_cgroup *memcg) +void mem_cgroup_oom_notify(struct mem_cgroup *memcg) { struct mem_cgroup *iter; diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 054ff47c4478..96b97027fc4d 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -476,6 +476,9 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, victim = p; } + if (memcg) + mem_cgroup_oom_notify(memcg); + /* mm cannot safely be dereferenced after task_unlock(victim) */ mm = victim->mm; pr_err("Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB\n", -- 1.8.5.2 ^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: [RFC 1/3] memcg: notify userspace about OOM only when and action is due 2014-01-15 15:01 ` [RFC 1/3] memcg: notify userspace about OOM only when and action is due Michal Hocko @ 2014-01-15 17:56 ` Johannes Weiner 2014-01-15 19:00 ` Michal Hocko 0 siblings, 1 reply; 8+ messages in thread From: Johannes Weiner @ 2014-01-15 17:56 UTC (permalink / raw) To: Michal Hocko Cc: linux-mm, LKML, David Rientjes, KOSAKI Motohiro, Andrew Morton On Wed, Jan 15, 2014 at 04:01:06PM +0100, Michal Hocko wrote: > Userspace is currently notified about OOM condition after reclaim > fails to uncharge any memory after MEM_CGROUP_RECLAIM_RETRIES rounds. > This usually means that the memcg is really in troubles and an > OOM action (either done by userspace or kernel) has to be taken. > The kernel OOM killer however bails out and doesn't kill anything > if it sees an already dying/exiting task in a good hope a memory > will be released and the OOM situation will be resolved. > > Therefore it makes sense to notify userspace only after really all > measures have been taken and an userspace action is required or > the kernel kills a task. > > This patch is based on idea by David Rientjes to not notify > userspace when the current task is killed or in a late exiting. > The original patch, however, didn't handle in kernel oom killer > back offs which is implemtented by this patch. > > Signed-off-by: Michal Hocko <mhocko@suse.cz> OOM is a temporary state because any task can exit at a time that is not under our control and outside our knowledge. That's why the OOM situation is defined by failing an allocation after a certain number of reclaim and charge attempts. As of right now, the OOM sampling window is MEM_CGROUP_RECLAIM_RETRIES loops of charge attempts and reclaim. If a racing task is exiting and releasing memory during that window, the charge will succeed fine. If the sampling window is too short in practice, it will have to be extended, preferrably through increasing MEM_CGROUP_RECLAIM_RETRIES. But a random task exiting a split second after the sampling window has closed will always be a possibility, regardless of how long it is. There is nothing to be gained from this layering violation and it's mind-boggling that you two still think this is a meaningful change. Nacked-by: Johannes Weiner <hannes@cmpxchg.org> ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [RFC 1/3] memcg: notify userspace about OOM only when and action is due 2014-01-15 17:56 ` Johannes Weiner @ 2014-01-15 19:00 ` Michal Hocko 2014-01-15 20:30 ` Johannes Weiner 0 siblings, 1 reply; 8+ messages in thread From: Michal Hocko @ 2014-01-15 19:00 UTC (permalink / raw) To: Johannes Weiner Cc: linux-mm, LKML, David Rientjes, KOSAKI Motohiro, Andrew Morton On Wed 15-01-14 12:56:55, Johannes Weiner wrote: > On Wed, Jan 15, 2014 at 04:01:06PM +0100, Michal Hocko wrote: > > Userspace is currently notified about OOM condition after reclaim > > fails to uncharge any memory after MEM_CGROUP_RECLAIM_RETRIES rounds. > > This usually means that the memcg is really in troubles and an > > OOM action (either done by userspace or kernel) has to be taken. > > The kernel OOM killer however bails out and doesn't kill anything > > if it sees an already dying/exiting task in a good hope a memory > > will be released and the OOM situation will be resolved. > > > > Therefore it makes sense to notify userspace only after really all > > measures have been taken and an userspace action is required or > > the kernel kills a task. > > > > This patch is based on idea by David Rientjes to not notify > > userspace when the current task is killed or in a late exiting. > > The original patch, however, didn't handle in kernel oom killer > > back offs which is implemtented by this patch. > > > > Signed-off-by: Michal Hocko <mhocko@suse.cz> > > OOM is a temporary state because any task can exit at a time that is > not under our control and outside our knowledge. That's why the OOM > situation is defined by failing an allocation after a certain number > of reclaim and charge attempts. > > As of right now, the OOM sampling window is MEM_CGROUP_RECLAIM_RETRIES > loops of charge attempts and reclaim. If a racing task is exiting and > releasing memory during that window, the charge will succeed fine. If > the sampling window is too short in practice, it will have to be > extended, preferrably through increasing MEM_CGROUP_RECLAIM_RETRIES. The patch doesn't try to address the above race because that one is unfixable. I hope that is clear. It just tries to reduce burden on the userspace oom notification consumers and given them a simple semantic. Notification comes only if an action will be necessary (either kernel kills something or user space is expected). E.g. consider a handler which tries to clean up after kernel handled OOM and killed something. If the kernel could back off and refrain from killing anything after the norification already fired up then the userspace has no practical way to detect that (except for checking the kernel log to search for OOM messages which might get suppressed due to rate limitting etc.. Nothing I would call optimal). Or do you think that such a use case doesn't make much sense and it is an abuse of the notification interface? > But a random task exiting a split second after the sampling window has > closed will always be a possibility, regardless of how long it is. Agreed and this is not what the patch is about. If the kernel oom killer couldn't back off then I would completely agree with you here. > There is nothing to be gained from this layering violation and it's > mind-boggling that you two still think this is a meaningful change. > > Nacked-by: Johannes Weiner <hannes@cmpxchg.org> -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [RFC 1/3] memcg: notify userspace about OOM only when and action is due 2014-01-15 19:00 ` Michal Hocko @ 2014-01-15 20:30 ` Johannes Weiner 2014-01-16 14:10 ` Michal Hocko 0 siblings, 1 reply; 8+ messages in thread From: Johannes Weiner @ 2014-01-15 20:30 UTC (permalink / raw) To: Michal Hocko Cc: linux-mm, LKML, David Rientjes, KOSAKI Motohiro, Andrew Morton On Wed, Jan 15, 2014 at 08:00:15PM +0100, Michal Hocko wrote: > On Wed 15-01-14 12:56:55, Johannes Weiner wrote: > > On Wed, Jan 15, 2014 at 04:01:06PM +0100, Michal Hocko wrote: > > > Userspace is currently notified about OOM condition after reclaim > > > fails to uncharge any memory after MEM_CGROUP_RECLAIM_RETRIES rounds. > > > This usually means that the memcg is really in troubles and an > > > OOM action (either done by userspace or kernel) has to be taken. > > > The kernel OOM killer however bails out and doesn't kill anything > > > if it sees an already dying/exiting task in a good hope a memory > > > will be released and the OOM situation will be resolved. > > > > > > Therefore it makes sense to notify userspace only after really all > > > measures have been taken and an userspace action is required or > > > the kernel kills a task. > > > > > > This patch is based on idea by David Rientjes to not notify > > > userspace when the current task is killed or in a late exiting. > > > The original patch, however, didn't handle in kernel oom killer > > > back offs which is implemtented by this patch. > > > > > > Signed-off-by: Michal Hocko <mhocko@suse.cz> > > > > OOM is a temporary state because any task can exit at a time that is > > not under our control and outside our knowledge. That's why the OOM > > situation is defined by failing an allocation after a certain number > > of reclaim and charge attempts. > > > > As of right now, the OOM sampling window is MEM_CGROUP_RECLAIM_RETRIES > > loops of charge attempts and reclaim. If a racing task is exiting and > > releasing memory during that window, the charge will succeed fine. If > > the sampling window is too short in practice, it will have to be > > extended, preferrably through increasing MEM_CGROUP_RECLAIM_RETRIES. > > The patch doesn't try to address the above race because that one is > unfixable. I hope that is clear. > > It just tries to reduce burden on the userspace oom notification > consumers and given them a simple semantic. Notification comes only if > an action will be necessary (either kernel kills something or user space > is expected). I.e. turn the OOM notification into an OOM kill event notification. > E.g. consider a handler which tries to clean up after kernel handled > OOM and killed something. If the kernel could back off and refrain > from killing anything after the norification already fired up then the > userspace has no practical way to detect that (except for checking the > kernel log to search for OOM messages which might get suppressed due to > rate limitting etc.. Nothing I would call optimal). > Or do you think that such a use case doesn't make much sense and it is > an abuse of the notification interface? I'm not sure what such a cleanup would be doing, a real life usecase would be useful when we are about to change notification semantics. I've heard "taking down the remaining tasks of the job" before, but that would be better solved by having the OOM killer operate on cgroups as single entities instead of taking out individual tasks. On the other hand, I can see how people use the OOM notification to monitor system/cgroup health. David argued that vmpressure "critical" would be the same thing. But first of all, this is not an argument to change semantics of an established interface. And secondly, it only tells you that reclaim is struggling, it doesn't give you the point of failure (the OOM condition), regardless of what the docs claim. So, please, if you need a new interface, make a clear case for it and then we can discuss if it's the right way to go. We do the same for every other user interface, whether it's a syscall, an ioctl, a procfs file etc. Just taking something existing that is close enough and skewing the semantics in your favor like this is not okay. ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [RFC 1/3] memcg: notify userspace about OOM only when and action is due 2014-01-15 20:30 ` Johannes Weiner @ 2014-01-16 14:10 ` Michal Hocko 0 siblings, 0 replies; 8+ messages in thread From: Michal Hocko @ 2014-01-16 14:10 UTC (permalink / raw) To: Johannes Weiner Cc: linux-mm, LKML, David Rientjes, KOSAKI Motohiro, Andrew Morton On Wed 15-01-14 15:30:47, Johannes Weiner wrote: > On Wed, Jan 15, 2014 at 08:00:15PM +0100, Michal Hocko wrote: > > On Wed 15-01-14 12:56:55, Johannes Weiner wrote: > > > On Wed, Jan 15, 2014 at 04:01:06PM +0100, Michal Hocko wrote: > > > > Userspace is currently notified about OOM condition after reclaim > > > > fails to uncharge any memory after MEM_CGROUP_RECLAIM_RETRIES rounds. > > > > This usually means that the memcg is really in troubles and an > > > > OOM action (either done by userspace or kernel) has to be taken. > > > > The kernel OOM killer however bails out and doesn't kill anything > > > > if it sees an already dying/exiting task in a good hope a memory > > > > will be released and the OOM situation will be resolved. > > > > > > > > Therefore it makes sense to notify userspace only after really all > > > > measures have been taken and an userspace action is required or > > > > the kernel kills a task. > > > > > > > > This patch is based on idea by David Rientjes to not notify > > > > userspace when the current task is killed or in a late exiting. > > > > The original patch, however, didn't handle in kernel oom killer > > > > back offs which is implemtented by this patch. > > > > > > > > Signed-off-by: Michal Hocko <mhocko@suse.cz> > > > > > > OOM is a temporary state because any task can exit at a time that is > > > not under our control and outside our knowledge. That's why the OOM > > > situation is defined by failing an allocation after a certain number > > > of reclaim and charge attempts. > > > > > > As of right now, the OOM sampling window is MEM_CGROUP_RECLAIM_RETRIES > > > loops of charge attempts and reclaim. If a racing task is exiting and > > > releasing memory during that window, the charge will succeed fine. If > > > the sampling window is too short in practice, it will have to be > > > extended, preferrably through increasing MEM_CGROUP_RECLAIM_RETRIES. > > > > The patch doesn't try to address the above race because that one is > > unfixable. I hope that is clear. > > > > It just tries to reduce burden on the userspace oom notification > > consumers and given them a simple semantic. Notification comes only if > > an action will be necessary (either kernel kills something or user space > > is expected). > > I.e. turn the OOM notification into an OOM kill event notification. OK, maybe it's just me but I've considered OOM -> OOM kill. Because if we for some reason do not need to perform an action then we are not OOM really (one of them is the state the other part is an action). Maybe it's because you cannot find out you are under OOM unless you see the OOM killer in action for ages (well memcg has changed that but...) I might be wrong here of course... > > E.g. consider a handler which tries to clean up after kernel handled > > OOM and killed something. If the kernel could back off and refrain > > from killing anything after the norification already fired up then the > > userspace has no practical way to detect that (except for checking the > > kernel log to search for OOM messages which might get suppressed due to > > rate limitting etc.. Nothing I would call optimal). > > Or do you think that such a use case doesn't make much sense and it is > > an abuse of the notification interface? > > I'm not sure what such a cleanup would be doing, a real life usecase > would be useful when we are about to change notification semantics. > I've heard "taking down the remaining tasks of the job" before, but > that would be better solved by having the OOM killer operate on > cgroups as single entities instead of taking out individual tasks. I am not a direct user of the interface myself but I can imagine that there might be many clean up actions to be done. The task receives SIG_KILL so it doesn't have any chance to do the cleanup itself. This might be something like reverting to the last consistent state for the internal data or removing temporary files which, for some reason, had to be visible througout the process life and many others. > On the other hand, I can see how people use the OOM notification to > monitor system/cgroup health. David argued that vmpressure "critical" > would be the same thing. But first of all, this is not an argument to > change semantics of an established interface. And secondly, it only > tells you that reclaim is struggling, it doesn't give you the point of > failure (the OOM condition), regardless of what the docs claim. > So, please, if you need a new interface, make a clear case for it and > then we can discuss if it's the right way to go. We do the same for > every other user interface, whether it's a syscall, an ioctl, a procfs > file etc. Just taking something existing that is close enough and > skewing the semantics in your favor like this is not okay. Agreed, that's why this has been sent as a request for comments and discussion. It is sad that the discussion ended before it started... I realize that the previous one was quite frustrating but maybe we can do better. I am not going to push for this very strong because I believe that last second back offs before OOM killer fires doesn't happen all that often. Do we have any numbers for that, btw? Maybe we should start by adding a counter and report it in (memcg) statistics (quick patch on top of mmotm bellow). And base our future decisions on those numbers? Because to be honest, something tells me that the overall difference will be barely noticeable most workloads. Anyway, I liked the notification to be tighter to the action because it makes userspace notifiers easier to implement because they wouldn't have to worry about back offs. Also the semantic is much cleaner IMO because you get a notification that the situation is so bad that the kernel had to use an emergency measures. --- >From 0aa4a4450815e78c8d69ea2dd5b4ddf308f68110 Mon Sep 17 00:00:00 2001 From: Michal Hocko <mhocko@suse.cz> Date: Thu, 16 Jan 2014 14:17:27 +0100 Subject: [PATCH] memcg: collect oom back off statistics OOM killer might back off and not kill anything if it believes that the memcg is not really OOM because there is a task(s) which is on the way out and so it would free memory shortly. We however do not have any way data to know how much those heuristics are effective to prevent from the real killing. Let's add a statistic for this event so we can collect data and make some educated conclusions from it. Signed-off-by: Michal Hocko <mhocko@suse.cz> --- mm/memcontrol.c | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index f016d26adfd3..b3c839ac6a1d 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -100,6 +100,7 @@ enum mem_cgroup_events_index { MEM_CGROUP_EVENTS_PGPGOUT, /* # of pages paged out */ MEM_CGROUP_EVENTS_PGFAULT, /* # of page-faults */ MEM_CGROUP_EVENTS_PGMAJFAULT, /* # of major page-faults */ + MEM_CGROUP_EVENTS_OOM_BACK_OFF, /* # of oom killer backoffs */ MEM_CGROUP_EVENTS_NSTATS, }; @@ -108,6 +109,7 @@ static const char * const mem_cgroup_events_names[] = { "pgpgout", "pgfault", "pgmajfault", + "oom_back_off", }; static const char * const mem_cgroup_lru_names[] = { @@ -1752,6 +1754,13 @@ static u64 mem_cgroup_get_limit(struct mem_cgroup *memcg) return limit; } +static void mem_cgroup_inc_oom_backoff(struct mem_cgroup *memcg) +{ + preempt_disable(); + this_cpu_inc(memcg->stat->events[MEM_CGROUP_EVENTS_OOM_BACK_OFF]); + preempt_enable(); +} + static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask, int order) { @@ -1768,6 +1777,7 @@ static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask, */ if (fatal_signal_pending(current) || current->flags & PF_EXITING) { set_thread_flag(TIF_MEMDIE); + mem_cgroup_inc_oom_backoff(memcg); return; } @@ -1795,6 +1805,7 @@ static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask, mem_cgroup_iter_break(memcg, iter); if (chosen) put_task_struct(chosen); + mem_cgroup_inc_oom_backoff(memcg); return; case OOM_SCAN_OK: break; -- 1.8.5.2 -- Michal Hocko SUSE Labs ^ permalink raw reply related [flat|nested] 8+ messages in thread
* [RFC 2/3] memcg: do not check PF_EXITING in mem_cgroup_out_of_memory 2014-01-15 15:01 [RFC PATCH 0/3] memcg OOM notifications and PF_EXITING checks Michal Hocko 2014-01-15 15:01 ` [RFC 1/3] memcg: notify userspace about OOM only when and action is due Michal Hocko @ 2014-01-15 15:01 ` Michal Hocko 2014-01-15 15:01 ` [RFC 3/3] memcg,oom: do not check PF_EXITING and do not set TIF_MEMDIE Michal Hocko 2 siblings, 0 replies; 8+ messages in thread From: Michal Hocko @ 2014-01-15 15:01 UTC (permalink / raw) To: linux-mm Cc: LKML, Johannes Weiner, David Rientjes, KOSAKI Motohiro, Andrew Morton because all tasks with PF_EXITING will skip the charge since (memcg: do not hang on OOM when killed by userspace OOM access to memory reserves) was merged. Signed-off-by: Michal Hocko <mhocko@suse.cz> --- mm/memcontrol.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 491d368ae488..97ae5cf12f5e 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1766,7 +1766,7 @@ static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask, * select it. The goal is to allow it to allocate so that it may * quickly exit and free its memory. */ - if (fatal_signal_pending(current) || current->flags & PF_EXITING) { + if (fatal_signal_pending(current)) { set_thread_flag(TIF_MEMDIE); return; } -- 1.8.5.2 ^ permalink raw reply related [flat|nested] 8+ messages in thread
* [RFC 3/3] memcg,oom: do not check PF_EXITING and do not set TIF_MEMDIE 2014-01-15 15:01 [RFC PATCH 0/3] memcg OOM notifications and PF_EXITING checks Michal Hocko 2014-01-15 15:01 ` [RFC 1/3] memcg: notify userspace about OOM only when and action is due Michal Hocko 2014-01-15 15:01 ` [RFC 2/3] memcg: do not check PF_EXITING in mem_cgroup_out_of_memory Michal Hocko @ 2014-01-15 15:01 ` Michal Hocko 2 siblings, 0 replies; 8+ messages in thread From: Michal Hocko @ 2014-01-15 15:01 UTC (permalink / raw) To: linux-mm Cc: LKML, Johannes Weiner, David Rientjes, KOSAKI Motohiro, Andrew Morton Memcg OOM handler mimics the global OOM handler heuristics. One of them is to give a dying task (one with either fatal signals pending or PF_EXITING set) access to memory reserves via TIF_MEMDIE flag. This is not necessary though, because memory allocation has been already done when it is charged against a memcg so we do not need to abuse the flag. fatal_signal_pending check is a bit tricky because the current task might have been killed during reclaim as an action done by vmpressure/thresholds handlers and we would definitely want to prevent from OOM kill in such situations. The current check is incomplete, though, because it only works for the current task because oom_scan_process_thread doesn't check for fatal_signal_pending. oom_scan_process_thread is shared between global and memcg OOM killer so we cannot simply abort scanning for killed tasks. We can, instead, move the check downwards in mem_cgroup_out_of_memory and break out from the tasks iteration loop when a killed task is encountered. We could check for PF_EXITING as well but it is dubious whether this would be helpful much more as a task should exit quite quickly once it is scheduled. Signed-off-by: Michal Hocko <mhocko@suse.cz> --- mm/memcontrol.c | 21 +++++++++++---------- 1 file changed, 11 insertions(+), 10 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 97ae5cf12f5e..ea9564895f54 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1761,16 +1761,6 @@ static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask, unsigned int points = 0; struct task_struct *chosen = NULL; - /* - * If current has a pending SIGKILL or is exiting, then automatically - * select it. The goal is to allow it to allocate so that it may - * quickly exit and free its memory. - */ - if (fatal_signal_pending(current)) { - set_thread_flag(TIF_MEMDIE); - return; - } - check_panic_on_oom(CONSTRAINT_MEMCG, gfp_mask, order, NULL); totalpages = mem_cgroup_get_limit(memcg) >> PAGE_SHIFT ? : 1; for_each_mem_cgroup_tree(iter, memcg) { @@ -1779,6 +1769,16 @@ static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask, css_task_iter_start(&iter->css, &it); while ((task = css_task_iter_next(&it))) { + /* + * Killed tasks are selected automatically. The goal is + * to give the task some more time to exit and release + * the memory. + * Unlike for the global OOM handler we do not need + * access to memory reserves. + */ + if (fatal_signal_pending(task)) + goto abort; + switch (oom_scan_process_thread(task, totalpages, NULL, false)) { case OOM_SCAN_SELECT: @@ -1791,6 +1791,7 @@ static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask, case OOM_SCAN_CONTINUE: continue; case OOM_SCAN_ABORT: +abort: css_task_iter_end(&it); mem_cgroup_iter_break(memcg, iter); if (chosen) -- 1.8.5.2 ^ permalink raw reply related [flat|nested] 8+ messages in thread
end of thread, other threads:[~2014-01-16 14:10 UTC | newest] Thread overview: 8+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2014-01-15 15:01 [RFC PATCH 0/3] memcg OOM notifications and PF_EXITING checks Michal Hocko 2014-01-15 15:01 ` [RFC 1/3] memcg: notify userspace about OOM only when and action is due Michal Hocko 2014-01-15 17:56 ` Johannes Weiner 2014-01-15 19:00 ` Michal Hocko 2014-01-15 20:30 ` Johannes Weiner 2014-01-16 14:10 ` Michal Hocko 2014-01-15 15:01 ` [RFC 2/3] memcg: do not check PF_EXITING in mem_cgroup_out_of_memory Michal Hocko 2014-01-15 15:01 ` [RFC 3/3] memcg,oom: do not check PF_EXITING and do not set TIF_MEMDIE Michal Hocko
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox