* [RFC PATCH] mm, oom: move GFP_NOFS check to out_of_memory @ 2016-03-29 13:27 Michal Hocko 2016-03-29 13:45 ` Tetsuo Handa ` (3 more replies) 0 siblings, 4 replies; 14+ messages in thread From: Michal Hocko @ 2016-03-29 13:27 UTC (permalink / raw) To: linux-mm Cc: David Rientjes, Johannes Weiner, Tetsuo Handa, Andrew Morton, LKML, Michal Hocko From: Michal Hocko <mhocko@suse.com> __alloc_pages_may_oom is the central place to decide when the out_of_memory should be invoked. This is a good approach for most checks there because they are page allocator specific and the allocation fails right after. The notable exception is GFP_NOFS context which is faking did_some_progress and keep the page allocator looping even though there couldn't have been any progress from the OOM killer. This patch doesn't change this behavior because we are not ready to allow those allocation requests to fail yet. Instead __GFP_FS check is moved down to out_of_memory and prevent from OOM victim selection there. There are two reasons for that - OOM notifiers might release some memory even from this context as none of the registered notifier seems to be FS related - this might help a dying thread to get an access to memory reserves and move on which will make the behavior more consistent with the case when the task gets killed from a different context. Keep a comment in __alloc_pages_may_oom to make sure we do not forget how GFP_NOFS is special and that we really want to do something about it. Signed-off-by: Michal Hocko <mhocko@suse.com> --- Hi, I am sending this as an RFC now even though I think this makes more sense than what we have right now. Maybe there are some side effects I do not see, though. A more tricky part is the OOM notifier part becasue future notifiers might decide to depend on the FS and we can lockup. Is this something to worry about, though? Would such a notifier be correct at all? I would call it broken as it would put OOM killer out of the way on the contended system which is a plain bug IMHO. If this looks like a reasonable approach I would go on think about how we can extend this for the oom_reaper and queue the current thread for the reaper to free some of the memory. Any thoughts mm/oom_kill.c | 4 ++++ mm/page_alloc.c | 24 ++++++++++-------------- 2 files changed, 14 insertions(+), 14 deletions(-) diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 86349586eacb..1c2b7a82f0c4 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -876,6 +876,10 @@ bool out_of_memory(struct oom_control *oc) return true; } + /* The OOM killer does not compensate for IO-less reclaim. */ + if (!(oc->gfp_mask & __GFP_FS)) + return true; + /* * Check if there were limitations on the allocation (only relevant for * NUMA) that may require different handling. diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 1b889dba7bd4..736ea28abfcf 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2872,22 +2872,18 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, /* The OOM killer does not needlessly kill tasks for lowmem */ if (ac->high_zoneidx < ZONE_NORMAL) goto out; - /* The OOM killer does not compensate for IO-less reclaim */ - if (!(gfp_mask & __GFP_FS)) { - /* - * XXX: Page reclaim didn't yield anything, - * and the OOM killer can't be invoked, but - * keep looping as per tradition. - * - * But do not keep looping if oom_killer_disable() - * was already called, for the system is trying to - * enter a quiescent state during suspend. - */ - *did_some_progress = !oom_killer_disabled; - goto out; - } if (pm_suspended_storage()) goto out; + /* + * XXX: GFP_NOFS allocations should rather fail than rely on + * other request to make a forward progress. + * We are in an unfortunate situation where out_of_memory cannot + * do much for this context but let's try it to at least get + * access to memory reserved if the current task is killed (see + * out_of_memory). Once filesystems are ready to handle allocation + * failures more gracefully we should just bail out here. + */ + /* The OOM killer may not free memory on a specific node */ if (gfp_mask & __GFP_THISNODE) goto out; -- 2.7.0 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [RFC PATCH] mm, oom: move GFP_NOFS check to out_of_memory 2016-03-29 13:27 [RFC PATCH] mm, oom: move GFP_NOFS check to out_of_memory Michal Hocko @ 2016-03-29 13:45 ` Tetsuo Handa 2016-03-29 14:22 ` Michal Hocko 2016-03-29 14:14 ` Michal Hocko ` (2 subsequent siblings) 3 siblings, 1 reply; 14+ messages in thread From: Tetsuo Handa @ 2016-03-29 13:45 UTC (permalink / raw) To: mhocko, linux-mm; +Cc: rientjes, hannes, akpm, linux-kernel, mhocko Michal Hocko wrote: > From: Michal Hocko <mhocko@suse.com> > > __alloc_pages_may_oom is the central place to decide when the > out_of_memory should be invoked. This is a good approach for most checks > there because they are page allocator specific and the allocation fails > right after. > > The notable exception is GFP_NOFS context which is faking > did_some_progress and keep the page allocator looping even though there > couldn't have been any progress from the OOM killer. This patch doesn't > change this behavior because we are not ready to allow those allocation > requests to fail yet. Instead __GFP_FS check is moved down to > out_of_memory and prevent from OOM victim selection there. There are > two reasons for that > - OOM notifiers might release some memory even from this context > as none of the registered notifier seems to be FS related > - this might help a dying thread to get an access to memory > reserves and move on which will make the behavior more > consistent with the case when the task gets killed from a > different context. Allowing !__GFP_FS allocations to get TIF_MEMDIE by calling the shortcuts in out_of_memory() would be fine. But I don't like the direction you want to go. I don't like failing !__GFP_FS allocations without selecting OOM victim ( http://lkml.kernel.org/r/201603252054.ADH30264.OJQFFLMOHFSOVt@I-love.SAKURA.ne.jp ). Also, I suggested removing all shortcuts by setting TIF_MEMDIE from oom_kill_process() ( http://lkml.kernel.org/r/1458529634-5951-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp ). > > Keep a comment in __alloc_pages_may_oom to make sure we do not forget > how GFP_NOFS is special and that we really want to do something about > it. > > Signed-off-by: Michal Hocko <mhocko@suse.com> > --- > > Hi, > I am sending this as an RFC now even though I think this makes more > sense than what we have right now. Maybe there are some side effects > I do not see, though. A more tricky part is the OOM notifier part > becasue future notifiers might decide to depend on the FS and we can > lockup. Is this something to worry about, though? Would such a notifier > be correct at all? I would call it broken as it would put OOM killer out > of the way on the contended system which is a plain bug IMHO. > > If this looks like a reasonable approach I would go on think about how > we can extend this for the oom_reaper and queue the current thread for > the reaper to free some of the memory. > > Any thoughts -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [RFC PATCH] mm, oom: move GFP_NOFS check to out_of_memory 2016-03-29 13:45 ` Tetsuo Handa @ 2016-03-29 14:22 ` Michal Hocko 2016-03-29 15:29 ` Tetsuo Handa 0 siblings, 1 reply; 14+ messages in thread From: Michal Hocko @ 2016-03-29 14:22 UTC (permalink / raw) To: Tetsuo Handa; +Cc: linux-mm, rientjes, hannes, akpm, linux-kernel On Tue 29-03-16 22:45:40, Tetsuo Handa wrote: > Michal Hocko wrote: > > From: Michal Hocko <mhocko@suse.com> > > > > __alloc_pages_may_oom is the central place to decide when the > > out_of_memory should be invoked. This is a good approach for most checks > > there because they are page allocator specific and the allocation fails > > right after. > > > > The notable exception is GFP_NOFS context which is faking > > did_some_progress and keep the page allocator looping even though there > > couldn't have been any progress from the OOM killer. This patch doesn't > > change this behavior because we are not ready to allow those allocation > > requests to fail yet. Instead __GFP_FS check is moved down to > > out_of_memory and prevent from OOM victim selection there. There are > > two reasons for that > > - OOM notifiers might release some memory even from this context > > as none of the registered notifier seems to be FS related > > - this might help a dying thread to get an access to memory > > reserves and move on which will make the behavior more > > consistent with the case when the task gets killed from a > > different context. > > Allowing !__GFP_FS allocations to get TIF_MEMDIE by calling the shortcuts in > out_of_memory() would be fine. But I don't like the direction you want to go. > > I don't like failing !__GFP_FS allocations without selecting OOM victim > ( http://lkml.kernel.org/r/201603252054.ADH30264.OJQFFLMOHFSOVt@I-love.SAKURA.ne.jp ). I didn't get to read and digest that email yet but from a quick glance it doesn't seem to be directly related to this patch. Even if we decide that __GFP_FS vs. OOM killer logic is flawed for some reason then would build on top as granting the access to memory reserves is not against it. > Also, I suggested removing all shortcuts by setting TIF_MEMDIE from oom_kill_process() > ( http://lkml.kernel.org/r/1458529634-5951-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp ). I personally do not like this much. I believe we have already tried to explain why we have (some of) those shortcuts. They might be too optimistic and there is a room for improvements for sure but I am not convinced we can get rid of them that easily. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [RFC PATCH] mm, oom: move GFP_NOFS check to out_of_memory 2016-03-29 14:22 ` Michal Hocko @ 2016-03-29 15:29 ` Tetsuo Handa 0 siblings, 0 replies; 14+ messages in thread From: Tetsuo Handa @ 2016-03-29 15:29 UTC (permalink / raw) To: mhocko; +Cc: linux-mm, rientjes, hannes, akpm, linux-kernel Michal Hocko wrote: > On Tue 29-03-16 22:45:40, Tetsuo Handa wrote: > > Michal Hocko wrote: > > > From: Michal Hocko <mhocko@suse.com> > > > > > > __alloc_pages_may_oom is the central place to decide when the > > > out_of_memory should be invoked. This is a good approach for most checks > > > there because they are page allocator specific and the allocation fails > > > right after. > > > > > > The notable exception is GFP_NOFS context which is faking > > > did_some_progress and keep the page allocator looping even though there > > > couldn't have been any progress from the OOM killer. This patch doesn't > > > change this behavior because we are not ready to allow those allocation > > > requests to fail yet. Instead __GFP_FS check is moved down to > > > out_of_memory and prevent from OOM victim selection there. There are > > > two reasons for that > > > - OOM notifiers might release some memory even from this context > > > as none of the registered notifier seems to be FS related > > > - this might help a dying thread to get an access to memory > > > reserves and move on which will make the behavior more > > > consistent with the case when the task gets killed from a > > > different context. > > > > Allowing !__GFP_FS allocations to get TIF_MEMDIE by calling the shortcuts in > > out_of_memory() would be fine. But I don't like the direction you want to go. > > > > I don't like failing !__GFP_FS allocations without selecting OOM victim > > ( http://lkml.kernel.org/r/201603252054.ADH30264.OJQFFLMOHFSOVt@I-love.SAKURA.ne.jp ). > > I didn't get to read and digest that email yet but from a quick glance > it doesn't seem to be directly related to this patch. Even if we decide > that __GFP_FS vs. OOM killer logic is flawed for some reason then would > build on top as granting the access to memory reserves is not against > it. > I think that removing these shortcuts is better. > > Also, I suggested removing all shortcuts by setting TIF_MEMDIE from oom_kill_process() > > ( http://lkml.kernel.org/r/1458529634-5951-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp ). > > I personally do not like this much. I believe we have already tried to > explain why we have (some of) those shortcuts. They might be too > optimistic and there is a room for improvements for sure but I am not > convinced we can get rid of them that easily. These shortcuts are too optimistic. They assume that the target thread can call exit_oom_victim() but the reality is that the target task can get stuck at down_read(&mm->mmap_sem) in exit_mm(). If SIGKILL were sent to all thread groups sharing that mm, the possibility of the target thread getting stuck at down_read(&mm->mmap_sem) in exit_mm() is significantly reduced. http://lkml.kernel.org/r/20160329141442.GD4466@dhcp22.suse.cz tried to let the OOM reaper to call exit_oom_victim() on behalf of the target thread by waking up the OOM reaper. But the OOM reaper won't call exit_oom_victim() because the OOM reaper will fail to reap memory because some thread sharing that mm and holding mm->mmap_sem for write will not receive SIGKILL if we use these shortcuts. As far as I know, all existing explanations for why we have these shortcuts are ignoring the possibility of such some thread. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [RFC PATCH] mm, oom: move GFP_NOFS check to out_of_memory 2016-03-29 13:27 [RFC PATCH] mm, oom: move GFP_NOFS check to out_of_memory Michal Hocko 2016-03-29 13:45 ` Tetsuo Handa @ 2016-03-29 14:14 ` Michal Hocko 2016-03-29 22:13 ` David Rientjes 2016-04-05 11:12 ` Tetsuo Handa 3 siblings, 0 replies; 14+ messages in thread From: Michal Hocko @ 2016-03-29 14:14 UTC (permalink / raw) To: linux-mm; +Cc: David Rientjes, Johannes Weiner, Tetsuo Handa, Andrew Morton, LKML On Tue 29-03-16 15:27:35, Michal Hocko wrote: [...] > If this looks like a reasonable approach I would go on think about how > we can extend this for the oom_reaper and queue the current thread for > the reaper to free some of the memory. And this is what I came up with (untested yet). Doesn't too bad to me: --- ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [RFC PATCH] mm, oom: move GFP_NOFS check to out_of_memory 2016-03-29 13:27 [RFC PATCH] mm, oom: move GFP_NOFS check to out_of_memory Michal Hocko 2016-03-29 13:45 ` Tetsuo Handa 2016-03-29 14:14 ` Michal Hocko @ 2016-03-29 22:13 ` David Rientjes 2016-03-30 9:47 ` Michal Hocko 2016-04-05 11:12 ` Tetsuo Handa 3 siblings, 1 reply; 14+ messages in thread From: David Rientjes @ 2016-03-29 22:13 UTC (permalink / raw) To: Michal Hocko Cc: linux-mm, Johannes Weiner, Tetsuo Handa, Andrew Morton, LKML, Michal Hocko On Tue, 29 Mar 2016, Michal Hocko wrote: > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > index 86349586eacb..1c2b7a82f0c4 100644 > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -876,6 +876,10 @@ bool out_of_memory(struct oom_control *oc) > return true; > } > > + /* The OOM killer does not compensate for IO-less reclaim. */ > + if (!(oc->gfp_mask & __GFP_FS)) > + return true; > + > /* > * Check if there were limitations on the allocation (only relevant for > * NUMA) that may require different handling. I don't object to this necessarily, but I think we need input from those that have taken the time to implement their own oom notifier to see if they agree. In the past, they would only be called if reclaim has completely failed; now, they can be called in low memory situations when reclaim has had very little chance to be successful. Getting an ack from them would be helpful. I also think we have discussed this before, but I think the oom notifier handling should be in done in the page allocator proper, i.e. in __alloc_pages_may_oom(). We can leave out_of_memory() for a clear defined purpose: to kill a process when all reclaim has failed. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [RFC PATCH] mm, oom: move GFP_NOFS check to out_of_memory 2016-03-29 22:13 ` David Rientjes @ 2016-03-30 9:47 ` Michal Hocko 2016-03-30 11:46 ` Tetsuo Handa 0 siblings, 1 reply; 14+ messages in thread From: Michal Hocko @ 2016-03-30 9:47 UTC (permalink / raw) To: David Rientjes Cc: linux-mm, Johannes Weiner, Tetsuo Handa, Andrew Morton, LKML On Tue 29-03-16 15:13:54, David Rientjes wrote: > On Tue, 29 Mar 2016, Michal Hocko wrote: > > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > > index 86349586eacb..1c2b7a82f0c4 100644 > > --- a/mm/oom_kill.c > > +++ b/mm/oom_kill.c > > @@ -876,6 +876,10 @@ bool out_of_memory(struct oom_control *oc) > > return true; > > } > > > > + /* The OOM killer does not compensate for IO-less reclaim. */ > > + if (!(oc->gfp_mask & __GFP_FS)) > > + return true; > > + > > /* > > * Check if there were limitations on the allocation (only relevant for > > * NUMA) that may require different handling. > > I don't object to this necessarily, but I think we need input from those > that have taken the time to implement their own oom notifier to see if > they agree. In the past, they would only be called if reclaim has > completely failed; now, they can be called in low memory situations when > reclaim has had very little chance to be successful. Getting an ack from > them would be helpful. I will make sure to put them on the CC and mention this in the changelog when I post this next time. I personally think that this shouldn't make much difference in the real life because GFP_NOFS only loads are rare and we should rather help by releasing memory when it is available rather than rely on something else to do it for us. Waiting for Godot is never a good strategy. > I also think we have discussed this before, but I think the oom notifier > handling should be in done in the page allocator proper, i.e. in > __alloc_pages_may_oom(). We can leave out_of_memory() for a clear defined > purpose: to kill a process when all reclaim has failed. I vaguely remember there was some issue with that the last time we have discussed that. It was the duplication from the page fault and allocator paths AFAIR. Nothing that cannot be handled though but the OOM notifier API is just too ugly to spread outside OOM proper I guess. Why we cannot move those users to use proper shrinkers interface (after it gets extended by a priority of some sort and release some objects only after we are really in troubles)? Something for a separate discussion, though... -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [RFC PATCH] mm, oom: move GFP_NOFS check to out_of_memory 2016-03-30 9:47 ` Michal Hocko @ 2016-03-30 11:46 ` Tetsuo Handa 2016-03-30 12:11 ` Michal Hocko 0 siblings, 1 reply; 14+ messages in thread From: Tetsuo Handa @ 2016-03-30 11:46 UTC (permalink / raw) To: mhocko, rientjes; +Cc: linux-mm, hannes, akpm, linux-kernel Michal Hocko wrote: > On Tue 29-03-16 15:13:54, David Rientjes wrote: > > On Tue, 29 Mar 2016, Michal Hocko wrote: > > > > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > > > index 86349586eacb..1c2b7a82f0c4 100644 > > > --- a/mm/oom_kill.c > > > +++ b/mm/oom_kill.c > > > @@ -876,6 +876,10 @@ bool out_of_memory(struct oom_control *oc) > > > return true; > > > } > > > > > > + /* The OOM killer does not compensate for IO-less reclaim. */ > > > + if (!(oc->gfp_mask & __GFP_FS)) > > > + return true; > > > + This patch will disable pagefault_out_of_memory() because currently pagefault_out_of_memory() is passing oc->gfp_mask == 0. Because of current behavior, calling oom notifiers from !__GFP_FS seems to be safe. > > > /* > > > * Check if there were limitations on the allocation (only relevant for > > > * NUMA) that may require different handling. > > > > I don't object to this necessarily, but I think we need input from those > > that have taken the time to implement their own oom notifier to see if > > they agree. In the past, they would only be called if reclaim has > > completely failed; now, they can be called in low memory situations when > > reclaim has had very little chance to be successful. Getting an ack from > > them would be helpful. > > I will make sure to put them on the CC and mention this in the changelog > when I post this next time. I personally think that this shouldn't make > much difference in the real life because GFP_NOFS only loads are rare GFP_NOFS only loads are rare. But some GFP_KERNEL load which got TIF_MEMDIE might be waiting for GFP_NOFS or GFP_NOIO loads to make progress. I think we are not ready to handle situations where out_of_memory() is called again after current thread got TIF_MEMDIE due to __GFP_NOFAIL allocation request when we ran out of memory reserves. We should not assume that the victim target thread does not have TIF_MEMDIE yet. I think we can handle it by making mark_oom_victim() return a bool and return via shortcut only if mark_oom_victim() successfully set TIF_MEMDIE. Though I don't like the shortcut approach that lacks a guaranteed unlocking mechanism. > and we should rather help by releasing memory when it is available > rather than rely on something else to do it for us. Waiting for Godot is > never a good strategy. > > > I also think we have discussed this before, but I think the oom notifier > > handling should be in done in the page allocator proper, i.e. in > > __alloc_pages_may_oom(). We can leave out_of_memory() for a clear defined > > purpose: to kill a process when all reclaim has failed. > > I vaguely remember there was some issue with that the last time we have > discussed that. It was the duplication from the page fault and allocator > paths AFAIR. Nothing that cannot be handled though but the OOM notifier > API is just too ugly to spread outside OOM proper I guess. Why we cannot > move those users to use proper shrinkers interface (after it gets > extended by a priority of some sort and release some objects only after > we are really in troubles)? Something for a separate discussion, > though... Calling oom notifiers from SysRq-f is what we want? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [RFC PATCH] mm, oom: move GFP_NOFS check to out_of_memory 2016-03-30 11:46 ` Tetsuo Handa @ 2016-03-30 12:11 ` Michal Hocko 2016-03-31 11:56 ` Tetsuo Handa 0 siblings, 1 reply; 14+ messages in thread From: Michal Hocko @ 2016-03-30 12:11 UTC (permalink / raw) To: Tetsuo Handa; +Cc: rientjes, linux-mm, hannes, akpm, linux-kernel On Wed 30-03-16 20:46:48, Tetsuo Handa wrote: > Michal Hocko wrote: > > On Tue 29-03-16 15:13:54, David Rientjes wrote: > > > On Tue, 29 Mar 2016, Michal Hocko wrote: > > > > > > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > > > > index 86349586eacb..1c2b7a82f0c4 100644 > > > > --- a/mm/oom_kill.c > > > > +++ b/mm/oom_kill.c > > > > @@ -876,6 +876,10 @@ bool out_of_memory(struct oom_control *oc) > > > > return true; > > > > } > > > > > > > > + /* The OOM killer does not compensate for IO-less reclaim. */ > > > > + if (!(oc->gfp_mask & __GFP_FS)) > > > > + return true; > > > > + > > This patch will disable pagefault_out_of_memory() because currently > pagefault_out_of_memory() is passing oc->gfp_mask == 0. > > Because of current behavior, calling oom notifiers from !__GFP_FS seems > to be safe. You are right! I have completely missed that and thought we were providing GFP_KERNEL there. So we have two choices. Either we do use GFP_KERNEL (same as we do for sysrq+f) or we special case pagefault_out_of_memory in some way. The second option seems to be safer because the gfp_mask has to contain at least ___GFP_DIRECT_RECLAIM to trigger the OOM path. > > > > /* > > > > * Check if there were limitations on the allocation (only relevant for > > > > * NUMA) that may require different handling. > > > > > > I don't object to this necessarily, but I think we need input from those > > > that have taken the time to implement their own oom notifier to see if > > > they agree. In the past, they would only be called if reclaim has > > > completely failed; now, they can be called in low memory situations when > > > reclaim has had very little chance to be successful. Getting an ack from > > > them would be helpful. > > > > I will make sure to put them on the CC and mention this in the changelog > > when I post this next time. I personally think that this shouldn't make > > much difference in the real life because GFP_NOFS only loads are rare > > GFP_NOFS only loads are rare. But some GFP_KERNEL load which got TIF_MEMDIE > might be waiting for GFP_NOFS or GFP_NOIO loads to make progress. How would that matter to oom notifiers? > I think we are not ready to handle situations where out_of_memory() is called > again after current thread got TIF_MEMDIE due to __GFP_NOFAIL allocation > request when we ran out of memory reserves. We should not assume that the > victim target thread does not have TIF_MEMDIE yet. I think we can handle it > by making mark_oom_victim() return a bool and return via shortcut only if > mark_oom_victim() successfully set TIF_MEMDIE. Though I don't like the > shortcut approach that lacks a guaranteed unlocking mechanism. That would lead to premature follow up OOM when TIF_MEMDIE makes some progress just not in time. > > and we should rather help by releasing memory when it is available > > rather than rely on something else to do it for us. Waiting for Godot is > > never a good strategy. > > > > > I also think we have discussed this before, but I think the oom notifier > > > handling should be in done in the page allocator proper, i.e. in > > > __alloc_pages_may_oom(). We can leave out_of_memory() for a clear defined > > > purpose: to kill a process when all reclaim has failed. > > > > I vaguely remember there was some issue with that the last time we have > > discussed that. It was the duplication from the page fault and allocator > > paths AFAIR. Nothing that cannot be handled though but the OOM notifier > > API is just too ugly to spread outside OOM proper I guess. Why we cannot > > move those users to use proper shrinkers interface (after it gets > > extended by a priority of some sort and release some objects only after > > we are really in troubles)? Something for a separate discussion, > > though... > > Calling oom notifiers from SysRq-f is what we want? I am not really sure about that to be honest. The semantic is really weak but what would be a downside? This operation shouldn't be fatal and dropped object can be reconstructed. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [RFC PATCH] mm, oom: move GFP_NOFS check to out_of_memory 2016-03-30 12:11 ` Michal Hocko @ 2016-03-31 11:56 ` Tetsuo Handa 2016-03-31 15:11 ` Michal Hocko 0 siblings, 1 reply; 14+ messages in thread From: Tetsuo Handa @ 2016-03-31 11:56 UTC (permalink / raw) To: mhocko; +Cc: rientjes, linux-mm, hannes, akpm, linux-kernel Michal Hocko wrote: > On Wed 30-03-16 20:46:48, Tetsuo Handa wrote: > > Michal Hocko wrote: > > > On Tue 29-03-16 15:13:54, David Rientjes wrote: > > > > On Tue, 29 Mar 2016, Michal Hocko wrote: > > > > > > > > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > > > > > index 86349586eacb..1c2b7a82f0c4 100644 > > > > > --- a/mm/oom_kill.c > > > > > +++ b/mm/oom_kill.c > > > > > @@ -876,6 +876,10 @@ bool out_of_memory(struct oom_control *oc) > > > > > return true; > > > > > } > > > > > > > > > > + /* The OOM killer does not compensate for IO-less reclaim. */ > > > > > + if (!(oc->gfp_mask & __GFP_FS)) > > > > > + return true; > > > > > + > > > > This patch will disable pagefault_out_of_memory() because currently > > pagefault_out_of_memory() is passing oc->gfp_mask == 0. > > > > Because of current behavior, calling oom notifiers from !__GFP_FS seems > > to be safe. > > You are right! I have completely missed that and thought we were > providing GFP_KERNEL there. So we have two choices. Either we do > use GFP_KERNEL (same as we do for sysrq+f) or we special case > pagefault_out_of_memory in some way. The second option seems to be safer > because the gfp_mask has to contain at least ___GFP_DIRECT_RECLAIM to > trigger the OOM path. Oops, I missed that this patch also disables out_of_memory() for !__GFP_FS && __GFP_NOFAIL allocation requests. > > I think we are not ready to handle situations where out_of_memory() is called > > again after current thread got TIF_MEMDIE due to __GFP_NOFAIL allocation > > request when we ran out of memory reserves. We should not assume that the > > victim target thread does not have TIF_MEMDIE yet. I think we can handle it > > by making mark_oom_victim() return a bool and return via shortcut only if > > mark_oom_victim() successfully set TIF_MEMDIE. Though I don't like the > > shortcut approach that lacks a guaranteed unlocking mechanism. > > That would lead to premature follow up OOM when TIF_MEMDIE makes some > progress just not in time. We can never know whether the OOM killer prematurely killed a victim. It is possible that get_page_from_freelist() will succeed even if select_bad_process() did not find a TIF_MEMDIE thread. You said you don't want to violate the layer ( http://lkml.kernel.org/r/20160129152307.GF32174@dhcp22.suse.cz ). What we can do is tolerate possible premature OOM killer invocation using some threshold. You are proposing such change as OOM detection rework that might possibly cause premature OOM killer invocation. Waiting forever unconditionally (e.g. http://lkml.kernel.org/r/201602092349.ACG81273.OSVtMJQHLOFOFF@I-love.SAKURA.ne.jp ) is no good. Suppressing OOM killer invocation forever unconditionally (e.g. decide based on only !__GFP_FS, decide based on only TIF_MEMDIE) is no good. Even if we stop returning via shortcut by making mark_oom_victim() return a bool, select_bad_process() will work as hold off mechanism. By combining with timeout (or something finite one) for TIF_MEMDIE, we can tolerate possible premature OOM killer invocation. It is much better than OOM-livelocked forever. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [RFC PATCH] mm, oom: move GFP_NOFS check to out_of_memory 2016-03-31 11:56 ` Tetsuo Handa @ 2016-03-31 15:11 ` Michal Hocko 0 siblings, 0 replies; 14+ messages in thread From: Michal Hocko @ 2016-03-31 15:11 UTC (permalink / raw) To: Tetsuo Handa; +Cc: rientjes, linux-mm, hannes, akpm, linux-kernel On Thu 31-03-16 20:56:23, Tetsuo Handa wrote: > Michal Hocko wrote: > > On Wed 30-03-16 20:46:48, Tetsuo Handa wrote: > > > Michal Hocko wrote: > > > > On Tue 29-03-16 15:13:54, David Rientjes wrote: > > > > > On Tue, 29 Mar 2016, Michal Hocko wrote: > > > > > > > > > > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > > > > > > index 86349586eacb..1c2b7a82f0c4 100644 > > > > > > --- a/mm/oom_kill.c > > > > > > +++ b/mm/oom_kill.c > > > > > > @@ -876,6 +876,10 @@ bool out_of_memory(struct oom_control *oc) > > > > > > return true; > > > > > > } > > > > > > > > > > > > + /* The OOM killer does not compensate for IO-less reclaim. */ > > > > > > + if (!(oc->gfp_mask & __GFP_FS)) > > > > > > + return true; > > > > > > + > > > > > > This patch will disable pagefault_out_of_memory() because currently > > > pagefault_out_of_memory() is passing oc->gfp_mask == 0. > > > > > > Because of current behavior, calling oom notifiers from !__GFP_FS seems > > > to be safe. > > > > You are right! I have completely missed that and thought we were > > providing GFP_KERNEL there. So we have two choices. Either we do > > use GFP_KERNEL (same as we do for sysrq+f) or we special case > > pagefault_out_of_memory in some way. The second option seems to be safer > > because the gfp_mask has to contain at least ___GFP_DIRECT_RECLAIM to > > trigger the OOM path. > > Oops, I missed that this patch also disables out_of_memory() for !__GFP_FS && > __GFP_NOFAIL allocation requests. True. The following should take care of that: diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 54aa4ec06889..32d8210b8773 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -882,7 +882,7 @@ bool out_of_memory(struct oom_control *oc) * make sure exclude 0 mask - all other users should have at least * ___GFP_DIRECT_RECLAIM to get here. */ - if (oc->gfp_mask && !(oc->gfp_mask & __GFP_FS)) + if (oc->gfp_mask && !(oc->gfp_mask & (__GFP_FS|__GFP_NOFAIL))) return true; /* Thanks for spotting this! [...] -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [RFC PATCH] mm, oom: move GFP_NOFS check to out_of_memory 2016-03-29 13:27 [RFC PATCH] mm, oom: move GFP_NOFS check to out_of_memory Michal Hocko ` (2 preceding siblings ...) 2016-03-29 22:13 ` David Rientjes @ 2016-04-05 11:12 ` Tetsuo Handa 2016-04-06 10:28 ` Tetsuo Handa 2016-04-06 12:41 ` Michal Hocko 3 siblings, 2 replies; 14+ messages in thread From: Tetsuo Handa @ 2016-04-05 11:12 UTC (permalink / raw) To: mhocko, linux-mm; +Cc: rientjes, hannes, akpm, linux-kernel, mhocko I did an OOM torture test using Linux 4.6-rc2 with kmallocwd patch on xfs and ext4 filesystems using reproducer shown below. ---------- Reproducer start ---------- #define _GNU_SOURCE #include <stdio.h> #include <stdlib.h> #include <string.h> #include <unistd.h> #include <sys/types.h> #include <sys/stat.h> #include <fcntl.h> #include <sched.h> #include <sys/prctl.h> #include <signal.h> #include <sys/mman.h> static char buffer[4096] = { }; static int writer(void *unused) { const int fd = open("/proc/self/exe", O_RDONLY); sleep(2); while (1) { void *ptr = mmap(NULL, 4096, PROT_READ, MAP_PRIVATE, fd, 0); munmap(ptr, 4096); } return 0; } static int file_io(void *unused) { const int fd = open(buffer, O_WRONLY | O_CREAT | O_APPEND, 0600); sleep(2); while (write(fd, buffer, sizeof(buffer)) > 0); close(fd); return 0; } int main(int argc, char *argv[]) { int i; if (chdir("/tmp")) return 1; for (i = 0; i < 64; i++) if (fork() == 0) { const int idx = i; char buffer2[64] = { }; const int fd = open("/proc/self/oom_score_adj", O_WRONLY); write(fd, "1000", 4); close(fd); snprintf(buffer, sizeof(buffer), "file_io.%02u", idx); prctl(PR_SET_NAME, (unsigned long) buffer, 0, 0, 0); for (i = 0; i < 16; i++) clone(file_io, malloc(1024) + 1024, CLONE_VM, NULL); snprintf(buffer2, sizeof(buffer2), "writer.%02u", idx); prctl(PR_SET_NAME, (unsigned long) buffer2, 0, 0, 0); for (i = 0; i < 16; i++) clone(writer, malloc(1024) + 1024, CLONE_VM, NULL); while (1) pause(); } { /* A dummy process for invoking the OOM killer. */ char *buf = NULL; unsigned long i; unsigned long size = 0; prctl(PR_SET_NAME, (unsigned long) "memeater", 0, 0, 0); for (size = 1048576; size < 512UL * (1 << 30); size <<= 1) { char *cp = realloc(buf, size); if (!cp) { size >>= 1; break; } buf = cp; } sleep(4); for (i = 0; i < size; i += 4096) buf[i] = '\0'; /* Will cause OOM due to overcommit */ } kill(-1, SIGKILL); return * (char *) NULL; /* Not reached. */ } ---------- Reproducer end ---------- What I can observe under OOM livelock condition is a three-way dependency loop. (1) An OOM victim (which has TIF_MEMDIE) is unable to make forward progress due to blocked at unkillable lock waiting for other thread's memory allocation. (2) A filesystem writeback work item is unable to make forward progress due to waiting for GFP_NOFS memory allocation to be satisfied because storage I/O is stalling. (3) A disk I/O work item is unable to make forward progress due to waiting for GFP_NOIO memory allocation to be satisfied because an OOM victim does not release memory but the OOM reaper does not unlock TIF_MEMDIE. Complete log for xfs is at http://I-love.SAKURA.ne.jp/tmp/serial-20160404.txt.xz ---------- [ 98.749616] Killed process 1424 (file_io.08) total-vm:4332kB, anon-rss:0kB, file-rss:4kB, shmem-rss:0kB [ 143.136457] MemAlloc-Info: stalling=2 dying=178 exiting=31 victim=1 oom_count=2324984/335679 [ 143.143740] MemAlloc: kswapd0(49) flags=0xa40840 switches=466 uninterruptible [ 143.149661] kswapd0 D 0000000000000001 0 49 2 0x00000000 [ 143.155312] ffff88003689c6c0 ffff8800368a4000 ffff8800368a38b0 ffff88003c251c10 [ 143.161566] ffff88003c251c28 ffff8800368a39e8 0000000000000001 ffffffff81556fbc [ 143.167957] ffff88003689c6c0 ffffffff81559108 0000000000000000 ffff88003c251c18 [ 143.174116] Call Trace: [ 143.176643] [<ffffffff81556fbc>] ? schedule+0x2c/0x80 [ 143.180854] [<ffffffff81559108>] ? rwsem_down_read_failed+0xf8/0x150 [ 143.186358] [<ffffffff810a20b0>] ? wait_woken+0x80/0x80 [ 143.190572] [<ffffffff8126d5e4>] ? call_rwsem_down_read_failed+0x14/0x30 [ 143.196129] [<ffffffff81558a67>] ? down_read+0x17/0x20 [ 143.200356] [<ffffffffa021c19e>] ? xfs_map_blocks+0x7e/0x150 [xfs] [ 143.205430] [<ffffffffa021cffa>] ? xfs_do_writepage+0x16a/0x510 [xfs] [ 143.210701] [<ffffffffa021d3d1>] ? xfs_vm_writepage+0x31/0x70 [xfs] [ 143.215819] [<ffffffff811225f2>] ? pageout.isra.43+0x182/0x230 [ 143.220678] [<ffffffff811239eb>] ? shrink_page_list+0x84b/0xb20 [ 143.225484] [<ffffffff8112444b>] ? shrink_inactive_list+0x20b/0x490 [ 143.230481] [<ffffffff81125071>] ? shrink_zone_memcg+0x5d1/0x790 [ 143.235430] [<ffffffff8117553d>] ? mem_cgroup_iter+0x14d/0x2b0 [ 143.240220] [<ffffffff81125307>] ? shrink_zone+0xd7/0x2f0 [ 143.244725] [<ffffffff811261c6>] ? kswapd+0x406/0x7d0 [ 143.248903] [<ffffffff81125dc0>] ? mem_cgroup_shrink_node_zone+0xb0/0xb0 [ 143.254293] [<ffffffff81083b68>] ? kthread+0xc8/0xe0 [ 143.258401] [<ffffffff8155a502>] ? ret_from_fork+0x22/0x40 [ 143.262846] [<ffffffff81083aa0>] ? kthread_create_on_node+0x1a0/0x1a0 [ 143.267956] MemAlloc: kworker/2:1(61) flags=0x4208860 switches=75880 seq=4 gfp=0x2400000(GFP_NOIO) order=0 delay=39526 uninterruptible [ 143.277844] kworker/2:1 R running task 0 61 2 0x00000000 [ 143.283598] Workqueue: events_freezable_power_ disk_events_workfn [ 143.288592] ffff880036940880 ffff88000013c000 ffff88000013b768 ffff88003f64dfc0 [ 143.295797] ffff88000013b700 00000000fffd98ea 0000000000000017 ffffffff81556fbc [ 143.301706] ffff88003f64dfc0 ffffffff8155965e 0000000000000000 0000000000000286 [ 143.307659] Call Trace: [ 143.309941] [<ffffffff81556fbc>] ? schedule+0x2c/0x80 [ 143.314292] [<ffffffff8155965e>] ? schedule_timeout+0x11e/0x1c0 [ 143.319045] [<ffffffff810c0270>] ? cascade+0x80/0x80 [ 143.323122] [<ffffffff8112e9f7>] ? wait_iff_congested+0xd7/0x120 [ 143.327887] [<ffffffff810a20b0>] ? wait_woken+0x80/0x80 [ 143.332129] [<ffffffff8112454f>] ? shrink_inactive_list+0x30f/0x490 [ 143.337349] [<ffffffff81125071>] ? shrink_zone_memcg+0x5d1/0x790 [ 143.342071] [<ffffffff8117553d>] ? mem_cgroup_iter+0x14d/0x2b0 [ 143.346651] [<ffffffff81125307>] ? shrink_zone+0xd7/0x2f0 [ 143.350955] [<ffffffff8112586a>] ? do_try_to_free_pages+0x15a/0x3e0 [ 143.355829] [<ffffffff81125b85>] ? try_to_free_pages+0x95/0xc0 [ 143.360409] [<ffffffff8111a38f>] ? __alloc_pages_nodemask+0x63f/0xc40 [ 143.365433] [<ffffffff8115dcef>] ? alloc_pages_current+0x7f/0x100 [ 143.370275] [<ffffffff8123456b>] ? bio_copy_kern+0xbb/0x170 [ 143.374695] [<ffffffff8123d53a>] ? blk_rq_map_kern+0x6a/0x120 [ 143.379295] [<ffffffff81237ca2>] ? blk_get_request+0x72/0xd0 [ 143.383477] [<ffffffff81388cf2>] ? scsi_execute+0x122/0x150 [ 143.388023] [<ffffffff81388df5>] ? scsi_execute_req_flags+0x85/0xf0 [ 143.392883] [<ffffffffa01dd719>] ? sr_check_events+0xb9/0x2b0 [sr_mod] [ 143.397909] [<ffffffffa01d114f>] ? cdrom_check_events+0xf/0x30 [cdrom] [ 143.403016] [<ffffffff8124772a>] ? disk_check_events+0x5a/0x140 [ 143.407606] [<ffffffff8107e484>] ? process_one_work+0x134/0x310 [ 143.412245] [<ffffffff8107e77d>] ? worker_thread+0x11d/0x4a0 [ 143.416729] [<ffffffff81556a51>] ? __schedule+0x271/0x7b0 [ 143.421047] [<ffffffff8107e660>] ? process_one_work+0x310/0x310 [ 143.425624] [<ffffffff81083b68>] ? kthread+0xc8/0xe0 [ 143.429595] [<ffffffff8155a502>] ? ret_from_fork+0x22/0x40 [ 143.433897] [<ffffffff81083aa0>] ? kthread_create_on_node+0x1a0/0x1a0 [ 143.440187] MemAlloc: kworker/u128:2(270) flags=0x4a28860 switches=68907 seq=90 gfp=0x2400240(GFP_NOFS|__GFP_NOWARN) order=0 delay=60000 uninterruptible [ 143.450674] kworker/u128:2 D 0000000000000017 0 270 2 0x00000000 [ 143.456069] Workqueue: writeback wb_workfn (flush-8:0) [ 143.460752] ffff880036034180 ffff880039ffc000 ffff880039ffae68 ffff88003f66dfc0 [ 143.466560] ffff880039ffae00 00000000fffd99b1 0000000000000017 ffffffff810c041f [ 143.472246] ffff88003f66dfc0 ffffffff8155965e 0000000000000000 0000000000000286 [ 143.478837] Call Trace: [ 143.481096] [<ffffffff81556fbc>] ? schedule+0x2c/0x80 [ 143.485192] [<ffffffff8155965e>] ? schedule_timeout+0x11e/0x1c0 [ 143.489958] [<ffffffff810c0270>] ? cascade+0x80/0x80 [ 143.494002] [<ffffffff8112e9f7>] ? wait_iff_congested+0xd7/0x120 [ 143.498750] [<ffffffff810a20b0>] ? wait_woken+0x80/0x80 [ 143.502968] [<ffffffff8112454f>] ? shrink_inactive_list+0x30f/0x490 [ 143.507907] [<ffffffff81125071>] ? shrink_zone_memcg+0x5d1/0x790 [ 143.512611] [<ffffffff8117553d>] ? mem_cgroup_iter+0x14d/0x2b0 [ 143.517315] [<ffffffff81125307>] ? shrink_zone+0xd7/0x2f0 [ 143.521575] [<ffffffff8112586a>] ? do_try_to_free_pages+0x15a/0x3e0 [ 143.526428] [<ffffffff81125b85>] ? try_to_free_pages+0x95/0xc0 [ 143.530957] [<ffffffff8111a38f>] ? __alloc_pages_nodemask+0x63f/0xc40 [ 143.536014] [<ffffffff8115dcef>] ? alloc_pages_current+0x7f/0x100 [ 143.541053] [<ffffffffa02539c2>] ? xfs_buf_allocate_memory+0x16a/0x2a5 [xfs] [ 143.546614] [<ffffffffa022251b>] ? xfs_buf_get_map+0xeb/0x140 [xfs] [ 143.551461] [<ffffffffa0222a03>] ? xfs_buf_read_map+0x23/0xd0 [xfs] [ 143.556319] [<ffffffffa024a827>] ? xfs_trans_read_buf_map+0x87/0x190 [xfs] [ 143.561610] [<ffffffffa01fdc22>] ? xfs_btree_read_buf_block.constprop.29+0x72/0xc0 [xfs] [ 143.568068] [<ffffffffa01fdce8>] ? xfs_btree_lookup_get_block+0x78/0xe0 [xfs] [ 143.573722] [<ffffffffa0202262>] ? xfs_btree_lookup+0xc2/0x570 [xfs] [ 143.578671] [<ffffffffa01e9712>] ? xfs_alloc_fixup_trees+0x282/0x350 [xfs] [ 143.583941] [<ffffffffa01eb7af>] ? xfs_alloc_ag_vextent_near+0x55f/0x910 [xfs] [ 143.589444] [<ffffffffa01ebc55>] ? xfs_alloc_ag_vextent+0xf5/0x120 [xfs] [ 143.594584] [<ffffffffa01ec72b>] ? xfs_alloc_vextent+0x3bb/0x470 [xfs] [ 143.599674] [<ffffffffa01f9de7>] ? xfs_bmap_btalloc+0x3d7/0x760 [xfs] [ 143.604422] [<ffffffffa01fab34>] ? xfs_bmapi_write+0x474/0xa20 [xfs] [ 143.609329] [<ffffffffa022de73>] ? xfs_iomap_write_allocate+0x163/0x380 [xfs] [ 143.614804] [<ffffffffa021c255>] ? xfs_map_blocks+0x135/0x150 [xfs] [ 143.619661] [<ffffffffa021cffa>] ? xfs_do_writepage+0x16a/0x510 [xfs] [ 143.624496] [<ffffffff8111c9fe>] ? write_cache_pages+0x1ae/0x400 [ 143.629218] [<ffffffffa021ce90>] ? xfs_aops_discard_page+0x130/0x130 [xfs] [ 143.634413] [<ffffffffa021ccbf>] ? xfs_vm_writepages+0x5f/0xa0 [xfs] [ 143.639403] [<ffffffff811aa9fc>] ? __writeback_single_inode+0x2c/0x170 [ 143.644474] [<ffffffff811ab013>] ? writeback_sb_inodes+0x223/0x4e0 [ 143.649194] [<ffffffff811ab352>] ? __writeback_inodes_wb+0x82/0xb0 [ 143.654019] [<ffffffff811ab56c>] ? wb_writeback+0x1ec/0x220 [ 143.658215] [<ffffffff811aba5e>] ? wb_workfn+0xde/0x290 [ 143.662373] [<ffffffff8107e484>] ? process_one_work+0x134/0x310 [ 143.667058] [<ffffffff8107e77d>] ? worker_thread+0x11d/0x4a0 [ 143.671623] [<ffffffff81556a51>] ? __schedule+0x271/0x7b0 [ 143.676393] [<ffffffff8107e660>] ? process_one_work+0x310/0x310 [ 143.681168] [<ffffffff81083b68>] ? kthread+0xc8/0xe0 [ 143.685169] [<ffffffff8155a502>] ? ret_from_fork+0x22/0x40 [ 143.689497] [<ffffffff81083aa0>] ? kthread_create_on_node+0x1a0/0x1a0 (...snipped...) [ 143.791611] MemAlloc: file_io.08(1424) flags=0x400040 switches=1058 uninterruptible dying victim [ 143.798403] file_io.08 D ffff88003c285d98 0 1424 1 0x00100084 [ 143.803820] ffff88003d36e180 ffff88003d374000 ffff88003d373d80 ffff88003c285d94 [ 143.809802] ffff88003d36e180 00000000ffffffff ffff88003c285d98 ffffffff81556fbc [ 143.815638] ffff88003c285d90 ffffffff81557255 ffffffff81558604 ffff88003d37fd30 [ 143.821210] Call Trace: [ 143.823431] [<ffffffff81556fbc>] ? schedule+0x2c/0x80 [ 143.828700] [<ffffffff81557255>] ? schedule_preempt_disabled+0x5/0x10 [ 143.833661] [<ffffffff81558604>] ? __mutex_lock_slowpath+0xb4/0x130 [ 143.838552] [<ffffffff81558696>] ? mutex_lock+0x16/0x25 [ 143.842614] [<ffffffffa022687c>] ? xfs_file_buffered_aio_write+0x5c/0x1e0 [xfs] [ 143.847945] [<ffffffff810226ad>] ? __switch_to+0x20d/0x3f0 [ 143.852188] [<ffffffffa0226a86>] ? xfs_file_write_iter+0x86/0x140 [xfs] [ 143.857179] [<ffffffff811838cb>] ? __vfs_write+0xcb/0x100 [ 143.861441] [<ffffffff81184478>] ? vfs_write+0x98/0x190 [ 143.865629] [<ffffffff81556a51>] ? __schedule+0x271/0x7b0 [ 143.869902] [<ffffffff8118583d>] ? SyS_write+0x4d/0xc0 [ 143.874031] [<ffffffff810034a7>] ? do_syscall_64+0x57/0xf0 [ 143.878258] [<ffffffff8155a3a1>] ? entry_SYSCALL64_slow_path+0x25/0x25 (...snipped...) [ 165.512677] Mem-Info: [ 165.514925] active_anon:166683 inactive_anon:1640 isolated_anon:0 [ 165.514925] active_file:10870 inactive_file:49863 isolated_file:68 [ 165.514925] unevictable:0 dirty:49806 writeback:112 unstable:0 [ 165.514925] slab_reclaimable:3373 slab_unreclaimable:7156 [ 165.514925] mapped:10566 shmem:1703 pagetables:1606 bounce:0 [ 165.514925] free:1854 free_pcp:130 free_cma:0 [ 165.541474] Node 0 DMA free:3932kB min:60kB low:72kB high:84kB active_anon:7596kB inactive_anon:176kB active_file:328kB inactive_file:976kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:976kB writeback:0kB mapped:404kB shmem:176kB slab_reclaimable:128kB slab_unreclaimable:488kB kernel_stack:144kB pagetables:140kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:8636 all_unreclaimable? yes [ 165.574685] lowmem_reserve[]: 0 968 968 968 [ 165.578352] Node 0 DMA32 free:3484kB min:3812kB low:4804kB high:5796kB active_anon:659136kB inactive_anon:6384kB active_file:43152kB inactive_file:198476kB unevictable:0kB isolated(anon):0kB isolated(file):272kB present:1032064kB managed:996224kB mlocked:0kB dirty:198248kB writeback:448kB mapped:41860kB shmem:6636kB slab_reclaimable:13364kB slab_unreclaimable:28136kB kernel_stack:7792kB pagetables:6284kB unstable:0kB bounce:0kB free_pcp:520kB local_pcp:216kB free_cma:0kB writeback_tmp:0kB pages_scanned:201090336 all_unreclaimable? yes [ 165.612568] lowmem_reserve[]: 0 0 0 0 [ 165.615805] Node 0 DMA: 23*4kB (UM) 30*8kB (UM) 21*16kB (U) 6*32kB (U) 4*64kB (U) 2*128kB (U) 0*256kB 3*512kB (UM) 1*1024kB (U) 0*2048kB 0*4096kB = 3932kB [ 165.626447] Node 0 DMA32: 759*4kB (UE) 54*8kB (U) 1*16kB (U) 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 3484kB [ 165.635697] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB [ 165.642340] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB [ 165.648792] 62515 total pagecache pages [ 165.652093] 0 pages in swap cache [ 165.655325] Swap cache stats: add 0, delete 0, find 0/0 [ 165.659472] Free swap = 0kB [ 165.662094] Total swap = 0kB [ 165.664813] 262013 pages RAM [ 165.667364] 0 pages HighMem/MovableOnly [ 165.670595] 8981 pages reserved [ 165.673400] 0 pages cma reserved [ 165.676333] 0 pages hwpoisoned [ 165.679103] Showing busy workqueues and worker pools: [ 165.683077] workqueue events: flags=0x0 [ 165.686367] pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=1/256 [ 165.690779] pending: vmpressure_work_fn [ 165.694084] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256 [ 165.698960] pending: vmw_fb_dirty_flush [vmwgfx] [ 165.703112] workqueue events_freezable_power_: flags=0x84 [ 165.707516] pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256 [ 165.711938] in-flight: 61:disk_events_workfn [ 165.715500] workqueue writeback: flags=0x4e [ 165.719068] pwq 128: cpus=0-63 flags=0x4 nice=0 active=2/256 [ 165.723890] in-flight: 270:wb_workfn wb_workfn [ 165.728443] pool 4: cpus=2 node=0 flags=0x0 nice=0 hung=0s workers=4 idle: 209 3311 23 [ 165.734327] pool 128: cpus=0-63 flags=0x4 nice=0 hung=0s workers=5 idle: 6 51 277 276 [ 165.740618] MemAlloc-Info: stalling=2 dying=178 exiting=31 victim=1 oom_count=3071760/430759 ---------- Complete log for ext4 is at http://I-love.SAKURA.ne.jp/tmp/serial-20160405.txt.xz ---------- [ 186.620979] Out of memory: Kill process 4458 (file_io.24) score 997 or sacrifice child [ 186.627897] Killed process 4458 (file_io.24) total-vm:4336kB, anon-rss:116kB, file-rss:1024kB, shmem-rss:0kB [ 186.688345] oom_reaper: reaped process 4458 (file_io.24), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB (...snipped...) [ 187.089562] Killed process 3499 (writer.26) total-vm:4344kB, anon-rss:80kB, file-rss:64kB, shmem-rss:0kB [ 242.174775] MemAlloc-Info: stalling=9 dying=31 exiting=0 victim=1 oom_count=752788/16556 [ 242.183365] MemAlloc: kswapd0(49) flags=0xa40840 switches=994137 [ 242.188759] kswapd0 R running task 0 49 2 0x00000000 [ 242.195022] ffff88003af2fd20 ffff88003af30000 ffff88003af2fde8 ffff88003f62dfc0 [ 242.201296] ffff88003af2fd80 00000000ffff1d70 ffff88003ffde000 ffffffff81587dec [ 242.207771] ffff88003f62dfc0 ffffffff8158a48e ffffffff811249c7 0000000000000286 [ 242.213864] Call Trace: [ 242.216691] [<ffffffff81587dec>] ? schedule+0x2c/0x80 [ 242.221078] [<ffffffff811249c7>] ? shrink_zone+0xd7/0x2f0 [ 242.225342] [<ffffffff810c00a0>] ? cascade+0x80/0x80 [ 242.229333] [<ffffffff81125b89>] ? kswapd+0x709/0x7d0 [ 242.233452] [<ffffffff810a1ee0>] ? wait_woken+0x80/0x80 [ 242.237618] [<ffffffff81125480>] ? mem_cgroup_shrink_node_zone+0xb0/0xb0 [ 242.242602] [<ffffffff81083b18>] ? kthread+0xc8/0xe0 [ 242.246718] [<ffffffff8158b342>] ? ret_from_fork+0x22/0x40 [ 242.250939] [<ffffffff81083a50>] ? kthread_create_on_node+0x1a0/0x1a0 [ 242.257505] MemAlloc: kworker/u128:1(51) flags=0x4a08860 switches=80360 seq=18 gfp=0x2400040(GFP_NOFS) order=0 delay=60000 uninterruptible [ 242.266407] kworker/u128:1 D 0000000000000017 0 51 2 0x00000000 [ 242.272485] Workqueue: writeback wb_workfn (flush-8:0) [ 242.276909] ffff880036814740 ffff88003681c000 ffff88003681b278 ffff88003f64dfc0 [ 242.282635] 00000000a6a32935 00000000ffff1d5e ffff88003681b278 ffffffff81587dec [ 242.288840] ffff88003f64dfc0 ffffffff8158a496 0000000000000000 0000000000000286 [ 242.294612] Call Trace: [ 242.297193] [<ffffffff81587dec>] ? schedule+0x2c/0x80 [ 242.301418] [<ffffffff8158a48e>] ? schedule_timeout+0x11e/0x1c0 [ 242.306184] [<ffffffff810c00a0>] ? cascade+0x80/0x80 [ 242.310356] [<ffffffff8112df97>] ? wait_iff_congested+0xd7/0x120 [ 242.314891] [<ffffffff810a1ee0>] ? wait_woken+0x80/0x80 [ 242.319372] [<ffffffff81123c0f>] ? shrink_inactive_list+0x30f/0x490 [ 242.324695] [<ffffffff81124731>] ? shrink_zone_memcg+0x5d1/0x790 [ 242.329526] [<ffffffff81095f29>] ? check_preempt_wakeup+0x119/0x230 [ 242.334118] [<ffffffff81094d6f>] ? dequeue_entity+0x23f/0x8e0 [ 242.339120] [<ffffffff811249c7>] ? shrink_zone+0xd7/0x2f0 [ 242.343704] [<ffffffff81124f2a>] ? do_try_to_free_pages+0x15a/0x3e0 [ 242.348572] [<ffffffff81125245>] ? try_to_free_pages+0x95/0xc0 [ 242.353213] [<ffffffff81119a4f>] ? __alloc_pages_nodemask+0x63f/0xc40 [ 242.358128] [<ffffffff8115d2df>] ? alloc_pages_current+0x7f/0x100 [ 242.362766] [<ffffffff81110445>] ? pagecache_get_page+0x85/0x240 [ 242.367679] [<ffffffff81228fb7>] ? ext4_mb_load_buddy_gfp+0x357/0x440 [ 242.372621] [<ffffffff8122b599>] ? ext4_mb_regular_allocator+0x169/0x470 [ 242.377834] [<ffffffff81094d6f>] ? dequeue_entity+0x23f/0x8e0 [ 242.382677] [<ffffffff8122d059>] ? ext4_mb_new_blocks+0x369/0x440 [ 242.387572] [<ffffffff81222bc0>] ? ext4_ext_map_blocks+0x10c0/0x1770 [ 242.392153] [<ffffffff8111e373>] ? release_pages+0x243/0x350 [ 242.396704] [<ffffffff81110bb3>] ? find_get_pages_tag+0xd3/0x1b0 [ 242.401379] [<ffffffff81110099>] ? __lock_page+0x49/0xf0 [ 242.405824] [<ffffffff81201412>] ? ext4_map_blocks+0x122/0x510 [ 242.410186] [<ffffffff8120490c>] ? ext4_writepages+0x53c/0xb10 [ 242.414687] [<ffffffff811a968c>] ? __writeback_single_inode+0x2c/0x170 [ 242.419531] [<ffffffff811a9ca3>] ? writeback_sb_inodes+0x223/0x4e0 [ 242.424284] [<ffffffff811a9fe2>] ? __writeback_inodes_wb+0x82/0xb0 [ 242.429196] [<ffffffff811aa1fc>] ? wb_writeback+0x1ec/0x220 [ 242.433267] [<ffffffff811aa6ee>] ? wb_workfn+0xde/0x290 [ 242.437275] [<ffffffff8107e434>] ? process_one_work+0x134/0x310 [ 242.441492] [<ffffffff8107e72d>] ? worker_thread+0x11d/0x4a0 [ 242.445781] [<ffffffff8107e610>] ? process_one_work+0x310/0x310 [ 242.450190] [<ffffffff81083b18>] ? kthread+0xc8/0xe0 [ 242.454366] [<ffffffff8158b342>] ? ret_from_fork+0x22/0x40 [ 242.458870] [<ffffffff81083a50>] ? kthread_create_on_node+0x1a0/0x1a0 [ 242.465699] MemAlloc: kworker/0:2(285) flags=0x4208860 switches=275666 seq=15 gfp=0x2400000(GFP_NOIO) order=0 delay=58093 [ 242.474600] kworker/0:2 R running task 0 285 2 0x00000000 [ 242.479981] Workqueue: events_freezable_power_ disk_events_workfn [ 242.484669] ffff8800396f8600 0000000000000286 ffff8800396ff768 ffff88003f60dfc0 [ 242.490850] ffff8800396ff700 ffff8800396ff700 0000000000000017 ffffffff81587dec [ 242.496493] ffff88003f60dfc0 ffffffff8158a48e 0000000000000000 0000000000000286 [ 242.502195] Call Trace: [ 242.504347] [<ffffffff810c01dc>] ? try_to_del_timer_sync+0x4c/0x80 [ 242.509164] [<ffffffff81587dec>] ? schedule+0x2c/0x80 [ 242.513106] [<ffffffff8158a48e>] ? schedule_timeout+0x11e/0x1c0 [ 242.517333] [<ffffffff810c00a0>] ? cascade+0x80/0x80 [ 242.521263] [<ffffffff8112df6f>] ? wait_iff_congested+0xaf/0x120 [ 242.525472] [<ffffffff810a1ee0>] ? wait_woken+0x80/0x80 [ 242.529443] [<ffffffff81123c0f>] ? shrink_inactive_list+0x30f/0x490 [ 242.534392] [<ffffffff81124731>] ? shrink_zone_memcg+0x5d1/0x790 [ 242.539076] [<ffffffff81094910>] ? update_curr+0x90/0xd0 [ 242.543052] [<ffffffff81174b0d>] ? mem_cgroup_iter+0x14d/0x2b0 [ 242.547529] [<ffffffff811249c7>] ? shrink_zone+0xd7/0x2f0 [ 242.551904] [<ffffffff81124f2a>] ? do_try_to_free_pages+0x15a/0x3e0 [ 242.556629] [<ffffffff81125245>] ? try_to_free_pages+0x95/0xc0 [ 242.561000] [<ffffffff81119c77>] ? __alloc_pages_nodemask+0x867/0xc40 [ 242.566133] [<ffffffff8115d2df>] ? alloc_pages_current+0x7f/0x100 [ 242.570852] [<ffffffff81265b3b>] ? bio_copy_kern+0xbb/0x170 [ 242.575036] [<ffffffff8126eb0a>] ? blk_rq_map_kern+0x6a/0x120 [ 242.579227] [<ffffffff81269272>] ? blk_get_request+0x72/0xd0 [ 242.583721] [<ffffffff813ba2e2>] ? scsi_execute+0x122/0x150 [ 242.588072] [<ffffffff813ba3e5>] ? scsi_execute_req_flags+0x85/0xf0 [ 242.592773] [<ffffffffa01cf719>] ? sr_check_events+0xb9/0x2b0 [sr_mod] [ 242.597639] [<ffffffffa01c314f>] ? cdrom_check_events+0xf/0x30 [cdrom] [ 242.602455] [<ffffffff81278cfa>] ? disk_check_events+0x5a/0x140 [ 242.606821] [<ffffffff8107e434>] ? process_one_work+0x134/0x310 [ 242.611191] [<ffffffff8107e72d>] ? worker_thread+0x11d/0x4a0 [ 242.615560] [<ffffffff81587881>] ? __schedule+0x271/0x7b0 [ 242.619988] [<ffffffff8107e610>] ? process_one_work+0x310/0x310 [ 242.624618] [<ffffffff81083b18>] ? kthread+0xc8/0xe0 [ 242.628245] [<ffffffff8158b342>] ? ret_from_fork+0x22/0x40 [ 242.632185] [<ffffffff81083a50>] ? kthread_create_on_node+0x1a0/0x1a0 (...snipped...) [ 245.572082] MemAlloc: file_io.24(4715) flags=0x400040 switches=8650 uninterruptible dying victim [ 245.578876] file_io.24 D 0000000000000000 0 4715 1 0x00100084 [ 245.584122] ffff88002fd9c000 ffff88002fda4000 ffff880036221870 00000000000035a2 [ 245.589618] 0000000000000000 ffff880036221870 0000000000000000 ffffffff81587dec [ 245.595428] ffff880036221800 ffffffff8123b821 0000000000000000 ffff88002fd9c000 [ 245.601370] Call Trace: [ 245.603428] [<ffffffff81587dec>] ? schedule+0x2c/0x80 [ 245.607680] [<ffffffff8123b821>] ? wait_transaction_locked+0x81/0xc0 [ 245.613586] [<ffffffff810a1ee0>] ? wait_woken+0x80/0x80 [ 245.618074] [<ffffffff8123ba9a>] ? add_transaction_credits+0x21a/0x2a0 [ 245.623497] [<ffffffff81178abc>] ? mem_cgroup_commit_charge+0x7c/0xf0 [ 245.628352] [<ffffffff8123bceb>] ? start_this_handle+0x18b/0x400 [ 245.632755] [<ffffffff8110fb6e>] ? add_to_page_cache_lru+0x6e/0xd0 [ 245.637274] [<ffffffff8123c294>] ? jbd2__journal_start+0xf4/0x190 [ 245.642298] [<ffffffff81205ca4>] ? ext4_da_write_begin+0x114/0x360 [ 245.647035] [<ffffffff8111116e>] ? generic_perform_write+0xce/0x1d0 [ 245.651651] [<ffffffff8119c440>] ? file_update_time+0xc0/0x110 [ 245.656166] [<ffffffff81111f2d>] ? __generic_file_write_iter+0x16d/0x1c0 [ 245.660835] [<ffffffff811fbafa>] ? ext4_file_write_iter+0x12a/0x340 [ 245.665292] [<ffffffff810226ad>] ? __switch_to+0x20d/0x3f0 [ 245.669604] [<ffffffff81182ddb>] ? __vfs_write+0xcb/0x100 [ 245.673904] [<ffffffff81183968>] ? vfs_write+0x98/0x190 [ 245.678174] [<ffffffff81184d2d>] ? SyS_write+0x4d/0xc0 [ 245.682376] [<ffffffff810034a7>] ? do_syscall_64+0x57/0xf0 [ 245.686845] [<ffffffff8158b1e1>] ? entry_SYSCALL64_slow_path+0x25/0x25 (...snipped...) [ 246.216363] Mem-Info: [ 246.218425] active_anon:183099 inactive_anon:2734 isolated_anon:0 [ 246.218425] active_file:2006 inactive_file:36363 isolated_file:0 [ 246.218425] unevictable:0 dirty:36369 writeback:0 unstable:0 [ 246.218425] slab_reclaimable:2055 slab_unreclaimable:9453 [ 246.218425] mapped:2266 shmem:3080 pagetables:1480 bounce:0 [ 246.218425] free:1814 free_pcp:197 free_cma:0 [ 246.245998] Node 0 DMA free:3928kB min:60kB low:72kB high:84kB active_anon:7868kB inactive_anon:112kB active_file:188kB inactive_file:1504kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:1440kB writeback:0kB mapped:132kB shmem:120kB slab_reclaimable:184kB slab_unreclaimable:592kB kernel_stack:624kB pagetables:304kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:42274336 all_unreclaimable? yes [ 246.281121] lowmem_reserve[]: 0 968 968 968 [ 246.284938] Node 0 DMA32 free:3328kB min:3812kB low:4804kB high:5796kB active_anon:724528kB inactive_anon:10824kB active_file:7836kB inactive_file:143948kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:1032064kB managed:996008kB mlocked:0kB dirty:144036kB writeback:0kB mapped:8932kB shmem:12200kB slab_reclaimable:8036kB slab_unreclaimable:37220kB kernel_stack:23680kB pagetables:5616kB unstable:0kB bounce:0kB free_pcp:788kB local_pcp:116kB free_cma:0kB writeback_tmp:0kB pages_scanned:22926424 all_unreclaimable? yes [ 246.319945] lowmem_reserve[]: 0 0 0 0 [ 246.323303] Node 0 DMA: 32*4kB (UME) 35*8kB (UME) 18*16kB (UE) 9*32kB (UE) 6*64kB (ME) 2*128kB (UE) 3*256kB (E) 3*512kB (UME) 0*1024kB 0*2048kB 0*4096kB = 3928kB [ 246.334695] Node 0 DMA32: 332*4kB (UE) 244*8kB (U) 3*16kB (U) 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 3328kB [ 246.344693] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB [ 246.351599] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB [ 246.357749] 41456 total pagecache pages [ 246.360874] 0 pages in swap cache [ 246.363717] Swap cache stats: add 0, delete 0, find 0/0 [ 246.368022] Free swap = 0kB [ 246.370769] Total swap = 0kB [ 246.373444] 262013 pages RAM [ 246.376115] 0 pages HighMem/MovableOnly [ 246.379669] 9035 pages reserved [ 246.382654] 0 pages cma reserved [ 246.385675] 0 pages hwpoisoned [ 246.388597] Showing busy workqueues and worker pools: [ 246.392477] workqueue events: flags=0x0 [ 246.395797] pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/256 [ 246.400741] pending: vmw_fb_dirty_flush [vmwgfx] [ 246.405129] workqueue events_freezable_power_: flags=0x84 [ 246.409390] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256 [ 246.413910] in-flight: 285:disk_events_workfn [ 246.417932] workqueue writeback: flags=0x4e [ 246.421660] pwq 128: cpus=0-63 flags=0x4 nice=0 active=2/256 [ 246.426158] in-flight: 51:wb_workfn wb_workfn [ 246.430208] pool 0: cpus=0 node=0 flags=0x0 nice=0 hung=0s workers=4 idle: 42 3280 4 [ 246.435871] pool 128: cpus=0-63 flags=0x4 nice=0 hung=0s workers=4 idle: 260 6 259 [ 246.441342] MemAlloc-Info: stalling=9 dying=31 exiting=0 victim=1 oom_count=783613/16904 ---------- If I apply ---------- diff --git a/block/bio.c b/block/bio.c index f124a0a..03250e86 100644 --- a/block/bio.c +++ b/block/bio.c @@ -1504,6 +1504,8 @@ struct bio *bio_copy_kern(struct request_queue *q, void *data, unsigned int len, void *p = data; int nr_pages = 0; + gfp_mask |= __GFP_HIGH; + /* * Overflow, abort */ ---------- then disk_events_workfn stall is gone. If I also apply ---------- diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c index 9a2191b..448f61e 100644 --- a/fs/xfs/xfs_buf.c +++ b/fs/xfs/xfs_buf.c @@ -55,7 +55,7 @@ static kmem_zone_t *xfs_buf_zone; #endif #define xb_to_gfp(flags) \ - ((((flags) & XBF_READ_AHEAD) ? __GFP_NORETRY : GFP_NOFS) | __GFP_NOWARN) + ((((flags) & XBF_READ_AHEAD) ? __GFP_NORETRY : (GFP_NOFS | __GFP_HIGH)) | __GFP_NOWARN) static inline int ---------- then both disk_events_workfn stall and wb_workfn stall are gone and I can no longer reproduce OOM livelock using this reproducer. Therefore, I think that the root cause of OOM livelock is that (A) We use the same watermark for GFP_KERNEL / GFP_NOFS / GFP_NOIO allocation requests. (B) We allow GFP_KERNEL allocation requests to consume memory to min: watermark. (C) GFP_KERNEL allocation requests might depend on GFP_NOFS allocation requests, and GFP_NOFS allocation requests might depend on GFP_NOIO allocation requests. (D) TIF_MEMDIE thread might wait forever for other thread's GFP_NOFS / GFP_NOIO allocation requests. There is no gfp flag that prevents GFP_KERNEL from consuming memory to min: watermark. Thus, it is inevitable that GFP_KERNEL allocations consume memory to min: watermark and invokes the OOM killer. But if we change memory allocations which might block writeback operations to utilize memory reserves, it is likely that allocations from workqueue items will no longer stall, even without depending on mmap_sem which is a weakness of the OOM reaper. Of course, there is no guarantee that allowing such GFP_NOFS / GFP_NOIO allocations to utilize memory reserves always avoids OOM livelock. But at least we don't need to give up GFP_NOFS / GFP_NOIO allocations immediately without trying to utilize memory reserves. Therefore, I object this comment Michal Hocko wrote: > + /* > + * XXX: GFP_NOFS allocations should rather fail than rely on > + * other request to make a forward progress. > + * We are in an unfortunate situation where out_of_memory cannot > + * do much for this context but let's try it to at least get > + * access to memory reserved if the current task is killed (see > + * out_of_memory). Once filesystems are ready to handle allocation > + * failures more gracefully we should just bail out here. > + */ > + that try to make !__GFP_FS allocations fail. It is possible that such GFP_NOFS / GFP_NOIO allocations need to select next OOM victim. If we add a guaranteed unlocking mechanism (the simplest way is timeout), such GFP_NOFS / GFP_NOIO allocations will succeed, and we can avoid loss of reliability of async write operations. (By the way, can swap in/out work even if GFP_NOIO fails?) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [RFC PATCH] mm, oom: move GFP_NOFS check to out_of_memory 2016-04-05 11:12 ` Tetsuo Handa @ 2016-04-06 10:28 ` Tetsuo Handa 2016-04-06 12:41 ` Michal Hocko 1 sibling, 0 replies; 14+ messages in thread From: Tetsuo Handa @ 2016-04-06 10:28 UTC (permalink / raw) To: mhocko, linux-mm; +Cc: rientjes, hannes, akpm, linux-kernel, mhocko This ext4 livelock case shows a race window which commit 36324a990cf5 ("oom: clear TIF_MEMDIE after oom_reaper managed to unmap the address space") did not care about. ---------- [ 186.620979] Out of memory: Kill process 4458 (file_io.24) score 997 or sacrifice child [ 186.627897] Killed process 4458 (file_io.24) total-vm:4336kB, anon-rss:116kB, file-rss:1024kB, shmem-rss:0kB [ 186.688345] oom_reaper: reaped process 4458 (file_io.24), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB [ 245.572082] MemAlloc: file_io.24(4715) flags=0x400040 switches=8650 uninterruptible dying victim [ 245.578876] file_io.24 D 0000000000000000 0 4715 1 0x00100084 [ 245.584122] ffff88002fd9c000 ffff88002fda4000 ffff880036221870 00000000000035a2 [ 245.589618] 0000000000000000 ffff880036221870 0000000000000000 ffffffff81587dec [ 245.595428] ffff880036221800 ffffffff8123b821 0000000000000000 ffff88002fd9c000 [ 245.601370] Call Trace: [ 245.603428] [<ffffffff81587dec>] ? schedule+0x2c/0x80 [ 245.607680] [<ffffffff8123b821>] ? wait_transaction_locked+0x81/0xc0 /* linux-4.6-rc2/fs/jbd2/transaction.c:163 */ [ 245.613586] [<ffffffff810a1ee0>] ? wait_woken+0x80/0x80 /* linux-4.6-rc2/kernel/sched/wait.c:292 */ [ 245.618074] [<ffffffff8123ba9a>] ? add_transaction_credits+0x21a/0x2a0 /* linux-4.6-rc2/fs/jbd2/transaction.c:191 */ [ 245.623497] [<ffffffff81178abc>] ? mem_cgroup_commit_charge+0x7c/0xf0 [ 245.628352] [<ffffffff8123bceb>] ? start_this_handle+0x18b/0x400 /* linux-4.6-rc2/fs/jbd2/transaction.c:357 */ [ 245.632755] [<ffffffff8110fb6e>] ? add_to_page_cache_lru+0x6e/0xd0 [ 245.637274] [<ffffffff8123c294>] ? jbd2__journal_start+0xf4/0x190 /* linux-4.6-rc2/fs/jbd2/transaction.c:459 */ [ 245.642298] [<ffffffff81205ca4>] ? ext4_da_write_begin+0x114/0x360 /* linux-4.6-rc2/fs/ext4/inode.c:2883 */ [ 245.647035] [<ffffffff8111116e>] ? generic_perform_write+0xce/0x1d0 /* linux-4.6-rc2/mm/filemap.c:2639 */ [ 245.651651] [<ffffffff8119c440>] ? file_update_time+0xc0/0x110 [ 245.656166] [<ffffffff81111f2d>] ? __generic_file_write_iter+0x16d/0x1c0 /* linux-4.6-rc2/mm/filemap.c:2765 */ [ 245.660835] [<ffffffff811fbafa>] ? ext4_file_write_iter+0x12a/0x340 /* linux-4.6-rc2/fs/ext4/file.c:170 */ [ 245.665292] [<ffffffff810226ad>] ? __switch_to+0x20d/0x3f0 [ 245.669604] [<ffffffff81182ddb>] ? __vfs_write+0xcb/0x100 [ 245.673904] [<ffffffff81183968>] ? vfs_write+0x98/0x190 [ 245.678174] [<ffffffff81184d2d>] ? SyS_write+0x4d/0xc0 [ 245.682376] [<ffffffff810034a7>] ? do_syscall_64+0x57/0xf0 [ 245.686845] [<ffffffff8158b1e1>] ? entry_SYSCALL64_slow_path+0x25/0x25 ---------- ext4_file_write_iter(struct kiocb *iocb, struct iov_iter *from) { ret = __generic_file_write_iter(iocb, from) { written = generic_perform_write(file, from, iocb->ki_pos) { if (fatal_signal_pending(current)) { status = -EINTR; break; } status = a_ops->write_begin(file, mapping, pos, bytes, flags, &page, &fsdata) /* ext4_da_write_begin */ { /***** Event1 *****/ handle = ext4_journal_start(inode, EXT4_HT_WRITE_PAGE, ext4_da_write_credits(inode, pos, len)) /* __ext4_journal_start */ { __ext4_journal_start_sb(inode->i_sb, line, type, blocks, rsv_blocks) { jbd2__journal_start(journal, blocks, rsv_blocks, GFP_NOFS, type, line) { err = start_this_handle(journal, handle, gfp_mask) { if (!journal->j_running_transaction) { /* * If __GFP_FS is not present, then we may be being called from * inside the fs writeback layer, so we MUST NOT fail. */ if ((gfp_mask & __GFP_FS) == 0) gfp_mask |= __GFP_NOFAIL; new_transaction = kmem_cache_zalloc(transaction_cache, gfp_mask); /***** Event2 *****/ if (!new_transaction) return -ENOMEM; } /* We may have dropped j_state_lock - restart in that case */ add_transaction_credits(journal, blocks, rsv_blocks) { /* * If the current transaction is locked down for commit, wait * for the lock to be released. */ if (t->t_state == T_LOCKED) { /***** Event3 *****/ wait_transaction_locked(journal); /***** Event4 *****/ return 1; } } } } } } } } } } Event1 ... The OOM killer sent SIGKILL to file_io.24(4715) because file_io.24(4715) was sharing memory with file_io.24(4458). Event2 ... file_io.24(4715) silently got TIF_MEMDIE using a shortcut fatal_signal_pending(current) in out_of_memory() because kmem_cache_zalloc() is allowed to call out_of_memory() due to __GFP_NOFAIL. Event3 ... The OOM reaper completed reaping memory used by file_io.24(4458) and marked file_io.24(4458) as no longer OOM-killable by now. But since the OOM reaper cleared TIF_MEMDIE from only file_io.24(4458), TIF_MEMDIE in file_io.24(4715) still remains. Event4 ... file_io.24(4715) (which used GFP_NOFS | __GFP_NOFAIL) is waiting for kworker/u128:1(51) (which used GFP_NOFS) to complete wb_workfn. But both kworker/u128:1(51) (which used GFP_NOFS) and kworker/0:2(285) (which used GFP_NOIO) cannot make forward progress because the OOM reaper does not clear TIF_MEMDIE from file_io.24(4715), and the OOM killer does not select next OOM victim due to TIF_MEMDIE in file_io.24(4715). If we remove these shortcuts and set TIF_MEMDIE to all OOM-killed threads sharing the victim's memory at oom_kill_process() and clear TIF_MEMDIE from all threads sharing the victim's memory at __oom_reap_task() (or do equivalent thing using per a signal_struct flag or per a mm_struct flag or a timer), we wouldn't have hit this race window. Thus, I say again, "I think that removing these shortcuts is better." unless we add a guaranteed unlocking mechanism like a timer. Also, I again want to say that, making current thread's current allocation request completed by giving TIF_MEMDIE does not guarantee that the current thread will be able to arrive at do_exit() shortly. It is possible that the current thread is blocked at unkillable wait if current allocation succeeded. Also, is it acceptable to make allocation requests by kworker/u128:1(51) and kworker/0:2(285) fail because they are !__GFP_FS && !__GFP_NOFAIL when file_io.24(4715) has managed to allocate memory for journal's transaction using GFP_NOFS | __GFP_NOFAIL? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [RFC PATCH] mm, oom: move GFP_NOFS check to out_of_memory 2016-04-05 11:12 ` Tetsuo Handa 2016-04-06 10:28 ` Tetsuo Handa @ 2016-04-06 12:41 ` Michal Hocko 1 sibling, 0 replies; 14+ messages in thread From: Michal Hocko @ 2016-04-06 12:41 UTC (permalink / raw) To: Tetsuo Handa; +Cc: linux-mm, rientjes, hannes, akpm, linux-kernel On Tue 05-04-16 20:12:51, Tetsuo Handa wrote: [...] > What I can observe under OOM livelock condition is a three-way dependency loop. > > (1) An OOM victim (which has TIF_MEMDIE) is unable to make forward progress > due to blocked at unkillable lock waiting for other thread's memory > allocation. > > (2) A filesystem writeback work item is unable to make forward progress > due to waiting for GFP_NOFS memory allocation to be satisfied because > storage I/O is stalling. > > (3) A disk I/O work item is unable to make forward progress due to > waiting for GFP_NOIO memory allocation to be satisfied because > an OOM victim does not release memory but the OOM reaper does not > unlock TIF_MEMDIE. It is true that find_lock_task_mm might have returned NULL and so we cannot reap anything. I guess we want to clear TIF_MEMDIE for such a task because it wouldn't have been selected in the next oom victim selection round, so we can argue this would be acceptable. After more thinking about this we can clear it for tasks which block oom_reaper because of mmap_sem contention because those would be sitting on the memory and we can retry to select them later so we cannot end up in the worse state we are now. I will prepare a patch for that. [...] > (A) We use the same watermark for GFP_KERNEL / GFP_NOFS / GFP_NOIO > allocation requests. > > (B) We allow GFP_KERNEL allocation requests to consume memory to > min: watermark. > > (C) GFP_KERNEL allocation requests might depend on GFP_NOFS > allocation requests, and GFP_NOFS allocation requests > might depend on GFP_NOIO allocation requests. > > (D) TIF_MEMDIE thread might wait forever for other thread's > GFP_NOFS / GFP_NOIO allocation requests. > > There is no gfp flag that prevents GFP_KERNEL from consuming memory to min: > watermark. Thus, it is inevitable that GFP_KERNEL allocations consume > memory to min: watermark and invokes the OOM killer. But if we change > memory allocations which might block writeback operations to utilize > memory reserves, it is likely that allocations from workqueue items > will no longer stall, even without depending on mmap_sem which is a > weakness of the OOM reaper. Depending on memory reserves just shifts the issue to a later moment. Heavy GFP_NOFS loads would deplete this reserve very easily and we are back to square one. > Of course, there is no guarantee that allowing such GFP_NOFS / GFP_NOIO > allocations to utilize memory reserves always avoids OOM livelock. But > at least we don't need to give up GFP_NOFS / GFP_NOIO allocations > immediately without trying to utilize memory reserves. > Therefore, I object this comment > > Michal Hocko wrote: > > + /* > > + * XXX: GFP_NOFS allocations should rather fail than rely on > > + * other request to make a forward progress. > > + * We are in an unfortunate situation where out_of_memory cannot > > + * do much for this context but let's try it to at least get > > + * access to memory reserved if the current task is killed (see > > + * out_of_memory). Once filesystems are ready to handle allocation > > + * failures more gracefully we should just bail out here. > > + */ > > + > > that try to make !__GFP_FS allocations fail. I do not get what do you abject to. The comment is clear that we are not yet there to make this happen. The primary purpose of the comment is to make it clear where we should back off and fail if we _ever_ consider this safe to do. > It is possible that such GFP_NOFS / GFP_NOIO allocations need to select > next OOM victim. If we add a guaranteed unlocking mechanism (the simplest > way is timeout), such GFP_NOFS / GFP_NOIO allocations will succeed, and > we can avoid loss of reliability of async write operations. this still relies on somebody else for making a forward progress, which is not good. I can imagine a highly theoretical situation where even selecting other task doesn't lead to any relief because most of the memory might be pinned for some reason. > (By the way, can swap in/out work even if GFP_NOIO fails?) The page would be redirtied and kept around if get_swap_bio failed the GFP_NOIO allocation -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2016-04-06 12:41 UTC | newest] Thread overview: 14+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2016-03-29 13:27 [RFC PATCH] mm, oom: move GFP_NOFS check to out_of_memory Michal Hocko 2016-03-29 13:45 ` Tetsuo Handa 2016-03-29 14:22 ` Michal Hocko 2016-03-29 15:29 ` Tetsuo Handa 2016-03-29 14:14 ` Michal Hocko 2016-03-29 22:13 ` David Rientjes 2016-03-30 9:47 ` Michal Hocko 2016-03-30 11:46 ` Tetsuo Handa 2016-03-30 12:11 ` Michal Hocko 2016-03-31 11:56 ` Tetsuo Handa 2016-03-31 15:11 ` Michal Hocko 2016-04-05 11:12 ` Tetsuo Handa 2016-04-06 10:28 ` Tetsuo Handa 2016-04-06 12:41 ` Michal Hocko
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).