From: Oleg Nesterov <oleg@redhat.com>
To: Michal Hocko <mhocko@suse.cz>
Cc: Cong Wang <xiyou.wangcong@gmail.com>,
"Rafael J. Wysocki" <rjw@rjwysocki.net>,
Tejun Heo <tj@kernel.org>, David Rientjes <rientjes@google.com>,
Andrew Morton <akpm@linux-foundation.org>,
linux-kernel@vger.kernel.org
Subject: Re: oom && coredump
Date: Thu, 27 Nov 2014 18:47:47 +0100 [thread overview]
Message-ID: <20141127174747.GA7684@redhat.com> (raw)
In-Reply-To: <20141127122908.GB18833@dhcp22.suse.cz>
On 11/27, Michal Hocko wrote:
>
> On Mon 20-10-14 21:56:18, Oleg Nesterov wrote:
> > speaking of "partial" fixes for oom problems...
> >
> > Perhaps the patch below makes sense? Sure, it is racy, but probably
> > better than nothing. And in any case (imo) this SIGNAL_GROUP_COREDUMP
> > check doesn't look bad, the coredumping task can consume more memory,
> > and we can't assume it is going to actually exit soon.
>
> I am not familiar with this area much (it is too scary...).
> I guess the issue is that OOM killer might try to kill a task which is
> currently in the middle of cordumping which is not killable, right? And
> if it is blocked on memory allocation then we are effectively dead
> locked. Right?
>
> Wouldn't it be better to make coredumping killable? Is this even
> possible?
It is already killable.
The problem is that oom-killer assumes that the PF_EXITING task should
exit and release its memory "soon", so oom_scan_process_thread() returns
OOM_SCAN_ABORT.
This is obviously wrong if this PF_EXITING task participates in coredump
and sleeps in exit_mm(). (iirc there are other issues with mt tasks, but
lets not discuss this now). This task won't exit until the coredumping
completes, and this can take o lot of time, more memory, etc.
> > And at least we can kill that ugly and wrong ptrace check.
>
> Why is the ptrace check wrong?
It was added in reply to exploit I sent. But:
- It doesn't (and can't) really work, it can only detect this particular
case and the same exploit still blocks oom-killer with the minimal
modifications.
- Once again, this has nothing to do with ptrace. That exploit used
ptrace only to control (freeze) the coredumping process, the coredumping
can "hang" because of other reasons.
- It is no longer needed after this patch, the coredumping process will
be killed.
So I think the patch below makes sense anyway. Although I should probably
split it and remove PT_TRACE_EXIT in 2/2.
> PF_EXITING should be set after
> ptrace_event(PTRACE_EVENT_EXIT, code). But then I can see
> unlikely(tsk->flags & PF_EXITING) check right after PTRACE_EVENT_EXIT
> notification.
Note that it checks task->group_leader, not task. But see above. This makes
no sense.
> > --- x/mm/oom_kill.c
> > +++ x/mm/oom_kill.c
> > @@ -254,6 +254,12 @@ static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
> > }
> > #endif
> >
> > +static inline bool task_will_free_mem(struct task_struct *task)
> > +{
> > + return (task->flags & PF_EXITING) &&
> > + !(task->signal->flags & SIGNAL_GROUP_COREDUMP);
> > +}
> > +
> > enum oom_scan_t oom_scan_process_thread(struct task_struct *task,
> > unsigned long totalpages, const nodemask_t *nodemask,
> > bool force_kill)
> > @@ -281,14 +287,9 @@ enum oom_scan_t oom_scan_process_thread(struct task_struct *task,
> > if (oom_task_origin(task))
> > return OOM_SCAN_SELECT;
> >
> > - if (task->flags & PF_EXITING && !force_kill) {
> > - /*
> > - * If this task is not being ptraced on exit, then wait for it
> > - * to finish before killing some other task unnecessarily.
> > - */
> > - if (!(task->group_leader->ptrace & PT_TRACE_EXIT))
> > - return OOM_SCAN_ABORT;
> > - }
> > + if (task_will_free_mem(task) && !force_kill)
> > + return OOM_SCAN_ABORT;
> > +
> > return OOM_SCAN_OK;
> > }
> >
> > @@ -426,7 +427,7 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
> > * If the task is already exiting, don't alarm the sysadmin or kill
> > * its children or threads, just set TIF_MEMDIE so it can die quickly
> > */
> > - if (p->flags & PF_EXITING) {
> > + if (task_will_free_mem(p)) {
> > set_tsk_thread_flag(p, TIF_MEMDIE);
> > put_task_struct(p);
> > return;
> > @@ -632,7 +633,7 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
> > * select it. The goal is to allow it to allocate so that it may
> > * quickly exit and free its memory.
> > */
> > - if (fatal_signal_pending(current) || current->flags & PF_EXITING) {
> > + if (fatal_signal_pending(current) || task_will_free_mem(current)) {
> > set_thread_flag(TIF_MEMDIE);
> > return;
> > }
> >
>
> --
> Michal Hocko
> SUSE Labs
next prev parent reply other threads:[~2014-11-27 17:48 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-10-17 17:19 + oom-pm-oom-killed-task-cannot-escape-pm-suspend.patch added to -mm tree Oleg Nesterov
2014-10-20 18:46 ` Michal Hocko
2014-10-20 19:06 ` Oleg Nesterov
2014-10-20 19:56 ` oom && coredump Oleg Nesterov
2014-11-27 12:29 ` Michal Hocko
2014-11-27 17:47 ` Oleg Nesterov [this message]
2014-12-02 8:59 ` Michal Hocko
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20141127174747.GA7684@redhat.com \
--to=oleg@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=linux-kernel@vger.kernel.org \
--cc=mhocko@suse.cz \
--cc=rientjes@google.com \
--cc=rjw@rjwysocki.net \
--cc=tj@kernel.org \
--cc=xiyou.wangcong@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).