From: Oleg Nesterov <oleg@redhat.com>
To: Michal Hocko <mhocko@suse.cz>
Cc: Cong Wang <xiyou.wangcong@gmail.com>,
"Rafael J. Wysocki" <rjw@rjwysocki.net>,
Tejun Heo <tj@kernel.org>, David Rientjes <rientjes@google.com>,
Andrew Morton <akpm@linux-foundation.org>,
linux-kernel@vger.kernel.org
Subject: Re: oom && coredump
Date: Thu, 27 Nov 2014 18:47:47 +0100 [thread overview]
Message-ID: <20141127174747.GA7684@redhat.com> (raw)
In-Reply-To: <20141127122908.GB18833@dhcp22.suse.cz>
On 11/27, Michal Hocko wrote:
>
> On Mon 20-10-14 21:56:18, Oleg Nesterov wrote:
> > speaking of "partial" fixes for oom problems...
> >
> > Perhaps the patch below makes sense? Sure, it is racy, but probably
> > better than nothing. And in any case (imo) this SIGNAL_GROUP_COREDUMP
> > check doesn't look bad, the coredumping task can consume more memory,
> > and we can't assume it is going to actually exit soon.
>
> I am not familiar with this area much (it is too scary...).
> I guess the issue is that OOM killer might try to kill a task which is
> currently in the middle of cordumping which is not killable, right? And
> if it is blocked on memory allocation then we are effectively dead
> locked. Right?
>
> Wouldn't it be better to make coredumping killable? Is this even
> possible?
It is already killable.
The problem is that oom-killer assumes that the PF_EXITING task should
exit and release its memory "soon", so oom_scan_process_thread() returns
OOM_SCAN_ABORT.
This is obviously wrong if this PF_EXITING task participates in coredump
and sleeps in exit_mm(). (iirc there are other issues with mt tasks, but
lets not discuss this now). This task won't exit until the coredumping
completes, and this can take o lot of time, more memory, etc.
> > And at least we can kill that ugly and wrong ptrace check.
>
> Why is the ptrace check wrong?
It was added in reply to exploit I sent. But:
- It doesn't (and can't) really work, it can only detect this particular
case and the same exploit still blocks oom-killer with the minimal
modifications.
- Once again, this has nothing to do with ptrace. That exploit used
ptrace only to control (freeze) the coredumping process, the coredumping
can "hang" because of other reasons.
- It is no longer needed after this patch, the coredumping process will
be killed.
So I think the patch below makes sense anyway. Although I should probably
split it and remove PT_TRACE_EXIT in 2/2.
> PF_EXITING should be set after
> ptrace_event(PTRACE_EVENT_EXIT, code). But then I can see
> unlikely(tsk->flags & PF_EXITING) check right after PTRACE_EVENT_EXIT
> notification.
Note that it checks task->group_leader, not task. But see above. This makes
no sense.
> > --- x/mm/oom_kill.c
> > +++ x/mm/oom_kill.c
> > @@ -254,6 +254,12 @@ static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
> > }
> > #endif
> >
> > +static inline bool task_will_free_mem(struct task_struct *task)
> > +{
> > + return (task->flags & PF_EXITING) &&
> > + !(task->signal->flags & SIGNAL_GROUP_COREDUMP);
> > +}
> > +
> > enum oom_scan_t oom_scan_process_thread(struct task_struct *task,
> > unsigned long totalpages, const nodemask_t *nodemask,
> > bool force_kill)
> > @@ -281,14 +287,9 @@ enum oom_scan_t oom_scan_process_thread(struct task_struct *task,
> > if (oom_task_origin(task))
> > return OOM_SCAN_SELECT;
> >
> > - if (task->flags & PF_EXITING && !force_kill) {
> > - /*
> > - * If this task is not being ptraced on exit, then wait for it
> > - * to finish before killing some other task unnecessarily.
> > - */
> > - if (!(task->group_leader->ptrace & PT_TRACE_EXIT))
> > - return OOM_SCAN_ABORT;
> > - }
> > + if (task_will_free_mem(task) && !force_kill)
> > + return OOM_SCAN_ABORT;
> > +
> > return OOM_SCAN_OK;
> > }
> >
> > @@ -426,7 +427,7 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
> > * If the task is already exiting, don't alarm the sysadmin or kill
> > * its children or threads, just set TIF_MEMDIE so it can die quickly
> > */
> > - if (p->flags & PF_EXITING) {
> > + if (task_will_free_mem(p)) {
> > set_tsk_thread_flag(p, TIF_MEMDIE);
> > put_task_struct(p);
> > return;
> > @@ -632,7 +633,7 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
> > * select it. The goal is to allow it to allocate so that it may
> > * quickly exit and free its memory.
> > */
> > - if (fatal_signal_pending(current) || current->flags & PF_EXITING) {
> > + if (fatal_signal_pending(current) || task_will_free_mem(current)) {
> > set_thread_flag(TIF_MEMDIE);
> > return;
> > }
> >
>
> --
> Michal Hocko
> SUSE Labs
next prev parent reply other threads:[~2014-11-27 17:48 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-10-17 17:19 + oom-pm-oom-killed-task-cannot-escape-pm-suspend.patch added to -mm tree Oleg Nesterov
2014-10-20 18:46 ` Michal Hocko
2014-10-20 19:06 ` Oleg Nesterov
2014-10-20 19:56 ` oom && coredump Oleg Nesterov
2014-11-27 12:29 ` Michal Hocko
2014-11-27 17:47 ` Oleg Nesterov [this message]
2014-12-02 8:59 ` Michal Hocko
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20141127174747.GA7684@redhat.com \
--to=oleg@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=linux-kernel@vger.kernel.org \
--cc=mhocko@suse.cz \
--cc=rientjes@google.com \
--cc=rjw@rjwysocki.net \
--cc=tj@kernel.org \
--cc=xiyou.wangcong@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.