From: Petr Mladek <pmladek@suse.com>
To: Michal Hocko <mhocko@suse.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>,
linux-mm@kvack.org,
Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Subject: Re: [PATCH] mm/page_alloc: Wait for oom_lock before retrying.
Date: Mon, 12 Dec 2016 12:49:03 +0100 [thread overview]
Message-ID: <20161212114903.GM3506@pathway.suse.cz> (raw)
In-Reply-To: <20161212090702.GD18163@dhcp22.suse.cz>
On Mon 2016-12-12 10:07:03, Michal Hocko wrote:
> On Sat 10-12-16 20:24:57, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > On Fri 09-12-16 23:23:10, Tetsuo Handa wrote:
> > > > Michal Hocko wrote:
> > > > > On Thu 08-12-16 00:29:26, Tetsuo Handa wrote:
> > > > > > Michal Hocko wrote:
> > > > > > > On Tue 06-12-16 19:33:59, Tetsuo Handa wrote:
> > > > > > > > If the OOM killer is invoked when many threads are looping inside the
> > > > > > > > page allocator, it is possible that the OOM killer is preempted by other
> > > > > > > > threads.
> > > > > > >
> > > > > > > Hmm, the only way I can see this would happen is when the task which
> > > > > > > actually manages to take the lock is not invoking the OOM killer for
> > > > > > > whatever reason. Is this what happens in your case? Are you able to
> > > > > > > trigger this reliably?
> > > > > >
> > > > > > Regarding http://I-love.SAKURA.ne.jp/tmp/serial-20161206.txt.xz ,
> > > > > > somebody called oom_kill_process() and reached
> > > > > >
> > > > > > pr_err("%s: Kill process %d (%s) score %u or sacrifice child\n",
> > > > > >
> > > > > > line but did not reach
> > > > > >
> > > > > > pr_err("Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n",
> > > > > >
> > > > > > line within tolerable delay.
> > > > >
> > > > > I would be really interested in that. This can happen only if
> > > > > find_lock_task_mm fails. This would mean that either we are selecting a
> > > > > child without mm or the selected victim has no mm anymore. Both cases
> > > > > should be ephemeral because oom_badness will rule those tasks on the
> > > > > next round. So the primary question here is why no other task has hit
> > > > > out_of_memory.
> > > >
> > > > This can also happen due to AB-BA livelock (oom_lock v.s. console_sem).
> > >
> > > Care to explain how would that livelock look like?
> >
> > Two types of threads (Thread-1 which is holding oom_lock, Thread-2 which is not
> > holding oom_lock) are doing memory allocation. Since oom_lock is a mutex, there
> > can be only 1 instance for Thread-1. But there can be multiple instances for
> > Thread-2.
> >
> > (1) Thread-1 enters out_of_memory() because it is holding oom_lock.
> > (2) Thread-1 enters printk() due to
> >
> > pr_err("%s: Kill process %d (%s) score %u or sacrifice child\n", ...);
> >
> > in oom_kill_process().
> >
> > (3) vprintk_func() is mapped to vprintk_default() because Thread-1 is not
> > inside NMI handler.
> >
> > (4) In vprintk_emit(), in_sched == false because loglevel for pr_err()
> > is not LOGLEVEL_SCHED.
> >
> > (5) Thread-1 calls log_store() via log_output() from vprintk_emit().
> >
> > (6) Thread-1 calls console_trylock() because in_sched == false.
> >
> > (7) Thread-1 acquires console_sem via down_trylock_console_sem().
> >
> > (8) In console_trylock(), console_may_schedule is set to true because
> > Thread-1 is in sleepable context.
> >
> > (9) Thread-1 calls console_unlock() because console_trylock() succeeded.
> >
> > (9) In console_unlock(), pending data stored by log_store() are printed
> > to consoles. Since there may be slow consoles, cond_resched() is called
> > if possible. And since console_may_schedule == true because Thread-1 is
> > in sleepable context, Thread-1 may be scheduled at console_unlock().
> >
> > (10) Thread-2 tries to acquire oom_lock but it fails because Thread-1 is
> > holding oom_lock.
> >
> > (11) Thread-2 enters warn_alloc() because it is waiting for Thread-1 to
> > return from oom_kill_process().
> >
> > (12) Thread-2 enters printk() due to
> >
> > warn_alloc(gfp_mask, "page allocation stalls for %ums, order:%u", ...);
> >
> > in __alloc_pages_slowpath().
> >
> > (13) vprintk_func() is mapped to vprintk_default() because Thread-2 is not
> > inside NMI handler.
> >
> > (14) In vprintk_emit(), in_sched == false because loglevel for pr_err()
> > is not LOGLEVEL_SCHED.
> >
> > (15) Thread-2 calls log_store() via log_output() from vprintk_emit().
> >
> > (16) Thread-2 calls console_trylock() because in_sched == false.
> >
> > (17) Thread-2 fails to acquire console_sem via down_trylock_console_sem().
> >
> > (18) Thread-2 returns from vprintk_emit().
> >
> > (19) Thread-2 leaves warn_alloc().
> >
> > (20) When Thread-1 is waken up, it finds new data appended by Thread-2.
> >
> > (21) Thread-1 remains inside console_unlock() with oom_lock still held
> > because there is data which should be printed to consoles.
> >
> > (22) Thread-2 remains failing to acquire oom_lock, periodically appending
> > new data via warn_alloc(), and failing to acquire oom_lock.
> >
> > (23) The user visible result is that Thread-1 is unable to return from
> >
> > pr_err("%s: Kill process %d (%s) score %u or sacrifice child\n", ...);
> >
> > in oom_kill_process().
>
> OK, I see. This is not a new problem though and people are trying to
> solve it in the printk proper. CCed some people, I do not have links
> to those threads handy. And if this is really the problem here then we
> definitely shouldn't put hacks into the page allocator path to handle
> it because there might be other sources of the printk flood might be
> arbitrary.
Yup, this is exactly the type of the problem that we want to solve
by the async printk.
> > The introduction of uncontrolled
> >
> > warn_alloc(gfp_mask, "page allocation stalls for %ums, order:%u", ...);
I am just curious that there would be so many messages.
If I get it correctly, this warning is printed
once every 10 second. Or am I wrong?
Well, you might want to consider using
stall_timeout *= 2;
instead of adding the constant 10 * HZ.
Of course, a better would be some global throttling of
this message.
Best Regards,
Petr
PS: I am not mm expert and did not read this thread. Just ignore this
if I missed the point. Anyway, it sounds weird to linearize all
allocation request in OOM situation. It is much harder to unblock
a high-order requests than a low-order ones.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2016-12-12 11:49 UTC|newest]
Thread overview: 74+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-12-06 10:33 [PATCH] mm/page_alloc: Wait for oom_lock before retrying Tetsuo Handa
2016-12-07 8:15 ` Michal Hocko
2016-12-07 15:29 ` Tetsuo Handa
2016-12-08 8:20 ` Vlastimil Babka
2016-12-08 11:00 ` Tetsuo Handa
2016-12-08 13:32 ` Michal Hocko
2016-12-08 16:18 ` Sergey Senozhatsky
2016-12-08 13:27 ` Michal Hocko
2016-12-09 14:23 ` Tetsuo Handa
2016-12-09 14:46 ` Michal Hocko
2016-12-10 11:24 ` Tetsuo Handa
2016-12-12 9:07 ` Michal Hocko
2016-12-12 11:49 ` Petr Mladek [this message]
2016-12-12 13:00 ` Michal Hocko
2016-12-12 14:05 ` Tetsuo Handa
2016-12-13 1:06 ` Sergey Senozhatsky
2016-12-12 12:12 ` Tetsuo Handa
2016-12-12 12:55 ` Michal Hocko
2016-12-12 13:19 ` Michal Hocko
2016-12-13 12:06 ` Tetsuo Handa
2016-12-13 17:06 ` Michal Hocko
2016-12-14 11:37 ` Tetsuo Handa
2016-12-14 12:42 ` Michal Hocko
2016-12-14 16:36 ` Tetsuo Handa
2016-12-14 18:18 ` Michal Hocko
2016-12-15 10:21 ` Tetsuo Handa
2016-12-19 11:25 ` Tetsuo Handa
2016-12-19 12:27 ` Sergey Senozhatsky
2016-12-20 15:39 ` Sergey Senozhatsky
2016-12-22 10:27 ` Tetsuo Handa
2016-12-22 10:53 ` Petr Mladek
2016-12-22 13:40 ` Sergey Senozhatsky
2016-12-22 13:33 ` Tetsuo Handa
2016-12-22 19:24 ` Michal Hocko
2016-12-24 6:25 ` Tetsuo Handa
2016-12-26 11:49 ` Michal Hocko
2016-12-27 10:39 ` Tetsuo Handa
2016-12-27 10:57 ` Michal Hocko
2016-12-22 13:42 ` Sergey Senozhatsky
2016-12-22 14:01 ` Tetsuo Handa
2016-12-22 14:09 ` Sergey Senozhatsky
2016-12-22 14:30 ` Sergey Senozhatsky
2016-12-26 10:54 ` Tetsuo Handa
2016-12-26 11:34 ` Sergey Senozhatsky
2017-01-12 13:10 ` Petr Mladek
2017-01-13 2:52 ` Sergey Senozhatsky
2017-01-13 3:53 ` Sergey Senozhatsky
2017-01-13 11:15 ` Petr Mladek
2017-01-13 11:14 ` Petr Mladek
2017-01-12 14:18 ` Petr Mladek
2017-01-13 2:28 ` Sergey Senozhatsky
2017-01-13 11:03 ` Petr Mladek
2017-01-13 11:50 ` Sergey Senozhatsky
2017-01-13 12:15 ` Petr Mladek
2016-12-26 11:41 ` Sergey Senozhatsky
2017-01-13 14:03 ` Petr Mladek
2016-12-15 1:11 ` Sergey Senozhatsky
2016-12-15 6:35 ` Michal Hocko
2016-12-15 10:16 ` Petr Mladek
2016-12-14 9:37 ` Petr Mladek
2016-12-14 10:20 ` Sergey Senozhatsky
2016-12-14 11:01 ` Petr Mladek
2016-12-14 12:23 ` Sergey Senozhatsky
2016-12-14 12:47 ` Petr Mladek
2016-12-14 10:26 ` Michal Hocko
2016-12-15 7:34 ` Sergey Senozhatsky
2016-12-14 11:37 ` Tetsuo Handa
2016-12-14 12:36 ` Petr Mladek
2016-12-14 12:44 ` Michal Hocko
2016-12-14 13:36 ` Tetsuo Handa
2016-12-14 13:52 ` Michal Hocko
2016-12-14 12:50 ` Sergey Senozhatsky
2016-12-12 14:59 ` Tetsuo Handa
2016-12-12 15:55 ` Michal Hocko
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20161212114903.GM3506@pathway.suse.cz \
--to=pmladek@suse.com \
--cc=linux-mm@kvack.org \
--cc=mhocko@suse.com \
--cc=penguin-kernel@I-love.SAKURA.ne.jp \
--cc=sergey.senozhatsky@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).