From: Michal Hocko <mhocko@suse.cz>
To: azurIt <azurit@pobox.sk>
Cc: Johannes Weiner <hannes@cmpxchg.org>,
Andrew Morton <akpm@linux-foundation.org>,
David Rientjes <rientjes@google.com>,
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>,
KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>,
linux-mm@kvack.org, cgroups@vger.kernel.org, x86@kernel.org,
linux-arch@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: Re: [patch 0/7] improve memcg oom killer robustness v2
Date: Mon, 16 Sep 2013 15:40:14 +0200 [thread overview]
Message-ID: <20130916134014.GA3674@dhcp22.suse.cz> (raw)
In-Reply-To: <20130914124831.4DD20346@pobox.sk>
On Sat 14-09-13 12:48:31, azurIt wrote:
[...]
> Here is the first occurence, this night between 5:15 and 5:25:
> - this time i kept opened terminal from other server to this problematic one with htop running
> - when server went down i opened it and saw one process of one user running at the top and taking 97% of CPU (cgroup 1304)
I guess you do not have a stack trace(s) for that process? That would be
extremely helpful.
> - everything was stucked so that htop didn't help me much
> - luckily, my new 'load check' script, which i was mentioning before, was able to kill apache and everything went to normal (success with it's very first version, wow ;) )
> - i checked some other logs and everything seems to point to cgroup 1304, also kernel log at 5:14-15 is showing hard OOM in that cgroup:
> http://watchdog.sk/lkml/kern7.log
I am not sure what you mean by hard OOM because there is no global OOM
in that log:
$ grep "Kill process" kern7.log | sed 's@.*]\(.*Kill process\>\).*@\1@' | sort -u
Memory cgroup out of memory: Kill process
But you had a lot of memcg OOMs in that group (1304) during that time
(and even earlier):
$ grep "\<1304\>" kern7.log
Sep 14 05:03:45 server01 kernel: [188287.778020] Task in /1304/uid killed as a result of limit of /1304
Sep 14 05:03:46 server01 kernel: [188287.871427] [30433] 1304 30433 181781 66426 7 0 0 apache2
Sep 14 05:03:46 server01 kernel: [188287.871594] [30808] 1304 30808 169111 53866 4 0 0 apache2
Sep 14 05:03:46 server01 kernel: [188287.871742] [30809] 1304 30809 181168 65992 2 0 0 apache2
Sep 14 05:03:46 server01 kernel: [188287.871890] [30811] 1304 30811 168684 53399 3 0 0 apache2
Sep 14 05:03:46 server01 kernel: [188287.872041] [30814] 1304 30814 181102 65924 3 0 0 apache2
Sep 14 05:03:46 server01 kernel: [188287.872189] [30815] 1304 30815 168814 53451 4 0 0 apache2
Sep 14 05:03:46 server01 kernel: [188287.877731] Task in /1304/uid killed as a result of limit of /1304
Sep 14 05:03:46 server01 kernel: [188287.973155] [30808] 1304 30808 169111 53918 3 0 0 apache2
Sep 14 05:03:46 server01 kernel: [188287.973155] [30809] 1304 30809 181168 65992 2 0 0 apache2
Sep 14 05:03:46 server01 kernel: [188287.973155] [30811] 1304 30811 168684 53399 3 0 0 apache2
Sep 14 05:03:46 server01 kernel: [188287.973155] [30814] 1304 30814 181102 65924 3 0 0 apache2
Sep 14 05:03:46 server01 kernel: [188287.973155] [30815] 1304 30815 168815 53558 0 0 0 apache2
Sep 14 05:03:47 server01 kernel: [188289.137540] Task in /1304/uid killed as a result of limit of /1304
Sep 14 05:03:47 server01 kernel: [188289.231873] [30809] 1304 30809 182662 67534 7 0 0 apache2
Sep 14 05:03:47 server01 kernel: [188289.232021] [30811] 1304 30811 171920 56781 4 0 0 apache2
Sep 14 05:03:47 server01 kernel: [188289.232171] [30814] 1304 30814 182596 67470 3 0 0 apache2
Sep 14 05:03:47 server01 kernel: [188289.232319] [30815] 1304 30815 171920 56778 1 0 0 apache2
Sep 14 05:03:47 server01 kernel: [188289.232478] [30896] 1304 30896 171918 56761 0 0 0 apache2
[...]
Sep 14 05:14:00 server01 kernel: [188902.666893] Task in /1304/uid killed as a result of limit of /1304
Sep 14 05:14:00 server01 kernel: [188902.742928] [ 7806] 1304 7806 178891 64008 6 0 0 apache2
Sep 14 05:14:00 server01 kernel: [188902.743080] [ 7910] 1304 7910 175318 60302 2 0 0 apache2
Sep 14 05:14:00 server01 kernel: [188902.743228] [ 7911] 1304 7911 174943 59878 1 0 0 apache2
Sep 14 05:14:00 server01 kernel: [188902.743376] [ 7912] 1304 7912 171568 56404 3 0 0 apache2
Sep 14 05:14:00 server01 kernel: [188902.743524] [ 7914] 1304 7914 174911 59879 5 0 0 apache2
Sep 14 05:14:00 server01 kernel: [188902.743673] [ 7915] 1304 7915 173472 58386 2 0 0 apache2
Sep 14 05:14:02 server01 kernel: [188904.249749] Task in /1304/uid killed as a result of limit of /1304
Sep 14 05:14:02 server01 kernel: [188904.336276] [ 7910] 1304 7910 176278 61211 6 0 0 apache2
Sep 14 05:14:02 server01 kernel: [188904.336276] [ 7911] 1304 7911 176278 61211 7 0 0 apache2
Sep 14 05:14:02 server01 kernel: [188904.336276] [ 7912] 1304 7912 173732 58655 3 0 0 apache2
Sep 14 05:14:02 server01 kernel: [188904.336276] [ 7914] 1304 7914 176269 61211 7 0 0 apache2
Sep 14 05:14:02 server01 kernel: [188904.336276] [ 7915] 1304 7915 176269 61211 7 0 0 apache2
Sep 14 05:14:02 server01 kernel: [188904.336276] [ 7966] 1304 7966 170385 55164 7 0 0 apache2
Sep 14 05:14:02 server01 kernel: [188904.340992] Task in /1304/uid killed as a result of limit of /1304
Sep 14 05:14:02 server01 kernel: [188904.424284] [ 7911] 1304 7911 176340 61332 2 0 0 apache2
Sep 14 05:14:02 server01 kernel: [188904.424284] [ 7912] 1304 7912 173996 58901 1 0 0 apache2
Sep 14 05:14:02 server01 kernel: [188904.424284] [ 7914] 1304 7914 176331 61331 4 0 0 apache2
Sep 14 05:14:02 server01 kernel: [188904.424284] [ 7915] 1304 7915 176331 61331 2 0 0 apache2
Sep 14 05:14:02 server01 kernel: [188904.424284] [ 7966] 1304 7966 170385 55164 7 0 0 apache2
[...]
The only thing that is clear from this is that there is always one
process killed and a new one is spawned and that leads to the same
out of memory situation. So this is precisely what Johannes already
described as a Hydra load.
There is a silence in the logs:
Sep 14 05:14:39 server01 kernel: [188940.869639] Killed process 8453 (apache2) total-vm:710732kB, anon-rss:245680kB, file-rss:4588kB
Sep 14 05:21:24 server01 kernel: [189344.518699] grsec: From 95.103.217.66: failed fork with errno EAGAIN by /bin/dash[sh:10362] uid/euid:1387/1387 g
id/egid:100/100, parent /usr/sbin/cron[cron:10144] uid/euid:0/0 gid/egid:0/0
Myabe that is what you are referring to as a stuck situation. Is pid
8453 the task you have seen consuming the CPU? If yes, then we would
need a stack for that task to find out what is going on.
Other than that nothing really suspicious in the log AFAICS.
--
Michal Hocko
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2013-09-16 13:40 UTC|newest]
Thread overview: 99+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-08-03 16:59 [patch 0/7] improve memcg oom killer robustness v2 Johannes Weiner
2013-08-03 16:59 ` [patch 1/7] arch: mm: remove obsolete init OOM protection Johannes Weiner
2013-08-06 6:34 ` Vineet Gupta
2013-08-03 16:59 ` [patch 2/7] arch: mm: do not invoke OOM killer on kernel fault OOM Johannes Weiner
2013-08-03 16:59 ` [patch 3/7] arch: mm: pass userspace fault flag to generic fault handler Johannes Weiner
2013-08-05 22:06 ` Andrew Morton
2013-08-05 22:25 ` Johannes Weiner
2013-08-03 16:59 ` [patch 4/7] x86: finish user fault error path with fatal signal Johannes Weiner
2013-08-03 16:59 ` [patch 5/7] mm: memcg: enable memcg OOM killer only for user faults Johannes Weiner
2013-08-05 9:18 ` Michal Hocko
2013-08-03 16:59 ` [patch 6/7] mm: memcg: rework and document OOM waiting and wakeup Johannes Weiner
2013-08-03 17:00 ` [patch 7/7] mm: memcg: do not trap chargers with full callstack on OOM Johannes Weiner
2013-08-05 9:54 ` Michal Hocko
2013-08-05 20:56 ` Johannes Weiner
2013-08-03 17:08 ` [patch 0/7] improve memcg oom killer robustness v2 Johannes Weiner
2013-08-09 9:06 ` azurIt
2013-08-30 19:58 ` azurIt
2013-09-02 10:38 ` azurIt
2013-09-03 20:48 ` Johannes Weiner
2013-09-04 7:53 ` azurIt
2013-09-04 8:18 ` azurIt
2013-09-05 11:54 ` Johannes Weiner
2013-09-05 12:43 ` Michal Hocko
2013-09-05 16:18 ` Johannes Weiner
2013-09-09 12:36 ` Michal Hocko
2013-09-09 12:56 ` Michal Hocko
2013-09-12 12:59 ` Johannes Weiner
2013-09-16 14:03 ` Michal Hocko
2013-09-05 13:24 ` Michal Hocko
2013-09-09 13:10 ` azurIt
2013-09-09 17:28 ` Johannes Weiner
2013-09-09 19:59 ` azurIt
2013-09-09 20:12 ` Johannes Weiner
2013-09-09 20:18 ` azurIt
2013-09-09 21:08 ` azurIt
2013-09-10 18:13 ` azurIt
2013-09-10 18:37 ` Johannes Weiner
2013-09-10 19:32 ` azurIt
2013-09-10 20:12 ` Johannes Weiner
2013-09-10 21:08 ` azurIt
2013-09-10 21:18 ` Johannes Weiner
2013-09-10 21:32 ` azurIt
2013-09-10 22:03 ` Johannes Weiner
2013-09-11 12:33 ` azurIt
2013-09-11 18:03 ` Johannes Weiner
2013-09-11 18:54 ` azurIt
2013-09-11 19:11 ` Johannes Weiner
2013-09-11 19:41 ` azurIt
2013-09-11 20:04 ` Johannes Weiner
2013-09-14 10:48 ` azurIt
2013-09-16 13:40 ` Michal Hocko [this message]
2013-09-16 14:01 ` azurIt
2013-09-16 14:06 ` Michal Hocko
2013-09-16 14:13 ` azurIt
2013-09-16 14:57 ` Michal Hocko
2013-09-16 15:05 ` azurIt
2013-09-16 15:17 ` Johannes Weiner
2013-09-16 15:24 ` azurIt
2013-09-16 15:25 ` Michal Hocko
2013-09-16 15:40 ` azurIt
2013-09-16 20:52 ` azurIt
2013-09-17 0:02 ` Johannes Weiner
2013-09-17 11:15 ` azurIt
2013-09-17 14:10 ` Michal Hocko
2013-09-18 14:03 ` azurIt
2013-09-18 14:24 ` Michal Hocko
2013-09-18 14:33 ` azurIt
2013-09-18 14:42 ` Michal Hocko
2013-09-18 18:02 ` azurIt
2013-09-18 18:36 ` Michal Hocko
[not found] ` <20130918160304.6EDF2729-Rm0zKEqwvD4@public.gmane.org>
2013-09-18 18:04 ` Johannes Weiner
[not found] ` <20130918180455.GD856-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
2013-09-18 18:19 ` Johannes Weiner
2013-09-18 19:55 ` Johannes Weiner
2013-09-18 20:52 ` azurIt
2013-09-25 7:26 ` azurIt
2013-09-26 16:54 ` azurIt
2013-09-26 19:27 ` Johannes Weiner
2013-09-27 2:04 ` azurIt
2013-10-07 11:01 ` azurIt
[not found] ` <20131007130149.5F5482D8-Rm0zKEqwvD4@public.gmane.org>
2013-10-07 19:23 ` Johannes Weiner
2013-10-09 18:44 ` azurIt
2013-10-10 0:14 ` Johannes Weiner
2013-10-10 22:59 ` azurIt
2013-09-17 11:20 ` azurIt
2013-09-16 10:22 ` azurIt
2013-09-04 9:45 ` azurIt
2013-09-04 11:57 ` Michal Hocko
2013-09-04 12:10 ` azurIt
2013-09-04 12:26 ` Michal Hocko
2013-09-04 12:39 ` azurIt
2013-09-05 9:14 ` azurIt
2013-09-05 9:53 ` Michal Hocko
2013-09-05 10:17 ` azurIt
2013-09-05 11:17 ` Michal Hocko
2013-09-05 11:47 ` azurIt
2013-09-05 12:03 ` Michal Hocko
2013-09-05 12:33 ` azurIt
2013-09-05 12:45 ` Michal Hocko
2013-09-05 13:00 ` azurIt
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20130916134014.GA3674@dhcp22.suse.cz \
--to=mhocko@suse.cz \
--cc=akpm@linux-foundation.org \
--cc=azurit@pobox.sk \
--cc=cgroups@vger.kernel.org \
--cc=hannes@cmpxchg.org \
--cc=kamezawa.hiroyu@jp.fujitsu.com \
--cc=kosaki.motohiro@jp.fujitsu.com \
--cc=linux-arch@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=rientjes@google.com \
--cc=x86@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).