Re: [patch 0/7] improve memcg oom killer robustness v2

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Michal Hocko <mhocko@suse.cz>
To: azurIt <azurit@pobox.sk>
Cc: Johannes Weiner <hannes@cmpxchg.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	David Rientjes <rientjes@google.com>,
	KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>,
	KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>,
	linux-mm@kvack.org, cgroups@vger.kernel.org, x86@kernel.org,
	linux-arch@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: Re: [patch 0/7] improve memcg oom killer robustness v2
Date: Mon, 16 Sep 2013 15:40:14 +0200	[thread overview]
Message-ID: <20130916134014.GA3674@dhcp22.suse.cz> (raw)
In-Reply-To: <20130914124831.4DD20346@pobox.sk>

On Sat 14-09-13 12:48:31, azurIt wrote:
[...]
> Here is the first occurence, this night between 5:15 and 5:25:
>  - this time i kept opened terminal from other server to this problematic one with htop running
>  - when server went down i opened it and saw one process of one user running at the top and taking 97% of CPU (cgroup 1304)

I guess you do not have a stack trace(s) for that process? That would be
extremely helpful.

>  - everything was stucked so that htop didn't help me much
>  - luckily, my new 'load check' script, which i was mentioning before, was able to kill apache and everything went to normal (success with it's very first version, wow ;) )
>  - i checked some other logs and everything seems to point to cgroup 1304, also kernel log at 5:14-15 is showing hard OOM in that cgroup:
> http://watchdog.sk/lkml/kern7.log

I am not sure what you mean by hard OOM because there is no global OOM
in that log:
$ grep "Kill process" kern7.log | sed 's@.*]\(.*Kill process\>\).*@\1@' | sort -u
 Memory cgroup out of memory: Kill process

But you had a lot of memcg OOMs in that group (1304) during that time
(and even earlier):
$ grep "\<1304\>" kern7.log 
Sep 14 05:03:45 server01 kernel: [188287.778020] Task in /1304/uid killed as a result of limit of /1304
Sep 14 05:03:46 server01 kernel: [188287.871427] [30433]  1304 30433   181781    66426   7       0             0 apache2
Sep 14 05:03:46 server01 kernel: [188287.871594] [30808]  1304 30808   169111    53866   4       0             0 apache2
Sep 14 05:03:46 server01 kernel: [188287.871742] [30809]  1304 30809   181168    65992   2       0             0 apache2
Sep 14 05:03:46 server01 kernel: [188287.871890] [30811]  1304 30811   168684    53399   3       0             0 apache2
Sep 14 05:03:46 server01 kernel: [188287.872041] [30814]  1304 30814   181102    65924   3       0             0 apache2
Sep 14 05:03:46 server01 kernel: [188287.872189] [30815]  1304 30815   168814    53451   4       0             0 apache2
Sep 14 05:03:46 server01 kernel: [188287.877731] Task in /1304/uid killed as a result of limit of /1304
Sep 14 05:03:46 server01 kernel: [188287.973155] [30808]  1304 30808   169111    53918   3       0             0 apache2
Sep 14 05:03:46 server01 kernel: [188287.973155] [30809]  1304 30809   181168    65992   2       0             0 apache2
Sep 14 05:03:46 server01 kernel: [188287.973155] [30811]  1304 30811   168684    53399   3       0             0 apache2
Sep 14 05:03:46 server01 kernel: [188287.973155] [30814]  1304 30814   181102    65924   3       0             0 apache2
Sep 14 05:03:46 server01 kernel: [188287.973155] [30815]  1304 30815   168815    53558   0       0             0 apache2
Sep 14 05:03:47 server01 kernel: [188289.137540] Task in /1304/uid killed as a result of limit of /1304
Sep 14 05:03:47 server01 kernel: [188289.231873] [30809]  1304 30809   182662    67534   7       0             0 apache2
Sep 14 05:03:47 server01 kernel: [188289.232021] [30811]  1304 30811   171920    56781   4       0             0 apache2
Sep 14 05:03:47 server01 kernel: [188289.232171] [30814]  1304 30814   182596    67470   3       0             0 apache2
Sep 14 05:03:47 server01 kernel: [188289.232319] [30815]  1304 30815   171920    56778   1       0             0 apache2
Sep 14 05:03:47 server01 kernel: [188289.232478] [30896]  1304 30896   171918    56761   0       0             0 apache2
[...]
Sep 14 05:14:00 server01 kernel: [188902.666893] Task in /1304/uid killed as a result of limit of /1304
Sep 14 05:14:00 server01 kernel: [188902.742928] [ 7806]  1304  7806   178891    64008   6       0             0 apache2
Sep 14 05:14:00 server01 kernel: [188902.743080] [ 7910]  1304  7910   175318    60302   2       0             0 apache2
Sep 14 05:14:00 server01 kernel: [188902.743228] [ 7911]  1304  7911   174943    59878   1       0             0 apache2
Sep 14 05:14:00 server01 kernel: [188902.743376] [ 7912]  1304  7912   171568    56404   3       0             0 apache2
Sep 14 05:14:00 server01 kernel: [188902.743524] [ 7914]  1304  7914   174911    59879   5       0             0 apache2
Sep 14 05:14:00 server01 kernel: [188902.743673] [ 7915]  1304  7915   173472    58386   2       0             0 apache2
Sep 14 05:14:02 server01 kernel: [188904.249749] Task in /1304/uid killed as a result of limit of /1304
Sep 14 05:14:02 server01 kernel: [188904.336276] [ 7910]  1304  7910   176278    61211   6       0             0 apache2
Sep 14 05:14:02 server01 kernel: [188904.336276] [ 7911]  1304  7911   176278    61211   7       0             0 apache2
Sep 14 05:14:02 server01 kernel: [188904.336276] [ 7912]  1304  7912   173732    58655   3       0             0 apache2
Sep 14 05:14:02 server01 kernel: [188904.336276] [ 7914]  1304  7914   176269    61211   7       0             0 apache2
Sep 14 05:14:02 server01 kernel: [188904.336276] [ 7915]  1304  7915   176269    61211   7       0             0 apache2
Sep 14 05:14:02 server01 kernel: [188904.336276] [ 7966]  1304  7966   170385    55164   7       0             0 apache2
Sep 14 05:14:02 server01 kernel: [188904.340992] Task in /1304/uid killed as a result of limit of /1304
Sep 14 05:14:02 server01 kernel: [188904.424284] [ 7911]  1304  7911   176340    61332   2       0             0 apache2
Sep 14 05:14:02 server01 kernel: [188904.424284] [ 7912]  1304  7912   173996    58901   1       0             0 apache2
Sep 14 05:14:02 server01 kernel: [188904.424284] [ 7914]  1304  7914   176331    61331   4       0             0 apache2
Sep 14 05:14:02 server01 kernel: [188904.424284] [ 7915]  1304  7915   176331    61331   2       0             0 apache2
Sep 14 05:14:02 server01 kernel: [188904.424284] [ 7966]  1304  7966   170385    55164   7       0             0 apache2
[...]

The only thing that is clear from this is that there is always one
process killed and a new one is spawned and that leads to the same
out of memory situation. So this is precisely what Johannes already
described as a Hydra load.

There is a silence in the logs:
Sep 14 05:14:39 server01 kernel: [188940.869639] Killed process 8453 (apache2) total-vm:710732kB, anon-rss:245680kB, file-rss:4588kB
Sep 14 05:21:24 server01 kernel: [189344.518699] grsec: From 95.103.217.66: failed fork with errno EAGAIN by /bin/dash[sh:10362] uid/euid:1387/1387 g
id/egid:100/100, parent /usr/sbin/cron[cron:10144] uid/euid:0/0 gid/egid:0/0

Myabe that is what you are referring to as a stuck situation. Is pid
8453 the task you have seen consuming the CPU? If yes, then we would
need a stack for that task to find out what is going on.

Other than that nothing really suspicious in the log AFAICS.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2013-09-16 13:40 UTC|newest]

Thread overview: 99+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-08-03 16:59 [patch 0/7] improve memcg oom killer robustness v2 Johannes Weiner
2013-08-03 16:59 ` [patch 1/7] arch: mm: remove obsolete init OOM protection Johannes Weiner
2013-08-06  6:34   ` Vineet Gupta
2013-08-03 16:59 ` [patch 2/7] arch: mm: do not invoke OOM killer on kernel fault OOM Johannes Weiner
2013-08-03 16:59 ` [patch 3/7] arch: mm: pass userspace fault flag to generic fault handler Johannes Weiner
2013-08-05 22:06   ` Andrew Morton
2013-08-05 22:25     ` Johannes Weiner
2013-08-03 16:59 ` [patch 4/7] x86: finish user fault error path with fatal signal Johannes Weiner
2013-08-03 16:59 ` [patch 5/7] mm: memcg: enable memcg OOM killer only for user faults Johannes Weiner
2013-08-05  9:18   ` Michal Hocko
2013-08-03 16:59 ` [patch 6/7] mm: memcg: rework and document OOM waiting and wakeup Johannes Weiner
2013-08-03 17:00 ` [patch 7/7] mm: memcg: do not trap chargers with full callstack on OOM Johannes Weiner
2013-08-05  9:54   ` Michal Hocko
2013-08-05 20:56     ` Johannes Weiner
2013-08-03 17:08 ` [patch 0/7] improve memcg oom killer robustness v2 Johannes Weiner
2013-08-09  9:06   ` azurIt
2013-08-30 19:58   ` azurIt
2013-09-02 10:38     ` azurIt
2013-09-03 20:48       ` Johannes Weiner
2013-09-04  7:53         ` azurIt
2013-09-04  8:18         ` azurIt
2013-09-05 11:54           ` Johannes Weiner
2013-09-05 12:43             ` Michal Hocko
2013-09-05 16:18               ` Johannes Weiner
2013-09-09 12:36                 ` Michal Hocko
2013-09-09 12:56                   ` Michal Hocko
2013-09-12 12:59                     ` Johannes Weiner
2013-09-16 14:03                       ` Michal Hocko
2013-09-05 13:24             ` Michal Hocko
2013-09-09 13:10             ` azurIt
2013-09-09 17:28               ` Johannes Weiner
2013-09-09 19:59                 ` azurIt
2013-09-09 20:12                   ` Johannes Weiner
2013-09-09 20:18                     ` azurIt
2013-09-09 21:08                     ` azurIt
2013-09-10 18:13                     ` azurIt
2013-09-10 18:37                       ` Johannes Weiner
2013-09-10 19:32                         ` azurIt
2013-09-10 20:12                           ` Johannes Weiner
2013-09-10 21:08                             ` azurIt
2013-09-10 21:18                               ` Johannes Weiner
2013-09-10 21:32                                 ` azurIt
2013-09-10 22:03                                   ` Johannes Weiner
2013-09-11 12:33                                     ` azurIt
2013-09-11 18:03                                       ` Johannes Weiner
2013-09-11 18:54                                         ` azurIt
2013-09-11 19:11                                           ` Johannes Weiner
2013-09-11 19:41                                             ` azurIt
2013-09-11 20:04                                               ` Johannes Weiner
2013-09-14 10:48                                                 ` azurIt
2013-09-16 13:40                                                   ` Michal Hocko [this message]
2013-09-16 14:01                                                     ` azurIt
2013-09-16 14:06                                                       ` Michal Hocko
2013-09-16 14:13                                                         ` azurIt
2013-09-16 14:57                                                           ` Michal Hocko
2013-09-16 15:05                                                             ` azurIt
2013-09-16 15:17                                                               ` Johannes Weiner
2013-09-16 15:24                                                                 ` azurIt
2013-09-16 15:25                                                               ` Michal Hocko
2013-09-16 15:40                                                                 ` azurIt
2013-09-16 20:52                                                                 ` azurIt
2013-09-17  0:02                                                                   ` Johannes Weiner
2013-09-17 11:15                                                                     ` azurIt
2013-09-17 14:10                                                                       ` Michal Hocko
2013-09-18 14:03                                                                         ` azurIt
2013-09-18 14:24                                                                           ` Michal Hocko
2013-09-18 14:33                                                                             ` azurIt
2013-09-18 14:42                                                                               ` Michal Hocko
2013-09-18 18:02                                                                                 ` azurIt
2013-09-18 18:36                                                                                   ` Michal Hocko
     [not found]                                                                           ` <20130918160304.6EDF2729-Rm0zKEqwvD4@public.gmane.org>
2013-09-18 18:04                                                                             ` Johannes Weiner
     [not found]                                                                               ` <20130918180455.GD856-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
2013-09-18 18:19                                                                                 ` Johannes Weiner
2013-09-18 19:55                                                                                   ` Johannes Weiner
2013-09-18 20:52                                                                                     ` azurIt
2013-09-25  7:26                                                                                     ` azurIt
2013-09-26 16:54                                                                                     ` azurIt
2013-09-26 19:27                                                                                       ` Johannes Weiner
2013-09-27  2:04                                                                                         ` azurIt
2013-10-07 11:01                                                                                         ` azurIt
     [not found]                                                                                           ` <20131007130149.5F5482D8-Rm0zKEqwvD4@public.gmane.org>
2013-10-07 19:23                                                                                             ` Johannes Weiner
2013-10-09 18:44                                                                                               ` azurIt
2013-10-10  0:14                                                                                                 ` Johannes Weiner
2013-10-10 22:59                                                                                                   ` azurIt
2013-09-17 11:20                                                                     ` azurIt
2013-09-16 10:22                                                 ` azurIt
2013-09-04  9:45         ` azurIt
2013-09-04 11:57           ` Michal Hocko
2013-09-04 12:10             ` azurIt
2013-09-04 12:26               ` Michal Hocko
2013-09-04 12:39                 ` azurIt
2013-09-05  9:14                 ` azurIt
2013-09-05  9:53                   ` Michal Hocko
2013-09-05 10:17                     ` azurIt
2013-09-05 11:17                       ` Michal Hocko
2013-09-05 11:47                         ` azurIt
2013-09-05 12:03                           ` Michal Hocko
2013-09-05 12:33                             ` azurIt
2013-09-05 12:45                               ` Michal Hocko
2013-09-05 13:00                                 ` azurIt

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20130916134014.GA3674@dhcp22.suse.cz \
    --to=mhocko@suse.cz \
    --cc=akpm@linux-foundation.org \
    --cc=azurit@pobox.sk \
    --cc=cgroups@vger.kernel.org \
    --cc=hannes@cmpxchg.org \
    --cc=kamezawa.hiroyu@jp.fujitsu.com \
    --cc=kosaki.motohiro@jp.fujitsu.com \
    --cc=linux-arch@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=rientjes@google.com \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).