public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* oom with 2.6.11
@ 2005-03-08 15:21 Christian Kujau
  2005-03-09 13:18 ` Mauricio Lin
  2005-03-11 10:59 ` Coywolf Qi Hunt
  0 siblings, 2 replies; 21+ messages in thread
From: Christian Kujau @ 2005-03-08 15:21 UTC (permalink / raw)
  To: linux-kernel

hallo list,

today my machine went out out memory and noticing it several hours after
the first OOM message in the log, i wonder
   1) why this happened at all and
   2) why almost every service was killed despite the clever algorithms
      documented in mm/oom_kill.c.

the first oom message went to the syslog at 01:27, i was away and no heavy
tasks were scheduled:

http://nerdbynature.de/bits/sheep/2.6.11/oom/oom_2.6.11.txt

mysqld got killed by the oom killer, so i have to suspect mysql for being
the reason for oom here, even that i know that mysqld is running all day
long. several other tasks got killed, but "Free swap" stays at 0kB and the
oom killer kills almost every other tasks, with no success in freeing ram.

the log stops at 03:21, perhaps syslog-ng got killed.
at around 07:31 i noticed the mess, did SYSRQ-E and now i was able to
login again. i pressed SYSRQ-M/T/P too, they are all in the log. at this
time loadavg was at 249 ;)

i went to runlevel 2, then up again to 3 and all services are up and
running again.

some 2.6.11-rc3 BK snapshot was running pretty stable (no OOM) for ~30
days before i switched to 2.6.11 (vanilla) a few days ago. i have to (not)
reproduce the problem the next night, i wonder if it will happen again.

do you vm-gurus have any idea to the points asked above?

more infos about the box here: http://nerdbynature.de/bits/sheep/2.6.11/oom/


thank you for your comments,
Christian.
-- 
BOFH excuse #281:

The co-locator cannot verify the frame-relay gateway to the ISDN server.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: oom with 2.6.11
  2005-03-08 15:21 oom " Christian Kujau
@ 2005-03-09 13:18 ` Mauricio Lin
  2005-03-09 13:41   ` Mauricio Lin
  2005-03-11 10:59 ` Coywolf Qi Hunt
  1 sibling, 1 reply; 21+ messages in thread
From: Mauricio Lin @ 2005-03-09 13:18 UTC (permalink / raw)
  To: Christian Kujau; +Cc: linux-kernel

Hi Christian,

Could you check the mm/oom_kill.c for your kernel 2.6.11-rc3?

During the 2.6.11-rc development, the oom killer was changed by Andrea
Arcangeli. I do not remember exactly which was the version that this
modification was included, perhaps in kernel 2.6.11-rc4.

Now this oom killer modification is part of 2.6.11 vanilla kernel.

Send the mm/oom_kill.c of 2.6.11-rc3 to me, please. Let me confirm my doubt.

BR,

Mauricio Lin. 

On Tue, 08 Mar 2005 16:21:21 +0100, Christian Kujau <evil@g-house.de> wrote:
> hallo list,
> 
> today my machine went out out memory and noticing it several hours after
> the first OOM message in the log, i wonder
>    1) why this happened at all and
>    2) why almost every service was killed despite the clever algorithms
>       documented in mm/oom_kill.c.
> 
> the first oom message went to the syslog at 01:27, i was away and no heavy
> tasks were scheduled:
> 
> http://nerdbynature.de/bits/sheep/2.6.11/oom/oom_2.6.11.txt
> 
> mysqld got killed by the oom killer, so i have to suspect mysql for being
> the reason for oom here, even that i know that mysqld is running all day
> long. several other tasks got killed, but "Free swap" stays at 0kB and the
> oom killer kills almost every other tasks, with no success in freeing ram.
> 
> the log stops at 03:21, perhaps syslog-ng got killed.
> at around 07:31 i noticed the mess, did SYSRQ-E and now i was able to
> login again. i pressed SYSRQ-M/T/P too, they are all in the log. at this
> time loadavg was at 249 ;)
> 
> i went to runlevel 2, then up again to 3 and all services are up and
> running again.
> 
> some 2.6.11-rc3 BK snapshot was running pretty stable (no OOM) for ~30
> days before i switched to 2.6.11 (vanilla) a few days ago. i have to (not)
> reproduce the problem the next night, i wonder if it will happen again.
> 
> do you vm-gurus have any idea to the points asked above?
> 
> more infos about the box here: http://nerdbynature.de/bits/sheep/2.6.11/oom/
> 
> thank you for your comments,
> Christian.
> --
> BOFH excuse #281:
> 
> The co-locator cannot verify the frame-relay gateway to the ISDN server.
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: OOM with 2.6.11
@ 2005-03-09 13:22 Christian Kujau
  0 siblings, 0 replies; 21+ messages in thread
From: Christian Kujau @ 2005-03-09 13:22 UTC (permalink / raw)
  To: linux-kernel


...replying to myself: it happened again!
switched back to 2.6.11-rc5-bk2, details will follow.

thanks,
Christian.
-- 
BOFH excuse #311:

transient bus protocol violation

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: oom with 2.6.11
  2005-03-09 13:18 ` Mauricio Lin
@ 2005-03-09 13:41   ` Mauricio Lin
  2005-03-09 14:00     ` Christian Kujau
  0 siblings, 1 reply; 21+ messages in thread
From: Mauricio Lin @ 2005-03-09 13:41 UTC (permalink / raw)
  To: Christian Kujau; +Cc: linux-kernel

Hi Christian,

I found the 2.6.11-rc3 patch. The oom killer modification from
Arcangeli was included in 2.6.11-rc3. Right? So this is correct, so
the problem is not related to Arcangeli modification.

Does anyone have idea?

BR,

Mauricio Lin.

On Wed, 9 Mar 2005 09:18:31 -0400, Mauricio Lin <mauriciolin@gmail.com> wrote:
> Hi Christian,
> 
> Could you check the mm/oom_kill.c for your kernel 2.6.11-rc3?
> 
> During the 2.6.11-rc development, the oom killer was changed by Andrea
> Arcangeli. I do not remember exactly which was the version that this
> modification was included, perhaps in kernel 2.6.11-rc4.
> 
> Now this oom killer modification is part of 2.6.11 vanilla kernel.
> 
> Send the mm/oom_kill.c of 2.6.11-rc3 to me, please. Let me confirm my doubt.
> 
> BR,
> 
> Mauricio Lin.
> 
> On Tue, 08 Mar 2005 16:21:21 +0100, Christian Kujau <evil@g-house.de> wrote:
> > hallo list,
> >
> > today my machine went out out memory and noticing it several hours after
> > the first OOM message in the log, i wonder
> >    1) why this happened at all and
> >    2) why almost every service was killed despite the clever algorithms
> >       documented in mm/oom_kill.c.
> >
> > the first oom message went to the syslog at 01:27, i was away and no heavy
> > tasks were scheduled:
> >
> > http://nerdbynature.de/bits/sheep/2.6.11/oom/oom_2.6.11.txt
> >
> > mysqld got killed by the oom killer, so i have to suspect mysql for being
> > the reason for oom here, even that i know that mysqld is running all day
> > long. several other tasks got killed, but "Free swap" stays at 0kB and the
> > oom killer kills almost every other tasks, with no success in freeing ram.
> >
> > the log stops at 03:21, perhaps syslog-ng got killed.
> > at around 07:31 i noticed the mess, did SYSRQ-E and now i was able to
> > login again. i pressed SYSRQ-M/T/P too, they are all in the log. at this
> > time loadavg was at 249 ;)
> >
> > i went to runlevel 2, then up again to 3 and all services are up and
> > running again.
> >
> > some 2.6.11-rc3 BK snapshot was running pretty stable (no OOM) for ~30
> > days before i switched to 2.6.11 (vanilla) a few days ago. i have to (not)
> > reproduce the problem the next night, i wonder if it will happen again.
> >
> > do you vm-gurus have any idea to the points asked above?
> >
> > more infos about the box here: http://nerdbynature.de/bits/sheep/2.6.11/oom/
> >
> > thank you for your comments,
> > Christian.
> > --
> > BOFH excuse #281:
> >
> > The co-locator cannot verify the frame-relay gateway to the ISDN server.
> > -
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at  http://www.tux.org/lkml/
> >
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: oom with 2.6.11
  2005-03-09 13:41   ` Mauricio Lin
@ 2005-03-09 14:00     ` Christian Kujau
  2005-03-10 15:12       ` Christian Kujau
  0 siblings, 1 reply; 21+ messages in thread
From: Christian Kujau @ 2005-03-09 14:00 UTC (permalink / raw)
  To: linux-kernel; +Cc: Mauricio Lin

Mauricio Lin wrote:
> Hi Christian,
> 
> I found the 2.6.11-rc3 patch. The oom killer modification from
> Arcangeli was included in 2.6.11-rc3. Right? So this is correct, so
> the problem is not related to Arcangeli modification.
> 
> Does anyone have idea?

hi Mauricio,

thank you for your answers, but i'm unable to investigate right now, will
do so in the evening.

thanks,
Christian.
-- 
BOFH excuse #240:

Too many little pins on CPU confusing it, bend back and forth until 10-20%
are neatly removed. Do _not_ leave metal bits visible!

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: oom with 2.6.11
  2005-03-09 14:00     ` Christian Kujau
@ 2005-03-10 15:12       ` Christian Kujau
  2005-03-11  0:39         ` Andrew Morton
  2005-03-11  9:01         ` Mauricio Lin
  0 siblings, 2 replies; 21+ messages in thread
From: Christian Kujau @ 2005-03-10 15:12 UTC (permalink / raw)
  To: linux-kernel; +Cc: elenstev, Mauricio Lin

ok,

as "promised", it the OOM happened again with the same plain 2.6.11,
details here.

http://nerdbynature.de/bits/sheep/2.6.11/oom/oom_2.6.11_2.txt

the following is a quite long, but please read on
(if anyone is reading at all :))

this time it happened at 08:01, and i could image some heavy cron jobs
were going on. but as i said: "it did not happen before". there are also
output of SYSRQ-T/M/P. i did SYSRQ-E to recover the machine, but then
decided to reboot back to 2.6.11-rc5-bk2.

i had a look at the changelogs too and noticed that ChangeLog-2.6.11
contains 7 occurrences of "OOM" in the patch desctiption:

[PATCH] mm: overcommit updates, 2005-01-03
[PATCH] vmscan: count writeback pages in nr_scanned, 2005-01-08
[PATCH] possible rq starvation on oom, 2005-01-13
[PATCH] mm: adjust dirty threshold for lowmem-only mappings, 2005-01-25
[PATCH] mm: oom-killer tunable, 2005-02-02
[PATCH] mm: fix several oom killer bugs, 2005-02-02
[PATCH] Fix oops in alloc_zeroed_user_highpage() when [...],2005-02-09

release dates:
2.6.11-rc5-bk1  26-Feb-2005
2.6.11-rc5-bk2  27-Feb-2005  <
2.6.11-rc5-bk3  28-Feb-2005
2.6.11-rc5-bk4  01-Mar-2005
2.6.11          02-Mar-2005

so i really don't see any patches that *could* have something to do with
the issue here.

now comes the weird part:

i was going to compile 2.6.11-rc5-bk4, to sort out the "bad" kernel.
compiling went fine. ok, finished some email, ok, suddenly my swap was
used up again, and no memory left - uh oh! OOM again, with 2.6.11-rc5-bk2!

to summarize it:
i've run 2.6.11-rc2-bk10 during whole february, then switched to
2.6.11-rc5-bk2 on 28.02.2005, then to 2.6.11 on 05.03.2005 - and only
noticed with 2.6.11 first, now with 2.6.11-rc5-bk2 too.

there is an interesting part in the logfiles:

http://nerdbynature.de/bits/sheep/2.6.11/oom/oom_2.6.11.txt
http://nerdbynature.de/bits/sheep/2.6.11/oom/oom_2.6.11_2.txt
http://nerdbynature.de/bits/sheep/2.6.11/oom/oom_2.6.11-rc5-bk2.txt

every last message before the "OOM" messages is something with pppd:

Mar 10 13:45:55 sheep pppd[1567]: Starting link
Mar 10 14:12:29 sheep kernel: oom-killer: gfp_mask=0x1d2

Mar  8 00:59:58 sheep pppd[418]: Starting link
Mar  8 01:27:33 sheep kernel: oom-killer: gfp_mask=0xd0

Mar  9 07:33:49 sheep pppd[30937]: Starting link
Mar  9 08:01:35 sheep kernel: oom-killer: gfp_mask=0x1d2

and 30min later OOM kicks in. normally, pppd (pppoe) gives messages like this:

Mar 10 14:23:38 sheep pppd[26365]: Starting link
Mar 10 14:23:38 sheep pppd[26365]: Serial connection established.
Mar 10 14:23:38 sheep pppd[26365]: Connect: ppp0 <--> /dev/pts/0
Mar 10 14:23:38 sheep pppoe[26383]: PADS: Service-Name: ''
Mar 10 14:23:38 sheep pppoe[26383]: PPP session is 6804
Mar 10 14:23:39 sheep pppd[26365]: CHAP authentication succeeded
Mar 10 14:23:40 sheep pppd[26365]: Local IP address changed to
[...]

is this strange? or not?

i hope someone has a hint for me, because "going back to the stable
kernel" would mean "being bound to 2.6.11-rc2-bk10" :(

thank you for any hints,
Christian.

PS: Steven, i've cc'ed you because you have trouble with new 2.6.11
kernels and pppd too. maybe unrelated, maybe not.
-- 
BOFH excuse #185:

system consumed all the paper for paging

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: oom with 2.6.11
  2005-03-10 15:12       ` Christian Kujau
@ 2005-03-11  0:39         ` Andrew Morton
  2005-03-11  1:14           ` Christian Kujau
  2005-03-11  9:01         ` Mauricio Lin
  1 sibling, 1 reply; 21+ messages in thread
From: Andrew Morton @ 2005-03-11  0:39 UTC (permalink / raw)
  To: Christian Kujau; +Cc: linux-kernel, elenstev, mauriciolin

Christian Kujau <evil@g-house.de> wrote:
>
> i was going to compile 2.6.11-rc5-bk4, to sort out the "bad" kernel.
> compiling went fine. ok, finished some email, ok, suddenly my swap was
> used up again, and no memory left - uh oh! OOM again, with 2.6.11-rc5-bk2!

Well if you ran out of swap then yes, the oom-killer will visit you.

Why did you run out of swapspace?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: oom with 2.6.11
  2005-03-11  0:39         ` Andrew Morton
@ 2005-03-11  1:14           ` Christian Kujau
  2005-03-11  7:45             ` Pasi Kärkkäinen
  0 siblings, 1 reply; 21+ messages in thread
From: Christian Kujau @ 2005-03-11  1:14 UTC (permalink / raw)
  To: linux-kernel; +Cc: Andrew Morton

Andrew Morton wrote:
> Christian Kujau <evil@g-house.de> wrote:
> 
>>i was going to compile 2.6.11-rc5-bk4, to sort out the "bad" kernel.
>>compiling went fine. ok, finished some email, ok, suddenly my swap was
>>used up again, and no memory left - uh oh! OOM again, with 2.6.11-rc5-bk2!
> 
> 
> Well if you ran out of swap then yes, the oom-killer will visit you.
> 
> Why did you run out of swapspace?

hm, if i only knew. i don't know how long it took the other night to go
from "normal" to "OOM". but today, with 2.6.11-rc5-bk2 (well, yesterday
actually) i was working normally, and all of a sudden swap goes from 170MB
used swap (normal) to OOM. i think it took a minute or so, but i just
can't tell which application went nuts. today the first process that got
killed was "ssh-agent", the other day it was mysqld. but even after this,
it should've released some memory, right? but the oom-killer goes on and
on and kills the next task.

i'll monitor memory usage tonight and see what it gives. these "pppd"
messages are suspicious though.

thank you,
Christian.
-- 
BOFH excuse #135:

You put the disk in upside down.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: oom with 2.6.11
  2005-03-11  1:14           ` Christian Kujau
@ 2005-03-11  7:45             ` Pasi Kärkkäinen
  0 siblings, 0 replies; 21+ messages in thread
From: Pasi Kärkkäinen @ 2005-03-11  7:45 UTC (permalink / raw)
  To: Christian Kujau; +Cc: linux-kernel, Andrew Morton

On Fri, Mar 11, 2005 at 02:14:14AM +0100, Christian Kujau wrote:
> Andrew Morton wrote:
> > Christian Kujau <evil@g-house.de> wrote:
> > 
> >>i was going to compile 2.6.11-rc5-bk4, to sort out the "bad" kernel.
> >>compiling went fine. ok, finished some email, ok, suddenly my swap was
> >>used up again, and no memory left - uh oh! OOM again, with 2.6.11-rc5-bk2!
> > 
> > 
> > Well if you ran out of swap then yes, the oom-killer will visit you.
> > 
> > Why did you run out of swapspace?
> 
> hm, if i only knew. i don't know how long it took the other night to go
> from "normal" to "OOM". but today, with 2.6.11-rc5-bk2 (well, yesterday
> actually) i was working normally, and all of a sudden swap goes from 170MB
> used swap (normal) to OOM. i think it took a minute or so, but i just
> can't tell which application went nuts. today the first process that got
> killed was "ssh-agent", the other day it was mysqld. but even after this,
> it should've released some memory, right? but the oom-killer goes on and
> on and kills the next task.
> 
> i'll monitor memory usage tonight and see what it gives. these "pppd"
> messages are suspicious though.
> 

I've also seen this 3 times now.. I'm running Xen 2.0 and the 2.6.10-xen0
kernel (with -as5 patch) goes OOM after a couple of days operation.. and
then the box reboots.

The "virtual machines" are OK, it's only the dom0 kernel that goes OOM.. 
And I'm not running anything special on dom0, only xen control stuff
(which is written in python..), ntp, nfs server, ssh, lvm2 and software raid.

Now I'm running a script which logs the cpu/memory/swap usage every 1
minutes.. trying to see if I can find the cause for the OOM.

-- Pasi Kärkkäinen
       
                                   ^
                                .     .
                                 Linux
                              /    -    \
                             Choice.of.the
                           .Next.Generation.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: oom with 2.6.11
  2005-03-10 15:12       ` Christian Kujau
  2005-03-11  0:39         ` Andrew Morton
@ 2005-03-11  9:01         ` Mauricio Lin
  2005-03-11 15:09           ` Christian Kujau
  1 sibling, 1 reply; 21+ messages in thread
From: Mauricio Lin @ 2005-03-11  9:01 UTC (permalink / raw)
  To: Christian Kujau; +Cc: linux-kernel, elenstev

Hi Christian,

I would like to know what are the kernel versions this problem happened.

Did this problem start from 2.6.11-rc2-bk10?

BR,

Mauricio Lin.

On Thu, 10 Mar 2005 16:12:27 +0100, Christian Kujau <evil@g-house.de> wrote:
> ok,
> 
> as "promised", it the OOM happened again with the same plain 2.6.11,
> details here.
> 
> http://nerdbynature.de/bits/sheep/2.6.11/oom/oom_2.6.11_2.txt
> 
> the following is a quite long, but please read on
> (if anyone is reading at all :))
> 
> this time it happened at 08:01, and i could image some heavy cron jobs
> were going on. but as i said: "it did not happen before". there are also
> output of SYSRQ-T/M/P. i did SYSRQ-E to recover the machine, but then
> decided to reboot back to 2.6.11-rc5-bk2.
> 
> i had a look at the changelogs too and noticed that ChangeLog-2.6.11
> contains 7 occurrences of "OOM" in the patch desctiption:
> 
> [PATCH] mm: overcommit updates, 2005-01-03
> [PATCH] vmscan: count writeback pages in nr_scanned, 2005-01-08
> [PATCH] possible rq starvation on oom, 2005-01-13
> [PATCH] mm: adjust dirty threshold for lowmem-only mappings, 2005-01-25
> [PATCH] mm: oom-killer tunable, 2005-02-02
> [PATCH] mm: fix several oom killer bugs, 2005-02-02
> [PATCH] Fix oops in alloc_zeroed_user_highpage() when [...],2005-02-09
> 
> release dates:
> 2.6.11-rc5-bk1  26-Feb-2005
> 2.6.11-rc5-bk2  27-Feb-2005  <
> 2.6.11-rc5-bk3  28-Feb-2005
> 2.6.11-rc5-bk4  01-Mar-2005
> 2.6.11          02-Mar-2005
> 
> so i really don't see any patches that *could* have something to do with
> the issue here.
> 
> now comes the weird part:
> 
> i was going to compile 2.6.11-rc5-bk4, to sort out the "bad" kernel.
> compiling went fine. ok, finished some email, ok, suddenly my swap was
> used up again, and no memory left - uh oh! OOM again, with 2.6.11-rc5-bk2!
> 
> to summarize it:
> i've run 2.6.11-rc2-bk10 during whole february, then switched to
> 2.6.11-rc5-bk2 on 28.02.2005, then to 2.6.11 on 05.03.2005 - and only
> noticed with 2.6.11 first, now with 2.6.11-rc5-bk2 too.
> 
> there is an interesting part in the logfiles:
> 
> http://nerdbynature.de/bits/sheep/2.6.11/oom/oom_2.6.11.txt
> http://nerdbynature.de/bits/sheep/2.6.11/oom/oom_2.6.11_2.txt
> http://nerdbynature.de/bits/sheep/2.6.11/oom/oom_2.6.11-rc5-bk2.txt
> 
> every last message before the "OOM" messages is something with pppd:
> 
> Mar 10 13:45:55 sheep pppd[1567]: Starting link
> Mar 10 14:12:29 sheep kernel: oom-killer: gfp_mask=0x1d2
> 
> Mar  8 00:59:58 sheep pppd[418]: Starting link
> Mar  8 01:27:33 sheep kernel: oom-killer: gfp_mask=0xd0
> 
> Mar  9 07:33:49 sheep pppd[30937]: Starting link
> Mar  9 08:01:35 sheep kernel: oom-killer: gfp_mask=0x1d2
> 
> and 30min later OOM kicks in. normally, pppd (pppoe) gives messages like this:
> 
> Mar 10 14:23:38 sheep pppd[26365]: Starting link
> Mar 10 14:23:38 sheep pppd[26365]: Serial connection established.
> Mar 10 14:23:38 sheep pppd[26365]: Connect: ppp0 <--> /dev/pts/0
> Mar 10 14:23:38 sheep pppoe[26383]: PADS: Service-Name: ''
> Mar 10 14:23:38 sheep pppoe[26383]: PPP session is 6804
> Mar 10 14:23:39 sheep pppd[26365]: CHAP authentication succeeded
> Mar 10 14:23:40 sheep pppd[26365]: Local IP address changed to
> [...]
> 
> is this strange? or not?
> 
> i hope someone has a hint for me, because "going back to the stable
> kernel" would mean "being bound to 2.6.11-rc2-bk10" :(
> 
> thank you for any hints,
> Christian.
> 
> PS: Steven, i've cc'ed you because you have trouble with new 2.6.11
> kernels and pppd too. maybe unrelated, maybe not.
> --
> BOFH excuse #185:
> 
> system consumed all the paper for paging
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: oom with 2.6.11
  2005-03-08 15:21 oom " Christian Kujau
  2005-03-09 13:18 ` Mauricio Lin
@ 2005-03-11 10:59 ` Coywolf Qi Hunt
  2005-03-11 15:10   ` Christian Kujau
  1 sibling, 1 reply; 21+ messages in thread
From: Coywolf Qi Hunt @ 2005-03-11 10:59 UTC (permalink / raw)
  To: Christian Kujau; +Cc: linux-kernel

In file mm/oom_kill.c, uncomment line 24: /* #define DEBUG */. 
And next time when oom happens again, we'll see the badness.

--coywolf

On Tue, 08 Mar 2005 16:21:21 +0100, Christian Kujau <evil@g-house.de> wrote:
> hallo list,
> 
> today my machine went out out memory and noticing it several hours after
> the first OOM message in the log, i wonder
>    1) why this happened at all and
>    2) why almost every service was killed despite the clever algorithms
>       documented in mm/oom_kill.c.
> 
> the first oom message went to the syslog at 01:27, i was away and no heavy
> tasks were scheduled:
> 
> http://nerdbynature.de/bits/sheep/2.6.11/oom/oom_2.6.11.txt
> 
> mysqld got killed by the oom killer, so i have to suspect mysql for being
> the reason for oom here, even that i know that mysqld is running all day
> long. several other tasks got killed, but "Free swap" stays at 0kB and the
> oom killer kills almost every other tasks, with no success in freeing ram.
> 
> the log stops at 03:21, perhaps syslog-ng got killed.
> at around 07:31 i noticed the mess, did SYSRQ-E and now i was able to
> login again. i pressed SYSRQ-M/T/P too, they are all in the log. at this
> time loadavg was at 249 ;)
> 
> i went to runlevel 2, then up again to 3 and all services are up and
> running again.
> 
> some 2.6.11-rc3 BK snapshot was running pretty stable (no OOM) for ~30
> days before i switched to 2.6.11 (vanilla) a few days ago. i have to (not)
> reproduce the problem the next night, i wonder if it will happen again.
> 
> do you vm-gurus have any idea to the points asked above?
> 
> more infos about the box here: http://nerdbynature.de/bits/sheep/2.6.11/oom/
> 
> thank you for your comments,
> Christian.
> --
> BOFH excuse #281:
> 
> The co-locator cannot verify the frame-relay gateway to the ISDN server.
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


-- 
Coywolf Qi Hunt
Homepage http://sosdg.org/~coywolf/

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: oom with 2.6.11
  2005-03-11  9:01         ` Mauricio Lin
@ 2005-03-11 15:09           ` Christian Kujau
  2005-03-15  8:52             ` Mauricio Lin
  0 siblings, 1 reply; 21+ messages in thread
From: Christian Kujau @ 2005-03-11 15:09 UTC (permalink / raw)
  To: linux-kernel; +Cc: Mauricio Lin, elenstev

Mauricio Lin wrote:
> Hi Christian,
> 
> I would like to know what are the kernel versions this problem happened.
> 
> Did this problem start from 2.6.11-rc2-bk10?

i noticed it first at 2.6.11, then again with 2.6.11-rc5-bk2. suspecting
pppd to be the culprit to chew up all RAM after being terminated by my ISP
once a day - i just have to wait (must be around 2a.m.).

thanks,
Christian.
-- 
BOFH excuse #134:

because of network lag due to too many people playing deathmatch

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: oom with 2.6.11
  2005-03-11 10:59 ` Coywolf Qi Hunt
@ 2005-03-11 15:10   ` Christian Kujau
  2005-03-12 18:06     ` Christian Kujau
  0 siblings, 1 reply; 21+ messages in thread
From: Christian Kujau @ 2005-03-11 15:10 UTC (permalink / raw)
  To: linux-kernel; +Cc: Coywolf Qi Hunt

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Coywolf Qi Hunt wrote:
> In file mm/oom_kill.c, uncomment line 24: /* #define DEBUG */. 
> And next time when oom happens again, we'll see the badness.

oh, good hint. will do this before the next reboot (in a few hours i guess)


thanks,
Christian.

- --
BOFH excuse #134:

because of network lag due to too many people playing deathmatch
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.0 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFCMbTj+A7rjkF8z0wRAmdAAJ4sCIaUqZaKn7gGtkpN7Wb47acFiACgrOSe
DpT3tE1/4zfPIDueDwBMhDU=
=o6Lr
-----END PGP SIGNATURE-----

-- 
BOFH excuse #37:

heavy gravity fluctuation, move computer to floor rapidly

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: oom with 2.6.11
  2005-03-11 15:10   ` Christian Kujau
@ 2005-03-12 18:06     ` Christian Kujau
  2005-03-17  1:27       ` Christian Kujau
  0 siblings, 1 reply; 21+ messages in thread
From: Christian Kujau @ 2005-03-12 18:06 UTC (permalink / raw)
  To: linux-kernel

hi again,

i had to wait for my pppoe session to be terminated by the remote peer
[1], and now it happened again with 2.6.11-rc5-bk2:

http://nerdbynature.de/bits/sheep/2.6.11/oom/oom_2.6.11-rc5-bk2_2.txt
http://nerdbynature.de/bits/sheep/2.6.11/oom/lsmod_2.6.11-rc5-bk2
http://nerdbynature.de/bits/sheep/2.6.11/oom/config-2.6.11-rc5-bk2.gz

i wanted to see the application chewing up the most RAM so i was running
"ps aux --sort=-vsz,-rss | head -n1" every 5sec and wrote it's output to a
file:

http://nerdbynature.de/bits/sheep/2.6.11/oom/daily_stats-2.6.11-rc5-bk2.log.gz

but even when almost all swap is used up, mysqld ist still no#1 with
131928 KB VSZ, as always. still suspecting something wrong with pppd, the
binary istself does *not* show up on the top-ten-memory-pigs.

real memory is almost always used up, but that's ok.
i stripped it down to show only the timestamp and the "Swap: .. used" numbers:

http://nerdbynature.de/bits/sheep/2.6.11/oom/daily_stats-2.6.11-rc5-bk2_stripped.log.gz

as you can see, until "Sat Mar 12 01:05:51" 122MB swap was used, from then
on swap-usage goes up until "Sat Mar 12 01:49:45", when all swap is in
use. the system is unable to keep up with writing memory usage every 5
seconds, the next entry is from 12 "01:56:24" - 6 minutes after the first
OOM message (i'm sure someone could feed this to gnuplot and make nice
bars out of it - i can't)

where else could i look to actually *see* where the memory goes to?

somehow lost in kernel versions,
Christian.

[1] "remote peer" aka "ISP". is there a way to "simulate" a "LCP
terminated by peer" locally? otherwise i really have to wait 24h to
trigger the bug.
-- 
BOFH excuse #373:

Suspicious pointer corrupted virtual machine

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: oom with 2.6.11
  2005-03-11 15:09           ` Christian Kujau
@ 2005-03-15  8:52             ` Mauricio Lin
  2005-03-15 14:12               ` Christian Kujau
  0 siblings, 1 reply; 21+ messages in thread
From: Mauricio Lin @ 2005-03-15  8:52 UTC (permalink / raw)
  To: Christian Kujau; +Cc: linux-kernel, elenstev

Hi Christian,

On Fri, 11 Mar 2005 16:09:24 +0100, Christian Kujau <evil@g-house.de> wrote:
> Mauricio Lin wrote:
> > Hi Christian,
> >
> > I would like to know what are the kernel versions this problem happened.
> >
> > Did this problem start from 2.6.11-rc2-bk10?
> 
> i noticed it first at 2.6.11, then again with 2.6.11-rc5-bk2. suspecting
> pppd to be the culprit to chew up all RAM after being terminated by my ISP
> once a day - i just have to wait (must be around 2a.m.).

Have you tried with 2.6.10 in order to check this problem?

BR,

Mauricio Lin.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: oom with 2.6.11
  2005-03-15  8:52             ` Mauricio Lin
@ 2005-03-15 14:12               ` Christian Kujau
  0 siblings, 0 replies; 21+ messages in thread
From: Christian Kujau @ 2005-03-15 14:12 UTC (permalink / raw)
  To: linux-kernel; +Cc: Mauricio Lin, elenstev

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Mauricio Lin wrote:
>>>
>>>Did this problem start from 2.6.11-rc2-bk10?
>>
>>i noticed it first at 2.6.11, then again with 2.6.11-rc5-bk2. suspecting
>>pppd to be the culprit to chew up all RAM after being terminated by my ISP
>>once a day - i just have to wait (must be around 2a.m.).
> 
> Have you tried with 2.6.10 in order to check this problem?

i don't think i've run plain 2.6.10 at all, syslog does not say so. i've
run several versions of 2.6.9, then 2.6.10-rc1, 2.6.10-rc1-bk19,
2.6.10-rc2-bk9, 2.6.10-rc3-bk11, 2.6.11-rc2-bk10, 2.6.11-rc5-bk2.

as stated earlier, the problem kicked in when i switched to 2.6.11, i
switched back to 2.6.11-rc5-bk2 and it happened again. by looking at the
syslog i am (still) suspecting pppd to be the source of OOM, but it did
not show up as a memory pig in "ps". i am running 2.6.11.3 now and the
machine survived almost 2 passes of pppd's daily hangups. i'll keep an eye
on it.

thank you for your concern,
Christian.
- --
BOFH excuse #357:

I'd love to help you -- it's just that the Boss won't let me near the
computer.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.0 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFCNu1e+A7rjkF8z0wRAmSPAJ49erb21bO6inHPn/zoUH2gNHJk8QCguYDA
HyzXS9whrJnb9eXVz9IP8zc=
=zhEx
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: oom with 2.6.11
  2005-03-12 18:06     ` Christian Kujau
@ 2005-03-17  1:27       ` Christian Kujau
  2005-03-17  1:51         ` Andrew Morton
  2005-03-17 21:25         ` Coywolf Qi Hunt
  0 siblings, 2 replies; 21+ messages in thread
From: Christian Kujau @ 2005-03-17  1:27 UTC (permalink / raw)
  To: linux-kernel; +Cc: Andrew Morton

hello again,

unfortunately i've hit OOM again, this time with "#define DEBUG" enabled
in mm/oom_kill.c:

http://nerdbynature.de/bits/sheep/2.6.11/oom/oom_2.6.11.3.txt

by "Mar 16 18:32" pppd died again and OOM kicked in 30min later.
(there are a *lot* messages of a shell script named "check-route.sh". it's
a little script which runs every minute or so to check if my default route
is still ok and if ping to the outside world are possible. definitely not
a memory hog, but noisy)

since tracking the "most memory consuming applications" did not reveal any
hints [1], i have monitored /proc/slabinfo and /proc/meminfo this time:

http://nerdbynature.de/bits/sheep/2.6.11/oom/daily_stats-2.6.11.3.gz

as stated before, i was suspecting pppd to be the bad guy here, and yes: i
downgraded pppd to an earlier version and pppd (and the system) survived 2
terminations of my dial-up ISP. yesterday i've upgraded back again to
current pppd (debian/unstable) and the OOM problem returned. yes, i'll bug
the debian people now (hello!), but grepping for "ppp" in
daily_stats-2.6.11.3.gz gives no hits. so "pppd" does not get *any* points
from mm/oom_kill.c and thus no attempts are made to kill it (it is always
only kill'able with "-9"). furthermore, i thought /proc/slabinfo coud give
me some hints about *where* all the memory went in. scrolling down this
file to the bottom, where "SwapFree" shows "0 kB" i don't see any alarming
numbers in the "slabinfo" right above "meminfo".

could someone give me a hint, please?

thanks,
Christian.

[1] http://lkml.org/lkml/2005/3/12/88

more info for this recent OOM issue:
http://nerdbynature.de/bits/sheep/2.6.11/oom/dmesg.2.6.11.3
http://nerdbynature.de/bits/sheep/2.6.11/oom/lsmod_2.6.11.3
http://nerdbynature.de/bits/sheep/2.6.11/oom/config-2.6.11.3.gz
-- 
BOFH excuse #62:

need to wrap system in aluminum foil to fix problem

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: oom with 2.6.11
  2005-03-17  1:27       ` Christian Kujau
@ 2005-03-17  1:51         ` Andrew Morton
  2005-03-17  2:00           ` Christian Kujau
  2005-03-17 21:25         ` Coywolf Qi Hunt
  1 sibling, 1 reply; 21+ messages in thread
From: Andrew Morton @ 2005-03-17  1:51 UTC (permalink / raw)
  To: Christian Kujau; +Cc: linux-kernel

Christian Kujau <evil@g-house.de> wrote:
>
> unfortunately i've hit OOM again, this time with "#define DEBUG" enabled
>  in mm/oom_kill.c:
> 
>  http://nerdbynature.de/bits/sheep/2.6.11/oom/oom_2.6.11.3.txt
> 
>  by "Mar 16 18:32" pppd died again and OOM kicked in 30min later.
>  (there are a *lot* messages of a shell script named "check-route.sh". it's
>  a little script which runs every minute or so to check if my default route
>  is still ok and if ping to the outside world are possible. definitely not
>  a memory hog, but noisy)
> 
>  since tracking the "most memory consuming applications" did not reveal any
>  hints [1], i have monitored /proc/slabinfo and /proc/meminfo this time:
> 
>  http://nerdbynature.de/bits/sheep/2.6.11/oom/daily_stats-2.6.11.3.gz
> 
>  as stated before, i was suspecting pppd to be the bad guy here, and yes: i
>  downgraded pppd to an earlier version and pppd (and the system) survived 2
>  terminations of my dial-up ISP. yesterday i've upgraded back again to
>  current pppd (debian/unstable) and the OOM problem returned. yes, i'll bug
>  the debian people now (hello!), but grepping for "ppp" in
>  daily_stats-2.6.11.3.gz gives no hits. so "pppd" does not get *any* points
>  from mm/oom_kill.c and thus no attempts are made to kill it (it is always
>  only kill'able with "-9").

The oom-killer tries to be nicer to processes which are running as root.

> furthermore, i thought /proc/slabinfo coud give
>  me some hints about *where* all the memory went in. scrolling down this
>  file to the bottom, where "SwapFree" shows "0 kB" i don't see any alarming
>  numbers in the "slabinfo" right above "meminfo".

MemTotal:       256372 kB
MemFree:          3280 kB
Buffers:           608 kB
Cached:           3256 kB
SwapCached:        664 kB
Active:         105020 kB
Inactive:        20364 kB
HighTotal:           0 kB
HighFree:            0 kB
LowTotal:       256372 kB
LowFree:          3280 kB
SwapTotal:      784468 kB
SwapFree:            0 kB
Dirty:              12 kB
Writeback:           0 kB
Mapped:         130332 kB
Slab:            61424 kB
CommitLimit:    912652 kB
Committed_AS:  1323548 kB
PageTables:      51668 kB
VmallocTotal:   778184 kB
VmallocUsed:      3464 kB
VmallocChunk:   774492 kB

Some application went berzerk, used up all the swap and then oomed the box.

You could perhaps run `top -d1' then hit M so the output is sorted by
bloatiness, then try to catch the culprit.

But it would be better to have some app which prints the N most
memory-hungry processes every second and simply scrolls that up the screen. 
I'm not aware of such a thing, but it could be cooked up via
/proc/N/cmdline and /proc/N/statm.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: oom with 2.6.11
  2005-03-17  1:51         ` Andrew Morton
@ 2005-03-17  2:00           ` Christian Kujau
  0 siblings, 0 replies; 21+ messages in thread
From: Christian Kujau @ 2005-03-17  2:00 UTC (permalink / raw)
  To: linux-kernel; +Cc: Andrew Morton

Andrew Morton wrote:
> 
> Some application went berzerk, used up all the swap and then oomed the box.
> 
> You could perhaps run `top -d1' then hit M so the output is sorted by
> bloatiness, then try to catch the culprit.

i've already done that. as OOM happens when i am not around, i did that
with "ps":

http://lkml.org/lkml/2005/3/12/88
http://nerdbynature.de/bits/sheep/2.6.11/oom/daily_stats-2.6.11-rc5-bk2.log.gz

> But it would be better to have some app which prints the N most
> memory-hungry processes every second and simply scrolls that up the screen. 
> I'm not aware of such a thing, but it could be cooked up via
> /proc/N/cmdline and /proc/N/statm.

i hope the link above does reveal this information.

i just wrote a bug report for the (debian), ppp package, but to know
*where* the memory goes to would really help, i think.

thank you,
Christian.
-- 
BOFH excuse #72:

Satan did it

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: oom with 2.6.11
  2005-03-17  1:27       ` Christian Kujau
  2005-03-17  1:51         ` Andrew Morton
@ 2005-03-17 21:25         ` Coywolf Qi Hunt
  2005-03-18  1:59           ` Christian Kujau
  1 sibling, 1 reply; 21+ messages in thread
From: Coywolf Qi Hunt @ 2005-03-17 21:25 UTC (permalink / raw)
  To: Christian Kujau; +Cc: linux-kernel, Andrew Morton

On Thu, 17 Mar 2005 02:27:29 +0100, Christian Kujau <evil@g-house.de> wrote:
> hello again,
> 
> unfortunately i've hit OOM again, this time with "#define DEBUG" enabled
> in mm/oom_kill.c:
> 
> http://nerdbynature.de/bits/sheep/2.6.11/oom/oom_2.6.11.3.txt
> 
> by "Mar 16 18:32" pppd died again and OOM kicked in 30min later.
> (there are a *lot* messages of a shell script named "check-route.sh". it's
> a little script which runs every minute or so to check if my default route
> is still ok and if ping to the outside world are possible. definitely not
> a memory hog, but noisy)


I do "grep check-route.sh oom_2.6.11.3.txt | wc" and it shows 4365
lines, which means
there're 4365 that script processes running, from pid 4260 to12747,
mostly with pretty low points, 123.  Based on this points, suppose
each script consumes 100k, that'll be 100k*4k=400M roughly. And your
box's is merely 256M MemTotal.

The currently oom algorithm fails to find out such kinds of memory hog.
And the kernel kills other innocent processes because the its points
is much lower than most others.

Check this script and disable it; see what will happen.

-- 
Coywolf Qi Hunt
http://sosdg.org/~coywolf/

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: oom with 2.6.11
  2005-03-17 21:25         ` Coywolf Qi Hunt
@ 2005-03-18  1:59           ` Christian Kujau
  0 siblings, 0 replies; 21+ messages in thread
From: Christian Kujau @ 2005-03-18  1:59 UTC (permalink / raw)
  To: linux-kernel; +Cc: Coywolf Qi Hunt

Coywolf Qi Hunt wrote:
> I do "grep check-route.sh oom_2.6.11.3.txt | wc" and it shows 4365

duh, good catch! really!

> lines, which means there're 4365 that script processes running, from 
> pid 4260 to12747, mostly with pretty low points, 123.
> Based on this points, suppose each script consumes 100k, that'll be
> 100k*4k=400M roughly. And your box's is merely 256M MemTotal.

yes, i just checked, the script is looping and crond is starting a new
one, and another....and the oom-killer does not catch it, because it's too
small and of course don't know where it is coming from (crond).

> Check this script and disable it; see what will happen.

yes, will do that. on a (not so unimportant) side-note: i was told the
whole thing should be fixed with 2.6.11.4:

  [PATCH] CAN-2005-0384: Remote Linux DoS on ppp servers


after all it seems to be PEBKAC and bad luck...what a week.

thank you for your help,
Christian.
-- 
BOFH excuse #416:

We're out of slots on the server

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2005-03-18  1:59 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-03-09 13:22 OOM with 2.6.11 Christian Kujau
  -- strict thread matches above, loose matches on Subject: below --
2005-03-08 15:21 oom " Christian Kujau
2005-03-09 13:18 ` Mauricio Lin
2005-03-09 13:41   ` Mauricio Lin
2005-03-09 14:00     ` Christian Kujau
2005-03-10 15:12       ` Christian Kujau
2005-03-11  0:39         ` Andrew Morton
2005-03-11  1:14           ` Christian Kujau
2005-03-11  7:45             ` Pasi Kärkkäinen
2005-03-11  9:01         ` Mauricio Lin
2005-03-11 15:09           ` Christian Kujau
2005-03-15  8:52             ` Mauricio Lin
2005-03-15 14:12               ` Christian Kujau
2005-03-11 10:59 ` Coywolf Qi Hunt
2005-03-11 15:10   ` Christian Kujau
2005-03-12 18:06     ` Christian Kujau
2005-03-17  1:27       ` Christian Kujau
2005-03-17  1:51         ` Andrew Morton
2005-03-17  2:00           ` Christian Kujau
2005-03-17 21:25         ` Coywolf Qi Hunt
2005-03-18  1:59           ` Christian Kujau

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox