linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [3.2-rc3] OOM killer doesn't kill the obvious memory hog
@ 2011-12-01  9:36 Dave Chinner
  2011-12-01  9:50 ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 9+ messages in thread
From: Dave Chinner @ 2011-12-01  9:36 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm

Testing a 17TB filesystem with xfstests on a VM with 4GB RAM, test
017 reliably triggers the OOM killer, which eventually panics the
machine after it has killed everything but the process consuming all
the memory. The console output I captured from the last kill where
the panic occurs:

[  302.040482] Pid: 16666, comm: xfs_db Not tainted 3.2.0-rc3-dgc+ #105
[  302.041959] Call Trace:
[  302.042547]  [<ffffffff810debfd>] ? cpuset_print_task_mems_allowed+0x9d/0xb0
[  302.044380]  [<ffffffff8111afae>] dump_header.isra.8+0x7e/0x1c0
[  302.045770]  [<ffffffff8111b22c>] ? oom_badness+0x13c/0x150
[  302.047074]  [<ffffffff8111bb23>] out_of_memory+0x513/0x550
[  302.048524]  [<ffffffff81120976>] __alloc_pages_nodemask+0x726/0x740
[  302.049993]  [<ffffffff81155183>] alloc_pages_current+0xa3/0x110
[  302.051384]  [<ffffffff8111814f>] __page_cache_alloc+0x8f/0xa0
[  302.052960]  [<ffffffff811185be>] ? find_get_page+0x1e/0x90
[  302.054267]  [<ffffffff8111a2dd>] filemap_fault+0x2bd/0x480
[  302.055570]  [<ffffffff8106ead8>] ? flush_tlb_page+0x48/0xb0
[  302.056748]  [<ffffffff81138a1f>] __do_fault+0x6f/0x4f0
[  302.057616]  [<ffffffff81139cfc>] ? do_wp_page+0x2ac/0x740
[  302.058609]  [<ffffffff8113b567>] handle_pte_fault+0xf7/0x8b0
[  302.059557]  [<ffffffff8107933a>] ? finish_task_switch+0x4a/0xf0
[  302.060718]  [<ffffffff8113c035>] handle_mm_fault+0x155/0x250
[  302.061679]  [<ffffffff81acc902>] do_page_fault+0x142/0x4f0
[  302.062599]  [<ffffffff8107958d>] ? set_next_entity+0xad/0xd0
[  302.063548]  [<ffffffff8103f6d2>] ? __switch_to+0x132/0x310
[  302.064575]  [<ffffffff8107933a>] ? finish_task_switch+0x4a/0xf0
[  302.065586]  [<ffffffff81acc405>] do_async_page_fault+0x35/0x80
[  302.066570]  [<ffffffff81ac97b5>] async_page_fault+0x25/0x30
[  302.067509] Mem-Info:
[  302.067992] Node 0 DMA per-cpu:
[  302.068652] CPU    0: hi:    0, btch:   1 usd:   0
[  302.069444] CPU    1: hi:    0, btch:   1 usd:   0
[  302.070239] CPU    2: hi:    0, btch:   1 usd:   0
[  302.071034] CPU    3: hi:    0, btch:   1 usd:   0
[  302.071830] CPU    4: hi:    0, btch:   1 usd:   0
[  302.072776] CPU    5: hi:    0, btch:   1 usd:   0
[  302.073577] CPU    6: hi:    0, btch:   1 usd:   0
[  302.074374] CPU    7: hi:    0, btch:   1 usd:   0
[  302.075172] Node 0 DMA32 per-cpu:
[  302.075745] CPU    0: hi:  186, btch:  31 usd:   0
[  302.076712] CPU    1: hi:  186, btch:  31 usd:   0
[  302.077517] CPU    2: hi:  186, btch:  31 usd:   0
[  302.078313] CPU    3: hi:  186, btch:  31 usd:   1
[  302.079104] CPU    4: hi:  186, btch:  31 usd:   0
[  302.080274] CPU    5: hi:  186, btch:  31 usd:   0
[  302.081482] CPU    6: hi:  186, btch:  31 usd:   0
[  302.082689] CPU    7: hi:  186, btch:  31 usd:  36
[  302.084210] Node 0 Normal per-cpu:
[  302.085104] CPU    0: hi:  186, btch:  31 usd:   1
[  302.086363] CPU    1: hi:  186, btch:  31 usd:  30
[  302.087575] CPU    2: hi:  186, btch:  31 usd:   0
[  302.089193] CPU    3: hi:  186, btch:  31 usd:  16
[  302.090448] CPU    4: hi:  186, btch:  31 usd:  14
[  302.091646] CPU    5: hi:  186, btch:  31 usd:   0
[  302.092992] CPU    6: hi:  186, btch:  31 usd:  30
[  302.093968] CPU    7: hi:  186, btch:  31 usd:  14
[  302.094945] active_anon:789505 inactive_anon:197012 isolated_anon:0
[  302.094946]  active_file:11 inactive_file:18 isolated_file:0
[  302.094947]  unevictable:0 dirty:0 writeback:29 unstable:0
[  302.094948]  free:6465 slab_reclaimable:2020 slab_unreclaimable:3473
[  302.094949]  mapped:5 shmem:1 pagetables:2539 bounce:0
[  302.101211] Node 0 DMA free:15888kB min:28kB low:32kB high:40kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0s
[  302.108917] lowmem_reserve[]: 0 3512 4017 4017
[  302.109885] Node 0 DMA32 free:9020kB min:7076kB low:8844kB high:10612kB active_anon:2962672kB inactive_anon:592684kB active_file:44kB inactive_file:0kB unevictable:s
[  302.117811] lowmem_reserve[]: 0 0 505 505
[  302.118938] Node 0 Normal free:952kB min:1016kB low:1268kB high:1524kB active_anon:195348kB inactive_anon:195364kB active_file:0kB inactive_file:72kB unevictable:0ks
[  302.126920] lowmem_reserve[]: 0 0 0 0
[  302.127744] Node 0 DMA: 0*4kB 0*8kB 1*16kB 0*32kB 2*64kB 1*128kB 1*256kB 0*512kB 1*1024kB 1*2048kB 3*4096kB = 15888kB
[  302.130415] Node 0 DMA32: 68*4kB 48*8kB 35*16kB 16*32kB 9*64kB 3*128kB 2*256kB 2*512kB 1*1024kB 0*2048kB 1*4096kB = 9344kB
[  302.133101] Node 0 Normal: 117*4kB 1*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 1*512kB 0*1024kB 0*2048kB 0*4096kB = 988kB
[  302.135488] 185 total pagecache pages
[  302.136455] 149 pages in swap cache
[  302.137171] Swap cache stats: add 126014, delete 125865, find 94/133
[  302.138523] Free swap  = 0kB
[  302.139114] Total swap = 497976kB
[  302.149921] 1048560 pages RAM
[  302.150591] 36075 pages reserved
[  302.151254] 35 pages shared
[  302.151830] 1004770 pages non-shared
[  302.152922] [ pid ]   uid  tgid total_vm      rss cpu oom_adj oom_score_adj name
[  302.154450] [  939]     0   939     5295        1   4     -17         -1000 udevd
[  302.156160] [ 1002]     0  1002     5294        1   4     -17         -1000 udevd
[  302.157673] [ 1003]     0  1003     5294        0   4     -17         -1000 udevd
[  302.159200] [ 2399]     0  2399     1737        0   7     -17         -1000 dhclient
[  302.161078] [ 2442]     0  2442    12405        0   4     -17         -1000 sshd
[  302.162581] [ 2446]     0  2446    20357        1   0     -17         -1000 sshd
[  302.164408] [ 2450]  1000  2450    20357        0   1     -17         -1000 sshd
[  302.165901] [ 2455]  1000  2455     5592        0   7     -17         -1000 bash
[  302.167401] [ 2516]     0  2516    20357        1   6     -17         -1000 sshd
[  302.169199] [ 2520]  1000  2520    20357        0   4     -17         -1000 sshd
[  302.170702] [ 2527]  1000  2527     5606        1   6     -17         -1000 bash
[  302.172508] [ 5516]     0  5516     5089        0   2     -17         -1000 sudo
[  302.174008] [ 5517]     0  5517     2862        1   0     -17         -1000 check
[  302.175536] [16484]     0 16484     2457        7   0     -17         -1000 017
[  302.177336] [16665]     0 16665     1036        0   2     -17         -1000 xfs_check
[  302.179001] [16666]     0 16666 10031571   986414   6     -17         -1000 xfs_db
[  302.180890] Kernel panic - not syncing: Out of memory and no killable processes...
[  302.180892]
[  302.182585] Pid: 16666, comm: xfs_db Not tainted 3.2.0-rc3-dgc+ #105
[  302.183764] Call Trace:
[  302.184528]  [<ffffffff81abe166>] panic+0x91/0x19d
[  302.185790]  [<ffffffff8111bb38>] out_of_memory+0x528/0x550
[  302.187244]  [<ffffffff81120976>] __alloc_pages_nodemask+0x726/0x740
[  302.188780]  [<ffffffff81155183>] alloc_pages_current+0xa3/0x110
[  302.189951]  [<ffffffff8111814f>] __page_cache_alloc+0x8f/0xa0
[  302.191039]  [<ffffffff811185be>] ? find_get_page+0x1e/0x90
[  302.192168]  [<ffffffff8111a2dd>] filemap_fault+0x2bd/0x480
[  302.193215]  [<ffffffff8106ead8>] ? flush_tlb_page+0x48/0xb0
[  302.194343]  [<ffffffff81138a1f>] __do_fault+0x6f/0x4f0
[  302.195312]  [<ffffffff81139cfc>] ? do_wp_page+0x2ac/0x740
[  302.196490]  [<ffffffff8113b567>] handle_pte_fault+0xf7/0x8b0
[  302.197554]  [<ffffffff8107933a>] ? finish_task_switch+0x4a/0xf0
[  302.198670]  [<ffffffff8113c035>] handle_mm_fault+0x155/0x250
[  302.199755]  [<ffffffff81acc902>] do_page_fault+0x142/0x4f0
[  302.200921]  [<ffffffff8107958d>] ? set_next_entity+0xad/0xd0
[  302.201987]  [<ffffffff8103f6d2>] ? __switch_to+0x132/0x310
[  302.203023]  [<ffffffff8107933a>] ? finish_task_switch+0x4a/0xf0
[  302.204321]  [<ffffffff81acc405>] do_async_page_fault+0x35/0x80
[  302.205417]  [<ffffffff81ac97b5>] async_page_fault+0x25/0x30

It looks to me like the process causing the page fault and trying to
allocate more memory (xfs_db) is also the one consuming all the
memory and by all metrics is the obvious candidate to kill. So, why
does the OOM killer kill everything else but the memory hog and then
panic the machine?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [3.2-rc3] OOM killer doesn't kill the obvious memory hog
  2011-12-01  9:36 [3.2-rc3] OOM killer doesn't kill the obvious memory hog Dave Chinner
@ 2011-12-01  9:50 ` KAMEZAWA Hiroyuki
  2011-12-01 12:46   ` Dave Chinner
  0 siblings, 1 reply; 9+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-12-01  9:50 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-kernel, linux-mm

On Thu, 1 Dec 2011 20:36:44 +1100
Dave Chinner <david@fromorbit.com> wrote:

> Testing a 17TB filesystem with xfstests on a VM with 4GB RAM, test
> 017 reliably triggers the OOM killer, which eventually panics the
> machine after it has killed everything but the process consuming all
> the memory. The console output I captured from the last kill where
> the panic occurs:
> 

your xfs_db is configured to have oom_score_adj==-1000.

/*
 * /proc/<pid>/oom_score_adj set to OOM_SCORE_ADJ_MIN disables oom killing for
 * pid.
 */
#define OOM_SCORE_ADJ_MIN       (-1000)

 
IIUC, this task cannot be killed by oom-killer because of oom_score_adj settings.

> [  302.152922] [ pid ]   uid  tgid total_vm      rss cpu oom_adj oom_score_adj name
> [  302.175536] [16484]     0 16484     2457        7   0     -17         -1000 017
> [  302.177336] [16665]     0 16665     1036        0   2     -17         -1000 xfs_check
> [  302.179001] [16666]     0 16666 10031571   986414   6     -17         -1000 xfs_db
>

The task has 986414 pages on a host which has 1048560 pages of total RAM.

This seem suicide. If you want to kill xfs_db, set oom_score_adj = 0 or higher.

Thanks,
-Kame

> [  302.040482] Pid: 16666, comm: xfs_db Not tainted 3.2.0-rc3-dgc+ #105
> [  302.041959] Call Trace:
> [  302.042547]  [<ffffffff810debfd>] ? cpuset_print_task_mems_allowed+0x9d/0xb0
> [  302.044380]  [<ffffffff8111afae>] dump_header.isra.8+0x7e/0x1c0
> [  302.045770]  [<ffffffff8111b22c>] ? oom_badness+0x13c/0x150
> [  302.047074]  [<ffffffff8111bb23>] out_of_memory+0x513/0x550
> [  302.048524]  [<ffffffff81120976>] __alloc_pages_nodemask+0x726/0x740
> [  302.049993]  [<ffffffff81155183>] alloc_pages_current+0xa3/0x110
> [  302.051384]  [<ffffffff8111814f>] __page_cache_alloc+0x8f/0xa0
> [  302.052960]  [<ffffffff811185be>] ? find_get_page+0x1e/0x90
> [  302.054267]  [<ffffffff8111a2dd>] filemap_fault+0x2bd/0x480
> [  302.055570]  [<ffffffff8106ead8>] ? flush_tlb_page+0x48/0xb0
> [  302.056748]  [<ffffffff81138a1f>] __do_fault+0x6f/0x4f0
> [  302.057616]  [<ffffffff81139cfc>] ? do_wp_page+0x2ac/0x740
> [  302.058609]  [<ffffffff8113b567>] handle_pte_fault+0xf7/0x8b0
> [  302.059557]  [<ffffffff8107933a>] ? finish_task_switch+0x4a/0xf0
> [  302.060718]  [<ffffffff8113c035>] handle_mm_fault+0x155/0x250
> [  302.061679]  [<ffffffff81acc902>] do_page_fault+0x142/0x4f0
> [  302.062599]  [<ffffffff8107958d>] ? set_next_entity+0xad/0xd0
> [  302.063548]  [<ffffffff8103f6d2>] ? __switch_to+0x132/0x310
> [  302.064575]  [<ffffffff8107933a>] ? finish_task_switch+0x4a/0xf0
> [  302.065586]  [<ffffffff81acc405>] do_async_page_fault+0x35/0x80
> [  302.066570]  [<ffffffff81ac97b5>] async_page_fault+0x25/0x30
> [  302.067509] Mem-Info:
> [  302.067992] Node 0 DMA per-cpu:
> [  302.068652] CPU    0: hi:    0, btch:   1 usd:   0
> [  302.069444] CPU    1: hi:    0, btch:   1 usd:   0
> [  302.070239] CPU    2: hi:    0, btch:   1 usd:   0
> [  302.071034] CPU    3: hi:    0, btch:   1 usd:   0
> [  302.071830] CPU    4: hi:    0, btch:   1 usd:   0
> [  302.072776] CPU    5: hi:    0, btch:   1 usd:   0
> [  302.073577] CPU    6: hi:    0, btch:   1 usd:   0
> [  302.074374] CPU    7: hi:    0, btch:   1 usd:   0
> [  302.075172] Node 0 DMA32 per-cpu:
> [  302.075745] CPU    0: hi:  186, btch:  31 usd:   0
> [  302.076712] CPU    1: hi:  186, btch:  31 usd:   0
> [  302.077517] CPU    2: hi:  186, btch:  31 usd:   0
> [  302.078313] CPU    3: hi:  186, btch:  31 usd:   1
> [  302.079104] CPU    4: hi:  186, btch:  31 usd:   0
> [  302.080274] CPU    5: hi:  186, btch:  31 usd:   0
> [  302.081482] CPU    6: hi:  186, btch:  31 usd:   0
> [  302.082689] CPU    7: hi:  186, btch:  31 usd:  36
> [  302.084210] Node 0 Normal per-cpu:
> [  302.085104] CPU    0: hi:  186, btch:  31 usd:   1
> [  302.086363] CPU    1: hi:  186, btch:  31 usd:  30
> [  302.087575] CPU    2: hi:  186, btch:  31 usd:   0
> [  302.089193] CPU    3: hi:  186, btch:  31 usd:  16
> [  302.090448] CPU    4: hi:  186, btch:  31 usd:  14
> [  302.091646] CPU    5: hi:  186, btch:  31 usd:   0
> [  302.092992] CPU    6: hi:  186, btch:  31 usd:  30
> [  302.093968] CPU    7: hi:  186, btch:  31 usd:  14
> [  302.094945] active_anon:789505 inactive_anon:197012 isolated_anon:0
> [  302.094946]  active_file:11 inactive_file:18 isolated_file:0
> [  302.094947]  unevictable:0 dirty:0 writeback:29 unstable:0
> [  302.094948]  free:6465 slab_reclaimable:2020 slab_unreclaimable:3473
> [  302.094949]  mapped:5 shmem:1 pagetables:2539 bounce:0
> [  302.101211] Node 0 DMA free:15888kB min:28kB low:32kB high:40kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0s
> [  302.108917] lowmem_reserve[]: 0 3512 4017 4017
> [  302.109885] Node 0 DMA32 free:9020kB min:7076kB low:8844kB high:10612kB active_anon:2962672kB inactive_anon:592684kB active_file:44kB inactive_file:0kB unevictable:s
> [  302.117811] lowmem_reserve[]: 0 0 505 505
> [  302.118938] Node 0 Normal free:952kB min:1016kB low:1268kB high:1524kB active_anon:195348kB inactive_anon:195364kB active_file:0kB inactive_file:72kB unevictable:0ks
> [  302.126920] lowmem_reserve[]: 0 0 0 0
> [  302.127744] Node 0 DMA: 0*4kB 0*8kB 1*16kB 0*32kB 2*64kB 1*128kB 1*256kB 0*512kB 1*1024kB 1*2048kB 3*4096kB = 15888kB
> [  302.130415] Node 0 DMA32: 68*4kB 48*8kB 35*16kB 16*32kB 9*64kB 3*128kB 2*256kB 2*512kB 1*1024kB 0*2048kB 1*4096kB = 9344kB
> [  302.133101] Node 0 Normal: 117*4kB 1*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 1*512kB 0*1024kB 0*2048kB 0*4096kB = 988kB
> [  302.135488] 185 total pagecache pages
> [  302.136455] 149 pages in swap cache
> [  302.137171] Swap cache stats: add 126014, delete 125865, find 94/133
> [  302.138523] Free swap  = 0kB
> [  302.139114] Total swap = 497976kB
> [  302.149921] 1048560 pages RAM
> [  302.150591] 36075 pages reserved
> [  302.151254] 35 pages shared
> [  302.151830] 1004770 pages non-shared
> [  302.152922] [ pid ]   uid  tgid total_vm      rss cpu oom_adj oom_score_adj name
> [  302.154450] [  939]     0   939     5295        1   4     -17         -1000 udevd
> [  302.156160] [ 1002]     0  1002     5294        1   4     -17         -1000 udevd
> [  302.157673] [ 1003]     0  1003     5294        0   4     -17         -1000 udevd
> [  302.159200] [ 2399]     0  2399     1737        0   7     -17         -1000 dhclient
> [  302.161078] [ 2442]     0  2442    12405        0   4     -17         -1000 sshd
> [  302.162581] [ 2446]     0  2446    20357        1   0     -17         -1000 sshd
> [  302.164408] [ 2450]  1000  2450    20357        0   1     -17         -1000 sshd
> [  302.165901] [ 2455]  1000  2455     5592        0   7     -17         -1000 bash
> [  302.167401] [ 2516]     0  2516    20357        1   6     -17         -1000 sshd
> [  302.169199] [ 2520]  1000  2520    20357        0   4     -17         -1000 sshd
> [  302.170702] [ 2527]  1000  2527     5606        1   6     -17         -1000 bash
> [  302.172508] [ 5516]     0  5516     5089        0   2     -17         -1000 sudo
> [  302.174008] [ 5517]     0  5517     2862        1   0     -17         -1000 check
> [  302.175536] [16484]     0 16484     2457        7   0     -17         -1000 017
> [  302.177336] [16665]     0 16665     1036        0   2     -17         -1000 xfs_check
> [  302.179001] [16666]     0 16666 10031571   986414   6     -17         -1000 xfs_db
> [  302.180890] Kernel panic - not syncing: Out of memory and no killable processes...
> [  302.180892]
> [  302.182585] Pid: 16666, comm: xfs_db Not tainted 3.2.0-rc3-dgc+ #105
> [  302.183764] Call Trace:
> [  302.184528]  [<ffffffff81abe166>] panic+0x91/0x19d
> [  302.185790]  [<ffffffff8111bb38>] out_of_memory+0x528/0x550
> [  302.187244]  [<ffffffff81120976>] __alloc_pages_nodemask+0x726/0x740
> [  302.188780]  [<ffffffff81155183>] alloc_pages_current+0xa3/0x110
> [  302.189951]  [<ffffffff8111814f>] __page_cache_alloc+0x8f/0xa0
> [  302.191039]  [<ffffffff811185be>] ? find_get_page+0x1e/0x90
> [  302.192168]  [<ffffffff8111a2dd>] filemap_fault+0x2bd/0x480
> [  302.193215]  [<ffffffff8106ead8>] ? flush_tlb_page+0x48/0xb0
> [  302.194343]  [<ffffffff81138a1f>] __do_fault+0x6f/0x4f0
> [  302.195312]  [<ffffffff81139cfc>] ? do_wp_page+0x2ac/0x740
> [  302.196490]  [<ffffffff8113b567>] handle_pte_fault+0xf7/0x8b0
> [  302.197554]  [<ffffffff8107933a>] ? finish_task_switch+0x4a/0xf0
> [  302.198670]  [<ffffffff8113c035>] handle_mm_fault+0x155/0x250
> [  302.199755]  [<ffffffff81acc902>] do_page_fault+0x142/0x4f0
> [  302.200921]  [<ffffffff8107958d>] ? set_next_entity+0xad/0xd0
> [  302.201987]  [<ffffffff8103f6d2>] ? __switch_to+0x132/0x310
> [  302.203023]  [<ffffffff8107933a>] ? finish_task_switch+0x4a/0xf0
> [  302.204321]  [<ffffffff81acc405>] do_async_page_fault+0x35/0x80
> [  302.205417]  [<ffffffff81ac97b5>] async_page_fault+0x25/0x30
> 
> It looks to me like the process causing the page fault and trying to
> allocate more memory (xfs_db) is also the one consuming all the
> memory and by all metrics is the obvious candidate to kill. So, why
> does the OOM killer kill everything else but the memory hog and then
> panic the machine?
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [3.2-rc3] OOM killer doesn't kill the obvious memory hog
  2011-12-01  9:50 ` KAMEZAWA Hiroyuki
@ 2011-12-01 12:46   ` Dave Chinner
  2011-12-01 22:35     ` David Rientjes
  0 siblings, 1 reply; 9+ messages in thread
From: Dave Chinner @ 2011-12-01 12:46 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-kernel, linux-mm

On Thu, Dec 01, 2011 at 06:50:01PM +0900, KAMEZAWA Hiroyuki wrote:
> On Thu, 1 Dec 2011 20:36:44 +1100
> Dave Chinner <david@fromorbit.com> wrote:
> 
> > Testing a 17TB filesystem with xfstests on a VM with 4GB RAM, test
> > 017 reliably triggers the OOM killer, which eventually panics the
> > machine after it has killed everything but the process consuming all
> > the memory. The console output I captured from the last kill where
> > the panic occurs:
> > 
> 
> your xfs_db is configured to have oom_score_adj==-1000.

So who is setting that? I've got plenty of occurrences where xfs_db
does get killed running from the test suite. e.g. from test 019:

[  283.294569] xfs_db invoked oom-killer: gfp_mask=0x280da, order=0, oom_adj=0, oom_score_adj=0
[  283.299206] xfs_db cpuset=/ mems_allowed=0
[  283.300755] Pid: 11341, comm: xfs_db Not tainted 3.2.0-rc3-dgc+ #114
[  283.302514] Call Trace:
[  283.303616]  [<ffffffff810debfd>] ? cpuset_print_task_mems_allowed+0x9d/0xb0
[  283.305848]  [<ffffffff8111afae>] dump_header.isra.8+0x7e/0x1c0
[  283.308354]  [<ffffffff816b56d0>] ? ___ratelimit+0xa0/0x120
[  283.310645]  [<ffffffff8111b9c7>] out_of_memory+0x3b7/0x550
[  283.312634]  [<ffffffff81120976>] __alloc_pages_nodemask+0x726/0x740
[  283.315345]  [<ffffffff8115700a>] alloc_pages_vma+0x9a/0x150
[  283.317343]  [<ffffffff81139e1a>] do_wp_page+0x3ca/0x740
[  283.319512]  [<ffffffff8113b8bd>] handle_pte_fault+0x44d/0x8b0
[  283.321953]  [<ffffffff81155183>] ? alloc_pages_current+0xa3/0x110
[  283.323943]  [<ffffffff8113c035>] handle_mm_fault+0x155/0x250
[  283.326375]  [<ffffffff81ac7021>] ? _cond_resched+0x1/0x40
[  283.328605]  [<ffffffff81acc9c2>] do_page_fault+0x142/0x4f0
[  283.330898]  [<ffffffff8107958d>] ? set_next_entity+0xad/0xd0
[  283.332754]  [<ffffffff810796c7>] ? pick_next_task_fair+0xc7/0x110
[  283.334974]  [<ffffffff81acc4c5>] do_async_page_fault+0x35/0x80
[  283.337460]  [<ffffffff81ac9875>] async_page_fault+0x25/0x30
[  283.339397] Mem-Info:
[  283.340253] Node 0 DMA per-cpu:
[  283.341839] CPU    0: hi:    0, btch:   1 usd:   0
[  283.343502] CPU    1: hi:    0, btch:   1 usd:   0
[  283.345062] CPU    2: hi:    0, btch:   1 usd:   0
[  283.346635] CPU    3: hi:    0, btch:   1 usd:   0
[  283.348320] CPU    4: hi:    0, btch:   1 usd:   0
[  283.349632] CPU    5: hi:    0, btch:   1 usd:   0
[  283.350886] CPU    6: hi:    0, btch:   1 usd:   0
[  283.352290] CPU    7: hi:    0, btch:   1 usd:   0
[  283.353692] Node 0 DMA32 per-cpu:
[  283.354654] CPU    0: hi:  186, btch:  31 usd:   0
[  283.355968] CPU    1: hi:  186, btch:  31 usd:   6
[  283.356995] CPU    2: hi:  186, btch:  31 usd:   5
[  283.357840] CPU    3: hi:  186, btch:  31 usd:   0
[  283.358694] CPU    4: hi:  186, btch:  31 usd:   0
[  283.359549] CPU    5: hi:  186, btch:  31 usd:   0
[  283.360635] CPU    6: hi:  186, btch:  31 usd:   0
[  283.361478] CPU    7: hi:  186, btch:  31 usd:   0
[  283.362323] Node 0 Normal per-cpu:
[  283.362941] CPU    0: hi:  186, btch:  31 usd:  57
[  283.363790] CPU    1: hi:  186, btch:  31 usd: 171
[  283.364886] CPU    2: hi:  186, btch:  31 usd:   3
[  283.365781] CPU    3: hi:  186, btch:  31 usd:   0
[  283.366633] CPU    4: hi:  186, btch:  31 usd:   0
[  283.367484] CPU    5: hi:  186, btch:  31 usd:  33
[  283.368584] CPU    6: hi:  186, btch:  31 usd:   0
[  283.369483] CPU    7: hi:  186, btch:  31 usd:   0
[  283.370333] active_anon:790116 inactive_anon:198062 isolated_anon:0
[  283.370334]  active_file:19 inactive_file:161 isolated_file:0
[  283.370335]  unevictable:0 dirty:0 writeback:123 unstable:0
[  283.370336]  free:6585 slab_reclaimable:2052 slab_unreclaimable:4014
[  283.370337]  mapped:64 shmem:11 pagetables:2741 bounce:0
[  283.375779] Node 0 DMA free:15888kB min:28kB low:32kB high:40kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15664kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
[  283.382547] lowmem_reserve[]: 0 3512 4017 4017
[  283.383536] Node 0 DMA32 free:9072kB min:7076kB low:8844kB high:10612kB active_anon:2960704kB inactive_anon:592396kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:3596500kB mlocked:0kB dirty:0kB writeback:480kB mapped:4kB shmem:0kB slab_reclaimable:4kB slab_unreclaimable:620kB kernel_stack:0kB pagetables:7692kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
[  283.390563] lowmem_reserve[]: 0 0 505 505
[  283.391390] Node 0 Normal free:1380kB min:1016kB low:1268kB high:1524kB active_anon:199760kB inactive_anon:199852kB active_file:76kB inactive_file:644kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:517120kB mlocked:0kB dirty:0kB writeback:12kB mapped:252kB shmem:44kB slab_reclaimable:8204kB slab_unreclaimable:15436kB kernel_stack:1184kB pagetables:3272kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  283.398685] lowmem_reserve[]: 0 0 0 0
[  283.399421] Node 0 DMA: 0*4kB 0*8kB 1*16kB 0*32kB 2*64kB 1*128kB 1*256kB 0*512kB 1*1024kB 1*2048kB 3*4096kB = 15888kB
[  283.401673] Node 0 DMA32: 15*4kB 4*8kB 3*16kB 1*32kB 1*64kB 0*128kB 1*256kB 2*512kB 2*1024kB 1*2048kB 1*4096kB = 9708kB
[  283.403790] Node 0 Normal: 305*4kB 17*8kB 4*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 1420kB
[  283.406200] 2722 total pagecache pages
[  283.406867] 2622 pages in swap cache
[  283.407498] Swap cache stats: add 124512, delete 121890, find 2/4
[  283.408766] Free swap  = 0kB
[  283.409324] Total swap = 497976kB
[  283.417994] 1048560 pages RAM
[  283.418661] 36075 pages reserved
[  283.419235] 595 pages shared
[  283.419747] 1003827 pages non-shared
[  283.420587] [ pid ]   uid  tgid total_vm      rss cpu oom_adj oom_score_adj name
[  283.421963] [  939]     0   939     5295       38   4     -17         -1000 udevd
[  283.423296] [  998]     0   998     5294       62   2     -17         -1000 udevd
[  283.424900] [  999]     0   999     5294       55   0     -17         -1000 udevd
[  283.426266] [ 1417]     0  1417     4731       63   1       0             0 rpcbind
[  283.427620] [ 1435]   102  1435     5825      110   0       0             0 rpc.statd
[  283.429436] [ 1452]     0  1452     7875       56   1       0             0 rpc.idmapd
[  283.430842] [ 1720]     0  1720    29565      156   6       0             0 rsyslogd
[  283.432526] [ 1797]     0  1797     1019       32   5       0             0 acpid
[  283.433904] [ 1861]     0  1861     4158       37   6       0             0 atd
[  283.435191] [ 1895]     0  1895     5090       58   2       0             0 cron
[  283.436754] [ 2045]   104  2045     5883       55   6       0             0 dbus-daemon
[  283.438224] [ 2069]     0  2069     5407      144   3       0             0 rpc.mountd
[  283.439616] [ 2147] 65534  2147     4218       40   4       0             0 pmproxy
[  283.441116] [ 2260]     0  2260     1737      118   1     -17         -1000 dhclient
[  283.442524] [ 2272]     0  2272    12405      137   3     -17         -1000 sshd
[  283.444069] [ 2284]     0  2284    17516      189   4       0             0 winbindd
[  283.445533] [ 2292]     0  2292    17516      189   4       0             0 winbindd
[  283.446891] [ 2345]     0  2345     4841       90   3       0             0 pmcd
[  283.448453] [ 2622]     0  2622     3324       47   1       0             0 pmie
[  283.449791] [ 2680]     0  2680     1530       24   1       0             0 getty
[  283.451093] [ 2681]     0  2681     1530       25   0       0             0 getty
[  283.452692] [ 2682]     0  2682     1530       25   0       0             0 getty
[  283.454001] [ 2683]     0  2683     1530       24   1       0             0 getty
[  283.455317] [ 2684]     0  2684     1530       25   5       0             0 getty
[  283.456849] [ 2685]     0  2685     1530       25   1       0             0 getty
[  283.458222] [ 2686]     0  2686     1530       24   4       0             0 getty
[  283.459532] [ 2687]     0  2687    20357      186   6       0             0 sshd
[  283.461086] [ 2691]  1000  2691    20357      184   4       0             0 sshd
[  283.462423] [ 2696]  1000  2696     5594      847   3       0             0 bash
[  283.463711] [ 2757]     0  2757    11422       86   5       0             0 sudo
[  283.465257] [ 2758]     0  2758     2862      213   0       0             0 check
[  283.466562] [11154]     0 11154     2466      200   3       0             0 019
[  283.467833] [11337]     0 11337     1036       24   6       0             0 xfs_check
[  283.469432] [11339]     0 11339     2466      200   5       0             0 019
[  283.470712] [11340]     0 11340     4198       89   0       0             0 perl
[  283.472284] [11341]     0 11341 10031496   981936   2       0             0 xfs_db
[  283.473722] Out of memory: Kill process 11341 (xfs_db) score 944 or sacrifice child
[  283.475112] Killed process 11341 (xfs_db) total-vm:40125984kB, anon-rss:3927736kB, file-rss:8kB

> 
> /*
>  * /proc/<pid>/oom_score_adj set to OOM_SCORE_ADJ_MIN disables oom killing for
>  * pid.
>  */
> #define OOM_SCORE_ADJ_MIN       (-1000)
> 
>  
> IIUC, this task cannot be killed by oom-killer because of oom_score_adj settings.

It's not me or the test suite that setting this, so it's something
the kernel must be doing automagically.

> 
> > [  302.152922] [ pid ]   uid  tgid total_vm      rss cpu oom_adj oom_score_adj name
> > [  302.175536] [16484]     0 16484     2457        7   0     -17         -1000 017
> > [  302.177336] [16665]     0 16665     1036        0   2     -17         -1000 xfs_check
> > [  302.179001] [16666]     0 16666 10031571   986414   6     -17         -1000 xfs_db
> >
> 
> The task has 986414 pages on a host which has 1048560 pages of total RAM.
> 
> This seem suicide. If you want to kill xfs_db, set oom_score_adj = 0 or higher.

Like I said: I'm not changing a thing, nor is the test suite - test
017 results in xfs_db being unkillable, test 019 does not. They are
executed in test suite environment from the
same parent process. But somehow test 017 results in an unkillable
xfs_db process, and test 019 does not. Nothing is touching
oom_score_adj at all, so it should have a zero value. Why doesn't
it?

Indeed, why do the sshd processes have non-default oom_score_adj
in the test 017 case and not for test 019? They shouldn't be any
different, either, but they are. Why?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [3.2-rc3] OOM killer doesn't kill the obvious memory hog
  2011-12-01 12:46   ` Dave Chinner
@ 2011-12-01 22:35     ` David Rientjes
  2011-12-02  1:59       ` Dave Chinner
  0 siblings, 1 reply; 9+ messages in thread
From: David Rientjes @ 2011-12-01 22:35 UTC (permalink / raw)
  To: Dave Chinner; +Cc: KAMEZAWA Hiroyuki, linux-kernel, linux-mm

On Thu, 1 Dec 2011, Dave Chinner wrote:

> > /*
> >  * /proc/<pid>/oom_score_adj set to OOM_SCORE_ADJ_MIN disables oom killing for
> >  * pid.
> >  */
> > #define OOM_SCORE_ADJ_MIN       (-1000)
> > 
> >  
> > IIUC, this task cannot be killed by oom-killer because of oom_score_adj settings.
> 
> It's not me or the test suite that setting this, so it's something
> the kernel must be doing automagically.
> 

The kernel does not set oom_score_adj to ever disable oom killing for a 
thread.  The only time the kernel touches oom_score_adj is when setting it 
to "1000" in ksm and swap to actually prefer a memory allocator for oom 
killing.

It's also possible to change this value via the deprecated 
/proc/pid/oom_adj interface until it is removed next year.  Check your 
dmesg for warnings about using the deprecated oom_adj interface or change 
the printk_once() in oom_adjust_write() to a normal printk() to catch it.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [3.2-rc3] OOM killer doesn't kill the obvious memory hog
  2011-12-01 22:35     ` David Rientjes
@ 2011-12-02  1:59       ` Dave Chinner
  2011-12-02  3:31         ` Dave Chinner
  0 siblings, 1 reply; 9+ messages in thread
From: Dave Chinner @ 2011-12-02  1:59 UTC (permalink / raw)
  To: David Rientjes; +Cc: KAMEZAWA Hiroyuki, linux-kernel, linux-mm

On Thu, Dec 01, 2011 at 02:35:31PM -0800, David Rientjes wrote:
> On Thu, 1 Dec 2011, Dave Chinner wrote:
> 
> > > /*
> > >  * /proc/<pid>/oom_score_adj set to OOM_SCORE_ADJ_MIN disables oom killing for
> > >  * pid.
> > >  */
> > > #define OOM_SCORE_ADJ_MIN       (-1000)
> > > 
> > >  
> > > IIUC, this task cannot be killed by oom-killer because of oom_score_adj settings.
> > 
> > It's not me or the test suite that setting this, so it's something
> > the kernel must be doing automagically.
> > 
> 
> The kernel does not set oom_score_adj to ever disable oom killing for a 
> thread.  The only time the kernel touches oom_score_adj is when setting it 
> to "1000" in ksm and swap to actually prefer a memory allocator for oom 
> killing.
> 
> It's also possible to change this value via the deprecated 
> /proc/pid/oom_adj interface until it is removed next year.  Check your 
> dmesg for warnings about using the deprecated oom_adj interface or change 
> the printk_once() in oom_adjust_write() to a normal printk() to catch it.

No warnings at all, as I've already said. If it is userspace,
whatever is doing it is using the oom_score_adj interface correctly.

Hmmm - google is finding reports of sshd randomly inheriting -17 at
startup depending modules loaded on debian systems. Except, I'm not
using a modular kernel and it's running in a VM so there's no
firmware being loaded.

Yup, all my systems end up with a random value for sessions logged
in via ssh:

$ ssh -X test-2
Linux test-2 3.2.0-rc3-dgc+ #114 SMP Thu Dec 1 22:14:55 EST 2011 x86_64
No mail.
Last login: Fri Dec  2 11:34:44 2011 from deranged
$ cat /proc/self/oom_adj
-17
$ sudo reboot;exit
[sudo] password for dave:

Broadcast message from root@test-2 (pts/0) (Fri Dec  2 12:39:39 2011):

The system is going down for reboot NOW!
logout
Connection to test-2 closed.
$ ssh -X test-2
Linux test-2 3.2.0-rc3-dgc+ #114 SMP Thu Dec 1 22:14:55 EST 2011 x86_64
No mail.
Last login: Fri Dec  2 12:40:15 2011 from deranged
$ cat /proc/self/oom_adj 
0
$ 

That'll be the root cause of the problem - I just caused an OOM
panic with test 019....

<sigh>

The reports all cycle around this loop:

	linux-mm says userspace/distro problem
	distro says openssh problem
	openssh says kernel problem

And there doesn't appear to be any resolution in any of the reports,
just circular finger pointing and frustrated users.

I can't find anything in the distro startup or udev scripts that
modify the oom parameters, and the openssh guys say they only
pass on the value inhereted from ssh's parent process, so it clearly
not obvious where the bug lies at this point. It's been around for
some time, though...

More digging to do...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [3.2-rc3] OOM killer doesn't kill the obvious memory hog
  2011-12-02  1:59       ` Dave Chinner
@ 2011-12-02  3:31         ` Dave Chinner
  2011-12-02  5:44           ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 9+ messages in thread
From: Dave Chinner @ 2011-12-02  3:31 UTC (permalink / raw)
  To: David Rientjes; +Cc: KAMEZAWA Hiroyuki, linux-kernel, linux-mm

On Fri, Dec 02, 2011 at 12:59:21PM +1100, Dave Chinner wrote:
> On Thu, Dec 01, 2011 at 02:35:31PM -0800, David Rientjes wrote:
> > On Thu, 1 Dec 2011, Dave Chinner wrote:
> > 
> > > > /*
> > > >  * /proc/<pid>/oom_score_adj set to OOM_SCORE_ADJ_MIN disables oom killing for
> > > >  * pid.
> > > >  */
> > > > #define OOM_SCORE_ADJ_MIN       (-1000)
> > > > 
> > > >  
> > > > IIUC, this task cannot be killed by oom-killer because of oom_score_adj settings.
> > > 
> > > It's not me or the test suite that setting this, so it's something
> > > the kernel must be doing automagically.
> > > 
> > 
> > The kernel does not set oom_score_adj to ever disable oom killing for a 
> > thread.  The only time the kernel touches oom_score_adj is when setting it 
> > to "1000" in ksm and swap to actually prefer a memory allocator for oom 
> > killing.
> > 
> > It's also possible to change this value via the deprecated 
> > /proc/pid/oom_adj interface until it is removed next year.  Check your 
> > dmesg for warnings about using the deprecated oom_adj interface or change 
> > the printk_once() in oom_adjust_write() to a normal printk() to catch it.
> 
> No warnings at all, as I've already said. If it is userspace,
> whatever is doing it is using the oom_score_adj interface correctly.

.....

> <sigh>
> 
> The reports all cycle around this loop:
> 
> 	linux-mm says userspace/distro problem
> 	distro says openssh problem
> 	openssh says kernel problem
> 
> And there doesn't appear to be any resolution in any of the reports,
> just circular finger pointing and frustrated users.
> 
> I can't find anything in the distro startup or udev scripts that
> modify the oom parameters, and the openssh guys say they only
> pass on the value inhereted from ssh's parent process, so it clearly
> not obvious where the bug lies at this point. It's been around for
> some time, though...
> 
> More digging to do...

A working sshd startup and login:

Dec  2 13:16:32 test-2 sshd[2119]: Set /proc/self/oom_score_adj from 0 to -1000
Dec  2 13:16:32 test-2 sshd[2119]: debug1: Bind to port 22 on 0.0.0.0.
Dec  2 13:16:32 test-2 sshd[2119]: Server listening on 0.0.0.0 port 22.
Dec  2 13:16:32 test-2 sshd[2119]: socket: Address family not supported by protocol
Dec  2 13:16:36 test-2 sshd[2119]: debug1: Forked child 2576.
Dec  2 13:16:36 test-2 sshd[2576]: Set /proc/self/oom_score_adj to 0

The child process sets itself back to 0 correctly. Now, a non-working
startup and login:

Dec  2 13:19:56 test-2 sshd[2126]: Set /proc/self/oom_score_adj from 0 to -1000
Dec  2 13:19:56 test-2 sshd[2126]: debug1: Bind to port 22 on 0.0.0.0.
Dec  2 13:19:56 test-2 sshd[2126]: Server listening on 0.0.0.0 port 22.
Dec  2 13:19:56 test-2 sshd[2126]: socket: Address family not supported by protocol
Dec  2 13:19:57 test-2 sshd[2126]: Received signal 15; terminating.
Dec  2 13:19:57 test-2 sshd[2317]: Set /proc/self/oom_score_adj from -1000 to -1000
Dec  2 13:19:57 test-2 sshd[2317]: debug1: Bind to port 22 on 0.0.0.0.
Dec  2 13:19:57 test-2 sshd[2317]: Server listening on 0.0.0.0 port 22.
Dec  2 13:19:57 test-2 sshd[2317]: socket: Address family not supported by protocol
Dec  2 13:20:01 test-2 sshd[2317]: debug1: Forked child 2322.
Dec  2 13:20:01 test-2 sshd[2322]: Set /proc/self/oom_score_adj to -1000

Somewhere in the statup process, a sshd process is getting a SIGTERM
and dying. It is then restarted immediately form a context that has
a /proc/self/oom_score_adj value of -1000, which is where the
problem lies. So, how does this occur? Looks like a distro problem -
I added 'echo "sshd restart" >> /var/log/auth.log' to the
/etc/init.d/ssh restart command, and this pops out now:

Dec  2 14:00:08 test-2 sshd[2037]: debug3: oom_adjust_setup
Dec  2 14:00:08 test-2 sshd[2037]: Set /proc/self/oom_score_adj from 0 to -1000
Dec  2 14:00:08 test-2 sshd[2037]: debug2: fd 3 setting O_NONBLOCK
Dec  2 14:00:08 test-2 sshd[2037]: debug1: Bind to port 22 on 0.0.0.0.
Dec  2 14:00:08 test-2 sshd[2037]: Server listening on 0.0.0.0 port 22.
Dec  2 14:00:08 test-2 sshd[2037]: socket: Address family not supported by protocol
sshd restart
Dec  2 14:00:11 test-2 sshd[2037]: Received signal 15; terminating.
Dec  2 14:00:11 test-2 sshd[2330]: debug3: oom_adjust_setup
Dec  2 14:00:11 test-2 sshd[2330]: Set /proc/self/oom_score_adj from -1000 to -1000

So, something in a startup script is causing an sshd restart from a
context where the oom_score_adj = -1000. There's only two
possibilities here - the dhcp client bringing the network interface
up or a udev event after the sshd has been started.

Bingo:

$ sudo ifup --verbose eth0
Configuring interface eth0=eth0 (inet)
run-parts --verbose /etc/network/if-pre-up.d
run-parts: executing /etc/network/if-pre-up.d/bridge
run-parts: executing /etc/network/if-pre-up.d/uml-utilities

dhclient -v -pf /var/run/dhclient.eth0.pid -lf
/var/lib/dhcp/dhclient.eth0.leases eth0
Internet Systems Consortium DHCP Client 4.1.1-P1
Copyright 2004-2010 Internet Systems Consortium.
All rights reserved.
For info, please visit https://www.isc.org/software/dhcp/

Listening on LPF/eth0/00:e4:b6:63:63:6e
Sending on   LPF/eth0/00:e4:b6:63:63:6e
Sending on   Socket/fallback
DHCPDISCOVER on eth0 to 255.255.255.255 port 67 interval 6
DHCPOFFER from 192.168.1.254
DHCPREQUEST on eth0 to 255.255.255.255 port 67
DHCPACK from 192.168.1.254
bound to 192.168.1.61 -- renewal in 17198 seconds.
run-parts --verbose /etc/network/if-up.d
run-parts: executing /etc/network/if-up.d/mountnfs
run-parts: executing /etc/network/if-up.d/ntpdate
run-parts: executing /etc/network/if-up.d/openssh-server
run-parts: executing /etc/network/if-up.d/uml-utilities
$

and /etc/network/if-up.d/openssh-server does a restart on the
ssh server.

And this is set in /etc/network/interfaces:

allow-hotplug eth0

which means udev can execute the ifup command whenteh device
appears, asynchronously to the startup scripts that are running.

So, it's a distro bug - sshd should never be started from from udev
context because of this inherited oom_score_adj thing.
Interestingly, the ifup ssh restart script says this:

# We'd like to use 'reload' here, but it has some problems; see #502444.
if [ -x /usr/sbin/invoke-rc.d ]; then
        invoke-rc.d ssh restart >/dev/null 2>&1 || true
else
        /etc/init.d/ssh restart >/dev/null 2>&1 || true
fi

Bug 502444 describes the exact startup race condition that I've just
found. It does a ssh server restart because reload causes the sshd
server to fail to start if a start is currently in progress.  So,
rather than solving the start vs reload race condition, it got a
bandaid (use restart to restart sshd from the reload context) and
left it as a landmine.....

<sigh>

Cheers,

Dave.

-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [3.2-rc3] OOM killer doesn't kill the obvious memory hog
  2011-12-02  3:31         ` Dave Chinner
@ 2011-12-02  5:44           ` KAMEZAWA Hiroyuki
  2011-12-04 22:04             ` Dave Chinner
  2011-12-06 20:31             ` David Rientjes
  0 siblings, 2 replies; 9+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-12-02  5:44 UTC (permalink / raw)
  To: Dave Chinner; +Cc: David Rientjes, linux-kernel, linux-mm

On Fri, 2 Dec 2011 14:31:48 +1100
Dave Chinner <david@fromorbit.com> wrote:

> So, it's a distro bug - sshd should never be started from from udev
> context because of this inherited oom_score_adj thing.
> Interestingly, the ifup ssh restart script says this:
> 
> # We'd like to use 'reload' here, but it has some problems; see #502444.
> if [ -x /usr/sbin/invoke-rc.d ]; then
>         invoke-rc.d ssh restart >/dev/null 2>&1 || true
> else
>         /etc/init.d/ssh restart >/dev/null 2>&1 || true
> fi
> 
> Bug 502444 describes the exact startup race condition that I've just
> found. It does a ssh server restart because reload causes the sshd
> server to fail to start if a start is currently in progress.  So,
> rather than solving the start vs reload race condition, it got a
> bandaid (use restart to restart sshd from the reload context) and
> left it as a landmine.....
> 

Thank you for chasing. 
Hm, BTW, do you think this kind of tracepoint is useful for debugging ?
This patch is just an example.

==

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [3.2-rc3] OOM killer doesn't kill the obvious memory hog
  2011-12-02  5:44           ` KAMEZAWA Hiroyuki
@ 2011-12-04 22:04             ` Dave Chinner
  2011-12-06 20:31             ` David Rientjes
  1 sibling, 0 replies; 9+ messages in thread
From: Dave Chinner @ 2011-12-04 22:04 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: David Rientjes, linux-kernel, linux-mm

On Fri, Dec 02, 2011 at 02:44:41PM +0900, KAMEZAWA Hiroyuki wrote:
> On Fri, 2 Dec 2011 14:31:48 +1100
> Dave Chinner <david@fromorbit.com> wrote:
> 
> > So, it's a distro bug - sshd should never be started from from udev
> > context because of this inherited oom_score_adj thing.
> > Interestingly, the ifup ssh restart script says this:
> > 
> > # We'd like to use 'reload' here, but it has some problems; see #502444.
> > if [ -x /usr/sbin/invoke-rc.d ]; then
> >         invoke-rc.d ssh restart >/dev/null 2>&1 || true
> > else
> >         /etc/init.d/ssh restart >/dev/null 2>&1 || true
> > fi
> > 
> > Bug 502444 describes the exact startup race condition that I've just
> > found. It does a ssh server restart because reload causes the sshd
> > server to fail to start if a start is currently in progress.  So,
> > rather than solving the start vs reload race condition, it got a
> > bandaid (use restart to restart sshd from the reload context) and
> > left it as a landmine.....
> > 
> 
> Thank you for chasing. 
> Hm, BTW, do you think this kind of tracepoint is useful for debugging ?
> This patch is just an example.

Definitely a good idea, because not all applications have logging
like sshd does.  Besides, the first thing I went looking for was
tracepoints. ;)

> 
> ==
> From ed565cbf842e0b30827fba7bfdbc724fe21d9d2d Mon Sep 17 00:00:00 2001
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Date: Fri, 2 Dec 2011 14:10:51 +0900
> Subject: [PATCH] oom_score_adj trace point.
> 
> oom_score_adj is set by some daemon and launch tasks ans inherited
> to applications, sometimes unexpectedly.
> 
> This patch is for debugging oom_score_adj inheritance. This
> adds trace points for oom_score_adj inheritance.
> 
>     bash-2501  [002]   448.860197: oom_score_adj_update: task 2501[bash] updates oom_score_adj=-1000
>     bash-2501  [002]   455.678190: oom_score_adj_inherited: new task 2527 inherited oom_score_adj -1000
>     ls-2527  [007]   455.678683: oom_score_task_rename: task 2527[bash] to [ls] oom_score_adj=-1000
>     bash-2501  [007]   461.632103: oom_score_adj_inherited: new task 2528 inherited oom_score_adj -1000
>     bash-2501  [007]   461.632335: oom_score_adj_inherited: new task 2529 inherited oom_score_adj -1000
>     ls-2528  [003]   461.632983: oom_score_task_rename: task 2528[bash] to [ls] oom_score_adj=-1000
>     less-2529  [005]   461.633086: oom_score_task_rename: task 2529[bash] to [less] oom_score_adj=-1000
>     bash-2501  [004]   474.888710: oom_score_adj_update: task 2501[bash] updates oom_score_adj=0
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Acked-by: Dave Chinner <dchinner@redhat.com>

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [3.2-rc3] OOM killer doesn't kill the obvious memory hog
  2011-12-02  5:44           ` KAMEZAWA Hiroyuki
  2011-12-04 22:04             ` Dave Chinner
@ 2011-12-06 20:31             ` David Rientjes
  1 sibling, 0 replies; 9+ messages in thread
From: David Rientjes @ 2011-12-06 20:31 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: Dave Chinner, linux-kernel, linux-mm

On Fri, 2 Dec 2011, KAMEZAWA Hiroyuki wrote:

> From ed565cbf842e0b30827fba7bfdbc724fe21d9d2d Mon Sep 17 00:00:00 2001
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Date: Fri, 2 Dec 2011 14:10:51 +0900
> Subject: [PATCH] oom_score_adj trace point.
> 
> oom_score_adj is set by some daemon and launch tasks ans inherited
> to applications, sometimes unexpectedly.
> 
> This patch is for debugging oom_score_adj inheritance. This
> adds trace points for oom_score_adj inheritance.
> 
>     bash-2501  [002]   448.860197: oom_score_adj_update: task 2501[bash] updates oom_score_adj=-1000
>     bash-2501  [002]   455.678190: oom_score_adj_inherited: new task 2527 inherited oom_score_adj -1000
>     ls-2527  [007]   455.678683: oom_score_task_rename: task 2527[bash] to [ls] oom_score_adj=-1000
>     bash-2501  [007]   461.632103: oom_score_adj_inherited: new task 2528 inherited oom_score_adj -1000
>     bash-2501  [007]   461.632335: oom_score_adj_inherited: new task 2529 inherited oom_score_adj -1000
>     ls-2528  [003]   461.632983: oom_score_task_rename: task 2528[bash] to [ls] oom_score_adj=-1000
>     less-2529  [005]   461.633086: oom_score_task_rename: task 2529[bash] to [less] oom_score_adj=-1000
>     bash-2501  [004]   474.888710: oom_score_adj_update: task 2501[bash] updates oom_score_adj=0
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Acked-by: David Rientjes <rientjes@google.com>

Thanks Kame!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2011-12-06 20:31 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-12-01  9:36 [3.2-rc3] OOM killer doesn't kill the obvious memory hog Dave Chinner
2011-12-01  9:50 ` KAMEZAWA Hiroyuki
2011-12-01 12:46   ` Dave Chinner
2011-12-01 22:35     ` David Rientjes
2011-12-02  1:59       ` Dave Chinner
2011-12-02  3:31         ` Dave Chinner
2011-12-02  5:44           ` KAMEZAWA Hiroyuki
2011-12-04 22:04             ` Dave Chinner
2011-12-06 20:31             ` David Rientjes

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).