[LTP] [BUG] oom hangs the system, NMI backtrace shows most CPUs in shrink_slab

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Jan Stancek <jstancek@redhat.com>
To: ltp@lists.linux.it
Subject: [LTP] [BUG] oom hangs the system, NMI backtrace shows most CPUs in shrink_slab
Date: Tue, 19 Jan 2016 16:13:11 +0100	[thread overview]
Message-ID: <569E5287.4080503@redhat.com> (raw)
In-Reply-To: <569E1010.2070806@I-love.SAKURA.ne.jp>

On 01/19/2016 11:29 AM, Tetsuo Handa wrote:
> Jan Stancek wrote:
>> I'm seeing system occasionally hanging after "oom01" testcase
>> from LTP triggers OOM.
>>
>> Here's a console log obtained from v4.4-8606 (shows oom, followed
>> by blocked task messages, followed by me triggering sysrq-t):
>>   http://jan.stancek.eu/tmp/oom_hangs/oom_hang_v4.4-8606.txt
>>   http://jan.stancek.eu/tmp/oom_hangs/config-v4.4-8606.txt
> 
> I think this console log reports an OOM livelock.
> 
> dump_tasks() shows that there are 2 oom1 tasks (10482 and 10528) but
> show_state() shows that there are 8 oom1 tasks whose parent is 10482
> (10528 10529 10530 10531 10532 10533 10534 10535). Thus, I guess oom1
> process called fork() and the child process created 7 threads.

Correct, parent forks a child and expects that child to be killed by OOM.
Child forks NCPU-1 threads, which all keep allocating/touching memory
while they can.

> 
> ----------
> [  495.322098] [10482]     0 10482     2692       10      11       3       27             0 oom01
> [  495.341231] [10528]     0 10528  8667787  3731327   11002      28  1890661             0 oom01
> [ 1736.414809] oom01           S ffff880453dd3e48 13008 10482  10479 0x00000080
> [ 1737.025258] oom01           x ffff88044ff9bc98 13080 10528  10482 0x00000084
> [ 1737.129553] oom01           R  running task    11616 10529  10482 0x00000084
> [ 1737.309520] oom01           R  running task    11544 10530  10482 0x00000084
> [ 1737.417212] oom01           R  running task    11616 10531  10482 0x00000084
> [ 1737.551904] oom01           R  running task    11616 10532  10482 0x00000084
> [ 1737.725174] oom01           R  running task    11616 10533  10482 0x00000084
> [ 1737.898452] oom01           R  running task    11616 10534  10482 0x00000084
> [ 1738.078431] oom01           R  running task    11616 10535  10482 0x00000084
> ----------
> 
> 10528 got both SIGKILL and TIF_MEMDIE at uptime = 495 and has exited by uptime = 1737.
> 
> ----------
> [  495.350845] Out of memory: Kill process 10528 (oom01) score 952 or sacrifice child
> [  495.359301] Killed process 10528 (oom01) total-vm:34671148kB, anon-rss:14925308kB, file-rss:0kB, shmem-rss:0kB
> ----------
> 
> Since 10529 to 10535 got only SIGKILL, all of them are looping inside __alloc_pages_slowpath().
> 
> ----------
> [ 1737.129553] oom01           R  running task    11616 10529  10482 0x00000084
> [ 1737.172049]  [<ffffffff81778a8c>] _cond_resched+0x1c/0x30
> [ 1737.178072]  [<ffffffff811e6daf>] shrink_slab.part.42+0x2cf/0x540
> [ 1737.184873]  [<ffffffff811176ae>] ? rcu_read_lock_held+0x5e/0x70
> [ 1737.191577]  [<ffffffff811ec4ba>] shrink_zone+0x30a/0x330
> [ 1737.197602]  [<ffffffff811ec884>] do_try_to_free_pages+0x174/0x440
> [ 1737.204499]  [<ffffffff811ecc50>] try_to_free_pages+0x100/0x2c0
> [ 1737.211107]  [<ffffffff81268ae0>] __alloc_pages_slowpath.constprop.85+0x3c6/0x74a
> [ 1737.309520] oom01           R  running task    11544 10530  10482 0x00000084
> [ 1737.352024]  [<ffffffff81778a8c>] _cond_resched+0x1c/0x30
> [ 1737.358049]  [<ffffffff81268b12>] __alloc_pages_slowpath.constprop.85+0x3f8/0x74a
> [ 1737.417212] oom01           R  running task    11616 10531  10482 0x00000084
> [ 1737.459715]  [<ffffffff81778a8c>] _cond_resched+0x1c/0x30
> [ 1737.465739]  [<ffffffff811e6daf>] shrink_slab.part.42+0x2cf/0x540
> [ 1737.472540]  [<ffffffff811176ae>] ? rcu_read_lock_held+0x5e/0x70
> [ 1737.479242]  [<ffffffff811ec4ba>] shrink_zone+0x30a/0x330
> [ 1737.485266]  [<ffffffff811ec884>] do_try_to_free_pages+0x174/0x440
> [ 1737.492162]  [<ffffffff811ecc50>] try_to_free_pages+0x100/0x2c0
> [ 1737.498768]  [<ffffffff81268ae0>] __alloc_pages_slowpath.constprop.85+0x3c6/0x74a
> [ 1737.551904] oom01           R  running task    11616 10532  10482 0x00000084
> [ 1737.594403]  [<ffffffff81778a8c>] _cond_resched+0x1c/0x30
> [ 1737.600428]  [<ffffffff811e6cf0>] shrink_slab.part.42+0x210/0x540
> [ 1737.607228]  [<ffffffff811ec4ba>] shrink_zone+0x30a/0x330
> [ 1737.613252]  [<ffffffff811ec884>] do_try_to_free_pages+0x174/0x440
> [ 1737.620150]  [<ffffffff811ecc50>] try_to_free_pages+0x100/0x2c0
> [ 1737.626755]  [<ffffffff81268ae0>] __alloc_pages_slowpath.constprop.85+0x3c6/0x74a
> [ 1737.725174] oom01           R  running task    11616 10533  10482 0x00000084
> [ 1737.767682]  [<ffffffff81778a8c>] _cond_resched+0x1c/0x30
> [ 1737.773705]  [<ffffffff811e6cf0>] shrink_slab.part.42+0x210/0x540
> [ 1737.780506]  [<ffffffff811ec4ba>] shrink_zone+0x30a/0x330
> [ 1737.786532]  [<ffffffff811ec884>] do_try_to_free_pages+0x174/0x440
> [ 1737.793428]  [<ffffffff811ecc50>] try_to_free_pages+0x100/0x2c0
> [ 1737.800037]  [<ffffffff81268ae0>] __alloc_pages_slowpath.constprop.85+0x3c6/0x74a
> [ 1737.898452] oom01           R  running task    11616 10534  10482 0x00000084
> [ 1737.940961]  [<ffffffff81778a8c>] _cond_resched+0x1c/0x30
> [ 1737.946985]  [<ffffffff811e6daf>] shrink_slab.part.42+0x2cf/0x540
> [ 1737.953785]  [<ffffffff811176ae>] ? rcu_read_lock_held+0x5e/0x70
> [ 1737.960488]  [<ffffffff811ec4ba>] shrink_zone+0x30a/0x330
> [ 1737.966512]  [<ffffffff811ec884>] do_try_to_free_pages+0x174/0x440
> [ 1737.973408]  [<ffffffff811ecc50>] try_to_free_pages+0x100/0x2c0
> [ 1737.980015]  [<ffffffff81268ae0>] __alloc_pages_slowpath.constprop.85+0x3c6/0x74a
> [ 1738.078431] oom01           R  running task    11616 10535  10482 0x00000084
> [ 1738.120938]  [<ffffffff81778a8c>] _cond_resched+0x1c/0x30
> [ 1738.126961]  [<ffffffff81268b12>] __alloc_pages_slowpath.constprop.85+0x3f8/0x74a
> ----------
> 
> As a result, 10482 remains sleeping at wait().
> 
> ----------
> [ 1736.414809] oom01           S ffff880453dd3e48 13008 10482  10479 0x00000080
> [ 1736.450320]  [<ffffffff817787ec>] schedule+0x3c/0x90
> [ 1736.455859]  [<ffffffff8109ba93>] do_wait+0x213/0x2f0
> [ 1736.461495]  [<ffffffff8109ce60>] SyS_wait4+0x80/0x100
> ----------
> 
> Also, 10519 is a typical trace which is looping inside __alloc_pages_slowpath()
> which can be observed when we hit OOM livelock. Since this allocation request
> includes neither __GFP_NOFAIL nor __GFP_FS, out_of_memory() cannot be called.
> 
> ----------
> [ 1736.716427] kworker/2:1     R  running task    12616 10519      2 0x00000088
> [ 1736.724319] Workqueue: events_freezable_power_ disk_events_workfn
> [ 1736.731128]  00000000ca948786 ffff88035997b8e8 0000000000000000 0000000000000000
> [ 1736.739427]  0000000000000000 0000000000000000 ffff88035997b850 ffffffff811ec4ba
> [ 1736.747725]  ffff88035997b9d0 0000000000000000 0000000000000000 0001000000000000
> [ 1736.756022] Call Trace:
> [ 1736.758749]  [<ffffffff811ec4ba>] ? shrink_zone+0x30a/0x330
> [ 1736.764969]  [<ffffffff811ec884>] ? do_try_to_free_pages+0x174/0x440
> [ 1736.772060]  [<ffffffff811ecc50>] ? try_to_free_pages+0x100/0x2c0
> [ 1736.778861]  [<ffffffff81268ae0>] ? __alloc_pages_slowpath.constprop.85+0x3c6/0x74a
> [ 1736.787408]  [<ffffffff811dc666>] ? __alloc_pages_nodemask+0x456/0x460
> [ 1736.794690]  [<ffffffff810f51e4>] ? __lock_is_held+0x54/0x70
> [ 1736.801004]  [<ffffffff81231c17>] ? alloc_pages_current+0x97/0x1b0
> [ 1736.807902]  [<ffffffff8137ebd9>] ? bio_copy_kern+0xc9/0x180
> [ 1736.814217]  [<ffffffff8138db15>] ? blk_rq_map_kern+0x75/0x130
> [ 1736.820726]  [<ffffffff815269e2>] ? scsi_execute+0x132/0x160
> [ 1736.827041]  [<ffffffff81528f7e>] ? scsi_execute_req_flags+0x8e/0xf0
> [ 1736.834133]  [<ffffffffa01956e7>] ? sr_check_events+0xb7/0x2a0 [sr_mod]
> [ 1736.841514]  [<ffffffffa00d7058>] ? cdrom_check_events+0x18/0x30 [cdrom]
> [ 1736.848992]  [<ffffffffa0195b2a>] ? sr_block_check_events+0x2a/0x30 [sr_mod]
> [ 1736.856859]  [<ffffffff81399bb0>] ? disk_check_events+0x60/0x170
> [ 1736.863561]  [<ffffffff81399cdc>] ? disk_events_workfn+0x1c/0x20
> [ 1736.870264]  [<ffffffff810b7055>] ? process_one_work+0x215/0x650
> [ 1736.876967]  [<ffffffff810b6fc1>] ? process_one_work+0x181/0x650
> [ 1736.883671]  [<ffffffff810b75b5>] ? worker_thread+0x125/0x4a0
> [ 1736.890082]  [<ffffffff810b7490>] ? process_one_work+0x650/0x650
> [ 1736.896785]  [<ffffffff810be191>] ? kthread+0x101/0x120
> [ 1736.902615]  [<ffffffff810f8c19>] ? trace_hardirqs_on_caller+0xf9/0x1c0
> [ 1736.909997]  [<ffffffff810be090>] ? kthread_create_on_node+0x250/0x250
> [ 1736.917281]  [<ffffffff8177ef9f>] ? ret_from_fork+0x3f/0x70
> [ 1736.923499]  [<ffffffff810be090>] ? kthread_create_on_node+0x250/0x250
> ----------
> 
> What is strange is, 10529 to 10535 should be able to get TIF_MEMDIE and exit
> __alloc_pages_slowpath(). There are three possibilities. First is that they are
> too unlucky to take oom_lock mutex which is needed for calling out_of_memory(),
> but it should not continue failing to take oom_lock mutex for such long time.
> Second is that their allocation requests include neither __GFP_NOFAIL nor
> __GFP_FS which is needed for calling out_of_memory(). Third is that, although I
> couldn't find evidence that mlock() and madvice() are related with this hangup,
> something is preventing __GFP_NOFAIL or __GFP_FS allocation requests from
> calling out_of_memory(). If you can reproduce with kmallocwd patch at
> https://lkml.kernel.org/r/201511250024.AAE78692.QVOtFFOSFOMLJH@I-love.SAKURA.ne.jp
> applied, I think you will be able to know which possibility you are hitting.
> 

It took slightly longer this time. Here's full console log:
  http://jan.stancek.eu/tmp/oom_hangs/console.log.2-v4.4-8606-with-memalloc_wc.txt.bz2

Last OOM message happens at:
[ 6845.765153] oom01 invoked oom-killer: gfp_mask=0x24280ca, order=0, oom_score_adj=0

The moments later, first MemAlloc output starts appearing:
[ 6904.555880] MemAlloc-Info: 2 stalling task, 0 dying task, 0 victim task.
[ 6904.563387] MemAlloc: oom01(22011) seq=5135 gfp=0x24280ca order=0 delay=10001
[ 6904.571353] MemAlloc: oom01(22013) seq=5101 gfp=0x24280ca order=0 delay=10001

[ 6915.195869] MemAlloc-Info: 16 stalling task, 0 dying task, 0 victim task.
[ 6915.203458] MemAlloc: systemd-journal(592) seq=33409 gfp=0x24201ca order=0 delay=20495
[ 6915.212300] MemAlloc: NetworkManager(807) seq=42042 gfp=0x24200ca order=0 delay=12030
[ 6915.221042] MemAlloc: gssproxy(815) seq=1551 gfp=0x24201ca order=0 delay=19414
[ 6915.229104] MemAlloc: irqbalance(825) seq=6763 gfp=0x24201ca order=0 delay=11234
[ 6915.237363] MemAlloc: tuned(1339) seq=74664 gfp=0x24201ca order=0 delay=20354
[ 6915.245329] MemAlloc: top(10485) seq=486624 gfp=0x24201ca order=0 delay=20124
[ 6915.253288] MemAlloc: kworker/1:1(20708) seq=48 gfp=0x2400000 order=0 delay=20248
[ 6915.261640] MemAlloc: sendmail(21855) seq=207 gfp=0x24201ca order=0 delay=19977
[ 6915.269800] MemAlloc: oom01(22007) seq=2 gfp=0x24201ca order=0 delay=20269
[ 6915.277466] MemAlloc: oom01(22008) seq=5659 gfp=0x24280ca order=0 delay=20502
[ 6915.285432] MemAlloc: oom01(22009) seq=5189 gfp=0x24280ca order=0 delay=20502
[ 6915.293389] MemAlloc: oom01(22010) seq=4795 gfp=0x24280ca order=0 delay=20502
[ 6915.301353] MemAlloc: oom01(22011) seq=5135 gfp=0x24280ca order=0 delay=20641
[ 6915.309316] MemAlloc: oom01(22012) seq=3828 gfp=0x24280ca order=0 delay=20502
[ 6915.317280] MemAlloc: oom01(22013) seq=5101 gfp=0x24280ca order=0 delay=20641
[ 6915.325244] MemAlloc: oom01(22014) seq=3633 gfp=0x24280ca order=0 delay=20502

seq numbers above appear to hold for another ~40 minutes, then I power cycled the system.

Regards,
Jan

WARNING: multiple messages have this Message-ID (diff)

From: Jan Stancek <jstancek@redhat.com>
To: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>, linux-mm@kvack.org
Cc: ltp@lists.linux.it
Subject: Re: [BUG] oom hangs the system, NMI backtrace shows most CPUs in shrink_slab
Date: Tue, 19 Jan 2016 16:13:11 +0100	[thread overview]
Message-ID: <569E5287.4080503@redhat.com> (raw)
In-Reply-To: <569E1010.2070806@I-love.SAKURA.ne.jp>

On 01/19/2016 11:29 AM, Tetsuo Handa wrote:
> Jan Stancek wrote:
>> I'm seeing system occasionally hanging after "oom01" testcase
>> from LTP triggers OOM.
>>
>> Here's a console log obtained from v4.4-8606 (shows oom, followed
>> by blocked task messages, followed by me triggering sysrq-t):
>>   http://jan.stancek.eu/tmp/oom_hangs/oom_hang_v4.4-8606.txt
>>   http://jan.stancek.eu/tmp/oom_hangs/config-v4.4-8606.txt
> 
> I think this console log reports an OOM livelock.
> 
> dump_tasks() shows that there are 2 oom1 tasks (10482 and 10528) but
> show_state() shows that there are 8 oom1 tasks whose parent is 10482
> (10528 10529 10530 10531 10532 10533 10534 10535). Thus, I guess oom1
> process called fork() and the child process created 7 threads.

Correct, parent forks a child and expects that child to be killed by OOM.
Child forks NCPU-1 threads, which all keep allocating/touching memory
while they can.

> 
> ----------
> [  495.322098] [10482]     0 10482     2692       10      11       3       27             0 oom01
> [  495.341231] [10528]     0 10528  8667787  3731327   11002      28  1890661             0 oom01
> [ 1736.414809] oom01           S ffff880453dd3e48 13008 10482  10479 0x00000080
> [ 1737.025258] oom01           x ffff88044ff9bc98 13080 10528  10482 0x00000084
> [ 1737.129553] oom01           R  running task    11616 10529  10482 0x00000084
> [ 1737.309520] oom01           R  running task    11544 10530  10482 0x00000084
> [ 1737.417212] oom01           R  running task    11616 10531  10482 0x00000084
> [ 1737.551904] oom01           R  running task    11616 10532  10482 0x00000084
> [ 1737.725174] oom01           R  running task    11616 10533  10482 0x00000084
> [ 1737.898452] oom01           R  running task    11616 10534  10482 0x00000084
> [ 1738.078431] oom01           R  running task    11616 10535  10482 0x00000084
> ----------
> 
> 10528 got both SIGKILL and TIF_MEMDIE at uptime = 495 and has exited by uptime = 1737.
> 
> ----------
> [  495.350845] Out of memory: Kill process 10528 (oom01) score 952 or sacrifice child
> [  495.359301] Killed process 10528 (oom01) total-vm:34671148kB, anon-rss:14925308kB, file-rss:0kB, shmem-rss:0kB
> ----------
> 
> Since 10529 to 10535 got only SIGKILL, all of them are looping inside __alloc_pages_slowpath().
> 
> ----------
> [ 1737.129553] oom01           R  running task    11616 10529  10482 0x00000084
> [ 1737.172049]  [<ffffffff81778a8c>] _cond_resched+0x1c/0x30
> [ 1737.178072]  [<ffffffff811e6daf>] shrink_slab.part.42+0x2cf/0x540
> [ 1737.184873]  [<ffffffff811176ae>] ? rcu_read_lock_held+0x5e/0x70
> [ 1737.191577]  [<ffffffff811ec4ba>] shrink_zone+0x30a/0x330
> [ 1737.197602]  [<ffffffff811ec884>] do_try_to_free_pages+0x174/0x440
> [ 1737.204499]  [<ffffffff811ecc50>] try_to_free_pages+0x100/0x2c0
> [ 1737.211107]  [<ffffffff81268ae0>] __alloc_pages_slowpath.constprop.85+0x3c6/0x74a
> [ 1737.309520] oom01           R  running task    11544 10530  10482 0x00000084
> [ 1737.352024]  [<ffffffff81778a8c>] _cond_resched+0x1c/0x30
> [ 1737.358049]  [<ffffffff81268b12>] __alloc_pages_slowpath.constprop.85+0x3f8/0x74a
> [ 1737.417212] oom01           R  running task    11616 10531  10482 0x00000084
> [ 1737.459715]  [<ffffffff81778a8c>] _cond_resched+0x1c/0x30
> [ 1737.465739]  [<ffffffff811e6daf>] shrink_slab.part.42+0x2cf/0x540
> [ 1737.472540]  [<ffffffff811176ae>] ? rcu_read_lock_held+0x5e/0x70
> [ 1737.479242]  [<ffffffff811ec4ba>] shrink_zone+0x30a/0x330
> [ 1737.485266]  [<ffffffff811ec884>] do_try_to_free_pages+0x174/0x440
> [ 1737.492162]  [<ffffffff811ecc50>] try_to_free_pages+0x100/0x2c0
> [ 1737.498768]  [<ffffffff81268ae0>] __alloc_pages_slowpath.constprop.85+0x3c6/0x74a
> [ 1737.551904] oom01           R  running task    11616 10532  10482 0x00000084
> [ 1737.594403]  [<ffffffff81778a8c>] _cond_resched+0x1c/0x30
> [ 1737.600428]  [<ffffffff811e6cf0>] shrink_slab.part.42+0x210/0x540
> [ 1737.607228]  [<ffffffff811ec4ba>] shrink_zone+0x30a/0x330
> [ 1737.613252]  [<ffffffff811ec884>] do_try_to_free_pages+0x174/0x440
> [ 1737.620150]  [<ffffffff811ecc50>] try_to_free_pages+0x100/0x2c0
> [ 1737.626755]  [<ffffffff81268ae0>] __alloc_pages_slowpath.constprop.85+0x3c6/0x74a
> [ 1737.725174] oom01           R  running task    11616 10533  10482 0x00000084
> [ 1737.767682]  [<ffffffff81778a8c>] _cond_resched+0x1c/0x30
> [ 1737.773705]  [<ffffffff811e6cf0>] shrink_slab.part.42+0x210/0x540
> [ 1737.780506]  [<ffffffff811ec4ba>] shrink_zone+0x30a/0x330
> [ 1737.786532]  [<ffffffff811ec884>] do_try_to_free_pages+0x174/0x440
> [ 1737.793428]  [<ffffffff811ecc50>] try_to_free_pages+0x100/0x2c0
> [ 1737.800037]  [<ffffffff81268ae0>] __alloc_pages_slowpath.constprop.85+0x3c6/0x74a
> [ 1737.898452] oom01           R  running task    11616 10534  10482 0x00000084
> [ 1737.940961]  [<ffffffff81778a8c>] _cond_resched+0x1c/0x30
> [ 1737.946985]  [<ffffffff811e6daf>] shrink_slab.part.42+0x2cf/0x540
> [ 1737.953785]  [<ffffffff811176ae>] ? rcu_read_lock_held+0x5e/0x70
> [ 1737.960488]  [<ffffffff811ec4ba>] shrink_zone+0x30a/0x330
> [ 1737.966512]  [<ffffffff811ec884>] do_try_to_free_pages+0x174/0x440
> [ 1737.973408]  [<ffffffff811ecc50>] try_to_free_pages+0x100/0x2c0
> [ 1737.980015]  [<ffffffff81268ae0>] __alloc_pages_slowpath.constprop.85+0x3c6/0x74a
> [ 1738.078431] oom01           R  running task    11616 10535  10482 0x00000084
> [ 1738.120938]  [<ffffffff81778a8c>] _cond_resched+0x1c/0x30
> [ 1738.126961]  [<ffffffff81268b12>] __alloc_pages_slowpath.constprop.85+0x3f8/0x74a
> ----------
> 
> As a result, 10482 remains sleeping at wait().
> 
> ----------
> [ 1736.414809] oom01           S ffff880453dd3e48 13008 10482  10479 0x00000080
> [ 1736.450320]  [<ffffffff817787ec>] schedule+0x3c/0x90
> [ 1736.455859]  [<ffffffff8109ba93>] do_wait+0x213/0x2f0
> [ 1736.461495]  [<ffffffff8109ce60>] SyS_wait4+0x80/0x100
> ----------
> 
> Also, 10519 is a typical trace which is looping inside __alloc_pages_slowpath()
> which can be observed when we hit OOM livelock. Since this allocation request
> includes neither __GFP_NOFAIL nor __GFP_FS, out_of_memory() cannot be called.
> 
> ----------
> [ 1736.716427] kworker/2:1     R  running task    12616 10519      2 0x00000088
> [ 1736.724319] Workqueue: events_freezable_power_ disk_events_workfn
> [ 1736.731128]  00000000ca948786 ffff88035997b8e8 0000000000000000 0000000000000000
> [ 1736.739427]  0000000000000000 0000000000000000 ffff88035997b850 ffffffff811ec4ba
> [ 1736.747725]  ffff88035997b9d0 0000000000000000 0000000000000000 0001000000000000
> [ 1736.756022] Call Trace:
> [ 1736.758749]  [<ffffffff811ec4ba>] ? shrink_zone+0x30a/0x330
> [ 1736.764969]  [<ffffffff811ec884>] ? do_try_to_free_pages+0x174/0x440
> [ 1736.772060]  [<ffffffff811ecc50>] ? try_to_free_pages+0x100/0x2c0
> [ 1736.778861]  [<ffffffff81268ae0>] ? __alloc_pages_slowpath.constprop.85+0x3c6/0x74a
> [ 1736.787408]  [<ffffffff811dc666>] ? __alloc_pages_nodemask+0x456/0x460
> [ 1736.794690]  [<ffffffff810f51e4>] ? __lock_is_held+0x54/0x70
> [ 1736.801004]  [<ffffffff81231c17>] ? alloc_pages_current+0x97/0x1b0
> [ 1736.807902]  [<ffffffff8137ebd9>] ? bio_copy_kern+0xc9/0x180
> [ 1736.814217]  [<ffffffff8138db15>] ? blk_rq_map_kern+0x75/0x130
> [ 1736.820726]  [<ffffffff815269e2>] ? scsi_execute+0x132/0x160
> [ 1736.827041]  [<ffffffff81528f7e>] ? scsi_execute_req_flags+0x8e/0xf0
> [ 1736.834133]  [<ffffffffa01956e7>] ? sr_check_events+0xb7/0x2a0 [sr_mod]
> [ 1736.841514]  [<ffffffffa00d7058>] ? cdrom_check_events+0x18/0x30 [cdrom]
> [ 1736.848992]  [<ffffffffa0195b2a>] ? sr_block_check_events+0x2a/0x30 [sr_mod]
> [ 1736.856859]  [<ffffffff81399bb0>] ? disk_check_events+0x60/0x170
> [ 1736.863561]  [<ffffffff81399cdc>] ? disk_events_workfn+0x1c/0x20
> [ 1736.870264]  [<ffffffff810b7055>] ? process_one_work+0x215/0x650
> [ 1736.876967]  [<ffffffff810b6fc1>] ? process_one_work+0x181/0x650
> [ 1736.883671]  [<ffffffff810b75b5>] ? worker_thread+0x125/0x4a0
> [ 1736.890082]  [<ffffffff810b7490>] ? process_one_work+0x650/0x650
> [ 1736.896785]  [<ffffffff810be191>] ? kthread+0x101/0x120
> [ 1736.902615]  [<ffffffff810f8c19>] ? trace_hardirqs_on_caller+0xf9/0x1c0
> [ 1736.909997]  [<ffffffff810be090>] ? kthread_create_on_node+0x250/0x250
> [ 1736.917281]  [<ffffffff8177ef9f>] ? ret_from_fork+0x3f/0x70
> [ 1736.923499]  [<ffffffff810be090>] ? kthread_create_on_node+0x250/0x250
> ----------
> 
> What is strange is, 10529 to 10535 should be able to get TIF_MEMDIE and exit
> __alloc_pages_slowpath(). There are three possibilities. First is that they are
> too unlucky to take oom_lock mutex which is needed for calling out_of_memory(),
> but it should not continue failing to take oom_lock mutex for such long time.
> Second is that their allocation requests include neither __GFP_NOFAIL nor
> __GFP_FS which is needed for calling out_of_memory(). Third is that, although I
> couldn't find evidence that mlock() and madvice() are related with this hangup,
> something is preventing __GFP_NOFAIL or __GFP_FS allocation requests from
> calling out_of_memory(). If you can reproduce with kmallocwd patch at
> https://lkml.kernel.org/r/201511250024.AAE78692.QVOtFFOSFOMLJH@I-love.SAKURA.ne.jp
> applied, I think you will be able to know which possibility you are hitting.
> 

It took slightly longer this time. Here's full console log:
  http://jan.stancek.eu/tmp/oom_hangs/console.log.2-v4.4-8606-with-memalloc_wc.txt.bz2

Last OOM message happens at:
[ 6845.765153] oom01 invoked oom-killer: gfp_mask=0x24280ca, order=0, oom_score_adj=0

The moments later, first MemAlloc output starts appearing:
[ 6904.555880] MemAlloc-Info: 2 stalling task, 0 dying task, 0 victim task.
[ 6904.563387] MemAlloc: oom01(22011) seq=5135 gfp=0x24280ca order=0 delay=10001
[ 6904.571353] MemAlloc: oom01(22013) seq=5101 gfp=0x24280ca order=0 delay=10001

[ 6915.195869] MemAlloc-Info: 16 stalling task, 0 dying task, 0 victim task.
[ 6915.203458] MemAlloc: systemd-journal(592) seq=33409 gfp=0x24201ca order=0 delay=20495
[ 6915.212300] MemAlloc: NetworkManager(807) seq=42042 gfp=0x24200ca order=0 delay=12030
[ 6915.221042] MemAlloc: gssproxy(815) seq=1551 gfp=0x24201ca order=0 delay=19414
[ 6915.229104] MemAlloc: irqbalance(825) seq=6763 gfp=0x24201ca order=0 delay=11234
[ 6915.237363] MemAlloc: tuned(1339) seq=74664 gfp=0x24201ca order=0 delay=20354
[ 6915.245329] MemAlloc: top(10485) seq=486624 gfp=0x24201ca order=0 delay=20124
[ 6915.253288] MemAlloc: kworker/1:1(20708) seq=48 gfp=0x2400000 order=0 delay=20248
[ 6915.261640] MemAlloc: sendmail(21855) seq=207 gfp=0x24201ca order=0 delay=19977
[ 6915.269800] MemAlloc: oom01(22007) seq=2 gfp=0x24201ca order=0 delay=20269
[ 6915.277466] MemAlloc: oom01(22008) seq=5659 gfp=0x24280ca order=0 delay=20502
[ 6915.285432] MemAlloc: oom01(22009) seq=5189 gfp=0x24280ca order=0 delay=20502
[ 6915.293389] MemAlloc: oom01(22010) seq=4795 gfp=0x24280ca order=0 delay=20502
[ 6915.301353] MemAlloc: oom01(22011) seq=5135 gfp=0x24280ca order=0 delay=20641
[ 6915.309316] MemAlloc: oom01(22012) seq=3828 gfp=0x24280ca order=0 delay=20502
[ 6915.317280] MemAlloc: oom01(22013) seq=5101 gfp=0x24280ca order=0 delay=20641
[ 6915.325244] MemAlloc: oom01(22014) seq=3633 gfp=0x24280ca order=0 delay=20502

seq numbers above appear to hold for another ~40 minutes, then I power cycled the system.

Regards,
Jan

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2016-01-19 15:13 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-01-18 15:38 [LTP] [BUG] oom hangs the system, NMI backtrace shows most CPUs in shrink_slab Jan Stancek
2016-01-18 15:38 ` Jan Stancek
2016-01-19 10:29 ` [LTP] " Tetsuo Handa
2016-01-19 10:29   ` Tetsuo Handa
2016-01-19 15:13   ` Jan Stancek [this message]
2016-01-19 15:13     ` Jan Stancek
2016-01-20 10:23     ` [LTP] [BUG] oom hangs the system, NMI backtrace shows most CPUs inshrink_slab Tetsuo Handa
2016-01-20 10:23       ` Tetsuo Handa
2016-01-20 13:17       ` [LTP] [BUG] oom hangs the system, NMI backtrace shows most CPUs in shrink_slab Tetsuo Handa
2016-01-20 13:17         ` Tetsuo Handa
2016-01-20 15:10         ` [LTP] " Tejun Heo
2016-01-20 15:10           ` Tejun Heo
2016-01-20 15:54           ` [LTP] " Tetsuo Handa
2016-01-20 15:54             ` Tetsuo Handa
2016-01-22 15:14   ` [LTP] " Jan Stancek
2016-01-22 15:14     ` Jan Stancek
2016-01-23  6:30     ` [LTP] " Tetsuo Handa
2016-01-23  6:30       ` Tetsuo Handa
2016-01-26  7:48     ` [LTP] " Jan Stancek
2016-01-26  7:48       ` Jan Stancek
2016-01-26 14:46       ` Tetsuo Handa
2016-01-26 14:46         ` Tetsuo Handa
2016-01-27 11:02         ` Tetsuo Handa
2016-01-28 15:48           ` Tetsuo Handa
2016-01-29  7:32             ` Jan Stancek
2016-01-29 12:35               ` Tetsuo Handa

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=569E5287.4080503@redhat.com \
    --to=jstancek@redhat.com \
    --cc=ltp@lists.linux.it \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.