From: Jan Stancek <jstancek@redhat.com>
To: ltp@lists.linux.it
Subject: [LTP] [BUG] oom hangs the system, NMI backtrace shows most CPUs in shrink_slab
Date: Tue, 19 Jan 2016 16:13:11 +0100 [thread overview]
Message-ID: <569E5287.4080503@redhat.com> (raw)
In-Reply-To: <569E1010.2070806@I-love.SAKURA.ne.jp>
On 01/19/2016 11:29 AM, Tetsuo Handa wrote:
> Jan Stancek wrote:
>> I'm seeing system occasionally hanging after "oom01" testcase
>> from LTP triggers OOM.
>>
>> Here's a console log obtained from v4.4-8606 (shows oom, followed
>> by blocked task messages, followed by me triggering sysrq-t):
>> http://jan.stancek.eu/tmp/oom_hangs/oom_hang_v4.4-8606.txt
>> http://jan.stancek.eu/tmp/oom_hangs/config-v4.4-8606.txt
>
> I think this console log reports an OOM livelock.
>
> dump_tasks() shows that there are 2 oom1 tasks (10482 and 10528) but
> show_state() shows that there are 8 oom1 tasks whose parent is 10482
> (10528 10529 10530 10531 10532 10533 10534 10535). Thus, I guess oom1
> process called fork() and the child process created 7 threads.
Correct, parent forks a child and expects that child to be killed by OOM.
Child forks NCPU-1 threads, which all keep allocating/touching memory
while they can.
>
> ----------
> [ 495.322098] [10482] 0 10482 2692 10 11 3 27 0 oom01
> [ 495.341231] [10528] 0 10528 8667787 3731327 11002 28 1890661 0 oom01
> [ 1736.414809] oom01 S ffff880453dd3e48 13008 10482 10479 0x00000080
> [ 1737.025258] oom01 x ffff88044ff9bc98 13080 10528 10482 0x00000084
> [ 1737.129553] oom01 R running task 11616 10529 10482 0x00000084
> [ 1737.309520] oom01 R running task 11544 10530 10482 0x00000084
> [ 1737.417212] oom01 R running task 11616 10531 10482 0x00000084
> [ 1737.551904] oom01 R running task 11616 10532 10482 0x00000084
> [ 1737.725174] oom01 R running task 11616 10533 10482 0x00000084
> [ 1737.898452] oom01 R running task 11616 10534 10482 0x00000084
> [ 1738.078431] oom01 R running task 11616 10535 10482 0x00000084
> ----------
>
> 10528 got both SIGKILL and TIF_MEMDIE at uptime = 495 and has exited by uptime = 1737.
>
> ----------
> [ 495.350845] Out of memory: Kill process 10528 (oom01) score 952 or sacrifice child
> [ 495.359301] Killed process 10528 (oom01) total-vm:34671148kB, anon-rss:14925308kB, file-rss:0kB, shmem-rss:0kB
> ----------
>
> Since 10529 to 10535 got only SIGKILL, all of them are looping inside __alloc_pages_slowpath().
>
> ----------
> [ 1737.129553] oom01 R running task 11616 10529 10482 0x00000084
> [ 1737.172049] [<ffffffff81778a8c>] _cond_resched+0x1c/0x30
> [ 1737.178072] [<ffffffff811e6daf>] shrink_slab.part.42+0x2cf/0x540
> [ 1737.184873] [<ffffffff811176ae>] ? rcu_read_lock_held+0x5e/0x70
> [ 1737.191577] [<ffffffff811ec4ba>] shrink_zone+0x30a/0x330
> [ 1737.197602] [<ffffffff811ec884>] do_try_to_free_pages+0x174/0x440
> [ 1737.204499] [<ffffffff811ecc50>] try_to_free_pages+0x100/0x2c0
> [ 1737.211107] [<ffffffff81268ae0>] __alloc_pages_slowpath.constprop.85+0x3c6/0x74a
> [ 1737.309520] oom01 R running task 11544 10530 10482 0x00000084
> [ 1737.352024] [<ffffffff81778a8c>] _cond_resched+0x1c/0x30
> [ 1737.358049] [<ffffffff81268b12>] __alloc_pages_slowpath.constprop.85+0x3f8/0x74a
> [ 1737.417212] oom01 R running task 11616 10531 10482 0x00000084
> [ 1737.459715] [<ffffffff81778a8c>] _cond_resched+0x1c/0x30
> [ 1737.465739] [<ffffffff811e6daf>] shrink_slab.part.42+0x2cf/0x540
> [ 1737.472540] [<ffffffff811176ae>] ? rcu_read_lock_held+0x5e/0x70
> [ 1737.479242] [<ffffffff811ec4ba>] shrink_zone+0x30a/0x330
> [ 1737.485266] [<ffffffff811ec884>] do_try_to_free_pages+0x174/0x440
> [ 1737.492162] [<ffffffff811ecc50>] try_to_free_pages+0x100/0x2c0
> [ 1737.498768] [<ffffffff81268ae0>] __alloc_pages_slowpath.constprop.85+0x3c6/0x74a
> [ 1737.551904] oom01 R running task 11616 10532 10482 0x00000084
> [ 1737.594403] [<ffffffff81778a8c>] _cond_resched+0x1c/0x30
> [ 1737.600428] [<ffffffff811e6cf0>] shrink_slab.part.42+0x210/0x540
> [ 1737.607228] [<ffffffff811ec4ba>] shrink_zone+0x30a/0x330
> [ 1737.613252] [<ffffffff811ec884>] do_try_to_free_pages+0x174/0x440
> [ 1737.620150] [<ffffffff811ecc50>] try_to_free_pages+0x100/0x2c0
> [ 1737.626755] [<ffffffff81268ae0>] __alloc_pages_slowpath.constprop.85+0x3c6/0x74a
> [ 1737.725174] oom01 R running task 11616 10533 10482 0x00000084
> [ 1737.767682] [<ffffffff81778a8c>] _cond_resched+0x1c/0x30
> [ 1737.773705] [<ffffffff811e6cf0>] shrink_slab.part.42+0x210/0x540
> [ 1737.780506] [<ffffffff811ec4ba>] shrink_zone+0x30a/0x330
> [ 1737.786532] [<ffffffff811ec884>] do_try_to_free_pages+0x174/0x440
> [ 1737.793428] [<ffffffff811ecc50>] try_to_free_pages+0x100/0x2c0
> [ 1737.800037] [<ffffffff81268ae0>] __alloc_pages_slowpath.constprop.85+0x3c6/0x74a
> [ 1737.898452] oom01 R running task 11616 10534 10482 0x00000084
> [ 1737.940961] [<ffffffff81778a8c>] _cond_resched+0x1c/0x30
> [ 1737.946985] [<ffffffff811e6daf>] shrink_slab.part.42+0x2cf/0x540
> [ 1737.953785] [<ffffffff811176ae>] ? rcu_read_lock_held+0x5e/0x70
> [ 1737.960488] [<ffffffff811ec4ba>] shrink_zone+0x30a/0x330
> [ 1737.966512] [<ffffffff811ec884>] do_try_to_free_pages+0x174/0x440
> [ 1737.973408] [<ffffffff811ecc50>] try_to_free_pages+0x100/0x2c0
> [ 1737.980015] [<ffffffff81268ae0>] __alloc_pages_slowpath.constprop.85+0x3c6/0x74a
> [ 1738.078431] oom01 R running task 11616 10535 10482 0x00000084
> [ 1738.120938] [<ffffffff81778a8c>] _cond_resched+0x1c/0x30
> [ 1738.126961] [<ffffffff81268b12>] __alloc_pages_slowpath.constprop.85+0x3f8/0x74a
> ----------
>
> As a result, 10482 remains sleeping at wait().
>
> ----------
> [ 1736.414809] oom01 S ffff880453dd3e48 13008 10482 10479 0x00000080
> [ 1736.450320] [<ffffffff817787ec>] schedule+0x3c/0x90
> [ 1736.455859] [<ffffffff8109ba93>] do_wait+0x213/0x2f0
> [ 1736.461495] [<ffffffff8109ce60>] SyS_wait4+0x80/0x100
> ----------
>
> Also, 10519 is a typical trace which is looping inside __alloc_pages_slowpath()
> which can be observed when we hit OOM livelock. Since this allocation request
> includes neither __GFP_NOFAIL nor __GFP_FS, out_of_memory() cannot be called.
>
> ----------
> [ 1736.716427] kworker/2:1 R running task 12616 10519 2 0x00000088
> [ 1736.724319] Workqueue: events_freezable_power_ disk_events_workfn
> [ 1736.731128] 00000000ca948786 ffff88035997b8e8 0000000000000000 0000000000000000
> [ 1736.739427] 0000000000000000 0000000000000000 ffff88035997b850 ffffffff811ec4ba
> [ 1736.747725] ffff88035997b9d0 0000000000000000 0000000000000000 0001000000000000
> [ 1736.756022] Call Trace:
> [ 1736.758749] [<ffffffff811ec4ba>] ? shrink_zone+0x30a/0x330
> [ 1736.764969] [<ffffffff811ec884>] ? do_try_to_free_pages+0x174/0x440
> [ 1736.772060] [<ffffffff811ecc50>] ? try_to_free_pages+0x100/0x2c0
> [ 1736.778861] [<ffffffff81268ae0>] ? __alloc_pages_slowpath.constprop.85+0x3c6/0x74a
> [ 1736.787408] [<ffffffff811dc666>] ? __alloc_pages_nodemask+0x456/0x460
> [ 1736.794690] [<ffffffff810f51e4>] ? __lock_is_held+0x54/0x70
> [ 1736.801004] [<ffffffff81231c17>] ? alloc_pages_current+0x97/0x1b0
> [ 1736.807902] [<ffffffff8137ebd9>] ? bio_copy_kern+0xc9/0x180
> [ 1736.814217] [<ffffffff8138db15>] ? blk_rq_map_kern+0x75/0x130
> [ 1736.820726] [<ffffffff815269e2>] ? scsi_execute+0x132/0x160
> [ 1736.827041] [<ffffffff81528f7e>] ? scsi_execute_req_flags+0x8e/0xf0
> [ 1736.834133] [<ffffffffa01956e7>] ? sr_check_events+0xb7/0x2a0 [sr_mod]
> [ 1736.841514] [<ffffffffa00d7058>] ? cdrom_check_events+0x18/0x30 [cdrom]
> [ 1736.848992] [<ffffffffa0195b2a>] ? sr_block_check_events+0x2a/0x30 [sr_mod]
> [ 1736.856859] [<ffffffff81399bb0>] ? disk_check_events+0x60/0x170
> [ 1736.863561] [<ffffffff81399cdc>] ? disk_events_workfn+0x1c/0x20
> [ 1736.870264] [<ffffffff810b7055>] ? process_one_work+0x215/0x650
> [ 1736.876967] [<ffffffff810b6fc1>] ? process_one_work+0x181/0x650
> [ 1736.883671] [<ffffffff810b75b5>] ? worker_thread+0x125/0x4a0
> [ 1736.890082] [<ffffffff810b7490>] ? process_one_work+0x650/0x650
> [ 1736.896785] [<ffffffff810be191>] ? kthread+0x101/0x120
> [ 1736.902615] [<ffffffff810f8c19>] ? trace_hardirqs_on_caller+0xf9/0x1c0
> [ 1736.909997] [<ffffffff810be090>] ? kthread_create_on_node+0x250/0x250
> [ 1736.917281] [<ffffffff8177ef9f>] ? ret_from_fork+0x3f/0x70
> [ 1736.923499] [<ffffffff810be090>] ? kthread_create_on_node+0x250/0x250
> ----------
>
> What is strange is, 10529 to 10535 should be able to get TIF_MEMDIE and exit
> __alloc_pages_slowpath(). There are three possibilities. First is that they are
> too unlucky to take oom_lock mutex which is needed for calling out_of_memory(),
> but it should not continue failing to take oom_lock mutex for such long time.
> Second is that their allocation requests include neither __GFP_NOFAIL nor
> __GFP_FS which is needed for calling out_of_memory(). Third is that, although I
> couldn't find evidence that mlock() and madvice() are related with this hangup,
> something is preventing __GFP_NOFAIL or __GFP_FS allocation requests from
> calling out_of_memory(). If you can reproduce with kmallocwd patch at
> https://lkml.kernel.org/r/201511250024.AAE78692.QVOtFFOSFOMLJH@I-love.SAKURA.ne.jp
> applied, I think you will be able to know which possibility you are hitting.
>
It took slightly longer this time. Here's full console log:
http://jan.stancek.eu/tmp/oom_hangs/console.log.2-v4.4-8606-with-memalloc_wc.txt.bz2
Last OOM message happens at:
[ 6845.765153] oom01 invoked oom-killer: gfp_mask=0x24280ca, order=0, oom_score_adj=0
The moments later, first MemAlloc output starts appearing:
[ 6904.555880] MemAlloc-Info: 2 stalling task, 0 dying task, 0 victim task.
[ 6904.563387] MemAlloc: oom01(22011) seq=5135 gfp=0x24280ca order=0 delay=10001
[ 6904.571353] MemAlloc: oom01(22013) seq=5101 gfp=0x24280ca order=0 delay=10001
[ 6915.195869] MemAlloc-Info: 16 stalling task, 0 dying task, 0 victim task.
[ 6915.203458] MemAlloc: systemd-journal(592) seq=33409 gfp=0x24201ca order=0 delay=20495
[ 6915.212300] MemAlloc: NetworkManager(807) seq=42042 gfp=0x24200ca order=0 delay=12030
[ 6915.221042] MemAlloc: gssproxy(815) seq=1551 gfp=0x24201ca order=0 delay=19414
[ 6915.229104] MemAlloc: irqbalance(825) seq=6763 gfp=0x24201ca order=0 delay=11234
[ 6915.237363] MemAlloc: tuned(1339) seq=74664 gfp=0x24201ca order=0 delay=20354
[ 6915.245329] MemAlloc: top(10485) seq=486624 gfp=0x24201ca order=0 delay=20124
[ 6915.253288] MemAlloc: kworker/1:1(20708) seq=48 gfp=0x2400000 order=0 delay=20248
[ 6915.261640] MemAlloc: sendmail(21855) seq=207 gfp=0x24201ca order=0 delay=19977
[ 6915.269800] MemAlloc: oom01(22007) seq=2 gfp=0x24201ca order=0 delay=20269
[ 6915.277466] MemAlloc: oom01(22008) seq=5659 gfp=0x24280ca order=0 delay=20502
[ 6915.285432] MemAlloc: oom01(22009) seq=5189 gfp=0x24280ca order=0 delay=20502
[ 6915.293389] MemAlloc: oom01(22010) seq=4795 gfp=0x24280ca order=0 delay=20502
[ 6915.301353] MemAlloc: oom01(22011) seq=5135 gfp=0x24280ca order=0 delay=20641
[ 6915.309316] MemAlloc: oom01(22012) seq=3828 gfp=0x24280ca order=0 delay=20502
[ 6915.317280] MemAlloc: oom01(22013) seq=5101 gfp=0x24280ca order=0 delay=20641
[ 6915.325244] MemAlloc: oom01(22014) seq=3633 gfp=0x24280ca order=0 delay=20502
seq numbers above appear to hold for another ~40 minutes, then I power cycled the system.
Regards,
Jan
WARNING: multiple messages have this Message-ID (diff)
From: Jan Stancek <jstancek@redhat.com>
To: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>, linux-mm@kvack.org
Cc: ltp@lists.linux.it
Subject: Re: [BUG] oom hangs the system, NMI backtrace shows most CPUs in shrink_slab
Date: Tue, 19 Jan 2016 16:13:11 +0100 [thread overview]
Message-ID: <569E5287.4080503@redhat.com> (raw)
In-Reply-To: <569E1010.2070806@I-love.SAKURA.ne.jp>
On 01/19/2016 11:29 AM, Tetsuo Handa wrote:
> Jan Stancek wrote:
>> I'm seeing system occasionally hanging after "oom01" testcase
>> from LTP triggers OOM.
>>
>> Here's a console log obtained from v4.4-8606 (shows oom, followed
>> by blocked task messages, followed by me triggering sysrq-t):
>> http://jan.stancek.eu/tmp/oom_hangs/oom_hang_v4.4-8606.txt
>> http://jan.stancek.eu/tmp/oom_hangs/config-v4.4-8606.txt
>
> I think this console log reports an OOM livelock.
>
> dump_tasks() shows that there are 2 oom1 tasks (10482 and 10528) but
> show_state() shows that there are 8 oom1 tasks whose parent is 10482
> (10528 10529 10530 10531 10532 10533 10534 10535). Thus, I guess oom1
> process called fork() and the child process created 7 threads.
Correct, parent forks a child and expects that child to be killed by OOM.
Child forks NCPU-1 threads, which all keep allocating/touching memory
while they can.
>
> ----------
> [ 495.322098] [10482] 0 10482 2692 10 11 3 27 0 oom01
> [ 495.341231] [10528] 0 10528 8667787 3731327 11002 28 1890661 0 oom01
> [ 1736.414809] oom01 S ffff880453dd3e48 13008 10482 10479 0x00000080
> [ 1737.025258] oom01 x ffff88044ff9bc98 13080 10528 10482 0x00000084
> [ 1737.129553] oom01 R running task 11616 10529 10482 0x00000084
> [ 1737.309520] oom01 R running task 11544 10530 10482 0x00000084
> [ 1737.417212] oom01 R running task 11616 10531 10482 0x00000084
> [ 1737.551904] oom01 R running task 11616 10532 10482 0x00000084
> [ 1737.725174] oom01 R running task 11616 10533 10482 0x00000084
> [ 1737.898452] oom01 R running task 11616 10534 10482 0x00000084
> [ 1738.078431] oom01 R running task 11616 10535 10482 0x00000084
> ----------
>
> 10528 got both SIGKILL and TIF_MEMDIE at uptime = 495 and has exited by uptime = 1737.
>
> ----------
> [ 495.350845] Out of memory: Kill process 10528 (oom01) score 952 or sacrifice child
> [ 495.359301] Killed process 10528 (oom01) total-vm:34671148kB, anon-rss:14925308kB, file-rss:0kB, shmem-rss:0kB
> ----------
>
> Since 10529 to 10535 got only SIGKILL, all of them are looping inside __alloc_pages_slowpath().
>
> ----------
> [ 1737.129553] oom01 R running task 11616 10529 10482 0x00000084
> [ 1737.172049] [<ffffffff81778a8c>] _cond_resched+0x1c/0x30
> [ 1737.178072] [<ffffffff811e6daf>] shrink_slab.part.42+0x2cf/0x540
> [ 1737.184873] [<ffffffff811176ae>] ? rcu_read_lock_held+0x5e/0x70
> [ 1737.191577] [<ffffffff811ec4ba>] shrink_zone+0x30a/0x330
> [ 1737.197602] [<ffffffff811ec884>] do_try_to_free_pages+0x174/0x440
> [ 1737.204499] [<ffffffff811ecc50>] try_to_free_pages+0x100/0x2c0
> [ 1737.211107] [<ffffffff81268ae0>] __alloc_pages_slowpath.constprop.85+0x3c6/0x74a
> [ 1737.309520] oom01 R running task 11544 10530 10482 0x00000084
> [ 1737.352024] [<ffffffff81778a8c>] _cond_resched+0x1c/0x30
> [ 1737.358049] [<ffffffff81268b12>] __alloc_pages_slowpath.constprop.85+0x3f8/0x74a
> [ 1737.417212] oom01 R running task 11616 10531 10482 0x00000084
> [ 1737.459715] [<ffffffff81778a8c>] _cond_resched+0x1c/0x30
> [ 1737.465739] [<ffffffff811e6daf>] shrink_slab.part.42+0x2cf/0x540
> [ 1737.472540] [<ffffffff811176ae>] ? rcu_read_lock_held+0x5e/0x70
> [ 1737.479242] [<ffffffff811ec4ba>] shrink_zone+0x30a/0x330
> [ 1737.485266] [<ffffffff811ec884>] do_try_to_free_pages+0x174/0x440
> [ 1737.492162] [<ffffffff811ecc50>] try_to_free_pages+0x100/0x2c0
> [ 1737.498768] [<ffffffff81268ae0>] __alloc_pages_slowpath.constprop.85+0x3c6/0x74a
> [ 1737.551904] oom01 R running task 11616 10532 10482 0x00000084
> [ 1737.594403] [<ffffffff81778a8c>] _cond_resched+0x1c/0x30
> [ 1737.600428] [<ffffffff811e6cf0>] shrink_slab.part.42+0x210/0x540
> [ 1737.607228] [<ffffffff811ec4ba>] shrink_zone+0x30a/0x330
> [ 1737.613252] [<ffffffff811ec884>] do_try_to_free_pages+0x174/0x440
> [ 1737.620150] [<ffffffff811ecc50>] try_to_free_pages+0x100/0x2c0
> [ 1737.626755] [<ffffffff81268ae0>] __alloc_pages_slowpath.constprop.85+0x3c6/0x74a
> [ 1737.725174] oom01 R running task 11616 10533 10482 0x00000084
> [ 1737.767682] [<ffffffff81778a8c>] _cond_resched+0x1c/0x30
> [ 1737.773705] [<ffffffff811e6cf0>] shrink_slab.part.42+0x210/0x540
> [ 1737.780506] [<ffffffff811ec4ba>] shrink_zone+0x30a/0x330
> [ 1737.786532] [<ffffffff811ec884>] do_try_to_free_pages+0x174/0x440
> [ 1737.793428] [<ffffffff811ecc50>] try_to_free_pages+0x100/0x2c0
> [ 1737.800037] [<ffffffff81268ae0>] __alloc_pages_slowpath.constprop.85+0x3c6/0x74a
> [ 1737.898452] oom01 R running task 11616 10534 10482 0x00000084
> [ 1737.940961] [<ffffffff81778a8c>] _cond_resched+0x1c/0x30
> [ 1737.946985] [<ffffffff811e6daf>] shrink_slab.part.42+0x2cf/0x540
> [ 1737.953785] [<ffffffff811176ae>] ? rcu_read_lock_held+0x5e/0x70
> [ 1737.960488] [<ffffffff811ec4ba>] shrink_zone+0x30a/0x330
> [ 1737.966512] [<ffffffff811ec884>] do_try_to_free_pages+0x174/0x440
> [ 1737.973408] [<ffffffff811ecc50>] try_to_free_pages+0x100/0x2c0
> [ 1737.980015] [<ffffffff81268ae0>] __alloc_pages_slowpath.constprop.85+0x3c6/0x74a
> [ 1738.078431] oom01 R running task 11616 10535 10482 0x00000084
> [ 1738.120938] [<ffffffff81778a8c>] _cond_resched+0x1c/0x30
> [ 1738.126961] [<ffffffff81268b12>] __alloc_pages_slowpath.constprop.85+0x3f8/0x74a
> ----------
>
> As a result, 10482 remains sleeping at wait().
>
> ----------
> [ 1736.414809] oom01 S ffff880453dd3e48 13008 10482 10479 0x00000080
> [ 1736.450320] [<ffffffff817787ec>] schedule+0x3c/0x90
> [ 1736.455859] [<ffffffff8109ba93>] do_wait+0x213/0x2f0
> [ 1736.461495] [<ffffffff8109ce60>] SyS_wait4+0x80/0x100
> ----------
>
> Also, 10519 is a typical trace which is looping inside __alloc_pages_slowpath()
> which can be observed when we hit OOM livelock. Since this allocation request
> includes neither __GFP_NOFAIL nor __GFP_FS, out_of_memory() cannot be called.
>
> ----------
> [ 1736.716427] kworker/2:1 R running task 12616 10519 2 0x00000088
> [ 1736.724319] Workqueue: events_freezable_power_ disk_events_workfn
> [ 1736.731128] 00000000ca948786 ffff88035997b8e8 0000000000000000 0000000000000000
> [ 1736.739427] 0000000000000000 0000000000000000 ffff88035997b850 ffffffff811ec4ba
> [ 1736.747725] ffff88035997b9d0 0000000000000000 0000000000000000 0001000000000000
> [ 1736.756022] Call Trace:
> [ 1736.758749] [<ffffffff811ec4ba>] ? shrink_zone+0x30a/0x330
> [ 1736.764969] [<ffffffff811ec884>] ? do_try_to_free_pages+0x174/0x440
> [ 1736.772060] [<ffffffff811ecc50>] ? try_to_free_pages+0x100/0x2c0
> [ 1736.778861] [<ffffffff81268ae0>] ? __alloc_pages_slowpath.constprop.85+0x3c6/0x74a
> [ 1736.787408] [<ffffffff811dc666>] ? __alloc_pages_nodemask+0x456/0x460
> [ 1736.794690] [<ffffffff810f51e4>] ? __lock_is_held+0x54/0x70
> [ 1736.801004] [<ffffffff81231c17>] ? alloc_pages_current+0x97/0x1b0
> [ 1736.807902] [<ffffffff8137ebd9>] ? bio_copy_kern+0xc9/0x180
> [ 1736.814217] [<ffffffff8138db15>] ? blk_rq_map_kern+0x75/0x130
> [ 1736.820726] [<ffffffff815269e2>] ? scsi_execute+0x132/0x160
> [ 1736.827041] [<ffffffff81528f7e>] ? scsi_execute_req_flags+0x8e/0xf0
> [ 1736.834133] [<ffffffffa01956e7>] ? sr_check_events+0xb7/0x2a0 [sr_mod]
> [ 1736.841514] [<ffffffffa00d7058>] ? cdrom_check_events+0x18/0x30 [cdrom]
> [ 1736.848992] [<ffffffffa0195b2a>] ? sr_block_check_events+0x2a/0x30 [sr_mod]
> [ 1736.856859] [<ffffffff81399bb0>] ? disk_check_events+0x60/0x170
> [ 1736.863561] [<ffffffff81399cdc>] ? disk_events_workfn+0x1c/0x20
> [ 1736.870264] [<ffffffff810b7055>] ? process_one_work+0x215/0x650
> [ 1736.876967] [<ffffffff810b6fc1>] ? process_one_work+0x181/0x650
> [ 1736.883671] [<ffffffff810b75b5>] ? worker_thread+0x125/0x4a0
> [ 1736.890082] [<ffffffff810b7490>] ? process_one_work+0x650/0x650
> [ 1736.896785] [<ffffffff810be191>] ? kthread+0x101/0x120
> [ 1736.902615] [<ffffffff810f8c19>] ? trace_hardirqs_on_caller+0xf9/0x1c0
> [ 1736.909997] [<ffffffff810be090>] ? kthread_create_on_node+0x250/0x250
> [ 1736.917281] [<ffffffff8177ef9f>] ? ret_from_fork+0x3f/0x70
> [ 1736.923499] [<ffffffff810be090>] ? kthread_create_on_node+0x250/0x250
> ----------
>
> What is strange is, 10529 to 10535 should be able to get TIF_MEMDIE and exit
> __alloc_pages_slowpath(). There are three possibilities. First is that they are
> too unlucky to take oom_lock mutex which is needed for calling out_of_memory(),
> but it should not continue failing to take oom_lock mutex for such long time.
> Second is that their allocation requests include neither __GFP_NOFAIL nor
> __GFP_FS which is needed for calling out_of_memory(). Third is that, although I
> couldn't find evidence that mlock() and madvice() are related with this hangup,
> something is preventing __GFP_NOFAIL or __GFP_FS allocation requests from
> calling out_of_memory(). If you can reproduce with kmallocwd patch at
> https://lkml.kernel.org/r/201511250024.AAE78692.QVOtFFOSFOMLJH@I-love.SAKURA.ne.jp
> applied, I think you will be able to know which possibility you are hitting.
>
It took slightly longer this time. Here's full console log:
http://jan.stancek.eu/tmp/oom_hangs/console.log.2-v4.4-8606-with-memalloc_wc.txt.bz2
Last OOM message happens at:
[ 6845.765153] oom01 invoked oom-killer: gfp_mask=0x24280ca, order=0, oom_score_adj=0
The moments later, first MemAlloc output starts appearing:
[ 6904.555880] MemAlloc-Info: 2 stalling task, 0 dying task, 0 victim task.
[ 6904.563387] MemAlloc: oom01(22011) seq=5135 gfp=0x24280ca order=0 delay=10001
[ 6904.571353] MemAlloc: oom01(22013) seq=5101 gfp=0x24280ca order=0 delay=10001
[ 6915.195869] MemAlloc-Info: 16 stalling task, 0 dying task, 0 victim task.
[ 6915.203458] MemAlloc: systemd-journal(592) seq=33409 gfp=0x24201ca order=0 delay=20495
[ 6915.212300] MemAlloc: NetworkManager(807) seq=42042 gfp=0x24200ca order=0 delay=12030
[ 6915.221042] MemAlloc: gssproxy(815) seq=1551 gfp=0x24201ca order=0 delay=19414
[ 6915.229104] MemAlloc: irqbalance(825) seq=6763 gfp=0x24201ca order=0 delay=11234
[ 6915.237363] MemAlloc: tuned(1339) seq=74664 gfp=0x24201ca order=0 delay=20354
[ 6915.245329] MemAlloc: top(10485) seq=486624 gfp=0x24201ca order=0 delay=20124
[ 6915.253288] MemAlloc: kworker/1:1(20708) seq=48 gfp=0x2400000 order=0 delay=20248
[ 6915.261640] MemAlloc: sendmail(21855) seq=207 gfp=0x24201ca order=0 delay=19977
[ 6915.269800] MemAlloc: oom01(22007) seq=2 gfp=0x24201ca order=0 delay=20269
[ 6915.277466] MemAlloc: oom01(22008) seq=5659 gfp=0x24280ca order=0 delay=20502
[ 6915.285432] MemAlloc: oom01(22009) seq=5189 gfp=0x24280ca order=0 delay=20502
[ 6915.293389] MemAlloc: oom01(22010) seq=4795 gfp=0x24280ca order=0 delay=20502
[ 6915.301353] MemAlloc: oom01(22011) seq=5135 gfp=0x24280ca order=0 delay=20641
[ 6915.309316] MemAlloc: oom01(22012) seq=3828 gfp=0x24280ca order=0 delay=20502
[ 6915.317280] MemAlloc: oom01(22013) seq=5101 gfp=0x24280ca order=0 delay=20641
[ 6915.325244] MemAlloc: oom01(22014) seq=3633 gfp=0x24280ca order=0 delay=20502
seq numbers above appear to hold for another ~40 minutes, then I power cycled the system.
Regards,
Jan
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2016-01-19 15:13 UTC|newest]
Thread overview: 26+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-01-18 15:38 [LTP] [BUG] oom hangs the system, NMI backtrace shows most CPUs in shrink_slab Jan Stancek
2016-01-18 15:38 ` Jan Stancek
2016-01-19 10:29 ` [LTP] " Tetsuo Handa
2016-01-19 10:29 ` Tetsuo Handa
2016-01-19 15:13 ` Jan Stancek [this message]
2016-01-19 15:13 ` Jan Stancek
2016-01-20 10:23 ` [LTP] [BUG] oom hangs the system, NMI backtrace shows most CPUs inshrink_slab Tetsuo Handa
2016-01-20 10:23 ` Tetsuo Handa
2016-01-20 13:17 ` [LTP] [BUG] oom hangs the system, NMI backtrace shows most CPUs in shrink_slab Tetsuo Handa
2016-01-20 13:17 ` Tetsuo Handa
2016-01-20 15:10 ` [LTP] " Tejun Heo
2016-01-20 15:10 ` Tejun Heo
2016-01-20 15:54 ` [LTP] " Tetsuo Handa
2016-01-20 15:54 ` Tetsuo Handa
2016-01-22 15:14 ` [LTP] " Jan Stancek
2016-01-22 15:14 ` Jan Stancek
2016-01-23 6:30 ` [LTP] " Tetsuo Handa
2016-01-23 6:30 ` Tetsuo Handa
2016-01-26 7:48 ` [LTP] " Jan Stancek
2016-01-26 7:48 ` Jan Stancek
2016-01-26 14:46 ` Tetsuo Handa
2016-01-26 14:46 ` Tetsuo Handa
2016-01-27 11:02 ` Tetsuo Handa
2016-01-28 15:48 ` Tetsuo Handa
2016-01-29 7:32 ` Jan Stancek
2016-01-29 12:35 ` Tetsuo Handa
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=569E5287.4080503@redhat.com \
--to=jstancek@redhat.com \
--cc=ltp@lists.linux.it \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.