ps lockups, cgroup memory reclaim

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* ps lockups, cgroup memory reclaim
@ 2013-09-17 15:50 Mark Hills
  2013-09-17 16:28 ` Johannes Weiner
  0 siblings, 1 reply; 4+ messages in thread
From: Mark Hills @ 2013-09-17 15:50 UTC (permalink / raw)
  To: linux-mm

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1497 bytes --]

I'm investigating intermitten kernel lockups in an HPC environment, with 
the RedHat kernel.

The symptoms are seen as lockups of multiple ps commands, with one 
consuming full CPU:

  # ps aux | grep ps
  root     19557 68.9  0.0 108100   908 ?        D    Sep16 1045:37 ps --ppid 1 -o args=
  root     19871  0.0  0.0 108100   908 ?        D    Sep16   0:00 ps --ppid 1 -o args=

SIGKILL on the busy one causes the other ps processes to run to completion 
(TERM has no effect).

In this case I was able to run my own ps to see the process list, but not 
always.

perf shows the locality of the spinning, roughly:

  proc_pid_cmdline
  get_user_pages
  handle_mm_fault
  mem_cgroup_try_charge_swapin
  mem_cgroup_reclaim

There are two entry points, the codepaths taken are better shown by the 
attached profile of CPU time.

We've had this behaviour since switching to Scientific Linux 6 (based on 
RHEL6, like CentOS) at kernel 2.6.32-279.9.1.el6.x86_64.

The example above is kernel 2.6.32-358.el6.x86_64.

I haven't been able to get a re-producable case with which to test the 
mainline kernel; our large-scale automated use of ps is working as a 
fuzz-test and switching kernels like that is not an option unfortunately.

Does this issue sound familiar? I'd appreciate any advice or information, 
or pointers to the mainline where such cases have been investigated.

I could not find anything using Google, but this problem does not have an 
key word or error message.

Many thanks

-- 
Mark

[-- Attachment #2: Type: APPLICATION/PDF, Size: 31913 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: ps lockups, cgroup memory reclaim
  2013-09-17 15:50 ps lockups, cgroup memory reclaim Mark Hills
@ 2013-09-17 16:28 ` Johannes Weiner
  2013-09-18  0:50   ` Mark Hills
  0 siblings, 1 reply; 4+ messages in thread
From: Johannes Weiner @ 2013-09-17 16:28 UTC (permalink / raw)
  To: Mark Hills; +Cc: linux-mm

On Tue, Sep 17, 2013 at 04:50:42PM +0100, Mark Hills wrote:
> I'm investigating intermitten kernel lockups in an HPC environment, with 
> the RedHat kernel.
> 
> The symptoms are seen as lockups of multiple ps commands, with one 
> consuming full CPU:
> 
>   # ps aux | grep ps
>   root     19557 68.9  0.0 108100   908 ?        D    Sep16 1045:37 ps --ppid 1 -o args=
>   root     19871  0.0  0.0 108100   908 ?        D    Sep16   0:00 ps --ppid 1 -o args=
> 
> SIGKILL on the busy one causes the other ps processes to run to completion 
> (TERM has no effect).

Any chance you can get to the stack of the non-busy blocked tasks?

It would be /proc/19871/stack in this case.

> In this case I was able to run my own ps to see the process list, but not 
> always.
> 
> perf shows the locality of the spinning, roughly:
> 
>   proc_pid_cmdline
>   get_user_pages
>   handle_mm_fault
>   mem_cgroup_try_charge_swapin
>   mem_cgroup_reclaim
> 
> There are two entry points, the codepaths taken are better shown by the 
> attached profile of CPU time.

Looks like it's spinning like crazy in shrink_mem_cgroup_zone().
Maybe an LRU counter underflow, maybe endlessly looping on the
should_continue_reclaim() compaction condition.  But I don't see an
obvious connection to why killing the busy task wakes up the blocked
one.

So yeah, it would be helpful to know what that task is waiting for.

> We've had this behaviour since switching to Scientific Linux 6 (based on 
> RHEL6, like CentOS) at kernel 2.6.32-279.9.1.el6.x86_64.
> 
> The example above is kernel 2.6.32-358.el6.x86_64.

Can you test with the debug build?  That should trap LRU counter
underflows at least.  If you have the possibility to recompile the
distribution kernel I can provide you with debug patches.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: ps lockups, cgroup memory reclaim
  2013-09-17 16:28 ` Johannes Weiner
@ 2013-09-18  0:50   ` Mark Hills
  2013-10-24 17:39     ` Mark Hills
  0 siblings, 1 reply; 4+ messages in thread
From: Mark Hills @ 2013-09-18  0:50 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: linux-mm

On Tue, 17 Sep 2013, Johannes Weiner wrote:

> On Tue, Sep 17, 2013 at 04:50:42PM +0100, Mark Hills wrote:
> > I'm investigating intermitten kernel lockups in an HPC environment, with 
> > the RedHat kernel.
> > 
> > The symptoms are seen as lockups of multiple ps commands, with one 
> > consuming full CPU:
> > 
> >   # ps aux | grep ps
> >   root     19557 68.9  0.0 108100   908 ?        D    Sep16 1045:37 ps --ppid 1 -o args=
> >   root     19871  0.0  0.0 108100   908 ?        D    Sep16   0:00 ps --ppid 1 -o args=
> > 
> > SIGKILL on the busy one causes the other ps processes to run to completion 
> > (TERM has no effect).
> 
> Any chance you can get to the stack of the non-busy blocked tasks?
> 
> It would be /proc/19871/stack in this case.

I had to return the machine above to the cluster, but next time I'll log 
this information.
 
> > In this case I was able to run my own ps to see the process list, but not 
> > always.
> > 
> > perf shows the locality of the spinning, roughly:
> > 
> >   proc_pid_cmdline
> >   get_user_pages
> >   handle_mm_fault
> >   mem_cgroup_try_charge_swapin
> >   mem_cgroup_reclaim
> > 
> > There are two entry points, the codepaths taken are better shown by the 
> > attached profile of CPU time.
> 
> Looks like it's spinning like crazy in shrink_mem_cgroup_zone().
> Maybe an LRU counter underflow, maybe endlessly looping on the
> should_continue_reclaim() compaction condition.  But I don't see an
> obvious connection to why killing the busy task wakes up the blocked
> one.

Maybe it's as simple as the lock taken at quite a high level; perhaps even 
a lock when reading values for /proc.

But no need for me to guess, we'll find out next time from the /proc 
information.
 
> So yeah, it would be helpful to know what that task is waiting for.
> 
> > We've had this behaviour since switching to Scientific Linux 6 (based on 
> > RHEL6, like CentOS) at kernel 2.6.32-279.9.1.el6.x86_64.
> > 
> > The example above is kernel 2.6.32-358.el6.x86_64.
> 
> Can you test with the debug build?  That should trap LRU counter
> underflows at least.

Ah, excellent -- I did not realise there was a kernel-debug package.

I'll see if I can isolate this with more detail from that kernel 
(including the stack trace of each task that is hung)

> If you have the possibility to recompile the distribution kernel I can 
> provide you with debug patches.

Absolutely, I can deploy a patched kernels but not cluster-wide. Let's 
hope I can get enough coverage to catch the bug.

Hopefully I'll have more information soon.

Thanks

-- 
Mark

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: ps lockups, cgroup memory reclaim
  2013-09-18  0:50   ` Mark Hills
@ 2013-10-24 17:39     ` Mark Hills
  0 siblings, 0 replies; 4+ messages in thread
From: Mark Hills @ 2013-10-24 17:39 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: linux-mm

On Wed, 18 Sep 2013, Mark Hills wrote:

> On Tue, 17 Sep 2013, Johannes Weiner wrote:
> 
> > On Tue, Sep 17, 2013 at 04:50:42PM +0100, Mark Hills wrote:
> > > I'm investigating intermitten kernel lockups in an HPC environment, with 
> > > the RedHat kernel.
> > > 
> > > The symptoms are seen as lockups of multiple ps commands, with one 
> > > consuming full CPU:
> > > 
> > >   # ps aux | grep ps
> > >   root     19557 68.9  0.0 108100   908 ?        D    Sep16 1045:37 ps --ppid 1 -o args=
> > >   root     19871  0.0  0.0 108100   908 ?        D    Sep16   0:00 ps --ppid 1 -o args=
> > > 
> > > SIGKILL on the busy one causes the other ps processes to run to completion 
> > > (TERM has no effect).
> > 
> > Any chance you can get to the stack of the non-busy blocked tasks?
> > 
> > It would be /proc/19871/stack in this case.
> 
> I had to return the machine above to the cluster, but next time I'll log 
> this information.
>  
> > > In this case I was able to run my own ps to see the process list, 
> > > but not always.
> > > 
> > > perf shows the locality of the spinning, roughly:
> > > 
> > >   proc_pid_cmdline
> > >   get_user_pages
> > >   handle_mm_fault
> > >   mem_cgroup_try_charge_swapin
> > >   mem_cgroup_reclaim
> > > 
> > > There are two entry points, the codepaths taken are better shown by 
> > > the attached profile of CPU time.
> > 
> > Looks like it's spinning like crazy in shrink_mem_cgroup_zone(). Maybe 
> > an LRU counter underflow, maybe endlessly looping on the 
> > should_continue_reclaim() compaction condition.  But I don't see an 
> > obvious connection to why killing the busy task wakes up the blocked 
> > one.
> 
> Maybe it's as simple as the lock taken at quite a high level; perhaps even 
> a lock when reading values for /proc.
> 
> But no need for me to guess, we'll find out next time from the /proc 
> information.
>  
> > So yeah, it would be helpful to know what that task is waiting for.
> > 
> > > We've had this behaviour since switching to Scientific Linux 6 (based on 
> > > RHEL6, like CentOS) at kernel 2.6.32-279.9.1.el6.x86_64.
> > > 
> > > The example above is kernel 2.6.32-358.el6.x86_64.
> > 
> > Can you test with the debug build?  That should trap LRU counter
> > underflows at least.
> 
> Ah, excellent -- I did not realise there was a kernel-debug package.
[...]

Ok, it has taken me a while but I now have some automated capture in place 
and the equivalent packaged debug kernel:

  $ uname -r
  2.6.32-358.el6.x86_64.debug

Here is the initial result. The bad news is that there doesn't appear to 
be an obvious BUG or alert to any kind of overflow.

With this in mind, I noted the times of the OOM-killer (which is operating 
on containers that users run their jobs in).

The most recent is less than 5 minutes before lockup was detected, and my 
detection runs every 5 minutes. I have extracted this below (I can post 
the full dmesg elsewhere if interested)

I assume that trace is of pid 8099 is a thread ID, because this process 
does not reside in the cgroup only PID 7919.

Notably the process requesting the memory is itself the memory hog. The 
cgroup contains only a launching shell and a binary process.

Perhaps an expect can see an immediate bug from the call trace; eg. 
killing a process at an awkward time that would cause some kind of counter 
underflow?

I'm unsure where to go next, given there is no obvious BUG here. I think I 
will look next week to find a reproducable test, perhaps I can simulate 
the same out-of-memory condition.

Thanks

-- 
Mark



frenpic.bin invoked oom-killer: gfp_mask=0xd0, order=0, oom_adj=0, oom_score_adj=0
frenpic.bin cpuset=manager-7805 mems_allowed=0-1
Pid: 8099, comm: frenpic.bin Not tainted 2.6.32-358.el6.x86_64.debug #1
Call Trace:
 [<ffffffff81541dcb>] ? _spin_unlock+0x2b/0x40
 [<ffffffff811314f3>] ? dump_header+0x83/0x210
 [<ffffffff810b4ffd>] ? trace_hardirqs_on_caller+0x14d/0x190
 [<ffffffff810b504d>] ? trace_hardirqs_on+0xd/0x10
 [<ffffffff811319f2>] ? oom_kill_process+0x82/0x2f0
 [<ffffffff811321e2>] ? mem_cgroup_out_of_memory+0x92/0xb0
 [<ffffffff8118cff4>] ? mem_cgroup_handle_oom+0x274/0x2a0
 [<ffffffff8118a700>] ? memcg_oom_wake_function+0x0/0xa0
 [<ffffffff8118d5e8>] ? __mem_cgroup_try_charge+0x5c8/0x6b0
 [<ffffffff8118d218>] ? __mem_cgroup_try_charge+0x1f8/0x6b0
 [<ffffffff8118eca0>] ? mem_cgroup_charge_common+0x90/0xd0
 [<ffffffff8118ed28>] ? mem_cgroup_newpage_charge+0x48/0x50
 [<ffffffff8115a05a>] ? handle_pte_fault+0x72a/0xa90
 [<ffffffff810a32bd>] ? sched_clock_cpu+0xcd/0x110
 [<ffffffff8115a5fa>] ? handle_mm_fault+0x23a/0x310
 [<ffffffff81047533>] ? __do_page_fault+0x163/0x4e0
 [<ffffffff810b504d>] ? trace_hardirqs_on+0xd/0x10
 [<ffffffff81541d10>] ? _spin_unlock_irq+0x30/0x40
 [<ffffffff8105c43d>] ? finish_task_switch+0x7d/0x110
 [<ffffffff8105c408>] ? finish_task_switch+0x48/0x110
 [<ffffffff8153e853>] ? thread_return+0x4e/0x7db
 [<ffffffff8100bb10>] ? restore_args+0x0/0x30
 [<ffffffff815457be>] ? do_page_fault+0x3e/0xa0
 [<ffffffff81542905>] ? page_fault+0x25/0x30
Task in /manager-7805 killed as a result of limit of /manager-7805
memory: usage 16381840kB, limit 16384000kB, failcnt 250389
memory+swap: usage 18022400kB, limit 18022400kB, failcnt 1117
Mem-Info:
Node 0 DMA per-cpu:
CPU    0: hi:    0, btch:   1 usd:   0
CPU    1: hi:    0, btch:   1 usd:   0
CPU    2: hi:    0, btch:   1 usd:   0
CPU    3: hi:    0, btch:   1 usd:   0
CPU    4: hi:    0, btch:   1 usd:   0
CPU    5: hi:    0, btch:   1 usd:   0
CPU    6: hi:    0, btch:   1 usd:   0
CPU    7: hi:    0, btch:   1 usd:   0
CPU    8: hi:    0, btch:   1 usd:   0
CPU    9: hi:    0, btch:   1 usd:   0
CPU   10: hi:    0, btch:   1 usd:   0
CPU   11: hi:    0, btch:   1 usd:   0
CPU   12: hi:    0, btch:   1 usd:   0
CPU   13: hi:    0, btch:   1 usd:   0
CPU   14: hi:    0, btch:   1 usd:   0
CPU   15: hi:    0, btch:   1 usd:   0
Node 0 DMA32 per-cpu:
CPU    0: hi:  186, btch:  31 usd:   0
CPU    1: hi:  186, btch:  31 usd:   0
CPU    2: hi:  186, btch:  31 usd:   0
CPU    3: hi:  186, btch:  31 usd:   0
CPU    4: hi:  186, btch:  31 usd:   0
CPU    5: hi:  186, btch:  31 usd:   0
CPU    6: hi:  186, btch:  31 usd:   0
CPU    7: hi:  186, btch:  31 usd:   0
CPU    8: hi:  186, btch:  31 usd: 161
CPU    9: hi:  186, btch:  31 usd: 164
CPU   10: hi:  186, btch:  31 usd: 171
CPU   11: hi:  186, btch:  31 usd: 178
CPU   12: hi:  186, btch:  31 usd:   0
CPU   13: hi:  186, btch:  31 usd:   0
CPU   14: hi:  186, btch:  31 usd:   0
CPU   15: hi:  186, btch:  31 usd:   0
Node 0 Normal per-cpu:
CPU    0: hi:  186, btch:  31 usd: 175
CPU    1: hi:  186, btch:  31 usd: 120
CPU    2: hi:  186, btch:  31 usd:  77
CPU    3: hi:  186, btch:  31 usd:   0
CPU    4: hi:  186, btch:  31 usd:   0
CPU    5: hi:  186, btch:  31 usd:   0
CPU    6: hi:  186, btch:  31 usd:  89
CPU    7: hi:  186, btch:  31 usd:   2
CPU    8: hi:  186, btch:  31 usd:  39
CPU    9: hi:  186, btch:  31 usd: 127
CPU   10: hi:  186, btch:  31 usd:   9
CPU   11: hi:  186, btch:  31 usd: 170
CPU   12: hi:  186, btch:  31 usd:   0
CPU   13: hi:  186, btch:  31 usd:   0
CPU   14: hi:  186, btch:  31 usd: 121
CPU   15: hi:  186, btch:  31 usd:   0
Node 1 Normal per-cpu:
CPU    0: hi:  186, btch:  31 usd:  82
CPU    1: hi:  186, btch:  31 usd: 158
CPU    2: hi:  186, btch:  31 usd:   0
CPU    3: hi:  186, btch:  31 usd:  89
CPU    4: hi:  186, btch:  31 usd:   0
CPU    5: hi:  186, btch:  31 usd:  30
CPU    6: hi:  186, btch:  31 usd:   0
CPU    7: hi:  186, btch:  31 usd:  61
CPU    8: hi:  186, btch:  31 usd: 180
CPU    9: hi:  186, btch:  31 usd: 103
CPU   10: hi:  186, btch:  31 usd: 162
CPU   11: hi:  186, btch:  31 usd:  42
CPU   12: hi:  186, btch:  31 usd:   0
CPU   13: hi:  186, btch:  31 usd:   0
CPU   14: hi:  186, btch:  31 usd:   0
CPU   15: hi:  186, btch:  31 usd:  98
active_anon:7716448 inactive_anon:1137356 isolated_anon:128
 active_file:2563426 inactive_file:3632112 isolated_file:0
 unevictable:0 dirty:0 writeback:1676 unstable:0
 free:1033107 slab_reclaimable:222717 slab_unreclaimable:21054
 mapped:12050 shmem:18 pagetables:20595 bounce:0
Node 0 DMA free:15736kB min:20kB low:24kB high:28kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15352kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
lowmem_reserve[]: 0 3179 32216 32216
Node 0 DMA32 free:1395136kB min:4436kB low:5544kB high:6652kB active_anon:780288kB inactive_anon:376980kB active_file:6968kB inactive_file:25896kB unevictable:0kB isolated(anon):512kB isolated(file):0kB present:3255392kB mlocked:0kB dirty:0kB writeback:6700kB mapped:492kB shmem:0kB slab_reclaimable:3356kB slab_unreclaimable:408kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 29037 29037
Node 0 Normal free:694076kB min:40532kB low:50664kB high:60796kB active_anon:14153412kB inactive_anon:1905888kB active_file:5255884kB inactive_file:7325488kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:29734400kB mlocked:0kB dirty:0kB writeback:0kB mapped:20016kB shmem:68kB slab_reclaimable:533876kB slab_unreclaimable:42860kB kernel_stack:2840kB pagetables:34016kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 0 0
Node 1 Normal free:2027480kB min:45116kB low:56392kB high:67672kB active_anon:15932092kB inactive_anon:2266556kB active_file:4990852kB inactive_file:7177064kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:33095680kB mlocked:0kB dirty:0kB writeback:4kB mapped:27692kB shmem:4kB slab_reclaimable:353636kB slab_unreclaimable:40948kB kernel_stack:528kB pagetables:48364kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 0 0
Node 0 DMA: 2*4kB 2*8kB 2*16kB 2*32kB 2*64kB 1*128kB 0*256kB 0*512kB 1*1024kB 1*2048kB 3*4096kB = 15736kB
Node 0 DMA32: 676*4kB 716*8kB 493*16kB 482*32kB 331*64kB 314*128kB 194*256kB 74*512kB 90*1024kB 32*2048kB 258*4096kB = 1395136kB
Node 0 Normal: 409*4kB 355*8kB 360*16kB 90*32kB 48*64kB 1070*128kB 951*256kB 307*512kB 137*1024kB 0*2048kB 0*4096kB = 694076kB
Node 1 Normal: 2112*4kB 480*8kB 251*16kB 65*32kB 40*64kB 18*128kB 1522*256kB 1826*512kB 664*1024kB 0*2048kB 0*4096kB = 2027728kB
6319722 total pagecache pages
124333 pages in swap cache
Swap cache stats: add 8249007, delete 8124674, find 737685/862651
Free swap  = 29529892kB
Total swap = 33013752kB
16777215 pages RAM
322632 pages reserved
176578 pages shared
15249459 pages non-shared
[ pid ]   uid  tgid total_vm      rss cpu oom_adj oom_score_adj name
[ 7806] 243986  7806    28804      507   9       0             0 tcsh
[ 7919] 243986  7919  5133045  4096768   9       0             0 frenpic.bin
Memory cgroup out of memory: Kill process 7919 (frenpic.bin) score 1000 or sacrifice child
Killed process 7919, UID 243986, (frenpic.bin) total-vm:20532180kB, anon-rss:16373904kB, file-rss:13168kB

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2013-10-24 17:39 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-09-17 15:50 ps lockups, cgroup memory reclaim Mark Hills
2013-09-17 16:28 ` Johannes Weiner
2013-09-18  0:50   ` Mark Hills
2013-10-24 17:39     ` Mark Hills

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).