* Re: Kernel migration eat CPUs [not found] ` <20130825142837.GD31370@twins.programming.kicks-ass.net> @ 2013-08-29 10:10 ` Alexey Vlasov 2013-09-04 18:53 ` Alexey Vlasov 1 sibling, 0 replies; 7+ messages in thread From: Alexey Vlasov @ 2013-08-29 10:10 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Andrew Morton, Ingo Molnar, Mike Galbraith, linux-kernel Hi Peter, On Sun, Aug 25, 2013 at 04:28:37PM +0200, Peter Zijlstra wrote: > > Gargh.. I've never seen anything like that. Nor ever had a report like > this. Is there anything in particular one can do to try and reproduce > this? I don't know how to reproduce it. This happens by itself and only on high-loaded servers. For example this happens almost every hour on one server with kernel 3.8.11 with 10k web-sites and 5k MySQL databases. On another server with kernel 3.9.4 with same load this can take place 3-5 times per day. Sometimes this happens almost synchronously on both servers. I returned to kernel 2.6.35 on servers where this often took place. Or they are not high-loaded enough that this effect doesn't appear. For example here is server which earlier worked on kernel 3.9.4. It is high-loaded, but migration stopped to eat CPUs after downgrade to 2.6.35. # uname -r 2.6.35.7 # uptime 13:56:34 up 32 days, 10:31, 10 users, load average: 24.44, 23.44, 24.13 # ps -u root -o user,bsdtime,comm | grep -E 'COMMAND|migration' USER TIME COMMAND root 4:20 migration/0 root 6:07 migration/1 root 17:00 migration/2 root 5:23 migration/3 root 16:43 migration/4 root 3:48 migration/5 root 12:28 migration/6 root 3:44 migration/7 root 12:25 migration/8 root 3:49 migration/9 root 1:52 migration/10 root 2:51 migration/11 root 1:28 migration/12 root 2:43 migration/13 root 2:16 migration/14 root 4:53 migration/15 root 2:15 migration/16 root 4:13 migration/17 root 2:13 migration/18 root 4:21 migration/19 root 2:07 migration/20 root 4:13 migration/21 root 2:13 migration/22 root 3:26 migration/23 For comparison 3.9.4: # uptime 13:55:49 up 11 days, 15:36, 11 users, load average: 24.62, 24.36, 23.63 USER TIME COMMAND root 233:51 migration/0 root 233:38 migration/1 root 231:57 migration/2 root 233:26 migration/3 root 231:46 migration/4 root 233:26 migration/5 root 231:37 migration/6 root 232:56 migration/7 root 231:09 migration/8 root 232:34 migration/9 root 231:04 migration/10 root 232:22 migration/11 root 230:50 migration/12 root 232:16 migration/13 root 230:38 migration/14 root 231:51 migration/15 root 230:04 migration/16 root 230:16 migration/17 root 230:06 migration/18 root 230:22 migration/19 root 229:45 migration/20 root 229:43 migration/21 root 229:27 migration/22 root 229:24 migration/23 root 229:11 migration/24 root 229:25 migration/25 root 229:16 migration/26 root 228:58 migration/27 root 228:48 migration/28 root 229:06 migration/29 root 228:25 migration/30 root 228:25 migration/31 > Could you perhaps send your .config and a function (or function-graph) > trace for when this happens? My .config https://www.dropbox.com/s/vuwvalj58cfgahu/.config_3.9.4-1gb-csmb-tr I can't make trace because it isn't turned on on my kernels. I will be able to reboot servers on weekend as there are many clients there and will send you trace. > Also, do you use weird things like cgroup/cpusets or other such fancy > stuff? If so, could you outline your setup? Grsec patch is used on all kernels. Also there is following patch only on kernel 3.8.11: --- kernel/cgroup.c.orig +++ kernel/cgroup.c @@ -1931,7 +1931,8 @@ ss->attach(cgrp, &tset); } - synchronize_rcu(); + synchronize_rcu_expedited(); /* * wake up rmdir() waiter. the rmdir should fail since the Aslo I use https://github.com/facebook/flashcache/ Actually I really use cgroup namely controllers cpuacct, memory, blkio. I create cgroup for every user on server, where all users processes are running. To make it work there are needed patches in Apache/prefork, SSH and other users staff. There can be about 10k-15k users and accordingly same amount of cgroups. The other day I disabled all cgroups, but controllers are still mounted. # cat /proc/cgroups #subsys_name hierarchy num_cgroups enabled cpuset 2 1 1 cpuacct 3 1 1 memory 4 1 1 blkio 5 1 1 But migration still eats CPUs. However I also use cgroup on kernel 2.6.35. -- BRGDS. Alexey Vlasov. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Kernel migration eat CPUs [not found] ` <20130825142837.GD31370@twins.programming.kicks-ass.net> 2013-08-29 10:10 ` Kernel migration eat CPUs Alexey Vlasov @ 2013-09-04 18:53 ` Alexey Vlasov 2013-09-05 11:12 ` Ingo Molnar 1 sibling, 1 reply; 7+ messages in thread From: Alexey Vlasov @ 2013-09-04 18:53 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Andrew Morton, Ingo Molnar, Mike Galbraith, linux-kernel On Sun, Aug 25, 2013 at 04:28:37PM +0200, Peter Zijlstra wrote: > and a function (or function-graph) > trace for when this happens? Unfortunately I could not make trace. First when migration threads start to eat CPUs and when I turn on trace: # echo function > current_tracer # echo 1 > tracing_on then server is down at once. LA comes up to several thousands, it stops to receive commands. Sometimes I even have to reboot it. And second when I manage to get trace then it shows little useful data. https://www.dropbox.com/s/kxgwr9fuckddtgf/trace.txt.gz Probably this junk appears from grsec patch? And I don't see migration threads in it. > Perhaps perf top -p <pick-a-thread> while the problem is happening will > show something. Also "perf top" doesn't show anything. It gives same values when migration threads eat CPUs and when everything is OK on server. ========================================= PerfTop: 68932 irqs/sec kernel:60.3% exact: 0.0% [4000Hz cycles], (all, 32 CPUs) --------------------------------------------------------------------------------- 64.11% [netconsole] [k] 0xffffffff81000de3 34.19% [unknown] [.] 0x0000000000585c45 0.40% libc-2.15.so [.] 0x0000000000124b10 0.23% perf [.] 0x000000000004df4f 0.14% ld-2.15.so [.] 0x00000000000183e7 0.08% libc-2.15.so [.] _nss_files_parse_grent 0.06% libc-2.15.so [.] _nss_files_parse_pwent 0.04% libc-2.15.so [.] memchr 0.04% php-cgi [.] match 0.03% libcrypt-2.15.so [.] 0x0000000000004eab 0.03% bash [.] 0x000000000001fe5a 0.02% libc-2.15.so [.] _IO_getline_info 0.02% php-cgi [.] php_pcre_exec 0.02% libc-2.15.so [.] _dl_addr 0.02% eaccelerator.so [.] eaccelerator_crc32 0.02% php-cgi [.] zend_hash_quick_find 0.02% libnss_compat-2.15.so [.] 0x00000000000055f1 0.01% php-cgi [.] zendparse 0.01% libc-2.15.so [.] fgets_unlocked 0.01% libnss_compat-2.15.so [.] _nss_compat_initgroups_dyn ========================================= Now server is more loaded and eating CPUs by migration threads happens more often and it lasts longer. -- BRGDS. Alexey Vlasov. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Kernel migration eat CPUs 2013-09-04 18:53 ` Alexey Vlasov @ 2013-09-05 11:12 ` Ingo Molnar 2013-09-11 15:18 ` Alexey Vlasov 0 siblings, 1 reply; 7+ messages in thread From: Ingo Molnar @ 2013-09-05 11:12 UTC (permalink / raw) To: Alexey Vlasov Cc: Peter Zijlstra, Andrew Morton, Ingo Molnar, Mike Galbraith, linux-kernel * Alexey Vlasov <renton@renton.name> wrote: > On Sun, Aug 25, 2013 at 04:28:37PM +0200, Peter Zijlstra wrote: > > > and a function (or function-graph) > > trace for when this happens? > > Unfortunately I could not make trace. > First when migration threads start to eat CPUs and when I turn on trace: > # echo function > current_tracer > # echo 1 > tracing_on > then server is down at once. LA comes up to several thousands, it stops > to receive commands. Sometimes I even have to reboot it. Could you try the latest -tip tree at: git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git In particular this fix: 5a8e01f8fa51 sched/cputime: Do not scale when utime == 0 Could perhaps fix the phantom CPU overhead you are seeing? Thanks, Ingo ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Kernel migration eat CPUs 2013-09-05 11:12 ` Ingo Molnar @ 2013-09-11 15:18 ` Alexey Vlasov 2013-10-10 7:13 ` Alexey Vlasov 0 siblings, 1 reply; 7+ messages in thread From: Alexey Vlasov @ 2013-09-11 15:18 UTC (permalink / raw) To: Ingo Molnar Cc: Peter Zijlstra, Andrew Morton, Ingo Molnar, Mike Galbraith, linux-kernel On Thu, Sep 05, 2013 at 01:12:52PM +0200, Ingo Molnar wrote: > > Could you try the latest -tip tree at: > > git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git > > In particular this fix: > > 5a8e01f8fa51 sched/cputime: Do not scale when utime == 0 > > Could perhaps fix the phantom CPU overhead you are seeing? This kernel solved problems. Thank you. -- BRGDS. Alexey Vlasov. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Kernel migration eat CPUs 2013-09-11 15:18 ` Alexey Vlasov @ 2013-10-10 7:13 ` Alexey Vlasov 0 siblings, 0 replies; 7+ messages in thread From: Alexey Vlasov @ 2013-10-10 7:13 UTC (permalink / raw) To: Ingo Molnar Cc: Peter Zijlstra, Andrew Morton, Ingo Molnar, Mike Galbraith, linux-kernel [-- Attachment #1: Type: text/plain, Size: 615 bytes --] Hi Ingo, On Wed, Sep 11, 2013 at 07:18:14PM +0400, Alexey Vlasov wrote: > On Thu, Sep 05, 2013 at 01:12:52PM +0200, Ingo Molnar wrote: > > > > Could you try the latest -tip tree at: > > > > git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git > > > > In particular this fix: > > > > 5a8e01f8fa51 sched/cputime: Do not scale when utime == 0 > > > > Could perhaps fix the phantom CPU overhead you are seeing? > > This kernel solved problems. > Thank you. One problem is solved, but there appeared another one. Now this kernel began to crash. Crashdump is in attachment. -- BRGDS. Alexey Vlasov. [-- Attachment #2: bug_kernel-git-tip.txt --] [-- Type: text/plain, Size: 14160 bytes --] Oct 4 20:42:55 l25 [255033.586773] BUG: unable to handle kernel Oct 4 20:42:55 l25 paging request Oct 4 20:42:55 l25 at 0000000100000000 Oct 4 20:42:55 l25 [255033.594760] IP: Oct 4 20:42:55 l25 [<ffffffff811235b1>] kmem_cache_alloc+0x51/0x120 Oct 4 20:42:55 l25 [255033.601622] PGD 199b953067 Oct 4 20:42:55 l25 PUD 0 Oct 4 20:42:55 l25 Oct 4 20:42:55 l25 [255033.605481] Oops: 0000 [#1] Oct 4 20:42:55 l25 SMP Oct 4 20:42:55 l25 Oct 4 20:42:55 l25 [255033.609249] Modules linked in: Oct 4 20:42:55 l25 netconsole Oct 4 20:42:55 l25 flashcache(O) Oct 4 20:42:55 l25 x86_pkg_temp_thermal Oct 4 20:42:55 l25 Oct 4 20:42:55 l25 [255033.617333] CPU: 13 PID: 21874 Comm: httpd Tainted: G IO 3.11.0-1gb-cm-tr+ #1 Oct 4 20:42:55 l25 [255033.626297] Hardware name: Intel Corporation S2600WP/S2600WP, BIOS SE5C600.86B.01.06.0002.110120121539 11/01/2012 Oct 4 20:42:55 l25 [255033.637900] task: ffff880018108000 ti: ffff88003aafa000 task.ti: ffff88003aafa000 Oct 4 20:42:55 l25 [255033.646386] RIP: 0010:[<ffffffff811235b1>] Oct 4 20:42:55 l25 [<ffffffff811235b1>] kmem_cache_alloc+0x51/0x120 Oct 4 20:42:55 l25 [255033.656009] RSP: 0018:ffff88003aafbe18 EFLAGS: 00010206 Oct 4 20:42:55 l25 [255033.662077] RAX: 0000000000000000 RBX: ffff880e6fac6d80 RCX: 0000000000000000 Oct 4 20:42:55 l25 [255033.670165] RDX: 000000000a4f5d48 RSI: 00000000000080d0 RDI: 0000000000015380 Oct 4 20:42:55 l25 [255033.678256] RBP: ffff88003aafbe58 R08: ffff881c3fcb5380 R09: ffffffff81179fdf Oct 4 20:42:55 l25 [255033.686442] R10: 0000000001d26b20 R11: 0000000000000246 R12: ffff8813ef803800 Oct 4 20:42:55 l25 [255033.694705] R13: 0000000100000000 R14: ffff880018108000 R15: ffff88161f1eac00 Oct 4 20:42:55 l25 [255033.702968] FS: 00007f3d46edc740(0000) GS:ffff881c3fca0000(0000) knlGS:0000000000000000 Oct 4 20:42:55 l25 [255033.733352] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Oct 4 20:42:55 l25 [255033.739996] CR2: 0000000100000000 CR3: 0000001688961000 CR4: 00000000000407e0 Oct 4 20:42:55 l25 [255033.748263] Stack: Oct 4 20:42:55 l25 [255033.750716] 0000000000000000 Oct 4 20:42:55 l25 000080d000000000 Oct 4 20:42:55 l25 0000000000000000 Oct 4 20:42:55 l25 ffff880e6fac6d80 Oct 4 20:42:55 l25 Oct 4 20:42:55 l25 [255033.759374] ffff88003aafbec8 Oct 4 20:42:55 l25 ffff880e6fac6dc8 Oct 4 20:42:55 l25 ffff880018108000 Oct 4 20:42:55 l25 ffff88161f1eac00 Oct 4 20:42:55 l25 Oct 4 20:42:55 l25 [255033.768028] ffff88003aafbe78 Oct 4 20:42:55 l25 ffffffff81179fdf Oct 4 20:42:55 l25 0000000000000001 Oct 4 20:42:55 l25 ffff880e6fac6d80 Oct 4 20:42:55 l25 Oct 4 20:42:55 l25 [255033.776672] Call Trace: Oct 4 20:42:55 l25 [255033.779612] [<ffffffff81179fdf>] locks_alloc_lock+0x1f/0x70 Oct 4 20:42:55 l25 [255033.786139] [<ffffffff8117a284>] flock_lock_file_wait+0x254/0x380 Oct 4 20:42:55 l25 [255033.793262] [<ffffffff81123658>] ? kmem_cache_alloc+0xf8/0x120 Oct 4 20:42:55 l25 [255033.800085] [<ffffffff8117b56b>] SyS_flock+0x1bb/0x1d0 Oct 4 20:42:55 l25 [255033.806145] [<ffffffff81502482>] system_call_fastpath+0x16/0x1b Oct 4 20:42:55 l25 [255033.813062] Code: Oct 4 20:42:55 l25 03 Oct 4 20:42:55 l25 04 Oct 4 20:42:55 l25 25 Oct 4 20:42:55 l25 08 Oct 4 20:42:55 l25 cc Oct 4 20:42:55 l25 00 Oct 4 20:42:55 l25 00 Oct 4 20:42:55 l25 49 Oct 4 20:42:55 l25 8b Oct 4 20:42:55 l25 50 Oct 4 20:42:55 l25 08 Oct 4 20:42:55 l25 4d Oct 4 20:42:55 l25 8b Oct 4 20:42:55 l25 28 Oct 4 20:42:55 l25 49 Oct 4 20:42:55 l25 83 Oct 4 20:42:55 l25 78 Oct 4 20:42:55 l25 10 Oct 4 20:42:55 l25 00 Oct 4 20:42:55 l25 0f Oct 4 20:42:55 l25 84 Oct 4 20:42:55 l25 be Oct 4 20:42:55 l25 00 Oct 4 20:42:55 l25 00 Oct 4 20:42:55 l25 00 Oct 4 20:42:55 l25 4d Oct 4 20:42:55 l25 85 Oct 4 20:42:55 l25 ed Oct 4 20:42:55 l25 0f Oct 4 20:42:55 l25 84 Oct 4 20:42:55 l25 b5 Oct 4 20:42:55 l25 00 Oct 4 20:42:55 l25 00 Oct 4 20:42:55 l25 00 Oct 4 20:42:55 l25 49 Oct 4 20:42:55 l25 63 Oct 4 20:42:55 l25 44 Oct 4 20:42:55 l25 24 Oct 4 20:42:55 l25 20 Oct 4 20:42:55 l25 49 Oct 4 20:42:55 l25 8b Oct 4 20:42:55 l25 3c Oct 4 20:42:55 l25 24 Oct 4 20:42:55 l25 Oct 4 20:42:55 l25 8b Oct 4 20:42:55 l25 5c Oct 4 20:42:55 l25 05 Oct 4 20:42:55 l25 00 Oct 4 20:42:55 l25 48 Oct 4 20:42:55 l25 8d Oct 4 20:42:55 l25 4a Oct 4 20:42:55 l25 01 Oct 4 20:42:55 l25 4c Oct 4 20:42:55 l25 89 Oct 4 20:42:55 l25 e8 Oct 4 20:42:55 l25 65 Oct 4 20:42:55 l25 48 Oct 4 20:42:55 l25 0f Oct 4 20:42:55 l25 c7 Oct 4 20:42:55 l25 0f Oct 4 20:42:55 l25 0f Oct 4 20:42:55 l25 94 Oct 4 20:42:55 l25 c0 Oct 4 20:42:55 l25 84 Oct 4 20:42:55 l25 Oct 4 20:42:55 l25 [255033.835977] RIP Oct 4 20:42:55 l25 [<ffffffff811235b1>] kmem_cache_alloc+0x51/0x120 Oct 4 20:42:55 l25 [255033.843039] RSP <ffff88003aafbe18> Oct 4 20:42:55 l25 [255033.847166] CR2: 0000000100000000 Oct 4 20:42:55 l25 [255033.851756] ---[ end trace 74c79b3b05e82a8a ]--- Oct 4 20:42:55 l25 [255033.870833] BUG: unable to handle kernel paging request at 0000000100000000 Oct 4 20:42:55 l25 [255033.870842] IP: [<ffffffff8112247f>] __kmalloc+0x6f/0x170 Oct 4 20:42:55 l25 [255033.870844] PGD 199b953067 PUD 0 Oct 4 20:42:55 l25 [255033.870846] Oops: 0000 [#2] SMP Oct 4 20:42:55 l25 [255033.870850] Modules linked in: netconsole flashcache(O) x86_pkg_temp_thermal Oct 4 20:42:55 l25 [255033.870853] CPU: 13 PID: 16854 Comm: kworker/13:2 Tainted: G D IO 3.11.0-1gb-cm-tr+ #1 Oct 4 20:42:55 l25 [255033.870854] Hardware name: Intel Corporation S2600WP/S2600WP, BIOS SE5C600.86B.01.06.0002.110120121539 11/01/2012 Oct 4 20:42:55 l25 [255033.870860] Workqueue: xfs-log/dm-0 xfs_log_worker Oct 4 20:42:55 l25 [255033.870861] task: ffff881c0df188f0 ti: ffff881206806000 task.ti: ffff881206806000 Oct 4 20:42:55 l25 [255033.870864] RIP: 0010:[<ffffffff8112247f>] [<ffffffff8112247f>] __kmalloc+0x6f/0x170 Oct 4 20:42:55 l25 [255033.870865] RSP: 0018:ffff881206807c48 EFLAGS: 00010206 Oct 4 20:42:55 l25 [255033.870866] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000020 Oct 4 20:42:55 l25 [255033.870867] RDX: 000000000a4f5d48 RSI: 0000000000000000 RDI: 0000000000015380 Oct 4 20:42:55 l25 [255033.870868] RBP: ffff881206807c88 R08: ffff881c3fcb5380 R09: ffffffff811e1fbf Oct 4 20:42:55 l25 [255033.870869] R10: 0000000000000000 R11: 0000000000005572 R12: ffff8813ef803800 Oct 4 20:42:55 l25 [255033.870870] R13: 0000000100000000 R14: 0000000000000098 R15: 0000000051eb851f Oct 4 20:42:55 l25 [255033.870871] FS: 0000000000000000(0000) GS:ffff881c3fca0000(0000) knlGS:0000000000000000 Oct 4 20:42:55 l25 [255033.870873] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Oct 4 20:42:55 l25 [255033.870877] CR2: 0000000100000000 CR3: 0000001688961000 CR4: 00000000000407e0 Oct 4 20:42:55 l25 [255033.870886] Stack: Oct 4 20:42:55 l25 [255033.870895] 0000000000000250 0000025000000050 ffff881c0e415980 0000000000000000 Oct 4 20:42:55 l25 [255033.870901] 0000000000000250 0000000000000098 0000000000000000 0000000051eb851f Oct 4 20:42:55 l25 [255033.870907] ffff881206807cc8 ffffffff811e1fbf 0000000000000009 ffff8808dfb7ee10 Oct 4 20:42:55 l25 [255033.870909] Call Trace: Oct 4 20:42:55 l25 [255033.870917] [<ffffffff811e1fbf>] kmem_alloc+0x6f/0xe0 Oct 4 20:42:55 l25 [255033.870921] [<ffffffff8122ba6f>] xfs_log_commit_cil+0x10f/0x500 Oct 4 20:42:56 l25 [255033.870924] [<ffffffff81226c21>] xfs_trans_commit+0x71/0x250 Oct 4 20:42:56 l25 [255033.870929] [<ffffffff811d3051>] xfs_fs_log_dummy+0x61/0x90 Oct 4 20:42:56 l25 [255033.870931] [<ffffffff81228d7b>] ? xfs_log_need_covered+0x9b/0xd0 Oct 4 20:42:56 l25 [255033.870933] [<ffffffff8122a4a8>] xfs_log_worker+0x48/0x50 Oct 4 20:42:56 l25 [255033.870939] [<ffffffff8105e291>] process_one_work+0x171/0x420 Oct 4 20:42:56 l25 [255033.870942] [<ffffffff8105f396>] worker_thread+0x126/0x360 Oct 4 20:42:56 l25 [255033.870945] [<ffffffff8105f270>] ? manage_workers.clone.19+0x270/0x270 Oct 4 20:42:56 l25 [255033.870949] [<ffffffff810658ee>] kthread+0xce/0xe0 Oct 4 20:42:56 l25 [255033.870952] [<ffffffff81065820>] ? __kthread_parkme+0x90/0x90 Oct 4 20:42:56 l25 [255033.870957] [<ffffffff815023dc>] ret_from_fork+0x7c/0xb0 Oct 4 20:42:56 l25 [255033.870959] [<ffffffff81065820>] ? __kthread_parkme+0x90/0x90 Oct 4 20:42:56 l25 [255033.870976] Code: 03 04 25 08 cc 00 00 49 8b 50 08 4d 8b 28 49 83 78 10 00 0f 84 d4 00 00 00 4d 85 ed 0f 84 cb 00 00 00 49 63 44 24 20 49 8b 3c 24 <49> 8b 5c 05 00 48 8d 4a 01 4c 89 e8 65 48 0f c7 0f 0f 94 c0 84 Oct 4 20:42:56 l25 [255033.870978] RIP [<ffffffff8112247f>] __kmalloc+0x6f/0x170 Oct 4 20:42:56 l25 [255033.870979] RSP <ffff881206807c48> Oct 4 20:42:56 l25 [255033.870979] CR2: 0000000100000000 Oct 4 20:42:56 l25 [255033.870981] ---[ end trace 74c79b3b05e82a8b ]--- Oct 4 20:42:56 l25 [255033.879442] BUG: unable to handle kernel paging request at ffffffffffffffd8 Oct 4 20:42:56 l25 [255033.879448] IP: [<ffffffff81065c50>] kthread_data+0x10/0x20 Oct 4 20:42:56 l25 [255033.879451] PGD 1a0c067 PUD 1a0e067 PMD 0 Oct 4 20:42:56 l25 [255033.879454] Oops: 0000 [#3] SMP Oct 4 20:42:56 l25 [255033.879458] Modules linked in: netconsole flashcache(O) x86_pkg_temp_thermal Oct 4 20:42:56 l25 [255033.879461] CPU: 13 PID: 16854 Comm: kworker/13:2 Tainted: G D IO 3.11.0-1gb-cm-tr+ #1 Oct 4 20:42:56 l25 [255033.879463] Hardware name: Intel Corporation S2600WP/S2600WP, BIOS SE5C600.86B.01.06.0002.110120121539 11/01/2012 Oct 4 20:42:56 l25 [255033.879477] task: ffff881c0df188f0 ti: ffff881206806000 task.ti: ffff881206806000 Oct 4 20:42:56 l25 [255033.879482] RIP: 0010:[<ffffffff81065c50>] [<ffffffff81065c50>] kthread_data+0x10/0x20 Oct 4 20:42:56 l25 [255033.879484] RSP: 0018:ffff881206807778 EFLAGS: 00010096 Oct 4 20:42:56 l25 [255033.879485] RAX: 0000000000000000 RBX: 000000000000000d RCX: ffffffffff48e500 Oct 4 20:42:56 l25 [255033.879487] RDX: ffff881c3fcb2700 RSI: 000000000000000d RDI: ffff881c0df188f0 Oct 4 20:42:56 l25 [255033.879488] RBP: ffff881206807778 R08: 00000000000000bd R09: ffffea004caae480 Oct 4 20:42:56 l25 [255033.879489] R10: ffffffff8127804b R11: 0000000000000000 R12: 000000000000000d Oct 4 20:42:56 l25 [255033.879491] R13: ffff881c0df18dd0 R14: ffff8813d5590000 R15: 0000000000000000 Oct 4 20:42:56 l25 [255033.879493] FS: 0000000000000000(0000) GS:ffff881c3fca0000(0000) knlGS:0000000000000000 Oct 4 20:42:56 l25 [255033.879494] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Oct 4 20:42:56 l25 [255033.879496] CR2: 0000000000000028 CR3: 0000001688961000 CR4: 00000000000407e0 Oct 4 20:42:56 l25 [255033.879497] Stack: Oct 4 20:42:56 l25 [255033.879503] ffff8812068077a8 ffffffff8105f655 000000000000000d ffff881c0df18dd0 Oct 4 20:42:56 l25 [255033.879508] ffff8812068077a8 ffff881c3fcb2700 ffff8812068078e8 ffffffff814fff88 Oct 4 20:42:56 l25 [255033.879514] ffff881c0df188f0 0000000000012700 ffff881206807fd8 ffff881206806000 Oct 4 20:42:56 l25 [255033.879514] Call Trace: Oct 4 20:42:56 l25 [255033.879521] [<ffffffff8105f655>] wq_worker_sleeping+0x15/0xa0 Oct 4 20:42:56 l25 [255033.879529] [<ffffffff814fff88>] __schedule+0x4f8/0x920 Oct 4 20:42:56 l25 [255033.879535] [<ffffffff81265ce6>] ? put_io_context+0x86/0xc0 Oct 4 20:42:56 l25 [255033.879539] [<ffffffff81265de2>] ? put_io_context_active+0xc2/0xf0 Oct 4 20:42:56 l25 [255033.879543] [<ffffffff81500469>] schedule+0x29/0x70 Oct 4 20:42:56 l25 [255033.879550] [<ffffffff81047b8d>] do_exit+0x6bd/0x9c0 Oct 4 20:42:56 l25 [255033.879556] [<ffffffff81005e81>] oops_end+0x81/0xb0 Oct 4 20:42:56 l25 [255033.879574] [<ffffffff81037daa>] no_context+0x12a/0x2e0 Oct 4 20:42:56 l25 [255033.879579] [<ffffffff8103807d>] __bad_area_nosemaphore+0x11d/0x220 Oct 4 20:42:56 l25 [255033.879584] [<ffffffff81038193>] bad_area_nosemaphore+0x13/0x20 Oct 4 20:42:56 l25 [255033.879588] [<ffffffff8103874a>] __do_page_fault+0x2ea/0x490 Oct 4 20:42:56 l25 [255033.879593] [<ffffffff8103892e>] do_page_fault+0xe/0x10 Oct 4 20:42:56 l25 [255033.879598] [<ffffffff81501cb2>] page_fault+0x22/0x30 Oct 4 20:42:56 l25 [255033.879602] [<ffffffff811e1fbf>] ? kmem_alloc+0x6f/0xe0 Oct 4 20:42:56 l25 [255033.879607] [<ffffffff8112247f>] ? __kmalloc+0x6f/0x170 Oct 4 20:42:56 l25 [255033.879611] [<ffffffff81122441>] ? __kmalloc+0x31/0x170 Oct 4 20:42:56 l25 [255033.879614] [<ffffffff811e1fbf>] kmem_alloc+0x6f/0xe0 Oct 4 20:42:56 l25 [255033.879619] [<ffffffff8122ba6f>] xfs_log_commit_cil+0x10f/0x500 Oct 4 20:42:56 l25 [255033.879623] [<ffffffff81226c21>] xfs_trans_commit+0x71/0x250 Oct 4 20:42:56 l25 [255033.879627] [<ffffffff811d3051>] xfs_fs_log_dummy+0x61/0x90 Oct 4 20:42:56 l25 [255033.879631] [<ffffffff81228d7b>] ? xfs_log_need_covered+0x9b/0xd0 Oct 4 20:42:56 l25 [255033.879634] [<ffffffff8122a4a8>] xfs_log_worker+0x48/0x50 Oct 4 20:42:56 l25 [255033.879639] [<ffffffff8105e291>] process_one_work+0x171/0x420 Oct 4 20:42:56 l25 [255033.879643] [<ffffffff8105f396>] worker_thread+0x126/0x360 Oct 4 20:42:56 l25 [255033.879648] [<ffffffff8105f270>] ? manage_workers.clone.19+0x270/0x270 Oct 4 20:42:56 l25 [255033.879652] [<ffffffff810658ee>] kthread+0xce/0xe0 Oct 4 20:42:56 l25 [255033.879656] [<ffffffff81065820>] ? __kthread_parkme+0x90/0x90 Oct 4 20:42:56 l25 [255033.879659] [<ffffffff815023dc>] ret_from_fork+0x7c/0xb0 Oct 4 20:42:56 l25 [255033.879663] [<ffffffff81065820>] ? __kthread_parkme+0x90/0x90 Oct 4 20:42:56 l25 [255033.879690] Code: 88 04 00 00 48 8b 40 c8 c9 48 c1 e8 02 83 e0 01 c3 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 66 66 66 66 90 48 8b 87 88 04 00 00 <48> 8b 40 d8 c9 c3 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 48 Oct 4 20:42:56 l25 [255033.879693] RIP [<ffffffff81065c50>] kthread_data+0x10/0x20 Oct 4 20:42:56 l25 [255033.879694] RSP <ffff881206807778> Oct 4 20:42:56 l25 [255033.879695] CR2: ffffffffffffffd8 Oct 4 20:42:56 l25 [255033.879697] ---[ end trace 74c79b3b05e82a8c ]--- Oct 4 20:42:56 l25 [255033.890764] Fixing recursive fault but reboot is needed! Oct 4 20:43:15 l25 [255053.594096] Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 13 Oct 4 20:43:16 l25 [255054.636585] Shutting down cpus with NMI ^ permalink raw reply [flat|nested] 7+ messages in thread
* Kernel migration eat CPUs @ 2013-08-22 15:00 Alexey Vlasov 2013-08-29 7:31 ` Mike Galbraith 0 siblings, 1 reply; 7+ messages in thread From: Alexey Vlasov @ 2013-08-22 15:00 UTC (permalink / raw) To: linux-kernel Hi, Beginning from 3.7 including 3.10 kernel, I noticed that from time to time all CPUs turn out to be busy 100% and at the moment websites of my clients work slowly or don’t work at all. top screen: https://www.dropbox.com/s/psuk2cwn6exinyd/3.9.cpu_migration.jpg All CPUs are busy with kernel threads migration for about 3-5 minutes. Exactly how often it periodically happens I don’t know, but it happens once in 1-3 hours. Approximately 30 my servers work on kernel 2.6.35 and there’s no such problem there. We use kernels 3.7 – 3.9 on 5 servers and I see such effect everywhere. If you downgrade on 2.6.35, then migration doesn’t eat CPUs. These servers are used for usual LAMP shared hosting. Can I get migration as if on the kernel 2.6.35 or can this feature be disabled? -- BRGDS. Alexey Vlasov. jabber: renton@1gb.ru icq: 5128664 ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Kernel migration eat CPUs 2013-08-22 15:00 Alexey Vlasov @ 2013-08-29 7:31 ` Mike Galbraith 0 siblings, 0 replies; 7+ messages in thread From: Mike Galbraith @ 2013-08-29 7:31 UTC (permalink / raw) To: Alexey Vlasov; +Cc: linux-kernel, Peter Zijlstra, Ingo Molnar, Mike Galbraith (CCs) On Thu, 2013-08-22 at 19:00 +0400, Alexey Vlasov wrote: > Hi, > > Beginning from 3.7 including 3.10 kernel, I noticed that from time to > time all CPUs turn out to be busy 100% and at the moment websites of my > clients work slowly or don’t work at all. > > top screen: > https://www.dropbox.com/s/psuk2cwn6exinyd/3.9.cpu_migration.jpg > > All CPUs are busy with kernel threads migration for about 3-5 minutes. > Exactly how often it periodically happens I don’t know, but it happens > once in 1-3 hours. > > Approximately 30 my servers work on kernel 2.6.35 and there’s no such > problem there. We use kernels 3.7 – 3.9 on 5 servers and I see such > effect everywhere. If you downgrade on 2.6.35, then migration doesn’t > eat CPUs. > > These servers are used for usual LAMP shared hosting. > > Can I get migration as if on the kernel 2.6.35 or can this feature be > disabled? I've not seen anything like this since the below went into 3.6, nor heard of same. Before that, migration threads could _appear_ to eat CPU under certain circumstances. Perhaps perf top -p <pick-a-thread> while the problem is happening will show something. commit 8f6189684eb4e85e6c593cd710693f09c944450a Author: Mike Galbraith <mgalbraith@suse.de> Date: Sat Aug 4 05:44:14 2012 +0200 sched: Fix migration thread runtime bogosity Make stop scheduler class do the same accounting as other classes, Migration threads can be caught in the act while doing exec balancing, leading to the below due to use of unmaintained ->se.exec_start. The load that triggered this particular instance was an apparently out of control heavily threaded application that does system monitoring in what equated to an exec bomb, with one of the VERY frequently migrated tasks being ps. %CPU PID USER CMD 99.3 45 root [migration/10] 97.7 53 root [migration/12] 97.0 57 root [migration/13] 90.1 49 root [migration/11] 89.6 65 root [migration/15] 88.7 17 root [migration/3] 80.4 37 root [migration/8] 78.1 41 root [migration/9] 44.2 13 root [migration/2] ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2013-10-10 7:16 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <20130823151711.f4c1f596b4c7aa1eecccc9a6@linux-foundation.org>
[not found] ` <20130825142837.GD31370@twins.programming.kicks-ass.net>
2013-08-29 10:10 ` Kernel migration eat CPUs Alexey Vlasov
2013-09-04 18:53 ` Alexey Vlasov
2013-09-05 11:12 ` Ingo Molnar
2013-09-11 15:18 ` Alexey Vlasov
2013-10-10 7:13 ` Alexey Vlasov
2013-08-22 15:00 Alexey Vlasov
2013-08-29 7:31 ` Mike Galbraith
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox