* vma corruption in today's -git @ 2011-03-29 4:09 Dave Jones 2011-03-29 4:19 ` Américo Wang 2011-03-29 4:22 ` Linus Torvalds 0 siblings, 2 replies; 27+ messages in thread From: Dave Jones @ 2011-03-29 4:09 UTC (permalink / raw) To: Linus Torvalds; +Cc: Andrew Morton, Linux Kernel Running a kernel that I built from a git pull from this afternoon, I ran a barrage of stress tests, including some of the new syscall fuzzing work that I've been working on. Everything seemed just fine, except that after I had stopped running everything, loadavg was still at 6, with top showing kworker processes running at high % of cpu. I ran perf top, and got this .. BUG: unable to handle kernel paging request at fffffffffffffff3 IP: [<ffffffff8118d3e9>] vma_stop.clone.3+0x18/0x33 PGD 1a05067 PUD 1a06067 PMD 0 Oops: 0000 [#1] SMP last sysfs file: /sys/devices/LNXSYSTM:00/device:00/PNP0C0A:00/power_supply/BAT0/energy_full CPU 0 Modules linked in: ipx p8022 psnap llc p8023 sctp libcrc32c tun vfat fat usb_storage uas nfs fscache fuse nfsd lockd nfs_acl auth_rpcgss sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables ipv6 uinput arc4 iwlagn btusb bluetooth snd_hda_codec_hdmi snd_hda_codec_idt mac80211 snd_hda_intel snd_usb_audio snd_hda_codec dell_wmi snd_seq sparse_keymap uvcvideo snd_pcm dell_laptop snd_hwdep dcdbas snd_usbmidi_lib microcode videodev zaurus cfg80211 snd_rawmidi snd_seq_device cdc_ether v4l2_compat_ioctl32 usbnet snd_timer iTCO_wdt pcspkr joydev cdc_acm i2c_i801 cdc_wdm tg3 snd mii iTCO_vendor_support rfkill soundcore snd_page_alloc wmi i915 drm_kms_helper drm i2c_algo_bit i2c_core video [last unloaded: scsi_wait_scan] Pid: 324, comm: perf Tainted: G W 2.6.38+ #5 Dell Inc. Adamo 13 /0N70T0 RIP: 0010:[<ffffffff8118d3e9>] [<ffffffff8118d3e9>] vma_stop.clone.3+0x18/0x33 RSP: 0018:ffff880114de1e48 EFLAGS: 00010286 RAX: 00000000fffffff3 RBX: ffff880007f44bc8 RCX: ffff8800b7216806 RDX: ffffffff81618f50 RSI: fffffffffffffff3 RDI: ffff880007f44bd8 RBP: ffff880114de1e58 R08: 0000000000000001 R09: 0000000000000000 R10: ffffffff81a26870 R11: 0000000000000246 R12: ffff88000fe3d880 R13: ffff8800b72168f8 R14: fffffffffffffff3 R15: 0000000000000000 FS: 00007f59e738c720(0000) GS:ffff88013fa00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: fffffffffffffff3 CR3: 0000000136566000 CR4: 00000000000406f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process perf (pid: 324, threadinfo ffff880114de0000, task ffff8800b43c23b0) Stack: 0000000000000000 ffff880007f44bc8 ffff880114de1e78 ffffffff8118d46a 0000000000000000 0000000000000400 ffff880114de1ef8 ffffffff81159a99 0000000000000004 ffff8801fffffff3 0000000000000004 ffff8800b7216930 Call Trace: [<ffffffff8118d46a>] m_stop+0x1e/0x30 [<ffffffff81159a99>] seq_read+0x240/0x36e [<ffffffff8113dc28>] vfs_read+0xac/0xff [<ffffffff8113f0f8>] ? fget_light+0x3a/0xa1 [<ffffffff8113dcc8>] sys_read+0x4d/0x74 [<ffffffff814d6b42>] system_call_fastpath+0x16/0x1b Code: 8b 54 24 08 48 89 53 30 41 5d 5b 41 5c 31 c0 41 5d c9 c3 55 48 89 e5 53 48 83 ec 08 0f 1f 44 00 00 48 85 f6 74 1c 48 3b 37 74 17 8b 1e 48 8d bb a8 00 00 00 e8 d9 e6 ee ff 48 89 df e8 9a ab RIP [<ffffffff8118d3e9>] vma_stop.clone.3+0x18/0x33 RSP <ffff880114de1e48> CR2: fffffffffffffff3 ---[ end trace ea1a722d5e92836a ]--- it smells like something got corrupted, but I'm not sure where to start. I've got a bunch of debug options turned on, but nothing else triggered, so I've got nothing other than this oops to go on. Any thoughts ? Dave ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: vma corruption in today's -git 2011-03-29 4:09 vma corruption in today's -git Dave Jones @ 2011-03-29 4:19 ` Américo Wang 2011-03-29 4:26 ` Dave Jones 2011-03-29 4:22 ` Linus Torvalds 1 sibling, 1 reply; 27+ messages in thread From: Américo Wang @ 2011-03-29 4:19 UTC (permalink / raw) To: Dave Jones, Linus Torvalds, Andrew Morton, Linux Kernel On Tue, Mar 29, 2011 at 12:09 PM, Dave Jones <davej@redhat.com> wrote: > Running a kernel that I built from a git pull from this afternoon, > I ran a barrage of stress tests, including some of the new syscall fuzzing > work that I've been working on. Everything seemed just fine, except that > after I had stopped running everything, loadavg was still at 6, with > top showing kworker processes running at high % of cpu. > > I ran perf top, and got this .. > > BUG: unable to handle kernel paging request at fffffffffffffff3 > IP: [<ffffffff8118d3e9>] vma_stop.clone.3+0x18/0x33 Should be fixed by: commit 76597cd31470fa130784c78fadb4dab2e624a723 Author: Linus Torvalds <torvalds@linux-foundation.org> Date: Sun Mar 27 19:09:29 2011 -0700 proc: fix oops on invalid /proc/<pid>/maps access When m_start returns an error, the seq_file logic will still call m_stop with that error entry, so we'd better make sure that we check it before using it as a vma. Introduced by commit ec6fd8a4355c ("report errors in /proc/*/*map* sanely"), which replaced NULL with various ERR_PTR() cases. (On ia64, you happen to get a unaligned fault instead of a page fault, since the address used is generally some random error code like -EPERM) Reported-by: Anca Emanuel <anca.emanuel@gmail.com> Reported-by: Tony Luck <tony.luck@intel.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Américo Wang <xiyou.wangcong@gmail.com> Cc: Stephen Wilson <wilsons@start.ca> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: vma corruption in today's -git 2011-03-29 4:19 ` Américo Wang @ 2011-03-29 4:26 ` Dave Jones 0 siblings, 0 replies; 27+ messages in thread From: Dave Jones @ 2011-03-29 4:26 UTC (permalink / raw) To: Américo Wang; +Cc: Linus Torvalds, Andrew Morton, Linux Kernel On Tue, Mar 29, 2011 at 12:19:01PM +0800, Américo Wang wrote: > On Tue, Mar 29, 2011 at 12:09 PM, Dave Jones <davej@redhat.com> wrote: > > Running a kernel that I built from a git pull from this afternoon, > > I ran a barrage of stress tests, including some of the new syscall fuzzing > > work that I've been working on. Everything seemed just fine, except that > > after I had stopped running everything, loadavg was still at 6, with > > top showing kworker processes running at high % of cpu. > > > > I ran perf top, and got this .. > > > > BUG: unable to handle kernel paging request at fffffffffffffff3 > > IP: [<ffffffff8118d3e9>] vma_stop.clone.3+0x18/0x33 > > Should be fixed by: > > commit 76597cd31470fa130784c78fadb4dab2e624a723 > Author: Linus Torvalds <torvalds@linux-foundation.org> > Date: Sun Mar 27 19:09:29 2011 -0700 > > proc: fix oops on invalid /proc/<pid>/maps access excellent, thanks. I'll repull and try again. Dave ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: vma corruption in today's -git 2011-03-29 4:09 vma corruption in today's -git Dave Jones 2011-03-29 4:19 ` Américo Wang @ 2011-03-29 4:22 ` Linus Torvalds 2011-03-31 3:09 ` excessive kworker activity when idle. (was Re: vma corruption in today's -git) Dave Jones 1 sibling, 1 reply; 27+ messages in thread From: Linus Torvalds @ 2011-03-29 4:22 UTC (permalink / raw) To: Dave Jones, Linus Torvalds, Andrew Morton, Linux Kernel On Mon, Mar 28, 2011 at 9:09 PM, Dave Jones <davej@redhat.com> wrote: > > I ran perf top, and got this .. > > BUG: unable to handle kernel paging request at fffffffffffffff3 Should be fixed by commit 76597cd31470. Except that was already committed yesterday. When did you pull? Linus ^ permalink raw reply [flat|nested] 27+ messages in thread
* excessive kworker activity when idle. (was Re: vma corruption in today's -git) 2011-03-29 4:22 ` Linus Torvalds @ 2011-03-31 3:09 ` Dave Jones 2011-03-31 3:34 ` Dave Jones 2011-03-31 3:37 ` Linus Torvalds 0 siblings, 2 replies; 27+ messages in thread From: Dave Jones @ 2011-03-31 3:09 UTC (permalink / raw) To: Linus Torvalds; +Cc: Andrew Morton, Linux Kernel On Mon, Mar 28, 2011 at 09:22:25PM -0700, Linus Torvalds wrote: > On Mon, Mar 28, 2011 at 9:09 PM, Dave Jones <davej@redhat.com> wrote: > > > > I ran perf top, and got this .. > > > > BUG: unable to handle kernel paging request at fffffffffffffff3 > > Should be fixed by commit 76597cd31470. > > Except that was already committed yesterday. When did you pull? It did fix it, I must have missed that commit. But rerunning the same tests on current head (6aba74f2791287ec407e0f92487a725a25908067) I can still reproduce the problem where kworker threads go nutso when the machine should be completely idle. top shows two kworker threads constantly at >80% cpu. I can now profile it with perf though, so I'll dig deeper. I need to try a build with slab debug etc turned off, just to be sure that isn't throwing things into a spin. Dave ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: excessive kworker activity when idle. (was Re: vma corruption in today's -git) 2011-03-31 3:09 ` excessive kworker activity when idle. (was Re: vma corruption in today's -git) Dave Jones @ 2011-03-31 3:34 ` Dave Jones 2011-03-31 3:44 ` Linus Torvalds 2011-03-31 3:37 ` Linus Torvalds 1 sibling, 1 reply; 27+ messages in thread From: Dave Jones @ 2011-03-31 3:34 UTC (permalink / raw) To: Linus Torvalds, Andrew Morton, Linux Kernel On Wed, Mar 30, 2011 at 11:09:18PM -0400, Dave Jones wrote: > I can still reproduce the problem where kworker threads go nutso > when the machine should be completely idle. > > top shows two kworker threads constantly at >80% cpu. > > I can now profile it with perf though, so I'll dig deeper. > I need to try a build with slab debug etc turned off, just to be sure that > isn't throwing things into a spin. so 'perf kmem record sleep 5' shows hundreds of calls kmem_cache_free from the kworker processes. Called from shmem_i_callback, __d_free and file_free_rcu. My guess is that my fuzzing caused so many allocations that the rcu freeing is still ongoing an hour or so after the process has quit. does that make any sense? Although if that hypothesis is true, shouldn't I be able to observe this by watching slabtop ? because I don't see anything decreasing... Dave ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: excessive kworker activity when idle. (was Re: vma corruption in today's -git) 2011-03-31 3:34 ` Dave Jones @ 2011-03-31 3:44 ` Linus Torvalds 2011-03-31 4:08 ` Dave Jones 2011-03-31 14:59 ` Paul E. McKenney 0 siblings, 2 replies; 27+ messages in thread From: Linus Torvalds @ 2011-03-31 3:44 UTC (permalink / raw) To: Dave Jones, Andrew Morton, Linux Kernel, Paul E. McKenney On Wed, Mar 30, 2011 at 8:34 PM, Dave Jones <davej@redhat.com> wrote: > > so 'perf kmem record sleep 5' shows hundreds of calls kmem_cache_free from > the kworker processes. Called from shmem_i_callback, __d_free and file_free_rcu. > My guess is that my fuzzing caused so many allocations that the rcu freeing is > still ongoing an hour or so after the process has quit. does that make any sense? No, that shouldn't be the case. RCU freeing should go on for just a few RCU periods, and be done. I think there is some "limit the work we do for rcu each time" in order to not have bad latencies, but even so that shouldn't take _that_ long. And as you say, you should see the freeing in the slab stats. So clearly there are shmem inodes being destroyed, but it shouldn't be from an hour ago. I wonder if your system isn't as idle as you think it is. But I'm cc'ing Paul, maybe he'll disagree and say it's expected and that the RCU batch size is really small. Or at least give some hint about how to check the pending rcu state. Linus ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: excessive kworker activity when idle. (was Re: vma corruption in today's -git) 2011-03-31 3:44 ` Linus Torvalds @ 2011-03-31 4:08 ` Dave Jones 2011-03-31 15:53 ` Linus Torvalds 2011-03-31 14:59 ` Paul E. McKenney 1 sibling, 1 reply; 27+ messages in thread From: Dave Jones @ 2011-03-31 4:08 UTC (permalink / raw) To: Linus Torvalds; +Cc: Andrew Morton, Linux Kernel, Paul E. McKenney On Wed, Mar 30, 2011 at 08:44:06PM -0700, Linus Torvalds wrote: > So clearly there are shmem inodes being destroyed, but it shouldn't be > from an hour ago. I wonder if your system isn't as idle as you think > it is. hmm. when I say idle, I mean I'm staring at a gnome-terminal with and ssh client, and nothing else. Certainly nothing that should be causing the cpu meter in the taskbar to almost max out, and definitely nothing that should cause the fans in the laptop to be roaring. If you want to try and repeat this experiment, this should do it.. git clone git://git.codemonkey.org.uk/trinity cd trinity ; make ./trinity --mode=random -i let it run for a while, then kill it. kworker should then go nuts. Dave ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: excessive kworker activity when idle. (was Re: vma corruption in today's -git) 2011-03-31 4:08 ` Dave Jones @ 2011-03-31 15:53 ` Linus Torvalds 2011-03-31 16:21 ` Linus Torvalds 0 siblings, 1 reply; 27+ messages in thread From: Linus Torvalds @ 2011-03-31 15:53 UTC (permalink / raw) To: Dave Jones, Andrew Morton, Linux Kernel, Paul E. McKenney And with your fixed trinity thing, I can now do this: On Wed, Mar 30, 2011 at 9:08 PM, Dave Jones <davej@redhat.com> wrote: > > git clone git://git.codemonkey.org.uk/trinity > cd trinity ; make > ./trinity --mode=random -i > > let it run for a while, then kill it. kworker should then go nuts. I really thought "hey, that reproduced it", because "top" did indeed show kworker going nuts after a while. And perf did show it doing all that | queue_work | schedule_work | flush_to_ldisc that looks like the re-scheduling. But when I killed trinity, kworker immediately calmed down too. Which I'd actually expect, since any tty it is using (and making cause problem) should calm down when it's closed - even if it was stuck before. But maybe I haven't triggered just the right condition to actually get it into some really infinite case. Regardless, I'll put my money where my mouth is, and try to remove the crazy re-flush thing. Linus ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: excessive kworker activity when idle. (was Re: vma corruption in today's -git) 2011-03-31 15:53 ` Linus Torvalds @ 2011-03-31 16:21 ` Linus Torvalds 2011-03-31 21:38 ` Linus Torvalds 0 siblings, 1 reply; 27+ messages in thread From: Linus Torvalds @ 2011-03-31 16:21 UTC (permalink / raw) To: Dave Jones, Andrew Morton, Linux Kernel, Paul E. McKenney On Thu, Mar 31, 2011 at 8:53 AM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > Regardless, I'll put my money where my mouth is, and try to remove the > crazy re-flush thing. Yeah, that doesn't work. The tty does actually lock up when it fills up. So clearly we actually depended on that reflushing happening. That said, I do still think it's the right thing to do to remove that line, it just means that I need to figure out where the missing flush is. Linus ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: excessive kworker activity when idle. (was Re: vma corruption in today's -git) 2011-03-31 16:21 ` Linus Torvalds @ 2011-03-31 21:38 ` Linus Torvalds 0 siblings, 0 replies; 27+ messages in thread From: Linus Torvalds @ 2011-03-31 21:38 UTC (permalink / raw) To: Dave Jones, Andrew Morton, Linux Kernel, Paul E. McKenney, Alan Cox, Greg KH [-- Attachment #1: Type: text/plain, Size: 1704 bytes --] On Thu, Mar 31, 2011 at 9:21 AM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Thu, Mar 31, 2011 at 8:53 AM, Linus Torvalds > <torvalds@linux-foundation.org> wrote: >> >> Regardless, I'll put my money where my mouth is, and try to remove the >> crazy re-flush thing. > > Yeah, that doesn't work. The tty does actually lock up when it fills > up. So clearly we actually depended on that reflushing happening. > > That said, I do still think it's the right thing to do to remove that > line, it just means that I need to figure out where the missing flush > is. Ok, that was unexpected. So the reason for the need to do that crazy "try to flush from the flush routine" is that in the case of "receive_room" going down to zero (which only happens for n_tty and for the case of switching ldisc's around), if we hit that during flushing, nothing will apparently ever re-start the flushing when receive_room then opens up again. So instead of having that case re-start the flush, we end up saying "ok, we'll just retry the flush over and over again", and essentially poll for receive_room opening up. No wonder you've seen high CPU use and thousands of calls a second. The "seen_tail" case doesn't have that issue, because anything that adds a new buffer to the tty list should always be flipping anyway. So this attached patch would seem to work. Not heavily tested, but the case that I could trivially trigger before doesn't trigger for me any more. And I can't seem to get kworker to waste lots of CPU time any more, but it was kind of hit-and-miss before too, so I don't know how much that's worth.. The locking here is kind of iffy, but otherwise? Comments? Linus [-- Attachment #2: patch.diff --] [-- Type: text/x-patch, Size: 1520 bytes --] drivers/tty/n_tty.c | 6 ++++++ drivers/tty/tty_buffer.c | 4 +--- 2 files changed, 7 insertions(+), 3 deletions(-) diff --git a/drivers/tty/n_tty.c b/drivers/tty/n_tty.c index 428f4fe..0ad3288 100644 --- a/drivers/tty/n_tty.c +++ b/drivers/tty/n_tty.c @@ -95,6 +95,7 @@ static void n_tty_set_room(struct tty_struct *tty) { /* tty->read_cnt is not read locked ? */ int left = N_TTY_BUF_SIZE - tty->read_cnt - 1; + int old_left; /* * If we are doing input canonicalization, and there are no @@ -104,7 +105,12 @@ static void n_tty_set_room(struct tty_struct *tty) */ if (left <= 0) left = tty->icanon && !tty->canon_data; + old_left = tty->receive_room; tty->receive_room = left; + + /* Did this open up the receive buffer? We may need to flip */ + if (left && !old_left) + schedule_work(&tty->buf.work); } static void put_tty_queue_nolock(unsigned char c, struct tty_struct *tty) diff --git a/drivers/tty/tty_buffer.c b/drivers/tty/tty_buffer.c index b945121..f1a7918 100644 --- a/drivers/tty/tty_buffer.c +++ b/drivers/tty/tty_buffer.c @@ -442,10 +442,8 @@ static void flush_to_ldisc(struct work_struct *work) line discipline as we want to empty the queue */ if (test_bit(TTY_FLUSHPENDING, &tty->flags)) break; - if (!tty->receive_room || seen_tail) { - schedule_work(&tty->buf.work); + if (!tty->receive_room || seen_tail) break; - } if (count > tty->receive_room) count = tty->receive_room; char_buf = head->char_buf_ptr + head->read; ^ permalink raw reply related [flat|nested] 27+ messages in thread
* Re: excessive kworker activity when idle. (was Re: vma corruption in today's -git) 2011-03-31 3:44 ` Linus Torvalds 2011-03-31 4:08 ` Dave Jones @ 2011-03-31 14:59 ` Paul E. McKenney 1 sibling, 0 replies; 27+ messages in thread From: Paul E. McKenney @ 2011-03-31 14:59 UTC (permalink / raw) To: Linus Torvalds; +Cc: Dave Jones, Andrew Morton, Linux Kernel On Wed, Mar 30, 2011 at 08:44:06PM -0700, Linus Torvalds wrote: > On Wed, Mar 30, 2011 at 8:34 PM, Dave Jones <davej@redhat.com> wrote: > > > > so 'perf kmem record sleep 5' shows hundreds of calls kmem_cache_free from > > the kworker processes. Called from shmem_i_callback, __d_free and file_free_rcu. > > My guess is that my fuzzing caused so many allocations that the rcu freeing is > > still ongoing an hour or so after the process has quit. does that make any sense? > > No, that shouldn't be the case. RCU freeing should go on for just a > few RCU periods, and be done. I think there is some "limit the work we > do for rcu each time" in order to not have bad latencies, but even so > that shouldn't take _that_ long. And as you say, you should see the > freeing in the slab stats. > > So clearly there are shmem inodes being destroyed, but it shouldn't be > from an hour ago. I wonder if your system isn't as idle as you think > it is. > > But I'm cc'ing Paul, maybe he'll disagree and say it's expected and > that the RCU batch size is really small. Or at least give some hint > about how to check the pending rcu state. There is by default a ten-callback-per-softirq per-CPU limit (set by module parameter "blimit"), but if there are more than 10,000 callbacks queued (set by module parameter "qhimark"), RCU stops being Mr. Nice Guy and just plows through the callbacks. I am having a really hard time believing that you could queue a batch of callbacks and still have RCU processing them an hour later. But one way this can happen is if the callbacks requeue themselves, for example, if the RCU callback function contains a call to call_rcu(). I am not seeing that in the __d_free and file_free_rcu callbacks that Dave mentioned, but this does happen in some places in the Linux kernel. And maybe such a place is being invoked in a non-obvious way. There is a debugfs file "rcu/rcudata" that reports the current RCU queue length that is enabled with CONFIG_RCU_TRACE. Here is sample output: rcu_sched: 0 c=17829 g=17829 pq=1 pqc=17829 qp=0 dt=10951/1 dn=0 df=1101 of=0 ri=36 ql=0 b=10 1 c=17829 g=17829 pq=1 pqc=17829 qp=0 dt=16117/1 dn=0 df=1015 of=0 ri=0 ql=0 b=10 2 c=17829 g=17829 pq=1 pqc=17829 qp=0 dt=1445/1 dn=0 df=1839 of=0 ri=0 ql=0 b=10 3 c=17829 g=17829 pq=1 pqc=17829 qp=0 dt=6681/1 dn=0 df=1545 of=0 ri=0 ql=0 b=10 4 c=17829 g=17829 pq=1 pqc=17829 qp=0 dt=1003/1 dn=0 df=1992 of=0 ri=0 ql=0 b=10 5 c=17829 g=17830 pq=1 pqc=17829 qp=1 dt=3887/1 dn=0 df=3331 of=0 ri=4 ql=2 b=10 6 c=17829 g=17829 pq=1 pqc=17829 qp=0 dt=859/1 dn=0 df=3224 of=0 ri=0 ql=0 b=10 7 c=17829 g=17830 pq=0 pqc=17829 qp=1 dt=3761/1 dn=0 df=1818 of=0 ri=0 ql=2 b=10 rcu_bh: 0 c=-275 g=-275 pq=1 pqc=-275 qp=0 dt=10951/1 dn=0 df=0 of=0 ri=0 ql=0 b=10 1 c=-275 g=-275 pq=1 pqc=-275 qp=0 dt=16117/1 dn=0 df=13 of=0 ri=0 ql=0 b=10 2 c=-275 g=-275 pq=1 pqc=-275 qp=0 dt=1445/1 dn=0 df=15 of=0 ri=0 ql=0 b=10 3 c=-275 g=-275 pq=1 pqc=-275 qp=0 dt=6681/1 dn=0 df=9 of=0 ri=0 ql=0 b=10 4 c=-275 g=-275 pq=1 pqc=-275 qp=0 dt=1003/1 dn=0 df=15 of=0 ri=0 ql=0 b=10 5 c=-275 g=-275 pq=1 pqc=-275 qp=0 dt=3887/1 dn=0 df=15 of=0 ri=0 ql=0 b=10 6 c=-275 g=-275 pq=1 pqc=-275 qp=0 dt=859/1 dn=0 df=15 of=0 ri=0 ql=0 b=10 7 c=-275 g=-275 pq=1 pqc=-275 qp=0 dt=3761/1 dn=0 df=15 of=0 ri=0 ql=0 b=10 The "ql=" gives the per-CPU callback queue size. For example, CPU 7 has two rcu_sched callbacks queued. In CONFIG_TREE_PREEMPT_RCU builds, there will be an additional "rcu_preempt" section. If you see some huge number after a "ql=", that would indicate a problem. ;-) Does this help? Thanx, Paul ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: excessive kworker activity when idle. (was Re: vma corruption in today's -git) 2011-03-31 3:09 ` excessive kworker activity when idle. (was Re: vma corruption in today's -git) Dave Jones 2011-03-31 3:34 ` Dave Jones @ 2011-03-31 3:37 ` Linus Torvalds 2011-03-31 3:55 ` Dave Jones 2011-03-31 6:56 ` Tejun Heo 1 sibling, 2 replies; 27+ messages in thread From: Linus Torvalds @ 2011-03-31 3:37 UTC (permalink / raw) To: Dave Jones, Andrew Morton, Linux Kernel, Tejun Heo On Wed, Mar 30, 2011 at 8:09 PM, Dave Jones <davej@redhat.com> wrote: > > But rerunning the same tests on current head (6aba74f2791287ec407e0f92487a725a25908067) > I can still reproduce the problem where kworker threads go nutso > when the machine should be completely idle. > > top shows two kworker threads constantly at >80% cpu. Ok, I've seen that "tons of cpu by kworker" triggered by a few different issues. One was the intel graphics driver doing monitor detection constantly, and spending all its time in one of the worker threads doing __udelay() for the stupid i2c driver. The other case I've seen is a wireless thing that falls back to GPIO, and spends a lot of CPU time on that. I'm not saying yours is either of those cases, but one of the problems with that behavior is that it's actually fairly hard to figure out what the hell is happening. You don't see some nice thread description in 'top' any more (like you used to when everybody created their own threads and didn't do the common worker thread thing), and the best approach literally seems to be something like perf record -ag sleep 10 perf report which does tend to show what's going on, but it's still a ridiculous way to this. (Powertop can also do it, and is probably a better thing to use, I'm just used to "perf record" for other reasons, so..) Tejun, would there be some saner way to get the information about _where_ all the kworker time is going? Linus ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: excessive kworker activity when idle. (was Re: vma corruption in today's -git) 2011-03-31 3:37 ` Linus Torvalds @ 2011-03-31 3:55 ` Dave Jones 2011-03-31 5:32 ` Linus Torvalds 2011-03-31 6:56 ` Tejun Heo 1 sibling, 1 reply; 27+ messages in thread From: Dave Jones @ 2011-03-31 3:55 UTC (permalink / raw) To: Linus Torvalds; +Cc: Andrew Morton, Linux Kernel, Tejun Heo On Wed, Mar 30, 2011 at 08:37:21PM -0700, Linus Torvalds wrote: > You don't see some nice thread description > in 'top' any more (like you used to when everybody created their own > threads and didn't do the common worker thread thing), and the best > approach literally seems to be something like > > perf record -ag sleep 10 > perf report > > which does tend to show what's going on, but it's still a ridiculous > way to this. I tried that, and wasn't particularly enlightened. + 6.53% kworker/1:2 [kernel.kallsyms] [k] read_hpet + 4.83% kworker/0:0 [kernel.kallsyms] [k] read_hpet + 4.28% kworker/0:0 [kernel.kallsyms] [k] arch_local_irq_restore + 4.03% kworker/1:2 [kernel.kallsyms] [k] arch_local_irq_restore + 3.10% kworker/0:0 [kernel.kallsyms] [k] do_raw_spin_trylock + 2.88% kworker/1:2 [kernel.kallsyms] [k] do_raw_spin_trylock + 2.85% kworker/1:2 [kernel.kallsyms] [k] debug_locks_off + 2.69% kworker/0:0 [kernel.kallsyms] [k] debug_locks_off + 2.48% kworker/0:0 [kernel.kallsyms] [k] lock_release + 2.26% kworker/1:2 [kernel.kallsyms] [k] lock_release + 2.03% kworker/0:0 [kernel.kallsyms] [k] lock_acquire + 1.88% kworker/0:0 [kernel.kallsyms] [k] arch_local_save_flags + 1.87% kworker/1:2 [kernel.kallsyms] [k] lock_acquire + 1.82% kworker/1:2 [kernel.kallsyms] [k] arch_local_save_flags + 1.81% kworker/1:2 [kernel.kallsyms] [k] arch_local_irq_save + 1.78% kworker/0:0 [kernel.kallsyms] [k] arch_local_irq_save + 1.56% kworker/0:0 [kernel.kallsyms] [k] lock_acquired + 1.53% kworker/1:2 [kernel.kallsyms] [k] __lock_acquire + 1.51% kworker/0:0 [kernel.kallsyms] [k] __lock_acquire + 1.29% kworker/0:0 [kernel.kallsyms] [k] native_write_msr_safe + 1.23% kworker/1:2 [kernel.kallsyms] [k] cpu_relax + 1.17% kworker/1:2 [kernel.kallsyms] [k] lock_acquired + 1.17% kworker/0:0 [kernel.kallsyms] [k] trace_hardirqs_off_caller + 1.11% kworker/1:2 [kernel.kallsyms] [k] trace_hardirqs_off_caller + 1.08% kworker/1:2 [kernel.kallsyms] [k] native_write_msr_safe + 1.02% kworker/0:0 [kernel.kallsyms] [k] _raw_spin_lock_irqsave + 0.92% kworker/0:0 [kernel.kallsyms] [k] process_one_work + 0.87% kworker/1:2 [kernel.kallsyms] [k] _raw_spin_lock_irqsave + 0.80% kworker/0:0 [kernel.kallsyms] [k] flush_to_ldisc + 0.76% kworker/1:2 [kernel.kallsyms] [k] process_one_work + 0.76% kworker/1:2 [kernel.kallsyms] [k] flush_to_ldisc + 0.72% kworker/0:0 [kernel.kallsyms] [k] arch_local_irq_restore + 0.71% kworker/1:2 [kernel.kallsyms] [k] arch_local_irq_restore + 0.64% kworker/1:2 [kernel.kallsyms] [k] do_raw_spin_unlock + 0.63% kworker/0:0 [kernel.kallsyms] [k] perf_event_task_tick + 0.61% kworker/1:2 [kernel.kallsyms] [k] ktime_get + 0.59% kworker/0:0 [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore this is what led me to try the other perf methods. the kmem traces were the only things that really jumped out. > (Powertop can also do it, and is probably a better thing to use, I'm > just used to "perf record" for other reasons, so..) Tried that too, here's what it said.. Summary: 0.0 wakeups/second, 0.0 GPU ops/second and 0.0 VFS ops/sec Usage Events/s Category Description -2147483648 ms/s 0.0 Timer -2147483648 ms/s 0.0 kWork 35151589 ms/s 0.0 Timer 35151588 ms/s 0.0 Timer 35151587 ms/s 0.0 Timer 35151586 ms/s 0.0 Timer 35151585 ms/s 0.0 Timer 35151585 ms/s 0.0 Timer 35151584 ms/s 0.0 Timer 35151583 ms/s 0.0 Timer 35151582 ms/s 0.0 Timer 35151581 ms/s 0.0 Timer 35151581 ms/s 0.0 Timer 35151580 ms/s 0.0 Timer Not exactly helpful. Dave ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: excessive kworker activity when idle. (was Re: vma corruption in today's -git) 2011-03-31 3:55 ` Dave Jones @ 2011-03-31 5:32 ` Linus Torvalds 2011-03-31 14:21 ` Arnaldo Carvalho de Melo 2011-03-31 14:58 ` Dave Jones 0 siblings, 2 replies; 27+ messages in thread From: Linus Torvalds @ 2011-03-31 5:32 UTC (permalink / raw) To: Dave Jones, Linus Torvalds, Andrew Morton, Linux Kernel, Tejun Heo On Wed, Mar 30, 2011 at 8:55 PM, Dave Jones <davej@redhat.com> wrote: > > I tried that, and wasn't particularly enlightened. > > + 6.53% kworker/1:2 [kernel.kallsyms] [k] read_hpet > + 4.83% kworker/0:0 [kernel.kallsyms] [k] read_hpet Go into those, deeper into the callchain. That's how to see what the upper-level problem is - by seeing who actually calls it. The leafs themselves (where the time is obviously spent) don't tend to be very informative, no. > > (Powertop can also do it, and is probably a better thing to use, I'm > > just used to "perf record" for other reasons, so..) > > Tried that too, here's what it said.. Hmm. Powertop is _supposed_ to be able to look into the kwork things (it has special trace points for it), but as mentioned, the times I've seen it I've used 'perf', so I haven't actually tried it myself. Linus ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: excessive kworker activity when idle. (was Re: vma corruption in today's -git) 2011-03-31 5:32 ` Linus Torvalds @ 2011-03-31 14:21 ` Arnaldo Carvalho de Melo 2011-03-31 14:58 ` Dave Jones 1 sibling, 0 replies; 27+ messages in thread From: Arnaldo Carvalho de Melo @ 2011-03-31 14:21 UTC (permalink / raw) To: Dave Jones; +Cc: Linus Torvalds, Andrew Morton, Linux Kernel, Tejun Heo Em Wed, Mar 30, 2011 at 10:32:11PM -0700, Linus Torvalds escreveu: > On Wed, Mar 30, 2011 at 8:55 PM, Dave Jones <davej@redhat.com> wrote: > > I tried that, and wasn't particularly enlightened. > > > > + 6.53% kworker/1:2 [kernel.kallsyms] [k] read_hpet > > + 4.83% kworker/0:0 [kernel.kallsyms] [k] read_hpet > > Go into those, deeper into the callchain. That's how to see what the > upper-level problem is - by seeing who actually calls it. The leafs > themselves (where the time is obviously spent) don't tend to be very > informative, no. Yeah, please press 'E' since you're using the TUI, where you can expand collapse callchains in all its levels. Pressing 'E' will expand all of them, 'C' will collapse it all. Enter on lines with a '+' will expand just that one, etc. Press 'h' to get more help. Or use 'perf report --stdio' and you'll get it all expanded and in plain text. - Arnaldo ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: excessive kworker activity when idle. (was Re: vma corruption in today's -git) 2011-03-31 5:32 ` Linus Torvalds 2011-03-31 14:21 ` Arnaldo Carvalho de Melo @ 2011-03-31 14:58 ` Dave Jones 2011-03-31 15:03 ` Dave Jones 1 sibling, 1 reply; 27+ messages in thread From: Dave Jones @ 2011-03-31 14:58 UTC (permalink / raw) To: Linus Torvalds; +Cc: Andrew Morton, Linux Kernel, Tejun Heo On Wed, Mar 30, 2011 at 10:32:11PM -0700, Linus Torvalds wrote: > On Wed, Mar 30, 2011 at 8:55 PM, Dave Jones <davej@redhat.com> wrote: > > > > I tried that, and wasn't particularly enlightened. > > > > + 6.53% kworker/1:2 [kernel.kallsyms] [k] read_hpet > > + 4.83% kworker/0:0 [kernel.kallsyms] [k] read_hpet > > Go into those, deeper into the callchain. That's how to see what the > upper-level problem is - by seeing who actually calls it. The leafs > themselves (where the time is obviously spent) don't tend to be very > informative, no. ok, a bit more interesting ? - 17.67% kworker/0:1 [kernel.kallsyms] [k] read_hpet - read_hpet - 76.07% timekeeping_get_ns - ktime_get - 64.02% tick_dev_program_event tick_program_event hrtimer_interrupt smp_apic_timer_interrupt + apic_timer_interrupt - 35.30% sched_clock_tick scheduler_tick update_process_times tick_sched_timer __run_hrtimer hrtimer_interrupt smp_apic_timer_interrupt - apic_timer_interrupt + 65.70% _raw_spin_unlock_irqrestore + 27.33% _raw_spin_unlock_irq + 1.16% lock_acquire + 1.16% put_ldisc + 1.16% process_one_work + 1.16% lock_release + 0.78% worker_thread + 0.58% _raw_spin_lock_irqsave + 0.55% tick_sched_timer - 23.82% do_timer tick_do_update_jiffies64 tick_sched_timer __run_hrtimer hrtimer_interrupt smp_apic_timer_interrupt - apic_timer_interrupt + 63.46% _raw_spin_unlock_irqrestore + 22.98% _raw_spin_unlock_irq + 3.50% put_ldisc + 3.28% lock_acquire btw, I left this on overnight capturing the output of /proc/loadavg it didn't drop off at all, and this morning the laptop is almost too hot to hold. Dave ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: excessive kworker activity when idle. (was Re: vma corruption in today's -git) 2011-03-31 14:58 ` Dave Jones @ 2011-03-31 15:03 ` Dave Jones 2011-03-31 15:09 ` Dave Jones 2011-03-31 15:25 ` Linus Torvalds 0 siblings, 2 replies; 27+ messages in thread From: Dave Jones @ 2011-03-31 15:03 UTC (permalink / raw) To: Linus Torvalds, Andrew Morton, Linux Kernel, Tejun Heo On Thu, Mar 31, 2011 at 10:58:50AM -0400, Dave Jones wrote: bah I missed a level. which seems to be the smoking gun. something in the tty layer seems to have gone nuts. > - 17.67% kworker/0:1 [kernel.kallsyms] [k] read_hpet > - read_hpet > - 76.07% timekeeping_get_ns > - ktime_get > - 64.02% tick_dev_program_event > tick_program_event > hrtimer_interrupt > smp_apic_timer_interrupt > + apic_timer_interrupt - 52.24% _raw_spin_unlock_irqrestore - 39.26% flush_to_ldisc process_one_work worker_thread kthread kernel_thread_helper - 32.52% __wake_up put_ldisc tty_ldisc_deref flush_to_ldisc process_one_work worker_thread kthread kernel_thread_helper - 28.22% tty_ldisc_try tty_ldisc_ref flush_to_ldisc process_one_work worker_thread kthread kernel_thread_helper - 18.80% _raw_spin_unlock_irq process_one_work worker_thread kthread kernel_thread_helper - 18.80% _raw_spin_unlock_irq process_one_work worker_thread kthread kernel_thread_helper - 5.98% process_one_work worker_thread kthread kernel_thread_helper - 4.81% lock_acquire process_one_work worker_thread kthread kernel_thread_helper - 4.70% put_ldisc tty_ldisc_deref flush_to_ldisc process_one_work worker_thread kthread kernel_thread_helper - 4.17% worker_thread kthread kernel_thread_helper - 4.17% lock_release process_one_work worker_thread kthread kernel_thread_helper - 1.28% flush_to_ldisc process_one_work worker_thread kthread kernel_thread_helper - 1.28% arch_local_irq_save - 58.33% lock_release process_one_work worker_thread kthread kernel_thread_helper - 41.67% lock_acquire process_one_work worker_thread kthread kernel_thread_helper - 0.85% tty_ldisc_deref flush_to_ldisc process_one_work worker_thread kthread kernel_thread_helper - 0.85% _raw_spin_lock_irqsave - 62.51% flush_to_ldisc process_one_work worker_thread kthread kernel_thread_helper - 37.49% tty_ldisc_try tty_ldisc_ref flush_to_ldisc process_one_work worker_thread kthread kernel_thread_helper - 0.53% tty_ldisc_ref flush_to_ldisc process_one_work worker_thread kthread > - 35.30% sched_clock_tick > scheduler_tick > update_process_times > tick_sched_timer > __run_hrtimer > hrtimer_interrupt > smp_apic_timer_interrupt > - apic_timer_interrupt > + 65.70% _raw_spin_unlock_irqrestore > + 27.33% _raw_spin_unlock_irq > + 1.16% lock_acquire > + 1.16% put_ldisc > + 1.16% process_one_work > + 1.16% lock_release > + 0.78% worker_thread > + 0.58% _raw_spin_lock_irqsave > + 0.55% tick_sched_timer > - 23.82% do_timer > tick_do_update_jiffies64 > tick_sched_timer > __run_hrtimer > hrtimer_interrupt > smp_apic_timer_interrupt > - apic_timer_interrupt > + 63.46% _raw_spin_unlock_irqrestore > + 22.98% _raw_spin_unlock_irq > + 3.50% put_ldisc > + 3.28% lock_acquire ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: excessive kworker activity when idle. (was Re: vma corruption in today's -git) 2011-03-31 15:03 ` Dave Jones @ 2011-03-31 15:09 ` Dave Jones 2011-03-31 15:45 ` Linus Torvalds 2011-03-31 15:25 ` Linus Torvalds 1 sibling, 1 reply; 27+ messages in thread From: Dave Jones @ 2011-03-31 15:09 UTC (permalink / raw) To: Linus Torvalds, Andrew Morton, Linux Kernel, Tejun Heo On Thu, Mar 31, 2011 at 11:03:44AM -0400, Dave Jones wrote: > On Thu, Mar 31, 2011 at 10:58:50AM -0400, Dave Jones wrote: > > bah I missed a level. which seems to be the smoking gun. > something in the tty layer seems to have gone nuts. I thought that trace looked familiar. http://lkml.org/lkml/2010/11/30/592 It's the same thing. Dave ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: excessive kworker activity when idle. (was Re: vma corruption in today's -git) 2011-03-31 15:09 ` Dave Jones @ 2011-03-31 15:45 ` Linus Torvalds 0 siblings, 0 replies; 27+ messages in thread From: Linus Torvalds @ 2011-03-31 15:45 UTC (permalink / raw) To: Dave Jones, Linus Torvalds, Andrew Morton, Linux Kernel, Tejun Heo On Thu, Mar 31, 2011 at 8:09 AM, Dave Jones <davej@redhat.com> wrote: > > I thought that trace looked familiar. > > http://lkml.org/lkml/2010/11/30/592 > > It's the same thing. Ok, that's before the "tty: stop using "delayed_work" in the tty layer" commit I just pointed to. So apparently you've been able to trigger this even with the old code too - although maybe the lack of delays anywhere has made it easier, and has made it use more cpu. I'll have to think about it, but I wonder if it's the crazy "reflush" case in flush_to_ldisc. We do if (!tty->receive_room || seen_tail) { schedule_work(&tty->buf.work); break; } inside the routine that is the work itself - basically we're saying that "if there's no more room to flip, of we've seen a new buffer, give up now and reschedule outselves". Which doesn't really make much sense to me, I have to admit. The code that actually empties the buffer, or the code that adds one, should already have scheduled us for a flip _anyway_. So the only thing that "schedule_work()" is doing is causing infinite work if nothing empties the buffer, of more likely if we have a flushing bug elsewhere. So I'm not sure, but my gut feel is that removing that "schedule_work()" line there is the right thing to do. At a guess, it was hiding some locking problem - and it's been carried around even though hopefully we've fixed all the crazy races we used to have (and it was a mindless "hey, we can retry in one jiffy - it doesn't really cost us anything") NOTE! Even if I'm right, and that line is just buggy, the bug may well have been hiding some other issue - ie just some using not flushing the tty when it made more room available. So I think the "make tty flush cause a re-flush when it cannot make progress" is wrong, but removing the line may well expose some other problem. Linus ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: excessive kworker activity when idle. (was Re: vma corruption in today's -git) 2011-03-31 15:03 ` Dave Jones 2011-03-31 15:09 ` Dave Jones @ 2011-03-31 15:25 ` Linus Torvalds 2011-03-31 15:49 ` Dave Jones 1 sibling, 1 reply; 27+ messages in thread From: Linus Torvalds @ 2011-03-31 15:25 UTC (permalink / raw) To: Dave Jones, Linus Torvalds, Andrew Morton, Linux Kernel, Tejun Heo On Thu, Mar 31, 2011 at 8:03 AM, Dave Jones <davej@redhat.com> wrote: > On Thu, Mar 31, 2011 at 10:58:50AM -0400, Dave Jones wrote: > > bah I missed a level. which seems to be the smoking gun. > something in the tty layer seems to have gone nuts. Ahhah. That is indeed the smoking gun. I bet it's an old problem where something causes some stupid tty ping-pong behaviour, and now it has become a high-CPU thing triggered by commit f23eb2b2b285 ("tty: stop using "delayed_work" in the tty layer"). Before that commit, if the tty layer got somehow confused, it would always delay for a timer tick in between doing nothing. Now it will go pretty much full blast. Which explains your CPU being hot. Linus ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: excessive kworker activity when idle. (was Re: vma corruption in today's -git) 2011-03-31 15:25 ` Linus Torvalds @ 2011-03-31 15:49 ` Dave Jones 2011-03-31 15:58 ` Linus Torvalds 0 siblings, 1 reply; 27+ messages in thread From: Dave Jones @ 2011-03-31 15:49 UTC (permalink / raw) To: Linus Torvalds; +Cc: Andrew Morton, Linux Kernel, Tejun Heo On Thu, Mar 31, 2011 at 08:25:36AM -0700, Linus Torvalds wrote: > On Thu, Mar 31, 2011 at 8:03 AM, Dave Jones <davej@redhat.com> wrote: > > On Thu, Mar 31, 2011 at 10:58:50AM -0400, Dave Jones wrote: > > > > bah I missed a level. which seems to be the smoking gun. > > something in the tty layer seems to have gone nuts. > > Ahhah. That is indeed the smoking gun. > > I bet it's an old problem where something causes some stupid tty > ping-pong behaviour, and now it has become a high-CPU thing triggered > by commit f23eb2b2b285 ("tty: stop using "delayed_work" in the tty > layer"). > > Before that commit, if the tty layer got somehow confused, it would > always delay for a timer tick in between doing nothing. Now it will go > pretty much full blast. Which explains your CPU being hot. That's a recent change though, and I first saw this back in November. Dave ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: excessive kworker activity when idle. (was Re: vma corruption in today's -git) 2011-03-31 15:49 ` Dave Jones @ 2011-03-31 15:58 ` Linus Torvalds 2011-03-31 16:13 ` Dave Jones 0 siblings, 1 reply; 27+ messages in thread From: Linus Torvalds @ 2011-03-31 15:58 UTC (permalink / raw) To: Dave Jones, Linus Torvalds, Andrew Morton, Linux Kernel, Tejun Heo On Thu, Mar 31, 2011 at 8:49 AM, Dave Jones <davej@redhat.com> wrote: > > That's a recent change though, and I first saw this back in November. So your November report said that you could see "thousands" of these a second. But maybe it didn't use up all CPU until recently? Especially if you have a high CONFIG_HZ value, you'd still see a thousand flushes a second even with the old "delay a bit". So it would use a fair amount of CPU, and certainly waste a lot of power. But it wouldn't pin the CPU entirely. With that commit f23eb2b2b285, the buggy case would become basically totally CPU-bound. I dunno. Right now 'trinity' just ends up printing out a lot of system call errors for me. I assume that's its normal behavior? Linus ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: excessive kworker activity when idle. (was Re: vma corruption in today's -git) 2011-03-31 15:58 ` Linus Torvalds @ 2011-03-31 16:13 ` Dave Jones 0 siblings, 0 replies; 27+ messages in thread From: Dave Jones @ 2011-03-31 16:13 UTC (permalink / raw) To: Linus Torvalds; +Cc: Andrew Morton, Linux Kernel, Tejun Heo On Thu, Mar 31, 2011 at 08:58:14AM -0700, Linus Torvalds wrote: > On Thu, Mar 31, 2011 at 8:49 AM, Dave Jones <davej@redhat.com> wrote: > > > > That's a recent change though, and I first saw this back in November. > > So your November report said that you could see "thousands" of these a > second. But maybe it didn't use up all CPU until recently? >From memory, I noticed it back then in the same way "hey, why is the laptop getting hot". > Especially if you have a high CONFIG_HZ value, you'd still see a > thousand flushes a second even with the old "delay a bit". So it would > use a fair amount of CPU, and certainly waste a lot of power. But it > wouldn't pin the CPU entirely. HZ=1000 here, so yeah that makes sense. > With that commit f23eb2b2b285, the buggy case would become basically > totally CPU-bound. > > I dunno. Right now 'trinity' just ends up printing out a lot of system > call errors for me. I assume that's its normal behavior? yep. it's throwing semi-random junk and seeing what sticks. 99% of the time, it ends up with an -EINVAL or something providing the syscalls have sufficient checks in place. We're pretty solid there these days. (Though I still need to finish adding more annotations of some of the syscall arguments just to be sure we're passing semi-sane things to get deeper into the syscalls) What still seems to throw a curve-ball though is the case where calling a syscall generates some state. Due to the random nature of the program, we never have a balanced alloc/free, so lists can grow quite large etc. I'm wondering if something has created a livelock situation, where some queue has grown to the point that we're generating new work before we can process the backlog. The downside of using randomness of course is with bugs like this, there's no easy way to figure out wtf happened to get it into this state other than poring over huge logfiles of all the syscalls that were made. I'm happily ignorant of most of the inner workings of the tty layer. I don't see anything useful in procfs/sysfs/debugfs. Is there anything useful I can do with the trace tools to try and find out things like the length of queues ? Dave ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: excessive kworker activity when idle. (was Re: vma corruption in today's -git) 2011-03-31 3:37 ` Linus Torvalds 2011-03-31 3:55 ` Dave Jones @ 2011-03-31 6:56 ` Tejun Heo 2011-03-31 10:37 ` [PATCH] workqueue: document debugging tricks Florian Mickler 1 sibling, 1 reply; 27+ messages in thread From: Tejun Heo @ 2011-03-31 6:56 UTC (permalink / raw) To: Linus Torvalds; +Cc: Dave Jones, Andrew Morton, Linux Kernel Hello, On Wed, Mar 30, 2011 at 08:37:21PM -0700, Linus Torvalds wrote: > On Wed, Mar 30, 2011 at 8:09 PM, Dave Jones <davej@redhat.com> wrote: > > > > But rerunning the same tests on current head (6aba74f2791287ec407e0f92487a725a25908067) > > I can still reproduce the problem where kworker threads go nutso > > when the machine should be completely idle. > > > > top shows two kworker threads constantly at >80% cpu. > > Ok, I've seen that "tons of cpu by kworker" triggered by a few > different issues. One was the intel graphics driver doing monitor > detection constantly, and spending all its time in one of the worker > threads doing __udelay() for the stupid i2c driver. > > The other case I've seen is a wireless thing that falls back to GPIO, > and spends a lot of CPU time on that. > > I'm not saying yours is either of those cases, but one of the problems > with that behavior is that it's actually fairly hard to figure out > what the hell is happening. You don't see some nice thread description > in 'top' any more (like you used to when everybody created their own > threads and didn't do the common worker thread thing), and the best > approach literally seems to be something like Yes, not having dedicated workers decrease immediate visibility via ps/top but I don't think the actual vsibility is worse. There are two types for kworker looping - something being scheduled in rapid succession or a single work consuming lots of CPU cycles. The first one can be tracked using tracing. $ echo workqueue:workqueue_queue_work > /sys/kernel/debug/tracing/set_event $ cat /sys/kernel/debug/tracing/trace_pipe > out.txt (wait a few secs) ^C If something is busy looping on work queueing, it would be dominating the output and the offender can be determined with the work item function. For the second type, "cat /proc/THE_OFFENDING_KWORKER/stack" is the easiest. The work item function will be trivially visible in the stack trace. Thanks. -- tejun ^ permalink raw reply [flat|nested] 27+ messages in thread
* [PATCH] workqueue: document debugging tricks 2011-03-31 6:56 ` Tejun Heo @ 2011-03-31 10:37 ` Florian Mickler 2011-03-31 11:41 ` Tejun Heo 0 siblings, 1 reply; 27+ messages in thread From: Florian Mickler @ 2011-03-31 10:37 UTC (permalink / raw) To: tj; +Cc: linux-kernel, rdunlap, linux-doc, Florian Mickler It is not obvious how to debug run-away workers. These are some tips given by Tejun on lkml. Signed-off-by: Florian Mickler <florian@mickler.org> CC: Tejun Heo <tj@kernel.org> --- Documentation/workqueue.txt | 38 ++++++++++++++++++++++++++++++++++++++ 1 files changed, 38 insertions(+), 0 deletions(-) diff --git a/Documentation/workqueue.txt b/Documentation/workqueue.txt index 01c513f..cdbc3c6 100644 --- a/Documentation/workqueue.txt +++ b/Documentation/workqueue.txt @@ -12,6 +12,7 @@ CONTENTS 4. Application Programming Interface (API) 5. Example Execution Scenarios 6. Guidelines +7. Debugging 1. Introduction @@ -379,3 +380,40 @@ If q1 has WQ_CPU_INTENSIVE set, * Unless work items are expected to consume a huge amount of CPU cycles, using a bound wq is usually beneficial due to the increased level of locality in wq operations and work item execution. + + +7. Debugging + +Because the work functions are executed by generic worker threads there are a +few tricks needed to shed some light on misbehaving workqueue users. + +Worker threads show up in the process list as: + +root 5671 0.0 0.0 0 0 ? S 12:07 0:00 [kworker/0:1] +root 5672 0.0 0.0 0 0 ? S 12:07 0:00 [kworker/1:2] +root 5673 0.0 0.0 0 0 ? S 12:12 0:00 [kworker/0:0] +root 5674 0.0 0.0 0 0 ? S 12:13 0:00 [kworker/1:0] + +If kworkers are going crazy (using too much cpu), there are two types of +possible problems: + + 1. Something beeing scheduled in rapid succession + 2. a single work item that consumes lots of cpu cycles + +The first one can be tracked using tracing: + + $ echo workqueue:workqueue_queue_work > /sys/kernel/debug/tracing/set_event + $ cat /sys/kernel/debug/tracing/trace_pipe > out.txt + (wait a few secs) + ^C + +If something is busy looping on work queueing, it would be dominating +the output and the offender can be determined with the work item +function. + +For the second type of problem it should be possible to just check the stack +trace of the offending worker thread. + + $ cat /proc/THE_OFFENDING_KWORKER/stack + +The work item's function should be trivially visible in the stack trace. -- 1.7.4.1 ^ permalink raw reply related [flat|nested] 27+ messages in thread
* Re: [PATCH] workqueue: document debugging tricks 2011-03-31 10:37 ` [PATCH] workqueue: document debugging tricks Florian Mickler @ 2011-03-31 11:41 ` Tejun Heo 0 siblings, 0 replies; 27+ messages in thread From: Tejun Heo @ 2011-03-31 11:41 UTC (permalink / raw) To: Florian Mickler; +Cc: linux-kernel, rdunlap, linux-doc On Thu, Mar 31, 2011 at 12:37:06PM +0200, Florian Mickler wrote: > It is not obvious how to debug run-away workers. > > These are some tips given by Tejun on lkml. > > Signed-off-by: Florian Mickler <florian@mickler.org> > CC: Tejun Heo <tj@kernel.org> Right, good idea. Applied to fixes-2.6.39 with minor formatting changes. Thanks. -- tejun ^ permalink raw reply [flat|nested] 27+ messages in thread
end of thread, other threads:[~2011-03-31 21:39 UTC | newest] Thread overview: 27+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2011-03-29 4:09 vma corruption in today's -git Dave Jones 2011-03-29 4:19 ` Américo Wang 2011-03-29 4:26 ` Dave Jones 2011-03-29 4:22 ` Linus Torvalds 2011-03-31 3:09 ` excessive kworker activity when idle. (was Re: vma corruption in today's -git) Dave Jones 2011-03-31 3:34 ` Dave Jones 2011-03-31 3:44 ` Linus Torvalds 2011-03-31 4:08 ` Dave Jones 2011-03-31 15:53 ` Linus Torvalds 2011-03-31 16:21 ` Linus Torvalds 2011-03-31 21:38 ` Linus Torvalds 2011-03-31 14:59 ` Paul E. McKenney 2011-03-31 3:37 ` Linus Torvalds 2011-03-31 3:55 ` Dave Jones 2011-03-31 5:32 ` Linus Torvalds 2011-03-31 14:21 ` Arnaldo Carvalho de Melo 2011-03-31 14:58 ` Dave Jones 2011-03-31 15:03 ` Dave Jones 2011-03-31 15:09 ` Dave Jones 2011-03-31 15:45 ` Linus Torvalds 2011-03-31 15:25 ` Linus Torvalds 2011-03-31 15:49 ` Dave Jones 2011-03-31 15:58 ` Linus Torvalds 2011-03-31 16:13 ` Dave Jones 2011-03-31 6:56 ` Tejun Heo 2011-03-31 10:37 ` [PATCH] workqueue: document debugging tricks Florian Mickler 2011-03-31 11:41 ` Tejun Heo
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox