vma corruption in today's -git

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* vma corruption in today's -git
@ 2011-03-29  4:09 Dave Jones
  2011-03-29  4:19 ` Américo Wang
  2011-03-29  4:22 ` Linus Torvalds
  0 siblings, 2 replies; 27+ messages in thread
From: Dave Jones @ 2011-03-29  4:09 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Andrew Morton, Linux Kernel

Running a kernel that I built from a git pull from this afternoon,
I ran a barrage of stress tests, including some of the new syscall fuzzing
work that I've been working on.  Everything seemed just fine, except that
after I had stopped running everything, loadavg was still at 6, with
top showing kworker processes running at high % of cpu.

I ran perf top, and got this ..

 BUG: unable to handle kernel paging request at fffffffffffffff3
 IP: [<ffffffff8118d3e9>] vma_stop.clone.3+0x18/0x33
 PGD 1a05067 PUD 1a06067 PMD 0 
 Oops: 0000 [#1] SMP 
 last sysfs file: /sys/devices/LNXSYSTM:00/device:00/PNP0C0A:00/power_supply/BAT0/energy_full
 CPU 0 
 Modules linked in: ipx p8022 psnap llc p8023 sctp libcrc32c tun vfat fat usb_storage uas nfs fscache fuse nfsd lockd nfs_acl auth_rpcgss sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables ipv6 uinput arc4 iwlagn btusb bluetooth snd_hda_codec_hdmi snd_hda_codec_idt mac80211 snd_hda_intel snd_usb_audio snd_hda_codec dell_wmi snd_seq sparse_keymap uvcvideo snd_pcm dell_laptop snd_hwdep dcdbas snd_usbmidi_lib microcode videodev zaurus cfg80211 snd_rawmidi snd_seq_device cdc_ether v4l2_compat_ioctl32 usbnet snd_timer iTCO_wdt pcspkr joydev cdc_acm i2c_i801 cdc_wdm tg3 snd mii iTCO_vendor_support rfkill soundcore snd_page_alloc wmi i915 drm_kms_helper drm i2c_algo_bit i2c_core video [last unloaded: scsi_wait_scan]
 
 Pid: 324, comm: perf Tainted: G        W   2.6.38+ #5 Dell Inc. Adamo 13   /0N70T0
 RIP: 0010:[<ffffffff8118d3e9>]  [<ffffffff8118d3e9>] vma_stop.clone.3+0x18/0x33
 RSP: 0018:ffff880114de1e48  EFLAGS: 00010286
 RAX: 00000000fffffff3 RBX: ffff880007f44bc8 RCX: ffff8800b7216806
 RDX: ffffffff81618f50 RSI: fffffffffffffff3 RDI: ffff880007f44bd8
 RBP: ffff880114de1e58 R08: 0000000000000001 R09: 0000000000000000
 R10: ffffffff81a26870 R11: 0000000000000246 R12: ffff88000fe3d880
 R13: ffff8800b72168f8 R14: fffffffffffffff3 R15: 0000000000000000
 FS:  00007f59e738c720(0000) GS:ffff88013fa00000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: fffffffffffffff3 CR3: 0000000136566000 CR4: 00000000000406f0
 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
 Process perf (pid: 324, threadinfo ffff880114de0000, task ffff8800b43c23b0)
 Stack:
  0000000000000000 ffff880007f44bc8 ffff880114de1e78 ffffffff8118d46a
  0000000000000000 0000000000000400 ffff880114de1ef8 ffffffff81159a99
  0000000000000004 ffff8801fffffff3 0000000000000004 ffff8800b7216930
 Call Trace:
  [<ffffffff8118d46a>] m_stop+0x1e/0x30
  [<ffffffff81159a99>] seq_read+0x240/0x36e
  [<ffffffff8113dc28>] vfs_read+0xac/0xff
  [<ffffffff8113f0f8>] ? fget_light+0x3a/0xa1
  [<ffffffff8113dcc8>] sys_read+0x4d/0x74
  [<ffffffff814d6b42>] system_call_fastpath+0x16/0x1b
 Code: 8b 54 24 08 48 89 53 30 41 5d 5b 41 5c 31 c0 41 5d c9 c3 55 48 89 e5 53 48 83 ec 08 0f 1f 44 00 00 48 85 f6 74 1c 48 3b 37 74 17 
  8b 1e 48 8d bb a8 00 00 00 e8 d9 e6 ee ff 48 89 df e8 9a ab 
 RIP  [<ffffffff8118d3e9>] vma_stop.clone.3+0x18/0x33
  RSP <ffff880114de1e48>
 CR2: fffffffffffffff3
 ---[ end trace ea1a722d5e92836a ]---


it smells like something got corrupted, but I'm not sure where
to start.  I've got a bunch of debug options turned on, but nothing
else triggered, so I've got nothing other than this oops to go on.

Any thoughts ?

	Dave


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: vma corruption in today's -git
  2011-03-29  4:09 vma corruption in today's -git Dave Jones
@ 2011-03-29  4:19 ` Américo Wang
  2011-03-29  4:26   ` Dave Jones
  2011-03-29  4:22 ` Linus Torvalds
  1 sibling, 1 reply; 27+ messages in thread
From: Américo Wang @ 2011-03-29  4:19 UTC (permalink / raw)
  To: Dave Jones, Linus Torvalds, Andrew Morton, Linux Kernel

On Tue, Mar 29, 2011 at 12:09 PM, Dave Jones <davej@redhat.com> wrote:
> Running a kernel that I built from a git pull from this afternoon,
> I ran a barrage of stress tests, including some of the new syscall fuzzing
> work that I've been working on.  Everything seemed just fine, except that
> after I had stopped running everything, loadavg was still at 6, with
> top showing kworker processes running at high % of cpu.
>
> I ran perf top, and got this ..
>
>  BUG: unable to handle kernel paging request at fffffffffffffff3
>  IP: [<ffffffff8118d3e9>] vma_stop.clone.3+0x18/0x33

Should be fixed by:

commit 76597cd31470fa130784c78fadb4dab2e624a723
Author: Linus Torvalds <torvalds@linux-foundation.org>
Date:   Sun Mar 27 19:09:29 2011 -0700

    proc: fix oops on invalid /proc/<pid>/maps access

    When m_start returns an error, the seq_file logic will still call m_stop
    with that error entry, so we'd better make sure that we check it before
    using it as a vma.

    Introduced by commit ec6fd8a4355c ("report errors in /proc/*/*map*
    sanely"), which replaced NULL with various ERR_PTR() cases.

    (On ia64, you happen to get a unaligned fault instead of a page fault,
    since the address used is generally some random error code like -EPERM)

    Reported-by: Anca Emanuel <anca.emanuel@gmail.com>
    Reported-by: Tony Luck <tony.luck@intel.com>
    Cc: Al Viro <viro@zeniv.linux.org.uk>
    Cc: Américo Wang <xiyou.wangcong@gmail.com>
    Cc: Stephen Wilson <wilsons@start.ca>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: vma corruption in today's -git
  2011-03-29  4:19 ` Américo Wang
@ 2011-03-29  4:26   ` Dave Jones
  0 siblings, 0 replies; 27+ messages in thread
From: Dave Jones @ 2011-03-29  4:26 UTC (permalink / raw)
  To: Américo Wang; +Cc: Linus Torvalds, Andrew Morton, Linux Kernel

On Tue, Mar 29, 2011 at 12:19:01PM +0800, Américo Wang wrote:
 > On Tue, Mar 29, 2011 at 12:09 PM, Dave Jones <davej@redhat.com> wrote:
 > > Running a kernel that I built from a git pull from this afternoon,
 > > I ran a barrage of stress tests, including some of the new syscall fuzzing
 > > work that I've been working on.  Everything seemed just fine, except that
 > > after I had stopped running everything, loadavg was still at 6, with
 > > top showing kworker processes running at high % of cpu.
 > >
 > > I ran perf top, and got this ..
 > >
 > >  BUG: unable to handle kernel paging request at fffffffffffffff3
 > >  IP: [<ffffffff8118d3e9>] vma_stop.clone.3+0x18/0x33
 > 
 > Should be fixed by:
 > 
 > commit 76597cd31470fa130784c78fadb4dab2e624a723
 > Author: Linus Torvalds <torvalds@linux-foundation.org>
 > Date:   Sun Mar 27 19:09:29 2011 -0700
 > 
 >     proc: fix oops on invalid /proc/<pid>/maps access

excellent, thanks. I'll repull and try again.

	Dave


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: vma corruption in today's -git
  2011-03-29  4:09 vma corruption in today's -git Dave Jones
  2011-03-29  4:19 ` Américo Wang
@ 2011-03-29  4:22 ` Linus Torvalds
  2011-03-31  3:09   ` excessive kworker activity when idle. (was Re: vma corruption in today's -git) Dave Jones
  1 sibling, 1 reply; 27+ messages in thread
From: Linus Torvalds @ 2011-03-29  4:22 UTC (permalink / raw)
  To: Dave Jones, Linus Torvalds, Andrew Morton, Linux Kernel

On Mon, Mar 28, 2011 at 9:09 PM, Dave Jones <davej@redhat.com> wrote:
>
> I ran perf top, and got this ..
>
>  BUG: unable to handle kernel paging request at fffffffffffffff3

Should be fixed by commit 76597cd31470.

Except that was already committed yesterday. When did you pull?

                        Linus

^ permalink raw reply	[flat|nested] 27+ messages in thread

* excessive kworker activity when idle. (was Re: vma corruption in today's -git)
  2011-03-29  4:22 ` Linus Torvalds
@ 2011-03-31  3:09   ` Dave Jones
  2011-03-31  3:34     ` Dave Jones
  2011-03-31  3:37     ` Linus Torvalds
  0 siblings, 2 replies; 27+ messages in thread
From: Dave Jones @ 2011-03-31  3:09 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Andrew Morton, Linux Kernel

On Mon, Mar 28, 2011 at 09:22:25PM -0700, Linus Torvalds wrote:
 > On Mon, Mar 28, 2011 at 9:09 PM, Dave Jones <davej@redhat.com> wrote:
 > >
 > > I ran perf top, and got this ..
 > >
 > >  BUG: unable to handle kernel paging request at fffffffffffffff3
 > 
 > Should be fixed by commit 76597cd31470.
 > 
 > Except that was already committed yesterday. When did you pull?

It did fix it, I must have missed that commit.

But rerunning the same tests on current head (6aba74f2791287ec407e0f92487a725a25908067)
I can still reproduce the problem where kworker threads go nutso
when the machine should be completely idle.

top shows two kworker threads constantly at >80% cpu.

I can now profile it with perf though, so I'll dig deeper.
I need to try a build with slab debug etc turned off, just to be sure that
isn't throwing things into a spin.

	Dave

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: excessive kworker activity when idle. (was Re: vma corruption in today's -git)
  2011-03-31  3:09   ` excessive kworker activity when idle. (was Re: vma corruption in today's -git) Dave Jones
@ 2011-03-31  3:34     ` Dave Jones
  2011-03-31  3:44       ` Linus Torvalds
  2011-03-31  3:37     ` Linus Torvalds
  1 sibling, 1 reply; 27+ messages in thread
From: Dave Jones @ 2011-03-31  3:34 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, Linux Kernel

On Wed, Mar 30, 2011 at 11:09:18PM -0400, Dave Jones wrote:

 > I can still reproduce the problem where kworker threads go nutso
 > when the machine should be completely idle.
 > 
 > top shows two kworker threads constantly at >80% cpu.
 > 
 > I can now profile it with perf though, so I'll dig deeper.
 > I need to try a build with slab debug etc turned off, just to be sure that
 > isn't throwing things into a spin.

so 'perf kmem record sleep 5' shows hundreds of calls kmem_cache_free from
the kworker processes. Called from shmem_i_callback, __d_free and file_free_rcu.
My guess is that my fuzzing caused so many allocations that the rcu freeing is
still ongoing an hour or so after the process has quit.  does that make any sense?

Although if that hypothesis is true, shouldn't I be able to observe this
by watching slabtop ? because I don't see anything decreasing...

	Dave

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: excessive kworker activity when idle. (was Re: vma corruption in today's -git)
  2011-03-31  3:34     ` Dave Jones
@ 2011-03-31  3:44       ` Linus Torvalds
  2011-03-31  4:08         ` Dave Jones
  2011-03-31 14:59         ` Paul E. McKenney
  0 siblings, 2 replies; 27+ messages in thread
From: Linus Torvalds @ 2011-03-31  3:44 UTC (permalink / raw)
  To: Dave Jones, Andrew Morton, Linux Kernel, Paul E. McKenney

On Wed, Mar 30, 2011 at 8:34 PM, Dave Jones <davej@redhat.com> wrote:
>
> so 'perf kmem record sleep 5' shows hundreds of calls kmem_cache_free from
> the kworker processes. Called from shmem_i_callback, __d_free and file_free_rcu.
> My guess is that my fuzzing caused so many allocations that the rcu freeing is
> still ongoing an hour or so after the process has quit.  does that make any sense?

No, that shouldn't be the case. RCU freeing should go on for just a
few RCU periods, and be done. I think there is some "limit the work we
do for rcu each time" in order to not have bad latencies, but even so
that shouldn't take _that_ long. And as you say, you should see the
freeing in the slab stats.

So clearly there are shmem inodes being destroyed, but it shouldn't be
from an hour ago. I wonder if your system isn't as idle as you think
it is.

But I'm cc'ing Paul, maybe he'll disagree and say it's expected and
that the RCU batch size is really small. Or at least give some hint
about how to check the pending rcu state.

                       Linus

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: excessive kworker activity when idle. (was Re: vma corruption in today's -git)
  2011-03-31  3:44       ` Linus Torvalds
@ 2011-03-31  4:08         ` Dave Jones
  2011-03-31 15:53           ` Linus Torvalds
  2011-03-31 14:59         ` Paul E. McKenney
  1 sibling, 1 reply; 27+ messages in thread
From: Dave Jones @ 2011-03-31  4:08 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Andrew Morton, Linux Kernel, Paul E. McKenney

On Wed, Mar 30, 2011 at 08:44:06PM -0700, Linus Torvalds wrote:

 > So clearly there are shmem inodes being destroyed, but it shouldn't be
 > from an hour ago. I wonder if your system isn't as idle as you think
 > it is.

hmm. when I say idle, I mean  I'm staring at a gnome-terminal with
and ssh client, and nothing else.  Certainly nothing that should be
causing the cpu meter in the taskbar to almost max out, and definitely
nothing that should cause the fans in the laptop to be roaring.

If you want to try and repeat this experiment, this should do it..

git clone git://git.codemonkey.org.uk/trinity
cd trinity ; make
./trinity --mode=random -i

let it run for a while, then kill it.   kworker should then go nuts.

	Dave

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: excessive kworker activity when idle. (was Re: vma corruption in today's -git)
  2011-03-31  4:08         ` Dave Jones
@ 2011-03-31 15:53           ` Linus Torvalds
  2011-03-31 16:21             ` Linus Torvalds
  0 siblings, 1 reply; 27+ messages in thread
From: Linus Torvalds @ 2011-03-31 15:53 UTC (permalink / raw)
  To: Dave Jones, Andrew Morton, Linux Kernel, Paul E. McKenney

And with your fixed trinity thing, I can now do this:

On Wed, Mar 30, 2011 at 9:08 PM, Dave Jones <davej@redhat.com> wrote:
>
> git clone git://git.codemonkey.org.uk/trinity
> cd trinity ; make
> ./trinity --mode=random -i
>
> let it run for a while, then kill it.   kworker should then go nuts.

I really thought "hey, that reproduced it", because "top" did indeed
show kworker going nuts after a while. And perf did show it doing all
that

                   |          queue_work
                   |          schedule_work
                   |          flush_to_ldisc

that looks like the re-scheduling.

But when I killed trinity, kworker immediately calmed down too.

Which I'd actually expect, since any tty it is using (and making cause
problem) should calm down when it's closed - even if it was stuck
before.

But maybe I haven't triggered just the right condition to actually get
it into some really infinite case.

Regardless, I'll put my money where my mouth is, and try to remove the
crazy re-flush thing.

                                         Linus

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: excessive kworker activity when idle. (was Re: vma corruption in today's -git)
  2011-03-31 15:53           ` Linus Torvalds
@ 2011-03-31 16:21             ` Linus Torvalds
  2011-03-31 21:38               ` Linus Torvalds
  0 siblings, 1 reply; 27+ messages in thread
From: Linus Torvalds @ 2011-03-31 16:21 UTC (permalink / raw)
  To: Dave Jones, Andrew Morton, Linux Kernel, Paul E. McKenney

On Thu, Mar 31, 2011 at 8:53 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Regardless, I'll put my money where my mouth is, and try to remove the
> crazy re-flush thing.

Yeah, that doesn't work. The tty does actually lock up when it fills
up. So clearly we actually depended on that reflushing happening.

That said, I do still think it's the right thing to do to remove that
line, it just means that I need to figure out where the missing flush
is.

                         Linus

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: excessive kworker activity when idle. (was Re: vma corruption in today's -git)
  2011-03-31 16:21             ` Linus Torvalds
@ 2011-03-31 21:38               ` Linus Torvalds
  0 siblings, 0 replies; 27+ messages in thread
From: Linus Torvalds @ 2011-03-31 21:38 UTC (permalink / raw)
  To: Dave Jones, Andrew Morton, Linux Kernel, Paul E. McKenney,
	Alan Cox, Greg KH

[-- Attachment #1: Type: text/plain, Size: 1704 bytes --]

On Thu, Mar 31, 2011 at 9:21 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Thu, Mar 31, 2011 at 8:53 AM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>>
>> Regardless, I'll put my money where my mouth is, and try to remove the
>> crazy re-flush thing.
>
> Yeah, that doesn't work. The tty does actually lock up when it fills
> up. So clearly we actually depended on that reflushing happening.
>
> That said, I do still think it's the right thing to do to remove that
> line, it just means that I need to figure out where the missing flush
> is.

Ok, that was unexpected.

So the reason for the need to do that crazy "try to flush from the
flush routine" is that in the case of "receive_room" going down to
zero (which only happens for n_tty and for the case of switching
ldisc's around), if we hit that during flushing, nothing will
apparently ever re-start the flushing when receive_room then opens up
again.

So instead of having that case re-start the flush, we end up saying
"ok, we'll just retry the flush over and over again", and essentially
poll for receive_room opening up. No wonder you've seen high CPU use
and thousands of calls a second.

The "seen_tail" case doesn't have that issue, because anything that
adds a new buffer to the tty list should always be flipping anyway. So
this attached patch would seem to work.

Not heavily tested, but the case that I could trivially trigger before
doesn't trigger for me any more. And I can't seem to get kworker to
waste lots of CPU time any more, but it was kind of hit-and-miss
before too, so I don't know how much that's worth..

The locking here is kind of iffy, but otherwise? Comments?

                     Linus

[-- Attachment #2: patch.diff --]
[-- Type: text/x-patch, Size: 1520 bytes --]

 drivers/tty/n_tty.c      |    6 ++++++
 drivers/tty/tty_buffer.c |    4 +---
 2 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/drivers/tty/n_tty.c b/drivers/tty/n_tty.c
index 428f4fe..0ad3288 100644
--- a/drivers/tty/n_tty.c
+++ b/drivers/tty/n_tty.c
@@ -95,6 +95,7 @@ static void n_tty_set_room(struct tty_struct *tty)
 {
 	/* tty->read_cnt is not read locked ? */
 	int	left = N_TTY_BUF_SIZE - tty->read_cnt - 1;
+	int old_left;
 
 	/*
 	 * If we are doing input canonicalization, and there are no
@@ -104,7 +105,12 @@ static void n_tty_set_room(struct tty_struct *tty)
 	 */
 	if (left <= 0)
 		left = tty->icanon && !tty->canon_data;
+	old_left = tty->receive_room;
 	tty->receive_room = left;
+
+	/* Did this open up the receive buffer? We may need to flip */
+	if (left && !old_left)
+		schedule_work(&tty->buf.work);
 }
 
 static void put_tty_queue_nolock(unsigned char c, struct tty_struct *tty)
diff --git a/drivers/tty/tty_buffer.c b/drivers/tty/tty_buffer.c
index b945121..f1a7918 100644
--- a/drivers/tty/tty_buffer.c
+++ b/drivers/tty/tty_buffer.c
@@ -442,10 +442,8 @@ static void flush_to_ldisc(struct work_struct *work)
 			   line discipline as we want to empty the queue */
 			if (test_bit(TTY_FLUSHPENDING, &tty->flags))
 				break;
-			if (!tty->receive_room || seen_tail) {
-				schedule_work(&tty->buf.work);
+			if (!tty->receive_room || seen_tail)
 				break;
-			}
 			if (count > tty->receive_room)
 				count = tty->receive_room;
 			char_buf = head->char_buf_ptr + head->read;

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: excessive kworker activity when idle. (was Re: vma corruption in today's -git)
  2011-03-31  3:44       ` Linus Torvalds
  2011-03-31  4:08         ` Dave Jones
@ 2011-03-31 14:59         ` Paul E. McKenney
  1 sibling, 0 replies; 27+ messages in thread
From: Paul E. McKenney @ 2011-03-31 14:59 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Dave Jones, Andrew Morton, Linux Kernel

On Wed, Mar 30, 2011 at 08:44:06PM -0700, Linus Torvalds wrote:
> On Wed, Mar 30, 2011 at 8:34 PM, Dave Jones <davej@redhat.com> wrote:
> >
> > so 'perf kmem record sleep 5' shows hundreds of calls kmem_cache_free from
> > the kworker processes. Called from shmem_i_callback, __d_free and file_free_rcu.
> > My guess is that my fuzzing caused so many allocations that the rcu freeing is
> > still ongoing an hour or so after the process has quit.  does that make any sense?
> 
> No, that shouldn't be the case. RCU freeing should go on for just a
> few RCU periods, and be done. I think there is some "limit the work we
> do for rcu each time" in order to not have bad latencies, but even so
> that shouldn't take _that_ long. And as you say, you should see the
> freeing in the slab stats.
> 
> So clearly there are shmem inodes being destroyed, but it shouldn't be
> from an hour ago. I wonder if your system isn't as idle as you think
> it is.
> 
> But I'm cc'ing Paul, maybe he'll disagree and say it's expected and
> that the RCU batch size is really small. Or at least give some hint
> about how to check the pending rcu state.

There is by default a ten-callback-per-softirq per-CPU limit (set by
module parameter "blimit"), but if there are more than 10,000 callbacks
queued (set by module parameter "qhimark"), RCU stops being Mr.  Nice Guy
and just plows through the callbacks.

I am having a really hard time believing that you could queue a batch of
callbacks and still have RCU processing them an hour later.  But one way
this can happen is if the callbacks requeue themselves, for example, if
the RCU callback function contains a call to call_rcu().  I am not seeing
that in the __d_free and file_free_rcu callbacks that Dave mentioned,
but this does happen in some places in the Linux kernel.  And maybe such
a place is being invoked in a non-obvious way.

There is a debugfs file "rcu/rcudata" that reports the current RCU queue
length that is enabled with CONFIG_RCU_TRACE.  Here is sample output:

rcu_sched:
  0 c=17829 g=17829 pq=1 pqc=17829 qp=0 dt=10951/1 dn=0 df=1101 of=0 ri=36 ql=0 b=10
  1 c=17829 g=17829 pq=1 pqc=17829 qp=0 dt=16117/1 dn=0 df=1015 of=0 ri=0 ql=0 b=10
  2 c=17829 g=17829 pq=1 pqc=17829 qp=0 dt=1445/1 dn=0 df=1839 of=0 ri=0 ql=0 b=10
  3 c=17829 g=17829 pq=1 pqc=17829 qp=0 dt=6681/1 dn=0 df=1545 of=0 ri=0 ql=0 b=10
  4 c=17829 g=17829 pq=1 pqc=17829 qp=0 dt=1003/1 dn=0 df=1992 of=0 ri=0 ql=0 b=10
  5 c=17829 g=17830 pq=1 pqc=17829 qp=1 dt=3887/1 dn=0 df=3331 of=0 ri=4 ql=2 b=10
  6 c=17829 g=17829 pq=1 pqc=17829 qp=0 dt=859/1 dn=0 df=3224 of=0 ri=0 ql=0 b=10
  7 c=17829 g=17830 pq=0 pqc=17829 qp=1 dt=3761/1 dn=0 df=1818 of=0 ri=0 ql=2 b=10
rcu_bh:
  0 c=-275 g=-275 pq=1 pqc=-275 qp=0 dt=10951/1 dn=0 df=0 of=0 ri=0 ql=0 b=10
  1 c=-275 g=-275 pq=1 pqc=-275 qp=0 dt=16117/1 dn=0 df=13 of=0 ri=0 ql=0 b=10
  2 c=-275 g=-275 pq=1 pqc=-275 qp=0 dt=1445/1 dn=0 df=15 of=0 ri=0 ql=0 b=10
  3 c=-275 g=-275 pq=1 pqc=-275 qp=0 dt=6681/1 dn=0 df=9 of=0 ri=0 ql=0 b=10
  4 c=-275 g=-275 pq=1 pqc=-275 qp=0 dt=1003/1 dn=0 df=15 of=0 ri=0 ql=0 b=10
  5 c=-275 g=-275 pq=1 pqc=-275 qp=0 dt=3887/1 dn=0 df=15 of=0 ri=0 ql=0 b=10
  6 c=-275 g=-275 pq=1 pqc=-275 qp=0 dt=859/1 dn=0 df=15 of=0 ri=0 ql=0 b=10
  7 c=-275 g=-275 pq=1 pqc=-275 qp=0 dt=3761/1 dn=0 df=15 of=0 ri=0 ql=0 b=10

The "ql=" gives the per-CPU callback queue size.  For example, CPU 7 has
two rcu_sched callbacks queued.  In CONFIG_TREE_PREEMPT_RCU builds, there
will be an additional "rcu_preempt" section.  If you see some huge number
after a "ql=", that would indicate a problem.  ;-)

Does this help?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: excessive kworker activity when idle. (was Re: vma corruption in today's -git)
  2011-03-31  3:09   ` excessive kworker activity when idle. (was Re: vma corruption in today's -git) Dave Jones
  2011-03-31  3:34     ` Dave Jones
@ 2011-03-31  3:37     ` Linus Torvalds
  2011-03-31  3:55       ` Dave Jones
  2011-03-31  6:56       ` Tejun Heo
  1 sibling, 2 replies; 27+ messages in thread
From: Linus Torvalds @ 2011-03-31  3:37 UTC (permalink / raw)
  To: Dave Jones, Andrew Morton, Linux Kernel, Tejun Heo

On Wed, Mar 30, 2011 at 8:09 PM, Dave Jones <davej@redhat.com> wrote:
>
> But rerunning the same tests on current head (6aba74f2791287ec407e0f92487a725a25908067)
> I can still reproduce the problem where kworker threads go nutso
> when the machine should be completely idle.
>
> top shows two kworker threads constantly at >80% cpu.

Ok, I've seen that "tons of cpu by kworker" triggered by a few
different issues. One was the intel graphics driver doing monitor
detection constantly, and spending all its time in one of the worker
threads doing __udelay() for the stupid i2c driver.

The other case I've seen is a wireless thing that falls back to GPIO,
and spends a lot of CPU time on that.

I'm not saying yours is either of those cases, but one of the problems
with that behavior is that it's actually fairly hard to figure out
what the hell is happening. You don't see some nice thread description
in 'top' any more (like you used to when everybody created their own
threads and didn't do the common worker thread thing), and the best
approach literally seems to be something like

   perf record -ag sleep 10
   perf report

which does tend to show what's going on, but it's still a ridiculous
way to this.

(Powertop can also do it, and is probably a better thing to use, I'm
just used to "perf record" for other reasons, so..)

Tejun, would there be some saner way to get the information about
_where_ all the kworker time is going?

                            Linus

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: excessive kworker activity when idle. (was Re: vma corruption in today's -git)
  2011-03-31  3:37     ` Linus Torvalds
@ 2011-03-31  3:55       ` Dave Jones
  2011-03-31  5:32         ` Linus Torvalds
  2011-03-31  6:56       ` Tejun Heo
  1 sibling, 1 reply; 27+ messages in thread
From: Dave Jones @ 2011-03-31  3:55 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Andrew Morton, Linux Kernel, Tejun Heo

On Wed, Mar 30, 2011 at 08:37:21PM -0700, Linus Torvalds wrote:
 > You don't see some nice thread description
 > in 'top' any more (like you used to when everybody created their own
 > threads and didn't do the common worker thread thing), and the best
 > approach literally seems to be something like
 > 
 >    perf record -ag sleep 10
 >    perf report
 > 
 > which does tend to show what's going on, but it's still a ridiculous
 > way to this.

I tried that, and wasn't particularly enlightened.

+      6.53%  kworker/1:2  [kernel.kallsyms]  [k] read_hpet            
+      4.83%  kworker/0:0  [kernel.kallsyms]  [k] read_hpet           
+      4.28%  kworker/0:0  [kernel.kallsyms]  [k] arch_local_irq_restore 
+      4.03%  kworker/1:2  [kernel.kallsyms]  [k] arch_local_irq_restore  
+      3.10%  kworker/0:0  [kernel.kallsyms]  [k] do_raw_spin_trylock      
+      2.88%  kworker/1:2  [kernel.kallsyms]  [k] do_raw_spin_trylock       
+      2.85%  kworker/1:2  [kernel.kallsyms]  [k] debug_locks_off  
+      2.69%  kworker/0:0  [kernel.kallsyms]  [k] debug_locks_off   
+      2.48%  kworker/0:0  [kernel.kallsyms]  [k] lock_release       
+      2.26%  kworker/1:2  [kernel.kallsyms]  [k] lock_release        
+      2.03%  kworker/0:0  [kernel.kallsyms]  [k] lock_acquire         
+      1.88%  kworker/0:0  [kernel.kallsyms]  [k] arch_local_save_flags 
+      1.87%  kworker/1:2  [kernel.kallsyms]  [k] lock_acquire           
+      1.82%  kworker/1:2  [kernel.kallsyms]  [k] arch_local_save_flags 
+      1.81%  kworker/1:2  [kernel.kallsyms]  [k] arch_local_irq_save    
+      1.78%  kworker/0:0  [kernel.kallsyms]  [k] arch_local_irq_save     
+      1.56%  kworker/0:0  [kernel.kallsyms]  [k] lock_acquired            
+      1.53%  kworker/1:2  [kernel.kallsyms]  [k] __lock_acquire            
+      1.51%  kworker/0:0  [kernel.kallsyms]  [k] __lock_acquire             
+      1.29%  kworker/0:0  [kernel.kallsyms]  [k] native_write_msr_safe       
+      1.23%  kworker/1:2  [kernel.kallsyms]  [k] cpu_relax                    
+      1.17%  kworker/1:2  [kernel.kallsyms]  [k] lock_acquired                 
+      1.17%  kworker/0:0  [kernel.kallsyms]  [k] trace_hardirqs_off_caller      
+      1.11%  kworker/1:2  [kernel.kallsyms]  [k] trace_hardirqs_off_caller       
+      1.08%  kworker/1:2  [kernel.kallsyms]  [k] native_write_msr_safe   
+      1.02%  kworker/0:0  [kernel.kallsyms]  [k] _raw_spin_lock_irqsave   
+      0.92%  kworker/0:0  [kernel.kallsyms]  [k] process_one_work          
+      0.87%  kworker/1:2  [kernel.kallsyms]  [k] _raw_spin_lock_irqsave     
+      0.80%  kworker/0:0  [kernel.kallsyms]  [k] flush_to_ldisc       
+      0.76%  kworker/1:2  [kernel.kallsyms]  [k] process_one_work      
+      0.76%  kworker/1:2  [kernel.kallsyms]  [k] flush_to_ldisc         
+      0.72%  kworker/0:0  [kernel.kallsyms]  [k] arch_local_irq_restore  
+      0.71%  kworker/1:2  [kernel.kallsyms]  [k] arch_local_irq_restore   
+      0.64%  kworker/1:2  [kernel.kallsyms]  [k] do_raw_spin_unlock        
+      0.63%  kworker/0:0  [kernel.kallsyms]  [k] perf_event_task_tick       
+      0.61%  kworker/1:2  [kernel.kallsyms]  [k] ktime_get                   
+      0.59%  kworker/0:0  [kernel.kallsyms]  [k] _raw_spin_unlock_irqrestore  

this is what led me to try the other perf methods. the kmem traces were
the only things that really jumped out.

 > (Powertop can also do it, and is probably a better thing to use, I'm
 > just used to "perf record" for other reasons, so..)

Tried that too, here's what it said..

Summary: 0.0 wakeups/second,  0.0 GPU ops/second and 0.0 VFS ops/sec

                Usage       Events/s    Category       Description
            -2147483648 ms/s   0.0        Timer
            -2147483648 ms/s   0.0        kWork
            35151589 ms/s    0.0        Timer
            35151588 ms/s    0.0        Timer
            35151587 ms/s    0.0        Timer
            35151586 ms/s    0.0        Timer
            35151585 ms/s    0.0        Timer
            35151585 ms/s    0.0        Timer
            35151584 ms/s    0.0        Timer
            35151583 ms/s    0.0        Timer
            35151582 ms/s    0.0        Timer
            35151581 ms/s    0.0        Timer
            35151581 ms/s    0.0        Timer
            35151580 ms/s    0.0        Timer

Not exactly helpful.

	Dave


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: excessive kworker activity when idle. (was Re: vma corruption in today's -git)
  2011-03-31  3:55       ` Dave Jones
@ 2011-03-31  5:32         ` Linus Torvalds
  2011-03-31 14:21           ` Arnaldo Carvalho de Melo
  2011-03-31 14:58           ` Dave Jones
  0 siblings, 2 replies; 27+ messages in thread
From: Linus Torvalds @ 2011-03-31  5:32 UTC (permalink / raw)
  To: Dave Jones, Linus Torvalds, Andrew Morton, Linux Kernel,
	Tejun Heo

On Wed, Mar 30, 2011 at 8:55 PM, Dave Jones <davej@redhat.com> wrote:
>
> I tried that, and wasn't particularly enlightened.
>
> +      6.53%  kworker/1:2  [kernel.kallsyms]  [k] read_hpet
> +      4.83%  kworker/0:0  [kernel.kallsyms]  [k] read_hpet

Go into those, deeper into the callchain. That's how to see what the
upper-level problem is - by seeing who actually calls it. The leafs
themselves (where the time is obviously spent) don't tend to be very
informative, no.

>  > (Powertop can also do it, and is probably a better thing to use, I'm
>  > just used to "perf record" for other reasons, so..)
>
> Tried that too, here's what it said..

Hmm. Powertop is _supposed_ to be able to look into the kwork things
(it has special trace points for it), but as mentioned, the times I've
seen it I've used 'perf', so I haven't actually tried it myself.

                  Linus

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: excessive kworker activity when idle. (was Re: vma corruption in today's -git)
  2011-03-31  5:32         ` Linus Torvalds
@ 2011-03-31 14:21           ` Arnaldo Carvalho de Melo
  2011-03-31 14:58           ` Dave Jones
  1 sibling, 0 replies; 27+ messages in thread
From: Arnaldo Carvalho de Melo @ 2011-03-31 14:21 UTC (permalink / raw)
  To: Dave Jones; +Cc: Linus Torvalds, Andrew Morton, Linux Kernel, Tejun Heo

Em Wed, Mar 30, 2011 at 10:32:11PM -0700, Linus Torvalds escreveu:
> On Wed, Mar 30, 2011 at 8:55 PM, Dave Jones <davej@redhat.com> wrote:
> > I tried that, and wasn't particularly enlightened.
> >
> > +      6.53%  kworker/1:2  [kernel.kallsyms]  [k] read_hpet
> > +      4.83%  kworker/0:0  [kernel.kallsyms]  [k] read_hpet
> 
> Go into those, deeper into the callchain. That's how to see what the
> upper-level problem is - by seeing who actually calls it. The leafs
> themselves (where the time is obviously spent) don't tend to be very
> informative, no.

Yeah, please press 'E' since you're using the TUI, where you can expand
collapse callchains in all its levels.

Pressing 'E' will expand all of them, 'C' will collapse it all. Enter on
lines with a '+' will expand just that one, etc.

Press 'h' to get more help.

Or use 'perf report --stdio' and you'll get it all expanded and in plain
text.

- Arnaldo

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: excessive kworker activity when idle. (was Re: vma corruption in today's -git)
  2011-03-31  5:32         ` Linus Torvalds
  2011-03-31 14:21           ` Arnaldo Carvalho de Melo
@ 2011-03-31 14:58           ` Dave Jones
  2011-03-31 15:03             ` Dave Jones
  1 sibling, 1 reply; 27+ messages in thread
From: Dave Jones @ 2011-03-31 14:58 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Andrew Morton, Linux Kernel, Tejun Heo

On Wed, Mar 30, 2011 at 10:32:11PM -0700, Linus Torvalds wrote:
 > On Wed, Mar 30, 2011 at 8:55 PM, Dave Jones <davej@redhat.com> wrote:
 > >
 > > I tried that, and wasn't particularly enlightened.
 > >
 > > +      6.53%  kworker/1:2  [kernel.kallsyms]  [k] read_hpet
 > > +      4.83%  kworker/0:0  [kernel.kallsyms]  [k] read_hpet
 > 
 > Go into those, deeper into the callchain. That's how to see what the
 > upper-level problem is - by seeing who actually calls it. The leafs
 > themselves (where the time is obviously spent) don't tend to be very
 > informative, no.
 
ok, a bit more interesting ?

-     17.67%  kworker/0:1  [kernel.kallsyms]  [k] read_hpet
   - read_hpet                          
      - 76.07% timekeeping_get_ns        
         - ktime_get                      
            - 64.02% tick_dev_program_event
                 tick_program_event        
                 hrtimer_interrupt          
                 smp_apic_timer_interrupt    
               + apic_timer_interrupt         
            - 35.30% sched_clock_tick          
                 scheduler_tick                 
                 update_process_times            
                 tick_sched_timer                 
                 __run_hrtimer                     
                 hrtimer_interrupt                  
                 smp_apic_timer_interrupt            
               - apic_timer_interrupt                 
                  + 65.70% _raw_spin_unlock_irqrestore 
                  + 27.33% _raw_spin_unlock_irq  
                  + 1.16% lock_acquire            
                  + 1.16% put_ldisc                
                  + 1.16% process_one_work          
                  + 1.16% lock_release               
                  + 0.78% worker_thread               
                  + 0.58% _raw_spin_lock_irqsave       
            + 0.55% tick_sched_timer      
      - 23.82% do_timer                    
           tick_do_update_jiffies64         
           tick_sched_timer                  
           __run_hrtimer                      
           hrtimer_interrupt                   
           smp_apic_timer_interrupt             
         - apic_timer_interrupt                  
            + 63.46% _raw_spin_unlock_irqrestore  
            + 22.98% _raw_spin_unlock_irq      
            + 3.50% put_ldisc                   
            + 3.28% lock_acquire                 



btw, I left this on overnight capturing the output of /proc/loadavg
it didn't drop off at all, and this morning the laptop is almost too
hot to hold.

	Dave


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: excessive kworker activity when idle. (was Re: vma corruption in today's -git)
  2011-03-31 14:58           ` Dave Jones
@ 2011-03-31 15:03             ` Dave Jones
  2011-03-31 15:09               ` Dave Jones
  2011-03-31 15:25               ` Linus Torvalds
  0 siblings, 2 replies; 27+ messages in thread
From: Dave Jones @ 2011-03-31 15:03 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, Linux Kernel, Tejun Heo

On Thu, Mar 31, 2011 at 10:58:50AM -0400, Dave Jones wrote:
 
bah I missed a level. which seems to be the smoking gun.
something in the tty layer seems to have gone nuts.
 
 > -     17.67%  kworker/0:1  [kernel.kallsyms]  [k] read_hpet
 >    - read_hpet                          
 >       - 76.07% timekeeping_get_ns        
 >          - ktime_get                      
 >             - 64.02% tick_dev_program_event
 >                  tick_program_event        
 >                  hrtimer_interrupt          
 >                  smp_apic_timer_interrupt    
 >                + apic_timer_interrupt         

                  - 52.24% _raw_spin_unlock_irqrestore 
                     - 39.26% flush_to_ldisc             
                          process_one_work              
                          worker_thread                
                          kthread                     
                          kernel_thread_helper       
                     - 32.52% __wake_up             
                          put_ldisc                
                          tty_ldisc_deref         
                          flush_to_ldisc         
                          process_one_work      
                          worker_thread        
                          kthread             
                          kernel_thread_helper        
                     - 28.22% tty_ldisc_try          
                          tty_ldisc_ref             
                          flush_to_ldisc           
                          process_one_work        
                          worker_thread          
                          kthread               
                          kernel_thread_helper 
                  - 18.80% _raw_spin_unlock_irq   
                       process_one_work          
                       worker_thread            
                       kthread                 
                       kernel_thread_helper    
                  - 18.80% _raw_spin_unlock_irq
                       process_one_work       
                       worker_thread         
                       kthread              
                       kernel_thread_helper
                  - 5.98% process_one_work         
                       worker_thread              
                       kthread                   
                       kernel_thread_helper     
                  - 4.81% lock_acquire         
                       process_one_work       
                       worker_thread         
                       kthread              
                       kernel_thread_helper
                  - 4.70% put_ldisc       
                       tty_ldisc_deref  
                       flush_to_ldisc      
                       process_one_work  
                       worker_thread   
                       kthread        
                       kernel_thread_helper          
                  - 4.17% worker_thread             
                       kthread                     
                       kernel_thread_helper       
                  - 4.17% lock_release           
                       process_one_work         
                       worker_thread           
                       kthread                
                       kernel_thread_helper  
                  - 1.28% flush_to_ldisc    
                       process_one_work    
                       worker_thread      
                       kthread           
                       kernel_thread_helper     
                  - 1.28% arch_local_irq_save   
                     - 58.33% lock_release      
                          process_one_work      
                          worker_thread         
                          kthread               
                          kernel_thread_helper  
                     - 41.67% lock_acquire      
                          process_one_work      
                          worker_thread         
                          kthread               
                          kernel_thread_helper  
                  - 0.85% tty_ldisc_deref       
                       flush_to_ldisc           
                       process_one_work         
                       worker_thread            
                       kthread                  
                       kernel_thread_helper     
                  - 0.85% _raw_spin_lock_irqsave
                     - 62.51% flush_to_ldisc    
                          process_one_work      
                          worker_thread         
                          kthread               
                          kernel_thread_helper  
                     - 37.49% tty_ldisc_try     
                          tty_ldisc_ref       
                          flush_to_ldisc        
                          process_one_work      
                          worker_thread         
                          kthread               
                          kernel_thread_helper      
                  - 0.53% tty_ldisc_ref    
                       flush_to_ldisc       
                       process_one_work      
                       worker_thread          
                       kthread      

 >             - 35.30% sched_clock_tick          
 >                  scheduler_tick                 
 >                  update_process_times            
 >                  tick_sched_timer                 
 >                  __run_hrtimer                     
 >                  hrtimer_interrupt                  
 >                  smp_apic_timer_interrupt            
 >                - apic_timer_interrupt                 
 >                   + 65.70% _raw_spin_unlock_irqrestore 
 >                   + 27.33% _raw_spin_unlock_irq  
 >                   + 1.16% lock_acquire            
 >                   + 1.16% put_ldisc                
 >                   + 1.16% process_one_work          
 >                   + 1.16% lock_release               
 >                   + 0.78% worker_thread               
 >                   + 0.58% _raw_spin_lock_irqsave       
 >             + 0.55% tick_sched_timer      
 >       - 23.82% do_timer                    
 >            tick_do_update_jiffies64         
 >            tick_sched_timer                  
 >            __run_hrtimer                      
 >            hrtimer_interrupt                   
 >            smp_apic_timer_interrupt             
 >          - apic_timer_interrupt                  
 >             + 63.46% _raw_spin_unlock_irqrestore  
 >             + 22.98% _raw_spin_unlock_irq      
 >             + 3.50% put_ldisc                   
 >             + 3.28% lock_acquire                 


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: excessive kworker activity when idle. (was Re: vma corruption in today's -git)
  2011-03-31 15:03             ` Dave Jones
@ 2011-03-31 15:09               ` Dave Jones
  2011-03-31 15:45                 ` Linus Torvalds
  2011-03-31 15:25               ` Linus Torvalds
  1 sibling, 1 reply; 27+ messages in thread
From: Dave Jones @ 2011-03-31 15:09 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton, Linux Kernel, Tejun Heo

On Thu, Mar 31, 2011 at 11:03:44AM -0400, Dave Jones wrote:
 > On Thu, Mar 31, 2011 at 10:58:50AM -0400, Dave Jones wrote:
 >  
 > bah I missed a level. which seems to be the smoking gun.
 > something in the tty layer seems to have gone nuts.
 
I thought that trace looked familiar.

http://lkml.org/lkml/2010/11/30/592

It's the same thing.

	Dave


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: excessive kworker activity when idle. (was Re: vma corruption in today's -git)
  2011-03-31 15:09               ` Dave Jones
@ 2011-03-31 15:45                 ` Linus Torvalds
  0 siblings, 0 replies; 27+ messages in thread
From: Linus Torvalds @ 2011-03-31 15:45 UTC (permalink / raw)
  To: Dave Jones, Linus Torvalds, Andrew Morton, Linux Kernel,
	Tejun Heo

On Thu, Mar 31, 2011 at 8:09 AM, Dave Jones <davej@redhat.com> wrote:
>
> I thought that trace looked familiar.
>
> http://lkml.org/lkml/2010/11/30/592
>
> It's the same thing.

Ok, that's before the "tty: stop using "delayed_work" in the tty
layer" commit I just pointed to.

So apparently you've been able to trigger this even with the old code
too - although maybe the lack of delays anywhere has made it easier,
and has made it use more cpu.

I'll have to think about it, but I wonder if it's the crazy "reflush"
case in flush_to_ldisc. We do

                        if (!tty->receive_room || seen_tail) {
                                schedule_work(&tty->buf.work);
                                break;
                        }

inside the routine that is the work itself - basically we're saying
that "if there's no more room to flip, of we've seen a new buffer,
give up now and reschedule outselves".

Which doesn't really make much sense to me, I have to admit. The code
that actually empties the buffer, or the code that adds one, should
already have scheduled us for a flip _anyway_. So the only thing that
"schedule_work()" is doing is causing infinite work if nothing empties
the buffer, of more likely if we have a flushing bug elsewhere.

So I'm not sure, but my gut feel is that removing that
"schedule_work()" line there is the right thing to do.

At a guess, it was hiding some locking problem - and it's been carried
around even though hopefully we've fixed all the crazy races we used
to have (and it was a mindless "hey, we can retry in one jiffy - it
doesn't really cost us anything")

NOTE! Even if I'm right, and that line is just buggy, the bug may well
have been hiding some other issue - ie just some using not flushing
the tty when it made more room available. So I think the "make tty
flush cause a re-flush when it cannot make progress" is wrong, but
removing the line may well expose some other problem.

                             Linus

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: excessive kworker activity when idle. (was Re: vma corruption in today's -git)
  2011-03-31 15:03             ` Dave Jones
  2011-03-31 15:09               ` Dave Jones
@ 2011-03-31 15:25               ` Linus Torvalds
  2011-03-31 15:49                 ` Dave Jones
  1 sibling, 1 reply; 27+ messages in thread
From: Linus Torvalds @ 2011-03-31 15:25 UTC (permalink / raw)
  To: Dave Jones, Linus Torvalds, Andrew Morton, Linux Kernel,
	Tejun Heo

On Thu, Mar 31, 2011 at 8:03 AM, Dave Jones <davej@redhat.com> wrote:
> On Thu, Mar 31, 2011 at 10:58:50AM -0400, Dave Jones wrote:
>
> bah I missed a level. which seems to be the smoking gun.
> something in the tty layer seems to have gone nuts.

Ahhah. That is indeed the smoking gun.

I bet it's an old problem where something causes some stupid tty
ping-pong behaviour, and now it has become a high-CPU thing triggered
by commit f23eb2b2b285 ("tty: stop using "delayed_work" in the tty
layer").

Before that commit, if the tty layer got somehow confused, it would
always delay for a timer tick in between doing nothing. Now it will go
pretty much full blast. Which explains your CPU being hot.

                        Linus

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: excessive kworker activity when idle. (was Re: vma corruption in today's -git)
  2011-03-31 15:25               ` Linus Torvalds
@ 2011-03-31 15:49                 ` Dave Jones
  2011-03-31 15:58                   ` Linus Torvalds
  0 siblings, 1 reply; 27+ messages in thread
From: Dave Jones @ 2011-03-31 15:49 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Andrew Morton, Linux Kernel, Tejun Heo

On Thu, Mar 31, 2011 at 08:25:36AM -0700, Linus Torvalds wrote:
 > On Thu, Mar 31, 2011 at 8:03 AM, Dave Jones <davej@redhat.com> wrote:
 > > On Thu, Mar 31, 2011 at 10:58:50AM -0400, Dave Jones wrote:
 > >
 > > bah I missed a level. which seems to be the smoking gun.
 > > something in the tty layer seems to have gone nuts.
 > 
 > Ahhah. That is indeed the smoking gun.
 > 
 > I bet it's an old problem where something causes some stupid tty
 > ping-pong behaviour, and now it has become a high-CPU thing triggered
 > by commit f23eb2b2b285 ("tty: stop using "delayed_work" in the tty
 > layer").
 > 
 > Before that commit, if the tty layer got somehow confused, it would
 > always delay for a timer tick in between doing nothing. Now it will go
 > pretty much full blast. Which explains your CPU being hot.

That's a recent change though, and I first saw this back in November.

	Dave


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: excessive kworker activity when idle. (was Re: vma corruption in today's -git)
  2011-03-31 15:49                 ` Dave Jones
@ 2011-03-31 15:58                   ` Linus Torvalds
  2011-03-31 16:13                     ` Dave Jones
  0 siblings, 1 reply; 27+ messages in thread
From: Linus Torvalds @ 2011-03-31 15:58 UTC (permalink / raw)
  To: Dave Jones, Linus Torvalds, Andrew Morton, Linux Kernel,
	Tejun Heo

On Thu, Mar 31, 2011 at 8:49 AM, Dave Jones <davej@redhat.com> wrote:
>
> That's a recent change though, and I first saw this back in November.

So your November report said that you could see "thousands" of these a
second. But maybe it didn't use up all CPU until recently?

Especially if you have a high CONFIG_HZ value, you'd still see a
thousand flushes a second even with the old "delay a bit". So it would
use a fair amount of CPU, and certainly waste a lot of power. But it
wouldn't pin the CPU entirely.

With that commit f23eb2b2b285, the buggy case would become basically
totally CPU-bound.

I dunno. Right now 'trinity' just ends up printing out a lot of system
call errors for me. I assume that's its normal behavior?

                               Linus

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: excessive kworker activity when idle. (was Re: vma corruption in today's -git)
  2011-03-31 15:58                   ` Linus Torvalds
@ 2011-03-31 16:13                     ` Dave Jones
  0 siblings, 0 replies; 27+ messages in thread
From: Dave Jones @ 2011-03-31 16:13 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Andrew Morton, Linux Kernel, Tejun Heo

On Thu, Mar 31, 2011 at 08:58:14AM -0700, Linus Torvalds wrote:
 > On Thu, Mar 31, 2011 at 8:49 AM, Dave Jones <davej@redhat.com> wrote:
 > >
 > > That's a recent change though, and I first saw this back in November.
 > 
 > So your November report said that you could see "thousands" of these a
 > second. But maybe it didn't use up all CPU until recently?

>From memory, I noticed it back then in the same way "hey, why is the laptop getting hot". 

 > Especially if you have a high CONFIG_HZ value, you'd still see a
 > thousand flushes a second even with the old "delay a bit". So it would
 > use a fair amount of CPU, and certainly waste a lot of power. But it
 > wouldn't pin the CPU entirely.

HZ=1000 here, so yeah that makes sense. 

 > With that commit f23eb2b2b285, the buggy case would become basically
 > totally CPU-bound.
 > 
 > I dunno. Right now 'trinity' just ends up printing out a lot of system
 > call errors for me. I assume that's its normal behavior?

yep. it's throwing semi-random junk and seeing what sticks.
99% of the time, it ends up with an -EINVAL or something providing
the syscalls have sufficient checks in place.  We're pretty solid there
these days. (Though I still need to finish adding more annotations of some
of the syscall arguments just to be sure we're passing semi-sane things
to get deeper into the syscalls)

What still seems to throw a curve-ball though is the case where calling
a syscall generates some state. Due to the random nature of the program,
we never have a balanced alloc/free, so lists can grow quite large etc.
I'm wondering if something has created a livelock situation, where some queue
has grown to the point that we're generating new work before we can
process the backlog.

The downside of using randomness of course is with bugs like this, there's
no easy way to figure out wtf happened to get it into this state
other than poring over huge logfiles of all the syscalls that were made.

I'm happily ignorant of most of the inner workings of the tty layer.
I don't see anything useful in procfs/sysfs/debugfs. Is there anything
useful I can do with the trace tools to try and find out things like
the length of queues ?

	Dave

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: excessive kworker activity when idle. (was Re: vma corruption in today's -git)
  2011-03-31  3:37     ` Linus Torvalds
  2011-03-31  3:55       ` Dave Jones
@ 2011-03-31  6:56       ` Tejun Heo
  2011-03-31 10:37         ` [PATCH] workqueue: document debugging tricks Florian Mickler
  1 sibling, 1 reply; 27+ messages in thread
From: Tejun Heo @ 2011-03-31  6:56 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Dave Jones, Andrew Morton, Linux Kernel

Hello,

On Wed, Mar 30, 2011 at 08:37:21PM -0700, Linus Torvalds wrote:
> On Wed, Mar 30, 2011 at 8:09 PM, Dave Jones <davej@redhat.com> wrote:
> >
> > But rerunning the same tests on current head (6aba74f2791287ec407e0f92487a725a25908067)
> > I can still reproduce the problem where kworker threads go nutso
> > when the machine should be completely idle.
> >
> > top shows two kworker threads constantly at >80% cpu.
> 
> Ok, I've seen that "tons of cpu by kworker" triggered by a few
> different issues. One was the intel graphics driver doing monitor
> detection constantly, and spending all its time in one of the worker
> threads doing __udelay() for the stupid i2c driver.
> 
> The other case I've seen is a wireless thing that falls back to GPIO,
> and spends a lot of CPU time on that.
> 
> I'm not saying yours is either of those cases, but one of the problems
> with that behavior is that it's actually fairly hard to figure out
> what the hell is happening. You don't see some nice thread description
> in 'top' any more (like you used to when everybody created their own
> threads and didn't do the common worker thread thing), and the best
> approach literally seems to be something like

Yes, not having dedicated workers decrease immediate visibility via
ps/top but I don't think the actual vsibility is worse.  There are two
types for kworker looping - something being scheduled in rapid
succession or a single work consuming lots of CPU cycles.

The first one can be tracked using tracing.

  $ echo workqueue:workqueue_queue_work > /sys/kernel/debug/tracing/set_event
  $ cat /sys/kernel/debug/tracing/trace_pipe > out.txt
  (wait a few secs)
  ^C

If something is busy looping on work queueing, it would be dominating
the output and the offender can be determined with the work item
function.

For the second type, "cat /proc/THE_OFFENDING_KWORKER/stack" is the
easiest.  The work item function will be trivially visible in the
stack trace.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH] workqueue: document debugging tricks
  2011-03-31  6:56       ` Tejun Heo
@ 2011-03-31 10:37         ` Florian Mickler
  2011-03-31 11:41           ` Tejun Heo
  0 siblings, 1 reply; 27+ messages in thread
From: Florian Mickler @ 2011-03-31 10:37 UTC (permalink / raw)
  To: tj; +Cc: linux-kernel, rdunlap, linux-doc, Florian Mickler

It is not obvious how to debug run-away workers.

These are some tips given by Tejun on lkml.

Signed-off-by: Florian Mickler <florian@mickler.org>
CC: Tejun Heo <tj@kernel.org>
---
 Documentation/workqueue.txt |   38 ++++++++++++++++++++++++++++++++++++++
 1 files changed, 38 insertions(+), 0 deletions(-)

diff --git a/Documentation/workqueue.txt b/Documentation/workqueue.txt
index 01c513f..cdbc3c6 100644
--- a/Documentation/workqueue.txt
+++ b/Documentation/workqueue.txt
@@ -12,6 +12,7 @@ CONTENTS
 4. Application Programming Interface (API)
 5. Example Execution Scenarios
 6. Guidelines
+7. Debugging
 
 
 1. Introduction
@@ -379,3 +380,40 @@ If q1 has WQ_CPU_INTENSIVE set,
 * Unless work items are expected to consume a huge amount of CPU
   cycles, using a bound wq is usually beneficial due to the increased
   level of locality in wq operations and work item execution.
+
+
+7. Debugging
+
+Because the work functions are executed by generic worker threads there are a
+few tricks needed to shed some light on misbehaving workqueue users.
+
+Worker threads show up in the process list as:
+
+root      5671  0.0  0.0      0     0 ?        S    12:07   0:00 [kworker/0:1]
+root      5672  0.0  0.0      0     0 ?        S    12:07   0:00 [kworker/1:2]
+root      5673  0.0  0.0      0     0 ?        S    12:12   0:00 [kworker/0:0]
+root      5674  0.0  0.0      0     0 ?        S    12:13   0:00 [kworker/1:0]
+
+If kworkers are going crazy (using too much cpu), there are two types of
+possible problems:
+
+	1. Something beeing scheduled in rapid succession
+	2. a single work item that consumes lots of cpu cycles
+
+The first one can be tracked using tracing:
+
+	$ echo workqueue:workqueue_queue_work > /sys/kernel/debug/tracing/set_event
+	$ cat /sys/kernel/debug/tracing/trace_pipe > out.txt
+	(wait a few secs)
+	^C
+
+If something is busy looping on work queueing, it would be dominating
+the output and the offender can be determined with the work item
+function.
+
+For the second type of problem it should be possible to just check the stack
+trace of the offending worker thread.
+
+	$ cat /proc/THE_OFFENDING_KWORKER/stack
+
+The work item's function should be trivially visible in the stack trace.
-- 
1.7.4.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH] workqueue: document debugging tricks
  2011-03-31 10:37         ` [PATCH] workqueue: document debugging tricks Florian Mickler
@ 2011-03-31 11:41           ` Tejun Heo
  0 siblings, 0 replies; 27+ messages in thread
From: Tejun Heo @ 2011-03-31 11:41 UTC (permalink / raw)
  To: Florian Mickler; +Cc: linux-kernel, rdunlap, linux-doc

On Thu, Mar 31, 2011 at 12:37:06PM +0200, Florian Mickler wrote:
> It is not obvious how to debug run-away workers.
> 
> These are some tips given by Tejun on lkml.
> 
> Signed-off-by: Florian Mickler <florian@mickler.org>
> CC: Tejun Heo <tj@kernel.org>

Right, good idea.  Applied to fixes-2.6.39 with minor formatting
changes.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2011-03-31 21:39 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-03-29  4:09 vma corruption in today's -git Dave Jones
2011-03-29  4:19 ` Américo Wang
2011-03-29  4:26   ` Dave Jones
2011-03-29  4:22 ` Linus Torvalds
2011-03-31  3:09   ` excessive kworker activity when idle. (was Re: vma corruption in today's -git) Dave Jones
2011-03-31  3:34     ` Dave Jones
2011-03-31  3:44       ` Linus Torvalds
2011-03-31  4:08         ` Dave Jones
2011-03-31 15:53           ` Linus Torvalds
2011-03-31 16:21             ` Linus Torvalds
2011-03-31 21:38               ` Linus Torvalds
2011-03-31 14:59         ` Paul E. McKenney
2011-03-31  3:37     ` Linus Torvalds
2011-03-31  3:55       ` Dave Jones
2011-03-31  5:32         ` Linus Torvalds
2011-03-31 14:21           ` Arnaldo Carvalho de Melo
2011-03-31 14:58           ` Dave Jones
2011-03-31 15:03             ` Dave Jones
2011-03-31 15:09               ` Dave Jones
2011-03-31 15:45                 ` Linus Torvalds
2011-03-31 15:25               ` Linus Torvalds
2011-03-31 15:49                 ` Dave Jones
2011-03-31 15:58                   ` Linus Torvalds
2011-03-31 16:13                     ` Dave Jones
2011-03-31  6:56       ` Tejun Heo
2011-03-31 10:37         ` [PATCH] workqueue: document debugging tricks Florian Mickler
2011-03-31 11:41           ` Tejun Heo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox