Re: rcu_sched_state detected stall on CPU 0, 3.0-rc2

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: rcu_sched_state detected stall on CPU 0, 3.0-rc2
       [not found] ` <20110612231143.GC11580@hexapodia.org>
@ 2011-06-12 23:55   ` Andy Isaacson
  2011-06-13  2:30     ` Ben Hutchings
  0 siblings, 1 reply; 4+ messages in thread
From: Andy Isaacson @ 2011-06-12 23:55 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: linux-kernel, netdev, linux-pm

[-- Attachment #1: Type: text/plain, Size: 2414 bytes --]

Let's CC netdev and linux-pm since this is obviously a suspend issue,
and may have something to do with ethtool.

On Sun, Jun 12, 2011 at 04:11:43PM -0700, Andy Isaacson wrote:
> On Sun, Jun 12, 2011 at 12:58:56PM -0700, Andy Isaacson wrote:
> > My Thinkpad x201s threw some errors (?) a few minutes after resuming
> > from suspend-to-ram this morning.
> > 
> > [56415.672140] INFO: rcu_sched_state detected stall on CPU 0 (t=15000 jiffies)
> > 
> > Nothing jumps out of the backtraces at me.  Full dmesg and config
> > attached.  This was my first StR since upgrading from 2.6.39, let's see
> > if it fails again when I suspend after sending this email. :)
> 
> I haven't had a fully successful StR cycle yet (in 5 tries), although I
> can't pin them all on RCU.  On try 2 it hung completely about 10 seconds
> after I unlocked the screensaver, on try 3 it came back to a black
> console, and on try 4 it didn't suspend at all (blinking moon LED but
> battery LED and CPU fan still on).

Of course now that I'm trying to debug, I am seeing many successful
suspend-resume cycles.  I don't see any signs of difference between the
cases that hung and the cases that are now succeeding.

CCing netdev, because I suspend by running pm-suspend, and in at least
one failure, an ethtool running under pm-suspend seemed to be the
problem:

root 11558 pts/8    S+ \_ /bin/sh /usr/lib/pm-utils/sleep.d/00powers
root 11559 pts/8    S+     \_ /bin/sh /usr/sbin/pm-powersave
root 11576 pts/8    S+         \_ /bin/sh /usr/lib/pm-utils/power.d/
root 11577 pts/8    D+             \_ ethtool -s eth0 wol g

many processes were stuck in D:

USER     PID    VSZ   RSS STAT START COMMAND
root   11493      0     0 D    16:11  \_ [kworker/u:15]
nobody  1707  21472   992 D    14:31 dnsmasq --strict-order --bind-interfaces --pid-file=/var/run/libvirt/network/default.pid --conf-file= --except-interface lo --listen-address 192.168.122.1 --dhcp-range 192.168.122.2,192.168.122.254 --dhcp-leasefile=/var/lib/libvirt/dnsmasq/default.leases --dhcp-lease-max=253 --dhcp-no-override
adi    11606  41004  2424 D+   16:13  |       \_ ssh hex
root   11577   4092   324 D+   16:11  |                               \_ ethtool -s eth0 wol g
root   11595  22108   892 D+   16:12  |       \_ sudo cat /proc/11577/stack
root   11604  22108   900 D+   16:13  |       \_ sudo cat /proc/11577/stack

==> /proc/11577/wchan <==
synchronize_sched

-andy

[-- Attachment #2: trace.gz --]
[-- Type: application/octet-stream, Size: 27327 bytes --]

[-- Attachment #3: config-trim.gz --]
[-- Type: application/octet-stream, Size: 8350 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: rcu_sched_state detected stall on CPU 0, 3.0-rc2
  2011-06-12 23:55   ` rcu_sched_state detected stall on CPU 0, 3.0-rc2 Andy Isaacson
@ 2011-06-13  2:30     ` Ben Hutchings
  2011-06-13  6:04       ` Andy Isaacson
  0 siblings, 1 reply; 4+ messages in thread
From: Ben Hutchings @ 2011-06-13  2:30 UTC (permalink / raw)
  To: Andy Isaacson; +Cc: Paul E. McKenney, linux-kernel, netdev, linux-pm

On Sun, 2011-06-12 at 16:55 -0700, Andy Isaacson wrote:
> Let's CC netdev and linux-pm since this is obviously a suspend issue,
> and may have something to do with ethtool.
> 
> On Sun, Jun 12, 2011 at 04:11:43PM -0700, Andy Isaacson wrote:
> > On Sun, Jun 12, 2011 at 12:58:56PM -0700, Andy Isaacson wrote:
> > > My Thinkpad x201s threw some errors (?) a few minutes after resuming
> > > from suspend-to-ram this morning.
> > > 
> > > [56415.672140] INFO: rcu_sched_state detected stall on CPU 0 (t=15000 jiffies)
> > > 
> > > Nothing jumps out of the backtraces at me.  Full dmesg and config
> > > attached.  This was my first StR since upgrading from 2.6.39, let's see
> > > if it fails again when I suspend after sending this email. :)
> > 
> > I haven't had a fully successful StR cycle yet (in 5 tries), although I
> > can't pin them all on RCU.  On try 2 it hung completely about 10 seconds
> > after I unlocked the screensaver, on try 3 it came back to a black
> > console, and on try 4 it didn't suspend at all (blinking moon LED but
> > battery LED and CPU fan still on).
> 
> Of course now that I'm trying to debug, I am seeing many successful
> suspend-resume cycles.  I don't see any signs of difference between the
> cases that hung and the cases that are now succeeding.
> 
> CCing netdev, because I suspend by running pm-suspend, and in at least
> one failure, an ethtool running under pm-suspend seemed to be the
> problem:
> 
> root 11558 pts/8    S+ \_ /bin/sh /usr/lib/pm-utils/sleep.d/00powers
> root 11559 pts/8    S+     \_ /bin/sh /usr/sbin/pm-powersave
> root 11576 pts/8    S+         \_ /bin/sh /usr/lib/pm-utils/power.d/
> root 11577 pts/8    D+             \_ ethtool -s eth0 wol g
[...]

Wake-on-LAN configuration is entirely handled by the relevant driver;
the ethtool core just copies the parameters in and out.  It looks like
there is some sort of deadlock or missing unlock in the driver.  So my
question would be which driver is running eth0?

Ben.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: rcu_sched_state detected stall on CPU 0, 3.0-rc2
  2011-06-13  2:30     ` Ben Hutchings
@ 2011-06-13  6:04       ` Andy Isaacson
  2011-06-19 21:20         ` Andy Isaacson
  0 siblings, 1 reply; 4+ messages in thread
From: Andy Isaacson @ 2011-06-13  6:04 UTC (permalink / raw)
  To: Ben Hutchings; +Cc: Paul E. McKenney, linux-kernel, netdev, linux-pm

On Sun, Jun 12, 2011 at 10:30:05PM -0400, Ben Hutchings wrote:
> > Of course now that I'm trying to debug, I am seeing many successful
> > suspend-resume cycles.  I don't see any signs of difference between the
> > cases that hung and the cases that are now succeeding.
> > 
> > CCing netdev, because I suspend by running pm-suspend, and in at least
> > one failure, an ethtool running under pm-suspend seemed to be the
> > problem:
> > 
> > root 11558 pts/8    S+ \_ /bin/sh /usr/lib/pm-utils/sleep.d/00powers
> > root 11559 pts/8    S+     \_ /bin/sh /usr/sbin/pm-powersave
> > root 11576 pts/8    S+         \_ /bin/sh /usr/lib/pm-utils/power.d/
> > root 11577 pts/8    D+             \_ ethtool -s eth0 wol g
> [...]
> 
> Wake-on-LAN configuration is entirely handled by the relevant driver;
> the ethtool core just copies the parameters in and out.  It looks like
> there is some sort of deadlock or missing unlock in the driver.  So my
> question would be which driver is running eth0?

I attached the whole gzipped dmesg, but anyways, it's e1000e:

[    1.168733] e1000e: Intel(R) PRO/1000 Network Driver - 1.3.10-k2
[    1.168879] e1000e: Copyright(c) 1999 - 2011 Intel Corporation.
[    1.169346] e1000e 0000:00:19.0: PCI INT A -> GSI 20 (level, low) -> IRQ 20
[    1.169494] e1000e 0000:00:19.0: setting latency timer to 64
[    1.169603] usbcore: registered new interface driver usbfs
[    1.169675] e1000e 0000:00:19.0: irq 40 for MSI/MSI-X
[    1.169794] usbcore: registered new interface driver hub
[    1.181231] usbcore: registered new device driver usb
[    1.181879] ehci_hcd: USB 2.0 'Enhanced' Host Controller (EHCI) Driver
[    1.190984] thermal LNXTHERM:00: registered as thermal_zone0
[    1.191129] ACPI: Thermal Zone [THM0] (48 C)
[    1.205191] SCSI subsystem initialized
[    1.207349] libata version 3.00 loaded.
[    1.352926] e1000e 0000:00:19.0: eth0: (PCI Express:2.5GT/s:Width x1) 00:26:2d:f3:14:0f
[    1.353243] e1000e 0000:00:19.0: eth0: Intel(R) PRO/1000 Network Connection
[    1.353473] e1000e 0000:00:19.0: eth0: MAC: 9, PHY: 10, PBA No: A002FF-0FF
[snip]
[ 1221.836727] PM: Entering mem sleep
[ 1221.836855] Suspending console(s) (use no_console_suspend to debug)
[ 1222.018735] sd 0:0:0:0: [sda] Synchronizing SCSI cache
[ 1222.063181] sd 0:0:0:0: [sda] Stopping disk
[ 1222.272665] ehci_hcd 0000:00:1d.0: PCI INT D disabled
[ 1222.272668] ehci_hcd 0000:00:1a.0: PCI INT D disabled
[ 1222.304428] i915 0000:00:02.0: power state changed by ACPI to D3
[ 1222.378384] e1000e 0000:00:19.0: PCI INT A disabled
[ 1222.378393] e1000e 0000:00:19.0: PME# enabled
[ 1222.378400] e1000e 0000:00:19.0: wake-up capability enabled by ACPI

ethtool does show up in the failure dmesg:

[56520.872462] INFO: task ethtool:18105 blocked for more than 120 seconds.
[56520.872465] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[56520.872468] ethtool         D 7fffffffffffffff     0 18105  18104 0x00000000
[56520.872473]  ffff880105d99ad8 0000000000000082 ffffffff810d0c05 ffff880137d15730
[56520.872479]  ffff880130fd3780 ffff880105d99fd8 ffff880105d99fd8 0000000000012500
[56520.872485]  ffff8801333794d0 ffff880130fd3780 ffff880105d99fd8 ffff880137ffbe00
[56520.872490] Call Trace:
[56520.872499]  [<ffffffff810d0c05>] ? zone_statistics+0x7c/0x83
[56520.872504]  [<ffffffff8131c13e>] schedule_timeout+0x34/0xde
[56520.872509]  [<ffffffff8131bef8>] wait_for_common+0xa3/0x11a
[56520.872514]  [<ffffffff8103c393>] ? try_to_wake_up+0x1ac/0x1ac
[56520.872519]  [<ffffffff8131c023>] wait_for_completion+0x1d/0x1f
[56520.872524]  [<ffffffff8109557d>] synchronize_sched+0x5a/0x5c
[56520.872528]  [<ffffffff8105beac>] ? find_ge_pid+0x41/0x41
[56520.872536]  [<ffffffff8122f67a>] wakeup_source_remove+0x60/0x64
[56520.872541]  [<ffffffff8122f785>] wakeup_source_unregister+0x13/0x1f
[56520.872546]  [<ffffffff8122f7f1>] device_wakeup_disable+0x60/0x6b
[56520.872550]  [<ffffffff8122f942>] device_set_wakeup_enable+0x2d/0x2f
[56520.872573]  [<ffffffffa003a56a>] e1000_set_wol+0x9b/0x9f [e1000e]
[56520.872580]  [<ffffffff81261c8e>] dev_ethtool+0x338/0x1c06
[56520.872585]  [<ffffffff810d4be4>] ? __do_fault+0x313/0x34a
[56520.872591]  [<ffffffff810329ef>] ? should_resched+0xe/0x2d
[56520.872595]  [<ffffffff8131be41>] ? _cond_resched+0xe/0x22
[56520.872600]  [<ffffffff8126019d>] dev_ioctl+0x503/0x688
[56520.872605]  [<ffffffff81319009>] ? __slab_alloc+0x330/0x342
[56520.872610]  [<ffffffff8124ac72>] ? sock_alloc_inode+0x24/0xb7
[56520.872615]  [<ffffffff8124b0c2>] sock_do_ioctl+0x3b/0x46
[56520.872619]  [<ffffffff8124b4f1>] sock_ioctl+0x20d/0x21b
[56520.872625]  [<ffffffff810feb02>] ? get_empty_filp+0x93/0x11b
[56520.872630]  [<ffffffff8110b218>] do_vfs_ioctl+0x460/0x4a1
[56520.872635]  [<ffffffff8124ca82>] ? sock_alloc_file+0xb3/0x114
[56520.872640]  [<ffffffff8131d13e>] ? _raw_spin_lock+0xe/0x10
[56520.872646]  [<ffffffff810fc23d>] ? fd_install+0x31/0x5d
[56520.872650]  [<ffffffff8110b2a0>] sys_ioctl+0x47/0x6b
[56520.872655]  [<ffffffff8131ddc2>] system_call_fastpath+0x16/0x1b

-andy

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: rcu_sched_state detected stall on CPU 0, 3.0-rc2
  2011-06-13  6:04       ` Andy Isaacson
@ 2011-06-19 21:20         ` Andy Isaacson
  0 siblings, 0 replies; 4+ messages in thread
From: Andy Isaacson @ 2011-06-19 21:20 UTC (permalink / raw)
  To: Ben Hutchings; +Cc: Paul E. McKenney, linux-kernel, netdev, linux-pm

On Sun, Jun 12, 2011 at 11:04:53PM -0700, Andy Isaacson wrote:
> On Sun, Jun 12, 2011 at 10:30:05PM -0400, Ben Hutchings wrote:
> > > Of course now that I'm trying to debug, I am seeing many successful
> > > suspend-resume cycles.  I don't see any signs of difference between the
> > > cases that hung and the cases that are now succeeding.
> > > 
> > > CCing netdev, because I suspend by running pm-suspend, and in at least
> > > one failure, an ethtool running under pm-suspend seemed to be the
> > > problem:
> > > 
> > > root 11558 pts/8    S+ \_ /bin/sh /usr/lib/pm-utils/sleep.d/00powers
> > > root 11559 pts/8    S+     \_ /bin/sh /usr/sbin/pm-powersave
> > > root 11576 pts/8    S+         \_ /bin/sh /usr/lib/pm-utils/power.d/
> > > root 11577 pts/8    D+             \_ ethtool -s eth0 wol g
> > [...]
> > 
> > Wake-on-LAN configuration is entirely handled by the relevant driver;
> > the ethtool core just copies the parameters in and out.  It looks like
> > there is some sort of deadlock or missing unlock in the driver.  So my
> > question would be which driver is running eth0?
> 
> I attached the whole gzipped dmesg, but anyways, it's e1000e:
> 
> [    1.168733] e1000e: Intel(R) PRO/1000 Network Driver - 1.3.10-k2
> [    1.168879] e1000e: Copyright(c) 1999 - 2011 Intel Corporation.
> [    1.169346] e1000e 0000:00:19.0: PCI INT A -> GSI 20 (level, low) -> IRQ 20
> [    1.169494] e1000e 0000:00:19.0: setting latency timer to 64
> [    1.169603] usbcore: registered new interface driver usbfs
> [    1.169675] e1000e 0000:00:19.0: irq 40 for MSI/MSI-X
> [    1.169794] usbcore: registered new interface driver hub
> [    1.181231] usbcore: registered new device driver usb
> [    1.181879] ehci_hcd: USB 2.0 'Enhanced' Host Controller (EHCI) Driver
> [    1.190984] thermal LNXTHERM:00: registered as thermal_zone0
> [    1.191129] ACPI: Thermal Zone [THM0] (48 C)
> [    1.205191] SCSI subsystem initialized
> [    1.207349] libata version 3.00 loaded.
> [    1.352926] e1000e 0000:00:19.0: eth0: (PCI Express:2.5GT/s:Width x1) 00:26:2d:f3:14:0f
> [    1.353243] e1000e 0000:00:19.0: eth0: Intel(R) PRO/1000 Network Connection
> [    1.353473] e1000e 0000:00:19.0: eth0: MAC: 9, PHY: 10, PBA No: A002FF-0FF
> [snip]
> [ 1221.836727] PM: Entering mem sleep
> [ 1221.836855] Suspending console(s) (use no_console_suspend to debug)
> [ 1222.018735] sd 0:0:0:0: [sda] Synchronizing SCSI cache
> [ 1222.063181] sd 0:0:0:0: [sda] Stopping disk
> [ 1222.272665] ehci_hcd 0000:00:1d.0: PCI INT D disabled
> [ 1222.272668] ehci_hcd 0000:00:1a.0: PCI INT D disabled
> [ 1222.304428] i915 0000:00:02.0: power state changed by ACPI to D3
> [ 1222.378384] e1000e 0000:00:19.0: PCI INT A disabled
> [ 1222.378393] e1000e 0000:00:19.0: PME# enabled
> [ 1222.378400] e1000e 0000:00:19.0: wake-up capability enabled by ACPI
> 
> ethtool does show up in the failure dmesg:
> 
> [56520.872462] INFO: task ethtool:18105 blocked for more than 120 seconds.
> [56520.872465] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [56520.872468] ethtool         D 7fffffffffffffff     0 18105  18104 0x00000000
> [56520.872473]  ffff880105d99ad8 0000000000000082 ffffffff810d0c05 ffff880137d15730
> [56520.872479]  ffff880130fd3780 ffff880105d99fd8 ffff880105d99fd8 0000000000012500
> [56520.872485]  ffff8801333794d0 ffff880130fd3780 ffff880105d99fd8 ffff880137ffbe00
> [56520.872490] Call Trace:
> [56520.872499]  [<ffffffff810d0c05>] ? zone_statistics+0x7c/0x83
> [56520.872504]  [<ffffffff8131c13e>] schedule_timeout+0x34/0xde
> [56520.872509]  [<ffffffff8131bef8>] wait_for_common+0xa3/0x11a
> [56520.872514]  [<ffffffff8103c393>] ? try_to_wake_up+0x1ac/0x1ac
> [56520.872519]  [<ffffffff8131c023>] wait_for_completion+0x1d/0x1f
> [56520.872524]  [<ffffffff8109557d>] synchronize_sched+0x5a/0x5c
> [56520.872528]  [<ffffffff8105beac>] ? find_ge_pid+0x41/0x41
> [56520.872536]  [<ffffffff8122f67a>] wakeup_source_remove+0x60/0x64
> [56520.872541]  [<ffffffff8122f785>] wakeup_source_unregister+0x13/0x1f
> [56520.872546]  [<ffffffff8122f7f1>] device_wakeup_disable+0x60/0x6b
> [56520.872550]  [<ffffffff8122f942>] device_set_wakeup_enable+0x2d/0x2f
> [56520.872573]  [<ffffffffa003a56a>] e1000_set_wol+0x9b/0x9f [e1000e]
> [56520.872580]  [<ffffffff81261c8e>] dev_ethtool+0x338/0x1c06
> [56520.872585]  [<ffffffff810d4be4>] ? __do_fault+0x313/0x34a
> [56520.872591]  [<ffffffff810329ef>] ? should_resched+0xe/0x2d
> [56520.872595]  [<ffffffff8131be41>] ? _cond_resched+0xe/0x22
> [56520.872600]  [<ffffffff8126019d>] dev_ioctl+0x503/0x688
> [56520.872605]  [<ffffffff81319009>] ? __slab_alloc+0x330/0x342
> [56520.872610]  [<ffffffff8124ac72>] ? sock_alloc_inode+0x24/0xb7
> [56520.872615]  [<ffffffff8124b0c2>] sock_do_ioctl+0x3b/0x46
> [56520.872619]  [<ffffffff8124b4f1>] sock_ioctl+0x20d/0x21b
> [56520.872625]  [<ffffffff810feb02>] ? get_empty_filp+0x93/0x11b
> [56520.872630]  [<ffffffff8110b218>] do_vfs_ioctl+0x460/0x4a1
> [56520.872635]  [<ffffffff8124ca82>] ? sock_alloc_file+0xb3/0x114
> [56520.872640]  [<ffffffff8131d13e>] ? _raw_spin_lock+0xe/0x10
> [56520.872646]  [<ffffffff810fc23d>] ? fd_install+0x31/0x5d
> [56520.872650]  [<ffffffff8110b2a0>] sys_ioctl+0x47/0x6b
> [56520.872655]  [<ffffffff8131ddc2>] system_call_fastpath+0x16/0x1b

I've been using 3.0.0-rc2 at 3c25fa74 and haven't seen this problem
reoccur, so I guess it was either transient or fixed by something in
7f45e5c..3c25fa7.  Thanks for the help, everyone.

-andy

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2011-06-19 21:20 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20110612195856.GA11580@hexapodia.org>
     [not found] ` <20110612231143.GC11580@hexapodia.org>
2011-06-12 23:55   ` rcu_sched_state detected stall on CPU 0, 3.0-rc2 Andy Isaacson
2011-06-13  2:30     ` Ben Hutchings
2011-06-13  6:04       ` Andy Isaacson
2011-06-19 21:20         ` Andy Isaacson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).