public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* Soft lockup during suspend since ~2.6.36
@ 2011-04-03  2:03 Thilo-Alexander Ginkel
  2011-04-03  6:18 ` Mike Galbraith
  0 siblings, 1 reply; 10+ messages in thread
From: Thilo-Alexander Ginkel @ 2011-04-03  2:03 UTC (permalink / raw)
  To: linux-kernel

Hello everyone,

recently, I upgraded my Linux kernel (running on an Intel Core 2 Quad)
from 2.6.35 to 2.6.38. Unfortunately, I have been experiencing
frequent (but sporadic) freezes during suspend (S3) since then, i.e.,
the system became completely unresponsive (even MagicSysRq stopped
working) after having spun down the hard disks.

I performed a bisect, which pointed out the following commit:

bd25f4dd6972755579d0ea50d1a5ace2e9b00d1a is the first bad commit
commit bd25f4dd6972755579d0ea50d1a5ace2e9b00d1a
Author: Arnd Bergmann <arnd@arndb.de>
Date:   Sun Jul 11 15:34:05 2010 +0200

    HID: hiddev: use usb_find_interface, get rid of BKL

    This removes the private hiddev_table in the usbhid
    driver and changes it to use usb_find_interface
    instead.

    The advantage is that we can avoid the race between
    usb_register_dev and usb_open and no longer need the
    big kernel lock.

    This doesn't introduce race condition -- the intf pointer could be
    invalidated only in hiddev_disconnect() through usb_deregister_dev(),
    but that will block on minor_rwsem and not actually remove the device
    until usb_open().

    Signed-off-by: Arnd Bergmann <arnd@arndb.de>
    Cc: Jiri Kosina <jkosina@suse.cz>
    Cc: "Greg Kroah-Hartman" <gregkh@suse.de>
    Signed-off-by: Jiri Kosina <jkosina@suse.cz>

:040000 040000 4ae14b3ba486373d7a354874e9ad334858f094e3
8041ffda20ca3020a6b60d64235ae179f8186bf0 M      drivers

Booting with no_console_suspend also brings up a BUG mentioning a soft
lockup. As my serial console breaks during suspend, I had to take
photos of the error messages:
  https://secure.tgbyte.de/dropbox/eaghoh9M.jpg
  https://secure.tgbyte.de/dropbox/ecae8ieR.jpg
  https://secure.tgbyte.de/dropbox/vah5ooR9.jpg

Unfortunately, I have so far been unable to get the complete error
message in a single screen capture (as the output is either too
verbose or too scarce).

This bug is present up to (and including) 2.6.38 (I did not try any
later version).

I'd appreciate your help to fix this issue. Let me know if you need
further details.

Thanks,
Thilo

P.S.: Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=31562

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Soft lockup during suspend since ~2.6.36
  2011-04-03  2:03 Soft lockup during suspend since ~2.6.36 Thilo-Alexander Ginkel
@ 2011-04-03  6:18 ` Mike Galbraith
  2011-04-03 17:32   ` Thilo-Alexander Ginkel
  0 siblings, 1 reply; 10+ messages in thread
From: Mike Galbraith @ 2011-04-03  6:18 UTC (permalink / raw)
  To: Thilo-Alexander Ginkel; +Cc: linux-kernel, Arnd Bergmann

Adds CC to author of fingered commit.

On Sun, 2011-04-03 at 04:03 +0200, Thilo-Alexander Ginkel wrote:
> Hello everyone,
> 
> recently, I upgraded my Linux kernel (running on an Intel Core 2 Quad)
> from 2.6.35 to 2.6.38. Unfortunately, I have been experiencing
> frequent (but sporadic) freezes during suspend (S3) since then, i.e.,
> the system became completely unresponsive (even MagicSysRq stopped
> working) after having spun down the hard disks.
> 
> I performed a bisect, which pointed out the following commit:
> 
> bd25f4dd6972755579d0ea50d1a5ace2e9b00d1a is the first bad commit
> commit bd25f4dd6972755579d0ea50d1a5ace2e9b00d1a
> Author: Arnd Bergmann <arnd@arndb.de>
> Date:   Sun Jul 11 15:34:05 2010 +0200
> 
>     HID: hiddev: use usb_find_interface, get rid of BKL
> 
>     This removes the private hiddev_table in the usbhid
>     driver and changes it to use usb_find_interface
>     instead.
> 
>     The advantage is that we can avoid the race between
>     usb_register_dev and usb_open and no longer need the
>     big kernel lock.
> 
>     This doesn't introduce race condition -- the intf pointer could be
>     invalidated only in hiddev_disconnect() through usb_deregister_dev(),
>     but that will block on minor_rwsem and not actually remove the device
>     until usb_open().
> 
>     Signed-off-by: Arnd Bergmann <arnd@arndb.de>
>     Cc: Jiri Kosina <jkosina@suse.cz>
>     Cc: "Greg Kroah-Hartman" <gregkh@suse.de>
>     Signed-off-by: Jiri Kosina <jkosina@suse.cz>
> 
> :040000 040000 4ae14b3ba486373d7a354874e9ad334858f094e3
> 8041ffda20ca3020a6b60d64235ae179f8186bf0 M      drivers
> 
> Booting with no_console_suspend also brings up a BUG mentioning a soft
> lockup. As my serial console breaks during suspend, I had to take
> photos of the error messages:
>   https://secure.tgbyte.de/dropbox/eaghoh9M.jpg
>   https://secure.tgbyte.de/dropbox/ecae8ieR.jpg
>   https://secure.tgbyte.de/dropbox/vah5ooR9.jpg
> 
> Unfortunately, I have so far been unable to get the complete error
> message in a single screen capture (as the output is either too
> verbose or too scarce).
> 
> This bug is present up to (and including) 2.6.38 (I did not try any
> later version).
> 
> I'd appreciate your help to fix this issue. Let me know if you need
> further details.
> 
> Thanks,
> Thilo
> 
> P.S.: Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=31562
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Soft lockup during suspend since ~2.6.36
  2011-04-03  6:18 ` Mike Galbraith
@ 2011-04-03 17:32   ` Thilo-Alexander Ginkel
  2011-04-03 21:08     ` Thilo-Alexander Ginkel
  0 siblings, 1 reply; 10+ messages in thread
From: Thilo-Alexander Ginkel @ 2011-04-03 17:32 UTC (permalink / raw)
  To: linux-kernel; +Cc: Arnd Bergmann

On Sun, Apr 3, 2011 at 08:18, Mike Galbraith <efault@gmx.de> wrote:
> Adds CC to author of fingered commit.
>
> On Sun, 2011-04-03 at 04:03 +0200, Thilo-Alexander Ginkel wrote:
>> recently, I upgraded my Linux kernel (running on an Intel Core 2 Quad)
>> from 2.6.35 to 2.6.38. Unfortunately, I have been experiencing
>> frequent (but sporadic) freezes during suspend (S3) since then, i.e.,
>> the system became completely unresponsive (even MagicSysRq stopped
>> working) after having spun down the hard disks.
>>
>> I performed a bisect, which pointed out the following commit:
>>
>> bd25f4dd6972755579d0ea50d1a5ace2e9b00d1a is the first bad commit
>> commit bd25f4dd6972755579d0ea50d1a5ace2e9b00d1a

A good night of sleep and some Googling later I am under the
impression that this issue was already supposedly fixed by the
following commit a long time ago:
  http://git.kernel.org/?p=linux/kernel/git/jikos/hid.git;a=commitdiff;h=9c9e54a8df0be48aa359744f412377cc55c3b7d2
  http://permalink.gmane.org/gmane.linux.kernel/1024386

This makes some sense as versions closer to 2.6.36 only lock up
sporadically whereas the original bug locks up reliably. I'll try to
get a backtrace for the sporadic bug, but may only be able to do so
using a tainted kernel as my text console does not survive
suspend/resume without proprietary drivers (and the serial console
also breaks during the relevant suspend phase). I hope, this is
acceptable.

Thanks,
Thilo

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Soft lockup during suspend since ~2.6.36
  2011-04-03 17:32   ` Thilo-Alexander Ginkel
@ 2011-04-03 21:08     ` Thilo-Alexander Ginkel
  2011-04-04  3:02       ` Arnd Bergmann
  0 siblings, 1 reply; 10+ messages in thread
From: Thilo-Alexander Ginkel @ 2011-04-03 21:08 UTC (permalink / raw)
  To: linux-kernel; +Cc: Arnd Bergmann

On Sun, Apr 3, 2011 at 19:32, Thilo-Alexander Ginkel <thilo@ginkel.com> wrote:
> A good night of sleep and some Googling later I am under the
> impression that this issue was already supposedly fixed by the
> following commit a long time ago:
>  http://git.kernel.org/?p=linux/kernel/git/jikos/hid.git;a=commitdiff;h=9c9e54a8df0be48aa359744f412377cc55c3b7d2
>  http://permalink.gmane.org/gmane.linux.kernel/1024386
>
> This makes some sense as versions closer to 2.6.36 only lock up
> sporadically whereas the original bug locks up reliably. I'll try to
> get a backtrace for the sporadic bug, but may only be able to do so
> using a tainted kernel as my text console does not survive
> suspend/resume without proprietary drivers (and the serial console
> also breaks during the relevant suspend phase). I hope, this is
> acceptable.

All right. I have actually been able to obtain a backtrace for a
2.6.36 kernel, which only shows this problem sporadically (so this
probably has a different root cause).

It all starts with the suspend sequence hanging while shutting down
non-boot CPUs (the CPU# varies from case to case):
  https://secure.tgbyte.de/dropbox/ueph3Ohm-1.jpg

Then, after some time, the following backtrace is written:
  https://secure.tgbyte.de/dropbox/ueph3Ohm-2.jpg

Transcript:
| RIP: _raw_spin_unlock_irqrestore
| Call Stack:
|   _set_cpus_allowed_ptr
|   worker_maybe_bind_and_lock
|   rescuer_thread
|   rescuer_thread
|   kernel_thread_helper
|   kthread
|   kernel_thread_helper
|   kthread

After some more time, the following backtrace is printed:
  https://secure.tgbyte.de/dropbox/ueph3Ohm-3.jpg

Transcript:
| RIP: worker_maybe_bind_and_lock
| Call Stack:
|   rescuer_thread
|   rescuer_thread
|   kernel_thread_helper
|   kthread
|   kernel_thread_helper
|   kthread

>From then on these two backtraces are printed in an alternating fashion.

Unfortunately, the top of the output is missing as it does not fit on
screen (does someone know an even smaller console font?), but I assume
it's the deadlock detection.

Thanks,
Thilo

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Soft lockup during suspend since ~2.6.36
  2011-04-03 21:08     ` Thilo-Alexander Ginkel
@ 2011-04-04  3:02       ` Arnd Bergmann
  2011-04-04 13:57         ` Thilo-Alexander Ginkel
  0 siblings, 1 reply; 10+ messages in thread
From: Arnd Bergmann @ 2011-04-04  3:02 UTC (permalink / raw)
  To: Thilo-Alexander Ginkel; +Cc: linux-kernel

On Sunday 03 April 2011, Thilo-Alexander Ginkel wrote:
> Transcript:
> | RIP: _raw_spin_unlock_irqrestore
> | Call Stack:
> |   _set_cpus_allowed_ptr
> |   worker_maybe_bind_and_lock
> |   rescuer_thread
> |   rescuer_thread
> |   kernel_thread_helper
> |   kthread
> |   kernel_thread_helper
> |   kthread
> 
> After some more time, the following backtrace is printed:
>   https://secure.tgbyte.de/dropbox/ueph3Ohm-3.jpg
> 
> Transcript:
> | RIP: worker_maybe_bind_and_lock
> | Call Stack:
> |   rescuer_thread
> |   rescuer_thread
> |   kernel_thread_helper
> |   kthread
> |   kernel_thread_helper
> |   kthread
> 
> From then on these two backtraces are printed in an alternating fashion.
> 
> Unfortunately, the top of the output is missing as it does not fit on
> screen (does someone know an even smaller console font?), but I assume
> it's the deadlock detection.

The interesting thing is that the backtrace is from unlock, not from lock,
so it can't really be a deadlock. However, the _raw_spin_unlock_irqrestore
function calls debug_spin_unlock(), which does a few sanity check. Maybe
one of those got triggered.

The easiest way to get the full output is usually to attach a serial
NULL modem cable and redirect the console to that, so you can get the
output on another machine. Another idea would be to modify the
show_registers function in arch/x86/kernel/dumpstack_64.c so that
it prints less data.

Yet another idea would be to set /sys/kernel/printk_delay so that the
oops gets printed slower.

	Arnd

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Soft lockup during suspend since ~2.6.36
  2011-04-04  3:02       ` Arnd Bergmann
@ 2011-04-04 13:57         ` Thilo-Alexander Ginkel
  2011-04-04 14:40           ` Arnd Bergmann
  0 siblings, 1 reply; 10+ messages in thread
From: Thilo-Alexander Ginkel @ 2011-04-04 13:57 UTC (permalink / raw)
  To: Arnd Bergmann, linux-kernel

On Mon, Apr 4, 2011 at 05:02, Arnd Bergmann <arnd@arndb.de> wrote:
> On Sunday 03 April 2011, Thilo-Alexander Ginkel wrote:
>> Transcript:
>> | RIP: _raw_spin_unlock_irqrestore
>> | Call Stack:
>> |   _set_cpus_allowed_ptr
>> |   worker_maybe_bind_and_lock
>> |   rescuer_thread
>> |   rescuer_thread
>> |   kernel_thread_helper
>> |   kthread
>> |   kernel_thread_helper
>> |   kthread
>>
>> After some more time, the following backtrace is printed:
>>   https://secure.tgbyte.de/dropbox/ueph3Ohm-3.jpg
>>
>> Transcript:
>> | RIP: worker_maybe_bind_and_lock
>> | Call Stack:
>> |   rescuer_thread
>> |   rescuer_thread
>> |   kernel_thread_helper
>> |   kthread
>> |   kernel_thread_helper
>> |   kthread
>>
>> From then on these two backtraces are printed in an alternating fashion.
>>
>> Unfortunately, the top of the output is missing as it does not fit on
>> screen (does someone know an even smaller console font?), but I assume
>> it's the deadlock detection.
>
> The interesting thing is that the backtrace is from unlock, not from lock,
> so it can't really be a deadlock. However, the _raw_spin_unlock_irqrestore
> function calls debug_spin_unlock(), which does a few sanity check. Maybe
> one of those got triggered.
>
> The easiest way to get the full output is usually to attach a serial
> NULL modem cable and redirect the console to that, so you can get the
> output on another machine. Another idea would be to modify the
> show_registers function in arch/x86/kernel/dumpstack_64.c so that
> it prints less data.

Unfortunately, the output via a serial console becomes garbled after
"Entering mem sleep", so I went for patching dumpstack_64.c and a
couple of other source files to reduce the verbosity. I hope not to
have stripped any essential information. The result is available in
these pictures:
  https://secure.tgbyte.de/dropbox/IeZalo4t-1.jpg
  https://secure.tgbyte.de/dropbox/IeZalo4t-2.jpg

For both traces, the printed error message reads: "BUG: soft lockup -
CPU#3 stuck for 67s! [kblockd:28]"

(After a bit of Googling I understand that a soft lockup is probably
different from a deadlock - please correct me if that assumption is
wrong)

> Yet another idea would be to set /sys/kernel/printk_delay so that the
> oops gets printed slower.

Hm, that file does not exist on my machine. Does it need a special
compile-time config  option to be enabled?

Regards,
Thilo

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Soft lockup during suspend since ~2.6.36
  2011-04-04 13:57         ` Thilo-Alexander Ginkel
@ 2011-04-04 14:40           ` Arnd Bergmann
  2011-04-04 15:02             ` Thilo-Alexander Ginkel
  0 siblings, 1 reply; 10+ messages in thread
From: Arnd Bergmann @ 2011-04-04 14:40 UTC (permalink / raw)
  To: Thilo-Alexander Ginkel; +Cc: linux-kernel

On Monday 04 April 2011, Thilo-Alexander Ginkel wrote:
> On Mon, Apr 4, 2011 at 05:02, Arnd Bergmann <arnd@arndb.de> wrote:

> Unfortunately, the output via a serial console becomes garbled after
> "Entering mem sleep", so I went for patching dumpstack_64.c and a
> couple of other source files to reduce the verbosity. I hope not to
> have stripped any essential information. The result is available in
> these pictures:
>   https://secure.tgbyte.de/dropbox/IeZalo4t-1.jpg
>   https://secure.tgbyte.de/dropbox/IeZalo4t-2.jpg
> 
> For both traces, the printed error message reads: "BUG: soft lockup -
> CPU#3 stuck for 67s! [kblockd:28]"
> 
> (After a bit of Googling I understand that a soft lockup is probably
> different from a deadlock - please correct me if that assumption is
> wrong)

My interpretation is that some process tries to use
kblockd_schedule_work() after the CPU for that workqueue has been
disabled. The work queue functions (worker_maybe_bind_and_lock)
is waiting for the CPU to become available, which it doesn't do.

You see different outputs every time the softlockup detection finds
this because the loop is in different states here. The reason why
the spin_unlock shows up here is because that is when the interrupts
get enabled and the softlockup detection notices the timeout.

I'm pretty sure that this has nothing to do with the bisected bug
that you initially found, but maybe somebody else can try analysing
this better.

> > Yet another idea would be to set /sys/kernel/printk_delay so that the
> > oops gets printed slower.
> 
> Hm, that file does not exist on my machine. Does it need a special
> compile-time config  option to be enabled?

Sorry, I meant /proc/sys/kernel/printk_delay.

	Arnd

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Soft lockup during suspend since ~2.6.36
  2011-04-04 14:40           ` Arnd Bergmann
@ 2011-04-04 15:02             ` Thilo-Alexander Ginkel
  2011-04-04 15:32               ` Arnd Bergmann
  0 siblings, 1 reply; 10+ messages in thread
From: Thilo-Alexander Ginkel @ 2011-04-04 15:02 UTC (permalink / raw)
  To: Arnd Bergmann; +Cc: linux-kernel

On Mon, Apr 4, 2011 at 16:40, Arnd Bergmann <arnd@arndb.de> wrote:
> On Monday 04 April 2011, Thilo-Alexander Ginkel wrote:
>> The result is available in these pictures:
>>   https://secure.tgbyte.de/dropbox/IeZalo4t-1.jpg
>>   https://secure.tgbyte.de/dropbox/IeZalo4t-2.jpg
>>
>> For both traces, the printed error message reads: "BUG: soft lockup -
>> CPU#3 stuck for 67s! [kblockd:28]"
>>
>> (After a bit of Googling I understand that a soft lockup is probably
>> different from a deadlock - please correct me if that assumption is
>> wrong)
>
> My interpretation is that some process tries to use
> kblockd_schedule_work() after the CPU for that workqueue has been
> disabled. The work queue functions (worker_maybe_bind_and_lock)
> is waiting for the CPU to become available, which it doesn't do.

Thanks for your help so far!

Is there a way to figure out which process that may be?

> You see different outputs every time the softlockup detection finds
> this because the loop is in different states here. The reason why
> the spin_unlock shows up here is because that is when the interrupts
> get enabled and the softlockup detection notices the timeout.

OK, that make sense.

> I'm pretty sure that this has nothing to do with the bisected bug
> that you initially found, but maybe somebody else can try analysing
> this better.

ACK. I see two possibilities:
a) The bug was introduced after the bisected bug was fixed
b) The bug was already present earlier, but was masked by the bug from
the bisected change

I hope for a) as that would open the possibility to bisect this new bug.

Regards,
Thilo

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Soft lockup during suspend since ~2.6.36
  2011-04-04 15:02             ` Thilo-Alexander Ginkel
@ 2011-04-04 15:32               ` Arnd Bergmann
  2011-04-09 16:37                 ` Pavel Machek
  0 siblings, 1 reply; 10+ messages in thread
From: Arnd Bergmann @ 2011-04-04 15:32 UTC (permalink / raw)
  To: Thilo-Alexander Ginkel; +Cc: linux-kernel

On Monday 04 April 2011, Thilo-Alexander Ginkel wrote:
> ACK. I see two possibilities:
> a) The bug was introduced after the bisected bug was fixed
> b) The bug was already present earlier, but was masked by the bug from
> the bisected change
> 
> I hope for a) as that would open the possibility to bisect this new bug.
> 

In case of b), you can still bisect it when you either apply the later fix
or revert the original patch whenever you build a kernel. Or you can try
to avoid using the usb-hid driver during bisect.

	Arnd

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Soft lockup during suspend since ~2.6.36
  2011-04-04 15:32               ` Arnd Bergmann
@ 2011-04-09 16:37                 ` Pavel Machek
  0 siblings, 0 replies; 10+ messages in thread
From: Pavel Machek @ 2011-04-09 16:37 UTC (permalink / raw)
  To: Arnd Bergmann, Rafael J. Wysocki; +Cc: Thilo-Alexander Ginkel, linux-kernel

Hi!

> > ACK. I see two possibilities:
> > a) The bug was introduced after the bisected bug was fixed
> > b) The bug was already present earlier, but was masked by the bug from
> > the bisected change
> > 
> > I hope for a) as that would open the possibility to bisect this new bug.
> 
> In case of b), you can still bisect it when you either apply the later fix
> or revert the original patch whenever you build a kernel. Or you can try
> to avoid using the usb-hid driver during bisect.

As this is both suspend problem and regression... you may want to cc: rjw.

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2011-04-09 16:38 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-04-03  2:03 Soft lockup during suspend since ~2.6.36 Thilo-Alexander Ginkel
2011-04-03  6:18 ` Mike Galbraith
2011-04-03 17:32   ` Thilo-Alexander Ginkel
2011-04-03 21:08     ` Thilo-Alexander Ginkel
2011-04-04  3:02       ` Arnd Bergmann
2011-04-04 13:57         ` Thilo-Alexander Ginkel
2011-04-04 14:40           ` Arnd Bergmann
2011-04-04 15:02             ` Thilo-Alexander Ginkel
2011-04-04 15:32               ` Arnd Bergmann
2011-04-09 16:37                 ` Pavel Machek

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox