Re: Soft lockup during suspend since ~2.6.36 [bisected]

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: Soft lockup during suspend since ~2.6.36 [bisected]
@ 2011-04-05 18:56 Thilo-Alexander Ginkel
  2011-04-05 23:28 ` Arnd Bergmann
  0 siblings, 1 reply; 13+ messages in thread
From: Thilo-Alexander Ginkel @ 2011-04-05 18:56 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-kernel, Arnd Bergmann

On Mon, Apr 4, 2011 at 17:32, Arnd Bergmann <arnd@arndb.de> wrote:
> On Monday 04 April 2011, Thilo-Alexander Ginkel wrote:
>> ACK. I see two possibilities:
>> a) The bug was introduced after the bisected bug was fixed
>> b) The bug was already present earlier, but was masked by the bug from
>> the bisected change
>>
>> I hope for a) as that would open the possibility to bisect this new bug.
>
> In case of b), you can still bisect it when you either apply the later fix
> or revert the original patch whenever you build a kernel. Or you can try
> to avoid using the usb-hid driver during bisect.

Thanks, that worked pretty well. A bisect with eleven builds later I
have now identified the following candidate commit, which may have
introduced the bug:

dcd989cb73ab0f7b722d64ab6516f101d9f43f88 is the first bad commit
commit dcd989cb73ab0f7b722d64ab6516f101d9f43f88
Author: Tejun Heo <tj@kernel.org>
Date:   Tue Jun 29 10:07:14 2010 +0200

    workqueue: implement several utility APIs

    Implement the following utility APIs.

     workqueue_set_max_active() : adjust max_active of a wq
     workqueue_congested()              : test whether a wq is contested
     work_cpu()                 : determine the last / current cpu of a work
     work_busy()                        : query whether a work is busy

    * Anton Blanchard fixed missing ret initialization in work_busy().

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Cc: Anton Blanchard <anton@samba.org>

:040000 040000 8b7443c650f0af36f1deba560586a91f6a88abcc
065589a95857a2fb73b94dc242c50ba558179a2a M      include
:040000 040000 84ca2de78af16483fa60a423f4f2d6eee0279eed
27487850f11a1e7ee9e4eaac54fd88f16d420d47 M      kernel

Brief summary for Tejun: Starting with this commit my system (x86_64,
4 CPUs) sporadically (probability around 25%) fails to suspend due to
a soft lockup. Full details at:
  https://lkml.org/lkml/2011/4/4/116
or:
  <BANLkTi=n4jLsjOYCd0L3hYb30sgPmdv_WA@mail.gmail.com>

I'd appreciate your help to resolve this issue and would be glad to
test any candidate patches.

Thanks,
Thilo

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Soft lockup during suspend since ~2.6.36 [bisected]
  2011-04-05 18:56 Soft lockup during suspend since ~2.6.36 [bisected] Thilo-Alexander Ginkel
@ 2011-04-05 23:28 ` Arnd Bergmann
  2011-04-06  6:03   ` Thilo-Alexander Ginkel
  0 siblings, 1 reply; 13+ messages in thread
From: Arnd Bergmann @ 2011-04-05 23:28 UTC (permalink / raw)
  To: Thilo-Alexander Ginkel; +Cc: Tejun Heo, linux-kernel

On Tuesday 05 April 2011, Thilo-Alexander Ginkel wrote:
> On Mon, Apr 4, 2011 at 17:32, Arnd Bergmann <arnd@arndb.de> wrote:
> > On Monday 04 April 2011, Thilo-Alexander Ginkel wrote:
> >> ACK. I see two possibilities:
> >> a) The bug was introduced after the bisected bug was fixed
> >> b) The bug was already present earlier, but was masked by the bug from
> >> the bisected change
> >>
> >> I hope for a) as that would open the possibility to bisect this new bug.
> >
> > In case of b), you can still bisect it when you either apply the later fix
> > or revert the original patch whenever you build a kernel. Or you can try
> > to avoid using the usb-hid driver during bisect.
> 
> Thanks, that worked pretty well. A bisect with eleven builds later I
> have now identified the following candidate commit, which may have
> introduced the bug:
> 
> dcd989cb73ab0f7b722d64ab6516f101d9f43f88 is the first bad commit
> commit dcd989cb73ab0f7b722d64ab6516f101d9f43f88
> Author: Tejun Heo <tj@kernel.org>
> Date:   Tue Jun 29 10:07:14 2010 +0200

Sorry, but looking at the patch shows that it can't possibly have introduced
the problem, since all the code that is modified in it is new code that
is not even used anywhere at that stage.

As far as I can tell, you must have hit a false positive or a false negative
somewhere in the bisect.

	Arnd

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Soft lockup during suspend since ~2.6.36 [bisected]
  2011-04-05 23:28 ` Arnd Bergmann
@ 2011-04-06  6:03   ` Thilo-Alexander Ginkel
  2011-04-14 12:24     ` Thilo-Alexander Ginkel
  0 siblings, 1 reply; 13+ messages in thread
From: Thilo-Alexander Ginkel @ 2011-04-06  6:03 UTC (permalink / raw)
  To: Arnd Bergmann; +Cc: Tejun Heo, linux-kernel

On Wed, Apr 6, 2011 at 01:28, Arnd Bergmann <arnd@arndb.de> wrote:
> On Tuesday 05 April 2011, Thilo-Alexander Ginkel wrote:
>> Thanks, that worked pretty well. A bisect with eleven builds later I
>> have now identified the following candidate commit, which may have
>> introduced the bug:
>>
>> dcd989cb73ab0f7b722d64ab6516f101d9f43f88 is the first bad commit
>> commit dcd989cb73ab0f7b722d64ab6516f101d9f43f88
>> Author: Tejun Heo <tj@kernel.org>
>> Date:   Tue Jun 29 10:07:14 2010 +0200
>
> Sorry, but looking at the patch shows that it can't possibly have introduced
> the problem, since all the code that is modified in it is new code that
> is not even used anywhere at that stage.
>
> As far as I can tell, you must have hit a false positive or a false negative
> somewhere in the bisect.

Well you're right. I hit "Reply" too early and should have paid closer
attention to what change the bisect actually brought up.

I already found a false negative (fortunately pretty close to the end
of the bisect sequence) and also verified the preceding good commits,
which gives me two new commits to test. I'll provide an update once
the builds and tests are through, which may however take until early
next week as I will be on vacation until then.

Regards,
Thilo

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Soft lockup during suspend since ~2.6.36 [bisected]
  2011-04-06  6:03   ` Thilo-Alexander Ginkel
@ 2011-04-14 12:24     ` Thilo-Alexander Ginkel
  2011-04-17 19:35       ` Arnd Bergmann
  0 siblings, 1 reply; 13+ messages in thread
From: Thilo-Alexander Ginkel @ 2011-04-14 12:24 UTC (permalink / raw)
  To: Arnd Bergmann, Tejun Heo, Rafael J. Wysocki; +Cc: linux-kernel

On Wed, Apr 6, 2011 at 08:03, Thilo-Alexander Ginkel <thilo@ginkel.com> wrote:
> On Wed, Apr 6, 2011 at 01:28, Arnd Bergmann <arnd@arndb.de> wrote:
>> On Tuesday 05 April 2011, Thilo-Alexander Ginkel wrote:
>>> Thanks, that worked pretty well. A bisect with eleven builds later I
>>> have now identified the following candidate commit, which may have
>>> introduced the bug:
>>>
>>> dcd989cb73ab0f7b722d64ab6516f101d9f43f88 is the first bad commit
>>> commit dcd989cb73ab0f7b722d64ab6516f101d9f43f88
>>> Author: Tejun Heo <tj@kernel.org>
>>> Date:   Tue Jun 29 10:07:14 2010 +0200
>>
>> Sorry, but looking at the patch shows that it can't possibly have introduced
>> the problem, since all the code that is modified in it is new code that
>> is not even used anywhere at that stage.
>>
>> As far as I can tell, you must have hit a false positive or a false negative
>> somewhere in the bisect.
>
> Well you're right. I hit "Reply" too early and should have paid closer
> attention to what change the bisect actually brought up.
>
> I already found a false negative (fortunately pretty close to the end
> of the bisect sequence) and also verified the preceding good commits,
> which gives me two new commits to test. I'll provide an update once
> the builds and tests are through, which may however take until early
> next week as I will be on vacation until then.

All right... I verified all my bisect tests and actually found yet
another bug. After correcting that one (and verifying the correctness
of the other tests), git bisect actually came up with a commit, which
makes some more sense:

| e22bee782b3b00bd4534ae9b1c5fb2e8e6573c5c is the first bad commit
| commit e22bee782b3b00bd4534ae9b1c5fb2e8e6573c5c
| Author: Tejun Heo <tj@kernel.org>
| Date:   Tue Jun 29 10:07:14 2010 +0200
|
|     workqueue: implement concurrency managed dynamic worker pool

The good news is that I am able to reproduce the issue within a KVM
virtual machine, so I am able to test for the soft lockup (which
somewhat looks like a race condition during worker / CPU shutdown) in
a mostly automated fashion. Unfortunately, that also means that this
issue is all but hardware specific, i.e., it most probably affects all
SMP systems (with a varying probability depending on the number of
CPUs).

Adding some further details about my configuration (which I replicated
in the VM):
- lvm running on top of
- dmcrypt (luks) running on top of
- md raid1

If anyone is interested in getting hold of this VM for further tests,
let me know and I'll try to figure out how to get it (2*8 GB, barely
compressible due to dmcrypt) to its recipient.

Regards,
Thilo

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Soft lockup during suspend since ~2.6.36 [bisected]
  2011-04-14 12:24     ` Thilo-Alexander Ginkel
@ 2011-04-17 19:35       ` Arnd Bergmann
  2011-04-17 21:53         ` Thilo-Alexander Ginkel
  0 siblings, 1 reply; 13+ messages in thread
From: Arnd Bergmann @ 2011-04-17 19:35 UTC (permalink / raw)
  To: Thilo-Alexander Ginkel
  Cc: Tejun Heo, Rafael J. Wysocki, linux-kernel, dm-devel

On Thursday 14 April 2011, Thilo-Alexander Ginkel wrote:
> All right... I verified all my bisect tests and actually found yet
> another bug. After correcting that one (and verifying the correctness
> of the other tests), git bisect actually came up with a commit, which
> makes some more sense:
> 
> | e22bee782b3b00bd4534ae9b1c5fb2e8e6573c5c is the first bad commit
> | commit e22bee782b3b00bd4534ae9b1c5fb2e8e6573c5c
> | Author: Tejun Heo <tj@kernel.org>
> | Date:   Tue Jun 29 10:07:14 2010 +0200
> |
> |     workqueue: implement concurrency managed dynamic worker pool

Is it possible to make it work by reverting this patch in 2.6.38?

> The good news is that I am able to reproduce the issue within a KVM
> virtual machine, so I am able to test for the soft lockup (which
> somewhat looks like a race condition during worker / CPU shutdown) in
> a mostly automated fashion. Unfortunately, that also means that this
> issue is all but hardware specific, i.e., it most probably affects all
> SMP systems (with a varying probability depending on the number of
> CPUs).
> 
> Adding some further details about my configuration (which I replicated
> in the VM):
> - lvm running on top of
> - dmcrypt (luks) running on top of
> - md raid1
> 
> If anyone is interested in getting hold of this VM for further tests,
> let me know and I'll try to figure out how to get it (2*8 GB, barely
> compressible due to dmcrypt) to its recipient.

Adding dm-devel to Cc, in case the problem is somewhere in there.

	Arnd

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Soft lockup during suspend since ~2.6.36 [bisected]
  2011-04-17 19:35       ` Arnd Bergmann
@ 2011-04-17 21:53         ` Thilo-Alexander Ginkel
  2011-04-26 13:11           ` Tejun Heo
  0 siblings, 1 reply; 13+ messages in thread
From: Thilo-Alexander Ginkel @ 2011-04-17 21:53 UTC (permalink / raw)
  To: Arnd Bergmann; +Cc: Tejun Heo, Rafael J. Wysocki, linux-kernel, dm-devel

On Sun, Apr 17, 2011 at 21:35, Arnd Bergmann <arnd@arndb.de> wrote:
> On Thursday 14 April 2011, Thilo-Alexander Ginkel wrote:
>> All right... I verified all my bisect tests and actually found yet
>> another bug. After correcting that one (and verifying the correctness
>> of the other tests), git bisect actually came up with a commit, which
>> makes some more sense:
>>
>> | e22bee782b3b00bd4534ae9b1c5fb2e8e6573c5c is the first bad commit
>> | commit e22bee782b3b00bd4534ae9b1c5fb2e8e6573c5c
>> | Author: Tejun Heo <tj@kernel.org>
>> | Date:   Tue Jun 29 10:07:14 2010 +0200
>> |
>> |     workqueue: implement concurrency managed dynamic worker pool
>
> Is it possible to make it work by reverting this patch in 2.6.38?

Unfortunately, that's not that easy to test as the reverted patch does
not apply cleanly against 2.6.38 (23 failed hunks) and I am not sure
whether I want to revert it manually ;-).

>> The good news is that I am able to reproduce the issue within a KVM
>> virtual machine, so I am able to test for the soft lockup (which
>> somewhat looks like a race condition during worker / CPU shutdown) in
>> a mostly automated fashion. Unfortunately, that also means that this
>> issue is all but hardware specific, i.e., it most probably affects all
>> SMP systems (with a varying probability depending on the number of
>> CPUs).
>>
>> Adding some further details about my configuration (which I replicated
>> in the VM):
>> - lvm running on top of
>> - dmcrypt (luks) running on top of
>> - md raid1
>>
>> If anyone is interested in getting hold of this VM for further tests,
>> let me know and I'll try to figure out how to get it (2*8 GB, barely
>> compressible due to dmcrypt) to its recipient.
>
> Adding dm-devel to Cc, in case the problem is somewhere in there.

In the meantime I also figured out that 2.6.39-rc3 seems to fix the
issue (there have been some work queue changes, so this is somewhat
sensible) and that raid1 seems to be sufficient to trigger the issue.
Now one could try to figure out what actually fixed it, but if that
means another bisect series I am not too keen to perform that
exercise. ;-) If someone else feels inclined to do so, my test
environment is available for download, though:
  https://secure.tgbyte.de/dropbox/lockup-test.tar.bz2 (~ 700 MB)

Boot using:
  kvm -hda LockupTestRaid-1.qcow2 -hdb LockupTestRaid-2.qcow2 -smp 8
-m 1024 -curses

To run the test, log in as root / test and run:
  /root/suspend-test

Regards,
Thilo

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Soft lockup during suspend since ~2.6.36 [bisected]
  2011-04-17 21:53         ` Thilo-Alexander Ginkel
@ 2011-04-26 13:11           ` Tejun Heo
  2011-04-27 23:51             ` Thilo-Alexander Ginkel
  0 siblings, 1 reply; 13+ messages in thread
From: Tejun Heo @ 2011-04-26 13:11 UTC (permalink / raw)
  To: Thilo-Alexander Ginkel
  Cc: Arnd Bergmann, Rafael J. Wysocki, linux-kernel, dm-devel

Hello, sorry about the delay.  Was on the road and then sick.

On Sun, Apr 17, 2011 at 11:53:42PM +0200, Thilo-Alexander Ginkel wrote:
> >> | e22bee782b3b00bd4534ae9b1c5fb2e8e6573c5c is the first bad commit
> >> | commit e22bee782b3b00bd4534ae9b1c5fb2e8e6573c5c
> >> | Author: Tejun Heo <tj@kernel.org>
> >> | Date:   Tue Jun 29 10:07:14 2010 +0200
> >> |
> >> |     workqueue: implement concurrency managed dynamic worker pool
> >
> > Is it possible to make it work by reverting this patch in 2.6.38?
> 
> Unfortunately, that's not that easy to test as the reverted patch does
> not apply cleanly against 2.6.38 (23 failed hunks) and I am not sure
> whether I want to revert it manually ;-).

Yeap, reverting that one would be a major effort at this point.
Hmmm... assuming all the workqueue usages were correct, the change
shouldn't have introduced such bug.  All forward progress guarantees
remain the same in that all workqueues are automatically given a
rescuer thread.  That said, there have been some number of bug fixes
and cases where single rescuer guarantee wasn't enough (which was
dangerous before the change too but was less likely to trigger).

> >> If anyone is interested in getting hold of this VM for further tests,
> >> let me know and I'll try to figure out how to get it (2*8 GB, barely
> >> compressible due to dmcrypt) to its recipient.
> >
> > Adding dm-devel to Cc, in case the problem is somewhere in there.
> 
> In the meantime I also figured out that 2.6.39-rc3 seems to fix the
> issue (there have been some work queue changes, so this is somewhat
> sensible)

Hmmm... that's a big demotivator.  :-)

> and that raid1 seems to be sufficient to trigger the issue.
> Now one could try to figure out what actually fixed it, but if that
> means another bisect series I am not too keen to perform that
> exercise. ;-) If someone else feels inclined to do so, my test
> environment is available for download, though:
>   https://secure.tgbyte.de/dropbox/lockup-test.tar.bz2 (~ 700 MB)
> 
> Boot using:
>   kvm -hda LockupTestRaid-1.qcow2 -hdb LockupTestRaid-2.qcow2 -smp 8
> -m 1024 -curses
> 
> To run the test, log in as root / test and run:
>   /root/suspend-test

Before I go ahead and try that, do you happen to have softlockup dump?
ie. stack traces of the stuck tasks?  I can't find the original
posting.

Thank you.

-- 
tejun

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Soft lockup during suspend since ~2.6.36 [bisected]
  2011-04-26 13:11           ` Tejun Heo
@ 2011-04-27 23:51             ` Thilo-Alexander Ginkel
  2011-04-28 10:30               ` Tejun Heo
  0 siblings, 1 reply; 13+ messages in thread
From: Thilo-Alexander Ginkel @ 2011-04-27 23:51 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Arnd Bergmann, Rafael J. Wysocki, linux-kernel, dm-devel

On Tue, Apr 26, 2011 at 15:11, Tejun Heo <tj@kernel.org> wrote:
> Hello, sorry about the delay.  Was on the road and then sick.

No problem. Thanks for getting back to me.

>> In the meantime I also figured out that 2.6.39-rc3 seems to fix the
>> issue (there have been some work queue changes, so this is somewhat
>> sensible)
>
> Hmmm... that's a big demotivator.  :-)

Well, I get your point. ;-) Maybe this fact can help as a motivator: I
ran some further tests and while -rc3 seems to be ok (and survived 100
suspend/resume cycles), the issue strangely seems to be back with -rc4
(the softlockup call stack that I can see is identical to the photos
below; the lockup happened after only two cycles).

> Before I go ahead and try that, do you happen to have softlockup dump?
> ie. stack traces of the stuck tasks?  I can't find the original
> posting.

Sure:

>From <BANLkTi=n4jLsjOYCd0L3hYb30sgPmdv_WA@mail.gmail.com>:
> Unfortunately, the output via a serial console becomes garbled after
> "Entering mem sleep", so I went for patching dumpstack_64.c and a
> couple of other source files to reduce the verbosity. I hope not to
> have stripped any essential information. The result is available in
> these pictures:
>   https://secure.tgbyte.de/dropbox/IeZalo4t-1.jpg
>   https://secure.tgbyte.de/dropbox/IeZalo4t-2.jpg
>
> For both traces, the printed error message reads: "BUG: soft lockup -
> CPU#3 stuck for 67s! [kblockd:28]"

Thanks,
Thilo

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Soft lockup during suspend since ~2.6.36 [bisected]
  2011-04-27 23:51             ` Thilo-Alexander Ginkel
@ 2011-04-28 10:30               ` Tejun Heo
  2011-04-28 23:56                 ` Thilo-Alexander Ginkel
  0 siblings, 1 reply; 13+ messages in thread
From: Tejun Heo @ 2011-04-28 10:30 UTC (permalink / raw)
  To: Thilo-Alexander Ginkel
  Cc: Arnd Bergmann, Rafael J. Wysocki, linux-kernel, dm-devel

Hello,

On Thu, Apr 28, 2011 at 01:51:34AM +0200, Thilo-Alexander Ginkel wrote:
> Well, I get your point. ;-) Maybe this fact can help as a motivator: I
> ran some further tests and while -rc3 seems to be ok (and survived 100
> suspend/resume cycles), the issue strangely seems to be back with -rc4
> (the softlockup call stack that I can see is identical to the photos
> below; the lockup happened after only two cycles).
> 
> > Before I go ahead and try that, do you happen to have softlockup dump?
> > ie. stack traces of the stuck tasks?  I can't find the original
> > posting.
> 
> Sure:
> 
> From <BANLkTi=n4jLsjOYCd0L3hYb30sgPmdv_WA@mail.gmail.com>:
> > Unfortunately, the output via a serial console becomes garbled after
> > "Entering mem sleep", so I went for patching dumpstack_64.c and a
> > couple of other source files to reduce the verbosity. I hope not to
> > have stripped any essential information. The result is available in
> > these pictures:
> >   https://secure.tgbyte.de/dropbox/IeZalo4t-1.jpg
> >   https://secure.tgbyte.de/dropbox/IeZalo4t-2.jpg
> >
> > For both traces, the printed error message reads: "BUG: soft lockup -
> > CPU#3 stuck for 67s! [kblockd:28]"

Does your kernel have preemption enabled?  If not, does the following
patch fix the problem?

Thanks.

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 04ef830..08c7334 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -1293,6 +1293,7 @@ __acquires(&gcwq->lock)
 
 		/* CPU has come up inbetween, retry migration */
 		cpu_relax();
+		cond_resched();
 	}
 }

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: Soft lockup during suspend since ~2.6.36 [bisected]
  2011-04-28 10:30               ` Tejun Heo
@ 2011-04-28 23:56                 ` Thilo-Alexander Ginkel
  2011-04-29 16:00                   ` Tejun Heo
  0 siblings, 1 reply; 13+ messages in thread
From: Thilo-Alexander Ginkel @ 2011-04-28 23:56 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Arnd Bergmann, Rafael J. Wysocki, linux-kernel, dm-devel

On Thu, Apr 28, 2011 at 12:30, Tejun Heo <tj@kernel.org> wrote:
> Does your kernel have preemption enabled?

CONFIG_PREEMPT is not set

> If not, does the following patch fix the problem?

Yep, looks good so far (at least in my virtualized test environment):
2.6.39-rc4 with your patch applied survived 100 suspend/resume cycles
w/o locking up.

Just out of curiosity: Was this a new issue in 2.6.39-rc4 ord could
this fix be backported to 2.6.38?

Thanks,
Thilo

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Soft lockup during suspend since ~2.6.36 [bisected]
  2011-04-28 23:56                 ` Thilo-Alexander Ginkel
@ 2011-04-29 16:00                   ` Tejun Heo
  2011-04-29 16:18                     ` [PATCH] workqueue: fix deadlock in worker_maybe_bind_and_lock() Tejun Heo
  0 siblings, 1 reply; 13+ messages in thread
From: Tejun Heo @ 2011-04-29 16:00 UTC (permalink / raw)
  To: Thilo-Alexander Ginkel
  Cc: Arnd Bergmann, Rafael J. Wysocki, linux-kernel, dm-devel

Hello,

On Fri, Apr 29, 2011 at 01:56:46AM +0200, Thilo-Alexander Ginkel wrote:
> On Thu, Apr 28, 2011 at 12:30, Tejun Heo <tj@kernel.org> wrote:
> > Does your kernel have preemption enabled?
> 
> CONFIG_PREEMPT is not set
> 
> > If not, does the following patch fix the problem?
> 
> Yep, looks good so far (at least in my virtualized test environment):
> 2.6.39-rc4 with your patch applied survived 100 suspend/resume cycles
> w/o locking up.

Awesome, I'll forward the patch to mainline and -stable.

> Just out of curiosity: Was this a new issue in 2.6.39-rc4 ord could
> this fix be backported to 2.6.38?

This needs to be backported.  It's an issue which has been there from
the initial implementation of cmwq.  It needs non-preemptive kernel
and rescuer kicking in at a very bad timing, so not many people seem
to have been affected by this.  Your setup somehow triggers it
reliably.

Thank you.

-- 
tejun

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH] workqueue: fix deadlock in worker_maybe_bind_and_lock()
  2011-04-29 16:00                   ` Tejun Heo
@ 2011-04-29 16:18                     ` Tejun Heo
  2011-04-29 20:40                       ` Rafael J. Wysocki
  0 siblings, 1 reply; 13+ messages in thread
From: Tejun Heo @ 2011-04-29 16:18 UTC (permalink / raw)
  To: Thilo-Alexander Ginkel
  Cc: Arnd Bergmann, Rafael J. Wysocki, linux-kernel, dm-devel

>From 5035b20fa5cd146b66f5f89619c20a4177fb736d Mon Sep 17 00:00:00 2001
From: Tejun Heo <tj@kernel.org>
Date: Fri, 29 Apr 2011 18:08:37 +0200

If a rescuer and stop_machine() bringing down a CPU race with each
other, they may deadlock on non-preemptive kernel.  The CPU won't
accept a new task, so the rescuer can't migrate to the target CPU,
while stop_machine() can't proceed because the rescuer is holding one
of the CPU retrying migration.  GCWQ_DISASSOCIATED is never cleared
and worker_maybe_bind_and_lock() retries indefinitely.

This problem can be reproduced semi reliably while the system is
entering suspend.

 http://thread.gmane.org/gmane.linux.kernel/1122051

A lot of kudos to Thilo-Alexander for reporting this tricky issue and
painstaking testing.

stable: This affects all kernels with cmwq, so all kernels since and
        including v2.6.36 need this fix.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Thilo-Alexander Ginkel <thilo@ginkel.com>
Tested-by: Thilo-Alexander Ginkel <thilo@ginkel.com>
Cc: stable@kernel.org
---
Will soon send pull request to Linus.  Thank you very much.

 kernel/workqueue.c |    8 +++++++-
 1 files changed, 7 insertions(+), 1 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 04ef830..e3378e8 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -1291,8 +1291,14 @@ __acquires(&gcwq->lock)
 			return true;
 		spin_unlock_irq(&gcwq->lock);
 
-		/* CPU has come up inbetween, retry migration */
+		/*
+		 * We've raced with CPU hot[un]plug.  Give it a breather
+		 * and retry migration.  cond_resched() is required here;
+		 * otherwise, we might deadlock against cpu_stop trying to
+		 * bring down the CPU on non-preemptive kernel.
+		 */
 		cpu_relax();
+		cond_resched();
 	}
 }
 
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH] workqueue: fix deadlock in worker_maybe_bind_and_lock()
  2011-04-29 16:18                     ` [PATCH] workqueue: fix deadlock in worker_maybe_bind_and_lock() Tejun Heo
@ 2011-04-29 20:40                       ` Rafael J. Wysocki
  0 siblings, 0 replies; 13+ messages in thread
From: Rafael J. Wysocki @ 2011-04-29 20:40 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Thilo-Alexander Ginkel, Arnd Bergmann, linux-kernel, dm-devel

On Friday, April 29, 2011, Tejun Heo wrote:
> From 5035b20fa5cd146b66f5f89619c20a4177fb736d Mon Sep 17 00:00:00 2001
> From: Tejun Heo <tj@kernel.org>
> Date: Fri, 29 Apr 2011 18:08:37 +0200
> 
> If a rescuer and stop_machine() bringing down a CPU race with each
> other, they may deadlock on non-preemptive kernel.  The CPU won't
> accept a new task, so the rescuer can't migrate to the target CPU,
> while stop_machine() can't proceed because the rescuer is holding one
> of the CPU retrying migration.  GCWQ_DISASSOCIATED is never cleared
> and worker_maybe_bind_and_lock() retries indefinitely.
> 
> This problem can be reproduced semi reliably while the system is
> entering suspend.
> 
>  http://thread.gmane.org/gmane.linux.kernel/1122051
> 
> A lot of kudos to Thilo-Alexander for reporting this tricky issue and
> painstaking testing.
> 
> stable: This affects all kernels with cmwq, so all kernels since and
>         including v2.6.36 need this fix.

Well, _that_ explains quite a number of mysterious reports where
suspend or poweroff hang randomly.

Thanks a lot of fixing it!

Rafael


> Signed-off-by: Tejun Heo <tj@kernel.org>
> Reported-by: Thilo-Alexander Ginkel <thilo@ginkel.com>
> Tested-by: Thilo-Alexander Ginkel <thilo@ginkel.com>
> Cc: stable@kernel.org
> ---
> Will soon send pull request to Linus.  Thank you very much.
> 
>  kernel/workqueue.c |    8 +++++++-
>  1 files changed, 7 insertions(+), 1 deletions(-)
> 
> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> index 04ef830..e3378e8 100644
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -1291,8 +1291,14 @@ __acquires(&gcwq->lock)
>  			return true;
>  		spin_unlock_irq(&gcwq->lock);
>  
> -		/* CPU has come up inbetween, retry migration */
> +		/*
> +		 * We've raced with CPU hot[un]plug.  Give it a breather
> +		 * and retry migration.  cond_resched() is required here;
> +		 * otherwise, we might deadlock against cpu_stop trying to
> +		 * bring down the CPU on non-preemptive kernel.
> +		 */
>  		cpu_relax();
> +		cond_resched();
>  	}
>  }
>  
> 


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2011-04-29 20:40 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-04-05 18:56 Soft lockup during suspend since ~2.6.36 [bisected] Thilo-Alexander Ginkel
2011-04-05 23:28 ` Arnd Bergmann
2011-04-06  6:03   ` Thilo-Alexander Ginkel
2011-04-14 12:24     ` Thilo-Alexander Ginkel
2011-04-17 19:35       ` Arnd Bergmann
2011-04-17 21:53         ` Thilo-Alexander Ginkel
2011-04-26 13:11           ` Tejun Heo
2011-04-27 23:51             ` Thilo-Alexander Ginkel
2011-04-28 10:30               ` Tejun Heo
2011-04-28 23:56                 ` Thilo-Alexander Ginkel
2011-04-29 16:00                   ` Tejun Heo
2011-04-29 16:18                     ` [PATCH] workqueue: fix deadlock in worker_maybe_bind_and_lock() Tejun Heo
2011-04-29 20:40                       ` Rafael J. Wysocki

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).