* Re: Soft lockup during suspend since ~2.6.36 [bisected]
@ 2011-04-05 18:56 Thilo-Alexander Ginkel
2011-04-05 23:28 ` Arnd Bergmann
0 siblings, 1 reply; 13+ messages in thread
From: Thilo-Alexander Ginkel @ 2011-04-05 18:56 UTC (permalink / raw)
To: Tejun Heo; +Cc: linux-kernel, Arnd Bergmann
On Mon, Apr 4, 2011 at 17:32, Arnd Bergmann <arnd@arndb.de> wrote:
> On Monday 04 April 2011, Thilo-Alexander Ginkel wrote:
>> ACK. I see two possibilities:
>> a) The bug was introduced after the bisected bug was fixed
>> b) The bug was already present earlier, but was masked by the bug from
>> the bisected change
>>
>> I hope for a) as that would open the possibility to bisect this new bug.
>
> In case of b), you can still bisect it when you either apply the later fix
> or revert the original patch whenever you build a kernel. Or you can try
> to avoid using the usb-hid driver during bisect.
Thanks, that worked pretty well. A bisect with eleven builds later I
have now identified the following candidate commit, which may have
introduced the bug:
dcd989cb73ab0f7b722d64ab6516f101d9f43f88 is the first bad commit
commit dcd989cb73ab0f7b722d64ab6516f101d9f43f88
Author: Tejun Heo <tj@kernel.org>
Date: Tue Jun 29 10:07:14 2010 +0200
workqueue: implement several utility APIs
Implement the following utility APIs.
workqueue_set_max_active() : adjust max_active of a wq
workqueue_congested() : test whether a wq is contested
work_cpu() : determine the last / current cpu of a work
work_busy() : query whether a work is busy
* Anton Blanchard fixed missing ret initialization in work_busy().
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Anton Blanchard <anton@samba.org>
:040000 040000 8b7443c650f0af36f1deba560586a91f6a88abcc
065589a95857a2fb73b94dc242c50ba558179a2a M include
:040000 040000 84ca2de78af16483fa60a423f4f2d6eee0279eed
27487850f11a1e7ee9e4eaac54fd88f16d420d47 M kernel
Brief summary for Tejun: Starting with this commit my system (x86_64,
4 CPUs) sporadically (probability around 25%) fails to suspend due to
a soft lockup. Full details at:
https://lkml.org/lkml/2011/4/4/116
or:
<BANLkTi=n4jLsjOYCd0L3hYb30sgPmdv_WA@mail.gmail.com>
I'd appreciate your help to resolve this issue and would be glad to
test any candidate patches.
Thanks,
Thilo
^ permalink raw reply [flat|nested] 13+ messages in thread* Re: Soft lockup during suspend since ~2.6.36 [bisected] 2011-04-05 18:56 Soft lockup during suspend since ~2.6.36 [bisected] Thilo-Alexander Ginkel @ 2011-04-05 23:28 ` Arnd Bergmann 2011-04-06 6:03 ` Thilo-Alexander Ginkel 0 siblings, 1 reply; 13+ messages in thread From: Arnd Bergmann @ 2011-04-05 23:28 UTC (permalink / raw) To: Thilo-Alexander Ginkel; +Cc: Tejun Heo, linux-kernel On Tuesday 05 April 2011, Thilo-Alexander Ginkel wrote: > On Mon, Apr 4, 2011 at 17:32, Arnd Bergmann <arnd@arndb.de> wrote: > > On Monday 04 April 2011, Thilo-Alexander Ginkel wrote: > >> ACK. I see two possibilities: > >> a) The bug was introduced after the bisected bug was fixed > >> b) The bug was already present earlier, but was masked by the bug from > >> the bisected change > >> > >> I hope for a) as that would open the possibility to bisect this new bug. > > > > In case of b), you can still bisect it when you either apply the later fix > > or revert the original patch whenever you build a kernel. Or you can try > > to avoid using the usb-hid driver during bisect. > > Thanks, that worked pretty well. A bisect with eleven builds later I > have now identified the following candidate commit, which may have > introduced the bug: > > dcd989cb73ab0f7b722d64ab6516f101d9f43f88 is the first bad commit > commit dcd989cb73ab0f7b722d64ab6516f101d9f43f88 > Author: Tejun Heo <tj@kernel.org> > Date: Tue Jun 29 10:07:14 2010 +0200 Sorry, but looking at the patch shows that it can't possibly have introduced the problem, since all the code that is modified in it is new code that is not even used anywhere at that stage. As far as I can tell, you must have hit a false positive or a false negative somewhere in the bisect. Arnd ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Soft lockup during suspend since ~2.6.36 [bisected] 2011-04-05 23:28 ` Arnd Bergmann @ 2011-04-06 6:03 ` Thilo-Alexander Ginkel 2011-04-14 12:24 ` Thilo-Alexander Ginkel 0 siblings, 1 reply; 13+ messages in thread From: Thilo-Alexander Ginkel @ 2011-04-06 6:03 UTC (permalink / raw) To: Arnd Bergmann; +Cc: Tejun Heo, linux-kernel On Wed, Apr 6, 2011 at 01:28, Arnd Bergmann <arnd@arndb.de> wrote: > On Tuesday 05 April 2011, Thilo-Alexander Ginkel wrote: >> Thanks, that worked pretty well. A bisect with eleven builds later I >> have now identified the following candidate commit, which may have >> introduced the bug: >> >> dcd989cb73ab0f7b722d64ab6516f101d9f43f88 is the first bad commit >> commit dcd989cb73ab0f7b722d64ab6516f101d9f43f88 >> Author: Tejun Heo <tj@kernel.org> >> Date: Tue Jun 29 10:07:14 2010 +0200 > > Sorry, but looking at the patch shows that it can't possibly have introduced > the problem, since all the code that is modified in it is new code that > is not even used anywhere at that stage. > > As far as I can tell, you must have hit a false positive or a false negative > somewhere in the bisect. Well you're right. I hit "Reply" too early and should have paid closer attention to what change the bisect actually brought up. I already found a false negative (fortunately pretty close to the end of the bisect sequence) and also verified the preceding good commits, which gives me two new commits to test. I'll provide an update once the builds and tests are through, which may however take until early next week as I will be on vacation until then. Regards, Thilo ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Soft lockup during suspend since ~2.6.36 [bisected] 2011-04-06 6:03 ` Thilo-Alexander Ginkel @ 2011-04-14 12:24 ` Thilo-Alexander Ginkel 2011-04-17 19:35 ` Arnd Bergmann 0 siblings, 1 reply; 13+ messages in thread From: Thilo-Alexander Ginkel @ 2011-04-14 12:24 UTC (permalink / raw) To: Arnd Bergmann, Tejun Heo, Rafael J. Wysocki; +Cc: linux-kernel On Wed, Apr 6, 2011 at 08:03, Thilo-Alexander Ginkel <thilo@ginkel.com> wrote: > On Wed, Apr 6, 2011 at 01:28, Arnd Bergmann <arnd@arndb.de> wrote: >> On Tuesday 05 April 2011, Thilo-Alexander Ginkel wrote: >>> Thanks, that worked pretty well. A bisect with eleven builds later I >>> have now identified the following candidate commit, which may have >>> introduced the bug: >>> >>> dcd989cb73ab0f7b722d64ab6516f101d9f43f88 is the first bad commit >>> commit dcd989cb73ab0f7b722d64ab6516f101d9f43f88 >>> Author: Tejun Heo <tj@kernel.org> >>> Date: Tue Jun 29 10:07:14 2010 +0200 >> >> Sorry, but looking at the patch shows that it can't possibly have introduced >> the problem, since all the code that is modified in it is new code that >> is not even used anywhere at that stage. >> >> As far as I can tell, you must have hit a false positive or a false negative >> somewhere in the bisect. > > Well you're right. I hit "Reply" too early and should have paid closer > attention to what change the bisect actually brought up. > > I already found a false negative (fortunately pretty close to the end > of the bisect sequence) and also verified the preceding good commits, > which gives me two new commits to test. I'll provide an update once > the builds and tests are through, which may however take until early > next week as I will be on vacation until then. All right... I verified all my bisect tests and actually found yet another bug. After correcting that one (and verifying the correctness of the other tests), git bisect actually came up with a commit, which makes some more sense: | e22bee782b3b00bd4534ae9b1c5fb2e8e6573c5c is the first bad commit | commit e22bee782b3b00bd4534ae9b1c5fb2e8e6573c5c | Author: Tejun Heo <tj@kernel.org> | Date: Tue Jun 29 10:07:14 2010 +0200 | | workqueue: implement concurrency managed dynamic worker pool The good news is that I am able to reproduce the issue within a KVM virtual machine, so I am able to test for the soft lockup (which somewhat looks like a race condition during worker / CPU shutdown) in a mostly automated fashion. Unfortunately, that also means that this issue is all but hardware specific, i.e., it most probably affects all SMP systems (with a varying probability depending on the number of CPUs). Adding some further details about my configuration (which I replicated in the VM): - lvm running on top of - dmcrypt (luks) running on top of - md raid1 If anyone is interested in getting hold of this VM for further tests, let me know and I'll try to figure out how to get it (2*8 GB, barely compressible due to dmcrypt) to its recipient. Regards, Thilo ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Soft lockup during suspend since ~2.6.36 [bisected] 2011-04-14 12:24 ` Thilo-Alexander Ginkel @ 2011-04-17 19:35 ` Arnd Bergmann 2011-04-17 21:53 ` Thilo-Alexander Ginkel 0 siblings, 1 reply; 13+ messages in thread From: Arnd Bergmann @ 2011-04-17 19:35 UTC (permalink / raw) To: Thilo-Alexander Ginkel Cc: Tejun Heo, Rafael J. Wysocki, linux-kernel, dm-devel On Thursday 14 April 2011, Thilo-Alexander Ginkel wrote: > All right... I verified all my bisect tests and actually found yet > another bug. After correcting that one (and verifying the correctness > of the other tests), git bisect actually came up with a commit, which > makes some more sense: > > | e22bee782b3b00bd4534ae9b1c5fb2e8e6573c5c is the first bad commit > | commit e22bee782b3b00bd4534ae9b1c5fb2e8e6573c5c > | Author: Tejun Heo <tj@kernel.org> > | Date: Tue Jun 29 10:07:14 2010 +0200 > | > | workqueue: implement concurrency managed dynamic worker pool Is it possible to make it work by reverting this patch in 2.6.38? > The good news is that I am able to reproduce the issue within a KVM > virtual machine, so I am able to test for the soft lockup (which > somewhat looks like a race condition during worker / CPU shutdown) in > a mostly automated fashion. Unfortunately, that also means that this > issue is all but hardware specific, i.e., it most probably affects all > SMP systems (with a varying probability depending on the number of > CPUs). > > Adding some further details about my configuration (which I replicated > in the VM): > - lvm running on top of > - dmcrypt (luks) running on top of > - md raid1 > > If anyone is interested in getting hold of this VM for further tests, > let me know and I'll try to figure out how to get it (2*8 GB, barely > compressible due to dmcrypt) to its recipient. Adding dm-devel to Cc, in case the problem is somewhere in there. Arnd ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Soft lockup during suspend since ~2.6.36 [bisected] 2011-04-17 19:35 ` Arnd Bergmann @ 2011-04-17 21:53 ` Thilo-Alexander Ginkel 2011-04-26 13:11 ` Tejun Heo 0 siblings, 1 reply; 13+ messages in thread From: Thilo-Alexander Ginkel @ 2011-04-17 21:53 UTC (permalink / raw) To: Arnd Bergmann; +Cc: Tejun Heo, Rafael J. Wysocki, linux-kernel, dm-devel On Sun, Apr 17, 2011 at 21:35, Arnd Bergmann <arnd@arndb.de> wrote: > On Thursday 14 April 2011, Thilo-Alexander Ginkel wrote: >> All right... I verified all my bisect tests and actually found yet >> another bug. After correcting that one (and verifying the correctness >> of the other tests), git bisect actually came up with a commit, which >> makes some more sense: >> >> | e22bee782b3b00bd4534ae9b1c5fb2e8e6573c5c is the first bad commit >> | commit e22bee782b3b00bd4534ae9b1c5fb2e8e6573c5c >> | Author: Tejun Heo <tj@kernel.org> >> | Date: Tue Jun 29 10:07:14 2010 +0200 >> | >> | workqueue: implement concurrency managed dynamic worker pool > > Is it possible to make it work by reverting this patch in 2.6.38? Unfortunately, that's not that easy to test as the reverted patch does not apply cleanly against 2.6.38 (23 failed hunks) and I am not sure whether I want to revert it manually ;-). >> The good news is that I am able to reproduce the issue within a KVM >> virtual machine, so I am able to test for the soft lockup (which >> somewhat looks like a race condition during worker / CPU shutdown) in >> a mostly automated fashion. Unfortunately, that also means that this >> issue is all but hardware specific, i.e., it most probably affects all >> SMP systems (with a varying probability depending on the number of >> CPUs). >> >> Adding some further details about my configuration (which I replicated >> in the VM): >> - lvm running on top of >> - dmcrypt (luks) running on top of >> - md raid1 >> >> If anyone is interested in getting hold of this VM for further tests, >> let me know and I'll try to figure out how to get it (2*8 GB, barely >> compressible due to dmcrypt) to its recipient. > > Adding dm-devel to Cc, in case the problem is somewhere in there. In the meantime I also figured out that 2.6.39-rc3 seems to fix the issue (there have been some work queue changes, so this is somewhat sensible) and that raid1 seems to be sufficient to trigger the issue. Now one could try to figure out what actually fixed it, but if that means another bisect series I am not too keen to perform that exercise. ;-) If someone else feels inclined to do so, my test environment is available for download, though: https://secure.tgbyte.de/dropbox/lockup-test.tar.bz2 (~ 700 MB) Boot using: kvm -hda LockupTestRaid-1.qcow2 -hdb LockupTestRaid-2.qcow2 -smp 8 -m 1024 -curses To run the test, log in as root / test and run: /root/suspend-test Regards, Thilo ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Soft lockup during suspend since ~2.6.36 [bisected] 2011-04-17 21:53 ` Thilo-Alexander Ginkel @ 2011-04-26 13:11 ` Tejun Heo 2011-04-27 23:51 ` Thilo-Alexander Ginkel 0 siblings, 1 reply; 13+ messages in thread From: Tejun Heo @ 2011-04-26 13:11 UTC (permalink / raw) To: Thilo-Alexander Ginkel Cc: Arnd Bergmann, Rafael J. Wysocki, linux-kernel, dm-devel Hello, sorry about the delay. Was on the road and then sick. On Sun, Apr 17, 2011 at 11:53:42PM +0200, Thilo-Alexander Ginkel wrote: > >> | e22bee782b3b00bd4534ae9b1c5fb2e8e6573c5c is the first bad commit > >> | commit e22bee782b3b00bd4534ae9b1c5fb2e8e6573c5c > >> | Author: Tejun Heo <tj@kernel.org> > >> | Date: Tue Jun 29 10:07:14 2010 +0200 > >> | > >> | workqueue: implement concurrency managed dynamic worker pool > > > > Is it possible to make it work by reverting this patch in 2.6.38? > > Unfortunately, that's not that easy to test as the reverted patch does > not apply cleanly against 2.6.38 (23 failed hunks) and I am not sure > whether I want to revert it manually ;-). Yeap, reverting that one would be a major effort at this point. Hmmm... assuming all the workqueue usages were correct, the change shouldn't have introduced such bug. All forward progress guarantees remain the same in that all workqueues are automatically given a rescuer thread. That said, there have been some number of bug fixes and cases where single rescuer guarantee wasn't enough (which was dangerous before the change too but was less likely to trigger). > >> If anyone is interested in getting hold of this VM for further tests, > >> let me know and I'll try to figure out how to get it (2*8 GB, barely > >> compressible due to dmcrypt) to its recipient. > > > > Adding dm-devel to Cc, in case the problem is somewhere in there. > > In the meantime I also figured out that 2.6.39-rc3 seems to fix the > issue (there have been some work queue changes, so this is somewhat > sensible) Hmmm... that's a big demotivator. :-) > and that raid1 seems to be sufficient to trigger the issue. > Now one could try to figure out what actually fixed it, but if that > means another bisect series I am not too keen to perform that > exercise. ;-) If someone else feels inclined to do so, my test > environment is available for download, though: > https://secure.tgbyte.de/dropbox/lockup-test.tar.bz2 (~ 700 MB) > > Boot using: > kvm -hda LockupTestRaid-1.qcow2 -hdb LockupTestRaid-2.qcow2 -smp 8 > -m 1024 -curses > > To run the test, log in as root / test and run: > /root/suspend-test Before I go ahead and try that, do you happen to have softlockup dump? ie. stack traces of the stuck tasks? I can't find the original posting. Thank you. -- tejun ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Soft lockup during suspend since ~2.6.36 [bisected] 2011-04-26 13:11 ` Tejun Heo @ 2011-04-27 23:51 ` Thilo-Alexander Ginkel 2011-04-28 10:30 ` Tejun Heo 0 siblings, 1 reply; 13+ messages in thread From: Thilo-Alexander Ginkel @ 2011-04-27 23:51 UTC (permalink / raw) To: Tejun Heo; +Cc: Arnd Bergmann, Rafael J. Wysocki, linux-kernel, dm-devel On Tue, Apr 26, 2011 at 15:11, Tejun Heo <tj@kernel.org> wrote: > Hello, sorry about the delay. Was on the road and then sick. No problem. Thanks for getting back to me. >> In the meantime I also figured out that 2.6.39-rc3 seems to fix the >> issue (there have been some work queue changes, so this is somewhat >> sensible) > > Hmmm... that's a big demotivator. :-) Well, I get your point. ;-) Maybe this fact can help as a motivator: I ran some further tests and while -rc3 seems to be ok (and survived 100 suspend/resume cycles), the issue strangely seems to be back with -rc4 (the softlockup call stack that I can see is identical to the photos below; the lockup happened after only two cycles). > Before I go ahead and try that, do you happen to have softlockup dump? > ie. stack traces of the stuck tasks? I can't find the original > posting. Sure: >From <BANLkTi=n4jLsjOYCd0L3hYb30sgPmdv_WA@mail.gmail.com>: > Unfortunately, the output via a serial console becomes garbled after > "Entering mem sleep", so I went for patching dumpstack_64.c and a > couple of other source files to reduce the verbosity. I hope not to > have stripped any essential information. The result is available in > these pictures: > https://secure.tgbyte.de/dropbox/IeZalo4t-1.jpg > https://secure.tgbyte.de/dropbox/IeZalo4t-2.jpg > > For both traces, the printed error message reads: "BUG: soft lockup - > CPU#3 stuck for 67s! [kblockd:28]" Thanks, Thilo ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Soft lockup during suspend since ~2.6.36 [bisected] 2011-04-27 23:51 ` Thilo-Alexander Ginkel @ 2011-04-28 10:30 ` Tejun Heo 2011-04-28 23:56 ` Thilo-Alexander Ginkel 0 siblings, 1 reply; 13+ messages in thread From: Tejun Heo @ 2011-04-28 10:30 UTC (permalink / raw) To: Thilo-Alexander Ginkel Cc: Arnd Bergmann, Rafael J. Wysocki, linux-kernel, dm-devel Hello, On Thu, Apr 28, 2011 at 01:51:34AM +0200, Thilo-Alexander Ginkel wrote: > Well, I get your point. ;-) Maybe this fact can help as a motivator: I > ran some further tests and while -rc3 seems to be ok (and survived 100 > suspend/resume cycles), the issue strangely seems to be back with -rc4 > (the softlockup call stack that I can see is identical to the photos > below; the lockup happened after only two cycles). > > > Before I go ahead and try that, do you happen to have softlockup dump? > > ie. stack traces of the stuck tasks? I can't find the original > > posting. > > Sure: > > From <BANLkTi=n4jLsjOYCd0L3hYb30sgPmdv_WA@mail.gmail.com>: > > Unfortunately, the output via a serial console becomes garbled after > > "Entering mem sleep", so I went for patching dumpstack_64.c and a > > couple of other source files to reduce the verbosity. I hope not to > > have stripped any essential information. The result is available in > > these pictures: > > https://secure.tgbyte.de/dropbox/IeZalo4t-1.jpg > > https://secure.tgbyte.de/dropbox/IeZalo4t-2.jpg > > > > For both traces, the printed error message reads: "BUG: soft lockup - > > CPU#3 stuck for 67s! [kblockd:28]" Does your kernel have preemption enabled? If not, does the following patch fix the problem? Thanks. diff --git a/kernel/workqueue.c b/kernel/workqueue.c index 04ef830..08c7334 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -1293,6 +1293,7 @@ __acquires(&gcwq->lock) /* CPU has come up inbetween, retry migration */ cpu_relax(); + cond_resched(); } } ^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: Soft lockup during suspend since ~2.6.36 [bisected] 2011-04-28 10:30 ` Tejun Heo @ 2011-04-28 23:56 ` Thilo-Alexander Ginkel 2011-04-29 16:00 ` Tejun Heo 0 siblings, 1 reply; 13+ messages in thread From: Thilo-Alexander Ginkel @ 2011-04-28 23:56 UTC (permalink / raw) To: Tejun Heo; +Cc: Arnd Bergmann, Rafael J. Wysocki, linux-kernel, dm-devel On Thu, Apr 28, 2011 at 12:30, Tejun Heo <tj@kernel.org> wrote: > Does your kernel have preemption enabled? CONFIG_PREEMPT is not set > If not, does the following patch fix the problem? Yep, looks good so far (at least in my virtualized test environment): 2.6.39-rc4 with your patch applied survived 100 suspend/resume cycles w/o locking up. Just out of curiosity: Was this a new issue in 2.6.39-rc4 ord could this fix be backported to 2.6.38? Thanks, Thilo ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Soft lockup during suspend since ~2.6.36 [bisected] 2011-04-28 23:56 ` Thilo-Alexander Ginkel @ 2011-04-29 16:00 ` Tejun Heo 2011-04-29 16:18 ` [PATCH] workqueue: fix deadlock in worker_maybe_bind_and_lock() Tejun Heo 0 siblings, 1 reply; 13+ messages in thread From: Tejun Heo @ 2011-04-29 16:00 UTC (permalink / raw) To: Thilo-Alexander Ginkel Cc: Arnd Bergmann, Rafael J. Wysocki, linux-kernel, dm-devel Hello, On Fri, Apr 29, 2011 at 01:56:46AM +0200, Thilo-Alexander Ginkel wrote: > On Thu, Apr 28, 2011 at 12:30, Tejun Heo <tj@kernel.org> wrote: > > Does your kernel have preemption enabled? > > CONFIG_PREEMPT is not set > > > If not, does the following patch fix the problem? > > Yep, looks good so far (at least in my virtualized test environment): > 2.6.39-rc4 with your patch applied survived 100 suspend/resume cycles > w/o locking up. Awesome, I'll forward the patch to mainline and -stable. > Just out of curiosity: Was this a new issue in 2.6.39-rc4 ord could > this fix be backported to 2.6.38? This needs to be backported. It's an issue which has been there from the initial implementation of cmwq. It needs non-preemptive kernel and rescuer kicking in at a very bad timing, so not many people seem to have been affected by this. Your setup somehow triggers it reliably. Thank you. -- tejun ^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH] workqueue: fix deadlock in worker_maybe_bind_and_lock() 2011-04-29 16:00 ` Tejun Heo @ 2011-04-29 16:18 ` Tejun Heo 2011-04-29 20:40 ` Rafael J. Wysocki 0 siblings, 1 reply; 13+ messages in thread From: Tejun Heo @ 2011-04-29 16:18 UTC (permalink / raw) To: Thilo-Alexander Ginkel Cc: Arnd Bergmann, Rafael J. Wysocki, linux-kernel, dm-devel >From 5035b20fa5cd146b66f5f89619c20a4177fb736d Mon Sep 17 00:00:00 2001 From: Tejun Heo <tj@kernel.org> Date: Fri, 29 Apr 2011 18:08:37 +0200 If a rescuer and stop_machine() bringing down a CPU race with each other, they may deadlock on non-preemptive kernel. The CPU won't accept a new task, so the rescuer can't migrate to the target CPU, while stop_machine() can't proceed because the rescuer is holding one of the CPU retrying migration. GCWQ_DISASSOCIATED is never cleared and worker_maybe_bind_and_lock() retries indefinitely. This problem can be reproduced semi reliably while the system is entering suspend. http://thread.gmane.org/gmane.linux.kernel/1122051 A lot of kudos to Thilo-Alexander for reporting this tricky issue and painstaking testing. stable: This affects all kernels with cmwq, so all kernels since and including v2.6.36 need this fix. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Thilo-Alexander Ginkel <thilo@ginkel.com> Tested-by: Thilo-Alexander Ginkel <thilo@ginkel.com> Cc: stable@kernel.org --- Will soon send pull request to Linus. Thank you very much. kernel/workqueue.c | 8 +++++++- 1 files changed, 7 insertions(+), 1 deletions(-) diff --git a/kernel/workqueue.c b/kernel/workqueue.c index 04ef830..e3378e8 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -1291,8 +1291,14 @@ __acquires(&gcwq->lock) return true; spin_unlock_irq(&gcwq->lock); - /* CPU has come up inbetween, retry migration */ + /* + * We've raced with CPU hot[un]plug. Give it a breather + * and retry migration. cond_resched() is required here; + * otherwise, we might deadlock against cpu_stop trying to + * bring down the CPU on non-preemptive kernel. + */ cpu_relax(); + cond_resched(); } } -- 1.7.1 ^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH] workqueue: fix deadlock in worker_maybe_bind_and_lock() 2011-04-29 16:18 ` [PATCH] workqueue: fix deadlock in worker_maybe_bind_and_lock() Tejun Heo @ 2011-04-29 20:40 ` Rafael J. Wysocki 0 siblings, 0 replies; 13+ messages in thread From: Rafael J. Wysocki @ 2011-04-29 20:40 UTC (permalink / raw) To: Tejun Heo; +Cc: Thilo-Alexander Ginkel, Arnd Bergmann, linux-kernel, dm-devel On Friday, April 29, 2011, Tejun Heo wrote: > From 5035b20fa5cd146b66f5f89619c20a4177fb736d Mon Sep 17 00:00:00 2001 > From: Tejun Heo <tj@kernel.org> > Date: Fri, 29 Apr 2011 18:08:37 +0200 > > If a rescuer and stop_machine() bringing down a CPU race with each > other, they may deadlock on non-preemptive kernel. The CPU won't > accept a new task, so the rescuer can't migrate to the target CPU, > while stop_machine() can't proceed because the rescuer is holding one > of the CPU retrying migration. GCWQ_DISASSOCIATED is never cleared > and worker_maybe_bind_and_lock() retries indefinitely. > > This problem can be reproduced semi reliably while the system is > entering suspend. > > http://thread.gmane.org/gmane.linux.kernel/1122051 > > A lot of kudos to Thilo-Alexander for reporting this tricky issue and > painstaking testing. > > stable: This affects all kernels with cmwq, so all kernels since and > including v2.6.36 need this fix. Well, _that_ explains quite a number of mysterious reports where suspend or poweroff hang randomly. Thanks a lot of fixing it! Rafael > Signed-off-by: Tejun Heo <tj@kernel.org> > Reported-by: Thilo-Alexander Ginkel <thilo@ginkel.com> > Tested-by: Thilo-Alexander Ginkel <thilo@ginkel.com> > Cc: stable@kernel.org > --- > Will soon send pull request to Linus. Thank you very much. > > kernel/workqueue.c | 8 +++++++- > 1 files changed, 7 insertions(+), 1 deletions(-) > > diff --git a/kernel/workqueue.c b/kernel/workqueue.c > index 04ef830..e3378e8 100644 > --- a/kernel/workqueue.c > +++ b/kernel/workqueue.c > @@ -1291,8 +1291,14 @@ __acquires(&gcwq->lock) > return true; > spin_unlock_irq(&gcwq->lock); > > - /* CPU has come up inbetween, retry migration */ > + /* > + * We've raced with CPU hot[un]plug. Give it a breather > + * and retry migration. cond_resched() is required here; > + * otherwise, we might deadlock against cpu_stop trying to > + * bring down the CPU on non-preemptive kernel. > + */ > cpu_relax(); > + cond_resched(); > } > } > > ^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2011-04-29 20:40 UTC | newest] Thread overview: 13+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2011-04-05 18:56 Soft lockup during suspend since ~2.6.36 [bisected] Thilo-Alexander Ginkel 2011-04-05 23:28 ` Arnd Bergmann 2011-04-06 6:03 ` Thilo-Alexander Ginkel 2011-04-14 12:24 ` Thilo-Alexander Ginkel 2011-04-17 19:35 ` Arnd Bergmann 2011-04-17 21:53 ` Thilo-Alexander Ginkel 2011-04-26 13:11 ` Tejun Heo 2011-04-27 23:51 ` Thilo-Alexander Ginkel 2011-04-28 10:30 ` Tejun Heo 2011-04-28 23:56 ` Thilo-Alexander Ginkel 2011-04-29 16:00 ` Tejun Heo 2011-04-29 16:18 ` [PATCH] workqueue: fix deadlock in worker_maybe_bind_and_lock() Tejun Heo 2011-04-29 20:40 ` Rafael J. Wysocki
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).