* Hard lockups using 3.10.0 @ 2013-07-11 9:38 Rolf Eike Beer 2013-07-11 10:07 ` Borislav Petkov 0 siblings, 1 reply; 11+ messages in thread From: Rolf Eike Beer @ 2013-07-11 9:38 UTC (permalink / raw) To: linux-kernel [-- Attachment #1: Type: text/plain, Size: 467 bytes --] Hi, I'm running 3.10.0 (from openSUSE packages) on an "Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz". I got a hard lockup on one of my CPUs twice, once with backtrace (see attached image). Graphics is the builtin Intel, used with X 7.6 and KDE 4.10beta2 (basically current openSUSE 12.3+KDE). I'm not aware that I had done anything special, just "normal" desktop and development usage, but no heavy compile work at the moment the lockups happened. Any ideas? Eike [-- Attachment #2: lockup.jpg --] [-- Type: image/jpeg, Size: 266338 bytes --] ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Hard lockups using 3.10.0 2013-07-11 9:38 Hard lockups using 3.10.0 Rolf Eike Beer @ 2013-07-11 10:07 ` Borislav Petkov 2013-07-11 10:16 ` Peter Zijlstra 2013-07-11 10:52 ` Peter Zijlstra 0 siblings, 2 replies; 11+ messages in thread From: Borislav Petkov @ 2013-07-11 10:07 UTC (permalink / raw) To: Rolf Eike Beer; +Cc: linux-kernel, dhowells, Paul E. McKenney, Peter Zijlstra On Thu, Jul 11, 2013 at 11:38:37AM +0200, Rolf Eike Beer wrote: > Hi, > > I'm running 3.10.0 (from openSUSE packages) on an "Intel(R) Core(TM) i7-2600 > CPU @ 3.40GHz". I got a hard lockup on one of my CPUs twice, once with > backtrace (see attached image). Graphics is the builtin Intel, used with X 7.6 > and KDE 4.10beta2 (basically current openSUSE 12.3+KDE). > > I'm not aware that I had done anything special, just "normal" desktop and > development usage, but no heavy compile work at the moment the lockups > happened. Hmm, I can see commit_creds() doing some rcu pointers assignment and rcu calling into the scheduler which screams about a cpu runqueue of the task we're about to reschedule not being locked. Let's add some more people who should know better. -- Regards/Gruss, Boris. Sent from a fat crate under my desk. Formatting is fine. -- ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Hard lockups using 3.10.0 2013-07-11 10:07 ` Borislav Petkov @ 2013-07-11 10:16 ` Peter Zijlstra 2013-07-11 10:52 ` Peter Zijlstra 1 sibling, 0 replies; 11+ messages in thread From: Peter Zijlstra @ 2013-07-11 10:16 UTC (permalink / raw) To: Borislav Petkov; +Cc: Rolf Eike Beer, linux-kernel, dhowells, Paul E. McKenney On Thu, Jul 11, 2013 at 12:07:21PM +0200, Borislav Petkov wrote: > On Thu, Jul 11, 2013 at 11:38:37AM +0200, Rolf Eike Beer wrote: > > Hi, > > > > I'm running 3.10.0 (from openSUSE packages) on an "Intel(R) Core(TM) i7-2600 > > CPU @ 3.40GHz". I got a hard lockup on one of my CPUs twice, once with > > backtrace (see attached image). Graphics is the builtin Intel, used with X 7.6 > > and KDE 4.10beta2 (basically current openSUSE 12.3+KDE). > > > > I'm not aware that I had done anything special, just "normal" desktop and > > development usage, but no heavy compile work at the moment the lockups > > happened. > > Hmm, I can see commit_creds() doing some rcu pointers assignment and rcu > calling into the scheduler which screams about a cpu runqueue of the > task we're about to reschedule not being locked. Let's add some more > people who should know better. -ENOIMAGE ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Hard lockups using 3.10.0 2013-07-11 10:07 ` Borislav Petkov 2013-07-11 10:16 ` Peter Zijlstra @ 2013-07-11 10:52 ` Peter Zijlstra 2013-07-11 17:50 ` Paul E. McKenney 2013-08-11 6:09 ` Rolf Eike Beer 1 sibling, 2 replies; 11+ messages in thread From: Peter Zijlstra @ 2013-07-11 10:52 UTC (permalink / raw) To: Borislav Petkov; +Cc: Rolf Eike Beer, linux-kernel, dhowells, Paul E. McKenney On Thu, Jul 11, 2013 at 12:07:21PM +0200, Borislav Petkov wrote: > On Thu, Jul 11, 2013 at 11:38:37AM +0200, Rolf Eike Beer wrote: > > Hi, > > > > I'm running 3.10.0 (from openSUSE packages) on an "Intel(R) Core(TM) i7-2600 > > CPU @ 3.40GHz". I got a hard lockup on one of my CPUs twice, once with > > backtrace (see attached image). Graphics is the builtin Intel, used with X 7.6 > > and KDE 4.10beta2 (basically current openSUSE 12.3+KDE). > > > > I'm not aware that I had done anything special, just "normal" desktop and > > development usage, but no heavy compile work at the moment the lockups > > happened. > > Hmm, I can see commit_creds() doing some rcu pointers assignment and rcu > calling into the scheduler which screams about a cpu runqueue of the > task we're about to reschedule not being locked. Let's add some more > people who should know better. Ok, for the other people too lazy to bother finding the picture: http://marc.info/?l=linux-kernel&m=137353587012001&q=p3 So we bug at: kernel/sched/core.c:519 assert_raw_spin_locked(&task_rq(p)->lock); and get there through: resched_task() check_preempt_wakeup() check_preempt_curr() try_to_wake_up() autoremove_wake_function() __call_rcu_nocb_enqueue() __call_rcu() commit_creds() ____call_usermodehelper() ret_from_fork() That don't make much sense though. Since: try_to_wake_up() ttwu_queue() raw_spin_lock(&rq->lock) ttwu_do_activate() ttwu_do_wakeup() check_preempt_curr() check_preempt_wakeup() resched_task(rq->curr) assert_raw_spin_locked(task_rq(p)->lock) It would somehow mean that 'task_rq(rq->curr) != rq', that's completely bonkers, we do after all have rq->lock locked. I must also say that I've _never_ seen this bug before. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Hard lockups using 3.10.0 2013-07-11 10:52 ` Peter Zijlstra @ 2013-07-11 17:50 ` Paul E. McKenney 2013-07-11 19:02 ` Rolf Eike Beer 2013-08-11 6:09 ` Rolf Eike Beer 1 sibling, 1 reply; 11+ messages in thread From: Paul E. McKenney @ 2013-07-11 17:50 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Borislav Petkov, Rolf Eike Beer, linux-kernel, dhowells On Thu, Jul 11, 2013 at 12:52:07PM +0200, Peter Zijlstra wrote: > On Thu, Jul 11, 2013 at 12:07:21PM +0200, Borislav Petkov wrote: > > On Thu, Jul 11, 2013 at 11:38:37AM +0200, Rolf Eike Beer wrote: > > > Hi, > > > > > > I'm running 3.10.0 (from openSUSE packages) on an "Intel(R) Core(TM) i7-2600 > > > CPU @ 3.40GHz". I got a hard lockup on one of my CPUs twice, once with > > > backtrace (see attached image). Graphics is the builtin Intel, used with X 7.6 > > > and KDE 4.10beta2 (basically current openSUSE 12.3+KDE). > > > > > > I'm not aware that I had done anything special, just "normal" desktop and > > > development usage, but no heavy compile work at the moment the lockups > > > happened. > > > > Hmm, I can see commit_creds() doing some rcu pointers assignment and rcu > > calling into the scheduler which screams about a cpu runqueue of the > > task we're about to reschedule not being locked. Let's add some more > > people who should know better. > > Ok, for the other people too lazy to bother finding the picture: > > http://marc.info/?l=linux-kernel&m=137353587012001&q=p3 > > So we bug at: > > kernel/sched/core.c:519 assert_raw_spin_locked(&task_rq(p)->lock); > > and get there through: > > resched_task() > check_preempt_wakeup() > check_preempt_curr() > try_to_wake_up() > autoremove_wake_function() > __call_rcu_nocb_enqueue() > __call_rcu() > commit_creds() > ____call_usermodehelper() > ret_from_fork() > > That don't make much sense though. Since: > > try_to_wake_up() > ttwu_queue() > raw_spin_lock(&rq->lock) > ttwu_do_activate() > ttwu_do_wakeup() > check_preempt_curr() > check_preempt_wakeup() > resched_task(rq->curr) > assert_raw_spin_locked(task_rq(p)->lock) > > It would somehow mean that 'task_rq(rq->curr) != rq', that's completely > bonkers, we do after all have rq->lock locked. > > I must also say that I've _never_ seen this bug before. New one on me as well. Is this reproducible? If so, does it happen when CONFIG_RCU_NOCB_CPU=n? (Given the call to call_rcu_nocb_enqueue(), I expect that you built with CONFIG_RCU_NOCB_CPU=y.) Can't say that I see how call_rcu_nocb_enqueue() would have caused this, but... Well, I supposed that if RCU's callback lists got corrupted, this (and much else besides) could in fact happen. Does your build have CONFIG_DEBUG_OBJECTS_RCU_HEAD=y? If not, could you please try it? Thanx, Paul ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Hard lockups using 3.10.0 2013-07-11 17:50 ` Paul E. McKenney @ 2013-07-11 19:02 ` Rolf Eike Beer 0 siblings, 0 replies; 11+ messages in thread From: Rolf Eike Beer @ 2013-07-11 19:02 UTC (permalink / raw) To: paulmck; +Cc: Peter Zijlstra, Borislav Petkov, linux-kernel, dhowells [-- Attachment #1: Type: text/plain, Size: 2921 bytes --] Paul E. McKenney wrote: > On Thu, Jul 11, 2013 at 12:52:07PM +0200, Peter Zijlstra wrote: > > On Thu, Jul 11, 2013 at 12:07:21PM +0200, Borislav Petkov wrote: > > > On Thu, Jul 11, 2013 at 11:38:37AM +0200, Rolf Eike Beer wrote: > > > > Hi, > > > > > > > > I'm running 3.10.0 (from openSUSE packages) on an "Intel(R) Core(TM) > > > > i7-2600 CPU @ 3.40GHz". I got a hard lockup on one of my CPUs twice, > > > > once with backtrace (see attached image). Graphics is the builtin > > > > Intel, used with X 7.6 and KDE 4.10beta2 (basically current openSUSE > > > > 12.3+KDE). > > > > > > > > I'm not aware that I had done anything special, just "normal" desktop > > > > and > > > > development usage, but no heavy compile work at the moment the lockups > > > > happened. > > > > > > Hmm, I can see commit_creds() doing some rcu pointers assignment and rcu > > > calling into the scheduler which screams about a cpu runqueue of the > > > task we're about to reschedule not being locked. Let's add some more > > > people who should know better. > > > > Ok, for the other people too lazy to bother finding the picture: > > http://marc.info/?l=linux-kernel&m=137353587012001&q=p3 > > > > So we bug at: > > > > kernel/sched/core.c:519 assert_raw_spin_locked(&task_rq(p)->lock); > > > > and get there through: > > resched_task() > > check_preempt_wakeup() > > check_preempt_curr() > > try_to_wake_up() > > autoremove_wake_function() > > __call_rcu_nocb_enqueue() > > __call_rcu() > > commit_creds() > > ____call_usermodehelper() > > ret_from_fork() > > > > That don't make much sense though. Since: > > try_to_wake_up() > > > > ttwu_queue() > > > > raw_spin_lock(&rq->lock) > > ttwu_do_activate() > > > > ttwu_do_wakeup() > > > > check_preempt_curr() > > > > check_preempt_wakeup() > > > > resched_task(rq->curr) > > > > assert_raw_spin_locked(task_rq(p)->lock) > > > > It would somehow mean that 'task_rq(rq->curr) != rq', that's completely > > bonkers, we do after all have rq->lock locked. > > > > I must also say that I've _never_ seen this bug before. > > New one on me as well. Is this reproducible? If so, does it happen > when CONFIG_RCU_NOCB_CPU=n? (Given the call to call_rcu_nocb_enqueue(), > I expect that you built with CONFIG_RCU_NOCB_CPU=y.) Can't say that I > see how call_rcu_nocb_enqueue() would have caused this, but... > > Well, I supposed that if RCU's callback lists got corrupted, this > (and much else besides) could in fact happen. Does your build have > CONFIG_DEBUG_OBJECTS_RCU_HEAD=y? If not, could you please try it? I will look tomorrow. This is a "standard" openSUSE kernel RPM, dunno right now which repository. It is not really reproducible, it suddenly happened again today but this time without backtrace. Eike [-- Attachment #2: This is a digitally signed message part. --] [-- Type: application/pgp-signature, Size: 198 bytes --] ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Hard lockups using 3.10.0 2013-07-11 10:52 ` Peter Zijlstra 2013-07-11 17:50 ` Paul E. McKenney @ 2013-08-11 6:09 ` Rolf Eike Beer 2013-08-11 8:37 ` Borislav Petkov 1 sibling, 1 reply; 11+ messages in thread From: Rolf Eike Beer @ 2013-08-11 6:09 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Borislav Petkov, linux-kernel, dhowells, Paul E. McKenney [-- Attachment #1: Type: text/plain, Size: 2144 bytes --] Peter Zijlstra wrote: > On Thu, Jul 11, 2013 at 12:07:21PM +0200, Borislav Petkov wrote: > > On Thu, Jul 11, 2013 at 11:38:37AM +0200, Rolf Eike Beer wrote: > > > Hi, > > > > > > I'm running 3.10.0 (from openSUSE packages) on an "Intel(R) Core(TM) > > > i7-2600 CPU @ 3.40GHz". I got a hard lockup on one of my CPUs twice, > > > once with backtrace (see attached image). Graphics is the builtin > > > Intel, used with X 7.6 and KDE 4.10beta2 (basically current openSUSE > > > 12.3+KDE). > > > > > > I'm not aware that I had done anything special, just "normal" desktop > > > and > > > development usage, but no heavy compile work at the moment the lockups > > > happened. > > > > Hmm, I can see commit_creds() doing some rcu pointers assignment and rcu > > calling into the scheduler which screams about a cpu runqueue of the > > task we're about to reschedule not being locked. Let's add some more > > people who should know better. > > Ok, for the other people too lazy to bother finding the picture: > > http://marc.info/?l=linux-kernel&m=137353587012001&q=p3 > > So we bug at: > > kernel/sched/core.c:519 assert_raw_spin_locked(&task_rq(p)->lock); > > and get there through: > > resched_task() > check_preempt_wakeup() > check_preempt_curr() > try_to_wake_up() > autoremove_wake_function() > __call_rcu_nocb_enqueue() > __call_rcu() > commit_creds() > ____call_usermodehelper() > ret_from_fork() > > That don't make much sense though. Since: > > try_to_wake_up() > ttwu_queue() > raw_spin_lock(&rq->lock) > ttwu_do_activate() > ttwu_do_wakeup() > check_preempt_curr() > check_preempt_wakeup() > resched_task(rq->curr) > assert_raw_spin_locked(task_rq(p)->lock) > > It would somehow mean that 'task_rq(rq->curr) != rq', that's completely > bonkers, we do after all have rq->lock locked. > > I must also say that I've _never_ seen this bug before. Meanwhile I found that there was a hardware defect on this machine. So if it does not happen again I will assume that this was caused by this. Thanks for looking into this. Eike [-- Attachment #2: This is a digitally signed message part. --] [-- Type: application/pgp-signature, Size: 198 bytes --] ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Hard lockups using 3.10.0 2013-08-11 6:09 ` Rolf Eike Beer @ 2013-08-11 8:37 ` Borislav Petkov 2013-08-11 11:10 ` Rolf Eike Beer 0 siblings, 1 reply; 11+ messages in thread From: Borislav Petkov @ 2013-08-11 8:37 UTC (permalink / raw) To: Rolf Eike Beer; +Cc: Peter Zijlstra, linux-kernel, dhowells, Paul E. McKenney On Sun, Aug 11, 2013 at 08:09:19AM +0200, Rolf Eike Beer wrote: > Meanwhile I found that there was a hardware defect on this machine. > So if it does not happen again I will assume that this was caused by > this. What hardware defect exactly? DIMMs failing...? Probably, since it looks like the spinlock gets corrupted and the assertion fires... In any case, it would be interesting to know for future reference. Thanks. -- Regards/Gruss, Boris. Sent from a fat crate under my desk. Formatting is fine. -- ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Hard lockups using 3.10.0 2013-08-11 8:37 ` Borislav Petkov @ 2013-08-11 11:10 ` Rolf Eike Beer 2013-08-13 10:38 ` Borislav Petkov 0 siblings, 1 reply; 11+ messages in thread From: Rolf Eike Beer @ 2013-08-11 11:10 UTC (permalink / raw) To: Borislav Petkov; +Cc: Peter Zijlstra, linux-kernel, dhowells, Paul E. McKenney [-- Attachment #1: Type: text/plain, Size: 1247 bytes --] Borislav Petkov wrote: > On Sun, Aug 11, 2013 at 08:09:19AM +0200, Rolf Eike Beer wrote: > > Meanwhile I found that there was a hardware defect on this machine. > > So if it does not happen again I will assume that this was caused by > > this. > > What hardware defect exactly? DIMMs failing...? Probably, since it looks > like the spinlock gets corrupted and the assertion fires... In any case, > it would be interesting to know for future reference. The RAM seems fine. It looks like it is the mainboard or a harddisk. The issues have magically disappeared since 3 weeks, but I have not done any suspend2disk since then anymore. Before that I had suspended the machine on the evening and resumed when I came to work. So it's possible that there was some corrupted stuff in the image. This is the smart output I got of one disk yesterday: Vendor: /0:0:0:0 Product: User Capacity: 600,332,565,813,390,450 bytes [600 PB] Logical block size: 774843950 bytes scsiModePageOffset: response length too short, resp_len=47 offset=50 bd_len=46 >> Terminate command early due to bad response to IEC mode page A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options. Eike [-- Attachment #2: This is a digitally signed message part. --] [-- Type: application/pgp-signature, Size: 198 bytes --] ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Hard lockups using 3.10.0 2013-08-11 11:10 ` Rolf Eike Beer @ 2013-08-13 10:38 ` Borislav Petkov 2013-08-13 11:57 ` Rolf Eike Beer 0 siblings, 1 reply; 11+ messages in thread From: Borislav Petkov @ 2013-08-13 10:38 UTC (permalink / raw) To: Rolf Eike Beer; +Cc: Peter Zijlstra, linux-kernel, dhowells, Paul E. McKenney On Sun, Aug 11, 2013 at 01:10:11PM +0200, Rolf Eike Beer wrote: > The RAM seems fine. It looks like it is the mainboard or a harddisk. > The issues have magically disappeared since 3 weeks, but I have not > done any suspend2disk since then anymore. Before that I had suspended > the machine on the evening and resumed when I came to work. So it's > possible that there was some corrupted stuff in the image. Hmm, probably... > This is the smart output I got of one disk yesterday: > > Vendor: /0:0:0:0 > Product: > User Capacity: 600,332,565,813,390,450 bytes [600 PB] Is this for real? 600 PB?? I wanna hdd like that :-) -- Regards/Gruss, Boris. Sent from a fat crate under my desk. Formatting is fine. -- ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Hard lockups using 3.10.0 2013-08-13 10:38 ` Borislav Petkov @ 2013-08-13 11:57 ` Rolf Eike Beer 0 siblings, 0 replies; 11+ messages in thread From: Rolf Eike Beer @ 2013-08-13 11:57 UTC (permalink / raw) To: Borislav Petkov; +Cc: Peter Zijlstra, linux-kernel, dhowells, Paul E. McKenney Borislav Petkov wrote: > On Sun, Aug 11, 2013 at 01:10:11PM +0200, Rolf Eike Beer wrote: >> The RAM seems fine. It looks like it is the mainboard or a harddisk. >> The issues have magically disappeared since 3 weeks, but I have not >> done any suspend2disk since then anymore. Before that I had suspended >> the machine on the evening and resumed when I came to work. So it's >> possible that there was some corrupted stuff in the image. > > Hmm, probably... > >> This is the smart output I got of one disk yesterday: >> >> Vendor: /0:0:0:0 >> Product: >> User Capacity: 600,332,565,813,390,450 bytes [600 PB] > > Is this for real? 600 PB?? > > I wanna hdd like that :-) We have problems getting such a disk again. Seems all available one have disappeared somewhere near Bluffdale. I'm not sure how good ext4 can handle sector sizes of several hundred megabytes, so it may be not that fun ;) Eike ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2013-08-13 11:57 UTC | newest] Thread overview: 11+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2013-07-11 9:38 Hard lockups using 3.10.0 Rolf Eike Beer 2013-07-11 10:07 ` Borislav Petkov 2013-07-11 10:16 ` Peter Zijlstra 2013-07-11 10:52 ` Peter Zijlstra 2013-07-11 17:50 ` Paul E. McKenney 2013-07-11 19:02 ` Rolf Eike Beer 2013-08-11 6:09 ` Rolf Eike Beer 2013-08-11 8:37 ` Borislav Petkov 2013-08-11 11:10 ` Rolf Eike Beer 2013-08-13 10:38 ` Borislav Petkov 2013-08-13 11:57 ` Rolf Eike Beer
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox