Hard lockups using 3.10.0

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* Hard lockups using 3.10.0
@ 2013-07-11  9:38 Rolf Eike Beer
  2013-07-11 10:07 ` Borislav Petkov
  0 siblings, 1 reply; 11+ messages in thread
From: Rolf Eike Beer @ 2013-07-11  9:38 UTC (permalink / raw)
  To: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 467 bytes --]

Hi,

I'm running 3.10.0 (from openSUSE packages) on an "Intel(R) Core(TM) i7-2600 
CPU @ 3.40GHz". I got a hard lockup on one of my CPUs twice, once with 
backtrace (see attached image). Graphics is the builtin Intel, used with X 7.6 
and KDE 4.10beta2 (basically current openSUSE 12.3+KDE).

I'm not aware that I had done anything special, just "normal" desktop and 
development usage, but no heavy compile work at the moment the lockups 
happened.

Any ideas?

Eike

[-- Attachment #2: lockup.jpg --]
[-- Type: image/jpeg, Size: 266338 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Hard lockups using 3.10.0
  2013-07-11  9:38 Hard lockups using 3.10.0 Rolf Eike Beer
@ 2013-07-11 10:07 ` Borislav Petkov
  2013-07-11 10:16   ` Peter Zijlstra
  2013-07-11 10:52   ` Peter Zijlstra
  0 siblings, 2 replies; 11+ messages in thread
From: Borislav Petkov @ 2013-07-11 10:07 UTC (permalink / raw)
  To: Rolf Eike Beer; +Cc: linux-kernel, dhowells, Paul E. McKenney, Peter Zijlstra

On Thu, Jul 11, 2013 at 11:38:37AM +0200, Rolf Eike Beer wrote:
> Hi,
> 
> I'm running 3.10.0 (from openSUSE packages) on an "Intel(R) Core(TM) i7-2600 
> CPU @ 3.40GHz". I got a hard lockup on one of my CPUs twice, once with 
> backtrace (see attached image). Graphics is the builtin Intel, used with X 7.6 
> and KDE 4.10beta2 (basically current openSUSE 12.3+KDE).
> 
> I'm not aware that I had done anything special, just "normal" desktop and 
> development usage, but no heavy compile work at the moment the lockups 
> happened.

Hmm, I can see commit_creds() doing some rcu pointers assignment and rcu
calling into the scheduler which screams about a cpu runqueue of the
task we're about to reschedule not being locked. Let's add some more
people who should know better.

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Hard lockups using 3.10.0
  2013-07-11 10:07 ` Borislav Petkov
@ 2013-07-11 10:16   ` Peter Zijlstra
  2013-07-11 10:52   ` Peter Zijlstra
  1 sibling, 0 replies; 11+ messages in thread
From: Peter Zijlstra @ 2013-07-11 10:16 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: Rolf Eike Beer, linux-kernel, dhowells, Paul E. McKenney

On Thu, Jul 11, 2013 at 12:07:21PM +0200, Borislav Petkov wrote:
> On Thu, Jul 11, 2013 at 11:38:37AM +0200, Rolf Eike Beer wrote:
> > Hi,
> > 
> > I'm running 3.10.0 (from openSUSE packages) on an "Intel(R) Core(TM) i7-2600 
> > CPU @ 3.40GHz". I got a hard lockup on one of my CPUs twice, once with 
> > backtrace (see attached image). Graphics is the builtin Intel, used with X 7.6 
> > and KDE 4.10beta2 (basically current openSUSE 12.3+KDE).
> > 
> > I'm not aware that I had done anything special, just "normal" desktop and 
> > development usage, but no heavy compile work at the moment the lockups 
> > happened.
> 
> Hmm, I can see commit_creds() doing some rcu pointers assignment and rcu
> calling into the scheduler which screams about a cpu runqueue of the
> task we're about to reschedule not being locked. Let's add some more
> people who should know better.

-ENOIMAGE

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Hard lockups using 3.10.0
  2013-07-11 10:07 ` Borislav Petkov
  2013-07-11 10:16   ` Peter Zijlstra
@ 2013-07-11 10:52   ` Peter Zijlstra
  2013-07-11 17:50     ` Paul E. McKenney
  2013-08-11  6:09     ` Rolf Eike Beer
  1 sibling, 2 replies; 11+ messages in thread
From: Peter Zijlstra @ 2013-07-11 10:52 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: Rolf Eike Beer, linux-kernel, dhowells, Paul E. McKenney

On Thu, Jul 11, 2013 at 12:07:21PM +0200, Borislav Petkov wrote:
> On Thu, Jul 11, 2013 at 11:38:37AM +0200, Rolf Eike Beer wrote:
> > Hi,
> > 
> > I'm running 3.10.0 (from openSUSE packages) on an "Intel(R) Core(TM) i7-2600 
> > CPU @ 3.40GHz". I got a hard lockup on one of my CPUs twice, once with 
> > backtrace (see attached image). Graphics is the builtin Intel, used with X 7.6 
> > and KDE 4.10beta2 (basically current openSUSE 12.3+KDE).
> > 
> > I'm not aware that I had done anything special, just "normal" desktop and 
> > development usage, but no heavy compile work at the moment the lockups 
> > happened.
> 
> Hmm, I can see commit_creds() doing some rcu pointers assignment and rcu
> calling into the scheduler which screams about a cpu runqueue of the
> task we're about to reschedule not being locked. Let's add some more
> people who should know better.

Ok, for the other people too lazy to bother finding the picture:

  http://marc.info/?l=linux-kernel&m=137353587012001&q=p3

So we bug at:

kernel/sched/core.c:519 assert_raw_spin_locked(&task_rq(p)->lock);

and get there through:

  resched_task()
  check_preempt_wakeup()
  check_preempt_curr()
  try_to_wake_up()
  autoremove_wake_function()
  __call_rcu_nocb_enqueue()
  __call_rcu()
  commit_creds()
  ____call_usermodehelper()
  ret_from_fork()

That don't make much sense though. Since:

  try_to_wake_up()
    ttwu_queue()
      raw_spin_lock(&rq->lock)
      ttwu_do_activate()
        ttwu_do_wakeup()
          check_preempt_curr()
            check_preempt_wakeup()
              resched_task(rq->curr)
                assert_raw_spin_locked(task_rq(p)->lock)

It would somehow mean that 'task_rq(rq->curr) != rq', that's completely
bonkers, we do after all have rq->lock locked.

I must also say that I've _never_ seen this bug before.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Hard lockups using 3.10.0
  2013-07-11 10:52   ` Peter Zijlstra
@ 2013-07-11 17:50     ` Paul E. McKenney
  2013-07-11 19:02       ` Rolf Eike Beer
  2013-08-11  6:09     ` Rolf Eike Beer
  1 sibling, 1 reply; 11+ messages in thread
From: Paul E. McKenney @ 2013-07-11 17:50 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Borislav Petkov, Rolf Eike Beer, linux-kernel, dhowells

On Thu, Jul 11, 2013 at 12:52:07PM +0200, Peter Zijlstra wrote:
> On Thu, Jul 11, 2013 at 12:07:21PM +0200, Borislav Petkov wrote:
> > On Thu, Jul 11, 2013 at 11:38:37AM +0200, Rolf Eike Beer wrote:
> > > Hi,
> > > 
> > > I'm running 3.10.0 (from openSUSE packages) on an "Intel(R) Core(TM) i7-2600 
> > > CPU @ 3.40GHz". I got a hard lockup on one of my CPUs twice, once with 
> > > backtrace (see attached image). Graphics is the builtin Intel, used with X 7.6 
> > > and KDE 4.10beta2 (basically current openSUSE 12.3+KDE).
> > > 
> > > I'm not aware that I had done anything special, just "normal" desktop and 
> > > development usage, but no heavy compile work at the moment the lockups 
> > > happened.
> > 
> > Hmm, I can see commit_creds() doing some rcu pointers assignment and rcu
> > calling into the scheduler which screams about a cpu runqueue of the
> > task we're about to reschedule not being locked. Let's add some more
> > people who should know better.
> 
> Ok, for the other people too lazy to bother finding the picture:
> 
>   http://marc.info/?l=linux-kernel&m=137353587012001&q=p3
> 
> So we bug at:
> 
> kernel/sched/core.c:519 assert_raw_spin_locked(&task_rq(p)->lock);
> 
> and get there through:
> 
>   resched_task()
>   check_preempt_wakeup()
>   check_preempt_curr()
>   try_to_wake_up()
>   autoremove_wake_function()
>   __call_rcu_nocb_enqueue()
>   __call_rcu()
>   commit_creds()
>   ____call_usermodehelper()
>   ret_from_fork()
> 
> That don't make much sense though. Since:
> 
>   try_to_wake_up()
>     ttwu_queue()
>       raw_spin_lock(&rq->lock)
>       ttwu_do_activate()
>         ttwu_do_wakeup()
>           check_preempt_curr()
>             check_preempt_wakeup()
>               resched_task(rq->curr)
>                 assert_raw_spin_locked(task_rq(p)->lock)
> 
> It would somehow mean that 'task_rq(rq->curr) != rq', that's completely
> bonkers, we do after all have rq->lock locked.
> 
> I must also say that I've _never_ seen this bug before.

New one on me as well.  Is this reproducible?  If so, does it happen
when CONFIG_RCU_NOCB_CPU=n?  (Given the call to call_rcu_nocb_enqueue(),
I expect that you built with CONFIG_RCU_NOCB_CPU=y.)  Can't say that I
see how call_rcu_nocb_enqueue() would have caused this, but...

Well, I supposed that if RCU's callback lists got corrupted, this
(and much else besides) could in fact happen.  Does your build have
CONFIG_DEBUG_OBJECTS_RCU_HEAD=y?  If not, could you please try it?

							Thanx, Paul


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Hard lockups using 3.10.0
  2013-07-11 17:50     ` Paul E. McKenney
@ 2013-07-11 19:02       ` Rolf Eike Beer
  0 siblings, 0 replies; 11+ messages in thread
From: Rolf Eike Beer @ 2013-07-11 19:02 UTC (permalink / raw)
  To: paulmck; +Cc: Peter Zijlstra, Borislav Petkov, linux-kernel, dhowells

[-- Attachment #1: Type: text/plain, Size: 2921 bytes --]

Paul E. McKenney wrote:
> On Thu, Jul 11, 2013 at 12:52:07PM +0200, Peter Zijlstra wrote:
> > On Thu, Jul 11, 2013 at 12:07:21PM +0200, Borislav Petkov wrote:
> > > On Thu, Jul 11, 2013 at 11:38:37AM +0200, Rolf Eike Beer wrote:
> > > > Hi,
> > > > 
> > > > I'm running 3.10.0 (from openSUSE packages) on an "Intel(R) Core(TM)
> > > > i7-2600 CPU @ 3.40GHz". I got a hard lockup on one of my CPUs twice,
> > > > once with backtrace (see attached image). Graphics is the builtin
> > > > Intel, used with X 7.6 and KDE 4.10beta2 (basically current openSUSE
> > > > 12.3+KDE).
> > > > 
> > > > I'm not aware that I had done anything special, just "normal" desktop
> > > > and
> > > > development usage, but no heavy compile work at the moment the lockups
> > > > happened.
> > > 
> > > Hmm, I can see commit_creds() doing some rcu pointers assignment and rcu
> > > calling into the scheduler which screams about a cpu runqueue of the
> > > task we're about to reschedule not being locked. Let's add some more
> > > people who should know better.
> > 
> > Ok, for the other people too lazy to bother finding the picture:
> >   http://marc.info/?l=linux-kernel&m=137353587012001&q=p3
> > 
> > So we bug at:
> > 
> > kernel/sched/core.c:519 assert_raw_spin_locked(&task_rq(p)->lock);
> > 
> > and get there through:
> >   resched_task()
> >   check_preempt_wakeup()
> >   check_preempt_curr()
> >   try_to_wake_up()
> >   autoremove_wake_function()
> >   __call_rcu_nocb_enqueue()
> >   __call_rcu()
> >   commit_creds()
> >   ____call_usermodehelper()
> >   ret_from_fork()
> > 
> > That don't make much sense though. Since:
> >   try_to_wake_up()
> >   
> >     ttwu_queue()
> >     
> >       raw_spin_lock(&rq->lock)
> >       ttwu_do_activate()
> >       
> >         ttwu_do_wakeup()
> >         
> >           check_preempt_curr()
> >           
> >             check_preempt_wakeup()
> >             
> >               resched_task(rq->curr)
> >               
> >                 assert_raw_spin_locked(task_rq(p)->lock)
> > 
> > It would somehow mean that 'task_rq(rq->curr) != rq', that's completely
> > bonkers, we do after all have rq->lock locked.
> > 
> > I must also say that I've _never_ seen this bug before.
> 
> New one on me as well.  Is this reproducible?  If so, does it happen
> when CONFIG_RCU_NOCB_CPU=n?  (Given the call to call_rcu_nocb_enqueue(),
> I expect that you built with CONFIG_RCU_NOCB_CPU=y.)  Can't say that I
> see how call_rcu_nocb_enqueue() would have caused this, but...
> 
> Well, I supposed that if RCU's callback lists got corrupted, this
> (and much else besides) could in fact happen.  Does your build have
> CONFIG_DEBUG_OBJECTS_RCU_HEAD=y?  If not, could you please try it?

I will look tomorrow. This is a "standard" openSUSE kernel RPM, dunno right 
now which repository. It is not really reproducible, it suddenly happened 
again today but this time without backtrace.

Eike

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Hard lockups using 3.10.0
  2013-07-11 10:52   ` Peter Zijlstra
  2013-07-11 17:50     ` Paul E. McKenney
@ 2013-08-11  6:09     ` Rolf Eike Beer
  2013-08-11  8:37       ` Borislav Petkov
  1 sibling, 1 reply; 11+ messages in thread
From: Rolf Eike Beer @ 2013-08-11  6:09 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Borislav Petkov, linux-kernel, dhowells, Paul E. McKenney

[-- Attachment #1: Type: text/plain, Size: 2144 bytes --]

Peter Zijlstra wrote:
> On Thu, Jul 11, 2013 at 12:07:21PM +0200, Borislav Petkov wrote:
> > On Thu, Jul 11, 2013 at 11:38:37AM +0200, Rolf Eike Beer wrote:
> > > Hi,
> > > 
> > > I'm running 3.10.0 (from openSUSE packages) on an "Intel(R) Core(TM)
> > > i7-2600 CPU @ 3.40GHz". I got a hard lockup on one of my CPUs twice,
> > > once with backtrace (see attached image). Graphics is the builtin
> > > Intel, used with X 7.6 and KDE 4.10beta2 (basically current openSUSE
> > > 12.3+KDE).
> > > 
> > > I'm not aware that I had done anything special, just "normal" desktop
> > > and
> > > development usage, but no heavy compile work at the moment the lockups
> > > happened.
> > 
> > Hmm, I can see commit_creds() doing some rcu pointers assignment and rcu
> > calling into the scheduler which screams about a cpu runqueue of the
> > task we're about to reschedule not being locked. Let's add some more
> > people who should know better.
> 
> Ok, for the other people too lazy to bother finding the picture:
> 
>   http://marc.info/?l=linux-kernel&m=137353587012001&q=p3
> 
> So we bug at:
> 
> kernel/sched/core.c:519 assert_raw_spin_locked(&task_rq(p)->lock);
> 
> and get there through:
> 
>   resched_task()
>   check_preempt_wakeup()
>   check_preempt_curr()
>   try_to_wake_up()
>   autoremove_wake_function()
>   __call_rcu_nocb_enqueue()
>   __call_rcu()
>   commit_creds()
>   ____call_usermodehelper()
>   ret_from_fork()
> 
> That don't make much sense though. Since:
> 
>   try_to_wake_up()
>     ttwu_queue()
>       raw_spin_lock(&rq->lock)
>       ttwu_do_activate()
>         ttwu_do_wakeup()
>           check_preempt_curr()
>             check_preempt_wakeup()
>               resched_task(rq->curr)
>                 assert_raw_spin_locked(task_rq(p)->lock)
> 
> It would somehow mean that 'task_rq(rq->curr) != rq', that's completely
> bonkers, we do after all have rq->lock locked.
> 
> I must also say that I've _never_ seen this bug before.

Meanwhile I found that there was a hardware defect on this machine. So if it 
does not happen again I will assume that this was caused by this.

Thanks for looking into this.

Eike

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Hard lockups using 3.10.0
  2013-08-11  6:09     ` Rolf Eike Beer
@ 2013-08-11  8:37       ` Borislav Petkov
  2013-08-11 11:10         ` Rolf Eike Beer
  0 siblings, 1 reply; 11+ messages in thread
From: Borislav Petkov @ 2013-08-11  8:37 UTC (permalink / raw)
  To: Rolf Eike Beer; +Cc: Peter Zijlstra, linux-kernel, dhowells, Paul E. McKenney

On Sun, Aug 11, 2013 at 08:09:19AM +0200, Rolf Eike Beer wrote:
> Meanwhile I found that there was a hardware defect on this machine.
> So if it does not happen again I will assume that this was caused by
> this.

What hardware defect exactly? DIMMs failing...? Probably, since it looks
like the spinlock gets corrupted and the assertion fires... In any case,
it would be interesting to know for future reference.

Thanks.

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Hard lockups using 3.10.0
  2013-08-11  8:37       ` Borislav Petkov
@ 2013-08-11 11:10         ` Rolf Eike Beer
  2013-08-13 10:38           ` Borislav Petkov
  0 siblings, 1 reply; 11+ messages in thread
From: Rolf Eike Beer @ 2013-08-11 11:10 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: Peter Zijlstra, linux-kernel, dhowells, Paul E. McKenney

[-- Attachment #1: Type: text/plain, Size: 1247 bytes --]

Borislav Petkov wrote:
> On Sun, Aug 11, 2013 at 08:09:19AM +0200, Rolf Eike Beer wrote:
> > Meanwhile I found that there was a hardware defect on this machine.
> > So if it does not happen again I will assume that this was caused by
> > this.
> 
> What hardware defect exactly? DIMMs failing...? Probably, since it looks
> like the spinlock gets corrupted and the assertion fires... In any case,
> it would be interesting to know for future reference.

The RAM seems fine. It looks like it is the mainboard or a harddisk. The issues 
have magically disappeared since 3 weeks, but I have not done any suspend2disk 
since then anymore. Before that I had suspended the machine on the evening and 
resumed when I came to work. So it's possible that there was some corrupted 
stuff in the image.

This is the smart output I got of one disk yesterday:

Vendor:               /0:0:0:0
Product:              
User Capacity:        600,332,565,813,390,450 bytes [600 PB]
Logical block size:   774843950 bytes
scsiModePageOffset: response length too short, resp_len=47 offset=50 bd_len=46
>> Terminate command early due to bad response to IEC mode page
A mandatory SMART command failed: exiting. To continue, add one or more '-T 
permissive' options.

Eike

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Hard lockups using 3.10.0
  2013-08-11 11:10         ` Rolf Eike Beer
@ 2013-08-13 10:38           ` Borislav Petkov
  2013-08-13 11:57             ` Rolf Eike Beer
  0 siblings, 1 reply; 11+ messages in thread
From: Borislav Petkov @ 2013-08-13 10:38 UTC (permalink / raw)
  To: Rolf Eike Beer; +Cc: Peter Zijlstra, linux-kernel, dhowells, Paul E. McKenney

On Sun, Aug 11, 2013 at 01:10:11PM +0200, Rolf Eike Beer wrote:
> The RAM seems fine. It looks like it is the mainboard or a harddisk.
> The issues have magically disappeared since 3 weeks, but I have not
> done any suspend2disk since then anymore. Before that I had suspended
> the machine on the evening and resumed when I came to work. So it's
> possible that there was some corrupted stuff in the image.

Hmm, probably...

> This is the smart output I got of one disk yesterday:
> 
> Vendor:               /0:0:0:0
> Product:
> User Capacity:        600,332,565,813,390,450 bytes [600 PB]

Is this for real? 600 PB??

I wanna hdd like that :-)

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Hard lockups using 3.10.0
  2013-08-13 10:38           ` Borislav Petkov
@ 2013-08-13 11:57             ` Rolf Eike Beer
  0 siblings, 0 replies; 11+ messages in thread
From: Rolf Eike Beer @ 2013-08-13 11:57 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: Peter Zijlstra, linux-kernel, dhowells, Paul E. McKenney

Borislav Petkov wrote:
> On Sun, Aug 11, 2013 at 01:10:11PM +0200, Rolf Eike Beer wrote:
>> The RAM seems fine. It looks like it is the mainboard or a harddisk.
>> The issues have magically disappeared since 3 weeks, but I have not
>> done any suspend2disk since then anymore. Before that I had suspended
>> the machine on the evening and resumed when I came to work. So it's
>> possible that there was some corrupted stuff in the image.
> 
> Hmm, probably...
> 
>> This is the smart output I got of one disk yesterday:
>> 
>> Vendor:               /0:0:0:0
>> Product:
>> User Capacity:        600,332,565,813,390,450 bytes [600 PB]
> 
> Is this for real? 600 PB??
> 
> I wanna hdd like that :-)

We have problems getting such a disk again. Seems all available one 
have disappeared somewhere near Bluffdale.

I'm not sure how good ext4 can handle sector sizes of several hundred 
megabytes, so it may be not that fun ;)

Eike

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2013-08-13 11:57 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-07-11  9:38 Hard lockups using 3.10.0 Rolf Eike Beer
2013-07-11 10:07 ` Borislav Petkov
2013-07-11 10:16   ` Peter Zijlstra
2013-07-11 10:52   ` Peter Zijlstra
2013-07-11 17:50     ` Paul E. McKenney
2013-07-11 19:02       ` Rolf Eike Beer
2013-08-11  6:09     ` Rolf Eike Beer
2013-08-11  8:37       ` Borislav Petkov
2013-08-11 11:10         ` Rolf Eike Beer
2013-08-13 10:38           ` Borislav Petkov
2013-08-13 11:57             ` Rolf Eike Beer

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox