public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* Total machine lockup w/ current kernels while installing from CD
@ 2006-05-11  1:22 Bernhard Rosenkraenzer
  2006-05-14 16:57 ` Bernhard Rosenkraenzer
  2006-05-15 20:32 ` [FIXED] " Bernhard Rosenkraenzer
  0 siblings, 2 replies; 6+ messages in thread
From: Bernhard Rosenkraenzer @ 2006-05-11  1:22 UTC (permalink / raw)
  To: linux-kernel

Hi,
I've built a CD that installs a customized system; basically what it does is 
boot from CD (iso9660) and run

mke2fs -m0 -j -O dir_index,filetype,has_journal,sparse_super /dev/hda1
mount -text3 /dev/hda1 /mnt/dest
rpm -r /mnt/dest -ivh /RPMS/*

This worked perfectly up until some recent kernel updates - with current 
kernels (both Linus and -mm branch), the system locks up totally at a random 
point during rpm installation (everything goes down, including the NumLock 
LED etc).

I'm currently building some old kernels to see when this problem was 
introduced and sort out what patch did it.
With 2.6.16-rc6, the problem occurs, but unlike later revisions it gives an 
error message before freezing:

BUG: soft lockup detected on CPU#0!

Pid: 421, comm: kjournald
EIP: 0060:[<b01a2f52>] CPU: 0
EIP is at journal_commit_transaction+0x92e/0xfcc
EFLAGS: 00000297 Not tainted (2.6.16-rc6 #1)
EAX: 00000001 EBX: c2d34788 ECX: 00000001 EDX: c785e000
ESI: b3ff8d04 EDI: 000000f0 EBP: b683b840 DS: 007b ES: 007b
CR0: 8005003b CR2: 0841f7fc CR3: 17217000 CR4: 000006d0
 [<b02bd52e>] schedule+0x2ee/0x5b6
 [<b01a6a88>] kjournald+0x201/0x213
 [<b0111089>] smp_apic_timer_interrupt+0x32/0x49
 [<b01a6937>] kjournald+0xb0/0x213
 [<b01a5ffa>] commit_timeout+0x0/0x9
 [<b012a789>] autoremove_wake_function+0x0/0x4b
 [<b01a6887>] kjournald+0x0/0x213
 [<b0101005>] kernel_thread_helper+0x5/0xb

I'll keep trying to isolate which patch introduced this; at the moment the 
last known working kernel is 2.6.15 (and 2.6.15-mm4), the first known broken 
one is 2.6.16-rc6; the ones in between are still waiting for the compile box.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Total machine lockup w/ current kernels while installing from CD
  2006-05-11  1:22 Total machine lockup w/ current kernels while installing from CD Bernhard Rosenkraenzer
@ 2006-05-14 16:57 ` Bernhard Rosenkraenzer
  2006-05-15 20:32 ` [FIXED] " Bernhard Rosenkraenzer
  1 sibling, 0 replies; 6+ messages in thread
From: Bernhard Rosenkraenzer @ 2006-05-14 16:57 UTC (permalink / raw)
  To: linux-kernel

On Thursday, 11. May 2006 03:22, Bernhard Rosenkraenzer wrote:
> Hi,
> I've built a CD that installs a customized system; basically what it does
> is boot from CD (iso9660) and run
>
> mke2fs -m0 -j -O dir_index,filetype,has_journal,sparse_super /dev/hda1
> mount -text3 /dev/hda1 /mnt/dest
> rpm -r /mnt/dest -ivh /RPMS/*
>
> This worked perfectly up until some recent kernel updates - with current
> kernels (both Linus and -mm branch), the system locks up totally at a
> random point during rpm installation (everything goes down, including the
> NumLock LED etc).
>
> I'm currently building some old kernels to see when this problem was
> introduced and sort out what patch did it.
> With 2.6.16-rc6, the problem occurs, but unlike later revisions it gives an
> error message before freezing:
>
> BUG: soft lockup detected on CPU#0!
>
> Pid: 421, comm: kjournald
> EIP: 0060:[<b01a2f52>] CPU: 0
> EIP is at journal_commit_transaction+0x92e/0xfcc
> EFLAGS: 00000297 Not tainted (2.6.16-rc6 #1)
> EAX: 00000001 EBX: c2d34788 ECX: 00000001 EDX: c785e000
> ESI: b3ff8d04 EDI: 000000f0 EBP: b683b840 DS: 007b ES: 007b
> CR0: 8005003b CR2: 0841f7fc CR3: 17217000 CR4: 000006d0
>  [<b02bd52e>] schedule+0x2ee/0x5b6
>  [<b01a6a88>] kjournald+0x201/0x213
>  [<b0111089>] smp_apic_timer_interrupt+0x32/0x49
>  [<b01a6937>] kjournald+0xb0/0x213
>  [<b01a5ffa>] commit_timeout+0x0/0x9
>  [<b012a789>] autoremove_wake_function+0x0/0x4b
>  [<b01a6887>] kjournald+0x0/0x213
>  [<b0101005>] kernel_thread_helper+0x5/0xb

I'm finally through compiling all kernels between the last good and first bad 
one (slow hardware sucks ;) ) -- the problem was introduced in the patch from 
2.6.16-rc5 to 2.6.16-rc6, and is apparently not present in any -mm releases 
before 2.6.16-rc6.

I'll try to isolate which change between 2.6.16-rc5 and -rc6 causes it... But 
it would be nice if someone who knows a bit more about the code involved 
could look at it.
l

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [FIXED] Re: Total machine lockup w/ current kernels while installing from CD
  2006-05-11  1:22 Total machine lockup w/ current kernels while installing from CD Bernhard Rosenkraenzer
  2006-05-14 16:57 ` Bernhard Rosenkraenzer
@ 2006-05-15 20:32 ` Bernhard Rosenkraenzer
  2006-05-15 20:45   ` Andrew Morton
  1 sibling, 1 reply; 6+ messages in thread
From: Bernhard Rosenkraenzer @ 2006-05-15 20:32 UTC (permalink / raw)
  To: linux-kernel

On Thursday, 11. May 2006 03:22, Bernhard Rosenkraenzer wrote:
> Hi,
> I've built a CD that installs a customized system
> [... crash at a random point ...]
> BUG: soft lockup detected on CPU#0!
>
> Pid: 421, comm: kjournald
> EIP: 0060:[<b01a2f52>] CPU: 0
> EIP is at journal_commit_transaction+0x92e/0xfcc
> EFLAGS: 00000297 Not tainted (2.6.16-rc6 #1)
> EAX: 00000001 EBX: c2d34788 ECX: 00000001 EDX: c785e000
> ESI: b3ff8d04 EDI: 000000f0 EBP: b683b840 DS: 007b ES: 007b
> CR0: 8005003b CR2: 0841f7fc CR3: 17217000 CR4: 000006d0
>  [<b02bd52e>] schedule+0x2ee/0x5b6
>  [<b01a6a88>] kjournald+0x201/0x213
>  [<b0111089>] smp_apic_timer_interrupt+0x32/0x49
>  [<b01a6937>] kjournald+0xb0/0x213
>  [<b01a5ffa>] commit_timeout+0x0/0x9
>  [<b012a789>] autoremove_wake_function+0x0/0x4b
>  [<b01a6887>] kjournald+0x0/0x213
>  [<b0101005>] kernel_thread_helper+0x5/0xb

After backing out lots of changes, I've figured out the problem is caused by 
this bit of 2.6.16-rc6:

diff -urN linux-2.6.16-rc5/kernel/sched.c linux-2.6.16-rc6/kernel/sched.c
--- linux-2.6.16-rc5/kernel/sched.c	2006-05-11 20:04:18.000000000 +0200
+++ linux-2.6.16-rc6/kernel/sched.c	2006-05-11 20:00:00.000000000 +0200
@@ -4028,6 +4021,8 @@
 	 */
 	if (unlikely(preempt_count()))
 		return;
+	if (unlikely(system_state != SYSTEM_RUNNING))
+		return;
 	do {
 		add_preempt_count(PREEMPT_ACTIVE);
 		schedule();


The problem is that (to save a couple of bits of space), my simple installer 
was running inside an initrd -- and system_state isn't set to SYSTEM_RUNNING 
before linuxrc is executed --> scheduler breakage causes the oops.

Since initrds are deprecated anyway, I've used this opportunity to update the 
installer to use initramfs instead, and as expected, the problem went away -- 
nevertheless [unless initrds are being removed in 2.6.18 anyway] we should 
probably fix this or at least document it somewhe.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [FIXED] Re: Total machine lockup w/ current kernels while installing from CD
  2006-05-15 20:32 ` [FIXED] " Bernhard Rosenkraenzer
@ 2006-05-15 20:45   ` Andrew Morton
  2006-05-15 20:53     ` Bernhard Rosenkraenzer
  0 siblings, 1 reply; 6+ messages in thread
From: Andrew Morton @ 2006-05-15 20:45 UTC (permalink / raw)
  To: Bernhard Rosenkraenzer; +Cc: linux-kernel, Ingo Molnar

Bernhard Rosenkraenzer <bero@arklinux.org> wrote:
>
> On Thursday, 11. May 2006 03:22, Bernhard Rosenkraenzer wrote:
> > Hi,
> > I've built a CD that installs a customized system
> > [... crash at a random point ...]
> > BUG: soft lockup detected on CPU#0!
> >
> > Pid: 421, comm: kjournald
> > EIP: 0060:[<b01a2f52>] CPU: 0
> > EIP is at journal_commit_transaction+0x92e/0xfcc
> > EFLAGS: 00000297 Not tainted (2.6.16-rc6 #1)
> > EAX: 00000001 EBX: c2d34788 ECX: 00000001 EDX: c785e000
> > ESI: b3ff8d04 EDI: 000000f0 EBP: b683b840 DS: 007b ES: 007b
> > CR0: 8005003b CR2: 0841f7fc CR3: 17217000 CR4: 000006d0
> >  [<b02bd52e>] schedule+0x2ee/0x5b6
> >  [<b01a6a88>] kjournald+0x201/0x213
> >  [<b0111089>] smp_apic_timer_interrupt+0x32/0x49
> >  [<b01a6937>] kjournald+0xb0/0x213
> >  [<b01a5ffa>] commit_timeout+0x0/0x9
> >  [<b012a789>] autoremove_wake_function+0x0/0x4b
> >  [<b01a6887>] kjournald+0x0/0x213
> >  [<b0101005>] kernel_thread_helper+0x5/0xb
> 
> After backing out lots of changes, I've figured out the problem is caused by 
> this bit of 2.6.16-rc6:
> 
> diff -urN linux-2.6.16-rc5/kernel/sched.c linux-2.6.16-rc6/kernel/sched.c
> --- linux-2.6.16-rc5/kernel/sched.c	2006-05-11 20:04:18.000000000 +0200
> +++ linux-2.6.16-rc6/kernel/sched.c	2006-05-11 20:00:00.000000000 +0200
> @@ -4028,6 +4021,8 @@
>  	 */
>  	if (unlikely(preempt_count()))
>  		return;
> +	if (unlikely(system_state != SYSTEM_RUNNING))
> +		return;
>  	do {
>  		add_preempt_count(PREEMPT_ACTIVE);
>  		schedule();

(That's cond_resched())

> 
> The problem is that (to save a couple of bits of space), my simple installer 
> was running inside an initrd -- and system_state isn't set to SYSTEM_RUNNING 
> before linuxrc is executed --> scheduler breakage causes the oops.

ah-hah.

It's odd that we'll run initrds in a !SYSTEM_RUNNING state.

It's not an oops - it's sort-of a warning.  Did the system actually
continue to run and boot up OK?

If so, I'd assume that the ext3 filesystem was mounted on a very slow
device - perhaps an IDE disk in PIO mode?

Perhaps we should poke the softlockup detector if someone called
cond_resched() when in a reschedulable state.


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [FIXED] Re: Total machine lockup w/ current kernels while installing from CD
  2006-05-15 20:45   ` Andrew Morton
@ 2006-05-15 20:53     ` Bernhard Rosenkraenzer
  2006-05-15 21:12       ` Andrew Morton
  0 siblings, 1 reply; 6+ messages in thread
From: Bernhard Rosenkraenzer @ 2006-05-15 20:53 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Ingo Molnar

On Monday, 15. May 2006 22:45, Andrew Morton wrote:
> It's odd that we'll run initrds in a !SYSTEM_RUNNING state.

True, especially because we run initramfs in SYSTEM_RUNNING state.

> It's not an oops - it's sort-of a warning.  Did the system actually
> continue to run and boot up OK?

No, it was a lockup and the system just hung at the point forever, so the 
lockup detection was right.

> If so, I'd assume that the ext3 filesystem was mounted on a very slow
> device - perhaps an IDE disk in PIO mode?

That too - happened with a pretty stupid 5-liner installation script that just 
formats the disk and installs a set of customized rpms from CD.

The hw we installed this on is Asus Pundit-R boxes, which means, basically, 
weird IDE setup (no secondary IDE --> harddisk on hda, CD drive on hdb) on an 
ATI IGP chipset, not exactly the fastest out there.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [FIXED] Re: Total machine lockup w/ current kernels while installing from CD
  2006-05-15 20:53     ` Bernhard Rosenkraenzer
@ 2006-05-15 21:12       ` Andrew Morton
  0 siblings, 0 replies; 6+ messages in thread
From: Andrew Morton @ 2006-05-15 21:12 UTC (permalink / raw)
  To: Bernhard Rosenkraenzer; +Cc: linux-kernel, mingo

Bernhard Rosenkraenzer <bero@arklinux.org> wrote:
>
> On Monday, 15. May 2006 22:45, Andrew Morton wrote:
> > It's odd that we'll run initrds in a !SYSTEM_RUNNING state.
> 
> True, especially because we run initramfs in SYSTEM_RUNNING state.
> 
> > It's not an oops - it's sort-of a warning.  Did the system actually
> > continue to run and boot up OK?
> 
> No, it was a lockup and the system just hung at the point forever, so the 
> lockup detection was right.

This is odd.  Your machine hangs if cond_resched() is a no-op.  This should
not happen.

What does `grep PREEMPT .config' say?


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2006-05-15 21:09 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-05-11  1:22 Total machine lockup w/ current kernels while installing from CD Bernhard Rosenkraenzer
2006-05-14 16:57 ` Bernhard Rosenkraenzer
2006-05-15 20:32 ` [FIXED] " Bernhard Rosenkraenzer
2006-05-15 20:45   ` Andrew Morton
2006-05-15 20:53     ` Bernhard Rosenkraenzer
2006-05-15 21:12       ` Andrew Morton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox