* BUG: Replaying ReiserFS log causes hard lockup
@ 2004-03-25 10:48 Adam Nielsen
2004-03-25 11:15 ` Vladimir Saveliev
2004-03-25 18:51 ` Matthias Andree
0 siblings, 2 replies; 10+ messages in thread
From: Adam Nielsen @ 2004-03-25 10:48 UTC (permalink / raw)
To: reiserfs-list
Hi,
My apologies if this is the wrong place to send a report like this, but
I couldn't find anywhere more suitable!
I have just come across a rather annoying bug, which seems to be most
likely located in ReiserFS as my computer now locks up while
booting a stock 2.6.3 kernel almost immediately after displaying this:
found reiserfs format "3.6" with standard journal
Reiserfs journal params: device sda1, size 8192, journal first block 18,
max trans len 1024, max batch 900, max commit age 30, max trans age 30
reiserfs: checking transaction log (sda1) for (sda1)
And it's a hard lockup - my keyboard LEDs stop working as does the magic
SysRq key. Are there any known problems that could exhibit
symptoms like this? (e.g. will upgrading to kernel 2.6.4 fix it?)
The drive itself is a Western Digital SATA drive, connected via an
onboard Silicon Image 3512A (which uses the under-development sil3112
(I think) driver, so I guess there's a chance it could be the driver's
fault, however I have not had any problems at all before this.) The
board itself is a Gigabyte GA-7N400 Pro 2 (rev 2.0) with an nForce 2
chipset.
I've got no idea how to go about diagnosing the problem further, so any
suggestions would be greatly appreciated! I'm not subscribed to this
list, so please CC any responses to me.
Oh, and one last thing, the problem seems to be caused by writing to the
drive (I assume), as my computer had locked up sometime this
afternoon, and when I reset it was when it started locking up while
booting, I guess trying to repeat the same write operation as what
caused the original lockup.
Anyway, any info would be greatly appreciated (as would the ability to
be able to boot my computer again ;-))
Thanks,
Adam.
P.S. I've never tried debugging a kernel before, but I am able to
compile a kernel and test it on this machine (by booting it over the
network) and I do have a serial cable, should it come in handy for
anything ;-)
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: BUG: Replaying ReiserFS log causes hard lockup
2004-03-25 10:48 BUG: Replaying ReiserFS log causes hard lockup Adam Nielsen
@ 2004-03-25 11:15 ` Vladimir Saveliev
2004-03-25 12:14 ` Adam Nielsen
2004-03-25 12:54 ` Chris Mason
2004-03-25 18:51 ` Matthias Andree
1 sibling, 2 replies; 10+ messages in thread
From: Vladimir Saveliev @ 2004-03-25 11:15 UTC (permalink / raw)
To: Adam Nielsen; +Cc: reiserfs-list
Hello
On Thu, 2004-03-25 at 13:48, Adam Nielsen wrote:
> Hi,
>
> My apologies if this is the wrong place to send a report like this, but
> I couldn't find anywhere more suitable!
>
> I have just come across a rather annoying bug, which seems to be most
> likely located in ReiserFS as my computer now locks up while
> booting a stock 2.6.3 kernel almost immediately after displaying this:
>
> found reiserfs format "3.6" with standard journal
> Reiserfs journal params: device sda1, size 8192, journal first block 18,
> max trans len 1024, max batch 900, max commit age 30, max trans age 30
> reiserfs: checking transaction log (sda1) for (sda1)
>
Is sda1 your root filesystem?
For the beginning you should boot without mounting it and reiserfsck it
> And it's a hard lockup - my keyboard LEDs stop working as does the magic
> SysRq key. Are there any known problems that could exhibit
> symptoms like this? (e.g. will upgrading to kernel 2.6.4 fix it?)
>
> The drive itself is a Western Digital SATA drive, connected via an
> onboard Silicon Image 3512A (which uses the under-development sil3112
> (I think) driver, so I guess there's a chance it could be the driver's
> fault, however I have not had any problems at all before this.) The
> board itself is a Gigabyte GA-7N400 Pro 2 (rev 2.0) with an nForce 2
> chipset.
>
> I've got no idea how to go about diagnosing the problem further, so any
> suggestions would be greatly appreciated! I'm not subscribed to this
> list, so please CC any responses to me.
>
> Oh, and one last thing, the problem seems to be caused by writing to the
> drive (I assume), as my computer had locked up sometime this
> afternoon, and when I reset it was when it started locking up while
> booting, I guess trying to repeat the same write operation as what
> caused the original lockup.
>
> Anyway, any info would be greatly appreciated (as would the ability to
> be able to boot my computer again ;-))
>
> Thanks,
> Adam.
>
> P.S. I've never tried debugging a kernel before, but I am able to
> compile a kernel and test it on this machine (by booting it over the
> network) and I do have a serial cable, should it come in handy for
> anything ;-)
>
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: BUG: Replaying ReiserFS log causes hard lockup
2004-03-25 11:15 ` Vladimir Saveliev
@ 2004-03-25 12:14 ` Adam Nielsen
2004-03-25 12:55 ` Vladimir Saveliev
2004-03-25 12:54 ` Chris Mason
1 sibling, 1 reply; 10+ messages in thread
From: Adam Nielsen @ 2004-03-25 12:14 UTC (permalink / raw)
To: reiserfs-list; +Cc: Vladimir Saveliev
Hi,
Thanks for your reply!
> Is sda1 your root filesystem?
> For the beginning you should boot without mounting it and reiserfsck
> it
Yes, it's the only partition (apart from sda2 which is swap.) I've done
some more fiddling around with it, and I'm starting to think that
perhaps after all it's not ReiserFS. I disconnected and reconnected the
SATA cable, and suddenly it got past that point where it locked up
before. Now I can mostly read data off the drive fine, but I still
can't boot from it. Lilo only gets as far as "LIL", and when booting
the kernel over the network it seems to run fine until I try reading
from /boot/bzImage where most of the time it will cause the hard lockup.
So I'm not sure whether it's ReiserFS "crashing" when it can't read the
data, the sil3112 module misbehaving, or a physical hardware problem
(which seemed unlikely until wiggling the SATA cable helped...!) I
would've assumed though, that a loose SATA cable would cause the drive
to either work or not work - but working fine except for booting and
reading certain files seems a little weird (it also locked up when I
loaded X11.)
Any more ideas? I don't really know what to do next...!
Thanks,
Adam.
(please CC replies)
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: BUG: Replaying ReiserFS log causes hard lockup
2004-03-25 11:15 ` Vladimir Saveliev
2004-03-25 12:14 ` Adam Nielsen
@ 2004-03-25 12:54 ` Chris Mason
2004-03-26 12:01 ` Adam Nielsen
1 sibling, 1 reply; 10+ messages in thread
From: Chris Mason @ 2004-03-25 12:54 UTC (permalink / raw)
To: Vladimir Saveliev; +Cc: Adam Nielsen, reiserfs-list
On Thu, 2004-03-25 at 06:15, Vladimir Saveliev wrote:
> Hello
>
> On Thu, 2004-03-25 at 13:48, Adam Nielsen wrote:
> > Hi,
> >
> > My apologies if this is the wrong place to send a report like this, but
> > I couldn't find anywhere more suitable!
> >
> > I have just come across a rather annoying bug, which seems to be most
> > likely located in ReiserFS as my computer now locks up while
> > booting a stock 2.6.3 kernel almost immediately after displaying this:
> >
> > found reiserfs format "3.6" with standard journal
> > Reiserfs journal params: device sda1, size 8192, journal first block 18,
> > max trans len 1024, max batch 900, max commit age 30, max trans age 30
> > reiserfs: checking transaction log (sda1) for (sda1)
> >
Please compile the kernel with sysrq support, and try to do a sysrq-p
when things are locked. If there kernel is completely frozen, you can
turn on the nmi watchdog, which should trigger an oops to tell us where
things are broken. You can get more details about both of these from
the Documentation directory in the kernel source
Documentation/nmi_watchdog.txt
Documentation/sysrq.txt
-chris
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: BUG: Replaying ReiserFS log causes hard lockup
2004-03-25 12:14 ` Adam Nielsen
@ 2004-03-25 12:55 ` Vladimir Saveliev
0 siblings, 0 replies; 10+ messages in thread
From: Vladimir Saveliev @ 2004-03-25 12:55 UTC (permalink / raw)
To: Adam Nielsen; +Cc: reiserfs-list
Hello
On Thu, 2004-03-25 at 15:14, Adam Nielsen wrote:
> Hi,
>
> Thanks for your reply!
>
> > Is sda1 your root filesystem?
> > For the beginning you should boot without mounting it and reiserfsck
> > it
>
> Yes, it's the only partition (apart from sda2 which is swap.) I've done
> some more fiddling around with it, and I'm starting to think that
> perhaps after all it's not ReiserFS. I disconnected and reconnected the
> SATA cable, and suddenly it got past that point where it locked up
> before. Now I can mostly read data off the drive fine, but I still
> can't boot from it. Lilo only gets as far as "LIL", and when booting
> the kernel over the network it seems to run fine until I try reading
> from /boot/bzImage where most of the time it will cause the hard lockup.
>
> So I'm not sure whether it's ReiserFS "crashing" when it can't read the
> data, the sil3112 module misbehaving, or a physical hardware problem
> (which seemed unlikely until wiggling the SATA cable helped...!) I
> would've assumed though, that a loose SATA cable would cause the drive
> to either work or not work - but working fine except for booting and
> reading certain files seems a little weird (it also locked up when I
> loaded X11.)
>
> Any more ideas? I don't really know what to do next...!
>
You should boot off not using /dev/sda1 and reiserfsck it
> Thanks,
> Adam.
> (please CC replies)
>
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: BUG: Replaying ReiserFS log causes hard lockup
2004-03-25 10:48 BUG: Replaying ReiserFS log causes hard lockup Adam Nielsen
2004-03-25 11:15 ` Vladimir Saveliev
@ 2004-03-25 18:51 ` Matthias Andree
1 sibling, 0 replies; 10+ messages in thread
From: Matthias Andree @ 2004-03-25 18:51 UTC (permalink / raw)
To: Adam Nielsen; +Cc: reiserfs-list
Adam Nielsen <a.nielsen@optushome.com.au> writes:
> The drive itself is a Western Digital SATA drive, connected via an
> onboard Silicon Image 3512A (which uses the under-development sil3112
> (I think) driver, so I guess there's a chance it could be the driver's
> fault, however I have not had any problems at all before this.) The
> board itself is a Gigabyte GA-7N400 Pro 2 (rev 2.0) with an nForce 2
> chipset.
CMD/Silicon Image hardware, at least the SATA stuff, is said to be utter
crap, with hardware bugs that require workaround and all that. Get the
latest 2.6 kernel you can fish (-bk variants included) and retry in the
hope the more recent kernel already has the needed workaround.
Better yet, ditch SATA, buy SCSI, that will actually work and perform
well. Linux SATA support hasn't quite made it from the cradle.
--
Matthias Andree
Encrypt your mail: my GnuPG key ID is 0x052E7D95
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: BUG: Replaying ReiserFS log causes hard lockup
2004-03-25 12:54 ` Chris Mason
@ 2004-03-26 12:01 ` Adam Nielsen
2004-03-26 12:34 ` Adam Nielsen
0 siblings, 1 reply; 10+ messages in thread
From: Adam Nielsen @ 2004-03-26 12:01 UTC (permalink / raw)
To: Chris Mason; +Cc: reiserfs-list
> Please compile the kernel with sysrq support, and try to do a sysrq-p
> when things are locked. If there kernel is completely frozen, you can
> turn on the nmi watchdog, which should trigger an oops to tell us
> where things are broken.
Well, try as I might, I can't get the oops to appear. SysRq doesn't
work since it's a hard lockup, and the NMI watchdog doesn't seem to work
either. I got the kernel booting once with nmi_watchdog=1 which gave an
interrupt count of zero, and then again with nmi_watchdog=2 which
incremented the NMI count about once every 30 seconds (but I also had
profiling turned on, which appears to use the NMI interrupt.) I have
since turned profiling off, disabled the preemptive kernel, enabled
debugging info, made sure the APIC etc. was on but it has stopped
booting completely and now locks up during boot. Now it goes as far as:
reiserfs: checking transaction log (sda1) for (sda1)
Using r5 hash to sort names
VFS: Mounted root (reiserfs filesystem) readonly.
Freeing unused kernel memory: 164k freed
Then it locks. Once it printed the message about INIT starting after
this, so I guess it's something around this point that's causing the
problem. However since it seems to be a *really* bad lock, I don't know
how to proceed :-(
Cheers,
Adam. (please CC replies)
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: BUG: Replaying ReiserFS log causes hard lockup
2004-03-26 12:01 ` Adam Nielsen
@ 2004-03-26 12:34 ` Adam Nielsen
2004-03-26 15:46 ` Jeff Mahoney
0 siblings, 1 reply; 10+ messages in thread
From: Adam Nielsen @ 2004-03-26 12:34 UTC (permalink / raw)
To: reiserfs-list
> Then it locks. Once it printed the message about INIT starting after
> this, so I guess it's something around this point that's causing the
> problem. However since it seems to be a *really* bad lock, I don't
> know how to proceed :-(
Ok, I've done some more fiddling and I think I've narrowed the problem
down to the SATA driver. Since the kernel used to boot, then it
started locking up during booting, I was trying to work out what could
cause this. I tried wiggling the SATA cable again, and that fixed
it - I could boot Linux again.
Wiggling the SATA cable while Linux was loaded didn't seem to cause a
problem, so I started dd reading off a ton of data and started fiddling
with the cable. I couldn't seem to fault anything, so I tried
unplugging the cable altogether (as SATA is hot swappable.) Obviously
everything accessing the drive paused, and when I plugged the cable back
into the drive it seemed as though everything resumed again (because the
drive started reading off a ton of data due to dd running.) However,
after a few seconds the drive stopped reading, and dd wouldn't
terminate. I tried switching to another VT and logging in, but after I
typed in my login name it froze too. It seems that if the drive gets
disconnected and reconnected, everything accessing it stops (I'm
guessing due to a driver problem.)
I'm still not sure why it would lock up during boot, but perhaps when
the SATA hardware is initialised the drive disconnects and reconnects,
and if it hasn't reconnected by the time something (e.g. ReiserFS) wants
to access the disk, that causes the lockup - only since the lock happens
in the kernel, it completely locks everything up. Of course, this is
just a wild guess, but given the behaviour it does seem like a
possibility. Especially if hotswapping is a feature not yet implemented
in the driver (which is quite likely, as it is still only in the testing
stage.)
Anyway, thanks for all your help, and I might try to find the SI3512
driver people and see what they think. Wish me luck ;-)
Cheers,
Adam.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: BUG: Replaying ReiserFS log causes hard lockup
2004-03-26 12:34 ` Adam Nielsen
@ 2004-03-26 15:46 ` Jeff Mahoney
2004-03-26 23:06 ` Adam Nielsen
0 siblings, 1 reply; 10+ messages in thread
From: Jeff Mahoney @ 2004-03-26 15:46 UTC (permalink / raw)
To: Adam Nielsen; +Cc: reiserfs-list
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Adam Nielsen wrote:
|>Then it locks. Once it printed the message about INIT starting after
|>this, so I guess it's something around this point that's causing the
|>problem. However since it seems to be a *really* bad lock, I don't
|>know how to proceed :-(
|
|
| Ok, I've done some more fiddling and I think I've narrowed the problem
| down to the SATA driver. Since the kernel used to boot, then it
| started locking up during booting, I was trying to work out what could
| cause this. I tried wiggling the SATA cable again, and that fixed
| it - I could boot Linux again.
|
| Wiggling the SATA cable while Linux was loaded didn't seem to cause a
| problem, so I started dd reading off a ton of data and started fiddling
| with the cable. I couldn't seem to fault anything, so I tried
| unplugging the cable altogether (as SATA is hot swappable.) Obviously
| everything accessing the drive paused, and when I plugged the cable back
| into the drive it seemed as though everything resumed again (because the
| drive started reading off a ton of data due to dd running.) However,
| after a few seconds the drive stopped reading, and dd wouldn't
| terminate. I tried switching to another VT and logging in, but after I
| typed in my login name it froze too. It seems that if the drive gets
| disconnected and reconnected, everything accessing it stops (I'm
| guessing due to a driver problem.)
|
| I'm still not sure why it would lock up during boot, but perhaps when
| the SATA hardware is initialised the drive disconnects and reconnects,
| and if it hasn't reconnected by the time something (e.g. ReiserFS) wants
| to access the disk, that causes the lockup - only since the lock happens
| in the kernel, it completely locks everything up. Of course, this is
| just a wild guess, but given the behaviour it does seem like a
| possibility. Especially if hotswapping is a feature not yet implemented
| in the driver (which is quite likely, as it is still only in the testing
| stage.)
|
| Anyway, thanks for all your help, and I might try to find the SI3512
| driver people and see what they think. Wish me luck ;-)
It may be a driver problem, but it could also be a shortcoming in the
way ReiserFS currently handles write errors in the journal.
Even when all you're doing is reading from the disk, you're still
writing a little bit (unless you've mounted with -onoatime,nodiratime).
ReiserFS can't deal at all with the device going away when it's
performing journal operations. Since updating the atime for a file
requires altering the metadata, it uses the journal. If I had to guess,
I'd say that you'd see ReiserFS panic in your logs, with something along
the lines of "journal-###: buffer write failed". A panic will make all
further access to the filesystem hang.
I'm in the final stages of a patch that will allow ReiserFS to handle
journal io-errors more gracefully. The result will be, rather than
panicking the system on journal write, the filesystem will be forced
read-only and all active transactions will be aborted and released. The
filesystem will umount'able, and on re-mount will appear similar to as
if a power failure had occured. However, since it did abort on an
io-error, I'd recommend a reiserfsck on the aborted partition.
- -Jeff
- --
Jeff Mahoney
SuSE Labs
jeffm@suse.com
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.2 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org
iD8DBQFAZFBgLPWxlyuTD7IRAgOYAJ4m9a2QRsUH7BuB7igHOWZf3P3j4ACfWml/
0aYNPCreOG3UQbI4/YNJTSw=
=Bqo5
-----END PGP SIGNATURE-----
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: BUG: Replaying ReiserFS log causes hard lockup
2004-03-26 15:46 ` Jeff Mahoney
@ 2004-03-26 23:06 ` Adam Nielsen
0 siblings, 0 replies; 10+ messages in thread
From: Adam Nielsen @ 2004-03-26 23:06 UTC (permalink / raw)
To: Jeff Mahoney; +Cc: reiserfs-list
> It may be a driver problem, but it could also be a shortcoming in the
> way ReiserFS currently handles write errors in the journal.
I suppose that depends on how the driver handles the situation.
Presumably the best way to handle an active drive being disconnected
would simply be to block, so that anything accessing the disk (e.g.
ReiserFS) would sit there forever and a day waiting for the drive to
come back online - and when the drive does come back, everything could
continue where it left off.
If the driver doesn't do that however, then ReiserFS would be put in a
situation where the drive suddenly disappeared and it would have to work
out how to cope with it.
> Even when all you're doing is reading from the disk, you're still
> writing a little bit (unless you've mounted with
> -onoatime,nodiratime). ReiserFS can't deal at all with the device
> going away when it's performing journal operations. Since updating the
> atime for a file requires altering the metadata, it uses the journal.
> If I had to guess, I'd say that you'd see ReiserFS panic in your logs,
> with something along the lines of "journal-###: buffer write failed".
> A panic will make all further access to the filesystem hang.
That makes sense, but I think the fault lies higher up. Once the drive
disconnects, I don't think ReiserFS even gets a chance to complain. I'm
not sure which logs you're referring to, but if I run "cat /proc/kmsg"
while I'm dd-ing data off the drive, I don't get any messages at all,
even after unplugging the drive (and trying to access the filesystem.)
Oh, I just noticed that I do eventually receive an "ata1: DMA timeout,
stat 0x1" message - which does sound like it should be something handled
in the driver.
> I'm in the final stages of a patch that will allow ReiserFS to handle
> journal io-errors more gracefully. The result will be, rather than
> panicking the system on journal write, the filesystem will be forced
> read-only and all active transactions will be aborted and released.
> The filesystem will umount'able, and on re-mount will appear similar
> to as if a power failure had occured. However, since it did abort on
> an io-error, I'd recommend a reiserfsck on the aborted partition.
That sounds good, however I'm reluctant to run reiserfsck for a while
yet, as the computer has previously locked up while simply reading data
from the drive (e.g. dd if=/dev/sda1) I don't want to risk doing more
damage to the data (if it locks up in the middle of reiserfsck) until
I'm sure the kernel can fully cope with a drive disconnect (if that is
indeed the root cause of all these problems.)
Cheers,
Adam.
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2004-03-26 23:06 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-03-25 10:48 BUG: Replaying ReiserFS log causes hard lockup Adam Nielsen
2004-03-25 11:15 ` Vladimir Saveliev
2004-03-25 12:14 ` Adam Nielsen
2004-03-25 12:55 ` Vladimir Saveliev
2004-03-25 12:54 ` Chris Mason
2004-03-26 12:01 ` Adam Nielsen
2004-03-26 12:34 ` Adam Nielsen
2004-03-26 15:46 ` Jeff Mahoney
2004-03-26 23:06 ` Adam Nielsen
2004-03-25 18:51 ` Matthias Andree
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.