2.4.23 && md raid1 && reiserfs panic

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* 2.4.23 && md raid1 && reiserfs panic
@ 2004-02-07 11:23 James Bromberger
  2004-02-08 17:22 ` Oleg Drokin
  0 siblings, 1 reply; 4+ messages in thread
From: James Bromberger @ 2004-02-07 11:23 UTC (permalink / raw)
  To: linux-kernel


[-- Attachment #1.1: Type: text/plain, Size: 1168 bytes --]



Hello all,

Can someone point me towards the right mailing list to help get this 
issue resolved; I think I've hit something in either the MD, block or 
reiserfs sections, but I could be completely wrong.

The symptoms: rm a file from a working RAID1 md reiserfs filesystem, 
and I get a panic, rm(1) segfaults, and all further I/O to any interactive 
shells stop. The entire system is rednered incapable; reboot (via 
ctrl-alt-del) doesnt shutdown and the only action is to hard reset the box.

Attached is a copy of the oops as recorded by syslog, the script I run to 
start the error. The exact line that Segfaults is the rm() on a 2.2 GB 
tar file.

For reference, the system is documented at 
	http://www.james.rcpt.to/programs/debian/raid1/

This is only using the Debian packages of the kernel. 
kernel-image-2.4.23-686_2.4.23-1.

If there is any further info I can send, please let me know.

Please CC me as I am not subscribed to LKML. Ta.

Regards,
  James
-- 
 James Bromberger <james_AT_rcpt.to> www.james.rcpt.to
 Remainder moved to http://www.james.rcpt.to/james/sig.html

I am in London on UK mobile +44 7952 042920.

[-- Attachment #1.2: error-2.4.23-1-686-md-reiserfs.txt --]
[-- Type: text/plain, Size: 4562 bytes --]

The kernel: 2.4.23-1-686-

The script:

DAY=`date +%d`
MONTH=`date +%m`
YEAR=`date +%Y`
FROM="/usr/local/fileshare /usr/local/psql"
TO="/usr/local/share/${YEAR}-${MONTH}-${DAY}-fileshare-backup.tbz"
TAR="/bin/tar"
CMD="$TAR -Pcjlf $TO $FROM"
/bin/echo -n "Removing old backup..."
/bin/rm /usr/local/share/????-??-??-fileshare-backup.tbz

The script running:

++ date +%d
+ DAY=07
++ date +%m
+ MONTH=02
++ date +%Y
+ YEAR=2004
+ FROM=/usr/local/fileshare /usr/local/psql
+ TO=/usr/local/share/2004-02-07-fileshare-backup.tbz
+ TAR=/bin/tar
+ CMD=/bin/tar -Pcjlf /usr/local/share/2004-02-07-fileshare-backup.tbz /usr/local/fileshare /usr/local/psql
+ /bin/echo -n 'Removing old backup...'
Removing old backup...+ /bin/rm /usr/local/share/2004-01-30-fileshare-backup.tbz./makecompressedbackup.sh: line 23:  9729 Segmentation fault      /bin/rm /usr/local/share/????-??-??-fileshare-backup.tbz
++ date
+ NOW=Sat Feb  7 17:21:19 WST 2004
+ echo 'Sat Feb  7 17:21:19 WST 2004 Doing /bin/tar -Pcjlf /usr/local/share/2004-02-07-fileshare-backup.tbz /usr/local/fileshare /usr/local/psql...'
Sat Feb  7 17:21:19 WST 2004 Doing /bin/tar -Pcjlf /usr/local/share/2004-02-07-fileshare-backup.tbz /usr/local/fileshare /usr/local/psql...
+ /bin/tar -Pcjf /usr/local/share/2004-02-07-fileshare-backup.tbz /usr/local/fileshare /usr/local/psql



From /var/log/syslog:


Feb  7 17:21:19 phobe kernel: md(9,5):vs-4075: reiserfs_free_block: block 1076949926 is out of range on md(9,5)
Feb  7 17:21:19 phobe kernel: md(9,5):vs-4075: reiserfs_free_block: block 2212879009 is out of range on md(9,5)
Feb  7 17:21:19 phobe kernel: Unable to handle kernel NULL pointer dereference at virtual address 0000028c
Feb  7 17:21:19 phobe kernel:  printing eip:
Feb  7 17:21:19 phobe kernel: e085124d
Feb  7 17:21:19 phobe kernel: *pde = 00000000
Feb  7 17:21:19 phobe kernel: Oops: 0002
Feb  7 17:21:19 phobe kernel: CPU:    0
Feb  7 17:21:19 phobe kernel: EIP:    0010:[ide-floppy:__insmod_ide-floppy_O/lib/modules/2.4.23-1-686/kernel/drive+-1772979/96]    Not tainted
Feb  7 17:21:19 phobe kernel: EFLAGS: 00010282
Feb  7 17:21:19 phobe kernel: eax: 00000000   ebx: 0001d0a2   ecx: df47dc00   edx: e0ca624c
Feb  7 17:21:19 phobe kernel: esi: 0000146b   edi: e0c170d4   ebp: 000003f4   esp: d56b1b88
Feb  7 17:21:19 phobe kernel: ds: 0018   es: 0018   ss: 0018
Feb  7 17:21:19 phobe kernel: Process rm (pid: 9729, stackpage=d56b1000)
Feb  7 17:21:19 phobe kernel: Stack: d56b1f1c d56b1f1c e851146b 00000000 e0855af0 df47dc00 e851146b e0c170d4
Feb  7 17:21:19 phobe kernel:        00005aa1 e08327ca d56b1f1c e851146b c763be68 000003f4 e08328c6 d56b1f1c
Feb  7 17:21:19 phobe kernel:        df47dc00 e851146b e851146b 00000065 e084d735 d56b1f1c e851146b cbf92b00
Feb  7 17:21:19 phobe kernel: Call Trace:    [ide-floppy:__insmod_ide-floppy_O/lib/modules/2.4.23-1-686/kernel/drive+-1754384/96] [ide-floppy:__insmod_ide-floppy_O/lib/modules/2.4.23-1-686/kernel/drive+-1898550/96] [ide-floppy:__insmod_ide-floppy_O/lib/modules/2.4.23-1-686/kernel/drive+-1898298/96] [ide-floppy:__insmod_ide-floppy_O/lib/modules/2.4.23-1-686/kernel/drive+-1788107/96] [__make_request+892/1936]
Feb  7 17:21:19 phobe kernel:   [ide-floppy:__insmod_ide-floppy_O/lib/modules/2.4.23-1-686/kernel/drive+-1541824/96] [ide-floppy:__insmod_ide-floppy_O/lib/modules/2.4.23-1-686/kernel/drive+-1541784/96] [ide-floppy:__insmod_ide-floppy_O/lib/modules/2.4.23-1-686/kernel/drive+-1786838/96] [ide-floppy:__insmod_ide-floppy_O/lib/modules/2.4.23-1-686/kernel/drive+-1784312/96] [ide-floppy:__insmod_ide-floppy_O/lib/modules/2.4.23-1-686/kernel/drive+-1782530/96] [ide-floppy:__insmod_ide-floppy_O/lib/modules/2.4.23-1-686/kernel/drive+-1785591/96]
Feb  7 17:21:19 phobe kernel:   [ide-floppy:__insmod_ide-floppy_O/lib/modules/2.4.23-1-686/kernel/drive+-1868477/96] [ide-floppy:__insmod_ide-floppy_O/lib/modules/2.4.23-1-686/kernel/drive+-1717594/96] [ide-floppy:__insmod_ide-floppy_O/lib/modules/2.4.23-1-686/kernel/drive+-1868624/96] [ide-floppy:__insmod_ide-floppy_O/lib/modules/2.4.23-1-686/kernel/drive+-1716640/96] [iput+304/640] [cached_lookup+27/112]
Feb  7 17:21:19 phobe kernel:   [vfs_unlink+242/448] [sys_unlink+186/288] [system_call+51/56]
Feb  7 17:21:19 phobe kernel:
Feb  7 17:21:19 phobe kernel: Code: 0f ab 30 8b 5c 24 04 31 c0 8b 74 24 08 8b 7c 24 0c 83 c4 10
...
Feb  7 17:26:23 phobe sm-mta[1274]: rejecting connections on daemon MTA: load average: 12
Feb  7 17:26:23 phobe sm-mta[1274]: rejecting connections on daemon MSA: load average: 12

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: 2.4.23 && md raid1 && reiserfs panic
  2004-02-07 11:23 2.4.23 && md raid1 && reiserfs panic James Bromberger
@ 2004-02-08 17:22 ` Oleg Drokin
  2004-02-09  1:10   ` James Bromberger
  0 siblings, 1 reply; 4+ messages in thread
From: Oleg Drokin @ 2004-02-08 17:22 UTC (permalink / raw)
  To: linux-kernel, james

James Bromberger <james@rcpt.to> wrote:

JB> The symptoms: rm a file from a working RAID1 md reiserfs filesystem, 
JB> and I get a panic, rm(1) segfaults, and all further I/O to any interactive 
JB> shells stop. The entire system is rednered incapable; reboot (via 
JB> ctrl-alt-del) doesnt shutdown and the only action is to hard reset the box.

What if you run reiserfsck over the volume that seems to be corrupted,
then fix the errors and then retry the operation?

Bye,
    Oleg

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: 2.4.23 && md raid1 && reiserfs panic
  2004-02-08 17:22 ` Oleg Drokin
@ 2004-02-09  1:10   ` James Bromberger
  2004-02-09  1:56     ` Oleg Drokin
  0 siblings, 1 reply; 4+ messages in thread
From: James Bromberger @ 2004-02-09  1:10 UTC (permalink / raw)
  To: Oleg Drokin; +Cc: linux-kernel


[-- Attachment #1.1: Type: text/plain, Size: 1031 bytes --]

Oleg Drokin (green@linuxhacker.ru) wrote:
> James Bromberger <james@rcpt.to> wrote:
> 
> JB> The symptoms: rm a file from a working RAID1 md reiserfs filesystem, 
> JB> and I get a panic, rm(1) segfaults, and all further I/O to any interactive 
> JB> shells stop. The entire system is rednered incapable; reboot (via 
> JB> ctrl-alt-del) doesnt shutdown and the only action is to hard reset the box.
> 
> What if you run reiserfsck over the volume that seems to be corrupted,
> then fix the errors and then retry the operation?

Yes! That was it. Attached is the output I captured from reiserfsck. 
It identified the very file I was attempting to remove that was causing
the segfault in rm(1).

So I guess this is a reiserfs specific issue when it kills all disk I/O
when this correcption happens. Hmm.

Thanks Oleg for your help.

  James
-- 
 James Bromberger <james_AT_rcpt.to> www.james.rcpt.to
 Remainder moved to http://www.james.rcpt.to/james/sig.html

I am in London on UK mobile +44 7952 042920.

[-- Attachment #1.2: error-reiserfsck.txt --]
[-- Type: text/plain, Size: 2619 bytes --]

bad_indirect_item: block 1449156: The item [9 24 0x1458e2001 IND (1)] has the bad pointer (904) to the block (616563974)
bad_indirect_item: block 1449156: The item [9 24 0x1458e2001 IND (1)] has the bad pointer (905) to the block (2538674917)
bad_indirect_item: block 1449156: The item [9 24 0x1458e2001 IND (1)] has the bad pointer (906) to the block (1226131222)
bad_indirect_item: block 1449156: The item [9 24 0x1458e2001 IND (1)] has the bad pointer (907) to the block (2721249827)
bad_indirect_item: block 1449156: The item [9 24 0x1458e2001 IND (1)] has the bad pointer (908) to the block (15941101)
bad_indirect_item: block 1449156: The item [9 24 0x1458e2001 IND (1)] has the bad pointer (909) to the block (74275561)
bad_indirect_item: block 1449156: The item [9 24 0x1458e2001 IND (1)] has the bad pointer (910) to the block (3897627755)
bad_indirect_item: block 1449156: The item [9 24 0x1458e2001 IND (1)] has the bad pointer (911) to the block (2212879009)
bad_indirect_item: block 1449156: The item [9 24 0x1458e2001 IND (1)] has the bad pointer (912) to the block (1076949926)             /167 (of 167)/126 (of 127)bad_indirect_item: block 1361383: The item (76625 76647 0xfa001 IND (1), len 4048, location 48 entry count 0, fsck need 0, format new) has the bad pointer (995) to the block (6306403), which is in tree already     finished
Comparing bitmaps..vpf-10640: The on-disk and the correct bitmaps differs.
/share/2004-01-30-fileshare-backup.tbzvpf-10670: The file [9 24] has the wrong size in the StatData (2295013376), should be (5466054656)
vpf-10680: The file [9 24] has the wrong block count in the StatData (0), should be (10675888)                            finished 16 found corruptions can be fixed when running with --fix-fixable


Comparing bitmaps..vpf-10630: The on-disk and the correct bitmaps differs. Will be fixed later.
Checking Semantic tree:
/fileshare/Documents/pics/standup block pics/standing for block.tifvpf-10680: The file [76625 76647] has the wrong block count in the StatData (12192) - correct/share/2004-01-30-fileshare-backup.tbzvpf-10670: The file [9 24] has the wrong size in the StatData (2295013376) - corrected to (5466054656)
vpf-10680: The file [9 24] has the wrong block count in the StatData (0) - corrected to (10675800)                        finished No corruptions found
There are on the filesystem:
        Leaves 24533
        Internal nodes 168
        Directories 2483
        Other files 62601
        Data block pointers 8767338 (143 of them are zero)
        Safe links 0
###########
reiserfsck finished at Mon Feb  9 08:55:00 2004
###########


[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: 2.4.23 && md raid1 && reiserfs panic
  2004-02-09  1:10   ` James Bromberger
@ 2004-02-09  1:56     ` Oleg Drokin
  0 siblings, 0 replies; 4+ messages in thread
From: Oleg Drokin @ 2004-02-09  1:56 UTC (permalink / raw)
  To: James Bromberger; +Cc: linux-kernel

Hello!

On Mon, Feb 09, 2004 at 09:10:40AM +0800, James Bromberger wrote:
> > JB> The symptoms: rm a file from a working RAID1 md reiserfs filesystem, 
> > JB> and I get a panic, rm(1) segfaults, and all further I/O to any interactive 
> > JB> shells stop. The entire system is rednered incapable; reboot (via 
> > JB> ctrl-alt-del) doesnt shutdown and the only action is to hard reset the box.
> > 
> > What if you run reiserfsck over the volume that seems to be corrupted,
> > then fix the errors and then retry the operation?
> Yes! That was it. Attached is the output I captured from reiserfsck. 
> It identified the very file I was attempting to remove that was causing
> the segfault in rm(1).
> So I guess this is a reiserfs specific issue when it kills all disk I/O
> when this correcption happens. Hmm.

Yes, in-kernel reiserfs is not all that good when it comes to corrupted
filesystems yet. The source of this corruption is yet unknown, though.

You can fix the corruption with reiserfsck --fix-fixable, or
reiserfsck --rebuild-tree if first one does not work.
Be sure to use latest reiserfsprogs.

Bye,
    Oleg

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2004-02-09  1:57 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-02-07 11:23 2.4.23 && md raid1 && reiserfs panic James Bromberger
2004-02-08 17:22 ` Oleg Drokin
2004-02-09  1:10   ` James Bromberger
2004-02-09  1:56     ` Oleg Drokin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox