* BUG in submit_ordered_buffer at fs/reiserfs/journal.c:616!
@ 2005-03-11 22:49 Linas Vepstas
2005-03-11 23:39 ` Hans Reiser
2005-04-13 19:11 ` Jeff Mahoney
0 siblings, 2 replies; 5+ messages in thread
From: Linas Vepstas @ 2005-03-11 22:49 UTC (permalink / raw)
To: reiserfs-dev; +Cc: reiserfs-list
Hi,
I've been experimenting with automatic bus error recovery in the
2.6.11 kernel. During one of my failed experiments, I tripped over
a Reiserfs bug, below. Basically, my error recovery failed, which
means a SCSI disk went permanently offline, which, admitedly,
is pretty catastrophic, but shouldn't be a kernel panic. It seems
that reiser hits a 'BUG_ON' in this case.
FWIW, in my limited experience with ext3 in the same exact situation,
it seems that ext3 handles this gracefully, returning -EIO to all
affected apps accessing the disk.
Unfortunately, I don't know how to tell you how to reproduce this :)
--linas
Here's dmesg leading up to the failure, and the stack traces are shown below.
<4>sym0:8:0: HOST RESET operation timed-out.
<6>scsi: Device offlined - not ready after error recovery: host 0 channel 0 id 8 lun 0
<3>scsi0 (8:0): rejecting I/O to offline device
<3>scsi0 (8:0): rejecting I/O to offline device
<3>Buffer I/O error on device sda3, logical block 8210
<4>lost page write due to I/O error on sda3
<4>ReiserFS: sda3: warning: journal-837: IO error during journal replay
<2>REISERFS: abort (device sda3): Write error while updating journal header in flush_journal_list
<2>REISERFS: Aborting journal for filesystem on sda3
<3>scsi0 (8:0): rejecting I/O to offline device
<3>Buffer I/O error on device sda3, logical block 741
<4>lost page write due to I/O error on sda3
<3>Buffer I/O error on device sda3, logical block 742
<4>lost page write due to I/O error on sda3
<3>Buffer I/O error on device sda3, logical block 743
<4>lost page write due to I/O error on sda3
<3>Buffer I/O error on device sda3, logical block 744
<4>lost page write due to I/O error on sda3
<3>Buffer I/O error on device sda3, logical block 745
<4>lost page write due to I/O error on sda3
<3>Buffer I/O error on device sda3, logical block 746
<4>lost page write due to I/O error on sda3
<3>Buffer I/O error on device sda3, logical block 747
<4>lost page write due to I/O error on sda3
<3>Buffer I/O error on device sda3, logical block 748
<4>lost page write due to I/O error on sda3
<3>Buffer I/O error on device sda3, logical block 749
<4>lost page write due to I/O error on sda3
<3>scsi0 (8:0): rejecting I/O to offline device
<4>ReiserFS: sda3: warning: clm-6006: writing inode 346759 on readonly FS
<4>ReiserFS: sda3: warning: clm-6006: writing inode 346759 on readonly FS
<4>ReiserFS: sda3: warning: clm-6006: writing inode 346759 on readonly FS
<4>ReiserFS: sda3: warning: clm-6006: writing inode 346759 on readonly FS
<4>ReiserFS: sda3: warning: clm-6006: writing inode 346759 on readonly FS
<4>ReiserFS: sda3: warning: clm-6006: writing inode 346759 on readonly FS
<2>kernel BUG in submit_ordered_buffer at fs/reiserfs/journal.c:616!
<3>scsi0 (8:0): rejecting I/O to offline device
cpu 0x1: Vector: 700 (Program Check) at [c00000000fcef740]
pc: c000000000132ac8: .write_ordered_chunk+0xa4/0x100
lr: c000000000133274: .write_ordered_buffers+0x348/0x364
sp: c00000000fcef9c0
msr: 9000000000029032
current = 0xc00000000fea87b0
paca = 0xc00000000053b400
pid = 953, comm = reiserfs/1
kernel BUG in submit_ordered_buffer at fs/reiserfs/journal.c:616!
enter ? for help
1:mon> t
[c00000000fcefa60] c000000000133274 .write_ordered_buffers+0x348/0x364
[c00000000fcefc30] c000000000133af0 .flush_commit_list+0x80c/0x8cc
[c00000000fcefd10] c000000000138ac0 .flush_async_commits+0xf0/0xf4
[c00000000fcefdb0] c00000000006d2fc .worker_thread+0x258/0x32c
[c00000000fcefee0] c000000000073d80 .kthread+0x174/0x1c8
[c00000000fceff90] c000000000014240 .kernel_thread+0x4c/0x6c
1:mon>
1:mon> c
cpus stopped: 0-3
1:mon> c 0
0:mon> t
[c0000000004efdd0] c00000000000f948 .cpu_idle+0x3c/0x54
[c0000000004efe50] c00000000000c188 .rest_init+0x3c/0x58
[c0000000004efed0] c00000000049b7dc .start_kernel+0x27c/0x2fc
[c0000000004eff90] c00000000000c000 .__setup_cpu_power3+0x0/0x4
0:mon> c 2
2:mon> t
[c00000000424fe80] c00000000000f948 .cpu_idle+0x3c/0x54
[c00000000424ff00] c00000000003a878 .start_secondary+0x108/0x148
[c00000000424ff90] c00000000000bd28 .enable_64b_mode+0x0/0x28
2:mon> c 3
3:mon> t
[c000000004253e80] c00000000000f948 .cpu_idle+0x3c/0x54
[c000000004253f00] c00000000003a878 .start_secondary+0x108/0x148
[c000000004253f90] c00000000000bd28 .enable_64b_mode+0x0/0x28
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: BUG in submit_ordered_buffer at fs/reiserfs/journal.c:616!
2005-03-11 22:49 BUG in submit_ordered_buffer at fs/reiserfs/journal.c:616! Linas Vepstas
@ 2005-03-11 23:39 ` Hans Reiser
2005-03-12 0:20 ` Jeff Mahoney
2005-03-14 19:50 ` Chris Mason
2005-04-13 19:11 ` Jeff Mahoney
1 sibling, 2 replies; 5+ messages in thread
From: Hans Reiser @ 2005-03-11 23:39 UTC (permalink / raw)
To: Linas Vepstas
Cc: reiserfs-dev, reiserfs-list, Chris Mason, vitaly, Jeff Mahoney
Chris/Jeff, can you modify your code to whenever it sees an I/O error,
to say "I/O errors usually indicate bad hardware not bad software,
probably you need to get a new disk and use dd_rescue to copy everything
to it."?
Thanks,
Hans
Linas Vepstas wrote:
>Hi,
>
>I've been experimenting with automatic bus error recovery in the
>2.6.11 kernel. During one of my failed experiments, I tripped over
>a Reiserfs bug, below. Basically, my error recovery failed, which
>means a SCSI disk went permanently offline, which, admitedly,
>is pretty catastrophic, but shouldn't be a kernel panic. It seems
>that reiser hits a 'BUG_ON' in this case.
>
>FWIW, in my limited experience with ext3 in the same exact situation,
>it seems that ext3 handles this gracefully, returning -EIO to all
>affected apps accessing the disk.
>
>Unfortunately, I don't know how to tell you how to reproduce this :)
>
>--linas
>
>
>Here's dmesg leading up to the failure, and the stack traces are shown below.
>
><4>sym0:8:0: HOST RESET operation timed-out.
><6>scsi: Device offlined - not ready after error recovery: host 0 channel 0 id 8 lun 0
><3>scsi0 (8:0): rejecting I/O to offline device
><3>scsi0 (8:0): rejecting I/O to offline device
><3>Buffer I/O error on device sda3, logical block 8210
><4>lost page write due to I/O error on sda3
><4>ReiserFS: sda3: warning: journal-837: IO error during journal replay
><2>REISERFS: abort (device sda3): Write error while updating journal header in flush_journal_list
><2>REISERFS: Aborting journal for filesystem on sda3
><3>scsi0 (8:0): rejecting I/O to offline device
><3>Buffer I/O error on device sda3, logical block 741
><4>lost page write due to I/O error on sda3
><3>Buffer I/O error on device sda3, logical block 742
><4>lost page write due to I/O error on sda3
><3>Buffer I/O error on device sda3, logical block 743
><4>lost page write due to I/O error on sda3
><3>Buffer I/O error on device sda3, logical block 744
><4>lost page write due to I/O error on sda3
><3>Buffer I/O error on device sda3, logical block 745
><4>lost page write due to I/O error on sda3
><3>Buffer I/O error on device sda3, logical block 746
><4>lost page write due to I/O error on sda3
><3>Buffer I/O error on device sda3, logical block 747
><4>lost page write due to I/O error on sda3
><3>Buffer I/O error on device sda3, logical block 748
><4>lost page write due to I/O error on sda3
><3>Buffer I/O error on device sda3, logical block 749
><4>lost page write due to I/O error on sda3
><3>scsi0 (8:0): rejecting I/O to offline device
><4>ReiserFS: sda3: warning: clm-6006: writing inode 346759 on readonly FS
><4>ReiserFS: sda3: warning: clm-6006: writing inode 346759 on readonly FS
><4>ReiserFS: sda3: warning: clm-6006: writing inode 346759 on readonly FS
><4>ReiserFS: sda3: warning: clm-6006: writing inode 346759 on readonly FS
><4>ReiserFS: sda3: warning: clm-6006: writing inode 346759 on readonly FS
><4>ReiserFS: sda3: warning: clm-6006: writing inode 346759 on readonly FS
><2>kernel BUG in submit_ordered_buffer at fs/reiserfs/journal.c:616!
><3>scsi0 (8:0): rejecting I/O to offline device
>
>
>cpu 0x1: Vector: 700 (Program Check) at [c00000000fcef740]
> pc: c000000000132ac8: .write_ordered_chunk+0xa4/0x100
> lr: c000000000133274: .write_ordered_buffers+0x348/0x364
> sp: c00000000fcef9c0
> msr: 9000000000029032
> current = 0xc00000000fea87b0
> paca = 0xc00000000053b400
> pid = 953, comm = reiserfs/1
>kernel BUG in submit_ordered_buffer at fs/reiserfs/journal.c:616!
>enter ? for help
>1:mon> t
>[c00000000fcefa60] c000000000133274 .write_ordered_buffers+0x348/0x364
>[c00000000fcefc30] c000000000133af0 .flush_commit_list+0x80c/0x8cc
>[c00000000fcefd10] c000000000138ac0 .flush_async_commits+0xf0/0xf4
>[c00000000fcefdb0] c00000000006d2fc .worker_thread+0x258/0x32c
>[c00000000fcefee0] c000000000073d80 .kthread+0x174/0x1c8
>[c00000000fceff90] c000000000014240 .kernel_thread+0x4c/0x6c
>1:mon>
>1:mon> c
>cpus stopped: 0-3
>1:mon> c 0
>0:mon> t
>[c0000000004efdd0] c00000000000f948 .cpu_idle+0x3c/0x54
>[c0000000004efe50] c00000000000c188 .rest_init+0x3c/0x58
>[c0000000004efed0] c00000000049b7dc .start_kernel+0x27c/0x2fc
>[c0000000004eff90] c00000000000c000 .__setup_cpu_power3+0x0/0x4
>0:mon> c 2
>2:mon> t
>[c00000000424fe80] c00000000000f948 .cpu_idle+0x3c/0x54
>[c00000000424ff00] c00000000003a878 .start_secondary+0x108/0x148
>[c00000000424ff90] c00000000000bd28 .enable_64b_mode+0x0/0x28
>2:mon> c 3
>3:mon> t
>[c000000004253e80] c00000000000f948 .cpu_idle+0x3c/0x54
>[c000000004253f00] c00000000003a878 .start_secondary+0x108/0x148
>[c000000004253f90] c00000000000bd28 .enable_64b_mode+0x0/0x28
>
>
>
>
>
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: BUG in submit_ordered_buffer at fs/reiserfs/journal.c:616!
2005-03-11 23:39 ` Hans Reiser
@ 2005-03-12 0:20 ` Jeff Mahoney
2005-03-14 19:50 ` Chris Mason
1 sibling, 0 replies; 5+ messages in thread
From: Jeff Mahoney @ 2005-03-12 0:20 UTC (permalink / raw)
To: Hans Reiser
Cc: Linas Vepstas, reiserfs-dev, reiserfs-list, Chris Mason, vitaly
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Hans -
Most of the error messages make it pretty clear that the problems are
hardware related. The BUG() at the end *is* a software error; This is a
valid bug report. I'll take a look at it on Monday.
That said, I don't think that kernel error messages are the best place
for data recovery howtos. Even so, ReiserFS is not the only filesystem
affected by I/O errors; every disk filesystem is. ext3's failure
messages are very similar, and I haven't heard much confusion over them.
- -Jeff
Hans Reiser wrote:
> Chris/Jeff, can you modify your code to whenever it sees an I/O error,
> to say "I/O errors usually indicate bad hardware not bad software,
> probably you need to get a new disk and use dd_rescue to copy everything
> to it."?
>
> Thanks,
>
> Hans
>
> Linas Vepstas wrote:
>
>
>>Hi,
>>
>>I've been experimenting with automatic bus error recovery in the
>>2.6.11 kernel. During one of my failed experiments, I tripped over
>>a Reiserfs bug, below. Basically, my error recovery failed, which
>>means a SCSI disk went permanently offline, which, admitedly,
>>is pretty catastrophic, but shouldn't be a kernel panic. It seems
>>that reiser hits a 'BUG_ON' in this case.
>>
>>FWIW, in my limited experience with ext3 in the same exact situation,
>>it seems that ext3 handles this gracefully, returning -EIO to all
>>affected apps accessing the disk.
>>
>>Unfortunately, I don't know how to tell you how to reproduce this :)
>>
>>--linas
>>
>>
>>Here's dmesg leading up to the failure, and the stack traces are shown below.
>>
>><4>sym0:8:0: HOST RESET operation timed-out.
>><6>scsi: Device offlined - not ready after error recovery: host 0 channel 0 id 8 lun 0
>><3>scsi0 (8:0): rejecting I/O to offline device
>><3>scsi0 (8:0): rejecting I/O to offline device
>><3>Buffer I/O error on device sda3, logical block 8210
>><4>lost page write due to I/O error on sda3
>><4>ReiserFS: sda3: warning: journal-837: IO error during journal replay
>><2>REISERFS: abort (device sda3): Write error while updating journal header in flush_journal_list
>><2>REISERFS: Aborting journal for filesystem on sda3
>><3>scsi0 (8:0): rejecting I/O to offline device
>><3>Buffer I/O error on device sda3, logical block 741
>><4>lost page write due to I/O error on sda3
>><3>Buffer I/O error on device sda3, logical block 742
>><4>lost page write due to I/O error on sda3
>><3>Buffer I/O error on device sda3, logical block 743
>><4>lost page write due to I/O error on sda3
>><3>Buffer I/O error on device sda3, logical block 744
>><4>lost page write due to I/O error on sda3
>><3>Buffer I/O error on device sda3, logical block 745
>><4>lost page write due to I/O error on sda3
>><3>Buffer I/O error on device sda3, logical block 746
>><4>lost page write due to I/O error on sda3
>><3>Buffer I/O error on device sda3, logical block 747
>><4>lost page write due to I/O error on sda3
>><3>Buffer I/O error on device sda3, logical block 748
>><4>lost page write due to I/O error on sda3
>><3>Buffer I/O error on device sda3, logical block 749
>><4>lost page write due to I/O error on sda3
>><3>scsi0 (8:0): rejecting I/O to offline device
>><4>ReiserFS: sda3: warning: clm-6006: writing inode 346759 on readonly FS
>><4>ReiserFS: sda3: warning: clm-6006: writing inode 346759 on readonly FS
>><4>ReiserFS: sda3: warning: clm-6006: writing inode 346759 on readonly FS
>><4>ReiserFS: sda3: warning: clm-6006: writing inode 346759 on readonly FS
>><4>ReiserFS: sda3: warning: clm-6006: writing inode 346759 on readonly FS
>><4>ReiserFS: sda3: warning: clm-6006: writing inode 346759 on readonly FS
>><2>kernel BUG in submit_ordered_buffer at fs/reiserfs/journal.c:616!
>><3>scsi0 (8:0): rejecting I/O to offline device
>>
>>
>>cpu 0x1: Vector: 700 (Program Check) at [c00000000fcef740]
>> pc: c000000000132ac8: .write_ordered_chunk+0xa4/0x100
>> lr: c000000000133274: .write_ordered_buffers+0x348/0x364
>> sp: c00000000fcef9c0
>> msr: 9000000000029032
>> current = 0xc00000000fea87b0
>> paca = 0xc00000000053b400
>> pid = 953, comm = reiserfs/1
>>kernel BUG in submit_ordered_buffer at fs/reiserfs/journal.c:616!
>>enter ? for help
>>1:mon> t
>>[c00000000fcefa60] c000000000133274 .write_ordered_buffers+0x348/0x364
>>[c00000000fcefc30] c000000000133af0 .flush_commit_list+0x80c/0x8cc
>>[c00000000fcefd10] c000000000138ac0 .flush_async_commits+0xf0/0xf4
>>[c00000000fcefdb0] c00000000006d2fc .worker_thread+0x258/0x32c
>>[c00000000fcefee0] c000000000073d80 .kthread+0x174/0x1c8
>>[c00000000fceff90] c000000000014240 .kernel_thread+0x4c/0x6c
>>1:mon>
>>1:mon> c
>>cpus stopped: 0-3
>>1:mon> c 0
>>0:mon> t
>>[c0000000004efdd0] c00000000000f948 .cpu_idle+0x3c/0x54
>>[c0000000004efe50] c00000000000c188 .rest_init+0x3c/0x58
>>[c0000000004efed0] c00000000049b7dc .start_kernel+0x27c/0x2fc
>>[c0000000004eff90] c00000000000c000 .__setup_cpu_power3+0x0/0x4
>>0:mon> c 2
>>2:mon> t
>>[c00000000424fe80] c00000000000f948 .cpu_idle+0x3c/0x54
>>[c00000000424ff00] c00000000003a878 .start_secondary+0x108/0x148
>>[c00000000424ff90] c00000000000bd28 .enable_64b_mode+0x0/0x28
>>2:mon> c 3
>>3:mon> t
>>[c000000004253e80] c00000000000f948 .cpu_idle+0x3c/0x54
>>[c000000004253f00] c00000000003a878 .start_secondary+0x108/0x148
>>[c000000004253f90] c00000000000bd28 .enable_64b_mode+0x0/0x28
>>
>>
>>
>>
>>
>
>
>
- --
Jeff Mahoney
SuSE Labs
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (GNU/Linux)
iD8DBQFCMjW/LPWxlyuTD7IRAtQSAJ4yBQxFRPZcMmU/vo4mUcki6aZ/KgCfZrXP
qF9JJ+nVRQtT4vE0OIAtGlM=
=EtXb
-----END PGP SIGNATURE-----
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: BUG in submit_ordered_buffer at fs/reiserfs/journal.c:616!
2005-03-11 23:39 ` Hans Reiser
2005-03-12 0:20 ` Jeff Mahoney
@ 2005-03-14 19:50 ` Chris Mason
1 sibling, 0 replies; 5+ messages in thread
From: Chris Mason @ 2005-03-14 19:50 UTC (permalink / raw)
To: Hans Reiser
Cc: Linas Vepstas, reiserfs-dev, reiserfs-list, vitaly, Jeff Mahoney
On Friday 11 March 2005 18:39, Hans Reiser wrote:
> "I/O errors usually indicate bad hardware not bad software,
> probably you need to get a new disk and use dd_rescue to copy everything
This is your user friendly error message targeted at users that don't know
what an I/O error is?
What's an I/O error?
What's software?
What's hardware?
What's a disk?
What's dd_rescue?
How do I copy everything?
How do I put a new disk in?
How do I make the kernel recognize use new disk instead of the old one?
The list goes on and on. You'll never make the kernel more usable by making
messages in the syslog more verbose. You can make it more usable by having
consistent error messages that can be found via search engines or the manual.
Jeff's completely right here.
-chris
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: BUG in submit_ordered_buffer at fs/reiserfs/journal.c:616!
2005-03-11 22:49 BUG in submit_ordered_buffer at fs/reiserfs/journal.c:616! Linas Vepstas
2005-03-11 23:39 ` Hans Reiser
@ 2005-04-13 19:11 ` Jeff Mahoney
1 sibling, 0 replies; 5+ messages in thread
From: Jeff Mahoney @ 2005-04-13 19:11 UTC (permalink / raw)
To: Linas Vepstas; +Cc: reiserfs-dev, reiserfs-list
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Linas Vepstas wrote:
>
> Hi,
>
> I've been experimenting with automatic bus error recovery in the
> 2.6.11 kernel. During one of my failed experiments, I tripped over
> a Reiserfs bug, below. Basically, my error recovery failed, which
> means a SCSI disk went permanently offline, which, admitedly,
> is pretty catastrophic, but shouldn't be a kernel panic. It seems
> that reiser hits a 'BUG_ON' in this case.
>
> FWIW, in my limited experience with ext3 in the same exact situation,
> it seems that ext3 handles this gracefully, returning -EIO to all
> affected apps accessing the disk.
>
> Unfortunately, I don't know how to tell you how to reproduce this :)
Hi Linas -
Finally getting a chance to look into this one a little bit more. What
were your test cases? I've seen this bug before, and in doing a quick
trace, it may be possible to hit this if you attempt to write to a file
with a hole and an I/O error occurs while flushing that buffer. When
map_block_for_writepage calls journal_end and it fails, it can still
call reiserfs_get_block for the hole even though the journal has been
aborted. If the buffer is !uptodate (due to the i/o error failure),
you'll hit that BUG.
I'll continue to track this one down, but any more info you can provide
on your test environment would be helpful.
Thanks.
- -Jeff
- --
Jeff Mahoney
SuSE Labs
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.0 (GNU/Linux)
iD8DBQFCXW7ELPWxlyuTD7IRAh4jAJ4zB4eMUxoZjhnaOkoSDZ/yDHMtDACggkXi
y1ESZm40aGqJ0S2SfGLgBBQ=
=yYCv
-----END PGP SIGNATURE-----
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2005-04-13 19:11 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-03-11 22:49 BUG in submit_ordered_buffer at fs/reiserfs/journal.c:616! Linas Vepstas
2005-03-11 23:39 ` Hans Reiser
2005-03-12 0:20 ` Jeff Mahoney
2005-03-14 19:50 ` Chris Mason
2005-04-13 19:11 ` Jeff Mahoney
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.