* Errors requiring --rebuild-tree in 2.4.23
@ 2003-12-11 13:51 Jens Benecke
2003-12-11 14:22 ` Chris Mason
2003-12-11 17:27 ` Vitaly Fertman
0 siblings, 2 replies; 9+ messages in thread
From: Jens Benecke @ 2003-12-11 13:51 UTC (permalink / raw)
To: reiserfs-list
Hi,
I posted earlier about quota problems. WE updated to 2.4.23 b ecause of the
logging patches because some power failures made our /home partition spew
out these: (QUESTIONS at the end of the mail)
Dec 1 06:26:05 artus kernel: is_leaf: item location seems wrong (second
one): *3.6* [113366 113466 0x1 DIRECT], item_len 280, item_location 1216,
free_space(entry_count) 65535
Dec 1 06:26:05 artus kernel: vs-5150: search_by_key: invalid format found
in block 36020. Fsck?
Dec 1 06:26:05 artus kernel: vs-13070: reiserfs_read_inode2: i/o failure
occurred trying to find stat data of [113366 113469 0x0 SD]
Dec 1 06:26:05 artus kernel: is_leaf: item location seems wrong (second
one): *3.6* [113366 113466 0x1 DIRECT], item_len 280, item_location 1216,
free_space(entry_count) 65535
Dec 1 06:26:05 artus kernel: vs-5150: search_by_key: invalid format found
in block 36020. Fsck?
Dec 1 06:26:05 artus kernel: vs-13070: reiserfs_read_inode2: i/o failure
occurred trying to find stat data of [113366 113469 0x0 SD]
Dec 1 06:26:05 artus kernel: is_leaf: item location seems wrong (second
one): *3.6* [113366 113466 0x1 DIRECT], item_len 280, item_location 1216,
free_space(entry_count) 65535
Dec 1 06:26:05 artus kernel: vs-5150: search_by_key: invalid format found
in block 36020. Fsck?
Dec 1 06:26:05 artus kernel: vs-13070: reiserfs_read_inode2: i/o failure
occurred trying to find stat data of [113366 113469 0x0 SD]
Dec 1 06:26:05 artus kernel: is_leaf: item location seems wrong (second
one): *3.6* [113366 113466 0x1 DIRECT], item_len 280, item_location 1216,
free_space(entry_count) 65535
Dec 1 06:26:05 artus kernel: vs-5150: search_by_key: invalid format found
in block 36020. Fsck?
Dec 1 06:26:05 artus kernel: vs-13070: reiserfs_read_inode2: i/o failure
occurred trying to find stat data of [113366 113469 0x0 SD]
Dec 1 06:26:05 artus kernel: is_leaf: item location seems wrong (second
one): *3.6* [113366 113466 0x1 DIRECT], item_len 280, item_location 1216,
free_space(entry_count) 65535
Dec 1 06:26:05 artus kernel: vs-5150: search_by_key: invalid format found
in block 36020. Fsck?
Dec 1 06:26:05 artus kernel: vs-13070: reiserfs_read_inode2: i/o failure
occurred trying to find stat data of [113366 113469 0x0 SD]
We only ever got read errors from cron scripts working on /home so we
thought that was the only partition affected. (And anyway, stuff in / and
usr was never _written_ to so it should have been OK).
But shortly after upgrading our server crashed totally and booting knoppix
we found this in the syslog:
Dec 9 23:00:57 linux1 kernel: is_leaf: item location seems wrong (second
one): *3.6* [5 19 0x1 IND], item_len 28, item_location 2964,
free_space(entry_count) 0
Dec 9 23:00:57 linux1 kernel: ide0(3,1):vs-5150: search_by_key: invalid
format found in block 23346. Fsck?
Dec 9 23:00:57 linux1 kernel: ide0(3,1):vs-13050: reiserfs_update_sd: i/o
failure occurred trying to update [5 9 0x0 SD] stat data
Dec 9 23:00:57 linux1 kernel: is_leaf: item location seems wrong (second
one): *3.6* [5 19 0x1 IND], item_len 28, item_location 2964,
free_space(entry_count) 0
Dec 9 23:00:57 linux1 kernel: ide0(3,1):vs-5150: search_by_key: invalid
format found in block 23346. Fsck?
Dec 9 23:00:57 linux1 kernel: ide0(3,1):vs-13050: reiserfs_update_sd: i/o
failure occurred trying to update [5 7 0x0 SD] stat data
Dec 9 23:00:57 linux1 kernel: is_leaf: item location seems wrong (second
one): *3.6* [5 19 0x1 IND], item_len 28, item_location 2964,
free_space(entry_count) 0
Dec 9 23:00:57 linux1 kernel: ide0(3,1):vs-5150: search_by_key: invalid
format found in block 23346. Fsck?
Dec 9 23:00:57 linux1 kernel: is_leaf: item location seems wrong (second
one): *3.6* [5 19 0x1 IND], item_len 28, item_location 2964,
free_space(entry_count) 0
Dec 9 23:00:57 linux1 kernel: ide0(3,1):vs-5150: search_by_key: invalid
format found in block 23346. Fsck?
Dec 9 23:00:57 linux1 kernel: is_leaf: item location seems wrong (second
one): *3.6* [5 19 0x1 IND], item_len 28, item_location 2964,
free_space(entry_count) 0
Dec 9 23:00:57 linux1 kernel: ide0(3,1):vs-5150: search_by_key: invalid
format found in block 23346. Fsck?
Dec 9 23:00:57 linux1 kernel: is_leaf: item location seems wrong (second
one): *3.6* [5 19 0x1 IND], item_len 28, item_location 2964,
free_space(entry_count) 0
Dec 9 23:00:57 linux1 qmail: 1071007257.421057 delivery 498: deferral:
Aack,_child_crashed._(#4.3.0)/
Dec 9 23:00:57 linux1 kernel: ide0(3,1):vs-5150: search_by_key: invalid
format found in block 23346. Fsck?
and much more of these. These required a --rebuild-tree (done with 3.6.11
from Knoppix from November 2003):
Pass 0:
####### Pass 0 #######
Loading on-disk bitmap .. ok, 312614 blocks marked used
Skipping 8241 blocks (super block, journal, bitmaps) 304373 blocks will be
read
0%.block 23346: The number of items (35) is incorrect, should be (19) -
corrected
block 23346: The free space (1540) is incorrect, should be (2520) -
corrected
left 286816, 1350 /sec
[1]+ Stopped reiserfsck --rebuild-tree /dev/hda1
root@1[mnt]# fg
reiserfsck --rebuild-tree /dev/hda1
...20%....40%....60%....80%...block 447056: The free space (2) is incorrect,
should be (4072) - corrected
.100% left 0, 3074 /sec
133527 directory entries were hashed with "r5" hash.
"r5" hash is selected
Flushing..finished
Read blocks (but not data blocks) 304373
Leaves among those 34465
- leaves all contents of which could not be saved
and deleted 1
Objectids found 133061
Pass 1 (will try to insert 34464 leaves):
####### Pass 1 #######
Looking for allocable blocks .. finished
0%....20%....40%....60%....80%....100% left 0, 1641
sec
Flushing..finished
34464 leaves read
34431 inserted
33 not inserted
####### Pass 2 #######
Pass 2:
0%....20%....40%....60%....80%....100% left 0, 66
sec
Flushing..finished
Leaves inserted item by item 33
Pass 3 (semantic):
####### Pass 3 #########
/bin/dfvpf-10680: The file [5 19] has the wrong block count in the StatData
(56) - corrected to (0) rebuild_semantic_pass:
The entry [5 21] ("ln") in directory [2 5] points to nowhere - is removed
rebuild_semantic_pass: The entry [5 22] ("ls") in directory [2 5] points to
nowhere - is removed
rebuild_semantic_pass: The entry [5 25] ("mv") in directory [2 5] points to
nowhere - is removed
rebuild_semantic_pass: The entry [5 26] ("rm") in directory [2 5] points to
nowhere - is removed
rebuild_semantic_pass: The entry [5 20] ("dir") in directory [2 5] points to
nowhere - is removed
rebuild_semantic_pass: The entry [5 23] ("mkdir") in directory [2 5] points
to nowhere - is removed
rebuild_semantic_pass: The entry [5 24] ("mknod") in directory [2 5] points
to nowhere - is removed
rebuild_semantic_pass: The entry [5 27] ("rmdir") in directory [2 5] points
to nowhere - is removed
vpf-10680: The directory [2 5] has the wrong block count in the StatData (5)
- corrected to (4)
vpf-10650: The directory [2 5] has the wrong size in the StatData (2152) -
corrected to (1960) /libFlushing..finished
in the StatData (2587122) -/usr/lib/nessus/plugins Files found:
115781
Directories found: 10270
Symlinks found: 2875
Others: 4120
Files with fixed size: 1
Names pointing to nowhere (removed): 8
Pass 3a (looking for lost dir/files):
####### Pass 3a (lost+found pass) #########
Looking for lost directories:
Flushing..finished8, 103 /sec
Pass 4 - finished done 33606, 99 /sec
Deleted unreachable items 1
Flushing..finished
Syncing..finished
###########
reiserfsck finished at Wed Dec 10 22:42:29 2003
###########
--------------------------------------------------------------------------
QUESTIONS:
- can I safely assume that the ONLY files damaged were the ones that
reiserfsck mentioned and deleted?
- is it possible that the new reiserfs code introduced data loss bugs, or is
it more likely that it brought old errors in the FS to the surface that the
2.4.19 code never noticed? (We never had problems executing "ls" or "ln"
before moving to 2.4.23, but it was apparently one of the files that was
severely damaged).
- is it possible to show the files/directories that are affected in the
syslog when errors occur? Otherwise it's always guesswork which files you
can still trust and which you can't. (And yes, we do have backups, i.e. the
other drbd machine, but only for /home ATM).
- How can I prevent this from happening in the future? Is it possible to
detect this kind of errors and automatically reboot, forcing a fsck at
reboot? Does that make sense?
--
Jens Benecke (jens at spamfreemail.de)
http://www.hitchhikers.de - Europaweite kostenlose Mitfahrzentrale
http://www.spamfreemail.de - 100% saubere Postfächer - garantiert!
http://www.rb-hosting.de - PHP ab 9? - SSH ab 19? - günstiger Traffic
^ permalink raw reply [flat|nested] 9+ messages in thread* Re: Errors requiring --rebuild-tree in 2.4.23
2003-12-11 13:51 Errors requiring --rebuild-tree in 2.4.23 Jens Benecke
@ 2003-12-11 14:22 ` Chris Mason
2003-12-11 16:43 ` Jens Benecke
2003-12-11 16:45 ` Jens Benecke
2003-12-11 17:27 ` Vitaly Fertman
1 sibling, 2 replies; 9+ messages in thread
From: Chris Mason @ 2003-12-11 14:22 UTC (permalink / raw)
To: Jens Benecke; +Cc: reiserfs-list
On Thu, 2003-12-11 at 08:51, Jens Benecke wrote:
> Hi,
>
> I posted earlier about quota problems. WE updated to 2.4.23 b ecause of the
> logging patches because some power failures made our /home partition spew
> out these: (QUESTIONS at the end of the mail)
Sorry, before we got to the questions, what was the order of the events
above?
-chris
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Errors requiring --rebuild-tree in 2.4.23
2003-12-11 14:22 ` Chris Mason
@ 2003-12-11 16:43 ` Jens Benecke
2003-12-11 18:24 ` Chris Mason
2003-12-11 19:20 ` Hans Reiser
2003-12-11 16:45 ` Jens Benecke
1 sibling, 2 replies; 9+ messages in thread
From: Jens Benecke @ 2003-12-11 16:43 UTC (permalink / raw)
To: reiserfs-list
Chris Mason wrote:
> On Thu, 2003-12-11 at 08:51, Jens Benecke wrote:
>> Hi,
>>
>> I posted earlier about quota problems. WE updated to 2.4.23 b ecause of
>> the logging patches because some power failures made our /home partition
>> spew out these: (QUESTIONS at the end of the mail)
>
> Sorry, before we got to the questions, what was the order of the events
> above?
Oops. I guess I was a bit too confused myself. :)
1. Errors on /home in syslog, cron jobs running wild with i/o failures
system kept running for a couple days because nobody was there
to fix it, though
Those errors were probably caused by power outages and
a non-data-logging ReiserFS kernel.
2. Backup what's left of /home to firewire harddisk.
3. Update to 2.4.23 with Chris' patches for data logging/quota
4. Repartition hda2..4 (was needed anyway for drbd),
reformat new /home (drbd), restore /home on drbd device
5. crash of the server overnight, reboot (don't know why yet)
6. couldn't reboot because root partition was totally b0rken
7. reiserfsck --rebuild-tree under Knoppix, killed a couple files
8. still running Knoppix, secondary server took over and is running now
btw: Is there a "reiserfs stress test" kind of thing to make sure a
configuration works before sending it two time zones away for production? I
plan on doing that in the next couple weeks. =;)
Would bonnie++ accomplish this or are there better tests?
--
Jens Benecke (jens at spamfreemail.de)
http://www.hitchhikers.de - Europaweite kostenlose Mitfahrzentrale
http://www.spamfreemail.de - 100% saubere Postfächer - garantiert!
http://www.rb-hosting.de - PHP ab 9? - SSH ab 19? - günstiger Traffic
^ permalink raw reply [flat|nested] 9+ messages in thread* Re: Errors requiring --rebuild-tree in 2.4.23
2003-12-11 16:43 ` Jens Benecke
@ 2003-12-11 18:24 ` Chris Mason
2003-12-11 19:20 ` Hans Reiser
1 sibling, 0 replies; 9+ messages in thread
From: Chris Mason @ 2003-12-11 18:24 UTC (permalink / raw)
To: Jens Benecke; +Cc: reiserfs-list
On Thu, 2003-12-11 at 11:43, Jens Benecke wrote:
> Chris Mason wrote:
>
> > On Thu, 2003-12-11 at 08:51, Jens Benecke wrote:
> >> Hi,
> >>
> >> I posted earlier about quota problems. WE updated to 2.4.23 b ecause of
> >> the logging patches because some power failures made our /home partition
> >> spew out these: (QUESTIONS at the end of the mail)
> >
> > Sorry, before we got to the questions, what was the order of the events
> > above?
>
> Oops. I guess I was a bit too confused myself. :)
>
> 1. Errors on /home in syslog, cron jobs running wild with i/o failures
> system kept running for a couple days because nobody was there
> to fix it, though
> Those errors were probably caused by power outages and
> a non-data-logging ReiserFS kernel.
> 2. Backup what's left of /home to firewire harddisk.
> 3. Update to 2.4.23 with Chris' patches for data logging/quota
> 4. Repartition hda2..4 (was needed anyway for drbd),
> reformat new /home (drbd), restore /home on drbd device
> 5. crash of the server overnight, reboot (don't know why yet)
Ok, we need to better understand step 5 here.
> 6. couldn't reboot because root partition was totally b0rken
> 7. reiserfsck --rebuild-tree under Knoppix, killed a couple files
> 8. still running Knoppix, secondary server took over and is running now
>
> btw: Is there a "reiserfs stress test" kind of thing to make sure a
> configuration works before sending it two time zones away for production? I
> plan on doing that in the next couple weeks. =;)
> Would bonnie++ accomplish this or are there better tests?
The best test is whatever that environment is going to use in
production. I've got a ton of different scripts that get used based on
different situations, most are ugly hacks.
-chris
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Errors requiring --rebuild-tree in 2.4.23
2003-12-11 16:43 ` Jens Benecke
2003-12-11 18:24 ` Chris Mason
@ 2003-12-11 19:20 ` Hans Reiser
2003-12-13 17:38 ` Jens Benecke
1 sibling, 1 reply; 9+ messages in thread
From: Hans Reiser @ 2003-12-11 19:20 UTC (permalink / raw)
To: Jens Benecke; +Cc: reiserfs-list, mason
Jens Benecke wrote:
>Chris Mason wrote:
>
>
>
>>On Thu, 2003-12-11 at 08:51, Jens Benecke wrote:
>>
>>
>>>Hi,
>>>
>>>I posted earlier about quota problems. WE updated to 2.4.23 b ecause of
>>>the logging patches because some power failures made our /home partition
>>>spew out these: (QUESTIONS at the end of the mail)
>>>
>>>
>>Sorry, before we got to the questions, what was the order of the events
>>above?
>>
>>
>
>Oops. I guess I was a bit too confused myself. :)
>
>1. Errors on /home in syslog, cron jobs running wild with i/o failures
> system kept running for a couple days because nobody was there
> to fix it, though
> Those errors were probably caused by power outages and
> a non-data-logging ReiserFS kernel.
>
>
errors like this are not due to a lack of data-logging.
Dec 1 06:26:05 artus kernel: is_leaf: item location seems wrong (second
one): *3.6* [113366 113466 0x1 DIRECT], item_len 280, item_location 1216,
free_space(entry_count) 65535
>2. Backup what's left of /home to firewire harddisk.
>3. Update to 2.4.23 with Chris' patches for data logging/quota
>4. Repartition hda2..4 (was needed anyway for drbd),
> reformat new /home (drbd), restore /home on drbd device
>5. crash of the server overnight, reboot (don't know why yet)
>6. couldn't reboot because root partition was totally b0rken
>7. reiserfsck --rebuild-tree under Knoppix, killed a couple files
>8. still running Knoppix, secondary server took over and is running now
>
>btw: Is there a "reiserfs stress test" kind of thing to make sure a
>configuration works before sending it two time zones away for production? I
>plan on doing that in the next couple weeks. =;)
>Would bonnie++ accomplish this or are there better tests?
>
>
>
>
>
--
Hans
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Errors requiring --rebuild-tree in 2.4.23
2003-12-11 19:20 ` Hans Reiser
@ 2003-12-13 17:38 ` Jens Benecke
2003-12-14 12:05 ` Hans Reiser
0 siblings, 1 reply; 9+ messages in thread
From: Jens Benecke @ 2003-12-13 17:38 UTC (permalink / raw)
To: reiserfs-list
Hans Reiser wrote:
> errors like this are not due to a lack of data-logging.
>
> Dec 1 06:26:05 artus kernel: is_leaf: item location seems wrong (second
> one): *3.6* [113366 113466 0x1 DIRECT], item_len 280, item_location 1216,
> free_space(entry_count) 65535
Hello Hans,
can you imagine a hardware failure causing these errors that does not show
up in the logs? Usually IDE disk failures are accompanied with IDE layer
errors, unless the disk is totally down (which it isn't in this case).
--
Jens Benecke (jens at spamfreemail.de)
http://www.hitchhikers.de - Europaweite kostenlose Mitfahrzentrale
http://www.spamfreemail.de - 100% saubere Postfächer - garantiert!
http://www.rb-hosting.de - PHP ab 9? - SSH ab 19? - günstiger Traffic
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Errors requiring --rebuild-tree in 2.4.23
2003-12-13 17:38 ` Jens Benecke
@ 2003-12-14 12:05 ` Hans Reiser
0 siblings, 0 replies; 9+ messages in thread
From: Hans Reiser @ 2003-12-14 12:05 UTC (permalink / raw)
To: Jens Benecke; +Cc: reiserfs-list, mason
Jens Benecke wrote:
>Hans Reiser wrote:
>
>
>
>>errors like this are not due to a lack of data-logging.
>>
>>Dec 1 06:26:05 artus kernel: is_leaf: item location seems wrong (second
>>one): *3.6* [113366 113466 0x1 DIRECT], item_len 280, item_location 1216,
>>free_space(entry_count) 65535
>>
>>
>
>Hello Hans,
>
>can you imagine a hardware failure causing these errors that does not show
>up in the logs? Usually IDE disk failures are accompanied with IDE layer
>errors, unless the disk is totally down (which it isn't in this case).
>
>
>
>
>
In theory it could be bad memory or bad CPU. Software bug seems far
more likely. I will let Chris try to debug this, if he fails to get to
the bottom of it, let me know.
--
Hans
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Errors requiring --rebuild-tree in 2.4.23
2003-12-11 14:22 ` Chris Mason
2003-12-11 16:43 ` Jens Benecke
@ 2003-12-11 16:45 ` Jens Benecke
1 sibling, 0 replies; 9+ messages in thread
From: Jens Benecke @ 2003-12-11 16:45 UTC (permalink / raw)
To: reiserfs-list
Chris Mason wrote:
> On Thu, 2003-12-11 at 08:51, Jens Benecke wrote:
>> Hi,
>>
>> I posted earlier about quota problems. WE updated to 2.4.23 b ecause of
>> the logging patches because some power failures made our /home partition
>> spew out these: (QUESTIONS at the end of the mail)
>
> Sorry, before we got to the questions, what was the order of the events
> above?
Oh yes, the quota problems were on a different machine. I never tried that
kernel that caused the quota problems on another machine than the test
servers at home that are about to be sent away.
--
Jens Benecke (jens at spamfreemail.de)
http://www.hitchhikers.de - Europaweite kostenlose Mitfahrzentrale
http://www.spamfreemail.de - 100% saubere Postfächer - garantiert!
http://www.rb-hosting.de - PHP ab 9? - SSH ab 19? - günstiger Traffic
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Errors requiring --rebuild-tree in 2.4.23
2003-12-11 13:51 Errors requiring --rebuild-tree in 2.4.23 Jens Benecke
2003-12-11 14:22 ` Chris Mason
@ 2003-12-11 17:27 ` Vitaly Fertman
1 sibling, 0 replies; 9+ messages in thread
From: Vitaly Fertman @ 2003-12-11 17:27 UTC (permalink / raw)
To: Jens Benecke, reiserfs-list; +Cc: vs
On Thursday 11 December 2003 16:51, Jens Benecke wrote:
> Hi,
>
> I posted earlier about quota problems. WE updated to 2.4.23 b ecause of the
> logging patches because some power failures made our /home partition spew
> out these: (QUESTIONS at the end of the mail)
>
>
> Dec 1 06:26:05 artus kernel: is_leaf: item location seems wrong (second
> one): *3.6* [113366 113466 0x1 DIRECT], item_len 280, item_location 1216,
> free_space(entry_count) 65535
> Dec 1 06:26:05 artus kernel: vs-5150: search_by_key: invalid format found
> in block 36020. Fsck?
> Dec 1 06:26:05 artus kernel: vs-13070: reiserfs_read_inode2: i/o failure
> occurred trying to find stat data of [113366 113469 0x0 SD]
> Dec 1 06:26:05 artus kernel: is_leaf: item location seems wrong (second
> one): *3.6* [113366 113466 0x1 DIRECT], item_len 280, item_location 1216,
> free_space(entry_count) 65535
> Dec 1 06:26:05 artus kernel: vs-5150: search_by_key: invalid format found
> in block 36020. Fsck?
> Dec 1 06:26:05 artus kernel: vs-13070: reiserfs_read_inode2: i/o failure
> occurred trying to find stat data of [113366 113469 0x0 SD]
> Dec 1 06:26:05 artus kernel: is_leaf: item location seems wrong (second
> one): *3.6* [113366 113466 0x1 DIRECT], item_len 280, item_location 1216,
> free_space(entry_count) 65535
> Dec 1 06:26:05 artus kernel: vs-5150: search_by_key: invalid format found
> in block 36020. Fsck?
> Dec 1 06:26:05 artus kernel: vs-13070: reiserfs_read_inode2: i/o failure
> occurred trying to find stat data of [113366 113469 0x0 SD]
> Dec 1 06:26:05 artus kernel: is_leaf: item location seems wrong (second
> one): *3.6* [113366 113466 0x1 DIRECT], item_len 280, item_location 1216,
> free_space(entry_count) 65535
> Dec 1 06:26:05 artus kernel: vs-5150: search_by_key: invalid format found
> in block 36020. Fsck?
> Dec 1 06:26:05 artus kernel: vs-13070: reiserfs_read_inode2: i/o failure
> occurred trying to find stat data of [113366 113469 0x0 SD]
> Dec 1 06:26:05 artus kernel: is_leaf: item location seems wrong (second
> one): *3.6* [113366 113466 0x1 DIRECT], item_len 280, item_location 1216,
> free_space(entry_count) 65535
> Dec 1 06:26:05 artus kernel: vs-5150: search_by_key: invalid format found
> in block 36020. Fsck?
> Dec 1 06:26:05 artus kernel: vs-13070: reiserfs_read_inode2: i/o failure
> occurred trying to find stat data of [113366 113469 0x0 SD]
>
>
>
> We only ever got read errors from cron scripts working on /home so we
> thought that was the only partition affected. (And anyway, stuff in / and
> usr was never _written_ to so it should have been OK).
>
> But shortly after upgrading our server crashed totally and booting knoppix
> we found this in the syslog:
>
>
>
> Dec 9 23:00:57 linux1 kernel: is_leaf: item location seems wrong (second
> one): *3.6* [5 19 0x1 IND], item_len 28, item_location 2964,
> free_space(entry_count) 0
> Dec 9 23:00:57 linux1 kernel: ide0(3,1):vs-5150: search_by_key: invalid
> format found in block 23346. Fsck?
> Dec 9 23:00:57 linux1 kernel: ide0(3,1):vs-13050: reiserfs_update_sd: i/o
> failure occurred trying to update [5 9 0x0 SD] stat data
> Dec 9 23:00:57 linux1 kernel: is_leaf: item location seems wrong (second
> one): *3.6* [5 19 0x1 IND], item_len 28, item_location 2964,
> free_space(entry_count) 0
> Dec 9 23:00:57 linux1 kernel: ide0(3,1):vs-5150: search_by_key: invalid
> format found in block 23346. Fsck?
> Dec 9 23:00:57 linux1 kernel: ide0(3,1):vs-13050: reiserfs_update_sd: i/o
> failure occurred trying to update [5 7 0x0 SD] stat data
> Dec 9 23:00:57 linux1 kernel: is_leaf: item location seems wrong (second
> one): *3.6* [5 19 0x1 IND], item_len 28, item_location 2964,
> free_space(entry_count) 0
> Dec 9 23:00:57 linux1 kernel: ide0(3,1):vs-5150: search_by_key: invalid
> format found in block 23346. Fsck?
> Dec 9 23:00:57 linux1 kernel: is_leaf: item location seems wrong (second
> one): *3.6* [5 19 0x1 IND], item_len 28, item_location 2964,
> free_space(entry_count) 0
> Dec 9 23:00:57 linux1 kernel: ide0(3,1):vs-5150: search_by_key: invalid
> format found in block 23346. Fsck?
> Dec 9 23:00:57 linux1 kernel: is_leaf: item location seems wrong (second
> one): *3.6* [5 19 0x1 IND], item_len 28, item_location 2964,
> free_space(entry_count) 0
> Dec 9 23:00:57 linux1 kernel: ide0(3,1):vs-5150: search_by_key: invalid
> format found in block 23346. Fsck?
> Dec 9 23:00:57 linux1 kernel: is_leaf: item location seems wrong (second
> one): *3.6* [5 19 0x1 IND], item_len 28, item_location 2964,
> free_space(entry_count) 0
> Dec 9 23:00:57 linux1 qmail: 1071007257.421057 delivery 498: deferral:
> Aack,_child_crashed._(#4.3.0)/
> Dec 9 23:00:57 linux1 kernel: ide0(3,1):vs-5150: search_by_key: invalid
> format found in block 23346. Fsck?
>
>
> and much more of these. These required a --rebuild-tree (done with 3.6.11
> from Knoppix from November 2003):
>
>
> Pass 0:
> ####### Pass 0 #######
> Loading on-disk bitmap .. ok, 312614 blocks marked used
> Skipping 8241 blocks (super block, journal, bitmaps) 304373 blocks will be
> read
> 0%.block 23346: The number of items (35) is incorrect, should be (19) -
> corrected
> block 23346: The free space (1540) is incorrect, should be (2520) -
> corrected
> left 286816, 1350
> /sec [1]+ Stopped reiserfsck --rebuild-tree /dev/hda1
> root@1[mnt]# fg
> reiserfsck --rebuild-tree /dev/hda1
> ...20%....40%....60%....80%...block 447056: The free space (2) is
> incorrect, should be (4072) - corrected
> .100% left 0, 3074 /sec
> 133527 directory entries were hashed with "r5" hash.
> "r5" hash is selected
> Flushing..finished
> Read blocks (but not data blocks) 304373
> Leaves among those 34465
> - leaves all contents of which could not be saved
> and deleted 1
> Objectids found 133061
>
> Pass 1 (will try to insert 34464 leaves):
> ####### Pass 1 #######
> Looking for allocable blocks .. finished
> 0%....20%....40%....60%....80%....100% left 0, 1641
> sec
> Flushing..finished
> 34464 leaves read
> 34431 inserted
> 33 not inserted
> ####### Pass 2 #######
>
> Pass 2:
> 0%....20%....40%....60%....80%....100% left 0, 66
> sec
> Flushing..finished
> Leaves inserted item by item 33
> Pass 3 (semantic):
> ####### Pass 3 #########
> /bin/dfvpf-10680: The file [5 19] has the wrong block count in the StatData
> (56) - corrected to (0) rebuild_semantic_pass:
> The entry [5 21] ("ln") in directory [2 5] points to nowhere - is removed
> rebuild_semantic_pass: The entry [5 22] ("ls") in directory [2 5] points to
> nowhere - is removed
> rebuild_semantic_pass: The entry [5 25] ("mv") in directory [2 5] points to
> nowhere - is removed
> rebuild_semantic_pass: The entry [5 26] ("rm") in directory [2 5] points to
> nowhere - is removed
> rebuild_semantic_pass: The entry [5 20] ("dir") in directory [2 5] points
> to nowhere - is removed
> rebuild_semantic_pass: The entry [5 23] ("mkdir") in directory [2 5] points
> to nowhere - is removed
> rebuild_semantic_pass: The entry [5 24] ("mknod") in directory [2 5] points
> to nowhere - is removed
> rebuild_semantic_pass: The entry [5 27] ("rmdir") in directory [2 5] points
> to nowhere - is removed
> vpf-10680: The directory [2 5] has the wrong block count in the StatData
> (5) - corrected to (4)
> vpf-10650: The directory [2 5] has the wrong size in the StatData (2152) -
> corrected to (1960) /libFlushing..finished
> in the StatData (2587122) -/usr/lib/nessus/plugins Files found:
> 115781
> Directories found: 10270
> Symlinks found: 2875
> Others: 4120
> Files with fixed size: 1
> Names pointing to nowhere (removed): 8
> Pass 3a (looking for lost dir/files):
> ####### Pass 3a (lost+found pass) #########
> Looking for lost directories:
> Flushing..finished8, 103 /sec
> Pass 4 - finished done 33606, 99 /sec
> Deleted unreachable items 1
> Flushing..finished
> Syncing..finished
> ###########
> reiserfsck finished at Wed Dec 10 22:42:29 2003
> ###########
>
> --------------------------------------------------------------------------
>
> QUESTIONS:
>
> - can I safely assume that the ONLY files damaged were the ones that
> reiserfsck mentioned and deleted?
almost. reiserfsck tries to be accurate in reporting what it does, but it
might silently remove something if consideres it as not recoverable.
>
> - is it possible that the new reiserfs code introduced data loss bugs, or
> is it more likely that it brought old errors in the FS to the surface that
> the 2.4.19 code never noticed? (We never had problems executing "ls" or
> "ln" before moving to 2.4.23, but it was apparently one of the files that
> was severely damaged).
>
yes, it is possible
> - is it possible to show the files/directories that are affected in the
> syslog when errors occur? Otherwise it's always guesswork which files you
> can still trust and which you can't. (And yes, we do have backups, i.e. the
> other drbd machine, but only for /home ATM).
>
yes, I am working on improving those warnings.
> - How can I prevent this from happening in the future?
Am I corrent that you got corrupted filesystem as result of the following:
install 2.4.23 with datalogging patches,
power failure under heavy load
?
> Is it possible to
> detect this kind of errors and automatically reboot, forcing a fsck at
> reboot? Does that make sense?
I think no. Returing -EIO when metadata corruption is encountered (currently
reiserfs returns -EACCES) seems to be more appropriate.
--
Thanks,
Vitaly Fertman
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2003-12-14 12:05 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2003-12-11 13:51 Errors requiring --rebuild-tree in 2.4.23 Jens Benecke
2003-12-11 14:22 ` Chris Mason
2003-12-11 16:43 ` Jens Benecke
2003-12-11 18:24 ` Chris Mason
2003-12-11 19:20 ` Hans Reiser
2003-12-13 17:38 ` Jens Benecke
2003-12-14 12:05 ` Hans Reiser
2003-12-11 16:45 ` Jens Benecke
2003-12-11 17:27 ` Vitaly Fertman
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.