All of lore.kernel.org
 help / color / mirror / Atom feed
* Bug report: reiserfsck --rebuild-tree not progressing
@ 2006-04-11  0:34 Tyler Phelps
  2006-04-11  7:21 ` Sander
  2006-04-12 15:35 ` Ard van Breemen
  0 siblings, 2 replies; 10+ messages in thread
From: Tyler Phelps @ 2006-04-11  0:34 UTC (permalink / raw)
  To: reiserfs-list; +Cc: System Administrators

Two sentence summary:  reiserfsck --rebuild-tree on a 1.4TB reiserfs 
filesystem seems to be spinning its wheels.  The reiserfsck has been 
running for a little more than 56 hours, the progress meter hasn't 
changed in over 24 hours, very little I/O activity is happening, and 
the reiserfsck process is utilizing 100% of one of the two CPUs in the 
system.

I'll provide system information first, then explain the problem in 
detail.

Hardware:
4U rackmount system with 16gb of memory and dual opteron cpus running 
in a controlled environment datacenter.  The partition in question sits 
on twelve 250gb drives (Western Digital Caviar Raid Edition 7200rpm 
SATA.. model number WD2500SD).  It is the only partition that resides 
on these drives.  The partition contains postgres data and nothing 
else.. the rest of the system lives on a separate set of disks.  The 
raid array is half hardware and half software.  The twelve disks are 
formed into six pairs.. each pair is formed into a raid 1 mirror.  This 
mirroring is done in hardware on two different 8-port 3ware 9500S 
controller cards.  EVMS is used to provide a software raid 0 stripe 
array that spans the six mirrored pairs.

Distro and kernel:
The system is running Debian using the sarge (stable) codeline, version 
1:3.3.4-6sarge1.  The kernel is self-compiled version 2.6.13.1.  No 
reiserfs-specific patches have been applied to the kernel source.  The 
kernel configuration can be provided as needed.. for now, the following 
are the lines from the config that contain "REISER":

CONFIG_REISERFS_FS=y
# CONFIG_REISERFS_CHECK is not set
CONFIG_REISERFS_PROC_INFO=y
# CONFIG_REISERFS_FS_XATTR is not set

Version:
I'm using the reiserfsprogs package from the sarge codeline.  The 
following is metadata about said package:

Package: reiserfsprogs
Status: install ok installed
Priority: optional
Section: admin
Installed-Size: 1079
Maintainer: Ed Boraas <ed@debian.org>
Architecture: i386
Version: 1:3.6.17-2
Depends: libc6 (>= 2.3.2.ds1-4), libuuid1
Description: User-level tools for ReiserFS filesystems
  This package contains utilities to create, check, resize, and debug 
ReiserFS
  filesystems.
  .
  NOTE: Releases of Linux prior to 2.4.1 do not support ReiserFS on 
their own.
  Thus, these tools will only be useful with Linux 2.4.1 or later, or if 
your
  kernel has been built with the ReiserFS patch applied. This patch can 
be found
  in the appropriate kernel-patch-<version>-reiserfs packages.
  .
  Homepage: http://www.namesys.com/

The following is mount information about the filesystem, taken from 
dmesg:

ReiserFS: dm-12: found reiserfs format "3.6" with standard journal
ReiserFS: dm-12: using ordered data mode
ReiserFS: dm-12: journal params: device dm-12, size 8192, journal first 
block 18, max trans len 1024, max batch 900, max commit age 30, max 
trans age 30
ReiserFS: dm-12: checking transaction log (dm-12)
ReiserFS: dm-12: Using r5 hash to sort names

The following are syslog messages from the kernel.  The filesystem 
trouble began at about 2:50am which matches the following log entries:

Apr  6 02:50:03 gwar kernel: ReiserFS: dm-12: warning: vs-13060: 
reiserfs_update_sd: stat data of object [124 983 0x0 SD] (nlink == 1) 
not found (pos 1)
Apr  6 02:50:03 gwar kernel: ReiserFS: dm-12: warning: vs-13060: 
reiserfs_update_sd: stat data of object [124 983 0x0 SD] (nlink == 1) 
not found (pos 1)
Apr  6 02:50:03 gwar kernel: ReiserFS: dm-12: warning: vs-13060: 
reiserfs_update_sd: stat data of object [124 983 0x0 SD] (nlink == 1) 
not found (pos 1)
Apr  6 02:50:03 gwar kernel: ReiserFS: dm-12: warning: vs-13060: 
reiserfs_update_sd: stat data of object [124 983 0x0 SD] (nlink == 1) 
not found (pos 1)
Apr  6 02:50:03 gwar kernel: ReiserFS: dm-12: warning: vs-13060: 
reiserfs_update_sd: stat data of object [124 983 0x0 SD] (nlink == 1) 
not found (pos 1)
Apr  6 02:50:03 gwar kernel: ReiserFS: dm-12: warning: vs-13060: 
reiserfs_update_sd: stat data of object [124 983 0x0 SD] (nlink == 1) 
not found (pos 1)
Apr  6 02:50:03 gwar kernel: ReiserFS: dm-12: warning: vs-13060: 
reiserfs_update_sd: stat data of object [124 983 0x0 SD] (nlink == 1) 
not found (pos 1)
Apr  6 02:50:03 gwar kernel: ReiserFS: dm-12: warning: vs-13060: 
reiserfs_update_sd: stat data of object [124 983 0x0 SD] (nlink == 1) 
not found (pos 1)
Apr  6 02:50:03 gwar kernel: ReiserFS: dm-12: warning: vs-13060: 
reiserfs_update_sd: stat data of object [124 983 0x0 SD] (nlink == 1) 
not found (pos 1)
Apr  6 02:50:04 gwar kernel: ReiserFS: dm-12: warning: vs-13060: 
reiserfs_update_sd: stat data of object [124 983 0x0 SD] (nlink == 1) 
not found (pos 1)
Apr  6 02:50:04 gwar kernel: ReiserFS: dm-12: warning: vs-13060: 
reiserfs_update_sd: stat data of object [124 983 0x0 SD] (nlink == 1) 
not found (pos 1)
Apr  6 02:50:04 gwar kernel: ReiserFS: dm-12: warning: vs-13060: 
reiserfs_update_sd: stat data of object [124 983 0x0 SD] (nlink == 1) 
not found (pos 1)
Apr  6 02:50:04 gwar kernel: ReiserFS: dm-12: warning: vs-13060: 
reiserfs_update_sd: stat data of object [124 983 0x0 SD] (nlink == 1) 
not found (pos 1)
Apr  6 02:50:04 gwar kernel: ReiserFS: dm-12: warning: vs-13060: 
reiserfs_update_sd: stat data of object [124 983 0x0 SD] (nlink == 1) 
not found (pos 1)
Apr  6 02:50:04 gwar kernel: ReiserFS: dm-12: warning: vs-13060: 
reiserfs_update_sd: stat data of object [124 983 0x0 SD] (nlink == 1) 
not found (pos 1)
Apr  6 02:50:04 gwar kernel: ReiserFS: dm-12: warning: vs-13060: 
reiserfs_update_sd: stat data of object [124 983 0x0 SD] (nlink == 1) 
not found (pos 1)
Apr  6 02:50:04 gwar kernel: ReiserFS: dm-12: warning: vs-13060: 
reiserfs_update_sd: stat data of object [124 983 0x0 SD] (nlink == 1) 
not found (pos 1)
Apr  6 02:50:05 gwar kernel: ReiserFS: dm-12: warning: vs-13060: 
reiserfs_update_sd: stat data of object [124 983 0x0 SD] (nlink == 1) 
not found (pos 1)
Apr  6 02:50:05 gwar kernel: ReiserFS: dm-12: warning: vs-13060: 
reiserfs_update_sd: stat data of object [124 983 0x0 SD] (nlink == 1) 
not found (pos 1)
Apr  6 02:50:05 gwar kernel: ReiserFS: dm-12: warning: vs-13060: 
reiserfs_update_sd: stat data of object [124 983 0x0 SD] (nlink == 1) 
not found (pos 1)
Apr  6 02:50:05 gwar kernel: ReiserFS: dm-12: warning: vs-13060: 
reiserfs_update_sd: stat data of object [124 983 0x0 SD] (nlink == 1) 
not found (pos 1)
Apr  6 02:50:05 gwar kernel: ReiserFS: dm-12: warning: vs-13060: 
reiserfs_update_sd: stat data of object [124 983 0x0 SD] (nlink == 1) 
not found (pos 1)
Apr  6 02:50:05 gwar kernel: ReiserFS: dm-12: warning: vs-13060: 
reiserfs_update_sd: stat data of object [124 983 0x0 SD] (nlink == 1) 
not found (pos 1)
Apr  6 02:50:05 gwar kernel: ReiserFS: dm-12: warning: vs-13060: 
reiserfs_update_sd: stat data of object [124 983 0x0 SD] (nlink == 1) 
not found (pos 1)
Apr  6 02:50:05 gwar kernel: ReiserFS: dm-12: warning: vs-13060: 
reiserfs_update_sd: stat data of object [124 983 0x0 SD] (nlink == 1) 
not found (pos 1)
Apr  6 02:50:06 gwar kernel: ReiserFS: dm-12: warning: vs-13060: 
reiserfs_update_sd: stat data of object [124 983 0x0 SD] (nlink == 1) 
not found (pos 1)
Apr  6 02:50:06 gwar kernel: ReiserFS: dm-12: warning: vs-13060: 
reiserfs_update_sd: stat data of object [124 983 0x0 SD] (nlink == 1) 
not ounos
Apr  6 02:50:06 gwar kernel: ReiserFS: dm-12: warning: vs-13060: 
reiserfs_update_sd: stat data of object [124 983 0x0 SD] (nlink == 1) 
not found (pos 1)
Apr  6 02:50:06 gwar kernel: ReiserFS: dm-12: warning: vs-13060: 
reiserfs_update_sd: stat data of object [124 983 0x0 SD] (nlink == 1) 
not found (pos 1)
Apr  6 02:50:06 gwar kernel: ReiserFS: dm-12: warning: vs-13060: 
reiserfs_update_sd: stat data of object [124 983 0x0 SD] (nlink == 1) 
not found (pos 1)
Apr  6 02:50:06 gwar kernel: ReiserFS: dm-12: warning: vs-13060: 
reiserfs_update_sd: stat data of object [124 983 0x0 SD] (nlink == 1) 
not found (pos 1)
Apr  6 07:00:05 gwar kernel: ReiserFS: dm-12: warning: vs-13060: 
reiserfs_update_sd: stat data of object [124 983 0x0 SD] (nlink == 1) 
not found (pos 1)
Apr  6 07:00:05 gwar kernel: ReiserFS: dm-12: warning: vs-13060: 
reiserfs_update_sd: stat data of object [124 983 0x0 SD] (nlink == 1) 
not found (pos 1)
Apr  6 07:00:06 gwar kernel: ReiserFS: dm-12: warning: vs-13060: 
reiserfs_update_sd: stat data of object [124 983 0x0 SD] (nlink == 1) 
not found (pos 1)
Apr  6 07:00:08 gwar kernel: ReiserFS: dm-12: warning: vs-13060: 
reiserfs_update_sd: stat data of object [124 983 0x0 SD] (nlink == 1) 
not found (pos 1)
Apr  6 07:00:09 gwar kernel: ReiserFS: dm-12: warning: vs-13060: 
reiserfs_update_sd: stat data of object [124 983 0x0 SD] (nlink == 1) 
not found (pos 1)
Apr  6 07:00:09 gwar kernel: ReiserFS: dm-12: warning: vs-13060: 
reiserfs_update_sd: stat data of object [124 983 0x0 SD] (nlink == 1) 
not found (pos 1)
Apr  6 07:00:10 gwar kernel: ReiserFS: dm-12: warning: vs-13060: 
reiserfs_update_sd: stat data of object [124 983 0x0 SD] (nlink == 1) 
not found (pos 1)
Apr  6 07:00:10 gwar kernel: ReiserFS: dm-12: warning: vs-13060: 
reiserfs_update_sd: stat data of object [124 983 0x0 SD] (nlink == 1) 
not found (pos 1)
Apr  6 07:00:11 gwar kernel: ReiserFS: dm-12: warning: vs-13060: 
reiserfs_update_sd: stat data of object [124 983 0x0 SD] (nlink == 1) 
not found (pos 1)
Apr  6 07:00:12 gwar kernel: ReiserFS: dm-12: warning: vs-13060: 
reiserfs_update_sd: stat data of object [124 983 0x0 SD] (nlink == 1) 
not found (pos 1)
Apr  6 07:00:12 gwar kernel: ReiserFS: dm-12: warning: vs-13060: 
reiserfs_update_sd: stat data of object [124 983 0x0 SD] (nlink == 1) 
not found (pos 1)
Apr  6 07:00:12 gwar kernel: ReiserFS: dm-12: warning: vs-13060: 
reiserfs_update_sd: stat data of object [124 983 0x0 SD] (nlink == 1) 
not found (pos 1)
Apr  6 07:00:13 gwar kernel: ReiserFS: dm-12: warning: vs-13060: 
reiserfs_update_sd: stat data of object [124 983 0x0 SD] (nlink == 1) 
not found (pos 1)
Apr  6 07:00:14 gwar kernel: ReiserFS: dm-12: warning: vs-13060: 
reiserfs_update_sd: stat data of object [124 983 0x0 SD] (nlink == 1) 
not found (pos 1)
Apr  6 07:00:17 gwar kernel: ReiserFS: dm-12: warning: vs-13060: 
reiserfs_update_sd: stat data of object [124 983 0x0 SD] (nlink == 1) 
not found (pos 1)
Apr  6 07:00:18 gwar kernel: ReiserFS: dm-12: warning: vs-13060: 
reiserfs_update_sd: stat data of object [124 983 0x0 SD] (nlink == 1) 
not found (pos 1)
Apr  6 07:00:18 gwar kernel: ReiserFS: dm-12: warning: vs-13060: 
reiserfs_update_sd: stat data of object [124 983 0x0 SD] (nlink == 1) 
not found (pos 1)
Apr  6 07:00:18 gwar kernel: ReiserFS: dm-12: warning: vs-13060: 
reiserfs_update_sd: stat data of object [124 983 0x0 SD] (nlink == 1) 
not found (pos 1)
Apr  6 07:00:21 gwar kernel: ReiserFS: dm-12: warning: vs-13060: 
reiserfs_update_sd: stat data of object [124 983 0x0 SD] (nlink == 1) 
not found (pos 1)
Apr  6 07:00:21 gwar kernel: ReiserFS: dm-12: warning: vs-13060: 
reiserfs_update_sd: stat data of object [124 983 0x0 SD] (nlink == 1) 
not found (pos 1)
Apr  6 07:00:22 gwar kernel: ReiserFS: dm-12: warning: vs-13060: 
reiserfs_update_sd: stat data of object [124 983 0x0 SD] (nlink == 1) 
not found (pos 1)
Apr  6 07:00:22 gwar kernel: ReiserFS: dm-12: warning: vs-13060: 
reiserfs_update_sd: stat data of object [124 983 0x0 SD] (nlink == 1) 
not found (pos 1)
Apr  6 07:00:22 gwar kernel: ReiserFS: dm-12: warning: vs-13060: 
reiserfs_update_sd: stat data of object [124 983 0x0 SD] (nlink == 1) 
not found (pos 1)
Apr  6 07:00:23 gwar kernel: ReiserFS: dm-12: warning: vs-13060: 
reiserfs_update_sd: stat data of object [124 983 0x0 SD] (nlink == 1) 
not found (pos 1)
Apr  6 07:00:23 gwar kernel: ReiserFS: dm-12: warning: vs-13060: 
reiserfs_update_sd: stat data of object [124 983 0x0 SD] (nlink == 1) 
not found (pos 1)
Apr  6 07:00:24 gwar kernel: ReiserFS: dm-12: warning: vs-13060: 
reiserfs_update_sd: stat data of object [124 983 0x0 SD] (nlink == 1) 
not found (pos 1)
Apr  6 07:00:25 gwar kernel: ReiserFS: dm-12: warning: vs-13060: 
reiserfs_update_sd: stat data of object [124 983 0x0 SD] (nlink == 1) 
not found (pos 1)
Apr  6 07:00:26 gwar kernel: ReiserFS: dm-12: warning: vs-13060: 
reiserfs_update_sd: stat data of object [124 983 0x0 SD] (nlink == 1) 
not found (pos 1)
Apr  6 07:00:26 gwar kernel: ReiserFS: dm-12: warning: vs-13060: 
reiserfs_update_sd: stat data of object [124 983 0x0 SD] (nlink == 1) 
not found (pos 1)
Apr  6 07:00:26 gwar kernel: ReiserFS: dm-12: warning: vs-13060: 
reiserfs_update_sd: stat data of object [124 983 0x0 SD] (nlink == 1) 
not found (pos 1)
Apr  6 07:00:27 gwar kernel: ReiserFS: dm-12: warning: vs-13060: 
reiserfs_update_sd: stat data of object [124 983 0x0 SD] (nlink == 1) 
not found (pos 1)
Apr  6 07:00:27 gwar kernel: ReiserFS: dm-12: warning: vs-13060: 
reiserfs_update_sd: stat data of object [124 983 0x0 SD] (nlink == 1) 
not found (pos 1)
Apr  6 07:00:27 gwar kernel: ReiserFS: dm-12: warning: vs-13060: 
reiserfs_update_sd: stat data of object [124 983 0x0 SD] (nlink == 1) 
not found (pos 1)
Apr  6 07:00:28 gwar kernel: ReiserFS: dm-12: warning: vs-13060: 
reiserfs_update_sd: stat data of object [124 983 0x0 SD] (nlink == 1) 
not found (pos 1)
Apr  6 07:00:28 gwar kernel: ReiserFS: dm-12: warning: vs-13060: 
reiserfs_update_sd: stat data of object [124 983 0x0 SD] (nlink == 1) 
not found (pos 1)
Apr  6 07:00:29 gwar kernel: ReiserFS: dm-12: warning: vs-13060: 
reiserfs_update_sd: stat data of object [124 983 0x0 SD] (nlink == 1) 
not found (pos 1)
Apr  6 07:00:29 gwar kernel: ReiserFS: dm-12: warning: vs-13060: 
reiserfs_update_sd: stat data of object [124 983 0x0 SD] (nlink == 1) 
not found (pos 1)

So here's what happened..

At around 2:50am on Apr 6 a standard sql query to insert something into 
the database hung.  From this point on, the process consumed 100% of 
one of the two cpus.  Any attempt to strace the process would cause 
strace to immediately hang.  All attempts to kill the process failed.  
The system needed to be up for business reasons so it was decided that 
the fastest way to recover postgres would be to reboot the system.  The 
system shutdown without any trouble.  At boot, postgres failed to 
start.. the following are log messages from that event:

2006-04-06 11:59:33.410 PDT PANIC:  could not count blocks of relation 
1663/16815/17215: Permission denied
2006-04-06 11:59:33.413 PDT LOG:  startup process (PID 3898) was 
terminated by signal 6
2006-04-06 11:59:33.413 PDT LOG:  aborting startup due to startup 
process failure
2006-04-06 11:59:33.419 PDT LOG:  logger shutting down

We began to look at the data files in postgres' data directory.  We 
found two files that responded with a "permission denied" message for 
any attempt to browse or manipulate them (using programs such as ls, 
stat, cp, etc.)  After browsing forums and newsgroups we concluded that 
this behavior may be indicative of a filesystem problem.  An example of 
one of the files exhibiting the behavior:

find: ./data/base/16815/17215.15: Permission denied

The partition was unmounted from the system.  We started an fsck using 
the following command:

reiserfsck --check --logfile /root/postgres-gwar-20060407.log 
/dev/evms/postgres

Unfortunately, this fsck was cutoff by a dropped ssh connection and the 
fsck wasn't running within screen (oops).  Based on the progress 
indicator, it seems that the check was about 2/3 of the way through the 
process.  The following is the contents of the logfile:

tyler@gwar ~> sudo cat /root/postgres-fsck-20060407.log
bad_path: The right delimiting key [124 982 0x3c001 IND (1)] of the 
node (227226375) must be greater than the last (6) element's key [124 
984 0x1 IND (1)] within the node.
vpf-10630: The on-disk and the correct bitmaps differs. Will be fixed 
later.

A second fsck was started (within screen) using the following command:

reiserfsck --fix-fixable /dev/evms/postgres

Unfortunately I don't have the verbatim output of this command.  The 
command completed successfully and found one problem which could only 
be fixed by running the fsck with --rebuild-tree.

We made a backup of the data minus the two files with permission denied 
problems to a different system.  We used this backup to start a 
parallel recovery effort to restore postgres on a different system 
using all of the data except the two broken files which couldn't be 
backed up.

A third reiserfsck was started using the following command:

reiserfsck --rebuild-tree /dev/evms/postgres

The following is the output from that command thus far:

Will rebuild the filesystem (/dev/evms/postgres) tree
Will put log info to 'stdout'

Do you want to run this program?[N/Yes] (note need to type Yes if you 
do):Yes
Replaying journal..
Reiserfs journal '/dev/evms/postgres' in blocks [18..8211]: 0 
transactions replayed
###########
reiserfsck --rebuild-tree started at Sat Apr  8 07:57:05 2006
###########

Pass 0:
####### Pass 0 #######
Loading on-disk bitmap .. ok, 264008904 blocks marked used
Skipping 19386 blocks (super block, journal, bitmaps) 263989518 blocks 
will be read
0%....block 162165170: The free space (6) is incorrect, should be 
(4072) - corrected
block 162165260: The free space (3) is incorrect, should be (4072) - 
corrected
block 162165340: The free space (3) is incorrect, should be (4072) - 
corrected
block 225738748: The number of items (5055) is incorrect, should be (1) 
- corrected
block 225738748: The free space (14) is incorrect, should be (13) - 
corrected
pass0: vpf-10110: block 225738748, item (0): Unknown item type found 
[1310733 1188392 0x1a7445900000000 ??? (4)] - deleted
                                                            left 0, 
20739 /sec
1424 directory entries were hashed with "r5" hash.
         "r5" hash is selected
Flushing..finished
         Read blocks (but not data blocks) 263989518
                 Leaves among those 95376536
                         - leaves all contents of which could not be 
saved and deleted 4
                 Objectids found 1426

Pass 1 (will try to insert 95376532 leaves):
####### Pass 1 #######
Looking for allocable blocks .. finished
0%....20%....40%.                                                       
    left 0, 3581 /seccc
Flushing..finished
         95376532 leaves read
                 95376530 inserted
                 2 not inserted
####### Pass 2 #######

Pass 2:
0%....20%....40%....60%...                                       left 
1, 0 /sec

The process has been "stuck" at this progress for more than 24 hours.

The reiserfsck process seems to be using 100% of one of the CPUs but 
very little I/O activity is happening.  The following is output about 
system resources from various commands:

tyler@gwar ~> vmstat 1
procs -----------memory---------- ---swap-- -----io---- --system-- 
----cpu----
  r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us 
sy id wa
  1  0      0 15491440 700792  31092    0    0    51     0    2    20 29 
  6 62  3
  1  0      0 15491068 701080  31076    0    0   128    72  272    32 50 
  0 50  0
  1  0      0 15490820 701328  31100    0    0   128     0  257    26 50 
  0 50  0
  1  0      0 15490696 701536  31096    0    0   128     0  260    24 50 
  0 50  0
  1  0      0 15491564 700620  31196    0    0   128     0  262    27 50 
  0 50  0
  1  0      0 15491440 700664  31152    0    0     0     0  257    23 50 
  0 50  0
  1  0      0 15491192 700900  31188    0    0    72    76  275    57 50 
  0 50  0

tyler@gwar ~> ps aux | grep reiserfsck
root      9835 89.9  1.0 183296 180200 pts/1 R+   Apr08 2933:14 
reiserfsck --rebuild-tree /dev/evms/postgres

tyler@gwar ~> top
top - 14:17:26 up 4 days,  2:20,  1 user,  load average: 1.00, 1.00, 
1.00
Tasks: 159 total,   2 running, 157 sleeping,   0 stopped,   0 zombie
  Cpu0 : 100.0% us,  0.0% sy,  0.0% ni,  0.0% id,  0.0% wa,  0.0% hi,  
0.0% si
  Cpu1 :  0.0% us,  0.0% sy,  0.0% ni, 100.0% id,  0.0% wa,  0.0% hi,  
0.0% si
Mem:  16508948k total,  1018516k used, 15490432k free,   701212k buffers
Swap:  8388596k total,        0k used,  8388596k free,    31216k cached
   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
  9835 root      25   0  179m 175m  688 R 99.7  1.1   2933:53 reiserfsck
     1 root      15   0  1588  516  452 S  0.0  0.0   0:00.75 init

tyler@gwar ~> free
              total       used       free     shared    buffers     
cached
Mem:      16508948    1042440   15466508          0     702812      
55116
-/+ buffers/cache:     284512   16224436
Swap:      8388596          0    8388596

I/O activity according to iostat.  The /dev/evms/postgres device lives 
on the six devices sdb through sdg.

tyler@gwar ~> iostat 1 -k
Linux 2.6.13.1-20050914 (gwar.savagebeast.com)  04/10/06
... output truncated ...
avg-cpu:  %user   %nice    %sys %iowait   %idle
           50.00    0.00    0.00    0.00   50.00

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda               0.00         0.00         0.00          0          0
sdb               1.00        24.00         0.00         24          0
sdc               0.00         0.00         0.00          0          0
sdd               1.00         8.00         0.00          8          0
sde               1.00        32.00         0.00         32          0
sdf               1.00        32.00         0.00         32          0
sdg               1.00        32.00         0.00         32          0
sdh               0.00         0.00         0.00          0          0

I'm not convinced that the process is hung because an strace of the 
process reveals that it's actively making system calls.  It crossed my 
mind that the fsck may be looping.. this is supported by the fact that 
CPU utilization is high while I/O is minimal.  However, for what it's 
worth, in a test of 65,000 lines of strace output there wasn't a single 
duplicate line... if it is looping, it's not making identical calls 
over and over.

The following is 100 lines of an strace of the process.

_llseek(3, 242647523328, [242647523328], SEEK_SET) = 0
write(3, "\1\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0|\0\0\0"..., 
4096) = 4096
_llseek(3, 242647527424, [242647527424], SEEK_SET) = 0
write(3, "\1\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0|\0\0\0"..., 
4096) = 4096
_llseek(3, 242647531520, [242647531520], SEEK_SET) = 0
write(3, "\1\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0|\0\0\0"..., 
4096) = 4096
_llseek(3, 242647535616, [242647535616], SEEK_SET) = 0
write(3, "\1\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0|\0\0\0"..., 
4096) = 4096
_llseek(3, 242647539712, [242647539712], SEEK_SET) = 0
write(3, "\1\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0|\0\0\0"..., 
4096) = 4096
_llseek(3, 242647474176, [242647474176], SEEK_SET) = 0
write(3, "\1\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0|\0\0\0"..., 
4096) = 4096
_llseek(3, 242647543808, [242647543808], SEEK_SET) = 0
write(3, "\1\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0|\0\0\0"..., 
4096) = 4096
_llseek(3, 242647547904, [242647547904], SEEK_SET) = 0
write(3, "\1\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0|\0\0\0"..., 
4096) = 4096
_llseek(3, 242647552000, [242647552000], SEEK_SET) = 0
write(3, "\1\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0|\0\0\0"..., 
4096) = 4096
_llseek(3, 242647560192, [242647560192], SEEK_SET) = 0
write(3, "\1\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0|\0\0\0"..., 
4096) = 4096
_llseek(3, 1169951559680, [1169951559680], SEEK_SET) = 0
read(3, "\1\0\1\0`\0 \0\200\0\0\0\200\240w\316\0\0\0\0\6\0\0\0|"..., 
4096) = 4096
_llseek(3, 1169951563776, [1169951563776], SEEK_SET) = 0
read(3, "\1\0\1\0`\0 \0\200\0\0\0\200\20\201\311\0\0\0\0\16\0\21"..., 
4096) = 4096
_llseek(3, 1169951567872, [1169951567872], SEEK_SET) = 0
read(3, "\1\0\1\0`\0 \0\200\0\0\0\200 v\313\0\0\0\0\0\0\21\300|"..., 
4096) = 4096
_llseek(3, 1169951571968, [1169951571968], SEEK_SET) = 0
read(3, "\1\0\1\0`\0 \0\200\0\0\0\200P\351\344\0\0\0\0\0\0\0\0|"..., 
4096) = 4096
_llseek(3, 1169951576064, [1169951576064], SEEK_SET) = 0
read(3, "\1\0\1\0`\0 \0\200\0\0\0\200\260\223\331\0\0\0\0\0\0\0"..., 
4096) = 4096
_llseek(3, 1169951580160, [1169951580160], SEEK_SET) = 0
read(3, "\1\0\1\0`\0 \0\200\0\0\0\200\0B\314\0\0\0\0\34\0\21\300"..., 
4096) = 4096
_llseek(3, 1169951584256, [1169951584256], SEEK_SET) = 0
read(3, "\1\0\1\0`\0 \0\200\0\0\0\200@\26\333\0\0\0\0\v\0\21\300"..., 
4096) = 4096
_llseek(3, 1169951588352, [1169951588352], SEEK_SET) = 0
read(3, "\1\0\1\0`\0 \0\200\0\0\0\200P:\327\0\0\0\0\20\0\21\300"..., 
4096) = 4096
_llseek(3, 1169951592448, [1169951592448], SEEK_SET) = 0
read(3, "\1\0\1\0`\0 \0\200\0\0\0\200\300N\350\0\0\0\0\25\0\21\300"..., 
4096) = 4096
_llseek(3, 1169951596544, [1169951596544], SEEK_SET) = 0
read(3, "\1\0\1\0`\0 \0\200\0\0\0\200`\23\323\0\0\0\0\0\0\322\r"..., 
4096) = 4096
_llseek(3, 1169951600640, [1169951600640], SEEK_SET) = 0
read(3, "\1\0\1\0`\0 \0\200\0\0\0\200\0206\312\0\0\0\0\0\0\0\0|"..., 
4096) = 4096
_llseek(3, 1169951604736, [1169951604736], SEEK_SET) = 0
read(3, "\1\0\1\0`\0 \0\200\0\0\0\200\340&\324\0\0\0\0\20\0\21\300"..., 
4096) = 4096
_llseek(3, 1169951608832, [1169951608832], SEEK_SET) = 0
read(3, "\1\0\1\0`\0 \0\200\0\0\0\200\320\335\313\0\0\0\0\0\0\21"..., 
4096) = 4096
_llseek(3, 1169951612928, [1169951612928], SEEK_SET) = 0
read(3, "\1\0\1\0`\0 \0\334\1\0\0\334\1\257\324\0\0\0\0L\0\21\300"..., 
4096) = 4096
_llseek(3, 1169951617024, [1169951617024], SEEK_SET) = 0
read(3, "\1\0\1\0`\0 \0\200\0\0\0\200\20+\366\0\0\0\0\36\0\21\300"..., 
4096) = 4096
_llseek(3, 1169951621120, [1169951621120], SEEK_SET) = 0
read(3, "\1\0\1\0`\0 \0\200\0\0\0\200@\354\324\0\0\0\0\1\0\377\377"..., 
4096) = 4096
_llseek(3, 242647564288, [242647564288], SEEK_SET) = 0
write(3, "\1\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0|\0\0\0"..., 
4096) = 4096
_llseek(3, 242647568384, [242647568384], SEEK_SET) = 0
write(3, "\1\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0|\0\0\0"..., 
4096) = 4096
_llseek(3, 242647572480, [242647572480], SEEK_SET) = 0
write(3, "\1\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0|\0\0\0"..., 
4096) = 4096
_llseek(3, 242647576576, [242647576576], SEEK_SET) = 0
write(3, "\1\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0|\0\0\0"..., 
4096) = 4096
_llseek(3, 242647580672, [242647580672], SEEK_SET) = 0
write(3, "\1\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0|\0\0\0"..., 
4096) = 4096
_llseek(3, 242647515136, [242647515136], SEEK_SET) = 0
write(3, "\1\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0|\0\0\0"..., 
4096) = 4096
_llseek(3, 242647584768, [242647584768], SEEK_SET) = 0
write(3, "\1\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0|\0\0\0"..., 
4096) = 4096
_llseek(3, 242647588864, [242647588864], SEEK_SET) = 0
write(3, "\1\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0|\0\0\0"..., 
4096) = 4096
_llseek(3, 242647592960, [242647592960], SEEK_SET) = 0
write(3, "\1\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0|\0\0\0"..., 
4096) = 4096
_llseek(3, 242647601152, [242647601152], SEEK_SET) = 0
write(3, "\1\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0|\0\0\0"..., 
4096) = 4096
_llseek(3, 242647605248, [242647605248], SEEK_SET) = 0
write(3, "\1\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0|\0\0\0"..., 
4096) = 4096
_llseek(3, 242647609344, [242647609344], SEEK_SET) = 0
write(3, "\1\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0|\0\0\0"..., 
4096) = 4096
_llseek(3, 242647613440, [242647613440], SEEK_SET) = 0
write(3, "\1\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0|\0\0\0"..., 
4096) = 4096
_llseek(3, 242647617536, [242647617536], SEEK_SET) = 0
write(3, "\1\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0|\0\0\0"..., 
4096) = 4096
_llseek(3, 1169951625216, [1169951625216], SEEK_SET) = 0
read(3, "\1\0\1\0`\0 \0\200\0\0\0\200\200\374\315\0\0\0\0\0\0\21"..., 
4096) = 4096
_llseek(3, 1169951629312, [1169951629312], SEEK_SET) = 0
read(3, "\1\0\1\0`\0 \0\200\0\0\0\200\0\276\356\0\0\0\0\0\0\21\300"..., 
4096) = 4096
_llseek(3, 1169951633408, [1169951633408], SEEK_SET) = 0
read(3, "\1\0\1\0`\0 \0\200\0\0\0\200@\337\354\0\0\0\0\1\0\30\0"..., 
4096) = 4096
_llseek(3, 1169951637504, [1169951637504], SEEK_SET) = 0
read(3, "\1\0\1\0`\0 \0\200\0\0\0\200\360\245\351\0\0\0\0\24\0\0"..., 
4096) = 4096
_llseek(3, 1169951641600, [1169951641600], SEEK_SET) = 0
read(3, "\1\0\1\0`\0 \0\200\0\0\0\200\320G\315\0\0\0\0\36\0\21\300"..., 
4096) = 4096
_llseek(3, 1169951645696, [1169951645696], SEEK_SET) = 0
read(3, "\1\0\1\0`\0 \0\200\0\0\0\200\200\267\327\0\0\0\0\6\0\21"..., 
4096) = 4096
_llseek(3, 1169951649792, [1169951649792], SEEK_SET) = 0
read(3, "\1\0\1\0`\0 \0\200\0\0\0\200\0\322\356\0\0\0\0\n\0\21\300"..., 
4096) = 4096
_llseek(3, 1169951653888, [1169951653888], SEEK_SET) = 0
read(3, "\1\0\1\0`\0 \0\200\0\0\0\200 c\352\0\0\0\0\36\0\332\360"..., 
4096) = 4096
_llseek(3, 1169951657984, [1169951657984], SEEK_SET) = 0
read(3, "\1\0\1\0`\0 \0\200\0\0\0\200\300\203\355\0\0\0\0\0\0\21"..., 
4096) = 4096
_llseek(3, 1169951662080, [1169951662080], SEEK_SET) = 0
read(3, "\1\0\1\0`\0 \0\200\0\0\0\200\20\266\312\0\0\0\0\0\0\0\0"..., 
4096) = 4096

The following are the file descriptors of process 9835 (the reiserfsck 
process) according to /proc:

tyler@gwar ~> sudo ls -laF /proc/9835/fd/
total 5
dr-x------    2 root     root            0 Apr 10 16:24 ./
dr-xr-xr-x    5 root     root            0 Apr 10 15:41 ../
lrwx------    1 root     root           64 Apr 10 16:24 0 -> /dev/pts/1
lrwx------    1 root     root           64 Apr 10 16:24 1 -> /dev/pts/1
lrwx------    1 root     root           64 Apr 10 16:24 2 -> /dev/pts/1
lrwx------    1 root     root           64 Apr 10 16:24 3 -> 
/dev/evms/postgres
lrwx------    1 root     root           64 Apr 10 16:24 4 -> 
/dev/evms/postgres

The parallel recovery effort to bring up postgres on different hardware 
using the backup has proceeded smoothly.  As a result, if the fsck has 
in fact croaked it's not devastating.. but it is troublesome.

1.  Is it normal for pass 2 of the --rebuild-tree operation to stay at 
the same "progress" for more than 24 hours?
2.  Is it normal for the reiserfsck operation to use significant CPU 
resources while I/O activity is minimal?
3.  For a filesystem of such size, is it reasonable for the fsck to 
take many days to complete?
4.  How long do you recommend that I leave it in this state before 
aborting the fsck?

I've tried to provide as much useful information as I can think of.. if 
there's additional information that you need, I'll do my best to 
provide it.

Thank you for your help.

-Tyler


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Bug report: reiserfsck --rebuild-tree not progressing
  2006-04-11  0:34 Bug report: reiserfsck --rebuild-tree not progressing Tyler Phelps
@ 2006-04-11  7:21 ` Sander
  2006-04-11  8:12   ` Tyler Phelps
  2006-04-12 15:35 ` Ard van Breemen
  1 sibling, 1 reply; 10+ messages in thread
From: Sander @ 2006-04-11  7:21 UTC (permalink / raw)
  To: System Administrators; +Cc: reiserfs-list

Tyler Phelps wrote (ao):
> Package: reiserfsprogs
> Version: 1:3.6.17-2

Can you try a newer version?
ftp://ftp.namesys.com/pub/reiserfsprogs/reiserfsprogs-3.6.19.tar.gz

According to http://marc.theaimsgroup.com/?t=114235160800008&r=1&w=2
3.6.20 also exist, but I cant find it.

	Good luck, Sander

-- 
Humilis IT Services and Solutions
http://www.humilis.net

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Bug report: reiserfsck --rebuild-tree not progressing
  2006-04-11  7:21 ` Sander
@ 2006-04-11  8:12   ` Tyler Phelps
  2006-04-11  9:24     ` Vladimir V. Saveliev
  2006-04-11 11:22     ` Konstantin Münning
  0 siblings, 2 replies; 10+ messages in thread
From: Tyler Phelps @ 2006-04-11  8:12 UTC (permalink / raw)
  To: sander; +Cc: System Administrators, reiserfs-list

My questions (and all of the diagnostic information provided) revolve  
around a specific process of reiserfsck, using the version that I  
specified, which is still running.  The only way that I can try a new  
version is to abort the current fsck operation... doing that  
essentially invalidates all of the questions that I've asked.

I'm reluctant to abort the current process.  My primary reason for  
this is that I have no way of knowing if aborting the current fsck  
will cause further damage.  After all, the man page states, "Once  
reiserfsck --rebuild-tree is started it must finish its work (and you  
should not interrupt it), otherwise the filesystem will be left in  
the unmountable state to avoid subsequent data corruptions."  Second,  
I don't know if anything is even wrong with the way that things are  
"progressing" with the current process... hence the reason for my  
original questions.

-Tyler

On Apr 11, 2006, at 12:21 AM, Sander wrote:

> Tyler Phelps wrote (ao):
>> Package: reiserfsprogs
>> Version: 1:3.6.17-2
>
> Can you try a newer version?
> ftp://ftp.namesys.com/pub/reiserfsprogs/reiserfsprogs-3.6.19.tar.gz
>
> According to http://marc.theaimsgroup.com/?t=114235160800008&r=1&w=2
> 3.6.20 also exist, but I cant find it.
>
> 	Good luck, Sander
>
> -- 
> Humilis IT Services and Solutions
> http://www.humilis.net


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Bug report: reiserfsck --rebuild-tree not progressing
  2006-04-11  8:12   ` Tyler Phelps
@ 2006-04-11  9:24     ` Vladimir V. Saveliev
  2006-04-11 14:34       ` Bernhard Sadlowski
  2006-04-11 11:22     ` Konstantin Münning
  1 sibling, 1 reply; 10+ messages in thread
From: Vladimir V. Saveliev @ 2006-04-11  9:24 UTC (permalink / raw)
  To: Tyler Phelps
  Cc: Vitaly P. Fertman, sander, System Administrators, reiserfs-list

Hello

On Tue, 2006-04-11 at 01:12 -0700, Tyler Phelps wrote:
> My questions (and all of the diagnostic information provided) revolve  
> around a specific process of reiserfsck, using the version that I  
> specified, which is still running.  The only way that I can try a new  
> version is to abort the current fsck operation... doing that  
> essentially invalidates all of the questions that I've asked.
> 

> 1.  Is it normal for pass 2 of the --rebuild-tree operation to stay 
> at the same "progress" for more than 24 hours?
> 2.  Is it normal for the reiserfsck operation to use significant CPU 
> resources while I/O activity is minimal?

1-2: It is not normal but possible. Old reiserfsck (like the one you are
using) was extremly inefficient in some (rare) kind of corruptions).

> 3.  For a filesystem of such size, is it reasonable for the fsck to  
> take many days to complete?

Unfortunately, currently, reiserfsck takes a long time on big
filesystems.

> 4.  How long do you recommend that I leave it in this state before  
> aborting the fsck?

I would recommend you to abort it and try
ftp://ftp.namesys.com/pub/tmp/reiserfsprogs-3.6.20.tar.gz


> I'm reluctant to abort the current process.  My primary reason for  
> this is that I have no way of knowing if aborting the current fsck  
> will cause further damage.  After all, the man page states, "Once  
> reiserfsck --rebuild-tree is started it must finish its work (and you  
> should not interrupt it), otherwise the filesystem will be left in  
> the unmountable state to avoid subsequent data corruptions."  Second,  
> I don't know if anything is even wrong with the way that things are  
> "progressing" with the current process... hence the reason for my  
> original questions.
> 


> -Tyler
> 
> On Apr 11, 2006, at 12:21 AM, Sander wrote:
> 
> > Tyler Phelps wrote (ao):
> >> Package: reiserfsprogs
> >> Version: 1:3.6.17-2
> >
> > Can you try a newer version?
> > ftp://ftp.namesys.com/pub/reiserfsprogs/reiserfsprogs-3.6.19.tar.gz
> >
> > According to http://marc.theaimsgroup.com/?t=114235160800008&r=1&w=2
> > 3.6.20 also exist, but I cant find it.
> >
> > 	Good luck, Sander
> >
> > -- 
> > Humilis IT Services and Solutions
> > http://www.humilis.net
> 
> 


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Bug report: reiserfsck --rebuild-tree not progressing
  2006-04-11  8:12   ` Tyler Phelps
  2006-04-11  9:24     ` Vladimir V. Saveliev
@ 2006-04-11 11:22     ` Konstantin Münning
  1 sibling, 0 replies; 10+ messages in thread
From: Konstantin Münning @ 2006-04-11 11:22 UTC (permalink / raw)
  To: Tyler Phelps; +Cc: reiserfs-list, sysadmin

Hi!

I had the same problem about a year ago with a 0.8TB drive, you may
check some list archives for the details.

The solution was a patch to the reiserfsprogs which was then
incorporated in version 3.6.19. I am not familiar with the details as I
only supplied the information and Vladimir did the work but I can only
repeat what was said - abort current fsck and retry with latest tools
(3.6.19 should be sufficent, I can't tell about 3.6.20).

As for your concerns, it's correct that when you abort the current fsck
it will result in an unmountable FS but you can repair it with another
run. I doubt there is any way to make the current fsck to contnue except
maybe some weird magic hack into the running program ;-). If there are
chances to loose data in this process - I'm not the expert but I think
an abort is not that critical - at least at that stage where reiserfsck
is in an endless loop. With my problem I had some minor data corruption
issues afterwards but I think that was because of the primary fault -
the RAID controller had RAM errors and the RAID consistency was broken
which resulted in the corrupted FS. Then after several fsck tries,
superblock reconstruction etc. I was surprised how much was still intact
(far less than 1% of data was corrupted) but I think this doesn't apply
to you as it's not any guarantee.

Have a nice day,
Konstantin

Tyler Phelps wrote:
> My questions (and all of the diagnostic information provided) revolve 
> around a specific process of reiserfsck, using the version that I 
> specified, which is still running.  The only way that I can try a new 
> version is to abort the current fsck operation... doing that 
> essentially invalidates all of the questions that I've asked.
> 
> I'm reluctant to abort the current process.  My primary reason for  this
> is that I have no way of knowing if aborting the current fsck  will
> cause further damage.  After all, the man page states, "Once  reiserfsck
> --rebuild-tree is started it must finish its work (and you  should not
> interrupt it), otherwise the filesystem will be left in  the unmountable
> state to avoid subsequent data corruptions."  Second,  I don't know if
> anything is even wrong with the way that things are  "progressing" with
> the current process... hence the reason for my  original questions.
> 
> -Tyler
> 
> On Apr 11, 2006, at 12:21 AM, Sander wrote:
> 
>> Tyler Phelps wrote (ao):
>>
>>> Package: reiserfsprogs
>>> Version: 1:3.6.17-2
>>
>>
>> Can you try a newer version?
>> ftp://ftp.namesys.com/pub/reiserfsprogs/reiserfsprogs-3.6.19.tar.gz
>>
>> According to http://marc.theaimsgroup.com/?t=114235160800008&r=1&w=2
>> 3.6.20 also exist, but I cant find it.
>>
>>     Good luck, Sander
>>
>> -- 
>> Humilis IT Services and Solutions
>> http://www.humilis.net

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Bug report: reiserfsck --rebuild-tree not progressing
  2006-04-11  9:24     ` Vladimir V. Saveliev
@ 2006-04-11 14:34       ` Bernhard Sadlowski
  2006-04-11 15:23         ` Vladimir V. Saveliev
  0 siblings, 1 reply; 10+ messages in thread
From: Bernhard Sadlowski @ 2006-04-11 14:34 UTC (permalink / raw)
  To: Vladimir V. Saveliev; +Cc: reiserfs-list

On 11 Apr 2006 13:24, Vladimir V. Saveliev <vs@namesys.com> wrote:
> I would recommend you to abort it and try
> ftp://ftp.namesys.com/pub/tmp/reiserfsprogs-3.6.20.tar.gz

Why is it in /pub/tmp? Is this an official new version?

This version compiles with no Problems on Debian 3.1 with gcc 3.3.5, but
compiling on Debian 3.0 (woody, old stable distribution, gcc 2.95.4) does
not work, even thogh 3.6.19 had no still compile problems:

$ make
[...]
gcc -DHAVE_CONFIG_H -I. -I. -I../.. -I../../include   -Wall -O2 -fPIC -DPIC
-Wall -O2 -fPIC -DPIC -c -o debugfs_reiserfs-corruption.o test -f
'corruption.c' || echo './'corruption.c
corruption.c: In function do_bitmap_corruption':
corruption.c:892: parse error before *'
corruption.c:907: i' undeclared (first use in this function)
corruption.c:907: (Each undeclared identifier is reported only once
corruption.c:907: for each function it appears in.)
corruption.c:910: bh' undeclared (first use in this function)
make[3]: *** [debugfs_reiserfs-corruption.o] Error 1
make[3]: Leaving directory /usr/src/reiser/reiserfsprogs-3.6.20/utils/debugfs'
make[2]: *** [all-recursive] Error 1
make[2]: Leaving directory /usr/src/reiser/reiserfsprogs-3.6.20/utils'
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory /usr/src/reiser/reiserfsprogs-3.6.20'
make: *** [all] Error 2

$ gcc -v
Reading specs from /usr/lib/gcc-lib/i386-linux/2.95.4/specs
gcc version 2.95.4 20011002 (Debian prerelease)

Bernhard

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Bug report: reiserfsck --rebuild-tree not progressing
  2006-04-11 14:34       ` Bernhard Sadlowski
@ 2006-04-11 15:23         ` Vladimir V. Saveliev
  2006-04-11 15:40           ` Bernhard Sadlowski
  0 siblings, 1 reply; 10+ messages in thread
From: Vladimir V. Saveliev @ 2006-04-11 15:23 UTC (permalink / raw)
  To: Bernhard Sadlowski; +Cc: reiserfs-list

[-- Attachment #1: Type: text/plain, Size: 1688 bytes --]

Hello

On Tue, 2006-04-11 at 16:34 +0200, Bernhard Sadlowski wrote:
> On 11 Apr 2006 13:24, Vladimir V. Saveliev <vs@namesys.com> wrote:
> > I would recommend you to abort it and try
> > ftp://ftp.namesys.com/pub/tmp/reiserfsprogs-3.6.20.tar.gz
> 
> Why is it in /pub/tmp? Is this an official new version?
> 
not yet. we would like it to get more testing.

> This version compiles with no Problems on Debian 3.1 with gcc 3.3.5, but
> compiling on Debian 3.0 (woody, old stable distribution, gcc 2.95.4) does
> not work, even thogh 3.6.19 had no still compile problems:
> 
> $ make
> [...]
> gcc -DHAVE_CONFIG_H -I. -I. -I../.. -I../../include   -Wall -O2 -fPIC -DPIC
> -Wall -O2 -fPIC -DPIC -c -o debugfs_reiserfs-corruption.o test -f
> 'corruption.c' || echo './'corruption.c
> corruption.c: In function do_bitmap_corruption':
> corruption.c:892: parse error before *'
> corruption.c:907: i' undeclared (first use in this function)
> corruption.c:907: (Each undeclared identifier is reported only once
> corruption.c:907: for each function it appears in.)
> corruption.c:910: bh' undeclared (first use in this function)
> make[3]: *** [debugfs_reiserfs-corruption.o] Error 1
> make[3]: Leaving directory /usr/src/reiser/reiserfsprogs-3.6.20/utils/debugfs'
> make[2]: *** [all-recursive] Error 1
> make[2]: Leaving directory /usr/src/reiser/reiserfsprogs-3.6.20/utils'
> make[1]: *** [all-recursive] Error 1
> make[1]: Leaving directory /usr/src/reiser/reiserfsprogs-3.6.20'
> make: *** [all] Error 2
> 
would you try if this patch helps, please?

> $ gcc -v
> Reading specs from /usr/lib/gcc-lib/i386-linux/2.95.4/specs
> gcc version 2.95.4 20011002 (Debian prerelease)
> 
> Bernhard
> 

[-- Attachment #2: patch --]
[-- Type: text/plain, Size: 362 bytes --]

--- ./utils/debugfs/corruption.c~	2006-03-28 16:40:23.000000000 +0400
+++ ./utils/debugfs/corruption.c	2006-04-11 19:21:39.000000000 +0400
@@ -888,7 +888,7 @@
 
 	unsigned long first = fs->fs_super_bh->b_blocknr + 1;
 	unsigned long nr_bitmap_to_corrupt;
-	unsigned long block, bmap_nr;;
+	unsigned long block, bmap_nr;
 	reiserfs_bh_t * bh;
 	unsigned int i;
 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Bug report: reiserfsck --rebuild-tree not progressing
  2006-04-11 15:23         ` Vladimir V. Saveliev
@ 2006-04-11 15:40           ` Bernhard Sadlowski
  0 siblings, 0 replies; 10+ messages in thread
From: Bernhard Sadlowski @ 2006-04-11 15:40 UTC (permalink / raw)
  To: Vladimir V. Saveliev; +Cc: reiserfs-list

Hi,

On 11 Apr 2006 19:23, Vladimir V. Saveliev <vs@namesys.com> wrote:
> would you try if this patch helps, please?

> --- ./utils/debugfs/corruption.c~	2006-03-28 16:40:23.000000000 +0400
> +++ ./utils/debugfs/corruption.c	2006-04-11 19:21:39.000000000 +0400
> @@ -888,7 +888,7 @@
>  
>  	unsigned long first = fs->fs_super_bh->b_blocknr + 1;
>  	unsigned long nr_bitmap_to_corrupt;
> -	unsigned long block, bmap_nr;;
> +	unsigned long block, bmap_nr;
>  	reiserfs_bh_t * bh;
>  	unsigned int i;

This works now. Thanks!

Bernhard

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Bug report: reiserfsck --rebuild-tree not progressing
  2006-04-11  0:34 Bug report: reiserfsck --rebuild-tree not progressing Tyler Phelps
  2006-04-11  7:21 ` Sander
@ 2006-04-12 15:35 ` Ard van Breemen
  2006-04-13  9:48   ` Ard van Breemen
  1 sibling, 1 reply; 10+ messages in thread
From: Ard van Breemen @ 2006-04-12 15:35 UTC (permalink / raw)
  To: System Administrators; +Cc: reiserfs-list

Hi,
On Mon, Apr 10, 2006 at 05:34:58PM -0700, Tyler Phelps wrote:
> The following are syslog messages from the kernel.  The filesystem 
> trouble began at about 2:50am which matches the following log entries:
> 
> Apr  6 02:50:03 gwar kernel: ReiserFS: dm-12: warning: vs-13060: 
> reiserfs_update_sd: stat data of object [124 983 0x0 SD] (nlink == 1) 
> not found (pos 1)

Have you been able to determine why it happened?
We also have opterons and got this:
ReiserFS: sdb9: warning: vs-13060: reiserfs_update_sd: stat data of object [2 12 0x0 SD] (nlink == 3) not found (pos 1)
ReiserFS: sdb9: warning: vs-13060: reiserfs_update_sd: stat data of object [2 12 0x0 SD] (nlink == 3) not found (pos 1)
ReiserFS: sdb9: warning: vs-13060: reiserfs_update_sd: stat data of object [2 12 0x0 SD] (nlink == 3) not found (pos 1)

And then finally some sort of kernel panic.
And it is repeatable.
Kernel: 2.6.16.1 for AMD64
Kernel: 2.6.17rc1 for AMD64 (Yes, tried both)
Compiled with gcc-3.3 on an AMD64 platform
userspace is 32 bits debian
Hardware:
tyan s2891 (gt24 system) with nforce4 and 4 sata 1 drives.

success:
Run a lot of bonnie++'s

consistent failure:
(Tried as 4 seperate disks of 0.4T and as one raid5 partition of 1.1T)
After 2 hours of pumping a few million files onto the machine
reiserfs starts putting out these warnings (a few thousands):
^MReiserFS: sdb9: warning: vs-13060: reiserfs_update_sd: stat data of object [2 12 0x0 SD] (nlink == 3) not found (pos 1)

And then the panic for a 2.6.16.1 kernel:

NMI Watchdog detected LOCKUP on CPU 1
CPU 1 
Modules linked in: ipv6 tg3 nfsd exportfs nfs lockd sunrpc
Pid: 15215, comm: tar Not tainted 2.6.16.1-tyan-s2891 #1
RIP: 0010:[<ffffffff8037903d>] <ffffffff8037903d>{.text.lock.spinlock+22}
RSP: 0000:ffff81012fc9fc48  EFLAGS: 00000086
RAX: ffff8100cbcfec78 RBX: ffff8100cbcfec70 RCX: 0000000000000000
RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffff8100cbcfec70
RBP: ffff81012fc9fc88 R08: 00000000000200a3 R09: 0000000000000640
R10: 0000000000010c0c R11: 00000000000005b4 R12: 0000000000000000
R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000000001
FS:  0000000000000000(0000) GS:ffff81012fc5fbc0(0063) knlGS:00000000f7e88080
CS:  0010 DS: 002b ES: 002b CR0: 000000008005003b
CR2: 00000000f7f7dfbc CR3: 0000000042d8e000 CR4: 00000000000006e0
Process tar (pid: 15215, threadinfo ffff81002891c000, task ffff81007e2860c0)
Stack: 0000000000000296 ffffffff801280ed 0000000100000000 ffff810116728700 
       ffff81002ff90780 00000000ffffffff 0000000000000000 ffff81003d8de834 
       ffff810116728700 ffffffff80315604 
Call Trace: <IRQ> <ffffffff801280ed>{__wake_up+45} <ffffffff80315604>{sock_def_readable+52}
       <ffffffff80349cce>{tcp_data_queue+894} <ffffffff8034b236>{tcp_rcv_established+1638}
       <ffffffff80352883>{tcp_v4_do_rcv+35} <ffffffff80352ef7>{tcp_v4_rcv+1431}
       <ffffffff8031c69f>{dev_queue_xmit+559} <ffffffff80337872>{ip_local_deliver+402}
       <ffffffff80337e3c>{ip_rcv+1244} <ffffffff8031cb8c>{netif_receive_skb+508}
       <ffffffff880b9109>{:tg3:tg3_rx+905} <ffffffff880b927c>{:tg3:tg3_poll+140}
       <ffffffff8031cd84>{net_rx_action+132} <ffffffff80133c21>{__do_softirq+97}
       <ffffffff8010bef2>{call_softirq+30} <ffffffff8010dc11>{do_softirq+49}
       <ffffffff8010dbc7>{do_IRQ+71} <ffffffff8010b250>{ret_from_intr+0} <EOI>
       <ffffffff8021f9b1>{memmove+49} <ffffffff801ce065>{leaf_paste_entries+245}
       <ffffffff801cc0ab>{leaf_copy_dir_entries+619} <ffffffff801cc48a>{leaf_copy_boundary_item+970}
       <ffffffff80143ad8>{wake_up_bit+24} <ffffffff801cce3e>{leaf_copy_items+78}
       <ffffffff801cd150>{leaf_move_items+80} <ffffffff801cd1de>{leaf_shift_left+62}
       <ffffffff801b85a1>{balance_leaf_when_delete+865} <ffffffff801b865d>{balance_leaf+93}
       <ffffffff80143ad8>{wake_up_bit+24} <ffffffff801d8dd8>{reiserfs_prepare_for_journal+88}
       <ffffffff801babed>{do_balance+141} <ffffffff801c75be>{fix_nodes+590}
       <ffffffff801d2113>{reiserfs_cut_from_item+915} <ffffffff801bc174>{reiserfs_unlink+308}
       <ffffffff801934c4>{mntput_no_expire+36} <ffffffff8018803e>{vfs_unlink+110}
       <ffffffff8018812f>{do_unlinkat+175} <ffffffff8011de5e>{ia32_sysret+0}

Code: 83 3f 00 7e f9 e9 92 fd ff ff f3 90 83 3f 00 7e f9 e9 9e fd 
console shuts up ...
 Badness in do_exit at kernel/exit.c:802

Call Trace: <NMI> <ffffffff80131434>{do_exit+68} <ffffffff8010c7eb>{die_nmi+123}
       <ffffffff80117b16>{nmi_watchdog_tick+230} <ffffffff8010d2c6>{default_do_nmi+134}
       <ffffffff80117c15>{do_nmi+69} <ffffffff803792f3>{nmi+127}
       <ffffffff8037903d>{.text.lock.spinlock+22} <EOE> <IRQ>
       <ffffffff801280ed>{__wake_up+45} <ffffffff80315604>{sock_def_readable+52}
       <ffffffff80349cce>{tcp_data_queue+894} <ffffffff8034b236>{tcp_rcv_established+1638}
       <ffffffff80352883>{tcp_v4_do_rcv+35} <ffffffff80352ef7>{tcp_v4_rcv+1431}
       <ffffffff8031c69f>{dev_queue_xmit+559} <ffffffff80337872>{ip_local_deliver+402}
       <ffffffff80337e3c>{ip_rcv+1244} <ffffffff8031cb8c>{netif_receive_skb+508}
       <ffffffff880b9109>{:tg3:tg3_rx+905} <ffffffff880b927c>{:tg3:tg3_poll+140}
       <ffffffff8031cd84>{net_rx_action+132} <ffffffff80133c21>{__do_softirq+97}
       <ffffffff8010bef2>{call_softirq+30} <ffffffff8010dc11>{do_softirq+49}
       <ffffffff8010dbc7>{do_IRQ+71} <ffffffff8010b250>{ret_from_intr+0} <EOI>
       <ffffffff8021f9b1>{memmove+49} <ffffffff801ce065>{leaf_paste_entries+245}
       <ffffffff801cc0ab>{leaf_copy_dir_entries+619} <ffffffff801cc48a>{leaf_copy_boundary_item+970}
       <ffffffff80143ad8>{wake_up_bit+24} <ffffffff801cce3e>{leaf_copy_items+78}
       <ffffffff801cd150>{leaf_move_items+80} <ffffffff801cd1de>{leaf_shift_left+62}
       <ffffffff801b85a1>{balance_leaf_when_delete+865} <ffffffff801b865d>{balance_leaf+93}
       <ffffffff80143ad8>{wake_up_bit+24} <ffffffff801d8dd8>{reiserfs_prepare_for_journal+88}
       <ffffffff801babed>{do_balance+141} <ffffffff801c75be>{fix_nodes+590}
       <ffffffff801d2113>{reiserfs_cut_from_item+915} <ffffffff801bc174>{reiserfs_unlink+308}
       <ffffffff801934c4>{mntput_no_expire+36} <ffffffff8018803e>{vfs_unlink+110}
       <ffffffff8018812f>{do_unlinkat+175} <ffffffff8011de5e>{ia32_sysret+0}
Kernel panic - not syncing: Aiee, killing interrupt handler!


I've seen in git that there is a memory leak in tg3, but I guess
(looking at the graphs) that it was not memory related.

Anyway, off to put another 2 of those at testing...

Regards,
Ard van Breemen

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Bug report: reiserfsck --rebuild-tree not progressing
  2006-04-12 15:35 ` Ard van Breemen
@ 2006-04-13  9:48   ` Ard van Breemen
  0 siblings, 0 replies; 10+ messages in thread
From: Ard van Breemen @ 2006-04-13  9:48 UTC (permalink / raw)
  To: System Administrators; +Cc: reiserfs-list

Hello,

On Wed, Apr 12, 2006 at 05:35:57PM +0200, Ard van Breemen wrote:
> consistent failure:
> (Tried as 4 seperate disks of 0.4T and as one raid5 partition of 1.1T)
> After 2 hours of pumping a few million files onto the machine
> reiserfs starts putting out these warnings (a few thousands):
> ^MReiserFS: sdb9: warning: vs-13060: reiserfs_update_sd: stat data of object [2 12 0x0 SD] (nlink == 3) not found (pos 1)

I've now got a problem getting it to fail.
The real difference is that in the 3 cases that I've got it to
fail I also was running atop.
Now I've been pumping gigs onto the disk without failure.
I guess something is wrong in the process accounting mechanism.
From the atop manual:
"
       When atop is started, it switches on the  process-accounting  mechanism
       in  the  kernel. This forces the kernel to write a record with account-
       ing-information to the accounting-file whenever a process ends.   Apart
       from  the  kernel-administration related to the running processes, atop
       also interprets the accounting-records on disk with every interval;  in
       this way atop can also show the activity of a process during the inter-
       val in which it is finished.
       Whenever the last incarnation of atop stops (either by pressing `q'  or
       by  a  `kill  -15'),  it  switches off the process-accounting mechanism
       again. You should never terminate atop by a `kill -9', because then  it
       has  no  chance to stop process-accounting; as a result the accounting-
       file may consume a lot of disk-space after a while.
"

I will install 2 other systems, one running atop, and one not running atop.
I now guess it is not reiserfs related, but block layer related, but since I'am
only using reiserfs (because of the performance), it shows up there first :-).
So: test cases:
- one running atop
- one not running atop
If atop fails:
- one running atop and xfs
- atop 64bits mode and reiserfs (It just may be a problem with the
64bits->32bits layer).

-- 
begin  LOVE-LETTER-FOR-YOU.txt.vbs
I am a signature virus. Distribute me until the bitter
end

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2006-04-13  9:48 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-04-11  0:34 Bug report: reiserfsck --rebuild-tree not progressing Tyler Phelps
2006-04-11  7:21 ` Sander
2006-04-11  8:12   ` Tyler Phelps
2006-04-11  9:24     ` Vladimir V. Saveliev
2006-04-11 14:34       ` Bernhard Sadlowski
2006-04-11 15:23         ` Vladimir V. Saveliev
2006-04-11 15:40           ` Bernhard Sadlowski
2006-04-11 11:22     ` Konstantin Münning
2006-04-12 15:35 ` Ard van Breemen
2006-04-13  9:48   ` Ard van Breemen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.