* Bug report: reiserfsck --rebuild-tree not progressing
@ 2006-04-11 0:34 Tyler Phelps
2006-04-11 7:21 ` Sander
2006-04-12 15:35 ` Ard van Breemen
0 siblings, 2 replies; 10+ messages in thread
From: Tyler Phelps @ 2006-04-11 0:34 UTC (permalink / raw)
To: reiserfs-list; +Cc: System Administrators
Two sentence summary: reiserfsck --rebuild-tree on a 1.4TB reiserfs
filesystem seems to be spinning its wheels. The reiserfsck has been
running for a little more than 56 hours, the progress meter hasn't
changed in over 24 hours, very little I/O activity is happening, and
the reiserfsck process is utilizing 100% of one of the two CPUs in the
system.
I'll provide system information first, then explain the problem in
detail.
Hardware:
4U rackmount system with 16gb of memory and dual opteron cpus running
in a controlled environment datacenter. The partition in question sits
on twelve 250gb drives (Western Digital Caviar Raid Edition 7200rpm
SATA.. model number WD2500SD). It is the only partition that resides
on these drives. The partition contains postgres data and nothing
else.. the rest of the system lives on a separate set of disks. The
raid array is half hardware and half software. The twelve disks are
formed into six pairs.. each pair is formed into a raid 1 mirror. This
mirroring is done in hardware on two different 8-port 3ware 9500S
controller cards. EVMS is used to provide a software raid 0 stripe
array that spans the six mirrored pairs.
Distro and kernel:
The system is running Debian using the sarge (stable) codeline, version
1:3.3.4-6sarge1. The kernel is self-compiled version 2.6.13.1. No
reiserfs-specific patches have been applied to the kernel source. The
kernel configuration can be provided as needed.. for now, the following
are the lines from the config that contain "REISER":
CONFIG_REISERFS_FS=y
# CONFIG_REISERFS_CHECK is not set
CONFIG_REISERFS_PROC_INFO=y
# CONFIG_REISERFS_FS_XATTR is not set
Version:
I'm using the reiserfsprogs package from the sarge codeline. The
following is metadata about said package:
Package: reiserfsprogs
Status: install ok installed
Priority: optional
Section: admin
Installed-Size: 1079
Maintainer: Ed Boraas <ed@debian.org>
Architecture: i386
Version: 1:3.6.17-2
Depends: libc6 (>= 2.3.2.ds1-4), libuuid1
Description: User-level tools for ReiserFS filesystems
This package contains utilities to create, check, resize, and debug
ReiserFS
filesystems.
.
NOTE: Releases of Linux prior to 2.4.1 do not support ReiserFS on
their own.
Thus, these tools will only be useful with Linux 2.4.1 or later, or if
your
kernel has been built with the ReiserFS patch applied. This patch can
be found
in the appropriate kernel-patch-<version>-reiserfs packages.
.
Homepage: http://www.namesys.com/
The following is mount information about the filesystem, taken from
dmesg:
ReiserFS: dm-12: found reiserfs format "3.6" with standard journal
ReiserFS: dm-12: using ordered data mode
ReiserFS: dm-12: journal params: device dm-12, size 8192, journal first
block 18, max trans len 1024, max batch 900, max commit age 30, max
trans age 30
ReiserFS: dm-12: checking transaction log (dm-12)
ReiserFS: dm-12: Using r5 hash to sort names
The following are syslog messages from the kernel. The filesystem
trouble began at about 2:50am which matches the following log entries:
Apr 6 02:50:03 gwar kernel: ReiserFS: dm-12: warning: vs-13060:
reiserfs_update_sd: stat data of object [124 983 0x0 SD] (nlink == 1)
not found (pos 1)
Apr 6 02:50:03 gwar kernel: ReiserFS: dm-12: warning: vs-13060:
reiserfs_update_sd: stat data of object [124 983 0x0 SD] (nlink == 1)
not found (pos 1)
Apr 6 02:50:03 gwar kernel: ReiserFS: dm-12: warning: vs-13060:
reiserfs_update_sd: stat data of object [124 983 0x0 SD] (nlink == 1)
not found (pos 1)
Apr 6 02:50:03 gwar kernel: ReiserFS: dm-12: warning: vs-13060:
reiserfs_update_sd: stat data of object [124 983 0x0 SD] (nlink == 1)
not found (pos 1)
Apr 6 02:50:03 gwar kernel: ReiserFS: dm-12: warning: vs-13060:
reiserfs_update_sd: stat data of object [124 983 0x0 SD] (nlink == 1)
not found (pos 1)
Apr 6 02:50:03 gwar kernel: ReiserFS: dm-12: warning: vs-13060:
reiserfs_update_sd: stat data of object [124 983 0x0 SD] (nlink == 1)
not found (pos 1)
Apr 6 02:50:03 gwar kernel: ReiserFS: dm-12: warning: vs-13060:
reiserfs_update_sd: stat data of object [124 983 0x0 SD] (nlink == 1)
not found (pos 1)
Apr 6 02:50:03 gwar kernel: ReiserFS: dm-12: warning: vs-13060:
reiserfs_update_sd: stat data of object [124 983 0x0 SD] (nlink == 1)
not found (pos 1)
Apr 6 02:50:03 gwar kernel: ReiserFS: dm-12: warning: vs-13060:
reiserfs_update_sd: stat data of object [124 983 0x0 SD] (nlink == 1)
not found (pos 1)
Apr 6 02:50:04 gwar kernel: ReiserFS: dm-12: warning: vs-13060:
reiserfs_update_sd: stat data of object [124 983 0x0 SD] (nlink == 1)
not found (pos 1)
Apr 6 02:50:04 gwar kernel: ReiserFS: dm-12: warning: vs-13060:
reiserfs_update_sd: stat data of object [124 983 0x0 SD] (nlink == 1)
not found (pos 1)
Apr 6 02:50:04 gwar kernel: ReiserFS: dm-12: warning: vs-13060:
reiserfs_update_sd: stat data of object [124 983 0x0 SD] (nlink == 1)
not found (pos 1)
Apr 6 02:50:04 gwar kernel: ReiserFS: dm-12: warning: vs-13060:
reiserfs_update_sd: stat data of object [124 983 0x0 SD] (nlink == 1)
not found (pos 1)
Apr 6 02:50:04 gwar kernel: ReiserFS: dm-12: warning: vs-13060:
reiserfs_update_sd: stat data of object [124 983 0x0 SD] (nlink == 1)
not found (pos 1)
Apr 6 02:50:04 gwar kernel: ReiserFS: dm-12: warning: vs-13060:
reiserfs_update_sd: stat data of object [124 983 0x0 SD] (nlink == 1)
not found (pos 1)
Apr 6 02:50:04 gwar kernel: ReiserFS: dm-12: warning: vs-13060:
reiserfs_update_sd: stat data of object [124 983 0x0 SD] (nlink == 1)
not found (pos 1)
Apr 6 02:50:04 gwar kernel: ReiserFS: dm-12: warning: vs-13060:
reiserfs_update_sd: stat data of object [124 983 0x0 SD] (nlink == 1)
not found (pos 1)
Apr 6 02:50:05 gwar kernel: ReiserFS: dm-12: warning: vs-13060:
reiserfs_update_sd: stat data of object [124 983 0x0 SD] (nlink == 1)
not found (pos 1)
Apr 6 02:50:05 gwar kernel: ReiserFS: dm-12: warning: vs-13060:
reiserfs_update_sd: stat data of object [124 983 0x0 SD] (nlink == 1)
not found (pos 1)
Apr 6 02:50:05 gwar kernel: ReiserFS: dm-12: warning: vs-13060:
reiserfs_update_sd: stat data of object [124 983 0x0 SD] (nlink == 1)
not found (pos 1)
Apr 6 02:50:05 gwar kernel: ReiserFS: dm-12: warning: vs-13060:
reiserfs_update_sd: stat data of object [124 983 0x0 SD] (nlink == 1)
not found (pos 1)
Apr 6 02:50:05 gwar kernel: ReiserFS: dm-12: warning: vs-13060:
reiserfs_update_sd: stat data of object [124 983 0x0 SD] (nlink == 1)
not found (pos 1)
Apr 6 02:50:05 gwar kernel: ReiserFS: dm-12: warning: vs-13060:
reiserfs_update_sd: stat data of object [124 983 0x0 SD] (nlink == 1)
not found (pos 1)
Apr 6 02:50:05 gwar kernel: ReiserFS: dm-12: warning: vs-13060:
reiserfs_update_sd: stat data of object [124 983 0x0 SD] (nlink == 1)
not found (pos 1)
Apr 6 02:50:05 gwar kernel: ReiserFS: dm-12: warning: vs-13060:
reiserfs_update_sd: stat data of object [124 983 0x0 SD] (nlink == 1)
not found (pos 1)
Apr 6 02:50:06 gwar kernel: ReiserFS: dm-12: warning: vs-13060:
reiserfs_update_sd: stat data of object [124 983 0x0 SD] (nlink == 1)
not found (pos 1)
Apr 6 02:50:06 gwar kernel: ReiserFS: dm-12: warning: vs-13060:
reiserfs_update_sd: stat data of object [124 983 0x0 SD] (nlink == 1)
not ounos
Apr 6 02:50:06 gwar kernel: ReiserFS: dm-12: warning: vs-13060:
reiserfs_update_sd: stat data of object [124 983 0x0 SD] (nlink == 1)
not found (pos 1)
Apr 6 02:50:06 gwar kernel: ReiserFS: dm-12: warning: vs-13060:
reiserfs_update_sd: stat data of object [124 983 0x0 SD] (nlink == 1)
not found (pos 1)
Apr 6 02:50:06 gwar kernel: ReiserFS: dm-12: warning: vs-13060:
reiserfs_update_sd: stat data of object [124 983 0x0 SD] (nlink == 1)
not found (pos 1)
Apr 6 02:50:06 gwar kernel: ReiserFS: dm-12: warning: vs-13060:
reiserfs_update_sd: stat data of object [124 983 0x0 SD] (nlink == 1)
not found (pos 1)
Apr 6 07:00:05 gwar kernel: ReiserFS: dm-12: warning: vs-13060:
reiserfs_update_sd: stat data of object [124 983 0x0 SD] (nlink == 1)
not found (pos 1)
Apr 6 07:00:05 gwar kernel: ReiserFS: dm-12: warning: vs-13060:
reiserfs_update_sd: stat data of object [124 983 0x0 SD] (nlink == 1)
not found (pos 1)
Apr 6 07:00:06 gwar kernel: ReiserFS: dm-12: warning: vs-13060:
reiserfs_update_sd: stat data of object [124 983 0x0 SD] (nlink == 1)
not found (pos 1)
Apr 6 07:00:08 gwar kernel: ReiserFS: dm-12: warning: vs-13060:
reiserfs_update_sd: stat data of object [124 983 0x0 SD] (nlink == 1)
not found (pos 1)
Apr 6 07:00:09 gwar kernel: ReiserFS: dm-12: warning: vs-13060:
reiserfs_update_sd: stat data of object [124 983 0x0 SD] (nlink == 1)
not found (pos 1)
Apr 6 07:00:09 gwar kernel: ReiserFS: dm-12: warning: vs-13060:
reiserfs_update_sd: stat data of object [124 983 0x0 SD] (nlink == 1)
not found (pos 1)
Apr 6 07:00:10 gwar kernel: ReiserFS: dm-12: warning: vs-13060:
reiserfs_update_sd: stat data of object [124 983 0x0 SD] (nlink == 1)
not found (pos 1)
Apr 6 07:00:10 gwar kernel: ReiserFS: dm-12: warning: vs-13060:
reiserfs_update_sd: stat data of object [124 983 0x0 SD] (nlink == 1)
not found (pos 1)
Apr 6 07:00:11 gwar kernel: ReiserFS: dm-12: warning: vs-13060:
reiserfs_update_sd: stat data of object [124 983 0x0 SD] (nlink == 1)
not found (pos 1)
Apr 6 07:00:12 gwar kernel: ReiserFS: dm-12: warning: vs-13060:
reiserfs_update_sd: stat data of object [124 983 0x0 SD] (nlink == 1)
not found (pos 1)
Apr 6 07:00:12 gwar kernel: ReiserFS: dm-12: warning: vs-13060:
reiserfs_update_sd: stat data of object [124 983 0x0 SD] (nlink == 1)
not found (pos 1)
Apr 6 07:00:12 gwar kernel: ReiserFS: dm-12: warning: vs-13060:
reiserfs_update_sd: stat data of object [124 983 0x0 SD] (nlink == 1)
not found (pos 1)
Apr 6 07:00:13 gwar kernel: ReiserFS: dm-12: warning: vs-13060:
reiserfs_update_sd: stat data of object [124 983 0x0 SD] (nlink == 1)
not found (pos 1)
Apr 6 07:00:14 gwar kernel: ReiserFS: dm-12: warning: vs-13060:
reiserfs_update_sd: stat data of object [124 983 0x0 SD] (nlink == 1)
not found (pos 1)
Apr 6 07:00:17 gwar kernel: ReiserFS: dm-12: warning: vs-13060:
reiserfs_update_sd: stat data of object [124 983 0x0 SD] (nlink == 1)
not found (pos 1)
Apr 6 07:00:18 gwar kernel: ReiserFS: dm-12: warning: vs-13060:
reiserfs_update_sd: stat data of object [124 983 0x0 SD] (nlink == 1)
not found (pos 1)
Apr 6 07:00:18 gwar kernel: ReiserFS: dm-12: warning: vs-13060:
reiserfs_update_sd: stat data of object [124 983 0x0 SD] (nlink == 1)
not found (pos 1)
Apr 6 07:00:18 gwar kernel: ReiserFS: dm-12: warning: vs-13060:
reiserfs_update_sd: stat data of object [124 983 0x0 SD] (nlink == 1)
not found (pos 1)
Apr 6 07:00:21 gwar kernel: ReiserFS: dm-12: warning: vs-13060:
reiserfs_update_sd: stat data of object [124 983 0x0 SD] (nlink == 1)
not found (pos 1)
Apr 6 07:00:21 gwar kernel: ReiserFS: dm-12: warning: vs-13060:
reiserfs_update_sd: stat data of object [124 983 0x0 SD] (nlink == 1)
not found (pos 1)
Apr 6 07:00:22 gwar kernel: ReiserFS: dm-12: warning: vs-13060:
reiserfs_update_sd: stat data of object [124 983 0x0 SD] (nlink == 1)
not found (pos 1)
Apr 6 07:00:22 gwar kernel: ReiserFS: dm-12: warning: vs-13060:
reiserfs_update_sd: stat data of object [124 983 0x0 SD] (nlink == 1)
not found (pos 1)
Apr 6 07:00:22 gwar kernel: ReiserFS: dm-12: warning: vs-13060:
reiserfs_update_sd: stat data of object [124 983 0x0 SD] (nlink == 1)
not found (pos 1)
Apr 6 07:00:23 gwar kernel: ReiserFS: dm-12: warning: vs-13060:
reiserfs_update_sd: stat data of object [124 983 0x0 SD] (nlink == 1)
not found (pos 1)
Apr 6 07:00:23 gwar kernel: ReiserFS: dm-12: warning: vs-13060:
reiserfs_update_sd: stat data of object [124 983 0x0 SD] (nlink == 1)
not found (pos 1)
Apr 6 07:00:24 gwar kernel: ReiserFS: dm-12: warning: vs-13060:
reiserfs_update_sd: stat data of object [124 983 0x0 SD] (nlink == 1)
not found (pos 1)
Apr 6 07:00:25 gwar kernel: ReiserFS: dm-12: warning: vs-13060:
reiserfs_update_sd: stat data of object [124 983 0x0 SD] (nlink == 1)
not found (pos 1)
Apr 6 07:00:26 gwar kernel: ReiserFS: dm-12: warning: vs-13060:
reiserfs_update_sd: stat data of object [124 983 0x0 SD] (nlink == 1)
not found (pos 1)
Apr 6 07:00:26 gwar kernel: ReiserFS: dm-12: warning: vs-13060:
reiserfs_update_sd: stat data of object [124 983 0x0 SD] (nlink == 1)
not found (pos 1)
Apr 6 07:00:26 gwar kernel: ReiserFS: dm-12: warning: vs-13060:
reiserfs_update_sd: stat data of object [124 983 0x0 SD] (nlink == 1)
not found (pos 1)
Apr 6 07:00:27 gwar kernel: ReiserFS: dm-12: warning: vs-13060:
reiserfs_update_sd: stat data of object [124 983 0x0 SD] (nlink == 1)
not found (pos 1)
Apr 6 07:00:27 gwar kernel: ReiserFS: dm-12: warning: vs-13060:
reiserfs_update_sd: stat data of object [124 983 0x0 SD] (nlink == 1)
not found (pos 1)
Apr 6 07:00:27 gwar kernel: ReiserFS: dm-12: warning: vs-13060:
reiserfs_update_sd: stat data of object [124 983 0x0 SD] (nlink == 1)
not found (pos 1)
Apr 6 07:00:28 gwar kernel: ReiserFS: dm-12: warning: vs-13060:
reiserfs_update_sd: stat data of object [124 983 0x0 SD] (nlink == 1)
not found (pos 1)
Apr 6 07:00:28 gwar kernel: ReiserFS: dm-12: warning: vs-13060:
reiserfs_update_sd: stat data of object [124 983 0x0 SD] (nlink == 1)
not found (pos 1)
Apr 6 07:00:29 gwar kernel: ReiserFS: dm-12: warning: vs-13060:
reiserfs_update_sd: stat data of object [124 983 0x0 SD] (nlink == 1)
not found (pos 1)
Apr 6 07:00:29 gwar kernel: ReiserFS: dm-12: warning: vs-13060:
reiserfs_update_sd: stat data of object [124 983 0x0 SD] (nlink == 1)
not found (pos 1)
So here's what happened..
At around 2:50am on Apr 6 a standard sql query to insert something into
the database hung. From this point on, the process consumed 100% of
one of the two cpus. Any attempt to strace the process would cause
strace to immediately hang. All attempts to kill the process failed.
The system needed to be up for business reasons so it was decided that
the fastest way to recover postgres would be to reboot the system. The
system shutdown without any trouble. At boot, postgres failed to
start.. the following are log messages from that event:
2006-04-06 11:59:33.410 PDT PANIC: could not count blocks of relation
1663/16815/17215: Permission denied
2006-04-06 11:59:33.413 PDT LOG: startup process (PID 3898) was
terminated by signal 6
2006-04-06 11:59:33.413 PDT LOG: aborting startup due to startup
process failure
2006-04-06 11:59:33.419 PDT LOG: logger shutting down
We began to look at the data files in postgres' data directory. We
found two files that responded with a "permission denied" message for
any attempt to browse or manipulate them (using programs such as ls,
stat, cp, etc.) After browsing forums and newsgroups we concluded that
this behavior may be indicative of a filesystem problem. An example of
one of the files exhibiting the behavior:
find: ./data/base/16815/17215.15: Permission denied
The partition was unmounted from the system. We started an fsck using
the following command:
reiserfsck --check --logfile /root/postgres-gwar-20060407.log
/dev/evms/postgres
Unfortunately, this fsck was cutoff by a dropped ssh connection and the
fsck wasn't running within screen (oops). Based on the progress
indicator, it seems that the check was about 2/3 of the way through the
process. The following is the contents of the logfile:
tyler@gwar ~> sudo cat /root/postgres-fsck-20060407.log
bad_path: The right delimiting key [124 982 0x3c001 IND (1)] of the
node (227226375) must be greater than the last (6) element's key [124
984 0x1 IND (1)] within the node.
vpf-10630: The on-disk and the correct bitmaps differs. Will be fixed
later.
A second fsck was started (within screen) using the following command:
reiserfsck --fix-fixable /dev/evms/postgres
Unfortunately I don't have the verbatim output of this command. The
command completed successfully and found one problem which could only
be fixed by running the fsck with --rebuild-tree.
We made a backup of the data minus the two files with permission denied
problems to a different system. We used this backup to start a
parallel recovery effort to restore postgres on a different system
using all of the data except the two broken files which couldn't be
backed up.
A third reiserfsck was started using the following command:
reiserfsck --rebuild-tree /dev/evms/postgres
The following is the output from that command thus far:
Will rebuild the filesystem (/dev/evms/postgres) tree
Will put log info to 'stdout'
Do you want to run this program?[N/Yes] (note need to type Yes if you
do):Yes
Replaying journal..
Reiserfs journal '/dev/evms/postgres' in blocks [18..8211]: 0
transactions replayed
###########
reiserfsck --rebuild-tree started at Sat Apr 8 07:57:05 2006
###########
Pass 0:
####### Pass 0 #######
Loading on-disk bitmap .. ok, 264008904 blocks marked used
Skipping 19386 blocks (super block, journal, bitmaps) 263989518 blocks
will be read
0%....block 162165170: The free space (6) is incorrect, should be
(4072) - corrected
block 162165260: The free space (3) is incorrect, should be (4072) -
corrected
block 162165340: The free space (3) is incorrect, should be (4072) -
corrected
block 225738748: The number of items (5055) is incorrect, should be (1)
- corrected
block 225738748: The free space (14) is incorrect, should be (13) -
corrected
pass0: vpf-10110: block 225738748, item (0): Unknown item type found
[1310733 1188392 0x1a7445900000000 ??? (4)] - deleted
left 0,
20739 /sec
1424 directory entries were hashed with "r5" hash.
"r5" hash is selected
Flushing..finished
Read blocks (but not data blocks) 263989518
Leaves among those 95376536
- leaves all contents of which could not be
saved and deleted 4
Objectids found 1426
Pass 1 (will try to insert 95376532 leaves):
####### Pass 1 #######
Looking for allocable blocks .. finished
0%....20%....40%.
left 0, 3581 /seccc
Flushing..finished
95376532 leaves read
95376530 inserted
2 not inserted
####### Pass 2 #######
Pass 2:
0%....20%....40%....60%... left
1, 0 /sec
The process has been "stuck" at this progress for more than 24 hours.
The reiserfsck process seems to be using 100% of one of the CPUs but
very little I/O activity is happening. The following is output about
system resources from various commands:
tyler@gwar ~> vmstat 1
procs -----------memory---------- ---swap-- -----io---- --system--
----cpu----
r b swpd free buff cache si so bi bo in cs us
sy id wa
1 0 0 15491440 700792 31092 0 0 51 0 2 20 29
6 62 3
1 0 0 15491068 701080 31076 0 0 128 72 272 32 50
0 50 0
1 0 0 15490820 701328 31100 0 0 128 0 257 26 50
0 50 0
1 0 0 15490696 701536 31096 0 0 128 0 260 24 50
0 50 0
1 0 0 15491564 700620 31196 0 0 128 0 262 27 50
0 50 0
1 0 0 15491440 700664 31152 0 0 0 0 257 23 50
0 50 0
1 0 0 15491192 700900 31188 0 0 72 76 275 57 50
0 50 0
tyler@gwar ~> ps aux | grep reiserfsck
root 9835 89.9 1.0 183296 180200 pts/1 R+ Apr08 2933:14
reiserfsck --rebuild-tree /dev/evms/postgres
tyler@gwar ~> top
top - 14:17:26 up 4 days, 2:20, 1 user, load average: 1.00, 1.00,
1.00
Tasks: 159 total, 2 running, 157 sleeping, 0 stopped, 0 zombie
Cpu0 : 100.0% us, 0.0% sy, 0.0% ni, 0.0% id, 0.0% wa, 0.0% hi,
0.0% si
Cpu1 : 0.0% us, 0.0% sy, 0.0% ni, 100.0% id, 0.0% wa, 0.0% hi,
0.0% si
Mem: 16508948k total, 1018516k used, 15490432k free, 701212k buffers
Swap: 8388596k total, 0k used, 8388596k free, 31216k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
9835 root 25 0 179m 175m 688 R 99.7 1.1 2933:53 reiserfsck
1 root 15 0 1588 516 452 S 0.0 0.0 0:00.75 init
tyler@gwar ~> free
total used free shared buffers
cached
Mem: 16508948 1042440 15466508 0 702812
55116
-/+ buffers/cache: 284512 16224436
Swap: 8388596 0 8388596
I/O activity according to iostat. The /dev/evms/postgres device lives
on the six devices sdb through sdg.
tyler@gwar ~> iostat 1 -k
Linux 2.6.13.1-20050914 (gwar.savagebeast.com) 04/10/06
... output truncated ...
avg-cpu: %user %nice %sys %iowait %idle
50.00 0.00 0.00 0.00 50.00
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sda 0.00 0.00 0.00 0 0
sdb 1.00 24.00 0.00 24 0
sdc 0.00 0.00 0.00 0 0
sdd 1.00 8.00 0.00 8 0
sde 1.00 32.00 0.00 32 0
sdf 1.00 32.00 0.00 32 0
sdg 1.00 32.00 0.00 32 0
sdh 0.00 0.00 0.00 0 0
I'm not convinced that the process is hung because an strace of the
process reveals that it's actively making system calls. It crossed my
mind that the fsck may be looping.. this is supported by the fact that
CPU utilization is high while I/O is minimal. However, for what it's
worth, in a test of 65,000 lines of strace output there wasn't a single
duplicate line... if it is looping, it's not making identical calls
over and over.
The following is 100 lines of an strace of the process.
_llseek(3, 242647523328, [242647523328], SEEK_SET) = 0
write(3, "\1\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0|\0\0\0"...,
4096) = 4096
_llseek(3, 242647527424, [242647527424], SEEK_SET) = 0
write(3, "\1\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0|\0\0\0"...,
4096) = 4096
_llseek(3, 242647531520, [242647531520], SEEK_SET) = 0
write(3, "\1\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0|\0\0\0"...,
4096) = 4096
_llseek(3, 242647535616, [242647535616], SEEK_SET) = 0
write(3, "\1\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0|\0\0\0"...,
4096) = 4096
_llseek(3, 242647539712, [242647539712], SEEK_SET) = 0
write(3, "\1\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0|\0\0\0"...,
4096) = 4096
_llseek(3, 242647474176, [242647474176], SEEK_SET) = 0
write(3, "\1\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0|\0\0\0"...,
4096) = 4096
_llseek(3, 242647543808, [242647543808], SEEK_SET) = 0
write(3, "\1\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0|\0\0\0"...,
4096) = 4096
_llseek(3, 242647547904, [242647547904], SEEK_SET) = 0
write(3, "\1\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0|\0\0\0"...,
4096) = 4096
_llseek(3, 242647552000, [242647552000], SEEK_SET) = 0
write(3, "\1\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0|\0\0\0"...,
4096) = 4096
_llseek(3, 242647560192, [242647560192], SEEK_SET) = 0
write(3, "\1\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0|\0\0\0"...,
4096) = 4096
_llseek(3, 1169951559680, [1169951559680], SEEK_SET) = 0
read(3, "\1\0\1\0`\0 \0\200\0\0\0\200\240w\316\0\0\0\0\6\0\0\0|"...,
4096) = 4096
_llseek(3, 1169951563776, [1169951563776], SEEK_SET) = 0
read(3, "\1\0\1\0`\0 \0\200\0\0\0\200\20\201\311\0\0\0\0\16\0\21"...,
4096) = 4096
_llseek(3, 1169951567872, [1169951567872], SEEK_SET) = 0
read(3, "\1\0\1\0`\0 \0\200\0\0\0\200 v\313\0\0\0\0\0\0\21\300|"...,
4096) = 4096
_llseek(3, 1169951571968, [1169951571968], SEEK_SET) = 0
read(3, "\1\0\1\0`\0 \0\200\0\0\0\200P\351\344\0\0\0\0\0\0\0\0|"...,
4096) = 4096
_llseek(3, 1169951576064, [1169951576064], SEEK_SET) = 0
read(3, "\1\0\1\0`\0 \0\200\0\0\0\200\260\223\331\0\0\0\0\0\0\0"...,
4096) = 4096
_llseek(3, 1169951580160, [1169951580160], SEEK_SET) = 0
read(3, "\1\0\1\0`\0 \0\200\0\0\0\200\0B\314\0\0\0\0\34\0\21\300"...,
4096) = 4096
_llseek(3, 1169951584256, [1169951584256], SEEK_SET) = 0
read(3, "\1\0\1\0`\0 \0\200\0\0\0\200@\26\333\0\0\0\0\v\0\21\300"...,
4096) = 4096
_llseek(3, 1169951588352, [1169951588352], SEEK_SET) = 0
read(3, "\1\0\1\0`\0 \0\200\0\0\0\200P:\327\0\0\0\0\20\0\21\300"...,
4096) = 4096
_llseek(3, 1169951592448, [1169951592448], SEEK_SET) = 0
read(3, "\1\0\1\0`\0 \0\200\0\0\0\200\300N\350\0\0\0\0\25\0\21\300"...,
4096) = 4096
_llseek(3, 1169951596544, [1169951596544], SEEK_SET) = 0
read(3, "\1\0\1\0`\0 \0\200\0\0\0\200`\23\323\0\0\0\0\0\0\322\r"...,
4096) = 4096
_llseek(3, 1169951600640, [1169951600640], SEEK_SET) = 0
read(3, "\1\0\1\0`\0 \0\200\0\0\0\200\0206\312\0\0\0\0\0\0\0\0|"...,
4096) = 4096
_llseek(3, 1169951604736, [1169951604736], SEEK_SET) = 0
read(3, "\1\0\1\0`\0 \0\200\0\0\0\200\340&\324\0\0\0\0\20\0\21\300"...,
4096) = 4096
_llseek(3, 1169951608832, [1169951608832], SEEK_SET) = 0
read(3, "\1\0\1\0`\0 \0\200\0\0\0\200\320\335\313\0\0\0\0\0\0\21"...,
4096) = 4096
_llseek(3, 1169951612928, [1169951612928], SEEK_SET) = 0
read(3, "\1\0\1\0`\0 \0\334\1\0\0\334\1\257\324\0\0\0\0L\0\21\300"...,
4096) = 4096
_llseek(3, 1169951617024, [1169951617024], SEEK_SET) = 0
read(3, "\1\0\1\0`\0 \0\200\0\0\0\200\20+\366\0\0\0\0\36\0\21\300"...,
4096) = 4096
_llseek(3, 1169951621120, [1169951621120], SEEK_SET) = 0
read(3, "\1\0\1\0`\0 \0\200\0\0\0\200@\354\324\0\0\0\0\1\0\377\377"...,
4096) = 4096
_llseek(3, 242647564288, [242647564288], SEEK_SET) = 0
write(3, "\1\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0|\0\0\0"...,
4096) = 4096
_llseek(3, 242647568384, [242647568384], SEEK_SET) = 0
write(3, "\1\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0|\0\0\0"...,
4096) = 4096
_llseek(3, 242647572480, [242647572480], SEEK_SET) = 0
write(3, "\1\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0|\0\0\0"...,
4096) = 4096
_llseek(3, 242647576576, [242647576576], SEEK_SET) = 0
write(3, "\1\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0|\0\0\0"...,
4096) = 4096
_llseek(3, 242647580672, [242647580672], SEEK_SET) = 0
write(3, "\1\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0|\0\0\0"...,
4096) = 4096
_llseek(3, 242647515136, [242647515136], SEEK_SET) = 0
write(3, "\1\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0|\0\0\0"...,
4096) = 4096
_llseek(3, 242647584768, [242647584768], SEEK_SET) = 0
write(3, "\1\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0|\0\0\0"...,
4096) = 4096
_llseek(3, 242647588864, [242647588864], SEEK_SET) = 0
write(3, "\1\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0|\0\0\0"...,
4096) = 4096
_llseek(3, 242647592960, [242647592960], SEEK_SET) = 0
write(3, "\1\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0|\0\0\0"...,
4096) = 4096
_llseek(3, 242647601152, [242647601152], SEEK_SET) = 0
write(3, "\1\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0|\0\0\0"...,
4096) = 4096
_llseek(3, 242647605248, [242647605248], SEEK_SET) = 0
write(3, "\1\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0|\0\0\0"...,
4096) = 4096
_llseek(3, 242647609344, [242647609344], SEEK_SET) = 0
write(3, "\1\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0|\0\0\0"...,
4096) = 4096
_llseek(3, 242647613440, [242647613440], SEEK_SET) = 0
write(3, "\1\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0|\0\0\0"...,
4096) = 4096
_llseek(3, 242647617536, [242647617536], SEEK_SET) = 0
write(3, "\1\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0|\0\0\0"...,
4096) = 4096
_llseek(3, 1169951625216, [1169951625216], SEEK_SET) = 0
read(3, "\1\0\1\0`\0 \0\200\0\0\0\200\200\374\315\0\0\0\0\0\0\21"...,
4096) = 4096
_llseek(3, 1169951629312, [1169951629312], SEEK_SET) = 0
read(3, "\1\0\1\0`\0 \0\200\0\0\0\200\0\276\356\0\0\0\0\0\0\21\300"...,
4096) = 4096
_llseek(3, 1169951633408, [1169951633408], SEEK_SET) = 0
read(3, "\1\0\1\0`\0 \0\200\0\0\0\200@\337\354\0\0\0\0\1\0\30\0"...,
4096) = 4096
_llseek(3, 1169951637504, [1169951637504], SEEK_SET) = 0
read(3, "\1\0\1\0`\0 \0\200\0\0\0\200\360\245\351\0\0\0\0\24\0\0"...,
4096) = 4096
_llseek(3, 1169951641600, [1169951641600], SEEK_SET) = 0
read(3, "\1\0\1\0`\0 \0\200\0\0\0\200\320G\315\0\0\0\0\36\0\21\300"...,
4096) = 4096
_llseek(3, 1169951645696, [1169951645696], SEEK_SET) = 0
read(3, "\1\0\1\0`\0 \0\200\0\0\0\200\200\267\327\0\0\0\0\6\0\21"...,
4096) = 4096
_llseek(3, 1169951649792, [1169951649792], SEEK_SET) = 0
read(3, "\1\0\1\0`\0 \0\200\0\0\0\200\0\322\356\0\0\0\0\n\0\21\300"...,
4096) = 4096
_llseek(3, 1169951653888, [1169951653888], SEEK_SET) = 0
read(3, "\1\0\1\0`\0 \0\200\0\0\0\200 c\352\0\0\0\0\36\0\332\360"...,
4096) = 4096
_llseek(3, 1169951657984, [1169951657984], SEEK_SET) = 0
read(3, "\1\0\1\0`\0 \0\200\0\0\0\200\300\203\355\0\0\0\0\0\0\21"...,
4096) = 4096
_llseek(3, 1169951662080, [1169951662080], SEEK_SET) = 0
read(3, "\1\0\1\0`\0 \0\200\0\0\0\200\20\266\312\0\0\0\0\0\0\0\0"...,
4096) = 4096
The following are the file descriptors of process 9835 (the reiserfsck
process) according to /proc:
tyler@gwar ~> sudo ls -laF /proc/9835/fd/
total 5
dr-x------ 2 root root 0 Apr 10 16:24 ./
dr-xr-xr-x 5 root root 0 Apr 10 15:41 ../
lrwx------ 1 root root 64 Apr 10 16:24 0 -> /dev/pts/1
lrwx------ 1 root root 64 Apr 10 16:24 1 -> /dev/pts/1
lrwx------ 1 root root 64 Apr 10 16:24 2 -> /dev/pts/1
lrwx------ 1 root root 64 Apr 10 16:24 3 ->
/dev/evms/postgres
lrwx------ 1 root root 64 Apr 10 16:24 4 ->
/dev/evms/postgres
The parallel recovery effort to bring up postgres on different hardware
using the backup has proceeded smoothly. As a result, if the fsck has
in fact croaked it's not devastating.. but it is troublesome.
1. Is it normal for pass 2 of the --rebuild-tree operation to stay at
the same "progress" for more than 24 hours?
2. Is it normal for the reiserfsck operation to use significant CPU
resources while I/O activity is minimal?
3. For a filesystem of such size, is it reasonable for the fsck to
take many days to complete?
4. How long do you recommend that I leave it in this state before
aborting the fsck?
I've tried to provide as much useful information as I can think of.. if
there's additional information that you need, I'll do my best to
provide it.
Thank you for your help.
-Tyler
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Bug report: reiserfsck --rebuild-tree not progressing
2006-04-11 0:34 Bug report: reiserfsck --rebuild-tree not progressing Tyler Phelps
@ 2006-04-11 7:21 ` Sander
2006-04-11 8:12 ` Tyler Phelps
2006-04-12 15:35 ` Ard van Breemen
1 sibling, 1 reply; 10+ messages in thread
From: Sander @ 2006-04-11 7:21 UTC (permalink / raw)
To: System Administrators; +Cc: reiserfs-list
Tyler Phelps wrote (ao):
> Package: reiserfsprogs
> Version: 1:3.6.17-2
Can you try a newer version?
ftp://ftp.namesys.com/pub/reiserfsprogs/reiserfsprogs-3.6.19.tar.gz
According to http://marc.theaimsgroup.com/?t=114235160800008&r=1&w=2
3.6.20 also exist, but I cant find it.
Good luck, Sander
--
Humilis IT Services and Solutions
http://www.humilis.net
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Bug report: reiserfsck --rebuild-tree not progressing
2006-04-11 7:21 ` Sander
@ 2006-04-11 8:12 ` Tyler Phelps
2006-04-11 9:24 ` Vladimir V. Saveliev
2006-04-11 11:22 ` Konstantin Münning
0 siblings, 2 replies; 10+ messages in thread
From: Tyler Phelps @ 2006-04-11 8:12 UTC (permalink / raw)
To: sander; +Cc: System Administrators, reiserfs-list
My questions (and all of the diagnostic information provided) revolve
around a specific process of reiserfsck, using the version that I
specified, which is still running. The only way that I can try a new
version is to abort the current fsck operation... doing that
essentially invalidates all of the questions that I've asked.
I'm reluctant to abort the current process. My primary reason for
this is that I have no way of knowing if aborting the current fsck
will cause further damage. After all, the man page states, "Once
reiserfsck --rebuild-tree is started it must finish its work (and you
should not interrupt it), otherwise the filesystem will be left in
the unmountable state to avoid subsequent data corruptions." Second,
I don't know if anything is even wrong with the way that things are
"progressing" with the current process... hence the reason for my
original questions.
-Tyler
On Apr 11, 2006, at 12:21 AM, Sander wrote:
> Tyler Phelps wrote (ao):
>> Package: reiserfsprogs
>> Version: 1:3.6.17-2
>
> Can you try a newer version?
> ftp://ftp.namesys.com/pub/reiserfsprogs/reiserfsprogs-3.6.19.tar.gz
>
> According to http://marc.theaimsgroup.com/?t=114235160800008&r=1&w=2
> 3.6.20 also exist, but I cant find it.
>
> Good luck, Sander
>
> --
> Humilis IT Services and Solutions
> http://www.humilis.net
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Bug report: reiserfsck --rebuild-tree not progressing
2006-04-11 8:12 ` Tyler Phelps
@ 2006-04-11 9:24 ` Vladimir V. Saveliev
2006-04-11 14:34 ` Bernhard Sadlowski
2006-04-11 11:22 ` Konstantin Münning
1 sibling, 1 reply; 10+ messages in thread
From: Vladimir V. Saveliev @ 2006-04-11 9:24 UTC (permalink / raw)
To: Tyler Phelps
Cc: Vitaly P. Fertman, sander, System Administrators, reiserfs-list
Hello
On Tue, 2006-04-11 at 01:12 -0700, Tyler Phelps wrote:
> My questions (and all of the diagnostic information provided) revolve
> around a specific process of reiserfsck, using the version that I
> specified, which is still running. The only way that I can try a new
> version is to abort the current fsck operation... doing that
> essentially invalidates all of the questions that I've asked.
>
> 1. Is it normal for pass 2 of the --rebuild-tree operation to stay
> at the same "progress" for more than 24 hours?
> 2. Is it normal for the reiserfsck operation to use significant CPU
> resources while I/O activity is minimal?
1-2: It is not normal but possible. Old reiserfsck (like the one you are
using) was extremly inefficient in some (rare) kind of corruptions).
> 3. For a filesystem of such size, is it reasonable for the fsck to
> take many days to complete?
Unfortunately, currently, reiserfsck takes a long time on big
filesystems.
> 4. How long do you recommend that I leave it in this state before
> aborting the fsck?
I would recommend you to abort it and try
ftp://ftp.namesys.com/pub/tmp/reiserfsprogs-3.6.20.tar.gz
> I'm reluctant to abort the current process. My primary reason for
> this is that I have no way of knowing if aborting the current fsck
> will cause further damage. After all, the man page states, "Once
> reiserfsck --rebuild-tree is started it must finish its work (and you
> should not interrupt it), otherwise the filesystem will be left in
> the unmountable state to avoid subsequent data corruptions." Second,
> I don't know if anything is even wrong with the way that things are
> "progressing" with the current process... hence the reason for my
> original questions.
>
> -Tyler
>
> On Apr 11, 2006, at 12:21 AM, Sander wrote:
>
> > Tyler Phelps wrote (ao):
> >> Package: reiserfsprogs
> >> Version: 1:3.6.17-2
> >
> > Can you try a newer version?
> > ftp://ftp.namesys.com/pub/reiserfsprogs/reiserfsprogs-3.6.19.tar.gz
> >
> > According to http://marc.theaimsgroup.com/?t=114235160800008&r=1&w=2
> > 3.6.20 also exist, but I cant find it.
> >
> > Good luck, Sander
> >
> > --
> > Humilis IT Services and Solutions
> > http://www.humilis.net
>
>
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Bug report: reiserfsck --rebuild-tree not progressing
2006-04-11 8:12 ` Tyler Phelps
2006-04-11 9:24 ` Vladimir V. Saveliev
@ 2006-04-11 11:22 ` Konstantin Münning
1 sibling, 0 replies; 10+ messages in thread
From: Konstantin Münning @ 2006-04-11 11:22 UTC (permalink / raw)
To: Tyler Phelps; +Cc: reiserfs-list, sysadmin
Hi!
I had the same problem about a year ago with a 0.8TB drive, you may
check some list archives for the details.
The solution was a patch to the reiserfsprogs which was then
incorporated in version 3.6.19. I am not familiar with the details as I
only supplied the information and Vladimir did the work but I can only
repeat what was said - abort current fsck and retry with latest tools
(3.6.19 should be sufficent, I can't tell about 3.6.20).
As for your concerns, it's correct that when you abort the current fsck
it will result in an unmountable FS but you can repair it with another
run. I doubt there is any way to make the current fsck to contnue except
maybe some weird magic hack into the running program ;-). If there are
chances to loose data in this process - I'm not the expert but I think
an abort is not that critical - at least at that stage where reiserfsck
is in an endless loop. With my problem I had some minor data corruption
issues afterwards but I think that was because of the primary fault -
the RAID controller had RAM errors and the RAID consistency was broken
which resulted in the corrupted FS. Then after several fsck tries,
superblock reconstruction etc. I was surprised how much was still intact
(far less than 1% of data was corrupted) but I think this doesn't apply
to you as it's not any guarantee.
Have a nice day,
Konstantin
Tyler Phelps wrote:
> My questions (and all of the diagnostic information provided) revolve
> around a specific process of reiserfsck, using the version that I
> specified, which is still running. The only way that I can try a new
> version is to abort the current fsck operation... doing that
> essentially invalidates all of the questions that I've asked.
>
> I'm reluctant to abort the current process. My primary reason for this
> is that I have no way of knowing if aborting the current fsck will
> cause further damage. After all, the man page states, "Once reiserfsck
> --rebuild-tree is started it must finish its work (and you should not
> interrupt it), otherwise the filesystem will be left in the unmountable
> state to avoid subsequent data corruptions." Second, I don't know if
> anything is even wrong with the way that things are "progressing" with
> the current process... hence the reason for my original questions.
>
> -Tyler
>
> On Apr 11, 2006, at 12:21 AM, Sander wrote:
>
>> Tyler Phelps wrote (ao):
>>
>>> Package: reiserfsprogs
>>> Version: 1:3.6.17-2
>>
>>
>> Can you try a newer version?
>> ftp://ftp.namesys.com/pub/reiserfsprogs/reiserfsprogs-3.6.19.tar.gz
>>
>> According to http://marc.theaimsgroup.com/?t=114235160800008&r=1&w=2
>> 3.6.20 also exist, but I cant find it.
>>
>> Good luck, Sander
>>
>> --
>> Humilis IT Services and Solutions
>> http://www.humilis.net
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Bug report: reiserfsck --rebuild-tree not progressing
2006-04-11 9:24 ` Vladimir V. Saveliev
@ 2006-04-11 14:34 ` Bernhard Sadlowski
2006-04-11 15:23 ` Vladimir V. Saveliev
0 siblings, 1 reply; 10+ messages in thread
From: Bernhard Sadlowski @ 2006-04-11 14:34 UTC (permalink / raw)
To: Vladimir V. Saveliev; +Cc: reiserfs-list
On 11 Apr 2006 13:24, Vladimir V. Saveliev <vs@namesys.com> wrote:
> I would recommend you to abort it and try
> ftp://ftp.namesys.com/pub/tmp/reiserfsprogs-3.6.20.tar.gz
Why is it in /pub/tmp? Is this an official new version?
This version compiles with no Problems on Debian 3.1 with gcc 3.3.5, but
compiling on Debian 3.0 (woody, old stable distribution, gcc 2.95.4) does
not work, even thogh 3.6.19 had no still compile problems:
$ make
[...]
gcc -DHAVE_CONFIG_H -I. -I. -I../.. -I../../include -Wall -O2 -fPIC -DPIC
-Wall -O2 -fPIC -DPIC -c -o debugfs_reiserfs-corruption.o test -f
'corruption.c' || echo './'corruption.c
corruption.c: In function do_bitmap_corruption':
corruption.c:892: parse error before *'
corruption.c:907: i' undeclared (first use in this function)
corruption.c:907: (Each undeclared identifier is reported only once
corruption.c:907: for each function it appears in.)
corruption.c:910: bh' undeclared (first use in this function)
make[3]: *** [debugfs_reiserfs-corruption.o] Error 1
make[3]: Leaving directory /usr/src/reiser/reiserfsprogs-3.6.20/utils/debugfs'
make[2]: *** [all-recursive] Error 1
make[2]: Leaving directory /usr/src/reiser/reiserfsprogs-3.6.20/utils'
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory /usr/src/reiser/reiserfsprogs-3.6.20'
make: *** [all] Error 2
$ gcc -v
Reading specs from /usr/lib/gcc-lib/i386-linux/2.95.4/specs
gcc version 2.95.4 20011002 (Debian prerelease)
Bernhard
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Bug report: reiserfsck --rebuild-tree not progressing
2006-04-11 14:34 ` Bernhard Sadlowski
@ 2006-04-11 15:23 ` Vladimir V. Saveliev
2006-04-11 15:40 ` Bernhard Sadlowski
0 siblings, 1 reply; 10+ messages in thread
From: Vladimir V. Saveliev @ 2006-04-11 15:23 UTC (permalink / raw)
To: Bernhard Sadlowski; +Cc: reiserfs-list
[-- Attachment #1: Type: text/plain, Size: 1688 bytes --]
Hello
On Tue, 2006-04-11 at 16:34 +0200, Bernhard Sadlowski wrote:
> On 11 Apr 2006 13:24, Vladimir V. Saveliev <vs@namesys.com> wrote:
> > I would recommend you to abort it and try
> > ftp://ftp.namesys.com/pub/tmp/reiserfsprogs-3.6.20.tar.gz
>
> Why is it in /pub/tmp? Is this an official new version?
>
not yet. we would like it to get more testing.
> This version compiles with no Problems on Debian 3.1 with gcc 3.3.5, but
> compiling on Debian 3.0 (woody, old stable distribution, gcc 2.95.4) does
> not work, even thogh 3.6.19 had no still compile problems:
>
> $ make
> [...]
> gcc -DHAVE_CONFIG_H -I. -I. -I../.. -I../../include -Wall -O2 -fPIC -DPIC
> -Wall -O2 -fPIC -DPIC -c -o debugfs_reiserfs-corruption.o test -f
> 'corruption.c' || echo './'corruption.c
> corruption.c: In function do_bitmap_corruption':
> corruption.c:892: parse error before *'
> corruption.c:907: i' undeclared (first use in this function)
> corruption.c:907: (Each undeclared identifier is reported only once
> corruption.c:907: for each function it appears in.)
> corruption.c:910: bh' undeclared (first use in this function)
> make[3]: *** [debugfs_reiserfs-corruption.o] Error 1
> make[3]: Leaving directory /usr/src/reiser/reiserfsprogs-3.6.20/utils/debugfs'
> make[2]: *** [all-recursive] Error 1
> make[2]: Leaving directory /usr/src/reiser/reiserfsprogs-3.6.20/utils'
> make[1]: *** [all-recursive] Error 1
> make[1]: Leaving directory /usr/src/reiser/reiserfsprogs-3.6.20'
> make: *** [all] Error 2
>
would you try if this patch helps, please?
> $ gcc -v
> Reading specs from /usr/lib/gcc-lib/i386-linux/2.95.4/specs
> gcc version 2.95.4 20011002 (Debian prerelease)
>
> Bernhard
>
[-- Attachment #2: patch --]
[-- Type: text/plain, Size: 362 bytes --]
--- ./utils/debugfs/corruption.c~ 2006-03-28 16:40:23.000000000 +0400
+++ ./utils/debugfs/corruption.c 2006-04-11 19:21:39.000000000 +0400
@@ -888,7 +888,7 @@
unsigned long first = fs->fs_super_bh->b_blocknr + 1;
unsigned long nr_bitmap_to_corrupt;
- unsigned long block, bmap_nr;;
+ unsigned long block, bmap_nr;
reiserfs_bh_t * bh;
unsigned int i;
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Bug report: reiserfsck --rebuild-tree not progressing
2006-04-11 15:23 ` Vladimir V. Saveliev
@ 2006-04-11 15:40 ` Bernhard Sadlowski
0 siblings, 0 replies; 10+ messages in thread
From: Bernhard Sadlowski @ 2006-04-11 15:40 UTC (permalink / raw)
To: Vladimir V. Saveliev; +Cc: reiserfs-list
Hi,
On 11 Apr 2006 19:23, Vladimir V. Saveliev <vs@namesys.com> wrote:
> would you try if this patch helps, please?
> --- ./utils/debugfs/corruption.c~ 2006-03-28 16:40:23.000000000 +0400
> +++ ./utils/debugfs/corruption.c 2006-04-11 19:21:39.000000000 +0400
> @@ -888,7 +888,7 @@
>
> unsigned long first = fs->fs_super_bh->b_blocknr + 1;
> unsigned long nr_bitmap_to_corrupt;
> - unsigned long block, bmap_nr;;
> + unsigned long block, bmap_nr;
> reiserfs_bh_t * bh;
> unsigned int i;
This works now. Thanks!
Bernhard
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Bug report: reiserfsck --rebuild-tree not progressing
2006-04-11 0:34 Bug report: reiserfsck --rebuild-tree not progressing Tyler Phelps
2006-04-11 7:21 ` Sander
@ 2006-04-12 15:35 ` Ard van Breemen
2006-04-13 9:48 ` Ard van Breemen
1 sibling, 1 reply; 10+ messages in thread
From: Ard van Breemen @ 2006-04-12 15:35 UTC (permalink / raw)
To: System Administrators; +Cc: reiserfs-list
Hi,
On Mon, Apr 10, 2006 at 05:34:58PM -0700, Tyler Phelps wrote:
> The following are syslog messages from the kernel. The filesystem
> trouble began at about 2:50am which matches the following log entries:
>
> Apr 6 02:50:03 gwar kernel: ReiserFS: dm-12: warning: vs-13060:
> reiserfs_update_sd: stat data of object [124 983 0x0 SD] (nlink == 1)
> not found (pos 1)
Have you been able to determine why it happened?
We also have opterons and got this:
ReiserFS: sdb9: warning: vs-13060: reiserfs_update_sd: stat data of object [2 12 0x0 SD] (nlink == 3) not found (pos 1)
ReiserFS: sdb9: warning: vs-13060: reiserfs_update_sd: stat data of object [2 12 0x0 SD] (nlink == 3) not found (pos 1)
ReiserFS: sdb9: warning: vs-13060: reiserfs_update_sd: stat data of object [2 12 0x0 SD] (nlink == 3) not found (pos 1)
And then finally some sort of kernel panic.
And it is repeatable.
Kernel: 2.6.16.1 for AMD64
Kernel: 2.6.17rc1 for AMD64 (Yes, tried both)
Compiled with gcc-3.3 on an AMD64 platform
userspace is 32 bits debian
Hardware:
tyan s2891 (gt24 system) with nforce4 and 4 sata 1 drives.
success:
Run a lot of bonnie++'s
consistent failure:
(Tried as 4 seperate disks of 0.4T and as one raid5 partition of 1.1T)
After 2 hours of pumping a few million files onto the machine
reiserfs starts putting out these warnings (a few thousands):
^MReiserFS: sdb9: warning: vs-13060: reiserfs_update_sd: stat data of object [2 12 0x0 SD] (nlink == 3) not found (pos 1)
And then the panic for a 2.6.16.1 kernel:
NMI Watchdog detected LOCKUP on CPU 1
CPU 1
Modules linked in: ipv6 tg3 nfsd exportfs nfs lockd sunrpc
Pid: 15215, comm: tar Not tainted 2.6.16.1-tyan-s2891 #1
RIP: 0010:[<ffffffff8037903d>] <ffffffff8037903d>{.text.lock.spinlock+22}
RSP: 0000:ffff81012fc9fc48 EFLAGS: 00000086
RAX: ffff8100cbcfec78 RBX: ffff8100cbcfec70 RCX: 0000000000000000
RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffff8100cbcfec70
RBP: ffff81012fc9fc88 R08: 00000000000200a3 R09: 0000000000000640
R10: 0000000000010c0c R11: 00000000000005b4 R12: 0000000000000000
R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000000001
FS: 0000000000000000(0000) GS:ffff81012fc5fbc0(0063) knlGS:00000000f7e88080
CS: 0010 DS: 002b ES: 002b CR0: 000000008005003b
CR2: 00000000f7f7dfbc CR3: 0000000042d8e000 CR4: 00000000000006e0
Process tar (pid: 15215, threadinfo ffff81002891c000, task ffff81007e2860c0)
Stack: 0000000000000296 ffffffff801280ed 0000000100000000 ffff810116728700
ffff81002ff90780 00000000ffffffff 0000000000000000 ffff81003d8de834
ffff810116728700 ffffffff80315604
Call Trace: <IRQ> <ffffffff801280ed>{__wake_up+45} <ffffffff80315604>{sock_def_readable+52}
<ffffffff80349cce>{tcp_data_queue+894} <ffffffff8034b236>{tcp_rcv_established+1638}
<ffffffff80352883>{tcp_v4_do_rcv+35} <ffffffff80352ef7>{tcp_v4_rcv+1431}
<ffffffff8031c69f>{dev_queue_xmit+559} <ffffffff80337872>{ip_local_deliver+402}
<ffffffff80337e3c>{ip_rcv+1244} <ffffffff8031cb8c>{netif_receive_skb+508}
<ffffffff880b9109>{:tg3:tg3_rx+905} <ffffffff880b927c>{:tg3:tg3_poll+140}
<ffffffff8031cd84>{net_rx_action+132} <ffffffff80133c21>{__do_softirq+97}
<ffffffff8010bef2>{call_softirq+30} <ffffffff8010dc11>{do_softirq+49}
<ffffffff8010dbc7>{do_IRQ+71} <ffffffff8010b250>{ret_from_intr+0} <EOI>
<ffffffff8021f9b1>{memmove+49} <ffffffff801ce065>{leaf_paste_entries+245}
<ffffffff801cc0ab>{leaf_copy_dir_entries+619} <ffffffff801cc48a>{leaf_copy_boundary_item+970}
<ffffffff80143ad8>{wake_up_bit+24} <ffffffff801cce3e>{leaf_copy_items+78}
<ffffffff801cd150>{leaf_move_items+80} <ffffffff801cd1de>{leaf_shift_left+62}
<ffffffff801b85a1>{balance_leaf_when_delete+865} <ffffffff801b865d>{balance_leaf+93}
<ffffffff80143ad8>{wake_up_bit+24} <ffffffff801d8dd8>{reiserfs_prepare_for_journal+88}
<ffffffff801babed>{do_balance+141} <ffffffff801c75be>{fix_nodes+590}
<ffffffff801d2113>{reiserfs_cut_from_item+915} <ffffffff801bc174>{reiserfs_unlink+308}
<ffffffff801934c4>{mntput_no_expire+36} <ffffffff8018803e>{vfs_unlink+110}
<ffffffff8018812f>{do_unlinkat+175} <ffffffff8011de5e>{ia32_sysret+0}
Code: 83 3f 00 7e f9 e9 92 fd ff ff f3 90 83 3f 00 7e f9 e9 9e fd
console shuts up ...
Badness in do_exit at kernel/exit.c:802
Call Trace: <NMI> <ffffffff80131434>{do_exit+68} <ffffffff8010c7eb>{die_nmi+123}
<ffffffff80117b16>{nmi_watchdog_tick+230} <ffffffff8010d2c6>{default_do_nmi+134}
<ffffffff80117c15>{do_nmi+69} <ffffffff803792f3>{nmi+127}
<ffffffff8037903d>{.text.lock.spinlock+22} <EOE> <IRQ>
<ffffffff801280ed>{__wake_up+45} <ffffffff80315604>{sock_def_readable+52}
<ffffffff80349cce>{tcp_data_queue+894} <ffffffff8034b236>{tcp_rcv_established+1638}
<ffffffff80352883>{tcp_v4_do_rcv+35} <ffffffff80352ef7>{tcp_v4_rcv+1431}
<ffffffff8031c69f>{dev_queue_xmit+559} <ffffffff80337872>{ip_local_deliver+402}
<ffffffff80337e3c>{ip_rcv+1244} <ffffffff8031cb8c>{netif_receive_skb+508}
<ffffffff880b9109>{:tg3:tg3_rx+905} <ffffffff880b927c>{:tg3:tg3_poll+140}
<ffffffff8031cd84>{net_rx_action+132} <ffffffff80133c21>{__do_softirq+97}
<ffffffff8010bef2>{call_softirq+30} <ffffffff8010dc11>{do_softirq+49}
<ffffffff8010dbc7>{do_IRQ+71} <ffffffff8010b250>{ret_from_intr+0} <EOI>
<ffffffff8021f9b1>{memmove+49} <ffffffff801ce065>{leaf_paste_entries+245}
<ffffffff801cc0ab>{leaf_copy_dir_entries+619} <ffffffff801cc48a>{leaf_copy_boundary_item+970}
<ffffffff80143ad8>{wake_up_bit+24} <ffffffff801cce3e>{leaf_copy_items+78}
<ffffffff801cd150>{leaf_move_items+80} <ffffffff801cd1de>{leaf_shift_left+62}
<ffffffff801b85a1>{balance_leaf_when_delete+865} <ffffffff801b865d>{balance_leaf+93}
<ffffffff80143ad8>{wake_up_bit+24} <ffffffff801d8dd8>{reiserfs_prepare_for_journal+88}
<ffffffff801babed>{do_balance+141} <ffffffff801c75be>{fix_nodes+590}
<ffffffff801d2113>{reiserfs_cut_from_item+915} <ffffffff801bc174>{reiserfs_unlink+308}
<ffffffff801934c4>{mntput_no_expire+36} <ffffffff8018803e>{vfs_unlink+110}
<ffffffff8018812f>{do_unlinkat+175} <ffffffff8011de5e>{ia32_sysret+0}
Kernel panic - not syncing: Aiee, killing interrupt handler!
I've seen in git that there is a memory leak in tg3, but I guess
(looking at the graphs) that it was not memory related.
Anyway, off to put another 2 of those at testing...
Regards,
Ard van Breemen
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Bug report: reiserfsck --rebuild-tree not progressing
2006-04-12 15:35 ` Ard van Breemen
@ 2006-04-13 9:48 ` Ard van Breemen
0 siblings, 0 replies; 10+ messages in thread
From: Ard van Breemen @ 2006-04-13 9:48 UTC (permalink / raw)
To: System Administrators; +Cc: reiserfs-list
Hello,
On Wed, Apr 12, 2006 at 05:35:57PM +0200, Ard van Breemen wrote:
> consistent failure:
> (Tried as 4 seperate disks of 0.4T and as one raid5 partition of 1.1T)
> After 2 hours of pumping a few million files onto the machine
> reiserfs starts putting out these warnings (a few thousands):
> ^MReiserFS: sdb9: warning: vs-13060: reiserfs_update_sd: stat data of object [2 12 0x0 SD] (nlink == 3) not found (pos 1)
I've now got a problem getting it to fail.
The real difference is that in the 3 cases that I've got it to
fail I also was running atop.
Now I've been pumping gigs onto the disk without failure.
I guess something is wrong in the process accounting mechanism.
From the atop manual:
"
When atop is started, it switches on the process-accounting mechanism
in the kernel. This forces the kernel to write a record with account-
ing-information to the accounting-file whenever a process ends. Apart
from the kernel-administration related to the running processes, atop
also interprets the accounting-records on disk with every interval; in
this way atop can also show the activity of a process during the inter-
val in which it is finished.
Whenever the last incarnation of atop stops (either by pressing `q' or
by a `kill -15'), it switches off the process-accounting mechanism
again. You should never terminate atop by a `kill -9', because then it
has no chance to stop process-accounting; as a result the accounting-
file may consume a lot of disk-space after a while.
"
I will install 2 other systems, one running atop, and one not running atop.
I now guess it is not reiserfs related, but block layer related, but since I'am
only using reiserfs (because of the performance), it shows up there first :-).
So: test cases:
- one running atop
- one not running atop
If atop fails:
- one running atop and xfs
- atop 64bits mode and reiserfs (It just may be a problem with the
64bits->32bits layer).
--
begin LOVE-LETTER-FOR-YOU.txt.vbs
I am a signature virus. Distribute me until the bitter
end
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2006-04-13 9:48 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-04-11 0:34 Bug report: reiserfsck --rebuild-tree not progressing Tyler Phelps
2006-04-11 7:21 ` Sander
2006-04-11 8:12 ` Tyler Phelps
2006-04-11 9:24 ` Vladimir V. Saveliev
2006-04-11 14:34 ` Bernhard Sadlowski
2006-04-11 15:23 ` Vladimir V. Saveliev
2006-04-11 15:40 ` Bernhard Sadlowski
2006-04-11 11:22 ` Konstantin Münning
2006-04-12 15:35 ` Ard van Breemen
2006-04-13 9:48 ` Ard van Breemen
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.