From mboxrd@z Thu Jan 1 00:00:00 1970 From: eazgwmir@umail.furryterror.org (Zygo Blaxell) Subject: 2.4.20, reiserfs, md linear, and "Permission denied"... Date: 11 Jan 2003 14:30:36 -0500 Message-ID: Return-path: list-help: list-unsubscribe: list-post: Errors-To: flx@namesys.com List-Id: MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: reiserfs-list@namesys.com I think I'm seeing a pattern of failure. I'm wondering if there is a problem with MD linear personality (aka JBOD) and reiserfs. Here's the recipe for disaster: Ingredients: reiserfs, of course ;-) 2.4.18 and 2.4.20 kernels (compiled for SMP but running on UP) Arrays of disks in MD linear mode (aka raidtools2) lots of metadata I/O (cp -al & rm -rf) Directions: Start with a large collection of files (e.g. the contents of /etc, /var, and /usr). Put this on a reiserfs filesystem under '(mountpoint)/foo'. In one thread, do 'cp -al foo bar/`date +%Y%m%d%H%M%S`'. Create a new directory for each copy. The thread should check for free disk space after each 'cp' command and throttle itself as the filesystem gets too full (i.e. when there is less space free than the size of 'foo'). In a second and third thread, do 'rm -rf `ls -d bar/* | head -1`'. The third thread should sleep for one hour between rm commands, while the second thread will not sleep. Both threads should check for free disk space and throttle themselves when the filesystem gets too empty (e.g. when there is more space free than the size of 'foo'). Note that there will sometimes be two 'rm -rf's processing the same directories at the same time. In a fourth thread, run 'find -ls >/dev/null' over the entire filesystem continuously. In a fifth thread, replace a few of the files in 'foo'. In my case these files are actually rsync-ed from a live Linux system that provided the original contents of 'foo'. If I run this for about two weeks, one day the 'find' and 'rm' threads will start to emit lots of "Permission denied" messages when trying to access random files under 'bar'. This is especially apparent in the 'rm' threads, because they'll stop being useful if they're unable to completely remove the oldest directory. reiserfsck --fix-fixable runs for 24.5 hours, then says: [several megabytes deleted] k_semantic_pass: name "gtk" in directory 5015381 5015470 points to nowhere - removed dir 5015381 5015470 has wrong sd_size 72, has to be 48 check_semantic_pass: name "Yell-O" in directory 5009649 5015381 points to nowhere - removed dir 5009649 5015381 has wrong sd_size 72, has to be 48 check_semantic_pass: name "themes" in directory 5005514 5009649 points to nowhere - removed dir 5005514 5009649 has wrong sd_size 200, has to be 48 check_semantic_pass: name "share" in directory 5004046 5005514 points to nowhere - removed dir 5004046 5005514 has wrong sd_size 120, has to be 48 check_semantic_pass: name "usr" in directory 5004045 5004046 points to nowhere - removed dir 5004045 5004046 has wrong sd_size 96, has to be 48 No corruptions found There are on the filesystem: Leaves 723046 Internal nodes 5073 Directories 1226338 Other files 15857491 Data block pointers 35351178 (zero of them 169766) Safe links 0 ########### reiserfsck finished at Sat Jan 11 10:17:45 2003 ########### Observations: I've seen this phenomenon occur ten times in the last two months. I thought that upgrading from 2.4.18 to 2.4.20 might fix the problem, but it has occurred three times on 2.4.20 machines. There do not appear to be any disk errors in the kernel logs. I have no reason to believe that there are CPU, RAM, or cooling problems. All of those tend to manifest themselves in other ways that I have not also observed. There does not appear to be any data corruption or loss in the filesystem other than the names pointing to nowhere. The affected files would be newly created hardlinks or recently deleted hardlinks--neither of which is "data loss" per se. Other hardlinks to these files seem to be unaffected (I've never seen any part of 'foo' damaged, only 'bar'). Both reiserfs and linear (JBOD) arrays of disks seem to be a requirement to reproduce the problem. I have several machines running very similar workload and software--they're actually mirrored servers with automated failover, so whatever filesystem activity one machine does, another machine does soon after. Machines using reiserfs on RAID0, RAID1, or single disks do not have these problems, nor are there problems with machines using ext3 on linear arrays. All machines are running the same kernels and they're all in the same building (so they have the same power failures and resulting unclean shutdowns). There's a wide variety of hardware involved ranging from P133 to P4 and several different disk vendors. If there is an unclean shutdown, there will definitely be names pointing to nowhere; however, this problem has recently occurred once on a machine that was not shut down at all, cleanly or otherwise, since mkreiserfs. -- Zygo Blaxell (Laptop) GPG = D13D 6651 F446 9787 600B AD1E CCF3 6F93 2823 44AD