From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from cuda.sgi.com (cuda3.sgi.com [192.48.176.15]) by oss.sgi.com (8.14.3/8.14.3/SuSE Linux 0.8) with ESMTP id n645gUMT216914 for ; Sat, 4 Jul 2009 00:42:31 -0500 Received: from mail.sandeen.net (localhost [127.0.0.1]) by cuda.sgi.com (Spam Firewall) with ESMTP id 44D271AF0658 for ; Fri, 3 Jul 2009 22:43:03 -0700 (PDT) Received: from mail.sandeen.net (sandeen.net [209.173.210.139]) by cuda.sgi.com with ESMTP id rXGg1OE59DahFwQs for ; Fri, 03 Jul 2009 22:43:03 -0700 (PDT) Message-ID: <4A4EEBE6.6060909@sandeen.net> Date: Sat, 04 Jul 2009 00:43:02 -0500 From: Eric Sandeen MIME-Version: 1.0 Subject: Re: bad fs - xfs_repair 3.01 crashes on it References: <200907031320.48358@zmi.at> In-Reply-To: <200907031320.48358@zmi.at> List-Id: XFS Filesystem from SGI List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: xfs-bounces@oss.sgi.com Errors-To: xfs-bounces@oss.sgi.com To: Michael Monnerie Cc: xfs mailing list Michael Monnerie wrote: > Tonight our server rebooted, and I found in /var/log/warn that he was crying > a lot about xfs since June 7 already: ... > But XFS didn't go offline, so nobody found this messages. There are a lot of them. > They obviously are generated by the nightly "xfs_fsr -v -t 7200" which we run > since then. It would have been nice if xfs_fsr could have displayed > a message, so we would have received the cron mail. (But it got killed > by the kernel, that's a good excuse) ok yeah we should see why fsr didn't print anything ... > Anyway, so I went to xfs_repair (3.01) and got this: > > Phase 3 - for each AG... > - scan and clear agi unlinked lists... > - process known inodes and perform inode discovery... > [snip] > - agno = 14 > local inode 3857051697 attr too small (size = 3, min size = 4) > bad attribute fork in inode 3857051697, clearing attr fork > clearing inode 3857051697 attributes > cleared inode 3857051697 > [snip] > Phase 4 - check for duplicate blocks... > [snip] > - agno = 15 > data fork in regular inode 3857051697 claims used block 537147998 > xfs_repair: dinode.c:2108: process_inode_data_fork: Assertion `err == 0' failed. Ok, so this is essentially some code which first does a scan; if it finds an error it bails out and clears the inode, but if not, it calls essentially the same function again, comments say "set bitmaps this time" - but on the 2nd call it finds an error, which isn't handled well. The ASSERT(err == 0) bit is presumably because if the first scan didn't find anything, the 2nd call shouldn't either, but ... not the case here :( There are more checks that can go wrong -after- the scan-only portion. So either the caller needs to cope w/ the error at this point, or the scan only business needs do all the checks, I think. Where's Barry when you need him .... Also I need to look at when the ASSERTs are active and when they should be; the Fedora packaged xfsprogs doesn't have the ASSERT active, and so this doesn't trip. After 2 calls to xfs_repair on Fedora, w/o the ASSERTs active, it checks clean on the 3rd (!). Not great. Not sure how much was cleared out in the process either... -Eric _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs