From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from cuda.sgi.com (cuda3.sgi.com [192.48.176.15]) by oss.sgi.com (8.14.3/8.14.3/SuSE Linux 0.8) with ESMTP id o49FXRsl229004 for ; Sun, 9 May 2010 10:33:27 -0500 Received: from smtp.stepping-stone.ch (localhost [127.0.0.1]) by cuda.sgi.com (Spam Firewall) with ESMTP id EC1EB1206662 for ; Sun, 9 May 2010 08:35:38 -0700 (PDT) Received: from smtp.stepping-stone.ch (smtp.stepping-stone.ch [194.176.109.228]) by cuda.sgi.com with ESMTP id DT0nneJ3aJUauj7V for ; Sun, 09 May 2010 08:35:38 -0700 (PDT) Received: from localhost (mail-scanner-01.int.stepping-stone.ch [10.17.98.47]) by smtp.stepping-stone.ch (Postfix) with ESMTP id CA70D4003CF for ; Sun, 9 May 2010 17:35:37 +0200 (CEST) Received: from smtp.stepping-stone.ch ([10.17.98.46]) by localhost (mail-scanner-01.int.stepping-stone.ch [10.17.98.47]) (amavisd-new, port 10024) with LMTP id UTGyaxnZuxlH for ; Sun, 9 May 2010 17:35:34 +0200 (CEST) Received: from [192.168.1.4] (84-73-140-121.dclient.hispeed.ch [84.73.140.121]) (using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits)) (Client did not present a certificate) by smtp.stepping-stone.ch (Postfix) with ESMTPSA id AA2C24003D5 for ; Sun, 9 May 2010 17:35:34 +0200 (CEST) Message-ID: <4BE6D63F.3070404@purplehaze.ch> Date: Sun, 09 May 2010 17:35:27 +0200 From: Christian Affolter MIME-Version: 1.0 Subject: Re: failed to read root inode References: <4BE55A63.8070203@purplehaze.ch> <4BE5EB5D.5020702@hardwarefreak.com> In-Reply-To: <4BE5EB5D.5020702@hardwarefreak.com> List-Id: XFS Filesystem from SGI List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: xfs-bounces@oss.sgi.com Errors-To: xfs-bounces@oss.sgi.com To: xfs@oss.sgi.com Hi >> After a disk crash within a hardware RAID-6 controller and kernel >> freeze, I'm unable to mount an XFS filesystem on top of an EVMS volume: > > What storage management operation(s) were you performing when this crash > occurred? Were you adding, deleting, shrinking, or growing an EVMS volume > when the "crash" occurred, or was the system just sitting idle with no load > when the crash occurred? No, there were no storage management operations in progress while the system crashed. It's a NFS file server with random read and write operations. > Why did the "crash" of a single disk in a hardware RAID6 cause a kernel > freeze? What is your definition of "disk crash"? A single physical disk > failure should not have caused this under any circumstances. The RAID card > should have handled a single disk failure transparently. That's a good question ;) Honestly I don't know how this could happen, all I saw were a bunch of errors from the RAID controller driver. In the past two other disks failed and the controller reported each failure correctly and started to rebuild the array automatically by using the hot-spare disk. So it did its job two times correctly. [...] kernel: arcmsr0: abort device command of scsi id = 1 lun = 0 kernel: arcmsr0: abort device command of scsi id = 4 lun = 0 kernel: arcmsr0: abort device command of scsi id = 4 lun = 0 kernel: arcmsr0: abort device command of scsi id = 4 lun = 0 kernel: arcmsr0: abort device command of scsi id = 4 lun = 0 kernel: arcmsr0: abort device command of scsi id = 2 lun = 0 kernel: arcmsr0: ccb ='0xffff8100bf819080' isr got aborted command kernel: arcmsr0: isr get an illegal ccb command done acb = '0xffff81013ec245c8'ccb = '0xffff8100bf819080' ccbacb = '0xffff81013ec245c8' startdone = 0x0 ccboutstandingcount = 0 kernel: arcmsr0: abort device command of scsi id = 2 lun = 0 kernel: arcmsr0: abort device command of scsi id = 4 lun = 0 kernel: arcmsr0: ccb ='0xffff8100bf814640' isr got aborted command kernel: arcmsr0: isr get an illegal ccb command done acb = '0xffff81013ec245c8'ccb = '0xffff8100bf814640' ccbacb = '0xffff81013ec245c8' startdone = 0x0 ccboutstandingcount = 13 kernel: sd 0:0:4:0: [sdd] Result: hostbyte=DID_ABORT driverbyte=DRIVER_OK,SUGGEST_OK kernel: end_request: I/O error, dev sdd, sector 1897479440 kernel: Device dm-79, XFS metadata write error block 0x1813c0 in dm-79 kernel: Device dm-65, XFS metadata write error block 0x2c1a0 in dm-65 kernel: xfs_force_shutdown(dm-65,0x1) called from line 1093 of file fs/xfs/xfs_buf_item.c. Return address = 0xffffffff80374359 kernel: Filesystem "dm-65": I/O Error Detected. Shutting down filesystem: dm-65 kernel: Please umount the filesystem, and rectify the problem(s) kernel: xfs_force_shutdown(dm-65,0x1) called from line 420 of file fs/xfs/xfs_rw.c. Return address = 0xffffffff803a9529 [...] Afterwards most of the volumes where shutdown and after a couple of hours the kernel freezes with a kernel panic (which I can't remember as I had no serial console attached). > Exactly which make/model is the RAID card? Areca Technology Corp. ARC-1160 16-Port PCI-X to SATA RAID Controller Firmware Version : V1.42 2006-10-13 > What is the status of each of the remaining disks attached to the card as reported by its BIOS? After the hard reset, one disk was reported as 'faild' and the rebuild started. > What is the status of the RAID6 volume as reported by the RAID card BIOS? By now, the rebuild finished, therefor the volume is in normal non-degraded state. > What is the status of each of your EVMS volumes as reported by the EVMS UI? They're all active. Do you need more informations here? There are approximately 45 active volumes on this server. > I'm asking all of these questions because it seems rather clear that the > root cause of your problem lies at a layer well below the XFS filesystem. Yes, I never blamed XFS for being the cause of the problem. > You have two layers of physical disk abstraction below XFS: a hardware > RAID6 and a software logical volume manager. You've apparently suffered a > storage system hardware failure, according to your description. You haven't > given any details of the current status of the hardware RAID, or of the > logical volumes, merely that XFS is having problems. I think a "Well duh!" > is in order. > > Please provide _detailed_ information from the RAID card BIOS and the EVMS > UI. Even if the problem isn't XFS related I for one would be glad to assist > you in getting this fixed. Right now we don't have enough information. At > least I don't. Thanks for your help Christian _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs