* advice for repair after IO error on raid device
@ 2010-06-22 14:36 Roel van Meer
2010-06-22 14:53 ` Roel van Meer
2010-06-22 16:07 ` Michael Monnerie
0 siblings, 2 replies; 7+ messages in thread
From: Roel van Meer @ 2010-06-22 14:36 UTC (permalink / raw)
To: xfs
Hi list,
I recently I had a failed disk in a raid6 setup, which resulted in an IO
error, which in turn caused XFS to shut down with the messages below.
I've seen on this list that incorrect use of xfs_repair might damage the fs
even more, so I would like to ask for some advice on the best way to
proceed.
Currently I have unmounted the filesystem, replaced the failed disk and
rebuilt the raid array. I am upgrading xfstools to their latest version (the
current version is 2.9.8). Any hints on how to continue would be highly
appreciated.
Background: This is a Fedora Core 3 machine, with a vanilla 2.6.31 kernel.
The raid setup consists of 24x2TB disks in a raid6 setup. We use it to store
our backup snapshots and the entire volume is written to tape once a week.
Thanks in advance,
roel
Jun 21 23:23:59 backup2 kernel: arcmsr6: abort device command of scsi id = 0 lun = 0
Jun 21 23:24:10 backup2 kernel: arcmsr6: ccb ='0xffff8800cb88ad40'????????????????????????????? isr got aborted command
Jun 21 23:24:10 backup2 kernel: arcmsr6: isr get an illegal ccb command???????????????????????????????? done acb = '0xffff880231c90408'ccb = '0xffff8800cb88ad40' ccbacb = '0xffff880231c90408' startdone = 0x0 ccboutstandingcount = 1
Jun 21 23:24:10 backup2 kernel: sd 6:0:0:0: [sdb] Unhandled error code
Jun 21 23:24:10 backup2 kernel: sd 6:0:0:0: [sdb] Result: hostbyte=DID_ABORT driverbyte=DRIVER_OK
Jun 21 23:24:10 backup2 kernel: end_request: I/O error, dev sdb, sector 12887056410
Jun 21 23:24:10 backup2 kernel: I/O error in filesystem ("sdb1") meta-data dev sdb1 block 0x30020dff8?????? ("xfs_trans_read_buf") error 5 buf count 4096
Jun 21 23:24:10 backup2 kernel: xfs_force_shutdown(sdb1,0x1) called from line 414 of file fs/xfs/xfs_trans_buf.c.? Return address = 0xffffffffa0168eaf
Jun 21 23:24:10 backup2 kernel: xfs_force_shutdown(sdb1,0x2) called from line 811 of file fs/xfs/xfs_log.c.? Return address = 0xffffffffa015c35f
Jun 21 23:24:10 backup2 kernel: Filesystem "sdb1": I/O Error Detected.? Shutting down filesystem: sdb1
Jun 21 23:24:10 backup2 kernel: Please umount the filesystem, and rectify the problem(s)
Jun 21 23:24:20 backup2 kernel: Filesystem "sdb1": xfs_log_force: error 5 returned.
Jun 21 23:24:50 backup2 kernel: Filesystem "sdb1": xfs_log_force: error 5 returned.
Jun 21 23:25:20 backup2 kernel: Filesystem "sdb1": xfs_log_force: error 5 returned.
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: advice for repair after IO error on raid device
2010-06-22 14:36 advice for repair after IO error on raid device Roel van Meer
@ 2010-06-22 14:53 ` Roel van Meer
2010-06-22 15:29 ` Michael Weissenbacher
2010-06-22 16:31 ` Emmanuel Florac
2010-06-22 16:07 ` Michael Monnerie
1 sibling, 2 replies; 7+ messages in thread
From: Roel van Meer @ 2010-06-22 14:53 UTC (permalink / raw)
To: xfs
Roel van Meer writes:
> Currently I have unmounted the filesystem, replaced the failed disk and
> rebuilt the raid array. I am upgrading xfstools to their latest version (the
> current version is 2.9.8). Any hints on how to continue would be highly
> appreciated.
Trying to answer my own question. I _think_ this is the way to go:
1) Mount and unmount the fs, in order to replay the log.
2) Run xfs_repair -n
3) Run xfs_repair
If someone could confirm (or reject) that, that would be great.
(By the way, is it necessary to run xfs_repair with -n first? If not, are
there advantages that would justify the extra time it takes?)
Thanks again,
roel
> Jun 21 23:23:59 backup2 kernel: arcmsr6: abort device command of scsi id = 0 lun = 0
> Jun 21 23:24:10 backup2 kernel: arcmsr6: ccb ='0xffff8800cb88ad40'????????????????????????????? isr got aborted command
> Jun 21 23:24:10 backup2 kernel: arcmsr6: isr get an illegal ccb command???????????????????????????????? done acb = '0xffff880231c90408'ccb = '0xffff8800cb88ad40' ccbacb = '0xffff880231c90408' startdone = 0x0 ccboutstandingcount = 1
> Jun 21 23:24:10 backup2 kernel: sd 6:0:0:0: [sdb] Unhandled error code
> Jun 21 23:24:10 backup2 kernel: sd 6:0:0:0: [sdb] Result: hostbyte=DID_ABORT driverbyte=DRIVER_OK
> Jun 21 23:24:10 backup2 kernel: end_request: I/O error, dev sdb, sector 12887056410
> Jun 21 23:24:10 backup2 kernel: I/O error in filesystem ("sdb1") meta-data dev sdb1 block 0x30020dff8?????? ("xfs_trans_read_buf") error 5 buf count 4096
> Jun 21 23:24:10 backup2 kernel: xfs_force_shutdown(sdb1,0x1) called from line 414 of file fs/xfs/xfs_trans_buf.c.? Return address = 0xffffffffa0168eaf
> Jun 21 23:24:10 backup2 kernel: xfs_force_shutdown(sdb1,0x2) called from line 811 of file fs/xfs/xfs_log.c.? Return address = 0xffffffffa015c35f
> Jun 21 23:24:10 backup2 kernel: Filesystem "sdb1": I/O Error Detected.? Shutting down filesystem: sdb1
> Jun 21 23:24:10 backup2 kernel: Please umount the filesystem, and rectify the problem(s)
> Jun 21 23:24:20 backup2 kernel: Filesystem "sdb1": xfs_log_force: error 5 returned.
> Jun 21 23:24:50 backup2 kernel: Filesystem "sdb1": xfs_log_force: error 5 returned.
> Jun 21 23:25:20 backup2 kernel: Filesystem "sdb1": xfs_log_force: error 5 returned.
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: advice for repair after IO error on raid device
2010-06-22 14:53 ` Roel van Meer
@ 2010-06-22 15:29 ` Michael Weissenbacher
2010-06-22 16:31 ` Emmanuel Florac
1 sibling, 0 replies; 7+ messages in thread
From: Michael Weissenbacher @ 2010-06-22 15:29 UTC (permalink / raw)
To: xfs
Hi Roel!
> Trying to answer my own question. I _think_ this is the way to go:
>
> 1) Mount and unmount the fs, in order to replay the log.
> 2) Run xfs_repair -n
> 3) Run xfs_repair
>
> If someone could confirm (or reject) that, that would be great.
This is a sound plan, i always do a "xfs_repair -n" first. If you're
lucky it will find no problems and you can skip step 3).
If "xfs_repair -n" does find problems you can ask for more advice here.
good luck,
Michael
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: advice for repair after IO error on raid device
2010-06-22 14:36 advice for repair after IO error on raid device Roel van Meer
2010-06-22 14:53 ` Roel van Meer
@ 2010-06-22 16:07 ` Michael Monnerie
2010-06-22 16:35 ` Emmanuel Florac
2010-06-24 8:28 ` advice for repair after IO error on raid device [SOLVED] Roel van Meer
1 sibling, 2 replies; 7+ messages in thread
From: Michael Monnerie @ 2010-06-22 16:07 UTC (permalink / raw)
To: xfs
[-- Attachment #1.1: Type: Text/Plain, Size: 1118 bytes --]
On Dienstag, 22. Juni 2010 Roel van Meer wrote:
> Jun 21 23:23:59 backup2 kernel: arcmsr6: abort device command of scsi
> id = 0 lun = 0 Jun 21 23:24:10 backup2 kernel: arcmsr6: ccb
> ='0xffff8800cb88ad40'????????????????????????????? isr got aborted
This does not sound like a simple failed disk. We also use Areca RAID
controllers, they are great, and a failed disk does *never* influence
the running system - that's what you have the RAID controller for, after
all. You only receive an e-mail from the controller about the disabled
disk.
The "isr got aborted" sounds like a driver problem, please report that
log snipped to areca support so they can help you (I'd be interested in
the results too - please tell me per PM once you know). The address is
support@areca.com.tw, and they tend to answer quickly and good.
--
mit freundlichen Grüssen,
Michael Monnerie, Ing. BSc
it-management Internet Services
http://proteger.at [gesprochen: Prot-e-schee]
Tel: 0660 / 415 65 31
// Wir haben im Moment zwei Häuser zu verkaufen:
// http://zmi.at/langegg/
// http://zmi.at/haus2009/
[-- Attachment #1.2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 198 bytes --]
[-- Attachment #2: Type: text/plain, Size: 121 bytes --]
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: advice for repair after IO error on raid device
2010-06-22 14:53 ` Roel van Meer
2010-06-22 15:29 ` Michael Weissenbacher
@ 2010-06-22 16:31 ` Emmanuel Florac
1 sibling, 0 replies; 7+ messages in thread
From: Emmanuel Florac @ 2010-06-22 16:31 UTC (permalink / raw)
To: Roel van Meer; +Cc: xfs
Le Tue, 22 Jun 2010 16:53:32 +0200
Roel van Meer <rolek@bokxing.nl> écrivait:
> If someone could confirm (or reject) that, that would be great.
Confirmation granted.
> (By the way, is it necessary to run xfs_repair with -n first? If not,
> are there advantages that would justify the extra time it takes?)
It will indicate what modifications he would have done, before actually
doing them... Like "inode 038953095 corrupted, would remove it" then
"directory not connected, would move content moved to lost+found"
I found that usually xfs_repair is quit quick and never takes much more
than a couple of minutes (even on very big arrays) so you have no valid
reason to skip the "xfs_repair -n" part.
It may be nice to dump the filesystem metadata with xfs_metadump
prior to using xfs_repair too. In the case where the repair's really
gone bad, you could at least revert to the prior less broken state using
xfs_mdrestore...
--
------------------------------------------------------------------------
Emmanuel Florac | Direction technique
| Intellique
| <eflorac@intellique.com>
| +33 1 78 94 84 02
------------------------------------------------------------------------
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: advice for repair after IO error on raid device
2010-06-22 16:07 ` Michael Monnerie
@ 2010-06-22 16:35 ` Emmanuel Florac
2010-06-24 8:28 ` advice for repair after IO error on raid device [SOLVED] Roel van Meer
1 sibling, 0 replies; 7+ messages in thread
From: Emmanuel Florac @ 2010-06-22 16:35 UTC (permalink / raw)
To: Michael Monnerie; +Cc: xfs
[-- Attachment #1.1: Type: text/plain, Size: 667 bytes --]
Le Tue, 22 Jun 2010 18:07:18 +0200
Michael Monnerie <michael.monnerie@is.it-management.at> écrivait:
> The "isr got aborted" sounds like a driver problem
You bet, isr stands for "interrupt system routine", apparently the
driver (or the firmware) seriously shat its pants. Maybe a firmware
and/or kernel upgrade would be good too.
--
------------------------------------------------------------------------
Emmanuel Florac | Direction technique
| Intellique
| <eflorac@intellique.com>
| +33 1 78 94 84 02
------------------------------------------------------------------------
[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 197 bytes --]
[-- Attachment #2: Type: text/plain, Size: 121 bytes --]
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: advice for repair after IO error on raid device [SOLVED]
2010-06-22 16:07 ` Michael Monnerie
2010-06-22 16:35 ` Emmanuel Florac
@ 2010-06-24 8:28 ` Roel van Meer
1 sibling, 0 replies; 7+ messages in thread
From: Roel van Meer @ 2010-06-24 8:28 UTC (permalink / raw)
To: xfs
Hi list,
just a quick follow-up: xfs_repair didn't find any trouble and everything is
up and running again like a charm.
Always nice to be able to post good news.
Thanks for your help and thanks for the great FS!
Regards,
roel
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2010-06-24 8:26 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-06-22 14:36 advice for repair after IO error on raid device Roel van Meer
2010-06-22 14:53 ` Roel van Meer
2010-06-22 15:29 ` Michael Weissenbacher
2010-06-22 16:31 ` Emmanuel Florac
2010-06-22 16:07 ` Michael Monnerie
2010-06-22 16:35 ` Emmanuel Florac
2010-06-24 8:28 ` advice for repair after IO error on raid device [SOLVED] Roel van Meer
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox