* xfs data loss
@ 2009-08-27 7:22 Passerone, Daniele
2009-08-27 9:41 ` Christian Kujau
` (2 more replies)
0 siblings, 3 replies; 28+ messages in thread
From: Passerone, Daniele @ 2009-08-27 7:22 UTC (permalink / raw)
To: xfs@oss.sgi.com
Dear xfs developers
We have a SUN X4500 with 48 500 GB drives, that we configured under SUSE SLES 10.
Among others, we have 3 RAID5 xfs filesystems, /dev/md4 with 20 units (9.27 TB)
/dev/md5 with 20 units (9.27 TB) and /dev/md6 with 5 units (1.95 TB)
These units are not backed up.
Due to a power shock, suddenly and without log messages about one half (5 TB) of the user
directories on /dev/md4 have disappeared.
Upon reboot, /dev/md6 showed only 3 units, and after a xfs_repair it was again ok.
/dev/md4 mounted immediately, but always with one half of the directories.
1) xfs_check gives no problem on /dev/md4
but
2) xfs_logprint
xfs_logprint:
data device: 0x904
log device: 0x904 daddr: 9279295544 length: 262144
Header 0x1c8b wanted 0xfeedbabe
**********************************************************************
* ERROR: header cycle=7307 block=182240 *
**********************************************************************
Bad log record header
and the same happens on /dev/md5.
3) xfs_ncheck eats a lot of memory and freezes after 6-7 hours without giving output
4) I have a xfs_metadump in a file (1.8 GB) but I don't know what to do with it.
WHat can I do? Any help would be appreciated, I would really be happy to recover those files...
:-)
Thank you in advance
Daniele Passerone
Empa, Switzerland
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: xfs data loss
2009-08-27 7:22 xfs data loss Passerone, Daniele
@ 2009-08-27 9:41 ` Christian Kujau
2009-08-27 9:47 ` Passerone, Daniele
2009-08-27 9:54 ` Passerone, Daniele
2009-08-28 4:16 ` Eric Sandeen
2009-08-29 14:08 ` Peter Grandi
2 siblings, 2 replies; 28+ messages in thread
From: Christian Kujau @ 2009-08-27 9:41 UTC (permalink / raw)
To: Passerone, Daniele; +Cc: xfs@oss.sgi.com
On Thu, 27 Aug 2009 at 09:22, Passerone, Daniele wrote:
> We have a SUN X4500 with 48 500 GB drives, that we configured under SUSE
> SLES 10.
SLES 10, that's Linux 2.6.16-something, right?
> These units are not backed up.
I know, I probably shouldn't ask this right now, but I'm kinda curious:
_Why_ aren't they backed up?
> 3) xfs_ncheck eats a lot of memory and freezes after 6-7 hours without giving output
Did you look into the syslog too? Maybe the box was out of memory and
killed xfs_ncheck. How much memory does the box have?
Christian.
--
BOFH excuse #101:
Collapsed Backbone
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 28+ messages in thread
* RE: xfs data loss
2009-08-27 9:41 ` Christian Kujau
@ 2009-08-27 9:47 ` Passerone, Daniele
2009-08-27 10:09 ` Christian Kujau
2009-08-27 9:54 ` Passerone, Daniele
1 sibling, 1 reply; 28+ messages in thread
From: Passerone, Daniele @ 2009-08-27 9:47 UTC (permalink / raw)
To: Christian Kujau; +Cc: xfs@oss.sgi.com
Thank you Christian for your reply.
>
>SLES 10, that's Linux 2.6.16-something, right?
>
Linux 2.6.16.60 x86_64
>I know, I probably shouldn't ask this right now, but I'm kinda curious:
>_Why_ aren't they backed up?
>
It is absolutely legitimate as question.
Because it is a storage unit of 20 TB that we can't backup, and the users know it.
>> 3) xfs_ncheck eats a lot of memory and freezes after 6-7 hours without
>giving output
>
>Did you look into the syslog too? Maybe the box was out of memory and
>killed xfs_ncheck. How much memory does the box have?
Unfortunately xfs_ncheck was not killed, it went "D" state.
The box has 16Gb.
Can I do something with my xfs_metadump [please be polite :-) ]
Daniele
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 28+ messages in thread
* RE: xfs data loss
2009-08-27 9:41 ` Christian Kujau
2009-08-27 9:47 ` Passerone, Daniele
@ 2009-08-27 9:54 ` Passerone, Daniele
1 sibling, 0 replies; 28+ messages in thread
From: Passerone, Daniele @ 2009-08-27 9:54 UTC (permalink / raw)
To: Christian Kujau; +Cc: xfs@oss.sgi.com
Please find enclosed the link to my metadata dump
http://snipurl.com/rfo6u
Thank you in advance
Daniele
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 28+ messages in thread
* RE: xfs data loss
2009-08-27 9:47 ` Passerone, Daniele
@ 2009-08-27 10:09 ` Christian Kujau
0 siblings, 0 replies; 28+ messages in thread
From: Christian Kujau @ 2009-08-27 10:09 UTC (permalink / raw)
To: Passerone, Daniele; +Cc: xfs@oss.sgi.com
On Thu, 27 Aug 2009 at 11:47, Passerone, Daniele wrote:
> Unfortunately xfs_ncheck was not killed, it went "D" state.
> The box has 16Gb.
"D" as in "waiting for I/O to complete". Were you able to confirm that,
i.e. was there really noticable disk i/o happening during xfs_ncheck or
was the process perhaps "stuck in D state"? I seem to remember that in
those cases the output of sysrq-w could be interesting.
> Can I do something with my xfs_metadump [please be polite :-) ]
You've put the dump online, let's hope some XFS wizard finds time to have
a look at this...I certainly can't read this :-\
Christian.
--
BOFH excuse #228:
That function is not currently supported, but Bill Gates assures us it will be featured in the next upgrade.
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: xfs data loss
2009-08-27 7:22 xfs data loss Passerone, Daniele
2009-08-27 9:41 ` Christian Kujau
@ 2009-08-28 4:16 ` Eric Sandeen
2009-08-28 9:19 ` Passerone, Daniele
2009-08-29 14:08 ` Peter Grandi
2 siblings, 1 reply; 28+ messages in thread
From: Eric Sandeen @ 2009-08-28 4:16 UTC (permalink / raw)
To: Passerone, Daniele; +Cc: xfs@oss.sgi.com
Passerone, Daniele wrote:
> Dear xfs developers
>
> We have a SUN X4500 with 48 500 GB drives, that we configured under SUSE SLES 10.
>
> Among others, we have 3 RAID5 xfs filesystems, /dev/md4 with 20 units (9.27 TB)
> /dev/md5 with 20 units (9.27 TB) and /dev/md6 with 5 units (1.95 TB)
>
> These units are not backed up.
>
> Due to a power shock, suddenly and without log messages about one half (5 TB) of the user
> directories on /dev/md4 have disappeared.
I presume you mean after a reboot?
> Upon reboot, /dev/md6 showed only 3 units, and after a xfs_repair it was again ok.
> /dev/md4 mounted immediately, but always with one half of the directories.
Were the lost directories recently created? I've never heard of
untouched, existing directories disappearing after a power loss...
> WHat can I do? Any help would be appreciated, I would really be happy to recover those files...
> :)
Not much to go on here I'm afraid. SLES10 is an old kernel, but it's
supported by SuSE at least.
-Eric
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 28+ messages in thread
* RE: xfs data loss
2009-08-28 4:16 ` Eric Sandeen
@ 2009-08-28 9:19 ` Passerone, Daniele
2009-08-28 17:17 ` Eric Sandeen
0 siblings, 1 reply; 28+ messages in thread
From: Passerone, Daniele @ 2009-08-28 9:19 UTC (permalink / raw)
To: Eric Sandeen; +Cc: xfs@oss.sgi.com
>> Due to a power shock, suddenly and without log messages about one half
>(5 TB) of the user
>> directories on /dev/md4 have disappeared.
>
>I presume you mean after a reboot?
No! No reboot at all.
The directories were mounted via nfs to all our cluster,
and are of the type
/pool/user
all directories from
/pool/g*
to
/pool/z*
have disappeared.
Of course NO COMMAND of the kind rm /pool/[g-z]* was issued.
>Were the lost directories recently created? I've never heard of
>untouched, existing directories disappearing after a power loss...
>
Not at all.
Here I am.
>
>Not much to go on here I'm afraid. SLES10 is an old kernel, but it's
>supported by SuSE at least.
>
Can you use my metadata, or is it useless?
Thank you!
Daniele
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: xfs data loss
2009-08-28 9:19 ` Passerone, Daniele
@ 2009-08-28 17:17 ` Eric Sandeen
2009-08-28 19:42 ` Passerone, Daniele
` (4 more replies)
0 siblings, 5 replies; 28+ messages in thread
From: Eric Sandeen @ 2009-08-28 17:17 UTC (permalink / raw)
To: Passerone, Daniele; +Cc: xfs@oss.sgi.com
Passerone, Daniele wrote:
>>> Due to a power shock, suddenly and without log messages about one half
>> (5 TB) of the user
>>> directories on /dev/md4 have disappeared.
>> I presume you mean after a reboot?
>
>
>
> No! No reboot at all.
Ok then perhaps I don't know what you mean by "power shock"
> The directories were mounted via nfs to all our cluster,
> and are of the type
>
> /pool/user
>
> all directories from
>
> /pool/g*
>
> to
>
> /pool/z*
>
> have disappeared.
On the server as well? Or just clients? -really- no server-side errors
in the logs?
Are you sure the storage hardware & the md volume is in ok shape?
> Of course NO COMMAND of the kind rm /pool/[g-z]* was issued.
>
>> Were the lost directories recently created? I've never heard of
>> untouched, existing directories disappearing after a power loss...
>>
>
> Not at all.
> Here I am.
>> Not much to go on here I'm afraid. SLES10 is an old kernel, but it's
>> supported by SuSE at least.
>>
>
> Can you use my metadata, or is it useless?
Not yet, still wondering what really happened.
-eric
> Thank you!
>
> Daniele
>
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs
>
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 28+ messages in thread
* RE: xfs data loss
2009-08-28 17:17 ` Eric Sandeen
@ 2009-08-28 19:42 ` Passerone, Daniele
2009-08-29 6:08 ` Passerone, Daniele
` (3 subsequent siblings)
4 siblings, 0 replies; 28+ messages in thread
From: Passerone, Daniele @ 2009-08-28 19:42 UTC (permalink / raw)
To: Eric Sandeen; +Cc: xfs@oss.sgi.com
Hi Eric,
and thank you for your attention and your time.
>
>Ok then perhaps I don't know what you mean by "power shock"
>
We work at a Materials science center, and there are big plants for
simulating earthquakes or so (believe it or not).
These plants have sometimes the side effect of inducing strong
destabilizations in the power supply.
We suspect that such destabilization could have induced
an effect in the system of 48 drives which constitutes our NAS server.
But of course, this is only an hypotesis.
>
>On the server as well? Or just clients? -really- no server-side errors
>in the logs?
>
Really.
>Are you sure the storage hardware & the md volume is in ok shape?
This is a very good question.
Indeed, the md volume (md6) close to the affected one (md4) showed loss of 2 disks upon reboot, but a
repair of THAT filesystem (md6) worked.
>
>Not yet, still wondering what really happened.
>
Me too
Thanks a lot.
Daniele
>-eric
>
>> Thank you!
>>
>> Daniele
>>
>> _______________________________________________
>> xfs mailing list
>> xfs@oss.sgi.com
>> http://oss.sgi.com/mailman/listinfo/xfs
>>
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 28+ messages in thread
* RE: xfs data loss
2009-08-28 17:17 ` Eric Sandeen
2009-08-28 19:42 ` Passerone, Daniele
@ 2009-08-29 6:08 ` Passerone, Daniele
2009-08-29 7:45 ` Ralf Gross
2009-08-29 7:11 ` Passerone, Daniele
` (2 subsequent siblings)
4 siblings, 1 reply; 28+ messages in thread
From: Passerone, Daniele @ 2009-08-29 6:08 UTC (permalink / raw)
To: Eric Sandeen; +Cc: xfs@oss.sgi.com
Dear Eric, all
During the night also the "twin" partition of the previously affected one
(/dev/md5) got problems,
this time with a clear hardware problem:
one of the 20 disks was going on failure (2 devices missing).
After that, and also after a reboot I cannot mount the partition:
can't read superblock.
I would like to recover my data, so I try xfs_repair but...
ipazia-sun:~ # xfs_repair -v /dev/md5
Phase 1 - find and verify superblock...
superblock read failed, offset 0, size 524288, ag 0, rval 0
fatal error -- Invalid argument
What should I do now?
Thanks, Daniele
Here the logs:
Aug 29 04:52:31 ipazia-sun kernel: sdo: Current [descriptor]: sense key: Medium Error
Aug 29 04:52:31 ipazia-sun kernel: Additional sense: Unrecovered read error - auto reallocate failed
Aug 29 04:52:31 ipazia-sun kernel: Descriptor sense data with sense descriptors (in hex):
Aug 29 04:52:31 ipazia-sun kernel: 72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00
Aug 29 04:52:31 ipazia-sun kernel: 08 6c c5 7e
Aug 29 04:52:31 ipazia-sun kernel: end_request: I/O error, dev sdo, sector 141346174
Aug 29 04:52:31 ipazia-sun kernel: raid5: read error not correctable.
Aug 29 04:52:31 ipazia-sun kernel: raid5: Disk failure on sdo1, disabling device. Operation continuing on 18 devices
Aug 29 04:52:31 ipazia-sun kernel: raid5: read error not correctable.
Aug 29 04:52:32 ipazia-sun kernel: raid5: read error not correctable.
Aug 29 04:52:32 ipazia-sun kernel: I/O error in filesystem ("md5") meta-data dev md5 block 0x229195c48 ("xlog_iodone") error 5 buf count 20480
Aug 29 04:52:32 ipazia-sun kernel: xfs_force_shutdown(md5,0x2) called from line 958 of file fs/xfs/xfs_log.c. Return address = 0xffffffff8829801c
Aug 29 04:52:32 ipazia-sun kernel: Filesystem "md5": Log I/O Error Detected. Shutting down filesystem: md5
Aug 29 04:52:32 ipazia-sun kernel: Please umount the filesystem, and rectify the problem(s)
Aug 29 04:52:32 ipazia-sun kernel: RAID5 conf printout:
Aug 29 04:52:32 ipazia-sun kernel: --- rd:20 wd:18 fd:2
>-----Original Message-----
>From: Eric Sandeen [mailto:sandeen@sandeen.net]
>Sent: Friday, August 28, 2009 7:18 PM
>To: Passerone, Daniele
>Cc: xfs@oss.sgi.com
>Subject: Re: xfs data loss
>
>Passerone, Daniele wrote:
>>>> Due to a power shock, suddenly and without log messages about one
>half
>>> (5 TB) of the user
>>>> directories on /dev/md4 have disappeared.
>>> I presume you mean after a reboot?
>>
>>
>>
>> No! No reboot at all.
>
>Ok then perhaps I don't know what you mean by "power shock"
>
>> The directories were mounted via nfs to all our cluster,
>> and are of the type
>>
>> /pool/user
>>
>> all directories from
>>
>> /pool/g*
>>
>> to
>>
>> /pool/z*
>>
>> have disappeared.
>
>On the server as well? Or just clients? -really- no server-side errors
>in the logs?
>
>Are you sure the storage hardware & the md volume is in ok shape?
>
>> Of course NO COMMAND of the kind rm /pool/[g-z]* was issued.
>>
>>> Were the lost directories recently created? I've never heard of
>>> untouched, existing directories disappearing after a power loss...
>>>
>>
>> Not at all.
>> Here I am.
>>> Not much to go on here I'm afraid. SLES10 is an old kernel, but it's
>>> supported by SuSE at least.
>>>
>>
>> Can you use my metadata, or is it useless?
>
>Not yet, still wondering what really happened.
>
>-eric
>
>> Thank you!
>>
>> Daniele
>>
>> _______________________________________________
>> xfs mailing list
>> xfs@oss.sgi.com
>> http://oss.sgi.com/mailman/listinfo/xfs
>>
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 28+ messages in thread
* RE: xfs data loss
2009-08-28 17:17 ` Eric Sandeen
2009-08-28 19:42 ` Passerone, Daniele
2009-08-29 6:08 ` Passerone, Daniele
@ 2009-08-29 7:11 ` Passerone, Daniele
2009-08-29 20:03 ` Passerone, Daniele
2009-08-29 22:52 ` Passerone, Daniele
4 siblings, 0 replies; 28+ messages in thread
From: Passerone, Daniele @ 2009-08-29 7:11 UTC (permalink / raw)
To: Eric Sandeen; +Cc: xfs@oss.sgi.com
In particular, I don't understand why xfs_repair does not look for a secondary superblock
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: xfs data loss
2009-08-29 6:08 ` Passerone, Daniele
@ 2009-08-29 7:45 ` Ralf Gross
0 siblings, 0 replies; 28+ messages in thread
From: Ralf Gross @ 2009-08-29 7:45 UTC (permalink / raw)
To: xfs
Passerone, Daniele schrieb:
> Dear Eric, all
> During the night also the "twin" partition of the previously affected one
>
> (/dev/md5) got problems,
> this time with a clear hardware problem:
>
> one of the 20 disks was going on failure (2 devices missing).
> After that, and also after a reboot I cannot mount the partition:
> can't read superblock.
If it's a raid5 and 2 devices are missing, the raid is lost. Or am I
missing something?
Ralf
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: xfs data loss
2009-08-27 7:22 xfs data loss Passerone, Daniele
2009-08-27 9:41 ` Christian Kujau
2009-08-28 4:16 ` Eric Sandeen
@ 2009-08-29 14:08 ` Peter Grandi
2 siblings, 0 replies; 28+ messages in thread
From: Peter Grandi @ 2009-08-29 14:08 UTC (permalink / raw)
To: Linux XFS
> Dear xfs developers We have a SUN X4500 with 48 500 GB drives,
> that we configured under SUSE SLES 10.
> Among others, we have 3 RAID5 xfs filesystems, /dev/md4 with
> 20 units (9.27 TB) /dev/md5 with 20 units (9.27 TB) and
> /dev/md6 with 5 units (1.95 TB)
AAAAA-Amazing! 19+1 RAID5s. With all identical drives in the same
box. What a wondersome, challenging configuration!
> These units are not backed up.
Many would say that RAID means that backup is not necessary.
> Due to a power shock, suddenly and without log messages about
> one half (5 TB) of the user directories on /dev/md4 have
> disappeared. [ ... ]
That was unreally unforeseeable! Power problems never happen to
computers, especially those that fry electronics or cause write
errors on multiple drives, so why worry?
> Upon reboot, /dev/md6 showed only 3 units, and after a
> xfs_repair it was again ok.
Uh, that was the "/dev/md6 with 5 units", so I guess Elvis
personally came to deliver an exact copy of drive #4, or else
someone has invented a new algorithm to ensure that a RAID5 can
lose 2 drives and still be fine. In the latter case rush to the
patent office -- you shall become billionaires.
> /dev/md4 mounted immediately, but always with one half of the
> directories. 1) xfs_check gives no problem on /dev/md4 but 2)
> xfs_logprint [ ... ]
> 3) xfs_ncheck eats a lot of memory and freezes after 6-7 hours
> without giving output
I wonder whether the further advanced technique of creating a
multi-TB filesystem on a 32-bit kernel was used for higher
sophistication.
> WHat can I do? [ ... ]
Well, I am not at the level of those who can develop such an
advanced understanding of storage systems that includes 19+1
arrays and RAID5s that can lose 2 drives and be "again OK".
But I'd look at how many controllers and drives are actually
still working by doing a read test of all the drives. Once that
is known perhaps some recovery strategy can be discerned. If
several drives have failed it might take several weeks or months
or even years to do a partial recovery. If things are really
very lucky only some of the 6 host adapters have failed, or only
one drive per array, and replacing the host adapters will get
things working again, except that running 'xfs_repair' on an
incomplete array will have made things even better, more
challenging and advanced.
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 28+ messages in thread
* RE: xfs data loss
2009-08-28 17:17 ` Eric Sandeen
` (2 preceding siblings ...)
2009-08-29 7:11 ` Passerone, Daniele
@ 2009-08-29 20:03 ` Passerone, Daniele
2009-08-29 22:14 ` Michael Monnerie
2009-08-29 22:52 ` Passerone, Daniele
4 siblings, 1 reply; 28+ messages in thread
From: Passerone, Daniele @ 2009-08-29 20:03 UTC (permalink / raw)
To: xfs@oss.sgi.com
I would like to ask mr. Peter Grandi, whether it is really necessary to delivery ist vaste knowledge in such a harsh way.
Is this the habit of this mailing list?
Apart from that, thank you for you help.
I understand that RAID5 is not the ideal solution for that system, and I admit that
in the urgence of solving the /md4 problem I miswrote the problem of /md6, which
of course was "erased" and not "repaired".
But apart from that, it is not as easy to backup 20 TB, so we decided to set it as
data storage leaving the responsibilty of the backup to our users.
I do not consider it completely absurd.
Moreover, when a raid loses 2 devices, and the devices are still ok, it is possible
to reassemble the raid by assuming the devices clean.
This is not the case for /Raid/md4, where apparently all devices are there.
THanks,
Daniele
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: xfs data loss
2009-08-29 20:03 ` Passerone, Daniele
@ 2009-08-29 22:14 ` Michael Monnerie
0 siblings, 0 replies; 28+ messages in thread
From: Michael Monnerie @ 2009-08-29 22:14 UTC (permalink / raw)
To: xfs
On Samstag 29 August 2009 Passerone, Daniele wrote:
> But apart from that, it is not as easy to backup 20 TB, so we decided
> to set it as data storage leaving the responsibilty of the backup to
> our users. I do not consider it completely absurd.
Right, if you communicated this to users it's OK.
But really, don't create any RAID with more than 8 data disks.
Performance doesn't increase above that, and the chance that a single
disk dies is already 8x as high as with a single disk.
I wish you luck with your recovery, but please try to split your 20
disks, make it 2x9 disks with a RAID-5, better RAID-6, and connect those
two via RAID-0. So you get a RAID-50 or RAID-60. Take the remaining 2
drives as hot spare. This will protect you at least from drive failures,
and speeds up recreating the RAID when a disk dies.
Try to connect the disks which are in a single RAID-5/6 via the same
controllers, so if a controller dies it's only one RAID-5/6 part that
dies, which will help to make it possible to repair.
mfg zmi
--
// Michael Monnerie, Ing.BSc ----- http://it-management.at
// Tel: 0660 / 415 65 31 .network.your.ideas.
// PGP Key: "curl -s http://zmi.at/zmi.asc | gpg --import"
// Fingerprint: AC19 F9D5 36ED CD8A EF38 500E CE14 91F7 1C12 09B4
// Keyserver: wwwkeys.eu.pgp.net Key-ID: 1C1209B4
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 28+ messages in thread
* xfs data loss
2009-08-28 17:17 ` Eric Sandeen
` (3 preceding siblings ...)
2009-08-29 20:03 ` Passerone, Daniele
@ 2009-08-29 22:52 ` Passerone, Daniele
2009-08-30 1:24 ` Eric Sandeen
2009-09-01 12:45 ` Peter Grandi
4 siblings, 2 replies; 28+ messages in thread
From: Passerone, Daniele @ 2009-08-29 22:52 UTC (permalink / raw)
To: xfs@oss.sgi.com
I would like to ask mr. Peter Grandi, whether it is really necessary to delivery ist vaste knowledge in such a harsh way.
Is this the habit of this mailing list?
Apart from that, thank you for you help.
I understand that RAID5 is not the ideal solution for that system, and I admit that
in the urgence of solving the /md4 problem I miswrote the problem of /md6, which
of course was "erased" and not "repaired".
But apart from that, it is not as easy to backup 20 TB, so we decided to set it as
data storage leaving the responsibilty of the backup to our users.
I do not consider it completely absurd.
Moreover, when a raid loses 2 devices, and the devices are still ok, it is possible
to reassemble the raid by assuming the devices clean.
This is not the case for /Raid/md4, where apparently all devices are there.
THanks,
Daniele
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: xfs data loss
2009-08-29 22:52 ` Passerone, Daniele
@ 2009-08-30 1:24 ` Eric Sandeen
2009-08-30 8:17 ` Michael Monnerie
2009-09-01 12:45 ` Peter Grandi
1 sibling, 1 reply; 28+ messages in thread
From: Eric Sandeen @ 2009-08-30 1:24 UTC (permalink / raw)
To: Passerone, Daniele; +Cc: xfs@oss.sgi.com
Passerone, Daniele wrote:
> I would like to ask mr. Peter Grandi, whether it is really necessary
> to delivery ist vaste knowledge in such a harsh way. Is this the
> habit of this mailing list?
Not generally.
> Apart from that, thank you for you help. I understand that RAID5 is
> not the ideal solution for that system, and I admit that in the
> urgence of solving the /md4 problem I miswrote the problem of /md6,
> which of course was "erased" and not "repaired".
>
> But apart from that, it is not as easy to backup 20 TB, so we decided
> to set it as data storage leaving the responsibilty of the backup to
> our users. I do not consider it completely absurd.
I think others have pointed out, though, that you start -increasing- the
risk of failure at a certain point...
> Moreover, when a raid loses 2 devices, and the devices are still ok,
> it is possible to reassemble the raid by assuming the devices clean.
>
> This is not the case for /Raid/md4, where apparently all devices are
> there.
This all seems most likely to be a raid failure problem, but it's hard
to know. I can't imagine why you're getting suddenly-disappearing
directories without a reboot or even a single error message; I just
don't know what to make of that.
-Eric
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: xfs data loss
2009-08-30 1:24 ` Eric Sandeen
@ 2009-08-30 8:17 ` Michael Monnerie
0 siblings, 0 replies; 28+ messages in thread
From: Michael Monnerie @ 2009-08-30 8:17 UTC (permalink / raw)
To: xfs
On Sonntag 30 August 2009 Eric Sandeen wrote:
> This all seems most likely to be a raid failure problem, but it's
> hard to know. I can't imagine why you're getting
> suddenly-disappearing directories without a reboot or even a single
> error message; I just don't know what to make of that.
Exactly, what does "we had a power shock" mean? The OP said it was no
power failure with reboot, so what happened?
mfg zmi
--
// Michael Monnerie, Ing.BSc ----- http://it-management.at
// Tel: 0660 / 415 65 31 .network.your.ideas.
// PGP Key: "curl -s http://zmi.at/zmi.asc | gpg --import"
// Fingerprint: AC19 F9D5 36ED CD8A EF38 500E CE14 91F7 1C12 09B4
// Keyserver: wwwkeys.eu.pgp.net Key-ID: 1C1209B4
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: xfs data loss
2009-08-29 22:52 ` Passerone, Daniele
2009-08-30 1:24 ` Eric Sandeen
@ 2009-09-01 12:45 ` Peter Grandi
2009-09-01 22:16 ` Michael Monnerie
2009-09-04 11:08 ` Andi Kleen
1 sibling, 2 replies; 28+ messages in thread
From: Peter Grandi @ 2009-09-01 12:45 UTC (permalink / raw)
To: Linux XFS
> [ ... ] such a harsh way.
Harsh? That sounds way too harsh. :-)
When you write to a mailing list asking for free help and support,
it is rather rude to not have done some preliminary work, such as
figuring out the characterisics of RAID5 in case of failure. It
is also somewhat rude (but amazingly common) to make confused and
partial reports, such as not checking and reporting what has
actually failed.
> Is this the habit of this mailing list?
Depends -- some people here are XFS salesmen, in that their career
and employability depend at least in part on widespread adoption
of XFS, and on support from other kernel subsystem guys, who may
be one day on an interview panel (the guild of Linux kernel
hackers is a pretty small and closed world in practice). These are
sell-side engineers, and they will be smooth and emollient even in
the face of outrageously ridiculous stuff. Sell-side engineers
just like sell-side stack analyst never issue anything as harsh as
a "sell" recommendation.
That's what I do myself when I am on the sell-side, to my
coworkers and customers; they pay me to solve their problems, not
to tell them they are idiots for creating those problems, and
suffering fools gladly is pat of what I get paid for.
But here I am on the buy-side; I am buying XFS (and the Linux
block layer), not selling it. Not only that, I am providing unpaid
opinions.
Since I am here buying, and actually paying with my time, I can
comment more openly than a someone with a sell-side POV, but still
in a relatively soft way, about the merit of the issues I comment
upon.
> Apart from that, thank you for you help.
But a soft but more open assessment of how outrageous some queries
are is help too as it makes it easier to assess the gravity of the
situation. The smooth, emollient sell-side people will let you dig
your own grave. Just consider your statement below about "assume
clean" that to me sounds very dangerous (big euphemism), and that
did not elicit any warning from the sell-side:
> Moreover, when a raid loses 2 devices, and the devices are still
> ok, it is possible to reassemble the raid by assuming the
> devices clean.
Sure you can reassemble the RAID, but what do you mean by "still
ok"? Have you read-tested those 2 drives? Have you tested the
*other* 18 drives? How do you know none of the other 18 drives got
damaged? Have you verified that only the host adapter electronics
failed or whatever it was that made those 2 drives drop out?
Why do you *need* to assume clean? If the 2 "lost" drives are
really ok, you just resync the array. If you *need* to assume
clean, it is likely that you have lost something like 5% of data
in (every stripe and thus) most files and directories (and
internal metadata) and will be replacing it with random
bytes. That will very likely cause XFS problems (the least of the
problems of course).
> I understand that RAID5 is not the ideal solution for that
> system, [ ... ]
That we don't know for sure; I personaly very much dislike RAID5,
but for throw-away mostly read-only data I have to concede that it
seems appropriate. It is rather better than RAID6 in almost every
reasonable situation. Still a 19+1 array sounds rather bizarre to
say the least. Especially in a place where part of the everyday
activity is earthquake simulation...
> But apart from that, it is not as easy to backup 20 TB,
Or to 'fsck' several TB as you also discovered. Anyhow my opinion
is that the best way to backup large storage servers is another
large storage server (or more than one). When I buy a hard drive I
buy 3 backup drives for each "live" drive I use -- at *home*.
> so we decided to set it as data storage leaving the
> responsibilty of the backup to our users. I do not consider it
> completely absurd.
Not at all absurd -- if those users *really* accept that. But you
are trying to recover the arrays instead of scratching them and
restarting. That suggests to me that the users did not actually
accept that. If the real agreement with the users is "you have to
keep backups, but if something happens you will behave as if you
cannot or don't want to restore them" it is quite different.
> This is not the case for /Raid/md4, where apparently all devices
> are there.
That's not so clear. One problem with trying to provide some
opinions on your issue and whether the filesystems are recoverable
is that you haven't made clear what failed and how you tested each
component of each array to make sure that what is still working is
known (and talk of "assume clean" is very suspicious).
I'd check *everything* because until then you don't know how much
has been damaged where, as a major power issue may have affected
*everything* even if only partially. When you wrote:
> one half (5 TB) of the user directories on /dev/md4 have
> disappeared.
that seems to indicate some major filesystem metadata and data
loss, and the idea of "assume clean" seems to me extremely
dangerous. Also '/dev/md5' seems to have reported serious drive
issues, so perhaps something bad happened to the '/dev/md4' drives
too.
That you have tried to run repair tools on a filesystem with an
incomplete storage layer may have made things rather worse, so
knowing *exactly* what has failed may help you a lot.
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: xfs data loss
2009-09-01 12:45 ` Peter Grandi
@ 2009-09-01 22:16 ` Michael Monnerie
2009-09-04 11:08 ` Andi Kleen
1 sibling, 0 replies; 28+ messages in thread
From: Michael Monnerie @ 2009-09-01 22:16 UTC (permalink / raw)
To: xfs
On Dienstag 01 September 2009 Peter Grandi wrote:
> knowing *exactly* what has failed may help you a lot.
Thank you for your very verbose posting, it was fun to read. And the
last line should be answered by the OP.
mfg zmi
--
// Michael Monnerie, Ing.BSc ----- http://it-management.at
// Tel: 0660 / 415 65 31 .network.your.ideas.
// PGP Key: "curl -s http://zmi.at/zmi.asc | gpg --import"
// Fingerprint: AC19 F9D5 36ED CD8A EF38 500E CE14 91F7 1C12 09B4
// Keyserver: wwwkeys.eu.pgp.net Key-ID: 1C1209B4
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 28+ messages in thread
* RE: xfs data loss
@ 2009-09-03 15:31 Passerone, Daniele
2009-09-05 18:29 ` Peter Grandi
0 siblings, 1 reply; 28+ messages in thread
From: Passerone, Daniele @ 2009-09-03 15:31 UTC (permalink / raw)
To: xfs@oss.sgi.com
Dear Peter,
Thank you very much for the time spent in writing this long and
interesting answer. Now I agree with you, that harsh and useful is better
than emollient and lying :-)
> When you write to a mailing list asking for free help and support,
> it is rather rude to not have done some preliminary work, such as
> figuring out the characterisics of RAID5 in case of failure. It
> is also somewhat rude (but amazingly common) to make confused and
> partial reports, such as not checking and reporting what has
> actually failed.
That is true. Unfortunately I am not the person who assembled the RAID5
and configured the machine, and I had to act mostly alone to figure out
what to do. That is why I eventually preferred to make a partial report.
> But a soft but more open assessment of how outrageous some queries
> are is help too as it makes it easier to assess the gravity of the
> situation. The smooth, emollient sell-side people will let you dig
> your own grave. Just consider your statement below about "assume
> clean" that to me sounds very dangerous (big euphemism), and that
> did not elicit any warning from the sell-side:
At the beginning of this week I was confronted with the following
situation:
1) /dev/md4 a 19+1 RAID 5, with the corresponding xfs /raidmd4 filesystem
that had lost half of the directories
on the 24th of August; for NO PARTICULAR APPARENT REASON (and this still makes me crazy).
No logs, nothing.
2) /dev/md5, a 19+1 RAID 5, that could not mount anymore...lost superblock.
3) /dev/md6 , a 4+1 RAID5, that was not mounting anymore because 2 devices were lost.
My collegue zapped the filesystem (which was almost empty), and rebuilt the RAID5.
Unfortunately I cannot say exactly what he did.
For 2) it was clear what happened:
At the distance of a few days, two devices of /dev/md5 died.
The information about the death of one device is issued in /var/log/warn.
We did not check it during the last days, so when the second device died, it was too late.
BUT: I followed the advice to make a read test on all devices (using dd) and all were ok.
So it seemed to be a raid controller problem, of the same kind described here
http://maillists.uci.edu/mailman/public/uci-linux/2007-December/002225.html
where a solution is proposed including the reassembling of the raid using mdadm with the option
"assume-clean". This is where this "assume-clean" comes from: from a read test, followed by
the study of the above mailing list post.
The resync of the /dev/md5 was performed, the raid was again with 20 working devices,
but at the end of the day the filesystem still was not able to mount.
So, I was eventually forced to do xfs_repair -L /dev/md5, which was a nightmare:
incredible number of forking, inodes cleared... but eventually... successful.
I was in the meanwhile 10 years older and with all my hair suddenly greyed, but...
RESULT: /dev/md5 is again up and running, with all data.
BUT at the same time, /dev/md4 was not able to mount anymore: superblock error.
So, at that point we bought another big drive (7 TB), we performed backup of /dev/md5 ,
and then we run the same procedure on /dev/md4.
RESULT: /dev/md4 is again up and running, but the data disappeared on August 24 were still missing.
Since the structure was including all devices, at this point I run xfs_repair -L /dev/md4. But nothing happens.
No error, and half of the data still missing.
So at this point I don't understand.
THERE IS ONE IMPORTANT THING THAT I DID NOT MENTION, BECAUSE IT WAS NOT EVIDENT BY LOOKING AT /etc/raidtab,
/proc/mdstat, etc., and it was done by my collaborator
All structure of the raids, partitioning etc. was done using Yast2 with LVM.
The use of LVM is a mistery to me, even more than the basic of the RAID ( :-) )
The /etc/lvm/backup and archive directories are empty.
In yast2 now the LVM panel is empty, and I have forbidden my collaborator to try to go through LVM now...
Coming to other specific questions:
>Sure you can reassemble the RAID, but what do you mean by "still
>ok"? Have you read-tested those 2 drives? Have you tested the
>*other* 18 drives? How do you know none of the other 18 drives got
>damaged? Have you verified that only the host adapter electronics
>failed or whatever it was that made those 2 drives drop out?
Tested all drives, but not the host adapter electronics.
>Why do you *need* to assume clean? If the 2 "lost" drives are
>really ok, you just resync the array.
Well, following the post above, after checking that the lost drives are ok,
first I stop the raid, then I create the raid with 20 drives assuming them clean,
then I stop it again, then assemble it with resyncing.
>If you *need* to assume
>clean, it is likely that you have lost something like 5% of data
>in (every stripe and thus) most files and directories (and
>internal metadata) and will be replacing it with random
>bytes. That will very likely cause XFS problems (the least of the
>problems of course).
On the /raidmd5 fortunately this was not the case.
>Especially in a place where part of the everyday
>activity is earthquake simulation...
LOL you are right.
> But apart from that, it is not as easy to backup 20 TB,
>Or to 'fsck' several TB as you also discovered. Anyhow my opinion
>is that the best way to backup large storage servers is another
>large storage server (or more than one). When I buy a hard drive I
>buy 3 backup drives for each "live" drive I use -- at *home*.
At least now, we did at least that right.
>Not at all absurd -- if those users *really* accept that. But you
>are trying to recover the arrays instead of scratching them and
>restarting. That suggests to me that the users did not actually
>accept that. If the real agreement with the users is "you have to
>keep backups, but if something happens you will behave as if you
>cannot or don't want to restore them" it is quite different.
Well. You would be surprised to know how stupid can scientist be when
they ignore the worst case scenario.
Including myself.
I knew exactly the situation, but if I had not succeeded in recovering
/raid/md5, it would have been a hard moment for me and my research group.
And we ALL knew that there were no backups.
>That's not so clear. One problem with trying to provide some
>opinions on your issue and whether the filesystems are recoverable
>is that you haven't made clear what failed and how you tested each
>component of each array to make sure that what is still working is
>known (and talk of "assume clean" is very suspicious).
Just to clarify: assume-clean was an option to the mdadm --create command
when I discovered that my 20 devices were there and running: I run a dd command
reading the first megabytes of each device.
Was this wrong?
>That you have tried to run repair tools on a filesystem with an
>incomplete storage layer may have made things rather worse, so
>knowing *exactly* what has failed may help you a lot.
I will contact the Sun service and ask them to check the whole storage-controller part.
In the meanwhile I am almost convinced that that 4-5 TB lost on /dev/md4 are lost for good.
I sent the metadata one week ago to the mailing list. Do you think this could help in examining
the famous 20 drives?
I hope I could catch up. I am trying to learn quickly.
Thanks,
Daniele
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: xfs data loss
2009-09-01 12:45 ` Peter Grandi
2009-09-01 22:16 ` Michael Monnerie
@ 2009-09-04 11:08 ` Andi Kleen
1 sibling, 0 replies; 28+ messages in thread
From: Andi Kleen @ 2009-09-04 11:08 UTC (permalink / raw)
To: Peter Grandi; +Cc: Linux XFS
pg_xf2@xf2.to.sabi.co.UK (Peter Grandi) writes:
>
> Depends -- some people here are XFS salesmen, in that their career
> and employability depend at least in part on widespread adoption
> of XFS, and on support from other kernel subsystem guys, who may
> be one day on an interview panel (the guild of Linux kernel
> hackers is a pretty small and closed world in practice). These are
> sell-side engineers, and they will be smooth and emollient even in
> the face of outrageously ridiculous stuff.
The main thing that seems `outrageously ridiculous' is your cynical
and totally unfair and in my experience incorrect description of the
people who are doing great work on XFS and unlike you actually helping
users on this mailing list and improving Linux.
-Andi
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: xfs data loss
@ 2009-09-04 11:45 Passerone, Daniele
0 siblings, 0 replies; 28+ messages in thread
From: Passerone, Daniele @ 2009-09-04 11:45 UTC (permalink / raw)
To: xfs@oss.sgi.com
Commenting further on my preceding message, I just would like to stress the fact that everybody here has tried to help - xfs and not-xfs people. So I have seen no emollient answers here, at least not to my query.
Mr. Peter Grandi was harsh - very harsh at the beginning, but I think he also spent time to think about my problem. For that I am grateful.
I am less grateful for being defined "outraugeously ridicolous". But I can skip that in times of trouble...
Daniele
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 28+ messages in thread
* RE: xfs data loss
2009-09-03 15:31 Passerone, Daniele
@ 2009-09-05 18:29 ` Peter Grandi
[not found] ` <4AA3261E.1000005@sandeen.net>
0 siblings, 1 reply; 28+ messages in thread
From: Peter Grandi @ 2009-09-05 18:29 UTC (permalink / raw)
To: Linux XFS
> [ ... ]
> 1) /dev/md4 a 19+1 RAID 5, with the corresponding xfs /raidmd4
> filesystem that had lost half of the directories on the 24th
> of August; for NO PARTICULAR APPARENT REASON (and this still
> makes me crazy). No logs, nothing.
But there was apparently a power "event" of some sort, and IIRC
the system stopped working, and there were other signs that the
block layer had suffered damage:
> 2) /dev/md5, a 19+1 RAID 5, that could not mount
> anymore...lost superblock.
The fact that were was apparent difficulty means that the
automatic "resync" that RAID5 implementatioqns do if only 1 drive
has been lost did not work, which is ominous.
> 3) /dev/md6 , a 4+1 RAID5, that was not mounting anymore because
> 2 devices were lost. My collegue zapped the filesystem (which
> was almost empty), and rebuilt the RAID5.
So let's forget about it, except that it indicates that there
was extensive storage system damage, wheter detected or not.
> For 2) it was clear what happened: At the distance of a few
> days, two devices of /dev/md5 died. The information about the
> death of one device is issued in /var/log/warn. We did not
> check it during the last days, so when the second device died,
> it was too late.
With a 19+1 RAID5 with 2 devices dead you have lost around 5-6%
of the data; regrettably this is not 5-6% of the files, but most
likely 5-6% of most files (and probably quite a bit of XFS metadata).
> BUT: I followed the advice to make a read test on all devices
> (using dd) and all were ok.
That is good news, but it is not clear what "all ok" means here,
when "two devices of /dev/md5 died". Maybe the two ports on the
host adapter died, but it is far from clear even given this:
> So it seemed to be a raid controller problem, of the same kind
> described here
> http://maillists.uci.edu/mailman/public/uci-linux/2007-December/002225.html
> where a solution is proposed including the reassembling of the
> raid using mdadm with the option "assume-clean". This is where
> this "assume-clean" comes from: from a read test, followed by
> the study of the above mailing list post.
Oops. I suspect that one should not believe everything one reads
in a mailing list. The statement over there:
> It's set up as a RAID5 (one parity disk), with no spares. [
> ... ] Trying to force mdadm to assemble it did not work: $
> mdadm --assemble /dev/md0 --chunk 16 /dev/sd*1 mdadm:
> /dev/md0 assembled from 2 drives - not enough to start the
> array. It was a 4-disk array, so this is a failure.
> However, it did not destroy any data either.
Seems to extremely optimistic (I am trying to be emollient and
mellifluous here :->).
> The resync of the /dev/md5 was performed, the raid was again
> with 20 working devices,
The original 20 devices or did you put in 2 new blank hard drives?
I feel like that 2 blank drives went in, but then later I read
that all [original] 20 drives could be read for a few MB at the
beginning.
> but at the end of the day the filesystem still was not able to
> mount. So, I was eventually forced to do xfs_repair -L
> /dev/md5, which was a nightmare: incredible number of forking,
> inodes cleared... but eventually... successful. I was in the
> meanwhile 10 years older and with all my hair suddenly greyed,
> but... RESULT: /dev/md5 is again up and running, with all
> data.
I suspect that "with all data" is also extremely optimistic.
There is one vital detail here: the XFS design in effect makes
two assumptions:
* The block layer is error free. By and large XFS does not even
check that the block layer behaves perfectly. It is the sysadm
responsibility to ensure that.
* XFS only ensures consistency of metadata, for data the
application takes care.
> BUT at the same time, /dev/md4 was not able to mount anymore:
> superblock error.
> So, at that point we bought another big drive (7 TB), we
> performed backup of /dev/md5 , and then we run the same
> procedure on /dev/md4.
Backing up existing data is a very good idea before doing any
repair work.
> RESULT: /dev/md4 is again up and running, but the data
> disappeared on August 24 were still missing.
> Since the structure was including all devices, at this point I
> run xfs_repair -L /dev/md4. But nothing happens. No error, and
> half of the data still missing. So at this point I don't
> understand.
Well, I can try to explain the bits that maybe are missing.
* Almost all your problems are block layer problems. Since XFS
assumes error free block layer, it is your task to ensure that
the black layer is error free. Which means that almost all the
work that you should have done was to first ensure that the
block layer is error free, byt testing fully each drive and
then putting together the array. It is quite likely that none
of the issues that your have reported has much to do with XFS.
* The array contains an XFS filesystem with a bit of metadata
and a lot of data. If something like 5% the array is replaced by
random numbers (usually zeroes) one can be "lucky" and less than
5% of the metadata will be affected, and what is affected can be
reconstructed by other information. If this is the case then
'xfs_repair' will reconstruct the metadata and leave alone the
data. XFS and its utilities check the metadata and try to
reconstruct it, but do nothing for the data.
* This makes it look like that the *filesystem* is fine, even if
quite a bit of data in each file has been replaced. XFS wisely
does nothing for the data (other than avoiding to deliberately
damage it) -- if your application does not add redundancy or
checksums to the data, you have no way to reconstruct it or even
check whether it is damaged in case of partial loss.
> THERE IS ONE IMPORTANT THING THAT I DID NOT MENTION, BECAUSE IT
> WAS NOT EVIDENT BY LOOKING AT /etc/raidtab, /proc/mdstat, etc.,
> and it was done by my collaborator All structure of the raids,
> partitioning etc. was done using Yast2 with LVM.
That's not important in itself, but it matters whether LVM used DM
for RAIDing, as it has less checking and repair options than MD.
>> Sure you can reassemble the RAID, but what do you mean by
>> "still ok"? Have you read-tested those 2 drives? Have you
>> tested the *other* 18 drives? How do you know none of the other
>> 18 drives got damaged? Have you verified that only the host
>> adapter electronics failed or whatever it was that made those 2
>> drives drop out?
> Tested all drives, but not the host adapter electronics.
Later on you say you tested only the first few MB of each drive.
We still don't know what really happened.
BTW, you mention LVM later but it is not clear whether you are
using LVM on top of MD or LVM on top od DM. If it is on top of
MD a good way to use regularly to check disk health is to use
the option to verify the array. This is for example described
here: http://en.gentoo-wiki.com/wiki/RAID/Software#Data_Scrubbing
but this only works if the array was built with MD, not DM.
>> Why do you *need* to assume clean? If the 2 "lost" drives are
>> really ok, you just resync the array.
> Well, following the post above, after checking that the lost
> drives are ok, first I stop the raid, then I create the raid
> with 20 drives assuming them clean, then I stop it again, then
> assemble it with resyncing.
If the array was very very lucky none of the 20 drives was
actually damaged, some just stopped working momentarily, and you
you could have actually just done the 'resync'; actually the
'resync' is automatic in both DM and MD arrays.
>> If you *need* to assume clean, it is likely that you have lost
>> something like 5% of data in (every stripe and thus) most files
>> and directories (and internal metadata) and will be replacing
>> it with random bytes. That will very likely cause XFS problems
>> (the least of the problems of course).
> On the /raidmd5 fortunately this was not the case.
This still seems most likely extremely optimistic.
[ ... ]
> Well. You would be surprised to know how stupid can scientist
> be they ignore the worst case scenario.
Well, I am familiar with a new "big science" place where lab
time costs several thousand $/hour; most of the scientists have
had data losses at other places before, and they have become
rather paranoid about that :-).
> Just to clarify: assume-clean was an option to the mdadm
> --create command when I discovered that my 20 devices were
> there and running: I run a dd command reading the first
> megabytes of each device. Was this wrong?
Given that the extent of damage is unknown, you should have done a
scan of each disk in its entirety. The killer for RAID5 is when 2
or more disks have damage at the same offset.
> In the meanwhile I am almost convinced that that 4-5 TB lost
> on /dev/md4 are lost for good. [ ... ]
My current guess is:
* 2 or more in each of the 20 disk arrays is damaged in the same
offsets, and full data recovery is not possible.
* Somehow 'xfs_repair' managed to rebuild the metadata of
'/dev/md5' despite a loss of 5-6% of it, so it looks
"consistent" as far as XFS is concerned, but up to 5-6% of
each file is essentially random, and it is very difficult to
know where the random part are.
* With '/dev/md4' 'xfs_repair' the 5-6% metadata lost was in
more critical parts of the filesystem, so the metadata for
half of the files is gone. Of the remaining files, up to
5-6% of their data is random.
It may well be more than 5-6% if in fact more than 2 drives per
array lost data.
Or the malfunction to the 2 or more drives that failed in each
array was "temporary" but then it is hard to imagine why there
were problems with RAID resync and XFS checking.
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 28+ messages in thread
* xfs data loss
@ 2009-09-06 9:00 Passerone, Daniele
2009-09-06 9:30 ` Michael Monnerie
2009-09-06 21:00 ` Peter Grandi
0 siblings, 2 replies; 28+ messages in thread
From: Passerone, Daniele @ 2009-09-06 9:00 UTC (permalink / raw)
To: xfs@oss.sgi.com
> [ ... ]
Hi Peter, thank you for your long message. Some of the things you suppose,
though, may not be exact. I'll try to give you some new element.
>But there was apparently a power "event" of some sort, and IIRC
>the system stopped working, and there were other signs that the
>block layer had suffered damage
DP> 2) /dev/md5, a 19+1 RAID 5, that could not mount
DP> anymore...lost superblock.
PG> The fact that were was apparent difficulty means that the
PG> automatic "resync" that RAID5 implementatioqns do if only 1 drive
PG> has been lost did not work, which is ominous.
PG> With a 19+1 RAID5 with 2 devices dead you have lost around 5-6%
PG> of the data; regrettably this is not 5-6% of the files, but most
PG> likely 5-6% of most files (and probably quite a bit of XFS metadata).
Up to now I found no damage in any file of md5 after recovery with
the mdadm --assemble --assume-clean.
Just an example: a MB-sized tar.gz file, compression of a postscript file,
uncompressed perfectly and was visualized in a perfect way by ghostview.
Moreover, a device died (a different one) yesterday, and in the messages I have:
Sep 4 11:00:44 ipazia-sun kernel: Badness in mv_start_dma at drivers/ata/sata_mv.c:651
Sep 4 11:00:44 ipazia-sun kernel:
Sep 4 11:00:44 ipazia-sun kernel: Call Trace: <ffffffff88099f96>{:sata_mv:mv_qc_issue+292}
Sep 4 11:00:44 ipazia-sun kernel: <ffffffff88035600>{:scsi_mod:scsi_done+0} <ffffffff8807b214>{:libata:ata_scsi_rw_xlat+0}
Sep 4 11:00:44 ipazia-sun kernel: <ffffffff8807727b>{:libata:ata_qc_issue+1037} <ffffffff88035600>{:scsi_mod:scsi_done+0}
Sep 4 11:00:44 ipazia-sun kernel: <ffffffff8807b214>{:libata:ata_scsi_rw_xlat+0} <ffffffff8807b4a9>{:libata:ata_scsi_translate+286}
Sep 4 11:00:44 ipazia-sun kernel: <ffffffff88035600>{:scsi_mod:scsi_done+0} <ffffffff8807d549>{:libata:ata_scsi_queuecmd+315}
Sep 4 11:00:44 ipazia-sun kernel: <ffffffff88035a6d>{:scsi_mod:scsi_dispatch_cmd+546}
Sep 4 11:00:44 ipazia-sun kernel: <ffffffff8803b06d>{:scsi_mod:scsi_request_fn+760} <ffffffff801e8aff>{elv_insert+230}
Sep 4 11:00:44 ipazia-sun kernel: <ffffffff801ed890>{__make_request+987} <ffffffff80164059>{mempool_alloc+49}
Sep 4 11:00:44 ipazia-sun kernel: <ffffffff801eaa13>{generic_make_request+538} <ffffffff8018b629>{__bio_clone+116}
Sep 4 11:00:44 ipazia-sun kernel: <ffffffff80147d5d>{keventd_create_kthread+0} <ffffffff801ec844>{submit_bio+186}
Sep 4 11:00:44 ipazia-sun kernel: <ffffffff80275ae8>{md_update_sb+270} <ffffffff802780bb>{md_check_recovery+371}
Sep 4 11:00:44 ipazia-sun kernel: <ffffffff80147d5d>{keventd_create_kthread+0} <ffffffff880f6f61>{:raid5:raid5d+21}
Sep 4 11:00:44 ipazia-sun kernel: <ffffffff80279990>{md_thread+267} <ffffffff80148166>{autoremove_wake_function+0}
Sep 4 11:00:44 ipazia-sun kernel: <ffffffff80147d5d>{keventd_create_kthread+0} <ffffffff80279885>{md_thread+0}
Sep 4 11:00:44 ipazia-sun kernel: <ffffffff80148025>{kthread+236} <ffffffff8010bea6>{child_rip+8}
Sep 4 11:00:44 ipazia-sun kernel: <ffffffff80147d5d>{keventd_create_kthread+0} <ffffffff80147f39>{kthread+0}
Sep 4 11:00:44 ipazia-sun kernel: <ffffffff8010be9e>{child_rip+0}
Sep 4 11:01:44 ipazia-sun kernel: ata42: Entering mv_eng_timeout
Sep 4 11:01:44 ipazia-sun kernel: mmio_base ffffc20001000000 ap ffff8103f8b4c488 qc ffff8103f8b4cf68 scsi_cmnd ffff8101f7e556c0 &cmnd ffff8101f7e5571c
Sep 4 11:01:44 ipazia-sun kernel: ata42: no sense translation for status: 0x40
Sep 4 11:01:44 ipazia-sun kernel: ata42: translated ATA stat/err 0x40/00 to SCSI SK/ASC/ASCQ 0xb/00/00
Sep 4 11:01:44 ipazia-sun kernel: ata42: status=0x40 { DriveReady }
Sep 4 11:01:44 ipazia-sun kernel: end_request: I/O error, dev sdap, sector 976767935
Sep 4 11:01:44 ipazia-sun kernel: RAID5 conf printout:
(...)
DP> The resync of the /dev/md5 was performed, the raid was again
DP> with 20 working devices,
PG> The original 20 devices or did you put in 2 new blank hard drives?
PG> I feel like that 2 blank drives went in, but then later I read
PG>that all [original] 20 drives could be read for a few MB at the
PG>beginning.
No. No blank drives went in. And I always used the original 20 devices.
I therefore suspect that the "broken devices" indication, since it is repeatedly found
in the last weeks, and always for different devices/filesystems, has to do with the RAID controller,
and not with a specific device failure-.
PG>Well, I can try to explain the bits that maybe are missing.
PG>* Almost all your problems are block layer problems. Since XFS
PG> assumes error free block layer, it is your task to ensure that
PG> the black layer is error free. Which means that almost all the
PG> work that you should have done was to first ensure that the
PG> block layer is error free, byt testing fully each drive and
PG> then putting together the array. It is quite likely that none
PG> of the issues that your have reported has much to do with XFS.
Couild have to do with the raid controller layer?
PG>* This makes it look like that the *filesystem* is fine, even if
PG> quite a bit of data in each file has been replaced. XFS wisely
PG> does nothing for the data (other than avoiding to deliberately
PG> damage it) -- if your application does not add redundancy or
PG> checksums to the data, you have no way to reconstruct it or even
PG> check whether it is damaged in case of partial loss.
Well, a binary file with 5% data loss would simply not work.
But I have executables on this filesystem, and they run!
PG > * 2 or more in each of the 20 disk arrays is damaged in the same
PG >offsets, and full data recovery is not possible.
PG>* Somehow 'xfs_repair' managed to rebuild the metadata of
PG> '/dev/md5' despite a loss of 5-6% of it, so it looks
PG> "consistent" as far as XFS is concerned, but up to 5-6% of
PG> each file is essentially random, and it is very difficult to
PG> know where the random part are.
I don't see any element to support this - at present.
PG>* With '/dev/md4' 'xfs_repair' the 5-6% metadata lost was in
PG> more critical parts of the filesystem, so the metadata for
PG> half of the files is gone. Of the remaining files, up to
PG> 5-6% of their data is random.
Half of the file was gone already before repair, and it remains gone after,
and for the remaining files, I have no sign of randomness.
Summarizing, it may well be that the devices are broken but I suspect, again, a failure in the controller.
Could it be?
I contacted Sun and they asked me output of Siga, ipmi, etc.
DAniele
________________________________
* Previous message: xfs data loss <http://oss.sgi.com/pipermail/xfs/2009-September/042515.html>
* Next message: [PATCH 2/4] xfs: make sure xfs_sync_fsdata covers the log <http://oss.sgi.com/pipermail/xfs/2009-September/042516.html>
* Messages sorted by: [ date ]<http://oss.sgi.com/pipermail/xfs/2009-September/date.html#42539> [ thread ]<http://oss.sgi.com/pipermail/xfs/2009-September/thread.html#42539> [ subject ]<http://oss.sgi.com/pipermail/xfs/2009-September/subject.html#42539> [ author ]<http://oss.sgi.com/pipermail/xfs/2009-September/author.html#42539>
________________________________
More information about the xfs mailing list<http://oss.sgi.com/mailman/listinfo/xfs>
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: xfs data loss
2009-09-06 9:00 Passerone, Daniele
@ 2009-09-06 9:30 ` Michael Monnerie
2009-09-06 21:00 ` Peter Grandi
1 sibling, 0 replies; 28+ messages in thread
From: Michael Monnerie @ 2009-09-06 9:30 UTC (permalink / raw)
To: xfs
On Sonntag 06 September 2009 Passerone, Daniele wrote:
> Well, a binary file with 5% data loss would simply not work.
> But I have executables on this filesystem, and they run!
Optimist. It just means the part of the binary you run are not random.
Randomness of *all* code paths would have to be checked, which you
probably can't do manually, so binaries are not a good check at all.
Since you didn't change any drives, chances are good that you really
lost very little data.
> a MB-sized tar.gz file, compression of a postscript file,
> uncompressed perfectly and was visualized in a perfect way by
> ghostview.
That's a good test, so you are lucky.
> Moreover, a device died (a different one) yesterday, and in the
> messages I have ...
Is this on the same controller as the other broken disks were? Then this
should be it (or it's cabling, or the backplane, etc.). And you should
immediately shut down the RAID on that controller, as you might loose
data (or the whole RAID) when the controller writes random data. A
broken hardware is the worst thing to have. Replace it, test the new
parts *thouroughly*, and only then start to use the RAID again.
mfg zmi
--
// Michael Monnerie, Ing.BSc ----- http://it-management.at
// Tel: 0660 / 415 65 31 .network.your.ideas.
// PGP Key: "curl -s http://zmi.at/zmi.asc | gpg --import"
// Fingerprint: AC19 F9D5 36ED CD8A EF38 500E CE14 91F7 1C12 09B4
// Keyserver: wwwkeys.eu.pgp.net Key-ID: 1C1209B4
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: xfs data loss
[not found] ` <4AA3261E.1000005@sandeen.net>
@ 2009-09-06 20:30 ` Peter Grandi
0 siblings, 0 replies; 28+ messages in thread
From: Peter Grandi @ 2009-09-06 20:30 UTC (permalink / raw)
To: Linux XFS
>> There is one vital detail here: the XFS design in effect makes
>> two assumptions:
>>
>> * The block layer is error free. By and large XFS does not even
>> check that the block layer behaves perfectly. It is the sysadm
>> responsibility to ensure that.
> Now, that's not quite accurate. XFS is -very- good at handling
> IO errors in general, and at detecting & handling metadata
> corruption
But what about recovering from bad blocks? For example bad blocks
that happen in the middle of a chunk of metadata? IIRC one cannot
even pass a list of bad blocks to 'mkfs.xfs' (JFS can handle that
statically, 'ext3' semi-dynamically). If one is lucky XFS will
have in some cases enough redundancy in the metadata to allow a
repair, but it does not seem to me that this has been a design
goal, just happenstance and lots of work in 'xfs_repair'.
> (potentially coming up from bad hardware) at runtime.... (where
> "handling" may mean "detecting and shutting down gracefully")
Without data loss? Because that is what matter to quit a few
people. Handling data loss by acknowledging it is a bit of an
optimistic usage of "handling", even if ignoring errors is even
worse. My impression is that XFS assumes that the block layer
handles all data loss (and maybe corruption) issues.
> XFS -does- expect that when the hardware says an IO is complete,
> it is complete and safe on disk.
As to this, actually this is one of the few areas where XFS
actually is to be praised as it does try to check whether the
block layer claims to do that.
> If that's what you refer to then we're in agreement.
I was not saying that XFS ignores errors; I am referring to making
the file system detects and works around block layer failures
without data loss or at least usability loss.
Some file systems add metadata to each block or extent to
facilitate data reconstruction in the case of bad blocks arising,
whether detected by the block layer or even without, some add it
to facilitate metadata index reconstruction (e.g. accidentally
VFAT, by design Reiser).
Personally I think that designing XFS to put all data loss issues
to the block device layer was the right decision given its likely
goals. I am more of a believer in end-to-end reliability checks
and/or block layer redundancy.
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: xfs data loss
2009-09-06 9:00 Passerone, Daniele
2009-09-06 9:30 ` Michael Monnerie
@ 2009-09-06 21:00 ` Peter Grandi
1 sibling, 0 replies; 28+ messages in thread
From: Peter Grandi @ 2009-09-06 21:00 UTC (permalink / raw)
To: Linux XFS
[ ... ]
>> The original 20 devices or did you put in 2 new blank hard
>> drives? I feel like that 2 blank drives went in, but then
>> later I read that all [original] 20 drives could be read for
>> a few MB at the beginning.
> No. No blank drives went in. And I always used the original 20
> devices.
That may be very good news (or not if some are partially
damaged).
[ ... ]
> I therefore suspect that the "broken devices" indication,
> since it is repeatedly found in the last weeks, and always for
> different devices/filesystems, has to do with the RAID
> controller, and not with a specific device failure-.
But a broken RAID host adapter can write random stuff to
some/most disks and can continue to do so. Unless the RAID host
adapter had a temporary failure. But who knows?
>> * Somehow 'xfs_repair' managed to rebuild the metadata of
>> '/dev/md5' despite a loss of 5-6% of it, so it looks
>> "consistent" as far as XFS is concerned, but up to 5-6% of
>> each file is essentially random, and it is very difficult to
>> know where the random part are.
> I don't see any element to support this - at present.
Well, the only thing is known for sure at this point is that an
event happened that physically damaged some parts of the system,
this damage includes some drives out of the 48 that died, and
there was huge data loss *apparently* without cause, as in the
arrays where data loss happened all drives are at least
partially working, but some have been failing afterwards,
and anyhow the arrays would not resync afterwards.
Given this background, I would not assume *anything* really
works unless it is proven to work with fairly challenging
testing.
Thus the repeated advice to do a thorough read check of all
drives. I would also check the error log of all drives with
'smartctl -l error' but if there was an electric shock the drive
might not have been able to log anything.
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 28+ messages in thread
end of thread, other threads:[~2009-09-09 16:48 UTC | newest]
Thread overview: 28+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-08-27 7:22 xfs data loss Passerone, Daniele
2009-08-27 9:41 ` Christian Kujau
2009-08-27 9:47 ` Passerone, Daniele
2009-08-27 10:09 ` Christian Kujau
2009-08-27 9:54 ` Passerone, Daniele
2009-08-28 4:16 ` Eric Sandeen
2009-08-28 9:19 ` Passerone, Daniele
2009-08-28 17:17 ` Eric Sandeen
2009-08-28 19:42 ` Passerone, Daniele
2009-08-29 6:08 ` Passerone, Daniele
2009-08-29 7:45 ` Ralf Gross
2009-08-29 7:11 ` Passerone, Daniele
2009-08-29 20:03 ` Passerone, Daniele
2009-08-29 22:14 ` Michael Monnerie
2009-08-29 22:52 ` Passerone, Daniele
2009-08-30 1:24 ` Eric Sandeen
2009-08-30 8:17 ` Michael Monnerie
2009-09-01 12:45 ` Peter Grandi
2009-09-01 22:16 ` Michael Monnerie
2009-09-04 11:08 ` Andi Kleen
2009-08-29 14:08 ` Peter Grandi
-- strict thread matches above, loose matches on Subject: below --
2009-09-03 15:31 Passerone, Daniele
2009-09-05 18:29 ` Peter Grandi
[not found] ` <4AA3261E.1000005@sandeen.net>
2009-09-06 20:30 ` Peter Grandi
2009-09-04 11:45 Passerone, Daniele
2009-09-06 9:00 Passerone, Daniele
2009-09-06 9:30 ` Michael Monnerie
2009-09-06 21:00 ` Peter Grandi
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox