public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed
* xfs data loss
@ 2009-09-06  9:00 Passerone, Daniele
  2009-09-06  9:30 ` Michael Monnerie
                   ` (2 more replies)
  0 siblings, 3 replies; 29+ messages in thread
From: Passerone, Daniele @ 2009-09-06  9:00 UTC (permalink / raw)
  To: xfs@oss.sgi.com

> [ ... ]




Hi Peter, thank you for your long message. Some of the things you suppose,

though, may not be exact. I'll try to give you some new element.



>But there was apparently a power "event" of some sort, and IIRC
>the system stopped working, and there were other signs that the
>block layer had suffered damage

DP> 2) /dev/md5, a 19+1 RAID 5, that could not mount
DP> anymore...lost superblock.

PG> The fact that were was apparent difficulty means that the
PG> automatic "resync" that RAID5 implementatioqns do if only 1 drive
PG> has been lost did not work, which is ominous.



PG> With a 19+1 RAID5 with 2 devices dead you have lost around 5-6%
PG> of the data; regrettably this is not 5-6% of the files, but most
PG> likely 5-6% of most files (and probably quite a bit of XFS metadata).

Up to now I found no damage in any file of md5 after recovery with

the mdadm --assemble --assume-clean.

Just an example: a MB-sized tar.gz file, compression of a postscript file,

uncompressed perfectly and was visualized in a perfect way by ghostview.

Moreover, a device died (a different one) yesterday, and in the messages I have:

Sep 4 11:00:44 ipazia-sun kernel: Badness in mv_start_dma at drivers/ata/sata_mv.c:651

Sep 4 11:00:44 ipazia-sun kernel:

Sep 4 11:00:44 ipazia-sun kernel: Call Trace: <ffffffff88099f96>{:sata_mv:mv_qc_issue+292}

Sep 4 11:00:44 ipazia-sun kernel: <ffffffff88035600>{:scsi_mod:scsi_done+0} <ffffffff8807b214>{:libata:ata_scsi_rw_xlat+0}

Sep 4 11:00:44 ipazia-sun kernel: <ffffffff8807727b>{:libata:ata_qc_issue+1037} <ffffffff88035600>{:scsi_mod:scsi_done+0}

Sep 4 11:00:44 ipazia-sun kernel: <ffffffff8807b214>{:libata:ata_scsi_rw_xlat+0} <ffffffff8807b4a9>{:libata:ata_scsi_translate+286}

Sep 4 11:00:44 ipazia-sun kernel: <ffffffff88035600>{:scsi_mod:scsi_done+0} <ffffffff8807d549>{:libata:ata_scsi_queuecmd+315}

Sep 4 11:00:44 ipazia-sun kernel: <ffffffff88035a6d>{:scsi_mod:scsi_dispatch_cmd+546}

Sep 4 11:00:44 ipazia-sun kernel: <ffffffff8803b06d>{:scsi_mod:scsi_request_fn+760} <ffffffff801e8aff>{elv_insert+230}

Sep 4 11:00:44 ipazia-sun kernel: <ffffffff801ed890>{__make_request+987} <ffffffff80164059>{mempool_alloc+49}

Sep 4 11:00:44 ipazia-sun kernel: <ffffffff801eaa13>{generic_make_request+538} <ffffffff8018b629>{__bio_clone+116}

Sep 4 11:00:44 ipazia-sun kernel: <ffffffff80147d5d>{keventd_create_kthread+0} <ffffffff801ec844>{submit_bio+186}

Sep 4 11:00:44 ipazia-sun kernel: <ffffffff80275ae8>{md_update_sb+270} <ffffffff802780bb>{md_check_recovery+371}

Sep 4 11:00:44 ipazia-sun kernel: <ffffffff80147d5d>{keventd_create_kthread+0} <ffffffff880f6f61>{:raid5:raid5d+21}

Sep 4 11:00:44 ipazia-sun kernel: <ffffffff80279990>{md_thread+267} <ffffffff80148166>{autoremove_wake_function+0}

Sep 4 11:00:44 ipazia-sun kernel: <ffffffff80147d5d>{keventd_create_kthread+0} <ffffffff80279885>{md_thread+0}

Sep 4 11:00:44 ipazia-sun kernel: <ffffffff80148025>{kthread+236} <ffffffff8010bea6>{child_rip+8}

Sep 4 11:00:44 ipazia-sun kernel: <ffffffff80147d5d>{keventd_create_kthread+0} <ffffffff80147f39>{kthread+0}

Sep 4 11:00:44 ipazia-sun kernel: <ffffffff8010be9e>{child_rip+0}

Sep 4 11:01:44 ipazia-sun kernel: ata42: Entering mv_eng_timeout

Sep 4 11:01:44 ipazia-sun kernel: mmio_base ffffc20001000000 ap ffff8103f8b4c488 qc ffff8103f8b4cf68 scsi_cmnd ffff8101f7e556c0 &cmnd ffff8101f7e5571c

Sep 4 11:01:44 ipazia-sun kernel: ata42: no sense translation for status: 0x40

Sep 4 11:01:44 ipazia-sun kernel: ata42: translated ATA stat/err 0x40/00 to SCSI SK/ASC/ASCQ 0xb/00/00

Sep 4 11:01:44 ipazia-sun kernel: ata42: status=0x40 { DriveReady }

Sep 4 11:01:44 ipazia-sun kernel: end_request: I/O error, dev sdap, sector 976767935

Sep 4 11:01:44 ipazia-sun kernel: RAID5 conf printout:

(...)




DP> The resync of the /dev/md5 was performed, the raid was again
DP> with 20 working devices,

PG> The original 20 devices or did you put in 2 new blank hard drives?
PG> I feel like that 2 blank drives went in, but then later I read
PG>that all [original] 20 drives could be read for a few MB at the
PG>beginning.



No. No blank drives went in. And I always used the original 20 devices.

I therefore suspect that the "broken devices" indication, since it is repeatedly found

in the last weeks, and always for different devices/filesystems, has to do with the RAID controller,

and not with a specific device failure-.




PG>Well, I can try to explain the bits that maybe are missing.

PG>* Almost all your problems are block layer problems. Since XFS
PG>  assumes error free block layer, it is your task to ensure that
PG>  the black layer is error free. Which means that almost all the
PG>  work that you should have done was to first ensure that the
PG>  block layer is error free, byt testing fully each drive and
PG>  then putting together the array. It is quite likely that none
PG>  of the issues that your have reported has much to do with XFS.


Couild have to do with the raid controller layer?



PG>* This makes it look like that the *filesystem* is fine, even if
PG> quite a bit of data in each file has been replaced. XFS wisely
PG>  does nothing for the data (other than avoiding to deliberately
PG>  damage it) -- if your application does not add redundancy or
PG>  checksums to the data, you have no way to reconstruct it or even
PG>  check whether it is damaged in case of partial loss.

Well, a binary file with 5% data loss would simply not work.

But I have executables on this filesystem, and they run!

PG > * 2 or more in each of the 20 disk arrays is damaged in the same
PG >offsets, and full data recovery is not possible.



PG>* Somehow 'xfs_repair' managed to rebuild the metadata of
PG>  '/dev/md5' despite a loss of 5-6% of it, so it looks
PG>  "consistent" as far as XFS is concerned, but up to 5-6% of
PG>  each file is essentially random, and it is very difficult to
PG>  know where the random part are.

I don't see any element to support this - at present.

PG>* With '/dev/md4' 'xfs_repair' the 5-6% metadata lost was in
PG>  more critical parts of the filesystem, so the metadata for
PG>  half of the files is gone. Of the remaining files, up to
PG>  5-6% of their data is random.

Half of the file was gone already before repair, and it remains gone after,

and for the remaining files, I have no sign of randomness.



Summarizing, it may well be that the devices are broken but I suspect, again, a failure in the controller.

Could it be?

I contacted Sun and they asked me output of Siga, ipmi, etc.

DAniele




________________________________

 *   Previous message: xfs data loss <http://oss.sgi.com/pipermail/xfs/2009-September/042515.html>
 *   Next message: [PATCH 2/4] xfs: make sure xfs_sync_fsdata covers the log <http://oss.sgi.com/pipermail/xfs/2009-September/042516.html>
 *   Messages sorted by: [ date ]<http://oss.sgi.com/pipermail/xfs/2009-September/date.html#42539> [ thread ]<http://oss.sgi.com/pipermail/xfs/2009-September/thread.html#42539> [ subject ]<http://oss.sgi.com/pipermail/xfs/2009-September/subject.html#42539> [ author ]<http://oss.sgi.com/pipermail/xfs/2009-September/author.html#42539>

________________________________
More information about the xfs mailing list<http://oss.sgi.com/mailman/listinfo/xfs>


_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 29+ messages in thread
* Re: xfs data loss
@ 2009-09-04 11:45 Passerone, Daniele
  0 siblings, 0 replies; 29+ messages in thread
From: Passerone, Daniele @ 2009-09-04 11:45 UTC (permalink / raw)
  To: xfs@oss.sgi.com

Commenting further on my preceding message, I just would like to stress the fact that everybody here has tried to help - xfs and not-xfs people. So I have seen no emollient answers here, at least not to my query.


Mr. Peter Grandi was harsh - very harsh at the  beginning, but I think he also spent time to think about my problem. For that I am grateful.

I am less grateful for being defined "outraugeously ridicolous". But I can skip that in times of trouble...


Daniele

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 29+ messages in thread
* RE: xfs data loss
@ 2009-09-03 15:31 Passerone, Daniele
  2009-09-05 18:29 ` Peter Grandi
  0 siblings, 1 reply; 29+ messages in thread
From: Passerone, Daniele @ 2009-09-03 15:31 UTC (permalink / raw)
  To: xfs@oss.sgi.com

Dear Peter, 

Thank you very much for the time spent in writing this long and 
interesting answer. Now I agree with you, that harsh and useful is better
than emollient and lying :-)


> When you write to a mailing list asking for free help and support,
> it is rather rude to not have done some preliminary work, such as
> figuring out the characterisics of RAID5 in case of failure. It
> is also somewhat rude (but amazingly common) to make confused and
> partial reports, such as not checking and reporting what has
> actually failed.

That is true. Unfortunately I am not the person who assembled the RAID5
and configured the machine, and I had to act mostly alone to figure out
what to do. That is why I eventually preferred to make a partial report.



> But a soft but more open assessment of how outrageous some queries
> are is help too as it makes it easier to assess the gravity of the
> situation. The smooth, emollient sell-side people will let you dig
> your own grave. Just consider your statement below about "assume
> clean" that to me sounds very dangerous (big euphemism), and that
> did not elicit any warning from the sell-side:


At the beginning of this week I was confronted with the following 
situation:

1) /dev/md4 a 19+1 RAID 5, with the corresponding xfs /raidmd4 filesystem
 that had lost half of the directories 
on the 24th of August; for NO PARTICULAR APPARENT REASON (and this still makes me crazy).
No logs, nothing. 

2) /dev/md5, a 19+1 RAID 5, that could not mount anymore...lost superblock.

3) /dev/md6 , a 4+1 RAID5, that was not mounting anymore because 2 devices were lost.
My collegue zapped the filesystem (which was almost empty), and rebuilt the RAID5. 
Unfortunately I cannot say exactly what he did.


For 2) it was clear what happened:
At the distance of a few days, two devices of /dev/md5 died. 
The information about the death of one device is issued in /var/log/warn.
We did not check it during the last days, so when the second device died, it was too late.

BUT: I followed the advice to make a read test on all devices (using dd) and all were ok.
So it seemed to be a raid controller  problem, of the same kind described here

http://maillists.uci.edu/mailman/public/uci-linux/2007-December/002225.html

where a solution is proposed including the reassembling of the raid using mdadm with the option 
"assume-clean". This is where this "assume-clean" comes from: from a read test, followed by 
the study of the above mailing list post.


The resync of the /dev/md5 was performed, the raid was again with 20 working devices, 
but at the end of the day the filesystem still was not able to mount.
So, I was eventually forced to do xfs_repair -L /dev/md5, which was a nightmare:
incredible number of forking, inodes cleared... but eventually... successful.
I was in the meanwhile 10 years older and with all my hair suddenly greyed, but...

RESULT: /dev/md5 is again up and running, with all data.

BUT  at the same time, /dev/md4 was not able to mount anymore: superblock error.

So, at that point we bought another big drive (7 TB), we performed backup of /dev/md5 ,
and then we run the same procedure on /dev/md4. 

RESULT: /dev/md4 is again up and running, but the data disappeared on August 24 were still missing.


Since the structure was including all devices, at this point I run xfs_repair -L /dev/md4. But nothing happens.
No error, and half of the data still missing.

So at this point I don't understand. 

THERE IS ONE IMPORTANT THING THAT I DID NOT MENTION, BECAUSE IT WAS NOT EVIDENT BY LOOKING AT /etc/raidtab, 
/proc/mdstat, etc., and it was done by my collaborator

All structure of the raids, partitioning etc. was done using Yast2 with LVM.
The use of LVM is a mistery to me, even more than the basic of the RAID ( :-) )
The /etc/lvm/backup and archive directories are empty.
In yast2 now the LVM panel is empty, and I have forbidden my collaborator to try to go through LVM now...


Coming to other specific questions:

>Sure you can reassemble the RAID, but what do you mean by "still
>ok"? Have you read-tested those 2 drives? Have you tested the
>*other* 18 drives? How do you know none of the other 18 drives got
>damaged? Have you verified that only the host adapter electronics
>failed or whatever it was that made those 2 drives drop out?

Tested all drives, but not the host adapter electronics.


>Why do you *need* to assume clean? If the 2 "lost" drives are
>really ok, you just resync the array. 

Well, following the post above, after checking that the lost drives are ok, 
first I stop the raid, then I create the raid with 20 drives assuming them clean, 
then I stop it again, then assemble it with resyncing.

>If you *need* to assume
>clean, it is likely that you have lost something like 5% of data
>in (every stripe and thus) most files and directories (and
>internal metadata) and will be replacing it with random
>bytes. That will very likely cause XFS problems (the least of the
>problems of course).


On the /raidmd5 fortunately this was not the case.



>Especially in a place where part of the everyday
>activity is earthquake simulation...

LOL you are right.


> But apart from that, it is not as easy to backup 20 TB,

>Or to 'fsck' several TB as you also discovered. Anyhow my opinion
>is that the best way to backup large storage servers is another
>large storage server (or more than one). When I buy a hard drive I
>buy 3 backup drives for each "live" drive I use -- at *home*.

At least now, we did at least that right.


>Not at all absurd -- if those users *really* accept that. But you
>are trying to recover the arrays instead of scratching them and
>restarting. That suggests to me that the users did not actually
>accept that. If the real agreement with the users is "you have to
>keep backups, but if something happens you will behave as if you
>cannot or don't want to restore them" it is quite different.


Well. You would be surprised to know how stupid can scientist be when 
they ignore the worst case scenario. 
Including myself.
I knew exactly the situation, but if I had not succeeded in recovering 
/raid/md5, it would have been a hard moment for me and my research group.
And we ALL knew that there were no backups.



>That's not so clear. One problem with trying to provide some
>opinions on your issue and whether the filesystems are recoverable
>is that you haven't made clear what failed and how you tested each
>component of each array to make sure that what is still working is
>known (and talk of "assume clean" is very suspicious).

Just to clarify: assume-clean was an option to the mdadm --create command
when I discovered that my 20 devices were there and running: I run a dd command
reading the first megabytes of each device.
Was this wrong?

>That you have tried to run repair tools on a filesystem with an
>incomplete storage layer may have made things rather worse, so
>knowing *exactly* what has failed may help you a lot.

I will contact the Sun service and ask them to check the whole storage-controller part.
In the meanwhile I am almost convinced that that 4-5 TB lost on /dev/md4 are lost for good.
I sent the metadata one week ago to the mailing list. Do you think this could help in examining
the famous 20 drives?

I hope I could catch up. I am trying to learn quickly.

Thanks,

Daniele

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 29+ messages in thread
* xfs data loss
@ 2009-08-27  7:22 Passerone, Daniele
  2009-08-27  9:41 ` Christian Kujau
                   ` (2 more replies)
  0 siblings, 3 replies; 29+ messages in thread
From: Passerone, Daniele @ 2009-08-27  7:22 UTC (permalink / raw)
  To: xfs@oss.sgi.com

Dear xfs developers

We have a SUN X4500 with 48 500 GB drives, that we configured under SUSE SLES 10.

Among others, we have 3 RAID5 xfs filesystems, /dev/md4 with 20 units (9.27 TB)
/dev/md5 with 20 units (9.27 TB) and /dev/md6 with 5 units (1.95 TB)

These units are not backed up.

Due to a power shock, suddenly and without log messages about one half (5 TB) of the user 
directories on /dev/md4 have disappeared.
Upon reboot, /dev/md6 showed only 3 units, and after a xfs_repair it was again ok.
/dev/md4 mounted immediately, but always with one half of the directories.


1) xfs_check gives no problem on /dev/md4

but 

2) xfs_logprint

xfs_logprint:
    data device: 0x904
    log device: 0x904 daddr: 9279295544 length: 262144

Header 0x1c8b wanted 0xfeedbabe
**********************************************************************
* ERROR: header cycle=7307        block=182240                       *
**********************************************************************
Bad log record header

and the same happens on /dev/md5.


3) xfs_ncheck eats a lot of memory and freezes after 6-7 hours without giving output

4) I have a xfs_metadump in a file (1.8 GB) but I don't know what to do with it.

WHat can I do? Any help would be appreciated, I would really be happy to recover those files...
:-)

Thank you in advance

Daniele Passerone
Empa, Switzerland


_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2009-09-09 16:48 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-09-06  9:00 xfs data loss Passerone, Daniele
2009-09-06  9:30 ` Michael Monnerie
2009-09-06 10:43 ` R: " Passerone, Daniele
2009-09-06 21:00 ` Peter Grandi
  -- strict thread matches above, loose matches on Subject: below --
2009-09-04 11:45 Passerone, Daniele
2009-09-03 15:31 Passerone, Daniele
2009-09-05 18:29 ` Peter Grandi
     [not found]   ` <4AA3261E.1000005@sandeen.net>
2009-09-06 20:30     ` Peter Grandi
2009-08-27  7:22 Passerone, Daniele
2009-08-27  9:41 ` Christian Kujau
2009-08-27  9:47   ` Passerone, Daniele
2009-08-27 10:09     ` Christian Kujau
2009-08-27  9:54   ` Passerone, Daniele
2009-08-28  4:16 ` Eric Sandeen
2009-08-28  9:19   ` Passerone, Daniele
2009-08-28 17:17     ` Eric Sandeen
2009-08-28 19:42       ` Passerone, Daniele
2009-08-29  6:08       ` Passerone, Daniele
2009-08-29  7:45         ` Ralf Gross
2009-08-29  7:11       ` Passerone, Daniele
2009-08-29 20:03       ` Passerone, Daniele
2009-08-29 22:14         ` Michael Monnerie
2009-08-29 22:52       ` Passerone, Daniele
2009-08-30  1:24         ` Eric Sandeen
2009-08-30  8:17           ` Michael Monnerie
2009-09-01 12:45         ` Peter Grandi
2009-09-01 22:16           ` Michael Monnerie
2009-09-04 11:08           ` Andi Kleen
2009-08-29 14:08 ` Peter Grandi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox