Re: Help, array corrupted after clean shutdown.

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Oliver Schinagl <oliver+list@schinagl.nl>
To: Durval Menezes <durval.menezes@gmail.com>
Cc: Linux RAID <linux-raid@vger.kernel.org>
Subject: Re: Help, array corrupted after clean shutdown.
Date: Sun, 07 Apr 2013 19:12:26 +0200	[thread overview]
Message-ID: <5161A8FA.2080906@schinagl.nl> (raw)
In-Reply-To: <CACj=ugTozdbEKvzeNZ4ZjH67P2mqFAicj_SJM6Ui9pUT64oPeg@mail.gmail.com>

On 08-04-13 10:10, Durval Menezes wrote:
> Hi Oliver.
>
> On Sun, Apr 7, 2013 at 12:32 PM, Oliver Schinagl
> <oliver+list@schinagl.nl> wrote:
>> On 06-04-13 20:59, Durval Menezes wrote:
>>> Hi Oliver,
>>>
>>>
>>> On Sat, Apr 6, 2013 at 3:01 PM, Oliver Schinagl <oliver+list@schinagl.nl
>>> <mailto:oliver+list@schinagl.nl>> wrote:
>>>
>>>      On 04/06/13 19:44, Durval Menezes wrote:
>>>
>>>          Hi Oliver,
>>>
>>>          Seems most of your problems are filesystem corruption (the
>>>          extN family
>>>          is well known for lack of robustness).
>>>
>>>          I would try to mount the filesystem read-only (without fsck)
>>>          and copy
>>>          off as much data as possible... Then fsck and try to copy the
>>>          rest.
>>>
>>>          Good luck.
>>>
>>>      It fails to mount ;)
>>>
>>>      How can I ensure that the array is not corrupt however (while
>>>      degraded)? At least that way, I can try my luck with ext4 tools.
>>>
>>>
>>> If the array was not degraded, I would try an array check:
>>>
>>> |echo check > /sys/block/md0/md/sync_action|
>>>
>>> Then, if you had no (or very little) mismatches, I would consider it OK.
>>> But as your array is in degraded mode, you have no redundancy to enable you
>>> to check... :-/
>> I guess the 'order' wouldn't have mattered. I would have expected some
>> very basic check was available.
>>
>> Maybe for raid8 :p; Thinking along the lines, every block has an id, and
>> each stripe has maching id's. If the id's no longer match, something is
>> wrong. Would probably only waste space in the end.
> And time ;-)
>
>> Anyhow, I may have panicked a little to early. mount did indeed fail to
>> mount, checking dmesg revealed a little more:
>> [  117.665385] EXT4-fs (md102): mounted filesystem with writeback data
>> mode. Opts: commit=120,data=writeback
>> [  126.743000] EXT4-fs (md101): ext4_check_descriptors: Checksum for group
>> 0 failed (42475!=15853)
>> [  126.743003] EXT4-fs (md101): group descriptors corrupted!
>>
>> I asked on linux-ext4 what could be going wrong, fsck-ing -n does show
>> (all?) group-descriptors not matching.
> Ouch :-/
>
>> Mounting ro however works
> Glad to hear it. When you said that "it fails to mount", I thought you
> had tried mounting read-only as I suggested.
mount complained, like when you use an invalid filesystem. The error 
could have been more descriptive. I tried mounting RO after you 
mentioned it (and marking the array as read-only).
>
>> and all data appears to be correct from a quick
>> investigation (my virtual machines start normally, so if that is ok, the
>> rest must be too.
> So probably only ext4 allocation metadata (which I think is what the
> group descriptors are) got corrupted... probably your data survived
> OK.
Looks like, the disk reports an unhealthy amount of freespace. But every 
single group descriptor got corrupted. Starting from 0, 1 .. 32k (and 
then I ctrl-c-ed). It's odd to get corrupted in that way. Well the 
checksum didn't match. I'd rather think either the on-disk format 
changed since 2010 somewhat, or usertools work differently.

Side story mode, I have an android tablet with ext4 filesystem for 
/data. The tablet runs a 3.0 kernel. A few weeks ago, the tablet refused 
to boot. I booted from SD card into a stock GNU/Linux 3.4 enviroment and 
ran fdisk. Same thing, all group descriptors where corrupt (didn't 
match). fsck ran for 10 minutes and its still working fine.
>
>> I am now in the progress of copying, and rsycn -car the
>> data to a temporary spot.
> After your data is copied, try validating it with whatever tools
> available, for example: for compressed files, try checking them (ex:
> "tar tvzf" for tar.gz files); if it's your root partition, try
> checking your distribution packages (rpm -Va on RPM distros, for
> example), etc. If it shows any corrupted data, it might point you
> towards things that need restoring, and if it shows nothing wrong, it
> will give you confidence that the rest of your (uncheckable) data is
> possibly good too.
It does look that the data survived just fine. It is a pure data disk, 
but did contain some virtual machines. kvm runs them all fine at the moment.

While I could just fsck the fs and get it all good again, I have now all 
data from the device. I will use that to increase the chunksize from 256 
to 512k, and remake the fs with those new parameters. I'm sure fsck will 
most likly fix it and nothing will be wrong. I'm simply not willing to 
take the risk now that the disks are empty anyway.
>
>> Thanks for all the help though, I probably would
>> have kept trying to fix the array first.
> No prob, and good luck with the rest of your recovery!
Thank you ;)
>
>
>> I'm still wondering why my entire (and only the) partition table was gone.
> One theory: as your shutdown was clean, then ext4 allocation metadata
> has probably been badly mangled in memory before the shutdown, so some
> of your data was possibly written over the start of the disk,
> clobbering the GPT.
>
> Off (Linux md RAID) topic: If I were in your place, I would start
> worrying how the in-memory metadata was SILENTLY mangled in the first
> place... do you use ECC memory, for example? Also, I would consider
> (now that you will have to mkfs the mangled partition to restore your
> data anyway) using a filesystem that has multiple metadata copies and
> also the means for not only finding out about silent corruptions but
> also for fixing them, to say nothing of a built-in RAID with no
> write-hole and that gives your data the same silent-corruption
> detection-and-fixing feature: http://zfsonlinux.org/
>
> Cheers,

next prev parent reply	other threads:[~2013-04-07 17:12 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-04-06 11:24 Help, array corrupted after clean shutdown Oliver Schinagl
2013-04-06 11:58 ` Mikael Abrahamsson
2013-04-06 12:04   ` Oliver Schinagl
     [not found] ` <CACj=ugTsNd87z4Uq_KdZa_HJYFNTtxwZJ76bv0GNHUj8D66YTA@mail.gmail.com>
2013-04-06 15:14   ` Oliver Schinagl
     [not found]     ` <CACj=ugSH2YBrePTKy3e36H4fcHpKQ8ywxrJoLJwbqtbvOR+pEQ@mail.gmail.com>
2013-04-06 18:01       ` Oliver Schinagl
     [not found]         ` <CACj=ugQR6hjw0qchJiOtgyWd8VRGs_pkZCBXHbQwjrKFz4u=Xg@mail.gmail.com>
2013-04-07 15:32           ` Oliver Schinagl
2013-04-08  8:10             ` Durval Menezes
2013-04-07 17:12               ` Oliver Schinagl [this message]
  -- strict thread matches above, loose matches on Subject: below --
2013-04-06 18:34 Oliver Schinagl

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5161A8FA.2080906@schinagl.nl \
    --to=oliver+list@schinagl.nl \
    --cc=durval.menezes@gmail.com \
    --cc=linux-raid@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.