Re: Recreate raid 10 array

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Bill Davidsen <davidsen@tmr.com>
To: LCID Fire <lcid-fire@gmx.net>
Cc: linux-raid@vger.kernel.org
Subject: Re: Recreate raid 10 array
Date: Thu, 09 Apr 2009 18:38:20 -0400	[thread overview]
Message-ID: <49DE78DC.8080700@tmr.com> (raw)
In-Reply-To: <49DD21C6.4060200@gmx.net>

LCID Fire wrote:
> First off the good news: I'm currently running on my raid10 again - 
> with only little data loss.
>
> Andrew Burgess wrote:
>> On Wed, 2009-04-08 at 17:47 -0400, Bill Davidsen wrote:
>>> Goswin von Brederlow wrote:
>>>> mdadm --create --assume-clean -l 10 -n 4 /dev/mdX 
>>>> /dev/copied_disk_1 /dev/copied_disk2 missing missing
>>>>
>>>> You need to match the create parameters exactly with the ones you
>>>> initially used (near/offset/farcopies? stripe size? ...) and the order
>>>> of devices is relevant so you might have to shuffle the disk
>>>> arguments. So just try different orders till the result can be mounted
>>>> or fscked. With the wrong options the mount/fsck could screw up the
>>>> data but then you copy the disk again for the next try. It should be
>>>> reasonably obvious when mount/fsck goes wrong as it should find tons
>>>> of errors. Mostly I would expect mount/fsck to just fail with the
>>>> wrong mdadm args though.
>>
>> Most fscks can be told to run read-only so they won't write to the
>> device and also interactive so they ask before writing so you should be
>> able to avoid recopying. The ext3 journal recovery violates at least one
>> of these IIRC (or used to) so if it's ext3 find an option to tell it to
>> ignore the journal.
> Too late. The journal recovery did complain quite a bit and I didn't 
> know better than to have it fix the things it liked to fix.
> As a result it shows the problem with many apps using sqlite these 
> days - it's not very good when the database file is corrupted.
>
>>> May I say that this makes a great case for saving the contents of 
>>> some files to a safe place when the system is up and running right.? 
>>> Maybe all of /etc, and at least a "tree /sys" and /proc/mdstat would 
>>> be useful, preferably on something readable like a CD or USB flash 
>>> drive, so you have a chance of reading it if you can't boot.
>>>
>>> Of course a rescue flash drive is pretty useful as well, so that's 
>>> probably the way to go.
> Quite frankly I don't really care about / - as long as my /home is 
> safe - because I can setup my machine again - but losing my work means 
> losing far more time.
>
>> It also seems like mdadm could be enhanced to figure stuff like this out
>> given intact device superblocks (I suggest --wild-ass-guess as the
>> option name)
> That would be great (not that I'm eager to run into that again).
>
> As a note I did a binary comparison between the raid1 stuff and got 
> quite shocked. The corrupted one had around 1.000.000 byte difference 
> - something I would expect - but even the valid mirror had around 
> 20.0000 bytes difference - which I can't explain to myself this easily.

My personal explanation for mismatch on raid1 swap is this: because 
software raid isn't real raid, you don't get a mirror by writing the 
same page once only to two different drives by sending the data to the 
controller and letting that happen. What you get is the kernel sending 
the same page to two drives without locking the page so it can't change 
as it's being written. That's a gross over-simplification, but I think 
it addresses the heart of the matter. To prevent that would require a 
flag which causes COW until the page had been moved to the controller 
for the last time.

The alternate I've seen proposed is that somewhere between writing 
copies the page is deallocated so the second (or Nth for more than two 
way mirror) write is abandoned.

In any case this appears to be because the page in memory changes 
between write and/or not all writes are mirrored.

The interesting thing is that some systems seem to have lots of 
mishatches and some almost always none.

-- 
bill davidsen <davidsen@tmr.com>
  CTO TMR Associates, Inc

"You are disgraced professional losers. And by the way, give us our money back."
    - Representative Earl Pomeroy,  Democrat of North Dakota
on the A.I.G. executives who were paid bonuses  after a federal bailout.

next prev parent reply	other threads:[~2009-04-09 22:38 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-04-06 19:49 Recreate raid 10 array LCID Fire
2009-04-07  6:13 ` Goswin von Brederlow
2009-04-08 21:47   ` Bill Davidsen
2009-04-08 21:57     ` Andrew Burgess
2009-04-08 22:13       ` Goswin von Brederlow
2009-04-08 22:14       ` LCID Fire
2009-04-09 10:47         ` Andrew Burgess
2009-04-10  1:41           ` Goswin von Brederlow
2009-04-09 22:38         ` Bill Davidsen [this message]
2009-04-10 11:01         ` LCID Fire
2009-04-10 14:25           ` LCID Fire

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=49DE78DC.8080700@tmr.com \
    --to=davidsen@tmr.com \
    --cc=lcid-fire@gmx.net \
    --cc=linux-raid@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).