Re: RAID5 Recovery - David Greaves

All of lore.kernel.org
 help / color / mirror / Atom feed

From: David Greaves <david@dgreaves.com>
To: Neil Cavan <neilcavan@gmail.com>
Cc: linux-raid@vger.kernel.org
Subject: Re: RAID5 Recovery
Date: Wed, 14 Nov 2007 10:58:34 +0000	[thread overview]
Message-ID: <473AD4DA.7060801@dgreaves.com> (raw)
In-Reply-To: <f0d815c0711131705s17f365b4kc122db98e8e27895@mail.gmail.com>

Neil Cavan wrote:
> Hello,
Hi Neil

What kernel version?
What mdadm version?

> This morning, I woke up to find the array had kicked two disks. This
> time, though, /proc/mdstat showed one of the failed disks (U_U_U, one
> of the "_"s) had been marked as a spare - weird, since there are no
> spare drives in this array. I rebooted, and the array came back in the
> same state: one failed, one spare. I hot-removed and hot-added the
> spare drive, which put the array back to where I thought it should be
> ( still U_U_U, but with both "_"s marked as failed). Then I rebooted,
> and the array began rebuilding on its own. Usually I have to hot-add
> manually, so that struck me as a little odd, but I gave it no mind and
> went to work. Without checking the contents of the filesystem. Which
> turned out not to have been mounted on reboot.
OK

> Because apparently things went horribly wrong.
Yep :(

> Do I have any hope of recovering this data? Could rebuilding the
> reiserfs superblock help if the rebuild managed to corrupt the
> superblock but not the data?
See below



> Nov 13 02:01:03 localhost kernel: [17805772.424000] hdc: dma_intr:
> status=0x51 { DriveReady SeekComplete Error }
<snip>
> Nov 13 02:01:06 localhost kernel: [17805775.156000] lost page write
> due to I/O error on md0
hdc1 fails


> Nov 13 02:01:06 localhost kernel: [17805775.196000] RAID5 conf printout:
> Nov 13 02:01:06 localhost kernel: [17805775.196000]  --- rd:5 wd:3 fd:2
> Nov 13 02:01:06 localhost kernel: [17805775.196000]  disk 0, o:1, dev:hda1
> Nov 13 02:01:06 localhost kernel: [17805775.196000]  disk 1, o:0, dev:hdc1
> Nov 13 02:01:06 localhost kernel: [17805775.196000]  disk 2, o:1, dev:hde1
> Nov 13 02:01:06 localhost kernel: [17805775.196000]  disk 4, o:1, dev:hdi1

hdg1 is already missing?

> Nov 13 02:01:06 localhost kernel: [17805775.212000] RAID5 conf printout:
> Nov 13 02:01:06 localhost kernel: [17805775.212000]  --- rd:5 wd:3 fd:2
> Nov 13 02:01:06 localhost kernel: [17805775.212000]  disk 0, o:1, dev:hda1
> Nov 13 02:01:06 localhost kernel: [17805775.212000]  disk 2, o:1, dev:hde1
> Nov 13 02:01:06 localhost kernel: [17805775.212000]  disk 4, o:1, dev:hdi1

so now the array is bad.

a reboot happens and:
> Nov 13 07:21:07 localhost kernel: [17179584.712000] md: md0 stopped.
> Nov 13 07:21:07 localhost kernel: [17179584.876000] md: bind<hdc1>
> Nov 13 07:21:07 localhost kernel: [17179584.884000] md: bind<hde1>
> Nov 13 07:21:07 localhost kernel: [17179584.884000] md: bind<hdg1>
> Nov 13 07:21:07 localhost kernel: [17179584.884000] md: bind<hdi1>
> Nov 13 07:21:07 localhost kernel: [17179584.892000] md: bind<hda1>
> Nov 13 07:21:07 localhost kernel: [17179584.892000] md: kicking
> non-fresh hdg1 from array!
> Nov 13 07:21:07 localhost kernel: [17179584.892000] md: unbind<hdg1>
> Nov 13 07:21:07 localhost kernel: [17179584.892000] md: export_rdev(hdg1)
> Nov 13 07:21:07 localhost kernel: [17179584.896000] raid5: allocated
> 5245kB for md0
... apparently hdc1 is OK? Hmmm.

> Nov 13 07:21:07 localhost kernel: [17179665.524000] ReiserFS: md0:
> found reiserfs format "3.6" with standard journal
> Nov 13 07:21:07 localhost kernel: [17179676.136000] ReiserFS: md0:
> using ordered data mode
> Nov 13 07:21:07 localhost kernel: [17179676.164000] ReiserFS: md0:
> journal params: device md0, size 8192, journal first block 18, max
> trans len 1024, max batch 900, max commit age 30, max trans age 30
> Nov 13 07:21:07 localhost kernel: [17179676.164000] ReiserFS: md0:
> checking transaction log (md0)
> Nov 13 07:21:07 localhost kernel: [17179676.828000] ReiserFS: md0:
> replayed 7 transactions in 1 seconds
> Nov 13 07:21:07 localhost kernel: [17179677.012000] ReiserFS: md0:
> Using r5 hash to sort names
> Nov 13 07:21:09 localhost kernel: [17179682.064000] lost page write
> due to I/O error on md0
Reiser tries to mount/replay itself relying on hdc1 (which is partly bad)

> Nov 13 07:25:39 localhost kernel: [17179584.828000] md: raid5
> personality registered as nr 4
> Nov 13 07:25:39 localhost kernel: [17179585.708000] md: kicking
> non-fresh hdg1 from array!
Another reboot...

> Nov 13 07:25:40 localhost kernel: [17179666.064000] ReiserFS: md0:
> found reiserfs format "3.6" with standard journal
> Nov 13 07:25:40 localhost kernel: [17179676.904000] ReiserFS: md0:
> using ordered data mode
> Nov 13 07:25:40 localhost kernel: [17179676.928000] ReiserFS: md0:
> journal params: device md0, size 8192, journal first block 18, max
> trans len 1024, max batch 900, max commit age 30, max trans age 30
> Nov 13 07:25:40 localhost kernel: [17179676.932000] ReiserFS: md0:
> checking transaction log (md0)
> Nov 13 07:25:40 localhost kernel: [17179677.080000] ReiserFS: md0:
> Using r5 hash to sort names
> Nov 13 07:25:42 localhost kernel: [17179683.128000] lost page write
> due to I/O error on md0
Reiser tries again...

> Nov 13 07:26:57 localhost kernel: [17179757.524000] md: unbind<hdc1>
> Nov 13 07:26:57 localhost kernel: [17179757.524000] md: export_rdev(hdc1)
> Nov 13 07:27:03 localhost kernel: [17179763.700000] md: bind<hdc1>
> Nov 13 07:30:24 localhost kernel: [17179584.180000] md: md driver
hdc is kicked too (again)

> Nov 13 07:30:24 localhost kernel: [17179584.184000] md: raid5
> personality registered as nr 4
Another reboot...

> Nov 13 07:30:24 localhost kernel: [17179585.068000] md: syncing RAID array md0
Now (I guess) hdg is being restored using hdc data:

> Nov 13 07:30:24 localhost kernel: [17179684.160000] ReiserFS: md0:
> warning: sh-2021: reiserfs_fill_super: can not find reiserfs on md0
But Reiser is confused.

> Nov 13 08:57:11 localhost kernel: [17184895.816000] md: md0: sync done.
hdg is back up to speed:


So hdc looks faulty.
Your only hope (IMO) is to use reiserfs recovery tools.
You may want to replace hdc to avoid an hdc failure interrupting any rebuild.

I think what happened is that hdg failed prior to 2am and you didn't notice
(mdadm --monitor is your friend). Then hdc had a real failure - at that point
you had data loss (not enough good disks). I don't know why md rebuilt using hdc
- I would expect it to have found hdc and hdg stale. If this is a newish kernel
then maybe Neil should take a look...

David

next prev parent reply	other threads:[~2007-11-14 10:58 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-11-14  1:05 RAID5 Recovery Neil Cavan
2007-11-14 10:58 ` David Greaves [this message]
     [not found]   ` <f0d815c0711140436x2ca72714ka6abee08c2bc70@mail.gmail.com>
2007-11-14 12:36     ` Fwd: " Neil Cavan
2007-11-14 12:59       ` David Greaves
  -- strict thread matches above, loose matches on Subject: below --
2013-06-05 12:02 RAID5 recovery Philipp Frauenfelder
2013-06-05 13:20 ` Drew
2013-06-05 13:29 ` John Robinson
2013-06-05 15:10   ` Drew
2013-06-05 20:12     ` Philipp Frauenfelder
2013-06-06 12:28 ` Sam Bingner
2006-10-22  1:14 RAID5 Recovery Neil Cavan
2006-10-22  1:14 ` Neil Cavan
2006-10-23  1:29 ` Neil Brown
2006-10-23  2:52   ` Neil Cavan
2006-10-23  2:52     ` Neil Cavan
2006-10-23  4:01     ` Neil Brown

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=473AD4DA.7060801@dgreaves.com \
    --to=david@dgreaves.com \
    --cc=linux-raid@vger.kernel.org \
    --cc=neilcavan@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.