RAID5 crashed for unknown reason on old 2.6.16 kernel

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* RAID5 crashed for unknown reason on old 2.6.16 kernel
@ 2010-06-26 21:22 Markus Hennig
  2010-06-28 15:29 ` Markus Hennig
  0 siblings, 1 reply; 5+ messages in thread
From: Markus Hennig @ 2010-06-26 21:22 UTC (permalink / raw)
  To: linux-raid

Hi all,

my RAID5 with 4 disks crashed on a Buffalo "NAS" box (big-endian!) -
no logs of course...
I made immediately images of all disks and try to now gather my very
valuable content on a Linux box running GRML 4/10 (little-endian!)
with 2.6.33 and mdadm - v3.1.1.
Some blocks were not readable from HDD2, maybe that's the reason why
the Buffalo box shut down.


What I know already:

- the RAID5 was created with a very old set of software:
linux-2.6.16-tshtgl.tgz   mdadm-2.5.2.tgz   xfsprogs-2.5.6_arm.tgz
- the Buffalo box blinked red on HDD2
- the box run a rebuild on HDD4, I don't know if that was already finished
- all disks are identically, 250GB

- Partitioning:
Disk /dev/sdb: 251.0 GB, 251000193024 bytes
255 heads, 63 sectors/track, 30515 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Disk identifier: 0x35353535

   Device Boot      Start         End      Blocks   Id  System
/dev/sdb1               1          48      385528+  fd  Linux raid autodetect
/dev/sdb2              49          65      136552+  82  Linux swap / Solaris
/dev/sdb3              66       30378   243481141   fd  Linux raid autodetect
/dev/sdb4           30378       30515     1108484   fd  Linux raid autodetect

- Partition 3 is the data partition:

HDD1: (mdadm   --examine --metadata=0.swap /dev/mapper/loop1p3 )
          Magic : a92b4efc
        Version : 0.90.00
           UUID : 9eb0d5a8:1ce1a3c7:e82c8901:cfa389c2
  Creation Time : Sun Nov 21 19:31:12 2004
     Raid Level : raid5
  Used Dev Size : 243481024 (232.20 GiB 249.32 GB)
     Array Size : 730443072 (696.60 GiB 747.97 GB)
   Raid Devices : 4
  Total Devices : 4
Preferred Minor : 1

    Update Time : Tue Jun 22 01:58:56 2010
          State : clean
 Active Devices : 2
Working Devices : 3
 Failed Devices : 1
  Spare Devices : 1
       Checksum : b8d2f28c - correct

         Events : 131
         Layout : left-symmetric
     Chunk Size : 64K

      Number   Major   Minor   RaidDevice State
this     0       3        3        0      active sync

   0     0       3        3        0      active sync
   1     1      22        3        1      faulty
   2     2      33        3        2      active sync
   3     3       0        0        3      faulty removed
   4     4      34        3        4      spare


HDD2:
          Magic : a92b4efc
        Version : 0.91.00
           UUID : ffffffff:ffffffff:ffffffff:ffffffff
  Creation Time : Sun Nov 21 19:31:12 2004
     Raid Level : raid5
  Used Dev Size : 243481024 (232.20 GiB 249.32 GB)
     Array Size : 730443072 (696.60 GiB 747.97 GB)
   Raid Devices : 4
  Total Devices : 4
Preferred Minor : 1

  Reshape pos'n : 0
      New Level : raid0
     New Layout : left-asymmetric
  New Chunksize : 0

    Update Time : Mon Jun 21 22:41:19 2010
          State : active
 Active Devices : 3
Working Devices : 4
 Failed Devices : 0
  Spare Devices : 1
       Checksum : b8d2c453 - expected 45703820
         Events : 129

         Layout : left-symmetric
     Chunk Size : 64K

      Number   Major   Minor   RaidDevice State
this     1      22        3        1      active sync

   0     0       3        3        0      active sync
   1     1      22        3        1      active sync
   2     2      33        3        2      active sync
   3     3       0        0        3      faulty removed
   4     4      34        3        4      spare


HDD3:
          Magic : a92b4efc
        Version : 0.90.00
           UUID : 9eb0d5a8:1ce1a3c7:e82c8901:cfa389c2
  Creation Time : Sun Nov 21 19:31:12 2004
     Raid Level : raid5
  Used Dev Size : 243481024 (232.20 GiB 249.32 GB)
     Array Size : 730443072 (696.60 GiB 747.97 GB)
   Raid Devices : 4
  Total Devices : 4
Preferred Minor : 1

    Update Time : Tue Jun 22 01:58:56 2010
          State : clean
 Active Devices : 2
Working Devices : 3
 Failed Devices : 1
  Spare Devices : 1
       Checksum : b8d2f2ae - correct
         Events : 131

         Layout : left-symmetric
     Chunk Size : 64K

      Number   Major   Minor   RaidDevice State
this     2      33        3        2      active sync

   0     0       3        3        0      active sync
   1     1      22        3        1      faulty
   2     2      33        3        2      active sync
   3     3       0        0        3      faulty removed
   4     4      34        3        4      spare

HDD4:
          Magic : a92b4efc
        Version : 0.90.00
           UUID : 9eb0d5a8:1ce1a3c7:e82c8901:cfa389c2
  Creation Time : Sun Nov 21 19:31:12 2004
     Raid Level : raid5
  Used Dev Size : 243481024 (232.20 GiB 249.32 GB)
     Array Size : 730443072 (696.60 GiB 747.97 GB)
   Raid Devices : 4
  Total Devices : 4
Preferred Minor : 1

    Update Time : Tue Jun 22 01:58:56 2010
          State : clean
 Active Devices : 2
Working Devices : 3
 Failed Devices : 1
  Spare Devices : 1
       Checksum : b8d2f2ad - correct
         Events : 131

         Layout : left-symmetric
     Chunk Size : 64K

      Number   Major   Minor   RaidDevice State
this     4      34        3        4      spare

   0     0       3        3        0      active sync
   1     1      22        3        1      faulty
   2     2      33        3        2      active sync
   3     3       0        0        3      faulty removed
   4     4      34        3        4      spare


My various experiments with "--assemble" and/or "--create" are not
successful so far.
What I learned already, I have to use "--update=byteorder"  and
"--metadata=0"  ;-)



Open questions for which I wasn't able to find a answer myself :

What triggers the event count? And why is the event counter on HDD2
just 129, on all other 131?
Can that cause problems while rescue my data and how can I work around it?


What is that "UUID : ffffffff:ffffffff:ffffffff:ffffffff" on HDD2?
What does it mean?

Its really in the superblock on the hard disk:
 hexdump -s 488006273b -C hdd2_ddrescue
 3a2cc50200  a9 2b 4e fc 00 00 00 00  00 00 00 5b 00 00 00 00
|.+N........[....|
 3a2cc50210  00 00 00 00 ff ff ff ff  41 a0 de f0 00 00 00 05
|........A.......|
 3a2cc50220  0e 83 39 c0 00 00 00 04  00 00 00 04 00 00 00 01
|..9.............|
 3a2cc50230  00 00 00 00 ff ff ff ff  ff ff ff ff ff ff ff ff
|................|
 3a2cc50240  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
|................|
Would it help to rewrite the UUID via hexedit to the correct one?


Can somebody explain the meaning of:
  Reshape pos'n : 0
      New Level : raid0
     New Layout : left-asymmetric
  New Chunksize : 0
on HDD2 ?


What parameters are included in the checksum?
And how critical in on HHD2 that "Checksum : b8d2c453 - expected 45703820"?


I have no explanation why "Version :" is on HDD2 on 0.91.00"...
I see 0x5B in the partition 3 superblock on HDD2 (and on all other
0x5A), so its really on the disk...  Weird...
Somebody any idea on that?



Any(!) help is very appreciated, incl. hints at resources (papers,
docu, code) or questions for additional information.

Thx in advance,
Markus

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: RAID5 crashed for unknown reason on old 2.6.16 kernel
  2010-06-26 21:22 RAID5 crashed for unknown reason on old 2.6.16 kernel Markus Hennig
@ 2010-06-28 15:29 ` Markus Hennig
  2010-06-29  6:50   ` Neil Brown
  0 siblings, 1 reply; 5+ messages in thread
From: Markus Hennig @ 2010-06-28 15:29 UTC (permalink / raw)
  To: linux-raid

Hi all,

for the (unlikely) case somebody is interested in a last update:

I learned in the meantime that the UUID as well as the mdadm version
is part of the checksum. And that that checksum is calculated on the
first 1kb of the 4kb ver0.0 superblock.
(https://raid.wiki.kernel.org/index.php/RAID_superblock_formats#The_version-0.90_Superblock_Format)

Via hexedit I set the UUID on HHD2 back to the correct value and also
changed the version information from 0.91.00 (0x5B) to 90 (0x5A).
Done that the checksum was correct and equal the expect one.

mdadm --assemble worked than like a charm and my RAID5 is back.

That's it,
Markus


On Sat, Jun 26, 2010 at 11:22 PM, Markus Hennig <mhennig@gmail.com> wrote:
> Hi all,
>
> my RAID5 with 4 disks crashed on a Buffalo "NAS" box (big-endian!) -
> no logs of course...
> I made immediately images of all disks and try to now gather my very
> valuable content on a Linux box running GRML 4/10 (little-endian!)
> with 2.6.33 and mdadm - v3.1.1.
> Some blocks were not readable from HDD2, maybe that's the reason why
> the Buffalo box shut down.
>
>
> What I know already:
>
> - the RAID5 was created with a very old set of software:
> linux-2.6.16-tshtgl.tgz   mdadm-2.5.2.tgz   xfsprogs-2.5.6_arm.tgz
> - the Buffalo box blinked red on HDD2
> - the box run a rebuild on HDD4, I don't know if that was already finished
> - all disks are identically, 250GB
>

> Open questions for which I wasn't able to find a answer myself :
>
> What triggers the event count? And why is the event counter on HDD2
> just 129, on all other 131?
> Can that cause problems while rescue my data and how can I work around it?
>
>
> What is that "UUID : ffffffff:ffffffff:ffffffff:ffffffff" on HDD2?
> What does it mean?
>
> Its really in the superblock on the hard disk:
>  hexdump -s 488006273b -C hdd2_ddrescue
>  3a2cc50200  a9 2b 4e fc 00 00 00 00  00 00 00 5b 00 00 00 00
> |.+N........[....|
>  3a2cc50210  00 00 00 00 ff ff ff ff  41 a0 de f0 00 00 00 05
> |........A.......|
>  3a2cc50220  0e 83 39 c0 00 00 00 04  00 00 00 04 00 00 00 01
> |..9.............|
>  3a2cc50230  00 00 00 00 ff ff ff ff  ff ff ff ff ff ff ff ff
> |................|
>  3a2cc50240  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
> |................|
> Would it help to rewrite the UUID via hexedit to the correct one?
>
>
> Can somebody explain the meaning of:
>  Reshape pos'n : 0
>      New Level : raid0
>     New Layout : left-asymmetric
>  New Chunksize : 0
> on HDD2 ?
>
>
> What parameters are included in the checksum?
> And how critical in on HHD2 that "Checksum : b8d2c453 - expected 45703820"?
>
>
> I have no explanation why "Version :" is on HDD2 on 0.91.00"...
> I see 0x5B in the partition 3 superblock on HDD2 (and on all other
> 0x5A), so its really on the disk...  Weird...
> Somebody any idea on that?
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: RAID5 crashed for unknown reason on old 2.6.16 kernel
  2010-06-28 15:29 ` Markus Hennig
@ 2010-06-29  6:50   ` Neil Brown
  2010-07-15 11:53     ` Markus Hennig
  0 siblings, 1 reply; 5+ messages in thread
From: Neil Brown @ 2010-06-29  6:50 UTC (permalink / raw)
  To: Markus Hennig; +Cc: linux-raid

On Mon, 28 Jun 2010 17:29:37 +0200
Markus Hennig <mhennig@gmail.com> wrote:

> Hi all,
> 
> for the (unlikely) case somebody is interested in a last update:
> 
> I learned in the meantime that the UUID as well as the mdadm version
> is part of the checksum. And that that checksum is calculated on the
> first 1kb of the 4kb ver0.0 superblock.
> (https://raid.wiki.kernel.org/index.php/RAID_superblock_formats#The_version-0.90_Superblock_Format)
> 
> Via hexedit I set the UUID on HHD2 back to the correct value and also
> changed the version information from 0.91.00 (0x5B) to 90 (0x5A).
> Done that the checksum was correct and equal the expect one.
> 
> mdadm --assemble worked than like a charm and my RAID5 is back.

Thanks for letting us know the resolution.
I cannot imagine how all those '1's got into the metadata where they
shouldn't be.

Based on the update times and event counter, the HDD2 was slightly 'older'
than the other devices.  Hopefully nothing had changed on the array in the
intervening time.

You should have been able to assemble the array with just the 3 sane devices
and had a degraded RAID5.  Then add the fourth device and let it recover.

However what you did seems to have worked, so if your data looks OK, you
should be safe.

NeilBrown


> 
> That's it,
> Markus
> 
> 
> On Sat, Jun 26, 2010 at 11:22 PM, Markus Hennig <mhennig@gmail.com> wrote:
> > Hi all,
> >
> > my RAID5 with 4 disks crashed on a Buffalo "NAS" box (big-endian!) -
> > no logs of course...
> > I made immediately images of all disks and try to now gather my very
> > valuable content on a Linux box running GRML 4/10 (little-endian!)
> > with 2.6.33 and mdadm - v3.1.1.
> > Some blocks were not readable from HDD2, maybe that's the reason why
> > the Buffalo box shut down.
> >
> >
> > What I know already:
> >
> > - the RAID5 was created with a very old set of software:
> > linux-2.6.16-tshtgl.tgz   mdadm-2.5.2.tgz   xfsprogs-2.5.6_arm.tgz
> > - the Buffalo box blinked red on HDD2
> > - the box run a rebuild on HDD4, I don't know if that was already finished
> > - all disks are identically, 250GB
> >
> 
> > Open questions for which I wasn't able to find a answer myself :
> >
> > What triggers the event count? And why is the event counter on HDD2
> > just 129, on all other 131?
> > Can that cause problems while rescue my data and how can I work around it?
> >
> >
> > What is that "UUID : ffffffff:ffffffff:ffffffff:ffffffff" on HDD2?
> > What does it mean?
> >
> > Its really in the superblock on the hard disk:
> >  hexdump -s 488006273b -C hdd2_ddrescue
> >  3a2cc50200  a9 2b 4e fc 00 00 00 00  00 00 00 5b 00 00 00 00
> > |.+N........[....|
> >  3a2cc50210  00 00 00 00 ff ff ff ff  41 a0 de f0 00 00 00 05
> > |........A.......|
> >  3a2cc50220  0e 83 39 c0 00 00 00 04  00 00 00 04 00 00 00 01
> > |..9.............|
> >  3a2cc50230  00 00 00 00 ff ff ff ff  ff ff ff ff ff ff ff ff
> > |................|
> >  3a2cc50240  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
> > |................|
> > Would it help to rewrite the UUID via hexedit to the correct one?
> >
> >
> > Can somebody explain the meaning of:
> >  Reshape pos'n : 0
> >      New Level : raid0
> >     New Layout : left-asymmetric
> >  New Chunksize : 0
> > on HDD2 ?
> >
> >
> > What parameters are included in the checksum?
> > And how critical in on HHD2 that "Checksum : b8d2c453 - expected 45703820"?
> >
> >
> > I have no explanation why "Version :" is on HDD2 on 0.91.00"...
> > I see 0x5B in the partition 3 superblock on HDD2 (and on all other
> > 0x5A), so its really on the disk...  Weird...
> > Somebody any idea on that?
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: RAID5 crashed for unknown reason on old 2.6.16 kernel
  2010-06-29  6:50   ` Neil Brown
@ 2010-07-15 11:53     ` Markus Hennig
  2010-07-15 13:09       ` Roman Mamedov
  0 siblings, 1 reply; 5+ messages in thread
From: Markus Hennig @ 2010-07-15 11:53 UTC (permalink / raw)
  To: linux-raid; +Cc: Neil Brown

Hi all,

I got all my data back from a degraded RAID5 array with 3 disks.
The only point which is worth to mention: XFS as underlying file
system is ineligible for small/cheap NAS because it is not edian safe.
I bought a powerpc driven MAC to replay the XFS journal...

That leads to my question to the list: does somebody know if BTRFS is
endian safe or what is an endian-safe alternative to ext3/ext ?

Regards,
Markus



On Tue, Jun 29, 2010 at 8:50 AM, Neil Brown <neilb@suse.de> wrote:
> On Mon, 28 Jun 2010 17:29:37 +0200
> Markus Hennig <mhennig@gmail.com> wrote:
>
>> Hi all,
>>
>> for the (unlikely) case somebody is interested in a last update:
>>
>> I learned in the meantime that the UUID as well as the mdadm version
>> is part of the checksum. And that that checksum is calculated on the
>> first 1kb of the 4kb ver0.0 superblock.
>> (https://raid.wiki.kernel.org/index.php/RAID_superblock_formats#The_version-0.90_Superblock_Format)
>>
>> Via hexedit I set the UUID on HHD2 back to the correct value and also
>> changed the version information from 0.91.00 (0x5B) to 90 (0x5A).
>> Done that the checksum was correct and equal the expect one.
>>
>> mdadm --assemble worked than like a charm and my RAID5 is back.
>
> Thanks for letting us know the resolution.
> I cannot imagine how all those '1's got into the metadata where they
> shouldn't be.
>
> Based on the update times and event counter, the HDD2 was slightly 'older'
> than the other devices.  Hopefully nothing had changed on the array in the
> intervening time.
>
> You should have been able to assemble the array with just the 3 sane devices
> and had a degraded RAID5.  Then add the fourth device and let it recover.
>
> However what you did seems to have worked, so if your data looks OK, you
> should be safe.
>
> NeilBrown
>
>
>>
>> That's it,
>> Markus
>>
>>
>> On Sat, Jun 26, 2010 at 11:22 PM, Markus Hennig <mhennig@gmail.com> wrote:
>> > Hi all,
>> >
>> > my RAID5 with 4 disks crashed on a Buffalo "NAS" box (big-endian!) -
>> > no logs of course...
>> > I made immediately images of all disks and try to now gather my very
>> > valuable content on a Linux box running GRML 4/10 (little-endian!)
>> > with 2.6.33 and mdadm - v3.1.1.
>> > Some blocks were not readable from HDD2, maybe that's the reason why
>> > the Buffalo box shut down.
>> >
>> >
>> > What I know already:
>> >
>> > - the RAID5 was created with a very old set of software:
>> > linux-2.6.16-tshtgl.tgz   mdadm-2.5.2.tgz   xfsprogs-2.5.6_arm.tgz
>> > - the Buffalo box blinked red on HDD2
>> > - the box run a rebuild on HDD4, I don't know if that was already finished
>> > - all disks are identically, 250GB
>> >
>>
>> > Open questions for which I wasn't able to find a answer myself :
>> >
>> > What triggers the event count? And why is the event counter on HDD2
>> > just 129, on all other 131?
>> > Can that cause problems while rescue my data and how can I work around it?
>> >
>> >
>> > What is that "UUID : ffffffff:ffffffff:ffffffff:ffffffff" on HDD2?
>> > What does it mean?
>> >
>> > Its really in the superblock on the hard disk:
>> >  hexdump -s 488006273b -C hdd2_ddrescue
>> >  3a2cc50200  a9 2b 4e fc 00 00 00 00  00 00 00 5b 00 00 00 00
>> > |.+N........[....|
>> >  3a2cc50210  00 00 00 00 ff ff ff ff  41 a0 de f0 00 00 00 05
>> > |........A.......|
>> >  3a2cc50220  0e 83 39 c0 00 00 00 04  00 00 00 04 00 00 00 01
>> > |..9.............|
>> >  3a2cc50230  00 00 00 00 ff ff ff ff  ff ff ff ff ff ff ff ff
>> > |................|
>> >  3a2cc50240  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
>> > |................|
>> > Would it help to rewrite the UUID via hexedit to the correct one?
>> >
>> >
>> > Can somebody explain the meaning of:
>> >  Reshape pos'n : 0
>> >      New Level : raid0
>> >     New Layout : left-asymmetric
>> >  New Chunksize : 0
>> > on HDD2 ?
>> >
>> >
>> > What parameters are included in the checksum?
>> > And how critical in on HHD2 that "Checksum : b8d2c453 - expected 45703820"?
>> >
>> >
>> > I have no explanation why "Version :" is on HDD2 on 0.91.00"...
>> > I see 0x5B in the partition 3 superblock on HDD2 (and on all other
>> > 0x5A), so its really on the disk...  Weird...
>> > Somebody any idea on that?
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: RAID5 crashed for unknown reason on old 2.6.16 kernel
  2010-07-15 11:53     ` Markus Hennig
@ 2010-07-15 13:09       ` Roman Mamedov
  0 siblings, 0 replies; 5+ messages in thread
From: Roman Mamedov @ 2010-07-15 13:09 UTC (permalink / raw)
  To: Markus Hennig; +Cc: linux-raid, Neil Brown

[-- Attachment #1: Type: text/plain, Size: 821 bytes --]

On Thu, 15 Jul 2010 13:53:50 +0200
Markus Hennig <mhennig@gmail.com> wrote:

> I got all my data back from a degraded RAID5 array with 3 disks.
> The only point which is worth to mention: XFS as underlying file
> system is ineligible for small/cheap NAS because it is not edian safe.
> I bought a powerpc driven MAC to replay the XFS journal...

QEMU can emulate almost every modern architecture while running on almost any
different one, with various degrees of (understandable) slowness.

> That leads to my question to the list: does somebody know if BTRFS is
> endian safe or what is an endian-safe alternative to ext3/ext ?

I'd suggest to use Ext4 for now. You can always convert it to BTRFS at a later
time: https://btrfs.wiki.kernel.org/index.php/Conversion_from_Ext3

-- 
With respect,
Roman

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2010-07-15 13:09 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-06-26 21:22 RAID5 crashed for unknown reason on old 2.6.16 kernel Markus Hennig
2010-06-28 15:29 ` Markus Hennig
2010-06-29  6:50   ` Neil Brown
2010-07-15 11:53     ` Markus Hennig
2010-07-15 13:09       ` Roman Mamedov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).