* Seeking help to get a failed RAID5 system back to life
@ 2014-08-29 2:07 Fabio Bacigalupo
2014-08-29 7:46 ` Robin Hill
0 siblings, 1 reply; 6+ messages in thread
From: Fabio Bacigalupo @ 2014-08-29 2:07 UTC (permalink / raw)
To: linux-raid
Hello,
I have been trying all night to get my system back to work. One of the
two remaining hard-drives suddenly stopped working today. I read and
tried everything I could find that seemed to not make things worse
than they are. Finally I stumbled upon this page [1] on the Linux Raid
wiki which recommends to consult this mailing list.
I had a RAID 5 installation with three disks but disk 0 (I assume as
it was /dev/sda3) has been taken out for a while. The disks reside in
a remote server.
Sorry if this is obvious to you but I am totally stuck. I always run
into dead ends.
Your help is very much appreciated!
Thank you for any hints,
Fabio
I could gather the following information:
================================================================================
# mdadm --examine /dev/sd*3
mdadm: No md superblock detected on /dev/sda3.
/dev/sdb3:
Magic : a92b4efc
Version : 0.90.00
UUID : f07f4bc6:36864b49:776c2c25:004bd7b2
Creation Time : Wed May 4 08:18:11 2011
Raid Level : raid5
Used Dev Size : 1462766336 (1395.00 GiB 1497.87 GB)
Array Size : 2925532672 (2790.01 GiB 2995.75 GB)
Raid Devices : 3
Total Devices : 1
Preferred Minor : 127
Update Time : Thu Aug 28 19:55:59 2014
State : clean
Active Devices : 1
Working Devices : 1
Failed Devices : 1
Spare Devices : 0
Checksum : 490fa722 - correct
Events : 68856340
Layout : left-symmetric
Chunk Size : 64K
Number Major Minor RaidDevice State
this 1 8 19 1 active sync /dev/sdb3
0 0 0 0 0 removed
1 1 8 19 1 active sync /dev/sdb3
2 2 0 0 2 faulty removed
/dev/sdc3:
Magic : a92b4efc
Version : 0.90.00
UUID : f07f4bc6:36864b49:776c2c25:004bd7b2
Creation Time : Wed May 4 08:18:11 2011
Raid Level : raid5
Used Dev Size : 1462766336 (1395.00 GiB 1497.87 GB)
Array Size : 2925532672 (2790.01 GiB 2995.75 GB)
Raid Devices : 3
Total Devices : 2
Preferred Minor : 127
Update Time : Thu Aug 28 19:22:19 2014
State : active
Active Devices : 2
Working Devices : 2
Failed Devices : 0
Spare Devices : 0
Checksum : 44f4f557 - correct
Events : 68856326
Layout : left-symmetric
Chunk Size : 64K
Number Major Minor RaidDevice State
this 2 8 35 2 active sync /dev/sdc3
0 0 0 0 0 removed
1 1 8 19 1 active sync /dev/sdb3
2 2 8 35 2 active sync /dev/sdc3
================================================================================
# mdadm --examine /dev/sd[b]
/dev/sdb:
MBR Magic : aa55
Partition[0] : 4737024 sectors at 2048 (type 83)
Partition[2] : 2925532890 sectors at 4739175 (type fd)
================================================================================
Disk /dev/sdc has been replaced with a new hard drive as the old one
had input/output errors.
I assume this is weired and showed /dev/sdb3 before (changing things):
# cat /proc/mdstat
Personalities : [raid1]
unused devices: <none>
I tried to copy the structure from /dev/sdb to /dev/sdc which assumably work:
# sgdisk -R /dev/sdc /dev/sdb
***************************************************************
Found invalid GPT and valid MBR; converting MBR to GPT format
in memory.
***************************************************************
The operation has completed successfully.
# sgdisk -G /dev/sdc
The operation has completed successfully.
# fdisk -l
-- Removed /dev/sda --
Disk /dev/sdb: 1500.3 GB, 1500301910016 bytes, 2930277168 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk label type: dos
Disk identifier: 0x0005fb16
Device Boot Start End Blocks Id System
/dev/sdb1 2048 4739071 2368512 83 Linux
/dev/sdb3 * 4739175 2930272064 1462766445 fd Linux raid autodetect
WARNING: fdisk GPT support is currently new, and therefore in an
experimental phase. Use at your own discretion.
Disk /dev/sdc: 1500.3 GB, 1500301910016 bytes, 2930277168 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk label type: gpt
# Start End Size Type Name
1 2048 4739071 2.3G Linux filesyste Linux filesystem
3 4739175 2930272064 1.4T Linux RAID Linux RAID
# mdadm --assemble /dev/md127 /dev/sd[bc]3
mdadm: no RAID superblock on /dev/sdc3
mdadm: /dev/sdc3 has no superblock - assembly aborted
# mdadm --assemble /dev/md127 /dev/sd[b]3
mdadm: /dev/md127 assembled from 1 drive - not enough to start the array.
# mdadm --misc -QD /dev/sd[bc]3
mdadm: /dev/sdb3 does not appear to be an md device
mdadm: /dev/sdc3 does not appear to be an md device
# mdadm --detail /dev/md127
/dev/md127:
Version :
Raid Level : raid0
Total Devices : 0
State : inactive
Number Major Minor RaidDevice
[1] https://raid.wiki.kernel.org/index.php/RAID_Recovery
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Seeking help to get a failed RAID5 system back to life
2014-08-29 2:07 Seeking help to get a failed RAID5 system back to life Fabio Bacigalupo
@ 2014-08-29 7:46 ` Robin Hill
2014-08-29 8:55 ` Fabio Bacigalupo
0 siblings, 1 reply; 6+ messages in thread
From: Robin Hill @ 2014-08-29 7:46 UTC (permalink / raw)
To: Fabio Bacigalupo; +Cc: linux-raid
[-- Attachment #1: Type: text/plain, Size: 7077 bytes --]
On Fri Aug 29, 2014 at 04:07:40AM +0200, Fabio Bacigalupo wrote:
> Hello,
>
> I have been trying all night to get my system back to work. One of the
> two remaining hard-drives suddenly stopped working today. I read and
> tried everything I could find that seemed to not make things worse
> than they are. Finally I stumbled upon this page [1] on the Linux Raid
> wiki which recommends to consult this mailing list.
>
> I had a RAID 5 installation with three disks but disk 0 (I assume as
> it was /dev/sda3) has been taken out for a while. The disks reside in
> a remote server.
>
That's a disaster waiting to happen. You should never leave a RAID array
in a degraded state for any longer than is absolutely necessary,
otherwise you might as well not bother running RAID at all.
> Sorry if this is obvious to you but I am totally stuck. I always run
> into dead ends.
>
> Your help is very much appreciated!
>
> Thank you for any hints,
> Fabio
>
> I could gather the following information:
>
> ================================================================================
>
> # mdadm --examine /dev/sd*3
> mdadm: No md superblock detected on /dev/sda3.
> /dev/sdb3:
> Magic : a92b4efc
> Version : 0.90.00
> UUID : f07f4bc6:36864b49:776c2c25:004bd7b2
> Creation Time : Wed May 4 08:18:11 2011
> Raid Level : raid5
> Used Dev Size : 1462766336 (1395.00 GiB 1497.87 GB)
> Array Size : 2925532672 (2790.01 GiB 2995.75 GB)
> Raid Devices : 3
> Total Devices : 1
> Preferred Minor : 127
>
> Update Time : Thu Aug 28 19:55:59 2014
> State : clean
> Active Devices : 1
> Working Devices : 1
> Failed Devices : 1
> Spare Devices : 0
> Checksum : 490fa722 - correct
> Events : 68856340
>
> Layout : left-symmetric
> Chunk Size : 64K
>
> Number Major Minor RaidDevice State
> this 1 8 19 1 active sync /dev/sdb3
>
> 0 0 0 0 0 removed
> 1 1 8 19 1 active sync /dev/sdb3
> 2 2 0 0 2 faulty removed
> /dev/sdc3:
> Magic : a92b4efc
> Version : 0.90.00
> UUID : f07f4bc6:36864b49:776c2c25:004bd7b2
> Creation Time : Wed May 4 08:18:11 2011
> Raid Level : raid5
> Used Dev Size : 1462766336 (1395.00 GiB 1497.87 GB)
> Array Size : 2925532672 (2790.01 GiB 2995.75 GB)
> Raid Devices : 3
> Total Devices : 2
> Preferred Minor : 127
>
> Update Time : Thu Aug 28 19:22:19 2014
> State : active
> Active Devices : 2
> Working Devices : 2
> Failed Devices : 0
> Spare Devices : 0
> Checksum : 44f4f557 - correct
> Events : 68856326
>
> Layout : left-symmetric
> Chunk Size : 64K
>
> Number Major Minor RaidDevice State
> this 2 8 35 2 active sync /dev/sdc3
>
> 0 0 0 0 0 removed
> 1 1 8 19 1 active sync /dev/sdb3
> 2 2 8 35 2 active sync /dev/sdc3
>
>
> ================================================================================
>
> # mdadm --examine /dev/sd[b]
> /dev/sdb:
> MBR Magic : aa55
> Partition[0] : 4737024 sectors at 2048 (type 83)
> Partition[2] : 2925532890 sectors at 4739175 (type fd)
>
>
> ================================================================================
>
> Disk /dev/sdc has been replaced with a new hard drive as the old one
> had input/output errors.
>
Are the above --examine results from before or after the replacement?
Was the old /dev/sdc data replicated onto the replacement disk?
> I assume this is weired and showed /dev/sdb3 before (changing things):
>
> # cat /proc/mdstat
> Personalities : [raid1]
> unused devices: <none>
>
> I tried to copy the structure from /dev/sdb to /dev/sdc which assumably work:
>
This shouldn't be needed if the old disk was replicated before being
replaced.
> # sgdisk -R /dev/sdc /dev/sdb
>
> ***************************************************************
> Found invalid GPT and valid MBR; converting MBR to GPT format
> in memory.
> ***************************************************************
>
> The operation has completed successfully.
>
> # sgdisk -G /dev/sdc
>
> The operation has completed successfully.
>
> # fdisk -l
>
> -- Removed /dev/sda --
>
> Disk /dev/sdb: 1500.3 GB, 1500301910016 bytes, 2930277168 sectors
> Units = sectors of 1 * 512 = 512 bytes
> Sector size (logical/physical): 512 bytes / 512 bytes
> I/O size (minimum/optimal): 512 bytes / 512 bytes
> Disk label type: dos
> Disk identifier: 0x0005fb16
>
> Device Boot Start End Blocks Id System
> /dev/sdb1 2048 4739071 2368512 83 Linux
> /dev/sdb3 * 4739175 2930272064 1462766445 fd Linux raid autodetect
> WARNING: fdisk GPT support is currently new, and therefore in an
> experimental phase. Use at your own discretion.
>
> Disk /dev/sdc: 1500.3 GB, 1500301910016 bytes, 2930277168 sectors
> Units = sectors of 1 * 512 = 512 bytes
> Sector size (logical/physical): 512 bytes / 512 bytes
> I/O size (minimum/optimal): 512 bytes / 512 bytes
> Disk label type: gpt
>
> # Start End Size Type Name
> 1 2048 4739071 2.3G Linux filesyste Linux filesystem
> 3 4739175 2930272064 1.4T Linux RAID Linux RAID
>
>
> # mdadm --assemble /dev/md127 /dev/sd[bc]3
> mdadm: no RAID superblock on /dev/sdc3
> mdadm: /dev/sdc3 has no superblock - assembly aborted
>
> # mdadm --assemble /dev/md127 /dev/sd[b]3
> mdadm: /dev/md127 assembled from 1 drive - not enough to start the array.
>
> # mdadm --misc -QD /dev/sd[bc]3
> mdadm: /dev/sdb3 does not appear to be an md device
> mdadm: /dev/sdc3 does not appear to be an md device
>
> # mdadm --detail /dev/md127
> /dev/md127:
> Version :
> Raid Level : raid0
> Total Devices : 0
>
> State : inactive
>
> Number Major Minor RaidDevice
>
>
> [1] https://raid.wiki.kernel.org/index.php/RAID_Recovery
If the initial --examine results were done on the same disks as the
--assemble then I'm rather confused as to why mdadm would find a
superblock for one and not for the other. Could you post the mdadm and
kernel versions - possibly there's a bug that's been fixed in newer
releases.
If the --examine was on the old disk and this wasn't replicated onto the
new one then I'm not sure what you're expecting to happen here - you've
lost 2 disks in a 3-disk RAID-5 so your data is now toast.
Cheers,
Robin
--
___
( ' } | Robin Hill <robin@robinhill.me.uk> |
/ / ) | Little Jim says .... |
// !! | "He fallen in de water !!" |
[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Seeking help to get a failed RAID5 system back to life
2014-08-29 7:46 ` Robin Hill
@ 2014-08-29 8:55 ` Fabio Bacigalupo
2014-08-29 9:10 ` Robin Hill
0 siblings, 1 reply; 6+ messages in thread
From: Fabio Bacigalupo @ 2014-08-29 8:55 UTC (permalink / raw)
To: linux-raid
Hello Robin,
thank you for your feedback!
2014-08-29 9:46 GMT+02:00 Robin Hill <robin@robinhill.me.uk>:
> That's a disaster waiting to happen. You should never leave a RAID array
> in a degraded state for any longer than is absolutely necessary,
> otherwise you might as well not bother running RAID at all.
>> I could gather the following information:
> Are the above --examine results from before or after the replacement?
I took them before the replacement.
> Was the old /dev/sdc data replicated onto the replacement disk?
No, that is, not, yet. Luckily the guys in the data center kept the disk.
> If the initial --examine results were done on the same disks as the
> --assemble then I'm rather confused as to why mdadm would find a
> superblock for one and not for the other. Could you post the mdadm and
> kernel versions - possibly there's a bug that's been fixed in newer
> releases.
There will be no bug. I just was under a false assumption.
> If the --examine was on the old disk and this wasn't replicated onto the
> new one then I'm not sure what you're expecting to happen here - you've
> lost 2 disks in a 3-disk RAID-5 so your data is now toast.
Ok, now that is clear. I will use ddrescue to replicate the old disk
to the new one and try again.
Thank you,
Fabio
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Seeking help to get a failed RAID5 system back to life
2014-08-29 8:55 ` Fabio Bacigalupo
@ 2014-08-29 9:10 ` Robin Hill
2014-08-31 9:12 ` Fabio Bacigalupo
0 siblings, 1 reply; 6+ messages in thread
From: Robin Hill @ 2014-08-29 9:10 UTC (permalink / raw)
To: Fabio Bacigalupo; +Cc: linux-raid
[-- Attachment #1: Type: text/plain, Size: 2654 bytes --]
On Fri Aug 29, 2014 at 10:55:53AM +0200, Fabio Bacigalupo wrote:
> Hello Robin,
>
> thank you for your feedback!
>
> 2014-08-29 9:46 GMT+02:00 Robin Hill <robin@robinhill.me.uk>:
> > That's a disaster waiting to happen. You should never leave a RAID array
> > in a degraded state for any longer than is absolutely necessary,
> > otherwise you might as well not bother running RAID at all.
>
> >> I could gather the following information:
>
> > Are the above --examine results from before or after the replacement?
>
> I took them before the replacement.
>
I suspected as such.
> > Was the old /dev/sdc data replicated onto the replacement disk?
>
> No, that is, not, yet. Luckily the guys in the data center kept the disk.
>
If you'd had the third disk in the array in the first place then you
could have just added the new disk to the array and left it to rebuild
the data, but with it already in a degraded state then you absolutely
need that data off the second disk.
> > If the initial --examine results were done on the same disks as the
> > --assemble then I'm rather confused as to why mdadm would find a
> > superblock for one and not for the other. Could you post the mdadm and
> > kernel versions - possibly there's a bug that's been fixed in newer
> > releases.
>
> There will be no bug. I just was under a false assumption.
>
> > If the --examine was on the old disk and this wasn't replicated onto the
> > new one then I'm not sure what you're expecting to happen here - you've
> > lost 2 disks in a 3-disk RAID-5 so your data is now toast.
>
> Ok, now that is clear. I will use ddrescue to replicate the old disk
> to the new one and try again.
>
You'll need to use --assemble --force in order to get the array going
again afterwards (as the event counts are different on the two disks).
If there are any blocks that couldn't be read by ddrescue then you'll
also need to run a fsck on the array after assembly to deal with any
resulting corruption - this may affect file data, directory metadata or
may just be in unused parts of the disk (if you're really lucky).
I'd definitely recommend adding the third disk back into the array
afterwards though, and making sure regular checks are run on the array
(echo check > /sys/block/mdX/md/sync_action) to pick up any disk errors
or sync issues before they cause major problems.
Cheers,
Robin
--
___
( ' } | Robin Hill <robin@robinhill.me.uk> |
/ / ) | Little Jim says .... |
// !! | "He fallen in de water !!" |
[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Seeking help to get a failed RAID5 system back to life
2014-08-29 9:10 ` Robin Hill
@ 2014-08-31 9:12 ` Fabio Bacigalupo
2014-08-31 11:15 ` Robin Hill
0 siblings, 1 reply; 6+ messages in thread
From: Fabio Bacigalupo @ 2014-08-31 9:12 UTC (permalink / raw)
To: Fabio Bacigalupo, linux-raid
Hello Robin, hello list,
> then you absolutely need that data off the second disk.
I ran ddrescue and it found errors but succeeded in copying the data.
# ddrescuelog -t /root/ddrescue_raid.log
current pos: 1500 GB, current status: finished
domain size: 1500 GB, in 1 area(s)
rescued: 1500 GB, in 8 area(s) ( 99.99%)
non-tried: 0 B, in 0 area(s) ( 0%)
errsize: 122368 B, errors: 7 ( 0.00%)
non-trimmed: 0 B, in 0 area(s) ( 0%)
non-split: 116736 B, in 9 area(s) ( 0.00%)
bad-sector: 5632 B, in 9 area(s) ( 0.00%)
2014-08-29 11:10 GMT+02:00 Robin Hill <robin@robinhill.me.uk>:
> You'll need to use --assemble --force in order to get the array going
> again afterwards (as the event counts are different on the two disks).
I finally got my RAID array back up and running. Thank you for your
guidance, Robin. There is one last question. The third drive just to
be /dev/sda3 which is now occupied by the system disk. If I add
another disk (a new one) into system it will be /dev/sdd. What do I
need to do to add this to the RAID array? Can it fill the unsed slot
[_UU] or do I have to add it as a new drive to get something like this
[_UUU] ?
It did not work right away. So if anyone stumbles upon this thread
here is what I did:
# mdadm --assemble /dev/md127 /dev/sd[bc]3 --force
mdadm: forcing event count in /dev/sdc3(2) from 68856326 upto 68856340
mdadm: clearing FAULTY flag for device 1 in /dev/md127 for /dev/sdc3
mdadm: Marking array /dev/md127 as 'clean'
mdadm: /dev/md127 assembled from 2 drives - not enough to start the array.
# cat /proc/mdstat
Personalities : [raid1]
md127 : inactive sdb3[1](S) sdc3[2](S)
2925532672 blocks
unused devices: <none>
# mdadm --stop /dev/md127
mdadm: stopped /dev/md127
# mdadm --assemble /dev/md127 /dev/sd[bc]3 --force
mdadm: /dev/md127 has been started with 2 drives (out of 3).
# cat /proc/mdstat
Personalities : [raid1] [raid6] [raid5] [raid4]
md127 : active raid5 sdb3[1] sdc3[2]
2925532672 blocks level 5, 64k chunk, algorithm 2 [3/2] [_UU]
unused devices: <none>
# fsck /dev/md127
I was lucky this time. fsck complained only twice.
Ciao
Fabio
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Seeking help to get a failed RAID5 system back to life
2014-08-31 9:12 ` Fabio Bacigalupo
@ 2014-08-31 11:15 ` Robin Hill
0 siblings, 0 replies; 6+ messages in thread
From: Robin Hill @ 2014-08-31 11:15 UTC (permalink / raw)
To: Fabio Bacigalupo; +Cc: linux-raid
[-- Attachment #1: Type: text/plain, Size: 1346 bytes --]
On Sun Aug 31, 2014 at 11:12:44am +0200, Fabio Bacigalupo wrote:
> 2014-08-29 11:10 GMT+02:00 Robin Hill <robin@robinhill.me.uk>:
> > You'll need to use --assemble --force in order to get the array going
> > again afterwards (as the event counts are different on the two disks).
>
> I finally got my RAID array back up and running. Thank you for your
> guidance, Robin. There is one last question. The third drive just to
> be /dev/sda3 which is now occupied by the system disk. If I add
> another disk (a new one) into system it will be /dev/sdd. What do I
> need to do to add this to the RAID array? Can it fill the unsed slot
> [_UU] or do I have to add it as a new drive to get something like this
> [_UUU] ?
>
You'll need to copy the partition info over from one of the existing
array members, then just use --add to add it into the array - it will
fill in the unused slot. md doesn't care what the kernel disk
order/naming is at all, it uses the RAID metadata to figure out which
disk goes where in the array.
Glad to hear you've got it all back up and running anyway.
Cheers,
Robin
--
___
( ' } | Robin Hill <robin@robinhill.me.uk> |
/ / ) | Little Jim says .... |
// !! | "He fallen in de water !!" |
[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2014-08-31 11:15 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-08-29 2:07 Seeking help to get a failed RAID5 system back to life Fabio Bacigalupo
2014-08-29 7:46 ` Robin Hill
2014-08-29 8:55 ` Fabio Bacigalupo
2014-08-29 9:10 ` Robin Hill
2014-08-31 9:12 ` Fabio Bacigalupo
2014-08-31 11:15 ` Robin Hill
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).