Rebuilding an array with a corrupt disk.

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Rebuilding an array with a corrupt disk.
@ 2008-06-14  2:22 Sean Hildebrand
  2008-06-14  6:21 ` David Greaves
  0 siblings, 1 reply; 8+ messages in thread
From: Sean Hildebrand @ 2008-06-14  2:22 UTC (permalink / raw)
  To: linux-raid

I had a batch of disks go bad in my array, and have swapped in new disks.

My array is a five disk RAID5, each 750GB. Currently I have four disks
operational within the array, so the array is functionally a RAID0.
Rebuilds have gone fine, except for the latest disk, which I've tried
four times.

At 74% into the rebuild, mdadm drops /dev/sdd1 (The spare being
synced) and /dev/sda1 (A synced disk active in the array.) due to a
read error on /dev/sda1. Checking smartctl, there have been 43 read
errors on the disk, and they occur in groups.

The array contents have been modifed since the removal of the older
disks - So only the four currently-operational disks are synced.

Fscking the array also has issues past the halfway mark - Namely, when
it gets to a certain point, /dev/sda1 is dropped from the array and
fsck begins spitting out inode read errors.

Are there any safe ways to remedy my problem? Resizing the array from
five disks to four and then removing /dev/sda1 is impossible, as for
the array to be resized, error free reads of /dev/sda1 would be
necessary, no?

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Rebuilding an array with a corrupt disk.
  2008-06-14  2:22 Rebuilding an array with a corrupt disk Sean Hildebrand
@ 2008-06-14  6:21 ` David Greaves
  2008-06-14 10:54   ` Sean Hildebrand
  0 siblings, 1 reply; 8+ messages in thread
From: David Greaves @ 2008-06-14  6:21 UTC (permalink / raw)
  To: Sean Hildebrand; +Cc: linux-raid

Sean Hildebrand wrote:
> I had a batch of disks go bad in my array, and have swapped in new disks.
> 
> My array is a five disk RAID5, each 750GB. Currently I have four disks
> operational within the array, so the array is functionally a RAID0.
> Rebuilds have gone fine, except for the latest disk, which I've tried
> four times.
> 
> At 74% into the rebuild, mdadm drops /dev/sdd1 (The spare being
> synced) and /dev/sda1 (A synced disk active in the array.) due to a
> read error on /dev/sda1. Checking smartctl, there have been 43 read
> errors on the disk, and they occur in groups.

You have 2 faulty drives.

Pounding on them will only make things worse.

Get 2 new drives and use ddrescue to copy /dev/sda to a new drive and replace
/dev/sda. Then add your second new drive.

> The array contents have been modifed since the removal of the older
> disks - So only the four currently-operational disks are synced.

> Fscking the array also has issues past the halfway mark - Namely, when
> it gets to a certain point, /dev/sda1 is dropped from the array and
> fsck begins spitting out inode read errors.
Well, once sda is gone you're reading garbage if the array even stays up.

> Are there any safe ways to remedy my problem? Resizing the array from
> five disks to four and then removing /dev/sda1 is impossible, as for
> the array to be resized, error free reads of /dev/sda1 would be
> necessary, no?
It depends how well ddrescue does at reading /dev/sda.

The sooner you do it the more chance you have.

David


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Rebuilding an array with a corrupt disk.
  2008-06-14  6:21 ` David Greaves
@ 2008-06-14 10:54   ` Sean Hildebrand
  2008-06-14 11:47     ` David Greaves
  0 siblings, 1 reply; 8+ messages in thread
From: Sean Hildebrand @ 2008-06-14 10:54 UTC (permalink / raw)
  To: David Greaves; +Cc: linux-raid

How's that?

The spare (/dev/sdd) seems to be fine. I haven't tried the rebuild
with any other disks, but smartctl doesn't report any issues with
/dev/sdd, only /dev/sda.

Ran ddrescue, managed to recover 559071 MB.. But the other 191GB was
thousand  upon thousands of read errors.

Now, prior to this with the array in degraded mode I was able to
access and modify all files I found, but mdadm would always fail on
rebuild, and fsck would always fail and the array would go down
roughly 75% through the scan, presumably when first encountering bad
sections of the disk.

ddrescue has not yet finished - It's currently "Splitting error
areas..." - Given that the array has been mountable prior to running
ddrescue, is it safe to assume that once it's done, the
partially-cloned /dev/sda1 that ddrescue has output onto /dev/sdd1
will be mountable as part of the array so I can assess file loss?

I am unsure of how data is spread through a RAID5. Each disk gets an
equal portion of data, but do drives fill up in linear fashion? I ask
this because whether the array is being rebuilt or fscked it fails at
roughly 75% through either operation, yet I never had the array go
down while I was using it - Only when fsck was running or mdadm was
rebuilding.

The array is 2.69 TB, with 1.57TB currently free - If the drives do
fill linearly (Or even semi-linearly) is it likely that the majority
of the 191GB of errors are empty space?

If this isn't making much sense I apologize. I'm sleep deprived and
not enjoying the prospect of losing large quantities of my data.

On Sat, Jun 14, 2008 at 2:21 AM, David Greaves <david@dgreaves.com> wrote:
> Sean Hildebrand wrote:
>> I had a batch of disks go bad in my array, and have swapped in new disks.
>>
>> My array is a five disk RAID5, each 750GB. Currently I have four disks
>> operational within the array, so the array is functionally a RAID0.
>> Rebuilds have gone fine, except for the latest disk, which I've tried
>> four times.
>>
>> At 74% into the rebuild, mdadm drops /dev/sdd1 (The spare being
>> synced) and /dev/sda1 (A synced disk active in the array.) due to a
>> read error on /dev/sda1. Checking smartctl, there have been 43 read
>> errors on the disk, and they occur in groups.
>
> You have 2 faulty drives.
>
> Pounding on them will only make things worse.
>
> Get 2 new drives and use ddrescue to copy /dev/sda to a new drive and replace
> /dev/sda. Then add your second new drive.
>
>> The array contents have been modifed since the removal of the older
>> disks - So only the four currently-operational disks are synced.
>
>> Fscking the array also has issues past the halfway mark - Namely, when
>> it gets to a certain point, /dev/sda1 is dropped from the array and
>> fsck begins spitting out inode read errors.
> Well, once sda is gone you're reading garbage if the array even stays up.
>
>> Are there any safe ways to remedy my problem? Resizing the array from
>> five disks to four and then removing /dev/sda1 is impossible, as for
>> the array to be resized, error free reads of /dev/sda1 would be
>> necessary, no?
> It depends how well ddrescue does at reading /dev/sda.
>
> The sooner you do it the more chance you have.
>
> David
>
>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Rebuilding an array with a corrupt disk.
  2008-06-14 10:54   ` Sean Hildebrand
@ 2008-06-14 11:47     ` David Greaves
  2008-06-15 11:54       ` Sean Hildebrand
  0 siblings, 1 reply; 8+ messages in thread
From: David Greaves @ 2008-06-14 11:47 UTC (permalink / raw)
  To: Sean Hildebrand; +Cc: linux-raid

Sean Hildebrand wrote:
> How's that?
> 
> The spare (/dev/sdd) seems to be fine. I haven't tried the rebuild
> with any other disks, but smartctl doesn't report any issues with
> /dev/sdd, only /dev/sda.
Sorry, misread what you said.
Thought you had errors on both sda and sdd.


> Ran ddrescue, managed to recover 559071 MB.. But the other 191GB was
> thousand  upon thousands of read errors.
Looking fairly bad then.

> Now, prior to this with the array in degraded mode I was able to
> access and modify all files I found, but mdadm would always fail on
> rebuild, and fsck would always fail and the array would go down
> roughly 75% through the scan, presumably when first encountering bad
> sections of the disk.
Sounds reasonable.

> ddrescue has not yet finished - It's currently "Splitting error
> areas..." - Given that the array has been mountable prior to running
> ddrescue, is it safe to assume that once it's done, the
> partially-cloned /dev/sda1 that ddrescue has output onto /dev/sdd1
> will be mountable as part of the array so I can assess file loss?
It should be.
Additionally, the raid won't die as fsck works.

However if any of the other disks die then you will have problems.
Its safer to add the spare when it arrives and go to a redundant setup. Then, if
any one drive dies, fsck will continue.

Also note that you *may* recover more data by using ddrescue with a logfile and
re-running it after chilling the failed drive etc. Google...

The longer you persevere with ddrescue, the more data you have the chance of
recovering. Maybe keep at it until the replacement spare arrives. Again - read
up on ddrescue - the list archives had something in the last few weeks.

> I am unsure of how data is spread through a RAID5. Each disk gets an
> equal portion of data, but do drives fill up in linear fashion?
No. the data is spread amongs the drives. You've lost everything from the 75% up
mark on all the drives.

> I ask
> this because whether the array is being rebuilt or fscked it fails at
> roughly 75% through either operation, yet I never had the array go
> down while I was using it - Only when fsck was running or mdadm was
> rebuilding.

> The array is 2.69 TB, with 1.57TB currently free - If the drives do
> fill linearly (Or even semi-linearly) is it likely that the majority
> of the 191GB of errors are empty space?
I don't know how various filesystems use space.
It also depends on previous usage - was the disk ever more full? etc etc.
I do know that with 'normal' filesystems (ext/xfs/etc) then the answer is undefined.
Plus it's 191Gb x4 - so ~800Gb of corrupted md device.

Sorry - keep fingers crossed.

> If this isn't making much sense I apologize. I'm sleep deprived and
> not enjoying the prospect of losing large quantities of my data.
Sad, but people do use RAID instead of backups.
RAID is a convenience that helps with uptime in the event of a failure and
reduces the risk of data-loss between backups.

Lets see what can be done to get it all back though - you may be lucky.

David

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Rebuilding an array with a corrupt disk.
  2008-06-14 11:47     ` David Greaves
@ 2008-06-15 11:54       ` Sean Hildebrand
  2008-06-15 13:32         ` David Greaves
                           ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Sean Hildebrand @ 2008-06-15 11:54 UTC (permalink / raw)
  To: David Greaves; +Cc: linux-raid

Unfortunately, ddrescue didn't do me any good. The amount of data
taken from /dev/sda1 and output to /dev/sdd1 was not sufficient to
include /dev/sdd1 in the array.

However, since I was only using around a third of the array, it looks
like there wasn't much data in the latter portion of /dev/sda1. I
mounted the array and cp'd data off - Ended up having only 11 read
errors, all of which caused the array to kick /dev/sda1 out as a
failed disk, and three of which were severe enough to stop the
motherboard from recognizing the disk until after a reboot.

I got all my data, save the eleven folders that read errors occurred
in - Thankfully the data lost isn't terribly important.

Is there no way to get mdadm to allow a certain number of read errors
from a disk, instead of removing it from the array immediately?
Manually unmounting, stopping, and re-assembling is somewhat of a
chore, especially when the system locks access to the array while
copying, despite the read error.

>IIt also depends on previous usage - was the disk ever more full? etc. etc.

To answer that: The drive was brand new. The thing I find odd about
this failure is that it was integrated into the array without issue,
meaning the disk has no issues writing to the bad sectors, just
reading. Never had that before.

In any case, I'm very glad to have got my data with minimal loss. And
to think, all this could have been avoided if I'd just made my array a
RAID6 when it was first built. Certainly when I have a new fifth disk
the array will be rebuilt as such.

On Sat, Jun 14, 2008 at 7:47 AM, David Greaves <david@dgreaves.com> wrote:
> Sean Hildebrand wrote:
>> How's that?
>>
>> The spare (/dev/sdd) seems to be fine. I haven't tried the rebuild
>> with any other disks, but smartctl doesn't report any issues with
>> /dev/sdd, only /dev/sda.
> Sorry, misread what you said.
> Thought you had errors on both sda and sdd.
>
>
>> Ran ddrescue, managed to recover 559071 MB.. But the other 191GB was
>> thousand  upon thousands of read errors.
> Looking fairly bad then.
>
>> Now, prior to this with the array in degraded mode I was able to
>> access and modify all files I found, but mdadm would always fail on
>> rebuild, and fsck would always fail and the array would go down
>> roughly 75% through the scan, presumably when first encountering bad
>> sections of the disk.
> Sounds reasonable.
>
>> ddrescue has not yet finished - It's currently "Splitting error
>> areas..." - Given that the array has been mountable prior to running
>> ddrescue, is it safe to assume that once it's done, the
>> partially-cloned /dev/sda1 that ddrescue has output onto /dev/sdd1
>> will be mountable as part of the array so I can assess file loss?
> It should be.
> Additionally, the raid won't die as fsck works.
>
> However if any of the other disks die then you will have problems.
> Its safer to add the spare when it arrives and go to a redundant setup. Then, if
> any one drive dies, fsck will continue.
>
> Also note that you *may* recover more data by using ddrescue with a logfile and
> re-running it after chilling the failed drive etc. Google...
>
> The longer you persevere with ddrescue, the more data you have the chance of
> recovering. Maybe keep at it until the replacement spare arrives. Again - read
> up on ddrescue - the list archives had something in the last few weeks.
>
>> I am unsure of how data is spread through a RAID5. Each disk gets an
>> equal portion of data, but do drives fill up in linear fashion?
> No. the data is spread amongs the drives. You've lost everything from the 75% up
> mark on all the drives.
>
>> I ask
>> this because whether the array is being rebuilt or fscked it fails at
>> roughly 75% through either operation, yet I never had the array go
>> down while I was using it - Only when fsck was running or mdadm was
>> rebuilding.
>
>> The array is 2.69 TB, with 1.57TB currently free - If the drives do
>> fill linearly (Or even semi-linearly) is it likely that the majority
>> of the 191GB of errors are empty space?
> I don't know how various filesystems use space.
> It also depends on previous usage - was the disk ever more full? etc etc.
> I do know that with 'normal' filesystems (ext/xfs/etc) then the answer is undefined.
> Plus it's 191Gb x4 - so ~800Gb of corrupted md device.
>
> Sorry - keep fingers crossed.
>
>> If this isn't making much sense I apologize. I'm sleep deprived and
>> not enjoying the prospect of losing large quantities of my data.
> Sad, but people do use RAID instead of backups.
> RAID is a convenience that helps with uptime in the event of a failure and
> reduces the risk of data-loss between backups.
>
> Lets see what can be done to get it all back though - you may be lucky.
>
> David
>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Rebuilding an array with a corrupt disk.
  2008-06-15 11:54       ` Sean Hildebrand
@ 2008-06-15 13:32         ` David Greaves
  2008-06-15 15:09         ` Peter Rabbitson
  2008-06-19  4:57         ` Neil Brown
  2 siblings, 0 replies; 8+ messages in thread
From: David Greaves @ 2008-06-15 13:32 UTC (permalink / raw)
  To: Sean Hildebrand; +Cc: linux-raid

Sean Hildebrand wrote:
> I got all my data, save the eleven folders that read errors occurred
> in - Thankfully the data lost isn't terribly important.
Good.


> Is there no way to get mdadm to allow a certain number of read errors
> from a disk, instead of removing it from the array immediately?
> Manually unmounting, stopping, and re-assembling is somewhat of a
> chore, especially when the system locks access to the array while
> copying, despite the read error.
Not that I know of.

Since you got your data back it's moot .... but:

You couldn't add /dev/sdd1 because the raid superblock is at the end of the disk
- clearly readable though since it was read at startup.
The next thing would have been to force a recreation of the array using the new
disk.

Anyhow, glad you're sorted

David

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Rebuilding an array with a corrupt disk.
  2008-06-15 11:54       ` Sean Hildebrand
  2008-06-15 13:32         ` David Greaves
@ 2008-06-15 15:09         ` Peter Rabbitson
  2008-06-19  4:57         ` Neil Brown
  2 siblings, 0 replies; 8+ messages in thread
From: Peter Rabbitson @ 2008-06-15 15:09 UTC (permalink / raw)
  To: Sean Hildebrand; +Cc: David Greaves, linux-raid

Sean Hildebrand wrote:
> 
> <snip>
> 

> 
> To answer that: The drive was brand new. The thing I find odd about
> this failure is that it was integrated into the array without issue,
> meaning the disk has no issues writing to the bad sectors, just
> reading. Never had that before.
> 
> In any case, I'm very glad to have got my data with minimal loss. And
> to think, all this could have been avoided if I'd just made my array a
> RAID6 when it was first built. Certainly when I have a new fifth disk
> the array will be rebuilt as such.
> 

When you get your new disk (or any disk for that matter) run badblocks -svw on 
it. It takes about 8 hours on average drive sizes today, but guards precisely 
against the problem you faced. Additionally the drive will receive a hefty 
does of "break in", so you know it performed well under stress at least for 
several hours.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Rebuilding an array with a corrupt disk.
  2008-06-15 11:54       ` Sean Hildebrand
  2008-06-15 13:32         ` David Greaves
  2008-06-15 15:09         ` Peter Rabbitson
@ 2008-06-19  4:57         ` Neil Brown
  2 siblings, 0 replies; 8+ messages in thread
From: Neil Brown @ 2008-06-19  4:57 UTC (permalink / raw)
  To: Sean Hildebrand; +Cc: David Greaves, linux-raid

On Sunday June 15, silverwraithii@gmail.com wrote:
> 
> Is there no way to get mdadm to allow a certain number of read errors
> from a disk, instead of removing it from the array immediately?
> Manually unmounting, stopping, and re-assembling is somewhat of a
> chore, especially when the system locks access to the array while
> copying, despite the read error.

No, there isn't.

It might make sense to arrange that if the array is flagged as
"read-only" (mdadm -r /dev/mdX), then rather than failing a drive, any
read error is passed up to the filesystem.... I'll put it on my todo
list (which isn't much of a promise).

NeilBrown

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2008-06-19  4:57 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-06-14  2:22 Rebuilding an array with a corrupt disk Sean Hildebrand
2008-06-14  6:21 ` David Greaves
2008-06-14 10:54   ` Sean Hildebrand
2008-06-14 11:47     ` David Greaves
2008-06-15 11:54       ` Sean Hildebrand
2008-06-15 13:32         ` David Greaves
2008-06-15 15:09         ` Peter Rabbitson
2008-06-19  4:57         ` Neil Brown

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).