* Rebuilding an array with a corrupt disk. @ 2008-06-14 2:22 Sean Hildebrand 2008-06-14 6:21 ` David Greaves 0 siblings, 1 reply; 8+ messages in thread From: Sean Hildebrand @ 2008-06-14 2:22 UTC (permalink / raw) To: linux-raid I had a batch of disks go bad in my array, and have swapped in new disks. My array is a five disk RAID5, each 750GB. Currently I have four disks operational within the array, so the array is functionally a RAID0. Rebuilds have gone fine, except for the latest disk, which I've tried four times. At 74% into the rebuild, mdadm drops /dev/sdd1 (The spare being synced) and /dev/sda1 (A synced disk active in the array.) due to a read error on /dev/sda1. Checking smartctl, there have been 43 read errors on the disk, and they occur in groups. The array contents have been modifed since the removal of the older disks - So only the four currently-operational disks are synced. Fscking the array also has issues past the halfway mark - Namely, when it gets to a certain point, /dev/sda1 is dropped from the array and fsck begins spitting out inode read errors. Are there any safe ways to remedy my problem? Resizing the array from five disks to four and then removing /dev/sda1 is impossible, as for the array to be resized, error free reads of /dev/sda1 would be necessary, no? ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Rebuilding an array with a corrupt disk. 2008-06-14 2:22 Rebuilding an array with a corrupt disk Sean Hildebrand @ 2008-06-14 6:21 ` David Greaves 2008-06-14 10:54 ` Sean Hildebrand 0 siblings, 1 reply; 8+ messages in thread From: David Greaves @ 2008-06-14 6:21 UTC (permalink / raw) To: Sean Hildebrand; +Cc: linux-raid Sean Hildebrand wrote: > I had a batch of disks go bad in my array, and have swapped in new disks. > > My array is a five disk RAID5, each 750GB. Currently I have four disks > operational within the array, so the array is functionally a RAID0. > Rebuilds have gone fine, except for the latest disk, which I've tried > four times. > > At 74% into the rebuild, mdadm drops /dev/sdd1 (The spare being > synced) and /dev/sda1 (A synced disk active in the array.) due to a > read error on /dev/sda1. Checking smartctl, there have been 43 read > errors on the disk, and they occur in groups. You have 2 faulty drives. Pounding on them will only make things worse. Get 2 new drives and use ddrescue to copy /dev/sda to a new drive and replace /dev/sda. Then add your second new drive. > The array contents have been modifed since the removal of the older > disks - So only the four currently-operational disks are synced. > Fscking the array also has issues past the halfway mark - Namely, when > it gets to a certain point, /dev/sda1 is dropped from the array and > fsck begins spitting out inode read errors. Well, once sda is gone you're reading garbage if the array even stays up. > Are there any safe ways to remedy my problem? Resizing the array from > five disks to four and then removing /dev/sda1 is impossible, as for > the array to be resized, error free reads of /dev/sda1 would be > necessary, no? It depends how well ddrescue does at reading /dev/sda. The sooner you do it the more chance you have. David ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Rebuilding an array with a corrupt disk. 2008-06-14 6:21 ` David Greaves @ 2008-06-14 10:54 ` Sean Hildebrand 2008-06-14 11:47 ` David Greaves 0 siblings, 1 reply; 8+ messages in thread From: Sean Hildebrand @ 2008-06-14 10:54 UTC (permalink / raw) To: David Greaves; +Cc: linux-raid How's that? The spare (/dev/sdd) seems to be fine. I haven't tried the rebuild with any other disks, but smartctl doesn't report any issues with /dev/sdd, only /dev/sda. Ran ddrescue, managed to recover 559071 MB.. But the other 191GB was thousand upon thousands of read errors. Now, prior to this with the array in degraded mode I was able to access and modify all files I found, but mdadm would always fail on rebuild, and fsck would always fail and the array would go down roughly 75% through the scan, presumably when first encountering bad sections of the disk. ddrescue has not yet finished - It's currently "Splitting error areas..." - Given that the array has been mountable prior to running ddrescue, is it safe to assume that once it's done, the partially-cloned /dev/sda1 that ddrescue has output onto /dev/sdd1 will be mountable as part of the array so I can assess file loss? I am unsure of how data is spread through a RAID5. Each disk gets an equal portion of data, but do drives fill up in linear fashion? I ask this because whether the array is being rebuilt or fscked it fails at roughly 75% through either operation, yet I never had the array go down while I was using it - Only when fsck was running or mdadm was rebuilding. The array is 2.69 TB, with 1.57TB currently free - If the drives do fill linearly (Or even semi-linearly) is it likely that the majority of the 191GB of errors are empty space? If this isn't making much sense I apologize. I'm sleep deprived and not enjoying the prospect of losing large quantities of my data. On Sat, Jun 14, 2008 at 2:21 AM, David Greaves <david@dgreaves.com> wrote: > Sean Hildebrand wrote: >> I had a batch of disks go bad in my array, and have swapped in new disks. >> >> My array is a five disk RAID5, each 750GB. Currently I have four disks >> operational within the array, so the array is functionally a RAID0. >> Rebuilds have gone fine, except for the latest disk, which I've tried >> four times. >> >> At 74% into the rebuild, mdadm drops /dev/sdd1 (The spare being >> synced) and /dev/sda1 (A synced disk active in the array.) due to a >> read error on /dev/sda1. Checking smartctl, there have been 43 read >> errors on the disk, and they occur in groups. > > You have 2 faulty drives. > > Pounding on them will only make things worse. > > Get 2 new drives and use ddrescue to copy /dev/sda to a new drive and replace > /dev/sda. Then add your second new drive. > >> The array contents have been modifed since the removal of the older >> disks - So only the four currently-operational disks are synced. > >> Fscking the array also has issues past the halfway mark - Namely, when >> it gets to a certain point, /dev/sda1 is dropped from the array and >> fsck begins spitting out inode read errors. > Well, once sda is gone you're reading garbage if the array even stays up. > >> Are there any safe ways to remedy my problem? Resizing the array from >> five disks to four and then removing /dev/sda1 is impossible, as for >> the array to be resized, error free reads of /dev/sda1 would be >> necessary, no? > It depends how well ddrescue does at reading /dev/sda. > > The sooner you do it the more chance you have. > > David > > ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Rebuilding an array with a corrupt disk. 2008-06-14 10:54 ` Sean Hildebrand @ 2008-06-14 11:47 ` David Greaves 2008-06-15 11:54 ` Sean Hildebrand 0 siblings, 1 reply; 8+ messages in thread From: David Greaves @ 2008-06-14 11:47 UTC (permalink / raw) To: Sean Hildebrand; +Cc: linux-raid Sean Hildebrand wrote: > How's that? > > The spare (/dev/sdd) seems to be fine. I haven't tried the rebuild > with any other disks, but smartctl doesn't report any issues with > /dev/sdd, only /dev/sda. Sorry, misread what you said. Thought you had errors on both sda and sdd. > Ran ddrescue, managed to recover 559071 MB.. But the other 191GB was > thousand upon thousands of read errors. Looking fairly bad then. > Now, prior to this with the array in degraded mode I was able to > access and modify all files I found, but mdadm would always fail on > rebuild, and fsck would always fail and the array would go down > roughly 75% through the scan, presumably when first encountering bad > sections of the disk. Sounds reasonable. > ddrescue has not yet finished - It's currently "Splitting error > areas..." - Given that the array has been mountable prior to running > ddrescue, is it safe to assume that once it's done, the > partially-cloned /dev/sda1 that ddrescue has output onto /dev/sdd1 > will be mountable as part of the array so I can assess file loss? It should be. Additionally, the raid won't die as fsck works. However if any of the other disks die then you will have problems. Its safer to add the spare when it arrives and go to a redundant setup. Then, if any one drive dies, fsck will continue. Also note that you *may* recover more data by using ddrescue with a logfile and re-running it after chilling the failed drive etc. Google... The longer you persevere with ddrescue, the more data you have the chance of recovering. Maybe keep at it until the replacement spare arrives. Again - read up on ddrescue - the list archives had something in the last few weeks. > I am unsure of how data is spread through a RAID5. Each disk gets an > equal portion of data, but do drives fill up in linear fashion? No. the data is spread amongs the drives. You've lost everything from the 75% up mark on all the drives. > I ask > this because whether the array is being rebuilt or fscked it fails at > roughly 75% through either operation, yet I never had the array go > down while I was using it - Only when fsck was running or mdadm was > rebuilding. > The array is 2.69 TB, with 1.57TB currently free - If the drives do > fill linearly (Or even semi-linearly) is it likely that the majority > of the 191GB of errors are empty space? I don't know how various filesystems use space. It also depends on previous usage - was the disk ever more full? etc etc. I do know that with 'normal' filesystems (ext/xfs/etc) then the answer is undefined. Plus it's 191Gb x4 - so ~800Gb of corrupted md device. Sorry - keep fingers crossed. > If this isn't making much sense I apologize. I'm sleep deprived and > not enjoying the prospect of losing large quantities of my data. Sad, but people do use RAID instead of backups. RAID is a convenience that helps with uptime in the event of a failure and reduces the risk of data-loss between backups. Lets see what can be done to get it all back though - you may be lucky. David ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Rebuilding an array with a corrupt disk. 2008-06-14 11:47 ` David Greaves @ 2008-06-15 11:54 ` Sean Hildebrand 2008-06-15 13:32 ` David Greaves ` (2 more replies) 0 siblings, 3 replies; 8+ messages in thread From: Sean Hildebrand @ 2008-06-15 11:54 UTC (permalink / raw) To: David Greaves; +Cc: linux-raid Unfortunately, ddrescue didn't do me any good. The amount of data taken from /dev/sda1 and output to /dev/sdd1 was not sufficient to include /dev/sdd1 in the array. However, since I was only using around a third of the array, it looks like there wasn't much data in the latter portion of /dev/sda1. I mounted the array and cp'd data off - Ended up having only 11 read errors, all of which caused the array to kick /dev/sda1 out as a failed disk, and three of which were severe enough to stop the motherboard from recognizing the disk until after a reboot. I got all my data, save the eleven folders that read errors occurred in - Thankfully the data lost isn't terribly important. Is there no way to get mdadm to allow a certain number of read errors from a disk, instead of removing it from the array immediately? Manually unmounting, stopping, and re-assembling is somewhat of a chore, especially when the system locks access to the array while copying, despite the read error. >IIt also depends on previous usage - was the disk ever more full? etc. etc. To answer that: The drive was brand new. The thing I find odd about this failure is that it was integrated into the array without issue, meaning the disk has no issues writing to the bad sectors, just reading. Never had that before. In any case, I'm very glad to have got my data with minimal loss. And to think, all this could have been avoided if I'd just made my array a RAID6 when it was first built. Certainly when I have a new fifth disk the array will be rebuilt as such. On Sat, Jun 14, 2008 at 7:47 AM, David Greaves <david@dgreaves.com> wrote: > Sean Hildebrand wrote: >> How's that? >> >> The spare (/dev/sdd) seems to be fine. I haven't tried the rebuild >> with any other disks, but smartctl doesn't report any issues with >> /dev/sdd, only /dev/sda. > Sorry, misread what you said. > Thought you had errors on both sda and sdd. > > >> Ran ddrescue, managed to recover 559071 MB.. But the other 191GB was >> thousand upon thousands of read errors. > Looking fairly bad then. > >> Now, prior to this with the array in degraded mode I was able to >> access and modify all files I found, but mdadm would always fail on >> rebuild, and fsck would always fail and the array would go down >> roughly 75% through the scan, presumably when first encountering bad >> sections of the disk. > Sounds reasonable. > >> ddrescue has not yet finished - It's currently "Splitting error >> areas..." - Given that the array has been mountable prior to running >> ddrescue, is it safe to assume that once it's done, the >> partially-cloned /dev/sda1 that ddrescue has output onto /dev/sdd1 >> will be mountable as part of the array so I can assess file loss? > It should be. > Additionally, the raid won't die as fsck works. > > However if any of the other disks die then you will have problems. > Its safer to add the spare when it arrives and go to a redundant setup. Then, if > any one drive dies, fsck will continue. > > Also note that you *may* recover more data by using ddrescue with a logfile and > re-running it after chilling the failed drive etc. Google... > > The longer you persevere with ddrescue, the more data you have the chance of > recovering. Maybe keep at it until the replacement spare arrives. Again - read > up on ddrescue - the list archives had something in the last few weeks. > >> I am unsure of how data is spread through a RAID5. Each disk gets an >> equal portion of data, but do drives fill up in linear fashion? > No. the data is spread amongs the drives. You've lost everything from the 75% up > mark on all the drives. > >> I ask >> this because whether the array is being rebuilt or fscked it fails at >> roughly 75% through either operation, yet I never had the array go >> down while I was using it - Only when fsck was running or mdadm was >> rebuilding. > >> The array is 2.69 TB, with 1.57TB currently free - If the drives do >> fill linearly (Or even semi-linearly) is it likely that the majority >> of the 191GB of errors are empty space? > I don't know how various filesystems use space. > It also depends on previous usage - was the disk ever more full? etc etc. > I do know that with 'normal' filesystems (ext/xfs/etc) then the answer is undefined. > Plus it's 191Gb x4 - so ~800Gb of corrupted md device. > > Sorry - keep fingers crossed. > >> If this isn't making much sense I apologize. I'm sleep deprived and >> not enjoying the prospect of losing large quantities of my data. > Sad, but people do use RAID instead of backups. > RAID is a convenience that helps with uptime in the event of a failure and > reduces the risk of data-loss between backups. > > Lets see what can be done to get it all back though - you may be lucky. > > David > ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Rebuilding an array with a corrupt disk. 2008-06-15 11:54 ` Sean Hildebrand @ 2008-06-15 13:32 ` David Greaves 2008-06-15 15:09 ` Peter Rabbitson 2008-06-19 4:57 ` Neil Brown 2 siblings, 0 replies; 8+ messages in thread From: David Greaves @ 2008-06-15 13:32 UTC (permalink / raw) To: Sean Hildebrand; +Cc: linux-raid Sean Hildebrand wrote: > I got all my data, save the eleven folders that read errors occurred > in - Thankfully the data lost isn't terribly important. Good. > Is there no way to get mdadm to allow a certain number of read errors > from a disk, instead of removing it from the array immediately? > Manually unmounting, stopping, and re-assembling is somewhat of a > chore, especially when the system locks access to the array while > copying, despite the read error. Not that I know of. Since you got your data back it's moot .... but: You couldn't add /dev/sdd1 because the raid superblock is at the end of the disk - clearly readable though since it was read at startup. The next thing would have been to force a recreation of the array using the new disk. Anyhow, glad you're sorted David ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Rebuilding an array with a corrupt disk. 2008-06-15 11:54 ` Sean Hildebrand 2008-06-15 13:32 ` David Greaves @ 2008-06-15 15:09 ` Peter Rabbitson 2008-06-19 4:57 ` Neil Brown 2 siblings, 0 replies; 8+ messages in thread From: Peter Rabbitson @ 2008-06-15 15:09 UTC (permalink / raw) To: Sean Hildebrand; +Cc: David Greaves, linux-raid Sean Hildebrand wrote: > > <snip> > > > To answer that: The drive was brand new. The thing I find odd about > this failure is that it was integrated into the array without issue, > meaning the disk has no issues writing to the bad sectors, just > reading. Never had that before. > > In any case, I'm very glad to have got my data with minimal loss. And > to think, all this could have been avoided if I'd just made my array a > RAID6 when it was first built. Certainly when I have a new fifth disk > the array will be rebuilt as such. > When you get your new disk (or any disk for that matter) run badblocks -svw on it. It takes about 8 hours on average drive sizes today, but guards precisely against the problem you faced. Additionally the drive will receive a hefty does of "break in", so you know it performed well under stress at least for several hours. ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Rebuilding an array with a corrupt disk. 2008-06-15 11:54 ` Sean Hildebrand 2008-06-15 13:32 ` David Greaves 2008-06-15 15:09 ` Peter Rabbitson @ 2008-06-19 4:57 ` Neil Brown 2 siblings, 0 replies; 8+ messages in thread From: Neil Brown @ 2008-06-19 4:57 UTC (permalink / raw) To: Sean Hildebrand; +Cc: David Greaves, linux-raid On Sunday June 15, silverwraithii@gmail.com wrote: > > Is there no way to get mdadm to allow a certain number of read errors > from a disk, instead of removing it from the array immediately? > Manually unmounting, stopping, and re-assembling is somewhat of a > chore, especially when the system locks access to the array while > copying, despite the read error. No, there isn't. It might make sense to arrange that if the array is flagged as "read-only" (mdadm -r /dev/mdX), then rather than failing a drive, any read error is passed up to the filesystem.... I'll put it on my todo list (which isn't much of a promise). NeilBrown ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2008-06-19 4:57 UTC | newest] Thread overview: 8+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2008-06-14 2:22 Rebuilding an array with a corrupt disk Sean Hildebrand 2008-06-14 6:21 ` David Greaves 2008-06-14 10:54 ` Sean Hildebrand 2008-06-14 11:47 ` David Greaves 2008-06-15 11:54 ` Sean Hildebrand 2008-06-15 13:32 ` David Greaves 2008-06-15 15:09 ` Peter Rabbitson 2008-06-19 4:57 ` Neil Brown
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).