* Trouble adding disk to degraded array [not found] <CAJ=LqmbYG8H45M196ZuRcDMu9Ucz0t_pQenQbZtMKM9AhSqrpQ@mail.gmail.com> @ 2013-01-09 17:21 ` Nicholas Ipsen(Sephiroth_VII) 2013-01-09 17:55 ` Phil Turmel 0 siblings, 1 reply; 10+ messages in thread From: Nicholas Ipsen(Sephiroth_VII) @ 2013-01-09 17:21 UTC (permalink / raw) To: linux-raid I recently had mdadm mark a disk in my RAID5-array as faulty. As it was within warranty, I returned it to the manufacturer, and have now installed a new drive. However, when I try to add it, recovery fails about halfway through, with the newly added drive being marked as a spare, and one of my other drives marked as faulty! I seem to have full access to my data when assembling the array without the new disk using --force, and e2fsck reports no problems with the filesystem. What is happening here? ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Trouble adding disk to degraded array 2013-01-09 17:21 ` Trouble adding disk to degraded array Nicholas Ipsen(Sephiroth_VII) @ 2013-01-09 17:55 ` Phil Turmel 2013-01-09 21:18 ` Nicholas Ipsen(Sephiroth_VII) 0 siblings, 1 reply; 10+ messages in thread From: Phil Turmel @ 2013-01-09 17:55 UTC (permalink / raw) To: Nicholas Ipsen(Sephiroth_VII); +Cc: linux-raid On 01/09/2013 12:21 PM, Nicholas Ipsen(Sephiroth_VII) wrote: > I recently had mdadm mark a disk in my RAID5-array as faulty. As it > was within warranty, I returned it to the manufacturer, and have now > installed a new drive. However, when I try to add it, recovery fails > about halfway through, with the newly added drive being marked as a > spare, and one of my other drives marked as faulty! > > I seem to have full access to my data when assembling the array > without the new disk using --force, and e2fsck reports no problems > with the filesystem. > > What is happening here? You haven't offered a great deal of information here, so I'll speculate: an unused sector one of your original drives has become unreadable (per most drive specs, occurs naturally about every 12TB read). Since rebuilding an array involves computing parity for every stripe, the unused sector is read and triggers the unrecoverable read error (URE). Since the rebuild is incomplete, mdadm has no way to generate this sector from another source, and doesn't know it isn't used, so the drive is kicked out of the array. You now have a double-degraded raid5, which cannot continue operating. If you post the output of dmesg, "mdadm -D /dev/mdX", and "mdadm -E /dev/sd[a-z]" (the latter with the appropriate member devices), we can be more specific. BTW, this exact scenario is why raid6 is so popular, and why weekly scrubbing is vital. It's also possible that you are experiencing the side effects of an error timeout mismatch between your drives (defaults vary) and the linux driver stack (default 30s). Drive timeout must be less than the driver timeout, or good drives will eventually be kicked out of your array. Enterprise drives default to 7 seconds. Desktop drives all default to more than 60 seconds, and it seems most will spend up to 120 seconds. Cheap desktop drives cannot change their timeout. For those, you must change the driver timeout with: echo 120 >/sys/block/sdX/device/timeout Better desktop drives will allow you to set a 7 second timeout with: smartctl -l scterc,70,70 /dev/sdX Either solution must be executed on each boot, or drive hot-swap. HTH, Phil ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Trouble adding disk to degraded array 2013-01-09 17:55 ` Phil Turmel @ 2013-01-09 21:18 ` Nicholas Ipsen(Sephiroth_VII) 2013-01-09 21:54 ` Phil Turmel 2013-01-09 22:33 ` Tudor Holton 0 siblings, 2 replies; 10+ messages in thread From: Nicholas Ipsen(Sephiroth_VII) @ 2013-01-09 21:18 UTC (permalink / raw) To: linux-raid [-- Attachment #1: Type: text/plain, Size: 2614 bytes --] Hello Phil, thank you for your prompt reply. It's the first time I've done any serious debugging work on mdadm, so please excuse my inadequacies. I've attached the files you requested. If you could please look through them and offer your thoughts, it'd be most appreciated. Nicholas Ipsen On 9 January 2013 18:55, Phil Turmel <philip@turmel.org> wrote: > On 01/09/2013 12:21 PM, Nicholas Ipsen(Sephiroth_VII) wrote: >> I recently had mdadm mark a disk in my RAID5-array as faulty. As it >> was within warranty, I returned it to the manufacturer, and have now >> installed a new drive. However, when I try to add it, recovery fails >> about halfway through, with the newly added drive being marked as a >> spare, and one of my other drives marked as faulty! >> >> I seem to have full access to my data when assembling the array >> without the new disk using --force, and e2fsck reports no problems >> with the filesystem. >> >> What is happening here? > > You haven't offered a great deal of information here, so I'll speculate: > an unused sector one of your original drives has become unreadable (per > most drive specs, occurs naturally about every 12TB read). Since > rebuilding an array involves computing parity for every stripe, the > unused sector is read and triggers the unrecoverable read error (URE). > Since the rebuild is incomplete, mdadm has no way to generate this > sector from another source, and doesn't know it isn't used, so the drive > is kicked out of the array. You now have a double-degraded raid5, which > cannot continue operating. > > If you post the output of dmesg, "mdadm -D /dev/mdX", and "mdadm -E > /dev/sd[a-z]" (the latter with the appropriate member devices), we can > be more specific. > > BTW, this exact scenario is why raid6 is so popular, and why weekly > scrubbing is vital. > > It's also possible that you are experiencing the side effects of an > error timeout mismatch between your drives (defaults vary) and the linux > driver stack (default 30s). Drive timeout must be less than the driver > timeout, or good drives will eventually be kicked out of your array. > Enterprise drives default to 7 seconds. Desktop drives all default to > more than 60 seconds, and it seems most will spend up to 120 seconds. > > Cheap desktop drives cannot change their timeout. For those, you must > change the driver timeout with: > > echo 120 >/sys/block/sdX/device/timeout > > Better desktop drives will allow you to set a 7 second timeout with: > > smartctl -l scterc,70,70 /dev/sdX > > Either solution must be executed on each boot, or drive hot-swap. > > HTH, > > Phil [-- Attachment #2: as requested.tar.gz --] [-- Type: application/x-gzip, Size: 20480 bytes --] ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Trouble adding disk to degraded array 2013-01-09 21:18 ` Nicholas Ipsen(Sephiroth_VII) @ 2013-01-09 21:54 ` Phil Turmel 2013-01-09 22:33 ` Tudor Holton 1 sibling, 0 replies; 10+ messages in thread From: Phil Turmel @ 2013-01-09 21:54 UTC (permalink / raw) To: Nicholas Ipsen(Sephiroth_VII); +Cc: linux-raid Hi Nicholas, [Top-posting fixed. Please don't do that.] On 01/09/2013 04:18 PM, Nicholas Ipsen(Sephiroth_VII) wrote: > On 9 January 2013 18:55, Phil Turmel <philip@turmel.org> wrote: >> On 01/09/2013 12:21 PM, Nicholas Ipsen(Sephiroth_VII) wrote: >>> I recently had mdadm mark a disk in my RAID5-array as faulty. As it >>> was within warranty, I returned it to the manufacturer, and have now >>> installed a new drive. However, when I try to add it, recovery fails >>> about halfway through, with the newly added drive being marked as a >>> spare, and one of my other drives marked as faulty! >>> >>> I seem to have full access to my data when assembling the array >>> without the new disk using --force, and e2fsck reports no problems >>> with the filesystem. >>> >>> What is happening here? >> >> You haven't offered a great deal of information here, so I'll speculate: >> an unused sector one of your original drives has become unreadable (per >> most drive specs, occurs naturally about every 12TB read). Since >> rebuilding an array involves computing parity for every stripe, the >> unused sector is read and triggers the unrecoverable read error (URE). >> Since the rebuild is incomplete, mdadm has no way to generate this >> sector from another source, and doesn't know it isn't used, so the drive >> is kicked out of the array. You now have a double-degraded raid5, which >> cannot continue operating. >> >> If you post the output of dmesg, "mdadm -D /dev/mdX", and "mdadm -E >> /dev/sd[a-z]" (the latter with the appropriate member devices), we can >> be more specific. >> >> BTW, this exact scenario is why raid6 is so popular, and why weekly >> scrubbing is vital. >> >> It's also possible that you are experiencing the side effects of an >> error timeout mismatch between your drives (defaults vary) and the linux >> driver stack (default 30s). Drive timeout must be less than the driver >> timeout, or good drives will eventually be kicked out of your array. >> Enterprise drives default to 7 seconds. Desktop drives all default to >> more than 60 seconds, and it seems most will spend up to 120 seconds. >> >> Cheap desktop drives cannot change their timeout. For those, you must >> change the driver timeout with: >> >> echo 120 >/sys/block/sdX/device/timeout >> >> Better desktop drives will allow you to set a 7 second timeout with: >> >> smartctl -l scterc,70,70 /dev/sdX >> >> Either solution must be executed on each boot, or drive hot-swap. > Hello Phil, thank you for your prompt reply. It's the first time I've > done any serious debugging work on mdadm, so please excuse my > inadequacies. I've attached the files you requested. If you could > please look through them and offer your thoughts, it'd be most > appreciated. I've looked at your dmesg, and it confirms that you had an unrecoverable read error on /dev/sdc1. The attachment that was supposed to be the output of "mdadm -E /dev/sd[abcde]1" was something else, but no big deal. (Partition #1 is the array member, not the whole drive.) (You can put such things directly in the email in the future--easier to read.) At this point, you could try to re-write the sectors on /dev/sdc that are currently unreadable, to get them to relocate. But I'd recommend using the spare with dd_rescue to copy everything readable from /dev/sdc. (With the array stopped.) Then you can zero the superblock on /dev/sdc1, leave the copy in place, and restart the array with the copy. Then add sdc1 to the array, and let mdadm rebuild (*to* sdc, instead of *from* sdc). This plan does depend on the problem with sdc being transient. Many UREs are, and are fixed by writing over them. Please show the output of: smartctl -x /dev/sdc Phil ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Trouble adding disk to degraded array 2013-01-09 21:18 ` Nicholas Ipsen(Sephiroth_VII) 2013-01-09 21:54 ` Phil Turmel @ 2013-01-09 22:33 ` Tudor Holton 2013-01-09 23:47 ` Nicholas Ipsen 1 sibling, 1 reply; 10+ messages in thread From: Tudor Holton @ 2013-01-09 22:33 UTC (permalink / raw) Cc: linux-raid Having been through this process recently, and I agree that the advice will most likely lead the user to speculate on this as a potential cause, is there some way we could more easily alert the user to this situation? Maybe we could mark the disk with a (URE) tag in mdstat (my preference) and/or reporting the error as "md: URE error occurred during read on disk X, aborting synchronization, returning discs [Y,Z...] to spare"? Trailing logs during synchronization can take several hours on large arrays (and busy servers) and cause alot of time wastage, particularly if you don't know what you're looking for. Since it first affected me I found this kind of question asked quite regularly on a multitude of tech forums and alot of the responses I came across were incorrect or misleading at best. Alot more were along the lines of "That happened to me, and after trying to fix it for days I just wiped the array and started again. Then it happened to the array again later. mdadm is so unstable!" Unfortunately we can't avoid people blaming the software, but we can at least help them to diagnose the problem more quicky and help their pain and our reputation. :-) Incidentally, is the state "active faulty" an allowed state? Because that could be a good way to report it, also. On 10/01/13 08:18, Nicholas Ipsen(Sephiroth_VII) wrote: > --snip--- > > On 9 January 2013 18:55, Phil Turmel <philip@turmel.org> wrote: >> On 01/09/2013 12:21 PM, Nicholas Ipsen(Sephiroth_VII) wrote: >>> I recently had mdadm mark a disk in my RAID5-array as faulty. As it >>> was within warranty, I returned it to the manufacturer, and have now >>> installed a new drive. However, when I try to add it, recovery fails >>> about halfway through, with the newly added drive being marked as a >>> spare, and one of my other drives marked as faulty! >>> >>> I seem to have full access to my data when assembling the array >>> without the new disk using --force, and e2fsck reports no problems >>> with the filesystem. >>> >>> What is happening here? >> You haven't offered a great deal of information here, so I'll speculate: >> an unused sector one of your original drives has become unreadable (per >> most drive specs, occurs naturally about every 12TB read). Since >> rebuilding an array involves computing parity for every stripe, the >> unused sector is read and triggers the unrecoverable read error (URE). >> Since the rebuild is incomplete, mdadm has no way to generate this >> sector from another source, and doesn't know it isn't used, so the drive >> is kicked out of the array. You now have a double-degraded raid5, which >> cannot continue operating. >> --snip-- ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Trouble adding disk to degraded array 2013-01-09 22:33 ` Tudor Holton @ 2013-01-09 23:47 ` Nicholas Ipsen 2013-01-11 13:14 ` Nicholas Ipsen 0 siblings, 1 reply; 10+ messages in thread From: Nicholas Ipsen @ 2013-01-09 23:47 UTC (permalink / raw) To: linux-raid Thanks Phil, I wrote "mdadm -E /dev/sd[abcde]" instead of "mdadm -E /dev/sd[abcde]1"... Anyway, I'm currently trying your advice with dd_rescue, I'll report back when something happens. Nicholas Ipsen On 9 January 2013 23:33, Tudor Holton <tudor@smartguide.com.au> wrote: > Having been through this process recently, and I agree that the advice will > most likely lead the user to speculate on this as a potential cause, is > there some way we could more easily alert the user to this situation? Maybe > we could mark the disk with a (URE) tag in mdstat (my preference) and/or > reporting the error as "md: URE error occurred during read on disk X, > aborting synchronization, returning discs [Y,Z...] to spare"? Trailing logs > during synchronization can take several hours on large arrays (and busy > servers) and cause alot of time wastage, particularly if you don't know what > you're looking for. > > Since it first affected me I found this kind of question asked quite > regularly on a multitude of tech forums and alot of the responses I came > across were incorrect or misleading at best. Alot more were along the lines > of "That happened to me, and after trying to fix it for days I just wiped > the array and started again. Then it happened to the array again later. > mdadm is so unstable!" Unfortunately we can't avoid people blaming the > software, but we can at least help them to diagnose the problem more quicky > and help their pain and our reputation. :-) > > Incidentally, is the state "active faulty" an allowed state? Because that > could be a good way to report it, also. > > On 10/01/13 08:18, Nicholas Ipsen(Sephiroth_VII) wrote: >> >> --snip--- >> >> >> On 9 January 2013 18:55, Phil Turmel <philip@turmel.org> wrote: >>> >>> On 01/09/2013 12:21 PM, Nicholas Ipsen(Sephiroth_VII) wrote: >>>> >>>> I recently had mdadm mark a disk in my RAID5-array as faulty. As it >>>> was within warranty, I returned it to the manufacturer, and have now >>>> installed a new drive. However, when I try to add it, recovery fails >>>> about halfway through, with the newly added drive being marked as a >>>> spare, and one of my other drives marked as faulty! >>>> >>>> I seem to have full access to my data when assembling the array >>>> without the new disk using --force, and e2fsck reports no problems >>>> with the filesystem. >>>> >>>> What is happening here? >>> >>> You haven't offered a great deal of information here, so I'll speculate: >>> an unused sector one of your original drives has become unreadable (per >>> most drive specs, occurs naturally about every 12TB read). Since >>> rebuilding an array involves computing parity for every stripe, the >>> unused sector is read and triggers the unrecoverable read error (URE). >>> Since the rebuild is incomplete, mdadm has no way to generate this >>> sector from another source, and doesn't know it isn't used, so the drive >>> is kicked out of the array. You now have a double-degraded raid5, which >>> cannot continue operating. >>> > --snip-- > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Trouble adding disk to degraded array 2013-01-09 23:47 ` Nicholas Ipsen @ 2013-01-11 13:14 ` Nicholas Ipsen 2013-01-11 14:07 ` Mikael Abrahamsson 0 siblings, 1 reply; 10+ messages in thread From: Nicholas Ipsen @ 2013-01-11 13:14 UTC (permalink / raw) To: linux-raid Hello again. I manage to recover most of the data using ddrescue and various option, but I'm still left with 24576B missing. Of course, it's probably nothing major in terms of lost files, but I can't help but worry that it will cause issues when I add the replacement disk to the array. Do you have any advice regarding this problem? Nicholas Ipsen ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Trouble adding disk to degraded array 2013-01-11 13:14 ` Nicholas Ipsen @ 2013-01-11 14:07 ` Mikael Abrahamsson 2013-01-12 0:01 ` Nicholas Ipsen 0 siblings, 1 reply; 10+ messages in thread From: Mikael Abrahamsson @ 2013-01-11 14:07 UTC (permalink / raw) To: Nicholas Ipsen; +Cc: linux-raid On Fri, 11 Jan 2013, Nicholas Ipsen wrote: > Hello again. I manage to recover most of the data using ddrescue and > various option, but I'm still left with 24576B missing. Of course, > it's probably nothing major in terms of lost files, but I can't help > but worry that it will cause issues when I add the replacement disk to > the array. Do you have any advice regarding this problem? Force an filesystem check. I don't see any problem with md-raid because of these zero:ed 24576 bytes, the only thing that will care is the filesystem. If you want to be 100% sure, run a "check" action on the md raid and make sure you have no mismatches on the md layer. -- Mikael Abrahamsson email: swmike@swm.pp.se ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Trouble adding disk to degraded array 2013-01-11 14:07 ` Mikael Abrahamsson @ 2013-01-12 0:01 ` Nicholas Ipsen 2013-01-12 0:24 ` Phil Turmel 0 siblings, 1 reply; 10+ messages in thread From: Nicholas Ipsen @ 2013-01-12 0:01 UTC (permalink / raw) To: linux-raid Yikes, I may really have stepped in it with that filesystem check... I first mounted the filesystem, and everything looked fine after replacing sdc1 with sdd1. I then ran fsck -f /dev/md0, and it started to report problems after 10-15 minutes. Here's one of them, they all said the same except for the block number: "Error reading block 5767176 (Attempt to read block from filesystem resulted in short read) while reading inode and block bitmaps. Ignore error<y>?" I just kept pressing y and then approving a forced rewrite, as I thought it was just a few blocks, but after approving 40 or so rewrites, I got cold feet, cancelled fsck, and tried to remount md0 through nautilus. But it wouldn't. Instead, I get this message: Error mounting /dev/md0 at /media/ubuntu/be265e78-0c46-41fb-96ba-fda269e74a60: Command-line `mount -t "ext4" -o "uhelper=udisks2,nodev,nosuid" "/dev/md0" "/media/ubuntu/be265e78-0c46-41fb-96ba-fda269e74a60"' exited with non-zero exit status 32: mount: wrong fs type, bad option, bad superblock on /dev/md0, missing codepage or helper program, or other error In some cases useful info is found in syslog - try dmesg | tail or so Naturally, I ran dmesg | tail, and got this; [104349.032352] EXT4-fs (md0): can't read group descriptor 378 [104350.997625] EXT4-fs (md0): can't read group descriptor 250 [104739.002433] EXT4-fs (md0): can't read group descriptor 250 [104780.612995] EXT4-fs (md0): can't read group descriptor 250 I then stopped the array, and restarted it, which gave me this message: mdadm: clearing FAULTY flag for device 1 in /dev/md0 for /dev/sdb1 Note that this is not the device that was giving me trouble before, nor is it the on I replaced it with. What have I done wrong here? Nicholas Ipsen ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Trouble adding disk to degraded array 2013-01-12 0:01 ` Nicholas Ipsen @ 2013-01-12 0:24 ` Phil Turmel 0 siblings, 0 replies; 10+ messages in thread From: Phil Turmel @ 2013-01-12 0:24 UTC (permalink / raw) To: Nicholas Ipsen; +Cc: linux-raid On 01/11/2013 07:01 PM, Nicholas Ipsen wrote: > Yikes, I may really have stepped in it with that filesystem check... [trim /] > Note that this is not the device that was giving me trouble before, > nor is it the on I replaced it with. > > What have I done wrong here? Are you also getting errors from the controller/drive? If so, you might have a hardware problem, like a power supply going bad (writing draws significantly more power than reading). Please post your latest complete dmesg. Phil ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2013-01-12 0:24 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <CAJ=LqmbYG8H45M196ZuRcDMu9Ucz0t_pQenQbZtMKM9AhSqrpQ@mail.gmail.com>
2013-01-09 17:21 ` Trouble adding disk to degraded array Nicholas Ipsen(Sephiroth_VII)
2013-01-09 17:55 ` Phil Turmel
2013-01-09 21:18 ` Nicholas Ipsen(Sephiroth_VII)
2013-01-09 21:54 ` Phil Turmel
2013-01-09 22:33 ` Tudor Holton
2013-01-09 23:47 ` Nicholas Ipsen
2013-01-11 13:14 ` Nicholas Ipsen
2013-01-11 14:07 ` Mikael Abrahamsson
2013-01-12 0:01 ` Nicholas Ipsen
2013-01-12 0:24 ` Phil Turmel
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).