* Help understanding the root cause of a member dropping out of a RAID 1 set. @ 2009-08-13 8:44 Simon Jackson 2009-08-13 16:13 ` Billy Crook 0 siblings, 1 reply; 7+ messages in thread From: Simon Jackson @ 2009-08-13 8:44 UTC (permalink / raw) To: linux-raid@vger.kernel.org I am running RAID1 partitions on some systems and a few times I have seen a raid set become degraded as a member has failed out of the md device. Looking at the /var/log/message file I have seen output similar to below: Can anyone help me decode what actually happened here. Thanks Simon. 2009-08-11T06:21:04-07:00 Metro-1 kernel: [556568.670377] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) 2009-08-11T06:21:04-07:00 Metro-1 kernel: [556568.670477] ata1: hard resetting link 2009-08-11T06:21:08-07:00 Metro-1 kernel: [556573.122562] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300) 2009-08-11T06:21:08-07:00 Metro-1 kernel: [556573.259057] ata1.00: configured for UDMA/133 2009-08-11T06:21:08-07:00 Metro-1 kernel: [556573.348168] ata1.01: configured for UDMA/133 2009-08-11T06:21:08-07:00 Metro-1 kernel: [556573.348168] md: super_written gets error=-5, uptodate=0 2009-08-11T06:21:08-07:00 Metro-1 kernel: [556573.348168] raid1: Operation continuing on 1 devices. 2009-08-11T06:21:08-07:00 Metro-1 kernel: [556573.348168] ata1: EH complete ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Help understanding the root cause of a member dropping out of a RAID 1 set. 2009-08-13 8:44 Help understanding the root cause of a member dropping out of a RAID 1 set Simon Jackson @ 2009-08-13 16:13 ` Billy Crook 2009-08-13 16:26 ` John Robinson 0 siblings, 1 reply; 7+ messages in thread From: Billy Crook @ 2009-08-13 16:13 UTC (permalink / raw) To: Simon Jackson; +Cc: linux-raid@vger.kernel.org On Thu, Aug 13, 2009 at 03:44, Simon Jackson<sjackson@bluearc.com> wrote: > > I am running RAID1 partitions on some systems and a few times I have seen a raid set become degraded as a member has failed out of the md device. Looking at the /var/log/message file I have seen output similar to below: > > Can anyone help me decode what actually happened here. > > Thanks Simon. > > 2009-08-11T06:21:04-07:00 Metro-1 kernel: [556568.670377] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) the hard drive didn't respond to an ata command > 2009-08-11T06:21:04-07:00 Metro-1 kernel: [556568.670477] ata1: hard resetting link kernel tells hdd controller to reset link. (Sometimes this gets "frozen" hard drives to respond again.) > 2009-08-11T06:21:08-07:00 Metro-1 kernel: [556573.122562] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300) > 2009-08-11T06:21:08-07:00 Metro-1 kernel: [556573.259057] ata1.00: configured for UDMA/133 > 2009-08-11T06:21:08-07:00 Metro-1 kernel: [556573.348168] ata1.01: configured for UDMA/133 link has been reset. > 2009-08-11T06:21:08-07:00 Metro-1 kernel: [556573.348168] md: super_written gets error=-5, uptodate=0 mdraid notices. says oh craps. > 2009-08-11T06:21:08-07:00 Metro-1 kernel: [556573.348168] raid1: Operation continuing on 1 devices. mdraid marks the component that encountered the error failed, and keeps on keeping on > 2009-08-11T06:21:08-07:00 Metro-1 kernel: [556573.348168] ata1: EH complete ata reset (of link, and subsequently drive) is complete. -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Help understanding the root cause of a member dropping out of a RAID 1 set. 2009-08-13 16:13 ` Billy Crook @ 2009-08-13 16:26 ` John Robinson 2009-08-14 13:09 ` Paweł Brodacki 2009-08-14 13:21 ` Robin Hill 0 siblings, 2 replies; 7+ messages in thread From: John Robinson @ 2009-08-13 16:26 UTC (permalink / raw) To: linux-raid@vger.kernel.org On Thu, 13 August, 2009 5:13 pm, Billy Crook wrote: > On Thu, Aug 13, 2009 at 03:44, Simon Jackson<sjackson@bluearc.com> wrote: [...] >> 2009-08-11T06:21:08-07:00 Metro-1 kernel: [556573.348168] md: >> super_written gets error=-5, uptodate=0 > > mdraid notices. says oh craps. > >> 2009-08-11T06:21:08-07:00 Metro-1 kernel: [556573.348168] raid1: >> Operation continuing on 1 devices. > > mdraid marks the component that encountered the error failed, and > keeps on keeping on > >> 2009-08-11T06:21:08-07:00 Metro-1 kernel: [556573.348168] ata1: EH >> complete > > ata reset (of link, and subsequently drive) is complete. Can or could md be made or configured to try re-adding a device if this sort of thing happens? After all, a stray cosmic ray or whatever perhaps shouldn't make one lose redundancy if the drive's actually OK? Cheers, John. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Help understanding the root cause of a member dropping out of a RAID 1 set. 2009-08-13 16:26 ` John Robinson @ 2009-08-14 13:09 ` Paweł Brodacki 2009-08-14 17:07 ` John Robinson 2009-08-14 13:21 ` Robin Hill 1 sibling, 1 reply; 7+ messages in thread From: Paweł Brodacki @ 2009-08-14 13:09 UTC (permalink / raw) To: linux-raid@vger.kernel.org 2009/8/13 John Robinson <john.robinson@anonymous.org.uk>: > Can or could md be made or configured to try re-adding a device if this > sort of thing happens? After all, a stray cosmic ray or whatever perhaps > shouldn't make one lose redundancy if the drive's actually OK? > > Cheers, > > John. > I think that from the coding point of view md probably could. The more important thing is if it should. The only hard fact is that there was an error while accessing the device. md has no way of telling if it was just a freak accident, or the drive is unreliable from now on. Therefore it does the one safe thing and says "I won't trust you anymore.". If a human being knows better, the said being is free to re-add the drive. Personally I'd hate having a suspicious drive being auto-added in hope it will rebuild and function properly. Because such an option could seem tempting but could and would cause loss of reliability I'd expect bad publicity if it was actually added. Just my 2c. Regards, Paweł -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Help understanding the root cause of a member dropping out of a RAID 1 set. 2009-08-14 13:09 ` Paweł Brodacki @ 2009-08-14 17:07 ` John Robinson 2009-08-14 20:56 ` Richard Scobie 0 siblings, 1 reply; 7+ messages in thread From: John Robinson @ 2009-08-14 17:07 UTC (permalink / raw) To: Paweł Brodacki; +Cc: linux-raid@vger.kernel.org On 14/08/2009 14:09, Paweł Brodacki wrote: > 2009/8/13 John Robinson <john.robinson@anonymous.org.uk>: > >> Can or could md be made or configured to try re-adding a device if this >> sort of thing happens? After all, a stray cosmic ray or whatever perhaps >> shouldn't make one lose redundancy if the drive's actually OK? > > I think that from the coding point of view md probably could. The more > important thing is if it should. The only hard fact is that there was > an error while accessing the device. md has no way of telling if it > was just a freak accident, or the drive is unreliable from now on. Ah well, perhaps we need to give md a way of knowing the difference between a transient error (that has been recovered from) and a more serious error. > Therefore it does the one safe thing and says "I won't trust you > anymore.". If a human being knows better, the said being is free to > re-add the drive. > > Personally I'd hate having a suspicious drive being auto-added in hope > it will rebuild and function properly. I wouldn't want it to be the default behaviour, but I'd like the option to configure things that way. I'd want the number of auto-re-adds configurable too. > Because such an option could seem tempting but could and would cause > loss of reliability I'd expect bad publicity if it was actually added. But it could cause improvements in reliability too. If the cable on drive A is hit by cosmic rays, the drive is taken out of the array, but the drive's actually still fine, then drive B fails before the operator has re-added drive A, the array goes down when it didn't need to. What is the operator's most likely response to seeing the SATA bus reset? She's going to re-add the drive assuming it was a transient error. If we could make this happen automatically, we could close a window when the array's more vulnerable. I wouldn't suggest we do it silently; it gets logged, notified etc. just like the drive being taken out of the array would be. Cheers, John. -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Help understanding the root cause of a member dropping out of a RAID 1 set. 2009-08-14 17:07 ` John Robinson @ 2009-08-14 20:56 ` Richard Scobie 0 siblings, 0 replies; 7+ messages in thread From: Richard Scobie @ 2009-08-14 20:56 UTC (permalink / raw) To: John Robinson; +Cc: Paweł Brodacki, linux-raid@vger.kernel.org John Robinson wrote: > What is the operator's most likely response to seeing the SATA bus > reset? She's going to re-add the drive assuming it was a transient > error. If we could make this happen automatically, we could close a I'd like to think a better response would be to use smartctl on the drive to examine it for signs of internal errors first... Regards, Richard ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Help understanding the root cause of a member dropping out of a RAID 1 set. 2009-08-13 16:26 ` John Robinson 2009-08-14 13:09 ` Paweł Brodacki @ 2009-08-14 13:21 ` Robin Hill 1 sibling, 0 replies; 7+ messages in thread From: Robin Hill @ 2009-08-14 13:21 UTC (permalink / raw) To: linux-raid@vger.kernel.org [-- Attachment #1: Type: text/plain, Size: 1423 bytes --] On Thu Aug 13, 2009 at 05:26:39PM +0100, John Robinson wrote: > On Thu, 13 August, 2009 5:13 pm, Billy Crook wrote: > > On Thu, Aug 13, 2009 at 03:44, Simon Jackson<sjackson@bluearc.com> wrote: > [...] > >> 2009-08-11T06:21:08-07:00 Metro-1 kernel: [556573.348168] md: > >> super_written gets error=-5, uptodate=0 > > > > mdraid notices. says oh craps. > > > >> 2009-08-11T06:21:08-07:00 Metro-1 kernel: [556573.348168] raid1: > >> Operation continuing on 1 devices. > > > > mdraid marks the component that encountered the error failed, and > > keeps on keeping on > > > >> 2009-08-11T06:21:08-07:00 Metro-1 kernel: [556573.348168] ata1: EH > >> complete > > > > ata reset (of link, and subsequently drive) is complete. > > Can or could md be made or configured to try re-adding a device if this > sort of thing happens? After all, a stray cosmic ray or whatever perhaps > shouldn't make one lose redundancy if the drive's actually OK? > If you want to do this, it should be doable via the PROGRAM option in mdadm.conf (using standard mdadm calls). As has been pointed out elsewhere though, doing so automatically can be rather a risky option. Cheers, Robin -- ___ ( ' } | Robin Hill <robin@robinhill.me.uk> | / / ) | Little Jim says .... | // !! | "He fallen in de water !!" | [-- Attachment #2: Type: application/pgp-signature, Size: 198 bytes --] ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2009-08-14 20:56 UTC | newest] Thread overview: 7+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2009-08-13 8:44 Help understanding the root cause of a member dropping out of a RAID 1 set Simon Jackson 2009-08-13 16:13 ` Billy Crook 2009-08-13 16:26 ` John Robinson 2009-08-14 13:09 ` Paweł Brodacki 2009-08-14 17:07 ` John Robinson 2009-08-14 20:56 ` Richard Scobie 2009-08-14 13:21 ` Robin Hill
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).