* /etc/cron.weekly/99-raid-check @ 2009-11-30 13:08 Farkas Levente 0 siblings, 0 replies; 5+ messages in thread From: Farkas Levente @ 2009-11-30 13:08 UTC (permalink / raw) To: CentOS mailing list, linux-raid hi, it's been a few weeks since rhel/centos 5.4 released and there were many discussion about this new "feature" the weekly raid partition check. we've got a lot's of server with raid1 system and i already try to configure them not to send these messages, but i'm not able ie. i already add to the SKIP_DEVS all of my swap partitions (since i read it on linux-kernel list that there can be mismatch_cnt even though i still not understand why?). but even the data partitions (ie. all of my servers all raid1 partitions) produce this error (ie. ther mismatch_cnt is never 0 at the weekend). and this cause all of my raid1 partitions are rebuild during the weekend. and i don't like it:-( so my questions: - is it a real bug in the raid1 system? - is it a real bug in my disk which runs raid (not really believe since it's dozens of servers)? - the /etc/cron.weekly/99-raid-check is wrong in rhel/centos-5.4? or what's the problem? can someone enlighten me? thanks in advance. regards. -- Levente "Si vis pacem para bellum!" ^ permalink raw reply [flat|nested] 5+ messages in thread
[parent not found: <lfarkas@lfarkas.org>]
* Re: /etc/cron.weekly/99-raid-check @ 2009-12-02 15:38 ` greg 2009-12-03 9:11 ` /etc/cron.weekly/99-raid-check CoolCold 0 siblings, 1 reply; 5+ messages in thread From: greg @ 2009-12-02 15:38 UTC (permalink / raw) To: Farkas Levente, CentOS mailing list, linux-raid On Nov 30, 2:08pm, Farkas Levente wrote: } Subject: /etc/cron.weekly/99-raid-check > hi, Hi Farkas, hope your day is going well. Just thought I would respond for the edification of others who are troubled by this issue. > it's been a few weeks since rhel/centos 5.4 released and there were many > discussion about this new "feature" the weekly raid partition check. > we've got a lot's of server with raid1 system and i already try to > configure them not to send these messages, but i'm not able ie. i > already add to the SKIP_DEVS all of my swap partitions (since i read it > on linux-kernel list that there can be mismatch_cnt even though i still > not understand why?). but even the data partitions (ie. all of my > servers all raid1 partitions) produce this error (ie. ther mismatch_cnt > is never 0 at the weekend). and this cause all of my raid1 partitions > are rebuild during the weekend. and i don't like it:-( > so my questions: > - is it a real bug in the raid1 system? > - is it a real bug in my disk which runs raid (not really believe since > it's dozens of servers)? > - the /etc/cron.weekly/99-raid-check is wrong in rhel/centos-5.4? > or what's the problem? > can someone enlighten me? Its a combination of what I would consider a misfeature with what MAY BE, and I stress MAY be a sentient bug someplace. The current RAID/IO stack does not 'pin' pages which are destined to be written out to disk. As a result the contents of the pages may change as the request to do I/O against these pages transits the I/O stack down to disk. This results in a 'race' condition where one side of a RAID1 mirror gets one version ofdata written to it while the other side of the mirror gets a different piece of data written to it. In the case of a swap partition this appears to be harmless. In the case of filesystems there seems to be a general assurance that this occurs only in uninhabited portions of the filesystem. The 'check' feature of the MD system which the 99-raid-check uses reads the underlying physical devices of a composite RAID device. The mismatch_cnt is elevated if the contents of mirrored sectors are not identical. The results of the intersection of all this are problematic now that major distributions have included this raid-check feature. There are probably hundreds if not thousands of systems which are reporting what may or may not be false positives with respect to data corruption. The current RAID stack has an option to 'repair' a RAID set which has mismatches. Unfortunately there is no intelligence in this facility and it randomly picks one of the sectors as being 'good' and uses that to replace the contents of the other sector. I'm somewhat reticent to recommend the use of this facility given the issues at hand. A complicating factor is that the kernel does not report the location of where the mismatches occur. There appears to be movement underway to include support in the kernel for printing out the sector locations of the mismatches. When that feature becomes available there will be a need to have some type of tool, in the case of RAID1 devices backing filesystems, to make an assessment of which version of the data is 'correct' so the faulty version can be over-written with the correct version. As an aside what is really needed is a tool which assesses whether or not the mismatched sectors are actually in an inhabited portion of the filesystem. If not the 'repair' facility on RAID1 could be presumably run with no issues. Given the appropriate coherency/validation checks to make sure the sectors are still incoherent secondary to a race where the uninhabited portion chooses to become inhabited. We see the issue over a large range of production systems running standard RHEL5 kernels all the way up to recent versions of Fedora. Interestingly the mismatch counts are always an exact multiple of 128 on all the systems. We have also isolated the problem to be RAID1 and independent of the backing store. We run geographical mirrors where an initiator is fed from two separate data-centers where each mirror half is based on a RAID5 Linux target. On RAID1 mirrors which are mismatched the two separate RAID5 backing volumes both report completely consistent volumes. So there is the situation as I believe it currently stands. The notion of running the 'check' sync_action is well founded. The issue of 'silent' data corruption is well understood and well founded. The Linux RAID system as of a couple of years ago will re-write any sectors which come up as unreadable during the check process. Disk drives will re-allocate a sector from their re-mapping pool effectively replacing the bad sector. This pays huge dividends with respect to maintaining healthy RAID farms. Unfortunately the report of the mismatch_cnt's is problematic given the above issues. I think it is unfortunate the vendors opted to release this checking/reporting while these issues are still unresolved. > thanks in advance. > regards. > > -- > Levente "Si vis pacem para bellum!" Hope the above information is helpful for everyone running into this issue. Best wishes for a productive remainder of the week to everyone. Greg }-- End of excerpt from Farkas Levente As always, Dr. G.W. Wettstein, Ph.D. Enjellic Systems Development, LLC. 4206 N. 19th Ave. Specializing in information infra-structure Fargo, ND 58102 development. PH: 701-281-1686 FAX: 701-281-3949 EMAIL: greg@enjellic.com ------------------------------------------------------------------------------ "Experience is something you don't get until just after you need it." -- Olivier ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: /etc/cron.weekly/99-raid-check 2009-12-02 15:38 ` /etc/cron.weekly/99-raid-check greg @ 2009-12-03 9:11 ` CoolCold 2009-12-03 9:53 ` /etc/cron.weekly/99-raid-check Robin Hill 2009-12-03 10:49 ` /etc/cron.weekly/99-raid-check Sujit K M 0 siblings, 2 replies; 5+ messages in thread From: CoolCold @ 2009-12-03 9:11 UTC (permalink / raw) To: greg; +Cc: Farkas Levente, CentOS mailing list, linux-raid On Wed, Dec 2, 2009 at 6:38 PM, <greg@enjellic.com> wrote: > On Nov 30, 2:08pm, Farkas Levente wrote: > } Subject: /etc/cron.weekly/99-raid-check > >> hi, > > Hi Farkas, hope your day is going well. Just thought I would respond > for the edification of others who are troubled by this issue. > >> it's been a few weeks since rhel/centos 5.4 released and there were many >> discussion about this new "feature" the weekly raid partition check. >> we've got a lot's of server with raid1 system and i already try to >> configure them not to send these messages, but i'm not able ie. i >> already add to the SKIP_DEVS all of my swap partitions (since i read it >> on linux-kernel list that there can be mismatch_cnt even though i still >> not understand why?). but even the data partitions (ie. all of my >> servers all raid1 partitions) produce this error (ie. ther mismatch_cnt >> is never 0 at the weekend). and this cause all of my raid1 partitions >> are rebuild during the weekend. and i don't like it:-( >> so my questions: >> - is it a real bug in the raid1 system? >> - is it a real bug in my disk which runs raid (not really believe since >> it's dozens of servers)? >> - the /etc/cron.weekly/99-raid-check is wrong in rhel/centos-5.4? >> or what's the problem? >> can someone enlighten me? > > Its a combination of what I would consider a misfeature with what MAY > BE, and I stress MAY be a sentient bug someplace. > > The current RAID/IO stack does not 'pin' pages which are destined to > be written out to disk. As a result the contents of the pages may > change as the request to do I/O against these pages transits the I/O > stack down to disk. Can you write a bit more about "the pages may change"? 'Who' can change page contents ? > This results in a 'race' condition where one side of a RAID1 mirror > gets one version ofdata written to it while the other side of the > mirror gets a different piece of data written to it. In the case of a > swap partition this appears to be harmless. In the case of > filesystems there seems to be a general assurance that this occurs > only in uninhabited portions of the filesystem. > > The 'check' feature of the MD system which the 99-raid-check uses > reads the underlying physical devices of a composite RAID device. The > mismatch_cnt is elevated if the contents of mirrored sectors are not > identical. > > The results of the intersection of all this are problematic now that > major distributions have included this raid-check feature. There are > probably hundreds if not thousands of systems which are reporting what > may or may not be false positives with respect to data corruption. > > The current RAID stack has an option to 'repair' a RAID set which has > mismatches. Unfortunately there is no intelligence in this facility > and it randomly picks one of the sectors as being 'good' and uses that > to replace the contents of the other sector. I'm somewhat reticent to > recommend the use of this facility given the issues at hand. > > A complicating factor is that the kernel does not report the location > of where the mismatches occur. There appears to be movement underway > to include support in the kernel for printing out the sector locations > of the mismatches. > > When that feature becomes available there will be a need to have some > type of tool, in the case of RAID1 devices backing filesystems, to > make an assessment of which version of the data is 'correct' so the > faulty version can be over-written with the correct version. > > As an aside what is really needed is a tool which assesses > whether or not the mismatched sectors are actually in an > inhabited portion of the filesystem. If not the 'repair' > facility on RAID1 could be presumably run with no issues. > Given the appropriate coherency/validation checks to make sure > the sectors are still incoherent secondary to a race where the > uninhabited portion chooses to become inhabited. > > We see the issue over a large range of production systems running > standard RHEL5 kernels all the way up to recent versions of Fedora. > Interestingly the mismatch counts are always an exact multiple of 128 > on all the systems. > > We have also isolated the problem to be RAID1 and independent of the > backing store. We run geographical mirrors where an initiator is fed > from two separate data-centers where each mirror half is based on a > RAID5 Linux target. On RAID1 mirrors which are mismatched the two > separate RAID5 backing volumes both report completely consistent > volumes. > > So there is the situation as I believe it currently stands. > > The notion of running the 'check' sync_action is well founded. The > issue of 'silent' data corruption is well understood and well founded. > The Linux RAID system as of a couple of years ago will re-write any > sectors which come up as unreadable during the check process. Disk > drives will re-allocate a sector from their re-mapping pool > effectively replacing the bad sector. This pays huge dividends with > respect to maintaining healthy RAID farms. > > Unfortunately the report of the mismatch_cnt's is problematic given > the above issues. I think it is unfortunate the vendors opted to > release this checking/reporting while these issues are still unresolved. > >> thanks in advance. >> regards. >> >> -- >> Levente "Si vis pacem para bellum!" > > Hope the above information is helpful for everyone running into this > issue. > > Best wishes for a productive remainder of the week to everyone. > > Greg > > }-- End of excerpt from Farkas Levente > > As always, > Dr. G.W. Wettstein, Ph.D. Enjellic Systems Development, LLC. > 4206 N. 19th Ave. Specializing in information infra-structure > Fargo, ND 58102 development. > PH: 701-281-1686 > FAX: 701-281-3949 EMAIL: greg@enjellic.com > ------------------------------------------------------------------------------ > "Experience is something you don't get until just after you need it." > -- Olivier > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- Best regards, [COOLCOLD-RIPN] -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: /etc/cron.weekly/99-raid-check 2009-12-03 9:11 ` /etc/cron.weekly/99-raid-check CoolCold @ 2009-12-03 9:53 ` Robin Hill 2009-12-03 10:49 ` /etc/cron.weekly/99-raid-check Sujit K M 1 sibling, 0 replies; 5+ messages in thread From: Robin Hill @ 2009-12-03 9:53 UTC (permalink / raw) To: linux-raid [-- Attachment #1: Type: text/plain, Size: 1622 bytes --] On Thu Dec 03, 2009 at 12:11:29PM +0300, CoolCold wrote: > > The current RAID/IO stack does not 'pin' pages which are destined to > > be written out to disk. As a result the contents of the pages may > > change as the request to do I/O against these pages transits the I/O > > stack down to disk. > > Can you write a bit more about "the pages may change"? 'Who' can > change page contents ? > My understanding of this is: - An application maps a file to memory. - It makes some modifications. - The kernel flags the md layer to write these changes to disk. - The md layer writes to one copy of the RAID1. - The application makes some more modifications. - The md layer writes to the second RAID1 copy (getting different data). - The kernel flags the md layer to write the new changes. - The md layer writes both copies. The second set of writes can go to a different section of the disk, so you're left with the (now unused) blocks being different on the disk. There's no data problem, as the kernel always flags the md layer to write _after_ the data's changes, and the md layer always writes the data _after_ the kernel notifies it. Note: I've not looked at the code for any of this, so I don't _know_ that this is what's happening, but that's my understanding from what I've read on here in the past. HTH, Robin -- ___ ( ' } | Robin Hill <robin@robinhill.me.uk> | / / ) | Little Jim says .... | // !! | "He fallen in de water !!" | [-- Attachment #2: Type: application/pgp-signature, Size: 198 bytes --] ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: /etc/cron.weekly/99-raid-check 2009-12-03 9:11 ` /etc/cron.weekly/99-raid-check CoolCold 2009-12-03 9:53 ` /etc/cron.weekly/99-raid-check Robin Hill @ 2009-12-03 10:49 ` Sujit K M 1 sibling, 0 replies; 5+ messages in thread From: Sujit K M @ 2009-12-03 10:49 UTC (permalink / raw) To: CoolCold; +Cc: greg, Farkas Levente, CentOS mailing list, linux-raid Kindly Follow some mailing list critique. Donot jump up unrelated threads without knowing what is being discussed. On Thu, Dec 3, 2009 at 2:41 PM, CoolCold <coolthecold@gmail.com> wrote: > On Wed, Dec 2, 2009 at 6:38 PM, <greg@enjellic.com> wrote: >> On Nov 30, 2:08pm, Farkas Levente wrote: >> } Subject: /etc/cron.weekly/99-raid-check >> >>> hi, >> >> Hi Farkas, hope your day is going well. Just thought I would respond >> for the edification of others who are troubled by this issue. >> >>> it's been a few weeks since rhel/centos 5.4 released and there were many >>> discussion about this new "feature" the weekly raid partition check. >>> we've got a lot's of server with raid1 system and i already try to >>> configure them not to send these messages, but i'm not able ie. i >>> already add to the SKIP_DEVS all of my swap partitions (since i read it >>> on linux-kernel list that there can be mismatch_cnt even though i still >>> not understand why?). but even the data partitions (ie. all of my >>> servers all raid1 partitions) produce this error (ie. ther mismatch_cnt >>> is never 0 at the weekend). and this cause all of my raid1 partitions >>> are rebuild during the weekend. and i don't like it:-( >>> so my questions: >>> - is it a real bug in the raid1 system? >>> - is it a real bug in my disk which runs raid (not really believe since >>> it's dozens of servers)? >>> - the /etc/cron.weekly/99-raid-check is wrong in rhel/centos-5.4? >>> or what's the problem? >>> can someone enlighten me? >> >> Its a combination of what I would consider a misfeature with what MAY >> BE, and I stress MAY be a sentient bug someplace. >> >> The current RAID/IO stack does not 'pin' pages which are destined to >> be written out to disk. As a result the contents of the pages may >> change as the request to do I/O against these pages transits the I/O >> stack down to disk. > > Can you write a bit more about "the pages may change"? 'Who' can > change page contents ? > > >> This results in a 'race' condition where one side of a RAID1 mirror >> gets one version ofdata written to it while the other side of the >> mirror gets a different piece of data written to it. In the case of a >> swap partition this appears to be harmless. In the case of >> filesystems there seems to be a general assurance that this occurs >> only in uninhabited portions of the filesystem. >> >> The 'check' feature of the MD system which the 99-raid-check uses >> reads the underlying physical devices of a composite RAID device. The >> mismatch_cnt is elevated if the contents of mirrored sectors are not >> identical. >> >> The results of the intersection of all this are problematic now that >> major distributions have included this raid-check feature. There are >> probably hundreds if not thousands of systems which are reporting what >> may or may not be false positives with respect to data corruption. >> >> The current RAID stack has an option to 'repair' a RAID set which has >> mismatches. Unfortunately there is no intelligence in this facility >> and it randomly picks one of the sectors as being 'good' and uses that >> to replace the contents of the other sector. I'm somewhat reticent to >> recommend the use of this facility given the issues at hand. >> >> A complicating factor is that the kernel does not report the location >> of where the mismatches occur. There appears to be movement underway >> to include support in the kernel for printing out the sector locations >> of the mismatches. >> >> When that feature becomes available there will be a need to have some >> type of tool, in the case of RAID1 devices backing filesystems, to >> make an assessment of which version of the data is 'correct' so the >> faulty version can be over-written with the correct version. >> >> As an aside what is really needed is a tool which assesses >> whether or not the mismatched sectors are actually in an >> inhabited portion of the filesystem. If not the 'repair' >> facility on RAID1 could be presumably run with no issues. >> Given the appropriate coherency/validation checks to make sure >> the sectors are still incoherent secondary to a race where the >> uninhabited portion chooses to become inhabited. >> >> We see the issue over a large range of production systems running >> standard RHEL5 kernels all the way up to recent versions of Fedora. >> Interestingly the mismatch counts are always an exact multiple of 128 >> on all the systems. >> >> We have also isolated the problem to be RAID1 and independent of the >> backing store. We run geographical mirrors where an initiator is fed >> from two separate data-centers where each mirror half is based on a >> RAID5 Linux target. On RAID1 mirrors which are mismatched the two >> separate RAID5 backing volumes both report completely consistent >> volumes. >> >> So there is the situation as I believe it currently stands. >> >> The notion of running the 'check' sync_action is well founded. The >> issue of 'silent' data corruption is well understood and well founded. >> The Linux RAID system as of a couple of years ago will re-write any >> sectors which come up as unreadable during the check process. Disk >> drives will re-allocate a sector from their re-mapping pool >> effectively replacing the bad sector. This pays huge dividends with >> respect to maintaining healthy RAID farms. >> >> Unfortunately the report of the mismatch_cnt's is problematic given >> the above issues. I think it is unfortunate the vendors opted to >> release this checking/reporting while these issues are still unresolved. >> >>> thanks in advance. >>> regards. >>> >>> -- >>> Levente "Si vis pacem para bellum!" >> >> Hope the above information is helpful for everyone running into this >> issue. >> >> Best wishes for a productive remainder of the week to everyone. >> >> Greg >> >> }-- End of excerpt from Farkas Levente >> >> As always, >> Dr. G.W. Wettstein, Ph.D. Enjellic Systems Development, LLC. >> 4206 N. 19th Ave. Specializing in information infra-structure >> Fargo, ND 58102 development. >> PH: 701-281-1686 >> FAX: 701-281-3949 EMAIL: greg@enjellic.com >> ------------------------------------------------------------------------------ >> "Experience is something you don't get until just after you need it." >> -- Olivier >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-raid" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> > > > > -- > Best regards, > [COOLCOLD-RIPN] > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- -- Sujit K M blog(http://kmsujit.blogspot.com/) -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2009-12-03 10:49 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-11-30 13:08 /etc/cron.weekly/99-raid-check Farkas Levente
[not found] <lfarkas@lfarkas.org>
2009-12-02 15:38 ` /etc/cron.weekly/99-raid-check greg
2009-12-03 9:11 ` /etc/cron.weekly/99-raid-check CoolCold
2009-12-03 9:53 ` /etc/cron.weekly/99-raid-check Robin Hill
2009-12-03 10:49 ` /etc/cron.weekly/99-raid-check Sujit K M
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).