* md devices: Suggestion for in place time and checksum within the RAID @ 2010-03-13 23:00 Joachim Otahal 2010-03-14 0:04 ` Bill Davidsen 0 siblings, 1 reply; 8+ messages in thread From: Joachim Otahal @ 2010-03-13 23:00 UTC (permalink / raw) To: linux-raid Current Situation in RAID: If a drive fails silently and is giving out wrong data instead of read errors there is no way to detect that corruption (no fun, I had that a few times already). Even in RAID1 with three drives there is no "two over three" voting mechanism. A workaround for that problem would be: Adding one sector to each chunk to store the time (in nanoseconds resolution) + CRC or ECC value of the whole stripe, making it possible to see and handle such errors below the filesystem level. Time in nanoseconds only to differ between those many writes that actually happen, it does not really matter how precise the time actually is, just every stripe update should have a different time value from the previous update. It would be an easy way to know which chunks are actually the latest (or which contain correct data in case one out of three+ chunks has a wrong time upon reading). A random uniqe ID or counter could also do the job of the time value if anyone prefers, but I doubt since the collision possibility would be higher. The use of CRC or ECC or whatever hash should be obvious, their existence would make it easy to detect drive degration, even in a RAID0 or LINEAR. Bad side: Adding this might break the on the fly raid expansion capabilities. A workaround might be using 8K(+ one sector) chunks by default upon creation or the need to specify the chunk size on creation (like 8k+1 sector) if future expansion capabilities are actually wanted with RAID0/4/5/6, but that is a different issue anyway. Question: Will RAID4/5/6 in the future use the parity upon read too? Currently it would not detect wrong data reads from the parity chunk, resulting in a disaster when it is actually needed. Do those plans already exist and my post was completely useless? Sorry that I cannot give patches, my last kernel patch + compile was 2.2.26, since then I never compiled a kernel. Joachim Otahal ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: md devices: Suggestion for in place time and checksum within the RAID 2010-03-13 23:00 md devices: Suggestion for in place time and checksum within the RAID Joachim Otahal @ 2010-03-14 0:04 ` Bill Davidsen 2010-03-14 1:25 ` Joachim Otahal 0 siblings, 1 reply; 8+ messages in thread From: Bill Davidsen @ 2010-03-14 0:04 UTC (permalink / raw) To: Joachim Otahal; +Cc: linux-raid Joachim Otahal wrote: > Current Situation in RAID: > If a drive fails silently and is giving out wrong data instead of read > errors there is no way to detect that corruption (no fun, I had that a > few times already). That is almost certainly a hardware issue, the chances of silent bad data are tiny, the chances of bad hardware messing the data is more likely. Often cable issues. > Even in RAID1 with three drives there is no "two over three" voting > mechanism. > > A workaround for that problem would be: > Adding one sector to each chunk to store the time (in nanoseconds > resolution) + CRC or ECC value of the whole stripe, making it possible > to see and handle such errors below the filesystem level. > Time in nanoseconds only to differ between those many writes that > actually happen, it does not really matter how precise the time > actually is, just every stripe update should have a different time > value from the previous update. Unlikely to have meaning, there is so much caching and delay that it would be inaccurate. A simple monotonic counter of writes would do as well. And I think you need to do it at a lower level than chuck, like sector. Have to look at that code again. > It would be an easy way to know which chunks are actually the latest > (or which contain correct data in case one out of three+ chunks has a > wrong time upon reading). A random uniqe ID or counter could also do > the job of the time value if anyone prefers, but I doubt since the > collision possibility would be higher. You can only know the time when the buffer is filled, after that you have write cache, drive cache, and rotational delay. A count does as well and doesn't depend on time between PCUs being the same at ns level. > The use of CRC or ECC or whatever hash should be obvious, their > existence would make it easy to detect drive degration, even in a > RAID0 or LINEAR. There is a ton of that in the drive already. > Bad side: Adding this might break the on the fly raid expansion > capabilities. A workaround might be using 8K(+ one sector) chunks by > default upon creation or the need to specify the chunk size on > creation (like 8k+1 sector) if future expansion capabilities are > actually wanted with RAID0/4/5/6, but that is a different issue anyway. > > Question: > Will RAID4/5/6 in the future use the parity upon read too? Currently > it would not detect wrong data reads from the parity chunk, resulting > in a disaster when it is actually needed. > > Do those plans already exist and my post was completely useless? > > Sorry that I cannot give patches, my last kernel patch + compile was > 2.2.26, since then I never compiled a kernel. > > Joachim Otahal > > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- Bill Davidsen <davidsen@tmr.com> "We can't solve today's problems by using the same thinking we used in creating them." - Einstein ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: md devices: Suggestion for in place time and checksum within the RAID 2010-03-14 0:04 ` Bill Davidsen @ 2010-03-14 1:25 ` Joachim Otahal 2010-03-14 10:20 ` Keld Simonsen 0 siblings, 1 reply; 8+ messages in thread From: Joachim Otahal @ 2010-03-14 1:25 UTC (permalink / raw) To: Bill Davidsen; +Cc: linux-raid Bill Davidsen schrieb: > Joachim Otahal wrote: >> Current Situation in RAID: >> If a drive fails silently and is giving out wrong data instead of >> read errors there is no way to detect that corruption (no fun, I had >> that a few times already). > > That is almost certainly a hardware issue, the chances of silent bad > data are tiny, the chances of bad hardware messing the data is more > likely. Often cable issues. In over 20 years (including our customer drives) about ten harddrives of that type. Does indeed not happen often. Were not cable issues, we replaced the drive with the same type and vendor and RMA'd the original. It is not vendor specific, it's like every vendor does have such problematic drives during their existence. The last case was just a few month ago. >> Even in RAID1 with three drives there is no "two over three" voting >> mechanism. >> >> A workaround for that problem would be: >> Adding one sector to each chunk to store the time (in nanoseconds >> resolution) + CRC or ECC value of the whole stripe, making it >> possible to see and handle such errors below the filesystem level. >> Time in nanoseconds only to differ between those many writes that >> actually happen, it does not really matter how precise the time >> actually is, just every stripe update should have a different time >> value from the previous update. > > Unlikely to have meaning, there is so much caching and delay that it > would be inaccurate. A simple monotonic counter of writes would do as > well. And I think you need to do it at a lower level than chuck, like > sector. Have to look at that code again. From what I know from the docs: The "stripe" is normally 64k, so the "chunk" on each drive when using raid5 with three drives is 32k, smaller with more drives. At least that is what I am referring to : ). The filesystem level never sees what is done on the raid level not even in the ZFS implementation on linux which was originally designed for such a case. >> The use of CRC or ECC or whatever hash should be obvious, their >> existence would make it easy to detect drive degration, even in a >> RAID0 or LINEAR. > > There is a ton of that in the drive already. That is mainly meant to know whether the stripe is consistent (after power fail etc), and if not, correct it. Currently that cannot be detected, especially since the the partiy is not read in the current implementation (at least the docs say so!). If it can be reconstructed using the ECC and/or parity write the corrected data back silently (if mounted rw) to get the data consistent again. For successful silent correction only one syslog line would be enough, if correction is not possible it can still go back to the current default behaviour, read whatever is there, but at least we could _detect_ such inconsistency. >> Bad side: Adding this might break the on the fly raid expansion >> capabilities. A workaround might be using 8K(+ one sector) chunks by >> default upon creation or the need to specify the chunk size on >> creation (like 8k+1 sector) if future expansion capabilities are >> actually wanted with RAID0/4/5/6, but that is a different issue anyway. >> >> Question: >> Will RAID4/5/6 in the future use the parity upon read too? Currently >> it would not detect wrong data reads from the parity chunk, resulting >> in a disaster when it is actually needed. >> >> Do those plans already exist and my post was completely useless? >> >> Sorry that I cannot give patches, my last kernel patch + compile was >> 2.2.26, since then I never compiled a kernel. >> >> Joachim Otahal >> >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-raid" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> > > ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: md devices: Suggestion for in place time and checksum within the RAID 2010-03-14 1:25 ` Joachim Otahal @ 2010-03-14 10:20 ` Keld Simonsen 2010-03-14 11:58 ` Joachim Otahal 0 siblings, 1 reply; 8+ messages in thread From: Keld Simonsen @ 2010-03-14 10:20 UTC (permalink / raw) To: Joachim Otahal; +Cc: Bill Davidsen, linux-raid On Sun, Mar 14, 2010 at 02:25:38AM +0100, Joachim Otahal wrote: > Bill Davidsen schrieb: > >Joachim Otahal wrote: > >>Current Situation in RAID: > >>If a drive fails silently and is giving out wrong data instead of > >>read errors there is no way to detect that corruption (no fun, I had > >>that a few times already). > > > >That is almost certainly a hardware issue, the chances of silent bad > >data are tiny, the chances of bad hardware messing the data is more > >likely. Often cable issues. > In over 20 years (including our customer drives) about ten harddrives of > that type. Does indeed not happen often. Were not cable issues, we > replaced the drive with the same type and vendor and RMA'd the original. > It is not vendor specific, it's like every vendor does have such > problematic drives during their existence. The last case was just a few > month ago. > > >>Even in RAID1 with three drives there is no "two over three" voting > >>mechanism. > >> > >>A workaround for that problem would be: > >>Adding one sector to each chunk to store the time (in nanoseconds > >>resolution) + CRC or ECC value of the whole stripe, making it > >>possible to see and handle such errors below the filesystem level. > >>Time in nanoseconds only to differ between those many writes that > >>actually happen, it does not really matter how precise the time > >>actually is, just every stripe update should have a different time > >>value from the previous update. > > > >Unlikely to have meaning, there is so much caching and delay that it > >would be inaccurate. A simple monotonic counter of writes would do as > >well. And I think you need to do it at a lower level than chuck, like > >sector. Have to look at that code again. > From what I know from the docs: The "stripe" is normally 64k, so the > "chunk" on each drive when using raid5 with three drives is 32k, smaller > with more drives. At least that is what I am referring to : ). The > filesystem level never sees what is done on the raid level not even in > the ZFS implementation on linux which was originally designed for such a > case. > > >>The use of CRC or ECC or whatever hash should be obvious, their > >>existence would make it easy to detect drive degration, even in a > >>RAID0 or LINEAR. > > > >There is a ton of that in the drive already. > That is mainly meant to know whether the stripe is consistent (after > power fail etc), and if not, correct it. Currently that cannot be > detected, especially since the the partiy is not read in the current > implementation (at least the docs say so!). If it can be reconstructed > using the ECC and/or parity write the corrected data back silently (if > mounted rw) to get the data consistent again. For successful silent > correction only one syslog line would be enough, if correction is not > possible it can still go back to the current default behaviour, read > whatever is there, but at least we could _detect_ such inconsistency. > > >>Bad side: Adding this might break the on the fly raid expansion > >>capabilities. A workaround might be using 8K(+ one sector) chunks by > >>default upon creation or the need to specify the chunk size on > >>creation (like 8k+1 sector) if future expansion capabilities are > >>actually wanted with RAID0/4/5/6, but that is a different issue anyway. > >> > >>Question: > >>Will RAID4/5/6 in the future use the parity upon read too? Currently > >>it would not detect wrong data reads from the parity chunk, resulting > >>in a disaster when it is actually needed. > >> > >>Do those plans already exist and my post was completely useless? > >> > >>Sorry that I cannot give patches, my last kernel patch + compile was > >>2.2.26, since then I never compiled a kernel. > >> > >>Joachim Otahal Hmm, would that not be detected by a check - initiated by cron? Which data to believe could then be determined according to a number of techniques, like for a 3 copy array the best 2 out of 3, investigating the error log of the drives, and relaying the error information to the file system layer for manual inspection and repair. I would expect this is not something that occurs frequently, so maybe once a year for the unlucky or systems with many disks. best regards keld ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: md devices: Suggestion for in place time and checksum within the RAID 2010-03-14 10:20 ` Keld Simonsen @ 2010-03-14 11:58 ` Joachim Otahal 2010-03-14 13:03 ` Keld Simonsen 0 siblings, 1 reply; 8+ messages in thread From: Joachim Otahal @ 2010-03-14 11:58 UTC (permalink / raw) To: Keld Simonsen; +Cc: Bill Davidsen, linux-raid Keld Simonsen schrieb: > On Sun, Mar 14, 2010 at 02:25:38AM +0100, Joachim Otahal wrote >>>> Question: >>>> Will RAID4/5/6 in the future use the parity upon read too? Currently >>>> it would not detect wrong data reads from the parity chunk, resulting >>>> in a disaster when it is actually needed. >>>> >>>> Do those plans already exist and my post was completely useless? >>>> >>>> Sorry that I cannot give patches, my last kernel patch + compile was >>>> 2.2.26, since then I never compiled a kernel. >>>> >>>> Joachim Otahal >>>> > Hmm, would that not be detected by a check - initiated by cron? > Debian schedules a monthly check (first sunday 00:57), IMHO the best possible time and frequency, less is dangerous, more is useless. I added a cronjob to check every 15 minutes for changes from /proc/mdstat and changes from smart info (reallocated sector count and drive internal error list only) and emails me if something changed from the previous check. I use the script because /etc/mdadm/mdadm.conf only takes ONE email address and requires a local MTA installed, I allways uninstall the local MTA if the machine is not going to be a mail server. But why not checking parity during normal read operation? Was that a performance decision? It is not _that_ bad not doing it during normal operation since the good dists schedule a regular check, but can it be controlled by something like echo "1" > /proc/sys/dev/raid/always_read_parity ? > Which data to believe could then be determined according to a number > of techniques, like for a 3 copy array the best 2 out of 3, > investigating the error log of the drives, and relaying the error > information to the file system layer for manual inspection and repair. > That is a matter of "believe" and "best guess" and not "knowing" which contains the correct data in redundant array levels, hence the suggestion from before to include a timer + ECC (or better) at the raid level, so we actually _know_ which is the newest, and we _know_ which stripe does have consistent data, no guessing needed, we can apply crystal clear rules. My ruleset would be: first use: newest time and correct ECC second use: newest time and correctable ECC third use: any time and correct ECC (hint possible filesystem error to the lext layer) fourth use: any time and correctable ECC (hint possible filesystem error to the lext layer) fifth use: Current implementation, use the data from the active drive ordering according to the list in the superblock + hint possible filesystem error to the lext layer. A raid aware filesystem would be perfect (compare with ZFS on Solaris) eliminating the write hole problem, doing the checksum at raid level makes it more flexible. > I would expect this is not something that occurs frequently, so maybe > once a year for the unlucky or systems with many disks. > If you get paranoid about corrupting really important data once in 5 years too much. Implementing the checksum + timestamp would lift linux software raid to the next level, closer to enterprise where such techniques are actually in use. At it's current level it is very good and solid, so it is time to get to the next level for long time archiving. regards, Joachim Otahal ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: md devices: Suggestion for in place time and checksum within the RAID 2010-03-14 11:58 ` Joachim Otahal @ 2010-03-14 13:03 ` Keld Simonsen 2010-03-14 14:00 ` Joachim Otahal 2010-03-15 21:28 ` Joachim Otahal 0 siblings, 2 replies; 8+ messages in thread From: Keld Simonsen @ 2010-03-14 13:03 UTC (permalink / raw) To: Joachim Otahal; +Cc: Bill Davidsen, linux-raid On Sun, Mar 14, 2010 at 12:58:50PM +0100, Joachim Otahal wrote: > Keld Simonsen schrieb: > >On Sun, Mar 14, 2010 at 02:25:38AM +0100, Joachim Otahal wrote > >>>>Question: > >>>>Will RAID4/5/6 in the future use the parity upon read too? Currently > >>>>it would not detect wrong data reads from the parity chunk, resulting > >>>>in a disaster when it is actually needed. > >>>> > >>>>Do those plans already exist and my post was completely useless? > >>>> > >>>>Sorry that I cannot give patches, my last kernel patch + compile was > >>>>2.2.26, since then I never compiled a kernel. > >>>> > >>>>Joachim Otahal > >>>> > >Hmm, would that not be detected by a check - initiated by cron? > > > Debian schedules a monthly check (first sunday 00:57), IMHO the best > possible time and frequency, less is dangerous, more is useless. I added > a cronjob to check every 15 minutes for changes from /proc/mdstat and > changes from smart info (reallocated sector count and drive internal > error list only) and emails me if something changed from the previous check. > I use the script because /etc/mdadm/mdadm.conf only takes ONE email > address and requires a local MTA installed, I allways uninstall the > local MTA if the machine is not going to be a mail server. Interesting! I would like to see your scripts.... > But why not checking parity during normal read operation? Was that a > performance decision? I don't know, but I do think it would hurt performance considerably. > It is not _that_ bad not doing it during normal > operation since the good dists schedule a regular check, but can it be > controlled by something like echo "1" > > /proc/sys/dev/raid/always_read_parity ? Well, I think making an optional check would be fine. I dont know if it could be done in a non-performance hurting way, such as being deleyed or running at a lower IO priority. > >Which data to believe could then be determined according to a number > >of techniques, like for a 3 copy array the best 2 out of 3, > >investigating the error log of the drives, and relaying the error > >information to the file system layer for manual inspection and repair. > > > That is a matter of "believe" and "best guess" and not "knowing" which > contains the correct data in redundant array levels, hence the > suggestion from before to include a timer + ECC (or better) at the raid > level, so we actually _know_ which is the newest, and we _know_ which > stripe does have consistent data, no guessing needed, we can apply > crystal clear rules. > My ruleset would be: > first use: newest time and correct ECC > second use: newest time and correctable ECC > third use: any time and correct ECC (hint possible filesystem error to > the lext layer) > fourth use: any time and correctable ECC (hint possible filesystem error > to the lext layer) > fifth use: Current implementation, use the data from the active drive > ordering according to the list in the superblock + hint possible > filesystem error to the lext layer. > A raid aware filesystem would be perfect (compare with ZFS on Solaris) > eliminating the write hole problem, doing the checksum at raid level > makes it more flexible. Interesting ideas > >I would expect this is not something that occurs frequently, so maybe > >once a year for the unlucky or systems with many disks. > > > If you get paranoid about corrupting really important data once in 5 > years too much. Implementing the checksum + timestamp would lift linux > software raid to the next level, closer to enterprise where such > techniques are actually in use. At it's current level it is very good > and solid, so it is time to get to the next level for long time archiving. I was not trying to say this is not important, but rather that error correction could be done by manual intervention, given that is not so frequent. Or at least that manual corrction should be one of the impelemted ways of adressing it. best regards keld ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: md devices: Suggestion for in place time and checksum within the RAID 2010-03-14 13:03 ` Keld Simonsen @ 2010-03-14 14:00 ` Joachim Otahal 2010-03-15 21:28 ` Joachim Otahal 1 sibling, 0 replies; 8+ messages in thread From: Joachim Otahal @ 2010-03-14 14:00 UTC (permalink / raw) To: Keld Simonsen; +Cc: Bill Davidsen, linux-raid Keld Simonsen schrieb: > On Sun, Mar 14, 2010 at 12:58:50PM +0100, Joachim Otahal wrote: > >> Debian schedules a monthly check (first sunday 00:57), IMHO the best >> possible time and frequency, less is dangerous, more is useless. I added >> a cronjob to check every 15 minutes for changes from /proc/mdstat and >> changes from smart info (reallocated sector count and drive internal >> error list only) and emails me if something changed from the previous check. >> I use the script because /etc/mdadm/mdadm.conf only takes ONE email >> address and requires a local MTA installed, I allways uninstall the >> local MTA if the machine is not going to be a mail server. >> > Interesting! I would like to see your scripts.... > sendEmail.pl is from http://caspian.dotconf.net/menu/Software/SendEmail/, in his latest update he managed to get rid of tls and base64-encoding problems. Here is the unpolished script, in "it does what it should do" state. The HEALTHFILE variable is changed to somewhere in the middle. The locations are chosen for: raid info at every boot + upon change, smart info only when something changes. It is run every 15 minutes from cron. One of my hdd's had a growing reallocated sector count each two weeks, but seems to be stabilized now, I can nicely follow that in my inbox. #!/bin/sh HEALTHFILE="/tmp/healthcheck.mdstat" HARDDRIVES="/dev/sda /dev/sdb /dev/sdc /dev/sdd" SENDEMAILCOMMAND="/usr/local/sbin/sendEmail.pl -f <sender> -t <receipient> -cc <receipient> -cc <receipient> -s <smtp-server> -o tls=auto -xu <smtp-user> -xp <smtp-password>" if [ -f ${HEALTHFILE}.1 ] ; then /bin/rm -f ${HEALTHFILE}.1 ; fi if [ -f ${HEALTHFILE}.0 ] ; then /bin/mv ${HEALTHFILE}.0 ${HEALTHFILE}.1 ; else /usr/bin/touch ${HEALTHFILE}.1 ; fi /bin/cat /proc/mdstat > ${HEALTHFILE}.0 /usr/bin/diff ${HEALTHFILE}.0 ${HEALTHFILE}.1 > /dev/null case "$?" in 0) # ;; 1) ${SENDEMAILCOMMAND} -u "RAID status" < ${HEALTHFILE}.0 ;; esac HEALTHFILE="/var/log/healthcheck.smartdtl.realloc-sector-count" if [ -f ${HEALTHFILE}.1 ] ; then /bin/rm -f ${HEALTHFILE}.1 ; fi if [ -f ${HEALTHFILE}.0 ] ; then /bin/mv ${HEALTHFILE}.0 ${HEALTHFILE}.1 ; else /usr/bin/touch ${HEALTHFILE}.1 ; fi echo "SMART shot info:"> ${HEALTHFILE}.0 for X in ${HARDDRIVES} ; do /bin/echo "${X}">> ${HEALTHFILE}.0 /usr/local/sbin/smartctl --all ${X} | /bin/grep -i Reallocated_Sector_Ct >> ${HEALTHFILE}.0 done /bin/echo "------------------------------------------------------------------------">> ${HEALTHFILE}.0 /bin/echo "Error Log from drives">> ${HEALTHFILE}.0 for X in ${HARDDRIVES} ; do /bin/echo "${X}">> ${HEALTHFILE}.0 /usr/local/sbin/smartctl --all ${X} | /bin/grep -i -A 999 "SMART Error Log" | grep -v "without error" >> ${HEALTHFILE}.0 /bin/echo "------------------------------------------------------------------------">> ${HEALTHFILE}.0 done /usr/bin/diff ${HEALTHFILE}.0 ${HEALTHFILE}.1 > /dev/null case "$?" in 0) # ;; 1) ${SENDEMAILCOMMAND} -u "SMART Status, Reallocated Sector Count" < ${HEALTHFILE}.0 ;; esac >> But why not checking parity during normal read operation? Was that a >> performance decision? >> > I don't know, but I do think it would hurt performance considerably. > If http://www.accs.com/p_and_p/RAID/LinuxRAID.html is still current info: It will hurt performance due to the "left synchronous default", but I expect the real world difference to be small. >> It is not _that_ bad not doing it during normal >> operation since the good dists schedule a regular check, but can it be >> controlled by something like echo "1"> >> /proc/sys/dev/raid/always_read_parity ? >> > Well, I think making an optional check would be fine. > I dont know if it could be done in a non-performance hurting way, such > as being deleyed or running at a lower IO priority. > I doubt delaying would help the performance, in asynchronous layouts it is the fifth HD doing a read, in synchronous layouts the next-chunk-to-read is directly after the parity chunk. kind regards, Joachim Otahal ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: md devices: Suggestion for in place time and checksum within the RAID 2010-03-14 13:03 ` Keld Simonsen 2010-03-14 14:00 ` Joachim Otahal @ 2010-03-15 21:28 ` Joachim Otahal 1 sibling, 0 replies; 8+ messages in thread From: Joachim Otahal @ 2010-03-15 21:28 UTC (permalink / raw) To: Keld Simonsen; +Cc: Bill Davidsen, linux-raid Keld Simonsen schrieb: > Interesting! I would like to see your scripts.... > I did not realize how OLD that script was until I saw it today, I could not leave it that way, here is a revised and less embarrassing version, easy to extent to bang you with emails on a raid error too, but southpark is on TV now: #!/bin/sh HEALTHFILE="/tmp/healthcheck.mdstat" HARDDRIVES="/dev/sda /dev/sdb /dev/sdc /dev/sdd" SENDEMAILCOMMAND="/usr/local/sbin/sendEmail.pl <commandline-here>" [ -f ${HEALTHFILE}.1 ] && /bin/rm -f ${HEALTHFILE}.1 [ -f ${HEALTHFILE}.0 ] && /bin/mv ${HEALTHFILE}.0 ${HEALTHFILE}.1 /usr/bin/touch ${HEALTHFILE}.1 /bin/cat /proc/mdstat > ${HEALTHFILE}.0 /usr/bin/diff ${HEALTHFILE}.0 ${HEALTHFILE}.1 > /dev/null if [ $? == 1 ] ; then ${SENDEMAILCOMMAND} -u "RAID Status" < ${HEALTHFILE}.0 fi HEALTHFILE="/var/log/healthcheck.smartctl" [ -f ${HEALTHFILE}.1 ] && /bin/rm -f ${HEALTHFILE}.1 [ -f ${HEALTHFILE}.0 ] && /bin/mv ${HEALTHFILE}.0 ${HEALTHFILE}.1 /usr/bin/touch ${HEALTHFILE}.1 echo "SMART info:"> ${HEALTHFILE}.0 EMAILSUBJECT="SMART Status, Reallocated Sector Count" for X in ${HARDDRIVES} ; do Y="`/usr/local/sbin/smartctl --all ${X} | /bin/grep -i Reallocated_Sector_Ct`" if [ "${Y}" != "" ] ; then /bin/echo "${X} ${Y}">> ${HEALTHFILE}.0 if [ "`/usr/local/sbin/smartctl --all ${X} | /bin/grep -o 'No Errors Logged'`" == "No Errors Logged" ] ; then /bin/echo "${X} No Errors Logged">> ${HEALTHFILE}.0 else EMAILSUBJECT="SMART ERRORS LOGGED, Reallocated Sector Count" [ -f ${HEALTHFILE}.1 ] && /bin/rm -f ${HEALTHFILE}.1 /usr/bin/touch ${HEALTHFILE}.1 /bin/echo "------------------------------------------------------------------------">> ${HEALTHFILE}.0 /bin/echo "${X}">> ${HEALTHFILE}.0 /usr/local/sbin/smartctl --all ${X} | /bin/grep -i -A 999 "SMART Error Log" >> ${HEALTHFILE}.0 /bin/echo "------------------------------------------------------------------------">> ${HEALTHFILE}.0 fi fi done /usr/bin/diff ${HEALTHFILE}.0 ${HEALTHFILE}.1 > /dev/null if [ $? == 1 ] ; then ${SENDEMAILCOMMAND} -u "${EMAILSUBJECT}" < ${HEALTHFILE}.0 fi regards, Joachim Otahal ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2010-03-15 21:28 UTC | newest] Thread overview: 8+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2010-03-13 23:00 md devices: Suggestion for in place time and checksum within the RAID Joachim Otahal 2010-03-14 0:04 ` Bill Davidsen 2010-03-14 1:25 ` Joachim Otahal 2010-03-14 10:20 ` Keld Simonsen 2010-03-14 11:58 ` Joachim Otahal 2010-03-14 13:03 ` Keld Simonsen 2010-03-14 14:00 ` Joachim Otahal 2010-03-15 21:28 ` Joachim Otahal
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).