* mismatch_cnt questions @ 2007-03-04 11:22 Christian Pernegger 2007-03-04 11:50 ` Neil Brown 0 siblings, 1 reply; 36+ messages in thread From: Christian Pernegger @ 2007-03-04 11:22 UTC (permalink / raw) To: linux-raid Hello, these questions apparently got buried in another thread, so here goes again ... I have a mismatch_cnt of 384 on a 2-way mirror. The box runs 2.6.17.4 and can't really be rebooted or have its kernel updated easily 1) Where does the mismatch come from? The box hasn't been down since the creation of the array. 2) How much data is 384? Blocks? Chunks? Bytes? 3) Is the "repair" sync action safe to use on the above kernel? Any other methods / additional steps for fixing this? Regards, C. ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: mismatch_cnt questions 2007-03-04 11:22 mismatch_cnt questions Christian Pernegger @ 2007-03-04 11:50 ` Neil Brown 2007-03-04 12:01 ` Christian Pernegger ` (2 more replies) 0 siblings, 3 replies; 36+ messages in thread From: Neil Brown @ 2007-03-04 11:50 UTC (permalink / raw) To: Christian Pernegger; +Cc: linux-raid On Sunday March 4, pernegger@gmail.com wrote: > Hello, > > these questions apparently got buried in another thread, so here goes again ... > > I have a mismatch_cnt of 384 on a 2-way mirror. > The box runs 2.6.17.4 and can't really be rebooted or have its kernel > updated easily > > 1) Where does the mismatch come from? > The box hasn't been down since the creation of the array. Do you have swap on the mirror at all? I recently discovered/realised that when 'swap' writes to a raid1 it can end up with different data on the different devices. This is perfectly acceptable as in that case the data will never be read. If you don't have swap, then I don't know what is happening. > > 2) How much data is 384? Blocks? Chunks? Bytes? The units is 'sectors', but the granularity is about 64K. so '384' means 3 different 64K sections of the device showed an error. One day I might reduce the granularity. > > 3) Is the "repair" sync action safe to use on the above kernel? Any > other methods / additional steps for fixing this? "repair" is safe, though it may not be effective. "repair" for raid1 was did not work until Jan 26th this year. Before then it was identical in effect to 'check'. NeilBrown ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: mismatch_cnt questions 2007-03-04 11:50 ` Neil Brown @ 2007-03-04 12:01 ` Christian Pernegger 2007-03-04 22:19 ` Neil Brown 2007-03-04 21:21 ` mismatch_cnt questions Eyal Lebedinsky 2008-05-12 11:16 ` Bas van Schaik 2 siblings, 1 reply; 36+ messages in thread From: Christian Pernegger @ 2007-03-04 12:01 UTC (permalink / raw) To: linux-raid Hey, that was quick ... thanks! > > 1) Where does the mismatch come from? The box hasn't been down since the creation of > > the array. > > Do you have swap on the mirror at all? As a matter of fact I do, /dev/md0_p2 is a swap partition. > I recently discovered/realised that when 'swap' writes to a raid1 it can end up with different > data on the different devices. This is perfectly acceptable as in that case the data will never > be read. Interesting ... care to elaborate a little? Would disabling swap, running mkswap again and rerunning check return 0 in this case? Regards, C. ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: mismatch_cnt questions 2007-03-04 12:01 ` Christian Pernegger @ 2007-03-04 22:19 ` Neil Brown 2007-03-06 10:04 ` mismatch_cnt questions - how about raid10? Peter Rabbitson 0 siblings, 1 reply; 36+ messages in thread From: Neil Brown @ 2007-03-04 22:19 UTC (permalink / raw) To: Christian Pernegger; +Cc: linux-raid On Sunday March 4, pernegger@gmail.com wrote: > Hey, that was quick ... thanks! > > > > 1) Where does the mismatch come from? The box hasn't been down since the creation of > > > the array. > > > > Do you have swap on the mirror at all? > > As a matter of fact I do, /dev/md0_p2 is a swap partition. > > > I recently discovered/realised that when 'swap' writes to a raid1 it can end up with different > > data on the different devices. This is perfectly acceptable as in that case the data will never > > be read. > > Interesting ... care to elaborate a little? When we write to a raid1, the data is DMAed from memory out to each device independently, so if the memory changes between the two (or more) DMA operations, you will get inconsistency between the devices. When the data being written is part of a file, the page will still be dirty after the write 'completes' so another write will be issued fairly soon (depending on various VM settings) and so the inconsistency will only be visible for a short time, and you probably won't notice. If this happens when writing to swap - i.e. if the page is dirtied while the write is happening - then the swap system will just forget that that page was written out. It is obviously still active, so some other page will get swapped out instead. There will never be any attempt to write out the 'correct' data to the device as that doesn't really mean anything. As more swap activity happens it is quite possible that the inconsistent area of the array will be written again with consistent data, but it is also quite possible that it won't be written for a long time. Long enough that a 'check' will find it. In any of these cases there is no risk of data corruption as the inconsistent area of the array will never be read from. > > Would disabling swap, running mkswap again and rerunning check return > 0 in this case? Disable swap, write to the entire swap area dd if=/dev/zero of=/dev/md0_p2 bs=1M then mkswap and rerun 'check' and it should return '0'. It did for me. NeilBrown ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: mismatch_cnt questions - how about raid10? 2007-03-04 22:19 ` Neil Brown @ 2007-03-06 10:04 ` Peter Rabbitson 2007-03-06 10:20 ` Neil Brown 0 siblings, 1 reply; 36+ messages in thread From: Peter Rabbitson @ 2007-03-06 10:04 UTC (permalink / raw) To: linux-raid Neil Brown wrote: > When we write to a raid1, the data is DMAed from memory out to each > device independently, so if the memory changes between the two (or > more) DMA operations, you will get inconsistency between the devices. Does this apply to raid 10 devices too? And in case of LVM if swap is on top of a LV which is a part of a VG which has a single PV as the raid array - will this happen as well? Or will the LVM layer take the data once and distribute exact copies of it to the PVs (in this case just the raid) effectively giving the raid array invariable data? ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: mismatch_cnt questions - how about raid10? 2007-03-06 10:04 ` mismatch_cnt questions - how about raid10? Peter Rabbitson @ 2007-03-06 10:20 ` Neil Brown 2007-03-06 10:56 ` Peter Rabbitson 0 siblings, 1 reply; 36+ messages in thread From: Neil Brown @ 2007-03-06 10:20 UTC (permalink / raw) To: Peter Rabbitson; +Cc: linux-raid On Tuesday March 6, rabbit@rabbit.us wrote: > Neil Brown wrote: > > When we write to a raid1, the data is DMAed from memory out to each > > device independently, so if the memory changes between the two (or > > more) DMA operations, you will get inconsistency between the devices. > > Does this apply to raid 10 devices too? And in case of LVM if swap is on > top of a LV which is a part of a VG which has a single PV as the raid > array - will this happen as well? Or will the LVM layer take the data > once and distribute exact copies of it to the PVs (in this case just the > raid) effectively giving the raid array invariable data? Yes, it applies to raid10 too. I don't know the details of the inner workings of LVM, but I doubt it will make a difference. Copying the data in memory is just too costly to do if it can be avoided. With LVM and raid1/10 it can be avoided with no significant cost. With raid4/5/6, not copying into the cache can cause data corruption. So we always copy. NeilBrown ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: mismatch_cnt questions - how about raid10? 2007-03-06 10:20 ` Neil Brown @ 2007-03-06 10:56 ` Peter Rabbitson 2007-03-06 10:59 ` Justin Piszcz 2007-03-12 5:35 ` Neil Brown 0 siblings, 2 replies; 36+ messages in thread From: Peter Rabbitson @ 2007-03-06 10:56 UTC (permalink / raw) To: linux-raid Neil Brown wrote: > On Tuesday March 6, rabbit@rabbit.us wrote: >> Neil Brown wrote: >>> When we write to a raid1, the data is DMAed from memory out to each >>> device independently, so if the memory changes between the two (or >>> more) DMA operations, you will get inconsistency between the devices. >> Does this apply to raid 10 devices too? And in case of LVM if swap is on >> top of a LV which is a part of a VG which has a single PV as the raid >> array - will this happen as well? Or will the LVM layer take the data >> once and distribute exact copies of it to the PVs (in this case just the >> raid) effectively giving the raid array invariable data? > > Yes, it applies to raid10 too. > > I don't know the details of the inner workings of LVM, but I doubt it > will make a difference. Copying the data in memory is just too costly > to do if it can be avoided. With LVM and raid1/10 it can be avoided > with no significant cost. > With raid4/5/6, not copying into the cache can cause data corruption. > So we always copy. > I see. So basically for those of us who want to run swap on raid 1 or 10, and at the same time want to rely on mismatch_cnt for early problem detection, the only option is to create a separate md device just for the swap. Is this about right? ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: mismatch_cnt questions - how about raid10? 2007-03-06 10:56 ` Peter Rabbitson @ 2007-03-06 10:59 ` Justin Piszcz 2007-03-12 5:35 ` Neil Brown 1 sibling, 0 replies; 36+ messages in thread From: Justin Piszcz @ 2007-03-06 10:59 UTC (permalink / raw) To: Peter Rabbitson; +Cc: linux-raid On Tue, 6 Mar 2007, Peter Rabbitson wrote: > Neil Brown wrote: >> On Tuesday March 6, rabbit@rabbit.us wrote: >>> Neil Brown wrote: >>>> When we write to a raid1, the data is DMAed from memory out to each >>>> device independently, so if the memory changes between the two (or >>>> more) DMA operations, you will get inconsistency between the devices. >>> Does this apply to raid 10 devices too? And in case of LVM if swap is on >>> top of a LV which is a part of a VG which has a single PV as the raid >>> array - will this happen as well? Or will the LVM layer take the data once >>> and distribute exact copies of it to the PVs (in this case just the raid) >>> effectively giving the raid array invariable data? >> >> Yes, it applies to raid10 too. >> >> I don't know the details of the inner workings of LVM, but I doubt it >> will make a difference. Copying the data in memory is just too costly >> to do if it can be avoided. With LVM and raid1/10 it can be avoided >> with no significant cost. >> With raid4/5/6, not copying into the cache can cause data corruption. >> So we always copy. >> > > I see. So basically for those of us who want to run swap on raid 1 or 10, and > at the same time want to rely on mismatch_cnt for early problem detection, > the only option is to create a separate md device just for the swap. Is this > about right? > - > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > That is what I do. /dev/md0 - swap /dev/md1 - boot /dev/md2 - root ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: mismatch_cnt questions - how about raid10? 2007-03-06 10:56 ` Peter Rabbitson 2007-03-06 10:59 ` Justin Piszcz @ 2007-03-12 5:35 ` Neil Brown 2007-03-12 14:26 ` Peter Rabbitson 1 sibling, 1 reply; 36+ messages in thread From: Neil Brown @ 2007-03-12 5:35 UTC (permalink / raw) To: Peter Rabbitson; +Cc: linux-raid On Tuesday March 6, rabbit@rabbit.us wrote: > > > > I see. So basically for those of us who want to run swap on raid 1 or > 10, and at the same time want to rely on mismatch_cnt for early problem > detection, the only option is to create a separate md device just for > the swap. Is this about right? Though it is less likely, a regular filesystem could still (I think) genuinely write different data to difference devices in a raid1/10. So relying on mismatch_cnt for early problem detection probably isn't really workable. And I think that if a drive is returning bad data without signalling an error, then you are very much into the 'late' side of problem detection. I see the 'check' and 'repair' functions mostly as valuable for the fact that they read every block and will trigger sleeping bad blocks early. If they every find a discrepancy, then it is either perfectly normal, or something seriously wrong that could have been wrong for a while.... NeilBrown ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: mismatch_cnt questions - how about raid10? 2007-03-12 5:35 ` Neil Brown @ 2007-03-12 14:26 ` Peter Rabbitson 0 siblings, 0 replies; 36+ messages in thread From: Peter Rabbitson @ 2007-03-12 14:26 UTC (permalink / raw) To: linux-raid Neil Brown wrote: > On Tuesday March 6, rabbit@rabbit.us wrote: > > Though it is less likely, a regular filesystem could still (I think) > genuinely write different data to difference devices in a raid1/10. > > So relying on mismatch_cnt for early problem detection probably isn't > really workable. > > And I think that if a drive is returning bad data without signalling > an error, then you are very much into the 'late' side of problem > detection. I agree with the later, but my concern is not that much with the cause, but with the effect. From past discussion on the list I gather that no special effort is made to determine which chunk to take as 'valid', even though more than 2 logically identical chunks might be present (raid1/10). And you also seem to think that the DMA syndrome might even apply to plain fast-changing filesystems, left alone something with multiple layers (fs on lvm on raid). So here is my question: how (theoretically) safe it is to use a raid1/10 array for something very disk intensive, e.g. a mail spool? How likely it is that the effect you described above will creep different blocks onto disks, and subsequently will return the wrong data to the kernel? Should I look into raid5/6 for this kind of activity, in case both uptime and data integrity are my number one priorities, and I am willing to sacrifice performance? Thank you ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: mismatch_cnt questions 2007-03-04 11:50 ` Neil Brown 2007-03-04 12:01 ` Christian Pernegger @ 2007-03-04 21:21 ` Eyal Lebedinsky 2007-03-04 22:30 ` Neil Brown 2008-05-12 11:16 ` Bas van Schaik 2 siblings, 1 reply; 36+ messages in thread From: Eyal Lebedinsky @ 2007-03-04 21:21 UTC (permalink / raw) To: Neil Brown; +Cc: Christian Pernegger, linux-raid Neil Brown wrote: > On Sunday March 4, pernegger@gmail.com wrote: >>I have a mismatch_cnt of 384 on a 2-way mirror. [trim] >>3) Is the "repair" sync action safe to use on the above kernel? Any >>other methods / additional steps for fixing this? > > "repair" is safe, though it may not be effective. > "repair" for raid1 was did not work until Jan 26th this year. > Before then it was identical in effect to 'check'. How is "repair" safe but not effective? When it finds a mismatch, how does it know which part is correct and which should be fixed (which copy of raid1, or which block in raid5)? When a disk fails we know what to rewrite, but when we discover a mismatch we do not have this knowledge. It may corrupt the good copy of a raid1. -- Eyal Lebedinsky (eyal@eyal.emu.id.au) <http://samba.org/eyal/> ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: mismatch_cnt questions 2007-03-04 21:21 ` mismatch_cnt questions Eyal Lebedinsky @ 2007-03-04 22:30 ` Neil Brown 2007-03-05 7:45 ` Eyal Lebedinsky 2007-03-06 6:27 ` Paul Davidson 0 siblings, 2 replies; 36+ messages in thread From: Neil Brown @ 2007-03-04 22:30 UTC (permalink / raw) To: Eyal Lebedinsky; +Cc: Christian Pernegger, linux-raid On Monday March 5, eyal@eyal.emu.id.au wrote: > Neil Brown wrote: > > On Sunday March 4, pernegger@gmail.com wrote: > >>I have a mismatch_cnt of 384 on a 2-way mirror. > [trim] > >>3) Is the "repair" sync action safe to use on the above kernel? Any > >>other methods / additional steps for fixing this? > > > > "repair" is safe, though it may not be effective. > > "repair" for raid1 was did not work until Jan 26th this year. > > Before then it was identical in effect to 'check'. > > How is "repair" safe but not effective? When it finds a mismatch, how does > it know which part is correct and which should be fixed (which copy of > raid1, or which block in raid5)? It is not 'effective' in that before 26jan2007 it did not actually copy the chosen data on to the other drives. i.e. a 'repair' had the same effect as a 'check', which is 'safe'. > > When a disk fails we know what to rewrite, but when we discover a mismatch > we do not have this knowledge. It may corrupt the good copy of a raid1. If a block differs between the different drives in a raid1, then no copy is 'good'. It is possible that one copy is the one you think you want, but you probably wouldn't know by looking at it. The worst situation is the have inconsistent data. If you read and get one value, then later read and get another value, that is really bad. For raid1 we 'fix' and inconsistency by arbitrarily choosing one copy and writing it over all other copies. For raid5 we assume the data is correct and update the parity. You might be able to imagine a failure scenario where this produces the 'wrong' result, but I'm confident that is the majority of cases it is as good as any other option. If we had something like ZFS which tracks checksums for all blocks, and could somehow get that information usefully into the md level, than maybe we could do something better. I suspect that it would be very rare for raid5 to detect a mismatch during a 'check', and raid1 would only see them when a write was aborted, such as swap can do, and filesystems might do occasionally (e.g. truncate a file that was recently written to). NeilBrown ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: mismatch_cnt questions 2007-03-04 22:30 ` Neil Brown @ 2007-03-05 7:45 ` Eyal Lebedinsky 2007-03-05 14:56 ` detecting/correcting _slightly_ flaky disks Michael Stumpf ` (2 more replies) 2007-03-06 6:27 ` Paul Davidson 1 sibling, 3 replies; 36+ messages in thread From: Eyal Lebedinsky @ 2007-03-05 7:45 UTC (permalink / raw) To: Neil Brown; +Cc: Christian Pernegger, linux-raid Neil Brown wrote: [trim Q re how resync fixes data] > For raid1 we 'fix' and inconsistency by arbitrarily choosing one copy > and writing it over all other copies. > For raid5 we assume the data is correct and update the parity. Can raid6 identify the bad block (two parity blocks could allow this if only one block has bad data in a stripe)? If so, does it? This will surely mean more value for raid6 than just the two-disk-failure protection. -- Eyal Lebedinsky (eyal@eyal.emu.id.au) <http://samba.org/eyal/> attach .zip as .dat ^ permalink raw reply [flat|nested] 36+ messages in thread
* detecting/correcting _slightly_ flaky disks 2007-03-05 7:45 ` Eyal Lebedinsky @ 2007-03-05 14:56 ` Michael Stumpf 2007-03-05 15:09 ` Justin Piszcz 2007-03-05 23:40 ` mismatch_cnt questions Neil Brown 2007-03-08 6:34 ` H. Peter Anvin 2 siblings, 1 reply; 36+ messages in thread From: Michael Stumpf @ 2007-03-05 14:56 UTC (permalink / raw) To: linux-raid I'm trying to assemble an array (raid 5) of 8 older, but not yet old age ATA 120 gig disks, but there is intermittent flakiness in one or more of the drives. Symptoms: * Won't boot sometimes. Even after moving to 2 power supplies and monitoring the amp spikes, sometimes I get "clicking" from 1-2 of the drives after the startup. * When initiating a SMART long test, so far two of them have: + passed 50-75% of the time + when "failed", didn't actually fail, just perpetually were stuck at an arbitrary % of test remain. + If I cancel and restart the test, often they pass. I've heard clicking from some drives when executing SMART long tests. Doing 4 drives at a time, but still can't isolate and don't want to use laborious "sit and listen by computer" method to determine which are dying--would prefer a tool to detect the issue. I know there's a problem with one or more because my issues with my primary array disappeared the minute I used LVM to remove these devices (and upgrade to some larger/newer ones). Two questions: 1) Is it smartest to isolate which drives are clicking and chuck them into the wood chipper, given the circumstances? 2) Are there tools that are designed to determine if a drive is fit for duty? dd_rescue et all seem focused on saving a dying drive; spinrite seems to be controversial black magic marketing, etc. I could try the manufacturer shipped tools but given their black box nature I have no idea how much (or little) is being done by their tests. What do you folks recommend? Thanks in advance. --Michael Stumpf ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: detecting/correcting _slightly_ flaky disks 2007-03-05 14:56 ` detecting/correcting _slightly_ flaky disks Michael Stumpf @ 2007-03-05 15:09 ` Justin Piszcz 2007-03-05 17:01 ` Michael Stumpf 0 siblings, 1 reply; 36+ messages in thread From: Justin Piszcz @ 2007-03-05 15:09 UTC (permalink / raw) To: Michael Stumpf; +Cc: linux-raid On Mon, 5 Mar 2007, Michael Stumpf wrote: > I'm trying to assemble an array (raid 5) of 8 older, but not yet old age ATA > 120 gig disks, but there is intermittent flakiness in one or more of the > drives. Symptoms: > > * Won't boot sometimes. Even after moving to 2 power supplies and monitoring > the amp spikes, sometimes I get "clicking" from 1-2 of the drives after the > startup. > > * When initiating a SMART long test, so far two of them have: > + passed 50-75% of the time > + when "failed", didn't actually fail, just perpetually were stuck at > an arbitrary % of test remain. > + If I cancel and restart the test, often they pass. > > I've heard clicking from some drives when executing SMART long tests. Doing > 4 drives at a time, but still can't > isolate and don't want to use laborious "sit and listen by computer" method > to determine which are dying--would prefer a tool to detect the issue. > > I know there's a problem with one or more because my issues with my primary > array disappeared the minute I used LVM to remove these devices (and upgrade > to some larger/newer ones). > > Two questions: > > 1) Is it smartest to isolate which drives are clicking and chuck them into > the wood chipper, given the circumstances? > > 2) Are there tools that are designed to determine if a drive is fit for > duty? dd_rescue et all seem focused on saving a dying drive; spinrite seems > to be controversial black magic marketing, etc. I could try the manufacturer > shipped tools but given their black box nature I have no idea how much (or > little) is being done by their tests. What do you folks recommend? > > Thanks in advance. > --Michael Stumpf > - > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > This is what I use: 799] What is the best way to verify a hard drive has no bad blocks? /usr/bin/time badblocks -b 512 -s -v -w /dev/hdg Note, this will wipe anything out on the drive. There is also a non-destructive write, check the manpage for badblocks(8). This operation usually takes 12 hours or so on a 400GB drive, if this passes & short+long tests pass without error, the drive is probably OK for the time being. Also, what does smartctl -a /dev/hda for each of your drives show? Justin. ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: detecting/correcting _slightly_ flaky disks 2007-03-05 15:09 ` Justin Piszcz @ 2007-03-05 17:01 ` Michael Stumpf 2007-03-05 17:11 ` Justin Piszcz 2007-03-07 0:14 ` Bill Davidsen 0 siblings, 2 replies; 36+ messages in thread From: Michael Stumpf @ 2007-03-05 17:01 UTC (permalink / raw) To: Justin Piszcz; +Cc: linux-raid This is the drive I think is most suspect. What isn't obvious, because it isn't listed in the self test log, is between #1 and #2 there was an aborted, hung test. The #4 short test that was aborted was also a hung test that I eventually, manually aborted--heard clicking from drives at that time, can't swear it was from this drive though. Not sure I fully understand the nuances of this report. If anything jumps out at you, I'd appreciate a tip on how you read it. (to me, looks mostly healthy) > > > Also, what does smartctl -a /dev/hda for each of your drives show? > > Justin. > > === START OF INFORMATION SECTION === Model Family: Western Digital Caviar SE family Device Model: WDC WD1200JB-75CRA0 Serial Number: WD-WMA8C3115683 Firmware Version: 16.06V76 User Capacity: 120,000,000,000 bytes Device is: In smartctl database [for details use: -P show] ATA Version is: 5 ATA Standard is: Exact ATA specification draft version not indicated Local Time is: Mon Mar 05 10:52:05 2007 CAST SMART support is: Available - device has SMART capability. Enabled status cached by OS, trying SMART RETURN STATUS cmd. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x85) Offline data collection activity was aborted by an interrupting command from host. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: (4680) seconds. Offline data collection capabilities: (0x3b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. No Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. No General Purpose Logging support. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 87) minutes. Conveyance self-test routine recommended polling time: ( 5) minutes. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000b 200 200 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0007 146 098 021 Pre-fail Always - 3491 4 Start_Stop_Count 0x0032 100 100 040 Old_age Always - 399 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x000b 200 200 051 Pre-fail Always - 0 9 Power_On_Hours 0x0032 070 070 000 Old_age Always - 22147 10 Spin_Retry_Count 0x0013 100 100 051 Pre-fail Always - 0 11 Calibration_Retry_Count 0x0013 100 100 051 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 397 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0012 200 200 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0012 200 200 000 Old_age Always - 0 199 UDMA_CRC_Error_Count 0x000a 200 253 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0009 200 200 051 Pre-fail Offline - 0 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed without error 00% 299 - # 2 Extended offline Interrupted (host reset) 50% 279 - # 3 Short offline Completed without error 00% 279 - # 4 Short offline Aborted by host 80% 279 - # 5 Extended offline Completed without error 00% 102 - # 6 Extended offline Completed without error 00% 1026 - # 7 Extended offline Completed without error 00% 859 - # 8 Extended offline Completed without error 00% 692 - # 9 Extended offline Completed without error 00% 525 - #10 Extended offline Completed without error 00% 380 - #11 Extended offline Completed without error 00% 370 - Device does not support Selective Self Tests/Logging ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: detecting/correcting _slightly_ flaky disks 2007-03-05 17:01 ` Michael Stumpf @ 2007-03-05 17:11 ` Justin Piszcz 2007-03-07 0:14 ` Bill Davidsen 1 sibling, 0 replies; 36+ messages in thread From: Justin Piszcz @ 2007-03-05 17:11 UTC (permalink / raw) To: Michael Stumpf; +Cc: linux-raid Besides being run for a long time, I don't see anything strange with this drive. Justin. On Mon, 5 Mar 2007, Michael Stumpf wrote: > This is the drive I think is most suspect. What isn't obvious, because it > isn't listed in the self test log, is between #1 and #2 there was an aborted, > hung test. The #4 short test that was aborted was also a hung test that I > eventually, manually aborted--heard clicking from drives at that time, can't > swear it was from this drive though. > > Not sure I fully understand the nuances of this report. If anything jumps > out at you, I'd appreciate a tip on how you read it. (to me, looks mostly > healthy) > >> >> >> Also, what does smartctl -a /dev/hda for each of your drives show? >> >> Justin. >> >> > === START OF INFORMATION SECTION === > Model Family: Western Digital Caviar SE family > Device Model: WDC WD1200JB-75CRA0 > Serial Number: WD-WMA8C3115683 > Firmware Version: 16.06V76 > User Capacity: 120,000,000,000 bytes > Device is: In smartctl database [for details use: -P show] > ATA Version is: 5 > ATA Standard is: Exact ATA specification draft version not indicated > Local Time is: Mon Mar 05 10:52:05 2007 CAST > SMART support is: Available - device has SMART capability. > Enabled status cached by OS, trying SMART RETURN STATUS cmd. > SMART support is: Enabled > > === START OF READ SMART DATA SECTION === > SMART overall-health self-assessment test result: PASSED > > General SMART Values: > Offline data collection status: (0x85) Offline data collection activity > was aborted by an interrupting command > from host. > Auto Offline Data Collection: Enabled. > Self-test execution status: ( 0) The previous self-test routine > completed > without error or no self-test has ever > been run. > Total time to complete Offline > data collection: (4680) seconds. > Offline data collection > capabilities: (0x3b) SMART execute Offline immediate. > Auto Offline data collection on/off > support. > Suspend Offline collection upon new > command. > Offline surface scan supported. > Self-test supported. > Conveyance Self-test supported. > No Selective Self-test supported. > SMART capabilities: (0x0003) Saves SMART data before entering > power-saving mode. > Supports SMART auto save timer. > Error logging capability: (0x01) Error logging supported. > No General Purpose Logging support. > Short self-test routine > recommended polling time: ( 2) minutes. > Extended self-test routine > recommended polling time: ( 87) minutes. > Conveyance self-test routine > recommended polling time: ( 5) minutes. > > SMART Attributes Data Structure revision number: 16 > Vendor Specific SMART Attributes with Thresholds: > ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED > WHEN_FAILED RAW_VALUE > 1 Raw_Read_Error_Rate 0x000b 200 200 051 Pre-fail Always > - 0 > 3 Spin_Up_Time 0x0007 146 098 021 Pre-fail Always > - 3491 > 4 Start_Stop_Count 0x0032 100 100 040 Old_age Always > - 399 > 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always > - 0 > 7 Seek_Error_Rate 0x000b 200 200 051 Pre-fail Always > - 0 > 9 Power_On_Hours 0x0032 070 070 000 Old_age Always > - 22147 > 10 Spin_Retry_Count 0x0013 100 100 051 Pre-fail Always > - 0 > 11 Calibration_Retry_Count 0x0013 100 100 051 Pre-fail Always > - 0 > 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always > - 397 > 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always > - 0 > 197 Current_Pending_Sector 0x0012 200 200 000 Old_age Always > - 0 > 198 Offline_Uncorrectable 0x0012 200 200 000 Old_age Always > - 0 > 199 UDMA_CRC_Error_Count 0x000a 200 253 000 Old_age Always > - 0 > 200 Multi_Zone_Error_Rate 0x0009 200 200 051 Pre-fail Offline > - 0 > > SMART Error Log Version: 1 > No Errors Logged > > SMART Self-test log structure revision number 1 > Num Test_Description Status Remaining LifeTime(hours) > LBA_of_first_error > # 1 Extended offline Completed without error 00% 299 > - > # 2 Extended offline Interrupted (host reset) 50% 279 > - > # 3 Short offline Completed without error 00% 279 > - > # 4 Short offline Aborted by host 80% 279 > - > # 5 Extended offline Completed without error 00% 102 > - > # 6 Extended offline Completed without error 00% 1026 > - > # 7 Extended offline Completed without error 00% 859 > - > # 8 Extended offline Completed without error 00% 692 > - > # 9 Extended offline Completed without error 00% 525 > - > #10 Extended offline Completed without error 00% 380 > - > #11 Extended offline Completed without error 00% 370 > - > > Device does not support Selective Self Tests/Logging > > - > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: detecting/correcting _slightly_ flaky disks 2007-03-05 17:01 ` Michael Stumpf 2007-03-05 17:11 ` Justin Piszcz @ 2007-03-07 0:14 ` Bill Davidsen 2007-03-07 1:37 ` Michael Stumpf 1 sibling, 1 reply; 36+ messages in thread From: Bill Davidsen @ 2007-03-07 0:14 UTC (permalink / raw) To: mjstumpf; +Cc: Justin Piszcz, linux-raid Michael Stumpf wrote: > This is the drive I think is most suspect. What isn't obvious, > because it isn't listed in the self test log, is between #1 and #2 > there was an aborted, hung test. The #4 short test that was aborted > was also a hung test that I eventually, manually aborted--heard > clicking from drives at that time, can't swear it was from this drive > though. > > Not sure I fully understand the nuances of this report. If anything > jumps out at you, I'd appreciate a tip on how you read it. (to me, > looks mostly healthy) > For what it's worth, if you are getting hung tests, either your drive or power supply should be redeployed as a paperweight. My opinion... -- bill davidsen <davidsen@tmr.com> CTO TMR Associates, Inc Doing interesting things with small computers since 1979 ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: detecting/correcting _slightly_ flaky disks 2007-03-07 0:14 ` Bill Davidsen @ 2007-03-07 1:37 ` Michael Stumpf 2007-03-07 13:57 ` berk walker 2007-03-07 15:01 ` Bill Davidsen 0 siblings, 2 replies; 36+ messages in thread From: Michael Stumpf @ 2007-03-07 1:37 UTC (permalink / raw) To: Bill Davidsen; +Cc: Justin Piszcz, linux-raid Bill Davidsen wrote: > Michael Stumpf wrote: >> This is the drive I think is most suspect. What isn't obvious, >> because it isn't listed in the self test log, is between #1 and #2 >> there was an aborted, hung test. The #4 short test that was aborted >> was also a hung test that I eventually, manually aborted--heard >> clicking from drives at that time, can't swear it was from this drive >> though. >> >> Not sure I fully understand the nuances of this report. If anything >> jumps out at you, I'd appreciate a tip on how you read it. (to me, >> looks mostly healthy) >> > For what it's worth, if you are getting hung tests, either your drive > or power supply should be redeployed as a paperweight. My opinion... > I don't disagree but I'd like to find something more concrete or repeatable, especially given that these give an audible click when failing. The problem I'm having is that I can't nail down precisely where the problem is, although your suggestion makes a lot of sense. After running Justin's suggested badblocks test, I'm kind-of-disturbed to see that all these drives are passing with flying colors. Firmware issue? WD had it in the past. ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: detecting/correcting _slightly_ flaky disks 2007-03-07 1:37 ` Michael Stumpf @ 2007-03-07 13:57 ` berk walker 2007-03-07 15:01 ` Bill Davidsen 1 sibling, 0 replies; 36+ messages in thread From: berk walker @ 2007-03-07 13:57 UTC (permalink / raw) To: mjstumpf; +Cc: Bill Davidsen, Justin Piszcz, linux-raid Michael Stumpf wrote: > Bill Davidsen wrote: >> Michael Stumpf wrote: >>> This is the drive I think is most suspect. What isn't obvious, >>> because it isn't listed in the self test log, is between #1 and #2 >>> there was an aborted, hung test. The #4 short test that was >>> aborted was also a hung test that I eventually, manually >>> aborted--heard clicking from drives at that time, can't swear it was >>> from this drive though. >>> >>> Not sure I fully understand the nuances of this report. If anything >>> jumps out at you, I'd appreciate a tip on how you read it. (to me, >>> looks mostly healthy) >>> >> For what it's worth, if you are getting hung tests, either your drive >> or power supply should be redeployed as a paperweight. My opinion... >> > I don't disagree but I'd like to find something more concrete or > repeatable, especially given that these give an audible click when > failing. The problem I'm having is that I can't nail down precisely > where the problem is, although your suggestion makes a lot of sense. > > After running Justin's suggested badblocks test, I'm kind-of-disturbed > to see that all these drives are passing with flying colors. > > Firmware issue? WD had it in the past. > > - > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > One nice thing, if your cables are OK, and your power is OK, then you can trash the electronics and transplant from a similar drive with bad sectors. - "load head seek spindle unload head" is not a nice thing for the hardware. b- ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: detecting/correcting _slightly_ flaky disks 2007-03-07 1:37 ` Michael Stumpf 2007-03-07 13:57 ` berk walker @ 2007-03-07 15:01 ` Bill Davidsen 1 sibling, 0 replies; 36+ messages in thread From: Bill Davidsen @ 2007-03-07 15:01 UTC (permalink / raw) To: mjstumpf; +Cc: Justin Piszcz, linux-raid Michael Stumpf wrote: > Bill Davidsen wrote: >> Michael Stumpf wrote: >>> This is the drive I think is most suspect. What isn't obvious, >>> because it isn't listed in the self test log, is between #1 and #2 >>> there was an aborted, hung test. The #4 short test that was >>> aborted was also a hung test that I eventually, manually >>> aborted--heard clicking from drives at that time, can't swear it was >>> from this drive though. >>> >>> Not sure I fully understand the nuances of this report. If anything >>> jumps out at you, I'd appreciate a tip on how you read it. (to me, >>> looks mostly healthy) >>> >> For what it's worth, if you are getting hung tests, either your drive >> or power supply should be redeployed as a paperweight. My opinion... >> > I don't disagree but I'd like to find something more concrete or > repeatable, especially given that these give an audible click when > failing. The problem I'm having is that I can't nail down precisely > where the problem is, although your suggestion makes a lot of sense. Well, here's thought if you are inclined... power up and go into BIOS config mode. That will leave the drives powered but not in use. Now pull the power cable out on one of them. Does the drive make a familiar click as the heads do an emergency park? That's the easiest thing to check which might cause the click. One thing your SMART doesn't include is Temp, which might or might now tell you anything. You could try hddtemp, but SMART would probably report it if the sensor was there. > > After running Justin's suggested badblocks test, I'm kind-of-disturbed > to see that all these drives are passing with flying colors. > > Firmware issue? WD had it in the past. Certainly you could check for newer firmware, and to see if all drives have the same level. > > - > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- bill davidsen <davidsen@tmr.com> CTO TMR Associates, Inc Doing interesting things with small computers since 1979 ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: mismatch_cnt questions 2007-03-05 7:45 ` Eyal Lebedinsky 2007-03-05 14:56 ` detecting/correcting _slightly_ flaky disks Michael Stumpf @ 2007-03-05 23:40 ` Neil Brown 2007-03-07 0:22 ` Bill Davidsen 2007-03-08 6:39 ` H. Peter Anvin 2007-03-08 6:34 ` H. Peter Anvin 2 siblings, 2 replies; 36+ messages in thread From: Neil Brown @ 2007-03-05 23:40 UTC (permalink / raw) To: Eyal Lebedinsky; +Cc: Christian Pernegger, linux-raid On Monday March 5, eyal@eyal.emu.id.au wrote: > Neil Brown wrote: > [trim Q re how resync fixes data] > > For raid1 we 'fix' and inconsistency by arbitrarily choosing one copy > > and writing it over all other copies. > > For raid5 we assume the data is correct and update the parity. > > Can raid6 identify the bad block (two parity blocks could allow this > if only one block has bad data in a stripe)? If so, does it? No, it doesn't. I guess that maybe it could: Rebuild each block in turn based on the xor parity, and then test if the Q-syndrome is satisfied. but I doubt the gain would be worth the pain. What we really want in drives that store 520 byte sectors so that a checksum can be passed all the way up and down through the stack .... or something like that. NeilBrown ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: mismatch_cnt questions 2007-03-05 23:40 ` mismatch_cnt questions Neil Brown @ 2007-03-07 0:22 ` Bill Davidsen 2007-03-08 6:39 ` H. Peter Anvin 1 sibling, 0 replies; 36+ messages in thread From: Bill Davidsen @ 2007-03-07 0:22 UTC (permalink / raw) To: Neil Brown; +Cc: Eyal Lebedinsky, Christian Pernegger, linux-raid Neil Brown wrote: > On Monday March 5, eyal@eyal.emu.id.au wrote: > >> Neil Brown wrote: >> [trim Q re how resync fixes data] >> >>> For raid1 we 'fix' and inconsistency by arbitrarily choosing one copy >>> and writing it over all other copies. >>> For raid5 we assume the data is correct and update the parity. >>> >> Can raid6 identify the bad block (two parity blocks could allow this >> if only one block has bad data in a stripe)? If so, does it? >> > > No, it doesn't. > > I guess that maybe it could: > Rebuild each block in turn based on the xor parity, and then test > if the Q-syndrome is satisfied. > but I doubt the gain would be worth the pain. What's the value of "I have a drive which returned bad data" vs. "I have a whole array and some part of it returned bad data?" What's the cost of doing that identification, since it need only be done when the data are inconsistent between the drives and give a parity or Q mismatch? It seems easy, given that you are going to read all the pertinent sectors into memory anyway. If the drive can be identified the data can be rewritten with confidence. -- bill davidsen <davidsen@tmr.com> CTO TMR Associates, Inc Doing interesting things with small computers since 1979 ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: mismatch_cnt questions 2007-03-05 23:40 ` mismatch_cnt questions Neil Brown 2007-03-07 0:22 ` Bill Davidsen @ 2007-03-08 6:39 ` H. Peter Anvin 2007-03-08 13:54 ` Martin K. Petersen 1 sibling, 1 reply; 36+ messages in thread From: H. Peter Anvin @ 2007-03-08 6:39 UTC (permalink / raw) To: Neil Brown; +Cc: Eyal Lebedinsky, Christian Pernegger, linux-raid Neil Brown wrote: > On Monday March 5, eyal@eyal.emu.id.au wrote: >> Neil Brown wrote: >> [trim Q re how resync fixes data] >>> For raid1 we 'fix' and inconsistency by arbitrarily choosing one copy >>> and writing it over all other copies. >>> For raid5 we assume the data is correct and update the parity. >> Can raid6 identify the bad block (two parity blocks could allow this >> if only one block has bad data in a stripe)? If so, does it? > > No, it doesn't. > > I guess that maybe it could: > Rebuild each block in turn based on the xor parity, and then test > if the Q-syndrome is satisfied. > but I doubt the gain would be worth the pain. > > What we really want in drives that store 520 byte sectors so that a > checksum can be passed all the way up and down through the stack > .... or something like that. > A lot of SCSI disks have that option, but I believe it's not arbitrary bytes. In particular, the integrity check portion is only 2 bytes, 16 bits. One option, of course, would be to store, say, 16 sectors/pages/blocks in 17 physical sectors/pages/blocks, where the last one is a packing of some sort of high-powered integrity checks, e.g. SHA-256, or even an ECC block. This would hurt performance substantially, but it would be highly useful for very high data integrity applications. I will look at the mathematics of trying to do this with RAID-6, but I'm 99% sure RAID-6 isn't sufficient to do it, even with syndrome set recomputation on every read. -hpa ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: mismatch_cnt questions 2007-03-08 6:39 ` H. Peter Anvin @ 2007-03-08 13:54 ` Martin K. Petersen 2007-03-09 2:00 ` Bill Davidsen 0 siblings, 1 reply; 36+ messages in thread From: Martin K. Petersen @ 2007-03-08 13:54 UTC (permalink / raw) To: H. Peter Anvin Cc: Neil Brown, Eyal Lebedinsky, Christian Pernegger, linux-raid >>>>> "hpa" == H Peter Anvin <hpa@zytor.com> writes: >> What we really want in drives that store 520 byte sectors so that a >> checksum can be passed all the way up and down through the stack >> .... or something like that. >> hpa> A lot of SCSI disks have that option, but I believe it's not hpa> arbitrary bytes. In particular, the integrity check portion is hpa> only 2 bytes, 16 bits. It's important to distinguish between drives that support 520 byte sectors and drives that include the Data Integrity Feature which also uses 520 byte sectors. Most regular SCSI drives can be formatted with 520 byte sectors and a lot of disk arrays use the extra space to store an internal checksum. The downside to 520 byte sectors is that it makes buffer management a pain as 512 bytes of data is followed by 8 bytes of protection data. That sucks when writing - say - a 4KB block because your scatterlist becomes long and twisted having to interleave data and protection data every sector. The data integrity feature also uses 520 byte byte sectors. The difference is that the format of the 8 bytes is well defined. And that both initiator and target are capable of verifying the integrity of an I/O. It is correct that the CRC is only 16 bits. DIF is strictly between HBA and disk. I'm lobbying HBA vendors to expose it to the OS so we can use it. I'm also lobbying to get them to allow us to submit the data and the protection data in separate scatterlists so we don't have to do the interleaving at the OS level. hpa> One option, of course, would be to store, say, 16 hpa> sectors/pages/blocks in 17 physical sectors/pages/blocks, where hpa> the last one is a packing of some sort of high-powered integrity hpa> checks, e.g. SHA-256, or even an ECC block. This would hurt hpa> performance substantially, but it would be highly useful for very hpa> high data integrity applications. A while ago I tinkered with something like that. I actually cheated and stored the checking data in a different partition on the same drive. It was a pretty simple test using my DIF code (i.e. 8 bytes per sector). I wanted to see how badly the extra seeks would affect us. The results weren't too discouraging but I decided I liked the ZFS approach better (having the checksum in the fs parent block which you'll be reading anyway). -- Martin K. Petersen Oracle Linux Engineering ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: mismatch_cnt questions 2007-03-08 13:54 ` Martin K. Petersen @ 2007-03-09 2:00 ` Bill Davidsen 2007-03-09 4:20 ` H. Peter Anvin 0 siblings, 1 reply; 36+ messages in thread From: Bill Davidsen @ 2007-03-09 2:00 UTC (permalink / raw) To: Martin K. Petersen Cc: H. Peter Anvin, Neil Brown, Eyal Lebedinsky, Christian Pernegger, linux-raid Martin K. Petersen wrote: >>>>>> "hpa" == H Peter Anvin <hpa@zytor.com> writes: >>>>>> > > >>> What we really want in drives that store 520 byte sectors so that a >>> checksum can be passed all the way up and down through the stack >>> .... or something like that. >>> >>> > > hpa> A lot of SCSI disks have that option, but I believe it's not > hpa> arbitrary bytes. In particular, the integrity check portion is > hpa> only 2 bytes, 16 bits. > > It's important to distinguish between drives that support 520 byte > sectors and drives that include the Data Integrity Feature which also > uses 520 byte sectors. > > Most regular SCSI drives can be formatted with 520 byte sectors and a > lot of disk arrays use the extra space to store an internal checksum. > The downside to 520 byte sectors is that it makes buffer management a > pain as 512 bytes of data is followed by 8 bytes of protection data. > That sucks when writing - say - a 4KB block because your scatterlist > becomes long and twisted having to interleave data and protection > data every sector. > > The data integrity feature also uses 520 byte byte sectors. The > difference is that the format of the 8 bytes is well defined. And > that both initiator and target are capable of verifying the integrity > of an I/O. It is correct that the CRC is only 16 bits. > When last I looked at Hamming code, and that would be 1989 or 1990, I believe that I learned that the number of Hamming bits needed to cover N data bits was 1+log2(N), which for 512 bytes would be 1+12, and fit into a 16 bit field nicely. I don't know that I would go that way, fix any one bit error, detect any two bit error, rather than a CRC which gives me only one chance in 64k of an undetected data error, but I find it interesting. I also looked at fire codes, which at the time would still be a viable topic for a thesis. I remember nothing about how they worked whatsoever. > DIF is strictly between HBA and disk. I'm lobbying HBA vendors to > expose it to the OS so we can use it. I'm also lobbying to get them > to allow us to submit the data and the protection data in separate > scatterlists so we don't have to do the interleaving at the OS level. > > > hpa> One option, of course, would be to store, say, 16 > hpa> sectors/pages/blocks in 17 physical sectors/pages/blocks, where > hpa> the last one is a packing of some sort of high-powered integrity > hpa> checks, e.g. SHA-256, or even an ECC block. This would hurt > hpa> performance substantially, but it would be highly useful for very > hpa> high data integrity applications. > > A while ago I tinkered with something like that. I actually cheated > and stored the checking data in a different partition on the same > drive. It was a pretty simple test using my DIF code (i.e. 8 bytes > per sector). > > I wanted to see how badly the extra seeks would affect us. The > results weren't too discouraging but I decided I liked the ZFS > approach better (having the checksum in the fs parent block which > you'll be reading anyway). > > -- bill davidsen <davidsen@tmr.com> CTO TMR Associates, Inc Doing interesting things with small computers since 1979 ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: mismatch_cnt questions 2007-03-09 2:00 ` Bill Davidsen @ 2007-03-09 4:20 ` H. Peter Anvin 2007-03-09 5:20 ` Bill Davidsen 0 siblings, 1 reply; 36+ messages in thread From: H. Peter Anvin @ 2007-03-09 4:20 UTC (permalink / raw) To: Bill Davidsen Cc: Martin K. Petersen, Neil Brown, Eyal Lebedinsky, Christian Pernegger, linux-raid Bill Davidsen wrote: > > When last I looked at Hamming code, and that would be 1989 or 1990, I > believe that I learned that the number of Hamming bits needed to cover N > data bits was 1+log2(N), which for 512 bytes would be 1+12, and fit into > a 16 bit field nicely. I don't know that I would go that way, fix any > one bit error, detect any two bit error, rather than a CRC which gives > me only one chance in 64k of an undetected data error, but I find it > interesting. > A Hamming code across the bytes of a sector is pretty darn pointless, since that's not a typical failure pattern. -hpa ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: mismatch_cnt questions 2007-03-09 4:20 ` H. Peter Anvin @ 2007-03-09 5:20 ` Bill Davidsen 0 siblings, 0 replies; 36+ messages in thread From: Bill Davidsen @ 2007-03-09 5:20 UTC (permalink / raw) To: H. Peter Anvin Cc: Martin K. Petersen, Neil Brown, Eyal Lebedinsky, Christian Pernegger, linux-raid H. Peter Anvin wrote: > Bill Davidsen wrote: >> >> When last I looked at Hamming code, and that would be 1989 or 1990, I >> believe that I learned that the number of Hamming bits needed to >> cover N data bits was 1+log2(N), which for 512 bytes would be 1+12, >> and fit into a 16 bit field nicely. I don't know that I would go that >> way, fix any one bit error, detect any two bit error, rather than a >> CRC which gives me only one chance in 64k of an undetected data >> error, but I find it interesting. >> > > A Hamming code across the bytes of a sector is pretty darn pointless, > since that's not a typical failure pattern. I just thought it was perhaps one of those little known facts that meaningful ECC could fit in 16 bits. I mentioned that I wouldn't go that way, mainly because it would be less effective catching multibit errors. This was a "fun fact" for all those folks who missed Hamming codes in their education, because they are old tech. -- bill davidsen <davidsen@tmr.com> CTO TMR Associates, Inc Doing interesting things with small computers since 1979 ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: mismatch_cnt questions 2007-03-05 7:45 ` Eyal Lebedinsky 2007-03-05 14:56 ` detecting/correcting _slightly_ flaky disks Michael Stumpf 2007-03-05 23:40 ` mismatch_cnt questions Neil Brown @ 2007-03-08 6:34 ` H. Peter Anvin 2007-03-08 7:00 ` H. Peter Anvin 2 siblings, 1 reply; 36+ messages in thread From: H. Peter Anvin @ 2007-03-08 6:34 UTC (permalink / raw) To: Eyal Lebedinsky; +Cc: Neil Brown, Christian Pernegger, linux-raid Eyal Lebedinsky wrote: > Neil Brown wrote: > [trim Q re how resync fixes data] >> For raid1 we 'fix' and inconsistency by arbitrarily choosing one copy >> and writing it over all other copies. >> For raid5 we assume the data is correct and update the parity. > > Can raid6 identify the bad block (two parity blocks could allow this > if only one block has bad data in a stripe)? If so, does it? > > This will surely mean more value for raid6 than just the two-disk-failure > protection. > No. It's not mathematically possible. -hpa ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: mismatch_cnt questions 2007-03-08 6:34 ` H. Peter Anvin @ 2007-03-08 7:00 ` H. Peter Anvin 2007-03-08 8:21 ` H. Peter Anvin 0 siblings, 1 reply; 36+ messages in thread From: H. Peter Anvin @ 2007-03-08 7:00 UTC (permalink / raw) To: H. Peter Anvin Cc: Eyal Lebedinsky, Neil Brown, Christian Pernegger, linux-raid H. Peter Anvin wrote: > Eyal Lebedinsky wrote: >> Neil Brown wrote: >> [trim Q re how resync fixes data] >>> For raid1 we 'fix' and inconsistency by arbitrarily choosing one copy >>> and writing it over all other copies. >>> For raid5 we assume the data is correct and update the parity. >> >> Can raid6 identify the bad block (two parity blocks could allow this >> if only one block has bad data in a stripe)? If so, does it? >> >> This will surely mean more value for raid6 than just the two-disk-failure >> protection. >> > > No. It's not mathematically possible. > Okay, I've thought about it, and I got it wrong the first time (off-the-cuff misapplication of the pigeonhole principle.) It apparently *is* possible (for notation and algebra rules, see my paper): Let's assume we know exactly one of the data (Dn) drives is corrupt (ignoring the case of P or Q corruption for now.) That means instead of Dn we have a corrupt value, Xn. Note that which data drive that is corrupt (n) is not known. We compute P' and Q' as the computed values over the corrupt set. P+P' = Dn+Xn Q+Q' = g^n Dn + g^n Xn g = {02} Q+Q' = g^n (Dn+Xn) By assumption, Dn != Xn, so P+P' = Dn+Xn != {00}. g^n is *never* {00}, so Q+Q' = g^n (Dn+Xn) != {00}. (Q+Q')/(P+P') = [g^n (Dn+Xn)]/(Dn+Xn) = g^n Since n is known to be in the range [0,255), we thus have: n = log_g((Q+Q')/(P+P')) ... which is a well-defined relation. For the case where either the P or the Q drives are corrupt (and the data drives are all good), this is easily detected by the fact that if P is the corrupt drive, Q+Q' = {00}; similarly, if Q is the corrupt drive, P+P' = {00}. Obviously, if P+P' = Q+Q' = {00}, then as far as RAID-6 can discover, there is no corruption in the drive set. So, yes, RAID-6 *can* detect single drive corruption, and even tell you which drive it is, if you're willing to compute a full syndrome set (P', Q') on every read (as well on every write.) Note: RAID-6 cannot detect 2-drive corruption, unless of course the corruption is in different byte positions. If multiple corresponding byte positions are corrupt, then the algorithm above will generally point you to a completely innocent drive. -hpa ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: mismatch_cnt questions 2007-03-08 7:00 ` H. Peter Anvin @ 2007-03-08 8:21 ` H. Peter Anvin 2007-03-13 9:58 ` Andre Noll 0 siblings, 1 reply; 36+ messages in thread From: H. Peter Anvin @ 2007-03-08 8:21 UTC (permalink / raw) To: H. Peter Anvin Cc: Eyal Lebedinsky, Neil Brown, Christian Pernegger, linux-raid I have just updated the paper at: http://www.kernel.org/pub/linux/kernel/people/hpa/raid6.pdf ... with this information (in slightly different notation and with a bit more detail.) -hpa ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: mismatch_cnt questions 2007-03-08 8:21 ` H. Peter Anvin @ 2007-03-13 9:58 ` Andre Noll 2007-03-13 23:46 ` H. Peter Anvin 0 siblings, 1 reply; 36+ messages in thread From: Andre Noll @ 2007-03-13 9:58 UTC (permalink / raw) To: H. Peter Anvin Cc: Eyal Lebedinsky, Neil Brown, Christian Pernegger, linux-raid [-- Attachment #1: Type: text/plain, Size: 430 bytes --] On 00:21, H. Peter Anvin wrote: > I have just updated the paper at: > > http://www.kernel.org/pub/linux/kernel/people/hpa/raid6.pdf > > ... with this information (in slightly different notation and with a bit > more detail.) There's a typo in the new section: s/By assumption, X_z != D_n/By assumption, X_z != D_z/ Regards Andre -- The only person who always got his work done by Friday was Robinson Crusoe [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: mismatch_cnt questions 2007-03-13 9:58 ` Andre Noll @ 2007-03-13 23:46 ` H. Peter Anvin 0 siblings, 0 replies; 36+ messages in thread From: H. Peter Anvin @ 2007-03-13 23:46 UTC (permalink / raw) To: Andre Noll; +Cc: Eyal Lebedinsky, Neil Brown, Christian Pernegger, linux-raid Andre Noll wrote: > On 00:21, H. Peter Anvin wrote: >> I have just updated the paper at: >> >> http://www.kernel.org/pub/linux/kernel/people/hpa/raid6.pdf >> >> ... with this information (in slightly different notation and with a bit >> more detail.) > > There's a typo in the new section: > > s/By assumption, X_z != D_n/By assumption, X_z != D_z/ > Thanks, fixed. -hpa ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: mismatch_cnt questions 2007-03-04 22:30 ` Neil Brown 2007-03-05 7:45 ` Eyal Lebedinsky @ 2007-03-06 6:27 ` Paul Davidson 1 sibling, 0 replies; 36+ messages in thread From: Paul Davidson @ 2007-03-06 6:27 UTC (permalink / raw) To: Neil Brown; +Cc: linux-raid Hi Neil, I've been following this thread with interest and I have a few questions. Neil Brown wrote: > On Monday March 5, eyal@eyal.emu.id.au wrote: > >>Neil Brown wrote: > >>When a disk fails we know what to rewrite, but when we discover a mismatch >>we do not have this knowledge. It may corrupt the good copy of a raid1. > > If a block differs between the different drives in a raid1, then no > copy is 'good'. It is possible that one copy is the one you think you > want, but you probably wouldn't know by looking at it. > The worst situation is the have inconsistent data. If you read and get > one value, then later read and get another value, that is really bad. > > For raid1 we 'fix' and inconsistency by arbitrarily choosing one copy > and writing it over all other copies. > For raid5 we assume the data is correct and update the parity. Wouldn't it be better to signal an error rather than potentially corrupt data - or perhaps this already happens? Does the above only refer to a 'repair' action? I'm worrying here about silent data corruption that gets on to my backup tapes. If an error was (is?) signaled by the raid system during the backup and could be tracked to the file being copied at the time, it would allow recovery of the data from a prior backup. If raid remains silent, the corrupted data eventually gets copied onto my entire backup rotation. Can you comment on this? FWIW, my 600GB raid5 array shows mismatch_cnt of 24 when I 'check' it - that machine has hung up on occasion. Cheers, Paul ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: mismatch_cnt questions 2007-03-04 11:50 ` Neil Brown 2007-03-04 12:01 ` Christian Pernegger 2007-03-04 21:21 ` mismatch_cnt questions Eyal Lebedinsky @ 2008-05-12 11:16 ` Bas van Schaik 2008-05-12 14:31 ` Justin Piszcz 2 siblings, 1 reply; 36+ messages in thread From: Bas van Schaik @ 2008-05-12 11:16 UTC (permalink / raw) To: Neil Brown; +Cc: linux-raid Neil Brown wrote: > On Sunday March 4, pernegger@gmail.com wrote: > > (...) >> 3) Is the "repair" sync action safe to use on the above kernel? Any >> other methods / additional steps for fixing this? >> > > "repair" is safe, though it may not be effective. > "repair" for raid1 was did not work until Jan 26th this year. > Before then it was identical in effect to 'check'. > Sorry to dig up such an old thread, but I'd like to know since when (which kernel version) "repair" (recompute parity, assuming data is consistent) for raid5 is effective? -- Bas ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: mismatch_cnt questions 2008-05-12 11:16 ` Bas van Schaik @ 2008-05-12 14:31 ` Justin Piszcz 0 siblings, 0 replies; 36+ messages in thread From: Justin Piszcz @ 2008-05-12 14:31 UTC (permalink / raw) To: Bas van Schaik; +Cc: Neil Brown, linux-raid On Mon, 12 May 2008, Bas van Schaik wrote: > Neil Brown wrote: >> On Sunday March 4, pernegger@gmail.com wrote: >> >> (...) >>> 3) Is the "repair" sync action safe to use on the above kernel? Any >>> other methods / additional steps for fixing this? >>> >> >> "repair" is safe, though it may not be effective. >> "repair" for raid1 was did not work until Jan 26th this year. >> Before then it was identical in effect to 'check'. >> > Sorry to dig up such an old thread, but I'd like to know since when > (which kernel version) "repair" (recompute parity, assuming data is > consistent) for raid5 is effective? Good question. ^ permalink raw reply [flat|nested] 36+ messages in thread
end of thread, other threads:[~2008-05-12 14:31 UTC | newest] Thread overview: 36+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2007-03-04 11:22 mismatch_cnt questions Christian Pernegger 2007-03-04 11:50 ` Neil Brown 2007-03-04 12:01 ` Christian Pernegger 2007-03-04 22:19 ` Neil Brown 2007-03-06 10:04 ` mismatch_cnt questions - how about raid10? Peter Rabbitson 2007-03-06 10:20 ` Neil Brown 2007-03-06 10:56 ` Peter Rabbitson 2007-03-06 10:59 ` Justin Piszcz 2007-03-12 5:35 ` Neil Brown 2007-03-12 14:26 ` Peter Rabbitson 2007-03-04 21:21 ` mismatch_cnt questions Eyal Lebedinsky 2007-03-04 22:30 ` Neil Brown 2007-03-05 7:45 ` Eyal Lebedinsky 2007-03-05 14:56 ` detecting/correcting _slightly_ flaky disks Michael Stumpf 2007-03-05 15:09 ` Justin Piszcz 2007-03-05 17:01 ` Michael Stumpf 2007-03-05 17:11 ` Justin Piszcz 2007-03-07 0:14 ` Bill Davidsen 2007-03-07 1:37 ` Michael Stumpf 2007-03-07 13:57 ` berk walker 2007-03-07 15:01 ` Bill Davidsen 2007-03-05 23:40 ` mismatch_cnt questions Neil Brown 2007-03-07 0:22 ` Bill Davidsen 2007-03-08 6:39 ` H. Peter Anvin 2007-03-08 13:54 ` Martin K. Petersen 2007-03-09 2:00 ` Bill Davidsen 2007-03-09 4:20 ` H. Peter Anvin 2007-03-09 5:20 ` Bill Davidsen 2007-03-08 6:34 ` H. Peter Anvin 2007-03-08 7:00 ` H. Peter Anvin 2007-03-08 8:21 ` H. Peter Anvin 2007-03-13 9:58 ` Andre Noll 2007-03-13 23:46 ` H. Peter Anvin 2007-03-06 6:27 ` Paul Davidson 2008-05-12 11:16 ` Bas van Schaik 2008-05-12 14:31 ` Justin Piszcz
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).