* RFC: detection of silent corruption via ATA long sector reads @ 2008-12-26 21:44 Greg Freemyer 2008-12-26 22:15 ` Robert Hancock 2008-12-28 22:26 ` Mark Lord 0 siblings, 2 replies; 19+ messages in thread From: Greg Freemyer @ 2008-12-26 21:44 UTC (permalink / raw) To: Redeeman; +Cc: piergiorgio.sartor, neilb, linux-raid, LKML, Mark Lord All, On the mdraid list, there was a recent thread about using raid functionality to detect / repair silent corruption. The issues brought up were that a lot of silent data corruption occurs when cables, controllers, power supplies, ram, cache, etc. goes bad. It made me think about another option for detecting silent corruption I have not seen discussed, but maybe I missed it. Aiui, the ATA spec allows for the reading of a long sector as well as the normal 512 byte sector. When you get a long sector you also get the CRC (or whatever checksum data there is on the disk that allows the drive itself to detect media errors). I don't have any idea how easy or hard it would be to do, but I would like to see the entire block subsystem enhanced to optionally allow long sector reads to be used in a "paranoid" fashion. Effectively it would be: 1) Read long sector from drive: verify CRC in kernel. This tests most everything on the i/o path. 2) maintain CRC type information in block subsystem. Verify no corruption just before handing off to userspace. This would potentially identify CPU/cache/RAM failures. Mark Lord has implemented long sector reads via hdparm. Mark can you comment on the feasibility of this idea? Thanks Greg -- Greg Freemyer Litigation Triage Solutions Specialist http://www.linkedin.com/in/gregfreemyer First 99 Days Litigation White Paper - http://www.norcrossgroup.com/forms/whitepapers/99%20Days%20whitepaper.pdf The Norcross Group The Intersection of Evidence & Technology http://www.norcrossgroup.com ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: RFC: detection of silent corruption via ATA long sector reads 2008-12-26 21:44 RFC: detection of silent corruption via ATA long sector reads Greg Freemyer @ 2008-12-26 22:15 ` Robert Hancock 2008-12-27 0:32 ` David Lethe 2008-12-28 22:26 ` Mark Lord 1 sibling, 1 reply; 19+ messages in thread From: Robert Hancock @ 2008-12-26 22:15 UTC (permalink / raw) To: linux-raid; +Cc: linux-kernel Greg Freemyer wrote: > All, > > On the mdraid list, there was a recent thread about using raid > functionality to detect / repair silent corruption. > > The issues brought up were that a lot of silent data corruption occurs > when cables, controllers, power supplies, ram, cache, etc. goes bad. > > It made me think about another option for detecting silent corruption > I have not seen discussed, but maybe I missed it. > > Aiui, the ATA spec allows for the reading of a long sector as well as > the normal 512 byte sector. When you get a long sector you also get > the CRC (or whatever checksum data there is on the disk that allows > the drive itself to detect media errors). > > I don't have any idea how easy or hard it would be to do, but I would > like to see the entire block subsystem enhanced to optionally allow > long sector reads to be used in a "paranoid" fashion. > > Effectively it would be: > > 1) Read long sector from drive: verify CRC in kernel. This tests > most everything on the i/o path. > > 2) maintain CRC type information in block subsystem. Verify no > corruption just before handing off to userspace. This would > potentially identify CPU/cache/RAM failures. Even if the drive supports those commands the problem is the CRC/ECC data is in a vendor-specific format, so it couldn't be processed generically. ^ permalink raw reply [flat|nested] 19+ messages in thread
* RE: Re: RFC: detection of silent corruption via ATA long sector reads 2008-12-26 22:15 ` Robert Hancock @ 2008-12-27 0:32 ` David Lethe 0 siblings, 0 replies; 19+ messages in thread From: David Lethe @ 2008-12-27 0:32 UTC (permalink / raw) To: Robert Hancock, linux-raid; +Cc: linux-kernel > -----Original Message----- > From: linux-raid-owner@vger.kernel.org [mailto:linux-raid- > owner@vger.kernel.org] On Behalf Of Robert Hancock > Sent: Friday, December 26, 2008 4:16 PM > To: linux-raid@vger.kernel.org > Cc: linux-kernel@vger.kernel.org > Subject: Re: RFC: detection of silent corruption via ATA long sector > reads > > Greg Freemyer wrote: > > All, > > > > On the mdraid list, there was a recent thread about using raid > > functionality to detect / repair silent corruption. > > > > The issues brought up were that a lot of silent data corruption > occurs > > when cables, controllers, power supplies, ram, cache, etc. goes bad. > > > > It made me think about another option for detecting silent corruption > > I have not seen discussed, but maybe I missed it. > > > > Aiui, the ATA spec allows for the reading of a long sector as well as > > the normal 512 byte sector. When you get a long sector you also get > > the CRC (or whatever checksum data there is on the disk that allows > > the drive itself to detect media errors). > > > > I don't have any idea how easy or hard it would be to do, but I would > > like to see the entire block subsystem enhanced to optionally allow > > long sector reads to be used in a "paranoid" fashion. > > > > Effectively it would be: > > > > 1) Read long sector from drive: verify CRC in kernel. This tests > > most everything on the i/o path. > > > > 2) maintain CRC type information in block subsystem. Verify no > > corruption just before handing off to userspace. This would > > potentially identify CPU/cache/RAM failures. > > Even if the drive supports those commands the problem is the CRC/ECC > data is in a vendor-specific format, so it couldn't be processed > generically. > > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" > in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html Many of the RAID appliance/subsystem vendors format the disks to 520 or 528 Bytes/sector, but expose 512-byte blocks to the user. The ECC logic is done by the firmware ... or if this ever gets implemented, would be done by the LINUX kernel. True there are some issues with many of the cheap consumer class drives not supporting anything but 512-byte blocks, but we shouldn't code to lowest common denominator. With 1TB SATA disks selling for $99, then it isn't as if the extra 8-16 bytes for ECC on the disk drive is going to be a problem. David ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: RFC: detection of silent corruption via ATA long sector reads 2008-12-26 21:44 RFC: detection of silent corruption via ATA long sector reads Greg Freemyer 2008-12-26 22:15 ` Robert Hancock @ 2008-12-28 22:26 ` Mark Lord 1 sibling, 0 replies; 19+ messages in thread From: Mark Lord @ 2008-12-28 22:26 UTC (permalink / raw) To: Greg Freemyer; +Cc: Redeeman, piergiorgio.sartor, neilb, linux-raid, LKML Greg Freemyer wrote: > All, > > On the mdraid list, there was a recent thread about using raid > functionality to detect / repair silent corruption. > > The issues brought up were that a lot of silent data corruption occurs > when cables, controllers, power supplies, ram, cache, etc. goes bad. > > It made me think about another option for detecting silent corruption > I have not seen discussed, but maybe I missed it. > > Aiui, the ATA spec allows for the reading of a long sector as well as > the normal 512 byte sector. When you get a long sector you also get > the CRC (or whatever checksum data there is on the disk that allows > the drive itself to detect media errors). > > I don't have any idea how easy or hard it would be to do, but I would > like to see the entire block subsystem enhanced to optionally allow > long sector reads to be used in a "paranoid" fashion. > > Effectively it would be: > > 1) Read long sector from drive: verify CRC in kernel. This tests > most everything on the i/o path. > > 2) maintain CRC type information in block subsystem. Verify no > corruption just before handing off to userspace. This would > potentially identify CPU/cache/RAM failures. > > Mark Lord has implemented long sector reads via hdparm. Mark can you > comment on the feasibility of this idea? .. The ATA READ/WRITE LONG commands have been obsoleted in the past few ATA specs, even though most drives continue to implement them. But not a good avenue. There's a separate effort, involving drive vendors and kernel hackers, to provide end-to-end CRC protection of data. I forget what it was called, but that's the future of this stuff for high-reliability requirements. Cheers ^ permalink raw reply [flat|nested] 19+ messages in thread
[parent not found: <fa.8mwKV7y4hm+Q6mvIKtp9QGoJYUU@ifi.uio.no>]
[parent not found: <fa.4QcsYZC0gJJwJ0eUOht3hDYaVWs@ifi.uio.no>]
* Re: RFC: detection of silent corruption via ATA long sector reads [not found] ` <fa.4QcsYZC0gJJwJ0eUOht3hDYaVWs@ifi.uio.no> @ 2008-12-28 22:40 ` Sitsofe Wheeler 2008-12-30 13:48 ` Mark Lord 2009-01-02 20:26 ` Greg Freemyer 0 siblings, 2 replies; 19+ messages in thread From: Sitsofe Wheeler @ 2008-12-28 22:40 UTC (permalink / raw) To: Mark Lord; +Cc: Greg Freemyer, Redeeman, piergiorgio.sartor, neilb, linux-raid Mark Lord wrote: > There's a separate effort, involving drive vendors and kernel hackers, > to provide end-to-end CRC protection of data. I forget what it was called, > but that's the future of this stuff for high-reliability requirements. Are you thinking of BLK_DEV_INTEGRITY which tries to support T10/SCSI Data Integrity Field or the T13/ATA External Path Protection? ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: RFC: detection of silent corruption via ATA long sector reads 2008-12-28 22:40 ` Sitsofe Wheeler @ 2008-12-30 13:48 ` Mark Lord 2009-01-02 20:26 ` Greg Freemyer 1 sibling, 0 replies; 19+ messages in thread From: Mark Lord @ 2008-12-30 13:48 UTC (permalink / raw) To: Sitsofe Wheeler Cc: Greg Freemyer, Redeeman, piergiorgio.sartor, neilb, linux-raid Sitsofe Wheeler wrote: > Mark Lord wrote: > >> There's a separate effort, involving drive vendors and kernel hackers, >> to provide end-to-end CRC protection of data. I forget what it was >> called, >> but that's the future of this stuff for high-reliability requirements. > > Are you thinking of BLK_DEV_INTEGRITY which tries to support T10/SCSI > Data Integrity Field or the T13/ATA External Path Protection? .. One or both of those, I think. Bad memory here, though! :) ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: RFC: detection of silent corruption via ATA long sector reads 2008-12-28 22:40 ` Sitsofe Wheeler 2008-12-30 13:48 ` Mark Lord @ 2009-01-02 20:26 ` Greg Freemyer 2009-01-02 20:43 ` Sitsofe Wheeler 2009-01-02 22:04 ` Martin K. Petersen 1 sibling, 2 replies; 19+ messages in thread From: Greg Freemyer @ 2009-01-02 20:26 UTC (permalink / raw) To: Sitsofe Wheeler Cc: Mark Lord, Redeeman, piergiorgio.sartor, neilb, linux-raid, IDE/ATA development list On Sun, Dec 28, 2008 at 5:40 PM, Sitsofe Wheeler <sitsofe@yahoo.com> wrote: > Mark Lord wrote: > >> There's a separate effort, involving drive vendors and kernel hackers, >> to provide end-to-end CRC protection of data. I forget what it was >> called, >> but that's the future of this stuff for high-reliability requirements. > > Are you thinking of BLK_DEV_INTEGRITY which tries to support T10/SCSI Data > Integrity Field or the T13/ATA External Path Protection? > I see that my Opensuse kernel has CONFIG_BLK_DEV_INTEGRITY enabled and that block layer changes have been implemented and documented in Documentation/block/data-integrity.txt I also see Device Mapper support was discussed in Oct. (My 2.6.27 kernel does not have those patches). Is there a more comprehensive write-up / resource that describes the current status of the overall INTEGRITY support is, especially as it relates to ATA devices? ie. Do actual ATA hardware devices that support "T13/ATA External Path Protection" exist yet? Does it require HDD and controller support? Or just HDD? Does libata support those devices and the extra INTEGRITY bio that holds the CRC field. Does mdraid? Device Mapper? Thanks Greg -- Greg Freemyer Litigation Triage Solutions Specialist http://www.linkedin.com/in/gregfreemyer First 99 Days Litigation White Paper - http://www.norcrossgroup.com/forms/whitepapers/99%20Days%20whitepaper.pdf The Norcross Group The Intersection of Evidence & Technology http://www.norcrossgroup.com ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: RFC: detection of silent corruption via ATA long sector reads 2009-01-02 20:26 ` Greg Freemyer @ 2009-01-02 20:43 ` Sitsofe Wheeler 2009-01-02 21:05 ` Greg Freemyer 2009-01-02 22:04 ` Martin K. Petersen 1 sibling, 1 reply; 19+ messages in thread From: Sitsofe Wheeler @ 2009-01-02 20:43 UTC (permalink / raw) To: Greg Freemyer Cc: Mark Lord, Redeeman, piergiorgio.sartor, neilb, linux-raid, IDE/ATA development list, linux-kernel > Is there a more comprehensive write-up / resource that describes the > current status of the overall INTEGRITY support is, especially as it > relates to ATA devices? Did you check the kernel notes on kernelnewbies when the feature went in - http://kernelnewbies.org/Linux_2_6_27 ? ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: RFC: detection of silent corruption via ATA long sector reads 2009-01-02 20:43 ` Sitsofe Wheeler @ 2009-01-02 21:05 ` Greg Freemyer 0 siblings, 0 replies; 19+ messages in thread From: Greg Freemyer @ 2009-01-02 21:05 UTC (permalink / raw) To: Sitsofe Wheeler Cc: Mark Lord, Redeeman, piergiorgio.sartor, neilb, linux-raid, IDE/ATA development list, linux-kernel On Fri, Jan 2, 2009 at 3:43 PM, Sitsofe Wheeler <sitsofe@yahoo.com> wrote: >> Is there a more comprehensive write-up / resource that describes the >> current status of the overall INTEGRITY support is, especially as it >> relates to ATA devices? > > > Did you check the kernel notes on kernelnewbies when the feature went in - > http://kernelnewbies.org/Linux_2_6_27 ? Interesting read, but it does not really answer the questions I posed. I did look through the 2.6.27 source I have handy and the only call to blk_integrity_register() is in./drivers/scsi/sd_dif.c. That leaves me with the impression that there are not any ATA devices claiming support yet. Greg -- Greg Freemyer Litigation Triage Solutions Specialist http://www.linkedin.com/in/gregfreemyer First 99 Days Litigation White Paper - http://www.norcrossgroup.com/forms/whitepapers/99%20Days%20whitepaper.pdf The Norcross Group The Intersection of Evidence & Technology http://www.norcrossgroup.com ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: RFC: detection of silent corruption via ATA long sector reads 2009-01-02 20:26 ` Greg Freemyer 2009-01-02 20:43 ` Sitsofe Wheeler @ 2009-01-02 22:04 ` Martin K. Petersen 2009-01-02 22:41 ` Greg Freemyer 2009-01-03 13:20 ` John Robinson 1 sibling, 2 replies; 19+ messages in thread From: Martin K. Petersen @ 2009-01-02 22:04 UTC (permalink / raw) To: Greg Freemyer Cc: Sitsofe Wheeler, Mark Lord, Redeeman, piergiorgio.sartor, neilb, linux-raid, IDE/ATA development list >>>>> "Greg" == Greg Freemyer <greg.freemyer@gmail.com> writes: Greg> I also see Device Mapper support was discussed in Oct. (My 2.6.27 Greg> kernel does not have those patches). See below. Greg> Is there a more comprehensive write-up / resource that describes Greg> the current status of the overall INTEGRITY support is, http://oss.oracle.com/projects/data-integrity/documentation/ The status is: - The infrastructure in the kernel is in place as of .27. Hoping to get MD/DM support in .29 but I'm running late wrt. the merge window. - We recently announced an early adopter program for Oracle DB customers. The ASM component of the database now supports the integrity hooks so we can true end-to-end integrity protection of DB I/O. - btrfs support is work in progress. - Other people have expressed interest in adding support to ext4 and XFS. Greg> especially as it relates to ATA devices? ATA support was put on hold in the T13 committee because the drive vendors don't feel like adding a big, intrusive feature to their firmware. I'm still hoping we can eventually get support added to nearline class drives but it'll be a while. Market demand needs to be there first. I.e. the array vendors that use SATA drives will need to start asking for it. We're just, just, just starting to push out FC support. Then comes SAS. And then hopefully ATA. Greg> ie. Do actual ATA hardware devices that support "T13/ATA External Greg> Path Protection" exist yet? Does it require HDD and controller Greg> support? Or just HDD? Both. You could emulate some of the DIX features in software (like scatterlist interleaving) and then plug in the long commands on the back end. But as Mark said the checksum formats differ between drive vendors/models. On SCSI you could conceivably use the block integrity stuff to store an LVM/MD checksum when used with devices that expose the application tag. However, it's only a 16-bit field (16 bits - 1 to be exact) so it's not exactly a lot of space. And only dumb drives are going to make it available. Some RAID controllers are going to keep those 16-bits for their own internal use. The main purpose of the block integrity stuff is to protect in-flight I/O. Persistence is an optional feature and a side-effect. So I think it would be much more worthwhile to implement checksumming in MD/DM without relying on special hardware. I did some experiments in that department a few years ago when we were investigating how to go about fixing some of the data integrity problems in Linux. I wrote something akin to DIF in software by doing 64 512-byte blocks + 512 bytes of checksums. The disadvantage there is having to do read-modify-write for small writes. I tried several other approaches sacrificing both space and locality but performance was still anemic. The reason DIF is implemented the way it is (with 520 byte sectors: 512 bytes followed by 8 bytes of checksum) is to prevent the cost of seeking to write the protection information elsewhere. With solid state devices that seek penalty doesn't exist so this may become less of an issue going forward. The beauty of checksumming in btrfs is that the checksum is stored in the filesystem metadata which is read/written anyway. So the only overhead is in calculating the actual checksum. That's something virtual block devices have a much harder time providing because they don't have metadata describing individual blocks. That doesn't mean it can't be done but it's a lot more work. I'm personally much more interested in adding support for adding a retry-other-mirror interface to MD/DM and leave the checksumming to the filesystems. -- Martin K. Petersen Oracle Linux Engineering ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: RFC: detection of silent corruption via ATA long sector reads 2009-01-02 22:04 ` Martin K. Petersen @ 2009-01-02 22:41 ` Greg Freemyer 2009-01-03 3:01 ` Martin K. Petersen 2009-01-03 13:20 ` John Robinson 1 sibling, 1 reply; 19+ messages in thread From: Greg Freemyer @ 2009-01-02 22:41 UTC (permalink / raw) To: Martin K. Petersen Cc: Sitsofe Wheeler, Mark Lord, Redeeman, piergiorgio.sartor, neilb, linux-raid, IDE/ATA development list Thanks Martin, comments interspersed On Fri, Jan 2, 2009 at 5:04 PM, Martin K. Petersen <martin.petersen@oracle.com> wrote: >>>>>> "Greg" == Greg Freemyer <greg.freemyer@gmail.com> writes: > <snip> > The status is: > > - The infrastructure in the kernel is in place as of .27. Hoping to > get MD/DM support in .29 but I'm running late wrt. the merge window. I haven't seen any MD patches at all. Will the MD support verify the CRC on read and trigger a RAID re-read other mirror on failure? > - We recently announced an early adopter program for Oracle DB > customers. The ASM component of the database now supports the > integrity hooks so we can true end-to-end integrity protection of DB > I/O. Very cool. > - btrfs support is work in progress. > > - Other people have expressed interest in adding support to ext4 and > XFS. Nice, but it seems the block layer will capture that vast majority of issues. > Greg> especially as it relates to ATA devices? > > ATA support was put on hold in the T13 committee because the drive > vendors don't feel like adding a big, intrusive feature to their > firmware. I'm still hoping we can eventually get support added to > nearline class drives but it'll be a while. Market demand needs to be > there first. I.e. the array vendors that use SATA drives will need to > start asking for it. > > We're just, just, just starting to push out FC support. Then comes SAS. > And then hopefully ATA. The LHC (Large Hadron Collider) people put out a white paper on silent corruption a year or two ago. They were very concerned that it could negatively impact there results. I don't remember the details, or how they worked around it. If they are not already part of your integrity team, you might want to reach out to them. And I think they bought / are buying huge amounts of hardware. > > Greg> ie. Do actual ATA hardware devices that support "T13/ATA External > Greg> Path Protection" exist yet? Does it require HDD and controller > Greg> support? Or just HDD? > > Both. You could emulate some of the DIX features in software (like > scatterlist interleaving) and then plug in the long commands on the back > end. But as Mark said the checksum formats differ between drive > vendors/models. The linux kernel obviously supports a large amount of vendor specific code. Maybe the INTEGRITY crc could be calculated on the fly by libata for at least a few hard drive vendors that have known CRC algorithms used with the current long sector reads. ie. When INTEGRITY is enabled and supported hard drives are being read from, libata requests the long sector with proprietary CRC and verifies the vendor specific CRC. If it looks good, then the vendor specific CRC is replaced by the SCSI Spec CRC and the sector / bios are passed up the line just like a supported SCSI device would do. If those drives started selling well, maybe the drive manufactures could be persuaded to implement the full end-to-end protocol. > On SCSI you could conceivably use the block integrity stuff to store an > LVM/MD checksum when used with devices that expose the application tag. > > However, it's only a 16-bit field (16 bits - 1 to be exact) so it's not > exactly a lot of space. And only dumb drives are going to make it > available. Some RAID controllers are going to keep those 16-bits for > their own internal use. > > The main purpose of the block integrity stuff is to protect in-flight > I/O. Persistence is an optional feature and a side-effect. In-flight is my concern as well. All of the silent corruption I've seen and taken the time to troubleshoot was caused by in-flight errors. I've seen it be cables, power supply, controller, ram, and CPU cache at a minimum. > So I think it would be much more worthwhile to implement checksumming in > MD/DM without relying on special hardware. I did some experiments in > that department a few years ago when we were investigating how to go > about fixing some of the data integrity problems in Linux. > > I wrote something akin to DIF in software by doing 64 512-byte blocks + > 512 bytes of checksums. The disadvantage there is having to do > read-modify-write for small writes. I tried several other approaches > sacrificing both space and locality but performance was still anemic. > > The reason DIF is implemented the way it is (with 520 byte sectors: 512 > bytes followed by 8 bytes of checksum) is to prevent the cost of seeking > to write the protection information elsewhere. With solid state devices > that seek penalty doesn't exist so this may become less of an issue > going forward. > > The beauty of checksumming in btrfs is that the checksum is stored in > the filesystem metadata which is read/written anyway. So the only > overhead is in calculating the actual checksum. That's something > virtual block devices have a much harder time providing because they > don't have metadata describing individual blocks. > > That doesn't mean it can't be done but it's a lot more work. I'm > personally much more interested in adding support for adding a > retry-other-mirror interface to MD/DM and leave the checksumming to the > filesystems. That makes sense as well, but given the most filesystems won't have inherent INTEGRITY support, then the block layer should also be able to make retry-other-mirror requests of MD / DM. > -- > Martin K. Petersen Oracle Linux Engineering > Also is there any effort to add diagnostic messages at the various tiers. You describe this as end-to-end protection, but when it fails, it would be extremely useful to check dmesg or something and be able to see that a sector came in from the controller fine, but was corrupted later, so CPU / memory is suspected vs. sector came in bad from the controller, so suspect a problem in the controller / cable / power supply area. Greg -- Greg Freemyer Litigation Triage Solutions Specialist http://www.linkedin.com/in/gregfreemyer First 99 Days Litigation White Paper - http://www.norcrossgroup.com/forms/whitepapers/99%20Days%20whitepaper.pdf The Norcross Group The Intersection of Evidence & Technology http://www.norcrossgroup.com ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: RFC: detection of silent corruption via ATA long sector reads 2009-01-02 22:41 ` Greg Freemyer @ 2009-01-03 3:01 ` Martin K. Petersen 0 siblings, 0 replies; 19+ messages in thread From: Martin K. Petersen @ 2009-01-03 3:01 UTC (permalink / raw) To: Greg Freemyer Cc: Martin K. Petersen, Sitsofe Wheeler, Mark Lord, Redeeman, piergiorgio.sartor, neilb, linux-raid, IDE/ATA development list >>>>> "Greg" == Greg Freemyer <greg.freemyer@gmail.com> writes: Greg> I haven't seen any MD patches at all. Will the MD support verify Greg> the CRC on read and trigger a RAID re-read other mirror on Greg> failure? No. With the data integrity model it is the owner of the integrity metadata that needs to re-drive the I/O in case of failure. So that means the application, filesystem or the block layer depending on who added it. The reason for this is twofold: 1) The owner of the I/O in question has much better knowledge about the context. On a write it can re-run verification checks on its buffers before deciding whether to try again, notify the user, etc. 2) Limiting the number of times we calculate the CRC/checksum. If every layer in the I/O stack did a check things would get painfully slow. So it's better to bubble everything to the top and do it once. That's why it's important to me to ensure that the appropriate signaling is in place so that upper layers can influence what's going on below. I.e. telling MD/DM to retry redundant copies. That said, adding a belt-and-suspenders option to MD/DM to verify all I/O would be trivial. But I don't think it's worth it. Greg> The LHC (Large Hadron Collider) people put out a white paper on Greg> silent corruption a year or two ago. They were very concerned Greg> that it could negatively impact there results. I've been talking to them on and off. >> Both. You could emulate some of the DIX features in software (like >> scatterlist interleaving) and then plug in the long commands on the >> back end. But as Mark said the checksum formats differ between drive >> vendors/models. Greg> The linux kernel obviously supports a large amount of vendor Greg> specific code. However, the actual ECC stored by disk drives is proprietary. The drive vendors have spent years and years refining their algorithms. I think it's highly unlikely that they'd be willing to tell us what's in there and how it's calculated. I really think you should all just go bug your drive vendors about this feature. The ATA add-on (called External Path Protection) was pretty much fully baked when it was shelved. It is compatible with the SCSI ditto so interoperability is a no-brainer. But the drive vendors fought it vehemently. Interestingly enough, SSD vendors seem much more interested in adding competitive features. Greg> Maybe the INTEGRITY crc could be calculated on the fly by libata Greg> for at least a few hard drive vendors that have known CRC Greg> algorithms used with the current long sector reads. It's usually an ECC and not a CRC, btw. And it's relatively big. It's not unusual to be able to correct on the order of 50 bytes out of 512. Greg> ie. When INTEGRITY is enabled and supported hard drives are being Greg> read from, libata requests the long sector with proprietary CRC Greg> and verifies the vendor specific CRC. If it looks good, then the Greg> vendor specific CRC is replaced by the SCSI Spec CRC and the Greg> sector / bios are passed up the line just like a supported SCSI Greg> device would do. Not necessary. The integrity infrastructure is completely agnostic to the data contained in the protection buffer. It's all done by callbacks registered with the block device. And consequently filesystems and applications operate at the "protect this buffer"/"verify this buffer" level. They don't have to know or care about T10, CRCs, ATA or anything. The actual format is negotiated in case of MD/DM that spans devices with potentially different capabilities/checksum formats. With SCSI we have the luxury that the CRC is mandatory so we can always fall back to that. Greg> In-flight is my concern as well. All of the silent corruption Greg> I've seen and taken the time to troubleshoot was caused by Greg> in-flight errors. I've seen it be cables, power supply, Greg> controller, ram, and CPU cache at a minimum. Yup. Greg> That makes sense as well, but given the most filesystems won't Greg> have inherent INTEGRITY support, then the block layer should also Greg> be able to make retry-other-mirror requests of MD / DM. Well, this is somewhat orthogonal. A drive is not going to return good sense information if the CRC didn't match the data. So the I/O is going to fail and DM/MD can retry at will. In that case it doesn't really matter what caused the failure and DM/MD will retry regardless. You could argue that the data could still be corrupted on the way back from the drive. But I haven't seen that happen much. In any case, the verification further up the stack is going to catch the mismatch. Most of the errors I see on READ are due to DMAs that for whatever reason didn't actually happen. That's actually a fun thing to do: Poison all pages in the target scatterlist before issuing a READ. I've had to do that several times to prove that transfers went missing in action. Greg> Also is there any effort to add diagnostic messages at the various Greg> tiers. Greg> You describe this as end-to-end protection, but when it fails, it Greg> would be extremely useful to check dmesg or something and be able Greg> to see that a sector came in from the controller fine, but was Greg> corrupted later, so CPU / memory is suspected vs. sector came in Greg> bad from the controller, so suspect a problem in the controller / Greg> cable / power supply area. Right now we distinguish between errors caught by the HBA and errors caught by the target device. A big problem we're trying to tackle is the case where a write is acknowledged by the RAID controller and stored in non-volatile memory there. Once the RAID controller commits the write to an actual disk the write fails and for some reason the RAID controller doesn't succeed in writing the block elsewhere. In that case the original I/O has been completed at the OS level. There's really no means for the array head to come back and say "Oh, btw. that I/O that I acked a while ago didn't actually make it". And even if it did we would have forgotten all about the context of that I/O so it wouldn't be of much help. So out of band error reporting like that (that also involves SAN switches) is a topic for discussion within the SNIA Data Integrity TWG. -- Martin K. Petersen Oracle Linux Engineering ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: RFC: detection of silent corruption via ATA long sector reads 2009-01-02 22:04 ` Martin K. Petersen 2009-01-02 22:41 ` Greg Freemyer @ 2009-01-03 13:20 ` John Robinson 2009-01-04 7:37 ` Martin K. Petersen 1 sibling, 1 reply; 19+ messages in thread From: John Robinson @ 2009-01-03 13:20 UTC (permalink / raw) To: Martin K. Petersen; +Cc: linux-raid On 02/01/2009 22:04, Martin K. Petersen wrote: [...] > I wrote something akin to DIF in software by doing 64 512-byte blocks + > 512 bytes of checksums. The disadvantage there is having to do > read-modify-write for small writes. I tried several other approaches > sacrificing both space and locality but performance was still anemic. Excuse me if I'm being dense - and indeed tell me! - but RAID 4/5/6 already suffer from having to do ready-modify-write for small writes, so is there any chance this could be done at relatively little additional expense for these? Cheers, John. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: RFC: detection of silent corruption via ATA long sector reads 2009-01-03 13:20 ` John Robinson @ 2009-01-04 7:37 ` Martin K. Petersen 2009-01-04 12:31 ` John Robinson 0 siblings, 1 reply; 19+ messages in thread From: Martin K. Petersen @ 2009-01-04 7:37 UTC (permalink / raw) To: John Robinson; +Cc: Martin K. Petersen, linux-raid >>>>> "John" == John Robinson <john.robinson@anonymous.org.uk> writes: John> Excuse me if I'm being dense - and indeed tell me! - but RAID John> 4/5/6 already suffer from having to do ready-modify-write for John> small writes, so is there any chance this could be done at John> relatively little additional expense for these? You'd still need to store a checksum somewhere else, incurring additional seek cost. You could attempt to weasel out of that by adding the checksum sector after a limited number of blocks and hope that you'd be able to pull it in or write it out in one sweep. The downside is that assume we do checksums on - say - 8KB chunks in the RAID5 case. We only need to store a few handfuls of bytes of checksum goo per block. But we can't address less than a 512 byte sector. So we need to either waste the bulk of 1 sector for every 16 to increase the likelihood of adjacent access. Or we can push the checksum sector further out to fill it completely. That wastes less space but has a higher chance of causing an extra seek. Pick your poison. The reason I'm advocating checksumming on logical (filesystem) blocks is that the filesystems have a much better idea what's good and what's bad in a recovery situation. And the filesystems already have an infrastructure for storing metadata like checksums. The cost of accessing that metadata is inherent and inevitable. btrfs had checksums from the get-go. The XFS folks are working hard on adding them. ext4 is going to checksum metadata, I believe. So this is stuff that's already in the pipeline. We also don't want to do checksumming at every layer. That's going to suck from a performance perspective. It's better to do checksumming high up in the stack and only do it once. As long as we give the upper layers the option of re-driving the I/O. That involves adding a cookie to each bio that gets filled out by DM/MD on completion. If the filesystem checksum fails we can resubmit the I/O and pass along the cookie indicating that we want a different copy than the one the cookie represents. -- Martin K. Petersen Oracle Linux Engineering ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: RFC: detection of silent corruption via ATA long sector reads 2009-01-04 7:37 ` Martin K. Petersen @ 2009-01-04 12:31 ` John Robinson 2009-01-04 13:49 ` John Robinson 2009-01-05 2:45 ` Martin K. Petersen 0 siblings, 2 replies; 19+ messages in thread From: John Robinson @ 2009-01-04 12:31 UTC (permalink / raw) To: Martin K. Petersen; +Cc: linux-raid On 04/01/2009 07:37, Martin K. Petersen wrote: >>>>>> "John" == John Robinson <john.robinson@anonymous.org.uk> writes: > > John> Excuse me if I'm being dense - and indeed tell me! - but RAID > John> 4/5/6 already suffer from having to do ready-modify-write for > John> small writes, so is there any chance this could be done at > John> relatively little additional expense for these? > > You'd still need to store a checksum somewhere else, incurring > additional seek cost. You could attempt to weasel out of that by adding > the checksum sector after a limited number of blocks and hope that you'd > be able to pull it in or write it out in one sweep. > > The downside is that assume we do checksums on - say - 8KB chunks in the > RAID5 case. We only need to store a few handfuls of bytes of checksum > goo per block. But we can't address less than a 512 byte sector. So we > need to either waste the bulk of 1 sector for every 16 to increase the > likelihood of adjacent access. Or we can push the checksum sector > further out to fill it completely. That wastes less space but has a > higher chance of causing an extra seek. Pick your poison. Well, I was assuming that MD/DM operates in chunk size amounts (e.g. 32K or 64 sectors) anyway, and having a sector or two of checksums on disc immediately following each chunk would be a pretty small cost, increasing each read or write cycle only marginally (e.g. to 65 sectors), which shouldn't cause much drop in performance (I guess 1/64th in throughput and IOPS, if the discs themselves are the bottleneck). Essentially DIF on 32k blocks instead of 512 byte ones. But perhaps this is a bad assumption and MD/DM already optimises out whole-chunk reads and writes where they're not required (for very short, less-than-one-chunk transactions), and I've no idea whether this happens a lot. > The reason I'm advocating checksumming on logical (filesystem) blocks is > that the filesystems have a much better idea what's good and what's bad > in a recovery situation. And the filesystems already have an > infrastructure for storing metadata like checksums. The cost of > accessing that metadata is inherent and inevitable. Yes, I can see that. But the old premise that RAID tried to maintain was that disc sectors don't go bad. You're quite reasonably dropping the premise rather than trying to do more to maintain it. There might be validity to both approaches. > We also don't want to do checksumming at every layer. That's going to > suck from a performance perspective. It's better to do checksumming > high up in the stack and only do it once. As long as we give the upper > layers the option of re-driving the I/O. > > That involves adding a cookie to each bio that gets filled out by DM/MD > on completion. If the filesystem checksum fails we can resubmit the I/O > and pass along the cookie indicating that we want a different copy than > the one the cookie represents. I'd like to understand this mechanism better; at first glance it's either going to be too simplistic and not cover the various block layer cases well, or it means you end up re-implementing RAID and LVM in the filesystem. Just my €$£0.02 of course. Cheers, John. -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: RFC: detection of silent corruption via ATA long sector reads 2009-01-04 12:31 ` John Robinson @ 2009-01-04 13:49 ` John Robinson 2009-01-05 2:43 ` Martin K. Petersen 2009-01-05 2:45 ` Martin K. Petersen 1 sibling, 1 reply; 19+ messages in thread From: John Robinson @ 2009-01-04 13:49 UTC (permalink / raw) To: Martin K. Petersen; +Cc: linux-raid On 04/01/2009 12:31, John Robinson wrote: > On 04/01/2009 07:37, Martin K. Petersen wrote: [...] >> We also don't want to do checksumming at every layer. That's going to >> suck from a performance perspective. It's better to do checksumming >> high up in the stack and only do it once. As long as we give the upper >> layers the option of re-driving the I/O. >> >> That involves adding a cookie to each bio that gets filled out by DM/MD >> on completion. If the filesystem checksum fails we can resubmit the I/O >> and pass along the cookie indicating that we want a different copy than >> the one the cookie represents. > > I'd like to understand this mechanism better; at first glance it's > either going to be too simplistic and not cover the various block layer > cases well, or it means you end up re-implementing RAID and LVM in the > filesystem. I've thought about this again, and I'm wrong; there may be complications in handling the cookies up and down the stack where more than one layer thinks it knows how to have another go, but I can see what you describe as being useful and relatively device-agnostic. I wonder if there might also be scope for cookies going down through the stack to carry an indication of how hard to try; some filesystems or other consumers of block devices may be willing to ask again or want to be told about problems quickly (e.g. btrfs over RAID over TLER-equipped discs), while some may need best efforts all out first time because they can't cope will failure returns (e.g. FAT over cheap IDE discs). Anyway, I think I'd better leave all this to the experts i.e. you :-) Cheers, John. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: RFC: detection of silent corruption via ATA long sector reads 2009-01-04 13:49 ` John Robinson @ 2009-01-05 2:43 ` Martin K. Petersen 0 siblings, 0 replies; 19+ messages in thread From: Martin K. Petersen @ 2009-01-05 2:43 UTC (permalink / raw) To: John Robinson; +Cc: Martin K. Petersen, linux-raid >>>>> "John" == John Robinson <john.robinson@anonymous.org.uk> writes: John> I've thought about this again, and I'm wrong; there may be John> complications in handling the cookies up and down the stack where John> more than one layer thinks it knows how to have another go, but I John> can see what you describe as being useful and relatively John> device-agnostic. Yeah, care will need to be taken if you have multiple layers in the stack providing redundancy. That's usually not the case, though. John> I wonder if there might also be scope for cookies going down John> through the stack to carry an indication of how hard to try; some John> filesystems or other consumers of block devices may be willing to John> ask again or want to be told about problems quickly (e.g. btrfs John> over RAID over TLER-equipped discs), while some may need best John> efforts all out first time because they can't cope will failure John> returns (e.g. FAT over cheap IDE discs). We already have this functionality. It's orthogonal to the integrity bits. You can tell the low-level drivers either fail a request immediately or to retry. That's only a software thing, though. It doesn't work terribly well with consumer harddrives that assume there's only one copy of the data and consequently enter annoying-click-mode and retry for a long time. Nearline and enterprise drives assume there's a redundant copy and will not try as hard under the assumption that you know how to remedy the problem. -- Martin K. Petersen Oracle Linux Engineering ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: RFC: detection of silent corruption via ATA long sector reads 2009-01-04 12:31 ` John Robinson 2009-01-04 13:49 ` John Robinson @ 2009-01-05 2:45 ` Martin K. Petersen 2009-01-05 3:24 ` NeilBrown 1 sibling, 1 reply; 19+ messages in thread From: Martin K. Petersen @ 2009-01-05 2:45 UTC (permalink / raw) To: John Robinson; +Cc: Martin K. Petersen, linux-raid >>>>> "John" == John Robinson <john.robinson@anonymous.org.uk> writes: John> Essentially DIF on 32k blocks instead of 512 byte ones. But John> perhaps this is a bad assumption and MD/DM already optimises out John> whole-chunk reads and writes where they're not required (for very John> short, less-than-one-chunk transactions), and I've no idea whether John> this happens a lot. I haven't looked at the RAID4/5/6 code for a long time so I'm not sure whether they only write dirty pages or the whole chunk + parity ditto. Neil? -- Martin K. Petersen Oracle Linux Engineering ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: RFC: detection of silent corruption via ATA long sector reads 2009-01-05 2:45 ` Martin K. Petersen @ 2009-01-05 3:24 ` NeilBrown 0 siblings, 0 replies; 19+ messages in thread From: NeilBrown @ 2009-01-05 3:24 UTC (permalink / raw) Cc: John Robinson, Martin K. Petersen, linux-raid On Mon, January 5, 2009 1:45 pm, Martin K. Petersen wrote: >>>>>> "John" == John Robinson <john.robinson@anonymous.org.uk> writes: > > John> Essentially DIF on 32k blocks instead of 512 byte ones. But > John> perhaps this is a bad assumption and MD/DM already optimises out > John> whole-chunk reads and writes where they're not required (for very > John> short, less-than-one-chunk transactions), and I've no idea whether > John> this happens a lot. > > I haven't looked at the RAID4/5/6 code for a long time so I'm not sure > whether they only write dirty pages or the whole chunk + parity ditto. > Neil? md/RAID456 writes whole pages (aligned to the array) but not whole chunks. If a filesystem request to write one page which is at a sector address which is not a multiple of the page size, we will pre-read the read of the two array-aligned pages, and when write them (and parity) back out. Otherwise, it will just write the requested pages plus parity updates. NeilBrown ^ permalink raw reply [flat|nested] 19+ messages in thread
end of thread, other threads:[~2009-01-05 3:24 UTC | newest]
Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-12-26 21:44 RFC: detection of silent corruption via ATA long sector reads Greg Freemyer
2008-12-26 22:15 ` Robert Hancock
2008-12-27 0:32 ` David Lethe
2008-12-28 22:26 ` Mark Lord
[not found] <fa.8mwKV7y4hm+Q6mvIKtp9QGoJYUU@ifi.uio.no>
[not found] ` <fa.4QcsYZC0gJJwJ0eUOht3hDYaVWs@ifi.uio.no>
2008-12-28 22:40 ` Sitsofe Wheeler
2008-12-30 13:48 ` Mark Lord
2009-01-02 20:26 ` Greg Freemyer
2009-01-02 20:43 ` Sitsofe Wheeler
2009-01-02 21:05 ` Greg Freemyer
2009-01-02 22:04 ` Martin K. Petersen
2009-01-02 22:41 ` Greg Freemyer
2009-01-03 3:01 ` Martin K. Petersen
2009-01-03 13:20 ` John Robinson
2009-01-04 7:37 ` Martin K. Petersen
2009-01-04 12:31 ` John Robinson
2009-01-04 13:49 ` John Robinson
2009-01-05 2:43 ` Martin K. Petersen
2009-01-05 2:45 ` Martin K. Petersen
2009-01-05 3:24 ` NeilBrown
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).