Re: RFC: detection of silent corruption via ATA long sector reads

linux-ide.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: RFC: detection of silent corruption via ATA long sector reads
       [not found]   ` <49580061.9060506@yahoo.com>
@ 2009-01-02 20:26     ` Greg Freemyer
  2009-01-02 20:43       ` Sitsofe Wheeler
  2009-01-02 22:04       ` Martin K. Petersen
  0 siblings, 2 replies; 6+ messages in thread
From: Greg Freemyer @ 2009-01-02 20:26 UTC (permalink / raw)
  To: Sitsofe Wheeler
  Cc: Mark Lord, Redeeman, piergiorgio.sartor, neilb, linux-raid,
	IDE/ATA development list

On Sun, Dec 28, 2008 at 5:40 PM, Sitsofe Wheeler <sitsofe@yahoo.com> wrote:
> Mark Lord wrote:
>
>> There's a separate effort, involving drive vendors and kernel hackers,
>> to provide end-to-end CRC protection of data.  I forget what it was
>> called,
>> but that's the future of this stuff for high-reliability requirements.
>
> Are you thinking of BLK_DEV_INTEGRITY which tries to support T10/SCSI Data
> Integrity Field or the T13/ATA External Path Protection?
>

I see that my Opensuse kernel has CONFIG_BLK_DEV_INTEGRITY enabled and
that block layer changes have been implemented and documented in

Documentation/block/data-integrity.txt

I also see Device Mapper support was discussed in Oct.  (My 2.6.27
kernel does not have those patches).

Is there a more comprehensive write-up / resource that describes the
current status of the overall INTEGRITY support is, especially as it
relates to ATA devices?

ie.
Do actual ATA hardware devices that support "T13/ATA External Path
Protection" exist yet?  Does it require HDD and controller support?
Or just HDD?

Does libata support those devices and the extra INTEGRITY bio that
holds the CRC field.

Does mdraid?  Device Mapper?

Thanks
Greg
-- 
Greg Freemyer
Litigation Triage Solutions Specialist
http://www.linkedin.com/in/gregfreemyer
First 99 Days Litigation White Paper -
http://www.norcrossgroup.com/forms/whitepapers/99%20Days%20whitepaper.pdf

The Norcross Group
The Intersection of Evidence & Technology
http://www.norcrossgroup.com

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: RFC: detection of silent corruption via ATA long sector reads
  2009-01-02 20:26     ` RFC: detection of silent corruption via ATA long sector reads Greg Freemyer
@ 2009-01-02 20:43       ` Sitsofe Wheeler
  2009-01-02 21:05         ` Greg Freemyer
  2009-01-02 22:04       ` Martin K. Petersen
  1 sibling, 1 reply; 6+ messages in thread
From: Sitsofe Wheeler @ 2009-01-02 20:43 UTC (permalink / raw)
  To: Greg Freemyer
  Cc: Mark Lord, Redeeman, piergiorgio.sartor, neilb, linux-raid,
	IDE/ATA development list, linux-kernel

> Is there a more comprehensive write-up / resource that describes the
> current status of the overall INTEGRITY support is, especially as it
> relates to ATA devices?


Did you check the kernel notes on kernelnewbies when the feature went in - 
http://kernelnewbies.org/Linux_2_6_27 ?


      

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: RFC: detection of silent corruption via ATA long sector reads
  2009-01-02 20:43       ` Sitsofe Wheeler
@ 2009-01-02 21:05         ` Greg Freemyer
  0 siblings, 0 replies; 6+ messages in thread
From: Greg Freemyer @ 2009-01-02 21:05 UTC (permalink / raw)
  To: Sitsofe Wheeler
  Cc: Mark Lord, Redeeman, piergiorgio.sartor, neilb, linux-raid,
	IDE/ATA development list, linux-kernel

On Fri, Jan 2, 2009 at 3:43 PM, Sitsofe Wheeler <sitsofe@yahoo.com> wrote:
>> Is there a more comprehensive write-up / resource that describes the
>> current status of the overall INTEGRITY support is, especially as it
>> relates to ATA devices?
>
>
> Did you check the kernel notes on kernelnewbies when the feature went in -
> http://kernelnewbies.org/Linux_2_6_27 ?

Interesting read, but it does not really answer the questions I posed.

I did look through the 2.6.27 source I have handy and the only call to
blk_integrity_register() is in./drivers/scsi/sd_dif.c.

That leaves me with the impression that there are not any ATA devices
claiming support yet.

Greg
-- 
Greg Freemyer
Litigation Triage Solutions Specialist
http://www.linkedin.com/in/gregfreemyer
First 99 Days Litigation White Paper -
http://www.norcrossgroup.com/forms/whitepapers/99%20Days%20whitepaper.pdf

The Norcross Group
The Intersection of Evidence & Technology
http://www.norcrossgroup.com

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: RFC: detection of silent corruption via ATA long sector reads
  2009-01-02 20:26     ` RFC: detection of silent corruption via ATA long sector reads Greg Freemyer
  2009-01-02 20:43       ` Sitsofe Wheeler
@ 2009-01-02 22:04       ` Martin K. Petersen
  2009-01-02 22:41         ` Greg Freemyer
  1 sibling, 1 reply; 6+ messages in thread
From: Martin K. Petersen @ 2009-01-02 22:04 UTC (permalink / raw)
  To: Greg Freemyer
  Cc: Sitsofe Wheeler, Mark Lord, Redeeman, piergiorgio.sartor, neilb,
	linux-raid, IDE/ATA development list

>>>>> "Greg" == Greg Freemyer <greg.freemyer@gmail.com> writes:

Greg> I also see Device Mapper support was discussed in Oct.  (My 2.6.27
Greg> kernel does not have those patches).

See below.

Greg> Is there a more comprehensive write-up / resource that describes
Greg> the current status of the overall INTEGRITY support is, 

http://oss.oracle.com/projects/data-integrity/documentation/

The status is:

 - The infrastructure in the kernel is in place as of .27.  Hoping to
   get MD/DM support in .29 but I'm running late wrt. the merge window.

 - We recently announced an early adopter program for Oracle DB
   customers.  The ASM component of the database now supports the
   integrity hooks so we can true end-to-end integrity protection of DB
   I/O.

 - btrfs support is work in progress.

 - Other people have expressed interest in adding support to ext4 and
   XFS.

Greg> especially as it relates to ATA devices?

ATA support was put on hold in the T13 committee because the drive
vendors don't feel like adding a big, intrusive feature to their
firmware.  I'm still hoping we can eventually get support added to
nearline class drives but it'll be a while.  Market demand needs to be
there first.  I.e. the array vendors that use SATA drives will need to
start asking for it.

We're just, just, just starting to push out FC support.  Then comes SAS.
And then hopefully ATA.

Greg> ie.  Do actual ATA hardware devices that support "T13/ATA External
Greg> Path Protection" exist yet?  Does it require HDD and controller
Greg> support?  Or just HDD?

Both.  You could emulate some of the DIX features in software (like
scatterlist interleaving) and then plug in the long commands on the back
end.  But as Mark said the checksum formats differ between drive
vendors/models.

On SCSI you could conceivably use the block integrity stuff to store an
LVM/MD checksum when used with devices that expose the application tag.

However, it's only a 16-bit field (16 bits - 1 to be exact) so it's not
exactly a lot of space.  And only dumb drives are going to make it
available.  Some RAID controllers are going to keep those 16-bits for
their own internal use.

The main purpose of the block integrity stuff is to protect in-flight
I/O.  Persistence is an optional feature and a side-effect.

So I think it would be much more worthwhile to implement checksumming in
MD/DM without relying on special hardware.  I did some experiments in
that department a few years ago when we were investigating how to go
about fixing some of the data integrity problems in Linux.

I wrote something akin to DIF in software by doing 64 512-byte blocks +
512 bytes of checksums.  The disadvantage there is having to do
read-modify-write for small writes.  I tried several other approaches
sacrificing both space and locality but performance was still anemic.

The reason DIF is implemented the way it is (with 520 byte sectors: 512
bytes followed by 8 bytes of checksum) is to prevent the cost of seeking
to write the protection information elsewhere.  With solid state devices
that seek penalty doesn't exist so this may become less of an issue
going forward.

The beauty of checksumming in btrfs is that the checksum is stored in
the filesystem metadata which is read/written anyway.  So the only
overhead is in calculating the actual checksum.  That's something
virtual block devices have a much harder time providing because they
don't have metadata describing individual blocks.

That doesn't mean it can't be done but it's a lot more work.  I'm
personally much more interested in adding support for adding a
retry-other-mirror interface to MD/DM and leave the checksumming to the
filesystems.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: RFC: detection of silent corruption via ATA long sector reads
  2009-01-02 22:04       ` Martin K. Petersen
@ 2009-01-02 22:41         ` Greg Freemyer
  2009-01-03  3:01           ` Martin K. Petersen
  0 siblings, 1 reply; 6+ messages in thread
From: Greg Freemyer @ 2009-01-02 22:41 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: Sitsofe Wheeler, Mark Lord, Redeeman, piergiorgio.sartor, neilb,
	linux-raid, IDE/ATA development list

Thanks Martin, comments interspersed

On Fri, Jan 2, 2009 at 5:04 PM, Martin K. Petersen
<martin.petersen@oracle.com> wrote:
>>>>>> "Greg" == Greg Freemyer <greg.freemyer@gmail.com> writes:
>
<snip>

> The status is:
>
>  - The infrastructure in the kernel is in place as of .27.  Hoping to
>   get MD/DM support in .29 but I'm running late wrt. the merge window.

I haven't seen any MD patches at all.  Will the MD support verify the
CRC on read and trigger a RAID re-read other mirror on failure?

>  - We recently announced an early adopter program for Oracle DB
>   customers.  The ASM component of the database now supports the
>   integrity hooks so we can true end-to-end integrity protection of DB
>   I/O.

Very cool.

>  - btrfs support is work in progress.
>
>  - Other people have expressed interest in adding support to ext4 and
>   XFS.

Nice, but it seems the block layer will capture that vast majority of issues.

> Greg> especially as it relates to ATA devices?
>
> ATA support was put on hold in the T13 committee because the drive
> vendors don't feel like adding a big, intrusive feature to their
> firmware.  I'm still hoping we can eventually get support added to
> nearline class drives but it'll be a while.  Market demand needs to be
> there first.  I.e. the array vendors that use SATA drives will need to
> start asking for it.
>
> We're just, just, just starting to push out FC support.  Then comes SAS.
> And then hopefully ATA.

The LHC (Large Hadron Collider) people put out a white paper on silent
corruption a year or two ago.   They were very concerned that it could
negatively impact there results.  I don't remember the details, or how
they worked around it.

If they are not already part of your integrity team, you might want to
reach out to them.  And I think they bought / are buying huge amounts
of hardware.

>
> Greg> ie.  Do actual ATA hardware devices that support "T13/ATA External
> Greg> Path Protection" exist yet?  Does it require HDD and controller
> Greg> support?  Or just HDD?
>
> Both.  You could emulate some of the DIX features in software (like
> scatterlist interleaving) and then plug in the long commands on the back
> end.  But as Mark said the checksum formats differ between drive
> vendors/models.

The linux kernel obviously supports a large amount of vendor specific code.

Maybe the INTEGRITY crc could be calculated on the fly by libata for
at least a few hard drive vendors that have known CRC algorithms used
with the current long sector reads.

ie. When INTEGRITY is enabled and supported hard drives are being read
from, libata requests the long sector with proprietary  CRC and
verifies the vendor specific CRC.  If it looks good, then the vendor
specific CRC is replaced by the SCSI Spec CRC and the sector / bios
are passed up the line just like a supported SCSI device would do.

If those drives started selling well, maybe the drive manufactures
could be persuaded to implement the full end-to-end protocol.

> On SCSI you could conceivably use the block integrity stuff to store an
> LVM/MD checksum when used with devices that expose the application tag.
>
> However, it's only a 16-bit field (16 bits - 1 to be exact) so it's not
> exactly a lot of space.  And only dumb drives are going to make it
> available.  Some RAID controllers are going to keep those 16-bits for
> their own internal use.
>
> The main purpose of the block integrity stuff is to protect in-flight
> I/O.  Persistence is an optional feature and a side-effect.

In-flight is my concern as well.  All of the silent corruption I've
seen and taken the time to troubleshoot was caused by in-flight
errors.  I've seen it be cables, power supply, controller, ram, and
CPU cache at a minimum.

> So I think it would be much more worthwhile to implement checksumming in
> MD/DM without relying on special hardware.  I did some experiments in
> that department a few years ago when we were investigating how to go
> about fixing some of the data integrity problems in Linux.
>
> I wrote something akin to DIF in software by doing 64 512-byte blocks +
> 512 bytes of checksums.  The disadvantage there is having to do
> read-modify-write for small writes.  I tried several other approaches
> sacrificing both space and locality but performance was still anemic.
>
> The reason DIF is implemented the way it is (with 520 byte sectors: 512
> bytes followed by 8 bytes of checksum) is to prevent the cost of seeking
> to write the protection information elsewhere.  With solid state devices
> that seek penalty doesn't exist so this may become less of an issue
> going forward.
>
> The beauty of checksumming in btrfs is that the checksum is stored in
> the filesystem metadata which is read/written anyway.  So the only
> overhead is in calculating the actual checksum.  That's something
> virtual block devices have a much harder time providing because they
> don't have metadata describing individual blocks.
>
> That doesn't mean it can't be done but it's a lot more work.  I'm
> personally much more interested in adding support for adding a
> retry-other-mirror interface to MD/DM and leave the checksumming to the
> filesystems.

That makes sense as well, but given the most filesystems won't have
inherent INTEGRITY support, then the block layer should also be able
to make retry-other-mirror requests of MD / DM.

> --
> Martin K. Petersen      Oracle Linux Engineering
>

Also is there any effort to add diagnostic messages at the various tiers.

You describe this as end-to-end protection, but when it fails, it
would be extremely useful to check dmesg or something and be able to
see that a sector came in from the controller fine, but was corrupted
later, so CPU / memory is suspected vs. sector came in bad from the
controller, so suspect a problem in the controller / cable / power
supply area.

Greg
-- 
Greg Freemyer
Litigation Triage Solutions Specialist
http://www.linkedin.com/in/gregfreemyer
First 99 Days Litigation White Paper -
http://www.norcrossgroup.com/forms/whitepapers/99%20Days%20whitepaper.pdf

The Norcross Group
The Intersection of Evidence & Technology
http://www.norcrossgroup.com

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: RFC: detection of silent corruption via ATA long sector reads
  2009-01-02 22:41         ` Greg Freemyer
@ 2009-01-03  3:01           ` Martin K. Petersen
  0 siblings, 0 replies; 6+ messages in thread
From: Martin K. Petersen @ 2009-01-03  3:01 UTC (permalink / raw)
  To: Greg Freemyer
  Cc: Martin K. Petersen, Sitsofe Wheeler, Mark Lord, Redeeman,
	piergiorgio.sartor, neilb, linux-raid, IDE/ATA development list

>>>>> "Greg" == Greg Freemyer <greg.freemyer@gmail.com> writes:

Greg> I haven't seen any MD patches at all.  Will the MD support verify
Greg> the CRC on read and trigger a RAID re-read other mirror on
Greg> failure?

No.  With the data integrity model it is the owner of the integrity
metadata that needs to re-drive the I/O in case of failure.  So that
means the application, filesystem or the block layer depending on who
added it.

The reason for this is twofold:

 1) The owner of the I/O in question has much better knowledge about the
    context.  On a write it can re-run verification checks on its
    buffers before deciding whether to try again, notify the user, etc.

 2) Limiting the number of times we calculate the CRC/checksum.  If
    every layer in the I/O stack did a check things would get painfully
    slow.  So it's better to bubble everything to the top and do it
    once.

That's why it's important to me to ensure that the appropriate signaling
is in place so that upper layers can influence what's going on below.
I.e. telling MD/DM to retry redundant copies.

That said, adding a belt-and-suspenders option to MD/DM to verify all
I/O would be trivial.  But I don't think it's worth it.

Greg> The LHC (Large Hadron Collider) people put out a white paper on
Greg> silent corruption a year or two ago.  They were very concerned
Greg> that it could negatively impact there results.

I've been talking to them on and off.

>> Both.  You could emulate some of the DIX features in software (like
>> scatterlist interleaving) and then plug in the long commands on the
>> back end.  But as Mark said the checksum formats differ between drive
>> vendors/models.

Greg> The linux kernel obviously supports a large amount of vendor
Greg> specific code.

However, the actual ECC stored by disk drives is proprietary.  The drive
vendors have spent years and years refining their algorithms.  I think
it's highly unlikely that they'd be willing to tell us what's in there
and how it's calculated.

I really think you should all just go bug your drive vendors about this
feature.  The ATA add-on (called External Path Protection) was pretty
much fully baked when it was shelved.  It is compatible with the SCSI
ditto so interoperability is a no-brainer.  But the drive vendors fought
it vehemently.

Interestingly enough, SSD vendors seem much more interested in adding
competitive features.

Greg> Maybe the INTEGRITY crc could be calculated on the fly by libata
Greg> for at least a few hard drive vendors that have known CRC
Greg> algorithms used with the current long sector reads.

It's usually an ECC and not a CRC, btw.  And it's relatively big.  It's
not unusual to be able to correct on the order of 50 bytes out of 512.

Greg> ie. When INTEGRITY is enabled and supported hard drives are being
Greg> read from, libata requests the long sector with proprietary CRC
Greg> and verifies the vendor specific CRC.  If it looks good, then the
Greg> vendor specific CRC is replaced by the SCSI Spec CRC and the
Greg> sector / bios are passed up the line just like a supported SCSI
Greg> device would do.

Not necessary.

The integrity infrastructure is completely agnostic to the data
contained in the protection buffer.  It's all done by callbacks
registered with the block device.  And consequently filesystems and
applications operate at the "protect this buffer"/"verify this buffer"
level.  They don't have to know or care about T10, CRCs, ATA or
anything.

The actual format is negotiated in case of MD/DM that spans devices with
potentially different capabilities/checksum formats.  With SCSI we have
the luxury that the CRC is mandatory so we can always fall back to that.

Greg> In-flight is my concern as well.  All of the silent corruption
Greg> I've seen and taken the time to troubleshoot was caused by
Greg> in-flight errors.  I've seen it be cables, power supply,
Greg> controller, ram, and CPU cache at a minimum.

Yup.

Greg> That makes sense as well, but given the most filesystems won't
Greg> have inherent INTEGRITY support, then the block layer should also
Greg> be able to make retry-other-mirror requests of MD / DM.

Well, this is somewhat orthogonal.  A drive is not going to return good
sense information if the CRC didn't match the data.  So the I/O is going
to fail and DM/MD can retry at will.  In that case it doesn't really
matter what caused the failure and DM/MD will retry regardless.

You could argue that the data could still be corrupted on the way back
from the drive.  But I haven't seen that happen much.  In any case, the
verification further up the stack is going to catch the mismatch.

Most of the errors I see on READ are due to DMAs that for whatever
reason didn't actually happen.

That's actually a fun thing to do: Poison all pages in the target
scatterlist before issuing a READ.  I've had to do that several times to
prove that transfers went missing in action.

Greg> Also is there any effort to add diagnostic messages at the various
Greg> tiers.

Greg> You describe this as end-to-end protection, but when it fails, it
Greg> would be extremely useful to check dmesg or something and be able
Greg> to see that a sector came in from the controller fine, but was
Greg> corrupted later, so CPU / memory is suspected vs. sector came in
Greg> bad from the controller, so suspect a problem in the controller /
Greg> cable / power supply area.

Right now we distinguish between errors caught by the HBA and errors
caught by the target device.

A big problem we're trying to tackle is the case where a write is
acknowledged by the RAID controller and stored in non-volatile memory
there.  Once the RAID controller commits the write to an actual disk the
write fails and for some reason the RAID controller doesn't succeed in
writing the block elsewhere.  In that case the original I/O has been
completed at the OS level.  There's really no means for the array head
to come back and say "Oh, btw. that I/O that I acked a while ago didn't
actually make it".  And even if it did we would have forgotten all about
the context of that I/O so it wouldn't be of much help.

So out of band error reporting like that (that also involves SAN
switches) is a topic for discussion within the SNIA Data Integrity TWG.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2009-01-03  3:03 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <fa.8mwKV7y4hm+Q6mvIKtp9QGoJYUU@ifi.uio.no>
     [not found] ` <fa.4QcsYZC0gJJwJ0eUOht3hDYaVWs@ifi.uio.no>
     [not found]   ` <49580061.9060506@yahoo.com>
2009-01-02 20:26     ` RFC: detection of silent corruption via ATA long sector reads Greg Freemyer
2009-01-02 20:43       ` Sitsofe Wheeler
2009-01-02 21:05         ` Greg Freemyer
2009-01-02 22:04       ` Martin K. Petersen
2009-01-02 22:41         ` Greg Freemyer
2009-01-03  3:01           ` Martin K. Petersen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).