RFC: detection of silent corruption via ATA long sector reads

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* RFC: detection of silent corruption via ATA long sector reads
@ 2008-12-26 21:44 Greg Freemyer
  2008-12-26 22:15 ` Robert Hancock
  2008-12-28 22:26 ` Mark Lord
  0 siblings, 2 replies; 18+ messages in thread
From: Greg Freemyer @ 2008-12-26 21:44 UTC (permalink / raw)
  To: Redeeman; +Cc: piergiorgio.sartor, neilb, linux-raid, LKML, Mark Lord

All,

On the mdraid list, there was a recent thread about using raid
functionality to detect / repair silent corruption.

The issues brought up were that a lot of silent data corruption occurs
when cables, controllers, power supplies, ram, cache, etc. goes bad.

It made me think about another option for detecting silent corruption
I have not seen discussed, but maybe I missed it.

Aiui, the ATA spec allows for the reading of a long sector as well as
the normal 512 byte sector.  When you get a long sector you also get
the CRC (or whatever checksum data there is on the disk that allows
the drive itself to detect media errors).

I don't have any idea how easy or hard it would be to do, but I would
like to see the entire block subsystem enhanced to optionally allow
long sector reads to be used in a "paranoid" fashion.

Effectively it would be:

1) Read long sector from drive:  verify CRC in kernel.  This tests
most everything on the i/o path.

2) maintain CRC type information in block subsystem.  Verify no
corruption just before handing off to userspace.  This would
potentially identify CPU/cache/RAM failures.

Mark Lord has implemented long sector reads via hdparm.  Mark can you
comment on the feasibility of this idea?

Thanks
Greg
-- 
Greg Freemyer
Litigation Triage Solutions Specialist
http://www.linkedin.com/in/gregfreemyer
First 99 Days Litigation White Paper -
http://www.norcrossgroup.com/forms/whitepapers/99%20Days%20whitepaper.pdf

The Norcross Group
The Intersection of Evidence & Technology
http://www.norcrossgroup.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: RFC: detection of silent corruption via ATA long sector reads
  2008-12-26 21:44 Greg Freemyer
@ 2008-12-26 22:15 ` Robert Hancock
  2008-12-28 22:26 ` Mark Lord
  1 sibling, 0 replies; 18+ messages in thread
From: Robert Hancock @ 2008-12-26 22:15 UTC (permalink / raw)
  To: linux-raid; +Cc: linux-kernel

Greg Freemyer wrote:
> All,
> 
> On the mdraid list, there was a recent thread about using raid
> functionality to detect / repair silent corruption.
> 
> The issues brought up were that a lot of silent data corruption occurs
> when cables, controllers, power supplies, ram, cache, etc. goes bad.
> 
> It made me think about another option for detecting silent corruption
> I have not seen discussed, but maybe I missed it.
> 
> Aiui, the ATA spec allows for the reading of a long sector as well as
> the normal 512 byte sector.  When you get a long sector you also get
> the CRC (or whatever checksum data there is on the disk that allows
> the drive itself to detect media errors).
> 
> I don't have any idea how easy or hard it would be to do, but I would
> like to see the entire block subsystem enhanced to optionally allow
> long sector reads to be used in a "paranoid" fashion.
> 
> Effectively it would be:
> 
> 1) Read long sector from drive:  verify CRC in kernel.  This tests
> most everything on the i/o path.
> 
> 2) maintain CRC type information in block subsystem.  Verify no
> corruption just before handing off to userspace.  This would
> potentially identify CPU/cache/RAM failures.

Even if the drive supports those commands the problem is the CRC/ECC 
data is in a vendor-specific format, so it couldn't be processed 
generically.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: RFC: detection of silent corruption via ATA long sector reads
  2008-12-26 21:44 Greg Freemyer
  2008-12-26 22:15 ` Robert Hancock
@ 2008-12-28 22:26 ` Mark Lord
  1 sibling, 0 replies; 18+ messages in thread
From: Mark Lord @ 2008-12-28 22:26 UTC (permalink / raw)
  To: Greg Freemyer; +Cc: Redeeman, piergiorgio.sartor, neilb, linux-raid, LKML

Greg Freemyer wrote:
> All,
> 
> On the mdraid list, there was a recent thread about using raid
> functionality to detect / repair silent corruption.
> 
> The issues brought up were that a lot of silent data corruption occurs
> when cables, controllers, power supplies, ram, cache, etc. goes bad.
> 
> It made me think about another option for detecting silent corruption
> I have not seen discussed, but maybe I missed it.
> 
> Aiui, the ATA spec allows for the reading of a long sector as well as
> the normal 512 byte sector.  When you get a long sector you also get
> the CRC (or whatever checksum data there is on the disk that allows
> the drive itself to detect media errors).
> 
> I don't have any idea how easy or hard it would be to do, but I would
> like to see the entire block subsystem enhanced to optionally allow
> long sector reads to be used in a "paranoid" fashion.
> 
> Effectively it would be:
> 
> 1) Read long sector from drive:  verify CRC in kernel.  This tests
> most everything on the i/o path.
> 
> 2) maintain CRC type information in block subsystem.  Verify no
> corruption just before handing off to userspace.  This would
> potentially identify CPU/cache/RAM failures.
> 
> Mark Lord has implemented long sector reads via hdparm.  Mark can you
> comment on the feasibility of this idea?
..

The ATA READ/WRITE LONG commands have been obsoleted in the past few ATA specs,
even though most drives continue to implement them.

But not a good avenue.

There's a separate effort, involving drive vendors and kernel hackers,
to provide end-to-end CRC protection of data.  I forget what it was called,
but that's the future of this stuff for high-reliability requirements.

Cheers

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: RFC: detection of silent corruption via ATA long sector reads
       [not found] ` <fa.4QcsYZC0gJJwJ0eUOht3hDYaVWs@ifi.uio.no>
@ 2008-12-28 22:40   ` Sitsofe Wheeler
  2008-12-30 13:48     ` Mark Lord
  2009-01-02 20:26     ` Greg Freemyer
  0 siblings, 2 replies; 18+ messages in thread
From: Sitsofe Wheeler @ 2008-12-28 22:40 UTC (permalink / raw)
  To: Mark Lord; +Cc: Greg Freemyer, Redeeman, piergiorgio.sartor, neilb, linux-raid

Mark Lord wrote:

> There's a separate effort, involving drive vendors and kernel hackers,
> to provide end-to-end CRC protection of data.  I forget what it was called,
> but that's the future of this stuff for high-reliability requirements.

Are you thinking of BLK_DEV_INTEGRITY which tries to support T10/SCSI 
Data Integrity Field or the T13/ATA External Path Protection?

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: RFC: detection of silent corruption via ATA long sector reads
  2008-12-28 22:40   ` RFC: detection of silent corruption via ATA long sector reads Sitsofe Wheeler
@ 2008-12-30 13:48     ` Mark Lord
  2009-01-02 20:26     ` Greg Freemyer
  1 sibling, 0 replies; 18+ messages in thread
From: Mark Lord @ 2008-12-30 13:48 UTC (permalink / raw)
  To: Sitsofe Wheeler
  Cc: Greg Freemyer, Redeeman, piergiorgio.sartor, neilb, linux-raid

Sitsofe Wheeler wrote:
> Mark Lord wrote:
> 
>> There's a separate effort, involving drive vendors and kernel hackers,
>> to provide end-to-end CRC protection of data.  I forget what it was 
>> called,
>> but that's the future of this stuff for high-reliability requirements.
> 
> Are you thinking of BLK_DEV_INTEGRITY which tries to support T10/SCSI 
> Data Integrity Field or the T13/ATA External Path Protection?
..

One or both of those, I think.  Bad memory here, though!  :)

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: RFC: detection of silent corruption via ATA long sector reads
  2008-12-28 22:40   ` RFC: detection of silent corruption via ATA long sector reads Sitsofe Wheeler
  2008-12-30 13:48     ` Mark Lord
@ 2009-01-02 20:26     ` Greg Freemyer
  2009-01-02 20:43       ` Sitsofe Wheeler
  2009-01-02 22:04       ` Martin K. Petersen
  1 sibling, 2 replies; 18+ messages in thread
From: Greg Freemyer @ 2009-01-02 20:26 UTC (permalink / raw)
  To: Sitsofe Wheeler
  Cc: Mark Lord, Redeeman, piergiorgio.sartor, neilb, linux-raid,
	IDE/ATA development list

On Sun, Dec 28, 2008 at 5:40 PM, Sitsofe Wheeler <sitsofe@yahoo.com> wrote:
> Mark Lord wrote:
>
>> There's a separate effort, involving drive vendors and kernel hackers,
>> to provide end-to-end CRC protection of data.  I forget what it was
>> called,
>> but that's the future of this stuff for high-reliability requirements.
>
> Are you thinking of BLK_DEV_INTEGRITY which tries to support T10/SCSI Data
> Integrity Field or the T13/ATA External Path Protection?
>

I see that my Opensuse kernel has CONFIG_BLK_DEV_INTEGRITY enabled and
that block layer changes have been implemented and documented in

Documentation/block/data-integrity.txt

I also see Device Mapper support was discussed in Oct.  (My 2.6.27
kernel does not have those patches).

Is there a more comprehensive write-up / resource that describes the
current status of the overall INTEGRITY support is, especially as it
relates to ATA devices?

ie.
Do actual ATA hardware devices that support "T13/ATA External Path
Protection" exist yet?  Does it require HDD and controller support?
Or just HDD?

Does libata support those devices and the extra INTEGRITY bio that
holds the CRC field.

Does mdraid?  Device Mapper?

Thanks
Greg
-- 
Greg Freemyer
Litigation Triage Solutions Specialist
http://www.linkedin.com/in/gregfreemyer
First 99 Days Litigation White Paper -
http://www.norcrossgroup.com/forms/whitepapers/99%20Days%20whitepaper.pdf

The Norcross Group
The Intersection of Evidence & Technology
http://www.norcrossgroup.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: RFC: detection of silent corruption via ATA long sector reads
  2009-01-02 20:26     ` Greg Freemyer
@ 2009-01-02 20:43       ` Sitsofe Wheeler
  2009-01-02 21:05         ` Greg Freemyer
  2009-01-02 22:04       ` Martin K. Petersen
  1 sibling, 1 reply; 18+ messages in thread
From: Sitsofe Wheeler @ 2009-01-02 20:43 UTC (permalink / raw)
  To: Greg Freemyer
  Cc: Mark Lord, Redeeman, piergiorgio.sartor, neilb, linux-raid,
	IDE/ATA development list, linux-kernel

> Is there a more comprehensive write-up / resource that describes the
> current status of the overall INTEGRITY support is, especially as it
> relates to ATA devices?


Did you check the kernel notes on kernelnewbies when the feature went in - 
http://kernelnewbies.org/Linux_2_6_27 ?


      

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: RFC: detection of silent corruption via ATA long sector reads
  2009-01-02 20:43       ` Sitsofe Wheeler
@ 2009-01-02 21:05         ` Greg Freemyer
  0 siblings, 0 replies; 18+ messages in thread
From: Greg Freemyer @ 2009-01-02 21:05 UTC (permalink / raw)
  To: Sitsofe Wheeler
  Cc: Mark Lord, Redeeman, piergiorgio.sartor, neilb, linux-raid,
	IDE/ATA development list, linux-kernel

On Fri, Jan 2, 2009 at 3:43 PM, Sitsofe Wheeler <sitsofe@yahoo.com> wrote:
>> Is there a more comprehensive write-up / resource that describes the
>> current status of the overall INTEGRITY support is, especially as it
>> relates to ATA devices?
>
>
> Did you check the kernel notes on kernelnewbies when the feature went in -
> http://kernelnewbies.org/Linux_2_6_27 ?

Interesting read, but it does not really answer the questions I posed.

I did look through the 2.6.27 source I have handy and the only call to
blk_integrity_register() is in./drivers/scsi/sd_dif.c.

That leaves me with the impression that there are not any ATA devices
claiming support yet.

Greg
-- 
Greg Freemyer
Litigation Triage Solutions Specialist
http://www.linkedin.com/in/gregfreemyer
First 99 Days Litigation White Paper -
http://www.norcrossgroup.com/forms/whitepapers/99%20Days%20whitepaper.pdf

The Norcross Group
The Intersection of Evidence & Technology
http://www.norcrossgroup.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: RFC: detection of silent corruption via ATA long sector reads
  2009-01-02 20:26     ` Greg Freemyer
  2009-01-02 20:43       ` Sitsofe Wheeler
@ 2009-01-02 22:04       ` Martin K. Petersen
  2009-01-02 22:41         ` Greg Freemyer
  2009-01-03 13:20         ` John Robinson
  1 sibling, 2 replies; 18+ messages in thread
From: Martin K. Petersen @ 2009-01-02 22:04 UTC (permalink / raw)
  To: Greg Freemyer
  Cc: Sitsofe Wheeler, Mark Lord, Redeeman, piergiorgio.sartor, neilb,
	linux-raid, IDE/ATA development list

>>>>> "Greg" == Greg Freemyer <greg.freemyer@gmail.com> writes:

Greg> I also see Device Mapper support was discussed in Oct.  (My 2.6.27
Greg> kernel does not have those patches).

See below.

Greg> Is there a more comprehensive write-up / resource that describes
Greg> the current status of the overall INTEGRITY support is, 

http://oss.oracle.com/projects/data-integrity/documentation/

The status is:

 - The infrastructure in the kernel is in place as of .27.  Hoping to
   get MD/DM support in .29 but I'm running late wrt. the merge window.

 - We recently announced an early adopter program for Oracle DB
   customers.  The ASM component of the database now supports the
   integrity hooks so we can true end-to-end integrity protection of DB
   I/O.

 - btrfs support is work in progress.

 - Other people have expressed interest in adding support to ext4 and
   XFS.

Greg> especially as it relates to ATA devices?

ATA support was put on hold in the T13 committee because the drive
vendors don't feel like adding a big, intrusive feature to their
firmware.  I'm still hoping we can eventually get support added to
nearline class drives but it'll be a while.  Market demand needs to be
there first.  I.e. the array vendors that use SATA drives will need to
start asking for it.

We're just, just, just starting to push out FC support.  Then comes SAS.
And then hopefully ATA.

Greg> ie.  Do actual ATA hardware devices that support "T13/ATA External
Greg> Path Protection" exist yet?  Does it require HDD and controller
Greg> support?  Or just HDD?

Both.  You could emulate some of the DIX features in software (like
scatterlist interleaving) and then plug in the long commands on the back
end.  But as Mark said the checksum formats differ between drive
vendors/models.

On SCSI you could conceivably use the block integrity stuff to store an
LVM/MD checksum when used with devices that expose the application tag.

However, it's only a 16-bit field (16 bits - 1 to be exact) so it's not
exactly a lot of space.  And only dumb drives are going to make it
available.  Some RAID controllers are going to keep those 16-bits for
their own internal use.

The main purpose of the block integrity stuff is to protect in-flight
I/O.  Persistence is an optional feature and a side-effect.

So I think it would be much more worthwhile to implement checksumming in
MD/DM without relying on special hardware.  I did some experiments in
that department a few years ago when we were investigating how to go
about fixing some of the data integrity problems in Linux.

I wrote something akin to DIF in software by doing 64 512-byte blocks +
512 bytes of checksums.  The disadvantage there is having to do
read-modify-write for small writes.  I tried several other approaches
sacrificing both space and locality but performance was still anemic.

The reason DIF is implemented the way it is (with 520 byte sectors: 512
bytes followed by 8 bytes of checksum) is to prevent the cost of seeking
to write the protection information elsewhere.  With solid state devices
that seek penalty doesn't exist so this may become less of an issue
going forward.

The beauty of checksumming in btrfs is that the checksum is stored in
the filesystem metadata which is read/written anyway.  So the only
overhead is in calculating the actual checksum.  That's something
virtual block devices have a much harder time providing because they
don't have metadata describing individual blocks.

That doesn't mean it can't be done but it's a lot more work.  I'm
personally much more interested in adding support for adding a
retry-other-mirror interface to MD/DM and leave the checksumming to the
filesystems.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: RFC: detection of silent corruption via ATA long sector reads
  2009-01-02 22:04       ` Martin K. Petersen
@ 2009-01-02 22:41         ` Greg Freemyer
  2009-01-03  3:01           ` Martin K. Petersen
  2009-01-03 13:20         ` John Robinson
  1 sibling, 1 reply; 18+ messages in thread
From: Greg Freemyer @ 2009-01-02 22:41 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: Sitsofe Wheeler, Mark Lord, Redeeman, piergiorgio.sartor, neilb,
	linux-raid, IDE/ATA development list

Thanks Martin, comments interspersed

On Fri, Jan 2, 2009 at 5:04 PM, Martin K. Petersen
<martin.petersen@oracle.com> wrote:
>>>>>> "Greg" == Greg Freemyer <greg.freemyer@gmail.com> writes:
>
<snip>

> The status is:
>
>  - The infrastructure in the kernel is in place as of .27.  Hoping to
>   get MD/DM support in .29 but I'm running late wrt. the merge window.

I haven't seen any MD patches at all.  Will the MD support verify the
CRC on read and trigger a RAID re-read other mirror on failure?

>  - We recently announced an early adopter program for Oracle DB
>   customers.  The ASM component of the database now supports the
>   integrity hooks so we can true end-to-end integrity protection of DB
>   I/O.

Very cool.

>  - btrfs support is work in progress.
>
>  - Other people have expressed interest in adding support to ext4 and
>   XFS.

Nice, but it seems the block layer will capture that vast majority of issues.

> Greg> especially as it relates to ATA devices?
>
> ATA support was put on hold in the T13 committee because the drive
> vendors don't feel like adding a big, intrusive feature to their
> firmware.  I'm still hoping we can eventually get support added to
> nearline class drives but it'll be a while.  Market demand needs to be
> there first.  I.e. the array vendors that use SATA drives will need to
> start asking for it.
>
> We're just, just, just starting to push out FC support.  Then comes SAS.
> And then hopefully ATA.

The LHC (Large Hadron Collider) people put out a white paper on silent
corruption a year or two ago.   They were very concerned that it could
negatively impact there results.  I don't remember the details, or how
they worked around it.

If they are not already part of your integrity team, you might want to
reach out to them.  And I think they bought / are buying huge amounts
of hardware.

>
> Greg> ie.  Do actual ATA hardware devices that support "T13/ATA External
> Greg> Path Protection" exist yet?  Does it require HDD and controller
> Greg> support?  Or just HDD?
>
> Both.  You could emulate some of the DIX features in software (like
> scatterlist interleaving) and then plug in the long commands on the back
> end.  But as Mark said the checksum formats differ between drive
> vendors/models.

The linux kernel obviously supports a large amount of vendor specific code.

Maybe the INTEGRITY crc could be calculated on the fly by libata for
at least a few hard drive vendors that have known CRC algorithms used
with the current long sector reads.

ie. When INTEGRITY is enabled and supported hard drives are being read
from, libata requests the long sector with proprietary  CRC and
verifies the vendor specific CRC.  If it looks good, then the vendor
specific CRC is replaced by the SCSI Spec CRC and the sector / bios
are passed up the line just like a supported SCSI device would do.

If those drives started selling well, maybe the drive manufactures
could be persuaded to implement the full end-to-end protocol.

> On SCSI you could conceivably use the block integrity stuff to store an
> LVM/MD checksum when used with devices that expose the application tag.
>
> However, it's only a 16-bit field (16 bits - 1 to be exact) so it's not
> exactly a lot of space.  And only dumb drives are going to make it
> available.  Some RAID controllers are going to keep those 16-bits for
> their own internal use.
>
> The main purpose of the block integrity stuff is to protect in-flight
> I/O.  Persistence is an optional feature and a side-effect.

In-flight is my concern as well.  All of the silent corruption I've
seen and taken the time to troubleshoot was caused by in-flight
errors.  I've seen it be cables, power supply, controller, ram, and
CPU cache at a minimum.

> So I think it would be much more worthwhile to implement checksumming in
> MD/DM without relying on special hardware.  I did some experiments in
> that department a few years ago when we were investigating how to go
> about fixing some of the data integrity problems in Linux.
>
> I wrote something akin to DIF in software by doing 64 512-byte blocks +
> 512 bytes of checksums.  The disadvantage there is having to do
> read-modify-write for small writes.  I tried several other approaches
> sacrificing both space and locality but performance was still anemic.
>
> The reason DIF is implemented the way it is (with 520 byte sectors: 512
> bytes followed by 8 bytes of checksum) is to prevent the cost of seeking
> to write the protection information elsewhere.  With solid state devices
> that seek penalty doesn't exist so this may become less of an issue
> going forward.
>
> The beauty of checksumming in btrfs is that the checksum is stored in
> the filesystem metadata which is read/written anyway.  So the only
> overhead is in calculating the actual checksum.  That's something
> virtual block devices have a much harder time providing because they
> don't have metadata describing individual blocks.
>
> That doesn't mean it can't be done but it's a lot more work.  I'm
> personally much more interested in adding support for adding a
> retry-other-mirror interface to MD/DM and leave the checksumming to the
> filesystems.

That makes sense as well, but given the most filesystems won't have
inherent INTEGRITY support, then the block layer should also be able
to make retry-other-mirror requests of MD / DM.

> --
> Martin K. Petersen      Oracle Linux Engineering
>

Also is there any effort to add diagnostic messages at the various tiers.

You describe this as end-to-end protection, but when it fails, it
would be extremely useful to check dmesg or something and be able to
see that a sector came in from the controller fine, but was corrupted
later, so CPU / memory is suspected vs. sector came in bad from the
controller, so suspect a problem in the controller / cable / power
supply area.

Greg
-- 
Greg Freemyer
Litigation Triage Solutions Specialist
http://www.linkedin.com/in/gregfreemyer
First 99 Days Litigation White Paper -
http://www.norcrossgroup.com/forms/whitepapers/99%20Days%20whitepaper.pdf

The Norcross Group
The Intersection of Evidence & Technology
http://www.norcrossgroup.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: RFC: detection of silent corruption via ATA long sector reads
  2009-01-02 22:41         ` Greg Freemyer
@ 2009-01-03  3:01           ` Martin K. Petersen
  0 siblings, 0 replies; 18+ messages in thread
From: Martin K. Petersen @ 2009-01-03  3:01 UTC (permalink / raw)
  To: Greg Freemyer
  Cc: Martin K. Petersen, Sitsofe Wheeler, Mark Lord, Redeeman,
	piergiorgio.sartor, neilb, linux-raid, IDE/ATA development list

>>>>> "Greg" == Greg Freemyer <greg.freemyer@gmail.com> writes:

Greg> I haven't seen any MD patches at all.  Will the MD support verify
Greg> the CRC on read and trigger a RAID re-read other mirror on
Greg> failure?

No.  With the data integrity model it is the owner of the integrity
metadata that needs to re-drive the I/O in case of failure.  So that
means the application, filesystem or the block layer depending on who
added it.

The reason for this is twofold:

 1) The owner of the I/O in question has much better knowledge about the
    context.  On a write it can re-run verification checks on its
    buffers before deciding whether to try again, notify the user, etc.

 2) Limiting the number of times we calculate the CRC/checksum.  If
    every layer in the I/O stack did a check things would get painfully
    slow.  So it's better to bubble everything to the top and do it
    once.

That's why it's important to me to ensure that the appropriate signaling
is in place so that upper layers can influence what's going on below.
I.e. telling MD/DM to retry redundant copies.

That said, adding a belt-and-suspenders option to MD/DM to verify all
I/O would be trivial.  But I don't think it's worth it.

Greg> The LHC (Large Hadron Collider) people put out a white paper on
Greg> silent corruption a year or two ago.  They were very concerned
Greg> that it could negatively impact there results.

I've been talking to them on and off.

>> Both.  You could emulate some of the DIX features in software (like
>> scatterlist interleaving) and then plug in the long commands on the
>> back end.  But as Mark said the checksum formats differ between drive
>> vendors/models.

Greg> The linux kernel obviously supports a large amount of vendor
Greg> specific code.

However, the actual ECC stored by disk drives is proprietary.  The drive
vendors have spent years and years refining their algorithms.  I think
it's highly unlikely that they'd be willing to tell us what's in there
and how it's calculated.

I really think you should all just go bug your drive vendors about this
feature.  The ATA add-on (called External Path Protection) was pretty
much fully baked when it was shelved.  It is compatible with the SCSI
ditto so interoperability is a no-brainer.  But the drive vendors fought
it vehemently.

Interestingly enough, SSD vendors seem much more interested in adding
competitive features.

Greg> Maybe the INTEGRITY crc could be calculated on the fly by libata
Greg> for at least a few hard drive vendors that have known CRC
Greg> algorithms used with the current long sector reads.

It's usually an ECC and not a CRC, btw.  And it's relatively big.  It's
not unusual to be able to correct on the order of 50 bytes out of 512.

Greg> ie. When INTEGRITY is enabled and supported hard drives are being
Greg> read from, libata requests the long sector with proprietary CRC
Greg> and verifies the vendor specific CRC.  If it looks good, then the
Greg> vendor specific CRC is replaced by the SCSI Spec CRC and the
Greg> sector / bios are passed up the line just like a supported SCSI
Greg> device would do.

Not necessary.

The integrity infrastructure is completely agnostic to the data
contained in the protection buffer.  It's all done by callbacks
registered with the block device.  And consequently filesystems and
applications operate at the "protect this buffer"/"verify this buffer"
level.  They don't have to know or care about T10, CRCs, ATA or
anything.

The actual format is negotiated in case of MD/DM that spans devices with
potentially different capabilities/checksum formats.  With SCSI we have
the luxury that the CRC is mandatory so we can always fall back to that.

Greg> In-flight is my concern as well.  All of the silent corruption
Greg> I've seen and taken the time to troubleshoot was caused by
Greg> in-flight errors.  I've seen it be cables, power supply,
Greg> controller, ram, and CPU cache at a minimum.

Yup.

Greg> That makes sense as well, but given the most filesystems won't
Greg> have inherent INTEGRITY support, then the block layer should also
Greg> be able to make retry-other-mirror requests of MD / DM.

Well, this is somewhat orthogonal.  A drive is not going to return good
sense information if the CRC didn't match the data.  So the I/O is going
to fail and DM/MD can retry at will.  In that case it doesn't really
matter what caused the failure and DM/MD will retry regardless.

You could argue that the data could still be corrupted on the way back
from the drive.  But I haven't seen that happen much.  In any case, the
verification further up the stack is going to catch the mismatch.

Most of the errors I see on READ are due to DMAs that for whatever
reason didn't actually happen.

That's actually a fun thing to do: Poison all pages in the target
scatterlist before issuing a READ.  I've had to do that several times to
prove that transfers went missing in action.

Greg> Also is there any effort to add diagnostic messages at the various
Greg> tiers.

Greg> You describe this as end-to-end protection, but when it fails, it
Greg> would be extremely useful to check dmesg or something and be able
Greg> to see that a sector came in from the controller fine, but was
Greg> corrupted later, so CPU / memory is suspected vs. sector came in
Greg> bad from the controller, so suspect a problem in the controller /
Greg> cable / power supply area.

Right now we distinguish between errors caught by the HBA and errors
caught by the target device.

A big problem we're trying to tackle is the case where a write is
acknowledged by the RAID controller and stored in non-volatile memory
there.  Once the RAID controller commits the write to an actual disk the
write fails and for some reason the RAID controller doesn't succeed in
writing the block elsewhere.  In that case the original I/O has been
completed at the OS level.  There's really no means for the array head
to come back and say "Oh, btw. that I/O that I acked a while ago didn't
actually make it".  And even if it did we would have forgotten all about
the context of that I/O so it wouldn't be of much help.

So out of band error reporting like that (that also involves SAN
switches) is a topic for discussion within the SNIA Data Integrity TWG.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: RFC: detection of silent corruption via ATA long sector reads
  2009-01-02 22:04       ` Martin K. Petersen
  2009-01-02 22:41         ` Greg Freemyer
@ 2009-01-03 13:20         ` John Robinson
  2009-01-04  7:37           ` Martin K. Petersen
  1 sibling, 1 reply; 18+ messages in thread
From: John Robinson @ 2009-01-03 13:20 UTC (permalink / raw)
  To: Martin K. Petersen; +Cc: linux-raid

On 02/01/2009 22:04, Martin K. Petersen wrote:
[...]
> I wrote something akin to DIF in software by doing 64 512-byte blocks +
> 512 bytes of checksums.  The disadvantage there is having to do
> read-modify-write for small writes.  I tried several other approaches
> sacrificing both space and locality but performance was still anemic.

Excuse me if I'm being dense - and indeed tell me! - but RAID 4/5/6 
already suffer from having to do ready-modify-write for small writes, so 
is there any chance this could be done at relatively little additional 
expense for these?

Cheers,

John.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: RFC: detection of silent corruption via ATA long sector reads
  2009-01-03 13:20         ` John Robinson
@ 2009-01-04  7:37           ` Martin K. Petersen
  2009-01-04 12:31             ` John Robinson
  0 siblings, 1 reply; 18+ messages in thread
From: Martin K. Petersen @ 2009-01-04  7:37 UTC (permalink / raw)
  To: John Robinson; +Cc: Martin K. Petersen, linux-raid

>>>>> "John" == John Robinson <john.robinson@anonymous.org.uk> writes:

John> Excuse me if I'm being dense - and indeed tell me! - but RAID
John> 4/5/6 already suffer from having to do ready-modify-write for
John> small writes, so is there any chance this could be done at
John> relatively little additional expense for these?

You'd still need to store a checksum somewhere else, incurring
additional seek cost.  You could attempt to weasel out of that by adding
the checksum sector after a limited number of blocks and hope that you'd
be able to pull it in or write it out in one sweep.

The downside is that assume we do checksums on - say - 8KB chunks in the
RAID5 case.  We only need to store a few handfuls of bytes of checksum
goo per block.  But we can't address less than a 512 byte sector.  So we
need to either waste the bulk of 1 sector for every 16 to increase the
likelihood of adjacent access.  Or we can push the checksum sector
further out to fill it completely.  That wastes less space but has a
higher chance of causing an extra seek.  Pick your poison.

The reason I'm advocating checksumming on logical (filesystem) blocks is
that the filesystems have a much better idea what's good and what's bad
in a recovery situation.  And the filesystems already have an
infrastructure for storing metadata like checksums.  The cost of
accessing that metadata is inherent and inevitable.

btrfs had checksums from the get-go.  The XFS folks are working hard on
adding them.  ext4 is going to checksum metadata, I believe.  So this is
stuff that's already in the pipeline.

We also don't want to do checksumming at every layer.  That's going to
suck from a performance perspective.  It's better to do checksumming
high up in the stack and only do it once.  As long as we give the upper
layers the option of re-driving the I/O.

That involves adding a cookie to each bio that gets filled out by DM/MD
on completion.  If the filesystem checksum fails we can resubmit the I/O
and pass along the cookie indicating that we want a different copy than
the one the cookie represents.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: RFC: detection of silent corruption via ATA long sector reads
  2009-01-04  7:37           ` Martin K. Petersen
@ 2009-01-04 12:31             ` John Robinson
  2009-01-04 13:49               ` John Robinson
  2009-01-05  2:45               ` Martin K. Petersen
  0 siblings, 2 replies; 18+ messages in thread
From: John Robinson @ 2009-01-04 12:31 UTC (permalink / raw)
  To: Martin K. Petersen; +Cc: linux-raid

On 04/01/2009 07:37, Martin K. Petersen wrote:
>>>>>> "John" == John Robinson <john.robinson@anonymous.org.uk> writes:
> 
> John> Excuse me if I'm being dense - and indeed tell me! - but RAID
> John> 4/5/6 already suffer from having to do ready-modify-write for
> John> small writes, so is there any chance this could be done at
> John> relatively little additional expense for these?
> 
> You'd still need to store a checksum somewhere else, incurring
> additional seek cost.  You could attempt to weasel out of that by adding
> the checksum sector after a limited number of blocks and hope that you'd
> be able to pull it in or write it out in one sweep.
> 
> The downside is that assume we do checksums on - say - 8KB chunks in the
> RAID5 case.  We only need to store a few handfuls of bytes of checksum
> goo per block.  But we can't address less than a 512 byte sector.  So we
> need to either waste the bulk of 1 sector for every 16 to increase the
> likelihood of adjacent access.  Or we can push the checksum sector
> further out to fill it completely.  That wastes less space but has a
> higher chance of causing an extra seek.  Pick your poison.

Well, I was assuming that MD/DM operates in chunk size amounts (e.g. 32K 
or 64 sectors) anyway, and having a sector or two of checksums on disc 
immediately following each chunk would be a pretty small cost, 
increasing each read or write cycle only marginally (e.g. to 65 
sectors), which shouldn't cause much drop in performance (I guess 1/64th 
in throughput and IOPS, if the discs themselves are the bottleneck). 
Essentially DIF on 32k blocks instead of 512 byte ones. But perhaps this 
is a bad assumption and MD/DM already optimises out whole-chunk reads 
and writes where they're not required (for very short, 
less-than-one-chunk transactions), and I've no idea whether this happens 
a lot.

> The reason I'm advocating checksumming on logical (filesystem) blocks is
> that the filesystems have a much better idea what's good and what's bad
> in a recovery situation.  And the filesystems already have an
> infrastructure for storing metadata like checksums.  The cost of
> accessing that metadata is inherent and inevitable.

Yes, I can see that. But the old premise that RAID tried to maintain was 
that disc sectors don't go bad. You're quite reasonably dropping the 
premise rather than trying to do more to maintain it. There might be 
validity to both approaches.

> We also don't want to do checksumming at every layer.  That's going to
> suck from a performance perspective.  It's better to do checksumming
> high up in the stack and only do it once.  As long as we give the upper
> layers the option of re-driving the I/O.
> 
> That involves adding a cookie to each bio that gets filled out by DM/MD
> on completion.  If the filesystem checksum fails we can resubmit the I/O
> and pass along the cookie indicating that we want a different copy than
> the one the cookie represents.

I'd like to understand this mechanism better; at first glance it's 
either going to be too simplistic and not cover the various block layer 
cases well, or it means you end up re-implementing RAID and LVM in the 
filesystem.

Just my €$£0.02 of course.

Cheers,

John.

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: RFC: detection of silent corruption via ATA long sector reads
  2009-01-04 12:31             ` John Robinson
@ 2009-01-04 13:49               ` John Robinson
  2009-01-05  2:43                 ` Martin K. Petersen
  2009-01-05  2:45               ` Martin K. Petersen
  1 sibling, 1 reply; 18+ messages in thread
From: John Robinson @ 2009-01-04 13:49 UTC (permalink / raw)
  To: Martin K. Petersen; +Cc: linux-raid

On 04/01/2009 12:31, John Robinson wrote:
> On 04/01/2009 07:37, Martin K. Petersen wrote:
[...]
>> We also don't want to do checksumming at every layer.  That's going to
>> suck from a performance perspective.  It's better to do checksumming
>> high up in the stack and only do it once.  As long as we give the upper
>> layers the option of re-driving the I/O.
>>
>> That involves adding a cookie to each bio that gets filled out by DM/MD
>> on completion.  If the filesystem checksum fails we can resubmit the I/O
>> and pass along the cookie indicating that we want a different copy than
>> the one the cookie represents.
> 
> I'd like to understand this mechanism better; at first glance it's 
> either going to be too simplistic and not cover the various block layer 
> cases well, or it means you end up re-implementing RAID and LVM in the 
> filesystem.

I've thought about this again, and I'm wrong; there may be complications 
in handling the cookies up and down the stack where more than one layer 
thinks it knows how to have another go, but I can see what you describe 
as being useful and relatively device-agnostic.

I wonder if there might also be scope for cookies going down through the 
stack to carry an indication of how hard to try; some filesystems or 
other consumers of block devices may be willing to ask again or want to 
be told about problems quickly (e.g. btrfs over RAID over TLER-equipped 
discs), while some may need best efforts all out first time because they 
can't cope will failure returns (e.g. FAT over cheap IDE discs).

Anyway, I think I'd better leave all this to the experts i.e. you :-)

Cheers,

John.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: RFC: detection of silent corruption via ATA long sector reads
  2009-01-04 13:49               ` John Robinson
@ 2009-01-05  2:43                 ` Martin K. Petersen
  0 siblings, 0 replies; 18+ messages in thread
From: Martin K. Petersen @ 2009-01-05  2:43 UTC (permalink / raw)
  To: John Robinson; +Cc: Martin K. Petersen, linux-raid

>>>>> "John" == John Robinson <john.robinson@anonymous.org.uk> writes:

John> I've thought about this again, and I'm wrong; there may be
John> complications in handling the cookies up and down the stack where
John> more than one layer thinks it knows how to have another go, but I
John> can see what you describe as being useful and relatively
John> device-agnostic.

Yeah, care will need to be taken if you have multiple layers in the
stack providing redundancy.  That's usually not the case, though.

John> I wonder if there might also be scope for cookies going down
John> through the stack to carry an indication of how hard to try; some
John> filesystems or other consumers of block devices may be willing to
John> ask again or want to be told about problems quickly (e.g. btrfs
John> over RAID over TLER-equipped discs), while some may need best
John> efforts all out first time because they can't cope will failure
John> returns (e.g. FAT over cheap IDE discs).

We already have this functionality.  It's orthogonal to the integrity
bits.  You can tell the low-level drivers either fail a request
immediately or to retry.

That's only a software thing, though.  It doesn't work terribly well
with consumer harddrives that assume there's only one copy of the data
and consequently enter annoying-click-mode and retry for a long time.
Nearline and enterprise drives assume there's a redundant copy and will
not try as hard under the assumption that you know how to remedy the
problem.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: RFC: detection of silent corruption via ATA long sector reads
  2009-01-04 12:31             ` John Robinson
  2009-01-04 13:49               ` John Robinson
@ 2009-01-05  2:45               ` Martin K. Petersen
  2009-01-05  3:24                 ` NeilBrown
  1 sibling, 1 reply; 18+ messages in thread
From: Martin K. Petersen @ 2009-01-05  2:45 UTC (permalink / raw)
  To: John Robinson; +Cc: Martin K. Petersen, linux-raid

>>>>> "John" == John Robinson <john.robinson@anonymous.org.uk> writes:

John> Essentially DIF on 32k blocks instead of 512 byte ones. But
John> perhaps this is a bad assumption and MD/DM already optimises out
John> whole-chunk reads and writes where they're not required (for very
John> short, less-than-one-chunk transactions), and I've no idea whether
John> this happens a lot.

I haven't looked at the RAID4/5/6 code for a long time so I'm not sure
whether they only write dirty pages or the whole chunk + parity ditto.
Neil?

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: RFC: detection of silent corruption via ATA long sector reads
  2009-01-05  2:45               ` Martin K. Petersen
@ 2009-01-05  3:24                 ` NeilBrown
  0 siblings, 0 replies; 18+ messages in thread
From: NeilBrown @ 2009-01-05  3:24 UTC (permalink / raw)
  Cc: John Robinson, Martin K. Petersen, linux-raid

On Mon, January 5, 2009 1:45 pm, Martin K. Petersen wrote:
>>>>>> "John" == John Robinson <john.robinson@anonymous.org.uk> writes:
>
> John> Essentially DIF on 32k blocks instead of 512 byte ones. But
> John> perhaps this is a bad assumption and MD/DM already optimises out
> John> whole-chunk reads and writes where they're not required (for very
> John> short, less-than-one-chunk transactions), and I've no idea whether
> John> this happens a lot.
>
> I haven't looked at the RAID4/5/6 code for a long time so I'm not sure
> whether they only write dirty pages or the whole chunk + parity ditto.
> Neil?

md/RAID456 writes whole pages (aligned to the array) but not whole chunks.

If a filesystem request to write one page which is at a sector address
which is not a multiple of the page size, we will pre-read the read of
the two array-aligned pages, and when write them (and parity) back out.

Otherwise, it will just write the requested pages plus parity updates.

NeilBrown



^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2009-01-05  3:24 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <fa.8mwKV7y4hm+Q6mvIKtp9QGoJYUU@ifi.uio.no>
     [not found] ` <fa.4QcsYZC0gJJwJ0eUOht3hDYaVWs@ifi.uio.no>
2008-12-28 22:40   ` RFC: detection of silent corruption via ATA long sector reads Sitsofe Wheeler
2008-12-30 13:48     ` Mark Lord
2009-01-02 20:26     ` Greg Freemyer
2009-01-02 20:43       ` Sitsofe Wheeler
2009-01-02 21:05         ` Greg Freemyer
2009-01-02 22:04       ` Martin K. Petersen
2009-01-02 22:41         ` Greg Freemyer
2009-01-03  3:01           ` Martin K. Petersen
2009-01-03 13:20         ` John Robinson
2009-01-04  7:37           ` Martin K. Petersen
2009-01-04 12:31             ` John Robinson
2009-01-04 13:49               ` John Robinson
2009-01-05  2:43                 ` Martin K. Petersen
2009-01-05  2:45               ` Martin K. Petersen
2009-01-05  3:24                 ` NeilBrown
2008-12-26 21:44 Greg Freemyer
2008-12-26 22:15 ` Robert Hancock
2008-12-28 22:26 ` Mark Lord

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).