Re: RFC: log record CRC validation

public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed

From: "William J. Earl" <earl@agami.com>
To: xfs-oss <xfs@oss.sgi.com>
Cc: David Chinner <dgc@sgi.com>, Michael Nishimoto <miken@agami.com>,
	markgw@sgi.com
Subject: Re: RFC: log record CRC validation
Date: Fri, 27 Jul 2007 19:00:32 -0700	[thread overview]
Message-ID: <46AAA340.60208@agami.com> (raw)
In-Reply-To: <20070726233129.GM12413810@sgi.com>

David Chinner wrote:
> On Thu, Jul 26, 2007 at 10:53:02AM -0700, Michael Nishimoto wrote:
>   
> ...
>> Is CRC checking being added to xfs log data?
>>     
>
> Yes. it's a little used debug option right now, and I'm
> planning on making it default behaviour.
>
>   
>> If so, what data has been collected to show that this needs to be added?
>>     
>
> The size of high-end filesystems are now at the same order of
> magnitude as the bit error rate of the storage hardware. e.g. 1PB =
> 10^16 bits. The bit error rate of high end FC drives? 1 in 10^16
> bits.  For "enterprise" SATA drives? 1 in 10^15 bits. For desktop
> SATA drives it's 1 in 10^14 bits (i.e. 1 in 10TB).
>
> We've got filesystems capable of moving > 2 x 10^16 bits of data
> *per day* and we see lots of instances of multi-TB arrays made out
> of desktop SATA disks. Given the recent studies of long-term disk
> reliability, these vendor figures are likely to be the best error
> rates we can hope for.....
>
> IOWs, we don't need evidence to justify this sort of error detection
> detection because simple maths says there is going to be errors.  We
> have to deal with that, and hence we are going to be adding CRC
> checking to on-disk metadata structures so we can detect bit errors
> that would otherwise go undetected and result in filesystem
> corruption.
>
> This means that instead of getting shutdown reports for some strange
> and unreproducable btree corruption, we'll get a shutdown for a CRC
> failure on the btree block. It is very likely that this will occur
> much earlier than a subtle btree corruption would otherwise be
> detected and hence we'll be less likely to propagate errors around
> the fileystem.
> ...
         Mike Nishimoto pointed out this thread to me, and suggested I 
reply, since I have worked on the analyzing the disk failure modes.  

       First, note that the claimed bit error rates are rates of 
reported bad blocks, not rates of silent data corruption.   The latter, 
while not quoted, are far lower.

       With large modern disks, it is unrealistic to not use RAID 1, 
RAID 5, or RAID 6 to mask reported disk bit errors.   A CRC in a 
filesystem data structure, however, is not required to detect disk 
errors, even without RAID protection, since the disks report error 
correction failures.   The rates of 1 in 10**14 (desktop SATA), 1 in 
10**15 (Enterprise SATA and some FC), and 1 in 10**16 (some FC) bits 
read are detected and reported error rates.   From conversations with 
drive vendors, these are actually fairly conservative, and assume you 
write the data once and read it after 10 years without rewriting it, as 
in an archive, which is the worst case, since you have ten years of 
accumulated deterioration.  

       With RAID protection, seeing reported errors which are not masked 
by the RAID layer is extremely unlikely, except in the case of a drive 
failure, where it is necessary to read all of the surviving drives to 
reconstruct the contents of the lost drive.    With 500 GB drives and a 
7D+1P RAID 5 group, we need to read 350 GB to rebuild the RAID array 
using a replacement drive, which is to say about 2.8 * 10**12 bits.   
This implies we would see one block not reconstructed in every 35 
rebuilds on desktop SATA drives, if the data were archive data written 
once and the rebuild happened after 10 years.     We would expect about 
one rebuild in 10 years.

       With more RAID groups, the chance of data loss grows very 
high.    With 100 groups, we would see a rebuild every  month or so, so 
we would expect an unmasked read error every few years.    With better 
drives, the chance is of course reduced, but is still significant in 
multi-PB storage systems.   RAID 6 largely eliminates the chance of 
seeing a read error, at some cost in performance.

       Note that some people have reported both much higher annual 
failure rates for drives (which increases the frequency of RAID 
reconstructions and hence the chance of data loss) and higher read error 
rates.    Based on personal experience with a large number of drives, I 
believe that both of these are a consequence of systems (including 
software and disk host bus adapters) dealing poorly with common 
transient disk problems, not actual errors on the surface of the disk.   
For example, if the drive firmware gets confused and stops talking, the 
controller will treat it as a read timeout, and the software may simply 
report "read failure", which in turn may be interpreted as "drive 
failed", even though an adequate drive reset would return the drive to 
service with no harm done to the data.  

     In addition, many people have taken up using desktop drives for 
primary storage, for which they are not designed.   Desktop drives are 
typically rated at 800,000 to  1,000,000 hours MTBF at 30% duty 
cycle.    Using them at 100% duty cycle drastically decreases their 
MTBF, which in turn drastically increases the rate of unmasked read 
errors, as a consequence of the extra RAID reconstructions.

      Lastly, quite a few "white box" vendors have shipped chassis which 
do not adequately cool and power the drives, and excessive heat can also 
drastically reduce the MTBF.

       In a well-designed chassis, with software which correctly 
recovers from transient drive issues, I have observed higher than 
claimed MTBF and much lower than claimed bit error rates.    The 
undetected error rate from a modern drive is not quoted publicly, but 
the block ECC is quite strong (since it has to mask raw bit error rates 
as high as 1 in 10**3) and hence can detect most error scenarios and 
correct many of them.

       None of the above, however, implies that we need CRCs on 
filesystem data structures.   That is, if you get EIO on a disk read, 
then you don't need a CRC to know the block is bad.    Other concerns, 
however, can motivate having CRCs.   In particular, if the path between 
drive and memory can corrupt the data, then CRCs can help us recover to 
some extent.    This has been a recurring problem with various 
technologies, but was particularly common on IDE (PATA) drives with 
their vulnerable cables, where mishandling of the cable could lead to 
silent data corruption.    CRCs on just filesystem data structures, 
however, only help with metadata integrity, leaving file data integrity 
in the dark.   Some vendors at various times,   such as Tandem and 
NetApp, have added a checksum to each block, usually when they were 
having data integrity issues which turned out in the end to be bad 
cables or bad software, but which could be masked by RAID recovery, if 
detected by the block checksum.   It is usually cost-effective, however, 
to simply select a reliable disk subsystem.   

       With SATA, SAS, and FC, which have link-level integrity checks, 
silent data corruption on the link is unlikely.   This leaves mainly the 
host bus adapter, the main memory, and the path between them.   If those 
are bad, however, it is hard to see how much the filesystem can help.

      In conclusion, I doubt that CRCs are worth the added 
complexity.    If I wanted to mask flaky hardware, I would look at using 
RAID 6 and validating parity on all reads and doing RAID recovery on any 
errors, but doing offline testing of the hardware and repairing or 
replacing any broken bits would be yet simpler.

next prev parent reply	other threads:[~2007-07-28  3:08 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20070725092445.GT12413810@sgi.com>
2007-07-25 10:14 ` RFC: log record CRC validation Mark Goodwin
2007-07-26  5:55   ` David Chinner
2007-07-26 23:01     ` Andi Kleen
2007-07-26 23:50       ` David Chinner
2007-07-26 17:53   ` Michael Nishimoto
2007-07-26 23:31     ` David Chinner
2007-07-27  1:24       ` Michael Nishimoto
2007-07-27  6:59         ` David Chinner
2007-08-01  0:49           ` Michael Nishimoto
2007-08-01  2:24             ` David Chinner
2007-08-01  2:36               ` Barry Naujok
2007-08-01  2:43                 ` David Chinner
2007-08-01 12:11               ` Andi Kleen
2007-07-28  2:00       ` William J. Earl [this message]
2007-07-28 14:03         ` Andi Kleen
2007-07-31  5:30         ` David Chinner
2007-08-01  1:32           ` William J. Earl
2007-08-01 10:02             ` David Chinner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=46AAA340.60208@agami.com \
    --to=earl@agami.com \
    --cc=dgc@sgi.com \
    --cc=markgw@sgi.com \
    --cc=miken@agami.com \
    --cc=xfs@oss.sgi.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox