All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Martin K. Petersen" <martin.petersen@oracle.com>
To: Mattias Wadenstein <maswan@acc.umu.se>
Cc: Neil Brown <neilb@suse.de>, David Chinner <dgc@sgi.com>,
	Avi Kivity <avi@argo.co.il>,
	david@lang.hm, linux-kernel@vger.kernel.org,
	linux-raid@vger.kernel.org
Subject: Re: limits on raid
Date: Thu, 21 Jun 2007 14:30:29 -0400	[thread overview]
Message-ID: <yq1zm2ttb22.fsf@sermon.lab.mkp.net> (raw)
In-Reply-To: <Pine.GSO.4.64.0706211417180.15647@montezuma.acc.umu.se> (Mattias Wadenstein's message of "Thu, 21 Jun 2007 14:40:44 +0200 (MEST)")

>>>>> "Mattias" == Mattias Wadenstein <maswan@acc.umu.se> writes:

Mattias> In theory, that's how storage should work. In practice,
Mattias> silent data corruption does happen. If not from the disks
Mattias> themselves, somewhere along the path of cables, controllers,
Mattias> drivers, buses, etc. If you add in fcal, you'll get even more
Mattias> sources of failure, but usually you can avoid SANs (if you
Mattias> care about your data).

Oracle cares a lot about people's data 8).  And we've seen many cases
of silent data corruption.  Often the problem goes unnoticed for
months.  And by the time you find out about it you may have gone
through your backup cycle so the data is simply lost.

The Oracle database in combination with certain high-end arrays
supports a technology called HARD (Hardware Assisted Resilient Data)
which allows the array front end to verify the integrity of an I/O
before committing it to disk.  The downside to HARD is that it's
proprietary and only really high-end customers use it (many
enterprises actually mandate HARD).

A couple of years ago some changes started to trickle into the SCSI
Block Commands spec.  And as some of you know I've been working on
implementing support for this Data Integrity Field in Linux.

What DIF allows you to do is to attach some integrity metadata to an
I/O.  We can attach this metadata all the way up in the userland
application context where the risk of corruption is relatively small.
The metadata passes all the way through the I/O stack, gets verified
by the HBA firmware, through the fabric, gets verified by the array
front end and finally again by the disk drive before the change is
committed to platter.  Any discrepancy will cause the I/O to be
failed.  And thanks to the intermediate checks you also get fault
isolation.

The DIF integrity metadata contains a CRC of the data block as well as
a reference tag that (for Type 1) needs to match the target sector on
disk.  This way the common problem of misdirected writes can be
alleviated.

Initially, DIF is going to be offered in the FC/SAS space.  But I
encourage everybody to lean on their SATA drive manufacturer of choice
and encourage them to provide a similar functionality on consumer or
at the very least nearline drives.


Note there's a difference between FS checksums and DIF.  Filesystem
checksums (plug: http://oss.oracle.com/projects/btrfs/) allows the
filesystem to detect that it read something bad.  And as discussed
earlier we can potentially retry the read from another mirror or
reconstruct in the case of RAID5/6.

DIF, however, is a proactive technology.  It prevents bad stuff from
being written to disk in the first place.  You'll know right away when
corruption happens, not 4 months later when you try to read the data
back.

So DIF and filesystem checksumming go hand in hand in preventing data
corruption...

-- 
Martin K. Petersen	Oracle Linux Engineering


  parent reply	other threads:[~2007-06-21 18:30 UTC|newest]

Thread overview: 69+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-06-15  2:58 limits on raid david
2007-06-15  3:05 ` Neil Brown
2007-06-15  3:43   ` david
2007-06-15  3:58     ` Neil Brown
2007-06-15  9:13       ` David Chinner
2007-06-15 22:21         ` Neil Brown
2007-06-15 11:10       ` Avi Kivity
2007-06-15 16:23         ` Jan Engelhardt
2007-06-15 17:20           ` Avi Kivity
2007-06-15 21:59         ` Neil Brown
2007-06-16 17:23           ` Avi Kivity
2007-06-17 13:00           ` Andi Kleen
2007-06-18  4:57           ` David Chinner
2007-06-21  2:56             ` Neil Brown
2007-06-21  6:39               ` David Chinner
2007-06-21  6:45                 ` david
2007-06-21  8:59                   ` David Greaves
2007-06-21 17:00                   ` Mark Lord
2007-06-21 11:00                 ` David Chinner
2007-06-21 12:40               ` Mattias Wadenstein
2007-06-21 14:40                 ` Justin Piszcz
2007-06-21 16:48                 ` david
2007-06-21 18:30                 ` Martin K. Petersen [this message]
2007-06-21 20:08               ` Nix
2007-06-16  2:03       ` Wakko Warner
2007-06-16  3:47         ` Neil Brown
2007-06-16  4:40           ` Dan Merillat
2007-06-16  7:48           ` david
2007-06-16 13:38             ` David Greaves
2007-06-16 17:16               ` david
2007-06-17 17:16             ` Bill Davidsen
2007-06-18 17:20             ` Brendan Conoboy
2007-06-18 17:28               ` david
2007-06-18 18:03                 ` Lennart Sorensen
2007-06-18 18:12                   ` david
2007-06-18 18:33                     ` Lennart Sorensen
2007-06-18 18:40                       ` david
2007-06-18 19:11                         ` Brendan Conoboy
2007-06-18 20:52                           ` david
2007-06-18 21:46                             ` Wakko Warner
2007-06-18 21:56                               ` david
2007-06-18 22:00                                 ` Brendan Conoboy
2007-06-19 20:11                                 ` Lennart Sorensen
2007-06-19 20:51                                   ` david
2007-06-19 15:07                             ` Phillip Susi
2007-06-19 19:28                               ` david
2007-06-18 18:07                 ` Brendan Conoboy
2007-06-18 18:16                   ` david
2007-06-16 13:33           ` David Greaves
2007-06-17  1:44             ` dean gaudet
2007-06-21  3:01             ` Neil Brown
2007-06-21  8:49               ` David Greaves
2007-06-16 14:08           ` Wakko Warner
2007-06-17  1:47             ` dean gaudet
2007-06-17 13:28               ` Wakko Warner
2007-06-17 17:28                 ` dean gaudet
2007-06-17 19:30                   ` Wakko Warner
2007-06-17 19:54                     ` dean gaudet
2007-06-17 20:46                       ` david
2007-06-17 20:44                     ` david
2007-06-17 17:14       ` Bill Davidsen
2007-06-21 23:03         ` Bill Davidsen
2007-06-22  2:24           ` Neil Brown
2007-06-22  8:10             ` David Greaves
2007-06-22  9:51               ` david
2007-06-22 12:39                 ` David Greaves
2007-06-22 16:00                   ` Bill Davidsen
2007-06-22 16:55                     ` David Greaves
2007-06-22 18:41                     ` david

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=yq1zm2ttb22.fsf@sermon.lab.mkp.net \
    --to=martin.petersen@oracle.com \
    --cc=avi@argo.co.il \
    --cc=david@lang.hm \
    --cc=dgc@sgi.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-raid@vger.kernel.org \
    --cc=maswan@acc.umu.se \
    --cc=neilb@suse.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.