Re: data corruption: ext3/lvm2/md/mptsas/vitesse/seagate

public inbox for linux-scsi@vger.kernel.org
 help / color / mirror / Atom feed

From: James Bottomley <James.Bottomley@HansenPartnership.com>
To: Marc Bejarano <beej@alum.mit.edu>
Cc: linux-scsi@vger.kernel.org, linux-raid@vger.kernel.org
Subject: Re: data corruption: ext3/lvm2/md/mptsas/vitesse/seagate
Date: Thu, 06 Mar 2008 18:10:52 -0600	[thread overview]
Message-ID: <1204848652.3062.100.camel@localhost.localdomain> (raw)
In-Reply-To: <200803062108.m26L8e4i020882@colby.verdasys.com>

On Thu, 2008-03-06 at 16:08 -0500, Marc Bejarano wrote:
> i've been doing burn-in on a new server i had hoped to deploy months 
> ago and can't seem to figure out the cause of data corruption i've 
> been seeing.  the SAS controller is an LSI SAS3801E connected to an 
> xTore XJ-SA12-316 SAS enclosures (vitesses expanders) full of seagate 
> 7200.10 750-GB SATA drives.
> 
> the corruption is occurring in ext3 filesystems that live on top of 
> an lvm2 RAID 0 stripe composed of 16 2-drive md RAID 1 sets.  the 
> corruption has been detected both by MySQL noticing bad checksums and 
> also by using md's "check" (sync_action) for RAID 1 consistency.

Actually, the RAID-1 might be the most useful.  Is there anything
significant about the differing data?  Do od dumps of the corrupt
sectors in both halves of the mirror and see what actually appears in
the data ... it might turn out to be useful.  Things like how long the
data corruption is (are the two sectors different, or is it just a run
of a few bytes within them) can be useful in tracking the source of the
corruption.

> most recently we got two cases of the storage stack apparently 
> writing a mysql 16K page starting at the wrong 512-byte (sector) 
> boundary.  in both cases it was at too low a sector.  one page was 13 
> sectors too early, the other 34 too early.  in both cases, one disk 
> in each mirror set had the correct data and the other incorrect 
> (apparently ruling out everything above md). unfortunately, the 
> problem is not easily repeatable.  the system can run for days with 
> terabytes of writes before we notice any corruption.

Do you happen to have the absolute block number (and relative block
number---relative to the partition start) of the corruption?  That might
help analyse the writing algorithms to see if there's a problem
somewhere.

> we're running RHEL 5.1's kernel and drivers and i understand that 
> these lists are for vanilla kernel support.  i've already engaged 
> redhat support, but i just wanted to see if anybody else has seen 
> something similar or anybody has any brilliant troubleshooting 
> ideas.  swapping drives, enclosures, HBA's, cables, and sacrifices of 
> animals to gods have so far not been able to make the world right.

Don't worry too much; the RHEL 5 stack is close enough to the vanilla
kernel, and we're interested in tracking it down.  Of course, confirming
that git head has this problem too, so we could rule out patches added
to the RHEL kernel would be useful ...

James

next prev parent reply	other threads:[~2008-03-07  0:10 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-03-06 21:08 data corruption: ext3/lvm2/md/mptsas/vitesse/seagate Marc Bejarano
2008-03-06 22:52 ` Steve Cousins
2008-03-07 22:39   ` Marc Bejarano
2008-03-08 17:18     ` Bill Davidsen
2008-03-08 21:23     ` Grant Grundler
2008-03-07  0:10 ` James Bottomley [this message]
2008-03-07 22:40   ` Marc Bejarano
2008-03-10 15:36     ` James Bottomley
2008-03-10 19:02       ` Janek Kozicki
2008-03-10 19:55         ` James Bottomley
2008-03-11 22:14       ` Marc Bejarano
     [not found]       ` <7.1.0.9.2.20080311174743.1376cc30@alum.mit.edu>
2008-03-25 23:43         ` Marc Bejarano
2008-03-26  0:12           ` Grant Grundler
     [not found]             ` <da824cf30803251712t801fdaexc19ba4fe8130ee2e@mail.gmail.com >
2008-03-26  2:17               ` Marc Bejarano
2008-03-26 17:03                 ` Grant Grundler
     [not found]                   ` <da824cf30803261003i690f108dh86ff846e4f5fd2fa@mail.gmail.co m>
2008-03-27 20:45                     ` Marc Bejarano
     [not found]                   ` <7.1.0.9.2.20080327163522.14ab0ac8@alum.mit.edu>
2008-09-02 19:32                     ` Marc Bejarano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1204848652.3062.100.camel@localhost.localdomain \
    --to=james.bottomley@hansenpartnership.com \
    --cc=beej@alum.mit.edu \
    --cc=linux-raid@vger.kernel.org \
    --cc=linux-scsi@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox