Re: rhel5 raid6 corruption

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: NeilBrown <neilb@suse.de>
To: Robin Humble <robin.humble+raid@anu.edu.au>
Cc: linux-raid@vger.kernel.org
Subject: Re: rhel5 raid6 corruption
Date: Fri, 8 Apr 2011 20:33:45 +1000	[thread overview]
Message-ID: <20110408203345.5bb5d179@notabene.brown> (raw)
In-Reply-To: <20110407074505.GA20218@grizzly.cita.utoronto.ca>

On Thu, 7 Apr 2011 03:45:05 -0400 Robin Humble <robin.humble+raid@anu.edu.au>
wrote:

> On Tue, Apr 05, 2011 at 03:00:22PM +1000, NeilBrown wrote:
> >On Mon, 4 Apr 2011 09:59:02 -0400 Robin Humble <robin.humble+raid@anu.edu.au>
> >> we are finding non-zero mismatch_cnt's and getting data corruption when
> >> using RHEL5/CentOS5 kernels with md raid6.
> >> actually, all kernels prior to 2.6.32 seem to have the bug.
> >> 
> >> the corruption only happens after we replace a failed disk, and the
> >> incorrect data is always on the replacement disk. i.e. the problem is
> >> with rebuild. mismatch_cnt is always a multiple of 8, so I suspect
> >> pages are going astray.
> ...
> >> git bisecting through drivers/md/raid5.c between 2.6.31 (has mismatches)
> >> and .32 (no problems) says that one of these (unbisectable) commits
> >> fixed the issue:
> >>   a9b39a741a7e3b262b9f51fefb68e17b32756999  md/raid6: asynchronous handle_stripe_dirtying6
> >>   5599becca4bee7badf605e41fd5bcde76d51f2a4  md/raid6: asynchronous handle_stripe_fill6
> >>   d82dfee0ad8f240fef1b28e2258891c07da57367  md/raid6: asynchronous handle_parity_check6
> >>   6c0069c0ae9659e3a91b68eaed06a5c6c37f45c8  md/raid6: asynchronous handle_stripe6
> >> 
> >> any ideas?
> >> were any "write i/o whilst rebuilding from degraded" issues fixed by
> >> the above patches?
> >
> >It looks like they were, but I didn't notice at the time.
> >
> >If a write to a block in a stripe happens at exactly the same time as the
> >recovery of a different block in that stripe - and both operations are
> >combined into a single "fix up the stripe parity and write it all out"
> >operation, then the block that needs to be recovered is computed but not
> >written out.  oops.
> >
> >The following patch should fix it.  Please test and report your results.
> >If they prove the fix I will submit it for the various -stable kernels.
> >It looks like this bug has "always" been present :-(
> 
> thanks for the very quick reply!
> 
> however, I don't think the patch has solved the problem :-/
> I applied it to 2.6.31.14 and have got several mismatches since on both
> FC and SATA machines.

That's disappointing - I was sure I had found it.  I'm tempted to ask "are
you really sure you are running the modified kernel", but I'm sure you are.

> 
> BTW, these tests are actually fairly quick. often <10 kickout/rebuild
> loops. so just a few hours.
> 
> in the above when you say "block in a stripe", does that mean the whole
> 128k on that disk (we have --chunk=128) might not have been written, or
> one 512byte block (or a page)?

'page'. raid5 does everything in one-page (4K) per device strips.
So mismatch count - which is measured in sectors - will always be a multiple
of 8.


> we don't see mismatch counts of 256 - usually 8 or 16, but I can see
> why our current testing might hide such a count. <later> ok - now I
> blank (dd zeros over) the replacement disk before putting it back into
> the raid and am seeing perhaps slightly larger typical mismatch counts
> of 16 and 32, but so far not 128k of mismatches.
> 
> another (potential) data point is that often position of the mismatches
> on the md device doesn't really line up with where I think the data is
> being written to. the mismatch is often near the start of the md device,
> but sometimes 50% of the way in, and sometimes 95%.
> the filesystem is <20% full, although the wildcard is that I really
> have no idea how the (sparse) files doing the 4k direct i/o are
> allocated across the filsystem, or where fs metadata and journals might
> be updating blocks either.
> seems odd though...

I suspect the filesystem spreads files across the whole disk, though it
depends a lot on the details of the particular filesystem.


> 
> >Thanks for the report .... and for all that testing!  A git-bisect where each
> >run can take 36 hours is a really test of commitment!!!
> 
> :-) no worries. thanks for md and for the help!
> we've been trying to figure this out for months so a bit of testing
> isn't a problem. we first eliminated a bunch of other things
> (filesystem, drivers, firmware, ...) as possibilities. for a long time
> we didn't really believe the problem could be with md as it's so well
> tested around the world and has been very solid for us except for these
> rebuilds.
> 

I'll try staring at the code a bit longer and see if anything jumps out at me.

thanks,
NeilBrown

next prev parent reply	other threads:[~2011-04-08 10:33 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-04-04 13:59 rhel5 raid6 corruption Robin Humble
2011-04-05  5:00 ` NeilBrown
2011-04-07  7:45   ` Robin Humble
2011-04-08 10:33     ` NeilBrown [this message]
2011-04-09  4:24       ` Robin Humble
2011-04-18  0:00         ` NeilBrown
2011-04-24  0:45           ` Robin Humble

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20110408203345.5bb5d179@notabene.brown \
    --to=neilb@suse.de \
    --cc=linux-raid@vger.kernel.org \
    --cc=robin.humble+raid@anu.edu.au \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).