Re: raid5 (re)-add recovery data corruption

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: NeilBrown <neilb@suse.de>
To: Bill <billstuff2001@sbcglobal.net>
Cc: linux-raid <linux-raid@vger.kernel.org>
Subject: Re: raid5 (re)-add recovery data corruption
Date: Mon, 30 Jun 2014 13:23:35 +1000	[thread overview]
Message-ID: <20140630132335.4361445e@notabene.brown> (raw)
In-Reply-To: <53AF5304.7020401@sbcglobal.net>

[-- Attachment #1: Type: text/plain, Size: 5254 bytes --]

On Sat, 28 Jun 2014 18:43:00 -0500 Bill <billstuff2001@sbcglobal.net> wrote:

> On 06/22/2014 08:36 PM, NeilBrown wrote:
> > On Sat, 21 Jun 2014 00:31:39 -0500 Bill<billstuff2001@sbcglobal.net>  wrote:
> >
> >> Hi Neil,
> >>
> >> I'm running a test on 3.14.8 and seeing data corruption after a recovery.
> >> I have this array:
> >>
> >>       md5 : active raid5 sdc1[2] sdb1[1] sda1[0] sde1[4] sdd1[3]
> >>             16777216 blocks level 5, 64k chunk, algorithm 2 [5/5] [UUUUU]
> >>             bitmap: 0/1 pages [0KB], 2048KB chunk
> >>
> >> with an xfs filesystem on it:
> >>       /dev/md5 on /hdtv/data5 type xfs
> >> (rw,noatime,barrier,swalloc,allocsize=256m,logbsize=256k,largeio)
> >>
> >> and I do this in a loop:
> >>
> >> 1. start writing 1/4 GB files to the filesystem
> >> 2. fail a disk. wait a bit
> >> 3. remove it. wait a bit
> >> 4. add the disk back into the array
> >> 5. wait for the array to sync and the file writes to finish
> >> 6. checksum the files.
> >> 7. wait a bit and do it all again
> >>
> >> The checksum QC will eventually fail, usually after a few hours.
> >>
> >> My last test failed after 4 hours:
> >>
> >>       18:51:48 - mdadm /dev/md5 -f /dev/sdc1
> >>       18:51:58 - mdadm /dev/md5 -r /dev/sdc1
> >>       18:52:06 - start writing 3 files
> >>       18:52:08 - mdadm /dev/md5 -a /dev/sdc1
> >>       18:52:18 - array recovery done
> >>       18:52:23 - writes finished. QC failed for one of three files.
> >>
> >> dmesg shows no errors and the disks are operating normally.
> >>
> >> If I "check" /dev/md5 it shows mismatch_cnt = 896
> >> If I dump the raw data on sd[abcde]1 underneath the bad file, it shows
> >> sd[abde]1 are correct, and sdc1 has some chunks of old data from a
> >> previous file.
> >>
> >> If I fail sdc1, --zero-superblock it, and add it, it then syncs and the
> >> QC is correct.
> >>
> >> So somehow is seems like md is loosing track of some changes which need
> >> to be
> >> written to sdc1 in the recovery. But rarely - in this case it failed
> >> after 175 cycles.
> >>
> >> Do you have any idea what could be happening here?
> > No.  As you say, it looks like md is not setting a bit in the bitmap
> > correctly, or ignoring one that is set, or maybe clearing one that shouldn't
> > be cleared.
> > The last is most likely I would guess.
> 
> Neil,
> 
> I'm still digging through this but I found something that might help 
> narrow it
> down - the bitmap stays dirty after the re-add and recovery is complete:
> 
>          Filename : /dev/sde1
>             Magic : 6d746962
>           Version : 4
>              UUID : 609846f8:ad08275f:824b3cb4:2e180e57
>            Events : 5259
>    Events Cleared : 5259
>             State : OK
>         Chunksize : 2 MB
>            Daemon : 5s flush period
>        Write Mode : Normal
>         Sync Size : 4194304 (4.00 GiB 4.29 GB)
>            Bitmap : 2048 bits (chunks), 2 dirty (0.1%)
>                                         ^^^^^^^^^^^^^^
> 
> This is after 1/2 hour idle. sde1 was the one removed / re-added, but
> all five disks show the same bitmap info, and the event count matches 
> that of
> the array (5259). At this point the QC check fails.
> 
> Then I manually failed, removed and re-added /dev/sde1, and shortly the 
> array
> synced the dirty chunks:
> 
>          Filename : /dev/sde1
>             Magic : 6d746962
>           Version : 4
>              UUID : 609846f8:ad08275f:824b3cb4:2e180e57
>            Events : 5275
>    Events Cleared : 5259
>             State : OK
>         Chunksize : 2 MB
>            Daemon : 5s flush period
>        Write Mode : Normal
>         Sync Size : 4194304 (4.00 GiB 4.29 GB)
>            Bitmap : 2048 bits (chunks), 0 dirty (0.0%)
>                                         ^^^^^^^^^^^^^^
> 
> Now the QC check succeeds and an array "check" shows no mismatches.
> 
> So it seems like md is ignoring a set bit in the bitmap, which then gets 
> noticed
> with the fail / remove / re-add sequence.

Thanks, that helps a lot ... maybe.

I have a theory.  This patch explains it and should fix it.
I'm not sure this is the patch I will go with if it works, but it will help
confirm my theory.
Can you test it?

thanks,
NeilBrown

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 34846856dbc6..27387a3740c8 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -7906,6 +7906,15 @@ void md_check_recovery(struct mddev *mddev)
 			clear_bit(MD_RECOVERY_CHECK, &mddev->recovery);
 			clear_bit(MD_RECOVERY_REQUESTED, &mddev->recovery);
 			set_bit(MD_RECOVERY_RECOVER, &mddev->recovery);
+			/* If there is a bitmap, we need to make sure
+			 * all writes that started before we added a spare
+			 * complete before we start doing a recovery.
+			 * Otherwise the write might complete and set
+			 * a bit in the bitmap after the recovery has
+			 * checked that bit and skipped that region.
+			 */
+			mddev->pers->quiesce(mddev, 1);
+			mddev->pers->quiesce(mddev, 0);
 		} else if (mddev->recovery_cp < MaxSector) {
 			set_bit(MD_RECOVERY_SYNC, &mddev->recovery);
 			clear_bit(MD_RECOVERY_RECOVER, &mddev->recovery);


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

next prev parent reply	other threads:[~2014-06-30  3:23 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-06-21  5:31 raid5 (re)-add recovery data corruption Bill
2014-06-23  1:36 ` NeilBrown
2014-06-23 13:43   ` Bill
2014-06-28 23:43   ` Bill
2014-06-30  3:23     ` NeilBrown [this message]
2014-06-30  3:40       ` NeilBrown
2014-07-01 15:24         ` Bill
2014-07-02  2:14           ` NeilBrown

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:34846856dbc dfblob:27387a3740c )
 OR (
bs:"Re: raid5 (re)-add recovery data corruption" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20140630132335.4361445e@notabene.brown \
    --to=neilb@suse.de \
    --cc=billstuff2001@sbcglobal.net \
    --cc=linux-raid@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).