From: NeilBrown <neilb@suse.de>
To: Bill <billstuff2001@sbcglobal.net>
Cc: linux-raid <linux-raid@vger.kernel.org>
Subject: Re: raid5 (re)-add recovery data corruption
Date: Mon, 30 Jun 2014 13:23:35 +1000 [thread overview]
Message-ID: <20140630132335.4361445e@notabene.brown> (raw)
In-Reply-To: <53AF5304.7020401@sbcglobal.net>
[-- Attachment #1: Type: text/plain, Size: 5254 bytes --]
On Sat, 28 Jun 2014 18:43:00 -0500 Bill <billstuff2001@sbcglobal.net> wrote:
> On 06/22/2014 08:36 PM, NeilBrown wrote:
> > On Sat, 21 Jun 2014 00:31:39 -0500 Bill<billstuff2001@sbcglobal.net> wrote:
> >
> >> Hi Neil,
> >>
> >> I'm running a test on 3.14.8 and seeing data corruption after a recovery.
> >> I have this array:
> >>
> >> md5 : active raid5 sdc1[2] sdb1[1] sda1[0] sde1[4] sdd1[3]
> >> 16777216 blocks level 5, 64k chunk, algorithm 2 [5/5] [UUUUU]
> >> bitmap: 0/1 pages [0KB], 2048KB chunk
> >>
> >> with an xfs filesystem on it:
> >> /dev/md5 on /hdtv/data5 type xfs
> >> (rw,noatime,barrier,swalloc,allocsize=256m,logbsize=256k,largeio)
> >>
> >> and I do this in a loop:
> >>
> >> 1. start writing 1/4 GB files to the filesystem
> >> 2. fail a disk. wait a bit
> >> 3. remove it. wait a bit
> >> 4. add the disk back into the array
> >> 5. wait for the array to sync and the file writes to finish
> >> 6. checksum the files.
> >> 7. wait a bit and do it all again
> >>
> >> The checksum QC will eventually fail, usually after a few hours.
> >>
> >> My last test failed after 4 hours:
> >>
> >> 18:51:48 - mdadm /dev/md5 -f /dev/sdc1
> >> 18:51:58 - mdadm /dev/md5 -r /dev/sdc1
> >> 18:52:06 - start writing 3 files
> >> 18:52:08 - mdadm /dev/md5 -a /dev/sdc1
> >> 18:52:18 - array recovery done
> >> 18:52:23 - writes finished. QC failed for one of three files.
> >>
> >> dmesg shows no errors and the disks are operating normally.
> >>
> >> If I "check" /dev/md5 it shows mismatch_cnt = 896
> >> If I dump the raw data on sd[abcde]1 underneath the bad file, it shows
> >> sd[abde]1 are correct, and sdc1 has some chunks of old data from a
> >> previous file.
> >>
> >> If I fail sdc1, --zero-superblock it, and add it, it then syncs and the
> >> QC is correct.
> >>
> >> So somehow is seems like md is loosing track of some changes which need
> >> to be
> >> written to sdc1 in the recovery. But rarely - in this case it failed
> >> after 175 cycles.
> >>
> >> Do you have any idea what could be happening here?
> > No. As you say, it looks like md is not setting a bit in the bitmap
> > correctly, or ignoring one that is set, or maybe clearing one that shouldn't
> > be cleared.
> > The last is most likely I would guess.
>
> Neil,
>
> I'm still digging through this but I found something that might help
> narrow it
> down - the bitmap stays dirty after the re-add and recovery is complete:
>
> Filename : /dev/sde1
> Magic : 6d746962
> Version : 4
> UUID : 609846f8:ad08275f:824b3cb4:2e180e57
> Events : 5259
> Events Cleared : 5259
> State : OK
> Chunksize : 2 MB
> Daemon : 5s flush period
> Write Mode : Normal
> Sync Size : 4194304 (4.00 GiB 4.29 GB)
> Bitmap : 2048 bits (chunks), 2 dirty (0.1%)
> ^^^^^^^^^^^^^^
>
> This is after 1/2 hour idle. sde1 was the one removed / re-added, but
> all five disks show the same bitmap info, and the event count matches
> that of
> the array (5259). At this point the QC check fails.
>
> Then I manually failed, removed and re-added /dev/sde1, and shortly the
> array
> synced the dirty chunks:
>
> Filename : /dev/sde1
> Magic : 6d746962
> Version : 4
> UUID : 609846f8:ad08275f:824b3cb4:2e180e57
> Events : 5275
> Events Cleared : 5259
> State : OK
> Chunksize : 2 MB
> Daemon : 5s flush period
> Write Mode : Normal
> Sync Size : 4194304 (4.00 GiB 4.29 GB)
> Bitmap : 2048 bits (chunks), 0 dirty (0.0%)
> ^^^^^^^^^^^^^^
>
> Now the QC check succeeds and an array "check" shows no mismatches.
>
> So it seems like md is ignoring a set bit in the bitmap, which then gets
> noticed
> with the fail / remove / re-add sequence.
Thanks, that helps a lot ... maybe.
I have a theory. This patch explains it and should fix it.
I'm not sure this is the patch I will go with if it works, but it will help
confirm my theory.
Can you test it?
thanks,
NeilBrown
diff --git a/drivers/md/md.c b/drivers/md/md.c
index 34846856dbc6..27387a3740c8 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -7906,6 +7906,15 @@ void md_check_recovery(struct mddev *mddev)
clear_bit(MD_RECOVERY_CHECK, &mddev->recovery);
clear_bit(MD_RECOVERY_REQUESTED, &mddev->recovery);
set_bit(MD_RECOVERY_RECOVER, &mddev->recovery);
+ /* If there is a bitmap, we need to make sure
+ * all writes that started before we added a spare
+ * complete before we start doing a recovery.
+ * Otherwise the write might complete and set
+ * a bit in the bitmap after the recovery has
+ * checked that bit and skipped that region.
+ */
+ mddev->pers->quiesce(mddev, 1);
+ mddev->pers->quiesce(mddev, 0);
} else if (mddev->recovery_cp < MaxSector) {
set_bit(MD_RECOVERY_SYNC, &mddev->recovery);
clear_bit(MD_RECOVERY_RECOVER, &mddev->recovery);
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]
next prev parent reply other threads:[~2014-06-30 3:23 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-06-21 5:31 raid5 (re)-add recovery data corruption Bill
2014-06-23 1:36 ` NeilBrown
2014-06-23 13:43 ` Bill
2014-06-28 23:43 ` Bill
2014-06-30 3:23 ` NeilBrown [this message]
2014-06-30 3:40 ` NeilBrown
2014-07-01 15:24 ` Bill
2014-07-02 2:14 ` NeilBrown
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20140630132335.4361445e@notabene.brown \
--to=neilb@suse.de \
--cc=billstuff2001@sbcglobal.net \
--cc=linux-raid@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).