From: NeilBrown <neilb@suse.de>
To: Bill <billstuff2001@sbcglobal.net>
Cc: linux-raid <linux-raid@vger.kernel.org>
Subject: Re: raid5 (re)-add recovery data corruption
Date: Mon, 30 Jun 2014 13:40:45 +1000 [thread overview]
Message-ID: <20140630134045.601cd33d@notabene.brown> (raw)
In-Reply-To: <20140630132335.4361445e@notabene.brown>
[-- Attachment #1: Type: text/plain, Size: 6471 bytes --]
On Mon, 30 Jun 2014 13:23:35 +1000 NeilBrown <neilb@suse.de> wrote:
> On Sat, 28 Jun 2014 18:43:00 -0500 Bill <billstuff2001@sbcglobal.net> wrote:
>
> > On 06/22/2014 08:36 PM, NeilBrown wrote:
> > > On Sat, 21 Jun 2014 00:31:39 -0500 Bill<billstuff2001@sbcglobal.net> wrote:
> > >
> > >> Hi Neil,
> > >>
> > >> I'm running a test on 3.14.8 and seeing data corruption after a recovery.
> > >> I have this array:
> > >>
> > >> md5 : active raid5 sdc1[2] sdb1[1] sda1[0] sde1[4] sdd1[3]
> > >> 16777216 blocks level 5, 64k chunk, algorithm 2 [5/5] [UUUUU]
> > >> bitmap: 0/1 pages [0KB], 2048KB chunk
> > >>
> > >> with an xfs filesystem on it:
> > >> /dev/md5 on /hdtv/data5 type xfs
> > >> (rw,noatime,barrier,swalloc,allocsize=256m,logbsize=256k,largeio)
> > >>
> > >> and I do this in a loop:
> > >>
> > >> 1. start writing 1/4 GB files to the filesystem
> > >> 2. fail a disk. wait a bit
> > >> 3. remove it. wait a bit
> > >> 4. add the disk back into the array
> > >> 5. wait for the array to sync and the file writes to finish
> > >> 6. checksum the files.
> > >> 7. wait a bit and do it all again
> > >>
> > >> The checksum QC will eventually fail, usually after a few hours.
> > >>
> > >> My last test failed after 4 hours:
> > >>
> > >> 18:51:48 - mdadm /dev/md5 -f /dev/sdc1
> > >> 18:51:58 - mdadm /dev/md5 -r /dev/sdc1
> > >> 18:52:06 - start writing 3 files
> > >> 18:52:08 - mdadm /dev/md5 -a /dev/sdc1
> > >> 18:52:18 - array recovery done
> > >> 18:52:23 - writes finished. QC failed for one of three files.
> > >>
> > >> dmesg shows no errors and the disks are operating normally.
> > >>
> > >> If I "check" /dev/md5 it shows mismatch_cnt = 896
> > >> If I dump the raw data on sd[abcde]1 underneath the bad file, it shows
> > >> sd[abde]1 are correct, and sdc1 has some chunks of old data from a
> > >> previous file.
> > >>
> > >> If I fail sdc1, --zero-superblock it, and add it, it then syncs and the
> > >> QC is correct.
> > >>
> > >> So somehow is seems like md is loosing track of some changes which need
> > >> to be
> > >> written to sdc1 in the recovery. But rarely - in this case it failed
> > >> after 175 cycles.
> > >>
> > >> Do you have any idea what could be happening here?
> > > No. As you say, it looks like md is not setting a bit in the bitmap
> > > correctly, or ignoring one that is set, or maybe clearing one that shouldn't
> > > be cleared.
> > > The last is most likely I would guess.
> >
> > Neil,
> >
> > I'm still digging through this but I found something that might help
> > narrow it
> > down - the bitmap stays dirty after the re-add and recovery is complete:
> >
> > Filename : /dev/sde1
> > Magic : 6d746962
> > Version : 4
> > UUID : 609846f8:ad08275f:824b3cb4:2e180e57
> > Events : 5259
> > Events Cleared : 5259
> > State : OK
> > Chunksize : 2 MB
> > Daemon : 5s flush period
> > Write Mode : Normal
> > Sync Size : 4194304 (4.00 GiB 4.29 GB)
> > Bitmap : 2048 bits (chunks), 2 dirty (0.1%)
> > ^^^^^^^^^^^^^^
> >
> > This is after 1/2 hour idle. sde1 was the one removed / re-added, but
> > all five disks show the same bitmap info, and the event count matches
> > that of
> > the array (5259). At this point the QC check fails.
> >
> > Then I manually failed, removed and re-added /dev/sde1, and shortly the
> > array
> > synced the dirty chunks:
> >
> > Filename : /dev/sde1
> > Magic : 6d746962
> > Version : 4
> > UUID : 609846f8:ad08275f:824b3cb4:2e180e57
> > Events : 5275
> > Events Cleared : 5259
> > State : OK
> > Chunksize : 2 MB
> > Daemon : 5s flush period
> > Write Mode : Normal
> > Sync Size : 4194304 (4.00 GiB 4.29 GB)
> > Bitmap : 2048 bits (chunks), 0 dirty (0.0%)
> > ^^^^^^^^^^^^^^
> >
> > Now the QC check succeeds and an array "check" shows no mismatches.
> >
> > So it seems like md is ignoring a set bit in the bitmap, which then gets
> > noticed
> > with the fail / remove / re-add sequence.
>
> Thanks, that helps a lot ... maybe.
>
> I have a theory. This patch explains it and should fix it.
> I'm not sure this is the patch I will go with if it works, but it will help
> confirm my theory.
> Can you test it?
>
> thanks,
> NeilBrown
>
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 34846856dbc6..27387a3740c8 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -7906,6 +7906,15 @@ void md_check_recovery(struct mddev *mddev)
> clear_bit(MD_RECOVERY_CHECK, &mddev->recovery);
> clear_bit(MD_RECOVERY_REQUESTED, &mddev->recovery);
> set_bit(MD_RECOVERY_RECOVER, &mddev->recovery);
> + /* If there is a bitmap, we need to make sure
> + * all writes that started before we added a spare
> + * complete before we start doing a recovery.
> + * Otherwise the write might complete and set
> + * a bit in the bitmap after the recovery has
> + * checked that bit and skipped that region.
> + */
> + mddev->pers->quiesce(mddev, 1);
> + mddev->pers->quiesce(mddev, 0);
> } else if (mddev->recovery_cp < MaxSector) {
> set_bit(MD_RECOVERY_SYNC, &mddev->recovery);
> clear_bit(MD_RECOVERY_RECOVER, &mddev->recovery);
>
Don't even bother trying that - it will definitely deadlock.
Please try this instead.
NeilBrown
diff --git a/drivers/md/md.c b/drivers/md/md.c
index 34846856dbc6..f8cd0bd83402 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -7501,6 +7501,16 @@ void md_do_sync(struct md_thread *thread)
rdev->recovery_offset < j)
j = rdev->recovery_offset;
rcu_read_unlock();
+
+ /* If there is a bitmap, we need to make sure
+ * all writes that started before we added a spare
+ * complete before we start doing a recovery.
+ * Otherwise the write might complete and set
+ * a bit in the bitmap after the recovery has
+ * checked that bit and skipped that region.
+ */
+ mddev->pers->quiesce(mddev, 1);
+ mddev->pers->quiesce(mddev, 0);
}
printk(KERN_INFO "md: %s of RAID array %s\n", desc, mdname(mddev));
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]
next prev parent reply other threads:[~2014-06-30 3:40 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-06-21 5:31 raid5 (re)-add recovery data corruption Bill
2014-06-23 1:36 ` NeilBrown
2014-06-23 13:43 ` Bill
2014-06-28 23:43 ` Bill
2014-06-30 3:23 ` NeilBrown
2014-06-30 3:40 ` NeilBrown [this message]
2014-07-01 15:24 ` Bill
2014-07-02 2:14 ` NeilBrown
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20140630134045.601cd33d@notabene.brown \
--to=neilb@suse.de \
--cc=billstuff2001@sbcglobal.net \
--cc=linux-raid@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).