linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Scrubbing "check" not working for RAID10 in 3.10-rc1+
@ 2013-06-25  6:19 Jonathan Brassow
  2013-06-25  6:32 ` NeilBrown
  2013-07-15 15:40 ` Jonathan Brassow
  0 siblings, 2 replies; 6+ messages in thread
From: Jonathan Brassow @ 2013-06-25  6:19 UTC (permalink / raw)
  To: neilb; +Cc: linux-raid

Neil,

I've noticed that the "check" operation no longer works for RAID10.  It
works just fine for the other RAIDs.  The ("data-check") sync_thread
kicks off just fine, sync_request_write() is called, but it never gets
past:
        if (i == conf->copies)
                goto done;
The test I am performing creates a RAID array, waits for it to sync,
shuts it down, writes random data to one of the devices, assembles the
array, and then runs a "check" - there should be descrepancies.  The
descrepancies are found and recorded in resync_mismatches for all RAIDs
<= 3.9 and only for non-RAID10 3.10-rc1+.

I'm sorry I haven't tracked it down yet and I'm going to be on vacation
starting tomorrow with only intermittent access to e-mail.  Sorry to
leave you hanging.

Thanks,
 brassow

P.S.  This also reminded me of a patch I have concerning tracking the
last sync action for the purpose of making mismatch_count more useful.
I'll post that before leaving.



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Scrubbing "check" not working for RAID10 in 3.10-rc1+
  2013-06-25  6:19 Scrubbing "check" not working for RAID10 in 3.10-rc1+ Jonathan Brassow
@ 2013-06-25  6:32 ` NeilBrown
  2013-07-15 15:35   ` Brassow Jonathan
  2013-07-15 15:40 ` Jonathan Brassow
  1 sibling, 1 reply; 6+ messages in thread
From: NeilBrown @ 2013-06-25  6:32 UTC (permalink / raw)
  To: Jonathan Brassow; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 1330 bytes --]

On Tue, 25 Jun 2013 01:19:20 -0500 Jonathan Brassow <jbrassow@redhat.com>
wrote:

> Neil,
> 
> I've noticed that the "check" operation no longer works for RAID10.  It
> works just fine for the other RAIDs.  The ("data-check") sync_thread
> kicks off just fine, sync_request_write() is called, but it never gets
> past:
>         if (i == conf->copies)
>                 goto done;
> The test I am performing creates a RAID array, waits for it to sync,
> shuts it down, writes random data to one of the devices, assembles the
> array, and then runs a "check" - there should be descrepancies.  The
> descrepancies are found and recorded in resync_mismatches for all RAIDs
> <= 3.9 and only for non-RAID10 3.10-rc1+.

I just tried on 3.10-rc5+ and it works as expected.
If you can provide a test script that fails, I'll look into it.

> 
> I'm sorry I haven't tracked it down yet and I'm going to be on vacation
> starting tomorrow with only intermittent access to e-mail.  Sorry to
> leave you hanging.

Go enjoy your vacation and don't worry about me hanging :-)

> 
> Thanks,
>  brassow
> 
> P.S.  This also reminded me of a patch I have concerning tracking the
> last sync action for the purpose of making mismatch_count more useful.
> I'll post that before leaving.
> 

Thanks.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Scrubbing "check" not working for RAID10 in 3.10-rc1+
  2013-06-25  6:32 ` NeilBrown
@ 2013-07-15 15:35   ` Brassow Jonathan
  2013-07-16  7:01     ` NeilBrown
  0 siblings, 1 reply; 6+ messages in thread
From: Brassow Jonathan @ 2013-07-15 15:35 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid


On Jun 25, 2013, at 1:32 AM, NeilBrown wrote:

> On Tue, 25 Jun 2013 01:19:20 -0500 Jonathan Brassow <jbrassow@redhat.com>
> wrote:
> 
>> Neil,
>> 
>> I've noticed that the "check" operation no longer works for RAID10.  It
>> works just fine for the other RAIDs.  The ("data-check") sync_thread
>> kicks off just fine, sync_request_write() is called, but it never gets
>> past:
>>        if (i == conf->copies)
>>                goto done;
>> The test I am performing creates a RAID array, waits for it to sync,
>> shuts it down, writes random data to one of the devices, assembles the
>> array, and then runs a "check" - there should be descrepancies.  The
>> descrepancies are found and recorded in resync_mismatches for all RAIDs
>> <= 3.9 and only for non-RAID10 3.10-rc1+.
> 
> I just tried on 3.10-rc5+ and it works as expected.
> If you can provide a test script that fails, I'll look into it.

Just tried 3.10 - it fails for me there too.  I'll send you the script I use shortly.

thanks,
 brassow

(vacation ends soon.)


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Scrubbing "check" not working for RAID10 in 3.10-rc1+
  2013-06-25  6:19 Scrubbing "check" not working for RAID10 in 3.10-rc1+ Jonathan Brassow
  2013-06-25  6:32 ` NeilBrown
@ 2013-07-15 15:40 ` Jonathan Brassow
  1 sibling, 0 replies; 6+ messages in thread
From: Jonathan Brassow @ 2013-07-15 15:40 UTC (permalink / raw)
  To: neilb; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 2786 bytes --]

Neil,

You will need to change 'devices' to suite your needs.  I can run this
test with RAID 1/4/5/6 and it works, but it fails with RAID10 since
10-rc1.

thanks,
 brassow

example output:
[~]# ./md.sh 1
mdadm: /dev/sda1 appears to be part of a raid array:
    level=raid5 devices=4 ctime=Sat Jul 13 14:52:49 2013
mdadm: Note: this array has metadata at the start and
    may not be suitable as a boot device.  If you plan to
    store '/boot' on this device please ensure that
    your boot-loader understands md/v1.x metadata, or use
    --metadata=0.90
mdadm: /dev/sdb1 appears to be part of a raid array:
    level=raid5 devices=4 ctime=Sat Jul 13 14:52:49 2013
mdadm: /dev/sdc1 appears to be part of a raid array:
    level=raid5 devices=4 ctime=Sat Jul 13 14:52:49 2013
mdadm: /dev/sdd1 appears to be part of a raid array:
    level=raid5 devices=4 ctime=Sat Jul 13 14:52:49 2013
mdadm: largest drive (/dev/sda1) exceeds size (102400K) by more than 1%
Continue creating array? y
mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md0 started.
        Waiting for resync to complete
        Waiting for resync to complete
        Waiting for resync to complete
        Waiting for resync to complete
RAID1 mismatch count after creation     : 0
mdadm: stopped /dev/md0
Writing garbage to one of the MD devices...
mdadm: /dev/md0 has been started with 4 drives.
RAID1 mismatch count after reactivation : 0
        Waiting for check to complete
        Waiting for check to complete
        Waiting for check to complete
RAID1 mismatch count after data-check   : 61440
mdadm: stopped /dev/md0


[~]# ./md.sh 10
mdadm: /dev/sda1 appears to be part of a raid array:
    level=raid1 devices=4 ctime=Mon Jul 15 10:30:44 2013
mdadm: /dev/sdb1 appears to be part of a raid array:
    level=raid1 devices=4 ctime=Mon Jul 15 10:30:44 2013
mdadm: /dev/sdc1 appears to be part of a raid array:
    level=raid1 devices=4 ctime=Mon Jul 15 10:30:44 2013
mdadm: /dev/sdd1 appears to be part of a raid array:
    level=raid1 devices=4 ctime=Mon Jul 15 10:30:44 2013
mdadm: largest drive (/dev/sda1) exceeds size (102400K) by more than 1%
Continue creating array? y
mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md0 started.
        Waiting for resync to complete
        Waiting for resync to complete
        Waiting for resync to complete
RAID10 mismatch count after creation     : 0
mdadm: stopped /dev/md0
Writing garbage to one of the MD devices...
mdadm: /dev/md0 has been started with 4 drives.
RAID10 mismatch count after reactivation : 0
        Waiting for check to complete
        Waiting for check to complete
        Waiting for check to complete
RAID10 mismatch count after data-check   : 0
mdadm: stopped /dev/md0
***** mismatch_cnt should not be zero !!!!!!


[-- Attachment #2: md.sh --]
[-- Type: application/x-shellscript, Size: 1742 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Scrubbing "check" not working for RAID10 in 3.10-rc1+
  2013-07-15 15:35   ` Brassow Jonathan
@ 2013-07-16  7:01     ` NeilBrown
  2013-07-17 18:24       ` Brassow Jonathan
  0 siblings, 1 reply; 6+ messages in thread
From: NeilBrown @ 2013-07-16  7:01 UTC (permalink / raw)
  To: Brassow Jonathan; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 3430 bytes --]

On Mon, 15 Jul 2013 10:35:07 -0500 Brassow Jonathan <jbrassow@redhat.com>
wrote:

> 
> On Jun 25, 2013, at 1:32 AM, NeilBrown wrote:
> 
> > On Tue, 25 Jun 2013 01:19:20 -0500 Jonathan Brassow <jbrassow@redhat.com>
> > wrote:
> > 
> >> Neil,
> >> 
> >> I've noticed that the "check" operation no longer works for RAID10.  It
> >> works just fine for the other RAIDs.  The ("data-check") sync_thread
> >> kicks off just fine, sync_request_write() is called, but it never gets
> >> past:
> >>        if (i == conf->copies)
> >>                goto done;
> >> The test I am performing creates a RAID array, waits for it to sync,
> >> shuts it down, writes random data to one of the devices, assembles the
> >> array, and then runs a "check" - there should be descrepancies.  The
> >> descrepancies are found and recorded in resync_mismatches for all RAIDs
> >> <= 3.9 and only for non-RAID10 3.10-rc1+.
> > 
> > I just tried on 3.10-rc5+ and it works as expected.
> > If you can provide a test script that fails, I'll look into it.
> 
> Just tried 3.10 - it fails for me there too.  I'll send you the script I use shortly.
> 
> thanks,
>  brassow
> 
> (vacation ends soon.)
:-)

Thanks.  This patch seems to fix it.

NeilBrown

From b0b0ac3ecf1e54dd6a429294082c47f1e52db41d Mon Sep 17 00:00:00 2001
From: NeilBrown <neilb@suse.de>
Date: Tue, 16 Jul 2013 16:50:47 +1000
Subject: [PATCH] md/raid10: fix two problems with RAID10 resync.

1/ When an different between blocks is found, data is copied from
   one bio to the other.  However bv_len is used as the length to
   copy and this could be zero.  So use r10_bio->sectors to calculate
   length instead.
   Using bv_len was probably always a bit dubious, but the introduction
   of bio_advance made it much more likely to be a problem.

2/ When preparing some blocks for sync, we don't set BIO_UPTODATE
   except on bios that we schedule for a read.  This ensures that
   missing/failed devices don't confuse the loop at the top of
   sync_request write.
   Commit 8be185f2c9d54d6 "raid10: Use bio_reset()"
   removed a loop which set BIO_UPTDATE on all appropriate bios.
   So we need to re-add that flag.

Reported-by: Brassow Jonathan <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>

diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index cd066b6..957a719 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -2097,11 +2097,17 @@ static void sync_request_write(struct mddev *mddev, struct r10bio *r10_bio)
 			 * both 'first' and 'i', so we just compare them.
 			 * All vec entries are PAGE_SIZE;
 			 */
-			for (j = 0; j < vcnt; j++)
+			int sectors = r10_bio->sectors;
+			for (j = 0; j < vcnt; j++) {
+				int len = PAGE_SIZE;
+				if (sectors < (len / 512))
+					len = sectors * 512;
 				if (memcmp(page_address(fbio->bi_io_vec[j].bv_page),
 					   page_address(tbio->bi_io_vec[j].bv_page),
-					   fbio->bi_io_vec[j].bv_len))
+					   len))
 					break;
+				sectors -= len/512;
+			}
 			if (j == vcnt)
 				continue;
 			atomic64_add(r10_bio->sectors, &mddev->resync_mismatches);
@@ -3407,6 +3413,7 @@ static sector_t sync_request(struct mddev *mddev, sector_t sector_nr,
 
 		if (bio->bi_end_io == end_sync_read) {
 			md_sync_acct(bio->bi_bdev, nr_sectors);
+			set_bit(BIO_UPTODATE, &bio->bi_flags);
 			generic_make_request(bio);
 		}
 	}

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: Scrubbing "check" not working for RAID10 in 3.10-rc1+
  2013-07-16  7:01     ` NeilBrown
@ 2013-07-17 18:24       ` Brassow Jonathan
  0 siblings, 0 replies; 6+ messages in thread
From: Brassow Jonathan @ 2013-07-17 18:24 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid


On Jul 16, 2013, at 2:01 AM, NeilBrown wrote:

> On Mon, 15 Jul 2013 10:35:07 -0500 Brassow Jonathan <jbrassow@redhat.com>
> wrote:
> 
>> 
>> On Jun 25, 2013, at 1:32 AM, NeilBrown wrote:
>> 
>>> On Tue, 25 Jun 2013 01:19:20 -0500 Jonathan Brassow <jbrassow@redhat.com>
>>> wrote:
>>> 
>>>> Neil,
>>>> 
>>>> I've noticed that the "check" operation no longer works for RAID10.  It
>>>> works just fine for the other RAIDs.  The ("data-check") sync_thread
>>>> kicks off just fine, sync_request_write() is called, but it never gets
>>>> past:
>>>>       if (i == conf->copies)
>>>>               goto done;
>>>> The test I am performing creates a RAID array, waits for it to sync,
>>>> shuts it down, writes random data to one of the devices, assembles the
>>>> array, and then runs a "check" - there should be descrepancies.  The
>>>> descrepancies are found and recorded in resync_mismatches for all RAIDs
>>>> <= 3.9 and only for non-RAID10 3.10-rc1+.
>>> 
>>> I just tried on 3.10-rc5+ and it works as expected.
>>> If you can provide a test script that fails, I'll look into it.
>> 
>> Just tried 3.10 - it fails for me there too.  I'll send you the script I use shortly.
>> 
>> thanks,
>> brassow
>> 
>> (vacation ends soon.)
> :-)
> 
> Thanks.  This patch seems to fix it.

Yes, it does.  Thanks!

 brassow


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2013-07-17 18:24 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-06-25  6:19 Scrubbing "check" not working for RAID10 in 3.10-rc1+ Jonathan Brassow
2013-06-25  6:32 ` NeilBrown
2013-07-15 15:35   ` Brassow Jonathan
2013-07-16  7:01     ` NeilBrown
2013-07-17 18:24       ` Brassow Jonathan
2013-07-15 15:40 ` Jonathan Brassow

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).