* Scrubbing "check" not working for RAID10 in 3.10-rc1+ @ 2013-06-25 6:19 Jonathan Brassow 2013-06-25 6:32 ` NeilBrown 2013-07-15 15:40 ` Jonathan Brassow 0 siblings, 2 replies; 6+ messages in thread From: Jonathan Brassow @ 2013-06-25 6:19 UTC (permalink / raw) To: neilb; +Cc: linux-raid Neil, I've noticed that the "check" operation no longer works for RAID10. It works just fine for the other RAIDs. The ("data-check") sync_thread kicks off just fine, sync_request_write() is called, but it never gets past: if (i == conf->copies) goto done; The test I am performing creates a RAID array, waits for it to sync, shuts it down, writes random data to one of the devices, assembles the array, and then runs a "check" - there should be descrepancies. The descrepancies are found and recorded in resync_mismatches for all RAIDs <= 3.9 and only for non-RAID10 3.10-rc1+. I'm sorry I haven't tracked it down yet and I'm going to be on vacation starting tomorrow with only intermittent access to e-mail. Sorry to leave you hanging. Thanks, brassow P.S. This also reminded me of a patch I have concerning tracking the last sync action for the purpose of making mismatch_count more useful. I'll post that before leaving. ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Scrubbing "check" not working for RAID10 in 3.10-rc1+ 2013-06-25 6:19 Scrubbing "check" not working for RAID10 in 3.10-rc1+ Jonathan Brassow @ 2013-06-25 6:32 ` NeilBrown 2013-07-15 15:35 ` Brassow Jonathan 2013-07-15 15:40 ` Jonathan Brassow 1 sibling, 1 reply; 6+ messages in thread From: NeilBrown @ 2013-06-25 6:32 UTC (permalink / raw) To: Jonathan Brassow; +Cc: linux-raid [-- Attachment #1: Type: text/plain, Size: 1330 bytes --] On Tue, 25 Jun 2013 01:19:20 -0500 Jonathan Brassow <jbrassow@redhat.com> wrote: > Neil, > > I've noticed that the "check" operation no longer works for RAID10. It > works just fine for the other RAIDs. The ("data-check") sync_thread > kicks off just fine, sync_request_write() is called, but it never gets > past: > if (i == conf->copies) > goto done; > The test I am performing creates a RAID array, waits for it to sync, > shuts it down, writes random data to one of the devices, assembles the > array, and then runs a "check" - there should be descrepancies. The > descrepancies are found and recorded in resync_mismatches for all RAIDs > <= 3.9 and only for non-RAID10 3.10-rc1+. I just tried on 3.10-rc5+ and it works as expected. If you can provide a test script that fails, I'll look into it. > > I'm sorry I haven't tracked it down yet and I'm going to be on vacation > starting tomorrow with only intermittent access to e-mail. Sorry to > leave you hanging. Go enjoy your vacation and don't worry about me hanging :-) > > Thanks, > brassow > > P.S. This also reminded me of a patch I have concerning tracking the > last sync action for the purpose of making mismatch_count more useful. > I'll post that before leaving. > Thanks. NeilBrown [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 828 bytes --] ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Scrubbing "check" not working for RAID10 in 3.10-rc1+ 2013-06-25 6:32 ` NeilBrown @ 2013-07-15 15:35 ` Brassow Jonathan 2013-07-16 7:01 ` NeilBrown 0 siblings, 1 reply; 6+ messages in thread From: Brassow Jonathan @ 2013-07-15 15:35 UTC (permalink / raw) To: NeilBrown; +Cc: linux-raid On Jun 25, 2013, at 1:32 AM, NeilBrown wrote: > On Tue, 25 Jun 2013 01:19:20 -0500 Jonathan Brassow <jbrassow@redhat.com> > wrote: > >> Neil, >> >> I've noticed that the "check" operation no longer works for RAID10. It >> works just fine for the other RAIDs. The ("data-check") sync_thread >> kicks off just fine, sync_request_write() is called, but it never gets >> past: >> if (i == conf->copies) >> goto done; >> The test I am performing creates a RAID array, waits for it to sync, >> shuts it down, writes random data to one of the devices, assembles the >> array, and then runs a "check" - there should be descrepancies. The >> descrepancies are found and recorded in resync_mismatches for all RAIDs >> <= 3.9 and only for non-RAID10 3.10-rc1+. > > I just tried on 3.10-rc5+ and it works as expected. > If you can provide a test script that fails, I'll look into it. Just tried 3.10 - it fails for me there too. I'll send you the script I use shortly. thanks, brassow (vacation ends soon.) ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Scrubbing "check" not working for RAID10 in 3.10-rc1+ 2013-07-15 15:35 ` Brassow Jonathan @ 2013-07-16 7:01 ` NeilBrown 2013-07-17 18:24 ` Brassow Jonathan 0 siblings, 1 reply; 6+ messages in thread From: NeilBrown @ 2013-07-16 7:01 UTC (permalink / raw) To: Brassow Jonathan; +Cc: linux-raid [-- Attachment #1: Type: text/plain, Size: 3430 bytes --] On Mon, 15 Jul 2013 10:35:07 -0500 Brassow Jonathan <jbrassow@redhat.com> wrote: > > On Jun 25, 2013, at 1:32 AM, NeilBrown wrote: > > > On Tue, 25 Jun 2013 01:19:20 -0500 Jonathan Brassow <jbrassow@redhat.com> > > wrote: > > > >> Neil, > >> > >> I've noticed that the "check" operation no longer works for RAID10. It > >> works just fine for the other RAIDs. The ("data-check") sync_thread > >> kicks off just fine, sync_request_write() is called, but it never gets > >> past: > >> if (i == conf->copies) > >> goto done; > >> The test I am performing creates a RAID array, waits for it to sync, > >> shuts it down, writes random data to one of the devices, assembles the > >> array, and then runs a "check" - there should be descrepancies. The > >> descrepancies are found and recorded in resync_mismatches for all RAIDs > >> <= 3.9 and only for non-RAID10 3.10-rc1+. > > > > I just tried on 3.10-rc5+ and it works as expected. > > If you can provide a test script that fails, I'll look into it. > > Just tried 3.10 - it fails for me there too. I'll send you the script I use shortly. > > thanks, > brassow > > (vacation ends soon.) :-) Thanks. This patch seems to fix it. NeilBrown From b0b0ac3ecf1e54dd6a429294082c47f1e52db41d Mon Sep 17 00:00:00 2001 From: NeilBrown <neilb@suse.de> Date: Tue, 16 Jul 2013 16:50:47 +1000 Subject: [PATCH] md/raid10: fix two problems with RAID10 resync. 1/ When an different between blocks is found, data is copied from one bio to the other. However bv_len is used as the length to copy and this could be zero. So use r10_bio->sectors to calculate length instead. Using bv_len was probably always a bit dubious, but the introduction of bio_advance made it much more likely to be a problem. 2/ When preparing some blocks for sync, we don't set BIO_UPTODATE except on bios that we schedule for a read. This ensures that missing/failed devices don't confuse the loop at the top of sync_request write. Commit 8be185f2c9d54d6 "raid10: Use bio_reset()" removed a loop which set BIO_UPTDATE on all appropriate bios. So we need to re-add that flag. Reported-by: Brassow Jonathan <jbrassow@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c index cd066b6..957a719 100644 --- a/drivers/md/raid10.c +++ b/drivers/md/raid10.c @@ -2097,11 +2097,17 @@ static void sync_request_write(struct mddev *mddev, struct r10bio *r10_bio) * both 'first' and 'i', so we just compare them. * All vec entries are PAGE_SIZE; */ - for (j = 0; j < vcnt; j++) + int sectors = r10_bio->sectors; + for (j = 0; j < vcnt; j++) { + int len = PAGE_SIZE; + if (sectors < (len / 512)) + len = sectors * 512; if (memcmp(page_address(fbio->bi_io_vec[j].bv_page), page_address(tbio->bi_io_vec[j].bv_page), - fbio->bi_io_vec[j].bv_len)) + len)) break; + sectors -= len/512; + } if (j == vcnt) continue; atomic64_add(r10_bio->sectors, &mddev->resync_mismatches); @@ -3407,6 +3413,7 @@ static sector_t sync_request(struct mddev *mddev, sector_t sector_nr, if (bio->bi_end_io == end_sync_read) { md_sync_acct(bio->bi_bdev, nr_sectors); + set_bit(BIO_UPTODATE, &bio->bi_flags); generic_make_request(bio); } } [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 828 bytes --] ^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: Scrubbing "check" not working for RAID10 in 3.10-rc1+ 2013-07-16 7:01 ` NeilBrown @ 2013-07-17 18:24 ` Brassow Jonathan 0 siblings, 0 replies; 6+ messages in thread From: Brassow Jonathan @ 2013-07-17 18:24 UTC (permalink / raw) To: NeilBrown; +Cc: linux-raid On Jul 16, 2013, at 2:01 AM, NeilBrown wrote: > On Mon, 15 Jul 2013 10:35:07 -0500 Brassow Jonathan <jbrassow@redhat.com> > wrote: > >> >> On Jun 25, 2013, at 1:32 AM, NeilBrown wrote: >> >>> On Tue, 25 Jun 2013 01:19:20 -0500 Jonathan Brassow <jbrassow@redhat.com> >>> wrote: >>> >>>> Neil, >>>> >>>> I've noticed that the "check" operation no longer works for RAID10. It >>>> works just fine for the other RAIDs. The ("data-check") sync_thread >>>> kicks off just fine, sync_request_write() is called, but it never gets >>>> past: >>>> if (i == conf->copies) >>>> goto done; >>>> The test I am performing creates a RAID array, waits for it to sync, >>>> shuts it down, writes random data to one of the devices, assembles the >>>> array, and then runs a "check" - there should be descrepancies. The >>>> descrepancies are found and recorded in resync_mismatches for all RAIDs >>>> <= 3.9 and only for non-RAID10 3.10-rc1+. >>> >>> I just tried on 3.10-rc5+ and it works as expected. >>> If you can provide a test script that fails, I'll look into it. >> >> Just tried 3.10 - it fails for me there too. I'll send you the script I use shortly. >> >> thanks, >> brassow >> >> (vacation ends soon.) > :-) > > Thanks. This patch seems to fix it. Yes, it does. Thanks! brassow ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Scrubbing "check" not working for RAID10 in 3.10-rc1+ 2013-06-25 6:19 Scrubbing "check" not working for RAID10 in 3.10-rc1+ Jonathan Brassow 2013-06-25 6:32 ` NeilBrown @ 2013-07-15 15:40 ` Jonathan Brassow 1 sibling, 0 replies; 6+ messages in thread From: Jonathan Brassow @ 2013-07-15 15:40 UTC (permalink / raw) To: neilb; +Cc: linux-raid [-- Attachment #1: Type: text/plain, Size: 2786 bytes --] Neil, You will need to change 'devices' to suite your needs. I can run this test with RAID 1/4/5/6 and it works, but it fails with RAID10 since 10-rc1. thanks, brassow example output: [~]# ./md.sh 1 mdadm: /dev/sda1 appears to be part of a raid array: level=raid5 devices=4 ctime=Sat Jul 13 14:52:49 2013 mdadm: Note: this array has metadata at the start and may not be suitable as a boot device. If you plan to store '/boot' on this device please ensure that your boot-loader understands md/v1.x metadata, or use --metadata=0.90 mdadm: /dev/sdb1 appears to be part of a raid array: level=raid5 devices=4 ctime=Sat Jul 13 14:52:49 2013 mdadm: /dev/sdc1 appears to be part of a raid array: level=raid5 devices=4 ctime=Sat Jul 13 14:52:49 2013 mdadm: /dev/sdd1 appears to be part of a raid array: level=raid5 devices=4 ctime=Sat Jul 13 14:52:49 2013 mdadm: largest drive (/dev/sda1) exceeds size (102400K) by more than 1% Continue creating array? y mdadm: Defaulting to version 1.2 metadata mdadm: array /dev/md0 started. Waiting for resync to complete Waiting for resync to complete Waiting for resync to complete Waiting for resync to complete RAID1 mismatch count after creation : 0 mdadm: stopped /dev/md0 Writing garbage to one of the MD devices... mdadm: /dev/md0 has been started with 4 drives. RAID1 mismatch count after reactivation : 0 Waiting for check to complete Waiting for check to complete Waiting for check to complete RAID1 mismatch count after data-check : 61440 mdadm: stopped /dev/md0 [~]# ./md.sh 10 mdadm: /dev/sda1 appears to be part of a raid array: level=raid1 devices=4 ctime=Mon Jul 15 10:30:44 2013 mdadm: /dev/sdb1 appears to be part of a raid array: level=raid1 devices=4 ctime=Mon Jul 15 10:30:44 2013 mdadm: /dev/sdc1 appears to be part of a raid array: level=raid1 devices=4 ctime=Mon Jul 15 10:30:44 2013 mdadm: /dev/sdd1 appears to be part of a raid array: level=raid1 devices=4 ctime=Mon Jul 15 10:30:44 2013 mdadm: largest drive (/dev/sda1) exceeds size (102400K) by more than 1% Continue creating array? y mdadm: Defaulting to version 1.2 metadata mdadm: array /dev/md0 started. Waiting for resync to complete Waiting for resync to complete Waiting for resync to complete RAID10 mismatch count after creation : 0 mdadm: stopped /dev/md0 Writing garbage to one of the MD devices... mdadm: /dev/md0 has been started with 4 drives. RAID10 mismatch count after reactivation : 0 Waiting for check to complete Waiting for check to complete Waiting for check to complete RAID10 mismatch count after data-check : 0 mdadm: stopped /dev/md0 ***** mismatch_cnt should not be zero !!!!!! [-- Attachment #2: md.sh --] [-- Type: application/x-shellscript, Size: 1742 bytes --] ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2013-07-17 18:24 UTC | newest] Thread overview: 6+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2013-06-25 6:19 Scrubbing "check" not working for RAID10 in 3.10-rc1+ Jonathan Brassow 2013-06-25 6:32 ` NeilBrown 2013-07-15 15:35 ` Brassow Jonathan 2013-07-16 7:01 ` NeilBrown 2013-07-17 18:24 ` Brassow Jonathan 2013-07-15 15:40 ` Jonathan Brassow
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).