* Reliability of bitmapped resync @ 2009-02-23 19:40 Piergiorgio Sartor 2009-02-23 19:59 ` NeilBrown 0 siblings, 1 reply; 12+ messages in thread From: Piergiorgio Sartor @ 2009-02-23 19:40 UTC (permalink / raw) To: linux-raid Hi all, I've a strange issue. I've a PC with 2 HDs in RAID-10 f2 with bitmap. There are actually 3 md devices, boot, swap and root. It happens that one SATA cable is/was flaky, so sometimes, at boot, /dev/sdb does not show up. The RAID starts in degraded mode, tracking the writes in the bitmap. On the next reboot, /dev/sdb is again there, so it is possible to re-add it. The md device resyncs what is to be resynced, very quickly, due to the bitmap. Later, if I run a check, usually a lot of mismatches show up. After repair (or add) further checks return zero mismatches. Without boot failure, no mismatches showed up after check. On a different PC, with same setup, but good cables, something similar happened. I tried, just for testing, to fail-remove-writeSomething-reAdd one HD, a then run a check. This was also returned some (few) mismatches. Now, this test I did not repeat, so I cannot say this was always the case. Nevertheless I'm a bit concerned. Is this behaviour somehow expected? Is there something special to take into account when removing and re-adding a RAID component? Thanks a lot in advance, bye, -- piergiorgio ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Reliability of bitmapped resync 2009-02-23 19:40 Reliability of bitmapped resync Piergiorgio Sartor @ 2009-02-23 19:59 ` NeilBrown 2009-02-23 20:19 ` Piergiorgio Sartor 2009-02-23 21:18 ` Eyal Lebedinsky 0 siblings, 2 replies; 12+ messages in thread From: NeilBrown @ 2009-02-23 19:59 UTC (permalink / raw) To: Piergiorgio Sartor; +Cc: linux-raid On Tue, February 24, 2009 6:40 am, Piergiorgio Sartor wrote: > Hi all, > > I've a strange issue. > > I've a PC with 2 HDs in RAID-10 f2 with bitmap. > There are actually 3 md devices, boot, swap and root. > > It happens that one SATA cable is/was flaky, so sometimes, > at boot, /dev/sdb does not show up. > The RAID starts in degraded mode, tracking the writes > in the bitmap. > On the next reboot, /dev/sdb is again there, so it is > possible to re-add it. > The md device resyncs what is to be resynced, very > quickly, due to the bitmap. > > Later, if I run a check, usually a lot of mismatches > show up. What exactly do you mean by "check". If you mean "look in /sys/block/md0/md/mismatch_cnt", then that is exactly what I would expect. The resync found some differences, just as you would expect it to, and reported them. However if by "check" you mean: echo check > /sys/block/md0/md/sync_action mdadm --wait /dev/md0 cat /sys/block/md0/md/mismatch_cnt then I would not expect any mismatches, and the resync should have fixed them. If it is the later, that is a real concern and I will need to look into it. Please let me know exactly which kernel version and mdadm version you are running. Thanks, NeilBrown ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Reliability of bitmapped resync 2009-02-23 19:59 ` NeilBrown @ 2009-02-23 20:19 ` Piergiorgio Sartor 2009-02-23 21:31 ` NeilBrown 2009-02-23 21:18 ` Eyal Lebedinsky 1 sibling, 1 reply; 12+ messages in thread From: Piergiorgio Sartor @ 2009-02-23 20:19 UTC (permalink / raw) To: NeilBrown; +Cc: Piergiorgio Sartor, linux-raid Hi, > What exactly do you mean by "check". > > If you mean "look in /sys/block/md0/md/mismatch_cnt", then that is > exactly what I would expect. The resync found some differences, just > as you would expect it to, and reported them. > > However if by "check" you mean: > echo check > /sys/block/md0/md/sync_action > mdadm --wait /dev/md0 > cat /sys/block/md0/md/mismatch_cnt yes, that is what I mean. I start the check _after_ the resync. Actually, maybe this is not correct, I keep running something like: watch cat /proc/mdstat /sys/block/md2/md/mismatch_cnt So I can see real-time, so to speak, the check progress and the mismatch count, just to have an idea on where, on the RAID, the mismatches could be located. Is this a problem? > then I would not expect any mismatches, and the resync should have > fixed them. > > If it is the later, that is a real concern and I will need to look into it. > Please let me know exactly which kernel version and mdadm version you > are running. It is an up-to-date Fedora 10, i.e. kernel-2.6.27.15-170.2.24.fc10 and mdadm-2.6.7.1-1.fc10. Thanks again, bye, -- piergiorgio ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Reliability of bitmapped resync 2009-02-23 20:19 ` Piergiorgio Sartor @ 2009-02-23 21:31 ` NeilBrown 2009-02-23 21:40 ` Piergiorgio Sartor 0 siblings, 1 reply; 12+ messages in thread From: NeilBrown @ 2009-02-23 21:31 UTC (permalink / raw) To: Piergiorgio Sartor; +Cc: linux-raid On Tue, February 24, 2009 7:19 am, Piergiorgio Sartor wrote: > Hi, > >> What exactly do you mean by "check". >> >> If you mean "look in /sys/block/md0/md/mismatch_cnt", then that is >> exactly what I would expect. The resync found some differences, just >> as you would expect it to, and reported them. >> >> However if by "check" you mean: >> echo check > /sys/block/md0/md/sync_action >> mdadm --wait /dev/md0 >> cat /sys/block/md0/md/mismatch_cnt > > yes, that is what I mean. > I start the check _after_ the resync. > > Actually, maybe this is not correct, I keep running > something like: > > watch cat /proc/mdstat /sys/block/md2/md/mismatch_cnt > > So I can see real-time, so to speak, the check > progress and the mismatch count, just to have > an idea on where, on the RAID, the mismatches > could be located. > > Is this a problem? No, that isn't a problem. > >> then I would not expect any mismatches, and the resync should have >> fixed them. >> >> If it is the later, that is a real concern and I will need to look into >> it. >> Please let me know exactly which kernel version and mdadm version you >> are running. > > It is an up-to-date Fedora 10, i.e. kernel-2.6.27.15-170.2.24.fc10 > and mdadm-2.6.7.1-1.fc10. I might have found something. If the bitmap chunk size is smaller than the raid10 chunk size, and the first bitmap chunk in a raid10 chunk is clean, it might be skipping the remaining bitmap chunks in that raid10 chunk. Can you please show me "--examine" and "--examine-bitmap" of one of the devices in your array please. Thanks, NeilBrown ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Reliability of bitmapped resync 2009-02-23 21:31 ` NeilBrown @ 2009-02-23 21:40 ` Piergiorgio Sartor 2009-02-23 21:49 ` NeilBrown 0 siblings, 1 reply; 12+ messages in thread From: Piergiorgio Sartor @ 2009-02-23 21:40 UTC (permalink / raw) To: NeilBrown; +Cc: Piergiorgio Sartor, linux-raid Hi, > I might have found something. If the bitmap chunk size is smaller than > the raid10 chunk size, and the first bitmap chunk in a raid10 chunk is > clean, it might be skipping the remaining bitmap chunks in that > raid10 chunk. the RAID-10 chunk is standard 64K, the bitmap chunk should be 256K. So, this does not seem to be the case, if I got it correctly. > Can you please show me "--examine" and "--examine-bitmap" of one of the > devices in your array please. OK, but you'll have to wait until tomorrow, I do not have the PC here (it is an office PC). bye, -- piergiorgio ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Reliability of bitmapped resync 2009-02-23 21:40 ` Piergiorgio Sartor @ 2009-02-23 21:49 ` NeilBrown 2009-02-24 19:39 ` Piergiorgio Sartor 0 siblings, 1 reply; 12+ messages in thread From: NeilBrown @ 2009-02-23 21:49 UTC (permalink / raw) To: Piergiorgio Sartor; +Cc: linux-raid On Tue, February 24, 2009 8:40 am, Piergiorgio Sartor wrote: > Hi, > >> I might have found something. If the bitmap chunk size is smaller than >> the raid10 chunk size, and the first bitmap chunk in a raid10 chunk is >> clean, it might be skipping the remaining bitmap chunks in that >> raid10 chunk. > > the RAID-10 chunk is standard 64K, the bitmap > chunk should be 256K. > So, this does not seem to be the case, if I > got it correctly. OK.. In that case I cannot reproduce the problem. > >> Can you please show me "--examine" and "--examine-bitmap" of one of the >> devices in your array please. > > OK, but you'll have to wait until tomorrow, I do > not have the PC here (it is an office PC). I'll wait for these details before I start hunting further. Thanks, NeilBrown ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Reliability of bitmapped resync 2009-02-23 21:49 ` NeilBrown @ 2009-02-24 19:39 ` Piergiorgio Sartor 2009-02-25 2:39 ` Neil Brown 0 siblings, 1 reply; 12+ messages in thread From: Piergiorgio Sartor @ 2009-02-24 19:39 UTC (permalink / raw) To: NeilBrown; +Cc: Piergiorgio Sartor, linux-raid Hi, > I'll wait for these details before I start hunting further. OK, here we are. Some forewords, the last disk to fail at boot was /dev/sda, this data was collected after a "clean" add of the /dev/sda3 to the RAID. This means the superblock was zeroed and the device added, so it should be clean. mdadm --examine /dev/sda3 /dev/sda3: Magic : a92b4efc Version : 1.1 Feature Map : 0x1 Array UUID : b601d547:b62e9563:2c68459c:22db163f Name : root Creation Time : Tue Feb 10 15:43:09 2009 Raid Level : raid10 Raid Devices : 2 Avail Dev Size : 483941796 (230.76 GiB 247.78 GB) Array Size : 483941632 (230.76 GiB 247.78 GB) Used Dev Size : 483941632 (230.76 GiB 247.78 GB) Data Offset : 264 sectors Super Offset : 0 sectors State : active Device UUID : f3665458:d51d27f5:87724fb8:529f91f1 Internal Bitmap : 8 sectors from superblock Update Time : Tue Feb 24 09:03:46 2009 Checksum : 68a2de81 - correct Events : 6541 Layout : near=1, far=2 Chunk Size : 64K Array Slot : 3 (failed, failed, 1, 0) Array State : Uu 2 failed mdadm --examine-bitmap /dev/sda3 Filename : /dev/sda3 Magic : 6d746962 Version : 4 UUID : b601d547:b62e9563:2c68459c:22db163f Events : 6541 Events Cleared : 6540 State : OK Chunksize : 256 KB Daemon : 5s flush period Write Mode : Normal Sync Size : 241970816 (230.76 GiB 247.78 GB) Bitmap : 945199 bits (chunks), 524289 dirty (55.5%) Now, one thing I do not understand, but maybe it is anyway OK, and it is this last line: Bitmap : 945199 bits (chunks), 524289 dirty (55.5%) Because the array status was fully recovered (in sync) and /dev/sdb3 showed: Bitmap : 945199 bits (chunks), 1 dirty (0.0%) Confirmed somehow by /proc/mdstat How it could be 55.5% dirty? Is this expected? Further note. I tested, on an identical PC, with a slightly different RAID (metadata 1.0 vs. 1.1), the following: mdadm --fail /dev/md2 /dev/sdb3 wait a little mdadm --remove /dev/md2 /dev/sdb3 do something to make the bitmap a bit dirty mdadm --re-add /dev/md2 /dev/sdb3 wait for resync to finish with "watch cat /proc/mdstat" echo check > /sys/block/md/md2/sync_action watch cat /proc/mdstat /sys/block/md/md2/mismatch_cnt Now, immediatly the mismatch count went to something like 1152 (or similar). After around 25% of the check it was around 1440, then I issued an "idle" and re-added the disk cleanly. This repeats the experience I already had. This is still a RAID-10 f2, with header 1.0, chunk 64KB and bitmap chunksize of 16MB (or 16384KB). Somehow it seems, at least on this setup, that the bitmap does not track everything or the resync does not consider all the bitmap chunk. Thanks, bye, -- piergiorgio ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Reliability of bitmapped resync 2009-02-24 19:39 ` Piergiorgio Sartor @ 2009-02-25 2:39 ` Neil Brown 2009-02-25 18:51 ` Piergiorgio Sartor 2009-03-13 17:19 ` Bill Davidsen 0 siblings, 2 replies; 12+ messages in thread From: Neil Brown @ 2009-02-25 2:39 UTC (permalink / raw) To: Piergiorgio Sartor; +Cc: linux-raid On Tuesday February 24, piergiorgio.sartor@nexgo.de wrote: > Hi, > > > I'll wait for these details before I start hunting further. > > OK, here we are. > Some forewords, the last disk to fail at boot was > /dev/sda, this data was collected after a "clean" > add of the /dev/sda3 to the RAID. > This means the superblock was zeroed and the device > added, so it should be clean. > .... Thanks a lot for that. > > Now, one thing I do not understand, but maybe it is > anyway OK, and it is this last line: > > Bitmap : 945199 bits (chunks), 524289 dirty (55.5%) > > Because the array status was fully recovered (in sync) > and /dev/sdb3 showed: > > Bitmap : 945199 bits (chunks), 1 dirty (0.0%) > > Confirmed somehow by /proc/mdstat > > How it could be 55.5% dirty? Is this expected? This is a bug. Is fixed by a patch that I have queued for 2.6.30. As it doesn't cause a crash or data corruption, it doesn't get to jump the queue. It is very small though: --- a/drivers/md/bitmap.c +++ b/drivers/md/bitmap.c @@ -266,7 +266,6 @@ static mdk_rdev_t *next_active_rdev(mdk_rdev_t *rdev, mddev_t *mddev) list_for_each_continue_rcu(pos, &mddev->disks) { rdev = list_entry(pos, mdk_rdev_t, same_set); if (rdev->raid_disk >= 0 && - test_bit(In_sync, &rdev->flags) && !test_bit(Faulty, &rdev->flags)) { /* this is a usable devices */ atomic_inc(&rdev->nr_pending); I'm fairly use I have found the bug that caused the problem you first noticed. It was introduced in 2.6.25. Below are two patches for raid10 which I have just submitted for 2.6.29 (As they can cause data corruption and so can jump the queue). The first solves your problem. The second solves a similar situation when the bitmap chunk size is smaller. If you are able to test and confirm, that would be great. Thanks a lot for reporting the problem and following through! NeilBrown From: NeilBrown <neilb@suse.de> Subject: [PATCH 2/2] md/raid10: Don't call bitmap_cond_end_sync when we are doing recovery. Date: Wed, 25 Feb 2009 13:38:19 +1100 Message-ID: <20090225023819.28579.20247.stgit@notabene.brown> In-Reply-To: <20090225023819.28579.71372.stgit@notabene.brown> References: <20090225023819.28579.71372.stgit@notabene.brown> User-Agent: StGIT/0.14.2 MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit For raid1/4/5/6, resync (fixing inconsistencies between devices) is very similar to recovery (rebuilding a failed device onto a spare). The both walk through the device addresses in order. For raid10 it can be quite different. resync follows the 'array' address, and makes sure all copies are the same. Recover walks through 'device' addresses and recreates each missing block. The 'bitmap_cond_end_sync' function allows the write-intent-bitmap (When present) to be updated to reflect a partially completed resync. It makes assumptions which mean that it does not work correctly for raid10 recovery at all. In particularly, it can cause bitmap-directed recovery of a raid10 to not recovery some of the blocks that need to be recovered. So move the call to bitmap_cond_end_sync into the resync path, rather than being in the common "resync or recovery" path. Cc: stable@kernel.org Signed-off-by: NeilBrown <neilb@suse.de> --- drivers/md/raid10.c | 4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c index 118f89e..e1feb87 100644 --- a/drivers/md/raid10.c +++ b/drivers/md/raid10.c @@ -1749,8 +1749,6 @@ static sector_t sync_request(mddev_t *mddev, sector_t sector_nr, int *skipped, i if (!go_faster && conf->nr_waiting) msleep_interruptible(1000); - bitmap_cond_end_sync(mddev->bitmap, sector_nr); - /* Again, very different code for resync and recovery. * Both must result in an r10bio with a list of bios that * have bi_end_io, bi_sector, bi_bdev set, @@ -1886,6 +1884,8 @@ static sector_t sync_request(mddev_t *mddev, sector_t sector_nr, int *skipped, i /* resync. Schedule a read for every block at this virt offset */ int count = 0; + bitmap_cond_end_sync(mddev->bitmap, sector_nr); + if (!bitmap_start_sync(mddev->bitmap, sector_nr, &sync_blocks, mddev->degraded) && !conf->fullsync && !test_bit(MD_RECOVERY_REQUESTED, &mddev->recovery)) { From: NeilBrown <neilb@suse.de> Subject: [PATCH 1/2] md/raid10: Don't skip more than 1 bitmap-chunk at a time during recovery. Date: Wed, 25 Feb 2009 13:38:19 +1100 Message-ID: <20090225023819.28579.71372.stgit@notabene.brown> User-Agent: StGIT/0.14.2 MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit When doing recovery on a raid10 with a write-intent bitmap, we only need to recovery chunks that are flagged in the bitmap. However if we choose to skip a chunk as it isn't flag, the code currently skips the whole raid10-chunk, thus it might not recovery some blocks that need recovering. This patch fixes it. In case that is confusing, it might help to understand that there is a 'raid10 chunk size' which guides how data is distributed across the devices, and a 'bitmap chunk size' which says how much data corresponds to a single bit in the bitmap. This bug only affects cases where the bitmap chunk size is smaller than the raid10 chunk size. Cc: stable@kernel.org Signed-off-by: NeilBrown <neilb@suse.de> --- drivers/md/raid10.c | 8 ++++---- 1 files changed, 4 insertions(+), 4 deletions(-) diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c index 6736d6d..118f89e 100644 --- a/drivers/md/raid10.c +++ b/drivers/md/raid10.c @@ -2010,13 +2010,13 @@ static sector_t sync_request(mddev_t *mddev, sector_t sector_nr, int *skipped, i /* There is nowhere to write, so all non-sync * drives must be failed, so try the next chunk... */ - { - sector_t sec = max_sector - sector_nr; - sectors_skipped += sec; + if (sector_nr + max_sync < max_sector) + max_sector = sector_nr + max_sync; + + sectors_skipped += (max_sector - sector_nr); chunks_skipped ++; sector_nr = max_sector; goto skipped; - } } static int run(mddev_t *mddev) ^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: Reliability of bitmapped resync 2009-02-25 2:39 ` Neil Brown @ 2009-02-25 18:51 ` Piergiorgio Sartor 2009-03-13 17:19 ` Bill Davidsen 1 sibling, 0 replies; 12+ messages in thread From: Piergiorgio Sartor @ 2009-02-25 18:51 UTC (permalink / raw) To: Neil Brown; +Cc: Piergiorgio Sartor, linux-raid Hi, > > How it could be 55.5% dirty? Is this expected? > > This is a bug. Is fixed by a patch that I have queued for 2.6.30. As ah! OK, good to know. > I'm fairly use I have found the bug that caused the problem you first > noticed. It was introduced in 2.6.25. > Below are two patches for raid10 which I have just submitted for > 2.6.29 (As they can cause data corruption and so can jump the queue). > > The first solves your problem. The second solves a similar situation > when the bitmap chunk size is smaller. > > If you are able to test and confirm, that would be great. I downloaded a random kernel (2.6.28.7), patched with the first patch only (and the bitmap thing). Then I was lucky enough to have another HD missing at boot (sigh! It seems the PSU has a bad mood), so I could immediatly try the bitmap resync (after a second reboot, of course). It seems it worked fine. After the (relativley short) resync, I checked the array and no mismatches were found. I had only one test, I hope it is OK. There is only one thing I noticed. I was under the impression that, previously, the "dirty" bits of the bitmap were cleared during the resync, while now there were all cleared at the end. > Thanks a lot for reporting the problem and following through! Nothing, is also in my interest... :-) Thanks for the quick solution. Question about the second patch. Is it really meaningful to have the possibility of a bitmap chunk smaller than a RAID chunk? My understanding is that the data "quantum" is a RAID chunk, so why to be able to track changes at sub-chunk level? Maybe constraining the bitmap chunk to an integer multiple of the RAID chunk would help in having a simpler and cleaner code, while it will not bring big disadvantages. Just my 2 cents... bye, -- piergiorgio ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Reliability of bitmapped resync 2009-02-25 2:39 ` Neil Brown 2009-02-25 18:51 ` Piergiorgio Sartor @ 2009-03-13 17:19 ` Bill Davidsen 1 sibling, 0 replies; 12+ messages in thread From: Bill Davidsen @ 2009-03-13 17:19 UTC (permalink / raw) To: Neil Brown; +Cc: linux-raid Neil Brown wrote: > On Tuesday February 24, piergiorgio.sartor@nexgo.de wrote: > >> Hi, >> >> >>> I'll wait for these details before I start hunting further. >>> >> OK, here we are. >> Some forewords, the last disk to fail at boot was >> /dev/sda, this data was collected after a "clean" >> add of the /dev/sda3 to the RAID. >> This means the superblock was zeroed and the device >> added, so it should be clean. >> >> > .... > > Thanks a lot for that. > > >> Now, one thing I do not understand, but maybe it is >> anyway OK, and it is this last line: >> >> Bitmap : 945199 bits (chunks), 524289 dirty (55.5%) >> >> Because the array status was fully recovered (in sync) >> and /dev/sdb3 showed: >> >> Bitmap : 945199 bits (chunks), 1 dirty (0.0%) >> >> Confirmed somehow by /proc/mdstat >> >> How it could be 55.5% dirty? Is this expected? >> > > This is a bug. Is fixed by a patch that I have queued for 2.6.30. As > it doesn't cause a crash or data corruption, it doesn't get to jump > the queue. It is very small though: > Belatedly I ask if this went to -stable for 2.6.29. -- Bill Davidsen <davidsen@tmr.com> "Woe unto the statesman who makes war without a reason that will still be valid when the war is over..." Otto von Bismark ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Reliability of bitmapped resync 2009-02-23 19:59 ` NeilBrown 2009-02-23 20:19 ` Piergiorgio Sartor @ 2009-02-23 21:18 ` Eyal Lebedinsky 2009-02-23 21:36 ` Piergiorgio Sartor 1 sibling, 1 reply; 12+ messages in thread From: Eyal Lebedinsky @ 2009-02-23 21:18 UTC (permalink / raw) To: linux-raid Did you mean echo repair > /sys/block/md0/md/sync_action I run 'check' regularly and I am rather sure that it reports but does not fix the mismatches. Here is my log of recent weekly 'check' runs. BTW, the mismatches are a mystery to me as there was not one i/o error and no other event (the machine was not even powered down/up). I will see how it does when the array moves (soon) to a new mobo. Sat Oct 25 10:47:16 EST 2008 mdcheck: end mismatch_cnt=0 Sat Nov 1 14:49:11 EST 2008 mdcheck: end mismatch_cnt=0 Sat Nov 8 14:48:15 EST 2008 mdcheck: end mismatch_cnt=0 Sat Nov 15 14:49:13 EST 2008 mdcheck: end mismatch_cnt=0 Sat Nov 22 14:48:12 EST 2008 mdcheck: end mismatch_cnt=0 Sat Nov 29 14:48:10 EST 2008 mdcheck: end mismatch_cnt=16 Sat Dec 6 14:48:11 EST 2008 mdcheck: end mismatch_cnt=136 Sat Dec 13 14:48:10 EST 2008 mdcheck: end mismatch_cnt=184 Sat Dec 20 14:48:10 EST 2008 mdcheck: end mismatch_cnt=280 Sat Dec 27 14:48:07 EST 2008 mdcheck: end mismatch_cnt=288 Sat Jan 3 14:48:09 EST 2009 mdcheck: end mismatch_cnt=288 Sat Jan 10 14:48:09 EST 2009 mdcheck: end mismatch_cnt=328 Sat Jan 17 10:21:16 EST 2009 mdcheck: end mismatch_cnt=328 Sat Jan 24 10:17:13 EST 2009 mdcheck: end mismatch_cnt=400 Sat Jan 31 10:17:14 EST 2009 mdcheck: end mismatch_cnt=408 Mon Feb 2 17:23:14 EST 2009 mdcheck: end mismatch_cnt=408 Sat Feb 7 10:12:09 EST 2009 mdcheck: end mismatch_cnt=0 <<< after repair done manually Eyal NeilBrown wrote: > On Tue, February 24, 2009 6:40 am, Piergiorgio Sartor wrote: >> Hi all, >> >> I've a strange issue. >> >> I've a PC with 2 HDs in RAID-10 f2 with bitmap. >> There are actually 3 md devices, boot, swap and root. >> >> It happens that one SATA cable is/was flaky, so sometimes, >> at boot, /dev/sdb does not show up. >> The RAID starts in degraded mode, tracking the writes >> in the bitmap. >> On the next reboot, /dev/sdb is again there, so it is >> possible to re-add it. >> The md device resyncs what is to be resynced, very >> quickly, due to the bitmap. >> >> Later, if I run a check, usually a lot of mismatches >> show up. > > What exactly do you mean by "check". > > If you mean "look in /sys/block/md0/md/mismatch_cnt", then that is > exactly what I would expect. The resync found some differences, just > as you would expect it to, and reported them. > > However if by "check" you mean: > echo check > /sys/block/md0/md/sync_action > mdadm --wait /dev/md0 > cat /sys/block/md0/md/mismatch_cnt > > then I would not expect any mismatches, and the resync should have > fixed them. > > If it is the later, that is a real concern and I will need to look into it. > Please let me know exactly which kernel version and mdadm version you > are running. > > Thanks, > NeilBrown -- Eyal Lebedinsky (eyal@eyal.emu.id.au) ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Reliability of bitmapped resync 2009-02-23 21:18 ` Eyal Lebedinsky @ 2009-02-23 21:36 ` Piergiorgio Sartor 0 siblings, 0 replies; 12+ messages in thread From: Piergiorgio Sartor @ 2009-02-23 21:36 UTC (permalink / raw) To: Eyal Lebedinsky; +Cc: linux-raid Hi, > Did you mean > echo repair > /sys/block/md0/md/sync_action > > I run 'check' regularly and I am rather sure that it reports but does not fix the mismatches. the sequence is as follow: 1) disk removed. 2) disk re-added 3) _automatic_ resync 4) manual check (after the resync finished) The last check returns mismatches. bye, -- piergiorgio ^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2009-03-13 17:19 UTC | newest] Thread overview: 12+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2009-02-23 19:40 Reliability of bitmapped resync Piergiorgio Sartor 2009-02-23 19:59 ` NeilBrown 2009-02-23 20:19 ` Piergiorgio Sartor 2009-02-23 21:31 ` NeilBrown 2009-02-23 21:40 ` Piergiorgio Sartor 2009-02-23 21:49 ` NeilBrown 2009-02-24 19:39 ` Piergiorgio Sartor 2009-02-25 2:39 ` Neil Brown 2009-02-25 18:51 ` Piergiorgio Sartor 2009-03-13 17:19 ` Bill Davidsen 2009-02-23 21:18 ` Eyal Lebedinsky 2009-02-23 21:36 ` Piergiorgio Sartor
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).