* question about bitmaps and dirty percentile
@ 2009-07-30 18:25 Jon Nelson
2009-07-30 19:16 ` Jon Nelson
2009-08-06 6:21 ` Neil Brown
0 siblings, 2 replies; 11+ messages in thread
From: Jon Nelson @ 2009-07-30 18:25 UTC (permalink / raw)
To: LinuxRaid
I have a 3-disk raid1 configured with bitmaps.
Most of the time it only has 1 disk (disk A)
Periodically (weekly or less frequenly) I re-add a second disk (disk
B), which then re-synchronizes, and when it's done I --fail and
--remove it.
Even less frequently (monthly or less frequently) I do the same thing
with a third disk (disk C).
Before adding the disks, I will issue an --examine.
When I added disk B today, it said this:
Events : 14580
Bitmap : 283645 bits (chunks), 11781 dirty (4.2%)
I'm curious why *any* of the bitmap chunks are dirty - when the disks
are removed the device has typically been quiescent for quite some
time. Is there a way to force a "flush" or whatever to get each disk
as up-to-date as possible, prior to a --fail and --remove?
While /dev/nbd0 was syncing, I also --re-add'ed /dev/sdf1, which (as
expected) waited until /dev/nbd0 was done.
Then, due to a logic bug in a script, /dev/sdf1 was removed (the
script was waiting with mdadm --wait /dev/md12 which returned when
/dev/nbd0 was done, even though /dev/sdf1 had not yet started!!).
Then things got weird.
I saw this, which just *can't* be right:
md12 : active raid1 nbd0[2](W) sde[0]
72612988 blocks super 1.1 [3/1] [U__]
[======================================>] recovery =192.7%
(69979200/36306494) finish=13228593199978.6min speed=11620K/sec
bitmap: 139/139 pages [556KB], 256KB chunk
and of course the percentile kept growing, and the finish minutes are crazy.
I had to --fail and --remove /dev/nbd0, and re-add it, which
unfortunately started the recovery over.
I haven't even gotten to my questions about dirty percentages and so
on, which I will save for later.
In summary:
3-disk raid1, using bitmaps, with 2 missing disks.
re-add disk B. recovery begins.
re-add disk C. recovery continues on to disk B, will wait for disk C.
recovery completes on disk B, mdadm --wait returns (unexpectedly)
--fail, --remove disk C (which was never recovered on-to)
/proc/mdstat crazy, disk I/O still high (WTF is it *doing*, then?)
--fail --remove disk B, --re-add disk B, recovery starts over.
--
Jon
^ permalink raw reply [flat|nested] 11+ messages in thread* Re: question about bitmaps and dirty percentile 2009-07-30 18:25 question about bitmaps and dirty percentile Jon Nelson @ 2009-07-30 19:16 ` Jon Nelson 2009-07-31 18:17 ` Paul Clements 2009-08-06 6:21 ` Neil Brown 1 sibling, 1 reply; 11+ messages in thread From: Jon Nelson @ 2009-07-30 19:16 UTC (permalink / raw) To: LinuxRaid On Thu, Jul 30, 2009 at 1:25 PM, Jon Nelson<jnelson-linux-raid@jamponi.net> wrote: > Then things got weird. > > I saw this, which just *can't* be right: > > md12 : active raid1 nbd0[2](W) sde[0] > 72612988 blocks super 1.1 [3/1] [U__] > [======================================>] recovery =192.7% > (69979200/36306494) finish=13228593199978.6min speed=11620K/sec > bitmap: 139/139 pages [556KB], 256KB chunk > > and of course the percentile kept growing, and the finish minutes are crazy. Weirdness: it ready 199 (or so) and then completed: md12 : active raid1 nbd0[2](W) sde[0] 72612988 blocks super 1.1 [3/2] [UU_] bitmap: 139/139 pages [556KB], 256KB chunk I --fail, --remove the device, and then --re-add it. The recovery *starts over*, as if nothing had happened over the last hour or so. The event counter are very close between /dev/nbd0 (the device here) and /dev/sde (the core device), within a dozen or so, but the "dirty percentile" on /dev/nbd0 is big - 18.8%, and unchanging between runs. It's like the bitmap isn't getting updated, or getting updated incompletely, or something. Does the bitmap only get updated when *all* devices have sync'd??? I'll let you know in about 2 hours. -- Jon -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: question about bitmaps and dirty percentile 2009-07-30 19:16 ` Jon Nelson @ 2009-07-31 18:17 ` Paul Clements 2009-07-31 19:09 ` Jon Nelson 0 siblings, 1 reply; 11+ messages in thread From: Paul Clements @ 2009-07-31 18:17 UTC (permalink / raw) To: Jon Nelson; +Cc: LinuxRaid Jon Nelson wrote: > On Thu, Jul 30, 2009 at 1:25 PM, Jon > Nelson<jnelson-linux-raid@jamponi.net> wrote: >> Then things got weird. >> >> I saw this, which just *can't* be right: >> >> md12 : active raid1 nbd0[2](W) sde[0] >> 72612988 blocks super 1.1 [3/1] [U__] >> [======================================>] recovery =192.7% >> (69979200/36306494) finish=13228593199978.6min speed=11620K/sec >> bitmap: 139/139 pages [556KB], 256KB chunk >> >> and of course the percentile kept growing, and the finish minutes are crazy. > > Weirdness: it ready 199 (or so) and then completed: > > md12 : active raid1 nbd0[2](W) sde[0] > 72612988 blocks super 1.1 [3/2] [UU_] > bitmap: 139/139 pages [556KB], 256KB chunk > > I --fail, --remove the device, and then --re-add it. > > The recovery *starts over*, as if nothing had happened over the last hour or so. > The event counter are very close between /dev/nbd0 (the device here) > and /dev/sde (the core device), within a dozen or so, but the "dirty > percentile" on /dev/nbd0 is big - 18.8%, and unchanging between runs. > It's like the bitmap isn't getting updated, or getting updated > incompletely, or something. > > Does the bitmap only get updated when *all* devices have sync'd??? > I'll let you know in about 2 hours. The bitmap never gets cleared unless all disks in the array are in sync. -- Paul ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: question about bitmaps and dirty percentile 2009-07-31 18:17 ` Paul Clements @ 2009-07-31 19:09 ` Jon Nelson 2009-08-03 16:44 ` Matthias Urlichs 0 siblings, 1 reply; 11+ messages in thread From: Jon Nelson @ 2009-07-31 19:09 UTC (permalink / raw) Cc: LinuxRaid > The bitmap never gets cleared unless all disks in the array are in sync. Well, that sucks. What is the reasoning behind that? It would seem that having 2 out of 3 disks with an up-to-date bitmap would be useful. However, it doesn't explain the 200% problem. -- Jon ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: question about bitmaps and dirty percentile 2009-07-31 19:09 ` Jon Nelson @ 2009-08-03 16:44 ` Matthias Urlichs 2009-08-03 20:30 ` Paul Clements 0 siblings, 1 reply; 11+ messages in thread From: Matthias Urlichs @ 2009-08-03 16:44 UTC (permalink / raw) To: linux-raid On Fri, 31 Jul 2009 14:09:06 -0500, Jon Nelson wrote: >> The bitmap never gets cleared unless all disks in the array are in >> sync. > > Well, that sucks. What is the reasoning behind that? There's only one bitmap per device. If the bits get cleaned writing to disk #2, then the system would forget that they still need to be written to disk #3. -- Matthias Urlichs ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: question about bitmaps and dirty percentile 2009-08-03 16:44 ` Matthias Urlichs @ 2009-08-03 20:30 ` Paul Clements 0 siblings, 0 replies; 11+ messages in thread From: Paul Clements @ 2009-08-03 20:30 UTC (permalink / raw) To: linux-raid Matthias Urlichs wrote: > On Fri, 31 Jul 2009 14:09:06 -0500, Jon Nelson wrote: > >>> The bitmap never gets cleared unless all disks in the array are in >>> sync. >> Well, that sucks. What is the reasoning behind that? > > There's only one bitmap per device. If the bits get cleaned writing to > disk #2, then the system would forget that they still need to be written > to disk #3. Right, and that decision was made for efficiency and simplicity of design. Having a bitmap per pair of component disks would be inefficient and very complicated. You could stack raid1's if you absolutely had to have that type of functionality. -- Paul ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: question about bitmaps and dirty percentile 2009-07-30 18:25 question about bitmaps and dirty percentile Jon Nelson 2009-07-30 19:16 ` Jon Nelson @ 2009-08-06 6:21 ` Neil Brown 2009-08-06 13:02 ` Jon Nelson 1 sibling, 1 reply; 11+ messages in thread From: Neil Brown @ 2009-08-06 6:21 UTC (permalink / raw) To: Jon Nelson; +Cc: LinuxRaid On Thursday July 30, jnelson-linux-raid@jamponi.net wrote: > > I saw this, which just *can't* be right: > > md12 : active raid1 nbd0[2](W) sde[0] > 72612988 blocks super 1.1 [3/1] [U__] > [======================================>] recovery =192.7% > (69979200/36306494) finish=13228593199978.6min speed=11620K/sec > bitmap: 139/139 pages [556KB], 256KB chunk Certainly very strange. I cannot explain it at all. Please report exactly what kernel version you were running, all kernel log messages from before the first resync completed until after the sync-to-200% completed. Hopefully there will be a clue somewhere in there. Thanks, NeilBrown ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: question about bitmaps and dirty percentile 2009-08-06 6:21 ` Neil Brown @ 2009-08-06 13:02 ` Jon Nelson 2009-08-07 1:47 ` NeilBrown 0 siblings, 1 reply; 11+ messages in thread From: Jon Nelson @ 2009-08-06 13:02 UTC (permalink / raw) Cc: LinuxRaid On Thu, Aug 6, 2009 at 1:21 AM, Neil Brown<neilb@suse.de> wrote: > On Thursday July 30, jnelson-linux-raid@jamponi.net wrote: >> >> I saw this, which just *can't* be right: >> >> md12 : active raid1 nbd0[2](W) sde[0] >> 72612988 blocks super 1.1 [3/1] [U__] >> [======================================>] recovery =192.7% >> (69979200/36306494) finish=13228593199978.6min speed=11620K/sec >> bitmap: 139/139 pages [556KB], 256KB chunk > > Certainly very strange. I cannot explain it at all. > > Please report exactly what kernel version you were running, all kernel > log messages from before the first resync completed until after the > sync-to-200% completed. > > Hopefully there will be a clue somewhere in there. Stock openSUSE 2.6.27.25-0.1-default on x86_64. I'm pretty sure this is it: Jul 30 13:51:01 turnip kernel: md: bind<nbd0> Jul 30 13:51:01 turnip kernel: RAID1 conf printout: Jul 30 13:51:01 turnip kernel: --- wd:1 rd:3 Jul 30 13:51:01 turnip kernel: disk 0, wo:0, o:1, dev:sde Jul 30 13:51:01 turnip kernel: disk 1, wo:1, o:1, dev:nbd0 Jul 30 13:51:01 turnip kernel: md: recovery of RAID array md12 Jul 30 13:51:01 turnip kernel: md: minimum _guaranteed_ speed: 1000 KB/sec/disk. Jul 30 13:51:01 turnip kernel: md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for recovery. Jul 30 13:51:01 turnip kernel: md: using 128k window, over a total of 72612988 blocks. Jul 30 14:10:48 turnip kernel: md: md12: recovery done. Jul 30 14:10:49 turnip kernel: RAID1 conf printout: Jul 30 14:10:49 turnip kernel: --- wd:2 rd:3 Jul 30 14:10:49 turnip kernel: disk 0, wo:0, o:1, dev:sde Jul 30 14:10:49 turnip kernel: disk 1, wo:0, o:1, dev:nbd0 -- Jon -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: question about bitmaps and dirty percentile 2009-08-06 13:02 ` Jon Nelson @ 2009-08-07 1:47 ` NeilBrown 2009-08-07 2:17 ` Jon Nelson 0 siblings, 1 reply; 11+ messages in thread From: NeilBrown @ 2009-08-07 1:47 UTC (permalink / raw) To: Jon Nelson; +Cc: LinuxRaid On Thu, August 6, 2009 11:02 pm, Jon Nelson wrote: > On Thu, Aug 6, 2009 at 1:21 AM, Neil Brown<neilb@suse.de> wrote: >> On Thursday July 30, jnelson-linux-raid@jamponi.net wrote: >>> >>> I saw this, which just *can't* be right: >>> >>> md12 : active raid1 nbd0[2](W) sde[0] >>> 72612988 blocks super 1.1 [3/1] [U__] >>> [======================================>] recovery =192.7% >>> (69979200/36306494) finish=13228593199978.6min speed=11620K/sec >>> bitmap: 139/139 pages [556KB], 256KB chunk >> >> Certainly very strange. I cannot explain it at all. >> >> Please report exactly what kernel version you were running, all kernel >> log messages from before the first resync completed until after the >> sync-to-200% completed. >> >> Hopefully there will be a clue somewhere in there. > > Stock openSUSE 2.6.27.25-0.1-default on x86_64. Ok, so it was probably broken by whoever maintain md for SuSE.... oh wait, that's me :-) > > I'm pretty sure this is it: > > Jul 30 13:51:01 turnip kernel: md: bind<nbd0> > Jul 30 13:51:01 turnip kernel: RAID1 conf printout: > Jul 30 13:51:01 turnip kernel: --- wd:1 rd:3 > Jul 30 13:51:01 turnip kernel: disk 0, wo:0, o:1, dev:sde > Jul 30 13:51:01 turnip kernel: disk 1, wo:1, o:1, dev:nbd0 > Jul 30 13:51:01 turnip kernel: md: recovery of RAID array md12 > Jul 30 13:51:01 turnip kernel: md: minimum _guaranteed_ speed: 1000 > KB/sec/disk. > Jul 30 13:51:01 turnip kernel: md: using maximum available idle IO > bandwidth (but not more than 200000 KB/sec) for recovery. > Jul 30 13:51:01 turnip kernel: md: using 128k window, over a total of > 72612988 blocks. > Jul 30 14:10:48 turnip kernel: md: md12: recovery done. > Jul 30 14:10:49 turnip kernel: RAID1 conf printout: > Jul 30 14:10:49 turnip kernel: --- wd:2 rd:3 > Jul 30 14:10:49 turnip kernel: disk 0, wo:0, o:1, dev:sde > Jul 30 14:10:49 turnip kernel: disk 1, wo:0, o:1, dev:nbd0 > Thanks... So: - when the recovery started, mddev->size was twice of half of the value printed for "over a total of...", so 72612988 (I assume this is expected to be a 72 Gig array). Twice this will have been stored in 'max_sectors' and the loop in md_do_sync will have taken 'j' up to that value and periodically stored in in mddev->curr_resync - When you ran "cat /proc/mdstat", mddev->array_sectors will have been twice the value printed at "... blocks", which is the same, 145225976 - When you ran "cat /proc/mdstat", it printed "recovery", not "resync", so MD_RECOVERY_SYNC was not set, so max_sectors was set to mddev->size... that looks wrong (size is in KB) Half of this is printed in the second number in the (%d/%d) bit, so ->size was twice 36306494 or 72612988, which is consistent. So the problem is that in resync_status, max_sectors is being set to mddev->size rather than mddev->size*2. This is purely a cosmetic problem, it do not affect data safety at all. It looks like I botched a backport of commit dd71cf6b2773310b01c6fe6c773064c80fd2476b into the Suse kernel. I'll get that fixed for the next update. Thanks for the report, and as I said, the only thing affected here is the content of /proc/mdstat. The recovery is doing the right thing. Thanks, NeilBrown ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: question about bitmaps and dirty percentile 2009-08-07 1:47 ` NeilBrown @ 2009-08-07 2:17 ` Jon Nelson 2009-08-07 12:29 ` John Robinson 0 siblings, 1 reply; 11+ messages in thread From: Jon Nelson @ 2009-08-07 2:17 UTC (permalink / raw) Cc: LinuxRaid On Thu, Aug 6, 2009 at 8:47 PM, NeilBrown<neilb@suse.de> wrote: ... > So the problem is that in resync_status, max_sectors is being set to > mddev->size rather than mddev->size*2. This is purely a cosmetic problem, > it do not affect data safety at all. > > It looks like I botched a backport of > commit dd71cf6b2773310b01c6fe6c773064c80fd2476b > into the Suse kernel. I'll get that fixed for the next update. > > Thanks for the report, and as I said, the only thing affected here > is the content of /proc/mdstat. The recovery is doing the right > thing. Sweet! Open Source is AWESOME. -- Jon -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: question about bitmaps and dirty percentile 2009-08-07 2:17 ` Jon Nelson @ 2009-08-07 12:29 ` John Robinson 0 siblings, 0 replies; 11+ messages in thread From: John Robinson @ 2009-08-07 12:29 UTC (permalink / raw) To: Jon Nelson; +Cc: LinuxRaid On 07/08/2009 03:17, Jon Nelson wrote: > Sweet! Open Source is AWESOME. Amen to that, brother Jon! ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2009-08-07 12:29 UTC | newest] Thread overview: 11+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2009-07-30 18:25 question about bitmaps and dirty percentile Jon Nelson 2009-07-30 19:16 ` Jon Nelson 2009-07-31 18:17 ` Paul Clements 2009-07-31 19:09 ` Jon Nelson 2009-08-03 16:44 ` Matthias Urlichs 2009-08-03 20:30 ` Paul Clements 2009-08-06 6:21 ` Neil Brown 2009-08-06 13:02 ` Jon Nelson 2009-08-07 1:47 ` NeilBrown 2009-08-07 2:17 ` Jon Nelson 2009-08-07 12:29 ` John Robinson
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).