question about bitmaps and dirty percentile

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* question about bitmaps and dirty percentile
@ 2009-07-30 18:25 Jon Nelson
  2009-07-30 19:16 ` Jon Nelson
  2009-08-06  6:21 ` Neil Brown
  0 siblings, 2 replies; 11+ messages in thread
From: Jon Nelson @ 2009-07-30 18:25 UTC (permalink / raw)
  To: LinuxRaid

I have a 3-disk raid1 configured with bitmaps.

Most of the time it only has 1 disk (disk A)
Periodically (weekly or less frequenly) I re-add a second disk (disk
B), which then re-synchronizes, and when it's done I --fail and
--remove it.
Even less frequently (monthly or less frequently) I do the same thing
with a third disk (disk C).

Before adding the disks, I will issue an --examine.
When I added disk B today, it said this:

Events : 14580
Bitmap : 283645 bits (chunks), 11781 dirty (4.2%)

I'm curious why *any* of the bitmap chunks are dirty - when the disks
are removed the device has typically been quiescent for quite some
time. Is there a way to force a "flush" or whatever to get each disk
as up-to-date as possible, prior to a --fail and --remove?

While /dev/nbd0 was syncing, I also --re-add'ed /dev/sdf1, which (as
expected) waited until /dev/nbd0 was done.
Then, due to a logic bug in a script, /dev/sdf1 was removed (the
script was waiting with mdadm --wait /dev/md12 which returned when
/dev/nbd0 was done, even though /dev/sdf1 had not yet started!!).

Then things got weird.

I saw this, which just *can't* be right:

md12 : active raid1 nbd0[2](W) sde[0]
      72612988 blocks super 1.1 [3/1] [U__]
      [======================================>]  recovery =192.7%
(69979200/36306494) finish=13228593199978.6min speed=11620K/sec
      bitmap: 139/139 pages [556KB], 256KB chunk

and of course the percentile kept growing, and the finish minutes are crazy.

I had to --fail and --remove /dev/nbd0, and re-add it, which
unfortunately started the recovery over.

I haven't even gotten to my questions about dirty percentages and so
on, which I will save for later.

In summary:

3-disk raid1, using bitmaps, with 2 missing disks.
re-add disk B. recovery begins.
re-add disk C. recovery continues on to disk B, will wait for disk C.
recovery completes on disk B, mdadm --wait returns (unexpectedly)
--fail, --remove disk C (which was never recovered on-to)
/proc/mdstat crazy, disk I/O still high (WTF is it *doing*, then?)
--fail --remove disk B, --re-add disk B, recovery starts over.

-- 
Jon

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: question about bitmaps and dirty percentile
  2009-07-30 18:25 question about bitmaps and dirty percentile Jon Nelson
@ 2009-07-30 19:16 ` Jon Nelson
  2009-07-31 18:17   ` Paul Clements
  2009-08-06  6:21 ` Neil Brown
  1 sibling, 1 reply; 11+ messages in thread
From: Jon Nelson @ 2009-07-30 19:16 UTC (permalink / raw)
  To: LinuxRaid

On Thu, Jul 30, 2009 at 1:25 PM, Jon
Nelson<jnelson-linux-raid@jamponi.net> wrote:
> Then things got weird.
>
> I saw this, which just *can't* be right:
>
> md12 : active raid1 nbd0[2](W) sde[0]
>      72612988 blocks super 1.1 [3/1] [U__]
>      [======================================>]  recovery =192.7%
> (69979200/36306494) finish=13228593199978.6min speed=11620K/sec
>      bitmap: 139/139 pages [556KB], 256KB chunk
>
> and of course the percentile kept growing, and the finish minutes are crazy.

Weirdness: it ready 199 (or so) and then completed:

md12 : active raid1 nbd0[2](W) sde[0]
      72612988 blocks super 1.1 [3/2] [UU_]
      bitmap: 139/139 pages [556KB], 256KB chunk

I --fail, --remove the device, and then --re-add it.

The recovery *starts over*, as if nothing had happened over the last hour or so.
The event counter are very close between /dev/nbd0 (the device here)
and /dev/sde (the core device), within a dozen or so, but the "dirty
percentile" on /dev/nbd0 is big - 18.8%, and unchanging between runs.
It's like the bitmap isn't getting updated, or getting updated
incompletely, or something.

Does the bitmap only get updated when *all* devices have sync'd???
I'll let you know in about 2 hours.

-- 
Jon
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: question about bitmaps and dirty percentile
  2009-07-30 19:16 ` Jon Nelson
@ 2009-07-31 18:17   ` Paul Clements
  2009-07-31 19:09     ` Jon Nelson
  0 siblings, 1 reply; 11+ messages in thread
From: Paul Clements @ 2009-07-31 18:17 UTC (permalink / raw)
  To: Jon Nelson; +Cc: LinuxRaid

Jon Nelson wrote:
> On Thu, Jul 30, 2009 at 1:25 PM, Jon
> Nelson<jnelson-linux-raid@jamponi.net> wrote:
>> Then things got weird.
>>
>> I saw this, which just *can't* be right:
>>
>> md12 : active raid1 nbd0[2](W) sde[0]
>>      72612988 blocks super 1.1 [3/1] [U__]
>>      [======================================>]  recovery =192.7%
>> (69979200/36306494) finish=13228593199978.6min speed=11620K/sec
>>      bitmap: 139/139 pages [556KB], 256KB chunk
>>
>> and of course the percentile kept growing, and the finish minutes are crazy.
> 
> Weirdness: it ready 199 (or so) and then completed:
> 
> md12 : active raid1 nbd0[2](W) sde[0]
>       72612988 blocks super 1.1 [3/2] [UU_]
>       bitmap: 139/139 pages [556KB], 256KB chunk
> 
> I --fail, --remove the device, and then --re-add it.
> 
> The recovery *starts over*, as if nothing had happened over the last hour or so.
> The event counter are very close between /dev/nbd0 (the device here)
> and /dev/sde (the core device), within a dozen or so, but the "dirty
> percentile" on /dev/nbd0 is big - 18.8%, and unchanging between runs.
> It's like the bitmap isn't getting updated, or getting updated
> incompletely, or something.
> 
> Does the bitmap only get updated when *all* devices have sync'd???
> I'll let you know in about 2 hours.

The bitmap never gets cleared unless all disks in the array are in sync.

--
Paul

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: question about bitmaps and dirty percentile
  2009-07-31 18:17   ` Paul Clements
@ 2009-07-31 19:09     ` Jon Nelson
  2009-08-03 16:44       ` Matthias Urlichs
  0 siblings, 1 reply; 11+ messages in thread
From: Jon Nelson @ 2009-07-31 19:09 UTC (permalink / raw)
  Cc: LinuxRaid

> The bitmap never gets cleared unless all disks in the array are in sync.

Well, that sucks. What is the reasoning behind that? It would seem
that having 2 out of 3 disks with an up-to-date bitmap would be
useful.

However, it doesn't explain the 200% problem.

-- 
Jon

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: question about bitmaps and dirty percentile
  2009-07-31 19:09     ` Jon Nelson
@ 2009-08-03 16:44       ` Matthias Urlichs
  2009-08-03 20:30         ` Paul Clements
  0 siblings, 1 reply; 11+ messages in thread
From: Matthias Urlichs @ 2009-08-03 16:44 UTC (permalink / raw)
  To: linux-raid

On Fri, 31 Jul 2009 14:09:06 -0500, Jon Nelson wrote:

>> The bitmap never gets cleared unless all disks in the array are in
>> sync.
> 
> Well, that sucks. What is the reasoning behind that?

There's only one bitmap per device. If the bits get cleaned writing to 
disk #2, then the system would forget that they still need to be written 
to disk #3.

-- 
Matthias Urlichs


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: question about bitmaps and dirty percentile
  2009-08-03 16:44       ` Matthias Urlichs
@ 2009-08-03 20:30         ` Paul Clements
  0 siblings, 0 replies; 11+ messages in thread
From: Paul Clements @ 2009-08-03 20:30 UTC (permalink / raw)
  To: linux-raid

Matthias Urlichs wrote:
> On Fri, 31 Jul 2009 14:09:06 -0500, Jon Nelson wrote:
> 
>>> The bitmap never gets cleared unless all disks in the array are in
>>> sync.
>> Well, that sucks. What is the reasoning behind that?
> 
> There's only one bitmap per device. If the bits get cleaned writing to 
> disk #2, then the system would forget that they still need to be written 
> to disk #3.

Right, and that decision was made for efficiency and simplicity of 
design. Having a bitmap per pair of component disks would be inefficient 
and very complicated. You could stack raid1's if you absolutely had to 
have that type of functionality.

--
Paul



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: question about bitmaps and dirty percentile
  2009-07-30 18:25 question about bitmaps and dirty percentile Jon Nelson
  2009-07-30 19:16 ` Jon Nelson
@ 2009-08-06  6:21 ` Neil Brown
  2009-08-06 13:02   ` Jon Nelson
  1 sibling, 1 reply; 11+ messages in thread
From: Neil Brown @ 2009-08-06  6:21 UTC (permalink / raw)
  To: Jon Nelson; +Cc: LinuxRaid

On Thursday July 30, jnelson-linux-raid@jamponi.net wrote:
> 
> I saw this, which just *can't* be right:
> 
> md12 : active raid1 nbd0[2](W) sde[0]
>       72612988 blocks super 1.1 [3/1] [U__]
>       [======================================>]  recovery =192.7%
> (69979200/36306494) finish=13228593199978.6min speed=11620K/sec
>       bitmap: 139/139 pages [556KB], 256KB chunk

Certainly very strange.  I cannot explain it at all.

Please report exactly what kernel version you were running, all kernel
log messages from before the first resync completed until after the
sync-to-200% completed.

Hopefully there will be a clue somewhere in there.

Thanks,
NeilBrown

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: question about bitmaps and dirty percentile
  2009-08-06  6:21 ` Neil Brown
@ 2009-08-06 13:02   ` Jon Nelson
  2009-08-07  1:47     ` NeilBrown
  0 siblings, 1 reply; 11+ messages in thread
From: Jon Nelson @ 2009-08-06 13:02 UTC (permalink / raw)
  Cc: LinuxRaid

On Thu, Aug 6, 2009 at 1:21 AM, Neil Brown<neilb@suse.de> wrote:
> On Thursday July 30, jnelson-linux-raid@jamponi.net wrote:
>>
>> I saw this, which just *can't* be right:
>>
>> md12 : active raid1 nbd0[2](W) sde[0]
>>       72612988 blocks super 1.1 [3/1] [U__]
>>       [======================================>]  recovery =192.7%
>> (69979200/36306494) finish=13228593199978.6min speed=11620K/sec
>>       bitmap: 139/139 pages [556KB], 256KB chunk
>
> Certainly very strange.  I cannot explain it at all.
>
> Please report exactly what kernel version you were running, all kernel
> log messages from before the first resync completed until after the
> sync-to-200% completed.
>
> Hopefully there will be a clue somewhere in there.

Stock openSUSE 2.6.27.25-0.1-default on x86_64.

I'm pretty sure this is it:

Jul 30 13:51:01 turnip kernel: md: bind<nbd0>
Jul 30 13:51:01 turnip kernel: RAID1 conf printout:
Jul 30 13:51:01 turnip kernel:  --- wd:1 rd:3
Jul 30 13:51:01 turnip kernel:  disk 0, wo:0, o:1, dev:sde
Jul 30 13:51:01 turnip kernel:  disk 1, wo:1, o:1, dev:nbd0
Jul 30 13:51:01 turnip kernel: md: recovery of RAID array md12
Jul 30 13:51:01 turnip kernel: md: minimum _guaranteed_  speed: 1000
KB/sec/disk.
Jul 30 13:51:01 turnip kernel: md: using maximum available idle IO
bandwidth (but not more than 200000 KB/sec) for recovery.
Jul 30 13:51:01 turnip kernel: md: using 128k window, over a total of
72612988 blocks.
Jul 30 14:10:48 turnip kernel: md: md12: recovery done.
Jul 30 14:10:49 turnip kernel: RAID1 conf printout:
Jul 30 14:10:49 turnip kernel:  --- wd:2 rd:3
Jul 30 14:10:49 turnip kernel:  disk 0, wo:0, o:1, dev:sde
Jul 30 14:10:49 turnip kernel:  disk 1, wo:0, o:1, dev:nbd0


-- 
Jon
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: question about bitmaps and dirty percentile
  2009-08-06 13:02   ` Jon Nelson
@ 2009-08-07  1:47     ` NeilBrown
  2009-08-07  2:17       ` Jon Nelson
  0 siblings, 1 reply; 11+ messages in thread
From: NeilBrown @ 2009-08-07  1:47 UTC (permalink / raw)
  To: Jon Nelson; +Cc: LinuxRaid

On Thu, August 6, 2009 11:02 pm, Jon Nelson wrote:
> On Thu, Aug 6, 2009 at 1:21 AM, Neil Brown<neilb@suse.de> wrote:
>> On Thursday July 30, jnelson-linux-raid@jamponi.net wrote:
>>>
>>> I saw this, which just *can't* be right:
>>>
>>> md12 : active raid1 nbd0[2](W) sde[0]
>>>       72612988 blocks super 1.1 [3/1] [U__]
>>>       [======================================>]  recovery =192.7%
>>> (69979200/36306494) finish=13228593199978.6min speed=11620K/sec
>>>       bitmap: 139/139 pages [556KB], 256KB chunk
>>
>> Certainly very strange.  I cannot explain it at all.
>>
>> Please report exactly what kernel version you were running, all kernel
>> log messages from before the first resync completed until after the
>> sync-to-200% completed.
>>
>> Hopefully there will be a clue somewhere in there.
>
> Stock openSUSE 2.6.27.25-0.1-default on x86_64.

Ok, so it was probably broken by whoever maintain md for
SuSE.... oh wait, that's me :-)


>
> I'm pretty sure this is it:
>
> Jul 30 13:51:01 turnip kernel: md: bind<nbd0>
> Jul 30 13:51:01 turnip kernel: RAID1 conf printout:
> Jul 30 13:51:01 turnip kernel:  --- wd:1 rd:3
> Jul 30 13:51:01 turnip kernel:  disk 0, wo:0, o:1, dev:sde
> Jul 30 13:51:01 turnip kernel:  disk 1, wo:1, o:1, dev:nbd0
> Jul 30 13:51:01 turnip kernel: md: recovery of RAID array md12
> Jul 30 13:51:01 turnip kernel: md: minimum _guaranteed_  speed: 1000
> KB/sec/disk.
> Jul 30 13:51:01 turnip kernel: md: using maximum available idle IO
> bandwidth (but not more than 200000 KB/sec) for recovery.
> Jul 30 13:51:01 turnip kernel: md: using 128k window, over a total of
> 72612988 blocks.
> Jul 30 14:10:48 turnip kernel: md: md12: recovery done.
> Jul 30 14:10:49 turnip kernel: RAID1 conf printout:
> Jul 30 14:10:49 turnip kernel:  --- wd:2 rd:3
> Jul 30 14:10:49 turnip kernel:  disk 0, wo:0, o:1, dev:sde
> Jul 30 14:10:49 turnip kernel:  disk 1, wo:0, o:1, dev:nbd0
>

Thanks...
So:
 - when the recovery started, mddev->size was twice of half of the value
   printed for "over a total of...", so 72612988 (I assume this is
   expected to be a 72 Gig array).
   Twice this will have been stored in 'max_sectors' and the loop in
   md_do_sync will have taken 'j' up to that value and periodically
   stored in in mddev->curr_resync
 - When you ran "cat /proc/mdstat", mddev->array_sectors will have been
   twice the value printed at "... blocks", which is the same,
   145225976
 - When you ran "cat /proc/mdstat", it printed "recovery", not "resync",
   so MD_RECOVERY_SYNC was not set, so max_sectors was set to
   mddev->size... that looks wrong (size is in KB)
   Half of this is printed in the second number in
   the (%d/%d) bit, so ->size was twice 36306494 or
   72612988, which is consistent.

So the problem is that in resync_status, max_sectors is being set to
mddev->size rather than mddev->size*2.  This is purely a cosmetic problem,
it do not affect data safety at all.

It looks like I botched a backport of
  commit dd71cf6b2773310b01c6fe6c773064c80fd2476b
into the Suse kernel.  I'll get that fixed for the next update.

Thanks for the report, and as I said, the only thing affected here
is the content of /proc/mdstat.  The recovery is doing the right
thing.

Thanks,
NeilBrown




^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: question about bitmaps and dirty percentile
  2009-08-07  1:47     ` NeilBrown
@ 2009-08-07  2:17       ` Jon Nelson
  2009-08-07 12:29         ` John Robinson
  0 siblings, 1 reply; 11+ messages in thread
From: Jon Nelson @ 2009-08-07  2:17 UTC (permalink / raw)
  Cc: LinuxRaid

On Thu, Aug 6, 2009 at 8:47 PM, NeilBrown<neilb@suse.de> wrote:
...
> So the problem is that in resync_status, max_sectors is being set to
> mddev->size rather than mddev->size*2.  This is purely a cosmetic problem,
> it do not affect data safety at all.
>
> It looks like I botched a backport of
>  commit dd71cf6b2773310b01c6fe6c773064c80fd2476b
> into the Suse kernel.  I'll get that fixed for the next update.
>
> Thanks for the report, and as I said, the only thing affected here
> is the content of /proc/mdstat.  The recovery is doing the right
> thing.

Sweet! Open Source is AWESOME.


-- 
Jon
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: question about bitmaps and dirty percentile
  2009-08-07  2:17       ` Jon Nelson
@ 2009-08-07 12:29         ` John Robinson
  0 siblings, 0 replies; 11+ messages in thread
From: John Robinson @ 2009-08-07 12:29 UTC (permalink / raw)
  To: Jon Nelson; +Cc: LinuxRaid

On 07/08/2009 03:17, Jon Nelson wrote:
> Sweet! Open Source is AWESOME.

Amen to that, brother Jon!


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2009-08-07 12:29 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-07-30 18:25 question about bitmaps and dirty percentile Jon Nelson
2009-07-30 19:16 ` Jon Nelson
2009-07-31 18:17   ` Paul Clements
2009-07-31 19:09     ` Jon Nelson
2009-08-03 16:44       ` Matthias Urlichs
2009-08-03 20:30         ` Paul Clements
2009-08-06  6:21 ` Neil Brown
2009-08-06 13:02   ` Jon Nelson
2009-08-07  1:47     ` NeilBrown
2009-08-07  2:17       ` Jon Nelson
2009-08-07 12:29         ` John Robinson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).