linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Incorrect in-kernel bitmap on raid10
@ 2009-04-18 18:15 Mario 'BitKoenig' Holbe
  2009-04-19  6:24 ` Neil Brown
  0 siblings, 1 reply; 16+ messages in thread
From: Mario 'BitKoenig' Holbe @ 2009-04-18 18:15 UTC (permalink / raw)
  To: linux-raid

Hello,

I created a 4.5T RAID10 with internal bitmap out of 3 1.5T disks on a
Debian 2.6.28-1-686-bigmem kernel:
# mdadm --create -l raid10 -n 6 -c 512 -b internal -a md /dev/md7 /dev/sdc1 missing /dev/sdd1 missing /dev/sde1 missing
and I get a strange inconsistency between the on-disk and the in-kernel
bitmap representation:
[202617.869531] md: bind<sdc1>
[202617.888998] md: bind<sdd1>
[202617.908895] md: bind<sde1>
[202617.917307] md: md7: raid array is not clean -- starting background reconstruction
[202619.527127] md: raid10 personality registered for level 10
[202619.544588] raid10: raid set md7 active with 3 out of 6 devices
[202619.563536] md7: bitmap file is out of date (0 < 1) -- forcing full recovery
[202619.584919] md7: bitmap file is out of date, doing full recovery
[202619.655867] md7: bitmap initialized from disk: read 1/1 pages, set 6131 bits
[202619.677268] created bitmap (3 pages) for device md7
[202619.714033] md7: detected capacity change from 0 to 4500896612352
[202619.732554]  md7: unknown partition table
[202781.236591] md: resync of RAID array md7
[202781.248610] md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
[202781.266343] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for resync.
[202781.294735] md: using 128k window, over a total of 4395406848 blocks.
[202781.336808] md: md7: resync done.
[202781.347708] md7: invalid bitmap page request: 3 (> 2)
[202781.363228] md7: invalid bitmap page request: 3 (> 2)
[202781.378637] md7: invalid bitmap page request: 3 (> 2)
[202781.394286] md7: invalid bitmap page request: 3 (> 2)
[202781.409704] md7: invalid bitmap page request: 3 (> 2)
...lots more of them...
[202781.832046] md7: invalid bitmap page request: 129 (> 2)
[202781.832047] md7: invalid bitmap page request: 129 (> 2)
[202781.832048] md7: invalid bitmap page request: 129 (> 2)
[202781.832049] md7: invalid bitmap page request: 129 (> 2)
[202781.832050] md7: invalid bitmap page request: 129 (> 2)
...lots more of them...
[202781.832095] md7: invalid bitmap page request: 130 (> 2)
[202781.832096] md7: invalid bitmap page request: 130 (> 2)
[202781.832097] md7: invalid bitmap page request: 130 (> 2)
[202781.832097] md7: invalid bitmap page request: 130 (> 2)
[202781.832098] md7: invalid bitmap page request: 130 (> 2)
...lots more of them...

While mdadm -X on every component shows a reasonable bitmap with a 16M
chunk size and 268275 chunks, the kernel seems to assume there are only
6131 chunks and initializes 3 instead of 130 pages.

The same happens when adding the bitmap with -G -b internal to an
existing array.

I don't know if this problem belongs to the array size (i.e. some 32bit
boundary problem) or to raid10 in general.


regards
   Mario
-- 
<snupidity> bjmg: ja, logik ist mein fachgebiet. das liegt im gen
<uepsie> in welchem?
<snupidity> im zweiten X


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Incorrect in-kernel bitmap on raid10
  2009-04-18 18:15 Incorrect in-kernel bitmap on raid10 Mario 'BitKoenig' Holbe
@ 2009-04-19  6:24 ` Neil Brown
  2009-04-19 22:55   ` Mario 'BitKoenig' Holbe
  2009-04-22 18:45   ` Incorrect in-kernel bitmap on raid10 Mario 'BitKoenig' Holbe
  0 siblings, 2 replies; 16+ messages in thread
From: Neil Brown @ 2009-04-19  6:24 UTC (permalink / raw)
  To: Mario 'BitKoenig' Holbe; +Cc: linux-raid

On Saturday April 18, Mario.Holbe@TU-Ilmenau.DE wrote:
> Hello,
> 
> I created a 4.5T RAID10 with internal bitmap out of 3 1.5T disks on a
> Debian 2.6.28-1-686-bigmem kernel:
> # mdadm --create -l raid10 -n 6 -c 512 -b internal -a md /dev/md7 /dev/sdc1 missing /dev/sdd1 missing /dev/sde1 missing
> and I get a strange inconsistency between the on-disk and the in-kernel
> bitmap representation:

oops...

Could you let me know if that following patch helps?
This bug will affect RAID4,5,6 when the device exceeds 2 terabytes,
but it affects RAID10 when the array exceeds 2 terabytes.
(For RAID1, the device size and array size are the same, if it fits in
both categories).

Thanks,
NeilBrown


diff --git a/drivers/md/bitmap.c b/drivers/md/bitmap.c
index e4510c9..1fb91ed 100644
--- a/drivers/md/bitmap.c
+++ b/drivers/md/bitmap.c
@@ -1590,7 +1590,7 @@ void bitmap_destroy(mddev_t *mddev)
 int bitmap_create(mddev_t *mddev)
 {
 	struct bitmap *bitmap;
-	unsigned long blocks = mddev->resync_max_sectors;
+	sector_t blocks = mddev->resync_max_sectors;
 	unsigned long chunks;
 	unsigned long pages;
 	struct file *file = mddev->bitmap_file;
@@ -1632,8 +1632,8 @@ int bitmap_create(mddev_t *mddev)
 	bitmap->chunkshift = ffz(~bitmap->chunksize);
 
 	/* now that chunksize and chunkshift are set, we can use these macros */
- 	chunks = (blocks + CHUNK_BLOCK_RATIO(bitmap) - 1) /
-			CHUNK_BLOCK_RATIO(bitmap);
+ 	chunks = (blocks + CHUNK_BLOCK_RATIO(bitmap) - 1) >>
+			CHUNK_BLOCK_SHIFT(bitmap);
  	pages = (chunks + PAGE_COUNTER_RATIO - 1) / PAGE_COUNTER_RATIO;
 
 	BUG_ON(!pages);

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: Incorrect in-kernel bitmap on raid10
  2009-04-19  6:24 ` Neil Brown
@ 2009-04-19 22:55   ` Mario 'BitKoenig' Holbe
  2009-04-19 23:27     ` Neil Brown
  2009-04-22 18:45   ` Incorrect in-kernel bitmap on raid10 Mario 'BitKoenig' Holbe
  1 sibling, 1 reply; 16+ messages in thread
From: Mario 'BitKoenig' Holbe @ 2009-04-19 22:55 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 2473 bytes --]

On Sun, Apr 19, 2009 at 04:24:02PM +1000, Neil Brown wrote:
> On Saturday April 18, Mario.Holbe@TU-Ilmenau.DE wrote:
> > I created a 4.5T RAID10 with internal bitmap out of 3 1.5T disks on a
> > and I get a strange inconsistency between the on-disk and the in-kernel
> > bitmap representation:
> Could you let me know if that following patch helps?

I attached the patch to 2.6.28 because of the still pending .29-fix.
It looks better but not perfect, if you ask me:

root@darkside:~# mdadm -G -b internal /dev/md7
[  137.605821] md7: bitmap file is out of date (0 < 8382) -- forcing full recovery
[  137.627777] md7: bitmap file is out of date, doing full recovery
[  137.871855] md7: bitmap initialized from disk: read 9/9 pages, set 268275 bits
[  137.893543] created bitmap (131 pages) for device md7
root@darkside:~# cat /proc/mdstat
Personalities : [raid1] [raid10]
md7 : active raid10 sdc1[0] sde1[4] sdd1[2]
      4395406848 blocks 512K chunks 2 near-copies [6/3] [U_U_U_]
      bitmap: 0/131 pages [0KB], 16384KB chunk
...

It looks like there are now enough pages allocated in-kernel.
So - yes, the patch helps :)

The "read 9/9 pages" message does still look somewhat strange but better
than before (where it was "read 1/1 pages, set 6131 bits") and it seems
to be similar to messages of my other raids.
The "set 268275 bits" message does not seem to be consistent to the
"bitmap: 0/131 pages [0KB]" mdstat, but this is quite likely unrelated
to the original problem.

root@darkside:~# mdadm -X /dev/sd[cde]1 | grep Bitmap
          Bitmap : 268275 bits (chunks), 137203 dirty (51.1%)
          Bitmap : 268275 bits (chunks), 137203 dirty (51.1%)
          Bitmap : 268275 bits (chunks), 137203 dirty (51.1%)

The discrepancy between the "0/131 pages [0KB]" in-kernel and the
"137203 dirty (51.1%)" on-disk seems to be another, unrelated issue.
I experienced somehow similar issues when adding a new component to an
existing bitmapped device. When the full-sync of the new component is
finished, the bitmap on the new component does usually show still lots
of dirty bits (sometimes only a few %, sometimes up to 95%) while the
other devices show 0 dirties. And this doesn't change over time or when
dropping page caches.


Mario
-- 
We know that communication is a problem, but the company is not going to
discuss it with the employees.
                       -- Switching supervisor, AT&T Long Lines Division

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 481 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Incorrect in-kernel bitmap on raid10
  2009-04-19 22:55   ` Mario 'BitKoenig' Holbe
@ 2009-04-19 23:27     ` Neil Brown
  2009-04-20  0:13       ` Race condition in write_sb_page? (was: Re: Incorrect in-kernel bitmap on raid10) Mario 'BitKoenig' Holbe
  0 siblings, 1 reply; 16+ messages in thread
From: Neil Brown @ 2009-04-19 23:27 UTC (permalink / raw)
  To: Mario 'BitKoenig' Holbe; +Cc: linux-raid

On Monday April 20, Mario.Holbe@TU-Ilmenau.DE wrote:
> On Sun, Apr 19, 2009 at 04:24:02PM +1000, Neil Brown wrote:
> > On Saturday April 18, Mario.Holbe@TU-Ilmenau.DE wrote:
> > > I created a 4.5T RAID10 with internal bitmap out of 3 1.5T disks on a
> > > and I get a strange inconsistency between the on-disk and the in-kernel
> > > bitmap representation:
> > Could you let me know if that following patch helps?
> 
> I attached the patch to 2.6.28 because of the still pending .29-fix.
> It looks better but not perfect, if you ask me:

Thanks for testing.

> 
> root@darkside:~# mdadm -G -b internal /dev/md7
> [  137.605821] md7: bitmap file is out of date (0 < 8382) -- forcing full recovery
> [  137.627777] md7: bitmap file is out of date, doing full recovery
> [  137.871855] md7: bitmap initialized from disk: read 9/9 pages, set 268275 bits
> [  137.893543] created bitmap (131 pages) for device md7
> root@darkside:~# cat /proc/mdstat
> Personalities : [raid1] [raid10]
> md7 : active raid10 sdc1[0] sde1[4] sdd1[2]
>       4395406848 blocks 512K chunks 2 near-copies [6/3] [U_U_U_]
>       bitmap: 0/131 pages [0KB], 16384KB chunk
> ...
> 
> It looks like there are now enough pages allocated in-kernel.
> So - yes, the patch helps :)
> 
> The "read 9/9 pages" message does still look somewhat strange but better
> than before (where it was "read 1/1 pages, set 6131 bits") and it seems
> to be similar to messages of my other raids.
> The "set 268275 bits" message does not seem to be consistent to the
> "bitmap: 0/131 pages [0KB]" mdstat, but this is quite likely unrelated
> to the original problem.

I think this is all consistent, though possibly confusing.
On disk, we use 1 bit per chunk so 268275 chunks uses 33535 bytes or 66
sectors or 9 (4K) pages.

In memory, we use 16 bits per chunk, so we can count how many pending
accesses there are to each chunk and so know when we can clear the
bit.
So 268275 chunks uses 536550 bytes or 523K or 131 (4K) pages.
These pages are only alloced on demand, so when the bitmap is
completely clean, there are likely to be 0 allocated.

> 
> root@darkside:~# mdadm -X /dev/sd[cde]1 | grep Bitmap
>           Bitmap : 268275 bits (chunks), 137203 dirty (51.1%)
>           Bitmap : 268275 bits (chunks), 137203 dirty (51.1%)
>           Bitmap : 268275 bits (chunks), 137203 dirty (51.1%)
> 
> The discrepancy between the "0/131 pages [0KB]" in-kernel and the
> "137203 dirty (51.1%)" on-disk seems to be another, unrelated issue.
> I experienced somehow similar issues when adding a new component to an
> existing bitmapped device. When the full-sync of the new component is
> finished, the bitmap on the new component does usually show still lots
> of dirty bits (sometimes only a few %, sometimes up to 95%) while the
> other devices show 0 dirties. And this doesn't change over time or when
> dropping page caches.

I think that problem is fixed by 
  commit 355a43e641b948a7b755cb4c2466ec548d5b495f

which is in 2.6.29.

NeilBrown

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Race condition in write_sb_page? (was: Re: Incorrect in-kernel bitmap on raid10)
  2009-04-19 23:27     ` Neil Brown
@ 2009-04-20  0:13       ` Mario 'BitKoenig' Holbe
  2009-04-20  1:57         ` NeilBrown
  0 siblings, 1 reply; 16+ messages in thread
From: Mario 'BitKoenig' Holbe @ 2009-04-20  0:13 UTC (permalink / raw)
  To: linux-raid

Neil Brown <neilb@suse.de> wrote:
> On Monday April 20, Mario.Holbe@TU-Ilmenau.DE wrote:
>> existing bitmapped device. When the full-sync of the new component is
>> finished, the bitmap on the new component does usually show still lots
>> of dirty bits (sometimes only a few %, sometimes up to 95%) while the
> I think that problem is fixed by 
>   commit 355a43e641b948a7b755cb4c2466ec548d5b495f
> which is in 2.6.29.

.2 probably :)

While looking at commit 355a43e641b948a7b755cb4c2466ec548d5b495f I'm not
sure, if this could raise a race condition: the comment in
next_active_rdev() states:
	 * As devices are only added or removed when raid_disk is < 0 and
	 * nr_pending is 0 and In_sync is clear, the entries we return will
	 * still be in the same position on the list when we re-enter
	 * list_for_each_continue_rcu.
but commit 355a43e641b948a7b755cb4c2466ec548d5b495f does exactly remove
the In_sync test. If the comment is true, the removal of the test
probably opens a window for a race condition.


regards
   Mario
-- 
Programmieren in C++ haelt die grauen Zellen am Leben. Es schaerft
alle fuenf Sinne: den Schwachsinn, den Bloedsinn, den Wahnsinn, den
Unsinn und den Stumpfsinn.
                                 [Holger Veit in doc]


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Race condition in write_sb_page? (was: Re: Incorrect in-kernel bitmap on raid10)
  2009-04-20  0:13       ` Race condition in write_sb_page? (was: Re: Incorrect in-kernel bitmap on raid10) Mario 'BitKoenig' Holbe
@ 2009-04-20  1:57         ` NeilBrown
  2009-04-20  8:03           ` Race condition in write_sb_page? Mario 'BitKoenig' Holbe
  0 siblings, 1 reply; 16+ messages in thread
From: NeilBrown @ 2009-04-20  1:57 UTC (permalink / raw)
  To: Mario 'BitKoenig' Holbe; +Cc: linux-raid

On Mon, April 20, 2009 10:13 am, Mario 'BitKoenig' Holbe wrote:
> Neil Brown <neilb@suse.de> wrote:
>> On Monday April 20, Mario.Holbe@TU-Ilmenau.DE wrote:
>>> existing bitmapped device. When the full-sync of the new component is
>>> finished, the bitmap on the new component does usually show still lots
>>> of dirty bits (sometimes only a few %, sometimes up to 95%) while the
>> I think that problem is fixed by
>>   commit 355a43e641b948a7b755cb4c2466ec548d5b495f
>> which is in 2.6.29.
>
> .2 probably :)

Actually not - I haven't tagged it for -stable.  It'll be in .30.

I had used "git describe" to see which release it was in, but that
tells me a previous release, but a subsequent one, which make it not
useful for that task.
I should have used 'get name-rev' after  a 'git pull --tags' which
would have told me
  355a43e641b948a7b755cb4c2466ec548d5b495f tags/v2.6.30-rc1~241^2~49
so it is in 30-rc1.


>
> While looking at commit 355a43e641b948a7b755cb4c2466ec548d5b495f I'm not
> sure, if this could raise a race condition: the comment in
> next_active_rdev() states:
> 	 * As devices are only added or removed when raid_disk is < 0 and
> 	 * nr_pending is 0 and In_sync is clear, the entries we return will
> 	 * still be in the same position on the list when we re-enter
> 	 * list_for_each_continue_rcu.
> but commit 355a43e641b948a7b755cb4c2466ec548d5b495f does exactly remove
> the In_sync test. If the comment is true, the removal of the test
> probably opens a window for a race condition.

Thanks for reviewing the code, I really appreciate it.

In this case, I think the code is still fine.
The comment lists a number of conditions that are all true when
something is remove from (or added to) the list.  So we only need
to be sure that one of these is true to be sure the thing wont be removed
from the list.  We previously had 3 of them known to be true.  Now
we only have 2.  But that is still plenty.

Thanks,
NeilBrown


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Race condition in write_sb_page?
  2009-04-20  1:57         ` NeilBrown
@ 2009-04-20  8:03           ` Mario 'BitKoenig' Holbe
  0 siblings, 0 replies; 16+ messages in thread
From: Mario 'BitKoenig' Holbe @ 2009-04-20  8:03 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 615 bytes --]

On Mon, Apr 20, 2009 at 11:57:54AM +1000, NeilBrown wrote:
> The comment lists a number of conditions that are all true when
> something is remove from (or added to) the list.  So we only need
> to be sure that one of these is true to be sure the thing wont be removed

Okay, I guess I just confused the component's descriptor index (desc_nr,
which changes when a resync finishes) and it's position in mddev->disks
(which doesn't change).


Mario
-- 
File names are infinite in length where infinity is set to 255 characters.
                                -- Peter Collinson, "The Unix File System"

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 481 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Incorrect in-kernel bitmap on raid10
  2009-04-19  6:24 ` Neil Brown
  2009-04-19 22:55   ` Mario 'BitKoenig' Holbe
@ 2009-04-22 18:45   ` Mario 'BitKoenig' Holbe
  2009-04-28 14:05     ` Mario 'BitKoenig' Holbe
  1 sibling, 1 reply; 16+ messages in thread
From: Mario 'BitKoenig' Holbe @ 2009-04-22 18:45 UTC (permalink / raw)
  To: linux-raid

Neil Brown <neilb@suse.de> wrote:
> Could you let me know if that following patch helps?

Hmmm, it looks like the patch doesn't fully fix it.
I told you already about the semi-initialized on-disk bitmap:

root@darkside:~# mdadm -X /dev/sd[cdefgh][1] | grep Bitmap | sort
          Bitmap : 268275 bits (chunks), 137203 dirty (51.1%)
          Bitmap : 268275 bits (chunks), 137203 dirty (51.1%)
          Bitmap : 268275 bits (chunks), 137203 dirty (51.1%)
          Bitmap : 268275 bits (chunks), 137203 dirty (51.1%)
          Bitmap : 268275 bits (chunks), 137203 dirty (51.1%)
          Bitmap : 268275 bits (chunks), 137203 dirty (51.1%)

I tried to clean it up with:
root@darkside:~# echo 0-268275 > /sys/block/md7/md/bitmap_set_bits

This should set all bits, and it seems to do:
root@darkside:~# mdadm -X /dev/sd[cdefgh][1] | grep Bitmap 
          Bitmap : 268275 bits (chunks), 268275 dirty (100.0%)
          Bitmap : 268275 bits (chunks), 268275 dirty (100.0%)
          Bitmap : 268275 bits (chunks), 268275 dirty (100.0%)
          Bitmap : 268275 bits (chunks), 268275 dirty (100.0%)
          Bitmap : 268275 bits (chunks), 268275 dirty (100.0%)
          Bitmap : 268275 bits (chunks), 268275 dirty (100.0%)

However, in-kernel it should also allocate all pages, but it does not:
root@darkside:~# head -4 /proc/mdstat 
Personalities : [raid1] [raid10] 
md7 : active (auto-read-only) raid10 sdc1[0] sdh1[5] sde1[4] sdg1[3] sdd1[2] sdf1[1]
      4395406848 blocks 512K chunks 2 near-copies [6/6] [UUUUUU]
      bitmap: 64/131 pages [256KB], 16384KB chunk

stopping/starting the array doesn't help either:
root@darkside:~# mdadm --stop /dev/md7
mdadm: stopped /dev/md7
root@darkside:~# mdadm -A /dev/md7
mdadm: /dev/md7 has been started with 6 drives.
root@darkside:~# head -4 /proc/mdstat 
Personalities : [raid1] [raid10] 
md7 : active (auto-read-only) raid10 sdc1[0] sdh1[5] sde1[4] sdg1[3] sdd1[2] sdf1[1]
      4395406848 blocks 512K chunks 2 near-copies [6/6] [UUUUUU]
      bitmap: 64/131 pages [256KB], 16384KB chunk
root@darkside:~# sleep 10
root@darkside:~# head -4 /proc/mdstat 
Personalities : [raid1] [raid10] 
md7 : active (auto-read-only) raid10 sdc1[0] sdh1[5] sde1[4] sdg1[3] sdd1[2] sdf1[1]
      4395406848 blocks 512K chunks 2 near-copies [6/6] [UUUUUU]
      bitmap: 0/131 pages [0KB], 16384KB chunk
root@darkside:~# echo 3 > /proc/sys/vm/drop_caches 
root@darkside:~# mdadm -X /dev/sd[cdefgh][1] | grep Bitmap
          Bitmap : 268275 bits (chunks), 137203 dirty (51.1%)
          Bitmap : 268275 bits (chunks), 137203 dirty (51.1%)
          Bitmap : 268275 bits (chunks), 137203 dirty (51.1%)
          Bitmap : 268275 bits (chunks), 137203 dirty (51.1%)
          Bitmap : 268275 bits (chunks), 137203 dirty (51.1%)
          Bitmap : 268275 bits (chunks), 137203 dirty (51.1%)


regards
   Mario
-- 
As a rule, the more bizarre a thing is, the less mysterious it proves to be.
                                    -- Sherlock Holmes by Arthur Conan Doyle


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Incorrect in-kernel bitmap on raid10
  2009-04-22 18:45   ` Incorrect in-kernel bitmap on raid10 Mario 'BitKoenig' Holbe
@ 2009-04-28 14:05     ` Mario 'BitKoenig' Holbe
  2009-05-01  2:11       ` Neil Brown
  0 siblings, 1 reply; 16+ messages in thread
From: Mario 'BitKoenig' Holbe @ 2009-04-28 14:05 UTC (permalink / raw)
  To: linux-raid

Mario 'BitKoenig' Holbe <Mario.Holbe@TU-Ilmenau.DE> wrote:
> Neil Brown <neilb@suse.de> wrote:
>> Could you let me know if that following patch helps?
> Hmmm, it looks like the patch doesn't fully fix it.
> root@darkside:~# echo 0-268275 > /sys/block/md7/md/bitmap_set_bits
> However, in-kernel it should also allocate all pages, but it does not:

*push*

I forgot to clarify the (new) issue:
While now the amount of pages needed in-kernel is calculated correctly,
only the half of them seems to be actually used, even if all bits are
set.


regards
   Mario
-- 
I thought the only thing the internet was good for was porn.  -- Futurama


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Incorrect in-kernel bitmap on raid10
  2009-04-28 14:05     ` Mario 'BitKoenig' Holbe
@ 2009-05-01  2:11       ` Neil Brown
  2009-05-01 17:55         ` Mario 'BitKoenig' Holbe
  0 siblings, 1 reply; 16+ messages in thread
From: Neil Brown @ 2009-05-01  2:11 UTC (permalink / raw)
  To: Mario 'BitKoenig' Holbe; +Cc: linux-raid

On Tuesday April 28, Mario.Holbe@TU-Ilmenau.DE wrote:
> Mario 'BitKoenig' Holbe <Mario.Holbe@TU-Ilmenau.DE> wrote:
> > Neil Brown <neilb@suse.de> wrote:
> >> Could you let me know if that following patch helps?
> > Hmmm, it looks like the patch doesn't fully fix it.
> > root@darkside:~# echo 0-268275 > /sys/block/md7/md/bitmap_set_bits
> > However, in-kernel it should also allocate all pages, but it does not:
> 
> *push*

Thanks for persisting.

> 
> I forgot to clarify the (new) issue:
> While now the amount of pages needed in-kernel is calculated correctly,
> only the half of them seems to be actually used, even if all bits are
> set.
> 

This is not necessarily a bug.  If the attempt to allocate a page
fails, we can persevere but using fewer counters with much larger
granularity.  So we might not alway allocate all the pages that are
required. 
However I don't think that is the case here.  There some other places
where are are overflowing on a shift.  One of those (in
bitmap_dirty_bits) can cause the problem you see.
This patch should fix it.  Please confirm.

NeilBrown

diff --git a/drivers/md/bitmap.c b/drivers/md/bitmap.c
index 1fb91ed..fcbf439 100644
--- a/drivers/md/bitmap.c
+++ b/drivers/md/bitmap.c
@@ -1016,8 +1016,11 @@ static int bitmap_init_from_disk(struct bitmap *bitmap, sector_t start)
 		kunmap_atomic(paddr, KM_USER0);
 		if (b) {
 			/* if the disk bit is set, set the memory bit */
-			bitmap_set_memory_bits(bitmap, i << CHUNK_BLOCK_SHIFT(bitmap),
-					       ((i+1) << (CHUNK_BLOCK_SHIFT(bitmap)) >= start)
+			int needed = ((sector_t)(i+1) << (CHUNK_BLOCK_SHIFT(bitmap))
+				      >= start);
+			bitmap_set_memory_bits(bitmap,
+					       (sector_t)i << CHUNK_BLOCK_SHIFT(bitmap),
+					       needed);
 				);
 			bit_cnt++;
 			set_page_attr(bitmap, page, BITMAP_PAGE_CLEAN);
@@ -1154,8 +1157,9 @@ void bitmap_daemon_work(struct bitmap *bitmap)
 			spin_lock_irqsave(&bitmap->lock, flags);
 			clear_page_attr(bitmap, page, BITMAP_PAGE_CLEAN);
 		}
-		bmc = bitmap_get_counter(bitmap, j << CHUNK_BLOCK_SHIFT(bitmap),
-					&blocks, 0);
+		bmc = bitmap_get_counter(bitmap,
+					 (sector_t)j << CHUNK_BLOCK_SHIFT(bitmap),
+					 &blocks, 0);
 		if (bmc) {
 /*
   if (j < 100) printk("bitmap: j=%lu, *bmc = 0x%x\n", j, *bmc);
@@ -1169,7 +1173,8 @@ void bitmap_daemon_work(struct bitmap *bitmap)
 			} else if (*bmc == 1) {
 				/* we can clear the bit */
 				*bmc = 0;
-				bitmap_count_page(bitmap, j << CHUNK_BLOCK_SHIFT(bitmap),
+				bitmap_count_page(bitmap,
+						  (sector_t)j << CHUNK_BLOCK_SHIFT(bitmap),
 						  -1);
 
 				/* clear the bit */
@@ -1514,7 +1519,7 @@ void bitmap_dirty_bits(struct bitmap *bitmap, unsigned long s, unsigned long e)
 	unsigned long chunk;
 
 	for (chunk = s; chunk <= e; chunk++) {
-		sector_t sec = chunk << CHUNK_BLOCK_SHIFT(bitmap);
+		sector_t sec = (sector_t)chunk << CHUNK_BLOCK_SHIFT(bitmap);
 		bitmap_set_memory_bits(bitmap, sec, 1);
 		bitmap_file_set_bit(bitmap, sec);
 	}

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: Incorrect in-kernel bitmap on raid10
  2009-05-01  2:11       ` Neil Brown
@ 2009-05-01 17:55         ` Mario 'BitKoenig' Holbe
  2009-05-01 21:36           ` NeilBrown
  0 siblings, 1 reply; 16+ messages in thread
From: Mario 'BitKoenig' Holbe @ 2009-05-01 17:55 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-raid


[-- Attachment #1.1: Type: text/plain, Size: 1287 bytes --]

On Fri, May 01, 2009 at 12:11:43PM +1000, Neil Brown wrote:
> There some other places
> where are are overflowing on a shift.  One of those (in
> bitmap_dirty_bits) can cause the problem you see.
> This patch should fix it.  Please confirm.

Together with the small syntax-fix attached this patch fixes the
allocation of half of the available pages only. Now, all pages are
allocated when I set all bits and they all get cleaned in-kernel as well
as on-disk.

However, can you confirm that the bitmap is really used in raid10
resync? I removed half of the disks (a correctly removable subset, of
course :)), copied 100G to the degraded array, got about 7k bit set in
the bitmap, (re-)added the removed devices (mdadm correctly states
re-add as well), but the resync looks *very* sequential.
Moreover: I stopped and re-assembled the array with about 2k bit left
set and the resync starts from the beginning, I can see no skip to the
previous position in the resync process.
I'll try to watch this and will trigger you again when I have more
stable evidence, but perhaps you have some faster test-cases, I have to
wait for at least 5 hours now :)


regards
   Mario
-- 
Singing is the lowest form of communication.
                         -- Homer J. Simpson

[-- Attachment #1.2: linux-source-2.6.28+bitmap3.patch --]
[-- Type: text/x-diff, Size: 392 bytes --]

diff -urN a/drivers/md/bitmap.c b/drivers/md/bitmap.c
--- a/drivers/md/bitmap.c	2009-05-01 12:50:48.463877165 +0200
+++ b/drivers/md/bitmap.c	2009-05-01 12:55:56.185432118 +0200
@@ -1021,7 +1021,6 @@
 			bitmap_set_memory_bits(bitmap,
 					       (sector_t)i << CHUNK_BLOCK_SHIFT(bitmap),
 					       needed);
-				);
 			bit_cnt++;
 			set_page_attr(bitmap, page, BITMAP_PAGE_CLEAN);
 		}

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 481 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Incorrect in-kernel bitmap on raid10
  2009-05-01 17:55         ` Mario 'BitKoenig' Holbe
@ 2009-05-01 21:36           ` NeilBrown
  2009-05-02 19:52             ` Mario 'BitKoenig' Holbe
  0 siblings, 1 reply; 16+ messages in thread
From: NeilBrown @ 2009-05-01 21:36 UTC (permalink / raw)
  To: Mario 'BitKoenig' Holbe, Neil Brown, linux-raid

On Sat, May 2, 2009 3:55 am, Mario 'BitKoenig' Holbe wrote:
> On Fri, May 01, 2009 at 12:11:43PM +1000, Neil Brown wrote:
>> There some other places
>> where are are overflowing on a shift.  One of those (in
>> bitmap_dirty_bits) can cause the problem you see.
>> This patch should fix it.  Please confirm.
>
> Together with the small syntax-fix attached this patch fixes the
> allocation of half of the available pages only. Now, all pages are
> allocated when I set all bits and they all get cleaned in-kernel as well
> as on-disk.

Good.  Thanks for the confirmation (and fix).

>
> However, can you confirm that the bitmap is really used in raid10
> resync? I removed half of the disks (a correctly removable subset, of
> course :)), copied 100G to the degraded array, got about 7k bit set in
> the bitmap, (re-)added the removed devices (mdadm correctly states
> re-add as well), but the resync looks *very* sequential.
> Moreover: I stopped and re-assembled the array with about 2k bit left
> set and the resync starts from the beginning, I can see no skip to the
> previous position in the resync process.
> I'll try to watch this and will trigger you again when I have more
> stable evidence, but perhaps you have some faster test-cases, I have to
> wait for at least 5 hours now :)

I just did some testing and it does seem to honour the bitmap during
recovery.  However there are some caveats.

1/ it processes the whole array from start to finish in chunk-sized blocks
  and simply doesn't generate IO where it isn't needed.  This is different
  to e.g. raid1 where it can skip over a whole bitmap-chunk at at time.
  So it does use more CPU
2/ With raid1, when it skips a whole bitmap chunk, that chunk is not
  included in the speed calculation.  With raid10, everything is included.
  So I found the resync was hitting the limit of 200M/sec and backing off.
  I increased the limited (Added a few more zeros) and it sped up.
3/ I found a bug.  If you have two devices missing and add just one,
  then after the recovery it might clear the bitmap even though
  there is another missing device.  When that device is re-added, it will
  be added with no recovery.  This is bad.  I'll post a patch shortly.

What speed are (were) you getting for resync.  If it was around 200M/sec,
then point 2 would explain it.  If it was closer to the device speed,
then there must be something else going wrong.

NeilBrown


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Incorrect in-kernel bitmap on raid10
  2009-05-01 21:36           ` NeilBrown
@ 2009-05-02 19:52             ` Mario 'BitKoenig' Holbe
  2009-05-02 22:41               ` NeilBrown
  0 siblings, 1 reply; 16+ messages in thread
From: Mario 'BitKoenig' Holbe @ 2009-05-02 19:52 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid


[-- Attachment #1.1: Type: text/plain, Size: 2639 bytes --]

On Sat, May 02, 2009 at 07:36:45AM +1000, NeilBrown wrote:
> On Sat, May 2, 2009 3:55 am, Mario 'BitKoenig' Holbe wrote:
> > However, can you confirm that the bitmap is really used in raid10
> > resync? I removed half of the disks (a correctly removable subset, of
...
> > I'll try to watch this and will trigger you again when I have more
> > stable evidence, but perhaps you have some faster test-cases, I have to
> 1/ it processes the whole array from start to finish in chunk-sized blocks
>   and simply doesn't generate IO where it isn't needed.  This is different

It definitely does generate I/O all the time here.

> 2/ With raid1, when it skips a whole bitmap chunk, that chunk is not
>   included in the speed calculation.  With raid10, everything is included.

It doesn't seem to touch the sync-limits here.

> 3/ I found a bug.  If you have two devices missing and add just one,
>   then after the recovery it might clear the bitmap even though
>   there is another missing device.  When that device is re-added, it will

The resync does never seem to respect the bitmap here.

> What speed are (were) you getting for resync.  If it was around 200M/sec,
> then point 2 would explain it.  If it was closer to the device speed,
> then there must be something else going wrong.

I guess, there is something else going wrong here. I attached a (quite
large, sorry for that) transcript of what I was doing with the output of
what I think could help.

What I did:
* I had a stable and clean raid10 out of 6 disks with superblocks all
  uptodate, on-disk bitmaps all uptodate.
* I failed and removed 3 of the disks.
* I set some bits in the bitmap via mounting/umounting the raid10.
* I stopped the raid10 just to make sure all superblocks/bitmaps are
  uptodate.
* I assembled the raid10 again with 3 out of 6 devices now.
* I re-added the 3 missing disks.
  Please note, that I cannot add them all at the same time because if
  the array is read-write the resync starts immediately when the first
  device is added, while I cannot add devices as long as the array is
  read-only.
  In this immediately starting re-sync only the first of the three
  spares is synched, it seems to ignore the bitmap, and it generates
  I/O all the time, I just forgot to c'n'p the evidence. I have seen
  that 3 times now with iostat, it really does I/O.
* I stopped and started the array to make it resync over all 3 spares
  concurrently, it seems to ignore the bitmap, and it generates I/O
  all the time again.


regards
   Mario
-- 
() Ascii Ribbon Campaign
/\ Support plain text e-mail

[-- Attachment #1.2: raid10-resync --]
[-- Type: text/plain, Size: 24793 bytes --]

root@darkside:~# head -4 /proc/mdstat 
Personalities : [raid1] [raid10] 
md7 : active raid10 sdc1[0] sdf1[1] sdg1[3] sdh1[5] sde1[4] sdd1[2]
      4395406848 blocks 512K chunks 2 near-copies [6/6] [UUUUUU]
      bitmap: 0/131 pages [0KB], 16384KB chunk
root@darkside:~# mdadm -E /dev/sd[c-h]1
/dev/sdc1:
          Magic : a92b4efc
        Version : 0.90.00
           UUID : bbcc5f03:13105928:bea48659:c10950ee
  Creation Time : Sat Apr 18 18:26:28 2009
     Raid Level : raid10
  Used Dev Size : 1465135616 (1397.26 GiB 1500.30 GB)
     Array Size : 4395406848 (4191.79 GiB 4500.90 GB)
   Raid Devices : 6
  Total Devices : 6
Preferred Minor : 7

    Update Time : Sat May  2 21:22:22 2009
          State : clean
Internal Bitmap : present
 Active Devices : 6
Working Devices : 6
 Failed Devices : 0
  Spare Devices : 0
       Checksum : e2f919ac - correct
         Events : 13278

         Layout : near=2
     Chunk Size : 512K

      Number   Major   Minor   RaidDevice State
this     0       8       33        0      active sync   /dev/sdc1

   0     0       8       33        0      active sync   /dev/sdc1
   1     1       8       81        1      active sync   /dev/sdf1
   2     2       8       49        2      active sync   /dev/sdd1
   3     3       8       97        3      active sync   /dev/sdg1
   4     4       8       65        4      active sync   /dev/sde1
   5     5       8      113        5      active sync   /dev/sdh1
/dev/sdd1:
          Magic : a92b4efc
        Version : 0.90.00
           UUID : bbcc5f03:13105928:bea48659:c10950ee
  Creation Time : Sat Apr 18 18:26:28 2009
     Raid Level : raid10
  Used Dev Size : 1465135616 (1397.26 GiB 1500.30 GB)
     Array Size : 4395406848 (4191.79 GiB 4500.90 GB)
   Raid Devices : 6
  Total Devices : 6
Preferred Minor : 7

    Update Time : Sat May  2 21:22:22 2009
          State : clean
Internal Bitmap : present
 Active Devices : 6
Working Devices : 6
 Failed Devices : 0
  Spare Devices : 0
       Checksum : e2f919c0 - correct
         Events : 13278

         Layout : near=2
     Chunk Size : 512K

      Number   Major   Minor   RaidDevice State
this     2       8       49        2      active sync   /dev/sdd1

   0     0       8       33        0      active sync   /dev/sdc1
   1     1       8       81        1      active sync   /dev/sdf1
   2     2       8       49        2      active sync   /dev/sdd1
   3     3       8       97        3      active sync   /dev/sdg1
   4     4       8       65        4      active sync   /dev/sde1
   5     5       8      113        5      active sync   /dev/sdh1
/dev/sde1:
          Magic : a92b4efc
        Version : 0.90.00
           UUID : bbcc5f03:13105928:bea48659:c10950ee
  Creation Time : Sat Apr 18 18:26:28 2009
     Raid Level : raid10
  Used Dev Size : 1465135616 (1397.26 GiB 1500.30 GB)
     Array Size : 4395406848 (4191.79 GiB 4500.90 GB)
   Raid Devices : 6
  Total Devices : 6
Preferred Minor : 7

    Update Time : Sat May  2 21:22:22 2009
          State : clean
Internal Bitmap : present
 Active Devices : 6
Working Devices : 6
 Failed Devices : 0
  Spare Devices : 0
       Checksum : e2f919d4 - correct
         Events : 13278

         Layout : near=2
     Chunk Size : 512K

      Number   Major   Minor   RaidDevice State
this     4       8       65        4      active sync   /dev/sde1

   0     0       8       33        0      active sync   /dev/sdc1
   1     1       8       81        1      active sync   /dev/sdf1
   2     2       8       49        2      active sync   /dev/sdd1
   3     3       8       97        3      active sync   /dev/sdg1
   4     4       8       65        4      active sync   /dev/sde1
   5     5       8      113        5      active sync   /dev/sdh1
/dev/sdf1:
          Magic : a92b4efc
        Version : 0.90.00
           UUID : bbcc5f03:13105928:bea48659:c10950ee
  Creation Time : Sat Apr 18 18:26:28 2009
     Raid Level : raid10
  Used Dev Size : 1465135616 (1397.26 GiB 1500.30 GB)
     Array Size : 4395406848 (4191.79 GiB 4500.90 GB)
   Raid Devices : 6
  Total Devices : 6
Preferred Minor : 7

    Update Time : Sat May  2 21:22:22 2009
          State : clean
Internal Bitmap : present
 Active Devices : 6
Working Devices : 6
 Failed Devices : 0
  Spare Devices : 0
       Checksum : e2f919de - correct
         Events : 13278

         Layout : near=2
     Chunk Size : 512K

      Number   Major   Minor   RaidDevice State
this     1       8       81        1      active sync   /dev/sdf1

   0     0       8       33        0      active sync   /dev/sdc1
   1     1       8       81        1      active sync   /dev/sdf1
   2     2       8       49        2      active sync   /dev/sdd1
   3     3       8       97        3      active sync   /dev/sdg1
   4     4       8       65        4      active sync   /dev/sde1
   5     5       8      113        5      active sync   /dev/sdh1
/dev/sdg1:
          Magic : a92b4efc
        Version : 0.90.00
           UUID : bbcc5f03:13105928:bea48659:c10950ee
  Creation Time : Sat Apr 18 18:26:28 2009
     Raid Level : raid10
  Used Dev Size : 1465135616 (1397.26 GiB 1500.30 GB)
     Array Size : 4395406848 (4191.79 GiB 4500.90 GB)
   Raid Devices : 6
  Total Devices : 6
Preferred Minor : 7

    Update Time : Sat May  2 21:22:22 2009
          State : clean
Internal Bitmap : present
 Active Devices : 6
Working Devices : 6
 Failed Devices : 0
  Spare Devices : 0
       Checksum : e2f919f2 - correct
         Events : 13278

         Layout : near=2
     Chunk Size : 512K

      Number   Major   Minor   RaidDevice State
this     3       8       97        3      active sync   /dev/sdg1

   0     0       8       33        0      active sync   /dev/sdc1
   1     1       8       81        1      active sync   /dev/sdf1
   2     2       8       49        2      active sync   /dev/sdd1
   3     3       8       97        3      active sync   /dev/sdg1
   4     4       8       65        4      active sync   /dev/sde1
   5     5       8      113        5      active sync   /dev/sdh1
/dev/sdh1:
          Magic : a92b4efc
        Version : 0.90.00
           UUID : bbcc5f03:13105928:bea48659:c10950ee
  Creation Time : Sat Apr 18 18:26:28 2009
     Raid Level : raid10
  Used Dev Size : 1465135616 (1397.26 GiB 1500.30 GB)
     Array Size : 4395406848 (4191.79 GiB 4500.90 GB)
   Raid Devices : 6
  Total Devices : 6
Preferred Minor : 7

    Update Time : Sat May  2 21:22:22 2009
          State : clean
Internal Bitmap : present
 Active Devices : 6
Working Devices : 6
 Failed Devices : 0
  Spare Devices : 0
       Checksum : e2f91a06 - correct
         Events : 13278

         Layout : near=2
     Chunk Size : 512K

      Number   Major   Minor   RaidDevice State
this     5       8      113        5      active sync   /dev/sdh1

   0     0       8       33        0      active sync   /dev/sdc1
   1     1       8       81        1      active sync   /dev/sdf1
   2     2       8       49        2      active sync   /dev/sdd1
   3     3       8       97        3      active sync   /dev/sdg1
   4     4       8       65        4      active sync   /dev/sde1
   5     5       8      113        5      active sync   /dev/sdh1
root@darkside:~# mdadm -X /dev/sd[c-h]1
        Filename : /dev/sdc1
           Magic : 6d746962
         Version : 4
            UUID : bbcc5f03:13105928:bea48659:c10950ee
          Events : 13278
  Events Cleared : 13278
           State : OK
       Chunksize : 16 MB
          Daemon : 5s flush period
      Write Mode : Normal
       Sync Size : 4395406848 (4191.79 GiB 4500.90 GB)
          Bitmap : 268275 bits (chunks), 0 dirty (0.0%)
        Filename : /dev/sdd1
           Magic : 6d746962
         Version : 4
            UUID : bbcc5f03:13105928:bea48659:c10950ee
          Events : 13278
  Events Cleared : 13278
           State : OK
       Chunksize : 16 MB
          Daemon : 5s flush period
      Write Mode : Normal
       Sync Size : 4395406848 (4191.79 GiB 4500.90 GB)
          Bitmap : 268275 bits (chunks), 0 dirty (0.0%)
        Filename : /dev/sde1
           Magic : 6d746962
         Version : 4
            UUID : bbcc5f03:13105928:bea48659:c10950ee
          Events : 13278
  Events Cleared : 13278
           State : OK
       Chunksize : 16 MB
          Daemon : 5s flush period
      Write Mode : Normal
       Sync Size : 4395406848 (4191.79 GiB 4500.90 GB)
          Bitmap : 268275 bits (chunks), 0 dirty (0.0%)
        Filename : /dev/sdf1
           Magic : 6d746962
         Version : 4
            UUID : bbcc5f03:13105928:bea48659:c10950ee
          Events : 13278
  Events Cleared : 13278
           State : OK
       Chunksize : 16 MB
          Daemon : 5s flush period
      Write Mode : Normal
       Sync Size : 4395406848 (4191.79 GiB 4500.90 GB)
          Bitmap : 268275 bits (chunks), 0 dirty (0.0%)
        Filename : /dev/sdg1
           Magic : 6d746962
         Version : 4
            UUID : bbcc5f03:13105928:bea48659:c10950ee
          Events : 13278
  Events Cleared : 13278
           State : OK
       Chunksize : 16 MB
          Daemon : 5s flush period
      Write Mode : Normal
       Sync Size : 4395406848 (4191.79 GiB 4500.90 GB)
          Bitmap : 268275 bits (chunks), 0 dirty (0.0%)
        Filename : /dev/sdh1
           Magic : 6d746962
         Version : 4
            UUID : bbcc5f03:13105928:bea48659:c10950ee
          Events : 13278
  Events Cleared : 13278
           State : OK
       Chunksize : 16 MB
          Daemon : 5s flush period
      Write Mode : Normal
       Sync Size : 4395406848 (4191.79 GiB 4500.90 GB)
          Bitmap : 268275 bits (chunks), 0 dirty (0.0%)
root@darkside:~# mdadm --fail /dev/md7 /dev/sd[fgh]1
mdadm: set /dev/sdf1 faulty in /dev/md7
mdadm: set /dev/sdg1 faulty in /dev/md7
mdadm: set /dev/sdh1 faulty in /dev/md7
root@darkside:~# mdadm --remove /dev/md7 /dev/sd[fgh]1
mdadm: hot removed /dev/sdf1
mdadm: hot removed /dev/sdg1
mdadm: hot removed /dev/sdh1
root@darkside:~# mount -a
root@darkside:~# umount /dev/md7
root@darkside:~# mdadm --stop /dev/md7
mdadm: stopped /dev/md7
root@darkside:~# mdadm -E /dev/sd[c-h]1
/dev/sdc1:
          Magic : a92b4efc
        Version : 0.90.00
           UUID : bbcc5f03:13105928:bea48659:c10950ee
  Creation Time : Sat Apr 18 18:26:28 2009
     Raid Level : raid10
  Used Dev Size : 1465135616 (1397.26 GiB 1500.30 GB)
     Array Size : 4395406848 (4191.79 GiB 4500.90 GB)
   Raid Devices : 6
  Total Devices : 3
Preferred Minor : 7

    Update Time : Sat May  2 21:24:31 2009
          State : clean
Internal Bitmap : present
 Active Devices : 3
Working Devices : 3
 Failed Devices : 3
  Spare Devices : 0
       Checksum : e2f91915 - correct
         Events : 13294

         Layout : near=2
     Chunk Size : 512K

      Number   Major   Minor   RaidDevice State
this     0       8       33        0      active sync   /dev/sdc1

   0     0       8       33        0      active sync   /dev/sdc1
   1     1       0        0        1      faulty removed
   2     2       8       49        2      active sync   /dev/sdd1
   3     3       0        0        3      faulty removed
   4     4       8       65        4      active sync   /dev/sde1
   5     5       0        0        5      faulty removed
/dev/sdd1:
          Magic : a92b4efc
        Version : 0.90.00
           UUID : bbcc5f03:13105928:bea48659:c10950ee
  Creation Time : Sat Apr 18 18:26:28 2009
     Raid Level : raid10
  Used Dev Size : 1465135616 (1397.26 GiB 1500.30 GB)
     Array Size : 4395406848 (4191.79 GiB 4500.90 GB)
   Raid Devices : 6
  Total Devices : 3
Preferred Minor : 7

    Update Time : Sat May  2 21:24:31 2009
          State : clean
Internal Bitmap : present
 Active Devices : 3
Working Devices : 3
 Failed Devices : 3
  Spare Devices : 0
       Checksum : e2f91929 - correct
         Events : 13294

         Layout : near=2
     Chunk Size : 512K

      Number   Major   Minor   RaidDevice State
this     2       8       49        2      active sync   /dev/sdd1

   0     0       8       33        0      active sync   /dev/sdc1
   1     1       0        0        1      faulty removed
   2     2       8       49        2      active sync   /dev/sdd1
   3     3       0        0        3      faulty removed
   4     4       8       65        4      active sync   /dev/sde1
   5     5       0        0        5      faulty removed
/dev/sde1:
          Magic : a92b4efc
        Version : 0.90.00
           UUID : bbcc5f03:13105928:bea48659:c10950ee
  Creation Time : Sat Apr 18 18:26:28 2009
     Raid Level : raid10
  Used Dev Size : 1465135616 (1397.26 GiB 1500.30 GB)
     Array Size : 4395406848 (4191.79 GiB 4500.90 GB)
   Raid Devices : 6
  Total Devices : 3
Preferred Minor : 7

    Update Time : Sat May  2 21:24:31 2009
          State : clean
Internal Bitmap : present
 Active Devices : 3
Working Devices : 3
 Failed Devices : 3
  Spare Devices : 0
       Checksum : e2f9193d - correct
         Events : 13294

         Layout : near=2
     Chunk Size : 512K

      Number   Major   Minor   RaidDevice State
this     4       8       65        4      active sync   /dev/sde1

   0     0       8       33        0      active sync   /dev/sdc1
   1     1       0        0        1      faulty removed
   2     2       8       49        2      active sync   /dev/sdd1
   3     3       0        0        3      faulty removed
   4     4       8       65        4      active sync   /dev/sde1
   5     5       0        0        5      faulty removed
/dev/sdf1:
          Magic : a92b4efc
        Version : 0.90.00
           UUID : bbcc5f03:13105928:bea48659:c10950ee
  Creation Time : Sat Apr 18 18:26:28 2009
     Raid Level : raid10
  Used Dev Size : 1465135616 (1397.26 GiB 1500.30 GB)
     Array Size : 4395406848 (4191.79 GiB 4500.90 GB)
   Raid Devices : 6
  Total Devices : 6
Preferred Minor : 7

    Update Time : Sat May  2 21:22:22 2009
          State : clean
Internal Bitmap : present
 Active Devices : 6
Working Devices : 6
 Failed Devices : 0
  Spare Devices : 0
       Checksum : e2f919de - correct
         Events : 13278

         Layout : near=2
     Chunk Size : 512K

      Number   Major   Minor   RaidDevice State
this     1       8       81        1      active sync   /dev/sdf1

   0     0       8       33        0      active sync   /dev/sdc1
   1     1       8       81        1      active sync   /dev/sdf1
   2     2       8       49        2      active sync   /dev/sdd1
   3     3       8       97        3      active sync   /dev/sdg1
   4     4       8       65        4      active sync   /dev/sde1
   5     5       8      113        5      active sync   /dev/sdh1
/dev/sdg1:
          Magic : a92b4efc
        Version : 0.90.00
           UUID : bbcc5f03:13105928:bea48659:c10950ee
  Creation Time : Sat Apr 18 18:26:28 2009
     Raid Level : raid10
  Used Dev Size : 1465135616 (1397.26 GiB 1500.30 GB)
     Array Size : 4395406848 (4191.79 GiB 4500.90 GB)
   Raid Devices : 6
  Total Devices : 6
Preferred Minor : 7

    Update Time : Sat May  2 21:23:51 2009
          State : clean
Internal Bitmap : present
 Active Devices : 5
Working Devices : 5
 Failed Devices : 1
  Spare Devices : 0
       Checksum : e2f91a5e - correct
         Events : 13280

         Layout : near=2
     Chunk Size : 512K

      Number   Major   Minor   RaidDevice State
this     3       8       97        3      active sync   /dev/sdg1

   0     0       8       33        0      active sync   /dev/sdc1
   1     1       0        0        1      faulty removed
   2     2       8       49        2      active sync   /dev/sdd1
   3     3       8       97        3      active sync   /dev/sdg1
   4     4       8       65        4      active sync   /dev/sde1
   5     5       8      113        5      active sync   /dev/sdh1
/dev/sdh1:
          Magic : a92b4efc
        Version : 0.90.00
           UUID : bbcc5f03:13105928:bea48659:c10950ee
  Creation Time : Sat Apr 18 18:26:28 2009
     Raid Level : raid10
  Used Dev Size : 1465135616 (1397.26 GiB 1500.30 GB)
     Array Size : 4395406848 (4191.79 GiB 4500.90 GB)
   Raid Devices : 6
  Total Devices : 6
Preferred Minor : 7

    Update Time : Sat May  2 21:23:51 2009
          State : clean
Internal Bitmap : present
 Active Devices : 4
Working Devices : 4
 Failed Devices : 2
  Spare Devices : 0
       Checksum : e2f91a87 - correct
         Events : 13282

         Layout : near=2
     Chunk Size : 512K

      Number   Major   Minor   RaidDevice State
this     5       8      113        5      active sync   /dev/sdh1

   0     0       8       33        0      active sync   /dev/sdc1
   1     1       0        0        1      faulty removed
   2     2       8       49        2      active sync   /dev/sdd1
   3     3       0        0        3      faulty removed
   4     4       8       65        4      active sync   /dev/sde1
   5     5       8      113        5      active sync   /dev/sdh1
root@darkside:~# mdadm -X /dev/sd[c-h]1
        Filename : /dev/sdc1
           Magic : 6d746962
         Version : 4
            UUID : bbcc5f03:13105928:bea48659:c10950ee
          Events : 13294
  Events Cleared : 13278
           State : OK
       Chunksize : 16 MB
          Daemon : 5s flush period
      Write Mode : Normal
       Sync Size : 4395406848 (4191.79 GiB 4500.90 GB)
          Bitmap : 268275 bits (chunks), 3 dirty (0.0%)
        Filename : /dev/sdd1
           Magic : 6d746962
         Version : 4
            UUID : bbcc5f03:13105928:bea48659:c10950ee
          Events : 13294
  Events Cleared : 13278
           State : OK
       Chunksize : 16 MB
          Daemon : 5s flush period
      Write Mode : Normal
       Sync Size : 4395406848 (4191.79 GiB 4500.90 GB)
          Bitmap : 268275 bits (chunks), 3 dirty (0.0%)
        Filename : /dev/sde1
           Magic : 6d746962
         Version : 4
            UUID : bbcc5f03:13105928:bea48659:c10950ee
          Events : 13294
  Events Cleared : 13278
           State : OK
       Chunksize : 16 MB
          Daemon : 5s flush period
      Write Mode : Normal
       Sync Size : 4395406848 (4191.79 GiB 4500.90 GB)
          Bitmap : 268275 bits (chunks), 3 dirty (0.0%)
        Filename : /dev/sdf1
           Magic : 6d746962
         Version : 4
            UUID : bbcc5f03:13105928:bea48659:c10950ee
          Events : 13278
  Events Cleared : 13278
           State : OK
       Chunksize : 16 MB
          Daemon : 5s flush period
      Write Mode : Normal
       Sync Size : 4395406848 (4191.79 GiB 4500.90 GB)
          Bitmap : 268275 bits (chunks), 0 dirty (0.0%)
        Filename : /dev/sdg1
           Magic : 6d746962
         Version : 4
            UUID : bbcc5f03:13105928:bea48659:c10950ee
          Events : 13280
  Events Cleared : 13278
           State : OK
       Chunksize : 16 MB
          Daemon : 5s flush period
      Write Mode : Normal
       Sync Size : 4395406848 (4191.79 GiB 4500.90 GB)
          Bitmap : 268275 bits (chunks), 0 dirty (0.0%)
        Filename : /dev/sdh1
           Magic : 6d746962
         Version : 4
            UUID : bbcc5f03:13105928:bea48659:c10950ee
          Events : 13282
  Events Cleared : 13278
           State : OK
       Chunksize : 16 MB
          Daemon : 5s flush period
      Write Mode : Normal
       Sync Size : 4395406848 (4191.79 GiB 4500.90 GB)
          Bitmap : 268275 bits (chunks), 0 dirty (0.0%)
root@darkside:~# mdadm -A /dev/md7
[115116.898904] md: bind<sdf1>
[115116.907579] md: bind<sdd1>
[115116.916526] md: bind<sdg1>
[115116.926345] md: bind<sde1>
[115116.935542] md: bind<sdh1>
[115116.944477] md: bind<sdc1>
[115116.953071] md: kicking non-fresh sdh1 from array!
[115116.967686] md: unbind<sdh1>
[115116.977298] md: export_rdev(sdh1)
[115116.987495] md: kicking non-fresh sdg1 from array!
[115117.002105] md: unbind<sdg1>
[115117.012009] md: export_rdev(sdg1)
[115117.022209] md: kicking non-fresh sdf1 from array!
[115117.036817] md: unbind<sdf1>
[115117.046827] md: export_rdev(sdf1)
[115117.058179] raid10: raid set md7 active with 3 out of 6 devices
[115117.083006] md7: bitmap initialized from disk: read 9/9 pages, set 3 bits
[115117.103594] created bitmap (131 pages) for device md7
[115117.171218] md7: detected capacity change from 0 to 4500896612352
[115117.189718]  md7: unknown partition table
mdadm: /dev/md7 has been started with 3 drives (out of 6).
root@darkside:~# mdadm --readonly /dev/md7
[115125.143499] md: md7 switched to read-only mode.
root@darkside:~# mdadm --add /dev/md7 /dev/sd[fgh]1
mdadm: add new device failed for /dev/sdf1 as 6: Read-only file system
root@darkside:~# head -4 /proc/mdstat 
Personalities : [raid1] [raid10] 
md7 : active (read-only) raid10 sdc1[0] sde1[4] sdd1[2]
      4395406848 blocks 512K chunks 2 near-copies [6/3] [U_U_U_]
      bitmap: 2/131 pages [8KB], 16384KB chunk
root@darkside:~# mdadm --readwrite /dev/md7
[115207.027682] md: md7 switched to read-write mode.
root@darkside:~# mdadm --add /dev/md7 /dev/sd[fgh]1
[115218.558523] md: bind<sdf1>
[115218.641655] RAID10 conf printout:
[115218.651846]  --- wd:3 rd:6
[115218.660217]  disk 0, wo:0, o:1, dev:sdc1
[115218.660219]  disk 1, wo:1, o:1, dev:sdf1
[115218.660220]  disk 2, wo:0, o:1, dev:sdd1
[115218.660222]  disk 4, wo:0, o:1, dev:sde1
[115218.709281] md: recovery of RAID array md7
[115218.721834] md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
[115218.739591] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for recovery.
[115218.739596] md: using 128k window, over a total of 1465135616 blocks.
mdadm: re-added /dev/sdf1
[115220.582354] md: bind<sdg1>
[115221.844379] md: bind<sdh1>
mdadm: re-added /dev/sdg1
mdadm: re-added /dev/sdh1
root@darkside:~# head -4 /proc/mdstat 
Personalities : [raid1] [raid10] 
md7 : active raid10 sdh1[6](S) sdg1[7](S) sdf1[8] sdc1[0] sde1[4] sdd1[2]
      4395406848 blocks 512K chunks 2 near-copies [6/3] [U_U_U_]
      [>....................]  recovery =  0.2% (3877760/1465135616) finish=200.9min speed=121180K/sec
root@darkside:~# sleep 60
root@darkside:~# head -4 /proc/mdstat 
Personalities : [raid1] [raid10] 
md7 : active raid10 sdh1[6](S) sdg1[7](S) sdf1[8] sdc1[0] sde1[4] sdd1[2]
      4395406848 blocks 512K chunks 2 near-copies [6/3] [U_U_U_]
      [>....................]  recovery =  2.0% (29498240/1465135616) finish=193.0min speed=123907K/sec
root@darkside:~# mdadm --stop /dev/md7
mdadm: stopped /dev/md7
root@darkside:~# mdadm -A /dev/md7
mdadm: /dev/md7 has been started with 3 drives (out of 6) and 3 spares.
root@darkside:~# head -4 /proc/mdstat 
Personalities : [raid1] [raid10] 
md7 : active (auto-read-only) raid10 sdc1[0] sdf1[8](S) sdg1[7](S) sdh1[6](S) sde1[4] sdd1[2]
      4395406848 blocks 512K chunks 2 near-copies [6/3] [U_U_U_]
      bitmap: 1/131 pages [4KB], 16384KB chunk
root@darkside:~# mdadm --readwrite /dev/md7
root@darkside:~# head -4 /proc/mdstat 
Personalities : [raid1] [raid10] 
md7 : active raid10 sdc1[0] sdf1[8] sdg1[7] sdh1[6] sde1[4] sdd1[2]
      4395406848 blocks 512K chunks 2 near-copies [6/3] [U_U_U_]
      [>....................]  recovery =  0.0% (172608/1465135616) finish=282.7min speed=86304K/sec
root@darkside:~# sleep 10
root@darkside:~# head -4 /proc/mdstat 
Personalities : [raid1] [raid10] 
md7 : active raid10 sdc1[0] sdf1[8] sdg1[7] sdh1[6] sde1[4] sdd1[2]
      4395406848 blocks 512K chunks 2 near-copies [6/3] [U_U_U_]
      [>....................]  recovery =  0.2% (4008384/1465135616) finish=323.8min speed=75191K/sec
root@darkside:~# iostat -m /dev/sd[c-h] 10 2 | tail -11
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           1,34    0,10    4,75    0,99    0,00   92,83

Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sdc             759,90        73,45         0,00        734          0
sdd             752,20        73,45         0,00        734          0
sde             753,10        73,45         0,00        734          0
sdf             730,70         0,00        73,45          0        734
sdg             261,10         0,00        73,47          0        734
sdh             290,60         0,00        73,41          0        734

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 481 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Incorrect in-kernel bitmap on raid10
  2009-05-02 19:52             ` Mario 'BitKoenig' Holbe
@ 2009-05-02 22:41               ` NeilBrown
  2009-05-03 13:22                 ` Mario 'BitKoenig' Holbe
  0 siblings, 1 reply; 16+ messages in thread
From: NeilBrown @ 2009-05-02 22:41 UTC (permalink / raw)
  To: Mario 'BitKoenig' Holbe, NeilBrown, linux-raid

On Sun, May 3, 2009 5:52 am, Mario 'BitKoenig' Holbe wrote:
> I guess, there is something else going wrong here. I attached a (quite
> large, sorry for that) transcript of what I was doing with the output of
> what I think could help.
>
> What I did:
> * I had a stable and clean raid10 out of 6 disks with superblocks all
>   uptodate, on-disk bitmaps all uptodate.
> * I failed and removed 3 of the disks.
> * I set some bits in the bitmap via mounting/umounting the raid10.
> * I stopped the raid10 just to make sure all superblocks/bitmaps are
>   uptodate.
> * I assembled the raid10 again with 3 out of 6 devices now.
> * I re-added the 3 missing disks.
>   Please note, that I cannot add them all at the same time because if
>   the array is read-write the resync starts immediately when the first
>   device is added, while I cannot add devices as long as the array is
>   read-only.
>   In this immediately starting re-sync only the first of the three
>   spares is synched, it seems to ignore the bitmap, and it generates
>   I/O all the time, I just forgot to c'n'p the evidence. I have seen
>   that 3 times now with iostat, it really does I/O.
> * I stopped and started the array to make it resync over all 3 spares
>   concurrently, it seems to ignore the bitmap, and it generates I/O
>   all the time again.

I managed to reproduce this thanks to all the detail you provided.
The problem was caused by trying to add a device to the array while the
array was readonly.
mdadm attempts a re-add.  When this fails it tries a conventional add
which involves writing  new metadata which shows the device to be a
spare.  That causes the information that would allow the fast resync
to be destroyed.

Can you duplicated the problem without setting the array to readonly?

The next mdadm release will check for EROFS from the re-add attempt and
not attempt the conventional add, thus saving the metadata.

Yes, it would be nice to be able to add multiple devices at once.
Maybe I could just get the kernel to wait 100ms after an add before
starting recovery in case a second device is about to be added.
I'll give it some thought.

NeilBrown


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Incorrect in-kernel bitmap on raid10
  2009-05-02 22:41               ` NeilBrown
@ 2009-05-03 13:22                 ` Mario 'BitKoenig' Holbe
  2009-05-07 20:25                   ` Mario 'BitKoenig' Holbe
  0 siblings, 1 reply; 16+ messages in thread
From: Mario 'BitKoenig' Holbe @ 2009-05-03 13:22 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 1368 bytes --]

On Sun, May 03, 2009 at 08:41:49AM +1000, NeilBrown wrote:
> I managed to reproduce this thanks to all the detail you provided.
> The problem was caused by trying to add a device to the array while the
> array was readonly.

Okay, I understand. I was just trying to find a way to sync all the
spares simultaneously.

> Can you duplicated the problem without setting the array to readonly?

No, mdadm -A /dev/md7; mdadm --add /dev/md7 /dev/sd[fgh]1 seems to
respect the bitmap, but first adds the first spare only and the two
remaining spares when the first resync has been finished.

According to your previous mail, I will run into your case 3/ this way,
the bug where the bitmap is cleared too early, will I?
I'm very willing to test your announced patch :)

> Yes, it would be nice to be able to add multiple devices at once.
> Maybe I could just get the kernel to wait 100ms after an add before

This is probably the easiest way to work around this issue on the short
run. However, such kind of timeouts cry for race conditions. Perhaps,
allowing to add spares in read-only mode or some flag like "wait for
confirmation before starting to resync" would be a cleaner approach on
the long run.


Thank you very much for your help
   Mario
-- 
Gemuese schmeckt am besten, wenn man es kurz vor dem Verzehr durch ein
Schnitzel ersetzt!

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 481 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Incorrect in-kernel bitmap on raid10
  2009-05-03 13:22                 ` Mario 'BitKoenig' Holbe
@ 2009-05-07 20:25                   ` Mario 'BitKoenig' Holbe
  0 siblings, 0 replies; 16+ messages in thread
From: Mario 'BitKoenig' Holbe @ 2009-05-07 20:25 UTC (permalink / raw)
  To: linux-raid

Mario 'BitKoenig' Holbe <Mario.Holbe@TU-Ilmenau.DE> wrote:
> No, mdadm -A /dev/md7; mdadm --add /dev/md7 /dev/sd[fgh]1 seems to
> respect the bitmap, but first adds the first spare only and the two
...
> According to your previous mail, I will run into your case 3/ this way,
> the bug where the bitmap is cleared too early, will I?

All right, I applied your "[md PATCH 2/7] md/raid10: don't clear bitmap
during recovery if array will still be degraded." patch, modified about
2% of my fully degraded raid10, synched the 3 missing devices, compared
the mirrors and found no difference, so I guess I can confirm the patch
works for me :)


regards
   Mario
-- 
The opposite of a correct statement is a false statement.
But the opposite of a profound truth may well be another profound truth.
                                                           -- Niels Bohr


^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2009-05-07 20:25 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-04-18 18:15 Incorrect in-kernel bitmap on raid10 Mario 'BitKoenig' Holbe
2009-04-19  6:24 ` Neil Brown
2009-04-19 22:55   ` Mario 'BitKoenig' Holbe
2009-04-19 23:27     ` Neil Brown
2009-04-20  0:13       ` Race condition in write_sb_page? (was: Re: Incorrect in-kernel bitmap on raid10) Mario 'BitKoenig' Holbe
2009-04-20  1:57         ` NeilBrown
2009-04-20  8:03           ` Race condition in write_sb_page? Mario 'BitKoenig' Holbe
2009-04-22 18:45   ` Incorrect in-kernel bitmap on raid10 Mario 'BitKoenig' Holbe
2009-04-28 14:05     ` Mario 'BitKoenig' Holbe
2009-05-01  2:11       ` Neil Brown
2009-05-01 17:55         ` Mario 'BitKoenig' Holbe
2009-05-01 21:36           ` NeilBrown
2009-05-02 19:52             ` Mario 'BitKoenig' Holbe
2009-05-02 22:41               ` NeilBrown
2009-05-03 13:22                 ` Mario 'BitKoenig' Holbe
2009-05-07 20:25                   ` Mario 'BitKoenig' Holbe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).