* AWFUL reshape speed with raid5.
@ 2008-07-28 17:39 Jon Nelson
2008-07-28 18:14 ` Justin Piszcz
` (2 more replies)
0 siblings, 3 replies; 20+ messages in thread
From: Jon Nelson @ 2008-07-28 17:39 UTC (permalink / raw)
To: LinuxRaid
I built a raid5 with 2 devices (and --assume-clean) using 2x 4GB
partitions (not logical volumes).
I then grew it to 3 devices.
The reshape speed is really really slow.
vmstat shows I/O like this:
0 0 212 25844 141160 497484 0 0 0 612 673 1284 0 6 93 0
0 0 212 25164 141160 497748 0 0 0 19 594 1253 1 4 95 0
0 0 212 25044 141160 498004 0 0 0 0 374 445 0 1 99 0
1 0 212 25220 141164 498000 0 0 0 23 506 1149 0 3 96 1
0 0 212 25500 141164 498004 0 0 0 3 546 1416 0 5 95 0
The min/max is 1000/200000.
What might be going on here?
Kernel is 2.6.25.11 (openSUSE 11.0 x86-64 stock)
/proc/mdstat for this entry:
md99 : active raid5 sdd3[2] sdc3[1] sdb3[0]
3903744 blocks super 1.0 level 5, 64k chunk, algorithm 2 [3/3] [UUU]
[=>...................] reshape = 8.2% (324224/3903744)
finish=43.3min speed=1373K/sec
This is on a set of devices capable of 70+ MB/s.
No meaningful change if I start with 3 disks and grow to 4, with or
without bitmap.
--
Jon
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: AWFUL reshape speed with raid5.
2008-07-28 17:39 AWFUL reshape speed with raid5 Jon Nelson
@ 2008-07-28 18:14 ` Justin Piszcz
2008-07-28 18:24 ` Jon Nelson
2008-08-01 1:26 ` Neil Brown
2008-08-21 2:58 ` Jon Nelson
2 siblings, 1 reply; 20+ messages in thread
From: Justin Piszcz @ 2008-07-28 18:14 UTC (permalink / raw)
To: Jon Nelson; +Cc: LinuxRaid
What happens if you use 0.90 superblocks?
Also, are these sata ports on the mobo or sata ports on a pci-based mobo?
On Mon, 28 Jul 2008, Jon Nelson wrote:
> I built a raid5 with 2 devices (and --assume-clean) using 2x 4GB
> partitions (not logical volumes).
> I then grew it to 3 devices.
> The reshape speed is really really slow.
>
> vmstat shows I/O like this:
>
> 0 0 212 25844 141160 497484 0 0 0 612 673 1284 0 6 93 0
> 0 0 212 25164 141160 497748 0 0 0 19 594 1253 1 4 95 0
> 0 0 212 25044 141160 498004 0 0 0 0 374 445 0 1 99 0
> 1 0 212 25220 141164 498000 0 0 0 23 506 1149 0 3 96 1
> 0 0 212 25500 141164 498004 0 0 0 3 546 1416 0 5 95 0
>
> The min/max is 1000/200000.
> What might be going on here?
>
> Kernel is 2.6.25.11 (openSUSE 11.0 x86-64 stock)
>
> /proc/mdstat for this entry:
>
> md99 : active raid5 sdd3[2] sdc3[1] sdb3[0]
> 3903744 blocks super 1.0 level 5, 64k chunk, algorithm 2 [3/3] [UUU]
> [=>...................] reshape = 8.2% (324224/3903744)
> finish=43.3min speed=1373K/sec
>
>
> This is on a set of devices capable of 70+ MB/s.
>
> No meaningful change if I start with 3 disks and grow to 4, with or
> without bitmap.
>
> --
> Jon
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: AWFUL reshape speed with raid5.
2008-07-28 18:14 ` Justin Piszcz
@ 2008-07-28 18:24 ` Jon Nelson
2008-07-28 18:55 ` Jon Nelson
0 siblings, 1 reply; 20+ messages in thread
From: Jon Nelson @ 2008-07-28 18:24 UTC (permalink / raw)
To: Justin Piszcz; +Cc: LinuxRaid
On Mon, Jul 28, 2008 at 1:14 PM, Justin Piszcz <jpiszcz@lucidpixels.com> wrote:
> What happens if you use 0.90 superblocks?
An even more spectacular explosion. Really quite impressive, actually.
Just a small sample of the >800 lines:
Jul 28 13:18:54 turnip kernel: itmap 9: invalid bitmap page 9: invalid
bitmap page 9: invalid bitmap p9: invalid bitmap pag9: invalid bitmap
9: invalid bitmap page9: invalid bitmap pa9: invalid bitmap pag9:
invalid bitmap 9: invalid bitmap 9: invalid bitmap pag9: invalid
bitmap page9: invalid bitmap 9: invalid bitmap 9: invalid bitmap
page9: invalid bitmap pa9: invalid bitmap pag9: invalid bitmap pa9:
invalid bitmap p9: invalid bitmap page 9: invalid bitmap page 9:
invalid bitmap p9: invalid bitmap 9: invalid bitmap page9: invalid
bitmap pag9: invalid bitmap pag9: invalid bitmap page9: invalid bitmap
9: invalid bitmap pag9: invalid bitmap page9: invalid bitmap pag9:
invalid bitmap9: invalid bitmap page9: invalid bitmap pag9: invalid
bitmap pa9: invalid bitmap page9: invalid bitmap pa9: invalid bitmap
p9: invalid bitmap page 9: invalid bitmap page 9: invalid bit9:
invalid bitmap pa9: invalid bitmap page 9: invalid bitmap pag9:
invalid bitmap page9: invalid bitmap 9: invalid bitmap pa9: invalid
bitmap page9: invalid
(Note the page -> pag -> p ... variations/corruption/whatever).
and
Jul 28 13:18:55 turnip kernel: i9: i9: 9: 9: i9: 9: i9: 9: 9: 9: 9: 9:
9: 9: 9: 9: 9: 9: i9: 9: 9: 9: 9: 9: 9: 9: 9: i9: i9: 9: i9: 9: i9:
i9: 9: 9: i9: 9: i9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: i9: 9: 9: 9: 9:
9: 9: 9: 9: i9: i9: 9: in9: 9: 9: 9: 9: 9: i9: 9: i9: 9: i9: 9: i9:
i9: 9: 9: 9: 9: 9: 9: i9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9:
9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9:
9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9:
9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: i9: 9: i9:
9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: i9: i9: 9:
9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: i9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9:
9: i9: i9: 9: 9: 9: 9: 9: 9: i9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9:
9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9:
9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9:
9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9:
9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9:
Jul 28 13:18:55 turnip kernel: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9:
9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9:
9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9:
9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9:
9: i9: i9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9:
9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9:
9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: i9: i9: 9: 9:
9: 9: 9: 9: 9: i9: 9: i9: 9: i9: 9: i9: 9: i9: 9: i9: 9: i9: 9: i9: 9:
i9: 9: i9: 9: i9: 9: i9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9:
9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9:
9: 9: 9: i9: 9: i9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9:
9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: 9: invalid bitmap page request:
222 (> 122)
> Also, are these sata ports on the mobo or sata ports on a pci-based mobo?
The ports are on a PCI-E motherboard, MCP55 chipset. The
motherboard/chipset/whatever is *not* the problem.
--
Jon
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: AWFUL reshape speed with raid5.
2008-07-28 18:24 ` Jon Nelson
@ 2008-07-28 18:55 ` Jon Nelson
2008-07-28 19:17 ` Roger Heflin
` (2 more replies)
0 siblings, 3 replies; 20+ messages in thread
From: Jon Nelson @ 2008-07-28 18:55 UTC (permalink / raw)
To: Justin Piszcz; +Cc: LinuxRaid
Some more data points, observations, and questions.
For each test, I'd --create the array, drop the caches, --grow, and
then watch vmstat and also record the time between
kernel: md: resuming resync of md99 from checkpoint.
and
kernel: md: md99: resync done.
I found two things:
1. metadata version matters. Why?
2. VERY LITTLE I/O takes place (between 0 and 100KB/s, typically no
I/O at all) according to vmstat. Why? If it takes 1m34s to "grow" the
array, but no I/O is taking place, then what is actually taking so
long?
3. I removed the bitmap for these tests. Having a bitmap meant that
the overall speed was REALLY HORRIBLE.
The results:
metadata: time taken
0.9: 27s
1.0: 27s
1.1: 37s
1.2: 1m34s
Questions (repeated):
1. Why does the metadata version matter so much?
2. If no I/O is taking place, why does it take so long? [ NOTE: I/O
must be taking place but why doesn't vmstat show it? ]
--
Jon
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: AWFUL reshape speed with raid5.
2008-07-28 18:55 ` Jon Nelson
@ 2008-07-28 19:17 ` Roger Heflin
2008-07-28 19:43 ` Justin Piszcz
2008-07-30 16:23 ` Bill Davidsen
2 siblings, 0 replies; 20+ messages in thread
From: Roger Heflin @ 2008-07-28 19:17 UTC (permalink / raw)
To: Jon Nelson; +Cc: Justin Piszcz, LinuxRaid
Jon Nelson wrote:
> Some more data points, observations, and questions.
>
> For each test, I'd --create the array, drop the caches, --grow, and
> then watch vmstat and also record the time between
>
> kernel: md: resuming resync of md99 from checkpoint.
> and
> kernel: md: md99: resync done.
>
> I found two things:
>
> 1. metadata version matters. Why?
> 2. VERY LITTLE I/O takes place (between 0 and 100KB/s, typically no
> I/O at all) according to vmstat. Why? If it takes 1m34s to "grow" the
> array, but no I/O is taking place, then what is actually taking so
> long?
I *think* that internal md io is not being shown.
I know I can tell an array to check itself, and have mdstat indicate a speed of
35MB/second and vmstat indicates no IO was happening. The same happens when
an array is rebuilding, vmstat indicates no IO. If you do IO to the md device
from outside it does show that. And in both cases visually checking the
confirms that quite a lot appears to be going on.
Roger
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: AWFUL reshape speed with raid5.
2008-07-28 18:55 ` Jon Nelson
2008-07-28 19:17 ` Roger Heflin
@ 2008-07-28 19:43 ` Justin Piszcz
2008-07-28 19:59 ` David Lethe
2008-07-30 16:23 ` Bill Davidsen
2 siblings, 1 reply; 20+ messages in thread
From: Justin Piszcz @ 2008-07-28 19:43 UTC (permalink / raw)
To: Jon Nelson; +Cc: LinuxRaid
There once was a bug in an earlier kernel, in which the min_speed is what
the rebuild ran at if you had a specific chunk size, have you tried to
echo 30000 > to min_speed? Does it increase it to 30mb/s for the rebuild?
On Mon, 28 Jul 2008, Jon Nelson wrote:
> Some more data points, observations, and questions.
>
> For each test, I'd --create the array, drop the caches, --grow, and
> then watch vmstat and also record the time between
>
> kernel: md: resuming resync of md99 from checkpoint.
> and
> kernel: md: md99: resync done.
>
> I found two things:
>
> 1. metadata version matters. Why?
> 2. VERY LITTLE I/O takes place (between 0 and 100KB/s, typically no
> I/O at all) according to vmstat. Why? If it takes 1m34s to "grow" the
> array, but no I/O is taking place, then what is actually taking so
> long?
> 3. I removed the bitmap for these tests. Having a bitmap meant that
> the overall speed was REALLY HORRIBLE.
>
> The results:
>
> metadata: time taken
>
> 0.9: 27s
> 1.0: 27s
> 1.1: 37s
> 1.2: 1m34s
>
> Questions (repeated):
>
> 1. Why does the metadata version matter so much?
> 2. If no I/O is taking place, why does it take so long? [ NOTE: I/O
> must be taking place but why doesn't vmstat show it? ]
>
> --
> Jon
>
^ permalink raw reply [flat|nested] 20+ messages in thread
* RE: AWFUL reshape speed with raid5.
2008-07-28 19:43 ` Justin Piszcz
@ 2008-07-28 19:59 ` David Lethe
2008-07-28 20:56 ` Jon Nelson
0 siblings, 1 reply; 20+ messages in thread
From: David Lethe @ 2008-07-28 19:59 UTC (permalink / raw)
To: Justin Piszcz, Jon Nelson; +Cc: LinuxRaid
>-----Original Message-----
>From: linux-raid-owner@vger.kernel.org
[mailto:linux-raid-owner@vger.kernel.org] On Behalf Of Justin Piszcz
>Sent: Monday, July 28, 2008 2:44 PM
>To: Jon Nelson
>Cc: LinuxRaid
>Subject: Re: AWFUL reshape speed with raid5.
>
>There once was a bug in an earlier kernel, in which the min_speed is
what
>the rebuild ran at if you had a specific chunk size, have you tried to
>echo 30000 > to min_speed? Does it increase it to 30mb/s for the
rebuild?
>
>On Mon, 28 Jul 2008, Jon Nelson wrote:
>
>> Some more data points, observations, and questions.
>>
>> For each test, I'd --create the array, drop the caches, --grow, and
>> then watch vmstat and also record the time between
>>
>> kernel: md: resuming resync of md99 from checkpoint.
>> and
>> kernel: md: md99: resync done.
>>
>> I found two things:
>>
>> 1. metadata version matters. Why?
>> 2. VERY LITTLE I/O takes place (between 0 and 100KB/s, typically no
>> I/O at all) according to vmstat. Why? If it takes 1m34s to "grow" the
>> array, but no I/O is taking place, then what is actually taking so
>> long?
>> 3. I removed the bitmap for these tests. Having a bitmap meant that
>> the overall speed was REALLY HORRIBLE.
>>
>> The results:
>>
>> metadata: time taken
>>
>> 0.9: 27s
>> 1.0: 27s
>> 1.1: 37s
>> 1.2: 1m34s
>>
>> Questions (repeated):
>>
>> 1. Why does the metadata version matter so much?
>> 2. If no I/O is taking place, why does it take so long? [ NOTE: I/O
>> must be taking place but why doesn't vmstat show it? ]
>>
>> --
>> Jon
>>
You are incorrectly working from the premise that vmstat is measures
disk activity. It does not. Vmstat has no idea how many actual bytes
get sent to, or received from disk drives.
Why not do a real test and hook up a pair of SAS, SCSI, or FC disks,
then issue some LOG SENSE commands to report the actual number of bytes
read & written to each disk during the rebuild? If the disks are
FibreChannel, then you have even more ways to measure true throughput in
bytes. It will not be an estimate, it will be a real count of
cumulative bytes read, written, re-read/re-written, recovered, etc., for
any instant in time. Heck, if you have Seagate and some other disks,
then you can even see detailed information for cached reads so you can
see if any particular md configuration results in a higher number of
cached I/Os, meaning greater efficiency and smaller overall latency.
David @ santools dot com
David
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: AWFUL reshape speed with raid5.
2008-07-28 19:59 ` David Lethe
@ 2008-07-28 20:56 ` Jon Nelson
0 siblings, 0 replies; 20+ messages in thread
From: Jon Nelson @ 2008-07-28 20:56 UTC (permalink / raw)
To: David Lethe; +Cc: Justin Piszcz, LinuxRaid
On Mon, Jul 28, 2008 at 2:59 PM, David Lethe <david@santools.com> wrote:
>>-----Original Message-----
>>From: linux-raid-owner@vger.kernel.org
> [mailto:linux-raid-owner@vger.kernel.org] On Behalf Of Justin Piszcz
>>Sent: Monday, July 28, 2008 2:44 PM
>>To: Jon Nelson
>>Cc: LinuxRaid
>>Subject: Re: AWFUL reshape speed with raid5.
>>
>>There once was a bug in an earlier kernel, in which the min_speed is
> what
>>the rebuild ran at if you had a specific chunk size, have you tried to
>>echo 30000 > to min_speed? Does it increase it to 30mb/s for the
> rebuild?
As I said in my original post, I'm running 2.6.25.11
> You are incorrectly working from the premise that vmstat is measures
> disk activity. It does not. Vmstat has no idea how many actual bytes
> get sent to, or received from disk drives.
Fair enough.
> Why not do a real test and hook up a pair of SAS, SCSI, or FC disks,
...
Cost.
--
Jon
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: AWFUL reshape speed with raid5.
2008-07-28 18:55 ` Jon Nelson
2008-07-28 19:17 ` Roger Heflin
2008-07-28 19:43 ` Justin Piszcz
@ 2008-07-30 16:23 ` Bill Davidsen
2008-07-30 16:31 ` Jon Nelson
2008-07-30 16:50 ` David Greaves
2 siblings, 2 replies; 20+ messages in thread
From: Bill Davidsen @ 2008-07-30 16:23 UTC (permalink / raw)
To: Jon Nelson; +Cc: Justin Piszcz, LinuxRaid
Jon Nelson wrote:
> 1. metadata version matters. Why?
> 2. VERY LITTLE I/O takes place (between 0 and 100KB/s, typically no
> I/O at all) according to vmstat. Why? If it takes 1m34s to "grow" the
> array, but no I/O is taking place, then what is actually taking so
> long?
> 3. I removed the bitmap for these tests. Having a bitmap meant that
> the overall speed was REALLY HORRIBLE.
>
> The results:
>
> metadata: time taken
>
> 0.9: 27s
> 1.0: 27s
> 1.1: 37s
> 1.2: 1m34s
>
> Questions (repeated):
>
> 1. Why does the metadata version matter so much?
>
I have no idea.
> 2. If no I/O is taking place, why does it take so long? [ NOTE: I/O
> must be taking place but why doesn't vmstat show it? ]
>
vmstat doesn't tell you enough, you need a tool to show per-device and
per-partition io, which will give you what you need. I can't put a
finger on the one I wrote, but there are others.
--
Bill Davidsen <davidsen@tmr.com>
"Woe unto the statesman who makes war without a reason that will still
be valid when the war is over..." Otto von Bismark
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: AWFUL reshape speed with raid5.
2008-07-30 16:23 ` Bill Davidsen
@ 2008-07-30 16:31 ` Jon Nelson
2008-07-30 17:08 ` Justin Piszcz
2008-07-30 16:50 ` David Greaves
1 sibling, 1 reply; 20+ messages in thread
From: Jon Nelson @ 2008-07-30 16:31 UTC (permalink / raw)
To: Bill Davidsen; +Cc: Justin Piszcz, LinuxRaid
On Wed, Jul 30, 2008 at 11:23 AM, Bill Davidsen <davidsen@tmr.com> wrote:
> Jon Nelson wrote:
>> 2. If no I/O is taking place, why does it take so long? [ NOTE: I/O
>> must be taking place but why doesn't vmstat show it? ]
>>
>
> vmstat doesn't tell you enough, you need a tool to show per-device and
> per-partition io, which will give you what you need. I can't put a finger on
> the one I wrote, but there are others.
I gave dstat a try (actually, I rather prefer dstat over vmstat...):
This is right before, during, and after the --grow operation.
--dsk/sdb-----dsk/sdc-----dsk/sdd-----dsk/sde----dsk/total-
read writ: read writ: read writ: read writ: read writ
0 44k: 0 24k: 0 24k: 0 0 : 0 184k
0 0 : 0 0 : 0 0 : 0 0 : 0 0
0 24k: 0 0 : 0 0 : 0 0 : 0 48k
32M 14k: 32M 2048B: 32M 2048B: 0 0 : 191M 36k
63M 0 : 63M 0 : 63M 0 : 0 0 : 377M 0
65M 0 : 65M 0 : 65M 0 : 0 0 : 391M 0
72M 0 : 73M 0 : 73M 0 : 0 0 : 437M 0
74M 0 : 73M 0 : 74M 0 : 0 0 : 441M 0
70M 48k: 70M 0 : 70M 0 : 0 0 : 419M 96k
61M 44k: 61M 16k: 62M 32k: 0 0 : 368M 184k
71M 0 : 72M 0 : 71M 0 : 0 0 : 429M 0
74M 0 : 73M 0 : 73M 0 : 0 0 : 439M 0
73M 0 : 73M 0 : 73M 0 : 0 0 : 439M 0
71M 20k: 71M 0 : 71M 0 : 0 0 : 426M 40k
72M 0 : 72M 0 : 73M 0 : 0 0 : 434M 0
73M 0 : 74M 0 : 73M 0 : 0 0 : 442M 0
60M 40k: 59M 16k: 59M 28k: 0 0 : 356M 168k
73M 0 : 73M 0 : 73M 0 : 0 0 : 438M 0
70M 24k: 69M 0 : 70M 0 : 0 0 : 418M 48k
72M 0 : 71M 0 : 72M 0 : 0 0 : 430M 0
73M 0 : 73M 0 : 73M 0 : 0 0 : 437M 0
71M 0 : 71M 0 : 71M 0 : 0 0 : 427M 0
73M 0 : 73M 0 : 73M 0 : 0 0 : 437M 0
71M 24k: 71M 0 : 71M 0 : 0 0 : 428M 48k
72M 0 : 72M 0 : 72M 0 : 0 0 : 432M 0
73M 0 : 73M 0 : 73M 0 : 0 0 : 437M 0
71M 4096B: 70M 0 : 70M 0 : 0 0 : 422M 8192B
58M 0 : 58M 0 : 58M 0 : 0 0 : 350M 0
59M 24k: 60M 0 : 59M 0 : 0 0 : 357M 48k
74M 0 : 73M 0 : 74M 0 : 0 0 : 441M 0
19M 8192B: 19M 8192B: 19M 4096B: 0 0 : 114M 40k
0 0 : 0 0 : 0 0 : 0 0 : 0 0
0 0 : 0 0 : 0 0 : 0 0 : 0 0
0 160k: 0 16k: 0 20k: 0 0 : 0 392k
So. Clearly, lots of I/O. 440MB/s total. Almost entirely reads.
Question: to --grow --size the array, clearly we see lots of reads.
Why aren't we seeing any (meaningful) writes? If there are no writes,
then what purpose do the reads serve?
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: AWFUL reshape speed with raid5.
2008-07-30 16:23 ` Bill Davidsen
2008-07-30 16:31 ` Jon Nelson
@ 2008-07-30 16:50 ` David Greaves
2008-07-30 17:24 ` Bill Davidsen
1 sibling, 1 reply; 20+ messages in thread
From: David Greaves @ 2008-07-30 16:50 UTC (permalink / raw)
To: Bill Davidsen; +Cc: Jon Nelson, Justin Piszcz, LinuxRaid
Bill Davidsen wrote:
> vmstat doesn't tell you enough, you need a tool to show per-device and
> per-partition io, which will give you what you need. I can't put a
> finger on the one I wrote, but there are others.
iostat?
David
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: AWFUL reshape speed with raid5.
2008-07-30 16:31 ` Jon Nelson
@ 2008-07-30 17:08 ` Justin Piszcz
2008-07-30 17:48 ` Jon Nelson
0 siblings, 1 reply; 20+ messages in thread
From: Justin Piszcz @ 2008-07-30 17:08 UTC (permalink / raw)
To: Jon Nelson; +Cc: Bill Davidsen, LinuxRaid
On Wed, 30 Jul 2008, Jon Nelson wrote:
> On Wed, Jul 30, 2008 at 11:23 AM, Bill Davidsen <davidsen@tmr.com> wrote:
>> Jon Nelson wrote:
>>> 2. If no I/O is taking place, why does it take so long? [ NOTE: I/O
>>> must be taking place but why doesn't vmstat show it? ]
>>>
>>
>> vmstat doesn't tell you enough, you need a tool to show per-device and
>> per-partition io, which will give you what you need. I can't put a finger on
>> the one I wrote, but there are others.
>
> I gave dstat a try (actually, I rather prefer dstat over vmstat...):
>
> This is right before, during, and after the --grow operation.
>
> --dsk/sdb-----dsk/sdc-----dsk/sdd-----dsk/sde----dsk/total-
> read writ: read writ: read writ: read writ: read writ
> 0 44k: 0 24k: 0 24k: 0 0 : 0 184k
> 0 0 : 0 0 : 0 0 : 0 0 : 0 0
> 0 24k: 0 0 : 0 0 : 0 0 : 0 48k
> 32M 14k: 32M 2048B: 32M 2048B: 0 0 : 191M 36k
> 63M 0 : 63M 0 : 63M 0 : 0 0 : 377M 0
> 65M 0 : 65M 0 : 65M 0 : 0 0 : 391M 0
> 72M 0 : 73M 0 : 73M 0 : 0 0 : 437M 0
> 74M 0 : 73M 0 : 74M 0 : 0 0 : 441M 0
> 70M 48k: 70M 0 : 70M 0 : 0 0 : 419M 96k
> 61M 44k: 61M 16k: 62M 32k: 0 0 : 368M 184k
> 71M 0 : 72M 0 : 71M 0 : 0 0 : 429M 0
> 74M 0 : 73M 0 : 73M 0 : 0 0 : 439M 0
> 73M 0 : 73M 0 : 73M 0 : 0 0 : 439M 0
> 71M 20k: 71M 0 : 71M 0 : 0 0 : 426M 40k
> 72M 0 : 72M 0 : 73M 0 : 0 0 : 434M 0
> 73M 0 : 74M 0 : 73M 0 : 0 0 : 442M 0
> 60M 40k: 59M 16k: 59M 28k: 0 0 : 356M 168k
> 73M 0 : 73M 0 : 73M 0 : 0 0 : 438M 0
> 70M 24k: 69M 0 : 70M 0 : 0 0 : 418M 48k
> 72M 0 : 71M 0 : 72M 0 : 0 0 : 430M 0
> 73M 0 : 73M 0 : 73M 0 : 0 0 : 437M 0
> 71M 0 : 71M 0 : 71M 0 : 0 0 : 427M 0
> 73M 0 : 73M 0 : 73M 0 : 0 0 : 437M 0
> 71M 24k: 71M 0 : 71M 0 : 0 0 : 428M 48k
> 72M 0 : 72M 0 : 72M 0 : 0 0 : 432M 0
> 73M 0 : 73M 0 : 73M 0 : 0 0 : 437M 0
> 71M 4096B: 70M 0 : 70M 0 : 0 0 : 422M 8192B
> 58M 0 : 58M 0 : 58M 0 : 0 0 : 350M 0
> 59M 24k: 60M 0 : 59M 0 : 0 0 : 357M 48k
> 74M 0 : 73M 0 : 74M 0 : 0 0 : 441M 0
> 19M 8192B: 19M 8192B: 19M 4096B: 0 0 : 114M 40k
> 0 0 : 0 0 : 0 0 : 0 0 : 0 0
> 0 0 : 0 0 : 0 0 : 0 0 : 0 0
> 0 160k: 0 16k: 0 20k: 0 0 : 0 392k
>
> So. Clearly, lots of I/O. 440MB/s total. Almost entirely reads.
>
>
> Question: to --grow --size the array, clearly we see lots of reads.
> Why aren't we seeing any (meaningful) writes? If there are no writes,
> then what purpose do the reads serve?
>
In dstat, the speed is doubled for the total for some reason, divide by
2 (if you compare with iostat -x -k 1) you should see the difference.
As far as the grow itself, the last time I did it was 2-3 years ago but if
I recall it ran for 24hrs+ (IDE system, PCI, etc) between 5-15MiB/s.
Justin.
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: AWFUL reshape speed with raid5.
2008-07-30 16:50 ` David Greaves
@ 2008-07-30 17:24 ` Bill Davidsen
0 siblings, 0 replies; 20+ messages in thread
From: Bill Davidsen @ 2008-07-30 17:24 UTC (permalink / raw)
To: David Greaves; +Cc: Jon Nelson, Justin Piszcz, LinuxRaid
David Greaves wrote:
> Bill Davidsen wrote:
>
>> vmstat doesn't tell you enough, you need a tool to show per-device and
>> per-partition io, which will give you what you need. I can't put a
>> finger on the one I wrote, but there are others.
>>
>
> iostat?
>
dstat seems to do what he wants, the one I wrote produced a file which
could be used by gnuplot to generate neat graphical output to make
problems glaringly obvious.
--
Bill Davidsen <davidsen@tmr.com>
"Woe unto the statesman who makes war without a reason that will still
be valid when the war is over..." Otto von Bismark
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: AWFUL reshape speed with raid5.
2008-07-30 17:08 ` Justin Piszcz
@ 2008-07-30 17:48 ` Jon Nelson
2008-08-01 1:43 ` Neil Brown
0 siblings, 1 reply; 20+ messages in thread
From: Jon Nelson @ 2008-07-30 17:48 UTC (permalink / raw)
To: Justin Piszcz; +Cc: Bill Davidsen, LinuxRaid
On Wed, Jul 30, 2008 at 12:08 PM, Justin Piszcz <jpiszcz@lucidpixels.com> wrote:
> In dstat, the speed is doubled for the total for some reason, divide by 2
> (if you compare with iostat -x -k 1) you should see the difference.
Ah, nice catch. It still doesn't change my question.
Upon further reflection, I believe I know what has caused the
read/write disparity:
1. I've done so much testing with these devices that the contents have
been zeroed many times.
2. I am GUESSING that if the raid recovery code reads from drives A,
B, and C, builds the appropriate checksum and verifies it, if the
checksum matches it skips the write. I believe that this is what is
happening. To confirm, I wrote a few gigs of /dev/urandom to one of
the devices and re-tested. Indeed, this time around I saw plenty of
writing. One mystery solved.
Remaining questions:
1. Why does the version of metadata matter so much in a --grow --size operation?
2. There appear to be bugs when a bitmap is used. Can somebody else confirm?
3. I'll look into the awful speed thing later as that doesn't seem to
be an issue.
--
Jon
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: AWFUL reshape speed with raid5.
2008-07-28 17:39 AWFUL reshape speed with raid5 Jon Nelson
2008-07-28 18:14 ` Justin Piszcz
@ 2008-08-01 1:26 ` Neil Brown
2008-08-01 13:14 ` Jon Nelson
2008-08-21 2:58 ` Jon Nelson
2 siblings, 1 reply; 20+ messages in thread
From: Neil Brown @ 2008-08-01 1:26 UTC (permalink / raw)
To: Jon Nelson; +Cc: LinuxRaid
On Monday July 28, jnelson-linux-raid@jamponi.net wrote:
> I built a raid5 with 2 devices (and --assume-clean) using 2x 4GB
> partitions (not logical volumes).
> I then grew it to 3 devices.
> The reshape speed is really really slow.
...
>
> Kernel is 2.6.25.11 (openSUSE 11.0 x86-64 stock)
>
> /proc/mdstat for this entry:
>
> md99 : active raid5 sdd3[2] sdc3[1] sdb3[0]
> 3903744 blocks super 1.0 level 5, 64k chunk, algorithm 2 [3/3] [UUU]
> [=>...................] reshape = 8.2% (324224/3903744)
> finish=43.3min speed=1373K/sec
>
1.3MB/sec is certainly slow. On my test system (which is just a bunch
of fairly ordinary SATA drives in a cheap controller) I get about 10
times this - 13MB/sec.
>
> This is on a set of devices capable of 70+ MB/s.
The 70MB/s is streaming IO. When doing a reshape like this, md/raid5
need to read some data, then go back and write it somewhere else. So
there is lots of seeking backwards and forwards.
You can possibly increase the speed somewhat by increasing the buffer
space that is used, thus allowing larger reads followed by larger
writes. This is done by increasing
/sys/block/mdXX/md/stripe_cache_size
Still, 1373K/sec is very slow. I cannot explain that.
NeilBrown
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: AWFUL reshape speed with raid5.
2008-07-30 17:48 ` Jon Nelson
@ 2008-08-01 1:43 ` Neil Brown
2008-08-01 13:23 ` Jon Nelson
0 siblings, 1 reply; 20+ messages in thread
From: Neil Brown @ 2008-08-01 1:43 UTC (permalink / raw)
To: Jon Nelson; +Cc: Justin Piszcz, Bill Davidsen, LinuxRaid
On Wednesday July 30, jnelson-linux-raid@jamponi.net wrote:
> On Wed, Jul 30, 2008 at 12:08 PM, Justin Piszcz <jpiszcz@lucidpixels.com> wrote:
> > In dstat, the speed is doubled for the total for some reason, divide by 2
> > (if you compare with iostat -x -k 1) you should see the difference.
>
> Ah, nice catch. It still doesn't change my question.
> Upon further reflection, I believe I know what has caused the
> read/write disparity:
>
> 1. I've done so much testing with these devices that the contents have
> been zeroed many times.
> 2. I am GUESSING that if the raid recovery code reads from drives A,
> B, and C, builds the appropriate checksum and verifies it, if the
> checksum matches it skips the write. I believe that this is what is
> happening. To confirm, I wrote a few gigs of /dev/urandom to one of
> the devices and re-tested. Indeed, this time around I saw plenty of
> writing. One mystery solved.
I note that while in the original mail in this thread you were talking
about growing an array by adding drives, you are now talking about
growing an array by using more space on each drive. This change threw
me at first...
You are correct. When raid5 is repairing parity, it reads everything
and only writes if something is found to be wrong. This is in-general
fast. When you grow the --size of a raid5 it repairs the parity on
the newly added space. If this already has correct parity, nothing
will be written.
>
> Remaining questions:
>
> 1. Why does the version of metadata matter so much in a --grow --size operation?
I cannot measure any significant different. Could you give some
precise details of the tests you run and the results you get ?
> 2. There appear to be bugs when a bitmap is used. Can somebody else confirm?
Confirmed. If you --grow an array with a bitmap, you will hit
problems as there is no mechanism to grow the bitmap.
What you need to do is to remove the bitmap, do the 'grow', then
re-add the bitmap.
I thought I had arranged that a grow would fail if there was a bitmap
in place, but I guess not.
I'll have a look into this.
Thanks.
NeilBrown
> 3. I'll look into the awful speed thing later as that doesn't seem to
> be an issue.
>
> --
> Jon
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: AWFUL reshape speed with raid5.
2008-08-01 1:26 ` Neil Brown
@ 2008-08-01 13:14 ` Jon Nelson
0 siblings, 0 replies; 20+ messages in thread
From: Jon Nelson @ 2008-08-01 13:14 UTC (permalink / raw)
To: Neil Brown; +Cc: LinuxRaid
On Thu, Jul 31, 2008 at 8:26 PM, Neil Brown <neilb@suse.de> wrote:
> On Monday July 28, jnelson-linux-raid@jamponi.net wrote:
>> I built a raid5 with 2 devices (and --assume-clean) using 2x 4GB
>> partitions (not logical volumes).
>> I then grew it to 3 devices.
>> The reshape speed is really really slow.
> ...
>>
>> Kernel is 2.6.25.11 (openSUSE 11.0 x86-64 stock)
>>
>> /proc/mdstat for this entry:
>>
>> md99 : active raid5 sdd3[2] sdc3[1] sdb3[0]
>> 3903744 blocks super 1.0 level 5, 64k chunk, algorithm 2 [3/3] [UUU]
>> [=>...................] reshape = 8.2% (324224/3903744)
>> finish=43.3min speed=1373K/sec
>>
>
>
> 1.3MB/sec is certainly slow. On my test system (which is just a bunch
> of fairly ordinary SATA drives in a cheap controller) I get about 10
> times this - 13MB/sec.
I was able to sorta replicate it.
The exact sequence of commands;
mdadm --create /dev/md99 --level=raid5 --raid-devices=2
--spare-devices=0 --assume-clean --metadata=1.0 /dev/sdb3 /dev/sdc3
echo 2000 > /sys/block/md99/md/sync_speed_min
mdadm --add /dev/md99 /dev/sdd3
mdadm --grow --raid-devices=3 /dev/md99
cat /proc/mdstat
md99 : active raid5 sdd3[2] sdc3[1] sdb3[0]
3903744 blocks super 1.0 level 5, 64k chunk, algorithm 2 [3/3] [UUU]
[=>...................] reshape = 5.1% (202240/3903744)
finish=28.5min speed=2161K/sec
dstat shows this (note that the total is doubled for some reason...):
--dsk/sdb-----dsk/sdc-----dsk/sdd-----dsk/sde----dsk/total-
read writ: read writ: read writ: read writ: read writ
549k 78k: 542k 68k: 534k 72k: 174B 11k:3333k 1097k
7936k 3970k:7936k 3970k: 0 3970k: 0 0 : 31M 23M
4096k 2050k:4096k 2050k: 0 2050k: 0 0 : 16M 12M
2816k 1282k:2816k 1282k: 0 1410k: 0 0 : 11M 7948k
1280k 836k:1280k 768k: 0 640k: 0 0 :5120k 4488k
7936k 3970k:7936k 3970k: 0 3970k: 0 0 : 31M 23M
5120k 2562k:5120k 2562k: 0 2562k: 0 0 : 20M 15M
3456k 1728k:3456k 1728k: 0 1728k: 0 0 : 14M 10M
1664k 834k:1664k 834k: 0 834k: 0 0 :6656k 5004k
3072k 1560k:3072k 1536k: 0 1536k: 0 0 : 12M 9264k
9216k 4612k:9216k 4612k: 0 4612k: 0 0 : 36M 27M
Which clearly shows not a great deal of I/O: 5-18MB/s *total*.
..
> You can possibly increase the speed somewhat by increasing the buffer
> space that is used, thus allowing larger reads followed by larger
> writes. This is done by increasing
> /sys/block/mdXX/md/stripe_cache_size
turnip:~ # cat /sys/block/md99/md/stripe_cache_size
256
turnip:~ # cat /sys/block/md99/md/stripe_cache_active
0
Increasing that to 4096 moves the rebuild speed to between 3 and 4MB/s.
Any ideas?
This appears to happen with all metadata versions.
100% reproduceable.
--
Jon
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: AWFUL reshape speed with raid5.
2008-08-01 1:43 ` Neil Brown
@ 2008-08-01 13:23 ` Jon Nelson
2008-08-01 15:57 ` Jon Nelson
0 siblings, 1 reply; 20+ messages in thread
From: Jon Nelson @ 2008-08-01 13:23 UTC (permalink / raw)
To: Neil Brown; +Cc: Justin Piszcz, Bill Davidsen, LinuxRaid
On Thu, Jul 31, 2008 at 8:43 PM, Neil Brown <neilb@suse.de> wrote:
> I note that while in the original mail in this thread you were talking
> about growing an array by adding drives, you are now talking about
> growing an array by using more space on each drive. This change threw
> me at first...
True. Mea Culpa.
>> Remaining questions:
>>
>> 1. Why does the version of metadata matter so much in a --grow --size operation?
>
> I cannot measure any significant different. Could you give some
> precise details of the tests you run and the results you get ?
I'll try to throw some stuff together soon.
>> 2. There appear to be bugs when a bitmap is used. Can somebody else confirm?
>
> Confirmed. If you --grow an array with a bitmap, you will hit
> problems as there is no mechanism to grow the bitmap.
> What you need to do is to remove the bitmap, do the 'grow', then
> re-add the bitmap.
> I thought I had arranged that a grow would fail if there was a bitmap
> in place, but I guess not.
> I'll have a look into this.
A small suggestion: to avoid trepidation, perhaps a small note like
"you may re-add the bitmap while the array is still
rebuilding/growing/whatever" would help to avoid some worry.
There are two other solutions:
Have the underlying code grow the bitmap (probably hard), or have it
automatically remove+re-add the bitmap.
--
Jon
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: AWFUL reshape speed with raid5.
2008-08-01 13:23 ` Jon Nelson
@ 2008-08-01 15:57 ` Jon Nelson
0 siblings, 0 replies; 20+ messages in thread
From: Jon Nelson @ 2008-08-01 15:57 UTC (permalink / raw)
Cc: LinuxRaid
>>> 1. Why does the version of metadata matter so much in a --grow --size operation?
>>
>> I cannot measure any significant different. Could you give some
>> precise details of the tests you run and the results you get ?
>
> I'll try to throw some stuff together soon.
I was unable to replicate the difference in bitmap speeds.
--
Jon
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: AWFUL reshape speed with raid5.
2008-07-28 17:39 AWFUL reshape speed with raid5 Jon Nelson
2008-07-28 18:14 ` Justin Piszcz
2008-08-01 1:26 ` Neil Brown
@ 2008-08-21 2:58 ` Jon Nelson
2 siblings, 0 replies; 20+ messages in thread
From: Jon Nelson @ 2008-08-21 2:58 UTC (permalink / raw)
To: LinuxRaid
On Mon, Jul 28, 2008 at 12:39 PM, Jon Nelson
<jnelson-linux-raid@jamponi.net> wrote:
> I built a raid5 with 2 devices (and --assume-clean) using 2x 4GB
> partitions (not logical volumes).
> I then grew it to 3 devices.
> The reshape speed is really really slow.
>
> vmstat shows I/O like this:
>
> 0 0 212 25844 141160 497484 0 0 0 612 673 1284 0 6 93 0
> 0 0 212 25164 141160 497748 0 0 0 19 594 1253 1 4 95 0
> 0 0 212 25044 141160 498004 0 0 0 0 374 445 0 1 99 0
> 1 0 212 25220 141164 498000 0 0 0 23 506 1149 0 3 96 1
> 0 0 212 25500 141164 498004 0 0 0 3 546 1416 0 5 95 0
>
> The min/max is 1000/200000.
> What might be going on here?
>
> Kernel is 2.6.25.11 (openSUSE 11.0 x86-64 stock)
>
> /proc/mdstat for this entry:
>
> md99 : active raid5 sdd3[2] sdc3[1] sdb3[0]
> 3903744 blocks super 1.0 level 5, 64k chunk, algorithm 2 [3/3] [UUU]
> [=>...................] reshape = 8.2% (324224/3903744)
> finish=43.3min speed=1373K/sec
>
>
> This is on a set of devices capable of 70+ MB/s.
I found some time to give this another shot.
It's still true!
Here is how I built the array:
mdadm --create /dev/md99 --level=raid5 --raid-devices=2
--spare-devices=0 --assume-clean --metadata=1.0 --chunk=64 /dev/sdb3
/dev/sdc3
and then I added a drive:
mdadm --add /dev/md99 /dev/sdd3
and then I grew the array to 3 devices:
mdadm --grow /dev/md99 --raid-devices=3
This is what the relevant portion of /proc/mdstat looks like:
md99 : active raid5 sdd3[2] sdc3[1] sdb3[0]
3903744 blocks super 1.0 level 5, 64k chunk, algorithm 2 [3/3] [UUU]
[=>...................] reshape = 6.1% (241920/3903744)
finish=43.0min speed=1415K/sec
The 1000/200000 min/max defaults are being used.
If I bump up the min to, say, 30000, the rebuild speed does grow to
hover around 30000.
As Justin Piszcz said:
There once was a bug in an earlier kernel, in which the min_speed is
what the rebuild ran at if you had a specific chunk size, have you
tried to echo 30000 > to min_speed? Does it increase it to 30mb/s for
the rebuild?
Yes, apparently, it does. However, 'git log drivers/md' in the
linux-2.6 tree doesn't show anything obvious for me. Can somebody
point me to a specific commit, patch, etc... because as of 2.6.25.11
it's apparently still a problem (on an otherwise idle system, too).
>
> No meaningful change if I start with 3 disks and grow to 4, with or
> without bitmap.
>
> --
> Jon
>
--
Jon
^ permalink raw reply [flat|nested] 20+ messages in thread
end of thread, other threads:[~2008-08-21 2:58 UTC | newest]
Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-07-28 17:39 AWFUL reshape speed with raid5 Jon Nelson
2008-07-28 18:14 ` Justin Piszcz
2008-07-28 18:24 ` Jon Nelson
2008-07-28 18:55 ` Jon Nelson
2008-07-28 19:17 ` Roger Heflin
2008-07-28 19:43 ` Justin Piszcz
2008-07-28 19:59 ` David Lethe
2008-07-28 20:56 ` Jon Nelson
2008-07-30 16:23 ` Bill Davidsen
2008-07-30 16:31 ` Jon Nelson
2008-07-30 17:08 ` Justin Piszcz
2008-07-30 17:48 ` Jon Nelson
2008-08-01 1:43 ` Neil Brown
2008-08-01 13:23 ` Jon Nelson
2008-08-01 15:57 ` Jon Nelson
2008-07-30 16:50 ` David Greaves
2008-07-30 17:24 ` Bill Davidsen
2008-08-01 1:26 ` Neil Brown
2008-08-01 13:14 ` Jon Nelson
2008-08-21 2:58 ` Jon Nelson
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).