Very long raid5 init/rebuild times

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Very long raid5 init/rebuild times
@ 2014-01-21  7:35 Marc MERLIN
  2014-01-21 16:37 ` Marc MERLIN
                   ` (2 more replies)
  0 siblings, 3 replies; 41+ messages in thread
From: Marc MERLIN @ 2014-01-21  7:35 UTC (permalink / raw)
  To: linux-raid

Howdy,

I'm setting up a new array with 5 4TB drives for which I'll use dmcrypt.

Question #1:
Is it better to dmcrypt the 5 drives and then make a raid5 on top, or the opposite
(raid5 first, and then dmcrypt)
I used:
cryptsetup luksFormat --align-payload=8192 -s 256 -c aes-xts-plain64 /dev/sd[mnopq]1

Question #2:
In order to copy data from a working system, I connected the drives via an external
enclosure which uses a SATA PMP. As a result, things are slow:

md5 : active raid5 dm-7[5] dm-6[3] dm-5[2] dm-4[1] dm-2[0]
      15627526144 blocks super 1.2 level 5, 512k chunk, algorithm 2 [5/4] [UUUU_]
      [>....................]  recovery =  0.9% (35709052/3906881536) finish=3406.6min speed=18939K/sec
      bitmap: 0/30 pages [0KB], 65536KB chunk

2.5 days for an init or rebuild is going to be painful. 
I already checked that I'm not CPU/dmcrpyt pegged.

I read Neil's message why init is still required:
http://marc.info/?l=linux-raid&m=112044009718483&w=2
even if somehow on brand new blank drives full of 0s I'm thinking this could be faster
by just assuming the array is clean (all 0s give a parity of 0).
Is it really unsafe to do so? (actually if you do this on top of dmcrypt
like I did here, I won't get 0s, so that way around, it's unfortunately
necessary).

I suppose that 1 day-ish rebuild times are kind of a given with 4TB drives anyway?

Question #3:
Since I'm going to put btrfs on top, I'm almost tempted to skip the md raid5
layer and just use the native support, but the raid code in btrfs still
seems a bit younger than I'm comfortable with.
Is anyone using it and has done disk failures, replaces, and all?

Thanks,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Very long raid5 init/rebuild times
  2014-01-21  7:35 Very long raid5 init/rebuild times Marc MERLIN
@ 2014-01-21 16:37 ` Marc MERLIN
  2014-01-21 17:08   ` Mark Knecht
                     ` (2 more replies)
  2014-01-21 18:31 ` Chris Murphy
  2014-01-22 13:46 ` Ethan Wilson
  2 siblings, 3 replies; 41+ messages in thread
From: Marc MERLIN @ 2014-01-21 16:37 UTC (permalink / raw)
  To: linux-raid

On Mon, Jan 20, 2014 at 11:35:40PM -0800, Marc MERLIN wrote:
> Howdy,
> 
> I'm setting up a new array with 5 4TB drives for which I'll use dmcrypt.
> 
> Question #1:
> Is it better to dmcrypt the 5 drives and then make a raid5 on top, or the opposite
> (raid5 first, and then dmcrypt)
> I used:
> cryptsetup luksFormat --align-payload=8192 -s 256 -c aes-xts-plain64 /dev/sd[mnopq]1
 
I should have said that this is seemingly a stupid question since obviously
if you encrypt each drive separately, you're going through the encryption
layer 5 times during rebuilds instead of just once.
However in my case, I'm not CPU-bound, so that didn't seem to be an issue
and I was more curious to know if the dmcrypt and dmraid5 layers stacked the
same regardless of which one was on top and which one at the bottom.
 
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Very long raid5 init/rebuild times
  2014-01-21 16:37 ` Marc MERLIN
@ 2014-01-21 17:08   ` Mark Knecht
  2014-01-21 18:42   ` Chris Murphy
  2014-01-22  7:55   ` Stan Hoeppner
  2 siblings, 0 replies; 41+ messages in thread
From: Mark Knecht @ 2014-01-21 17:08 UTC (permalink / raw)
  To: Marc MERLIN; +Cc: Linux-RAID

On Tue, Jan 21, 2014 at 8:37 AM, Marc MERLIN <marc@merlins.org> wrote:
> On Mon, Jan 20, 2014 at 11:35:40PM -0800, Marc MERLIN wrote:
>> Howdy,
>>
>> I'm setting up a new array with 5 4TB drives for which I'll use dmcrypt.
>>
>> Question #1:
>> Is it better to dmcrypt the 5 drives and then make a raid5 on top, or the opposite
>> (raid5 first, and then dmcrypt)
>> I used:
>> cryptsetup luksFormat --align-payload=8192 -s 256 -c aes-xts-plain64 /dev/sd[mnopq]1
>
> I should have said that this is seemingly a stupid question since obviously
> if you encrypt each drive separately, you're going through the encryption
> layer 5 times during rebuilds instead of just once.
> However in my case, I'm not CPU-bound, so that didn't seem to be an issue
> and I was more curious to know if the dmcrypt and dmraid5 layers stacked the
> same regardless of which one was on top and which one at the bottom.
>
> Marc

I know nothing about dmcrypt but as someone pointed out to me in
another thread recently about alternative parity methods for RAID6 you
might be able to do some tests using loopback devices instead of real
hard drives to speed up your investigation times.

Cheers,
Mark

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Very long raid5 init/rebuild times
  2014-01-21 16:37 ` Marc MERLIN
  2014-01-21 17:08   ` Mark Knecht
@ 2014-01-21 18:42   ` Chris Murphy
  2014-01-22  7:55   ` Stan Hoeppner
  2 siblings, 0 replies; 41+ messages in thread
From: Chris Murphy @ 2014-01-21 18:42 UTC (permalink / raw)
  To: linux-raid@vger.kernel.org Mailing List

On Jan 21, 2014, at 9:37 AM, Marc MERLIN <marc@merlins.org> wrote:

> I should have said that this is seemingly a stupid question since obviously
> if you encrypt each drive separately, you're going through the encryption
> layer 5 times during rebuilds instead of just once.

It wasn't a stupid question, but I think you've succeeded in confusing yourself into thinking more work is happening by encrypting the drives rather than the logical md device.

> However in my case, I'm not CPU-bound, so that didn't seem to be an issue
> and I was more curious to know if the dmcrypt and dmraid5 layers stacked the
> same regardless of which one was on top and which one at the bottom.

md raid isn't dmraid. I'm actually not sure where the dmraid work is at. I'm under the impression most of that work is happening within LVM2 - they now have their own raid 0,1,10,5,6 implementation. My understanding is it uses md kernel code, but uses lvm tools to create and monitor, rather than mdadm.

Chris Murphy

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Very long raid5 init/rebuild times
  2014-01-21 16:37 ` Marc MERLIN
  2014-01-21 17:08   ` Mark Knecht
  2014-01-21 18:42   ` Chris Murphy
@ 2014-01-22  7:55   ` Stan Hoeppner
  2014-01-22 17:48     ` Marc MERLIN
  2014-01-22 19:38     ` Opal 2.0 SEDs on linux, was: " Chris Murphy
  2 siblings, 2 replies; 41+ messages in thread
From: Stan Hoeppner @ 2014-01-22  7:55 UTC (permalink / raw)
  To: Marc MERLIN, linux-raid

On 1/21/2014 10:37 AM, Marc MERLIN wrote:
> On Mon, Jan 20, 2014 at 11:35:40PM -0800, Marc MERLIN wrote:
>> Howdy,
>>
>> I'm setting up a new array with 5 4TB drives for which I'll use dmcrypt.
>>
>> Question #1:
>> Is it better to dmcrypt the 5 drives and then make a raid5 on top, or the opposite
>> (raid5 first, and then dmcrypt)

For maximum throughput and to avoid hitting a ceiling with one thread on
one core, using one dmcrypt thread per physical device is a way to
achieve this.

>> I used:
>> cryptsetup luksFormat --align-payload=8192 -s 256 -c aes-xts-plain64 /dev/sd[mnopq]1

Changing the key size or the encryption method may decrease latency a
bit, but likely not enough.

> I should have said that this is seemingly a stupid question since obviously
> if you encrypt each drive separately, you're going through the encryption
> layer 5 times during rebuilds instead of just once.

Each dmcrypt thread is handling 1/5th of the IOs.  The low init
throughput isn't caused by using 5 threads.  One thread would likely do
no better.

> However in my case, I'm not CPU-bound, so that didn't seem to be an issue
> and I was more curious to know if the dmcrypt and dmraid5 layers stacked the
> same regardless of which one was on top and which one at the bottom.

You are not CPU bound, nor hardware bandwidth bound.  You are latency
bound, just like every dmcrypt user.  dmcrypt adds a non trivial amount
of latency to every IO.  Latency with serial IO equals low throughput.

Experiment with these things to increase throughput.  If you're using
the CFQ elevator switch to deadline.  Try smaller md chunk sizes, key
lengths, different ciphers, etc.  Turn off automatic CPU frequency
scaling.  I've read reports of encryption causing the frequency to drop
instead of increase.

In general, to increase serial IO throughput on a high latency path one
must:

1.  Issue lots of IOs asynchronously
2.  And/or issue lots of IOs in parallel

Or both.  AFAIK both of these require code rewrites for md maintenance
operations.

Once in production, if your application workloads do 1 or 2 above then
you may see higher throughput than the 18MB/s you see with the init.  If
your workloads are serial maybe not much more.

Common sense says that encrypting 16TB of storage at the block level,
using software libraries and optimized CPU instructions, is not a smart
thing to do.  Not if one desires decent performance, and especially if
one doesn't need all 16TB encrypted.

If you in fact don't need all 16TB encrypted, and I'd argue very few do,
especially John and Jane Doe, then tear this down, build a regular
array, and maintain an encrypted directory or few.

If you actually *need* to encrypt all 16TB at the block level, and
require decent performance, you need to acquire a dedicated crypto
board.  One board will cost more than your complete server.  The cost of
such devices should be a strong clue as to who does and does not need to
encrypt their entire storage.

-- 
Stan

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Very long raid5 init/rebuild times
  2014-01-22  7:55   ` Stan Hoeppner
@ 2014-01-22 17:48     ` Marc MERLIN
  2014-01-22 23:17       ` Stan Hoeppner
  2014-01-23  2:37       ` Stan Hoeppner
  2014-01-22 19:38     ` Opal 2.0 SEDs on linux, was: " Chris Murphy
  1 sibling, 2 replies; 41+ messages in thread
From: Marc MERLIN @ 2014-01-22 17:48 UTC (permalink / raw)
  To: Stan Hoeppner; +Cc: linux-raid

On Wed, Jan 22, 2014 at 01:55:34AM -0600, Stan Hoeppner wrote:
> >> Question #1:
> >> Is it better to dmcrypt the 5 drives and then make a raid5 on top, or the opposite
> >> (raid5 first, and then dmcrypt)
> 
> For maximum throughput and to avoid hitting a ceiling with one thread on
> one core, using one dmcrypt thread per physical device is a way to
> achieve this.
 
There is that, but at rebuild time, if dmcrypt is after raid5, the raid5
rebuild would happen without going through encryption, and hence would save
5 core's worth of encryption bandwidth, would it not (for 5 drives)

I agree that during non rebuild operation, I do get 5 cores of encryption
bandwidth insttead of 1, so if I'm willing to suck up the CPU from rebuild
time, it may be a good thing anyway.

> >> I used:
> >> cryptsetup luksFormat --align-payload=8192 -s 256 -c aes-xts-plain64 /dev/sd[mnopq]1
> 
> Changing the key size or the encryption method may decrease latency a
> bit, but likely not enough.

Ok, thanks.

> > I should have said that this is seemingly a stupid question since obviously
> > if you encrypt each drive separately, you're going through the encryption
> > layer 5 times during rebuilds instead of just once.
> 
> Each dmcrypt thread is handling 1/5th of the IOs.  The low init
> throughput isn't caused by using 5 threads.  One thread would likely do
> no better.

If crypt is on top of raid5, it seems (and that makes sense) that no
encryption is neded for the rebuild. However in my test I can confirm that
the rebuild time is exactly the same. I only get 19MB/s of rebuild bandwidth
and I think tha'ts because of the port multiplier.

> > However in my case, I'm not CPU-bound, so that didn't seem to be an issue
> > and I was more curious to know if the dmcrypt and dmraid5 layers stacked the
> > same regardless of which one was on top and which one at the bottom.
> 
> You are not CPU bound, nor hardware bandwidth bound.  You are latency
> bound, just like every dmcrypt user.  dmcrypt adds a non trivial amount
> of latency to every IO.  Latency with serial IO equals low throughput.

Are you sure that applies here in the rebuild time? I see no crypt thread
running.

> Experiment with these things to increase throughput.  If you're using
> the CFQ elevator switch to deadline.  Try smaller md chunk sizes, key
> lengths, different ciphers, etc.  Turn off automatic CPU frequency
> scaling.  I've read reports of encryption causing the frequency to drop
> instead of increase.

I'll check those too, they can't hurt.

> Once in production, if your application workloads do 1 or 2 above then
> you may see higher throughput than the 18MB/s you see with the init.  If
> your workloads are serial maybe not much more.
 
I expect to see more because the drives will move inside the array that is
directly connected to the SATA card without going through a PMP (with PMP
all the SATA IO is shared on a single SATA chip).

> Common sense says that encrypting 16TB of storage at the block level,
> using software libraries and optimized CPU instructions, is not a smart
> thing to do.  Not if one desires decent performance, and especially if
> one doesn't need all 16TB encrypted.

I encrypt everything now because I think it's good general hygiene, and I
don't want to think about where my drives and data end up 5 years later, or
worry if they get stolen.
Software encryption on linux has been close enough to wire speed for a
little while now, I encrypt my 500MB/s capable SSD on my laptop and barely
see slowdowns (except a bit of extra latency as you point out).

> If you in fact don't need all 16TB encrypted, and I'd argue very few do,
> especially John and Jane Doe, then tear this down, build a regular
> array, and maintain an encrypted directory or few.

Not bad advise in general.

> If you actually *need* to encrypt all 16TB at the block level, and
> require decent performance, you need to acquire a dedicated crypto
> board.  One board will cost more than your complete server.  The cost of
> such devices should be a strong clue as to who does and does not need to
> encrypt their entire storage.

I'm not actually convinced that the CPU is the bottleneck, and as pointed out 
if I put dmcrypt on top of raid5, the rebuild happens without any
encryption.
Or did I miss something?

Thanks,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Very long raid5 init/rebuild times
  2014-01-22 17:48     ` Marc MERLIN
@ 2014-01-22 23:17       ` Stan Hoeppner
  2014-01-23 14:28         ` John Stoffel
  2014-01-23  2:37       ` Stan Hoeppner
  1 sibling, 1 reply; 41+ messages in thread
From: Stan Hoeppner @ 2014-01-22 23:17 UTC (permalink / raw)
  To: Marc MERLIN; +Cc: linux-raid

On 1/22/2014 11:48 AM, Marc MERLIN wrote:
...
> If crypt is on top of raid5, it seems (and that makes sense) that no
> encryption is neded for the rebuild. However in my test I can confirm that
> the rebuild time is exactly the same. I only get 19MB/s of rebuild bandwidth
> and I think tha'ts because of the port multiplier.

Ok, now I think we're finally getting to the heart of this.  Given the
fact that you're doing full array encryption, and after reading your bio
on your website the other day, I think I've been giving you too much
credit.  So let's get back to md basics.  Have you performed any md
optimizations?  The default value of

/sys/block/mdX/md/stripe_cache_size

is 256.  This default is woefully inadequate for modern systems, and
will yield dreadfully low throughput.  To fix this execute

~$ echo 2048 > /sys/block/mdX/md/stripe_cache_size

To specifically address slow resync speed try

~$ echo 50000 > /proc/sys/dev/raid/speed_limit_min

And you also likely need to increase readahead from the default 128KB to
something like 1MB (in 512KiB units)

~$ blockdev --setra 2048 /dev/mdX

Since kernel 2.6.23 Linux does on demand readahead, so small random IO
won't trigger it.  Thus a large value here will not negatively impact
random IO.  See:  http://lwn.net/Articles/235181/

These changes should give you just a bit of a boost to resync
throughput, and streaming workloads in general.

Please test and post your results.  I don't think your problems have
anything to do with crypto.  However, after you get md running at peak
performance you then may start to see limitations in your crypto setup,
if you have chosen to switch to dmcrypt above md.

-- 
Stan

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Very long raid5 init/rebuild times
  2014-01-22 23:17       ` Stan Hoeppner
@ 2014-01-23 14:28         ` John Stoffel
  2014-01-24  1:02           ` Stan Hoeppner
  0 siblings, 1 reply; 41+ messages in thread
From: John Stoffel @ 2014-01-23 14:28 UTC (permalink / raw)
  To: stan; +Cc: Marc MERLIN, linux-raid

>>>>> "Stan" == Stan Hoeppner <stan@hardwarefreak.com> writes:

Stan> On 1/22/2014 11:48 AM, Marc MERLIN wrote:
Stan> ...
>> If crypt is on top of raid5, it seems (and that makes sense) that no
>> encryption is neded for the rebuild. However in my test I can confirm that
>> the rebuild time is exactly the same. I only get 19MB/s of rebuild bandwidth
>> and I think tha'ts because of the port multiplier.

Stan> Ok, now I think we're finally getting to the heart of this.  Given the
Stan> fact that you're doing full array encryption, and after reading your bio
Stan> on your website the other day, I think I've been giving you too much
Stan> credit.  So let's get back to md basics.  Have you performed any md
Stan> optimizations?  The default value of

Stan> /sys/block/mdX/md/stripe_cache_size

Stan> is 256.  This default is woefully inadequate for modern systems, and
Stan> will yield dreadfully low throughput.  To fix this execute

Stan> ~$ echo 2048 > /sys/block/mdX/md/stripe_cache_size

Which Linux kernel version is this in?  I'm running 3.9.3 on my main
home server and I'm not finding this at all.  Nor do I find it on my
Linux Mint 16 desktop running 3.11.0-12-generic either.  

# cat /proc/version
Linux version 3.9.3 (root@quad) (gcc version 4.4.5 (Debian 4.4.5-8) ) #1 SMP Wed May 22 12:15:10 EDT 2013
# find /sys -name "*stripe*"
# find /dev -name "*stripe*"
# find /proc -name "stripe*"
#

Oh wait... I'm a total moron here.  This feature is only for RAID[456]
arrays, and all I have are RAID1 mirrors for all my disks.  Hmm... it
would almost make more sense for it to be named something different,
but legacy systems would be impacted.  

But more importantly, maybe it would make sense to have this number
automatically scale with memory size?  If you only have 1gig stay at
256, but then jump more aggresively to 1024, 2048, 4196 and 8192 and
then (for now) capping at 8192.  

John

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Very long raid5 init/rebuild times
  2014-01-23 14:28         ` John Stoffel
@ 2014-01-24  1:02           ` Stan Hoeppner
  2014-01-24  3:07             ` NeilBrown
  0 siblings, 1 reply; 41+ messages in thread
From: Stan Hoeppner @ 2014-01-24  1:02 UTC (permalink / raw)
  To: John Stoffel; +Cc: Marc MERLIN, linux-raid

On 1/23/2014 8:28 AM, John Stoffel wrote:

> But more importantly, maybe it would make sense to have this number
> automatically scale with memory size?  If you only have 1gig stay at
> 256, but then jump more aggresively to 1024, 2048, 4196 and 8192 and
> then (for now) capping at 8192.  

Setting the default based strictly on memory capacity won't work.  See
this discussion for background.

http://www.spinics.net/lists/raid/msg45364.html

-- 
Stan

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Very long raid5 init/rebuild times
  2014-01-24  1:02           ` Stan Hoeppner
@ 2014-01-24  3:07             ` NeilBrown
  2014-01-24  8:24               ` Stan Hoeppner
  0 siblings, 1 reply; 41+ messages in thread
From: NeilBrown @ 2014-01-24  3:07 UTC (permalink / raw)
  To: stan; +Cc: John Stoffel, Marc MERLIN, linux-raid

[-- Attachment #1: Type: text/plain, Size: 781 bytes --]

On Thu, 23 Jan 2014 19:02:21 -0600 Stan Hoeppner <stan@hardwarefreak.com>
wrote:

> On 1/23/2014 8:28 AM, John Stoffel wrote:
> 
> > But more importantly, maybe it would make sense to have this number
> > automatically scale with memory size?  If you only have 1gig stay at
> > 256, but then jump more aggresively to 1024, 2048, 4196 and 8192 and
> > then (for now) capping at 8192.  
> 
> Setting the default based strictly on memory capacity won't work.  See
> this discussion for background.
> 
> http://www.spinics.net/lists/raid/msg45364.html
> 

I would like to see the stripe cache grow on demand, shrink when idle, and
use the "shrinker" interface to shrink even when not idle if there is memory
pressure.
So if someone wants a project....

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Very long raid5 init/rebuild times
  2014-01-24  3:07             ` NeilBrown
@ 2014-01-24  8:24               ` Stan Hoeppner
  0 siblings, 0 replies; 41+ messages in thread
From: Stan Hoeppner @ 2014-01-24  8:24 UTC (permalink / raw)
  To: NeilBrown; +Cc: John Stoffel, Marc MERLIN, linux-raid

On 1/23/2014 9:07 PM, NeilBrown wrote:
> On Thu, 23 Jan 2014 19:02:21 -0600 Stan Hoeppner <stan@hardwarefreak.com>
> wrote:
> 
>> On 1/23/2014 8:28 AM, John Stoffel wrote:
>>
>>> But more importantly, maybe it would make sense to have this number
>>> automatically scale with memory size?  If you only have 1gig stay at
>>> 256, but then jump more aggresively to 1024, 2048, 4196 and 8192 and
>>> then (for now) capping at 8192.  
>>
>> Setting the default based strictly on memory capacity won't work.  See
>> this discussion for background.
>>
>> http://www.spinics.net/lists/raid/msg45364.html
>>
> 
> I would like to see the stripe cache grow on demand, shrink when idle, and
> use the "shrinker" interface to shrink even when not idle if there is memory
> pressure.
> So if someone wants a project....
> 
> NeilBrown

I'm a user, not a kernel hacker, and I don't know C.  Three strikes
right there. :(  Otherwise I'd love to tackle it.  I do have some
comments/ideas on the subject.

Progressively growing and shrinking the cache should be relatively
straightforward.  We can do it dynamically today by modifying a system
variable.  What's needed is code to track data input volume or rate to
md and to interface with the shrinker.

I think the difficult aspect of this will be determining the upper bound
on the cache size for a given system, as the optimum cache size directly
correlates to the throughput of the hardware.  With the current power of
2 restrictions, less than thorough testing indicates that disk based
arrays seem to prefer a value of 1024-2048 for max throughput whereas
SSD arrays seem to prefer 4096.  In either case, going to the next legal
value decreases throughput and eats double the RAM while doing so.

So here we need some way to determine device throughput or at least
device class, and set an upper bound accordingly.  I also think we
should consider unhitching our wagon from powers of 2 if we're going to
be dynamically growing/shrinking the cache.  I think grow/shrink should
be progressive with smaller jumps.  With 5 drives growing from 2048 to
4096 is going to grab 40MB of pages, likewise dumping 40MB for the
impending shrink iteration, then 20MB, 10MB, and finally dumping 5MB
arriving back at the 1MB/drive default.  This may cause a lot of memory
thrashing on some systems and workloads, evicting application data from
L2/L3 caches.  So we may want to be careful about how much memory we're
shuffling and how often.

-- 
Stan

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Very long raid5 init/rebuild times
  2014-01-22 17:48     ` Marc MERLIN
  2014-01-22 23:17       ` Stan Hoeppner
@ 2014-01-23  2:37       ` Stan Hoeppner
  2014-01-23  9:13         ` Marc MERLIN
  1 sibling, 1 reply; 41+ messages in thread
From: Stan Hoeppner @ 2014-01-23  2:37 UTC (permalink / raw)
  To: Marc MERLIN; +Cc: linux-raid

On 1/22/2014 11:48 AM, Marc MERLIN wrote:
...
> If crypt is on top of raid5, it seems (and that makes sense) that no
> encryption is neded for the rebuild. However in my test I can confirm that
> the rebuild time is exactly the same. I only get 19MB/s of rebuild bandwidth
> and I think tha'ts because of the port multiplier.

I didn't address this earlier as I assumed you, and anyone else reading
this thread, would do a little background reading and realize no SATA
PMP would behave in this manner.  No SATA PMP, not Silicon Image, not
Marvell, none of them, will limit host port throughput to 20MB/s.  All
of them achieve pretty close to wire speed throughput.

-- 
Stan

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Very long raid5 init/rebuild times
  2014-01-23  2:37       ` Stan Hoeppner
@ 2014-01-23  9:13         ` Marc MERLIN
  2014-01-23 12:24           ` Stan Hoeppner
  0 siblings, 1 reply; 41+ messages in thread
From: Marc MERLIN @ 2014-01-23  9:13 UTC (permalink / raw)
  To: Stan Hoeppner; +Cc: linux-raid

On Wed, Jan 22, 2014 at 08:37:49PM -0600, Stan Hoeppner wrote:
> On 1/22/2014 11:48 AM, Marc MERLIN wrote:
> ...
> > If crypt is on top of raid5, it seems (and that makes sense) that no
> > encryption is neded for the rebuild. However in my test I can confirm that
> > the rebuild time is exactly the same. I only get 19MB/s of rebuild bandwidth
> > and I think tha'ts because of the port multiplier.
>
> I didn't address this earlier as I assumed you, and anyone else reading
> this thread, would do a little background reading and realize no SATA
> PMP would behave in this manner.  No SATA PMP, not Silicon Image, not
> Marvell, none of them, will limit host port throughput to 20MB/s.  All
> of them achieve pretty close to wire speed throughput.

I haven't answered your other message, as I'm getting more data to do
so, but I can assure you that this is incorrect :)

I've worked with 3 different PMP boards and three different SATA cards
over the last 6 years (sil3124, 3132, and marvel), and got similarly
slow results on all of them.
The marvel was faster than sil3124 but it stopped being stable in
kernels in the last year and fell unsupported (no one to fix the bugs),
so I went back to sil3124.

I'm not saying that they can't go faster somehow, but in my experience
that has not been the case.

In case you don't believe me, I just switched my drives from the PMP to
directly connected to the motherboard and a marvel card, and my rebuild
speed changed from 19MB/s to 99MB/s.
(I made no other setting changes, but I did try your changes without
saving them before and after the PMP change and will report below)

You also said:
> Ok, now I think we're finally getting to the heart of this.  Given the
> fact that you're doing full array encryption, and after reading your bio
> on your website the other day, I think I've been giving you too much
> credit.  So let's get back to md basics.  Have you performed any md
> optimizations?  The default value of

Can't hurt to ask, you never know if I may have forgotten or not know about one.

> /sys/block/mdX/md/stripe_cache_size
> is 256.  This default is woefully inadequate for modern systems, and
> will yield dreadfully low throughput.  To fix this execute
> ~$ echo 2048 > /sys/block/mdX/md/stripe_cache_size

Thanks for that one.
It made no speed difference on the PMP or without, but can't hurt to do anyway.

> To specifically address slow resync speed try
> ~$ echo 50000 > /proc/sys/dev/raid/speed_limit_min

I had this, but good reminder.

> And you also likely need to increase readahead from the default 128KB to
> something like 1MB (in 512KiB units)
>
> ~$ blockdev --setra 2048 /dev/mdX

I had this already set to 8192, but again, thanks for asking too.

> Since kernel 2.6.23 Linux does on demand readahead, so small random IO
> won't trigger it.  Thus a large value here will not negatively impact
> random IO.  See:  http://lwn.net/Articles/235181/
>
> Please test and post your results.  I don't think your problems have
> anything to do with crypto.  However, after you get md running at peak
> performance you then may start to see limitations in your crypto setup,
> if you have chosen to switch to dmcrypt above md.

Looks like so far my only problem was the PMP.

Thank you for your suggestions though.

Back to my original questions:
> Question #1:
> Is it better to dmcrypt the 5 drives and then make a raid5 on top, or the opposite
> (raid5 first, and then dmcrypt)
> I used:
> cryptsetup luksFormat --align-payload=8192 -s 256 -c aes-xts-plain64 /dev/sd[mnopq]1
 
As you did point out, the array will be faster when I use it because the
encryption will be sharded over my CPUs, but rebuilding is going to create 5 encryption
threads whereas if md5 is first and encryption is on top, rebuilds do
not involve any encryption on CPU.

So it depends what's more important.
 
> Question #2:
> In order to copy data from a working system, I connected the drives via an external
> enclosure which uses a SATA PMP. As a result, things are slow:
> 
> md5 : active raid5 dm-7[5] dm-6[3] dm-5[2] dm-4[1] dm-2[0]
>       15627526144 blocks super 1.2 level 5, 512k chunk, algorithm 2 [5/4] [UUUU_]
>       [>....................]  recovery =  0.9% (35709052/3906881536) finish=3406.6min speed=18939K/sec
>       bitmap: 0/30 pages [0KB], 65536KB chunk
> 
> 2.5 days for an init or rebuild is going to be painful.
> I already checked that I'm not CPU/dmcrpyt pegged.
> 
> I read Neil's message why init is still required:
> http://marc.info/?l=linux-raid&m=112044009718483&w=2
> even if somehow on brand new blank drives full of 0s I'm thinking this could be faster
> by just assuming the array is clean (all 0s give a parity of 0).
> Is it really unsafe to do so? (actually if you do this on top of dmcrypt
> like I did here, I won't get 0s, so that way around, it's unfortunately
> necessary).

Still curious on this: if the drives are brand new, is it safe to assume
t> hey're full of 0's and tell mdadm to skip the re-init?
(parity of X x 0 = 0)

> Question #3:
> Since I'm going to put btrfs on top, I'm almost tempted to skip the md raid5
> layer and just use the native support, but the raid code in btrfs still
> seems a bit younger than I'm comfortable with.
> Is anyone using it and has done disk failures, replaces, and all?

Ok, this is not a btrfs list, so I'll asume no one tried that here, no biggie.

Cheers,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/                         | PGP 1024R/763BE901

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Very long raid5 init/rebuild times
  2014-01-23  9:13         ` Marc MERLIN
@ 2014-01-23 12:24           ` Stan Hoeppner
  2014-01-23 21:01             ` Marc MERLIN
  2014-01-30 20:18             ` Phillip Susi
  0 siblings, 2 replies; 41+ messages in thread
From: Stan Hoeppner @ 2014-01-23 12:24 UTC (permalink / raw)
  To: Marc MERLIN; +Cc: linux-raid

On 1/23/2014 3:13 AM, Marc MERLIN wrote:
> On Wed, Jan 22, 2014 at 08:37:49PM -0600, Stan Hoeppner wrote:
>> On 1/22/2014 11:48 AM, Marc MERLIN wrote:
>> ...
>>> If crypt is on top of raid5, it seems (and that makes sense) that no
>>> encryption is neded for the rebuild. However in my test I can confirm that
>>> the rebuild time is exactly the same. I only get 19MB/s of rebuild bandwidth
>>> and I think tha'ts because of the port multiplier.
>>
>> I didn't address this earlier as I assumed you, and anyone else reading
>> this thread, would do a little background reading and realize no SATA
>> PMP would behave in this manner.  No SATA PMP, not Silicon Image, not
>> Marvell, none of them, will limit host port throughput to 20MB/s.  All
>> of them achieve pretty close to wire speed throughput.
> 
> I haven't answered your other message, as I'm getting more data to do
> so, but I can assure you that this is incorrect :)
> 
> I've worked with 3 different PMP boards and three different SATA cards
> over the last 6 years (sil3124, 3132, and marvel), and got similarly
> slow results on all of them.
> The marvel was faster than sil3124 but it stopped being stable in
> kernels in the last year and fell unsupported (no one to fix the bugs),
> so I went back to sil3124.
>
> I'm not saying that they can't go faster somehow, but in my experience
> that has not been the case.

Others don't seem to be having such PMP problems.  Not in modern times
anyway.  Maybe it's just your specific hardware mix.

If eliminating the PMP increased your read-only resync speed by a factor
of 5x, I'm elated to be wrong here.

> In case you don't believe me, I just switched my drives from the PMP to
> directly connected to the motherboard and a marvel card, and my rebuild
> speed changed from 19MB/s to 99MB/s.
> (I made no other setting changes, but I did try your changes without
> saving them before and after the PMP change and will report below)

Why would you assume I wouldn't believe you?

> You also said:
>> Ok, now I think we're finally getting to the heart of this.  Given the
>> fact that you're doing full array encryption, and after reading your bio
>> on your website the other day, I think I've been giving you too much
>> credit.  So let's get back to md basics.  Have you performed any md
>> optimizations?  The default value of
> 
> Can't hurt to ask, you never know if I may have forgotten or not know about one.
> 
>> /sys/block/mdX/md/stripe_cache_size
>> is 256.  This default is woefully inadequate for modern systems, and
>> will yield dreadfully low throughput.  To fix this execute
>> ~$ echo 2048 > /sys/block/mdX/md/stripe_cache_size
> 
> Thanks for that one.
> It made no speed difference on the PMP or without, but can't hurt to do anyway.

If you're not writing it won't.  The problem here is that you're
apparently using a non-destructive resync as a performance benchmark.
Don't do that.  It's representative of nothing but read-only resync speed.

Increasing stripe_cache_size above the default as I suggested will
ALWAYS increase write speed, often by a factor of 2-3x or more on modern
hardware.  It should speed up destructive resyncs considerably, as well
as normal write IO.  Once your array has settled down after the inits
and resyncs and what not, run some parallel FIO write tests with the
default of 256 and then with 2048.  You can try 4096 as well, but with 5
rusty drives 4096 will probably cause a slight tailing off of
throughput.  2048 should be your sweet spot.  You can also just time a
few large parallel file copies.  You'll be amazed at the gains.

The reason is simply that the default of 256 was selected some ~10 years
ago when disks were much slower.  Increasing this default has been a
topic of much discussion recently, because bumping it up increases
throughput for everyone, substantially, even with 3 disk RAID5 arrays.

>> To specifically address slow resync speed try
>> ~$ echo 50000 > /proc/sys/dev/raid/speed_limit_min
> 
> I had this, but good reminder.
> 
>> And you also likely need to increase readahead from the default 128KB to
>> something like 1MB (in 512KiB units)
>>
>> ~$ blockdev --setra 2048 /dev/mdX
> 
> I had this already set to 8192, but again, thanks for asking too.
>
>> Since kernel 2.6.23 Linux does on demand readahead, so small random IO
>> won't trigger it.  Thus a large value here will not negatively impact
>> random IO.  See:  http://lwn.net/Articles/235181/
>>
>> Please test and post your results.  I don't think your problems have
>> anything to do with crypto.  However, after you get md running at peak
>> performance you then may start to see limitations in your crypto setup,
>> if you have chosen to switch to dmcrypt above md.
> 
> Looks like so far my only problem was the PMP.

That's because you've not been looking deep enough.

> Thank you for your suggestions though.

You're welcome.

> Back to my original questions:
>> Question #1:
>> Is it better to dmcrypt the 5 drives and then make a raid5 on top, or the opposite
>> (raid5 first, and then dmcrypt)
>> I used:
>> cryptsetup luksFormat --align-payload=8192 -s 256 -c aes-xts-plain64 /dev/sd[mnopq]1
>  
> As you did point out, the array will be faster when I use it because the
> encryption will be sharded over my CPUs, but rebuilding is going to create 5 encryption
> threads whereas if md5 is first and encryption is on top, rebuilds do
> not involve any encryption on CPU.
> 
> So it depends what's more important.

Yep.  If you post what CPU you're using I can probably give you a good
idea if one core is sufficient for dmcrypt.

I'll also reiterate that encrypting a 16TB array device is silly when
you can simply carve off an LV for files that need to be encrypted, and
run dmcrypt only against that LV.  You can always expand an LV.  This is
a huge performance win for all other files, such your media collections,
which don't need to be encrypted.

>> Question #2:
>> In order to copy data from a working system, I connected the drives via an external
>> enclosure which uses a SATA PMP. As a result, things are slow:
>>
>> md5 : active raid5 dm-7[5] dm-6[3] dm-5[2] dm-4[1] dm-2[0]
>>       15627526144 blocks super 1.2 level 5, 512k chunk, algorithm 2 [5/4] [UUUU_]
>>       [>....................]  recovery =  0.9% (35709052/3906881536) finish=3406.6min speed=18939K/sec
>>       bitmap: 0/30 pages [0KB], 65536KB chunk
>>
>> 2.5 days for an init or rebuild is going to be painful.

With stripe_cache_size=2048 this should drop from 2.5 days to less than
a day.

>> I already checked that I'm not CPU/dmcrpyt pegged.
>>
>> I read Neil's message why init is still required:
>> http://marc.info/?l=linux-raid&m=112044009718483&w=2
>> even if somehow on brand new blank drives full of 0s I'm thinking this could be faster
>> by just assuming the array is clean (all 0s give a parity of 0).
>> Is it really unsafe to do so? (actually if you do this on top of dmcrypt
>> like I did here, I won't get 0s, so that way around, it's unfortunately
>> necessary).
> 
> Still curious on this: if the drives are brand new, is it safe to assume
> t> hey're full of 0's and tell mdadm to skip the re-init?
> (parity of X x 0 = 0)

No, for a few reasons:

1.  Because not all bits are always 0 out of the factory.
2.  Bad sectors may exist and need to be discovered/remapped
3.  With the increased stripe_cache_size, and if your CPU turns out to
be fast enough for dmcrypt in front of md, resync speed won't be as much
of an issue, eliminating your motivation for skipping the init.


-- 
Stan


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Very long raid5 init/rebuild times
  2014-01-23 12:24           ` Stan Hoeppner
@ 2014-01-23 21:01             ` Marc MERLIN
  2014-01-24  5:13               ` Stan Hoeppner
  2014-01-30 20:18             ` Phillip Susi
  1 sibling, 1 reply; 41+ messages in thread
From: Marc MERLIN @ 2014-01-23 21:01 UTC (permalink / raw)
  To: Stan Hoeppner; +Cc: linux-raid

On Thu, Jan 23, 2014 at 06:24:39AM -0600, Stan Hoeppner wrote:
> > In case you don't believe me, I just switched my drives from the PMP to
> > directly connected to the motherboard and a marvel card, and my rebuild
> > speed changed from 19MB/s to 99MB/s.
> > (I made no other setting changes, but I did try your changes without
> > saving them before and after the PMP change and will report below)
> 
> Why would you assume I wouldn't believe you?
 
You seemed incredulous that PMPs could make things so slow :)

> > Thanks for that one.
> > It made no speed difference on the PMP or without, but can't hurt to do anyway.
> 
> If you're not writing it won't.  The problem here is that you're
> apparently using a non-destructive resync as a performance benchmark.
> Don't do that.  It's representative of nothing but read-only resync speed.
 
Let me think about this: the resync is done at build array time.
If all the drives are full of 0's indeed there will be nothing to write.
Given that, I think you're right.

> Increasing stripe_cache_size above the default as I suggested will
> ALWAYS increase write speed, often by a factor of 2-3x or more on modern
> hardware.  It should speed up destructive resyncs considerably, as well
> as normal write IO.  Once your array has settled down after the inits
> and resyncs and what not, run some parallel FIO write tests with the
> default of 256 and then with 2048.  You can try 4096 as well, but with 5
> rusty drives 4096 will probably cause a slight tailing off of
> throughput.  2048 should be your sweet spot.  You can also just time a
> few large parallel file copies.  You'll be amazed at the gains.

Will do, thanks.

> The reason is simply that the default of 256 was selected some ~10 years
> ago when disks were much slower.  Increasing this default has been a
> topic of much discussion recently, because bumping it up increases
> throughput for everyone, substantially, even with 3 disk RAID5 arrays.

Great to hear that the default may hopefully be increased for all.
 
> > As you did point out, the array will be faster when I use it because the
> > encryption will be sharded over my CPUs, but rebuilding is going to create 5 encryption
> > threads whereas if md5 is first and encryption is on top, rebuilds do
> > not involve any encryption on CPU.
> > 
> > So it depends what's more important.
> 
> Yep.  If you post what CPU you're using I can probably give you a good
> idea if one core is sufficient for dmcrypt.

Oh, I did forget to post that.

That server is a low power-ish dual core with 4 HT units:
processor	: 3
vendor_id	: GenuineIntel
cpu family	: 6
model		: 42
model name	: Intel(R) Core(TM) i3-2100T CPU @ 2.50GHz
stepping	: 7
microcode	: 0x28
cpu MHz		: 2500.000
cache size	: 3072 KB
physical id	: 0
siblings	: 4
core id		: 1
cpu cores	: 2
apicid		: 3
initial apicid	: 3
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 popcnt tsc_deadline_timer xsave avx lahf_lm arat epb xsaveopt pln pts dtherm tpr_shadow vnmi flexpriority ept vpid
bogomips	: 5150.14
clflush size	: 64
cache_alignment	: 64
address sizes	: 36 bits physical, 48 bits virtual
power management:

> I'll also reiterate that encrypting a 16TB array device is silly when
> you can simply carve off an LV for files that need to be encrypted, and
> run dmcrypt only against that LV.  You can always expand an LV.  This is
> a huge performance win for all other files, such your media collections,
> which don't need to be encrypted.

I use btrfs for LV management, so it's easier to encrypt the entire pool. I
also encrypt any data on any drive at this point, kind of like I wash my
hands. I'm not saying it's the right thing to do for all, but it's my
personal choice. I've seen too many drives end up on ebay with data, and I
don't want to have to worry about this later, or even erasing my own drives
before sending them back to warranty, especially in cases where maybe I
can't erase them, but the manufacturer can read them anyway.
You get the idea...

I've used LVM for too many years (15 was it?) and I'm happy to switch away now :)
(I know thin snapshots were recently added, but basically I've been not
super happy with LVM performance, and LVM snapshots have been abysmal if you
keep them long term).
Also, this is off topic here, but I like the fact that I can compute snapshot diffs
with btfrs and use that for super fast backups of changed blocks instead of a very slow rsync
that has to scan millions of inodes (which is what I've been doing so far).

> >> Question #2:
> >> In order to copy data from a working system, I connected the drives via an external
> >> enclosure which uses a SATA PMP. As a result, things are slow:
> >>
> >> md5 : active raid5 dm-7[5] dm-6[3] dm-5[2] dm-4[1] dm-2[0]
> >>       15627526144 blocks super 1.2 level 5, 512k chunk, algorithm 2 [5/4] [UUUU_]
> >>       [>....................]  recovery =  0.9% (35709052/3906881536) finish=3406.6min speed=18939K/sec
> >>       bitmap: 0/30 pages [0KB], 65536KB chunk
> >>
> >> 2.5 days for an init or rebuild is going to be painful.
> 
> With stripe_cache_size=2048 this should drop from 2.5 days to less than
> a day.

It didn't since it PMP limited, but I made that change for the other reasons
you suggested.

> > Still curious on this: if the drives are brand new, is it safe to assume
> > t> hey're full of 0's and tell mdadm to skip the re-init?
> > (parity of X x 0 = 0)
> 
> No, for a few reasons:
> 
> 1.  Because not all bits are always 0 out of the factory.
> 2.  Bad sectors may exist and need to be discovered/remapped
> 3.  With the increased stripe_cache_size, and if your CPU turns out to
> be fast enough for dmcrypt in front of md, resync speed won't be as much
> of an issue, eliminating your motivation for skipping the init.
 
All fair points, thanks for explaining.
For now, I put dmcrypt on top of md5, I get 100MB/s raw block write speed (actually
just writing a big file in btrfs and going through all the layers) even
though it's only using one CPU thread for encryption instead of 2 or more if
each disk were encrypted under the md5 layer.

Since 100MB/s was also the resync speed I was getting without encryption
involved, looks like a single CPU thread can keep up with the raw IO of the
array, so I guess I'll leave things that way.
As another test
gargamel:/mnt/btrfs_pool1# dd if=/dev/md5 of=/dev/null bs=1M count=1024
1073741824 bytes (1.1 GB) copied, 9.78191 s, 110 MB/s

So it looks like 100-110MB/s is the read and write speed limit of that array.
The drives are rated for 150MB/s each so I'm not too sure which limit I'm
hitting, but 100MB/s is fast enough for my intended use.

Thanks for you answers again,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Very long raid5 init/rebuild times
  2014-01-23 21:01             ` Marc MERLIN
@ 2014-01-24  5:13               ` Stan Hoeppner
  2014-01-25  8:36                 ` Marc MERLIN
  2014-01-30 20:36                 ` Phillip Susi
  0 siblings, 2 replies; 41+ messages in thread
From: Stan Hoeppner @ 2014-01-24  5:13 UTC (permalink / raw)
  To: Marc MERLIN; +Cc: linux-raid

On 1/23/2014 3:01 PM, Marc MERLIN wrote:
> On Thu, Jan 23, 2014 at 06:24:39AM -0600, Stan Hoeppner wrote:
>>> In case you don't believe me, I just switched my drives from the PMP to
>>> directly connected to the motherboard and a marvel card, and my rebuild
>>> speed changed from 19MB/s to 99MB/s.
>>> (I made no other setting changes, but I did try your changes without
>>> saving them before and after the PMP change and will report below)
>>
>> Why would you assume I wouldn't believe you?
>  
> You seemed incredulous that PMPs could make things so slow :)

Well, no, not really.  I know there are some real quality issues with a
lot of cheap PMP JBODs out there.  I was just surprised to see an
experienced Linux sysadmin have bad luck with 3/3 of em.  Most folks
using Silicon Image HBAs with SiI PMPs seem to get good performance.

Personally, I've never used PMPs.  Given the cost ratio between drives
and HBA ports, a quality 4/8 port SAS HBA such as one of the LSIs is a
better solution all around.  4TB drives average $200 each.  A five drive
array is $1000.  An LSI 8 port 12G SAS HBA with guaranteed
compatibility, quality, support, and performance is $300.  A cheap 2
port SATA HBA and 5 port PMP card gives sub optimal performance, iffy
compatibility, and low quality, and is ~$130.  $1300 vs $1130.  Going
with a cheap SATA HBA and PMP makes no sense.

>>> Thanks for that one.
>>> It made no speed difference on the PMP or without, but can't hurt to do anyway.
>>
>> If you're not writing it won't.  The problem here is that you're
>> apparently using a non-destructive resync as a performance benchmark.
>> Don't do that.  It's representative of nothing but read-only resync speed.
>  
> Let me think about this: the resync is done at build array time.
> If all the drives are full of 0's indeed there will be nothing to write.
> Given that, I think you're right.

The initial resync is read-only.  It won't modify anything unless
there's a discrepancy.  So the stripe cache isn't in play.  The larger
stripe cache should indeed increase rebuild rate though.

>> Increasing stripe_cache_size above the default as I suggested will
>> ALWAYS increase write speed, often by a factor of 2-3x or more on modern
>> hardware.  It should speed up destructive resyncs considerably, as well
>> as normal write IO.  Once your array has settled down after the inits
>> and resyncs and what not, run some parallel FIO write tests with the
>> default of 256 and then with 2048.  You can try 4096 as well, but with 5
>> rusty drives 4096 will probably cause a slight tailing off of
>> throughput.  2048 should be your sweet spot.  You can also just time a
>> few large parallel file copies.  You'll be amazed at the gains.
> 
> Will do, thanks.
> 
>> The reason is simply that the default of 256 was selected some ~10 years
>> ago when disks were much slower.  Increasing this default has been a
>> topic of much discussion recently, because bumping it up increases
>> throughput for everyone, substantially, even with 3 disk RAID5 arrays.
> 
> Great to hear that the default may hopefully be increased for all.

It may be a while, or never.  Neil's last note suggests the default
likely won't change, but eventually we may have automated stripe cache
size management.

>>> As you did point out, the array will be faster when I use it because the
>>> encryption will be sharded over my CPUs, but rebuilding is going to create 5 encryption
>>> threads whereas if md5 is first and encryption is on top, rebuilds do
>>> not involve any encryption on CPU.
>>>
>>> So it depends what's more important.
>>
>> Yep.  If you post what CPU you're using I can probably give you a good
>> idea if one core is sufficient for dmcrypt.
> 
> Oh, I did forget to post that.
> 
> That server is a low power-ish dual core with 4 HT units:
...
> model name	: Intel(R) Core(TM) i3-2100T CPU @ 2.50GHz
...
> cache size	: 3072 KB
...

Actually, instead of me making an educated guess, I'd suggest you run

~$ cryptsetup benchmark

This will tell you precisely what your throughput is with various
settings and ciphers.  Depending on what this spits back you may want to
change your setup, assuming we get the IO throughput where it should be.

>> I'll also reiterate that encrypting a 16TB array device is silly when
>> you can simply carve off an LV for files that need to be encrypted, and
>> run dmcrypt only against that LV.  You can always expand an LV.  This is
>> a huge performance win for all other files, such your media collections,
>> which don't need to be encrypted.
> 
> I use btrfs for LV management, so it's easier to encrypt the entire pool. I
> also encrypt any data on any drive at this point, kind of like I wash my
> hands. I'm not saying it's the right thing to do for all, but it's my
> personal choice. I've seen too many drives end up on ebay with data, and I
> don't want to have to worry about this later, or even erasing my own drives
> before sending them back to warranty, especially in cases where maybe I
> can't erase them, but the manufacturer can read them anyway.
> You get the idea...

So be it.  Now let's work to see if we can squeeze every ounce of
performance out of it.

...
>>>> Question #2:
>>>> In order to copy data from a working system, I connected the drives via an external
>>>> enclosure which uses a SATA PMP. As a result, things are slow:
>>>>
>>>> md5 : active raid5 dm-7[5] dm-6[3] dm-5[2] dm-4[1] dm-2[0]
>>>>       15627526144 blocks super 1.2 level 5, 512k chunk, algorithm 2 [5/4] [UUUU_]
>>>>       [>....................]  recovery =  0.9% (35709052/3906881536) finish=3406.6min speed=18939K/sec
>>>>       bitmap: 0/30 pages [0KB], 65536KB chunk
>>>>
>>>> 2.5 days for an init or rebuild is going to be painful.
>>
>> With stripe_cache_size=2048 this should drop from 2.5 days to less than
>> a day.
> 
> It didn't since it PMP limited, but I made that change for the other reasons
> you suggested.

You said you had pulled the PMP and connected direct to an HBA, bumping
from 19MB/s to 99MB/s.  Did you switch back to the PMP and are now
getting 100MB/s through the PMP?  We should be able to get much higher
if it's 3/6G SATA, a little higher if it's 1/5G.

>>> Still curious on this: if the drives are brand new, is it safe to assume
>>> t> hey're full of 0's and tell mdadm to skip the re-init?
>>> (parity of X x 0 = 0)
>>
>> No, for a few reasons:
>>
>> 1.  Because not all bits are always 0 out of the factory.
>> 2.  Bad sectors may exist and need to be discovered/remapped
>> 3.  With the increased stripe_cache_size, and if your CPU turns out to
>> be fast enough for dmcrypt in front of md, resync speed won't be as much
>> of an issue, eliminating your motivation for skipping the init.

I shouldn't have included #3 here as it doesn't affect initial resync,
only rebuild.

> All fair points, thanks for explaining.
> For now, I put dmcrypt on top of md5, I get 100MB/s raw block write speed (actually
> just writing a big file in btrfs and going through all the layers) even
> though it's only using one CPU thread for encryption instead of 2 or more if
> each disk were encrypted under the md5 layer.

100MB/s sequential read throughput is very poor for a 5 drive RAID5,
especially with new 4TB drives which can stream well over 130MB/s each.

> Since 100MB/s was also the resync speed I was getting without encryption
> involved, looks like a single CPU thread can keep up with the raw IO of the
> array, so I guess I'll leave things that way.

100MB/s is leaving big performance on the table.  And 100 isn't the peak
array throughput of your current configuration.

> As another test
> gargamel:/mnt/btrfs_pool1# dd if=/dev/md5 of=/dev/null bs=1M count=1024
> 1073741824 bytes (1.1 GB) copied, 9.78191 s, 110 MB/s

dd single stream copies are not a valid test of array throughput.  This
tells you only the -minimum- throughput of the array.

> So it looks like 100-110MB/s is the read and write speed limit of that array.
> The drives are rated for 150MB/s each so I'm not too sure which limit I'm
> hitting, but 100MB/s is fast enough for my intended use.

To test real maximum throughput install fio, save and run this job file,
and post your results.  Monitor CPU burn of dmcrypt, using top is fine,
while running the job to see if it eats all of one core.  The job runs
in multiple steps, first creating the eight 1GB test files, then running
the read/write tests against those files.

[global]
directory=/some/directory
zero_buffers
numjobs=4
group_reporting
blocksize=1024k
ioengine=libaio
iodepth=16
direct=1
size=1g

[read]
rw=read
stonewall

[write]
rw=write
stonewall

> Thanks for you answers again,

You're welcome.  If you wish to wring maximum possible performance from
this rig I'll stick with ya until we get there.  You're not far.  Just
takes some testing and tweaking unless you have a real hardware
limitation, not a driver setting or firmware issue.

BTW, I don't recall you mentioning which HBA and PMP you're using at the
moment, and whether the PMP is an Addonics card or integrated in a JBOD.
 Nor if you're 1.5/3/6G from HBA through PMP to each drive.

Post your dmesg output showing the drive link speeds if you would, i.e.
ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 310)

-- 
Stan

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Very long raid5 init/rebuild times
  2014-01-24  5:13               ` Stan Hoeppner
@ 2014-01-25  8:36                 ` Marc MERLIN
  2014-01-28  7:46                   ` Stan Hoeppner
  2014-01-30 20:36                 ` Phillip Susi
  1 sibling, 1 reply; 41+ messages in thread
From: Marc MERLIN @ 2014-01-25  8:36 UTC (permalink / raw)
  To: Stan Hoeppner; +Cc: linux-raid

On Thu, Jan 23, 2014 at 11:13:41PM -0600, Stan Hoeppner wrote:
> Well, no, not really.  I know there are some real quality issues with a
> lot of cheap PMP JBODs out there.  I was just surprised to see an
> experienced Linux sysadmin have bad luck with 3/3 of em.  Most folks
> using Silicon Image HBAs with SiI PMPs seem to get good performance.

I've worked with the raw chips on silicon, have the firmware flashing tool
for the PMP, and never saw better than that.
So I'm not sure who those most folks are, or what chips they have, but
obviously the experience you describe is very different from the one I've
seen, or even from what the 2 kernel folks I know who used to maintain them
have, since they've abandonned using them due to them being more trouble
than they're worth and the performance poor.

To be fair, at the time I cared about performance on PMP, I was also using
snapshots on LVM and those were so bad that they actually were the
performance issue sometimes I got as slow as 5MB/s. Yes, LVM snapshots were
horrible for performance, which is why I switched to brtfs now.

> Personally, I've never used PMPs.  Given the cost ratio between drives
> and HBA ports, a quality 4/8 port SAS HBA such as one of the LSIs is a
> better solution all around.  4TB drives average $200 each.  A five drive
> array is $1000.  An LSI 8 port 12G SAS HBA with guaranteed
> compatibility, quality, support, and performance is $300.  A cheap 2

You are correct. When I started with PMPs there was not a single good SATA
card that had 10 ports or more and didn't cost $900. That was 4-5 years ago
though.
Today, I don't use PMPs anymore, except for some enclosures where it's easy
to just have one cable and where what you describe would need 5 sata cables
to the enclosure, would it not?
(unless you use something like USB3, but that's another interface I've had
my share of driver bug problems with, so it's not a net win either).

> port SATA HBA and 5 port PMP card gives sub optimal performance, iffy
> compatibility, and low quality, and is ~$130.  $1300 vs $1130.  Going
> with a cheap SATA HBA and PMP makes no sense.

I generally agree. Here I was using it to transfer data off some drives, but
indeed I wouldn't use this for a main array.

> > Let me think about this: the resync is done at build array time.
> > If all the drives are full of 0's indeed there will be nothing to write.
> > Given that, I think you're right.
> 
> The initial resync is read-only.  It won't modify anything unless
> there's a discrepancy.  So the stripe cache isn't in play.  The larger
> stripe cache should indeed increase rebuild rate though.

Right, I understood that the first time you explained it.

> Actually, instead of me making an educated guess, I'd suggest you run
> 
> ~$ cryptsetup benchmark
> 
> This will tell you precisely what your throughput is with various
> settings and ciphers.  Depending on what this spits back you may want to
> change your setup, assuming we get the IO throughput where it should be.

Sigh, debian unstable doesn't have the brand new cryptsetup with that option
yet, will have to get it.
Either way, I already know my CPU is not a bottleneck, so it's not that
important.

> > I use btrfs for LV management, so it's easier to encrypt the entire pool. I
> > also encrypt any data on any drive at this point, kind of like I wash my
> > hands. I'm not saying it's the right thing to do for all, but it's my
> > personal choice. I've seen too many drives end up on ebay with data, and I
> > don't want to have to worry about this later, or even erasing my own drives
> > before sending them back to warranty, especially in cases where maybe I
> > can't erase them, but the manufacturer can read them anyway.
> > You get the idea...
> 
> So be it.  Now let's work to see if we can squeeze every ounce of
> performance out of it.

Since I get the same speed writing through all the layers as raid5 gets
doing a resync without writes and the other layers, I'm not sure how you're
suggesting that I can get extra performance.
Well, unless you mean just raw swraid5 can be made faster with my drives
still.
That is likely possible if I get a better sata card to put in my machine
or find another way to increase cpu to drive throughput.

> You said you had pulled the PMP and connected direct to an HBA, bumping
> from 19MB/s to 99MB/s.  Did you switch back to the PMP and are now
> getting 100MB/s through the PMP?  We should be able to get much higher
> if it's 3/6G SATA, a little higher if it's 1/5G.

No, I did not. I'm not planning on having my destination array (the one I'm
writing to) behind a PMP for the reasons we discussed above.
The ports are 3MB/s. Obviously I'm not getting the right speed, but I think
there is something wrong with the motherboard of the system this is in,
causing some bus conflicts and slowdowns.
This is something I'll need to investigate outside of this list since it's
not related to raid anymore.

> > For now, I put dmcrypt on top of md5, I get 100MB/s raw block write speed (actually
> > just writing a big file in btrfs and going through all the layers) even
> > though it's only using one CPU thread for encryption instead of 2 or more if
> > each disk were encrypted under the md5 layer.
> 
> 100MB/s sequential read throughput is very poor for a 5 drive RAID5,
> especially with new 4TB drives which can stream well over 130MB/s each.

Yes, I totally agree.

> > As another test
> > gargamel:/mnt/btrfs_pool1# dd if=/dev/md5 of=/dev/null bs=1M count=1024
> > 1073741824 bytes (1.1 GB) copied, 9.78191 s, 110 MB/s
> 
> dd single stream copies are not a valid test of array throughput.  This
> tells you only the -minimum- throughput of the array.

If the array is idle, how is that not a valid block read test?

> > So it looks like 100-110MB/s is the read and write speed limit of that array.
> To test real maximum throughput install fio, save and run this job file,
> and post your results.  Monitor CPU burn of dmcrypt, using top is fine,
> while running the job to see if it eats all of one core.  The job runs
> in multiple steps, first creating the eight 1GB test files, then running
> the read/write tests against those files.
> 
> [global]
> directory=/some/directory
> zero_buffers
> numjobs=4
> group_reporting
> blocksize=1024k
> ioengine=libaio
> iodepth=16
> direct=1
> size=1g
> 
> [read]
> rw=read
> stonewall
> 
> [write]
> rw=write
> stonewall

Yeah, I have fio, didn't seem needed here, but I'll it a shot when I get a
chance.

> > Thanks for you answers again,
> 
> You're welcome.  If you wish to wring maximum possible performance from
> this rig I'll stick with ya until we get there.  You're not far.  Just
> takes some testing and tweaking unless you have a real hardware
> limitation, not a driver setting or firmware issue.

Thanks for your offer, although to be honest, I think I'm hitting a hardware
problem which I need to look into when I get a chance.

> BTW, I don't recall you mentioning which HBA and PMP you're using at the
> moment, and whether the PMP is an Addonics card or integrated in a JBOD.
>  Nor if you're 1.5/3/6G from HBA through PMP to each drive.

That PMP is integrated in the jbod, I haven't torn it apart to check which
one it was, but I've pretty much gotten slow speeds from those things and
more importantly PMPs have bugs during drive hangs and retries which can
cause recovery problems and killing swraid5 arrays, so that's why I stopped
using them for serious use.
The driver authors know about the issues, and some are in the PMP firmware
and not something they can work around.

> Post your dmesg output showing the drive link speeds if you would, i.e.
> ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 310)

Yep, very familiar with that unfortunately from my PMP debugging days
[    6.188660] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 330)
[    6.211533] ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 330)
[    6.444897] ata1.00: SATA link up 3.0 Gbps (SStatus 123 SControl 330)
[    6.444918] ata1.01: SATA link up 3.0 Gbps (SStatus 123 SControl 330)
[    6.445087] ata2.00: SATA link up 6.0 Gbps (SStatus 133 SControl 330)
[    6.445109] ata2.01: SATA link up 3.0 Gbps (SStatus 123 SControl 330)
[   14.179297] ata9: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[   14.675693] ata10: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[   15.516390] ata11: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[   16.008800] ata12: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[   19.339559] ata14: SATA link up 3.0 Gbps (SStatus 123 SControl 0)
[   19.692273] ata14.00: SATA link up 3.0 Gbps (SStatus 123 SControl 320)
[   20.705263] ata14.01: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[   21.785956] ata14.02: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[   22.899091] ata14.03: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[   23.935813] ata14.04: SATA link up 3.0 Gbps (SStatus 123 SControl 300)

Of course, I'm not getting that speed, but again, I'll look into it.

Thanks for your suggestions for tweaks.

Best,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Very long raid5 init/rebuild times
  2014-01-25  8:36                 ` Marc MERLIN
@ 2014-01-28  7:46                   ` Stan Hoeppner
  2014-01-28 16:50                     ` Marc MERLIN
  2014-01-30 20:47                     ` Phillip Susi
  0 siblings, 2 replies; 41+ messages in thread
From: Stan Hoeppner @ 2014-01-28  7:46 UTC (permalink / raw)
  To: Marc MERLIN; +Cc: linux-raid

On 1/25/2014 2:36 AM, Marc MERLIN wrote:
> On Thu, Jan 23, 2014 at 11:13:41PM -0600, Stan Hoeppner wrote:
>> Well, no, not really.  I know there are some real quality issues with a
>> lot of cheap PMP JBODs out there.  I was just surprised to see an
>> experienced Linux sysadmin have bad luck with 3/3 of em.  Most folks
>> using Silicon Image HBAs with SiI PMPs seem to get good performance.
>  
> I've worked with the raw chips on silicon, have the firmware flashing tool
> for the PMP, and never saw better than that.
> So I'm not sure who those most folks are, or what chips they have, but
> obviously the experience you describe is very different from the one I've
> seen, or even from what the 2 kernel folks I know who used to maintain them
> have, since they've abandonned using them due to them being more trouble
> than they're worth and the performance poor.

The first that comes to mind is Backblaze, a cloud storage provider for
consumer file backup.  They're on their 3rd generation of storage pod,
and they're still using the original Syba SiI 3132 PCIe, Addonics SiI
3124 PCI cards, and SiI 3726 PMP backplane boards, since 2009.  All
Silicon Image ASICs both HBA and PMP.  Each pod has 4 SATA cards and 9
PMPs boards with 45 drive slots.  The version 3.0 pod offers 180TB of
storage.  They have a few hundred of these storage pods in service
backing up user files over the net.  Here's the original design.  The
post has links to version 2 and 3.

http://blog.backblaze.com/2009/09/01/petabytes-on-a-budget-how-to-build-cheap-cloud-storage/

The key to their success is obviously working closely with all their
vendors to make sure the SATA cards and PMPs have the correct firmware
versions to work reliably with each other.  Consumers buying cheap big
box store HBAs and enclosures don't have this advantage.

> To be fair, at the time I cared about performance on PMP, I was also using
> snapshots on LVM and those were so bad that they actually were the
> performance issue sometimes I got as slow as 5MB/s. Yes, LVM snapshots were
> horrible for performance, which is why I switched to brtfs now.
> 
>> Personally, I've never used PMPs.  Given the cost ratio between drives
>> and HBA ports, a quality 4/8 port SAS HBA such as one of the LSIs is a
>> better solution all around.  4TB drives average $200 each.  A five drive
>> array is $1000.  An LSI 8 port 12G SAS HBA with guaranteed
>> compatibility, quality, support, and performance is $300.  A cheap 2
> 
> You are correct. When I started with PMPs there was not a single good SATA
> card that had 10 ports or more and didn't cost $900. That was 4-5 years ago
> though.
> Today, I don't use PMPs anymore, except for some enclosures where it's easy
> to just have one cable and where what you describe would need 5 sata cables
> to the enclosure, would it not?

No.  For external JBOD storage you go with an SAS expander unit instead
of a PMP.  You have a single SFF 8088 cable to the host which carries 4
SAS/SATA channels, up to 2.4 GB/s with 6G interfaces.

> (unless you use something like USB3, but that's another interface I've had
> my share of driver bug problems with, so it's not a net win either).

Yes, USB is a horrible interface for RAID storage.

>> port SATA HBA and 5 port PMP card gives sub optimal performance, iffy
>> compatibility, and low quality, and is ~$130.  $1300 vs $1130.  Going
>> with a cheap SATA HBA and PMP makes no sense.
> 
> I generally agree. Here I was using it to transfer data off some drives, but
> indeed I wouldn't use this for a main array.

Your original posts left me with the impression that you were using this
as a production array.  Apologies for not digesting those correctly.

...
> Since I get the same speed writing through all the layers as raid5 gets
> doing a resync without writes and the other layers, I'm not sure how you're
> suggesting that I can get extra performance.

You don't get extra performance.  You expose the performance you already
have.  Serial submission typically doesn't reach peak throughput.  Both
the resync operation and dd copy are serial submitters.  You usually
must submit asynchronously or in parallel to reach maximum throughput.
Being limited by a PMP it may not matter.  But with your direct
connected drives of your production array you should see a substantial
increase in throughput with parallel submission.

> Well, unless you mean just raw swraid5 can be made faster with my drives
> still.
> That is likely possible if I get a better sata card to put in my machine
> or find another way to increase cpu to drive throughput.

To significantly increase single streaming throughput you need AIO.  A
faster CPU won't make any difference.  Neither will a better SATA card,
unless your current one is defective, or limits port throughput will
more than one port active--I've heard of couple that do so.

>> You said you had pulled the PMP and connected direct to an HBA, bumping
>> from 19MB/s to 99MB/s.  Did you switch back to the PMP and are now
>> getting 100MB/s through the PMP?  We should be able to get much higher
>> if it's 3/6G SATA, a little higher if it's 1/5G.
>  
> No, I did not. I'm not planning on having my destination array (the one I'm
> writing to) behind a PMP for the reasons we discussed above.
> The ports are 3MB/s. Obviously I'm not getting the right speed, but I think
> there is something wrong with the motherboard of the system this is in,
> causing some bus conflicts and slowdowns.
> This is something I'll need to investigate outside of this list since it's
> not related to raid anymore.

Interesting.

>>> For now, I put dmcrypt on top of md5, I get 100MB/s raw block write speed (actually
>>> just writing a big file in btrfs and going through all the layers) even
>>> though it's only using one CPU thread for encryption instead of 2 or more if
>>> each disk were encrypted under the md5 layer.
>>
>> 100MB/s sequential read throughput is very poor for a 5 drive RAID5,
>> especially with new 4TB drives which can stream well over 130MB/s each.
>  
> Yes, I totally agree.
> 
>>> As another test
>>> gargamel:/mnt/btrfs_pool1# dd if=/dev/md5 of=/dev/null bs=1M count=1024
>>> 1073741824 bytes (1.1 GB) copied, 9.78191 s, 110 MB/s
>>
>> dd single stream copies are not a valid test of array throughput.  This
>> tells you only the -minimum- throughput of the array.
>  
> If the array is idle, how is that not a valid block read test?

See above WRT asynchronous and parallel submission.

>>> So it looks like 100-110MB/s is the read and write speed limit of that array.
>> To test real maximum throughput install fio, save and run this job file,
>> and post your results.  Monitor CPU burn of dmcrypt, using top is fine,
>> while running the job to see if it eats all of one core.  The job runs
>> in multiple steps, first creating the eight 1GB test files, then running
>> the read/write tests against those files.
>>
>> [global]
>> directory=/some/directory
>> zero_buffers
>> numjobs=4
>> group_reporting
>> blocksize=1024k
>> ioengine=libaio
>> iodepth=16
>> direct=1
>> size=1g
>>
>> [read]
>> rw=read
>> stonewall
>>
>> [write]
>> rw=write
>> stonewall
> 
> Yeah, I have fio, didn't seem needed here, but I'll it a shot when I get a
> chance.

With your setup and its apparent hardware limitations, parallel
submission may not reveal any more performance.  On the vast majority of
systems it does.

>>> Thanks for you answers again,
>>
>> You're welcome.  If you wish to wring maximum possible performance from
>> this rig I'll stick with ya until we get there.  You're not far.  Just
>> takes some testing and tweaking unless you have a real hardware
>> limitation, not a driver setting or firmware issue.
> 
> Thanks for your offer, although to be honest, I think I'm hitting a hardware
> problem which I need to look into when I get a chance.

Got it.

>> BTW, I don't recall you mentioning which HBA and PMP you're using at the
>> moment, and whether the PMP is an Addonics card or integrated in a JBOD.
>>  Nor if you're 1.5/3/6G from HBA through PMP to each drive.
> 
> That PMP is integrated in the jbod, I haven't torn it apart to check which
> one it was, but I've pretty much gotten slow speeds from those things and
> more importantly PMPs have bugs during drive hangs and retries which can
> cause recovery problems and killing swraid5 arrays, so that's why I stopped
> using them for serious use.

Probably a good call WRT consumer PMP JBODs.

> The driver authors know about the issues, and some are in the PMP firmware
> and not something they can work around.
> 
>> Post your dmesg output showing the drive link speeds if you would, i.e.
>> ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
> 
> Yep, very familiar with that unfortunately from my PMP debugging days
> [    6.188660] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 330)
> [    6.211533] ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 330)
> [    6.444897] ata1.00: SATA link up 3.0 Gbps (SStatus 123 SControl 330)
> [    6.444918] ata1.01: SATA link up 3.0 Gbps (SStatus 123 SControl 330)
> [    6.445087] ata2.00: SATA link up 6.0 Gbps (SStatus 133 SControl 330)
> [    6.445109] ata2.01: SATA link up 3.0 Gbps (SStatus 123 SControl 330)
> [   14.179297] ata9: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
> [   14.675693] ata10: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
> [   15.516390] ata11: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
> [   16.008800] ata12: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
> [   19.339559] ata14: SATA link up 3.0 Gbps (SStatus 123 SControl 0)
> [   19.692273] ata14.00: SATA link up 3.0 Gbps (SStatus 123 SControl 320)
> [   20.705263] ata14.01: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
> [   21.785956] ata14.02: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
> [   22.899091] ata14.03: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
> [   23.935813] ata14.04: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
> 
> Of course, I'm not getting that speed, but again, I'll look into it.

Yeah, something's definitely up with that.  All drives are 3G sync, so
you 'should' have 300 MB/s data rate through the PMP.

> Thanks for your suggestions for tweaks.

No problem Marc.  Have you noticed the right hand side of my email
address? :)  I'm kinda like a dog with a bone when it comes to hardware
issues.  Apologies if I've been a bit too tenacious with this.

-- 
Stan

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Very long raid5 init/rebuild times
  2014-01-28  7:46                   ` Stan Hoeppner
@ 2014-01-28 16:50                     ` Marc MERLIN
  2014-01-29  0:56                       ` Stan Hoeppner
  2014-01-30 20:47                     ` Phillip Susi
  1 sibling, 1 reply; 41+ messages in thread
From: Marc MERLIN @ 2014-01-28 16:50 UTC (permalink / raw)
  To: Stan Hoeppner; +Cc: linux-raid

On Tue, Jan 28, 2014 at 01:46:28AM -0600, Stan Hoeppner wrote:
> > Today, I don't use PMPs anymore, except for some enclosures where it's easy
> > to just have one cable and where what you describe would need 5 sata cables
> > to the enclosure, would it not?
> 
> No.  For external JBOD storage you go with an SAS expander unit instead
> of a PMP.  You have a single SFF 8088 cable to the host which carries 4
> SAS/SATA channels, up to 2.4 GB/s with 6G interfaces.
 
Yeah, I know about those, but I have 5 drives in my enclosures, so that's
one short :)

> > I generally agree. Here I was using it to transfer data off some drives, but
> > indeed I wouldn't use this for a main array.
> 
> Your original posts left me with the impression that you were using this
> as a production array.  Apologies for not digesting those correctly.
 
I likely wasn't clear, sorry about that.

> You don't get extra performance.  You expose the performance you already
> have.  Serial submission typically doesn't reach peak throughput.  Both
> the resync operation and dd copy are serial submitters.  You usually
> must submit asynchronously or in parallel to reach maximum throughput.
> Being limited by a PMP it may not matter.  But with your direct
> connected drives of your production array you should see a substantial
> increase in throughput with parallel submission.

I agree, it should be faster. 
 
> >> [global]
> >> directory=/some/directory
> >> zero_buffers
> >> numjobs=4
> >> group_reporting
> >> blocksize=1024k
> >> ioengine=libaio
> >> iodepth=16
> >> direct=1
> >> size=1g
> >>
> >> [read]
> >> rw=read
> >> stonewall
> >>
> >> [write]
> >> rw=write
> >> stonewall
> > 
> > Yeah, I have fio, didn't seem needed here, but I'll it a shot when I get a
> > chance.
> 
> With your setup and its apparent hardware limitations, parallel
> submission may not reveal any more performance.  On the vast majority of
> systems it does.

fio said:
Run status group 0 (all jobs):
   READ: io=4096.0MB, aggrb=77695KB/s, minb=77695KB/s, maxb=77695KB/s, mint=53984msec, maxt=53984msec

Run status group 1 (all jobs):
  WRITE: io=4096.0MB, aggrb=77006KB/s, minb=77006KB/s, maxb=77006KB/s, mint=54467msec, maxt=54467msec
 
> > Of course, I'm not getting that speed, but again, I'll look into it.
> 
> Yeah, something's definitely up with that.  All drives are 3G sync, so
> you 'should' have 300 MB/s data rate through the PMP.

Right.
 
> > Thanks for your suggestions for tweaks.
> 
> No problem Marc.  Have you noticed the right hand side of my email
> address? :)  I'm kinda like a dog with a bone when it comes to hardware
> issues.  Apologies if I've been a bit too tenacious with this.

I had not :) I usually try to optimize stuff as much as possible when it's
worth it or when I really care and have time. I agree this one is puzzling
me a bit and even if it's fast enough for my current needs and the time I
have right now, I'll try and move it to another system to see. I'm pretty
sure that one system has a weird bottleneck.

Cheers,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Very long raid5 init/rebuild times
  2014-01-28 16:50                     ` Marc MERLIN
@ 2014-01-29  0:56                       ` Stan Hoeppner
  2014-01-29  1:01                         ` Marc MERLIN
  0 siblings, 1 reply; 41+ messages in thread
From: Stan Hoeppner @ 2014-01-29  0:56 UTC (permalink / raw)
  To: Marc MERLIN; +Cc: linux-raid

On 1/28/2014 10:50 AM, Marc MERLIN wrote:
> On Tue, Jan 28, 2014 at 01:46:28AM -0600, Stan Hoeppner wrote:
>>> Today, I don't use PMPs anymore, except for some enclosures where it's easy
>>> to just have one cable and where what you describe would need 5 sata cables
>>> to the enclosure, would it not?
>>
>> No.  For external JBOD storage you go with an SAS expander unit instead
>> of a PMP.  You have a single SFF 8088 cable to the host which carries 4
>> SAS/SATA channels, up to 2.4 GB/s with 6G interfaces.
>  
> Yeah, I know about those, but I have 5 drives in my enclosures, so that's
> one short :)

I think you misunderstood.  I was referring to a JBOD chassis with SAS
expander, up to 32 drives, typically 12-24 drives with two host or two
daisy chain ports.  Maybe an example would help here.

http://www.newegg.com/Product/Product.aspx?Item=N82E16816133047

Obviously this is in a difference cost category, and not typical for
consumer use.  Smaller units are available for less $$ but you pay more
per drive, as the expander board is the majority of the cost.  Steel and
plastic are cheap, as are PSUs.

>>> I generally agree. Here I was using it to transfer data off some drives, but
>>> indeed I wouldn't use this for a main array.
>>
>> Your original posts left me with the impression that you were using this
>> as a production array.  Apologies for not digesting those correctly.
>  
> I likely wasn't clear, sorry about that.
> 
>> You don't get extra performance.  You expose the performance you already
>> have.  Serial submission typically doesn't reach peak throughput.  Both
>> the resync operation and dd copy are serial submitters.  You usually
>> must submit asynchronously or in parallel to reach maximum throughput.
>> Being limited by a PMP it may not matter.  But with your direct
>> connected drives of your production array you should see a substantial
>> increase in throughput with parallel submission.
> 
> I agree, it should be faster. 
>  
>>>> [global]
>>>> directory=/some/directory
>>>> zero_buffers
>>>> numjobs=4
>>>> group_reporting
>>>> blocksize=1024k
>>>> ioengine=libaio
>>>> iodepth=16
>>>> direct=1
>>>> size=1g
>>>>
>>>> [read]
>>>> rw=read
>>>> stonewall
>>>>
>>>> [write]
>>>> rw=write
>>>> stonewall
>>>
>>> Yeah, I have fio, didn't seem needed here, but I'll it a shot when I get a
>>> chance.
>>
>> With your setup and its apparent hardware limitations, parallel
>> submission may not reveal any more performance.  On the vast majority of
>> systems it does.
> 
> fio said:
> Run status group 0 (all jobs):
>    READ: io=4096.0MB, aggrb=77695KB/s, minb=77695KB/s, maxb=77695KB/s, mint=53984msec, maxt=53984msec
> 
> Run status group 1 (all jobs):
>   WRITE: io=4096.0MB, aggrb=77006KB/s, minb=77006KB/s, maxb=77006KB/s, mint=54467msec, maxt=54467msec

Something is definitely not right if parallel FIO submission is ~25%
lower than single submission dd.  But you were running your dd tests
through buffer cache IIRC.  This FIO test uses O_DIRECT.  So it's not
apples to apples.  When testing IO throughput one should also bypass
buffer cache.

>>> Of course, I'm not getting that speed, but again, I'll look into it.
>>
>> Yeah, something's definitely up with that.  All drives are 3G sync, so
>> you 'should' have 300 MB/s data rate through the PMP.
> 
> Right.
>  
>>> Thanks for your suggestions for tweaks.
>>
>> No problem Marc.  Have you noticed the right hand side of my email
>> address? :)  I'm kinda like a dog with a bone when it comes to hardware
>> issues.  Apologies if I've been a bit too tenacious with this.
> 
> I had not :) I usually try to optimize stuff as much as possible when it's
> worth it or when I really care and have time. I agree this one is puzzling
> me a bit and even if it's fast enough for my current needs and the time I
> have right now, I'll try and move it to another system to see. I'm pretty
> sure that one system has a weird bottleneck.

Yeah, something definitely not right.  Your RAID throughput is less than
a single 7.2K SATA drive.  It's probably just something funky with that
JBOD chassis.

-- 
Stan




^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Very long raid5 init/rebuild times
  2014-01-29  0:56                       ` Stan Hoeppner
@ 2014-01-29  1:01                         ` Marc MERLIN
  0 siblings, 0 replies; 41+ messages in thread
From: Marc MERLIN @ 2014-01-29  1:01 UTC (permalink / raw)
  To: Stan Hoeppner; +Cc: linux-raid

On Tue, Jan 28, 2014 at 06:56:32PM -0600, Stan Hoeppner wrote:
> On 1/28/2014 10:50 AM, Marc MERLIN wrote:
> > On Tue, Jan 28, 2014 at 01:46:28AM -0600, Stan Hoeppner wrote:
> >>> Today, I don't use PMPs anymore, except for some enclosures where it's easy
> >>> to just have one cable and where what you describe would need 5 sata cables
> >>> to the enclosure, would it not?
> >>
> >> No.  For external JBOD storage you go with an SAS expander unit instead
> >> of a PMP.  You have a single SFF 8088 cable to the host which carries 4
> >> SAS/SATA channels, up to 2.4 GB/s with 6G interfaces.
> >  
> > Yeah, I know about those, but I have 5 drives in my enclosures, so that's
> > one short :)
> 
> I think you misunderstood.  I was referring to a JBOD chassis with SAS
> expander, up to 32 drives, typically 12-24 drives with two host or two
> daisy chain ports.  Maybe an example would help here.
> 
> http://www.newegg.com/Product/Product.aspx?Item=N82E16816133047

Ah, yes, that.
So indeed in the price category of a PMP chassis with 5 drives ($150-ish), I
haven't found anything that isn't PMP or 5 sata cables.

> Yeah, something definitely not right.  Your RAID throughput is less than
> a single 7.2K SATA drive.  It's probably just something funky with that
> JBOD chassis.

That's also possible.
If/when I have time, I'll unplug things and plug the drives directly to the
card as well as try another MB.

Thanks,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Very long raid5 init/rebuild times
  2014-01-28  7:46                   ` Stan Hoeppner
  2014-01-28 16:50                     ` Marc MERLIN
@ 2014-01-30 20:47                     ` Phillip Susi
  2014-02-01 22:39                       ` Stan Hoeppner
  1 sibling, 1 reply; 41+ messages in thread
From: Phillip Susi @ 2014-01-30 20:47 UTC (permalink / raw)
  To: stan, Marc MERLIN; +Cc: linux-raid

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 1/28/2014 2:46 AM, Stan Hoeppner wrote:
> You usually must submit asynchronously or in parallel to reach 
> maximum throughput. Being limited by a PMP it may not matter. But 
> with your direct connected drives of your production array you
> should see a substantial increase in throughput with parallel
> submission.

Not for streaming IO; you just need to make sure your cache is big
enough so the drive is never waiting for the app.

> To significantly increase single streaming throughput you need AIO.
> A faster CPU won't make any difference.  Neither will a better SATA
> card, unless your current one is defective, or limits port
> throughput will more than one port active--I've heard of couple
> that do so.

What AIO gets you is the ability to use O_DIRECT to avoid a memory
copy to/from the kernel page cache.  That saves you some cpu time, but
doesn't make *that* much difference unless you have a crazy fast
storage array, or crazy slow ram.  And since almost nobody uses it,
it's a bit of an unrealistic benchmark.


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.17 (MingW32)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQEcBAEBAgAGBQJS6rpiAAoJEI5FoCIzSKrw6oMH/jDcFOs8Gu2wjbbuE1eoGtG7
aHeUvF6klWWV5VWCVBd4tHieVkj1zyg3nQa3DGaOvqBnz6mtIQUx6Pg5MgYkJAhD
EY1f3zVH+hxBEyJwwmMIDIyVsDCbdsryKndfPuYolaqNSgXLyWpAcL6g/SM9vjoG
nH29w1GC3TJP5Py1DNP4P04Q+kJMTYnY/4AFJOtsMRK5XRpno784YZauS/basEH3
rpSf/JvhcZMbk6nE8jkqIYnMbA35E8f+GfSa60epqDSSM3hU5U1xYnh6vCZSSndK
pMCFv26O9AVoFdyPZTJwM32gqGXdsGkDanK2+0y/j2im5IT0PxKCWO+uCLO/1mQ=
=9NYg
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Very long raid5 init/rebuild times
  2014-01-30 20:47                     ` Phillip Susi
@ 2014-02-01 22:39                       ` Stan Hoeppner
  2014-02-02 18:53                         ` Phillip Susi
  0 siblings, 1 reply; 41+ messages in thread
From: Stan Hoeppner @ 2014-02-01 22:39 UTC (permalink / raw)
  To: Phillip Susi, Marc MERLIN; +Cc: linux-raid

On 1/30/2014 2:47 PM, Phillip Susi wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> On 1/28/2014 2:46 AM, Stan Hoeppner wrote:
>> You usually must submit asynchronously or in parallel to reach 
>> maximum throughput. Being limited by a PMP it may not matter. But 
>> with your direct connected drives of your production array you
>> should see a substantial increase in throughput with parallel
>> submission.
> 
> Not for streaming IO; you just need to make sure your cache is big
> enough so the drive is never waiting for the app.
> 
>> To significantly increase single streaming throughput you need AIO.
>> A faster CPU won't make any difference.  Neither will a better SATA
>> card, unless your current one is defective, or limits port
>> throughput will more than one port active--I've heard of couple
>> that do so.
> 
> What AIO gets you is the ability to use O_DIRECT to avoid a memory
> copy to/from the kernel page cache.  That saves you some cpu time, but
> doesn't make *that* much difference unless you have a crazy fast
> storage array, or crazy slow ram.  And since almost nobody uses it,
> it's a bit of an unrealistic benchmark.

Phillip, you seem to be arguing application performance.  For an
application/data type that doesn't need fsync you'd be correct.

However, the purpose of this exchange has been to determine the maximum
hardware throughput of the OP's array.  It's not possible to accurately
measure IO throughput doing buffered writes.  Thus O_DIRECT is needed.

But, as I already stated, because O_DIRECT is synchronous with
significant completion latency, regardless of CPU/RAM speed, a single
write stream typically won't saturate the storage.  Thus, one needs to
use either use AIO or parallel submission, or possibly both, to saturate
the storage.

-- 
Stan


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Very long raid5 init/rebuild times
  2014-02-01 22:39                       ` Stan Hoeppner
@ 2014-02-02 18:53                         ` Phillip Susi
  2014-02-03  6:34                           ` Stan Hoeppner
  0 siblings, 1 reply; 41+ messages in thread
From: Phillip Susi @ 2014-02-02 18:53 UTC (permalink / raw)
  To: stan, Marc MERLIN; +Cc: linux-raid

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512

On 02/01/2014 05:39 PM, Stan Hoeppner wrote:
> Phillip, you seem to be arguing application performance.  For an 
> application/data type that doesn't need fsync you'd be correct.
> 
> However, the purpose of this exchange has been to determine the
> maximum hardware throughput of the OP's array.  It's not possible
> to accurately measure IO throughput doing buffered writes.  Thus
> O_DIRECT is needed.

No, you can get there just fine with buffered IO as well, unless you
have an obscenely fast array and very slow ram, the overhead of
buffering won't really matter for IO throughput ( just the extra cpu ).


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.14 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQEcBAEBCgAGBQJS7pRFAAoJEI5FoCIzSKrw/5AH/2EUXwPjdaqKyg9oD5igX92p
VLLjCCDJyzd2MaC3u4DPpQUtzIXPEkjuKT6hBMNwx/+gFnODXQjssZ3siXVgc2Mi
JAGYwRWYbLDYQscKagYyQFDiNjg5b1zA/KEjKYTO5hpkFQELDgE115PPn6NhV/Q9
r/vjzEtvDLjuN5Y7uQDpZv87bEy2O7aJX1TFygPN0MazJo6O93yFfweUwI2JUm1H
q1DL96Evd/CCfzhPWiPnWrP4MpuUG7B1OKjwfXnlw5tW1tj7wRA6tRnyuCeAHCfb
1/wvxXm426rxflSNa85sP///amh8mnccN+4ZcOYEql4XSt8qh8G2sdH35Kp3og8=
=BnZb
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Very long raid5 init/rebuild times
  2014-02-02 18:53                         ` Phillip Susi
@ 2014-02-03  6:34                           ` Stan Hoeppner
  2014-02-03 14:42                             ` Phillip Susi
  0 siblings, 1 reply; 41+ messages in thread
From: Stan Hoeppner @ 2014-02-03  6:34 UTC (permalink / raw)
  To: Phillip Susi, Marc MERLIN; +Cc: linux-raid

On 2/2/2014 12:53 PM, Phillip Susi wrote:

> On 02/01/2014 05:39 PM, Stan Hoeppner wrote:

>> It's not possible
>> to accurately measure IO throughput doing buffered writes.  Thus
>> O_DIRECT is needed.

> No, you can get there just fine with buffered IO as well, unless you
> have an obscenely fast array and very slow ram, the overhead of
> buffering won't really matter for IO throughput ( just the extra cpu ).

Please reread my statement above.  Now let me restate that as:

Measuring disk throughput when writing through the buffer cache isn't a
measurement of disk throughput as much as it is a measurement of cache
throughput.  Thus, such measurements do not demonstrate actual disk
throughput.

Do you disagree?

-- 
Stan

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Very long raid5 init/rebuild times
  2014-02-03  6:34                           ` Stan Hoeppner
@ 2014-02-03 14:42                             ` Phillip Susi
  2014-02-04  3:30                               ` Stan Hoeppner
  0 siblings, 1 reply; 41+ messages in thread
From: Phillip Susi @ 2014-02-03 14:42 UTC (permalink / raw)
  To: stan, Marc MERLIN; +Cc: linux-raid

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 2/3/2014 1:34 AM, Stan Hoeppner wrote:
> Please reread my statement above.  Now let me restate that as:
> 
> Measuring disk throughput when writing through the buffer cache
> isn't a measurement of disk throughput as much as it is a
> measurement of cache throughput.  Thus, such measurements do not
> demonstrate actual disk throughput.
> 
> Do you disagree?

Yes, I do because cache throughput is >>>> disk throughput.


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.17 (MingW32)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQEcBAEBAgAGBQJS76rMAAoJEI5FoCIzSKrw+1oH/1xdPTZwoaw4MKQrwQV22sgM
zu1BXTs1+/wEjyxJdwr9Rpa6W/aLlaMYmoriNWVXG+MLm2aemrGq4nHD5i3GhESU
T1R65IY92fVPqCAYNUjYQftGryYcZjWxiNurHI4/Tt0BH4hPn0Ol34xwuTE7/mg7
ozn7mzqYFxJltomRjtARuXulJz4DW5p0tKjsgBRnqAqXyywU/bEC5fpb0xuqhNWK
xm4asE9FPJxCxV8QuqQUwehU+IAQ3ObqgDIsdcCX0wSZurDQCbPBfbW2OuYpSBSd
QOrjulMciutBxIyejA1OC20z9DEPhoWfJEn2HSUrMEMgpvuz/+q7ybBkCcHDs5Y=
=4ti3
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Very long raid5 init/rebuild times
  2014-02-03 14:42                             ` Phillip Susi
@ 2014-02-04  3:30                               ` Stan Hoeppner
  2014-02-04 17:59                                 ` Larry Fenske
  0 siblings, 1 reply; 41+ messages in thread
From: Stan Hoeppner @ 2014-02-04  3:30 UTC (permalink / raw)
  To: Phillip Susi; +Cc: linux-raid

On 2/3/2014 8:42 AM, Phillip Susi wrote:

> On 2/3/2014 1:34 AM, Stan Hoeppner wrote:
>> Please reread my statement above.  Now let me restate that as:
>>
>> Measuring disk throughput when writing through the buffer cache
>> isn't a measurement of disk throughput as much as it is a
>> measurement of cache throughput.  Thus, such measurements do not
>> demonstrate actual disk throughput.
>>
>> Do you disagree?
> 
> Yes, I do because cache throughput is >>>> disk throughput.

It is because buffer cache throughput is greater that measurements of
disk throughput are not accurate.  If one issues a sync after writing
through buffer cache the measured throughput should be fairly close.
But without issuing a sync you're measuring buffer cache throughput.

Thus, as I said previously, it is better to do parallel O_DIRECT writes
or use AIO with O_DIRECT for testing disk throughput as one doesn't have
to worry about these buffer cache issues.

-- 
Stan

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Very long raid5 init/rebuild times
  2014-02-04  3:30                               ` Stan Hoeppner
@ 2014-02-04 17:59                                 ` Larry Fenske
  2014-02-04 18:08                                   ` Phillip Susi
  0 siblings, 1 reply; 41+ messages in thread
From: Larry Fenske @ 2014-02-04 17:59 UTC (permalink / raw)
  To: stan, Phillip Susi; +Cc: linux-raid

On 02/03/2014 08:30 PM, Stan Hoeppner wrote:
> On 2/3/2014 8:42 AM, Phillip Susi wrote:
>
>> On 2/3/2014 1:34 AM, Stan Hoeppner wrote:
>>> Please reread my statement above.  Now let me restate that as:
>>>
>>> Measuring disk throughput when writing through the buffer cache
>>> isn't a measurement of disk throughput as much as it is a
>>> measurement of cache throughput.  Thus, such measurements do not
>>> demonstrate actual disk throughput.
>>>
>>> Do you disagree?
>> Yes, I do because cache throughput is >>>> disk throughput.
> It is because buffer cache throughput is greater that measurements of
> disk throughput are not accurate.  If one issues a sync after writing
> through buffer cache the measured throughput should be fairly close.
> But without issuing a sync you're measuring buffer cache throughput.
>
> Thus, as I said previously, it is better to do parallel O_DIRECT writes
> or use AIO with O_DIRECT for testing disk throughput as one doesn't have
> to worry about these buffer cache issues.
>
Perhaps Phillip is doing the obvious and only measuring throughput after
the cache is full.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Very long raid5 init/rebuild times
  2014-02-04 17:59                                 ` Larry Fenske
@ 2014-02-04 18:08                                   ` Phillip Susi
  2014-02-04 18:43                                     ` Stan Hoeppner
  0 siblings, 1 reply; 41+ messages in thread
From: Phillip Susi @ 2014-02-04 18:08 UTC (permalink / raw)
  To: Larry Fenske, stan; +Cc: linux-raid

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 2/4/2014 12:59 PM, Larry Fenske wrote:
> Perhaps Phillip is doing the obvious and only measuring throughput
> after the cache is full.

Or the obvious and dropping the cache first, the way hdparm -t does.

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.17 (MingW32)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQEcBAEBAgAGBQJS8SygAAoJEI5FoCIzSKrwdesH/3ZuonrTN58MTmKalrdbmrvO
P9UJCY90Nzv57TBm6POzipDKe7cdBjBRSr97DU2Ea8yOqxTo9ErbbS2prUmDeC04
RrTUJnWw5eP5Zrt2TT4tnUJbKmmhxbXMxTPz8ZrzaV/2PJzh2PWbj5HjGgceyCVS
2V7iuMJfpPvL/EiiTm32gXVAp9FlWtOpiKdBg+eaD4UfLemYMObRScbhmS0+1XYh
lQ7Ce7RUE+y0zkgaeLSBDTpyUL+mVF6l9C2q0dxK8U2mhUChQl7RX81yk3ePD6DD
w+SzG+UaQjdNM9jbldHKdXSYT04u/ZX9KmZszFQGPRGvvw9tMxihjUmmawN9GWQ=
=2/Lt
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Very long raid5 init/rebuild times
  2014-02-04 18:08                                   ` Phillip Susi
@ 2014-02-04 18:43                                     ` Stan Hoeppner
  2014-02-04 18:55                                       ` Phillip Susi
  0 siblings, 1 reply; 41+ messages in thread
From: Stan Hoeppner @ 2014-02-04 18:43 UTC (permalink / raw)
  To: Phillip Susi, Larry Fenske; +Cc: linux-raid

On 2/4/2014 12:08 PM, Phillip Susi wrote:

> Or the obvious and dropping the cache first, the way hdparm -t does.

"hdparm -t" is a read test.

Everything we've been discussing has been about maximizing write
throughput.  The fact that you argue this at this point makes it crystal
clear that you don't have no understanding of the differences in the
read/write paths and how buffer cache affects each differently.  Further
discussion is thus pointless.

-- 
Stan

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Very long raid5 init/rebuild times
  2014-02-04 18:43                                     ` Stan Hoeppner
@ 2014-02-04 18:55                                       ` Phillip Susi
  2014-02-04 19:15                                         ` Stan Hoeppner
  0 siblings, 1 reply; 41+ messages in thread
From: Phillip Susi @ 2014-02-04 18:55 UTC (permalink / raw)
  To: stan, Larry Fenske; +Cc: linux-raid

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 2/4/2014 1:43 PM, Stan Hoeppner wrote:
> Everything we've been discussing has been about maximizing write 
> throughput.  The fact that you argue this at this point makes it
> crystal clear that you don't have no understanding of the
> differences in the read/write paths and how buffer cache affects
> each differently.  Further discussion is thus pointless.

I am intimately familiar with the two code paths, having written
several applications using them, studied the kernel code extensively,
and been one of the original strong advocates for the kernel to grow
direct aio apis in the first place, since it worked swimmingly well on
WinNT.

So I say again: switching to direct aio, while saving a decent chunk
of cpu time, makes very little difference in streaming write
throughput.  If it did, there would be something terribly broken with
the buffer cache if it couldn't keep the disk queues full.


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.17 (MingW32)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQEcBAEBAgAGBQJS8TeVAAoJEI5FoCIzSKrwMJ0IAJB3+mGIVXJ+qMCHwSGjFI7G
dNJp8/0NFdy42Eww1Yu3EOBRPim4sDxmRE6bqRX9Ytbb5jqnlr22c/fjcUmqH3wr
fO7qqj2T6FiaFgrudFNukCAqRiCWTS3nkxzrAs5HV1PBukJtAXugQEBYEtHcVZ7l
EoeTu16N70RMnywK0vbHx7Gqx9AOps9xe6qyStN7KptgGbkX/b0OkDLRjSedLput
qQNyLA8/kuoGfVvswSzKqneK/CC0GAdbdQt0rP0hC3Icsh2qKQZLdsAwKgL3L0f6
zyALvUBvuRSD6ZQW8VdNI+i4BCyeYSCwZT/5pPXI5AtRZtIUkymQkZtW7cjNKpM=
=ILJS
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Very long raid5 init/rebuild times
  2014-02-04 18:55                                       ` Phillip Susi
@ 2014-02-04 19:15                                         ` Stan Hoeppner
  2014-02-04 20:16                                           ` Phillip Susi
  0 siblings, 1 reply; 41+ messages in thread
From: Stan Hoeppner @ 2014-02-04 19:15 UTC (permalink / raw)
  To: Phillip Susi, Larry Fenske; +Cc: linux-raid

On 2/4/2014 12:55 PM, Phillip Susi wrote:

> On 2/4/2014 1:43 PM, Stan Hoeppner wrote:
>> Everything we've been discussing has been about maximizing write 
>> throughput.  The fact that you argue this at this point makes it
>> crystal clear that you don't have no understanding of the
>> differences in the read/write paths and how buffer cache affects
>> each differently.  Further discussion is thus pointless.
> 
> I am intimately familiar with the two code paths, having written
> several applications using them, studied the kernel code extensively,
> and been one of the original strong advocates for the kernel to grow
> direct aio apis in the first place, since it worked swimmingly well on
> WinNT.
> 
> So I say again: switching to direct aio, while saving a decent chunk
> of cpu time, makes very little difference in streaming write
> throughput.  If it did, there would be something terribly broken with
> the buffer cache if it couldn't keep the disk queues full.

If all this is true, then why do you keep making a tangential arguments
that are not relevant?

I never argued that the buffer cache path is slower.  It is in fact much
faster in most cases.

I argued that accurately measuring the actual data throughput at the
disks isn't possible when writing through buffer cache.  At least not in
a straightforward manner as with O_DIRECT.  I've made the point in the
last two or three replies.  Yet instead of directly addressing that,
rebutting that, you keep making these tangential irrelevant arguments...

-- 
Stan

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Very long raid5 init/rebuild times
  2014-02-04 19:15                                         ` Stan Hoeppner
@ 2014-02-04 20:16                                           ` Phillip Susi
  2014-02-04 21:58                                             ` Stan Hoeppner
  0 siblings, 1 reply; 41+ messages in thread
From: Phillip Susi @ 2014-02-04 20:16 UTC (permalink / raw)
  To: stan, Larry Fenske; +Cc: linux-raid

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 2/4/2014 2:15 PM, Stan Hoeppner wrote:
> I never argued that the buffer cache path is slower.  It is in fact
> much faster in most cases.
> 
> I argued that accurately measuring the actual data throughput at
> the disks isn't possible when writing through buffer cache.  At
> least not in a straightforward manner as with O_DIRECT.  I've made
> the point in the last two or three replies.  Yet instead of
> directly addressing that, rebutting that, you keep making these
> tangential irrelevant arguments...

You originally said "To significantly increase single streaming
throughput you need AIO."  Now you appear to be saying otherwise.


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.17 (MingW32)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQEcBAEBAgAGBQJS8UqkAAoJEI5FoCIzSKrwIMYH/AvTA9Z1+SGarMym9hy1DPis
czb3W4+MVeisH2QwyfTbMPkicOw4pffa3Hc9ZLLI0yvGnU8b6XbFvG+2sWYBtqhj
HQX1Osjy0ZP7GuVU5TtydbNNXba4f+iIm6FIpzX3eseAjZgBJeDeG2s0oePw8q/d
b+P0PAZSqA99CNNpqOw7GTnYZqh++SM9CYPmr7KC4LYFyaqklj3eS0XQPDT+rbej
K1ly2ZibE348Nol/A6gT63x6WnuMFj4jAUK40O//farqbjsDTJgWaz9x0aV58c6x
Uxsrzx9a92tZC5AL0BW6dqLeY0BMZ3j9hqF/51nz5MrGtzm7qninrsSp1hJJRuM=
=Ls3I
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Very long raid5 init/rebuild times
  2014-02-04 20:16                                           ` Phillip Susi
@ 2014-02-04 21:58                                             ` Stan Hoeppner
  2014-02-05  1:19                                               ` Phillip Susi
  0 siblings, 1 reply; 41+ messages in thread
From: Stan Hoeppner @ 2014-02-04 21:58 UTC (permalink / raw)
  To: Phillip Susi, Larry Fenske; +Cc: linux-raid

On 2/4/2014 2:16 PM, Phillip Susi wrote:

> On 2/4/2014 2:15 PM, Stan Hoeppner wrote:
>> I never argued that the buffer cache path is slower.  It is in fact
>> much faster in most cases.
>>
>> I argued that accurately measuring the actual data throughput at
>> the disks isn't possible when writing through buffer cache.  At
>> least not in a straightforward manner as with O_DIRECT.  I've made
>> the point in the last two or three replies.  Yet instead of
>> directly addressing that, rebutting that, you keep making these
>> tangential irrelevant arguments...
> 
> You originally said "To significantly increase single streaming
> throughput you need AIO."  

Yes, I stated that in this post

http://www.spinics.net/lists/raid/msg45726.html

in the context of achieving greater throughput with an FIO job file
configured to use O_DIRECT, a job file I created, that the OP was using
for testing.  That job file is quoted further down in this same post,
and is included in my posts prior this one in the thread.  Apparently
you ignored them.  The context of my comment above is clearly
established multiple times earlier in the thread.

In my paragraph directly preceding the statement you quote above, I
stated this:

"Serial submission typically doesn't reach peak throughput...  You
usually must submit asynchronously or in parallel to reach maximum
throughput."

And again this is in the context of the FIO job file using O_DIRECT, and
this statement is factual.  As I repeated earlier today, O_DIRECT is
used because measuring actual throughput at the disks is
straightforward.  To increase O_DIRECT write throughput in FIO, you
typically need parallel submission or AIO.  This is well known.

> Now you appear to be saying otherwise.

No, I have not contradicted myself Phillip.  I've stated the same thing
until becoming blue in the face, and it's quite frustrating.

The fact of the matter is that you took a single sentence out of context
in a very long thread, and attacked it, attacked me, as being wrong,
when in fact I am, and have been, correct throughout.

Context matters, always.

-- 
Stan

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Very long raid5 init/rebuild times
  2014-02-04 21:58                                             ` Stan Hoeppner
@ 2014-02-05  1:19                                               ` Phillip Susi
  2014-02-05  1:42                                                 ` Stan Hoeppner
  0 siblings, 1 reply; 41+ messages in thread
From: Phillip Susi @ 2014-02-05  1:19 UTC (permalink / raw)
  To: stan, Larry Fenske; +Cc: linux-raid

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512

On 02/04/2014 04:58 PM, Stan Hoeppner wrote:
> Yes, I stated that in this post
> 
> http://www.spinics.net/lists/raid/msg45726.html
> 
> in the context of achieving greater throughput with an FIO job
> file configured to use O_DIRECT, a job file I created, that the OP
> was using for testing.  That job file is quoted further down in
> this same post, and is included in my posts prior this one in the
> thread.  Apparently you ignored them.  The context of my comment
> above is clearly established multiple times earlier in the thread.
> 
> In my paragraph directly preceding the statement you quote above,
> I stated this:
> 
> "Serial submission typically doesn't reach peak throughput...  You 
> usually must submit asynchronously or in parallel to reach maximum 
> throughput."
> 
> And again this is in the context of the FIO job file using
> O_DIRECT, and this statement is factual.  As I repeated earlier
> today, O_DIRECT is used because measuring actual throughput at the
> disks is straightforward.  To increase O_DIRECT write throughput in
> FIO, you typically need parallel submission or AIO.  This is well
> known.

Ahh, I did not gather that O_DIRECT was already assumed.  In that
case, then I was simply restating the same thing: that you want aio
with O_DIRECT, but otherwise, buffered IO works fine too ( which is
what the OP was using with dd, which is why it sounded like you were
saying not to do that, that you must use O_DIRECT + aio because
buffered IO won't get you the performance you're looking for ).


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.14 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQEcBAEBCgAGBQJS8ZGfAAoJEI5FoCIzSKrw43gH/2sUIEVB97YOGpTj5H8XkySb
wJAQxU//LyZRcUiK37TNeIF+6QUfqVtD/VFYxjTFfV8gmLSmu7JzwfMZQjJ2Rrb5
I08Pks2xCrU/XvfLKqum5JQHreJaz8jQQVIByXAziDAj+H46k5NV34rUNDP5glyk
18uKN1ty0//jyKNlzWhRZllw7Uo7CAvJvfTHSxvoTGgTmzeea2Q6eADIv0Ov96Lb
ZeNKnZXTwDyIXskEduDToWQdGL01TYSKXiV8zTqnhMsMBUZ33oE7r5l+a/o/m6Kv
ZKWE+JG/5xzZiFipNj1ELYuPwM/SD6cCPBRfwh2tWmKTG3Z/waD+kjytIwieUDY=
=1T19
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Very long raid5 init/rebuild times
  2014-02-05  1:19                                               ` Phillip Susi
@ 2014-02-05  1:42                                                 ` Stan Hoeppner
  0 siblings, 0 replies; 41+ messages in thread
From: Stan Hoeppner @ 2014-02-05  1:42 UTC (permalink / raw)
  To: Phillip Susi, Larry Fenske; +Cc: linux-raid

On 2/4/2014 7:19 PM, Phillip Susi wrote:
> On 02/04/2014 04:58 PM, Stan Hoeppner wrote:
>> Yes, I stated that in this post
>>
>> http://www.spinics.net/lists/raid/msg45726.html
>>
>> in the context of achieving greater throughput with an FIO job
>> file configured to use O_DIRECT, a job file I created, that the OP
>> was using for testing.  That job file is quoted further down in
>> this same post, and is included in my posts prior this one in the
>> thread.  Apparently you ignored them.  The context of my comment
>> above is clearly established multiple times earlier in the thread.
>>
>> In my paragraph directly preceding the statement you quote above,
>> I stated this:
>>
>> "Serial submission typically doesn't reach peak throughput...  You 
>> usually must submit asynchronously or in parallel to reach maximum 
>> throughput."
>>
>> And again this is in the context of the FIO job file using
>> O_DIRECT, and this statement is factual.  As I repeated earlier
>> today, O_DIRECT is used because measuring actual throughput at the
>> disks is straightforward.  To increase O_DIRECT write throughput in
>> FIO, you typically need parallel submission or AIO.  This is well
>> known.
> 
> Ahh, I did not gather that O_DIRECT was already assumed.  In that
> case, then I was simply restating the same thing: that you want aio
> with O_DIRECT, but otherwise, buffered IO works fine too ( which is
> what the OP was using with dd, which is why it sounded like you were
> saying not to do that, that you must use O_DIRECT + aio because
> buffered IO won't get you the performance you're looking for ).

I guess maybe I wasn't clear with my wording at that time.  Yes, IIRC he
was doing dd through buffer cache.  My point to him was that O_DIRECT dd
gives more accurate throughput numbers, but a single stream may not be
sufficient to peak the disks.  Which is why I recommended FIO and
provided a job file, as it can do multiple O_DIRECT streams.  AIO can
reduce dispatch latency thus increasing throughput, but not to the
extent that multiple streams will, as the latter can fully overlap the
single stream dispatch latency, keeping the pipeline full.

-- 
Stan


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Very long raid5 init/rebuild times
  2014-01-24  5:13               ` Stan Hoeppner
  2014-01-25  8:36                 ` Marc MERLIN
@ 2014-01-30 20:36                 ` Phillip Susi
  1 sibling, 0 replies; 41+ messages in thread
From: Phillip Susi @ 2014-01-30 20:36 UTC (permalink / raw)
  To: stan, Marc MERLIN; +Cc: linux-raid

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 1/24/2014 12:13 AM, Stan Hoeppner wrote:
> The initial resync is read-only.  It won't modify anything unless 
> there's a discrepancy.  So the stripe cache isn't in play.  The
> larger stripe cache should indeed increase rebuild rate though.

What?  That makes no sense.  It doesn't really take any longer to
write the parity than to read it, and the odds are pretty good that it
is going to need to write it anyhow, so doing a read first would waste
a *lot* of time.  I think you are thinking of when you manually issue
a check_repair on the array after it is fully initialized.


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.17 (MingW32)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQEcBAEBAgAGBQJS6rexAAoJEI5FoCIzSKrwm8sIAJ93OT39nxYau3//33skn0lW
XOT0z+EyBvAMBHMz4x36GJilHsd9gNwRLSETHu3sSTFve+0hkTWPRfLFt+OMkX2S
30hZnVWh+fd02enTE3uw6kaCcuU709hKDwCUOf1wQhm3bUeJGIOTRrkqnGtKpncR
9qxENIdhcRrMIkh1F1dmJOZvehlGc6doa1ddodM8QfSESlQtTu8N9nyxAbVtLf5S
lGebMF+dsf3DEK1UXn7RUus8IE28DifWQnMlfUPtI7u2A50cjADQh9mu1moJdDdo
HVgz/Y5sveYq5KfRMco4cNVGQiyR3t1LYmZQFTcAcCUqbQ6PctXp3pCEfogLYTw=
=zlqj
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Very long raid5 init/rebuild times
  2014-01-23 12:24           ` Stan Hoeppner
  2014-01-23 21:01             ` Marc MERLIN
@ 2014-01-30 20:18             ` Phillip Susi
  1 sibling, 0 replies; 41+ messages in thread
From: Phillip Susi @ 2014-01-30 20:18 UTC (permalink / raw)
  To: stan, Marc MERLIN; +Cc: linux-raid

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 1/23/2014 7:24 AM, Stan Hoeppner wrote:
> Increasing stripe_cache_size above the default as I suggested will 
> ALWAYS increase write speed, often by a factor of 2-3x or more on
> modern hardware.  It should speed up destructive resyncs
> considerably, as well as normal write IO.  Once your array has
> settled down after the inits and resyncs and what not, run some
> parallel FIO write tests with the default of 256 and then with
> 2048.  You can try 4096 as well, but with 5 rusty drives 4096 will
> probably cause a slight tailing off of throughput.  2048 should be
> your sweet spot.  You can also just time a few large parallel file
> copies.  You'll be amazed at the gains.

I have never seen it make a difference on destructive syncs on 3 or 4
disk raid5 arrays and when you think about it, that makes perfect
sense.  The point of the cache is to coalesce small random writes into
full stripes so you don't have to do a RMW cycle, so for streaming IO,
it isn't going to do a thing.


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.17 (MingW32)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQEcBAEBAgAGBQJS6rOcAAoJEI5FoCIzSKrwz0YIAJMJG6C0aePvKfXJlQy1mmbp
AaFHQX8MgIgj3kgiNj8s83uWEdVZfzaQpcc3oJcB0PD2FTHxt+204e8C2wZz0b5N
4zZim1YRec67LTRkLwNeko5HrkiapmWf0FYmx95d3gNvb0UGUbh98hRItgSX78NS
Lu+afQWOLCqiv3UjMHpG4Blb37oT0cp2pttuGKbDZTS4OSgd/qWRvcQ5sRQ09338
n40EiCWaIIlWlSQJ0r6GUylTOiys+JziB+qcHK6SvK+9gF3VdYN955GXNKxMn7t1
A3rFbeVsC0MSRcMytVmMk3cFQDn+xhZ1ZvHvkbLagj9i0uLcC+1cR6KIeScdQC4=
=iFoR
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Opal 2.0 SEDs on linux, was: Very long raid5 init/rebuild times
  2014-01-22  7:55   ` Stan Hoeppner
  2014-01-22 17:48     ` Marc MERLIN
@ 2014-01-22 19:38     ` Chris Murphy
  1 sibling, 0 replies; 41+ messages in thread
From: Chris Murphy @ 2014-01-22 19:38 UTC (permalink / raw)
  To: linux-raid

On Jan 22, 2014, at 12:55 AM, Stan Hoeppner <stan@hardwarefreak.com> wrote:
> 
> You are not CPU bound, nor hardware bandwidth bound.  You are latency
> bound, just like every dmcrypt user.  dmcrypt adds a non trivial amount
> of latency to every IO.  Latency with serial IO equals low throughput.

A self-encrypting drive (a drive that always & only writes ciphertext to the platters) I'd expect totally eliminates this. It's difficult finding information whether and how these drives are configurable purely for data drives (non-bootable) which ought to be easier to implement than the boot use case.

For booting, it appears necessary to have computer firmware that explicitly supports the TCG OPAL spec: an EFI application might be sufficient for unlocking the drive(s) at boot time, but I'm not certain about that. It seems necessary for hibernate support that firmware and kernel support is required.

Chris Murphy

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Very long raid5 init/rebuild times
  2014-01-21  7:35 Very long raid5 init/rebuild times Marc MERLIN
  2014-01-21 16:37 ` Marc MERLIN
@ 2014-01-21 18:31 ` Chris Murphy
  2014-01-22 13:46 ` Ethan Wilson
  2 siblings, 0 replies; 41+ messages in thread
From: Chris Murphy @ 2014-01-21 18:31 UTC (permalink / raw)
  To: linux-raid

On Jan 21, 2014, at 12:35 AM, Marc MERLIN <marc@merlins.org> wrote:

> Howdy,
> 
> I'm setting up a new array with 5 4TB drives for which I'll use dmcrypt.
> 
> Question #1:
> Is it better to dmcrypt the 5 drives and then make a raid5 on top, or the opposite
> (raid5 first, and then dmcrypt)

I'm pretty sure there are multithreading patches for dmcrypt, but I don't know what versions it's accepted into, and that's the question you need to answer. If the version you're using still only supports one encryption thread per block device, then you want to dmcrypt 5 drives, then create md on top of that. This way you get 5 threads doing encryption rather than 1 thread and hence one core trying to encrypt for 5 drives.

> I suppose that 1 day-ish rebuild times are kind of a given with 4TB drives anyway?

4000000MB / 135MB/s = 8.23 hours. So 24 hours seems like a long time and I'd wonder what the bottleneck is.

> Question #3:
> Since I'm going to put btrfs on top, I'm almost tempted to skip the md raid5
> layer and just use the native support, but the raid code in btrfs still
> seems a bit younger than I'm comfortable with.
> Is anyone using it and has done disk failures, replaces, and all?

Like I mentioned on linux-btrfs@ it's there for testing. If you're prepared to use it with the intent of breaking it and reporting breakage, then great, that's useful. If you're just being impatient, expect to get bitten. And it's undergoing quite a bit of changes in btrfs-next. I'm not sure if those changes will appear in 3.14 or 3.15

Chris Murphy

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Very long raid5 init/rebuild times
  2014-01-21  7:35 Very long raid5 init/rebuild times Marc MERLIN
  2014-01-21 16:37 ` Marc MERLIN
  2014-01-21 18:31 ` Chris Murphy
@ 2014-01-22 13:46 ` Ethan Wilson
  2 siblings, 0 replies; 41+ messages in thread
From: Ethan Wilson @ 2014-01-22 13:46 UTC (permalink / raw)
  To: linux-raid

On 21/01/2014 08:35, Marc MERLIN wrote:
> Howdy,
>
> I'm setting up a new array with 5 4TB drives for which I'll use dmcrypt.
>
> Question #1:
> Is it better to dmcrypt the 5 drives and then make a raid5 on top, or the opposite
> (raid5 first, and then dmcrypt)

Crypt above, or you will need to enter the password 5 times.
Array checks and rebuilds would also be slower
And also, when working at low level with mdadm commands, it would 
probably be too easy to get confused and specify the underlying volume 
instead of the one above luks, wiping all data as a result.

> Question #2:
> In order to copy data from a working system, I connected the drives via an external
> enclosure which uses a SATA PMP. As a result, things are slow:
>
> md5 : active raid5 dm-7[5] dm-6[3] dm-5[2] dm-4[1] dm-2[0]
>        15627526144 blocks super 1.2 level 5, 512k chunk, algorithm 2 [5/4] [UUUU_]
>        [>....................]  recovery =  0.9% (35709052/3906881536) finish=3406.6min speed=18939K/sec
>        bitmap: 0/30 pages [0KB], 65536KB chunk
>
> 2.5 days for an init or rebuild is going to be painful.
> I already checked that I'm not CPU/dmcrpyt pegged.
>
> I read Neil's message why init is still required:
> http://marc.info/?l=linux-raid&m=112044009718483&w=2
> even if somehow on brand new blank drives full of 0s I'm thinking this could be faster
> by just assuming the array is clean (all 0s give a parity of 0).
> Is it really unsafe to do so? (actually if you do this on top of dmcrypt
> like I did here, I won't get 0s, so that way around, it's unfortunately
> necessary).

Yes it is unsafe because raid5 does shortcut rmw , which means it uses 
the current, wrong, parity to compute new parity, which will also come 
out wrong  . The parities of your array will never be correct, so you 
won't be able to withstand a disk failure.

You need to do the initial init/rebuild, however you can start writing 
to the array now already, but keep in mind that such data will be safe 
only after the fist init/rebuild has completed.

> I suppose that 1 day-ish rebuild times are kind of a given with 4TB drives anyway?

I think around 13 hours if your connections to the disks are fast.

> Question #3:
> Since I'm going to put btrfs on top, I'm almost tempted to skip the md raid5
> layer and just use the native support, but the raid code in btrfs still
> seems a bit younger than I'm comfortable with.

Native btrfs raid5 is WAY experimental at this stage. Only raid0/1/10 is 
kinda stable at this stage.

^ permalink raw reply	[flat|nested] 41+ messages in thread

end of thread, other threads:[~2014-02-05  1:42 UTC | newest]

Thread overview: 41+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-01-21  7:35 Very long raid5 init/rebuild times Marc MERLIN
2014-01-21 16:37 ` Marc MERLIN
2014-01-21 17:08   ` Mark Knecht
2014-01-21 18:42   ` Chris Murphy
2014-01-22  7:55   ` Stan Hoeppner
2014-01-22 17:48     ` Marc MERLIN
2014-01-22 23:17       ` Stan Hoeppner
2014-01-23 14:28         ` John Stoffel
2014-01-24  1:02           ` Stan Hoeppner
2014-01-24  3:07             ` NeilBrown
2014-01-24  8:24               ` Stan Hoeppner
2014-01-23  2:37       ` Stan Hoeppner
2014-01-23  9:13         ` Marc MERLIN
2014-01-23 12:24           ` Stan Hoeppner
2014-01-23 21:01             ` Marc MERLIN
2014-01-24  5:13               ` Stan Hoeppner
2014-01-25  8:36                 ` Marc MERLIN
2014-01-28  7:46                   ` Stan Hoeppner
2014-01-28 16:50                     ` Marc MERLIN
2014-01-29  0:56                       ` Stan Hoeppner
2014-01-29  1:01                         ` Marc MERLIN
2014-01-30 20:47                     ` Phillip Susi
2014-02-01 22:39                       ` Stan Hoeppner
2014-02-02 18:53                         ` Phillip Susi
2014-02-03  6:34                           ` Stan Hoeppner
2014-02-03 14:42                             ` Phillip Susi
2014-02-04  3:30                               ` Stan Hoeppner
2014-02-04 17:59                                 ` Larry Fenske
2014-02-04 18:08                                   ` Phillip Susi
2014-02-04 18:43                                     ` Stan Hoeppner
2014-02-04 18:55                                       ` Phillip Susi
2014-02-04 19:15                                         ` Stan Hoeppner
2014-02-04 20:16                                           ` Phillip Susi
2014-02-04 21:58                                             ` Stan Hoeppner
2014-02-05  1:19                                               ` Phillip Susi
2014-02-05  1:42                                                 ` Stan Hoeppner
2014-01-30 20:36                 ` Phillip Susi
2014-01-30 20:18             ` Phillip Susi
2014-01-22 19:38     ` Opal 2.0 SEDs on linux, was: " Chris Murphy
2014-01-21 18:31 ` Chris Murphy
2014-01-22 13:46 ` Ethan Wilson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).