recommended way to add ssd cache to mdraid array

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* recommended way to add ssd cache to mdraid array
@ 2012-12-22  6:57 Thomas Fjellstrom
  2012-12-23  3:44 ` Thomas Fjellstrom
  0 siblings, 1 reply; 53+ messages in thread
From: Thomas Fjellstrom @ 2012-12-22  6:57 UTC (permalink / raw)
  To: linux-raid

I'm setting up a new nas box (7x2TB on a IBM M1015 8 port sas card flashed to 
9211IT mode) and was thinking about adding an SSD cache to it. I've been 
following bcache's development, but it seems to have stalled a bit.

I've got a 240G Samsung 470/810 that I'd like to use for this.

Also I was wondering if anyone has any tips on the best (or their preferred) 
way to set up a "big" raid6 array with a single filesystem. I'm probably going 
to stick with XFS, but I'm not married to it, if there's something better for 
a big media (audio, video, disk images, backups, etc) volume I'd love to hear 
about it.

-- 
Thomas Fjellstrom
thomas@fjellstrom.ca

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: recommended way to add ssd cache to mdraid array
  2012-12-22  6:57 recommended way to add ssd cache to mdraid array Thomas Fjellstrom
@ 2012-12-23  3:44 ` Thomas Fjellstrom
  2013-01-09 18:41   ` Thomas Fjellstrom
  2013-01-10 13:13   ` Brad Campbell
  0 siblings, 2 replies; 53+ messages in thread
From: Thomas Fjellstrom @ 2012-12-23  3:44 UTC (permalink / raw)
  To: linux-raid

On Fri Dec 21, 2012, Thomas Fjellstrom wrote:
> I'm setting up a new nas box (7x2TB on a IBM M1015 8 port sas card flashed
> to 9211IT mode) and was thinking about adding an SSD cache to it. I've
> been following bcache's development, but it seems to have stalled a bit.
> 
> I've got a 240G Samsung 470/810 that I'd like to use for this.
> 
> Also I was wondering if anyone has any tips on the best (or their
> preferred) way to set up a "big" raid6 array with a single filesystem. I'm
> probably going to stick with XFS, but I'm not married to it, if there's
> something better for a big media (audio, video, disk images, backups, etc)
> volume I'd love to hear about it.

So my array has finally finished resyncing, and I've run a simple iozone test 
on it formated with xfs, and I'm seeing some somewhat low write results:

moose@mrbig:/mnt/mrbig/data/test$ iozone -a -s 32G -r 8M
                                                            random  random    
bkwd   record   stride                                   
              KB  reclen   write rewrite    read    reread    read   write    
read  rewrite     read   fwrite frewrite   fread  freread
        33554432    8192  212507  210382   630327   630852  372807  161710  
388319  4922757   617347   210642   217122  717279   716150

Is this normal for a 7 disk (2TB seagate barracudas) raid array on a pcie x8 
sas controller?

I was thinking it might be alignment issues, but there are no partitions on 
the disks, and xfs seems to have correctly set up the sunit and swidth 
settings (128/640 for a 7 disk raid6). While 200MB/s is probably more than I 
need day to day, I'd like to make sure it is set up properly.

-- 
Thomas Fjellstrom
thomas@fjellstrom.ca

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: recommended way to add ssd cache to mdraid array
  2012-12-23  3:44 ` Thomas Fjellstrom
@ 2013-01-09 18:41   ` Thomas Fjellstrom
  2013-01-10  6:25     ` Chris Murphy
  2013-01-10 13:13   ` Brad Campbell
  1 sibling, 1 reply; 53+ messages in thread
From: Thomas Fjellstrom @ 2013-01-09 18:41 UTC (permalink / raw)
  To: linux-raid

On Sat Dec 22, 2012, you wrote:
> On Fri Dec 21, 2012, Thomas Fjellstrom wrote:
> > I'm setting up a new nas box (7x2TB on a IBM M1015 8 port sas card
> > flashed to 9211IT mode) and was thinking about adding an SSD cache to
> > it. I've been following bcache's development, but it seems to have
> > stalled a bit.
> > 
> > I've got a 240G Samsung 470/810 that I'd like to use for this.
> > 
> > Also I was wondering if anyone has any tips on the best (or their
> > preferred) way to set up a "big" raid6 array with a single filesystem.
> > I'm probably going to stick with XFS, but I'm not married to it, if
> > there's something better for a big media (audio, video, disk images,
> > backups, etc) volume I'd love to hear about it.
> 
> So my array has finally finished resyncing, and I've run a simple iozone
> test on it formated with xfs, and I'm seeing some somewhat low write
> results:
> 
> moose@mrbig:/mnt/mrbig/data/test$ iozone -a -s 32G -r 8M
>                                                             random  random
> bkwd   record   stride
>               KB  reclen   write rewrite    read    reread    read   write
> read  rewrite     read   fwrite frewrite   fread  freread
>         33554432    8192  212507  210382   630327   630852  372807  161710
> 388319  4922757   617347   210642   217122  717279   716150
> 
> Is this normal for a 7 disk (2TB seagate barracudas) raid array on a pcie
> x8 sas controller?
> 
> I was thinking it might be alignment issues, but there are no partitions on
> the disks, and xfs seems to have correctly set up the sunit and swidth
> settings (128/640 for a 7 disk raid6). While 200MB/s is probably more than
> I need day to day, I'd like to make sure it is set up properly.

So I've retested the array without bcache in th way, and my write speeds are 
still a fraction of the read speeds. Is this normal for a setup like mine?

The iozone results are a bit odd as well, I'm seeing the write speeds get 
worse as the record size goes up. To compare with the bcache result above:

                                                            random  random    
bkwd   record   stride                                   
              KB  reclen   write rewrite    read    reread    read   write    
read  rewrite     read   fwrite frewrite   fread  freread
        33554432    8192  177919  167524   632030   629077  371602  115228  
384030  4934570   618061   161562   176033  708542   709788

I'm pretty sure the bare results I got before I tested the bcache setup were 
better than this, a little over 200MB/s. But with the current kernel, these 
are the numbers I'm getting.

Specs:
Intel Core i3-2120
16GB 1333MHZ DDR3 ECC
30GB Vertex 1 SSD for /
IBM M1015 flashed with the LSI 9211-8i IT firmware
7x2TB Seagate Baracuda HDDs in raid6 unpartitioned and formatted with XFS

This system is currently dedicated to nothing but NAS duties. Later on I might 
get it doing other things, but right now, its actually doing nothing but 
running a RAID6 array that I haven't started using yet, so there shouldn't be 
too much getting in the way of the iozone tests.

-- 
Thomas Fjellstrom
thomas@fjellstrom.ca

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: recommended way to add ssd cache to mdraid array
  2013-01-09 18:41   ` Thomas Fjellstrom
@ 2013-01-10  6:25     ` Chris Murphy
  2013-01-10 10:49       ` Thomas Fjellstrom
  0 siblings, 1 reply; 53+ messages in thread
From: Chris Murphy @ 2013-01-10  6:25 UTC (permalink / raw)
  To: linux-raid

On Jan 9, 2013, at 11:41 AM, Thomas Fjellstrom <thomas@fjellstrom.ca> wrote:
> 
> So I've retested the array without bcache in th way, and my write speeds are 
> still a fraction of the read speeds. Is this normal for a setup like mine?

It looks like roughly 1/3 of the write performance? There is a non-insignificant penalty with RAID6 writes in IOPS, so I would look at whether your test is even valid; i.e. is it writing or rewriting in amounts that match what you will be using the array for? If not, disqualify the test and results and come up with something else that's more like the actual usage.

Is this for DAS? For big media files, IOPS is not typically what you need. You get large sequential writes.

> 
> IBM M1015 flashed with the LSI 9211-8i IT firmware
> 7x2TB Seagate Baracuda HDDs in raid6 unpartitioned and formatted with XFS

I'm not following. This is a SAS controller and you're using a Barracuda, which I think is a SATA drive? If you're using SAS drives, I'd expect them to be good enough quality and low UER that you could do RAID5 and avoid some of the write penalty. RAID 6 implies quite a few drives, have you considered RAID 10 instead? That card will do RAID 10 itself. The other possibility just RAID 0 the thing for DAS, and then back it up daily with rsync to a separate DAS RAID 0, or linear array.

Chris Murphy

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: recommended way to add ssd cache to mdraid array
  2013-01-10  6:25     ` Chris Murphy
@ 2013-01-10 10:49       ` Thomas Fjellstrom
  2013-01-10 21:36         ` Chris Murphy
  0 siblings, 1 reply; 53+ messages in thread
From: Thomas Fjellstrom @ 2013-01-10 10:49 UTC (permalink / raw)
  To: Chris Murphy; +Cc: linux-raid

On Wed Jan 9, 2013, you wrote:
> On Jan 9, 2013, at 11:41 AM, Thomas Fjellstrom <thomas@fjellstrom.ca> wrote:
> > So I've retested the array without bcache in th way, and my write speeds
> > are still a fraction of the read speeds. Is this normal for a setup like
> > mine?
> 
> It looks like roughly 1/3 of the write performance? There is a
> non-insignificant penalty with RAID6 writes in IOPS, so I would look at
> whether your test is even valid; i.e. is it writing or rewriting in
> amounts that match what you will be using the array for? If not,
> disqualify the test and results and come up with something else that's
> more like the actual usage.

A lot of it will be streaming. Some may end up being random read/writes. The 
test is just to gauge over all performance of the setup. 600MBs read is far 
more than I need, but having writes at 1/3 that seems odd to me.

> Is this for DAS? For big media files, IOPS is not typically what you need.
> You get large sequential writes.

IOPS is not something I'm worrying about, but 1/3 the write throughput seems a 
bit off to me. I assume I've mis-configured something.

> > IBM M1015 flashed with the LSI 9211-8i IT firmware
> > 7x2TB Seagate Baracuda HDDs in raid6 unpartitioned and formatted with XFS
> 
> I'm not following. This is a SAS controller and you're using a Barracuda,
> which I think is a SATA drive? If you're using SAS drives, I'd expect them
> to be good enough quality and low UER that you could do RAID5 and avoid
> some of the write penalty. RAID 6 implies quite a few drives, have you
> considered RAID 10 instead? That card will do RAID 10 itself. The other
> possibility just RAID 0 the thing for DAS, and then back it up daily with
> rsync to a separate DAS RAID 0, or linear array.

Its a relatively inexpensive ($80-100) 8 port SAS/SATA controller. And I have 
it flashed in IT mode, so theres no RAID support what so ever. I'm not a big 
fan of hardware based raid, as it tends to be rather proprietary and you end 
up having just one more weak link in the chain to account for (you get to buy 
identical backup cards to make sure you have spares, since no other cards will 
likely like any meta data, or on disk layout the original happened to use).

This is primarily a media and file share. RAID 10 comes with a rather large 
space penalty (50%) which is not something I want to deal with. I'd end up 
with an array the same size as my current 7x1TB RAID5, which is full. My idea 
was to make a 10TB array to replace the current one, and recommission some of 
those disks for use in a backup array (with some 3TB drives I picked up)

The reason I've selected RAID6 to begin with is I've read (on this mailing 
list, and on some hardware tech sites) that even with SAS drives, the 
rebuild/resync time on a large array using large disks (2TB+) is long enough 
that it gives more than enough time for another disk to hit a random read 
error, which will kick the disk out of the array, potentially giving you a 
double fault and taking the entire array out. This is something I'd like to 
avoid. But something I also don't want to spend 2x the money on drives for. 
Also going with parity based RAID so the array isn't totally out of commission 
for long stretches, even then theres some kind of error taking out one of the 
disks (and resync times are short due to using mdraid's write intent bitmap).

> Chris Murphy--
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Thomas Fjellstrom
thomas@fjellstrom.ca

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: recommended way to add ssd cache to mdraid array
  2012-12-23  3:44 ` Thomas Fjellstrom
  2013-01-09 18:41   ` Thomas Fjellstrom
@ 2013-01-10 13:13   ` Brad Campbell
  1 sibling, 0 replies; 53+ messages in thread
From: Brad Campbell @ 2013-01-10 13:13 UTC (permalink / raw)
  To: thomas; +Cc: linux-raid

On 23/12/12 11:44, Thomas Fjellstrom wrote:
>
> moose@mrbig:/mnt/mrbig/data/test$ iozone -a -s 32G -r 8M
>                                                              random  random
> bkwd   record   stride
>                KB  reclen   write rewrite    read    reread    read   write
> read  rewrite     read   fwrite frewrite   fread  freread
>          33554432    8192  212507  210382   630327   630852  372807  161710
> 388319  4922757   617347   210642   217122  717279   716150
>
> Is this normal for a 7 disk (2TB seagate barracudas) raid array on a pcie x8
> sas controller?
>

For comparo using the same command line. This is a running system and 
has a slight but measurable load on the array, but not enough that it 
should really impact the numbers.

This is 10 x WD 2TB Green drives (sloooow) in RAID-6 on a pair of those 
cards. Formatted ext4.

mpt2sas0: LSISAS2008: FWVersion(15.00.00.00), ChipRevision(0x02), 
BiosVersion(00.00.00.00)

md0 : active raid6 sdl[0] sdr[5] sdo[6] sdp[7] sdn[8] sdk[9] sdm[10] 
sdq[11] sds[4] sdf[12]
       15628106752 blocks super 1.2 level 6, 128k chunk, algorithm 2 
[10/10] [UUUUUUUUUU]
       bitmap: 2/15 pages [8KB], 65536KB chunk

                                                             random 
random    bkwd   record   stride
               KB  reclen   write rewrite    read    reread    read 
write    read  rewrite     read   fwrite frewrite   fread  freread
         33554432    8192  349745  448840   481170   488821  284313 
312892  355422  2288708   440818   397878   469964  478596   496619

Not sure this actually means anything though.

Regards,
Brad

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: recommended way to add ssd cache to mdraid array
  2013-01-10 10:49       ` Thomas Fjellstrom
@ 2013-01-10 21:36         ` Chris Murphy
  2013-01-11  0:18           ` Stan Hoeppner
                             ` (2 more replies)
  0 siblings, 3 replies; 53+ messages in thread
From: Chris Murphy @ 2013-01-10 21:36 UTC (permalink / raw)
  To: linux-raid Raid

On Jan 10, 2013, at 3:49 AM, Thomas Fjellstrom <thomas@fjellstrom.ca> wrote:

> A lot of it will be streaming. Some may end up being random read/writes. The 
> test is just to gauge over all performance of the setup. 600MBs read is far 
> more than I need, but having writes at 1/3 that seems odd to me.

Tell us how many disks there are, and what the chunk size is. It could be too small if you have too few disks which results in a small full stripe size for a video context. If you're using the default, it could be too big and you're getting a lot of RWM. Stan, and others, can better answer this.

You said these are unpartitioned disks, I think. In which case alignment of 4096 byte sectors isn't a factor if these are AF disks. 

Unlikely to make up the difference is the scheduler. Parallel fs's like XFS don't perform nearly as well with CFQ, so you should have a kernel parameter elevator=noop. 

Another thing to look at is md/stripe_cache_size which probably needs to be higher for your application.

Another thing to look at is if you're using XFS, what your mount options are. Invariably with an array of this size you need to be mounting with the inode64 option.

> The reason I've selected RAID6 to begin with is I've read (on this mailing 
> list, and on some hardware tech sites) that even with SAS drives, the 
> rebuild/resync time on a large array using large disks (2TB+) is long enough 
> that it gives more than enough time for another disk to hit a random read 
> error,

This is true for high density consumer SATA drives. It's not nearly as applicable for low to moderate density nearline SATA which has an order of magnitude lower UER, or for enterprise SAS (and some enterprise SATA) which has yet another order of magnitude lower UER.  So it depends on the disks, and the RAID size, and the backup/restore strategy.

Another way people get into trouble with the event you're talking about, is they don't do regular scrubs or poll drive SMART data. I have no empirical data, but I'd expect much better than order of magnitude lower array loss during a rebuild when the array is being properly maintained, rather than considering it a push button "it's magic" appliance to be forgotten about.

Chris Murphy

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: recommended way to add ssd cache to mdraid array
  2013-01-10 21:36         ` Chris Murphy
@ 2013-01-11  0:18           ` Stan Hoeppner
  2013-01-11 12:35             ` Thomas Fjellstrom
  2013-01-11 12:20           ` Thomas Fjellstrom
  2013-01-12 12:06           ` Roy Sigurd Karlsbakk
  2 siblings, 1 reply; 53+ messages in thread
From: Stan Hoeppner @ 2013-01-11  0:18 UTC (permalink / raw)
  To: Chris Murphy; +Cc: linux-raid Raid

On 1/10/2013 3:36 PM, Chris Murphy wrote:
> 
> On Jan 10, 2013, at 3:49 AM, Thomas Fjellstrom <thomas@fjellstrom.ca> wrote:
> 
>> A lot of it will be streaming. Some may end up being random read/writes. The 
>> test is just to gauge over all performance of the setup. 600MBs read is far 
>> more than I need, but having writes at 1/3 that seems odd to me.
> 
> Tell us how many disks there are, and what the chunk size is. It could be too small if you have too few disks which results in a small full stripe size for a video context. If you're using the default, it could be too big and you're getting a lot of RWM. Stan, and others, can better answer this.

Thomas is using a benchmark, and a single one at that, to judge the
performance.  He's not using his actual workloads.  Tuning/tweaking to
increase the numbers in a benchmark could be detrimental to actual
performance instead of providing a boost.  One must be careful.

Regarding RAID6, it will always have horrible performance compared to
non-parity RAID levels and even RAID5, for anything but full stripe
aligned writes, which means writing new large files or doing large
appends to existing files.

However, everything is relative.  This RAID6 may have plenty of random
and streaming write/read throughput for Thomas.  But a single benchmark
isn't going to inform him accurately.

> You said these are unpartitioned disks, I think. In which case alignment of 4096 byte sectors isn't a factor if these are AF disks. 
> 
> Unlikely to make up the difference is the scheduler. Parallel fs's like XFS don't perform nearly as well with CFQ, so you should have a kernel parameter elevator=noop. 

If the HBAs have [BB|FB]WC then one should probably use noop as the
cache schedules the actual IO to the drives.  If the HBAs lack cache,
then deadline often provides better performance.  Testing of each is
required on a system and workload basis.  With two identical systems
(hardware/RAID/OS) one may perform better with noop, the other with
deadline.  The determining factor is the applications' IO patterns.

> Another thing to look at is md/stripe_cache_size which probably needs to be higher for your application.
> 
> Another thing to look at is if you're using XFS, what your mount options are. Invariably with an array of this size you need to be mounting with the inode64 option.

The desired allocator behavior is independent of array size but, once
again, dependent on the workloads.  inode64 is only needed for large
filesystems with lots of files, where 1TB may not be enough for the
directory inodes.  Or, for mixed metadata/data heavy workloads.

For many workloads including databases, video ingestion, etc, the
inode32 allocator is preferred, regardless of array size.  This is the
linux-raid list so I'll not go into detail of the XFS allocators.

>> The reason I've selected RAID6 to begin with is I've read (on this mailing 
>> list, and on some hardware tech sites) that even with SAS drives, the 
>> rebuild/resync time on a large array using large disks (2TB+) is long enough 
>> that it gives more than enough time for another disk to hit a random read 
>> error,

> This is true for high density consumer SATA drives. It's not nearly as applicable for low to moderate density nearline SATA which has an order of magnitude lower UER, or for enterprise SAS (and some enterprise SATA) which has yet another order of magnitude lower UER.  So it depends on the disks, and the RAID size, and the backup/restore strategy.

Yes, enterprise drives have a much larger spare sector pool.

WRT rebuild time, this is one more reason to use RAID10 or a concat of
RAID1s.  The rebuild time is low, constant, predictable.  For 2TB drives
about 5-6 hours at 100% rebuild rate.  And rebuild time, for any array
type, with gargantuan drives, is yet one more reason not to use the
largest drives you can get your hands on.  Using 1TB drives will cut
that to 2.5-3 hours, and using 500GB drives will cut it down to 1.25-1.5
hours, as all these drives tend to have similar streaming write rates.

To wit, as a general rule I always build my arrays with the smallest
drives I can get away with for the workload at hand.  Yes, for a given
TB total it increases acquisition cost of drives, HBAs, enclosures, and
cables, and power consumption, but it also increases spindle count--thus
performance-- while decreasing rebuild times substantially/dramatically.

-- 
Stan

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: recommended way to add ssd cache to mdraid array
  2013-01-10 21:36         ` Chris Murphy
  2013-01-11  0:18           ` Stan Hoeppner
@ 2013-01-11 12:20           ` Thomas Fjellstrom
  2013-01-11 17:39             ` Chris Murphy
  2013-01-11 18:50             ` Stan Hoeppner
  2013-01-12 12:06           ` Roy Sigurd Karlsbakk
  2 siblings, 2 replies; 53+ messages in thread
From: Thomas Fjellstrom @ 2013-01-11 12:20 UTC (permalink / raw)
  To: Chris Murphy; +Cc: linux-raid Raid

On Thu Jan 10, 2013, Chris Murphy wrote:
> On Jan 10, 2013, at 3:49 AM, Thomas Fjellstrom <thomas@fjellstrom.ca> wrote:
> > A lot of it will be streaming. Some may end up being random read/writes.
> > The test is just to gauge over all performance of the setup. 600MBs read
> > is far more than I need, but having writes at 1/3 that seems odd to me.
> 
> Tell us how many disks there are, and what the chunk size is. It could be
> too small if you have too few disks which results in a small full stripe
> size for a video context. If you're using the default, it could be too big
> and you're getting a lot of RWM. Stan, and others, can better answer this.

As stated earlier, its a 7x2TB array.

> You said these are unpartitioned disks, I think. In which case alignment of
> 4096 byte sectors isn't a factor if these are AF disks.

They are AF disks.

> Unlikely to make up the difference is the scheduler. Parallel fs's like XFS
> don't perform nearly as well with CFQ, so you should have a kernel
> parameter elevator=noop.
> 
> Another thing to look at is md/stripe_cache_size which probably needs to be
> higher for your application.

I'll look into it.

> Another thing to look at is if you're using XFS, what your mount options
> are. Invariably with an array of this size you need to be mounting with
> the inode64 option.

I'm not sure, but I think that's the default.
 
> > The reason I've selected RAID6 to begin with is I've read (on this
> > mailing list, and on some hardware tech sites) that even with SAS
> > drives, the rebuild/resync time on a large array using large disks
> > (2TB+) is long enough that it gives more than enough time for another
> > disk to hit a random read error,
> 
> This is true for high density consumer SATA drives. It's not nearly as
> applicable for low to moderate density nearline SATA which has an order of
> magnitude lower UER, or for enterprise SAS (and some enterprise SATA)
> which has yet another order of magnitude lower UER.  So it depends on the
> disks, and the RAID size, and the backup/restore strategy.

Plain old seagate baracudas, so not the best but at least they aren't greens.

> Another way people get into trouble with the event you're talking about, is
> they don't do regular scrubs or poll drive SMART data. I have no empirical
> data, but I'd expect much better than order of magnitude lower array loss
> during a rebuild when the array is being properly maintained, rather than
> considering it a push button "it's magic" appliance to be forgotten about.

Debian seems to set up a weekly or monthly scrub, which I leave on due to 
reading that same fact.

> 
> Chris Murphy--
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


-- 
Thomas Fjellstrom
thomas@fjellstrom.ca

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: recommended way to add ssd cache to mdraid array
  2013-01-11  0:18           ` Stan Hoeppner
@ 2013-01-11 12:35             ` Thomas Fjellstrom
  2013-01-11 12:48               ` Thomas Fjellstrom
  2013-01-14  0:05               ` Tommy Apel Hansen
  0 siblings, 2 replies; 53+ messages in thread
From: Thomas Fjellstrom @ 2013-01-11 12:35 UTC (permalink / raw)
  To: stan; +Cc: Chris Murphy, linux-raid Raid

On Thu Jan 10, 2013, Stan Hoeppner wrote:
> On 1/10/2013 3:36 PM, Chris Murphy wrote:
> > On Jan 10, 2013, at 3:49 AM, Thomas Fjellstrom <thomas@fjellstrom.ca> wrote:
> >> A lot of it will be streaming. Some may end up being random read/writes.
> >> The test is just to gauge over all performance of the setup. 600MBs
> >> read is far more than I need, but having writes at 1/3 that seems odd
> >> to me.
> > 
> > Tell us how many disks there are, and what the chunk size is. It could be
> > too small if you have too few disks which results in a small full stripe
> > size for a video context. If you're using the default, it could be too
> > big and you're getting a lot of RWM. Stan, and others, can better answer
> > this.
> 
> Thomas is using a benchmark, and a single one at that, to judge the
> performance.  He's not using his actual workloads.  Tuning/tweaking to
> increase the numbers in a benchmark could be detrimental to actual
> performance instead of providing a boost.  One must be careful.
> 
> Regarding RAID6, it will always have horrible performance compared to
> non-parity RAID levels and even RAID5, for anything but full stripe
> aligned writes, which means writing new large files or doing large
> appends to existing files.

Considering its a rather simple use case, mostly streaming video, and misc
file sharing for my home network, an iozone test should be rather telling.
Especially the full test, from 4k up to 16mb

                                                            random  random    bkwd   record   stride                                   
              KB  reclen   write rewrite    read    reread    read   write    read  rewrite     read   fwrite frewrite   fread  freread
        33554432       4  243295  221756   628767   624081    1028    4627   16822  7468777    17740   233295   231092  582036   579131
        33554432       8  241134  225728   628264   627015    2027    8879   25977 10030302    19578   228923   233928  591478   584892
        33554432      16  233758  228122   633406   618248    3952   13635   35676 10166457    19968   227599   229698  579267   576850
        33554432      32  232390  219484   625968   625627    7604   18800   44252 10728450    24976   216880   222545  556513   555371
        33554432      64  222936  206166   631659   627823   14112   22837   52259 11243595    30251   196243   192755  498602   494354
        33554432     128  214740  182619   628604   626407   25088   26719   64912 11232068    39867   198638   185078  463505   467853
        33554432     256  202543  185964   626614   624367   44363   34763   73939 10148251    62349   176724   191899  593517   595646
        33554432     512  208081  188584   632188   629547   72617   39145   84876  9660408    89877   182736   172912  610681   608870
        33554432    1024  196429  166125   630785   632413  116793   51904  133342  8687679   121956   168756   175225  620587   616722
        33554432    2048  185399  167484   622180   627606  188571   70789  218009  5357136   370189   171019   166128  637830   637120
        33554432    4096  198340  188695   632693   628225  289971   95211  278098  4836433   611529   161664   170469  665617   655268
        33554432    8192  177919  167524   632030   629077  371602  115228  384030  4934570   618061   161562   176033  708542   709788
        33554432   16384  196639  183744   631478   627518  485622  133467  462861  4890426   644615   175411   179795  725966   734364

> However, everything is relative.  This RAID6 may have plenty of random
> and streaming write/read throughput for Thomas.  But a single benchmark
> isn't going to inform him accurately.

200MB/s may be enough, but the difference between the read and write
throughput is a bit unexpected. It's not a weak machine (core i3-2120, dual
core 3.2Ghz with HT, 16GB ECC 1333Mhz ram), and this is basically all its
going to be doing.

> > You said these are unpartitioned disks, I think. In which case alignment
> > of 4096 byte sectors isn't a factor if these are AF disks.
> > 
> > Unlikely to make up the difference is the scheduler. Parallel fs's like
> > XFS don't perform nearly as well with CFQ, so you should have a kernel
> > parameter elevator=noop.
> 
> If the HBAs have [BB|FB]WC then one should probably use noop as the
> cache schedules the actual IO to the drives.  If the HBAs lack cache,
> then deadline often provides better performance.  Testing of each is
> required on a system and workload basis.  With two identical systems
> (hardware/RAID/OS) one may perform better with noop, the other with
> deadline.  The determining factor is the applications' IO patterns.

Mostly streaming reads, some long rsync's to copy stuff back and forth, file
share duties (downloads etc).

> > Another thing to look at is md/stripe_cache_size which probably needs to
> > be higher for your application.
> > 
> > Another thing to look at is if you're using XFS, what your mount options
> > are. Invariably with an array of this size you need to be mounting with
> > the inode64 option.
> 
> The desired allocator behavior is independent of array size but, once
> again, dependent on the workloads.  inode64 is only needed for large
> filesystems with lots of files, where 1TB may not be enough for the
> directory inodes.  Or, for mixed metadata/data heavy workloads.
> 
> For many workloads including databases, video ingestion, etc, the
> inode32 allocator is preferred, regardless of array size.  This is the
> linux-raid list so I'll not go into detail of the XFS allocators.

If you have the time and the desire, I'd like to hear about it off list.

> >> The reason I've selected RAID6 to begin with is I've read (on this
> >> mailing list, and on some hardware tech sites) that even with SAS
> >> drives, the rebuild/resync time on a large array using large disks
> >> (2TB+) is long enough that it gives more than enough time for another
> >> disk to hit a random read error,
> > 
> > This is true for high density consumer SATA drives. It's not nearly as
> > applicable for low to moderate density nearline SATA which has an order
> > of magnitude lower UER, or for enterprise SAS (and some enterprise SATA)
> > which has yet another order of magnitude lower UER.  So it depends on
> > the disks, and the RAID size, and the backup/restore strategy.
> 
> Yes, enterprise drives have a much larger spare sector pool.
> 
> WRT rebuild time, this is one more reason to use RAID10 or a concat of
> RAID1s.  The rebuild time is low, constant, predictable.  For 2TB drives
> about 5-6 hours at 100% rebuild rate.  And rebuild time, for any array
> type, with gargantuan drives, is yet one more reason not to use the
> largest drives you can get your hands on.  Using 1TB drives will cut
> that to 2.5-3 hours, and using 500GB drives will cut it down to 1.25-1.5
> hours, as all these drives tend to have similar streaming write rates.
> 
> To wit, as a general rule I always build my arrays with the smallest
> drives I can get away with for the workload at hand.  Yes, for a given
> TB total it increases acquisition cost of drives, HBAs, enclosures, and
> cables, and power consumption, but it also increases spindle count--thus
> performance-- while decreasing rebuild times substantially/dramatically.

I'd go raid10 or something if I had the space, but this little 10TB nas (which
is the goal, a small, quiet, not too slow, 10TB nas with some kind of
redundancy) only fits 7 3.5" HDDs.

Maybe sometime in the future I'll get a big 3 or 4 u case with a crap load of
3.5" HDD bays, but for now, this is what I have (as well as my old array,
7x1TB RAID5+XFS in 4in3 hot swap bays with room for 8 drives, but haven't
bothered to expand the old array, and I have the new one almost ready to go).

I don't know if it impacts anything at all, but when burning in these drives
after I bought them, I ran the same full iozone test a couple times, and each
drive shows 150MB/s read, and similar write times (100-120+?). It impressed me
somewhat, to see a mechanical hard drive go that fast. I remember back a few
years ago thinking 80MBs was fast for a HDD.

-- 
Thomas Fjellstrom
thomas@fjellstrom.ca

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: recommended way to add ssd cache to mdraid array
  2013-01-11 12:35             ` Thomas Fjellstrom
@ 2013-01-11 12:48               ` Thomas Fjellstrom
  2013-01-14  0:05               ` Tommy Apel Hansen
  1 sibling, 0 replies; 53+ messages in thread
From: Thomas Fjellstrom @ 2013-01-11 12:48 UTC (permalink / raw)
  To: stan; +Cc: Chris Murphy, linux-raid Raid

On Fri Jan 11, 2013, Thomas Fjellstrom wrote:
> On Thu Jan 10, 2013, Stan Hoeppner wrote:
> > On 1/10/2013 3:36 PM, Chris Murphy wrote:
> > > On Jan 10, 2013, at 3:49 AM, Thomas Fjellstrom <thomas@fjellstrom.ca> 
wrote:
> > >> A lot of it will be streaming. Some may end up being random
> > >> read/writes. The test is just to gauge over all performance of the
> > >> setup. 600MBs read is far more than I need, but having writes at 1/3
> > >> that seems odd to me.
> > > 
> > > Tell us how many disks there are, and what the chunk size is. It could
> > > be too small if you have too few disks which results in a small full
> > > stripe size for a video context. If you're using the default, it could
> > > be too big and you're getting a lot of RWM. Stan, and others, can
> > > better answer this.
> > 
> > Thomas is using a benchmark, and a single one at that, to judge the
> > performance.  He's not using his actual workloads.  Tuning/tweaking to
> > increase the numbers in a benchmark could be detrimental to actual
> > performance instead of providing a boost.  One must be careful.
> > 
> > Regarding RAID6, it will always have horrible performance compared to
> > non-parity RAID levels and even RAID5, for anything but full stripe
> > aligned writes, which means writing new large files or doing large
> > appends to existing files.
> 
> Considering its a rather simple use case, mostly streaming video, and misc
> file sharing for my home network, an iozone test should be rather telling.
> Especially the full test, from 4k up to 16mb
> 
>                                                             random  random 
>   bkwd   record   stride KB  reclen   write rewrite    read    reread   
> read   write    read  rewrite     read   fwrite frewrite   fread  freread
> 33554432       4  243295  221756   628767   624081    1028    4627   16822
>  7468777    17740   233295   231092  582036   579131 33554432       8 
> 241134  225728   628264   627015    2027    8879   25977 10030302    19578
>   228923   233928  591478   584892 33554432      16  233758  228122  
> 633406   618248    3952   13635   35676 10166457    19968   227599  
> 229698  579267   576850 33554432      32  232390  219484   625968   625627
>    7604   18800   44252 10728450    24976   216880   222545  556513  
> 555371 33554432      64  222936  206166   631659   627823   14112   22837 
>  52259 11243595    30251   196243   192755  498602   494354 33554432    
> 128  214740  182619   628604   626407   25088   26719   64912 11232068   
> 39867   198638   185078  463505   467853 33554432     256  202543  185964 
>  626614   624367   44363   34763   73939 10148251    62349   176724  
> 191899  593517   595646 33554432     512  208081  188584   632188   629547
>   72617   39145   84876  9660408    89877   182736   172912  610681  
> 608870 33554432    1024  196429  166125   630785   632413  116793   51904 
> 133342  8687679   121956   168756   175225  620587   616722 33554432   
> 2048  185399  167484   622180   627606  188571   70789  218009  5357136  
> 370189   171019   166128  637830   637120 33554432    4096  198340  188695
>   632693   628225  289971   95211  278098  4836433   611529   161664  
> 170469  665617   655268 33554432    8192  177919  167524   632030   629077
>  371602  115228  384030  4934570   618061   161562   176033  708542  
> 709788 33554432   16384  196639  183744   631478   627518  485622  133467 
> 462861  4890426   644615   175411   179795  725966   734364
> 
> > However, everything is relative.  This RAID6 may have plenty of random
> > and streaming write/read throughput for Thomas.  But a single benchmark
> > isn't going to inform him accurately.
> 
> 200MB/s may be enough, but the difference between the read and write
> throughput is a bit unexpected. It's not a weak machine (core i3-2120, dual
> core 3.2Ghz with HT, 16GB ECC 1333Mhz ram), and this is basically all its
> going to be doing.
> 
> > > You said these are unpartitioned disks, I think. In which case
> > > alignment of 4096 byte sectors isn't a factor if these are AF disks.
> > > 
> > > Unlikely to make up the difference is the scheduler. Parallel fs's like
> > > XFS don't perform nearly as well with CFQ, so you should have a kernel
> > > parameter elevator=noop.
> > 
> > If the HBAs have [BB|FB]WC then one should probably use noop as the
> > cache schedules the actual IO to the drives.  If the HBAs lack cache,
> > then deadline often provides better performance.  Testing of each is
> > required on a system and workload basis.  With two identical systems
> > (hardware/RAID/OS) one may perform better with noop, the other with
> > deadline.  The determining factor is the applications' IO patterns.
> 
> Mostly streaming reads, some long rsync's to copy stuff back and forth,
> file share duties (downloads etc).
> 
> > > Another thing to look at is md/stripe_cache_size which probably needs
> > > to be higher for your application.
> > > 
> > > Another thing to look at is if you're using XFS, what your mount
> > > options are. Invariably with an array of this size you need to be
> > > mounting with the inode64 option.
> > 
> > The desired allocator behavior is independent of array size but, once
> > again, dependent on the workloads.  inode64 is only needed for large
> > filesystems with lots of files, where 1TB may not be enough for the
> > directory inodes.  Or, for mixed metadata/data heavy workloads.
> > 
> > For many workloads including databases, video ingestion, etc, the
> > inode32 allocator is preferred, regardless of array size.  This is the
> > linux-raid list so I'll not go into detail of the XFS allocators.
> 
> If you have the time and the desire, I'd like to hear about it off list.
> 
> > >> The reason I've selected RAID6 to begin with is I've read (on this
> > >> mailing list, and on some hardware tech sites) that even with SAS
> > >> drives, the rebuild/resync time on a large array using large disks
> > >> (2TB+) is long enough that it gives more than enough time for another
> > >> disk to hit a random read error,
> > > 
> > > This is true for high density consumer SATA drives. It's not nearly as
> > > applicable for low to moderate density nearline SATA which has an order
> > > of magnitude lower UER, or for enterprise SAS (and some enterprise
> > > SATA) which has yet another order of magnitude lower UER.  So it
> > > depends on the disks, and the RAID size, and the backup/restore
> > > strategy.
> > 
> > Yes, enterprise drives have a much larger spare sector pool.
> > 
> > WRT rebuild time, this is one more reason to use RAID10 or a concat of
> > RAID1s.  The rebuild time is low, constant, predictable.  For 2TB drives
> > about 5-6 hours at 100% rebuild rate.  And rebuild time, for any array
> > type, with gargantuan drives, is yet one more reason not to use the
> > largest drives you can get your hands on.  Using 1TB drives will cut
> > that to 2.5-3 hours, and using 500GB drives will cut it down to 1.25-1.5
> > hours, as all these drives tend to have similar streaming write rates.
> > 
> > To wit, as a general rule I always build my arrays with the smallest
> > drives I can get away with for the workload at hand.  Yes, for a given
> > TB total it increases acquisition cost of drives, HBAs, enclosures, and
> > cables, and power consumption, but it also increases spindle count--thus
> > performance-- while decreasing rebuild times substantially/dramatically.
> 
> I'd go raid10 or something if I had the space, but this little 10TB nas
> (which is the goal, a small, quiet, not too slow, 10TB nas with some kind
> of redundancy) only fits 7 3.5" HDDs.
> 
> Maybe sometime in the future I'll get a big 3 or 4 u case with a crap load
> of 3.5" HDD bays, but for now, this is what I have (as well as my old
> array, 7x1TB RAID5+XFS in 4in3 hot swap bays with room for 8 drives, but
> haven't bothered to expand the old array, and I have the new one almost
> ready to go).
> 
> I don't know if it impacts anything at all, but when burning in these
> drives after I bought them, I ran the same full iozone test a couple
> times, and each drive shows 150MB/s read, and similar write times
> (100-120+?). It impressed me somewhat, to see a mechanical hard drive go
> that fast. I remember back a few years ago thinking 80MBs was fast for a
> HDD.

I should note, it might do some p2p duties in the future. Not sure about that.

-- 
Thomas Fjellstrom
thomas@fjellstrom.ca

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: recommended way to add ssd cache to mdraid array
  2013-01-11 12:20           ` Thomas Fjellstrom
@ 2013-01-11 17:39             ` Chris Murphy
  2013-01-11 17:46               ` Chris Murphy
  2013-01-11 18:51               ` Thomas Fjellstrom
  2013-01-11 18:50             ` Stan Hoeppner
  1 sibling, 2 replies; 53+ messages in thread
From: Chris Murphy @ 2013-01-11 17:39 UTC (permalink / raw)
  To: linux-raid Raid


On Jan 11, 2013, at 5:20 AM, Thomas Fjellstrom <thomas@fjellstrom.ca> wrote:

> On Thu Jan 10, 2013, Chris Murphy wrote:
>> On Jan 10, 2013, at 3:49 AM, Thomas Fjellstrom <thomas@fjellstrom.ca> wrote:
>>> A lot of it will be streaming. Some may end up being random read/writes.
>>> The test is just to gauge over all performance of the setup. 600MBs read
>>> is far more than I need, but having writes at 1/3 that seems odd to me.
>> 
>> Tell us how many disks there are, and what the chunk size is. It could be
>> too small if you have too few disks which results in a small full stripe
>> size for a video context. If you're using the default, it could be too big
>> and you're getting a lot of RWM. Stan, and others, can better answer this.
> 
> As stated earlier, its a 7x2TB array.

OK fair enough, and as asked earlier, what's the chunk size?

> 
>> Another thing to look at is if you're using XFS, what your mount options
>> are. Invariably with an array of this size you need to be mounting with
>> the inode64 option.
> 
> I'm not sure, but I think that's the default.

It's not but as Stan writes it may not be preferred for your application.

>> 
> 
> Plain old seagate baracudas, so not the best but at least they aren't greens.

They probably have a high ERC time out as all consumer disks do so you should also check /sys/block/sdX/device/timeout and make sure it's not significantly less than the drive. It may be possible for smartctl or hdparm to figure out what the drive ERC timeout is.

http://cgi.csc.liv.ac.uk/~greg/projects/erc/


> 
> 
> Debian seems to set up a weekly or monthly scrub, which I leave on due to 
> reading that same fact.

Unless there's a script to email you the results, you'll need to check it periodically. Or write a script that checks for reported errors and only sends an email if that's the case.

Chris Murphy


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: recommended way to add ssd cache to mdraid array
  2013-01-11 17:39             ` Chris Murphy
@ 2013-01-11 17:46               ` Chris Murphy
  2013-01-11 18:52                 ` Thomas Fjellstrom
  2013-01-12  0:47                 ` Phil Turmel
  2013-01-11 18:51               ` Thomas Fjellstrom
  1 sibling, 2 replies; 53+ messages in thread
From: Chris Murphy @ 2013-01-11 17:46 UTC (permalink / raw)
  To: linux-raid Raid

On Jan 11, 2013, at 10:39 AM, Chris Murphy <lists@colorremedies.com> wrote:
> 
> They probably have a high ERC time out as all consumer disks do so you should also check /sys/block/sdX/device/timeout and make sure it's not significantly less than the drive. It may be possible for smartctl or hdparm to figure out what the drive ERC timeout is.
> 
> http://cgi.csc.liv.ac.uk/~greg/projects/erc/

Actually what I wrote is misleading to the point it's wrong. You want the linux device time out to be greater than the device timeout. The device needs to be allowed to give up, and report back a read error to linux/md, so that md knows it should reconstruct the missing data from parity, and overwrite the (obviously) bad blocks causing the read error.

If the linux device time out is even a little bit less than the drive's timeout, md never gets the sector read error, doesn't repair it, since linux boots the whole drive. Now instead of repairing a few sectors, you have a degraded array on your hands. Usual consumer drive time outs are quite high, they can be up to a couple minutes long. Linux device time out is 30 seconds.

Chris Murphy

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: recommended way to add ssd cache to mdraid array
  2013-01-11 12:20           ` Thomas Fjellstrom
  2013-01-11 17:39             ` Chris Murphy
@ 2013-01-11 18:50             ` Stan Hoeppner
  2013-01-12  2:45               ` Thomas Fjellstrom
  1 sibling, 1 reply; 53+ messages in thread
From: Stan Hoeppner @ 2013-01-11 18:50 UTC (permalink / raw)
  To: thomas; +Cc: Chris Murphy, linux-raid Raid

On 1/11/2013 6:20 AM, Thomas Fjellstrom wrote:
> On Thu Jan 10, 2013, Chris Murphy wrote:

>> Another thing to look at is if you're using XFS, what your mount options
>> are. Invariably with an array of this size you need to be mounting with
>> the inode64 option.
> 
> I'm not sure, but I think that's the default.

No, inode32 has always been the default allocator.  It was decided just
recently to make inode64 the default, and the patcheset for this was
committed on 9/20/2012, ~3 months ago, into 3.6-rc1-17:

http://oss.sgi.com/archives/xfs/2012-09/msg00397.html

So with 3.6+ kernels inode64 is the new default.  Any current mainstream
distro kernel defaults to inode32.

Worth noting:  if you upgrade the kernel on a system with existing
inode32 XFS filesystems the allocator for these will remain inode32
unless the mount option is manually changed.  This is the same behavior
as with current kernels.  New filesystems created after upgrading to a
64 bit kernel w/this patch set will be mounted with inode64.  IIRC 32
bit kernels are limited to the inode32 allocator and 32 bit inodes.

There are many workloads that prefer, or even require, 32bit inodes,
and/or the behavior available with the inode32 allocator.  Thus it would
not be smart to auto convert them to inode64 after a kernel upgrade.

-- 
Stan

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: recommended way to add ssd cache to mdraid array
  2013-01-11 17:39             ` Chris Murphy
  2013-01-11 17:46               ` Chris Murphy
@ 2013-01-11 18:51               ` Thomas Fjellstrom
  2013-01-11 22:17                 ` Stan Hoeppner
  2013-01-13 19:18                 ` Chris Murphy
  1 sibling, 2 replies; 53+ messages in thread
From: Thomas Fjellstrom @ 2013-01-11 18:51 UTC (permalink / raw)
  To: Chris Murphy; +Cc: linux-raid Raid

On Fri Jan 11, 2013, you wrote:
> On Jan 11, 2013, at 5:20 AM, Thomas Fjellstrom <thomas@fjellstrom.ca> wrote:
> > On Thu Jan 10, 2013, Chris Murphy wrote:
> >> On Jan 10, 2013, at 3:49 AM, Thomas Fjellstrom <thomas@fjellstrom.ca> 
wrote:
> >>> A lot of it will be streaming. Some may end up being random
> >>> read/writes. The test is just to gauge over all performance of the
> >>> setup. 600MBs read is far more than I need, but having writes at 1/3
> >>> that seems odd to me.
> >> 
> >> Tell us how many disks there are, and what the chunk size is. It could
> >> be too small if you have too few disks which results in a small full
> >> stripe size for a video context. If you're using the default, it could
> >> be too big and you're getting a lot of RWM. Stan, and others, can
> >> better answer this.
> > 
> > As stated earlier, its a 7x2TB array.
> 
> OK fair enough, and as asked earlier, what's the chunk size?

Ah, sorry, I missed that bit. I didn't tweak the chunk size, so I have the 
default 512K.

> >> Another thing to look at is if you're using XFS, what your mount options
> >> are. Invariably with an array of this size you need to be mounting with
> >> the inode64 option.
> > 
> > I'm not sure, but I think that's the default.
> 
> It's not but as Stan writes it may not be preferred for your application.

Hm, ok, it is in my mount line, so I'll try it with that off. Though I would 
be interested in hearing from Stan why it may not be good.

> > Plain old seagate baracudas, so not the best but at least they aren't
> > greens.
> 
> They probably have a high ERC time out as all consumer disks do so you
> should also check /sys/block/sdX/device/timeout and make sure it's not
> significantly less than the drive. It may be possible for smartctl or
> hdparm to figure out what the drive ERC timeout is.
> 
> http://cgi.csc.liv.ac.uk/~greg/projects/erc/
> 
> > Debian seems to set up a weekly or monthly scrub, which I leave on due to
> > reading that same fact.
> 
> Unless there's a script to email you the results, you'll need to check it
> periodically. Or write a script that checks for reported errors and only
> sends an email if that's the case.

I believe it will send a mail to root. It seems as if its a mdadm/mdmon 
feature, as the config option is in mdadm.conf. I've set it to send a mail to 
my main email address, so I should see alerts.

> Chris Murphy
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


-- 
Thomas Fjellstrom
thomas@fjellstrom.ca

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: recommended way to add ssd cache to mdraid array
  2013-01-11 17:46               ` Chris Murphy
@ 2013-01-11 18:52                 ` Thomas Fjellstrom
  2013-01-12  0:47                 ` Phil Turmel
  1 sibling, 0 replies; 53+ messages in thread
From: Thomas Fjellstrom @ 2013-01-11 18:52 UTC (permalink / raw)
  To: Chris Murphy; +Cc: linux-raid Raid

On Fri Jan 11, 2013, Chris Murphy wrote:
> On Jan 11, 2013, at 10:39 AM, Chris Murphy <lists@colorremedies.com> wrote:
> > They probably have a high ERC time out as all consumer disks do so you
> > should also check /sys/block/sdX/device/timeout and make sure it's not
> > significantly less than the drive. It may be possible for smartctl or
> > hdparm to figure out what the drive ERC timeout is.
> > 
> > http://cgi.csc.liv.ac.uk/~greg/projects/erc/
> 
> Actually what I wrote is misleading to the point it's wrong. You want the
> linux device time out to be greater than the device timeout. The device
> needs to be allowed to give up, and report back a read error to linux/md,
> so that md knows it should reconstruct the missing data from parity, and
> overwrite the (obviously) bad blocks causing the read error.
> 
> If the linux device time out is even a little bit less than the drive's
> timeout, md never gets the sector read error, doesn't repair it, since
> linux boots the whole drive. Now instead of repairing a few sectors, you
> have a degraded array on your hands. Usual consumer drive time outs are
> quite high, they can be up to a couple minutes long. Linux device time out
> is 30 seconds.
> 

Hm, ok. I'll look into that, and set those up properly.

Thanks.

> Chris Murphy--
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


-- 
Thomas Fjellstrom
thomas@fjellstrom.ca

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: recommended way to add ssd cache to mdraid array
  2013-01-11 18:51               ` Thomas Fjellstrom
@ 2013-01-11 22:17                 ` Stan Hoeppner
  2013-01-12  2:44                   ` Thomas Fjellstrom
  2013-01-13 19:18                 ` Chris Murphy
  1 sibling, 1 reply; 53+ messages in thread
From: Stan Hoeppner @ 2013-01-11 22:17 UTC (permalink / raw)
  To: thomas; +Cc: Chris Murphy, linux-raid Raid

On 1/11/2013 12:51 PM, Thomas Fjellstrom wrote:

>> It's not but as Stan writes it may not be preferred for your application.
> 
> Hm, ok, it is in my mount line, so I'll try it with that off. Though I would 
> be interested in hearing from Stan why it may not be good.

You should not do this.  This parameter is not a toggle switch intended
to be flipped on/off at will, but changed only once and left there.

Never change XFS parameters willy-nilly without knowing the
consequences.  And currently you certainly do not know them.

I don't have time for the detailed explanation.  Ask on the XFS list.

-- 
Stan

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: recommended way to add ssd cache to mdraid array
  2013-01-11 17:46               ` Chris Murphy
  2013-01-11 18:52                 ` Thomas Fjellstrom
@ 2013-01-12  0:47                 ` Phil Turmel
  2013-01-12  3:56                   ` Chris Murphy
  1 sibling, 1 reply; 53+ messages in thread
From: Phil Turmel @ 2013-01-12  0:47 UTC (permalink / raw)
  To: Chris Murphy; +Cc: linux-raid Raid

On 01/11/2013 12:46 PM, Chris Murphy wrote:
> 
> On Jan 11, 2013, at 10:39 AM, Chris Murphy <lists@colorremedies.com>
> wrote:
>> 
>> They probably have a high ERC time out as all consumer disks do so
>> you should also check /sys/block/sdX/device/timeout and make sure
>> it's not significantly less than the drive. It may be possible for
>> smartctl or hdparm to figure out what the drive ERC timeout is.
>> 
>> http://cgi.csc.liv.ac.uk/~greg/projects/erc/
> 
> Actually what I wrote is misleading to the point it's wrong. You want
> the linux device time out to be greater than the device timeout. The
> device needs to be allowed to give up, and report back a read error
> to linux/md, so that md knows it should reconstruct the missing data
> from parity, and overwrite the (obviously) bad blocks causing the
> read error.
> 
> If the linux device time out is even a little bit less than the
> drive's timeout, md never gets the sector read error, doesn't repair
> it, since linux boots the whole drive. Now instead of repairing a few
> sectors, you have a degraded array on your hands. Usual consumer
> drive time outs are quite high, they can be up to a couple minutes
> long. Linux device time out is 30 seconds.

This isn't quite right.  When the linux driver stack times out, it
passes the error to MD.  MD doesn't care if the drive reported the
error, or if the controller reported the error, it just knows that it
couldn't read that block.  It goes to recovery, which typically
generates the replacement data in a few milliseconds, and tries to write
back to the first disk.  *That* instantly fails, since the controller is
resetting the link and the drive is still in la-la land trying to read
the data.  MD will tolerate several bad reads before it kicks out a
drive, but will immediately kick if a write fails.

By the time you come to investigate, the drive has completed its
timeout, the link has reset, and the otherwise good drive is sitting
idle (failed).

Any array running with mismatched timeouts will kick a drive on every
unrecoverable read error, where it would likely have just fixed it.

Sadly, many hobbyist arrays are built with desktop drives, and the
timeouts are left mismatched.  When that hobbyist later learns s/he
should be scrubbing, the long-overdue scrub is very likely to produce
UREs on multiple drives (BOOM).

Phil

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: recommended way to add ssd cache to mdraid array
  2013-01-11 22:17                 ` Stan Hoeppner
@ 2013-01-12  2:44                   ` Thomas Fjellstrom
  2013-01-12  8:33                     ` Stan Hoeppner
  0 siblings, 1 reply; 53+ messages in thread
From: Thomas Fjellstrom @ 2013-01-12  2:44 UTC (permalink / raw)
  To: stan; +Cc: Chris Murphy, linux-raid Raid

On Fri Jan 11, 2013, Stan Hoeppner wrote:
> On 1/11/2013 12:51 PM, Thomas Fjellstrom wrote:
> >> It's not but as Stan writes it may not be preferred for your
> >> application.
> > 
> > Hm, ok, it is in my mount line, so I'll try it with that off. Though I
> > would be interested in hearing from Stan why it may not be good.
> 
> You should not do this.  This parameter is not a toggle switch intended
> to be flipped on/off at will, but changed only once and left there.

Makes me wonder why they are mount options if they aren't meant to ever be 
changed?

> Never change XFS parameters willy-nilly without knowing the
> consequences.  And currently you certainly do not know them.
> 
> I don't have time for the detailed explanation.  Ask on the XFS list.

Alright, thanks for your help :)

-- 
Thomas Fjellstrom
thomas@fjellstrom.ca

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: recommended way to add ssd cache to mdraid array
  2013-01-11 18:50             ` Stan Hoeppner
@ 2013-01-12  2:45               ` Thomas Fjellstrom
  0 siblings, 0 replies; 53+ messages in thread
From: Thomas Fjellstrom @ 2013-01-12  2:45 UTC (permalink / raw)
  To: stan; +Cc: Chris Murphy, linux-raid Raid

On Fri Jan 11, 2013, Stan Hoeppner wrote:
> On 1/11/2013 6:20 AM, Thomas Fjellstrom wrote:
> > On Thu Jan 10, 2013, Chris Murphy wrote:
> >> Another thing to look at is if you're using XFS, what your mount options
> >> are. Invariably with an array of this size you need to be mounting with
> >> the inode64 option.
> > 
> > I'm not sure, but I think that's the default.
> 
> No, inode32 has always been the default allocator.  It was decided just
> recently to make inode64 the default, and the patcheset for this was
> committed on 9/20/2012, ~3 months ago, into 3.6-rc1-17:

Ahh. I somewhat sorta try and keep up to date on kernel happenings, and after 
a while things get muddled. So that's probably where I got that from.

> http://oss.sgi.com/archives/xfs/2012-09/msg00397.html
> 
> So with 3.6+ kernels inode64 is the new default.  Any current mainstream
> distro kernel defaults to inode32.
> 
> Worth noting:  if you upgrade the kernel on a system with existing
> inode32 XFS filesystems the allocator for these will remain inode32
> unless the mount option is manually changed.  This is the same behavior
> as with current kernels.  New filesystems created after upgrading to a
> 64 bit kernel w/this patch set will be mounted with inode64.  IIRC 32
> bit kernels are limited to the inode32 allocator and 32 bit inodes.
> 
> There are many workloads that prefer, or even require, 32bit inodes,
> and/or the behavior available with the inode32 allocator.  Thus it would
> not be smart to auto convert them to inode64 after a kernel upgrade.


-- 
Thomas Fjellstrom
thomas@fjellstrom.ca

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: recommended way to add ssd cache to mdraid array
  2013-01-12  0:47                 ` Phil Turmel
@ 2013-01-12  3:56                   ` Chris Murphy
  2013-01-13 22:13                     ` Phil Turmel
  0 siblings, 1 reply; 53+ messages in thread
From: Chris Murphy @ 2013-01-12  3:56 UTC (permalink / raw)
  To: Phil Turmel; +Cc: linux-raid Raid

On Jan 11, 2013, at 5:47 PM, Phil Turmel <philip@turmel.org> wrote:

> On 01/11/2013 12:46 PM, Chris Murphy wrote:
>> 
>> On Jan 11, 2013, at 10:39 AM, Chris Murphy <lists@colorremedies.com>
>> wrote:
>>> 
>>> They probably have a high ERC time out as all consumer disks do so
>>> you should also check /sys/block/sdX/device/timeout and make sure
>>> it's not significantly less than the drive. It may be possible for
>>> smartctl or hdparm to figure out what the drive ERC timeout is.
>>> 
>>> http://cgi.csc.liv.ac.uk/~greg/projects/erc/
>> 
>> Actually what I wrote is misleading to the point it's wrong. You want
>> the linux device time out to be greater than the device timeout. The
>> device needs to be allowed to give up, and report back a read error
>> to linux/md, so that md knows it should reconstruct the missing data
>> from parity, and overwrite the (obviously) bad blocks causing the
>> read error.
>> 
>> If the linux device time out is even a little bit less than the
>> drive's timeout, md never gets the sector read error, doesn't repair
>> it, since linux boots the whole drive. Now instead of repairing a few
>> sectors, you have a degraded array on your hands. Usual consumer
>> drive time outs are quite high, they can be up to a couple minutes
>> long. Linux device time out is 30 seconds.
> 
> This isn't quite right.  When the linux driver stack times out, it
> passes the error to MD.  MD doesn't care if the drive reported the
> error, or if the controller reported the error, it just knows that it
> couldn't read that block.  It goes to recovery, which typically
> generates the replacement data in a few milliseconds, and tries to write
> back to the first disk.  *That* instantly fails, since the controller is
> resetting the link and the drive is still in la-la land trying to read
> the data.  MD will tolerate several bad reads before it kicks out a
> drive, but will immediately kick if a write fails.
> 
> By the time you come to investigate, the drive has completed its
> timeout, the link has reset, and the otherwise good drive is sitting
> idle (failed).

I admit I omitted the handling of the error md gets in the case of linux itself timing out the drive, because I don't know how that's handled. For example: 

When you say, "the linux driver stack times out, it passes the error to MD," what error is passed? Is it the same (I think it's 0x40) read error that the drive would have produced, along with affected LBAs? Does the driver know the affected LBA's, maybe by inference? Otherwise md wouldn't know what replacement data to generate. Or is it a different error, neither a read nor write error, that causes md to bounce the drive wholesale?

> 
> Any array running with mismatched timeouts will kick a drive on every
> unrecoverable read error, where it would likely have just fixed it.

This is the key phrase I was trying to get at. 

> Sadly, many hobbyist arrays are built with desktop drives, and the
> timeouts are left mismatched.  When that hobbyist later learns s/he
> should be scrubbing, the long-overdue scrub is very likely to produce
> UREs on multiple drives (BOOM).

Or even if they have been scrubbing all along. If the drive recovers the data inside of 30 seconds, and also doesn't relocated the data to a new sector (? I have no idea when drives do this on their own; I know they will do it on a write failure but I'm unclear when they do it on persistent read "difficulty") the scrub has no means  of even being aware there's a problem to fix!

Given the craptastic state of affairs that manufacturers disallow a simple setting change to ask the drive to do LESS error correction, the recommendation to buy a different drive that can be so configured, is the best suggestion. Alternative 1 is to change the linux driver timeout to maybe upwards of two minutes, and then deal with the fall out of that behavior, which could be worse than a drive being booted out of the array sooner. And a very distant alternative 2 is to zero or Secure Erase the drive every so often, in hopes of avoiding bad sectors altogether — is tedious, as well as implies either putting the drive into a degraded state, or cycling a spare drive. And at the point you're going to buy a spare drive for this fiasco, you might as well just buy drives suited for the purpose.

How's that?

Chris Murphy--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: recommended way to add ssd cache to mdraid array
  2013-01-12  2:44                   ` Thomas Fjellstrom
@ 2013-01-12  8:33                     ` Stan Hoeppner
  2013-01-12 14:44                       ` Thomas Fjellstrom
  0 siblings, 1 reply; 53+ messages in thread
From: Stan Hoeppner @ 2013-01-12  8:33 UTC (permalink / raw)
  To: thomas; +Cc: Chris Murphy, linux-raid Raid

On 1/11/2013 8:44 PM, Thomas Fjellstrom wrote:
> On Fri Jan 11, 2013, Stan Hoeppner wrote:
>> On 1/11/2013 12:51 PM, Thomas Fjellstrom wrote:
>>>> It's not but as Stan writes it may not be preferred for your
>>>> application.
>>>
>>> Hm, ok, it is in my mount line, so I'll try it with that off. Though I
>>> would be interested in hearing from Stan why it may not be good.
>>
>> You should not do this.  This parameter is not a toggle switch intended
>> to be flipped on/off at will, but changed only once and left there.
> 
> Makes me wonder why they are mount options if they aren't meant to ever be 
> changed?

I'll answer that question with a question:  If you were to implement a
new (secondary) allocator on a 10 year old filesystem, by what mechanism
would you have the user enable it?

You can't change the allocator while the filesystem is mounted, so you
can't do this with a sysctl.  So if you must remount the filesystem to
enable the new allocator, where do you enable it?

Make sense yet?

>> Never change XFS parameters willy-nilly without knowing the
>> consequences.  And currently you certainly do not know them.
>>
>> I don't have time for the detailed explanation.  Ask on the XFS list.
> 
> Alright, thanks for your help :)

And as the last few times over many months, you never post to the XFS
list.  Which tells me you really don't care to learn this stuff.

-- 
Stan



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: recommended way to add ssd cache to mdraid array
  2013-01-10 21:36         ` Chris Murphy
  2013-01-11  0:18           ` Stan Hoeppner
  2013-01-11 12:20           ` Thomas Fjellstrom
@ 2013-01-12 12:06           ` Roy Sigurd Karlsbakk
  2013-01-12 14:14             ` Stan Hoeppner
  2 siblings, 1 reply; 53+ messages in thread
From: Roy Sigurd Karlsbakk @ 2013-01-12 12:06 UTC (permalink / raw)
  To: Chris Murphy; +Cc: linux-raid Raid

> Unlikely to make up the difference is the scheduler. Parallel fs's
> like XFS don't perform nearly as well with CFQ, so you should have a
> kernel parameter elevator=noop.

uh… how can the filesystem chosen be relevant to the disk elevator? CFQ will try to optimize access to reduce seeks and should be completely independent on the filesystem used on top. Also, if you really want to disable CFQ (which you shouldn't do on rotating rust, IMO), better use deadline (which is default on newer kernels).

Vennlige hilsener / Best regards

roy
--
Roy Sigurd Karlsbakk
(+47) 98013356
roy@karlsbakk.net
http://blogg.karlsbakk.net/
GPG Public key: http://karlsbakk.net/roysigurdkarlsbakk.pubkey.txt
--
I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av idiomer med xenotyp etymologi. I de fleste tilfeller eksisterer adekvate og relevante synonymer på norsk.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: recommended way to add ssd cache to mdraid array
  2013-01-12 12:06           ` Roy Sigurd Karlsbakk
@ 2013-01-12 14:14             ` Stan Hoeppner
  2013-01-12 16:37               ` Roy Sigurd Karlsbakk
  0 siblings, 1 reply; 53+ messages in thread
From: Stan Hoeppner @ 2013-01-12 14:14 UTC (permalink / raw)
  To: Roy Sigurd Karlsbakk; +Cc: Chris Murphy, linux-raid Raid

On 1/12/2013 6:06 AM, Roy Sigurd Karlsbakk wrote:
>> Unlikely to make up the difference is the scheduler. Parallel fs's
>> like XFS don't perform nearly as well with CFQ, so you should have a
>> kernel parameter elevator=noop.
> 
> uh… how can the filesystem chosen be relevant to the disk elevator? 

It's the other way round.  The chosen elevator can cause problems with
the filesystem.  You should find this relevant conversation amongst the
lead XFS developers educational:

http://oss.sgi.com/archives/xfs/2011-07/msg00464.html

> CFQ will try to optimize access to reduce seeks

"Completely Fair Queuing" -- The name alone tells you how it works.  It
most certainly does not do what you state.  Please read the brief
Wikipedia article:  http://en.wikipedia.org/wiki/CFQ

> and should be completely independent on the filesystem used on top.

Operative word: "should"

The USA should not be $16 Trillion in debt, but it is.  By international
law whales should not be killed, but they still are.  Etc.

-- 
Stan

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: recommended way to add ssd cache to mdraid array
  2013-01-12  8:33                     ` Stan Hoeppner
@ 2013-01-12 14:44                       ` Thomas Fjellstrom
  0 siblings, 0 replies; 53+ messages in thread
From: Thomas Fjellstrom @ 2013-01-12 14:44 UTC (permalink / raw)
  To: stan; +Cc: Chris Murphy, linux-raid Raid

On Sat Jan 12, 2013, Stan Hoeppner wrote:
> On 1/11/2013 8:44 PM, Thomas Fjellstrom wrote:
> > On Fri Jan 11, 2013, Stan Hoeppner wrote:
> >> On 1/11/2013 12:51 PM, Thomas Fjellstrom wrote:
> >>>> It's not but as Stan writes it may not be preferred for your
> >>>> application.
> >>> 
> >>> Hm, ok, it is in my mount line, so I'll try it with that off. Though I
> >>> would be interested in hearing from Stan why it may not be good.
> >> 
> >> You should not do this.  This parameter is not a toggle switch intended
> >> to be flipped on/off at will, but changed only once and left there.
> > 
> > Makes me wonder why they are mount options if they aren't meant to ever
> > be changed?
> 
> I'll answer that question with a question:  If you were to implement a
> new (secondary) allocator on a 10 year old filesystem, by what mechanism
> would you have the user enable it?
> 
> You can't change the allocator while the filesystem is mounted, so you
> can't do this with a sysctl.  So if you must remount the filesystem to
> enable the new allocator, where do you enable it?
> 
> Make sense yet?

Could have a tool like tune2fs, if you have to remount anyway, putting the non 
option option behind a tool would make it less dangerous for people who don't 
know they aren't actually changeable.

> >> Never change XFS parameters willy-nilly without knowing the
> >> consequences.  And currently you certainly do not know them.
> >> 
> >> I don't have time for the detailed explanation.  Ask on the XFS list.
> > 
> > Alright, thanks for your help :)
> 
> And as the last few times over many months, you never post to the XFS
> list.  Which tells me you really don't care to learn this stuff.

I don't think theres a reason to be hostile. I had been meaning to join the 
XFS list, but usually, I get so far into looking into this stuff, before I run 
out of time and have to get back to work. As it is this new NAS box has been 
sitting on the floor doing nothing for a month or so waiting for me to have 
time to finish configuring it.

-- 
Thomas Fjellstrom
thomas@fjellstrom.ca

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: recommended way to add ssd cache to mdraid array
  2013-01-12 14:14             ` Stan Hoeppner
@ 2013-01-12 16:37               ` Roy Sigurd Karlsbakk
  0 siblings, 0 replies; 53+ messages in thread
From: Roy Sigurd Karlsbakk @ 2013-01-12 16:37 UTC (permalink / raw)
  To: stan; +Cc: Chris Murphy, linux-raid Raid

> > uh… how can the filesystem chosen be relevant to the disk elevator?
> 
> It's the other way round. The chosen elevator can cause problems with
> the filesystem. You should find this relevant conversation amongst the
> lead XFS developers educational:
> 
> http://oss.sgi.com/archives/xfs/2011-07/msg00464.html

This article is about using CFQ with RAID controllers with write cache, which isn't something I'd recommend. CFQ is, if I'm not mistaken, good if Linux has local access to each disk, and the disks are spinning (!SSD).

> > CFQ will try to optimize access to reduce seeks
> 
> "Completely Fair Queuing" -- The name alone tells you how it works. It
> most certainly does not do what you state. Please read the brief
> Wikipedia article: http://en.wikipedia.org/wiki/CFQ

It's how it's designed, for spinning rust.

> > and should be completely independent on the filesystem used on top.
> 
> Operative word: "should"
> 
> The USA should not be $16 Trillion in debt, but it is. By
> international
> law whales should not be killed, but they still are. Etc.

Well, most things don't work as they are intended to do, but I guess CFQ works pretty well for spinning rust directly attached to a Linux box.

The default scheduler was changed in kernel, but only because SSDs are getting more common. Would those kernel developers know their things?

Vennlige hilsener / Best regards

roy
--
Roy Sigurd Karlsbakk
(+47) 98013356
roy@karlsbakk.net
http://blogg.karlsbakk.net/
GPG Public key: http://karlsbakk.net/roysigurdkarlsbakk.pubkey.txt
--
I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av idiomer med xenotyp etymologi. I de fleste tilfeller eksisterer adekvate og relevante synonymer på norsk.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: recommended way to add ssd cache to mdraid array
  2013-01-11 18:51               ` Thomas Fjellstrom
  2013-01-11 22:17                 ` Stan Hoeppner
@ 2013-01-13 19:18                 ` Chris Murphy
  2013-01-14  9:06                   ` Thomas Fjellstrom
  1 sibling, 1 reply; 53+ messages in thread
From: Chris Murphy @ 2013-01-13 19:18 UTC (permalink / raw)
  To: Thomas Fjellstrom; +Cc: linux-raid Raid

On Jan 11, 2013, at 11:51 AM, Thomas Fjellstrom <thomas@fjellstrom.ca> wrote:

> On Fri Jan 11, 2013, you wrote:
>> OK fair enough, and as asked earlier, what's the chunk size?
> 
> Ah, sorry, I missed that bit. I didn't tweak the chunk size, so I have the 
> default 512K.

OK now that the subject has been sufficiently detoured… 

Benchmarking can perhaps help you isolate causes for problems you're having, but you haven't said what problems you're having.  You're using benchmarking data to go looking for a problem. This is a GigE network to the NAS? If it's wireless, who cares, all of these array numbers are better than wireless bandwidth and latency. If it's GigE, all of your sequential read write numbers exceed the bandwidth of wired. So what's the problem? You want better random read writes? Rebuild the array with a 32KB or 64KB chunk size, knowing you may take a small hit on the larger sequential read/writes.

Chris Murphy--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: recommended way to add ssd cache to mdraid array
  2013-01-12  3:56                   ` Chris Murphy
@ 2013-01-13 22:13                     ` Phil Turmel
  2013-01-13 23:20                       ` Chris Murphy
  0 siblings, 1 reply; 53+ messages in thread
From: Phil Turmel @ 2013-01-13 22:13 UTC (permalink / raw)
  To: Chris Murphy; +Cc: linux-raid Raid

On 01/11/2013 10:56 PM, Chris Murphy wrote:
> 
> On Jan 11, 2013, at 5:47 PM, Phil Turmel <philip@turmel.org> wrote:

[trim /]

>> This isn't quite right.  When the linux driver stack times out, it
>>  passes the error to MD.  MD doesn't care if the drive reported the
>>  error, or if the controller reported the error, it just knows
>> that it couldn't read that block.  It goes to recovery, which
>> typically generates the replacement data in a few milliseconds, and
>> tries to write back to the first disk.  *That* instantly fails,
>> since the controller is resetting the link and the drive is still
>> in la-la land trying to read the data.  MD will tolerate several
>> bad reads before it kicks out a drive, but will immediately kick if
>> a write fails.
>> 
>> By the time you come to investigate, the drive has completed its 
>> timeout, the link has reset, and the otherwise good drive is 
>> sitting idle (failed).
> 
> I admit I omitted the handling of the error md gets in the case of 
> linux itself timing out the drive, because I don't know how that's 
> handled. For example:
> 
> When you say, "the linux driver stack times out, it passes the error 
> to MD," what error is passed? Is it the same (I think it's 0x40) read
> error that the drive would have produced, along with affected LBAs?
> Does the driver know the affected LBA's, maybe by inference? 
> Otherwise md wouldn't know what replacement data to generate. Or is 
> it a different error, neither a read nor write error, that causes md 
> to bounce the drive wholesale?

I haven't examined the code in detail, just watched patches pass on the
list.  :-)  But as I understand it, the error is returned with the
request that it belongs to, and the MD does not look at the drive error
code itself.  So MD know what read it was, for which member devices, but
doesn't care if the error came from the drive itself, or the controller,
or the driver.
> 
>> Any array running with mismatched timeouts will kick a drive on 
>> every unrecoverable read error, where it would likely have just 
>> fixed it.
> 
> This is the key phrase I was trying to get at.
> 
>> Sadly, many hobbyist arrays are built with desktop drives, and the
>>  timeouts are left mismatched.  When that hobbyist later learns
>> s/he should be scrubbing, the long-overdue scrub is very likely to
>>  produce UREs on multiple drives (BOOM).
> 
> Or even if they have been scrubbing all along.

Yes.  But in that case, they've probably lost arrays without
understanding why.

> If the drive recovers the data inside of 30 seconds, and also doesn't
> relocated the data to a new sector (? I have no idea when drives do
> this on their own; I know they will do it on a write failure but I'm
> unclear when they do it on persistent read "difficulty") the scrub
> has no means  of even being aware there's a problem to fix!

I understand that it varies.  But drives generally only reallocate on a
write, and only if they are primed to verify the write at that sector by
a previous URE at that spot.  Those show up in a smartctl report as
"Pending".

> Given the craptastic state of affairs that manufacturers disallow a 
> simple setting change to ask the drive to do LESS error correction, 
> the recommendation to buy a different drive that can be so 
> configured, is the best suggestion.

For a while, it was Hitachi Deskstar or enterprise.  Western Digital's
new "Red" series appears to be an attempt to deal with the backlash.

> Alternative 1 is to change the linux driver timeout to maybe upwards
>  of two minutes, and then deal with the fall out of that behavior, 
> which could be worse than a drive being booted out of the array 
> sooner.

Yes.  Some servers will time out a connection in 90 seconds if a reply
is delayed.  To be safe with desktop drives, a timeout of 120 seconds
seems to be necessary.  I wouldn't be surprised if certain drives needed
more, but I have insufficient experience.

> And a very distant alternative 2 is to zero or Secure Erase the drive
> every so often, in hopes of avoiding bad sectors altogether — is
> tedious, as well as implies either putting the drive into a degraded
> state, or cycling a spare drive.

No.  This still isn't safe.  UREs can happen at any time, and are spec'd
to occur at about every 12TB read.  Even on a freshly wiped drive.
Spares don't help either, as a rebuild onto a spare stresses the rest of
the array, and is likely to expose any developing UREs.

> And at the point you're going to buy a spare drive for this fiasco, 
> you might as well just buy drives suited for the purpose.

Options are:

A) Buy Enterprise drives.  They have appropriate error timeouts and work
properly with MD right out of the box.

B) Buy Desktop drives with SCTERC support.  They have inappropriate
default timeouts, but can be set to an appropriate value.  Udev or boot
script assistance is needed to call smartctl to set it.  They do *not*
work properly with MD out of the box.

C) Suffer with desktop drives without SCTERC support.  They cannot be
set to appropriate error timeouts.  Udev or boot script assistance is
needed to set a 120 second driver timeout in sysfs.  They do *not* work
properly with MD out of the box.

D) Lose your data during spare rebuild after your first URE.  (Odds in
proportion to array size.)

One last point bears repeating:  MD is *not* a backup system, although
some people leverage it's features for rotating off-site backup disks.
Raid arrays are all about *uptime*.  They will not save you from
accidental deletion or other operator errors.  They will not save you if
your office burns down.  You need a separate backup system for critical
files.

Phil
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: recommended way to add ssd cache to mdraid array
  2013-01-13 22:13                     ` Phil Turmel
@ 2013-01-13 23:20                       ` Chris Murphy
  2013-01-14  0:23                         ` Phil Turmel
  0 siblings, 1 reply; 53+ messages in thread
From: Chris Murphy @ 2013-01-13 23:20 UTC (permalink / raw)
  To: Phil Turmel; +Cc: linux-raid Raid

On Jan 13, 2013, at 3:13 PM, Phil Turmel <philip@turmel.org> wrote:
> 
> I haven't examined the code in detail, just watched patches pass on the
> list.  :-)  But as I understand it, the error is returned with the
> request that it belongs to, and the MD does not look at the drive error
> code itself.  So MD know what read it was, for which member devices, but
> doesn't care if the error came from the drive itself, or the controller,
> or the driver.

I think it does, but I don't know diddly about the code.

Only the drive knows what LBA's exhibit read and write errors. The linux driver doesn't. And the linux driver times out after 30 seconds by default, which is an eternity. There isn't just one request pending. There are dozens to hundreds of pending commands, representing possibly tens of thousands of LBAs in the drive's own cache. Once the drive goes into sector recovery, I'm pretty sure SATA drives, unlike SAS, basically go silent. That's probably when the linux timeout counter starts, in the meantime md is still talking to the linux driver making more requests.

This is a huge pile of just requests, not even the data it represents. Some of those requests made it to the drive, some are with the linux driver. I think it's much easier/efficient, when linux block device driver timeout arrives, for the linux driver to just nullify all requests, gives the drive the boot (lalala i can't hear you i don't care if you start talking to me again in 60 seconds), and tells md with a single error that the drive isn't even available. And I'd expect md does the only thing it can do if it gets such an error which is the same as a write error; it flags the device in the array as faulty. I'd be surprised if it tried to reconstruct data at all in such a case, without an explicit read error and LBA reported by the drive.

But I don't know the code, so I'm talking out my ass.

>> 
>>> Any array running with mismatched timeouts will kick a drive on 
>>> every unrecoverable read error, where it would likely have just 
>>> fixed it.
>> 
>> This is the key phrase I was trying to get at.
>> 
>>> Sadly, many hobbyist arrays are built with desktop drives, and the
>>> timeouts are left mismatched.  When that hobbyist later learns
>>> s/he should be scrubbing, the long-overdue scrub is very likely to
>>> produce UREs on multiple drives (BOOM).
>> 
>> Or even if they have been scrubbing all along.
> 
> Yes.  But in that case, they've probably lost arrays without
> understanding why.

Maybe. I don't have data on this. If recovery occurs in less than 30 seconds, they effectively get no indication. They'd have to be looking at ECC errors recorded by SMART. And not all drives record that attribute.

>> If the drive recovers the data inside of 30 seconds, and also doesn't
>> relocated the data to a new sector (? I have no idea when drives do
>> this on their own; I know they will do it on a write failure but I'm
>> unclear when they do it on persistent read "difficulty") the scrub
>> has no means  of even being aware there's a problem to fix!
> 
> I understand that it varies.  But drives generally only reallocate on a
> write, and only if they are primed to verify the write at that sector by
> a previous URE at that spot.  Those show up in a smartctl report as
> "Pending".

I'm pretty sure that attribute 197, usually called current pending sector, is only due to unrecoverable read error. Not from a sector that ECC detects transient read error and can correct for. The uncorrectable error, the firmware doesn't want to relocate the data on the sector because it clearly can't get it correct, so it just leaves it there until the sector is written and if on write there's persistent write failure, it gets remapped to a reserve sector. I don't know off hand if there is a read specific remap count, but attribute 5 'reallocated sectors' appears to be read or write remaps. *shrug*

I have several HDDs that have attribute 197, but not attribute 5. And an SSD with only attribute 5, not 197.

> 
>> Given the craptastic state of affairs that manufacturers disallow a 
>> simple setting change to ask the drive to do LESS error correction, 
>> the recommendation to buy a different drive that can be so 
>> configured, is the best suggestion.
> 
> For a while, it was Hitachi Deskstar or enterprise.  Western Digital's
> new "Red" series appears to be an attempt to deal with the backlash.

Yeah I know. :-) But I mean all drives could have this. It's a request for LESS, not more. I'm not asking for better ECC, although that would be nice, merely a faster time out as a settable option.

> 
>> And a very distant alternative 2 is to zero or Secure Erase the drive
>> every so often, in hopes of avoiding bad sectors altogether — is
>> tedious, as well as implies either putting the drive into a degraded
>> state, or cycling a spare drive.
> 
> No.  This still isn't safe.  UREs can happen at any time, and are spec'd
> to occur at about every 12TB read.  Even on a freshly wiped drive.
> Spares don't help either, as a rebuild onto a spare stresses the rest of
> the array, and is likely to expose any developing UREs.

Technically the spec statistic is "less than" 1 bit in 10^14 for a consumer disk. So it's not that you will get a URE at 12TB, but that you should be able to read at least 11.37TiB without a URE. It's entirely within the tolerance if the mean occurrence happens 2 bits shy of 10^15 bits, or 113TiB. By using "less than" the value is not a mean. It's likely a lot higher than 12TB, or we'd have total mayhem by now. That's only 3 reads of a 4TB drive otherwise. It's bad, but not that bad.

> 
>> And at the point you're going to buy a spare drive for this fiasco, 
>> you might as well just buy drives suited for the purpose.
> 
> Options are:
> 
> A) Buy Enterprise drives.  They have appropriate error timeouts and work
> properly with MD right out of the box.
> 
> B) Buy Desktop drives with SCTERC support.  They have inappropriate
> default timeouts, but can be set to an appropriate value.  Udev or boot
> script assistance is needed to call smartctl to set it.  They do *not*
> work properly with MD out of the box.
> 
> C) Suffer with desktop drives without SCTERC support.  They cannot be
> set to appropriate error timeouts.  Udev or boot script assistance is
> needed to set a 120 second driver timeout in sysfs.  They do *not* work
> properly with MD out of the box.
> 
> D) Lose your data during spare rebuild after your first URE.  (Odds in
> proportion to array size.)

That's a good summary.

> 
> One last point bears repeating:  MD is *not* a backup system, although
> some people leverage it's features for rotating off-site backup disks.
> Raid arrays are all about *uptime*.  They will not save you from
> accidental deletion or other operator errors.  They will not save you if
> your office burns down.  You need a separate backup system for critical
> files.

Yeah and that's why I'm sorta leery of this RAID 6 setup in the home. I think that people are reading that the odds of an array failure with RAID 5 are so high that they are better off adding one more drive for dual-parity, and *still* not having a real backup and restore plan. As if the RAID 6 is the faux-backup plan.

Some home NAS's, with BluRay vids, are so big that people just either need to stop such behavior, or get a used LTO 2 or 3 drive for their gargantuous backups.

Chris--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: recommended way to add ssd cache to mdraid array
  2013-01-11 12:35             ` Thomas Fjellstrom
  2013-01-11 12:48               ` Thomas Fjellstrom
@ 2013-01-14  0:05               ` Tommy Apel Hansen
  2013-01-14  8:58                 ` Thomas Fjellstrom
  1 sibling, 1 reply; 53+ messages in thread
From: Tommy Apel Hansen @ 2013-01-14  0:05 UTC (permalink / raw)
  To: thomas; +Cc: stan, Chris Murphy, linux-raid Raid

Could you do me a favor and run the iozone test with the -I switch on so
that we can seen the actual speed of the array and not you RAM

/Tommy

On Fri, 2013-01-11 at 05:35 -0700, Thomas Fjellstrom wrote:
> On Thu Jan 10, 2013, Stan Hoeppner wrote:
> > On 1/10/2013 3:36 PM, Chris Murphy wrote:
> > > On Jan 10, 2013, at 3:49 AM, Thomas Fjellstrom <thomas@fjellstrom.ca> wrote:
> > >> A lot of it will be streaming. Some may end up being random read/writes.
> > >> The test is just to gauge over all performance of the setup. 600MBs
> > >> read is far more than I need, but having writes at 1/3 that seems odd
> > >> to me.
> > > 
> > > Tell us how many disks there are, and what the chunk size is. It could be
> > > too small if you have too few disks which results in a small full stripe
> > > size for a video context. If you're using the default, it could be too
> > > big and you're getting a lot of RWM. Stan, and others, can better answer
> > > this.
> > 
> > Thomas is using a benchmark, and a single one at that, to judge the
> > performance.  He's not using his actual workloads.  Tuning/tweaking to
> > increase the numbers in a benchmark could be detrimental to actual
> > performance instead of providing a boost.  One must be careful.
> > 
> > Regarding RAID6, it will always have horrible performance compared to
> > non-parity RAID levels and even RAID5, for anything but full stripe
> > aligned writes, which means writing new large files or doing large
> > appends to existing files.
> 
> Considering its a rather simple use case, mostly streaming video, and misc
> file sharing for my home network, an iozone test should be rather telling.
> Especially the full test, from 4k up to 16mb
> 
>                                                             random  random    bkwd   record   stride                                   
>               KB  reclen   write rewrite    read    reread    read   write    read  rewrite     read   fwrite frewrite   fread  freread
>         33554432       4  243295  221756   628767   624081    1028    4627   16822  7468777    17740   233295   231092  582036   579131
>         33554432       8  241134  225728   628264   627015    2027    8879   25977 10030302    19578   228923   233928  591478   584892
>         33554432      16  233758  228122   633406   618248    3952   13635   35676 10166457    19968   227599   229698  579267   576850
>         33554432      32  232390  219484   625968   625627    7604   18800   44252 10728450    24976   216880   222545  556513   555371
>         33554432      64  222936  206166   631659   627823   14112   22837   52259 11243595    30251   196243   192755  498602   494354
>         33554432     128  214740  182619   628604   626407   25088   26719   64912 11232068    39867   198638   185078  463505   467853
>         33554432     256  202543  185964   626614   624367   44363   34763   73939 10148251    62349   176724   191899  593517   595646
>         33554432     512  208081  188584   632188   629547   72617   39145   84876  9660408    89877   182736   172912  610681   608870
>         33554432    1024  196429  166125   630785   632413  116793   51904  133342  8687679   121956   168756   175225  620587   616722
>         33554432    2048  185399  167484   622180   627606  188571   70789  218009  5357136   370189   171019   166128  637830   637120
>         33554432    4096  198340  188695   632693   628225  289971   95211  278098  4836433   611529   161664   170469  665617   655268
>         33554432    8192  177919  167524   632030   629077  371602  115228  384030  4934570   618061   161562   176033  708542   709788
>         33554432   16384  196639  183744   631478   627518  485622  133467  462861  4890426   644615   175411   179795  725966   734364
> 
> > However, everything is relative.  This RAID6 may have plenty of random
> > and streaming write/read throughput for Thomas.  But a single benchmark
> > isn't going to inform him accurately.
> 
> 200MB/s may be enough, but the difference between the read and write
> throughput is a bit unexpected. It's not a weak machine (core i3-2120, dual
> core 3.2Ghz with HT, 16GB ECC 1333Mhz ram), and this is basically all its
> going to be doing.
> 
> > > You said these are unpartitioned disks, I think. In which case alignment
> > > of 4096 byte sectors isn't a factor if these are AF disks.
> > > 
> > > Unlikely to make up the difference is the scheduler. Parallel fs's like
> > > XFS don't perform nearly as well with CFQ, so you should have a kernel
> > > parameter elevator=noop.
> > 
> > If the HBAs have [BB|FB]WC then one should probably use noop as the
> > cache schedules the actual IO to the drives.  If the HBAs lack cache,
> > then deadline often provides better performance.  Testing of each is
> > required on a system and workload basis.  With two identical systems
> > (hardware/RAID/OS) one may perform better with noop, the other with
> > deadline.  The determining factor is the applications' IO patterns.
> 
> Mostly streaming reads, some long rsync's to copy stuff back and forth, file
> share duties (downloads etc).
> 
> > > Another thing to look at is md/stripe_cache_size which probably needs to
> > > be higher for your application.
> > > 
> > > Another thing to look at is if you're using XFS, what your mount options
> > > are. Invariably with an array of this size you need to be mounting with
> > > the inode64 option.
> > 
> > The desired allocator behavior is independent of array size but, once
> > again, dependent on the workloads.  inode64 is only needed for large
> > filesystems with lots of files, where 1TB may not be enough for the
> > directory inodes.  Or, for mixed metadata/data heavy workloads.
> > 
> > For many workloads including databases, video ingestion, etc, the
> > inode32 allocator is preferred, regardless of array size.  This is the
> > linux-raid list so I'll not go into detail of the XFS allocators.
> 
> If you have the time and the desire, I'd like to hear about it off list.
> 
> > >> The reason I've selected RAID6 to begin with is I've read (on this
> > >> mailing list, and on some hardware tech sites) that even with SAS
> > >> drives, the rebuild/resync time on a large array using large disks
> > >> (2TB+) is long enough that it gives more than enough time for another
> > >> disk to hit a random read error,
> > > 
> > > This is true for high density consumer SATA drives. It's not nearly as
> > > applicable for low to moderate density nearline SATA which has an order
> > > of magnitude lower UER, or for enterprise SAS (and some enterprise SATA)
> > > which has yet another order of magnitude lower UER.  So it depends on
> > > the disks, and the RAID size, and the backup/restore strategy.
> > 
> > Yes, enterprise drives have a much larger spare sector pool.
> > 
> > WRT rebuild time, this is one more reason to use RAID10 or a concat of
> > RAID1s.  The rebuild time is low, constant, predictable.  For 2TB drives
> > about 5-6 hours at 100% rebuild rate.  And rebuild time, for any array
> > type, with gargantuan drives, is yet one more reason not to use the
> > largest drives you can get your hands on.  Using 1TB drives will cut
> > that to 2.5-3 hours, and using 500GB drives will cut it down to 1.25-1.5
> > hours, as all these drives tend to have similar streaming write rates.
> > 
> > To wit, as a general rule I always build my arrays with the smallest
> > drives I can get away with for the workload at hand.  Yes, for a given
> > TB total it increases acquisition cost of drives, HBAs, enclosures, and
> > cables, and power consumption, but it also increases spindle count--thus
> > performance-- while decreasing rebuild times substantially/dramatically.
> 
> I'd go raid10 or something if I had the space, but this little 10TB nas (which
> is the goal, a small, quiet, not too slow, 10TB nas with some kind of
> redundancy) only fits 7 3.5" HDDs.
> 
> Maybe sometime in the future I'll get a big 3 or 4 u case with a crap load of
> 3.5" HDD bays, but for now, this is what I have (as well as my old array,
> 7x1TB RAID5+XFS in 4in3 hot swap bays with room for 8 drives, but haven't
> bothered to expand the old array, and I have the new one almost ready to go).
> 
> I don't know if it impacts anything at all, but when burning in these drives
> after I bought them, I ran the same full iozone test a couple times, and each
> drive shows 150MB/s read, and similar write times (100-120+?). It impressed me
> somewhat, to see a mechanical hard drive go that fast. I remember back a few
> years ago thinking 80MBs was fast for a HDD.
> 
> -- 
> Thomas Fjellstrom
> thomas@fjellstrom.ca
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: recommended way to add ssd cache to mdraid array
  2013-01-13 23:20                       ` Chris Murphy
@ 2013-01-14  0:23                         ` Phil Turmel
  2013-01-14  3:58                           ` Chris Murphy
  2013-01-14 22:00                           ` Thomas Fjellstrom
  0 siblings, 2 replies; 53+ messages in thread
From: Phil Turmel @ 2013-01-14  0:23 UTC (permalink / raw)
  To: Chris Murphy; +Cc: linux-raid Raid

On 01/13/2013 06:20 PM, Chris Murphy wrote:
> 
> On Jan 13, 2013, at 3:13 PM, Phil Turmel <philip@turmel.org> wrote:
>> 
>> I haven't examined the code in detail, just watched patches pass on
>> the list.  :-)  But as I understand it, the error is returned with
>> the request that it belongs to, and the MD does not look at the
>> drive error code itself.  So MD know what read it was, for which
>> member devices, but doesn't care if the error came from the drive
>> itself, or the controller, or the driver.
> 
> I think it does, but I don't know diddly about the code.

If you think about this, you'll realize that the driver *must* keep
track of every unique request for the block device.  Otherwise, how
would MD know what data was read and ready to use?  And what data was
written so its buffer could be freed?  Reads are distinct from writes.

> Only the drive knows what LBA's exhibit read and write errors. The
> linux driver doesn't. And the linux driver times out after 30 seconds
> by default, which is an eternity. There isn't just one request
> pending. There are dozens to hundreds of pending commands,
> representing possibly tens of thousands of LBAs in the drive's own
> cache.

But reads are separate from writes, and MD handles them differently.
See the "md" man-page under "Recovery".

> Once the drive goes into sector recovery, I'm pretty sure SATA
> drives, unlike SAS, basically go silent. That's probably when the
> linux timeout counter starts, in the meantime md is still talking to
> the linux driver making more requests.
> 
> This is a huge pile of just requests, not even the data it
> represents. Some of those requests made it to the drive, some are
> with the linux driver. I think it's much easier/efficient, when linux
> block device driver timeout arrives, for the linux driver to just
> nullify all requests, gives the drive the boot (lalala i can't hear
> you i don't care if you start talking to me again in 60 seconds), and
> tells md with a single error that the drive isn't even available. And
> I'd expect md does the only thing it can do if it gets such an error
> which is the same as a write error; it flags the device in the array
> as faulty. I'd be surprised if it tried to reconstruct data at all in
> such a case, without an explicit read error and LBA reported by the
> drive.

This is just wrong.

> But I don't know the code, so I'm talking out my ass.

:-)

[trim /]

>> Yes.  But in that case, they've probably lost arrays without 
>> understanding why.
> 
> Maybe. I don't have data on this. If recovery occurs in less than 30
> seconds, they effectively get no indication. They'd have to be
> looking at ECC errors recorded by SMART. And not all drives record
> that attribute.

See all the assistance requests on this list where the OP says something
to the effect of: "I don't understand! The (failed) drive appears to be OK!"

[trim /]

>>> Given the craptastic state of affairs that manufacturers disallow
>>> a simple setting change to ask the drive to do LESS error
>>> correction, the recommendation to buy a different drive that can
>>> be so configured, is the best suggestion.
>> 
>> For a while, it was Hitachi Deskstar or enterprise.  Western
>> Digital's new "Red" series appears to be an attempt to deal with
>> the backlash.
> 
> Yeah I know. :-) But I mean all drives could have this. It's a
> request for LESS, not more. I'm not asking for better ECC, although
> that would be nice, merely a faster time out as a settable option.

You don't seem to understand:  The hard drive industry loses revenue
when people set up raid arrays with cheap drives, and then have the
temerity to return good drives that have been (arguably) misapplied.

Manufacturers have a financial interest in selling enterprise drives
instead of desktop drives, and have made desktop drives painful to use
in this application.  The manufacturers have even redefined "RAID".
Supposedly "I" now stands for "Independent" instead of "Inexpensive".

>>> And a very distant alternative 2 is to zero or Secure Erase the
>>> drive every so often, in hopes of avoiding bad sectors altogether
>>> — is tedious, as well as implies either putting the drive into a
>>> degraded state, or cycling a spare drive.
>> 
>> No.  This still isn't safe.  UREs can happen at any time, and are
>> spec'd to occur at about every 12TB read.  Even on a freshly wiped
>> drive. Spares don't help either, as a rebuild onto a spare stresses
>> the rest of the array, and is likely to expose any developing
>> UREs.
> 
> Technically the spec statistic is "less than" 1 bit in 10^14 for a
> consumer disk. So it's not that you will get a URE at 12TB, but that
> you should be able to read at least 11.37TiB without a URE. It's
> entirely within the tolerance if the mean occurrence happens 2 bits
> shy of 10^15 bits, or 113TiB. By using "less than" the value is not a
> mean. It's likely a lot higher than 12TB, or we'd have total mayhem
> by now. That's only 3 reads of a 4TB drive otherwise. It's bad, but
> not that bad.

That's not really how the statistics work.  The spec just means that if
you run a typically drive for some long time on some workload you'll
average one URE every 10^14 bits.  What the actual shape of the
distribution is varies through the life of the drive.  IIRC, Google's
analysis was that the rate spikes early, then forms a gaussian
distribution for the bulk of the life, then spikes again as mechanical
parts wear out.

>>> And at the point you're going to buy a spare drive for this
>>> fiasco, you might as well just buy drives suited for the
>>> purpose.
>> 
>> Options are:
>> 
>> A) Buy Enterprise drives.  They have appropriate error timeouts and
>> work properly with MD right out of the box.
>> 
>> B) Buy Desktop drives with SCTERC support.  They have
>> inappropriate default timeouts, but can be set to an appropriate
>> value.  Udev or boot script assistance is needed to call smartctl
>> to set it.  They do *not* work properly with MD out of the box.
>> 
>> C) Suffer with desktop drives without SCTERC support.  They cannot
>> be set to appropriate error timeouts.  Udev or boot script
>> assistance is needed to set a 120 second driver timeout in sysfs.
>> They do *not* work properly with MD out of the box.
>> 
>> D) Lose your data during spare rebuild after your first URE.  (Odds
>> in proportion to array size.)
> 
> That's a good summary.

Yeah.  Not enough people hear it though.  If I was more than a very
light user, I'd be on option A.  As it is, option B is best for me.

>> One last point bears repeating:  MD is *not* a backup system,
>> although some people leverage it's features for rotating off-site
>> backup disks. Raid arrays are all about *uptime*.  They will not
>> save you from accidental deletion or other operator errors.  They
>> will not save you if your office burns down.  You need a separate
>> backup system for critical files.
> 
> Yeah and that's why I'm sorta leery of this RAID 6 setup in the home.
> I think that people are reading that the odds of an array failure
> with RAID 5 are so high that they are better off adding one more
> drive for dual-parity, and *still* not having a real backup and
> restore plan. As if the RAID 6 is the faux-backup plan.
> 
> Some home NAS's, with BluRay vids, are so big that people just either
> need to stop such behavior, or get a used LTO 2 or 3 drive for their
> gargantuous backups.

Well, for me, such material on hard drives *are* the backups.  I use
"par2" for big backup files, not MD raid.  I also skip backups for my
Hi-Def MythTV recordings.  Just not valuable enough.

Phil
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: recommended way to add ssd cache to mdraid array
  2013-01-14  0:23                         ` Phil Turmel
@ 2013-01-14  3:58                           ` Chris Murphy
  2013-01-14 22:00                           ` Thomas Fjellstrom
  1 sibling, 0 replies; 53+ messages in thread
From: Chris Murphy @ 2013-01-14  3:58 UTC (permalink / raw)
  To: Phil Turmel; +Cc: linux-raid Raid

On Jan 13, 2013, at 5:23 PM, Phil Turmel <philip@turmel.org> wrote:

> On 01/13/2013 06:20 PM, Chris Murphy wrote:
>> 
>> On Jan 13, 2013, at 3:13 PM, Phil Turmel <philip@turmel.org> wrote:
>>> 
>>> I haven't examined the code in detail, just watched patches pass on
>>> the list.  :-)  But as I understand it, the error is returned with
>>> the request that it belongs to, and the MD does not look at the
>>> drive error code itself.  So MD know what read it was, for which
>>> member devices, but doesn't care if the error came from the drive
>>> itself, or the controller, or the driver.
>> 
>> I think it does, but I don't know diddly about the code.
> 
> If you think about this, you'll realize that the driver *must* keep
> track of every unique request for the block device.

Perhaps not. It looks like it's the SCSI layer that sets the timer on each request. The driver may be doing something comparatively simple. But in any case there is something in between the SATA controller and md that's tracking requests, yes. But presumably md tracks its own requests. It's expecting something back.

> Otherwise, how
> would MD know what data was read and ready to use?  And what data was
> written so its buffer could be freed?  Reads are distinct from writes.

Yes fine, and I'm suggesting timeouts are distinct from reads and writes. Read and write errors are drive reported errors. I'm suggesting there's some other error from either the linux block device driver (or the SCSI layer) that is not a read error or a write error, when there's a time out.

> 
>> Only the drive knows what LBA's exhibit read and write errors. The
>> linux driver doesn't. And the linux driver times out after 30 seconds
>> by default, which is an eternity. There isn't just one request
>> pending. There are dozens to hundreds of pending commands,
>> representing possibly tens of thousands of LBAs in the drive's own
>> cache.
> 
> But reads are separate from writes, and MD handles them differently.
> See the "md" man-page under "Recovery".

I know that. I'm just suggesting it's not only a read or write error that's possible.

> 
>> Once the drive goes into sector recovery, I'm pretty sure SATA
>> drives, unlike SAS, basically go silent. That's probably when the
>> linux timeout counter starts, in the meantime md is still talking to
>> the linux driver making more requests.
>> 
>> This is a huge pile of just requests, not even the data it
>> represents. Some of those requests made it to the drive, some are
>> with the linux driver. I think it's much easier/efficient, when linux
>> block device driver timeout arrives, for the linux driver to just
>> nullify all requests, gives the drive the boot (lalala i can't hear
>> you i don't care if you start talking to me again in 60 seconds), and
>> tells md with a single error that the drive isn't even available. And
>> I'd expect md does the only thing it can do if it gets such an error
>> which is the same as a write error; it flags the device in the array
>> as faulty. I'd be surprised if it tried to reconstruct data at all in
>> such a case, without an explicit read error and LBA reported by the
>> drive.
> 
> This is just wrong.

Yeah the last sentence is  clearly quite wrong: md has to reconstruct the data at some point in such a case, without an explicit read error from the drive. However, I think I largely had the idea right now that I've looked at this after-the-fact.

https://access.redhat.com/knowledge/docs/en-US/Red_Hat_Enterprise_Linux/6/html/Storage_Administration_Guide/task_controlling-scsi-command-timer-onlining-devices.html

I expect that the SCSI layer informs md that the device state is offline. And md in turn marks the drive faulty. And in turn rebuilds all pending and future data chunks for that device from parity (or mirrored copy). No need for a superfluous write attempt for an offline device.

>> 
>> Technically the spec statistic is "less than" 1 bit in 10^14 for a
>> consumer disk. So it's not that you will get a URE at 12TB, but that
>> you should be able to read at least 11.37TiB without a URE. It's
>> entirely within the tolerance if the mean occurrence happens 2 bits
>> shy of 10^15 bits, or 113TiB. By using "less than" the value is not a
>> mean. It's likely a lot higher than 12TB, or we'd have total mayhem
>> by now. That's only 3 reads of a 4TB drive otherwise. It's bad, but
>> not that bad.
> 
> That's not really how the statistics work.  The spec just means that if
> you run a typically drive for some long time on some workload you'll
> average one URE every 10^14 bits.

I don't accept this. "less than" cannot be redefined as "mean/average" in statistics.

And 1 bit does not equal an actual URE either. The drive either reports all 4096 bits for a sector (usually they are good, but they might be corrupted), or you get a URE in which case all 4096 bits are lost. There is no such thing as getting 1 bit of URE with a hard drive when the smallest unit is a sector containing 4096 bits (for non-AF drives).  4096 bits in 10^14 bits is *NOT* the same thing as 1 bit in 10^14 bits. If you lose a sector to URE in 12TBs, that's the same thing as 1 bit in 2.4^10.

And for an AF drive, a URE means you lose 16384 bits. If that happened every 12TB, it would be 1 bit in 6.1^9 bits of loss.

So I don't agree at all with the basic math you've proposed. But then again, I'm just an ape so someone probably ought to double check it.

>  What the actual shape of the
> distribution is varies through the life of the drive.  IIRC, Google's
> analysis was that the rate spikes early, then forms a gaussian
> distribution for the bulk of the life, then spikes again as mechanical
> parts wear out.

That was not a UBER/URE study however. That study was about failures, i.e. whole disks were replaced. They only looked at sector reallocation to see if there was correlation to drive failures/replacements, not if UBER was consistent with manufacturer's stated spec.

Chris Murphy

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: recommended way to add ssd cache to mdraid array
  2013-01-14  0:05               ` Tommy Apel Hansen
@ 2013-01-14  8:58                 ` Thomas Fjellstrom
  2013-01-14 18:22                   ` Thomas Fjellstrom
  0 siblings, 1 reply; 53+ messages in thread
From: Thomas Fjellstrom @ 2013-01-14  8:58 UTC (permalink / raw)
  To: Tommy Apel Hansen; +Cc: stan, Chris Murphy, linux-raid Raid

On Sun Jan 13, 2013, Tommy Apel Hansen wrote:
> Could you do me a favor and run the iozone test with the -I switch on so
> that we can seen the actual speed of the array and not you RAM

Sure. Though I thought running the test with a file size twice the size of ram 
would help with that issue.

> /Tommy
> 
> On Fri, 2013-01-11 at 05:35 -0700, Thomas Fjellstrom wrote:
> > On Thu Jan 10, 2013, Stan Hoeppner wrote:
> > > On 1/10/2013 3:36 PM, Chris Murphy wrote:
> > > > On Jan 10, 2013, at 3:49 AM, Thomas Fjellstrom <thomas@fjellstrom.ca> 
wrote:
> > > >> A lot of it will be streaming. Some may end up being random
> > > >> read/writes. The test is just to gauge over all performance of the
> > > >> setup. 600MBs read is far more than I need, but having writes at
> > > >> 1/3 that seems odd to me.
> > > > 
> > > > Tell us how many disks there are, and what the chunk size is. It
> > > > could be too small if you have too few disks which results in a
> > > > small full stripe size for a video context. If you're using the
> > > > default, it could be too big and you're getting a lot of RWM. Stan,
> > > > and others, can better answer this.
> > > 
> > > Thomas is using a benchmark, and a single one at that, to judge the
> > > performance.  He's not using his actual workloads.  Tuning/tweaking to
> > > increase the numbers in a benchmark could be detrimental to actual
> > > performance instead of providing a boost.  One must be careful.
> > > 
> > > Regarding RAID6, it will always have horrible performance compared to
> > > non-parity RAID levels and even RAID5, for anything but full stripe
> > > aligned writes, which means writing new large files or doing large
> > > appends to existing files.
> > 
> > Considering its a rather simple use case, mostly streaming video, and
> > misc file sharing for my home network, an iozone test should be rather
> > telling. Especially the full test, from 4k up to 16mb
> > 
> >                                                             random  random   
> >                                                             bkwd   record  
> >                                                             stride
> >               
> >               KB  reclen   write rewrite    read    reread    read  
> >               write    read  rewrite     read   fwrite frewrite   fread 
> >               freread
> >         
> >         33554432       4  243295  221756   628767   624081    1028   
> >         4627   16822  7468777    17740   233295   231092  582036  
> >         579131 33554432       8  241134  225728   628264   627015   
> >         2027    8879   25977 10030302    19578   228923   233928  591478
> >           584892 33554432      16  233758  228122   633406   618248   
> >         3952   13635   35676 10166457    19968   227599   229698  579267
> >           576850 33554432      32  232390  219484   625968   625627   
> >         7604   18800   44252 10728450    24976   216880   222545  556513
> >           555371 33554432      64  222936  206166   631659   627823  
> >         14112   22837   52259 11243595    30251   196243   192755 
> >         498602   494354 33554432     128  214740  182619   628604  
> >         626407   25088   26719   64912 11232068    39867   198638  
> >         185078  463505   467853 33554432     256  202543  185964  
> >         626614   624367   44363   34763   73939 10148251    62349  
> >         176724   191899  593517   595646 33554432     512  208081 
> >         188584   632188   629547   72617   39145   84876  9660408   
> >         89877   182736   172912  610681   608870 33554432    1024 
> >         196429  166125   630785   632413  116793   51904  133342 
> >         8687679   121956   168756   175225  620587   616722 33554432   
> >         2048  185399  167484   622180   627606  188571   70789  218009 
> >         5357136   370189   171019   166128  637830   637120 33554432   
> >         4096  198340  188695   632693   628225  289971   95211  278098 
> >         4836433   611529   161664   170469  665617   655268 33554432   
> >         8192  177919  167524   632030   629077  371602  115228  384030 
> >         4934570   618061   161562   176033  708542   709788 33554432  
> >         16384  196639  183744   631478   627518  485622  133467  462861 
> >         4890426   644615   175411   179795  725966   734364
> > > 
> > > However, everything is relative.  This RAID6 may have plenty of random
> > > and streaming write/read throughput for Thomas.  But a single benchmark
> > > isn't going to inform him accurately.
> > 
> > 200MB/s may be enough, but the difference between the read and write
> > throughput is a bit unexpected. It's not a weak machine (core i3-2120,
> > dual core 3.2Ghz with HT, 16GB ECC 1333Mhz ram), and this is basically
> > all its going to be doing.
> > 
> > > > You said these are unpartitioned disks, I think. In which case
> > > > alignment of 4096 byte sectors isn't a factor if these are AF disks.
> > > > 
> > > > Unlikely to make up the difference is the scheduler. Parallel fs's
> > > > like XFS don't perform nearly as well with CFQ, so you should have a
> > > > kernel parameter elevator=noop.
> > > 
> > > If the HBAs have [BB|FB]WC then one should probably use noop as the
> > > cache schedules the actual IO to the drives.  If the HBAs lack cache,
> > > then deadline often provides better performance.  Testing of each is
> > > required on a system and workload basis.  With two identical systems
> > > (hardware/RAID/OS) one may perform better with noop, the other with
> > > deadline.  The determining factor is the applications' IO patterns.
> > 
> > Mostly streaming reads, some long rsync's to copy stuff back and forth,
> > file share duties (downloads etc).
> > 
> > > > Another thing to look at is md/stripe_cache_size which probably needs
> > > > to be higher for your application.
> > > > 
> > > > Another thing to look at is if you're using XFS, what your mount
> > > > options are. Invariably with an array of this size you need to be
> > > > mounting with the inode64 option.
> > > 
> > > The desired allocator behavior is independent of array size but, once
> > > again, dependent on the workloads.  inode64 is only needed for large
> > > filesystems with lots of files, where 1TB may not be enough for the
> > > directory inodes.  Or, for mixed metadata/data heavy workloads.
> > > 
> > > For many workloads including databases, video ingestion, etc, the
> > > inode32 allocator is preferred, regardless of array size.  This is the
> > > linux-raid list so I'll not go into detail of the XFS allocators.
> > 
> > If you have the time and the desire, I'd like to hear about it off list.
> > 
> > > >> The reason I've selected RAID6 to begin with is I've read (on this
> > > >> mailing list, and on some hardware tech sites) that even with SAS
> > > >> drives, the rebuild/resync time on a large array using large disks
> > > >> (2TB+) is long enough that it gives more than enough time for
> > > >> another disk to hit a random read error,
> > > > 
> > > > This is true for high density consumer SATA drives. It's not nearly
> > > > as applicable for low to moderate density nearline SATA which has an
> > > > order of magnitude lower UER, or for enterprise SAS (and some
> > > > enterprise SATA) which has yet another order of magnitude lower UER.
> > > >  So it depends on the disks, and the RAID size, and the
> > > > backup/restore strategy.
> > > 
> > > Yes, enterprise drives have a much larger spare sector pool.
> > > 
> > > WRT rebuild time, this is one more reason to use RAID10 or a concat of
> > > RAID1s.  The rebuild time is low, constant, predictable.  For 2TB
> > > drives about 5-6 hours at 100% rebuild rate.  And rebuild time, for
> > > any array type, with gargantuan drives, is yet one more reason not to
> > > use the largest drives you can get your hands on.  Using 1TB drives
> > > will cut that to 2.5-3 hours, and using 500GB drives will cut it down
> > > to 1.25-1.5 hours, as all these drives tend to have similar streaming
> > > write rates.
> > > 
> > > To wit, as a general rule I always build my arrays with the smallest
> > > drives I can get away with for the workload at hand.  Yes, for a given
> > > TB total it increases acquisition cost of drives, HBAs, enclosures, and
> > > cables, and power consumption, but it also increases spindle
> > > count--thus performance-- while decreasing rebuild times
> > > substantially/dramatically.
> > 
> > I'd go raid10 or something if I had the space, but this little 10TB nas
> > (which is the goal, a small, quiet, not too slow, 10TB nas with some
> > kind of redundancy) only fits 7 3.5" HDDs.
> > 
> > Maybe sometime in the future I'll get a big 3 or 4 u case with a crap
> > load of 3.5" HDD bays, but for now, this is what I have (as well as my
> > old array, 7x1TB RAID5+XFS in 4in3 hot swap bays with room for 8 drives,
> > but haven't bothered to expand the old array, and I have the new one
> > almost ready to go).
> > 
> > I don't know if it impacts anything at all, but when burning in these
> > drives after I bought them, I ran the same full iozone test a couple
> > times, and each drive shows 150MB/s read, and similar write times
> > (100-120+?). It impressed me somewhat, to see a mechanical hard drive go
> > that fast. I remember back a few years ago thinking 80MBs was fast for a
> > HDD.


-- 
Thomas Fjellstrom
thomas@fjellstrom.ca

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: recommended way to add ssd cache to mdraid array
  2013-01-13 19:18                 ` Chris Murphy
@ 2013-01-14  9:06                   ` Thomas Fjellstrom
  0 siblings, 0 replies; 53+ messages in thread
From: Thomas Fjellstrom @ 2013-01-14  9:06 UTC (permalink / raw)
  To: Chris Murphy; +Cc: linux-raid Raid

On Sun Jan 13, 2013, Chris Murphy wrote:
> On Jan 11, 2013, at 11:51 AM, Thomas Fjellstrom <thomas@fjellstrom.ca> 
wrote:
> > On Fri Jan 11, 2013, you wrote:
> >> OK fair enough, and as asked earlier, what's the chunk size?
> > 
> > Ah, sorry, I missed that bit. I didn't tweak the chunk size, so I have
> > the default 512K.
> 
> OK now that the subject has been sufficiently detoured…
> 
> Benchmarking can perhaps help you isolate causes for problems you're
> having, but you haven't said what problems you're having.  You're using
> benchmarking data to go looking for a problem. This is a GigE network to
> the NAS? If it's wireless, who cares, all of these array numbers are
> better than wireless bandwidth and latency. If it's GigE, all of your
> sequential read write numbers exceed the bandwidth of wired. So what's the
> problem? You want better random read writes? Rebuild the array with a 32KB
> or 64KB chunk size, knowing you may take a small hit on the larger
> sequential read/writes.

I did the "just ignore it" with my last array. This time around I figured it 
would be a good idea to make sure everything is setup as correctly as is 
possible given the hardware I have, because once its setup, It's too late to 
fix any issues found that need changes to the lower level settings.

To me, the write performance shown so far seems quite low, which hints at a 
problem that could be solved with some config change before I put it into 
"production". I would like to get it right (as I can), and learn a little in 
the process.

> Chris Murphy--
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


-- 
Thomas Fjellstrom
thomas@fjellstrom.ca
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: recommended way to add ssd cache to mdraid array
  2013-01-14  8:58                 ` Thomas Fjellstrom
@ 2013-01-14 18:22                   ` Thomas Fjellstrom
  2013-01-14 19:45                     ` Stan Hoeppner
                                       ` (2 more replies)
  0 siblings, 3 replies; 53+ messages in thread
From: Thomas Fjellstrom @ 2013-01-14 18:22 UTC (permalink / raw)
  To: Tommy Apel Hansen; +Cc: stan, Chris Murphy, linux-raid Raid

On Mon Jan 14, 2013, Thomas Fjellstrom wrote:
> On Sun Jan 13, 2013, Tommy Apel Hansen wrote:
> > Could you do me a favor and run the iozone test with the -I switch on so
> > that we can seen the actual speed of the array and not you RAM
> 
> Sure. Though I thought running the test with a file size twice the size of
> ram would help with that issue.

This is the initial single 8MB chunk size test

                                                            random  random    
bkwd   record   stride                                   
              KB  reclen   write rewrite    read    reread    read   write    
read  rewrite     read   fwrite frewrite   fread  freread
        33554432    8192  124664  121973   524509   527971  376880  104357  
336083    40088   392683   213941   215453  631122   631617

The full run will take a couple days.

> > /Tommy
> > 
> > On Fri, 2013-01-11 at 05:35 -0700, Thomas Fjellstrom wrote:
> > > On Thu Jan 10, 2013, Stan Hoeppner wrote:
> > > > On 1/10/2013 3:36 PM, Chris Murphy wrote:
> > > > > On Jan 10, 2013, at 3:49 AM, Thomas Fjellstrom
> > > > > <thomas@fjellstrom.ca>
> 
> wrote:
> > > > >> A lot of it will be streaming. Some may end up being random
> > > > >> read/writes. The test is just to gauge over all performance of the
> > > > >> setup. 600MBs read is far more than I need, but having writes at
> > > > >> 1/3 that seems odd to me.
> > > > > 
> > > > > Tell us how many disks there are, and what the chunk size is. It
> > > > > could be too small if you have too few disks which results in a
> > > > > small full stripe size for a video context. If you're using the
> > > > > default, it could be too big and you're getting a lot of RWM. Stan,
> > > > > and others, can better answer this.
> > > > 
> > > > Thomas is using a benchmark, and a single one at that, to judge the
> > > > performance.  He's not using his actual workloads.  Tuning/tweaking
> > > > to increase the numbers in a benchmark could be detrimental to
> > > > actual performance instead of providing a boost.  One must be
> > > > careful.
> > > > 
> > > > Regarding RAID6, it will always have horrible performance compared to
> > > > non-parity RAID levels and even RAID5, for anything but full stripe
> > > > aligned writes, which means writing new large files or doing large
> > > > appends to existing files.
> > > 
> > > Considering its a rather simple use case, mostly streaming video, and
> > > misc file sharing for my home network, an iozone test should be rather
> > > telling. Especially the full test, from 4k up to 16mb
> > > 
> > >                                                             random  
random
> > >                                                             bkwd   
record
> > >                                                             stride
> > >               
> > >               KB  reclen   write rewrite    read    reread    read
> > >               write    read  rewrite     read   fwrite frewrite   fread
> > >               freread
> > >         
> > >         33554432       4  243295  221756   628767   624081    1028
> > >         4627   16822  7468777    17740   233295   231092  582036
> > >         579131 33554432       8  241134  225728   628264   627015
> > >         2027    8879   25977 10030302    19578   228923   233928 
> > >         591478
> > >         
> > >           584892 33554432      16  233758  228122   633406   618248
> > >         
> > >         3952   13635   35676 10166457    19968   227599   229698 
> > >         579267
> > >         
> > >           576850 33554432      32  232390  219484   625968   625627
> > >         
> > >         7604   18800   44252 10728450    24976   216880   222545 
> > >         556513
> > >         
> > >           555371 33554432      64  222936  206166   631659   627823
> > >         
> > >         14112   22837   52259 11243595    30251   196243   192755
> > >         498602   494354 33554432     128  214740  182619   628604
> > >         626407   25088   26719   64912 11232068    39867   198638
> > >         185078  463505   467853 33554432     256  202543  185964
> > >         626614   624367   44363   34763   73939 10148251    62349
> > >         176724   191899  593517   595646 33554432     512  208081
> > >         188584   632188   629547   72617   39145   84876  9660408
> > >         89877   182736   172912  610681   608870 33554432    1024
> > >         196429  166125   630785   632413  116793   51904  133342
> > >         8687679   121956   168756   175225  620587   616722 33554432
> > >         2048  185399  167484   622180   627606  188571   70789  218009
> > >         5357136   370189   171019   166128  637830   637120 33554432
> > >         4096  198340  188695   632693   628225  289971   95211  278098
> > >         4836433   611529   161664   170469  665617   655268 33554432
> > >         8192  177919  167524   632030   629077  371602  115228  384030
> > >         4934570   618061   161562   176033  708542   709788 33554432
> > >         16384  196639  183744   631478   627518  485622  133467  462861
> > >         4890426   644615   175411   179795  725966   734364
> > > > 
> > > > However, everything is relative.  This RAID6 may have plenty of
> > > > random and streaming write/read throughput for Thomas.  But a single
> > > > benchmark isn't going to inform him accurately.
> > > 
> > > 200MB/s may be enough, but the difference between the read and write
> > > throughput is a bit unexpected. It's not a weak machine (core i3-2120,
> > > dual core 3.2Ghz with HT, 16GB ECC 1333Mhz ram), and this is basically
> > > all its going to be doing.
> > > 
> > > > > You said these are unpartitioned disks, I think. In which case
> > > > > alignment of 4096 byte sectors isn't a factor if these are AF
> > > > > disks.
> > > > > 
> > > > > Unlikely to make up the difference is the scheduler. Parallel fs's
> > > > > like XFS don't perform nearly as well with CFQ, so you should have
> > > > > a kernel parameter elevator=noop.
> > > > 
> > > > If the HBAs have [BB|FB]WC then one should probably use noop as the
> > > > cache schedules the actual IO to the drives.  If the HBAs lack cache,
> > > > then deadline often provides better performance.  Testing of each is
> > > > required on a system and workload basis.  With two identical systems
> > > > (hardware/RAID/OS) one may perform better with noop, the other with
> > > > deadline.  The determining factor is the applications' IO patterns.
> > > 
> > > Mostly streaming reads, some long rsync's to copy stuff back and forth,
> > > file share duties (downloads etc).
> > > 
> > > > > Another thing to look at is md/stripe_cache_size which probably
> > > > > needs to be higher for your application.
> > > > > 
> > > > > Another thing to look at is if you're using XFS, what your mount
> > > > > options are. Invariably with an array of this size you need to be
> > > > > mounting with the inode64 option.
> > > > 
> > > > The desired allocator behavior is independent of array size but, once
> > > > again, dependent on the workloads.  inode64 is only needed for large
> > > > filesystems with lots of files, where 1TB may not be enough for the
> > > > directory inodes.  Or, for mixed metadata/data heavy workloads.
> > > > 
> > > > For many workloads including databases, video ingestion, etc, the
> > > > inode32 allocator is preferred, regardless of array size.  This is
> > > > the linux-raid list so I'll not go into detail of the XFS
> > > > allocators.
> > > 
> > > If you have the time and the desire, I'd like to hear about it off
> > > list.
> > > 
> > > > >> The reason I've selected RAID6 to begin with is I've read (on this
> > > > >> mailing list, and on some hardware tech sites) that even with SAS
> > > > >> drives, the rebuild/resync time on a large array using large disks
> > > > >> (2TB+) is long enough that it gives more than enough time for
> > > > >> another disk to hit a random read error,
> > > > > 
> > > > > This is true for high density consumer SATA drives. It's not nearly
> > > > > as applicable for low to moderate density nearline SATA which has
> > > > > an order of magnitude lower UER, or for enterprise SAS (and some
> > > > > enterprise SATA) which has yet another order of magnitude lower
> > > > > UER.
> > > > > 
> > > > >  So it depends on the disks, and the RAID size, and the
> > > > > 
> > > > > backup/restore strategy.
> > > > 
> > > > Yes, enterprise drives have a much larger spare sector pool.
> > > > 
> > > > WRT rebuild time, this is one more reason to use RAID10 or a concat
> > > > of RAID1s.  The rebuild time is low, constant, predictable.  For 2TB
> > > > drives about 5-6 hours at 100% rebuild rate.  And rebuild time, for
> > > > any array type, with gargantuan drives, is yet one more reason not
> > > > to use the largest drives you can get your hands on.  Using 1TB
> > > > drives will cut that to 2.5-3 hours, and using 500GB drives will cut
> > > > it down to 1.25-1.5 hours, as all these drives tend to have similar
> > > > streaming write rates.
> > > > 
> > > > To wit, as a general rule I always build my arrays with the smallest
> > > > drives I can get away with for the workload at hand.  Yes, for a
> > > > given TB total it increases acquisition cost of drives, HBAs,
> > > > enclosures, and cables, and power consumption, but it also increases
> > > > spindle count--thus performance-- while decreasing rebuild times
> > > > substantially/dramatically.
> > > 
> > > I'd go raid10 or something if I had the space, but this little 10TB nas
> > > (which is the goal, a small, quiet, not too slow, 10TB nas with some
> > > kind of redundancy) only fits 7 3.5" HDDs.
> > > 
> > > Maybe sometime in the future I'll get a big 3 or 4 u case with a crap
> > > load of 3.5" HDD bays, but for now, this is what I have (as well as my
> > > old array, 7x1TB RAID5+XFS in 4in3 hot swap bays with room for 8
> > > drives, but haven't bothered to expand the old array, and I have the
> > > new one almost ready to go).
> > > 
> > > I don't know if it impacts anything at all, but when burning in these
> > > drives after I bought them, I ran the same full iozone test a couple
> > > times, and each drive shows 150MB/s read, and similar write times
> > > (100-120+?). It impressed me somewhat, to see a mechanical hard drive
> > > go that fast. I remember back a few years ago thinking 80MBs was fast
> > > for a HDD.


-- 
Thomas Fjellstrom
thomas@fjellstrom.ca

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: recommended way to add ssd cache to mdraid array
  2013-01-14 18:22                   ` Thomas Fjellstrom
@ 2013-01-14 19:45                     ` Stan Hoeppner
  2013-01-14 21:53                       ` Thomas Fjellstrom
  2013-01-14 21:38                     ` Tommy Apel Hansen
  2013-01-14 21:47                     ` Tommy Apel Hansen
  2 siblings, 1 reply; 53+ messages in thread
From: Stan Hoeppner @ 2013-01-14 19:45 UTC (permalink / raw)
  To: thomas; +Cc: Tommy Apel Hansen, Chris Murphy, linux-raid Raid

On 1/14/2013 12:22 PM, Thomas Fjellstrom wrote:

> This is the initial single 8MB chunk size test
> 
>                                                             random  random    
> bkwd   record   stride                                   
>               KB  reclen   write rewrite    read    reread    read   write    
> read  rewrite     read   fwrite frewrite   fread  freread
>         33554432    8192  124664  121973   524509   527971  376880  104357  
> 336083    40088   392683   213941   215453  631122   631617

Thomas, if you can't paste a data table into an email such that folks
can actually read it, try a paste bin.  Sending this junk demonstrates a
lack of respect for fellow technical users here-- assuming they have or
will take time to decipher this.  I won't even bother to tackle your
lack of trim posting 3 page emails when replying with a single sentence...

It's bad enough that you require so much hand holding for a simple home
server array, but then you expect us to assemble puzzles in order to
assist you?  This is absurd.  By doing this you're asking to be treated
like a child, asking us to wipe your nose for you.

I'm not being 'hostile' here but REAL.  Cowboy up and get it together.
Don't retort, don't give excuses.  Just take your licks, bite your
tongue, and resolve yourself to get this stuff right from now on.  It's
simple list etiquette, courtesy.  If it's worth your time to send an
email with an Iozone output table, then it's worth taking a little extra
time formatting it so folks can read it.

-- 
Stan

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: recommended way to add ssd cache to mdraid array
  2013-01-14 18:22                   ` Thomas Fjellstrom
  2013-01-14 19:45                     ` Stan Hoeppner
@ 2013-01-14 21:38                     ` Tommy Apel Hansen
  2013-01-14 21:47                     ` Tommy Apel Hansen
  2 siblings, 0 replies; 53+ messages in thread
From: Tommy Apel Hansen @ 2013-01-14 21:38 UTC (permalink / raw)
  To: thomas; +Cc: stan, Chris Murphy, linux-raid Raid

Test of my raw SAS2 7K2 disks without bcache
*****************************************************************************
*****************************************************************************
~ # mdadm -C /dev/md0 -l 6 -c 512 -n 7 --assume-clean --run --force /dev/dm-[0,1,2,3,4,5,7]
*****************************************************************************
~ # mdadm -D /dev/md0
/dev/md0:
     Raid Level : raid6
     Array Size : 9767564800 (9315.08 GiB 10001.99 GB)
  Used Dev Size : 1953512960 (1863.02 GiB 2000.40 GB)
   Raid Devices : 7
  Total Devices : 7
         Layout : left-symmetric
     Chunk Size : 512K

    Number   Major   Minor   RaidDevice State
       0     253        0        0      active sync   /dev/dm-0
       1     253        1        1      active sync   /dev/dm-1
       2     253        2        2      active sync   /dev/dm-2
       3     253        3        3      active sync   /dev/dm-3
       4     253        4        4      active sync   /dev/dm-4
       5     253        5        5      active sync   /dev/dm-5
       6     253        7        6      active sync   /dev/dm-7
*****************************************************************************
~ # echo 32768 > /sys/block/md0/md/stripe_cache_size
*****************************************************************************
~ # blockdev --getra /dev/dm-0
2048
~ # iozone -I -a -s 1g -y 8192 -q 8192 -i 0 -i 1 -i 2 -I -f /dev/md0
        O_DIRECT feature enabled
        Auto Mode
        File size set to 1048576 KB
        Using Minimum Record Size 8192 KB
        Using Maximum Record Size 8192 KB
        O_DIRECT feature enabled
        Command line used: iozone -I -a -s 1g -y 8192 -q 8192 -i 0 -i 1 -i 2 -I -f /dev/md0
        Output is in Kbytes/sec
        Time Resolution = 0.000001 seconds.
        Processor cache size set to 1024 Kbytes.
        Processor cache line size set to 32 bytes.
        File stride size set to 17 * record size.
                                                            random  random
              KB  reclen   write rewrite    read    reread    read   write
         1048576    8192  193370  193976   297343   314465  311949  197962                                                          

iozone test complete.
*****************************************************************************
*****************************************************************************
This test should be of more intrest for you as those are the normal blocksizes you'll see hit the disks
*****************************************************************************
~ # iozone -I -a -s 25m -y 4 -q 64 -i 0 -i 1 -i 2 -I -f /dev/md0
        O_DIRECT feature enabled
        Auto Mode
        File size set to 25600 KB
        Using Minimum Record Size 4 KB
        Using Maximum Record Size 64 KB
        O_DIRECT feature enabled
        Command line used: iozone -I -a -s 25m -y 4 -q 64 -i 0 -i 1 -i 2 -I -f /dev/md0
        Output is in Kbytes/sec
        Time Resolution = 0.000001 seconds.
        Processor cache size set to 1024 Kbytes.
        Processor cache line size set to 32 bytes.
        File stride size set to 17 * record size.
                                                            random  random                                  
              KB  reclen   write rewrite    read    reread    read   write
           25600       4     864     931   149704   197296  202074    8378                                                          
           25600       8    1713    1717   204531   254967  245363   15468                                                          
           25600      16    3522    3389   270464   266586  237274   19848                                                          
           25600      32    5824    5887   342603   422281  324350   36111                                                          
           25600      64   10555   10304   382500   367605  318845   56680                                                          

iozone test complete.
*****************************************************************************
One more thing you should look for when you do this type of test is that you are properly
maxing out you cpu while doing this, my tests were done on a Dual Xeon E5-2609 @ 2.40GHz,
that's a total o 8 physical cores, while doing this singlethread iozone test only 1 core was in use
and it was maxed out at 12.5% iowait according to iostat, if you run iozone in a multithreaded setup
you result will get higher, I was noting just about 48MB/s to every drive in the array while doing
the tests above which is about 100MB/s from the max of those drives.
*****************************************************************************
MULTITHREAD TEST BELOW USING XFS
*****************************************************************************
~ # mkfs.xfs -f /dev/md0
data     =                       bsize=4096   blocks=2441887744, imaxpct=5
         =                       sunit=128    swidth=640 blks
*****************************************************************************
~ # mount -t xfs -o noatime,inode64 /dev/md0 /mnt
*****************************************************************************
mnt # iozone -t 20 -s 512m -i 0 -i 1 -i 2 -I -r 8m
Children see throughput for 20 initial writers  =  431515.35 KB/sec
Children see throughput for 20 rewriters        =  349231.38 KB/sec
Children see throughput for 20 readers          =  647558.30 KB/sec
Children see throughput for 20 re-readers       =  643240.44 KB/sec
Children see throughput for 20 random readers   =  642147.36 KB/sec
Children see throughput for 20 random writers   =  325886.66 KB/sec
iozone test complete.
*****************************************************************************
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.00    0.00   21.01   78.61    0.00    0.38
*****************************************************************************
This test brought the drives up around 100MB/s but I ran out of cpu cycles,
Now replicate my tests on you system as see what numbers you can come up with,
considering your drives are not enterprise sas drives you numbers might be some 
10% lower or so.
*****************************************************************************



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: recommended way to add ssd cache to mdraid array
  2013-01-14 18:22                   ` Thomas Fjellstrom
  2013-01-14 19:45                     ` Stan Hoeppner
  2013-01-14 21:38                     ` Tommy Apel Hansen
@ 2013-01-14 21:47                     ` Tommy Apel Hansen
  2 siblings, 0 replies; 53+ messages in thread
From: Tommy Apel Hansen @ 2013-01-14 21:47 UTC (permalink / raw)
  To: thomas; +Cc: stan, Chris Murphy, linux-raid Raid

Oh I forgot to mention, I'm running Gentoo-Hardened kernel-3.7.1 with Grsecurity and PaX enabled

/Tommy


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: recommended way to add ssd cache to mdraid array
  2013-01-14 19:45                     ` Stan Hoeppner
@ 2013-01-14 21:53                       ` Thomas Fjellstrom
  2013-01-14 22:51                         ` Chris Murphy
  2013-01-15  1:50                         ` Stan Hoeppner
  0 siblings, 2 replies; 53+ messages in thread
From: Thomas Fjellstrom @ 2013-01-14 21:53 UTC (permalink / raw)
  To: stan; +Cc: Tommy Apel Hansen, Chris Murphy, linux-raid Raid

On Mon Jan 14, 2013, you wrote:
> On 1/14/2013 12:22 PM, Thomas Fjellstrom wrote:
[snip]
> Thomas, if you can't paste a data table into an email such that folks
> can actually read it, try a paste bin.  Sending this junk demonstrates a
> lack of respect for fellow technical users here-- assuming they have or
> will take time to decipher this.  I won't even bother to tackle your
> lack of trim posting 3 page emails when replying with a single sentence...
> 
> It's bad enough that you require so much hand holding for a simple home
> server array, but then you expect us to assemble puzzles in order to
> assist you?  This is absurd.  By doing this you're asking to be treated
> like a child, asking us to wipe your nose for you.
>
> I'm not being 'hostile' here but REAL.  Cowboy up and get it together.
> Don't retort, don't give excuses.  Just take your licks, bite your
> tongue, and resolve yourself to get this stuff right from now on.  It's
> simple list etiquette, courtesy.  If it's worth your time to send an
> email with an Iozone output table, then it's worth taking a little extra
> time formatting it so folks can read it.

                                                            random  random    bkwd   record   stride                                   
              KB  reclen   write rewrite    read    reread    read   write    read  rewrite     read   fwrite frewrite   fread  freread
        33554432    8192  124664  121973   524509   527971  376880  104357  336083    40088   392683   213941   215453  631122   631617

I assume that is to you liking?

As for the simple home server array, if it were so simple, it'd work out
of the box with no issues at all.

The implied insults are completely unnecessary. And yes,
you are being hostile, whether you choose to admit it or not.

I apologize for not trimming, and not formatting the table
properly. I should have waited to post when I wasn't so tired. I tend to
forget things when I post too tired. 

p.s. I am seriously tempted to play your little flame game,
but It's not worth the time.

-- 
Thomas Fjellstrom
thomas@fjellstrom.ca

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: recommended way to add ssd cache to mdraid array
  2013-01-14  0:23                         ` Phil Turmel
  2013-01-14  3:58                           ` Chris Murphy
@ 2013-01-14 22:00                           ` Thomas Fjellstrom
  1 sibling, 0 replies; 53+ messages in thread
From: Thomas Fjellstrom @ 2013-01-14 22:00 UTC (permalink / raw)
  To: Phil Turmel; +Cc: Chris Murphy, linux-raid Raid

On Sun Jan 13, 2013, Phil Turmel wrote:
> On 01/13/2013 06:20 PM, Chris Murphy wrote:
[snip]
> >> One last point bears repeating:  MD is *not* a backup system,
> >> although some people leverage it's features for rotating off-site
> >> backup disks. Raid arrays are all about *uptime*.  They will not
> >> save you from accidental deletion or other operator errors.  They
> >> will not save you if your office burns down.  You need a separate
> >> backup system for critical files.
> > 
> > Yeah and that's why I'm sorta leery of this RAID 6 setup in the home.
> > I think that people are reading that the odds of an array failure
> > with RAID 5 are so high that they are better off adding one more
> > drive for dual-parity, and *still* not having a real backup and
> > restore plan. As if the RAID 6 is the faux-backup plan.
> > 
> > Some home NAS's, with BluRay vids, are so big that people just either
> > need to stop such behavior, or get a used LTO 2 or 3 drive for their
> > gargantuous backups.
> 
> Well, for me, such material on hard drives *are* the backups.  I use
> "par2" for big backup files, not MD raid.  I also skip backups for my
> Hi-Def MythTV recordings.  Just not valuable enough.

Yeah, I learned a while back to make proper backups. My very important files 
are backed up from each machine every day to a raid1, which is then synced up 
to a remote machine. Once the new array is up, the backups will have another 
location to copy to, as well as the not so important media files will have a 
backup (some 3TB drives and a few of the old 1TB drives in a linear concat 
'array'). I may add another remote backup location in the future, I just 
haven't decided who to go with.
 
> Phil
> --



-- 
Thomas Fjellstrom
thomas@fjellstrom.ca

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: recommended way to add ssd cache to mdraid array
  2013-01-14 21:53                       ` Thomas Fjellstrom
@ 2013-01-14 22:51                         ` Chris Murphy
  2013-01-15  3:25                           ` Thomas Fjellstrom
  2013-01-15  1:50                         ` Stan Hoeppner
  1 sibling, 1 reply; 53+ messages in thread
From: Chris Murphy @ 2013-01-14 22:51 UTC (permalink / raw)
  To: linux-raid Raid, Thomas Fjellstrom
  Cc: stan@hardwarefreak.com Hoeppner, Tommy Apel Hansen

On Jan 14, 2013, at 2:53 PM, Thomas Fjellstrom <thomas@fjellstrom.ca> wrote:

>                                                            random  random    bkwd   record   stride                                   
>              KB  reclen   write rewrite    read    reread    read   write    read  rewrite     read   fwrite frewrite   fread  freread
>        33554432    8192  124664  121973   524509   527971  376880  104357  336083    40088   392683   213941   215453  631122   631617
> 
> I assume that is to you liking?

No it's getting hosed for me too, I think it's just too wide for the forum. Pastebin it. And I suggest you post the one that has the variable reclen not just the 8MB result.

And also, now I'm confused because the above reclen8192 result doesn't match with the one you posted from before, so I'm goingn to guess this is the one with -I. My suggestion is that you put into the pastebin the command you used to arrive at the results. Both are important.

> p.s. I am seriously tempted to play your little flame game,
> but It's not worth the time.

Don't say that or people won't help. If it's your first day at the rodeo taking a bit of a beating and you announce it's not worth it, most people will agree with you and stop playing.

Chris Murphy

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: recommended way to add ssd cache to mdraid array
  2013-01-14 21:53                       ` Thomas Fjellstrom
  2013-01-14 22:51                         ` Chris Murphy
@ 2013-01-15  1:50                         ` Stan Hoeppner
  2013-01-15  3:52                           ` Thomas Fjellstrom
  1 sibling, 1 reply; 53+ messages in thread
From: Stan Hoeppner @ 2013-01-15  1:50 UTC (permalink / raw)
  To: thomas; +Cc: Tommy Apel Hansen, Chris Murphy, linux-raid Raid

On 1/14/2013 3:53 PM, Thomas Fjellstrom wrote:

> 
>                                                             random  random    bkwd   record   stride                                   
>               KB  reclen   write rewrite    read    reread    read   write    read  rewrite     read   fwrite frewrite   fread  freread
>         33554432    8192  124664  121973   524509   527971  376880  104357  336083    40088   392683   213941   215453  631122   631617
>
> I assume that is to you liking?

Yes, much better.  Now, where is the output from the system you're
comparing performance against?

> As for the simple home server array, if it were so simple, it'd work out
> of the box with no issues at all.

It is working.  And there are no issues, but for your subjective
interpretation of the iozone data, assuming it is not working properly.

This is why benchmarks of this sort are generally only good for
comparing one system to another.

-- 
Stan



^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: recommended way to add ssd cache to mdraid array
  2013-01-14 22:51                         ` Chris Murphy
@ 2013-01-15  3:25                           ` Thomas Fjellstrom
  0 siblings, 0 replies; 53+ messages in thread
From: Thomas Fjellstrom @ 2013-01-15  3:25 UTC (permalink / raw)
  To: Chris Murphy
  Cc: linux-raid Raid, stan@hardwarefreak.com Hoeppner,
	Tommy Apel Hansen

On Mon Jan 14, 2013, Chris Murphy wrote:
> On Jan 14, 2013, at 2:53 PM, Thomas Fjellstrom <thomas@fjellstrom.ca> wrote:
> >                                                            random  random   
> >                                                            bkwd   record  
> >                                                            stride
> >              
> >              KB  reclen   write rewrite    read    reread    read   write
> >                 read  rewrite     read   fwrite frewrite   fread 
> >              freread
> >        
> >        33554432    8192  124664  121973   524509   527971  376880  104357
> >         336083    40088   392683   213941   215453  631122   631617
> > 
> > I assume that is to you liking?
> 
> No it's getting hosed for me too, I think it's just too wide for the forum.
> Pastebin it. And I suggest you post the one that has the variable reclen
> not just the 8MB result.

Forum? The email I got back views fine in my email program. Well mostly fine,
one of the header lines is offset by one character.

> And also, now I'm confused because the above reclen8192 result doesn't
> match with the one you posted from before, so I'm goingn to guess this is
> the one with -I. My suggestion is that you put into the pastebin the
> command you used to arrive at the results. Both are important.

Yeah, total screw up on my part there. I still had the contents from replying
to Tommy with a different test. *sigh* *headdesk*

http://pastebin.com/ffaKiKnN

moose@mrbig:/mnt/mrbig/data/test$ time iozone -a -s 32G -r 8M
                                                            random  random    bkwd   record   stride                                   
      KB  reclen   write rewrite    read    reread    read   write    read  rewrite     read   fwrite frewrite   fread  freread
33554432    8192  208232  210453   628247   630950  370444  163451  382472  4889055   618370   210211   216416  714091   709861

> > p.s. I am seriously tempted to play your little flame game,
> > but It's not worth the time.
> 
> Don't say that or people won't help. If it's your first day at the rodeo
> taking a bit of a beating and you announce it's not worth it, most people
> will agree with you and stop playing.

I'm a little confused. People won't help if I state that I'm uninterested in a
flame war? I've spent way more than enough time engaging in flame wars on the
internet.

> Chris Murphy


-- 
Thomas Fjellstrom
thomas@fjellstrom.ca

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: recommended way to add ssd cache to mdraid array
  2013-01-15  1:50                         ` Stan Hoeppner
@ 2013-01-15  3:52                           ` Thomas Fjellstrom
  2013-01-15  8:38                             ` Stan Hoeppner
  0 siblings, 1 reply; 53+ messages in thread
From: Thomas Fjellstrom @ 2013-01-15  3:52 UTC (permalink / raw)
  To: stan; +Cc: Tommy Apel Hansen, Chris Murphy, linux-raid Raid

On Mon Jan 14, 2013, Stan Hoeppner wrote:
> On 1/14/2013 3:53 PM, Thomas Fjellstrom wrote:
> >                                                             random  random   
> >                                                             bkwd   record  
> >                                                             stride
> >               
> >               KB  reclen   write rewrite    read    reread    read  
> >               write    read  rewrite     read   fwrite frewrite   fread 
> >               freread
> >         
> >         33554432    8192  124664  121973   524509   527971  376880 
> >         104357  336083    40088   392683   213941   215453  631122  
> >         631617
> > 
> > I assume that is to you liking?
> 
> Yes, much better.  Now, where is the output from the system you're
> comparing performance against?

I haven't been comparing it against my other system, as its kind of apples and 
oranges. My old array, on somewhat similar hardware for the most part, but 
uses older 1TB drives in RAID5.

Server hw:
Supermicro X9SCM-FO
Xeon E3-1230 3.2Ghz
16GB DDR3 1333mhz ECC
8 port IBM/LSI SAS/SATA HBA

NAS hw:
Intel S1200KP
Core i3-2120 3.3Ghz
16GB DDR3 1333mhz ECC
8 port IBM/LSI SAS/SATA HBA

Not the highest end hardware out there, but it gets the job done. I was 
actually trying to get less powerful hardware for the NAS, but I really 
couldn't find much that fit my other requirements (mini-itx server grade hw). 
Very limited selection of motherboards, most of which take socket 1155 cpus, 
and the selection of those that also take ECC ram is fairly limited as well.

> > As for the simple home server array, if it were so simple, it'd work out
> > of the box with no issues at all.
> 
> It is working.  And there are no issues, but for your subjective
> interpretation of the iozone data, assuming it is not working properly.

It is working. And I can live with it as is, but it does seem like something 
isn't right. If thats just me jumping to conclusions, well thats fine then. 
But 600MB/s+ reads vs 200MB/s writes seems a tad off.

> This is why benchmarks of this sort are generally only good for
> comparing one system to another.

I'm running the same iozone test on the old array, see how it goes. But its 
currently in use, and getting full (84G free out of 5.5TB), so I'm not 
positive how well it'll do as compared to if it was a fresh array like the new 
nas array.

Preliminary results show similar read/write patterns (140MB/s write, 380MB/s 
read), albeit slower probably due to being well aged, in use, and maybe the 
drive speeds (the 1TB drives are 20-40MB/s slower than the 2TB drives in a 
straight read test, I can't remember the write differences).

-- 
Thomas Fjellstrom
thomas@fjellstrom.ca

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: recommended way to add ssd cache to mdraid array
  2013-01-15  3:52                           ` Thomas Fjellstrom
@ 2013-01-15  8:38                             ` Stan Hoeppner
  2013-01-15  9:02                               ` Tommy Apel
                                                 ` (2 more replies)
  0 siblings, 3 replies; 53+ messages in thread
From: Stan Hoeppner @ 2013-01-15  8:38 UTC (permalink / raw)
  To: thomas; +Cc: Tommy Apel Hansen, Chris Murphy, linux-raid Raid

On 1/14/2013 9:52 PM, Thomas Fjellstrom wrote:
...
> I haven't been comparing it against my other system, as its kind of apples and 
> oranges. My old array, on somewhat similar hardware for the most part, but 
> uses older 1TB drives in RAID5.
...
> It is working. And I can live with it as is, but it does seem like something 
> isn't right. If thats just me jumping to conclusions, well thats fine then. 
> But 600MB/s+ reads vs 200MB/s writes seems a tad off.

It's not off.  As myself and others stated previously, this low write
performance is typical of RAID6, particularly for unaligned or partial
stripe writes--anything that triggers a RMW cycle.

> I'm running the same iozone test on the old array, see how it goes. But its 
> currently in use, and getting full (84G free out of 5.5TB), so I'm not 
> positive how well it'll do as compared to if it was a fresh array like the new 
> nas array.
...
> Preliminary results show similar read/write patterns (140MB/s write, 380MB/s 
> read), albeit slower probably due to being well aged, in use, and maybe the 
> drive speeds (the 1TB drives are 20-40MB/s slower than the 2TB drives in a 
> straight read test, I can't remember the write differences).

Yes, the way in which the old filesystem has aged, and the difference in
single drive performance, will both cause lower numbers on the old hardware.

What you're really after, what you want to see, is iozone numbers from a
similar system with a 7 drive md/RAID6 array with XFS.  Only that will
finally convince you, one way or the other, that your array is doing
pretty much as well as it can, or not.  However, even once you've
established this, it still doesn't inform you as to how well the new
array will perform with your workloads.

On that note, someone stated you should run iozone using O_DIRECT writes
to get more accurate numbers, or more precisely, to eliminate the Linux
buffer cache from the equation.  Doing this actually makes your testing
LESS valid, because your real world use will likely include all buffered
IO, and no direct IO.

What you should be concentrating on right now is identifying if any of
your workloads make use of fsync.  If they do not, or if the majority do
not (Samba does not by default IIRC, neither does NFS), then you should
be running iozone with fsync disabled.  In other words, since you're not
comparing two similar systems, you should be tweaking iozone to best
mimic your real workloads.  Running iozone with buffer cache and with
fsync disable should produce higher write numbers, which should be
closer to what you will see with your real workloads.

-- 
Stan

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: recommended way to add ssd cache to mdraid array
  2013-01-15  8:38                             ` Stan Hoeppner
@ 2013-01-15  9:02                               ` Tommy Apel
  2013-01-15 11:19                                 ` Stan Hoeppner
  2013-01-15 10:47                               ` Tommy Apel
  2013-01-16  5:31                               ` Thomas Fjellstrom
  2 siblings, 1 reply; 53+ messages in thread
From: Tommy Apel @ 2013-01-15  9:02 UTC (permalink / raw)
  To: linux-raid Raid

Stan: it is true what you are saying about the cache and real life usage
but if you suspect a problem with the array I would suggest testing array
rather than the buffer system in linux hence the use if O_DIRECT as that
will determine the array performance and not the vmem.

On Jan 15, 2013 9:38 AM, "Stan Hoeppner" <stan@hardwarefreak.com> wrote:
>
> On 1/14/2013 9:52 PM, Thomas Fjellstrom wrote:
> ...
> > I haven't been comparing it against my other system, as its kind of apples and
> > oranges. My old array, on somewhat similar hardware for the most part, but
> > uses older 1TB drives in RAID5.
> ...
> > It is working. And I can live with it as is, but it does seem like something
> > isn't right. If thats just me jumping to conclusions, well thats fine then.
> > But 600MB/s+ reads vs 200MB/s writes seems a tad off.
>
> It's not off.  As myself and others stated previously, this low write
> performance is typical of RAID6, particularly for unaligned or partial
> stripe writes--anything that triggers a RMW cycle.
>
> > I'm running the same iozone test on the old array, see how it goes. But its
> > currently in use, and getting full (84G free out of 5.5TB), so I'm not
> > positive how well it'll do as compared to if it was a fresh array like the new
> > nas array.
> ...
> > Preliminary results show similar read/write patterns (140MB/s write, 380MB/s
> > read), albeit slower probably due to being well aged, in use, and maybe the
> > drive speeds (the 1TB drives are 20-40MB/s slower than the 2TB drives in a
> > straight read test, I can't remember the write differences).
>
> Yes, the way in which the old filesystem has aged, and the difference in
> single drive performance, will both cause lower numbers on the old hardware.
>
> What you're really after, what you want to see, is iozone numbers from a
> similar system with a 7 drive md/RAID6 array with XFS.  Only that will
> finally convince you, one way or the other, that your array is doing
> pretty much as well as it can, or not.  However, even once you've
> established this, it still doesn't inform you as to how well the new
> array will perform with your workloads.
>
> On that note, someone stated you should run iozone using O_DIRECT writes
> to get more accurate numbers, or more precisely, to eliminate the Linux
> buffer cache from the equation.  Doing this actually makes your testing
> LESS valid, because your real world use will likely include all buffered
> IO, and no direct IO.
>
> What you should be concentrating on right now is identifying if any of
> your workloads make use of fsync.  If they do not, or if the majority do
> not (Samba does not by default IIRC, neither does NFS), then you should
> be running iozone with fsync disabled.  In other words, since you're not
> comparing two similar systems, you should be tweaking iozone to best
> mimic your real workloads.  Running iozone with buffer cache and with
> fsync disable should produce higher write numbers, which should be
> closer to what you will see with your real workloads.
>
> --
> Stan
>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: recommended way to add ssd cache to mdraid array
  2013-01-15  8:38                             ` Stan Hoeppner
  2013-01-15  9:02                               ` Tommy Apel
@ 2013-01-15 10:47                               ` Tommy Apel
  2013-01-16  5:31                               ` Thomas Fjellstrom
  2 siblings, 0 replies; 53+ messages in thread
From: Tommy Apel @ 2013-01-15 10:47 UTC (permalink / raw)
  To: stan; +Cc: thomas, Chris Murphy, linux-raid Raid

If to follow you I can only assume that my server is better at
administrating cache than Thomas' server according to these results,
this doesn't tell me much about how the subsystem is handeling the io though.

        Command line used: iozone -a -s 128g -r 8m
        Output is in Kbytes/sec
        Time Resolution = 0.000001 seconds.
        Processor cache size set to 1024 Kbytes.
        Processor cache line size set to 32 bytes.
        File stride size set to 17 * record size.


               random   random
                  KB  reclen      write    ewrite       read
reread     read       write
       134217728    8192  365802  371250   397293   399526  241641  265306


2013/1/15 Stan Hoeppner <stan@hardwarefreak.com>:
> On 1/14/2013 9:52 PM, Thomas Fjellstrom wrote:
> ...
>> I haven't been comparing it against my other system, as its kind of apples and
>> oranges. My old array, on somewhat similar hardware for the most part, but
>> uses older 1TB drives in RAID5.
> ...
>> It is working. And I can live with it as is, but it does seem like something
>> isn't right. If thats just me jumping to conclusions, well thats fine then.
>> But 600MB/s+ reads vs 200MB/s writes seems a tad off.
>
> It's not off.  As myself and others stated previously, this low write
> performance is typical of RAID6, particularly for unaligned or partial
> stripe writes--anything that triggers a RMW cycle.
>
>> I'm running the same iozone test on the old array, see how it goes. But its
>> currently in use, and getting full (84G free out of 5.5TB), so I'm not
>> positive how well it'll do as compared to if it was a fresh array like the new
>> nas array.
> ...
>> Preliminary results show similar read/write patterns (140MB/s write, 380MB/s
>> read), albeit slower probably due to being well aged, in use, and maybe the
>> drive speeds (the 1TB drives are 20-40MB/s slower than the 2TB drives in a
>> straight read test, I can't remember the write differences).
>
> Yes, the way in which the old filesystem has aged, and the difference in
> single drive performance, will both cause lower numbers on the old hardware.
>
> What you're really after, what you want to see, is iozone numbers from a
> similar system with a 7 drive md/RAID6 array with XFS.  Only that will
> finally convince you, one way or the other, that your array is doing
> pretty much as well as it can, or not.  However, even once you've
> established this, it still doesn't inform you as to how well the new
> array will perform with your workloads.
>
> On that note, someone stated you should run iozone using O_DIRECT writes
> to get more accurate numbers, or more precisely, to eliminate the Linux
> buffer cache from the equation.  Doing this actually makes your testing
> LESS valid, because your real world use will likely include all buffered
> IO, and no direct IO.
>
> What you should be concentrating on right now is identifying if any of
> your workloads make use of fsync.  If they do not, or if the majority do
> not (Samba does not by default IIRC, neither does NFS), then you should
> be running iozone with fsync disabled.  In other words, since you're not
> comparing two similar systems, you should be tweaking iozone to best
> mimic your real workloads.  Running iozone with buffer cache and with
> fsync disable should produce higher write numbers, which should be
> closer to what you will see with your real workloads.
>
> --
> Stan
>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: recommended way to add ssd cache to mdraid array
  2013-01-15  9:02                               ` Tommy Apel
@ 2013-01-15 11:19                                 ` Stan Hoeppner
  0 siblings, 0 replies; 53+ messages in thread
From: Stan Hoeppner @ 2013-01-15 11:19 UTC (permalink / raw)
  To: Tommy Apel; +Cc: linux-raid Raid

On 1/15/2013 3:02 AM, Tommy Apel wrote:
> Stan: it is true what you are saying about the cache and real life usage
> but if you suspect a problem with the array I would suggest testing array
> rather than the buffer system in linux hence the use if O_DIRECT as that
> will determine the array performance and not the vmem.

If you really want to get to testing only the array you must bypass the
filesystem as well using something like fio for your testing.  Simply
bypassing the buffer cache only removes one obstacle of many in the IO path.

But in Thomas case raw IO numbers still don't clear the fog for him
because he has no idea what the numbers _should_ be in the first place.
 He is assuming, due to lack of knowledge/experience, that write
throughput that is ~1/3rd his read throughput is wrong.  It's not.  It
is expected.

To see this for himself all he need do is blow it away and create a 6
disk RAID10 or layered RAID0 over 3 RAID1 pairs.  When he sees the much
higher iozone write throughput with only 6 of his disks, only 3 spindles
vs 5, then he'll finally understand their is a huge write performance
penalty with RAID6.

-- 
Stan

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: recommended way to add ssd cache to mdraid array
  2013-01-15  8:38                             ` Stan Hoeppner
  2013-01-15  9:02                               ` Tommy Apel
  2013-01-15 10:47                               ` Tommy Apel
@ 2013-01-16  5:31                               ` Thomas Fjellstrom
  2013-01-16  8:59                                 ` John Robinson
  2013-01-16 22:06                                 ` Stan Hoeppner
  2 siblings, 2 replies; 53+ messages in thread
From: Thomas Fjellstrom @ 2013-01-16  5:31 UTC (permalink / raw)
  To: stan; +Cc: Tommy Apel Hansen, Chris Murphy, linux-raid Raid

On Tue Jan 15, 2013, Stan Hoeppner wrote:
> On 1/14/2013 9:52 PM, Thomas Fjellstrom wrote:
> ...
> 
> > I haven't been comparing it against my other system, as its kind of
> > apples and oranges. My old array, on somewhat similar hardware for the
> > most part, but uses older 1TB drives in RAID5.
> 
> ...
> 
> > It is working. And I can live with it as is, but it does seem like
> > something isn't right. If thats just me jumping to conclusions, well
> > thats fine then. But 600MB/s+ reads vs 200MB/s writes seems a tad off.
> 
> It's not off.  As myself and others stated previously, this low write
> performance is typical of RAID6, particularly for unaligned or partial
> stripe writes--anything that triggers a RMW cycle.

That gets me thinking. Maybe try a test with the record test size set to the 
stripe width, that would hopefully show some more accurate numbers.

If that large of a difference between reads and writes is perfectly normal, I 
can accept that. I am wondering what kinds of numbers others see, real world 
wise.

> > I'm running the same iozone test on the old array, see how it goes. But
> > its currently in use, and getting full (84G free out of 5.5TB), so I'm
> > not positive how well it'll do as compared to if it was a fresh array
> > like the new nas array.
> 
> ...
> 
> > Preliminary results show similar read/write patterns (140MB/s write,
> > 380MB/s read), albeit slower probably due to being well aged, in use,
> > and maybe the drive speeds (the 1TB drives are 20-40MB/s slower than the
> > 2TB drives in a straight read test, I can't remember the write
> > differences).
> 
> Yes, the way in which the old filesystem has aged, and the difference in
> single drive performance, will both cause lower numbers on the old
> hardware.
> 
> What you're really after, what you want to see, is iozone numbers from a
> similar system with a 7 drive md/RAID6 array with XFS.  Only that will
> finally convince you, one way or the other, that your array is doing
> pretty much as well as it can, or not.  However, even once you've
> established this, it still doesn't inform you as to how well the new
> array will perform with your workloads.

In the end, the performance I am getting is more than I currently use day to 
day. So its not a huge problem I need to solve, rather its something I thought 
was odd, and wanted to figure out.

> On that note, someone stated you should run iozone using O_DIRECT writes
> to get more accurate numbers, or more precisely, to eliminate the Linux
> buffer cache from the equation.  Doing this actually makes your testing
> LESS valid, because your real world use will likely include all buffered
> IO, and no direct IO.

I didn't think it'd be a very good test of real world performance, but it 
can't hurt to be thorough. Though I just checked on it, that one run is still 
going, and it seems like it may be quite a while.

> What you should be concentrating on right now is identifying if any of
> your workloads make use of fsync.  If they do not, or if the majority do
> not (Samba does not by default IIRC, neither does NFS), then you should
> be running iozone with fsync disabled.  In other words, since you're not
> comparing two similar systems, you should be tweaking iozone to best
> mimic your real workloads.  Running iozone with buffer cache and with
> fsync disable should produce higher write numbers, which should be
> closer to what you will see with your real workloads.

I doubt very much of my workload uses fsync, though if I move some p2p stuff 
to it /may/ use fsync, to be honest, I'm not sure which (if any) p2p clients 
use fsync. Or if I particularly care in that case. P2P performance really 
depends on decent random writes of 512KB-4MB which an array like this isn't 
exactly going to excel at.

P2P is one reason I was interested in ssd caching (tried playing with bcache, 
it seemed to cut read speeds down to 200MB/s or something crazy, likely a 
misconfig on my part, I still have to finish looking into that, the author 
suggested changing the cache mode).

I've found it incredibly annoying when video playback stutters due to other 
activity on the array. It used to happen often enough due to the different 
jobs I had it doing. At one point it held VM images, regular rsnapshot 
backups, all of my media, and some torrent downloads. Over the past while I've 
slowly pulled one job after another off the old array till all that its really 
doing now is storing random downloads, p2p and media file streaming.

-- 
Thomas Fjellstrom
thomas@fjellstrom.ca

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: recommended way to add ssd cache to mdraid array
  2013-01-16  5:31                               ` Thomas Fjellstrom
@ 2013-01-16  8:59                                 ` John Robinson
  2013-01-16 21:29                                   ` Stan Hoeppner
  2013-01-16 22:06                                 ` Stan Hoeppner
  1 sibling, 1 reply; 53+ messages in thread
From: John Robinson @ 2013-01-16  8:59 UTC (permalink / raw)
  To: thomas; +Cc: stan, Tommy Apel Hansen, Chris Murphy, linux-raid Raid

On 16/01/2013 05:31, Thomas Fjellstrom wrote:
> On Tue Jan 15, 2013, Stan Hoeppner wrote:
>> On 1/14/2013 9:52 PM, Thomas Fjellstrom wrote:
>> ...
>>> It is working. And I can live with it as is, but it does seem like
>>> something isn't right. If thats just me jumping to conclusions, well
>>> thats fine then. But 600MB/s+ reads vs 200MB/s writes seems a tad off.
>>
>> It's not off.  As myself and others stated previously, this low write
>> performance is typical of RAID6, particularly for unaligned or partial
>> stripe writes--anything that triggers a RMW cycle.
>
> That gets me thinking. Maybe try a test with the record test size set to the
> stripe width, that would hopefully show some more accurate numbers.

Your 7-drive RAID-6 with 512K chunk will have a 2.5MB stripe width, or 
stride, whichever is the correct term, on the basis of 5 data chunks. 
Even still, a filesystem-level test cannot guarantee to be writing 
records aligned to the array's data stripes.

If you do another benchmark, try running iostat concurrently, to see how 
many reads are happening during the write tests.

At the same time, if in the real world you're doing streaming writes of 
dozens of MB/s, I would expect that write caching would turn a good 
proportion of the writes into full-stripe writes.

Cheers,

John.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: recommended way to add ssd cache to mdraid array
  2013-01-16  8:59                                 ` John Robinson
@ 2013-01-16 21:29                                   ` Stan Hoeppner
  2013-02-10  6:59                                     ` Thomas Fjellstrom
  0 siblings, 1 reply; 53+ messages in thread
From: Stan Hoeppner @ 2013-01-16 21:29 UTC (permalink / raw)
  To: John Robinson; +Cc: thomas, Tommy Apel Hansen, Chris Murphy, linux-raid Raid

On 1/16/2013 2:59 AM, John Robinson wrote:

> At the same time, if in the real world you're doing streaming writes of
> dozens of MB/s, I would expect that write caching would turn a good
> proportion of the writes into full-stripe writes.

The filesystem dictates whether a write is stripe aligned or not, or
fills a full stripe.  If the filesystem performs multiple partial stripe
writes no type nor manner of write caching is going to turn them into a
single stripe aligned write.  That's not possible.

On that note, a great many people on this list mistakenly configure,
optimize, and test their arrays for a magical chunk/stripe size,
apparently oblivious to the fact that most writes with most workloads
are not going to be full stripe writes, or stripe aligned whatsoever:

1.  Journal writes can be aligned, but usually don't fill a full stripe
2.  Metadata writes to the directory tress are often unaligned
3.  File appends or modify-in-place ops are never aligned

The only instance in which one will always get full stripe writes is
when creating and writing a new file whose size is a multiple of the
full stripe width.  I.e. one must be performing allocation to fill full
stripes.  Most workloads don't do this.

This is the reason why RAID6 performs so horribly with mixed read/write
workloads.  Using Thomas' example, while he was doing a streaming read
of a media file and simultaneously doing non-aligned writes from a P2P
or other application, md is performing a RMW operation during each
write, adding substantially to the seek burden on the drives.  RAID5/6
use rotating parity, so he also has an extra seek on each of two drives
occurring, competing with the read seeks of his streaming app.  Consumer
7.2K drives aren't designed to handle this type of random seek load with
good performance.

If using RAID10 or RAID0 over RAID1, there is no RMW penalty for partial
stripe width writes, and no extra seek burden for the parity writes, as
described above for RAID5/6.  Thus it doesn't cause the playback stutter
as the disks can service the read and write requests without running out
of head seek bandwidth as parity arrays do due to RMW and parity block
writes.

In summary, with Thomas' old disk system, he would have most likely
avoided the playback stutter simply by using a non-parity RAID level.

I'm constantly amazed by the fact that so many people here using parity
RAID don't understand the performance impact of these basic parity RAID
IO behaviors, and how striping actually works, and the fact that most
often they're not writing full stripes, and thus not benefiting from
their spindle count.

-- 
Stan

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: recommended way to add ssd cache to mdraid array
  2013-01-16  5:31                               ` Thomas Fjellstrom
  2013-01-16  8:59                                 ` John Robinson
@ 2013-01-16 22:06                                 ` Stan Hoeppner
  1 sibling, 0 replies; 53+ messages in thread
From: Stan Hoeppner @ 2013-01-16 22:06 UTC (permalink / raw)
  To: thomas; +Cc: Tommy Apel Hansen, Chris Murphy, linux-raid Raid

On 1/15/2013 11:31 PM, Thomas Fjellstrom wrote:

> That gets me thinking. Maybe try a test with the record test size set to the 
> stripe width, that would hopefully show some more accurate numbers.

That will give you less accurate numbers, if you're looking for numbers
that reflect real world performance.  Again, in general, most real world
writes are not stripe aligned, and not full width, especially with the
insanely large 512KB chunk size.

For all but a few workloads, storage should be designed and optimized
assuming most writes will be unaligned or not full width, -not- assuming
most will be full stripe writes.

-- 
Stan

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: recommended way to add ssd cache to mdraid array
  2013-01-16 21:29                                   ` Stan Hoeppner
@ 2013-02-10  6:59                                     ` Thomas Fjellstrom
  0 siblings, 0 replies; 53+ messages in thread
From: Thomas Fjellstrom @ 2013-02-10  6:59 UTC (permalink / raw)
  To: linux-raid; +Cc: stan, Phil Turmel, Chris Murphy, Tommy Apel Hansen

On January 16, 2013, Stan Hoeppner wrote:
[snip]
> This is the reason why RAID6 performs so horribly with mixed read/write
> workloads.  Using Thomas' example, while he was doing a streaming read
> of a media file and simultaneously doing non-aligned writes from a P2P
> or other application, md is performing a RMW operation during each
> write, adding substantially to the seek burden on the drives.  RAID5/6
> use rotating parity, so he also has an extra seek on each of two drives
> occurring, competing with the read seeks of his streaming app.  Consumer
> 7.2K drives aren't designed to handle this type of random seek load with
> good performance.

Re-reading through this thread (I have a bit of spare time this weekend), I 
finally understood what you wrote there. I'm not normally quite this dense, 
and I appologise. The RMW ops and double seek penalties are really quite 
harsh.

> If using RAID10 or RAID0 over RAID1, there is no RMW penalty for partial
> stripe width writes, and no extra seek burden for the parity writes, as
> described above for RAID5/6.  Thus it doesn't cause the playback stutter
> as the disks can service the read and write requests without running out
> of head seek bandwidth as parity arrays do due to RMW and parity block
> writes.
> 
> In summary, with Thomas' old disk system, he would have most likely
> avoided the playback stutter simply by using a non-parity RAID level.
> 
> I'm constantly amazed by the fact that so many people here using parity
> RAID don't understand the performance impact of these basic parity RAID
> IO behaviors, and how striping actually works, and the fact that most
> often they're not writing full stripes, and thus not benefiting from
> their spindle count.

I actually do/did have a decent understanding of how raid5 works, and why its 
slower than a lay-person would intuit. The extra seeking, the RMW, and for 
software raid, the parity calculations. What's happened is I didn't expand 
that up to how raid6 obviously works, with an extra parity set interleaved 
with the data. *sigh*

Complete face palm moment.

In the future, if I need more performance than what I'll get out of this 
setup, I'll move to smaller drives (with ERC, and low URE rates), in raid10. 
If I had a couple/few extra bays in this box, I'd be very tempted to go raid10 
right now.

As I haven't noticed any stuttering with my "old" 5.5TB (7x1TB) array, I 
somehwat doubt I'll notice any on my new 11TB array (7x2TB, switched to raid5, 
as with the backup array, I probably don't need/want the extra parity, if two 
drives do die at the same time, I still have a backup copy of most/all of the 
data, and can just restore it).

Thank you Stan, Chris, Phil, and Tommy for the help and insight. It was all 
very helpful.

-- 
Thomas Fjellstrom
thomas@fjellstrom.ca

^ permalink raw reply	[flat|nested] 53+ messages in thread

end of thread, other threads:[~2013-02-10  6:59 UTC | newest]

Thread overview: 53+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-12-22  6:57 recommended way to add ssd cache to mdraid array Thomas Fjellstrom
2012-12-23  3:44 ` Thomas Fjellstrom
2013-01-09 18:41   ` Thomas Fjellstrom
2013-01-10  6:25     ` Chris Murphy
2013-01-10 10:49       ` Thomas Fjellstrom
2013-01-10 21:36         ` Chris Murphy
2013-01-11  0:18           ` Stan Hoeppner
2013-01-11 12:35             ` Thomas Fjellstrom
2013-01-11 12:48               ` Thomas Fjellstrom
2013-01-14  0:05               ` Tommy Apel Hansen
2013-01-14  8:58                 ` Thomas Fjellstrom
2013-01-14 18:22                   ` Thomas Fjellstrom
2013-01-14 19:45                     ` Stan Hoeppner
2013-01-14 21:53                       ` Thomas Fjellstrom
2013-01-14 22:51                         ` Chris Murphy
2013-01-15  3:25                           ` Thomas Fjellstrom
2013-01-15  1:50                         ` Stan Hoeppner
2013-01-15  3:52                           ` Thomas Fjellstrom
2013-01-15  8:38                             ` Stan Hoeppner
2013-01-15  9:02                               ` Tommy Apel
2013-01-15 11:19                                 ` Stan Hoeppner
2013-01-15 10:47                               ` Tommy Apel
2013-01-16  5:31                               ` Thomas Fjellstrom
2013-01-16  8:59                                 ` John Robinson
2013-01-16 21:29                                   ` Stan Hoeppner
2013-02-10  6:59                                     ` Thomas Fjellstrom
2013-01-16 22:06                                 ` Stan Hoeppner
2013-01-14 21:38                     ` Tommy Apel Hansen
2013-01-14 21:47                     ` Tommy Apel Hansen
2013-01-11 12:20           ` Thomas Fjellstrom
2013-01-11 17:39             ` Chris Murphy
2013-01-11 17:46               ` Chris Murphy
2013-01-11 18:52                 ` Thomas Fjellstrom
2013-01-12  0:47                 ` Phil Turmel
2013-01-12  3:56                   ` Chris Murphy
2013-01-13 22:13                     ` Phil Turmel
2013-01-13 23:20                       ` Chris Murphy
2013-01-14  0:23                         ` Phil Turmel
2013-01-14  3:58                           ` Chris Murphy
2013-01-14 22:00                           ` Thomas Fjellstrom
2013-01-11 18:51               ` Thomas Fjellstrom
2013-01-11 22:17                 ` Stan Hoeppner
2013-01-12  2:44                   ` Thomas Fjellstrom
2013-01-12  8:33                     ` Stan Hoeppner
2013-01-12 14:44                       ` Thomas Fjellstrom
2013-01-13 19:18                 ` Chris Murphy
2013-01-14  9:06                   ` Thomas Fjellstrom
2013-01-11 18:50             ` Stan Hoeppner
2013-01-12  2:45               ` Thomas Fjellstrom
2013-01-12 12:06           ` Roy Sigurd Karlsbakk
2013-01-12 14:14             ` Stan Hoeppner
2013-01-12 16:37               ` Roy Sigurd Karlsbakk
2013-01-10 13:13   ` Brad Campbell

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).