Re: recommended way to add ssd cache to mdraid array

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Thomas Fjellstrom <thomas@fjellstrom.ca>
To: Tommy Apel Hansen <tommyapeldk@gmail.com>
Cc: stan@hardwarefreak.com, Chris Murphy <lists@colorremedies.com>,
	linux-raid Raid <linux-raid@vger.kernel.org>
Subject: Re: recommended way to add ssd cache to mdraid array
Date: Mon, 14 Jan 2013 01:58:27 -0700	[thread overview]
Message-ID: <201301140158.27135.thomas@fjellstrom.ca> (raw)
In-Reply-To: <1358121900.3019.1.camel@workstation-home>

On Sun Jan 13, 2013, Tommy Apel Hansen wrote:
> Could you do me a favor and run the iozone test with the -I switch on so
> that we can seen the actual speed of the array and not you RAM

Sure. Though I thought running the test with a file size twice the size of ram 
would help with that issue.

> /Tommy
> 
> On Fri, 2013-01-11 at 05:35 -0700, Thomas Fjellstrom wrote:
> > On Thu Jan 10, 2013, Stan Hoeppner wrote:
> > > On 1/10/2013 3:36 PM, Chris Murphy wrote:
> > > > On Jan 10, 2013, at 3:49 AM, Thomas Fjellstrom <thomas@fjellstrom.ca> 
wrote:
> > > >> A lot of it will be streaming. Some may end up being random
> > > >> read/writes. The test is just to gauge over all performance of the
> > > >> setup. 600MBs read is far more than I need, but having writes at
> > > >> 1/3 that seems odd to me.
> > > > 
> > > > Tell us how many disks there are, and what the chunk size is. It
> > > > could be too small if you have too few disks which results in a
> > > > small full stripe size for a video context. If you're using the
> > > > default, it could be too big and you're getting a lot of RWM. Stan,
> > > > and others, can better answer this.
> > > 
> > > Thomas is using a benchmark, and a single one at that, to judge the
> > > performance.  He's not using his actual workloads.  Tuning/tweaking to
> > > increase the numbers in a benchmark could be detrimental to actual
> > > performance instead of providing a boost.  One must be careful.
> > > 
> > > Regarding RAID6, it will always have horrible performance compared to
> > > non-parity RAID levels and even RAID5, for anything but full stripe
> > > aligned writes, which means writing new large files or doing large
> > > appends to existing files.
> > 
> > Considering its a rather simple use case, mostly streaming video, and
> > misc file sharing for my home network, an iozone test should be rather
> > telling. Especially the full test, from 4k up to 16mb
> > 
> >                                                             random  random   
> >                                                             bkwd   record  
> >                                                             stride
> >               
> >               KB  reclen   write rewrite    read    reread    read  
> >               write    read  rewrite     read   fwrite frewrite   fread 
> >               freread
> >         
> >         33554432       4  243295  221756   628767   624081    1028   
> >         4627   16822  7468777    17740   233295   231092  582036  
> >         579131 33554432       8  241134  225728   628264   627015   
> >         2027    8879   25977 10030302    19578   228923   233928  591478
> >           584892 33554432      16  233758  228122   633406   618248   
> >         3952   13635   35676 10166457    19968   227599   229698  579267
> >           576850 33554432      32  232390  219484   625968   625627   
> >         7604   18800   44252 10728450    24976   216880   222545  556513
> >           555371 33554432      64  222936  206166   631659   627823  
> >         14112   22837   52259 11243595    30251   196243   192755 
> >         498602   494354 33554432     128  214740  182619   628604  
> >         626407   25088   26719   64912 11232068    39867   198638  
> >         185078  463505   467853 33554432     256  202543  185964  
> >         626614   624367   44363   34763   73939 10148251    62349  
> >         176724   191899  593517   595646 33554432     512  208081 
> >         188584   632188   629547   72617   39145   84876  9660408   
> >         89877   182736   172912  610681   608870 33554432    1024 
> >         196429  166125   630785   632413  116793   51904  133342 
> >         8687679   121956   168756   175225  620587   616722 33554432   
> >         2048  185399  167484   622180   627606  188571   70789  218009 
> >         5357136   370189   171019   166128  637830   637120 33554432   
> >         4096  198340  188695   632693   628225  289971   95211  278098 
> >         4836433   611529   161664   170469  665617   655268 33554432   
> >         8192  177919  167524   632030   629077  371602  115228  384030 
> >         4934570   618061   161562   176033  708542   709788 33554432  
> >         16384  196639  183744   631478   627518  485622  133467  462861 
> >         4890426   644615   175411   179795  725966   734364
> > > 
> > > However, everything is relative.  This RAID6 may have plenty of random
> > > and streaming write/read throughput for Thomas.  But a single benchmark
> > > isn't going to inform him accurately.
> > 
> > 200MB/s may be enough, but the difference between the read and write
> > throughput is a bit unexpected. It's not a weak machine (core i3-2120,
> > dual core 3.2Ghz with HT, 16GB ECC 1333Mhz ram), and this is basically
> > all its going to be doing.
> > 
> > > > You said these are unpartitioned disks, I think. In which case
> > > > alignment of 4096 byte sectors isn't a factor if these are AF disks.
> > > > 
> > > > Unlikely to make up the difference is the scheduler. Parallel fs's
> > > > like XFS don't perform nearly as well with CFQ, so you should have a
> > > > kernel parameter elevator=noop.
> > > 
> > > If the HBAs have [BB|FB]WC then one should probably use noop as the
> > > cache schedules the actual IO to the drives.  If the HBAs lack cache,
> > > then deadline often provides better performance.  Testing of each is
> > > required on a system and workload basis.  With two identical systems
> > > (hardware/RAID/OS) one may perform better with noop, the other with
> > > deadline.  The determining factor is the applications' IO patterns.
> > 
> > Mostly streaming reads, some long rsync's to copy stuff back and forth,
> > file share duties (downloads etc).
> > 
> > > > Another thing to look at is md/stripe_cache_size which probably needs
> > > > to be higher for your application.
> > > > 
> > > > Another thing to look at is if you're using XFS, what your mount
> > > > options are. Invariably with an array of this size you need to be
> > > > mounting with the inode64 option.
> > > 
> > > The desired allocator behavior is independent of array size but, once
> > > again, dependent on the workloads.  inode64 is only needed for large
> > > filesystems with lots of files, where 1TB may not be enough for the
> > > directory inodes.  Or, for mixed metadata/data heavy workloads.
> > > 
> > > For many workloads including databases, video ingestion, etc, the
> > > inode32 allocator is preferred, regardless of array size.  This is the
> > > linux-raid list so I'll not go into detail of the XFS allocators.
> > 
> > If you have the time and the desire, I'd like to hear about it off list.
> > 
> > > >> The reason I've selected RAID6 to begin with is I've read (on this
> > > >> mailing list, and on some hardware tech sites) that even with SAS
> > > >> drives, the rebuild/resync time on a large array using large disks
> > > >> (2TB+) is long enough that it gives more than enough time for
> > > >> another disk to hit a random read error,
> > > > 
> > > > This is true for high density consumer SATA drives. It's not nearly
> > > > as applicable for low to moderate density nearline SATA which has an
> > > > order of magnitude lower UER, or for enterprise SAS (and some
> > > > enterprise SATA) which has yet another order of magnitude lower UER.
> > > >  So it depends on the disks, and the RAID size, and the
> > > > backup/restore strategy.
> > > 
> > > Yes, enterprise drives have a much larger spare sector pool.
> > > 
> > > WRT rebuild time, this is one more reason to use RAID10 or a concat of
> > > RAID1s.  The rebuild time is low, constant, predictable.  For 2TB
> > > drives about 5-6 hours at 100% rebuild rate.  And rebuild time, for
> > > any array type, with gargantuan drives, is yet one more reason not to
> > > use the largest drives you can get your hands on.  Using 1TB drives
> > > will cut that to 2.5-3 hours, and using 500GB drives will cut it down
> > > to 1.25-1.5 hours, as all these drives tend to have similar streaming
> > > write rates.
> > > 
> > > To wit, as a general rule I always build my arrays with the smallest
> > > drives I can get away with for the workload at hand.  Yes, for a given
> > > TB total it increases acquisition cost of drives, HBAs, enclosures, and
> > > cables, and power consumption, but it also increases spindle
> > > count--thus performance-- while decreasing rebuild times
> > > substantially/dramatically.
> > 
> > I'd go raid10 or something if I had the space, but this little 10TB nas
> > (which is the goal, a small, quiet, not too slow, 10TB nas with some
> > kind of redundancy) only fits 7 3.5" HDDs.
> > 
> > Maybe sometime in the future I'll get a big 3 or 4 u case with a crap
> > load of 3.5" HDD bays, but for now, this is what I have (as well as my
> > old array, 7x1TB RAID5+XFS in 4in3 hot swap bays with room for 8 drives,
> > but haven't bothered to expand the old array, and I have the new one
> > almost ready to go).
> > 
> > I don't know if it impacts anything at all, but when burning in these
> > drives after I bought them, I ran the same full iozone test a couple
> > times, and each drive shows 150MB/s read, and similar write times
> > (100-120+?). It impressed me somewhat, to see a mechanical hard drive go
> > that fast. I remember back a few years ago thinking 80MBs was fast for a
> > HDD.


-- 
Thomas Fjellstrom
thomas@fjellstrom.ca

next prev parent reply	other threads:[~2013-01-14  8:58 UTC|newest]

Thread overview: 53+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-12-22  6:57 recommended way to add ssd cache to mdraid array Thomas Fjellstrom
2012-12-23  3:44 ` Thomas Fjellstrom
2013-01-09 18:41   ` Thomas Fjellstrom
2013-01-10  6:25     ` Chris Murphy
2013-01-10 10:49       ` Thomas Fjellstrom
2013-01-10 21:36         ` Chris Murphy
2013-01-11  0:18           ` Stan Hoeppner
2013-01-11 12:35             ` Thomas Fjellstrom
2013-01-11 12:48               ` Thomas Fjellstrom
2013-01-14  0:05               ` Tommy Apel Hansen
2013-01-14  8:58                 ` Thomas Fjellstrom [this message]
2013-01-14 18:22                   ` Thomas Fjellstrom
2013-01-14 19:45                     ` Stan Hoeppner
2013-01-14 21:53                       ` Thomas Fjellstrom
2013-01-14 22:51                         ` Chris Murphy
2013-01-15  3:25                           ` Thomas Fjellstrom
2013-01-15  1:50                         ` Stan Hoeppner
2013-01-15  3:52                           ` Thomas Fjellstrom
2013-01-15  8:38                             ` Stan Hoeppner
2013-01-15  9:02                               ` Tommy Apel
2013-01-15 11:19                                 ` Stan Hoeppner
2013-01-15 10:47                               ` Tommy Apel
2013-01-16  5:31                               ` Thomas Fjellstrom
2013-01-16  8:59                                 ` John Robinson
2013-01-16 21:29                                   ` Stan Hoeppner
2013-02-10  6:59                                     ` Thomas Fjellstrom
2013-01-16 22:06                                 ` Stan Hoeppner
2013-01-14 21:38                     ` Tommy Apel Hansen
2013-01-14 21:47                     ` Tommy Apel Hansen
2013-01-11 12:20           ` Thomas Fjellstrom
2013-01-11 17:39             ` Chris Murphy
2013-01-11 17:46               ` Chris Murphy
2013-01-11 18:52                 ` Thomas Fjellstrom
2013-01-12  0:47                 ` Phil Turmel
2013-01-12  3:56                   ` Chris Murphy
2013-01-13 22:13                     ` Phil Turmel
2013-01-13 23:20                       ` Chris Murphy
2013-01-14  0:23                         ` Phil Turmel
2013-01-14  3:58                           ` Chris Murphy
2013-01-14 22:00                           ` Thomas Fjellstrom
2013-01-11 18:51               ` Thomas Fjellstrom
2013-01-11 22:17                 ` Stan Hoeppner
2013-01-12  2:44                   ` Thomas Fjellstrom
2013-01-12  8:33                     ` Stan Hoeppner
2013-01-12 14:44                       ` Thomas Fjellstrom
2013-01-13 19:18                 ` Chris Murphy
2013-01-14  9:06                   ` Thomas Fjellstrom
2013-01-11 18:50             ` Stan Hoeppner
2013-01-12  2:45               ` Thomas Fjellstrom
2013-01-12 12:06           ` Roy Sigurd Karlsbakk
2013-01-12 14:14             ` Stan Hoeppner
2013-01-12 16:37               ` Roy Sigurd Karlsbakk
2013-01-10 13:13   ` Brad Campbell

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=201301140158.27135.thomas@fjellstrom.ca \
    --to=thomas@fjellstrom.ca \
    --cc=linux-raid@vger.kernel.org \
    --cc=lists@colorremedies.com \
    --cc=stan@hardwarefreak.com \
    --cc=tommyapeldk@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.