sequential versus random I/O

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* sequential versus random I/O
@ 2014-01-29 17:23 Matt Garman
  2014-01-30  0:10 ` Adam Goryachev
                   ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Matt Garman @ 2014-01-29 17:23 UTC (permalink / raw)
  To: Mdadm

This is arguably off-topic for this list, but hopefully it's relevant
enough that no one gets upset...

I have a conceptual question regarding "sequential" versus "random"
I/O, reads in particular.

Say I have a simple case: one disk and exactly one program reading one
big file off the disk.  Clearly, that's a sequential read operation.
(And I assume that's basically a description of a sequential read disk
benchmark program.)

Now I have one disk with two large files on it.  By "large" I mean the
files are at least 2x bigger than any disk cache or system RAM, i.e.
for the sake of argument, ignore caching in the system.  I have
exactly two programs running, and each program constantly reads and
re-reads one of those two big files.

From the programs' perspective, this is clearly a sequential read.
But from the disk's perspective, it to me looks at least somewhat like
random I/O: for a spinning disk, the head will presumably be jumping
around quite a bit to fulfill both requests at the same time.

And then generalize that second example: one disk, one filesystem,
with some arbitrary number of large files, and an arbitrary number of
running programs, all doing sequential reads of the files.  Again,
looking at each program in isolation, it's a sequential read request.
But at the system level, all those programs in aggregate present more
of a random read I/O load... right?

So if a storage system (individual disk, RAID, NAS appliance, etc)
advertises X MB/s sequential read, that X is only meaningful if there
is exactly one reader.  Obviously I can't run two sequential read
benchmarks in parallel and expect to get the same result as running
one benchmark in isolation.  I would expect the two parallel
benchmarks to report roughly 1/2 the performance of the single
instance.  And as more benchmarks are run in parallel, I would expect
the performance report to eventually look like the result of a random
read benchmark.

The motivation from this question comes from my use case, which is
similar to running a bunch of sequential read benchmarks in parallel.
In particular, we have a big NFS server that houses a collection of
large files (average ~400 MB).  The server is read-only mounted by
dozens of compute nodes.  Each compute node in turn runs dozens of
processes that continually re-read those big files.  Generally
speaking, should the NFS server (including RAID subsystem) be tuned
for sequential I/O or random I/O?

Furthermore, how does this differ (if at all) between spinning drives
and SSDs?  For simplicity, assume a spinning drive and an SSD
advertise the same sequential read throughput.  (I know this is a
stretch, but assume the advertising is honest and accurate.)  The
difference, though, is that the spinning disk can do 200 IOPS, but the
SSD can do 10,000 IOPS... intuitively, it seems like the SSD ought to
have the edge in my multi-consumer example.  But, is my intuition
correct?  And if so, how can I quantify how much better the SSD is?

Thanks,
Matt

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: sequential versus random I/O
  2014-01-29 17:23 sequential versus random I/O Matt Garman
@ 2014-01-30  0:10 ` Adam Goryachev
  2014-01-30  0:41 ` Roberto Spadim
  2014-01-30  2:38 ` Stan Hoeppner
  2 siblings, 0 replies; 15+ messages in thread
From: Adam Goryachev @ 2014-01-30  0:10 UTC (permalink / raw)
  To: Matt Garman, Mdadm

On 30/01/14 04:23, Matt Garman wrote:
> This is arguably off-topic for this list, but hopefully it's relevant
> enough that no one gets upset...
>
> I have a conceptual question regarding "sequential" versus "random"
> I/O, reads in particular.
>
> Say I have a simple case: one disk and exactly one program reading one
> big file off the disk.  Clearly, that's a sequential read operation.
> (And I assume that's basically a description of a sequential read disk
> benchmark program.)
>
> Now I have one disk with two large files on it.  By "large" I mean the
> files are at least 2x bigger than any disk cache or system RAM, i.e.
> for the sake of argument, ignore caching in the system.  I have
> exactly two programs running, and each program constantly reads and
> re-reads one of those two big files.
>
>  From the programs' perspective, this is clearly a sequential read.
> But from the disk's perspective, it to me looks at least somewhat like
> random I/O: for a spinning disk, the head will presumably be jumping
> around quite a bit to fulfill both requests at the same time.
>
> And then generalize that second example: one disk, one filesystem,
> with some arbitrary number of large files, and an arbitrary number of
> running programs, all doing sequential reads of the files.  Again,
> looking at each program in isolation, it's a sequential read request.
> But at the system level, all those programs in aggregate present more
> of a random read I/O load... right?
>
> So if a storage system (individual disk, RAID, NAS appliance, etc)
> advertises X MB/s sequential read, that X is only meaningful if there
> is exactly one reader.  Obviously I can't run two sequential read
> benchmarks in parallel and expect to get the same result as running
> one benchmark in isolation.  I would expect the two parallel
> benchmarks to report roughly 1/2 the performance of the single
> instance.  And as more benchmarks are run in parallel, I would expect
> the performance report to eventually look like the result of a random
> read benchmark.
>
> The motivation from this question comes from my use case, which is
> similar to running a bunch of sequential read benchmarks in parallel.
> In particular, we have a big NFS server that houses a collection of
> large files (average ~400 MB).  The server is read-only mounted by
> dozens of compute nodes.  Each compute node in turn runs dozens of
> processes that continually re-read those big files.  Generally
> speaking, should the NFS server (including RAID subsystem) be tuned
> for sequential I/O or random I/O?
>
> Furthermore, how does this differ (if at all) between spinning drives
> and SSDs?  For simplicity, assume a spinning drive and an SSD
> advertise the same sequential read throughput.  (I know this is a
> stretch, but assume the advertising is honest and accurate.)  The
> difference, though, is that the spinning disk can do 200 IOPS, but the
> SSD can do 10,000 IOPS... intuitively, it seems like the SSD ought to
> have the edge in my multi-consumer example.  But, is my intuition
> correct?  And if so, how can I quantify how much better the SSD is?
When doing parallel reads, you will get less than half the read speed 
for each of the two readers, because you will need to wait for the seek 
time of the drive each time it moves from reading one file to the other. 
You might get 40% of the read speed for each, but if you have 100 
readers, you will get a lot less than 1% each, because the overhead 
(seek time) is multiplied 100x instead of only 2x.

However, for SSD, the seek time is 0, so you will get exactly half the 
read speed for each of the two readers. (or 1% of the read speed for 100 
readers, etc).

That would be the perfect application of SSD's, read only (so you never 
even have to think about the write limitation), and large number of 
concurrent access.

Of course, RAID of various levels will assist you in scaling even 
further with either spinning disks or SSD, even linear would help 
because different files will land on different disks.

Of course, you might want some protection from failed disks as well.

Regards,
Adam
-- 
Adam Goryachev Website Managers www.websitemanagers.com.au

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: sequential versus random I/O
  2014-01-29 17:23 sequential versus random I/O Matt Garman
  2014-01-30  0:10 ` Adam Goryachev
@ 2014-01-30  0:41 ` Roberto Spadim
  2014-01-30  0:45   ` Roberto Spadim
  2014-01-30  2:38 ` Stan Hoeppner
  2 siblings, 1 reply; 15+ messages in thread
From: Roberto Spadim @ 2014-01-30  0:41 UTC (permalink / raw)
  To: Matt Garman; +Cc: Mdadm

2014-01-29 Matt Garman <matthew.garman@gmail.com>:
> This is arguably off-topic for this list, but hopefully it's relevant
> enough that no one gets upset...
>
> I have a conceptual question regarding "sequential" versus "random"
> I/O, reads in particular.
>
> Say I have a simple case: one disk and exactly one program reading one
> big file off the disk.  Clearly, that's a sequential read operation.
> (And I assume that's basically a description of a sequential read disk
> benchmark program.)

no, if forgot that kernel is "another program", and filesystem is
"another program" too
if your 'exactly one program' read and write to block device, you will
get "exactly" only one program using disk (ok your program and linux
kernel...)

some filesystem can divide a file in many parts and fragment, but
considering that disk is new and clean many filesystem will not
fragment


>
> Now I have one disk with two large files on it.  By "large" I mean the
> files are at least 2x bigger than any disk cache or system RAM, i.e.
> for the sake of argument, ignore caching in the system.  I have
> exactly two programs running, and each program constantly reads and
> re-reads one of those two big files.

ok i will not forget that it's over a filesystem, since you have two
large 'files'

>
> From the programs' perspective, this is clearly a sequential read.
> But from the disk's perspective, it to me looks at least somewhat like
> random I/O: for a spinning disk, the head will presumably be jumping
> around quite a bit to fulfill both requests at the same time.

hum, you must check the linux i/o scheduler
http://en.wikipedia.org/wiki/I/O_scheduling
this can do a very very nice job :)

the random and the continous is a block device feature, i don't know
if i'm wrong but, filesystem send command to block device, and block
devices group it and 'create' disk commands to get data, a level down,
you have sata/scsi and others protocols that contact block device i/o
and tell about hardware errors and others stuffs
i'm a bit old in source code of linux, but if you check read balance
function of raid1, you will see an example of how continous read is
considered
read balance try to send continous read to same disk, this speed up a
lot, leaving more disks to other tasks/thread/programs/etc

>
> And then generalize that second example: one disk, one filesystem,
> with some arbitrary number of large files, and an arbitrary number of
> running programs, all doing sequential reads of the files.  Again,
> looking at each program in isolation, it's a sequential read request.
> But at the system level, all those programs in aggregate present more
> of a random read I/O load... right?

hum, block device do this, check scheduler/elevators again, there's a
time gap between hard disks command, even "noop" scheduler(elevator)
wait a bit to send continous reads more often

>
> So if a storage system (individual disk, RAID, NAS appliance, etc)
> advertises X MB/s sequential read, that X is only meaningful if there
> is exactly one reader.
the read speed is the "super pro master top ultrablaster" speed you
can read a disk without cache and with a good sas/scsi/sata card

> Obviously I can't run two sequential read
> benchmarks in parallel and expect to get the same result as running
> one benchmark in isolation.

yes :)

> I would expect the two parallel
> benchmarks to report roughly 1/2 the performance of the single
> instance.  And as more benchmarks are run in parallel, I would expect
> the performance report to eventually look like the result of a random
> read benchmark.

hum... you forgot elevators, the 1/2 could be more or less, and
sometimes 1/n where n=number of tests, isn't a good math, there's more
things we could forget (cache, bus problems, disk problems, irq
problems, dma problems, etc)

>
> The motivation from this question comes from my use case, which is
> similar to running a bunch of sequential read benchmarks in parallel.
> In particular, we have a big NFS server that houses a collection of
> large files (average ~400 MB).  The server is read-only mounted by
> dozens of compute nodes.  Each compute node in turn runs dozens of
> processes that continually re-read those big files.  Generally
> speaking, should the NFS server (including RAID subsystem) be tuned
> for sequential I/O or random I/O?
hum, when i have many thread, i use raid1, i only use raid0 or other
stripe/linear solution when i have only big files (like a dvr) this
give a better speed than raid1 in some cases (but you should check
yourself)
another nice feature is hardware raid cards with cache (flash memory),
this do a nice cache job

>
> Furthermore, how does this differ (if at all) between spinning drives
> and SSDs?  For simplicity, assume a spinning drive and an SSD
> advertise the same sequential read throughput.  (I know this is a
> stretch, but assume the advertising is honest and accurate.)  The
> difference, though, is that the spinning disk can do 200 IOPS, but the
> SSD can do 10,000 IOPS... intuitively, it seems like the SSD ought to
> have the edge in my multi-consumer example.  But, is my intuition
> correct?  And if so, how can I quantify how much better the SSD is?

hummm if your problem is cost, consider using ssd as a cache, and hdd
as main storage, this kind of setup facebook use a lot check bcache,
flashcache, dmcache:
https://github.com/facebook/flashcache/
http://en.wikipedia.org/wiki/Bcache
http://en.wikipedia.org/wiki/Dm-cache


>
> Thanks,
> Matt
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

:)

-- 
Roberto Spadim

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: sequential versus random I/O
  2014-01-30  0:41 ` Roberto Spadim
@ 2014-01-30  0:45   ` Roberto Spadim
  2014-01-30  0:58     ` Roberto Spadim
  0 siblings, 1 reply; 15+ messages in thread
From: Roberto Spadim @ 2014-01-30  0:45 UTC (permalink / raw)
  To: Matt Garman; +Cc: Mdadm

check this too:
http://www.hardwaresecrets.com/article/315
http://en.wikipedia.org/wiki/Native_Command_Queuing
http://en.wikipedia.org/wiki/TCQ

http://en.wikipedia.org/wiki/I/O_scheduling
http://en.wikipedia.org/wiki/Deadline_scheduler
http://en.wikipedia.org/wiki/CFQ
http://en.wikipedia.org/wiki/Anticipatory_scheduling
http://en.wikipedia.org/wiki/Noop_scheduler

http://doc.opensuse.org/products/draft/SLES/SLES-tuning_sd_draft/cha.tuning.io.html

and many others

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: sequential versus random I/O
  2014-01-30  0:45   ` Roberto Spadim
@ 2014-01-30  0:58     ` Roberto Spadim
  2014-01-30  1:03       ` Roberto Spadim
  0 siblings, 1 reply; 15+ messages in thread
From: Roberto Spadim @ 2014-01-30  0:58 UTC (permalink / raw)
  To: Matt Garman; +Cc: Mdadm

hummm, another thing.... since you have nfs, you need network...
did you enabled jumbo frames?

http://en.wikipedia.org/wiki/Jumbo_frame
https://wiki.archlinux.org/index.php/Jumbo_Frames

sorry many mails guys, that's the last one =)

2014-01-29 Roberto Spadim <rspadim@gmail.com>:
> check this too:
> http://www.hardwaresecrets.com/article/315
> http://en.wikipedia.org/wiki/Native_Command_Queuing
> http://en.wikipedia.org/wiki/TCQ
>
> http://en.wikipedia.org/wiki/I/O_scheduling
> http://en.wikipedia.org/wiki/Deadline_scheduler
> http://en.wikipedia.org/wiki/CFQ
> http://en.wikipedia.org/wiki/Anticipatory_scheduling
> http://en.wikipedia.org/wiki/Noop_scheduler
>
> http://doc.opensuse.org/products/draft/SLES/SLES-tuning_sd_draft/cha.tuning.io.html
>
> and many others



-- 
Roberto Spadim

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: sequential versus random I/O
  2014-01-30  0:58     ` Roberto Spadim
@ 2014-01-30  1:03       ` Roberto Spadim
  2014-01-30  1:18         ` Roberto Spadim
  0 siblings, 1 reply; 15+ messages in thread
From: Roberto Spadim @ 2014-01-30  1:03 UTC (permalink / raw)
  To: Matt Garman; +Cc: Mdadm

ops this one is the last...

http://kernel.dk/blk-mq.pdf
https://lwn.net/Articles/552904/
http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=320ae51feed5c2f13664aa05a76bec198967e04d
http://kernelnewbies.org/LinuxChanges#head-d433b7e91267144d5ad63dc96789f97519a73422

:) sorry

2014-01-29 Roberto Spadim <rspadim@gmail.com>:
> hummm, another thing.... since you have nfs, you need network...
> did you enabled jumbo frames?
>
> http://en.wikipedia.org/wiki/Jumbo_frame
> https://wiki.archlinux.org/index.php/Jumbo_Frames
>
> sorry many mails guys, that's the last one =)
>
> 2014-01-29 Roberto Spadim <rspadim@gmail.com>:
>> check this too:
>> http://www.hardwaresecrets.com/article/315
>> http://en.wikipedia.org/wiki/Native_Command_Queuing
>> http://en.wikipedia.org/wiki/TCQ
>>
>> http://en.wikipedia.org/wiki/I/O_scheduling
>> http://en.wikipedia.org/wiki/Deadline_scheduler
>> http://en.wikipedia.org/wiki/CFQ
>> http://en.wikipedia.org/wiki/Anticipatory_scheduling
>> http://en.wikipedia.org/wiki/Noop_scheduler
>>
>> http://doc.opensuse.org/products/draft/SLES/SLES-tuning_sd_draft/cha.tuning.io.html
>>
>> and many others
>
>
>
> --
> Roberto Spadim



-- 
Roberto Spadim

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: sequential versus random I/O
  2014-01-30  1:03       ` Roberto Spadim
@ 2014-01-30  1:18         ` Roberto Spadim
  0 siblings, 0 replies; 15+ messages in thread
From: Roberto Spadim @ 2014-01-30  1:18 UTC (permalink / raw)
  To: Matt Garman; +Cc: Mdadm

sorry again, the last interesting thing (that i remember)

compression is a nice feature, check this:
https://btrfs.wiki.kernel.org/index.php/Main_Page   (BTRFS may online
compress files, i don't know if it's stable)

http://en.wikipedia.org/wiki/JFS_(file_system)    (JFS version 1, but
i don't know if you will have bugs)

https://code.google.com/p/fusecompress/         (FUSE is a userland
filesystem tool, maybe not too fast as a kernel land filesystem, but
it do a nice job)
http://squashfs.sourceforge.net/                       (read-only filesystem)

end =]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: sequential versus random I/O
  2014-01-29 17:23 sequential versus random I/O Matt Garman
  2014-01-30  0:10 ` Adam Goryachev
  2014-01-30  0:41 ` Roberto Spadim
@ 2014-01-30  2:38 ` Stan Hoeppner
  2014-01-30  3:20   ` Matt Garman
  2 siblings, 1 reply; 15+ messages in thread
From: Stan Hoeppner @ 2014-01-30  2:38 UTC (permalink / raw)
  To: Matt Garman, Mdadm

On 1/29/2014 11:23 AM, Matt Garman wrote:
...
> In particular, we have a big NFS server that houses a collection of
> large files (average ~400 MB).  The server is read-only mounted by
> dozens of compute nodes.  Each compute node in turn runs dozens of
> processes that continually re-read those big files.  Generally
> speaking, should the NFS server (including RAID subsystem) be tuned
> for sequential I/O or random I/O?
...

If your workflow description is accurate, and assuming you're trying to
fix a bottleneck at the NFS server, the solution to this is simple, and
very well known:  local scratch space.  Given your workflow description
it's odd that you're not already doing so.  Which leads me to believe
that the description isn't entirely accurate.  If it is, you simply copy
each file to local scratch disk and iterate over it locally.  If you're
using diskless compute nodes then that's an architectural
flaw/oversight, as this workload as described begs for scratch disk.

-- 
Stan

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: sequential versus random I/O
  2014-01-30  2:38 ` Stan Hoeppner
@ 2014-01-30  3:20   ` Matt Garman
  2014-01-30  4:10     ` Roberto Spadim
  2014-01-30 10:22     ` Stan Hoeppner
  0 siblings, 2 replies; 15+ messages in thread
From: Matt Garman @ 2014-01-30  3:20 UTC (permalink / raw)
  To: Stan Hoeppner; +Cc: Mdadm

On Wed, Jan 29, 2014 at 8:38 PM, Stan Hoeppner <stan@hardwarefreak.com> wrote:
> If your workflow description is accurate, and assuming you're trying to
> fix a bottleneck at the NFS server, the solution to this is simple, and
> very well known:  local scratch space.  Given your workflow description
> it's odd that you're not already doing so.  Which leads me to believe
> that the description isn't entirely accurate.  If it is, you simply copy
> each file to local scratch disk and iterate over it locally.  If you're
> using diskless compute nodes then that's an architectural
> flaw/oversight, as this workload as described begs for scratch disk.

There really is no bottleneck now, but looking into the future, there
will be a bottleneck at the next addition of compute nodes.  I've
thought about local caching at the compute node level, but I don't
think it will help.  The total collection of big files on the NFS
server is upwards of 20 TB.  Processes are distributed randomly across
compute nodes, and any process could access any part of that 20 TB
file collection.  (My description may have implied there is a 1-to-1
process-to-file mapping, but that is not the case.)  So the local
scratch space would have to be quite big to prevent thrashing.  In
other words, unless the local cache was multi-terrabyte in size, I'm
quite confident that the local cache would actually degrade
performance due to constant turnover.

Furthermore, let's simplify the workflow: say there is only one
compute server, and it's local disk is sufficiently large to hold the
entire data set (assume 20 TB drives exist with performance
characteristics similar to today's spinning drives).  In other words,
there is no need for the NFS server now.  I believe even in this
scenario, the single local disk would be a bottleneck to the dozens of
programs running on the node... these compute nodes are typically dual
socket, 6 or 8 core systems.  The computational part is fast enough on
modern CPUs that the I/O workload can be realistically approximated by
dozens of parallel "dd if=/random/big/file of=/dev/null" processes,
all accessing different files from the collection.  In other words,
very much like my contrived example of multiple parallel read
benchmark programs.

FWIW, the current NFS server is from a big iron storage vendor.  It's
made up of 96 15k SAS drives.  A while ago we were hitting a
bottleneck on the spinning disks, so the vendor was happy to sell us 1
TB of their very expensive SSD cache module.  This worked quite well
at reducing spinning disk utilization, and cache module utilization
was quite high.  The recent compute node expansion has lowered cache
utilization at the expense of spinning disk utilization... things are
still chugging along acceptably, but we're at capacity.  We've maxed
out at just under 3 GB/sec of throughput (that's gigabytes, not bits).

What I'm trying to do is decide if we should continue to pay expensive
maintenance and additional cache upgrades to our current device, or if
I might be better served by a DIY big array of consumer SSDs, ala the
"Dirt Cheap Data Warehouse" [1].  I don't see too many people building
big arrays of consumer-grade SSDs, or even vendors selling pre-made
big SSD based systems.  (To be fair, you can buy big SSD arrays, but
with crazy-expensive *enterprise* SSD... we have effectively a WORM
workload, so don't need the write endurance features of enterprise
SSD.  I think that's where the value opportunity comes in for us.)
Anyway, I'm just looking for reasons why taking on such a project
might blow up in my face (assuming I can convince the check-writer to
basically fund a storage R&D project).

[1] http://www.openida.com/the-dirt-cheap-data-warehouse-an-introduction/

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: sequential versus random I/O
  2014-01-30  3:20   ` Matt Garman
@ 2014-01-30  4:10     ` Roberto Spadim
  2014-01-30 10:22     ` Stan Hoeppner
  1 sibling, 0 replies; 15+ messages in thread
From: Roberto Spadim @ 2014-01-30  4:10 UTC (permalink / raw)
  To: Matt Garman; +Cc: Stan Hoeppner, Mdadm

hummmm, there's a nice solution for low cost storages, it's cost less
than any ibm,dell,hp,etc storage, and have a nice read rate:
http://www.blackblaze.com/  (it's a red chassis with many disks)

maybe you could have a better performace with many sata disks running
raid1 or raid0 or raid6 or raid10, than many sas disks, considering
the same cost.... at least where i live (brazil) sas is very
expensive, and 2 sata disks is better than 1 sas disk for enterprise
databases with the same workload with many random reads (my tests
only)

with sata you buy more space and "lost" read rate (7200rpm vs
15000rpm, 2x faster), but you have more disks heads (nice for raid1
solution), this could allow more program reading different parts of
"logical volume" each head in one part (ok you must test by your self)

in other words, maybe you could save some money with many sata disks
and pay some nice cache solution: ssd disks creating a
bcache/flash/cache or raid card with flash cache, just an idea... may
others solutions could be better

2014-01-30 Matt Garman <matthew.garman@gmail.com>:
> On Wed, Jan 29, 2014 at 8:38 PM, Stan Hoeppner <stan@hardwarefreak.com> wrote:
>> If your workflow description is accurate, and assuming you're trying to
>> fix a bottleneck at the NFS server, the solution to this is simple, and
>> very well known:  local scratch space.  Given your workflow description
>> it's odd that you're not already doing so.  Which leads me to believe
>> that the description isn't entirely accurate.  If it is, you simply copy
>> each file to local scratch disk and iterate over it locally.  If you're
>> using diskless compute nodes then that's an architectural
>> flaw/oversight, as this workload as described begs for scratch disk.
>
> There really is no bottleneck now, but looking into the future, there
> will be a bottleneck at the next addition of compute nodes.  I've
> thought about local caching at the compute node level, but I don't
> think it will help.  The total collection of big files on the NFS
> server is upwards of 20 TB.  Processes are distributed randomly across
> compute nodes, and any process could access any part of that 20 TB
> file collection.  (My description may have implied there is a 1-to-1
> process-to-file mapping, but that is not the case.)  So the local
> scratch space would have to be quite big to prevent thrashing.  In
> other words, unless the local cache was multi-terrabyte in size, I'm
> quite confident that the local cache would actually degrade
> performance due to constant turnover.
>
> Furthermore, let's simplify the workflow: say there is only one
> compute server, and it's local disk is sufficiently large to hold the
> entire data set (assume 20 TB drives exist with performance
> characteristics similar to today's spinning drives).  In other words,
> there is no need for the NFS server now.  I believe even in this
> scenario, the single local disk would be a bottleneck to the dozens of
> programs running on the node... these compute nodes are typically dual
> socket, 6 or 8 core systems.  The computational part is fast enough on
> modern CPUs that the I/O workload can be realistically approximated by
> dozens of parallel "dd if=/random/big/file of=/dev/null" processes,
> all accessing different files from the collection.  In other words,
> very much like my contrived example of multiple parallel read
> benchmark programs.
>
> FWIW, the current NFS server is from a big iron storage vendor.  It's
> made up of 96 15k SAS drives.  A while ago we were hitting a
> bottleneck on the spinning disks, so the vendor was happy to sell us 1
> TB of their very expensive SSD cache module.  This worked quite well
> at reducing spinning disk utilization, and cache module utilization
> was quite high.  The recent compute node expansion has lowered cache
> utilization at the expense of spinning disk utilization... things are
> still chugging along acceptably, but we're at capacity.  We've maxed
> out at just under 3 GB/sec of throughput (that's gigabytes, not bits).
>
> What I'm trying to do is decide if we should continue to pay expensive
> maintenance and additional cache upgrades to our current device, or if
> I might be better served by a DIY big array of consumer SSDs, ala the
> "Dirt Cheap Data Warehouse" [1].  I don't see too many people building
> big arrays of consumer-grade SSDs, or even vendors selling pre-made
> big SSD based systems.  (To be fair, you can buy big SSD arrays, but
> with crazy-expensive *enterprise* SSD... we have effectively a WORM
> workload, so don't need the write endurance features of enterprise
> SSD.  I think that's where the value opportunity comes in for us.)
> Anyway, I'm just looking for reasons why taking on such a project
> might blow up in my face (assuming I can convince the check-writer to
> basically fund a storage R&D project).
>
>
> [1] http://www.openida.com/the-dirt-cheap-data-warehouse-an-introduction/
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Roberto Spadim

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: sequential versus random I/O
  2014-01-30  3:20   ` Matt Garman
  2014-01-30  4:10     ` Roberto Spadim
@ 2014-01-30 10:22     ` Stan Hoeppner
  2014-01-30 15:28       ` Matt Garman
  1 sibling, 1 reply; 15+ messages in thread
From: Stan Hoeppner @ 2014-01-30 10:22 UTC (permalink / raw)
  To: Matt Garman; +Cc: Mdadm

On 1/29/2014 9:20 PM, Matt Garman wrote:
> On Wed, Jan 29, 2014 at 8:38 PM, Stan Hoeppner <stan@hardwarefreak.com> wrote:
>> If your workflow description is accurate, and assuming you're trying to
>> fix a bottleneck at the NFS server, the solution to this is simple, and
>> very well known:  local scratch space.  Given your workflow description
>> it's odd that you're not already doing so.  Which leads me to believe
>> that the description isn't entirely accurate.  If it is, you simply copy
>> each file to local scratch disk and iterate over it locally.  If you're
>> using diskless compute nodes then that's an architectural
>> flaw/oversight, as this workload as described begs for scratch disk.
> 
> There really is no bottleneck now, but looking into the future, there
> will be a bottleneck at the next addition of compute nodes.  I've
> thought about local caching at the compute node level, but I don't
> think it will help.  The total collection of big files on the NFS
> server is upwards of 20 TB.  Processes are distributed randomly across
> compute nodes, and any process could access any part of that 20 TB
> file collection.  (My description may have implied there is a 1-to-1
> process-to-file mapping, but that is not the case.)  So the local
> scratch space would have to be quite big to prevent thrashing.  In
> other words, unless the local cache was multi-terrabyte in size, I'm
> quite confident that the local cache would actually degrade
> performance due to constant turnover.
> 
> Furthermore, let's simplify the workflow: say there is only one
> compute server, and it's local disk is sufficiently large to hold the
> entire data set (assume 20 TB drives exist with performance
> characteristics similar to today's spinning drives).  In other words,
> there is no need for the NFS server now.  I believe even in this
> scenario, the single local disk would be a bottleneck to the dozens of
> programs running on the node... these compute nodes are typically dual
> socket, 6 or 8 core systems.  The computational part is fast enough on
> modern CPUs that the I/O workload can be realistically approximated by
> dozens of parallel "dd if=/random/big/file of=/dev/null" processes,
> all accessing different files from the collection.  In other words,
> very much like my contrived example of multiple parallel read
> benchmark programs.
>
> FWIW, the current NFS server is from a big iron storage vendor.  It's
> made up of 96 15k SAS drives.  A while ago we were hitting a
> bottleneck on the spinning disks, so the vendor was happy to sell us 1
> TB of their very expensive SSD cache module.  This worked quite well
> at reducing spinning disk utilization, and cache module utilization
> was quite high.  The recent compute node expansion has lowered cache
> utilization at the expense of spinning disk utilization... things are
> still chugging along acceptably, but we're at capacity.  We've maxed
> out at just under 3 GB/sec of throughput (that's gigabytes, not bits).
> 
> What I'm trying to do is decide if we should continue to pay expensive
> maintenance and additional cache upgrades to our current device, or if
> I might be better served by a DIY big array of consumer SSDs, ala the
> "Dirt Cheap Data Warehouse" [1].  

I wouldn't go used as they do.  Not for something this critical.

> I don't see too many people building
> big arrays of consumer-grade SSDs, or even vendors selling pre-made
> big SSD based systems.  (To be fair, you can buy big SSD arrays, but
> with crazy-expensive *enterprise* SSD... we have effectively a WORM
> workload, so don't need the write endurance features of enterprise
> SSD.  I think that's where the value opportunity comes in for us.)

I absolutely agree.

> Anyway, I'm just looking for reasons why taking on such a project
> might blow up in my face 

If you architect the system correctly, and use decent quality hardware,
it won't blow up on you.  If you don't get the OS environment tuned
correctly you'll simply get less throughput than desired.  But that can
always be remedied with tweaking.

> (assuming I can convince the check-writer to
> basically fund a storage R&D project).

How big a check?  24x 1TB Samsung SSDs will run you $12,000:
http://www.newegg.com/Product/Product.aspx?Item=N82E16820147251

A suitable server with 48 2.5" SAS bays sans HBAs and NICs will run
$5,225.00:
http://www.rackmountpro.com/products/servers/5u-servers/details/&pnum=YM5U52652&cpu=int

CPU:	2x Intel® Xeon® Processor E5-2630v2 6 core (2.6/3.1 Ghz 80W)
RAM:	8x 8GB DDR3 1600MHz ECC Registered Memory
OSD:	2x 2.5" SSD 120GB SATA III 6Gb/s
NET:	Integrated Quad Intel GbE
Optical
Drive:	8x Slim Internal DVD-RW
PSU:	1140W R3G5B40V4V 2+1 redundant power supply
OS:	No OS
3 year warranty

3x LSI 9201-16i SAS HBAs:  $1,100
http://www.newegg.com/Product/Product.aspx?Item=N82E16816118142

Each of the two backplanes has 24 drive slots and 6 SFF-8087 connectors.
 Each 8087 carries 4 SAS channels.  You connect two ports of each HBA to
the top backplane and the other two to the bottom backplane.  I.e. one
HBA controls the left 16 drives, one controls the middle 16 drives, and
one controls the right 16 drives.  Starting with 24 drives in the top
tray, each HBA controls 8 drives.

These 3 HBAs are 8 lane PCIe 2.0 and provide an aggregate peak
uni/bi-directional throughput of ~12/24 GB/s.  Samsung 840 EVO raw read
throughput is ~0.5GB/s * 24 drives = 12GB/s.  Additional SSDs will not
provide much increased throughput, if any, as the HBAs are pretty well
maxed at 24 drives.  This doesn't matter as your network throughput will
be much less.

Speaking of network throughput, if you're not using Infiniband but
10GbE, you'll want to acquire this 6 port 10 GbE NIC.  I don't have a
price:  http://www.interfacemasters.com/pdf/Niagara_32716.pdf

With proper TX load balancing and the TCP stack well tuned you'll have a
potential peak of ~6GB/s of NFS throughput.  They offer dual and quad
port models as well if you want two cards for redundancy:

http://www.interfacemasters.com/products/server-adapters/server-adapters-product-matrix.html#twoj_fragment1-5

Without the cost of NICs you're looking at roughly $19,000 for this
configuration, including shipping costs, for a ~22TB DIY SSD based NFS
server system expandable to 46TB.  With two quad port 10GbE NICs and
SFPs you're at less $25K with the potential for ~6GB/s NFS throughput.

In specifying HBAs instead of RAID controllers I am assuming you'll use
md/RAID.  With this many SSDs any current RAID controller would slow you
down anyway as the ASICs aren't fast enough.  You'll need minimum
redundancy to guard against an SSD failure, which means RAID5 with SSDs.
 Your workload is almost exclusively read heavy, which means you could
simply create a single 24 drive RAID5 or RAID6 with the default 512KB
chunk.  I'd go with RAID6.  That will yield a stripe width of
22*512KB=11MB.  Using RAID5/6 allows you to grow the array incrementally
without the need for LVM which may slow you down.

Surely you'll use XFS as it's the only Linux filesystem suitable for
such a parallel workload.  As you will certainly grow the array in the
future, I'd format XFS without stripe alignment and have it do 4KB IOs.
 Stripe alignment won't gain you anything with this workload on SSDs,
but it could cause performance problems after you grow the array, at
which point the XFS stripe alignment will not match the new array
geometry.  mkfs.xfs will auto align to the md geometry, so forcing it to
use the default 4KB single FS block alignment will be necessary.  I can
help you with this if you indeed go down this path.

The last point I'll make is that it may require some serious tweaking of
IRQ load balancing, md/RAID, NFS, Ethernet bonding driver, etc, to wring
peak throughput out of such a DIY SSD system.  Achieving ~1GB/s parallel
NFS throughput from a DIY rig with a single 10GbE port isn't horribly
difficult.  3+GB/s parallel NFS via bonded 10GbE interfaces is a bit
more challenging.

-- 
Stan
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: sequential versus random I/O
  2014-01-30 10:22     ` Stan Hoeppner
@ 2014-01-30 15:28       ` Matt Garman
  2014-02-01 18:28         ` Stan Hoeppner
  0 siblings, 1 reply; 15+ messages in thread
From: Matt Garman @ 2014-01-30 15:28 UTC (permalink / raw)
  To: Stan Hoeppner; +Cc: Mdadm

On Thu, Jan 30, 2014 at 4:22 AM, Stan Hoeppner <stan@hardwarefreak.com> wrote:
> I wouldn't go used as they do.  Not for something this critical.

No, not for an actual production system.  I linked that as "conceptual
inspiration" not as an exact template for what I'd do.  Although, the
used route might be useful for building a cheap prototype to
demonstrate proof of concept.

> If you architect the system correctly, and use decent quality hardware,
> it won't blow up on you.  If you don't get the OS environment tuned
> correctly you'll simply get less throughput than desired.  But that can
> always be remedied with tweaking.

Right.  I think the general concept is solid, but, as with most
things, "the devil's in the details".  FWIW, the creator of the DCDW
enumerated some of the "gotchas" for a build like this[1].  He went
into more detail in some private correspondence with me.  It's a
little alarming that he got roughly 50% the performance with a tuned
Linux setup compared to a mostly out-of-the-box Solaris install.
Also, subtle latency issues with PCIe timings across different
motherboards sounds like a migraine-caliber headache.

> Each of the two backplanes has 24 drive slots and 6 SFF-8087 connectors.
>  Each 8087 carries 4 SAS channels.  You connect two ports of each HBA to
> the top backplane and the other two to the bottom backplane.  I.e. one
> ...

Your concept is similar to what I've sketched out in my mind.  My
twist is that I think I would actually build multiple servers, each
one would be a 24-disk 2U system.  Our data is fairly easy to
partition across multiple servers.  Also, we already have a big
"symlink index" directory that abstracts the actual location of the
files.  IOW, my users don't know/don't care where the files actually
live, as long as the symlinks are there and not broken.

> Without the cost of NICs you're looking at roughly $19,000 for this
> configuration, including shipping costs, for a ~22TB DIY SSD based NFS
> server system expandable to 46TB.  With two quad port 10GbE NICs and
> SFPs you're at less $25K with the potential for ~6GB/s NFS throughput.

Yup, and this amount is less than one year's maintenance on the big
iron system we have in place.  And, quoting the vendor, "Maintenance
costs only go up."

> In specifying HBAs instead of RAID controllers I am assuming you'll use
> md/RAID.  With this many SSDs any current RAID controller would slow you
> down anyway as the ASICs aren't fast enough.  You'll need minimum
> redundancy to guard against an SSD failure, which means RAID5 with SSDs.
>  Your workload is almost exclusively read heavy, which means you could
> simply create a single 24 drive RAID5 or RAID6 with the default 512KB
> chunk.  I'd go with RAID6.  That will yield a stripe width of
> 22*512KB=11MB.  Using RAID5/6 allows you to grow the array incrementally
> without the need for LVM which may slow you down.

At the expense of storage capacity, I was in my mind thinking of
raid10 with 3-way mirrors.  We do have backups, but downtime on this
system won't be taken lightly.

> Surely you'll use XFS as it's the only Linux filesystem suitable for
> such a parallel workload.  As you will certainly grow the array in the
> future, I'd format XFS without stripe alignment and have it do 4KB IOs.
> ...

I was definitely thinking XFS.  But one other motivation for multiple
2U systems (instead of one massive system) is that it's more modular.
Existing systems never have to be grown or reconfigured.  When we need
more space/throughput, I just throw another system in place.  I might
have to re-distribute the data, but this would be a very rare (maybe
once/year) event.

If I get the green light to do this, I'd actually test a few
configurations.  But some that come to mind:
    - raid10,f3
    - groups of 3-way raid1 mirrors striped together with XFS
    - groups of raid6 sets not striped together (our symlink index I
mentioned above makes this not as messy as it sounds)

> The last point I'll make is that it may require some serious tweaking of
> IRQ load balancing, md/RAID, NFS, Ethernet bonding driver, etc, to wring
> peak throughput out of such a DIY SSD system.  Achieving ~1GB/s parallel
> NFS throughput from a DIY rig with a single 10GbE port isn't horribly
> difficult.  3+GB/s parallel NFS via bonded 10GbE interfaces is a bit
> more challenging.

I agree, I think that comes back around to what we said above: the
concept is simple, but the details mean the difference between
brilliant and mediocre.

Thanks for your input Stan, I appreciate it.  I'm an infrequent poster
to this list, but a long-time reader, and I've learned a lot from your
posts over the years.

[1] http://forums.servethehome.com/diy-server-builds/2894-utterly-absurd-quad-xeon-e5-supermicro-server-48-ssd-drives.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: sequential versus random I/O
  2014-01-30 15:28       ` Matt Garman
@ 2014-02-01 18:28         ` Stan Hoeppner
  2014-02-03 19:28           ` Matt Garman
  0 siblings, 1 reply; 15+ messages in thread
From: Stan Hoeppner @ 2014-02-01 18:28 UTC (permalink / raw)
  To: Matt Garman; +Cc: Mdadm

On 1/30/2014 9:28 AM, Matt Garman wrote:
> On Thu, Jan 30, 2014 at 4:22 AM, Stan Hoeppner <stan@hardwarefreak.com> wrote:
>> I wouldn't go used as they do.  Not for something this critical.
> 
> No, not for an actual production system.  I linked that as "conceptual
> inspiration" not as an exact template for what I'd do.  Although, the
> used route might be useful for building a cheap prototype to
> demonstrate proof of concept.
> 
>> If you architect the system correctly, and use decent quality hardware,
>> it won't blow up on you.  If you don't get the OS environment tuned
>> correctly you'll simply get less throughput than desired.  But that can
>> always be remedied with tweaking.
> 
> Right.  I think the general concept is solid, but, as with most
> things, "the devil's in the details".  

Always.

> FWIW, the creator of the DCDW
> enumerated some of the "gotchas" for a build like this[1].  He went
> into more detail in some private correspondence with me.  It's a
> little alarming that he got roughly 50% the performance with a tuned
> Linux setup compared to a mostly out-of-the-box Solaris install.

Most x86-64 Linux distro kernels are built to perform on servers,
desktops, and laptops, thus performance on each is somewhat compromised.
 Solaris x86-64 is built primarily for server duty, and tuned for that
out of the box.  So what you state above isn't too surprising.

> Also, subtle latency issues with PCIe timings across different
> motherboards sounds like a migraine-caliber headache.

This is an issue of board design and Q.A., specifically trace routing
and resulting signal skew, that the buyer can't do anything about.  And
unfortunately this kind of information just isn't "out there" in reviews
and what not when you buy boards.  The best one can do is buy reputable
brand and cross fingers.

>> Each of the two backplanes has 24 drive slots and 6 SFF-8087 connectors.
>>  Each 8087 carries 4 SAS channels.  You connect two ports of each HBA to
>> the top backplane and the other two to the bottom backplane.  I.e. one
>> ...
> 
> Your concept is similar to what I've sketched out in my mind.  My
> twist is that I think I would actually build multiple servers, each
> one would be a 24-disk 2U system.  Our data is fairly easy to
> partition across multiple servers.  Also, we already have a big
> "symlink index" directory that abstracts the actual location of the
> files.  IOW, my users don't know/don't care where the files actually
> live, as long as the symlinks are there and not broken.

That makes tuning each box much easier if you go with a single 10GbE
port.  But this has some downside I'll address down below.

>> Without the cost of NICs you're looking at roughly $19,000 for this
>> configuration, including shipping costs, for a ~22TB DIY SSD based NFS
>> server system expandable to 46TB.  With two quad port 10GbE NICs and
>> SFPs you're at less $25K with the potential for ~6GB/s NFS throughput.
> 
> Yup, and this amount is less than one year's maintenance on the big
> iron system we have in place.  And, quoting the vendor, "Maintenance
> costs only go up."

Yes, it's sad.  "Maintenance Contract" = mostly pure profit.  This is
the Best Buy extended warranty of the big iron marketplace.  You pay a
ton of money and get very little, if anything, in return.

>> In specifying HBAs instead of RAID controllers I am assuming you'll use
>> md/RAID.  With this many SSDs any current RAID controller would slow you
>> down anyway as the ASICs aren't fast enough.  You'll need minimum
>> redundancy to guard against an SSD failure, which means RAID5 with SSDs.
>>  Your workload is almost exclusively read heavy, which means you could
>> simply create a single 24 drive RAID5 or RAID6 with the default 512KB
>> chunk.  I'd go with RAID6.  That will yield a stripe width of
>> 22*512KB=11MB.  Using RAID5/6 allows you to grow the array incrementally
>> without the need for LVM which may slow you down.
> 
> At the expense of storage capacity, I was in my mind thinking of
> raid10 with 3-way mirrors.  We do have backups, but downtime on this
> system won't be taken lightly.

I was in lock step with you until this point.  We're talking about SSDs
aren't we?  And a read-only workload?  RAID10 today is only for
transactional workloads on rust to avoid RMW.  SSD doesn't suffer RMW
latency.  And this isn't a transactional workload, but parallel linear
read.  Three-way mirroring within a RAID 10 setup is used strictly to
avoid losing the 2nd disk in a mirror while its partner is rebuilding in
a standard RAID10.  This is suitable when using large rusty drives where
rebuild times are 8+ hours.  With a RAID10 triple mirror setup 2/3rds of
your capacity is wasted.  This isn't a sane architecture for SSDs and a
read-only workload.  Here's why.  Under optimal conditions

a 4TB 7.2K SAS/SATA mirror rebuild takes 4TB / 130MB/s= ~8.5 hours
a 1TB Sammy 840 EVO mirror rebuild takes 1TB / 500MB/s= ~34 minutes.

A RAID6 rebuild will take a little longer, but still much less than an
hour, say 45 minutes max.  With RAID6 you would have to sustain *two*
additional drive failures within that 45 minute rebuild window to lose
the array.  Only HBA, backplane, or PSU failure could take down two more
drives in 45 minutes, and if that happens you're losing many drives,
probably all of them, and you're sunk anyway.  No matter how you slice
it, I can't see RAID10 being of any benefit here, and especially not
3-way mirror RAID10.

If one of your concerns is decreased client throughput during rebuild,
then simply turn down the rebuild priority to 50%.  Your rebuild will
take 1.5 hours, in which you'd have to lose 2 additional drives to lose
the array, and you'll still have more client throughput at the array
than the network interface can push:

((22*500MB/s) = 11GB/s)/2 = 5.5GB/s client B/W during rebuild
10GbE interface B/W =       1.0GB/s max

Using RAID10 yields no gain but increases cost.  Using RAID10 with 3
mirrors is simply 3 times more cost and 2/3rds wasted capacity.  Any
form of mirroring just isn't suitable for this type of SSD system.

>> Surely you'll use XFS as it's the only Linux filesystem suitable for
>> such a parallel workload.  As you will certainly grow the array in the
>> future, I'd format XFS without stripe alignment and have it do 4KB IOs.
>> ...
> 
> I was definitely thinking XFS.  But one other motivation for multiple
> 2U systems (instead of one massive system) is that it's more modular.

The modular approach has advantages.  But keep in mind that modularity
increases complexity and component count, which increase the probability
of a failure.  The more vehicles you own the more often one of them is
in the shop at any given time, if even only for an oil change.

> Existing systems never have to be grown or reconfigured.  When we need
> more space/throughput, I just throw another system in place.  I might
> have to re-distribute the data, but this would be a very rare (maybe
> once/year) event.

Gluster has advantages here as it can redistribute data automatically
among the storage nodes.  If you do distributed mirroring you can take a
node completely offline for maintenance, and client's won't skip a beat,
or at worst a short beat.  It costs half your storage for the mirroring,
but using RAID6 it's still ~33% less than the RAID10 w/3 way mirrors.

> If I get the green light to do this, I'd actually test a few
> configurations.  But some that come to mind:
>     - raid10,f3

Skip it.  RAID10 is a no go.  And, none of the alternate layouts will
provide any benefit because SSDs are not spinning rust.  The alternate
layouts exist strictly to reduce rotational latency.

>     - groups of 3-way raid1 mirrors striped together with XFS

I covered this above.  Skip it.  And you're thinking of XFS over
concatenated mirror sets here.  This architecture is used only for high
IOPS transactional workloads on rust.  It won't gain you anything with SSDs.

>     - groups of raid6 sets not striped together (our symlink index I
> mentioned above makes this not as messy as it sounds)

If you're going with multiple identical 24 bay nodes, you want a single
24 drive md/RAID6 in each directly formatted with XFS.  Or Gluster atop
XFS.  It's the best approach for your read only workload with large files.

>> The last point I'll make is that it may require some serious tweaking of
>> IRQ load balancing, md/RAID, NFS, Ethernet bonding driver, etc, to wring
>> peak throughput out of such a DIY SSD system.  Achieving ~1GB/s parallel
>> NFS throughput from a DIY rig with a single 10GbE port isn't horribly
>> difficult.  3+GB/s parallel NFS via bonded 10GbE interfaces is a bit
>> more challenging.
> 
> I agree, I think that comes back around to what we said above: the
> concept is simple, but the details mean the difference between
> brilliant and mediocre.

The details definitely become a bit easier with one array and one NIC
per node.  But one thing really bothers me about such a setup.  You have
~11GB/s read throughput with 22 SSDs (24 RAID6).  It doesn't make sense
to waste ~10GB/s of SSD throughput by using a single 10GbE interface.
At the very least you should be using 4x 10GbE ports per box to achieve
potentially 3+ GB/s.

I think what's happening here is that you're saving so much money
compared to the proprietary NAS filer that you're intoxicated by the
savings.  You're throwing money around on SSDs like a drunken sailor on
6 month leave at a strip club. :)  And without fully understanding the
implications, the capability that you're buying, and putting in each box.

> Thanks for your input Stan, I appreciate it.  I'm an infrequent poster
> to this list, but a long-time reader, and I've learned a lot from your
> posts over the years.

Glad someone actually reads my drivel on occasion. :)

I'm firmly an AMD guy.  I used the YMI 48 bay Intel server in my
previous example for expediency, and to avoid what I'm doing here now.
Please allow me to indulge you with a complete parts list for one fully
DIY NFS server node build.  I've matched and verified compatibility of
all of the components, using manufacturer specs, down to the iPASS/SGPIO
SAS cables.  Combined with the LSI HBAs and this SM backplane, these
sideband signaling SAS cables should enable you to make drive failure
LEDs work with mdadm, using:
http://sourceforge.net/projects/ledmon/

I've not tried the software myself, but if it's up to par, dead drive
identification should work the same as with any vendor storage array,
which to this point has been nearly impossible with md arrays using
plain non-RAID HBAs.

Preemptively flashing the mobo and SAS HBAs with the latest firmware
image should prevent any issues with the hardware.  These products have
"shelf" firmware which is often quite a few revs old by the time the
customer receives product.

All but one of the necessary parts are stocked by NewEgg believe it or
not.  The build consists of a 24 bay 2U SuperMicro 920W dual HS PSU,
SuperMicro dual socket C32 mobo w/5 PCIe 2.0 x8 slots, 2x Opteron 4334
3.1GHz 6 core CPUs, 2 Dynatron C32/1207 2U CPU coolers, 8x Kingston 4GB
ECC registered DDR3-1333 single rank DIMMs, 3x LSI 9207-8i PCIe 3.0 x8
SAS HBAs, rear 2 drive HS cage, 2x Samsung 120GB boot SSDs, 24x Samsung
1TB data SSDs, 6x 2ft LSI SFF-8087 sideband cables, and two dual port
Intel 10GbE NICs sans SFPs as you probably already have some spares.
You may prefer another NIC brand/model.  These are <$800 of the total.

1x  http://www.newegg.com/Product/Product.aspx?Item=N82E16811152565
1x  http://www.newegg.com/Product/Product.aspx?Item=N82E16813182320
2x  http://www.newegg.com/Product/Product.aspx?Item=N82E16819113321
2x  http://www.newegg.com/Product/Product.aspx?Item=N82E16835114139
8x  http://www.newegg.com/Product/Product.aspx?Item=N82E16820239618
3x  http://www.newegg.com/Product/Product.aspx?Item=N82E16816118182
24x http://www.newegg.com/Product/Product.aspx?Item=N82E16820147251
2x  http://www.newegg.com/Product/Product.aspx?Item=N82E16820147247
6x  http://www.newegg.com/Product/Product.aspx?Item=N82E16812652015
2x  http://www.newegg.com/Product/Product.aspx?Item=N82E16833106044

1x
http://www.costcentral.com/proddetail/Supermicro_Storage_drive_cage/MCP220826090N/11744345/

Total cost today:  $16,927.23
SSD cost:          $13,119.98

Note all SSDs are direct connected to the HBAs.  This system doesn't
suffer any disk bandwidth starvation due to SAS expanders as with most
storage arrays.  As such you get nearly full bandwidth per drive, being
limited only by the NorthBridge to CPU HT link.  At the hardware level
the system bandwidth breakdown is as follow:

Memory:		42.6 GB/s
PCIe to CPU:	10.4 GB/s unidirectional x2
HBA to PCIe:	12 GB/s unidirectional x2
SSD to HBA:	12 GB/s unidirectional x2
PCIe to NIC:	8 GB/s unidirectional x2
NIC to client:  4 GB/s unidirectional x2

Your HBA traffic will flow on the HT uplink and your NIC traffic on the
down link, so you are non constrained here with this NFS read only workload.

Assuming an 8:1 bandwidth ratio between file bytes requested and memory
bandwidth consumed by the kernel in the form of DMA xfers to/from buffer
cache, memory-memory copies for TCP and NFS, and hardware overhead in
the form of coherency traffic between the CPUs and HBAs, interrupts in
the form of MSI-X writes to memory, etc, then 4GB/s of requested data
generates ~32GB/s at the memory controllers before transmitted over the
wire.  Beyond tweaking parameters, it may require building a custom
kernel to achieve this throughput.  But the hardware is capable.  Using
a single 10GbE interface yields 1/10th of the SSD b/w to clients.  This
is a huge waste of $$ spent on the SSDs.  Using 4 will come close to
maxing out the rest of the hardware so I spec'd 4 ports.  With the
correct bonding setup you should be able to get between 3-4GB/s.  Still
only 1/4th - 1/3rd the SSD throughput.

To get close to taking near full advantage of the 12GB/s read bandwidth
offered by these 24 SSDs requires a box with dual Socket G34 processors
to get 8 DDR3-1333 memory channels--85GB/s--two SR5690 PCIe to HT
controllers, 8x 10GbE ports (or 2x QDR Infiniband 4x).  Notice I didn't
discuss CPU frequency or core count anywhere?  That's because it's not a
factor.  The critical factor is memory bandwidth.  Any single/dual
Opteron 4xxx/6xxx system with ~8 or more cores should do the job as long
as IRQs are balanced across cores.

Hope you at least found this an interesting read, if not actionable.
Maybe others will as well.  I had some fun putting this one together.  I
think the only things I omitted were Velcro straps and self stick lock
loops for tidying up the cables for optimum airflow.  Experienced
builders usually have these on hand, but I figured I'd mention them just
in case.  Locating certified DIMMs in the clock speed and rank required
took too much time, but this was not unforeseen.  The only easy way to
spec memory for server boards and allow max expansion is to go with the
lowest clock speed.  If I'd done that here you'd lose 17GB/s of memory
bandwidth, a 40% reduction.  I also wanted to use a single socket G34
board, but unfortunately nobody makes one with more than 3 PCIe slots.
This design required at least 4.

-- 
Stan

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: sequential versus random I/O
  2014-02-01 18:28         ` Stan Hoeppner
@ 2014-02-03 19:28           ` Matt Garman
  2014-02-04 15:16             ` Stan Hoeppner
  0 siblings, 1 reply; 15+ messages in thread
From: Matt Garman @ 2014-02-03 19:28 UTC (permalink / raw)
  To: Stan Hoeppner; +Cc: Mdadm

On Sat, Feb 1, 2014 at 12:28 PM, Stan Hoeppner <stan@hardwarefreak.com> wrote:
> I was in lock step with you until this point.  We're talking about SSDs
> aren't we?  And a read-only workload?  RAID10 today is only for
> transactional workloads on rust to avoid RMW.  SSD doesn't suffer RMW
> ...

OK, I think I'm convinced, raid10 isn't appropriate here.  (If I get
the green light for this, I might still do it in the
buildup/experimentation stage, just for kicks and grin if nothing
else.)

So, just to be clear, you've implied that if I have an N-disk raid6,
then the (theoretical) sequential read throughput is
    (N-2) * T
where T is the throughput of a single drive (assuming uniform drives).
 Is this correct?

> If one of your concerns is decreased client throughput during rebuild,
> then simply turn down the rebuild priority to 50%.  Your rebuild will

The main concern was "high availability".  This isn't like my home
server, where I use raid as an excuse to de-prioritize my backups.  :)
 But raid for the actual designed purpose, to minimize service
interruptions in case of failure(s).

The thing is, I think consumer SSDs are still somewhat of an unknown
entity in terms of reliability, longevity, and failure modes.  Just
from the SSDs I've dealt with at home (tiny sample size), I've had two
fail the "bad way": that is, they die and are no longer recognizable
by the system (neither OS nor BIOS).  Presumably a failure of the SSDs
controller.  And with spinning rust, we have decades of experience and
useful public information like Google's HDD study and Backblaze's
blog.  SSDs just haven't been out in the wild long enough to have a
big enough sample size to do similar studies.

Those two SSDs I had die just abruptly went out, without any kind of
advance warning.  (To be fair, these were first-gen, discount,
consumer SSDs.)  Certainly, traditional spinning drives can also die
in this way, but with regular SMART monitoring and such, we (in
theory) have some useful means to predict impending death.  Not sure
if the SMART monitoring on SSDs is up to par with their rusty
counterparts.

> The modular approach has advantages.  But keep in mind that modularity
> increases complexity and component count, which increase the probability
> of a failure.  The more vehicles you own the more often one of them is
> in the shop at any given time, if even only for an oil change.

Good point.  Although if I have more cars than I actually need
(redundancy), I can afford to always have a car in the shop.  ;)

> Gluster has advantages here as it can redistribute data automatically
> among the storage nodes.  If you do distributed mirroring you can take a
> node completely offline for maintenance, and client's won't skip a beat,
> or at worst a short beat.  It costs half your storage for the mirroring,
> but using RAID6 it's still ~33% less than the RAID10 w/3 way mirrors.
> ...
> If you're going with multiple identical 24 bay nodes, you want a single
> 24 drive md/RAID6 in each directly formatted with XFS.  Or Gluster atop
> XFS.  It's the best approach for your read only workload with large files.

Now that you've convinced me RAID6 is the way to go, and if I can get
3 GB/s out of one of these systems, then two of these system would
literally double the capability (storage capacity and throughput) of
our current big iron system.  What would be ideal is to use something
like Gluster to add a third system for redundancy, and have a "raid 5"
at the server level.  I.e., same storage capacity of two systems, but
one whole node could go down without losing service availability.  I
have no experience with cluster filesystems, however, so this presents
another risk vector.

> I'm firmly an AMD guy.

Any reason for that?  That's an honest question, not veiled argument.
Do the latest AMD server chips include the PCIe controller on-chip
like the Sandy Bridge and newer Intel chips?  Or does AMD still put
the PCIe controller on a separate chip (a northbridge)?

Just wondering if having dual on-CPU-die PCIe controllers is an
advantage here (assuming a dual-socket system).  I agree with you, CPU
core count and clock isn't terribly important, it's all about being
able to extract maximum I/O from basically every other component in
the system.

> sideband signaling SAS cables should enable you to make drive failure
> LEDs work with mdadm, using:
> http://sourceforge.net/projects/ledmon/
>
> I've not tried the software myself, but if it's up to par, dead drive
> identification should work the same as with any vendor storage array,
> which to this point has been nearly impossible with md arrays using
> plain non-RAID HBAs.

Ha, that's nice.  In my home server, which is idle 99% of the time,
I've identified drives by simply doing a "dd if=/dev/target/drive
of=/dev/null" and looking for the drive that lights up.  Although,
I've noticed some drives (Samsung) don't even light up when I do that.

I could do this in reverse on a system that's 99% busy: just offline
the target drive, and look for the one light that's NOT lit.

Failing that, I had planned to use the old school paper and pencil
method of just keeping good notes of which drive (identified by serial
number) was in which bay.

> All but one of the necessary parts are stocked by NewEgg believe it or
> not.  The build consists of a 24 bay 2U SuperMicro 920W dual HS PSU,
> SuperMicro dual socket C32 mobo w/5 PCIe 2.0 x8 slots, 2x Opteron 4334
> ...

Thanks for that.  I integrated these into my planning spreadsheet,
which incidentally already had 75% of what you spec'ed out.  Main
difference is I spec'ed out an Intel-based system, and you used AMD.
Big cost savings by going with AMD however!

> Total cost today:  $16,927.23
> SSD cost:          $13,119.98

Looks like you're using the $550 sale price for those 1TB Samsung
SSDs.  Normal price is $600.  Newegg usually has a limit of 5 (IIRC)
on sale-priced drives.

> maxing out the rest of the hardware so I spec'd 4 ports.  With the
> correct bonding setup you should be able to get between 3-4GB/s.  Still
> only 1/4th - 1/3rd the SSD throughput.

Right.  I might start with just a single dual-port 10gig NIC, and see
if I can saturate that.  Let's be pessimistic, and assume I can only
wrangle 250 MB/sec out of each SSD.  And I'll have designate two hot
spares, leaving a 22-drive raid6.  So that's: 250 MB/s * 20 = 5 GB/s.
Now that's not so far away from the 4 GB/sec theoretical with 4x 10gig
NICs.

> Hope you at least found this an interesting read, if not actionable.
> Maybe others will as well.  I had some fun putting this one together.  I

Absolutely interesting, thanks again for all the detailed feedback.

> think the only things I omitted were Velcro straps and self stick lock
> loops for tidying up the cables for optimum airflow.  Experienced
> builders usually have these on hand, but I figured I'd mention them just
> in case.

Of course, but why can't I ever find them when I actually need them?  :)

Anyway, thanks again for your feedback.  The first roadblock is
definitely getting manager buy-in.  He tends to dismiss projects like
this because (1) we're not a storage company / we don't DIY servers,
(2) why isn't anyone else doing this / why can't you buy an OTS system
like this, (3) even though the cost savings are dramatic, it's still a
~$20k risk - what if I can't get even 50% of the theoretical
throughput? what if those SSDs require constant replacement? what if
there is some subtle kernel- or driver-level bug(s) that are in
"landmine" status waiting for something like this to expose them?

-Matt

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: sequential versus random I/O
  2014-02-03 19:28           ` Matt Garman
@ 2014-02-04 15:16             ` Stan Hoeppner
  0 siblings, 0 replies; 15+ messages in thread
From: Stan Hoeppner @ 2014-02-04 15:16 UTC (permalink / raw)
  To: Matt Garman; +Cc: Mdadm

On 2/3/2014 1:28 PM, Matt Garman wrote:
> On Sat, Feb 1, 2014 at 12:28 PM, Stan Hoeppner <stan@hardwarefreak.com> wrote:
>> I was in lock step with you until this point.  We're talking about SSDs
>> aren't we?  And a read-only workload?  RAID10 today is only for
>> transactional workloads on rust to avoid RMW.  SSD doesn't suffer RMW
>> ...
> 
> OK, I think I'm convinced, raid10 isn't appropriate here.  (If I get
> the green light for this, I might still do it in the
> buildup/experimentation stage, just for kicks and grin if nothing
> else.)
> 
> So, just to be clear, you've implied that if I have an N-disk raid6,
> then the (theoretical) sequential read throughput is
>     (N-2) * T
> where T is the throughput of a single drive (assuming uniform drives).
>  Is this correct?

Should be pretty close to that for parallel streaming read.

>> If one of your concerns is decreased client throughput during rebuild,
>> then simply turn down the rebuild priority to 50%.  Your rebuild will
> 
> The main concern was "high availability".  This isn't like my home
> server, where I use raid as an excuse to de-prioritize my backups.  :)
>  But raid for the actual designed purpose, to minimize service
> interruptions in case of failure(s).

The major problem with rust based RAID5/6 arrays is the big throughput
hit you take during a rebuild.  Concurrent access causes massive head
seeking, slowing everything down, both user IO and rebuild.  This
proposed SSD rig has disk throughput that is 4-8x the network
throughput.  And there are no heads to seek, thus no increased latency
nor reduced bandwidth.  You should be able to dial down the rebuild rate
by as little as 25% and the NFS throughput shouldn't vary from normal state.

This is the definition of high availability--failures don't affect
function or performance.

> The thing is, I think consumer SSDs are still somewhat of an unknown
> entity in terms of reliability, longevity, and failure modes.  Just
> from the SSDs I've dealt with at home (tiny sample size), I've had two
> fail the "bad way": that is, they die and are no longer recognizable
> by the system (neither OS nor BIOS).  Presumably a failure of the SSDs
> controller.  

I had one die like that in 2011, after 4 months, a Corsair V32, 1st gen
Indilinx drive.

> And with spinning rust, we have decades of experience and
> useful public information like Google's HDD study and Backblaze's
> blog.  SSDs just haven't been out in the wild long enough to have a
> big enough sample size to do similar studies.

As is the case with all new technologies.  Hybrid technology is much
newer still, but will probably start being adopted at a much faster pace
than pure SSD for most applications.  Speaking of SSHD I should have
mentioned it sooner because it's actually a perfect fit for your
workload, as you reread the same ~400MB files repeatedly.  Have you
considered hybrid SSHD drives?  These Seagate 1TB 2.5" drives have an
8GB SSD cache:

http://www.newegg.com/Product/Product.aspx?Item=N82E16822178340

24 of these yields the same capacity as the pure SSD solution, but at
*1/6th* the price per drive, ~$2600 for 24 drives vs ~$15,500.  You'd
have an aggregate 192GB of SSD cache per server node and close to 1GB/s
of network throughput even when hitting platters instead of cache.  So a
single GbE connection would be a good fit, and no bonding headaches.
The drives drop into the same chassis.  You'll save $10,000 per chassis.
 In essence you'd be duplicating the NetApp's disk + SSD cache setup but
inside each drive.  I worked up the totals, see down below.

...
>> The modular approach has advantages.  But keep in mind that modularity
>> increases complexity and component count, which increase the probability
>> of a failure.  The more vehicles you own the more often one of them is
>> in the shop at any given time, if even only for an oil change.
> 
> Good point.  Although if I have more cars than I actually need
> (redundancy), I can afford to always have a car in the shop.  ;)

But it requires two vehicles and two people to get the car to the shop
and get you back home.  This is the point I was making.  The more
complex the infrastructure, the more time/effort required for maintenance.

>> Gluster has advantages here as it can redistribute data automatically
>> among the storage nodes.  If you do distributed mirroring you can take a
>> node completely offline for maintenance, and client's won't skip a beat,
>> or at worst a short beat.  It costs half your storage for the mirroring,
>> but using RAID6 it's still ~33% less than the RAID10 w/3 way mirrors.
>> ...
>> If you're going with multiple identical 24 bay nodes, you want a single
>> 24 drive md/RAID6 in each directly formatted with XFS.  Or Gluster atop
>> XFS.  It's the best approach for your read only workload with large files.
> 
> 
> Now that you've convinced me RAID6 is the way to go, and if I can get
> 3 GB/s out of one of these systems, then two of these system would
> literally double the capability (storage capacity and throughput) of
> our current big iron system.  

The challenge will be getting 3GB/s.  You may spend weeks, maybe months,
in testing and development work to achieve it.  I can't say as I've
never tried this.  Getting close to 1GB/s from one interface is much
easier.  This fact, and cost, make the SSHD solution much much more
attractive.

> What would be ideal is to use something
> like Gluster to add a third system for redundancy, and have a "raid 5"
> at the server level.  I.e., same storage capacity of two systems, but
> one whole node could go down without losing service availability.  I
> have no experience with cluster filesystems, however, so this presents
> another risk vector.

Read up on Gluster and its replication capabilities.  Say "DFS" as
Gluster is a distributed filesystem.  A cluster filesystem or "CFS" is a
completely different technology.

>> I'm firmly an AMD guy.
> 
> Any reason for that?

We've seen ample examples in the US of what happens with a monopolist.
Prices increase and innovation decreases.  If AMD goes bankrupt or
simply exits the desktop/server x86 CPU market then Chipzilla has a
monopoly on x86 desktop/server CPUs.  They nearly do now simply based on
market share.  AMD still makes plenty capable CPUs, chipsets, etc, and
at a lower cost.  Intel chips may have superior performance at the
moment, but AMD was superior for half a decade.  As long as AMD has a
remotely competitive offering I'll support them with my business.  I
don't want to be at the mercy of a monopolist.

> Do the latest AMD server chips include the PCIe controller on-chip
> like the Sandy Bridge and newer Intel chips?  Or does AMD still put
> the PCIe controller on a separate chip (a northbridge)?
>
> Just wondering if having dual on-CPU-die PCIe controllers is an
> advantage here (assuming a dual-socket system).  I agree with you, CPU
> core count and clock isn't terribly important, it's all about being
> able to extract maximum I/O from basically every other component in
> the system.

Adding PCIe interfaces to the CPU die eliminates the need for an IO
support chip, simplifying board design and testing, and freeing up board
real estate.  This is good for large NUMA systems, such as SGI's Altix
UV, which contain dozens or hundreds of CPU boards.  It does not
increase PCIe channel throughput, though it does lower latency by a few
nanoseconds.  There may be a small noticeable gain here for HPC
applications sending MPI messages over PCIe Infiniband HCAs, but not for
any other device connected via PCIe.  Storage IO is typically not
latency bound and is always pipelined, so latency is largely irrelevant.

>> sideband signaling SAS cables should enable you to make drive failure
>> LEDs work with mdadm, using:
>> http://sourceforge.net/projects/ledmon/
>>
>> I've not tried the software myself, but if it's up to par, dead drive
>> identification should work the same as with any vendor storage array,
>> which to this point has been nearly impossible with md arrays using
>> plain non-RAID HBAs.
> 
> Ha, that's nice.  In my home server, which is idle 99% of the time,
> I've identified drives by simply doing a "dd if=/dev/target/drive
> of=/dev/null" and looking for the drive that lights up.  Although,
> I've noticed some drives (Samsung) don't even light up when I do that.

It's always good to have a fall back position.  This is another thing
you have to integrate yourself.  Part of the "DIY" thing.

...
>> All but one of the necessary parts are stocked by NewEgg believe it or
>> not.  The build consists of a 24 bay 2U SuperMicro 920W dual HS PSU,
>> SuperMicro dual socket C32 mobo w/5 PCIe 2.0 x8 slots, 2x Opteron 4334
>> ...
> 
> Thanks for that.  I integrated these into my planning spreadsheet,
> which incidentally already had 75% of what you spec'ed out.  Main
> difference is I spec'ed out an Intel-based system, and you used AMD.
> Big cost savings by going with AMD however!
> 
>> Total cost today:  $16,927.23
Corrected total:      $19,384.10

>> SSD cost:          $13,119.98
Corrected:	      $15,927.34

SSHD system:	      $ 6,238.50
Savings:	      $13,145.60

Specs same as before, but with one dual port 10GbE NIC and 26x Seagate
1TB 2.5" SSHDs displacing the Samsung SSDs.  These drives target the
laptop market.  As such they are built for vibration and should fair
well in a multi-drive chassis.

$6,300 may be more palatable to the boss for an experimental development
system.  It shouldn't be difficult to reach maximum potential throughput
of the 10GbE interface with a little tweaking.  Your time to proof of
concept should be minimal.  Once proven you could put it into limited
production with a subset of the data to see how the drives standup with
continuous use.  If it holds up for a month, purchase components for
another 4 units for ~$25,000.  Put 3 nodes into production for 4 total,
keep the other set of parts as spares for the 4 production units since
consumer parts availability is volatile, even on a 6 month time scale.
You'll have ~$32,000 in the total system.

Once you've racked the 3 systems and burned them in, install and
configure Gluster and load your datasets.  By then you'll know Gluster
well, how to spread data for load balancing, configure fault tolerance,
etc.  You'll have the cheap node concept you originally mentioned.  You
should be able to get close to 4GB/s out of the 4 node farm, and scale
up by ~1GB/s with each future node.

> Looks like you're using the $550 sale price for those 1TB Samsung
> SSDs.  Normal price is $600.  Newegg usually has a limit of 5 (IIRC)
> on sale-priced drives.

I didn't look closely enough.  It's actually $656.14, corrected all
figures above.

>> maxing out the rest of the hardware so I spec'd 4 ports.  With the
>> correct bonding setup you should be able to get between 3-4GB/s.  Still
>> only 1/4th - 1/3rd the SSD throughput.
> 
> Right.  I might start with just a single dual-port 10gig NIC, and see
> if I can saturate that.  Let's be pessimistic, and assume I can only
> wrangle 250 MB/sec out of each SSD.  And I'll have designate two hot
> spares, leaving a 22-drive raid6.  So that's: 250 MB/s * 20 = 5 GB/s.
> Now that's not so far away from the 4 GB/sec theoretical with 4x 10gig
> NICs.

You'll get near full read bandwidth from the SSDs without any problems.
 That's not an issue.  The problem will likely be getting 3-4GB/s of
NFS/TCP throughput from your bonded stack.  The one thing in your favor
is you only need transmit load balancing for your workload, which is
much easier to do than receive load balancing.

>> Hope you at least found this an interesting read, if not actionable.
>> Maybe others will as well.  I had some fun putting this one together.  I
> 
> Absolutely interesting, thanks again for all the detailed feedback.

They don't call me "HardwareFreak" for nothin. :)

...
> Anyway, thanks again for your feedback.  The first roadblock is
> definitely getting manager buy-in.  He tends to dismiss projects like
> this because (1) we're not a storage company / we don't DIY servers,
> (2) why isn't anyone else doing this / why can't you buy an OTS system
> like this, (3) even though the cost savings are dramatic, it's still a
> ~$20k risk - what if I can't get even 50% of the theoretical
> throughput? what if those SSDs require constant replacement? what if
> there is some subtle kernel- or driver-level bug(s) that are in
> "landmine" status waiting for something like this to expose them?

(1)  I'm not an HVAC contractor nor an electrician, but I rewired my
     entire house and replaced the HVAC system, including all new duct
     work.  I did it because I know how, and it saved me ~$10,000.  And
     the results are better than if I'd hired a contractor.  If you can
     do something yourself at lower cost and higher quality, do so.

(2)  Because an OTS "system" is not a DIY system.  You're paying for
     expertise and support more than for the COTS gear.  Hardware at
     the wholesale OEM level is inexpensive.  When you buy a NetApp,
     their unit cost from the supplier is less than 1/4th what you pay
     NetApp for the hardware.  The rest is profit, R&D, cust support,
     employee overhead, etc.  When you buy hardware for a DIY build,
     you're buying the hardware, and paying 10-20% profit to the
     wholesaler depending on the item.

(3)  The bulk of storage systems on the market today use embedded Linux.
     So any kernel or driver level bugs that may affect a DIY system
     will also affect such vendor solutions.

The risks boil down to one thing:  competence.  If your staff is
competent, your risk is extremely low.  Your boss has competent staff.

The problem with most management is they know they can buy X for Y cost
from company Z and get some kind of guarantee for paying cost Y.  They
feel they have "assurance" that things will just work.  We all know from
experience, journals, word of mouth, that one can spend $100K to
$millions on hardware or software and/or "expert" consultants, and a
year later it still doesn't work right.  There are no real guarantees.

Frankly, I'd much rather do everything myself, because I can, and have
complete control of it.  That's a much better guarantee for me than any
contract or SLA a vendor could ever provide.

-- 
Stan

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2014-02-04 15:16 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-01-29 17:23 sequential versus random I/O Matt Garman
2014-01-30  0:10 ` Adam Goryachev
2014-01-30  0:41 ` Roberto Spadim
2014-01-30  0:45   ` Roberto Spadim
2014-01-30  0:58     ` Roberto Spadim
2014-01-30  1:03       ` Roberto Spadim
2014-01-30  1:18         ` Roberto Spadim
2014-01-30  2:38 ` Stan Hoeppner
2014-01-30  3:20   ` Matt Garman
2014-01-30  4:10     ` Roberto Spadim
2014-01-30 10:22     ` Stan Hoeppner
2014-01-30 15:28       ` Matt Garman
2014-02-01 18:28         ` Stan Hoeppner
2014-02-03 19:28           ` Matt Garman
2014-02-04 15:16             ` Stan Hoeppner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).