* sequential versus random I/O
@ 2014-01-29 17:23 Matt Garman
2014-01-30 0:10 ` Adam Goryachev
` (2 more replies)
0 siblings, 3 replies; 15+ messages in thread
From: Matt Garman @ 2014-01-29 17:23 UTC (permalink / raw)
To: Mdadm
This is arguably off-topic for this list, but hopefully it's relevant
enough that no one gets upset...
I have a conceptual question regarding "sequential" versus "random"
I/O, reads in particular.
Say I have a simple case: one disk and exactly one program reading one
big file off the disk. Clearly, that's a sequential read operation.
(And I assume that's basically a description of a sequential read disk
benchmark program.)
Now I have one disk with two large files on it. By "large" I mean the
files are at least 2x bigger than any disk cache or system RAM, i.e.
for the sake of argument, ignore caching in the system. I have
exactly two programs running, and each program constantly reads and
re-reads one of those two big files.
From the programs' perspective, this is clearly a sequential read.
But from the disk's perspective, it to me looks at least somewhat like
random I/O: for a spinning disk, the head will presumably be jumping
around quite a bit to fulfill both requests at the same time.
And then generalize that second example: one disk, one filesystem,
with some arbitrary number of large files, and an arbitrary number of
running programs, all doing sequential reads of the files. Again,
looking at each program in isolation, it's a sequential read request.
But at the system level, all those programs in aggregate present more
of a random read I/O load... right?
So if a storage system (individual disk, RAID, NAS appliance, etc)
advertises X MB/s sequential read, that X is only meaningful if there
is exactly one reader. Obviously I can't run two sequential read
benchmarks in parallel and expect to get the same result as running
one benchmark in isolation. I would expect the two parallel
benchmarks to report roughly 1/2 the performance of the single
instance. And as more benchmarks are run in parallel, I would expect
the performance report to eventually look like the result of a random
read benchmark.
The motivation from this question comes from my use case, which is
similar to running a bunch of sequential read benchmarks in parallel.
In particular, we have a big NFS server that houses a collection of
large files (average ~400 MB). The server is read-only mounted by
dozens of compute nodes. Each compute node in turn runs dozens of
processes that continually re-read those big files. Generally
speaking, should the NFS server (including RAID subsystem) be tuned
for sequential I/O or random I/O?
Furthermore, how does this differ (if at all) between spinning drives
and SSDs? For simplicity, assume a spinning drive and an SSD
advertise the same sequential read throughput. (I know this is a
stretch, but assume the advertising is honest and accurate.) The
difference, though, is that the spinning disk can do 200 IOPS, but the
SSD can do 10,000 IOPS... intuitively, it seems like the SSD ought to
have the edge in my multi-consumer example. But, is my intuition
correct? And if so, how can I quantify how much better the SSD is?
Thanks,
Matt
^ permalink raw reply [flat|nested] 15+ messages in thread* Re: sequential versus random I/O 2014-01-29 17:23 sequential versus random I/O Matt Garman @ 2014-01-30 0:10 ` Adam Goryachev 2014-01-30 0:41 ` Roberto Spadim 2014-01-30 2:38 ` Stan Hoeppner 2 siblings, 0 replies; 15+ messages in thread From: Adam Goryachev @ 2014-01-30 0:10 UTC (permalink / raw) To: Matt Garman, Mdadm On 30/01/14 04:23, Matt Garman wrote: > This is arguably off-topic for this list, but hopefully it's relevant > enough that no one gets upset... > > I have a conceptual question regarding "sequential" versus "random" > I/O, reads in particular. > > Say I have a simple case: one disk and exactly one program reading one > big file off the disk. Clearly, that's a sequential read operation. > (And I assume that's basically a description of a sequential read disk > benchmark program.) > > Now I have one disk with two large files on it. By "large" I mean the > files are at least 2x bigger than any disk cache or system RAM, i.e. > for the sake of argument, ignore caching in the system. I have > exactly two programs running, and each program constantly reads and > re-reads one of those two big files. > > From the programs' perspective, this is clearly a sequential read. > But from the disk's perspective, it to me looks at least somewhat like > random I/O: for a spinning disk, the head will presumably be jumping > around quite a bit to fulfill both requests at the same time. > > And then generalize that second example: one disk, one filesystem, > with some arbitrary number of large files, and an arbitrary number of > running programs, all doing sequential reads of the files. Again, > looking at each program in isolation, it's a sequential read request. > But at the system level, all those programs in aggregate present more > of a random read I/O load... right? > > So if a storage system (individual disk, RAID, NAS appliance, etc) > advertises X MB/s sequential read, that X is only meaningful if there > is exactly one reader. Obviously I can't run two sequential read > benchmarks in parallel and expect to get the same result as running > one benchmark in isolation. I would expect the two parallel > benchmarks to report roughly 1/2 the performance of the single > instance. And as more benchmarks are run in parallel, I would expect > the performance report to eventually look like the result of a random > read benchmark. > > The motivation from this question comes from my use case, which is > similar to running a bunch of sequential read benchmarks in parallel. > In particular, we have a big NFS server that houses a collection of > large files (average ~400 MB). The server is read-only mounted by > dozens of compute nodes. Each compute node in turn runs dozens of > processes that continually re-read those big files. Generally > speaking, should the NFS server (including RAID subsystem) be tuned > for sequential I/O or random I/O? > > Furthermore, how does this differ (if at all) between spinning drives > and SSDs? For simplicity, assume a spinning drive and an SSD > advertise the same sequential read throughput. (I know this is a > stretch, but assume the advertising is honest and accurate.) The > difference, though, is that the spinning disk can do 200 IOPS, but the > SSD can do 10,000 IOPS... intuitively, it seems like the SSD ought to > have the edge in my multi-consumer example. But, is my intuition > correct? And if so, how can I quantify how much better the SSD is? When doing parallel reads, you will get less than half the read speed for each of the two readers, because you will need to wait for the seek time of the drive each time it moves from reading one file to the other. You might get 40% of the read speed for each, but if you have 100 readers, you will get a lot less than 1% each, because the overhead (seek time) is multiplied 100x instead of only 2x. However, for SSD, the seek time is 0, so you will get exactly half the read speed for each of the two readers. (or 1% of the read speed for 100 readers, etc). That would be the perfect application of SSD's, read only (so you never even have to think about the write limitation), and large number of concurrent access. Of course, RAID of various levels will assist you in scaling even further with either spinning disks or SSD, even linear would help because different files will land on different disks. Of course, you might want some protection from failed disks as well. Regards, Adam -- Adam Goryachev Website Managers www.websitemanagers.com.au ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: sequential versus random I/O 2014-01-29 17:23 sequential versus random I/O Matt Garman 2014-01-30 0:10 ` Adam Goryachev @ 2014-01-30 0:41 ` Roberto Spadim 2014-01-30 0:45 ` Roberto Spadim 2014-01-30 2:38 ` Stan Hoeppner 2 siblings, 1 reply; 15+ messages in thread From: Roberto Spadim @ 2014-01-30 0:41 UTC (permalink / raw) To: Matt Garman; +Cc: Mdadm 2014-01-29 Matt Garman <matthew.garman@gmail.com>: > This is arguably off-topic for this list, but hopefully it's relevant > enough that no one gets upset... > > I have a conceptual question regarding "sequential" versus "random" > I/O, reads in particular. > > Say I have a simple case: one disk and exactly one program reading one > big file off the disk. Clearly, that's a sequential read operation. > (And I assume that's basically a description of a sequential read disk > benchmark program.) no, if forgot that kernel is "another program", and filesystem is "another program" too if your 'exactly one program' read and write to block device, you will get "exactly" only one program using disk (ok your program and linux kernel...) some filesystem can divide a file in many parts and fragment, but considering that disk is new and clean many filesystem will not fragment > > Now I have one disk with two large files on it. By "large" I mean the > files are at least 2x bigger than any disk cache or system RAM, i.e. > for the sake of argument, ignore caching in the system. I have > exactly two programs running, and each program constantly reads and > re-reads one of those two big files. ok i will not forget that it's over a filesystem, since you have two large 'files' > > From the programs' perspective, this is clearly a sequential read. > But from the disk's perspective, it to me looks at least somewhat like > random I/O: for a spinning disk, the head will presumably be jumping > around quite a bit to fulfill both requests at the same time. hum, you must check the linux i/o scheduler http://en.wikipedia.org/wiki/I/O_scheduling this can do a very very nice job :) the random and the continous is a block device feature, i don't know if i'm wrong but, filesystem send command to block device, and block devices group it and 'create' disk commands to get data, a level down, you have sata/scsi and others protocols that contact block device i/o and tell about hardware errors and others stuffs i'm a bit old in source code of linux, but if you check read balance function of raid1, you will see an example of how continous read is considered read balance try to send continous read to same disk, this speed up a lot, leaving more disks to other tasks/thread/programs/etc > > And then generalize that second example: one disk, one filesystem, > with some arbitrary number of large files, and an arbitrary number of > running programs, all doing sequential reads of the files. Again, > looking at each program in isolation, it's a sequential read request. > But at the system level, all those programs in aggregate present more > of a random read I/O load... right? hum, block device do this, check scheduler/elevators again, there's a time gap between hard disks command, even "noop" scheduler(elevator) wait a bit to send continous reads more often > > So if a storage system (individual disk, RAID, NAS appliance, etc) > advertises X MB/s sequential read, that X is only meaningful if there > is exactly one reader. the read speed is the "super pro master top ultrablaster" speed you can read a disk without cache and with a good sas/scsi/sata card > Obviously I can't run two sequential read > benchmarks in parallel and expect to get the same result as running > one benchmark in isolation. yes :) > I would expect the two parallel > benchmarks to report roughly 1/2 the performance of the single > instance. And as more benchmarks are run in parallel, I would expect > the performance report to eventually look like the result of a random > read benchmark. hum... you forgot elevators, the 1/2 could be more or less, and sometimes 1/n where n=number of tests, isn't a good math, there's more things we could forget (cache, bus problems, disk problems, irq problems, dma problems, etc) > > The motivation from this question comes from my use case, which is > similar to running a bunch of sequential read benchmarks in parallel. > In particular, we have a big NFS server that houses a collection of > large files (average ~400 MB). The server is read-only mounted by > dozens of compute nodes. Each compute node in turn runs dozens of > processes that continually re-read those big files. Generally > speaking, should the NFS server (including RAID subsystem) be tuned > for sequential I/O or random I/O? hum, when i have many thread, i use raid1, i only use raid0 or other stripe/linear solution when i have only big files (like a dvr) this give a better speed than raid1 in some cases (but you should check yourself) another nice feature is hardware raid cards with cache (flash memory), this do a nice cache job > > Furthermore, how does this differ (if at all) between spinning drives > and SSDs? For simplicity, assume a spinning drive and an SSD > advertise the same sequential read throughput. (I know this is a > stretch, but assume the advertising is honest and accurate.) The > difference, though, is that the spinning disk can do 200 IOPS, but the > SSD can do 10,000 IOPS... intuitively, it seems like the SSD ought to > have the edge in my multi-consumer example. But, is my intuition > correct? And if so, how can I quantify how much better the SSD is? hummm if your problem is cost, consider using ssd as a cache, and hdd as main storage, this kind of setup facebook use a lot check bcache, flashcache, dmcache: https://github.com/facebook/flashcache/ http://en.wikipedia.org/wiki/Bcache http://en.wikipedia.org/wiki/Dm-cache > > Thanks, > Matt > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html :) -- Roberto Spadim ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: sequential versus random I/O 2014-01-30 0:41 ` Roberto Spadim @ 2014-01-30 0:45 ` Roberto Spadim 2014-01-30 0:58 ` Roberto Spadim 0 siblings, 1 reply; 15+ messages in thread From: Roberto Spadim @ 2014-01-30 0:45 UTC (permalink / raw) To: Matt Garman; +Cc: Mdadm check this too: http://www.hardwaresecrets.com/article/315 http://en.wikipedia.org/wiki/Native_Command_Queuing http://en.wikipedia.org/wiki/TCQ http://en.wikipedia.org/wiki/I/O_scheduling http://en.wikipedia.org/wiki/Deadline_scheduler http://en.wikipedia.org/wiki/CFQ http://en.wikipedia.org/wiki/Anticipatory_scheduling http://en.wikipedia.org/wiki/Noop_scheduler http://doc.opensuse.org/products/draft/SLES/SLES-tuning_sd_draft/cha.tuning.io.html and many others ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: sequential versus random I/O 2014-01-30 0:45 ` Roberto Spadim @ 2014-01-30 0:58 ` Roberto Spadim 2014-01-30 1:03 ` Roberto Spadim 0 siblings, 1 reply; 15+ messages in thread From: Roberto Spadim @ 2014-01-30 0:58 UTC (permalink / raw) To: Matt Garman; +Cc: Mdadm hummm, another thing.... since you have nfs, you need network... did you enabled jumbo frames? http://en.wikipedia.org/wiki/Jumbo_frame https://wiki.archlinux.org/index.php/Jumbo_Frames sorry many mails guys, that's the last one =) 2014-01-29 Roberto Spadim <rspadim@gmail.com>: > check this too: > http://www.hardwaresecrets.com/article/315 > http://en.wikipedia.org/wiki/Native_Command_Queuing > http://en.wikipedia.org/wiki/TCQ > > http://en.wikipedia.org/wiki/I/O_scheduling > http://en.wikipedia.org/wiki/Deadline_scheduler > http://en.wikipedia.org/wiki/CFQ > http://en.wikipedia.org/wiki/Anticipatory_scheduling > http://en.wikipedia.org/wiki/Noop_scheduler > > http://doc.opensuse.org/products/draft/SLES/SLES-tuning_sd_draft/cha.tuning.io.html > > and many others -- Roberto Spadim ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: sequential versus random I/O 2014-01-30 0:58 ` Roberto Spadim @ 2014-01-30 1:03 ` Roberto Spadim 2014-01-30 1:18 ` Roberto Spadim 0 siblings, 1 reply; 15+ messages in thread From: Roberto Spadim @ 2014-01-30 1:03 UTC (permalink / raw) To: Matt Garman; +Cc: Mdadm ops this one is the last... http://kernel.dk/blk-mq.pdf https://lwn.net/Articles/552904/ http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=320ae51feed5c2f13664aa05a76bec198967e04d http://kernelnewbies.org/LinuxChanges#head-d433b7e91267144d5ad63dc96789f97519a73422 :) sorry 2014-01-29 Roberto Spadim <rspadim@gmail.com>: > hummm, another thing.... since you have nfs, you need network... > did you enabled jumbo frames? > > http://en.wikipedia.org/wiki/Jumbo_frame > https://wiki.archlinux.org/index.php/Jumbo_Frames > > sorry many mails guys, that's the last one =) > > 2014-01-29 Roberto Spadim <rspadim@gmail.com>: >> check this too: >> http://www.hardwaresecrets.com/article/315 >> http://en.wikipedia.org/wiki/Native_Command_Queuing >> http://en.wikipedia.org/wiki/TCQ >> >> http://en.wikipedia.org/wiki/I/O_scheduling >> http://en.wikipedia.org/wiki/Deadline_scheduler >> http://en.wikipedia.org/wiki/CFQ >> http://en.wikipedia.org/wiki/Anticipatory_scheduling >> http://en.wikipedia.org/wiki/Noop_scheduler >> >> http://doc.opensuse.org/products/draft/SLES/SLES-tuning_sd_draft/cha.tuning.io.html >> >> and many others > > > > -- > Roberto Spadim -- Roberto Spadim ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: sequential versus random I/O 2014-01-30 1:03 ` Roberto Spadim @ 2014-01-30 1:18 ` Roberto Spadim 0 siblings, 0 replies; 15+ messages in thread From: Roberto Spadim @ 2014-01-30 1:18 UTC (permalink / raw) To: Matt Garman; +Cc: Mdadm sorry again, the last interesting thing (that i remember) compression is a nice feature, check this: https://btrfs.wiki.kernel.org/index.php/Main_Page (BTRFS may online compress files, i don't know if it's stable) http://en.wikipedia.org/wiki/JFS_(file_system) (JFS version 1, but i don't know if you will have bugs) https://code.google.com/p/fusecompress/ (FUSE is a userland filesystem tool, maybe not too fast as a kernel land filesystem, but it do a nice job) http://squashfs.sourceforge.net/ (read-only filesystem) end =] ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: sequential versus random I/O 2014-01-29 17:23 sequential versus random I/O Matt Garman 2014-01-30 0:10 ` Adam Goryachev 2014-01-30 0:41 ` Roberto Spadim @ 2014-01-30 2:38 ` Stan Hoeppner 2014-01-30 3:20 ` Matt Garman 2 siblings, 1 reply; 15+ messages in thread From: Stan Hoeppner @ 2014-01-30 2:38 UTC (permalink / raw) To: Matt Garman, Mdadm On 1/29/2014 11:23 AM, Matt Garman wrote: ... > In particular, we have a big NFS server that houses a collection of > large files (average ~400 MB). The server is read-only mounted by > dozens of compute nodes. Each compute node in turn runs dozens of > processes that continually re-read those big files. Generally > speaking, should the NFS server (including RAID subsystem) be tuned > for sequential I/O or random I/O? ... If your workflow description is accurate, and assuming you're trying to fix a bottleneck at the NFS server, the solution to this is simple, and very well known: local scratch space. Given your workflow description it's odd that you're not already doing so. Which leads me to believe that the description isn't entirely accurate. If it is, you simply copy each file to local scratch disk and iterate over it locally. If you're using diskless compute nodes then that's an architectural flaw/oversight, as this workload as described begs for scratch disk. -- Stan ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: sequential versus random I/O 2014-01-30 2:38 ` Stan Hoeppner @ 2014-01-30 3:20 ` Matt Garman 2014-01-30 4:10 ` Roberto Spadim 2014-01-30 10:22 ` Stan Hoeppner 0 siblings, 2 replies; 15+ messages in thread From: Matt Garman @ 2014-01-30 3:20 UTC (permalink / raw) To: Stan Hoeppner; +Cc: Mdadm On Wed, Jan 29, 2014 at 8:38 PM, Stan Hoeppner <stan@hardwarefreak.com> wrote: > If your workflow description is accurate, and assuming you're trying to > fix a bottleneck at the NFS server, the solution to this is simple, and > very well known: local scratch space. Given your workflow description > it's odd that you're not already doing so. Which leads me to believe > that the description isn't entirely accurate. If it is, you simply copy > each file to local scratch disk and iterate over it locally. If you're > using diskless compute nodes then that's an architectural > flaw/oversight, as this workload as described begs for scratch disk. There really is no bottleneck now, but looking into the future, there will be a bottleneck at the next addition of compute nodes. I've thought about local caching at the compute node level, but I don't think it will help. The total collection of big files on the NFS server is upwards of 20 TB. Processes are distributed randomly across compute nodes, and any process could access any part of that 20 TB file collection. (My description may have implied there is a 1-to-1 process-to-file mapping, but that is not the case.) So the local scratch space would have to be quite big to prevent thrashing. In other words, unless the local cache was multi-terrabyte in size, I'm quite confident that the local cache would actually degrade performance due to constant turnover. Furthermore, let's simplify the workflow: say there is only one compute server, and it's local disk is sufficiently large to hold the entire data set (assume 20 TB drives exist with performance characteristics similar to today's spinning drives). In other words, there is no need for the NFS server now. I believe even in this scenario, the single local disk would be a bottleneck to the dozens of programs running on the node... these compute nodes are typically dual socket, 6 or 8 core systems. The computational part is fast enough on modern CPUs that the I/O workload can be realistically approximated by dozens of parallel "dd if=/random/big/file of=/dev/null" processes, all accessing different files from the collection. In other words, very much like my contrived example of multiple parallel read benchmark programs. FWIW, the current NFS server is from a big iron storage vendor. It's made up of 96 15k SAS drives. A while ago we were hitting a bottleneck on the spinning disks, so the vendor was happy to sell us 1 TB of their very expensive SSD cache module. This worked quite well at reducing spinning disk utilization, and cache module utilization was quite high. The recent compute node expansion has lowered cache utilization at the expense of spinning disk utilization... things are still chugging along acceptably, but we're at capacity. We've maxed out at just under 3 GB/sec of throughput (that's gigabytes, not bits). What I'm trying to do is decide if we should continue to pay expensive maintenance and additional cache upgrades to our current device, or if I might be better served by a DIY big array of consumer SSDs, ala the "Dirt Cheap Data Warehouse" [1]. I don't see too many people building big arrays of consumer-grade SSDs, or even vendors selling pre-made big SSD based systems. (To be fair, you can buy big SSD arrays, but with crazy-expensive *enterprise* SSD... we have effectively a WORM workload, so don't need the write endurance features of enterprise SSD. I think that's where the value opportunity comes in for us.) Anyway, I'm just looking for reasons why taking on such a project might blow up in my face (assuming I can convince the check-writer to basically fund a storage R&D project). [1] http://www.openida.com/the-dirt-cheap-data-warehouse-an-introduction/ ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: sequential versus random I/O 2014-01-30 3:20 ` Matt Garman @ 2014-01-30 4:10 ` Roberto Spadim 2014-01-30 10:22 ` Stan Hoeppner 1 sibling, 0 replies; 15+ messages in thread From: Roberto Spadim @ 2014-01-30 4:10 UTC (permalink / raw) To: Matt Garman; +Cc: Stan Hoeppner, Mdadm hummmm, there's a nice solution for low cost storages, it's cost less than any ibm,dell,hp,etc storage, and have a nice read rate: http://www.blackblaze.com/ (it's a red chassis with many disks) maybe you could have a better performace with many sata disks running raid1 or raid0 or raid6 or raid10, than many sas disks, considering the same cost.... at least where i live (brazil) sas is very expensive, and 2 sata disks is better than 1 sas disk for enterprise databases with the same workload with many random reads (my tests only) with sata you buy more space and "lost" read rate (7200rpm vs 15000rpm, 2x faster), but you have more disks heads (nice for raid1 solution), this could allow more program reading different parts of "logical volume" each head in one part (ok you must test by your self) in other words, maybe you could save some money with many sata disks and pay some nice cache solution: ssd disks creating a bcache/flash/cache or raid card with flash cache, just an idea... may others solutions could be better 2014-01-30 Matt Garman <matthew.garman@gmail.com>: > On Wed, Jan 29, 2014 at 8:38 PM, Stan Hoeppner <stan@hardwarefreak.com> wrote: >> If your workflow description is accurate, and assuming you're trying to >> fix a bottleneck at the NFS server, the solution to this is simple, and >> very well known: local scratch space. Given your workflow description >> it's odd that you're not already doing so. Which leads me to believe >> that the description isn't entirely accurate. If it is, you simply copy >> each file to local scratch disk and iterate over it locally. If you're >> using diskless compute nodes then that's an architectural >> flaw/oversight, as this workload as described begs for scratch disk. > > There really is no bottleneck now, but looking into the future, there > will be a bottleneck at the next addition of compute nodes. I've > thought about local caching at the compute node level, but I don't > think it will help. The total collection of big files on the NFS > server is upwards of 20 TB. Processes are distributed randomly across > compute nodes, and any process could access any part of that 20 TB > file collection. (My description may have implied there is a 1-to-1 > process-to-file mapping, but that is not the case.) So the local > scratch space would have to be quite big to prevent thrashing. In > other words, unless the local cache was multi-terrabyte in size, I'm > quite confident that the local cache would actually degrade > performance due to constant turnover. > > Furthermore, let's simplify the workflow: say there is only one > compute server, and it's local disk is sufficiently large to hold the > entire data set (assume 20 TB drives exist with performance > characteristics similar to today's spinning drives). In other words, > there is no need for the NFS server now. I believe even in this > scenario, the single local disk would be a bottleneck to the dozens of > programs running on the node... these compute nodes are typically dual > socket, 6 or 8 core systems. The computational part is fast enough on > modern CPUs that the I/O workload can be realistically approximated by > dozens of parallel "dd if=/random/big/file of=/dev/null" processes, > all accessing different files from the collection. In other words, > very much like my contrived example of multiple parallel read > benchmark programs. > > FWIW, the current NFS server is from a big iron storage vendor. It's > made up of 96 15k SAS drives. A while ago we were hitting a > bottleneck on the spinning disks, so the vendor was happy to sell us 1 > TB of their very expensive SSD cache module. This worked quite well > at reducing spinning disk utilization, and cache module utilization > was quite high. The recent compute node expansion has lowered cache > utilization at the expense of spinning disk utilization... things are > still chugging along acceptably, but we're at capacity. We've maxed > out at just under 3 GB/sec of throughput (that's gigabytes, not bits). > > What I'm trying to do is decide if we should continue to pay expensive > maintenance and additional cache upgrades to our current device, or if > I might be better served by a DIY big array of consumer SSDs, ala the > "Dirt Cheap Data Warehouse" [1]. I don't see too many people building > big arrays of consumer-grade SSDs, or even vendors selling pre-made > big SSD based systems. (To be fair, you can buy big SSD arrays, but > with crazy-expensive *enterprise* SSD... we have effectively a WORM > workload, so don't need the write endurance features of enterprise > SSD. I think that's where the value opportunity comes in for us.) > Anyway, I'm just looking for reasons why taking on such a project > might blow up in my face (assuming I can convince the check-writer to > basically fund a storage R&D project). > > > [1] http://www.openida.com/the-dirt-cheap-data-warehouse-an-introduction/ > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Roberto Spadim ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: sequential versus random I/O 2014-01-30 3:20 ` Matt Garman 2014-01-30 4:10 ` Roberto Spadim @ 2014-01-30 10:22 ` Stan Hoeppner 2014-01-30 15:28 ` Matt Garman 1 sibling, 1 reply; 15+ messages in thread From: Stan Hoeppner @ 2014-01-30 10:22 UTC (permalink / raw) To: Matt Garman; +Cc: Mdadm On 1/29/2014 9:20 PM, Matt Garman wrote: > On Wed, Jan 29, 2014 at 8:38 PM, Stan Hoeppner <stan@hardwarefreak.com> wrote: >> If your workflow description is accurate, and assuming you're trying to >> fix a bottleneck at the NFS server, the solution to this is simple, and >> very well known: local scratch space. Given your workflow description >> it's odd that you're not already doing so. Which leads me to believe >> that the description isn't entirely accurate. If it is, you simply copy >> each file to local scratch disk and iterate over it locally. If you're >> using diskless compute nodes then that's an architectural >> flaw/oversight, as this workload as described begs for scratch disk. > > There really is no bottleneck now, but looking into the future, there > will be a bottleneck at the next addition of compute nodes. I've > thought about local caching at the compute node level, but I don't > think it will help. The total collection of big files on the NFS > server is upwards of 20 TB. Processes are distributed randomly across > compute nodes, and any process could access any part of that 20 TB > file collection. (My description may have implied there is a 1-to-1 > process-to-file mapping, but that is not the case.) So the local > scratch space would have to be quite big to prevent thrashing. In > other words, unless the local cache was multi-terrabyte in size, I'm > quite confident that the local cache would actually degrade > performance due to constant turnover. > > Furthermore, let's simplify the workflow: say there is only one > compute server, and it's local disk is sufficiently large to hold the > entire data set (assume 20 TB drives exist with performance > characteristics similar to today's spinning drives). In other words, > there is no need for the NFS server now. I believe even in this > scenario, the single local disk would be a bottleneck to the dozens of > programs running on the node... these compute nodes are typically dual > socket, 6 or 8 core systems. The computational part is fast enough on > modern CPUs that the I/O workload can be realistically approximated by > dozens of parallel "dd if=/random/big/file of=/dev/null" processes, > all accessing different files from the collection. In other words, > very much like my contrived example of multiple parallel read > benchmark programs. > > FWIW, the current NFS server is from a big iron storage vendor. It's > made up of 96 15k SAS drives. A while ago we were hitting a > bottleneck on the spinning disks, so the vendor was happy to sell us 1 > TB of their very expensive SSD cache module. This worked quite well > at reducing spinning disk utilization, and cache module utilization > was quite high. The recent compute node expansion has lowered cache > utilization at the expense of spinning disk utilization... things are > still chugging along acceptably, but we're at capacity. We've maxed > out at just under 3 GB/sec of throughput (that's gigabytes, not bits). > > What I'm trying to do is decide if we should continue to pay expensive > maintenance and additional cache upgrades to our current device, or if > I might be better served by a DIY big array of consumer SSDs, ala the > "Dirt Cheap Data Warehouse" [1]. I wouldn't go used as they do. Not for something this critical. > I don't see too many people building > big arrays of consumer-grade SSDs, or even vendors selling pre-made > big SSD based systems. (To be fair, you can buy big SSD arrays, but > with crazy-expensive *enterprise* SSD... we have effectively a WORM > workload, so don't need the write endurance features of enterprise > SSD. I think that's where the value opportunity comes in for us.) I absolutely agree. > Anyway, I'm just looking for reasons why taking on such a project > might blow up in my face If you architect the system correctly, and use decent quality hardware, it won't blow up on you. If you don't get the OS environment tuned correctly you'll simply get less throughput than desired. But that can always be remedied with tweaking. > (assuming I can convince the check-writer to > basically fund a storage R&D project). How big a check? 24x 1TB Samsung SSDs will run you $12,000: http://www.newegg.com/Product/Product.aspx?Item=N82E16820147251 A suitable server with 48 2.5" SAS bays sans HBAs and NICs will run $5,225.00: http://www.rackmountpro.com/products/servers/5u-servers/details/&pnum=YM5U52652&cpu=int CPU: 2x Intel® Xeon® Processor E5-2630v2 6 core (2.6/3.1 Ghz 80W) RAM: 8x 8GB DDR3 1600MHz ECC Registered Memory OSD: 2x 2.5" SSD 120GB SATA III 6Gb/s NET: Integrated Quad Intel GbE Optical Drive: 8x Slim Internal DVD-RW PSU: 1140W R3G5B40V4V 2+1 redundant power supply OS: No OS 3 year warranty 3x LSI 9201-16i SAS HBAs: $1,100 http://www.newegg.com/Product/Product.aspx?Item=N82E16816118142 Each of the two backplanes has 24 drive slots and 6 SFF-8087 connectors. Each 8087 carries 4 SAS channels. You connect two ports of each HBA to the top backplane and the other two to the bottom backplane. I.e. one HBA controls the left 16 drives, one controls the middle 16 drives, and one controls the right 16 drives. Starting with 24 drives in the top tray, each HBA controls 8 drives. These 3 HBAs are 8 lane PCIe 2.0 and provide an aggregate peak uni/bi-directional throughput of ~12/24 GB/s. Samsung 840 EVO raw read throughput is ~0.5GB/s * 24 drives = 12GB/s. Additional SSDs will not provide much increased throughput, if any, as the HBAs are pretty well maxed at 24 drives. This doesn't matter as your network throughput will be much less. Speaking of network throughput, if you're not using Infiniband but 10GbE, you'll want to acquire this 6 port 10 GbE NIC. I don't have a price: http://www.interfacemasters.com/pdf/Niagara_32716.pdf With proper TX load balancing and the TCP stack well tuned you'll have a potential peak of ~6GB/s of NFS throughput. They offer dual and quad port models as well if you want two cards for redundancy: http://www.interfacemasters.com/products/server-adapters/server-adapters-product-matrix.html#twoj_fragment1-5 Without the cost of NICs you're looking at roughly $19,000 for this configuration, including shipping costs, for a ~22TB DIY SSD based NFS server system expandable to 46TB. With two quad port 10GbE NICs and SFPs you're at less $25K with the potential for ~6GB/s NFS throughput. In specifying HBAs instead of RAID controllers I am assuming you'll use md/RAID. With this many SSDs any current RAID controller would slow you down anyway as the ASICs aren't fast enough. You'll need minimum redundancy to guard against an SSD failure, which means RAID5 with SSDs. Your workload is almost exclusively read heavy, which means you could simply create a single 24 drive RAID5 or RAID6 with the default 512KB chunk. I'd go with RAID6. That will yield a stripe width of 22*512KB=11MB. Using RAID5/6 allows you to grow the array incrementally without the need for LVM which may slow you down. Surely you'll use XFS as it's the only Linux filesystem suitable for such a parallel workload. As you will certainly grow the array in the future, I'd format XFS without stripe alignment and have it do 4KB IOs. Stripe alignment won't gain you anything with this workload on SSDs, but it could cause performance problems after you grow the array, at which point the XFS stripe alignment will not match the new array geometry. mkfs.xfs will auto align to the md geometry, so forcing it to use the default 4KB single FS block alignment will be necessary. I can help you with this if you indeed go down this path. The last point I'll make is that it may require some serious tweaking of IRQ load balancing, md/RAID, NFS, Ethernet bonding driver, etc, to wring peak throughput out of such a DIY SSD system. Achieving ~1GB/s parallel NFS throughput from a DIY rig with a single 10GbE port isn't horribly difficult. 3+GB/s parallel NFS via bonded 10GbE interfaces is a bit more challenging. -- Stan -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: sequential versus random I/O 2014-01-30 10:22 ` Stan Hoeppner @ 2014-01-30 15:28 ` Matt Garman 2014-02-01 18:28 ` Stan Hoeppner 0 siblings, 1 reply; 15+ messages in thread From: Matt Garman @ 2014-01-30 15:28 UTC (permalink / raw) To: Stan Hoeppner; +Cc: Mdadm On Thu, Jan 30, 2014 at 4:22 AM, Stan Hoeppner <stan@hardwarefreak.com> wrote: > I wouldn't go used as they do. Not for something this critical. No, not for an actual production system. I linked that as "conceptual inspiration" not as an exact template for what I'd do. Although, the used route might be useful for building a cheap prototype to demonstrate proof of concept. > If you architect the system correctly, and use decent quality hardware, > it won't blow up on you. If you don't get the OS environment tuned > correctly you'll simply get less throughput than desired. But that can > always be remedied with tweaking. Right. I think the general concept is solid, but, as with most things, "the devil's in the details". FWIW, the creator of the DCDW enumerated some of the "gotchas" for a build like this[1]. He went into more detail in some private correspondence with me. It's a little alarming that he got roughly 50% the performance with a tuned Linux setup compared to a mostly out-of-the-box Solaris install. Also, subtle latency issues with PCIe timings across different motherboards sounds like a migraine-caliber headache. > Each of the two backplanes has 24 drive slots and 6 SFF-8087 connectors. > Each 8087 carries 4 SAS channels. You connect two ports of each HBA to > the top backplane and the other two to the bottom backplane. I.e. one > ... Your concept is similar to what I've sketched out in my mind. My twist is that I think I would actually build multiple servers, each one would be a 24-disk 2U system. Our data is fairly easy to partition across multiple servers. Also, we already have a big "symlink index" directory that abstracts the actual location of the files. IOW, my users don't know/don't care where the files actually live, as long as the symlinks are there and not broken. > Without the cost of NICs you're looking at roughly $19,000 for this > configuration, including shipping costs, for a ~22TB DIY SSD based NFS > server system expandable to 46TB. With two quad port 10GbE NICs and > SFPs you're at less $25K with the potential for ~6GB/s NFS throughput. Yup, and this amount is less than one year's maintenance on the big iron system we have in place. And, quoting the vendor, "Maintenance costs only go up." > In specifying HBAs instead of RAID controllers I am assuming you'll use > md/RAID. With this many SSDs any current RAID controller would slow you > down anyway as the ASICs aren't fast enough. You'll need minimum > redundancy to guard against an SSD failure, which means RAID5 with SSDs. > Your workload is almost exclusively read heavy, which means you could > simply create a single 24 drive RAID5 or RAID6 with the default 512KB > chunk. I'd go with RAID6. That will yield a stripe width of > 22*512KB=11MB. Using RAID5/6 allows you to grow the array incrementally > without the need for LVM which may slow you down. At the expense of storage capacity, I was in my mind thinking of raid10 with 3-way mirrors. We do have backups, but downtime on this system won't be taken lightly. > Surely you'll use XFS as it's the only Linux filesystem suitable for > such a parallel workload. As you will certainly grow the array in the > future, I'd format XFS without stripe alignment and have it do 4KB IOs. > ... I was definitely thinking XFS. But one other motivation for multiple 2U systems (instead of one massive system) is that it's more modular. Existing systems never have to be grown or reconfigured. When we need more space/throughput, I just throw another system in place. I might have to re-distribute the data, but this would be a very rare (maybe once/year) event. If I get the green light to do this, I'd actually test a few configurations. But some that come to mind: - raid10,f3 - groups of 3-way raid1 mirrors striped together with XFS - groups of raid6 sets not striped together (our symlink index I mentioned above makes this not as messy as it sounds) > The last point I'll make is that it may require some serious tweaking of > IRQ load balancing, md/RAID, NFS, Ethernet bonding driver, etc, to wring > peak throughput out of such a DIY SSD system. Achieving ~1GB/s parallel > NFS throughput from a DIY rig with a single 10GbE port isn't horribly > difficult. 3+GB/s parallel NFS via bonded 10GbE interfaces is a bit > more challenging. I agree, I think that comes back around to what we said above: the concept is simple, but the details mean the difference between brilliant and mediocre. Thanks for your input Stan, I appreciate it. I'm an infrequent poster to this list, but a long-time reader, and I've learned a lot from your posts over the years. [1] http://forums.servethehome.com/diy-server-builds/2894-utterly-absurd-quad-xeon-e5-supermicro-server-48-ssd-drives.html ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: sequential versus random I/O 2014-01-30 15:28 ` Matt Garman @ 2014-02-01 18:28 ` Stan Hoeppner 2014-02-03 19:28 ` Matt Garman 0 siblings, 1 reply; 15+ messages in thread From: Stan Hoeppner @ 2014-02-01 18:28 UTC (permalink / raw) To: Matt Garman; +Cc: Mdadm On 1/30/2014 9:28 AM, Matt Garman wrote: > On Thu, Jan 30, 2014 at 4:22 AM, Stan Hoeppner <stan@hardwarefreak.com> wrote: >> I wouldn't go used as they do. Not for something this critical. > > No, not for an actual production system. I linked that as "conceptual > inspiration" not as an exact template for what I'd do. Although, the > used route might be useful for building a cheap prototype to > demonstrate proof of concept. > >> If you architect the system correctly, and use decent quality hardware, >> it won't blow up on you. If you don't get the OS environment tuned >> correctly you'll simply get less throughput than desired. But that can >> always be remedied with tweaking. > > Right. I think the general concept is solid, but, as with most > things, "the devil's in the details". Always. > FWIW, the creator of the DCDW > enumerated some of the "gotchas" for a build like this[1]. He went > into more detail in some private correspondence with me. It's a > little alarming that he got roughly 50% the performance with a tuned > Linux setup compared to a mostly out-of-the-box Solaris install. Most x86-64 Linux distro kernels are built to perform on servers, desktops, and laptops, thus performance on each is somewhat compromised. Solaris x86-64 is built primarily for server duty, and tuned for that out of the box. So what you state above isn't too surprising. > Also, subtle latency issues with PCIe timings across different > motherboards sounds like a migraine-caliber headache. This is an issue of board design and Q.A., specifically trace routing and resulting signal skew, that the buyer can't do anything about. And unfortunately this kind of information just isn't "out there" in reviews and what not when you buy boards. The best one can do is buy reputable brand and cross fingers. >> Each of the two backplanes has 24 drive slots and 6 SFF-8087 connectors. >> Each 8087 carries 4 SAS channels. You connect two ports of each HBA to >> the top backplane and the other two to the bottom backplane. I.e. one >> ... > > Your concept is similar to what I've sketched out in my mind. My > twist is that I think I would actually build multiple servers, each > one would be a 24-disk 2U system. Our data is fairly easy to > partition across multiple servers. Also, we already have a big > "symlink index" directory that abstracts the actual location of the > files. IOW, my users don't know/don't care where the files actually > live, as long as the symlinks are there and not broken. That makes tuning each box much easier if you go with a single 10GbE port. But this has some downside I'll address down below. >> Without the cost of NICs you're looking at roughly $19,000 for this >> configuration, including shipping costs, for a ~22TB DIY SSD based NFS >> server system expandable to 46TB. With two quad port 10GbE NICs and >> SFPs you're at less $25K with the potential for ~6GB/s NFS throughput. > > Yup, and this amount is less than one year's maintenance on the big > iron system we have in place. And, quoting the vendor, "Maintenance > costs only go up." Yes, it's sad. "Maintenance Contract" = mostly pure profit. This is the Best Buy extended warranty of the big iron marketplace. You pay a ton of money and get very little, if anything, in return. >> In specifying HBAs instead of RAID controllers I am assuming you'll use >> md/RAID. With this many SSDs any current RAID controller would slow you >> down anyway as the ASICs aren't fast enough. You'll need minimum >> redundancy to guard against an SSD failure, which means RAID5 with SSDs. >> Your workload is almost exclusively read heavy, which means you could >> simply create a single 24 drive RAID5 or RAID6 with the default 512KB >> chunk. I'd go with RAID6. That will yield a stripe width of >> 22*512KB=11MB. Using RAID5/6 allows you to grow the array incrementally >> without the need for LVM which may slow you down. > > At the expense of storage capacity, I was in my mind thinking of > raid10 with 3-way mirrors. We do have backups, but downtime on this > system won't be taken lightly. I was in lock step with you until this point. We're talking about SSDs aren't we? And a read-only workload? RAID10 today is only for transactional workloads on rust to avoid RMW. SSD doesn't suffer RMW latency. And this isn't a transactional workload, but parallel linear read. Three-way mirroring within a RAID 10 setup is used strictly to avoid losing the 2nd disk in a mirror while its partner is rebuilding in a standard RAID10. This is suitable when using large rusty drives where rebuild times are 8+ hours. With a RAID10 triple mirror setup 2/3rds of your capacity is wasted. This isn't a sane architecture for SSDs and a read-only workload. Here's why. Under optimal conditions a 4TB 7.2K SAS/SATA mirror rebuild takes 4TB / 130MB/s= ~8.5 hours a 1TB Sammy 840 EVO mirror rebuild takes 1TB / 500MB/s= ~34 minutes. A RAID6 rebuild will take a little longer, but still much less than an hour, say 45 minutes max. With RAID6 you would have to sustain *two* additional drive failures within that 45 minute rebuild window to lose the array. Only HBA, backplane, or PSU failure could take down two more drives in 45 minutes, and if that happens you're losing many drives, probably all of them, and you're sunk anyway. No matter how you slice it, I can't see RAID10 being of any benefit here, and especially not 3-way mirror RAID10. If one of your concerns is decreased client throughput during rebuild, then simply turn down the rebuild priority to 50%. Your rebuild will take 1.5 hours, in which you'd have to lose 2 additional drives to lose the array, and you'll still have more client throughput at the array than the network interface can push: ((22*500MB/s) = 11GB/s)/2 = 5.5GB/s client B/W during rebuild 10GbE interface B/W = 1.0GB/s max Using RAID10 yields no gain but increases cost. Using RAID10 with 3 mirrors is simply 3 times more cost and 2/3rds wasted capacity. Any form of mirroring just isn't suitable for this type of SSD system. >> Surely you'll use XFS as it's the only Linux filesystem suitable for >> such a parallel workload. As you will certainly grow the array in the >> future, I'd format XFS without stripe alignment and have it do 4KB IOs. >> ... > > I was definitely thinking XFS. But one other motivation for multiple > 2U systems (instead of one massive system) is that it's more modular. The modular approach has advantages. But keep in mind that modularity increases complexity and component count, which increase the probability of a failure. The more vehicles you own the more often one of them is in the shop at any given time, if even only for an oil change. > Existing systems never have to be grown or reconfigured. When we need > more space/throughput, I just throw another system in place. I might > have to re-distribute the data, but this would be a very rare (maybe > once/year) event. Gluster has advantages here as it can redistribute data automatically among the storage nodes. If you do distributed mirroring you can take a node completely offline for maintenance, and client's won't skip a beat, or at worst a short beat. It costs half your storage for the mirroring, but using RAID6 it's still ~33% less than the RAID10 w/3 way mirrors. > If I get the green light to do this, I'd actually test a few > configurations. But some that come to mind: > - raid10,f3 Skip it. RAID10 is a no go. And, none of the alternate layouts will provide any benefit because SSDs are not spinning rust. The alternate layouts exist strictly to reduce rotational latency. > - groups of 3-way raid1 mirrors striped together with XFS I covered this above. Skip it. And you're thinking of XFS over concatenated mirror sets here. This architecture is used only for high IOPS transactional workloads on rust. It won't gain you anything with SSDs. > - groups of raid6 sets not striped together (our symlink index I > mentioned above makes this not as messy as it sounds) If you're going with multiple identical 24 bay nodes, you want a single 24 drive md/RAID6 in each directly formatted with XFS. Or Gluster atop XFS. It's the best approach for your read only workload with large files. >> The last point I'll make is that it may require some serious tweaking of >> IRQ load balancing, md/RAID, NFS, Ethernet bonding driver, etc, to wring >> peak throughput out of such a DIY SSD system. Achieving ~1GB/s parallel >> NFS throughput from a DIY rig with a single 10GbE port isn't horribly >> difficult. 3+GB/s parallel NFS via bonded 10GbE interfaces is a bit >> more challenging. > > I agree, I think that comes back around to what we said above: the > concept is simple, but the details mean the difference between > brilliant and mediocre. The details definitely become a bit easier with one array and one NIC per node. But one thing really bothers me about such a setup. You have ~11GB/s read throughput with 22 SSDs (24 RAID6). It doesn't make sense to waste ~10GB/s of SSD throughput by using a single 10GbE interface. At the very least you should be using 4x 10GbE ports per box to achieve potentially 3+ GB/s. I think what's happening here is that you're saving so much money compared to the proprietary NAS filer that you're intoxicated by the savings. You're throwing money around on SSDs like a drunken sailor on 6 month leave at a strip club. :) And without fully understanding the implications, the capability that you're buying, and putting in each box. > Thanks for your input Stan, I appreciate it. I'm an infrequent poster > to this list, but a long-time reader, and I've learned a lot from your > posts over the years. Glad someone actually reads my drivel on occasion. :) I'm firmly an AMD guy. I used the YMI 48 bay Intel server in my previous example for expediency, and to avoid what I'm doing here now. Please allow me to indulge you with a complete parts list for one fully DIY NFS server node build. I've matched and verified compatibility of all of the components, using manufacturer specs, down to the iPASS/SGPIO SAS cables. Combined with the LSI HBAs and this SM backplane, these sideband signaling SAS cables should enable you to make drive failure LEDs work with mdadm, using: http://sourceforge.net/projects/ledmon/ I've not tried the software myself, but if it's up to par, dead drive identification should work the same as with any vendor storage array, which to this point has been nearly impossible with md arrays using plain non-RAID HBAs. Preemptively flashing the mobo and SAS HBAs with the latest firmware image should prevent any issues with the hardware. These products have "shelf" firmware which is often quite a few revs old by the time the customer receives product. All but one of the necessary parts are stocked by NewEgg believe it or not. The build consists of a 24 bay 2U SuperMicro 920W dual HS PSU, SuperMicro dual socket C32 mobo w/5 PCIe 2.0 x8 slots, 2x Opteron 4334 3.1GHz 6 core CPUs, 2 Dynatron C32/1207 2U CPU coolers, 8x Kingston 4GB ECC registered DDR3-1333 single rank DIMMs, 3x LSI 9207-8i PCIe 3.0 x8 SAS HBAs, rear 2 drive HS cage, 2x Samsung 120GB boot SSDs, 24x Samsung 1TB data SSDs, 6x 2ft LSI SFF-8087 sideband cables, and two dual port Intel 10GbE NICs sans SFPs as you probably already have some spares. You may prefer another NIC brand/model. These are <$800 of the total. 1x http://www.newegg.com/Product/Product.aspx?Item=N82E16811152565 1x http://www.newegg.com/Product/Product.aspx?Item=N82E16813182320 2x http://www.newegg.com/Product/Product.aspx?Item=N82E16819113321 2x http://www.newegg.com/Product/Product.aspx?Item=N82E16835114139 8x http://www.newegg.com/Product/Product.aspx?Item=N82E16820239618 3x http://www.newegg.com/Product/Product.aspx?Item=N82E16816118182 24x http://www.newegg.com/Product/Product.aspx?Item=N82E16820147251 2x http://www.newegg.com/Product/Product.aspx?Item=N82E16820147247 6x http://www.newegg.com/Product/Product.aspx?Item=N82E16812652015 2x http://www.newegg.com/Product/Product.aspx?Item=N82E16833106044 1x http://www.costcentral.com/proddetail/Supermicro_Storage_drive_cage/MCP220826090N/11744345/ Total cost today: $16,927.23 SSD cost: $13,119.98 Note all SSDs are direct connected to the HBAs. This system doesn't suffer any disk bandwidth starvation due to SAS expanders as with most storage arrays. As such you get nearly full bandwidth per drive, being limited only by the NorthBridge to CPU HT link. At the hardware level the system bandwidth breakdown is as follow: Memory: 42.6 GB/s PCIe to CPU: 10.4 GB/s unidirectional x2 HBA to PCIe: 12 GB/s unidirectional x2 SSD to HBA: 12 GB/s unidirectional x2 PCIe to NIC: 8 GB/s unidirectional x2 NIC to client: 4 GB/s unidirectional x2 Your HBA traffic will flow on the HT uplink and your NIC traffic on the down link, so you are non constrained here with this NFS read only workload. Assuming an 8:1 bandwidth ratio between file bytes requested and memory bandwidth consumed by the kernel in the form of DMA xfers to/from buffer cache, memory-memory copies for TCP and NFS, and hardware overhead in the form of coherency traffic between the CPUs and HBAs, interrupts in the form of MSI-X writes to memory, etc, then 4GB/s of requested data generates ~32GB/s at the memory controllers before transmitted over the wire. Beyond tweaking parameters, it may require building a custom kernel to achieve this throughput. But the hardware is capable. Using a single 10GbE interface yields 1/10th of the SSD b/w to clients. This is a huge waste of $$ spent on the SSDs. Using 4 will come close to maxing out the rest of the hardware so I spec'd 4 ports. With the correct bonding setup you should be able to get between 3-4GB/s. Still only 1/4th - 1/3rd the SSD throughput. To get close to taking near full advantage of the 12GB/s read bandwidth offered by these 24 SSDs requires a box with dual Socket G34 processors to get 8 DDR3-1333 memory channels--85GB/s--two SR5690 PCIe to HT controllers, 8x 10GbE ports (or 2x QDR Infiniband 4x). Notice I didn't discuss CPU frequency or core count anywhere? That's because it's not a factor. The critical factor is memory bandwidth. Any single/dual Opteron 4xxx/6xxx system with ~8 or more cores should do the job as long as IRQs are balanced across cores. Hope you at least found this an interesting read, if not actionable. Maybe others will as well. I had some fun putting this one together. I think the only things I omitted were Velcro straps and self stick lock loops for tidying up the cables for optimum airflow. Experienced builders usually have these on hand, but I figured I'd mention them just in case. Locating certified DIMMs in the clock speed and rank required took too much time, but this was not unforeseen. The only easy way to spec memory for server boards and allow max expansion is to go with the lowest clock speed. If I'd done that here you'd lose 17GB/s of memory bandwidth, a 40% reduction. I also wanted to use a single socket G34 board, but unfortunately nobody makes one with more than 3 PCIe slots. This design required at least 4. -- Stan ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: sequential versus random I/O 2014-02-01 18:28 ` Stan Hoeppner @ 2014-02-03 19:28 ` Matt Garman 2014-02-04 15:16 ` Stan Hoeppner 0 siblings, 1 reply; 15+ messages in thread From: Matt Garman @ 2014-02-03 19:28 UTC (permalink / raw) To: Stan Hoeppner; +Cc: Mdadm On Sat, Feb 1, 2014 at 12:28 PM, Stan Hoeppner <stan@hardwarefreak.com> wrote: > I was in lock step with you until this point. We're talking about SSDs > aren't we? And a read-only workload? RAID10 today is only for > transactional workloads on rust to avoid RMW. SSD doesn't suffer RMW > ... OK, I think I'm convinced, raid10 isn't appropriate here. (If I get the green light for this, I might still do it in the buildup/experimentation stage, just for kicks and grin if nothing else.) So, just to be clear, you've implied that if I have an N-disk raid6, then the (theoretical) sequential read throughput is (N-2) * T where T is the throughput of a single drive (assuming uniform drives). Is this correct? > If one of your concerns is decreased client throughput during rebuild, > then simply turn down the rebuild priority to 50%. Your rebuild will The main concern was "high availability". This isn't like my home server, where I use raid as an excuse to de-prioritize my backups. :) But raid for the actual designed purpose, to minimize service interruptions in case of failure(s). The thing is, I think consumer SSDs are still somewhat of an unknown entity in terms of reliability, longevity, and failure modes. Just from the SSDs I've dealt with at home (tiny sample size), I've had two fail the "bad way": that is, they die and are no longer recognizable by the system (neither OS nor BIOS). Presumably a failure of the SSDs controller. And with spinning rust, we have decades of experience and useful public information like Google's HDD study and Backblaze's blog. SSDs just haven't been out in the wild long enough to have a big enough sample size to do similar studies. Those two SSDs I had die just abruptly went out, without any kind of advance warning. (To be fair, these were first-gen, discount, consumer SSDs.) Certainly, traditional spinning drives can also die in this way, but with regular SMART monitoring and such, we (in theory) have some useful means to predict impending death. Not sure if the SMART monitoring on SSDs is up to par with their rusty counterparts. > The modular approach has advantages. But keep in mind that modularity > increases complexity and component count, which increase the probability > of a failure. The more vehicles you own the more often one of them is > in the shop at any given time, if even only for an oil change. Good point. Although if I have more cars than I actually need (redundancy), I can afford to always have a car in the shop. ;) > Gluster has advantages here as it can redistribute data automatically > among the storage nodes. If you do distributed mirroring you can take a > node completely offline for maintenance, and client's won't skip a beat, > or at worst a short beat. It costs half your storage for the mirroring, > but using RAID6 it's still ~33% less than the RAID10 w/3 way mirrors. > ... > If you're going with multiple identical 24 bay nodes, you want a single > 24 drive md/RAID6 in each directly formatted with XFS. Or Gluster atop > XFS. It's the best approach for your read only workload with large files. Now that you've convinced me RAID6 is the way to go, and if I can get 3 GB/s out of one of these systems, then two of these system would literally double the capability (storage capacity and throughput) of our current big iron system. What would be ideal is to use something like Gluster to add a third system for redundancy, and have a "raid 5" at the server level. I.e., same storage capacity of two systems, but one whole node could go down without losing service availability. I have no experience with cluster filesystems, however, so this presents another risk vector. > I'm firmly an AMD guy. Any reason for that? That's an honest question, not veiled argument. Do the latest AMD server chips include the PCIe controller on-chip like the Sandy Bridge and newer Intel chips? Or does AMD still put the PCIe controller on a separate chip (a northbridge)? Just wondering if having dual on-CPU-die PCIe controllers is an advantage here (assuming a dual-socket system). I agree with you, CPU core count and clock isn't terribly important, it's all about being able to extract maximum I/O from basically every other component in the system. > sideband signaling SAS cables should enable you to make drive failure > LEDs work with mdadm, using: > http://sourceforge.net/projects/ledmon/ > > I've not tried the software myself, but if it's up to par, dead drive > identification should work the same as with any vendor storage array, > which to this point has been nearly impossible with md arrays using > plain non-RAID HBAs. Ha, that's nice. In my home server, which is idle 99% of the time, I've identified drives by simply doing a "dd if=/dev/target/drive of=/dev/null" and looking for the drive that lights up. Although, I've noticed some drives (Samsung) don't even light up when I do that. I could do this in reverse on a system that's 99% busy: just offline the target drive, and look for the one light that's NOT lit. Failing that, I had planned to use the old school paper and pencil method of just keeping good notes of which drive (identified by serial number) was in which bay. > All but one of the necessary parts are stocked by NewEgg believe it or > not. The build consists of a 24 bay 2U SuperMicro 920W dual HS PSU, > SuperMicro dual socket C32 mobo w/5 PCIe 2.0 x8 slots, 2x Opteron 4334 > ... Thanks for that. I integrated these into my planning spreadsheet, which incidentally already had 75% of what you spec'ed out. Main difference is I spec'ed out an Intel-based system, and you used AMD. Big cost savings by going with AMD however! > Total cost today: $16,927.23 > SSD cost: $13,119.98 Looks like you're using the $550 sale price for those 1TB Samsung SSDs. Normal price is $600. Newegg usually has a limit of 5 (IIRC) on sale-priced drives. > maxing out the rest of the hardware so I spec'd 4 ports. With the > correct bonding setup you should be able to get between 3-4GB/s. Still > only 1/4th - 1/3rd the SSD throughput. Right. I might start with just a single dual-port 10gig NIC, and see if I can saturate that. Let's be pessimistic, and assume I can only wrangle 250 MB/sec out of each SSD. And I'll have designate two hot spares, leaving a 22-drive raid6. So that's: 250 MB/s * 20 = 5 GB/s. Now that's not so far away from the 4 GB/sec theoretical with 4x 10gig NICs. > Hope you at least found this an interesting read, if not actionable. > Maybe others will as well. I had some fun putting this one together. I Absolutely interesting, thanks again for all the detailed feedback. > think the only things I omitted were Velcro straps and self stick lock > loops for tidying up the cables for optimum airflow. Experienced > builders usually have these on hand, but I figured I'd mention them just > in case. Of course, but why can't I ever find them when I actually need them? :) Anyway, thanks again for your feedback. The first roadblock is definitely getting manager buy-in. He tends to dismiss projects like this because (1) we're not a storage company / we don't DIY servers, (2) why isn't anyone else doing this / why can't you buy an OTS system like this, (3) even though the cost savings are dramatic, it's still a ~$20k risk - what if I can't get even 50% of the theoretical throughput? what if those SSDs require constant replacement? what if there is some subtle kernel- or driver-level bug(s) that are in "landmine" status waiting for something like this to expose them? -Matt ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: sequential versus random I/O 2014-02-03 19:28 ` Matt Garman @ 2014-02-04 15:16 ` Stan Hoeppner 0 siblings, 0 replies; 15+ messages in thread From: Stan Hoeppner @ 2014-02-04 15:16 UTC (permalink / raw) To: Matt Garman; +Cc: Mdadm On 2/3/2014 1:28 PM, Matt Garman wrote: > On Sat, Feb 1, 2014 at 12:28 PM, Stan Hoeppner <stan@hardwarefreak.com> wrote: >> I was in lock step with you until this point. We're talking about SSDs >> aren't we? And a read-only workload? RAID10 today is only for >> transactional workloads on rust to avoid RMW. SSD doesn't suffer RMW >> ... > > OK, I think I'm convinced, raid10 isn't appropriate here. (If I get > the green light for this, I might still do it in the > buildup/experimentation stage, just for kicks and grin if nothing > else.) > > So, just to be clear, you've implied that if I have an N-disk raid6, > then the (theoretical) sequential read throughput is > (N-2) * T > where T is the throughput of a single drive (assuming uniform drives). > Is this correct? Should be pretty close to that for parallel streaming read. >> If one of your concerns is decreased client throughput during rebuild, >> then simply turn down the rebuild priority to 50%. Your rebuild will > > The main concern was "high availability". This isn't like my home > server, where I use raid as an excuse to de-prioritize my backups. :) > But raid for the actual designed purpose, to minimize service > interruptions in case of failure(s). The major problem with rust based RAID5/6 arrays is the big throughput hit you take during a rebuild. Concurrent access causes massive head seeking, slowing everything down, both user IO and rebuild. This proposed SSD rig has disk throughput that is 4-8x the network throughput. And there are no heads to seek, thus no increased latency nor reduced bandwidth. You should be able to dial down the rebuild rate by as little as 25% and the NFS throughput shouldn't vary from normal state. This is the definition of high availability--failures don't affect function or performance. > The thing is, I think consumer SSDs are still somewhat of an unknown > entity in terms of reliability, longevity, and failure modes. Just > from the SSDs I've dealt with at home (tiny sample size), I've had two > fail the "bad way": that is, they die and are no longer recognizable > by the system (neither OS nor BIOS). Presumably a failure of the SSDs > controller. I had one die like that in 2011, after 4 months, a Corsair V32, 1st gen Indilinx drive. > And with spinning rust, we have decades of experience and > useful public information like Google's HDD study and Backblaze's > blog. SSDs just haven't been out in the wild long enough to have a > big enough sample size to do similar studies. As is the case with all new technologies. Hybrid technology is much newer still, but will probably start being adopted at a much faster pace than pure SSD for most applications. Speaking of SSHD I should have mentioned it sooner because it's actually a perfect fit for your workload, as you reread the same ~400MB files repeatedly. Have you considered hybrid SSHD drives? These Seagate 1TB 2.5" drives have an 8GB SSD cache: http://www.newegg.com/Product/Product.aspx?Item=N82E16822178340 24 of these yields the same capacity as the pure SSD solution, but at *1/6th* the price per drive, ~$2600 for 24 drives vs ~$15,500. You'd have an aggregate 192GB of SSD cache per server node and close to 1GB/s of network throughput even when hitting platters instead of cache. So a single GbE connection would be a good fit, and no bonding headaches. The drives drop into the same chassis. You'll save $10,000 per chassis. In essence you'd be duplicating the NetApp's disk + SSD cache setup but inside each drive. I worked up the totals, see down below. ... >> The modular approach has advantages. But keep in mind that modularity >> increases complexity and component count, which increase the probability >> of a failure. The more vehicles you own the more often one of them is >> in the shop at any given time, if even only for an oil change. > > Good point. Although if I have more cars than I actually need > (redundancy), I can afford to always have a car in the shop. ;) But it requires two vehicles and two people to get the car to the shop and get you back home. This is the point I was making. The more complex the infrastructure, the more time/effort required for maintenance. >> Gluster has advantages here as it can redistribute data automatically >> among the storage nodes. If you do distributed mirroring you can take a >> node completely offline for maintenance, and client's won't skip a beat, >> or at worst a short beat. It costs half your storage for the mirroring, >> but using RAID6 it's still ~33% less than the RAID10 w/3 way mirrors. >> ... >> If you're going with multiple identical 24 bay nodes, you want a single >> 24 drive md/RAID6 in each directly formatted with XFS. Or Gluster atop >> XFS. It's the best approach for your read only workload with large files. > > > Now that you've convinced me RAID6 is the way to go, and if I can get > 3 GB/s out of one of these systems, then two of these system would > literally double the capability (storage capacity and throughput) of > our current big iron system. The challenge will be getting 3GB/s. You may spend weeks, maybe months, in testing and development work to achieve it. I can't say as I've never tried this. Getting close to 1GB/s from one interface is much easier. This fact, and cost, make the SSHD solution much much more attractive. > What would be ideal is to use something > like Gluster to add a third system for redundancy, and have a "raid 5" > at the server level. I.e., same storage capacity of two systems, but > one whole node could go down without losing service availability. I > have no experience with cluster filesystems, however, so this presents > another risk vector. Read up on Gluster and its replication capabilities. Say "DFS" as Gluster is a distributed filesystem. A cluster filesystem or "CFS" is a completely different technology. >> I'm firmly an AMD guy. > > Any reason for that? We've seen ample examples in the US of what happens with a monopolist. Prices increase and innovation decreases. If AMD goes bankrupt or simply exits the desktop/server x86 CPU market then Chipzilla has a monopoly on x86 desktop/server CPUs. They nearly do now simply based on market share. AMD still makes plenty capable CPUs, chipsets, etc, and at a lower cost. Intel chips may have superior performance at the moment, but AMD was superior for half a decade. As long as AMD has a remotely competitive offering I'll support them with my business. I don't want to be at the mercy of a monopolist. > Do the latest AMD server chips include the PCIe controller on-chip > like the Sandy Bridge and newer Intel chips? Or does AMD still put > the PCIe controller on a separate chip (a northbridge)? > > Just wondering if having dual on-CPU-die PCIe controllers is an > advantage here (assuming a dual-socket system). I agree with you, CPU > core count and clock isn't terribly important, it's all about being > able to extract maximum I/O from basically every other component in > the system. Adding PCIe interfaces to the CPU die eliminates the need for an IO support chip, simplifying board design and testing, and freeing up board real estate. This is good for large NUMA systems, such as SGI's Altix UV, which contain dozens or hundreds of CPU boards. It does not increase PCIe channel throughput, though it does lower latency by a few nanoseconds. There may be a small noticeable gain here for HPC applications sending MPI messages over PCIe Infiniband HCAs, but not for any other device connected via PCIe. Storage IO is typically not latency bound and is always pipelined, so latency is largely irrelevant. >> sideband signaling SAS cables should enable you to make drive failure >> LEDs work with mdadm, using: >> http://sourceforge.net/projects/ledmon/ >> >> I've not tried the software myself, but if it's up to par, dead drive >> identification should work the same as with any vendor storage array, >> which to this point has been nearly impossible with md arrays using >> plain non-RAID HBAs. > > Ha, that's nice. In my home server, which is idle 99% of the time, > I've identified drives by simply doing a "dd if=/dev/target/drive > of=/dev/null" and looking for the drive that lights up. Although, > I've noticed some drives (Samsung) don't even light up when I do that. It's always good to have a fall back position. This is another thing you have to integrate yourself. Part of the "DIY" thing. ... >> All but one of the necessary parts are stocked by NewEgg believe it or >> not. The build consists of a 24 bay 2U SuperMicro 920W dual HS PSU, >> SuperMicro dual socket C32 mobo w/5 PCIe 2.0 x8 slots, 2x Opteron 4334 >> ... > > Thanks for that. I integrated these into my planning spreadsheet, > which incidentally already had 75% of what you spec'ed out. Main > difference is I spec'ed out an Intel-based system, and you used AMD. > Big cost savings by going with AMD however! > >> Total cost today: $16,927.23 Corrected total: $19,384.10 >> SSD cost: $13,119.98 Corrected: $15,927.34 SSHD system: $ 6,238.50 Savings: $13,145.60 Specs same as before, but with one dual port 10GbE NIC and 26x Seagate 1TB 2.5" SSHDs displacing the Samsung SSDs. These drives target the laptop market. As such they are built for vibration and should fair well in a multi-drive chassis. $6,300 may be more palatable to the boss for an experimental development system. It shouldn't be difficult to reach maximum potential throughput of the 10GbE interface with a little tweaking. Your time to proof of concept should be minimal. Once proven you could put it into limited production with a subset of the data to see how the drives standup with continuous use. If it holds up for a month, purchase components for another 4 units for ~$25,000. Put 3 nodes into production for 4 total, keep the other set of parts as spares for the 4 production units since consumer parts availability is volatile, even on a 6 month time scale. You'll have ~$32,000 in the total system. Once you've racked the 3 systems and burned them in, install and configure Gluster and load your datasets. By then you'll know Gluster well, how to spread data for load balancing, configure fault tolerance, etc. You'll have the cheap node concept you originally mentioned. You should be able to get close to 4GB/s out of the 4 node farm, and scale up by ~1GB/s with each future node. > Looks like you're using the $550 sale price for those 1TB Samsung > SSDs. Normal price is $600. Newegg usually has a limit of 5 (IIRC) > on sale-priced drives. I didn't look closely enough. It's actually $656.14, corrected all figures above. >> maxing out the rest of the hardware so I spec'd 4 ports. With the >> correct bonding setup you should be able to get between 3-4GB/s. Still >> only 1/4th - 1/3rd the SSD throughput. > > Right. I might start with just a single dual-port 10gig NIC, and see > if I can saturate that. Let's be pessimistic, and assume I can only > wrangle 250 MB/sec out of each SSD. And I'll have designate two hot > spares, leaving a 22-drive raid6. So that's: 250 MB/s * 20 = 5 GB/s. > Now that's not so far away from the 4 GB/sec theoretical with 4x 10gig > NICs. You'll get near full read bandwidth from the SSDs without any problems. That's not an issue. The problem will likely be getting 3-4GB/s of NFS/TCP throughput from your bonded stack. The one thing in your favor is you only need transmit load balancing for your workload, which is much easier to do than receive load balancing. >> Hope you at least found this an interesting read, if not actionable. >> Maybe others will as well. I had some fun putting this one together. I > > Absolutely interesting, thanks again for all the detailed feedback. They don't call me "HardwareFreak" for nothin. :) ... > Anyway, thanks again for your feedback. The first roadblock is > definitely getting manager buy-in. He tends to dismiss projects like > this because (1) we're not a storage company / we don't DIY servers, > (2) why isn't anyone else doing this / why can't you buy an OTS system > like this, (3) even though the cost savings are dramatic, it's still a > ~$20k risk - what if I can't get even 50% of the theoretical > throughput? what if those SSDs require constant replacement? what if > there is some subtle kernel- or driver-level bug(s) that are in > "landmine" status waiting for something like this to expose them? (1) I'm not an HVAC contractor nor an electrician, but I rewired my entire house and replaced the HVAC system, including all new duct work. I did it because I know how, and it saved me ~$10,000. And the results are better than if I'd hired a contractor. If you can do something yourself at lower cost and higher quality, do so. (2) Because an OTS "system" is not a DIY system. You're paying for expertise and support more than for the COTS gear. Hardware at the wholesale OEM level is inexpensive. When you buy a NetApp, their unit cost from the supplier is less than 1/4th what you pay NetApp for the hardware. The rest is profit, R&D, cust support, employee overhead, etc. When you buy hardware for a DIY build, you're buying the hardware, and paying 10-20% profit to the wholesaler depending on the item. (3) The bulk of storage systems on the market today use embedded Linux. So any kernel or driver level bugs that may affect a DIY system will also affect such vendor solutions. The risks boil down to one thing: competence. If your staff is competent, your risk is extremely low. Your boss has competent staff. The problem with most management is they know they can buy X for Y cost from company Z and get some kind of guarantee for paying cost Y. They feel they have "assurance" that things will just work. We all know from experience, journals, word of mouth, that one can spend $100K to $millions on hardware or software and/or "expert" consultants, and a year later it still doesn't work right. There are no real guarantees. Frankly, I'd much rather do everything myself, because I can, and have complete control of it. That's a much better guarantee for me than any contract or SLA a vendor could ever provide. -- Stan ^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2014-02-04 15:16 UTC | newest] Thread overview: 15+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2014-01-29 17:23 sequential versus random I/O Matt Garman 2014-01-30 0:10 ` Adam Goryachev 2014-01-30 0:41 ` Roberto Spadim 2014-01-30 0:45 ` Roberto Spadim 2014-01-30 0:58 ` Roberto Spadim 2014-01-30 1:03 ` Roberto Spadim 2014-01-30 1:18 ` Roberto Spadim 2014-01-30 2:38 ` Stan Hoeppner 2014-01-30 3:20 ` Matt Garman 2014-01-30 4:10 ` Roberto Spadim 2014-01-30 10:22 ` Stan Hoeppner 2014-01-30 15:28 ` Matt Garman 2014-02-01 18:28 ` Stan Hoeppner 2014-02-03 19:28 ` Matt Garman 2014-02-04 15:16 ` Stan Hoeppner
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).