* Thoughts on big SSD arrays?
@ 2015-07-31 15:23 Matt Garman
2015-08-01 8:34 ` Pasi Kärkkäinen
` (2 more replies)
0 siblings, 3 replies; 4+ messages in thread
From: Matt Garman @ 2015-07-31 15:23 UTC (permalink / raw)
To: Mdadm
Every few years I reprise this topic on this mailing list[1], [2].
Basically I'm just brainstorming what is possible on the DIY front
versus purchased solutions from a traditional "big iron" storage
vendor. Our particular use case is "ultra-high parallel sequential
read throughput". Our workload is effectively WORM: we do a small
daily incremental write, and then the rest of the time it's constant
re-reading of the data. Literally 99:1 read:write
I continue to be inspired by the "Dirt Cheap Data Warehouse (DCDW)"
[3]. SSD are getting bigger and prices are dropping rapidly (2 TB
SSDs available now for $800). With our WORM-like workload, I believe
we can safely get away with consumer drives, as durability shouldn't
be an issue.
So at this point I'm just putting out a feeler---has anyone out there
actually built a massive SSD array, using either Linux software raid
or hardware raid (technically off-topic for this list, though I hope
the discussion is interesting enough to let it slide). If so, how big
of an array (i.e. drives/capacity)? What was the target versus actual
performance? Any particularly challenging issues that came up?
FWIW, I'm thinking of something along the lines of a 24-disk chassis,
with 2 disks for OS (raid1), 2 disks as hot spares, and the remaining
20 in raid-6. The 22 data disks (raid + hot spares) would be 2 TB
SSDs.
The "problem" with SSDs is that they're just so seductive:
back-of-the-envelope numbers are wonderful, so it's easy to get
overly-optimistic about builds that use them. But as with most
things, the devil's in the details.
Off the top of my head, potential issues I can think of:
- Subtle PCIe latency/timing issues of the motherboard
- High variation in SSD latency
- Software stacks still making assumptions based on spinning
drives (i.e. not adequately tuned for SSDs)
- Non-parallel RAID implementation (i.e. single CPU bottleneck potential)
- Potential bandwidth bottlenecks at various stages: SATA/SAS
interface, SAS expander/backplane, SATA/SAS controller (or HBA), PCIe
bus, CPU memory bus, network card, etc
- I forget the exact number, but the DCDW guy told me with Linux
he was only able to get about 30% of the predicted throughput in his
SSD array
- Wacky TRIM related issues (seem to be drive dependent)
Not asking any particular question here, just hoping to start an
open-ended discussion. Of course I'd love to hear from anyone with
actual SSD RAID experience!
Thanks,
Matt
[1] "high throughput storage server?", Feb 14, 2011
http://marc.info/?l=linux-raid&m=129772818924753&w=2
[2] "high read throughput storage server, take 2"
http://marc.info/?l=linux-raid&m=138359009013781&w=2
[3] "The Dirt Cheap Data Warehouse"
http://www.openida.com/the-dirt-cheap-data-warehouse-an-introduction/
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: Thoughts on big SSD arrays?
2015-07-31 15:23 Thoughts on big SSD arrays? Matt Garman
@ 2015-08-01 8:34 ` Pasi Kärkkäinen
2015-08-03 10:52 ` AW: " Markus Stockhausen
2015-08-03 11:38 ` Adam Goryachev
2 siblings, 0 replies; 4+ messages in thread
From: Pasi Kärkkäinen @ 2015-08-01 8:34 UTC (permalink / raw)
To: Matt Garman; +Cc: Mdadm
On Fri, Jul 31, 2015 at 10:23:26AM -0500, Matt Garman wrote:
> Every few years I reprise this topic on this mailing list[1], [2].
> Basically I'm just brainstorming what is possible on the DIY front
> versus purchased solutions from a traditional "big iron" storage
> vendor. Our particular use case is "ultra-high parallel sequential
> read throughput". Our workload is effectively WORM: we do a small
> daily incremental write, and then the rest of the time it's constant
> re-reading of the data. Literally 99:1 read:write
>
> I continue to be inspired by the "Dirt Cheap Data Warehouse (DCDW)"
> [3]. SSD are getting bigger and prices are dropping rapidly (2 TB
> SSDs available now for $800). With our WORM-like workload, I believe
> we can safely get away with consumer drives, as durability shouldn't
> be an issue.
>
> So at this point I'm just putting out a feeler---has anyone out there
> actually built a massive SSD array, using either Linux software raid
> or hardware raid (technically off-topic for this list, though I hope
> the discussion is interesting enough to let it slide). If so, how big
> of an array (i.e. drives/capacity)? What was the target versus actual
> performance? Any particularly challenging issues that came up?
>
> FWIW, I'm thinking of something along the lines of a 24-disk chassis,
> with 2 disks for OS (raid1), 2 disks as hot spares, and the remaining
> 20 in raid-6. The 22 data disks (raid + hot spares) would be 2 TB
> SSDs.
>
Also remember raid rebuilds after SSD failures.. with 20 disks in the same raid6-set,
you'll have a lot of reads going on during rebuild :)
-- Pasi
> The "problem" with SSDs is that they're just so seductive:
> back-of-the-envelope numbers are wonderful, so it's easy to get
> overly-optimistic about builds that use them. But as with most
> things, the devil's in the details.
>
> Off the top of my head, potential issues I can think of:
>
> - Subtle PCIe latency/timing issues of the motherboard
> - High variation in SSD latency
> - Software stacks still making assumptions based on spinning
> drives (i.e. not adequately tuned for SSDs)
> - Non-parallel RAID implementation (i.e. single CPU bottleneck potential)
> - Potential bandwidth bottlenecks at various stages: SATA/SAS
> interface, SAS expander/backplane, SATA/SAS controller (or HBA), PCIe
> bus, CPU memory bus, network card, etc
> - I forget the exact number, but the DCDW guy told me with Linux
> he was only able to get about 30% of the predicted throughput in his
> SSD array
> - Wacky TRIM related issues (seem to be drive dependent)
>
> Not asking any particular question here, just hoping to start an
> open-ended discussion. Of course I'd love to hear from anyone with
> actual SSD RAID experience!
>
> Thanks,
> Matt
>
>
> [1] "high throughput storage server?", Feb 14, 2011
> http://marc.info/?l=linux-raid&m=129772818924753&w=2
>
> [2] "high read throughput storage server, take 2"
> http://marc.info/?l=linux-raid&m=138359009013781&w=2
>
> [3] "The Dirt Cheap Data Warehouse"
> http://www.openida.com/the-dirt-cheap-data-warehouse-an-introduction/
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 4+ messages in thread
* AW: Thoughts on big SSD arrays?
2015-07-31 15:23 Thoughts on big SSD arrays? Matt Garman
2015-08-01 8:34 ` Pasi Kärkkäinen
@ 2015-08-03 10:52 ` Markus Stockhausen
2015-08-03 11:38 ` Adam Goryachev
2 siblings, 0 replies; 4+ messages in thread
From: Markus Stockhausen @ 2015-08-03 10:52 UTC (permalink / raw)
To: Matt Garman, Mdadm
[-- Attachment #1: Type: text/plain, Size: 2848 bytes --]
> Von: linux-raid-owner@vger.kernel.org [linux-raid-owner@vger.kernel.org]" im Auftrag von "Matt Garman [matthew.garman@gmail.com]
> Gesendet: Freitag, 31. Juli 2015 17:23
> An: Mdadm
> Betreff: Thoughts on big SSD arrays?
>
> Every few years I reprise this topic on this mailing list[1], [2].
> Basically I'm just brainstorming what is possible on the DIY front
> versus purchased solutions from a traditional "big iron" storage
> vendor. Our particular use case is "ultra-high parallel sequential
> read throughput". Our workload is effectively WORM: we do a small
> daily incremental write, and then the rest of the time it's constant
> re-reading of the data. Literally 99:1 read:write
>
> I continue to be inspired by the "Dirt Cheap Data Warehouse (DCDW)"
> [3]. SSD are getting bigger and prices are dropping rapidly (2 TB
> SSDs available now for $800). With our WORM-like workload, I believe
> we can safely get away with consumer drives, as durability shouldn't
> be an issue.
>
> So at this point I'm just putting out a feeler---has anyone out there
> actually built a massive SSD array, using either Linux software raid
> or hardware raid (technically off-topic for this list, though I hope
> the discussion is interesting enough to let it slide). If so, how big
> of an array (i.e. drives/capacity)? What was the target versus actual
> performance? Any particularly challenging issues that came up?
Hi Matt,
maybe not 100% matching... We wanted to start a 22 HDD md raid 6 last
year for one of our storage servers. Doing some tests we experienced
slow random write I/Os and got discouraged. So we headed over to
hardware raid.
To get a clearer picture we did further analysis. The bootleneck arised
from two sources. md issued only 4K I/Os and it always did a reconstruct
write. Thus massive write amplification. Several patches in 4.1 mitigate
the situation. In between I like a lot of things in md raid 4/5/6. Especially
the parity calucation massively benefits from high power CPUs.
Back to the reason for my explanation: The biggest "area under construction"
in md raid 4/5/6 is the internal stripe cache handling. Allocating, flushing
and freeing stripes is based on solid but not tuned algorithms (like LRU).
IIRC it does not even have a scheduling or queueing. That could become
your bottleneck. Maybe massive CPU power easily compensates the gap.
That said. Your read mostly setup might avoid a lot of headaches but you
need to push it to the limits yourself. I do not see benchmarks in the net
that closely fit your case. So I advise to enable ramdisks in the Linux
kernel and build a raid 6 on top of /dev/ramX. A synthetic test according
to your needs should give an idea of scalability and overhead of md raid.
Good luck.
Markus
=
[-- Attachment #2: InterScan_Disclaimer.txt --]
[-- Type: text/plain, Size: 1650 bytes --]
****************************************************************************
Diese E-Mail enthält vertrauliche und/oder rechtlich geschützte
Informationen. Wenn Sie nicht der richtige Adressat sind oder diese E-Mail
irrtümlich erhalten haben, informieren Sie bitte sofort den Absender und
vernichten Sie diese Mail. Das unerlaubte Kopieren sowie die unbefugte
Weitergabe dieser Mail ist nicht gestattet.
Ãber das Internet versandte E-Mails können unter fremden Namen erstellt oder
manipuliert werden. Deshalb ist diese als E-Mail verschickte Nachricht keine
rechtsverbindliche Willenserklärung.
Collogia
Unternehmensberatung AG
Ubierring 11
D-50678 Köln
Vorstand:
Kadir Akin
Dr. Michael Höhnerbach
Vorsitzender des Aufsichtsrates:
Hans Kristian Langva
Registergericht: Amtsgericht Köln
Registernummer: HRB 52 497
This e-mail may contain confidential and/or privileged information. If you
are not the intended recipient (or have received this e-mail in error)
please notify the sender immediately and destroy this e-mail. Any
unauthorized copying, disclosure or distribution of the material in this
e-mail is strictly forbidden.
e-mails sent over the internet may have been written under a wrong name or
been manipulated. That is why this message sent as an e-mail is not a
legally binding declaration of intention.
Collogia
Unternehmensberatung AG
Ubierring 11
D-50678 Köln
executive board:
Kadir Akin
Dr. Michael Höhnerbach
President of the supervisory board:
Hans Kristian Langva
Registry office: district court Cologne
Register number: HRB 52 497
****************************************************************************
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: Thoughts on big SSD arrays?
2015-07-31 15:23 Thoughts on big SSD arrays? Matt Garman
2015-08-01 8:34 ` Pasi Kärkkäinen
2015-08-03 10:52 ` AW: " Markus Stockhausen
@ 2015-08-03 11:38 ` Adam Goryachev
2 siblings, 0 replies; 4+ messages in thread
From: Adam Goryachev @ 2015-08-03 11:38 UTC (permalink / raw)
To: Matt Garman, Mdadm
On 1/08/2015 01:23, Matt Garman wrote:
> I continue to be inspired by the "Dirt Cheap Data Warehouse (DCDW)"
> [3]. SSD are getting bigger and prices are dropping rapidly (2 TB
> SSDs available now for $800). With our WORM-like workload, I believe
> we can safely get away with consumer drives, as durability shouldn't
> be an issue.
>
> So at this point I'm just putting out a feeler---has anyone out there
> actually built a massive SSD array, using either Linux software raid
> or hardware raid (technically off-topic for this list, though I hope
> the discussion is interesting enough to let it slide). If so, how big
> of an array (i.e. drives/capacity)? What was the target versus actual
> performance? Any particularly challenging issues that came up?
I have been using a 8x 480GB RAID5 linux md array for a iSCSI SAN for a
number of years, and it worked well after some careful tuning, and
careful (lucky) hardware selection (ie, motherboard was lucky to have
the right bandwidth memory/PCI bus/etc).
The main challenge I had was actually with DRBD on top of the array,
once I disabled the forced writes, then it all worked really well. The
forced writes were forcing a consistent on disk status for every write
since the SSD's I use did not have any ability to save the data during a
power outage.
> FWIW, I'm thinking of something along the lines of a 24-disk chassis,
> with 2 disks for OS (raid1), 2 disks as hot spares, and the remaining
> 20 in raid-6. The 22 data disks (raid + hot spares) would be 2 TB
> SSDs.
I'm not sure that sounds like a good idea. Personally, I'd probably
prefer to use 2 x RAID6 arrays at least, but then that is just what
advice I hear on the list. Using two arrays will also get you more
parallel processing (use more cpu cores), as I think you are limited to
one cpu per array.
> The "problem" with SSDs is that they're just so seductive:
> back-of-the-envelope numbers are wonderful, so it's easy to get
> overly-optimistic about builds that use them. But as with most
> things, the devil's in the details.
I was able to get 2.5GB/s read and 1.5GB/s write with (I think) only 6
SSD's in RAID5. However, eventually, when I did the correct test to
match my actual load, that dropped to abysmal values (well under
100MB/s). The reason is that my live load uses very small read/write
block size, so there were a massive number of small random read/writes,
leading to high IOPS. Using large block sizes can deliver massive
throughput, with very small number of IOPS,
> Off the top of my head, potential issues I can think of:
>
> - Subtle PCIe latency/timing issues of the motherboard
From memory, this can include the amount of bandwidth between
memory/CPU/PCI bus/SATA bus/etc... Including the speed of the RAM as
just one of the factors. I don't know all the tricky details, but I do
recall that while the bandwdith looks plenty fast enough at first, the
data moves over a number of bridges, and sometimes the same bridge more
than once (eg, the disk interface and network interface might be on the
same bridge).
> - High variation in SSD latency
> - Software stacks still making assumptions based on spinning
> drives (i.e. not adequately tuned for SSDs)
> - Non-parallel RAID implementation (i.e. single CPU bottleneck potential)
> - Potential bandwidth bottlenecks at various stages: SATA/SAS
> interface, SAS expander/backplane, SATA/SAS controller (or HBA), PCIe
> bus, CPU memory bus, network card, etc
> - I forget the exact number, but the DCDW guy told me with Linux
> he was only able to get about 30% of the predicted throughput in his
> SSD array
I got close to the theoretical maximum (from memory), but it depended on
the actual real life workload. Those theoretical performance values are
only achieved in "optimal" conditions, real life is often a lot more messy.
> - Wacky TRIM related issues (seem to be drive dependent)
If you are mostly read then TRIM shouldn't be much of an issue for you.
> Not asking any particular question here, just hoping to start an
> open-ended discussion. Of course I'd love to hear from anyone with
> actual SSD RAID experience!
>
My experience has been positive. BTW, I'm using the Intel 480GB SSD
(basically consumer grade 520/530 series). If you want any extra
information/details, let me know.
Regards,
Adam
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2015-08-03 11:38 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-07-31 15:23 Thoughts on big SSD arrays? Matt Garman
2015-08-01 8:34 ` Pasi Kärkkäinen
2015-08-03 10:52 ` AW: " Markus Stockhausen
2015-08-03 11:38 ` Adam Goryachev
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).