* Intel SSD or other brands
@ 2016-12-29 2:14 Adam Goryachev
2016-12-29 11:39 ` Peter Grandi
` (2 more replies)
0 siblings, 3 replies; 15+ messages in thread
From: Adam Goryachev @ 2016-12-29 2:14 UTC (permalink / raw)
To: linux-raid@vger.kernel.org
Apologies for my prematurely sent email (if it gets through), this one
is complete...
Hi all,
I've spent a number of years trying to build up a nice RAID array for my
SAN, but I seem to be slowly solving one bottle neck only to find
another one. Right now, I've identified the underlying SSD's as being a
major factor in that performance issue.
I started with 5 x 480GB Intel 520s SSD's in a RAID5 array, and this
performed really well.
I added 1 x 480GB 530s SSD
I added 2 x 480GB 530s SSD
I now found out that performance of a 520s SSD is around 180 times
faster than a 530s SSD. I had to run many tests, but eventually I found
the right things to test for (which matched my real life results), and
the numbers were nothing short of crazy.
Running each test 5 times and average the results...
520s: 70MB/s
530s: 0.4MB/s
OK, so before I could remove and test the 520s, I removed/tested one of
the 530s and saw the horrible performance, so I bought and tested a 540s
and found:
540s: 6.7MB/s
So, around 20 times better than the 530, so I replaced all the drives
with the 540, but I still have worse performance than the original 5 x
520s array.
Working with Intel, they swapped a 530s drive for a DC3510, and I then
found the DC3510 was awesome:
DC3510: 99MB/s
Except, a few weeks back when I placed the order, I was told there is no
longer any stock of this drive, (I wanted 16 x 800GB model), and that
the replacement model is the DC3520. So I figure I won't just blindly
buy the DC3520 assuming it's performance will be similar to the previous
model, so I buy 4 x 480GB DC3520 and start testing.
DC3520: 37MB/s
So, 1/3rd of a DC3510, but still better than the current live 540s
drives, but also still half the original 520s drives.
Summary:
520s: 70217kB/s
530s: 391kB/s
540s: 6712kB/s
330s: 24kB/s
DC3510: 99313kB/s
DC3520: 37051kB/s
WD2TBCB: 475kB/s
* For comparison, I had a older Western Digital Black 2TB spare, and ran
the same test on it. Got a better result than some of the SSD's which
was really surprising, but it's certainly not an option.
FYI, the test I'm running is this:
fio --filename=/dev/sdb --direct=1 --sync=1 --rw=write --bs=4k
--iodepth=1 --runtime=60 --time_based --group_reporting
--name=IntelDC3510_4kj1 --numjobs=1
All drives were tested on the same machine/SATA3 port (basic intel
desktop motherboard), with nothing on the drive (no fs, no partition,
nothing trying to access it, etc..).
In reality, I tested iodepth from 1..10, but in my use case, the
iodepth=1 matches is the relevant number. At higher iodepth, we see
performance on all the drives improve, if interested, I can provide a
full set of my results/analysis.
So, my actual question... Can you suggest or have you tested any Intel
(or other brand) SSD which has good performance (similar to the DC3510
or the 520s)? (I can't buy and test every single variant out there, my
budget doesn't go anywhere close to that).
It needs to be SATA, since I don't have enough PCIe slots to get the
needed capacity (nor enough budget). I need around 8 x drives with
around 6TB capacity in RAID5.
FYI, my storage stack is like this:
8 x SSD's
mdadm - RAID5
LVM
DRBD
iSCSI
From my understanding, it is DRBD that makes everything a iodepth=1
issue. It is possible to reach iodepth=2 if I have 2 x VM's both doing a
lot of IO at the same time, but it usually a single VM performance that
is too limited.
Regards,
Adam
--
Adam Goryachev Website Managers www.websitemanagers.com.au
^ permalink raw reply [flat|nested] 15+ messages in thread* Re: Intel SSD or other brands 2016-12-29 2:14 Intel SSD or other brands Adam Goryachev @ 2016-12-29 11:39 ` Peter Grandi 2016-12-29 14:35 ` Adam Goryachev 2016-12-29 16:56 ` Robert LeBlanc 2016-12-29 18:50 ` Doug Dumitru 2 siblings, 1 reply; 15+ messages in thread From: Peter Grandi @ 2016-12-29 11:39 UTC (permalink / raw) To: Linux RAID > [ ... ] I now found out that performance of a 520s SSD is > around 180 times faster than a 530s SSD. [ ... ] Well "performance" can be roughly the same, even if "speed" can be very different. http://www.sabi.co.uk/blog/15-two.html#151023 > 520s: 70MB/s > 530s: 0.4MB/s .... > fio --filename=/dev/sdb --direct=1 --sync=1 --rw=write --bs=4k > --iodepth=1 --runtime=60 --time_based --group_reporting > --name=IntelDC3510_4kj1 --numjobs=1 Arguably the 520s actually have no performance. https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/ ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Intel SSD or other brands 2016-12-29 11:39 ` Peter Grandi @ 2016-12-29 14:35 ` Adam Goryachev 2016-12-29 17:46 ` Peter Grandi 0 siblings, 1 reply; 15+ messages in thread From: Adam Goryachev @ 2016-12-29 14:35 UTC (permalink / raw) To: Peter Grandi, Linux RAID On 29/12/16 22:39, Peter Grandi wrote: >> [ ... ] I now found out that performance of a 520s SSD is >> around 180 times faster than a 530s SSD. [ ... ] > Well "performance" can be roughly the same, even if "speed" can > be very different. > > http://www.sabi.co.uk/blog/15-two.html#151023 I'm not entirely sure what you mean to say here, I have a reasonably well defined real life workload, (ie, single threaded, small random writes)... I am measuring the same statistics across multiple devices and comparing those numbers. In addition, replacing the devices with others that showed an improvement (measured during testing) in the real life system showed an equivalent improvement (reduction) in end user complaints. So I feel reasonably sure that I am "on the right track".... Am I overlooking something else (very possible)? >> 520s: 70MB/s >> 530s: 0.4MB/s > .... >> fio --filename=/dev/sdb --direct=1 --sync=1 --rw=write --bs=4k >> --iodepth=1 --runtime=60 --time_based --group_reporting >> --name=IntelDC3510_4kj1 --numjobs=1 > Arguably the 520s actually have no performance. > > https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/ I have seen this page, and all I can suggest is that the 480GB 520s performs very different to the 60GB model. I see 70MB/s which is significantly different to the listed 9MB/s on that page. This page matches my results (comparatively) that the 520 performs much better than the 535, though I don't have easy access to a 535 in order to run my destructive tests.... but I did run similar tests on a 535, and I ran the more thorough tests on the 530 and 540. Regards, Adam ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Intel SSD or other brands 2016-12-29 14:35 ` Adam Goryachev @ 2016-12-29 17:46 ` Peter Grandi 0 siblings, 0 replies; 15+ messages in thread From: Peter Grandi @ 2016-12-29 17:46 UTC (permalink / raw) To: Linux RAID >> Well "performance" can be roughly the same, even if "speed" can >> be very different. http://www.sabi.co.uk/blog/15-two.html#151023 > [ ... ] what you mean to say here, [ ... ] Some people may say that 'eatmydata $COMMAND' can improve a lot the "performance" of running '$COMMAND'. More properly it can improve its speed, but its performance arguably goes to zero or more precisely becomes insignificant, in most cases. That's a pretty huge difference. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Intel SSD or other brands 2016-12-29 2:14 Intel SSD or other brands Adam Goryachev 2016-12-29 11:39 ` Peter Grandi @ 2016-12-29 16:56 ` Robert LeBlanc 2016-12-29 17:33 ` Peter Grandi 2016-12-29 23:04 ` Adam Goryachev 2016-12-29 18:50 ` Doug Dumitru 2 siblings, 2 replies; 15+ messages in thread From: Robert LeBlanc @ 2016-12-29 16:56 UTC (permalink / raw) To: Adam Goryachev; +Cc: linux-raid@vger.kernel.org This is a similar workload as Ceph and you may find more information from their mailing lists. When I was working with Ceph about a year ago, we tested a bunch of SSDs and found that sync=1 really differentiates drives and you really find which drives are better. In our testing, we found that the 35xx, 36xx, and 37xx drives handled the workloads the best. The 3x00 drives were close to EOL, so we focused on the 3x10 drives. I don't have the data anymore, but the 3610 had the best performance, the 3710 had the best data integrity in the case of power failure, and the 3510 had the best price. The 3510 had about ~.1 drive writes per day, the 3610 had ~1 DWPD and the 3710 had ~3 DWPD. Due to the fault tolerance of Ceph, we felt comfortable with the 3610s. In our testing, we exceeded the performance numbers listed for the drives on their data sheets when running up to 8 jobs even with sync=1 which no other manufacture did. For Ceph, we could put multiple OSDs on a disk and take advantage of this performance gain. You may be able to do something similar by partitioning your RAID 5 and putting multiple DRBDs on it. ---------------- Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 On Wed, Dec 28, 2016 at 7:14 PM, Adam Goryachev <mailinglists@websitemanagers.com.au> wrote: > Apologies for my prematurely sent email (if it gets through), this one is > complete... > > Hi all, > > I've spent a number of years trying to build up a nice RAID array for my > SAN, but I seem to be slowly solving one bottle neck only to find another > one. Right now, I've identified the underlying SSD's as being a major factor > in that performance issue. > > I started with 5 x 480GB Intel 520s SSD's in a RAID5 array, and this > performed really well. > I added 1 x 480GB 530s SSD > I added 2 x 480GB 530s SSD > > I now found out that performance of a 520s SSD is around 180 times faster > than a 530s SSD. I had to run many tests, but eventually I found the right > things to test for (which matched my real life results), and the numbers > were nothing short of crazy. > Running each test 5 times and average the results... > 520s: 70MB/s > 530s: 0.4MB/s > > OK, so before I could remove and test the 520s, I removed/tested one of the > 530s and saw the horrible performance, so I bought and tested a 540s and > found: > 540s: 6.7MB/s > So, around 20 times better than the 530, so I replaced all the drives with > the 540, but I still have worse performance than the original 5 x 520s > array. > > Working with Intel, they swapped a 530s drive for a DC3510, and I then found > the DC3510 was awesome: > DC3510: 99MB/s > Except, a few weeks back when I placed the order, I was told there is no > longer any stock of this drive, (I wanted 16 x 800GB model), and that the > replacement model is the DC3520. So I figure I won't just blindly buy the > DC3520 assuming it's performance will be similar to the previous model, so I > buy 4 x 480GB DC3520 and start testing. > DC3520: 37MB/s > > So, 1/3rd of a DC3510, but still better than the current live 540s drives, > but also still half the original 520s drives. > > Summary: > 520s: 70217kB/s > 530s: 391kB/s > 540s: 6712kB/s > 330s: 24kB/s > DC3510: 99313kB/s > DC3520: 37051kB/s > WD2TBCB: 475kB/s > > * For comparison, I had a older Western Digital Black 2TB spare, and ran the > same test on it. Got a better result than some of the SSD's which was really > surprising, but it's certainly not an option. > FYI, the test I'm running is this: > fio --filename=/dev/sdb --direct=1 --sync=1 --rw=write --bs=4k --iodepth=1 > --runtime=60 --time_based --group_reporting --name=IntelDC3510_4kj1 > --numjobs=1 > All drives were tested on the same machine/SATA3 port (basic intel desktop > motherboard), with nothing on the drive (no fs, no partition, nothing trying > to access it, etc..). > In reality, I tested iodepth from 1..10, but in my use case, the iodepth=1 > matches is the relevant number. At higher iodepth, we see performance on all > the drives improve, if interested, I can provide a full set of my > results/analysis. > > So, my actual question... Can you suggest or have you tested any Intel (or > other brand) SSD which has good performance (similar to the DC3510 or the > 520s)? (I can't buy and test every single variant out there, my budget > doesn't go anywhere close to that). > It needs to be SATA, since I don't have enough PCIe slots to get the needed > capacity (nor enough budget). I need around 8 x drives with around 6TB > capacity in RAID5. > > FYI, my storage stack is like this: > 8 x SSD's > mdadm - RAID5 > LVM > DRBD > iSCSI > > From my understanding, it is DRBD that makes everything a iodepth=1 issue. > It is possible to reach iodepth=2 if I have 2 x VM's both doing a lot of IO > at the same time, but it usually a single VM performance that is too > limited. > > Regards, > Adam > > > -- > Adam Goryachev Website Managers www.websitemanagers.com.au > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Intel SSD or other brands 2016-12-29 16:56 ` Robert LeBlanc @ 2016-12-29 17:33 ` Peter Grandi 2016-12-29 18:37 ` Peter Grandi 2016-12-29 23:04 ` Adam Goryachev 1 sibling, 1 reply; 15+ messages in thread From: Peter Grandi @ 2016-12-29 17:33 UTC (permalink / raw) To: Linux RAID > [ ... ] sync=1 really differentiates drives and you really > find which drives are better. [ ... ] It is not necessarily "better" in a strict sense: flash SSD devices with supercapacitor-backed persistent caches can be much faster on 'fsync' heavy workloads, but also cost a lot more (probably mostly because of market segmentation). Of course especially on a RAID5 set with lots of read-modify-write. It is a different performance envelope, not necessarily a "better" one. If one does not need small-write speed then cheaper drivers are more appropriate. However devices which don't have persistent caches and still have high 'sync=1'/'direct=1' speed because they don't implement 'fsync' synchronously are definitely worse, in the sense of having arguably no performance at all. Some manufacturers think that using an SLC cache helps without a persistent-ed RAM cache, but the persistent=-ed RAM seems a lot better to me. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Intel SSD or other brands 2016-12-29 17:33 ` Peter Grandi @ 2016-12-29 18:37 ` Peter Grandi 0 siblings, 0 replies; 15+ messages in thread From: Peter Grandi @ 2016-12-29 18:37 UTC (permalink / raw) To: Linux RAID >> [ ... ] sync=1 really differentiates drives and you really >> find which drives are better. [ ... ] > It is not necessarily "better" in a strict sense: flash SSD > devices with supercapacitor-backed persistent caches [ ... ] http://www.storagereview.com/images/Samsung-SSD-SM825-PCB-Bottom.jpg http://www.storagereview.com/samsung_ssd_sm825_enterprise_ssd_review http://www.theregister.co.uk/2014/09/24/storage_supercapacitors/ http://us.apacer.com/business/technology/CorePower_Technology http://www.badcaps.net/forum/showthread.php?t=34417 ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Intel SSD or other brands 2016-12-29 16:56 ` Robert LeBlanc 2016-12-29 17:33 ` Peter Grandi @ 2016-12-29 23:04 ` Adam Goryachev 2016-12-29 23:20 ` Robert LeBlanc 1 sibling, 1 reply; 15+ messages in thread From: Adam Goryachev @ 2016-12-29 23:04 UTC (permalink / raw) To: Robert LeBlanc; +Cc: linux-raid@vger.kernel.org On 30/12/16 03:56, Robert LeBlanc wrote: > This is a similar workload as Ceph and you may find more information > from their mailing lists. When I was working with Ceph about a year > ago, we tested a bunch of SSDs and found that sync=1 really > differentiates drives and you really find which drives are better. In > our testing, we found that the 35xx, 36xx, and 37xx drives handled the > workloads the best. The 3x00 drives were close to EOL, so we focused > on the 3x10 drives. I don't have the data anymore, but the 3610 had > the best performance, the 3710 had the best data integrity in the case > of power failure, and the 3510 had the best price. So it seems that my "good/best" results were based on the 3510, which was the cheapest out of the options you tested. Any chance you could find the raw data again? Or do you recall the relative performance difference between these three drives? > The 3510 had about > ~.1 drive writes per day, the 3610 had ~1 DWPD and the 3710 had ~3 > DWPD. We seem to be around 0.03 DWPD, so I don't think any of these drives would be a problem for us. Lifetime seems much longer than the useful life, given capacity/etc. > Due to the fault tolerance of Ceph, we felt comfortable with the > 3610s. Equally, we have fault tolerance (RAID5) as well as DRBD onto the other node which also has RAID5. I also monitor the drive lifetime, I'm not sure what value I would consider urgent replacement, but probably around 20% remaining life.... > In our testing, we exceeded the performance numbers listed for > the drives on their data sheets when running up to 8 jobs even with > sync=1 which no other manufacture did. For Ceph, we could put multiple > OSDs on a disk and take advantage of this performance gain. You may be > able to do something similar by partitioning your RAID 5 and putting > multiple DRBDs on it. We do this already... we use a single RAID5 which is split with LVM2 (20 LV's), and each LV is then a DRBD device (so 20 DRBD's). This was one of the optimisations linbit advised us to do way back at the beginning. The problem I'm having is that a single DRBD will reach saturation because the underlying devices are saturated. So I'm trying to improve the underlying device performance, and expect to be able to "move" the bottleneck to DRBD or hopefully, the ethernet of the iSCSI interface. Regards, Adam > ---------------- > Robert LeBlanc > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 > > > On Wed, Dec 28, 2016 at 7:14 PM, Adam Goryachev > <mailinglists@websitemanagers.com.au> wrote: >> Apologies for my prematurely sent email (if it gets through), this one is >> complete... >> >> Hi all, >> >> I've spent a number of years trying to build up a nice RAID array for my >> SAN, but I seem to be slowly solving one bottle neck only to find another >> one. Right now, I've identified the underlying SSD's as being a major factor >> in that performance issue. >> >> I started with 5 x 480GB Intel 520s SSD's in a RAID5 array, and this >> performed really well. >> I added 1 x 480GB 530s SSD >> I added 2 x 480GB 530s SSD >> >> I now found out that performance of a 520s SSD is around 180 times faster >> than a 530s SSD. I had to run many tests, but eventually I found the right >> things to test for (which matched my real life results), and the numbers >> were nothing short of crazy. >> Running each test 5 times and average the results... >> 520s: 70MB/s >> 530s: 0.4MB/s >> >> OK, so before I could remove and test the 520s, I removed/tested one of the >> 530s and saw the horrible performance, so I bought and tested a 540s and >> found: >> 540s: 6.7MB/s >> So, around 20 times better than the 530, so I replaced all the drives with >> the 540, but I still have worse performance than the original 5 x 520s >> array. >> >> Working with Intel, they swapped a 530s drive for a DC3510, and I then found >> the DC3510 was awesome: >> DC3510: 99MB/s >> Except, a few weeks back when I placed the order, I was told there is no >> longer any stock of this drive, (I wanted 16 x 800GB model), and that the >> replacement model is the DC3520. So I figure I won't just blindly buy the >> DC3520 assuming it's performance will be similar to the previous model, so I >> buy 4 x 480GB DC3520 and start testing. >> DC3520: 37MB/s >> >> So, 1/3rd of a DC3510, but still better than the current live 540s drives, >> but also still half the original 520s drives. >> >> Summary: >> 520s: 70217kB/s >> 530s: 391kB/s >> 540s: 6712kB/s >> 330s: 24kB/s >> DC3510: 99313kB/s >> DC3520: 37051kB/s >> WD2TBCB: 475kB/s >> >> * For comparison, I had a older Western Digital Black 2TB spare, and ran the >> same test on it. Got a better result than some of the SSD's which was really >> surprising, but it's certainly not an option. >> FYI, the test I'm running is this: >> fio --filename=/dev/sdb --direct=1 --sync=1 --rw=write --bs=4k --iodepth=1 >> --runtime=60 --time_based --group_reporting --name=IntelDC3510_4kj1 >> --numjobs=1 >> All drives were tested on the same machine/SATA3 port (basic intel desktop >> motherboard), with nothing on the drive (no fs, no partition, nothing trying >> to access it, etc..). >> In reality, I tested iodepth from 1..10, but in my use case, the iodepth=1 >> matches is the relevant number. At higher iodepth, we see performance on all >> the drives improve, if interested, I can provide a full set of my >> results/analysis. >> >> So, my actual question... Can you suggest or have you tested any Intel (or >> other brand) SSD which has good performance (similar to the DC3510 or the >> 520s)? (I can't buy and test every single variant out there, my budget >> doesn't go anywhere close to that). >> It needs to be SATA, since I don't have enough PCIe slots to get the needed >> capacity (nor enough budget). I need around 8 x drives with around 6TB >> capacity in RAID5. >> >> FYI, my storage stack is like this: >> 8 x SSD's >> mdadm - RAID5 >> LVM >> DRBD >> iSCSI >> >> From my understanding, it is DRBD that makes everything a iodepth=1 issue. >> It is possible to reach iodepth=2 if I have 2 x VM's both doing a lot of IO >> at the same time, but it usually a single VM performance that is too >> limited. >> >> Regards, >> Adam >> >> >> -- >> Adam Goryachev Website Managers www.websitemanagers.com.au >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-raid" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Adam Goryachev Website Managers www.websitemanagers.com.au ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Intel SSD or other brands 2016-12-29 23:04 ` Adam Goryachev @ 2016-12-29 23:20 ` Robert LeBlanc 0 siblings, 0 replies; 15+ messages in thread From: Robert LeBlanc @ 2016-12-29 23:20 UTC (permalink / raw) To: Adam Goryachev; +Cc: linux-raid@vger.kernel.org On Thu, Dec 29, 2016 at 4:04 PM, Adam Goryachev <mailinglists@websitemanagers.com.au> wrote: > On 30/12/16 03:56, Robert LeBlanc wrote: >> >> This is a similar workload as Ceph and you may find more information >> from their mailing lists. When I was working with Ceph about a year >> ago, we tested a bunch of SSDs and found that sync=1 really >> differentiates drives and you really find which drives are better. In >> our testing, we found that the 35xx, 36xx, and 37xx drives handled the >> workloads the best. The 3x00 drives were close to EOL, so we focused >> on the 3x10 drives. I don't have the data anymore, but the 3610 had >> the best performance, the 3710 had the best data integrity in the case >> of power failure, and the 3510 had the best price. > > So it seems that my "good/best" results were based on the 3510, which was > the cheapest out of the options you tested. Any chance you could find the > raw data again? Or do you recall the relative performance difference between > these three drives? This was done at another job and the data stayed when I left. The performance between the three drives were pretty close, I think less than 10%, but I can't remember exactly. >> The 3510 had about >> ~.1 drive writes per day, the 3610 had ~1 DWPD and the 3710 had ~3 >> DWPD. > > We seem to be around 0.03 DWPD, so I don't think any of these drives would > be a problem for us. Lifetime seems much longer than the useful life, given > capacity/etc. We had really good wear on the 35xx drives, I think they are understated, but I don't have the data to back that up. >> Due to the fault tolerance of Ceph, we felt comfortable with the >> 3610s. > > Equally, we have fault tolerance (RAID5) as well as DRBD onto the other node > which also has RAID5. I also monitor the drive lifetime, I'm not sure what > value I would consider urgent replacement, but probably around 20% remaining > life.... You may never even get there at 0.03 DWPD. >> In our testing, we exceeded the performance numbers listed for >> the drives on their data sheets when running up to 8 jobs even with >> sync=1 which no other manufacture did. For Ceph, we could put multiple >> OSDs on a disk and take advantage of this performance gain. You may be >> able to do something similar by partitioning your RAID 5 and putting >> multiple DRBDs on it. > > > We do this already... we use a single RAID5 which is split with LVM2 (20 > LV's), and each LV is then a DRBD device (so 20 DRBD's). This was one of the > optimisations linbit advised us to do way back at the beginning. > > The problem I'm having is that a single DRBD will reach saturation because > the underlying devices are saturated. So I'm trying to improve the > underlying device performance, and expect to be able to "move" the > bottleneck to DRBD or hopefully, the ethernet of the iSCSI interface. > > Regards, > Adam ---------------- Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Intel SSD or other brands 2016-12-29 2:14 Intel SSD or other brands Adam Goryachev 2016-12-29 11:39 ` Peter Grandi 2016-12-29 16:56 ` Robert LeBlanc @ 2016-12-29 18:50 ` Doug Dumitru 2016-12-29 22:51 ` Adam Goryachev 2 siblings, 1 reply; 15+ messages in thread From: Doug Dumitru @ 2016-12-29 18:50 UTC (permalink / raw) To: Adam Goryachev; +Cc: linux-raid@vger.kernel.org Mr. Goryachev, I find it easier to look at these numbers in terms of IOPS. You are dealing with 17,500 IOPS vs 100 IOPS. This pretty much has to be the "commit" from the drive. The benchmark is basically waiting for the data to actually hit "recorded media" before the IO completes. The "better" drive is returning this "write ACK" when the data is in RAM, the the "worse" drive is returning this "write ACK" after the data is somewhere much slower (probably flash). I would note that 17,500 IOPS is a "good" but not "great" number. Doing commit writes to Flash is expensive. Not only do you have to wait for the flash, but you have to update the mapping tables to get to the data. Flash also does not typically allow 4K updates (even given the erase rules), so your 4K sync update probably has to update a 16K "page" is probably causing a lot of flash wear. Maybe as much as 10:1 write amplification. Maybe more. There are a bunch of things to consider when looking at "sync" performance. The easiest way to look at this is that the drive "absolutely" has to have the data in stable storage to be "correct". This is not really true, and the overhead of this can be huge. File systems "know" this behavior and instead of looking for a hard sync, they use "barriers". The idea of a barrier, is that the drive is allowed to buffer writes, just not re-order them so that an IO crosses a "barrier". Testing of SSDs for this is looking for "serialization errors". If you pull power from an SSD and then go look at the blocks that made it to the media after the reboot, drives can work in one of three ways. If absolutely every ACKd block is on the drive, then sync works and barriers are not relevant. If the writes stop and no "newer" write made it to the drive when an "older" one did not, then the drive is still OK with barriers. If "newer" writes made it to the media but older writes did not, then this is a serialization error and you have spaghetti. SSDs with power fail serialization errors are "bad". Then again, it is important to understand the system-level implications of how the error will impact your stack. In the case of RAID-5 in a "traditional" Linux deployment, and especially with DRBD protecting you on another node, you are probably fine without having every last ACK "perfect". After all, if you power fail the primary node, the secondary will become the primary, and any "tail writes" that are missing will get re-sync'd by DRBD's hash checks. And because the amount of data being re-synced is small, it will happen very quickly and you might not even notice it. Back to performance, you should also consider what your array is doing to you. You are running an 8 drive raid-5 array. This will limit performance even more because every write becomes 2 sync writes, plus 6 reads. With q=1 latencies, if you run this test on the array with "good" drives, you should probably get about 15K IOPS max, but it might be a bit worse as the read and write latencies add for each OP. I tried your test on our "in house" "server-side FTL" mapping layer on 8 drives raid-5. This is an E5-1650 v3 w/ an LSI 3008 SAS controller and 8 Samsung 256GB 850 Pro SSDs. The array is "new" so it will slow down somewhat as it fills. 439K IOPS is actually quite a bit under the array's bandwidth, but at q=1, you end up benchmarking the benchmark program. (at q=10, the array saturates the drives linear performance at about 900K IOPS or 3518 MB/sec). root@ubuntu-16-24-2:/usr/local/ess# fio --filename=/dev/mapper/ess-md0 --direct=1 --sync=1 --rw=write --bs=4k --iodepth=1 --runtime=60 --time_based --group_reporting --name=ess-raid5 --numjobs=1 ess-raid5: (g=0): rw=write, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1 fio-2.2.10 Starting 1 process Jobs: 1 (f=1): [W(1)] [100.0% done] [0KB/1714MB/0KB /s] [0/439K/0 iops] [eta 00m:00s] ess-raid5: (groupid=0, jobs=1): err= 0: pid=29544: Thu Dec 29 10:36:13 2016 write: io=102653MB, bw=1710.9MB/s, iops=437980, runt= 60001msec clat (usec): min=1, max=155, avg= 2.06, stdev= 0.49 lat (usec): min=1, max=155, avg= 2.10, stdev= 0.49 clat percentiles (usec): | 1.00th=[ 1], 5.00th=[ 1], 10.00th=[ 2], 20.00th=[ 2], | 30.00th=[ 2], 40.00th=[ 2], 50.00th=[ 2], 60.00th=[ 2], | 70.00th=[ 2], 80.00th=[ 2], 90.00th=[ 3], 95.00th=[ 3], | 99.00th=[ 3], 99.50th=[ 3], 99.90th=[ 7], 99.95th=[ 8], | 99.99th=[ 11] bw (MB /s): min= 1330, max= 1755, per=100.00%, avg=1710.84, stdev=39.12 lat (usec) : 2=5.17%, 4=94.51%, 10=0.31%, 20=0.02%, 50=0.01% lat (usec) : 250=0.01% cpu : usr=14.08%, sys=85.92%, ctx=47, majf=0, minf=10 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued : total=r=0/w=26279247/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0 latency : target=0, window=0, percentile=100.00%, depth=1 Run status group 0 (all jobs): WRITE: io=102653MB, aggrb=1710.9MB/s, minb=1710.9MB/s, maxb=1710.9MB/s, mint=60001msec, maxt=60001msec On Wed, Dec 28, 2016 at 6:14 PM, Adam Goryachev <mailinglists@websitemanagers.com.au> wrote: > Apologies for my prematurely sent email (if it gets through), this one is > complete... > > Hi all, > > I've spent a number of years trying to build up a nice RAID array for my > SAN, but I seem to be slowly solving one bottle neck only to find another > one. Right now, I've identified the underlying SSD's as being a major factor > in that performance issue. > > I started with 5 x 480GB Intel 520s SSD's in a RAID5 array, and this > performed really well. > I added 1 x 480GB 530s SSD > I added 2 x 480GB 530s SSD > > I now found out that performance of a 520s SSD is around 180 times faster > than a 530s SSD. I had to run many tests, but eventually I found the right > things to test for (which matched my real life results), and the numbers > were nothing short of crazy. > Running each test 5 times and average the results... > 520s: 70MB/s > 530s: 0.4MB/s > > OK, so before I could remove and test the 520s, I removed/tested one of the > 530s and saw the horrible performance, so I bought and tested a 540s and > found: > 540s: 6.7MB/s > So, around 20 times better than the 530, so I replaced all the drives with > the 540, but I still have worse performance than the original 5 x 520s > array. > > Working with Intel, they swapped a 530s drive for a DC3510, and I then found > the DC3510 was awesome: > DC3510: 99MB/s > Except, a few weeks back when I placed the order, I was told there is no > longer any stock of this drive, (I wanted 16 x 800GB model), and that the > replacement model is the DC3520. So I figure I won't just blindly buy the > DC3520 assuming it's performance will be similar to the previous model, so I > buy 4 x 480GB DC3520 and start testing. > DC3520: 37MB/s > > So, 1/3rd of a DC3510, but still better than the current live 540s drives, > but also still half the original 520s drives. > > Summary: > 520s: 70217kB/s > 530s: 391kB/s > 540s: 6712kB/s > 330s: 24kB/s > DC3510: 99313kB/s > DC3520: 37051kB/s > WD2TBCB: 475kB/s > > * For comparison, I had a older Western Digital Black 2TB spare, and ran the > same test on it. Got a better result than some of the SSD's which was really > surprising, but it's certainly not an option. > FYI, the test I'm running is this: > fio --filename=/dev/sdb --direct=1 --sync=1 --rw=write --bs=4k --iodepth=1 > --runtime=60 --time_based --group_reporting --name=IntelDC3510_4kj1 > --numjobs=1 > All drives were tested on the same machine/SATA3 port (basic intel desktop > motherboard), with nothing on the drive (no fs, no partition, nothing trying > to access it, etc..). > In reality, I tested iodepth from 1..10, but in my use case, the iodepth=1 > matches is the relevant number. At higher iodepth, we see performance on all > the drives improve, if interested, I can provide a full set of my > results/analysis. > > So, my actual question... Can you suggest or have you tested any Intel (or > other brand) SSD which has good performance (similar to the DC3510 or the > 520s)? (I can't buy and test every single variant out there, my budget > doesn't go anywhere close to that). > It needs to be SATA, since I don't have enough PCIe slots to get the needed > capacity (nor enough budget). I need around 8 x drives with around 6TB > capacity in RAID5. > > FYI, my storage stack is like this: > 8 x SSD's > mdadm - RAID5 > LVM > DRBD > iSCSI > > From my understanding, it is DRBD that makes everything a iodepth=1 issue. > It is possible to reach iodepth=2 if I have 2 x VM's both doing a lot of IO > at the same time, but it usually a single VM performance that is too > limited. > > Regards, > Adam > > > -- > Adam Goryachev Website Managers www.websitemanagers.com.au > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Doug Dumitru EasyCo LLC ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Intel SSD or other brands 2016-12-29 18:50 ` Doug Dumitru @ 2016-12-29 22:51 ` Adam Goryachev 2016-12-30 1:24 ` Doug Dumitru 0 siblings, 1 reply; 15+ messages in thread From: Adam Goryachev @ 2016-12-29 22:51 UTC (permalink / raw) To: doug; +Cc: linux-raid@vger.kernel.org On 30/12/16 05:50, Doug Dumitru wrote: > Mr. Goryachev, > > I find it easier to look at these numbers in terms of IOPS. You are > dealing with 17,500 IOPS vs 100 IOPS. This pretty much has to be the > "commit" from the drive. The benchmark is basically waiting for the > data to actually hit "recorded media" before the IO completes. The > "better" drive is returning this "write ACK" when the data is in RAM, > the the "worse" drive is returning this "write ACK" after the data is > somewhere much slower (probably flash). I would note that 17,500 IOPS > is a "good" but not "great" number. So what would you consider a great number? I guess in practice the environment isn't really that massive, it shouldn't really *need* great numbers, but it seems no matter how hard I try to "over-architect", it is still not performing to end user expectation. > Doing commit writes to Flash is expensive. Not only do you have to > wait for the flash, but you have to update the mapping tables to get > to the data. Flash also does not typically allow 4K updates (even > given the erase rules), so your 4K sync update probably has to update > a 16K "page" is probably causing a lot of flash wear. Maybe as much > as 10:1 write amplification. Maybe more. Wear doesn't seem to have been a problem so far. 9 Power_On_Hours_and_Msec 0x0032 000 000 000 Old_age Always - 914339h+46m+34.180s This is obviously wrong, I haven't had the drive for >100 years, but it is at least almost 4 years old (early 2013 I suspect). 233 Media_Wearout_Indicator 0x0032 095 095 000 Old_age Always - 0 This is the worst drive out of the whole array, the best is 99, but either way it suggests these drives could easily last >10 years, which would be well and truly longer than their expected/useful life (based on capacity). 241 Host_Writes_32MiB 0x0032 100 100 000 Old_age Always - 3329773 This is the drive with the highest number of writes.... Obviously most writes are smaller than 32MB, so I'm not entirely sure what this means, but I suspect we are not doing a lot of writes per day compared to the total storage capacity... 3329773 * 32MB / 3 years / 365 days = 97308MB/day. Total capacity is 480*7 = 3360GB or approx 0.03 per drive writes per day. I've actually asked this question before, but here again we find what appears to be an anomaly... some drives have significantly more writes than others, and I don't understand why in a RAID5 array this would be the case, I would have expected the writes to be split approx equally across all drives... 225 Host_Writes_32MiB 0x0032 100 100 000 Old_age Always - 1501762 225 Host_Writes_32MiB 0x0032 100 100 000 Old_age Always - 1712480 225 Host_Writes_32MiB 0x0032 100 100 000 Old_age Always - 1684811 225 Host_Writes_32MiB 0x0032 100 100 000 Old_age Always - 1781849 225 Host_Writes_32MiB 0x0032 100 100 000 Old_age Always - 2282764 225 Host_Writes_32MiB 0x0032 100 100 000 Old_age Always - 2269957 225 Host_Writes_32MiB 0x0032 100 100 000 Old_age Always - 2154155 225 Host_Writes_32MiB 0x0032 100 100 000 Old_age Always - 2163563 225 Host_Writes_32MiB 0x0032 100 100 000 Old_age Always - 3329774 It's hard to calculate whether some drives were replaced or similar due to the nonsense power on hours values.... but generally all these drives were purchased at the same time, and so should have been used mostly equally. > In the case of RAID-5 in a "traditional" Linux deployment, and > especially with DRBD protecting you on another node, you are probably > fine without having every last ACK "perfect". After all, if you power > fail the primary node, the secondary will become the primary, and any > "tail writes" that are missing will get re-sync'd by DRBD's hash > checks. And because the amount of data being re-synced is small, it > will happen very quickly and you might not even notice it. Right, and at this stage I'm not even looking at data integrity, I'm only examining "performance". In fact, it would be within the "acceptable" parameters" to lose some data under a "disaster" scenario (where disaster means losing both primary and secondary in an unclean shutdown). Of course, I wouldn't design the system to do that, but it isn't a strict requirement, as long as "normal" processes mean no data loss/corruption, and any drive should (eventually) write all the data it has told you it will. > Back to performance, you should also consider what your array is doing > to you. You are running an 8 drive raid-5 array. This will limit > performance even more because every write becomes 2 sync writes, plus > 6 reads. With q=1 latencies, if you run this test on the array with > "good" drives, you should probably get about 15K IOPS max, but it > might be a bit worse as the read and write latencies add for each OP. Right, and one thing I've considered was moving to RAID10 to avoid this, but even RAID10 means 2 writes. Assuming reads are relatively quick, than that should reduce the impact of the RAID5 as well. At this stage, converting to RAID10 is still something I'm holding up my sleeve as a last resort (due to the additional wasted capacity). Note that my tests are all on single drives, not the array. I can't afford to be doing testing on the full array due to the destructive nature, and also it is almost impossible to get a quiet moment where the tests wouldn't be affected by workload. > I tried your test on our "in house" "server-side FTL" mapping layer on > 8 drives raid-5. This is an E5-1650 v3 w/ an LSI 3008 SAS controller > and 8 Samsung 256GB 850 Pro SSDs. The array is "new" so it will slow > down somewhat as it fills. 439K IOPS is actually quite a bit under > the array's bandwidth, but at q=1, you end up benchmarking the > benchmark program. (at q=10, the array saturates the drives linear > performance at about 900K IOPS or 3518 MB/sec). > > root@ubuntu-16-24-2:/usr/local/ess# fio --filename=/dev/mapper/ess-md0 > --direct=1 --sync=1 --rw=write --bs=4k --iodepth=1 --runtime=60 > --time_based --group_reporting --name=ess-raid5 --numjobs=1 Would it be possible for you to run the test on a single drive directly instead? > > Run status group 0 (all jobs): > WRITE: io=102653MB, aggrb=1710.9MB/s, minb=1710.9MB/s, > maxb=1710.9MB/s, mint=60001msec, maxt=60001msec I might be looking at the wrong value, but you are getting 1711MB/s out of an 8 drive array, I got a max of 99MB/s on a single drive, even if I multiply that by 7 (8 drives - 1 redundancy), it's still less than half. I'd be pretty keen to see your single drive results. Also whether those results will change when using the 800GB model. Thank you for your advice, I'll see whether I can find a way to purchase one of the samsung drives for testing/evaluation, then seem to be a similar price to the Intel S3510 that I was looking at. Regards, Adam > On Wed, Dec 28, 2016 at 6:14 PM, Adam Goryachev > <mailinglists@websitemanagers.com.au> wrote: >> Apologies for my prematurely sent email (if it gets through), this one is >> complete... >> >> Hi all, >> >> I've spent a number of years trying to build up a nice RAID array for my >> SAN, but I seem to be slowly solving one bottle neck only to find another >> one. Right now, I've identified the underlying SSD's as being a major factor >> in that performance issue. >> >> I started with 5 x 480GB Intel 520s SSD's in a RAID5 array, and this >> performed really well. >> I added 1 x 480GB 530s SSD >> I added 2 x 480GB 530s SSD >> >> I now found out that performance of a 520s SSD is around 180 times faster >> than a 530s SSD. I had to run many tests, but eventually I found the right >> things to test for (which matched my real life results), and the numbers >> were nothing short of crazy. >> Running each test 5 times and average the results... >> 520s: 70MB/s >> 530s: 0.4MB/s >> >> OK, so before I could remove and test the 520s, I removed/tested one of the >> 530s and saw the horrible performance, so I bought and tested a 540s and >> found: >> 540s: 6.7MB/s >> So, around 20 times better than the 530, so I replaced all the drives with >> the 540, but I still have worse performance than the original 5 x 520s >> array. >> >> Working with Intel, they swapped a 530s drive for a DC3510, and I then found >> the DC3510 was awesome: >> DC3510: 99MB/s >> Except, a few weeks back when I placed the order, I was told there is no >> longer any stock of this drive, (I wanted 16 x 800GB model), and that the >> replacement model is the DC3520. So I figure I won't just blindly buy the >> DC3520 assuming it's performance will be similar to the previous model, so I >> buy 4 x 480GB DC3520 and start testing. >> DC3520: 37MB/s >> >> So, 1/3rd of a DC3510, but still better than the current live 540s drives, >> but also still half the original 520s drives. >> >> Summary: >> 520s: 70217kB/s >> 530s: 391kB/s >> 540s: 6712kB/s >> 330s: 24kB/s >> DC3510: 99313kB/s >> DC3520: 37051kB/s >> WD2TBCB: 475kB/s >> >> * For comparison, I had a older Western Digital Black 2TB spare, and ran the >> same test on it. Got a better result than some of the SSD's which was really >> surprising, but it's certainly not an option. >> FYI, the test I'm running is this: >> fio --filename=/dev/sdb --direct=1 --sync=1 --rw=write --bs=4k --iodepth=1 >> --runtime=60 --time_based --group_reporting --name=IntelDC3510_4kj1 >> --numjobs=1 >> All drives were tested on the same machine/SATA3 port (basic intel desktop >> motherboard), with nothing on the drive (no fs, no partition, nothing trying >> to access it, etc..). >> In reality, I tested iodepth from 1..10, but in my use case, the iodepth=1 >> matches is the relevant number. At higher iodepth, we see performance on all >> the drives improve, if interested, I can provide a full set of my >> results/analysis. >> >> So, my actual question... Can you suggest or have you tested any Intel (or >> other brand) SSD which has good performance (similar to the DC3510 or the >> 520s)? (I can't buy and test every single variant out there, my budget >> doesn't go anywhere close to that). >> It needs to be SATA, since I don't have enough PCIe slots to get the needed >> capacity (nor enough budget). I need around 8 x drives with around 6TB >> capacity in RAID5. >> >> FYI, my storage stack is like this: >> 8 x SSD's >> mdadm - RAID5 >> LVM >> DRBD >> iSCSI >> >> From my understanding, it is DRBD that makes everything a iodepth=1 issue. >> It is possible to reach iodepth=2 if I have 2 x VM's both doing a lot of IO >> at the same time, but it usually a single VM performance that is too >> limited. >> >> Regards, >> Adam >> >> >> -- >> Adam Goryachev Website Managers www.websitemanagers.com.au >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-raid" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- Adam Goryachev Website Managers www.websitemanagers.com.au ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Intel SSD or other brands 2016-12-29 22:51 ` Adam Goryachev @ 2016-12-30 1:24 ` Doug Dumitru 2016-12-30 16:32 ` Pasi Kärkkäinen 0 siblings, 1 reply; 15+ messages in thread From: Doug Dumitru @ 2016-12-30 1:24 UTC (permalink / raw) To: Adam Goryachev; +Cc: linux-raid@vger.kernel.org On Thu, Dec 29, 2016 at 2:51 PM, Adam Goryachev <mailinglists@websitemanagers.com.au> wrote: > On 30/12/16 05:50, Doug Dumitru wrote: >> >> Mr. Goryachev, >> >> I find it easier to look at these numbers in terms of IOPS. You are >> dealing with 17,500 IOPS vs 100 IOPS. This pretty much has to be the >> "commit" from the drive. The benchmark is basically waiting for the >> data to actually hit "recorded media" before the IO completes. The >> "better" drive is returning this "write ACK" when the data is in RAM, >> the the "worse" drive is returning this "write ACK" after the data is >> somewhere much slower (probably flash). I would note that 17,500 IOPS >> is a "good" but not "great" number. > > > So what would you consider a great number? I guess in practice the > environment isn't really that massive, it shouldn't really *need* great > numbers, but it seems no matter how hard I try to "over-architect", it is > still not performing to end user expectation. The Intel drives run the same IOPS regardless of pre-conditioning. They do this mostly by intentionally slowing down random writes so that the worst case does not actually look any worse. You can pretty much dial in any level of random write IOPS by manipulating over provisioning. With 8% OP, a drive might get 5K IOPS, but at 20%, this goes up to 15K. So if you want to keep an SSD fast, don't fill it up. >> >> Doing commit writes to Flash is expensive. Not only do you have to >> wait for the flash, but you have to update the mapping tables to get >> to the data. Flash also does not typically allow 4K updates (even >> given the erase rules), so your 4K sync update probably has to update >> a 16K "page" is probably causing a lot of flash wear. Maybe as much >> as 10:1 write amplification. Maybe more. > > Wear doesn't seem to have been a problem so far. > > 9 Power_On_Hours_and_Msec 0x0032 000 000 000 Old_age Always > - 914339h+46m+34.180s > This is obviously wrong, I haven't had the drive for >100 years, but it is > at least almost 4 years old (early 2013 I suspect). > 233 Media_Wearout_Indicator 0x0032 095 095 000 Old_age Always > - 0 > This is the worst drive out of the whole array, the best is 99, but either > way it suggests these drives could easily last >10 years, which would be > well and truly longer than their expected/useful life (based on capacity). > > 241 Host_Writes_32MiB 0x0032 100 100 000 Old_age Always > - 3329773 > This is the drive with the highest number of writes.... Obviously most > writes are smaller than 32MB, so I'm not entirely sure what this means, but > I suspect we are not doing a lot of writes per day compared to the total > storage capacity... > 3329773 * 32MB / 3 years / 365 days = 97308MB/day. Total capacity is 480*7 = > 3360GB or approx 0.03 per drive writes per day. > > I've actually asked this question before, but here again we find what > appears to be an anomaly... some drives have significantly more writes than > others, and I don't understand why in a RAID5 array this would be the case, > I would have expected the writes to be split approx equally across all > drives... > 225 Host_Writes_32MiB 0x0032 100 100 000 Old_age Always > - 1501762 > 225 Host_Writes_32MiB 0x0032 100 100 000 Old_age Always > - 1712480 > 225 Host_Writes_32MiB 0x0032 100 100 000 Old_age Always > - 1684811 > 225 Host_Writes_32MiB 0x0032 100 100 000 Old_age Always > - 1781849 > 225 Host_Writes_32MiB 0x0032 100 100 000 Old_age Always > - 2282764 > 225 Host_Writes_32MiB 0x0032 100 100 000 Old_age Always > - 2269957 > 225 Host_Writes_32MiB 0x0032 100 100 000 Old_age Always > - 2154155 > 225 Host_Writes_32MiB 0x0032 100 100 000 Old_age Always > - 2163563 > 225 Host_Writes_32MiB 0x0032 100 100 000 Old_age Always > - 3329774 > > It's hard to calculate whether some drives were replaced or similar due to > the nonsense power on hours values.... but generally all these drives were > purchased at the same time, and so should have been used mostly equally. > >> In the case of RAID-5 in a "traditional" Linux deployment, and >> especially with DRBD protecting you on another node, you are probably >> fine without having every last ACK "perfect". After all, if you power >> fail the primary node, the secondary will become the primary, and any >> "tail writes" that are missing will get re-sync'd by DRBD's hash >> checks. And because the amount of data being re-synced is small, it >> will happen very quickly and you might not even notice it. > > Right, and at this stage I'm not even looking at data integrity, I'm only > examining "performance". In fact, it would be within the "acceptable" > parameters" to lose some data under a "disaster" scenario (where disaster > means losing both primary and secondary in an unclean shutdown). Of course, > I wouldn't design the system to do that, but it isn't a strict requirement, > as long as "normal" processes mean no data loss/corruption, and any drive > should (eventually) write all the data it has told you it will. >> >> Back to performance, you should also consider what your array is doing >> to you. You are running an 8 drive raid-5 array. This will limit >> performance even more because every write becomes 2 sync writes, plus >> 6 reads. With q=1 latencies, if you run this test on the array with >> "good" drives, you should probably get about 15K IOPS max, but it >> might be a bit worse as the read and write latencies add for each OP. > > Right, and one thing I've considered was moving to RAID10 to avoid this, but > even RAID10 means 2 writes. Assuming reads are relatively quick, than that > should reduce the impact of the RAID5 as well. At this stage, converting to > RAID10 is still something I'm holding up my sleeve as a last resort (due to > the additional wasted capacity). > > Note that my tests are all on single drives, not the array. I can't afford > to be doing testing on the full array due to the destructive nature, and > also it is almost impossible to get a quiet moment where the tests wouldn't > be affected by workload. >> >> I tried your test on our "in house" "server-side FTL" mapping layer on >> 8 drives raid-5. This is an E5-1650 v3 w/ an LSI 3008 SAS controller >> and 8 Samsung 256GB 850 Pro SSDs. The array is "new" so it will slow >> down somewhat as it fills. 439K IOPS is actually quite a bit under >> the array's bandwidth, but at q=1, you end up benchmarking the >> benchmark program. (at q=10, the array saturates the drives linear >> performance at about 900K IOPS or 3518 MB/sec). >> >> root@ubuntu-16-24-2:/usr/local/ess# fio --filename=/dev/mapper/ess-md0 >> --direct=1 --sync=1 --rw=write --bs=4k --iodepth=1 --runtime=60 >> --time_based --group_reporting --name=ess-raid5 --numjobs=1 > > Would it be possible for you to run the test on a single drive directly > instead? >> >> >> Run status group 0 (all jobs): >> WRITE: io=102653MB, aggrb=1710.9MB/s, minb=1710.9MB/s, >> maxb=1710.9MB/s, mint=60001msec, maxt=60001msec > > I might be looking at the wrong value, but you are getting 1711MB/s out of > an 8 drive array, I got a max of 99MB/s on a single drive, even if I > multiply that by 7 (8 drives - 1 redundancy), it's still less than half. I'd > be pretty keen to see your single drive results. Also whether those results > will change when using the 800GB model. My test is of a "managed" array with a "host side Flash Translation Layer". This means that software is linearizing the writes before RAID-5 sees them. This is how the major "storage appliance" vendors get really fast performance. One vendor, running an earlier version of the software I am running here, was able to support 5000 ESXI VDI clients from a single 2U storage server (with a lot of FC cards). The boot storm took about 3 minutes to settle. Single drives are around 500 MB/sec which is 125K IOPS through our engine. Eight drives are (8-1)x500=3500 MB/sec or 900K IOPS. This is actually faster than FIO can generate a test pattern from a single job. It is also faster than stock RAID-5 can linearly write without patches. In terms of wear, lots of users are running very light write environments. This is good as many configurations are > 50:1 write amp if you measure "end to end". By end to end, I mean, how many flash writes happen when you create a small file inside of a file system. This leads to "file system write amp" x "raid write amp" x "SSD write amp". Some people don't like this approach as the file system is often "off limits" and a black box. Then again, some file systems are better than others (for 10K sync creates, EXT4 and XFS are both about 4.4:1 whereas ZFS is a lot worse). And with EXT4/XFS, you can mitigate some of this with an SSD or mapping layer that compresses blocks. Doug Dumitru > > Thank you for your advice, I'll see whether I can find a way to purchase one > of the samsung drives for testing/evaluation, then seem to be a similar > price to the Intel S3510 that I was looking at. > > Regards, > Adam > > >> On Wed, Dec 28, 2016 at 6:14 PM, Adam Goryachev >> <mailinglists@websitemanagers.com.au> wrote: >>> >>> Apologies for my prematurely sent email (if it gets through), this one is >>> complete... >>> >>> Hi all, >>> >>> I've spent a number of years trying to build up a nice RAID array for my >>> SAN, but I seem to be slowly solving one bottle neck only to find another >>> one. Right now, I've identified the underlying SSD's as being a major >>> factor >>> in that performance issue. >>> >>> I started with 5 x 480GB Intel 520s SSD's in a RAID5 array, and this >>> performed really well. >>> I added 1 x 480GB 530s SSD >>> I added 2 x 480GB 530s SSD >>> >>> I now found out that performance of a 520s SSD is around 180 times faster >>> than a 530s SSD. I had to run many tests, but eventually I found the >>> right >>> things to test for (which matched my real life results), and the numbers >>> were nothing short of crazy. >>> Running each test 5 times and average the results... >>> 520s: 70MB/s >>> 530s: 0.4MB/s >>> >>> OK, so before I could remove and test the 520s, I removed/tested one of >>> the >>> 530s and saw the horrible performance, so I bought and tested a 540s and >>> found: >>> 540s: 6.7MB/s >>> So, around 20 times better than the 530, so I replaced all the drives >>> with >>> the 540, but I still have worse performance than the original 5 x 520s >>> array. >>> >>> Working with Intel, they swapped a 530s drive for a DC3510, and I then >>> found >>> the DC3510 was awesome: >>> DC3510: 99MB/s >>> Except, a few weeks back when I placed the order, I was told there is no >>> longer any stock of this drive, (I wanted 16 x 800GB model), and that the >>> replacement model is the DC3520. So I figure I won't just blindly buy the >>> DC3520 assuming it's performance will be similar to the previous model, >>> so I >>> buy 4 x 480GB DC3520 and start testing. >>> DC3520: 37MB/s >>> >>> So, 1/3rd of a DC3510, but still better than the current live 540s >>> drives, >>> but also still half the original 520s drives. >>> >>> Summary: >>> 520s: 70217kB/s >>> 530s: 391kB/s >>> 540s: 6712kB/s >>> 330s: 24kB/s >>> DC3510: 99313kB/s >>> DC3520: 37051kB/s >>> WD2TBCB: 475kB/s >>> >>> * For comparison, I had a older Western Digital Black 2TB spare, and ran >>> the >>> same test on it. Got a better result than some of the SSD's which was >>> really >>> surprising, but it's certainly not an option. >>> FYI, the test I'm running is this: >>> fio --filename=/dev/sdb --direct=1 --sync=1 --rw=write --bs=4k >>> --iodepth=1 >>> --runtime=60 --time_based --group_reporting --name=IntelDC3510_4kj1 >>> --numjobs=1 >>> All drives were tested on the same machine/SATA3 port (basic intel >>> desktop >>> motherboard), with nothing on the drive (no fs, no partition, nothing >>> trying >>> to access it, etc..). >>> In reality, I tested iodepth from 1..10, but in my use case, the >>> iodepth=1 >>> matches is the relevant number. At higher iodepth, we see performance on >>> all >>> the drives improve, if interested, I can provide a full set of my >>> results/analysis. >>> >>> So, my actual question... Can you suggest or have you tested any Intel >>> (or >>> other brand) SSD which has good performance (similar to the DC3510 or the >>> 520s)? (I can't buy and test every single variant out there, my budget >>> doesn't go anywhere close to that). >>> It needs to be SATA, since I don't have enough PCIe slots to get the >>> needed >>> capacity (nor enough budget). I need around 8 x drives with around 6TB >>> capacity in RAID5. >>> >>> FYI, my storage stack is like this: >>> 8 x SSD's >>> mdadm - RAID5 >>> LVM >>> DRBD >>> iSCSI >>> >>> From my understanding, it is DRBD that makes everything a iodepth=1 >>> issue. >>> It is possible to reach iodepth=2 if I have 2 x VM's both doing a lot of >>> IO >>> at the same time, but it usually a single VM performance that is too >>> limited. >>> >>> Regards, >>> Adam >>> >>> >>> -- >>> Adam Goryachev Website Managers www.websitemanagers.com.au >>> -- >>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in >>> the body of a message to majordomo@vger.kernel.org >>> More majordomo info at http://vger.kernel.org/majordomo-info.html >> >> >> > > > -- > Adam Goryachev Website Managers www.websitemanagers.com.au > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Doug Dumitru EasyCo LLC ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Intel SSD or other brands 2016-12-30 1:24 ` Doug Dumitru @ 2016-12-30 16:32 ` Pasi Kärkkäinen 2016-12-30 18:23 ` Doug Dumitru 0 siblings, 1 reply; 15+ messages in thread From: Pasi Kärkkäinen @ 2016-12-30 16:32 UTC (permalink / raw) To: Doug Dumitru; +Cc: Adam Goryachev, linux-raid@vger.kernel.org Hello, On Thu, Dec 29, 2016 at 05:24:16PM -0800, Doug Dumitru wrote: > > My test is of a "managed" array with a "host side Flash Translation > Layer". This means that software is linearizing the writes before > RAID-5 sees them. This is how the major "storage appliance" vendors > get really fast performance. One vendor, running an earlier version > of the software I am running here, was able to support 5000 ESXI VDI > clients from a single 2U storage server (with a lot of FC cards). The > boot storm took about 3 minutes to settle. > Does this software happen to be opensource / publicly available ? Thanks, -- Pasi > Single drives are around 500 MB/sec which is 125K IOPS through our > engine. Eight drives are (8-1)x500=3500 MB/sec or 900K IOPS. This is > actually faster than FIO can generate a test pattern from a single > job. It is also faster than stock RAID-5 can linearly write without > patches. > > In terms of wear, lots of users are running very light write > environments. This is good as many configurations are > 50:1 write > amp if you measure "end to end". By end to end, I mean, how many > flash writes happen when you create a small file inside of a file > system. This leads to "file system write amp" x "raid write amp" x > "SSD write amp". Some people don't like this approach as the file > system is often "off limits" and a black box. Then again, some file > systems are better than others (for 10K sync creates, EXT4 and XFS are > both about 4.4:1 whereas ZFS is a lot worse). And with EXT4/XFS, you > can mitigate some of this with an SSD or mapping layer that compresses > blocks. > > Doug Dumitru > ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Intel SSD or other brands 2016-12-30 16:32 ` Pasi Kärkkäinen @ 2016-12-30 18:23 ` Doug Dumitru 0 siblings, 0 replies; 15+ messages in thread From: Doug Dumitru @ 2016-12-30 18:23 UTC (permalink / raw) To: Pasi Kärkkäinen; +Cc: Adam Goryachev, linux-raid@vger.kernel.org Pasi, The software is proprietary, but can run on most 64-bit Linux systems in a server setting. It also works wonders with "embedded Flash" (SD, eMMC, USB, etc.) for 32-bit x86 and ARM systems. Please send me a direct email, off-list, and I can give you more details. Doug Dumitru On Fri, Dec 30, 2016 at 8:32 AM, Pasi Kärkkäinen <pasik@iki.fi> wrote: > Hello, > > On Thu, Dec 29, 2016 at 05:24:16PM -0800, Doug Dumitru wrote: >> >> My test is of a "managed" array with a "host side Flash Translation >> Layer". This means that software is linearizing the writes before >> RAID-5 sees them. This is how the major "storage appliance" vendors >> get really fast performance. One vendor, running an earlier version >> of the software I am running here, was able to support 5000 ESXI VDI >> clients from a single 2U storage server (with a lot of FC cards). The >> boot storm took about 3 minutes to settle. >> > > Does this software happen to be opensource / publicly available ? > > > Thanks, > > -- Pasi > >> Single drives are around 500 MB/sec which is 125K IOPS through our >> engine. Eight drives are (8-1)x500=3500 MB/sec or 900K IOPS. This is >> actually faster than FIO can generate a test pattern from a single >> job. It is also faster than stock RAID-5 can linearly write without >> patches. >> >> In terms of wear, lots of users are running very light write >> environments. This is good as many configurations are > 50:1 write >> amp if you measure "end to end". By end to end, I mean, how many >> flash writes happen when you create a small file inside of a file >> system. This leads to "file system write amp" x "raid write amp" x >> "SSD write amp". Some people don't like this approach as the file >> system is often "off limits" and a black box. Then again, some file >> systems are better than others (for 10K sync creates, EXT4 and XFS are >> both about 4.4:1 whereas ZFS is a lot worse). And with EXT4/XFS, you >> can mitigate some of this with an SSD or mapping layer that compresses >> blocks. >> >> Doug Dumitru >> > -- Doug Dumitru EasyCo LLC ^ permalink raw reply [flat|nested] 15+ messages in thread
* Intel SSD or other brands @ 2016-12-29 1:52 Adam Goryachev 0 siblings, 0 replies; 15+ messages in thread From: Adam Goryachev @ 2016-12-29 1:52 UTC (permalink / raw) To: linux-raid Hi all, I've spent a number of years trying to build up a nice RAID array for my SAN, but I seem to be slowly solving one bottle neck only to find another one. Right now, I've identified the underlying SSD's as being a major factor in that performance issue. I started with 5 x 480GB Intel 520s SSD's in a RAID5 array, and this performed really well. I added 1 x 480GB 530s SSD I added 2 x 480GB 530s SSD I now found out that performance of a 520s SSD is around 180 times faster than a 530s SSD. -- Adam Goryachev Website Managers P: +61 2 8304 0000 adam@websitemanagers.com.au F: +61 2 8304 0001 www.websitemanagers.com.au ^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2016-12-30 18:23 UTC | newest] Thread overview: 15+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2016-12-29 2:14 Intel SSD or other brands Adam Goryachev 2016-12-29 11:39 ` Peter Grandi 2016-12-29 14:35 ` Adam Goryachev 2016-12-29 17:46 ` Peter Grandi 2016-12-29 16:56 ` Robert LeBlanc 2016-12-29 17:33 ` Peter Grandi 2016-12-29 18:37 ` Peter Grandi 2016-12-29 23:04 ` Adam Goryachev 2016-12-29 23:20 ` Robert LeBlanc 2016-12-29 18:50 ` Doug Dumitru 2016-12-29 22:51 ` Adam Goryachev 2016-12-30 1:24 ` Doug Dumitru 2016-12-30 16:32 ` Pasi Kärkkäinen 2016-12-30 18:23 ` Doug Dumitru -- strict thread matches above, loose matches on Subject: below -- 2016-12-29 1:52 Adam Goryachev
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox