Intel SSD or other brands

Linux RAID subsystem development
 help / color / mirror / Atom feed

* Intel SSD or other brands
@ 2016-12-29  1:52 Adam Goryachev
  0 siblings, 0 replies; 15+ messages in thread
From: Adam Goryachev @ 2016-12-29  1:52 UTC (permalink / raw)
  To: linux-raid

Hi all,

I've spent a number of years trying to build up a nice RAID array for my 
SAN, but I seem to be slowly solving one bottle neck only to find 
another one. Right now, I've identified the underlying SSD's as being a 
major factor in that performance issue.

I started with 5 x 480GB Intel 520s SSD's in a RAID5 array, and this 
performed really well.
I added 1 x 480GB 530s SSD
I added 2 x 480GB 530s SSD

I now found out that performance of a 520s SSD is around 180 times 
faster than a 530s SSD.

-- 
Adam Goryachev
Website Managers
P: +61 2 8304 0000                    adam@websitemanagers.com.au
F: +61 2 8304 0001                     www.websitemanagers.com.au

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Intel SSD or other brands
@ 2016-12-29  2:14 Adam Goryachev
  2016-12-29 11:39 ` Peter Grandi
                   ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Adam Goryachev @ 2016-12-29  2:14 UTC (permalink / raw)
  To: linux-raid@vger.kernel.org

Apologies for my prematurely sent email (if it gets through), this one 
is complete...

Hi all,

I've spent a number of years trying to build up a nice RAID array for my 
SAN, but I seem to be slowly solving one bottle neck only to find 
another one. Right now, I've identified the underlying SSD's as being a 
major factor in that performance issue.

I started with 5 x 480GB Intel 520s SSD's in a RAID5 array, and this 
performed really well.
I added 1 x 480GB 530s SSD
I added 2 x 480GB 530s SSD

I now found out that performance of a 520s SSD is around 180 times 
faster than a 530s SSD. I had to run many tests, but eventually I found 
the right things to test for (which matched my real life results), and 
the numbers were nothing short of crazy.
Running each test 5 times and average the results...
520s: 70MB/s
530s: 0.4MB/s

OK, so before I could remove and test the 520s, I removed/tested one of 
the 530s and saw the horrible performance, so I bought and tested a 540s 
and found:
540s: 6.7MB/s
So, around 20 times better than the 530, so I replaced all the drives 
with the 540, but I still have worse performance than the original 5 x 
520s array.

Working with Intel, they swapped a 530s drive for a DC3510, and I then 
found the DC3510 was awesome:
DC3510: 99MB/s
Except, a few weeks back when I placed the order, I was told there is no 
longer any stock of this drive, (I wanted 16 x 800GB model), and that 
the replacement model is the DC3520. So I figure I won't just blindly 
buy the DC3520 assuming it's performance will be similar to the previous 
model, so I buy 4 x 480GB DC3520 and start testing.
DC3520: 37MB/s

So, 1/3rd of a DC3510, but still better than the current live 540s 
drives, but also still half the original 520s drives.

Summary:
520s:   70217kB/s
530s:     391kB/s
540s:    6712kB/s
330s:      24kB/s
DC3510: 99313kB/s
DC3520: 37051kB/s
WD2TBCB:  475kB/s

* For comparison, I had a older Western Digital Black 2TB spare, and ran 
the same test on it. Got a better result than some of the SSD's which 
was really surprising, but it's certainly not an option.
FYI, the test I'm running is this:
fio --filename=/dev/sdb --direct=1 --sync=1 --rw=write --bs=4k 
--iodepth=1 --runtime=60 --time_based --group_reporting 
--name=IntelDC3510_4kj1 --numjobs=1
All drives were tested on the same machine/SATA3 port (basic intel 
desktop motherboard), with nothing on the drive (no fs, no partition, 
nothing trying to access it, etc..).
In reality, I tested iodepth from 1..10, but in my use case, the 
iodepth=1 matches is the relevant number. At higher iodepth, we see 
performance on all the drives improve, if interested, I can provide a 
full set of my results/analysis.

So, my actual question... Can you suggest or have you tested any Intel 
(or other brand) SSD which has good performance (similar to the DC3510 
or the 520s)? (I can't buy and test every single variant out there, my 
budget doesn't go anywhere close to that).
It needs to be SATA, since I don't have enough PCIe slots to get the 
needed capacity (nor enough budget). I need around 8 x drives with 
around 6TB capacity in RAID5.

FYI, my storage stack is like this:
8 x SSD's
mdadm - RAID5
LVM
DRBD
iSCSI

 From my understanding, it is DRBD that makes everything a iodepth=1 
issue. It is possible to reach iodepth=2 if I have 2 x VM's both doing a 
lot of IO at the same time, but it usually a single VM performance that 
is too limited.

Regards,
Adam

-- 
Adam Goryachev Website Managers www.websitemanagers.com.au

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Intel SSD or other brands
  2016-12-29  2:14 Intel SSD or other brands Adam Goryachev
@ 2016-12-29 11:39 ` Peter Grandi
  2016-12-29 14:35   ` Adam Goryachev
  2016-12-29 16:56 ` Robert LeBlanc
  2016-12-29 18:50 ` Doug Dumitru
  2 siblings, 1 reply; 15+ messages in thread
From: Peter Grandi @ 2016-12-29 11:39 UTC (permalink / raw)
  To: Linux RAID

> [ ... ] I now found out that performance of a 520s SSD is
> around 180 times faster than a 530s SSD. [ ... ]

Well "performance" can be roughly the same, even if "speed" can
be very different.

http://www.sabi.co.uk/blog/15-two.html#151023

> 520s: 70MB/s
> 530s: 0.4MB/s
....
> fio --filename=/dev/sdb --direct=1 --sync=1 --rw=write --bs=4k 
>  --iodepth=1 --runtime=60 --time_based --group_reporting 
>  --name=IntelDC3510_4kj1 --numjobs=1

Arguably the 520s actually have no performance.

https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Intel SSD or other brands
  2016-12-29 11:39 ` Peter Grandi
@ 2016-12-29 14:35   ` Adam Goryachev
  2016-12-29 17:46     ` Peter Grandi
  0 siblings, 1 reply; 15+ messages in thread
From: Adam Goryachev @ 2016-12-29 14:35 UTC (permalink / raw)
  To: Peter Grandi, Linux RAID



On 29/12/16 22:39, Peter Grandi wrote:
>> [ ... ] I now found out that performance of a 520s SSD is
>> around 180 times faster than a 530s SSD. [ ... ]
> Well "performance" can be roughly the same, even if "speed" can
> be very different.
>
> http://www.sabi.co.uk/blog/15-two.html#151023
I'm not entirely sure what you mean to say here, I have a reasonably 
well defined real life workload, (ie, single threaded, small random 
writes)... I am measuring the same statistics across multiple devices 
and comparing those numbers.
In addition, replacing the devices with others that showed an 
improvement (measured during testing) in the real life system showed an 
equivalent improvement (reduction) in end user complaints. So I feel 
reasonably sure that I am "on the right track"....

Am I overlooking something else (very possible)?
>> 520s: 70MB/s
>> 530s: 0.4MB/s
> ....
>> fio --filename=/dev/sdb --direct=1 --sync=1 --rw=write --bs=4k
>>   --iodepth=1 --runtime=60 --time_based --group_reporting
>>   --name=IntelDC3510_4kj1 --numjobs=1
> Arguably the 520s actually have no performance.
>
> https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/

I have seen this page, and all I can suggest is that the 480GB 520s 
performs very different to the 60GB model. I see 70MB/s which is 
significantly different to the listed 9MB/s on that page.
This page matches my results (comparatively) that the 520 performs much 
better than the 535, though I don't have easy access to a 535 in order 
to run my destructive tests.... but I did run similar tests on a 535, 
and I ran the more thorough tests on the 530 and 540.

Regards,
Adam


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Intel SSD or other brands
  2016-12-29  2:14 Intel SSD or other brands Adam Goryachev
  2016-12-29 11:39 ` Peter Grandi
@ 2016-12-29 16:56 ` Robert LeBlanc
  2016-12-29 17:33   ` Peter Grandi
  2016-12-29 23:04   ` Adam Goryachev
  2016-12-29 18:50 ` Doug Dumitru
  2 siblings, 2 replies; 15+ messages in thread
From: Robert LeBlanc @ 2016-12-29 16:56 UTC (permalink / raw)
  To: Adam Goryachev; +Cc: linux-raid@vger.kernel.org

This is a similar workload as Ceph and you may find more information
from their mailing lists. When I was working with Ceph about a year
ago, we tested a bunch of SSDs and found that sync=1 really
differentiates drives and you really find which drives are better. In
our testing, we found that the 35xx, 36xx, and 37xx drives handled the
workloads the best. The 3x00 drives were close to EOL, so we focused
on the 3x10 drives. I don't have the data anymore, but the 3610 had
the best performance, the 3710 had the best data integrity in the case
of power failure, and the 3510 had the best price. The 3510 had about
~.1 drive writes per day, the 3610 had ~1 DWPD and the 3710 had ~3
DWPD. Due to the fault tolerance of Ceph, we felt comfortable with the
3610s. In our testing, we exceeded the performance numbers listed for
the drives on their data sheets when running up to 8 jobs even with
sync=1 which no other manufacture did. For Ceph, we could put multiple
OSDs on a disk and take advantage of this performance gain. You may be
able to do something similar by partitioning your RAID 5 and putting
multiple DRBDs on it.
----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Wed, Dec 28, 2016 at 7:14 PM, Adam Goryachev
<mailinglists@websitemanagers.com.au> wrote:
> Apologies for my prematurely sent email (if it gets through), this one is
> complete...
>
> Hi all,
>
> I've spent a number of years trying to build up a nice RAID array for my
> SAN, but I seem to be slowly solving one bottle neck only to find another
> one. Right now, I've identified the underlying SSD's as being a major factor
> in that performance issue.
>
> I started with 5 x 480GB Intel 520s SSD's in a RAID5 array, and this
> performed really well.
> I added 1 x 480GB 530s SSD
> I added 2 x 480GB 530s SSD
>
> I now found out that performance of a 520s SSD is around 180 times faster
> than a 530s SSD. I had to run many tests, but eventually I found the right
> things to test for (which matched my real life results), and the numbers
> were nothing short of crazy.
> Running each test 5 times and average the results...
> 520s: 70MB/s
> 530s: 0.4MB/s
>
> OK, so before I could remove and test the 520s, I removed/tested one of the
> 530s and saw the horrible performance, so I bought and tested a 540s and
> found:
> 540s: 6.7MB/s
> So, around 20 times better than the 530, so I replaced all the drives with
> the 540, but I still have worse performance than the original 5 x 520s
> array.
>
> Working with Intel, they swapped a 530s drive for a DC3510, and I then found
> the DC3510 was awesome:
> DC3510: 99MB/s
> Except, a few weeks back when I placed the order, I was told there is no
> longer any stock of this drive, (I wanted 16 x 800GB model), and that the
> replacement model is the DC3520. So I figure I won't just blindly buy the
> DC3520 assuming it's performance will be similar to the previous model, so I
> buy 4 x 480GB DC3520 and start testing.
> DC3520: 37MB/s
>
> So, 1/3rd of a DC3510, but still better than the current live 540s drives,
> but also still half the original 520s drives.
>
> Summary:
> 520s:   70217kB/s
> 530s:     391kB/s
> 540s:    6712kB/s
> 330s:      24kB/s
> DC3510: 99313kB/s
> DC3520: 37051kB/s
> WD2TBCB:  475kB/s
>
> * For comparison, I had a older Western Digital Black 2TB spare, and ran the
> same test on it. Got a better result than some of the SSD's which was really
> surprising, but it's certainly not an option.
> FYI, the test I'm running is this:
> fio --filename=/dev/sdb --direct=1 --sync=1 --rw=write --bs=4k --iodepth=1
> --runtime=60 --time_based --group_reporting --name=IntelDC3510_4kj1
> --numjobs=1
> All drives were tested on the same machine/SATA3 port (basic intel desktop
> motherboard), with nothing on the drive (no fs, no partition, nothing trying
> to access it, etc..).
> In reality, I tested iodepth from 1..10, but in my use case, the iodepth=1
> matches is the relevant number. At higher iodepth, we see performance on all
> the drives improve, if interested, I can provide a full set of my
> results/analysis.
>
> So, my actual question... Can you suggest or have you tested any Intel (or
> other brand) SSD which has good performance (similar to the DC3510 or the
> 520s)? (I can't buy and test every single variant out there, my budget
> doesn't go anywhere close to that).
> It needs to be SATA, since I don't have enough PCIe slots to get the needed
> capacity (nor enough budget). I need around 8 x drives with around 6TB
> capacity in RAID5.
>
> FYI, my storage stack is like this:
> 8 x SSD's
> mdadm - RAID5
> LVM
> DRBD
> iSCSI
>
> From my understanding, it is DRBD that makes everything a iodepth=1 issue.
> It is possible to reach iodepth=2 if I have 2 x VM's both doing a lot of IO
> at the same time, but it usually a single VM performance that is too
> limited.
>
> Regards,
> Adam
>
>
> --
> Adam Goryachev Website Managers www.websitemanagers.com.au
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Intel SSD or other brands
  2016-12-29 16:56 ` Robert LeBlanc
@ 2016-12-29 17:33   ` Peter Grandi
  2016-12-29 18:37     ` Peter Grandi
  2016-12-29 23:04   ` Adam Goryachev
  1 sibling, 1 reply; 15+ messages in thread
From: Peter Grandi @ 2016-12-29 17:33 UTC (permalink / raw)
  To: Linux RAID

> [ ... ] sync=1 really differentiates drives and you really
> find which drives are better. [ ... ]

It is not necessarily "better" in a strict sense: flash SSD
devices with supercapacitor-backed persistent caches can be much
faster on 'fsync' heavy workloads, but also cost a lot more
(probably mostly because of market segmentation). Of course
especially on a RAID5 set with lots of read-modify-write.

It is a different performance envelope, not necessarily a
"better" one. If one does not need small-write speed then
cheaper drivers are more appropriate.

However devices which don't have persistent caches and still
have high 'sync=1'/'direct=1' speed because they don't implement
'fsync' synchronously are definitely worse, in the sense of
having arguably no performance at all.

Some manufacturers think that using an SLC cache helps without a
persistent-ed RAM cache, but the persistent=-ed RAM seems a lot
better to me.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Intel SSD or other brands
  2016-12-29 14:35   ` Adam Goryachev
@ 2016-12-29 17:46     ` Peter Grandi
  0 siblings, 0 replies; 15+ messages in thread
From: Peter Grandi @ 2016-12-29 17:46 UTC (permalink / raw)
  To: Linux RAID

>> Well "performance" can be roughly the same, even if "speed" can
>> be very different. http://www.sabi.co.uk/blog/15-two.html#151023

> [ ... ] what you mean to say here, [ ... ]

Some people may say that 'eatmydata $COMMAND' can improve a lot
the "performance" of running '$COMMAND'. More properly it can
improve its speed, but its performance arguably goes to zero or
more precisely becomes insignificant, in most cases. That's a
pretty huge difference.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Intel SSD or other brands
  2016-12-29 17:33   ` Peter Grandi
@ 2016-12-29 18:37     ` Peter Grandi
  0 siblings, 0 replies; 15+ messages in thread
From: Peter Grandi @ 2016-12-29 18:37 UTC (permalink / raw)
  To: Linux RAID

>> [ ... ] sync=1 really differentiates drives and you really
>> find which drives are better. [ ... ]

> It is not necessarily "better" in a strict sense: flash SSD
> devices with supercapacitor-backed persistent caches [ ... ]

http://www.storagereview.com/images/Samsung-SSD-SM825-PCB-Bottom.jpg
http://www.storagereview.com/samsung_ssd_sm825_enterprise_ssd_review

http://www.theregister.co.uk/2014/09/24/storage_supercapacitors/
http://us.apacer.com/business/technology/CorePower_Technology
http://www.badcaps.net/forum/showthread.php?t=34417

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Intel SSD or other brands
  2016-12-29  2:14 Intel SSD or other brands Adam Goryachev
  2016-12-29 11:39 ` Peter Grandi
  2016-12-29 16:56 ` Robert LeBlanc
@ 2016-12-29 18:50 ` Doug Dumitru
  2016-12-29 22:51   ` Adam Goryachev
  2 siblings, 1 reply; 15+ messages in thread
From: Doug Dumitru @ 2016-12-29 18:50 UTC (permalink / raw)
  To: Adam Goryachev; +Cc: linux-raid@vger.kernel.org

Mr. Goryachev,

I find it easier to look at these numbers in terms of IOPS.  You are
dealing with 17,500 IOPS vs 100 IOPS.  This pretty much has to be the
"commit" from the drive.  The benchmark is basically waiting for the
data to actually hit "recorded media" before the IO completes.  The
"better" drive is returning this "write ACK" when the data is in RAM,
the the "worse" drive is returning this "write ACK" after the data is
somewhere much slower (probably flash).  I would note that 17,500 IOPS
is a "good" but not "great" number.

Doing commit writes to Flash is expensive.  Not only do you have to
wait for the flash, but you have to update the mapping tables to get
to the data.  Flash also does not typically allow 4K updates (even
given the erase rules), so your 4K sync update probably has to update
a 16K "page" is probably causing a lot of flash wear.  Maybe as much
as 10:1 write amplification.  Maybe more.

There are a bunch of things to consider when looking at "sync"
performance.  The easiest way to look at this is that the drive
"absolutely" has to have the data in stable storage to be "correct".
This is not really true, and the overhead of this can be huge.  File
systems "know" this behavior and instead of looking for a hard sync,
they use "barriers".  The idea of a barrier, is that the drive is
allowed to buffer writes, just not re-order them  so that an IO
crosses a "barrier".

Testing of SSDs for this is looking for "serialization errors".  If
you pull power from an SSD and then go look at the blocks that made it
to the media after the reboot, drives can work in one of three ways.
If absolutely every ACKd block is on the drive, then sync works and
barriers are not relevant.  If the writes stop and no "newer" write
made it to the drive when an "older" one did not, then the drive is
still OK with barriers.  If "newer" writes made it to the media but
older writes did not, then this is a serialization error and you have
spaghetti.  SSDs with power fail serialization errors are "bad".  Then
again, it is important to understand the system-level implications of
how the error will impact your stack.

In the case of RAID-5 in a "traditional" Linux deployment, and
especially with DRBD protecting you on another node, you are probably
fine without having every last ACK "perfect".  After all, if you power
fail the primary node, the secondary will become the primary, and any
"tail writes" that are missing will get re-sync'd by DRBD's hash
checks.  And because the amount of data being re-synced is small, it
will happen very quickly and you might not even notice it.

Back to performance, you should also consider what your array is doing
to you.  You are running an 8 drive raid-5 array.  This will limit
performance even more because every write becomes 2 sync writes, plus
6 reads.  With q=1 latencies, if you run this test on the array with
"good" drives, you should probably get about 15K IOPS max, but it
might be a bit worse as the read and write latencies add for each OP.

I tried your test on our "in house" "server-side FTL" mapping layer on
8 drives raid-5.  This is an E5-1650 v3 w/ an LSI 3008 SAS controller
and 8 Samsung 256GB 850 Pro SSDs.  The array is "new" so it will slow
down somewhat as it fills.  439K IOPS is actually quite a bit under
the array's bandwidth, but at q=1, you end up benchmarking the
benchmark program.  (at q=10, the array saturates the drives linear
performance at about 900K IOPS or 3518 MB/sec).

root@ubuntu-16-24-2:/usr/local/ess# fio --filename=/dev/mapper/ess-md0
--direct=1 --sync=1 --rw=write --bs=4k --iodepth=1 --runtime=60
--time_based --group_reporting --name=ess-raid5 --numjobs=1
ess-raid5: (g=0): rw=write, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1
fio-2.2.10
Starting 1 process
Jobs: 1 (f=1): [W(1)] [100.0% done] [0KB/1714MB/0KB /s] [0/439K/0
iops] [eta 00m:00s]
ess-raid5: (groupid=0, jobs=1): err= 0: pid=29544: Thu Dec 29 10:36:13 2016
  write: io=102653MB, bw=1710.9MB/s, iops=437980, runt= 60001msec
    clat (usec): min=1, max=155, avg= 2.06, stdev= 0.49
     lat (usec): min=1, max=155, avg= 2.10, stdev= 0.49
    clat percentiles (usec):
     |  1.00th=[    1],  5.00th=[    1], 10.00th=[    2], 20.00th=[    2],
     | 30.00th=[    2], 40.00th=[    2], 50.00th=[    2], 60.00th=[    2],
     | 70.00th=[    2], 80.00th=[    2], 90.00th=[    3], 95.00th=[    3],
     | 99.00th=[    3], 99.50th=[    3], 99.90th=[    7], 99.95th=[    8],
     | 99.99th=[   11]
    bw (MB  /s): min= 1330, max= 1755, per=100.00%, avg=1710.84, stdev=39.12
    lat (usec) : 2=5.17%, 4=94.51%, 10=0.31%, 20=0.02%, 50=0.01%
    lat (usec) : 250=0.01%
  cpu          : usr=14.08%, sys=85.92%, ctx=47, majf=0, minf=10
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=0/w=26279247/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: io=102653MB, aggrb=1710.9MB/s, minb=1710.9MB/s,
maxb=1710.9MB/s, mint=60001msec, maxt=60001msec

On Wed, Dec 28, 2016 at 6:14 PM, Adam Goryachev
<mailinglists@websitemanagers.com.au> wrote:
> Apologies for my prematurely sent email (if it gets through), this one is
> complete...
>
> Hi all,
>
> I've spent a number of years trying to build up a nice RAID array for my
> SAN, but I seem to be slowly solving one bottle neck only to find another
> one. Right now, I've identified the underlying SSD's as being a major factor
> in that performance issue.
>
> I started with 5 x 480GB Intel 520s SSD's in a RAID5 array, and this
> performed really well.
> I added 1 x 480GB 530s SSD
> I added 2 x 480GB 530s SSD
>
> I now found out that performance of a 520s SSD is around 180 times faster
> than a 530s SSD. I had to run many tests, but eventually I found the right
> things to test for (which matched my real life results), and the numbers
> were nothing short of crazy.
> Running each test 5 times and average the results...
> 520s: 70MB/s
> 530s: 0.4MB/s
>
> OK, so before I could remove and test the 520s, I removed/tested one of the
> 530s and saw the horrible performance, so I bought and tested a 540s and
> found:
> 540s: 6.7MB/s
> So, around 20 times better than the 530, so I replaced all the drives with
> the 540, but I still have worse performance than the original 5 x 520s
> array.
>
> Working with Intel, they swapped a 530s drive for a DC3510, and I then found
> the DC3510 was awesome:
> DC3510: 99MB/s
> Except, a few weeks back when I placed the order, I was told there is no
> longer any stock of this drive, (I wanted 16 x 800GB model), and that the
> replacement model is the DC3520. So I figure I won't just blindly buy the
> DC3520 assuming it's performance will be similar to the previous model, so I
> buy 4 x 480GB DC3520 and start testing.
> DC3520: 37MB/s
>
> So, 1/3rd of a DC3510, but still better than the current live 540s drives,
> but also still half the original 520s drives.
>
> Summary:
> 520s:   70217kB/s
> 530s:     391kB/s
> 540s:    6712kB/s
> 330s:      24kB/s
> DC3510: 99313kB/s
> DC3520: 37051kB/s
> WD2TBCB:  475kB/s
>
> * For comparison, I had a older Western Digital Black 2TB spare, and ran the
> same test on it. Got a better result than some of the SSD's which was really
> surprising, but it's certainly not an option.
> FYI, the test I'm running is this:
> fio --filename=/dev/sdb --direct=1 --sync=1 --rw=write --bs=4k --iodepth=1
> --runtime=60 --time_based --group_reporting --name=IntelDC3510_4kj1
> --numjobs=1
> All drives were tested on the same machine/SATA3 port (basic intel desktop
> motherboard), with nothing on the drive (no fs, no partition, nothing trying
> to access it, etc..).
> In reality, I tested iodepth from 1..10, but in my use case, the iodepth=1
> matches is the relevant number. At higher iodepth, we see performance on all
> the drives improve, if interested, I can provide a full set of my
> results/analysis.
>
> So, my actual question... Can you suggest or have you tested any Intel (or
> other brand) SSD which has good performance (similar to the DC3510 or the
> 520s)? (I can't buy and test every single variant out there, my budget
> doesn't go anywhere close to that).
> It needs to be SATA, since I don't have enough PCIe slots to get the needed
> capacity (nor enough budget). I need around 8 x drives with around 6TB
> capacity in RAID5.
>
> FYI, my storage stack is like this:
> 8 x SSD's
> mdadm - RAID5
> LVM
> DRBD
> iSCSI
>
> From my understanding, it is DRBD that makes everything a iodepth=1 issue.
> It is possible to reach iodepth=2 if I have 2 x VM's both doing a lot of IO
> at the same time, but it usually a single VM performance that is too
> limited.
>
> Regards,
> Adam
>
>
> --
> Adam Goryachev Website Managers www.websitemanagers.com.au
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Doug Dumitru
EasyCo LLC

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Intel SSD or other brands
  2016-12-29 18:50 ` Doug Dumitru
@ 2016-12-29 22:51   ` Adam Goryachev
  2016-12-30  1:24     ` Doug Dumitru
  0 siblings, 1 reply; 15+ messages in thread
From: Adam Goryachev @ 2016-12-29 22:51 UTC (permalink / raw)
  To: doug; +Cc: linux-raid@vger.kernel.org

On 30/12/16 05:50, Doug Dumitru wrote:
> Mr. Goryachev,
>
> I find it easier to look at these numbers in terms of IOPS.  You are
> dealing with 17,500 IOPS vs 100 IOPS.  This pretty much has to be the
> "commit" from the drive.  The benchmark is basically waiting for the
> data to actually hit "recorded media" before the IO completes.  The
> "better" drive is returning this "write ACK" when the data is in RAM,
> the the "worse" drive is returning this "write ACK" after the data is
> somewhere much slower (probably flash).  I would note that 17,500 IOPS
> is a "good" but not "great" number.

So what would you consider a great number? I guess in practice the 
environment isn't really that massive, it shouldn't really *need* great 
numbers, but it seems no matter how hard I try to "over-architect", it 
is still not performing to end user expectation.
> Doing commit writes to Flash is expensive.  Not only do you have to
> wait for the flash, but you have to update the mapping tables to get
> to the data.  Flash also does not typically allow 4K updates (even
> given the erase rules), so your 4K sync update probably has to update
> a 16K "page" is probably causing a lot of flash wear.  Maybe as much
> as 10:1 write amplification.  Maybe more.
Wear doesn't seem to have been a problem so far.

   9 Power_On_Hours_and_Msec 0x0032   000   000   000    Old_age 
Always       -       914339h+46m+34.180s
This is obviously wrong, I haven't had the drive for >100 years, but it 
is at least almost 4 years old (early 2013 I suspect).
233 Media_Wearout_Indicator 0x0032   095   095   000    Old_age 
Always       -       0
This is the worst drive out of the whole array, the best is 99, but 
either way it suggests these drives could easily last >10 years, which 
would be well and truly longer than their expected/useful life (based on 
capacity).

241 Host_Writes_32MiB       0x0032   100   100   000    Old_age 
Always       -       3329773
This is the drive with the highest number of writes.... Obviously most 
writes are smaller than 32MB, so I'm not entirely sure what this means, 
but I suspect we are not doing a lot of writes per day compared to the 
total storage capacity...
3329773 * 32MB / 3 years / 365 days = 97308MB/day. Total capacity is 
480*7 = 3360GB or approx 0.03 per drive writes per day.

I've actually asked this question before, but here again we find what 
appears to be an anomaly... some drives have significantly more writes 
than others, and I don't understand why in a RAID5 array this would be 
the case, I would have expected the writes to be split approx equally 
across all drives...
225 Host_Writes_32MiB       0x0032   100   100   000    Old_age 
Always       -       1501762
225 Host_Writes_32MiB       0x0032   100   100   000    Old_age 
Always       -       1712480
225 Host_Writes_32MiB       0x0032   100   100   000    Old_age 
Always       -       1684811
225 Host_Writes_32MiB       0x0032   100   100   000    Old_age 
Always       -       1781849
225 Host_Writes_32MiB       0x0032   100   100   000    Old_age 
Always       -       2282764
225 Host_Writes_32MiB       0x0032   100   100   000    Old_age 
Always       -       2269957
225 Host_Writes_32MiB       0x0032   100   100   000    Old_age 
Always       -       2154155
225 Host_Writes_32MiB       0x0032   100   100   000    Old_age 
Always       -       2163563
225 Host_Writes_32MiB       0x0032   100   100   000    Old_age 
Always       -       3329774

It's hard to calculate whether some drives were replaced or similar due 
to the nonsense power on hours values.... but generally all these drives 
were purchased at the same time, and so should have been used mostly 
equally.

> In the case of RAID-5 in a "traditional" Linux deployment, and
> especially with DRBD protecting you on another node, you are probably
> fine without having every last ACK "perfect".  After all, if you power
> fail the primary node, the secondary will become the primary, and any
> "tail writes" that are missing will get re-sync'd by DRBD's hash
> checks.  And because the amount of data being re-synced is small, it
> will happen very quickly and you might not even notice it.
Right, and at this stage I'm not even looking at data integrity, I'm 
only examining "performance". In fact, it would be within the 
"acceptable" parameters" to lose some data under a "disaster" scenario 
(where disaster means losing both primary and secondary in an unclean 
shutdown). Of course, I wouldn't design the system to do that, but it 
isn't a strict requirement, as long as "normal" processes mean no data 
loss/corruption, and any drive should (eventually) write all the data it 
has told you it will.
> Back to performance, you should also consider what your array is doing
> to you.  You are running an 8 drive raid-5 array.  This will limit
> performance even more because every write becomes 2 sync writes, plus
> 6 reads.  With q=1 latencies, if you run this test on the array with
> "good" drives, you should probably get about 15K IOPS max, but it
> might be a bit worse as the read and write latencies add for each OP.
Right, and one thing I've considered was moving to RAID10 to avoid this, 
but even RAID10 means 2 writes. Assuming reads are relatively quick, 
than that should reduce the impact of the RAID5 as well. At this stage, 
converting to RAID10 is still something I'm holding up my sleeve as a 
last resort (due to the additional wasted capacity).

Note that my tests are all on single drives, not the array. I can't 
afford to be doing testing on the full array due to the destructive 
nature, and also it is almost impossible to get a quiet moment where the 
tests wouldn't be affected by workload.
> I tried your test on our "in house" "server-side FTL" mapping layer on
> 8 drives raid-5.  This is an E5-1650 v3 w/ an LSI 3008 SAS controller
> and 8 Samsung 256GB 850 Pro SSDs.  The array is "new" so it will slow
> down somewhat as it fills.  439K IOPS is actually quite a bit under
> the array's bandwidth, but at q=1, you end up benchmarking the
> benchmark program.  (at q=10, the array saturates the drives linear
> performance at about 900K IOPS or 3518 MB/sec).
>
> root@ubuntu-16-24-2:/usr/local/ess# fio --filename=/dev/mapper/ess-md0
> --direct=1 --sync=1 --rw=write --bs=4k --iodepth=1 --runtime=60
> --time_based --group_reporting --name=ess-raid5 --numjobs=1
Would it be possible for you to run the test on a single drive directly 
instead?
>
> Run status group 0 (all jobs):
>    WRITE: io=102653MB, aggrb=1710.9MB/s, minb=1710.9MB/s,
> maxb=1710.9MB/s, mint=60001msec, maxt=60001msec
I might be looking at the wrong value, but you are getting 1711MB/s out 
of an 8 drive array, I got a max of 99MB/s on a single drive, even if I 
multiply that by 7 (8 drives - 1 redundancy), it's still less than half. 
I'd be pretty keen to see your single drive results. Also whether those 
results will change when using the 800GB model.

Thank you for your advice, I'll see whether I can find a way to purchase 
one of the samsung drives for testing/evaluation, then seem to be a 
similar price to the Intel S3510 that I was looking at.

Regards,
Adam

> On Wed, Dec 28, 2016 at 6:14 PM, Adam Goryachev
> <mailinglists@websitemanagers.com.au> wrote:
>> Apologies for my prematurely sent email (if it gets through), this one is
>> complete...
>>
>> Hi all,
>>
>> I've spent a number of years trying to build up a nice RAID array for my
>> SAN, but I seem to be slowly solving one bottle neck only to find another
>> one. Right now, I've identified the underlying SSD's as being a major factor
>> in that performance issue.
>>
>> I started with 5 x 480GB Intel 520s SSD's in a RAID5 array, and this
>> performed really well.
>> I added 1 x 480GB 530s SSD
>> I added 2 x 480GB 530s SSD
>>
>> I now found out that performance of a 520s SSD is around 180 times faster
>> than a 530s SSD. I had to run many tests, but eventually I found the right
>> things to test for (which matched my real life results), and the numbers
>> were nothing short of crazy.
>> Running each test 5 times and average the results...
>> 520s: 70MB/s
>> 530s: 0.4MB/s
>>
>> OK, so before I could remove and test the 520s, I removed/tested one of the
>> 530s and saw the horrible performance, so I bought and tested a 540s and
>> found:
>> 540s: 6.7MB/s
>> So, around 20 times better than the 530, so I replaced all the drives with
>> the 540, but I still have worse performance than the original 5 x 520s
>> array.
>>
>> Working with Intel, they swapped a 530s drive for a DC3510, and I then found
>> the DC3510 was awesome:
>> DC3510: 99MB/s
>> Except, a few weeks back when I placed the order, I was told there is no
>> longer any stock of this drive, (I wanted 16 x 800GB model), and that the
>> replacement model is the DC3520. So I figure I won't just blindly buy the
>> DC3520 assuming it's performance will be similar to the previous model, so I
>> buy 4 x 480GB DC3520 and start testing.
>> DC3520: 37MB/s
>>
>> So, 1/3rd of a DC3510, but still better than the current live 540s drives,
>> but also still half the original 520s drives.
>>
>> Summary:
>> 520s:   70217kB/s
>> 530s:     391kB/s
>> 540s:    6712kB/s
>> 330s:      24kB/s
>> DC3510: 99313kB/s
>> DC3520: 37051kB/s
>> WD2TBCB:  475kB/s
>>
>> * For comparison, I had a older Western Digital Black 2TB spare, and ran the
>> same test on it. Got a better result than some of the SSD's which was really
>> surprising, but it's certainly not an option.
>> FYI, the test I'm running is this:
>> fio --filename=/dev/sdb --direct=1 --sync=1 --rw=write --bs=4k --iodepth=1
>> --runtime=60 --time_based --group_reporting --name=IntelDC3510_4kj1
>> --numjobs=1
>> All drives were tested on the same machine/SATA3 port (basic intel desktop
>> motherboard), with nothing on the drive (no fs, no partition, nothing trying
>> to access it, etc..).
>> In reality, I tested iodepth from 1..10, but in my use case, the iodepth=1
>> matches is the relevant number. At higher iodepth, we see performance on all
>> the drives improve, if interested, I can provide a full set of my
>> results/analysis.
>>
>> So, my actual question... Can you suggest or have you tested any Intel (or
>> other brand) SSD which has good performance (similar to the DC3510 or the
>> 520s)? (I can't buy and test every single variant out there, my budget
>> doesn't go anywhere close to that).
>> It needs to be SATA, since I don't have enough PCIe slots to get the needed
>> capacity (nor enough budget). I need around 8 x drives with around 6TB
>> capacity in RAID5.
>>
>> FYI, my storage stack is like this:
>> 8 x SSD's
>> mdadm - RAID5
>> LVM
>> DRBD
>> iSCSI
>>
>>  From my understanding, it is DRBD that makes everything a iodepth=1 issue.
>> It is possible to reach iodepth=2 if I have 2 x VM's both doing a lot of IO
>> at the same time, but it usually a single VM performance that is too
>> limited.
>>
>> Regards,
>> Adam
>>
>>
>> --
>> Adam Goryachev Website Managers www.websitemanagers.com.au
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>


-- 
Adam Goryachev Website Managers www.websitemanagers.com.au

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Intel SSD or other brands
  2016-12-29 16:56 ` Robert LeBlanc
  2016-12-29 17:33   ` Peter Grandi
@ 2016-12-29 23:04   ` Adam Goryachev
  2016-12-29 23:20     ` Robert LeBlanc
  1 sibling, 1 reply; 15+ messages in thread
From: Adam Goryachev @ 2016-12-29 23:04 UTC (permalink / raw)
  To: Robert LeBlanc; +Cc: linux-raid@vger.kernel.org

On 30/12/16 03:56, Robert LeBlanc wrote:
> This is a similar workload as Ceph and you may find more information
> from their mailing lists. When I was working with Ceph about a year
> ago, we tested a bunch of SSDs and found that sync=1 really
> differentiates drives and you really find which drives are better. In
> our testing, we found that the 35xx, 36xx, and 37xx drives handled the
> workloads the best. The 3x00 drives were close to EOL, so we focused
> on the 3x10 drives. I don't have the data anymore, but the 3610 had
> the best performance, the 3710 had the best data integrity in the case
> of power failure, and the 3510 had the best price.
So it seems that my "good/best" results were based on the 3510, which 
was the cheapest out of the options you tested. Any chance you could 
find the raw data again? Or do you recall the relative performance 
difference between these three drives?

> The 3510 had about
> ~.1 drive writes per day, the 3610 had ~1 DWPD and the 3710 had ~3
> DWPD.
We seem to be around 0.03 DWPD, so I don't think any of these drives 
would be a problem for us. Lifetime seems much longer than the useful 
life, given capacity/etc.
> Due to the fault tolerance of Ceph, we felt comfortable with the
> 3610s.
Equally, we have fault tolerance (RAID5) as well as DRBD onto the other 
node which also has RAID5. I also monitor the drive lifetime, I'm not 
sure what value I would consider urgent replacement, but probably around 
20% remaining life....
> In our testing, we exceeded the performance numbers listed for
> the drives on their data sheets when running up to 8 jobs even with
> sync=1 which no other manufacture did. For Ceph, we could put multiple
> OSDs on a disk and take advantage of this performance gain. You may be
> able to do something similar by partitioning your RAID 5 and putting
> multiple DRBDs on it.

We do this already... we use a single RAID5 which is split with LVM2 (20 
LV's), and each LV is then a DRBD device (so 20 DRBD's). This was one of 
the optimisations linbit advised us to do way back at the beginning.

The problem I'm having is that a single DRBD will reach saturation 
because the underlying devices are saturated. So I'm trying to improve 
the underlying device performance, and expect to be able to "move" the 
bottleneck to DRBD or hopefully, the ethernet of the iSCSI interface.

Regards,
Adam

> ----------------
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>
>
> On Wed, Dec 28, 2016 at 7:14 PM, Adam Goryachev
> <mailinglists@websitemanagers.com.au> wrote:
>> Apologies for my prematurely sent email (if it gets through), this one is
>> complete...
>>
>> Hi all,
>>
>> I've spent a number of years trying to build up a nice RAID array for my
>> SAN, but I seem to be slowly solving one bottle neck only to find another
>> one. Right now, I've identified the underlying SSD's as being a major factor
>> in that performance issue.
>>
>> I started with 5 x 480GB Intel 520s SSD's in a RAID5 array, and this
>> performed really well.
>> I added 1 x 480GB 530s SSD
>> I added 2 x 480GB 530s SSD
>>
>> I now found out that performance of a 520s SSD is around 180 times faster
>> than a 530s SSD. I had to run many tests, but eventually I found the right
>> things to test for (which matched my real life results), and the numbers
>> were nothing short of crazy.
>> Running each test 5 times and average the results...
>> 520s: 70MB/s
>> 530s: 0.4MB/s
>>
>> OK, so before I could remove and test the 520s, I removed/tested one of the
>> 530s and saw the horrible performance, so I bought and tested a 540s and
>> found:
>> 540s: 6.7MB/s
>> So, around 20 times better than the 530, so I replaced all the drives with
>> the 540, but I still have worse performance than the original 5 x 520s
>> array.
>>
>> Working with Intel, they swapped a 530s drive for a DC3510, and I then found
>> the DC3510 was awesome:
>> DC3510: 99MB/s
>> Except, a few weeks back when I placed the order, I was told there is no
>> longer any stock of this drive, (I wanted 16 x 800GB model), and that the
>> replacement model is the DC3520. So I figure I won't just blindly buy the
>> DC3520 assuming it's performance will be similar to the previous model, so I
>> buy 4 x 480GB DC3520 and start testing.
>> DC3520: 37MB/s
>>
>> So, 1/3rd of a DC3510, but still better than the current live 540s drives,
>> but also still half the original 520s drives.
>>
>> Summary:
>> 520s:   70217kB/s
>> 530s:     391kB/s
>> 540s:    6712kB/s
>> 330s:      24kB/s
>> DC3510: 99313kB/s
>> DC3520: 37051kB/s
>> WD2TBCB:  475kB/s
>>
>> * For comparison, I had a older Western Digital Black 2TB spare, and ran the
>> same test on it. Got a better result than some of the SSD's which was really
>> surprising, but it's certainly not an option.
>> FYI, the test I'm running is this:
>> fio --filename=/dev/sdb --direct=1 --sync=1 --rw=write --bs=4k --iodepth=1
>> --runtime=60 --time_based --group_reporting --name=IntelDC3510_4kj1
>> --numjobs=1
>> All drives were tested on the same machine/SATA3 port (basic intel desktop
>> motherboard), with nothing on the drive (no fs, no partition, nothing trying
>> to access it, etc..).
>> In reality, I tested iodepth from 1..10, but in my use case, the iodepth=1
>> matches is the relevant number. At higher iodepth, we see performance on all
>> the drives improve, if interested, I can provide a full set of my
>> results/analysis.
>>
>> So, my actual question... Can you suggest or have you tested any Intel (or
>> other brand) SSD which has good performance (similar to the DC3510 or the
>> 520s)? (I can't buy and test every single variant out there, my budget
>> doesn't go anywhere close to that).
>> It needs to be SATA, since I don't have enough PCIe slots to get the needed
>> capacity (nor enough budget). I need around 8 x drives with around 6TB
>> capacity in RAID5.
>>
>> FYI, my storage stack is like this:
>> 8 x SSD's
>> mdadm - RAID5
>> LVM
>> DRBD
>> iSCSI
>>
>>  From my understanding, it is DRBD that makes everything a iodepth=1 issue.
>> It is possible to reach iodepth=2 if I have 2 x VM's both doing a lot of IO
>> at the same time, but it usually a single VM performance that is too
>> limited.
>>
>> Regards,
>> Adam
>>
>>
>> --
>> Adam Goryachev Website Managers www.websitemanagers.com.au
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


-- 
Adam Goryachev Website Managers www.websitemanagers.com.au

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Intel SSD or other brands
  2016-12-29 23:04   ` Adam Goryachev
@ 2016-12-29 23:20     ` Robert LeBlanc
  0 siblings, 0 replies; 15+ messages in thread
From: Robert LeBlanc @ 2016-12-29 23:20 UTC (permalink / raw)
  To: Adam Goryachev; +Cc: linux-raid@vger.kernel.org

On Thu, Dec 29, 2016 at 4:04 PM, Adam Goryachev
<mailinglists@websitemanagers.com.au> wrote:
> On 30/12/16 03:56, Robert LeBlanc wrote:
>>
>> This is a similar workload as Ceph and you may find more information
>> from their mailing lists. When I was working with Ceph about a year
>> ago, we tested a bunch of SSDs and found that sync=1 really
>> differentiates drives and you really find which drives are better. In
>> our testing, we found that the 35xx, 36xx, and 37xx drives handled the
>> workloads the best. The 3x00 drives were close to EOL, so we focused
>> on the 3x10 drives. I don't have the data anymore, but the 3610 had
>> the best performance, the 3710 had the best data integrity in the case
>> of power failure, and the 3510 had the best price.
>
> So it seems that my "good/best" results were based on the 3510, which was
> the cheapest out of the options you tested. Any chance you could find the
> raw data again? Or do you recall the relative performance difference between
> these three drives?

This was done at another job and the data stayed when I left. The
performance between the three drives were pretty close, I think less
than 10%, but I can't remember exactly.

>> The 3510 had about
>> ~.1 drive writes per day, the 3610 had ~1 DWPD and the 3710 had ~3
>> DWPD.
>
> We seem to be around 0.03 DWPD, so I don't think any of these drives would
> be a problem for us. Lifetime seems much longer than the useful life, given
> capacity/etc.

We had really good wear on the 35xx drives, I think they are
understated, but I don't have the data to back that up.

>> Due to the fault tolerance of Ceph, we felt comfortable with the
>> 3610s.
>
> Equally, we have fault tolerance (RAID5) as well as DRBD onto the other node
> which also has RAID5. I also monitor the drive lifetime, I'm not sure what
> value I would consider urgent replacement, but probably around 20% remaining
> life....

You may never even get there at 0.03 DWPD.

>> In our testing, we exceeded the performance numbers listed for
>> the drives on their data sheets when running up to 8 jobs even with
>> sync=1 which no other manufacture did. For Ceph, we could put multiple
>> OSDs on a disk and take advantage of this performance gain. You may be
>> able to do something similar by partitioning your RAID 5 and putting
>> multiple DRBDs on it.
>
>
> We do this already... we use a single RAID5 which is split with LVM2 (20
> LV's), and each LV is then a DRBD device (so 20 DRBD's). This was one of the
> optimisations linbit advised us to do way back at the beginning.
>
> The problem I'm having is that a single DRBD will reach saturation because
> the underlying devices are saturated. So I'm trying to improve the
> underlying device performance, and expect to be able to "move" the
> bottleneck to DRBD or hopefully, the ethernet of the iSCSI interface.
>
> Regards,
> Adam

----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Intel SSD or other brands
  2016-12-29 22:51   ` Adam Goryachev
@ 2016-12-30  1:24     ` Doug Dumitru
  2016-12-30 16:32       ` Pasi Kärkkäinen
  0 siblings, 1 reply; 15+ messages in thread
From: Doug Dumitru @ 2016-12-30  1:24 UTC (permalink / raw)
  To: Adam Goryachev; +Cc: linux-raid@vger.kernel.org

On Thu, Dec 29, 2016 at 2:51 PM, Adam Goryachev
<mailinglists@websitemanagers.com.au> wrote:
> On 30/12/16 05:50, Doug Dumitru wrote:
>>
>> Mr. Goryachev,
>>
>> I find it easier to look at these numbers in terms of IOPS.  You are
>> dealing with 17,500 IOPS vs 100 IOPS.  This pretty much has to be the
>> "commit" from the drive.  The benchmark is basically waiting for the
>> data to actually hit "recorded media" before the IO completes.  The
>> "better" drive is returning this "write ACK" when the data is in RAM,
>> the the "worse" drive is returning this "write ACK" after the data is
>> somewhere much slower (probably flash).  I would note that 17,500 IOPS
>> is a "good" but not "great" number.
>
>
> So what would you consider a great number? I guess in practice the
> environment isn't really that massive, it shouldn't really *need* great
> numbers, but it seems no matter how hard I try to "over-architect", it is
> still not performing to end user expectation.

The Intel drives run the same IOPS regardless of pre-conditioning.
They do this mostly by intentionally slowing down random writes so
that the worst case does not actually look any worse.  You can pretty
much dial in any level of random write IOPS by manipulating over
provisioning.  With 8% OP, a drive might get 5K IOPS, but at 20%, this
goes up to 15K.  So if you want to keep an SSD fast, don't fill it up.

>>
>> Doing commit writes to Flash is expensive.  Not only do you have to
>> wait for the flash, but you have to update the mapping tables to get
>> to the data.  Flash also does not typically allow 4K updates (even
>> given the erase rules), so your 4K sync update probably has to update
>> a 16K "page" is probably causing a lot of flash wear.  Maybe as much
>> as 10:1 write amplification.  Maybe more.
>
> Wear doesn't seem to have been a problem so far.
>
>   9 Power_On_Hours_and_Msec 0x0032   000   000   000    Old_age Always
> -       914339h+46m+34.180s
> This is obviously wrong, I haven't had the drive for >100 years, but it is
> at least almost 4 years old (early 2013 I suspect).
> 233 Media_Wearout_Indicator 0x0032   095   095   000    Old_age Always
> -       0
> This is the worst drive out of the whole array, the best is 99, but either
> way it suggests these drives could easily last >10 years, which would be
> well and truly longer than their expected/useful life (based on capacity).
>
> 241 Host_Writes_32MiB       0x0032   100   100   000    Old_age Always
> -       3329773
> This is the drive with the highest number of writes.... Obviously most
> writes are smaller than 32MB, so I'm not entirely sure what this means, but
> I suspect we are not doing a lot of writes per day compared to the total
> storage capacity...
> 3329773 * 32MB / 3 years / 365 days = 97308MB/day. Total capacity is 480*7 =
> 3360GB or approx 0.03 per drive writes per day.
>
> I've actually asked this question before, but here again we find what
> appears to be an anomaly... some drives have significantly more writes than
> others, and I don't understand why in a RAID5 array this would be the case,
> I would have expected the writes to be split approx equally across all
> drives...
> 225 Host_Writes_32MiB       0x0032   100   100   000    Old_age Always
> -       1501762
> 225 Host_Writes_32MiB       0x0032   100   100   000    Old_age Always
> -       1712480
> 225 Host_Writes_32MiB       0x0032   100   100   000    Old_age Always
> -       1684811
> 225 Host_Writes_32MiB       0x0032   100   100   000    Old_age Always
> -       1781849
> 225 Host_Writes_32MiB       0x0032   100   100   000    Old_age Always
> -       2282764
> 225 Host_Writes_32MiB       0x0032   100   100   000    Old_age Always
> -       2269957
> 225 Host_Writes_32MiB       0x0032   100   100   000    Old_age Always
> -       2154155
> 225 Host_Writes_32MiB       0x0032   100   100   000    Old_age Always
> -       2163563
> 225 Host_Writes_32MiB       0x0032   100   100   000    Old_age Always
> -       3329774
>
> It's hard to calculate whether some drives were replaced or similar due to
> the nonsense power on hours values.... but generally all these drives were
> purchased at the same time, and so should have been used mostly equally.
>
>> In the case of RAID-5 in a "traditional" Linux deployment, and
>> especially with DRBD protecting you on another node, you are probably
>> fine without having every last ACK "perfect".  After all, if you power
>> fail the primary node, the secondary will become the primary, and any
>> "tail writes" that are missing will get re-sync'd by DRBD's hash
>> checks.  And because the amount of data being re-synced is small, it
>> will happen very quickly and you might not even notice it.
>
> Right, and at this stage I'm not even looking at data integrity, I'm only
> examining "performance". In fact, it would be within the "acceptable"
> parameters" to lose some data under a "disaster" scenario (where disaster
> means losing both primary and secondary in an unclean shutdown). Of course,
> I wouldn't design the system to do that, but it isn't a strict requirement,
> as long as "normal" processes mean no data loss/corruption, and any drive
> should (eventually) write all the data it has told you it will.
>>
>> Back to performance, you should also consider what your array is doing
>> to you.  You are running an 8 drive raid-5 array.  This will limit
>> performance even more because every write becomes 2 sync writes, plus
>> 6 reads.  With q=1 latencies, if you run this test on the array with
>> "good" drives, you should probably get about 15K IOPS max, but it
>> might be a bit worse as the read and write latencies add for each OP.
>
> Right, and one thing I've considered was moving to RAID10 to avoid this, but
> even RAID10 means 2 writes. Assuming reads are relatively quick, than that
> should reduce the impact of the RAID5 as well. At this stage, converting to
> RAID10 is still something I'm holding up my sleeve as a last resort (due to
> the additional wasted capacity).
>
> Note that my tests are all on single drives, not the array. I can't afford
> to be doing testing on the full array due to the destructive nature, and
> also it is almost impossible to get a quiet moment where the tests wouldn't
> be affected by workload.
>>
>> I tried your test on our "in house" "server-side FTL" mapping layer on
>> 8 drives raid-5.  This is an E5-1650 v3 w/ an LSI 3008 SAS controller
>> and 8 Samsung 256GB 850 Pro SSDs.  The array is "new" so it will slow
>> down somewhat as it fills.  439K IOPS is actually quite a bit under
>> the array's bandwidth, but at q=1, you end up benchmarking the
>> benchmark program.  (at q=10, the array saturates the drives linear
>> performance at about 900K IOPS or 3518 MB/sec).
>>
>> root@ubuntu-16-24-2:/usr/local/ess# fio --filename=/dev/mapper/ess-md0
>> --direct=1 --sync=1 --rw=write --bs=4k --iodepth=1 --runtime=60
>> --time_based --group_reporting --name=ess-raid5 --numjobs=1
>
> Would it be possible for you to run the test on a single drive directly
> instead?
>>
>>
>> Run status group 0 (all jobs):
>>    WRITE: io=102653MB, aggrb=1710.9MB/s, minb=1710.9MB/s,
>> maxb=1710.9MB/s, mint=60001msec, maxt=60001msec
>
> I might be looking at the wrong value, but you are getting 1711MB/s out of
> an 8 drive array, I got a max of 99MB/s on a single drive, even if I
> multiply that by 7 (8 drives - 1 redundancy), it's still less than half. I'd
> be pretty keen to see your single drive results. Also whether those results
> will change when using the 800GB model.

My test is of a "managed" array with a "host side Flash Translation
Layer".  This means that software is linearizing the writes before
RAID-5 sees them.  This is how the major "storage appliance" vendors
get really fast performance.  One vendor, running an earlier version
of the software I am running here, was able to support 5000 ESXI VDI
clients from a single 2U storage server (with a lot of FC cards).  The
boot storm took about 3 minutes to settle.

Single drives are around 500 MB/sec which is 125K IOPS through our
engine.  Eight drives are (8-1)x500=3500 MB/sec or 900K IOPS.  This is
actually faster than FIO can generate a test pattern from a single
job.  It is also faster than stock RAID-5 can linearly write without
patches.

In terms of wear, lots of users are running very light write
environments.  This is good as many configurations are > 50:1 write
amp if you measure "end to end".  By end to end, I mean, how many
flash writes happen when you create a small file inside of a file
system.  This leads to "file system write amp" x "raid write amp" x
"SSD write amp".  Some people don't like this approach as the file
system is often "off limits" and a black box.  Then again, some file
systems are better than others (for 10K sync creates, EXT4 and XFS are
both about 4.4:1 whereas ZFS is a lot worse).  And with EXT4/XFS, you
can mitigate some of this with an SSD or mapping layer that compresses
blocks.

Doug Dumitru



>
> Thank you for your advice, I'll see whether I can find a way to purchase one
> of the samsung drives for testing/evaluation, then seem to be a similar
> price to the Intel S3510 that I was looking at.
>
> Regards,
> Adam
>
>
>> On Wed, Dec 28, 2016 at 6:14 PM, Adam Goryachev
>> <mailinglists@websitemanagers.com.au> wrote:
>>>
>>> Apologies for my prematurely sent email (if it gets through), this one is
>>> complete...
>>>
>>> Hi all,
>>>
>>> I've spent a number of years trying to build up a nice RAID array for my
>>> SAN, but I seem to be slowly solving one bottle neck only to find another
>>> one. Right now, I've identified the underlying SSD's as being a major
>>> factor
>>> in that performance issue.
>>>
>>> I started with 5 x 480GB Intel 520s SSD's in a RAID5 array, and this
>>> performed really well.
>>> I added 1 x 480GB 530s SSD
>>> I added 2 x 480GB 530s SSD
>>>
>>> I now found out that performance of a 520s SSD is around 180 times faster
>>> than a 530s SSD. I had to run many tests, but eventually I found the
>>> right
>>> things to test for (which matched my real life results), and the numbers
>>> were nothing short of crazy.
>>> Running each test 5 times and average the results...
>>> 520s: 70MB/s
>>> 530s: 0.4MB/s
>>>
>>> OK, so before I could remove and test the 520s, I removed/tested one of
>>> the
>>> 530s and saw the horrible performance, so I bought and tested a 540s and
>>> found:
>>> 540s: 6.7MB/s
>>> So, around 20 times better than the 530, so I replaced all the drives
>>> with
>>> the 540, but I still have worse performance than the original 5 x 520s
>>> array.
>>>
>>> Working with Intel, they swapped a 530s drive for a DC3510, and I then
>>> found
>>> the DC3510 was awesome:
>>> DC3510: 99MB/s
>>> Except, a few weeks back when I placed the order, I was told there is no
>>> longer any stock of this drive, (I wanted 16 x 800GB model), and that the
>>> replacement model is the DC3520. So I figure I won't just blindly buy the
>>> DC3520 assuming it's performance will be similar to the previous model,
>>> so I
>>> buy 4 x 480GB DC3520 and start testing.
>>> DC3520: 37MB/s
>>>
>>> So, 1/3rd of a DC3510, but still better than the current live 540s
>>> drives,
>>> but also still half the original 520s drives.
>>>
>>> Summary:
>>> 520s:   70217kB/s
>>> 530s:     391kB/s
>>> 540s:    6712kB/s
>>> 330s:      24kB/s
>>> DC3510: 99313kB/s
>>> DC3520: 37051kB/s
>>> WD2TBCB:  475kB/s
>>>
>>> * For comparison, I had a older Western Digital Black 2TB spare, and ran
>>> the
>>> same test on it. Got a better result than some of the SSD's which was
>>> really
>>> surprising, but it's certainly not an option.
>>> FYI, the test I'm running is this:
>>> fio --filename=/dev/sdb --direct=1 --sync=1 --rw=write --bs=4k
>>> --iodepth=1
>>> --runtime=60 --time_based --group_reporting --name=IntelDC3510_4kj1
>>> --numjobs=1
>>> All drives were tested on the same machine/SATA3 port (basic intel
>>> desktop
>>> motherboard), with nothing on the drive (no fs, no partition, nothing
>>> trying
>>> to access it, etc..).
>>> In reality, I tested iodepth from 1..10, but in my use case, the
>>> iodepth=1
>>> matches is the relevant number. At higher iodepth, we see performance on
>>> all
>>> the drives improve, if interested, I can provide a full set of my
>>> results/analysis.
>>>
>>> So, my actual question... Can you suggest or have you tested any Intel
>>> (or
>>> other brand) SSD which has good performance (similar to the DC3510 or the
>>> 520s)? (I can't buy and test every single variant out there, my budget
>>> doesn't go anywhere close to that).
>>> It needs to be SATA, since I don't have enough PCIe slots to get the
>>> needed
>>> capacity (nor enough budget). I need around 8 x drives with around 6TB
>>> capacity in RAID5.
>>>
>>> FYI, my storage stack is like this:
>>> 8 x SSD's
>>> mdadm - RAID5
>>> LVM
>>> DRBD
>>> iSCSI
>>>
>>>  From my understanding, it is DRBD that makes everything a iodepth=1
>>> issue.
>>> It is possible to reach iodepth=2 if I have 2 x VM's both doing a lot of
>>> IO
>>> at the same time, but it usually a single VM performance that is too
>>> limited.
>>>
>>> Regards,
>>> Adam
>>>
>>>
>>> --
>>> Adam Goryachev Website Managers www.websitemanagers.com.au
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
>>
>
>
> --
> Adam Goryachev Website Managers www.websitemanagers.com.au
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Doug Dumitru
EasyCo LLC

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Intel SSD or other brands
  2016-12-30  1:24     ` Doug Dumitru
@ 2016-12-30 16:32       ` Pasi Kärkkäinen
  2016-12-30 18:23         ` Doug Dumitru
  0 siblings, 1 reply; 15+ messages in thread
From: Pasi Kärkkäinen @ 2016-12-30 16:32 UTC (permalink / raw)
  To: Doug Dumitru; +Cc: Adam Goryachev, linux-raid@vger.kernel.org

Hello,

On Thu, Dec 29, 2016 at 05:24:16PM -0800, Doug Dumitru wrote:
> 
> My test is of a "managed" array with a "host side Flash Translation
> Layer".  This means that software is linearizing the writes before
> RAID-5 sees them.  This is how the major "storage appliance" vendors
> get really fast performance.  One vendor, running an earlier version
> of the software I am running here, was able to support 5000 ESXI VDI
> clients from a single 2U storage server (with a lot of FC cards).  The
> boot storm took about 3 minutes to settle.
>

Does this software happen to be opensource / publicly available ? 


Thanks,

-- Pasi
 
> Single drives are around 500 MB/sec which is 125K IOPS through our
> engine.  Eight drives are (8-1)x500=3500 MB/sec or 900K IOPS.  This is
> actually faster than FIO can generate a test pattern from a single
> job.  It is also faster than stock RAID-5 can linearly write without
> patches.
> 
> In terms of wear, lots of users are running very light write
> environments.  This is good as many configurations are > 50:1 write
> amp if you measure "end to end".  By end to end, I mean, how many
> flash writes happen when you create a small file inside of a file
> system.  This leads to "file system write amp" x "raid write amp" x
> "SSD write amp".  Some people don't like this approach as the file
> system is often "off limits" and a black box.  Then again, some file
> systems are better than others (for 10K sync creates, EXT4 and XFS are
> both about 4.4:1 whereas ZFS is a lot worse).  And with EXT4/XFS, you
> can mitigate some of this with an SSD or mapping layer that compresses
> blocks.
> 
> Doug Dumitru
> 


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Intel SSD or other brands
  2016-12-30 16:32       ` Pasi Kärkkäinen
@ 2016-12-30 18:23         ` Doug Dumitru
  0 siblings, 0 replies; 15+ messages in thread
From: Doug Dumitru @ 2016-12-30 18:23 UTC (permalink / raw)
  To: Pasi Kärkkäinen; +Cc: Adam Goryachev, linux-raid@vger.kernel.org

Pasi,

The software is proprietary, but can run on most 64-bit Linux systems
in a server setting.  It also works wonders with "embedded Flash" (SD,
eMMC, USB, etc.) for 32-bit x86 and ARM systems.

Please send me a direct email, off-list, and I can give you more details.

Doug Dumitru

On Fri, Dec 30, 2016 at 8:32 AM, Pasi Kärkkäinen <pasik@iki.fi> wrote:
> Hello,
>
> On Thu, Dec 29, 2016 at 05:24:16PM -0800, Doug Dumitru wrote:
>>
>> My test is of a "managed" array with a "host side Flash Translation
>> Layer".  This means that software is linearizing the writes before
>> RAID-5 sees them.  This is how the major "storage appliance" vendors
>> get really fast performance.  One vendor, running an earlier version
>> of the software I am running here, was able to support 5000 ESXI VDI
>> clients from a single 2U storage server (with a lot of FC cards).  The
>> boot storm took about 3 minutes to settle.
>>
>
> Does this software happen to be opensource / publicly available ?
>
>
> Thanks,
>
> -- Pasi
>
>> Single drives are around 500 MB/sec which is 125K IOPS through our
>> engine.  Eight drives are (8-1)x500=3500 MB/sec or 900K IOPS.  This is
>> actually faster than FIO can generate a test pattern from a single
>> job.  It is also faster than stock RAID-5 can linearly write without
>> patches.
>>
>> In terms of wear, lots of users are running very light write
>> environments.  This is good as many configurations are > 50:1 write
>> amp if you measure "end to end".  By end to end, I mean, how many
>> flash writes happen when you create a small file inside of a file
>> system.  This leads to "file system write amp" x "raid write amp" x
>> "SSD write amp".  Some people don't like this approach as the file
>> system is often "off limits" and a black box.  Then again, some file
>> systems are better than others (for 10K sync creates, EXT4 and XFS are
>> both about 4.4:1 whereas ZFS is a lot worse).  And with EXT4/XFS, you
>> can mitigate some of this with an SSD or mapping layer that compresses
>> blocks.
>>
>> Doug Dumitru
>>
>



-- 
Doug Dumitru
EasyCo LLC

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2016-12-30 18:23 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-12-29  2:14 Intel SSD or other brands Adam Goryachev
2016-12-29 11:39 ` Peter Grandi
2016-12-29 14:35   ` Adam Goryachev
2016-12-29 17:46     ` Peter Grandi
2016-12-29 16:56 ` Robert LeBlanc
2016-12-29 17:33   ` Peter Grandi
2016-12-29 18:37     ` Peter Grandi
2016-12-29 23:04   ` Adam Goryachev
2016-12-29 23:20     ` Robert LeBlanc
2016-12-29 18:50 ` Doug Dumitru
2016-12-29 22:51   ` Adam Goryachev
2016-12-30  1:24     ` Doug Dumitru
2016-12-30 16:32       ` Pasi Kärkkäinen
2016-12-30 18:23         ` Doug Dumitru
  -- strict thread matches above, loose matches on Subject: below --
2016-12-29  1:52 Adam Goryachev

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox