Incredibly poor performance of mdraid-1 with 2 SSD Samsung 840 PRO

Linux RAID subsystem development
 help / color / mirror / Atom feed

* Incredibly poor performance of mdraid-1 with 2 SSD Samsung 840 PRO
@ 2013-04-19 22:58 Andrei Banu
       [not found] ` <CAH3kUhEaZGON=fAyVMZOz5fH_DcfKv=hCa96UCeK4pN7k81c_Q@mail.gmail.com>
                   ` (3 more replies)
  0 siblings, 4 replies; 38+ messages in thread
From: Andrei Banu @ 2013-04-19 22:58 UTC (permalink / raw)
  To: linux-raid

Hello!

I come to you with a difficult problem. We have a server otherwise 
snappy fitted with mdraid-1 made of Samsung 840 PRO SSDs. If we copy a 
larger file to the server (from the same server, from net doesn't 
matter) the server load will increase from roughly 0.7 to over 100 (for 
several GB files). Apparently the reason is that the raid can't write well.

Few examples:

root [~]# dd if=testfile.tar.gz of=test20 oflag=sync bs=4M
130+1 records in
130+1 records out
547682517 bytes (548 MB) copied, 7.99664 s, 68.5 MB/s

And 10-20 seconds later I try the very same test:

root [~]# dd if=testfile.tar.gz of=test21 oflag=sync bs=4M
130+1 records in / 130+1 records out
547682517 bytes (548 MB) copied, 52.1958 s, 10.5 MB/s

A different test with 'bs=1G'
root [~]# w
  12:08:34 up 1 day, 13:09,  1 user,  load average: 0.37, 0.60, 0.72

root [~]# dd if=testfile.tar.gz of=test oflag=sync bs=1G
0+1 records in / 0+1 records out
547682517 bytes (548 MB) copied, 75.3476 s, 7.3 MB/s

root [~]# w
  12:09:56 up 1 day, 13:11,  1 user,  load average: 39.29, 12.67, 4.93

It needed 75 seconds to copy a half GB file and the server load 
increased 100 times.

And a final test:

root@ [~]# dd if=/dev/zero of=test24 bs=64k count=16k conv=fdatasync
16384+0 records in / 16384+0 records out
1073741824 bytes (1.1 GB) copied, 61.8796 s, 17.4 MB/s

This time the load spiked to only ~ 20.

A few other peculiarities:

root@ [~]# hdparm -t /dev/sda
Timing buffered disk reads:  654 MB in  3.01 seconds = 217.55 MB/sec
root@ [~]# hdparm -t /dev/sdb
Timing buffered disk reads:  272 MB in  3.01 seconds =  90.44 MB/sec

The read speed is very different between the 2 devices (the margin is 
140%) but look what happens when I run it with --direct:

root@ [~]# hdparm --direct -t /dev/sda
Timing O_DIRECT disk reads:  788 MB in  3.00 seconds = 262.23 MB/sec
root@ [~]# hdparm --direct -t /dev/sdb
Timing O_DIRECT disk reads:  554 MB in  3.00 seconds = 184.53 MB/sec

So the hardware seems to sustain speeds of about 200MB/s  on both 
devices but it differs greatly.
The measurement of sda increased 20% but sdb doubled. Maybe there's a 
problem with the page cache?

BACKGROUND INFORMATION
Server type: general shared hosting server (3 weeks new)
O/S: CentOS 6.4 / 64 bit (2.6.32-358.2.1.el6.x86_64)
Hardware: SuperMicro 5017C-MTRF, E3-1270v2, 16GB RAM, 2 x Samsung 840 
PRO 512GB
Partitioning: ~ 100GB left for over-provisioning, ext 4:

I believe it is aligned:

root [~]# fdisk -lu

Disk /dev/sda: 512.1 GB, 512110190592 bytes
255 heads, 63 sectors/track, 62260 cylinders, total 1000215216 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00026d59

    Device Boot      Start         End      Blocks   Id  System
/dev/sda1            2048     4196351     2097152   fd  Linux raid 
autodetect
Partition 1 does not end on cylinder boundary.
/dev/sda2   *     4196352     4605951      204800   fd  Linux raid 
autodetect
Partition 2 does not end on cylinder boundary.
/dev/sda3         4605952   814106623   404750336   fd  Linux raid 
autodetect

Disk /dev/sdb: 512.1 GB, 512110190592 bytes
255 heads, 63 sectors/track, 62260 cylinders, total 1000215216 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x0003dede

    Device Boot      Start         End      Blocks   Id  System
/dev/sdb1            2048     4196351     2097152   fd  Linux raid 
autodetect
Partition 1 does not end on cylinder boundary.
/dev/sdb2   *     4196352     4605951      204800   fd  Linux raid 
autodetect
Partition 2 does not end on cylinder boundary.
/dev/sdb3         4605952   814106623   404750336   fd  Linux raid 
autodetect

The matrix is NOT degraded:

root@ [~]# cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 sdb2[1] sda2[0]
       204736 blocks super 1.0 [2/2] [UU]
md2 : active raid1 sdb3[1] sda3[0]
       404750144 blocks super 1.0 [2/2] [UU]
md1 : active raid1 sdb1[1] sda1[0]
       2096064 blocks super 1.1 [2/2] [UU]
unused devices: <none>

Write cache is on:

root@ [~]# hdparm -W /dev/sda
write-caching =  1 (on)
root@ [~]# hdparm -W /dev/sdb
write-caching =  1 (on)

SMART seems to be OK:
SMART overall-health self-assessment test result: PASSED (for both devices)

I have tried changing IO scheduler with NOOP and deadline but I couldn't 
see improvements.

I have tried running fstrim but it errors out:

root [~]# fstrim -v /
fstrim: /: FITRIM ioctl failed: Operation not supported

So I have changed /etc/fstab to contain noatime and discard and rebooted 
the server but to no avail.

I no longer know what to do. And I need to come up with some sort of a 
solution (it's not reasonable nor acceptable to get at 3 digits loads 
from copying several GBs worth of file). If anyone can help me, please do!

Thanks in advance!
Andy

^ permalink raw reply	[flat|nested] 38+ messages in thread

[parent not found: <CAH3kUhEaZGON=fAyVMZOz5fH_DcfKv=hCa96UCeK4pN7k81c_Q@mail.gmail.com>]

[parent not found: <51725458.7020109@redhost.ro>]

[parent not found: <CAH3kUhHxBiqugFQm=PPJNNe9jOdKy0etUjQNsoDz_LJNUCLCCQ@mail.gmail.com>]

* Re: Incredibly poor performance of mdraid-1 with 2 SSD Samsung 840 PRO
       [not found]     ` <CAH3kUhHxBiqugFQm=PPJNNe9jOdKy0etUjQNsoDz_LJNUCLCCQ@mail.gmail.com>
@ 2013-04-20 23:25       ` Andrei Banu
  2013-04-20 23:26       ` Andrei Banu
  1 sibling, 0 replies; 38+ messages in thread
From: Andrei Banu @ 2013-04-20 23:25 UTC (permalink / raw)
  To: linux-raid

The previous test was done with "noop" for scheduler (the speed test 
completed at about 8MB/s). Then I rebooted the server and redone the 
test (also with noop) and the result was slightly better but not as it 
should be (21MB/s). A third test 5-10 minute later (after the load 
subsided) completed at 16MB/s.  A fourth test ended with 14.6MB/s.

Something else: the weekly auto raid check started a little time ago and 
it's going at an average of 60MB/s (anywhere between 25 and 100MB/s) 
with noop, cfq and deadline. A raid check with ordinary mechanical 
drives gets done at about 160MB/s on the outer cylinders. Why are these 
SSDs so slow?

These are the result from the 21MB/s test (5 seconds delay):

Device:            tps    kB_read/s    kB_wrtn/s    kB_read kB_wrtn
sda             204.94      1918.02       389.30    1367719 277605
sdb             154.80      1196.21       389.30     853008 277605
md1               0.65         2.59         0.00 1848          0
md2             355.45      3106.05       388.53    2214890 277056
md0               1.10         2.90         0.01 2069          9

Device:            tps    kB_read/s    kB_wrtn/s    kB_read kB_wrtn
sda             583.40     42764.80     29452.90     213824 147264
sdb             234.80     23172.00     14950.50     115860 74752
md1               0.00         0.00         0.00 0          0
md2            8079.60     65886.40     29862.40     329432 149312
md0               0.00         0.00         0.00 0          0

Device:            tps    kB_read/s    kB_wrtn/s    kB_read kB_wrtn
sda              15.00         1.60      1740.00          8 8700
sdb              15.00         0.00      7196.80          0 35984
md1               0.00         0.00         0.00 0          0
md2             333.20         1.60      1330.40          8 6652
md0               0.00         0.00         0.00 0          0

Device:            tps    kB_read/s    kB_wrtn/s    kB_read kB_wrtn
sda             167.20       538.40     37432.80       2692 187164
sdb              86.20        16.00     33688.80         80 168444
md1               0.00         0.00         0.00 0          0
md2            9510.80       572.80     37934.40       2864 189672
md0               0.00         0.00         0.00 0          0

Device:            tps    kB_read/s    kB_wrtn/s    kB_read kB_wrtn
sda             150.20       585.60     29090.40       2928 145452
sdb              71.20        44.00     30355.20        220 151776
md1               0.00         0.00         0.00 0          0
md2            7306.20       615.20     28998.40       3076 144992
md0               0.00         0.00         0.00 0          0

Device:            tps    kB_read/s    kB_wrtn/s    kB_read kB_wrtn
sda             257.20      1624.00      9913.80       8120 49569
sdb             137.20       372.00     21438.60       1860 107193
md1               0.00         0.00         0.00 0          0
md2            2600.80      1991.20      9504.00       9956 47520
md0               0.00         0.00         0.00 0          0

Device:            tps    kB_read/s    kB_wrtn/s    kB_read kB_wrtn
sda             186.80       972.80       292.70       4864 1463
sdb             150.40       733.60       292.70       3668 1463
md1               0.00         0.00         0.00 0          0
md2             283.80      1706.40       291.20       8532 1456
md0               0.00         0.00         0.00 0          0

If you have any idea what can I do to improve this please let me know.

Thanks!!

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Incredibly poor performance of mdraid-1 with 2 SSD Samsung 840 PRO
       [not found]     ` <CAH3kUhHxBiqugFQm=PPJNNe9jOdKy0etUjQNsoDz_LJNUCLCCQ@mail.gmail.com>
  2013-04-20 23:25       ` Andrei Banu
@ 2013-04-20 23:26       ` Andrei Banu
  2013-04-21  2:48         ` Stan Hoeppner
  2013-04-25 11:38         ` Thomas Jarosch
  1 sibling, 2 replies; 38+ messages in thread
From: Andrei Banu @ 2013-04-20 23:26 UTC (permalink / raw)
  To: linux-raid

Hi!

They are connected through SATA2 ports (this does explain the read speed 
but not the pitiful write one) in AHCI.

Ok, I redid the test with '-d 6' seconds and 'noop' scheduler during the 
same file copy and these are the entire results:

root [~]# iostat -d 6 -k
Linux 2.6.32-358.2.1.el6.x86_64 (host)      04/21/2013 _x86_64_(8 CPU)

Device:            tps    kB_read/s    kB_wrtn/s    kB_read kB_wrtn
sda             245.95       832.69       591.13  219499895 155823699
sdb             190.80       572.24       590.88  150844446 155758671
md1               1.15         2.15         2.43     567732 641156
md2             406.02      1368.44       587.74  360725304 154930520
md0               0.06         0.23         0.00      59992 171

Device:            tps    kB_read/s    kB_wrtn/s    kB_read kB_wrtn
sda              34.17         0.00      4466.00          0 26796
sdb               9.67         0.00      4949.33          0 29696
md1               0.00         0.00         0.00 0          0
md2            1116.50         0.00      4466.00          0 26796
md0               0.00         0.00         0.00 0          0

Device:            tps    kB_read/s    kB_wrtn/s    kB_read kB_wrtn
sda              35.17         0.00      5475.33          0 32852
sdb               9.33         2.00      4522.67         12 27136
md1               0.00         0.00         0.00 0          0
md2            1369.67         8.00      5475.33         48 32852
md0               0.00         0.00         0.00 0          0

Device:            tps    kB_read/s    kB_wrtn/s    kB_read kB_wrtn
sda              40.33         0.00      3160.00          0 18960
sdb              19.50         0.00      7882.00          0 47292
md1               0.00         0.00         0.00 0          0
md2             790.50         2.67      3160.00         16 18960
md0               0.00         0.00         0.00 0          0

Device:            tps    kB_read/s    kB_wrtn/s    kB_read kB_wrtn
sda              77.67         4.00     15328.00         24 91968
sdb              50.33        16.00     10972.67         96 65836
md1               0.00         0.00         0.00 0          0
md2            3834.33         9.33     15328.00         56 91968
md0               0.00         0.00         0.00 0          0

Device:            tps    kB_read/s    kB_wrtn/s    kB_read kB_wrtn
sda              66.67        48.00     10604.00        288 63624
sdb              23.17         0.00      9660.00          0 57960
md1               0.00         0.00         0.00 0          0
md2            2653.50        51.33     10604.00        308 63624
md0               0.00         0.00         0.00 0          0

Device:            tps    kB_read/s    kB_wrtn/s    kB_read kB_wrtn
sda              37.83        24.67      5378.67        148 32272
sdb              13.17         3.33      6315.33         20 37892
md1               0.00         0.00         0.00 0          0
md2            1345.17        26.00      5378.67        156 32272
md0               0.00         0.00         0.00 0          0

Device:            tps    kB_read/s    kB_wrtn/s    kB_read kB_wrtn
sda             132.50         4.67     22714.00         28 136284
sdb              32.33        20.00     12328.00        120 73968
md1               0.00         0.00         0.00 0          0
md2            5713.67        31.33     22843.33        188 137060
md0               0.00         0.00         0.00 0          0

Device:            tps    kB_read/s    kB_wrtn/s    kB_read kB_wrtn
sda              58.17         6.00      8200.00         36 49200
sdb              23.00         8.00     11349.33         48 68096
md1               0.00         0.00         0.00 0          0
md2            1936.17        21.33      7729.33        128 46376
md0               0.00         0.00         0.00 0          0

Device:            tps    kB_read/s    kB_wrtn/s    kB_read kB_wrtn
sda               6.17         0.00        24.67          0 148
sdb              10.00         0.00      5120.00          0 30720
md1               0.00         0.00         0.00 0          0
md2               6.17         0.00        24.67          0 148
md0               0.00         0.00         0.00 0          0

Device:            tps    kB_read/s    kB_wrtn/s    kB_read kB_wrtn
sda               1.50         0.00         5.33 0         32
sdb              14.17         0.00      7170.67          0 43024
md1               0.00         0.00         0.00 0          0
md2               1.50         0.00         5.33 0         32
md0               0.00         0.00         0.00 0          0

Device:            tps    kB_read/s    kB_wrtn/s    kB_read kB_wrtn
sda             256.00       346.67      1105.17       2080 6631
sdb             270.83       544.00      7029.17       3264 42175
md1              49.33       170.00        27.33       1020 164
md2             311.83       705.33      1076.67       4232 6460
md0               0.00         0.00         0.00 0          0

Device:            tps    kB_read/s    kB_wrtn/s    kB_read kB_wrtn
sda              51.17        46.67       219.08        280 1314
sdb              48.67       140.00       219.08        840 1314
md1              20.67        82.67         0.00 496          0
md2              58.00       104.00       218.00        624 1308
md0               0.00         0.00         0.00 0          0

Thank you for your time.

Kind regards!

On 20/04/2013 4:11 PM, Roberto Spadim wrote:
>
> Hum at beginning you have more iops than the end, how you connected 
> this devices, normally a ssd can handler more than 1000 iops and a hd 
> no more than 300iops, how did you configured the queue of ssd disks? 
> Could you change it to noop and test again?
>
> Em 20/04/2013 05:39, "Andrei Banu" <andrei.banu@redhost.ro 
> <mailto:andrei.banu@redhost.ro>> escreveu:
>
>     Hi,
>
>     I ran with '-d 3' iostat during a "heavy" (540MB) copy. It took a
>     bit over a minute and completed with less than 9MB/s. These are
>     some of the results (this does NOT include the first batch i.e.
>     the average from start up result):
>
>     Device:            tps    kB_read/s    kB_wrtn/s kB_read    kB_wrtn
>     sda             503.00      1542.67     28157.33 4628      84472
>     sdb              66.00        72.00     13162.67 216      39488
>     md1             373.00      1492.00         0.00 4476          0
>     md2            6951.67       126.67     27734.67 380      83204
>     md0               0.00         0.00         0.00 0          0
>
>     Device:            tps    kB_read/s    kB_wrtn/s kB_read    kB_wrtn
>     sda              56.67        20.00      1177.50 60       3532
>     sdb              47.33        12.00     10824.17 36      32472
>     md1               0.67         2.67         0.00 8          0
>     md2             322.00        25.33      1266.67 76       3800
>     md0               0.00         0.00         0.00 0          0
>
>     Device:            tps    kB_read/s    kB_wrtn/s kB_read    kB_wrtn
>     sda             122.00        16.00     45773.33 48     137320
>     sdb              96.67        14.67     19472.00 44      58416
>     md1               0.00         0.00         0.00 0          0
>     md2           11431.00        32.00     45684.00 96     137052
>     md0               0.00         0.00         0.00 0          0
>
>     Device:            tps    kB_read/s    kB_wrtn/s kB_read    kB_wrtn
>     sda               0.00         0.00         0.00 0          0
>     sdb              13.67         8.00      5973.33 24      17920
>     md1               0.00         0.00         0.00 0          0
>     md2               2.00         8.00         0.00 24          0
>     md0               0.00         0.00         0.00 0          0
>
>     This is the "normal" iostat took after 10 minutes (this DOES
>     include the first batch i.e. the average from start up result):
>
>     Device:            tps    kB_read/s    kB_wrtn/s kB_read    kB_wrtn
>     sda             281.83       973.99       641.55 212615675  140045467
>     sdb             215.51       665.94       641.55 145369465  140045467
>     md1               1.18         2.17         2.56 473492     558452
>     md2             470.71      1596.29       638.01 348460340  139272912
>     md0               0.08         0.27         0.00 59983        171
>
>     Device:            tps    kB_read/s    kB_wrtn/s kB_read    kB_wrtn
>     sda              41.67       237.33       133.67 712        401
>     sdb              39.33        90.67       133.67 272        401
>     md1               0.00         0.00         0.00 0          0
>     md2              83.00       328.00       133.33 984        400
>     md0               0.00         0.00         0.00 0          0
>
>     Device:            tps    kB_read/s    kB_wrtn/s kB_read    kB_wrtn
>     sda              29.33         2.67       110.00 8        330
>     sdb              29.33         2.67       110.00 8        330
>     md1               0.00         0.00         0.00 0          0
>     md2              28.67         5.33       109.33 16        328
>     md0               0.00         0.00         0.00 0          0
>
>     Device:            tps    kB_read/s    kB_wrtn/s kB_read    kB_wrtn
>     sda             175.67         1.33       747.50 4       2242
>     sdb             182.00        56.00       747.50 168       2242
>     md1               0.00         0.00         0.00 0          0
>     md2             191.33        57.33       746.67 172       2240
>     md0               0.00         0.00         0.00 0          0
>
>     Best regards!
>
>     On 20/04/2013 3:59 AM, Roberto Spadim wrote:
>>     run some kind of iostat -d 1 -k and check the write/read  iops
>>     and kb/s
>>
>>
>>     2013/4/19 Andrei Banu <andrei.banu@redhost.ro
>>     <mailto:andrei.banu@redhost.ro>>
>>
>>         Hello!
>>
>>         I come to you with a difficult problem. We have a server
>>         otherwise snappy fitted with mdraid-1 made of Samsung 840 PRO
>>         SSDs. If we copy a larger file to the server (from the same
>>         server, from net doesn't matter) the server load will
>>         increase from roughly 0.7 to over 100 (for several GB files).
>>         Apparently the reason is that the raid can't write well.
>>
>>         Few examples:
>>
>>         root [~]# dd if=testfile.tar.gz of=test20 oflag=sync bs=4M
>>         130+1 records in
>>         130+1 records out
>>         547682517 bytes (548 MB) copied, 7.99664 s, 68.5 MB/s
>>
>>         And 10-20 seconds later I try the very same test:
>>
>>         root [~]# dd if=testfile.tar.gz of=test21 oflag=sync bs=4M
>>         130+1 records in / 130+1 records out
>>         547682517 bytes (548 MB) copied, 52.1958 s, 10.5 MB/s
>>
>>         A different test with 'bs=1G'
>>         root [~]# w
>>          12:08:34 up 1 day, 13:09,  1 user,  load average: 0.37,
>>         0.60, 0.72
>>
>>         root [~]# dd if=testfile.tar.gz of=test oflag=sync bs=1G
>>         0+1 records in / 0+1 records out
>>         547682517 bytes (548 MB) copied, 75.3476 s, 7.3 MB/s
>>
>>         root [~]# w
>>          12:09:56 up 1 day, 13:11,  1 user,  load average: 39.29,
>>         12.67, 4.93
>>
>>         It needed 75 seconds to copy a half GB file and the server
>>         load increased 100 times.
>>
>>         And a final test:
>>
>>         root@ [~]# dd if=/dev/zero of=test24 bs=64k count=16k
>>         conv=fdatasync
>>         16384+0 records in / 16384+0 records out
>>         1073741824 bytes (1.1 GB) copied, 61.8796 s, 17.4 MB/s
>>
>>         This time the load spiked to only ~ 20.
>>
>>         A few other peculiarities:
>>
>>         root@ [~]# hdparm -t /dev/sda
>>         Timing buffered disk reads:  654 MB in  3.01 seconds = 217.55
>>         MB/sec
>>         root@ [~]# hdparm -t /dev/sdb
>>         Timing buffered disk reads:  272 MB in  3.01 seconds =  90.44
>>         MB/sec
>>
>>         The read speed is very different between the 2 devices (the
>>         margin is 140%) but look what happens when I run it with
>>         --direct:
>>
>>         root@ [~]# hdparm --direct -t /dev/sda
>>         Timing O_DIRECT disk reads:  788 MB in  3.00 seconds = 262.23
>>         MB/sec
>>         root@ [~]# hdparm --direct -t /dev/sdb
>>         Timing O_DIRECT disk reads:  554 MB in  3.00 seconds = 184.53
>>         MB/sec
>>
>>         So the hardware seems to sustain speeds of about 200MB/s  on
>>         both devices but it differs greatly.
>>         The measurement of sda increased 20% but sdb doubled. Maybe
>>         there's a problem with the page cache?
>>
>>         BACKGROUND INFORMATION
>>         Server type: general shared hosting server (3 weeks new)
>>         O/S: CentOS 6.4 / 64 bit (2.6.32-358.2.1.el6.x86_64)
>>         Hardware: SuperMicro 5017C-MTRF, E3-1270v2, 16GB RAM, 2 x
>>         Samsung 840 PRO 512GB
>>         Partitioning: ~ 100GB left for over-provisioning, ext 4:
>>
>>         I believe it is aligned:
>>
>>         root [~]# fdisk -lu
>>
>>         Disk /dev/sda: 512.1 GB, 512110190592 bytes
>>         255 heads, 63 sectors/track, 62260 cylinders, total
>>         1000215216 sectors
>>         Units = sectors of 1 * 512 = 512 bytes
>>         Sector size (logical/physical): 512 bytes / 512 bytes
>>         I/O size (minimum/optimal): 512 bytes / 512 bytes
>>         Disk identifier: 0x00026d59
>>
>>            Device Boot      Start         End      Blocks Id  System
>>         /dev/sda1            2048     4196351     2097152 fd  Linux
>>         raid autodetect
>>         Partition 1 does not end on cylinder boundary.
>>         /dev/sda2   *     4196352     4605951      204800 fd  Linux
>>         raid autodetect
>>         Partition 2 does not end on cylinder boundary.
>>         /dev/sda3         4605952   814106623   404750336 fd  Linux
>>         raid autodetect
>>
>>         Disk /dev/sdb: 512.1 GB, 512110190592 bytes
>>         255 heads, 63 sectors/track, 62260 cylinders, total
>>         1000215216 sectors
>>         Units = sectors of 1 * 512 = 512 bytes
>>         Sector size (logical/physical): 512 bytes / 512 bytes
>>         I/O size (minimum/optimal): 512 bytes / 512 bytes
>>         Disk identifier: 0x0003dede
>>
>>            Device Boot      Start         End      Blocks Id  System
>>         /dev/sdb1            2048     4196351     2097152 fd  Linux
>>         raid autodetect
>>         Partition 1 does not end on cylinder boundary.
>>         /dev/sdb2   *     4196352     4605951      204800 fd  Linux
>>         raid autodetect
>>         Partition 2 does not end on cylinder boundary.
>>         /dev/sdb3         4605952   814106623   404750336 fd  Linux
>>         raid autodetect
>>
>>         The matrix is NOT degraded:
>>
>>         root@ [~]# cat /proc/mdstat
>>         Personalities : [raid1]
>>         md0 : active raid1 sdb2[1] sda2[0]
>>               204736 blocks super 1.0 [2/2] [UU]
>>         md2 : active raid1 sdb3[1] sda3[0]
>>               404750144 blocks super 1.0 [2/2] [UU]
>>         md1 : active raid1 sdb1[1] sda1[0]
>>               2096064 blocks super 1.1 [2/2] [UU]
>>         unused devices: <none>
>>
>>         Write cache is on:
>>
>>         root@ [~]# hdparm -W /dev/sda
>>         write-caching =  1 (on)
>>         root@ [~]# hdparm -W /dev/sdb
>>         write-caching =  1 (on)
>>
>>         SMART seems to be OK:
>>         SMART overall-health self-assessment test result: PASSED (for
>>         both devices)
>>
>>         I have tried changing IO scheduler with NOOP and deadline but
>>         I couldn't see improvements.
>>
>>         I have tried running fstrim but it errors out:
>>
>>         root [~]# fstrim -v /
>>         fstrim: /: FITRIM ioctl failed: Operation not supported
>>
>>         So I have changed /etc/fstab to contain noatime and discard
>>         and rebooted the server but to no avail.
>>
>>         I no longer know what to do. And I need to come up with some
>>         sort of a solution (it's not reasonable nor acceptable to get
>>         at 3 digits loads from copying several GBs worth of file). If
>>         anyone can help me, please do!
>>
>>         Thanks in advance!
>>         Andy
>>         --
>>         To unsubscribe from this list: send the line "unsubscribe
>>         linux-raid" in
>>         the body of a message to majordomo@vger.kernel.org
>>         <mailto:majordomo@vger.kernel.org>
>>         More majordomo info at http://vger.kernel.org/majordomo-info.html
>>
>>
>>
>>
>>     -- 
>>     Roberto Spadim
>


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Incredibly poor performance of mdraid-1 with 2 SSD Samsung 840 PRO
  2013-04-20 23:26       ` Andrei Banu
@ 2013-04-21  2:48         ` Stan Hoeppner
  2013-04-21 12:23           ` Tommy Apel
  2013-04-25 11:38         ` Thomas Jarosch
  1 sibling, 1 reply; 38+ messages in thread
From: Stan Hoeppner @ 2013-04-21  2:48 UTC (permalink / raw)
  To: Andrei Banu; +Cc: linux-raid

On 4/20/2013 6:26 PM, Andrei Banu wrote:

> They are connected through SATA2 ports (this does explain the read speed
> but not the pitiful write one) in AHCI.

These SSDs are capable of 500MB/s, and cost ~$1000 USD.  Spend ~$200 USD
on a decent HBA.  The 6G SAS/SATA LSI 9211-4i seems perfectly suited to
your RAID1 SSD application.  It is a 4 port enterprise JBOD HBA that
also supports ASIC level RAID 1, 1E, 10.

Also, the difference in throughput your show between RAID maintenance,
direct device access, and filesystem access suggests you have something
running between the block and filesystem layers, for instance LUKS.
Though LUKS alone shouldn't hammer your CPU and IO throughput so
dramatically.  However, if the SSDs do compression or encryption
automatically, and I believe the 840s do, the LUKS encrypted blocks may
cause the SSD firmware to take considerably more time to process the blocks.

-- 
Stan

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Incredibly poor performance of mdraid-1 with 2 SSD Samsung 840 PRO
  2013-04-21  2:48         ` Stan Hoeppner
@ 2013-04-21 12:23           ` Tommy Apel
  2013-04-21 16:48             ` Tommy Apel
  2013-04-21 19:33             ` Stan Hoeppner
  0 siblings, 2 replies; 38+ messages in thread
From: Tommy Apel @ 2013-04-21 12:23 UTC (permalink / raw)
  To: stan; +Cc: Andrei Banu, linux-raid Raid

Hello, FYI I'm getting ~68MB/s on two intel330 in RAID1 aswell on
vanilla 3.8.8 and 3.9.0-rc3 when writing random data and ~236MB/s
writing from /dev/zero

mdadm -C /dev/md0 -l 1 -n 2 --assume-clean --force --run /dev/sdb /dev/sdc
openssl enc -aes-128-ctr -pass pass:"$(dd if=/dev/urandom bs=128
count=1 2>/dev/null | base64)" -nosalt < /dev/zero | pv -pterb >
/run/fill ~1.06GB/s
dd if=/run/fill of=/dev/null bs=1M count=1024 iflag=fullblock ~5.7GB/s
dd if=/run/fill of=/dev/md0 bs=1M count=1024 oflag=direct ~68MB/s
dd if=/dev/zero of=/dev/md0 bs=1M count=1024 oflag=direct ~236MB/s

iostat claiming 100% util on both drives when doing so, running both
deadline and noop scheduler,
doing the same with 4 threads and offset by 1.1GB on the disk and
taske set to 4 cores makes no difference, still ~68MB/s with random
data
# for x in `seq 0 4`; do taskset -c $x dd if=/run/fill of=/dev/md0
bs=1M count=1024 seek=$(($x * 1024)) oflag=direct & done

/Tommy

2013/4/21 Stan Hoeppner <stan@hardwarefreak.com>:
> On 4/20/2013 6:26 PM, Andrei Banu wrote:
>
>> They are connected through SATA2 ports (this does explain the read speed
>> but not the pitiful write one) in AHCI.
>
> These SSDs are capable of 500MB/s, and cost ~$1000 USD.  Spend ~$200 USD
> on a decent HBA.  The 6G SAS/SATA LSI 9211-4i seems perfectly suited to
> your RAID1 SSD application.  It is a 4 port enterprise JBOD HBA that
> also supports ASIC level RAID 1, 1E, 10.
>
> Also, the difference in throughput your show between RAID maintenance,
> direct device access, and filesystem access suggests you have something
> running between the block and filesystem layers, for instance LUKS.
> Though LUKS alone shouldn't hammer your CPU and IO throughput so
> dramatically.  However, if the SSDs do compression or encryption
> automatically, and I believe the 840s do, the LUKS encrypted blocks may
> cause the SSD firmware to take considerably more time to process the blocks.
>
> --
> Stan
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Incredibly poor performance of mdraid-1 with 2 SSD Samsung 840 PRO
  2013-04-21 12:23           ` Tommy Apel
@ 2013-04-21 16:48             ` Tommy Apel
  2013-04-21 19:33             ` Stan Hoeppner
  1 sibling, 0 replies; 38+ messages in thread
From: Tommy Apel @ 2013-04-21 16:48 UTC (permalink / raw)
  To: stan; +Cc: Andrei Banu, linux-raid Raid

Just did a blockwise test aswell with fio >

Single SSD :
# ./scst-trunk/scripts/blockdev-perftest -d -f -i 1 -j -m 10 -M 20 -s
30 -f /dev/sdb
blocksize        W   W(avg,   W(std,        W        R   R(avg,
R(std,        R
  (bytes)      (s)    MB/s)    MB/s)   (IOPS)      (s)    MB/s)
MB/s)   (IOPS)
  1048576    6.548  156.384    0.000  156.384    2.383  429.710
0.000  429.710
   524288    6.311  162.256    0.000  324.513    2.521  406.188
0.000  812.376
   262144    6.183  165.615    0.000  662.462    3.003  340.992
0.000 1363.969
   131072    6.096  167.979    0.000 1343.832    3.140  326.115
0.000 2608.917
    65536    5.973  171.438    0.000 2743.010    3.807  268.978
0.000 4303.651
    32768    5.748  178.149    0.000 5700.765    4.609  222.174
0.000 7109.568
    16384    5.693  179.870    0.000 11511.681    5.203  196.810
0.000 12595.810
     8192    6.188  165.482    0.000 21181.642    7.339  139.529
0.000 17859.654
     4096   10.190  100.491    0.000 25725.613   13.816   74.117
0.000 18973.943
     2048   25.018   40.931    0.000 20956.431   26.136   39.180
0.000 20059.994
     1024   39.693   25.798    0.000 26417.152   50.580   20.245
0.000 20731.040

RAID1 with two Intel330 SSDs:
# ./scst-trunk/scripts/blockdev-perftest -d -f -i 1 -j -m 10 -M 20 -s
30 -f /dev/md0
blocksize        W   W(avg,   W(std,        W        R   R(avg,
R(std,        R
  (bytes)      (s)    MB/s)    MB/s)   (IOPS)      (s)    MB/s)
MB/s)   (IOPS)
  1048576    7.053  145.186    0.000  145.186    2.384  429.530
0.000  429.530
   524288    6.906  148.277    0.000  296.554    2.518  406.672
0.000  813.344
   262144    6.763  151.412    0.000  605.648    2.871  356.670
0.000 1426.681
   131072    6.558  156.145    0.000 1249.161    3.166  323.437
0.000 2587.492
    65536    6.578  155.670    0.000 2490.727    3.835  267.014
0.000 4272.229
    32768    6.311  162.256    0.000 5192.204    4.379  233.843
0.000 7482.987
    16384    6.406  159.850    0.000 10230.409    5.953  172.014
0.000 11008.903
     8192    7.776  131.687    0.000 16855.967    8.621  118.780
0.000 15203.805
     4096   11.137   91.946    0.000 23538.116   14.138   72.429
0.000 18541.802
     2048   38.440   26.639    0.000 13639.126   22.512   45.487
0.000 23289.268
     1024   60.933   16.805    0.000 17208.672   43.247   23.678
0.000 24246.214

it sorta confirms that the performance goes down but I would kinda
expect that in a way aswell as the write confirm has to come from both
disks.

/Tommy

2013/4/21 Tommy Apel <tommyapeldk@gmail.com>:
> Hello, FYI I'm getting ~68MB/s on two intel330 in RAID1 aswell on
> vanilla 3.8.8 and 3.9.0-rc3 when writing random data and ~236MB/s
> writing from /dev/zero
>
> mdadm -C /dev/md0 -l 1 -n 2 --assume-clean --force --run /dev/sdb /dev/sdc
> openssl enc -aes-128-ctr -pass pass:"$(dd if=/dev/urandom bs=128
> count=1 2>/dev/null | base64)" -nosalt < /dev/zero | pv -pterb >
> /run/fill ~1.06GB/s
> dd if=/run/fill of=/dev/null bs=1M count=1024 iflag=fullblock ~5.7GB/s
> dd if=/run/fill of=/dev/md0 bs=1M count=1024 oflag=direct ~68MB/s
> dd if=/dev/zero of=/dev/md0 bs=1M count=1024 oflag=direct ~236MB/s
>
> iostat claiming 100% util on both drives when doing so, running both
> deadline and noop scheduler,
> doing the same with 4 threads and offset by 1.1GB on the disk and
> taske set to 4 cores makes no difference, still ~68MB/s with random
> data
> # for x in `seq 0 4`; do taskset -c $x dd if=/run/fill of=/dev/md0
> bs=1M count=1024 seek=$(($x * 1024)) oflag=direct & done
>
> /Tommy
>
> 2013/4/21 Stan Hoeppner <stan@hardwarefreak.com>:
>> On 4/20/2013 6:26 PM, Andrei Banu wrote:
>>
>>> They are connected through SATA2 ports (this does explain the read speed
>>> but not the pitiful write one) in AHCI.
>>
>> These SSDs are capable of 500MB/s, and cost ~$1000 USD.  Spend ~$200 USD
>> on a decent HBA.  The 6G SAS/SATA LSI 9211-4i seems perfectly suited to
>> your RAID1 SSD application.  It is a 4 port enterprise JBOD HBA that
>> also supports ASIC level RAID 1, 1E, 10.
>>
>> Also, the difference in throughput your show between RAID maintenance,
>> direct device access, and filesystem access suggests you have something
>> running between the block and filesystem layers, for instance LUKS.
>> Though LUKS alone shouldn't hammer your CPU and IO throughput so
>> dramatically.  However, if the SSDs do compression or encryption
>> automatically, and I believe the 840s do, the LUKS encrypted blocks may
>> cause the SSD firmware to take considerably more time to process the blocks.
>>
>> --
>> Stan
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Incredibly poor performance of mdraid-1 with 2 SSD Samsung 840 PRO
  2013-04-21 12:23           ` Tommy Apel
  2013-04-21 16:48             ` Tommy Apel
@ 2013-04-21 19:33             ` Stan Hoeppner
  2013-04-21 19:56               ` Tommy Apel
  1 sibling, 1 reply; 38+ messages in thread
From: Stan Hoeppner @ 2013-04-21 19:33 UTC (permalink / raw)
  To: Tommy Apel; +Cc: Andrei Banu, linux-raid Raid

On 4/21/2013 7:23 AM, Tommy Apel wrote:
> Hello, FYI I'm getting ~68MB/s on two intel330 in RAID1 aswell on
> vanilla 3.8.8 and 3.9.0-rc3 when writing random data and ~236MB/s
> writing from /dev/zero
> 
> mdadm -C /dev/md0 -l 1 -n 2 --assume-clean --force --run /dev/sdb /dev/sdc

> openssl enc -aes-128-ctr -pass pass:"$(dd if=/dev/urandom bs=128
> count=1 2>/dev/null | base64)" -nosalt < /dev/zero | pv -pterb >
> /run/fill ~1.06GB/s

What's the purpose of all of this?  Surely not simply to create random
data, which is accomplished much more easily.  Are you sand bagging us
here with a known bug, or simply trying to show off your mad skillz?
Either way this is entirely unnecessary for troubleshooting an IO
performance issue.  dd doesn't (shouldn't) care if the bits are random
or not, though the Intel SSD controller might, as well as other layers
you may have in your IO stack.  Keep it simple so we can isolate one
layer at a time.

> dd if=/run/fill of=/dev/null bs=1M count=1024 iflag=fullblock ~5.7GB/s
> dd if=/run/fill of=/dev/md0 bs=1M count=1024 oflag=direct ~68MB/s
> dd if=/dev/zero of=/dev/md0 bs=1M count=1024 oflag=direct ~236MB/s

Noting the above, it's interesting that you omitted this test

  dd if=/run/fill of=/dev/sdb bs=1M count=1024 oflag=direct

preventing an apples to apples comparison between raw SSD device and
md/RAID1 performance with your uber random file as input.

-- 
Stan

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Incredibly poor performance of mdraid-1 with 2 SSD Samsung 840 PRO
  2013-04-21 19:33             ` Stan Hoeppner
@ 2013-04-21 19:56               ` Tommy Apel
  2013-04-22  0:47                 ` Stan Hoeppner
  0 siblings, 1 reply; 38+ messages in thread
From: Tommy Apel @ 2013-04-21 19:56 UTC (permalink / raw)
  To: stan; +Cc: Andrei Banu, linux-raid Raid

Calm the f. down, I was just handing over some information, sorry your
day was ruined mr. high and mighty, use the info for whatever you want
to but flaming me is't going to help anyone.

2013/4/21 Stan Hoeppner <stan@hardwarefreak.com>:
> On 4/21/2013 7:23 AM, Tommy Apel wrote:
>> Hello, FYI I'm getting ~68MB/s on two intel330 in RAID1 aswell on
>> vanilla 3.8.8 and 3.9.0-rc3 when writing random data and ~236MB/s
>> writing from /dev/zero
>>
>> mdadm -C /dev/md0 -l 1 -n 2 --assume-clean --force --run /dev/sdb /dev/sdc
>
>
>> openssl enc -aes-128-ctr -pass pass:"$(dd if=/dev/urandom bs=128
>> count=1 2>/dev/null | base64)" -nosalt < /dev/zero | pv -pterb >
>> /run/fill ~1.06GB/s
>
> What's the purpose of all of this?  Surely not simply to create random
> data, which is accomplished much more easily.  Are you sand bagging us
> here with a known bug, or simply trying to show off your mad skillz?
> Either way this is entirely unnecessary for troubleshooting an IO
> performance issue.  dd doesn't (shouldn't) care if the bits are random
> or not, though the Intel SSD controller might, as well as other layers
> you may have in your IO stack.  Keep it simple so we can isolate one
> layer at a time.
>
>> dd if=/run/fill of=/dev/null bs=1M count=1024 iflag=fullblock ~5.7GB/s
>> dd if=/run/fill of=/dev/md0 bs=1M count=1024 oflag=direct ~68MB/s
>> dd if=/dev/zero of=/dev/md0 bs=1M count=1024 oflag=direct ~236MB/s
>
> Noting the above, it's interesting that you omitted this test
>
>   dd if=/run/fill of=/dev/sdb bs=1M count=1024 oflag=direct
>
> preventing an apples to apples comparison between raw SSD device and
> md/RAID1 performance with your uber random file as input.
>
> --
> Stan
>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Incredibly poor performance of mdraid-1 with 2 SSD Samsung 840 PRO
  2013-04-21 19:56               ` Tommy Apel
@ 2013-04-22  0:47                 ` Stan Hoeppner
  2013-04-22  7:51                   ` Tommy Apel
  0 siblings, 1 reply; 38+ messages in thread
From: Stan Hoeppner @ 2013-04-22  0:47 UTC (permalink / raw)
  To: Tommy Apel; +Cc: Andrei Banu, linux-raid Raid

On 4/21/2013 2:56 PM, Tommy Apel wrote:
> Calm the f. down, I was just handing over some information, sorry your
> day was ruined mr. high and mighty, use the info for whatever you want
> to but flaming me is't going to help anyone.

Your tantrum aside, the Intel 330, as well as all current Intel SSDs,
uses the SandForce 2281 controller.  The SF2xxx series' write
performance is limited by the compressibility of the data.  What you're
doing below is simply showcasing the write bandwidth limitation of the
SF2xxx controllers with incompressible data.

This is not relevant to md.  And it's not relevant to Andrei.  It turns
out that the Samsung 840 SSDs have consistent throughput because they
don't rely on compression.

-- 
Stan


> 2013/4/21 Stan Hoeppner <stan@hardwarefreak.com>:
>> On 4/21/2013 7:23 AM, Tommy Apel wrote:
>>> Hello, FYI I'm getting ~68MB/s on two intel330 in RAID1 aswell on
>>> vanilla 3.8.8 and 3.9.0-rc3 when writing random data and ~236MB/s
>>> writing from /dev/zero
>>>
>>> mdadm -C /dev/md0 -l 1 -n 2 --assume-clean --force --run /dev/sdb /dev/sdc
>>
>>
>>> openssl enc -aes-128-ctr -pass pass:"$(dd if=/dev/urandom bs=128
>>> count=1 2>/dev/null | base64)" -nosalt < /dev/zero | pv -pterb >
>>> /run/fill ~1.06GB/s
>>
>> What's the purpose of all of this?  Surely not simply to create random
>> data, which is accomplished much more easily.  Are you sand bagging us
>> here with a known bug, or simply trying to show off your mad skillz?
>> Either way this is entirely unnecessary for troubleshooting an IO
>> performance issue.  dd doesn't (shouldn't) care if the bits are random
>> or not, though the Intel SSD controller might, as well as other layers
>> you may have in your IO stack.  Keep it simple so we can isolate one
>> layer at a time.
>>
>>> dd if=/run/fill of=/dev/null bs=1M count=1024 iflag=fullblock ~5.7GB/s
>>> dd if=/run/fill of=/dev/md0 bs=1M count=1024 oflag=direct ~68MB/s
>>> dd if=/dev/zero of=/dev/md0 bs=1M count=1024 oflag=direct ~236MB/s
>>
>> Noting the above, it's interesting that you omitted this test
>>
>>   dd if=/run/fill of=/dev/sdb bs=1M count=1024 oflag=direct
>>
>> preventing an apples to apples comparison between raw SSD device and
>> md/RAID1 performance with your uber random file as input.
>>
>> --
>> Stan
>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Incredibly poor performance of mdraid-1 with 2 SSD Samsung 840 PRO
  2013-04-22  0:47                 ` Stan Hoeppner
@ 2013-04-22  7:51                   ` Tommy Apel
  2013-04-22  8:29                     ` Tommy Apel
                                       ` (2 more replies)
  0 siblings, 3 replies; 38+ messages in thread
From: Tommy Apel @ 2013-04-22  7:51 UTC (permalink / raw)
  To: stan; +Cc: Andrei Banu, linux-raid Raid

Stan>
That was exactly what I was trying to show, that you result may vary
depending on data and backing device, as far as the raid1 goes it
doesn't care much for the data beeing passed through it.

Ben>
could you try to run iostat -x 2 for a few minuts just to make sure
there is no other I/O going on the device before running your tests,
and then run the tests with fio instead of dd ?

fio write test > fio --rw=write --filename=testfile --bs=1048576
--size=4294967296 --ioengine=psync --end_fsync=1 --invalidate=1
--direct=1 --name=writeperftest

/Tommy

2013/4/22 Stan Hoeppner <stan@hardwarefreak.com>:
> On 4/21/2013 2:56 PM, Tommy Apel wrote:
>> Calm the f. down, I was just handing over some information, sorry your
>> day was ruined mr. high and mighty, use the info for whatever you want
>> to but flaming me is't going to help anyone.
>
> Your tantrum aside, the Intel 330, as well as all current Intel SSDs,
> uses the SandForce 2281 controller.  The SF2xxx series' write
> performance is limited by the compressibility of the data.  What you're
> doing below is simply showcasing the write bandwidth limitation of the
> SF2xxx controllers with incompressible data.
>
> This is not relevant to md.  And it's not relevant to Andrei.  It turns
> out that the Samsung 840 SSDs have consistent throughput because they
> don't rely on compression.
>
> --
> Stan
>
>
>> 2013/4/21 Stan Hoeppner <stan@hardwarefreak.com>:
>>> On 4/21/2013 7:23 AM, Tommy Apel wrote:
>>>> Hello, FYI I'm getting ~68MB/s on two intel330 in RAID1 aswell on
>>>> vanilla 3.8.8 and 3.9.0-rc3 when writing random data and ~236MB/s
>>>> writing from /dev/zero
>>>>
>>>> mdadm -C /dev/md0 -l 1 -n 2 --assume-clean --force --run /dev/sdb /dev/sdc
>>>
>>>
>>>> openssl enc -aes-128-ctr -pass pass:"$(dd if=/dev/urandom bs=128
>>>> count=1 2>/dev/null | base64)" -nosalt < /dev/zero | pv -pterb >
>>>> /run/fill ~1.06GB/s
>>>
>>> What's the purpose of all of this?  Surely not simply to create random
>>> data, which is accomplished much more easily.  Are you sand bagging us
>>> here with a known bug, or simply trying to show off your mad skillz?
>>> Either way this is entirely unnecessary for troubleshooting an IO
>>> performance issue.  dd doesn't (shouldn't) care if the bits are random
>>> or not, though the Intel SSD controller might, as well as other layers
>>> you may have in your IO stack.  Keep it simple so we can isolate one
>>> layer at a time.
>>>
>>>> dd if=/run/fill of=/dev/null bs=1M count=1024 iflag=fullblock ~5.7GB/s
>>>> dd if=/run/fill of=/dev/md0 bs=1M count=1024 oflag=direct ~68MB/s
>>>> dd if=/dev/zero of=/dev/md0 bs=1M count=1024 oflag=direct ~236MB/s
>>>
>>> Noting the above, it's interesting that you omitted this test
>>>
>>>   dd if=/run/fill of=/dev/sdb bs=1M count=1024 oflag=direct
>>>
>>> preventing an apples to apples comparison between raw SSD device and
>>> md/RAID1 performance with your uber random file as input.
>>>
>>> --
>>> Stan
>>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Incredibly poor performance of mdraid-1 with 2 SSD Samsung 840 PRO
  2013-04-22  7:51                   ` Tommy Apel
@ 2013-04-22  8:29                     ` Tommy Apel
  2013-04-22 10:26                     ` Andrei Banu
  2013-04-22 23:21                     ` Stan Hoeppner
  2 siblings, 0 replies; 38+ messages in thread
From: Tommy Apel @ 2013-04-22  8:29 UTC (permalink / raw)
  To: Andrei Banu, stan; +Cc: linux-raid Raid

Ben = Andrei, sorry for the typo.

2013/4/22 Tommy Apel <tommyapeldk@gmail.com>:
> Stan>
> That was exactly what I was trying to show, that you result may vary
> depending on data and backing device, as far as the raid1 goes it
> doesn't care much for the data beeing passed through it.
>
> Ben>
> could you try to run iostat -x 2 for a few minuts just to make sure
> there is no other I/O going on the device before running your tests,
> and then run the tests with fio instead of dd ?
>
> fio write test > fio --rw=write --filename=testfile --bs=1048576
> --size=4294967296 --ioengine=psync --end_fsync=1 --invalidate=1
> --direct=1 --name=writeperftest
>
> /Tommy
>
> 2013/4/22 Stan Hoeppner <stan@hardwarefreak.com>:
>> On 4/21/2013 2:56 PM, Tommy Apel wrote:
>>> Calm the f. down, I was just handing over some information, sorry your
>>> day was ruined mr. high and mighty, use the info for whatever you want
>>> to but flaming me is't going to help anyone.
>>
>> Your tantrum aside, the Intel 330, as well as all current Intel SSDs,
>> uses the SandForce 2281 controller.  The SF2xxx series' write
>> performance is limited by the compressibility of the data.  What you're
>> doing below is simply showcasing the write bandwidth limitation of the
>> SF2xxx controllers with incompressible data.
>>
>> This is not relevant to md.  And it's not relevant to Andrei.  It turns
>> out that the Samsung 840 SSDs have consistent throughput because they
>> don't rely on compression.
>>
>> --
>> Stan
>>
>>
>>> 2013/4/21 Stan Hoeppner <stan@hardwarefreak.com>:
>>>> On 4/21/2013 7:23 AM, Tommy Apel wrote:
>>>>> Hello, FYI I'm getting ~68MB/s on two intel330 in RAID1 aswell on
>>>>> vanilla 3.8.8 and 3.9.0-rc3 when writing random data and ~236MB/s
>>>>> writing from /dev/zero
>>>>>
>>>>> mdadm -C /dev/md0 -l 1 -n 2 --assume-clean --force --run /dev/sdb /dev/sdc
>>>>
>>>>
>>>>> openssl enc -aes-128-ctr -pass pass:"$(dd if=/dev/urandom bs=128
>>>>> count=1 2>/dev/null | base64)" -nosalt < /dev/zero | pv -pterb >
>>>>> /run/fill ~1.06GB/s
>>>>
>>>> What's the purpose of all of this?  Surely not simply to create random
>>>> data, which is accomplished much more easily.  Are you sand bagging us
>>>> here with a known bug, or simply trying to show off your mad skillz?
>>>> Either way this is entirely unnecessary for troubleshooting an IO
>>>> performance issue.  dd doesn't (shouldn't) care if the bits are random
>>>> or not, though the Intel SSD controller might, as well as other layers
>>>> you may have in your IO stack.  Keep it simple so we can isolate one
>>>> layer at a time.
>>>>
>>>>> dd if=/run/fill of=/dev/null bs=1M count=1024 iflag=fullblock ~5.7GB/s
>>>>> dd if=/run/fill of=/dev/md0 bs=1M count=1024 oflag=direct ~68MB/s
>>>>> dd if=/dev/zero of=/dev/md0 bs=1M count=1024 oflag=direct ~236MB/s
>>>>
>>>> Noting the above, it's interesting that you omitted this test
>>>>
>>>>   dd if=/run/fill of=/dev/sdb bs=1M count=1024 oflag=direct
>>>>
>>>> preventing an apples to apples comparison between raw SSD device and
>>>> md/RAID1 performance with your uber random file as input.
>>>>
>>>> --
>>>> Stan
>>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Incredibly poor performance of mdraid-1 with 2 SSD Samsung 840 PRO
  2013-04-22  7:51                   ` Tommy Apel
  2013-04-22  8:29                     ` Tommy Apel
@ 2013-04-22 10:26                     ` Andrei Banu
  2013-04-22 12:02                       ` Tommy Apel
  2013-04-22 23:21                     ` Stan Hoeppner
  2 siblings, 1 reply; 38+ messages in thread
From: Andrei Banu @ 2013-04-22 10:26 UTC (permalink / raw)
  To: linux-raid

Hi,

No worries about the typo. I ran iostat -x -m 2 for a few minutes and I 
get:

- 0-500KB/s 70% of the time
- 1-2MB/s 20% of the time
- 3-4MB/s 10% of the time.

It never went beyond 4MB/s write speed. But I guess none of this 
qualifies as a heavy write. Right?

The fio test can be carried out safely on an active production server 
just as you gave it?

Thanks!
Andrei

On 2013-04-22 10:51, Tommy Apel wrote:
> Stan>
> That was exactly what I was trying to show, that you result may vary
> depending on data and backing device, as far as the raid1 goes it
> doesn't care much for the data beeing passed through it.
> 
> Ben>
> could you try to run iostat -x 2 for a few minuts just to make sure
> there is no other I/O going on the device before running your tests,
> and then run the tests with fio instead of dd ?
> 
> fio write test > fio --rw=write --filename=testfile --bs=1048576
> --size=4294967296 --ioengine=psync --end_fsync=1 --invalidate=1
> --direct=1 --name=writeperftest
> 
> /Tommy
> 
> 2013/4/22 Stan Hoeppner <stan@hardwarefreak.com>:
>> On 4/21/2013 2:56 PM, Tommy Apel wrote:
>>> Calm the f. down, I was just handing over some information, sorry 
>>> your
>>> day was ruined mr. high and mighty, use the info for whatever you 
>>> want
>>> to but flaming me is't going to help anyone.
>> 
>> Your tantrum aside, the Intel 330, as well as all current Intel SSDs,
>> uses the SandForce 2281 controller.  The SF2xxx series' write
>> performance is limited by the compressibility of the data.  What 
>> you're
>> doing below is simply showcasing the write bandwidth limitation of 
>> the
>> SF2xxx controllers with incompressible data.
>> 
>> This is not relevant to md.  And it's not relevant to Andrei.  It 
>> turns
>> out that the Samsung 840 SSDs have consistent throughput because they
>> don't rely on compression.
>> 
>> --
>> Stan
>> 
>> 
>>> 2013/4/21 Stan Hoeppner <stan@hardwarefreak.com>:
>>>> On 4/21/2013 7:23 AM, Tommy Apel wrote:
>>>>> Hello, FYI I'm getting ~68MB/s on two intel330 in RAID1 aswell on
>>>>> vanilla 3.8.8 and 3.9.0-rc3 when writing random data and ~236MB/s
>>>>> writing from /dev/zero
>>>>> 
>>>>> mdadm -C /dev/md0 -l 1 -n 2 --assume-clean --force --run /dev/sdb 
>>>>> /dev/sdc
>>>> 
>>>> 
>>>>> openssl enc -aes-128-ctr -pass pass:"$(dd if=/dev/urandom bs=128
>>>>> count=1 2>/dev/null | base64)" -nosalt < /dev/zero | pv -pterb >
>>>>> /run/fill ~1.06GB/s
>>>> 
>>>> What's the purpose of all of this?  Surely not simply to create 
>>>> random
>>>> data, which is accomplished much more easily.  Are you sand bagging 
>>>> us
>>>> here with a known bug, or simply trying to show off your mad 
>>>> skillz?
>>>> Either way this is entirely unnecessary for troubleshooting an IO
>>>> performance issue.  dd doesn't (shouldn't) care if the bits are 
>>>> random
>>>> or not, though the Intel SSD controller might, as well as other 
>>>> layers
>>>> you may have in your IO stack.  Keep it simple so we can isolate 
>>>> one
>>>> layer at a time.
>>>> 
>>>>> dd if=/run/fill of=/dev/null bs=1M count=1024 iflag=fullblock 
>>>>> ~5.7GB/s
>>>>> dd if=/run/fill of=/dev/md0 bs=1M count=1024 oflag=direct ~68MB/s
>>>>> dd if=/dev/zero of=/dev/md0 bs=1M count=1024 oflag=direct ~236MB/s
>>>> 
>>>> Noting the above, it's interesting that you omitted this test
>>>> 
>>>>   dd if=/run/fill of=/dev/sdb bs=1M count=1024 oflag=direct
>>>> 
>>>> preventing an apples to apples comparison between raw SSD device 
>>>> and
>>>> md/RAID1 performance with your uber random file as input.
>>>> 
>>>> --
>>>> Stan
>>>> 
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe 
>>> linux-raid" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>> 
>> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" 
> in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Incredibly poor performance of mdraid-1 with 2 SSD Samsung 840 PRO
  2013-04-22 10:26                     ` Andrei Banu
@ 2013-04-22 12:02                       ` Tommy Apel
  2013-04-23  2:59                         ` Stan Hoeppner
  0 siblings, 1 reply; 38+ messages in thread
From: Tommy Apel @ 2013-04-22 12:02 UTC (permalink / raw)
  To: Andrei Banu, stan; +Cc: linux-raid Raid

Yes it can be run as it is, it will write to the file given by --filename=

well from what I make of it so far I wouldn't rule out the bad device
part but at the same time there could be other things involved
although I don't belive it to be the md part

Stan> do you know anything about the state of ext4 on centos 6.x ?

/Tommy

2013/4/22 Andrei Banu <andrei.banu@redhost.ro>
>
> Hi,
>
> No worries about the typo. I ran iostat -x -m 2 for a few minutes and I get:
>
> - 0-500KB/s 70% of the time
> - 1-2MB/s 20% of the time
> - 3-4MB/s 10% of the time.
>
> It never went beyond 4MB/s write speed. But I guess none of this qualifies as a heavy write. Right?
>
> The fio test can be carried out safely on an active production server just as you gave it?
>
> Thanks!
> Andrei
>
>
> On 2013-04-22 10:51, Tommy Apel wrote:
>>
>> Stan>
>> That was exactly what I was trying to show, that you result may vary
>> depending on data and backing device, as far as the raid1 goes it
>> doesn't care much for the data beeing passed through it.
>>
>> Ben>
>> could you try to run iostat -x 2 for a few minuts just to make sure
>> there is no other I/O going on the device before running your tests,
>> and then run the tests with fio instead of dd ?
>>
>> fio write test > fio --rw=write --filename=testfile --bs=1048576
>> --size=4294967296 --ioengine=psync --end_fsync=1 --invalidate=1
>> --direct=1 --name=writeperftest
>>
>> /Tommy
>>
>> 2013/4/22 Stan Hoeppner <stan@hardwarefreak.com>:
>>>
>>> On 4/21/2013 2:56 PM, Tommy Apel wrote:
>>>>
>>>> Calm the f. down, I was just handing over some information, sorry your
>>>> day was ruined mr. high and mighty, use the info for whatever you want
>>>> to but flaming me is't going to help anyone.
>>>
>>>
>>> Your tantrum aside, the Intel 330, as well as all current Intel SSDs,
>>> uses the SandForce 2281 controller.  The SF2xxx series' write
>>> performance is limited by the compressibility of the data.  What you're
>>> doing below is simply showcasing the write bandwidth limitation of the
>>> SF2xxx controllers with incompressible data.
>>>
>>> This is not relevant to md.  And it's not relevant to Andrei.  It turns
>>> out that the Samsung 840 SSDs have consistent throughput because they
>>> don't rely on compression.
>>>
>>> --
>>> Stan
>>>
>>>
>>>> 2013/4/21 Stan Hoeppner <stan@hardwarefreak.com>:
>>>>>
>>>>> On 4/21/2013 7:23 AM, Tommy Apel wrote:
>>>>>>
>>>>>> Hello, FYI I'm getting ~68MB/s on two intel330 in RAID1 aswell on
>>>>>> vanilla 3.8.8 and 3.9.0-rc3 when writing random data and ~236MB/s
>>>>>> writing from /dev/zero
>>>>>>
>>>>>> mdadm -C /dev/md0 -l 1 -n 2 --assume-clean --force --run /dev/sdb /dev/sdc
>>>>>
>>>>>
>>>>>
>>>>>> openssl enc -aes-128-ctr -pass pass:"$(dd if=/dev/urandom bs=128
>>>>>> count=1 2>/dev/null | base64)" -nosalt < /dev/zero | pv -pterb >
>>>>>> /run/fill ~1.06GB/s
>>>>>
>>>>>
>>>>> What's the purpose of all of this?  Surely not simply to create random
>>>>> data, which is accomplished much more easily.  Are you sand bagging us
>>>>> here with a known bug, or simply trying to show off your mad skillz?
>>>>> Either way this is entirely unnecessary for troubleshooting an IO
>>>>> performance issue.  dd doesn't (shouldn't) care if the bits are random
>>>>> or not, though the Intel SSD controller might, as well as other layers
>>>>> you may have in your IO stack.  Keep it simple so we can isolate one
>>>>> layer at a time.
>>>>>
>>>>>> dd if=/run/fill of=/dev/null bs=1M count=1024 iflag=fullblock ~5.7GB/s
>>>>>> dd if=/run/fill of=/dev/md0 bs=1M count=1024 oflag=direct ~68MB/s
>>>>>> dd if=/dev/zero of=/dev/md0 bs=1M count=1024 oflag=direct ~236MB/s
>>>>>
>>>>>
>>>>> Noting the above, it's interesting that you omitted this test
>>>>>
>>>>>   dd if=/run/fill of=/dev/sdb bs=1M count=1024 oflag=direct
>>>>>
>>>>> preventing an apples to apples comparison between raw SSD device and
>>>>> md/RAID1 performance with your uber random file as input.
>>>>>
>>>>> --
>>>>> Stan
>>>>>
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>
>>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Incredibly poor performance of mdraid-1 with 2 SSD Samsung 840 PRO
  2013-04-22 12:02                       ` Tommy Apel
@ 2013-04-23  2:59                         ` Stan Hoeppner
  0 siblings, 0 replies; 38+ messages in thread
From: Stan Hoeppner @ 2013-04-23  2:59 UTC (permalink / raw)
  To: Tommy Apel; +Cc: Andrei Banu, linux-raid Raid

On 4/22/2013 7:02 AM, Tommy Apel wrote:
> Yes it can be run as it is, it will write to the file given by --filename=
> 
> well from what I make of it so far I wouldn't rule out the bad device
> part but at the same time there could be other things involved
> although I don't belive it to be the md part
> 
> Stan> do you know anything about the state of ext4 on centos 6.x ?

Enough to assume it's not part of the problem here.  Andrei's hdparm
below the filesystem layer throughput is bouncing up/down by ~100MB/s
depending on when he runs it.

If he's using LVM and has active snapshots that would definitely cause
some extra load, but in that case given his 3 RAID1 pairs it should
affect both drives equally.  And that's not what we're seeing.

I hope my last post gets him closer to identifying the problem.  The
perf top and iotop data doing $bigfile copy should be instructive.

-- 
Stan

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Incredibly poor performance of mdraid-1 with 2 SSD Samsung 840 PRO
  2013-04-22  7:51                   ` Tommy Apel
  2013-04-22  8:29                     ` Tommy Apel
  2013-04-22 10:26                     ` Andrei Banu
@ 2013-04-22 23:21                     ` Stan Hoeppner
  2 siblings, 0 replies; 38+ messages in thread
From: Stan Hoeppner @ 2013-04-22 23:21 UTC (permalink / raw)
  To: Tommy Apel; +Cc: Andrei Banu, linux-raid Raid

On 4/22/2013 2:51 AM, Tommy Apel wrote:
> Stan>
> That was exactly what I was trying to show, that you result may vary
> depending on data and backing device, as far as the raid1 goes it
> doesn't care much for the data beeing passed through it.

As I mentioned, this is true of the SandForce 2nd gen ASICs, maybe some
others.  The Samsung SSDs use a home grown Samsung controller which
doesn't do compression.  Its performance doesn't vary due to data
content.  Thus the performance gap you demonstrated doesn't apply to Andrei.

We can eliminate this as a possible cause of his apparently horrible
performance.  And I think we can eliminate the regression in 2.6.32 as
that patch seems to be included in his kernel, otherwise he'd likely not
get 260MB/s in his dd raw read tests.  The mystery continues...

-- 
Stan

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Incredibly poor performance of mdraid-1 with 2 SSD Samsung 840 PRO
  2013-04-20 23:26       ` Andrei Banu
  2013-04-21  2:48         ` Stan Hoeppner
@ 2013-04-25 11:38         ` Thomas Jarosch
  1 sibling, 0 replies; 38+ messages in thread
From: Thomas Jarosch @ 2013-04-25 11:38 UTC (permalink / raw)
  To: Andrei Banu; +Cc: linux-raid

On Sunday, 21. April 2013 02:26:26 Andrei Banu wrote:
> They are connected through SATA2 ports (this does explain the read speed
> but not the pitiful write one) in AHCI.

So the SATA controller is already in AHCI mode. Good.

You didn't say what kind of server hardware you are using or I missed it.
On the HP DL3xxx servers we usually use, we have to enable AHCI mode _and_ 
the write cache in the BIOS. Maybe your server needs something similar.

Some RAID controllers only allow you to enable the write cache
when a battery-backed write cache module is installed.

HTH,
Thomas

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Incredibly poor performance of mdraid-1 with 2 SSD Samsung 840 PRO
       [not found] ` <CAH3kUhEaZGON=fAyVMZOz5fH_DcfKv=hCa96UCeK4pN7k81c_Q@mail.gmail.com>
       [not found]   ` <51725458.7020109@redhost.ro>
@ 2013-04-20 23:26   ` Andrei Banu
  1 sibling, 0 replies; 38+ messages in thread
From: Andrei Banu @ 2013-04-20 23:26 UTC (permalink / raw)
  To: linux-raid

Hi,

I ran with '-d 3' iostat during a "heavy" (540MB) copy. It took a bit 
over a minute and completed with less than 9MB/s. These are some of the 
results (this does NOT include the first batch i.e. the average from 
start up result):

Device:            tps    kB_read/s    kB_wrtn/s    kB_read kB_wrtn
sda             503.00      1542.67     28157.33       4628 84472
sdb              66.00        72.00     13162.67        216 39488
md1             373.00      1492.00         0.00 4476          0
md2            6951.67       126.67     27734.67        380 83204
md0               0.00         0.00         0.00 0          0

Device:            tps    kB_read/s    kB_wrtn/s    kB_read kB_wrtn
sda              56.67        20.00      1177.50         60 3532
sdb              47.33        12.00     10824.17         36 32472
md1               0.67         2.67         0.00 8          0
md2             322.00        25.33      1266.67         76 3800
md0               0.00         0.00         0.00 0          0

Device:            tps    kB_read/s    kB_wrtn/s    kB_read kB_wrtn
sda             122.00        16.00     45773.33         48 137320
sdb              96.67        14.67     19472.00         44 58416
md1               0.00         0.00         0.00 0          0
md2           11431.00        32.00     45684.00         96 137052
md0               0.00         0.00         0.00 0          0

Device:            tps    kB_read/s    kB_wrtn/s    kB_read kB_wrtn
sda               0.00         0.00         0.00 0          0
sdb              13.67         8.00      5973.33         24 17920
md1               0.00         0.00         0.00 0          0
md2               2.00         8.00         0.00 24          0
md0               0.00         0.00         0.00 0          0

This is the "normal" iostat took after 10 minutes (this DOES include the 
first batch i.e. the average from start up result):

Device:            tps    kB_read/s    kB_wrtn/s    kB_read kB_wrtn
sda             281.83       973.99       641.55  212615675 140045467
sdb             215.51       665.94       641.55  145369465 140045467
md1               1.18         2.17         2.56     473492 558452
md2             470.71      1596.29       638.01  348460340 139272912
md0               0.08         0.27         0.00      59983 171

Device:            tps    kB_read/s    kB_wrtn/s    kB_read kB_wrtn
sda              41.67       237.33       133.67        712 401
sdb              39.33        90.67       133.67        272 401
md1               0.00         0.00         0.00 0          0
md2              83.00       328.00       133.33        984 400
md0               0.00         0.00         0.00 0          0

Device:            tps    kB_read/s    kB_wrtn/s    kB_read kB_wrtn
sda              29.33         2.67       110.00          8 330
sdb              29.33         2.67       110.00          8 330
md1               0.00         0.00         0.00 0          0
md2              28.67         5.33       109.33         16 328
md0               0.00         0.00         0.00 0          0

Device:            tps    kB_read/s    kB_wrtn/s    kB_read kB_wrtn
sda             175.67         1.33       747.50          4 2242
sdb             182.00        56.00       747.50        168 2242
md1               0.00         0.00         0.00 0          0
md2             191.33        57.33       746.67        172 2240
md0               0.00         0.00         0.00 0          0

Best regards!

On 20/04/2013 3:59 AM, Roberto Spadim wrote:
> run some kind of iostat -d 1 -k and check the write/read  iops and kb/s
>
>
> 2013/4/19 Andrei Banu <andrei.banu@redhost.ro 
> <mailto:andrei.banu@redhost.ro>>
>
>     Hello!
>
>     I come to you with a difficult problem. We have a server otherwise
>     snappy fitted with mdraid-1 made of Samsung 840 PRO SSDs. If we
>     copy a larger file to the server (from the same server, from net
>     doesn't matter) the server load will increase from roughly 0.7 to
>     over 100 (for several GB files). Apparently the reason is that the
>     raid can't write well.
>
>     Few examples:
>
>     root [~]# dd if=testfile.tar.gz of=test20 oflag=sync bs=4M
>     130+1 records in
>     130+1 records out
>     547682517 bytes (548 MB) copied, 7.99664 s, 68.5 MB/s
>
>     And 10-20 seconds later I try the very same test:
>
>     root [~]# dd if=testfile.tar.gz of=test21 oflag=sync bs=4M
>     130+1 records in / 130+1 records out
>     547682517 bytes (548 MB) copied, 52.1958 s, 10.5 MB/s
>
>     A different test with 'bs=1G'
>     root [~]# w
>      12:08:34 up 1 day, 13:09,  1 user,  load average: 0.37, 0.60, 0.72
>
>     root [~]# dd if=testfile.tar.gz of=test oflag=sync bs=1G
>     0+1 records in / 0+1 records out
>     547682517 bytes (548 MB) copied, 75.3476 s, 7.3 MB/s
>
>     root [~]# w
>      12:09:56 up 1 day, 13:11,  1 user,  load average: 39.29, 12.67, 4.93
>
>     It needed 75 seconds to copy a half GB file and the server load
>     increased 100 times.
>
>     And a final test:
>
>     root@ [~]# dd if=/dev/zero of=test24 bs=64k count=16k conv=fdatasync
>     16384+0 records in / 16384+0 records out
>     1073741824 bytes (1.1 GB) copied, 61.8796 s, 17.4 MB/s
>
>     This time the load spiked to only ~ 20.
>
>     A few other peculiarities:
>
>     root@ [~]# hdparm -t /dev/sda
>     Timing buffered disk reads:  654 MB in  3.01 seconds = 217.55 MB/sec
>     root@ [~]# hdparm -t /dev/sdb
>     Timing buffered disk reads:  272 MB in  3.01 seconds =  90.44 MB/sec
>
>     The read speed is very different between the 2 devices (the margin
>     is 140%) but look what happens when I run it with --direct:
>
>     root@ [~]# hdparm --direct -t /dev/sda
>     Timing O_DIRECT disk reads:  788 MB in  3.00 seconds = 262.23 MB/sec
>     root@ [~]# hdparm --direct -t /dev/sdb
>     Timing O_DIRECT disk reads:  554 MB in  3.00 seconds = 184.53 MB/sec
>
>     So the hardware seems to sustain speeds of about 200MB/s  on both
>     devices but it differs greatly.
>     The measurement of sda increased 20% but sdb doubled. Maybe
>     there's a problem with the page cache?
>
>     BACKGROUND INFORMATION
>     Server type: general shared hosting server (3 weeks new)
>     O/S: CentOS 6.4 / 64 bit (2.6.32-358.2.1.el6.x86_64)
>     Hardware: SuperMicro 5017C-MTRF, E3-1270v2, 16GB RAM, 2 x Samsung
>     840 PRO 512GB
>     Partitioning: ~ 100GB left for over-provisioning, ext 4:
>
>     I believe it is aligned:
>
>     root [~]# fdisk -lu
>
>     Disk /dev/sda: 512.1 GB, 512110190592 bytes
>     255 heads, 63 sectors/track, 62260 cylinders, total 1000215216 sectors
>     Units = sectors of 1 * 512 = 512 bytes
>     Sector size (logical/physical): 512 bytes / 512 bytes
>     I/O size (minimum/optimal): 512 bytes / 512 bytes
>     Disk identifier: 0x00026d59
>
>        Device Boot      Start         End      Blocks   Id  System
>     /dev/sda1            2048     4196351     2097152   fd  Linux raid
>     autodetect
>     Partition 1 does not end on cylinder boundary.
>     /dev/sda2   *     4196352     4605951      204800   fd  Linux raid
>     autodetect
>     Partition 2 does not end on cylinder boundary.
>     /dev/sda3         4605952   814106623   404750336   fd  Linux raid
>     autodetect
>
>     Disk /dev/sdb: 512.1 GB, 512110190592 bytes
>     255 heads, 63 sectors/track, 62260 cylinders, total 1000215216 sectors
>     Units = sectors of 1 * 512 = 512 bytes
>     Sector size (logical/physical): 512 bytes / 512 bytes
>     I/O size (minimum/optimal): 512 bytes / 512 bytes
>     Disk identifier: 0x0003dede
>
>        Device Boot      Start         End      Blocks   Id  System
>     /dev/sdb1            2048     4196351     2097152   fd  Linux raid
>     autodetect
>     Partition 1 does not end on cylinder boundary.
>     /dev/sdb2   *     4196352     4605951      204800   fd  Linux raid
>     autodetect
>     Partition 2 does not end on cylinder boundary.
>     /dev/sdb3         4605952   814106623   404750336   fd  Linux raid
>     autodetect
>
>     The matrix is NOT degraded:
>
>     root@ [~]# cat /proc/mdstat
>     Personalities : [raid1]
>     md0 : active raid1 sdb2[1] sda2[0]
>           204736 blocks super 1.0 [2/2] [UU]
>     md2 : active raid1 sdb3[1] sda3[0]
>           404750144 blocks super 1.0 [2/2] [UU]
>     md1 : active raid1 sdb1[1] sda1[0]
>           2096064 blocks super 1.1 [2/2] [UU]
>     unused devices: <none>
>
>     Write cache is on:
>
>     root@ [~]# hdparm -W /dev/sda
>     write-caching =  1 (on)
>     root@ [~]# hdparm -W /dev/sdb
>     write-caching =  1 (on)
>
>     SMART seems to be OK:
>     SMART overall-health self-assessment test result: PASSED (for both
>     devices)
>
>     I have tried changing IO scheduler with NOOP and deadline but I
>     couldn't see improvements.
>
>     I have tried running fstrim but it errors out:
>
>     root [~]# fstrim -v /
>     fstrim: /: FITRIM ioctl failed: Operation not supported
>
>     So I have changed /etc/fstab to contain noatime and discard and
>     rebooted the server but to no avail.
>
>     I no longer know what to do. And I need to come up with some sort
>     of a solution (it's not reasonable nor acceptable to get at 3
>     digits loads from copying several GBs worth of file). If anyone
>     can help me, please do!
>
>     Thanks in advance!
>     Andy
>     --
>     To unsubscribe from this list: send the line "unsubscribe
>     linux-raid" in
>     the body of a message to majordomo@vger.kernel.org
>     <mailto:majordomo@vger.kernel.org>
>     More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
>
>
> -- 
> Roberto Spadim


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Incredibly poor performance of mdraid-1 with 2 SSD Samsung 840 PRO
  2013-04-19 22:58 Incredibly poor performance of mdraid-1 with 2 SSD Samsung 840 PRO Andrei Banu
       [not found] ` <CAH3kUhEaZGON=fAyVMZOz5fH_DcfKv=hCa96UCeK4pN7k81c_Q@mail.gmail.com>
@ 2013-04-21  0:10 ` Stan Hoeppner
       [not found] ` <51732E2B.6090607@hardwarefreak.com>
  2013-04-23  6:01 ` Stan Hoeppner
  3 siblings, 0 replies; 38+ messages in thread
From: Stan Hoeppner @ 2013-04-21  0:10 UTC (permalink / raw)
  To: Andrei Banu, Linux RAID

Forgot to CC the list.  Sorry for the dup Andrei.

On 4/19/2013 5:58 PM, Andrei Banu wrote:

> I come to you with a difficult problem. We have a server otherwise
> snappy fitted with mdraid-1 made of Samsung 840 PRO SSDs. If we copy a
> larger file to the server (from the same server, from net doesn't
> matter) the server load will increase from roughly 0.7 to over 100 (for
> several GB files). Apparently the reason is that the raid can't write well.
...
> 547682517 bytes (548 MB) copied, 7.99664 s, 68.5 MB/s
> 547682517 bytes (548 MB) copied, 52.1958 s, 10.5 MB/s
> 547682517 bytes (548 MB) copied, 75.3476 s, 7.3 MB/s
> 1073741824 bytes (1.1 GB) copied, 61.8796 s, 17.4 MB/s
> Timing buffered disk reads:  654 MB in  3.01 seconds = 217.55 MB/sec
> Timing buffered disk reads:  272 MB in  3.01 seconds =  90.44 MB/sec
> Timing O_DIRECT disk reads:  788 MB in  3.00 seconds = 262.23 MB/sec
> Timing O_DIRECT disk reads:  554 MB in  3.00 seconds = 184.53 MB/sec
...

Obviously this is frustrating, but the fix should be pretty easy.

> O/S: CentOS 6.4 / 64 bit (2.6.32-358.2.1.el6.x86_64)

I'd guess your problem is the following regression.  I don't believe
this regression is fixed in Red Hat 2.6.32-* kernels:

http://www.archivum.info/linux-ide@vger.kernel.org/2010-02/00243/bad-performance-with-SSD-since-kernel-version-2.6.32.html

After I discovered this regression and recommended Adam Goryachev
upgrade from Debian 2.6.32 to 3.2.x, his SSD RAID5 throughput increased
by a factor of 5x, though much of this was due testing methods.  His raw
SSD throughput more than doubled per drive.  The thread detailing this
is long but is a good read:

http://marc.info/?l=linux-raid&m=136098921212920&w=2

-- 
Stan


^ permalink raw reply	[flat|nested] 38+ messages in thread

[parent not found: <51732E2B.6090607@hardwarefreak.com>]

* Re: Incredibly poor performance of mdraid-1 with 2 SSD Samsung 840 PRO
       [not found] ` <51732E2B.6090607@hardwarefreak.com>
@ 2013-04-21 20:46   ` Andrei Banu
  2013-04-21 23:17     ` Stan Hoeppner
  0 siblings, 1 reply; 38+ messages in thread
From: Andrei Banu @ 2013-04-21 20:46 UTC (permalink / raw)
  To: linux-raid

Hello,

At this point I probably should state that I am not an experienced 
sysadmin. Knowing this, I do have a server management company but they 
said they don't know what to do so now I am trying to fix things myself 
but I am something of a noob. I normally try to keep my actions to 
cautious config changes and testing. I have never done a kernel update. 
Any easy way to do this?

Regarding your second advice (to purchase a decent HBA) I have already 
thought about it but I guess it comes with it's own drivers that need to 
be compiled into initramfs etc. So I am trying to replace the baseboard 
with one with SATA3 support to avoid any configuration changes (the old 
board has the C202 chipset and the new one has C204 so I guess this 
replacement is as simple as it gets - just remove the old board and plug 
the new one without any software changes or recompiles). Again I need to 
say this server is in production and I can't move the data or the users. 
I can have a few hours downtime during the night but that's about all.

Regarding the kernel upgrade, do we need to compile one from source or 
there's an easier way?

Thanks!

On 21/04/2013 3:09 AM, Stan Hoeppner wrote:
> On 4/19/2013 5:58 PM, Andrei Banu wrote:
>
>> I come to you with a difficult problem. We have a server otherwise
>> snappy fitted with mdraid-1 made of Samsung 840 PRO SSDs. If we copy a
>> larger file to the server (from the same server, from net doesn't
>> matter) the server load will increase from roughly 0.7 to over 100 (for
>> several GB files). Apparently the reason is that the raid can't write well.
> ...
>> 547682517 bytes (548 MB) copied, 7.99664 s, 68.5 MB/s
>> 547682517 bytes (548 MB) copied, 52.1958 s, 10.5 MB/s
>> 547682517 bytes (548 MB) copied, 75.3476 s, 7.3 MB/s
>> 1073741824 bytes (1.1 GB) copied, 61.8796 s, 17.4 MB/s
>> Timing buffered disk reads:  654 MB in  3.01 seconds = 217.55 MB/sec
>> Timing buffered disk reads:  272 MB in  3.01 seconds =  90.44 MB/sec
>> Timing O_DIRECT disk reads:  788 MB in  3.00 seconds = 262.23 MB/sec
>> Timing O_DIRECT disk reads:  554 MB in  3.00 seconds = 184.53 MB/sec
> ...
>
> Obviously this is frustrating, but the fix should be pretty easy.
>
>> O/S: CentOS 6.4 / 64 bit (2.6.32-358.2.1.el6.x86_64)
> I'd guess your problem is the following regression.  I don't believe
> this regression is fixed in Red Hat 2.6.32-* kernels:
>
> http://www.archivum.info/linux-ide@vger.kernel.org/2010-02/00243/bad-performance-with-SSD-since-kernel-version-2.6.32.html
>
> After I discovered this regression and recommended Adam Goryachev
> upgrade from Debian 2.6.32 to 3.2.x, his SSD RAID5 throughput increased
> by a factor of 5x, though much of this was due testing methods.  His raw
> SSD throughput more than doubled per drive.  The thread detailing this
> is long but is a good read:
>
> http://marc.info/?l=linux-raid&m=136098921212920&w=2
>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Incredibly poor performance of mdraid-1 with 2 SSD Samsung 840 PRO
  2013-04-21 20:46   ` Andrei Banu
@ 2013-04-21 23:17     ` Stan Hoeppner
  2013-04-22 10:19       ` Andrei Banu
                         ` (2 more replies)
  0 siblings, 3 replies; 38+ messages in thread
From: Stan Hoeppner @ 2013-04-21 23:17 UTC (permalink / raw)
  To: Andrei Banu; +Cc: linux-raid

On 4/21/2013 3:46 PM, Andrei Banu wrote:
> Hello,
> 
> At this point I probably should state that I am not an experienced
> sysadmin. 

Things are becoming more clear now.

> Knowing this, I do have a server management company but they
> said they don't know what to do 

So you own this hardware and it is colocated, correct?

> so now I am trying to fix things myself
> but I am something of a noob. I normally try to keep my actions to
> cautious config changes and testing. 

Why did you choose Centos?  Was this installed by the company?

> I have never done a kernel update.
> Any easy way to do this?

It may not be necessary, at least to solve any SSD performance problems
anyway.  Reexamining your numbers shows you hit 262MB/s to /dev/sda.
That's 65% of SATA2 interface bandwidth, so this kernel probably does
have the patch.  Your problem lie elsewhere.

> Regarding your second advice (to purchase a decent HBA) I have already
> thought about it but I guess it comes with it's own drivers that need to
> be compiled into initramfs etc. 

The default CentOS (RHEL) initramfs should include mptsas, which
supports all the LSI HBAs.  The LSI caching RAID cards are supported as
well with megaraid_sas.

The question is, do you really need more than the ~260MB/s of peak
throughput you currently have?  And is it worth the hassle?

> So I am trying to replace the baseboard
> with one with SATA3 support to avoid any configuration changes (the old
> board has the C202 chipset and the new one has C204 so I guess this
> replacement is as simple as it gets - just remove the old board and plug
> the new one without any software changes or recompiles). Again I need to
> say this server is in production and I can't move the data or the users.
> I can have a few hours downtime during the night but that's about all.

It's not clear your problem is hardware bandwidth.  In fact it seems the
problem lie elsewhere.  It may simply be that you're running these tests
while other substantial IO is occurring.  Actually, your numbers show
this is exactly the case.  What they don't show is how much other IO is
hitting the SSDs while you're running your tests.

> Regarding the kernel upgrade, do we need to compile one from source or
> there's an easier way?

I don't believe at this point you need a new kernel to fix the problem
you have.  If this patch was not present you'd not be able to get
260MB/s from SATA2.  Your problem lie elsewhere.

In the future, instead of making a post saying "md is slow, my SSDs are
slow" and pasting test data which appears to back that claim, you'd be
better served by describing a general problem, such as "users say the
system is slow and I think it may be md or SSD related".  This way we
don't waste time following a troubleshooting path based on incorrect
assumptions, as we've done here.  Or at least as I've done here, as I'm
the only one assisting.

Boot all users off the system, shut down any daemons that may generate
any meaningful load on the disks or CPUs.  Disable any encryption or
compression.  Then rerun your tests while completely idle.  Then we'll
go from there.

-- 
Stan

> Thanks!
> 
> On 21/04/2013 3:09 AM, Stan Hoeppner wrote:
>> On 4/19/2013 5:58 PM, Andrei Banu wrote:
>>
>>> I come to you with a difficult problem. We have a server otherwise
>>> snappy fitted with mdraid-1 made of Samsung 840 PRO SSDs. If we copy a
>>> larger file to the server (from the same server, from net doesn't
>>> matter) the server load will increase from roughly 0.7 to over 100 (for
>>> several GB files). Apparently the reason is that the raid can't write
>>> well.
>> ...
>>> 547682517 bytes (548 MB) copied, 7.99664 s, 68.5 MB/s
>>> 547682517 bytes (548 MB) copied, 52.1958 s, 10.5 MB/s
>>> 547682517 bytes (548 MB) copied, 75.3476 s, 7.3 MB/s
>>> 1073741824 bytes (1.1 GB) copied, 61.8796 s, 17.4 MB/s
>>> Timing buffered disk reads:  654 MB in  3.01 seconds = 217.55 MB/sec
>>> Timing buffered disk reads:  272 MB in  3.01 seconds =  90.44 MB/sec
>>> Timing O_DIRECT disk reads:  788 MB in  3.00 seconds = 262.23 MB/sec
>>> Timing O_DIRECT disk reads:  554 MB in  3.00 seconds = 184.53 MB/sec
>> ...
>>
>> Obviously this is frustrating, but the fix should be pretty easy.
>>
>>> O/S: CentOS 6.4 / 64 bit (2.6.32-358.2.1.el6.x86_64)
>> I'd guess your problem is the following regression.  I don't believe
>> this regression is fixed in Red Hat 2.6.32-* kernels:
>>
>> http://www.archivum.info/linux-ide@vger.kernel.org/2010-02/00243/bad-performance-with-SSD-since-kernel-version-2.6.32.html
>>
>>
>> After I discovered this regression and recommended Adam Goryachev
>> upgrade from Debian 2.6.32 to 3.2.x, his SSD RAID5 throughput increased
>> by a factor of 5x, though much of this was due testing methods.  His raw
>> SSD throughput more than doubled per drive.  The thread detailing this
>> is long but is a good read:
>>
>> http://marc.info/?l=linux-raid&m=136098921212920&w=2
>>
> 
> -- 
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Incredibly poor performance of mdraid-1 with 2 SSD Samsung 840 PRO
  2013-04-21 23:17     ` Stan Hoeppner
@ 2013-04-22 10:19       ` Andrei Banu
  2013-04-23  2:51         ` Stan Hoeppner
  2013-04-22 23:11       ` Andrei Banu
  2013-04-22 23:25       ` Stan Hoeppner
  2 siblings, 1 reply; 38+ messages in thread
From: Andrei Banu @ 2013-04-22 10:19 UTC (permalink / raw)
  To: linux-raid

Hello!

First off allow me to apologize if my rumbling sent you in a wrong 
direction and thank you for assisting.

Most of the data I have supplied was mostly background information. Let 
me start fresh but first allow me to answer your explicit questions:

1. Yes, I own the hardware and it's colocated in a datacenter.
2. I am quite happy with 260MB/s read for SATA2. I think that's decent 
and I never meant it as a problem.
3. I have run for a few minutes iostat -x -m 2 and from what I see the 
normal write per second is at about 0-500KB/s, sometimes it gets to 
1-2MB/s and rarely between 3 and 4MB/s.
4. I will redo the test off-peak hours when I can afford to shutdown 
various services.

The actual problem is that when I write any larger file hundreds of MB 
or more to the server (from network or from the same server) the server 
starts to overload. The server can overload to over 100 for files of ~ 
5GB. I mean this server has an average load of 0.52 (sar -q) but it can 
spike to 3 digit server loads in a few minutes from making or 
downloading a larger cPanel backup file. I have to rely only on R1Soft 
for backups right now because the normal cPanel backups make the server 
unstable when it backs up accounts over 1GB (many).

So I concluded this is due to very low write speeds so I ran the 'dd' 
tests to evaluate this assumption. You know, I don't think that the 
problem is I ran these tests during other I/O intensive tasks. It's like 
after a number of megabytes written at a time, the SSD devices 
themselves overload. I mean during off peak hours I can sometimes get a 
good decent speed (like 60-100MB/s write speed) but if I redo the test 
soon (tens of seconds - minutes) I get very different much lower write 
speeds (like under 10MB/s write speed). Or maybe the write speed itsef 
is not the problem but the fact that when I write a large file the 
server seems to stop doing anything else. So...the speed test results 
are poor AND the server overloads. A lot! I mean most write results are 
in the 10-20MB/s range. I have seen more than 25MB/s very rarely and 
almost never was I able to reproduce them within the same hour. If I do 
a 'dd' test with 'bs' of 2-4MB I sometimes get good results (40-60MB/s) 
but never with a 'bs' of 1GB (the top speed I got with 1G 'bs' was 
27MB/s during the night). But the essential notable problem is that this 
server can't copy large files without seriously overloading itself.

Now let me elaborate why I have given the read speeds (as I am not 
unhappy with them):
1. Some said the low write speed might be due to a bad cable. So I 
stated the 260MB/s read speed to show it's probably not a bad cable. If 
it's capable to push 260MB/s up, it's probably not a bad cable.
2. I have observed a very big difference between /dev/sda and /dev/sdb 
and I thought it might me indicative of a problem somewhere. If I run 
hdparm -t /dev/sda I get about 215MB/s but on /dev/sdb I get about 
80-90MB/s. Only if I add --direct flag I get 260MB/s for /dev/sda. 
Previously when I added --direct for /dev/sdb I was getting about 
180MB/s but now I get ~85MB/s with or without --direct.

root [/]# hdparm -t /dev/sdb
Timing buffered disk reads:  262 MB in  3.01 seconds =  86.92 MB/sec

root [/]# hdparm --direct -t /dev/sdb
Timing O_DIRECT disk reads:  264 MB in  3.08 seconds =  85.74 MB/sec

This is something new. /dev/sdb no longer gets to nearly 200MB/s (with 
--direct) but stays under 100MB/s in all cases. Maybe indeed it's a 
problem with the cable or with the device itself.

And a 30 minutes later update: /dev/sdb returned to 90MB/s read speed 
WITHOUT --direct and 180MB/s WITH --direct. /dev/sda is constant (215 
without --direct and 260 with --direct). What do you make of this?

Kind regards!

On 2013-04-22 02:17, Stan Hoeppner wrote:
> On 4/21/2013 3:46 PM, Andrei Banu wrote:
>> Hello,
>> At this point I probably should state that I am not an experienced
>> sysadmin.
> Things are becoming more clear now.
> 
>> Knowing this, I do have a server management company but they
>> said they don't know what to do
> So you own this hardware and it is colocated, correct?
> 
>> so now I am trying to fix things myself
>> but I am something of a noob. I normally try to keep my actions to
>> cautious config changes and testing.
> Why did you choose Centos?  Was this installed by the company?
> 
>> I have never done a kernel update.
>> Any easy way to do this?
> It may not be necessary, at least to solve any SSD performance 
> problems
> anyway.  Reexamining your numbers shows you hit 262MB/s to /dev/sda.
> That's 65% of SATA2 interface bandwidth, so this kernel probably does
> have the patch.  Your problem lie elsewhere.
> 
>> Regarding your second advice (to purchase a decent HBA) I have 
>> already
>> thought about it but I guess it comes with it's own drivers that need 
>> to
>> be compiled into initramfs etc.
> The default CentOS (RHEL) initramfs should include mptsas, which
> supports all the LSI HBAs.  The LSI caching RAID cards are supported 
> as
> well with megaraid_sas.
> The question is, do you really need more than the ~260MB/s of peak
> throughput you currently have?  And is it worth the hassle?
> 
>> So I am trying to replace the baseboard
>> with one with SATA3 support to avoid any configuration changes (the 
>> old
>> board has the C202 chipset and the new one has C204 so I guess this
>> replacement is as simple as it gets - just remove the old board and 
>> plug
>> the new one without any software changes or recompiles). Again I need 
>> to
>> say this server is in production and I can't move the data or the 
>> users.
>> I can have a few hours downtime during the night but that's about 
>> all.
> It's not clear your problem is hardware bandwidth.  In fact it seems 
> the
> problem lie elsewhere.  It may simply be that you're running these 
> tests
> while other substantial IO is occurring.  Actually, your numbers show
> this is exactly the case.  What they don't show is how much other IO 
> is
> hitting the SSDs while you're running your tests.
> 
>> Regarding the kernel upgrade, do we need to compile one from source 
>> or
>> there's an easier way?
> I don't believe at this point you need a new kernel to fix the problem
> you have.  If this patch was not present you'd not be able to get
> 260MB/s from SATA2.  Your problem lie elsewhere.
> In the future, instead of making a post saying "md is slow, my SSDs 
> are
> slow" and pasting test data which appears to back that claim, you'd be
> better served by describing a general problem, such as "users say the
> system is slow and I think it may be md or SSD related".  This way we
> don't waste time following a troubleshooting path based on incorrect
> assumptions, as we've done here.  Or at least as I've done here, as 
> I'm
> the only one assisting.
> Boot all users off the system, shut down any daemons that may generate
> any meaningful load on the disks or CPUs.  Disable any encryption or
> compression.  Then rerun your tests while completely idle.  Then we'll
> go from there.
> --
> Stan
> 
> 
>> Thanks!
>> On 21/04/2013 3:09 AM, Stan Hoeppner wrote:
>>> On 4/19/2013 5:58 PM, Andrei Banu wrote:
>>> 
>>>> I come to you with a difficult problem. We have a server otherwise
>>>> snappy fitted with mdraid-1 made of Samsung 840 PRO SSDs. If we 
>>>> copy a
>>>> larger file to the server (from the same server, from net doesn't
>>>> matter) the server load will increase from roughly 0.7 to over 100 
>>>> (for
>>>> several GB files). Apparently the reason is that the raid can't 
>>>> write
>>>> well.
>>> ...
>>>> 547682517 bytes (548 MB) copied, 7.99664 s, 68.5 MB/s
>>>> 547682517 bytes (548 MB) copied, 52.1958 s, 10.5 MB/s
>>>> 547682517 bytes (548 MB) copied, 75.3476 s, 7.3 MB/s
>>>> 1073741824 bytes (1.1 GB) copied, 61.8796 s, 17.4 MB/s
>>>> Timing buffered disk reads:  654 MB in  3.01 seconds = 217.55 
>>>> MB/sec
>>>> Timing buffered disk reads:  272 MB in  3.01 seconds =  90.44 
>>>> MB/sec
>>>> Timing O_DIRECT disk reads:  788 MB in  3.00 seconds = 262.23 
>>>> MB/sec
>>>> Timing O_DIRECT disk reads:  554 MB in  3.00 seconds = 184.53 
>>>> MB/sec
>>> ...
>>> Obviously this is frustrating, but the fix should be pretty easy.
>>> 
>>>> O/S: CentOS 6.4 / 64 bit (2.6.32-358.2.1.el6.x86_64)
>>> I'd guess your problem is the following regression.  I don't believe
>>> this regression is fixed in Red Hat 2.6.32-* kernels:
>>> http://www.archivum.info/linux-ide@vger.kernel.org/2010-02/00243/bad-performance-with-SSD-since-kernel-version-2.6.32.html
>>> 
>>> After I discovered this regression and recommended Adam Goryachev
>>> upgrade from Debian 2.6.32 to 3.2.x, his SSD RAID5 throughput 
>>> increased
>>> by a factor of 5x, though much of this was due testing methods.  His 
>>> raw
>>> SSD throughput more than doubled per drive.  The thread detailing 
>>> this
>>> is long but is a good read:
>>> http://marc.info/?l=linux-raid&m=136098921212920&w=2
>>> 
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-raid" 
>> in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" 
> in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Incredibly poor performance of mdraid-1 with 2 SSD Samsung 840 PRO
  2013-04-22 10:19       ` Andrei Banu
@ 2013-04-23  2:51         ` Stan Hoeppner
  2013-04-23 10:17           ` Andrei Banu
  0 siblings, 1 reply; 38+ messages in thread
From: Stan Hoeppner @ 2013-04-23  2:51 UTC (permalink / raw)
  To: Andrei Banu; +Cc: linux-raid

On 4/22/2013 5:19 AM, Andrei Banu wrote:
> Hello!
> 
> First off allow me to apologize if my rumbling sent you in a wrong
> direction and thank you for assisting.

No harm done, and you're welcome.

> The actual problem is that when I write any larger file hundreds of MB
> or more to the server (from network or from the same server) the server
> starts to overload. The server can overload to over 100 for files of ~
> 5GB. I mean this server has an average load of 0.52 (sar -q) but it can
> spike to 3 digit server loads in a few minutes from making or
> downloading a larger cPanel backup file. I have to rely only on R1Soft
> for backups right now because the normal cPanel backups make the server
> unstable when it backs up accounts over 1GB (many).

Describing this problem in terms of load average isn't very helpful.
What would be is 'perf top -U' output so we can see what is eating cpu,
simultaneously with 'iotop' so we see what's eating IO.

> So I concluded this is due to very low write speeds so I ran the 'dd'

It's most likely that the low disk throughput is a symptom of the
problem, which is lurking elsewhere awaiting discovery.

> 1. Some said the low write speed might be due to a bad cable. 

Very unlikely, but possible.  This is easy to verify.  Does dmesg show
hundreds of "hard resetting link" messages.

> 2. I have observed a very big difference between /dev/sda and /dev/sdb
> and I thought it might me indicative of a problem somewhere. If I run
> hdparm -t /dev/sda I get about 215MB/s but on /dev/sdb I get about
> 80-90MB/s. Only if I add --direct flag I get 260MB/s for /dev/sda.
> Previously when I added --direct for /dev/sdb I was getting about
> 180MB/s but now I get ~85MB/s with or without --direct.

I simply chalked up the difference to IO load variance between test runs
of hdparm.  If one SSD is always that much slower there may be a problem
with the drive or controller but it's not likely.  If you haven't
already, swap the cable on the slow drive with new one.  In fact, SATA
cables are cheap as dirt so I'd swap them both just for piece of mind.

> root [/]# hdparm -t /dev/sdb
> Timing buffered disk reads:  262 MB in  3.01 seconds =  86.92 MB/sec
> 
> root [/]# hdparm --direct -t /dev/sdb
> Timing O_DIRECT disk reads:  264 MB in  3.08 seconds =  85.74 MB/sec
...
> This is something new. /dev/sdb no longer gets to nearly 200MB/s (with
> --direct) but stays under 100MB/s in all cases. Maybe indeed it's a
> problem with the cable or with the device itself.
...
> And a 30 minutes later update: /dev/sdb returned to 90MB/s read speed
> WITHOUT --direct and 180MB/s WITH --direct. /dev/sda is constant (215
> without --direct and 260 with --direct). What do you make of this?

Show your partition tables again.  My gut instinct tells me you have a
swap partition on /dev/sdb, and/or some other partition that is not part
of the RAID1, nor equally present on /dev/sda, that is/are being
accessed heavily at some times and not others, thus the throughput
discrepancy.

If this is the case, and the kernel is low on RAM due to an application
memory leak or just normal process load, that swap partition may become
critical.  When when you start $big_file copy, the kernel goes into
overdrive swapping and/or dropping cache to make room for $big_file in
the write buffers.  This could explain both your triple digit system
load and the decreased throughput on /dev/sdb.

The fdisk output you provided previously showed only 3 partitions per
SSD, all RAID autodetect, all in md/RAID1 I assume.  However, the
symptoms you're reporting tend to suggest the partition layout I just
described, and could be responsible for the odd up/down throughput on sdb.

-- 
Stan

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Incredibly poor performance of mdraid-1 with 2 SSD Samsung 840 PRO
  2013-04-23  2:51         ` Stan Hoeppner
@ 2013-04-23 10:17           ` Andrei Banu
  2013-04-24  3:24             ` Stan Hoeppner
  0 siblings, 1 reply; 38+ messages in thread
From: Andrei Banu @ 2013-04-23 10:17 UTC (permalink / raw)
  Cc: linux-raid

Hi,

I am sorry for the very long email. And thanks a lot for all your patience.

1. DMESG doesn't show any "hard resetting link" at all.

2. The SSDs are connected to ATA 0 and ATA1. The server is brand new (or 
at least it should be).

3. Partition table:

root [~]# cat /etc/fstab
# Created by anaconda on Wed Apr  3 17:22:52 2013

UUID=8fedde2c-f5b7-4edf-975f-d8d087d79ebf       /       ext4 
noatime,usrjquota=quota.user,jqfmt=vfsv0        1       1
UUID=bfc50d02-6d4d-4510-93ea-27941cd49cf4 /boot ext4    
noatime,defaults        1 2
UUID=cef1d19d-2578-43db-9ffc-b6b70e227bfa swap swap    defaults        0 0
tmpfs                   /dev/shm                tmpfs defaults        0 0
devpts                  /dev/pts                devpts gid=5,mode=620  0 0
sysfs                   /sys                    sysfs defaults        0 0
proc                    /proc                   proc defaults        0 0
/usr/tmpDSK             /tmp                    ext3 
noatime,defaults,noauto        0 0

root [~]# cat /etc/mdadm.conf
# mdadm.conf written out by anaconda
MAILADDR root
AUTO +imsm +1.x -all
ARRAY /dev/md0 level=raid1 num-devices=2 
UUID=8a4b7005:a4f71a13:7d4659cf:104f9a4f
ARRAY /dev/md1 level=raid1 num-devices=2 
UUID=ead5b5ca:9f5397a2:3b488cbe:11eb8bdb
ARRAY /dev/md2 level=raid1 num-devices=2 
UUID=44efd14d:8bcd26d4:4d1fda9f:a4b5fe14

root [/]# mount
/dev/md2 on / type ext4 (rw,noatime,usrjquota=quota.user,jqfmt=vfsv0)
proc on /proc type proc (rw)
sysfs on /sys type sysfs (rw)
devpts on /dev/pts type devpts (rw,gid=5,mode=620)
tmpfs on /dev/shm type tmpfs (rw,rootcontext="system_u:object_r:tmpfs_t:s0")
/dev/md0 on /boot type ext4 (rw,noatime)
none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw)
/usr/tmpDSK on /tmp type ext3 (rw,noexec,nosuid,loop=/dev/loop0)
/tmp on /var/tmp type none (rw,noexec,nosuid,bind)

And now the tests you indicated:

4.
root [/]# echo 3 > /proc/sys/vm/drop_caches
root [~]# time cp largefile.tar.gz test03.tmp; time sync;

(this is probably when the file is read into some swap/cache)
real    0m3.052s
user    0m0.010s
sys     0m0.612s

(this is probably when the file is actually written)
real    1m2.570s
user    0m0.000s
sys     0m0.011s

root [/]# echo 3 > /proc/sys/vm/drop_caches
root [~]# time cp largefile.tar.gz test04.tmp;

real    0m3.848s
user    0m0.004s
sys     0m0.634s

After about 15 seconds the server load started to increase from 1, 
spiked to 40 in about a minute and then it started decreasing.

5. The perf top -U output during a dd copy:

Samples: 2M of event 'cycles', Event count (approx.): 19505138470
   9.10%  [kernel]             [k] page_fault
   5.56%  [kernel]             [k] clear_page_c_e
   3.29%  [kernel]             [k] list_del
   2.51%  [kernel]             [k] unmap_vmas
   2.50%  [kernel]             [k] __mem_cgroup_commit_charge
   2.50%  [kernel]             [k] mem_cgroup_update_file_mapped
   2.26%  [kernel]             [k] port_inb
   1.89%  [kernel]             [k] shmem_getpage_gfp
   1.78%  [kernel]             [k] _spin_lock
   1.72%  [kernel]             [k] __alloc_pages_nodemask
   1.67%  [kernel]             [k] __mem_cgroup_uncharge_common
   1.61%  [kernel]             [k] free_pcppages_bulk
   1.59%  [kernel]             [k] get_page_from_freelist
   1.56%  [kernel]             [k] alloc_pages_vma
   1.37%  [kernel]             [k] get_page
   1.26%  [kernel]             [k] release_pages
   1.22%  [kernel]             [k] radix_tree_lookup_slot
   1.19%  [kernel]             [k] lookup_page_cgroup
   1.11%  [kernel]             [k] handle_mm_fault
   0.98%  [kernel]             [k] __wake_up_bit
   0.98%  [kernel]             [k] copy_page_c
   0.97%  [kernel]             [k] __d_lookup
   0.94%  [kernel]             [k] __do_fault
   0.92%  [kernel]             [k] free_hot_cold_page
   0.80%  [kernel]             [k] find_vma

6. iotop is very dynamic and I am afraid the data I am providing will be 
unclear but let me give a number of snapshots from during the large file 
copy and maybe you can make something of it (samples a few seconds apart):

Total DISK READ: 15.39 K/s | Total DISK WRITE: 169.29 M/s
   TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN      IO COMMAND
  4236 be/4 nobody      0.00 B/s    0.00 B/s  0.00 %  0.00 % [httpd]
  4662 be/4 nobody      0.00 B/s    0.00 B/s  0.00 %  0.00 % [httpd]
31126 be/4 mysql       0.00 B/s   46.17 K/s  0.00 %  0.00 % mysqld 
--basedir=/ --datadir=/var/lib/mysql --user=mysql 
--log-error=/var/lib/mysql/server.err --open-files-limit=50000 
--pid-file=/var/$
  4971 be/4 nobody      0.00 B/s   23.08 K/s  0.00 %  0.00 % [httpd]
  5284 be/4 nobody      0.00 B/s    7.69 K/s  0.00 %  0.00 % [httpd]
  9522 be/4 user    7.69 K/s   38.47 K/s  0.00 %  0.00 % spamd child
  5547 be/4 nobody      0.00 B/s    7.69 K/s  0.00 %  0.00 % [httpd]

!!!!!! 6085 be/4 root        7.69 K/s 1004.85 M/s  0.00 %  0.00 % dd 
if=largefile.tar.gz of=test10 oflag=sync bs=1G

Total DISK READ: 7.71 K/s | Total DISK WRITE: 29.91 M/s
   TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN      IO COMMAND
   506 be/4 root        0.00 B/s    0.00 B/s  0.00 % 99.99 % [md2_raid1]
30861 be/4 root        0.00 B/s    7.71 K/s  0.00 %  0.00 % httpd -k 
start -DSSL
31346 be/4 root        0.00 B/s    7.71 K/s  0.00 %  0.00 % tailwatchd
  1457 be/3 root        0.00 B/s    7.71 K/s  0.00 %  0.00 % auditd
  5914 be/4 root        7.71 K/s    0.00 B/s  0.00 %  0.00 % cpanellogd 
- scanning logs
  6085 be/4 root        0.00 B/s    7.71 K/s  0.00 %  0.00 % dd 
if=largefile.tar.gz of=test10 oflag=sync bs=1G

Total DISK READ: 0.00 B/s | Total DISK WRITE: 29.30 M/s
   TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN      IO COMMAND
  9522 be/4 user    0.00 B/s    0.00 B/s  0.00 % 99.99 % spamd child
   506 be/4 root        0.00 B/s    0.00 B/s  0.00 % 99.99 % [md2_raid1]
31346 be/4 root        0.00 B/s    7.73 K/s  0.00 %  0.00 % tailwatchd
  1397 be/4 root        0.00 B/s    7.73 K/s  0.00 %  0.00 % [flush-9:2]
  6085 be/4 root        0.00 B/s   15.45 K/s  0.00 %  0.00 % dd 
if=largefile.tar.gz of=test10 oflag=sync bs=1G

Total DISK READ: 12.43 K/s | Total DISK WRITE: 5.96 M/s
   TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN      IO COMMAND
  5914 be/4 root        0.00 B/s    0.00 B/s  0.00 % 99.99 % cpanellogd 
- setting up logs for promusic
  6101 be/4 mailnull    0.00 B/s  353.61 B/s  0.00 % 99.99 % exim -bd -q1h
  6107 be/4 user     0.00 B/s    0.00 B/s  0.00 % 99.99 % pop3
  6124 be/4 nobody      0.00 B/s  353.61 B/s  0.00 % 99.99 % httpd -k 
start -DSSL
  9522 be/4 user 1060.83 B/s  184.06 K/s  0.00 % 99.99 % spamd child
  1669 be/4 root        0.00 B/s    2.42 K/s  0.00 % 99.99 % rsyslogd -i 
/var/run/syslogd.pid -c 5
  1235 be/4 root        0.00 B/s    2.42 K/s  0.00 % 98.28 % [kjournald]
   506 be/4 root        0.00 B/s    0.00 B/s  0.00 % 28.46 % [md2_raid1]
   541 be/3 root        0.00 B/s   34.04 M/s  0.00 %  3.43 % [jbd2/md2-8]

Total DISK READ: 303.21 K/s | Total DISK WRITE: 60.64 M/s
   TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN      IO COMMAND
  1235 be/4 root        0.00 B/s   60.64 K/s  0.00 % 99.99 % [kjournald]
   541 be/3 root        0.00 B/s    0.00 B/s  0.00 % 96.16 % [jbd2/md2-8]
  1232 be/0 root        0.00 B/s    0.00 B/s  0.00 % 81.07 % [loop0]
11449 be/4 mysql     250.15 K/s    0.00 B/s  0.00 % 12.84 % mysqld 
--basedir=/ --datadir=/var/lib/mysql --user=mysql 
--log-error=/var/lib/mysql/server.err --open-files-limit=50000 
--pid-file=/var/$
  6085 be/4 root        7.58 K/s   30.32 K/s  0.00 %  5.24 % dd 
if=largefile.tar.gz of=test10 oflag=sync bs=1G

Total DISK READ: 2023.83 K/s | Total DISK WRITE: 82.31 M/s
   TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN      IO COMMAND
  6085 be/4 root        0.00 B/s   38.04 K/s  0.00 % 99.99 % dd 
if=largefile.tar.gz of=test10 oflag=sync bs=1G
  6267 be/4 user    0.00 B/s    0.00 B/s  0.00 % 99.99 % pop3
  6291 be/4 user     0.00 B/s    0.00 B/s  0.00 % 99.99 % pop3
   541 be/3 root        0.00 B/s  492.43 M/s  0.00 % 99.99 % [jbd2/md2-8]
  6282 be/4 nobody    730.40 K/s    0.00 B/s  0.00 % 99.99 % httpd -k 
start -DSSL
   506 be/4 root        0.00 B/s    0.00 B/s  0.00 % 52.39 % [md2_raid1]

Total DISK READ: 74.61 K/s | Total DISK WRITE: 8.66 M/s
   TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN      IO COMMAND
  6282 be/4 nobody     26.55 K/s    0.00 B/s  0.00 % 97.65 % httpd -k 
start -DSSL
   541 be/3 root        0.00 B/s    7.04 M/s  0.00 % 95.64 % [jbd2/md2-8]
  1235 be/4 root        0.00 B/s    0.00 B/s  0.00 % 94.07 % [kjournald]
  1394 be/4 root        0.00 B/s    0.00 B/s  0.00 % 89.26 % [flush-7:0]
   506 be/4 root        0.00 B/s    0.00 B/s  0.00 % 31.66 % [md2_raid1]

Total DISK READ: 544.44 K/s | Total DISK WRITE: 82.08 M/s
   TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN      IO COMMAND
  1235 be/4 root        0.00 B/s  129.31 K/s  0.00 % 99.99 % [kjournald]
   541 be/3 root        0.00 B/s   63.57 M/s  0.00 % 99.99 % [jbd2/md2-8]
31119 be/4 mysql       0.00 B/s   61.25 K/s  0.00 % 88.49 % mysqld 
--basedir=/ --datadir=/var/lib/mysql --user=mysql 
--log-error=/var/lib/mysql/server.err --open-files-limit=50000 
--pid-file=/var/$
   506 be/4 root        0.00 B/s    0.00 B/s  0.00 % 72.41 % [md2_raid1]
31346 be/4 root        0.00 B/s   20.42 K/s  0.00 % 69.36 % tailwatchd
  1232 be/0 root        0.00 B/s  183.75 K/s  0.00 % 54.04 % [loop0]
  6085 be/4 root        3.40 K/s   40.83 K/s  0.00 % 26.49 % dd 
if=largefile.tar.gz of=test10 oflag=sync bs=1G
11561 be/4 mysql       0.00 B/s   45.64 M/s  0.00 %  0.00 % mysqld 
--basedir=/ --datadir=/var/lib/mysql --user=mysql 
--log-error=/var/lib/mysql/server.err --open-files-limit=50000 
--pid-file=/var/$

I have also run it with the "-a" flag and there is something interesting 
(looong though heavily greped output below).
This is taken during the 'dd oflag=sync' copy. It seems it does 
something right at the beginning (writes about 250MB of that files) than 
it mostly idles through the end:

Total DISK READ: 333.35 K/s | Total DISK WRITE: 38.76 M/s
   PID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN      IO    COMMAND
   541 be/3 root          0.00 B    332.00 K  0.00 %  0.49 % [jbd2/md2-8]
13467 be/4 root          0.00 B      4.00 K  0.00 %  0.00 % python /usr/bin/iotop -baoP -d 1
13479 be/4 root          4.00 K    250.12 M  0.00 %  0.00 % dd if=largefile.tar.gz of=test11 oflag=sync bs=1G

Total DISK READ: 4.84 M/s | Total DISK WRITE: 11.77 K/s
   PID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN      IO    COMMAND
   541 be/3 root          0.00 B    332.00 K  0.00 %  0.37 % [jbd2/md2-8]
13467 be/4 root          0.00 B      4.00 K  0.00 %  0.00 % python /usr/bin/iotop -baoP -d 1
13479 be/4 root          4.00 K    250.12 M  0.00 %  0.00 % dd if=largefile.tar.gz of=test11 oflag=sync bs=1G

Total DISK READ: 0.00 B/s | Total DISK WRITE: 379.93 K/s
   PID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN      IO    COMMAND
   541 be/3 root          0.00 B    332.00 K  0.00 %  0.30 % [jbd2/md2-8]
13467 be/4 root          0.00 B      8.00 K  0.00 %  0.00 % python /usr/bin/iotop -baoP -d 1
13479 be/4 root          4.00 K    250.12 M  0.00 %  0.00 % dd if=largefile.tar.gz of=test11 oflag=sync bs=1G
  1232 be/0 root          0.00 B    244.00 K  0.00 %  0.00 % [loop0]
  1397 be/4 root          0.00 B     24.00 K  0.00 %  0.00 % [flush-9:2]

Total DISK READ: 0.00 B/s | Total DISK WRITE: 69.69 M/s
   PID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN      IO    COMMAND
13479 be/4 root          4.00 K    250.16 M  0.00 % 79.98 % dd if=largefile.tar.gz of=test11 oflag=sync bs=1G
   541 be/3 root          0.00 B    458.64 M  0.00 %  0.25 % [jbd2/md2-8]
13467 be/4 root          0.00 B      8.00 K  0.00 %  0.00 % python /usr/bin/iotop -baoP -d 1
  1232 be/0 root          0.00 B    244.00 K  0.00 %  0.00 % [loop0]
  1397 be/4 root          0.00 B     24.00 K  0.00 %  0.00 % [flush-9:2]

Total DISK READ: 20.81 K/s | Total DISK WRITE: 6.07 M/s
   PID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN      IO    COMMAND
   541 be/3 root          0.00 B    765.19 M  0.00 % 83.17 % [jbd2/md2-8]
  1235 be/4 root          0.00 B      0.00 B  0.00 % 78.06 % [kjournald]
13479 be/4 root          8.00 K    250.24 M  0.00 % 60.66 % dd if=largefile.tar.gz of=test11 oflag=sync bs=1G
   506 be/4 root          0.00 B      0.00 B  0.00 % 35.01 % [md2_raid1]
  1394 be/4 root          0.00 B      0.00 B  0.00 % 11.25 % [flush-7:0]

Total DISK READ: 43.28 K/s | Total DISK WRITE: 34.09 M/s
   PID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN      IO    COMMAND
   541 be/3 root          0.00 B    767.47 M  0.00 % 84.84 % [jbd2/md2-8]
  1235 be/4 root          0.00 B     28.00 K  0.00 % 70.65 % [kjournald]
13479 be/4 root         12.00 K    250.29 M  0.00 % 65.12 % dd if=largefile.tar.gz of=test11 oflag=sync bs=1G
   506 be/4 root          0.00 B      0.00 B  0.00 % 31.81 % [md2_raid1]
  1232 be/0 root          0.00 B   1568.00 K  0.00 % 14.57 % [loop0]
  1394 be/4 root          0.00 B      0.00 B  0.00 %  9.71 % [flush-7:0]
  1397 be/4 root          0.00 B      3.44 M  0.00 %  1.47 % [flush-9:2]
13467 be/4 root          0.00 B     12.00 K  0.00 %  0.00 % python /usr/bin/iotop -baoP -d 1

Total DISK READ: 3.85 K/s | Total DISK WRITE: 35.28 M/s
   PID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN      IO    COMMAND
   541 be/3 root          0.00 B    768.32 M  0.00 % 84.36 % [jbd2/md2-8]
  1235 be/4 root          0.00 B     28.00 K  0.00 % 83.53 % [kjournald]
13479 be/4 root         12.00 K    250.30 M  0.00 % 65.05 % dd if=largefile.tar.gz of=test11 oflag=sync bs=1G
   506 be/4 root          0.00 B      0.00 B  0.00 % 32.55 % [md2_raid1]
  1232 be/0 root          0.00 B   1568.00 K  0.00 % 14.21 % [loop0]
  1394 be/4 root          0.00 B      0.00 B  0.00 %  9.46 % [flush-7:0]
  1397 be/4 root          0.00 B      3.45 M  0.00 %  1.48 % [flush-9:2]

Total DISK READ: 3.91 K/s | Total DISK WRITE: 3.91 K/s
   PID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN      IO    COMMAND
   541 be/3 root          0.00 B    768.32 M  0.00 % 82.29 % [jbd2/md2-8]
  1235 be/4 root          0.00 B     28.00 K  0.00 % 81.48 % [kjournald]
13479 be/4 root         12.00 K    250.30 M  0.00 % 63.37 % dd if=largefile.tar.gz of=test11 oflag=sync bs=1G
   506 be/4 root          0.00 B      0.00 B  0.00 % 31.75 % [md2_raid1]
  1232 be/0 root          0.00 B   1568.00 K  0.00 % 13.86 % [loop0]
  1394 be/4 root          0.00 B      0.00 B  0.00 %  9.23 % [flush-7:0]
  1397 be/4 root          0.00 B      3.45 M  0.00 %  1.44 % [flush-9:2]
13467 be/4 root          0.00 B     28.00 K  0.00 %  0.00 % python /usr/bin/iotop -baoP -d 1

Total DISK READ: 15.64 K/s | Total DISK WRITE: 15.32 M/s
   PID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN      IO    COMMAND
   541 be/3 root          0.00 B    768.71 M  0.00 % 85.51 % [jbd2/md2-8]
  1235 be/4 root          0.00 B     28.00 K  0.00 % 79.53 % [kjournald]
13479 be/4 root         12.00 K    250.31 M  0.00 % 61.78 % dd if=largefile.tar.gz of=test11 oflag=sync bs=1G
   506 be/4 root          0.00 B      0.00 B  0.00 % 36.15 % [md2_raid1]
  1232 be/0 root          0.00 B   1568.00 K  0.00 % 13.53 % [loop0]
  1394 be/4 root          0.00 B      0.00 B  0.00 %  9.01 % [flush-7:0]
  1397 be/4 root          0.00 B      3.45 M  0.00 %  6.60 % [flush-9:2]
13467 be/4 root          0.00 B     32.00 K  0.00 %  0.00 % python /usr/bin/iotop -baoP -d 1

Total DISK READ: 0.00 B/s | Total DISK WRITE: 0.00 B/s
   PID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN      IO    COMMAND
   541 be/3 root          0.00 B    768.71 M  0.00 % 85.77 % [jbd2/md2-8]
  1235 be/4 root          0.00 B     28.00 K  0.00 % 75.90 % [kjournald]
13479 be/4 root         12.00 K    250.31 M  0.00 % 58.82 % dd if=largefile.tar.gz of=test11 oflag=sync bs=1G
   506 be/4 root          0.00 B      0.00 B  0.00 % 34.51 % [md2_raid1]
  1232 be/0 root          0.00 B   1568.00 K  0.00 % 12.91 % [loop0]
  1394 be/4 root          0.00 B      0.00 B  0.00 %  8.60 % [flush-7:0]
  1397 be/4 root          0.00 B      3.45 M  0.00 %  6.30 % [flush-9:2]
31346 be/4 root          0.00 B    120.00 K  0.00 %  3.42 % tailwatchd
13467 be/4 root          0.00 B     44.00 K  0.00 %  0.00 % python /usr/bin/iotop -baoP -d 1

Total DISK READ: 19.56 K/s | Total DISK WRITE: 10.12 M/s
   PID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN      IO    COMMAND
   541 be/3 root          0.00 B    768.76 M  0.00 % 86.39 % [jbd2/md2-8]
  1235 be/4 root          0.00 B     28.00 K  0.00 % 74.21 % [kjournald]
13479 be/4 root         12.00 K    250.31 M  0.00 % 64.36 % dd if=largefile.tar.gz of=test11 oflag=sync bs=1G
   506 be/4 root          0.00 B      0.00 B  0.00 % 39.86 % [md2_raid1]
  1232 be/0 root          0.00 B   1568.00 K  0.00 % 12.62 % [loop0]
  1394 be/4 root          0.00 B      0.00 B  0.00 %  8.41 % [flush-7:0]
  1397 be/4 root          0.00 B      3.46 M  0.00 %  6.16 % [flush-9:2]
13467 be/4 root          0.00 B     48.00 K  0.00 %  0.00 % python /usr/bin/iotop -baoP -d 1

Total DISK READ: 0.00 B/s | Total DISK WRITE: 15.65 K/s
   PID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN      IO    COMMAND
   541 be/3 root          0.00 B    768.76 M  0.00 % 87.13 % [jbd2/md2-8]
  1235 be/4 root          0.00 B     28.00 K  0.00 % 72.58 % [kjournald]
13479 be/4 root         12.00 K    250.31 M  0.00 % 65.64 % dd if=largefile.tar.gz of=test11 oflag=sync bs=1G
   506 be/4 root          0.00 B      0.00 B  0.00 % 38.98 % [md2_raid1]
  1232 be/0 root          0.00 B   1568.00 K  0.00 % 12.34 % [loop0]
  1394 be/4 root          0.00 B      0.00 B  0.00 %  8.22 % [flush-7:0]
  1397 be/4 root          0.00 B      3.46 M  0.00 %  6.03 % [flush-9:2]
13467 be/4 root          0.00 B     52.00 K  0.00 %  0.00 % python /usr/bin/iotop -baoP -d 1

Total DISK READ: 46.71 K/s | Total DISK WRITE: 38.92 K/s
   PID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN      IO    COMMAND
   541 be/3 root          0.00 B    768.76 M  0.00 % 87.24 % [jbd2/md2-8]
  1235 be/4 root          0.00 B     28.00 K  0.00 % 71.03 % [kjournald]
13479 be/4 root         12.00 K    250.31 M  0.00 % 66.24 % dd if=largefile.tar.gz of=test11 oflag=sync bs=1G
   506 be/4 root          0.00 B      0.00 B  0.00 % 38.15 % [md2_raid1]
  1232 be/0 root          0.00 B   1568.00 K  0.00 % 12.08 % [loop0]
  1394 be/4 root          0.00 B      0.00 B  0.00 %  8.05 % [flush-7:0]
  1397 be/4 root          0.00 B      3.46 M  0.00 %  5.90 % [flush-9:2]
13467 be/4 root          0.00 B     56.00 K  0.00 %  0.00 % python /usr/bin/iotop -baoP -d 1

Total DISK READ: 0.00 B/s | Total DISK WRITE: 0.00 B/s
   PID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN      IO    COMMAND
   541 be/3 root          0.00 B    768.76 M  0.00 % 87.63 % [jbd2/md2-8]
  1235 be/4 root          0.00 B     28.00 K  0.00 % 69.54 % [kjournald]
13479 be/4 root         12.00 K    250.31 M  0.00 % 67.10 % dd if=largefile.tar.gz of=test11 oflag=sync bs=1G
   506 be/4 root          0.00 B      0.00 B  0.00 % 42.88 % [md2_raid1]
  1232 be/0 root          0.00 B   1568.00 K  0.00 % 11.83 % [loop0]
  1394 be/4 root          0.00 B      0.00 B  0.00 %  7.88 % [flush-7:0]
  1397 be/4 root          0.00 B      3.46 M  0.00 %  5.78 % [flush-9:2]
13467 be/4 root          0.00 B     60.00 K  0.00 %  0.00 % python /usr/bin/iotop -baoP -d 1

Total DISK READ: 7.82 K/s | Total DISK WRITE: 0.00 B/s
   PID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN      IO    COMMAND
   541 be/3 root          0.00 B    768.76 M  0.00 % 87.91 % [jbd2/md2-8]
  1235 be/4 root          0.00 B     28.00 K  0.00 % 68.12 % [kjournald]
13479 be/4 root         12.00 K    250.31 M  0.00 % 67.83 % dd if=largefile.tar.gz of=test11 oflag=sync bs=1G
   506 be/4 root          0.00 B      0.00 B  0.00 % 42.01 % [md2_raid1]
  1232 be/0 root          0.00 B   1568.00 K  0.00 % 11.59 % [loop0]
  1394 be/4 root          0.00 B      0.00 B  0.00 %  7.72 % [flush-7:0]
  1397 be/4 root          0.00 B      3.46 M  0.00 %  5.66 % [flush-9:2]
13467 be/4 root          0.00 B     68.00 K  0.00 %  0.00 % python /usr/bin/iotop -baoP -d 1

Total DISK READ: 0.00 B/s | Total DISK WRITE: 50.84 K/s
   PID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN      IO    COMMAND
   541 be/3 root          0.00 B    768.76 M  0.00 % 88.16 % [jbd2/md2-8]
  1235 be/4 root          0.00 B     28.00 K  0.00 % 66.75 % [kjournald]
13479 be/4 root         12.00 K    250.31 M  0.00 % 68.51 % dd if=largefile.tar.gz of=test11 oflag=sync bs=1G
   506 be/4 root          0.00 B      0.00 B  0.00 % 41.16 % [md2_raid1]
  1232 be/0 root          0.00 B   1568.00 K  0.00 % 11.35 % [loop0]
  1394 be/4 root          0.00 B      0.00 B  0.00 %  7.56 % [flush-7:0]
  1397 be/4 root          0.00 B      3.46 M  0.00 %  5.54 % [flush-9:2]
13467 be/4 root          0.00 B     72.00 K  0.00 %  0.00 % python /usr/bin/iotop -baoP -d 1

Total DISK READ: 3.91 K/s | Total DISK WRITE: 93.83 K/s
   PID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN      IO    COMMAND
   541 be/3 root          0.00 B    768.76 M  0.00 % 88.33 % [jbd2/md2-8]
13479 be/4 root         12.00 K    250.31 M  0.00 % 69.09 % dd if=largefile.tar.gz of=test11 oflag=sync bs=1G
  1235 be/4 root          0.00 B     28.00 K  0.00 % 65.44 % [kjournald]
   506 be/4 root          0.00 B      0.00 B  0.00 % 40.35 % [md2_raid1]
  1232 be/0 root          0.00 B   1568.00 K  0.00 % 11.13 % [loop0]
  1394 be/4 root          0.00 B      0.00 B  0.00 %  7.41 % [flush-7:0]
  1397 be/4 root          0.00 B      3.46 M  0.00 %  5.43 % [flush-9:2]
31346 be/4 root          0.00 B    120.00 K  0.00 %  2.95 % tailwatchd
13467 be/4 root          0.00 B     76.00 K  0.00 %  0.00 % python /usr/bin/iotop -baoP -d 1

Total DISK READ: 0.00 B/s | Total DISK WRITE: 0.00 B/s
   PID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN      IO    COMMAND
   541 be/3 root          0.00 B    768.76 M  0.00 % 88.53 % [jbd2/md2-8]
13479 be/4 root         12.00 K    250.31 M  0.00 % 69.69 % dd if=largefile.tar.gz of=test11 oflag=sync bs=1G
  1235 be/4 root          0.00 B     28.00 K  0.00 % 64.18 % [kjournald]
   506 be/4 root          0.00 B      0.00 B  0.00 % 39.57 % [md2_raid1]
  1232 be/0 root          0.00 B   1568.00 K  0.00 % 10.91 % [loop0]
  1394 be/4 root          0.00 B      0.00 B  0.00 %  7.27 % [flush-7:0]
  1397 be/4 root          0.00 B      3.46 M  0.00 %  5.33 % [flush-9:2]
13467 be/4 root          0.00 B     80.00 K  0.00 %  0.00 % python /usr/bin/iotop -baoP -d 1

Total DISK READ: 0.00 B/s | Total DISK WRITE: 15.64 K/s
   PID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN      IO    COMMAND
   541 be/3 root          0.00 B    768.76 M  0.00 % 88.20 % [jbd2/md2-8]
13479 be/4 root         12.00 K    250.31 M  0.00 % 69.72 % dd if=largefile.tar.gz of=test11 oflag=sync bs=1G
  1235 be/4 root          0.00 B     28.00 K  0.00 % 62.96 % [kjournald]
   506 be/4 root          0.00 B      0.00 B  0.00 % 38.82 % [md2_raid1]
  1232 be/0 root          0.00 B   1568.00 K  0.00 % 10.71 % [loop0]
  1394 be/4 root          0.00 B      0.00 B  0.00 %  7.13 % [flush-7:0]
  1397 be/4 root          0.00 B      3.46 M  0.00 %  5.23 % [flush-9:2]
13467 be/4 root          0.00 B     84.00 K  0.00 %  0.00 % python /usr/bin/iotop -baoP -d 1

Total DISK READ: 0.00 B/s | Total DISK WRITE: 0.00 B/s
   PID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN      IO    COMMAND
   541 be/3 root          0.00 B    768.76 M  0.00 % 88.61 % [jbd2/md2-8]
13479 be/4 root         12.00 K    250.31 M  0.00 % 70.50 % dd if=largefile.tar.gz of=test11 oflag=sync bs=1G
  1235 be/4 root          0.00 B     28.00 K  0.00 % 61.79 % [kjournald]
   506 be/4 root          0.00 B      0.00 B  0.00 % 38.10 % [md2_raid1]
  1232 be/0 root          0.00 B   1568.00 K  0.00 % 10.51 % [loop0]
  1394 be/4 root          0.00 B      0.00 B  0.00 %  7.00 % [flush-7:0]
  1397 be/4 root          0.00 B      3.46 M  0.00 %  5.13 % [flush-9:2]
13467 be/4 root          0.00 B     92.00 K  0.00 %  0.00 % python /usr/bin/iotop -baoP -d 1

Total DISK READ: 258.12 K/s | Total DISK WRITE: 86.04 K/s
   PID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN      IO    COMMAND
   541 be/3 root          0.00 B    768.76 M  0.00 % 89.19 % [jbd2/md2-8]
13479 be/4 root         12.00 K    250.31 M  0.00 % 71.45 % dd if=largefile.tar.gz of=test11 oflag=sync bs=1G
  1235 be/4 root          0.00 B     28.00 K  0.00 % 60.66 % [kjournald]
   506 be/4 root          0.00 B      0.00 B  0.00 % 37.40 % [md2_raid1]
  1232 be/0 root          0.00 B   1568.00 K  0.00 % 10.32 % [loop0]
  1394 be/4 root          0.00 B      0.00 B  0.00 %  6.87 % [flush-7:0]
  1397 be/4 root          0.00 B      3.46 M  0.00 %  5.04 % [flush-9:2]
13467 be/4 root          0.00 B     96.00 K  0.00 %  0.00 % python /usr/bin/iotop -baoP -d 1


I appologize for such a lengthy email!

Kind regards!
Andrei Banu

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Incredibly poor performance of mdraid-1 with 2 SSD Samsung 840 PRO
  2013-04-23 10:17           ` Andrei Banu
@ 2013-04-24  3:24             ` Stan Hoeppner
  2013-04-24  8:26               ` Andrei Banu
  0 siblings, 1 reply; 38+ messages in thread
From: Stan Hoeppner @ 2013-04-24  3:24 UTC (permalink / raw)
  To: Andrei Banu

On 4/23/2013 5:17 AM, Andrei Banu wrote:

> I am sorry for the very long email. And thanks a lot for all your patience.

From now on simply provide what is asked for.  That keeps the length
manageable and the info relevant, and allows us to help you get to a
solution more quickly without being bogged down.

> 1. DMESG doesn't show any "hard resetting link" at all.

Then it seems you don't have hardware problems.

> 2. The SSDs are connected to ATA 0 and ATA1. The server is brand new (or
> at least it should be).

Nor the Intel 6 Series SATA problem.

> 3. Partition table:

/etc/fstab contains mount points, not the partition table.

> root [~]# cat /etc/fstab

> UUID=cef1d19d-2578-43db-9ffc-b6b70e227bfa swap swap    defaults        0 0

I can't discern from UUID where your swap partition is located.  Is it a
partition directly on an SSD or is it a partition atop md1?

> root [/]# echo 3 > /proc/sys/vm/drop_caches
> root [~]# time cp largefile.tar.gz test03.tmp; time sync;

You're slowing us down here.  Please execute commands as instructed
without modification.  The above is wrong.  You don't call time twice.
If you're worried about sync execution being included time, use:
$ time (cp src.tmp src.temp; sync)

Though it makes little difference as Linux is pretty good about flushing
the last few write buffers.  But you missed the important part, the math
for bandwidth determination:  548/real = xx MB/s

This is cp not dd.  It's up to you to do the math.  Using time allows
you to do so.  548MB is my example using your previous file size in your
tests.  Modify accordingly if needed.

*Important note*  The job of this list is to provide knowledge transfer,
advice, and assistance.  You must do the work, and you must learn along
the way.  We don't fix people's problems, as we don't have access to
their computers.  What we do is *enable* people to fix their problems
themselves.

> After about 15 seconds the server load started to increase from 1,
> spiked to 40 in about a minute and then it started decreasing.

Please stop telling us this.  Linux load average is irrelevant.

> 5. The perf top -U output during a dd copy:

This was supposed to be executed before and simultaneously with the cp
operation above.  Do you know how to use multiple terminal windows?

> 6. iotop 

Again, this was supposed to be run with the cp command, exited toward
the end of the cp operation, then copy/pasted.

is very dynamic and I am afraid the data I am providing will be
> unclear but let me give a number of snapshots from during the large file
> copy and maybe you can make something of it (samples a few seconds apart):

> !!!!!! 6085 be/4 root        7.69 K/s 1004.85 M/s  0.00 %  0.00 % dd
> if=largefile.tar.gz of=test10 oflag=sync bs=1G

This is another example of why you don't use dd for IO testing, and
especially with a block size of 1GB.  dd buffers into RAM up to
$block_size bytes before it begins flushing to disk.  So what you're
seeing here is that massive push at the beginning of the run.  Your SSDs
in RAID1 peak at ~265MB/s.  iotop is showing 1GB/s, 4 times what the
drives can do.  This is obviously not real.

You can get away with oflag=sync using 1GB block size.  But if you run
dd the only way it can be run for realistic results, using bs=4096 which
matches every filesystem block size including EXTx, XFS, and JFS, then
using iflag=sync will degrade your performance, an ack is required on
each block.  That's what sync does.  With SSD it won't be nearly as
dramatic as rust, where the difference in runtime is 100-200x slower due
to rotational latency.

> I appologize for such a lengthy email!

Don't apologize, just don't send more information than needed,
especially if you don't know it's relevant. ;)  Send only what's
requested, and as requested, please.

-- 
Stan

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Incredibly poor performance of mdraid-1 with 2 SSD Samsung 840 PRO
  2013-04-24  3:24             ` Stan Hoeppner
@ 2013-04-24  8:26               ` Andrei Banu
  2013-04-24  9:12                 ` Adam Goryachev
  2013-04-24 16:37                 ` Stan Hoeppner
  0 siblings, 2 replies; 38+ messages in thread
From: Andrei Banu @ 2013-04-24  8:26 UTC (permalink / raw)
  To: linux-raid

Hello,

I am sorry for the irrelevant feedback. Where I misunderstood your 
request, I filled in the blanks (poorly).

1. SWAP
root [~]# blkid | grep cef1d19d-2578-43db-9ffc-b6b70e227bfa
/dev/md1: UUID="cef1d19d-2578-43db-9ffc-b6b70e227bfa" TYPE="swap"

So yes, swap is on md1. This *md1 has a size of 2GB*. Isn't this way too 
low for a system with 16GB of memory?

2. Let me try again to give you the right test results:

Before the bigfile copy:

root [~]# perf top -U
Samples: 768  of event 'cycles', Event count (approx.): 499088870
  18.58%  [kernel]  [k] port_inb
   6.21%  [kernel]  [k] page_fault
   3.36%  [kernel]  [k] clear_page_c_e
   2.82%  [kernel]  [k] kallsyms_expand_symbol
   1.99%  [kernel]  [k] __mem_cgroup_commit_charge
   1.84%  [kernel]  [k] shmem_getpage_gfp
   1.51%  [kernel]  [k] alloc_pages_vma
   1.51%  [kernel]  [k] __alloc_pages_nodemask
   1.46%  [kernel]  [k] avtab_search_node
   1.45%  [kernel]  [k] format_decode
   1.40%  [kernel]  [k] list_del
   1.36%  [kernel]  [k] get_page_from_freelist
   1.35%  [kernel]  [k] vsnprintf
   1.29%  [kernel]  [k] avc_has_perm_noaudit
   1.28%  [kernel]  [k] number
   1.22%  [kernel]  [k] free_pcppages_bulk
   1.21%  [kernel]  [k] ____pagevec_lru_add
   1.14%  [kernel]  [k] get_page
   1.08%  [kernel]  [k] memcpy
   1.07%  [kernel]  [k] mem_cgroup_update_file_mapped
   1.07%  [kernel]  [k] page_waitqueue
   0.98%  [kernel]  [k] __d_lookup
   0.97%  [kernel]  [k] unmap_vmas
   0.91%  [kernel]  [k] _spin_lock
   0.87%  [kernel]  [k] inode_has_perm
   0.81%  [kernel]  [k] string
   0.77%  [kernel]  [k] page_remove_rmap
   0.73%  [kernel]  [k] __audit_syscall_exit
   0.68%  [kernel]  [k] lookup_page_cgroup
   0.61%  [kernel]  [k] unlock_page
   0.61%  [kernel]  [k] shmem_find_get_pages_and_swap
   0.61%  [kernel]  [k] free_hot_cold_page
   0.61%  [kernel]  [k] release_pages
   0.56%  [kernel]  [k] mem_cgroup_lru_del_list
   0.55%  [kernel]  [k] strncpy_from_user
   0.54%  [kernel]  [k] module_get_kallsym
   0.52%  [kernel]  [k] find_get_page
   0.50%  [kernel]  [k] __do_fault
   0.48%  [kernel]  [k] path_put
   0.46%  [kernel]  [k] __list_add
   0.46%  [kernel]  [k] handle_mm_fault
   0.45%  [kernel]  [k] __wake_up_bit
   0.44%  [kernel]  [k] handle_pte_fault
   0.43%  [kernel]  [k] audit_syscall_entry
   0.43%  [kernel]  [k] thread_return
   0.42%  [kernel]  [k] path_init
   0.41%  [kernel]  [k] dput
   0.40%  [kernel]  [k] task_has_capability
   0.40%  [kernel]  [k] get_task_cred
   0.40%  [kernel]  [k] pointer
   0.40%  [kernel]  [k] _atomic_dec_and_lock
   0.39%  [kernel]  [k] __link_path_walk
   0.38%  [kernel]  [k] memset
   0.37%  [kernel]  [k] do_lookup
   0.34%  [kernel]  [k] radix_tree_lookup_slot
   0.34%  [kernel]  [k] down_read_trylock
   0.33%  [kernel]  [k] kmem_cache_alloc
   0.31%  [kernel]  [k] __set_page_dirty_no_writeback
   0.31%  [kernel]  [k] __inc_zone_state
   0.31%  [kernel]  [k] __mem_cgroup_uncharge_common

root [~]# iotop
Total DISK READ: 0.00 B/s | Total DISK WRITE: 2.33 M/s
   TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN     IO> COMMAND
   541 be/3 root        0.00 B/s    7.83 K/s  0.00 %  2.27 % [jbd2/md2-8]
  8568 be/4 root        0.00 B/s    7.83 K/s  0.00 %  0.00 % lfd - sleeping
  1457 be/3 root        0.00 B/s    7.83 K/s  0.00 %  0.00 % auditd
  1669 be/4 root        0.00 B/s    3.91 K/s  0.00 %  0.00 % rsyslogd -i 
/var/run/syslogd.pid -c 5
  1695 be/4 named       0.00 B/s    3.91 K/s  0.00 %  0.00 % named -u named
31391 be/4 mysql       0.00 B/s   23.48 K/s  0.00 %  0.00 % mysqld 
--basedir=/ --datadir=/var/lib/mysql --user=mysql --log-error=/var~r 
--open-files-limit=50000 --pid-file=/var/lib/mysql/server.pid
     1 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % init
     2 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [kthreadd]
     3 rt/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [migration/0]
     4 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [ksoftirqd/0]
     5 rt/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [migration/0]
     6 rt/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [watchdog/0]
     7 rt/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [migration/1]
     8 rt/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [migration/1]
     9 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [ksoftirqd/1]
    10 rt/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [watchdog/1]
    11 rt/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [migration/2]
    12 rt/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [migration/2]
    13 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [ksoftirqd/2]
    14 rt/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [watchdog/2]
    15 rt/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [migration/3]
    16 rt/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [migration/3]
    17 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [ksoftirqd/3]
    18 rt/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [watchdog/3]
    19 rt/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [migration/4]
    20 rt/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [migration/4]
    21 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [ksoftirqd/4]
    22 rt/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [watchdog/4]
    23 rt/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [migration/5]
    24 rt/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [migration/5]
    25 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [ksoftirqd/5]
    26 rt/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [watchdog/5]
    27 rt/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [migration/6]
    28 rt/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [migration/6]
    29 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [ksoftirqd/6]
    30 rt/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [watchdog/6]
    31 rt/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [migration/7]
    32 rt/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [migration/7]
    33 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [ksoftirqd/7]
    34 rt/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [watchdog/7]
    35 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [events/0]
    36 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [events/1]
    37 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [events/2]
    38 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [events/3]
    39 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [events/4]
    40 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [events/5]
    41 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [events/6]
    42 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [events/7]
    43 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [cgroup]
    44 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [khelper]
    45 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [netns]
    46 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [async/mgr]
    47 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [pm]
    48 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [sync_supers]
    49 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [bdi-default]
    50 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [kintegrityd/0]
    51 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [kintegrityd/1]
    52 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [kintegrityd/2]
    53 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [kintegrityd/3]
    54 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [kintegrityd/4]

Now the file copy with sync:

root [~]# time (cp largefile.tar.gz test05.tmp; sync)

real    1m33.923s
user    0m0.002s
sys     0m0.713s

Large file size: 523MB
BW determination: 523MB / 93.923 seconds = 5.56MB/s

File copy without sync:
root [~]# echo 3 > /proc/sys/vm/drop_caches
root [~]# time cp largefile.tar.gz test07.tmp
real    0m6.452s
user    0m0.007s
sys     0m0.687s
Large file size: 523MB
BW determination: 523MB / 6.452 seconds = 81.06 MB/s

During the copy (near the end: about 70 seconds into the copy - results 
with sync):

Samples: 17K of event 'cycles', Event count (approx.): 5067697991
   7.48%  [kernel]             [k] port_inb
   5.40%  [kernel]             [k] page_fault
   2.92%  [kernel]             [k] clear_page_c_e
   2.29%  [kernel]             [k] list_del
   2.21%  [kernel]             [k] _spin_lock
   1.99%  [kernel]             [k] __d_lookup
   1.92%  [kernel]             [k] avtab_search_node
   1.64%  [kernel]             [k] unmap_vmas
   1.59%  [kernel]             [k] get_page_from_freelist
   1.55%  [kernel]             [k] __mem_cgroup_commit_charge
   1.22%  [kernel]             [k] mem_cgroup_update_file_mapped
   1.21%  [kernel]             [k] copy_page_c
   1.04%  [kernel]             [k] find_vma
   1.00%  [kernel]             [k] _spin_lock_irq
   0.97%  [kernel]             [k] __wake_up_bit
   0.94%  [kernel]             [k] __mem_cgroup_uncharge_common
   0.92%  [kernel]             [k] get_page
   0.91%  [kernel]             [k] __alloc_pages_nodemask
   0.87%  [kernel]             [k] handle_mm_fault
   0.85%  [kernel]             [k] __link_path_walk
   0.84%  [kernel]             [k] avc_has_perm_noaudit
   0.83%  [kernel]             [k] alloc_pages_vma
   0.81%  [kernel]             [k] lookup_page_cgroup
   0.80%  [kernel]             [k] __do_page_fault
   0.80%  [kernel]             [k] free_pcppages_bulk
   0.77%  [kernel]             [k] _spin_lock_irqsave
   0.75%  [kernel]             [k] radix_tree_lookup_slot
   0.73%  [kernel]             [k] kmem_cache_alloc
   0.68%  [ip_tables]          [k] ipt_do_table
   0.66%  [kernel]             [k] _atomic_dec_and_lock
   0.65%  [kernel]             [k] release_pages
   0.62%  [kernel]             [k] find_get_page
   0.61%  [kernel]             [k] schedule
   0.60%  [kernel]             [k] inode_has_perm
   0.56%  [kernel]             [k] sidtab_context_to_sid
   0.54%  [kernel]             [k] handle_pte_fault
   0.53%  [kernel]             [k] _spin_unlock_irqrestore
   0.53%  [kernel]             [k] memset
   0.52%  [kernel]             [k] __inc_zone_state
   0.51%  [kernel]             [k] update_curr
   0.51%  [kernel]             [k] kfree
   0.50%  [kernel]             [k] __list_add
   0.50%  [kernel]             [k] __do_fault
   0.49%  [kernel]             [k] shmem_getpage_gfp
   0.47%  [kernel]             [k] filemap_fault


Total DISK READ: 0.00 B/s | Total DISK WRITE: 0.00 B/s
   TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN     IO> COMMAND
   541 be/3 root        0.00 B/s    0.00 B/s  0.00 % 96.96 % [jbd2/md2-8]
12468 be/4 nobody      0.00 B/s    3.89 K/s  0.00 %  0.00 % httpd -k 
start -DSSL
18818 be/4 mysql       0.00 B/s    3.89 K/s  0.00 %  0.00 % mysqld 
--basedir=/ --da~sql/server.pid
12333 be/4 nobody      0.00 B/s    3.89 K/s  0.00 %  0.00 % httpd -k 
start -DSSL
12560 be/4 nobody      0.00 B/s    3.89 K/s  0.00 %  0.00 % httpd -k 
start -DSSL
12568 be/4 nobody      0.00 B/s    3.89 K/s  0.00 %  0.00 % httpd -k 
start -DSSL
12281 be/4 nobody      0.00 B/s    3.89 K/s  0.00 %  0.00 % [httpd]
     1 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % init
     2 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [kthreadd]
     3 rt/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [migration/0]
     4 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [ksoftirqd/0]
     5 rt/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [migration/0]
     6 rt/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [watchdog/0]
     7 rt/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [migration/1]
     8 rt/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [migration/1]
     9 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [ksoftirqd/1]
    10 rt/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [watchdog/1]
    11 rt/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [migration/2]
    12 rt/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [migration/2]
    13 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [ksoftirqd/2]
    14 rt/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [watchdog/2]
    15 rt/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [migration/3]
    16 rt/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [migration/3]
    17 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [ksoftirqd/3]
    18 rt/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [watchdog/3]
    19 rt/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [migration/4]
    20 rt/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [migration/4]
    21 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [ksoftirqd/4]
    22 rt/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [watchdog/4]
    23 rt/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [migration/5]
    24 rt/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [migration/5]
    25 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [ksoftirqd/5]
    26 rt/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [watchdog/5]
    27 rt/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [migration/6]

Please let me know if I messed up again so that I can correct it.


@Adam

3. root [~]# fdisk -lu /dev/sd*

Disk /dev/sda: 512.1 GB, 512110190592 bytes
255 heads, 63 sectors/track, 62260 cylinders, total 1000215216 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00026d59

    Device Boot      Start         End      Blocks   Id  System
/dev/sda1            2048     4196351     2097152   fd  Linux raid 
autodetect
Partition 1 does not end on cylinder boundary.
/dev/sda2   *     4196352     4605951      204800   fd  Linux raid 
autodetect
Partition 2 does not end on cylinder boundary.
/dev/sda3         4605952   814106623   404750336   fd  Linux raid 
autodetect

Disk /dev/sda1: 2147 MB, 2147483648 bytes
255 heads, 63 sectors/track, 261 cylinders, total 4194304 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0xfffefffe

Disk /dev/sda2: 209 MB, 209715200 bytes
255 heads, 63 sectors/track, 25 cylinders, total 409600 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00000000

Disk /dev/sda3: 414.5 GB, 414464344064 bytes
255 heads, 63 sectors/track, 50389 cylinders, total 809500672 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00000000

Disk /dev/sdb: 512.1 GB, 512110190592 bytes
255 heads, 63 sectors/track, 62260 cylinders, total 1000215216 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x0003dede

    Device Boot      Start         End      Blocks   Id  System
/dev/sdb1            2048     4196351     2097152   fd  Linux raid 
autodetect
Partition 1 does not end on cylinder boundary.
/dev/sdb2   *     4196352     4605951      204800   fd  Linux raid 
autodetect
Partition 2 does not end on cylinder boundary.
/dev/sdb3         4605952   814106623   404750336   fd  Linux raid 
autodetect

Disk /dev/sdb1: 2147 MB, 2147483648 bytes
255 heads, 63 sectors/track, 261 cylinders, total 4194304 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0xfffefffe

Disk /dev/sdb2: 209 MB, 209715200 bytes
255 heads, 63 sectors/track, 25 cylinders, total 409600 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00000000

Disk /dev/sdb3: 414.5 GB, 414464344064 bytes
255 heads, 63 sectors/track, 50389 cylinders, total 809500672 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00000000

Kind regards!
Andrei Banu

On 4/24/2013 6:24 AM, Stan Hoeppner wrote:
> root [~]# cat /etc/fstab
>> UUID=cef1d19d-2578-43db-9ffc-b6b70e227bfa swap swap    defaults        0 0
> I can't discern from UUID where your swap partition is located.  Is it a
> partition directly on an SSD or is it a partition atop md1?
>
>> root [/]# echo 3 > /proc/sys/vm/drop_caches
>> root [~]# time cp largefile.tar.gz test03.tmp; time sync;
> You're slowing us down here.  Please execute commands as instructed
> without modification.  The above is wrong.  You don't call time twice.
> If you're worried about sync execution being included time, use:
> $ time (cp src.tmp src.temp; sync)
>
> Though it makes little difference as Linux is pretty good about flushing
> the last few write buffers.  But you missed the important part, the math
> for bandwidth determination:  548/real = xx MB/s
>
> This is cp not dd.  It's up to you to do the math.  Using time allows
> you to do so.  548MB is my example using your previous file size in your
> tests.  Modify accordingly if needed.
>
> *Important note*  The job of this list is to provide knowledge transfer,
> advice, and assistance.  You must do the work, and you must learn along
> the way.  We don't fix people's problems, as we don't have access to
> their computers.  What we do is *enable* people to fix their problems
> themselves.
>
>> After about 15 seconds the server load started to increase from 1,
>> spiked to 40 in about a minute and then it started decreasing.
> Please stop telling us this.  Linux load average is irrelevant.
>
>> 5. The perf top -U output during a dd copy:
> This was supposed to be executed before and simultaneously with the cp
> operation above.  Do you know how to use multiple terminal windows?
>
>> 6. iotop
> Again, this was supposed to be run with the cp command, exited toward
> the end of the cp operation, then copy/pasted.
>
> is very dynamic and I am afraid the data I am providing will be
>> unclear but let me give a number of snapshots from during the large file
>> copy and maybe you can make something of it (samples a few seconds apart):
>> !!!!!! 6085 be/4 root        7.69 K/s 1004.85 M/s  0.00 %  0.00 % dd
>> if=largefile.tar.gz of=test10 oflag=sync bs=1G
> This is another example of why you don't use dd for IO testing, and
> especially with a block size of 1GB.  dd buffers into RAM up to
> $block_size bytes before it begins flushing to disk.  So what you're
> seeing here is that massive push at the beginning of the run.  Your SSDs
> in RAID1 peak at ~265MB/s.  iotop is showing 1GB/s, 4 times what the
> drives can do.  This is obviously not real.
>
> You can get away with oflag=sync using 1GB block size.  But if you run
> dd the only way it can be run for realistic results, using bs=4096 which
> matches every filesystem block size including EXTx, XFS, and JFS, then
> using iflag=sync will degrade your performance, an ack is required on
> each block.  That's what sync does.  With SSD it won't be nearly as
> dramatic as rust, where the difference in runtime is 100-200x slower due
> to rotational latency.
>
>> I appologize for such a lengthy email!
> Don't apologize, just don't send more information than needed,
> especially if you don't know it's relevant. ;)  Send only what's
> requested, and as requested, please.
>


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Incredibly poor performance of mdraid-1 with 2 SSD Samsung 840 PRO
  2013-04-24  8:26               ` Andrei Banu
@ 2013-04-24  9:12                 ` Adam Goryachev
  2013-04-24 10:24                   ` Tommy Apel
  2013-04-24 21:40                   ` Andrei Banu
  2013-04-24 16:37                 ` Stan Hoeppner
  1 sibling, 2 replies; 38+ messages in thread
From: Adam Goryachev @ 2013-04-24  9:12 UTC (permalink / raw)
  To: Andrei Banu; +Cc: linux-raid

On 24/04/13 18:26, Andrei Banu wrote:
> Hello,
>
> I am sorry for the irrelevant feedback. Where I misunderstood your
> request, I filled in the blanks (poorly).
>
> 1. SWAP
> root [~]# blkid | grep cef1d19d-2578-43db-9ffc-b6b70e227bfa
> /dev/md1: UUID="cef1d19d-2578-43db-9ffc-b6b70e227bfa" TYPE="swap"
>
> So yes, swap is on md1. This *md1 has a size of 2GB*. Isn't this way
> too low for a system with 16GB of memory?
>
Provide the output of "free", if there is RAM available, then it isn't
too small (that is my personal opinion, but at least it won't affect
performance/operations until you are using most of that swap space).

>
> 3. root [~]# fdisk -lu /dev/sd*
>
My mistake, I should have said:
fdisk -lu /dev/sd?

In any case, all of the relevant information was included, so no harm done.
> Disk /dev/sda: 512.1 GB, 512110190592 bytes
> 255 heads, 63 sectors/track, 62260 cylinders, total 1000215216 sectors
> Units = sectors of 1 * 512 = 512 bytes
> Sector size (logical/physical): 512 bytes / 512 bytes
> I/O size (minimum/optimal): 512 bytes / 512 bytes
> Disk identifier: 0x00026d59
>
>    Device Boot      Start         End      Blocks   Id  System
> /dev/sda1            2048     4196351     2097152   fd  Linux raid
> autodetect
> Partition 1 does not end on cylinder boundary.
> /dev/sda2   *     4196352     4605951      204800   fd  Linux raid
> autodetect
> Partition 2 does not end on cylinder boundary.
> /dev/sda3         4605952   814106623   404750336   fd  Linux raid
> autodetect
>
> Disk /dev/sdb: 512.1 GB, 512110190592 bytes
> 255 heads, 63 sectors/track, 62260 cylinders, total 1000215216 sectors
> Units = sectors of 1 * 512 = 512 bytes
> Sector size (logical/physical): 512 bytes / 512 bytes
> I/O size (minimum/optimal): 512 bytes / 512 bytes
> Disk identifier: 0x0003dede
>
>    Device Boot      Start         End      Blocks   Id  System
> /dev/sdb1            2048     4196351     2097152   fd  Linux raid
> autodetect
> Partition 1 does not end on cylinder boundary.
> /dev/sdb2   *     4196352     4605951      204800   fd  Linux raid
> autodetect
> Partition 2 does not end on cylinder boundary.
> /dev/sdb3         4605952   814106623   404750336   fd  Linux raid
> autodetect
>
I'm assuming from this you have three md RAID1 arrays where sda1/sdb1
are a pair, sda2/sdb2 are a pair and sda3/sdb3 are a pair?

Can you describe what is on each of these arrays?
Output of
cat /proc/mdstat
df
pvs
lvs

Might be helpful....

Regards,
Adam

-- 
Adam Goryachev
Website Managers
www.websitemanagers.com.au


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Incredibly poor performance of mdraid-1 with 2 SSD Samsung 840 PRO
  2013-04-24  9:12                 ` Adam Goryachev
@ 2013-04-24 10:24                   ` Tommy Apel
  2013-04-24 21:42                     ` Andrei Banu
  2013-04-24 21:40                   ` Andrei Banu
  1 sibling, 1 reply; 38+ messages in thread
From: Tommy Apel @ 2013-04-24 10:24 UTC (permalink / raw)
  To: Adam Goryachev; +Cc: Andrei Banu, linux-raid Raid, stan

Looks to me like it's the journaled quota process that holds everything back.

2013/4/24 Adam Goryachev <mailinglists@websitemanagers.com.au>:
> On 24/04/13 18:26, Andrei Banu wrote:
>> Hello,
>>
>> I am sorry for the irrelevant feedback. Where I misunderstood your
>> request, I filled in the blanks (poorly).
>>
>> 1. SWAP
>> root [~]# blkid | grep cef1d19d-2578-43db-9ffc-b6b70e227bfa
>> /dev/md1: UUID="cef1d19d-2578-43db-9ffc-b6b70e227bfa" TYPE="swap"
>>
>> So yes, swap is on md1. This *md1 has a size of 2GB*. Isn't this way
>> too low for a system with 16GB of memory?
>>
> Provide the output of "free", if there is RAM available, then it isn't
> too small (that is my personal opinion, but at least it won't affect
> performance/operations until you are using most of that swap space).
>
>>
>> 3. root [~]# fdisk -lu /dev/sd*
>>
> My mistake, I should have said:
> fdisk -lu /dev/sd?
>
> In any case, all of the relevant information was included, so no harm done.
>> Disk /dev/sda: 512.1 GB, 512110190592 bytes
>> 255 heads, 63 sectors/track, 62260 cylinders, total 1000215216 sectors
>> Units = sectors of 1 * 512 = 512 bytes
>> Sector size (logical/physical): 512 bytes / 512 bytes
>> I/O size (minimum/optimal): 512 bytes / 512 bytes
>> Disk identifier: 0x00026d59
>>
>>    Device Boot      Start         End      Blocks   Id  System
>> /dev/sda1            2048     4196351     2097152   fd  Linux raid
>> autodetect
>> Partition 1 does not end on cylinder boundary.
>> /dev/sda2   *     4196352     4605951      204800   fd  Linux raid
>> autodetect
>> Partition 2 does not end on cylinder boundary.
>> /dev/sda3         4605952   814106623   404750336   fd  Linux raid
>> autodetect
>>
>> Disk /dev/sdb: 512.1 GB, 512110190592 bytes
>> 255 heads, 63 sectors/track, 62260 cylinders, total 1000215216 sectors
>> Units = sectors of 1 * 512 = 512 bytes
>> Sector size (logical/physical): 512 bytes / 512 bytes
>> I/O size (minimum/optimal): 512 bytes / 512 bytes
>> Disk identifier: 0x0003dede
>>
>>    Device Boot      Start         End      Blocks   Id  System
>> /dev/sdb1            2048     4196351     2097152   fd  Linux raid
>> autodetect
>> Partition 1 does not end on cylinder boundary.
>> /dev/sdb2   *     4196352     4605951      204800   fd  Linux raid
>> autodetect
>> Partition 2 does not end on cylinder boundary.
>> /dev/sdb3         4605952   814106623   404750336   fd  Linux raid
>> autodetect
>>
> I'm assuming from this you have three md RAID1 arrays where sda1/sdb1
> are a pair, sda2/sdb2 are a pair and sda3/sdb3 are a pair?
>
> Can you describe what is on each of these arrays?
> Output of
> cat /proc/mdstat
> df
> pvs
> lvs
>
> Might be helpful....
>
> Regards,
> Adam
>
> --
> Adam Goryachev
> Website Managers
> www.websitemanagers.com.au
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Incredibly poor performance of mdraid-1 with 2 SSD Samsung 840 PRO
  2013-04-24 10:24                   ` Tommy Apel
@ 2013-04-24 21:42                     ` Andrei Banu
  0 siblings, 0 replies; 38+ messages in thread
From: Andrei Banu @ 2013-04-24 21:42 UTC (permalink / raw)
  Cc: linux-raid Raid

Hi,

Why would it do that?
And how do I fix this?

Thanks!

On 24/04/2013 1:24 PM, Tommy Apel wrote:
> Looks to me like it's the journaled quota process that holds everything back.\


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Incredibly poor performance of mdraid-1 with 2 SSD Samsung 840 PRO
  2013-04-24  9:12                 ` Adam Goryachev
  2013-04-24 10:24                   ` Tommy Apel
@ 2013-04-24 21:40                   ` Andrei Banu
  1 sibling, 0 replies; 38+ messages in thread
From: Andrei Banu @ 2013-04-24 21:40 UTC (permalink / raw)
  Cc: linux-raid

Hi,

1. free -m
root [~]# free -m
              total       used       free     shared    buffers cached
Mem:         15921      15542        379          0 1063      11870
-/+ buffers/cache:       2608      13313
Swap:         2046        100       1946

2. Yes, you understood correctly regarding the raid array (all 3 of them 
are raid 1):

root@gts6 [~]# cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 sdb2[1] sda2[0]
       204736 blocks super 1.0 [2/2] [UU]

md2 : active raid1 sdb3[1] sda3[0]
       404750144 blocks super 1.0 [2/2] [UU]

md1 : active raid1 sdb1[1] sda1[0]
       2096064 blocks super 1.1 [2/2] [UU]

unused devices: <none>

md0 is boot.
md1 is swap.
md2 is /

3. df

root@gts6 [~]# df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/md2              380G  246G  116G  68% /
tmpfs                 7.8G     0  7.8G   0% /dev/shm
/dev/md0              194M   47M  137M  26% /boot
/usr/tmpDSK           3.6G  1.2G  2.2G  36% /tmp

4. pvs

root [~]# pvs -a
   PV         VG   Fmt Attr PSize PFree
   /dev/loop0          ---     0     0
   /dev/md0            ---     0     0
   /dev/md1            ---     0     0
   /dev/ram0           ---     0     0
   /dev/ram1           ---     0     0
   /dev/ram10          ---     0     0
   /dev/ram11          ---     0     0
   /dev/ram12          ---     0     0
   /dev/ram13          ---     0     0
   /dev/ram14          ---     0     0
   /dev/ram15          ---     0     0
   /dev/ram2           ---     0     0
   /dev/ram3           ---     0     0
   /dev/ram4           ---     0     0
   /dev/ram5           ---     0     0
   /dev/ram6           ---     0     0
   /dev/ram7           ---     0     0
   /dev/ram8           ---     0     0
   /dev/ram9           ---     0     0
   /dev/root           ---     0     0

5. lvs (No volume groups).

Thanks!

On 24/04/2013 12:12 PM, Adam Goryachev wrote:
> On 24/04/13 18:26, Andrei Banu wrote:
>> Hello,
>>
>> I am sorry for the irrelevant feedback. Where I misunderstood your
>> request, I filled in the blanks (poorly).
>>
>> 1. SWAP
>> root [~]# blkid | grep cef1d19d-2578-43db-9ffc-b6b70e227bfa
>> /dev/md1: UUID="cef1d19d-2578-43db-9ffc-b6b70e227bfa" TYPE="swap"
>>
>> So yes, swap is on md1. This *md1 has a size of 2GB*. Isn't this way
>> too low for a system with 16GB of memory?
>>
> Provide the output of "free", if there is RAM available, then it isn't
> too small (that is my personal opinion, but at least it won't affect
> performance/operations until you are using most of that swap space).
>
>> 3. root [~]# fdisk -lu /dev/sd*
>>
> My mistake, I should have said:
> fdisk -lu /dev/sd?
>
> In any case, all of the relevant information was included, so no harm done.
>> Disk /dev/sda: 512.1 GB, 512110190592 bytes
>> 255 heads, 63 sectors/track, 62260 cylinders, total 1000215216 sectors
>> Units = sectors of 1 * 512 = 512 bytes
>> Sector size (logical/physical): 512 bytes / 512 bytes
>> I/O size (minimum/optimal): 512 bytes / 512 bytes
>> Disk identifier: 0x00026d59
>>
>>     Device Boot      Start         End      Blocks   Id  System
>> /dev/sda1            2048     4196351     2097152   fd  Linux raid
>> autodetect
>> Partition 1 does not end on cylinder boundary.
>> /dev/sda2   *     4196352     4605951      204800   fd  Linux raid
>> autodetect
>> Partition 2 does not end on cylinder boundary.
>> /dev/sda3         4605952   814106623   404750336   fd  Linux raid
>> autodetect
>>
>> Disk /dev/sdb: 512.1 GB, 512110190592 bytes
>> 255 heads, 63 sectors/track, 62260 cylinders, total 1000215216 sectors
>> Units = sectors of 1 * 512 = 512 bytes
>> Sector size (logical/physical): 512 bytes / 512 bytes
>> I/O size (minimum/optimal): 512 bytes / 512 bytes
>> Disk identifier: 0x0003dede
>>
>>     Device Boot      Start         End      Blocks   Id  System
>> /dev/sdb1            2048     4196351     2097152   fd  Linux raid
>> autodetect
>> Partition 1 does not end on cylinder boundary.
>> /dev/sdb2   *     4196352     4605951      204800   fd  Linux raid
>> autodetect
>> Partition 2 does not end on cylinder boundary.
>> /dev/sdb3         4605952   814106623   404750336   fd  Linux raid
>> autodetect
>>
> I'm assuming from this you have three md RAID1 arrays where sda1/sdb1
> are a pair, sda2/sdb2 are a pair and sda3/sdb3 are a pair?
>
> Can you describe what is on each of these arrays?
> Output of
> cat /proc/mdstat
> df
> pvs
> lvs
>
> Might be helpful....
>
> Regards,
> Adam
>


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Incredibly poor performance of mdraid-1 with 2 SSD Samsung 840 PRO
  2013-04-24  8:26               ` Andrei Banu
  2013-04-24  9:12                 ` Adam Goryachev
@ 2013-04-24 16:37                 ` Stan Hoeppner
  2013-04-24 21:46                   ` Andrei Banu
  1 sibling, 1 reply; 38+ messages in thread
From: Stan Hoeppner @ 2013-04-24 16:37 UTC (permalink / raw)
  To: Andrei Banu; +Cc: linux-raid

On 4/24/2013 3:26 AM, Andrei Banu wrote:

> Total DISK READ: 0.00 B/s | Total DISK WRITE: 0.00 B/s
>   TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN     IO> COMMAND
>   541 be/3 root        0.00 B/s    0.00 B/s  0.00 % 96.96 % [jbd2/md2-8]

This seems to be your problem.  jbd2 (journal block device) is causing
97% iowait, yet without doing much physical IO.  This is a component of
EXT4.  As this will fire intermittently it explains why you see such a
wide throughput gap between tests at different points in time.

This isn't a bug or Google would reveal that.  Andrei, you need to
identify which daemon or kernel feature is causing this.  Do you happen
to have realtime TRIM enabled?  It is well known to bring IO to a crawl.

If not realtime TRIM, I'd guess you turned a knob you should not have in
some config file, causing a daemon to frequently issue a few gazillion
atomic updates.

-- 
Stan

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Incredibly poor performance of mdraid-1 with 2 SSD Samsung 840 PRO
  2013-04-24 16:37                 ` Stan Hoeppner
@ 2013-04-24 21:46                   ` Andrei Banu
       [not found]                     ` <CAH3kUhHnF0imY=CAHfzaQy4XJuOMgOtbHNp17EYzeSJR2en7Fg@mail.gmail.com>
  2013-04-25 10:56                     ` Stan Hoeppner
  0 siblings, 2 replies; 38+ messages in thread
From: Andrei Banu @ 2013-04-24 21:46 UTC (permalink / raw)
  Cc: linux-raid

Hi,

1. How can I at least start trying to find the daemon that might be 
doing this?

2. I am not sure what real time TRIM is. I thought there was the 
'discard' option in
fstab (which I tried and didn't help) and other command like trims 
(fstrim - which
errors out when run on / or mdtrim that seems somebody's experiment). But I
am not sure what real time trim might be.

I am not really sure where do I go from here. I am a bit lost as it 
seems we hit
a dead end.

Thanks!
Andrei Banu

On 24/04/2013 7:37 PM, Stan Hoeppner wrote:
> On 4/24/2013 3:26 AM, Andrei Banu wrote:
>
>> Total DISK READ: 0.00 B/s | Total DISK WRITE: 0.00 B/s
>>    TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN     IO> COMMAND
>>    541 be/3 root        0.00 B/s    0.00 B/s  0.00 % 96.96 % [jbd2/md2-8]
> This seems to be your problem.  jbd2 (journal block device) is causing
> 97% iowait, yet without doing much physical IO.  This is a component of
> EXT4.  As this will fire intermittently it explains why you see such a
> wide throughput gap between tests at different points in time.
>
> This isn't a bug or Google would reveal that.  Andrei, you need to
> identify which daemon or kernel feature is causing this.  Do you happen
> to have realtime TRIM enabled?  It is well known to bring IO to a crawl.
>
> If not realtime TRIM, I'd guess you turned a knob you should not have in
> some config file, causing a daemon to frequently issue a few gazillion
> atomic updates.
>


^ permalink raw reply	[flat|nested] 38+ messages in thread

[parent not found: <CAH3kUhHnF0imY=CAHfzaQy4XJuOMgOtbHNp17EYzeSJR2en7Fg@mail.gmail.com>]

* Re: Incredibly poor performance of mdraid-1 with 2 SSD Samsung 840 PRO
       [not found]                     ` <CAH3kUhHnF0imY=CAHfzaQy4XJuOMgOtbHNp17EYzeSJR2en7Fg@mail.gmail.com>
@ 2013-04-25 10:11                       ` Andrei Banu
  0 siblings, 0 replies; 38+ messages in thread
From: Andrei Banu @ 2013-04-25 10:11 UTC (permalink / raw)
  To: linux-raid

Hi,

I don't have fstab discard option set. I was just enumerating the trim 
kinds I know. I did try discard but it didn't do anything good. And the 
problem dated from before my discard test.

Regards!

On 2013-04-25 00:53, Roberto Spadim wrote:
> TRIM in ext4 = discard
> 2013/4/24 Andrei Banu <andrei.banu@redhost.ro>
> 
>> Hi,
>> 1. How can I at least start trying to find the daemon that might be 
>> doing this?
>> 2. I am not sure what real time TRIM is. I thought there was the 
>> 'discard' option in
>> fstab (which I tried and didn't help) and other command like trims 
>> (fstrim - which
>> errors out when run on / or mdtrim that seems somebody's experiment). 
>> But I
>> am not sure what real time trim might be.
>> I am not really sure where do I go from here. I am a bit lost as it 
>> seems we hit
>> a dead end.
>> Thanks!
>> Andrei Banu
>> On 24/04/2013 7:37 PM, Stan Hoeppner wrote:
>> 
>>> On 4/24/2013 3:26 AM, Andrei Banu wrote:
>>> 
>>>> Total DISK READ: 0.00 B/s | Total DISK WRITE: 0.00 B/s
>>>>    TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN     IO> 
>>>> COMMAND
>>>>    541 be/3 root        0.00 B/s    0.00 B/s  0.00 % 96.96 % 
>>>> [jbd2/md2-8]
>>> This seems to be your problem.  jbd2 (journal block device) is 
>>> causing
>>> 97% iowait, yet without doing much physical IO.  This is a component 
>>> of
>>> EXT4.  As this will fire intermittently it explains why you see such 
>>> a
>>> wide throughput gap between tests at different points in time.
>>> This isn't a bug or Google would reveal that.  Andrei, you need to
>>> identify which daemon or kernel feature is causing this.  Do you 
>>> happen
>>> to have realtime TRIM enabled?  It is well known to bring IO to a 
>>> crawl.
>>> If not realtime TRIM, I'd guess you turned a knob you should not 
>>> have in
>>> some config file, causing a daemon to frequently issue a few 
>>> gazillion
>>> atomic updates.
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-raid" 
>> in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html 
>> [1]
> --
> Roberto Spadim
> Links:
> ------
> [1] http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Incredibly poor performance of mdraid-1 with 2 SSD Samsung 840 PRO
  2013-04-24 21:46                   ` Andrei Banu
       [not found]                     ` <CAH3kUhHnF0imY=CAHfzaQy4XJuOMgOtbHNp17EYzeSJR2en7Fg@mail.gmail.com>
@ 2013-04-25 10:56                     ` Stan Hoeppner
  1 sibling, 0 replies; 38+ messages in thread
From: Stan Hoeppner @ 2013-04-25 10:56 UTC (permalink / raw)
  To: Andrei Banu

On 4/24/2013 4:46 PM, Andrei Banu wrote:

> 1. How can I at least start trying to find the daemon that might be
> doing this?

For you, I'd say grab a bucket of popcorn and watch top and iotop for a
while during peak use periods.  Fire up two ssh sessions and watch both
simultaneously, left and right on your screen.  You need to become
familiar with your system, what the applications are doing to cpu, mem,
and io.

When you're not doing that, use Google.  Start reading about problems
others have with "[jbd2/]" and/or super slow performance with very fast
SSDs.

> 2. I am not sure what real time TRIM is. I thought there was the
> 'discard' option in
> fstab (which I tried and didn't help) and other command like trims

discard = realtime trim

If it's not enabled then this isn't the source of your problem.

> I am not really sure where do I go from here. I am a bit lost as it
> seems we hit
> a dead end.

There's only so much we can do.  The problem appears to have nothing to
do with md/RAID.  I'm doing my best to point you in the right
direction(s), but I'm neither a CentOS nor EXT4 user and am not familiar
with those ecosystems nor support channels.

You need to research your problem via Google, interface with other
CentOS users and others using the same type of cpanel based hosting
software stack.

If I had access to the box I'm sure I could figure this out for you, but
this isn't something I'm willing to do at this time.

Keep at it and you'll eventually figure it out.  And you'll learn a lot
along the way.

Best of luck.

-- 
Stan

> Thanks!
> Andrei Banu
> 
> On 24/04/2013 7:37 PM, Stan Hoeppner wrote:
>> On 4/24/2013 3:26 AM, Andrei Banu wrote:
>>
>>> Total DISK READ: 0.00 B/s | Total DISK WRITE: 0.00 B/s
>>>    TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN     IO> COMMAND
>>>    541 be/3 root        0.00 B/s    0.00 B/s  0.00 % 96.96 %
>>> [jbd2/md2-8]
>> This seems to be your problem.  jbd2 (journal block device) is causing
>> 97% iowait, yet without doing much physical IO.  This is a component of
>> EXT4.  As this will fire intermittently it explains why you see such a
>> wide throughput gap between tests at different points in time.
>>
>> This isn't a bug or Google would reveal that.  Andrei, you need to
>> identify which daemon or kernel feature is causing this.  Do you happen
>> to have realtime TRIM enabled?  It is well known to bring IO to a crawl.
>>
>> If not realtime TRIM, I'd guess you turned a knob you should not have in
>> some config file, causing a daemon to frequently issue a few gazillion
>> atomic updates.
>>
> 
> -- 
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Incredibly poor performance of mdraid-1 with 2 SSD Samsung 840 PRO
  2013-04-21 23:17     ` Stan Hoeppner
  2013-04-22 10:19       ` Andrei Banu
@ 2013-04-22 23:11       ` Andrei Banu
  2013-04-23  4:39         ` Stan Hoeppner
  2013-04-22 23:25       ` Stan Hoeppner
  2 siblings, 1 reply; 38+ messages in thread
From: Andrei Banu @ 2013-04-22 23:11 UTC (permalink / raw)
  To: linux-raid

Hello again!

I have closed all the load generating services, waited a few minutes for 
the server load to reach a clean 0.00 and then I have re-performed the 
dd tests with various bs sizes. I was not able to setup correctly fio 
with a compile error but I'll get it done.

One more thing before the results: I omitted to answer something earlier 
today. CentOS was installed due to fact that cPanel is not installable 
on many OSes (CentOS, RHEL and I think that's about it). So I picked 
CentOS. The installation was done remotely over KVM with a minimal 
CentOS CD (datacenter does not offer any server related services so we 
had to do it ourselves over a Raritan KVM).

Tests were done roughly 1 minute apart.

1. First test (bs=1G): same as always.
root [~]# dd if=testfile.tar.gz of=test oflag=sync bs=1G
547682517 bytes (548 MB) copied, 53.3767 s, 10.3 MB/s

2. With a bs of 4MB: niceeee! Best result ever. I am not sure what 
happened this time. However it's short lived.
root [~]# dd if=testfile.tar.gz of=test2 oflag=sync bs=4M
547682517 bytes (548 MB) copied, 4.43305 s, 124 MB/s

3. bs=2MB, starting to decay.
root [~]# dd if=testfile.tar.gz of=test3 oflag=sync bs=2M
547682517 bytes (548 MB) copied, 20.3647 s, 26.9 MB/s

4. bs=4MB again. Back to square 1.
root [~]# dd if=testfile.tar.gz of=test4 oflag=sync bs=4M
547682517 bytes (548 MB) copied, 56.7124 s, 9.7 MB/s

As services were shut down prior to the test, the biggest load it 
reached was about 2.

5. Finally I restarted the services and redone the bs=4MB test (going 
from a load of 0.23):
root [~]# dd if=testfile.tar.gz of=test6 oflag=sync bs=4M
547682517 bytes (548 MB) copied, 116.469 s, 4.7 MB/s

Again, I don't think my problem is related to any concurrent I/O 
starvation. These SSDs or this mdraid or I don't know what simply can't 
take any sustained write task. And this is not due to the server load. 
Even during very low server loads it's enough to write about 1GB of data 
within a short time frame (minutes) to bring the I/O system to it's 
knees for a considerable time (at least tens of minutes).

4.7MB per second for writing a 548MB file starting from a load of 0.23 
during off peak hours on SSDs. Nice!!!

Thanks!

On 22/04/2013 2:17 AM, Stan Hoeppner wrote:
> On 4/21/2013 3:46 PM, Andrei Banu wrote:
>> Hello,
>>
>> At this point I probably should state that I am not an experienced
>> sysadmin.
> Things are becoming more clear now.
>
>> Knowing this, I do have a server management company but they
>> said they don't know what to do
> So you own this hardware and it is colocated, correct?
>
>> so now I am trying to fix things myself
>> but I am something of a noob. I normally try to keep my actions to
>> cautious config changes and testing.
> Why did you choose Centos?  Was this installed by the company?
>
>> I have never done a kernel update.
>> Any easy way to do this?
> It may not be necessary, at least to solve any SSD performance problems
> anyway.  Reexamining your numbers shows you hit 262MB/s to /dev/sda.
> That's 65% of SATA2 interface bandwidth, so this kernel probably does
> have the patch.  Your problem lie elsewhere.
>
>> Regarding your second advice (to purchase a decent HBA) I have already
>> thought about it but I guess it comes with it's own drivers that need to
>> be compiled into initramfs etc.
> The default CentOS (RHEL) initramfs should include mptsas, which
> supports all the LSI HBAs.  The LSI caching RAID cards are supported as
> well with megaraid_sas.
>
> The question is, do you really need more than the ~260MB/s of peak
> throughput you currently have?  And is it worth the hassle?
>
>> So I am trying to replace the baseboard
>> with one with SATA3 support to avoid any configuration changes (the old
>> board has the C202 chipset and the new one has C204 so I guess this
>> replacement is as simple as it gets - just remove the old board and plug
>> the new one without any software changes or recompiles). Again I need to
>> say this server is in production and I can't move the data or the users.
>> I can have a few hours downtime during the night but that's about all.
> It's not clear your problem is hardware bandwidth.  In fact it seems the
> problem lie elsewhere.  It may simply be that you're running these tests
> while other substantial IO is occurring.  Actually, your numbers show
> this is exactly the case.  What they don't show is how much other IO is
> hitting the SSDs while you're running your tests.
>
>> Regarding the kernel upgrade, do we need to compile one from source or
>> there's an easier way?
> I don't believe at this point you need a new kernel to fix the problem
> you have.  If this patch was not present you'd not be able to get
> 260MB/s from SATA2.  Your problem lie elsewhere.
>
> In the future, instead of making a post saying "md is slow, my SSDs are
> slow" and pasting test data which appears to back that claim, you'd be
> better served by describing a general problem, such as "users say the
> system is slow and I think it may be md or SSD related".  This way we
> don't waste time following a troubleshooting path based on incorrect
> assumptions, as we've done here.  Or at least as I've done here, as I'm
> the only one assisting.
>
> Boot all users off the system, shut down any daemons that may generate
> any meaningful load on the disks or CPUs.  Disable any encryption or
> compression.  Then rerun your tests while completely idle.  Then we'll
> go from there.
>


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Incredibly poor performance of mdraid-1 with 2 SSD Samsung 840 PRO
  2013-04-22 23:11       ` Andrei Banu
@ 2013-04-23  4:39         ` Stan Hoeppner
  0 siblings, 0 replies; 38+ messages in thread
From: Stan Hoeppner @ 2013-04-23  4:39 UTC (permalink / raw)
  To: Andrei Banu; +Cc: linux-raid

On 4/22/2013 6:11 PM, Andrei Banu wrote:
...
> 1. First test (bs=1G): same as always.
> root [~]# dd if=testfile.tar.gz of=test oflag=sync bs=1G
> 547682517 bytes (548 MB) copied, 53.3767 s, 10.3 MB/s
...
> root [~]# dd if=testfile.tar.gz of=test6 oflag=sync bs=4M
> 547682517 bytes (548 MB) copied, 116.469 s, 4.7 MB/s
...
> Again, I don't think my problem is related to any concurrent I/O
> starvation. These SSDs or this mdraid or I don't know what simply can't
> take any sustained write task. And this is not due to the server load.
> Even during very low server loads it's enough to write about 1GB of data
> within a short time frame (minutes) to bring the I/O system to it's
> knees for a considerable time (at least tens of minutes).

Something's going on here.  Ditch dd for now.  What's the result of:

$ echo 3 > /proc/sys/vm/drop_caches
$ time cp testfile.tar.gz testxx.tmp; sync
548/real = xx MB/s

And now ditch flushing FS buffers:
$ echo 3 > /proc/sys/vm/drop_caches
$ time cp testfile.tar.gz testxx.tmp
548/real = xx MB/s

And please paste this so we can see how you're mounting EXT4.
$ cat /etc/fstab |grep ext

Mounting data=journal will decrease write throughput by 50% as
everything is written twice: once to the journal, once into the
filesystem.  This wouldn't account for the entire performance deficit
though.

-- 
Stan


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Incredibly poor performance of mdraid-1 with 2 SSD Samsung 840 PRO
  2013-04-21 23:17     ` Stan Hoeppner
  2013-04-22 10:19       ` Andrei Banu
  2013-04-22 23:11       ` Andrei Banu
@ 2013-04-22 23:25       ` Stan Hoeppner
  2013-04-23  4:49         ` Mikael Abrahamsson
  2 siblings, 1 reply; 38+ messages in thread
From: Stan Hoeppner @ 2013-04-22 23:25 UTC (permalink / raw)
  To: stan; +Cc: Andrei Banu, linux-raid

On 4/21/2013 6:17 PM, Stan Hoeppner wrote:

> It may not be necessary, at least to solve any SSD performance problems
> anyway.  Reexamining your numbers shows you hit 262MB/s to /dev/sda.
> That's 65% of SATA2 interface bandwidth, so this kernel probably does
> have the patch.  Your problem lie elsewhere.

Big correction.  That should state 87% of SATA2 interface bandwidth.  I
must have been thinking of three things at once when I fubar'd that, as
that's not simply a typo.

-- 
Stan



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Incredibly poor performance of mdraid-1 with 2 SSD Samsung 840 PRO
  2013-04-22 23:25       ` Stan Hoeppner
@ 2013-04-23  4:49         ` Mikael Abrahamsson
  0 siblings, 0 replies; 38+ messages in thread
From: Mikael Abrahamsson @ 2013-04-23  4:49 UTC (permalink / raw)
  To: Stan Hoeppner; +Cc: Andrei Banu, linux-raid

On Mon, 22 Apr 2013, Stan Hoeppner wrote:

> On 4/21/2013 6:17 PM, Stan Hoeppner wrote:
>
>> It may not be necessary, at least to solve any SSD performance problems
>> anyway.  Reexamining your numbers shows you hit 262MB/s to /dev/sda.
>> That's 65% of SATA2 interface bandwidth, so this kernel probably does
>> have the patch.  Your problem lie elsewhere.
>
> Big correction.  That should state 87% of SATA2 interface bandwidth.  I
> must have been thinking of three things at once when I fubar'd that, as
> that's not simply a typo.

As far as I know, the 300 megabyte/s of SATA2 bw doesn't include coding 
overhead etc, so it's not theoretically possible to reach all the way up 
to 300. From all tests I've seen, around 260-270 megabyte/s seems to be 
maximum that can be achievable, so I'd say 262 MB/s is basically as much 
as can be expected from SATA2.

-- 
Mikael Abrahamsson    email: swmike@swm.pp.se

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Incredibly poor performance of mdraid-1 with 2 SSD Samsung 840 PRO
  2013-04-19 22:58 Incredibly poor performance of mdraid-1 with 2 SSD Samsung 840 PRO Andrei Banu
                   ` (2 preceding siblings ...)
       [not found] ` <51732E2B.6090607@hardwarefreak.com>
@ 2013-04-23  6:01 ` Stan Hoeppner
  3 siblings, 0 replies; 38+ messages in thread
From: Stan Hoeppner @ 2013-04-23  6:01 UTC (permalink / raw)
  To: Andrei Banu; +Cc: linux-raid

On 4/19/2013 5:58 PM, Andrei Banu wrote:
> Hardware: SuperMicro 5017C-MTRF

Not relevant if you're using SATA ports 0-1, but may well be if using
2-5, assuming this system isn't brand new.  As I said previously, you'd
see some errors in dmesg if you had port/cable issues.  From:

Intel® 6 Series Chipset and Intel® C200 Series Chipset Specification Update

Problem:  Due to a circuit design issue on Intel 6 Series Chipset and
Intel C200 Series Chipset, electrical lifetime wear out may affect clock
distribution for SATA ports 2-5. This may manifest itself as a
functional issue on SATA ports 2-5 over time.

•The electrical lifetime wear out may result in device oxide degradation
which over time can cause drain to gate leakage current.

•This issue has time, temperature and voltage sensitivities.

Implication:  The increased leakage current may result in an unstable
clock and potentially functional issues on SATA ports 2-5 in the form of
receive errors, transmit errors, and unrecognized drives.

...
•SATA ports 0-1 are not affected by this design issue as they have
separate clock generation circuitry.

Workaround:  Intel has worked with board and system manufacturers to
identify and implement solutions for affected systems.

•Use only SATA ports 0-1.
•Use an add-in PCIe SATA bridge solution.

Not all boards are affected by this.  You'd have to check the spec
revision on your C202, which means contacting SuperMicro with your board
revision/serial number.  To be certain you're not affected simply use
only ports 0-1.  But on that note...

It may be an opportune time to consider dropping in a LSI 9211-4i.
4GB/s raw throughput, plenty for 4 SSDs at full boogie should you
expand.  The kit version comes with a 1-4 breakout cable for your 1U SM
chassis drive backplane.  Even if we get your issue fixed via software
and both drives are humming away at ~260MB/s, that nightly backup
process you mentioned, and others, would surely benefit from an
additional ~200MB/s throughput.

-- 
Stan

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2013-04-25 11:38 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-04-19 22:58 Incredibly poor performance of mdraid-1 with 2 SSD Samsung 840 PRO Andrei Banu
     [not found] ` <CAH3kUhEaZGON=fAyVMZOz5fH_DcfKv=hCa96UCeK4pN7k81c_Q@mail.gmail.com>
     [not found]   ` <51725458.7020109@redhost.ro>
     [not found]     ` <CAH3kUhHxBiqugFQm=PPJNNe9jOdKy0etUjQNsoDz_LJNUCLCCQ@mail.gmail.com>
2013-04-20 23:25       ` Andrei Banu
2013-04-20 23:26       ` Andrei Banu
2013-04-21  2:48         ` Stan Hoeppner
2013-04-21 12:23           ` Tommy Apel
2013-04-21 16:48             ` Tommy Apel
2013-04-21 19:33             ` Stan Hoeppner
2013-04-21 19:56               ` Tommy Apel
2013-04-22  0:47                 ` Stan Hoeppner
2013-04-22  7:51                   ` Tommy Apel
2013-04-22  8:29                     ` Tommy Apel
2013-04-22 10:26                     ` Andrei Banu
2013-04-22 12:02                       ` Tommy Apel
2013-04-23  2:59                         ` Stan Hoeppner
2013-04-22 23:21                     ` Stan Hoeppner
2013-04-25 11:38         ` Thomas Jarosch
2013-04-20 23:26   ` Andrei Banu
2013-04-21  0:10 ` Stan Hoeppner
     [not found] ` <51732E2B.6090607@hardwarefreak.com>
2013-04-21 20:46   ` Andrei Banu
2013-04-21 23:17     ` Stan Hoeppner
2013-04-22 10:19       ` Andrei Banu
2013-04-23  2:51         ` Stan Hoeppner
2013-04-23 10:17           ` Andrei Banu
2013-04-24  3:24             ` Stan Hoeppner
2013-04-24  8:26               ` Andrei Banu
2013-04-24  9:12                 ` Adam Goryachev
2013-04-24 10:24                   ` Tommy Apel
2013-04-24 21:42                     ` Andrei Banu
2013-04-24 21:40                   ` Andrei Banu
2013-04-24 16:37                 ` Stan Hoeppner
2013-04-24 21:46                   ` Andrei Banu
     [not found]                     ` <CAH3kUhHnF0imY=CAHfzaQy4XJuOMgOtbHNp17EYzeSJR2en7Fg@mail.gmail.com>
2013-04-25 10:11                       ` Andrei Banu
2013-04-25 10:56                     ` Stan Hoeppner
2013-04-22 23:11       ` Andrei Banu
2013-04-23  4:39         ` Stan Hoeppner
2013-04-22 23:25       ` Stan Hoeppner
2013-04-23  4:49         ` Mikael Abrahamsson
2013-04-23  6:01 ` Stan Hoeppner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox