* Incredibly poor performance of mdraid-1 with 2 SSD Samsung 840 PRO
@ 2013-04-19 22:58 Andrei Banu
[not found] ` <CAH3kUhEaZGON=fAyVMZOz5fH_DcfKv=hCa96UCeK4pN7k81c_Q@mail.gmail.com>
` (3 more replies)
0 siblings, 4 replies; 38+ messages in thread
From: Andrei Banu @ 2013-04-19 22:58 UTC (permalink / raw)
To: linux-raid
Hello!
I come to you with a difficult problem. We have a server otherwise
snappy fitted with mdraid-1 made of Samsung 840 PRO SSDs. If we copy a
larger file to the server (from the same server, from net doesn't
matter) the server load will increase from roughly 0.7 to over 100 (for
several GB files). Apparently the reason is that the raid can't write well.
Few examples:
root [~]# dd if=testfile.tar.gz of=test20 oflag=sync bs=4M
130+1 records in
130+1 records out
547682517 bytes (548 MB) copied, 7.99664 s, 68.5 MB/s
And 10-20 seconds later I try the very same test:
root [~]# dd if=testfile.tar.gz of=test21 oflag=sync bs=4M
130+1 records in / 130+1 records out
547682517 bytes (548 MB) copied, 52.1958 s, 10.5 MB/s
A different test with 'bs=1G'
root [~]# w
12:08:34 up 1 day, 13:09, 1 user, load average: 0.37, 0.60, 0.72
root [~]# dd if=testfile.tar.gz of=test oflag=sync bs=1G
0+1 records in / 0+1 records out
547682517 bytes (548 MB) copied, 75.3476 s, 7.3 MB/s
root [~]# w
12:09:56 up 1 day, 13:11, 1 user, load average: 39.29, 12.67, 4.93
It needed 75 seconds to copy a half GB file and the server load
increased 100 times.
And a final test:
root@ [~]# dd if=/dev/zero of=test24 bs=64k count=16k conv=fdatasync
16384+0 records in / 16384+0 records out
1073741824 bytes (1.1 GB) copied, 61.8796 s, 17.4 MB/s
This time the load spiked to only ~ 20.
A few other peculiarities:
root@ [~]# hdparm -t /dev/sda
Timing buffered disk reads: 654 MB in 3.01 seconds = 217.55 MB/sec
root@ [~]# hdparm -t /dev/sdb
Timing buffered disk reads: 272 MB in 3.01 seconds = 90.44 MB/sec
The read speed is very different between the 2 devices (the margin is
140%) but look what happens when I run it with --direct:
root@ [~]# hdparm --direct -t /dev/sda
Timing O_DIRECT disk reads: 788 MB in 3.00 seconds = 262.23 MB/sec
root@ [~]# hdparm --direct -t /dev/sdb
Timing O_DIRECT disk reads: 554 MB in 3.00 seconds = 184.53 MB/sec
So the hardware seems to sustain speeds of about 200MB/s on both
devices but it differs greatly.
The measurement of sda increased 20% but sdb doubled. Maybe there's a
problem with the page cache?
BACKGROUND INFORMATION
Server type: general shared hosting server (3 weeks new)
O/S: CentOS 6.4 / 64 bit (2.6.32-358.2.1.el6.x86_64)
Hardware: SuperMicro 5017C-MTRF, E3-1270v2, 16GB RAM, 2 x Samsung 840
PRO 512GB
Partitioning: ~ 100GB left for over-provisioning, ext 4:
I believe it is aligned:
root [~]# fdisk -lu
Disk /dev/sda: 512.1 GB, 512110190592 bytes
255 heads, 63 sectors/track, 62260 cylinders, total 1000215216 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00026d59
Device Boot Start End Blocks Id System
/dev/sda1 2048 4196351 2097152 fd Linux raid
autodetect
Partition 1 does not end on cylinder boundary.
/dev/sda2 * 4196352 4605951 204800 fd Linux raid
autodetect
Partition 2 does not end on cylinder boundary.
/dev/sda3 4605952 814106623 404750336 fd Linux raid
autodetect
Disk /dev/sdb: 512.1 GB, 512110190592 bytes
255 heads, 63 sectors/track, 62260 cylinders, total 1000215216 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x0003dede
Device Boot Start End Blocks Id System
/dev/sdb1 2048 4196351 2097152 fd Linux raid
autodetect
Partition 1 does not end on cylinder boundary.
/dev/sdb2 * 4196352 4605951 204800 fd Linux raid
autodetect
Partition 2 does not end on cylinder boundary.
/dev/sdb3 4605952 814106623 404750336 fd Linux raid
autodetect
The matrix is NOT degraded:
root@ [~]# cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 sdb2[1] sda2[0]
204736 blocks super 1.0 [2/2] [UU]
md2 : active raid1 sdb3[1] sda3[0]
404750144 blocks super 1.0 [2/2] [UU]
md1 : active raid1 sdb1[1] sda1[0]
2096064 blocks super 1.1 [2/2] [UU]
unused devices: <none>
Write cache is on:
root@ [~]# hdparm -W /dev/sda
write-caching = 1 (on)
root@ [~]# hdparm -W /dev/sdb
write-caching = 1 (on)
SMART seems to be OK:
SMART overall-health self-assessment test result: PASSED (for both devices)
I have tried changing IO scheduler with NOOP and deadline but I couldn't
see improvements.
I have tried running fstrim but it errors out:
root [~]# fstrim -v /
fstrim: /: FITRIM ioctl failed: Operation not supported
So I have changed /etc/fstab to contain noatime and discard and rebooted
the server but to no avail.
I no longer know what to do. And I need to come up with some sort of a
solution (it's not reasonable nor acceptable to get at 3 digits loads
from copying several GBs worth of file). If anyone can help me, please do!
Thanks in advance!
Andy
^ permalink raw reply [flat|nested] 38+ messages in thread[parent not found: <CAH3kUhEaZGON=fAyVMZOz5fH_DcfKv=hCa96UCeK4pN7k81c_Q@mail.gmail.com>]
[parent not found: <51725458.7020109@redhost.ro>]
[parent not found: <CAH3kUhHxBiqugFQm=PPJNNe9jOdKy0etUjQNsoDz_LJNUCLCCQ@mail.gmail.com>]
* Re: Incredibly poor performance of mdraid-1 with 2 SSD Samsung 840 PRO [not found] ` <CAH3kUhHxBiqugFQm=PPJNNe9jOdKy0etUjQNsoDz_LJNUCLCCQ@mail.gmail.com> @ 2013-04-20 23:25 ` Andrei Banu 2013-04-20 23:26 ` Andrei Banu 1 sibling, 0 replies; 38+ messages in thread From: Andrei Banu @ 2013-04-20 23:25 UTC (permalink / raw) To: linux-raid The previous test was done with "noop" for scheduler (the speed test completed at about 8MB/s). Then I rebooted the server and redone the test (also with noop) and the result was slightly better but not as it should be (21MB/s). A third test 5-10 minute later (after the load subsided) completed at 16MB/s. A fourth test ended with 14.6MB/s. Something else: the weekly auto raid check started a little time ago and it's going at an average of 60MB/s (anywhere between 25 and 100MB/s) with noop, cfq and deadline. A raid check with ordinary mechanical drives gets done at about 160MB/s on the outer cylinders. Why are these SSDs so slow? These are the result from the 21MB/s test (5 seconds delay): Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sda 204.94 1918.02 389.30 1367719 277605 sdb 154.80 1196.21 389.30 853008 277605 md1 0.65 2.59 0.00 1848 0 md2 355.45 3106.05 388.53 2214890 277056 md0 1.10 2.90 0.01 2069 9 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sda 583.40 42764.80 29452.90 213824 147264 sdb 234.80 23172.00 14950.50 115860 74752 md1 0.00 0.00 0.00 0 0 md2 8079.60 65886.40 29862.40 329432 149312 md0 0.00 0.00 0.00 0 0 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sda 15.00 1.60 1740.00 8 8700 sdb 15.00 0.00 7196.80 0 35984 md1 0.00 0.00 0.00 0 0 md2 333.20 1.60 1330.40 8 6652 md0 0.00 0.00 0.00 0 0 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sda 167.20 538.40 37432.80 2692 187164 sdb 86.20 16.00 33688.80 80 168444 md1 0.00 0.00 0.00 0 0 md2 9510.80 572.80 37934.40 2864 189672 md0 0.00 0.00 0.00 0 0 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sda 150.20 585.60 29090.40 2928 145452 sdb 71.20 44.00 30355.20 220 151776 md1 0.00 0.00 0.00 0 0 md2 7306.20 615.20 28998.40 3076 144992 md0 0.00 0.00 0.00 0 0 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sda 257.20 1624.00 9913.80 8120 49569 sdb 137.20 372.00 21438.60 1860 107193 md1 0.00 0.00 0.00 0 0 md2 2600.80 1991.20 9504.00 9956 47520 md0 0.00 0.00 0.00 0 0 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sda 186.80 972.80 292.70 4864 1463 sdb 150.40 733.60 292.70 3668 1463 md1 0.00 0.00 0.00 0 0 md2 283.80 1706.40 291.20 8532 1456 md0 0.00 0.00 0.00 0 0 If you have any idea what can I do to improve this please let me know. Thanks!! ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Incredibly poor performance of mdraid-1 with 2 SSD Samsung 840 PRO [not found] ` <CAH3kUhHxBiqugFQm=PPJNNe9jOdKy0etUjQNsoDz_LJNUCLCCQ@mail.gmail.com> 2013-04-20 23:25 ` Andrei Banu @ 2013-04-20 23:26 ` Andrei Banu 2013-04-21 2:48 ` Stan Hoeppner 2013-04-25 11:38 ` Thomas Jarosch 1 sibling, 2 replies; 38+ messages in thread From: Andrei Banu @ 2013-04-20 23:26 UTC (permalink / raw) To: linux-raid Hi! They are connected through SATA2 ports (this does explain the read speed but not the pitiful write one) in AHCI. Ok, I redid the test with '-d 6' seconds and 'noop' scheduler during the same file copy and these are the entire results: root [~]# iostat -d 6 -k Linux 2.6.32-358.2.1.el6.x86_64 (host) 04/21/2013 _x86_64_(8 CPU) Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sda 245.95 832.69 591.13 219499895 155823699 sdb 190.80 572.24 590.88 150844446 155758671 md1 1.15 2.15 2.43 567732 641156 md2 406.02 1368.44 587.74 360725304 154930520 md0 0.06 0.23 0.00 59992 171 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sda 34.17 0.00 4466.00 0 26796 sdb 9.67 0.00 4949.33 0 29696 md1 0.00 0.00 0.00 0 0 md2 1116.50 0.00 4466.00 0 26796 md0 0.00 0.00 0.00 0 0 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sda 35.17 0.00 5475.33 0 32852 sdb 9.33 2.00 4522.67 12 27136 md1 0.00 0.00 0.00 0 0 md2 1369.67 8.00 5475.33 48 32852 md0 0.00 0.00 0.00 0 0 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sda 40.33 0.00 3160.00 0 18960 sdb 19.50 0.00 7882.00 0 47292 md1 0.00 0.00 0.00 0 0 md2 790.50 2.67 3160.00 16 18960 md0 0.00 0.00 0.00 0 0 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sda 77.67 4.00 15328.00 24 91968 sdb 50.33 16.00 10972.67 96 65836 md1 0.00 0.00 0.00 0 0 md2 3834.33 9.33 15328.00 56 91968 md0 0.00 0.00 0.00 0 0 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sda 66.67 48.00 10604.00 288 63624 sdb 23.17 0.00 9660.00 0 57960 md1 0.00 0.00 0.00 0 0 md2 2653.50 51.33 10604.00 308 63624 md0 0.00 0.00 0.00 0 0 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sda 37.83 24.67 5378.67 148 32272 sdb 13.17 3.33 6315.33 20 37892 md1 0.00 0.00 0.00 0 0 md2 1345.17 26.00 5378.67 156 32272 md0 0.00 0.00 0.00 0 0 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sda 132.50 4.67 22714.00 28 136284 sdb 32.33 20.00 12328.00 120 73968 md1 0.00 0.00 0.00 0 0 md2 5713.67 31.33 22843.33 188 137060 md0 0.00 0.00 0.00 0 0 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sda 58.17 6.00 8200.00 36 49200 sdb 23.00 8.00 11349.33 48 68096 md1 0.00 0.00 0.00 0 0 md2 1936.17 21.33 7729.33 128 46376 md0 0.00 0.00 0.00 0 0 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sda 6.17 0.00 24.67 0 148 sdb 10.00 0.00 5120.00 0 30720 md1 0.00 0.00 0.00 0 0 md2 6.17 0.00 24.67 0 148 md0 0.00 0.00 0.00 0 0 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sda 1.50 0.00 5.33 0 32 sdb 14.17 0.00 7170.67 0 43024 md1 0.00 0.00 0.00 0 0 md2 1.50 0.00 5.33 0 32 md0 0.00 0.00 0.00 0 0 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sda 256.00 346.67 1105.17 2080 6631 sdb 270.83 544.00 7029.17 3264 42175 md1 49.33 170.00 27.33 1020 164 md2 311.83 705.33 1076.67 4232 6460 md0 0.00 0.00 0.00 0 0 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sda 51.17 46.67 219.08 280 1314 sdb 48.67 140.00 219.08 840 1314 md1 20.67 82.67 0.00 496 0 md2 58.00 104.00 218.00 624 1308 md0 0.00 0.00 0.00 0 0 Thank you for your time. Kind regards! On 20/04/2013 4:11 PM, Roberto Spadim wrote: > > Hum at beginning you have more iops than the end, how you connected > this devices, normally a ssd can handler more than 1000 iops and a hd > no more than 300iops, how did you configured the queue of ssd disks? > Could you change it to noop and test again? > > Em 20/04/2013 05:39, "Andrei Banu" <andrei.banu@redhost.ro > <mailto:andrei.banu@redhost.ro>> escreveu: > > Hi, > > I ran with '-d 3' iostat during a "heavy" (540MB) copy. It took a > bit over a minute and completed with less than 9MB/s. These are > some of the results (this does NOT include the first batch i.e. > the average from start up result): > > Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn > sda 503.00 1542.67 28157.33 4628 84472 > sdb 66.00 72.00 13162.67 216 39488 > md1 373.00 1492.00 0.00 4476 0 > md2 6951.67 126.67 27734.67 380 83204 > md0 0.00 0.00 0.00 0 0 > > Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn > sda 56.67 20.00 1177.50 60 3532 > sdb 47.33 12.00 10824.17 36 32472 > md1 0.67 2.67 0.00 8 0 > md2 322.00 25.33 1266.67 76 3800 > md0 0.00 0.00 0.00 0 0 > > Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn > sda 122.00 16.00 45773.33 48 137320 > sdb 96.67 14.67 19472.00 44 58416 > md1 0.00 0.00 0.00 0 0 > md2 11431.00 32.00 45684.00 96 137052 > md0 0.00 0.00 0.00 0 0 > > Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn > sda 0.00 0.00 0.00 0 0 > sdb 13.67 8.00 5973.33 24 17920 > md1 0.00 0.00 0.00 0 0 > md2 2.00 8.00 0.00 24 0 > md0 0.00 0.00 0.00 0 0 > > This is the "normal" iostat took after 10 minutes (this DOES > include the first batch i.e. the average from start up result): > > Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn > sda 281.83 973.99 641.55 212615675 140045467 > sdb 215.51 665.94 641.55 145369465 140045467 > md1 1.18 2.17 2.56 473492 558452 > md2 470.71 1596.29 638.01 348460340 139272912 > md0 0.08 0.27 0.00 59983 171 > > Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn > sda 41.67 237.33 133.67 712 401 > sdb 39.33 90.67 133.67 272 401 > md1 0.00 0.00 0.00 0 0 > md2 83.00 328.00 133.33 984 400 > md0 0.00 0.00 0.00 0 0 > > Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn > sda 29.33 2.67 110.00 8 330 > sdb 29.33 2.67 110.00 8 330 > md1 0.00 0.00 0.00 0 0 > md2 28.67 5.33 109.33 16 328 > md0 0.00 0.00 0.00 0 0 > > Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn > sda 175.67 1.33 747.50 4 2242 > sdb 182.00 56.00 747.50 168 2242 > md1 0.00 0.00 0.00 0 0 > md2 191.33 57.33 746.67 172 2240 > md0 0.00 0.00 0.00 0 0 > > Best regards! > > On 20/04/2013 3:59 AM, Roberto Spadim wrote: >> run some kind of iostat -d 1 -k and check the write/read iops >> and kb/s >> >> >> 2013/4/19 Andrei Banu <andrei.banu@redhost.ro >> <mailto:andrei.banu@redhost.ro>> >> >> Hello! >> >> I come to you with a difficult problem. We have a server >> otherwise snappy fitted with mdraid-1 made of Samsung 840 PRO >> SSDs. If we copy a larger file to the server (from the same >> server, from net doesn't matter) the server load will >> increase from roughly 0.7 to over 100 (for several GB files). >> Apparently the reason is that the raid can't write well. >> >> Few examples: >> >> root [~]# dd if=testfile.tar.gz of=test20 oflag=sync bs=4M >> 130+1 records in >> 130+1 records out >> 547682517 bytes (548 MB) copied, 7.99664 s, 68.5 MB/s >> >> And 10-20 seconds later I try the very same test: >> >> root [~]# dd if=testfile.tar.gz of=test21 oflag=sync bs=4M >> 130+1 records in / 130+1 records out >> 547682517 bytes (548 MB) copied, 52.1958 s, 10.5 MB/s >> >> A different test with 'bs=1G' >> root [~]# w >> 12:08:34 up 1 day, 13:09, 1 user, load average: 0.37, >> 0.60, 0.72 >> >> root [~]# dd if=testfile.tar.gz of=test oflag=sync bs=1G >> 0+1 records in / 0+1 records out >> 547682517 bytes (548 MB) copied, 75.3476 s, 7.3 MB/s >> >> root [~]# w >> 12:09:56 up 1 day, 13:11, 1 user, load average: 39.29, >> 12.67, 4.93 >> >> It needed 75 seconds to copy a half GB file and the server >> load increased 100 times. >> >> And a final test: >> >> root@ [~]# dd if=/dev/zero of=test24 bs=64k count=16k >> conv=fdatasync >> 16384+0 records in / 16384+0 records out >> 1073741824 bytes (1.1 GB) copied, 61.8796 s, 17.4 MB/s >> >> This time the load spiked to only ~ 20. >> >> A few other peculiarities: >> >> root@ [~]# hdparm -t /dev/sda >> Timing buffered disk reads: 654 MB in 3.01 seconds = 217.55 >> MB/sec >> root@ [~]# hdparm -t /dev/sdb >> Timing buffered disk reads: 272 MB in 3.01 seconds = 90.44 >> MB/sec >> >> The read speed is very different between the 2 devices (the >> margin is 140%) but look what happens when I run it with >> --direct: >> >> root@ [~]# hdparm --direct -t /dev/sda >> Timing O_DIRECT disk reads: 788 MB in 3.00 seconds = 262.23 >> MB/sec >> root@ [~]# hdparm --direct -t /dev/sdb >> Timing O_DIRECT disk reads: 554 MB in 3.00 seconds = 184.53 >> MB/sec >> >> So the hardware seems to sustain speeds of about 200MB/s on >> both devices but it differs greatly. >> The measurement of sda increased 20% but sdb doubled. Maybe >> there's a problem with the page cache? >> >> BACKGROUND INFORMATION >> Server type: general shared hosting server (3 weeks new) >> O/S: CentOS 6.4 / 64 bit (2.6.32-358.2.1.el6.x86_64) >> Hardware: SuperMicro 5017C-MTRF, E3-1270v2, 16GB RAM, 2 x >> Samsung 840 PRO 512GB >> Partitioning: ~ 100GB left for over-provisioning, ext 4: >> >> I believe it is aligned: >> >> root [~]# fdisk -lu >> >> Disk /dev/sda: 512.1 GB, 512110190592 bytes >> 255 heads, 63 sectors/track, 62260 cylinders, total >> 1000215216 sectors >> Units = sectors of 1 * 512 = 512 bytes >> Sector size (logical/physical): 512 bytes / 512 bytes >> I/O size (minimum/optimal): 512 bytes / 512 bytes >> Disk identifier: 0x00026d59 >> >> Device Boot Start End Blocks Id System >> /dev/sda1 2048 4196351 2097152 fd Linux >> raid autodetect >> Partition 1 does not end on cylinder boundary. >> /dev/sda2 * 4196352 4605951 204800 fd Linux >> raid autodetect >> Partition 2 does not end on cylinder boundary. >> /dev/sda3 4605952 814106623 404750336 fd Linux >> raid autodetect >> >> Disk /dev/sdb: 512.1 GB, 512110190592 bytes >> 255 heads, 63 sectors/track, 62260 cylinders, total >> 1000215216 sectors >> Units = sectors of 1 * 512 = 512 bytes >> Sector size (logical/physical): 512 bytes / 512 bytes >> I/O size (minimum/optimal): 512 bytes / 512 bytes >> Disk identifier: 0x0003dede >> >> Device Boot Start End Blocks Id System >> /dev/sdb1 2048 4196351 2097152 fd Linux >> raid autodetect >> Partition 1 does not end on cylinder boundary. >> /dev/sdb2 * 4196352 4605951 204800 fd Linux >> raid autodetect >> Partition 2 does not end on cylinder boundary. >> /dev/sdb3 4605952 814106623 404750336 fd Linux >> raid autodetect >> >> The matrix is NOT degraded: >> >> root@ [~]# cat /proc/mdstat >> Personalities : [raid1] >> md0 : active raid1 sdb2[1] sda2[0] >> 204736 blocks super 1.0 [2/2] [UU] >> md2 : active raid1 sdb3[1] sda3[0] >> 404750144 blocks super 1.0 [2/2] [UU] >> md1 : active raid1 sdb1[1] sda1[0] >> 2096064 blocks super 1.1 [2/2] [UU] >> unused devices: <none> >> >> Write cache is on: >> >> root@ [~]# hdparm -W /dev/sda >> write-caching = 1 (on) >> root@ [~]# hdparm -W /dev/sdb >> write-caching = 1 (on) >> >> SMART seems to be OK: >> SMART overall-health self-assessment test result: PASSED (for >> both devices) >> >> I have tried changing IO scheduler with NOOP and deadline but >> I couldn't see improvements. >> >> I have tried running fstrim but it errors out: >> >> root [~]# fstrim -v / >> fstrim: /: FITRIM ioctl failed: Operation not supported >> >> So I have changed /etc/fstab to contain noatime and discard >> and rebooted the server but to no avail. >> >> I no longer know what to do. And I need to come up with some >> sort of a solution (it's not reasonable nor acceptable to get >> at 3 digits loads from copying several GBs worth of file). If >> anyone can help me, please do! >> >> Thanks in advance! >> Andy >> -- >> To unsubscribe from this list: send the line "unsubscribe >> linux-raid" in >> the body of a message to majordomo@vger.kernel.org >> <mailto:majordomo@vger.kernel.org> >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> >> >> >> >> -- >> Roberto Spadim > ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Incredibly poor performance of mdraid-1 with 2 SSD Samsung 840 PRO 2013-04-20 23:26 ` Andrei Banu @ 2013-04-21 2:48 ` Stan Hoeppner 2013-04-21 12:23 ` Tommy Apel 2013-04-25 11:38 ` Thomas Jarosch 1 sibling, 1 reply; 38+ messages in thread From: Stan Hoeppner @ 2013-04-21 2:48 UTC (permalink / raw) To: Andrei Banu; +Cc: linux-raid On 4/20/2013 6:26 PM, Andrei Banu wrote: > They are connected through SATA2 ports (this does explain the read speed > but not the pitiful write one) in AHCI. These SSDs are capable of 500MB/s, and cost ~$1000 USD. Spend ~$200 USD on a decent HBA. The 6G SAS/SATA LSI 9211-4i seems perfectly suited to your RAID1 SSD application. It is a 4 port enterprise JBOD HBA that also supports ASIC level RAID 1, 1E, 10. Also, the difference in throughput your show between RAID maintenance, direct device access, and filesystem access suggests you have something running between the block and filesystem layers, for instance LUKS. Though LUKS alone shouldn't hammer your CPU and IO throughput so dramatically. However, if the SSDs do compression or encryption automatically, and I believe the 840s do, the LUKS encrypted blocks may cause the SSD firmware to take considerably more time to process the blocks. -- Stan ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Incredibly poor performance of mdraid-1 with 2 SSD Samsung 840 PRO 2013-04-21 2:48 ` Stan Hoeppner @ 2013-04-21 12:23 ` Tommy Apel 2013-04-21 16:48 ` Tommy Apel 2013-04-21 19:33 ` Stan Hoeppner 0 siblings, 2 replies; 38+ messages in thread From: Tommy Apel @ 2013-04-21 12:23 UTC (permalink / raw) To: stan; +Cc: Andrei Banu, linux-raid Raid Hello, FYI I'm getting ~68MB/s on two intel330 in RAID1 aswell on vanilla 3.8.8 and 3.9.0-rc3 when writing random data and ~236MB/s writing from /dev/zero mdadm -C /dev/md0 -l 1 -n 2 --assume-clean --force --run /dev/sdb /dev/sdc openssl enc -aes-128-ctr -pass pass:"$(dd if=/dev/urandom bs=128 count=1 2>/dev/null | base64)" -nosalt < /dev/zero | pv -pterb > /run/fill ~1.06GB/s dd if=/run/fill of=/dev/null bs=1M count=1024 iflag=fullblock ~5.7GB/s dd if=/run/fill of=/dev/md0 bs=1M count=1024 oflag=direct ~68MB/s dd if=/dev/zero of=/dev/md0 bs=1M count=1024 oflag=direct ~236MB/s iostat claiming 100% util on both drives when doing so, running both deadline and noop scheduler, doing the same with 4 threads and offset by 1.1GB on the disk and taske set to 4 cores makes no difference, still ~68MB/s with random data # for x in `seq 0 4`; do taskset -c $x dd if=/run/fill of=/dev/md0 bs=1M count=1024 seek=$(($x * 1024)) oflag=direct & done /Tommy 2013/4/21 Stan Hoeppner <stan@hardwarefreak.com>: > On 4/20/2013 6:26 PM, Andrei Banu wrote: > >> They are connected through SATA2 ports (this does explain the read speed >> but not the pitiful write one) in AHCI. > > These SSDs are capable of 500MB/s, and cost ~$1000 USD. Spend ~$200 USD > on a decent HBA. The 6G SAS/SATA LSI 9211-4i seems perfectly suited to > your RAID1 SSD application. It is a 4 port enterprise JBOD HBA that > also supports ASIC level RAID 1, 1E, 10. > > Also, the difference in throughput your show between RAID maintenance, > direct device access, and filesystem access suggests you have something > running between the block and filesystem layers, for instance LUKS. > Though LUKS alone shouldn't hammer your CPU and IO throughput so > dramatically. However, if the SSDs do compression or encryption > automatically, and I believe the 840s do, the LUKS encrypted blocks may > cause the SSD firmware to take considerably more time to process the blocks. > > -- > Stan > > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Incredibly poor performance of mdraid-1 with 2 SSD Samsung 840 PRO 2013-04-21 12:23 ` Tommy Apel @ 2013-04-21 16:48 ` Tommy Apel 2013-04-21 19:33 ` Stan Hoeppner 1 sibling, 0 replies; 38+ messages in thread From: Tommy Apel @ 2013-04-21 16:48 UTC (permalink / raw) To: stan; +Cc: Andrei Banu, linux-raid Raid Just did a blockwise test aswell with fio > Single SSD : # ./scst-trunk/scripts/blockdev-perftest -d -f -i 1 -j -m 10 -M 20 -s 30 -f /dev/sdb blocksize W W(avg, W(std, W R R(avg, R(std, R (bytes) (s) MB/s) MB/s) (IOPS) (s) MB/s) MB/s) (IOPS) 1048576 6.548 156.384 0.000 156.384 2.383 429.710 0.000 429.710 524288 6.311 162.256 0.000 324.513 2.521 406.188 0.000 812.376 262144 6.183 165.615 0.000 662.462 3.003 340.992 0.000 1363.969 131072 6.096 167.979 0.000 1343.832 3.140 326.115 0.000 2608.917 65536 5.973 171.438 0.000 2743.010 3.807 268.978 0.000 4303.651 32768 5.748 178.149 0.000 5700.765 4.609 222.174 0.000 7109.568 16384 5.693 179.870 0.000 11511.681 5.203 196.810 0.000 12595.810 8192 6.188 165.482 0.000 21181.642 7.339 139.529 0.000 17859.654 4096 10.190 100.491 0.000 25725.613 13.816 74.117 0.000 18973.943 2048 25.018 40.931 0.000 20956.431 26.136 39.180 0.000 20059.994 1024 39.693 25.798 0.000 26417.152 50.580 20.245 0.000 20731.040 RAID1 with two Intel330 SSDs: # ./scst-trunk/scripts/blockdev-perftest -d -f -i 1 -j -m 10 -M 20 -s 30 -f /dev/md0 blocksize W W(avg, W(std, W R R(avg, R(std, R (bytes) (s) MB/s) MB/s) (IOPS) (s) MB/s) MB/s) (IOPS) 1048576 7.053 145.186 0.000 145.186 2.384 429.530 0.000 429.530 524288 6.906 148.277 0.000 296.554 2.518 406.672 0.000 813.344 262144 6.763 151.412 0.000 605.648 2.871 356.670 0.000 1426.681 131072 6.558 156.145 0.000 1249.161 3.166 323.437 0.000 2587.492 65536 6.578 155.670 0.000 2490.727 3.835 267.014 0.000 4272.229 32768 6.311 162.256 0.000 5192.204 4.379 233.843 0.000 7482.987 16384 6.406 159.850 0.000 10230.409 5.953 172.014 0.000 11008.903 8192 7.776 131.687 0.000 16855.967 8.621 118.780 0.000 15203.805 4096 11.137 91.946 0.000 23538.116 14.138 72.429 0.000 18541.802 2048 38.440 26.639 0.000 13639.126 22.512 45.487 0.000 23289.268 1024 60.933 16.805 0.000 17208.672 43.247 23.678 0.000 24246.214 it sorta confirms that the performance goes down but I would kinda expect that in a way aswell as the write confirm has to come from both disks. /Tommy 2013/4/21 Tommy Apel <tommyapeldk@gmail.com>: > Hello, FYI I'm getting ~68MB/s on two intel330 in RAID1 aswell on > vanilla 3.8.8 and 3.9.0-rc3 when writing random data and ~236MB/s > writing from /dev/zero > > mdadm -C /dev/md0 -l 1 -n 2 --assume-clean --force --run /dev/sdb /dev/sdc > openssl enc -aes-128-ctr -pass pass:"$(dd if=/dev/urandom bs=128 > count=1 2>/dev/null | base64)" -nosalt < /dev/zero | pv -pterb > > /run/fill ~1.06GB/s > dd if=/run/fill of=/dev/null bs=1M count=1024 iflag=fullblock ~5.7GB/s > dd if=/run/fill of=/dev/md0 bs=1M count=1024 oflag=direct ~68MB/s > dd if=/dev/zero of=/dev/md0 bs=1M count=1024 oflag=direct ~236MB/s > > iostat claiming 100% util on both drives when doing so, running both > deadline and noop scheduler, > doing the same with 4 threads and offset by 1.1GB on the disk and > taske set to 4 cores makes no difference, still ~68MB/s with random > data > # for x in `seq 0 4`; do taskset -c $x dd if=/run/fill of=/dev/md0 > bs=1M count=1024 seek=$(($x * 1024)) oflag=direct & done > > /Tommy > > 2013/4/21 Stan Hoeppner <stan@hardwarefreak.com>: >> On 4/20/2013 6:26 PM, Andrei Banu wrote: >> >>> They are connected through SATA2 ports (this does explain the read speed >>> but not the pitiful write one) in AHCI. >> >> These SSDs are capable of 500MB/s, and cost ~$1000 USD. Spend ~$200 USD >> on a decent HBA. The 6G SAS/SATA LSI 9211-4i seems perfectly suited to >> your RAID1 SSD application. It is a 4 port enterprise JBOD HBA that >> also supports ASIC level RAID 1, 1E, 10. >> >> Also, the difference in throughput your show between RAID maintenance, >> direct device access, and filesystem access suggests you have something >> running between the block and filesystem layers, for instance LUKS. >> Though LUKS alone shouldn't hammer your CPU and IO throughput so >> dramatically. However, if the SSDs do compression or encryption >> automatically, and I believe the 840s do, the LUKS encrypted blocks may >> cause the SSD firmware to take considerably more time to process the blocks. >> >> -- >> Stan >> >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-raid" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Incredibly poor performance of mdraid-1 with 2 SSD Samsung 840 PRO 2013-04-21 12:23 ` Tommy Apel 2013-04-21 16:48 ` Tommy Apel @ 2013-04-21 19:33 ` Stan Hoeppner 2013-04-21 19:56 ` Tommy Apel 1 sibling, 1 reply; 38+ messages in thread From: Stan Hoeppner @ 2013-04-21 19:33 UTC (permalink / raw) To: Tommy Apel; +Cc: Andrei Banu, linux-raid Raid On 4/21/2013 7:23 AM, Tommy Apel wrote: > Hello, FYI I'm getting ~68MB/s on two intel330 in RAID1 aswell on > vanilla 3.8.8 and 3.9.0-rc3 when writing random data and ~236MB/s > writing from /dev/zero > > mdadm -C /dev/md0 -l 1 -n 2 --assume-clean --force --run /dev/sdb /dev/sdc > openssl enc -aes-128-ctr -pass pass:"$(dd if=/dev/urandom bs=128 > count=1 2>/dev/null | base64)" -nosalt < /dev/zero | pv -pterb > > /run/fill ~1.06GB/s What's the purpose of all of this? Surely not simply to create random data, which is accomplished much more easily. Are you sand bagging us here with a known bug, or simply trying to show off your mad skillz? Either way this is entirely unnecessary for troubleshooting an IO performance issue. dd doesn't (shouldn't) care if the bits are random or not, though the Intel SSD controller might, as well as other layers you may have in your IO stack. Keep it simple so we can isolate one layer at a time. > dd if=/run/fill of=/dev/null bs=1M count=1024 iflag=fullblock ~5.7GB/s > dd if=/run/fill of=/dev/md0 bs=1M count=1024 oflag=direct ~68MB/s > dd if=/dev/zero of=/dev/md0 bs=1M count=1024 oflag=direct ~236MB/s Noting the above, it's interesting that you omitted this test dd if=/run/fill of=/dev/sdb bs=1M count=1024 oflag=direct preventing an apples to apples comparison between raw SSD device and md/RAID1 performance with your uber random file as input. -- Stan ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Incredibly poor performance of mdraid-1 with 2 SSD Samsung 840 PRO 2013-04-21 19:33 ` Stan Hoeppner @ 2013-04-21 19:56 ` Tommy Apel 2013-04-22 0:47 ` Stan Hoeppner 0 siblings, 1 reply; 38+ messages in thread From: Tommy Apel @ 2013-04-21 19:56 UTC (permalink / raw) To: stan; +Cc: Andrei Banu, linux-raid Raid Calm the f. down, I was just handing over some information, sorry your day was ruined mr. high and mighty, use the info for whatever you want to but flaming me is't going to help anyone. 2013/4/21 Stan Hoeppner <stan@hardwarefreak.com>: > On 4/21/2013 7:23 AM, Tommy Apel wrote: >> Hello, FYI I'm getting ~68MB/s on two intel330 in RAID1 aswell on >> vanilla 3.8.8 and 3.9.0-rc3 when writing random data and ~236MB/s >> writing from /dev/zero >> >> mdadm -C /dev/md0 -l 1 -n 2 --assume-clean --force --run /dev/sdb /dev/sdc > > >> openssl enc -aes-128-ctr -pass pass:"$(dd if=/dev/urandom bs=128 >> count=1 2>/dev/null | base64)" -nosalt < /dev/zero | pv -pterb > >> /run/fill ~1.06GB/s > > What's the purpose of all of this? Surely not simply to create random > data, which is accomplished much more easily. Are you sand bagging us > here with a known bug, or simply trying to show off your mad skillz? > Either way this is entirely unnecessary for troubleshooting an IO > performance issue. dd doesn't (shouldn't) care if the bits are random > or not, though the Intel SSD controller might, as well as other layers > you may have in your IO stack. Keep it simple so we can isolate one > layer at a time. > >> dd if=/run/fill of=/dev/null bs=1M count=1024 iflag=fullblock ~5.7GB/s >> dd if=/run/fill of=/dev/md0 bs=1M count=1024 oflag=direct ~68MB/s >> dd if=/dev/zero of=/dev/md0 bs=1M count=1024 oflag=direct ~236MB/s > > Noting the above, it's interesting that you omitted this test > > dd if=/run/fill of=/dev/sdb bs=1M count=1024 oflag=direct > > preventing an apples to apples comparison between raw SSD device and > md/RAID1 performance with your uber random file as input. > > -- > Stan > ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Incredibly poor performance of mdraid-1 with 2 SSD Samsung 840 PRO 2013-04-21 19:56 ` Tommy Apel @ 2013-04-22 0:47 ` Stan Hoeppner 2013-04-22 7:51 ` Tommy Apel 0 siblings, 1 reply; 38+ messages in thread From: Stan Hoeppner @ 2013-04-22 0:47 UTC (permalink / raw) To: Tommy Apel; +Cc: Andrei Banu, linux-raid Raid On 4/21/2013 2:56 PM, Tommy Apel wrote: > Calm the f. down, I was just handing over some information, sorry your > day was ruined mr. high and mighty, use the info for whatever you want > to but flaming me is't going to help anyone. Your tantrum aside, the Intel 330, as well as all current Intel SSDs, uses the SandForce 2281 controller. The SF2xxx series' write performance is limited by the compressibility of the data. What you're doing below is simply showcasing the write bandwidth limitation of the SF2xxx controllers with incompressible data. This is not relevant to md. And it's not relevant to Andrei. It turns out that the Samsung 840 SSDs have consistent throughput because they don't rely on compression. -- Stan > 2013/4/21 Stan Hoeppner <stan@hardwarefreak.com>: >> On 4/21/2013 7:23 AM, Tommy Apel wrote: >>> Hello, FYI I'm getting ~68MB/s on two intel330 in RAID1 aswell on >>> vanilla 3.8.8 and 3.9.0-rc3 when writing random data and ~236MB/s >>> writing from /dev/zero >>> >>> mdadm -C /dev/md0 -l 1 -n 2 --assume-clean --force --run /dev/sdb /dev/sdc >> >> >>> openssl enc -aes-128-ctr -pass pass:"$(dd if=/dev/urandom bs=128 >>> count=1 2>/dev/null | base64)" -nosalt < /dev/zero | pv -pterb > >>> /run/fill ~1.06GB/s >> >> What's the purpose of all of this? Surely not simply to create random >> data, which is accomplished much more easily. Are you sand bagging us >> here with a known bug, or simply trying to show off your mad skillz? >> Either way this is entirely unnecessary for troubleshooting an IO >> performance issue. dd doesn't (shouldn't) care if the bits are random >> or not, though the Intel SSD controller might, as well as other layers >> you may have in your IO stack. Keep it simple so we can isolate one >> layer at a time. >> >>> dd if=/run/fill of=/dev/null bs=1M count=1024 iflag=fullblock ~5.7GB/s >>> dd if=/run/fill of=/dev/md0 bs=1M count=1024 oflag=direct ~68MB/s >>> dd if=/dev/zero of=/dev/md0 bs=1M count=1024 oflag=direct ~236MB/s >> >> Noting the above, it's interesting that you omitted this test >> >> dd if=/run/fill of=/dev/sdb bs=1M count=1024 oflag=direct >> >> preventing an apples to apples comparison between raw SSD device and >> md/RAID1 performance with your uber random file as input. >> >> -- >> Stan >> > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Incredibly poor performance of mdraid-1 with 2 SSD Samsung 840 PRO 2013-04-22 0:47 ` Stan Hoeppner @ 2013-04-22 7:51 ` Tommy Apel 2013-04-22 8:29 ` Tommy Apel ` (2 more replies) 0 siblings, 3 replies; 38+ messages in thread From: Tommy Apel @ 2013-04-22 7:51 UTC (permalink / raw) To: stan; +Cc: Andrei Banu, linux-raid Raid Stan> That was exactly what I was trying to show, that you result may vary depending on data and backing device, as far as the raid1 goes it doesn't care much for the data beeing passed through it. Ben> could you try to run iostat -x 2 for a few minuts just to make sure there is no other I/O going on the device before running your tests, and then run the tests with fio instead of dd ? fio write test > fio --rw=write --filename=testfile --bs=1048576 --size=4294967296 --ioengine=psync --end_fsync=1 --invalidate=1 --direct=1 --name=writeperftest /Tommy 2013/4/22 Stan Hoeppner <stan@hardwarefreak.com>: > On 4/21/2013 2:56 PM, Tommy Apel wrote: >> Calm the f. down, I was just handing over some information, sorry your >> day was ruined mr. high and mighty, use the info for whatever you want >> to but flaming me is't going to help anyone. > > Your tantrum aside, the Intel 330, as well as all current Intel SSDs, > uses the SandForce 2281 controller. The SF2xxx series' write > performance is limited by the compressibility of the data. What you're > doing below is simply showcasing the write bandwidth limitation of the > SF2xxx controllers with incompressible data. > > This is not relevant to md. And it's not relevant to Andrei. It turns > out that the Samsung 840 SSDs have consistent throughput because they > don't rely on compression. > > -- > Stan > > >> 2013/4/21 Stan Hoeppner <stan@hardwarefreak.com>: >>> On 4/21/2013 7:23 AM, Tommy Apel wrote: >>>> Hello, FYI I'm getting ~68MB/s on two intel330 in RAID1 aswell on >>>> vanilla 3.8.8 and 3.9.0-rc3 when writing random data and ~236MB/s >>>> writing from /dev/zero >>>> >>>> mdadm -C /dev/md0 -l 1 -n 2 --assume-clean --force --run /dev/sdb /dev/sdc >>> >>> >>>> openssl enc -aes-128-ctr -pass pass:"$(dd if=/dev/urandom bs=128 >>>> count=1 2>/dev/null | base64)" -nosalt < /dev/zero | pv -pterb > >>>> /run/fill ~1.06GB/s >>> >>> What's the purpose of all of this? Surely not simply to create random >>> data, which is accomplished much more easily. Are you sand bagging us >>> here with a known bug, or simply trying to show off your mad skillz? >>> Either way this is entirely unnecessary for troubleshooting an IO >>> performance issue. dd doesn't (shouldn't) care if the bits are random >>> or not, though the Intel SSD controller might, as well as other layers >>> you may have in your IO stack. Keep it simple so we can isolate one >>> layer at a time. >>> >>>> dd if=/run/fill of=/dev/null bs=1M count=1024 iflag=fullblock ~5.7GB/s >>>> dd if=/run/fill of=/dev/md0 bs=1M count=1024 oflag=direct ~68MB/s >>>> dd if=/dev/zero of=/dev/md0 bs=1M count=1024 oflag=direct ~236MB/s >>> >>> Noting the above, it's interesting that you omitted this test >>> >>> dd if=/run/fill of=/dev/sdb bs=1M count=1024 oflag=direct >>> >>> preventing an apples to apples comparison between raw SSD device and >>> md/RAID1 performance with your uber random file as input. >>> >>> -- >>> Stan >>> >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-raid" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> > ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Incredibly poor performance of mdraid-1 with 2 SSD Samsung 840 PRO 2013-04-22 7:51 ` Tommy Apel @ 2013-04-22 8:29 ` Tommy Apel 2013-04-22 10:26 ` Andrei Banu 2013-04-22 23:21 ` Stan Hoeppner 2 siblings, 0 replies; 38+ messages in thread From: Tommy Apel @ 2013-04-22 8:29 UTC (permalink / raw) To: Andrei Banu, stan; +Cc: linux-raid Raid Ben = Andrei, sorry for the typo. 2013/4/22 Tommy Apel <tommyapeldk@gmail.com>: > Stan> > That was exactly what I was trying to show, that you result may vary > depending on data and backing device, as far as the raid1 goes it > doesn't care much for the data beeing passed through it. > > Ben> > could you try to run iostat -x 2 for a few minuts just to make sure > there is no other I/O going on the device before running your tests, > and then run the tests with fio instead of dd ? > > fio write test > fio --rw=write --filename=testfile --bs=1048576 > --size=4294967296 --ioengine=psync --end_fsync=1 --invalidate=1 > --direct=1 --name=writeperftest > > /Tommy > > 2013/4/22 Stan Hoeppner <stan@hardwarefreak.com>: >> On 4/21/2013 2:56 PM, Tommy Apel wrote: >>> Calm the f. down, I was just handing over some information, sorry your >>> day was ruined mr. high and mighty, use the info for whatever you want >>> to but flaming me is't going to help anyone. >> >> Your tantrum aside, the Intel 330, as well as all current Intel SSDs, >> uses the SandForce 2281 controller. The SF2xxx series' write >> performance is limited by the compressibility of the data. What you're >> doing below is simply showcasing the write bandwidth limitation of the >> SF2xxx controllers with incompressible data. >> >> This is not relevant to md. And it's not relevant to Andrei. It turns >> out that the Samsung 840 SSDs have consistent throughput because they >> don't rely on compression. >> >> -- >> Stan >> >> >>> 2013/4/21 Stan Hoeppner <stan@hardwarefreak.com>: >>>> On 4/21/2013 7:23 AM, Tommy Apel wrote: >>>>> Hello, FYI I'm getting ~68MB/s on two intel330 in RAID1 aswell on >>>>> vanilla 3.8.8 and 3.9.0-rc3 when writing random data and ~236MB/s >>>>> writing from /dev/zero >>>>> >>>>> mdadm -C /dev/md0 -l 1 -n 2 --assume-clean --force --run /dev/sdb /dev/sdc >>>> >>>> >>>>> openssl enc -aes-128-ctr -pass pass:"$(dd if=/dev/urandom bs=128 >>>>> count=1 2>/dev/null | base64)" -nosalt < /dev/zero | pv -pterb > >>>>> /run/fill ~1.06GB/s >>>> >>>> What's the purpose of all of this? Surely not simply to create random >>>> data, which is accomplished much more easily. Are you sand bagging us >>>> here with a known bug, or simply trying to show off your mad skillz? >>>> Either way this is entirely unnecessary for troubleshooting an IO >>>> performance issue. dd doesn't (shouldn't) care if the bits are random >>>> or not, though the Intel SSD controller might, as well as other layers >>>> you may have in your IO stack. Keep it simple so we can isolate one >>>> layer at a time. >>>> >>>>> dd if=/run/fill of=/dev/null bs=1M count=1024 iflag=fullblock ~5.7GB/s >>>>> dd if=/run/fill of=/dev/md0 bs=1M count=1024 oflag=direct ~68MB/s >>>>> dd if=/dev/zero of=/dev/md0 bs=1M count=1024 oflag=direct ~236MB/s >>>> >>>> Noting the above, it's interesting that you omitted this test >>>> >>>> dd if=/run/fill of=/dev/sdb bs=1M count=1024 oflag=direct >>>> >>>> preventing an apples to apples comparison between raw SSD device and >>>> md/RAID1 performance with your uber random file as input. >>>> >>>> -- >>>> Stan >>>> >>> -- >>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in >>> the body of a message to majordomo@vger.kernel.org >>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>> >> ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Incredibly poor performance of mdraid-1 with 2 SSD Samsung 840 PRO 2013-04-22 7:51 ` Tommy Apel 2013-04-22 8:29 ` Tommy Apel @ 2013-04-22 10:26 ` Andrei Banu 2013-04-22 12:02 ` Tommy Apel 2013-04-22 23:21 ` Stan Hoeppner 2 siblings, 1 reply; 38+ messages in thread From: Andrei Banu @ 2013-04-22 10:26 UTC (permalink / raw) To: linux-raid Hi, No worries about the typo. I ran iostat -x -m 2 for a few minutes and I get: - 0-500KB/s 70% of the time - 1-2MB/s 20% of the time - 3-4MB/s 10% of the time. It never went beyond 4MB/s write speed. But I guess none of this qualifies as a heavy write. Right? The fio test can be carried out safely on an active production server just as you gave it? Thanks! Andrei On 2013-04-22 10:51, Tommy Apel wrote: > Stan> > That was exactly what I was trying to show, that you result may vary > depending on data and backing device, as far as the raid1 goes it > doesn't care much for the data beeing passed through it. > > Ben> > could you try to run iostat -x 2 for a few minuts just to make sure > there is no other I/O going on the device before running your tests, > and then run the tests with fio instead of dd ? > > fio write test > fio --rw=write --filename=testfile --bs=1048576 > --size=4294967296 --ioengine=psync --end_fsync=1 --invalidate=1 > --direct=1 --name=writeperftest > > /Tommy > > 2013/4/22 Stan Hoeppner <stan@hardwarefreak.com>: >> On 4/21/2013 2:56 PM, Tommy Apel wrote: >>> Calm the f. down, I was just handing over some information, sorry >>> your >>> day was ruined mr. high and mighty, use the info for whatever you >>> want >>> to but flaming me is't going to help anyone. >> >> Your tantrum aside, the Intel 330, as well as all current Intel SSDs, >> uses the SandForce 2281 controller. The SF2xxx series' write >> performance is limited by the compressibility of the data. What >> you're >> doing below is simply showcasing the write bandwidth limitation of >> the >> SF2xxx controllers with incompressible data. >> >> This is not relevant to md. And it's not relevant to Andrei. It >> turns >> out that the Samsung 840 SSDs have consistent throughput because they >> don't rely on compression. >> >> -- >> Stan >> >> >>> 2013/4/21 Stan Hoeppner <stan@hardwarefreak.com>: >>>> On 4/21/2013 7:23 AM, Tommy Apel wrote: >>>>> Hello, FYI I'm getting ~68MB/s on two intel330 in RAID1 aswell on >>>>> vanilla 3.8.8 and 3.9.0-rc3 when writing random data and ~236MB/s >>>>> writing from /dev/zero >>>>> >>>>> mdadm -C /dev/md0 -l 1 -n 2 --assume-clean --force --run /dev/sdb >>>>> /dev/sdc >>>> >>>> >>>>> openssl enc -aes-128-ctr -pass pass:"$(dd if=/dev/urandom bs=128 >>>>> count=1 2>/dev/null | base64)" -nosalt < /dev/zero | pv -pterb > >>>>> /run/fill ~1.06GB/s >>>> >>>> What's the purpose of all of this? Surely not simply to create >>>> random >>>> data, which is accomplished much more easily. Are you sand bagging >>>> us >>>> here with a known bug, or simply trying to show off your mad >>>> skillz? >>>> Either way this is entirely unnecessary for troubleshooting an IO >>>> performance issue. dd doesn't (shouldn't) care if the bits are >>>> random >>>> or not, though the Intel SSD controller might, as well as other >>>> layers >>>> you may have in your IO stack. Keep it simple so we can isolate >>>> one >>>> layer at a time. >>>> >>>>> dd if=/run/fill of=/dev/null bs=1M count=1024 iflag=fullblock >>>>> ~5.7GB/s >>>>> dd if=/run/fill of=/dev/md0 bs=1M count=1024 oflag=direct ~68MB/s >>>>> dd if=/dev/zero of=/dev/md0 bs=1M count=1024 oflag=direct ~236MB/s >>>> >>>> Noting the above, it's interesting that you omitted this test >>>> >>>> dd if=/run/fill of=/dev/sdb bs=1M count=1024 oflag=direct >>>> >>>> preventing an apples to apples comparison between raw SSD device >>>> and >>>> md/RAID1 performance with your uber random file as input. >>>> >>>> -- >>>> Stan >>>> >>> -- >>> To unsubscribe from this list: send the line "unsubscribe >>> linux-raid" in >>> the body of a message to majordomo@vger.kernel.org >>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>> >> > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" > in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Incredibly poor performance of mdraid-1 with 2 SSD Samsung 840 PRO 2013-04-22 10:26 ` Andrei Banu @ 2013-04-22 12:02 ` Tommy Apel 2013-04-23 2:59 ` Stan Hoeppner 0 siblings, 1 reply; 38+ messages in thread From: Tommy Apel @ 2013-04-22 12:02 UTC (permalink / raw) To: Andrei Banu, stan; +Cc: linux-raid Raid Yes it can be run as it is, it will write to the file given by --filename= well from what I make of it so far I wouldn't rule out the bad device part but at the same time there could be other things involved although I don't belive it to be the md part Stan> do you know anything about the state of ext4 on centos 6.x ? /Tommy 2013/4/22 Andrei Banu <andrei.banu@redhost.ro> > > Hi, > > No worries about the typo. I ran iostat -x -m 2 for a few minutes and I get: > > - 0-500KB/s 70% of the time > - 1-2MB/s 20% of the time > - 3-4MB/s 10% of the time. > > It never went beyond 4MB/s write speed. But I guess none of this qualifies as a heavy write. Right? > > The fio test can be carried out safely on an active production server just as you gave it? > > Thanks! > Andrei > > > On 2013-04-22 10:51, Tommy Apel wrote: >> >> Stan> >> That was exactly what I was trying to show, that you result may vary >> depending on data and backing device, as far as the raid1 goes it >> doesn't care much for the data beeing passed through it. >> >> Ben> >> could you try to run iostat -x 2 for a few minuts just to make sure >> there is no other I/O going on the device before running your tests, >> and then run the tests with fio instead of dd ? >> >> fio write test > fio --rw=write --filename=testfile --bs=1048576 >> --size=4294967296 --ioengine=psync --end_fsync=1 --invalidate=1 >> --direct=1 --name=writeperftest >> >> /Tommy >> >> 2013/4/22 Stan Hoeppner <stan@hardwarefreak.com>: >>> >>> On 4/21/2013 2:56 PM, Tommy Apel wrote: >>>> >>>> Calm the f. down, I was just handing over some information, sorry your >>>> day was ruined mr. high and mighty, use the info for whatever you want >>>> to but flaming me is't going to help anyone. >>> >>> >>> Your tantrum aside, the Intel 330, as well as all current Intel SSDs, >>> uses the SandForce 2281 controller. The SF2xxx series' write >>> performance is limited by the compressibility of the data. What you're >>> doing below is simply showcasing the write bandwidth limitation of the >>> SF2xxx controllers with incompressible data. >>> >>> This is not relevant to md. And it's not relevant to Andrei. It turns >>> out that the Samsung 840 SSDs have consistent throughput because they >>> don't rely on compression. >>> >>> -- >>> Stan >>> >>> >>>> 2013/4/21 Stan Hoeppner <stan@hardwarefreak.com>: >>>>> >>>>> On 4/21/2013 7:23 AM, Tommy Apel wrote: >>>>>> >>>>>> Hello, FYI I'm getting ~68MB/s on two intel330 in RAID1 aswell on >>>>>> vanilla 3.8.8 and 3.9.0-rc3 when writing random data and ~236MB/s >>>>>> writing from /dev/zero >>>>>> >>>>>> mdadm -C /dev/md0 -l 1 -n 2 --assume-clean --force --run /dev/sdb /dev/sdc >>>>> >>>>> >>>>> >>>>>> openssl enc -aes-128-ctr -pass pass:"$(dd if=/dev/urandom bs=128 >>>>>> count=1 2>/dev/null | base64)" -nosalt < /dev/zero | pv -pterb > >>>>>> /run/fill ~1.06GB/s >>>>> >>>>> >>>>> What's the purpose of all of this? Surely not simply to create random >>>>> data, which is accomplished much more easily. Are you sand bagging us >>>>> here with a known bug, or simply trying to show off your mad skillz? >>>>> Either way this is entirely unnecessary for troubleshooting an IO >>>>> performance issue. dd doesn't (shouldn't) care if the bits are random >>>>> or not, though the Intel SSD controller might, as well as other layers >>>>> you may have in your IO stack. Keep it simple so we can isolate one >>>>> layer at a time. >>>>> >>>>>> dd if=/run/fill of=/dev/null bs=1M count=1024 iflag=fullblock ~5.7GB/s >>>>>> dd if=/run/fill of=/dev/md0 bs=1M count=1024 oflag=direct ~68MB/s >>>>>> dd if=/dev/zero of=/dev/md0 bs=1M count=1024 oflag=direct ~236MB/s >>>>> >>>>> >>>>> Noting the above, it's interesting that you omitted this test >>>>> >>>>> dd if=/run/fill of=/dev/sdb bs=1M count=1024 oflag=direct >>>>> >>>>> preventing an apples to apples comparison between raw SSD device and >>>>> md/RAID1 performance with your uber random file as input. >>>>> >>>>> -- >>>>> Stan >>>>> >>>> -- >>>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in >>>> the body of a message to majordomo@vger.kernel.org >>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>>> >>> >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-raid" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Incredibly poor performance of mdraid-1 with 2 SSD Samsung 840 PRO 2013-04-22 12:02 ` Tommy Apel @ 2013-04-23 2:59 ` Stan Hoeppner 0 siblings, 0 replies; 38+ messages in thread From: Stan Hoeppner @ 2013-04-23 2:59 UTC (permalink / raw) To: Tommy Apel; +Cc: Andrei Banu, linux-raid Raid On 4/22/2013 7:02 AM, Tommy Apel wrote: > Yes it can be run as it is, it will write to the file given by --filename= > > well from what I make of it so far I wouldn't rule out the bad device > part but at the same time there could be other things involved > although I don't belive it to be the md part > > Stan> do you know anything about the state of ext4 on centos 6.x ? Enough to assume it's not part of the problem here. Andrei's hdparm below the filesystem layer throughput is bouncing up/down by ~100MB/s depending on when he runs it. If he's using LVM and has active snapshots that would definitely cause some extra load, but in that case given his 3 RAID1 pairs it should affect both drives equally. And that's not what we're seeing. I hope my last post gets him closer to identifying the problem. The perf top and iotop data doing $bigfile copy should be instructive. -- Stan ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Incredibly poor performance of mdraid-1 with 2 SSD Samsung 840 PRO 2013-04-22 7:51 ` Tommy Apel 2013-04-22 8:29 ` Tommy Apel 2013-04-22 10:26 ` Andrei Banu @ 2013-04-22 23:21 ` Stan Hoeppner 2 siblings, 0 replies; 38+ messages in thread From: Stan Hoeppner @ 2013-04-22 23:21 UTC (permalink / raw) To: Tommy Apel; +Cc: Andrei Banu, linux-raid Raid On 4/22/2013 2:51 AM, Tommy Apel wrote: > Stan> > That was exactly what I was trying to show, that you result may vary > depending on data and backing device, as far as the raid1 goes it > doesn't care much for the data beeing passed through it. As I mentioned, this is true of the SandForce 2nd gen ASICs, maybe some others. The Samsung SSDs use a home grown Samsung controller which doesn't do compression. Its performance doesn't vary due to data content. Thus the performance gap you demonstrated doesn't apply to Andrei. We can eliminate this as a possible cause of his apparently horrible performance. And I think we can eliminate the regression in 2.6.32 as that patch seems to be included in his kernel, otherwise he'd likely not get 260MB/s in his dd raw read tests. The mystery continues... -- Stan ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Incredibly poor performance of mdraid-1 with 2 SSD Samsung 840 PRO 2013-04-20 23:26 ` Andrei Banu 2013-04-21 2:48 ` Stan Hoeppner @ 2013-04-25 11:38 ` Thomas Jarosch 1 sibling, 0 replies; 38+ messages in thread From: Thomas Jarosch @ 2013-04-25 11:38 UTC (permalink / raw) To: Andrei Banu; +Cc: linux-raid On Sunday, 21. April 2013 02:26:26 Andrei Banu wrote: > They are connected through SATA2 ports (this does explain the read speed > but not the pitiful write one) in AHCI. So the SATA controller is already in AHCI mode. Good. You didn't say what kind of server hardware you are using or I missed it. On the HP DL3xxx servers we usually use, we have to enable AHCI mode _and_ the write cache in the BIOS. Maybe your server needs something similar. Some RAID controllers only allow you to enable the write cache when a battery-backed write cache module is installed. HTH, Thomas ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Incredibly poor performance of mdraid-1 with 2 SSD Samsung 840 PRO [not found] ` <CAH3kUhEaZGON=fAyVMZOz5fH_DcfKv=hCa96UCeK4pN7k81c_Q@mail.gmail.com> [not found] ` <51725458.7020109@redhost.ro> @ 2013-04-20 23:26 ` Andrei Banu 1 sibling, 0 replies; 38+ messages in thread From: Andrei Banu @ 2013-04-20 23:26 UTC (permalink / raw) To: linux-raid Hi, I ran with '-d 3' iostat during a "heavy" (540MB) copy. It took a bit over a minute and completed with less than 9MB/s. These are some of the results (this does NOT include the first batch i.e. the average from start up result): Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sda 503.00 1542.67 28157.33 4628 84472 sdb 66.00 72.00 13162.67 216 39488 md1 373.00 1492.00 0.00 4476 0 md2 6951.67 126.67 27734.67 380 83204 md0 0.00 0.00 0.00 0 0 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sda 56.67 20.00 1177.50 60 3532 sdb 47.33 12.00 10824.17 36 32472 md1 0.67 2.67 0.00 8 0 md2 322.00 25.33 1266.67 76 3800 md0 0.00 0.00 0.00 0 0 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sda 122.00 16.00 45773.33 48 137320 sdb 96.67 14.67 19472.00 44 58416 md1 0.00 0.00 0.00 0 0 md2 11431.00 32.00 45684.00 96 137052 md0 0.00 0.00 0.00 0 0 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sda 0.00 0.00 0.00 0 0 sdb 13.67 8.00 5973.33 24 17920 md1 0.00 0.00 0.00 0 0 md2 2.00 8.00 0.00 24 0 md0 0.00 0.00 0.00 0 0 This is the "normal" iostat took after 10 minutes (this DOES include the first batch i.e. the average from start up result): Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sda 281.83 973.99 641.55 212615675 140045467 sdb 215.51 665.94 641.55 145369465 140045467 md1 1.18 2.17 2.56 473492 558452 md2 470.71 1596.29 638.01 348460340 139272912 md0 0.08 0.27 0.00 59983 171 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sda 41.67 237.33 133.67 712 401 sdb 39.33 90.67 133.67 272 401 md1 0.00 0.00 0.00 0 0 md2 83.00 328.00 133.33 984 400 md0 0.00 0.00 0.00 0 0 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sda 29.33 2.67 110.00 8 330 sdb 29.33 2.67 110.00 8 330 md1 0.00 0.00 0.00 0 0 md2 28.67 5.33 109.33 16 328 md0 0.00 0.00 0.00 0 0 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sda 175.67 1.33 747.50 4 2242 sdb 182.00 56.00 747.50 168 2242 md1 0.00 0.00 0.00 0 0 md2 191.33 57.33 746.67 172 2240 md0 0.00 0.00 0.00 0 0 Best regards! On 20/04/2013 3:59 AM, Roberto Spadim wrote: > run some kind of iostat -d 1 -k and check the write/read iops and kb/s > > > 2013/4/19 Andrei Banu <andrei.banu@redhost.ro > <mailto:andrei.banu@redhost.ro>> > > Hello! > > I come to you with a difficult problem. We have a server otherwise > snappy fitted with mdraid-1 made of Samsung 840 PRO SSDs. If we > copy a larger file to the server (from the same server, from net > doesn't matter) the server load will increase from roughly 0.7 to > over 100 (for several GB files). Apparently the reason is that the > raid can't write well. > > Few examples: > > root [~]# dd if=testfile.tar.gz of=test20 oflag=sync bs=4M > 130+1 records in > 130+1 records out > 547682517 bytes (548 MB) copied, 7.99664 s, 68.5 MB/s > > And 10-20 seconds later I try the very same test: > > root [~]# dd if=testfile.tar.gz of=test21 oflag=sync bs=4M > 130+1 records in / 130+1 records out > 547682517 bytes (548 MB) copied, 52.1958 s, 10.5 MB/s > > A different test with 'bs=1G' > root [~]# w > 12:08:34 up 1 day, 13:09, 1 user, load average: 0.37, 0.60, 0.72 > > root [~]# dd if=testfile.tar.gz of=test oflag=sync bs=1G > 0+1 records in / 0+1 records out > 547682517 bytes (548 MB) copied, 75.3476 s, 7.3 MB/s > > root [~]# w > 12:09:56 up 1 day, 13:11, 1 user, load average: 39.29, 12.67, 4.93 > > It needed 75 seconds to copy a half GB file and the server load > increased 100 times. > > And a final test: > > root@ [~]# dd if=/dev/zero of=test24 bs=64k count=16k conv=fdatasync > 16384+0 records in / 16384+0 records out > 1073741824 bytes (1.1 GB) copied, 61.8796 s, 17.4 MB/s > > This time the load spiked to only ~ 20. > > A few other peculiarities: > > root@ [~]# hdparm -t /dev/sda > Timing buffered disk reads: 654 MB in 3.01 seconds = 217.55 MB/sec > root@ [~]# hdparm -t /dev/sdb > Timing buffered disk reads: 272 MB in 3.01 seconds = 90.44 MB/sec > > The read speed is very different between the 2 devices (the margin > is 140%) but look what happens when I run it with --direct: > > root@ [~]# hdparm --direct -t /dev/sda > Timing O_DIRECT disk reads: 788 MB in 3.00 seconds = 262.23 MB/sec > root@ [~]# hdparm --direct -t /dev/sdb > Timing O_DIRECT disk reads: 554 MB in 3.00 seconds = 184.53 MB/sec > > So the hardware seems to sustain speeds of about 200MB/s on both > devices but it differs greatly. > The measurement of sda increased 20% but sdb doubled. Maybe > there's a problem with the page cache? > > BACKGROUND INFORMATION > Server type: general shared hosting server (3 weeks new) > O/S: CentOS 6.4 / 64 bit (2.6.32-358.2.1.el6.x86_64) > Hardware: SuperMicro 5017C-MTRF, E3-1270v2, 16GB RAM, 2 x Samsung > 840 PRO 512GB > Partitioning: ~ 100GB left for over-provisioning, ext 4: > > I believe it is aligned: > > root [~]# fdisk -lu > > Disk /dev/sda: 512.1 GB, 512110190592 bytes > 255 heads, 63 sectors/track, 62260 cylinders, total 1000215216 sectors > Units = sectors of 1 * 512 = 512 bytes > Sector size (logical/physical): 512 bytes / 512 bytes > I/O size (minimum/optimal): 512 bytes / 512 bytes > Disk identifier: 0x00026d59 > > Device Boot Start End Blocks Id System > /dev/sda1 2048 4196351 2097152 fd Linux raid > autodetect > Partition 1 does not end on cylinder boundary. > /dev/sda2 * 4196352 4605951 204800 fd Linux raid > autodetect > Partition 2 does not end on cylinder boundary. > /dev/sda3 4605952 814106623 404750336 fd Linux raid > autodetect > > Disk /dev/sdb: 512.1 GB, 512110190592 bytes > 255 heads, 63 sectors/track, 62260 cylinders, total 1000215216 sectors > Units = sectors of 1 * 512 = 512 bytes > Sector size (logical/physical): 512 bytes / 512 bytes > I/O size (minimum/optimal): 512 bytes / 512 bytes > Disk identifier: 0x0003dede > > Device Boot Start End Blocks Id System > /dev/sdb1 2048 4196351 2097152 fd Linux raid > autodetect > Partition 1 does not end on cylinder boundary. > /dev/sdb2 * 4196352 4605951 204800 fd Linux raid > autodetect > Partition 2 does not end on cylinder boundary. > /dev/sdb3 4605952 814106623 404750336 fd Linux raid > autodetect > > The matrix is NOT degraded: > > root@ [~]# cat /proc/mdstat > Personalities : [raid1] > md0 : active raid1 sdb2[1] sda2[0] > 204736 blocks super 1.0 [2/2] [UU] > md2 : active raid1 sdb3[1] sda3[0] > 404750144 blocks super 1.0 [2/2] [UU] > md1 : active raid1 sdb1[1] sda1[0] > 2096064 blocks super 1.1 [2/2] [UU] > unused devices: <none> > > Write cache is on: > > root@ [~]# hdparm -W /dev/sda > write-caching = 1 (on) > root@ [~]# hdparm -W /dev/sdb > write-caching = 1 (on) > > SMART seems to be OK: > SMART overall-health self-assessment test result: PASSED (for both > devices) > > I have tried changing IO scheduler with NOOP and deadline but I > couldn't see improvements. > > I have tried running fstrim but it errors out: > > root [~]# fstrim -v / > fstrim: /: FITRIM ioctl failed: Operation not supported > > So I have changed /etc/fstab to contain noatime and discard and > rebooted the server but to no avail. > > I no longer know what to do. And I need to come up with some sort > of a solution (it's not reasonable nor acceptable to get at 3 > digits loads from copying several GBs worth of file). If anyone > can help me, please do! > > Thanks in advance! > Andy > -- > To unsubscribe from this list: send the line "unsubscribe > linux-raid" in > the body of a message to majordomo@vger.kernel.org > <mailto:majordomo@vger.kernel.org> > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > > -- > Roberto Spadim ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Incredibly poor performance of mdraid-1 with 2 SSD Samsung 840 PRO 2013-04-19 22:58 Incredibly poor performance of mdraid-1 with 2 SSD Samsung 840 PRO Andrei Banu [not found] ` <CAH3kUhEaZGON=fAyVMZOz5fH_DcfKv=hCa96UCeK4pN7k81c_Q@mail.gmail.com> @ 2013-04-21 0:10 ` Stan Hoeppner [not found] ` <51732E2B.6090607@hardwarefreak.com> 2013-04-23 6:01 ` Stan Hoeppner 3 siblings, 0 replies; 38+ messages in thread From: Stan Hoeppner @ 2013-04-21 0:10 UTC (permalink / raw) To: Andrei Banu, Linux RAID Forgot to CC the list. Sorry for the dup Andrei. On 4/19/2013 5:58 PM, Andrei Banu wrote: > I come to you with a difficult problem. We have a server otherwise > snappy fitted with mdraid-1 made of Samsung 840 PRO SSDs. If we copy a > larger file to the server (from the same server, from net doesn't > matter) the server load will increase from roughly 0.7 to over 100 (for > several GB files). Apparently the reason is that the raid can't write well. ... > 547682517 bytes (548 MB) copied, 7.99664 s, 68.5 MB/s > 547682517 bytes (548 MB) copied, 52.1958 s, 10.5 MB/s > 547682517 bytes (548 MB) copied, 75.3476 s, 7.3 MB/s > 1073741824 bytes (1.1 GB) copied, 61.8796 s, 17.4 MB/s > Timing buffered disk reads: 654 MB in 3.01 seconds = 217.55 MB/sec > Timing buffered disk reads: 272 MB in 3.01 seconds = 90.44 MB/sec > Timing O_DIRECT disk reads: 788 MB in 3.00 seconds = 262.23 MB/sec > Timing O_DIRECT disk reads: 554 MB in 3.00 seconds = 184.53 MB/sec ... Obviously this is frustrating, but the fix should be pretty easy. > O/S: CentOS 6.4 / 64 bit (2.6.32-358.2.1.el6.x86_64) I'd guess your problem is the following regression. I don't believe this regression is fixed in Red Hat 2.6.32-* kernels: http://www.archivum.info/linux-ide@vger.kernel.org/2010-02/00243/bad-performance-with-SSD-since-kernel-version-2.6.32.html After I discovered this regression and recommended Adam Goryachev upgrade from Debian 2.6.32 to 3.2.x, his SSD RAID5 throughput increased by a factor of 5x, though much of this was due testing methods. His raw SSD throughput more than doubled per drive. The thread detailing this is long but is a good read: http://marc.info/?l=linux-raid&m=136098921212920&w=2 -- Stan ^ permalink raw reply [flat|nested] 38+ messages in thread
[parent not found: <51732E2B.6090607@hardwarefreak.com>]
* Re: Incredibly poor performance of mdraid-1 with 2 SSD Samsung 840 PRO [not found] ` <51732E2B.6090607@hardwarefreak.com> @ 2013-04-21 20:46 ` Andrei Banu 2013-04-21 23:17 ` Stan Hoeppner 0 siblings, 1 reply; 38+ messages in thread From: Andrei Banu @ 2013-04-21 20:46 UTC (permalink / raw) To: linux-raid Hello, At this point I probably should state that I am not an experienced sysadmin. Knowing this, I do have a server management company but they said they don't know what to do so now I am trying to fix things myself but I am something of a noob. I normally try to keep my actions to cautious config changes and testing. I have never done a kernel update. Any easy way to do this? Regarding your second advice (to purchase a decent HBA) I have already thought about it but I guess it comes with it's own drivers that need to be compiled into initramfs etc. So I am trying to replace the baseboard with one with SATA3 support to avoid any configuration changes (the old board has the C202 chipset and the new one has C204 so I guess this replacement is as simple as it gets - just remove the old board and plug the new one without any software changes or recompiles). Again I need to say this server is in production and I can't move the data or the users. I can have a few hours downtime during the night but that's about all. Regarding the kernel upgrade, do we need to compile one from source or there's an easier way? Thanks! On 21/04/2013 3:09 AM, Stan Hoeppner wrote: > On 4/19/2013 5:58 PM, Andrei Banu wrote: > >> I come to you with a difficult problem. We have a server otherwise >> snappy fitted with mdraid-1 made of Samsung 840 PRO SSDs. If we copy a >> larger file to the server (from the same server, from net doesn't >> matter) the server load will increase from roughly 0.7 to over 100 (for >> several GB files). Apparently the reason is that the raid can't write well. > ... >> 547682517 bytes (548 MB) copied, 7.99664 s, 68.5 MB/s >> 547682517 bytes (548 MB) copied, 52.1958 s, 10.5 MB/s >> 547682517 bytes (548 MB) copied, 75.3476 s, 7.3 MB/s >> 1073741824 bytes (1.1 GB) copied, 61.8796 s, 17.4 MB/s >> Timing buffered disk reads: 654 MB in 3.01 seconds = 217.55 MB/sec >> Timing buffered disk reads: 272 MB in 3.01 seconds = 90.44 MB/sec >> Timing O_DIRECT disk reads: 788 MB in 3.00 seconds = 262.23 MB/sec >> Timing O_DIRECT disk reads: 554 MB in 3.00 seconds = 184.53 MB/sec > ... > > Obviously this is frustrating, but the fix should be pretty easy. > >> O/S: CentOS 6.4 / 64 bit (2.6.32-358.2.1.el6.x86_64) > I'd guess your problem is the following regression. I don't believe > this regression is fixed in Red Hat 2.6.32-* kernels: > > http://www.archivum.info/linux-ide@vger.kernel.org/2010-02/00243/bad-performance-with-SSD-since-kernel-version-2.6.32.html > > After I discovered this regression and recommended Adam Goryachev > upgrade from Debian 2.6.32 to 3.2.x, his SSD RAID5 throughput increased > by a factor of 5x, though much of this was due testing methods. His raw > SSD throughput more than doubled per drive. The thread detailing this > is long but is a good read: > > http://marc.info/?l=linux-raid&m=136098921212920&w=2 > ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Incredibly poor performance of mdraid-1 with 2 SSD Samsung 840 PRO 2013-04-21 20:46 ` Andrei Banu @ 2013-04-21 23:17 ` Stan Hoeppner 2013-04-22 10:19 ` Andrei Banu ` (2 more replies) 0 siblings, 3 replies; 38+ messages in thread From: Stan Hoeppner @ 2013-04-21 23:17 UTC (permalink / raw) To: Andrei Banu; +Cc: linux-raid On 4/21/2013 3:46 PM, Andrei Banu wrote: > Hello, > > At this point I probably should state that I am not an experienced > sysadmin. Things are becoming more clear now. > Knowing this, I do have a server management company but they > said they don't know what to do So you own this hardware and it is colocated, correct? > so now I am trying to fix things myself > but I am something of a noob. I normally try to keep my actions to > cautious config changes and testing. Why did you choose Centos? Was this installed by the company? > I have never done a kernel update. > Any easy way to do this? It may not be necessary, at least to solve any SSD performance problems anyway. Reexamining your numbers shows you hit 262MB/s to /dev/sda. That's 65% of SATA2 interface bandwidth, so this kernel probably does have the patch. Your problem lie elsewhere. > Regarding your second advice (to purchase a decent HBA) I have already > thought about it but I guess it comes with it's own drivers that need to > be compiled into initramfs etc. The default CentOS (RHEL) initramfs should include mptsas, which supports all the LSI HBAs. The LSI caching RAID cards are supported as well with megaraid_sas. The question is, do you really need more than the ~260MB/s of peak throughput you currently have? And is it worth the hassle? > So I am trying to replace the baseboard > with one with SATA3 support to avoid any configuration changes (the old > board has the C202 chipset and the new one has C204 so I guess this > replacement is as simple as it gets - just remove the old board and plug > the new one without any software changes or recompiles). Again I need to > say this server is in production and I can't move the data or the users. > I can have a few hours downtime during the night but that's about all. It's not clear your problem is hardware bandwidth. In fact it seems the problem lie elsewhere. It may simply be that you're running these tests while other substantial IO is occurring. Actually, your numbers show this is exactly the case. What they don't show is how much other IO is hitting the SSDs while you're running your tests. > Regarding the kernel upgrade, do we need to compile one from source or > there's an easier way? I don't believe at this point you need a new kernel to fix the problem you have. If this patch was not present you'd not be able to get 260MB/s from SATA2. Your problem lie elsewhere. In the future, instead of making a post saying "md is slow, my SSDs are slow" and pasting test data which appears to back that claim, you'd be better served by describing a general problem, such as "users say the system is slow and I think it may be md or SSD related". This way we don't waste time following a troubleshooting path based on incorrect assumptions, as we've done here. Or at least as I've done here, as I'm the only one assisting. Boot all users off the system, shut down any daemons that may generate any meaningful load on the disks or CPUs. Disable any encryption or compression. Then rerun your tests while completely idle. Then we'll go from there. -- Stan > Thanks! > > On 21/04/2013 3:09 AM, Stan Hoeppner wrote: >> On 4/19/2013 5:58 PM, Andrei Banu wrote: >> >>> I come to you with a difficult problem. We have a server otherwise >>> snappy fitted with mdraid-1 made of Samsung 840 PRO SSDs. If we copy a >>> larger file to the server (from the same server, from net doesn't >>> matter) the server load will increase from roughly 0.7 to over 100 (for >>> several GB files). Apparently the reason is that the raid can't write >>> well. >> ... >>> 547682517 bytes (548 MB) copied, 7.99664 s, 68.5 MB/s >>> 547682517 bytes (548 MB) copied, 52.1958 s, 10.5 MB/s >>> 547682517 bytes (548 MB) copied, 75.3476 s, 7.3 MB/s >>> 1073741824 bytes (1.1 GB) copied, 61.8796 s, 17.4 MB/s >>> Timing buffered disk reads: 654 MB in 3.01 seconds = 217.55 MB/sec >>> Timing buffered disk reads: 272 MB in 3.01 seconds = 90.44 MB/sec >>> Timing O_DIRECT disk reads: 788 MB in 3.00 seconds = 262.23 MB/sec >>> Timing O_DIRECT disk reads: 554 MB in 3.00 seconds = 184.53 MB/sec >> ... >> >> Obviously this is frustrating, but the fix should be pretty easy. >> >>> O/S: CentOS 6.4 / 64 bit (2.6.32-358.2.1.el6.x86_64) >> I'd guess your problem is the following regression. I don't believe >> this regression is fixed in Red Hat 2.6.32-* kernels: >> >> http://www.archivum.info/linux-ide@vger.kernel.org/2010-02/00243/bad-performance-with-SSD-since-kernel-version-2.6.32.html >> >> >> After I discovered this regression and recommended Adam Goryachev >> upgrade from Debian 2.6.32 to 3.2.x, his SSD RAID5 throughput increased >> by a factor of 5x, though much of this was due testing methods. His raw >> SSD throughput more than doubled per drive. The thread detailing this >> is long but is a good read: >> >> http://marc.info/?l=linux-raid&m=136098921212920&w=2 >> > > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Incredibly poor performance of mdraid-1 with 2 SSD Samsung 840 PRO 2013-04-21 23:17 ` Stan Hoeppner @ 2013-04-22 10:19 ` Andrei Banu 2013-04-23 2:51 ` Stan Hoeppner 2013-04-22 23:11 ` Andrei Banu 2013-04-22 23:25 ` Stan Hoeppner 2 siblings, 1 reply; 38+ messages in thread From: Andrei Banu @ 2013-04-22 10:19 UTC (permalink / raw) To: linux-raid Hello! First off allow me to apologize if my rumbling sent you in a wrong direction and thank you for assisting. Most of the data I have supplied was mostly background information. Let me start fresh but first allow me to answer your explicit questions: 1. Yes, I own the hardware and it's colocated in a datacenter. 2. I am quite happy with 260MB/s read for SATA2. I think that's decent and I never meant it as a problem. 3. I have run for a few minutes iostat -x -m 2 and from what I see the normal write per second is at about 0-500KB/s, sometimes it gets to 1-2MB/s and rarely between 3 and 4MB/s. 4. I will redo the test off-peak hours when I can afford to shutdown various services. The actual problem is that when I write any larger file hundreds of MB or more to the server (from network or from the same server) the server starts to overload. The server can overload to over 100 for files of ~ 5GB. I mean this server has an average load of 0.52 (sar -q) but it can spike to 3 digit server loads in a few minutes from making or downloading a larger cPanel backup file. I have to rely only on R1Soft for backups right now because the normal cPanel backups make the server unstable when it backs up accounts over 1GB (many). So I concluded this is due to very low write speeds so I ran the 'dd' tests to evaluate this assumption. You know, I don't think that the problem is I ran these tests during other I/O intensive tasks. It's like after a number of megabytes written at a time, the SSD devices themselves overload. I mean during off peak hours I can sometimes get a good decent speed (like 60-100MB/s write speed) but if I redo the test soon (tens of seconds - minutes) I get very different much lower write speeds (like under 10MB/s write speed). Or maybe the write speed itsef is not the problem but the fact that when I write a large file the server seems to stop doing anything else. So...the speed test results are poor AND the server overloads. A lot! I mean most write results are in the 10-20MB/s range. I have seen more than 25MB/s very rarely and almost never was I able to reproduce them within the same hour. If I do a 'dd' test with 'bs' of 2-4MB I sometimes get good results (40-60MB/s) but never with a 'bs' of 1GB (the top speed I got with 1G 'bs' was 27MB/s during the night). But the essential notable problem is that this server can't copy large files without seriously overloading itself. Now let me elaborate why I have given the read speeds (as I am not unhappy with them): 1. Some said the low write speed might be due to a bad cable. So I stated the 260MB/s read speed to show it's probably not a bad cable. If it's capable to push 260MB/s up, it's probably not a bad cable. 2. I have observed a very big difference between /dev/sda and /dev/sdb and I thought it might me indicative of a problem somewhere. If I run hdparm -t /dev/sda I get about 215MB/s but on /dev/sdb I get about 80-90MB/s. Only if I add --direct flag I get 260MB/s for /dev/sda. Previously when I added --direct for /dev/sdb I was getting about 180MB/s but now I get ~85MB/s with or without --direct. root [/]# hdparm -t /dev/sdb Timing buffered disk reads: 262 MB in 3.01 seconds = 86.92 MB/sec root [/]# hdparm --direct -t /dev/sdb Timing O_DIRECT disk reads: 264 MB in 3.08 seconds = 85.74 MB/sec This is something new. /dev/sdb no longer gets to nearly 200MB/s (with --direct) but stays under 100MB/s in all cases. Maybe indeed it's a problem with the cable or with the device itself. And a 30 minutes later update: /dev/sdb returned to 90MB/s read speed WITHOUT --direct and 180MB/s WITH --direct. /dev/sda is constant (215 without --direct and 260 with --direct). What do you make of this? Kind regards! On 2013-04-22 02:17, Stan Hoeppner wrote: > On 4/21/2013 3:46 PM, Andrei Banu wrote: >> Hello, >> At this point I probably should state that I am not an experienced >> sysadmin. > Things are becoming more clear now. > >> Knowing this, I do have a server management company but they >> said they don't know what to do > So you own this hardware and it is colocated, correct? > >> so now I am trying to fix things myself >> but I am something of a noob. I normally try to keep my actions to >> cautious config changes and testing. > Why did you choose Centos? Was this installed by the company? > >> I have never done a kernel update. >> Any easy way to do this? > It may not be necessary, at least to solve any SSD performance > problems > anyway. Reexamining your numbers shows you hit 262MB/s to /dev/sda. > That's 65% of SATA2 interface bandwidth, so this kernel probably does > have the patch. Your problem lie elsewhere. > >> Regarding your second advice (to purchase a decent HBA) I have >> already >> thought about it but I guess it comes with it's own drivers that need >> to >> be compiled into initramfs etc. > The default CentOS (RHEL) initramfs should include mptsas, which > supports all the LSI HBAs. The LSI caching RAID cards are supported > as > well with megaraid_sas. > The question is, do you really need more than the ~260MB/s of peak > throughput you currently have? And is it worth the hassle? > >> So I am trying to replace the baseboard >> with one with SATA3 support to avoid any configuration changes (the >> old >> board has the C202 chipset and the new one has C204 so I guess this >> replacement is as simple as it gets - just remove the old board and >> plug >> the new one without any software changes or recompiles). Again I need >> to >> say this server is in production and I can't move the data or the >> users. >> I can have a few hours downtime during the night but that's about >> all. > It's not clear your problem is hardware bandwidth. In fact it seems > the > problem lie elsewhere. It may simply be that you're running these > tests > while other substantial IO is occurring. Actually, your numbers show > this is exactly the case. What they don't show is how much other IO > is > hitting the SSDs while you're running your tests. > >> Regarding the kernel upgrade, do we need to compile one from source >> or >> there's an easier way? > I don't believe at this point you need a new kernel to fix the problem > you have. If this patch was not present you'd not be able to get > 260MB/s from SATA2. Your problem lie elsewhere. > In the future, instead of making a post saying "md is slow, my SSDs > are > slow" and pasting test data which appears to back that claim, you'd be > better served by describing a general problem, such as "users say the > system is slow and I think it may be md or SSD related". This way we > don't waste time following a troubleshooting path based on incorrect > assumptions, as we've done here. Or at least as I've done here, as > I'm > the only one assisting. > Boot all users off the system, shut down any daemons that may generate > any meaningful load on the disks or CPUs. Disable any encryption or > compression. Then rerun your tests while completely idle. Then we'll > go from there. > -- > Stan > > >> Thanks! >> On 21/04/2013 3:09 AM, Stan Hoeppner wrote: >>> On 4/19/2013 5:58 PM, Andrei Banu wrote: >>> >>>> I come to you with a difficult problem. We have a server otherwise >>>> snappy fitted with mdraid-1 made of Samsung 840 PRO SSDs. If we >>>> copy a >>>> larger file to the server (from the same server, from net doesn't >>>> matter) the server load will increase from roughly 0.7 to over 100 >>>> (for >>>> several GB files). Apparently the reason is that the raid can't >>>> write >>>> well. >>> ... >>>> 547682517 bytes (548 MB) copied, 7.99664 s, 68.5 MB/s >>>> 547682517 bytes (548 MB) copied, 52.1958 s, 10.5 MB/s >>>> 547682517 bytes (548 MB) copied, 75.3476 s, 7.3 MB/s >>>> 1073741824 bytes (1.1 GB) copied, 61.8796 s, 17.4 MB/s >>>> Timing buffered disk reads: 654 MB in 3.01 seconds = 217.55 >>>> MB/sec >>>> Timing buffered disk reads: 272 MB in 3.01 seconds = 90.44 >>>> MB/sec >>>> Timing O_DIRECT disk reads: 788 MB in 3.00 seconds = 262.23 >>>> MB/sec >>>> Timing O_DIRECT disk reads: 554 MB in 3.00 seconds = 184.53 >>>> MB/sec >>> ... >>> Obviously this is frustrating, but the fix should be pretty easy. >>> >>>> O/S: CentOS 6.4 / 64 bit (2.6.32-358.2.1.el6.x86_64) >>> I'd guess your problem is the following regression. I don't believe >>> this regression is fixed in Red Hat 2.6.32-* kernels: >>> http://www.archivum.info/linux-ide@vger.kernel.org/2010-02/00243/bad-performance-with-SSD-since-kernel-version-2.6.32.html >>> >>> After I discovered this regression and recommended Adam Goryachev >>> upgrade from Debian 2.6.32 to 3.2.x, his SSD RAID5 throughput >>> increased >>> by a factor of 5x, though much of this was due testing methods. His >>> raw >>> SSD throughput more than doubled per drive. The thread detailing >>> this >>> is long but is a good read: >>> http://marc.info/?l=linux-raid&m=136098921212920&w=2 >>> >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-raid" >> in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" > in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Incredibly poor performance of mdraid-1 with 2 SSD Samsung 840 PRO 2013-04-22 10:19 ` Andrei Banu @ 2013-04-23 2:51 ` Stan Hoeppner 2013-04-23 10:17 ` Andrei Banu 0 siblings, 1 reply; 38+ messages in thread From: Stan Hoeppner @ 2013-04-23 2:51 UTC (permalink / raw) To: Andrei Banu; +Cc: linux-raid On 4/22/2013 5:19 AM, Andrei Banu wrote: > Hello! > > First off allow me to apologize if my rumbling sent you in a wrong > direction and thank you for assisting. No harm done, and you're welcome. > The actual problem is that when I write any larger file hundreds of MB > or more to the server (from network or from the same server) the server > starts to overload. The server can overload to over 100 for files of ~ > 5GB. I mean this server has an average load of 0.52 (sar -q) but it can > spike to 3 digit server loads in a few minutes from making or > downloading a larger cPanel backup file. I have to rely only on R1Soft > for backups right now because the normal cPanel backups make the server > unstable when it backs up accounts over 1GB (many). Describing this problem in terms of load average isn't very helpful. What would be is 'perf top -U' output so we can see what is eating cpu, simultaneously with 'iotop' so we see what's eating IO. > So I concluded this is due to very low write speeds so I ran the 'dd' It's most likely that the low disk throughput is a symptom of the problem, which is lurking elsewhere awaiting discovery. > 1. Some said the low write speed might be due to a bad cable. Very unlikely, but possible. This is easy to verify. Does dmesg show hundreds of "hard resetting link" messages. > 2. I have observed a very big difference between /dev/sda and /dev/sdb > and I thought it might me indicative of a problem somewhere. If I run > hdparm -t /dev/sda I get about 215MB/s but on /dev/sdb I get about > 80-90MB/s. Only if I add --direct flag I get 260MB/s for /dev/sda. > Previously when I added --direct for /dev/sdb I was getting about > 180MB/s but now I get ~85MB/s with or without --direct. I simply chalked up the difference to IO load variance between test runs of hdparm. If one SSD is always that much slower there may be a problem with the drive or controller but it's not likely. If you haven't already, swap the cable on the slow drive with new one. In fact, SATA cables are cheap as dirt so I'd swap them both just for piece of mind. > root [/]# hdparm -t /dev/sdb > Timing buffered disk reads: 262 MB in 3.01 seconds = 86.92 MB/sec > > root [/]# hdparm --direct -t /dev/sdb > Timing O_DIRECT disk reads: 264 MB in 3.08 seconds = 85.74 MB/sec ... > This is something new. /dev/sdb no longer gets to nearly 200MB/s (with > --direct) but stays under 100MB/s in all cases. Maybe indeed it's a > problem with the cable or with the device itself. ... > And a 30 minutes later update: /dev/sdb returned to 90MB/s read speed > WITHOUT --direct and 180MB/s WITH --direct. /dev/sda is constant (215 > without --direct and 260 with --direct). What do you make of this? Show your partition tables again. My gut instinct tells me you have a swap partition on /dev/sdb, and/or some other partition that is not part of the RAID1, nor equally present on /dev/sda, that is/are being accessed heavily at some times and not others, thus the throughput discrepancy. If this is the case, and the kernel is low on RAM due to an application memory leak or just normal process load, that swap partition may become critical. When when you start $big_file copy, the kernel goes into overdrive swapping and/or dropping cache to make room for $big_file in the write buffers. This could explain both your triple digit system load and the decreased throughput on /dev/sdb. The fdisk output you provided previously showed only 3 partitions per SSD, all RAID autodetect, all in md/RAID1 I assume. However, the symptoms you're reporting tend to suggest the partition layout I just described, and could be responsible for the odd up/down throughput on sdb. -- Stan ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Incredibly poor performance of mdraid-1 with 2 SSD Samsung 840 PRO 2013-04-23 2:51 ` Stan Hoeppner @ 2013-04-23 10:17 ` Andrei Banu 2013-04-24 3:24 ` Stan Hoeppner 0 siblings, 1 reply; 38+ messages in thread From: Andrei Banu @ 2013-04-23 10:17 UTC (permalink / raw) Cc: linux-raid Hi, I am sorry for the very long email. And thanks a lot for all your patience. 1. DMESG doesn't show any "hard resetting link" at all. 2. The SSDs are connected to ATA 0 and ATA1. The server is brand new (or at least it should be). 3. Partition table: root [~]# cat /etc/fstab # Created by anaconda on Wed Apr 3 17:22:52 2013 UUID=8fedde2c-f5b7-4edf-975f-d8d087d79ebf / ext4 noatime,usrjquota=quota.user,jqfmt=vfsv0 1 1 UUID=bfc50d02-6d4d-4510-93ea-27941cd49cf4 /boot ext4 noatime,defaults 1 2 UUID=cef1d19d-2578-43db-9ffc-b6b70e227bfa swap swap defaults 0 0 tmpfs /dev/shm tmpfs defaults 0 0 devpts /dev/pts devpts gid=5,mode=620 0 0 sysfs /sys sysfs defaults 0 0 proc /proc proc defaults 0 0 /usr/tmpDSK /tmp ext3 noatime,defaults,noauto 0 0 root [~]# cat /etc/mdadm.conf # mdadm.conf written out by anaconda MAILADDR root AUTO +imsm +1.x -all ARRAY /dev/md0 level=raid1 num-devices=2 UUID=8a4b7005:a4f71a13:7d4659cf:104f9a4f ARRAY /dev/md1 level=raid1 num-devices=2 UUID=ead5b5ca:9f5397a2:3b488cbe:11eb8bdb ARRAY /dev/md2 level=raid1 num-devices=2 UUID=44efd14d:8bcd26d4:4d1fda9f:a4b5fe14 root [/]# mount /dev/md2 on / type ext4 (rw,noatime,usrjquota=quota.user,jqfmt=vfsv0) proc on /proc type proc (rw) sysfs on /sys type sysfs (rw) devpts on /dev/pts type devpts (rw,gid=5,mode=620) tmpfs on /dev/shm type tmpfs (rw,rootcontext="system_u:object_r:tmpfs_t:s0") /dev/md0 on /boot type ext4 (rw,noatime) none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw) /usr/tmpDSK on /tmp type ext3 (rw,noexec,nosuid,loop=/dev/loop0) /tmp on /var/tmp type none (rw,noexec,nosuid,bind) And now the tests you indicated: 4. root [/]# echo 3 > /proc/sys/vm/drop_caches root [~]# time cp largefile.tar.gz test03.tmp; time sync; (this is probably when the file is read into some swap/cache) real 0m3.052s user 0m0.010s sys 0m0.612s (this is probably when the file is actually written) real 1m2.570s user 0m0.000s sys 0m0.011s root [/]# echo 3 > /proc/sys/vm/drop_caches root [~]# time cp largefile.tar.gz test04.tmp; real 0m3.848s user 0m0.004s sys 0m0.634s After about 15 seconds the server load started to increase from 1, spiked to 40 in about a minute and then it started decreasing. 5. The perf top -U output during a dd copy: Samples: 2M of event 'cycles', Event count (approx.): 19505138470 9.10% [kernel] [k] page_fault 5.56% [kernel] [k] clear_page_c_e 3.29% [kernel] [k] list_del 2.51% [kernel] [k] unmap_vmas 2.50% [kernel] [k] __mem_cgroup_commit_charge 2.50% [kernel] [k] mem_cgroup_update_file_mapped 2.26% [kernel] [k] port_inb 1.89% [kernel] [k] shmem_getpage_gfp 1.78% [kernel] [k] _spin_lock 1.72% [kernel] [k] __alloc_pages_nodemask 1.67% [kernel] [k] __mem_cgroup_uncharge_common 1.61% [kernel] [k] free_pcppages_bulk 1.59% [kernel] [k] get_page_from_freelist 1.56% [kernel] [k] alloc_pages_vma 1.37% [kernel] [k] get_page 1.26% [kernel] [k] release_pages 1.22% [kernel] [k] radix_tree_lookup_slot 1.19% [kernel] [k] lookup_page_cgroup 1.11% [kernel] [k] handle_mm_fault 0.98% [kernel] [k] __wake_up_bit 0.98% [kernel] [k] copy_page_c 0.97% [kernel] [k] __d_lookup 0.94% [kernel] [k] __do_fault 0.92% [kernel] [k] free_hot_cold_page 0.80% [kernel] [k] find_vma 6. iotop is very dynamic and I am afraid the data I am providing will be unclear but let me give a number of snapshots from during the large file copy and maybe you can make something of it (samples a few seconds apart): Total DISK READ: 15.39 K/s | Total DISK WRITE: 169.29 M/s TID PRIO USER DISK READ DISK WRITE SWAPIN IO COMMAND 4236 be/4 nobody 0.00 B/s 0.00 B/s 0.00 % 0.00 % [httpd] 4662 be/4 nobody 0.00 B/s 0.00 B/s 0.00 % 0.00 % [httpd] 31126 be/4 mysql 0.00 B/s 46.17 K/s 0.00 % 0.00 % mysqld --basedir=/ --datadir=/var/lib/mysql --user=mysql --log-error=/var/lib/mysql/server.err --open-files-limit=50000 --pid-file=/var/$ 4971 be/4 nobody 0.00 B/s 23.08 K/s 0.00 % 0.00 % [httpd] 5284 be/4 nobody 0.00 B/s 7.69 K/s 0.00 % 0.00 % [httpd] 9522 be/4 user 7.69 K/s 38.47 K/s 0.00 % 0.00 % spamd child 5547 be/4 nobody 0.00 B/s 7.69 K/s 0.00 % 0.00 % [httpd] !!!!!! 6085 be/4 root 7.69 K/s 1004.85 M/s 0.00 % 0.00 % dd if=largefile.tar.gz of=test10 oflag=sync bs=1G Total DISK READ: 7.71 K/s | Total DISK WRITE: 29.91 M/s TID PRIO USER DISK READ DISK WRITE SWAPIN IO COMMAND 506 be/4 root 0.00 B/s 0.00 B/s 0.00 % 99.99 % [md2_raid1] 30861 be/4 root 0.00 B/s 7.71 K/s 0.00 % 0.00 % httpd -k start -DSSL 31346 be/4 root 0.00 B/s 7.71 K/s 0.00 % 0.00 % tailwatchd 1457 be/3 root 0.00 B/s 7.71 K/s 0.00 % 0.00 % auditd 5914 be/4 root 7.71 K/s 0.00 B/s 0.00 % 0.00 % cpanellogd - scanning logs 6085 be/4 root 0.00 B/s 7.71 K/s 0.00 % 0.00 % dd if=largefile.tar.gz of=test10 oflag=sync bs=1G Total DISK READ: 0.00 B/s | Total DISK WRITE: 29.30 M/s TID PRIO USER DISK READ DISK WRITE SWAPIN IO COMMAND 9522 be/4 user 0.00 B/s 0.00 B/s 0.00 % 99.99 % spamd child 506 be/4 root 0.00 B/s 0.00 B/s 0.00 % 99.99 % [md2_raid1] 31346 be/4 root 0.00 B/s 7.73 K/s 0.00 % 0.00 % tailwatchd 1397 be/4 root 0.00 B/s 7.73 K/s 0.00 % 0.00 % [flush-9:2] 6085 be/4 root 0.00 B/s 15.45 K/s 0.00 % 0.00 % dd if=largefile.tar.gz of=test10 oflag=sync bs=1G Total DISK READ: 12.43 K/s | Total DISK WRITE: 5.96 M/s TID PRIO USER DISK READ DISK WRITE SWAPIN IO COMMAND 5914 be/4 root 0.00 B/s 0.00 B/s 0.00 % 99.99 % cpanellogd - setting up logs for promusic 6101 be/4 mailnull 0.00 B/s 353.61 B/s 0.00 % 99.99 % exim -bd -q1h 6107 be/4 user 0.00 B/s 0.00 B/s 0.00 % 99.99 % pop3 6124 be/4 nobody 0.00 B/s 353.61 B/s 0.00 % 99.99 % httpd -k start -DSSL 9522 be/4 user 1060.83 B/s 184.06 K/s 0.00 % 99.99 % spamd child 1669 be/4 root 0.00 B/s 2.42 K/s 0.00 % 99.99 % rsyslogd -i /var/run/syslogd.pid -c 5 1235 be/4 root 0.00 B/s 2.42 K/s 0.00 % 98.28 % [kjournald] 506 be/4 root 0.00 B/s 0.00 B/s 0.00 % 28.46 % [md2_raid1] 541 be/3 root 0.00 B/s 34.04 M/s 0.00 % 3.43 % [jbd2/md2-8] Total DISK READ: 303.21 K/s | Total DISK WRITE: 60.64 M/s TID PRIO USER DISK READ DISK WRITE SWAPIN IO COMMAND 1235 be/4 root 0.00 B/s 60.64 K/s 0.00 % 99.99 % [kjournald] 541 be/3 root 0.00 B/s 0.00 B/s 0.00 % 96.16 % [jbd2/md2-8] 1232 be/0 root 0.00 B/s 0.00 B/s 0.00 % 81.07 % [loop0] 11449 be/4 mysql 250.15 K/s 0.00 B/s 0.00 % 12.84 % mysqld --basedir=/ --datadir=/var/lib/mysql --user=mysql --log-error=/var/lib/mysql/server.err --open-files-limit=50000 --pid-file=/var/$ 6085 be/4 root 7.58 K/s 30.32 K/s 0.00 % 5.24 % dd if=largefile.tar.gz of=test10 oflag=sync bs=1G Total DISK READ: 2023.83 K/s | Total DISK WRITE: 82.31 M/s TID PRIO USER DISK READ DISK WRITE SWAPIN IO COMMAND 6085 be/4 root 0.00 B/s 38.04 K/s 0.00 % 99.99 % dd if=largefile.tar.gz of=test10 oflag=sync bs=1G 6267 be/4 user 0.00 B/s 0.00 B/s 0.00 % 99.99 % pop3 6291 be/4 user 0.00 B/s 0.00 B/s 0.00 % 99.99 % pop3 541 be/3 root 0.00 B/s 492.43 M/s 0.00 % 99.99 % [jbd2/md2-8] 6282 be/4 nobody 730.40 K/s 0.00 B/s 0.00 % 99.99 % httpd -k start -DSSL 506 be/4 root 0.00 B/s 0.00 B/s 0.00 % 52.39 % [md2_raid1] Total DISK READ: 74.61 K/s | Total DISK WRITE: 8.66 M/s TID PRIO USER DISK READ DISK WRITE SWAPIN IO COMMAND 6282 be/4 nobody 26.55 K/s 0.00 B/s 0.00 % 97.65 % httpd -k start -DSSL 541 be/3 root 0.00 B/s 7.04 M/s 0.00 % 95.64 % [jbd2/md2-8] 1235 be/4 root 0.00 B/s 0.00 B/s 0.00 % 94.07 % [kjournald] 1394 be/4 root 0.00 B/s 0.00 B/s 0.00 % 89.26 % [flush-7:0] 506 be/4 root 0.00 B/s 0.00 B/s 0.00 % 31.66 % [md2_raid1] Total DISK READ: 544.44 K/s | Total DISK WRITE: 82.08 M/s TID PRIO USER DISK READ DISK WRITE SWAPIN IO COMMAND 1235 be/4 root 0.00 B/s 129.31 K/s 0.00 % 99.99 % [kjournald] 541 be/3 root 0.00 B/s 63.57 M/s 0.00 % 99.99 % [jbd2/md2-8] 31119 be/4 mysql 0.00 B/s 61.25 K/s 0.00 % 88.49 % mysqld --basedir=/ --datadir=/var/lib/mysql --user=mysql --log-error=/var/lib/mysql/server.err --open-files-limit=50000 --pid-file=/var/$ 506 be/4 root 0.00 B/s 0.00 B/s 0.00 % 72.41 % [md2_raid1] 31346 be/4 root 0.00 B/s 20.42 K/s 0.00 % 69.36 % tailwatchd 1232 be/0 root 0.00 B/s 183.75 K/s 0.00 % 54.04 % [loop0] 6085 be/4 root 3.40 K/s 40.83 K/s 0.00 % 26.49 % dd if=largefile.tar.gz of=test10 oflag=sync bs=1G 11561 be/4 mysql 0.00 B/s 45.64 M/s 0.00 % 0.00 % mysqld --basedir=/ --datadir=/var/lib/mysql --user=mysql --log-error=/var/lib/mysql/server.err --open-files-limit=50000 --pid-file=/var/$ I have also run it with the "-a" flag and there is something interesting (looong though heavily greped output below). This is taken during the 'dd oflag=sync' copy. It seems it does something right at the beginning (writes about 250MB of that files) than it mostly idles through the end: Total DISK READ: 333.35 K/s | Total DISK WRITE: 38.76 M/s PID PRIO USER DISK READ DISK WRITE SWAPIN IO COMMAND 541 be/3 root 0.00 B 332.00 K 0.00 % 0.49 % [jbd2/md2-8] 13467 be/4 root 0.00 B 4.00 K 0.00 % 0.00 % python /usr/bin/iotop -baoP -d 1 13479 be/4 root 4.00 K 250.12 M 0.00 % 0.00 % dd if=largefile.tar.gz of=test11 oflag=sync bs=1G Total DISK READ: 4.84 M/s | Total DISK WRITE: 11.77 K/s PID PRIO USER DISK READ DISK WRITE SWAPIN IO COMMAND 541 be/3 root 0.00 B 332.00 K 0.00 % 0.37 % [jbd2/md2-8] 13467 be/4 root 0.00 B 4.00 K 0.00 % 0.00 % python /usr/bin/iotop -baoP -d 1 13479 be/4 root 4.00 K 250.12 M 0.00 % 0.00 % dd if=largefile.tar.gz of=test11 oflag=sync bs=1G Total DISK READ: 0.00 B/s | Total DISK WRITE: 379.93 K/s PID PRIO USER DISK READ DISK WRITE SWAPIN IO COMMAND 541 be/3 root 0.00 B 332.00 K 0.00 % 0.30 % [jbd2/md2-8] 13467 be/4 root 0.00 B 8.00 K 0.00 % 0.00 % python /usr/bin/iotop -baoP -d 1 13479 be/4 root 4.00 K 250.12 M 0.00 % 0.00 % dd if=largefile.tar.gz of=test11 oflag=sync bs=1G 1232 be/0 root 0.00 B 244.00 K 0.00 % 0.00 % [loop0] 1397 be/4 root 0.00 B 24.00 K 0.00 % 0.00 % [flush-9:2] Total DISK READ: 0.00 B/s | Total DISK WRITE: 69.69 M/s PID PRIO USER DISK READ DISK WRITE SWAPIN IO COMMAND 13479 be/4 root 4.00 K 250.16 M 0.00 % 79.98 % dd if=largefile.tar.gz of=test11 oflag=sync bs=1G 541 be/3 root 0.00 B 458.64 M 0.00 % 0.25 % [jbd2/md2-8] 13467 be/4 root 0.00 B 8.00 K 0.00 % 0.00 % python /usr/bin/iotop -baoP -d 1 1232 be/0 root 0.00 B 244.00 K 0.00 % 0.00 % [loop0] 1397 be/4 root 0.00 B 24.00 K 0.00 % 0.00 % [flush-9:2] Total DISK READ: 20.81 K/s | Total DISK WRITE: 6.07 M/s PID PRIO USER DISK READ DISK WRITE SWAPIN IO COMMAND 541 be/3 root 0.00 B 765.19 M 0.00 % 83.17 % [jbd2/md2-8] 1235 be/4 root 0.00 B 0.00 B 0.00 % 78.06 % [kjournald] 13479 be/4 root 8.00 K 250.24 M 0.00 % 60.66 % dd if=largefile.tar.gz of=test11 oflag=sync bs=1G 506 be/4 root 0.00 B 0.00 B 0.00 % 35.01 % [md2_raid1] 1394 be/4 root 0.00 B 0.00 B 0.00 % 11.25 % [flush-7:0] Total DISK READ: 43.28 K/s | Total DISK WRITE: 34.09 M/s PID PRIO USER DISK READ DISK WRITE SWAPIN IO COMMAND 541 be/3 root 0.00 B 767.47 M 0.00 % 84.84 % [jbd2/md2-8] 1235 be/4 root 0.00 B 28.00 K 0.00 % 70.65 % [kjournald] 13479 be/4 root 12.00 K 250.29 M 0.00 % 65.12 % dd if=largefile.tar.gz of=test11 oflag=sync bs=1G 506 be/4 root 0.00 B 0.00 B 0.00 % 31.81 % [md2_raid1] 1232 be/0 root 0.00 B 1568.00 K 0.00 % 14.57 % [loop0] 1394 be/4 root 0.00 B 0.00 B 0.00 % 9.71 % [flush-7:0] 1397 be/4 root 0.00 B 3.44 M 0.00 % 1.47 % [flush-9:2] 13467 be/4 root 0.00 B 12.00 K 0.00 % 0.00 % python /usr/bin/iotop -baoP -d 1 Total DISK READ: 3.85 K/s | Total DISK WRITE: 35.28 M/s PID PRIO USER DISK READ DISK WRITE SWAPIN IO COMMAND 541 be/3 root 0.00 B 768.32 M 0.00 % 84.36 % [jbd2/md2-8] 1235 be/4 root 0.00 B 28.00 K 0.00 % 83.53 % [kjournald] 13479 be/4 root 12.00 K 250.30 M 0.00 % 65.05 % dd if=largefile.tar.gz of=test11 oflag=sync bs=1G 506 be/4 root 0.00 B 0.00 B 0.00 % 32.55 % [md2_raid1] 1232 be/0 root 0.00 B 1568.00 K 0.00 % 14.21 % [loop0] 1394 be/4 root 0.00 B 0.00 B 0.00 % 9.46 % [flush-7:0] 1397 be/4 root 0.00 B 3.45 M 0.00 % 1.48 % [flush-9:2] Total DISK READ: 3.91 K/s | Total DISK WRITE: 3.91 K/s PID PRIO USER DISK READ DISK WRITE SWAPIN IO COMMAND 541 be/3 root 0.00 B 768.32 M 0.00 % 82.29 % [jbd2/md2-8] 1235 be/4 root 0.00 B 28.00 K 0.00 % 81.48 % [kjournald] 13479 be/4 root 12.00 K 250.30 M 0.00 % 63.37 % dd if=largefile.tar.gz of=test11 oflag=sync bs=1G 506 be/4 root 0.00 B 0.00 B 0.00 % 31.75 % [md2_raid1] 1232 be/0 root 0.00 B 1568.00 K 0.00 % 13.86 % [loop0] 1394 be/4 root 0.00 B 0.00 B 0.00 % 9.23 % [flush-7:0] 1397 be/4 root 0.00 B 3.45 M 0.00 % 1.44 % [flush-9:2] 13467 be/4 root 0.00 B 28.00 K 0.00 % 0.00 % python /usr/bin/iotop -baoP -d 1 Total DISK READ: 15.64 K/s | Total DISK WRITE: 15.32 M/s PID PRIO USER DISK READ DISK WRITE SWAPIN IO COMMAND 541 be/3 root 0.00 B 768.71 M 0.00 % 85.51 % [jbd2/md2-8] 1235 be/4 root 0.00 B 28.00 K 0.00 % 79.53 % [kjournald] 13479 be/4 root 12.00 K 250.31 M 0.00 % 61.78 % dd if=largefile.tar.gz of=test11 oflag=sync bs=1G 506 be/4 root 0.00 B 0.00 B 0.00 % 36.15 % [md2_raid1] 1232 be/0 root 0.00 B 1568.00 K 0.00 % 13.53 % [loop0] 1394 be/4 root 0.00 B 0.00 B 0.00 % 9.01 % [flush-7:0] 1397 be/4 root 0.00 B 3.45 M 0.00 % 6.60 % [flush-9:2] 13467 be/4 root 0.00 B 32.00 K 0.00 % 0.00 % python /usr/bin/iotop -baoP -d 1 Total DISK READ: 0.00 B/s | Total DISK WRITE: 0.00 B/s PID PRIO USER DISK READ DISK WRITE SWAPIN IO COMMAND 541 be/3 root 0.00 B 768.71 M 0.00 % 85.77 % [jbd2/md2-8] 1235 be/4 root 0.00 B 28.00 K 0.00 % 75.90 % [kjournald] 13479 be/4 root 12.00 K 250.31 M 0.00 % 58.82 % dd if=largefile.tar.gz of=test11 oflag=sync bs=1G 506 be/4 root 0.00 B 0.00 B 0.00 % 34.51 % [md2_raid1] 1232 be/0 root 0.00 B 1568.00 K 0.00 % 12.91 % [loop0] 1394 be/4 root 0.00 B 0.00 B 0.00 % 8.60 % [flush-7:0] 1397 be/4 root 0.00 B 3.45 M 0.00 % 6.30 % [flush-9:2] 31346 be/4 root 0.00 B 120.00 K 0.00 % 3.42 % tailwatchd 13467 be/4 root 0.00 B 44.00 K 0.00 % 0.00 % python /usr/bin/iotop -baoP -d 1 Total DISK READ: 19.56 K/s | Total DISK WRITE: 10.12 M/s PID PRIO USER DISK READ DISK WRITE SWAPIN IO COMMAND 541 be/3 root 0.00 B 768.76 M 0.00 % 86.39 % [jbd2/md2-8] 1235 be/4 root 0.00 B 28.00 K 0.00 % 74.21 % [kjournald] 13479 be/4 root 12.00 K 250.31 M 0.00 % 64.36 % dd if=largefile.tar.gz of=test11 oflag=sync bs=1G 506 be/4 root 0.00 B 0.00 B 0.00 % 39.86 % [md2_raid1] 1232 be/0 root 0.00 B 1568.00 K 0.00 % 12.62 % [loop0] 1394 be/4 root 0.00 B 0.00 B 0.00 % 8.41 % [flush-7:0] 1397 be/4 root 0.00 B 3.46 M 0.00 % 6.16 % [flush-9:2] 13467 be/4 root 0.00 B 48.00 K 0.00 % 0.00 % python /usr/bin/iotop -baoP -d 1 Total DISK READ: 0.00 B/s | Total DISK WRITE: 15.65 K/s PID PRIO USER DISK READ DISK WRITE SWAPIN IO COMMAND 541 be/3 root 0.00 B 768.76 M 0.00 % 87.13 % [jbd2/md2-8] 1235 be/4 root 0.00 B 28.00 K 0.00 % 72.58 % [kjournald] 13479 be/4 root 12.00 K 250.31 M 0.00 % 65.64 % dd if=largefile.tar.gz of=test11 oflag=sync bs=1G 506 be/4 root 0.00 B 0.00 B 0.00 % 38.98 % [md2_raid1] 1232 be/0 root 0.00 B 1568.00 K 0.00 % 12.34 % [loop0] 1394 be/4 root 0.00 B 0.00 B 0.00 % 8.22 % [flush-7:0] 1397 be/4 root 0.00 B 3.46 M 0.00 % 6.03 % [flush-9:2] 13467 be/4 root 0.00 B 52.00 K 0.00 % 0.00 % python /usr/bin/iotop -baoP -d 1 Total DISK READ: 46.71 K/s | Total DISK WRITE: 38.92 K/s PID PRIO USER DISK READ DISK WRITE SWAPIN IO COMMAND 541 be/3 root 0.00 B 768.76 M 0.00 % 87.24 % [jbd2/md2-8] 1235 be/4 root 0.00 B 28.00 K 0.00 % 71.03 % [kjournald] 13479 be/4 root 12.00 K 250.31 M 0.00 % 66.24 % dd if=largefile.tar.gz of=test11 oflag=sync bs=1G 506 be/4 root 0.00 B 0.00 B 0.00 % 38.15 % [md2_raid1] 1232 be/0 root 0.00 B 1568.00 K 0.00 % 12.08 % [loop0] 1394 be/4 root 0.00 B 0.00 B 0.00 % 8.05 % [flush-7:0] 1397 be/4 root 0.00 B 3.46 M 0.00 % 5.90 % [flush-9:2] 13467 be/4 root 0.00 B 56.00 K 0.00 % 0.00 % python /usr/bin/iotop -baoP -d 1 Total DISK READ: 0.00 B/s | Total DISK WRITE: 0.00 B/s PID PRIO USER DISK READ DISK WRITE SWAPIN IO COMMAND 541 be/3 root 0.00 B 768.76 M 0.00 % 87.63 % [jbd2/md2-8] 1235 be/4 root 0.00 B 28.00 K 0.00 % 69.54 % [kjournald] 13479 be/4 root 12.00 K 250.31 M 0.00 % 67.10 % dd if=largefile.tar.gz of=test11 oflag=sync bs=1G 506 be/4 root 0.00 B 0.00 B 0.00 % 42.88 % [md2_raid1] 1232 be/0 root 0.00 B 1568.00 K 0.00 % 11.83 % [loop0] 1394 be/4 root 0.00 B 0.00 B 0.00 % 7.88 % [flush-7:0] 1397 be/4 root 0.00 B 3.46 M 0.00 % 5.78 % [flush-9:2] 13467 be/4 root 0.00 B 60.00 K 0.00 % 0.00 % python /usr/bin/iotop -baoP -d 1 Total DISK READ: 7.82 K/s | Total DISK WRITE: 0.00 B/s PID PRIO USER DISK READ DISK WRITE SWAPIN IO COMMAND 541 be/3 root 0.00 B 768.76 M 0.00 % 87.91 % [jbd2/md2-8] 1235 be/4 root 0.00 B 28.00 K 0.00 % 68.12 % [kjournald] 13479 be/4 root 12.00 K 250.31 M 0.00 % 67.83 % dd if=largefile.tar.gz of=test11 oflag=sync bs=1G 506 be/4 root 0.00 B 0.00 B 0.00 % 42.01 % [md2_raid1] 1232 be/0 root 0.00 B 1568.00 K 0.00 % 11.59 % [loop0] 1394 be/4 root 0.00 B 0.00 B 0.00 % 7.72 % [flush-7:0] 1397 be/4 root 0.00 B 3.46 M 0.00 % 5.66 % [flush-9:2] 13467 be/4 root 0.00 B 68.00 K 0.00 % 0.00 % python /usr/bin/iotop -baoP -d 1 Total DISK READ: 0.00 B/s | Total DISK WRITE: 50.84 K/s PID PRIO USER DISK READ DISK WRITE SWAPIN IO COMMAND 541 be/3 root 0.00 B 768.76 M 0.00 % 88.16 % [jbd2/md2-8] 1235 be/4 root 0.00 B 28.00 K 0.00 % 66.75 % [kjournald] 13479 be/4 root 12.00 K 250.31 M 0.00 % 68.51 % dd if=largefile.tar.gz of=test11 oflag=sync bs=1G 506 be/4 root 0.00 B 0.00 B 0.00 % 41.16 % [md2_raid1] 1232 be/0 root 0.00 B 1568.00 K 0.00 % 11.35 % [loop0] 1394 be/4 root 0.00 B 0.00 B 0.00 % 7.56 % [flush-7:0] 1397 be/4 root 0.00 B 3.46 M 0.00 % 5.54 % [flush-9:2] 13467 be/4 root 0.00 B 72.00 K 0.00 % 0.00 % python /usr/bin/iotop -baoP -d 1 Total DISK READ: 3.91 K/s | Total DISK WRITE: 93.83 K/s PID PRIO USER DISK READ DISK WRITE SWAPIN IO COMMAND 541 be/3 root 0.00 B 768.76 M 0.00 % 88.33 % [jbd2/md2-8] 13479 be/4 root 12.00 K 250.31 M 0.00 % 69.09 % dd if=largefile.tar.gz of=test11 oflag=sync bs=1G 1235 be/4 root 0.00 B 28.00 K 0.00 % 65.44 % [kjournald] 506 be/4 root 0.00 B 0.00 B 0.00 % 40.35 % [md2_raid1] 1232 be/0 root 0.00 B 1568.00 K 0.00 % 11.13 % [loop0] 1394 be/4 root 0.00 B 0.00 B 0.00 % 7.41 % [flush-7:0] 1397 be/4 root 0.00 B 3.46 M 0.00 % 5.43 % [flush-9:2] 31346 be/4 root 0.00 B 120.00 K 0.00 % 2.95 % tailwatchd 13467 be/4 root 0.00 B 76.00 K 0.00 % 0.00 % python /usr/bin/iotop -baoP -d 1 Total DISK READ: 0.00 B/s | Total DISK WRITE: 0.00 B/s PID PRIO USER DISK READ DISK WRITE SWAPIN IO COMMAND 541 be/3 root 0.00 B 768.76 M 0.00 % 88.53 % [jbd2/md2-8] 13479 be/4 root 12.00 K 250.31 M 0.00 % 69.69 % dd if=largefile.tar.gz of=test11 oflag=sync bs=1G 1235 be/4 root 0.00 B 28.00 K 0.00 % 64.18 % [kjournald] 506 be/4 root 0.00 B 0.00 B 0.00 % 39.57 % [md2_raid1] 1232 be/0 root 0.00 B 1568.00 K 0.00 % 10.91 % [loop0] 1394 be/4 root 0.00 B 0.00 B 0.00 % 7.27 % [flush-7:0] 1397 be/4 root 0.00 B 3.46 M 0.00 % 5.33 % [flush-9:2] 13467 be/4 root 0.00 B 80.00 K 0.00 % 0.00 % python /usr/bin/iotop -baoP -d 1 Total DISK READ: 0.00 B/s | Total DISK WRITE: 15.64 K/s PID PRIO USER DISK READ DISK WRITE SWAPIN IO COMMAND 541 be/3 root 0.00 B 768.76 M 0.00 % 88.20 % [jbd2/md2-8] 13479 be/4 root 12.00 K 250.31 M 0.00 % 69.72 % dd if=largefile.tar.gz of=test11 oflag=sync bs=1G 1235 be/4 root 0.00 B 28.00 K 0.00 % 62.96 % [kjournald] 506 be/4 root 0.00 B 0.00 B 0.00 % 38.82 % [md2_raid1] 1232 be/0 root 0.00 B 1568.00 K 0.00 % 10.71 % [loop0] 1394 be/4 root 0.00 B 0.00 B 0.00 % 7.13 % [flush-7:0] 1397 be/4 root 0.00 B 3.46 M 0.00 % 5.23 % [flush-9:2] 13467 be/4 root 0.00 B 84.00 K 0.00 % 0.00 % python /usr/bin/iotop -baoP -d 1 Total DISK READ: 0.00 B/s | Total DISK WRITE: 0.00 B/s PID PRIO USER DISK READ DISK WRITE SWAPIN IO COMMAND 541 be/3 root 0.00 B 768.76 M 0.00 % 88.61 % [jbd2/md2-8] 13479 be/4 root 12.00 K 250.31 M 0.00 % 70.50 % dd if=largefile.tar.gz of=test11 oflag=sync bs=1G 1235 be/4 root 0.00 B 28.00 K 0.00 % 61.79 % [kjournald] 506 be/4 root 0.00 B 0.00 B 0.00 % 38.10 % [md2_raid1] 1232 be/0 root 0.00 B 1568.00 K 0.00 % 10.51 % [loop0] 1394 be/4 root 0.00 B 0.00 B 0.00 % 7.00 % [flush-7:0] 1397 be/4 root 0.00 B 3.46 M 0.00 % 5.13 % [flush-9:2] 13467 be/4 root 0.00 B 92.00 K 0.00 % 0.00 % python /usr/bin/iotop -baoP -d 1 Total DISK READ: 258.12 K/s | Total DISK WRITE: 86.04 K/s PID PRIO USER DISK READ DISK WRITE SWAPIN IO COMMAND 541 be/3 root 0.00 B 768.76 M 0.00 % 89.19 % [jbd2/md2-8] 13479 be/4 root 12.00 K 250.31 M 0.00 % 71.45 % dd if=largefile.tar.gz of=test11 oflag=sync bs=1G 1235 be/4 root 0.00 B 28.00 K 0.00 % 60.66 % [kjournald] 506 be/4 root 0.00 B 0.00 B 0.00 % 37.40 % [md2_raid1] 1232 be/0 root 0.00 B 1568.00 K 0.00 % 10.32 % [loop0] 1394 be/4 root 0.00 B 0.00 B 0.00 % 6.87 % [flush-7:0] 1397 be/4 root 0.00 B 3.46 M 0.00 % 5.04 % [flush-9:2] 13467 be/4 root 0.00 B 96.00 K 0.00 % 0.00 % python /usr/bin/iotop -baoP -d 1 I appologize for such a lengthy email! Kind regards! Andrei Banu ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Incredibly poor performance of mdraid-1 with 2 SSD Samsung 840 PRO 2013-04-23 10:17 ` Andrei Banu @ 2013-04-24 3:24 ` Stan Hoeppner 2013-04-24 8:26 ` Andrei Banu 0 siblings, 1 reply; 38+ messages in thread From: Stan Hoeppner @ 2013-04-24 3:24 UTC (permalink / raw) To: Andrei Banu On 4/23/2013 5:17 AM, Andrei Banu wrote: > I am sorry for the very long email. And thanks a lot for all your patience. From now on simply provide what is asked for. That keeps the length manageable and the info relevant, and allows us to help you get to a solution more quickly without being bogged down. > 1. DMESG doesn't show any "hard resetting link" at all. Then it seems you don't have hardware problems. > 2. The SSDs are connected to ATA 0 and ATA1. The server is brand new (or > at least it should be). Nor the Intel 6 Series SATA problem. > 3. Partition table: /etc/fstab contains mount points, not the partition table. > root [~]# cat /etc/fstab > UUID=cef1d19d-2578-43db-9ffc-b6b70e227bfa swap swap defaults 0 0 I can't discern from UUID where your swap partition is located. Is it a partition directly on an SSD or is it a partition atop md1? > root [/]# echo 3 > /proc/sys/vm/drop_caches > root [~]# time cp largefile.tar.gz test03.tmp; time sync; You're slowing us down here. Please execute commands as instructed without modification. The above is wrong. You don't call time twice. If you're worried about sync execution being included time, use: $ time (cp src.tmp src.temp; sync) Though it makes little difference as Linux is pretty good about flushing the last few write buffers. But you missed the important part, the math for bandwidth determination: 548/real = xx MB/s This is cp not dd. It's up to you to do the math. Using time allows you to do so. 548MB is my example using your previous file size in your tests. Modify accordingly if needed. *Important note* The job of this list is to provide knowledge transfer, advice, and assistance. You must do the work, and you must learn along the way. We don't fix people's problems, as we don't have access to their computers. What we do is *enable* people to fix their problems themselves. > After about 15 seconds the server load started to increase from 1, > spiked to 40 in about a minute and then it started decreasing. Please stop telling us this. Linux load average is irrelevant. > 5. The perf top -U output during a dd copy: This was supposed to be executed before and simultaneously with the cp operation above. Do you know how to use multiple terminal windows? > 6. iotop Again, this was supposed to be run with the cp command, exited toward the end of the cp operation, then copy/pasted. is very dynamic and I am afraid the data I am providing will be > unclear but let me give a number of snapshots from during the large file > copy and maybe you can make something of it (samples a few seconds apart): > !!!!!! 6085 be/4 root 7.69 K/s 1004.85 M/s 0.00 % 0.00 % dd > if=largefile.tar.gz of=test10 oflag=sync bs=1G This is another example of why you don't use dd for IO testing, and especially with a block size of 1GB. dd buffers into RAM up to $block_size bytes before it begins flushing to disk. So what you're seeing here is that massive push at the beginning of the run. Your SSDs in RAID1 peak at ~265MB/s. iotop is showing 1GB/s, 4 times what the drives can do. This is obviously not real. You can get away with oflag=sync using 1GB block size. But if you run dd the only way it can be run for realistic results, using bs=4096 which matches every filesystem block size including EXTx, XFS, and JFS, then using iflag=sync will degrade your performance, an ack is required on each block. That's what sync does. With SSD it won't be nearly as dramatic as rust, where the difference in runtime is 100-200x slower due to rotational latency. > I appologize for such a lengthy email! Don't apologize, just don't send more information than needed, especially if you don't know it's relevant. ;) Send only what's requested, and as requested, please. -- Stan ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Incredibly poor performance of mdraid-1 with 2 SSD Samsung 840 PRO 2013-04-24 3:24 ` Stan Hoeppner @ 2013-04-24 8:26 ` Andrei Banu 2013-04-24 9:12 ` Adam Goryachev 2013-04-24 16:37 ` Stan Hoeppner 0 siblings, 2 replies; 38+ messages in thread From: Andrei Banu @ 2013-04-24 8:26 UTC (permalink / raw) To: linux-raid Hello, I am sorry for the irrelevant feedback. Where I misunderstood your request, I filled in the blanks (poorly). 1. SWAP root [~]# blkid | grep cef1d19d-2578-43db-9ffc-b6b70e227bfa /dev/md1: UUID="cef1d19d-2578-43db-9ffc-b6b70e227bfa" TYPE="swap" So yes, swap is on md1. This *md1 has a size of 2GB*. Isn't this way too low for a system with 16GB of memory? 2. Let me try again to give you the right test results: Before the bigfile copy: root [~]# perf top -U Samples: 768 of event 'cycles', Event count (approx.): 499088870 18.58% [kernel] [k] port_inb 6.21% [kernel] [k] page_fault 3.36% [kernel] [k] clear_page_c_e 2.82% [kernel] [k] kallsyms_expand_symbol 1.99% [kernel] [k] __mem_cgroup_commit_charge 1.84% [kernel] [k] shmem_getpage_gfp 1.51% [kernel] [k] alloc_pages_vma 1.51% [kernel] [k] __alloc_pages_nodemask 1.46% [kernel] [k] avtab_search_node 1.45% [kernel] [k] format_decode 1.40% [kernel] [k] list_del 1.36% [kernel] [k] get_page_from_freelist 1.35% [kernel] [k] vsnprintf 1.29% [kernel] [k] avc_has_perm_noaudit 1.28% [kernel] [k] number 1.22% [kernel] [k] free_pcppages_bulk 1.21% [kernel] [k] ____pagevec_lru_add 1.14% [kernel] [k] get_page 1.08% [kernel] [k] memcpy 1.07% [kernel] [k] mem_cgroup_update_file_mapped 1.07% [kernel] [k] page_waitqueue 0.98% [kernel] [k] __d_lookup 0.97% [kernel] [k] unmap_vmas 0.91% [kernel] [k] _spin_lock 0.87% [kernel] [k] inode_has_perm 0.81% [kernel] [k] string 0.77% [kernel] [k] page_remove_rmap 0.73% [kernel] [k] __audit_syscall_exit 0.68% [kernel] [k] lookup_page_cgroup 0.61% [kernel] [k] unlock_page 0.61% [kernel] [k] shmem_find_get_pages_and_swap 0.61% [kernel] [k] free_hot_cold_page 0.61% [kernel] [k] release_pages 0.56% [kernel] [k] mem_cgroup_lru_del_list 0.55% [kernel] [k] strncpy_from_user 0.54% [kernel] [k] module_get_kallsym 0.52% [kernel] [k] find_get_page 0.50% [kernel] [k] __do_fault 0.48% [kernel] [k] path_put 0.46% [kernel] [k] __list_add 0.46% [kernel] [k] handle_mm_fault 0.45% [kernel] [k] __wake_up_bit 0.44% [kernel] [k] handle_pte_fault 0.43% [kernel] [k] audit_syscall_entry 0.43% [kernel] [k] thread_return 0.42% [kernel] [k] path_init 0.41% [kernel] [k] dput 0.40% [kernel] [k] task_has_capability 0.40% [kernel] [k] get_task_cred 0.40% [kernel] [k] pointer 0.40% [kernel] [k] _atomic_dec_and_lock 0.39% [kernel] [k] __link_path_walk 0.38% [kernel] [k] memset 0.37% [kernel] [k] do_lookup 0.34% [kernel] [k] radix_tree_lookup_slot 0.34% [kernel] [k] down_read_trylock 0.33% [kernel] [k] kmem_cache_alloc 0.31% [kernel] [k] __set_page_dirty_no_writeback 0.31% [kernel] [k] __inc_zone_state 0.31% [kernel] [k] __mem_cgroup_uncharge_common root [~]# iotop Total DISK READ: 0.00 B/s | Total DISK WRITE: 2.33 M/s TID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND 541 be/3 root 0.00 B/s 7.83 K/s 0.00 % 2.27 % [jbd2/md2-8] 8568 be/4 root 0.00 B/s 7.83 K/s 0.00 % 0.00 % lfd - sleeping 1457 be/3 root 0.00 B/s 7.83 K/s 0.00 % 0.00 % auditd 1669 be/4 root 0.00 B/s 3.91 K/s 0.00 % 0.00 % rsyslogd -i /var/run/syslogd.pid -c 5 1695 be/4 named 0.00 B/s 3.91 K/s 0.00 % 0.00 % named -u named 31391 be/4 mysql 0.00 B/s 23.48 K/s 0.00 % 0.00 % mysqld --basedir=/ --datadir=/var/lib/mysql --user=mysql --log-error=/var~r --open-files-limit=50000 --pid-file=/var/lib/mysql/server.pid 1 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % init 2 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [kthreadd] 3 rt/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [migration/0] 4 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [ksoftirqd/0] 5 rt/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [migration/0] 6 rt/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [watchdog/0] 7 rt/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [migration/1] 8 rt/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [migration/1] 9 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [ksoftirqd/1] 10 rt/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [watchdog/1] 11 rt/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [migration/2] 12 rt/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [migration/2] 13 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [ksoftirqd/2] 14 rt/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [watchdog/2] 15 rt/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [migration/3] 16 rt/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [migration/3] 17 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [ksoftirqd/3] 18 rt/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [watchdog/3] 19 rt/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [migration/4] 20 rt/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [migration/4] 21 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [ksoftirqd/4] 22 rt/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [watchdog/4] 23 rt/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [migration/5] 24 rt/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [migration/5] 25 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [ksoftirqd/5] 26 rt/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [watchdog/5] 27 rt/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [migration/6] 28 rt/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [migration/6] 29 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [ksoftirqd/6] 30 rt/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [watchdog/6] 31 rt/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [migration/7] 32 rt/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [migration/7] 33 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [ksoftirqd/7] 34 rt/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [watchdog/7] 35 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [events/0] 36 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [events/1] 37 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [events/2] 38 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [events/3] 39 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [events/4] 40 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [events/5] 41 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [events/6] 42 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [events/7] 43 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [cgroup] 44 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [khelper] 45 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [netns] 46 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [async/mgr] 47 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [pm] 48 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [sync_supers] 49 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [bdi-default] 50 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [kintegrityd/0] 51 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [kintegrityd/1] 52 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [kintegrityd/2] 53 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [kintegrityd/3] 54 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [kintegrityd/4] Now the file copy with sync: root [~]# time (cp largefile.tar.gz test05.tmp; sync) real 1m33.923s user 0m0.002s sys 0m0.713s Large file size: 523MB BW determination: 523MB / 93.923 seconds = 5.56MB/s File copy without sync: root [~]# echo 3 > /proc/sys/vm/drop_caches root [~]# time cp largefile.tar.gz test07.tmp real 0m6.452s user 0m0.007s sys 0m0.687s Large file size: 523MB BW determination: 523MB / 6.452 seconds = 81.06 MB/s During the copy (near the end: about 70 seconds into the copy - results with sync): Samples: 17K of event 'cycles', Event count (approx.): 5067697991 7.48% [kernel] [k] port_inb 5.40% [kernel] [k] page_fault 2.92% [kernel] [k] clear_page_c_e 2.29% [kernel] [k] list_del 2.21% [kernel] [k] _spin_lock 1.99% [kernel] [k] __d_lookup 1.92% [kernel] [k] avtab_search_node 1.64% [kernel] [k] unmap_vmas 1.59% [kernel] [k] get_page_from_freelist 1.55% [kernel] [k] __mem_cgroup_commit_charge 1.22% [kernel] [k] mem_cgroup_update_file_mapped 1.21% [kernel] [k] copy_page_c 1.04% [kernel] [k] find_vma 1.00% [kernel] [k] _spin_lock_irq 0.97% [kernel] [k] __wake_up_bit 0.94% [kernel] [k] __mem_cgroup_uncharge_common 0.92% [kernel] [k] get_page 0.91% [kernel] [k] __alloc_pages_nodemask 0.87% [kernel] [k] handle_mm_fault 0.85% [kernel] [k] __link_path_walk 0.84% [kernel] [k] avc_has_perm_noaudit 0.83% [kernel] [k] alloc_pages_vma 0.81% [kernel] [k] lookup_page_cgroup 0.80% [kernel] [k] __do_page_fault 0.80% [kernel] [k] free_pcppages_bulk 0.77% [kernel] [k] _spin_lock_irqsave 0.75% [kernel] [k] radix_tree_lookup_slot 0.73% [kernel] [k] kmem_cache_alloc 0.68% [ip_tables] [k] ipt_do_table 0.66% [kernel] [k] _atomic_dec_and_lock 0.65% [kernel] [k] release_pages 0.62% [kernel] [k] find_get_page 0.61% [kernel] [k] schedule 0.60% [kernel] [k] inode_has_perm 0.56% [kernel] [k] sidtab_context_to_sid 0.54% [kernel] [k] handle_pte_fault 0.53% [kernel] [k] _spin_unlock_irqrestore 0.53% [kernel] [k] memset 0.52% [kernel] [k] __inc_zone_state 0.51% [kernel] [k] update_curr 0.51% [kernel] [k] kfree 0.50% [kernel] [k] __list_add 0.50% [kernel] [k] __do_fault 0.49% [kernel] [k] shmem_getpage_gfp 0.47% [kernel] [k] filemap_fault Total DISK READ: 0.00 B/s | Total DISK WRITE: 0.00 B/s TID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND 541 be/3 root 0.00 B/s 0.00 B/s 0.00 % 96.96 % [jbd2/md2-8] 12468 be/4 nobody 0.00 B/s 3.89 K/s 0.00 % 0.00 % httpd -k start -DSSL 18818 be/4 mysql 0.00 B/s 3.89 K/s 0.00 % 0.00 % mysqld --basedir=/ --da~sql/server.pid 12333 be/4 nobody 0.00 B/s 3.89 K/s 0.00 % 0.00 % httpd -k start -DSSL 12560 be/4 nobody 0.00 B/s 3.89 K/s 0.00 % 0.00 % httpd -k start -DSSL 12568 be/4 nobody 0.00 B/s 3.89 K/s 0.00 % 0.00 % httpd -k start -DSSL 12281 be/4 nobody 0.00 B/s 3.89 K/s 0.00 % 0.00 % [httpd] 1 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % init 2 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [kthreadd] 3 rt/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [migration/0] 4 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [ksoftirqd/0] 5 rt/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [migration/0] 6 rt/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [watchdog/0] 7 rt/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [migration/1] 8 rt/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [migration/1] 9 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [ksoftirqd/1] 10 rt/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [watchdog/1] 11 rt/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [migration/2] 12 rt/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [migration/2] 13 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [ksoftirqd/2] 14 rt/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [watchdog/2] 15 rt/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [migration/3] 16 rt/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [migration/3] 17 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [ksoftirqd/3] 18 rt/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [watchdog/3] 19 rt/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [migration/4] 20 rt/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [migration/4] 21 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [ksoftirqd/4] 22 rt/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [watchdog/4] 23 rt/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [migration/5] 24 rt/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [migration/5] 25 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [ksoftirqd/5] 26 rt/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [watchdog/5] 27 rt/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [migration/6] Please let me know if I messed up again so that I can correct it. @Adam 3. root [~]# fdisk -lu /dev/sd* Disk /dev/sda: 512.1 GB, 512110190592 bytes 255 heads, 63 sectors/track, 62260 cylinders, total 1000215216 sectors Units = sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes Disk identifier: 0x00026d59 Device Boot Start End Blocks Id System /dev/sda1 2048 4196351 2097152 fd Linux raid autodetect Partition 1 does not end on cylinder boundary. /dev/sda2 * 4196352 4605951 204800 fd Linux raid autodetect Partition 2 does not end on cylinder boundary. /dev/sda3 4605952 814106623 404750336 fd Linux raid autodetect Disk /dev/sda1: 2147 MB, 2147483648 bytes 255 heads, 63 sectors/track, 261 cylinders, total 4194304 sectors Units = sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes Disk identifier: 0xfffefffe Disk /dev/sda2: 209 MB, 209715200 bytes 255 heads, 63 sectors/track, 25 cylinders, total 409600 sectors Units = sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes Disk identifier: 0x00000000 Disk /dev/sda3: 414.5 GB, 414464344064 bytes 255 heads, 63 sectors/track, 50389 cylinders, total 809500672 sectors Units = sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes Disk identifier: 0x00000000 Disk /dev/sdb: 512.1 GB, 512110190592 bytes 255 heads, 63 sectors/track, 62260 cylinders, total 1000215216 sectors Units = sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes Disk identifier: 0x0003dede Device Boot Start End Blocks Id System /dev/sdb1 2048 4196351 2097152 fd Linux raid autodetect Partition 1 does not end on cylinder boundary. /dev/sdb2 * 4196352 4605951 204800 fd Linux raid autodetect Partition 2 does not end on cylinder boundary. /dev/sdb3 4605952 814106623 404750336 fd Linux raid autodetect Disk /dev/sdb1: 2147 MB, 2147483648 bytes 255 heads, 63 sectors/track, 261 cylinders, total 4194304 sectors Units = sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes Disk identifier: 0xfffefffe Disk /dev/sdb2: 209 MB, 209715200 bytes 255 heads, 63 sectors/track, 25 cylinders, total 409600 sectors Units = sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes Disk identifier: 0x00000000 Disk /dev/sdb3: 414.5 GB, 414464344064 bytes 255 heads, 63 sectors/track, 50389 cylinders, total 809500672 sectors Units = sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes Disk identifier: 0x00000000 Kind regards! Andrei Banu On 4/24/2013 6:24 AM, Stan Hoeppner wrote: > root [~]# cat /etc/fstab >> UUID=cef1d19d-2578-43db-9ffc-b6b70e227bfa swap swap defaults 0 0 > I can't discern from UUID where your swap partition is located. Is it a > partition directly on an SSD or is it a partition atop md1? > >> root [/]# echo 3 > /proc/sys/vm/drop_caches >> root [~]# time cp largefile.tar.gz test03.tmp; time sync; > You're slowing us down here. Please execute commands as instructed > without modification. The above is wrong. You don't call time twice. > If you're worried about sync execution being included time, use: > $ time (cp src.tmp src.temp; sync) > > Though it makes little difference as Linux is pretty good about flushing > the last few write buffers. But you missed the important part, the math > for bandwidth determination: 548/real = xx MB/s > > This is cp not dd. It's up to you to do the math. Using time allows > you to do so. 548MB is my example using your previous file size in your > tests. Modify accordingly if needed. > > *Important note* The job of this list is to provide knowledge transfer, > advice, and assistance. You must do the work, and you must learn along > the way. We don't fix people's problems, as we don't have access to > their computers. What we do is *enable* people to fix their problems > themselves. > >> After about 15 seconds the server load started to increase from 1, >> spiked to 40 in about a minute and then it started decreasing. > Please stop telling us this. Linux load average is irrelevant. > >> 5. The perf top -U output during a dd copy: > This was supposed to be executed before and simultaneously with the cp > operation above. Do you know how to use multiple terminal windows? > >> 6. iotop > Again, this was supposed to be run with the cp command, exited toward > the end of the cp operation, then copy/pasted. > > is very dynamic and I am afraid the data I am providing will be >> unclear but let me give a number of snapshots from during the large file >> copy and maybe you can make something of it (samples a few seconds apart): >> !!!!!! 6085 be/4 root 7.69 K/s 1004.85 M/s 0.00 % 0.00 % dd >> if=largefile.tar.gz of=test10 oflag=sync bs=1G > This is another example of why you don't use dd for IO testing, and > especially with a block size of 1GB. dd buffers into RAM up to > $block_size bytes before it begins flushing to disk. So what you're > seeing here is that massive push at the beginning of the run. Your SSDs > in RAID1 peak at ~265MB/s. iotop is showing 1GB/s, 4 times what the > drives can do. This is obviously not real. > > You can get away with oflag=sync using 1GB block size. But if you run > dd the only way it can be run for realistic results, using bs=4096 which > matches every filesystem block size including EXTx, XFS, and JFS, then > using iflag=sync will degrade your performance, an ack is required on > each block. That's what sync does. With SSD it won't be nearly as > dramatic as rust, where the difference in runtime is 100-200x slower due > to rotational latency. > >> I appologize for such a lengthy email! > Don't apologize, just don't send more information than needed, > especially if you don't know it's relevant. ;) Send only what's > requested, and as requested, please. > ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Incredibly poor performance of mdraid-1 with 2 SSD Samsung 840 PRO 2013-04-24 8:26 ` Andrei Banu @ 2013-04-24 9:12 ` Adam Goryachev 2013-04-24 10:24 ` Tommy Apel 2013-04-24 21:40 ` Andrei Banu 2013-04-24 16:37 ` Stan Hoeppner 1 sibling, 2 replies; 38+ messages in thread From: Adam Goryachev @ 2013-04-24 9:12 UTC (permalink / raw) To: Andrei Banu; +Cc: linux-raid On 24/04/13 18:26, Andrei Banu wrote: > Hello, > > I am sorry for the irrelevant feedback. Where I misunderstood your > request, I filled in the blanks (poorly). > > 1. SWAP > root [~]# blkid | grep cef1d19d-2578-43db-9ffc-b6b70e227bfa > /dev/md1: UUID="cef1d19d-2578-43db-9ffc-b6b70e227bfa" TYPE="swap" > > So yes, swap is on md1. This *md1 has a size of 2GB*. Isn't this way > too low for a system with 16GB of memory? > Provide the output of "free", if there is RAM available, then it isn't too small (that is my personal opinion, but at least it won't affect performance/operations until you are using most of that swap space). > > 3. root [~]# fdisk -lu /dev/sd* > My mistake, I should have said: fdisk -lu /dev/sd? In any case, all of the relevant information was included, so no harm done. > Disk /dev/sda: 512.1 GB, 512110190592 bytes > 255 heads, 63 sectors/track, 62260 cylinders, total 1000215216 sectors > Units = sectors of 1 * 512 = 512 bytes > Sector size (logical/physical): 512 bytes / 512 bytes > I/O size (minimum/optimal): 512 bytes / 512 bytes > Disk identifier: 0x00026d59 > > Device Boot Start End Blocks Id System > /dev/sda1 2048 4196351 2097152 fd Linux raid > autodetect > Partition 1 does not end on cylinder boundary. > /dev/sda2 * 4196352 4605951 204800 fd Linux raid > autodetect > Partition 2 does not end on cylinder boundary. > /dev/sda3 4605952 814106623 404750336 fd Linux raid > autodetect > > Disk /dev/sdb: 512.1 GB, 512110190592 bytes > 255 heads, 63 sectors/track, 62260 cylinders, total 1000215216 sectors > Units = sectors of 1 * 512 = 512 bytes > Sector size (logical/physical): 512 bytes / 512 bytes > I/O size (minimum/optimal): 512 bytes / 512 bytes > Disk identifier: 0x0003dede > > Device Boot Start End Blocks Id System > /dev/sdb1 2048 4196351 2097152 fd Linux raid > autodetect > Partition 1 does not end on cylinder boundary. > /dev/sdb2 * 4196352 4605951 204800 fd Linux raid > autodetect > Partition 2 does not end on cylinder boundary. > /dev/sdb3 4605952 814106623 404750336 fd Linux raid > autodetect > I'm assuming from this you have three md RAID1 arrays where sda1/sdb1 are a pair, sda2/sdb2 are a pair and sda3/sdb3 are a pair? Can you describe what is on each of these arrays? Output of cat /proc/mdstat df pvs lvs Might be helpful.... Regards, Adam -- Adam Goryachev Website Managers www.websitemanagers.com.au ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Incredibly poor performance of mdraid-1 with 2 SSD Samsung 840 PRO 2013-04-24 9:12 ` Adam Goryachev @ 2013-04-24 10:24 ` Tommy Apel 2013-04-24 21:42 ` Andrei Banu 2013-04-24 21:40 ` Andrei Banu 1 sibling, 1 reply; 38+ messages in thread From: Tommy Apel @ 2013-04-24 10:24 UTC (permalink / raw) To: Adam Goryachev; +Cc: Andrei Banu, linux-raid Raid, stan Looks to me like it's the journaled quota process that holds everything back. 2013/4/24 Adam Goryachev <mailinglists@websitemanagers.com.au>: > On 24/04/13 18:26, Andrei Banu wrote: >> Hello, >> >> I am sorry for the irrelevant feedback. Where I misunderstood your >> request, I filled in the blanks (poorly). >> >> 1. SWAP >> root [~]# blkid | grep cef1d19d-2578-43db-9ffc-b6b70e227bfa >> /dev/md1: UUID="cef1d19d-2578-43db-9ffc-b6b70e227bfa" TYPE="swap" >> >> So yes, swap is on md1. This *md1 has a size of 2GB*. Isn't this way >> too low for a system with 16GB of memory? >> > Provide the output of "free", if there is RAM available, then it isn't > too small (that is my personal opinion, but at least it won't affect > performance/operations until you are using most of that swap space). > >> >> 3. root [~]# fdisk -lu /dev/sd* >> > My mistake, I should have said: > fdisk -lu /dev/sd? > > In any case, all of the relevant information was included, so no harm done. >> Disk /dev/sda: 512.1 GB, 512110190592 bytes >> 255 heads, 63 sectors/track, 62260 cylinders, total 1000215216 sectors >> Units = sectors of 1 * 512 = 512 bytes >> Sector size (logical/physical): 512 bytes / 512 bytes >> I/O size (minimum/optimal): 512 bytes / 512 bytes >> Disk identifier: 0x00026d59 >> >> Device Boot Start End Blocks Id System >> /dev/sda1 2048 4196351 2097152 fd Linux raid >> autodetect >> Partition 1 does not end on cylinder boundary. >> /dev/sda2 * 4196352 4605951 204800 fd Linux raid >> autodetect >> Partition 2 does not end on cylinder boundary. >> /dev/sda3 4605952 814106623 404750336 fd Linux raid >> autodetect >> >> Disk /dev/sdb: 512.1 GB, 512110190592 bytes >> 255 heads, 63 sectors/track, 62260 cylinders, total 1000215216 sectors >> Units = sectors of 1 * 512 = 512 bytes >> Sector size (logical/physical): 512 bytes / 512 bytes >> I/O size (minimum/optimal): 512 bytes / 512 bytes >> Disk identifier: 0x0003dede >> >> Device Boot Start End Blocks Id System >> /dev/sdb1 2048 4196351 2097152 fd Linux raid >> autodetect >> Partition 1 does not end on cylinder boundary. >> /dev/sdb2 * 4196352 4605951 204800 fd Linux raid >> autodetect >> Partition 2 does not end on cylinder boundary. >> /dev/sdb3 4605952 814106623 404750336 fd Linux raid >> autodetect >> > I'm assuming from this you have three md RAID1 arrays where sda1/sdb1 > are a pair, sda2/sdb2 are a pair and sda3/sdb3 are a pair? > > Can you describe what is on each of these arrays? > Output of > cat /proc/mdstat > df > pvs > lvs > > Might be helpful.... > > Regards, > Adam > > -- > Adam Goryachev > Website Managers > www.websitemanagers.com.au > > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Incredibly poor performance of mdraid-1 with 2 SSD Samsung 840 PRO 2013-04-24 10:24 ` Tommy Apel @ 2013-04-24 21:42 ` Andrei Banu 0 siblings, 0 replies; 38+ messages in thread From: Andrei Banu @ 2013-04-24 21:42 UTC (permalink / raw) Cc: linux-raid Raid Hi, Why would it do that? And how do I fix this? Thanks! On 24/04/2013 1:24 PM, Tommy Apel wrote: > Looks to me like it's the journaled quota process that holds everything back.\ ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Incredibly poor performance of mdraid-1 with 2 SSD Samsung 840 PRO 2013-04-24 9:12 ` Adam Goryachev 2013-04-24 10:24 ` Tommy Apel @ 2013-04-24 21:40 ` Andrei Banu 1 sibling, 0 replies; 38+ messages in thread From: Andrei Banu @ 2013-04-24 21:40 UTC (permalink / raw) Cc: linux-raid Hi, 1. free -m root [~]# free -m total used free shared buffers cached Mem: 15921 15542 379 0 1063 11870 -/+ buffers/cache: 2608 13313 Swap: 2046 100 1946 2. Yes, you understood correctly regarding the raid array (all 3 of them are raid 1): root@gts6 [~]# cat /proc/mdstat Personalities : [raid1] md0 : active raid1 sdb2[1] sda2[0] 204736 blocks super 1.0 [2/2] [UU] md2 : active raid1 sdb3[1] sda3[0] 404750144 blocks super 1.0 [2/2] [UU] md1 : active raid1 sdb1[1] sda1[0] 2096064 blocks super 1.1 [2/2] [UU] unused devices: <none> md0 is boot. md1 is swap. md2 is / 3. df root@gts6 [~]# df -h Filesystem Size Used Avail Use% Mounted on /dev/md2 380G 246G 116G 68% / tmpfs 7.8G 0 7.8G 0% /dev/shm /dev/md0 194M 47M 137M 26% /boot /usr/tmpDSK 3.6G 1.2G 2.2G 36% /tmp 4. pvs root [~]# pvs -a PV VG Fmt Attr PSize PFree /dev/loop0 --- 0 0 /dev/md0 --- 0 0 /dev/md1 --- 0 0 /dev/ram0 --- 0 0 /dev/ram1 --- 0 0 /dev/ram10 --- 0 0 /dev/ram11 --- 0 0 /dev/ram12 --- 0 0 /dev/ram13 --- 0 0 /dev/ram14 --- 0 0 /dev/ram15 --- 0 0 /dev/ram2 --- 0 0 /dev/ram3 --- 0 0 /dev/ram4 --- 0 0 /dev/ram5 --- 0 0 /dev/ram6 --- 0 0 /dev/ram7 --- 0 0 /dev/ram8 --- 0 0 /dev/ram9 --- 0 0 /dev/root --- 0 0 5. lvs (No volume groups). Thanks! On 24/04/2013 12:12 PM, Adam Goryachev wrote: > On 24/04/13 18:26, Andrei Banu wrote: >> Hello, >> >> I am sorry for the irrelevant feedback. Where I misunderstood your >> request, I filled in the blanks (poorly). >> >> 1. SWAP >> root [~]# blkid | grep cef1d19d-2578-43db-9ffc-b6b70e227bfa >> /dev/md1: UUID="cef1d19d-2578-43db-9ffc-b6b70e227bfa" TYPE="swap" >> >> So yes, swap is on md1. This *md1 has a size of 2GB*. Isn't this way >> too low for a system with 16GB of memory? >> > Provide the output of "free", if there is RAM available, then it isn't > too small (that is my personal opinion, but at least it won't affect > performance/operations until you are using most of that swap space). > >> 3. root [~]# fdisk -lu /dev/sd* >> > My mistake, I should have said: > fdisk -lu /dev/sd? > > In any case, all of the relevant information was included, so no harm done. >> Disk /dev/sda: 512.1 GB, 512110190592 bytes >> 255 heads, 63 sectors/track, 62260 cylinders, total 1000215216 sectors >> Units = sectors of 1 * 512 = 512 bytes >> Sector size (logical/physical): 512 bytes / 512 bytes >> I/O size (minimum/optimal): 512 bytes / 512 bytes >> Disk identifier: 0x00026d59 >> >> Device Boot Start End Blocks Id System >> /dev/sda1 2048 4196351 2097152 fd Linux raid >> autodetect >> Partition 1 does not end on cylinder boundary. >> /dev/sda2 * 4196352 4605951 204800 fd Linux raid >> autodetect >> Partition 2 does not end on cylinder boundary. >> /dev/sda3 4605952 814106623 404750336 fd Linux raid >> autodetect >> >> Disk /dev/sdb: 512.1 GB, 512110190592 bytes >> 255 heads, 63 sectors/track, 62260 cylinders, total 1000215216 sectors >> Units = sectors of 1 * 512 = 512 bytes >> Sector size (logical/physical): 512 bytes / 512 bytes >> I/O size (minimum/optimal): 512 bytes / 512 bytes >> Disk identifier: 0x0003dede >> >> Device Boot Start End Blocks Id System >> /dev/sdb1 2048 4196351 2097152 fd Linux raid >> autodetect >> Partition 1 does not end on cylinder boundary. >> /dev/sdb2 * 4196352 4605951 204800 fd Linux raid >> autodetect >> Partition 2 does not end on cylinder boundary. >> /dev/sdb3 4605952 814106623 404750336 fd Linux raid >> autodetect >> > I'm assuming from this you have three md RAID1 arrays where sda1/sdb1 > are a pair, sda2/sdb2 are a pair and sda3/sdb3 are a pair? > > Can you describe what is on each of these arrays? > Output of > cat /proc/mdstat > df > pvs > lvs > > Might be helpful.... > > Regards, > Adam > ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Incredibly poor performance of mdraid-1 with 2 SSD Samsung 840 PRO 2013-04-24 8:26 ` Andrei Banu 2013-04-24 9:12 ` Adam Goryachev @ 2013-04-24 16:37 ` Stan Hoeppner 2013-04-24 21:46 ` Andrei Banu 1 sibling, 1 reply; 38+ messages in thread From: Stan Hoeppner @ 2013-04-24 16:37 UTC (permalink / raw) To: Andrei Banu; +Cc: linux-raid On 4/24/2013 3:26 AM, Andrei Banu wrote: > Total DISK READ: 0.00 B/s | Total DISK WRITE: 0.00 B/s > TID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND > 541 be/3 root 0.00 B/s 0.00 B/s 0.00 % 96.96 % [jbd2/md2-8] This seems to be your problem. jbd2 (journal block device) is causing 97% iowait, yet without doing much physical IO. This is a component of EXT4. As this will fire intermittently it explains why you see such a wide throughput gap between tests at different points in time. This isn't a bug or Google would reveal that. Andrei, you need to identify which daemon or kernel feature is causing this. Do you happen to have realtime TRIM enabled? It is well known to bring IO to a crawl. If not realtime TRIM, I'd guess you turned a knob you should not have in some config file, causing a daemon to frequently issue a few gazillion atomic updates. -- Stan ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Incredibly poor performance of mdraid-1 with 2 SSD Samsung 840 PRO 2013-04-24 16:37 ` Stan Hoeppner @ 2013-04-24 21:46 ` Andrei Banu [not found] ` <CAH3kUhHnF0imY=CAHfzaQy4XJuOMgOtbHNp17EYzeSJR2en7Fg@mail.gmail.com> 2013-04-25 10:56 ` Stan Hoeppner 0 siblings, 2 replies; 38+ messages in thread From: Andrei Banu @ 2013-04-24 21:46 UTC (permalink / raw) Cc: linux-raid Hi, 1. How can I at least start trying to find the daemon that might be doing this? 2. I am not sure what real time TRIM is. I thought there was the 'discard' option in fstab (which I tried and didn't help) and other command like trims (fstrim - which errors out when run on / or mdtrim that seems somebody's experiment). But I am not sure what real time trim might be. I am not really sure where do I go from here. I am a bit lost as it seems we hit a dead end. Thanks! Andrei Banu On 24/04/2013 7:37 PM, Stan Hoeppner wrote: > On 4/24/2013 3:26 AM, Andrei Banu wrote: > >> Total DISK READ: 0.00 B/s | Total DISK WRITE: 0.00 B/s >> TID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND >> 541 be/3 root 0.00 B/s 0.00 B/s 0.00 % 96.96 % [jbd2/md2-8] > This seems to be your problem. jbd2 (journal block device) is causing > 97% iowait, yet without doing much physical IO. This is a component of > EXT4. As this will fire intermittently it explains why you see such a > wide throughput gap between tests at different points in time. > > This isn't a bug or Google would reveal that. Andrei, you need to > identify which daemon or kernel feature is causing this. Do you happen > to have realtime TRIM enabled? It is well known to bring IO to a crawl. > > If not realtime TRIM, I'd guess you turned a knob you should not have in > some config file, causing a daemon to frequently issue a few gazillion > atomic updates. > ^ permalink raw reply [flat|nested] 38+ messages in thread
[parent not found: <CAH3kUhHnF0imY=CAHfzaQy4XJuOMgOtbHNp17EYzeSJR2en7Fg@mail.gmail.com>]
* Re: Incredibly poor performance of mdraid-1 with 2 SSD Samsung 840 PRO [not found] ` <CAH3kUhHnF0imY=CAHfzaQy4XJuOMgOtbHNp17EYzeSJR2en7Fg@mail.gmail.com> @ 2013-04-25 10:11 ` Andrei Banu 0 siblings, 0 replies; 38+ messages in thread From: Andrei Banu @ 2013-04-25 10:11 UTC (permalink / raw) To: linux-raid Hi, I don't have fstab discard option set. I was just enumerating the trim kinds I know. I did try discard but it didn't do anything good. And the problem dated from before my discard test. Regards! On 2013-04-25 00:53, Roberto Spadim wrote: > TRIM in ext4 = discard > 2013/4/24 Andrei Banu <andrei.banu@redhost.ro> > >> Hi, >> 1. How can I at least start trying to find the daemon that might be >> doing this? >> 2. I am not sure what real time TRIM is. I thought there was the >> 'discard' option in >> fstab (which I tried and didn't help) and other command like trims >> (fstrim - which >> errors out when run on / or mdtrim that seems somebody's experiment). >> But I >> am not sure what real time trim might be. >> I am not really sure where do I go from here. I am a bit lost as it >> seems we hit >> a dead end. >> Thanks! >> Andrei Banu >> On 24/04/2013 7:37 PM, Stan Hoeppner wrote: >> >>> On 4/24/2013 3:26 AM, Andrei Banu wrote: >>> >>>> Total DISK READ: 0.00 B/s | Total DISK WRITE: 0.00 B/s >>>> TID PRIO USER DISK READ DISK WRITE SWAPIN IO> >>>> COMMAND >>>> 541 be/3 root 0.00 B/s 0.00 B/s 0.00 % 96.96 % >>>> [jbd2/md2-8] >>> This seems to be your problem. jbd2 (journal block device) is >>> causing >>> 97% iowait, yet without doing much physical IO. This is a component >>> of >>> EXT4. As this will fire intermittently it explains why you see such >>> a >>> wide throughput gap between tests at different points in time. >>> This isn't a bug or Google would reveal that. Andrei, you need to >>> identify which daemon or kernel feature is causing this. Do you >>> happen >>> to have realtime TRIM enabled? It is well known to bring IO to a >>> crawl. >>> If not realtime TRIM, I'd guess you turned a knob you should not >>> have in >>> some config file, causing a daemon to frequently issue a few >>> gazillion >>> atomic updates. >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-raid" >> in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> [1] > -- > Roberto Spadim > Links: > ------ > [1] http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Incredibly poor performance of mdraid-1 with 2 SSD Samsung 840 PRO 2013-04-24 21:46 ` Andrei Banu [not found] ` <CAH3kUhHnF0imY=CAHfzaQy4XJuOMgOtbHNp17EYzeSJR2en7Fg@mail.gmail.com> @ 2013-04-25 10:56 ` Stan Hoeppner 1 sibling, 0 replies; 38+ messages in thread From: Stan Hoeppner @ 2013-04-25 10:56 UTC (permalink / raw) To: Andrei Banu On 4/24/2013 4:46 PM, Andrei Banu wrote: > 1. How can I at least start trying to find the daemon that might be > doing this? For you, I'd say grab a bucket of popcorn and watch top and iotop for a while during peak use periods. Fire up two ssh sessions and watch both simultaneously, left and right on your screen. You need to become familiar with your system, what the applications are doing to cpu, mem, and io. When you're not doing that, use Google. Start reading about problems others have with "[jbd2/]" and/or super slow performance with very fast SSDs. > 2. I am not sure what real time TRIM is. I thought there was the > 'discard' option in > fstab (which I tried and didn't help) and other command like trims discard = realtime trim If it's not enabled then this isn't the source of your problem. > I am not really sure where do I go from here. I am a bit lost as it > seems we hit > a dead end. There's only so much we can do. The problem appears to have nothing to do with md/RAID. I'm doing my best to point you in the right direction(s), but I'm neither a CentOS nor EXT4 user and am not familiar with those ecosystems nor support channels. You need to research your problem via Google, interface with other CentOS users and others using the same type of cpanel based hosting software stack. If I had access to the box I'm sure I could figure this out for you, but this isn't something I'm willing to do at this time. Keep at it and you'll eventually figure it out. And you'll learn a lot along the way. Best of luck. -- Stan > Thanks! > Andrei Banu > > On 24/04/2013 7:37 PM, Stan Hoeppner wrote: >> On 4/24/2013 3:26 AM, Andrei Banu wrote: >> >>> Total DISK READ: 0.00 B/s | Total DISK WRITE: 0.00 B/s >>> TID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND >>> 541 be/3 root 0.00 B/s 0.00 B/s 0.00 % 96.96 % >>> [jbd2/md2-8] >> This seems to be your problem. jbd2 (journal block device) is causing >> 97% iowait, yet without doing much physical IO. This is a component of >> EXT4. As this will fire intermittently it explains why you see such a >> wide throughput gap between tests at different points in time. >> >> This isn't a bug or Google would reveal that. Andrei, you need to >> identify which daemon or kernel feature is causing this. Do you happen >> to have realtime TRIM enabled? It is well known to bring IO to a crawl. >> >> If not realtime TRIM, I'd guess you turned a knob you should not have in >> some config file, causing a daemon to frequently issue a few gazillion >> atomic updates. >> > > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Incredibly poor performance of mdraid-1 with 2 SSD Samsung 840 PRO 2013-04-21 23:17 ` Stan Hoeppner 2013-04-22 10:19 ` Andrei Banu @ 2013-04-22 23:11 ` Andrei Banu 2013-04-23 4:39 ` Stan Hoeppner 2013-04-22 23:25 ` Stan Hoeppner 2 siblings, 1 reply; 38+ messages in thread From: Andrei Banu @ 2013-04-22 23:11 UTC (permalink / raw) To: linux-raid Hello again! I have closed all the load generating services, waited a few minutes for the server load to reach a clean 0.00 and then I have re-performed the dd tests with various bs sizes. I was not able to setup correctly fio with a compile error but I'll get it done. One more thing before the results: I omitted to answer something earlier today. CentOS was installed due to fact that cPanel is not installable on many OSes (CentOS, RHEL and I think that's about it). So I picked CentOS. The installation was done remotely over KVM with a minimal CentOS CD (datacenter does not offer any server related services so we had to do it ourselves over a Raritan KVM). Tests were done roughly 1 minute apart. 1. First test (bs=1G): same as always. root [~]# dd if=testfile.tar.gz of=test oflag=sync bs=1G 547682517 bytes (548 MB) copied, 53.3767 s, 10.3 MB/s 2. With a bs of 4MB: niceeee! Best result ever. I am not sure what happened this time. However it's short lived. root [~]# dd if=testfile.tar.gz of=test2 oflag=sync bs=4M 547682517 bytes (548 MB) copied, 4.43305 s, 124 MB/s 3. bs=2MB, starting to decay. root [~]# dd if=testfile.tar.gz of=test3 oflag=sync bs=2M 547682517 bytes (548 MB) copied, 20.3647 s, 26.9 MB/s 4. bs=4MB again. Back to square 1. root [~]# dd if=testfile.tar.gz of=test4 oflag=sync bs=4M 547682517 bytes (548 MB) copied, 56.7124 s, 9.7 MB/s As services were shut down prior to the test, the biggest load it reached was about 2. 5. Finally I restarted the services and redone the bs=4MB test (going from a load of 0.23): root [~]# dd if=testfile.tar.gz of=test6 oflag=sync bs=4M 547682517 bytes (548 MB) copied, 116.469 s, 4.7 MB/s Again, I don't think my problem is related to any concurrent I/O starvation. These SSDs or this mdraid or I don't know what simply can't take any sustained write task. And this is not due to the server load. Even during very low server loads it's enough to write about 1GB of data within a short time frame (minutes) to bring the I/O system to it's knees for a considerable time (at least tens of minutes). 4.7MB per second for writing a 548MB file starting from a load of 0.23 during off peak hours on SSDs. Nice!!! Thanks! On 22/04/2013 2:17 AM, Stan Hoeppner wrote: > On 4/21/2013 3:46 PM, Andrei Banu wrote: >> Hello, >> >> At this point I probably should state that I am not an experienced >> sysadmin. > Things are becoming more clear now. > >> Knowing this, I do have a server management company but they >> said they don't know what to do > So you own this hardware and it is colocated, correct? > >> so now I am trying to fix things myself >> but I am something of a noob. I normally try to keep my actions to >> cautious config changes and testing. > Why did you choose Centos? Was this installed by the company? > >> I have never done a kernel update. >> Any easy way to do this? > It may not be necessary, at least to solve any SSD performance problems > anyway. Reexamining your numbers shows you hit 262MB/s to /dev/sda. > That's 65% of SATA2 interface bandwidth, so this kernel probably does > have the patch. Your problem lie elsewhere. > >> Regarding your second advice (to purchase a decent HBA) I have already >> thought about it but I guess it comes with it's own drivers that need to >> be compiled into initramfs etc. > The default CentOS (RHEL) initramfs should include mptsas, which > supports all the LSI HBAs. The LSI caching RAID cards are supported as > well with megaraid_sas. > > The question is, do you really need more than the ~260MB/s of peak > throughput you currently have? And is it worth the hassle? > >> So I am trying to replace the baseboard >> with one with SATA3 support to avoid any configuration changes (the old >> board has the C202 chipset and the new one has C204 so I guess this >> replacement is as simple as it gets - just remove the old board and plug >> the new one without any software changes or recompiles). Again I need to >> say this server is in production and I can't move the data or the users. >> I can have a few hours downtime during the night but that's about all. > It's not clear your problem is hardware bandwidth. In fact it seems the > problem lie elsewhere. It may simply be that you're running these tests > while other substantial IO is occurring. Actually, your numbers show > this is exactly the case. What they don't show is how much other IO is > hitting the SSDs while you're running your tests. > >> Regarding the kernel upgrade, do we need to compile one from source or >> there's an easier way? > I don't believe at this point you need a new kernel to fix the problem > you have. If this patch was not present you'd not be able to get > 260MB/s from SATA2. Your problem lie elsewhere. > > In the future, instead of making a post saying "md is slow, my SSDs are > slow" and pasting test data which appears to back that claim, you'd be > better served by describing a general problem, such as "users say the > system is slow and I think it may be md or SSD related". This way we > don't waste time following a troubleshooting path based on incorrect > assumptions, as we've done here. Or at least as I've done here, as I'm > the only one assisting. > > Boot all users off the system, shut down any daemons that may generate > any meaningful load on the disks or CPUs. Disable any encryption or > compression. Then rerun your tests while completely idle. Then we'll > go from there. > ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Incredibly poor performance of mdraid-1 with 2 SSD Samsung 840 PRO 2013-04-22 23:11 ` Andrei Banu @ 2013-04-23 4:39 ` Stan Hoeppner 0 siblings, 0 replies; 38+ messages in thread From: Stan Hoeppner @ 2013-04-23 4:39 UTC (permalink / raw) To: Andrei Banu; +Cc: linux-raid On 4/22/2013 6:11 PM, Andrei Banu wrote: ... > 1. First test (bs=1G): same as always. > root [~]# dd if=testfile.tar.gz of=test oflag=sync bs=1G > 547682517 bytes (548 MB) copied, 53.3767 s, 10.3 MB/s ... > root [~]# dd if=testfile.tar.gz of=test6 oflag=sync bs=4M > 547682517 bytes (548 MB) copied, 116.469 s, 4.7 MB/s ... > Again, I don't think my problem is related to any concurrent I/O > starvation. These SSDs or this mdraid or I don't know what simply can't > take any sustained write task. And this is not due to the server load. > Even during very low server loads it's enough to write about 1GB of data > within a short time frame (minutes) to bring the I/O system to it's > knees for a considerable time (at least tens of minutes). Something's going on here. Ditch dd for now. What's the result of: $ echo 3 > /proc/sys/vm/drop_caches $ time cp testfile.tar.gz testxx.tmp; sync 548/real = xx MB/s And now ditch flushing FS buffers: $ echo 3 > /proc/sys/vm/drop_caches $ time cp testfile.tar.gz testxx.tmp 548/real = xx MB/s And please paste this so we can see how you're mounting EXT4. $ cat /etc/fstab |grep ext Mounting data=journal will decrease write throughput by 50% as everything is written twice: once to the journal, once into the filesystem. This wouldn't account for the entire performance deficit though. -- Stan ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Incredibly poor performance of mdraid-1 with 2 SSD Samsung 840 PRO 2013-04-21 23:17 ` Stan Hoeppner 2013-04-22 10:19 ` Andrei Banu 2013-04-22 23:11 ` Andrei Banu @ 2013-04-22 23:25 ` Stan Hoeppner 2013-04-23 4:49 ` Mikael Abrahamsson 2 siblings, 1 reply; 38+ messages in thread From: Stan Hoeppner @ 2013-04-22 23:25 UTC (permalink / raw) To: stan; +Cc: Andrei Banu, linux-raid On 4/21/2013 6:17 PM, Stan Hoeppner wrote: > It may not be necessary, at least to solve any SSD performance problems > anyway. Reexamining your numbers shows you hit 262MB/s to /dev/sda. > That's 65% of SATA2 interface bandwidth, so this kernel probably does > have the patch. Your problem lie elsewhere. Big correction. That should state 87% of SATA2 interface bandwidth. I must have been thinking of three things at once when I fubar'd that, as that's not simply a typo. -- Stan ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Incredibly poor performance of mdraid-1 with 2 SSD Samsung 840 PRO 2013-04-22 23:25 ` Stan Hoeppner @ 2013-04-23 4:49 ` Mikael Abrahamsson 0 siblings, 0 replies; 38+ messages in thread From: Mikael Abrahamsson @ 2013-04-23 4:49 UTC (permalink / raw) To: Stan Hoeppner; +Cc: Andrei Banu, linux-raid On Mon, 22 Apr 2013, Stan Hoeppner wrote: > On 4/21/2013 6:17 PM, Stan Hoeppner wrote: > >> It may not be necessary, at least to solve any SSD performance problems >> anyway. Reexamining your numbers shows you hit 262MB/s to /dev/sda. >> That's 65% of SATA2 interface bandwidth, so this kernel probably does >> have the patch. Your problem lie elsewhere. > > Big correction. That should state 87% of SATA2 interface bandwidth. I > must have been thinking of three things at once when I fubar'd that, as > that's not simply a typo. As far as I know, the 300 megabyte/s of SATA2 bw doesn't include coding overhead etc, so it's not theoretically possible to reach all the way up to 300. From all tests I've seen, around 260-270 megabyte/s seems to be maximum that can be achievable, so I'd say 262 MB/s is basically as much as can be expected from SATA2. -- Mikael Abrahamsson email: swmike@swm.pp.se ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Incredibly poor performance of mdraid-1 with 2 SSD Samsung 840 PRO 2013-04-19 22:58 Incredibly poor performance of mdraid-1 with 2 SSD Samsung 840 PRO Andrei Banu ` (2 preceding siblings ...) [not found] ` <51732E2B.6090607@hardwarefreak.com> @ 2013-04-23 6:01 ` Stan Hoeppner 3 siblings, 0 replies; 38+ messages in thread From: Stan Hoeppner @ 2013-04-23 6:01 UTC (permalink / raw) To: Andrei Banu; +Cc: linux-raid On 4/19/2013 5:58 PM, Andrei Banu wrote: > Hardware: SuperMicro 5017C-MTRF Not relevant if you're using SATA ports 0-1, but may well be if using 2-5, assuming this system isn't brand new. As I said previously, you'd see some errors in dmesg if you had port/cable issues. From: Intel® 6 Series Chipset and Intel® C200 Series Chipset Specification Update Problem: Due to a circuit design issue on Intel 6 Series Chipset and Intel C200 Series Chipset, electrical lifetime wear out may affect clock distribution for SATA ports 2-5. This may manifest itself as a functional issue on SATA ports 2-5 over time. •The electrical lifetime wear out may result in device oxide degradation which over time can cause drain to gate leakage current. •This issue has time, temperature and voltage sensitivities. Implication: The increased leakage current may result in an unstable clock and potentially functional issues on SATA ports 2-5 in the form of receive errors, transmit errors, and unrecognized drives. ... •SATA ports 0-1 are not affected by this design issue as they have separate clock generation circuitry. Workaround: Intel has worked with board and system manufacturers to identify and implement solutions for affected systems. •Use only SATA ports 0-1. •Use an add-in PCIe SATA bridge solution. Not all boards are affected by this. You'd have to check the spec revision on your C202, which means contacting SuperMicro with your board revision/serial number. To be certain you're not affected simply use only ports 0-1. But on that note... It may be an opportune time to consider dropping in a LSI 9211-4i. 4GB/s raw throughput, plenty for 4 SSDs at full boogie should you expand. The kit version comes with a 1-4 breakout cable for your 1U SM chassis drive backplane. Even if we get your issue fixed via software and both drives are humming away at ~260MB/s, that nightly backup process you mentioned, and others, would surely benefit from an additional ~200MB/s throughput. -- Stan -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 38+ messages in thread
end of thread, other threads:[~2013-04-25 11:38 UTC | newest]
Thread overview: 38+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-04-19 22:58 Incredibly poor performance of mdraid-1 with 2 SSD Samsung 840 PRO Andrei Banu
[not found] ` <CAH3kUhEaZGON=fAyVMZOz5fH_DcfKv=hCa96UCeK4pN7k81c_Q@mail.gmail.com>
[not found] ` <51725458.7020109@redhost.ro>
[not found] ` <CAH3kUhHxBiqugFQm=PPJNNe9jOdKy0etUjQNsoDz_LJNUCLCCQ@mail.gmail.com>
2013-04-20 23:25 ` Andrei Banu
2013-04-20 23:26 ` Andrei Banu
2013-04-21 2:48 ` Stan Hoeppner
2013-04-21 12:23 ` Tommy Apel
2013-04-21 16:48 ` Tommy Apel
2013-04-21 19:33 ` Stan Hoeppner
2013-04-21 19:56 ` Tommy Apel
2013-04-22 0:47 ` Stan Hoeppner
2013-04-22 7:51 ` Tommy Apel
2013-04-22 8:29 ` Tommy Apel
2013-04-22 10:26 ` Andrei Banu
2013-04-22 12:02 ` Tommy Apel
2013-04-23 2:59 ` Stan Hoeppner
2013-04-22 23:21 ` Stan Hoeppner
2013-04-25 11:38 ` Thomas Jarosch
2013-04-20 23:26 ` Andrei Banu
2013-04-21 0:10 ` Stan Hoeppner
[not found] ` <51732E2B.6090607@hardwarefreak.com>
2013-04-21 20:46 ` Andrei Banu
2013-04-21 23:17 ` Stan Hoeppner
2013-04-22 10:19 ` Andrei Banu
2013-04-23 2:51 ` Stan Hoeppner
2013-04-23 10:17 ` Andrei Banu
2013-04-24 3:24 ` Stan Hoeppner
2013-04-24 8:26 ` Andrei Banu
2013-04-24 9:12 ` Adam Goryachev
2013-04-24 10:24 ` Tommy Apel
2013-04-24 21:42 ` Andrei Banu
2013-04-24 21:40 ` Andrei Banu
2013-04-24 16:37 ` Stan Hoeppner
2013-04-24 21:46 ` Andrei Banu
[not found] ` <CAH3kUhHnF0imY=CAHfzaQy4XJuOMgOtbHNp17EYzeSJR2en7Fg@mail.gmail.com>
2013-04-25 10:11 ` Andrei Banu
2013-04-25 10:56 ` Stan Hoeppner
2013-04-22 23:11 ` Andrei Banu
2013-04-23 4:39 ` Stan Hoeppner
2013-04-22 23:25 ` Stan Hoeppner
2013-04-23 4:49 ` Mikael Abrahamsson
2013-04-23 6:01 ` Stan Hoeppner
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox