Growing RAID5 SSD Array

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Growing RAID5 SSD Array
@ 2014-03-13  2:49 Adam Goryachev
  2014-03-13 11:58 ` Stan Hoeppner
  0 siblings, 1 reply; 16+ messages in thread
From: Adam Goryachev @ 2014-03-13  2:49 UTC (permalink / raw)
  To: linux-raid

Hi all,

About a year ago I setup a RAID5 array with 5 x Intel 480GB SSD's, (with 
a huge amount of help from the list in general, and Stan in particular, 
thanks again). Now I need to grow my array to 6 drives to get a little 
extra storage capacity, and just want to confirm I'm not doing anything 
crazy/stupid, and take the opportunity to re-check what I've got.

So, currently I have 5 x Intel 480GB SSD:
Device Model:     INTEL SSDSC2CW480A3
Serial Number:    CVCV205201PK480DGN
LU WWN Device Id: 5 001517 bb2833c5f
Firmware Version: 400i
User Capacity:    480,103,981,056 bytes [480 GB]
Sector Size:      512 bytes logical/physical
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   8
ATA Standard is:  ACS-2 revision 3
Local Time is:    Thu Mar 13 13:40:20 2014 EST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

md1 : active raid5 sdc1[7] sde1[9] sdf1[5] sdd1[8] sda1[6]
       1875391744 blocks super 1.2 level 5, 64k chunk, algorithm 2 [5/5] 
[UUUUU]

/dev/md1:
         Version : 1.2
   Creation Time : Wed Aug 22 00:47:03 2012
      Raid Level : raid5
      Array Size : 1875391744 (1788.51 GiB 1920.40 GB)
   Used Dev Size : 468847936 (447.13 GiB 480.10 GB)
    Raid Devices : 5
   Total Devices : 5
     Persistence : Superblock is persistent

     Update Time : Thu Mar 13 13:41:03 2014
           State : active
  Active Devices : 5
Working Devices : 5
  Failed Devices : 0
   Spare Devices : 0

          Layout : left-symmetric
      Chunk Size : 64K

            Name : san1:1  (local to host san1)
            UUID : 707957c0:b7195438:06da5bc4:485d301c
          Events : 1712560

     Number   Major   Minor   RaidDevice State
        7       8       33        0      active sync   /dev/sdc1
        6       8        1        1      active sync   /dev/sda1
        8       8       49        2      active sync   /dev/sdd1
        5       8       81        3      active sync   /dev/sdf1
        9       8       65        4      active sync   /dev/sde1

One thing I've noticed is that on average, some drives seem to have more 
activity that others (ie, watching the flashing lights), however, the 
below stats from the drives themselves:
/dev/sda
241 Total_LBAs_Written      0x0032   100   100   000    Old_age 
Always       -       845235
242 Total_LBAs_Read         0x0032   100   100   000    Old_age 
Always       -       1725102
/dev/sdb
241 Total_LBAs_Written      0x0032   100   100   000    Old_age 
Always       -       0
242 Total_LBAs_Read         0x0032   100   100   000    Old_age 
Always       -       0
/dev/sdc
241 Total_LBAs_Written      0x0032   100   100   000    Old_age 
Always       -       851335
242 Total_LBAs_Read         0x0032   100   100   000    Old_age 
Always       -       1715159
/dev/sdd
241 Total_LBAs_Written      0x0032   100   100   000    Old_age 
Always       -       804564
242 Total_LBAs_Read         0x0032   100   100   000    Old_age 
Always       -       1670041
/dev/sde
241 Total_LBAs_Written      0x0032   100   100   000    Old_age 
Always       -       719767
242 Total_LBAs_Read         0x0032   100   100   000    Old_age 
Always       -       1577363
/dev/sdf
241 Total_LBAs_Written      0x0032   100   100   000    Old_age 
Always       -       719982
242 Total_LBAs_Read         0x0032   100   100   000    Old_age 
Always       -       1577900

sdb is the new drive obviously, not yet part of the array.

So the drive with the highest writes 851335 and the drive with the 
lowest writes 719982 show a big difference. Perhaps I have a problem 
with the setup/config of my array, or similar?

So, I could simply do the following:
mdadm --manage /dev/md1 --add /dev/sdb1
mdadm --grow /dev/md1 --raid-devices=6

Probably also need to remove the bitmap and re-add the bitmap.

Can anyone suggest if what I am seeing is "normal", and should I just go 
ahead and add the extra disk?

Regards,
Adam
-- 
Adam Goryachev Website Managers www.websitemanagers.com.au

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Growing RAID5 SSD Array
  2014-03-13  2:49 Growing RAID5 SSD Array Adam Goryachev
@ 2014-03-13 11:58 ` Stan Hoeppner
  2014-03-17  5:43   ` Adam Goryachev
  0 siblings, 1 reply; 16+ messages in thread
From: Stan Hoeppner @ 2014-03-13 11:58 UTC (permalink / raw)
  To: Adam Goryachev, linux-raid

On 3/12/2014 9:49 PM, Adam Goryachev wrote:
...
>     Number   Major   Minor   RaidDevice State
>        7       8       33        0      active sync   /dev/sdc1
>        6       8        1        1      active sync   /dev/sda1
>        8       8       49        2      active sync   /dev/sdd1
>        5       8       81        3      active sync   /dev/sdf1
>        9       8       65        4      active sync   /dev/sde1
...
> /dev/sda	Total_LBAs_Written	845235
> /dev/sdc	Total_LBAs_Written      851335
> /dev/sdd	Total_LBAs_Written      804564
> /dev/sde	Total_LBAs_Written	719767
> /dev/sdf	Total_LBAs_Written      719982
...
> So the drive with the highest writes 851335 and the drive with the
> lowest writes 719982 show a big difference. Perhaps I have a problem
> with the setup/config of my array, or similar?

This is normal for striped arrays.  If we reorder your write statistics
table to reflect array device order, we can clearly see the effect of
partial stripe writes.  These are new file allocations, appends, etc
that are smaller than stripe width.  Totally normal.  To get these close
to equal you'd need a chunk size of 16K or smaller.

> /dev/sdc	Total_LBAs_Written      851335
> /dev/sda	Total_LBAs_Written	845235
> /dev/sdd	Total_LBAs_Written      804564
> /dev/sde	Total_LBAs_Written	719767
> /dev/sdf	Total_LBAs_Written      719982

> So, I could simply do the following:
> mdadm --manage /dev/md1 --add /dev/sdb1
> mdadm --grow /dev/md1 --raid-devices=6
> 
> Probably also need to remove the bitmap and re-add the bitmap.

Might want to do

~$ echo 250000 > /proc/sys/dev/raid/speed_limit_min
~$ echo 500000 > /proc/sys/dev/raid/speed_limit_min

That'll bump min resync to 250 MB/s per drive, max 500 MB/s.  IIRC the
defaults are 1 MB/s and 100 MB/s.

> Can anyone suggest if what I am seeing is "normal", and should I just go
> ahead and add the extra disk?

Don't see why not.  You might want to stop drbd first.

-- 
Stan


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Growing RAID5 SSD Array
  2014-03-13 11:58 ` Stan Hoeppner
@ 2014-03-17  5:43   ` Adam Goryachev
  2014-03-17 21:43     ` Stan Hoeppner
  0 siblings, 1 reply; 16+ messages in thread
From: Adam Goryachev @ 2014-03-17  5:43 UTC (permalink / raw)
  To: stan, linux-raid

On 13/03/14 22:58, Stan Hoeppner wrote:
> On 3/12/2014 9:49 PM, Adam Goryachev wrote:
> ...
>>      Number   Major   Minor   RaidDevice State
>>         7       8       33        0      active sync   /dev/sdc1
>>         6       8        1        1      active sync   /dev/sda1
>>         8       8       49        2      active sync   /dev/sdd1
>>         5       8       81        3      active sync   /dev/sdf1
>>         9       8       65        4      active sync   /dev/sde1
> ...
>> /dev/sda	Total_LBAs_Written	845235
>> /dev/sdc	Total_LBAs_Written      851335
>> /dev/sdd	Total_LBAs_Written      804564
>> /dev/sde	Total_LBAs_Written	719767
>> /dev/sdf	Total_LBAs_Written      719982
> ...
>> So the drive with the highest writes 851335 and the drive with the
>> lowest writes 719982 show a big difference. Perhaps I have a problem
>> with the setup/config of my array, or similar?
> This is normal for striped arrays.  If we reorder your write statistics
> table to reflect array device order, we can clearly see the effect of
> partial stripe writes.  These are new file allocations, appends, etc
> that are smaller than stripe width.  Totally normal.  To get these close
> to equal you'd need a chunk size of 16K or smaller.

Would that have a material impact on performance?
While current wear stats (Media Wearout Indicator) are all 98 or higher, 
at some point, would it be reasonable to fail the drive with the lowest 
write count, and then use it to replace the drive with the highest write 
count, repeating twice, so that over the next period of time usage 
should merge toward the average? Given the current wear rate, will 
probably replace all the drives in 5 years, which is well before they 
reach 50% wear anyway.

>> So, I could simply do the following:
>> mdadm --manage /dev/md1 --add /dev/sdb1
>> mdadm --grow /dev/md1 --raid-devices=6
>>
>> Probably also need to remove the bitmap and re-add the bitmap.
> Might want to do
>
> ~$ echo 250000 > /proc/sys/dev/raid/speed_limit_min
> ~$ echo 500000 > /proc/sys/dev/raid/speed_limit_min
>
> That'll bump min resync to 250 MB/s per drive, max 500 MB/s.  IIRC the
> defaults are 1 MB/s and 100 MB/s.

Worked perfectly on one machine, the second machine hung, and basically 
crashed. Almost turned into a disaster, but thankfully having two copies 
over the two machines I managed to get everything sorted. After a 
reboot, the second machine recovered and it grew the array also.

Some of the logs from that time:
Mar 13 23:05:59 san2 kernel: [42511.418380] RAID conf printout:
Mar 13 23:05:59 san2 kernel: [42511.418385]  --- level:5 rd:6 wd:6
Mar 13 23:05:59 san2 kernel: [42511.418388]  disk 0, o:1, dev:sdc1
Mar 13 23:05:59 san2 kernel: [42511.418390]  disk 1, o:1, dev:sde1
Mar 13 23:05:59 san2 kernel: [42511.418392]  disk 2, o:1, dev:sdd1
Mar 13 23:05:59 san2 kernel: [42511.418394]  disk 3, o:1, dev:sdf1
Mar 13 23:05:59 san2 kernel: [42511.418396]  disk 4, o:1, dev:sda1
Mar 13 23:05:59 san2 kernel: [42511.418399]  disk 5, o:1, dev:sdb1
Mar 13 23:05:59 san2 kernel: [42511.418444] md: reshape of RAID array md1
Mar 13 23:05:59 san2 kernel: [42511.418448] md: minimum _guaranteed_  
speed: 1000 KB/sec/disk.
Mar 13 23:05:59 san2 kernel: [42511.418451] md: using maximum available 
idle IO bandwidth (but not more than 200000 KB/sec) for reshape.
Mar 13 23:05:59 san2 kernel: [42511.418493] md: using 128k window, over 
a total of 468847936k.
Mar 13 23:06:00 san2 kernel: [42511.512165] md: md_do_sync() got signal 
... exiting
Mar 13 23:07:01 san2 kernel: [42573.067781] iscsi_trgt: Abort Task (01) 
issued on tid:9 lun:0 by sid:8162774362161664 (Function Complete)
Mar 13 23:07:01 san2 kernel: [42573.067789] iscsi_trgt: Abort Task (01) 
issued on tid:11 lun:0 by sid:7318349599801856 (Function Complete)
Mar 13 23:07:01 san2 kernel: [42573.067797] iscsi_trgt: Abort Task (01) 
issued on tid:12 lun:0 by sid:6473924787110400 (Function Complete)
Mar 13 23:07:01 san2 kernel: [42573.067838] iscsi_trgt: Abort Task (01) 
issued on tid:14 lun:0 by sid:5348025014485504 (Function Complete)
Mar 13 23:07:02 san2 kernel: [42573.237591] iscsi_trgt: Abort Task (01) 
issued on tid:8 lun:0 by sid:4503599899804160 (Function Complete)
Mar 13 23:07:02 san2 kernel: [42573.237600] iscsi_trgt: Abort Task (01) 
issued on tid:2 lun:0 by sid:14918173819994624 (Function Complete)

I probably hit CTRL-C causing the "got signal... exiting" because the 
system wasn't responding. There are a *lot* more iscsi errors and then 
these:
Mar 13 23:09:09 san2 kernel: [42700.645060] INFO: task md1_raid5:314 
blocked for more than 120 seconds.
Mar 13 23:09:09 san2 kernel: [42700.645087] "echo 0 > 
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 13 23:09:09 san2 kernel: [42700.645117] md1_raid5       D 
ffff880236833780     0   314      2 0x00000000
Mar 13 23:09:09 san2 kernel: [42700.645123]  ffff88022fc53690 
0000000000000046 ffff8801ee330240 ffff88023593e0c0
Mar 13 23:09:09 san2 kernel: [42700.645128]  0000000000013780 
ffff88022d859fd8 ffff88022d859fd8 ffff88022fc53690
Mar 13 23:09:09 san2 kernel: [42700.645133]  ffff8801ee4b85b8 
ffffffff81071011 0000000000000046 ffff8802307aa000
Mar 13 23:09:09 san2 kernel: [42700.645138] Call Trace:
Mar 13 23:09:09 san2 kernel: [42700.645146] [<ffffffff81071011>] ? 
arch_local_irq_save+0x11/0x17
Mar 13 23:09:09 san2 kernel: [42700.645160] [<ffffffffa0111c44>] ? 
check_reshape+0x27b/0x51a [raid456]
Mar 13 23:09:09 san2 kernel: [42700.645165] [<ffffffff8103f6ba>] ? 
try_to_wake_up+0x197/0x197
Mar 13 23:09:09 san2 kernel: [42700.645175] [<ffffffffa0060381>] ? 
md_check_recovery+0x2a5/0x514 [md_mod]
Mar 13 23:09:09 san2 kernel: [42700.645181] [<ffffffffa01156fe>] ? 
raid5d+0x1c/0x483 [raid456]
Mar 13 23:09:09 san2 kernel: [42700.645187] [<ffffffff8134fdc7>] ? 
_raw_spin_unlock_irqrestore+0xe/0xf
Mar 13 23:09:09 san2 kernel: [42700.645192] [<ffffffff8134eedb>] ? 
schedule_timeout+0x2c/0xdb
Mar 13 23:09:09 san2 kernel: [42700.645195] [<ffffffff81071011>] ? 
arch_local_irq_save+0x11/0x17
Mar 13 23:09:09 san2 kernel: [42700.645199] [<ffffffff81071011>] ? 
arch_local_irq_save+0x11/0x17
Mar 13 23:09:09 san2 kernel: [42700.645206] [<ffffffffa005a256>] ? 
md_thread+0x114/0x132 [md_mod]
Mar 13 23:09:09 san2 kernel: [42700.645212] [<ffffffff8105fcd3>] ? 
add_wait_queue+0x3c/0x3c
Mar 13 23:09:09 san2 kernel: [42700.645219] [<ffffffffa005a142>] ? 
md_rdev_init+0xea/0xea [md_mod]
Mar 13 23:09:09 san2 kernel: [42700.645224] [<ffffffff8105f681>] ? 
kthread+0x76/0x7e
Mar 13 23:09:09 san2 kernel: [42700.645229] [<ffffffff81356ef4>] ? 
kernel_thread_helper+0x4/0x10
Mar 13 23:09:09 san2 kernel: [42700.645234] [<ffffffff8105f60b>] ? 
kthread_worker_fn+0x139/0x139
Mar 13 23:09:09 san2 kernel: [42700.645238] [<ffffffff81356ef0>] ? 
gs_change+0x13/0x13
Mar 13 23:11:09 san2 kernel: [42820.250905] INFO: task md1_raid5:314 
blocked for more than 120 seconds.
Mar 13 23:11:09 san2 kernel: [42820.250932] "echo 0 > 
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 13 23:11:09 san2 kernel: [42820.250961] md1_raid5       D 
ffff880236833780     0   314      2 0x00000000
Mar 13 23:11:09 san2 kernel: [42820.250967]  ffff88022fc53690 
0000000000000046 ffff8801ee330240 ffff88023593e0c0
Mar 13 23:11:09 san2 kernel: [42820.250973]  0000000000013780 
ffff88022d859fd8 ffff88022d859fd8 ffff88022fc53690
Mar 13 23:11:09 san2 kernel: [42820.250978]  ffff8801ee4b85b8 
ffffffff81071011 0000000000000046 ffff8802307aa000
Mar 13 23:11:09 san2 kernel: [42820.250982] Call Trace:
Mar 13 23:11:09 san2 kernel: [42820.250991] [<ffffffff81071011>] ? 
arch_local_irq_save+0x11/0x17
Mar 13 23:11:09 san2 kernel: [42820.251004] [<ffffffffa0111c44>] ? 
check_reshape+0x27b/0x51a [raid456]
Mar 13 23:11:09 san2 kernel: [42820.251009] [<ffffffff8103f6ba>] ? 
try_to_wake_up+0x197/0x197
Mar 13 23:11:09 san2 kernel: [42820.251019] [<ffffffffa0060381>] ? 
md_check_recovery+0x2a5/0x514 [md_mod]
Mar 13 23:11:09 san2 kernel: [42820.251025] [<ffffffffa01156fe>] ? 
raid5d+0x1c/0x483 [raid456]
Mar 13 23:11:09 san2 kernel: [42820.251031] [<ffffffff8134fdc7>] ? 
_raw_spin_unlock_irqrestore+0xe/0xf
Mar 13 23:11:09 san2 kernel: [42820.251035] [<ffffffff8134eedb>] ? 
schedule_timeout+0x2c/0xdb
Mar 13 23:11:09 san2 kernel: [42820.251039] [<ffffffff81071011>] ? 
arch_local_irq_save+0x11/0x17
Mar 13 23:11:09 san2 kernel: [42820.251043] [<ffffffff81071011>] ? 
arch_local_irq_save+0x11/0x17
Mar 13 23:11:09 san2 kernel: [42820.251050] [<ffffffffa005a256>] ? 
md_thread+0x114/0x132 [md_mod]
Mar 13 23:11:09 san2 kernel: [42820.251056] [<ffffffff8105fcd3>] ? 
add_wait_queue+0x3c/0x3c
Mar 13 23:11:09 san2 kernel: [42820.251063] [<ffffffffa005a142>] ? 
md_rdev_init+0xea/0xea [md_mod]
Mar 13 23:11:09 san2 kernel: [42820.251068] [<ffffffff8105f681>] ? 
kthread+0x76/0x7e
Mar 13 23:11:09 san2 kernel: [42820.251073] [<ffffffff81356ef4>] ? 
kernel_thread_helper+0x4/0x10
Mar 13 23:11:09 san2 kernel: [42820.251078] [<ffffffff8105f60b>] ? 
kthread_worker_fn+0x139/0x139
Mar 13 23:11:09 san2 kernel: [42820.251082] [<ffffffff81356ef0>] ? 
gs_change+0x13/0x13

Plus a few more (can provide them if interested), then more iscsi 
errors, and finally I rebooted the machine:
Mar 14 00:55:08 san2 kernel: [    4.415215] md/raid:md1: not clean -- 
starting background reconstruction
Mar 14 00:55:08 san2 kernel: [    4.415216] md/raid:md1: reshape will 
continue
Mar 14 00:55:08 san2 kernel: [    4.415223] md/raid:md1: device sdc1 
operational as raid disk 0
Mar 14 00:55:08 san2 kernel: [    4.415225] md/raid:md1: device sdb1 
operational as raid disk 5
Mar 14 00:55:08 san2 kernel: [    4.415226] md/raid:md1: device sda1 
operational as raid disk 4
Mar 14 00:55:08 san2 kernel: [    4.415227] md/raid:md1: device sdf1 
operational as raid disk 3
Mar 14 00:55:08 san2 kernel: [    4.415228] md/raid:md1: device sdd1 
operational as raid disk 2
Mar 14 00:55:08 san2 kernel: [    4.415230] md/raid:md1: device sde1 
operational as raid disk 1
Mar 14 00:55:08 san2 kernel: [    4.415477] md/raid:md1: allocated 6384kB
Mar 14 00:55:08 san2 kernel: [    4.415491] md/raid:md1: raid level 5 
active with 6 out of 6 devices, algorithm 2
Mar 14 00:55:08 san2 kernel: [    4.415492] RAID conf printout:
Mar 14 00:55:08 san2 kernel: [    4.415493]  --- level:5 rd:6 wd:6
Mar 14 00:55:08 san2 kernel: [    4.415494]  disk 0, o:1, dev:sdc1
Mar 14 00:55:08 san2 kernel: [    4.415495]  disk 1, o:1, dev:sde1
Mar 14 00:55:08 san2 kernel: [    4.415496]  disk 2, o:1, dev:sdd1
Mar 14 00:55:08 san2 kernel: [    4.415497]  disk 3, o:1, dev:sdf1
Mar 14 00:55:08 san2 kernel: [    4.415498]  disk 4, o:1, dev:sda1
Mar 14 00:55:08 san2 kernel: [    4.415499]  disk 5, o:1, dev:sdb1
Mar 14 00:55:08 san2 kernel: [    4.415526] md1: detected capacity 
change from 0 to 1920401145856
Mar 14 00:55:08 san2 kernel: [    4.416733]  md1: unknown partition table

Later after the resync completed I grew the array to make the extra 
space available:
Mar 14 01:37:02 san2 kernel: [ 2514.928987] md: md1: reshape done.
Mar 14 01:37:02 san2 kernel: [ 2514.982394] RAID conf printout:
Mar 14 01:37:02 san2 kernel: [ 2514.982398]  --- level:5 rd:6 wd:6
Mar 14 01:37:02 san2 kernel: [ 2514.982402]  disk 0, o:1, dev:sdc1
Mar 14 01:37:02 san2 kernel: [ 2514.982405]  disk 1, o:1, dev:sde1
Mar 14 01:37:02 san2 kernel: [ 2514.982407]  disk 2, o:1, dev:sdd1
Mar 14 01:37:02 san2 kernel: [ 2514.982410]  disk 3, o:1, dev:sdf1
Mar 14 01:37:02 san2 kernel: [ 2514.982413]  disk 4, o:1, dev:sda1
Mar 14 01:37:02 san2 kernel: [ 2514.982415]  disk 5, o:1, dev:sdb1
Mar 14 01:37:02 san2 kernel: [ 2514.982422] md1: detected capacity 
change from 1920401145856 to 2400501432320
Mar 14 01:37:02 san2 kernel: [ 2514.993988] md: resync of RAID array md1
Mar 14 01:37:02 san2 kernel: [ 2514.993992] md: minimum _guaranteed_  
speed: 300000 KB/sec/disk.
Mar 14 01:37:02 san2 kernel: [ 2514.993995] md: using maximum available 
idle IO bandwidth (but not more than 400000 KB/sec) for resync.
Mar 14 01:37:02 san2 kernel: [ 2514.994041] md: using 128k window, over 
a total of 468847936k.
Mar 14 01:55:16 san2 kernel: [ 3605.141839] md: md1: resync done.
Mar 14 01:55:16 san2 kernel: [ 3605.172547] RAID conf printout:
Mar 14 01:55:16 san2 kernel: [ 3605.172551]  --- level:5 rd:6 wd:6
Mar 14 01:55:16 san2 kernel: [ 3605.172554]  disk 0, o:1, dev:sdc1
Mar 14 01:55:16 san2 kernel: [ 3605.172556]  disk 1, o:1, dev:sde1
Mar 14 01:55:16 san2 kernel: [ 3605.172558]  disk 2, o:1, dev:sdd1
Mar 14 01:55:16 san2 kernel: [ 3605.172560]  disk 3, o:1, dev:sdf1
Mar 14 01:55:16 san2 kernel: [ 3605.172562]  disk 4, o:1, dev:sda1
Mar 14 01:55:16 san2 kernel: [ 3605.172564]  disk 5, o:1, dev:sdb1


This did lead to another observation.... The speed of the resync seemed 
limited by something other than disk IO. It was usually around 250 to 
300MB/s, the maximum achieved was around 420MB/s. I also noticed that 
idle CPU time on one of the cores was relatively low, though I never saw 
it hit 0 (minimum I saw was 12% idle, average around 20%).

So, I'm wondering whether I should consider upgrading the CPU and/or 
motherboard to try and improve peak performance?
Currently I have Intel Xeon E3-1230V2/3.3GHz/8MB 
Cache/4core/8thread/5GTs, my supplier has offered a number of options:
1) Compatible with current motherboard
      Intel Xeon E3-1280V2/3.6GHz/8MB Cache/4core/8thread/5GTs
2)  Intel Xeon E5-2620V2/2.1GHz/15MB Cache/6core/12thread/5GTs
3)  Intel Xeon E5-2630V2/2.6GHz/15MB Cache/6core/12thread/7.2GTs

My understanding is that the RAID5 is single threaded, so will work best 
with a higher speed single core CPU compared to a larger number of cores 
at a lower speed. However, I'm not sure how much "work" is being done 
across the various models. ie, does a E5 CPU do more work even though it 
has a lower clock speed? Does this carry over to the E7 class as well?

Currently I'm looking to replace at least the motherboard with 
http://www.supermicro.com/products/motherboard/Xeon/C202_C204/X9SCM-F.cfm in 
order to get 2 of the PCIe 2.0 8x slots (one for the existing LSI SATA 
controller and one for a dual port 10Gb ethernet card. This will provide 
a 10Gb cross-over connection between the two server, plus replace the 8 
x 1G ports with a single 10Gb port (solving the load balancing across 
the multiple links issue). Finally, this 28 port (4 x 10G + 24 x 1G) 
switch 
http://www.netgear.com.au/business/products/switches/stackable-smart-switches/GS728TXS.aspx# 
should allow the 2 x 10G connections to be connected through to the 8 
servers with 2 x 1G connections each using multipath scsi to setup two 
connections (one on each 1G port) with the same destination (10G port)

Any suggestions/comments would be welcome.

Regards,
Adam

-- 
Adam Goryachev Website Managers www.websitemanagers.com.au

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Growing RAID5 SSD Array
  2014-03-17  5:43   ` Adam Goryachev
@ 2014-03-17 21:43     ` Stan Hoeppner
  2014-03-18  1:41       ` Adam Goryachev
  0 siblings, 1 reply; 16+ messages in thread
From: Stan Hoeppner @ 2014-03-17 21:43 UTC (permalink / raw)
  To: Adam Goryachev, linux-raid

On 3/17/2014 12:43 AM, Adam Goryachev wrote:
> On 13/03/14 22:58, Stan Hoeppner wrote:
>> On 3/12/2014 9:49 PM, Adam Goryachev wrote:
>> ...
>>>      Number   Major   Minor   RaidDevice State
>>>         7       8       33        0      active sync   /dev/sdc1
>>>         6       8        1        1      active sync   /dev/sda1
>>>         8       8       49        2      active sync   /dev/sdd1
>>>         5       8       81        3      active sync   /dev/sdf1
>>>         9       8       65        4      active sync   /dev/sde1
>> ...
>>> /dev/sda    Total_LBAs_Written	845235
>>> /dev/sdc    Total_LBAs_Written	851335
>>> /dev/sdd    Total_LBAs_Written	804564
>>> /dev/sde    Total_LBAs_Written	719767
>>> /dev/sdf    Total_LBAs_Written	719982
>> ...
>>> So the drive with the highest writes 851335 and the drive with the
>>> lowest writes 719982 show a big difference. Perhaps I have a problem
>>> with the setup/config of my array, or similar?
>> This is normal for striped arrays.  If we reorder your write statistics
>> table to reflect array device order, we can clearly see the effect of
>> partial stripe writes.  These are new file allocations, appends, etc
>> that are smaller than stripe width.  Totally normal.  To get these close
>> to equal you'd need a chunk size of 16K or smaller.
> 
> Would that have a material impact on performance?

Not with SSDs.  If this was a rust array you'd probably want an 8KB or
16KB chunk to more evenly spread the small write IOs.

> While current wear stats (Media Wearout Indicator) are all 98 or higher,
> at some point, would it be reasonable to fail the drive with the lowest
> write count, and then use it to replace the drive with the highest write
> count, repeating twice, so that over the next period of time usage
> should merge toward the average? Given the current wear rate, will
> probably replace all the drives in 5 years, which is well before they
> reach 50% wear anyway.

Given the level of production write activity on your array, doing what
you suggest above will simply cause leapfrogging, taking drives with
lesser wear on them and shooting them way out in front of the drives
with the most wear.  In fact, any array operations you perform are
putting far more wear on the flash cells than normal operation is.

>>> So, I could simply do the following:
>>> mdadm --manage /dev/md1 --add /dev/sdb1
>>> mdadm --grow /dev/md1 --raid-devices=6
>>>
>>> Probably also need to remove the bitmap and re-add the bitmap.
>> Might want to do
>>
>> ~$ echo 250000 > /proc/sys/dev/raid/speed_limit_min
>> ~$ echo 500000 > /proc/sys/dev/raid/speed_limit_min
>>
>> That'll bump min resync to 250 MB/s per drive, max 500 MB/s.  IIRC the
>> defaults are 1 MB/s and 100 MB/s.
> 
> Worked perfectly on one machine, the second machine hung, and basically
> crashed. Almost turned into a disaster, but thankfully having two copies
> over the two machines I managed to get everything sorted. After a
> reboot, the second machine recovered and it grew the array also.

See:  https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=629442

This is the backup machine, yes?  Last info I had from you said this box
was using rust not SSD.  Is that still the case?  If so you should not
have bumped the reshape speed upward as rust can't handle it, especially
with load other than md on it.  Also, I recall you had to install a
backport kernel on san1 as well as a new iscsi-target package.

What kernel and iscsi-target version is running on each of san1 and
san2.  I'm guessing they're not the same.

What elevator is configured on san1 and san2?  It should be noop for SSD
and deadline for rust.

> Some of the logs from that time:
> Mar 13 23:05:59 san2 kernel: [42511.418380] RAID conf printout:
> Mar 13 23:05:59 san2 kernel: [42511.418385]  --- level:5 rd:6 wd:6
> Mar 13 23:05:59 san2 kernel: [42511.418388]  disk 0, o:1, dev:sdc1
> Mar 13 23:05:59 san2 kernel: [42511.418390]  disk 1, o:1, dev:sde1
> Mar 13 23:05:59 san2 kernel: [42511.418392]  disk 2, o:1, dev:sdd1
> Mar 13 23:05:59 san2 kernel: [42511.418394]  disk 3, o:1, dev:sdf1
> Mar 13 23:05:59 san2 kernel: [42511.418396]  disk 4, o:1, dev:sda1
> Mar 13 23:05:59 san2 kernel: [42511.418399]  disk 5, o:1, dev:sdb1
> Mar 13 23:05:59 san2 kernel: [42511.418444] md: reshape of RAID array md1
> Mar 13 23:05:59 san2 kernel: [42511.418448] md: minimum _guaranteed_ 
> speed: 1000 KB/sec/disk.
> Mar 13 23:05:59 san2 kernel: [42511.418451] md: using maximum available
> idle IO bandwidth (but not more than 200000 KB/sec) for reshape.
> Mar 13 23:05:59 san2 kernel: [42511.418493] md: using 128k window, over
> a total of 468847936k.
> Mar 13 23:06:00 san2 kernel: [42511.512165] md: md_do_sync() got signal
> ... exiting
> Mar 13 23:07:01 san2 kernel: [42573.067781] iscsi_trgt: Abort Task (01)
> issued on tid:9 lun:0 by sid:8162774362161664 (Function Complete)
> Mar 13 23:07:01 san2 kernel: [42573.067789] iscsi_trgt: Abort Task (01)
> issued on tid:11 lun:0 by sid:7318349599801856 (Function Complete)
> Mar 13 23:07:01 san2 kernel: [42573.067797] iscsi_trgt: Abort Task (01)
> issued on tid:12 lun:0 by sid:6473924787110400 (Function Complete)
> Mar 13 23:07:01 san2 kernel: [42573.067838] iscsi_trgt: Abort Task (01)
> issued on tid:14 lun:0 by sid:5348025014485504 (Function Complete)
> Mar 13 23:07:02 san2 kernel: [42573.237591] iscsi_trgt: Abort Task (01)
> issued on tid:8 lun:0 by sid:4503599899804160 (Function Complete)
> Mar 13 23:07:02 san2 kernel: [42573.237600] iscsi_trgt: Abort Task (01)
> issued on tid:2 lun:0 by sid:14918173819994624 (Function Complete)
...
> I probably hit CTRL-C causing the "got signal... exiting" because the
> system wasn't responding. There are a *lot* more iscsi errors and then
> these:
> Mar 13 23:09:09 san2 kernel: [42700.645060] INFO: task md1_raid5:314
> blocked for more than 120 seconds.

The md write thread blocked for more than 2 minutes.  Often these
timeouts are due to multiple processes fighting for IO.  This leads me
to believe san2 has rust based disk, and that the kernel and other
tweaks applied to san1 were not applied to san2.

...
> This did lead to another observation.... The speed of the resync seemed
> limited by something other than disk IO. 

On both san1/san2 or just san1?  I'm assuming for now you mean san1 only.

> It was usually around 250 to
> 300MB/s, the maximum achieved was around 420MB/s. I also noticed that
> idle CPU time on one of the cores was relatively low, though I never saw
> it hit 0 (minimum I saw was 12% idle, average around 20%).

Never look at idle, but what's eating the CPU.  Was that 80+% being
eaten by sys, wa, or a process?  Without that information it's not
possible to definitely answer your questions below.

Do note, recall that during fio testing you were hitting 1.6 GB/s write
throughput, ~4x greater than the resync throughput stated above.  If one
of your cores was at greater than 80% utilization with only ~420 MB/s of
resync throughput, then something other than the md write thread was
hammering that core.

> So, I'm wondering whether I should consider upgrading the CPU and/or
> motherboard to try and improve peak performance?

As I mentioned after walking you through all of the fio testing, you
have far more hardware than your workload needs.

> Currently I have Intel Xeon E3-1230V2/3.3GHz/8MB
> Cache/4core/8thread/5GTs, my supplier has offered a number of options:
> 1) Compatible with current motherboard
>      Intel Xeon E3-1280V2/3.6GHz/8MB Cache/4core/8thread/5GTs

This may gain you 5% peak RAID5 throughput.

> 2)  Intel Xeon E5-2620V2/2.1GHz/15MB Cache/6core/12thread/5GTs
> 3)  Intel Xeon E5-2630V2/2.6GHz/15MB Cache/6core/12thread/7.2GTs

Both of these will decrease your peak RAID5 throughput quite markedly.
md raid5 is clock sensitive, not cache sensitive.

> My understanding is that the RAID5 is single threaded, so will work best
> with a higher speed single core CPU compared to a larger number of cores
> at a lower speed. However, I'm not sure how much "work" is being done
> across the various models. ie, does a E5 CPU do more work even though it
> has a lower clock speed? Does this carry over to the E7 class as well?

You're chasing a red herring.  Any performance issue you currently have,
and I've seen no evidence of such to this point, is not due to the model
of CPU in the box.  It's due to tuning, administration, etc.

> Currently I'm looking to replace at least the motherboard with
> http://www.supermicro.com/products/motherboard/Xeon/C202_C204/X9SCM-F.cfm in
> order to get 2 of the PCIe 2.0 8x slots (one for the existing LSI SATA
> controller and one for a dual port 10Gb ethernet card. This will provide
> a 10Gb cross-over connection between the two server, plus replace the 8
> x 1G ports with a single 10Gb port (solving the load balancing across
> the multiple links issue). Finally, this 28 port (4 x 10G + 24 x 1G)
> switch

Adam if you have the budget now I absolutely agree that 10 GbE is a much
better solution than the multi-GbE setup.  But you don't need a new
motherboard.  The S1200BTLR has 4 PCIe 2.0 slots: one x8 electrical in
x16 physical slot, and three x4 electrical in x8 physical slots.  Your
bandwidth per slot is:

x8	4 GB/s unidirectional x2  <-  occupied by LSI SAS HBA
x4	2 GB/s unidirectional x2  <-  occupied by quad port GbE cards

10 Gbps Ethernet has a 1 GB/s effective data rate one way.  Inserting an
x8 PCIe card into an x4 electrical/x8 physical slot gives you 4 active
lanes for 2+2 GB/s bandwidth.  This is an exact match for a dual port 10
GbE card.  You could install up to three dual port 10 GbE cards into
these 3 slots of the S1200BTLR.

> http://www.netgear.com.au/business/products/switches/stackable-smart-switches/GS728TXS.aspx#
> should allow the 2 x 10G connections to be connected through to the 8
> servers with 2 x 1G connections each using multipath scsi to setup two
> connections (one on each 1G port) with the same destination (10G port)
>
> Any suggestions/comments would be welcome.

You'll want use SFP+ NICs and passive Twin-Ax cables to avoid paying the
$2000 fiber tax, as that is what four SFP+ 10 Gbit fiber LC transceivers
cost--$500 each.  The only SFP+ Intel dual port 10 GbE NIC that ships
with vacant SFP+ ports is the X520-DA2:
http://www.newegg.com/Product/Product.aspx?Item=N82E16833106044

To connect the NICs to the switch and to one another you'll need 3 or 4
SFP+ passive Twin-Ax cables of appropriate length.  Three if direct
server-to-server works, four if it doesn't, in which case you connect
all 4 to the 4 SFP+ switch ports.  You'll need to contact Intel and
inquire about the NIC-to-NIC functionality.  I'm not using the word
cross-over because I don't believe it applies to Twin-Ax cable.  But you
need to confirm their NICs will auto negotiate the send/receive pairs.
This isn't twisted pair cable Adam.  It's a different beast entirely.
You can't run the length you want, cut the cable and terminate it
yourself.  These cables must be pre-made to length and terminated at the
factory.  One look at the prices tells you that.  The 1 meter Intel
cable costs more than a 500ft spool of Cat 5e.  A 1 meter and a 3 meter
Passive Twin-Ax cable, Intel and Netgear:

http://www.newegg.com/Product/Product.aspx?Item=N82E16812128002
http://www.newegg.com/Product/Product.aspx?Item=N82E16812638004

If the server to switch distance is much over 15ft you will need to
inquire with Intel and Netgear about the possibility of using active
Twin-Ax cables.  If their products do no support active cables you'll
have to go with fiber, and spend the extra $2000 for the 4 transceivers,
along with one LC-to-LC multimode fiber cable for the server-to-server
link, and two straight through LC-LC multimode fiber cables.

-- 
Stan

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Growing RAID5 SSD Array
  2014-03-17 21:43     ` Stan Hoeppner
@ 2014-03-18  1:41       ` Adam Goryachev
  2014-03-18 11:22         ` Stan Hoeppner
  0 siblings, 1 reply; 16+ messages in thread
From: Adam Goryachev @ 2014-03-18  1:41 UTC (permalink / raw)
  To: stan, linux-raid

On 18/03/14 08:43, Stan Hoeppner wrote:
> On 3/17/2014 12:43 AM, Adam Goryachev wrote:
>> On 13/03/14 22:58, Stan Hoeppner wrote:
>>> On 3/12/2014 9:49 PM, Adam Goryachev wrote:
>>>> So, I could simply do the following:
>>>> mdadm --manage /dev/md1 --add /dev/sdb1
>>>> mdadm --grow /dev/md1 --raid-devices=6
>>>>
>>>> Probably also need to remove the bitmap and re-add the bitmap.
>>> Might want to do
>>>
>>> ~$ echo 250000 > /proc/sys/dev/raid/speed_limit_min
>>> ~$ echo 500000 > /proc/sys/dev/raid/speed_limit_min
>>>
>>> That'll bump min resync to 250 MB/s per drive, max 500 MB/s.  IIRC the
>>> defaults are 1 MB/s and 100 MB/s.
>> Worked perfectly on one machine, the second machine hung, and basically
>> crashed. Almost turned into a disaster, but thankfully having two copies
>> over the two machines I managed to get everything sorted. After a
>> reboot, the second machine recovered and it grew the array also.
> See:https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=629442
>
> This is the backup machine, yes?  Last info I had from you said this box
> was using rust not SSD.  Is that still the case?  If so you should not
> have bumped the reshape speed upward as rust can't handle it, especially
> with load other than md on it.

The second machine is hardware and software identical to the primary 
now, ie, both had 5 x 480GB SSD, and I added 1 x 480GB SSD to each.

> Also, I recall you had to install a
> backport kernel on san1 as well as a new iscsi-target package.
>
> What kernel and iscsi-target version is running on each of san1 and
> san2.  I'm guessing they're not the same.

Yep, I did install 3.2.41-2~bpo60+1 some time ago, but it looks like 
I've upgraded to 3.2.54-2 since then, and that is the version currently 
running.
ii  iscsitarget 1.4.20.2-10.1                 amd64        iSCSI 
Enterprise Target userland tools
ii  iscsitarget-dkms 1.4.20.2-10.1                 all          iSCSI 
Enterprise Target kernel module source - dkms version

Versions are identical on both machines. I don't think it is a iscsi 
issue, I think iscsi had a problem because the kernel stopped providing 
IO...
> What elevator is configured on san1 and san2?  It should be noop for SSD
> and deadline for rust.
This is from /etc/rc.local:
for disk in sda sdb sdc sdd sde sdf sdg
do
         echo noop > /sys/block/${disk}/queue/scheduler
         echo 128 > /sys/block/${disk}/queue/nr_requests
done
echo 4096 > /sys/block/md1/md/stripe_cache_size

It is identical on both machines.
NOTE: I just added sdg to the end now, so it wasn't there before. 
However, sdg is/would have been the OS 120G SSD, therefore shouldn't 
make any difference with the raid array.

I was thinking recently that maybe I should try and use cfq or deadline, 
as one of the issues I'm getting is IO starvation with multiple heavy IO 
workloads. ie, if I leave the DRBD connection up between the machines, 
single copy from a client is around 25 to 30MB/s, but if I do two copies 
I can see each copy take turns for around 5 or more seconds each. 
Although I'm hoping the below faster interconnect will help to resolve this.

>> Some of the logs from that time:
>> Mar 13 23:05:59 san2 kernel: [42511.418380] RAID conf printout:
>> Mar 13 23:05:59 san2 kernel: [42511.418385]  --- level:5 rd:6 wd:6
>> Mar 13 23:05:59 san2 kernel: [42511.418388]  disk 0, o:1, dev:sdc1
>> Mar 13 23:05:59 san2 kernel: [42511.418390]  disk 1, o:1, dev:sde1
>> Mar 13 23:05:59 san2 kernel: [42511.418392]  disk 2, o:1, dev:sdd1
>> Mar 13 23:05:59 san2 kernel: [42511.418394]  disk 3, o:1, dev:sdf1
>> Mar 13 23:05:59 san2 kernel: [42511.418396]  disk 4, o:1, dev:sda1
>> Mar 13 23:05:59 san2 kernel: [42511.418399]  disk 5, o:1, dev:sdb1
>> Mar 13 23:05:59 san2 kernel: [42511.418444] md: reshape of RAID array md1
>> Mar 13 23:05:59 san2 kernel: [42511.418448] md: minimum _guaranteed_
>> speed: 1000 KB/sec/disk.
>> Mar 13 23:05:59 san2 kernel: [42511.418451] md: using maximum available
>> idle IO bandwidth (but not more than 200000 KB/sec) for reshape.
>> Mar 13 23:05:59 san2 kernel: [42511.418493] md: using 128k window, over
>> a total of 468847936k.
>> Mar 13 23:06:00 san2 kernel: [42511.512165] md: md_do_sync() got signal
>> ... exiting
>> Mar 13 23:07:01 san2 kernel: [42573.067781] iscsi_trgt: Abort Task (01)
>> issued on tid:9 lun:0 by sid:8162774362161664 (Function Complete)
>> Mar 13 23:07:01 san2 kernel: [42573.067789] iscsi_trgt: Abort Task (01)
>> issued on tid:11 lun:0 by sid:7318349599801856 (Function Complete)
>> Mar 13 23:07:01 san2 kernel: [42573.067797] iscsi_trgt: Abort Task (01)
>> issued on tid:12 lun:0 by sid:6473924787110400 (Function Complete)
>> Mar 13 23:07:01 san2 kernel: [42573.067838] iscsi_trgt: Abort Task (01)
>> issued on tid:14 lun:0 by sid:5348025014485504 (Function Complete)
>> Mar 13 23:07:02 san2 kernel: [42573.237591] iscsi_trgt: Abort Task (01)
>> issued on tid:8 lun:0 by sid:4503599899804160 (Function Complete)
>> Mar 13 23:07:02 san2 kernel: [42573.237600] iscsi_trgt: Abort Task (01)
>> issued on tid:2 lun:0 by sid:14918173819994624 (Function Complete)
> ...
>> I probably hit CTRL-C causing the "got signal... exiting" because the
>> system wasn't responding. There are a *lot* more iscsi errors and then
>> these:
>> Mar 13 23:09:09 san2 kernel: [42700.645060] INFO: task md1_raid5:314
>> blocked for more than 120 seconds.
> The md write thread blocked for more than 2 minutes.  Often these
> timeouts are due to multiple processes fighting for IO.  This leads me
> to believe san2 has rust based disk, and that the kernel and other
> tweaks applied to san1 were not applied to san2.
>
> ...
Nope, both san1 and san2 are identical.... however, yes, it looks like 
IO starvation, which I suspect is because md1 was blocking, which is 
where drbd/lvm2/iscsi gets the data from.
>> This did lead to another observation.... The speed of the resync seemed
>> limited by something other than disk IO.
> On both san1/san2 or just san1?  I'm assuming for now you mean san1 only.

I watched the resync a lot closer on san2, because while san1 did the 
resync I was driving into the office :)

>> It was usually around 250 to
>> 300MB/s, the maximum achieved was around 420MB/s. I also noticed that
>> idle CPU time on one of the cores was relatively low, though I never saw
>> it hit 0 (minimum I saw was 12% idle, average around 20%).
> Never look at idle, but what's eating the CPU.  Was that 80+% being
> eaten by sys, wa, or a process?  Without that information it's not
> possible to definitely answer your questions below.

Unfortunately I should have logged the info but didn't. I am pretty sure 
md1_resync was at the top of the task list...
> Do note, recall that during fio testing you were hitting 1.6 GB/s write
> throughput, ~4x greater than the resync throughput stated above.  If one
> of your cores was at greater than 80% utilization with only ~420 MB/s of
> resync throughput, then something other than the md write thread was
> hammering that core.
Shouldn't be any other CPU tasks running on this machine. These machines 
only do MD RAID + DRBD + LVM2 + iSCSI, there are no other tasks that run 
on these systems.

>> So, I'm wondering whether I should consider upgrading the CPU and/or
>> motherboard to try and improve peak performance?
> As I mentioned after walking you through all of the fio testing, you
> have far more hardware than your workload needs.
Which is driving me insance..... I really really don't understand why I 
have such horrible performance :(
I don't know what is missing or lacking to cause things to perform so 
poorly when benchmarks run so well, but live usage is so poor.

Right now users are complaining about performance, and I see md1_raid5 
in the top 1 or 2 process positions, but CPU utilisation is under 2% 
user, 5% sys, and 3%ni, and over 95% idle, wa is practically 0....
>> My understanding is that the RAID5 is single threaded, so will work best
>> with a higher speed single core CPU compared to a larger number of cores
>> at a lower speed. However, I'm not sure how much "work" is being done
>> across the various models. ie, does a E5 CPU do more work even though it
>> has a lower clock speed? Does this carry over to the E7 class as well?
> You're chasing a red herring.  Any performance issue you currently have,
> and I've seen no evidence of such to this point, is not due to the model
> of CPU in the box.  It's due to tuning, administration, etc.
OK, so forgetting about a newer CPU then (I really can't imagine that 
any near modern CPU should not be capable of this work load, but I'm 
struggling to solve the underlying issues, and I'm hoping that throwing 
hardware at it will help ... Obviously CPU hardware is the wrong fit though.

>> Currently I'm looking to replace at least the motherboard with
>> http://www.supermicro.com/products/motherboard/Xeon/C202_C204/X9SCM-F.cfm  in
>> order to get 2 of the PCIe 2.0 8x slots (one for the existing LSI SATA
>> controller and one for a dual port 10Gb ethernet card. This will provide
>> a 10Gb cross-over connection between the two server, plus replace the 8
>> x 1G ports with a single 10Gb port (solving the load balancing across
>> the multiple links issue). Finally, this 28 port (4 x 10G + 24 x 1G)
>> switch
> Adam if you have the budget now I absolutely agree that 10 GbE is a much
> better solution than the multi-GbE setup.
Well, I've been tasked to fix the problem..... Whatever it takes. I just 
don't know what I should be targetting....
> But you don't need a new
> motherboard.  The S1200BTLR has 4 PCIe 2.0 slots: one x8 electrical in
> x16 physical slot, and three x4 electrical in x8 physical slots.  Your
> bandwidth per slot is:
>
> x8	4 GB/s unidirectional x2  <-  occupied by LSI SAS HBA
> x4	2 GB/s unidirectional x2  <-  occupied by quad port GbE cards
>
> 10 Gbps Ethernet has a 1 GB/s effective data rate one way.  Inserting an
> x8 PCIe card into an x4 electrical/x8 physical slot gives you 4 active
> lanes for 2+2 GB/s bandwidth.  This is an exact match for a dual port 10
> GbE card.  You could install up to three dual port 10 GbE cards into
> these 3 slots of the S1200BTLR.
This is somewhat beyond my knowledge, but I'm trying to understand, so 
thank you for the information. From 
http://en.wikipedia.org/wiki/PCI_Express#PCI_Express_2.0 it says:

"Like 1.x, PCIe 2.0 uses an 8b/10b encoding 
<http://en.wikipedia.org/wiki/8b/10b_encoding> scheme, therefore 
delivering, per-lane, an effective 4 Gbit/s max transfer rate from its 
5 GT/s raw data rate."

So, it suggests that we can get 4Gbit/s * 4 (using the x4 slots) which 
provides a maximum throughput of 16Gbit/s which wouldn't quite manage 
the full 20Gb/s capable from a dual port 10Gb card. One option is to 
only use a single port for the cross connect, but it would probably help 
to be able to use the second port to replace the 8x1Gb ports. (BTW, the 
pci and ethernet bandwidth is apparently full duplex, so that shouldn't 
be a problem AFAIK).

Or, I'm reading something wrong?


>> http://www.netgear.com.au/business/products/switches/stackable-smart-switches/GS728TXS.aspx#
>> should allow the 2 x 10G connections to be connected through to the 8
>> servers with 2 x 1G connections each using multipath scsi to setup two
>> connections (one on each 1G port) with the same destination (10G port)
>>
>> Any suggestions/comments would be welcome.
> You'll want use SFP+ NICs and passive Twin-Ax cables to avoid paying the
> $2000 fiber tax, as that is what four SFP+ 10 Gbit fiber LC transceivers
> cost--$500 each.  The only SFP+ Intel dual port 10 GbE NIC that ships
> with vacant SFP+ ports is the X520-DA2:
> http://www.newegg.com/Product/Product.aspx?Item=N82E16833106044
>
> To connect the NICs to the switch and to one another you'll need 3 or 4
> SFP+ passive Twin-Ax cables of appropriate length.  Three if direct
> server-to-server works, four if it doesn't, in which case you connect
> all 4 to the 4 SFP+ switch ports.  You'll need to contact Intel and
> inquire about the NIC-to-NIC functionality.  I'm not using the word
> cross-over because I don't believe it applies to Twin-Ax cable.  But you
> need to confirm their NICs will auto negotiate the send/receive pairs.
> This isn't twisted pair cable Adam.  It's a different beast entirely.
> You can't run the length you want, cut the cable and terminate it
> yourself.  These cables must be pre-made to length and terminated at the
> factory.  One look at the prices tells you that.  The 1 meter Intel
> cable costs more than a 500ft spool of Cat 5e.  A 1 meter and a 3 meter
> Passive Twin-Ax cable, Intel and Netgear:
>
> http://www.newegg.com/Product/Product.aspx?Item=N82E16812128002
> http://www.newegg.com/Product/Product.aspx?Item=N82E16812638004

I understand about the cables, though I was planning on trying to use 
Cat6 cables as I thought that would be an option, together with the 
Intel X540T2
http://www.newegg.com/Product/Product.aspx?Item=N82E16833106083
Though that has PCIe 2.1 so maybe it wouldn't work, so was then looking 
at X520T2
http://www.newegg.com/Product/Product.aspx?Item=N82E16833106075
Which has PCIe 2.0.

However, if the twin-ax cables will offer lower latency, then I think 
that is a better option. I think DRBD will work a lot better with lower 
latency, as I'm sure iSCSI should also benefit.

Also it seems that finding the SFP+ modules for the netgear switch to 
provide the Cat6 ports might also be challenging and/or more expensive.
Given the proximity of the two servers (one rack apart) I think the 
Intel card you mentioned above, plus 4 of the 3m cables (might as well 
order the 4th cable now in case we need it later) would be the best 
solution.

> If the server to switch distance is much over 15ft you will need to
> inquire with Intel and Netgear about the possibility of using active
> Twin-Ax cables.  If their products do no support active cables you'll
> have to go with fiber, and spend the extra $2000 for the 4 transceivers,
> along with one LC-to-LC multimode fiber cable for the server-to-server
> link, and two straight through LC-LC multimode fiber cables.
Hopefully not :) I originally thought fibre might provide a lower 
latency, (I'm sure it does for a long distance cable run), but once I 
read that it increases latency in the conversion (copper <-> fibre) then 
I figured it was better to avoid it. Cat6 seemed to provide a suitable 
solution, but as mentioned, if twin-ax is lower latency then thats a 
better solution.

Finally, can you suggest a reasonable solution on how or what to monitor 
to rule out the various components?
I know in the past I've used fio on the server itself, and got excellent 
results (2.5GB/s read + 1.6GB/s write), I know I've done multiple 
parallel fio tests from the linux clients and each gets around 180+MB/s 
read and write, I know I can do fio tests within my windows VM's, and 
still get 200MB/s read/write (one at a time recently). Yet at times I am 
seeing *really* slow disk IO from the windows VM's (and linux VM's), 
where in windows you can wait 30 seconds for the command prompt to 
change to another drive, or 2 minutes for the "My Computer" window to 
show the list of drives. I have all this hardware, and yet performance 
feels really bad, if it's not hardware, then it must be some config 
option that I've seriously stuffed up...

Firstly I want to rule out MD, so far I am graphing the read/write 
sectors per second for each physical disk as well as md1, drbd2 and each 
LVM. I am also graphing BackLog and ActiveTime taken from 
/sys/block/DEVICE/stat
These stats clearly show significantly higher IO during the backups than 
during peak times, so again it suggests that the system should be 
capable of performing really well.

Thanks again for any advice or suggestions.

Regards,
Adam


-- 
Adam Goryachev Website Managers www.websitemanagers.com.au

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Growing RAID5 SSD Array
  2014-03-18  1:41       ` Adam Goryachev
@ 2014-03-18 11:22         ` Stan Hoeppner
  2014-03-18 23:25           ` Adam Goryachev
  0 siblings, 1 reply; 16+ messages in thread
From: Stan Hoeppner @ 2014-03-18 11:22 UTC (permalink / raw)
  To: Adam Goryachev, linux-raid

On 3/17/2014 8:41 PM, Adam Goryachev wrote:
> On 18/03/14 08:43, Stan Hoeppner wrote:
>> On 3/17/2014 12:43 AM, Adam Goryachev wrote:
>>> On 13/03/14 22:58, Stan Hoeppner wrote:
>>>> On 3/12/2014 9:49 PM, Adam Goryachev wrote:
>>>>> So, I could simply do the following:
>>>>> mdadm --manage /dev/md1 --add /dev/sdb1
>>>>> mdadm --grow /dev/md1 --raid-devices=6
>>>>>
>>>>> Probably also need to remove the bitmap and re-add the bitmap.
>>>> Might want to do
>>>>
>>>> ~$ echo 250000 > /proc/sys/dev/raid/speed_limit_min
>>>> ~$ echo 500000 > /proc/sys/dev/raid/speed_limit_min
>>>>
>>>> That'll bump min resync to 250 MB/s per drive, max 500 MB/s.  IIRC the
>>>> defaults are 1 MB/s and 100 MB/s.
>>> Worked perfectly on one machine, the second machine hung, and basically
>>> crashed. Almost turned into a disaster, but thankfully having two copies
>>> over the two machines I managed to get everything sorted. After a
>>> reboot, the second machine recovered and it grew the array also.
>> See:https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=629442
>>
>> This is the backup machine, yes?  Last info I had from you said this box
>> was using rust not SSD.  Is that still the case?  If so you should not
>> have bumped the reshape speed upward as rust can't handle it, especially
>> with load other than md on it.
> 
> The second machine is hardware and software identical to the primary
> now, ie, both had 5 x 480GB SSD, and I added 1 x 480GB SSD to each.
> 
>> Also, I recall you had to install a
>> backport kernel on san1 as well as a new iscsi-target package.
>>
>> What kernel and iscsi-target version is running on each of san1 and
>> san2.  I'm guessing they're not the same.
> 
> Yep, I did install 3.2.41-2~bpo60+1 some time ago, but it looks like
> I've upgraded to 3.2.54-2 since then, and that is the version currently
> running.
> ii  iscsitarget 1.4.20.2-10.1                 amd64        iSCSI
> Enterprise Target userland tools
> ii  iscsitarget-dkms 1.4.20.2-10.1                 all          iSCSI
> Enterprise Target kernel module source - dkms version
> 
> Versions are identical on both machines. I don't think it is a iscsi
> issue, I think iscsi had a problem because the kernel stopped providing
> IO...

Given the multi-gigabyte/sec throughput of your block hardware I'd say
it's fairly certain that you had plenty of idle HBA and SSD when this
warning and stack trace occurred.  Thus you hit a kernel bug.  I don't
have time to track it down.  And since this only occurred on one of two
identical machines performing identical reshape operations, it's likely
not something that will affect your production workload.

>> What elevator is configured on san1 and san2?  It should be noop for SSD
>> and deadline for rust.
> This is from /etc/rc.local:
> for disk in sda sdb sdc sdd sde sdf sdg
> do
>         echo noop > /sys/block/${disk}/queue/scheduler
>         echo 128 > /sys/block/${disk}/queue/nr_requests
> done
> echo 4096 > /sys/block/md1/md/stripe_cache_size
> 
> It is identical on both machines.
> NOTE: I just added sdg to the end now, so it wasn't there before.
> However, sdg is/would have been the OS 120G SSD, therefore shouldn't
> make any difference with the raid array.
> 
> I was thinking recently that maybe I should try and use cfq or deadline,
> as one of the issues I'm getting is IO starvation with multiple heavy IO
> workloads. 

First, CFQ and deadline are coded specifically for rotational disks.
They are designed to do basically the same thing as TCQ/NCQ.  With SSD
they will do nothing but add latency, not decrease it.  Regardless, if
you simply look at iostat you'll see that the SSD latency isn't your
problem.

I know what the TS client performance problem with your production
workload is and it has nothing to do with your iSCSI servers.  You know
what it is as well but you've forgotten over the past year since I
helped you track it down.  See below.

> ie, if I leave the DRBD connection up between the machines,
> single copy from a client is around 25 to 30MB/s, but if I do two copies
> I can see each copy take turns for around 5 or more seconds each.
> Although I'm hoping the below faster interconnect will help to resolve
> this.
> 
>>> Some of the logs from that time:
>>> Mar 13 23:05:59 san2 kernel: [42511.418380] RAID conf printout:
>>> Mar 13 23:05:59 san2 kernel: [42511.418385]  --- level:5 rd:6 wd:6
>>> Mar 13 23:05:59 san2 kernel: [42511.418388]  disk 0, o:1, dev:sdc1
>>> Mar 13 23:05:59 san2 kernel: [42511.418390]  disk 1, o:1, dev:sde1
>>> Mar 13 23:05:59 san2 kernel: [42511.418392]  disk 2, o:1, dev:sdd1
>>> Mar 13 23:05:59 san2 kernel: [42511.418394]  disk 3, o:1, dev:sdf1
>>> Mar 13 23:05:59 san2 kernel: [42511.418396]  disk 4, o:1, dev:sda1
>>> Mar 13 23:05:59 san2 kernel: [42511.418399]  disk 5, o:1, dev:sdb1
>>> Mar 13 23:05:59 san2 kernel: [42511.418444] md: reshape of RAID array
>>> md1
>>> Mar 13 23:05:59 san2 kernel: [42511.418448] md: minimum _guaranteed_
>>> speed: 1000 KB/sec/disk.
>>> Mar 13 23:05:59 san2 kernel: [42511.418451] md: using maximum available
>>> idle IO bandwidth (but not more than 200000 KB/sec) for reshape.
>>> Mar 13 23:05:59 san2 kernel: [42511.418493] md: using 128k window, over
>>> a total of 468847936k.
>>> Mar 13 23:06:00 san2 kernel: [42511.512165] md: md_do_sync() got signal
>>> ... exiting
>>> Mar 13 23:07:01 san2 kernel: [42573.067781] iscsi_trgt: Abort Task (01)
>>> issued on tid:9 lun:0 by sid:8162774362161664 (Function Complete)
>>> Mar 13 23:07:01 san2 kernel: [42573.067789] iscsi_trgt: Abort Task (01)
>>> issued on tid:11 lun:0 by sid:7318349599801856 (Function Complete)
>>> Mar 13 23:07:01 san2 kernel: [42573.067797] iscsi_trgt: Abort Task (01)
>>> issued on tid:12 lun:0 by sid:6473924787110400 (Function Complete)
>>> Mar 13 23:07:01 san2 kernel: [42573.067838] iscsi_trgt: Abort Task (01)
>>> issued on tid:14 lun:0 by sid:5348025014485504 (Function Complete)
>>> Mar 13 23:07:02 san2 kernel: [42573.237591] iscsi_trgt: Abort Task (01)
>>> issued on tid:8 lun:0 by sid:4503599899804160 (Function Complete)
>>> Mar 13 23:07:02 san2 kernel: [42573.237600] iscsi_trgt: Abort Task (01)
>>> issued on tid:2 lun:0 by sid:14918173819994624 (Function Complete)
>> ...
>>> I probably hit CTRL-C causing the "got signal... exiting" because the
>>> system wasn't responding. There are a *lot* more iscsi errors and then
>>> these:
>>> Mar 13 23:09:09 san2 kernel: [42700.645060] INFO: task md1_raid5:314
>>> blocked for more than 120 seconds.
>> The md write thread blocked for more than 2 minutes.  Often these
>> timeouts are due to multiple processes fighting for IO.  This leads me
>> to believe san2 has rust based disk, and that the kernel and other
>> tweaks applied to san1 were not applied to san2.
>>
>> ...
> Nope, both san1 and san2 are identical.... however, yes, it looks like
> IO starvation, which I suspect is because md1 was blocking, which is
> where drbd/lvm2/iscsi gets the data from.

But again you should have had no iSCSI sessions active, and if you
didn't shutdown DRBD during a reshape then you're asking for it anyway.
 Recall in my initial response I recommended you shutdown DRBD before
doing the reshapes?

>>> This did lead to another observation.... The speed of the resync seemed
>>> limited by something other than disk IO.
>> On both san1/san2 or just san1?  I'm assuming for now you mean san1 only.
> 
> I watched the resync a lot closer on san2, because while san1 did the
> resync I was driving into the office :)
> 
>>> It was usually around 250 to
>>> 300MB/s, the maximum achieved was around 420MB/s. I also noticed that
>>> idle CPU time on one of the cores was relatively low, though I never saw
>>> it hit 0 (minimum I saw was 12% idle, average around 20%).
>> Never look at idle, but what's eating the CPU.  Was that 80+% being
>> eaten by sys, wa, or a process?  Without that information it's not
>> possible to definitely answer your questions below.
> 
> Unfortunately I should have logged the info but didn't. I am pretty sure
> md1_resync was at the top of the task list...

A reshape reads and writes all drives concurrently.  You're likely not
going to get even one drive worth of write throughput.  Your FIO testing
under my direction showed 1.6GB/s div 4 = 400MB/s peak per drive write
throughput with a highly parallel workload, i.e. queue depth >4.  I'd
say these reshape numbers are pretty good.  If it peaked at 420MB/s and
average 250-300 then other processes were accessing the drives.  If DRBD
was active that would probably explain it.  This isn't something to
spend any time worrying about because it's not relevant to your
production issues.

>> Do note, recall that during fio testing you were hitting 1.6 GB/s write
>> throughput, ~4x greater than the resync throughput stated above.  If one
>> of your cores was at greater than 80% utilization with only ~420 MB/s of
>> resync throughput, then something other than the md write thread was
>> hammering that core.

> Shouldn't be any other CPU tasks running on this machine. These machines
> only do MD RAID + DRBD + LVM2 + iSCSI, there are no other tasks that run
> on these systems.

Scratch that.  I wasn't thinking straight here.  A RAID5 reshape is more
CPU intensive than multi-threaded FIO.  With a reshape everything is an
RMW operation, many more cycles are spent managing the stripe cache due
to the reads, etc.

>>> So, I'm wondering whether I should consider upgrading the CPU and/or
>>> motherboard to try and improve peak performance?
>> As I mentioned after walking you through all of the fio testing, you
>> have far more hardware than your workload needs.
> Which is driving me insance..... I really really don't understand why I
> have such horrible performance :(
> I don't know what is missing or lacking to cause things to perform so
> poorly when benchmarks run so well, but live usage is so poor.
> 
> Right now users are complaining about performance, and I see md1_raid5
> in the top 1 or 2 process positions, but CPU utilisation is under 2%
> user, 5% sys, and 3%ni, and over 95% idle, wa is practically 0....

You're looking in the wrong place--on the wrong box.

>>> My understanding is that the RAID5 is single threaded, so will work best
>>> with a higher speed single core CPU compared to a larger number of cores
>>> at a lower speed. However, I'm not sure how much "work" is being done
>>> across the various models. ie, does a E5 CPU do more work even though it
>>> has a lower clock speed? Does this carry over to the E7 class as well?
>> You're chasing a red herring.  Any performance issue you currently have,
>> and I've seen no evidence of such to this point, is not due to the model
>> of CPU in the box.  It's due to tuning, administration, etc.
>
> OK, so forgetting about a newer CPU then (I really can't imagine that
> any near modern CPU should not be capable of this work load, but I'm
> struggling to solve the underlying issues, and I'm hoping that throwing
> hardware at it will help ... Obviously CPU hardware is the wrong fit
> though.
> 
>>> Currently I'm looking to replace at least the motherboard with
>>> http://www.supermicro.com/products/motherboard/Xeon/C202_C204/X9SCM-F.cfm 
>>> in
>>> order to get 2 of the PCIe 2.0 8x slots (one for the existing LSI SATA
>>> controller and one for a dual port 10Gb ethernet card. This will provide
>>> a 10Gb cross-over connection between the two server, plus replace the 8
>>> x 1G ports with a single 10Gb port (solving the load balancing across
>>> the multiple links issue). Finally, this 28 port (4 x 10G + 24 x 1G)
>>> switch

>> Adam if you have the budget now I absolutely agree that 10 GbE is a much
>> better solution than the multi-GbE setup.

> Well, I've been tasked to fix the problem..... Whatever it takes. I just
> don't know what I should be targetting....

>> But you don't need a new
>> motherboard.  The S1200BTLR has 4 PCIe 2.0 slots: one x8 electrical in
>> x16 physical slot, and three x4 electrical in x8 physical slots.  Your
>> bandwidth per slot is:
>>
>> x8    4 GB/s unidirectional x2  <-  occupied by LSI SAS HBA
>> x4    2 GB/s unidirectional x2  <-  occupied by quad port GbE cards
>>
>> 10 Gbps Ethernet has a 1 GB/s effective data rate one way.  Inserting an
>> x8 PCIe card into an x4 electrical/x8 physical slot gives you 4 active
>> lanes for 2+2 GB/s bandwidth.  This is an exact match for a dual port 10
>> GbE card.  You could install up to three dual port 10 GbE cards into
>> these 3 slots of the S1200BTLR.

> This is somewhat beyond my knowledge, but I'm trying to understand, so
> thank you for the information. From
> http://en.wikipedia.org/wiki/PCI_Express#PCI_Express_2.0 it says:
> 
> "Like 1.x, PCIe 2.0 uses an 8b/10b encoding
> <http://en.wikipedia.org/wiki/8b/10b_encoding> scheme, therefore
> delivering, per-lane, an effective 4 Gbit/s max transfer rate from its 5
> GT/s raw data rate."
>
> So, it suggests that we can get 4Gbit/s * 4 (using the x4 slots) which
> provides a maximum throughput of 16Gbit/s  which wouldn't quite manage
> the full 20Gb/s capable from a dual port 10Gb card. 

Except for the fact that you'll never get close to 10 Gbps with TCP due
to protocol overhead, host latency, etc.  Your goal in switching to 10
GbE should not be achieving 10 Gb/s throughput, as that's not possible
with your workload.  Your goal should be achieving more bandwidth more
of the time than what you can achieve now with 8 GbE interfaces, and
simplifying your topology.

Again, your core problem isn't lack of bandwidth in the storage network.

> One option is to
> only use a single port for the cross connect, but it would probably help
> to be able to use the second port to replace the 8x1Gb ports. (BTW, the
> pci and ethernet bandwidth is apparently full duplex, so that shouldn't
> be a problem AFAIK).
>
> Or, I'm reading something wrong?

Everything is full duplex today, has been for many years.  Yes, you'd
use one port on each 2-port 10 GbE NIC for DRBD traffic and the other to
replace the 8 GbE ports.  Again, this won't solve the current core
problem but it will provide benefits.

>>> http://www.netgear.com.au/business/products/switches/stackable-smart-switches/GS728TXS.aspx#
>>>
>>> should allow the 2 x 10G connections to be connected through to the 8
>>> servers with 2 x 1G connections each using multipath scsi to setup two
>>> connections (one on each 1G port) with the same destination (10G port)
>>>
>>> Any suggestions/comments would be welcome.

>> You'll want use SFP+ NICs and passive Twin-Ax cables to avoid paying the
>> $2000 fiber tax, as that is what four SFP+ 10 Gbit fiber LC transceivers
>> cost--$500 each.  The only SFP+ Intel dual port 10 GbE NIC that ships
>> with vacant SFP+ ports is the X520-DA2:
>> http://www.newegg.com/Product/Product.aspx?Item=N82E16833106044
>>
>> To connect the NICs to the switch and to one another you'll need 3 or 4
>> SFP+ passive Twin-Ax cables of appropriate length.  Three if direct
>> server-to-server works, four if it doesn't, in which case you connect
>> all 4 to the 4 SFP+ switch ports.  You'll need to contact Intel and
>> inquire about the NIC-to-NIC functionality.  I'm not using the word
>> cross-over because I don't believe it applies to Twin-Ax cable.  But you
>> need to confirm their NICs will auto negotiate the send/receive pairs.
>> This isn't twisted pair cable Adam.  It's a different beast entirely.
>> You can't run the length you want, cut the cable and terminate it
>> yourself.  These cables must be pre-made to length and terminated at the
>> factory.  One look at the prices tells you that.  The 1 meter Intel
>> cable costs more than a 500ft spool of Cat 5e.  A 1 meter and a 3 meter
>> Passive Twin-Ax cable, Intel and Netgear:
>>
>> http://www.newegg.com/Product/Product.aspx?Item=N82E16812128002
>> http://www.newegg.com/Product/Product.aspx?Item=N82E16812638004

> I understand about the cables, though I was planning on trying to use
> Cat6 cables as I thought that would be an option, together with the
> Intel X540T2
> http://www.newegg.com/Product/Product.aspx?Item=N82E16833106083
> Though that has PCIe 2.1 so maybe it wouldn't work, so was then looking
> at X520T2
> http://www.newegg.com/Product/Product.aspx?Item=N82E16833106075
> Which has PCIe 2.0.

All PCIe devices are forward and backward compatible.  That's not a problem.

> However, if the twin-ax cables will offer lower latency, then I think
> that is a better option. I think DRBD will work a lot better with lower
> latency, as I'm sure iSCSI should also benefit.

Definitely go with Twin-Ax.

> Also it seems that finding the SFP+ modules for the netgear switch to
> provide the Cat6 ports might also be challenging and/or more expensive.
> Given the proximity of the two servers (one rack apart) I think the
> Intel card you mentioned above, plus 4 of the 3m cables (might as well
> order the 4th cable now in case we need it later) would be the best
> solution.

10 GBase-T transceivers have limited availability, which drives up the
cost.  The reason is most folks use Twin-Ax due to its advantages in
rack-to-rack connections.  In addition SFP+ transceivers are not
universal.  Many 10 GbE SFP+ NICs don't support 10GBase-T transceivers
due to the power draw.

And I absolutely agree on the 4th cable--if the server-server cable
doesn't work, why wait another week or two to get DRBD running through
the switch?

>> If the server to switch distance is much over 15ft you will need to
>> inquire with Intel and Netgear about the possibility of using active
>> Twin-Ax cables.  If their products do no support active cables you'll
>> have to go with fiber, and spend the extra $2000 for the 4 transceivers,
>> along with one LC-to-LC multimode fiber cable for the server-to-server
>> link, and two straight through LC-LC multimode fiber cables.

> Hopefully not :) I originally thought fibre might provide a lower
> latency, (I'm sure it does for a long distance cable run), but once I
> read that it increases latency in the conversion (copper <-> fibre) then
> I figured it was better to avoid it. Cat6 seemed to provide a suitable
> solution, but as mentioned, if twin-ax is lower latency then thats a
> better solution.

And it's easier to acquire.

> Finally, can you suggest a reasonable solution on how or what to monitor
> to rule out the various components?

You don't need to.  You already found the problem, a year ago.  I'm
guessing you simply forgot to fix it, or didn't sufficiently fix it.

> I know in the past I've used fio on the server itself, and got excellent
> results (2.5GB/s read + 1.6GB/s write), I know I've done multiple
> parallel fio tests from the linux clients and each gets around 180+MB/s
> read and write, I know I can do fio tests within my windows VM's, and
> still get 200MB/s read/write (one at a time recently). Yet at times I am
> seeing *really* slow disk IO from the windows VM's (and linux VM's),
> where in windows you can wait 30 seconds for the command prompt to
> change to another drive, or 2 minutes for the "My Computer" window to
> show the list of drives. I have all this hardware, and yet performance
> feels really bad, if it's not hardware, then it must be some config
> option that I've seriously stuffed up...

I may have some details incorrect as I'm going strictly from organic
memory here, so please pardon me if I fubar a detail or two.

You had a Windows 2000 Directory Controller VM that hosts all of your
SMB file shares.  You were giving it only one virtual CPU, i.e. one
core, and not enough RAM.  It was peaking the core during any sustained
SMB file copy in either direction while achieving less than 100 MB/s SMB
throughput IIRC.  In addition, your topology limits SMB traffic between
the hypervisor nodes to a single GbE link, 100 MB/s.

The W2K VM simply couldn't handle more than 200 MB/s of combined SMB and
block IO processing.  I did some research at that time and found that
2003/2008 had many enhancements for running in VMs that solved many of
the virtualization performance problems of W2K.  I suggested you
wholesale move SMB file sharing directly to the storage servers running
Samba to fix this once and for all, with a sledgehammer, but you did not
want to part with a Windows VM hosting the SMB shares.  I said your next
best option was to upgrade and give the DC VM 4 virtual CPUs and 2GB of
RAM.  IIRC you said you needed to allocate as much CPU/RAM as possible
to the other VMs on that box and you couldn't spare it.

So, as of the last information I have, you had not fixed this.  Given
the nature of the end user issues you describe, which are pretty much
identical to a year ago, I can only assume you didn't properly upgrade
or replace this Windows DC file server VM and it is still the
bottleneck.  The long delays you mention tend to indicate it is trying
to swap heavily but is experiencing tremendous latency in doing so.  Is
the swap file for this DC VM physically located on the iSCSI server?  If
so the round trip latency is exacerbating the VM's attempts to swap.

Get out your medical examiner's kit and perform an autopsy on this
Windows DC/SMB server VM.  This is where you'll find the problem I
think.  If not it's somewhere in your Windows infrastructure.

Two minutes to display the mapped drive list in Explorer?  That might be
a master browser issue.  Go through all the Windows Event logs for the
Terminal Services VMs with a fine toothed comb.

> Firstly I want to rule out MD, so far I am graphing the read/write
> sectors per second for each physical disk as well as md1, drbd2 and each
> LVM. I am also graphing BackLog and ActiveTime taken from
> /sys/block/DEVICE/stat
> These stats clearly show significantly higher IO during the backups than
> during peak times, so again it suggests that the system should be
> capable of performing really well.

You're troubleshooting what you know because you know how to do it, even
though you know deep down that's not where the problem is.  You're
comfortable with it so that's the path you take.  You're avoiding
troubleshooting Windows, but this is where the heart of this problem is,
so you simply must.

> Thanks again for any advice or suggestions.

I hope I helped steer you toward the right path Adam.  Always keep in
mind that the apparent cause of problems within a virtual machine guest
are not always what they appear to be.

Cheers,

Stan

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Growing RAID5 SSD Array
  2014-03-18 11:22         ` Stan Hoeppner
@ 2014-03-18 23:25           ` Adam Goryachev
  2014-03-19 20:45             ` Stan Hoeppner
  0 siblings, 1 reply; 16+ messages in thread
From: Adam Goryachev @ 2014-03-18 23:25 UTC (permalink / raw)
  To: stan, linux-raid

On 18/03/14 22:22, Stan Hoeppner wrote:
> On 3/17/2014 8:41 PM, Adam Goryachev wrote:
>> On 18/03/14 08:43, Stan Hoeppner wrote:
>>> On 3/17/2014 12:43 AM, Adam Goryachev wrote:
>>>> On 13/03/14 22:58, Stan Hoeppner wrote:
>>>>> On 3/12/2014 9:49 PM, Adam Goryachev wrote:
>>>>>
>>>>>
>>>>>
> But again you should have had no iSCSI sessions active, and if you 
> didn't shutdown DRBD during a reshape then you're asking for it 
> anyway. Recall in my initial response I recommended you shutdown DRBD 
> before doing the reshapes? 

Yes, and I did ignore that very good, sane advice. However, it should 
have worked... So a kernel bug somewhere happened to bite me, hopefully 
it has been fixed in a newer kernel already, and I will definitely learn 
from the experience and shutdown all iscsi clients prior to the next 
upgrade. However, I don't think I can stop drbd when I grow the array, 
because drbd uses information at the "end" of the block device, and if 
the array has grown, then it won't find the right information. I need to 
grow the MD device, and then grow drbd while it is on-line to work 
smoothly. Though if I shutdown iscsi, then stop lvm, then nothing will 
even have the drbd device open, so it should be totally idle.
>>>> It was usually around 250 to
>>>> 300MB/s, the maximum achieved was around 420MB/s. I also noticed that
>>>> idle CPU time on one of the cores was relatively low, though I never saw
>>>> it hit 0 (minimum I saw was 12% idle, average around 20%).
>>> Never look at idle, but what's eating the CPU.  Was that 80+% being
>>> eaten by sys, wa, or a process?  Without that information it's not
>>> possible to definitely answer your questions below.
>> Unfortunately I should have logged the info but didn't. I am pretty sure
>> md1_resync was at the top of the task list...
> A reshape reads and writes all drives concurrently.  You're likely not
> going to get even one drive worth of write throughput.  Your FIO testing
> under my direction showed 1.6GB/s div 4 = 400MB/s peak per drive write
> throughput with a highly parallel workload, i.e. queue depth >4.  I'd
> say these reshape numbers are pretty good.  If it peaked at 420MB/s and
> average 250-300 then other processes were accessing the drives.  If DRBD
> was active that would probably explain it.  This isn't something to
> spend any time worrying about because it's not relevant to your
> production issues.

OK, good :) Less to worry about is a good thing.
>>>> Currently I'm looking to replace at least the motherboard with
>>>> http://www.supermicro.com/products/motherboard/Xeon/C202_C204/X9SCM-F.cfm
>>>> in
>>>> order to get 2 of the PCIe 2.0 8x slots (one for the existing LSI SATA
>>>> controller and one for a dual port 10Gb ethernet card. This will provide
>>>> a 10Gb cross-over connection between the two server, plus replace the 8
>>>> x 1G ports with a single 10Gb port (solving the load balancing across
>>>> the multiple links issue). Finally, this 28 port (4 x 10G + 24 x 1G)
>>>> switch
>>> Adam if you have the budget now I absolutely agree that 10 GbE is a much
>>> better solution than the multi-GbE setup.
>> Well, I've been tasked to fix the problem..... Whatever it takes. I just
>> don't know what I should be targetting....
>>> But you don't need a new
>>> motherboard.  The S1200BTLR has 4 PCIe 2.0 slots: one x8 electrical in
>>> x16 physical slot, and three x4 electrical in x8 physical slots.  Your
>>> bandwidth per slot is:
>>>
>>> x8    4 GB/s unidirectional x2  <-  occupied by LSI SAS HBA
>>> x4    2 GB/s unidirectional x2  <-  occupied by quad port GbE cards
>>>
>>> 10 Gbps Ethernet has a 1 GB/s effective data rate one way.  Inserting an
>>> x8 PCIe card into an x4 electrical/x8 physical slot gives you 4 active
>>> lanes for 2+2 GB/s bandwidth.  This is an exact match for a dual port 10
>>> GbE card.  You could install up to three dual port 10 GbE cards into
>>> these 3 slots of the S1200BTLR.
>> This is somewhat beyond my knowledge, but I'm trying to understand, so
>> thank you for the information. From
>> http://en.wikipedia.org/wiki/PCI_Express#PCI_Express_2.0 it says:
>>
>> "Like 1.x, PCIe 2.0 uses an 8b/10b encoding
>> <http://en.wikipedia.org/wiki/8b/10b_encoding> scheme, therefore
>> delivering, per-lane, an effective 4 Gbit/s max transfer rate from its 5
>> GT/s raw data rate."
>>
>> So, it suggests that we can get 4Gbit/s * 4 (using the x4 slots) which
>> provides a maximum throughput of 16Gbit/s  which wouldn't quite manage
>> the full 20Gb/s capable from a dual port 10Gb card.
> Except for the fact that you'll never get close to 10 Gbps with TCP due
> to protocol overhead, host latency, etc.  Your goal in switching to 10
> GbE should not be achieving 10 Gb/s throughput, as that's not possible
> with your workload.  Your goal should be achieving more bandwidth more
> of the time than what you can achieve now with 8 GbE interfaces, and
> simplifying your topology.
>
> Again, your core problem isn't lack of bandwidth in the storage network.
I'm still somewhat concerned that this might cause problems, given a new 
motherboard is around $350, I'd prefer to replace it if that is going to 
help at all. Even if I solve the "other" problem, I'd prefer the users 
to *really* notice the difference, rather than just "normal". ie, I want 
the end result to be excellent rather than good, considering all the 
time, money and effort... For now, I've just ordered the 2 x Intel cards 
plus 1 of the cables (only one in stock right now, the other three are 
on back order) plus the switch. I should have all that by tomorrow, and 
if all goes well and I can use the single cable as a direct connect 
between the two machines, then that's great, if not I will have to wait 
for more cables.

>> One option is to
>> only use a single port for the cross connect, but it would probably help
>> to be able to use the second port to replace the 8x1Gb ports. (BTW, the
>> pci and ethernet bandwidth is apparently full duplex, so that shouldn't
>> be a problem AFAIK).
>>
>> Or, I'm reading something wrong?
> Everything is full duplex today, has been for many years.  Yes, you'd
> use one port on each 2-port 10 GbE NIC for DRBD traffic and the other to
> replace the 8 GbE ports.  Again, this won't solve the current core
> problem but it will provide benefits.
>
>>>> http://www.netgear.com.au/business/products/switches/stackable-smart-switches/GS728TXS.aspx#
>>>>
>>>> should allow the 2 x 10G connections to be connected through to the 8
>>>> servers with 2 x 1G connections each using multipath scsi to setup two
>>>> connections (one on each 1G port) with the same destination (10G port)
>>>>
>>>> Any suggestions/comments would be welcome.
>> Finally, can you suggest a reasonable solution on how or what to monitor
>> to rule out the various components?
> You don't need to.  You already found the problem, a year ago.  I'm
> guessing you simply forgot to fix it, or didn't sufficiently fix it.
>
>> I know in the past I've used fio on the server itself, and got excellent
>> results (2.5GB/s read + 1.6GB/s write), I know I've done multiple
>> parallel fio tests from the linux clients and each gets around 180+MB/s
>> read and write, I know I can do fio tests within my windows VM's, and
>> still get 200MB/s read/write (one at a time recently). Yet at times I am
>> seeing *really* slow disk IO from the windows VM's (and linux VM's),
>> where in windows you can wait 30 seconds for the command prompt to
>> change to another drive, or 2 minutes for the "My Computer" window to
>> show the list of drives. I have all this hardware, and yet performance
>> feels really bad, if it's not hardware, then it must be some config
>> option that I've seriously stuffed up...
> I may have some details incorrect as I'm going strictly from organic
> memory here, so please pardon me if I fubar a detail or two.
>
> You had a Windows 2000 Directory Controller VM that hosts all of your
> SMB file shares.  You were giving it only one virtual CPU, i.e. one
> core, and not enough RAM.  It was peaking the core during any sustained
> SMB file copy in either direction while achieving less than 100 MB/s SMB
> throughput IIRC.  In addition, your topology limits SMB traffic between
> the hypervisor nodes to a single GbE link, 100 MB/s.
I only ran win2000 for a very minimal time, I think less than one day, 
which was part of the process of migrating from the old winNT 4.0 
physical machine to the VM. It has been running win2003 for over a year 
now.
I know in the initial period I did have an issue where I couldn't 
upgrade to multiple CPU's, but it seems I did eventually manage to solve 
that, because it is now running with 4 vCPU's and has been for a long 
time. I think I also had issues with running win2003sp1, but an upgrade 
to sp2 resolved that issue, something to do with the way the CPU was 
being used the the virtualisation layer.

Generally, it has always been the only VM running on the physical 
machine (to ensure network/cpu/etc priority), and has 4 vCPU's mapped to 
4 physical CPU's which are not shared with anything. The dom0 has two 
dedicated CPU's as well. All the win2003 machines are allocated 4GB RAM 
(maximum for win2003 32bit).
The windows VM is limited to 1Gbps for SMB traffic, in fact the entire 
"user" LAN is 1Gbps, at least for all the VM's.

> The W2K VM simply couldn't handle more than 200 MB/s of combined SMB and
> block IO processing.  I did some research at that time and found that
> 2003/2008 had many enhancements for running in VMs that solved many of
> the virtualization performance problems of W2K.  I suggested you
> wholesale move SMB file sharing directly to the storage servers running
> Samba to fix this once and for all, with a sledgehammer, but you did not
> want to part with a Windows VM hosting the SMB shares.  I said your next
> best option was to upgrade and give the DC VM 4 virtual CPUs and 2GB of
> RAM.  IIRC you said you needed to allocate as much CPU/RAM as possible
> to the other VMs on that box and you couldn't spare it.

Yes, I was (still am) very scared to replace the DC with a Linux box. 
Moving the SMB shares would have resulted in changing the "location" of 
all the files, and means finding and fixing every config file or spot 
which relies on that. Though I have thought about this a number of 
times. Currently, the plan is to migrate the authentication, DHCP, DNS, 
etc to a new win2008R2 machine this weekend. Once that is done, next 
weekend I will try and migrate the shares to a new win2012R2 machine. 
The goal being to resolve any issues caused by upgrading the old win NT 
era machine over and over and over again, by using brand new 
installations of more modern versions. When the time comes, I may 
consider migrating the file sharing to a linux VM, I've very slightly 
played with samba4, but I'm not particularly confident about it yet (it 
isn't included in Debian stable yet).

> So, as of the last information I have, you had not fixed this.  Given
> the nature of the end user issues you describe, which are pretty much
> identical to a year ago, I can only assume you didn't properly upgrade
> or replace this Windows DC file server VM and it is still the
> bottleneck.  The long delays you mention tend to indicate it is trying
> to swap heavily but is experiencing tremendous latency in doing so.  Is
> the swap file for this DC VM physically located on the iSCSI server?  If
> so the round trip latency is exacerbating the VM's attempts to swap.

The VM isn't swapping at all. At one stage I allocated an additional 4GB 
ram drive for each VM (DC plus terminal servers), which simply looked 
like a normal 4GB hard drive to windows. Then moved the pagefile to this 
drive. It didn't make any difference to the performance issues, and in 
the end I removed it because it meant I couldn't live migrate VM's to 
different physical boxes, etc. In any case, swap is not in use for the 
DC at all, right now (9:15am) there is 17% physical memory in use, and 
CPU load is under 1%. The next time things are running slowly I'll take 
another look at these numbers, but I don't suspect the issue is memory 
or cpu on this box.

> Get out your medical examiner's kit and perform an autopsy on this
> Windows DC/SMB server VM.  This is where you'll find the problem I
> think.  If not it's somewhere in your Windows infrastructure.
>
> Two minutes to display the mapped drive list in Explorer?  That might be
> a master browser issue.  Go through all the Windows Event logs for the
> Terminal Services VMs with a fine toothed comb.
The performance issue impacts on unrelated linux VM's as well. I 
recently setup a new Linux VM to run a new application. When the issue 
is happening, if I login to this VM, disk IO is severely slow, like 
running ls will take a long time etc...

I see the following event logs on the DC:
NTDS (764)NTDSA: A request to write to the file "C:\WINNT\NTDS\edb.chk" 
at offset 0 (0x0000000000000000) for 4096 (0x00001000) bytes succeeded 
but took an abnormally long time (72 seconds) to be serviced by the OS. 
This problem is likely due to faulty hardware. Please contact your 
hardware vendor for further assistance diagnosing the problem.

That type of event hasn't happened often:
20140314 11:15:35   72 seconds
20131124 17:55:48   55 minutes 12 seconds
20130422 20:45:23   367 seconds
20130410 23:57:16   901 seconds

Though these look like they may have happened at times when DRBD crashed 
or similar, since I've definitely had a lot more times of very slow 
performance....

Also looking on the terminal servers has produced a similar lack of 
events, except some auth errors when the DC has crashed recently.

The newest terminal servers (running Win 2012R2) show this event for 
every logon:
Remote Desktop services has taken too long to load the user 
configuration from server \\DC for user xyz

Although the logins actually do work, and seems mostly normal after 
login, except for times when it runs really slow again.

Finally, on the old terminal servers, the PST file for outlook contained 
*all* of the email and was stored on the SMB server, on the new terminal 
servers, the PST file on the SMB server only contains contacts and 
calendars (ie, very small) and the email is stored in the "local" 
profile on the C: (which is iSCSI still). I'm hopeful that this will 
reduce the file sharing load on the domain controller. (If the C: pst 
file is lost, then it is automatically re-created and all the email is 
re-downloaded from the IMAP server, so nothing is lost, but it 
drastically increases the SAN load to re-download 2GB of email for each 
user, which had a massive impact on performance on Friday last week!).
>> Firstly I want to rule out MD, so far I am graphing the read/write
>> sectors per second for each physical disk as well as md1, drbd2 and each
>> LVM. I am also graphing BackLog and ActiveTime taken from
>> /sys/block/DEVICE/stat
>> These stats clearly show significantly higher IO during the backups than
>> during peak times, so again it suggests that the system should be
>> capable of performing really well.
> You're troubleshooting what you know because you know how to do it, even
> though you know deep down that's not where the problem is.  You're
> comfortable with it so that's the path you take.  You're avoiding
> troubleshooting Windows, but this is where the heart of this problem is,
> so you simply must.
>
>> Thanks again for any advice or suggestions.
> I hope I helped steer you toward the right path Adam.  Always keep in
> mind that the apparent cause of problems within a virtual machine guest
> are not always what they appear to be.
I'm really not sure, I still don't like the domain controller and file 
server being on the same box, and the fact it has been upgraded so many 
times, but I'm doubtful that it is the real cause.

On Thursday night after the failed RAID5 grow, I decided not to increase 
the allocated space for the two new terminal servers (in case I caused 
more problems), and simply deleted a number of user profiles on each 
system. (I assumed the roaming profile would simply copy back when the 
user logged in the next day). However, the roaming profile didn't copy, 
and windows logged users in with a temp profile, so eventually the only 
fix was to restore the profile from the backup server. Once I did this, 
the user could login normally, except the backup doesn't save the pst 
file, so outlook was forced to re-download all of the users email from 
IMAP. This then caused the really, really, really bad performance across 
the SAN, yet it didn't generate any traffic on the SMB shares from the 
domain controller. In addition, as I mentioned, disk IO on the newest 
Linux VM was also badly delayed. Also, copying from a smb share on a 
different windows 2008 VM (basically idle and unused) showed equally bad 
performance copying to my desktop (linux), etc.

So, essentially the current plans are:
Install the Intel 10Gb network cards
Replace the existing 1Gbps crossover connection with one 10Gbps connection
Replace the existing 8 x 1Gbps connections with 1 x 10Gbps connection
Migrate the win2003sp2 authentication etc to a new win2008R2 server
Migrate the win2003sp2 SMB to a new win2012R2 server

I'd still like to clarify whether there is any benefit to replacing the 
motherboard, if needed, I would prefer to do that now rather than later. 
Mainly I wanted to confirm that the rest of the interfaces on the 
motherboard were not interconnected "worse" than the current one. I 
think from the manual the 2 x PCIe x8 and one PCIe x4 and memory were 
directly connected to the CPU, while everything else including onboard 
sata, onboard ethernet, etc are all connected via another chip.

Thanks again for all your advice, much appreciated.

Regards,
Adam

-- 
Adam Goryachev Website Managers www.websitemanagers.com.au

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Growing RAID5 SSD Array
  2014-03-18 23:25           ` Adam Goryachev
@ 2014-03-19 20:45             ` Stan Hoeppner
  2014-03-20  2:54               ` Adam Goryachev
  0 siblings, 1 reply; 16+ messages in thread
From: Stan Hoeppner @ 2014-03-19 20:45 UTC (permalink / raw)
  To: Adam Goryachev, linux-raid

On 3/18/2014 6:25 PM, Adam Goryachev wrote:
> On 18/03/14 22:22, Stan Hoeppner wrote:
>> On 3/17/2014 8:41 PM, Adam Goryachev wrote:
>>> On 18/03/14 08:43, Stan Hoeppner wrote:
>>>> On 3/17/2014 12:43 AM, Adam Goryachev wrote:
>>>>> On 13/03/14 22:58, Stan Hoeppner wrote:
>>>>>> On 3/12/2014 9:49 PM, Adam Goryachev wrote:
...
> I'm still somewhat concerned that this might cause problems, given a new
> motherboard is around $350, I'd prefer to replace it if that is going to
> help at all. Even if I solve the "other" problem, I'd prefer the users
> to *really* notice the difference, rather than just "normal". ie, I want
> the end result to be excellent rather than good, considering all the
> time, money and effort... 

Replacing the motherboards, CPUs, memory, etc in the storage servers
isn't going to increase your user performance.

None of your problems are due to faulty hardware, or lack of hardware
horsepower in your SAN machines nor network hardware.  You have far more
than sufficient bandwidth, both network and SSD array.  The problems you
are experiencing are due to configuration issues and/or faults.

> For now, I've just ordered the 2 x Intel cards
> plus 1 of the cables (only one in stock right now, the other three are
> on back order) plus the switch. I should have all that by tomorrow, and
> if all goes well and I can use the single cable as a direct connect
> between the two machines, then that's great, if not I will have to wait
> for more cables.

Never install new hardware until after you have the root problem(s)
identified and fixed.  Replacing hardware may cause more additional
problems and won't solve any.

...
> Yes, I was (still am) very scared to replace the DC with a Linux box.
> Moving the SMB shares would have resulted in changing the "location" of
> all the files, and means finding and fixing every config file or spot
> which relies on that. Though I have thought about this a number of
> times. Currently, the plan is to migrate the authentication, DHCP, DNS,
> etc to a new win2008R2 machine this weekend. 

So your DHCP and DNS servers are on the DC VM.

> Once that is done, next
> weekend I will try and migrate the shares to a new win2012R2 machine.
> The goal being to resolve any issues caused by upgrading the old win NT
> era machine over and over and over again, by using brand new
> installations of more modern versions. When the time comes, I may
> consider migrating the file sharing to a linux VM, I've very slightly
> played with samba4, but I'm not particularly confident about it yet (it
> isn't included in Debian stable yet).

The problem isn't what is serving the shares.  The problem is the
reliability of the system serving up the shares.

...
>> Get out your medical examiner's kit and perform an autopsy on this
>> Windows DC/SMB server VM.  This is where you'll find the problem I
>> think.  If not it's somewhere in your Windows infrastructure.
>>
>> Two minutes to display the mapped drive list in Explorer?  That might be
>> a master browser issue.  Go through all the Windows Event logs for the
>> Terminal Services VMs with a fine toothed comb.
>
> The performance issue impacts on unrelated linux VM's as well. I
> recently setup a new Linux VM to run a new application. When the issue
> is happening, if I login to this VM, disk IO is severely slow, like
> running ls will take a long time etc...

Slow, or delayed?  I'm guessing delayed.  Do Linux VM guests get DNS
resolution from the Windows DNS server running on the DC?  Do any get
their IP assignment from the DHCP server running on the DC VM?

Do your Linux hypervisors resolve the IPs of the SAN1 interfaces via
DNS?  Or do you use /etc/hosts?  Or do you have these statically
configured in the iSCSI initiator?

> I see the following event logs on the DC:
> NTDS (764)NTDSA: A request to write to the file "C:\WINNT\NTDS\edb.chk"
> at offset 0 (0x0000000000000000) for 4096 (0x00001000) bytes succeeded
> but took an abnormally long time (72 seconds) to be serviced by the OS.
> This problem is likely due to faulty hardware. Please contact your
> hardware vendor for further assistance diagnosing the problem.

Microsoft engineers always assume drive C: is a local disk.  This is why
the error msg says "faulty hardware".  But in your case, drive C: is
actually a SAN LUN mapped through to Windows by the hypervisor, correct?
 To incur a 72 second delay attempting to write to drive C: indicates
that the underlying hypervisor is experiencing significant delay in
resolving the IP of the SAN1 network interface containing the LUN, or IP
packets are being dropped, or the switch is malfunctioning.

"C:\WINNT\NTDS\edb.chk" is the Active Directory database checkpoint
file.  I.e. it is a journal.  AD updates are written to the journal,
then written to the database file "NTDT.DIT", and when that operation is
successful the transaction is removed from the checkpoint file (journal)
edb.chk.  Such a file will likely be read/write locked when written due
to its critical nature.  NTDT.DIT will also likely be read/write locked
when being written.  Look for errors in your logs related to NTDT.DIT
and Active Directory in general.

> That type of event hasn't happened often:
> 20140314 11:15:35   72 seconds
> 20131124 17:55:48   55 minutes 12 seconds
> 20130422 20:45:23   367 seconds
> 20130410 23:57:16   901 seconds

Large delays/timeouts like this are nearly always resolution related,
DNS, NIS, etc.  I'm surprised that Windows would wait 55 minutes to
write to a local AD file, without timing out and producing a hard error.

> Though these look like they may have happened at times when DRBD crashed
> or similar, since I've definitely had a lot more times of very slow
> performance....

I serious doubt this is part of the delay problem since none of your
hosts map anything on SAN2, according to what you told me a year ago
told me anyway.

However, why is DRBD crashing?  And what do you mean by "crashed"?  You
mean the daemon crashed?  On which host?  Or both?

"may have happened at times when"...

Did you cross reference the logs on the Windows DC with the Linux logs?
 That should give you a definitive answer.

> Also looking on the terminal servers has produced a similar lack of
> events, except some auth errors when the DC has crashed recently.

This DC is likely the entirety of your problems.  This is what I was
referring to above about reliability.  Why is the DC VM crashing?  How
often does it crash?  Is it just the VM crashing, or the physical box?
That DC provides the entire infrastructure for your Windows Terminal
Servers and any Windows PC on the network and, from the symptoms and log
information you're provided, it seems pretty clear you're experiencing
delays of some kind when the hypervisors access the SAN LUNs.  Surely
you're not using DNS resolution for the IPs on SAN1, are you?

An unreliable AD/DNS server could explain the vast majority of the
problems you're experiencing.

> The newest terminal servers (running Win 2012R2) show this event for
> every logon:
> Remote Desktop services has taken too long to load the user
> configuration from server \\DC for user xyz

Slow AD/DNs.

> Although the logins actually do work, and seems mostly normal after
> login, except for times when it runs really slow again.

Same problem, slow AD/DNS.

> Finally, on the old terminal servers, the PST file for outlook contained
> *all* of the email and was stored on the SMB server, on the new terminal
> servers, the PST file on the SMB server only contains contacts and
> calendars (ie, very small) and the email is stored in the "local"
> profile on the C: (which is iSCSI still). I'm hopeful that this will
> reduce the file sharing load on the domain controller. (If the C: pst
> file is lost, then it is automatically re-created and all the email is
> re-downloaded from the IMAP server, so nothing is lost, but it
> drastically increases the SAN load to re-download 2GB of email for each
> user, which had a massive impact on performance on Friday last week!).

You have an IMAP server which is already storing all the mail.  The
entire point of IMAP is keeping all the mail on the IMAP server.  Each
message is transferred to a client only when the user opens it, thus
network load is nonexistent.

Why, again, are you not having Outlook use IMAP as intended?  For the
life of me I can't imagine why you don't...

...
> I'm really not sure, I still don't like the domain controller and file
> server being on the same box, and the fact it has been upgraded so many
> times, but I'm doubtful that it is the real cause.

Being on the same physical box is fine.  You just need to get it
reliable.  And I would never put a DNS server inside a VM if any bare
metal outside the VM environment needs that DNS resolution.  DNS is
infrastructure.  VMs are NOT infrastructure, but reside on top of it.

For less than the $375 cost of that mainboard you mentioned you can
build/buy a box for AD duty, install Windows and configure from scratch.
 It only needs the one inbuilt NIC port for the user LAN because it
won't host the shares/files.

You'll export the shares key from the registry of the current SMB
server.  After you have the new bare metal AD/DNS server up, you'll shut
the current one down and never fire it up again because you'll get a
name collision with the new VM you are going to build...

You build a fresh SMB server VM for file serving and give it the host
name of the now shut down DC SMB server.  Moving the shares/files to the
this new server is as simple as mounting/mapping the file share SAN LUN
to the new VM, into the same Windows local device path as on the old SMB
server (e.g. D:\).  After that you restore the shares registry key onto
the new SMB server VM.

This allows all systems that currently map those shares by hostname and
share path to continue to do so.  Basic instructions for migrating
shares in this manner can be found here:

http://support.microsoft.com/kb/125996

> On Thursday night after the failed RAID5 grow, I decided not to increase
> the allocated space for the two new terminal servers (in case I caused
> more problems), and simply deleted a number of user profiles on each
> system. (I assumed the roaming profile would simply copy back when the
> user logged in the next day). However, the roaming profile didn't copy,
> and windows logged users in with a temp profile, so eventually the only
> fix was to restore the profile from the backup server. Once I did this,
> the user could login normally, except the backup doesn't save the pst
> file, so outlook was forced to re-download all of the users email from
> IMAP. 

...
> This then caused the really, really, really bad performance across
> the SAN, 

Can you quantify this?  What was the duration of this really, really,
really bad performance?  And how do you know the bad performance existed
on the SAN links and not just the shared LAN segment?  You don't have
your network links, or systems, instrumented, so how do you know?

Given that you've had continuous problems with this particular mini
datacenter, and the fact that you don't document problems in order to
track them, you need to instrument everything you can.  Then when
problems arise you can look at the data and have a pretty good idea of
where the problems are.  Munin is pretty decent for collecting most
Linux metrics, bare metal and guest, and it's free:

http://munin-monitoring.org/

It may help identify problem periods based on array throughput, NIC
throughput, errors, etc.

> yet it didn't generate any traffic on the SMB shares from the
> domain controller. In addition, as I mentioned, disk IO on the newest
> Linux VM was also badly delayed. 

Now you say "delayed", not "bad performance".  Do all of your VMs
acquire DHCP and DNS from the DC VM?  If so, again, there's your problem.

Linux does not cache DNS information.  It queries the remote DNS server
every time it needs a name to address mapping.

> Also, copying from a smb share on a
> different windows 2008 VM (basically idle and unused) showed equally bad
> performance copying to my desktop (linux), etc.

Now you say "bad performance" again.  So you have a combination of DNS
problems, "delay", and throughput issues, "bad performance".  Again, can
you quantify this "bad performance"?

I'm trying my best to help you identify and fix your problems, but your
descriptions lack detail.

> So, essentially the current plans are:
> Install the Intel 10Gb network cards
> Replace the existing 1Gbps crossover connection with one 10Gbps connection
> Replace the existing 8 x 1Gbps connections with 1 x 10Gbps connection

You can't fix these problems by throwing bigger hardware at them.
Switching to 10 GbE links might fix your current "bad performance" by
eliminating the ALB bonds, or by eliminating ports that are currently
problematic but unknown, see link speed/duplex below.  However, as I
recommended when you acquired the quad port NICs, you shouldn't have
used bonds in the first place.  Linux bonding relies heavily on ARP
negotiation and the assumption that the switch properly updates its MAC
routing tables and in a timely manner.  It also relies on the bond
interfaces having a higher routing priority than all the slaves, or that
the slaves have no route configured.  You probably never checked nor
ensured this when you setup your bonding.

It's possible that due to bonding issues that all of your SAN1 outbound
iSCSI packets are going out only two of the 8 ports, and it's possible
that all the inbound traffic is hitting a single port.  It's also
possible that the master link in either bond may have dropped link
intermittently, dropped link speed to 100 or 10, or is bouncing up and
down due to a cable or switch issue, or may have switched from full to
half duplex.  Without some kind of monitoring such as Munin setup you
simply won't know this without manually looking at the link and TX/RX
statistic for every port with ifconfig and ethtool, which, at this point
is a good idea.  But, if any links are flapping up and down at irregular
intervals, note they may all show 1000 FDX when you check manually with
ethtool, even though they're dropping link on occasion.

You need to have some monitoring setup, alerting is even better.  If an
interface in those two bonds drops link you should currently be
receiving an email or a page.  Same goes for the DRBD link.

Last I recall you had setup two ALB bonds of 4 ports each, with the
multipath mappings of LUNS atop the bonds--against my recommendation of
using straight multipath without bonding.  That would have probably
avoided some of your problems.

Anyway, switching to 10 GbE should solve all of this as you'll have a
single interface for iSCSI traffic at the server, no bond problems to
deal with, and 200 MB/s more peak potential bandwidth to boot, even
though you'll never use half of it, and then only in short bursts.

> Migrate the win2003sp2 authentication etc to a new win2008R2 server
> Migrate the win2003sp2 SMB to a new win2012R2 server

DNS is nearly always the cause of network delays.  To avoid it, always
hard code hostnames and IPs into the host files of all your operating
systems because your server IPs never change.  This prevents problems in
your DNS server from propagating across everything and causing delays
everywhere.  With only 8 physical boxen and a dozen VMs, it simply
doesn't make sense to use DNS for resolving the IPs of these
infrastructure servers, given the massive problems it causes, and how
easy it is to manually configure hosts entries.

> I'd still like to clarify whether there is any benefit to replacing the
> motherboard, if needed, I would prefer to do that now rather than later.

The Xeon E3-1230V2 CPU has an embedded PCI Express 3.0 controller with
16 lanes.  The bandwidth is 32 GB/s.  This is greater than the 21/25
GB/s memory bandwidth of the CPU, so the interface is downgraded to PCIe
2.0 at 16 GB/s.  In the S1200BTLR motherboard this is split into one x8
slot and two x4 slots.  The third x4 slot is connected to the C204
Southbridge chip.

With this motherboard, CPU, 16GB RAM, 8 of those Intel SSDs in a nested
stripe 2x md/RAID5 on the LSI, and two dual port 10G NICs, the system
could be easily tuned to achieve ~3.5/2.5 GB/s TCP read/write
throughput.  Which is 10x (350/250 MB/s) the peak load your 6 Xen
servers will ever put on it.  The board has headroom to do 4-5 times
more than you're asking of it, if you insert/attach the right combo of
hardware, and tweak the bejesus out of your kernel and apps.

The maximum disk-to-network and reverse throughput one can typically
achieve on a platform with sufficient IO bandwidth, and an optimally
tuned Linux kernel, is typically 20-25% of the system memory bandwidth.
 This is due to cache misses, interrupts, DMA from disk, memcpy into TCP
buffers, DMA from TCP buffers to NIC, window scaling, buffer sizes,
retransmitted packets, etc, etc.  With dual channel DDR3 this is
21/[5|4]= 4-5 GB/s.

As I've said many times over, you have ample, actually excess, raw
hardware performance in all of your machines.

> Mainly I wanted to confirm that the rest of the interfaces on the
> motherboard were not interconnected "worse" than the current one. I
> think from the manual the 2 x PCIe x8 and one PCIe x4 and memory were
> directly connected to the CPU, while everything else including onboard
> sata, onboard ethernet, etc are all connected via another chip.

See above.  Your PCIe slots and everything else in your current servers
are very well connected.

If you go ahead and replace the server mobos, I'm buying a ticket,
flying literally half way around the world, just to plant my boot in
your arse. ;)

> Thanks again for all your advice, much appreciated.

You're welcome.  And you're lucky I'm not billing you my hourly rate. :)

Believe it or not, I've spent considerable time both this year and last
digging up specs on your gear, doing Windows server instability
research, bonding configuration, etc, etc.  This is part of my "giving
back to the community".  In that respect, I can just idle until June
before helping anyone else. ;)

Cheers,

Stan

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Growing RAID5 SSD Array
  2014-03-19 20:45             ` Stan Hoeppner
@ 2014-03-20  2:54               ` Adam Goryachev
  2014-03-22 19:39                 ` Stan Hoeppner
  0 siblings, 1 reply; 16+ messages in thread
From: Adam Goryachev @ 2014-03-20  2:54 UTC (permalink / raw)
  To: stan, linux-raid@vger.kernel.org

On 20/03/14 07:45, Stan Hoeppner wrote:
> On 3/18/2014 6:25 PM, Adam Goryachev wrote:
>> On 18/03/14 22:22, Stan Hoeppner wrote:
>>> On 3/17/2014 8:41 PM, Adam Goryachev wrote:
>>>> On 18/03/14 08:43, Stan Hoeppner wrote:
>>>>> On 3/17/2014 12:43 AM, Adam Goryachev wrote:
>>>>>> On 13/03/14 22:58, Stan Hoeppner wrote:
>>>>>>> On 3/12/2014 9:49 PM, Adam Goryachev wrote:
> ...
>> For now, I've just ordered the 2 x Intel cards
>> plus 1 of the cables (only one in stock right now, the other three are
>> on back order) plus the switch. I should have all that by tomorrow, and
>> if all goes well and I can use the single cable as a direct connect
>> between the two machines, then that's great, if not I will have to wait
>> for more cables.
> Never install new hardware until after you have the root problem(s)
> identified and fixed.  Replacing hardware may cause more additional
> problems and won't solve any.
>
> ...
>> Yes, I was (still am) very scared to replace the DC with a Linux box.
>> Moving the SMB shares would have resulted in changing the "location" of
>> all the files, and means finding and fixing every config file or spot
>> which relies on that. Though I have thought about this a number of
>> times. Currently, the plan is to migrate the authentication, DHCP, DNS,
>> etc to a new win2008R2 machine this weekend.
> So your DHCP and DNS servers are on the DC VM.

Correct.

>> Once that is done, next
>> weekend I will try and migrate the shares to a new win2012R2 machine.
>> The goal being to resolve any issues caused by upgrading the old win NT
>> era machine over and over and over again, by using brand new
>> installations of more modern versions. When the time comes, I may
>> consider migrating the file sharing to a linux VM, I've very slightly
>> played with samba4, but I'm not particularly confident about it yet (it
>> isn't included in Debian stable yet).
> The problem isn't what is serving the shares.  The problem is the
> reliability of the system serving up the shares.
>
> ...
>>> Get out your medical examiner's kit and perform an autopsy on this
>>> Windows DC/SMB server VM.  This is where you'll find the problem I
>>> think.  If not it's somewhere in your Windows infrastructure.
>>>
>>> Two minutes to display the mapped drive list in Explorer?  That might be
>>> a master browser issue.  Go through all the Windows Event logs for the
>>> Terminal Services VMs with a fine toothed comb.
>> The performance issue impacts on unrelated linux VM's as well. I
>> recently setup a new Linux VM to run a new application. When the issue
>> is happening, if I login to this VM, disk IO is severely slow, like
>> running ls will take a long time etc...
> Slow, or delayed?  I'm guessing delayed.  Do Linux VM guests get DNS
> resolution from the Windows DNS server running on the DC?  Do any get
> their IP assignment from the DHCP server running on the DC VM?
>
> Do your Linux hypervisors resolve the IPs of the SAN1 interfaces via
> DNS?  Or do you use /etc/hosts?  Or do you have these statically
> configured in the iSCSI initiator?

Well, slow somewhat equals delayed... if it takes 20 seconds for ls of a 
small directory to return the results, then there is a problem 
somewhere. I've used slow/delayed/performance problem to mean the same 
thing. Sorry for the confusion.
Every machine (VM and physical) are configured with the DC DNS IP. 
However, no server gets any details from DHCP, they are all static 
configurations.

The Linux hypervisors use IP's for iSCSI, in fact the iSCSI servers are 
not configured in DNS, nor are the hypervisor machines, nor any of the 
Linux VM's. The only entries in DNS are the ones that windows 
automatically does as part of active directory. Almost every machine or 
service is configured by IP address.

Additional evidence that iSCSI doesn't rely on DNS is that when 
*everything* is down, I can start the san1/san2, and then start the 
linux hypervisors, and then bootup the VM's, all while obviously the 
DNS/DHCP server is not yet up. There is absolutely no external DNS 
resolution at all (though that isn't really relevant at the iSCSI/etc 
level).

>> I see the following event logs on the DC:
>> NTDS (764)NTDSA: A request to write to the file "C:\WINNT\NTDS\edb.chk"
>> at offset 0 (0x0000000000000000) for 4096 (0x00001000) bytes succeeded
>> but took an abnormally long time (72 seconds) to be serviced by the OS.
>> This problem is likely due to faulty hardware. Please contact your
>> hardware vendor for further assistance diagnosing the problem.
> Microsoft engineers always assume drive C: is a local disk.  This is why
> the error msg says "faulty hardware".  But in your case, drive C: is
> actually a SAN LUN mapped through to Windows by the hypervisor, correct?
>   To incur a 72 second delay attempting to write to drive C: indicates
> that the underlying hypervisor is experiencing significant delay in
> resolving the IP of the SAN1 network interface containing the LUN, or IP
> packets are being dropped, or the switch is malfunctioning.
>
> "C:\WINNT\NTDS\edb.chk" is the Active Directory database checkpoint
> file.  I.e. it is a journal.  AD updates are written to the journal,
> then written to the database file "NTDT.DIT", and when that operation is
> successful the transaction is removed from the checkpoint file (journal)
> edb.chk.  Such a file will likely be read/write locked when written due
> to its critical nature.  NTDT.DIT will also likely be read/write locked
> when being written.  Look for errors in your logs related to NTDT.DIT
> and Active Directory in general.

This event happened last week, in the midst of when all the users were 
re-caching all their email. At the same time, before I had worked that 
out, I was attempting to "fix" a standalone PC users problems with their 
PST file (stored on the SMB server). The PST file was approx 3GB, and I 
copied it from the SMB server to the local PC, ran scanpst to repair the 
file. When I attempted to copy the file back to the server (the PC is on 
a 100Mbps connection), the server stopped responding (totally), even 
though the console was not BSOD, all network responses stopped, no 
console activity could be seen, and SMB shares were no longer 
accessible. I assumed the server had been overloaded and crashed, in 
actual fact it was probably just overloaded and very, very, very, slow. 
I forced a reboot from the hypervisor, and the above error message was 
logged in the event viewer about 10 minutes after the crash, probably 
when I tried to copy the same file again. After it did the same thing 
the second time (stopped responding) I cancelled the copy, and 
everything recovered (without rebooting the server). In the end I copied 
the file after hours, and it completed normally. So, I would suspect the 
72 seconds occurred during that second 'freeze' when the server wasn't 
responding but I patiently waited for it to recover. This DC VM doesn't 
crash, at least I don't think it ever has, except when the san 
crashed/got lost/etc...

>> That type of event hasn't happened often:
>> 20140314 11:15:35   72 seconds
>> 20131124 17:55:48   55 minutes 12 seconds
>> 20130422 20:45:23   367 seconds
>> 20130410 23:57:16   901 seconds
> Large delays/timeouts like this are nearly always resolution related,
> DNS, NIS, etc.  I'm surprised that Windows would wait 55 minutes to
> write to a local AD file, without timing out and producing a hard error.

As part of all the previous work, every layer has been configured to 
stall rather than return disk failures, so even if the san vanishes, no 
disk read/write should be handed a failure, though I would imagine that 
sooner or later windows should assume no answer is a failure, so 
surprising indeed.

20131124 17:55:48   55 minutes 12 seconds

Records show that san1 iscsi process was stopped at 5:03:44 (or up to 5 minutes earlier), san2 never kicked in and started serving, and san1 recovered at 17:58:59 (or up to 5 minutes earlier). So I'm not sure *why* san1 failed, but I do have records showing that it did. I know it wasn't rebooted in order to recover it, and san2 wasn't offline at the time, nor did it become active (automatically or manually), nor was it rebooted.

20130422 20:45:23   367 seconds
This one looks strange... at 7:37pm all the VM's stopped responding to network (ping), at 7:41 all the physical boxes CPU went high, at 7:45 the DC logged this in the event log:
Sys: E 'Mon Apr 22 19:45:23 2013': XenVbd - " The device, \Device\Scsi\XenVbd1, did not respond within
the timeout period.  "
Sys: E 'Mon Apr 22 19:45:23 2013': MRxSmb - " The master browser has received a server announcement from
the computer BACKUPPC  that believes that it is the master browser for the domain on transport NetBT_Tcpip_{AB8F434F-3023-498C-.
  The master browser is stopping or an election is being forced.  "
Sys: E 'Mon Apr 22 19:45:23 2013': XenVbd - " The device, \Device\Scsi\XenVbd1, did not respond within
the timeout period.  "
Finally at 19:51pm the physical boxes CPU recovered,
At 19:53 san1 was rebooted
At 19:56 the physical CPU's load went high again
At 19:58 san1 came back online
At 20:01 the physical CPU's load went normal again
By 20:21 all VM's had been rebooted and were back to normal

I think at this stage, the sync between san1/san2 was disabled, and there was no automatic failover. It might also have been me changing networking on the SAN systems, I know a lot of changes were being made between Jan and April last year...

20130410 23:57:16   901 seconds

Without checking, I'm almost certain this would have been caused by me messing around, or changing things around. The timeframe is correct, (late at night, in April last year)...

>
>> Though these look like they may have happened at times when DRBD crashed
>> or similar, since I've definitely had a lot more times of very slow
>> performance....
> I serious doubt this is part of the delay problem since none of your
> hosts map anything on SAN2, according to what you told me a year ago
> told me anyway.
>
> However, why is DRBD crashing?  And what do you mean by "crashed"?  You
> mean the daemon crashed?  On which host?  Or both?

I generally mean that I did something (like adding a new SSD and growing 
the MD array) which caused a crash. I have also had issues with LVM 
snapshots where it would get into a state I couldn't add/list/delete any 
snapshots any longer, though the machine would continue to work. 
Generally these were solved by migrating to san2, reboot san1, and 
everything worked normally on san2 (or fail back to san1).

I am pretty sure that I haven't had any 'crashes' on san1/san2 under 
normal workload or without a known cause (at least for a very long time, 
probably after I installed that kernel from backports).

> "may have happened at times when"...
>
> Did you cross reference the logs on the Windows DC with the Linux logs?
>   That should give you a definitive answer.
I do have an installation of Xymon (actually the older version still 
called Hobbit) which catches things like logs, cpu, memory, disk, 
processes, etc and stores those things as well as alerts. I've never 
actually setup munin, but I have seen some of what it produces, and I 
did like the level of detail it logged (ie, the graphs I saw logged 
every smart counter from a HDD).

>
>> Also looking on the terminal servers has produced a similar lack of
>> events, except some auth errors when the DC has crashed recently.
> This DC is likely the entirety of your problems.  This is what I was
> referring to above about reliability.  Why is the DC VM crashing?  How
> often does it crash?  Is it just the VM crashing, or the physical box?
> That DC provides the entire infrastructure for your Windows Terminal
> Servers and any Windows PC on the network and, from the symptoms and log
> information you're provided, it seems pretty clear you're experiencing
> delays of some kind when the hypervisors access the SAN LUNs.  Surely
> you're not using DNS resolution for the IPs on SAN1, are you?
>
> An unreliable AD/DNS server could explain the vast majority of the
> problems you're experiencing.

Nope, definitely not using DNS for the SAN config, iscsi, etc.. I'm 
somewhat certain that this isn't a DNS issue.

>> The newest terminal servers (running Win 2012R2) show this event for
>> every logon:
>> Remote Desktop services has taken too long to load the user
>> configuration from server \\DC for user xyz
> Slow AD/DNs.
>
>> Although the logins actually do work, and seems mostly normal after
>> login, except for times when it runs really slow again.
> Same problem, slow AD/DNS.
>
>> Finally, on the old terminal servers, the PST file for outlook contained
>> *all* of the email and was stored on the SMB server, on the new terminal
>> servers, the PST file on the SMB server only contains contacts and
>> calendars (ie, very small) and the email is stored in the "local"
>> profile on the C: (which is iSCSI still). I'm hopeful that this will
>> reduce the file sharing load on the domain controller. (If the C: pst
>> file is lost, then it is automatically re-created and all the email is
>> re-downloaded from the IMAP server, so nothing is lost, but it
>> drastically increases the SAN load to re-download 2GB of email for each
>> user, which had a massive impact on performance on Friday last week!).
> You have an IMAP server which is already storing all the mail.  The
> entire point of IMAP is keeping all the mail on the IMAP server.  Each
> message is transferred to a client only when the user opens it, thus
> network load is nonexistent.
>
> Why, again, are you not having Outlook use IMAP as intended?  For the
> life of me I can't imagine why you don't...
>
> ...
Well, I'm trying to do the best (most sensible) thing that is possible 
within the constraints of MS Outlook (not my first preference for email 
client, but that's another story). To the best of my knowledge, MS 
Outlook (various versions) has never worked properly with IMAP, however, 
Outlook 2013 is one of the best versions yet. You can actually tell it 
how much email to cache (timeframe of 1 month, 3 months, etc), but if 
you tell it to only cache 3 months, then you simply can't see or access 
any email older than that. Don't ask me, but that seems to be what 
happens. Change the cache time to 6 months, and you can suddenly access 
up to 6 months of email. So the only solution is to cache ALL email 
(yes, luckily it does have a forever option).

However, the good news is that it means I don't need to store the PST 
file with the massive cache on the SMB server, since it doesn't contain 
any data that can't be automatically recovered. I create a small pst 
file on SMB to store contacts and calendars, but all other IMAP cached 
data is stored on the local C: of the terminal server. So, reduced load 
on SMB, but still the same load on iSCSI.

>> I'm really not sure, I still don't like the domain controller and file
>> server being on the same box, and the fact it has been upgraded so many
>> times, but I'm doubtful that it is the real cause.
> Being on the same physical box is fine.  You just need to get it
> reliable.  And I would never put a DNS server inside a VM if any bare
> metal outside the VM environment needs that DNS resolution.  DNS is
> infrastructure.  VMs are NOT infrastructure, but reside on top of it.

Nope, nothing requires DNS to work.... at least not to bootup, etc... 
Probably windows needs some DNS/AD for file sharing, but that is a 
higher level issue anyway.

> For less than the $375 cost of that mainboard you mentioned you can
> build/buy a box for AD duty, install Windows and configure from scratch.
>   It only needs the one inbuilt NIC port for the user LAN because it
> won't host the shares/files.

Well, I'll be doing this as a new VM... Windows 2008R2. While I hope 
this will help to split DNS/AD from SMB, I'm doubtful it will resolve 
the issues.

> You'll export the shares key from the registry of the current SMB
> server.  After you have the new bare metal AD/DNS server up, you'll shut
> the current one down and never fire it up again because you'll get a
> name collision with the new VM you are going to build...
>
> You build a fresh SMB server VM for file serving and give it the host
> name of the now shut down DC SMB server.  Moving the shares/files to the
> this new server is as simple as mounting/mapping the file share SAN LUN
> to the new VM, into the same Windows local device path as on the old SMB
> server (e.g. D:\).  After that you restore the shares registry key onto
> the new SMB server VM.
>
> This allows all systems that currently map those shares by hostname and
> share path to continue to do so.  Basic instructions for migrating
> shares in this manner can be found here:
>
> http://support.microsoft.com/kb/125996

Thank you for the pointer, that makes me more confident about copying 
share configuration and permissions. The only difference to the above is 
I plan on creating a new disk, formatting with win2012R2, and copy the 
data from the old disk across. The reason is that the old disk was 
originally formatted by Win NT, it was suggested that it might be a good 
idea to start with a newly formatted/clean filesystem. The concern with 
this is copying of the ACL information on those files, hence some 
testing beforehand will be needed.

>
>> On Thursday night after the failed RAID5 grow, I decided not to increase
>> the allocated space for the two new terminal servers (in case I caused
>> more problems), and simply deleted a number of user profiles on each
>> system. (I assumed the roaming profile would simply copy back when the
>> user logged in the next day). However, the roaming profile didn't copy,
>> and windows logged users in with a temp profile, so eventually the only
>> fix was to restore the profile from the backup server. Once I did this,
>> the user could login normally, except the backup doesn't save the pst
>> file, so outlook was forced to re-download all of the users email from
>> IMAP.
> ...
>> This then caused the really, really, really bad performance across
>> the SAN,
> Can you quantify this?  What was the duration of this really, really,
> really bad performance?  And how do you know the bad performance existed
> on the SAN links and not just the shared LAN segment?  You don't have
> your network links, or systems, instrumented, so how do you know?

Well running an ls from a linux VM CLI doesn't rely on the user LAN 
segment... (other than the ssh connection).
I do collect and graph a lot of various numbers, though generally I 
don't find graphs to produce fine grained values which are so useful, 
but I keep trying to collect more information in the hope that it might 
tell me something eventually...

For example, I am graphing the "Backlog" and "ActiveTime" on each 
physical disk, DRBD, and each LV in san1, at the time of my tests, when 
I said I did an "ls" command on this test VM, I see BackLog values on 
the LV for the VM of up to 9948, which AFAIK, means a 10second delay. 
This was either consistently around 10seconds for a number of minutes, 
or varied much higher and lower to produce this average/graph figure.

Using these same graphs, I can see the much higher than normal BackLog 
and ActiveTime values for the two terminal servers that I expected were 
re-caching all the IMAP emails. So again, there is some correlation to 
iSCSI load and the issues being seen.

In addition, I can see much higher (at least three time higher) values 
on the SMB/DC server.

If I look at read/write sectors/sec graphs, then I can see:
1) Higher than normal read activity on the IMAP VM
2) Significantly higher than normal write activity on the two Terminal 
Servers between 10am (when I fixed the user profiles) and 3pm.
3) Higher than normal read/write activity on the SMB/DC between 9am and 
12pm, but much lower than backup read rates for example.

Looking at the user LAN, I also take the values from the hypervisor for 
each network interface. During testing I can see the new win2008R2 
server was doing 28Mbps receive and 21Mbps transmit. Though given the 
intermittent nature of my testing, it may not have been long enough to 
generate accurate average values that can be seen on the graphs, even 
though the rates somewhat match what I was reporting, around 20 to 
25MB/s transfer rates.

During the day (Friday) I can see much higher than normal activity on 
the mail server, up to around 5MB/s peak value.
Again, the two new terminal servers show TX rates up to 3.4MB/s and 
2.5MB/s, which is lot higher than a "normal" work day peak, and also 
these high traffic levels were consistent over the time periods above 
(10am to 3pm).

Finally, on the SMB/DC I see RX traffic peaking at 4MB/s and TX at 
3MB/s, but other than those peaks (probably when I was copying that PST 
file that caused the "crash") traffic levels look similar to other days.

I also graph CPU load. This is the number of seconds of CPU time given 
to the VM divided by the interval. So if the VM was given 20 seconds of 
CPU time in the past minute, then we record a value of 0.33, however we 
should also remember that a value of 4.0 would be expected for a VM with 
4 vCPU's. On the Friday, no VM was especially busy, the mail server was 
about the same as normal, and still below 0.4, and it has 2 vCPU's.

Also, I graph the "disk" IO performed by each VM, as reported by the 
hypervisor, in bytes read/write per second.
During my late night Friday testing, I can see the test win2008R2 VM 
peaking at 185MB/s write, I don't recall what I did to generate the 
traffic, I think I was copying a file from it's C: to the same drive. So 
the read was probably cached, but re-writing the same file multiple 
times generated a lot of write load.

On the Friday, I again see the high disk IO for the two new terminal 
servers, higher than the normal load. Of course, for most other machines 
their peak is lower than the backup load peak, but for these two the 
backup is done from LVM snapshots, so the load doesn't show up on the VM 
at all. (BTW, due to the load that LVM snapshots seem to place, the 
backup system takes a snapshot, does the backup, and immediately removes 
the snapshot when done). All backups are done at night time, to avoid 
any issues with users etc.

I also have MRTG graphs for each port on each switch.
I can see that for each physical machine (hypervisor) it is balancing 
the traffic evenly across both iSCSI links. Both send and receive 
traffic is equal across the pair of links.
Also, for san1, I can see the switch reports IN traffic (which would be 
outbound from san1) is not evenly balanced across all 8 links, but there 
is definite amounts of traffic across all 8 links. I can also see OUT 
traffic (inbound to san1) is 0 on 5 of the links, and the large majority 
of the traffic is on one link (peaking at 40Mbps yesterday during normal 
work day load, and 75Mbps during backup load last night). The other two 
links with load peaked at 18Mbps yesterday, and didn't do very much load 
during the backups being run last night (actually, basically zero). 
Today's peak so far for these two lines is 30Mbps, and the single line 
peak is 30Mbps, all three at the same time.

One issue I have is that I don't necessarily know which physical machine 
was hosting which VM at what time, although I know I always put the 
DC/SMB server on the same physical box. So this makes it more difficult 
to match the "user" lan traffic with the VM, though the other graphs 
above from the hypervisor should be accurate for network traffic anyway. 
Also, the MRTG graphs are only every 5 minutes, while the hypervisor 
based graphs are 1 minute averages, so MRTG is a lot "coarser".

> Given that you've had continuous problems with this particular mini
> datacenter, and the fact that you don't document problems in order to
> track them, you need to instrument everything you can.  Then when
> problems arise you can look at the data and have a pretty good idea of
> where the problems are.  Munin is pretty decent for collecting most
> Linux metrics, bare metal and guest, and it's free:
>
> http://munin-monitoring.org/
>
> It may help identify problem periods based on array throughput, NIC
> throughput, errors, etc.

Thanks, I'll take a look at installing it, will probably start with my 
desktop pc, and then extend to san2 and one of the hypervisor boxes, 
before extending to san1 and the rest. I'm not sure where I'll put the 
"master" node, or how much it will overlap with the existing stats I'm 
collecting, but it certainly promises to help find performance issues....

>> yet it didn't generate any traffic on the SMB shares from the
>> domain controller. In addition, as I mentioned, disk IO on the newest
>> Linux VM was also badly delayed.
> Now you say "delayed", not "bad performance".  Do all of your VMs
> acquire DHCP and DNS from the DC VM?  If so, again, there's your problem.
>
> Linux does not cache DNS information.  It queries the remote DNS server
> every time it needs a name to address mapping.

Delayed just means it didn't work as quickly as expected but worked 
eventually, bad performance means it took longer than expected but still 
worked. ie, both the same. At least, to me they mean the same thing....

>> Also, copying from a smb share on a
>> different windows 2008 VM (basically idle and unused) showed equally bad
>> performance copying to my desktop (linux), etc.
> Now you say "bad performance" again.  So you have a combination of DNS
> problems, "delay", and throughput issues, "bad performance".  Again, can
> you quantify this "bad performance"?
>
> I'm trying my best to help you identify and fix your problems, but your
> descriptions lack detail.
Apologies, both the same thing.

>> So, essentially the current plans are:
>> Install the Intel 10Gb network cards
>> Replace the existing 1Gbps crossover connection with one 10Gbps connection
>> Replace the existing 8 x 1Gbps connections with 1 x 10Gbps connection
> You can't fix these problems by throwing bigger hardware at them.
> Switching to 10 GbE links might fix your current "bad performance" by
> eliminating the ALB bonds, or by eliminating ports that are currently
> problematic but unknown, see link speed/duplex below.  However, as I
> recommended when you acquired the quad port NICs, you shouldn't have
> used bonds in the first place.  Linux bonding relies heavily on ARP
> negotiation and the assumption that the switch properly updates its MAC
> routing tables and in a timely manner.  It also relies on the bond
> interfaces having a higher routing priority than all the slaves, or that
> the slaves have no route configured.  You probably never checked nor
> ensured this when you setup your bonding.

I'm not using bonding on the hypervisors, they are using multipath to 
make use of each link. I'm using bonding on the san1/san2 server only, 
which is configured as:
iface bond0 inet static
     address x.x.16.1
     netmask 255.255.255.0
     slaves eth2 eth3 eth4 eth5 eth6 eth7 eth8 eth9
     bond-mode balance-alb
     bond-miimon 100
     bond-updelay 200
     mtu 9000

This is slightly different to what you suggested, from memory, you 
suggested I should have two bond groups on each of san1/san2 of 4 
connections each, and each physical server should have one ethernet 
connection to each bond group. Changing that would probably improve the 
problem mentioned above with almost all the inbound (san1) traffic using 
the one link.

None of the slave interfaces are configured at all, so I doubt there is 
any issue with routing or interface priority.

> It's possible that due to bonding issues that all of your SAN1 outbound
> iSCSI packets are going out only two of the 8 ports, and it's possible
> that all the inbound traffic is hitting a single port.
It looks like (from the switch mrtg graphs) that outbound balancing is 
working properly, but inbound balancing is very poor, almost just a 
single link.

>   It's also
> possible that the master link in either bond may have dropped link
> intermittently, dropped link speed to 100 or 10, or is bouncing up and
> down due to a cable or switch issue, or may have switched from full to
> half duplex.  Without some kind of monitoring such as Munin setup you
> simply won't know this without manually looking at the link and TX/RX
> statistic for every port with ifconfig and ethtool, which, at this point
> is a good idea.  But, if any links are flapping up and down at irregular
> intervals, note they may all show 1000 FDX when you check manually with
> ethtool, even though they're dropping link on occasion.

The switch logs don't show any links dropped or changing speed since at 
least Monday night when I last rebooted one of the san servers. The 
switch also logs via syslog to the mail server, logs there don't show 
any unexpected link drops or speed changes etc. All cables used are new 
cables (from last year), cat6 I think or else cat5e, and all are less 
than 3m long. I've not seen any evidence of a faulty cable or port on 
either the network cards or switches. In addition, any link drop logs a 
kernel message in syslog, which is reported up to hobbit/xymon with an 
associated alert (SMS).

>
> You need to have some monitoring setup, alerting is even better.  If an
> interface in those two bonds drops link you should currently be
> receiving an email or a page.  Same goes for the DRBD link.
Done, any process not running, port not listening (ie TCP port), port 
not in a connected state (DRBD talking), MD alert, or certain log 
entries will all generate an SMS alert.

> Last I recall you had setup two ALB bonds of 4 ports each, with the
> multipath mappings of LUNS atop the bonds--against my recommendation of
> using straight multipath without bonding.  That would have probably
> avoided some of your problems.

I might be wrong, but from memory we had agreed that using 2 groups of 4 
bonded channels on the SAN1/2 side was the best option. I never did get 
around to doing that, because it seemed to be working well enough as is, 
and I didn't want to keep changing things (ie, breaking things and then 
trying to fix them again). Things were never really fully resolved, they 
were just good enough, but the mess on Friday means that now things need 
to be pretty much perfect. I think replacing this group of 8 bonded 
connections with a single 10Gbps connection should solve this even 
better than using 2 groups of 4 bonds, or any other option. I assume I 
will keep the 2 multipath connections on the physical boxes the same as 
current, simply removing the bond group on the san, configuring the new 
10Gbps port with the same IP/netmask as previous, and everything should 
work nicely.

> Anyway, switching to 10 GbE should solve all of this as you'll have a
> single interface for iSCSI traffic at the server, no bond problems to
> deal with, and 200 MB/s more peak potential bandwidth to boot, even
> though you'll never use half of it, and then only in short bursts.

Agreed.

>> Migrate the win2003sp2 authentication etc to a new win2008R2 server
>> Migrate the win2003sp2 SMB to a new win2012R2 server
> DNS is nearly always the cause of network delays.  To avoid it, always
> hard code hostnames and IPs into the host files of all your operating
> systems because your server IPs never change.  This prevents problems in
> your DNS server from propagating across everything and causing delays
> everywhere.  With only 8 physical boxen and a dozen VMs, it simply
> doesn't make sense to use DNS for resolving the IPs of these
> infrastructure servers, given the massive problems it causes, and how
> easy it is to manually configure hosts entries.

Done, I definitely couldn't rely on DNS being provided by the VM as you 
noted. Generally Linux machines (that I configure) don't rely on DNS for 
anything, I don't change IP addresses on servers enough to make that 
even slightly useful (would anyone?).

>> I'd still like to clarify whether there is any benefit to replacing the
>> motherboard, if needed, I would prefer to do that now rather than later.
> The Xeon E3-1230V2 CPU has an embedded PCI Express 3.0 controller with
> 16 lanes.  The bandwidth is 32 GB/s.  This is greater than the 21/25
> GB/s memory bandwidth of the CPU, so the interface is downgraded to PCIe
> 2.0 at 16 GB/s.  In the S1200BTLR motherboard this is split into one x8
> slot and two x4 slots.  The third x4 slot is connected to the C204
> Southbridge chip.
>
> With this motherboard, CPU, 16GB RAM, 8 of those Intel SSDs in a nested
> stripe 2x md/RAID5 on the LSI, and two dual port 10G NICs, the system
> could be easily tuned to achieve ~3.5/2.5 GB/s TCP read/write
> throughput.  Which is 10x (350/250 MB/s) the peak load your 6 Xen
> servers will ever put on it.  The board has headroom to do 4-5 times
> more than you're asking of it, if you insert/attach the right combo of
> hardware, and tweak the bejesus out of your kernel and apps.
>
> The maximum disk-to-network and reverse throughput one can typically
> achieve on a platform with sufficient IO bandwidth, and an optimally
> tuned Linux kernel, is typically 20-25% of the system memory bandwidth.
>   This is due to cache misses, interrupts, DMA from disk, memcpy into TCP
> buffers, DMA from TCP buffers to NIC, window scaling, buffer sizes,
> retransmitted packets, etc, etc.  With dual channel DDR3 this is
> 21/[5|4]= 4-5 GB/s.
>
> As I've said many times over, you have ample, actually excess, raw
> hardware performance in all of your machines.

OK, so I'll just add the dual port 10Gbps network card, and remove the 2 
quad port 1Gbps cards from each server. That will mean there is only two 
cards installed in each san system. I really don't think it is 
worthwhile right now, but I may re-use these cards by installing one 
quad port card into 4 of the physical machines, and use 2 x dual port 
cards in the other 4, and increase the iSCSI to 4 multipath connections 
on each physical. That is all in the future though, for now I just want 
to obtain at least 50MB/s (minimum, I should expect at least 100MB/s) 
performance for the VM's, consistently....

>> Mainly I wanted to confirm that the rest of the interfaces on the
>> motherboard were not interconnected "worse" than the current one. I
>> think from the manual the 2 x PCIe x8 and one PCIe x4 and memory were
>> directly connected to the CPU, while everything else including onboard
>> sata, onboard ethernet, etc are all connected via another chip.
> See above.  Your PCIe slots and everything else in your current servers
> are very well connected.
>
> If you go ahead and replace the server mobos, I'm buying a ticket,
> flying literally half way around the world, just to plant my boot in
> your arse. ;)

I'll save you most of the trouble, I'll be in the USA next month :) 
however, I promise I won't get any new motherboards for now :)

>> Thanks again for all your advice, much appreciated.
> You're welcome.  And you're lucky I'm not billing you my hourly rate. :)
>
> Believe it or not, I've spent considerable time both this year and last
> digging up specs on your gear, doing Windows server instability
> research, bonding configuration, etc, etc.  This is part of my "giving
> back to the community".  In that respect, I can just idle until June
> before helping anyone else. ;)

Absolutely, and I greatly appreciate it all!

Regards,
Adam

-- 
Adam Goryachev Website Managers www.websitemanagers.com.au

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Growing RAID5 SSD Array
  2014-03-20  2:54               ` Adam Goryachev
@ 2014-03-22 19:39                 ` Stan Hoeppner
  2014-03-25 13:10                   ` Adam Goryachev
  0 siblings, 1 reply; 16+ messages in thread
From: Stan Hoeppner @ 2014-03-22 19:39 UTC (permalink / raw)
  To: Adam Goryachev, linux-raid@vger.kernel.org

On 3/19/2014 9:54 PM, Adam Goryachev wrote:
> On 20/03/14 07:45, Stan Hoeppner wrote:
>> On 3/18/2014 6:25 PM, Adam Goryachev wrote:
>>> On 18/03/14 22:22, Stan Hoeppner wrote:
>>>> On 3/17/2014 8:41 PM, Adam Goryachev wrote:
>>>>> On 18/03/14 08:43, Stan Hoeppner wrote:
>>>>>> On 3/17/2014 12:43 AM, Adam Goryachev wrote:
>>>>>>> On 13/03/14 22:58, Stan Hoeppner wrote:
>>>>>>>> On 3/12/2014 9:49 PM, Adam Goryachev wrote:
...
>> Do your Linux hypervisors resolve the IPs of the SAN1 interfaces via
>> DNS?  Or do you use /etc/hosts?  Or do you have these statically
>> configured in the iSCSI initiator?
> 
> Well, slow somewhat equals delayed... if it takes 20 seconds for ls of a
> small directory to return the results, then there is a problem
> somewhere. 

Agreed.  But where exactly?  Hmmm... this 'ls' delay sounds vaguely
familiar.

> I've used slow/delayed/performance problem to mean the same
> thing. Sorry for the confusion.

An example of the distinction between "delayed" and "slow" would be
clicking a link in your browser.  In the "delayed" case it takes 10
seconds for IP resolution but the file downloads at max throughput in 30
seconds.  In the "slow" case IP resolution is instant but network
congestion causes a 2 minute download.

With a browser it's easy to see where the problem is, but not here.  In
your case the delays are not necessarily distinguishable without using
tools.  For slow 'ls' on the new Linux guest you can see where the
individual latencies exist in execution by running the 'ls' command
through strace.  And that reminds me...

Nearly every time I've seen this 'slow ls' problem reported, the cause
has been delayed or slow response from an LDAP server in a single
sign-on, global authentication environment.  With such a setup, during
'ls' of a local filesystem, the Linux group and user data must be looked
up on the LDAP server for each file in the directory, not locally as is
the case with standard Linux passwd security.

Do you have such a single sign on configuration on the new Linux VM you
mentioned?  If so this may tend to explain why 'ls' in the Linux guest
is slow at the same time Windows share operations are also slow, as both
rely on the AD/DC server.

> Every machine (VM and physical) are configured with the DC DNS IP.
> However, no server gets any details from DHCP, they are all static
> configurations.

Got it.  Just covering the bases.

...
>>> I see the following event logs on the DC:
>>> NTDS (764)NTDSA: A request to write to the file "C:\WINNT\NTDS\edb.chk"
>>> at offset 0 (0x0000000000000000) for 4096 (0x00001000) bytes succeeded
>>> but took an abnormally long time (72 seconds) to be serviced by the OS.
>>> This problem is likely due to faulty hardware. Please contact your
>>> hardware vendor for further assistance diagnosing the problem.
>>
>> Microsoft engineers always assume drive C: is a local disk.  This is why
>> the error msg says "faulty hardware".  But in your case, drive C: is
>> actually a SAN LUN mapped through to Windows by the hypervisor, correct?
>>   To incur a 72 second delay attempting to write to drive C: indicates
>> that the underlying hypervisor is experiencing significant delay in
>> resolving the IP of the SAN1 network interface containing the LUN, or IP
>> packets are being dropped, or the switch is malfunctioning.
>>
>> "C:\WINNT\NTDS\edb.chk" is the Active Directory database checkpoint
>> file.  I.e. it is a journal.  AD updates are written to the journal,
>> then written to the database file "NTDT.DIT", and when that operation is
>> successful the transaction is removed from the checkpoint file (journal)
>> edb.chk.  Such a file will likely be read/write locked when written due
>> to its critical nature.  NTDT.DIT will also likely be read/write locked
>> when being written.  Look for errors in your logs related to NTDT.DIT
>> and Active Directory in general.
> 
> This event happened last week, in the midst of when all the users were
> re-caching all their email. At the same time, before I had worked that
> out, I was attempting to "fix" a standalone PC users problems with their
> PST file (stored on the SMB server). The PST file was approx 3GB, and I
> copied it from the SMB server to the local PC, ran scanpst to repair the
> file. When I attempted to copy the file back to the server (the PC is on
> a 100Mbps connection), 

Let's assume for now that this was just an ugly byproduct of the NT
kernel going out for lunch at the time.  And this would make sense in
the case of the kernel driver issue in the KB article below.

> the server stopped responding (totally), even
> though the console was not BSOD, all network responses stopped, no
> console activity could be seen, and SMB shares were no longer
> accessible. I assumed the server had been overloaded and crashed, in
> actual fact it was probably just overloaded and very, very, very, slow.

A 3GB file copy to a share over 100FDX won't overload a windows server
if it's configured properly.  These may apply to your problem, probably
many more:

http://support.microsoft.com/kb/822219
http://support.microsoft.com/kb/2550581

These don't address AD non-responsiveness, but with Windows, it's
certainly possible that the SMB problem described here is negatively
impacting the AD service, and/or other services.  Windows is rather
notorious for breakage in one service causing problems with others due
to the interdependent design or Windows processes, unlike in the UNIX
world where daemons tend to be designed for fault isolation.

> I forced a reboot from the hypervisor, and the above error message was
> logged in the event viewer about 10 minutes after the crash, probably
> when I tried to copy the same file again. After it did the same thing
> the second time (stopped responding) I cancelled the copy, and
> everything recovered (without rebooting the server). In the end I copied
> the file after hours, and it completed normally. So, I would suspect the
> 72 seconds occurred during that second 'freeze' when the server wasn't
> responding but I patiently waited for it to recover. This DC VM doesn't
> crash, at least I don't think it ever has, except when the san
> crashed/got lost/etc...

Windows event logging is anything but realtime.  It will often log an
error that occurred before a reboot long after the system comes back up.
 Sometimes the time stamp tells you this, sometimes it doesn't...

>>> That type of event hasn't happened often:
>>> 20140314 11:15:35   72 seconds
>>> 20131124 17:55:48   55 minutes 12 seconds
>>> 20130422 20:45:23   367 seconds
>>> 20130410 23:57:16   901 seconds

Looks like the SAN LUNs were unavailable at these times.  The above are
all on the DC Xen host correct?  Did the other Windows VMs log delayed
C: writes at these times?

> As part of all the previous work, every layer has been configured to
> stall rather than return disk failures, so even if the san vanishes, no
> disk read/write should be handed a failure, though I would imagine that
> sooner or later windows should assume no answer is a failure, so
> surprising indeed.

This is "designing for failure" and I recommend against it.  If one's
SAN is properly designed and implemented this should not be necessary.
All this does is delay detection of serious problems.  Even with a home
brew SAN this shouldn't be necessary.  I've done a few boot from SAN
systems and never did anything like you describe here, but not on home
brew hardware, but IBM blades with Qlogic FC HBAs.

...
> I do have an installation of Xymon (actually the older version still
> called Hobbit) which catches things like logs, cpu, memory, disk,
> processes, etc and stores those things as well as alerts. I've never

Ok, good, so you've got some monitoring/collection going on.

> actually setup munin, but I have seen some of what it produces, and I
> did like the level of detail it logged (ie, the graphs I saw logged
> every smart counter from a HDD).

It can be pretty handy.

>>> Also looking on the terminal servers has produced a similar lack of
>>> events, except some auth errors when the DC has crashed recently.
>>
>> This DC is likely the entirety of your problems.  This is what I was
>> referring to above about reliability.  Why is the DC VM crashing?  How
>> often does it crash?
...
>> An unreliable AD/DNS server could explain the vast majority of the
>> problems you're experiencing.
...
> Nope, definitely not using DNS for the SAN config, iscsi, etc.. I'm
> somewhat certain that this isn't a DNS issue.

And at this point I agree.  It's not DNS, but most likely the SMB
redirector and kernel on the DC going out to lunch, and the AD service
with them, likely many services on this Windows VM as well.

>>> The newest terminal servers (running Win 2012R2) show this event for
>>> every logon:
>>> Remote Desktop services has taken too long to load the user
>>> configuration from server \\DC for user xyz

>> Slow AD/DNs.

Malfunctioning SMB redirector.

>>> Although the logins actually do work, and seems mostly normal after
>>> login, except for times when it runs really slow again.

>> Same problem, slow AD/DNS.

Malfunctioning SMB redirector.

...
> However, the good news is that it means I don't need to store the PST
> file with the massive cache on the SMB server, since it doesn't contain
> any data that can't be automatically recovered. I create a small pst
> file on SMB to store contacts and calendars, but all other IMAP cached
> data is stored on the local C: of the terminal server. So, reduced load
> on SMB, but still the same load on iSCSI.

The block IO load is probably small, as your throughput numbers below
demonstrate.  The problem here will be CPU load in the VM as Outlook
parses 2-3GB or larger cached mail files.

>>> I'm really not sure, I still don't like the domain controller and file
>>> server being on the same box, and the fact it has been upgraded so many
>>> times, but I'm doubtful that it is the real cause.
>>
>> Being on the same physical box is fine.  You just need to get it
>> reliable.  And I would never put a DNS server inside a VM if any bare
>> metal outside the VM environment needs that DNS resolution.  DNS is
>> infrastructure.  VMs are NOT infrastructure, but reside on top of it.
> 
> Nope, nothing requires DNS to work.... at least not to bootup, etc...
> Probably windows needs some DNS/AD for file sharing, but that is a
> higher level issue anyway.

In modern MS networks since Win2000 AD/DNS are required for all hostname
resolution if NETBIOS is disabled across the board, as it should be.
Every machine in the AD domain registers its hostname in DNS.  So if AD
goes down, machines can't find one another after their local DNS caches
have expired.

TTBOMK AD is required for locating shares, user/group permissions, etc
in a domain based network.  For workgroups this is still handled solely
by the SMB redirector and local machine SAM.

>> For less than the $375 cost of that mainboard you mentioned you can
>> build/buy a box for AD duty, install Windows and configure from scratch.
>>   It only needs the one inbuilt NIC port for the user LAN because it
>> won't host the shares/files.
> 
> Well, I'll be doing this as a new VM... Windows 2008R2. While I hope
> this will help to split DNS/AD from SMB, I'm doubtful it will resolve
> the issues.

It very well may fix it, based on what the MS knowledge base had to say.
 Just make sure all service packs go on immediately, obviously, and
automatic updates enabled/scheduled to install in the wee a.m.

The next time this happens, manually stop and restart the Server service
on the DC and see if that breaks the SMB hang.  Of course, if the CPU is
racing this may be difficult.

>> You'll export the shares key from the registry of the current SMB
>> server.  After you have the new bare metal AD/DNS server up, you'll shut
>> the current one down and never fire it up again because you'll get a
>> name collision with the new VM you are going to build...
>>
>> You build a fresh SMB server VM for file serving and give it the host
>> name of the now shut down DC SMB server.  Moving the shares/files to the
>> this new server is as simple as mounting/mapping the file share SAN LUN
>> to the new VM, into the same Windows local device path as on the old SMB
>> server (e.g. D:\).  After that you restore the shares registry key onto
>> the new SMB server VM.
>>
>> This allows all systems that currently map those shares by hostname and
>> share path to continue to do so.  Basic instructions for migrating
>> shares in this manner can be found here:
>>
>> http://support.microsoft.com/kb/125996
> 
> Thank you for the pointer, that makes me more confident about copying
> share configuration and permissions. The only difference to the above is
> I plan on creating a new disk, formatting with win2012R2, and copy the
> data from the old disk across. 

I assume you read the caveat about duplicate hostnames.  You can't have
both hosts running simultaneously.  And AFAIK you can't change the
hostname of a DC after the Windows install.  So plan this migration
carefully.  You also must have the AD database dumped and imported to
the new host -before- you copy the files from the old "disk" to the new
disk, and before you import the registry shares file.  The users and
groups must exist before importing the shares.

> The reason is that the old disk was
> originally formatted by Win NT, it was suggested that it might be a good
> idea to start with a newly formatted/clean filesystem. The concern with
> this is copying of the ACL information on those files, hence some
> testing beforehand will be needed.

"Volumes formatted with previous versions of NTFS are upgraded
automatically by Windows 2000 Setup."

http://technet.microsoft.com/en-us/library/cc938945.aspx

The only on disk format change since NTFS 3.0 (Windows 2000) and NTFS
3.1 (all later Windows version) is the addition of symbolic links, which
you won't be using since you never have and none of your apps require
them.  Normally the sole reason to copy the files to a fresh NTFS
filesystem would be to eliminate fragmentation.  This filesystem resides
on SSD, where fragmentation effects are non existent.

Thus there is no advantage of any kind to your new filesystem plan.
Mount the current filesystem on the new VM and continue.  This will also
ensure the shares transfer procedure works, whereas your copy plan might
break that.

>>> On Thursday night after the failed RAID5 grow, I decided not to increase
>>> the allocated space for the two new terminal servers (in case I caused
>>> more problems), and simply deleted a number of user profiles on each
>>> system. (I assumed the roaming profile would simply copy back when the
>>> user logged in the next day). However, the roaming profile didn't copy,
>>> and windows logged users in with a temp profile, so eventually the only
>>> fix was to restore the profile from the backup server. Once I did this,
>>> the user could login normally, except the backup doesn't save the pst
>>> file, so outlook was forced to re-download all of the users email from
>>> IMAP.
>> ...
>>> This then caused the really, really, really bad performance across
>>> the SAN,
>>
>> Can you quantify this?  What was the duration of this really, really,
>> really bad performance?  And how do you know the bad performance existed
>> on the SAN links and not just the shared LAN segment?  You don't have
>> your network links, or systems, instrumented, so how do you know?
> 
> Well running an ls from a linux VM CLI doesn't rely on the user LAN
> segment... (other than the ssh connection).

Unless as I mentioned up above you're using LDAP for global auth/single
sign on in this Linux VM.

> I do collect and graph a lot of various numbers, though generally I
> don't find graphs to produce fine grained values which are so useful,
> but I keep trying to collect more information in the hope that it might
> tell me something eventually...

Munin provides trends.  It can assist in proactive monitoring, but can
also assist in troubleshooting when things break or performance drops,
often allowing one to quickly zero in on which daemon or hardware is
causing problems.  For this benefit one need be familiar with the data,
which requires looking at one's Munin graphs regularly, as in daily,
every other day, etc.

> For example, I am graphing the "Backlog" and "ActiveTime" on each
> physical disk, DRBD, and each LV in san1, at the time of my tests, when
> I said I did an "ls" command on this test VM, I see BackLog values on
> the LV for the VM of up to 9948, which AFAIK, means a 10second delay.
> This was either consistently around 10seconds for a number of minutes,
> or varied much higher and lower to produce this average/graph figure.

I can tell you this right now--any value for a "backlog" metric relating
to block devices is not likely to be elapsed time.  It's going to be
outstanding requests, pages, kbytes, etc.  And if it is a time value
then your LVM setup is totally fubar'ed.

Is this a Xymon or Munin graph you're referring to?  I can't find any
information on LVM metrics captured for either, because as is typical
with most FOSS, the documentation is non-existent.  It would really help
if you'd:

A.  pastebin or include the raw data and heading quantities
B.  look up "backlog" and "activetime" in the documentation that came
    with the package(s) you installed.  That way we don't have to guess
    as to the meaning of "backlog" and what the value quantity is

> Using these same graphs, I can see the much higher than normal BackLog
> and ActiveTime values for the two terminal servers that I expected were
> re-caching all the IMAP emails. So again, there is some correlation to
> iSCSI load and the issues being seen.

No, the correlation is simply between application use and IO.  The
amount of IO isn't causing the problems.  The cause of those lie
elsewhere, probably in what I described above in the KB references, or
similar.

Your md arrays are capable of approximately 6*50,000 = 300,000 4KB read
IOPs, and 250,000 write.  A backlog of 10K open/outstanding LVM read
pages is insignificant as it will be drained in 0.03 seconds, writes in
0.04 seconds.

> In addition, I can see much higher (at least three time higher) values
> on the SMB/DC server.

And this tells you (and me) absolutely squat without knowing the meaning
of "backlog" and the quantity it is providing.  You're walking around in
the dark without that information.

> If I look at read/write sectors/sec graphs, then I can see:
> 1) Higher than normal read activity on the IMAP VM

Because Outlook is syncing.  Nothing abnormal about this.

> 2) Significantly higher than normal write activity on the two Terminal
> Servers between 10am (when I fixed the user profiles) and 3pm.

Again, simply users doing work.

> 3) Higher than normal read/write activity on the SMB/DC between 9am and
> 12pm, but much lower than backup read rates for example.

Define "normal" and then "higher".  User loads fluctuate.  If something
happens to break when user load is "higher than normal", it's not
because your storage infrastructure can't handle the load.  It's because
some piece of software, 99% sure to be MS, is broken, and that's why it
can't handle the load.

> Looking at the user LAN, I also take the values from the hypervisor for
> each network interface. During testing I can see the new win2008R2
> server was doing 28Mbps receive and 21Mbps transmit. Though given the

3.5 MB/s and 2.6 MB/s, which is nothing.

> intermittent nature of my testing, it may not have been long enough to
> generate accurate average values that can be seen on the graphs, even
> though the rates somewhat match what I was reporting, around 20 to
> 25MB/s transfer rates.

Are you talking about the same thing?  You just switched bandwidths by a
factor of 8, bits to bytes.  In either case, that amount of traffic is
nothing given your hardware horsepower.

> During the day (Friday) I can see much higher than normal activity on
> the mail server, up to around 5MB/s peak value.
> Again, the two new terminal servers show TX rates up to 3.4MB/s and
> 2.5MB/s, which is lot higher than a "normal" work day peak, and also
> these high traffic levels were consistent over the time periods above
> (10am to 3pm).

So what were your users doing?  Maybe the extra traffic was a single
user doing data transformations or something.  Who knows.

> Finally, on the SMB/DC I see RX traffic peaking at 4MB/s and TX at
> 3MB/s, but other than those peaks (probably when I was copying that PST
> file that caused the "crash") traffic levels look similar to other days.

Just to be thorough, run ifconfig on the DC hypervisor and look at
errors, dropped packets, overruns, frame errors, carrier errors, and
collisions for the user NIC port and the two SAN NIC ports.  Also do the
same for all 8 ports on SAN1.  Check the switch for any errors on the
ports that the DC box connects to.  Check the user switch for errors on
the DC box port.

If the PC on which you had to reload the Outlook data is connected to
the same user switch as the DC box, check stats/errors on that port at
well, and check for network related errors in the event log.  If you
find any corresponding to that time period, I'd replace the patch cables
on both ends, especially if that user has reported any odd problems,
more so than other users.  If you see a lot of errors, replace the NIC
as well.

The reason I mention this is that I've seen low end intelligent switches
get "knocked out" temporarily when a client NIC goes bad, VRM fail, etc,
and it starts putting too much voltage on the wire.  Some switches
aren't designed to drain the extra voltage to ground, or they simply
aren't grounded and can't, and they literally "lock up" the client stops
transmitting.

You said that copying the PST down from the DC went fine, but as soon as
you starting copying the repaired version back up to the DC, everything
went to hell.

Makes ya go "hmmm" doesn't it?  This can happen even if the client is on
an upstream switch as well, but it's far less likely in that case, as it
usually just makes the upstream switch go brain dead for a bit.

> I also graph CPU load. This is the number of seconds of CPU time given
> to the VM divided by the interval. So if the VM was given 20 seconds of
> CPU time in the past minute, then we record a value of 0.33, however we
> should also remember that a value of 4.0 would be expected for a VM with
> 4 vCPU's. On the Friday, no VM was especially busy, the mail server was
> about the same as normal, and still below 0.4, and it has 2 vCPU's.
> 
> Also, I graph the "disk" IO performed by each VM, as reported by the
> hypervisor, in bytes read/write per second.
> During my late night Friday testing, I can see the test win2008R2 VM
> peaking at 185MB/s write, I don't recall what I did to generate the
> traffic, I think I was copying a file from it's C: to the same drive. So
> the read was probably cached, but re-writing the same file multiple
> times generated a lot of write load.

But this did not involve the DC correct?

> On the Friday, I again see the high disk IO for the two new terminal
> servers, higher than the normal load. Of course, for most other machines
> their peak is lower than the backup load peak, but for these two the
> backup is done from LVM snapshots, so the load doesn't show up on the VM
> at all. (BTW, due to the load that LVM snapshots seem to place, the
> backup system takes a snapshot, does the backup, and immediately removes
> the snapshot when done). All backups are done at night time, to avoid
> any issues with users etc.

As is usually done.

> I also have MRTG graphs for each port on each switch.
> I can see that for each physical machine (hypervisor) it is balancing
> the traffic evenly across both iSCSI links. Both send and receive
> traffic is equal across the pair of links.

Which is great.

> Also, for san1, I can see the switch reports IN traffic (which would be
> outbound from san1) is not evenly balanced across all 8 links, but there
> is definite amounts of traffic across all 8 links. I can also see OUT

This is because balance-alb is transmit load adaptive.  It only
transmits from more than one link when packet load is sufficiently high.
 This data tells you what I did from the time I got involved in this:
that you don't *need* anything more than a dual GbE iSCSI NIC in each
SAN server.  If you'd have done that, and used straight scsi-multipath,
you'd have had perfectly even scaling across both ports this entire past
year, and plenty of headroom.

Quite frankly, after seeing the bandwidth numbers you're posting, I'd
cancel the order or send the 10 GbE gear back for a refund.  It's
absolutely unnecessary, total overkill.

Instead, ditch the bonding, go straight scsi-multipath as I recommended
last year.  Use two ports of each quad NIC for iSCSI.  Export each LUN
on one port of each NIC going in a round robin fashion, wrapping back
around, while separating any "heavy hitter" LUNs, such as the file
share, on different ports.  With the remaining 2 ports on each quad HBA,
use x-over cables and connect the two SAN servers.  Configure a
balance-rr bond of these 4 ports on each server.

This configuration will yield 400 MB/s peak full duplex dedicated iSCSI
throughput at the SAN server, and 400 MB/s peak dedicated DRBD
throughput.  Currently you have 800 MB/s shared one way for both iSCSI
and DRBD, but only 100 MB/s the other way.

This will also give each Xen client 200 MB/s peak full duplex iSCSI
throughput, about 8x what they're currently using.

This is pretty much exactly what I recommended a year ago.  You rejected
it because in your mind it lacked "symmetry", as IO wasn't
"automatically" balanced.  Well, your way has proven that "automatic
balancing" doesn't work.  And it's proven a 6:1 imbalance for Xen writes
to SAN, all through one SAN port.  My proposed multipath configuration
gets you full bandwidth on all links both ways, 4 write links and peak
200 MB/s write per Xen host.  Which is again about 8x what your iSCSI
clients actually use.

And it costs nothing but your time.

> traffic (inbound to san1) is 0 on 5 of the links, and the large majority
> of the traffic is on one link (peaking at 40Mbps yesterday during normal
> work day load, and 75Mbps during backup load last night). The other two
> links with load peaked at 18Mbps yesterday, and didn't do very much load
> during the backups being run last night (actually, basically zero).
> Today's peak so far for these two lines is 30Mbps, and the single line
> peak is 30Mbps, all three at the same time.

Except for the fact that the SAN boxen are taking all the Xen outbound
down a single port of 8.  This is due to the ARP problem with
balance-alb receive load balancing I previously described.

> One issue I have is that I don't necessarily know which physical machine
> was hosting which VM at what time, although I know I always put the
> DC/SMB server on the same physical box. So this makes it more difficult
> to match the "user" lan traffic with the VM, though the other graphs
> above from the hypervisor should be accurate for network traffic anyway.
> Also, the MRTG graphs are only every 5 minutes, while the hypervisor
> based graphs are 1 minute averages, so MRTG is a lot "coarser".

That's not critical because you know where the DC VM is.  And we know
that network load isn't the issue, unless you have a bad/marginal NIC.

>> Given that you've had continuous problems with this particular mini
>> datacenter, and the fact that you don't document problems in order to
>> track them, you need to instrument everything you can.  Then when
>> problems arise you can look at the data and have a pretty good idea of
>> where the problems are.  Munin is pretty decent for collecting most
>> Linux metrics, bare metal and guest, and it's free:
>>
>> http://munin-monitoring.org/
>>
>> It may help identify problem periods based on array throughput, NIC
>> throughput, errors, etc.
> 
> Thanks, I'll take a look at installing it, will probably start with my
> desktop pc, and then extend to san2 and one of the hypervisor boxes,
> before extending to san1 and the rest. I'm not sure where I'll put the
> "master" node, or how much it will overlap with the existing stats I'm
> collecting, but it certainly promises to help find performance issues....

Munin master (debian package munin-common) is installed on a system with
a web server which provides the munin interface and graphs.  Munin 1.4
worked great on lighttpd.  I tried 2.0 and never got it working.  Last I
knew it required Apache because that's the only platform they developed
it for.  That was over a year ago so it may work with lighttpd now,
maybe nginx and others.

munin-node is a tiny daemon that runs on each Linux host to be
monitored.  It collects the data and sends it to the munin master.

...
>>> Also, copying from a smb share on a
>>> different windows 2008 VM (basically idle and unused) showed equally bad
>>> performance copying to my desktop (linux), etc.

Define "equally bad" in this context.  All of the Realtek GbE NICs I've
used have topped out at ~35 MB/s from Windows to Windows via SMB shares,
and it wasn't consistently that high.  It would jump from 12 to 35, to
22.5, to 28, to 12, etc.  This in bare metal.  Surely it's worth in a
VM.  It you got less than 10MB/s for the entire copy I'd say something
is wrong, other than the Realtek NICs being crap to begin with.

>>> So, essentially the current plans are:
>>> Install the Intel 10Gb network cards
>>> Replace the existing 1Gbps crossover connection with one 10Gbps
>>> connection
>>> Replace the existing 8 x 1Gbps connections with 1 x 10Gbps connection
>>
>> You can't fix these problems by throwing bigger hardware at them.
>> Switching to 10 GbE links might fix your current "bad performance" by
>> eliminating the ALB bonds, or by eliminating ports that are currently
>> problematic but unknown, see link speed/duplex below.  However, as I
>> recommended when you acquired the quad port NICs, you shouldn't have
>> used bonds in the first place.  Linux bonding relies heavily on ARP
>> negotiation and the assumption that the switch properly updates its MAC
>> routing tables and in a timely manner.  It also relies on the bond
>> interfaces having a higher routing priority than all the slaves, or that
>> the slaves have no route configured.  You probably never checked nor
>> ensured this when you setup your bonding.
> 
> I'm not using bonding on the hypervisors, they are using multipath to

Yes, I knew that.

> make use of each link. I'm using bonding on the san1/san2 server only,

And I recalled this as well.

> which is configured as:
> iface bond0 inet static
>     address x.x.16.1
>     netmask 255.255.255.0
>     slaves eth2 eth3 eth4 eth5 eth6 eth7 eth8 eth9
>>>>     bond-mode balance-alb

Ditch the 10 GbE idea.  Ditch alb.  Go straight scsi-multipath.

>     bond-miimon 100
>     bond-updelay 200

>>>>     mtu 9000

If you have jumbos enabled on the user network, with so many cheap
Realtek NICs, I'd think this may be involved in your stability issues.
If enabled, disable it for a couple of months.  Both throughput and
stability may increase.  Cheap Ethernet ASICs and drives often don't
like jumbo frames too well.

> This is slightly different to what you suggested, from memory, you
> suggested I should have two bond groups on each of san1/san2 of 4

I believe what I originally suggested was 2 Intel GbE iSCSI ports on the
SAN server, 2 on each Xen client, and two on each server for dedicated
DRBD traffic.  I suggested you to use two small switches with each host
port connected to a different switch as this allows balance-rr to fully
utilize both ports in both directions to max bandwidth.  You shot this
suggestion down because you had already ordered a new 48? port switch,
and you were convinced you needed more aggregate Ethernet bandwidth at
the SAN servers, not simply equal to that of one Xen client (but as it
turns out, the bandwidth data you provided in this thread shows a peak
of 75 MB/s, in which case 2 ports on the SAN server would have been more
than sufficient, with one simply for redundancy)

So I then suggested two quad NICs in the servers, and assisted you as
you tried a bunch of different bond modes and scsi-multipath combos.
The last recommendation I made due to difficulties you had with bonding
was to simply used straight scsi-multipath, exporting your LUNs
appropriately across the 8 ports, as this would have guaranteed a peak
of 200 MB/s full duplex per Xen client.  You then made the beginner's
argument that two Xens could each do a big transfer and each only get
half bandwidth, or 50 MB/s per port.  You tried exporting all LUNs on
all ports and doing multipath across all SAN ports to achieve what you
considered "balanced IO".  I don't recall if that worked or not.  Eve if
it did, you needed bonding for the DRBD links.  It was at that opint
that you decided to create bonds so DRBD would get multiple links, and
you exported your iSCSI LUNs atop the bonds for the Xen hosts.

I still don't know exactly what your current setup is.  I thought it was
2x 4 port alb bonds.  But below you seem to indicate it's something
else.  What is the current bonding/iSCSI setup on the servers?

> connections each, and each physical server should have one ethernet
> connection to each bond group. Changing that would probably improve the
> problem mentioned above with almost all the inbound (san1) traffic using
> the one link.

Ditching bonding for pure multipath is the solution.  Always has been.
You didn't like the idea before because it's not "symmetrical" in your
mind.  It doesn't have to be.  Just do it.  Afterward maybe you'll begin
to understand why it works so well.

> None of the slave interfaces are configured at all, so I doubt there is
> any issue with routing or interface priority.

Just one:  ARP negotiation.

>> It's possible that due to bonding issues that all of your SAN1 outbound
>> iSCSI packets are going out only two of the 8 ports, and it's possible
>> that all the inbound traffic is hitting a single port.
> It looks like (from the switch mrtg graphs) that outbound balancing is
> working properly, but inbound balancing is very poor, almost just a
> single link.

See directly above.  You can read the primer again, and again, and
again, as I did, still without fully understanding what needs to be
configured to make the ARP negotiation work.  Or, again, just switch to
pure multipath and you're done.

>>   It's also
>> possible that the master link in either bond may have dropped link
>> intermittently, dropped link speed to 100 or 10, or is bouncing up and
>> down due to a cable or switch issue, or may have switched from full to
>> half duplex.  Without some kind of monitoring such as Munin setup you
>> simply won't know this without manually looking at the link and TX/RX
>> statistic for every port with ifconfig and ethtool, which, at this point
>> is a good idea.  But, if any links are flapping up and down at irregular
>> intervals, note they may all show 1000 FDX when you check manually with
>> ethtool, even though they're dropping link on occasion.
> 
> The switch logs don't show any links dropped or changing speed since at
> least Monday night when I last rebooted one of the san servers. The
> switch also logs via syslog to the mail server, logs there don't show
> any unexpected link drops or speed changes etc. 

How about dropped frames, CRC errors, etc?

...
>> Last I recall you had setup two ALB bonds of 4 ports each, with the
>> multipath mappings of LUNS atop the bonds--against my recommendation of
>> using straight multipath without bonding.  That would have probably
>> avoided some of your problems.
> 
> I might be wrong, but from memory we had agreed that using 2 groups of 4
> bonded channels on the SAN1/2 side was the best option. I never did get
> around to doing that, because it seemed to be working well enough as is,
> and I didn't want to keep changing things (ie, breaking things and then
> trying to fix them again). Things were never really fully resolved, they
> were just good enough, but the mess on Friday means that now things need
> to be pretty much perfect. I think replacing this group of 8 bonded
> connections with a single 10Gbps connection should solve this even
> better than using 2 groups of 4 bonds, or any other option. I assume I

My 20+ years of experience disagrees.  You may have heard the drum I've
been banging for a while in this reply, the pure scsi-multipath drum.
Ditch bonds, do that, and all your iSCSI IO bandwidth issues are
resolved.  Every Xen host will get 200 MB/s full duplex.  Your
statistics say your peak throughput is 75 MB/s for all hosts aggregate.
 So it's really hard to screw up the LUN assignments so bad it would
make performance worse than it is now.

> will keep the 2 multipath connections on the physical boxes the same as
> current, simply removing the bond group on the san, configuring the new
> 10Gbps port with the same IP/netmask as previous, and everything should
> work nicely.

Again, you're hitting 75 MB/s peak with your real workloads.  A single
GbE is sufficient.  You currently have dual 200 MB/s hardware in the Xen
hosts, and 800 MB/s in the servers.  Why do you need 1 GB/s links?  You
don't.  You simply need to reconfigure what you have so it works
properly.  And it's free.

BTW, what's the peak aggregate data rate on the DRBD links?

>> Anyway, switching to 10 GbE should solve all of this as you'll have a
>> single interface for iSCSI traffic at the server, no bond problems to
>> deal with, and 200 MB/s more peak potential bandwidth to boot, even
>> though you'll never use half of it, and then only in short bursts.
> 
> Agreed.

Seeing your bandwidth numbers for the first time changed my mind.  You'd
be insane to spend any money on hardware to fix this, when you already
have quality gear and over 10 times the throughput you need.

I should have asked you for numbers earlier.  Since you hadn't offered
them I assumed you weren't gathering that info.

...
> Done, I definitely couldn't rely on DNS being provided by the VM as you
> noted. Generally Linux machines (that I configure) don't rely on DNS for
> anything, I don't change IP addresses on servers enough to make that
> even slightly useful (would anyone?).

At least you're doing something right. ;) (heavily tongue in cheek)

...
> OK, so I'll just add the dual port 10Gbps network card, and remove the 2
> quad port 1Gbps cards from each server. That will mean there is only two
> cards installed in each san system. I really don't think it is
> worthwhile right now, but I may re-use these cards by installing one
> quad port card into 4 of the physical machines, and use 2 x dual port
> cards in the other 4, and increase the iSCSI to 4 multipath connections
> on each physical. That is all in the future though, for now I just want
> to obtain at least 50MB/s (minimum, I should expect at least 100MB/s)
> performance for the VM's, consistently....

I don't get the disconnect here.  You want 50 MB/s minimum, you already
show a max of 75 MB/s in your stats, you desire 100 MB/s capability.
Yet you already have 200 MB/s hardware, and you're talking about buying
1 GB/s hardware, which is ten times your requirement...

...
>> If you go ahead and replace the server mobos, I'm buying a ticket,
>> flying literally half way around the world, just to plant my boot in
>> your arse. ;)

I should add the 10 GbE parts in here as well.  Your numbers confirming
what I suspected back when you went 2x quad GbE, your needs aren't
anywhere near this level of throughput.

> I'll save you most of the trouble, I'll be in the USA next month :)
> however, I promise I won't get any new motherboards for now :)
> 
>>> Thanks again for all your advice, much appreciated.
>> You're welcome.  And you're lucky I'm not billing you my hourly rate. :)
>>
>> Believe it or not, I've spent considerable time both this year and last
>> digging up specs on your gear, doing Windows server instability
>> research, bonding configuration, etc, etc.  This is part of my "giving
>> back to the community".  In that respect, I can just idle until June
>> before helping anyone else. ;)
> 
> Absolutely, and I greatly appreciate it all!

Well let's hope you appreciate the advice above, and actually follow it
this time. :)  You'll be glad you did.

Cheers,

Stan

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Growing RAID5 SSD Array
  2014-03-22 19:39                 ` Stan Hoeppner
@ 2014-03-25 13:10                   ` Adam Goryachev
  2014-03-25 20:31                     ` Stan Hoeppner
  0 siblings, 1 reply; 16+ messages in thread
From: Adam Goryachev @ 2014-03-25 13:10 UTC (permalink / raw)
  To: stan, linux-raid@vger.kernel.org

I'll respond to the other email later on, but in between, I've found 
something else that seems just plain wrong.

So, right now, I've shutdown most of the VM's (just one Linux VM left, 
which should be mostly idle since it is after 11pm local time). I'm 
trying to create a duplicate copy of one LV to another as a backup (in 
case I mess it up). So, I've shutdown DRBD, so we are operating 
independently (not that there is any change if DRBD is connected), I'm 
running on the storage server itself (so no iscsi or network issues).

So, two LV's:
   LV VG   Attr     LSize   Pool Origin Data%  Move Log Copy%  Convert
   backup_xptserver1_d1_20140325_224311 vg0  -wi-ao-- 453.00g
   xptserver1_d1                                              vg0 
-wi-ao-- 452.00g

running the command:
dd if=/dev/vg0/xptserver1_d1 
of=/dev/vg0/backup_xptserver1_d1_20140325_224311
from another shell I run:
while pidof dd > /dev/null;do kill -USR1 `pidof dd`;sleep 10;done

dd shows this output:
99059692032 bytes (99 GB) copied, 2515.43 s, 39.4 MB/s
99403235840 bytes (99 GB) copied, 2525.45 s, 39.4 MB/s
99817538048 bytes (100 GB) copied, 2535.47 s, 39.4 MB/s
100252660224 bytes (100 GB) copied, 2545.49 s, 39.4 MB/s

iostat -dmx 1 shows this output:

sda - sdg are the RAID5 SSD drives, single partition, used by md only
dm-8 is the source for the dd copy
dm-17 is the destination of the dd copy,
dm-12 is the Linux VM which is currently running...

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s 
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdg             957.00  6767.00  930.00  356.00     8.68    27.65 
57.85     0.65    0.50    0.16    1.39   0.37  48.00
sdd             956.00  6774.00  921.00  313.00     8.69    27.50 
60.06     0.26    0.21    0.08    0.60   0.17  20.80
sda             940.00  6781.00  927.00  326.00     8.65    27.57 
59.20     0.28    0.22    0.09    0.60   0.17  20.80
sdf             967.00  6768.00  927.00  320.00     8.70    27.50 
59.46     0.29    0.23    0.12    0.55   0.16  20.00
sde             943.00  6770.00  933.00  369.00     8.69    27.71 
57.26     0.74    0.57    0.16    1.60   0.44  57.20
sdc             983.00  6790.00  937.00  317.00     8.86    27.55 
59.46     1.58    1.27    0.71    2.90   0.49  61.60
sdb             966.00  6813.00  929.00  313.00     8.76    27.57 
59.92     1.20    0.97    0.34    2.84   0.49  61.20
md1               0.00     0.00 12037.00 42030.00    56.42 164.04     
8.35     0.00    0.00    0.00    0.00   0.00   0.00
drbd2             0.00     0.00 12034.00 41989.00    56.41 164.02     
8.36   177.73    3.31    0.46    4.13   0.02  91.60
dm-8              0.00     0.00 5955.00    0.00    23.26 0.00     
8.00     4.43    0.74    0.74    0.00   0.01   6.40
dm-12             0.00     0.00  254.00    5.00    10.39     0.02 
82.38     0.28    1.08    1.01    4.80   0.59  15.20
dm-17             0.00     0.00 5813.00 41984.00    22.71 164.00     
8.00   174.87    3.65    0.15    4.13   0.02 100.00

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s 
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdg            1378.00   156.00 1672.00    8.00    18.20     0.69 
23.03     0.15    0.09    0.08    2.00   0.08  12.80
sdd            1370.00   169.00 1667.00    9.00    18.15     0.75 
23.09     0.13    0.08    0.07    1.78   0.08  12.80
sda            1365.00   169.00 1655.00    9.00    18.09     0.75 
23.18     0.14    0.08    0.07    1.33   0.08  13.20
sdf            1377.00   156.00 1672.00    6.00    18.23     0.69 
23.09     0.14    0.09    0.08    2.00   0.08  12.80
sde            1374.00   159.00 1657.00    8.00    18.10     0.71 
23.13     0.16    0.10    0.09    2.00   0.10  16.40
sdc            1365.00   146.00 1666.00   10.00    18.12     0.75 
23.07     0.20    0.12    0.10    3.20   0.10  16.80
sdb            1375.00   137.00 1665.00   10.00    18.16     0.75 
23.13     0.23    0.14    0.11    4.00   0.12  19.60
md1               0.00     0.00 21218.00  820.00   126.86 3.20    
12.09     0.00    0.00    0.00    0.00   0.00   0.00
drbd2             0.00     0.00 21221.00  820.00   126.88 3.20    
12.09    10.17    0.46    0.25    5.98   0.03  67.60
dm-8              0.00     0.00 9984.00    0.00    39.00 0.00     
8.00     4.47    0.45    0.45    0.00   0.00   2.80
dm-12             0.00     0.00 1166.00    0.00    48.54     0.00 
85.25     0.20    0.17    0.17    0.00   0.09  10.00
dm-17             0.00     0.00 10061.00  819.00    39.30 3.20     
8.00     4.38    0.51    0.06    5.99   0.06  63.60

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s 
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdg            1472.00     0.00 1681.00    0.00    14.70     0.00 
17.90     0.14    0.08    0.08    0.00   0.08  13.60
sdd            1472.00     0.00 1668.00    0.00    14.64     0.00 
17.98     0.12    0.07    0.07    0.00   0.07  11.20
sda            1472.00     0.00 1673.00    0.00    14.66     0.00 
17.95     0.12    0.07    0.07    0.00   0.07  11.60
sdf            1472.00     0.00 1680.00    0.00    14.69     0.00 
17.91     0.13    0.08    0.08    0.00   0.07  12.40
sde            1472.00     0.00 1685.00    0.00    14.71     0.00 
17.88     0.12    0.07    0.07    0.00   0.07  11.60
sdc            1478.00     0.00 1687.00    0.00    14.72     0.00 
17.87     0.12    0.07    0.07    0.00   0.07  11.20
sdb            1487.00     0.00 1679.00    0.00    14.69     0.00 
17.92     0.14    0.08    0.08    0.00   0.08  13.20
md1               0.00     0.00 22182.00    0.00   103.29 0.00     
9.54     0.00    0.00    0.00    0.00   0.00   0.00
drbd2             0.00     0.00 22244.00    0.00   103.66 0.00     
9.54     5.76    0.26    0.26    0.00   0.03  59.60
dm-8              0.00     0.00 10945.00    0.00    42.75 0.00     
8.00     5.74    0.50    0.50    0.00   0.00   4.00
dm-12             0.00     0.00  446.00    0.00    18.51     0.00 
84.99     0.07    0.15    0.15    0.00   0.07   3.20
dm-17             0.00     0.00 10836.00    0.00    42.33 0.00     
8.00     0.58    0.05    0.05    0.00   0.05  57.60

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s 
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdg            1441.00    13.00 1684.00    4.00    17.76     0.06 
21.62     0.14    0.08    0.08    1.00   0.08  14.00
sdd            1421.00    32.00 1676.00    9.00    17.65     0.16 
21.64     0.15    0.09    0.08    1.33   0.09  14.40
sda            1426.00    28.00 1682.00    5.00    17.69     0.13 
21.63     0.13    0.08    0.08    1.60   0.07  12.40
sdf            1429.00    14.00 1671.00    6.00    17.66     0.08 
21.66     0.11    0.07    0.07    0.00   0.06  10.80
sde            1442.00    13.00 1686.00    7.00    17.77     0.08 
21.59     0.14    0.08    0.08    0.57   0.08  13.20
sdc            1405.00    28.00 1673.00    8.00    17.60     0.14 
21.61     0.16    0.10    0.09    1.50   0.09  14.40
sdb            1408.00    18.00 1672.00    6.00    17.64     0.09 
21.64     0.16    0.10    0.09    2.67   0.09  15.60
md1               0.00     0.00 21539.00   17.00   123.00 0.55    
11.74     0.00    0.00    0.00    0.00   0.00   0.00
drbd2             0.00     0.00 21477.00   17.00   122.64 0.55    
11.74     4.61    0.21    0.21    1.18   0.03  68.40
dm-8              0.00     0.00 10176.00    0.00    39.75 0.00     
8.00     3.26    0.35    0.35    0.00   0.00   3.20
dm-12             0.00     0.00 1031.00   17.00    42.88     0.55 
84.87     0.14    0.13    0.12    1.18   0.08   8.80
dm-17             0.00     0.00 10262.00    0.00    40.09 0.00     
8.00     0.64    0.06    0.06    0.00   0.06  64.00


Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s 
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdg            1472.00     0.00 1643.00    1.00    13.53     0.00 
16.85     0.17    0.10    0.10    4.00   0.09  15.60
sdd            1472.00     0.00 1653.00    2.00    13.57     0.00 
16.79     0.14    0.08    0.08    0.00   0.08  13.60
sda            1472.00     0.00 1654.00    1.00    13.57     0.00 
16.79     0.14    0.08    0.08    0.00   0.07  12.40
sdf            1472.00     0.00 1656.00    1.00    13.58     0.00 
16.78     0.11    0.07    0.07    4.00   0.07  11.20
sde            1472.00     0.00 1657.00    1.00    13.58     0.00 
16.77     0.12    0.07    0.07    4.00   0.07  11.20
sdc            1472.00     0.00 1662.00    1.00    13.60     0.00 
16.75     0.16    0.10    0.09   12.00   0.09  14.80
sdb            1472.00     0.00 1654.00    2.00    13.57     0.00 
16.79     0.17    0.10    0.09   10.00   0.10  16.80
md1               0.00     0.00 21882.00    1.00    94.99 0.00     
8.89     0.00    0.00    0.00    0.00   0.00   0.00
drbd2             0.00     0.00 21882.00    1.00    94.99 0.00     
8.89     6.72    0.31    0.31    0.00   0.03  65.20
dm-8              0.00     0.00 10753.00    0.00    42.00 0.00     
8.00     6.02    0.56    0.56    0.00   0.00   4.40
dm-12             0.00     0.00  274.00    1.00    10.47     0.00 
78.02     0.10    0.38    0.38    0.00   0.13   3.60
dm-17             0.00     0.00 10849.00    0.00    42.38 0.00     
8.00     0.63    0.06    0.06    0.00   0.06  62.80

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s 
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdg            1518.00     0.00 1669.00    0.00    12.45     0.00 
15.28     0.10    0.06    0.06    0.00   0.06   9.20
sdd            1523.00     0.00 1651.00    0.00    12.40     0.00 
15.38     0.13    0.08    0.08    0.00   0.07  12.00
sda            1518.00     0.00 1651.00    0.00    12.38     0.00 
15.36     0.15    0.09    0.09    0.00   0.09  14.40
sdf            1528.00     0.00 1669.00    0.00    12.49     0.00 
15.32     0.12    0.07    0.07    0.00   0.07  12.00
sde            1517.00     0.00 1655.00    0.00    12.39     0.00 
15.34     0.13    0.08    0.08    0.00   0.07  12.40
sdc            1534.00     0.00 1662.00    0.00    12.48     0.00 
15.38     0.13    0.08    0.08    0.00   0.08  13.20
sdb            1534.00     0.00 1651.00    0.00    12.44     0.00 
15.43     0.13    0.08    0.08    0.00   0.08  12.40
md1               0.00     0.00 22277.00    0.00    87.02 0.00     
8.00     0.00    0.00    0.00    0.00   0.00   0.00
drbd2             0.00     0.00 22277.00    0.00    87.02 0.00     
8.00     6.06    0.27    0.27    0.00   0.03  68.40
dm-8              0.00     0.00 11137.00    0.00    43.50 0.00     
8.00     5.44    0.49    0.49    0.00   0.01   5.60
dm-12             0.00     0.00    1.00    0.00     0.00 0.00     
8.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-17             0.00     0.00 11121.00    0.00    43.44 0.00     
8.00     0.63    0.06    0.06    0.00   0.06  63.20

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s 
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdg             976.00  6175.00  968.00  327.00     8.77    25.27 
53.84     0.60    0.47    0.13    1.47   0.35  45.20
sdd             973.00  6130.00  949.00  292.00     8.64    24.95 
55.43     0.26    0.21    0.10    0.58   0.17  20.80
sda             979.00  6176.00  952.00  292.00     8.68    25.13 
55.65     0.26    0.21    0.11    0.53   0.15  19.20
sdf             978.00  6191.00  958.00  294.00     8.73    25.20 
55.49     0.28    0.22    0.11    0.57   0.16  19.60
sde             976.00  6189.00  968.00  330.00     8.79    25.33 
53.82     0.67    0.51    0.14    1.61   0.41  52.80
sdc             972.00  6202.00  947.00  292.00     8.69    24.93 
55.58     1.43    1.15    0.54    3.10   0.46  56.40
sdb             974.00  6174.00  954.00  293.00     8.70    24.82 
55.05     1.19    0.95    0.29    3.10   0.46  57.20
md1               0.00     0.00 12427.00 38133.00    56.70 148.83     
8.33     0.00    0.00    0.00    0.00   0.00   0.00
drbd2             0.00     0.00 12424.00 38092.00    56.69 148.80     
8.33   159.95    3.14    0.49    4.00   0.02  89.60
dm-8              0.00     0.00 6144.00    0.00    24.00 0.00     
8.00     5.21    0.85    0.85    0.00   0.01   6.80
dm-12             0.00     0.00  231.00    0.00     9.06     0.00 
80.31     0.15    0.64    0.64    0.00   0.36   8.40
dm-17             0.00     0.00 6030.00 38093.00    23.55 148.80     
8.00   157.57    3.48    0.13    4.00   0.02  99.20

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s 
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdg             914.00  6081.00  921.00  324.00     8.36    24.89 
54.70     0.58    0.46    0.17    1.30   0.35  43.20
sdd             925.00  6079.00  914.00  286.00     8.38    24.75 
56.53     0.24    0.20    0.06    0.63   0.16  18.80
sda             940.00  6048.00  921.00  289.00     8.48    24.62 
56.03     0.31    0.26    0.16    0.58   0.20  23.60
sdf             907.00  6032.00  920.00  287.00     8.31    24.56 
55.77     0.28    0.23    0.11    0.60   0.17  20.00
sde             919.00  6074.00  930.00  324.00     8.41    24.88 
54.36     0.99    0.79    0.38    1.96   0.43  54.40
sdc             909.00  6077.00  926.00  289.00     8.36    25.05 
56.32     1.51    1.25    0.63    3.24   0.48  58.40
sdb             905.00  6083.00  914.00  287.00     8.29    25.06 
56.88     1.09    0.92    0.26    3.02   0.43  52.00
md1               0.00     0.00 11831.00 37443.00    54.55 146.14     
8.34     0.00    0.00    0.00    0.00   0.00   0.00
drbd2             0.00     0.00 11834.00 37407.00    54.56 146.12     
8.35   179.72    3.71    0.64    4.68   0.02  89.60
dm-8              0.00     0.00 5760.00    0.00    22.50 0.00     
8.00     6.43    1.12    1.12    0.00   0.02  10.00
dm-12             0.00     0.00  228.00    0.00     9.23     0.00 
82.88     0.31    1.37    1.37    0.00   0.70  16.00
dm-17             0.00     0.00 5847.00 37406.00    22.84 146.12     
8.00   172.03    4.07    0.15    4.68   0.02  96.40

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s 
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdg            1439.00    44.00 1562.00   21.00    12.91     0.25 
17.02     0.20    0.13    0.12    0.95   0.11  18.00
sdd            1434.00    64.00 1571.00   15.00    12.87     0.30 
17.00     0.17    0.11    0.10    1.07   0.11  16.80
sda            1447.00    66.00 1568.00   23.00    12.89     0.34 
17.03     0.13    0.08    0.07    0.70   0.08  12.80
sdf            1464.00    66.00 1561.00   22.00    12.95     0.35 
17.20     0.11    0.07    0.06    0.73   0.07  10.80
sde            1454.00    78.00 1567.00   30.00    12.93     0.41 
17.10     0.18    0.11    0.09    0.93   0.11  16.80
sdc            1478.00    91.00 1575.00   32.00    13.05     0.47 
17.23     0.19    0.12    0.09    1.50   0.09  14.40
sdb            1464.00   102.00 1581.00   27.00    13.12     0.50 
17.35     0.20    0.12    0.08    2.67   0.11  17.60
md1               0.00     0.00 20847.00   45.00    90.04 1.66     
8.99     0.00    0.00    0.00    0.00   0.00   0.00
drbd2             0.00     0.00 20845.00   44.00    90.04 1.66     
8.99     6.39    0.31    0.30    2.00   0.03  71.60
dm-8              0.00     0.00 10369.00    0.00    40.50 0.00     
8.00     5.66    0.55    0.55    0.00   0.01   5.20
dm-12             0.00     0.00  231.00   44.00     9.51     1.66 
83.20     0.10    0.36    0.05    2.00   0.17   4.80
dm-17             0.00     0.00 10237.00    0.00    39.99 0.00     
8.00     0.68    0.07    0.07    0.00   0.07  67.60

Another 15 seconds of 0.00 wMB/s on dm-17

In fact, the peak value is 180.00 and the minimum is 0.00, with a total 
of 44 seconds of 0.00 and 16seconds over 100.00 and 16 seconds between 0 
and 100.

Here is a look at top -b -d 0.5 -n 60|grep ^\%Cpu
%Cpu0  :  2.1 us, 29.2 sy,  0.0 ni,  4.2 id, 64.6 wa,  0.0 hi,  0.0 si,  
0.0 st
%Cpu1  :  1.9 us, 11.5 sy,  0.0 ni, 78.8 id,  7.7 wa,  0.0 hi,  0.0 si,  
0.0 st
%Cpu2  :  0.0 us, 12.0 sy,  0.0 ni, 88.0 id,  0.0 wa,  0.0 hi,  0.0 si,  
0.0 st
%Cpu3  :  0.0 us,  7.8 sy,  0.0 ni, 90.2 id,  2.0 wa,  0.0 hi,  0.0 si,  
0.0 st
%Cpu0  :  0.0 us, 24.4 sy,  0.0 ni,  6.7 id, 66.7 wa,  0.0 hi,  2.2 si,  
0.0 st
%Cpu1  :  0.0 us, 13.5 sy,  0.0 ni, 75.0 id, 11.5 wa,  0.0 hi,  0.0 si,  
0.0 st
%Cpu2  :  0.0 us,  6.0 sy,  0.0 ni, 94.0 id,  0.0 wa,  0.0 hi,  0.0 si,  
0.0 st
%Cpu3  :  0.0 us,  2.1 sy,  0.0 ni, 97.9 id,  0.0 wa,  0.0 hi,  0.0 si,  
0.0 st
%Cpu0  :  0.0 us, 25.5 sy,  0.0 ni, 14.9 id, 57.4 wa,  0.0 hi,  2.1 si,  
0.0 st
%Cpu1  :  2.0 us,  8.2 sy,  0.0 ni, 75.5 id, 14.3 wa,  0.0 hi,  0.0 si,  
0.0 st
%Cpu2  :  0.0 us,  4.1 sy,  0.0 ni, 93.9 id,  2.0 wa,  0.0 hi,  0.0 si,  
0.0 st
%Cpu3  :  0.0 us,  9.8 sy,  0.0 ni, 90.2 id,  0.0 wa,  0.0 hi,  0.0 si,  
0.0 st
%Cpu0  :  2.2 us, 32.6 sy,  0.0 ni,  4.3 id, 56.5 wa,  0.0 hi,  4.3 si,  
0.0 st
%Cpu1  :  2.0 us,  3.9 sy,  0.0 ni, 92.2 id,  2.0 wa,  0.0 hi,  0.0 si,  
0.0 st
%Cpu2  :  0.0 us,  7.7 sy,  0.0 ni, 92.3 id,  0.0 wa,  0.0 hi,  0.0 si,  
0.0 st
%Cpu3  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  
0.0 st
%Cpu0  :  4.0 us, 42.0 sy,  0.0 ni,  0.0 id, 54.0 wa,  0.0 hi,  0.0 si,  
0.0 st
%Cpu1  :  0.0 us,  2.0 sy,  0.0 ni, 98.0 id,  0.0 wa,  0.0 hi,  0.0 si,  
0.0 st
%Cpu2  :  0.0 us,  1.9 sy,  0.0 ni, 98.1 id,  0.0 wa,  0.0 hi,  0.0 si,  
0.0 st
%Cpu3  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  
0.0 st
%Cpu0  :  2.2 us, 39.1 sy,  0.0 ni,  0.0 id, 58.7 wa,  0.0 hi,  0.0 si,  
0.0 st
%Cpu1  :  2.0 us,  0.0 sy,  0.0 ni, 98.0 id,  0.0 wa,  0.0 hi,  0.0 si,  
0.0 st
%Cpu2  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  
0.0 st
%Cpu3  :  0.0 us,  2.0 sy,  0.0 ni, 98.0 id,  0.0 wa,  0.0 hi,  0.0 si,  
0.0 st
%Cpu0  :  2.2 us, 34.8 sy,  0.0 ni,  4.3 id, 56.5 wa,  0.0 hi,  2.2 si,  
0.0 st
%Cpu1  :  0.0 us,  3.8 sy,  0.0 ni, 94.2 id,  1.9 wa,  0.0 hi,  0.0 si,  
0.0 st
%Cpu2  :  0.0 us,  3.9 sy,  0.0 ni, 96.1 id,  0.0 wa,  0.0 hi,  0.0 si,  
0.0 st
%Cpu3  :  1.9 us,  1.9 sy,  0.0 ni, 96.2 id,  0.0 wa,  0.0 hi,  0.0 si,  
0.0 st


Currently, there are no LVM snapshots at all, the raid array is in sync, 
operating normally:
md1 : active raid5 sdd1[7] sdb1[11] sdc1[10] sdf1[9] sdg1[5] sde1[8] sda1[6]
       2813087616 blocks super 1.2 level 5, 64k chunk, algorithm 2 [7/7] 
[UUUUUUU]

mdadm --detail /dev/md1
/dev/md1:
         Version : 1.2
   Creation Time : Wed Aug 22 00:47:03 2012
      Raid Level : raid5
      Array Size : 2813087616 (2682.77 GiB 2880.60 GB)
   Used Dev Size : 468847936 (447.13 GiB 480.10 GB)
    Raid Devices : 7
   Total Devices : 7
     Persistence : Superblock is persistent

     Update Time : Tue Mar 25 23:55:42 2014
           State : active
  Active Devices : 7
Working Devices : 7
  Failed Devices : 0
   Spare Devices : 0

          Layout : left-symmetric
      Chunk Size : 64K

            Name : san1:1  (local to host san1)
            UUID : 707957c0:b7195438:06da5bc4:485d301c
          Events : 1713337

     Number   Major   Minor   RaidDevice State
        7       8       49        0      active sync   /dev/sdd1
        6       8        1        1      active sync   /dev/sda1
        8       8       65        2      active sync   /dev/sde1
        5       8       97        3      active sync   /dev/sdg1
        9       8       81        4      active sync   /dev/sdf1
       10       8       33        5      active sync   /dev/sdc1
       11       8       17        6      active sync   /dev/sdb1


Also, the DRBD is disconnected:
  2: cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown C r-----
     ns:79767379 nr:0 dw:137515806 dr:388623024 al:37206 bm:6688 lo:0 
pe:0 ua:0 ap:0 ep:1 wo:f oos:58639192

So, I know dd isn't the ideal performance testing tool or metric, but 
I'd really like to know why I can't get more than 40MB/s. There is no 
networking, no iscsi, just a fairly simple raid5, drbd, and lvm.

So, am I crazy? What totally retarded thing have I done here?


-- 
Adam Goryachev Website Managers www.websitemanagers.com.au

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Growing RAID5 SSD Array
  2014-03-25 13:10                   ` Adam Goryachev
@ 2014-03-25 20:31                     ` Stan Hoeppner
  2014-04-05 19:25                       ` Adam Goryachev
  0 siblings, 1 reply; 16+ messages in thread
From: Stan Hoeppner @ 2014-03-25 20:31 UTC (permalink / raw)
  To: Adam Goryachev, linux-raid@vger.kernel.org

On 3/25/2014 8:10 AM, Adam Goryachev wrote:
> I'll respond to the other email later on, but in between, I've found something else that seems just plain wrong.
> 
> So, right now, I've shutdown most of the VM's (just one Linux VM left, which should be mostly idle since it is after 11pm local time). I'm trying to create a duplicate copy of one LV to another as a backup (in case I mess it up). So, I've shutdown DRBD, so we are operating independently (not that there is any change if DRBD is connected), I'm running on the storage server itself (so no iscsi or network issues).
> 
> So, two LV's:
>   LV VG   Attr     LSize   Pool Origin Data%  Move Log Copy%  Convert
>   backup_xptserver1_d1_20140325_224311 vg0  -wi-ao-- 453.00g
>   xptserver1_d1                                              vg0 -wi-ao-- 452.00g

So you're copying 452 GB of raw bytes from one LV to another.

> running the command:

This is part of the problem:
> dd if=/dev/vg0/xptserver1_d1 of=/dev/vg0/backup_xptserver1_d1_20140325_224311

Using the dd defaults of buffered IO and 512 byte block size is horribly inefficient when copying 452 GB of data, especially to SSD.  Buffered IO consumes 904 GB of extra memory bandwidth.  Using 512 byte IOs requires much work of the raid5 write thread and more stripe cache bandwidth.  Use this instead:

dd if=/dev/vg0/xxx of=/dev/vg0/yyy iflag=direct oflag=direct bs=1536k

This eliminates 904 GB of RAM b/w in memcpy's and writes out to the block layer in 1.5 MB IOs, i.e. four full stripes.  This decreases the amount of work required of md as it receives 4 stripes of ligned IO at once, instead of 512 byte IOs which it must assemble.

> from another shell I run:
> while pidof dd > /dev/null;do kill -USR1 `pidof dd`;sleep 10;done
> 
> dd shows this output:
> 99059692032 bytes (99 GB) copied, 2515.43 s, 39.4 MB/s
> 99403235840 bytes (99 GB) copied, 2525.45 s, 39.4 MB/s
> 99817538048 bytes (100 GB) copied, 2535.47 s, 39.4 MB/s
> 100252660224 bytes (100 GB) copied, 2545.49 s, 39.4 MB/s

Yes, that is very low, worse than single rust.  Using the dd options above should bump this up substantially.  However I have read claims that LVM2 over md tends to decrease performance.  I'm still looking into that for verification.

When you performed the in depth FIO testing last year with the job files I provided, was the target the md RAID device or an LV?

> iostat -dmx 1 shows this output:
> 
> sda - sdg are the RAID5 SSD drives, single partition, used by md only
> dm-8 is the source for the dd copy
> dm-17 is the destination of the dd copy,
> dm-12 is the Linux VM which is currently running...
> 
> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> sdg             957.00  6767.00  930.00  356.00     8.68    27.65 57.85     0.65    0.50    0.16    1.39   0.37  48.00
> sdd             956.00  6774.00  921.00  313.00     8.69    27.50 60.06     0.26    0.21    0.08    0.60   0.17  20.80
> sda             940.00  6781.00  927.00  326.00     8.65    27.57 59.20     0.28    0.22    0.09    0.60   0.17  20.80
> sdf             967.00  6768.00  927.00  320.00     8.70    27.50 59.46     0.29    0.23    0.12    0.55   0.16  20.00
> sde             943.00  6770.00  933.00  369.00     8.69    27.71 57.26     0.74    0.57    0.16    1.60   0.44  57.20
> sdc             983.00  6790.00  937.00  317.00     8.86    27.55 59.46     1.58    1.27    0.71    2.90   0.49  61.60
> sdb             966.00  6813.00  929.00  313.00     8.76    27.57 59.92     1.20    0.97    0.34    2.84   0.49  61.20 
                  ^^^^^^^ ^^^^^^^
Note the difference between read merges and write merges, about 7:1, whereas the bandwidth is about 3:1.  That's about 7K read merges/s and 48K write merges/s.  Telling dd to use 1.5 MB IOs should reduce merges significantly, increasing throughout by a non negligible amount.  It should also decrease %util substantially, as less CPU time is required for merging, and less for md to assemble stripes from tiny 512 byte writes.

> md1               0.00     0.00 12037.00 42030.00    56.42 164.04     8.35     0.00    0.00    0.00    0.00   0.00   0.00
> drbd2             0.00     0.00 12034.00 41989.00    56.41 164.02     8.36   177.73    3.31    0.46    4.13   0.02  91.60
> dm-8              0.00     0.00 5955.00    0.00    23.26 0.00     8.00     4.43    0.74    0.74    0.00   0.01   6.40
> dm-12             0.00     0.00  254.00    5.00    10.39     0.02 82.38     0.28    1.08    1.01    4.80   0.59  15.20
> dm-17             0.00     0.00 5813.00 41984.00    22.71 164.00     8.00   174.87    3.65    0.15    4.13   0.02 100.00
... 
> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> sdg            1472.00     0.00 1681.00    0.00    14.70     0.00 17.90     0.14    0.08    0.08    0.00   0.08  13.60
> sdd            1472.00     0.00 1668.00    0.00    14.64     0.00 17.98     0.12    0.07    0.07    0.00   0.07  11.20
> sda            1472.00     0.00 1673.00    0.00    14.66     0.00 17.95     0.12    0.07    0.07    0.00   0.07  11.60
> sdf            1472.00     0.00 1680.00    0.00    14.69     0.00 17.91     0.13    0.08    0.08    0.00   0.07  12.40
> sde            1472.00     0.00 1685.00    0.00    14.71     0.00 17.88     0.12    0.07    0.07    0.00   0.07  11.60
> sdc            1478.00     0.00 1687.00    0.00    14.72     0.00 17.87     0.12    0.07    0.07    0.00   0.07  11.20
> sdb            1487.00     0.00 1679.00    0.00    14.69     0.00 17.92     0.14    0.08    0.08    0.00   0.08  13.20
> md1               0.00     0.00 22182.00    0.00   103.29 0.00     9.54     0.00    0.00    0.00    0.00   0.00   0.00
> drbd2             0.00     0.00 22244.00    0.00   103.66 0.00     9.54     5.76    0.26    0.26    0.00   0.03  59.60
> dm-8              0.00     0.00 10945.00    0.00    42.75 0.00     8.00     5.74    0.50    0.50    0.00   0.00   4.00
> dm-12             0.00     0.00  446.00    0.00    18.51     0.00 84.99     0.07    0.15    0.15    0.00   0.07   3.20
> dm-17             0.00     0.00 10836.00    0.00    42.33 0.00     8.00     0.58    0.05    0.05    0.00   0.05  57.60

No clue here.  You're reading exactly the same amount from the drives, drbd2, dm-8, and dm-17.  Given your description of a dd copy from dm-8 to dm-17, it seems odd that dm-8 and dm-17 are being read nearly the same number of bytes here, with no writes.

... 
> Another 15 seconds of 0.00 wMB/s on dm-17

These periods of no write activity suggest that your iostat timing didn't fully coincide with your dd copy.  If it's not that, then something is causing your write IO to stall entirely.  Any stack traces in dmesg?

> In fact, the peak value is 180.00 and the minimum is 0.00, with a total of 44 seconds of 0.00 and 16seconds over 100.00 and 16 seconds between 0 and 100.
> 
> Here is a look at top -b -d 0.5 -n 60|grep ^\%Cpu
>>>>   95.9% --  %Cpu0  :  2.1 us, 29.2 sy,  0.0 ni,  4.2 id, 64.6 wa,  0.0 hi,  0.0 si,  0.0 st
>>>>   91.1% --  %Cpu0  :  0.0 us, 24.4 sy,  0.0 ni,  6.7 id, 66.7 wa,  0.0 hi,  2.2 si,  0.0 st
>>>>   82.9% --  %Cpu0  :  0.0 us, 25.5 sy,  0.0 ni, 14.9 id, 57.4 wa,  0.0 hi,  2.1 si,  0.0 st
>>>>   91.3% --  %Cpu0  :  2.2 us, 32.6 sy,  0.0 ni,  4.3 id, 56.5 wa,  0.0 hi,  4.3 si,  0.0 st
>>>>  100.0% --  %Cpu0  :  4.0 us, 42.0 sy,  0.0 ni,  0.0 id, 54.0 wa,  0.0 hi,  0.0 si,  0.0 st
>>>>  100.0% --  %Cpu0  :  2.2 us, 39.1 sy,  0.0 ni,  0.0 id, 58.7 wa,  0.0 hi,  0.0 si,  0.0 st
>>>>   93.5% --  %Cpu0  :  2.2 us, 34.8 sy,  0.0 ni,  4.3 id, 56.5 wa,  0.0 hi,  2.2 si,  0.0 st

It would appear that the raid5 write thread is being scheduled only on Cpu0, which is not good as core0 is the only core on this machine that processes interrupts.  Hardware interrupt load above is zero, but with a real disk and network throughput rate it will eat into the cycles needed by the RAID5 thread.  

The physical IO work does not seem to be spread very well across all 4 cores.  However, the data rates are so low here it's difficult to come to any conclusion.  Cores 1-2 are performing a little work, 5-10% or so.  If you present a workload with bare minimal optimization, removing the choke hold from md and the elevator, as in my dd example up above, I'm sure you'll see much more work done by the other cores, as there will be far more IO to process.
 
> Currently, there are no LVM snapshots at all, the raid array is in sync, operating normally:
> md1 : active raid5 sdd1[7] sdb1[11] sdc1[10] sdf1[9] sdg1[5] sde1[8] sda1[6]
>       2813087616 blocks super 1.2 level 5, 64k chunk, algorithm 2 [7/7] [UUUUUUU]
> 
> mdadm --detail /dev/md1
> /dev/md1:
>         Version : 1.2
>   Creation Time : Wed Aug 22 00:47:03 2012
>      Raid Level : raid5
>      Array Size : 2813087616 (2682.77 GiB 2880.60 GB)
>   Used Dev Size : 468847936 (447.13 GiB 480.10 GB)
>    Raid Devices : 7
>   Total Devices : 7
>     Persistence : Superblock is persistent
> 
>     Update Time : Tue Mar 25 23:55:42 2014
>           State : active
>  Active Devices : 7
> Working Devices : 7
>  Failed Devices : 0
>   Spare Devices : 0
> 
>          Layout : left-symmetric
>      Chunk Size : 64K
> 
>            Name : san1:1  (local to host san1)
>            UUID : 707957c0:b7195438:06da5bc4:485d301c
>          Events : 1713337
> 
>     Number   Major   Minor   RaidDevice State
>        7       8       49        0      active sync   /dev/sdd1
>        6       8        1        1      active sync   /dev/sda1
>        8       8       65        2      active sync   /dev/sde1
>        5       8       97        3      active sync   /dev/sdg1
>        9       8       81        4      active sync   /dev/sdf1
>       10       8       33        5      active sync   /dev/sdc1
>       11       8       17        6      active sync   /dev/sdb1
> 
> 
> Also, the DRBD is disconnected:
>  2: cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown C r-----
>     ns:79767379 nr:0 dw:137515806 dr:388623024 al:37206 bm:6688 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:58639192

According to your iostat output above, drbd2 was indeed still engaged.  And eating over 59.6% and 91.6% of a core.
 
> So, I know dd isn't the ideal performance testing tool or metric, but I'd really like to know why I can't get more than 40MB/s. There is no networking, no iscsi, just a fairly simple raid5, drbd, and lvm.

You can get much more than 40MB/s, but you must know your tools, and gain a better understanding of the Linux IO subsystem.

> So, am I crazy? What totally retarded thing have I done here?

No, not crazy.  Not totally retarded.  You simply shoved a gazillion 512 byte IOs through the block layer.  Even with SSDs that's going to be slow due to the extra work the kernel threads must perform on all those tiny IOs, and all the memory bandwidth consumed by buffered IO and stripe cache operations.

The problem with your dd run here is the same problem you had before I taught you how to use FIO a year ago.  If you recall you were testing back then with a single dd process.  As I explained then, dd is a serial application.  It submits blocks one at a time with no overlap, and thus can't keep the request pipeline full.  With FIO and an appropriate job file, we kept the request pipeline full using parallel requests, and we used large IOs to keep overhead to a minimum.  The only way to increase dd throughput is to use large blocks and O_DIRECT to eliminate the RAM bandwidth of two unneeded memcpy's.

You've simply forgotten that lesson, apparently.  Which is a shame, as I spent so much time teaching you the how and why of Linux IO performance...

Cheers,

Stan


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Growing RAID5 SSD Array
  2014-03-25 20:31                     ` Stan Hoeppner
@ 2014-04-05 19:25                       ` Adam Goryachev
  2014-04-08 15:27                         ` Stan Hoeppner
  0 siblings, 1 reply; 16+ messages in thread
From: Adam Goryachev @ 2014-04-05 19:25 UTC (permalink / raw)
  To: stan, linux-raid@vger.kernel.org

On 26/03/14 07:31, Stan Hoeppner wrote:
> On 3/25/2014 8:10 AM, Adam Goryachev wrote:
>> I'll respond to the other email later on, but in between, I've found something else that seems just plain wrong.
>>
>> So, right now, I've shutdown most of the VM's (just one Linux VM left, which should be mostly idle since it is after 11pm local time). I'm trying to create a duplicate copy of one LV to another as a backup (in case I mess it up). So, I've shutdown DRBD, so we are operating independently (not that there is any change if DRBD is connected), I'm running on the storage server itself (so no iscsi or network issues).
>>
>> So, two LV's:
>>    LV VG   Attr     LSize   Pool Origin Data%  Move Log Copy%  Convert
>>    backup_xptserver1_d1_20140325_224311 vg0  -wi-ao-- 453.00g
>>    xptserver1_d1                                              vg0 -wi-ao-- 452.00g
> So you're copying 452 GB of raw bytes from one LV to another.
>
>> running the command:
> This is part of the problem:
>> dd if=/dev/vg0/xptserver1_d1 of=/dev/vg0/backup_xptserver1_d1_20140325_224311
> Using the dd defaults of buffered IO and 512 byte block size is horribly inefficient when copying 452 GB of data, especially to SSD.  Buffered IO consumes 904 GB of extra memory bandwidth.  Using 512 byte IOs requires much work of the raid5 write thread and more stripe cache bandwidth.  Use this instead:
>
> dd if=/dev/vg0/xxx of=/dev/vg0/yyy iflag=direct oflag=direct bs=1536k
>
> This eliminates 904 GB of RAM b/w in memcpy's and writes out to the block layer in 1.5 MB IOs, i.e. four full stripes.  This decreases the amount of work required of md as it receives 4 stripes of ligned IO at once, instead of 512 byte IOs which it must assemble.

Yes, of course, I should have known better! What a waste of three hours 
or so....
>> from another shell I run:
>> while pidof dd > /dev/null;do kill -USR1 `pidof dd`;sleep 10;done
>>
>> dd shows this output:
>> 99059692032 bytes (99 GB) copied, 2515.43 s, 39.4 MB/s
>> 99403235840 bytes (99 GB) copied, 2525.45 s, 39.4 MB/s
>> 99817538048 bytes (100 GB) copied, 2535.47 s, 39.4 MB/s
>> 100252660224 bytes (100 GB) copied, 2545.49 s, 39.4 MB/s
> Yes, that is very low, worse than single rust.  Using the dd options above should bump this up substantially.  However I have read claims that LVM2 over md tends to decrease performance.  I'm still looking into that for verification.
>
> When you performed the in depth FIO testing last year with the job files I provided, was the target the md RAID device or an LV?

I'm certain that it was against an LV on DRBD on MD RAID5, while the 
DRBD was disconnected.

>> iostat -dmx 1 shows this output:
>>
>> sda - sdg are the RAID5 SSD drives, single partition, used by md only
>> dm-8 is the source for the dd copy
>> dm-17 is the destination of the dd copy,
>> dm-12 is the Linux VM which is currently running...
>>
>> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
>> sdg             957.00  6767.00  930.00  356.00     8.68    27.65 57.85     0.65    0.50    0.16    1.39   0.37  48.00
>> sdd             956.00  6774.00  921.00  313.00     8.69    27.50 60.06     0.26    0.21    0.08    0.60   0.17  20.80
>> sda             940.00  6781.00  927.00  326.00     8.65    27.57 59.20     0.28    0.22    0.09    0.60   0.17  20.80
>> sdf             967.00  6768.00  927.00  320.00     8.70    27.50 59.46     0.29    0.23    0.12    0.55   0.16  20.00
>> sde             943.00  6770.00  933.00  369.00     8.69    27.71 57.26     0.74    0.57    0.16    1.60   0.44  57.20
>> sdc             983.00  6790.00  937.00  317.00     8.86    27.55 59.46     1.58    1.27    0.71    2.90   0.49  61.60
>> sdb             966.00  6813.00  929.00  313.00     8.76    27.57 59.92     1.20    0.97    0.34    2.84   0.49  61.20
>                    ^^^^^^^ ^^^^^^^
> Note the difference between read merges and write merges, about 7:1, whereas the bandwidth is about 3:1.  That's about 7K read merges/s and 48K write merges/s.  Telling dd to use 1.5 MB IOs should reduce merges significantly, increasing throughout by a non negligible amount.  It should also decrease %util substantially, as less CPU time is required for merging, and less for md to assemble stripes from tiny 512 byte writes.
>
>> md1               0.00     0.00 12037.00 42030.00    56.42 164.04     8.35     0.00    0.00    0.00    0.00   0.00   0.00
>> drbd2             0.00     0.00 12034.00 41989.00    56.41 164.02     8.36   177.73    3.31    0.46    4.13   0.02  91.60
>> dm-8              0.00     0.00 5955.00    0.00    23.26 0.00     8.00     4.43    0.74    0.74    0.00   0.01   6.40
>> dm-12             0.00     0.00  254.00    5.00    10.39     0.02 82.38     0.28    1.08    1.01    4.80   0.59  15.20
>> dm-17             0.00     0.00 5813.00 41984.00    22.71 164.00     8.00   174.87    3.65    0.15    4.13   0.02 100.00
> ...
>> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
>> sdg            1472.00     0.00 1681.00    0.00    14.70     0.00 17.90     0.14    0.08    0.08    0.00   0.08  13.60
>> sdd            1472.00     0.00 1668.00    0.00    14.64     0.00 17.98     0.12    0.07    0.07    0.00   0.07  11.20
>> sda            1472.00     0.00 1673.00    0.00    14.66     0.00 17.95     0.12    0.07    0.07    0.00   0.07  11.60
>> sdf            1472.00     0.00 1680.00    0.00    14.69     0.00 17.91     0.13    0.08    0.08    0.00   0.07  12.40
>> sde            1472.00     0.00 1685.00    0.00    14.71     0.00 17.88     0.12    0.07    0.07    0.00   0.07  11.60
>> sdc            1478.00     0.00 1687.00    0.00    14.72     0.00 17.87     0.12    0.07    0.07    0.00   0.07  11.20
>> sdb            1487.00     0.00 1679.00    0.00    14.69     0.00 17.92     0.14    0.08    0.08    0.00   0.08  13.20
>> md1               0.00     0.00 22182.00    0.00   103.29 0.00     9.54     0.00    0.00    0.00    0.00   0.00   0.00
>> drbd2             0.00     0.00 22244.00    0.00   103.66 0.00     9.54     5.76    0.26    0.26    0.00   0.03  59.60
>> dm-8              0.00     0.00 10945.00    0.00    42.75 0.00     8.00     5.74    0.50    0.50    0.00   0.00   4.00
>> dm-12             0.00     0.00  446.00    0.00    18.51     0.00 84.99     0.07    0.15    0.15    0.00   0.07   3.20
>> dm-17             0.00     0.00 10836.00    0.00    42.33 0.00     8.00     0.58    0.05    0.05    0.00   0.05  57.60
> No clue here.  You're reading exactly the same amount from the drives, drbd2, dm-8, and dm-17.  Given your description of a dd copy from dm-8 to dm-17, it seems odd that dm-8 and dm-17 are being read nearly the same number of bytes here, with no writes.
I've just double checked, definitely reading from dm-8 and writing to 
dm-17, since all the LV's are on DRBD, the total reads on the LV's 
should equal the reads on drbd2, same goes for writes. Also, values for 
drbd2 should (approx) equal md1, and the sum of sd[a-g]. I really have 
no idea who, what, or why there would be any reads on dm-17...

>> Another 15 seconds of 0.00 wMB/s on dm-17
> These periods of no write activity suggest that your iostat timing didn't fully coincide with your dd copy.  If it's not that, then something is causing your write IO to stall entirely.  Any stack traces in dmesg?

Definitely not, the stats were collected and the email sent hours before 
the dd completed.... I only collected the stats for 76 seconds, the copy 
took around 4 hours...

Very interesting.... looking at log files can be at times :)
So, no stack traces etc in relation to this, however, just last night, 
the log started recording errors on the OS drive (sdh), some testing 
with dd shows that it is returning unreadable errors at 77551MB to 
77555MB. This first one works, the second fails:
dd if=/dev/sdh of=/dev/null bs=1M iflag=direct skip=77550 count=1
1+0 records in
1+0 records out
1048576 bytes (1.0 MB) copied, 0.00356682 s, 294 MB/s

dd if=/dev/sdh of=/dev/null bs=1M iflag=direct skip=77551 count=1
dd: reading `/dev/sdh': Input/output error
0+0 records in
0+0 records out
0 bytes (0 B) copied, 0.213353 s, 0.0 kB/s

This drive is on a different SATA controller (onboard) while all the 
rest of the drives are on the LSI SATA controller. I can read from the 
drive fine before 77551 and after 77556. I've just ordered a replacement 
drive, and will replace that tonight, then wait for the warranty 
replacement later. FYI, it's a Intel 120GB SSD.

I can't be sure, but I don't think this should have impacted on the 
copy, given that the OS isn't even in use generally, and wasn't the 
source/destination of the copy.

Actually, drive was replaced already....

>> In fact, the peak value is 180.00 and the minimum is 0.00, with a total of 44 seconds of 0.00 and 16seconds over 100.00 and 16 seconds between 0 and 100.
>>
>> Here is a look at top -b -d 0.5 -n 60|grep ^\%Cpu
>>>>>    95.9% --  %Cpu0  :  2.1 us, 29.2 sy,  0.0 ni,  4.2 id, 64.6 wa,  0.0 hi,  0.0 si,  0.0 st
>>>>>    91.1% --  %Cpu0  :  0.0 us, 24.4 sy,  0.0 ni,  6.7 id, 66.7 wa,  0.0 hi,  2.2 si,  0.0 st
>>>>>    82.9% --  %Cpu0  :  0.0 us, 25.5 sy,  0.0 ni, 14.9 id, 57.4 wa,  0.0 hi,  2.1 si,  0.0 st
>>>>>    91.3% --  %Cpu0  :  2.2 us, 32.6 sy,  0.0 ni,  4.3 id, 56.5 wa,  0.0 hi,  4.3 si,  0.0 st
>>>>>   100.0% --  %Cpu0  :  4.0 us, 42.0 sy,  0.0 ni,  0.0 id, 54.0 wa,  0.0 hi,  0.0 si,  0.0 st
>>>>>   100.0% --  %Cpu0  :  2.2 us, 39.1 sy,  0.0 ni,  0.0 id, 58.7 wa,  0.0 hi,  0.0 si,  0.0 st
>>>>>    93.5% --  %Cpu0  :  2.2 us, 34.8 sy,  0.0 ni,  4.3 id, 56.5 wa,  0.0 hi,  2.2 si,  0.0 st
> It would appear that the raid5 write thread is being scheduled only on Cpu0, which is not good as core0 is the only core on this machine that processes interrupts.  Hardware interrupt load above is zero, but with a real disk and network throughput rate it will eat into the cycles needed by the RAID5 thread.
OK, I'm going to add the following to the /etc/rc.local:
for irq in `cat /proc/interrupts |grep mpt2sas| awk -F: '{ print $1}'`
do
         echo 4 > /proc/irq/${irq}/smp_affinity
done

That will move the LSI card interrupt processing to CPU2 like this:
   57:  143806142       7246      41052          0 IR-PCI-MSI-edge      
mpt2sas0-msix0
   58:   14381650          0      22952          0 IR-PCI-MSI-edge      
mpt2sas0-msix1
   59:    6733526          0     144387          0 IR-PCI-MSI-edge      
mpt2sas0-msix2
   60:    3342802          0      32053          0 IR-PCI-MSI-edge      
mpt2sas0-msix3

You can see I briefly moved one to CPU1 as well.

Would you suggest moving the eth devices to another CPU as well, perhaps 
CPU3 ?

> The physical IO work does not seem to be spread very well across all 4 cores.  However, the data rates are so low here it's difficult to come to any conclusion.  Cores 1-2 are performing a little work, 5-10% or so.  If you present a workload with bare minimal optimization, removing the choke hold from md and the elevator, as in my dd example up above, I'm sure you'll see much more work done by the other cores, as there will be far more IO to process.
I'll run a bunch more tests tonight, and get a better idea. For now though:
dd if=/dev/vg0/xptest of=/dev/vg0/testing iflag=direct oflag=direct 
bs=1536k count=5k
iostat shows much more solid read and write rates, around 120MB/s peaks, 
dd reported 88MB/s, it also shows 0 for rrqm and wrqm, so no more 
merging was being done. The avgrq-sz value is always 128 for the 
destination, and almost always 128 for the source during the copy. This 
seems to equal 64kB, so I'm not sure why that is if we told dd to use 
1536k ...

top shows:
%Cpu0  :  0.0 us,  3.9 sy,  0.0 ni, 92.2 id,  2.0 wa,  0.0 hi,  2.0 si,  
0.0 st
%Cpu1  :  0.0 us,  2.0 sy,  0.0 ni, 94.1 id,  3.9 wa,  0.0 hi,  0.0 si,  
0.0 st
%Cpu2  :  2.0 us,  8.2 sy,  0.0 ni, 75.5 id, 12.2 wa,  0.0 hi,  2.0 si,  
0.0 st
%Cpu3  :  0.0 us,  9.6 sy,  0.0 ni, 86.5 id,  3.8 wa,  0.0 hi,  0.0 si,  
0.0 st
%Cpu0  : 19.2 us, 13.5 sy,  0.0 ni, 61.5 id,  5.8 wa,  0.0 hi,  0.0 si,  
0.0 st
%Cpu1  : 21.6 us, 11.8 sy,  0.0 ni, 66.7 id,  0.0 wa,  0.0 hi,  0.0 si,  
0.0 st
%Cpu2  : 17.6 us, 19.6 sy,  0.0 ni, 51.0 id,  7.8 wa,  0.0 hi,  3.9 si,  
0.0 st
%Cpu3  : 19.6 us, 15.7 sy,  0.0 ni, 58.8 id,  5.9 wa,  0.0 hi,  0.0 si,  
0.0 st
%Cpu0  :  0.0 us,  2.0 sy,  0.0 ni, 91.8 id,  6.1 wa,  0.0 hi,  0.0 si,  
0.0 st
%Cpu1  :  0.0 us,  1.9 sy,  0.0 ni, 96.2 id,  1.9 wa,  0.0 hi,  0.0 si,  
0.0 st
%Cpu2  :  2.0 us, 10.0 sy,  0.0 ni, 80.0 id,  8.0 wa,  0.0 hi,  0.0 si,  
0.0 st
%Cpu3  :  2.0 us,  7.8 sy,  0.0 ni, 88.2 id,  2.0 wa,  0.0 hi,  0.0 si,  
0.0 st
%Cpu0  :  0.0 us,  2.0 sy,  0.0 ni, 96.1 id,  2.0 wa,  0.0 hi,  0.0 si,  
0.0 st
%Cpu1  :  0.0 us,  0.0 sy,  0.0 ni, 93.9 id,  6.1 wa,  0.0 hi,  0.0 si,  
0.0 st
%Cpu2  :  0.0 us, 10.0 sy,  0.0 ni, 76.0 id, 14.0 wa,  0.0 hi,  0.0 si,  
0.0 st
%Cpu3  :  0.0 us,  9.3 sy,  0.0 ni, 85.2 id,  5.6 wa,  0.0 hi,  0.0 si,  
0.0 st
%Cpu0  :  0.0 us,  3.9 sy,  0.0 ni, 92.2 id,  2.0 wa,  0.0 hi,  2.0 si,  
0.0 st
%Cpu1  :  0.0 us,  2.0 sy,  0.0 ni, 94.1 id,  3.9 wa,  0.0 hi,  0.0 si,  
0.0 st
%Cpu2  :  1.9 us, 15.1 sy,  0.0 ni, 67.9 id, 15.1 wa,  0.0 hi,  0.0 si,  
0.0 st
%Cpu3  :  0.0 us,  5.9 sy,  0.0 ni, 84.3 id,  9.8 wa,  0.0 hi,  0.0 si,  
0.0 st
%Cpu0  :  0.0 us,  2.0 sy,  0.0 ni, 98.0 id,  0.0 wa,  0.0 hi,  0.0 si,  
0.0 st
%Cpu1  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  
0.0 st
%Cpu2  :  0.0 us, 10.2 sy,  0.0 ni, 81.6 id,  8.2 wa,  0.0 hi,  0.0 si,  
0.0 st
%Cpu3  :  0.0 us,  6.1 sy,  0.0 ni, 85.7 id,  8.2 wa,  0.0 hi,  0.0 si,  
0.0 st
%Cpu0  :  0.0 us,  3.8 sy,  0.0 ni, 90.4 id,  5.8 wa,  0.0 hi,  0.0 si,  
0.0 st
%Cpu1  :  0.0 us,  0.0 sy,  0.0 ni, 88.2 id, 11.8 wa,  0.0 hi,  0.0 si,  
0.0 st
%Cpu2  :  0.0 us,  8.0 sy,  0.0 ni, 90.0 id,  2.0 wa,  0.0 hi,  0.0 si,  
0.0 st
%Cpu3  :  0.0 us,  2.0 sy,  0.0 ni, 98.0 id,  0.0 wa,  0.0 hi,  0.0 si,  
0.0 st
%Cpu0  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  
0.0 st
%Cpu1  :  0.0 us,  0.0 sy,  0.0 ni, 96.0 id,  4.0 wa,  0.0 hi,  0.0 si,  
0.0 st
%Cpu2  :  2.0 us,  6.0 sy,  0.0 ni, 86.0 id,  6.0 wa,  0.0 hi,  0.0 si,  
0.0 st
%Cpu3  :  0.0 us,  9.3 sy,  0.0 ni, 75.9 id, 14.8 wa,  0.0 hi,  0.0 si,  
0.0 st
%Cpu0  :  0.0 us,  3.9 sy,  0.0 ni, 80.4 id, 15.7 wa,  0.0 hi,  0.0 si,  
0.0 st
%Cpu1  :  0.0 us,  3.8 sy,  0.0 ni, 90.4 id,  5.8 wa,  0.0 hi,  0.0 si,  
0.0 st
%Cpu2  :  2.0 us,  3.9 sy,  0.0 ni, 92.2 id,  2.0 wa,  0.0 hi,  0.0 si,  
0.0 st
%Cpu3  :  0.0 us,  2.0 sy,  0.0 ni, 94.1 id,  3.9 wa,  0.0 hi,  0.0 si,  
0.0 st
%Cpu0  :  0.0 us,  3.9 sy,  0.0 ni, 96.1 id,  0.0 wa,  0.0 hi,  0.0 si,  
0.0 st
%Cpu1  :  0.0 us,  1.9 sy,  0.0 ni, 94.2 id,  3.8 wa,  0.0 hi,  0.0 si,  
0.0 st
%Cpu2  :  0.0 us, 10.4 sy,  0.0 ni, 79.2 id, 10.4 wa,  0.0 hi,  0.0 si,  
0.0 st
%Cpu3  :  0.0 us,  6.1 sy,  0.0 ni, 91.8 id,  2.0 wa,  0.0 hi,  0.0 si,  
0.0 st
%Cpu0  :  0.0 us,  4.1 sy,  0.0 ni, 95.9 id,  0.0 wa,  0.0 hi,  0.0 si,  
0.0 st
%Cpu1  :  2.0 us,  2.0 sy,  0.0 ni, 94.1 id,  2.0 wa,  0.0 hi,  0.0 si,  
0.0 st
%Cpu2  :  0.0 us, 12.0 sy,  0.0 ni, 76.0 id, 12.0 wa,  0.0 hi,  0.0 si,  
0.0 st
%Cpu3  :  0.0 us, 13.2 sy,  0.0 ni, 81.1 id,  5.7 wa,  0.0 hi,  0.0 si,  
0.0 st
%Cpu0  :  2.0 us,  4.0 sy,  0.0 ni, 88.0 id,  6.0 wa,  0.0 hi,  0.0 si,  
0.0 st
%Cpu1  :  0.0 us,  0.0 sy,  0.0 ni, 96.0 id,  4.0 wa,  0.0 hi,  0.0 si,  
0.0 st
%Cpu2  :  0.0 us,  6.2 sy,  0.0 ni, 83.3 id, 10.4 wa,  0.0 hi,  0.0 si,  
0.0 st
%Cpu3  :  0.0 us,  7.7 sy,  0.0 ni, 84.6 id,  7.7 wa,  0.0 hi,  0.0 si,  
0.0 st
%Cpu0  :  0.0 us,  4.0 sy,  0.0 ni, 96.0 id,  0.0 wa,  0.0 hi,  0.0 si,  
0.0 st
%Cpu1  :  0.0 us,  3.8 sy,  0.0 ni, 88.5 id,  7.7 wa,  0.0 hi,  0.0 si,  
0.0 st
%Cpu2  :  0.0 us,  6.4 sy,  0.0 ni, 87.2 id,  6.4 wa,  0.0 hi,  0.0 si,  
0.0 st
%Cpu3  :  0.0 us,  8.0 sy,  0.0 ni, 84.0 id,  8.0 wa,  0.0 hi,  0.0 si,  
0.0 st

So it looks like CPU0 is less busy, with more work being done on CPU2 
(the interrupts for the LSI SATA controller)

If I increase bs=6M then dd reports 130MB/s ...


>> Currently, there are no LVM snapshots at all, the raid array is in sync, operating normally:
>> md1 : active raid5 sdd1[7] sdb1[11] sdc1[10] sdf1[9] sdg1[5] sde1[8] sda1[6]
>>        2813087616 blocks super 1.2 level 5, 64k chunk, algorithm 2 [7/7] [UUUUUUU]
>>
>> mdadm --detail /dev/md1
>> /dev/md1:
>>          Version : 1.2
>>    Creation Time : Wed Aug 22 00:47:03 2012
>>       Raid Level : raid5
>>       Array Size : 2813087616 (2682.77 GiB 2880.60 GB)
>>    Used Dev Size : 468847936 (447.13 GiB 480.10 GB)
>>     Raid Devices : 7
>>    Total Devices : 7
>>      Persistence : Superblock is persistent
>>
>>      Update Time : Tue Mar 25 23:55:42 2014
>>            State : active
>>   Active Devices : 7
>> Working Devices : 7
>>   Failed Devices : 0
>>    Spare Devices : 0
>>
>>           Layout : left-symmetric
>>       Chunk Size : 64K
>>
>>             Name : san1:1  (local to host san1)
>>             UUID : 707957c0:b7195438:06da5bc4:485d301c
>>           Events : 1713337
>>
>>      Number   Major   Minor   RaidDevice State
>>         7       8       49        0      active sync   /dev/sdd1
>>         6       8        1        1      active sync   /dev/sda1
>>         8       8       65        2      active sync   /dev/sde1
>>         5       8       97        3      active sync   /dev/sdg1
>>         9       8       81        4      active sync   /dev/sdf1
>>        10       8       33        5      active sync   /dev/sdc1
>>        11       8       17        6      active sync   /dev/sdb1
>>
>>
>> Also, the DRBD is disconnected:
>>   2: cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown C r-----
>>      ns:79767379 nr:0 dw:137515806 dr:388623024 al:37206 bm:6688 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:58639192
> According to your iostat output above, drbd2 was indeed still engaged.  And eating over 59.6% and 91.6% of a core.
Nope, definitely not connected, however, it is still part of the IO 
path, because the LV sits on drbd. So it isn't talking to it's partner, 
but it still does it's own "work" in between LVM and MD.

>> So, I know dd isn't the ideal performance testing tool or metric, but I'd really like to know why I can't get more than 40MB/s. There is no networking, no iscsi, just a fairly simple raid5, drbd, and lvm.
> You can get much more than 40MB/s, but you must know your tools, and gain a better understanding of the Linux IO subsystem.

Apologies, it was a second late night in a row, and I wasn't doing very 
well, I should have remembered my previous lessons about this!

>> So, am I crazy? What totally retarded thing have I done here?
> No, not crazy.  Not totally retarded.  You simply shoved a gazillion 512 byte IOs through the block layer.  Even with SSDs that's going to be slow due to the extra work the kernel threads must perform on all those tiny IOs, and all the memory bandwidth consumed by buffered IO and stripe cache operations.
>
> The problem with your dd run here is the same problem you had before I taught you how to use FIO a year ago.  If you recall you were testing back then with a single dd process.  As I explained then, dd is a serial application.  It submits blocks one at a time with no overlap, and thus can't keep the request pipeline full.  With FIO and an appropriate job file, we kept the request pipeline full using parallel requests, and we used large IOs to keep overhead to a minimum.  The only way to increase dd throughput is to use large blocks and O_DIRECT to eliminate the RAM bandwidth of two unneeded memcpy's.
>
> You've simply forgotten that lesson, apparently.  Which is a shame, as I spent so much time teaching you the how and why of Linux IO performance...

OK, so thinking this through... We should expect really poor performance 
if we are not using O_DIRECT, and not doing large requests in parallel. 
I think the parallel part of the workload should be fine in real world 
use, since each user and machine will be generating some random load, 
which should be delivered in parallel to the stack (LVM/DRBD/MD). 
However, in 'real world' use, we don't determine the request size, only 
the application or client OS, or perhaps iscsi will determine that.

My concern is that while I can get fantastical numbers from specific 
tests (such as highly parallel, large block size requests) I don't need 
that type of I/O, so my system isn't tuned to my needs.

After working with linbit (DRBD) I've found out some more useful 
information, which puts me right back to the beginning I think, but with 
a lot more experience and knowledge.
It seems that DRBD keeps it's own "journal", so every write is written 
to the journal, then it's bitmap is marked, then the journal is written 
to the data area, then the bitmap updated again, and then start over for 
the next write. This means it is doing lots and lots of small writes to 
the same areas of the disk ie, 4k blocks.

Anyway, I was advised to re-organise the stack from:
RAID5 -> DRBD -> LVM -> iSCSI
To:
RAID5 -> LVM -> DRBD -> iSCSI
This means each DRBD device is smaller, and so the "working set" is 
smaller, and should be more efficient. So, now I am easily able to do 
tests completely excluding drbd by targeting the LV itself. Which means 
just RAID5 + LVM layers to worry about.

When I use this fio job:
[global]
filename=/dev/vg0/testing
zero_buffers
numjobs=16
thread
group_reporting
blocksize=4k
ioengine=libaio
iodepth=16
direct=1
runtime=60
size=16g

[read]
rw=randread
stonewall

[write]
rw=randwrite
stonewall

Then I get these results:
read: (g=0): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=16
...
read: (g=0): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=16
write: (g=1): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=16
...
write: (g=1): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=16
2.0.8
Starting 32 threads

read: (groupid=0, jobs=16): err= 0: pid=36459
   read : io=74697MB, bw=1244.1MB/s, iops=318691 , runt= 60003msec
     slat (usec): min=0 , max=999873 , avg= 5.90, stdev=529.35
     clat (usec): min=0 , max=1002.4K, avg=795.43, stdev=5201.15
      lat (usec): min=0 , max=1002.4K, avg=801.56, stdev=5233.38
     clat percentiles (usec):
      |  1.00th=[    0],  5.00th=[  213], 10.00th=[  286], 20.00th=[ 366],
      | 30.00th=[  438], 40.00th=[  516], 50.00th=[  604], 60.00th=[ 708],
      | 70.00th=[  860], 80.00th=[ 1096], 90.00th=[ 1544], 95.00th=[ 1928],
      | 99.00th=[ 2608], 99.50th=[ 2800], 99.90th=[ 3536], 99.95th=[ 4128],
      | 99.99th=[15424]
     bw (KB/s)  : min=22158, max=245376, per=6.39%, avg=81462.59, 
stdev=22339.85
     lat (usec) : 2=3.34%, 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01%
     lat (usec) : 100=0.01%, 250=3.67%, 500=31.43%, 750=24.55%, 1000=13.33%
     lat (msec) : 2=19.37%, 4=4.25%, 10=0.04%, 20=0.01%, 50=0.01%
     lat (msec) : 100=0.01%, 250=0.01%, 1000=0.01%, 2000=0.01%
   cpu          : usr=30.27%, sys=236.67%, ctx=239859018, majf=0, minf=64588
   IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, 
 >=64=0.0%
      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
 >=64=0.0%
      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, 
 >=64=0.0%
      issued    : total=r=19122474/w=0/d=0, short=r=0/w=0/d=0
write: (groupid=1, jobs=16): err= 0: pid=38376
   write: io=13885MB, bw=236914KB/s, iops=59228 , runt= 60016msec
     slat (usec): min=2 , max=25554K, avg=25.74, stdev=17219.99
     clat (usec): min=122 , max=43459K, avg=4294.06, stdev=100111.47
      lat (usec): min=129 , max=43459K, avg=4319.92, stdev=101581.66
     clat percentiles (usec):
      |  1.00th=[  482],  5.00th=[  628], 10.00th=[  748], 20.00th=[ 996],
      | 30.00th=[ 1320], 40.00th=[ 1784], 50.00th=[ 2352], 60.00th=[ 3056],
      | 70.00th=[ 4192], 80.00th=[ 5920], 90.00th=[ 8384], 95.00th=[10816],
      | 99.00th=[17536], 99.50th=[20096], 99.90th=[57088], 99.95th=[67072],
      | 99.99th=[123392]
     bw (KB/s)  : min=   98, max=25256, per=6.74%, avg=15959.71, 
stdev=2969.06
     lat (usec) : 250=0.01%, 500=1.25%, 750=8.72%, 1000=10.13%
     lat (msec) : 2=23.87%, 4=24.78%, 10=24.87%, 20=5.85%, 50=0.39%
     lat (msec) : 100=0.11%, 250=0.01%, 750=0.01%, 2000=0.01%, >=2000=0.01%
   cpu          : usr=5.47%, sys=39.74%, ctx=54762279, majf=0, minf=62375
   IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, 
 >=64=0.0%
      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
 >=64=0.0%
      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, 
 >=64=0.0%
      issued    : total=r=0/w=3554662/d=0, short=r=0/w=0/d=0

Run status group 0 (all jobs):
    READ: io=74697MB, aggrb=1244.1MB/s, minb=1244.1MB/s, 
maxb=1244.1MB/s, mint=60003msec, maxt=60003msec

Run status group 1 (all jobs):
   WRITE: io=13885MB, aggrb=236914KB/s, minb=236914KB/s, 
maxb=236914KB/s, mint=60016msec, maxt=60016msec

So, a maximum of 237MB/s write. Once DRBD takes that and adds it's 
overhead, I'm getting approx 10% of that performance (some of the time, 
other times I'm getting even less, but that is probably yet another issue).

Now, 237MB/s is pretty poor, and when you try and share that between a 
dozen VM's, with some of those VM's trying to work on 2+ GB files 
(outlook users), then I suspect that is why there are so many issues. 
The question is, what can I do to improve this? Should I use RAID5 with 
a smaller stripe size? Should I use RAID10 or RAID1+linear? Could the 
issue be from LVM? LVM is using 4MB Physical Extents, from reading 
though, nobody seems to worry about the PE size related to performance 
(only LVM1 had a limit on the number of PE's... which meant a larger LV 
required larger PE's).

Here is the current md array:
/dev/md1:
         Version : 1.2
   Creation Time : Wed Aug 22 00:47:03 2012
      Raid Level : raid5
      Array Size : 2813087616 (2682.77 GiB 2880.60 GB)
   Used Dev Size : 468847936 (447.13 GiB 480.10 GB)
    Raid Devices : 7
   Total Devices : 7
     Persistence : Superblock is persistent

     Update Time : Sun Apr  6 05:19:14 2014
           State : clean
  Active Devices : 7
Working Devices : 7
  Failed Devices : 0
   Spare Devices : 0

          Layout : left-symmetric
      Chunk Size : 64K

            Name : san1:1  (local to host san1)
            UUID : 707957c0:b7195438:06da5bc4:485d301c
          Events : 1713347

     Number   Major   Minor   RaidDevice State
        7       8       49        0      active sync   /dev/sdd1
        6       8        1        1      active sync   /dev/sda1
        8       8       65        2      active sync   /dev/sde1
        5       8       97        3      active sync   /dev/sdg1
        9       8       81        4      active sync   /dev/sdf1
       10       8       33        5      active sync   /dev/sdc1
       11       8       17        6      active sync   /dev/sdb1

BTW, I've also split the domain controller to a win2008R2 server, and 
upgraded the file server to win2012R2.

Regards,
Adam

-- 
Adam Goryachev Website Managers www.websitemanagers.com.au

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Growing RAID5 SSD Array
  2014-04-05 19:25                       ` Adam Goryachev
@ 2014-04-08 15:27                         ` Stan Hoeppner
  2014-04-09  3:57                           ` Adam Goryachev
  0 siblings, 1 reply; 16+ messages in thread
From: Stan Hoeppner @ 2014-04-08 15:27 UTC (permalink / raw)
  To: Adam Goryachev, linux-raid@vger.kernel.org

On 4/5/2014 2:25 PM, Adam Goryachev wrote:
> On 26/03/14 07:31, Stan Hoeppner wrote:
>> On 3/25/2014 8:10 AM, Adam Goryachev wrote:
...
...
> OK, I'm going to add the following to the /etc/rc.local:
> for irq in `cat /proc/interrupts |grep mpt2sas| awk -F: '{ print $1}'`
> do
>         echo 4 > /proc/irq/${irq}/smp_affinity
> done
> 
> That will move the LSI card interrupt processing to CPU2 like this:
>   57:  143806142       7246      41052          0 IR-PCI-MSI-edge     
> mpt2sas0-msix0
>   58:   14381650          0      22952          0 IR-PCI-MSI-edge     
> mpt2sas0-msix1
>   59:    6733526          0     144387          0 IR-PCI-MSI-edge     
> mpt2sas0-msix2
>   60:    3342802          0      32053          0 IR-PCI-MSI-edge     
> mpt2sas0-msix3
> 
> You can see I briefly moved one to CPU1 as well.

Most of your block IO interrupts are read traffic.  md/RAID5 reads are
fully threaded, unlike writes, and can be serviced by any core.  Assign
each LSI interrupt queue to a different core.

> Would you suggest moving the eth devices to another CPU as well, perhaps
> CPU3 ?

Spread all the interrupt queues across all cores, starting with CPU3
moving backwards and eth0 moving forward, this because IIRC eth0 is your
only interface receiving inbound traffic currently, due to a broken
balance-alb config.  NICs generally only generate interrupts for inbound
packets, so balancing IRQs won't make much difference until you get
inbound load balancing working.

...
> I'll run a bunch more tests tonight, and get a better idea. For now though:
> dd if=/dev/vg0/xptest of=/dev/vg0/testing iflag=direct oflag=direct
> bs=1536k count=5k
> iostat shows much more solid read and write rates, around 120MB/s peaks,
> dd reported 88MB/s, it also shows 0 for rrqm and wrqm, so no more
> merging was being done. 

Moving larger blocks and thus eliminating merges increased throughput a
little over 2x.  The absolute data rate is still very poor as something
is broken.  Still, doubling throughput with a few command line args is
always impressive.

> The avgrq-sz value is always 128 for the
> destination, and almost always 128 for the source during the copy. This
> seems to equal 64kB, so I'm not sure why that is if we told dd to use
> 1536k ...

I'd need to see the actual output to comment intelligently on this.
However, do note that application read/write IO size and avgrq-sz
reported by iostat are two different things.

...
> So it looks like CPU0 is less busy, with more work being done on CPU2
> (the interrupts for the LSI SATA controller)

The md write thread is typically scheduled on the processor (core) which
is servicing interrupts for the thread.  The %sy you're seeing on CPU2
is not interrupt processing but the RAID5 write thread execution.

> If I increase bs=6M then dd reports 130MB/s ...

You can continue increasing the dd block size and gain small increases
in throughput incrementally until you hit the wall.  But again,
something is broken somewhere for single thread throughput to be this low.

...
>> According to your iostat output above, drbd2 was indeed still
>> engaged.  And eating over 59.6% and 91.6% of a core.
>
> Nope, definitely not connected, however, it is still part of the IO
> path, because the LV sits on drbd. So it isn't talking to it's partner,
> but it still does it's own "work" in between LVM and MD.
> 
>>> So, I know dd isn't the ideal performance testing tool or metric, but
>>> I'd really like to know why I can't get more than 40MB/s. There is no
>>> networking, no iscsi, just a fairly simple raid5, drbd, and lvm.

There is nothing simple at all about a storage architecture involving
layered lvm, drbd, and md RAID.  This may be a "popular" configuration,
but popular does not equal "simple".

>> You can get much more than 40MB/s, but you must know your tools, and
>> gain a better understanding of the Linux IO subsystem.
> 
> Apologies, it was a second late night in a row, and I wasn't doing very
> well, I should have remembered my previous lessons about this!

Remember:  High throughput requires large IOs in parallel.  High IOPS
requires small IOs in parallel.  Bandwidth and IOPS are inversely
proportional.

...
> OK, so thinking this through... We should expect really poor performance
> if we are not using O_DIRECT, and not doing large requests in parallel.

You should never expect poor performance with a single thread, but not
full hardware potential of the SSDs either.  Something odd is going on
in your current setup if a dd copy with large block size and O_DIRECT
can only hit 130 MB/s to an array of 7 of these SandForce based Intel
SSDs.  You should be able to hit a few hundred MB/s with a simultaneous
read and write stream from one LV to another.  Something is plugging a
big finger into the ends of your fat IO pipe when single streaming.
Determining what this finger is will require some investigation.

> I think the parallel part of the workload should be fine in real world
> use, since each user and machine will be generating some random load,
> which should be delivered in parallel to the stack (LVM/DRBD/MD).
> However, in 'real world' use, we don't determine the request size, only
> the application or client OS, or perhaps iscsi will determine that.

Note that in your previous testing you achieved 200 MB/s iSCSI traffic
at the Xen hosts.  Whether using many threads on the client or not,
iSCSI over GbE at the server should never be faster than a local LV to
LV copy.  Something is misconfigured or you have a bug somewhere.

> My concern is that while I can get fantastical numbers from specific
> tests (such as highly parallel, large block size requests) I don't need
> that type of I/O, 

The previous testing I assisted you with a year ago demonstrated peak
hardware read/write throughput of your RAID5 array.  Demonstrating
throughput was what you requested, not IOPS.

The broken FIO test you performed, with results down below, demonstrated
320K read IOPS, or 45K IOPS per drive.  This is the inverse test of
bandwidth.  Here you also achieved near peak hardware IO rate from the
SSDs, which is claimed by Intel at 50K read IOPS.  You have the best of
both worlds, max throughput and IOPS.  If you'd not have broken the test
your write IOPS would have been correctly demonstrated as well.

Playing the broken record again, you simply don't yet understand how to
use your benchmarking/testing tools, nor the data, the picture, they are
presenting to you.

> so my system isn't tuned to my needs.

While that statement may be true, the thing(s) not properly tuned are
not the SSDs, nor LSI, nor mobo, nor md.  That leaves LVM and DRBD.  And
the problems may not be due to tuning but bugs.

> After working with linbit (DRBD) I've found out some more useful
> information, which puts me right back to the beginning I think, but with
> a lot more experience and knowledge.
> It seems that DRBD keeps it's own "journal", so every write is written
> to the journal, then it's bitmap is marked, then the journal is written
> to the data area, then the bitmap updated again, and then start over for
> the next write. This means it is doing lots and lots of small writes to
> the same areas of the disk ie, 4k blocks.

Your 5 SSDs had a combined ~160,000 4KB IOPS write performance.  Your 7
SSDs should hit ~240,000 4KB write IOPS when configured properly.  To
put this into perspective, an array comprised of 15K SAS drives in RAID0
would require 533 and 800 drives respectively to reach the same IOPS
performance, 1066 and 1600 drives in RAID10.

With that comparison in mind, surely it's clear that your original DRBD
journal throughput was not creating a bottleneck of any kind at the SSDs.

> Anyway, I was advised to re-organise the stack from:
> RAID5 -> DRBD -> LVM -> iSCSI
> To:
> RAID5 -> LVM -> DRBD -> iSCSI
> This means each DRBD device is smaller, and so the "working set" is
> smaller, and should be more efficient. 

Makes sense.  But I saw nothing previously to suggest DRBD CPU or memory
consumption was a problem, nor write IOPS.

> So, now I am easily able to do
> tests completely excluding drbd by targeting the LV itself. Which means
> just RAID5 + LVM layers to worry about.

Recall what I said previously about knowing your tools?

...
> [global]
> filename=/dev/vg0/testing
> zero_buffers
> numjobs=16
> thread
> group_reporting
> blocksize=4k
> ioengine=libaio
> iodepth=16
> direct=1

It's generally a bad idea to mix size and run time.  It makes results
non deterministic.  Best to use one or the other.  But you have much
bigger problems here...

> runtime=60
> size=16g

16 jobs * 2 streams (read + write) * 16 GB per stream = 512 GB required
for this test.  The size= parm is per job thread, not aggregate.  What
was the capacity of /dev/vg0/testing?  Is this a filesystem or raw
device?  I'm assuming raw device of capacity well less than 512 GB.

>   read : io=74697MB, bw=1244.1MB/s, iops=318691 , runt= 60003msec
              ^^^^^^^                      ^^^^^^

318K IOPS is 45K IOPS per drive, all 7 active on reads.  This is
awesome, and close to the claimed peak hardware performance of 50K 4KB
read IOPS per drive.

>     lat (usec) : 2=3.34%, 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01%
>     lat (usec) : 100=0.01%, 250=3.67%, 500=31.43%, 750=24.55%, 1000=13.33%
>     lat (msec) : 2=19.37%, 4=4.25%, 10=0.04%, 20=0.01%, 50=0.01%
>     lat (msec) : 100=0.01%, 250=0.01%, 1000=0.01%, 2000=0.01%

76% of read IOPS completed in 1 millisecond or less, 63% in 750
microseconds or less, and 31% in 500 microseconds or less.  This is
nearly perfect for 7 of these SSDs.

...
>   write: io=13885MB, bw=236914KB/s, iops=59228 , runt= 60016msec
              ^^^^^^^                      ^^^^^
The write IOPS is roughly 10K per drive counting 6 drives no parity.
This result should be 200K-240K IOPS, 40K IPOS per drive, for these
SandForce based SSDs.  Why is it so horribly low?  The latencies yield a
clue.

>     lat (usec) : 250=0.01%, 500=1.25%, 750=8.72%, 1000=10.13%
>     lat (msec) : 2=23.87%, 4=24.78%, 10=24.87%, 20=5.85%, 50=0.39%
>     lat (msec) : 100=0.11%, 250=0.01%, 750=0.01%, 2000=0.01%, >=2000=0.01% 

80% of write IOPS required more than 2 milliseconds to complete, 56%
required more than 4ms, 31% required over 10ms, and 6.35% required over
20ms.  This is roughly equivalent to 15K SAS performance.  What tends to
make SSD write latency so high?  Erase block rewrite, garbage
collection.  Why are we experiencing this during the test?  Let's see...

Your read test was 75 GB and your write test was 14 GB.  These should
always be equal values when the size= parameter is specified.  Using
file based IO, FIO will normally create one read and one write file per
job thread of size "size=", and should throw and error and exit if the
filesystem space is not sufficient.

When performing IO to a raw block device  I don't know what the FIO
behavior is as the raw device scenario isn't documented and I've never
traced it.  Given your latency results it's clear that your SSD were
performing heavyweight garbage collection during the write test.  This
would tend to suggest that the test device was significantly smaller
than the 512 GB required, and thus the erase blocks were simply
rewritten many times over.  This scenario would tend to explain the
latencies reported.

...
> So, a maximum of 237MB/s write. Once DRBD takes that and adds it's
> overhead, I'm getting approx 10% of that performance (some of the time,
> other times I'm getting even less, but that is probably yet another issue).
> 
> Now, 237MB/s is pretty poor, and when you try and share that between a
> dozen VM's, with some of those VM's trying to work on 2+ GB files
> (outlook users), then I suspect that is why there are so many issues.
> The question is, what can I do to improve this? Should I use RAID5 with
> a smaller stripe size? Should I use RAID10 or RAID1+linear? Could the
> issue be from LVM? LVM is using 4MB Physical Extents, from reading
> though, nobody seems to worry about the PE size related to performance
> (only LVM1 had a limit on the number of PE's... which meant a larger LV
> required larger PE's).

I suspect you'll be rethinking the above after running a proper FIO test
for 4KB IOPS.  Try numjobs=8 and size=500m, for an 8 GB test, assuming
the test LV is greater than 8 GB in size.

...
> BTW, I've also split the domain controller to a win2008R2 server, and
> upgraded the file server to win2012R2.

I take it you decided this route had fewer potential pitfalls than
reassigning the DC share LUN to a new VM with the same Windows host
name, exporting/importing the shares, etc?  It'll be interesting to see
if this resolves some/all of the problems.  Have my fingers crossed for ya.

Please don't feel I'm picking on you WRT your understanding of IO
performance, benching, etc.  It is not my intent to belittle you.  It is
critical that you better understand Linux block IO, proper testing,
correctly interpreting the results.  Once you do you can realize if/when
and where you do actually have problems, instead of thinking you have a
problem where none exists.

Cheers,

Stan

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Growing RAID5 SSD Array
  2014-04-08 15:27                         ` Stan Hoeppner
@ 2014-04-09  3:57                           ` Adam Goryachev
  2014-04-10  8:06                             ` Stan Hoeppner
  0 siblings, 1 reply; 16+ messages in thread
From: Adam Goryachev @ 2014-04-09  3:57 UTC (permalink / raw)
  To: stan, linux-raid@vger.kernel.org

On 09/04/14 01:27, Stan Hoeppner wrote:
> On 4/5/2014 2:25 PM, Adam Goryachev wrote:
>> On 26/03/14 07:31, Stan Hoeppner wrote:
>>> On 3/25/2014 8:10 AM, Adam Goryachev wrote:
> ...
> ...
>> OK, I'm going to add the following to the /etc/rc.local:
>> for irq in `cat /proc/interrupts |grep mpt2sas| awk -F: '{ print $1}'`
>> do
>>          echo 4 > /proc/irq/${irq}/smp_affinity
>> done
>>
>> That will move the LSI card interrupt processing to CPU2 like this:
>>    57:  143806142       7246      41052          0 IR-PCI-MSI-edge
>> mpt2sas0-msix0
>>    58:   14381650          0      22952          0 IR-PCI-MSI-edge
>> mpt2sas0-msix1
>>    59:    6733526          0     144387          0 IR-PCI-MSI-edge
>> mpt2sas0-msix2
>>    60:    3342802          0      32053          0 IR-PCI-MSI-edge
>> mpt2sas0-msix3
>>
>> You can see I briefly moved one to CPU1 as well.
> Most of your block IO interrupts are read traffic.  md/RAID5 reads are
> fully threaded, unlike writes, and can be serviced by any core.  Assign
> each LSI interrupt queue to a different core.
>
>> Would you suggest moving the eth devices to another CPU as well, perhaps
>> CPU3 ?
> Spread all the interrupt queues across all cores, starting with CPU3
> moving backwards and eth0 moving forward, this because IIRC eth0 is your
> only interface receiving inbound traffic currently, due to a broken
> balance-alb config.  NICs generally only generate interrupts for inbound
> packets, so balancing IRQs won't make much difference until you get
> inbound load balancing working.
>
> ...

My /proc/interrupts now looks like this:
   47:      22036          0   78203150          0 IR-PCI-MSI-edge      
mpt2sas0-msix0
   48:       1588          0   78058322          0 IR-PCI-MSI-edge      
mpt2sas0-msix1
   49:        616          0  352803023          0 IR-PCI-MSI-edge      
mpt2sas0-msix2
   50:        382          0   78836976          0 IR-PCI-MSI-edge      
mpt2sas0-msix3
   51:        303          0          0   34032878 IR-PCI-MSI-edge      
eth3-TxRx-0
   52:        120          0          0   49823788 IR-PCI-MSI-edge      
eth3-TxRx-1
   53:        118          0          0   27475141 IR-PCI-MSI-edge      
eth3-TxRx-2
   54:        100          0          0   52690836 IR-PCI-MSI-edge      
eth3-TxRx-3
   55:          2          0          0         13 IR-PCI-MSI-edge      eth3
   56:    8845363          0          0          0 IR-PCI-MSI-edge      
eth0-rx-0
   57:    7884067          0          0          0 IR-PCI-MSI-edge      
eth0-tx-0
   58:          2          0          0          0 IR-PCI-MSI-edge      eth0
   59:         26   18534150          0          0 IR-PCI-MSI-edge      
eth2-TxRx-0
   60:         23  292294351          0          0 IR-PCI-MSI-edge      
eth2-TxRx-1
   61:         21   29820261          0          0 IR-PCI-MSI-edge      
eth2-TxRx-2
   62:         21   32405950          0          0 IR-PCI-MSI-edge      
eth2-TxRx-3


I've replaced the 8 x 1G ethernet with the 1 x 10G ethernet (yep, I 
know, probably not useful, but at least it solved the unbalanced 
traffic, and removed another potential problem point).
So, currently, total IRQ's per core are roughly equal. Given I only have 
4 cores, is it still useful to put each IRQ on a different core? Also, 
most of the IRQ's for the LSI card are all on the same IRQ, so again 
will it make any difference?

>> I'll run a bunch more tests tonight, and get a better idea. For now though:
>> dd if=/dev/vg0/xptest of=/dev/vg0/testing iflag=direct oflag=direct
>> bs=1536k count=5k
>> iostat shows much more solid read and write rates, around 120MB/s peaks,
>> dd reported 88MB/s, it also shows 0 for rrqm and wrqm, so no more
>> merging was being done.
> Moving larger blocks and thus eliminating merges increased throughput a
> little over 2x.  The absolute data rate is still very poor as something
> is broken.  Still, doubling throughput with a few command line args is
> always impressive.

OK, re-running the above test now (while some other load is active) I 
get this result from iostat while the copy is running:
Device:         rrqm/s   wrqm/s     r/s     w/s         rMB/s wMB/s 
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda            1316.00 11967.80  391.40  791.80    44.96    49.97 
164.32     0.83    0.69    0.96    0.56   0.40  47.20
sdc            1274.00 11918.20  383.00  815.60    44.73    49.81 
161.54     0.82    0.67    0.88    0.58   0.39  47.20
sdd            1288.00 11965.00  388.00  791.00    44.84    49.95 
164.65     0.88    0.73    1.05    0.57   0.42  49.28
sde            1358.00 11972.20  385.00  795.60    45.10    50.00 
164.98     0.95    0.79    1.10    0.64   0.44  52.24
sdf            1304.60 11963.60  393.20  804.80    44.94    50.00 
162.30     0.80    0.66    0.93    0.53   0.38  45.84
sdg            1329.80 11967.00  394.00  802.60    45.03    49.99 
162.64     0.80    0.67    0.94    0.53   0.39  46.64
sdi            1282.60 11937.00  380.80  803.40    44.75    49.84 
163.59     0.81    0.67    0.91    0.56   0.40  47.68
md1               0.00     0.00 4595.00 4693.00   286.00   287.40 
126.43     0.00    0.00    0.00    0.00   0.00   0.00

root@san1:~# dd if=/dev/vg0/xptest of=/dev/vg0/testing iflag=direct 
oflag=direct bs=1536k count=5k
5120+0 records in
5120+0 records out
8053063680 bytes (8.1 GB) copied, 23.684 s, 340 MB/s

So, now 340MB/s... but now the merging is being done again. I'm not sure 
this is going to matter though, see below...

>> The avgrq-sz value is always 128 for the
>> destination, and almost always 128 for the source during the copy. This
>> seems to equal 64kB, so I'm not sure why that is if we told dd to use
>> 1536k ...
> I'd need to see the actual output to comment intelligently on this.
> However, do note that application read/write IO size and avgrq-sz
> reported by iostat are two different things.
>
> ...
See results above...

>> So it looks like CPU0 is less busy, with more work being done on CPU2
>> (the interrupts for the LSI SATA controller)
> The md write thread is typically scheduled on the processor (core) which
> is servicing interrupts for the thread.  The %sy you're seeing on CPU2
> is not interrupt processing but the RAID5 write thread execution.
>
>> If I increase bs=6M then dd reports 130MB/s ...
> You can continue increasing the dd block size and gain small increases
> in throughput incrementally until you hit the wall.  But again,
> something is broken somewhere for single thread throughput to be this low.
>
> ...
>>> According to your iostat output above, drbd2 was indeed still
>>> engaged.  And eating over 59.6% and 91.6% of a core.
>> Nope, definitely not connected, however, it is still part of the IO
>> path, because the LV sits on drbd. So it isn't talking to it's partner,
>> but it still does it's own "work" in between LVM and MD.
>>
>>>> So, I know dd isn't the ideal performance testing tool or metric, but
>>>> I'd really like to know why I can't get more than 40MB/s. There is no
>>>> networking, no iscsi, just a fairly simple raid5, drbd, and lvm.
> There is nothing simple at all about a storage architecture involving
> layered lvm, drbd, and md RAID.  This may be a "popular" configuration,
> but popular does not equal "simple".

Sorry, I meant "simple raid5", drbd and lvm or (simple raid5), drbd and 
lvm.... :)

>>> You can get much more than 40MB/s, but you must know your tools, and
>>> gain a better understanding of the Linux IO subsystem.
>> Apologies, it was a second late night in a row, and I wasn't doing very
>> well, I should have remembered my previous lessons about this!
> Remember:  High throughput requires large IOs in parallel.  High IOPS
> requires small IOs in parallel.  Bandwidth and IOPS are inversely
> proportional.
>

Yep, I'm working through that learning curve :) I never considered 
storage to be such a complex topic, and I'm sure I never had to deal 
with this much before. The last time I sincerely dealt with storage 
performance was setting up a NNTP news server, where the simple solution 
was to drop in lots of small (well, compared to current sizes) SCSI 
drives to allow the nntp server to balance load amongst the different 
drives. From memory that was all without raid, since if you lost a bunch 
of newsgroups you just said "too bad" to the users, waited a few days, 
and everything was fine again :)

>> OK, so thinking this through... We should expect really poor performance
>> if we are not using O_DIRECT, and not doing large requests in parallel.
> You should never expect poor performance with a single thread, but not
> full hardware potential of the SSDs either.  Something odd is going on
> in your current setup if a dd copy with large block size and O_DIRECT
> can only hit 130 MB/s to an array of 7 of these SandForce based Intel
> SSDs.  You should be able to hit a few hundred MB/s with a simultaneous
> read and write stream from one LV to another.  Something is plugging a
> big finger into the ends of your fat IO pipe when single streaming.
> Determining what this finger is will require some investigation.

I think we might have part of the answer... see below...

>
>> I think the parallel part of the workload should be fine in real world
>> use, since each user and machine will be generating some random load,
>> which should be delivered in parallel to the stack (LVM/DRBD/MD).
>> However, in 'real world' use, we don't determine the request size, only
>> the application or client OS, or perhaps iscsi will determine that.
> Note that in your previous testing you achieved 200 MB/s iSCSI traffic
> at the Xen hosts.  Whether using many threads on the client or not,
> iSCSI over GbE at the server should never be faster than a local LV to
> LV copy.  Something is misconfigured or you have a bug somewhere.

Or perhaps we are testing different things. I think the 200MB/s over 
iSCSI was using fio, with large block sizes, and multiple threads.

>
>> My concern is that while I can get fantastical numbers from specific
>> tests (such as highly parallel, large block size requests) I don't need
>> that type of I/O,
> The previous testing I assisted you with a year ago demonstrated peak
> hardware read/write throughput of your RAID5 array.  Demonstrating
> throughput was what you requested, not IOPS.

Yep, again, my own complete ignorance. Sometimes you just want to see a 
big number because it looks good, regardless of what it means. At the 
time I was merely suspicious of a performance issue, and randomly 
testing things I only partly understood, and then focusing on the items 
which produced unexpected results. That started as throughput on the SAN.

> The broken FIO test you performed, with results down below, demonstrated
> 320K read IOPS, or 45K IOPS per drive.  This is the inverse test of
> bandwidth.  Here you also achieved near peak hardware IO rate from the
> SSDs, which is claimed by Intel at 50K read IOPS.  You have the best of
> both worlds, max throughput and IOPS.  If you'd not have broken the test
> your write IOPS would have been correctly demonstrated as well.
>
> Playing the broken record again, you simply don't yet understand how to
> use your benchmarking/testing tools, nor the data, the picture, they are
> presenting to you.
>
>> so my system isn't tuned to my needs.
> While that statement may be true, the thing(s) not properly tuned are
> not the SSDs, nor LSI, nor mobo, nor md.  That leaves LVM and DRBD.  And
> the problems may not be due to tuning but bugs.

Absolutely, and to be honest, while we have tuned a few of those things 
I don't think they were significant in the scheme of things. Tuning 
something that isn't broken might get an extra few percent, but we were 
always looking to get a significant improvement (like 5x or something).

>> After working with linbit (DRBD) I've found out some more useful
>> information, which puts me right back to the beginning I think, but with
>> a lot more experience and knowledge.
>> It seems that DRBD keeps it's own "journal", so every write is written
>> to the journal, then it's bitmap is marked, then the journal is written
>> to the data area, then the bitmap updated again, and then start over for
>> the next write. This means it is doing lots and lots of small writes to
>> the same areas of the disk ie, 4k blocks.
> Your 5 SSDs had a combined ~160,000 4KB IOPS write performance.  Your 7
> SSDs should hit ~240,000 4KB write IOPS when configured properly.  To
> put this into perspective, an array comprised of 15K SAS drives in RAID0
> would require 533 and 800 drives respectively to reach the same IOPS
> performance, 1066 and 1600 drives in RAID10.
OK, so like I always thought, the hardware I have *should* be producing 
some awesome performance... I'd hate to think how someone might connect 
1600 15k SAS drives, nor the noise, heat, power draw, etc..

> With that comparison in mind, surely it's clear that your original DRBD
> journal throughput was not creating a bottleneck of any kind at the SSDs.
See below...

>> Anyway, I was advised to re-organise the stack from:
>> RAID5 -> DRBD -> LVM -> iSCSI
>> To:
>> RAID5 -> LVM -> DRBD -> iSCSI
>> This means each DRBD device is smaller, and so the "working set" is
>> smaller, and should be more efficient.
> Makes sense.  But I saw nothing previously to suggest DRBD CPU or memory
> consumption was a problem, nor write IOPS.
>
>> So, now I am easily able to do
>> tests completely excluding drbd by targeting the LV itself. Which means
>> just RAID5 + LVM layers to worry about.
> Recall what I said previously about knowing your tools?
>
> ...
>> [global]
>> filename=/dev/vg0/testing
>> zero_buffers
>> numjobs=16
>> thread
>> group_reporting
>> blocksize=4k
>> ioengine=libaio
>> iodepth=16
>> direct=1
> It's generally a bad idea to mix size and run time.  It makes results
> non deterministic.  Best to use one or the other.  But you have much
> bigger problems here...
>
>> runtime=60
>> size=16g
> 16 jobs * 2 streams (read + write) * 16 GB per stream = 512 GB required
> for this test.  The size= parm is per job thread, not aggregate.  What
> was the capacity of /dev/vg0/testing?  Is this a filesystem or raw
> device?  I'm assuming raw device of capacity well less than 512 GB.

 From running the tests, fio runs one stream (read or write) at a time, 
not both concurrently. So it does the read test first, and then does the 
write test.
   testing       vg0  -wi-ao-- 50.00g
The LV was 50G.... somewhat smaller than the 512GB required then....
What I thought that was doing is making 16 requests in parallel, with a 
total test size of 16G.  Clearly a mistake again.

>>    read : io=74697MB, bw=1244.1MB/s, iops=318691 , runt= 60003msec
>                ^^^^^^^                      ^^^^^^
>
> 318K IOPS is 45K IOPS per drive, all 7 active on reads.  This is
> awesome, and close to the claimed peak hardware performance of 50K 4KB
> read IOPS per drive.
Yep, read performance is awesome, and I don't think this was ever an 
issue... at least, not for a long time (or my memory is corrupt)...

>>      lat (usec) : 2=3.34%, 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01%
>>      lat (usec) : 100=0.01%, 250=3.67%, 500=31.43%, 750=24.55%, 1000=13.33%
>>      lat (msec) : 2=19.37%, 4=4.25%, 10=0.04%, 20=0.01%, 50=0.01%
>>      lat (msec) : 100=0.01%, 250=0.01%, 1000=0.01%, 2000=0.01%
> 76% of read IOPS completed in 1 millisecond or less, 63% in 750
> microseconds or less, and 31% in 500 microseconds or less.  This is
> nearly perfect for 7 of these SSDs.

Inadvertently, I have ended up with 5 x SSDSC2CW480A3 + 2 x 
SSDSC2BW480A4 in each server. I noticed significantly higher %util 
reported by iostat on the 2 SSD's compared to the other 5. Finally on 
Monday I moved two of the SSDSC2CW480A3 models from the second server 
into the primary, (one at a time) and the two SSDSC2BW480A4 into the 
second server. So then I had 7 x SSDSC2CW480A3 in the primary, and the 
secondary had 3 of them plus 4 of the other model. iostat on the primary 
then showed a much more balanced load across all 7 of the SSD's in the 
primary (with DRBD disconnected).
BTW, when I say much higher, the 2 SSD's would should 40% while the 
other 5 would should around 10%, with the two peaking at 100% while the 
other 5 would peak at 30%...

I haven't been able to find detailed enough specs on the differences 
between these two models to explain that yet. In any case, the 
SSDSC2CW480A3 model is no longer available, so I can't order more of 
them anyway.

>>    write: io=13885MB, bw=236914KB/s, iops=59228 , runt= 60016msec
>                ^^^^^^^                      ^^^^^
> The write IOPS is roughly 10K per drive counting 6 drives no parity.
> This result should be 200K-240K IOPS, 40K IPOS per drive, for these
> SandForce based SSDs.  Why is it so horribly low?  The latencies yield a
> clue.
>
>>      lat (usec) : 250=0.01%, 500=1.25%, 750=8.72%, 1000=10.13%
>>      lat (msec) : 2=23.87%, 4=24.78%, 10=24.87%, 20=5.85%, 50=0.39%
>>      lat (msec) : 100=0.11%, 250=0.01%, 750=0.01%, 2000=0.01%, >=2000=0.01%
> 80% of write IOPS required more than 2 milliseconds to complete, 56%
> required more than 4ms, 31% required over 10ms, and 6.35% required over
> 20ms.  This is roughly equivalent to 15K SAS performance.  What tends to
> make SSD write latency so high?  Erase block rewrite, garbage
> collection.  Why are we experiencing this during the test?  Let's see...
>
> Your read test was 75 GB and your write test was 14 GB.  These should
> always be equal values when the size= parameter is specified.  Using
> file based IO, FIO will normally create one read and one write file per
> job thread of size "size=", and should throw and error and exit if the
> filesystem space is not sufficient.
>
> When performing IO to a raw block device  I don't know what the FIO
> behavior is as the raw device scenario isn't documented and I've never
> traced it.  Given your latency results it's clear that your SSD were
> performing heavyweight garbage collection during the write test.  This
> would tend to suggest that the test device was significantly smaller
> than the 512 GB required, and thus the erase blocks were simply
> rewritten many times over.  This scenario would tend to explain the
> latencies reported.

One other explanation for the different sizes might be that the 
bandwidth was different, but the time was constant (because I specified 
the time option as well). In any case, the performance difference might 
easily be due to your suggestion, which was definitely another idea I 
was having. I was thinking now that I have more drives, I could go back 
to the old solution of leaving some un-allocated space on each drive. 
However to do that I would have needed to reduce the PV ensuring no 
allocated blocks at the "end" of the MD, then reduce the MD, and finally 
reduce the partition. Then I still needed to find a method to tell the 
SSD that the space is now unused (trim). Now I think it isn't so 
important any more...

>> So, a maximum of 237MB/s write. Once DRBD takes that and adds it's
>> overhead, I'm getting approx 10% of that performance (some of the time,
>> other times I'm getting even less, but that is probably yet another issue).
>>
>> Now, 237MB/s is pretty poor, and when you try and share that between a
>> dozen VM's, with some of those VM's trying to work on 2+ GB files
>> (outlook users), then I suspect that is why there are so many issues.
>> The question is, what can I do to improve this? Should I use RAID5 with
>> a smaller stripe size? Should I use RAID10 or RAID1+linear? Could the
>> issue be from LVM? LVM is using 4MB Physical Extents, from reading
>> though, nobody seems to worry about the PE size related to performance
>> (only LVM1 had a limit on the number of PE's... which meant a larger LV
>> required larger PE's).
> I suspect you'll be rethinking the above after running a proper FIO test
> for 4KB IOPS.  Try numjobs=8 and size=500m, for an 8 GB test, assuming
> the test LV is greater than 8 GB in size.
>
> ...
OK, I'll retry with numjobs=16 and size=1G which should require a 32G 
LV, which should be fine with my 50G LV.
read: (g=0): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=16
...
read: (g=0): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=16
write: (g=1): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=16
...
write: (g=1): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=16
2.0.8
Starting 32 threads
Jobs: 2 (f=2): [_________________w_____________w] [100.0% done] 
[0K/157.9M /s] [0 /40.5K iops] [eta 00m:00s]]
read: (groupid=0, jobs=16): err= 0: pid=26714
   read : io=16384MB, bw=1267.4MB/s, iops=324360 , runt= 12931msec
     slat (usec): min=1 , max=141080 , avg= 7.28, stdev=141.90
     clat (usec): min=9 , max=207827 , avg=764.34, stdev=962.30
      lat (usec): min=55 , max=207831 , avg=771.84, stdev=981.10
     clat percentiles (usec):
      |  1.00th=[  159],  5.00th=[  215], 10.00th=[  262], 20.00th=[ 342],
      | 30.00th=[  426], 40.00th=[  524], 50.00th=[  628], 60.00th=[ 740],
      | 70.00th=[  868], 80.00th=[ 1048], 90.00th=[ 1352], 95.00th=[ 1672],
      | 99.00th=[ 2672], 99.50th=[ 3632], 99.90th=[ 8896], 99.95th=[13632],
      | 99.99th=[36608]
     bw (KB/s)  : min=40608, max=109600, per=6.29%, avg=81566.38, 
stdev=8098.56
     lat (usec) : 10=0.01%, 20=0.01%, 50=0.01%, 100=0.02%, 250=8.72%
     lat (usec) : 500=29.09%, 750=23.21%, 1000=16.65%
     lat (msec) : 2=19.74%, 4=2.16%, 10=0.33%, 20=0.05%, 50=0.02%
     lat (msec) : 100=0.01%, 250=0.01%
   cpu          : usr=41.33%, sys=238.07%, ctx=48328280, majf=0, minf=64230
   IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, 
 >=64=0.0%
      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
 >=64=0.0%
      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, 
 >=64=0.0%
      issued    : total=r=4194304/w=0/d=0, short=r=0/w=0/d=0
write: (groupid=1, jobs=16): err= 0: pid=27973
   write: io=16384MB, bw=262686KB/s, iops=65671 , runt= 63868msec
     slat (usec): min=2 , max=4387.4K, avg=64.75, stdev=9203.16
     clat (usec): min=13 , max=6500.9K, avg=3692.55, stdev=47966.38
      lat (usec): min=64 , max=6500.9K, avg=3757.42, stdev=48862.99
     clat percentiles (usec):
      |  1.00th=[  410],  5.00th=[  564], 10.00th=[  700], 20.00th=[ 1080],
      | 30.00th=[ 1432], 40.00th=[ 1688], 50.00th=[ 1880], 60.00th=[ 2064],
      | 70.00th=[ 2256], 80.00th=[ 2480], 90.00th=[ 2992], 95.00th=[ 3632],
      | 99.00th=[ 8640], 99.50th=[12736], 99.90th=[577536], 
99.95th=[954368],
      | 99.99th=[2146304]
     bw (KB/s)  : min=   97, max=56592, per=7.49%, avg=19678.60, 
stdev=8387.79
     lat (usec) : 20=0.01%, 100=0.01%, 250=0.08%, 500=2.74%, 750=8.96%
     lat (usec) : 1000=6.49%
     lat (msec) : 2=38.00%, 4=40.30%, 10=2.68%, 20=0.36%, 50=0.02%
     lat (msec) : 100=0.14%, 250=0.06%, 500=0.07%, 750=0.04%, 1000=0.03%
     lat (msec) : 2000=0.03%, >=2000=0.01%
   cpu          : usr=10.05%, sys=40.27%, ctx=60488513, majf=0, minf=62068
   IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, 
 >=64=0.0%
      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
 >=64=0.0%
      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, 
 >=64=0.0%
      issued    : total=r=0/w=4194304/d=0, short=r=0/w=0/d=0

Run status group 0 (all jobs):
    READ: io=16384MB, aggrb=1267.4MB/s, minb=1267.4MB/s, 
maxb=1267.4MB/s, mint=12931msec, maxt=12931msec

Run status group 1 (all jobs):
   WRITE: io=16384MB, aggrb=262685KB/s, minb=262685KB/s, 
maxb=262685KB/s, mint=63868msec, maxt=63868msec

So, I don't think that made a lot of difference to the results.

>> BTW, I've also split the domain controller to a win2008R2 server, and
>> upgraded the file server to win2012R2.
> I take it you decided this route had fewer potential pitfalls than
> reassigning the DC share LUN to a new VM with the same Windows host
> name, exporting/importing the shares, etc?  It'll be interesting to see
> if this resolves some/all of the problems.  Have my fingers crossed for ya.

It wasn't clear, but what I meant was:
1) Install new 2008R2 server, promote to DC, migrate roles across to it, etc
2) Install new 2012R2 server
3) export registry with share information and shutdown the old 2003 server
4) change name of the new server (to the same as the old server) and 
join the domain
5) attach the existing LUN to the 2012R2 server
6) import the registry information

Short answer, it seemed to have a variable result, but I think that was 
just the usual some days are good and some days are bad, depending on 
who is doing what, when, and how much the users decide to complain.

> Please don't feel I'm picking on you WRT your understanding of IO
> performance, benching, etc.  It is not my intent to belittle you.  It is
> critical that you better understand Linux block IO, proper testing,
> correctly interpreting the results.  Once you do you can realize if/when
> and where you do actually have problems, instead of thinking you have a
> problem where none exists.

Absolutely, and I do appreciate the lessons. I apologise for needing so 
much "hand holding", but hopefully we are almost at the end.

After some more work with linbit, they logged in, and took a look 
around, doing some of their own measurements, and the outcome was to add 
the following three options to the DRBD config file, which improved the 
DRBD IOPS from around 3000 to 50000.
         disk-barrier no;
         disk-flushes no;
         md-flushes no;

Essentially DRBD was disabling the SSD write cache by forcing every 
write to be completed before returning, and this was drastically 
reducing the IOPS that could be achieved.

Running the same test against the DRBD device, in a connected state:
read: (groupid=0, jobs=16): err= 0: pid=4498
   read : io=16384MB, bw=1238.8MB/s, iops=317125 , runt= 13226msec
     slat (usec): min=0 , max=997330 , avg=11.16, stdev=992.34
     clat (usec): min=0 , max=1015.8K, avg=769.38, stdev=7791.99
      lat (usec): min=0 , max=1018.6K, avg=781.10, stdev=7873.73
     clat percentiles (usec):
      |  1.00th=[    0],  5.00th=[    0], 10.00th=[  195], 20.00th=[ 298],
      | 30.00th=[  370], 40.00th=[  446], 50.00th=[  532], 60.00th=[ 620],
      | 70.00th=[  732], 80.00th=[  876], 90.00th=[ 1144], 95.00th=[ 1480],
      | 99.00th=[ 4896], 99.50th=[ 7200], 99.90th=[16512], 99.95th=[21888],
      | 99.99th=[53504]
     bw (KB/s)  : min= 5085, max=305504, per=6.35%, avg=80531.22, 
stdev=29062.40
     lat (usec) : 2=7.73%, 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01%
     lat (usec) : 100=0.04%, 250=6.78%, 500=32.00%, 750=25.02%, 1000=14.15%
     lat (msec) : 2=11.28%, 4=1.64%, 10=1.10%, 20=0.20%, 50=0.05%
     lat (msec) : 100=0.01%, 250=0.01%, 1000=0.01%, 2000=0.01%
   cpu          : usr=41.05%, sys=253.29%, ctx=49215916, majf=0, minf=65328
   IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, 
 >=64=0.0%
      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
 >=64=0.0%
      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, 
 >=64=0.0%
      issued    : total=r=4194304/w=0/d=0, short=r=0/w=0/d=0
write: (groupid=1, jobs=16): err= 0: pid=5163
   write: io=16384MB, bw=138483KB/s, iops=34620 , runt=121150msec
     slat (usec): min=1 , max=84258 , avg=20.68, stdev=303.42
     clat (usec): min=179 , max=123372 , avg=7354.94, stdev=3634.96
      lat (usec): min=187 , max=132967 , avg=7375.81, stdev=3644.96
     clat percentiles (usec):
      |  1.00th=[ 3696],  5.00th=[ 4576], 10.00th=[ 5088], 20.00th=[ 5920],
      | 30.00th=[ 6560], 40.00th=[ 7008], 50.00th=[ 7328], 60.00th=[ 7584],
      | 70.00th=[ 7840], 80.00th=[ 8160], 90.00th=[ 8640], 95.00th=[ 9280],
      | 99.00th=[13504], 99.50th=[23168], 99.90th=[67072], 99.95th=[70144],
      | 99.99th=[75264]
     bw (KB/s)  : min= 5976, max=12447, per=6.26%, avg=8673.20, stdev=731.62
     lat (usec) : 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
     lat (msec) : 2=0.09%, 4=1.76%, 10=94.97%, 20=2.61%, 50=0.29%
     lat (msec) : 100=0.26%, 250=0.01%
   cpu          : usr=8.99%, sys=33.90%, ctx=71679376, majf=0, minf=69677
   IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, 
 >=64=0.0%
      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
 >=64=0.0%
      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, 
 >=64=0.0%
      issued    : total=r=0/w=4194304/d=0, short=r=0/w=0/d=0

Run status group 0 (all jobs):
    READ: io=16384MB, aggrb=1238.8MB/s, minb=1238.8MB/s, 
maxb=1238.8MB/s, mint=13226msec, maxt=13226msec

Run status group 1 (all jobs):
   WRITE: io=16384MB, aggrb=138483KB/s, minb=138483KB/s, 
maxb=138483KB/s, mint=121150msec, maxt=121150msec

Disk stats (read/write):
   drbd17: ios=4194477/4188834, merge=0/0, ticks=2645376/30507320, 
in_queue=33171672, util=99.81%


Here is the summary of the first fio above:
   read : io=16384MB, bw=1267.4MB/s, iops=324360 , runt= 12931msec
   write: io=16384MB, bw=262686KB/s, iops=65671 , runt= 63868msec
    READ: io=16384MB, aggrb=1267.4MB/s, minb=1267.4MB/s, 
maxb=1267.4MB/s, mint=12931msec, maxt=12931msec
   WRITE: io=16384MB, aggrb=262685KB/s, minb=262685KB/s, 
maxb=262685KB/s, mint=63868msec, maxt=63868msec

So, do you still think there is an issue (from looking the the first fio 
results above) with getting "only" 65k IOPS write?
One potential clue I did find was hidden in the Intel specs:
Firstly Intel markets it here:
http://www.intel.com/content/www/us/en/solid-state-drives/solid-state-drives-520-series.html
480GB 	SATA 6Gb/s 550 MB/s / 520 MB/s
SATA 3Gb/s       280 MB/s / 260 MB/s 	50,000 IOPS / 50,000 IOPS 	9.5mm 
2.5-inch SATA


However, here: 
http://www.intel.com/content/dam/www/public/us/en/documents/product-specifications/ssd-530-sata-specification.pdf

Table 5 shows the Incompressible Performance:
480GB     Random 4k Read 37500 IOPS       Random 4k Write 13000 IOPS

So, now we might be better placed to calculate the "expected" results? 
13000 * 6 = 78000, we are getting 65000, which is not very far away.

So, for yesterday and today, with the barriers/flushes disabled, things 
seem to be working well, I haven't had any user complaints, and that 
makes me happy :) However, if you still think I should be able to get 
200000 IOPS or higher on write, then I'll definitely be interested in 
investigating further.

Regards,
Adam

-- 
Adam Goryachev Website Managers www.websitemanagers.com.au

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Growing RAID5 SSD Array
  2014-04-09  3:57                           ` Adam Goryachev
@ 2014-04-10  8:06                             ` Stan Hoeppner
  0 siblings, 0 replies; 16+ messages in thread
From: Stan Hoeppner @ 2014-04-10  8:06 UTC (permalink / raw)
  To: Adam Goryachev, linux-raid@vger.kernel.org

On 4/8/2014 10:57 PM, Adam Goryachev wrote:
> On 09/04/14 01:27, Stan Hoeppner wrote:
>> On 4/5/2014 2:25 PM, Adam Goryachev wrote:
>>> On 26/03/14 07:31, Stan Hoeppner wrote:
>>>> On 3/25/2014 8:10 AM, Adam Goryachev wrote:
...
>>> Would you suggest moving the eth devices to another CPU as well, perhaps
>>> CPU3 ?
>>
>> Spread all the interrupt queues across all cores, starting with CPU3
>> moving backwards and eth0 moving forward, this because IIRC eth0 is your
>> only interface receiving inbound traffic currently, due to a broken
>> balance-alb config.  NICs generally only generate interrupts for inbound
>> packets, so balancing IRQs won't make much difference until you get
>> inbound load balancing working.
> 
> My /proc/interrupts now looks like this:
>   47:      22036          0   78203150          0 IR-PCI-MSI-edge      mpt2sas0-msix0
>   48:       1588          0   78058322          0 IR-PCI-MSI-edge      mpt2sas0-msix1
>   49:        616          0  352803023          0 IR-PCI-MSI-edge      mpt2sas0-msix2
>   50:        382          0   78836976          0 IR-PCI-MSI-edge      mpt2sas0-msix3
>   51:        303          0          0   34032878 IR-PCI-MSI-edge      eth3-TxRx-0
>   52:        120          0          0   49823788 IR-PCI-MSI-edge      eth3-TxRx-1
>   53:        118          0          0   27475141 IR-PCI-MSI-edge      eth3-TxRx-2
>   54:        100          0          0   52690836 IR-PCI-MSI-edge      eth3-TxRx-3
>   55:          2          0          0         13 IR-PCI-MSI-edge      eth3
>   56:    8845363          0          0          0 IR-PCI-MSI-edge      eth0-rx-0
>   57:    7884067          0          0          0 IR-PCI-MSI-edge      eth0-tx-0
>   58:          2          0          0          0 IR-PCI-MSI-edge      eth0
>   59:         26   18534150          0          0 IR-PCI-MSI-edge      eth2-TxRx-0
>   60:         23  292294351          0          0 IR-PCI-MSI-edge      eth2-TxRx-1
>   61:         21   29820261          0          0 IR-PCI-MSI-edge      eth2-TxRx-2
>   62:         21   32405950          0          0 IR-PCI-MSI-edge      eth2-TxRx-3

eth0 is the integrated/management port?  eth2/3 are the two ports of the new 10 GbE?  This should free up all of cpu3 for the RAID5 write thread.  

> I've replaced the 8 x 1G ethernet with the 1 x 10G ethernet (yep, I know, probably not useful, but at least it solved the unbalanced traffic, and removed another potential problem point).

It's overkill, but it does make things much cleaner, simpler to manage.

> So, currently, total IRQ's per core are roughly equal. Given I only have 4 cores, is it still useful to put each IRQ on a different core? Also, most of the IRQ's for the LSI card are all on the same IRQ, so again will it make any difference?

It will make the most difference under heavy RAID write load.  With a light load probably not much.  Given the cost to implement it you can't go wrong here.
 
>>> I'll run a bunch more tests tonight, and get a better idea. For now though:
>>> dd if=/dev/vg0/xptest of=/dev/vg0/testing iflag=direct oflag=direct
>>> bs=1536k count=5k
>>> iostat shows much more solid read and write rates, around 120MB/s peaks,
>>> dd reported 88MB/s, it also shows 0 for rrqm and wrqm, so no more
>>> merging was being done.
>>
>> Moving larger blocks and thus eliminating merges increased throughput a
>> little over 2x.  The absolute data rate is still very poor as something
>> is broken.  Still, doubling throughput with a few command line args is
>> always impressive.

I should have said "eliminating [some of the] merges" here.  There is always merging, see below.

> OK, re-running the above test now (while some other load is active) I get this result from iostat while the copy is running:
> Device:         rrqm/s   wrqm/s     r/s     w/s         rMB/s wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> sda            1316.00 11967.80  391.40  791.80    44.96    49.97 164.32     0.83    0.69    0.96    0.56   0.40  47.20
> sdc            1274.00 11918.20  383.00  815.60    44.73    49.81 161.54     0.82    0.67    0.88    0.58   0.39  47.20
> sdd            1288.00 11965.00  388.00  791.00    44.84    49.95 164.65     0.88    0.73    1.05    0.57   0.42  49.28
> sde            1358.00 11972.20  385.00  795.60    45.10    50.00 164.98     0.95    0.79    1.10    0.64   0.44  52.24
> sdf            1304.60 11963.60  393.20  804.80    44.94    50.00 162.30     0.80    0.66    0.93    0.53   0.38  45.84
> sdg            1329.80 11967.00  394.00  802.60    45.03    49.99 162.64     0.80    0.67    0.94    0.53   0.39  46.64
> sdi            1282.60 11937.00  380.80  803.40    44.75    49.84 163.59     0.81    0.67    0.91    0.56   0.40  47.68
> md1               0.00     0.00 4595.00 4693.00   286.00   287.40 126.43     0.00    0.00    0.00    0.00   0.00   0.00
> 
> root@san1:~# dd if=/dev/vg0/xptest of=/dev/vg0/testing iflag=direct oflag=direct bs=1536k count=5k
> 5120+0 records in
> 5120+0 records out
> 8053063680 bytes (8.1 GB) copied, 23.684 s, 340 MB/s
> 
> So, now 340MB/s... but now the merging is being done again. I'm not sure this is going to matter though, see below...

Request merging is always performed.  You just tend to get more merging with small IOs than with large IOs.  Think along the lines of jumbo frames vs standard frames-- more data transferred with less overhead.

>>> The avgrq-sz value is always 128 for the
>>> destination, and almost always 128 for the source during the copy. This
>>> seems to equal 64kB, so I'm not sure why that is if we told dd to use
>>> 1536k ...
>> I'd need to see the actual output to comment intelligently on this.
>> However, do note that application read/write IO size and avgrq-sz
>> reported by iostat are two different things.
>>
>> ...
> See results above...

From 128 to 160, both using 1526 KB block size doing the same copy operation.  I'd say there is other load on the system every time you run a test, and that's causing variable results, possible artificially low results as well.
...
>> Remember:  High throughput requires large IOs in parallel.  High IOPS
>> requires small IOs in parallel.  Bandwidth and IOPS are inversely
>> proportional.
> 
> Yep, I'm working through that learning curve :) I never considered storage to be such a complex topic, and I'm sure I never had to deal with this much before. The last time I sincerely dealt with storage performance was setting up a NNTP news server, where the simple solution was to drop in lots of small (well, compared to current sizes) SCSI drives to allow the nntp server to balance load amongst the different drives. From memory that was all without raid, since if you lost a bunch of newsgroups you just said "too bad" to the users, waited a few days, and everything was fine again :)

The only difference between tuning storage and rocket science is that disk drives don't fly-- until you get really frustrated.

...
>>> I think the parallel part of the workload should be fine in real world
>>> use, since each user and machine will be generating some random load,
>>> which should be delivered in parallel to the stack (LVM/DRBD/MD).
>>> However, in 'real world' use, we don't determine the request size, only
>>> the application or client OS, or perhaps iscsi will determine that.
>>
>> Note that in your previous testing you achieved 200 MB/s iSCSI traffic
>> at the Xen hosts.  Whether using many threads on the client or not,
>> iSCSI over GbE at the server should never be faster than a local LV to
>> LV copy.  Something is misconfigured or you have a bug somewhere.
> 
> Or perhaps we are testing different things. I think the 200MB/s over iSCSI was using fio, with large block sizes, and multiple threads.

Anything over the wire, regardless of thread count and block size, should not be faster than a local single stream operation on the same storage, simply due to TCP latency being at least a hundred times higher than local SATA.  Worth noting, there are many folks on this list who have demonstrated 500 MB/s+ with similar dd streaming but with only a handful of high cap rust drives.  340 MB/s is only about 1/3rd of the minimum I think you should be seeing.  So there's more investigation and optimization to be done.
 
>>> My concern is that while I can get fantastical numbers from specific
>>> tests (such as highly parallel, large block size requests) I don't need
>>> that type of I/O,
>>
>> The previous testing I assisted you with a year ago demonstrated peak
>> hardware read/write throughput of your RAID5 array.  Demonstrating
>> throughput was what you requested, not IOPS.
> 
> Yep, again, my own complete ignorance. Sometimes you just want to see a big number because it looks good, regardless of what it means. At the time I was merely suspicious of a performance issue, and randomly testing things I only partly understood, and then focusing on the items which produced unexpected results. That started as throughput on the SAN.

2.5GB/s is such a large number, and is the parallel FIO read throughput you achieved with 5 SSDs last year.  You should be able to hit 3.5GB/s read throughput with 7 drives and that job file.

318,000 doesn't seem like a big number to some folks these days who are accustomed to quantities in the GB and TB.  But for anyone who has been around storage for a while and understands what "random IOPS" means, this number would make jaws drop just a few years ago.  Before the big storage players started offering SSD based products, a disk based storage system capable of 300K+ random read IOPS would have cost USD $1 million, minimum, and included many FC heads connected to ~2000 disk drives.

>> The broken FIO test you performed, with results down below, demonstrated
>> 320K read IOPS, or 45K IOPS per drive.  This is the inverse test of
>> bandwidth.  Here you also achieved near peak hardware IO rate from the
>> SSDs, which is claimed by Intel at 50K read IOPS.  You have the best of
>> both worlds, max throughput and IOPS.  If you'd not have broken the test
>> your write IOPS would have been correctly demonstrated as well.
>>
>> Playing the broken record again, you simply don't yet understand how to
>> use your benchmarking/testing tools, nor the data, the picture, they are
>> presenting to you.
>>
>>> so my system isn't tuned to my needs.
>> While that statement may be true, the thing(s) not properly tuned are
>> not the SSDs, nor LSI, nor mobo, nor md.  That leaves LVM and DRBD.  And
>> the problems may not be due to tuning but bugs.
> 
> Absolutely, and to be honest, while we have tuned a few of those things I don't think they were significant in the scheme of things. Tuning something that isn't broken might get an extra few percent, but we were always looking to get a significant improvement (like 5x or something).

Some of the tuning you've done did have a big impact on throughput, specifically, testing stripe_cache_size values and settling on 4096.  That alone bumped your sustained measured write throughput from ~1 GB/s to 1.6 GB/s.  And this provided real world benefit.  IIRC before this tuning you were unable to run some daemon in realtime, DRBD or LVM snapshots etc, due to the hit to the storage throughput and resulting low user performance.  After the tuning you were able to reenable it to run realtime due to the extra performance.

Speaking of which, you've increased your data 'spindles' by 50% from 4 to 6, which means your drive level peak write throughput with the parallel IO should now be 2.4 GB/s.  You should run the last FIO job file you used last year that produced the 1.6 GB/s write throughput with stripe_cache_size 4096, for apples to apples 5 drives vs 7 comparison.  Then bump stripe_cache_size to 8192 to see if that helps your sequential write throughput.  Also perform your recent 4KB FIO test at 8192.

>>> After working with linbit (DRBD) I've found out some more useful
>>> information, which puts me right back to the beginning I think, but with
>>> a lot more experience and knowledge.
>>> It seems that DRBD keeps it's own "journal", so every write is written
>>> to the journal, then it's bitmap is marked, then the journal is written
>>> to the data area, then the bitmap updated again, and then start over for
>>> the next write. This means it is doing lots and lots of small writes to
>>> the same areas of the disk ie, 4k blocks.
>>
>> Your 5 SSDs had a combined ~160,000 4KB IOPS write performance.  Your 7
>> SSDs should hit ~240,000 4KB write IOPS when configured properly.  To
>> put this into perspective, an array comprised of 15K SAS drives in RAID0
>> would require 533 and 800 drives respectively to reach the same IOPS
>> performance, 1066 and 1600 drives in RAID10.
>
> OK, so like I always thought, the hardware I have *should* be producing some awesome performance... 

Your server isn't the problem.  The MS Windows infrastructure is.  

> I'd hate to think how someone might connect 1600 15k SAS drives, nor the noise, heat, power draw, etc..

This is small potatoes for large enterprises, sites serving lots of HD video, and of course the HPC labs such as NCSA, ORNL, LLNL, NASA's NAS, LHC, et al with their multiple petabyte Lustre storage.  The 4U 60 drive SAN/DAS/JBOD chassis becoming popular today pack 1800 drives in just three 19" cabinets.  Many HPC clusters are connected to dozens of such cabinets.

...
>>> [global]
>>> filename=/dev/vg0/testing
>>> zero_buffers
>>> numjobs=16
>>> thread
>>> group_reporting
>>> blocksize=4k
>>> ioengine=libaio
>>> iodepth=16
>>> direct=1
>>
>> It's generally a bad idea to mix size and run time.  It makes results
>> non deterministic.  Best to use one or the other.  But you have much
>> bigger problems here...
>>
>>> runtime=60
>>> size=16g
>>
>> 16 jobs * 2 streams (read + write) * 16 GB per stream = 512 GB required
>> for this test.  The size= parm is per job thread, not aggregate.  What
>> was the capacity of /dev/vg0/testing?  Is this a filesystem or raw
>> device?  I'm assuming raw device of capacity well less than 512 GB.
> 
> From running the tests, fio runs one stream (read or write) at a time, not both concurrently. So it does the read test first, and then does the write test.

Correct, that is how fio executes.  But that's not the point of confusion here, which I finally figured it out.  My apologies for not catching this sooner.  After re-re-reading your job file I realized you're specifying "filename=" instead of "directory=".  I'd assumed you always used the latter as I thought this was in my example job files I sent you, and that you stuck with that.   "directory=" gives you numb_jobs*2 files, each of size "size=" by default.  Specifying one file in "filename=" will cause all threads to read/write the same file.  So fio should have been using single a 16 GB file, as you thought, or, apparently in your case 16 GB of a raw device space.  This is correct, yes?  This device has no filesystem, correct?

However, many filesystems tend to achieve poor performance writing to different parts of one file in parallel.  fio does this without locking so it's not as bad as the normal case.  But even so performance is typically less than accessing multiple files in parallel.

Which filesystem is on this LV?  Is it aligned to the RAID geometry?

...
> What I thought that was doing is making 16 requests in parallel, with a total test size of 16G.  Clearly a mistake again.

Yes it was, but this time it was my mistake.

>>>    read : io=74697MB, bw=1244.1MB/s, iops=318691 , runt= 60003msec
>>                ^^^^^^^                      ^^^^^^
>>
>> 318K IOPS is 45K IOPS per drive, all 7 active on reads.  This is
>> awesome, and close to the claimed peak hardware performance of 50K 4KB
>> read IOPS per drive.
>
> Yep, read performance is awesome, and I don't think this was ever an issue... at least, not for a long time (or my memory is corrupt)...

Write performance hadn't been severely lacking either.  It simply needed to be demonstrated and quantified, and tweaked a bit.

>>>      lat (usec) : 2=3.34%, 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01%
>>>      lat (usec) : 100=0.01%, 250=3.67%, 500=31.43%, 750=24.55%, 1000=13.33%
>>>      lat (msec) : 2=19.37%, 4=4.25%, 10=0.04%, 20=0.01%, 50=0.01%
>>>      lat (msec) : 100=0.01%, 250=0.01%, 1000=0.01%, 2000=0.01%
>>
>> 76% of read IOPS completed in 1 millisecond or less, 63% in 750
>> microseconds or less, and 31% in 500 microseconds or less.  This is
>> nearly perfect for 7 of these SSDs.
> 
> Inadvertently, I have ended up with 5 x SSDSC2CW480A3 + 2 x SSDSC2BW480A4 in each server. I noticed significantly higher %util reported by iostat on the 2 SSD's compared to the other 5. 

Which is interesting as the A4 is presumably newer than the A3.

> Finally on Monday I moved two of the SSDSC2CW480A3 models from the second server into the primary, (one at a time) and the two SSDSC2BW480A4 into the second server. So then I had 7 x SSDSC2CW480A3 in the primary, and the secondary had 3 of them plus 4 of the other model. iostat on the primary then showed a much more balanced load across all 7 of the SSD's in the primary (with DRBD disconnected).
> BTW, when I say much higher, the 2 SSD's would should 40% while the other 5 would should around 10%, with the two peaking at 100% while the other 5 would peak at 30%...

Swapped out two drives from a 7 drive SSD RAID5?  How long did each rebuild take?

> I haven't been able to find detailed enough specs on the differences between these two models to explain that yet. In any case, the SSDSC2CW480A3 model is no longer available, so I can't order more of them anyway.

Did you check to see if newer firmware is available for these two?
 
... 
> One other explanation for the different sizes might be that the bandwidth was different, but the time was constant (because I specified the time option as well). In any case, the performance difference might easily be due to your suggestion, which was definitely another idea I was having. 

Usually latencies this high with SSDs are due to GC, i.e. lack of trim.  A few microseconds of the latency could be in the IO path, but you're seeing a huge number of IOs at 10ms, which just has to be occurring inside the SSDs.

> I was thinking now that I have more drives, I could go back to the old solution of leaving some un-allocated space on each drive. However to do that I would have needed to reduce the PV ensuring no allocated blocks at the "end" of the MD, then reduce the MD, and finally reduce the partition. Then I still needed to find a method to tell the SSD that the space is now unused (trim). Now I think it isn't so important any more...

That would be option Z for me.

>>> So, a maximum of 237MB/s write. Once DRBD takes that and adds it's
>>> overhead, I'm getting approx 10% of that performance (some of the time,
>>> other times I'm getting even less, but that is probably yet another issue).
>>>
>>> Now, 237MB/s is pretty poor, and when you try and share that between a
>>> dozen VM's, with some of those VM's trying to work on 2+ GB files
>>> (outlook users), then I suspect that is why there are so many issues.
>>> The question is, what can I do to improve this? Should I use RAID5 with
>>> a smaller stripe size? Should I use RAID10 or RAID1+linear? Could the
>>> issue be from LVM? LVM is using 4MB Physical Extents, from reading
>>> though, nobody seems to worry about the PE size related to performance
>>> (only LVM1 had a limit on the number of PE's... which meant a larger LV
>>> required larger PE's).
>> I suspect you'll be rethinking the above after running a proper FIO test
>> for 4KB IOPS.  Try numjobs=8 and size=500m, for an 8 GB test, assuming
>> the test LV is greater than 8 GB in size.
>>
>> ...
> OK, I'll retry with numjobs=16 and size=1G which should require a 32G LV, which should be fine with my 50G LV.

Actually the total is apparently 1 GB.  I must say I really do dislike the raw device target.

> read: (g=0): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=16
> ...
> read: (g=0): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=16
> write: (g=1): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=16
> ...
> write: (g=1): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=16
> 2.0.8
> Starting 32 threads
> Jobs: 2 (f=2): [_________________w_____________w] [100.0% done] [0K/157.9M /s] [0 /40.5K iops] [eta 00m:00s]]
> read: (groupid=0, jobs=16): err= 0: pid=26714
>   read : io=16384MB, bw=1267.4MB/s, iops=324360 , runt= 12931msec
>     slat (usec): min=1 , max=141080 , avg= 7.28, stdev=141.90
>     clat (usec): min=9 , max=207827 , avg=764.34, stdev=962.30
>      lat (usec): min=55 , max=207831 , avg=771.84, stdev=981.10
>     clat percentiles (usec):
>      |  1.00th=[  159],  5.00th=[  215], 10.00th=[  262], 20.00th=[ 342],
>      | 30.00th=[  426], 40.00th=[  524], 50.00th=[  628], 60.00th=[ 740],
>      | 70.00th=[  868], 80.00th=[ 1048], 90.00th=[ 1352], 95.00th=[ 1672],
>      | 99.00th=[ 2672], 99.50th=[ 3632], 99.90th=[ 8896], 99.95th=[13632],
>      | 99.99th=[36608]
>     bw (KB/s)  : min=40608, max=109600, per=6.29%, avg=81566.38, stdev=8098.56
>     lat (usec) : 10=0.01%, 20=0.01%, 50=0.01%, 100=0.02%, 250=8.72%
>     lat (usec) : 500=29.09%, 750=23.21%, 1000=16.65%
>     lat (msec) : 2=19.74%, 4=2.16%, 10=0.33%, 20=0.05%, 50=0.02%
>     lat (msec) : 100=0.01%, 250=0.01%
>   cpu          : usr=41.33%, sys=238.07%, ctx=48328280, majf=0, minf=64230
>   IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
>      issued    : total=r=4194304/w=0/d=0, short=r=0/w=0/d=0
> write: (groupid=1, jobs=16): err= 0: pid=27973
>   write: io=16384MB, bw=262686KB/s, iops=65671 , runt= 63868msec
>     slat (usec): min=2 , max=4387.4K, avg=64.75, stdev=9203.16
>     clat (usec): min=13 , max=6500.9K, avg=3692.55, stdev=47966.38
>      lat (usec): min=64 , max=6500.9K, avg=3757.42, stdev=48862.99
>     clat percentiles (usec):
>      |  1.00th=[  410],  5.00th=[  564], 10.00th=[  700], 20.00th=[ 1080],
>      | 30.00th=[ 1432], 40.00th=[ 1688], 50.00th=[ 1880], 60.00th=[ 2064],
>      | 70.00th=[ 2256], 80.00th=[ 2480], 90.00th=[ 2992], 95.00th=[ 3632],
>      | 99.00th=[ 8640], 99.50th=[12736], 99.90th=[577536], 99.95th=[954368],
>      | 99.99th=[2146304]
>     bw (KB/s)  : min=   97, max=56592, per=7.49%, avg=19678.60, stdev=8387.79
>     lat (usec) : 20=0.01%, 100=0.01%, 250=0.08%, 500=2.74%, 750=8.96%
>     lat (usec) : 1000=6.49%
>     lat (msec) : 2=38.00%, 4=40.30%, 10=2.68%, 20=0.36%, 50=0.02%
>     lat (msec) : 100=0.14%, 250=0.06%, 500=0.07%, 750=0.04%, 1000=0.03%
>     lat (msec) : 2000=0.03%, >=2000=0.01%
>   cpu          : usr=10.05%, sys=40.27%, ctx=60488513, majf=0, minf=62068
>   IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
>      issued    : total=r=0/w=4194304/d=0, short=r=0/w=0/d=0
> 
> Run status group 0 (all jobs):
>    READ: io=16384MB, aggrb=1267.4MB/s, minb=1267.4MB/s, maxb=1267.4MB/s, mint=12931msec, maxt=12931msec
> 
> Run status group 1 (all jobs):
>   WRITE: io=16384MB, aggrb=262685KB/s, minb=262685KB/s, maxb=262685KB/s, mint=63868msec, maxt=63868msec
> 
> So, I don't think that made a lot of difference to the results.

Measured 65K random write IOPS performance is much lower than I'd expect given the advertised rates for SandForce 22xx based SSDs.  However, putting this into perspective...

15K SAS drives peak at 300 random seeks/second.
65K random write IOPS = ((65671/300)*2)= 436 SAS 15K drives using nested RAID10.
A 6ft 40U cabinet containing 18x 2U 24 drive chassis provides 432 drives, 4U for the server.
Practically speaking, it's a full rack of 15K SAS drives.

If you had this sitting next to your server cage providing your storage, would you consider it insufficient, or overkill on the scale of hunting mice with nukes?

>>> BTW, I've also split the domain controller to a win2008R2 server, and
>>> upgraded the file server to win2012R2.
>> I take it you decided this route had fewer potential pitfalls than
>> reassigning the DC share LUN to a new VM with the same Windows host
>> name, exporting/importing the shares, etc?  It'll be interesting to see
>> if this resolves some/all of the problems.  Have my fingers crossed for ya.
> 
> It wasn't clear, but what I meant was:
> 1) Install new 2008R2 server, promote to DC, migrate roles across to it, etc
> 2) Install new 2012R2 server
> 3) export registry with share information and shutdown the old 2003 server
> 4) change name of the new server (to the same as the old server) and join the domain
> 5) attach the existing LUN to the 2012R2 server
> 6) import the registry information

Got it.

> Short answer, it seemed to have a variable result, but I think that was just the usual some days are good and some days are bad, depending on who is doing what, when, and how much the users decide to complain.

How many use a full TS desktop as their "PC"?  Are standalone PC users complaining about performance as well?

>> Please don't feel I'm picking on you WRT your understanding of IO
>> performance, benching, etc.  It is not my intent to belittle you.  It is
>> critical that you better understand Linux block IO, proper testing,
>> correctly interpreting the results.  Once you do you can realize if/when
>> and where you do actually have problems, instead of thinking you have a
>> problem where none exists.
> 
> Absolutely, and I do appreciate the lessons. I apologise for needing so much "hand holding", but hopefully we are almost at the end.
> 
> After some more work with linbit, they logged in, and took a look around, doing some of their own measurements, and the outcome was to add the following three options to the DRBD config file, which improved the DRBD IOPS from around 3000 to 50000.
>         disk-barrier no;
>         disk-flushes no;
>         md-flushes no;
> 
> Essentially DRBD was disabling the SSD write cache by forcing every write to be completed before returning, and this was drastically reducing the IOPS that could be achieved.

The plot thickens.  When you previously mentioned DRBD writes a journal log it didn't click that they'd be doing barriers and flushes.  But this makes perfect sense given the mirror function of DRBD.

> Running the same test against the DRBD device, in a connected state:
> read: (groupid=0, jobs=16): err= 0: pid=4498
>   read : io=16384MB, bw=1238.8MB/s, iops=317125 , runt= 13226msec
>     slat (usec): min=0 , max=997330 , avg=11.16, stdev=992.34
>     clat (usec): min=0 , max=1015.8K, avg=769.38, stdev=7791.99
>      lat (usec): min=0 , max=1018.6K, avg=781.10, stdev=7873.73
>     clat percentiles (usec):
>      |  1.00th=[    0],  5.00th=[    0], 10.00th=[  195], 20.00th=[ 298],
>      | 30.00th=[  370], 40.00th=[  446], 50.00th=[  532], 60.00th=[ 620],
>      | 70.00th=[  732], 80.00th=[  876], 90.00th=[ 1144], 95.00th=[ 1480],
>      | 99.00th=[ 4896], 99.50th=[ 7200], 99.90th=[16512], 99.95th=[21888],
>      | 99.99th=[53504]
>     bw (KB/s)  : min= 5085, max=305504, per=6.35%, avg=80531.22, stdev=29062.40
>     lat (usec) : 2=7.73%, 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01%
>     lat (usec) : 100=0.04%, 250=6.78%, 500=32.00%, 750=25.02%, 1000=14.15%
>     lat (msec) : 2=11.28%, 4=1.64%, 10=1.10%, 20=0.20%, 50=0.05%
>     lat (msec) : 100=0.01%, 250=0.01%, 1000=0.01%, 2000=0.01%
>   cpu          : usr=41.05%, sys=253.29%, ctx=49215916, majf=0, minf=65328
>   IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
>      issued    : total=r=4194304/w=0/d=0, short=r=0/w=0/d=0
> write: (groupid=1, jobs=16): err= 0: pid=5163
>   write: io=16384MB, bw=138483KB/s, iops=34620 , runt=121150msec
>     slat (usec): min=1 , max=84258 , avg=20.68, stdev=303.42
>     clat (usec): min=179 , max=123372 , avg=7354.94, stdev=3634.96
>      lat (usec): min=187 , max=132967 , avg=7375.81, stdev=3644.96
>     clat percentiles (usec):
>      |  1.00th=[ 3696],  5.00th=[ 4576], 10.00th=[ 5088], 20.00th=[ 5920],
>      | 30.00th=[ 6560], 40.00th=[ 7008], 50.00th=[ 7328], 60.00th=[ 7584],
>      | 70.00th=[ 7840], 80.00th=[ 8160], 90.00th=[ 8640], 95.00th=[ 9280],
>      | 99.00th=[13504], 99.50th=[23168], 99.90th=[67072], 99.95th=[70144],
>      | 99.99th=[75264]
>     bw (KB/s)  : min= 5976, max=12447, per=6.26%, avg=8673.20, stdev=731.62
>     lat (usec) : 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
>     lat (msec) : 2=0.09%, 4=1.76%, 10=94.97%, 20=2.61%, 50=0.29%
>     lat (msec) : 100=0.26%, 250=0.01%
>   cpu          : usr=8.99%, sys=33.90%, ctx=71679376, majf=0, minf=69677
>   IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
>      issued    : total=r=0/w=4194304/d=0, short=r=0/w=0/d=0
> 
> Run status group 0 (all jobs):
>    READ: io=16384MB, aggrb=1238.8MB/s, minb=1238.8MB/s, maxb=1238.8MB/s, mint=13226msec, maxt=13226msec
> 
> Run status group 1 (all jobs):
>   WRITE: io=16384MB, aggrb=138483KB/s, minb=138483KB/s, maxb=138483KB/s, mint=121150msec, maxt=121150msec
> 
> Disk stats (read/write):
>   drbd17: ios=4194477/4188834, merge=0/0, ticks=2645376/30507320, in_queue=33171672, util=99.81%
> 
> 
> Here is the summary of the first fio above:
>   read : io=16384MB, bw=1267.4MB/s, iops=324360 , runt= 12931msec
>   write: io=16384MB, bw=262686KB/s, iops=65671 , runt= 63868msec
>    READ: io=16384MB, aggrb=1267.4MB/s, minb=1267.4MB/s, maxb=1267.4MB/s, mint=12931msec, maxt=12931msec
>   WRITE: io=16384MB, aggrb=262685KB/s, minb=262685KB/s, maxb=262685KB/s, mint=63868msec, maxt=63868msec

Given that even with the new settings DRBD cuts your random IOPS in half, it would make a lot of sense to move the journal off the array and onto the system SSD, since as you stated it is idle all the time.  XFS allows one to put the journal on a separate device for precisely this reason.  Does DRBD?  If not, request this feature be added.  There's no technical requirement that ties the journal to the device being mirrored.

> So, do you still think there is an issue (from looking the the first fio results above) with getting "only" 65k IOPS write?

Yes.  But I think the bulk of the issue is your benchmark configuration, mainly the tiny sliver of the array you keep hammering with test write IOs.

> One potential clue I did find was hidden in the Intel specs:
> Firstly Intel markets it here:
> http://www.intel.com/content/www/us/en/solid-state-drives/solid-state-drives-520-series.html
> 480GB     SATA 6Gb/s 550 MB/s / 520 MB/s
> SATA 3Gb/s       280 MB/s / 260 MB/s     50,000 IOPS / 50,000 IOPS     9.5mm 2.5-inch SATA

That link doesn't work for me, but the ARK always has the info:
http://ark.intel.com/products/66251/Intel-SSD-520-Series-480GB-2_5in-SATA-6Gbs-25nm-MLC

Their 4KB random write IOPS tests are performed on an "out of box" SSD, meaning all fresh cells.  They only write 8 GB but randomly across the entire LBA range, all 480 GB.  This prevents wear leveling from kicking in.  The yield is 42K write IOPS.

You're achieving 1/4th of that IOPS rate with non trimmed heavily used drives and testing greater than 8 GB.  Recall I suggested you test with only 8 GB or less?  It should actually be much smaller given the LV device size.  Realistically, you should be testing over the entire capacity of all the drives, but that's not possible.  Hitting this small LV causes the wear leveling routine to attack like a rabid dog, remapping erase blocks on the fly and thus dragging down interface performance dramatically due to all of the internal data shuffling going on.  This is the cause of the large number of IOPS requiring 4, 10, and 20ms to complete.

LVM supports TRIM for some destructive operations:
https://wiki.debian.org/SSDOptimization#A.2Fetc.2Flvm.2Flvm.conf_example

You could enable TRIM support and lvremove the 50 GB device.  This action should trim the ~7 GB on each drive, if your kernel version's md RAID5 module supports TRIM pass through--I haven't kept up.  Then create a new LV device.  Should be fully trimmed and fresh.  Ask other for status of RAID5 TRIM pass through and your kernel version.

> However, here: http://www.intel.com/content/dam/www/public/us/en/documents/product-specifications/ssd-530-sata-specification.pdf

This is the 530 series, although it is very similar to your 520s.  

> Table 5 shows the Incompressible Performance:
> 480GB     Random 4k Read 37500 IOPS       Random 4k Write 13000 IOPS

zero_buffers

in your job file causes all writes to be zeros.  This should be allowing maximum compression by the SF-22xx controllers on the SSDs.

> So, now we might be better placed to calculate the "expected" results? 13000 * 6 = 78000, we are getting 65000, which is not very far away.

Unless your fio is broken, bugged, and not zeroing buffers, I can't see compression being a factor in the low benching throughput.  Everything seems to point to garbage collection, i.e. wear leveling.  Note that you're achieving ~45K read IOPS per drive with worn no-TRIM drives, huge data sets compared to Intel's tests, and on a tiny sliver of each SSD.  Intel says 50K on pristine drives.

In almost all cases where SSD write performance is much lower than spec, decreases over time, etc, it is due to lack of TRIM and massive wear leveling kicking in as a result.

> So, for yesterday and today, with the barriers/flushes disabled, things seem to be working well, 

Good to hear.

> I haven't had any user complaints, and that makes me happy :) 

Also good to hear.  But even with 'only' 35K IOPS available w/DRBD running, that's equivalent to 232 SAS 15K drives in RAID 10, which should be a tad bit more than sufficient.  So I'm guessing this may be the normal case of the benchmarks not accurately reflecting reality, your actual workload.

> However, if you still think I should be able to get 200000 IOPS or higher on write, then I'll definitely be interested in investigating further.

You can surely achieve close to it with future fio testing, but the results may not be very informative as we already know the bulk of the performance hit is the result of no TRIM and garbage collection.  Larger stripe_cache_size may help a little, but the lvremove with TRIM should help far more if that 50 GB is the only slice available for testing.  To hit Intel's published write numbers may require secure erasing the drives making them factory fresh, and that's not an option on a production machine.

Cheers,

Stan

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2014-04-10  8:06 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-03-13  2:49 Growing RAID5 SSD Array Adam Goryachev
2014-03-13 11:58 ` Stan Hoeppner
2014-03-17  5:43   ` Adam Goryachev
2014-03-17 21:43     ` Stan Hoeppner
2014-03-18  1:41       ` Adam Goryachev
2014-03-18 11:22         ` Stan Hoeppner
2014-03-18 23:25           ` Adam Goryachev
2014-03-19 20:45             ` Stan Hoeppner
2014-03-20  2:54               ` Adam Goryachev
2014-03-22 19:39                 ` Stan Hoeppner
2014-03-25 13:10                   ` Adam Goryachev
2014-03-25 20:31                     ` Stan Hoeppner
2014-04-05 19:25                       ` Adam Goryachev
2014-04-08 15:27                         ` Stan Hoeppner
2014-04-09  3:57                           ` Adam Goryachev
2014-04-10  8:06                             ` Stan Hoeppner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).