linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* ARC-1120 and MD very sloooow
@ 2013-11-22 11:13 Jimmy Thrasibule
  2013-11-22 11:17 ` Mikael Abrahamsson
  2013-11-22 20:17 ` Stan Hoeppner
  0 siblings, 2 replies; 28+ messages in thread
From: Jimmy Thrasibule @ 2013-11-22 11:13 UTC (permalink / raw)
  To: linux-raid

Hi,

I've got a bunch of servers with a ARC-1120 8-Port PCI-X to SATA RAID
Controller.


        $ lspci -d 17d3:1120 -v
        02:0e.0 RAID bus controller: Areca Technology Corp. ARC-1120 8-Port PCI-X to SATA RAID Controller
        	Subsystem: Areca Technology Corp. ARC-1120 8-Port PCI-X to SATA RAID Controller
        	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop- ParErr+ Stepping+ SERR- FastB2B- DisINTx-
        	Status: Cap+ 66MHz+ UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort+ >SERR- <PERR- INTx-
        	Latency: 32 (32000ns min), Cache Line Size: 32 bytes
        	Interrupt: pin A routed to IRQ 16
        	Region 0: Memory at fceff000 (32-bit, non-prefetchable) [size=4K]
        	Region 2: Memory at fdc00000 (32-bit, prefetchable) [size=4M]
        	[virtual] Expansion ROM at fce00000 [disabled] [size=64K]
        	Capabilities: [c0] Power Management version 2
        		Flags: PMEClk- DSI- D1+ D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
        		Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
        	Capabilities: [d0] MSI: Enable- Count=1/2 Maskable- 64bit+
        		Address: 0000000000000000  Data: 0000
        	Capabilities: [e0] PCI-X non-bridge device
        		Command: DPERE+ ERO- RBC=1024 OST=8
        		Status: Dev=02:0e.0 64bit+ 133MHz+ SCD- USC- DC=bridge DMMRBC=1024 DMOST=4 DMCRS=32 RSCEM- 266MHz- 533MHz-
        	Kernel driver in use: arcmsr


They are all running Debian Wheezy (7) and kernel versoin 3.2.


        $ uname -srvmo
        Linux 3.2.0-4-amd64 #1 SMP Debian 3.2.51-1 x86_64 GNU/Linux


I don't want to use the hardware RAID capabilities of this SATA
controller, I prefer to bet on Linux's software RAID. So I configured
the drives in the ARC-1120 controller as just a bunch of drives (JBOD)
and then use mdadm to create some RAID arrays.

For instance:


        $ cat /proc/mdstat 
        Personalities : [raid1] [raid10] 
        md3 : active raid10 sdc1[0] sdf1[3] sde1[2] sdd1[1]
              7813770240 blocks super 1.2 512K chunks 2 near-copies [4/4] [UUUU]
              
        md2 : active raid1 sda4[0] sdb4[1]
              67893176 blocks super 1.2 [2/2] [UU]
              
        md1 : active raid1 sda3[0] sdb3[1]
              4205556 blocks super 1.2 [2/2] [UU]
              
        md0 : active raid1 sda2[0] sdb2[1]
              509940 blocks super 1.2 [2/2] [UU]
              
        unused devices: <none>
        
        # mount
        [...]
        /dev/md3 on /srv type xfs (rw,nosuid,nodev,noexec,noatime,attr2,delaylog,inode64,sunit=2048,swidth=4096,noquota)
        
        # xfs_info /dev/md3 
        meta-data=/dev/md3               isize=256    agcount=32, agsize=30523648 blks
                 =                       sectsz=512   attr=2
        data     =                       bsize=4096   blocks=976755712, imaxpct=5
                 =                       sunit=256    swidth=512 blks
        naming   =version 2              bsize=4096   ascii-ci=0
        log      =internal               bsize=4096   blocks=476936, version=2
                 =                       sectsz=512   sunit=8 blks, lazy-count=1
        realtime =none                   extsz=4096   blocks=0, rtextents=0


The issue is that disk access is very slow and I cannot spot why. Here
is some data when I try to access the file system.


        # dd if=/dev/zero of=/srv/test.zero bs=512K count=6000
        6000+0 records in
        6000+0 records out
        3145728000 bytes (3.1 GB) copied, 82.2142 s, 38.3 MB/s
        
        # dd if=/srv/store/video/test.zero of=/dev/null
        6144000+0 records in
        6144000+0 records out
        3145728000 bytes (3.1 GB) copied, 12.0893 s, 260 MB/s
        
        First run:
        $ time ls /srv/files
        [...]
        real	9m59.609s
        user	0m0.408s
        sys	0m0.176s
        
        Second run:
        $ time ls /srv/files
        [...]
        real	0m0.257s
        user	0m0.108s
        sys	0m0.088s
        
        $ ls -l /srv/files | wc -l
        17189


I guess the controller is what's is blocking here as I encounter the
issue only on servers where it is installed. I tried many settings like
enabling or disabling cache but nothing changed.

Any advise would be appreciated.

--
Jimmy




^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: ARC-1120 and MD very sloooow
  2013-11-22 11:13 ARC-1120 and MD very sloooow Jimmy Thrasibule
@ 2013-11-22 11:17 ` Mikael Abrahamsson
  2013-11-22 20:17 ` Stan Hoeppner
  1 sibling, 0 replies; 28+ messages in thread
From: Mikael Abrahamsson @ 2013-11-22 11:17 UTC (permalink / raw)
  To: Jimmy Thrasibule; +Cc: linux-raid

On Fri, 22 Nov 2013, Jimmy Thrasibule wrote:

> Any advise would be appreciated.

"iostat -x 5" is something I use to see what's going on in more detail. 
Let it run during your writing.

If you see a lot of reading from the drives, it might be good to tune 
/sys/block/mdX/md/stripe_cache_size to something larger than the default 
256 setting.

-- 
Mikael Abrahamsson    email: swmike@swm.pp.se

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: ARC-1120 and MD very sloooow
  2013-11-22 11:13 ARC-1120 and MD very sloooow Jimmy Thrasibule
  2013-11-22 11:17 ` Mikael Abrahamsson
@ 2013-11-22 20:17 ` Stan Hoeppner
  2013-11-25  8:56   ` Jimmy Thrasibule
  1 sibling, 1 reply; 28+ messages in thread
From: Stan Hoeppner @ 2013-11-22 20:17 UTC (permalink / raw)
  To: Jimmy Thrasibule; +Cc: Linux RAID, xfs@oss.sgi.com

[CC'ing XFS]

On 11/22/2013 5:13 AM, Jimmy Thrasibule wrote:

Hi Jimmy,

This may not be an md problem.  It appears you've mangled your XFS
filesystem alignment.  This may be a contributing factor to the low
write throughput.

>         md3 : active raid10 sdc1[0] sdf1[3] sde1[2] sdd1[1]
>               7813770240 blocks super 1.2 512K chunks 2 near-copies [4/4] [UUUU]
...
>         /dev/md3 on /srv type xfs (rw,nosuid,nodev,noexec,noatime,attr2,delaylog,inode64,sunit=2048,swidth=4096,noquota)

Beyond having a ridiculously unnecessary quantity of mount options, it
appears you've got your filesystem alignment messed up, still.  Your
RAID geometry is 512KB chunk, 1MB stripe width.  Your override above is
telling the filesystem that the RAID geometry is chunk size 1MB and
stripe width 2MB, so XFS is pumping double the IO size that md is
expecting.

>         # xfs_info /dev/md3 
>         meta-data=/dev/md3               isize=256    agcount=32, agsize=30523648 blks
>                  =                       sectsz=512   attr=2
>         data     =                       bsize=4096   blocks=976755712, imaxpct=5
>                  =                       sunit=256    swidth=512 blks
>         naming   =version 2              bsize=4096   ascii-ci=0
>         log      =internal               bsize=4096   blocks=476936, version=2
>                  =                       sectsz=512   sunit=8 blks, lazy-count=1

You created your filesystem with stripe unit of 128KB and stripe width
of 256KB which don't match the RAID geometry.  I assume this is the
reason for the fstab overrides.  I suggest you try overriding with
values that match the RAID geometry, which should be sunit=1024 and
swidth=2048.  This may or may not cure the low write throughput but it's
a good starting point, and should be done anyway.  You could also try
specifying zeros to force all filesystem write IOs to be 4KB, i.e. no
alignment.

Also, your log was created with a stripe unit alignment of 4KB, which is
128 times smaller than your chunk.  The default value is zero, which
means use 4KB IOs.  This shouldn't be a problem, but I do wonder why you
manually specified a value equal to the default.

mkfs.xfs automatically reads the stripe geometry from md and sets
sunit/swidth correctly (assuming non-nested arrays).  Why did you
specify these manually?

> The issue is that disk access is very slow and I cannot spot why. Here
> is some data when I try to access the file system.
> 
> 
>         # dd if=/dev/zero of=/srv/test.zero bs=512K count=6000
>         6000+0 records in
>         6000+0 records out
>         3145728000 bytes (3.1 GB) copied, 82.2142 s, 38.3 MB/s
>         
>         # dd if=/srv/store/video/test.zero of=/dev/null
>         6144000+0 records in
>         6144000+0 records out
>         3145728000 bytes (3.1 GB) copied, 12.0893 s, 260 MB/s

What percent of the filesystem space is currently used?

>         First run:
>         $ time ls /srv/files
>         [...]
>         real	9m59.609s
>         user	0m0.408s
>         sys	0m0.176s

This is a separate problem and has nothing to do with the hardware, md,
or XFS.  I assisted with a similar, probably identical, ls completion
time issue last week on the XFS list.  I'd guess you're storing user and
group data on a remote LDAP server and it is responding somewhat slowly.
 Use 'strace -T' with ls and you'll see lots of poll calls and the time
taken by each.  17,189 files at 35ms avg latency per LDAP query yields
10m02s, if my math is correct, so 35ms is your current avg latency per
query.  Be aware that even if you get the average LDAP latency per file
down to 2ms, you're still looking at 34s for ls to complete on this
directory.  Much better than 10 minutes, but nothing close to the local
speed you're used to.

>         Second run:
>         $ time ls /srv/files
>         [...]
>         real	0m0.257s
>         user	0m0.108s
>         sys	0m0.088s

Here the LDAP data has been cached.  Wait an hour, run ls again, and
it'll be slow again.

>         $ ls -l /srv/files | wc -l
>         17189

> I guess the controller is what's is blocking here as I encounter the
> issue only on servers where it is installed. I tried many settings like
> enabling or disabling cache but nothing changed.

The controller is not the cause of the 10 minute ls delay.  If you see
the ls delay only on servers with this controller it is coincidence.
The cause lay elsewhere.

Areca are pretty crappy controllers generally, but I doubt they're at
fault WRT your low write throughput, though it is possible.

> Any advise would be appreciated.

I hope I've steered you in the right direction.

-- 
Stan

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: ARC-1120 and MD very sloooow
  2013-11-22 20:17 ` Stan Hoeppner
@ 2013-11-25  8:56   ` Jimmy Thrasibule
  2013-11-26  0:45     ` Stan Hoeppner
  0 siblings, 1 reply; 28+ messages in thread
From: Jimmy Thrasibule @ 2013-11-25  8:56 UTC (permalink / raw)
  To: stan; +Cc: Linux RAID, xfs@oss.sgi.com

Hello Stan,

> This may not be an md problem.  It appears you've mangled your XFS
> filesystem alignment.  This may be a contributing factor to the low
> write throughput.
> 
> >         md3 : active raid10 sdc1[0] sdf1[3] sde1[2] sdd1[1]
> >               7813770240 blocks super 1.2 512K chunks 2 near-copies [4/4] [UUUU]
> ...
> >         /dev/md3 on /srv type xfs (rw,nosuid,nodev,noexec,noatime,attr2,delaylog,inode64,sunit=2048,swidth=4096,noquota)
> 
> Beyond having a ridiculously unnecessary quantity of mount options, it
> appears you've got your filesystem alignment messed up, still.  Your
> RAID geometry is 512KB chunk, 1MB stripe width.  Your override above is
> telling the filesystem that the RAID geometry is chunk size 1MB and
> stripe width 2MB, so XFS is pumping double the IO size that md is
> expecting.

The nosuid, nodev, noexec, noatime and inode64 options are mine, the
others are added by the system.


> >         # xfs_info /dev/md3 
> >         meta-data=/dev/md3               isize=256    agcount=32, agsize=30523648 blks
> >                  =                       sectsz=512   attr=2
> >         data     =                       bsize=4096   blocks=976755712, imaxpct=5
> >                  =                       sunit=256    swidth=512 blks
> >         naming   =version 2              bsize=4096   ascii-ci=0
> >         log      =internal               bsize=4096   blocks=476936, version=2
> >                  =                       sectsz=512   sunit=8 blks, lazy-count=1
> 
> You created your filesystem with stripe unit of 128KB and stripe width
> of 256KB which don't match the RAID geometry.  I assume this is the
> reason for the fstab overrides.  I suggest you try overriding with
> values that match the RAID geometry, which should be sunit=1024 and
> swidth=2048.  This may or may not cure the low write throughput but it's
> a good starting point, and should be done anyway.  You could also try
> specifying zeros to force all filesystem write IOs to be 4KB, i.e. no
> alignment.
> 
> Also, your log was created with a stripe unit alignment of 4KB, which is
> 128 times smaller than your chunk.  The default value is zero, which
> means use 4KB IOs.  This shouldn't be a problem, but I do wonder why you
> manually specified a value equal to the default.
> 
> mkfs.xfs automatically reads the stripe geometry from md and sets
> sunit/swidth correctly (assuming non-nested arrays).  Why did you
> specify these manually?

It is said to trust mkfs.xfs, that's what I did. No options have been
specified by me and mkfs.xfs guessed everything by itself.


> > The issue is that disk access is very slow and I cannot spot why. Here
> > is some data when I try to access the file system.
> > 
> > 
> >         # dd if=/dev/zero of=/srv/test.zero bs=512K count=6000
> >         6000+0 records in
> >         6000+0 records out
> >         3145728000 bytes (3.1 GB) copied, 82.2142 s, 38.3 MB/s
> >         
> >         # dd if=/srv/store/video/test.zero of=/dev/null
> >         6144000+0 records in
> >         6144000+0 records out
> >         3145728000 bytes (3.1 GB) copied, 12.0893 s, 260 MB/s
> 
> What percent of the filesystem space is currently used?

Very small, 3GB / 6TB, something like 0.05%.

> >         First run:
> >         $ time ls /srv/files
> >         [...]
> >         real	9m59.609s
> >         user	0m0.408s
> >         sys	0m0.176s
> 
> This is a separate problem and has nothing to do with the hardware, md,
> or XFS.  I assisted with a similar, probably identical, ls completion
> time issue last week on the XFS list.  I'd guess you're storing user and
> group data on a remote LDAP server and it is responding somewhat slowly.
>  Use 'strace -T' with ls and you'll see lots of poll calls and the time
> taken by each.  17,189 files at 35ms avg latency per LDAP query yields
> 10m02s, if my math is correct, so 35ms is your current avg latency per
> query.  Be aware that even if you get the average LDAP latency per file
> down to 2ms, you're still looking at 34s for ls to complete on this
> directory.  Much better than 10 minutes, but nothing close to the local
> speed you're used to.
> 
> >         Second run:
> >         $ time ls /srv/files
> >         [...]
> >         real	0m0.257s
> >         user	0m0.108s
> >         sys	0m0.088s
> 
> Here the LDAP data has been cached.  Wait an hour, run ls again, and
> it'll be slow again.
> 
> >         $ ls -l /srv/files | wc -l
> >         17189
> 
> > I guess the controller is what's is blocking here as I encounter the
> > issue only on servers where it is installed. I tried many settings like
> > enabling or disabling cache but nothing changed.

Just using the old good `/etc/passwd` and `/etc/group` files here. There
is no special permissions configuration.


> The controller is not the cause of the 10 minute ls delay.  If you see
> the ls delay only on servers with this controller it is coincidence.
> The cause lay elsewhere.
> 
> Areca are pretty crappy controllers generally, but I doubt they're at
> fault WRT your low write throughput, though it is possible.

Well I have issues only on those servers. Strange enough.


I see however that I messed the outputs concerning the filesystem
details. Let me put everything in order.


Server 1
--------

# xfs_info /dev/md3
meta-data=/dev/mapper/data-video isize=256    agcount=33, agsize=50331520 blks
         =                       sectsz=512   attr=2
data     =                       bsize=4096   blocks=1610612736, imaxpct=5
         =                       sunit=128    swidth=256 blks
naming   =version 2              bsize=4096   ascii-ci=0
log      =internal               bsize=4096   blocks=521728, version=2
         =                       sectsz=512   sunit=8 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

# mdadm -D /dev/md3
/dev/md3:
        Version : 1.2
  Creation Time : Thu Oct 24 14:33:59 2013
     Raid Level : raid10
     Array Size : 7813770240 (7451.79 GiB 8001.30 GB)
  Used Dev Size : 3906885120 (3725.90 GiB 4000.65 GB)
   Raid Devices : 4
  Total Devices : 4
    Persistence : Superblock is persistent

    Update Time : Fri Nov 22 12:30:20 2013
          State : clean 
 Active Devices : 4
Working Devices : 4
 Failed Devices : 0
  Spare Devices : 0

         Layout : near=2
     Chunk Size : 512K

           Name : srv1:data  (local to host srv1)
           UUID : ea612767:5870a6f5:38e8537a:8fd03631
         Events : 22

    Number   Major   Minor   RaidDevice State
       0       8       33        0      active sync   /dev/sdc1
       1       8       49        1      active sync   /dev/sdd1
       2       8       65        2      active sync   /dev/sde1
       3       8       81        3      active sync   /dev/sdf1

# grep md3 /etc/fstab
/dev/md3        /srv        xfs        defaults,inode64        0        0


Server 2
--------

# xfs_info /dev/md0
meta-data=/dev/md0               isize=256    agcount=32, agsize=30523648 blks
         =                       sectsz=512   attr=2
data     =                       bsize=4096   blocks=976755712, imaxpct=5
         =                       sunit=256    swidth=512 blks
naming   =version 2              bsize=4096   ascii-ci=0
log      =internal               bsize=4096   blocks=476936, version=2
         =                       sectsz=512   sunit=8 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

# mdadm -D /dev/md0
/dev/md0:
        Version : 1.2
  Creation Time : Thu Nov  8 11:20:57 2012
     Raid Level : raid10
     Array Size : 3907022848 (3726.03 GiB 4000.79 GB)
  Used Dev Size : 1953511424 (1863.01 GiB 2000.40 GB)
   Raid Devices : 4
  Total Devices : 5
    Persistence : Superblock is persistent

    Update Time : Mon Nov 25 08:37:33 2013
          State : active 
 Active Devices : 4
Working Devices : 5
 Failed Devices : 0
  Spare Devices : 1

         Layout : near=2
     Chunk Size : 1024K

           Name : srv2:0
           UUID : 0bb3f599:e414f7ae:0ba93fa2:7a2b4e67
         Events : 280490

    Number   Major   Minor   RaidDevice State
       0       8       17        0      active sync   /dev/sdb1
       1       8       33        1      active sync   /dev/sdc1
       2       8       49        2      active sync   /dev/sdd1
       5       8       65        3      active sync   /dev/sde1

       4       8       81        -      spare   /dev/sdf1

# grep md0 /etc/fstab
/dev/md0        /srv       noatime,nodev,nosuid,noexec,inode64        0        0


--
Jimmy


_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: ARC-1120 and MD very sloooow
  2013-11-25  8:56   ` Jimmy Thrasibule
@ 2013-11-26  0:45     ` Stan Hoeppner
  2013-11-26  2:52       ` Dave Chinner
  0 siblings, 1 reply; 28+ messages in thread
From: Stan Hoeppner @ 2013-11-26  0:45 UTC (permalink / raw)
  To: Jimmy Thrasibule; +Cc: Linux RAID, xfs@oss.sgi.com

On 11/25/2013 2:56 AM, Jimmy Thrasibule wrote:
> Hello Stan,
> 
>> This may not be an md problem.  It appears you've mangled your XFS
>> filesystem alignment.  This may be a contributing factor to the low
>> write throughput.
>>
>>>         md3 : active raid10 sdc1[0] sdf1[3] sde1[2] sdd1[1]
>>>               7813770240 blocks super 1.2 512K chunks 2 near-copies [4/4] [UUUU]
>> ...
>>>         /dev/md3 on /srv type xfs (rw,nosuid,nodev,noexec,noatime,attr2,delaylog,inode64,sunit=2048,swidth=4096,noquota)
>>
>> Beyond having a ridiculously unnecessary quantity of mount options, it
>> appears you've got your filesystem alignment messed up, still.  Your
>> RAID geometry is 512KB chunk, 1MB stripe width.  Your override above is
>> telling the filesystem that the RAID geometry is chunk size 1MB and
>> stripe width 2MB, so XFS is pumping double the IO size that md is
>> expecting.
> 
> The nosuid, nodev, noexec, noatime and inode64 options are mine, the
> others are added by the system.

Right.  It's unusual to see this many mount options.  FYI, the XFS
default is relatime, which is nearly identical to noatime.  Specifying
noatime won't gain you anything.  Do you really need nosuid, nodev, noexec?

>>>         # xfs_info /dev/md3 
>>>         meta-data=/dev/md3               isize=256    agcount=32, agsize=30523648 blks
>>>                  =                       sectsz=512   attr=2
>>>         data     =                       bsize=4096   blocks=976755712, imaxpct=5
>>>                  =                       sunit=256    swidth=512 blks
>>>         naming   =version 2              bsize=4096   ascii-ci=0
>>>         log      =internal               bsize=4096   blocks=476936, version=2
>>>                  =                       sectsz=512   sunit=8 blks, lazy-count=1
>>
>> You created your filesystem with stripe unit of 128KB and stripe width
>> of 256KB which don't match the RAID geometry.  I assume this is the
>> reason for the fstab overrides.  I suggest you try overriding with
>> values that match the RAID geometry, which should be sunit=1024 and
>> swidth=2048.  This may or may not cure the low write throughput but it's
>> a good starting point, and should be done anyway.  You could also try
>> specifying zeros to force all filesystem write IOs to be 4KB, i.e. no
>> alignment.
>>
>> Also, your log was created with a stripe unit alignment of 4KB, which is
>> 128 times smaller than your chunk.  The default value is zero, which
>> means use 4KB IOs.  This shouldn't be a problem, but I do wonder why you
>> manually specified a value equal to the default.
>>
>> mkfs.xfs automatically reads the stripe geometry from md and sets
>> sunit/swidth correctly (assuming non-nested arrays).  Why did you
>> specify these manually?
> 
> It is said to trust mkfs.xfs, that's what I did. No options have been
> specified by me and mkfs.xfs guessed everything by itself.

So the mkfs.xfs defaults in Wheezy did this.  Maybe I'm missing
something WRT the md/RAID10 near2 layout.  I know the alternate layouts
can play tricks with the resulting stripe width but I'm not sure if
that's the case here.  The log sunit of 8 blocks may be due to your
chunk being 512KB, which IIRC is greater than the XFS allowed maximum
for the log.  Hence it may have been dropped to 4KB for this reason.

>>> The issue is that disk access is very slow and I cannot spot why. Here
>>> is some data when I try to access the file system.
>>>
>>>
>>>         # dd if=/dev/zero of=/srv/test.zero bs=512K count=6000
>>>         6000+0 records in
>>>         6000+0 records out
>>>         3145728000 bytes (3.1 GB) copied, 82.2142 s, 38.3 MB/s
>>>         
>>>         # dd if=/srv/store/video/test.zero of=/dev/null
>>>         6144000+0 records in
>>>         6144000+0 records out
>>>         3145728000 bytes (3.1 GB) copied, 12.0893 s, 260 MB/s
>>
>> What percent of the filesystem space is currently used?
> 
> Very small, 3GB / 6TB, something like 0.05%.

So the low write speed shouldn't be related to free space fragmentation.

>>>         First run:
>>>         $ time ls /srv/files
>>>         [...]
>>>         real	9m59.609s
>>>         user	0m0.408s
>>>         sys	0m0.176s
>>
>> This is a separate problem and has nothing to do with the hardware, md,
>> or XFS.  I assisted with a similar, probably identical, ls completion
>> time issue last week on the XFS list.  I'd guess you're storing user and
>> group data on a remote LDAP server and it is responding somewhat slowly.
>>  Use 'strace -T' with ls and you'll see lots of poll calls and the time
>> taken by each.  17,189 files at 35ms avg latency per LDAP query yields
>> 10m02s, if my math is correct, so 35ms is your current avg latency per
>> query.  Be aware that even if you get the average LDAP latency per file
>> down to 2ms, you're still looking at 34s for ls to complete on this
>> directory.  Much better than 10 minutes, but nothing close to the local
>> speed you're used to.
>>
>>>         Second run:
>>>         $ time ls /srv/files
>>>         [...]
>>>         real	0m0.257s
>>>         user	0m0.108s
>>>         sys	0m0.088s
>>
>> Here the LDAP data has been cached.  Wait an hour, run ls again, and
>> it'll be slow again.
>>
>>>         $ ls -l /srv/files | wc -l
>>>         17189
>>
>>> I guess the controller is what's is blocking here as I encounter the
>>> issue only on servers where it is installed. I tried many settings like
>>> enabling or disabling cache but nothing changed.
> 
> Just using the old good `/etc/passwd` and `/etc/group` files here. There
> is no special permissions configuration.

You'll need to run "strace -T ls -l" to determine what's eating all the
time.  The user and kernel code is taking less than 0.5s combined.  The
other 9m58s is spent waiting on something.  You need to identify that.

This is interesting.  You have low linear write speed to a file with dd,
yet also horrible latency with a read operation.

Do you see any errors in dmesg relating to the Areca, or anything else?

>> The controller is not the cause of the 10 minute ls delay.  If you see
>> the ls delay only on servers with this controller it is coincidence.
>> The cause lay elsewhere.
>>
>> Areca are pretty crappy controllers generally, but I doubt they're at
>> fault WRT your low write throughput, though it is possible.
> 
> Well I have issues only on those servers. Strange enough.

Yes, this is a strange case thus far.

Do you also see the low write speed and slow ls on md0, any/all of your
md/RAID10 arrays?


> I see however that I messed the outputs concerning the filesystem
> details. Let me put everything in order.
> 
> 
> Server 1
> --------
> 
> # xfs_info /dev/md3
> meta-data=/dev/mapper/data-video isize=256    agcount=33, agsize=50331520 blks
>          =                       sectsz=512   attr=2
> data     =                       bsize=4096   blocks=1610612736, imaxpct=5
>          =                       sunit=128    swidth=256 blks
> naming   =version 2              bsize=4096   ascii-ci=0
> log      =internal               bsize=4096   blocks=521728, version=2
>          =                       sectsz=512   sunit=8 blks, lazy-count=1
> realtime =none                   extsz=4096   blocks=0, rtextents=0
> 
> # mdadm -D /dev/md3
> /dev/md3:
>         Version : 1.2
>   Creation Time : Thu Oct 24 14:33:59 2013
>      Raid Level : raid10
>      Array Size : 7813770240 (7451.79 GiB 8001.30 GB)
>   Used Dev Size : 3906885120 (3725.90 GiB 4000.65 GB)
>    Raid Devices : 4
>   Total Devices : 4
>     Persistence : Superblock is persistent
> 
>     Update Time : Fri Nov 22 12:30:20 2013
>           State : clean 
>  Active Devices : 4
> Working Devices : 4
>  Failed Devices : 0
>   Spare Devices : 0
> 
>          Layout : near=2
>      Chunk Size : 512K
> 
>            Name : srv1:data  (local to host srv1)
>            UUID : ea612767:5870a6f5:38e8537a:8fd03631
>          Events : 22
> 
>     Number   Major   Minor   RaidDevice State
>        0       8       33        0      active sync   /dev/sdc1
>        1       8       49        1      active sync   /dev/sdd1
>        2       8       65        2      active sync   /dev/sde1
>        3       8       81        3      active sync   /dev/sdf1
> 
> # grep md3 /etc/fstab
> /dev/md3        /srv        xfs        defaults,inode64        0        0
> 
> 
> Server 2
> --------
> 
> # xfs_info /dev/md0
> meta-data=/dev/md0               isize=256    agcount=32, agsize=30523648 blks
>          =                       sectsz=512   attr=2
> data     =                       bsize=4096   blocks=976755712, imaxpct=5
>          =                       sunit=256    swidth=512 blks
> naming   =version 2              bsize=4096   ascii-ci=0
> log      =internal               bsize=4096   blocks=476936, version=2
>          =                       sectsz=512   sunit=8 blks, lazy-count=1
> realtime =none                   extsz=4096   blocks=0, rtextents=0
> 
> # mdadm -D /dev/md0
> /dev/md0:
>         Version : 1.2
>   Creation Time : Thu Nov  8 11:20:57 2012
>      Raid Level : raid10
>      Array Size : 3907022848 (3726.03 GiB 4000.79 GB)
>   Used Dev Size : 1953511424 (1863.01 GiB 2000.40 GB)
>    Raid Devices : 4
>   Total Devices : 5
>     Persistence : Superblock is persistent
> 
>     Update Time : Mon Nov 25 08:37:33 2013
>           State : active 
>  Active Devices : 4
> Working Devices : 5
>  Failed Devices : 0
>   Spare Devices : 1
> 
>          Layout : near=2
>      Chunk Size : 1024K
> 
>            Name : srv2:0
>            UUID : 0bb3f599:e414f7ae:0ba93fa2:7a2b4e67
>          Events : 280490
> 
>     Number   Major   Minor   RaidDevice State
>        0       8       17        0      active sync   /dev/sdb1
>        1       8       33        1      active sync   /dev/sdc1
>        2       8       49        2      active sync   /dev/sdd1
>        5       8       65        3      active sync   /dev/sde1
> 
>        4       8       81        -      spare   /dev/sdf1
> 
> # grep md0 /etc/fstab
> /dev/md0        /srv       noatime,nodev,nosuid,noexec,inode64        0        0

-- 
Stan


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: ARC-1120 and MD very sloooow
  2013-11-26  0:45     ` Stan Hoeppner
@ 2013-11-26  2:52       ` Dave Chinner
  2013-11-26  3:58         ` Stan Hoeppner
  0 siblings, 1 reply; 28+ messages in thread
From: Dave Chinner @ 2013-11-26  2:52 UTC (permalink / raw)
  To: Stan Hoeppner; +Cc: Jimmy Thrasibule, Linux RAID, xfs@oss.sgi.com

On Mon, Nov 25, 2013 at 06:45:38PM -0600, Stan Hoeppner wrote:
> On 11/25/2013 2:56 AM, Jimmy Thrasibule wrote:
> > Hello Stan,
> > 
> >> This may not be an md problem.  It appears you've mangled your XFS
> >> filesystem alignment.  This may be a contributing factor to the low
> >> write throughput.
> >>
> >>>         md3 : active raid10 sdc1[0] sdf1[3] sde1[2] sdd1[1]
> >>>               7813770240 blocks super 1.2 512K chunks 2 near-copies [4/4] [UUUU]
> >> ...
> >>>         /dev/md3 on /srv type xfs (rw,nosuid,nodev,noexec,noatime,attr2,delaylog,inode64,sunit=2048,swidth=4096,noquota)
> >>
> >> Beyond having a ridiculously unnecessary quantity of mount options, it
> >> appears you've got your filesystem alignment messed up, still.  Your
> >> RAID geometry is 512KB chunk, 1MB stripe width.  Your override above is
> >> telling the filesystem that the RAID geometry is chunk size 1MB and
> >> stripe width 2MB, so XFS is pumping double the IO size that md is
> >> expecting.
> > 
> > The nosuid, nodev, noexec, noatime and inode64 options are mine, the
> > others are added by the system.
> 
> Right.  It's unusual to see this many mount options.  FYI, the XFS
> default is relatime, which is nearly identical to noatime.  Specifying
> noatime won't gain you anything.  Do you really need nosuid, nodev, noexec?
> 
> >>>         # xfs_info /dev/md3 
> >>>         meta-data=/dev/md3               isize=256    agcount=32, agsize=30523648 blks
> >>>                  =                       sectsz=512   attr=2
> >>>         data     =                       bsize=4096   blocks=976755712, imaxpct=5
> >>>                  =                       sunit=256    swidth=512 blks
> >>>         naming   =version 2              bsize=4096   ascii-ci=0
> >>>         log      =internal               bsize=4096   blocks=476936, version=2
> >>>                  =                       sectsz=512   sunit=8 blks, lazy-count=1
> >>
> >> You created your filesystem with stripe unit of 128KB and stripe width
> >> of 256KB which don't match the RAID geometry.  I assume this is the

sunit/swidth is in filesystem blocks, not sectors. Hence
sunit is 1MB, swidth = 2MB. While it's not quite correct
(su=512k,sw=1m), it's not actually a problem...

> >> reason for the fstab overrides.  I suggest you try overriding with
> >> values that match the RAID geometry, which should be sunit=1024 and
> >> swidth=2048.  This may or may not cure the low write throughput but it's
> >> a good starting point, and should be done anyway.  You could also try
> >> specifying zeros to force all filesystem write IOs to be 4KB, i.e. no
> >> alignment.
> >>
> >> Also, your log was created with a stripe unit alignment of 4KB, which is
> >> 128 times smaller than your chunk.  The default value is zero, which
> >> means use 4KB IOs.  This shouldn't be a problem, but I do wonder why you
> >> manually specified a value equal to the default.
> >>
> >> mkfs.xfs automatically reads the stripe geometry from md and sets
> >> sunit/swidth correctly (assuming non-nested arrays).  Why did you
> >> specify these manually?
> > 
> > It is said to trust mkfs.xfs, that's what I did. No options have been
> > specified by me and mkfs.xfs guessed everything by itself.

Well, mkfs.xfs just uses what it gets from the kernel, so it
might have been told the wrong thing by MD itself.  However, you can
modify sunit/swidth by mount options, so you can't directly trust
what is reported from xfs_info to be what mkfs actually set
originally.

> So the mkfs.xfs defaults in Wheezy did this.  Maybe I'm missing
> something WRT the md/RAID10 near2 layout.  I know the alternate layouts
> can play tricks with the resulting stripe width but I'm not sure if
> that's the case here.  The log sunit of 8 blocks may be due to your
> chunk being 512KB, which IIRC is greater than the XFS allowed maximum
> for the log.  Hence it may have been dropped to 4KB for this reason.

Again, lsunit is in filesystem blocks, so it is 32k, not 4k. And
yes, the default lsunit when the sunit > 256k is 32k. So, nothing
wrong there, either.


> >>> The issue is that disk access is very slow and I cannot spot why. Here
> >>> is some data when I try to access the file system.
> >>>
> >>>
> >>>         # dd if=/dev/zero of=/srv/test.zero bs=512K count=6000
> >>>         6000+0 records in
> >>>         6000+0 records out
> >>>         3145728000 bytes (3.1 GB) copied, 82.2142 s, 38.3 MB/s
> >>>         
> >>>         # dd if=/srv/store/video/test.zero of=/dev/null
> >>>         6144000+0 records in
> >>>         6144000+0 records out
> >>>         3145728000 bytes (3.1 GB) copied, 12.0893 s, 260 MB/s
> >>
> >> What percent of the filesystem space is currently used?
> > 
> > Very small, 3GB / 6TB, something like 0.05%.

The usual: "iostat -x -d -m 5" output while the test is running.
Also, you are using buffered IO, so changing it to use direct IO
will tell us exactly what the disks are doing when Io is issued.
blktrace is your friend here....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: ARC-1120 and MD very sloooow
  2013-11-26  2:52       ` Dave Chinner
@ 2013-11-26  3:58         ` Stan Hoeppner
  2013-11-26  6:14           ` Dave Chinner
  0 siblings, 1 reply; 28+ messages in thread
From: Stan Hoeppner @ 2013-11-26  3:58 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Jimmy Thrasibule, Linux RAID, xfs@oss.sgi.com

On 11/25/2013 8:52 PM, Dave Chinner wrote:
...
> sunit/swidth is in filesystem blocks, not sectors. Hence
> sunit is 1MB, swidth = 2MB. While it's not quite correct
> (su=512k,sw=1m), it's not actually a problem...

Well that's what I thought as well, and I was puzzled by the 8 blocks
value for the log sunit.  So I double checked before posting, and 'man
mkfs.xfs' told me

	sunit=value
              This is used to specify the stripe unit for a RAID device
              or a logical volume. The  value  has  to  be specified in
              512-byte block units.

So apparently the units of 'sunit' are different depending on which XFS
tool one is using.  That's a bit confusing.  And 'man xfs_info'
(xfs_growfs) doesn't tell us that sunit is given in filesystem blocks.
I'm using xfsprogs 3.1.4 so maybe these have been corrected since.

> Well, mkfs.xfs just uses what it gets from the kernel, so it
> might have been told the wrong thing by MD itself.  However, you can
> modify sunit/swidth by mount options, so you can't directly trust
> what is reported from xfs_info to be what mkfs actually set
> originally.

Got it.

> Again, lsunit is in filesystem blocks, so it is 32k, not 4k. And
> yes, the default lsunit when the sunit > 256k is 32k. So, nothing
> wrong there, either.

So where should I have looked to confirm sunit reported by xfs_info is
in fs block (4KB) multiples, not the in the 512B multiples of mkfs.xfs?

> The usual: "iostat -x -d -m 5" output while the test is running.
> Also, you are using buffered IO, so changing it to use direct IO
> will tell us exactly what the disks are doing when Io is issued.
> blktrace is your friend here....

It'll be interesting to see where this troubleshooting leads.  Buffered
single stream write speed is ~6x slower than read w/RAID10.  That makes
me wonder if the controller and drive write caches have been disabled.
That could explain this.

-- 
Stan

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: ARC-1120 and MD very sloooow
  2013-11-26  3:58         ` Stan Hoeppner
@ 2013-11-26  6:14           ` Dave Chinner
  2013-11-26  8:03             ` Stan Hoeppner
                               ` (2 more replies)
  0 siblings, 3 replies; 28+ messages in thread
From: Dave Chinner @ 2013-11-26  6:14 UTC (permalink / raw)
  To: Stan Hoeppner; +Cc: Jimmy Thrasibule, Linux RAID, xfs@oss.sgi.com

On Mon, Nov 25, 2013 at 09:58:21PM -0600, Stan Hoeppner wrote:
> On 11/25/2013 8:52 PM, Dave Chinner wrote:
> ...
> > sunit/swidth is in filesystem blocks, not sectors. Hence
> > sunit is 1MB, swidth = 2MB. While it's not quite correct
> > (su=512k,sw=1m), it's not actually a problem...
> 
> Well that's what I thought as well, and I was puzzled by the 8 blocks
> value for the log sunit.  So I double checked before posting, and 'man
> mkfs.xfs' told me
> 
> 	sunit=value
>               This is used to specify the stripe unit for a RAID device
>               or a logical volume. The  value  has  to  be specified in
>               512-byte block units.
> 
> So apparently the units of 'sunit' are different depending on which XFS
> tool one is using. 

No they don't. sunit as a mkfs input value is determined by 512 byte
units. The output is given in units of "blks" i.e. the log block
size:

$ mkfs.xfs -N -l sunit=64 /dev/vdb
....
log      =internal log           bsize=4096   blocks=12800, version=2
         =                       sectsz=512   sunit=8 blks, lazy-count=1

Which is given by the "bsize=4096" variable and so are, in this
case, 4k in size.  input = 64 * 512 bytes = 8 * 4096 bytes = output

Remember, you can specify su rather than sunit, and they are
specified in sectors, filesystem blocks or bytes, and the output is
still in units of log block size:

# mkfs.xfs -N -b size=4096 -l su=8b /dev/vdb
....
log      =internal log           bsize=4096   blocks=12800, version=2
         =                       sectsz=512   sunit=8 blks, lazy-count=1

# mkfs.xfs -N -l su=32k /dev/vdb
....
log      =internal log           bsize=4096   blocks=12800, version=2
         =                       sectsz=512   sunit=8 blks, lazy-count=1

IOws, the input units can vary, but the output units are always the
same.

> That's a bit confusing.  And 'man xfs_info'
> (xfs_growfs) doesn't tell us that sunit is given in filesystem blocks.
> I'm using xfsprogs 3.1.4 so maybe these have been corrected since.

It might seem confusing at first, but it's actually quite
consistent...

> > Again, lsunit is in filesystem blocks, so it is 32k, not 4k. And
> > yes, the default lsunit when the sunit > 256k is 32k. So, nothing
> > wrong there, either.
> 
> So where should I have looked to confirm sunit reported by xfs_info is
> in fs block (4KB) multiples, not the in the 512B multiples of mkfs.xfs?

Explained above.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: ARC-1120 and MD very sloooow
  2013-11-26  6:14           ` Dave Chinner
@ 2013-11-26  8:03             ` Stan Hoeppner
  2013-11-28 15:59               ` Jimmy Thrasibule
  2013-11-27 13:48             ` md raid5 performace 6x SSD RAID5 lilofile
  2013-11-27 13:51             ` 答复:md " lilofile
  2 siblings, 1 reply; 28+ messages in thread
From: Stan Hoeppner @ 2013-11-26  8:03 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Jimmy Thrasibule, Linux RAID, xfs@oss.sgi.com

On 11/26/2013 12:14 AM, Dave Chinner wrote:
> On Mon, Nov 25, 2013 at 09:58:21PM -0600, Stan Hoeppner wrote:
>> On 11/25/2013 8:52 PM, Dave Chinner wrote:
>> ...
>>> sunit/swidth is in filesystem blocks, not sectors. Hence
>>> sunit is 1MB, swidth = 2MB. While it's not quite correct
>>> (su=512k,sw=1m), it's not actually a problem...
>>
>> Well that's what I thought as well, and I was puzzled by the 8 blocks
>> value for the log sunit.  So I double checked before posting, and 'man
>> mkfs.xfs' told me
>>
>> 	sunit=value
>>               This is used to specify the stripe unit for a RAID device
>>               or a logical volume. The  value  has  to  be specified in
>>               512-byte block units.
>>
>> So apparently the units of 'sunit' are different depending on which XFS
>> tool one is using. 
> 
> No they don't. sunit as a mkfs input value is determined by 512 byte
> units. The output is given in units of "blks" i.e. the log block
> size:

Yes.  That's pretty clear now.  And I've figured out why this is...

> $ mkfs.xfs -N -l sunit=64 /dev/vdb
> ....
> log      =internal log           bsize=4096   blocks=12800, version=2
>          =                       sectsz=512   sunit=8 blks, lazy-count=1
> 
> Which is given by the "bsize=4096" variable and so are, in this
> case, 4k in size.  input = 64 * 512 bytes = 8 * 4096 bytes = output
> 
> Remember, you can specify su rather than sunit, and they are
> specified in sectors, filesystem blocks or bytes, and the output is
> still in units of log block size:

I never used IRIX.  But I've deduced that this made sense then due to
variable filesystem block size selection during mkfs.  But in Linux the
filesystem block size is static, at 4KB, equal to page size, and from
everything I've read the page size isn't going to change any time soon.
 Thus for Linux only users, this exercise of using creation values in
512 byte blocks, or bytes, or multiples of the fs block size, can be
very confusing, when the output is always a multiple of filesystem
blocks, always a multiple of 4KB.

> # mkfs.xfs -N -b size=4096 -l su=8b /dev/vdb
                                ^^^^^
I never noticed this until now because I've never used an external log,
nor needed an internal log with different geometry than the data section.

But why do we have different input values for su in the data (bytes) and
log (blocks) sections?  I hope to learn something from your answer, as I
usually do. :)

> ....
> log      =internal log           bsize=4096   blocks=12800, version=2
>          =                       sectsz=512   sunit=8 blks, lazy-count=1
> 
> # mkfs.xfs -N -l su=32k /dev/vdb
> ....
> log      =internal log           bsize=4096   blocks=12800, version=2
>          =                       sectsz=512   sunit=8 blks, lazy-count=1
> 
> IOws, the input units can vary, but the output units are always the
> same.
> 
>> That's a bit confusing.  And 'man xfs_info'
>> (xfs_growfs) doesn't tell us that sunit is given in filesystem blocks.
>> I'm using xfsprogs 3.1.4 so maybe these have been corrected since.
> 
> It might seem confusing at first, but it's actually quite
> consistent...

At first?  Dang Dave, you've been mentoring me for something like 3+
years now. :)  I don't deal with alignment issues very often, but this
isn't my first rodeo.  I had my answer based on 4KB blocks, and went to
the docs to verify it before posting.  That's the logical thing to do.
In this case, the docs led me astray.  That shouldn't happen.

It won't happen to me again, but if it did once, after using the
software and documentation for over 4 years, it may likely happen to
someone else.  So I'm thinking a short caveat/note might be in order in
mkfs.xfs(8).  Something like

"Note: During filesystem creation, data section stripe alignment values
(sunit/swidth/su/sw) are specified in units other than filesystem
blocks.  After creation, sunit/swidth values are referenced in multiples
of filesystem blocks by the xfsprogs tools."

>>> Again, lsunit is in filesystem blocks, so it is 32k, not 4k. And
>>> yes, the default lsunit when the sunit > 256k is 32k. So, nothing
>>> wrong there, either.
>>
>> So where should I have looked to confirm sunit reported by xfs_info is
>> in fs block (4KB) multiples, not the in the 512B multiples of mkfs.xfs?
> 
> Explained above.

Thanks Dave.  Hopefully others learn from this as well.

-- 
Stan

^ permalink raw reply	[flat|nested] 28+ messages in thread

* md raid5 performace 6x SSD RAID5
  2013-11-26  6:14           ` Dave Chinner
  2013-11-26  8:03             ` Stan Hoeppner
@ 2013-11-27 13:48             ` lilofile
  2013-11-27 13:51             ` 答复:md " lilofile
  2 siblings, 0 replies; 28+ messages in thread
From: lilofile @ 2013-11-27 13:48 UTC (permalink / raw)
  To: Linux RAID

hi:all;
when I create raid5 which use six SSD(sTEC s840),
when the stripe_cache_size is set 4096. 
root@host1:/sys/block/md126/md# cat /proc/mdstat 
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] 
md126 : active raid5 sdg[6] sdf[4] sde[3] sdd[2] sdc[1] sdb[0]
      3906404480 blocks super 1.2 level 5, 128k chunk, algorithm 2 [6/6] [UUUUUU]

the single ssd read/write performance :

root@host1:~# dd if=/dev/sdb of=/dev/zero count=100000 bs=1M
^C76120+0 records in
76119+0 records out
79816556544 bytes (80 GB) copied, 208.278 s, 383 MB/s

root@host1:~# dd of=/dev/sdb if=/dev/zero count=100000 bs=1M
100000+0 records in
100000+0 records out
104857600000 bytes (105 GB) copied, 232.943 s, 450 MB/s

the raid read and write performance is  approx 1.8GB/s read and 1.1GB/s write performance
root@sc0:/sys/block/md126/md# dd if=/dev/zero of=/dev/md126 count=100000 bs=1M
100000+0 records in
100000+0 records out
104857600000 bytes (105 GB) copied, 94.2039 s, 1.1 GB/s


root@sc0:/sys/block/md126/md# dd of=/dev/zero if=/dev/md126 count=100000 bs=1M
100000+0 records in
100000+0 records out
104857600000 bytes (105 GB) copied, 59.5551 s, 1.8 GB/s

why the performance is so bad?  especially the write performace.

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 28+ messages in thread

* 答复:md raid5 performace 6x SSD RAID5
  2013-11-26  6:14           ` Dave Chinner
  2013-11-26  8:03             ` Stan Hoeppner
  2013-11-27 13:48             ` md raid5 performace 6x SSD RAID5 lilofile
@ 2013-11-27 13:51             ` lilofile
  2013-11-28  4:41               ` Stan Hoeppner
                                 ` (5 more replies)
  2 siblings, 6 replies; 28+ messages in thread
From: lilofile @ 2013-11-27 13:51 UTC (permalink / raw)
  To: lilofile, Linux RAID

additional: CPU: Intel(R) Xeon(R) CPU           X5650  @ 2.67GHz
                memory:32GB


------------------------------------------------------------------
发件人:lilofile <lilofile@aliyun.com>
发送时间:2013年11月27日(星期三) 21:48
收件人:Linux RAID <linux-raid@vger.kernel.org>
主 题:md raid5 performace 6x SSD RAID5

hi:all;
when I create raid5 which use six SSD(sTEC s840),
when the stripe_cache_size is set 4096. 
root@host1:/sys/block/md126/md# cat /proc/mdstat 
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] 
md126 : active raid5 sdg[6] sdf[4] sde[3] sdd[2] sdc[1] sdb[0]
      3906404480 blocks super 1.2 level 5, 128k chunk, algorithm 2 [6/6] [UUUUUU]

the single ssd read/write performance :

root@host1:~# dd if=/dev/sdb of=/dev/zero count=100000 bs=1M
^C76120+0 records in
76119+0 records out
79816556544 bytes (80 GB) copied, 208.278 s, 383 MB/s

root@host1:~# dd of=/dev/sdb if=/dev/zero count=100000 bs=1M
100000+0 records in
100000+0 records out
104857600000 bytes (105 GB) copied, 232.943 s, 450 MB/s

the raid read and write performance is  approx 1.8GB/s read and 1.1GB/s write performance
root@sc0:/sys/block/md126/md# dd if=/dev/zero of=/dev/md126 count=100000 bs=1M
100000+0 records in
100000+0 records out
104857600000 bytes (105 GB) copied, 94.2039 s, 1.1 GB/s


root@sc0:/sys/block/md126/md# dd of=/dev/zero if=/dev/md126 count=100000 bs=1M
100000+0 records in
100000+0 records out
104857600000 bytes (105 GB) copied, 59.5551 s, 1.8 GB/s

why the performance is so bad?  especially the write performace.

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: 答复:md raid5 performace 6x SSD RAID5
  2013-11-27 13:51             ` 答复:md " lilofile
@ 2013-11-28  4:41               ` Stan Hoeppner
  2013-11-28  4:46                 ` Roman Mamedov
  2013-11-28 10:02               ` 答复:答复:md " lilofile
                                 ` (4 subsequent siblings)
  5 siblings, 1 reply; 28+ messages in thread
From: Stan Hoeppner @ 2013-11-28  4:41 UTC (permalink / raw)
  To: lilofile, Linux RAID

On 11/27/2013 7:51 AM, lilofile wrote:
> additional: CPU: Intel(R) Xeon(R) CPU           X5650  @ 2.67GHz
>                 memory:32GB
...
> when I create raid5 which use six SSD(sTEC s840),
> when the stripe_cache_size is set 4096. 
> root@host1:/sys/block/md126/md# cat /proc/mdstat 
> Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] 
> md126 : active raid5 sdg[6] sdf[4] sde[3] sdd[2] sdc[1] sdb[0]
>       3906404480 blocks super 1.2 level 5, 128k chunk, algorithm 2 [6/6] [UUUUUU]
> 
> the single ssd read/write performance :
> 
> root@host1:~# dd if=/dev/sdb of=/dev/zero count=100000 bs=1M
> ^C76120+0 records in
> 76119+0 records out
> 79816556544 bytes (80 GB) copied, 208.278 s, 383 MB/s
> 
> root@host1:~# dd of=/dev/sdb if=/dev/zero count=100000 bs=1M
> 100000+0 records in
> 100000+0 records out
> 104857600000 bytes (105 GB) copied, 232.943 s, 450 MB/s
> 
> the raid read and write performance is  approx 1.8GB/s read and 1.1GB/s write performance
> root@sc0:/sys/block/md126/md# dd if=/dev/zero of=/dev/md126 count=100000 bs=1M
> 100000+0 records in
> 100000+0 records out
> 104857600000 bytes (105 GB) copied, 94.2039 s, 1.1 GB/s
> 
> 
> root@sc0:/sys/block/md126/md# dd of=/dev/zero if=/dev/md126 count=100000 bs=1M
> 100000+0 records in
> 100000+0 records out
> 104857600000 bytes (105 GB) copied, 59.5551 s, 1.8 GB/s
> 
> why the performance is so bad?  especially the write performace.

There are 3 things that could be, or are, limiting performance here.

1.  The RAID5 write thread peaks one CPU core as it is single threaded
2.  A 4KB stripe cache is too small for 6 SSDs, try 8KB
3.  dd issues IOs serially and will thus never saturate the hardware

#1 will eventually be addressed with a multi-thread patch to the various
RAID drivers including RAID5.  There is no workaround at this time.

To address #3 use FIO or a similar testing tool that can issue IOs in
parallel.  With SSD based storage you will never reach maximum
throughput with a serial data stream.

-- 
Stan


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: 答复:md raid5 performace 6x SSD RAID5
  2013-11-28  4:41               ` Stan Hoeppner
@ 2013-11-28  4:46                 ` Roman Mamedov
  2013-11-28  6:24                   ` Stan Hoeppner
  0 siblings, 1 reply; 28+ messages in thread
From: Roman Mamedov @ 2013-11-28  4:46 UTC (permalink / raw)
  To: stan; +Cc: lilofile, Linux RAID

[-- Attachment #1: Type: text/plain, Size: 396 bytes --]

On Wed, 27 Nov 2013 22:41:49 -0600
Stan Hoeppner <stan@hardwarefreak.com> wrote:

> > when the stripe_cache_size is set 4096. 
...
> 2.  A 4KB stripe cache is too small for 6 SSDs, try 8KB

The stripe cache size setting is specified not in KB, but in pages per disk,
so a value of 4096 on x86 systems means 4096*4096*6 = 96 MB of cache for the
whole array.

-- 
With respect,
Roman

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: 答复:md raid5 performace 6x SSD RAID5
  2013-11-28  4:46                 ` Roman Mamedov
@ 2013-11-28  6:24                   ` Stan Hoeppner
  0 siblings, 0 replies; 28+ messages in thread
From: Stan Hoeppner @ 2013-11-28  6:24 UTC (permalink / raw)
  To: Roman Mamedov; +Cc: lilofile, Linux RAID

On 11/27/2013 10:46 PM, Roman Mamedov wrote:
> On Wed, 27 Nov 2013 22:41:49 -0600
> Stan Hoeppner <stan@hardwarefreak.com> wrote:
> 
>>> when the stripe_cache_size is set 4096. 
> ...
>> 2.  A 4KB stripe cache is too small for 6 SSDs, try 8KB
> 
> The stripe cache size setting is specified not in KB, but in pages per disk,
> so a value of 4096 on x86 systems means 4096*4096*6 = 96 MB of cache for the
> whole array.

Thanks Roman for correcting me on that which I know well.  Typing a
trailing "KB" so often hard wires the brain and fingers I guess.  My KBs
were intended to be Ks.

http://www.spinics.net/lists/raid/msg42370.html
On 04/03/13 23:20, Stan Hoeppner wrote:
...
> Formula:  stripe_cache_size * 4096 bytes * drive_count = RAM usage.

To expound on the importance of this, with a handful of SSDs and a value
of 8K, throughput tends to plateau, and then slowly decreases as
stripe_cache_size is increased.  The upper bound of stripe_cache_size
gains has not yet been established because the write thread peaks a core
with only a few SSDs.  Multiple write threads and a larger quantity of
SSDs, or much faster SSDs, are needed to explore whether values of
16K-32K provide a meaningful increase in throughput, and whether this is
worth the RAM consumed.  For instance, with 12 SSDs and
stripe_cache_size of 32768:

(((32768*4096)*12)/1048576)/1000 = 1.5 GB of RAM is consumed

When Shaohua Li completes his threading patch series it may be possible
to explore this more thoroughly.

-- 
Stan


^ permalink raw reply	[flat|nested] 28+ messages in thread

* 答复:答复:md raid5 performace 6x SSD RAID5
  2013-11-27 13:51             ` 答复:md " lilofile
  2013-11-28  4:41               ` Stan Hoeppner
@ 2013-11-28 10:02               ` lilofile
  2013-11-29  2:38                 ` Stan Hoeppner
  2013-11-30 14:12                 ` 答复:答复:答复:md raid5 random " lilofile
  2013-11-28 11:54               ` 答复:答复:md raid5 " lilofile
                                 ` (3 subsequent siblings)
  5 siblings, 2 replies; 28+ messages in thread
From: lilofile @ 2013-11-28 10:02 UTC (permalink / raw)
  To: stan, Linux RAID

thank you  for your advise. now I have test multi-thread patch, the single raid5 performance improve 30%.

but I have another problem,when write on single raid,write performance is  approx 1.1GB/s 

root@host0:/sys/block/md126/md# dd if=/dev/zero of=/dev/md126 count=100000 bs=1M
100000+0 records in
100000+0 records out
104857600000 bytes (105 GB) copied, 94.2039 s, 1.1 GB/s

when write on two raid,write write performance is  approx 0.96+0.84=1.8GB/s, theory is 2.2GB/s,why have 400M/s  performance loss?

root@host0:/sys/block/md126/md# 100000+0 records in
100000+0 records out
104857600000 bytes (105 GB) copied, 108.56 s, 966 MB/s
100000+0 records in
100000+0 records out
104857600000 bytes (105 GB) copied, 123.511 s, 849 MB/s

[1]-  Done                    dd if=/dev/zero of=/dev/md126 count=100000 bs=1M
[2]+  Done                    dd if=/dev/zero of=/dev/md127 count=100000 bs=1M
root@host0:/sys/block/md126/md# 




------------------------------------------------------------------
发件人:Stan Hoeppner <stan@hardwarefreak.com>
发送时间:2013年11月28日(星期四) 12:41
收件人:lilofile <lilofile@aliyun.com>; Linux RAID <linux-raid@vger.kernel.org>
主 题:Re: 答复:md raid5 performace 6x SSD RAID5

On 11/27/2013 7:51 AM, lilofile wrote:
> additional: CPU: Intel(R) Xeon(R) CPU           X5650  @ 2.67GHz
>                 memory:32GB
...
> when I create raid5 which use six SSD(sTEC s840),
> when the stripe_cache_size is set 4096. 
> root@host1:/sys/block/md126/md# cat /proc/mdstat 
> Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] 
> md126 : active raid5 sdg[6] sdf[4] sde[3] sdd[2] sdc[1] sdb[0]
>       3906404480 blocks super 1.2 level 5, 128k chunk, algorithm 2 [6/6] [UUUUUU]
> 
> the single ssd read/write performance :
> 
> root@host1:~# dd if=/dev/sdb of=/dev/zero count=100000 bs=1M
> ^C76120+0 records in
> 76119+0 records out
> 79816556544 bytes (80 GB) copied, 208.278 s, 383 MB/s
> 
> root@host1:~# dd of=/dev/sdb if=/dev/zero count=100000 bs=1M
> 100000+0 records in
> 100000+0 records out
> 104857600000 bytes (105 GB) copied, 232.943 s, 450 MB/s
> 
> the raid read and write performance is  approx 1.8GB/s read and 1.1GB/s write performance
> root@sc0:/sys/block/md126/md# dd if=/dev/zero of=/dev/md126 count=100000 bs=1M
> 100000+0 records in
> 100000+0 records out
> 104857600000 bytes (105 GB) copied, 94.2039 s, 1.1 GB/s
> 
> 
> root@sc0:/sys/block/md126/md# dd of=/dev/zero if=/dev/md126 count=100000 bs=1M
> 100000+0 records in
> 100000+0 records out
> 104857600000 bytes (105 GB) copied, 59.5551 s, 1.8 GB/s
> 
> why the performance is so bad?  especially the write performace.

There are 3 things that could be, or are, limiting performance here.

1.  The RAID5 write thread peaks one CPU core as it is single threaded
2.  A 4KB stripe cache is too small for 6 SSDs, try 8KB
3.  dd issues IOs serially and will thus never saturate the hardware

#1 will eventually be addressed with a multi-thread patch to the various
RAID drivers including RAID5.  There is no workaround at this time.

To address #3 use FIO or a similar testing tool that can issue IOs in
parallel.  With SSD based storage you will never reach maximum
throughput with a serial data stream.

-- 
Stan
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 28+ messages in thread

* 答复:答复:md raid5 performace 6x SSD RAID5
  2013-11-27 13:51             ` 答复:md " lilofile
  2013-11-28  4:41               ` Stan Hoeppner
  2013-11-28 10:02               ` 答复:答复:md " lilofile
@ 2013-11-28 11:54               ` lilofile
  2013-12-02  3:48               ` md " lilofile
                                 ` (2 subsequent siblings)
  5 siblings, 0 replies; 28+ messages in thread
From: lilofile @ 2013-11-28 11:54 UTC (permalink / raw)
  To: stan, Linux RAID

I have change stripe cache size from   4096 stripe cache to  8192, the test result show the performance improve <5%, maybe The effect is not very obvious。


------------------------------------------------------------------
发件人:Stan Hoeppner <stan@hardwarefreak.com>
发送时间:2013年11月28日(星期四) 12:41
收件人:lilofile <lilofile@aliyun.com>; Linux RAID <linux-raid@vger.kernel.org>
主 题:Re: 答复:md raid5 performace 6x SSD RAID5

On 11/27/2013 7:51 AM, lilofile wrote:
> additional: CPU: Intel(R) Xeon(R) CPU           X5650  @ 2.67GHz
>                 memory:32GB
...
> when I create raid5 which use six SSD(sTEC s840),
> when the stripe_cache_size is set 4096. 
> root@host1:/sys/block/md126/md# cat /proc/mdstat 
> Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] 
> md126 : active raid5 sdg[6] sdf[4] sde[3] sdd[2] sdc[1] sdb[0]
>       3906404480 blocks super 1.2 level 5, 128k chunk, algorithm 2 [6/6] [UUUUUU]
> 
> the single ssd read/write performance :
> 
> root@host1:~# dd if=/dev/sdb of=/dev/zero count=100000 bs=1M
> ^C76120+0 records in
> 76119+0 records out
> 79816556544 bytes (80 GB) copied, 208.278 s, 383 MB/s
> 
> root@host1:~# dd of=/dev/sdb if=/dev/zero count=100000 bs=1M
> 100000+0 records in
> 100000+0 records out
> 104857600000 bytes (105 GB) copied, 232.943 s, 450 MB/s
> 
> the raid read and write performance is  approx 1.8GB/s read and 1.1GB/s write performance
> root@sc0:/sys/block/md126/md# dd if=/dev/zero of=/dev/md126 count=100000 bs=1M
> 100000+0 records in
> 100000+0 records out
> 104857600000 bytes (105 GB) copied, 94.2039 s, 1.1 GB/s
> 
> 
> root@sc0:/sys/block/md126/md# dd of=/dev/zero if=/dev/md126 count=100000 bs=1M
> 100000+0 records in
> 100000+0 records out
> 104857600000 bytes (105 GB) copied, 59.5551 s, 1.8 GB/s
> 
> why the performance is so bad?  especially the write performace.

There are 3 things that could be, or are, limiting performance here.

1.  The RAID5 write thread peaks one CPU core as it is single threaded
2.  A 4KB stripe cache is too small for 6 SSDs, try 8KB
3.  dd issues IOs serially and will thus never saturate the hardware

#1 will eventually be addressed with a multi-thread patch to the various
RAID drivers including RAID5.  There is no workaround at this time.

To address #3 use FIO or a similar testing tool that can issue IOs in
parallel.  With SSD based storage you will never reach maximum
throughput with a serial data stream.

-- 
Stan
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: ARC-1120 and MD very sloooow
  2013-11-26  8:03             ` Stan Hoeppner
@ 2013-11-28 15:59               ` Jimmy Thrasibule
  2013-11-28 19:59                 ` Stan Hoeppner
  0 siblings, 1 reply; 28+ messages in thread
From: Jimmy Thrasibule @ 2013-11-28 15:59 UTC (permalink / raw)
  To: stan; +Cc: Dave Chinner, Linux RAID, xfs@oss.sgi.com

> Right.  It's unusual to see this many mount options.  FYI, the XFS
> default is relatime, which is nearly identical to noatime.  Specifying
> noatime won't gain you anything.  Do you really need nosuid, nodev, noexec?

Well better say what I don't want on the filesystem no?

 >Do you also see the low write speed and slow ls on md0, any/all of your
> md/RAID10 arrays?

Yes, all drive operations are slow, unfortunately, I have no drives in
the machine
that are not managed by the controller to push tests further.

> The usual: "iostat -x -d -m 5" output while the test is running.
> Also, you are using buffered IO, so changing it to use direct IO
> will tell us exactly what the disks are doing when Io is issued.
> blktrace is your friend here....

I've ran the following:


    # dd if=/dev/zero of=/srv/store/video/test.zero bs=512K count=6000
oflag=direct
    6000+0 records in
    6000+0 records out
    3145728000 bytes (3.1 GB) copied, 179.945 s, 17.5 MB/s

    # dd if=/srv/store/video/test.zero of=/dev/null iflag=direct
    6144000+0 records in
    6144000+0 records out
    3145728000 bytes (3.1 GB) copied, 984.317 s, 3.2 MB/s


Traces are huge for the read test so I put them on Google Drive + SHA1 sums:
https://drive.google.com/folderview?id=0BxJZG8aWsaMaVWkyQk1ELU5yX2c

Drives `sdc` to `sdf` are part of the RAID10 array. Only drives `sdc` and `sde`
are used when reading.

> That makes me wonder if the controller and drive write caches have been disabled.
> That could explain this.

Caching is enabled for the controller but not much information.


    > sys info
    The System Information
    ===========================================
    Main Processor     : 500MHz
    CPU ICache Size    : 32KB
    CPU DCache Size    : 32KB
    CPU SCache Size    : 0KB
    System Memory      : 128MB/333MHz/ECC
    Firmware Version   : V1.49 2010-12-02
    BOOT ROM Version   : V1.49 2010-12-02
    Serial Number      : Y611CAABAR200126
    Controller Name    : ARC-1120
    ===========================================


By the way is enabling the controller cache a good idea? I would disable
it and let the kernel manage.

--
Jimmy

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: ARC-1120 and MD very sloooow
  2013-11-28 15:59               ` Jimmy Thrasibule
@ 2013-11-28 19:59                 ` Stan Hoeppner
  0 siblings, 0 replies; 28+ messages in thread
From: Stan Hoeppner @ 2013-11-28 19:59 UTC (permalink / raw)
  To: Jimmy Thrasibule; +Cc: Linux RAID, xfs@oss.sgi.com

On 11/28/2013 9:59 AM, Jimmy Thrasibule wrote:
>> Right.  It's unusual to see this many mount options.  FYI, the XFS
>> default is relatime, which is nearly identical to noatime.  Specifying
>> noatime won't gain you anything.  Do you really need nosuid, nodev, noexec?
> 
> Well better say what I don't want on the filesystem no?
> 
>  >Do you also see the low write speed and slow ls on md0, any/all of your
>> md/RAID10 arrays?
> 
> Yes, all drive operations are slow, unfortunately, I have no drives in
> the machine
> that are not managed by the controller to push tests further.

Testing a single drive might provide a useful comparison.

>> The usual: "iostat -x -d -m 5" output while the test is running.
>> Also, you are using buffered IO, so changing it to use direct IO
>> will tell us exactly what the disks are doing when Io is issued.
>> blktrace is your friend here....
> 
> I've ran the following:
>
>     # dd if=/dev/zero of=/srv/store/video/test.zero bs=512K count=6000
> oflag=direct
>     6000+0 records in
>     6000+0 records out
>     3145728000 bytes (3.1 GB) copied, 179.945 s, 17.5 MB/s

While O_DIRECT writing will give a more accurate picture of the
throughput at the disks, single threaded O_DIRECT is usually not a good
test due to serialization.  That said, 17.5MB/s is very slow even for a
single thread.

>     # dd if=/srv/store/video/test.zero of=/dev/null iflag=direct
>     6144000+0 records in
>     6144000+0 records out
>     3145728000 bytes (3.1 GB) copied, 984.317 s, 3.2 MB/s

This is useless.  Never use O_DIRECT on input with dd.  The result will
always be ~20x lower than actual drive throughput.

> Traces are huge for the read test so I put them on Google Drive + SHA1 sums:
> https://drive.google.com/folderview?id=0BxJZG8aWsaMaVWkyQk1ELU5yX2c
> 
> Drives `sdc` to `sdf` are part of the RAID10 array. Only drives `sdc` and `sde`
> are used when reading.
> 
>> That makes me wonder if the controller and drive write caches have been disabled.
>> That could explain this.
> 
> Caching is enabled for the controller but not much information.
> 
>     > sys info
>     The System Information
>     ===========================================
>     Main Processor     : 500MHz
>     CPU ICache Size    : 32KB
>     CPU DCache Size    : 32KB
>     CPU SCache Size    : 0KB
>     System Memory      : 128MB/333MHz/ECC
>     Firmware Version   : V1.49 2010-12-02
>     BOOT ROM Version   : V1.49 2010-12-02
>     Serial Number      : Y611CAABAR200126
>     Controller Name    : ARC-1120
>     ===========================================

This doesn't tell you if the read/write cache is enabled or disabled.
This is simply the controller information summary.

> By the way is enabling the controller cache a good idea? I would disable
> it and let the kernel manage.

With any decent RAID card the cache is enabled automatically for reads.
 The write cache will only be enabled automatically if a battery module
is present and the firmware test shows it is in good condition.  Some
controllers allow manually enabling the write cache without battery.
This is usually not advised.  Since barriers are enabled in XFS by
default, you may try enabling write cache on the controller to see if
this helps performance.  It may not depending on how the controller
handles barriers.  And of course, using md you'll want drive caches
enabled or performance will be horrible.  Which is why I recommending
checking to make sure they're enabled.

-- 
Stan

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: 答复:答复:md raid5 performace 6x SSD RAID5
  2013-11-28 10:02               ` 答复:答复:md " lilofile
@ 2013-11-29  2:38                 ` Stan Hoeppner
  2013-11-29  6:23                   ` Stan Hoeppner
  2013-11-30 14:12                 ` 答复:答复:答复:md raid5 random " lilofile
  1 sibling, 1 reply; 28+ messages in thread
From: Stan Hoeppner @ 2013-11-29  2:38 UTC (permalink / raw)
  To: lilofile, Linux RAID

On 11/28/2013 4:02 AM, lilofile wrote:
> thank you  for your advise. now I have test multi-thread patch, the single raid5 performance improve 30%.
> 
> but I have another problem,when write on single raid,write performance is  approx 1.1GB/s 
...
> [1]-  Done                    dd if=/dev/zero of=/dev/md126 count=100000 bs=1M
> [2]+  Done                    dd if=/dev/zero of=/dev/md127 count=100000 bs=1M

No.  This is not a parallel IO test.

...
> To address #3 use FIO or a similar testing tool that can issue IOs in
> parallel.  With SSD based storage you will never reach maximum
> throughput with a serial data stream.

This is a parallel IO test, one command line:

~# fio --directory=/dev/md126 --zero_buffers --numjobs=16
--group_reporting --blocksize=64k --ioengine=libaio --iodepth=16
--direct=1 --size=64g --name=read --rw=read --stonewall --name=write
--rw=write --stonewall

Normally this targets a filesystem, not a raw block device.  This
command line should work for a raw md device.

-- 
Stan

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: 答复:答复:md raid5 performace 6x SSD RAID5
  2013-11-29  2:38                 ` Stan Hoeppner
@ 2013-11-29  6:23                   ` Stan Hoeppner
  0 siblings, 0 replies; 28+ messages in thread
From: Stan Hoeppner @ 2013-11-29  6:23 UTC (permalink / raw)
  To: lilofile, Linux RAID

On 11/28/2013 8:38 PM, Stan Hoeppner wrote:
> On 11/28/2013 4:02 AM, lilofile wrote:
>> thank you  for your advise. now I have test multi-thread patch, the single raid5 performance improve 30%.
>>
>> but I have another problem,when write on single raid,write performance is  approx 1.1GB/s 
> ...
>> [1]-  Done                    dd if=/dev/zero of=/dev/md126 count=100000 bs=1M
>> [2]+  Done                    dd if=/dev/zero of=/dev/md127 count=100000 bs=1M
> 
> No.  This is not a parallel IO test.
> 
> ...
>> To address #3 use FIO or a similar testing tool that can issue IOs in
>> parallel.  With SSD based storage you will never reach maximum
>> throughput with a serial data stream.
> 
> This is a parallel IO test, one command line:
> 
> ~# fio --directory=/dev/md126 --zero_buffers --numjobs=16
> --group_reporting --blocksize=64k --ioengine=libaio --iodepth=16
> --direct=1 --size=64g --name=read --rw=read --stonewall --name=write
> --rw=write --stonewall

Correction.  The --size value is per job, not per fio run.  We use 16
jobs in parallel to maximize the hardware throughput.  So use --size=4g
for 64GB total written in the test.  If you use --size=64g as I stated
above you'll write 1TB total in the test, and it will take forever to
finish.  With --size=4g the read test should take ~30 seconds and the
write test ~40s, not including the fio initialization time.

> Normally this targets a filesystem, not a raw block device.  This
> command line should work for a raw md device.


-- 
Stan


^ permalink raw reply	[flat|nested] 28+ messages in thread

* 答复:答复:答复:md raid5 random performace 6x SSD RAID5
  2013-11-28 10:02               ` 答复:答复:md " lilofile
  2013-11-29  2:38                 ` Stan Hoeppner
@ 2013-11-30 14:12                 ` lilofile
  2013-12-01 14:14                   ` Stan Hoeppner
  2013-12-01 16:33                   ` md " lilofile
  1 sibling, 2 replies; 28+ messages in thread
From: lilofile @ 2013-11-30 14:12 UTC (permalink / raw)
  To: stan, Linux RAID

thanks.  now i use fio to test random write performance
why the random write performance is so low, 6X SSD , 4k IOPS write random only 55097?  when I use FIO,the single SSD random 4k write reach to 3.5W.

root@host0:/# fio -filename=/dev/md0     -iodepth=16 -thread -rw=randwrite -ioengine=libaio -bs=4k -size=30G  -numjobs=10 -runtime=1000 -group_reporting -name=mytest 
mytest: (g=0): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=16
...
mytest: (g=0): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=16
fio 1.59
Starting 10 threadsJobs: 1 (f=1): [____w_____] [68.3% done] [0K/0K /s] [0 /0  iops] [eta 07m:53s]       s]
mytest: (groupid=0, jobs=10): err= 0: pid=6099
  write: io=215230MB, bw=220392KB/s, iops=55097 , runt=1000019msec
    slat (usec): min=1 , max=337733 , avg=176.46, stdev=2623.23
    clat (usec): min=4 , max=540048 , avg=2667.83, stdev=10078.16
     lat (usec): min=40 , max=576049 , avg=2844.42, stdev=10399.30
    bw (KB/s) : min=    0, max=1100192, per=10.22%, avg=22514.48, stdev=17262.85
  cpu          : usr=6.70%, sys=16.48%, ctx=11656865, majf=46, minf=1626216
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued r/w/d: total=0/55098999/0, short=0/0/0
     lat (usec): 10=0.01%, 50=41.01%, 100=50.01%, 250=1.23%, 500=0.42%
     lat (usec): 750=0.02%, 1000=0.01%
     lat (msec): 2=0.01%, 4=0.01%, 10=0.05%, 20=0.16%, 50=6.58%
     lat (msec): 100=0.44%, 250=0.05%, 500=0.01%, 750=0.01%

Run status group 0 (all jobs):
  WRITE: io=215230MB, aggrb=220391KB/s, minb=225681KB/s, maxb=225681KB/s, mint=1000019msec, maxt=1000019msec

Disk stats (read/write):
  md0: ios=167/49755890, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=12530125/13199536, aggrmerge=1151802/1283069, aggrticks=14762174/11503916, aggrin_queue=26230996, aggrutil=95.56%
    sdh: ios=12519812/13192529, merge=1157990/1291154, ticks=11854444/8141456, in_queue=19960416, util=90.19%
    sdi: ios=12524619/13201735, merge=1158477/1280984, ticks=12161064/8308572, in_queue=20436280, util=90.56%
    sdj: ios=12526628/13210796, merge=1155512/1274875, ticks=12074040/8250524, in_queue=20289960, util=90.63%
    sdk: ios=12534367/13213646, merge=1148527/1268088, ticks=12372792/8455368, in_queue=20791752, util=90.81%
    sdl: ios=12534777/13205894, merge=1147263/1275381, ticks=12632824/8728444, in_queue=21325724, util=90.86%
    sdm: ios=12540551/13172620, merge=1143048/1307937, ticks=27477880/27139136, in_queue=54581844, util=95.56%





------------------------------------------------------------------
发件人:Stan Hoeppner <stan@hardwarefreak.com>
发送时间:2013年11月29日(星期五) 10:38
收件人:lilofile <lilofile@aliyun.com>; Linux RAID <linux-raid@vger.kernel.org>
主 题:Re: 答复:答复:md raid5 performace 6x SSD RAID5

On 11/28/2013 4:02 AM, lilofile wrote:
> thank you  for your advise. now I have test multi-thread patch, the single raid5 performance improve 30%.
> 
> but I have another problem,when write on single raid,write performance is  approx 1.1GB/s 
...
> [1]-  Done                    dd if=/dev/zero of=/dev/md126 count=100000 bs=1M
> [2]+  Done                    dd if=/dev/zero of=/dev/md127 count=100000 bs=1M

No.  This is not a parallel IO test.

...
> To address #3 use FIO or a similar testing tool that can issue IOs in
> parallel.  With SSD based storage you will never reach maximum
> throughput with a serial data stream.

This is a parallel IO test, one command line:

~# fio --directory=/dev/md126 --zero_buffers --numjobs=16
--group_reporting --blocksize=64k --ioengine=libaio --iodepth=16
--direct=1 --size=64g --name=read --rw=read --stonewall --name=write
--rw=write --stonewall

Normally this targets a filesystem, not a raw block device.  This
command line should work for a raw md device.

-- 
Stan
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: 答复:答复:答复:md raid5 random performace 6x SSD RAID5
  2013-11-30 14:12                 ` 答复:答复:答复:md raid5 random " lilofile
@ 2013-12-01 14:14                   ` Stan Hoeppner
  2013-12-01 16:33                   ` md " lilofile
  1 sibling, 0 replies; 28+ messages in thread
From: Stan Hoeppner @ 2013-12-01 14:14 UTC (permalink / raw)
  To: lilofile, Linux RAID

On 11/30/2013 8:12 AM, lilofile wrote:
> thanks.  now i use fio to test random write performance

You were using dd for testing your array throughput.  dd uses single
thread sequential IO which does not fully tax your hardware and thus
does not provide realistic results.  I recommended you use FIO with many
threads which will tax your hardware.  The purpose of this was three fold:

1.  Show the difference between single and multiple thread throughput
2.  Show the peak hardware streaming throughput you might achieve
3.  Show the effects of stripe_cache_size as IO rate increases

Please show the FIO multi thread streaming results, with
stripe_cache_size of 2048, 4096, 8192 so everyone can see the
differences, and so those results are in the list archive.  This
information is useful to others in the future.  Please show these
results before we move on to discussing random IO performance.

Remember, getting help on a mailing list isn't strictly for your
benefit, but the benefit of everyone.  So when you are instructed to run
a test, always post the results, as they are for everyone's benefit, not
just yours.

Thanks.

> why the random write performance is so low, 6X SSD , 4k IOPS write random only 55097?  when I use FIO,the single SSD random 4k write reach to 3.5W.
> 
> root@host0:/# fio -filename=/dev/md0     -iodepth=16 -thread -rw=randwrite -ioengine=libaio -bs=4k -size=30G  -numjobs=10 -runtime=1000 -group_reporting -name=mytest 
> mytest: (g=0): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=16
> ...
> mytest: (g=0): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=16
> fio 1.59
> Starting 10 threadsJobs: 1 (f=1): [____w_____] [68.3% done] [0K/0K /s] [0 /0  iops] [eta 07m:53s]       s]
> mytest: (groupid=0, jobs=10): err= 0: pid=6099
>   write: io=215230MB, bw=220392KB/s, iops=55097 , runt=1000019msec
>     slat (usec): min=1 , max=337733 , avg=176.46, stdev=2623.23
>     clat (usec): min=4 , max=540048 , avg=2667.83, stdev=10078.16
>      lat (usec): min=40 , max=576049 , avg=2844.42, stdev=10399.30
>     bw (KB/s) : min=    0, max=1100192, per=10.22%, avg=22514.48, stdev=17262.85
>   cpu          : usr=6.70%, sys=16.48%, ctx=11656865, majf=46, minf=1626216
>   IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
>      issued r/w/d: total=0/55098999/0, short=0/0/0
>      lat (usec): 10=0.01%, 50=41.01%, 100=50.01%, 250=1.23%, 500=0.42%
>      lat (usec): 750=0.02%, 1000=0.01%
>      lat (msec): 2=0.01%, 4=0.01%, 10=0.05%, 20=0.16%, 50=6.58%
>      lat (msec): 100=0.44%, 250=0.05%, 500=0.01%, 750=0.01%
> 
> Run status group 0 (all jobs):
>   WRITE: io=215230MB, aggrb=220391KB/s, minb=225681KB/s, maxb=225681KB/s, mint=1000019msec, maxt=1000019msec
> 
> Disk stats (read/write):
>   md0: ios=167/49755890, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=12530125/13199536, aggrmerge=1151802/1283069, aggrticks=14762174/11503916, aggrin_queue=26230996, aggrutil=95.56%
>     sdh: ios=12519812/13192529, merge=1157990/1291154, ticks=11854444/8141456, in_queue=19960416, util=90.19%
>     sdi: ios=12524619/13201735, merge=1158477/1280984, ticks=12161064/8308572, in_queue=20436280, util=90.56%
>     sdj: ios=12526628/13210796, merge=1155512/1274875, ticks=12074040/8250524, in_queue=20289960, util=90.63%
>     sdk: ios=12534367/13213646, merge=1148527/1268088, ticks=12372792/8455368, in_queue=20791752, util=90.81%
>     sdl: ios=12534777/13205894, merge=1147263/1275381, ticks=12632824/8728444, in_queue=21325724, util=90.86%
>     sdm: ios=12540551/13172620, merge=1143048/1307937, ticks=27477880/27139136, in_queue=54581844, util=95.56%
> 
> 
> 
> 
> 
> ------------------------------------------------------------------
> 发件人:Stan Hoeppner <stan@hardwarefreak.com>
> 发送时间:2013年11月29日(星期五) 10:38
> 收件人:lilofile <lilofile@aliyun.com>; Linux RAID <linux-raid@vger.kernel.org>
> 主 题:Re: 答复:答复:md raid5 performace 6x SSD RAID5
> 
> On 11/28/2013 4:02 AM, lilofile wrote:
>> thank you  for your advise. now I have test multi-thread patch, the single raid5 performance improve 30%.
>>
>> but I have another problem,when write on single raid,write performance is  approx 1.1GB/s 
> ...
>> [1]-  Done                    dd if=/dev/zero of=/dev/md126 count=100000 bs=1M
>> [2]+  Done                    dd if=/dev/zero of=/dev/md127 count=100000 bs=1M
> 
> No.  This is not a parallel IO test.
> 
> ...
>> To address #3 use FIO or a similar testing tool that can issue IOs in
>> parallel.  With SSD based storage you will never reach maximum
>> throughput with a serial data stream.
> 
> This is a parallel IO test, one command line:
> 
> ~# fio --directory=/dev/md126 --zero_buffers --numjobs=16
> --group_reporting --blocksize=64k --ioengine=libaio --iodepth=16
> --direct=1 --size=64g --name=read --rw=read --stonewall --name=write
> --rw=write --stonewall
> 
> Normally this targets a filesystem, not a raw block device.  This
> command line should work for a raw md device.
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 28+ messages in thread

* md raid5 random performace 6x SSD RAID5
  2013-11-30 14:12                 ` 答复:答复:答复:md raid5 random " lilofile
  2013-12-01 14:14                   ` Stan Hoeppner
@ 2013-12-01 16:33                   ` lilofile
  2013-12-02  2:37                     ` Stan Hoeppner
  1 sibling, 1 reply; 28+ messages in thread
From: lilofile @ 2013-12-01 16:33 UTC (permalink / raw)
  To: linux-raid

six ssd disk ,raid5 cpu:Intel(R) Xeon(R) CPU     X5650  @ 2.67GHz memory:32G
sTEC SSD disk: single disk iops=35973
root@host0:/sys/block/md127/md# cat /proc/mdstat 
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] 
md127 : active raid5 sdg[6] sdl[4] sdk[3] sdj[2] sdi[1] sdh[0]
      3906404480 blocks super 1.2 level 5, 128k chunk, algorithm 2 [6/6] [UUUUUU]
    
unused devices: <none>


ramdom write iops is as follows:
 stripe_cache_size==2048   iops= 59617
 stripe_cache_size==4096   iops=61623
 stripe_cache_size==8192   iops= 59877


why the random write iops is so low,while single disk write IOPS reach to 3.6W?


 fio parameter is as follows:

the test result shows: stripe_cache_size==2048
root@sc0:~# fio -filename=/dev/md/md0    -iodepth 16 -thread -rw=randwrite -ioengine=libaio -bs=4k -size=30G  -numjobs=16 -runtime=1000 -group_reporting -name=mytest
mytest: (g=0): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=16
...
mytest: (g=0): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=16
fio 1.59
Starting 16 threads
Jobs: 7 (f=7): [www__w____w_w__w] [47.3% done] [0K/186.6M /s] [0 /46.7K iops] [eta 18m:35s]s]
mytest: (groupid=0, jobs=16): err= 0: pid=5208
  write: io=232889MB, bw=238470KB/s, iops=59617 , runt=1000036msec
    slat (usec): min=1 , max=65595 , avg=264.91, stdev=3322.66
    clat (usec): min=4 , max=111435 , avg=3992.16, stdev=12317.14
     lat (usec): min=40 , max=111439 , avg=4257.19, stdev=12679.23
    bw (KB/s) : min=    0, max=350792, per=6.31%, avg=15039.33, stdev=6492.82
  cpu          : usr=1.45%, sys=31.90%, ctx=7766821, majf=136, minf=3585068
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued r/w/d: total=0/59619701/0, short=0/0/0
     lat (usec): 10=0.01%, 50=19.28%, 100=70.12%, 250=1.14%, 500=0.01%
     lat (usec): 750=0.01%, 1000=0.01%
     lat (msec): 2=0.01%, 4=0.02%, 10=0.05%, 20=0.09%, 50=9.14%
     lat (msec): 100=0.13%, 250=0.01%

Run status group 0 (all jobs):
  WRITE: io=232889MB, aggrb=238470KB/s, minb=244193KB/s, maxb=244193KB/s, mint=1000036msec, maxt=1000036msec
root@host0:~# 



the test result shows: stripe_cache_size==4096
root@host0:~# fio -filename=/dev/md/md0    -iodepth 16 -thread -rw=randwrite -ioengine=libaio -bs=4k -size=30G  -numjobs=16 -runtime=1000 -group_reporting -name=mytest
mytest: (g=0): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=16
...
mytest: (g=0): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=16
fio 1.59
Starting 16 threads
Jobs: 7 (f=7): [ww_ww_ww_______w] [48.3% done] [0K/224.8M /s] [0 /56.2K iops] [eta 17m:58s]s]               
mytest: (groupid=0, jobs=16): err= 0: pid=4851
  write: io=240727MB, bw=246495KB/s, iops=61623 , runt=1000037msec
    slat (usec): min=1 , max=837996 , avg=257.06, stdev=3387.21
    clat (usec): min=4 , max=838074 , avg=3873.92, stdev=12967.09
     lat (usec): min=41 , max=838077 , avg=4131.10, stdev=13376.14
    bw (KB/s) : min=    0, max=449685, per=6.28%, avg=15490.34, stdev=5760.87
  cpu          : usr=6.16%, sys=18.83%, ctx=15818324, majf=181, minf=3591162
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued r/w/d: total=0/61626113/0, short=0/0/0
     lat (usec): 10=0.01%, 50=20.21%, 100=70.72%, 250=0.21%, 500=0.01%
     lat (usec): 750=0.01%, 1000=0.01%
     lat (msec): 2=0.01%, 4=0.02%, 10=0.06%, 20=0.10%, 50=7.87%
     lat (msec): 100=0.75%, 250=0.03%, 500=0.01%, 750=0.01%, 1000=0.01%

Run status group 0 (all jobs):
  WRITE: io=240727MB, aggrb=246495KB/s, minb=252411KB/s, maxb=252411KB/s, mint=1000037msec, maxt=1000037msec
root@host0:~# 

the test result shows: stripe_cache_size==8192
root@host0:~# fio -filename=/dev/md/md0    -iodepth 16 -thread -rw=randwrite -ioengine=libaio -bs=4k -size=30G  -numjobs=16 -runtime=1000 -group_reporting -name=mytest
mytest: (g=0): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=16
...
mytest: (g=0): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=16
fio 1.59
Starting 16 threads
Jobs: 6 (f=6): [__w_w__ww__w___w] [47.6% done] [0K/178.6M /s] [0 /44.7K iops] [eta 18m:24s]s]
mytest: (groupid=0, jobs=16): err= 0: pid=5047
  write: io=233924MB, bw=239511KB/s, iops=59877 , runt=1000114msec
    slat (usec): min=1 , max=235194 , avg=263.80, stdev=4435.78
    clat (usec): min=2 , max=391878 , avg=3974.23, stdev=16930.35
     lat (usec): min=4 , max=391885 , avg=4238.15, stdev=17467.30
    bw (KB/s) : min=    0, max=303248, per=6.34%, avg=15180.71, stdev=5877.14
  cpu          : usr=4.93%, sys=27.37%, ctx=6335719, majf=103, minf=3591206
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued r/w/d: total=0/59884454/0, short=0/0/0
     lat (usec): 4=0.01%, 10=0.01%, 20=0.01%, 50=36.26%, 100=55.83%
     lat (usec): 250=0.78%, 500=0.01%, 750=0.01%, 1000=0.01%
     lat (msec): 2=0.01%, 4=0.02%, 10=0.05%, 20=0.09%, 50=5.38%
     lat (msec): 100=0.75%, 250=0.80%, 500=0.01%

Run status group 0 (all jobs):
  WRITE: io=233924MB, aggrb=239510KB/s, minb=245258KB/s, maxb=245258KB/s, mint=1000114msec, maxt=1000114msec
root@host0:~# 

// single ssd disk
root@host0:~# fio -filename=/dev/sdb    -iodepth 16 -thread -rw=randwrite -ioengine=libaio -bs=4k -size=30G  -numjobs=16 -runtime=1000 -group_reporting -name=mytest
mytest: (g=0): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=16
...
mytest: (g=0): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=16
fio 1.59
Starting 16 threads
Jobs: 1 (f=1): [___w____________] [28.5% done] [0K/0K /s] [0 /0  iops] [eta 43m:08s]        s]
mytest: (groupid=0, jobs=16): err= 0: pid=5308
  write: io=140528MB, bw=143894KB/s, iops=35973 , runt=1000046msec
    slat (usec): min=1 , max=159802 , avg=443.06, stdev=4487.35
    clat (usec): min=4 , max=159916 , avg=6665.26, stdev=16174.17
     lat (usec): min=40 , max=159922 , avg=7108.46, stdev=16611.67
    bw (KB/s) : min=    3, max=892696, per=6.26%, avg=9008.49, stdev=8706.58
  cpu          : usr=2.61%, sys=13.09%, ctx=7436836, majf=58, minf=782937
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued r/w/d: total=0/35975210/0, short=0/0/0
     lat (usec): 10=0.01%, 50=16.00%, 100=67.45%, 250=1.81%, 500=0.05%
     lat (usec): 750=0.01%, 1000=0.01%
     lat (msec): 2=0.01%, 4=0.01%, 10=0.01%, 20=0.01%, 50=13.33%
     lat (msec): 100=1.28%, 250=0.04%

Run status group 0 (all jobs):
  WRITE: io=140528MB, aggrb=143894KB/s, minb=147347KB/s, maxb=147347KB/s, mint=1000046msec, maxt=1000046msec

Disk stats (read/write):
  sdb: ios=261/27342034, merge=0/5212609, ticks=48/143752312, in_queue=143721596, util=100.00%
root@host0:~# 





^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: md raid5 random performace 6x SSD RAID5
  2013-12-01 16:33                   ` md " lilofile
@ 2013-12-02  2:37                     ` Stan Hoeppner
  0 siblings, 0 replies; 28+ messages in thread
From: Stan Hoeppner @ 2013-12-02  2:37 UTC (permalink / raw)
  To: lilofile, linux-raid

Again, please post the result output from the streaming read/write fio
runs, not random.  After I see those we can discuss your random performance.


On 12/1/2013 10:33 AM, lilofile wrote:
> six ssd disk ,raid5 cpu:Intel(R) Xeon(R) CPU     X5650  @ 2.67GHz memory:32G
> sTEC SSD disk: single disk iops=35973
> root@host0:/sys/block/md127/md# cat /proc/mdstat 
> Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] 
> md127 : active raid5 sdg[6] sdl[4] sdk[3] sdj[2] sdi[1] sdh[0]
>       3906404480 blocks super 1.2 level 5, 128k chunk, algorithm 2 [6/6] [UUUUUU]
>     
> unused devices: <none>
> 
> 
> ramdom write iops is as follows:
>  stripe_cache_size==2048   iops= 59617
>  stripe_cache_size==4096   iops=61623
>  stripe_cache_size==8192   iops= 59877
> 
> 
> why the random write iops is so low,while single disk write IOPS reach to 3.6W?
> 
> 
>  fio parameter is as follows:
> 
> the test result shows: stripe_cache_size==2048
> root@sc0:~# fio -filename=/dev/md/md0    -iodepth 16 -thread -rw=randwrite -ioengine=libaio -bs=4k -size=30G  -numjobs=16 -runtime=1000 -group_reporting -name=mytest
> mytest: (g=0): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=16
> ...
> mytest: (g=0): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=16
> fio 1.59
> Starting 16 threads
> Jobs: 7 (f=7): [www__w____w_w__w] [47.3% done] [0K/186.6M /s] [0 /46.7K iops] [eta 18m:35s]s]
> mytest: (groupid=0, jobs=16): err= 0: pid=5208
>   write: io=232889MB, bw=238470KB/s, iops=59617 , runt=1000036msec
>     slat (usec): min=1 , max=65595 , avg=264.91, stdev=3322.66
>     clat (usec): min=4 , max=111435 , avg=3992.16, stdev=12317.14
>      lat (usec): min=40 , max=111439 , avg=4257.19, stdev=12679.23
>     bw (KB/s) : min=    0, max=350792, per=6.31%, avg=15039.33, stdev=6492.82
>   cpu          : usr=1.45%, sys=31.90%, ctx=7766821, majf=136, minf=3585068
>   IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
>      issued r/w/d: total=0/59619701/0, short=0/0/0
>      lat (usec): 10=0.01%, 50=19.28%, 100=70.12%, 250=1.14%, 500=0.01%
>      lat (usec): 750=0.01%, 1000=0.01%
>      lat (msec): 2=0.01%, 4=0.02%, 10=0.05%, 20=0.09%, 50=9.14%
>      lat (msec): 100=0.13%, 250=0.01%
> 
> Run status group 0 (all jobs):
>   WRITE: io=232889MB, aggrb=238470KB/s, minb=244193KB/s, maxb=244193KB/s, mint=1000036msec, maxt=1000036msec
> root@host0:~# 
> 
> 
> 
> the test result shows: stripe_cache_size==4096
> root@host0:~# fio -filename=/dev/md/md0    -iodepth 16 -thread -rw=randwrite -ioengine=libaio -bs=4k -size=30G  -numjobs=16 -runtime=1000 -group_reporting -name=mytest
> mytest: (g=0): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=16
> ...
> mytest: (g=0): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=16
> fio 1.59
> Starting 16 threads
> Jobs: 7 (f=7): [ww_ww_ww_______w] [48.3% done] [0K/224.8M /s] [0 /56.2K iops] [eta 17m:58s]s]               
> mytest: (groupid=0, jobs=16): err= 0: pid=4851
>   write: io=240727MB, bw=246495KB/s, iops=61623 , runt=1000037msec
>     slat (usec): min=1 , max=837996 , avg=257.06, stdev=3387.21
>     clat (usec): min=4 , max=838074 , avg=3873.92, stdev=12967.09
>      lat (usec): min=41 , max=838077 , avg=4131.10, stdev=13376.14
>     bw (KB/s) : min=    0, max=449685, per=6.28%, avg=15490.34, stdev=5760.87
>   cpu          : usr=6.16%, sys=18.83%, ctx=15818324, majf=181, minf=3591162
>   IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
>      issued r/w/d: total=0/61626113/0, short=0/0/0
>      lat (usec): 10=0.01%, 50=20.21%, 100=70.72%, 250=0.21%, 500=0.01%
>      lat (usec): 750=0.01%, 1000=0.01%
>      lat (msec): 2=0.01%, 4=0.02%, 10=0.06%, 20=0.10%, 50=7.87%
>      lat (msec): 100=0.75%, 250=0.03%, 500=0.01%, 750=0.01%, 1000=0.01%
> 
> Run status group 0 (all jobs):
>   WRITE: io=240727MB, aggrb=246495KB/s, minb=252411KB/s, maxb=252411KB/s, mint=1000037msec, maxt=1000037msec
> root@host0:~# 
> 
> the test result shows: stripe_cache_size==8192
> root@host0:~# fio -filename=/dev/md/md0    -iodepth 16 -thread -rw=randwrite -ioengine=libaio -bs=4k -size=30G  -numjobs=16 -runtime=1000 -group_reporting -name=mytest
> mytest: (g=0): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=16
> ...
> mytest: (g=0): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=16
> fio 1.59
> Starting 16 threads
> Jobs: 6 (f=6): [__w_w__ww__w___w] [47.6% done] [0K/178.6M /s] [0 /44.7K iops] [eta 18m:24s]s]
> mytest: (groupid=0, jobs=16): err= 0: pid=5047
>   write: io=233924MB, bw=239511KB/s, iops=59877 , runt=1000114msec
>     slat (usec): min=1 , max=235194 , avg=263.80, stdev=4435.78
>     clat (usec): min=2 , max=391878 , avg=3974.23, stdev=16930.35
>      lat (usec): min=4 , max=391885 , avg=4238.15, stdev=17467.30
>     bw (KB/s) : min=    0, max=303248, per=6.34%, avg=15180.71, stdev=5877.14
>   cpu          : usr=4.93%, sys=27.37%, ctx=6335719, majf=103, minf=3591206
>   IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
>      issued r/w/d: total=0/59884454/0, short=0/0/0
>      lat (usec): 4=0.01%, 10=0.01%, 20=0.01%, 50=36.26%, 100=55.83%
>      lat (usec): 250=0.78%, 500=0.01%, 750=0.01%, 1000=0.01%
>      lat (msec): 2=0.01%, 4=0.02%, 10=0.05%, 20=0.09%, 50=5.38%
>      lat (msec): 100=0.75%, 250=0.80%, 500=0.01%
> 
> Run status group 0 (all jobs):
>   WRITE: io=233924MB, aggrb=239510KB/s, minb=245258KB/s, maxb=245258KB/s, mint=1000114msec, maxt=1000114msec
> root@host0:~# 
> 
> // single ssd disk
> root@host0:~# fio -filename=/dev/sdb    -iodepth 16 -thread -rw=randwrite -ioengine=libaio -bs=4k -size=30G  -numjobs=16 -runtime=1000 -group_reporting -name=mytest
> mytest: (g=0): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=16
> ...
> mytest: (g=0): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=16
> fio 1.59
> Starting 16 threads
> Jobs: 1 (f=1): [___w____________] [28.5% done] [0K/0K /s] [0 /0  iops] [eta 43m:08s]        s]
> mytest: (groupid=0, jobs=16): err= 0: pid=5308
>   write: io=140528MB, bw=143894KB/s, iops=35973 , runt=1000046msec
>     slat (usec): min=1 , max=159802 , avg=443.06, stdev=4487.35
>     clat (usec): min=4 , max=159916 , avg=6665.26, stdev=16174.17
>      lat (usec): min=40 , max=159922 , avg=7108.46, stdev=16611.67
>     bw (KB/s) : min=    3, max=892696, per=6.26%, avg=9008.49, stdev=8706.58
>   cpu          : usr=2.61%, sys=13.09%, ctx=7436836, majf=58, minf=782937
>   IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
>      issued r/w/d: total=0/35975210/0, short=0/0/0
>      lat (usec): 10=0.01%, 50=16.00%, 100=67.45%, 250=1.81%, 500=0.05%
>      lat (usec): 750=0.01%, 1000=0.01%
>      lat (msec): 2=0.01%, 4=0.01%, 10=0.01%, 20=0.01%, 50=13.33%
>      lat (msec): 100=1.28%, 250=0.04%
> 
> Run status group 0 (all jobs):
>   WRITE: io=140528MB, aggrb=143894KB/s, minb=147347KB/s, maxb=147347KB/s, mint=1000046msec, maxt=1000046msec
> 
> Disk stats (read/write):
>   sdb: ios=261/27342034, merge=0/5212609, ticks=48/143752312, in_queue=143721596, util=100.00%
> root@host0:~# 
> 
> 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* md raid5 performace 6x SSD RAID5
  2013-11-27 13:51             ` 答复:md " lilofile
                                 ` (2 preceding siblings ...)
  2013-11-28 11:54               ` 答复:答复:md raid5 " lilofile
@ 2013-12-02  3:48               ` lilofile
  2013-12-02  5:51                 ` Stan Hoeppner
  2014-09-23  3:34               ` raid sync speed lilofile
  2014-09-23  5:11               ` behind_writes lilofile
  5 siblings, 1 reply; 28+ messages in thread
From: lilofile @ 2013-12-02  3:48 UTC (permalink / raw)
  To: lilofile, stan, Linux RAID

#1 will eventually be addressed with a multi-thread patch to the various RAID drivers including RAID5

what is the differences between the multi-thread patch and the CONFIG_MULTICORE_RAID456?
 
my understanding is CONFIG_MULTICORE_RAID456
 enum {
	STRIPE_OP_BIOFILL,
	STRIPE_OP_COMPUTE_BLK,
	STRIPE_OP_PREXOR,
	STRIPE_OP_BIODRAIN,
	STRIPE_OP_RECONSTRUCT,
	STRIPE_OP_CHECK,
};  this operations  in a stripe can be schedule to other CPU to run,

while  multi-thread patch  mainly modify lock contention of thread, this understanding is correct? 

------------------------------------------------------------------
发件人:lilofile <lilofile@aliyun.com>
发送时间:2013年11月28日(星期四) 19:54
收件人:stan <stan@hardwarefreak.com>; Linux RAID <linux-raid@vger.kernel.org>
主 题:答复:答复:md raid5 performace 6x SSD RAID5

I have change stripe cache size from   4096 stripe cache to  8192, the test result show the performance improve <5%, maybe The effect is not very obvious。


------------------------------------------------------------------
发件人:Stan Hoeppner <stan@hardwarefreak.com>
发送时间:2013年11月28日(星期四) 12:41
收件人:lilofile <lilofile@aliyun.com>; Linux RAID <linux-raid@vger.kernel.org>
主 题:Re: 答复:md raid5 performace 6x SSD RAID5

On 11/27/2013 7:51 AM, lilofile wrote:
> additional: CPU: Intel(R) Xeon(R) CPU           X5650  @ 2.67GHz
>                 memory:32GB
...
> when I create raid5 which use six SSD(sTEC s840),
> when the stripe_cache_size is set 4096. 
> root@host1:/sys/block/md126/md# cat /proc/mdstat 
> Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] 
> md126 : active raid5 sdg[6] sdf[4] sde[3] sdd[2] sdc[1] sdb[0]
>       3906404480 blocks super 1.2 level 5, 128k chunk, algorithm 2 [6/6] [UUUUUU]
> 
> the single ssd read/write performance :
> 
> root@host1:~# dd if=/dev/sdb of=/dev/zero count=100000 bs=1M
> ^C76120+0 records in
> 76119+0 records out
> 79816556544 bytes (80 GB) copied, 208.278 s, 383 MB/s
> 
> root@host1:~# dd of=/dev/sdb if=/dev/zero count=100000 bs=1M
> 100000+0 records in
> 100000+0 records out
> 104857600000 bytes (105 GB) copied, 232.943 s, 450 MB/s
> 
> the raid read and write performance is  approx 1.8GB/s read and 1.1GB/s write performance
> root@sc0:/sys/block/md126/md# dd if=/dev/zero of=/dev/md126 count=100000 bs=1M
> 100000+0 records in
> 100000+0 records out
> 104857600000 bytes (105 GB) copied, 94.2039 s, 1.1 GB/s
> 
> 
> root@sc0:/sys/block/md126/md# dd of=/dev/zero if=/dev/md126 count=100000 bs=1M
> 100000+0 records in
> 100000+0 records out
> 104857600000 bytes (105 GB) copied, 59.5551 s, 1.8 GB/s
> 
> why the performance is so bad?  especially the write performace.

There are 3 things that could be, or are, limiting performance here.

1.  The RAID5 write thread peaks one CPU core as it is single threaded
2.  A 4KB stripe cache is too small for 6 SSDs, try 8KB
3.  dd issues IOs serially and will thus never saturate the hardware

#1 will eventually be addressed with a multi-thread patch to the various
RAID drivers including RAID5.  There is no workaround at this time.

To address #3 use FIO or a similar testing tool that can issue IOs in
parallel.  With SSD based storage you will never reach maximum
throughput with a serial data stream.

-- 
Stan
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: md raid5 performace 6x SSD RAID5
  2013-12-02  3:48               ` md " lilofile
@ 2013-12-02  5:51                 ` Stan Hoeppner
  0 siblings, 0 replies; 28+ messages in thread
From: Stan Hoeppner @ 2013-12-02  5:51 UTC (permalink / raw)
  To: lilofile, Linux RAID, Shaohua Li

On 12/1/2013 9:48 PM, lilofile wrote:
> #1 will eventually be addressed with a multi-thread patch to the various RAID drivers including RAID5
> 
> what is the differences between the multi-thread patch and the CONFIG_MULTICORE_RAID456?

I can't find the original description for that option, but I can tell
you that:

1.  It was experimental
2.  Neil Brown requested its complete removal from git in March 2013:

http://permalink.gmane.org/gmane.linux.kernel.commits.head/372527

> my understanding is CONFIG_MULTICORE_RAID456
>  enum {
> 	STRIPE_OP_BIOFILL,
> 	STRIPE_OP_COMPUTE_BLK,
> 	STRIPE_OP_PREXOR,
> 	STRIPE_OP_BIODRAIN,
> 	STRIPE_OP_RECONSTRUCT,
> 	STRIPE_OP_CHECK,
> };  this operations  in a stripe can be schedule to other CPU to run,
> 
> while  multi-thread patch  mainly modify lock contention of thread, this understanding is correct? 

Shaohua Li has been working on multi-threaded md drivers to fix the CPU
bottleneck with SSD storage for some time now.  He's currently focusing
on raid5.c.  See:
http://lwn.net/Articles/500200/
http://www.spinics.net/lists/raid/msg44699.html

AFAIK this work is not yet fully completed nor thoroughly tested, nor
included in a stable release.  Shaohua, could you give us a quick update
on the status of your RAID5 multi-thread work?  Demand for it seems to
be steeply increasing recently, this current thread, and another last
week with slow RAID10 on the new hybrid SSD/rust drives.

> ------------------------------------------------------------------
> 发件人:lilofile <lilofile@aliyun.com>
> 发送时间:2013年11月28日(星期四) 19:54
> 收件人:stan <stan@hardwarefreak.com>; Linux RAID <linux-raid@vger.kernel.org>
> 主 题:答复:答复:md raid5 performace 6x SSD RAID5
> 
> I have change stripe cache size from   4096 stripe cache to  8192, the test result show the performance improve <5%, maybe The effect is not very obvious。

IIRC, this was before you started testing with FIO.  I'd really like to
see your streaming read/write results of FIO with the command line I
gave you, for each of these 3 stripe_cache_size values.  BTW, you don't
need to set a timer.  The size=30G limits the test to 30GB.  I chose
this value because the test runs should only take 15s at this size.  Go
any smaller and it makes capturing accurate data more difficult.

The reason for running the streaming tests is that it eliminates the RMW
code path and any associated latencies you get with the random write
test.  The command line I gave you should give us an idea of the peak
streaming read/write throughput of your SSD RAID5 array with the only
limitation being single core performance.

To discover how much CPU is being burned, concurrently with each FIO
test, execute the following as well once FIO initialization is complete
and the actual read/write tests begin.  This will show us what your CPU
consumption looks like and if you're hitting the single core ceiling
with the md write thread.  This will give you 20 seconds of CPU stats
polled every .5s:

~# top -b -n 40 -d 0.5 |grep Cpu|mawk '{print ($1,$3,$4) }'

This will generate a lot of output.  Piping through mawk will clean this
up making it easier to see which CPU is running the md write thread
during your write tests.  The FIO threads will execute in user space,
the md write thread in system space.  You won't see one core peaking
during read tests as any/all CPUs may be used.

Which kernel version are you using?  I don't recall you saying.  With
later kernels IIRC the parity calculations are offloaded to another
thread, so you may see high load on two cores.

> ------------------------------------------------------------------
> 发件人:Stan Hoeppner <stan@hardwarefreak.com>
> 发送时间:2013年11月28日(星期四) 12:41
> 收件人:lilofile <lilofile@aliyun.com>; Linux RAID <linux-raid@vger.kernel.org>
> 主 题:Re: 答复:md raid5 performace 6x SSD RAID5
> 
> On 11/27/2013 7:51 AM, lilofile wrote:
>> additional: CPU: Intel(R) Xeon(R) CPU           X5650  @ 2.67GHz
>>                 memory:32GB
> ...
>> when I create raid5 which use six SSD(sTEC s840),
>> when the stripe_cache_size is set 4096. 
>> root@host1:/sys/block/md126/md# cat /proc/mdstat 
>> Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] 
>> md126 : active raid5 sdg[6] sdf[4] sde[3] sdd[2] sdc[1] sdb[0]
>>       3906404480 blocks super 1.2 level 5, 128k chunk, algorithm 2 [6/6] [UUUUUU]
>>
>> the single ssd read/write performance :
>>
>> root@host1:~# dd if=/dev/sdb of=/dev/zero count=100000 bs=1M
>> ^C76120+0 records in
>> 76119+0 records out
>> 79816556544 bytes (80 GB) copied, 208.278 s, 383 MB/s
>>
>> root@host1:~# dd of=/dev/sdb if=/dev/zero count=100000 bs=1M
>> 100000+0 records in
>> 100000+0 records out
>> 104857600000 bytes (105 GB) copied, 232.943 s, 450 MB/s
>>
>> the raid read and write performance is  approx 1.8GB/s read and 1.1GB/s write performance
>> root@sc0:/sys/block/md126/md# dd if=/dev/zero of=/dev/md126 count=100000 bs=1M
>> 100000+0 records in
>> 100000+0 records out
>> 104857600000 bytes (105 GB) copied, 94.2039 s, 1.1 GB/s
>>
>>
>> root@sc0:/sys/block/md126/md# dd of=/dev/zero if=/dev/md126 count=100000 bs=1M
>> 100000+0 records in
>> 100000+0 records out
>> 104857600000 bytes (105 GB) copied, 59.5551 s, 1.8 GB/s
>>
>> why the performance is so bad?  especially the write performace.
> 
> There are 3 things that could be, or are, limiting performance here.
> 
> 1.  The RAID5 write thread peaks one CPU core as it is single threaded
> 2.  A 4KB stripe cache is too small for 6 SSDs, try 8KB
> 3.  dd issues IOs serially and will thus never saturate the hardware
> 
> #1 will eventually be addressed with a multi-thread patch to the various
> RAID drivers including RAID5.  There is no workaround at this time.
> 
> To address #3 use FIO or a similar testing tool that can issue IOs in
> parallel.  With SSD based storage you will never reach maximum
> throughput with a serial data stream.
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 28+ messages in thread

* raid sync speed 
  2013-11-27 13:51             ` 答复:md " lilofile
                                 ` (3 preceding siblings ...)
  2013-12-02  3:48               ` md " lilofile
@ 2014-09-23  3:34               ` lilofile
  2014-09-23  5:11               ` behind_writes lilofile
  5 siblings, 0 replies; 28+ messages in thread
From: lilofile @ 2014-09-23  3:34 UTC (permalink / raw)
  To: stan, Linux RAID, lilofile

when I  read raid sync speed control code, I found it is very difficult for me to understand.
such as  calculation of currspeed,the setting of SYNC_MARK_STEP, any suggestions will be Welcome.

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 28+ messages in thread

* behind_writes
  2013-11-27 13:51             ` 答复:md " lilofile
                                 ` (4 preceding siblings ...)
  2014-09-23  3:34               ` raid sync speed lilofile
@ 2014-09-23  5:11               ` lilofile
  5 siblings, 0 replies; 28+ messages in thread
From: lilofile @ 2014-09-23  5:11 UTC (permalink / raw)
  To: stan, Linux RAID, lilofile

in struct bitmap,what the behind_writes variable means?

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2014-09-23  5:11 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-11-22 11:13 ARC-1120 and MD very sloooow Jimmy Thrasibule
2013-11-22 11:17 ` Mikael Abrahamsson
2013-11-22 20:17 ` Stan Hoeppner
2013-11-25  8:56   ` Jimmy Thrasibule
2013-11-26  0:45     ` Stan Hoeppner
2013-11-26  2:52       ` Dave Chinner
2013-11-26  3:58         ` Stan Hoeppner
2013-11-26  6:14           ` Dave Chinner
2013-11-26  8:03             ` Stan Hoeppner
2013-11-28 15:59               ` Jimmy Thrasibule
2013-11-28 19:59                 ` Stan Hoeppner
2013-11-27 13:48             ` md raid5 performace 6x SSD RAID5 lilofile
2013-11-27 13:51             ` 答复:md " lilofile
2013-11-28  4:41               ` Stan Hoeppner
2013-11-28  4:46                 ` Roman Mamedov
2013-11-28  6:24                   ` Stan Hoeppner
2013-11-28 10:02               ` 答复:答复:md " lilofile
2013-11-29  2:38                 ` Stan Hoeppner
2013-11-29  6:23                   ` Stan Hoeppner
2013-11-30 14:12                 ` 答复:答复:答复:md raid5 random " lilofile
2013-12-01 14:14                   ` Stan Hoeppner
2013-12-01 16:33                   ` md " lilofile
2013-12-02  2:37                     ` Stan Hoeppner
2013-11-28 11:54               ` 答复:答复:md raid5 " lilofile
2013-12-02  3:48               ` md " lilofile
2013-12-02  5:51                 ` Stan Hoeppner
2014-09-23  3:34               ` raid sync speed lilofile
2014-09-23  5:11               ` behind_writes lilofile

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).