single cpu thread performance limit?

Linux RAID subsystem development
 help / color / mirror / Atom feed

* single cpu thread performance limit?
@ 2011-08-11 15:58 mark delfman
  2011-08-11 16:01 ` Mathias Burén
                   ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: mark delfman @ 2011-08-11 15:58 UTC (permalink / raw)
  To: Linux RAID Mailing List, NeilBrown

I seem to have hit a significant hard stop in MD RAID1/10 performance
which seems to be linked to a single CPU thread.

I am using extremely high speed (IOPS) internal block devices – 8 in
total.  They are capable of achieving > 1million iops.

However if I use RAID1 / 10 then MD seems to use a single thread which
will reach 100% CPU utilisation (single core) at around 200K IOPS.
Limiting the entire performance to around 200K.

If I use say 4 x RAID1 / 10’s and a RAID0 on top – I see not much
greater results. (although the theory seems to say I should and there
are now 4 CPU threads running, it still seems to hit 4 x 100% at maybe
350K).

Is there any way to increase the number of threads per RAID set? Or
any other suggestions on configurations?  (I have tried every
permutation of R0+R1/10’s)

Thank you for any advice.

Mark
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: single cpu thread performance limit?
  2011-08-11 15:58 single cpu thread performance limit? mark delfman
@ 2011-08-11 16:01 ` Mathias Burén
  2011-08-11 16:07   ` mark delfman
  2011-08-11 18:58 ` Stan Hoeppner
  2011-08-11 19:04 ` Bernd Schubert
  2 siblings, 1 reply; 14+ messages in thread
From: Mathias Burén @ 2011-08-11 16:01 UTC (permalink / raw)
  To: mark delfman; +Cc: Linux RAID Mailing List, NeilBrown

 11 August 2011 16:58, mark delfman <markdelfman@googlemail.com> wrote:
> I seem to have hit a significant hard stop in MD RAID1/10 performance
> which seems to be linked to a single CPU thread.
>
> I am using extremely high speed (IOPS) internal block devices – 8 in
> total.  They are capable of achieving > 1million iops.
>
> However if I use RAID1 / 10 then MD seems to use a single thread which
> will reach 100% CPU utilisation (single core) at around 200K IOPS.
> Limiting the entire performance to around 200K.
>
> If I use say 4 x RAID1 / 10’s and a RAID0 on top – I see not much
> greater results. (although the theory seems to say I should and there
> are now 4 CPU threads running, it still seems to hit 4 x 100% at maybe
> 350K).
>
> Is there any way to increase the number of threads per RAID set? Or
> any other suggestions on configurations?  (I have tried every
> permutation of R0+R1/10’s)
>
> Thank you for any advice.
>
>
> Mark
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

Maybe create separate MD RAID1 devices, then a new MD device with
RAID0? (instead of using mdadm RAID"10")

/M
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: single cpu thread performance limit?
  2011-08-11 16:01 ` Mathias Burén
@ 2011-08-11 16:07   ` mark delfman
  0 siblings, 0 replies; 14+ messages in thread
From: mark delfman @ 2011-08-11 16:07 UTC (permalink / raw)
  To: Mathias Burén; +Cc: Linux RAID Mailing List, NeilBrown

Tried this... it results in the same :(

On Thu, Aug 11, 2011 at 5:01 PM, Mathias Burén <mathias.buren@gmail.com> wrote:
>  11 August 2011 16:58, mark delfman <markdelfman@googlemail.com> wrote:
>> I seem to have hit a significant hard stop in MD RAID1/10 performance
>> which seems to be linked to a single CPU thread.
>>
>> I am using extremely high speed (IOPS) internal block devices – 8 in
>> total.  They are capable of achieving > 1million iops.
>>
>> However if I use RAID1 / 10 then MD seems to use a single thread which
>> will reach 100% CPU utilisation (single core) at around 200K IOPS.
>> Limiting the entire performance to around 200K.
>>
>> If I use say 4 x RAID1 / 10’s and a RAID0 on top – I see not much
>> greater results. (although the theory seems to say I should and there
>> are now 4 CPU threads running, it still seems to hit 4 x 100% at maybe
>> 350K).
>>
>> Is there any way to increase the number of threads per RAID set? Or
>> any other suggestions on configurations?  (I have tried every
>> permutation of R0+R1/10’s)
>>
>> Thank you for any advice.
>>
>>
>> Mark
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>
> Maybe create separate MD RAID1 devices, then a new MD device with
> RAID0? (instead of using mdadm RAID"10")
>
> /M
>
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: single cpu thread performance limit?
  2011-08-11 15:58 single cpu thread performance limit? mark delfman
  2011-08-11 16:01 ` Mathias Burén
@ 2011-08-11 18:58 ` Stan Hoeppner
  2011-08-11 19:37   ` mark delfman
  2011-08-12 13:23   ` mark delfman
  2011-08-11 19:04 ` Bernd Schubert
  2 siblings, 2 replies; 14+ messages in thread
From: Stan Hoeppner @ 2011-08-11 18:58 UTC (permalink / raw)
  To: mark delfman; +Cc: Linux RAID Mailing List, NeilBrown

On 8/11/2011 10:58 AM, mark delfman wrote:
> I seem to have hit a significant hard stop in MD RAID1/10 performance
> which seems to be linked to a single CPU thread.

What is the name of the kernel thread that is peaking your cores?  Could
the device driver be eating the CPU and not the md kernel threads?  Is
it both?  Is it a different thread?  How much CPU is the IO generator
app eating?

What Linux kernel version are you running?  Which Linux distribution?
What application are you using to generate the IO load?  Does it work at
the raw device/partition level or at the file level?

> I am using extremely high speed (IOPS) internal block devices – 8 in
> total.  They are capable of achieving > 1million iops.

8 solid state drives of one model or another, probably occupying 8 PCIe
slots.  IBIS, VeloDrive, the LSI SSD, or other PCIe based SSD?  Or are
these plain SATA II SSDs that *claim* to have 125K 4KB random IOPS
performance?

> However if I use RAID1 / 10 then MD seems to use a single thread which
> will reach 100% CPU utilisation (single core) at around 200K IOPS.
> Limiting the entire performance to around 200K.

CPU frequency?  How many sockets?  Total cores?  Whose box?  HP, Dell,
IBM, whitebox, self built?  If the latter two, whose motherboard?  How
many PCIe slots are occupied by the SSD cards?

> If I use say 4 x RAID1 / 10’s and a RAID0 on top – I see not much
> greater results. (although the theory seems to say I should and there
> are now 4 CPU threads running, it still seems to hit 4 x 100% at maybe
> 350K).

Assuming you have 4 processors (cores), then yes, you should see better
scaling.  If you have less cores than threads, then no.  Do you see more
IOPS before running out of CPU when writing vs reading?  You should as
you're doing half the IOs when reading.

> Is there any way to increase the number of threads per RAID set? Or
> any other suggestions on configurations?  (I have tried every
> permutation of R0+R1/10’s)

The answer to the first question AFAIK is no.  Do you have the same
problem with a single --linear array?  What is the result when putting a
filesystem on each individual drive?  Do you get your 1 million IOPS?

Is MSI enabled and verified to be working for each PCIe SSD device?  See:

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;f=Documentation/PCI/MSI-HOWTO.txt;hb=HEAD

-- 
Stan
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: single cpu thread performance limit?
  2011-08-11 15:58 single cpu thread performance limit? mark delfman
  2011-08-11 16:01 ` Mathias Burén
  2011-08-11 18:58 ` Stan Hoeppner
@ 2011-08-11 19:04 ` Bernd Schubert
  2 siblings, 0 replies; 14+ messages in thread
From: Bernd Schubert @ 2011-08-11 19:04 UTC (permalink / raw)
  To: mark delfman; +Cc: Linux RAID Mailing List

On 08/11/2011 05:58 PM, mark delfman wrote:
> I seem to have hit a significant hard stop in MD RAID1/10 performance
> which seems to be linked to a single CPU thread.
>
> I am using extremely high speed (IOPS) internal block devices – 8 in
> total.  They are capable of achieving>  1million iops.
>
> However if I use RAID1 / 10 then MD seems to use a single thread which
> will reach 100% CPU utilisation (single core) at around 200K IOPS.
> Limiting the entire performance to around 200K.
>

Out of interest, could you please run "perf top" to let us see where the 
kernel is busy?

Thanks,
Bernd
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: single cpu thread performance limit?
  2011-08-11 18:58 ` Stan Hoeppner
@ 2011-08-11 19:37   ` mark delfman
  2011-08-11 19:57     ` Joe Landman
                       ` (2 more replies)
  2011-08-12 13:23   ` mark delfman
  1 sibling, 3 replies; 14+ messages in thread
From: mark delfman @ 2011-08-11 19:37 UTC (permalink / raw)
  To: Stan Hoeppner; +Cc: Linux RAID Mailing List, NeilBrown

Hi... sorry for the lack of initial info and your question  made me
realise how much i had missed off! hopefully this adds some color

PCIe based Flash - SLC based
Multuiple XEON 5640's  (total 16 cores)
MSI ints all set (and affinity / pinned tried)
SLES 11 (2.6.32.43-0.5)
tried on both a Supermicro and and Dell R server

the thread is MD0_RAID10 (or something simular as i am not near it now).
This thread is easily linked to the MD(s)
Create 4 x RAID1's and you have 4 x MD threads etc.

So, a single RAID10 creates a single thread - which will max at maybe 200K IOPS.
Create 4 x RAID10's seems OK, but they will not scale so great with a
RAID0 on top :(
Ideal would be a few threads per RAIDx


Using basic fio for IOPS (4 workers - 128 QD) - this usess hardly any
CPU resource.
Reads are maybe 50% faster as you would expect.

The issue seems to be the fact a single thread will only deliver X
before 100% CPU... with emerging flash, this is not reaching the
capability

FS:  An FS is not really an option for this solution, so we have not
tried this on this rig, but in the past the FS has degreaded the IOPS

Whilst a R0 on top of the R1/10's does offer some increase in
performance, linear does not :(
LVM R0 on top of the MD R1/10's does much the same results.
The limiter seems fixes on the single thread per R1/10


Thank you for any feedback!

Mark



On Thu, Aug 11, 2011 at 7:58 PM, Stan Hoeppner <stan@hardwarefreak.com> wrote:
> On 8/11/2011 10:58 AM, mark delfman wrote:
>> I seem to have hit a significant hard stop in MD RAID1/10 performance
>> which seems to be linked to a single CPU thread.
>
> What is the name of the kernel thread that is peaking your cores?  Could
> the device driver be eating the CPU and not the md kernel threads?  Is
> it both?  Is it a different thread?  How much CPU is the IO generator
> app eating?
>
> What Linux kernel version are you running?  Which Linux distribution?
> What application are you using to generate the IO load?  Does it work at
> the raw device/partition level or at the file level?
>
>> I am using extremely high speed (IOPS) internal block devices – 8 in
>> total.  They are capable of achieving > 1million iops.
>
> 8 solid state drives of one model or another, probably occupying 8 PCIe
> slots.  IBIS, VeloDrive, the LSI SSD, or other PCIe based SSD?  Or are
> these plain SATA II SSDs that *claim* to have 125K 4KB random IOPS
> performance?
>
>> However if I use RAID1 / 10 then MD seems to use a single thread which
>> will reach 100% CPU utilisation (single core) at around 200K IOPS.
>> Limiting the entire performance to around 200K.
>
> CPU frequency?  How many sockets?  Total cores?  Whose box?  HP, Dell,
> IBM, whitebox, self built?  If the latter two, whose motherboard?  How
> many PCIe slots are occupied by the SSD cards?
>
>> If I use say 4 x RAID1 / 10’s and a RAID0 on top – I see not much
>> greater results. (although the theory seems to say I should and there
>> are now 4 CPU threads running, it still seems to hit 4 x 100% at maybe
>> 350K).
>
> Assuming you have 4 processors (cores), then yes, you should see better
> scaling.  If you have less cores than threads, then no.  Do you see more
> IOPS before running out of CPU when writing vs reading?  You should as
> you're doing half the IOs when reading.
>
>> Is there any way to increase the number of threads per RAID set? Or
>> any other suggestions on configurations?  (I have tried every
>> permutation of R0+R1/10’s)
>
> The answer to the first question AFAIK is no.  Do you have the same
> problem with a single --linear array?  What is the result when putting a
> filesystem on each individual drive?  Do you get your 1 million IOPS?
>
> Is MSI enabled and verified to be working for each PCIe SSD device?  See:
>
> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;f=Documentation/PCI/MSI-HOWTO.txt;hb=HEAD
>
> --
> Stan
>
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: single cpu thread performance limit?
  2011-08-11 19:37   ` mark delfman
@ 2011-08-11 19:57     ` Joe Landman
  2011-08-12  9:04       ` David Brown
  2011-08-11 20:51     ` Stan Hoeppner
  2011-08-12 12:48     ` Asdo
  2 siblings, 1 reply; 14+ messages in thread
From: Joe Landman @ 2011-08-11 19:57 UTC (permalink / raw)
  To: mark delfman; +Cc: Linux RAID Mailing List

On 08/11/2011 03:37 PM, mark delfman wrote:

> So, a single RAID10 creates a single thread - which will max at maybe 200K IOPS.

We are seeing ~110k IOPs per PCI HBA for an SSD variant of what you 
have.  FWIW, MD RAID is significantly faster than the hardware RAID 
here, but that's due to the processor more than anything else.

Which cards if you don't mind my asking?  We work with a number of 
vendors in this space.

> Create 4 x RAID10's seems OK, but they will not scale so great with a
> RAID0 on top :(
> Ideal would be a few threads per RAIDx

[...]

> Whilst a R0 on top of the R1/10's does offer some increase in
> performance, linear does not :(

Linear makes no sense for distributing IO's among many devices.  Linear 
is a concatenation.

> LVM R0 on top of the MD R1/10's does much the same results.
> The limiter seems fixes on the single thread per R1/10

Whats your CPU?  What's your 'lspci -vvv' output look like (is it 
possible you've oversubscribed your PCIe channels?)  How many PCIe lanes 
do you have on your MB?

FWIW, our array of SSD's hit 7.8 GB/s and 330k IOPs (8k random reads 
against 768GB of data) using MD RAID5's.  Each RAID5 hits around 75k 
IOPs, and when joined together, they hit closer to 110k per HBA.

The PCIe units are generally much better than this.  Last set of cards 
we played with a few weeks ago we were getting about 400k IOPs for a 
pair of cards in an MD RAID0.  I expect newer drivers and other things 
to help out a bit.

-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
email: landman@scalableinformatics.com
web  : http://scalableinformatics.com
        http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: single cpu thread performance limit?
  2011-08-11 19:37   ` mark delfman
  2011-08-11 19:57     ` Joe Landman
@ 2011-08-11 20:51     ` Stan Hoeppner
  2011-08-12  1:05       ` Stan Hoeppner
  2011-08-12 12:48     ` Asdo
  2 siblings, 1 reply; 14+ messages in thread
From: Stan Hoeppner @ 2011-08-11 20:51 UTC (permalink / raw)
  To: mark delfman; +Cc: Linux RAID Mailing List, NeilBrown

On 8/11/2011 2:37 PM, mark delfman wrote:

> FS:  An FS is not really an option for this solution, so we have not
> tried this on this rig, but in the past the FS has degreaded the IOPS

I'm wondering what your applications is, given you have the option to
write to raw devices in production.

> Whilst a R0 on top of the R1/10's does offer some increase in
> performance, linear does not :(
> LVM R0 on top of the MD R1/10's does much the same results.
> The limiter seems fixes on the single thread per R1/10

This might provide you some really interesting results. :)  Take your 8
flash devices, which are of equal size I assume, and create an md
--linear array  on the raw device, no partitions (we'll worry about
redundancy later).  Format this md device with:

~$ mkfs.xfs -d ag=8 /dev/mdX

Mount it with:

~$ mount -o inode64,logbsize=256,noatime,nobarrier /dev/mdX /test

(Too bad you're running 2.6.32 instead of 2.6.35 or above, as enabling
the XFS delayed logging mount option would probably bump your small file
block IOPS to well over a million, if the hardware is actually up to it.)

Now, create 8 directories, say test[1-8].  XFS drives parallelism
through allocation groups.  Each directory will be created in a
different AG.  Thus, you'll end up with one directory per SSD, and any
files written to that directory will go that that same SSD.  Thus,
writing files to all 8 directories in parallel will get you near perfect
scaling across all disks, with files, not simply raw blocks.

I'm not really that familiar with FIO but I'll assume it can do file as
well as block IO.  If not, grab iozone or bonnie, etc, and run tests
writing small files to all 8 directories in parallel.  The results may
surprise you.  After you've done this, create 4 mirror pairs and then a
--linear of them.  Duplicate the above but use 4 allocation groups and 4
directories.  Please post the results for both test setups.

-- 
Stan

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: single cpu thread performance limit?
  2011-08-11 20:51     ` Stan Hoeppner
@ 2011-08-12  1:05       ` Stan Hoeppner
  0 siblings, 0 replies; 14+ messages in thread
From: Stan Hoeppner @ 2011-08-12  1:05 UTC (permalink / raw)
  To: mark delfman; +Cc: Linux RAID Mailing List, NeilBrown

On 8/11/2011 3:51 PM, Stan Hoeppner wrote:
> On 8/11/2011 2:37 PM, mark delfman wrote:
> 
>> FS:  An FS is not really an option for this solution, so we have not
>> tried this on this rig, but in the past the FS has degreaded the IOPS

>> Whilst a R0 on top of the R1/10's does offer some increase in
>> performance, linear does not :(
>> LVM R0 on top of the MD R1/10's does much the same results.
>> The limiter seems fixes on the single thread per R1/10

This seems to be the case.  The md processes apparently aren't threaded,
at least not when doing mirroring/+striping.  xfsbufd, xfssyncd, and
xfsaild are all threaded.

> This might provide you some really interesting results. :)  Take your 8
> flash devices, which are of equal size I assume, and create an md
> --linear array  on the raw device, no partitions (we'll worry about
> redundancy later).  Format this md device with:

A concat shouldn't use nearly as much CPU as a mirror or stripe.  Though
I don't know if one core will be enough here.  Test and see.

> ~$ mkfs.xfs -d ag=8 /dev/mdX
> 
> Mount it with:
> 
> ~$ mount -o inode64,logbsize=256,noatime,nobarrier /dev/mdX /test
> 
> (Too bad you're running 2.6.32 instead of 2.6.35 or above, as enabling
> the XFS delayed logging mount option would probably bump your small file
> block IOPS to well over a million, if the hardware is actually up to it.)
> 
> Now, create 8 directories, say test[1-8].  XFS drives parallelism
> through allocation groups.  Each directory will be created in a
> different AG.  Thus, you'll end up with one directory per SSD, and any
> files written to that directory will go that that same SSD.  Thus,
> writing files to all 8 directories in parallel will get you near perfect
> scaling across all disks, with files, not simply raw blocks.

In actuality, since you're running up against CPU vs IOPs, it may be
better here to create 32 or even 64 allocation groups and spread files
evenly across them.  IIRC, each XFS file IO gets its own worker thread,
so you'll be able to take advantage of all 16 cores in the box.  The
kernel IO is more than sufficiently threaded.

You mentioned above that using a filesystem isn't really an option.  As
I see it, given the lack of md's lateral (parallel) scalability with
your hardware and workload, you may want to evaluate the following ideas:

1.  Upgrade to 2.6.38 or later.  There have been IO optimizations since
2.6.32, though I'm not sure WRT the md code itself.

2.  Try the XFS option.  It may or may not work in your case, but it
will parallelize to hundreds of cores when writing hundreds of files
concurrently.  The trick is matching your workload to it, vice versa.
If you're writing single large files, it's likely not going to
parallelize.  If you can't use a filesystem...

3.  mdraid on your individual cores can't keep up with your SSDs, so:
    A.  Switch to 24 SLC SATA SSDs attached to 3* 8 port LSI SAS HBAs:
http://www.lsi.com/products/storagecomponents/Pages/LSISAS9211-8i.aspx
        which will give you 12 mdraid1 processes instead of 4.  Use
        cpumemsets to lock the 12 mdraid1 processes to 12 specific
        cores, and the mdraid0 process to another core.  And disable HT.
    B.  Swap the CPUs for higher frequency models, though it'll gain you
        little and cost quite a bit for four 3.6GHz Xeon W5590s

I'm sure you've already thought of these options, but I figured I'd get
them in Google.

-- 
Stan

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: single cpu thread performance limit?
  2011-08-11 19:57     ` Joe Landman
@ 2011-08-12  9:04       ` David Brown
  0 siblings, 0 replies; 14+ messages in thread
From: David Brown @ 2011-08-12  9:04 UTC (permalink / raw)
  To: linux-raid

On 11/08/11 21:57, Joe Landman wrote:
> On 08/11/2011 03:37 PM, mark delfman wrote:
>
>
>> Whilst a R0 on top of the R1/10's does offer some increase in
>> performance, linear does not :(
>
> Linear makes no sense for distributing IO's among many devices. Linear
> is a concatenation.
>

If the real-world application involves parallel access to lots of 
different files, then XFS on a linear concatenation /will/ make sense, 
if your allocation groups match your concatenated devices.  It won't 
give you faster access to any of the files, but it will let you have 
fast access to several files at the same time.  Of course, YMMV 
according to the setup and application.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: single cpu thread performance limit?
  2011-08-11 19:37   ` mark delfman
  2011-08-11 19:57     ` Joe Landman
  2011-08-11 20:51     ` Stan Hoeppner
@ 2011-08-12 12:48     ` Asdo
  2 siblings, 0 replies; 14+ messages in thread
From: Asdo @ 2011-08-12 12:48 UTC (permalink / raw)
  To: mark delfman; +Cc: Stan Hoeppner, Linux RAID Mailing List, NeilBrown

On 08/11/11 21:37, mark delfman wrote:
> So, a single RAID10 creates a single thread - which will max at maybe 200K IOPS.
> Create 4 x RAID10's seems OK, but they will not scale so great with a
> RAID0 on top :(
> Ideal would be a few threads per RAIDx

Try this: LVM.
AFAIR, LVM does not have its thread, it is the application thread that 
executes LVM code.
This should not impede scalability.

If you are testing with something like fio, which randomly spans the 
whole device with random I/O during test, you can use a linear LVM 
concatenation (which is the default when you create a LV that spans the 
whole VG).
Otherwise use striping on lvcreate.
Try both if possible.

Also, as other people have said, your kernel is quite old... Actually I 
don't remember if there were performance improvements regarding what you 
are doing, but you probably should try a newer one.

Let me know how it goes.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: single cpu thread performance limit?
  2011-08-11 18:58 ` Stan Hoeppner
  2011-08-11 19:37   ` mark delfman
@ 2011-08-12 13:23   ` mark delfman
  2011-08-12 14:23     ` Asdo
  2011-08-12 20:51     ` Stan Hoeppner
  1 sibling, 2 replies; 14+ messages in thread
From: mark delfman @ 2011-08-12 13:23 UTC (permalink / raw)
  To: Stan Hoeppner; +Cc: Linux RAID Mailing List, NeilBrown

Hi

Quick update with the XFS tests suggested (although a FS is still
probably not a real option at teh moment for me)

This rig only has 4 x Flash (2 MLC and 2 SLC).....  125K IOPS each for
MLC - 165K each for SLC.

Create linear RAID and XFS with ag=4

Mount as suggested and create 4 test folders.....

If i test individually - we get 99.9% of the IOPS (ie. 125 for first 2
AG's and 165 for last 2).  which is great news and means that the AG
does what it should.

But if a run the test over all 4, then we see it peak at aroudn 320K
IOPS.  Interstingly each AG = 80K IOPS and as we can see above this is
need not be the case, as the CPU load is not having any issues - i am
presuming that this could be a simple XFS limit maybe.


More testing with many R1's and R0's on top seem to suggest that R0 is
losing around 20-25% of the IOPS.  (R1 around 5%).  I have tried with
LVM strip and much the same.






On Thu, Aug 11, 2011 at 7:58 PM, Stan Hoeppner <stan@hardwarefreak.com> wrote:
> On 8/11/2011 10:58 AM, mark delfman wrote:
>> I seem to have hit a significant hard stop in MD RAID1/10 performance
>> which seems to be linked to a single CPU thread.
>
> What is the name of the kernel thread that is peaking your cores?  Could
> the device driver be eating the CPU and not the md kernel threads?  Is
> it both?  Is it a different thread?  How much CPU is the IO generator
> app eating?
>
> What Linux kernel version are you running?  Which Linux distribution?
> What application are you using to generate the IO load?  Does it work at
> the raw device/partition level or at the file level?
>
>> I am using extremely high speed (IOPS) internal block devices – 8 in
>> total.  They are capable of achieving > 1million iops.
>
> 8 solid state drives of one model or another, probably occupying 8 PCIe
> slots.  IBIS, VeloDrive, the LSI SSD, or other PCIe based SSD?  Or are
> these plain SATA II SSDs that *claim* to have 125K 4KB random IOPS
> performance?
>
>> However if I use RAID1 / 10 then MD seems to use a single thread which
>> will reach 100% CPU utilisation (single core) at around 200K IOPS.
>> Limiting the entire performance to around 200K.
>
> CPU frequency?  How many sockets?  Total cores?  Whose box?  HP, Dell,
> IBM, whitebox, self built?  If the latter two, whose motherboard?  How
> many PCIe slots are occupied by the SSD cards?
>
>> If I use say 4 x RAID1 / 10’s and a RAID0 on top – I see not much
>> greater results. (although the theory seems to say I should and there
>> are now 4 CPU threads running, it still seems to hit 4 x 100% at maybe
>> 350K).
>
> Assuming you have 4 processors (cores), then yes, you should see better
> scaling.  If you have less cores than threads, then no.  Do you see more
> IOPS before running out of CPU when writing vs reading?  You should as
> you're doing half the IOs when reading.
>
>> Is there any way to increase the number of threads per RAID set? Or
>> any other suggestions on configurations?  (I have tried every
>> permutation of R0+R1/10’s)
>
> The answer to the first question AFAIK is no.  Do you have the same
> problem with a single --linear array?  What is the result when putting a
> filesystem on each individual drive?  Do you get your 1 million IOPS?
>
> Is MSI enabled and verified to be working for each PCIe SSD device?  See:
>
> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;f=Documentation/PCI/MSI-HOWTO.txt;hb=HEAD
>
> --
> Stan
>
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: single cpu thread performance limit?
  2011-08-12 13:23   ` mark delfman
@ 2011-08-12 14:23     ` Asdo
  2011-08-12 20:51     ` Stan Hoeppner
  1 sibling, 0 replies; 14+ messages in thread
From: Asdo @ 2011-08-12 14:23 UTC (permalink / raw)
  To: mark delfman; +Cc: Stan Hoeppner, Linux RAID Mailing List, NeilBrown

On 08/12/11 15:23, mark delfman wrote:
> Hi
>
> Quick update with the XFS tests suggested (although a FS is still
> probably not a real option at teh moment for me)
>
> This rig only has 4 x Flash (2 MLC and 2 SLC).....  125K IOPS each for
> MLC - 165K each for SLC.
>
> Create linear RAID and XFS with ag=4
>
> Mount as suggested and create 4 test folders.....
>
> If i test individually - we get 99.9% of the IOPS (ie. 125 for first 2
> AG's and 165 for last 2).  which is great news and means that the AG
> does what it should.
>
> But if a run the test over all 4, then we see it peak at aroudn 320K
> IOPS.  Interstingly each AG = 80K IOPS and as we can see above this is
> need not be the case, as the CPU load is not having any issues - i am
> presuming that this could be a simple XFS limit maybe.
>
>
> More testing with many R1's and R0's on top seem to suggest that R0 is
> losing around 20-25% of the IOPS.  (R1 around 5%).  I have tried with
> LVM strip and much the same.
>

So you report a higher speed now:  (25% overhead + 5% overhead = 30% 
overhead = 70% remains)
(125*2+175*2)*0.7 = 420 K
Previously in your first post you were talking about 350K, do you confirm?

Unfortunately I think 20% overhead for R0 or LVM is reasonable, I have 
measured 15% for LVM in other situations.
Your figures with 4 SSDs are not bad I'd say.

But this means that you should obtain 840K IOPS when you have all 8 SSD 
PCIe cards installed (like in your first post).
If possible repeat the test with LVM stripes on the big rig.

Oh and I also wanted to ask: if you run 8 parallel tests on the big rig 
with 8 SSDs, each test on a different SSD but all tests simultaneously, 
without RAIDs or LVMs, are you sure you reach 1 million IOPS overall, or 
do you max out at 600K or similar?  (600K would be the last performance 
you measured but adjusted to remove the overheads of LVM and RAID)

BTW: please note you do NOT have 16 cores, you have 8 cores if you have 
a dual Xeon 5640. The other 8 cores you see are fake, that's 
hyperthreading. If one core CPU occupation goes up, you will see it's 
other twin phantom core to also go up. This makes more difficult to 
understand the benchmarking, so you might disable hyperthreading from 
bios if you want to understand better what's going on. Performances 
should probably change just very little after you disable hyperthreading.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: single cpu thread performance limit?
  2011-08-12 13:23   ` mark delfman
  2011-08-12 14:23     ` Asdo
@ 2011-08-12 20:51     ` Stan Hoeppner
  1 sibling, 0 replies; 14+ messages in thread
From: Stan Hoeppner @ 2011-08-12 20:51 UTC (permalink / raw)
  To: mark delfman; +Cc: Linux RAID Mailing List, NeilBrown

On 8/12/2011 8:23 AM, mark delfman wrote:

> Quick update with the XFS tests suggested (although a FS is still
> probably not a real option at teh moment for me)
> 
> This rig only has 4 x Flash (2 MLC and 2 SLC).....  125K IOPS each for
> MLC - 165K each for SLC.
> 
> Create linear RAID and XFS with ag=4
> 
> Mount as suggested and create 4 test folders.....
> 
> If i test individually - we get 99.9% of the IOPS (ie. 125 for first 2
> AG's and 165 for last 2).  which is great news and means that the AG
> does what it should.

Now you know why XFS has the high performance reputation it does.

> But if a run the test over all 4, then we see it peak at aroudn 320K
> IOPS.  Interstingly each AG = 80K IOPS and as we can see above this is
> need not be the case, as the CPU load is not having any issues - i am
> presuming that this could be a simple XFS limit maybe.

Ok, now this is interesting, because the 320K IOPS you mentioned as a
limit here is very close to the ~350K IOPS you mentioned in your first
post, when 4 cores were pegged with the md processes.  In this case your
CPUs are not pegged, but you're hitting nearly the same ceiling, 320K IOPS.

I'm pretty sure you're not hitting an XFS limit here.  To confirm,
create 4 subdirectories in each of the current 4 directories, and
generate 16 concurrent writers against the 16 dirs.

On 8/11/2011 10:58 AM, mark delfman wrote:
> If I use say 4 x RAID1 / 10’s and a RAID0 on top – I see not much
> greater results. (although the theory seems to say I should and there
> are now 4 CPU threads running, it still seems to hit 4 x 100% at maybe
> 350K).

So it's beginning to look like your scalability issue may not
necessarily be with mdraid, but possibly a hardware bottleneck, or a
bottleneck somewhere else in the kernel.  As Bernd mentioned previously,
you should probably run perf top or some other tool to see where the
kernel is busy.

Also, you never answered my question regarding which block device
driver(s) you're using for these PCIe SSDs.

> More testing with many R1's and R0's on top seem to suggest that R0 is
> losing around 20-25% of the IOPS.  (R1 around 5%).  I have tried with
> LVM strip and much the same.

Are you hitting the same ~320K-350K IOPS aggregate limit with all test
configurations?

-- 
Stan

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2011-08-12 20:51 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-08-11 15:58 single cpu thread performance limit? mark delfman
2011-08-11 16:01 ` Mathias Burén
2011-08-11 16:07   ` mark delfman
2011-08-11 18:58 ` Stan Hoeppner
2011-08-11 19:37   ` mark delfman
2011-08-11 19:57     ` Joe Landman
2011-08-12  9:04       ` David Brown
2011-08-11 20:51     ` Stan Hoeppner
2011-08-12  1:05       ` Stan Hoeppner
2011-08-12 12:48     ` Asdo
2011-08-12 13:23   ` mark delfman
2011-08-12 14:23     ` Asdo
2011-08-12 20:51     ` Stan Hoeppner
2011-08-11 19:04 ` Bernd Schubert

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox