Linux RAID subsystem development
 help / color / mirror / Atom feed
* write performance of HW RAID VS MD RAID
@ 2015-06-10 22:27 Ming Lin
  2015-06-10 23:00 ` Neil Brown
  2015-06-11  0:27 ` Roman Mamedov
  0 siblings, 2 replies; 9+ messages in thread
From: Ming Lin @ 2015-06-10 22:27 UTC (permalink / raw)
  To: linux-raid; +Cc: Neil Brown

Hi NeilBrown,

As you may already see, I run a lot of tests with 10 HDDs for the patchset
"simplify block layer based on immutable biovecs"

Here is the summary.
http://minggr.net/pub/20150608/fio_results/summary.log

MD RAID6 read performance is OK.
But write performance is much lower than HW RAID6.

Is it a known issue?

Thanks.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: write performance of HW RAID VS MD RAID
  2015-06-10 22:27 write performance of HW RAID VS MD RAID Ming Lin
@ 2015-06-10 23:00 ` Neil Brown
  2015-06-10 23:34   ` Steven Haigh
  2015-06-10 23:59   ` Ming Lin
  2015-06-11  0:27 ` Roman Mamedov
  1 sibling, 2 replies; 9+ messages in thread
From: Neil Brown @ 2015-06-10 23:00 UTC (permalink / raw)
  To: Ming Lin; +Cc: linux-raid

On Wed, 10 Jun 2015 15:27:07 -0700
Ming Lin <mlin@kernel.org> wrote:

> Hi NeilBrown,
> 
> As you may already see, I run a lot of tests with 10 HDDs for the patchset
> "simplify block layer based on immutable biovecs"
> 
> Here is the summary.
> http://minggr.net/pub/20150608/fio_results/summary.log
> 
> MD RAID6 read performance is OK.
> But write performance is much lower than HW RAID6.
> 
> Is it a known issue?

It is not unexpected.
There are two likely reasons.
One is that HW RAID cards often have on-board NVRAM which is used as a
write-behind cache.  This allows better throughput by hiding latency and more
often gathering full-stripe writes.  HW RAID cards may also have accelerators
for the parity calculations, but that is not likely to make a big difference.
What sort of RAID6 controller do you have?

The other is that it is not easy for MD/RAID6 to schedule writes stripes
optimally.  It doesn't really know if more writes are coming, so it should
wait, or if it already has everything - so it should get to work straight away.
It is possible that it could reply to writes as soon as they are in the
(volatile) cache and only force things to storage when a REQ_FUA or REQ_FLUSH
arrives.  That might help ... or it might corrupt filesystems :-(

As long as the patches don't make things obviously worse, I'm happy.

Thanks,
NeilBrown

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: write performance of HW RAID VS MD RAID
  2015-06-10 23:00 ` Neil Brown
@ 2015-06-10 23:34   ` Steven Haigh
  2015-06-10 23:59   ` Ming Lin
  1 sibling, 0 replies; 9+ messages in thread
From: Steven Haigh @ 2015-06-10 23:34 UTC (permalink / raw)
  To: linux-raid

[-- Attachment #1: Type: text/plain, Size: 2852 bytes --]

On Thu, 11 Jun 2015 09:00:54 AM Neil Brown wrote:
> On Wed, 10 Jun 2015 15:27:07 -0700
> 
> Ming Lin <mlin@kernel.org> wrote:
> > Hi NeilBrown,
> > 
> > As you may already see, I run a lot of tests with 10 HDDs for the patchset
> > "simplify block layer based on immutable biovecs"
> > 
> > Here is the summary.
> > http://minggr.net/pub/20150608/fio_results/summary.log
> > 
> > MD RAID6 read performance is OK.
> > But write performance is much lower than HW RAID6.
> > 
> > Is it a known issue?
> 
> It is not unexpected.
> There are two likely reasons.
> One is that HW RAID cards often have on-board NVRAM which is used as a
> write-behind cache.  This allows better throughput by hiding latency and
> more often gathering full-stripe writes.  HW RAID cards may also have
> accelerators for the parity calculations, but that is not likely to make a
> big difference. What sort of RAID6 controller do you have?
> 
> The other is that it is not easy for MD/RAID6 to schedule writes stripes
> optimally.  It doesn't really know if more writes are coming, so it should
> wait, or if it already has everything - so it should get to work straight
> away. It is possible that it could reply to writes as soon as they are in
> the (volatile) cache and only force things to storage when a REQ_FUA or
> REQ_FLUSH arrives.  That might help ... or it might corrupt filesystems :-(

And this here is the problem. Any conceptual changes that risk filesystem and 
therefore data integrity are bad. For something as simple as benchmarks it 
isn't really worth the risk of losing data integrity.

In a hardware card setup, one would hope that the write cache is battery 
backed - or flash - or something that won't lose data if the power goes out. 
When you're running this in software, you can't magically keep data if you 
lose power - so the longer something is not flushed to disk, the longer the 
risk period for a write.

If you want to extend this concept - then you're not safe from writes between 
the write buffer in the kernel and the (hopefully) battery backed RAM on the 
hardware card if power is lost. You're also not safe when the card is writing 
to the physical disk - modern hard drives have massive caches! If the drive 
has the write in its cache and loses power, is the data gone?

Guaranteed data integrity these days is a difficult subject. The kernel may say 
the data is written properly - but is it? The HW RAID card may say the data is 
written properly - but is it? Or is it still in cache? Or has it just hit the 
HDD cache?

What we currently have is a slight tradeoff in performance for a minimalisation 
of risk (as far as practical anyway) - and I'm ok with this.

-- 
Steven Haigh

Email: netwiz@crc.id.au
Web: http://www.crc.id.au
Phone: (03) 9001 6090 - 0412 935 897

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: write performance of HW RAID VS MD RAID
  2015-06-10 23:00 ` Neil Brown
  2015-06-10 23:34   ` Steven Haigh
@ 2015-06-10 23:59   ` Ming Lin
  2015-06-11  0:28     ` Neil Brown
  1 sibling, 1 reply; 9+ messages in thread
From: Ming Lin @ 2015-06-10 23:59 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-raid

On Wed, Jun 10, 2015 at 4:00 PM, Neil Brown <neilb@suse.de> wrote:
> On Wed, 10 Jun 2015 15:27:07 -0700
> Ming Lin <mlin@kernel.org> wrote:
>
>> Hi NeilBrown,
>>
>> As you may already see, I run a lot of tests with 10 HDDs for the patchset
>> "simplify block layer based on immutable biovecs"
>>
>> Here is the summary.
>> http://minggr.net/pub/20150608/fio_results/summary.log
>>
>> MD RAID6 read performance is OK.
>> But write performance is much lower than HW RAID6.
>>
>> Is it a known issue?
>
> It is not unexpected.
> There are two likely reasons.
> One is that HW RAID cards often have on-board NVRAM which is used as a
> write-behind cache.  This allows better throughput by hiding latency and more
> often gathering full-stripe writes.  HW RAID cards may also have accelerators
> for the parity calculations, but that is not likely to make a big difference.
> What sort of RAID6 controller do you have?

PERC H730 Mini

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: write performance of HW RAID VS MD RAID
  2015-06-10 22:27 write performance of HW RAID VS MD RAID Ming Lin
  2015-06-10 23:00 ` Neil Brown
@ 2015-06-11  0:27 ` Roman Mamedov
  2015-06-11  5:39   ` AW: " Markus Stockhausen
  1 sibling, 1 reply; 9+ messages in thread
From: Roman Mamedov @ 2015-06-11  0:27 UTC (permalink / raw)
  To: Ming Lin; +Cc: linux-raid, Neil Brown

[-- Attachment #1: Type: text/plain, Size: 611 bytes --]

On Wed, 10 Jun 2015 15:27:07 -0700
Ming Lin <mlin@kernel.org> wrote:

> Hi NeilBrown,
> 
> As you may already see, I run a lot of tests with 10 HDDs for the patchset
> "simplify block layer based on immutable biovecs"
> 
> Here is the summary.
> http://minggr.net/pub/20150608/fio_results/summary.log
> 
> MD RAID6 read performance is OK.
> But write performance is much lower than HW RAID6.
> 
> Is it a known issue?

Did you tune the stripe_cache_size for the array? Try 32768.
https://peterkieser.com/2009/11/29/raid-mdraid-stripe_cache_size-vs-write-transfer/

-- 
With respect,
Roman

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: write performance of HW RAID VS MD RAID
  2015-06-10 23:59   ` Ming Lin
@ 2015-06-11  0:28     ` Neil Brown
  0 siblings, 0 replies; 9+ messages in thread
From: Neil Brown @ 2015-06-11  0:28 UTC (permalink / raw)
  To: Ming Lin; +Cc: linux-raid

On Wed, 10 Jun 2015 16:59:10 -0700
Ming Lin <mlin@kernel.org> wrote:

> On Wed, Jun 10, 2015 at 4:00 PM, Neil Brown <neilb@suse.de> wrote:
> > On Wed, 10 Jun 2015 15:27:07 -0700
> > Ming Lin <mlin@kernel.org> wrote:
> >
> >> Hi NeilBrown,
> >>
> >> As you may already see, I run a lot of tests with 10 HDDs for the patchset
> >> "simplify block layer based on immutable biovecs"
> >>
> >> Here is the summary.
> >> http://minggr.net/pub/20150608/fio_results/summary.log
> >>
> >> MD RAID6 read performance is OK.
> >> But write performance is much lower than HW RAID6.
> >>
> >> Is it a known issue?
> >
> > It is not unexpected.
> > There are two likely reasons.
> > One is that HW RAID cards often have on-board NVRAM which is used as a
> > write-behind cache.  This allows better throughput by hiding latency and more
> > often gathering full-stripe writes.  HW RAID cards may also have accelerators
> > for the parity calculations, but that is not likely to make a big difference.
> > What sort of RAID6 controller do you have?
> 
> PERC H730 Mini

http://www.dell.com/learn/us/en/04/campaigns/dell-raid-controllers
   1GB NV Flash Backed Cache on the H730

That would explain a lot of performance difference for writes.

NeilBrown

^ permalink raw reply	[flat|nested] 9+ messages in thread

* AW: write performance of HW RAID VS MD RAID
  2015-06-11  0:27 ` Roman Mamedov
@ 2015-06-11  5:39   ` Markus Stockhausen
  2015-06-11  6:02     ` Ming Lin
  0 siblings, 1 reply; 9+ messages in thread
From: Markus Stockhausen @ 2015-06-11  5:39 UTC (permalink / raw)
  To: Ming Lin; +Cc: linux-raid@vger.kernel.org, Neil Brown, Roman Mamedov

[-- Attachment #1: Type: text/plain, Size: 1128 bytes --]

> Von: linux-raid-owner@vger.kernel.org [linux-raid-owner@vger.kernel.org]&quot; im Auftrag von &quot;Roman Mamedov [rm@romanrm.net]
> Gesendet: Donnerstag, 11. Juni 2015 02:27
> An: Ming Lin
> Cc: linux-raid@vger.kernel.org; Neil Brown
> Betreff: Re: write performance of HW RAID VS MD RAID
> 
> On Wed, 10 Jun 2015 15:27:07 -0700
> Ming Lin <mlin@kernel.org> wrote:
> 
> > Hi NeilBrown,
> >
> > As you may already see, I run a lot of tests with 10 HDDs for the patchset
> > "simplify block layer based on immutable biovecs"
> >
> > Here is the summary.
> > http://minggr.net/pub/20150608/fio_results/summary.log
> >
> > MD RAID6 read performance is OK.
> > But write performance is much lower than HW RAID6.
> >
> > Is it a known issue?
> 
> Did you tune the stripe_cache_size for the array? Try 32768.
> https://peterkieser.com/2009/11/29/raid-mdraid-stripe_cache_size-vs-write-transfer/

+1 for giving an increased cache size a try. 

From the numbers I anticipate that you are doing sequential 
read/write tests. Otherwise I would expect a write penalty for 
the HW RAID setup too.

Markus
=

[-- Attachment #2: InterScan_Disclaimer.txt --]
[-- Type: text/plain, Size: 1650 bytes --]

****************************************************************************
Diese E-Mail enthält vertrauliche und/oder rechtlich geschützte
Informationen. Wenn Sie nicht der richtige Adressat sind oder diese E-Mail
irrtümlich erhalten haben, informieren Sie bitte sofort den Absender und
vernichten Sie diese Mail. Das unerlaubte Kopieren sowie die unbefugte
Weitergabe dieser Mail ist nicht gestattet.

Über das Internet versandte E-Mails können unter fremden Namen erstellt oder
manipuliert werden. Deshalb ist diese als E-Mail verschickte Nachricht keine
rechtsverbindliche Willenserklärung.

Collogia
Unternehmensberatung AG
Ubierring 11
D-50678 Köln

Vorstand:
Kadir Akin
Dr. Michael Höhnerbach

Vorsitzender des Aufsichtsrates:
Hans Kristian Langva

Registergericht: Amtsgericht Köln
Registernummer: HRB 52 497

This e-mail may contain confidential and/or privileged information. If you
are not the intended recipient (or have received this e-mail in error)
please notify the sender immediately and destroy this e-mail. Any
unauthorized copying, disclosure or distribution of the material in this
e-mail is strictly forbidden.

e-mails sent over the internet may have been written under a wrong name or
been manipulated. That is why this message sent as an e-mail is not a
legally binding declaration of intention.

Collogia
Unternehmensberatung AG
Ubierring 11
D-50678 Köln

executive board:
Kadir Akin
Dr. Michael Höhnerbach

President of the supervisory board:
Hans Kristian Langva

Registry office: district court Cologne
Register number: HRB 52 497

****************************************************************************

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: write performance of HW RAID VS MD RAID
  2015-06-11  5:39   ` AW: " Markus Stockhausen
@ 2015-06-11  6:02     ` Ming Lin
  2015-06-12 17:20       ` Ming Lin
  0 siblings, 1 reply; 9+ messages in thread
From: Ming Lin @ 2015-06-11  6:02 UTC (permalink / raw)
  To: Markus Stockhausen; +Cc: Roman Mamedov, linux-raid@vger.kernel.org, Neil Brown

On Wed, Jun 10, 2015 at 10:39 PM, Markus Stockhausen
<stockhausen@collogia.de> wrote:
>> Von: linux-raid-owner@vger.kernel.org [linux-raid-owner@vger.kernel.org]&quot; im Auftrag von &quot;Roman Mamedov [rm@romanrm.net]
>> Gesendet: Donnerstag, 11. Juni 2015 02:27
>> An: Ming Lin
>> Cc: linux-raid@vger.kernel.org; Neil Brown
>> Betreff: Re: write performance of HW RAID VS MD RAID
>>
>> On Wed, 10 Jun 2015 15:27:07 -0700
>> Ming Lin <mlin@kernel.org> wrote:
>>
>> > Hi NeilBrown,
>> >
>> > As you may already see, I run a lot of tests with 10 HDDs for the patchset
>> > "simplify block layer based on immutable biovecs"
>> >
>> > Here is the summary.
>> > http://minggr.net/pub/20150608/fio_results/summary.log
>> >
>> > MD RAID6 read performance is OK.
>> > But write performance is much lower than HW RAID6.
>> >
>> > Is it a known issue?
>>
>> Did you tune the stripe_cache_size for the array? Try 32768.
>> https://peterkieser.com/2009/11/29/raid-mdraid-stripe_cache_size-vs-write-transfer/
>
> +1 for giving an increased cache size a try.

Will try it.

>
> From the numbers I anticipate that you are doing sequential
> read/write tests. Otherwise I would expect a write penalty for
> the HW RAID setup too.

Yes,

[global]
ioengine=libaio
iodepth=64
direct=1
runtime=1800
time_based
group_reporting
numjobs=48
gtod_reduce=0
norandommap
write_iops_log=fs

[job1]
bs=640K
directory=/mnt
size=5G
rw=write

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: write performance of HW RAID VS MD RAID
  2015-06-11  6:02     ` Ming Lin
@ 2015-06-12 17:20       ` Ming Lin
  0 siblings, 0 replies; 9+ messages in thread
From: Ming Lin @ 2015-06-12 17:20 UTC (permalink / raw)
  To: Ming Lin
  Cc: Markus Stockhausen, Roman Mamedov, linux-raid@vger.kernel.org,
	Neil Brown

On Wed, Jun 10, 2015 at 11:02 PM, Ming Lin <mlin@kernel.org> wrote:
> On Wed, Jun 10, 2015 at 10:39 PM, Markus Stockhausen
> <stockhausen@collogia.de> wrote:
>>> Von: linux-raid-owner@vger.kernel.org [linux-raid-owner@vger.kernel.org]&quot; im Auftrag von &quot;Roman Mamedov [rm@romanrm.net]
>>> Gesendet: Donnerstag, 11. Juni 2015 02:27
>>> An: Ming Lin
>>> Cc: linux-raid@vger.kernel.org; Neil Brown
>>> Betreff: Re: write performance of HW RAID VS MD RAID
>>>
>>> On Wed, 10 Jun 2015 15:27:07 -0700
>>> Ming Lin <mlin@kernel.org> wrote:
>>>
>>> > Hi NeilBrown,
>>> >
>>> > As you may already see, I run a lot of tests with 10 HDDs for the patchset
>>> > "simplify block layer based on immutable biovecs"
>>> >
>>> > Here is the summary.
>>> > http://minggr.net/pub/20150608/fio_results/summary.log
>>> >
>>> > MD RAID6 read performance is OK.
>>> > But write performance is much lower than HW RAID6.
>>> >
>>> > Is it a known issue?
>>>
>>> Did you tune the stripe_cache_size for the array? Try 32768.
>>> https://peterkieser.com/2009/11/29/raid-mdraid-stripe_cache_size-vs-write-transfer/
>>
>> +1 for giving an increased cache size a try.
>
> Will try it.
>
>>
>> From the numbers I anticipate that you are doing sequential
>> read/write tests. Otherwise I would expect a write penalty for
>> the HW RAID setup too.
>
> Yes,
>
> [global]
> ioengine=libaio
> iodepth=64
> direct=1
> runtime=1800
> time_based
> group_reporting
> numjobs=48
> gtod_reduce=0
> norandommap
> write_iops_log=fs
>
> [job1]
> bs=640K
> directory=/mnt
> size=5G
> rw=write

I tested xfs write for RAID6 stripe size 64k with different stripe_cache_size.

stripe_cache_size       throughput(MB/s)
--------------------------       --------------------------
256                            181.7
512                            185.1
768                            178.3
1024                          194.4
2048                           227
4096                           247
8192                           300.3
16834                         312.4
32768                          304

While xfs write for HW RAID6 throughput is 753.16 MB/s

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2015-06-12 17:20 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-06-10 22:27 write performance of HW RAID VS MD RAID Ming Lin
2015-06-10 23:00 ` Neil Brown
2015-06-10 23:34   ` Steven Haigh
2015-06-10 23:59   ` Ming Lin
2015-06-11  0:28     ` Neil Brown
2015-06-11  0:27 ` Roman Mamedov
2015-06-11  5:39   ` AW: " Markus Stockhausen
2015-06-11  6:02     ` Ming Lin
2015-06-12 17:20       ` Ming Lin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox