linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* RAID456 direct I/O write performance
@ 2014-09-04 16:23 Markus Stockhausen
  2014-09-04 16:30 ` AW: " Markus Stockhausen
  0 siblings, 1 reply; 6+ messages in thread
From: Markus Stockhausen @ 2014-09-04 16:23 UTC (permalink / raw)
  To: linux-raid@vger.kernel.org

[-- Attachment #1: Type: text/plain, Size: 1667 bytes --]

Hello,

Doing some benchmark tests for my patch series I discovered that
direct I/O performance for a single thread onto raid456 md seems 
to be not on par with what I would have expected. 

A simple 5 minute test run with md on top of /dev/ramX ramdisks 
reveals:

raid5 - 3 disks - chunk 32k - 1 writer
4K random direct IO writes: 5MB/sec (five megabytes per second)

raid5 - 3 disks - chunk 32k - 1 writer
4K random direct IO writes: 600MB/sec

raid5 - 3 disks - chunk 32k - 8 writers
4K random direct IO writes: 45MB/sec 

raid5 - 3 disks - chunk 32k - 8 writers
4K random direct IO writes: 800MB/sec

for comparison:

raid1 - 3 disks - 1 writer
4K random direct IO writes: 150MB/sec 

That would be a 1ms penalty per one md block write (or to be
precise per 3 device I/Os). A similar I/O overhead on the three
disk RAID1 shows reasonable numbers.

A perf record of the 1 writer test gives:

    38.40%      swapper  [kernel.kallsyms]   [k] default_idle
    13.14%    md0_raid5  [kernel.kallsyms]   [k] _raw_spin_unlock_irqrestore
    13.05%      swapper  [kernel.kallsyms]   [k] tick_nohz_idle_enter
    10.01%          iot  [raid456]           [k] raid5_unplug
     9.06%      swapper  [kernel.kallsyms]   [k] tick_nohz_idle_exit
     3.39%    md0_raid5  [kernel.kallsyms]   [k] __kernel_fpu_begin
     1.67%    md0_raid5  [xor]               [k] xor_sse_2_pf64
     0.87%          iot  [kernel.kallsyms]   [k] finish_task_switch

I'm confused and clueless. Especially I cannot see where the
10% overhead in the source of raid5_unplug might come 
from? Any idea from someone with better insight?

Markus





=

[-- Attachment #2: InterScan_Disclaimer.txt --]
[-- Type: text/plain, Size: 1650 bytes --]

****************************************************************************
Diese E-Mail enthält vertrauliche und/oder rechtlich geschützte
Informationen. Wenn Sie nicht der richtige Adressat sind oder diese E-Mail
irrtümlich erhalten haben, informieren Sie bitte sofort den Absender und
vernichten Sie diese Mail. Das unerlaubte Kopieren sowie die unbefugte
Weitergabe dieser Mail ist nicht gestattet.

Über das Internet versandte E-Mails können unter fremden Namen erstellt oder
manipuliert werden. Deshalb ist diese als E-Mail verschickte Nachricht keine
rechtsverbindliche Willenserklärung.

Collogia
Unternehmensberatung AG
Ubierring 11
D-50678 Köln

Vorstand:
Kadir Akin
Dr. Michael Höhnerbach

Vorsitzender des Aufsichtsrates:
Hans Kristian Langva

Registergericht: Amtsgericht Köln
Registernummer: HRB 52 497

This e-mail may contain confidential and/or privileged information. If you
are not the intended recipient (or have received this e-mail in error)
please notify the sender immediately and destroy this e-mail. Any
unauthorized copying, disclosure or distribution of the material in this
e-mail is strictly forbidden.

e-mails sent over the internet may have been written under a wrong name or
been manipulated. That is why this message sent as an e-mail is not a
legally binding declaration of intention.

Collogia
Unternehmensberatung AG
Ubierring 11
D-50678 Köln

executive board:
Kadir Akin
Dr. Michael Höhnerbach

President of the supervisory board:
Hans Kristian Langva

Registry office: district court Cologne
Register number: HRB 52 497

****************************************************************************

^ permalink raw reply	[flat|nested] 6+ messages in thread

* AW: RAID456 direct I/O write performance
  2014-09-04 16:23 RAID456 direct I/O write performance Markus Stockhausen
@ 2014-09-04 16:30 ` Markus Stockhausen
  2014-09-04 21:12   ` Ethan Wilson
  0 siblings, 1 reply; 6+ messages in thread
From: Markus Stockhausen @ 2014-09-04 16:30 UTC (permalink / raw)
  To: linux-raid@vger.kernel.org

[-- Attachment #1: Type: text/plain, Size: 1785 bytes --]

Sorry must be

>
> Hello,
> 
> Doing some benchmark tests for my patch series I discovered that
> direct I/O performance for a single thread onto raid456 md seems
> to be not on par with what I would have expected.
> 
> A simple 5 minute test run with md on top of /dev/ramX ramdisks
> reveals:
> 
> raid5 - 3 disks - chunk 32k - 1 writer
> 4K random direct IO writes: 5MB/sec (five megabytes per second)
> 

raid5 - 3 disks - chunk 32k - 1 writer
4K random writes (*NOT direct IO*): 600MB/sec

>
>raid5 - 3 disks - chunk 32k - 8 writers
>4K random direct IO writes: 45MB/sec
>

raid5 - 3 disks - chunk 32k - 8 writers
4K random writes (*NOT direct IO*): 800MB/sec

> 
> for comparison:
> 
> raid1 - 3 disks - 1 writer
> 4K random direct IO writes: 150MB/sec
> 
> That would be a 1ms penalty per one md block write (or to be
> precise per 3 device I/Os). A similar I/O overhead on the three
> disk RAID1 shows reasonable numbers.
> 
> A perf record of the 1 writer test gives:
> 
>     38.40%      swapper  [kernel.kallsyms]   [k] default_idle
>     13.14%    md0_raid5  [kernel.kallsyms]   [k] _raw_spin_unlock_irqrestore
>     13.05%      swapper  [kernel.kallsyms]   [k] tick_nohz_idle_enter
>     10.01%          iot  [raid456]           [k] raid5_unplug
>      9.06%      swapper  [kernel.kallsyms]   [k] tick_nohz_idle_exit
>      3.39%    md0_raid5  [kernel.kallsyms]   [k] __kernel_fpu_begin
>      1.67%    md0_raid5  [xor]               [k] xor_sse_2_pf64
>      0.87%          iot  [kernel.kallsyms]   [k] finish_task_switch
> 
> I'm confused and clueless. Especially I cannot see where the
> 10% overhead in the source of raid5_unplug might come
> from? Any idea from someone with better insight?
> 

Markus





=

[-- Attachment #2: InterScan_Disclaimer.txt --]
[-- Type: text/plain, Size: 1650 bytes --]

****************************************************************************
Diese E-Mail enthält vertrauliche und/oder rechtlich geschützte
Informationen. Wenn Sie nicht der richtige Adressat sind oder diese E-Mail
irrtümlich erhalten haben, informieren Sie bitte sofort den Absender und
vernichten Sie diese Mail. Das unerlaubte Kopieren sowie die unbefugte
Weitergabe dieser Mail ist nicht gestattet.

Über das Internet versandte E-Mails können unter fremden Namen erstellt oder
manipuliert werden. Deshalb ist diese als E-Mail verschickte Nachricht keine
rechtsverbindliche Willenserklärung.

Collogia
Unternehmensberatung AG
Ubierring 11
D-50678 Köln

Vorstand:
Kadir Akin
Dr. Michael Höhnerbach

Vorsitzender des Aufsichtsrates:
Hans Kristian Langva

Registergericht: Amtsgericht Köln
Registernummer: HRB 52 497

This e-mail may contain confidential and/or privileged information. If you
are not the intended recipient (or have received this e-mail in error)
please notify the sender immediately and destroy this e-mail. Any
unauthorized copying, disclosure or distribution of the material in this
e-mail is strictly forbidden.

e-mails sent over the internet may have been written under a wrong name or
been manipulated. That is why this message sent as an e-mail is not a
legally binding declaration of intention.

Collogia
Unternehmensberatung AG
Ubierring 11
D-50678 Köln

executive board:
Kadir Akin
Dr. Michael Höhnerbach

President of the supervisory board:
Hans Kristian Langva

Registry office: district court Cologne
Register number: HRB 52 497

****************************************************************************

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: AW: RAID456 direct I/O write performance
  2014-09-04 16:30 ` AW: " Markus Stockhausen
@ 2014-09-04 21:12   ` Ethan Wilson
  2014-09-05 18:06     ` AW: " Markus Stockhausen
  2014-09-09  3:24     ` NeilBrown
  0 siblings, 2 replies; 6+ messages in thread
From: Ethan Wilson @ 2014-09-04 21:12 UTC (permalink / raw)
  To: linux-raid@vger.kernel.org

On 04/09/2014 18:30, Markus Stockhausen wrote:
> A perf record of the 1 writer test gives:
>
>      38.40%      swapper  [kernel.kallsyms]   [k] default_idle
>      13.14%    md0_raid5  [kernel.kallsyms]   [k] _raw_spin_unlock_irqrestore
>      13.05%      swapper  [kernel.kallsyms]   [k] tick_nohz_idle_enter
>      10.01%          iot  [raid456]           [k] raid5_unplug
>       9.06%      swapper  [kernel.kallsyms]   [k] tick_nohz_idle_exit
>       3.39%    md0_raid5  [kernel.kallsyms]   [k] __kernel_fpu_begin
>       1.67%    md0_raid5  [xor]               [k] xor_sse_2_pf64
>       0.87%          iot  [kernel.kallsyms]   [k] finish_task_switch
>
> I'm confused and clueless. Especially I cannot see where the
> 10% overhead in the source of raid5_unplug might come
> from? Any idea from someone with better insight?

I am no kernel developer but I have read that the CPU time for serving 
interrupts is often accounted to the random process which has the bad 
luck to be running at the time the interrupt comes and steals the CPU. I 
read this for top, htop etc, which have probably a different accounting 
mechanism than perf, but maybe something similar happens here, because 
_raw_spin_unlock_irqrestore at 13% looks too absurd to me.
In fact, probably as soon as the interrupts are re-enabled by 
_raw_spin_unlock_irqrestore, the CPU often goes serving one interrupt 
that was queued, and this is before the function 
_raw_spin_unlock_irqrestore exits, so the time is really accounted there 
and that's why it's so high.

OTOH I would like to ask kernel experts one thing if I may: does anybody 
know a way to get a stack trace for a process which is currently running 
in kernel mode and is running NOW on a CPU and it is not stopped waiting 
in a queue? I know about /proc/pid/stack but that one shows 
0xffffffffffffffff for such a case. Being able to do that would help to 
answer the above question too...

Thanks
EW


^ permalink raw reply	[flat|nested] 6+ messages in thread

* AW: AW: RAID456 direct I/O write performance
  2014-09-04 21:12   ` Ethan Wilson
@ 2014-09-05 18:06     ` Markus Stockhausen
  2014-09-06 19:46       ` Markus Stockhausen
  2014-09-09  3:24     ` NeilBrown
  1 sibling, 1 reply; 6+ messages in thread
From: Markus Stockhausen @ 2014-09-05 18:06 UTC (permalink / raw)
  To: Ethan Wilson, linux-raid@vger.kernel.org

[-- Attachment #1: Type: text/plain, Size: 2129 bytes --]

> Von: Ethan Wilson [ethan.wilson@shiftmail.org]
> Gesendet: Donnerstag, 4. September 2014 23:12
> An: linux-raid@vger.kernel.org
> Betreff: Re: AW: RAID456 direct I/O write performance
> 
> On 04/09/2014 18:30, Markus Stockhausen wrote:
> > A perf record of the 1 writer test gives:
> >
> >      38.40%      swapper  [kernel.kallsyms]   [k] default_idle
> >      13.14%    md0_raid5  [kernel.kallsyms]   [k] _raw_spin_unlock_irqrestore
> >      13.05%      swapper  [kernel.kallsyms]   [k] tick_nohz_idle_enter
> >      10.01%          iot  [raid456]           [k] raid5_unplug
> >       9.06%      swapper  [kernel.kallsyms]   [k] tick_nohz_idle_exit
> >       3.39%    md0_raid5  [kernel.kallsyms]   [k] __kernel_fpu_begin
> >       1.67%    md0_raid5  [xor]               [k] xor_sse_2_pf64
> >       0.87%          iot  [kernel.kallsyms]   [k] finish_task_switch
> >
> > I'm confused and clueless. Especially I cannot see where the
> > 10% overhead in the source of raid5_unplug might come
> > from? Any idea from someone with better insight?
> 
> I am no kernel developer but I have read that the CPU time for serving
> interrupts is often accounted to the random process which has the bad
> luck to be running at the time the interrupt comes and steals the CPU. I
> read this for top, htop etc, which have probably a different accounting
> mechanism than perf, but maybe something similar happens here, because
> _raw_spin_unlock_irqrestore at 13% looks too absurd to me.
> In fact, probably as soon as the interrupts are re-enabled by
> _raw_spin_unlock_irqrestore, the CPU often goes serving one interrupt
> that was queued, and this is before the function
> _raw_spin_unlock_irqrestore exits, so the time is really accounted there
> and that's why it's so high.

Thanks Ethan for the explanation. 

In between I found the reason for the slowness. RAID5 handles direct IO
write requests sequentially with each one restarting the raid5d() worker
thread (more lke sync IO). The raid1d() worker thread can batch those 
requests. Maybe Neil can tell if that behaviour is desired.


=

[-- Attachment #2: InterScan_Disclaimer.txt --]
[-- Type: text/plain, Size: 1650 bytes --]

****************************************************************************
Diese E-Mail enthält vertrauliche und/oder rechtlich geschützte
Informationen. Wenn Sie nicht der richtige Adressat sind oder diese E-Mail
irrtümlich erhalten haben, informieren Sie bitte sofort den Absender und
vernichten Sie diese Mail. Das unerlaubte Kopieren sowie die unbefugte
Weitergabe dieser Mail ist nicht gestattet.

Über das Internet versandte E-Mails können unter fremden Namen erstellt oder
manipuliert werden. Deshalb ist diese als E-Mail verschickte Nachricht keine
rechtsverbindliche Willenserklärung.

Collogia
Unternehmensberatung AG
Ubierring 11
D-50678 Köln

Vorstand:
Kadir Akin
Dr. Michael Höhnerbach

Vorsitzender des Aufsichtsrates:
Hans Kristian Langva

Registergericht: Amtsgericht Köln
Registernummer: HRB 52 497

This e-mail may contain confidential and/or privileged information. If you
are not the intended recipient (or have received this e-mail in error)
please notify the sender immediately and destroy this e-mail. Any
unauthorized copying, disclosure or distribution of the material in this
e-mail is strictly forbidden.

e-mails sent over the internet may have been written under a wrong name or
been manipulated. That is why this message sent as an e-mail is not a
legally binding declaration of intention.

Collogia
Unternehmensberatung AG
Ubierring 11
D-50678 Köln

executive board:
Kadir Akin
Dr. Michael Höhnerbach

President of the supervisory board:
Hans Kristian Langva

Registry office: district court Cologne
Register number: HRB 52 497

****************************************************************************

^ permalink raw reply	[flat|nested] 6+ messages in thread

* AW: AW: RAID456 direct I/O write performance
  2014-09-05 18:06     ` AW: " Markus Stockhausen
@ 2014-09-06 19:46       ` Markus Stockhausen
  0 siblings, 0 replies; 6+ messages in thread
From: Markus Stockhausen @ 2014-09-06 19:46 UTC (permalink / raw)
  To: Ethan Wilson, linux-raid@vger.kernel.org

[-- Attachment #1: Type: text/plain, Size: 3012 bytes --]

> Von: Markus Stockhausen
> Gesendet: Freitag, 5. September 2014 20:06
> An: Ethan Wilson; linux-raid@vger.kernel.org
> Betreff: AW: AW: RAID456 direct I/O write performance
> 
> > Von: Ethan Wilson [ethan.wilson@shiftmail.org]
> > Gesendet: Donnerstag, 4. September 2014 23:12
> > An: linux-raid@vger.kernel.org
> > Betreff: Re: AW: RAID456 direct I/O write performance
> >
> > On 04/09/2014 18:30, Markus Stockhausen wrote:
> > > A perf record of the 1 writer test gives:
> > >
> > >      38.40%      swapper  [kernel.kallsyms]   [k] default_idle
> > >      13.14%    md0_raid5  [kernel.kallsyms]   [k] _raw_spin_unlock_irqrestore
> > >      13.05%      swapper  [kernel.kallsyms]   [k] tick_nohz_idle_enter
> > >      10.01%          iot  [raid456]           [k] raid5_unplug
> > >       9.06%      swapper  [kernel.kallsyms]   [k] tick_nohz_idle_exit
> > >       3.39%    md0_raid5  [kernel.kallsyms]   [k] __kernel_fpu_begin
> > >       1.67%    md0_raid5  [xor]               [k] xor_sse_2_pf64
> > >       0.87%          iot  [kernel.kallsyms]   [k] finish_task_switch
> > >
> > > I'm confused and clueless. Especially I cannot see where the
> > > 10% overhead in the source of raid5_unplug might come
> > > from? Any idea from someone with better insight?
> >
> > I am no kernel developer but I have read that the CPU time for serving
> > interrupts is often accounted to the random process which has the bad
> > luck to be running at the time the interrupt comes and steals the CPU. I
> > read this for top, htop etc, which have probably a different accounting
> > mechanism than perf, but maybe something similar happens here, because
> > _raw_spin_unlock_irqrestore at 13% looks too absurd to me.
> > In fact, probably as soon as the interrupts are re-enabled by
> > _raw_spin_unlock_irqrestore, the CPU often goes serving one interrupt
> > that was queued, and this is before the function
> > _raw_spin_unlock_irqrestore exits, so the time is really accounted there
> > and that's why it's so high.
> 
> Thanks Ethan for the explanation.
> 
> In between I found the reason for the slowness. RAID5 handles direct IO
> write requests sequentially with each one restarting the raid5d() worker
> thread (more lke sync IO). The raid1d() worker thread can batch those
> requests. Maybe Neil can tell if that behaviour is desired.

Hopefully right this time - correcting myself to not spread wrong 
information: In raid1 case kernel writes via calling raid1_unplug(). Block 
handling is done within the function inside the calling thread. In raid5 
case kernel calls raid5_unplug() for writeout handling. Inside we just 
have waiting logic for the raid5d() thread completion. So the numbers
seem to be reasonable.

A funny side effect is: A slow sequential directio or sync writer to a md
raid456 on fast devices can be sped up by starting a (independent) 
reader - even to non md devices. Maybe because of thread scheduling 
optimization.

Markus
=

[-- Attachment #2: InterScan_Disclaimer.txt --]
[-- Type: text/plain, Size: 1650 bytes --]

****************************************************************************
Diese E-Mail enthält vertrauliche und/oder rechtlich geschützte
Informationen. Wenn Sie nicht der richtige Adressat sind oder diese E-Mail
irrtümlich erhalten haben, informieren Sie bitte sofort den Absender und
vernichten Sie diese Mail. Das unerlaubte Kopieren sowie die unbefugte
Weitergabe dieser Mail ist nicht gestattet.

Über das Internet versandte E-Mails können unter fremden Namen erstellt oder
manipuliert werden. Deshalb ist diese als E-Mail verschickte Nachricht keine
rechtsverbindliche Willenserklärung.

Collogia
Unternehmensberatung AG
Ubierring 11
D-50678 Köln

Vorstand:
Kadir Akin
Dr. Michael Höhnerbach

Vorsitzender des Aufsichtsrates:
Hans Kristian Langva

Registergericht: Amtsgericht Köln
Registernummer: HRB 52 497

This e-mail may contain confidential and/or privileged information. If you
are not the intended recipient (or have received this e-mail in error)
please notify the sender immediately and destroy this e-mail. Any
unauthorized copying, disclosure or distribution of the material in this
e-mail is strictly forbidden.

e-mails sent over the internet may have been written under a wrong name or
been manipulated. That is why this message sent as an e-mail is not a
legally binding declaration of intention.

Collogia
Unternehmensberatung AG
Ubierring 11
D-50678 Köln

executive board:
Kadir Akin
Dr. Michael Höhnerbach

President of the supervisory board:
Hans Kristian Langva

Registry office: district court Cologne
Register number: HRB 52 497

****************************************************************************

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: RAID456 direct I/O write performance
  2014-09-04 21:12   ` Ethan Wilson
  2014-09-05 18:06     ` AW: " Markus Stockhausen
@ 2014-09-09  3:24     ` NeilBrown
  1 sibling, 0 replies; 6+ messages in thread
From: NeilBrown @ 2014-09-09  3:24 UTC (permalink / raw)
  To: Ethan Wilson; +Cc: linux-raid@vger.kernel.org

[-- Attachment #1: Type: text/plain, Size: 899 bytes --]

On Thu, 04 Sep 2014 23:12:44 +0200 Ethan Wilson <ethan.wilson@shiftmail.org>
wrote:

> OTOH I would like to ask kernel experts one thing if I may: does anybody 
> know a way to get a stack trace for a process which is currently running 
> in kernel mode and is running NOW on a CPU and it is not stopped waiting 
> in a queue? I know about /proc/pid/stack but that one shows 
> 0xffffffffffffffff for such a case. Being able to do that would help to 
> answer the above question too...

The contents of the stack would change while it was being inspected, so it
would be impossible  to get a meaningful trace.

Maybe you could disable all but one CPU.  Then whenever you try to look at
the stack of another process it must have scheduled and so will have a
stable visible stack...

Or maybe you would use "perf record -g" to get some stack information, I'm
not sure.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2014-09-09  3:24 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-09-04 16:23 RAID456 direct I/O write performance Markus Stockhausen
2014-09-04 16:30 ` AW: " Markus Stockhausen
2014-09-04 21:12   ` Ethan Wilson
2014-09-05 18:06     ` AW: " Markus Stockhausen
2014-09-06 19:46       ` Markus Stockhausen
2014-09-09  3:24     ` NeilBrown

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).