* RAID456 direct I/O write performance
@ 2014-09-04 16:23 Markus Stockhausen
2014-09-04 16:30 ` AW: " Markus Stockhausen
0 siblings, 1 reply; 6+ messages in thread
From: Markus Stockhausen @ 2014-09-04 16:23 UTC (permalink / raw)
To: linux-raid@vger.kernel.org
[-- Attachment #1: Type: text/plain, Size: 1667 bytes --]
Hello,
Doing some benchmark tests for my patch series I discovered that
direct I/O performance for a single thread onto raid456 md seems
to be not on par with what I would have expected.
A simple 5 minute test run with md on top of /dev/ramX ramdisks
reveals:
raid5 - 3 disks - chunk 32k - 1 writer
4K random direct IO writes: 5MB/sec (five megabytes per second)
raid5 - 3 disks - chunk 32k - 1 writer
4K random direct IO writes: 600MB/sec
raid5 - 3 disks - chunk 32k - 8 writers
4K random direct IO writes: 45MB/sec
raid5 - 3 disks - chunk 32k - 8 writers
4K random direct IO writes: 800MB/sec
for comparison:
raid1 - 3 disks - 1 writer
4K random direct IO writes: 150MB/sec
That would be a 1ms penalty per one md block write (or to be
precise per 3 device I/Os). A similar I/O overhead on the three
disk RAID1 shows reasonable numbers.
A perf record of the 1 writer test gives:
38.40% swapper [kernel.kallsyms] [k] default_idle
13.14% md0_raid5 [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore
13.05% swapper [kernel.kallsyms] [k] tick_nohz_idle_enter
10.01% iot [raid456] [k] raid5_unplug
9.06% swapper [kernel.kallsyms] [k] tick_nohz_idle_exit
3.39% md0_raid5 [kernel.kallsyms] [k] __kernel_fpu_begin
1.67% md0_raid5 [xor] [k] xor_sse_2_pf64
0.87% iot [kernel.kallsyms] [k] finish_task_switch
I'm confused and clueless. Especially I cannot see where the
10% overhead in the source of raid5_unplug might come
from? Any idea from someone with better insight?
Markus
=
[-- Attachment #2: InterScan_Disclaimer.txt --]
[-- Type: text/plain, Size: 1650 bytes --]
****************************************************************************
Diese E-Mail enthält vertrauliche und/oder rechtlich geschützte
Informationen. Wenn Sie nicht der richtige Adressat sind oder diese E-Mail
irrtümlich erhalten haben, informieren Sie bitte sofort den Absender und
vernichten Sie diese Mail. Das unerlaubte Kopieren sowie die unbefugte
Weitergabe dieser Mail ist nicht gestattet.
Ãber das Internet versandte E-Mails können unter fremden Namen erstellt oder
manipuliert werden. Deshalb ist diese als E-Mail verschickte Nachricht keine
rechtsverbindliche Willenserklärung.
Collogia
Unternehmensberatung AG
Ubierring 11
D-50678 Köln
Vorstand:
Kadir Akin
Dr. Michael Höhnerbach
Vorsitzender des Aufsichtsrates:
Hans Kristian Langva
Registergericht: Amtsgericht Köln
Registernummer: HRB 52 497
This e-mail may contain confidential and/or privileged information. If you
are not the intended recipient (or have received this e-mail in error)
please notify the sender immediately and destroy this e-mail. Any
unauthorized copying, disclosure or distribution of the material in this
e-mail is strictly forbidden.
e-mails sent over the internet may have been written under a wrong name or
been manipulated. That is why this message sent as an e-mail is not a
legally binding declaration of intention.
Collogia
Unternehmensberatung AG
Ubierring 11
D-50678 Köln
executive board:
Kadir Akin
Dr. Michael Höhnerbach
President of the supervisory board:
Hans Kristian Langva
Registry office: district court Cologne
Register number: HRB 52 497
****************************************************************************
^ permalink raw reply [flat|nested] 6+ messages in thread
* AW: RAID456 direct I/O write performance
2014-09-04 16:23 RAID456 direct I/O write performance Markus Stockhausen
@ 2014-09-04 16:30 ` Markus Stockhausen
2014-09-04 21:12 ` Ethan Wilson
0 siblings, 1 reply; 6+ messages in thread
From: Markus Stockhausen @ 2014-09-04 16:30 UTC (permalink / raw)
To: linux-raid@vger.kernel.org
[-- Attachment #1: Type: text/plain, Size: 1785 bytes --]
Sorry must be
>
> Hello,
>
> Doing some benchmark tests for my patch series I discovered that
> direct I/O performance for a single thread onto raid456 md seems
> to be not on par with what I would have expected.
>
> A simple 5 minute test run with md on top of /dev/ramX ramdisks
> reveals:
>
> raid5 - 3 disks - chunk 32k - 1 writer
> 4K random direct IO writes: 5MB/sec (five megabytes per second)
>
raid5 - 3 disks - chunk 32k - 1 writer
4K random writes (*NOT direct IO*): 600MB/sec
>
>raid5 - 3 disks - chunk 32k - 8 writers
>4K random direct IO writes: 45MB/sec
>
raid5 - 3 disks - chunk 32k - 8 writers
4K random writes (*NOT direct IO*): 800MB/sec
>
> for comparison:
>
> raid1 - 3 disks - 1 writer
> 4K random direct IO writes: 150MB/sec
>
> That would be a 1ms penalty per one md block write (or to be
> precise per 3 device I/Os). A similar I/O overhead on the three
> disk RAID1 shows reasonable numbers.
>
> A perf record of the 1 writer test gives:
>
> 38.40% swapper [kernel.kallsyms] [k] default_idle
> 13.14% md0_raid5 [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore
> 13.05% swapper [kernel.kallsyms] [k] tick_nohz_idle_enter
> 10.01% iot [raid456] [k] raid5_unplug
> 9.06% swapper [kernel.kallsyms] [k] tick_nohz_idle_exit
> 3.39% md0_raid5 [kernel.kallsyms] [k] __kernel_fpu_begin
> 1.67% md0_raid5 [xor] [k] xor_sse_2_pf64
> 0.87% iot [kernel.kallsyms] [k] finish_task_switch
>
> I'm confused and clueless. Especially I cannot see where the
> 10% overhead in the source of raid5_unplug might come
> from? Any idea from someone with better insight?
>
Markus
=
[-- Attachment #2: InterScan_Disclaimer.txt --]
[-- Type: text/plain, Size: 1650 bytes --]
****************************************************************************
Diese E-Mail enthält vertrauliche und/oder rechtlich geschützte
Informationen. Wenn Sie nicht der richtige Adressat sind oder diese E-Mail
irrtümlich erhalten haben, informieren Sie bitte sofort den Absender und
vernichten Sie diese Mail. Das unerlaubte Kopieren sowie die unbefugte
Weitergabe dieser Mail ist nicht gestattet.
Ãber das Internet versandte E-Mails können unter fremden Namen erstellt oder
manipuliert werden. Deshalb ist diese als E-Mail verschickte Nachricht keine
rechtsverbindliche Willenserklärung.
Collogia
Unternehmensberatung AG
Ubierring 11
D-50678 Köln
Vorstand:
Kadir Akin
Dr. Michael Höhnerbach
Vorsitzender des Aufsichtsrates:
Hans Kristian Langva
Registergericht: Amtsgericht Köln
Registernummer: HRB 52 497
This e-mail may contain confidential and/or privileged information. If you
are not the intended recipient (or have received this e-mail in error)
please notify the sender immediately and destroy this e-mail. Any
unauthorized copying, disclosure or distribution of the material in this
e-mail is strictly forbidden.
e-mails sent over the internet may have been written under a wrong name or
been manipulated. That is why this message sent as an e-mail is not a
legally binding declaration of intention.
Collogia
Unternehmensberatung AG
Ubierring 11
D-50678 Köln
executive board:
Kadir Akin
Dr. Michael Höhnerbach
President of the supervisory board:
Hans Kristian Langva
Registry office: district court Cologne
Register number: HRB 52 497
****************************************************************************
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: AW: RAID456 direct I/O write performance
2014-09-04 16:30 ` AW: " Markus Stockhausen
@ 2014-09-04 21:12 ` Ethan Wilson
2014-09-05 18:06 ` AW: " Markus Stockhausen
2014-09-09 3:24 ` NeilBrown
0 siblings, 2 replies; 6+ messages in thread
From: Ethan Wilson @ 2014-09-04 21:12 UTC (permalink / raw)
To: linux-raid@vger.kernel.org
On 04/09/2014 18:30, Markus Stockhausen wrote:
> A perf record of the 1 writer test gives:
>
> 38.40% swapper [kernel.kallsyms] [k] default_idle
> 13.14% md0_raid5 [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore
> 13.05% swapper [kernel.kallsyms] [k] tick_nohz_idle_enter
> 10.01% iot [raid456] [k] raid5_unplug
> 9.06% swapper [kernel.kallsyms] [k] tick_nohz_idle_exit
> 3.39% md0_raid5 [kernel.kallsyms] [k] __kernel_fpu_begin
> 1.67% md0_raid5 [xor] [k] xor_sse_2_pf64
> 0.87% iot [kernel.kallsyms] [k] finish_task_switch
>
> I'm confused and clueless. Especially I cannot see where the
> 10% overhead in the source of raid5_unplug might come
> from? Any idea from someone with better insight?
I am no kernel developer but I have read that the CPU time for serving
interrupts is often accounted to the random process which has the bad
luck to be running at the time the interrupt comes and steals the CPU. I
read this for top, htop etc, which have probably a different accounting
mechanism than perf, but maybe something similar happens here, because
_raw_spin_unlock_irqrestore at 13% looks too absurd to me.
In fact, probably as soon as the interrupts are re-enabled by
_raw_spin_unlock_irqrestore, the CPU often goes serving one interrupt
that was queued, and this is before the function
_raw_spin_unlock_irqrestore exits, so the time is really accounted there
and that's why it's so high.
OTOH I would like to ask kernel experts one thing if I may: does anybody
know a way to get a stack trace for a process which is currently running
in kernel mode and is running NOW on a CPU and it is not stopped waiting
in a queue? I know about /proc/pid/stack but that one shows
0xffffffffffffffff for such a case. Being able to do that would help to
answer the above question too...
Thanks
EW
^ permalink raw reply [flat|nested] 6+ messages in thread
* AW: AW: RAID456 direct I/O write performance
2014-09-04 21:12 ` Ethan Wilson
@ 2014-09-05 18:06 ` Markus Stockhausen
2014-09-06 19:46 ` Markus Stockhausen
2014-09-09 3:24 ` NeilBrown
1 sibling, 1 reply; 6+ messages in thread
From: Markus Stockhausen @ 2014-09-05 18:06 UTC (permalink / raw)
To: Ethan Wilson, linux-raid@vger.kernel.org
[-- Attachment #1: Type: text/plain, Size: 2129 bytes --]
> Von: Ethan Wilson [ethan.wilson@shiftmail.org]
> Gesendet: Donnerstag, 4. September 2014 23:12
> An: linux-raid@vger.kernel.org
> Betreff: Re: AW: RAID456 direct I/O write performance
>
> On 04/09/2014 18:30, Markus Stockhausen wrote:
> > A perf record of the 1 writer test gives:
> >
> > 38.40% swapper [kernel.kallsyms] [k] default_idle
> > 13.14% md0_raid5 [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore
> > 13.05% swapper [kernel.kallsyms] [k] tick_nohz_idle_enter
> > 10.01% iot [raid456] [k] raid5_unplug
> > 9.06% swapper [kernel.kallsyms] [k] tick_nohz_idle_exit
> > 3.39% md0_raid5 [kernel.kallsyms] [k] __kernel_fpu_begin
> > 1.67% md0_raid5 [xor] [k] xor_sse_2_pf64
> > 0.87% iot [kernel.kallsyms] [k] finish_task_switch
> >
> > I'm confused and clueless. Especially I cannot see where the
> > 10% overhead in the source of raid5_unplug might come
> > from? Any idea from someone with better insight?
>
> I am no kernel developer but I have read that the CPU time for serving
> interrupts is often accounted to the random process which has the bad
> luck to be running at the time the interrupt comes and steals the CPU. I
> read this for top, htop etc, which have probably a different accounting
> mechanism than perf, but maybe something similar happens here, because
> _raw_spin_unlock_irqrestore at 13% looks too absurd to me.
> In fact, probably as soon as the interrupts are re-enabled by
> _raw_spin_unlock_irqrestore, the CPU often goes serving one interrupt
> that was queued, and this is before the function
> _raw_spin_unlock_irqrestore exits, so the time is really accounted there
> and that's why it's so high.
Thanks Ethan for the explanation.
In between I found the reason for the slowness. RAID5 handles direct IO
write requests sequentially with each one restarting the raid5d() worker
thread (more lke sync IO). The raid1d() worker thread can batch those
requests. Maybe Neil can tell if that behaviour is desired.
=
[-- Attachment #2: InterScan_Disclaimer.txt --]
[-- Type: text/plain, Size: 1650 bytes --]
****************************************************************************
Diese E-Mail enthält vertrauliche und/oder rechtlich geschützte
Informationen. Wenn Sie nicht der richtige Adressat sind oder diese E-Mail
irrtümlich erhalten haben, informieren Sie bitte sofort den Absender und
vernichten Sie diese Mail. Das unerlaubte Kopieren sowie die unbefugte
Weitergabe dieser Mail ist nicht gestattet.
Ãber das Internet versandte E-Mails können unter fremden Namen erstellt oder
manipuliert werden. Deshalb ist diese als E-Mail verschickte Nachricht keine
rechtsverbindliche Willenserklärung.
Collogia
Unternehmensberatung AG
Ubierring 11
D-50678 Köln
Vorstand:
Kadir Akin
Dr. Michael Höhnerbach
Vorsitzender des Aufsichtsrates:
Hans Kristian Langva
Registergericht: Amtsgericht Köln
Registernummer: HRB 52 497
This e-mail may contain confidential and/or privileged information. If you
are not the intended recipient (or have received this e-mail in error)
please notify the sender immediately and destroy this e-mail. Any
unauthorized copying, disclosure or distribution of the material in this
e-mail is strictly forbidden.
e-mails sent over the internet may have been written under a wrong name or
been manipulated. That is why this message sent as an e-mail is not a
legally binding declaration of intention.
Collogia
Unternehmensberatung AG
Ubierring 11
D-50678 Köln
executive board:
Kadir Akin
Dr. Michael Höhnerbach
President of the supervisory board:
Hans Kristian Langva
Registry office: district court Cologne
Register number: HRB 52 497
****************************************************************************
^ permalink raw reply [flat|nested] 6+ messages in thread
* AW: AW: RAID456 direct I/O write performance
2014-09-05 18:06 ` AW: " Markus Stockhausen
@ 2014-09-06 19:46 ` Markus Stockhausen
0 siblings, 0 replies; 6+ messages in thread
From: Markus Stockhausen @ 2014-09-06 19:46 UTC (permalink / raw)
To: Ethan Wilson, linux-raid@vger.kernel.org
[-- Attachment #1: Type: text/plain, Size: 3012 bytes --]
> Von: Markus Stockhausen
> Gesendet: Freitag, 5. September 2014 20:06
> An: Ethan Wilson; linux-raid@vger.kernel.org
> Betreff: AW: AW: RAID456 direct I/O write performance
>
> > Von: Ethan Wilson [ethan.wilson@shiftmail.org]
> > Gesendet: Donnerstag, 4. September 2014 23:12
> > An: linux-raid@vger.kernel.org
> > Betreff: Re: AW: RAID456 direct I/O write performance
> >
> > On 04/09/2014 18:30, Markus Stockhausen wrote:
> > > A perf record of the 1 writer test gives:
> > >
> > > 38.40% swapper [kernel.kallsyms] [k] default_idle
> > > 13.14% md0_raid5 [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore
> > > 13.05% swapper [kernel.kallsyms] [k] tick_nohz_idle_enter
> > > 10.01% iot [raid456] [k] raid5_unplug
> > > 9.06% swapper [kernel.kallsyms] [k] tick_nohz_idle_exit
> > > 3.39% md0_raid5 [kernel.kallsyms] [k] __kernel_fpu_begin
> > > 1.67% md0_raid5 [xor] [k] xor_sse_2_pf64
> > > 0.87% iot [kernel.kallsyms] [k] finish_task_switch
> > >
> > > I'm confused and clueless. Especially I cannot see where the
> > > 10% overhead in the source of raid5_unplug might come
> > > from? Any idea from someone with better insight?
> >
> > I am no kernel developer but I have read that the CPU time for serving
> > interrupts is often accounted to the random process which has the bad
> > luck to be running at the time the interrupt comes and steals the CPU. I
> > read this for top, htop etc, which have probably a different accounting
> > mechanism than perf, but maybe something similar happens here, because
> > _raw_spin_unlock_irqrestore at 13% looks too absurd to me.
> > In fact, probably as soon as the interrupts are re-enabled by
> > _raw_spin_unlock_irqrestore, the CPU often goes serving one interrupt
> > that was queued, and this is before the function
> > _raw_spin_unlock_irqrestore exits, so the time is really accounted there
> > and that's why it's so high.
>
> Thanks Ethan for the explanation.
>
> In between I found the reason for the slowness. RAID5 handles direct IO
> write requests sequentially with each one restarting the raid5d() worker
> thread (more lke sync IO). The raid1d() worker thread can batch those
> requests. Maybe Neil can tell if that behaviour is desired.
Hopefully right this time - correcting myself to not spread wrong
information: In raid1 case kernel writes via calling raid1_unplug(). Block
handling is done within the function inside the calling thread. In raid5
case kernel calls raid5_unplug() for writeout handling. Inside we just
have waiting logic for the raid5d() thread completion. So the numbers
seem to be reasonable.
A funny side effect is: A slow sequential directio or sync writer to a md
raid456 on fast devices can be sped up by starting a (independent)
reader - even to non md devices. Maybe because of thread scheduling
optimization.
Markus
=
[-- Attachment #2: InterScan_Disclaimer.txt --]
[-- Type: text/plain, Size: 1650 bytes --]
****************************************************************************
Diese E-Mail enthält vertrauliche und/oder rechtlich geschützte
Informationen. Wenn Sie nicht der richtige Adressat sind oder diese E-Mail
irrtümlich erhalten haben, informieren Sie bitte sofort den Absender und
vernichten Sie diese Mail. Das unerlaubte Kopieren sowie die unbefugte
Weitergabe dieser Mail ist nicht gestattet.
Ãber das Internet versandte E-Mails können unter fremden Namen erstellt oder
manipuliert werden. Deshalb ist diese als E-Mail verschickte Nachricht keine
rechtsverbindliche Willenserklärung.
Collogia
Unternehmensberatung AG
Ubierring 11
D-50678 Köln
Vorstand:
Kadir Akin
Dr. Michael Höhnerbach
Vorsitzender des Aufsichtsrates:
Hans Kristian Langva
Registergericht: Amtsgericht Köln
Registernummer: HRB 52 497
This e-mail may contain confidential and/or privileged information. If you
are not the intended recipient (or have received this e-mail in error)
please notify the sender immediately and destroy this e-mail. Any
unauthorized copying, disclosure or distribution of the material in this
e-mail is strictly forbidden.
e-mails sent over the internet may have been written under a wrong name or
been manipulated. That is why this message sent as an e-mail is not a
legally binding declaration of intention.
Collogia
Unternehmensberatung AG
Ubierring 11
D-50678 Köln
executive board:
Kadir Akin
Dr. Michael Höhnerbach
President of the supervisory board:
Hans Kristian Langva
Registry office: district court Cologne
Register number: HRB 52 497
****************************************************************************
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: RAID456 direct I/O write performance
2014-09-04 21:12 ` Ethan Wilson
2014-09-05 18:06 ` AW: " Markus Stockhausen
@ 2014-09-09 3:24 ` NeilBrown
1 sibling, 0 replies; 6+ messages in thread
From: NeilBrown @ 2014-09-09 3:24 UTC (permalink / raw)
To: Ethan Wilson; +Cc: linux-raid@vger.kernel.org
[-- Attachment #1: Type: text/plain, Size: 899 bytes --]
On Thu, 04 Sep 2014 23:12:44 +0200 Ethan Wilson <ethan.wilson@shiftmail.org>
wrote:
> OTOH I would like to ask kernel experts one thing if I may: does anybody
> know a way to get a stack trace for a process which is currently running
> in kernel mode and is running NOW on a CPU and it is not stopped waiting
> in a queue? I know about /proc/pid/stack but that one shows
> 0xffffffffffffffff for such a case. Being able to do that would help to
> answer the above question too...
The contents of the stack would change while it was being inspected, so it
would be impossible to get a meaningful trace.
Maybe you could disable all but one CPU. Then whenever you try to look at
the stack of another process it must have scheduled and so will have a
stable visible stack...
Or maybe you would use "perf record -g" to get some stack information, I'm
not sure.
NeilBrown
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2014-09-09 3:24 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-09-04 16:23 RAID456 direct I/O write performance Markus Stockhausen
2014-09-04 16:30 ` AW: " Markus Stockhausen
2014-09-04 21:12 ` Ethan Wilson
2014-09-05 18:06 ` AW: " Markus Stockhausen
2014-09-06 19:46 ` Markus Stockhausen
2014-09-09 3:24 ` NeilBrown
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).