* sata_mv performance; impact of NCQ
@ 2008-04-26 0:16 Jody McIntyre
2008-04-26 3:13 ` Grant Grundler
2008-04-26 4:49 ` Mark Lord
0 siblings, 2 replies; 8+ messages in thread
From: Jody McIntyre @ 2008-04-26 0:16 UTC (permalink / raw)
To: linux-ide
I have completed several performance tests of the Marvell SATA
controllers (6x MV88SX6081) in a Sun x4500 (Thumper) system. The
complete results are available from:
http://downloads.lustre.org/people/scjody/thumper-kernel-comparison-01.ods
My ultimate goal is to replace mv_sata (the out-of-tree vendor driver)
on RHEL 4 with sata_mv on a modern kernel, and to do this I need
equivalent or better performance, especially on large (1MB) IOs.
I note the recent changes to enable NCQ result in a net performance gain
for 64K and 128K IOs, but due to a chipset limitation in the 6xxx series
(according to commit f273827e2aadcf2f74a7bdc9ad715a1b20ea7dda),
max_sectors is now limited which means we can no longer perform IOs
greater than 128K (ENOMEM is returned from an sg write.) Therefore
large IO performance suffers - for example, the performance of 2.6.25
with NCQ support removed on 1MB IOs is better than anything possible
with stock 2.6.25 for many workloads.
Would it be worth re-enabling large IOs on this hardware when NCQ is
disabled (using the queue_depth /proc variable)? If so I'll come up
with a patch.
Does anyone know what mv_sata does about NCQ? I see references to NCQ
throughout the code but I don't yet understand it enough to determine
what's going on. mv_sata _does_ support IOs greater than 128K, which
suggests that it does not use NCQ on this hardware at least.
Any advice on areas to explore to improve sata_mv's performance? I
imagine I need to understand what mv_sata does differently, and plan on
spending some time reading that code, but I'd appreciate more specific
ideas if anyone has them.
Details on the tests I performed:
I used the sgpdd-survey tool, a low level performance test from the
Lustre iokit. It can be downloaded from:
http://downloads.lustre.org/public/tools/lustre-iokit/ . The tool
performs timed sgp_dd commands using various IO sizes, region counts,
and thread counts and reports aggregate bandwidth. These results were
then graphed using a spreadsheet.
Note that the largest thread count measured is not the same on all
graphs. For some reason, write() to the sg device returns ENOMEM to
some sgp_dd threads when large numbers of threads are run on recent
kernels. The problem does not exist with the RHEL 4 kernel. I have not
yet investigated why this happens.
Cheers,
Jody
--
Jody McIntyre - Linux Kernel Engineer, Sun HPC
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: sata_mv performance; impact of NCQ
2008-04-26 0:16 sata_mv performance; impact of NCQ Jody McIntyre
@ 2008-04-26 3:13 ` Grant Grundler
2008-04-28 17:14 ` Jody McIntyre
2008-04-26 4:49 ` Mark Lord
1 sibling, 1 reply; 8+ messages in thread
From: Grant Grundler @ 2008-04-26 3:13 UTC (permalink / raw)
To: Jody McIntyre; +Cc: linux-ide
On Fri, Apr 25, 2008 at 5:16 PM, Jody McIntyre <scjody@sun.com> wrote:
> I have completed several performance tests of the Marvell SATA
> controllers (6x MV88SX6081) in a Sun x4500 (Thumper) system. The
> complete results are available from:
> http://downloads.lustre.org/people/scjody/thumper-kernel-comparison-01.ods
>
> My ultimate goal is to replace mv_sata (the out-of-tree vendor driver)
> on RHEL 4 with sata_mv on a modern kernel, and to do this I need
> equivalent or better performance, especially on large (1MB) IOs.
>
> I note the recent changes to enable NCQ result in a net performance gain
> for 64K and 128K IOs, but due to a chipset limitation in the 6xxx series
> (according to commit f273827e2aadcf2f74a7bdc9ad715a1b20ea7dda),
> max_sectors is now limited which means we can no longer perform IOs
> greater than 128K (ENOMEM is returned from an sg write.) Therefore
> large IO performance suffers
I would not jump to that conclusion. 4-5 years ago I measured CPU
overhead and throughput vs block size for u320 SCSI and the
difference was minimal once block size exceeded 64K. By minimal
I mean low single digit differences. I haven't measured large block IO
recently for SCSI (or SATA) where it might be different.
Recently I measured sequential 4K IOs performance (goal was to
determine CPU overhead in the device drivers) and was surprised
to find that performance peaked at 4 threads per disk. 8 threads per
disk got about the same throughput as 1 thread.
For sequential reads the disk device firmware should recognize the
"stream" and read ahead to maintain media rate. My guess is more
threads will confuse this strategy and performance will suffer. This
behavior depends on drive firmware which will vary depending
on the IO request size and on how the disk firmware segments read
buffers. And in this case, the firmware is known to have some
"OS specific optimizations" (not linux) so I'm not sure if that made
a difference as well.
For sequential writes, I'd be very surprised if the same is true since all
the SATA drives I've seen have WCE turned on by default. Thus, the
actual write to media is disconnected from the write request by the
on-disk write buffering. The drives don't have to guess what needs to
be written and the original request size might affect how buffers are
managed in the device and how much RAM the disk controller has
(I expect 8 MB - 32MB for SATA drives these days).
Again, YMMV depending on disk firmware.
>- for example, the performance of 2.6.25
> with NCQ support removed on 1MB IOs is better than anything possible
> with stock 2.6.25 for many workloads.
>
> Would it be worth re-enabling large IOs on this hardware when NCQ is
> disabled (using the queue_depth /proc variable)? If so I'll come up
> with a patch.
I think it depends on the difference in performance, how much the performance
depends on diskfirmware (can someone else reproduce those results with
a different drive?) and how much uglier it makes the code.
I'm not openoffice enabled at the moment thus haven't been able to look
at the spreadsheet you provided. (Odd that OpenOffice isn't installed on
this google-issued Mac laptop by default...I'll get that installed though and
make sure to look at the spreadsheet.)
hth,
grant
> Does anyone know what mv_sata does about NCQ? I see references to NCQ
> throughout the code but I don't yet understand it enough to determine
> what's going on. mv_sata _does_ support IOs greater than 128K, which
> suggests that it does not use NCQ on this hardware at least.
>
> Any advice on areas to explore to improve sata_mv's performance? I
> imagine I need to understand what mv_sata does differently, and plan on
> spending some time reading that code, but I'd appreciate more specific
> ideas if anyone has them.
>
> Details on the tests I performed:
>
> I used the sgpdd-survey tool, a low level performance test from the
> Lustre iokit. It can be downloaded from:
> http://downloads.lustre.org/public/tools/lustre-iokit/ . The tool
> performs timed sgp_dd commands using various IO sizes, region counts,
> and thread counts and reports aggregate bandwidth. These results were
> then graphed using a spreadsheet.
>
> Note that the largest thread count measured is not the same on all
> graphs. For some reason, write() to the sg device returns ENOMEM to
> some sgp_dd threads when large numbers of threads are run on recent
> kernels. The problem does not exist with the RHEL 4 kernel. I have not
> yet investigated why this happens.
>
> Cheers,
> Jody
>
> --
> Jody McIntyre - Linux Kernel Engineer, Sun HPC
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ide" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: sata_mv performance; impact of NCQ
2008-04-26 0:16 sata_mv performance; impact of NCQ Jody McIntyre
2008-04-26 3:13 ` Grant Grundler
@ 2008-04-26 4:49 ` Mark Lord
2008-04-28 17:33 ` Jody McIntyre
1 sibling, 1 reply; 8+ messages in thread
From: Mark Lord @ 2008-04-26 4:49 UTC (permalink / raw)
To: Jody McIntyre; +Cc: linux-ide
Jody McIntyre wrote:
..
> My ultimate goal is to replace mv_sata (the out-of-tree vendor driver)
> on RHEL 4 with sata_mv on a modern kernel, and to do this I need
> equivalent or better performance, especially on large (1MB) IOs.
..
Good goal, but hang on for a few weeks of updates yet-to-come first.
sata_mv is still not safe enough for production use, but getting there soon.
We're still behind on the errata workarounds (quite important), but should
be catching up soon. And I've just today noticed a bug in the recently
reworked IRQ handling (patch to fix it coming soon-ish).
> I note the recent changes to enable NCQ result in a net performance gain
> for 64K and 128K IOs, but due to a chipset limitation in the 6xxx series
> (according to commit f273827e2aadcf2f74a7bdc9ad715a1b20ea7dda),
> max_sectors is now limited which means we can no longer perform IOs
> greater than 128K (ENOMEM is returned from an sg write.) Therefore
> large IO performance suffers - for example, the performance of 2.6.25
> with NCQ support removed on 1MB IOs is better than anything possible
> with stock 2.6.25 for many workloads.
>
> Would it be worth re-enabling large IOs on this hardware when NCQ is
> disabled (using the queue_depth /proc variable)? If so I'll come up
> with a patch.
..
Mmm.. I'd like to see numbers for that, though all bets are off if it's
a WD drive that's performing more slowly with NCQ (quite common, that).
But yes, a sata_mv specific queue_depth op would be a good thing,
and it could revert to non-NCQ with a depth of "1", I suppose.
I'd still prefer if libata continued with NCQ on a depth of "1",
and only switched to non-NCQ at a depth of "0", though. That's more of
a thing for Tejun than for you, though. :)
If you could do a rough code-up of the non-NCQ for depth 1 patch
then that could save me some time here, thanks!
Cheers
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: sata_mv performance; impact of NCQ
2008-04-26 3:13 ` Grant Grundler
@ 2008-04-28 17:14 ` Jody McIntyre
2008-04-28 17:41 ` Mark Lord
0 siblings, 1 reply; 8+ messages in thread
From: Jody McIntyre @ 2008-04-28 17:14 UTC (permalink / raw)
To: Grant Grundler; +Cc: linux-ide
On Fri, Apr 25, 2008 at 08:13:16PM -0700, Grant Grundler wrote:
> > max_sectors is now limited which means we can no longer perform IOs
> > greater than 128K (ENOMEM is returned from an sg write.) Therefore
> > large IO performance suffers
>
> I would not jump to that conclusion. 4-5 years ago I measured CPU
> overhead and throughput vs block size for u320 SCSI and the
> difference was minimal once block size exceeded 64K. By minimal
> I mean low single digit differences. I haven't measured large block IO
> recently for SCSI (or SATA) where it might be different.
The conclusion is supported by the graphs - comparing 128K IOs on stock
2.6.25 with 1024K IOs on 2.6.25 with NCQ removed shows lower performance
at high thread counts. It's not dramatically lower but it's
measureable. Also the results for small IOs are less consistent -
performance goes up and down as thread count increases.
> Recently I measured sequential 4K IOs performance (goal was to
> determine CPU overhead in the device drivers) and was surprised
> to find that performance peaked at 4 threads per disk. 8 threads per
> disk got about the same throughput as 1 thread.
>
> For sequential reads the disk device firmware should recognize the
> "stream" and read ahead to maintain media rate. My guess is more
> threads will confuse this strategy and performance will suffer. This
> behavior depends on drive firmware which will vary depending
> on the IO request size and on how the disk firmware segments read
> buffers. And in this case, the firmware is known to have some
> "OS specific optimizations" (not linux) so I'm not sure if that made
> a difference as well.
Indeed - and since (at the higher end of the scale) the thread count is
larger than the NCQ queue size, the drive firmware has no chance of
being able to reorder everything optimally.
> For sequential writes, I'd be very surprised if the same is true since all
> the SATA drives I've seen have WCE turned on by default. Thus, the
> actual write to media is disconnected from the write request by the
> on-disk write buffering. The drives don't have to guess what needs to
> be written and the original request size might affect how buffers are
> managed in the device and how much RAM the disk controller has
> (I expect 8 MB - 32MB for SATA drives these days).
> Again, YMMV depending on disk firmware.
We turn write caching off for data integrity reasons (write reordering
does bad things to journalling file systems if power is interrupted -
and at the scale of many Lustre deployments, it happens often enough to
be a concern.) I'm also concerned about the effects of NCQ in this area
so we'll probably turn it off anyway.
> > Would it be worth re-enabling large IOs on this hardware when NCQ is
> > disabled (using the queue_depth /proc variable)? If so I'll come up
> > with a patch.
>
> I think it depends on the difference in performance, how much the performance
> depends on diskfirmware (can someone else reproduce those results with
> a different drive?) and how much uglier it makes the code.
Yes - sounds like it's worth a try anyway, especially if we end up
disabling NCQ for integrity reasons (no NCQ _and_ small IOs really sucks
according to my tests.)
Cheers,
Jody
> I'm not openoffice enabled at the moment thus haven't been able to look
> at the spreadsheet you provided. (Odd that OpenOffice isn't installed on
> this google-issued Mac laptop by default...I'll get that installed though and
> make sure to look at the spreadsheet.)
>
> hth,
> grant
>
> > Does anyone know what mv_sata does about NCQ? I see references to NCQ
> > throughout the code but I don't yet understand it enough to determine
> > what's going on. mv_sata _does_ support IOs greater than 128K, which
> > suggests that it does not use NCQ on this hardware at least.
> >
> > Any advice on areas to explore to improve sata_mv's performance? I
> > imagine I need to understand what mv_sata does differently, and plan on
> > spending some time reading that code, but I'd appreciate more specific
> > ideas if anyone has them.
> >
> > Details on the tests I performed:
> >
> > I used the sgpdd-survey tool, a low level performance test from the
> > Lustre iokit. It can be downloaded from:
> > http://downloads.lustre.org/public/tools/lustre-iokit/ . The tool
> > performs timed sgp_dd commands using various IO sizes, region counts,
> > and thread counts and reports aggregate bandwidth. These results were
> > then graphed using a spreadsheet.
> >
> > Note that the largest thread count measured is not the same on all
> > graphs. For some reason, write() to the sg device returns ENOMEM to
> > some sgp_dd threads when large numbers of threads are run on recent
> > kernels. The problem does not exist with the RHEL 4 kernel. I have not
> > yet investigated why this happens.
> >
> > Cheers,
> > Jody
> >
> > --
> > Jody McIntyre - Linux Kernel Engineer, Sun HPC
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-ide" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> >
--
Jody McIntyre - Linux Kernel Engineer, Sun HPC
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: sata_mv performance; impact of NCQ
2008-04-26 4:49 ` Mark Lord
@ 2008-04-28 17:33 ` Jody McIntyre
0 siblings, 0 replies; 8+ messages in thread
From: Jody McIntyre @ 2008-04-28 17:33 UTC (permalink / raw)
To: Mark Lord; +Cc: linux-ide
On Sat, Apr 26, 2008 at 12:49:01AM -0400, Mark Lord wrote:
> >My ultimate goal is to replace mv_sata (the out-of-tree vendor driver)
> >on RHEL 4 with sata_mv on a modern kernel, and to do this I need
> >equivalent or better performance, especially on large (1MB) IOs.
>
> Good goal, but hang on for a few weeks of updates yet-to-come first.
> sata_mv is still not safe enough for production use, but getting there soon.
OK. For what it's worth, I've never had any problems with the 2.6.25
version of sata_mv, which is a nice improvement from the one in the
latest RHEL 5 kernel. I haven't done any specific reliability tests
though.
> We're still behind on the errata workarounds (quite important), but should
> be catching up soon. And I've just today noticed a bug in the recently
> reworked IRQ handling (patch to fix it coming soon-ish).
OK. Let me know if there's anything I can help with, though my
experience with the chipset and SATA are both quite limited. If you
need testing, I'm certainly able to do that.
> >Would it be worth re-enabling large IOs on this hardware when NCQ is
> >disabled (using the queue_depth /proc variable)? If so I'll come up
> >with a patch.
>
> Mmm.. I'd like to see numbers for that, though all bets are off if it's
> a WD drive that's performing more slowly with NCQ (quite common, that).
These are Hitachi drives:
Vendor: ATA Model: HITACHI HUA7250S Rev: GK6O
>From my tests, NCQ is definitely a win for the IO sizes that work.
> But yes, a sata_mv specific queue_depth op would be a good thing,
> and it could revert to non-NCQ with a depth of "1", I suppose.
>
> I'd still prefer if libata continued with NCQ on a depth of "1",
> and only switched to non-NCQ at a depth of "0", though. That's more of
> a thing for Tejun than for you, though. :)
>
> If you could do a rough code-up of the non-NCQ for depth 1 patch
> then that could save me some time here, thanks!
OK, I'll work on that.
Cheers,
Jody
> Cheers
--
Jody McIntyre - Linux Kernel Engineer, Sun HPC
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: sata_mv performance; impact of NCQ
2008-04-28 17:14 ` Jody McIntyre
@ 2008-04-28 17:41 ` Mark Lord
2008-04-28 17:53 ` Alan Cox
2008-04-28 20:07 ` Jeff Garzik
0 siblings, 2 replies; 8+ messages in thread
From: Mark Lord @ 2008-04-28 17:41 UTC (permalink / raw)
To: Jody McIntyre; +Cc: Grant Grundler, linux-ide
Jody McIntyre wrote:
>
> We turn write caching off for data integrity reasons (write reordering
> does bad things to journalling file systems if power is interrupted -
> and at the scale of many Lustre deployments, it happens often enough to
> be a concern.) I'm also concerned about the effects of NCQ in this area
> so we'll probably turn it off anyway.
..
I haven't done a detailed examination lately, but..
Both write-caching and NCQ re-ordering should be safe on Linux.
The kernel will issue FLUSH_CACHE_EXT commands as required to checkpoint
data to the disk.
Or at least that's how I understood Tejun's last explanation of it.
It is possible that the drive firmware in some brands may not follow
spec for FLUSH_CACHE_EXT, but I don't know of a specific instance of this.
Cheers
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: sata_mv performance; impact of NCQ
2008-04-28 17:41 ` Mark Lord
@ 2008-04-28 17:53 ` Alan Cox
2008-04-28 20:07 ` Jeff Garzik
1 sibling, 0 replies; 8+ messages in thread
From: Alan Cox @ 2008-04-28 17:53 UTC (permalink / raw)
To: Mark Lord; +Cc: Jody McIntyre, Grant Grundler, linux-ide
> Or at least that's how I understood Tejun's last explanation of it.
> It is possible that the drive firmware in some brands may not follow
> spec for FLUSH_CACHE_EXT, but I don't know of a specific instance of this.
For an ATA device it should be ok - the standard purposefully specified
flush cache functions in a way that was intended to stop drives lying for
benchmarketing reasons and the like. So you should always get proper
behaviour on any drive which implements it providing barriers are enabled
on the fs.
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: sata_mv performance; impact of NCQ
2008-04-28 17:41 ` Mark Lord
2008-04-28 17:53 ` Alan Cox
@ 2008-04-28 20:07 ` Jeff Garzik
1 sibling, 0 replies; 8+ messages in thread
From: Jeff Garzik @ 2008-04-28 20:07 UTC (permalink / raw)
To: Mark Lord; +Cc: Jody McIntyre, Grant Grundler, linux-ide
Mark Lord wrote:
> Jody McIntyre wrote:
>>
>> We turn write caching off for data integrity reasons (write reordering
>> does bad things to journalling file systems if power is interrupted -
>> and at the scale of many Lustre deployments, it happens often enough to
>> be a concern.) I'm also concerned about the effects of NCQ in this area
>> so we'll probably turn it off anyway.
> ..
>
> I haven't done a detailed examination lately, but..
>
> Both write-caching and NCQ re-ordering should be safe on Linux.
> The kernel will issue FLUSH_CACHE_EXT commands as required to checkpoint
> data to the disk.
It depends on the fs's use of barriers (or not).
Without barriers, FLUSH CACHE[ EXT] isn't issued very often.
Jeff
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2008-04-28 20:08 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-04-26 0:16 sata_mv performance; impact of NCQ Jody McIntyre
2008-04-26 3:13 ` Grant Grundler
2008-04-28 17:14 ` Jody McIntyre
2008-04-28 17:41 ` Mark Lord
2008-04-28 17:53 ` Alan Cox
2008-04-28 20:07 ` Jeff Garzik
2008-04-26 4:49 ` Mark Lord
2008-04-28 17:33 ` Jody McIntyre
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).