Drive performance bottleneck

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Drive performance bottleneck
@ 2005-02-03  2:24 Ian Godin
  2005-02-03  3:56 ` Bernd Eckenfels
                   ` (3 more replies)
  0 siblings, 4 replies; 12+ messages in thread
From: Ian Godin @ 2005-02-03  2:24 UTC (permalink / raw)
  To: linux-kernel


   I am trying to get very fast disk drive performance and I am seeing 
some interesting bottlenecks.  We are trying to get 800 MB/sec or more 
(yes, that is megabytes per second).  We are currently using 
PCI-Express with a 16 drive raid card (SATA drives).  We have achieved 
that speed, but only through the SG (SCSI generic) driver.  This is 
running the stock 2.6.10 kernel.  And the device is not mounted as a 
file system.  I also set the read ahead size on the device to 16KB 
(which speeds things up a lot):

blockdev --setra 16834 /dev/sdb

So here are the results:

$ time dd if=/dev/sdb of=/dev/null bs=64k count=1000000
1000000+0 records in
1000000+0 records out
0.27user 86.19system 2:40.68elapsed 53%CPU (0avgtext+0avgdata 
0maxresident)k
0inputs+0outputs (0major+177minor)pagefaults 0swaps

64k * 1000000 / 160.68 = 398.3 MB/sec

Using sg_dd just to make sure it works the same:

$ time sg_dd if=/dev/sdb of=/dev/null bs=64k count=1000000
1000000+0 records in
1000000+0 records out
0.05user 144.27system 2:41.55elapsed 89%CPU (0avgtext+0avgdata 
0maxresident)k
0inputs+0outputs (17major+5375minor)pagefaults 0swaps

   Pretty much the same speed.  Now using the SG device (sg1 is tied to 
sdb):

$ time sg_dd if=/dev/sg1 of=/dev/null bs=64k count=1000000
Reducing read to 16 blocks per loop
1000000+0 records in
1000000+0 records out
0.22user 66.21system 1:10.23elapsed 94%CPU (0avgtext+0avgdata 
0maxresident)k
0inputs+0outputs (0major+2327minor)pagefaults 0swaps

64k * 1000000 / 70.23 = 911.3 MB/sec

   Now that's more like the speeds we expected.  I understand that the 
SG device uses direct I/O and/or mmap memory from the kernel.  What I 
cannot believe is that there is that much overhead in going through the 
page buffer/cache system in Linux.

   We also tried going through a file system (various ones, JFS, XFS, 
Reiser, Ext3).  They all seem to bottleneck at around 400MB/sec, much 
like /dev/sdb does.  We also have a "real" SCSI raid system which also 
bottlenecks right at 400 MB/sec.  Under Windows (XP) both of these 
systems run at 650 (SCSI) or 800 (SATA) MB/sec.

   Other variations I've tried: setting the read ahead to larger or 
smaller number (1, 2, 4, 8, 16, 32, 64 KB)... 8 or 16 seems to be 
optimal.  Using different block sizes in the dd command (again 1, 2, 4, 
8, 16, 32, 64).  16, 32, 64 are pretty much identical and fastest.

  Below is an oprofile (truncated) of (the same) dd running on /dev/sdb.

   So is the overhead really that high?  Hopefully there's a bottleneck 
in there that no one has come across yet, and it can be optimized.  
Anyone else trying to pull close to 1GB/sec from disk? :)  The kernel 
has changed a lot since the last time I really worked with it (2.2), so 
any suggestions are appreciated.

Ian Godin
Senior Software Developer
DTS/Lowry Digital Images

-----------

CPU: P4 / Xeon, speed 3402.13 MHz (estimated)
Counted GLOBAL_POWER_EVENTS events (time during which processor is not 
stopped) with a unit mask of 0x01 (mandatory) count 100000
samples  %        symbol name
848185    8.3510  __copy_to_user_ll
772172    7.6026  do_anonymous_page
701579    6.9076  _spin_lock_irq
579024    5.7009  __copy_user_intel
361634    3.5606  _spin_lock
343018    3.3773  _spin_lock_irqsave
307462    3.0272  kmap_atomic
193327    1.9035  page_fault
181040    1.7825  schedule
174502    1.7181  radix_tree_delete
158967    1.5652  end_buffer_async_read
124124    1.2221  free_hot_cold_page
119057    1.1722  sysenter_past_esp
117384    1.1557  shrink_list
112762    1.1102  buffered_rmqueue
105490    1.0386  smp_call_function
101568    1.0000  kmem_cache_alloc
97404     0.9590  kmem_cache_free
95826     0.9435  __rmqueue
95443     0.9397  __copy_from_user_ll
93181     0.9174  free_pages_bulk
92732     0.9130  release_pages
86912     0.8557  shrink_cache
85896     0.8457  block_read_full_page
79629     0.7840  free_block
78304     0.7710  mempool_free
72264     0.7115  create_empty_buffers
71303     0.7020  do_syslog
70769     0.6968  emit_log_char
66413     0.6539  mark_offset_tsc
64333     0.6334  vprintk
63468     0.6249  file_read_actor
63292     0.6232  add_to_page_cache
62281     0.6132  unlock_page
61655     0.6070  _spin_unlock_irqrestore
59486     0.5857  find_get_page
58901     0.5799  drop_buffers
58775     0.5787  do_generic_mapping_read
55070     0.5422  __wake_up_bit
48681     0.4793  __end_that_request_first
47121     0.4639  bad_range
47102     0.4638  submit_bh
45009     0.4431  journal_add_journal_head
41270     0.4063  __alloc_pages
41247     0.4061  page_waitqueue
39520     0.3891  generic_file_buffered_write
38520     0.3793  __pagevec_lru_add
38142     0.3755  do_select
38105     0.3752  do_mpage_readpage
37020     0.3645  vsnprintf
36541     0.3598  __clear_page_buffers
35932     0.3538  journal_put_journal_head
35769     0.3522  radix_tree_lookup
35636     0.3509  bio_put
34904     0.3437  jfs_get_blocks
34865     0.3433  mark_page_accessed
33686     0.3317  bio_alloc
33273     0.3276  __switch_to
32928     0.3242  __generic_file_aio_write_nolock
32781     0.3228  mempool_alloc
31457     0.3097  number
31160     0.3068  do_get_write_access
30922     0.3045  sys_select
30201     0.2974  blk_rq_map_sg
29488     0.2903  smp_call_function_interrupt
28815     0.2837  do_page_fault
27654     0.2723  radix_tree_insert
27650     0.2722  __block_prepare_write
27185     0.2677  cache_alloc_refill
26910     0.2649  page_remove_rmap
26378     0.2597  handle_mm_fault
26192     0.2579  invalidate_complete_page
25762     0.2536  page_add_anon_rmap
25580     0.2519  max_block
25497     0.2510  do_gettimeofday


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Drive performance bottleneck
  2005-02-03  2:24 Drive performance bottleneck Ian Godin
@ 2005-02-03  3:56 ` Bernd Eckenfels
  2005-02-03 16:49   ` Ian Godin
  2005-02-03 16:06 ` Jens Axboe
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 12+ messages in thread
From: Bernd Eckenfels @ 2005-02-03  3:56 UTC (permalink / raw)
  To: linux-kernel

In article <c4fc982390674caa2eae4f252bf4fc78@lowrydigital.com> you wrote:
>  Below is an oprofile (truncated) of (the same) dd running on /dev/sdb.

do  you also have the oprofile of the sg_dd handy?

Greetings
Bernd

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Drive performance bottleneck
  2005-02-03  3:56 ` Bernd Eckenfels
@ 2005-02-03 16:49   ` Ian Godin
  0 siblings, 0 replies; 12+ messages in thread
From: Ian Godin @ 2005-02-03 16:49 UTC (permalink / raw)
  To: linux-kernel

On Feb 2, 2005, at 7:56 PM, Bernd Eckenfels wrote:

> In article <c4fc982390674caa2eae4f252bf4fc78@lowrydigital.com> you 
> wrote:
>>  Below is an oprofile (truncated) of (the same) dd running on 
>> /dev/sdb.
>
> do  you also have the oprofile of the sg_dd handy?
>
> Greetings
> Bernd
>

   Just ran it on the sg_dd (using /dev/sg1):

CPU: P4 / Xeon, speed 3402.13 MHz (estimated)
Counted GLOBAL_POWER_EVENTS events (time during which processor is not 
stopped) with a unit mask of 0x01 (mandatory) count 10
0000
samples  %        symbol name
2145136  89.7474  __copy_to_user_ll
64720     2.7077  lock_kernel
17883     0.7482  mark_offset_tsc
16212     0.6783  page_address
8091      0.3385  schedule
5380      0.2251  timer_interrupt
5314      0.2223  _spin_lock
5311      0.2222  mwait_idle
4034      0.1688  sysenter_past_esp
3569      0.1493  do_anonymous_page
3530      0.1477  apic_timer_interrupt
3368      0.1409  _spin_lock_irqsave
3150      0.1318  sg_read_xfer
3043      0.1273  kmem_cache_alloc
2590      0.1084  find_busiest_group
2553      0.1068  scheduler_tick
2319      0.0970  __copy_from_user_ll
2299      0.0962  sg_ioctl
2131      0.0892  irq_entries_start
1905      0.0797  sys_ioctl
1773      0.0742  copy_page_range
1678      0.0702  fget
1648      0.0689  __switch_to
1648      0.0689  scsi_block_when_processing_errors
1632      0.0683  sg_start_req
1569      0.0656  increment_tail
1511      0.0632  try_to_wake_up
1506      0.0630  update_one_process
1454      0.0608  fput
1397      0.0584  do_gettimeofday
1396      0.0584  zap_pte_range
1371      0.0574  recalc_task_prio
1357      0.0568  sched_clock
1352      0.0566  sg_link_reserve
1282      0.0536  __elv_add_request
1229      0.0514  kmap_atomic
1195      0.0500  page_fault

   Oddly enough it appears to also be copying to user space...

      Ian.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Drive performance bottleneck
  2005-02-03  2:24 Drive performance bottleneck Ian Godin
  2005-02-03  3:56 ` Bernd Eckenfels
@ 2005-02-03 16:06 ` Jens Axboe
  2005-02-03 17:40 ` Nuno Silva
  2005-02-04  9:32 ` Andrew Morton
  3 siblings, 0 replies; 12+ messages in thread
From: Jens Axboe @ 2005-02-03 16:06 UTC (permalink / raw)
  To: Ian Godin; +Cc: linux-kernel

On Wed, Feb 02 2005, Ian Godin wrote:
> 
>   I am trying to get very fast disk drive performance and I am seeing 
> some interesting bottlenecks.  We are trying to get 800 MB/sec or more 
> (yes, that is megabytes per second).  We are currently using 
> PCI-Express with a 16 drive raid card (SATA drives).  We have achieved 
> that speed, but only through the SG (SCSI generic) driver.  This is 
> running the stock 2.6.10 kernel.  And the device is not mounted as a 
> file system.  I also set the read ahead size on the device to 16KB 
> (which speeds things up a lot):
> 
> blockdev --setra 16834 /dev/sdb
> 
> So here are the results:
> 
> $ time dd if=/dev/sdb of=/dev/null bs=64k count=1000000
> 1000000+0 records in
> 1000000+0 records out
> 0.27user 86.19system 2:40.68elapsed 53%CPU (0avgtext+0avgdata 
> 0maxresident)k
> 0inputs+0outputs (0major+177minor)pagefaults 0swaps
> 
> 64k * 1000000 / 160.68 = 398.3 MB/sec
> 
> Using sg_dd just to make sure it works the same:
> 
> $ time sg_dd if=/dev/sdb of=/dev/null bs=64k count=1000000
> 1000000+0 records in
> 1000000+0 records out
> 0.05user 144.27system 2:41.55elapsed 89%CPU (0avgtext+0avgdata 
> 0maxresident)k
> 0inputs+0outputs (17major+5375minor)pagefaults 0swaps
> 
>   Pretty much the same speed.  Now using the SG device (sg1 is tied to 
> sdb):
> 
> $ time sg_dd if=/dev/sg1 of=/dev/null bs=64k count=1000000
> Reducing read to 16 blocks per loop
> 1000000+0 records in
> 1000000+0 records out
> 0.22user 66.21system 1:10.23elapsed 94%CPU (0avgtext+0avgdata 
> 0maxresident)k
> 0inputs+0outputs (0major+2327minor)pagefaults 0swaps
> 
> 64k * 1000000 / 70.23 = 911.3 MB/sec
> 
>   Now that's more like the speeds we expected.  I understand that the 
> SG device uses direct I/O and/or mmap memory from the kernel.  What I 
> cannot believe is that there is that much overhead in going through the 
> page buffer/cache system in Linux.

It's not going through the page cache that is the problem, it's the
copying to user space. Have you tried using O_DIRECT? What kind of
speeds are you getting with that?

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Drive performance bottleneck
  2005-02-03  2:24 Drive performance bottleneck Ian Godin
  2005-02-03  3:56 ` Bernd Eckenfels
  2005-02-03 16:06 ` Jens Axboe
@ 2005-02-03 17:40 ` Nuno Silva
  2005-02-03 18:08   ` Ian Godin
  2005-02-04  9:32 ` Andrew Morton
  3 siblings, 1 reply; 12+ messages in thread
From: Nuno Silva @ 2005-02-03 17:40 UTC (permalink / raw)
  To: Ian Godin; +Cc: linux-kernel

Ian Godin wrote:
> 
>   I am trying to get very fast disk drive performance and I am seeing 
> some interesting bottlenecks.  We are trying to get 800 MB/sec or more 
> (yes, that is megabytes per second).  We are currently using PCI-Express 
> with a 16 drive raid card (SATA drives).  We have achieved that speed, 
> but only through the SG (SCSI generic) driver.  This is running the 
> stock 2.6.10 kernel.  And the device is not mounted as a file system.  I 
> also set the read ahead size on the device to 16KB (which speeds things 
> up a lot):

I was trying to reproduce but got distracted by this:
(use page down, if you just want to see the odd result)

puma:/tmp/dd# sg_map
/dev/sg0  /dev/sda
/dev/sg1  /dev/sdb
/dev/sg2  /dev/scd0
/dev/sg3  /dev/sdc
puma:/tmp/dd# time sg_dd if=/dev/sg1 of=/tmp/dd/sg1 bs=64k count=1000
Reducing read to 64 blocks per loop
1000+0 records in
1000+0 records out

real    0m0.187s
user    0m0.001s
sys     0m0.141s
puma:/tmp/dd# time dd if=/dev/sdb of=/tmp/dd/sdb bs=64k count=1000
1000+0 records in
1000+0 records out
65536000 bytes transferred in 1.203468 seconds (54455956 bytes/sec)

real    0m1.219s
user    0m0.001s
sys     0m0.138s
puma:/tmp/dd# ls -l
total 128000
-rw-r--r--  1 root root 65536000 Feb  3 17:16 sdb
-rw-r--r--  1 root root 65536000 Feb  3 17:16 sg1
puma:/tmp/dd# md5sum *
ec31224970ddd3fb74501c8e68327e7b  sdb
60d4689227d60e6122f1ffe0ec1b2ad7  sg1
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

See? dd from sdb is not the same as sg1! Is this supposed to happen?

About the 900MB/sec:
This same sg1 (= sdb, which is a single hitachi sata hdd) performes like 
this:

puma:/tmp/dd# time sg_dd if=/dev/sg1 of=/dev/null bs=64k count=1000000 
time=1
Reducing read to 64 blocks per loop
time to transfer data was 69.784784 secs, 939.12 MB/sec
1000000+0 records in
1000000+0 records out

real    1m9.787s
user    0m0.063s
sys     0m58.115s

I can assure you that this drive can't do more than 60MB/sec sustained.

My only conclusion is that sg (or sg_dd) is broken? ;)

Peace,
Nuno Silva

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Drive performance bottleneck
  2005-02-03 17:40 ` Nuno Silva
@ 2005-02-03 18:08   ` Ian Godin
  2005-02-03 19:03     ` Paulo Marques
  0 siblings, 1 reply; 12+ messages in thread
From: Ian Godin @ 2005-02-03 18:08 UTC (permalink / raw)
  To: linux-kernel


On Feb 3, 2005, at 9:40 AM, Nuno Silva wrote:

> Ian Godin wrote:
>>   I am trying to get very fast disk drive performance and I am seeing 
>> some interesting bottlenecks.  We are trying to get 800 MB/sec or 
>> more (yes, that is megabytes per second).  We are currently using 
>> PCI-Express with a 16 drive raid card (SATA drives).  We have 
>> achieved that speed, but only through the SG (SCSI generic) driver.  
>> This is running the stock 2.6.10 kernel.  And the device is not 
>> mounted as a file system.  I also set the read ahead size on the 
>> device to 16KB (which speeds things up a lot):
>
> I was trying to reproduce but got distracted by this:
> (use page down, if you just want to see the odd result)
>
> puma:/tmp/dd# sg_map
> /dev/sg0  /dev/sda
> /dev/sg1  /dev/sdb
> /dev/sg2  /dev/scd0
> /dev/sg3  /dev/sdc
> puma:/tmp/dd# time sg_dd if=/dev/sg1 of=/tmp/dd/sg1 bs=64k count=1000
> Reducing read to 64 blocks per loop
> 1000+0 records in
> 1000+0 records out
>
> real    0m0.187s
> user    0m0.001s
> sys     0m0.141s
> puma:/tmp/dd# time dd if=/dev/sdb of=/tmp/dd/sdb bs=64k count=1000
> 1000+0 records in
> 1000+0 records out
> 65536000 bytes transferred in 1.203468 seconds (54455956 bytes/sec)
>
> real    0m1.219s
> user    0m0.001s
> sys     0m0.138s
> puma:/tmp/dd# ls -l
> total 128000
> -rw-r--r--  1 root root 65536000 Feb  3 17:16 sdb
> -rw-r--r--  1 root root 65536000 Feb  3 17:16 sg1
> puma:/tmp/dd# md5sum *
> ec31224970ddd3fb74501c8e68327e7b  sdb
> 60d4689227d60e6122f1ffe0ec1b2ad7  sg1
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>
> See? dd from sdb is not the same as sg1! Is this supposed to happen?
>
> About the 900MB/sec:
> This same sg1 (= sdb, which is a single hitachi sata hdd) performes 
> like this:
>
> puma:/tmp/dd# time sg_dd if=/dev/sg1 of=/dev/null bs=64k count=1000000 
> time=1
> Reducing read to 64 blocks per loop
> time to transfer data was 69.784784 secs, 939.12 MB/sec
> 1000000+0 records in
> 1000000+0 records out
>
> real    1m9.787s
> user    0m0.063s
> sys     0m58.115s
>
> I can assure you that this drive can't do more than 60MB/sec sustained.
>
> My only conclusion is that sg (or sg_dd) is broken? ;)
>
> Peace,
> Nuno Silva
>

   Definitely have been able to repeat that here, so the SG driver 
definitely appears to be broken.  At least I'm glad I am not going 
insane, I was starting to wonder :)

   I'll run some more tests with O_DIRECT and such things, see if I can 
figure out what the REAL max speed is.

  Thanks for the help everyone,
    Ian.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Drive performance bottleneck
  2005-02-03 18:08   ` Ian Godin
@ 2005-02-03 19:03     ` Paulo Marques
  2005-02-04 19:36       ` Andy Isaacson
  0 siblings, 1 reply; 12+ messages in thread
From: Paulo Marques @ 2005-02-03 19:03 UTC (permalink / raw)
  To: Ian Godin; +Cc: linux-kernel

Ian Godin wrote:
> [...]
>   Definitely have been able to repeat that here, so the SG driver 
> definitely appears to be broken.  At least I'm glad I am not going 
> insane, I was starting to wonder :)
> 
>   I'll run some more tests with O_DIRECT and such things, see if I can 
> figure out what the REAL max speed is.

FYI there was a patch running around last April that made a new option 
for "dd" to make it use O_DIRECT. You can get it here:

http://marc.theaimsgroup.com/?l=linux-kernel&m=108135935629589&w=2

Unfortunately this hasn't made it into coreutils. IIRC there were issues 
about dd being multi-platform and the way O_DIRECT was done in other 
systems.

Anyway, you can patch dd yourself and have a tool for debugging with 
O_DIRECT. I hope this helps,

-- 
Paulo Marques - www.grupopie.com

All that is necessary for the triumph of evil is that good men do nothing.
Edmund Burke (1729 - 1797)

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Drive performance bottleneck
  2005-02-03 19:03     ` Paulo Marques
@ 2005-02-04 19:36       ` Andy Isaacson
  0 siblings, 0 replies; 12+ messages in thread
From: Andy Isaacson @ 2005-02-04 19:36 UTC (permalink / raw)
  To: Paulo Marques; +Cc: Ian Godin, linux-kernel

On Thu, Feb 03, 2005 at 07:03:48PM +0000, Paulo Marques wrote:
> FYI there was a patch running around last April that made a new option 
> for "dd" to make it use O_DIRECT. You can get it here:
> 
> http://marc.theaimsgroup.com/?l=linux-kernel&m=108135935629589&w=2
> 
> Unfortunately this hasn't made it into coreutils.

Follow down the thread and you'll see that it was merged to coreutils
CVS (message-id <85k70qsp71.fsf@pi.meyering.net>), but there apparently
hasn't been a coreutils release since then.  Nor has the patch been
added to the debian package.

-andy

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Drive performance bottleneck
  2005-02-03  2:24 Drive performance bottleneck Ian Godin
                   ` (2 preceding siblings ...)
  2005-02-03 17:40 ` Nuno Silva
@ 2005-02-04  9:32 ` Andrew Morton
  2005-02-04 22:51   ` Lincoln Dale
  3 siblings, 1 reply; 12+ messages in thread
From: Andrew Morton @ 2005-02-04  9:32 UTC (permalink / raw)
  To: Ian Godin; +Cc: linux-kernel

Ian Godin <Ian.Godin@lowrydigital.com> wrote:
>
> 
>    I am trying to get very fast disk drive performance and I am seeing 
> some interesting bottlenecks.  We are trying to get 800 MB/sec or more 
> (yes, that is megabytes per second).  We are currently using 
> PCI-Express with a 16 drive raid card (SATA drives).  We have achieved 
> that speed, but only through the SG (SCSI generic) driver.  This is 
> running the stock 2.6.10 kernel.  And the device is not mounted as a 
> file system.  I also set the read ahead size on the device to 16KB 
> (which speeds things up a lot):
> ...
> samples  %        symbol name
> 848185    8.3510  __copy_to_user_ll
> 772172    7.6026  do_anonymous_page
> 701579    6.9076  _spin_lock_irq
> 579024    5.7009  __copy_user_intel
> 361634    3.5606  _spin_lock
> 343018    3.3773  _spin_lock_irqsave
> 307462    3.0272  kmap_atomic
> 193327    1.9035  page_fault

Something funny is happening here - it looks like there's plenty of CPU
capacity left over.

It's odd that you're getting a lot of pagefaults in this test but not with
the sg_dd test, too.  I wonder why dd is getting so many pagefaults?  (I
recall that sg_dd did something cheaty, but I forget what it was).

Could you monitor the CPU load during the various tests?  If the `dd'
workload isn't pegging the CPU then it could be that there's something
wrong with the I/O submission patterns.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Drive performance bottleneck
  2005-02-04  9:32 ` Andrew Morton
@ 2005-02-04 22:51   ` Lincoln Dale
  2005-02-04 23:07     ` Andrew Morton
  0 siblings, 1 reply; 12+ messages in thread
From: Lincoln Dale @ 2005-02-04 22:51 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Ian Godin, linux-kernel

At 08:32 PM 4/02/2005, Andrew Morton wrote:
>Something funny is happening here - it looks like there's plenty of CPU
>capacity left over.
[..]
>Could you monitor the CPU load during the various tests?  If the `dd'
>workload isn't pegging the CPU then it could be that there's something
>wrong with the I/O submission patterns.

as an educated guess, i'd say that the workload is running out of memory 
bandwidth ..

lets say the RAM is single-channel DDR400.  peak bandwidth = 3.2Gb/s (400 x 
10^6 x 64 bits / 1000000000).  its fair to say that peak bandwidth is 
pretty rare thing to achieve with SDRAM given real-world access patterns -- 
lets take a conservative "it'll be 50% efficient" -- so DDR400 realistic 
peak = 1.6Gbps.

as far as memory-accesses go, a standard user-space read() from disk 
results in 4 memory-accesses (1. DMA from HBA to RAM, 2. read in 
copy_to_user(), 3. write in copy_to_user(), 4. userspace accessing that data).
1.6Gbps / 4 = 400MB/s -- or roughly what Ian was seeing.

sg_dd uses a window into a kernel DMA window.  as such, two of the four 
memory acccesses are cut out (1. DMA from HBA to RAM, 2. userspace 
accessing data).
1.6Gbps / 2 = 800MB/s -- or roughly what Ian was seeing with sg_dd.

DIRECT_IO should achieve similar numbers to sg_dd, but perhaps not quite as 
efficient.

cheers,

lincoln.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Drive performance bottleneck
  2005-02-04 22:51   ` Lincoln Dale
@ 2005-02-04 23:07     ` Andrew Morton
  2005-02-06  4:25       ` Nuno Silva
  0 siblings, 1 reply; 12+ messages in thread
From: Andrew Morton @ 2005-02-04 23:07 UTC (permalink / raw)
  To: Lincoln Dale; +Cc: Ian.Godin, linux-kernel

Lincoln Dale <ltd@cisco.com> wrote:
>
> sg_dd uses a window into a kernel DMA window.  as such, two of the four 
> memory acccesses are cut out (1. DMA from HBA to RAM, 2. userspace 
> accessing data).
> 1.6Gbps / 2 = 800MB/s -- or roughly what Ian was seeing with sg_dd.

Right.  That's a fancy way of saying "cheating" ;)

But from the oprofile output it appears to me that there is plenty of CPU
capacity left over.  Maybe I'm misreading it due to oprofile adding in the
SMP factor (25% CPU on a 4-way means we've exhausted CPU capacity).

> DIRECT_IO should achieve similar numbers to sg_dd, but perhaps not quite as 
> efficient.

Probably so.

There are various tools in
http://www.zip.com.au/~akpm/linux/patches/stuff/ext3-tools.tar.gz which are
more useful than dd, btw.  `odread' and `odwrite' are usful for this sort
of thing.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Drive performance bottleneck
  2005-02-04 23:07     ` Andrew Morton
@ 2005-02-06  4:25       ` Nuno Silva
  0 siblings, 0 replies; 12+ messages in thread
From: Nuno Silva @ 2005-02-06  4:25 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Lincoln Dale, Ian.Godin, linux-kernel

Andrew Morton wrote:
> Lincoln Dale <ltd@cisco.com> wrote:
> 
>>sg_dd uses a window into a kernel DMA window.  as such, two of the four 
>>memory acccesses are cut out (1. DMA from HBA to RAM, 2. userspace 
>>accessing data).
>>1.6Gbps / 2 = 800MB/s -- or roughly what Ian was seeing with sg_dd.
> 
> 
> Right.  That's a fancy way of saying "cheating" ;)
> 
> But from the oprofile output it appears to me that there is plenty of CPU
> capacity left over.  Maybe I'm misreading it due to oprofile adding in the
> SMP factor (25% CPU on a 4-way means we've exhausted CPU capacity).

sg_dd is lying or /dev/sg* is broken. Try to do that sg_dd test in any 
single drive and you'll get 20 times the performance you're supposed to 
achieve:

puma:/tmp/dd# time sg_dd if=/dev/sg1 of=/dev/null bs=64k count=1000000 
time=1
Reducing read to 64 blocks per loop
time to transfer data was 69.784784 secs, 939.12 MB/sec
1000000+0 records in
1000000+0 records out

This is a single sata drive. I'm lucky, am I not?  ;-)

Regards,
Nuno Silva


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2005-02-06  4:25 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-02-03  2:24 Drive performance bottleneck Ian Godin
2005-02-03  3:56 ` Bernd Eckenfels
2005-02-03 16:49   ` Ian Godin
2005-02-03 16:06 ` Jens Axboe
2005-02-03 17:40 ` Nuno Silva
2005-02-03 18:08   ` Ian Godin
2005-02-03 19:03     ` Paulo Marques
2005-02-04 19:36       ` Andy Isaacson
2005-02-04  9:32 ` Andrew Morton
2005-02-04 22:51   ` Lincoln Dale
2005-02-04 23:07     ` Andrew Morton
2005-02-06  4:25       ` Nuno Silva

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).