public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* ext3 performance bottleneck as the number of spindles gets large
@ 2002-06-19 21:29 mgross
  2002-06-20  0:54 ` Andrew Morton
  2002-06-20  1:55 ` Andrew Morton
  0 siblings, 2 replies; 24+ messages in thread
From: mgross @ 2002-06-19 21:29 UTC (permalink / raw)
  To: Linux Kernel Mailing List, lse-tech; +Cc: richard.a.griffiths

[-- Attachment #1: Type: text/plain, Size: 2309 bytes --]

We've been doing some throughput comparisons and benchmarks of block I/O 
throughput for 8KB writes as the number of SCSI addapters and drives per 
adapter is increased.

The Linux platform is a dual processor 1.2GHz PIII, 2Gig or RAM, 2U box.
Similar results have been seen with both 2.4.16 and 2.4.18 base kernel, as 
well as one of those patched up O(1) 2.4.18 kernels out there.

The benchmark is Bonnie++.

What seems to be happening is the throughput for 8Kb sequential Write's with 
300MB files goes down with the number of spindles. We have negative scale WRT 
spindles per SCSI adapter, and very poor scaling per SCSI adapter.

(The other 2 processor + OS platform sees its throughput go up with adapters and 
spindles. )

Running this benchmark with lockmeter ends up pointing a big finger at BKL 
contention in: ext3_commit_write, ext3_dirty_inode, ext3_get_block_handle 
and, ext3_prepare_write (twice!).  Attached is the output from the worst 
case, 4 SCSI adapters with 6 drives per adapter.

Has anyone done any work looking into the I/O scaling of Linux / ext3 per 
spindle or per adapter?  We would like to compare notes.

I've only just started to look at the ext3 code but it seems to me that replacing the 
BKL with a per - ext3 file system lock could remove some of the contention thats 
getting measured.  What data are the BKL protecting in these ext3 functions?  Could a 
lock per FS approach work?

Thoughts? 
Comments?
Ideas?

--mgross



- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
SPINLOCKS         HOLD            WAIT
  UTIL  CON    MEAN(  MAX )   MEAN(  MAX )(% CPU)     TOTAL NOWAIT SPIN RJECT  NAME

        3.7%  0.7us(  44ms)  7.8us(  44ms)(22.9%)  49644038 96.3%  3.7% 0.00%  *TOTAL*

 26.6% 71.2%   13us(  44ms)  8.0us(8076us)( 5.8%)    632107 28.8% 71.2%    0%    ext3_commit_write+0x38
  4.4% 30.3%  4.3us( 360us)   13us(7511us)( 2.1%)    316124 69.7% 30.3%    0%    ext3_dirty_inode+0x2c
 28.1%  7.9%   14us(1660us)  9.7us(6842us)(0.78%)    632239 92.1%  7.9%    0%    ext3_get_block_handle+0x8c
  1.2% 27.2%  0.6us( 240us)   11us(6604us)( 3.0%)    632107 72.8% 27.2%    0%    ext3_prepare_write+0x34
 0.26% 88.1%  0.1us(  74us)  9.6us(7026us)( 8.6%)    632107 11.9% 88.1%    0%    ext3_prepare_write+0xe0


[-- Attachment #2: lm_4x6_300MBw --]
[-- Type: text/plain, Size: 40755 bytes --]

Lockmeter statistics are now RESET
Lockmeter statistics are now ON
___________________________________________________________________________________________
System: Linux TSRLT2 2.4.18 #2 SMP Mon Jun 17 08:28:25 PDT 2002 i686
Total counts

All (2) CPUs

Start time: Mon Jun 17 11:12:36 2002
End   time: Mon Jun 17 11:13:07 2002
Delta Time: 30.99 sec.
Hash table slots in use:      314.
Global read lock slots in use: 884.

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
SPINLOCKS         HOLD            WAIT
  UTIL  CON    MEAN(  MAX )   MEAN(  MAX )(% CPU)     TOTAL NOWAIT SPIN RJECT  NAME

        3.7%  0.7us(  44ms)  7.8us(  44ms)(22.9%)  49644038 96.3%  3.7% 0.00%  *TOTAL*

 0.00%    0%  1.6us( 3.4us)    0us                        3  100%    0%    0%  [0xdff2bf90]
 0.00%    0%  3.4us( 3.4us)    0us                        1  100%    0%    0%    complete+0x1c
 0.00%    0%  1.3us( 1.3us)    0us                        1  100%    0%    0%    wait_for_completion+0x18
 0.00%    0%  0.1us( 0.1us)    0us                        1  100%    0%    0%    wait_for_completion+0x98

 0.00%    0%  0.1us( 0.1us)    0us                        1  100%    0%    0%  [0xdff56694]
 0.00%    0%  0.1us( 0.1us)    0us                        1  100%    0%    0%    exec_mmap+0x8c

 0.00%    0%  1.2us( 2.0us)    0us                        7  100%    0%    0%  [0xf410c22c]
 0.00%    0%  0.9us( 1.5us)    0us                        5  100%    0%    0%    unmap_fixup+0x8c
 0.00%    0%  1.9us( 2.0us)    0us                        2  100%    0%    0%    unmap_fixup+0x134

 0.00%    0%  0.1us( 0.1us)    0us                        1  100%    0%    0%  [0xf683bf0c]
 0.00%    0%  0.1us( 0.1us)    0us                        1  100%    0%    0%    neigh_destroy+0x108

 0.00%    0%  2.5us( 3.9us)    0us                       15  100%    0%    0%  [0xf6e170d0]
 0.00%    0%  2.5us( 3.9us)    0us                       15  100%    0%    0%    dev_watchdog+0x14

 0.00%    0%  0.4us( 0.7us)    0us                        2  100%    0%    0%  [0xf703f504]
 0.00%    0%  0.1us( 0.1us)    0us                        1  100%    0%    0%    skb_recv_datagram+0x90
 0.00%    0%  0.7us( 0.7us)    0us                        1  100%    0%    0%    unix_dgram_sendmsg+0x35c

 0.37% 0.31%  2.5us(  12us)  3.9us(  10us)(0.00%)     45356 99.7% 0.31%    0%  allocator_request_lock
 0.07% 0.49%  1.0us( 7.8us)  3.9us(  10us)(0.00%)     22678 99.5% 0.49%    0%    scsi_free+0x1c
 0.30% 0.14%  4.1us(  12us)  3.7us( 7.0us)(0.00%)     22678 99.9% 0.14%    0%    scsi_malloc+0x48

 0.00%    0%  0.1us( 0.6us)    0us                       36  100%    0%    0%  arbitration_lock
 0.00%    0%  0.1us( 0.6us)    0us                       30  100%    0%    0%    deny_write_access+0xc
 0.00%    0%  0.2us( 0.3us)    0us                        6  100%    0%    0%    get_write_access+0xc

 0.00%    0%  0.7us( 2.8us)    0us                     1011  100%    0%    0%  bdev_lock
 0.00%    0%  0.7us( 2.8us)    0us                     1011  100%    0%    0%    bdget+0x34

 0.00%    0%  4.1us( 5.8us)    0us                       18  100%    0%    0%  call_lock
 0.00%    0%  4.1us( 5.8us)    0us                       18  100%    0%    0%    smp_call_function+0x58

 0.00%    0%  0.2us( 0.9us)    0us                        6  100%    0%    0%  cdev_lock
 0.00%    0%  0.2us( 0.9us)    0us                        6  100%    0%    0%    cdput+0x28

  2.9% 0.59%  0.7us( 6.5us)  1.0us( 3.6us)(0.01%)   1256678 99.4% 0.59%    0%  contig_page_data+0xa8
 0.72%  1.1%  0.4us( 5.8us)  1.0us( 3.4us)(0.01%)    628341 98.9%  1.1%    0%    __free_pages_ok+0xc8
  2.2% 0.06%  1.1us( 6.5us)  0.9us( 3.6us)(0.00%)    628337  100% 0.06%    0%    rmqueue+0x28

 0.14% 0.02%  0.1us(  84us)  4.4us(  47us)(0.00%)    325106  100% 0.02%    0%  dcache_lock
 0.00%    0%  0.1us( 0.6us)    0us                       20  100%    0%    0%    d_alloc+0x128
 0.00%    0%  0.1us( 0.1us)    0us                        2  100%    0%    0%    d_delete+0x10
 0.00%    0%  0.1us( 0.3us)    0us                       23  100%    0%    0%    d_instantiate+0x1c
 0.01% 0.02%  0.4us(  24us)  0.9us( 0.9us)(0.00%)      4375  100% 0.02%    0%    d_lookup+0x5c
 0.00%    0%  0.2us( 1.2us)    0us                       20  100%    0%    0%    d_rehash+0x40
 0.00%    0%  1.2us( 1.2us)    0us                        1  100%    0%    0%    do_readv_writev+0x28c
 0.00% 0.09%  0.1us( 3.1us)  1.2us( 1.2us)(0.00%)      1069  100% 0.09%    0%    dput+0x30
 0.00%    0%  2.0us( 3.4us)    0us                       10  100%    0%    0%    link_path_walk+0x2a8
 0.00%    0%  0.2us( 0.7us)    0us                        4  100%    0%    0%    notify_change+0xec
 0.00%    0%  1.1us( 1.9us)    0us                        4  100%    0%    0%    prune_dcache+0x14
 0.00% 0.87%  0.5us(  59us)  0.9us( 1.5us)(0.00%)      3231 99.1% 0.87%    0%    prune_dcache+0x138
 0.00%    0%  2.3us( 2.3us)    0us                        1  100%    0%    0%    sys_getcwd+0xc8
 0.00% 0.30%  0.2us( 1.4us)  0.6us( 0.6us)(0.00%)       334 99.7% 0.30%    0%    sys_read+0xac
 0.13% 0.01%  0.1us(  84us)  8.0us(  47us)(0.00%)    316012  100% 0.01%    0%    sys_write+0xac

 0.20% 0.10%  1.3us(  41us)  1.7us( 6.1us)(0.00%)     47856 99.9% 0.10%    0%  device_request_lock
 0.01% 0.15%  0.1us( 1.6us)  2.0us( 6.1us)(0.00%)     23928 99.8% 0.15%    0%    __scsi_release_command+0x14
 0.19% 0.05%  2.4us(  41us)  1.0us( 1.7us)(0.00%)     23928  100% 0.05%    0%    scsi_allocate_device+0x30

 0.00%    0%  0.2us( 2.5us)    0us                      642  100%    0%    0%  files_lock
 0.00%    0%  0.1us( 1.1us)    0us                      200  100%    0%    0%    file_move+0x18
 0.00%    0%  0.1us( 0.8us)    0us                      201  100%    0%    0%    fput+0x80
 0.00%    0%  0.3us( 2.5us)    0us                      203  100%    0%    0%    get_empty_filp+0xc
 0.00%    0%  0.7us( 1.4us)    0us                       37  100%    0%    0%    get_empty_filp+0xdc
 0.00%    0%  0.1us( 0.1us)    0us                        1  100%    0%    0%    put_filp+0x18

  2.9%  8.9%   32us(1727us)    0us                    27945 91.1%    0%  8.9%  global_bh_lock
  2.9%  8.9%   32us(1727us)    0us                    27945 91.1%    0%  8.9%    bh_action+0x18

 0.05%    0%  5.2us( 7.8us)    0us                     3099  100%    0%    0%  i8253_lock
 0.05%    0%  5.2us( 7.8us)    0us                     3099  100%    0%    0%    timer_interrupt+0x2c

 0.01%    0%  1.2us( 2.5us)    0us                     3099  100%    0%    0%  i8259A_lock
 0.01%    0%  1.2us( 2.5us)    0us                     3099  100%    0%    0%    timer_interrupt+0x90

 0.00%    0%  0.4us( 0.4us)    0us                        1  100%    0%    0%  inet_peer_unused_lock
 0.00%    0%  0.4us( 0.4us)    0us                        1  100%    0%    0%    cleanup_once+0x24

 0.00%    0%  0.1us( 1.2us)    0us                       34  100%    0%    0%  init_mm+0x2c
 0.00%    0%  0.9us( 1.2us)    0us                        2  100%    0%    0%    __vmalloc+0x70
 0.00%    0%  0.1us( 0.3us)    0us                       32  100%    0%    0%    __vmalloc+0x120

 0.00%    0%  0.5us( 295us)    0us                     2768  100%    0%    0%  inode_lock
 0.00%    0%  0.4us( 1.9us)    0us                      309  100%    0%    0%    __mark_inode_dirty+0x48
 0.00%    0%  0.8us( 1.0us)    0us                        4  100%    0%    0%    get_empty_inode+0x24
 0.00%    0%  0.7us( 1.3us)    0us                       12  100%    0%    0%    get_new_inode+0x34
 0.00%    0%  1.3us( 2.0us)    0us                       12  100%    0%    0%    iget4+0x3c
 0.00%    0%  0.5us( 0.5us)    0us                        2  100%    0%    0%    insert_inode_hash+0x44
 0.00%    0%  0.1us( 2.1us)    0us                     2138  100%    0%    0%    iput+0x68
 0.00%    0%  182us( 295us)    0us                        4  100%    0%    0%    prune_icache+0x1c
 0.00%    0%  5.8us( 8.0us)    0us                        6  100%    0%    0%    sync_unlocked_inodes+0x10
 0.00%    0%  1.1us(  42us)    0us                      281  100%    0%    0%    sync_unlocked_inodes+0x10c

  2.0%  2.6%  0.8us( 103us)  2.6us(  29us)(0.08%)    788392 97.4%  2.6%    0%  io_request_lock
 0.00%  6.5%  0.6us( 2.0us)  4.6us(  13us)(0.00%)       262 93.5%  6.5%    0%    __get_request_wait+0x90
  1.1%  2.0%  0.5us(  46us)  3.0us(  29us)(0.06%)    666006 98.0%  2.0%    0%    __make_request+0xc0
 0.19%  4.4%  2.4us(  36us)  2.8us(  22us)(0.00%)     23953 95.6%  4.4%    0%    ahc_linux_isr+0x2ec
 0.03%  9.6%  4.6us(  46us)  1.6us(  15us)(0.00%)      2319 90.4%  9.6%    0%    generic_unplug_device+0x10
 0.31%  8.9%  4.0us( 103us)  1.5us(  27us)(0.01%)     24068 91.1%  8.9%    0%    scsi_dispatch_cmd+0x11c
 0.02%  3.9%  0.3us( 1.9us)  2.3us(  20us)(0.00%)     23928 96.1%  3.9%    0%    scsi_finish_command+0x18
 0.17%  3.0%  2.2us(  36us)  2.2us(  15us)(0.00%)     23928 97.0%  3.0%    0%    scsi_queue_next_request+0x18
 0.16%  7.0%  2.0us(  37us)  1.4us(  11us)(0.00%)     23928 93.0%  7.0%    0%    scsi_request_fn+0x31c

 0.51%    0%  0.1us( 147us)    0us                  1300436  100%    0%    0%  jh_splice_lock
 0.27%    0%  0.1us(  59us)    0us                   665500  100%    0%    0%    __journal_remove_journal_head+0xe8
 0.24%    0%  0.1us( 147us)    0us                   634936  100%    0%    0%    journal_add_journal_head+0xd0

 12.8% 0.49%  0.1us(  15ms)  1.3us(  15ms)(0.33%)  33219940 99.5% 0.49%    0%  journal_datalist_lock
 0.00%    0%  2.3us( 2.3us)    0us                        1  100%    0%    0%    dispose_buffer+0x18
 0.00%  2.7%  1.1us( 7.8us)  0.7us( 1.0us)(0.00%)       828 97.3%  2.7%    0%    do_get_write_access+0x9c
  1.6% 0.25%  0.1us( 163us)  1.0us(  84us)(0.03%)   8220760 99.8% 0.25%    0%    do_get_write_access+0x204
  2.0% 0.40%  0.1us( 208us)  0.9us( 182us)(0.05%)   8855491 99.6% 0.40%    0%    journal_add_journal_head+0x10
 0.62% 0.49%  0.3us( 158us)  0.8us( 6.0us)(0.00%)    634936 99.5% 0.49%    0%    journal_add_journal_head+0x88
 0.01%    0%   25us(1934us)    0us                      179  100%    0%    0%    journal_commit_transaction+0x1bc
  3.0% 0.53% 2460us(  15ms)  1.0us( 1.2us)(0.00%)       377 99.5% 0.53%    0%    journal_commit_transaction+0x258
 0.08%    0%   23us( 369us)    0us                     1111  100%    0%    0%    journal_commit_transaction+0x3c0
 0.01% 0.73%  1.4us(  50us)  1.0us( 1.4us)(0.00%)      1786 99.3% 0.73%    0%    journal_commit_transaction+0xd5c
 0.00%  1.1%  0.2us( 0.8us)  1.1us( 1.3us)(0.00%)       179 98.9%  1.1%    0%    journal_commit_transaction+0xee4
 0.45% 0.44%  0.2us( 111us)  1.0us( 2.8us)(0.00%)    632107 99.6% 0.44%    0%    journal_dirty_data+0x54
  2.7% 0.28%  0.2us( 578us)  0.9us(  50us)(0.02%)   5376063 99.7% 0.28%    0%    journal_dirty_metadata+0x54
 0.01% 0.70%  0.4us(  19us)  1.0us( 2.9us)(0.00%)      5537 99.3% 0.70%    0%    journal_file_buffer+0x18
 0.00%    0%  2.2us(  20us)    0us                      619  100%    0%    0%    journal_get_create_access+0x130
 0.60%  9.4%  0.3us( 106us)  1.8us(  15ms)(0.18%)    632510 90.6%  9.4%    0%    journal_try_to_free_buffers+0x4c
 0.00% 0.36%  0.2us( 2.4us)  0.8us( 1.1us)(0.00%)      1965 99.6% 0.36%    0%    journal_unfile_buffer+0xc
  1.7% 0.28%  0.1us( 320us)  1.0us( 263us)(0.04%)   8855491 99.7% 0.28%    0%    journal_unlock_journal_head+0xc

 0.01%    0%   21us(  41us)    0us                      176  100%    0%    0%  kbd_controller_lock
 0.01%    0%   21us(  41us)    0us                      176  100%    0%    0%    keyboard_interrupt+0x14

 64.2% 46.6%  7.0us(  44ms)   10us(  13ms)(21.4%)   2845886 53.4% 46.6%    0%  kernel_flag
 0.00%    0%  1.3us( 1.3us)    0us                        1  100%    0%    0%    chrdev_open+0x4c
 0.00% 57.1%  0.4us( 0.5us)  8.6us(  20us)(0.00%)         7 42.9% 57.1%    0%    de_put+0x28
 0.00% 20.0%   89us( 142us)  4.2us( 4.2us)(0.00%)         5 80.0% 20.0%    0%    do_exit+0xd8
 26.6% 71.2%   13us(  44ms)  8.0us(8076us)( 5.8%)    632107 28.8% 71.2%    0%    ext3_commit_write+0x38
 0.00%    0%   43us(  43us)    0us                        1  100%    0%    0%    ext3_delete_inode+0x48
  4.4% 30.3%  4.3us( 360us)   13us(7511us)( 2.1%)    316124 69.7% 30.3%    0%    ext3_dirty_inode+0x2c
 0.00%    0%  2.2us( 2.2us)    0us                        1  100%    0%    0%    ext3_force_commit+0x38
 28.1%  7.9%   14us(1660us)  9.7us(6842us)(0.78%)    632239 92.1%  7.9%    0%    ext3_get_block_handle+0x8c
  1.2% 27.2%  0.6us( 240us)   11us(6604us)( 3.0%)    632107 72.8% 27.2%    0%    ext3_prepare_write+0x34
 0.26% 88.1%  0.1us(  74us)  9.6us(7026us)( 8.6%)    632107 11.9% 88.1%    0%    ext3_prepare_write+0xe0
 0.00%    0%  5.1us( 5.1us)    0us                        1  100%    0%    0%    get_chrfops+0x88
 0.00%  100%  0.8us( 0.8us)   33us(  33us)(0.00%)         1    0%  100%    0%    locks_remove_posix+0x3c
 0.00%    0%  137us( 221us)    0us                        2  100%    0%    0%    lookup_hash+0x7c
 0.00% 25.0%   17us(  49us)   27us(  27us)(0.00%)         4 75.0% 25.0%    0%    notify_change+0x50
 0.00% 50.0%   40us( 121us)  5.6us( 9.4us)(0.00%)        16 50.0% 50.0%    0%    real_lookup+0x64
  3.7% 58.4% 1007us(  15ms) 1062us(  13ms)( 1.1%)      1136 41.6% 58.4%    0%    schedule+0x508
 0.00% 83.3%  181us( 284us)   17us(  57us)(0.00%)         6 16.7% 83.3%    0%    sync_old_buffers+0x1c
 0.00%    0%  3.2us( 4.2us)    0us                        3  100%    0%    0%    sys_ioctl+0x4c
 0.00% 50.0%  1.3us( 2.1us)   33us( 105us)(0.00%)         8 50.0% 50.0%    0%    sys_llseek+0x88
 0.00%    0%  0.8us( 0.8us)    0us                        1  100%    0%    0%    sys_lseek+0x70
 0.00% 50.0%  7.5us( 9.1us)   13us(  13us)(0.00%)         2 50.0% 50.0%    0%    sys_sysctl+0x70
 0.00%    0%  115us( 172us)    0us                        2  100%    0%    0%    vfs_create+0x84
 0.00%    0%   13us(  13us)    0us                        1  100%    0%    0%    vfs_link+0xa4
 0.00%    0%   12us(  24us)    0us                        2  100%    0%    0%    vfs_readdir+0x68
 0.00%    0%   15us(  16us)    0us                        2  100%    0%    0%    vfs_unlink+0x108

 0.00%    0%  0.6us( 0.9us)    0us                        5  100%    0%    0%  lastpid_lock
 0.00%    0%  0.6us( 0.9us)    0us                        5  100%    0%    0%    get_pid+0x20

 0.00%    0%  0.5us( 1.3us)    0us                      176  100%    0%    0%  logbuf_lock
 0.00%    0%  0.5us( 1.3us)    0us                      176  100%    0%    0%    release_console_sem+0x1c

  7.5% 0.74%  0.9us(  44ms)   14us(  44ms)(0.46%)   2683880 99.3% 0.74%    0%  lru_list_lock
  1.2%  1.5%  7.1us( 499us)  2.5us( 134us)(0.00%)     50765 98.5%  1.5%    0%    balance_dirty+0x18
 0.06% 16.8%   24us( 121us)   12us(  91us)(0.00%)       792 83.2% 16.8%    0%    bdflush+0x98
 0.54%    0%   28ms(  44ms)    0us                        6  100%    0%    0%    bdflush+0xb8
 0.14% 0.42%  0.1us(  85us)  2.0us(  19us)(0.01%)    632101 99.6% 0.42%    0%    buffer_insert_inode_data_queue+0x10
 0.00%  100%  0.4us( 0.4us)   11us(  11us)(0.00%)         1    0%  100%    0%    fsync_inode_buffers+0x28
 0.00%    0%  0.7us( 0.7us)    0us                        1  100%    0%    0%    fsync_inode_data_buffers+0x28
 0.00%    0%  0.6us( 0.6us)    0us                        1  100%    0%    0%    fsync_inode_data_buffers+0xb4
 0.00%    0%  0.1us( 0.1us)    0us                        1  100%    0%    0%    fsync_inode_data_buffers+0x128
 0.00%  3.8%  0.1us( 0.7us)  1.4us( 3.5us)(0.00%)      1760 96.2%  3.8%    0%    inode_has_buffers+0x10
 0.00%  1.1%  0.1us( 0.7us)  1.1us( 1.9us)(0.00%)       877 98.9%  1.1%    0%    invalidate_inode_buffers+0x10
 0.19%    0% 9733us(  17ms)    0us                        6  100%    0%    0%    kupdate+0x98
 0.00%    0%  1.0us( 1.0us)    0us                        1  100%    0%    0%    osync_inode_buffers+0x14
 0.00%    0%  0.8us( 0.8us)    0us                        1  100%    0%    0%    osync_inode_data_buffers+0x14
  1.2% 0.29%  0.3us( 192us)   45us(  44ms)(0.29%)   1363983 99.7% 0.29%    0%    refile_buffer+0xc
 0.00%    0%  0.5us( 0.8us)    0us                        6  100%    0%    0%    sync_old_buffers+0x64
  4.2%  1.9%  2.1us( 384us)  7.7us(  29ms)(0.15%)    633578 98.1%  1.9%    0%    try_to_free_buffers+0x1c

 0.00%    0%  0.5us( 1.9us)    0us                       55  100%    0%    0%  mmlist_lock
 0.00%    0%  0.1us( 0.1us)    0us                        4  100%    0%    0%    copy_mm+0x120
 0.00%    0%  0.5us( 0.5us)    0us                        1  100%    0%    0%    exec_mmap+0x50
 0.00%    0%  0.4us( 0.6us)    0us                        5  100%    0%    0%    mmput+0x28
 0.00%    0%  0.6us( 1.9us)    0us                       45  100%    0%    0%    swap_out+0x50

 0.00%    0%  0.3us( 1.7us)    0us                      223  100%    0%    0%  page_uptodate_lock.0
 0.00%    0%  0.3us( 1.7us)    0us                      223  100%    0%    0%    end_buffer_io_async+0x38

  5.7% 0.86%  0.9us( 278us)  1.6us( 240us)(0.04%)   1901672 99.1% 0.86%    0%  pagecache_lock
 0.00%  2.5%  0.5us( 2.4us)  1.2us( 2.4us)(0.00%)      1028 97.5%  2.5%    0%    __find_get_page+0x18
  2.6% 0.35%  1.3us( 278us)  4.7us( 240us)(0.02%)    632107 99.7% 0.35%    0%    __find_lock_page+0xc
  1.6%  1.1%  0.8us( 196us)  1.0us( 158us)(0.01%)    632348 98.9%  1.1%    0%    add_to_page_cache_unique+0x18
 0.00% 0.29%  0.6us( 3.5us)  1.4us( 1.4us)(0.00%)       349 99.7% 0.29%    0%    do_generic_file_read+0x1a4
 0.00%    0%  1.1us( 1.5us)    0us                        2  100%    0%    0%    do_generic_file_read+0x370
 0.00%    0%  0.1us( 0.9us)    0us                      282  100%    0%    0%    filemap_fdatasync+0x20
 0.00% 0.35%  0.1us( 1.1us)  2.0us( 2.0us)(0.00%)       282 99.6% 0.35%    0%    filemap_fdatawait+0x14
 0.00% 0.69%  0.8us(  54us)  1.0us( 2.0us)(0.00%)      1011 99.3% 0.69%    0%    find_or_create_page+0x38
 0.00% 0.49%  0.7us( 2.4us)  1.0us( 1.4us)(0.00%)      1011 99.5% 0.49%    0%    find_or_create_page+0x78
 0.00% 0.68%  0.1us( 1.2us)  0.8us( 0.8us)(0.00%)       146 99.3% 0.68%    0%    page_cache_read+0x48
 0.00%    0%  0.3us( 0.3us)    0us                        1  100%    0%    0%    remove_inode_page+0x18
 0.00%  4.5%  0.1us( 0.9us)  1.4us( 1.5us)(0.00%)       111 95.5%  4.5%    0%    set_page_dirty+0x24
  1.5%  1.1%  0.7us( 163us)  1.1us(  69us)(0.01%)    632992 98.9%  1.1%    0%    shrink_cache+0x2c0
 0.00%    0%  0.9us( 0.9us)    0us                        1  100%    0%    0%    truncate_inode_pages+0x38
 0.00%    0%  0.2us( 0.2us)    0us                        1  100%    0%    0%    truncate_list_pages+0x158

  6.3% 20.8%  1.5us( 956us)  1.2us( 953us)(0.54%)   1310161 79.2% 20.8%    0%  pagemap_lru_lock
 0.00%  2.3%  0.7us(  71us)  1.0us( 2.2us)(0.00%)      1688 97.7%  2.3%    0%    activate_page+0xc
 0.69%  2.6%  0.3us( 166us)  1.5us( 254us)(0.04%)    634142 97.4%  2.6%    0%    lru_cache_add+0x1c
 0.00% 0.70%  0.5us( 135us)  1.0us( 1.2us)(0.00%)       714 99.3% 0.70%    0%    lru_cache_del+0xc
 0.03% 12.7%  0.4us(  30us)  1.7us(  79us)(0.01%)     19785 87.3% 12.7%    0%    refill_inactive+0x10
 0.11% 11.6%  1.8us(  71us)  2.3us( 122us)(0.01%)     19785 88.4% 11.6%    0%    shrink_cache+0x50
 0.00%  1.6%  1.0us(  26us)  1.5us( 1.9us)(0.00%)       385 98.4%  1.6%    0%    shrink_cache+0x194
 0.00%    0%  1.2us(  13us)    0us                       85  100%    0%    0%    shrink_cache+0x21c
  5.5% 39.8%  2.7us( 956us)  1.2us( 953us)(0.48%)    632981 60.2% 39.8%    0%    shrink_cache+0x290
 0.00% 18.0%  1.0us(  14us)  1.1us( 4.0us)(0.00%)       596 82.0% 18.0%    0%    shrink_cache+0x2b0

 0.45%  1.8%  3.8us(  21us)  3.3us(  11us)(0.00%)     36592 98.2%  1.8%    0%  runqueue_lock
 0.09%  4.6%  3.2us(  11us)  3.5us( 8.7us)(0.00%)      9263 95.4%  4.6%    0%    __wake_up+0x5c
 0.00%    0%  0.7us( 0.7us)    0us                        1  100%    0%    0%    complete+0x6c
 0.00%    0%  0.3us( 0.7us)    0us                        3  100%    0%    0%    deliver_signal+0x48
 0.00%    0%  3.0us( 9.8us)    0us                       20  100%    0%    0%    process_timeout+0x14
 0.00%    0%  3.6us( 3.6us)    0us                        1  100%    0%    0%    schedule_tail+0x58
 0.31% 0.35%  5.4us(  21us)  4.7us(  11us)(0.00%)     18215 99.7% 0.35%    0%    schedule+0xa0
 0.00%    0%  1.6us( 4.6us)    0us                       43  100%    0%    0%    schedule+0x264
 0.04%  2.2%  1.5us( 8.3us)  2.1us( 7.1us)(0.00%)      8325 97.8%  2.2%    0%    schedule+0x4c8
 0.00%    0%  0.8us( 8.4us)    0us                      721  100%    0%    0%    wake_up_process+0x14

 0.00%    0%  0.4us( 5.9us)    0us                      496  100%    0%    0%  sb_lock
 0.00%    0%  0.1us( 0.1us)    0us                      168  100%    0%    0%    drop_super+0x24
 0.00%    0%  0.5us( 4.4us)    0us                      174  100%    0%    0%    sync_supers+0x6c
 0.00%    0%  4.0us( 5.9us)    0us                        6  100%    0%    0%    sync_unlocked_inodes+0x18
 0.00%    0%  0.4us( 1.3us)    0us                      148  100%    0%    0%    sync_unlocked_inodes+0x18c

 0.03% 0.06%  0.1us( 1.9us)  1.2us( 2.1us)(0.00%)     68015  100% 0.06%    0%  scsi_bhqueue_lock
 0.01% 0.07%  0.1us( 1.2us)  1.3us( 2.1us)(0.00%)     43947  100% 0.07%    0%    scsi_bottom_half_handler+0x1c
 0.02% 0.04%  0.2us( 1.9us)  0.9us( 1.5us)(0.00%)     24068  100% 0.04%    0%    scsi_done+0x3c

 0.00%    0%  0.4us( 2.1us)    0us                     2366  100%    0%    0%  semaphore_lock
 0.00%    0%  0.4us( 1.9us)    0us                     1483  100%    0%    0%    __down+0x44
 0.00%    0%  0.5us( 2.1us)    0us                      715  100%    0%    0%    __down+0x78
 0.00%    0%  0.3us( 0.3us)    0us                      168  100%    0%    0%    __down_trylock+0x10

 0.00%    0%  0.1us( 6.3us)    0us                      416  100%    0%    0%  swap_info+0x8
 0.00%    0%  0.2us( 6.3us)    0us                      111  100%    0%    0%    get_swap_page+0x74
 0.00%    0%  0.1us( 1.3us)    0us                      123  100%    0%    0%    swap_duplicate+0x54
 0.00%    0%  0.1us( 0.8us)    0us                      182  100%    0%    0%    swap_info_get+0xb4

 0.00%    0%  0.4us( 9.0us)    0us                      294  100%    0%    0%  swaplock
 0.00%    0%  0.4us( 9.0us)    0us                      111  100%    0%    0%    get_swap_page+0x20
 0.00%    0%  1.7us( 1.7us)    0us                        1  100%    0%    0%    si_swapinfo+0x18
 0.00%    0%  0.4us( 2.3us)    0us                      182  100%    0%    0%    swap_info_get+0x88

 0.04% 0.07%  0.2us(  15us)  1.5us( 7.8us)(0.00%)     52142  100% 0.07%    0%  timerlist_lock
 0.01% 0.07%  0.2us( 2.9us)  1.3us( 3.1us)(0.00%)     24279  100% 0.07%    0%    add_timer+0x10
 0.01% 0.07%  0.1us( 1.4us)  1.6us( 7.8us)(0.00%)     24423  100% 0.07%    0%    del_timer+0x14
 0.00%    0%  0.3us( 0.7us)    0us                       27  100%    0%    0%    del_timer_sync+0x1c
 0.00%    0%  0.8us( 1.7us)    0us                      197  100%    0%    0%    mod_timer+0x18
 0.02%    0%  1.5us(  15us)    0us                     3099  100%    0%    0%    timer_bh+0xcc
 0.00%    0%  0.2us( 0.9us)    0us                      117  100%    0%    0%    timer_bh+0x254

 0.00% 0.02%  0.1us( 0.9us)  1.4us( 1.4us)(0.00%)      5566  100% 0.02%    0%  tqueue_lock
 0.00%    0%  0.1us( 0.8us)    0us                     2259  100%    0%    0%    __run_task_queue+0x14
 0.00%    0%  0.1us( 0.8us)    0us                     1067  100%    0%    0%    batch_entropy_store+0x7c
 0.00% 0.05%  0.1us( 0.6us)  1.4us( 1.4us)(0.00%)      2064  100% 0.05%    0%    generic_plug_device+0x34
 0.00%    0%  0.2us( 0.9us)    0us                      176  100%    0%    0%    schedule_task+0x28

 0.00%    0%  0.9us( 1.6us)    0us                        3  100%    0%    0%  uidhash_lock
 0.00%    0%  1.6us( 1.6us)    0us                        1  100%    0%    0%    alloc_uid+0x10
 0.00%    0%  0.1us( 0.1us)    0us                        1  100%    0%    0%    alloc_uid+0x94
 0.00%    0%  1.0us( 1.0us)    0us                        1  100%    0%    0%    free_uid+0x28

  1.5% 0.12%  0.4us( 192us)  2.0us( 147us)(0.00%)   1269889 99.9% 0.12%    0%  unused_list_lock
 0.37% 0.19%  0.2us( 192us)  2.0us( 142us)(0.00%)    635121 99.8% 0.19%    0%    get_unused_buffer_head+0x8
 0.00% 0.22%  0.1us( 1.3us)  1.0us( 2.4us)(0.00%)      1786 99.8% 0.22%    0%    put_unused_buffer_head+0xc
  1.2% 0.05%  0.6us( 157us)  2.0us( 147us)(0.00%)    632982  100% 0.05%    0%    try_to_free_buffers+0x54

 0.00%    0%  0.3us( 1.1us)    0us                        8  100%    0%    0%  __kmem_cache_shrink+0x18
 0.00%    0%  0.1us( 0.3us)    0us                      142  100%    0%    0%  __kmem_cache_shrink+0x48
 0.71% 0.01%  0.1us(  34us)  1.8us( 8.9us)(0.00%)   2045139  100% 0.01%    0%  __wake_up+0x24
 0.00%    0%  0.1us( 0.6us)    0us                     8216  100%    0%    0%  add_wait_queue+0x10
 0.00%  6.8%  0.1us( 1.3us)  2.9us( 8.4us)(0.00%)      1490 93.2%  6.8%    0%  add_wait_queue_exclusive+0x10
 0.50% 0.08%  4.3us( 124us)  2.8us(  10us)(0.00%)     35856  100% 0.08%    0%  ahc_linux_isr+0x24
 0.28% 0.21%  3.6us(  34us)  6.4us(  93us)(0.00%)     24068 99.8% 0.21%    0%  ahc_linux_queue+0x34
 0.00%    0%  0.7us( 1.6us)    0us                       24  100%    0%    0%  change_protection+0x34
 0.00%    0%   11us(  15us)    0us                        9  100%    0%    0%  clear_page_tables+0x1c
 0.00%    0%  0.1us( 0.4us)    0us                       63  100%    0%    0%  copy_mm+0x1e8
 0.00%    0%  2.8us(  48us)    0us                       84  100%    0%    0%  copy_mm+0x230
 0.00%    0%  3.9us(  48us)    0us                       84  100%    0%    0%  copy_page_range+0x100
 0.02%    0%  0.3us( 2.2us)    0us                    27234  100%    0%    0%  do_IRQ+0x40
 0.02%    0%  0.3us( 2.4us)    0us                    27234  100%    0%    0%  do_IRQ+0xc0
 0.00%    0%  0.9us( 5.2us)    0us                      617  100%    0%    0%  do_anonymous_page+0x5c
 0.00%    0%  0.4us( 0.9us)    0us                        5  100%    0%    0%  do_brk+0x1d4
 0.00%    0%  0.1us( 0.1us)    0us                        5  100%    0%    0%  do_exit+0x124
 0.00%    0%  0.1us( 0.3us)    0us                        5  100%    0%    0%  do_exit+0x84
 0.00%    0%  0.1us( 0.1us)    0us                        5  100%    0%    0%  do_exit+0xf4
 0.00%    0%  0.6us( 2.7us)    0us                       86  100%    0%    0%  do_mmap_pgoff+0x40c
 0.00%    0%  0.5us( 3.8us)    0us                      168  100%    0%    0%  do_mmap_pgoff+0x418
 0.00%    0%  0.1us( 0.8us)    0us                       38  100%    0%    0%  do_munmap+0x1b8
 0.00%    0%  0.6us( 5.4us)    0us                      114  100%    0%    0%  do_munmap+0xe0
 0.00%    0%  0.1us( 1.3us)    0us                     1025  100%    0%    0%  do_no_page+0xdc
 0.00%    0%  0.3us( 1.0us)    0us                       14  100%    0%    0%  do_page_fault+0xe0
 0.00%    0%  0.4us( 4.9us)    0us                      114  100%    0%    0%  do_sigaction+0x58
 0.00%    0%  0.6us( 1.5us)    0us                        9  100%    0%    0%  do_sigaction+0xd8
 0.00%    0%  2.6us( 6.2us)    0us                        7  100%    0%    0%  do_signal+0x54
 0.00%    0%  1.1us( 4.2us)    0us                      205  100%    0%    0%  do_wp_page+0x118
 0.00%    0%  0.1us( 0.1us)    0us                        9  100%    0%    0%  exit_mmap+0x18
 0.00%    0%  0.1us( 0.7us)    0us                      143  100%    0%    0%  exit_mmap+0x88
 0.00%    0%  1.7us( 2.8us)    0us                        5  100%    0%    0%  exit_sighand+0x18
 0.13% 0.02%  7.9us(  30us)  1.2us( 1.2us)(0.00%)      5232  100% 0.02%    0%  free_block+0x1c
 0.00%    0%  0.3us(  17us)    0us                     2025  100%    0%    0%  handle_mm_fault+0x34
 0.00%    0%  0.4us( 0.7us)    0us                        5  100%    0%    0%  handle_signal+0xb0
 0.00%    0%  1.7us( 3.3us)    0us                        5  100%    0%    0%  insert_vm_struct+0x60
 0.00%    0%  0.1us( 0.3us)    0us                      182  100%    0%    0%  interruptible_sleep_on+0x28
 0.00%    0%  0.1us( 0.2us)    0us                      182  100%    0%    0%  interruptible_sleep_on+0x54
 0.19% 0.02%  2.9us(  51us)  2.1us( 2.4us)(0.00%)     19844  100% 0.02%    0%  kmem_cache_alloc_batch+0x18
 0.00%    0%  0.1us( 1.3us)    0us                     6765  100%    0%    0%  kmem_cache_grow+0x1d4
 0.01% 0.01%  0.3us( 1.5us)  1.5us( 1.5us)(0.00%)      6765  100% 0.01%    0%  kmem_cache_grow+0x80
 0.00%    0%  0.1us( 0.6us)    0us                      160  100%    0%    0%  kmem_cache_reap+0x25c
 0.00% 0.07%  0.1us( 0.6us)   10us(  15us)(0.00%)      7063  100% 0.07%    0%  kmem_cache_reap+0x2c4
 0.72% 0.00%  1.2us( 327us)  3.0us( 6.9us)(0.00%)    194356  100% 0.00%    0%  kmem_cache_reap+0xa4
 0.00%    0%  1.0us( 2.0us)    0us                       24  100%    0%    0%  mprotect_fixup+0x2a8
 0.00%    0%  0.7us( 1.5us)    0us                       24  100%    0%    0%  mprotect_fixup+0x2b4
 0.00%    0%  5.0us(  41us)    0us                       27  100%    0%    0%  pte_alloc+0x88
 0.00%    0%  0.3us( 0.8us)    0us                        5  100%    0%    0%  put_dirty_page+0x3c
 0.00%    0%  0.1us( 0.4us)    0us                        5  100%    0%    0%  release_task+0x3c
 0.00%  2.9%  0.1us( 1.2us)  3.5us( 9.2us)(0.00%)      9706 97.1%  2.9%    0%  remove_wait_queue+0x10
 0.07%    0%  1.1us( 263us)    0us                    18189  100%    0%    0%  schedule+0x478
 0.00%    0%  1.7us( 7.4us)    0us                        5  100%    0%    0%  schedule_tail+0x20
 0.00%    0%   13us(  31us)    0us                        6  100%    0%    0%  send_sig_info+0x4c
 0.00%    0%  0.1us( 0.4us)    0us                       40  100%    0%    0%  sleep_on+0x28
 0.00%    0%  0.1us( 0.8us)    0us                       40  100%    0%    0%  sleep_on+0x54
 0.01%    0%   57us( 546us)    0us                       45  100%    0%    0%  swap_out+0xc8
 0.00%    0%  0.1us( 0.1us)    0us                        9  100%    0%    0%  sys_rt_sigprocmask+0x18c
 0.00%    0%  0.3us( 1.8us)    0us                       69  100%    0%    0%  sys_rt_sigprocmask+0x98
 0.00%    0%  0.4us( 0.7us)    0us                        5  100%    0%    0%  sys_sigreturn+0x84
 0.00%    0%  1.1us( 1.7us)    0us                        8  100%    0%    0%  unmap_fixup+0xa8
 0.00%    0%  0.6us( 1.0us)    0us                        8  100%    0%    0%  unmap_fixup+0xb8
 0.00%    0%  0.1us( 4.1us)    0us                      249  100%    0%    0%  vma_merge+0x54
 0.01%    0%  5.9us( 398us)    0us                      300  100%    0%    0%  zap_page_range+0x48

- - - - - - - - - - - -  - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
RWLOCK READS   HOLD    MAX  RDR BUSY PERIOD      WAIT
  UTIL  CON    MEAN   RDRS   MEAN(  MAX )   MEAN(  MAX )( %CPU)     TOTAL NOWAIT SPIN  NAME

       0.68%                               2.1us( 229us)(0.11%)   5063033 99.3% 0.68%  *TOTAL*

 0.00%    0%   3.9us     1  3.9us(  19us)    0us                        3  100%    0%  [0xe6f1b044]
          0%                                 0us                        3  100%    0%    copy_files+0x158

 0.00%    0%   0.2us     1  0.2us( 5.8us)    0us                        1  100%    0%  [0xf417a044]
          0%                                 0us                        1  100%    0%    do_fcntl+0x104

 0.00%    0%   1.2us     1  1.2us( 1.2us)    0us                        1  100%    0%  [0xf703f224]
          0%                                 0us                        1  100%    0%    unix_write_space+0x14

 0.00%    0%   1.3us     1  1.3us( 1.3us)    0us                        1  100%    0%  [0xf703f448]
          0%                                 0us                        1  100%    0%    unix_dgram_sendmsg+0x80

 0.00%    0%  10.1us     1   10us(  10us)    0us                        1  100%    0%  [0xf703f564]
          0%                                 0us                        1  100%    0%    sock_def_readable+0x14

 0.00%    0%   4.5us     1  4.5us( 4.5us)    0us                        1  100%    0%  [0xf703f788]
          0%                                 0us                        1  100%    0%    unix_dgram_sendmsg+0x21c

 0.00%    0%   0.2us     1  0.2us( 3.3us)    0us                        1  100%    0%  [0xf7658d24]
          0%                                 0us                        1  100%    0%    sys_getcwd+0x38

 0.00%    0%   0.7us     1  0.7us( 0.9us)    0us                        2  100%    0%  arp_tbl+0xc4
          0%                                 0us                        2  100%    0%    neigh_lookup+0x40

 0.00%    0%   0.9us     1  0.9us( 1.3us)    0us                        5  100%    0%  binfmt_lock
          0%                                 0us                        5  100%    0%    search_binary_handler+0x38

 0.00%    0%   1.3us     1  1.3us( 1.3us)    0us                        1  100%    0%  chrdevs_lock
          0%                                 0us                        1  100%    0%    get_chrfops+0x28

 0.00%    0%   3.4us     1  3.4us( 4.2us)    0us                        4  100%    0%  fib_hash_lock
          0%                                 0us                        4  100%    0%    fn_hash_lookup+0x10

  2.9% 0.72%   0.2us     2  0.2us( 288us)  2.1us( 229us)(0.11%)   4745142 99.3% 0.72%  hash_table_lock
       0.72%                               2.1us( 229us)(0.11%)   4745142 99.3% 0.72%    get_hash_table+0x60

 0.00%    0%   0.8us     1  0.8us( 1.4us)    0us                        4  100%    0%  inetdev_lock
          0%                                 0us                        2  100%    0%    arp_rcv+0x28
          0%                                 0us                        2  100%    0%    ip_route_input_slow+0x18

 0.01%    0%  28.0us     2   28us(  80us)    0us                       69  100%    0%  tasklist_lock
          0%                                 0us                        6  100%    0%    count_active_tasks+0xc
          0%                                 0us                        5  100%    0%    exit_notify+0x18
          0%                                 0us                       43  100%    0%    schedule+0x218
          0%                                 0us                        1  100%    0%    sys_setsid+0x10
          0%                                 0us                       14  100%    0%    sys_wait4+0x8c

 0.00%    0%   2.1us     1  2.1us( 3.2us)    0us                        3  100%    0%  udp_hash_lock
          0%                                 0us                        3  100%    0%    udp_v4_mcast_deliver+0x10

 0.00%    0%   0.6us     1  0.6us( 1.4us)    0us                       36  100%    0%  xtime_lock
          0%                                 0us                       36  100%    0%    do_gettimeofday+0x14

          0%                                 0us                        5  100%    0%  copy_files+0x100
          0%                                 0us                        5  100%    0%  do_fork+0x35c
          0%                                 0us                       14  100%    0%  do_select+0x24
          0%                                 0us                   316664  100%    0%  fget+0x1c
          0%                                 0us                        5  100%    0%  ip_route_input+0x88
          0%                                 0us                        5  100%    0%  net_rx_action+0x48
          0%                                 0us                        4  100%    0%  path_init+0x114
          0%                                 0us                     1056  100%    0%  path_init+0x30

- - - - - - - - - - -  - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
RWLOCK WRITES     HOLD           WAIT (ALL)           WAIT (WW) 
  UTIL  CON    MEAN(  MAX )   MEAN(  MAX )( %CPU)   MEAN(  MAX )     TOTAL NOWAIT SPIN(  WW )  NAME

        4.9%  1.4us( 217us)  1.2us( 218us)(0.06%)  0.3us(  34us)    643483 95.1%  3.5%( 1.4%)  *TOTAL*

 0.00%    0%  0.6us( 1.3us)    0us                   0us                 6  100%    0%(   0%)  [0xe6f1b204]
 0.00%    0%  0.5us( 1.3us)    0us                   0us                 3  100%    0%(   0%)    copy_files+0x12c
 0.00%    0%  0.6us( 0.7us)    0us                   0us                 3  100%    0%(   0%)    expand_fd_array+0x88

 0.00%    0%  4.4us(  21us)    0us                   0us                 5  100%    0%(   0%)  [0xf417a044]
 0.00%    0%  0.1us( 0.1us)    0us                   0us                 2  100%    0%(   0%)    do_fcntl+0x140
 0.00%    0%  7.3us(  21us)    0us                   0us                 3  100%    0%(   0%)    sys_dup2+0x2c

 0.00%    0%  0.1us( 0.1us)    0us                   0us                 4  100%    0%(   0%)  [0xf50ce044]
 0.00%    0%  0.1us( 0.1us)    0us                   0us                 2  100%    0%(   0%)    do_pipe+0x174
 0.00%    0%  0.1us( 0.1us)    0us                   0us                 2  100%    0%(   0%)    do_pipe+0x1a4

 0.00%    0%  0.7us( 0.7us)    0us                   0us                 1  100%    0%(   0%)  [0xf7658d24]
 0.00%    0%  0.7us( 0.7us)    0us                   0us                 1  100%    0%(   0%)    sys_chdir+0x9c

 0.00%    0%  0.1us( 0.1us)    0us                   0us                 1  100%    0%(   0%)  [0xf7df24f4]
 0.00%    0%  0.1us( 0.1us)    0us                   0us                 1  100%    0%(   0%)    neigh_destroy+0x8c

 0.00%    0%  104us( 104us)    0us                   0us                 1  100%    0%(   0%)  arp_tbl+0xc4
 0.00%    0%  104us( 104us)    0us                   0us                 1  100%    0%(   0%)    neigh_periodic_timer__thr+0x20

 0.00%    0%  0.5us( 0.6us)    0us                   0us                 2  100%    0%(   0%)  dn_lock
 0.00%    0%  0.5us( 0.6us)    0us                   0us                 2  100%    0%(   0%)    fcntl_dirnotify+0x94

  2.9%  5.0%  1.4us( 217us)  1.2us( 218us)(0.06%)  0.3us(  34us)    634589 95.0%  3.6%( 1.4%)  hash_table_lock
 0.00% 0.59%  1.3us(  30us)   17us(  34us)(0.00%)  8.8us(  34us)      1011 99.4% 0.10%(0.49%)    hash_page_buffers+0x48
  2.9%  5.0%  1.4us( 217us)  1.2us( 218us)(0.06%)  0.2us( 4.9us)    633578 95.0%  3.6%( 1.4%)    try_to_free_buffers+0x28

 0.00%    0%  7.6us(  38us)    0us                   0us                15  100%    0%(   0%)  tasklist_lock
 0.00%    0%  2.2us( 2.7us)    0us                   0us                 5  100%    0%(   0%)    do_fork+0x530
 0.00%    0%   20us(  38us)    0us                   0us                 5  100%    0%(   0%)    exit_notify+0x1b0
 0.00%    0%  0.6us( 1.0us)    0us                   0us                 5  100%    0%(   0%)    release_task+0x7c

 0.00%    0%   20us(  40us)    0us                   0us                 4  100%    0%(   0%)  vmlist_lock
 0.00%    0%  4.4us( 5.2us)    0us                   0us                 2  100%    0%(   0%)    get_vm_area+0x3c
 0.00%    0%   35us(  40us)    0us                   0us                 2  100%    0%(   0%)    vfree+0x58

 0.12%    0%  5.9us(  19us)    0us                   0us              6198  100%    0%(   0%)  xtime_lock
 0.02%    0%  1.7us( 6.7us)    0us                   0us              3099  100%    0%(   0%)    timer_bh+0xc
 0.10%    0%   10us(  19us)    0us                   0us              3099  100%    0%(   0%)    timer_interrupt+0x10

 0.00%    0%  0.9us( 1.2us)    0us                   0us                 5  100%    0%(   0%)  flush_old_exec+0x22c
 0.00%    0%  0.2us( 3.3us)    0us                   0us               813  100%    0%(   0%)  get_unused_fd+0x24
 0.00%    0%  0.1us( 0.1us)    0us                   0us                 5  100%    0%(   0%)  load_elf_binary+0x190
 0.00%    0%  0.4us( 0.5us)    0us                   0us                 4  100%    0%(   0%)  neigh_periodic_timer__thr+0xa8
 0.00%    0%  0.1us( 4.1us)    0us                   0us               820  100%    0%(   0%)  rt_check_expire__thr+0x64
 0.00%    0%  0.2us( 2.6us)    0us                   0us               206  100%    0%(   0%)  sys_close+0x1c
 0.00%    0%  0.1us( 1.7us)    0us                   0us               188  100%    0%(   0%)  sys_open+0x60
 0.00%    0%  0.1us( 0.8us)    0us                   0us               616  100%    0%(   0%)  sys_open+0xa8
_________________________________________________________________________________________________________________________
Number of read locks found=16
Lockmeter statistics are now OFF

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: ext3 performance bottleneck as the number of spindles gets large
  2002-06-19 21:29 ext3 performance bottleneck as the number of spindles gets large mgross
@ 2002-06-20  0:54 ` Andrew Morton
  2002-06-20  4:09   ` [Lse-tech] " Dave Hansen
  2002-06-20  9:54   ` Stephen C. Tweedie
  2002-06-20  1:55 ` Andrew Morton
  1 sibling, 2 replies; 24+ messages in thread
From: Andrew Morton @ 2002-06-20  0:54 UTC (permalink / raw)
  To: mgross; +Cc: Linux Kernel Mailing List, lse-tech, richard.a.griffiths

mgross wrote:
> 
> ...
> Has anyone done any work looking into the I/O scaling of Linux / ext3 per
> spindle or per adapter?  We would like to compare notes.

No.  ext3 scalability is very poor, I'm afraid.  The fs really wasn't
up and running until kernel 2.4.5 and we just didn't have time to
address that issue.
 
> I've only just started to look at the ext3 code but it seems to me that replacing the
> BKL with a per - ext3 file system lock could remove some of the contention thats
> getting measured.  What data are the BKL protecting in these ext3 functions?  Could a
> lock per FS approach work?

The vague plan there is to replace lock_kernel with lock_journal
where appropriate.  But ext3 scalability work of this nature
will be targetted at the 2.5 kernel, most probably.

I'll take a look, see if there's any low-hanging fruit in there,
but I doubt that the results will be fantastic.

-

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: ext3 performance bottleneck as the number of spindles gets large
  2002-06-19 21:29 ext3 performance bottleneck as the number of spindles gets large mgross
  2002-06-20  0:54 ` Andrew Morton
@ 2002-06-20  1:55 ` Andrew Morton
  2002-06-20  6:05   ` Jens Axboe
  1 sibling, 1 reply; 24+ messages in thread
From: Andrew Morton @ 2002-06-20  1:55 UTC (permalink / raw)
  To: mgross; +Cc: Linux Kernel Mailing List, lse-tech, richard.a.griffiths

mgross wrote:
> 
> We've been doing some throughput comparisons and benchmarks of block I/O
> throughput for 8KB writes as the number of SCSI addapters and drives per
> adapter is increased.
> 
> The Linux platform is a dual processor 1.2GHz PIII, 2Gig or RAM, 2U box.
> Similar results have been seen with both 2.4.16 and 2.4.18 base kernel, as
> well as one of those patched up O(1) 2.4.18 kernels out there.

umm.  Are you not using block-highmem?  That is a must-have.

http://www.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.19pre9aa2/00_block-highmem-all-18b-12.gz

-

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Lse-tech] Re: ext3 performance bottleneck as the number of spindles gets large
  2002-06-20  0:54 ` Andrew Morton
@ 2002-06-20  4:09   ` Dave Hansen
  2002-06-20  6:03     ` Andreas Dilger
  2002-06-20  9:54   ` Stephen C. Tweedie
  1 sibling, 1 reply; 24+ messages in thread
From: Dave Hansen @ 2002-06-20  4:09 UTC (permalink / raw)
  To: Andrew Morton
  Cc: mgross, Linux Kernel Mailing List, lse-tech, richard.a.griffiths

Andrew Morton wrote:
> mgross wrote:
>>Has anyone done any work looking into the I/O scaling of Linux / ext3 per
>>spindle or per adapter?  We would like to compare notes.
> 
> No.  ext3 scalability is very poor, I'm afraid.  The fs really wasn't
> up and running until kernel 2.4.5 and we just didn't have time to
> address that issue.

Ick.  That takes the prize for the highest BKL contention I've ever 
seen, except for some horribly contrived torture tests of mine.  I've 
had data like this sent to me a few times to analyze and the only 
thing I've been able to suggest up to this point is not to use ext3.

>>I've only just started to look at the ext3 code but it seems to me that replacing the
>>BKL with a per - ext3 file system lock could remove some of the contention thats
>>getting measured.  What data are the BKL protecting in these ext3 functions?  Could a
>>lock per FS approach work?
> 
> The vague plan there is to replace lock_kernel with lock_journal
> where appropriate.  But ext3 scalability work of this nature
> will be targetted at the 2.5 kernel, most probably.

I really doubt that dropping in lock_journal will help this case very 
much.  Every single kernel_flag entry in the lockmeter output where 
Util > 0.00% is caused by ext3.  The schedule entry is probably caused 
by something in ext3 grabbing BKL, getting scheduled out for some 
reason, then having it implicitly released in schedule().  The 
schedule() contention comes from the reacquire_kernel_lock().

We used to see plenty of ext2 BKL contention, but Al Viro did a good 
job fixing that early in 2.5 using a per-inode rwlock.  I think that 
this is the required level of lock granularity, another global lock 
just won't cut it.
http://lse.sourceforge.net/lockhier/bkl_rollup.html#getblock

-- 
Dave Hansen
haveblue@us.ibm.com


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Lse-tech] Re: ext3 performance bottleneck as the number of spindles gets large
  2002-06-20  4:09   ` [Lse-tech] " Dave Hansen
@ 2002-06-20  6:03     ` Andreas Dilger
  2002-06-20  6:53       ` Andrew Morton
  0 siblings, 1 reply; 24+ messages in thread
From: Andreas Dilger @ 2002-06-20  6:03 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andrew Morton, mgross, Linux Kernel Mailing List, lse-tech,
	richard.a.griffiths

On Jun 19, 2002  21:09 -0700, Dave Hansen wrote:
> Andrew Morton wrote:
> >The vague plan there is to replace lock_kernel with lock_journal
> >where appropriate.  But ext3 scalability work of this nature
> >will be targetted at the 2.5 kernel, most probably.
> 
> I really doubt that dropping in lock_journal will help this case very 
> much.  Every single kernel_flag entry in the lockmeter output where 
> Util > 0.00% is caused by ext3.  The schedule entry is probably caused 
> by something in ext3 grabbing BKL, getting scheduled out for some 
> reason, then having it implicitly released in schedule().  The 
> schedule() contention comes from the reacquire_kernel_lock().
> 
> We used to see plenty of ext2 BKL contention, but Al Viro did a good 
> job fixing that early in 2.5 using a per-inode rwlock.  I think that 
> this is the required level of lock granularity, another global lock 
> just won't cut it.
> http://lse.sourceforge.net/lockhier/bkl_rollup.html#getblock

There are a variety of different efforts that could be made towards
removing the BKL from ext2 and ext3.  The first, of course, would be
to have a per-filesystem lock instead of taking the BKL (I don't know
if Al has changed lock_super() in 2.5 to be a real semaphore or not).
As Andrew mentioned, there would also need to be be a per-journal lock to
ensure coherency of the journal data.  Currently the per-filesystem and
per-journal lock would be equivalent, but when a single journal device
can be shared among multiple filesystems they would be different locks.

I will leave it up to Andrew and Stephen to discuss locking scalability
within the journal layer.

Within the filesystem there can be a large number of increasingly fine
locks added - a superblock-only lock with per-group locks, or even
per-bitmap and per-inode-table(-block) locks if needed.  This would
allow multi- threaded inode and block allocations, but a sane lock
ranking strategy would have to be developed.  The bitmap locks would
only need to be 2-state locks, because you only look at the bitmaps
when you want to modify them.  The inode table locks would be read/write
locks.

If there is a try-writelock mechanism for the individual inode table
blocks you can avoid write lock contention for creations by simply
finding the first un-write-locked block in the target group's inode table
(usually in the hundreds of blocks per group for default parameters).
For inode allocation you don't really care which inode you get, as long
as you get one in the preferred group (even that isn't critical for
directory creation).  For inode deletions you will get essentially random
block locking, which is actually improved by the find-first-unlocked
allocation policy (at the expense of dirtying more inode table blocks).

Contention for the superblock lock for updates to the superblock free
block and free inode counts could be mitigated by keeping "per-group
delta buckets" in memory, that are written into the superblock only
once every few seconds or at statfs time instead of needing multiple
locks for each block/inode alloc/free.  The groups already keep their
own summary counts for free blocks and inodes.  The coherency of these
fields with the superblock on recovery would be handled at journal
recovery time (either in the kernel or e2fsck*).  Other than these two
fields there are few write updates to the superblock (on ext3 there
is also the orphan list, modified at truncate and when an open file is
unlinked and when such a file is closed).

I have even been thinking about multi-threaded directory-entry creation
in a single directory.  One nice thing about ext2/ext3 directory blocks
is that each one is self-contained and can be modified independently.
For regular ext2/ext3 directories you would only be able to do
multi-threaded deletes by having a lock for each directory block.
For creations you would need to lock the entire directory to ensure
exclusive access for a create, which is the same single-threaded behaviour
for a single directory we have today with the directory i_sem.

However, if you are using the htree indexed directory layout (which you
will be, if you care about scalable filesystem performance) then there
is only a single[**] block into which a given filename can be added, so
you can have per-block locks even for file creation.  As the number of
directory entries grows (and hence more directory blocks) the locking
becomes increasingly more fine-grained so you get better scalability
with larger directories, which is what you want.

Cheers, Andreas
[*]  If we think that we will go to any kind of per-group locking in the
     near future, the support for this could be added into e2fsck and
     existing kernels today with read support for a COMPAT flag to
     ensure maximal forwards compatibility.  On e2fsck runs we already
     validate the superblock on each boot, and the group descriptor table
     is contiguous with the superblock, so the amount of extra checking
     at boot time would be very minimal.

     The kernel already has ext[23]_count_free_{blocks,inodes} functions
     that just need a bit of tweaking to check only the descriptor
     summaries unless mounted with debug and check options, and to update
     the superblock counts at mount time if the COMPAT flag is set.

[**] In rare circumstances you may have a large number of hash collisions
     for a single hash value which fill more than one block, so an entry
     with that hash value could live in more than a single block.  This
     would need to be handled somehow (e.g. always getting the locks on
     all such blocks in order at create time; you only need a single
     block lock at delete time).
--
Andreas Dilger
http://www-mddsp.enel.ucalgary.ca/People/adilger/
http://sourceforge.net/projects/ext2resize/


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: ext3 performance bottleneck as the number of spindles gets large
  2002-06-20  1:55 ` Andrew Morton
@ 2002-06-20  6:05   ` Jens Axboe
  0 siblings, 0 replies; 24+ messages in thread
From: Jens Axboe @ 2002-06-20  6:05 UTC (permalink / raw)
  To: Andrew Morton
  Cc: mgross, Linux Kernel Mailing List, lse-tech, richard.a.griffiths

On Wed, Jun 19 2002, Andrew Morton wrote:
> mgross wrote:
> > 
> > We've been doing some throughput comparisons and benchmarks of block I/O
> > throughput for 8KB writes as the number of SCSI addapters and drives per
> > adapter is increased.
> > 
> > The Linux platform is a dual processor 1.2GHz PIII, 2Gig or RAM, 2U box.
> > Similar results have been seen with both 2.4.16 and 2.4.18 base kernel, as
> > well as one of those patched up O(1) 2.4.18 kernels out there.
> 
> umm.  Are you not using block-highmem?  That is a must-have.
> 
> http://www.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.19pre9aa2/00_block-highmem-all-18b-12.gz

please use

http://www.kernel.org/pub/linux/kernel/people/axboe/patches/v2.4/2.4.19-pre10/block-highmem-all-19.bz2

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Lse-tech] Re: ext3 performance bottleneck as the number of spindles  gets large
  2002-06-20  6:03     ` Andreas Dilger
@ 2002-06-20  6:53       ` Andrew Morton
  0 siblings, 0 replies; 24+ messages in thread
From: Andrew Morton @ 2002-06-20  6:53 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Dave Hansen, mgross, Linux Kernel Mailing List, lse-tech,
	richard.a.griffiths

Andreas Dilger wrote:
> 
> On Jun 19, 2002  21:09 -0700, Dave Hansen wrote:
> > Andrew Morton wrote:
> > >The vague plan there is to replace lock_kernel with lock_journal
> > >where appropriate.  But ext3 scalability work of this nature
> > >will be targetted at the 2.5 kernel, most probably.
> >
> > I really doubt that dropping in lock_journal will help this case very
> > much.  Every single kernel_flag entry in the lockmeter output where
> > Util > 0.00% is caused by ext3.  The schedule entry is probably caused
> > by something in ext3 grabbing BKL, getting scheduled out for some
> > reason, then having it implicitly released in schedule().  The
> > schedule() contention comes from the reacquire_kernel_lock().
> >
> > We used to see plenty of ext2 BKL contention, but Al Viro did a good
> > job fixing that early in 2.5 using a per-inode rwlock.  I think that
> > this is the required level of lock granularity, another global lock
> > just won't cut it.
> > http://lse.sourceforge.net/lockhier/bkl_rollup.html#getblock
> 
> There are a variety of different efforts that could be made towards
> removing the BKL from ext2 and ext3.  The first, of course, would be
> to have a per-filesystem lock instead of taking the BKL (I don't know
> if Al has changed lock_super() in 2.5 to be a real semaphore or not).

lock_super() has been `down()' for a long time.  In 2.4, too.

> As Andrew mentioned, there would also need to be be a per-journal lock to
> ensure coherency of the journal data.  Currently the per-filesystem and
> per-journal lock would be equivalent, but when a single journal device
> can be shared among multiple filesystems they would be different locks.

Well.  First I want to know if block-highmem is in there.  If not,
then yep, we'll spend ages spinning on the BKL.  Because ext3 _is_
BKL-happy, and if a CPU takes a disk interrupt while holding the BKL
and then sits there in interrupt context copying tons of cache-cold
memory around, guess what the other CPUs will be doing?

> I will leave it up to Andrew and Stephen to discuss locking scalability
> within the journal layer.

ext3 is about 700x as complex as ext2.  It will need to be done with
some care.
 
> Within the filesystem there can be a large number of increasingly fine
> locks added - a superblock-only lock with per-group locks, or even
> per-bitmap and per-inode-table(-block) locks if needed.  This would
> allow multi- threaded inode and block allocations, but a sane lock
> ranking strategy would have to be developed.  The bitmap locks would
> only need to be 2-state locks, because you only look at the bitmaps
> when you want to modify them.  The inode table locks would be read/write
> locks.

The next steps for ext2 are: stare at Anton's next set of graphs and
then, I expect, removal of the fs-private bitmap LRUs, per-cpu buffer
LRUs to avoid blockdev mapping lock contention,  per-blockgroup locks
and removal of lock_super from the block allocator.

But there's no point in doing that while zone->lock and pagemap_lru_lock
are top of the list.  Fixes for both of those are in progress.

ext2 is bog-simple.  It will scale up the wazoo in 2.6.
 
> If there is a try-writelock mechanism for the individual inode table
> blocks you can avoid write lock contention for creations by simply
> finding the first un-write-locked block in the target group's inode table
> (usually in the hundreds of blocks per group for default parameters).

Depends on what the profile say, Andreas.  And I mean profiles - lockmeter
tends to tell you "what", not "why".   Start at the top of the list.  Fix
them by design if possible.  If not, tweak it!


-

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: ext3 performance bottleneck as the number of spindles gets large
  2002-06-20  0:54 ` Andrew Morton
  2002-06-20  4:09   ` [Lse-tech] " Dave Hansen
@ 2002-06-20  9:54   ` Stephen C. Tweedie
  1 sibling, 0 replies; 24+ messages in thread
From: Stephen C. Tweedie @ 2002-06-20  9:54 UTC (permalink / raw)
  To: Andrew Morton
  Cc: mgross, Linux Kernel Mailing List, lse-tech, richard.a.griffiths,
	ext2-devel

Hi,

On Wed, Jun 19, 2002 at 05:54:46PM -0700, Andrew Morton wrote:

> The vague plan there is to replace lock_kernel with lock_journal
> where appropriate.  But ext3 scalability work of this nature
> will be targetted at the 2.5 kernel, most probably.

I think we can do better than that, with care.  lock_journal could
easily become a read/write lock to protect the transaction state
machine, as there's really only one place --- the commit thread ---
where we end up changing the state of a transaction itself (eg. from
running to committing).  For short-lived buffer transformations, we
already have the datalist spinlock.

There are a few intermediate types of operation, such as the
do_get_write_access.  That's a buffer operation, but it relies on us
being able to allocate memory for the old version of the buffer if we
happen to be committing the bh to disk already.  All of those cases
are already prepared to accept BKL being dropped during the memory
allocation, so there's no problem with doing the same for a short-term
buffer spinlock; and if the journal_lock is only taken shared in such
places, then there's no urgent need to drop that over the malloc.

Even the commit thread can probably avoid taking the journal lock in
many cases --- it would need it exclusively while changing a
transaction's global state, but while it's just manipulating blocks on
the committing transaction it can probably get away with much less
locking.

Cheers,
 Stephen

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Lse-tech] Re: ext3 performance bottleneck as the number of spindles gets large
       [not found] <59885C5E3098D511AD690002A5072D3C057B499E@orsmsx111.jf.intel.com>
@ 2002-06-20 16:10 ` Dave Hansen
  2002-06-20 20:47   ` John Hawkes
  0 siblings, 1 reply; 24+ messages in thread
From: Dave Hansen @ 2002-06-20 16:10 UTC (permalink / raw)
  To: Gross, Mark
  Cc: 'Russell Leighton', Andrew Morton, mgross,
	Linux Kernel Mailing List, lse-tech, Griffiths, Richard A

Gross, Mark wrote:
> We will get around to reformatting our spindles to some other FS after 
> we get as much data and analysis out of our current configuration as we 
> can get. 
>  
> We'll report out our findings on the lock contention, and throughput 
> data for some other FS then.  I'd like recommendations on what file 
> systems to try, besides ext2.

Do you really need a journaling FS?  If not, I think ext2 is a sure 
bet to be the fastest.  If you do need journaling, try reiserfs and jfs.

BTW, what kind of workload are you running under?

-- 
Dave Hansen
haveblue@us.ibm.com


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Lse-tech] Re: ext3 performance bottleneck as the number of spindles gets large
  2002-06-20 16:10 ` [Lse-tech] " Dave Hansen
@ 2002-06-20 20:47   ` John Hawkes
  0 siblings, 0 replies; 24+ messages in thread
From: John Hawkes @ 2002-06-20 20:47 UTC (permalink / raw)
  To: Dave Hansen, Gross, Mark
  Cc: 'Russell Leighton', Andrew Morton, mgross,
	Linux Kernel Mailing List, lse-tech, Griffiths, Richard A

From: "Dave Hansen" <haveblue@us.ibm.com>
> > We'll report out our findings on the lock contention, and throughput
> > data for some other FS then.  I'd like recommendations on what file
> > systems to try, besides ext2.
>
> Do you really need a journaling FS?  If not, I think ext2 is a sure
> bet to be the fastest.  If you do need journaling, try reiserfs and
jfs.

XFS in 2.4.x scales much better on larger CPU counts than do ext3 or
ReiserFS.  That's because XFS is a much lighter user of the BKL in 2.4.x
than ext3, ReiserFS, or ext2.

John Hawkes
hawkes@sgi.com


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Lse-tech] Re: ext3 performance bottleneck as the number of spindles  gets large
  2002-06-20 16:24 [Lse-tech] Re: ext3 performance bottleneck as the number of s pindles " Gross, Mark
@ 2002-06-20 21:11 ` Andrew Morton
  0 siblings, 0 replies; 24+ messages in thread
From: Andrew Morton @ 2002-06-20 21:11 UTC (permalink / raw)
  To: Gross, Mark
  Cc: 'Dave Hansen', 'Russell Leighton', mgross,
	Linux Kernel Mailing List, lse-tech, Griffiths, Richard A

"Gross, Mark" wrote:
> 
> ...
> The workload is http://www.coker.com.au/bonnie++/ (one of the newer versions
> ;)
>

Please tell me exactly how you're using it: how many filesystems, how
many controllers, disk topology, physical memory, size of filesystems,
etc.  Sufficient for me to be able to reproduce it and find out what
is happening.

Also: what is your best-case aggregate bandwidth?  Platter-speed of disks
multiplied by number of disks, please?

Thanks to the BKL you've effectively got 1.3 to 1.5 CPUs, but we should be
able to saturate six or eight disks on a uniprocessor kernel.  It's
possible that we're looking at the wrong thing.

-

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [Lse-tech] Re: ext3 performance bottleneck as the number of spindles  gets large
@ 2002-06-21 22:03 Duc Vianney
  2002-06-21 23:11 ` Andrew Morton
  2002-06-22  0:19 ` kwijibo
  0 siblings, 2 replies; 24+ messages in thread
From: Duc Vianney @ 2002-06-21 22:03 UTC (permalink / raw)
  To: Andrew Morton, mgross, Griffiths, Richard A, Jens Axboe,
	Linux Kernel Mailing List, lse-tech

Andrew Morton wrote:
>If you have time, please test ext2 and/or reiserfs and/or ext3
>in writeback mode.
I ran IOzone on ext2fs, ext3fs, JFS, and Reiserfs on an SMP 4-way
500MHz, 2.5GB RAM, two 9.1GB SCSI drives. The test partition is 1GB,
test file size is 128MB, test block size is 4KB, and IO threads varies
from 1 to 6. When comparing with other file system for this test
environment, the results on a 2.5.19 SMP kernel show ext3fs is having
performance problem with Writes and in particularly, with Random Write.
I think the BKL contention patch would help ext3fs, but I need to verify
it first.

The following data are throughput in MB/sec obtained from IOzone
benchmark running on all file systems installed with default options.


Kernels           2519smp4   2519smp4   2519smp4   2519smp4
No of threads=1   ext2-1t    jfs-1t     ext3-1t    reiserfs-1t

Initial write     138010     111023      29808      48170
Rewrite           205736     204538     119543     142765
Read              236500     237235     231860     236959
Re-read           242927     243577     240284     242776
Random read       204292     206010     201664     207219
Random write      180144     180461       1090     121676

No of threads=2  ext2-2t     jfs-2t     ext3-2t    reiserfs-2t

Initial write     196477     143395      62248      55260
Rewrite           261641     261441     126604     205076
Read              292566     292796     313562     291434
Re-read           302239     306423     341416     303424
Random read       296152     295430     316966     288584
Random write      253026     251013        958     203358

No of threads=4  ext2-4t     jfs-4t    ext3-4t     reiserfs-4t

Initial write      79513     172302      42051      48782
Rewrite           256568     269840     124912     231395
Read              290599     303669     327066     283793
Re-read           289578     303644     327362     287531
Random read       354011     353455     353806     351671
Random write      279704     279922       2482     250498

No of threads=6  ext2-6t     jfs-6t    ext3-6t     reiserfs-6t

Initial write      98559      69825      59728      15576
Rewrite           274993     286987     126048     232193
Read              330522     326143     332147     326163
Re-read           339672     328890     333094     326725
Random read       348059     346154     347901     344927
Random write      281613     280213       3659     227579

Cheers,
Duc J Vianney, dvianney@us.ibm.com
home page: http://www-124.ibm.com/developerworks/opensource/linuxperf/
project page: http://www-124.ibm.com/developerworks/projects/linuxperf



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Lse-tech] Re: ext3 performance bottleneck as the number of spindles  gets large
  2002-06-21 22:03 Duc Vianney
@ 2002-06-21 23:11 ` Andrew Morton
  2002-06-22  0:19 ` kwijibo
  1 sibling, 0 replies; 24+ messages in thread
From: Andrew Morton @ 2002-06-21 23:11 UTC (permalink / raw)
  To: Duc Vianney
  Cc: mgross, Griffiths, Richard A, Jens Axboe,
	Linux Kernel Mailing List, lse-tech

Duc Vianney wrote:
> 
> Andrew Morton wrote:
> >If you have time, please test ext2 and/or reiserfs and/or ext3
> >in writeback mode.
> I ran IOzone on ext2fs, ext3fs, JFS, and Reiserfs on an SMP 4-way
> 500MHz, 2.5GB RAM, two 9.1GB SCSI drives. The test partition is 1GB,
> test file size is 128MB, test block size is 4KB, and IO threads varies
> from 1 to 6. When comparing with other file system for this test
> environment, the results on a 2.5.19 SMP kernel show ext3fs is having
> performance problem with Writes and in particularly, with Random Write.
> I think the BKL contention patch would help ext3fs, but I need to verify
> it first.
> 
> The following data are throughput in MB/sec obtained from IOzone
> benchmark running on all file systems installed with default options.
> 
> Kernels           2519smp4   2519smp4   2519smp4   2519smp4
> No of threads=1   ext2-1t    jfs-1t     ext3-1t    reiserfs-1t
> 
> Initial write     138010     111023      29808      48170
> Rewrite           205736     204538     119543     142765
> Read              236500     237235     231860     236959
> Re-read           242927     243577     240284     242776
> Random read       204292     206010     201664     207219
> Random write      180144     180461       1090     121676

ext3 only allows dirty data to remain in memory for five seconds,
whereas the other filesystems allow it for thirty.  This is
a reasonable thing to do, but it hurts badly in benchmarks.

If you run a benchmark which takes ext2 ten seconds to
complete, ext2 will do it all in-RAM.  But after five
seconds, ext3 will go to disk and the test takes vastly longer.
I suspect that is what is happening here - we're seeing the
difference between disk bandwidth and memory bandwidth.

If you choose a larger file, a shorter file or a longer-running
test then the difference will not be so gross.

You can confirm this by trying a one-gigabyte file instead.

The "Initial write" is fishy.  I wonder if the same thing
is happening here - there may have been lots of dirty memory
left in-core (and unaccounted for) after the test completed.
iozone has a `-e' option which causes it to include the fsync()
time in the timing  calculations.   Using that would give a
better comparison, unless you are specifically trying to test
in-memory performance.  And we're not doing that here.

-

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Lse-tech] Re: ext3 performance bottleneck as the number of spindles gets large
  2002-06-21 22:03 Duc Vianney
  2002-06-21 23:11 ` Andrew Morton
@ 2002-06-22  0:19 ` kwijibo
  2002-06-22  8:10   ` kwijibo
  1 sibling, 1 reply; 24+ messages in thread
From: kwijibo @ 2002-06-22  0:19 UTC (permalink / raw)
  To: Duc Vianney
  Cc: Andrew Morton, mgross, Griffiths, Richard A, Jens Axboe,
	Linux Kernel Mailing List, lse-tech

This web site may be of interest for this discussion:
http://labs.zianet.com.  I have benchmarks using NFS
with ext3 there.  It also compares ext3 with ReiserFS.
The page is not quite complete but it has the
benchmarks up.

Steven

Duc Vianney wrote:

>Andrew Morton wrote:
>  
>
>>If you have time, please test ext2 and/or reiserfs and/or ext3
>>in writeback mode.
>>    
>>
>I ran IOzone on ext2fs, ext3fs, JFS, and Reiserfs on an SMP 4-way
>500MHz, 2.5GB RAM, two 9.1GB SCSI drives. The test partition is 1GB,
>test file size is 128MB, test block size is 4KB, and IO threads varies
>from 1 to 6. When comparing with other file system for this test
>environment, the results on a 2.5.19 SMP kernel show ext3fs is having
>performance problem with Writes and in particularly, with Random Write.
>I think the BKL contention patch would help ext3fs, but I need to verify
>it first.
>
>The following data are throughput in MB/sec obtained from IOzone
>benchmark running on all file systems installed with default options.
>
>
>Kernels           2519smp4   2519smp4   2519smp4   2519smp4
>No of threads=1   ext2-1t    jfs-1t     ext3-1t    reiserfs-1t
>
>Initial write     138010     111023      29808      48170
>Rewrite           205736     204538     119543     142765
>Read              236500     237235     231860     236959
>Re-read           242927     243577     240284     242776
>Random read       204292     206010     201664     207219
>Random write      180144     180461       1090     121676
>
>No of threads=2  ext2-2t     jfs-2t     ext3-2t    reiserfs-2t
>
>Initial write     196477     143395      62248      55260
>Rewrite           261641     261441     126604     205076
>Read              292566     292796     313562     291434
>Re-read           302239     306423     341416     303424
>Random read       296152     295430     316966     288584
>Random write      253026     251013        958     203358
>
>No of threads=4  ext2-4t     jfs-4t    ext3-4t     reiserfs-4t
>
>Initial write      79513     172302      42051      48782
>Rewrite           256568     269840     124912     231395
>Read              290599     303669     327066     283793
>Re-read           289578     303644     327362     287531
>Random read       354011     353455     353806     351671
>Random write      279704     279922       2482     250498
>
>No of threads=6  ext2-6t     jfs-6t    ext3-6t     reiserfs-6t
>
>Initial write      98559      69825      59728      15576
>Rewrite           274993     286987     126048     232193
>Read              330522     326143     332147     326163
>Re-read           339672     328890     333094     326725
>Random read       348059     346154     347901     344927
>Random write      281613     280213       3659     227579
>
>Cheers,
>Duc J Vianney, dvianney@us.ibm.com
>home page: http://www-124.ibm.com/developerworks/opensource/linuxperf/
>project page: http://www-124.ibm.com/developerworks/projects/linuxperf
>
>
>-
>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html
>Please read the FAQ at  http://www.tux.org/lkml/
>
>
>  
>




^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Lse-tech] Re: ext3 performance bottleneck as the number of spindles gets large
  2002-06-22  0:19 ` kwijibo
@ 2002-06-22  8:10   ` kwijibo
  0 siblings, 0 replies; 24+ messages in thread
From: kwijibo @ 2002-06-22  8:10 UTC (permalink / raw)
  To: kwijibo
  Cc: Duc Vianney, Andrew Morton, mgross, Griffiths, Richard A,
	Jens Axboe, Linux Kernel Mailing List, lse-tech

If you tried the link earlier and it didn't work I'm sorry.
Had a mental brain fart with the web server.  It should
work now.

Steven

kwijibo@zianet.com wrote:

> This web site may be of interest for this discussion:
> http://labs.zianet.com.  I have benchmarks using NFS
> with ext3 there.  It also compares ext3 with ReiserFS.
> The page is not quite complete but it has the
> benchmarks up.
>
> Steven
>
> Duc Vianney wrote:
>
>> Andrew Morton wrote:
>>  
>>
>>> If you have time, please test ext2 and/or reiserfs and/or ext3
>>> in writeback mode.
>>>   
>>
>> I ran IOzone on ext2fs, ext3fs, JFS, and Reiserfs on an SMP 4-way
>> 500MHz, 2.5GB RAM, two 9.1GB SCSI drives. The test partition is 1GB,
>> test file size is 128MB, test block size is 4KB, and IO threads varies
>> from 1 to 6. When comparing with other file system for this test
>> environment, the results on a 2.5.19 SMP kernel show ext3fs is having
>> performance problem with Writes and in particularly, with Random Write.
>> I think the BKL contention patch would help ext3fs, but I need to verify
>> it first.
>>
>> The following data are throughput in MB/sec obtained from IOzone
>> benchmark running on all file systems installed with default options.
>>
>>
>> Kernels           2519smp4   2519smp4   2519smp4   2519smp4
>> No of threads=1   ext2-1t    jfs-1t     ext3-1t    reiserfs-1t
>>
>> Initial write     138010     111023      29808      48170
>> Rewrite           205736     204538     119543     142765
>> Read              236500     237235     231860     236959
>> Re-read           242927     243577     240284     242776
>> Random read       204292     206010     201664     207219
>> Random write      180144     180461       1090     121676
>>
>> No of threads=2  ext2-2t     jfs-2t     ext3-2t    reiserfs-2t
>>
>> Initial write     196477     143395      62248      55260
>> Rewrite           261641     261441     126604     205076
>> Read              292566     292796     313562     291434
>> Re-read           302239     306423     341416     303424
>> Random read       296152     295430     316966     288584
>> Random write      253026     251013        958     203358
>>
>> No of threads=4  ext2-4t     jfs-4t    ext3-4t     reiserfs-4t
>>
>> Initial write      79513     172302      42051      48782
>> Rewrite           256568     269840     124912     231395
>> Read              290599     303669     327066     283793
>> Re-read           289578     303644     327362     287531
>> Random read       354011     353455     353806     351671
>> Random write      279704     279922       2482     250498
>>
>> No of threads=6  ext2-6t     jfs-6t    ext3-6t     reiserfs-6t
>>
>> Initial write      98559      69825      59728      15576
>> Rewrite           274993     286987     126048     232193
>> Read              330522     326143     332147     326163
>> Re-read           339672     328890     333094     326725
>> Random read       348059     346154     347901     344927
>> Random write      281613     280213       3659     227579
>>
>> Cheers,
>> Duc J Vianney, dvianney@us.ibm.com
>> home page: http://www-124.ibm.com/developerworks/opensource/linuxperf/
>> project page: http://www-124.ibm.com/developerworks/projects/linuxperf
>>
>>
>> -
>> To unsubscribe from this list: send the line "unsubscribe 
>> linux-kernel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at  http://www.tux.org/lkml/
>>
>>
>>  
>>
>
>
>
> -
> To unsubscribe from this list: send the line "unsubscribe 
> linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>
>




^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Lse-tech] Re: ext3 performance bottleneck as the number of spindles gets large
  2002-06-23  6:00 ` Christopher E. Brown
@ 2002-06-23  6:35   ` William Lee Irwin III
  2002-06-23  7:29     ` Dave Hansen
  2002-06-23 17:06     ` Eric W. Biederman
  0 siblings, 2 replies; 24+ messages in thread
From: William Lee Irwin III @ 2002-06-23  6:35 UTC (permalink / raw)
  To: Christopher E. Brown
  Cc: Andreas Dilger, Griffiths, Richard A, 'Andrew Morton',
	mgross, 'Jens Axboe', Linux Kernel Mailing List, lse-tech

On Sun, Jun 23, 2002 at 12:00:01AM -0600, Christopher E. Brown wrote:
> However, multiple busses are *rare* on x86.  There are alot of chained
> busses via PCI to PCI bridge, but few systems with 2 or more PCI
> busses of any type with parallel access to the CPU.

NUMA-Q has them.


Cheers,
Bill

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Lse-tech] Re: ext3 performance bottleneck as the number of spindles gets large
  2002-06-23  6:35   ` [Lse-tech] " William Lee Irwin III
@ 2002-06-23  7:29     ` Dave Hansen
  2002-06-23  7:36       ` William Lee Irwin III
  2002-06-23 17:06     ` Eric W. Biederman
  1 sibling, 1 reply; 24+ messages in thread
From: Dave Hansen @ 2002-06-23  7:29 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Christopher E. Brown, Andreas Dilger, Griffiths, Richard A,
	'Andrew Morton', mgross, 'Jens Axboe',
	Linux Kernel Mailing List, lse-tech

William Lee Irwin III wrote:
> On Sun, Jun 23, 2002 at 12:00:01AM -0600, Christopher E. Brown wrote:
> 
>>However, multiple busses are *rare* on x86.  There are alot of chained
>>busses via PCI to PCI bridge, but few systems with 2 or more PCI
>>busses of any type with parallel access to the CPU.
> 
> NUMA-Q has them.
> 

Yep, 2 independent busses per quad.  That's a _lot_ of busses when you 
have an 8 or 16 quad system.  (I wonder who has one of those... ;)

Almost all of the server-type boxes that we play with have multiple 
PCI busses.  Even my old dual-PPro has 2.

-- 
Dave Hansen
haveblue@us.ibm.com


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Lse-tech] Re: ext3 performance bottleneck as the number of spindles gets large
  2002-06-23  7:29     ` Dave Hansen
@ 2002-06-23  7:36       ` William Lee Irwin III
  2002-06-23  7:45         ` Dave Hansen
  0 siblings, 1 reply; 24+ messages in thread
From: William Lee Irwin III @ 2002-06-23  7:36 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Christopher E. Brown, Andreas Dilger, Griffiths, Richard A,
	'Andrew Morton', mgross, 'Jens Axboe',
	Linux Kernel Mailing List, lse-tech

>> On Sun, Jun 23, 2002 at 12:00:01AM -0600, Christopher E. Brown wrote:
>>> However, multiple busses are *rare* on x86.  There are alot of chained
>>> busses via PCI to PCI bridge, but few systems with 2 or more PCI
>>> busses of any type with parallel access to the CPU.

William Lee Irwin III wrote:
>> NUMA-Q has them.


On Sun, Jun 23, 2002 at 12:29:23AM -0700, Dave Hansen wrote:
> Yep, 2 independent busses per quad.  That's a _lot_ of busses when you 
> have an 8 or 16 quad system.  (I wonder who has one of those... ;)
> Almost all of the server-type boxes that we play with have multiple 
> PCI busses.  Even my old dual-PPro has 2.

I thought I saw 3 PCI and 1 ISA per-quad., but maybe that's the
"independent" bit coming into play.


Cheers,
Bill

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Lse-tech] Re: ext3 performance bottleneck as the number of spindles gets large
  2002-06-23  7:36       ` William Lee Irwin III
@ 2002-06-23  7:45         ` Dave Hansen
  2002-06-23  7:55           ` Christopher E. Brown
  2002-06-23 16:21           ` Martin J. Bligh
  0 siblings, 2 replies; 24+ messages in thread
From: Dave Hansen @ 2002-06-23  7:45 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Christopher E. Brown, Andreas Dilger, Griffiths, Richard A,
	'Andrew Morton', mgross, 'Jens Axboe',
	Linux Kernel Mailing List, lse-tech

William Lee Irwin III wrote:
 > On Sun, Jun 23, 2002 at 12:29:23AM -0700, Dave Hansen wrote:
 >> Yep, 2 independent busses per quad.  That's a _lot_ of busses
 >> when you have an 8 or 16 quad system.  (I wonder who has one of
 >> those... ;) Almost all of the server-type boxes that we play with
 >>  have multiple PCI busses.  Even my old dual-PPro has 2.
 >
 > I thought I saw 3 PCI and 1 ISA per-quad., but maybe that's the
 > "independent" bit coming into play.
 >
Hmmmm.  Maybe there is another one for the onboard devices.  I thought
that there were 8 slots and 4 per bus.  I could
be wrong.  BTW, the ISA slot is EISA and as far as I can tell is only
used for the MDC.


-- 
Dave Hansen
haveblue@us.ibm.com


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Lse-tech] Re: ext3 performance bottleneck as the number of spindles gets large
  2002-06-23  7:45         ` Dave Hansen
@ 2002-06-23  7:55           ` Christopher E. Brown
  2002-06-23  8:11             ` David Lang
  2002-06-23  8:31             ` Dave Hansen
  2002-06-23 16:21           ` Martin J. Bligh
  1 sibling, 2 replies; 24+ messages in thread
From: Christopher E. Brown @ 2002-06-23  7:55 UTC (permalink / raw)
  To: Dave Hansen
  Cc: William Lee Irwin III, Andreas Dilger, Griffiths, Richard A,
	'Andrew Morton', mgross, 'Jens Axboe',
	Linux Kernel Mailing List, lse-tech

On Sun, 23 Jun 2002, Dave Hansen wrote:

> William Lee Irwin III wrote:
>  > On Sun, Jun 23, 2002 at 12:29:23AM -0700, Dave Hansen wrote:
>  >> Yep, 2 independent busses per quad.  That's a _lot_ of busses
>  >> when you have an 8 or 16 quad system.  (I wonder who has one of
>  >> those... ;) Almost all of the server-type boxes that we play with
>  >>  have multiple PCI busses.  Even my old dual-PPro has 2.
>  >
>  > I thought I saw 3 PCI and 1 ISA per-quad., but maybe that's the
>  > "independent" bit coming into play.
>  >
> Hmmmm.  Maybe there is another one for the onboard devices.  I thought
> that there were 8 slots and 4 per bus.  I could
> be wrong.  BTW, the ISA slot is EISA and as far as I can tell is only
> used for the MDC.


Do you mean independent in that there are 2 sets of 4 slots each
detected as a seperate PCI bus, or independent in that each set of 4
had *direct* access to the cpu side, and *does not* access via a
PCI:PCI bridge?



I have stacks of PPro/PII/Xeon boards around, but 9 out of 10 have
chianed buses.  Even the old PPro x 6 (Avion 6600/ALR 6x6/Unisys
HR/HS6000) had 2 PCI buses, however the second BUS hung off of a
PCI:PCI bridge.


-- 
I route, therefore you are.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Lse-tech] Re: ext3 performance bottleneck as the number of spindles gets large
  2002-06-23  7:55           ` Christopher E. Brown
@ 2002-06-23  8:11             ` David Lang
  2002-06-23  8:31             ` Dave Hansen
  1 sibling, 0 replies; 24+ messages in thread
From: David Lang @ 2002-06-23  8:11 UTC (permalink / raw)
  To: Christopher E. Brown
  Cc: Dave Hansen, William Lee Irwin III, Andreas Dilger,
	Griffiths, Richard A, 'Andrew Morton', mgross,
	'Jens Axboe', Linux Kernel Mailing List, lse-tech

most chipsets only have one PCI bus on them so any others need to be
bridged to that one.

David Lang

On Sun, 23 Jun 2002, Christopher E. Brown wrote:

> Date: Sun, 23 Jun 2002 01:55:28 -0600 (MDT)
> From: Christopher E. Brown <cbrown@woods.net>
> To: Dave Hansen <haveblue@us.ibm.com>
> Cc: William Lee Irwin III <wli@holomorphy.com>,
>      Andreas Dilger <adilger@clusterfs.com>,
>      "Griffiths, Richard A" <richard.a.griffiths@intel.com>,
>      'Andrew Morton' <akpm@zip.com.au>, mgross@unix-os.sc.intel.com,
>      'Jens Axboe' <axboe@suse.de>,
>      Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
>      lse-tech@lists.sourceforge.net
> Subject: Re: [Lse-tech] Re: ext3 performance bottleneck as the number of
>     spindles gets large
>
> On Sun, 23 Jun 2002, Dave Hansen wrote:
>
> > William Lee Irwin III wrote:
> >  > On Sun, Jun 23, 2002 at 12:29:23AM -0700, Dave Hansen wrote:
> >  >> Yep, 2 independent busses per quad.  That's a _lot_ of busses
> >  >> when you have an 8 or 16 quad system.  (I wonder who has one of
> >  >> those... ;) Almost all of the server-type boxes that we play with
> >  >>  have multiple PCI busses.  Even my old dual-PPro has 2.
> >  >
> >  > I thought I saw 3 PCI and 1 ISA per-quad., but maybe that's the
> >  > "independent" bit coming into play.
> >  >
> > Hmmmm.  Maybe there is another one for the onboard devices.  I thought
> > that there were 8 slots and 4 per bus.  I could
> > be wrong.  BTW, the ISA slot is EISA and as far as I can tell is only
> > used for the MDC.
>
>
> Do you mean independent in that there are 2 sets of 4 slots each
> detected as a seperate PCI bus, or independent in that each set of 4
> had *direct* access to the cpu side, and *does not* access via a
> PCI:PCI bridge?
>
>
>
> I have stacks of PPro/PII/Xeon boards around, but 9 out of 10 have
> chianed buses.  Even the old PPro x 6 (Avion 6600/ALR 6x6/Unisys
> HR/HS6000) had 2 PCI buses, however the second BUS hung off of a
> PCI:PCI bridge.
>
>
> --
> I route, therefore you are.
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Lse-tech] Re: ext3 performance bottleneck as the number of spindles gets large
  2002-06-23  7:55           ` Christopher E. Brown
  2002-06-23  8:11             ` David Lang
@ 2002-06-23  8:31             ` Dave Hansen
  1 sibling, 0 replies; 24+ messages in thread
From: Dave Hansen @ 2002-06-23  8:31 UTC (permalink / raw)
  To: Christopher E. Brown
  Cc: William Lee Irwin III, Andreas Dilger, Griffiths, Richard A,
	'Andrew Morton', mgross, 'Jens Axboe',
	Linux Kernel Mailing List, lse-tech

Christopher E. Brown wrote:
> Do you mean independent in that there are 2 sets of 4 slots each
> detected as a seperate PCI bus, or independent in that each set of 4
> had *direct* access to the cpu side, and *does not* access via a
> PCI:PCI bridge?

No PCI:PCI bridges, at least for NUMA-Q.
http://telia.dl.sourceforge.net/sourceforge/lse/linux_on_numaq.pdf

-- 
Dave Hansen
haveblue@us.ibm.com


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Lse-tech] Re: ext3 performance bottleneck as the number of spindles gets large
  2002-06-23  7:45         ` Dave Hansen
  2002-06-23  7:55           ` Christopher E. Brown
@ 2002-06-23 16:21           ` Martin J. Bligh
  1 sibling, 0 replies; 24+ messages in thread
From: Martin J. Bligh @ 2002-06-23 16:21 UTC (permalink / raw)
  To: Dave Hansen, William Lee Irwin III
  Cc: Christopher E. Brown, Andreas Dilger, Griffiths, Richard A,
	'Andrew Morton', mgross, 'Jens Axboe',
	Linux Kernel Mailing List, lse-tech

>  >> Yep, 2 independent busses per quad.  That's a _lot_ of busses
>  >> when you have an 8 or 16 quad system.  (I wonder who has one of
>  >> those... ;) Almost all of the server-type boxes that we play with
>  >>  have multiple PCI busses.  Even my old dual-PPro has 2.
>  >
>  > I thought I saw 3 PCI and 1 ISA per-quad., but maybe that's the
>  > "independent" bit coming into play.
>  >
> Hmmmm.  Maybe there is another one for the onboard devices.  I thought
> that there were 8 slots and 4 per bus.  I could
> be wrong.  BTW, the ISA slot is EISA and as far as I can tell is only
> used for the MDC.

NUMA-Q has 2 PCI buses per quad, 3 slots in one, 4 in the other,
plus the EISA slots.

Multiple independant PCI buses are also available on other more
common architecutres, eg Netfinity 8500R, x360, x440, etc. 

Anything with the Intel Profusion chipset will have this feature,
the bottleneck becomes the "P6 system bus" backplane they're all
connected to, which has a theoretical limit of 800Mb/s IIRC, though
nobody's been able to get more than 420Mb/s out of it in practice,
as far as I know. 

The thing that makes the NUMA-Q a massive IO shovelling engine is
having one of these IO backplanes per quad too ... 16 x 800Mb/s
= 12.8Gb/s ;-)

M.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Lse-tech] Re: ext3 performance bottleneck as the number of spindles gets large
  2002-06-23  6:35   ` [Lse-tech] " William Lee Irwin III
  2002-06-23  7:29     ` Dave Hansen
@ 2002-06-23 17:06     ` Eric W. Biederman
  1 sibling, 0 replies; 24+ messages in thread
From: Eric W. Biederman @ 2002-06-23 17:06 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Christopher E. Brown, Andreas Dilger, Griffiths, Richard A,
	'Andrew Morton', mgross, 'Jens Axboe',
	Linux Kernel Mailing List, lse-tech

William Lee Irwin III <wli@holomorphy.com> writes:

> On Sun, Jun 23, 2002 at 12:00:01AM -0600, Christopher E. Brown wrote:
> > However, multiple busses are *rare* on x86.  There are alot of chained
> > busses via PCI to PCI bridge, but few systems with 2 or more PCI
> > busses of any type with parallel access to the CPU.
> 
> NUMA-Q has them.

As do the latest round of dual P4 Xeon chipsets.  The Intel E7500 and
the Serverworks Grand Champion.  

So on new systems this is easy to get if you want it.

Eric

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2002-06-23 17:17 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2002-06-19 21:29 ext3 performance bottleneck as the number of spindles gets large mgross
2002-06-20  0:54 ` Andrew Morton
2002-06-20  4:09   ` [Lse-tech] " Dave Hansen
2002-06-20  6:03     ` Andreas Dilger
2002-06-20  6:53       ` Andrew Morton
2002-06-20  9:54   ` Stephen C. Tweedie
2002-06-20  1:55 ` Andrew Morton
2002-06-20  6:05   ` Jens Axboe
     [not found] <59885C5E3098D511AD690002A5072D3C057B499E@orsmsx111.jf.intel.com>
2002-06-20 16:10 ` [Lse-tech] " Dave Hansen
2002-06-20 20:47   ` John Hawkes
  -- strict thread matches above, loose matches on Subject: below --
2002-06-20 16:24 [Lse-tech] Re: ext3 performance bottleneck as the number of s pindles " Gross, Mark
2002-06-20 21:11 ` [Lse-tech] Re: ext3 performance bottleneck as the number of spindles " Andrew Morton
2002-06-21 22:03 Duc Vianney
2002-06-21 23:11 ` Andrew Morton
2002-06-22  0:19 ` kwijibo
2002-06-22  8:10   ` kwijibo
2002-06-23  4:33 Andreas Dilger
2002-06-23  6:00 ` Christopher E. Brown
2002-06-23  6:35   ` [Lse-tech] " William Lee Irwin III
2002-06-23  7:29     ` Dave Hansen
2002-06-23  7:36       ` William Lee Irwin III
2002-06-23  7:45         ` Dave Hansen
2002-06-23  7:55           ` Christopher E. Brown
2002-06-23  8:11             ` David Lang
2002-06-23  8:31             ` Dave Hansen
2002-06-23 16:21           ` Martin J. Bligh
2002-06-23 17:06     ` Eric W. Biederman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox