Re: [PATCH] remove 2TB block device limit

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Andrew Morton <akpm@zip.com.au>
To: Daniel Phillips <phillips@bonn-fries.net>
Cc: Jesse Pollard <pollard@tomcat.admin.navo.hpc.mil>,
	Peter Chubb <peter@chubb.wattle.id.au>,
	Anton Altaparmakov <aia21@cantab.net>,
	Christoph Hellwig <hch@infradead.org>,
	linux-kernel@vger.kernel.org, axboe@suse.de, martin@dalecki.de,
	neilb@cse.unsw.edu.au
Subject: Re: [PATCH] remove 2TB block device limit
Date: Fri, 17 May 2002 13:25:53 -0700	[thread overview]
Message-ID: <3CE56751.D71C84E9@zip.com.au> (raw)
In-Reply-To: <200205171332.IAA93516@tomcat.admin.navo.hpc.mil> <E178nm3-000074-00@starship>

Daniel Phillips wrote:
> 
> On Friday 17 May 2002 15:32, Jesse Pollard wrote:
> > Note - most these really large filesystems allow the inode tables and
> > bitmaps to be stored on disks with a relatively small blocksize (raid 5),
> > and the data on different drives (striped) with a large block size (I believe
> > ours is 64K to 128K sized data blocks, inode/bitmaps are 16K-32K.) This is
> > done for two reasons:
> 
> Since we're on this subject, and you have experience with these large block
> sizes, where exactly do you see the large savings?
> 
>   - setup cost of the disk transfer?
>   - rotational latency of small transfers?
>   - setup cost of the network transfer?
>   - interrupt processing overhead?
>   - other?

If you surf on over to
http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.15/ you'll see
some code which performs 64k I/Os.  Reads direct into pagecache.
It reduces the cost of reading from disk by 25% in my testing.
(That code is ready to go - just waiting for Linus to rematerialise).

The remaining profile is interesting.  The workload is simply
`cat large_file > /dev/null':

c012b448 33       0.200877    kmem_cache_free         
c0131af8 33       0.200877    flush_all_zero_pkmaps   
c01e51bc 33       0.200877    blk_recount_segments    
c01f9aec 34       0.206964    hpt374_udma_stop        
c016eb80 36       0.219138    ext2_get_block          
c0133320 37       0.225225    page_cache_readahead    
c013740c 37       0.225225    __getblk                
c0131ba0 41       0.249574    kmap_high               
c01fa1c4 41       0.249574    ata_start_dma           
c016e7dc 46       0.28001     ext2_block_to_path      
c01e5320 48       0.292184    blk_rq_map_sg           
c01c65d0 50       0.304358    radix_tree_reserve      
c014bfb0 53       0.32262     do_mpage_bio_readpage   
c01f4d88 54       0.328707    ata_irq_request         
c0136b34 64       0.389579    __get_hash_table        
c0126a00 72       0.438276    do_generic_file_read    
c016e910 82       0.499148    ext2_get_branch         
c0126610 88       0.535671    unlock_page             
c0106df4 91       0.553932    system_call             
c012b04c 94       0.572194    kmem_cache_alloc        
c01f2494 126      0.766983    ata_taskfile            
c01c66e8 163      0.992208    radix_tree_lookup       
c012d250 165      1.00438     rmqueue                 
c0105274 2781     16.9284     default_idle            
c0126e48 11009    67.0136     file_read_actor         

That's a single 500MHz PIII Xeon, reading at 35 megabytes/sec.

There's 17% "overhead" here.  Going to a larger filesystem
blocksize would provide almost zero benefit in the I/O layers.

Savings from larger blocks and larger pages would come into
the radix tree operations, get_block, a few other places.
At a guess, 8k blocks would cut the overhead to 10-12%.

And larger block size significantly penalises bandwidth for
the many-small-file case.  The larger the blocks, the worse
it gets.  You end up having to implement complexities such
as tail-merging to get around the inefficiency which the
workaround for your other inefficiency caused.

And larger pages with small blocks isn't an answer - CPU load
and seek costs from 2-blocks-per-page is measurable.  At
4 blocks-per-page it's getting serious.

Small pages and pagesize=blocksize are good.  I see no point in
going to larger pages or blocks until the current scheme is 
working efficiently and has been *proven* to still be unfixably
inadequate.

The current code sucks.  Simply amortising that suckiness across
larger blocks is not the right thing to do.

-

next prev parent reply	other threads:[~2002-05-17 20:27 UTC|newest]

Thread overview: 41+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <581856778@toto.iv>
2002-05-17  0:04 ` [PATCH] remove 2TB block device limit Peter Chubb
2002-05-17  0:18   ` Daniel Phillips
2002-05-17 13:32     ` Jesse Pollard
2002-05-17 18:02       ` Daniel Phillips
2002-05-17 18:26         ` Jesse Pollard
2002-05-17 18:36       ` Andreas Dilger
2002-05-17 19:52       ` Daniel Phillips
2002-05-17 20:25         ` Andrew Morton [this message]
2002-05-17 15:26     ` Jason L Tibbitts III
2002-05-15  9:41 Hirotaka Sasaki
2002-05-15 21:49 ` Steve Lord
     [not found] <1060250300@toto.iv>
2002-05-13 10:28 ` Peter Chubb
2002-05-13 12:13   ` Christoph Hellwig
2002-05-14  0:30     ` Peter Chubb
2002-05-14  1:36       ` Anton Altaparmakov
2002-05-16 20:32         ` Daniel Phillips
2002-05-14  2:09       ` Andrew Morton
2002-05-14  2:58         ` Peter Chubb
2002-05-14  7:22           ` Christoph Hellwig
2002-05-14  7:21         ` Christoph Hellwig
  -- strict thread matches above, loose matches on Subject: below --
2002-05-10  3:53 Neil Brown
2002-05-10  3:36 Peter Chubb
2002-05-10  4:05 ` Andrew Morton
2002-05-10  8:43   ` Anton Altaparmakov
2002-05-10  9:04     ` Andrew Morton
2002-05-16 19:08       ` Daniel Phillips
2002-05-10  9:05     ` Jens Axboe
2002-05-10  9:53       ` Peter Chubb
2002-05-10 10:01         ` Jens Axboe
2002-05-10 11:43         ` Anton Altaparmakov
2002-05-10  4:51 ` Martin Dalecki
     [not found] ` <20020510084713.43ce396e.jeremy@kerneltrap.org>
2002-05-10 19:12   ` Peter Chubb
2002-05-10 23:46     ` Andreas Dilger
2002-05-11  0:07       ` David Mosberger
2002-05-15 22:17         ` Andreas Dilger
2002-05-16 20:22           ` Daniel Phillips
2002-05-16 22:54             ` Andreas Dilger
2002-05-17  1:17               ` Daniel Phillips
2002-05-11  4:40       ` Peter Chubb
2002-05-15 13:49       ` Pavel Machek
2002-05-11 18:13     ` Padraig Brady

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=3CE56751.D71C84E9@zip.com.au \
    --to=akpm@zip.com.au \
    --cc=aia21@cantab.net \
    --cc=axboe@suse.de \
    --cc=hch@infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=martin@dalecki.de \
    --cc=neilb@cse.unsw.edu.au \
    --cc=peter@chubb.wattle.id.au \
    --cc=phillips@bonn-fries.net \
    --cc=pollard@tomcat.admin.navo.hpc.mil \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.