From: Andrew Morton <akpm@zip.com.au>
To: Daniel Phillips <phillips@bonn-fries.net>
Cc: Jesse Pollard <pollard@tomcat.admin.navo.hpc.mil>,
Peter Chubb <peter@chubb.wattle.id.au>,
Anton Altaparmakov <aia21@cantab.net>,
Christoph Hellwig <hch@infradead.org>,
linux-kernel@vger.kernel.org, axboe@suse.de, martin@dalecki.de,
neilb@cse.unsw.edu.au
Subject: Re: [PATCH] remove 2TB block device limit
Date: Fri, 17 May 2002 13:25:53 -0700 [thread overview]
Message-ID: <3CE56751.D71C84E9@zip.com.au> (raw)
In-Reply-To: <200205171332.IAA93516@tomcat.admin.navo.hpc.mil> <E178nm3-000074-00@starship>
Daniel Phillips wrote:
>
> On Friday 17 May 2002 15:32, Jesse Pollard wrote:
> > Note - most these really large filesystems allow the inode tables and
> > bitmaps to be stored on disks with a relatively small blocksize (raid 5),
> > and the data on different drives (striped) with a large block size (I believe
> > ours is 64K to 128K sized data blocks, inode/bitmaps are 16K-32K.) This is
> > done for two reasons:
>
> Since we're on this subject, and you have experience with these large block
> sizes, where exactly do you see the large savings?
>
> - setup cost of the disk transfer?
> - rotational latency of small transfers?
> - setup cost of the network transfer?
> - interrupt processing overhead?
> - other?
If you surf on over to
http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.15/ you'll see
some code which performs 64k I/Os. Reads direct into pagecache.
It reduces the cost of reading from disk by 25% in my testing.
(That code is ready to go - just waiting for Linus to rematerialise).
The remaining profile is interesting. The workload is simply
`cat large_file > /dev/null':
c012b448 33 0.200877 kmem_cache_free
c0131af8 33 0.200877 flush_all_zero_pkmaps
c01e51bc 33 0.200877 blk_recount_segments
c01f9aec 34 0.206964 hpt374_udma_stop
c016eb80 36 0.219138 ext2_get_block
c0133320 37 0.225225 page_cache_readahead
c013740c 37 0.225225 __getblk
c0131ba0 41 0.249574 kmap_high
c01fa1c4 41 0.249574 ata_start_dma
c016e7dc 46 0.28001 ext2_block_to_path
c01e5320 48 0.292184 blk_rq_map_sg
c01c65d0 50 0.304358 radix_tree_reserve
c014bfb0 53 0.32262 do_mpage_bio_readpage
c01f4d88 54 0.328707 ata_irq_request
c0136b34 64 0.389579 __get_hash_table
c0126a00 72 0.438276 do_generic_file_read
c016e910 82 0.499148 ext2_get_branch
c0126610 88 0.535671 unlock_page
c0106df4 91 0.553932 system_call
c012b04c 94 0.572194 kmem_cache_alloc
c01f2494 126 0.766983 ata_taskfile
c01c66e8 163 0.992208 radix_tree_lookup
c012d250 165 1.00438 rmqueue
c0105274 2781 16.9284 default_idle
c0126e48 11009 67.0136 file_read_actor
That's a single 500MHz PIII Xeon, reading at 35 megabytes/sec.
There's 17% "overhead" here. Going to a larger filesystem
blocksize would provide almost zero benefit in the I/O layers.
Savings from larger blocks and larger pages would come into
the radix tree operations, get_block, a few other places.
At a guess, 8k blocks would cut the overhead to 10-12%.
And larger block size significantly penalises bandwidth for
the many-small-file case. The larger the blocks, the worse
it gets. You end up having to implement complexities such
as tail-merging to get around the inefficiency which the
workaround for your other inefficiency caused.
And larger pages with small blocks isn't an answer - CPU load
and seek costs from 2-blocks-per-page is measurable. At
4 blocks-per-page it's getting serious.
Small pages and pagesize=blocksize are good. I see no point in
going to larger pages or blocks until the current scheme is
working efficiently and has been *proven* to still be unfixably
inadequate.
The current code sucks. Simply amortising that suckiness across
larger blocks is not the right thing to do.
-
next prev parent reply other threads:[~2002-05-17 20:27 UTC|newest]
Thread overview: 41+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <581856778@toto.iv>
2002-05-17 0:04 ` [PATCH] remove 2TB block device limit Peter Chubb
2002-05-17 0:18 ` Daniel Phillips
2002-05-17 13:32 ` Jesse Pollard
2002-05-17 18:02 ` Daniel Phillips
2002-05-17 18:26 ` Jesse Pollard
2002-05-17 18:36 ` Andreas Dilger
2002-05-17 19:52 ` Daniel Phillips
2002-05-17 20:25 ` Andrew Morton [this message]
2002-05-17 15:26 ` Jason L Tibbitts III
2002-05-15 9:41 Hirotaka Sasaki
2002-05-15 21:49 ` Steve Lord
[not found] <1060250300@toto.iv>
2002-05-13 10:28 ` Peter Chubb
2002-05-13 12:13 ` Christoph Hellwig
2002-05-14 0:30 ` Peter Chubb
2002-05-14 1:36 ` Anton Altaparmakov
2002-05-16 20:32 ` Daniel Phillips
2002-05-14 2:09 ` Andrew Morton
2002-05-14 2:58 ` Peter Chubb
2002-05-14 7:22 ` Christoph Hellwig
2002-05-14 7:21 ` Christoph Hellwig
-- strict thread matches above, loose matches on Subject: below --
2002-05-10 3:53 Neil Brown
2002-05-10 3:36 Peter Chubb
2002-05-10 4:05 ` Andrew Morton
2002-05-10 8:43 ` Anton Altaparmakov
2002-05-10 9:04 ` Andrew Morton
2002-05-16 19:08 ` Daniel Phillips
2002-05-10 9:05 ` Jens Axboe
2002-05-10 9:53 ` Peter Chubb
2002-05-10 10:01 ` Jens Axboe
2002-05-10 11:43 ` Anton Altaparmakov
2002-05-10 4:51 ` Martin Dalecki
[not found] ` <20020510084713.43ce396e.jeremy@kerneltrap.org>
2002-05-10 19:12 ` Peter Chubb
2002-05-10 23:46 ` Andreas Dilger
2002-05-11 0:07 ` David Mosberger
2002-05-15 22:17 ` Andreas Dilger
2002-05-16 20:22 ` Daniel Phillips
2002-05-16 22:54 ` Andreas Dilger
2002-05-17 1:17 ` Daniel Phillips
2002-05-11 4:40 ` Peter Chubb
2002-05-15 13:49 ` Pavel Machek
2002-05-11 18:13 ` Padraig Brady
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=3CE56751.D71C84E9@zip.com.au \
--to=akpm@zip.com.au \
--cc=aia21@cantab.net \
--cc=axboe@suse.de \
--cc=hch@infradead.org \
--cc=linux-kernel@vger.kernel.org \
--cc=martin@dalecki.de \
--cc=neilb@cse.unsw.edu.au \
--cc=peter@chubb.wattle.id.au \
--cc=phillips@bonn-fries.net \
--cc=pollard@tomcat.admin.navo.hpc.mil \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox