From: Andrew Morton <akpm@zip.com.au>
To: Daniel Phillips <phillips@bonn-fries.net>
Cc: Jesse Pollard <pollard@tomcat.admin.navo.hpc.mil>,
Peter Chubb <peter@chubb.wattle.id.au>,
Anton Altaparmakov <aia21@cantab.net>,
Christoph Hellwig <hch@infradead.org>,
linux-kernel@vger.kernel.org, axboe@suse.de, martin@dalecki.de,
neilb@cse.unsw.edu.au
Subject: Re: [PATCH] remove 2TB block device limit
Date: Fri, 17 May 2002 13:25:53 -0700 [thread overview]
Message-ID: <3CE56751.D71C84E9@zip.com.au> (raw)
In-Reply-To: <200205171332.IAA93516@tomcat.admin.navo.hpc.mil> <E178nm3-000074-00@starship>
Daniel Phillips wrote:
>
> On Friday 17 May 2002 15:32, Jesse Pollard wrote:
> > Note - most these really large filesystems allow the inode tables and
> > bitmaps to be stored on disks with a relatively small blocksize (raid 5),
> > and the data on different drives (striped) with a large block size (I believe
> > ours is 64K to 128K sized data blocks, inode/bitmaps are 16K-32K.) This is
> > done for two reasons:
>
> Since we're on this subject, and you have experience with these large block
> sizes, where exactly do you see the large savings?
>
> - setup cost of the disk transfer?
> - rotational latency of small transfers?
> - setup cost of the network transfer?
> - interrupt processing overhead?
> - other?
If you surf on over to
http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.15/ you'll see
some code which performs 64k I/Os. Reads direct into pagecache.
It reduces the cost of reading from disk by 25% in my testing.
(That code is ready to go - just waiting for Linus to rematerialise).
The remaining profile is interesting. The workload is simply
`cat large_file > /dev/null':
c012b448 33 0.200877 kmem_cache_free
c0131af8 33 0.200877 flush_all_zero_pkmaps
c01e51bc 33 0.200877 blk_recount_segments
c01f9aec 34 0.206964 hpt374_udma_stop
c016eb80 36 0.219138 ext2_get_block
c0133320 37 0.225225 page_cache_readahead
c013740c 37 0.225225 __getblk
c0131ba0 41 0.249574 kmap_high
c01fa1c4 41 0.249574 ata_start_dma
c016e7dc 46 0.28001 ext2_block_to_path
c01e5320 48 0.292184 blk_rq_map_sg
c01c65d0 50 0.304358 radix_tree_reserve
c014bfb0 53 0.32262 do_mpage_bio_readpage
c01f4d88 54 0.328707 ata_irq_request
c0136b34 64 0.389579 __get_hash_table
c0126a00 72 0.438276 do_generic_file_read
c016e910 82 0.499148 ext2_get_branch
c0126610 88 0.535671 unlock_page
c0106df4 91 0.553932 system_call
c012b04c 94 0.572194 kmem_cache_alloc
c01f2494 126 0.766983 ata_taskfile
c01c66e8 163 0.992208 radix_tree_lookup
c012d250 165 1.00438 rmqueue
c0105274 2781 16.9284 default_idle
c0126e48 11009 67.0136 file_read_actor
That's a single 500MHz PIII Xeon, reading at 35 megabytes/sec.
There's 17% "overhead" here. Going to a larger filesystem
blocksize would provide almost zero benefit in the I/O layers.
Savings from larger blocks and larger pages would come into
the radix tree operations, get_block, a few other places.
At a guess, 8k blocks would cut the overhead to 10-12%.
And larger block size significantly penalises bandwidth for
the many-small-file case. The larger the blocks, the worse
it gets. You end up having to implement complexities such
as tail-merging to get around the inefficiency which the
workaround for your other inefficiency caused.
And larger pages with small blocks isn't an answer - CPU load
and seek costs from 2-blocks-per-page is measurable. At
4 blocks-per-page it's getting serious.
Small pages and pagesize=blocksize are good. I see no point in
going to larger pages or blocks until the current scheme is
working efficiently and has been *proven* to still be unfixably
inadequate.
The current code sucks. Simply amortising that suckiness across
larger blocks is not the right thing to do.
-
next prev parent reply other threads:[~2002-05-17 20:27 UTC|newest]
Thread overview: 41+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <581856778@toto.iv>
2002-05-17 0:04 ` [PATCH] remove 2TB block device limit Peter Chubb
2002-05-17 0:18 ` Daniel Phillips
2002-05-17 13:32 ` Jesse Pollard
2002-05-17 18:02 ` Daniel Phillips
2002-05-17 18:26 ` Jesse Pollard
2002-05-17 18:36 ` Andreas Dilger
2002-05-17 19:52 ` Daniel Phillips
2002-05-17 20:25 ` Andrew Morton [this message]
2002-05-17 15:26 ` Jason L Tibbitts III
2002-05-15 9:41 Hirotaka Sasaki
2002-05-15 21:49 ` Steve Lord
[not found] <1060250300@toto.iv>
2002-05-13 10:28 ` Peter Chubb
2002-05-13 12:13 ` Christoph Hellwig
2002-05-14 0:30 ` Peter Chubb
2002-05-14 1:36 ` Anton Altaparmakov
2002-05-16 20:32 ` Daniel Phillips
2002-05-14 2:09 ` Andrew Morton
2002-05-14 2:58 ` Peter Chubb
2002-05-14 7:22 ` Christoph Hellwig
2002-05-14 7:21 ` Christoph Hellwig
-- strict thread matches above, loose matches on Subject: below --
2002-05-10 3:53 Neil Brown
2002-05-10 3:36 Peter Chubb
2002-05-10 4:05 ` Andrew Morton
2002-05-10 8:43 ` Anton Altaparmakov
2002-05-10 9:04 ` Andrew Morton
2002-05-16 19:08 ` Daniel Phillips
2002-05-10 9:05 ` Jens Axboe
2002-05-10 9:53 ` Peter Chubb
2002-05-10 10:01 ` Jens Axboe
2002-05-10 11:43 ` Anton Altaparmakov
2002-05-10 4:51 ` Martin Dalecki
[not found] ` <20020510084713.43ce396e.jeremy@kerneltrap.org>
2002-05-10 19:12 ` Peter Chubb
2002-05-10 23:46 ` Andreas Dilger
2002-05-11 0:07 ` David Mosberger
2002-05-15 22:17 ` Andreas Dilger
2002-05-16 20:22 ` Daniel Phillips
2002-05-16 22:54 ` Andreas Dilger
2002-05-17 1:17 ` Daniel Phillips
2002-05-11 4:40 ` Peter Chubb
2002-05-15 13:49 ` Pavel Machek
2002-05-11 18:13 ` Padraig Brady
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=3CE56751.D71C84E9@zip.com.au \
--to=akpm@zip.com.au \
--cc=aia21@cantab.net \
--cc=axboe@suse.de \
--cc=hch@infradead.org \
--cc=linux-kernel@vger.kernel.org \
--cc=martin@dalecki.de \
--cc=neilb@cse.unsw.edu.au \
--cc=peter@chubb.wattle.id.au \
--cc=phillips@bonn-fries.net \
--cc=pollard@tomcat.admin.navo.hpc.mil \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.