From: Andrea Arcangeli <andrea@suse.de>
To: David Chinner <dgc@sgi.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>,
Nathan Scott <nscott@aconex.com>,
Nick Piggin <nickpiggin@yahoo.com.au>,
Christoph Lameter <clameter@sgi.com>, Mel Gorman <mel@skynet.ie>,
linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
Christoph Hellwig <hch@lst.de>,
William Lee Irwin III <wli@holomorphy.com>,
Jens Axboe <jens.axboe@oracle.com>,
Badari Pulavarty <pbadari@gmail.com>,
Maxim Levitsky <maximlevitsky@gmail.com>,
Fengguang Wu <fengguang.wu@gmail.com>,
swin wang <wangswin@gmail.com>,
totty.lu@gmail.com, hugh@veritas.com, joern@lazybastard.org
Subject: Re: [00/41] Large Blocksize Support V7 (adds memmap support)
Date: Wed, 19 Sep 2007 16:04:30 +0200 [thread overview]
Message-ID: <20070919140430.GJ4608@v2.random> (raw)
In-Reply-To: <20070919050910.GK995458@sgi.com>
On Wed, Sep 19, 2007 at 03:09:10PM +1000, David Chinner wrote:
> Ok, let's step back for a moment and look at a basic, fundamental
> constraint of disks - seek capacity. A decade ago, a terabyte of
> filesystem had 30 disks behind it - a seek capacity of about
> 6000 seeks/s. Nowdays, that's a single disk with a seek
> capacity of about 200/s. We're going *rapidly* backwards in
> terms of seek capacity per terabyte of storage.
>
> Now fill that terabyte of storage and index it in the most efficient
> way - let's say btrees are used because lots of filesystems use
> them. Hence the depth of the tree is roughly O((log n)/m) where m is
> a factor of the btree block size. Effectively, btree depth = seek
> count on lookup of any object.
I agree. btrees will clearly benefit if the nodes are larger. We've an
excess of disk capacity and an huge gap between seeking and contiguous
bandwidth.
You don't need largepages for this, fsblocks is enough.
Largepages for you are a further improvement to avoid reducing the SG
entries and potentially reducing the cpu utilization a bit (not much
though, only the pagecache works with largepages and especially with
small sized random I/O you'll be taking the radix tree lock the same
number of times...).
Plus of course you don't like fsblock because it requires work to
adapt a fs to it, I can't argue about that.
> Ok, so let's set the record straight. There were 3 justifications
> for using *large pages* to *support* large filesystem block sizes
> The justifications for the variable order page cache with large
> pages were:
>
> 1. little code change needed in the filesystems
> -> still true
Disagree, the mmap side is not a little change. If you do it just for
the not-mmapped I/O that truly is an hack, but then frankly I would
prefer only the read/write hack (without mmap) so it will not reject
heavily with my stuff and it'll be quicker to nuke it out of the
kernel later.
> 3. avoiding the need for vmap() as it has great
> overhead and does not scale
> -> Nick is starting to work on that and has
> already had good results.
Frankly I don't follow this vmap thing. Can you elaborate? Is this
about allowing the blkdev pagecache for metadata to go in highmemory?
Is that the kmap thing? I think we can stick to a direct mapped b_data
and avoid all overhead of converting a struct page to a virtual
address. It takes the same 64bit size anyway in ram and we avoid one
layer of indirection and many modifications. If we wanted to switch to
kmap for blkdev pagecache we should have done years ago, now it's far
too late to worry about it.
> Everyone seems to be focussing on #2 as the entire justification for
> large block sizes in filesystems and that this is an "SGI" problem.
I agree it's not a SGI problem and this is why I want a design that
has a _slight chance_ to improve performance on x86-64 too. If
variable order page cache will provide any further improvement on top
of fsblock will be only because your I/O device isn't fast with small
sg entries.
For the I/O layout the fsblock is more than enough, but I don't think
your variable order page cache will help in any significant way on
x86-64. Furthermore the complexity of handle page faults on largepages
is almost equivalent to the complexity of config-page-shift, but
config-page-shift gives you the whole cpu-saving benefits that you can
never remotely hope to achieve with variable order page cache.
config-page-shift + fsblock IMHO is the way to go for x86-64, with one
additional 64k PAGE_SIZE rpm. config-page-shift will stack nicely on
top of fsblocks.
fsblock will provide the guarantee of "mounting" all fs anywhere no
matter which config-page-shift you selected at compile time, as well
as dvd writing. Then config-page-shift will provide the cpu
optimization on all fronts, not just for the pagecache I/O for the
large ram systems, without fragmentation issues and with 100%
reliability in the "free" numbers (not working by luck). That's all we
need as far as I can tell.
next prev parent reply other threads:[~2007-09-19 14:04 UTC|newest]
Thread overview: 187+ messages / expand[flat|nested] mbox.gz Atom feed top
2007-09-11 6:03 [00/41] Large Blocksize Support V7 (adds memmap support) Christoph Lameter
2007-09-10 18:52 ` Nick Piggin
2007-09-11 12:05 ` Andrea Arcangeli
2007-09-11 20:03 ` Christoph Lameter
2007-09-11 12:12 ` Jörn Engel
2007-09-10 21:13 ` Nick Piggin
2007-09-11 16:02 ` Goswin von Brederlow
2007-09-11 20:07 ` Christoph Lameter
2007-09-11 20:29 ` Jörn Engel
2007-09-11 20:41 ` Christoph Lameter
2007-09-11 23:26 ` Andrea Arcangeli
2007-09-12 0:04 ` Christoph Lameter
2007-09-12 8:20 ` Andrea Arcangeli
2007-09-15 8:44 ` Andrew Morton
2007-09-15 12:14 ` Goswin von Brederlow
2007-09-15 15:51 ` Andrea Arcangeli
2007-09-15 20:14 ` Goswin von Brederlow
2007-09-15 22:30 ` Andrea Arcangeli
2007-09-16 13:54 ` Goswin von Brederlow
2007-09-16 15:08 ` Andrea Arcangeli
2007-09-16 21:08 ` Mel Gorman
2007-09-16 22:48 ` Goswin von Brederlow
2007-09-17 9:30 ` Mel Gorman
2007-09-16 17:46 ` Jörn Engel
2007-09-16 18:15 ` Linus Torvalds
2007-09-16 18:21 ` Jörn Engel
2007-09-16 18:44 ` Linus Torvalds
2007-09-16 22:51 ` Goswin von Brederlow
2007-09-23 17:44 ` Jörn Engel
2007-09-16 22:06 ` Goswin von Brederlow
2007-09-16 22:40 ` Jörn Engel
2007-09-16 18:15 ` Mel Gorman
2007-09-16 18:50 ` Andrea Arcangeli
2007-09-16 20:54 ` Mel Gorman
2007-09-16 21:31 ` Andrea Arcangeli
2007-09-17 10:13 ` Mel Gorman
2007-09-23 5:50 ` Goswin von Brederlow
2007-09-16 22:56 ` Goswin von Brederlow
2007-09-18 19:31 ` Andrea Arcangeli
2007-09-23 6:56 ` Goswin von Brederlow
2007-09-24 15:39 ` Andrea Arcangeli
2007-09-16 18:13 ` Mel Gorman
2007-09-16 9:03 ` Nick Piggin
2007-09-17 22:00 ` Christoph Lameter
2007-09-18 0:11 ` Nick Piggin
2007-09-18 20:36 ` Christoph Lameter
2007-09-18 10:00 ` Mel Gorman
2007-09-18 10:49 ` Jörn Engel
2007-09-18 12:31 ` David Chinner
2007-09-16 21:58 ` Goswin von Brederlow
2007-09-17 10:03 ` Mel Gorman
2007-09-23 6:22 ` Goswin von Brederlow
2007-09-24 12:32 ` Kyle Moffett
2007-09-16 17:53 ` Jörn Engel
2007-09-16 21:31 ` Mel Gorman
2007-09-17 22:03 ` Christoph Lameter
2007-09-11 15:36 ` Mel Gorman
2007-09-11 1:44 ` Nick Piggin
2007-09-11 20:11 ` Christoph Lameter
2007-09-11 4:53 ` Nick Piggin
2007-09-11 20:42 ` Christoph Lameter
2007-09-11 5:30 ` Nick Piggin
2007-09-11 21:41 ` Christoph Lameter
2007-09-11 6:06 ` Nick Piggin
2007-09-11 21:52 ` Christoph Lameter
2007-09-11 18:07 ` Nick Piggin
2007-09-12 23:06 ` Christoph Lameter
2007-09-13 20:51 ` Nick Piggin
2007-09-14 17:52 ` Christoph Lameter
2007-09-16 8:22 ` Nick Piggin
2007-09-17 22:05 ` Christoph Lameter
2007-09-18 0:10 ` Nick Piggin
2007-09-18 20:42 ` Christoph Lameter
2007-09-17 11:10 ` Bernd Schmidt
2007-09-17 22:10 ` Christoph Lameter
2007-09-14 16:10 ` Goswin von Brederlow
2007-09-14 17:42 ` Mel Gorman
2007-09-15 0:31 ` Goswin von Brederlow
2007-09-16 21:16 ` Mel Gorman
2007-09-16 22:38 ` Goswin von Brederlow
2007-09-17 8:57 ` Mel Gorman
2007-09-23 6:49 ` Goswin von Brederlow
2007-09-11 20:53 ` Mel Gorman
2007-09-11 6:00 ` Nick Piggin
2007-09-11 21:48 ` Christoph Lameter
2007-09-11 6:17 ` Nick Piggin
2007-09-12 0:00 ` Christoph Lameter
2007-09-12 2:46 ` Nick Piggin
2007-09-12 23:17 ` Christoph Lameter
2007-09-13 9:40 ` Mel Gorman
2007-09-14 2:38 ` Christoph Lameter
2007-09-13 21:20 ` Nick Piggin
2007-09-14 18:08 ` Christoph Lameter
2007-09-14 18:15 ` Christoph Lameter
2007-09-15 0:33 ` Goswin von Brederlow
2007-09-16 8:53 ` Nick Piggin
2007-09-17 22:21 ` Christoph Lameter
2007-09-18 1:16 ` Nick Piggin
2007-09-18 18:30 ` Linus Torvalds
2007-09-18 17:53 ` Nick Piggin
2007-09-18 19:18 ` Andrea Arcangeli
2007-09-18 19:44 ` Linus Torvalds
2007-09-19 0:58 ` Nathan Scott
2007-09-19 1:06 ` Linus Torvalds
2007-09-19 2:45 ` Nathan Scott
2007-09-19 5:09 ` David Chinner
2007-09-19 9:41 ` Alex Tomas
2007-09-19 14:04 ` Andrea Arcangeli [this message]
2007-09-20 1:38 ` David Chinner
2007-09-20 14:54 ` Andrea Arcangeli
2007-09-20 18:11 ` Christoph Lameter
2007-09-20 18:07 ` Christoph Lameter
2007-09-21 20:41 ` Hugh Dickins
2007-09-24 21:13 ` Christoph Lameter
2007-09-28 2:46 ` Nick Piggin
2007-09-19 3:41 ` Rene Herman
2007-09-19 3:50 ` Linus Torvalds
2007-09-19 4:26 ` Rene Herman
2007-09-19 4:33 ` Linus Torvalds
2007-09-19 4:56 ` Rene Herman
2007-09-11 21:54 ` Mel Gorman
2007-09-12 14:29 ` Martin J. Bligh
2007-09-12 1:49 ` David Chinner
2007-09-11 15:27 ` Nick Piggin
2007-09-13 1:49 ` David Chinner
2007-09-12 17:23 ` Nick Piggin
2007-09-13 13:03 ` David Chinner
2007-09-13 2:01 ` Nick Piggin
2007-09-13 20:48 ` Nick Piggin
2007-09-17 4:07 ` David Chinner
2007-09-16 21:13 ` Nick Piggin
2007-09-12 2:01 ` Nick Piggin
2007-09-11 21:35 ` Christoph Lameter
2007-09-11 16:47 ` Andrea Arcangeli
2007-09-11 18:31 ` Mel Gorman
2007-09-11 2:26 ` Nick Piggin
2007-09-11 18:25 ` Maxim Levitsky
2007-09-11 3:05 ` Nick Piggin
2007-09-11 21:03 ` Mel Gorman
2007-09-11 19:20 ` Andrea Arcangeli
2007-09-11 20:19 ` Jörn Engel
2007-09-11 20:13 ` Christoph Lameter
2007-09-11 20:01 ` Christoph Lameter
2007-09-11 4:43 ` Nick Piggin
2007-09-11 5:17 ` Nick Piggin
2007-09-11 21:27 ` Mel Gorman
2007-09-11 6:03 ` [01/41] Pagecache zeroing: zero_user_segment, zero_user_segments and zero_user Christoph Lameter
2007-09-11 6:03 ` [02/41] Define functions for page cache handling Christoph Lameter
2007-09-11 6:03 ` [03/41] Use page_cache_xxx functions in mm/filemap.c Christoph Lameter
2007-09-11 6:03 ` [04/41] Use page_cache_xxx in mm/page-writeback.c Christoph Lameter
2007-09-11 6:03 ` [05/41] Use page_cache_xxx in mm/truncate.c Christoph Lameter
2007-09-11 6:03 ` [06/41] Use page_cache_xxx in mm/rmap.c Christoph Lameter
2007-09-11 6:03 ` [07/41] Use page_cache_xxx in mm/filemap_xip.c Christoph Lameter
2007-09-11 6:03 ` [08/41] Use page_cache_xxx in mm/migrate.c Christoph Lameter
2007-09-11 6:03 ` [09/41] Use page_cache_xxx in fs/libfs.c Christoph Lameter
2007-09-11 6:04 ` [10/41] Use page_cache_xxx in fs/sync Christoph Lameter
2007-09-11 6:04 ` [11/41] Use page_cache_xxx in fs/buffer.c Christoph Lameter
2007-09-11 6:04 ` [12/41] Use page_cache_xxx in mm/mpage.c Christoph Lameter
2007-09-11 6:04 ` [13/41] Use page_cache_xxx in mm/fadvise.c Christoph Lameter
2007-09-11 6:04 ` [14/41] Use page_cache_xxx in fs/splice.c Christoph Lameter
2007-09-11 6:04 ` [15/41] Use page_cache_xxx in ext2 Christoph Lameter
2007-09-11 6:04 ` [16/41] Use page_cache_xxx in fs/ext3 Christoph Lameter
2007-09-11 6:04 ` [17/41] Use page_cache_xxx in fs/ext4 Christoph Lameter
2007-09-11 6:04 ` [18/41] Use page_cache_xxx in fs/reiserfs Christoph Lameter
2007-09-11 6:04 ` [19/41] Use page_cache_xxx for fs/xfs Christoph Lameter
2007-09-11 6:04 ` [20/41] Use page_cache_xxx in drivers/block/rd.c Christoph Lameter
2007-09-11 6:04 ` [21/41] compound pages: Better PageHead/PageTail handling Christoph Lameter
2007-09-11 6:04 ` [22/41] compound pages: Add new support functions Christoph Lameter
2007-09-11 6:04 ` [23/41] compound pages: vmstat support Christoph Lameter
2007-09-11 6:04 ` [24/41] compound pages: Use new compound vmstat functions in SLUB Christoph Lameter
2007-09-11 6:04 ` [25/41] compound pages: Allow use of get_page_unless_zero with compound pages Christoph Lameter
2007-09-11 6:04 ` [26/41] compound pages: Allow freeing of compound pages via pagevec Christoph Lameter
2007-09-11 6:04 ` [27/41] Large page order operations, zeroing and flushing Christoph Lameter
2007-09-11 6:04 ` [28/41] Futex: Fix PAGE SIZE assumption Christoph Lameter
2007-09-11 6:04 ` [29/41] Fix up reclaim counters Christoph Lameter
2007-09-11 6:04 ` [30/41] Add VM_BUG_ONs to check for correct page order Christoph Lameter
2007-09-11 6:04 ` [31/41] Large Blocksize: Core piece Christoph Lameter
2007-09-11 6:04 ` [32/41] Readahead changes to support large blocksize Christoph Lameter
2007-09-11 6:04 ` [33/41] Large blocksize support in ramfs Christoph Lameter
2007-09-11 6:04 ` [34/41] Large blocksize support for XFS Christoph Lameter
2007-09-11 6:04 ` [35/41] Reiserfs: Fix up mapping_set_gfp_mask() Christoph Lameter
2007-09-11 6:04 ` [36/41] 64k block size support for Ext2/3/4 Christoph Lameter
2007-09-11 6:04 ` [37/41] ext2: fix rec_len overflow for 64KB block size Christoph Lameter
2007-09-11 6:04 ` [38/41] ext3: fix rec_len overflow with " Christoph Lameter
2007-09-11 6:04 ` [39/41] ext4: fix rec_len overflow for " Christoph Lameter
2007-09-11 6:04 ` [40/41] Do not use f_mapping in simple_prepare_write() Christoph Lameter
2007-09-11 6:04 ` [41/41] Mmap support using pte PAGE_SIZE mappings Christoph Lameter
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20070919140430.GJ4608@v2.random \
--to=andrea@suse.de \
--cc=clameter@sgi.com \
--cc=dgc@sgi.com \
--cc=fengguang.wu@gmail.com \
--cc=hch@lst.de \
--cc=hugh@veritas.com \
--cc=jens.axboe@oracle.com \
--cc=joern@lazybastard.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=maximlevitsky@gmail.com \
--cc=mel@skynet.ie \
--cc=nickpiggin@yahoo.com.au \
--cc=nscott@aconex.com \
--cc=pbadari@gmail.com \
--cc=torvalds@linux-foundation.org \
--cc=totty.lu@gmail.com \
--cc=wangswin@gmail.com \
--cc=wli@holomorphy.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).