linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Andrea Arcangeli <andrea@suse.de>
To: David Chinner <dgc@sgi.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>,
	Nathan Scott <nscott@aconex.com>,
	Nick Piggin <nickpiggin@yahoo.com.au>,
	Christoph Lameter <clameter@sgi.com>, Mel Gorman <mel@skynet.ie>,
	linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	Christoph Hellwig <hch@lst.de>,
	William Lee Irwin III <wli@holomorphy.com>,
	Jens Axboe <jens.axboe@oracle.com>,
	Badari Pulavarty <pbadari@gmail.com>,
	Maxim Levitsky <maximlevitsky@gmail.com>,
	Fengguang Wu <fengguang.wu@gmail.com>,
	swin wang <wangswin@gmail.com>,
	totty.lu@gmail.com, hugh@veritas.com, joern@lazybastard.org
Subject: Re: [00/41] Large Blocksize Support V7 (adds memmap support)
Date: Thu, 20 Sep 2007 16:54:07 +0200	[thread overview]
Message-ID: <20070920145407.GY4608@v2.random> (raw)
In-Reply-To: <20070920013821.GR995458@sgi.com>

On Thu, Sep 20, 2007 at 11:38:21AM +1000, David Chinner wrote:
> Sure, and that's what I meant when I said VPC + large pages was
> a means to the end, not the only solution to the problem.

The whole point is that it's not an end, it's an end to your own fs
centric view only (which is sure fair enough), but I watch the whole
VM not just the pagecache...

The same way the fs-centric view will hope to get this little bit of
further optimization from largepages to reach "the end", my VM-wide
view wants the same little bit of opitmization for *everything*
including tmpfs and anonymous memory, slab etc..! This is clearly why
config-page-shift is better...

If you're ok not to be on the edge and you want a generic rpm image
that run quite optimally for any workload, then 4k+fslblock is just
fine of course. But if we go on the edge we should aim for the _very_
end for the whole VM, not just for "the end of the pagecache on
certain files". Especially when the complexity involved in the mmap
code is similar, and it will reject heavily if we merge this
not-very-end solution that only reaches "the end" for the pagecache.

> No, I don't like fsblock because it is inherently a "struture
> per filesystem block" construct, just like buggerheads. You
> still need to allocate millions of them when you have millions
> dirty pages around. Rather than type it all out again, read
> the fsblocks thread from here:
> 
> http://marc.info/?l=linux-fsdevel&m=118284983925719&w=2

Thanks for the pointer!

> FWIW, with Chris mason's extent-based block mapping (which btrfs
> is using and Christoph Hellwig is porting XFS over to) we completely
> remove buggerheads from XFS and so fsblock would be a pretty
> major step backwards for us if Chris's work goes into mainline.

I tend to agree if we change it fsblock should support extent if
that's what you need on xfs to support range-locking etc... Whatever
happens in vfs should please all existing fs without people needing to
go their own way again... Or replace fsblock with Chris's block
mapping. Frankly I didn't see Chris's code so I cannot comment
further. But your complains sounds sensible. We certainly want to
avoid lowlevel fs to get smarter again than the vfs. The brainer stuff
should be in vfs!

> That's not in the filesystem, though. ;)
> 
> However, I agree that if you don't have mmap then it's not
> worthwhile and the changes for VPC aren't trivial.

Yep.

> 
> > > 	3. avoiding the need for vmap() as it has great
> > > 	   overhead and does not scale
> > > 	   	-> Nick is starting to work on that and has
> > > 		   already had good results.
> > 
> > Frankly I don't follow this vmap thing. Can you elaborate?
> 
> We current support metadata blocks larger than page size for
> certain types of metadata in XFS. e.g. directory blocks.
> This however, requires vmap()ing a bunch of individual,
> non-contiguous pages out of a block device address space
> in exactly the fashion that was proposed by Nick with fsblock
> originally.
> 
> vmap() has severe scalability problems - read this subthread
> of this discussion between Nick and myself:
> 
> http://lkml.org/lkml/2007/9/11/508

So the idea of vmap is that it's much simpler to have a contiguous
virtual address space large blocksize, than to find the right
b_data[index] once you exceed PAGE_SIZE...

The global tlb flush with ipi would kill performance, you can forget
any global mapping here. The only chance to do this would be like we
do with kmap_atomic per-cpu on highmem, with preempt_disable (for the
enjoyment of the rt folks out there ;). what's the problem of having
it per-cpu? Is this what fsblock already does? You've just have to
allocate a new virtual range large numberofentriesinvmap*blocksize
every time you mount a new fs. Then instead of calling kmap you call
vmap and vunmap when you're finished. That should provide decent
performance, especially with physically indexed caches.

Anything more heavyweight than what I suggested is probably overkill,
even vmalloc_to_page.

> Hmm - so you'll need page cache tail packing as well in that case
> to prevent memory being wasted on small files. That means any way
> we look at it (VPC+mmap or config-page-shift+fsblock+pctails)
> we've got some non-trivial VM  modifications to make. 

Hmm no, the point of config-page-shift is that if you really need to
reach "the very end", you probably don't care about wasting some
memory, because either your workload can't fit in cache, or it fits in
cache regardless, or you're not wasting memory because you work with
large files...

The only point of this largepage stuff is to go an extra mile to save
a bit more of cpu vs a strict vmap based solution (fsblock of course
will be smart enough that if it notices the PAGE_SIZE is >= blocksize
it doesn't need to run any vmap at all and it can just use the direct
mapping, so vmap translates in 1 branch only to check the blocksize
variable, PAGE_SIZE is immediate in the .text at compile time). But if
you care about that tiny bit of performance during I/O operations
(variable order page cache only gives the tinybit of performance
during read/write syscalls!!!), then it means you actually want to
save CPU _everywhere_ not just in read/write and while mangling
metadata in the lowlevel fs. And that's what config-page-shift should
provide...

This is my whole argument for preferring config-page-shift+fsblock (or
whatever else fsblock replacement but then Nick design looked quite
sensible to me, if integrated with extent based locking, without
having seen Chris's yet). Regardless of the fact config-page-shift
also has the other benefit of providing guarantees for meminfo levels
and the other fact it doesn't strictly require defrag heuristics to
avoid hitting worst case huge-ram-waste scenarios.

> But, I'm not going to argue endlessly for one solution or another;
> I'm happy to see different solutions being chased, so may the
> best VM win ;)

;)

  reply	other threads:[~2007-09-20 14:54 UTC|newest]

Thread overview: 187+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-09-11  6:03 [00/41] Large Blocksize Support V7 (adds memmap support) Christoph Lameter
2007-09-10 18:52 ` Nick Piggin
2007-09-11 12:05   ` Andrea Arcangeli
2007-09-11 20:03     ` Christoph Lameter
2007-09-11 12:12   ` Jörn Engel
2007-09-10 21:13     ` Nick Piggin
2007-09-11 16:02       ` Goswin von Brederlow
2007-09-11 20:07     ` Christoph Lameter
2007-09-11 20:29       ` Jörn Engel
2007-09-11 20:41         ` Christoph Lameter
2007-09-11 23:26           ` Andrea Arcangeli
2007-09-12  0:04             ` Christoph Lameter
2007-09-12  8:20               ` Andrea Arcangeli
2007-09-15  8:44     ` Andrew Morton
2007-09-15 12:14       ` Goswin von Brederlow
2007-09-15 15:51         ` Andrea Arcangeli
2007-09-15 20:14           ` Goswin von Brederlow
2007-09-15 22:30             ` Andrea Arcangeli
2007-09-16 13:54               ` Goswin von Brederlow
2007-09-16 15:08                 ` Andrea Arcangeli
2007-09-16 21:08                   ` Mel Gorman
2007-09-16 22:48                     ` Goswin von Brederlow
2007-09-17  9:30                       ` Mel Gorman
2007-09-16 17:46               ` Jörn Engel
2007-09-16 18:15                 ` Linus Torvalds
2007-09-16 18:21                   ` Jörn Engel
2007-09-16 18:44                     ` Linus Torvalds
2007-09-16 22:51                       ` Goswin von Brederlow
2007-09-23 17:44                       ` Jörn Engel
2007-09-16 22:06                 ` Goswin von Brederlow
2007-09-16 22:40                   ` Jörn Engel
2007-09-16 18:15           ` Mel Gorman
2007-09-16 18:50             ` Andrea Arcangeli
2007-09-16 20:54               ` Mel Gorman
2007-09-16 21:31                 ` Andrea Arcangeli
2007-09-17 10:13                   ` Mel Gorman
2007-09-23  5:50                     ` Goswin von Brederlow
2007-09-16 22:56               ` Goswin von Brederlow
2007-09-18 19:31                 ` Andrea Arcangeli
2007-09-23  6:56                   ` Goswin von Brederlow
2007-09-24 15:39                     ` Andrea Arcangeli
2007-09-16 18:13         ` Mel Gorman
2007-09-16  9:03           ` Nick Piggin
2007-09-17 22:00             ` Christoph Lameter
2007-09-18  0:11               ` Nick Piggin
2007-09-18 20:36                 ` Christoph Lameter
2007-09-18 10:00               ` Mel Gorman
2007-09-18 10:49                 ` Jörn Engel
2007-09-18 12:31                 ` David Chinner
2007-09-16 21:58           ` Goswin von Brederlow
2007-09-17 10:03             ` Mel Gorman
2007-09-23  6:22               ` Goswin von Brederlow
2007-09-24 12:32                 ` Kyle Moffett
2007-09-16 17:53       ` Jörn Engel
2007-09-16 21:31         ` Mel Gorman
2007-09-17 22:03         ` Christoph Lameter
2007-09-11 15:36   ` Mel Gorman
2007-09-11  1:44     ` Nick Piggin
2007-09-11 20:11       ` Christoph Lameter
2007-09-11  4:53         ` Nick Piggin
2007-09-11 20:42           ` Christoph Lameter
2007-09-11  5:30             ` Nick Piggin
2007-09-11 21:41               ` Christoph Lameter
2007-09-11  6:06                 ` Nick Piggin
2007-09-11 21:52                   ` Christoph Lameter
2007-09-11 18:07                     ` Nick Piggin
2007-09-12 23:06                       ` Christoph Lameter
2007-09-13 20:51                         ` Nick Piggin
2007-09-14 17:52                           ` Christoph Lameter
2007-09-16  8:22                             ` Nick Piggin
2007-09-17 22:05                               ` Christoph Lameter
2007-09-18  0:10                                 ` Nick Piggin
2007-09-18 20:42                                   ` Christoph Lameter
2007-09-17 11:10                         ` Bernd Schmidt
2007-09-17 22:10                           ` Christoph Lameter
2007-09-14 16:10                       ` Goswin von Brederlow
2007-09-14 17:42                         ` Mel Gorman
2007-09-15  0:31                           ` Goswin von Brederlow
2007-09-16 21:16                             ` Mel Gorman
2007-09-16 22:38                               ` Goswin von Brederlow
2007-09-17  8:57                                 ` Mel Gorman
2007-09-23  6:49                                   ` Goswin von Brederlow
2007-09-11 20:53       ` Mel Gorman
2007-09-11  6:00         ` Nick Piggin
2007-09-11 21:48           ` Christoph Lameter
2007-09-11  6:17             ` Nick Piggin
2007-09-12  0:00               ` Christoph Lameter
2007-09-12  2:46                 ` Nick Piggin
2007-09-12 23:17                   ` Christoph Lameter
2007-09-13  9:40                     ` Mel Gorman
2007-09-14  2:38                       ` Christoph Lameter
2007-09-13 21:20                     ` Nick Piggin
2007-09-14 18:08                       ` Christoph Lameter
2007-09-14 18:15                         ` Christoph Lameter
2007-09-15  0:33                           ` Goswin von Brederlow
2007-09-16  8:53                         ` Nick Piggin
2007-09-17 22:21                           ` Christoph Lameter
2007-09-18  1:16                             ` Nick Piggin
2007-09-18 18:30                               ` Linus Torvalds
2007-09-18 17:53                                 ` Nick Piggin
2007-09-18 19:18                                 ` Andrea Arcangeli
2007-09-18 19:44                                   ` Linus Torvalds
2007-09-19  0:58                                     ` Nathan Scott
2007-09-19  1:06                                       ` Linus Torvalds
2007-09-19  2:45                                         ` Nathan Scott
2007-09-19  5:09                                         ` David Chinner
2007-09-19  9:41                                           ` Alex Tomas
2007-09-19 14:04                                           ` Andrea Arcangeli
2007-09-20  1:38                                             ` David Chinner
2007-09-20 14:54                                               ` Andrea Arcangeli [this message]
2007-09-20 18:11                                                 ` Christoph Lameter
2007-09-20 18:07                                               ` Christoph Lameter
2007-09-21 20:41                                                 ` Hugh Dickins
2007-09-24 21:13                                                   ` Christoph Lameter
2007-09-28  2:46                                               ` Nick Piggin
2007-09-19  3:41                                     ` Rene Herman
2007-09-19  3:50                                       ` Linus Torvalds
2007-09-19  4:26                                         ` Rene Herman
2007-09-19  4:33                                           ` Linus Torvalds
2007-09-19  4:56                                             ` Rene Herman
2007-09-11 21:54             ` Mel Gorman
2007-09-12 14:29             ` Martin J. Bligh
2007-09-12  1:49           ` David Chinner
2007-09-11 15:27             ` Nick Piggin
2007-09-13  1:49               ` David Chinner
2007-09-12 17:23                 ` Nick Piggin
2007-09-13 13:03                   ` David Chinner
2007-09-13  2:01                     ` Nick Piggin
2007-09-13 20:48                       ` Nick Piggin
2007-09-17  4:07                         ` David Chinner
2007-09-16 21:13                           ` Nick Piggin
2007-09-12  2:01             ` Nick Piggin
2007-09-11 21:35         ` Christoph Lameter
2007-09-11 16:47     ` Andrea Arcangeli
2007-09-11 18:31       ` Mel Gorman
2007-09-11  2:26         ` Nick Piggin
2007-09-11 18:25           ` Maxim Levitsky
2007-09-11  3:05             ` Nick Piggin
2007-09-11 21:03           ` Mel Gorman
2007-09-11 19:20         ` Andrea Arcangeli
2007-09-11 20:19           ` Jörn Engel
2007-09-11 20:13       ` Christoph Lameter
2007-09-11 20:01   ` Christoph Lameter
2007-09-11  4:43     ` Nick Piggin
2007-09-11  5:17     ` Nick Piggin
2007-09-11 21:27       ` Mel Gorman
2007-09-11  6:03 ` [01/41] Pagecache zeroing: zero_user_segment, zero_user_segments and zero_user Christoph Lameter
2007-09-11  6:03 ` [02/41] Define functions for page cache handling Christoph Lameter
2007-09-11  6:03 ` [03/41] Use page_cache_xxx functions in mm/filemap.c Christoph Lameter
2007-09-11  6:03 ` [04/41] Use page_cache_xxx in mm/page-writeback.c Christoph Lameter
2007-09-11  6:03 ` [05/41] Use page_cache_xxx in mm/truncate.c Christoph Lameter
2007-09-11  6:03 ` [06/41] Use page_cache_xxx in mm/rmap.c Christoph Lameter
2007-09-11  6:03 ` [07/41] Use page_cache_xxx in mm/filemap_xip.c Christoph Lameter
2007-09-11  6:03 ` [08/41] Use page_cache_xxx in mm/migrate.c Christoph Lameter
2007-09-11  6:03 ` [09/41] Use page_cache_xxx in fs/libfs.c Christoph Lameter
2007-09-11  6:04 ` [10/41] Use page_cache_xxx in fs/sync Christoph Lameter
2007-09-11  6:04 ` [11/41] Use page_cache_xxx in fs/buffer.c Christoph Lameter
2007-09-11  6:04 ` [12/41] Use page_cache_xxx in mm/mpage.c Christoph Lameter
2007-09-11  6:04 ` [13/41] Use page_cache_xxx in mm/fadvise.c Christoph Lameter
2007-09-11  6:04 ` [14/41] Use page_cache_xxx in fs/splice.c Christoph Lameter
2007-09-11  6:04 ` [15/41] Use page_cache_xxx in ext2 Christoph Lameter
2007-09-11  6:04 ` [16/41] Use page_cache_xxx in fs/ext3 Christoph Lameter
2007-09-11  6:04 ` [17/41] Use page_cache_xxx in fs/ext4 Christoph Lameter
2007-09-11  6:04 ` [18/41] Use page_cache_xxx in fs/reiserfs Christoph Lameter
2007-09-11  6:04 ` [19/41] Use page_cache_xxx for fs/xfs Christoph Lameter
2007-09-11  6:04 ` [20/41] Use page_cache_xxx in drivers/block/rd.c Christoph Lameter
2007-09-11  6:04 ` [21/41] compound pages: Better PageHead/PageTail handling Christoph Lameter
2007-09-11  6:04 ` [22/41] compound pages: Add new support functions Christoph Lameter
2007-09-11  6:04 ` [23/41] compound pages: vmstat support Christoph Lameter
2007-09-11  6:04 ` [24/41] compound pages: Use new compound vmstat functions in SLUB Christoph Lameter
2007-09-11  6:04 ` [25/41] compound pages: Allow use of get_page_unless_zero with compound pages Christoph Lameter
2007-09-11  6:04 ` [26/41] compound pages: Allow freeing of compound pages via pagevec Christoph Lameter
2007-09-11  6:04 ` [27/41] Large page order operations, zeroing and flushing Christoph Lameter
2007-09-11  6:04 ` [28/41] Futex: Fix PAGE SIZE assumption Christoph Lameter
2007-09-11  6:04 ` [29/41] Fix up reclaim counters Christoph Lameter
2007-09-11  6:04 ` [30/41] Add VM_BUG_ONs to check for correct page order Christoph Lameter
2007-09-11  6:04 ` [31/41] Large Blocksize: Core piece Christoph Lameter
2007-09-11  6:04 ` [32/41] Readahead changes to support large blocksize Christoph Lameter
2007-09-11  6:04 ` [33/41] Large blocksize support in ramfs Christoph Lameter
2007-09-11  6:04 ` [34/41] Large blocksize support for XFS Christoph Lameter
2007-09-11  6:04 ` [35/41] Reiserfs: Fix up mapping_set_gfp_mask() Christoph Lameter
2007-09-11  6:04 ` [36/41] 64k block size support for Ext2/3/4 Christoph Lameter
2007-09-11  6:04 ` [37/41] ext2: fix rec_len overflow for 64KB block size Christoph Lameter
2007-09-11  6:04 ` [38/41] ext3: fix rec_len overflow with " Christoph Lameter
2007-09-11  6:04 ` [39/41] ext4: fix rec_len overflow for " Christoph Lameter
2007-09-11  6:04 ` [40/41] Do not use f_mapping in simple_prepare_write() Christoph Lameter
2007-09-11  6:04 ` [41/41] Mmap support using pte PAGE_SIZE mappings Christoph Lameter

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20070920145407.GY4608@v2.random \
    --to=andrea@suse.de \
    --cc=clameter@sgi.com \
    --cc=dgc@sgi.com \
    --cc=fengguang.wu@gmail.com \
    --cc=hch@lst.de \
    --cc=hugh@veritas.com \
    --cc=jens.axboe@oracle.com \
    --cc=joern@lazybastard.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=maximlevitsky@gmail.com \
    --cc=mel@skynet.ie \
    --cc=nickpiggin@yahoo.com.au \
    --cc=nscott@aconex.com \
    --cc=pbadari@gmail.com \
    --cc=torvalds@linux-foundation.org \
    --cc=totty.lu@gmail.com \
    --cc=wangswin@gmail.com \
    --cc=wli@holomorphy.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).