linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Mel Gorman <mel@csn.ul.ie>
To: Andrea Arcangeli <andrea@suse.de>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>,
	Christoph Lameter <clameter@sgi.com>,
	torvalds@linux-foundation.org, linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org, Christoph Hellwig <hch@lst.de>,
	Mel Gorman <mel@skynet.ie>,
	William Lee Irwin III <wli@holomorphy.com>,
	David Chinner <dgc@sgi.com>, Jens Axboe <jens.axboe@oracle.com>,
	Badari Pulavarty <pbadari@gmail.com>,
	Maxim Levitsky <maximlevitsky@gmail.com>,
	Fengguang Wu <fengguang.wu@gmail.com>,
	swin wang <wangswin@gmail.com>,
	totty.lu@gmail.com, hugh@veritas.com, joern@lazybastard.org
Subject: Re: [00/41] Large Blocksize Support V7 (adds memmap support)
Date: Tue, 11 Sep 2007 19:31:01 +0100	[thread overview]
Message-ID: <1189535461.32731.75.camel@localhost> (raw)
In-Reply-To: <20070911164702.GD10831@v2.random>

On Tue, 2007-09-11 at 18:47 +0200, Andrea Arcangeli wrote:
> Hi Mel,
> 

Hi,

> On Tue, Sep 11, 2007 at 04:36:07PM +0100, Mel Gorman wrote:
> > that increasing the pagesize like what Andrea suggested would lead to
> > internal fragmentation problems. Regrettably we didn't discuss Andrea's
> 
> The config_page_shift guarantees the kernel stacks or whatever not
> defragmentable allocation other allocation goes into the same 64k "not
> defragmentable" page. Not like with SGI design that a 8k kernel stack
> could be allocated in the first 64k page, and then another 8k stack
> could be allocated in the next 64k page, effectively pinning all 64k
> pages until Nick worst case scenario triggers.
> 

In practice, it's pretty difficult to trigger. Buddy allocators always
try and use the smallest possible sized buddy to split. Once a 64K is
split for a 4K or 8K allocation, the remainder of that block will be
used for other 4K, 8K, 16K, 32K allocations. The situation where
multiple 64K blocks gets split does not occur.

Now, the worst case scenario for your patch is that a hostile process
allocates large amount of memory and mlocks() one 4K page per 64K chunk
(this is unlikely in practice I know). The end result is you have many
64KB regions that are now unusable because 4K is pinned in each of them.
Your approach is not immune from problems either. To me, only Nicks
approach is bullet-proof in the long run.

> What I said at the VM summit is that your reclaim-defrag patch in the
> slub isn't necessarily entirely useless with config_page_shift,
> because the larger the software page_size, the more partial pages we
> could find in the slab, so to save some memory if there are tons of
> pages very partially used, we could free some of them.
> 

This is true. Slub targetted reclaim (Chrisophs work) is useful
independent of this current problem.

> But the whole point is that with the config_page_shift, Nick's worst
> case scenario can't happen by design regardless of defrag or not
> defrag.

I agree with this. It's why I thought Nick's approach was where we were
going to finish up ultimately.

>  While it can _definitely_ happen with SGI design (regardless
> of any defrag thing).

I have never stated that the SGI design is immune from this problem.

>  We can still try to save some memory by
> defragging the slab a bit, but it's by far *not* required with
> config_page_shift. No defrag at all is required infact.
> 

You will need to take some sort of defragmentation to deal with internal
fragmentation. It's a very similar problem to blasting away at slab
pages and still not being able to free them because objects are in use.
Replace "slab" with "large page" and "object" with "4k page" and the
issues are similar.

> Plus there's a cost in defragging and freeing cache... the more you
> need defrag, the slower the kernel will be.
> 
> > approach in depth.
> 
> Well it wasn't my fault if we didn't discuss it in depth though.

If it's my fault, sorry about that. It wasn't my intention.

>  I
> tried to discuss it in all possible occasions where I was suggested to
> talk about it and where it was somewhat on topic.

Who said it was off-topic? Again, if this was me, sorry - you should
have chucked something at my head to shut me up.

>  Given I wasn't even
> invited at the KS, I felt it would not be appropriate for me to try to
> monopolize the VM summit according to my agenda. So I happily listened
> to what the top kernel developers are planning ;), while giving
> some hints on what I think the right direction is instead.
> 

Right, clearly we failed or at least had sub-optimal results dicussion
this one at VM Summit. Good job we have mail to pick up the stick with.

> > I *thought* that the end conclusion was that we would go with
> 
> Frankly I don't care what the end conclusion was.
> 

heh. Well we need to come to some sort of conclusion here or this will
go around the merri-go-round till we're all bald.

> > Christoph's approach pending two things being resolved;
> > 
> > o mmap() support that we agreed on is good
> 
> Let's see how good the mmap support for variable order page size will
> work after the 2 weeks...
> 

Ok, I'm ok with that.

> > o A clear statement, with logging maybe for users that mounted a large 
> >   block filesystem that it might blow up and they get to keep both parts
> >   when it does. Basically, for now it's only suitable in specialised
> >   environments.
> 
> Yes, but perhaps you missed that such printk is needed exactly to
> provide proof that SGI design is the wrong way and it needs to be
> dumped. If that printk ever triggers it means you were totally wrong.
> 

heh, I suggested printing the warning because I knew it had this
problem. The purpose in my mind was to see how far the design could be
brought before fs-block had to fill in the holes.

> > I also thought there was an acknowledgement that long-term, fs-block was
> > the way to go - possibly using contiguous pages optimistically instead
> > of virtual mapping the pages. At that point, it would be a general
> > solution and we could remove the warnings.
> 
> fsblock should stack on top of config_page_shift simply.

It should be able to stack on top of either approach and arguably
setting slub_min_order=large_block_order with large block filesystems is
90% of your approach anyway.

>  Both are
> needed. You don't want to use 64k pages on a laptop but you may want a
> larger blocksize for the btrees etc... if you've a large harddisk and
> not much ram.

I am still failing to see what happens when there are pagetable pages,
slab objects or mlocked 4k pages pinning the 64K pages and you need to
allocate another 64K page for the filesystem. I *think* you deadlock in
a similar fashion to Christoph's approach but the shape of the problem
is different because we are dealing with internal instead of external
fragmentation. Am I wrong?

> > That's the absolute worst case but yes, in theory this can occur and
> > it's safest to assume the situation will occur somewhere to someone. It
> 
> Do you agree this worst case can't happen with config_page_shift?
> 

Yes. I just think you have a different worst case that is just as bad.

> > Where we expected to see the the use of this patchset was in specialised
> > environments *only*. The SGI people can mitigate their mixed
> > fragmentation problems somewhat by setting slub_min_order ==
> > large_block_order so that blocks get allocated and freed at the same
> > size. This is partial way towards Andrea's solution of raising the size
> > of an order-0 allocation. The point of printing out the warnings at
> 
> Except you don't get all the full benefits of it...
> 
> Even if I could end up mapping 4k kmalloced entries in userland for
> the tail packing, that IMHO would still be a preferable solution than
> to keep the base-page small and to make an hard effort to create large
> pages out of small pages. The approach I advocate keeps the base page
> big and the fast path fast, and it rather does some work to split the
> base pages outside the buddy for the small files.
> 

small files (you need something like Shaggy's page tail packing),
pagetable pages, pte pages all have to be dealt with. These are the
things I think will cause us internal fragmentaiton problems.

> All your defrag work is still good to have, like I said at the VM
> summit if you remember, to grow the hugetlbfs at runtime etc... I just
> rather avoid to depend on it to avoid I/O failure in presence of
> mlocked pagecache for example.
> 

I'd rather avoid depending on it for the system to work 100% of the
same. Hence I've been saying that we need fsblock ultimately for this to
be a 100% supported feature.

> > Independently of that, we would work on order-0 scalability,
> > particularly readahead and batching operations on ranges of pages as
> > much as possible.
> 
> That's pretty much an unnecessary logic, if the order0 pages become
> larger.
> 

Quite possibly.

-- 
Mel Gorman


  reply	other threads:[~2007-09-11 17:31 UTC|newest]

Thread overview: 187+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-09-11  6:03 [00/41] Large Blocksize Support V7 (adds memmap support) Christoph Lameter
2007-09-10 18:52 ` Nick Piggin
2007-09-11 12:05   ` Andrea Arcangeli
2007-09-11 20:03     ` Christoph Lameter
2007-09-11 12:12   ` Jörn Engel
2007-09-10 21:13     ` Nick Piggin
2007-09-11 16:02       ` Goswin von Brederlow
2007-09-11 20:07     ` Christoph Lameter
2007-09-11 20:29       ` Jörn Engel
2007-09-11 20:41         ` Christoph Lameter
2007-09-11 23:26           ` Andrea Arcangeli
2007-09-12  0:04             ` Christoph Lameter
2007-09-12  8:20               ` Andrea Arcangeli
2007-09-15  8:44     ` Andrew Morton
2007-09-15 12:14       ` Goswin von Brederlow
2007-09-15 15:51         ` Andrea Arcangeli
2007-09-15 20:14           ` Goswin von Brederlow
2007-09-15 22:30             ` Andrea Arcangeli
2007-09-16 13:54               ` Goswin von Brederlow
2007-09-16 15:08                 ` Andrea Arcangeli
2007-09-16 21:08                   ` Mel Gorman
2007-09-16 22:48                     ` Goswin von Brederlow
2007-09-17  9:30                       ` Mel Gorman
2007-09-16 17:46               ` Jörn Engel
2007-09-16 18:15                 ` Linus Torvalds
2007-09-16 18:21                   ` Jörn Engel
2007-09-16 18:44                     ` Linus Torvalds
2007-09-16 22:51                       ` Goswin von Brederlow
2007-09-23 17:44                       ` Jörn Engel
2007-09-16 22:06                 ` Goswin von Brederlow
2007-09-16 22:40                   ` Jörn Engel
2007-09-16 18:15           ` Mel Gorman
2007-09-16 18:50             ` Andrea Arcangeli
2007-09-16 20:54               ` Mel Gorman
2007-09-16 21:31                 ` Andrea Arcangeli
2007-09-17 10:13                   ` Mel Gorman
2007-09-23  5:50                     ` Goswin von Brederlow
2007-09-16 22:56               ` Goswin von Brederlow
2007-09-18 19:31                 ` Andrea Arcangeli
2007-09-23  6:56                   ` Goswin von Brederlow
2007-09-24 15:39                     ` Andrea Arcangeli
2007-09-16 18:13         ` Mel Gorman
2007-09-16  9:03           ` Nick Piggin
2007-09-17 22:00             ` Christoph Lameter
2007-09-18  0:11               ` Nick Piggin
2007-09-18 20:36                 ` Christoph Lameter
2007-09-18 10:00               ` Mel Gorman
2007-09-18 10:49                 ` Jörn Engel
2007-09-18 12:31                 ` David Chinner
2007-09-16 21:58           ` Goswin von Brederlow
2007-09-17 10:03             ` Mel Gorman
2007-09-23  6:22               ` Goswin von Brederlow
2007-09-24 12:32                 ` Kyle Moffett
2007-09-16 17:53       ` Jörn Engel
2007-09-16 21:31         ` Mel Gorman
2007-09-17 22:03         ` Christoph Lameter
2007-09-11 15:36   ` Mel Gorman
2007-09-11  1:44     ` Nick Piggin
2007-09-11 20:11       ` Christoph Lameter
2007-09-11  4:53         ` Nick Piggin
2007-09-11 20:42           ` Christoph Lameter
2007-09-11  5:30             ` Nick Piggin
2007-09-11 21:41               ` Christoph Lameter
2007-09-11  6:06                 ` Nick Piggin
2007-09-11 21:52                   ` Christoph Lameter
2007-09-11 18:07                     ` Nick Piggin
2007-09-12 23:06                       ` Christoph Lameter
2007-09-13 20:51                         ` Nick Piggin
2007-09-14 17:52                           ` Christoph Lameter
2007-09-16  8:22                             ` Nick Piggin
2007-09-17 22:05                               ` Christoph Lameter
2007-09-18  0:10                                 ` Nick Piggin
2007-09-18 20:42                                   ` Christoph Lameter
2007-09-17 11:10                         ` Bernd Schmidt
2007-09-17 22:10                           ` Christoph Lameter
2007-09-14 16:10                       ` Goswin von Brederlow
2007-09-14 17:42                         ` Mel Gorman
2007-09-15  0:31                           ` Goswin von Brederlow
2007-09-16 21:16                             ` Mel Gorman
2007-09-16 22:38                               ` Goswin von Brederlow
2007-09-17  8:57                                 ` Mel Gorman
2007-09-23  6:49                                   ` Goswin von Brederlow
2007-09-11 20:53       ` Mel Gorman
2007-09-11  6:00         ` Nick Piggin
2007-09-11 21:48           ` Christoph Lameter
2007-09-11  6:17             ` Nick Piggin
2007-09-12  0:00               ` Christoph Lameter
2007-09-12  2:46                 ` Nick Piggin
2007-09-12 23:17                   ` Christoph Lameter
2007-09-13  9:40                     ` Mel Gorman
2007-09-14  2:38                       ` Christoph Lameter
2007-09-13 21:20                     ` Nick Piggin
2007-09-14 18:08                       ` Christoph Lameter
2007-09-14 18:15                         ` Christoph Lameter
2007-09-15  0:33                           ` Goswin von Brederlow
2007-09-16  8:53                         ` Nick Piggin
2007-09-17 22:21                           ` Christoph Lameter
2007-09-18  1:16                             ` Nick Piggin
2007-09-18 18:30                               ` Linus Torvalds
2007-09-18 17:53                                 ` Nick Piggin
2007-09-18 19:18                                 ` Andrea Arcangeli
2007-09-18 19:44                                   ` Linus Torvalds
2007-09-19  0:58                                     ` Nathan Scott
2007-09-19  1:06                                       ` Linus Torvalds
2007-09-19  2:45                                         ` Nathan Scott
2007-09-19  5:09                                         ` David Chinner
2007-09-19  9:41                                           ` Alex Tomas
2007-09-19 14:04                                           ` Andrea Arcangeli
2007-09-20  1:38                                             ` David Chinner
2007-09-20 14:54                                               ` Andrea Arcangeli
2007-09-20 18:11                                                 ` Christoph Lameter
2007-09-20 18:07                                               ` Christoph Lameter
2007-09-21 20:41                                                 ` Hugh Dickins
2007-09-24 21:13                                                   ` Christoph Lameter
2007-09-28  2:46                                               ` Nick Piggin
2007-09-19  3:41                                     ` Rene Herman
2007-09-19  3:50                                       ` Linus Torvalds
2007-09-19  4:26                                         ` Rene Herman
2007-09-19  4:33                                           ` Linus Torvalds
2007-09-19  4:56                                             ` Rene Herman
2007-09-11 21:54             ` Mel Gorman
2007-09-12 14:29             ` Martin J. Bligh
2007-09-12  1:49           ` David Chinner
2007-09-11 15:27             ` Nick Piggin
2007-09-13  1:49               ` David Chinner
2007-09-12 17:23                 ` Nick Piggin
2007-09-13 13:03                   ` David Chinner
2007-09-13  2:01                     ` Nick Piggin
2007-09-13 20:48                       ` Nick Piggin
2007-09-17  4:07                         ` David Chinner
2007-09-16 21:13                           ` Nick Piggin
2007-09-12  2:01             ` Nick Piggin
2007-09-11 21:35         ` Christoph Lameter
2007-09-11 16:47     ` Andrea Arcangeli
2007-09-11 18:31       ` Mel Gorman [this message]
2007-09-11  2:26         ` Nick Piggin
2007-09-11 18:25           ` Maxim Levitsky
2007-09-11  3:05             ` Nick Piggin
2007-09-11 21:03           ` Mel Gorman
2007-09-11 19:20         ` Andrea Arcangeli
2007-09-11 20:19           ` Jörn Engel
2007-09-11 20:13       ` Christoph Lameter
2007-09-11 20:01   ` Christoph Lameter
2007-09-11  4:43     ` Nick Piggin
2007-09-11  5:17     ` Nick Piggin
2007-09-11 21:27       ` Mel Gorman
2007-09-11  6:03 ` [01/41] Pagecache zeroing: zero_user_segment, zero_user_segments and zero_user Christoph Lameter
2007-09-11  6:03 ` [02/41] Define functions for page cache handling Christoph Lameter
2007-09-11  6:03 ` [03/41] Use page_cache_xxx functions in mm/filemap.c Christoph Lameter
2007-09-11  6:03 ` [04/41] Use page_cache_xxx in mm/page-writeback.c Christoph Lameter
2007-09-11  6:03 ` [05/41] Use page_cache_xxx in mm/truncate.c Christoph Lameter
2007-09-11  6:03 ` [06/41] Use page_cache_xxx in mm/rmap.c Christoph Lameter
2007-09-11  6:03 ` [07/41] Use page_cache_xxx in mm/filemap_xip.c Christoph Lameter
2007-09-11  6:03 ` [08/41] Use page_cache_xxx in mm/migrate.c Christoph Lameter
2007-09-11  6:03 ` [09/41] Use page_cache_xxx in fs/libfs.c Christoph Lameter
2007-09-11  6:04 ` [10/41] Use page_cache_xxx in fs/sync Christoph Lameter
2007-09-11  6:04 ` [11/41] Use page_cache_xxx in fs/buffer.c Christoph Lameter
2007-09-11  6:04 ` [12/41] Use page_cache_xxx in mm/mpage.c Christoph Lameter
2007-09-11  6:04 ` [13/41] Use page_cache_xxx in mm/fadvise.c Christoph Lameter
2007-09-11  6:04 ` [14/41] Use page_cache_xxx in fs/splice.c Christoph Lameter
2007-09-11  6:04 ` [15/41] Use page_cache_xxx in ext2 Christoph Lameter
2007-09-11  6:04 ` [16/41] Use page_cache_xxx in fs/ext3 Christoph Lameter
2007-09-11  6:04 ` [17/41] Use page_cache_xxx in fs/ext4 Christoph Lameter
2007-09-11  6:04 ` [18/41] Use page_cache_xxx in fs/reiserfs Christoph Lameter
2007-09-11  6:04 ` [19/41] Use page_cache_xxx for fs/xfs Christoph Lameter
2007-09-11  6:04 ` [20/41] Use page_cache_xxx in drivers/block/rd.c Christoph Lameter
2007-09-11  6:04 ` [21/41] compound pages: Better PageHead/PageTail handling Christoph Lameter
2007-09-11  6:04 ` [22/41] compound pages: Add new support functions Christoph Lameter
2007-09-11  6:04 ` [23/41] compound pages: vmstat support Christoph Lameter
2007-09-11  6:04 ` [24/41] compound pages: Use new compound vmstat functions in SLUB Christoph Lameter
2007-09-11  6:04 ` [25/41] compound pages: Allow use of get_page_unless_zero with compound pages Christoph Lameter
2007-09-11  6:04 ` [26/41] compound pages: Allow freeing of compound pages via pagevec Christoph Lameter
2007-09-11  6:04 ` [27/41] Large page order operations, zeroing and flushing Christoph Lameter
2007-09-11  6:04 ` [28/41] Futex: Fix PAGE SIZE assumption Christoph Lameter
2007-09-11  6:04 ` [29/41] Fix up reclaim counters Christoph Lameter
2007-09-11  6:04 ` [30/41] Add VM_BUG_ONs to check for correct page order Christoph Lameter
2007-09-11  6:04 ` [31/41] Large Blocksize: Core piece Christoph Lameter
2007-09-11  6:04 ` [32/41] Readahead changes to support large blocksize Christoph Lameter
2007-09-11  6:04 ` [33/41] Large blocksize support in ramfs Christoph Lameter
2007-09-11  6:04 ` [34/41] Large blocksize support for XFS Christoph Lameter
2007-09-11  6:04 ` [35/41] Reiserfs: Fix up mapping_set_gfp_mask() Christoph Lameter
2007-09-11  6:04 ` [36/41] 64k block size support for Ext2/3/4 Christoph Lameter
2007-09-11  6:04 ` [37/41] ext2: fix rec_len overflow for 64KB block size Christoph Lameter
2007-09-11  6:04 ` [38/41] ext3: fix rec_len overflow with " Christoph Lameter
2007-09-11  6:04 ` [39/41] ext4: fix rec_len overflow for " Christoph Lameter
2007-09-11  6:04 ` [40/41] Do not use f_mapping in simple_prepare_write() Christoph Lameter
2007-09-11  6:04 ` [41/41] Mmap support using pte PAGE_SIZE mappings Christoph Lameter

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1189535461.32731.75.camel@localhost \
    --to=mel@csn.ul.ie \
    --cc=andrea@suse.de \
    --cc=clameter@sgi.com \
    --cc=dgc@sgi.com \
    --cc=fengguang.wu@gmail.com \
    --cc=hch@lst.de \
    --cc=hugh@veritas.com \
    --cc=jens.axboe@oracle.com \
    --cc=joern@lazybastard.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=maximlevitsky@gmail.com \
    --cc=mel@skynet.ie \
    --cc=nickpiggin@yahoo.com.au \
    --cc=pbadari@gmail.com \
    --cc=torvalds@linux-foundation.org \
    --cc=totty.lu@gmail.com \
    --cc=wangswin@gmail.com \
    --cc=wli@holomorphy.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).