All of lore.kernel.org
 help / color / mirror / Atom feed
From: Nick Piggin <nickpiggin@yahoo.com.au>
To: Mel Gorman <mel@csn.ul.ie>
Cc: Christoph Lameter <clameter@sgi.com>,
	andrea@suse.de, torvalds@linux-foundation.org,
	linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	Christoph Hellwig <hch@lst.de>, Mel Gorman <mel@skynet.ie>,
	William Lee Irwin III <wli@holomorphy.com>,
	David Chinner <dgc@sgi.com>, Jens Axboe <jens.axboe@oracle.com>,
	Badari Pulavarty <pbadari@gmail.com>,
	Maxim Levitsky <maximlevitsky@gmail.com>,
	Fengguang Wu <fengguang.wu@gmail.com>,
	swin wang <wangswin@gmail.com>,
	totty.lu@gmail.com, hugh@veritas.com, joern@lazybastard.org
Subject: Re: [00/41] Large Blocksize Support V7 (adds memmap support)
Date: Tue, 11 Sep 2007 11:44:47 +1000	[thread overview]
Message-ID: <200709111144.48743.nickpiggin@yahoo.com.au> (raw)
In-Reply-To: <1189524967.32731.58.camel@localhost>

On Wednesday 12 September 2007 01:36, Mel Gorman wrote:
> On Tue, 2007-09-11 at 04:52 +1000, Nick Piggin wrote:
> > On Tuesday 11 September 2007 16:03, Christoph Lameter wrote:
> > > 5. VM scalability
> > >    Large block sizes mean less state keeping for the information being
> > >    transferred. For a 1TB file one needs to handle 256 million page
> > >    structs in the VM if one uses 4k page size. A 64k page size reduces
> > >    that amount to 16 million. If the limitation in existing filesystems
> > >    are removed then even higher reductions become possible. For very
> > >    large files like that a page size of 2 MB may be beneficial which
> > >    will reduce the number of page struct to handle to 512k. The
> > > variable nature of the block size means that the size can be tuned at
> > > file system creation time for the anticipated needs on a volume.
> >
> > There is a limitation in the VM. Fragmentation. You keep saying this
> > is a solved issue and just assuming you'll be able to fix any cases
> > that come up as they happen.
> >
> > I still don't get the feeling you realise that there is a fundamental
> > fragmentation issue that is unsolvable with Mel's approach.
>
> I thought we had discussed this already at VM and reached something
> resembling a conclusion. It was acknowledged that depending on
> contiguous allocations to always succeed will get a caller into trouble
> and they need to deal with fallback - whether the problem was
> theoritical or not. It was also strongly pointed out that the large
> block patches as presented would be vunerable to that problem.

Well Christoph seems to still be spinning them as a solution for VM
scalability and first class support for making contiguous IOs, large
filesystem block sizes etc.

At the VM summit I think the conclusion was that grouping by
mobility could be merged. I'm still not thrilled by that, but I was
going to get steamrolled[*] anyway... and seeing as the userspace
hugepages is a relatively demanded workload and can be
implemented in this way with basically no other changes to the
kernel and already must have fallbacks.... then that's actually a
reasonable case for it.

The higher order pagecache, again I'm just going to get steamrolled
on, and it actually isn't so intrusive minus the mmap changes, so I
didn't have much to reasonably say there.

And I would have kept quiet this time too, except for the worrying idea
to use higher order pages to fix the SLUB vs SLAB regression, and if
the rationale for this patchset was more realistic.

[*] And I don't say steamrolled because I'm bitter and twisted :) I
personally want the kernel to be perfect. But I realise it already isn't
and for practical purposes people want these things, so I accept
being overruled, no problem. The fact simply is -- I would have been
steamrolled I think :P

> The alternatives were fs-block and increasing the size of order-0. It
> was felt that fs-block was far away because it's complex and I thought
> that increasing the pagesize like what Andrea suggested would lead to
> internal fragmentation problems. Regrettably we didn't discuss Andrea's
> approach in depth.

Sure. And some people run workloads where fragmentation is likely never
going to be a problem, they are shipping this poorly configured hardware
now or soon, so they don't have too much interest in doing it right at this
point, rather than doing it *now*. OK, that's a valid reason which is why I
don't use the argument that we should do it correctly or never at all.


> I *thought* that the end conclusion was that we would go with
> Christoph's approach pending two things being resolved;
>
> o mmap() support that we agreed on is good

In theory (and again for the filesystem guys who don't have to worry about
it). In practice after seeing the patch it's not a nice thing for the VM to
have to do.


> I also thought there was an acknowledgement that long-term, fs-block was
> the way to go - possibly using contiguous pages optimistically instead
> of virtual mapping the pages. At that point, it would be a general
> solution and we could remove the warnings.

I guess it is still in the air. I personally think a vmapping approach and/or
teaching filesystems to do some nonlinear block metadata access is the
way to go (strangely, this happens to be one of the fsblock paradigms!).
OTOH, I'm not sure how much buy-in there was from the filesystems guys.
Particularly Christoph H and XFS (which is strange because they already do
vmapping in places).

That's understandable though. It is a lot of work for filesystems. But the
reason I think it is the correct approach for larger block than soft-page
size is that it doesn't have fundamental issues (assuming that virtually
mapping the entire kernel is off the table).


> Basically, to start out with, this was going to be an SGI-only thing so
> they get to rattle out the issues we expect to encounter with large
> blocks and help steer the direction of the
> more-complex-but-safer-overall fs-block.

That's what I expected, but it seems from the descriptions in the patches
that it is also supposed to cure cancer :)


> > The idea that there even _is_ a bug to fail when higher order pages
> > cannot be allocated was also brushed aside by some people at the
> > vm/fs summit.
>
> When that brushing occured, I thought I made it very clear what the
> expectations were and that without fallback they would be taking a risk.
> I am not sure if that message actually sank in or not.

No, you have been good about that aspect. I wasn't trying to point to you
at all here.


> >  I don't know if those people had gone through the
> > math about this, but it goes somewhat like this: if you use a 64K
> > page size, you can "run out of memory" with 93% of your pages free.
> > If you use a 2MB page size, you can fail with 99.8% of your pages
> > still free. That's 64GB of memory used on a 32TB Altix.
>
> That's the absolute worst case but yes, in theory this can occur and
> it's safest to assume the situation will occur somewhere to someone. It
> would be difficult to craft an attack to do it but conceivably a machine
> running for a long enough time would trigger it particularly if the
> large block allocations are GFP_NOIO or GFP_NOFS.

It would be interesting to craft an attack. If you knew roughly the layout
and size of your dentry slab for example... maybe you could stat a whole
lot of files, then open one and keep it open (maybe post the fd to a unix
socket or something crazy!) when you think you have filled up a couple
of MB worth of them. Repeat the process until your movable zone is
gone. Or do the same things with pagetables, or task structs, or radix
tree nodes, etc.. these are the kinds of things I worry about (as well as
just the gradual natural degredation).

Yeah, it might be reasonably possible to make an attack that would
deplete most of higher order allocations while pinning somewhat close
to just the theoretical minimum required.

[snip]

Thanks Mel. Fairly good summary I think.


> > Basically, if you're placing your hopes for VM and IO scalability on
> > this, then I think that's a totally broken thing to do and will end up
> > making the kernel worse in the years to come (except maybe on some poor
> > configurations of bad hardware).
>
> My magic 8-ball is in the garage.
>
> I thought the following plan was sane but I could be la-la
>
> 1. Go with large block + explosions to start with
>    - Second class feature at this point, not fully supported
>    - Experiment in different places to see what it gains (if anything)
> 2. Get fs-block in slowly over time with the fallback options replacing
>    Christophs patches bit by bit
> 3. Kick away warnings
>    - First class feature at this point, fully supported

I guess that was my hope. The only problem I have with a 2nd class
higher order pagecache on a *practical* technical issue is introducing
more complexity in the VM for mmap. Andrea and Hugh are probably
more guardians of that area of code than I, so if they're happy with the
mmap stuff then again I can accept being overruled on this ;)

Then I would love to say #2 will go ahead (and I hope it would), but I
can't force it down the throat of the filesystem maintainers just like I
feel they can't force vm devs (me) to do a virtually mapped and
defrag-able kernel :) Basically I'm trying to practice what I preach and
I don't want to force fsblock onto anyone.

Maybe when ext2 is converted and if I can show it isn't a performance
problem / too much complexity then I'll have another leg to stand on
here... I don't know.

> Independently of that, we would work on order-0 scalability,
> particularly readahead and batching operations on ranges of pages as
> much as possible.

Definitely. Also, aops capable of spanning multiple pages, batching of
large write(2) pagecache insertion, etc all are things we must go after,
regardless of the large page and/or block size work.

  reply	other threads:[~2007-09-11 17:26 UTC|newest]

Thread overview: 206+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-09-11  6:03 [00/41] Large Blocksize Support V7 (adds memmap support) Christoph Lameter
2007-09-10 18:52 ` Nick Piggin
2007-09-11 12:05   ` Andrea Arcangeli
2007-09-11 20:03     ` Christoph Lameter
2007-09-11 12:12   ` Jörn Engel
2007-09-11 12:12     ` Jörn Engel
2007-09-10 21:13     ` Nick Piggin
2007-09-10 21:13       ` Nick Piggin
2007-09-11 16:02       ` Goswin von Brederlow
2007-09-11 20:07     ` Christoph Lameter
2007-09-11 20:29       ` Jörn Engel
2007-09-11 20:29         ` Jörn Engel
2007-09-11 20:41         ` Christoph Lameter
2007-09-11 23:26           ` Andrea Arcangeli
2007-09-12  0:04             ` Christoph Lameter
2007-09-12  8:20               ` Andrea Arcangeli
2007-09-15  8:44     ` Andrew Morton
2007-09-15  8:44       ` Andrew Morton
2007-09-15 12:14       ` Goswin von Brederlow
2007-09-15 12:14         ` Goswin von Brederlow
2007-09-15 15:51         ` Andrea Arcangeli
2007-09-15 20:14           ` Goswin von Brederlow
2007-09-15 22:30             ` Andrea Arcangeli
2007-09-16 13:54               ` Goswin von Brederlow
2007-09-16 15:08                 ` Andrea Arcangeli
2007-09-16 21:08                   ` Mel Gorman
2007-09-16 22:48                     ` Goswin von Brederlow
2007-09-17  9:30                       ` Mel Gorman
2007-09-16 17:46               ` Jörn Engel
2007-09-16 18:15                 ` Linus Torvalds
2007-09-16 18:15                   ` Linus Torvalds
2007-09-16 18:21                   ` Jörn Engel
2007-09-16 18:21                     ` Jörn Engel
2007-09-16 18:44                     ` Linus Torvalds
2007-09-16 18:44                       ` Linus Torvalds
2007-09-16 22:51                       ` Goswin von Brederlow
2007-09-16 22:51                         ` Goswin von Brederlow
2007-09-23 17:44                       ` Jörn Engel
2007-09-23 17:44                         ` Jörn Engel
2007-09-16 22:06                 ` Goswin von Brederlow
2007-09-16 22:06                   ` Goswin von Brederlow
2007-09-16 22:40                   ` Jörn Engel
2007-09-16 22:40                     ` Jörn Engel
2007-09-16 18:15           ` Mel Gorman
2007-09-16 18:50             ` Andrea Arcangeli
2007-09-16 20:54               ` Mel Gorman
2007-09-16 21:31                 ` Andrea Arcangeli
2007-09-17 10:13                   ` Mel Gorman
2007-09-23  5:50                     ` Goswin von Brederlow
2007-09-16 22:56               ` Goswin von Brederlow
2007-09-18 19:31                 ` Andrea Arcangeli
2007-09-23  6:56                   ` Goswin von Brederlow
2007-09-24 15:39                     ` Andrea Arcangeli
2007-09-16 18:13         ` Mel Gorman
2007-09-16 18:13           ` Mel Gorman
2007-09-16  9:03           ` Nick Piggin
2007-09-17 22:00             ` Christoph Lameter
2007-09-18  0:11               ` Nick Piggin
2007-09-18 20:36                 ` Christoph Lameter
2007-09-18 10:00               ` Mel Gorman
2007-09-18 10:49                 ` Jörn Engel
2007-09-18 10:49                   ` Jörn Engel
2007-09-18 12:31                 ` David Chinner
2007-09-16 21:58           ` Goswin von Brederlow
2007-09-16 21:58             ` Goswin von Brederlow
2007-09-17 10:03             ` Mel Gorman
2007-09-17 10:03               ` Mel Gorman
2007-09-23  6:22               ` Goswin von Brederlow
2007-09-24 12:32                 ` Kyle Moffett
2007-09-16 17:53       ` Jörn Engel
2007-09-16 17:53         ` Jörn Engel
2007-09-16 21:31         ` Mel Gorman
2007-09-16 21:31           ` Mel Gorman
2007-09-17 22:03         ` Christoph Lameter
2007-09-11 15:36   ` Mel Gorman
2007-09-11  1:44     ` Nick Piggin [this message]
2007-09-11 20:11       ` Christoph Lameter
2007-09-11  4:53         ` Nick Piggin
2007-09-11 20:42           ` Christoph Lameter
2007-09-11  5:30             ` Nick Piggin
2007-09-11 21:41               ` Christoph Lameter
2007-09-11  6:06                 ` Nick Piggin
2007-09-11 21:52                   ` Christoph Lameter
2007-09-11 18:07                     ` Nick Piggin
2007-09-12 23:06                       ` Christoph Lameter
2007-09-13 20:51                         ` Nick Piggin
2007-09-14 17:52                           ` Christoph Lameter
2007-09-16  8:22                             ` Nick Piggin
2007-09-17 22:05                               ` Christoph Lameter
2007-09-18  0:10                                 ` Nick Piggin
2007-09-18 20:42                                   ` Christoph Lameter
2007-09-17 11:10                         ` Bernd Schmidt
2007-09-17 22:10                           ` Christoph Lameter
2007-09-14 16:10                       ` Goswin von Brederlow
2007-09-14 17:42                         ` Mel Gorman
2007-09-15  0:31                           ` Goswin von Brederlow
2007-09-16 21:16                             ` Mel Gorman
2007-09-16 22:38                               ` Goswin von Brederlow
2007-09-17  8:57                                 ` Mel Gorman
2007-09-23  6:49                                   ` Goswin von Brederlow
2007-09-11 20:53       ` Mel Gorman
2007-09-11  6:00         ` Nick Piggin
2007-09-11 21:48           ` Christoph Lameter
2007-09-11  6:17             ` Nick Piggin
2007-09-12  0:00               ` Christoph Lameter
2007-09-12  2:46                 ` Nick Piggin
2007-09-12 23:17                   ` Christoph Lameter
2007-09-13  9:40                     ` Mel Gorman
2007-09-14  2:38                       ` Christoph Lameter
2007-09-13 21:20                     ` Nick Piggin
2007-09-14 18:08                       ` Christoph Lameter
2007-09-14 18:15                         ` Christoph Lameter
2007-09-15  0:33                           ` Goswin von Brederlow
2007-09-16  8:53                         ` Nick Piggin
2007-09-17 22:21                           ` Christoph Lameter
2007-09-18  1:16                             ` Nick Piggin
2007-09-18 18:30                               ` Linus Torvalds
2007-09-18 17:53                                 ` Nick Piggin
2007-09-18 19:18                                 ` Andrea Arcangeli
2007-09-18 19:44                                   ` Linus Torvalds
2007-09-19  0:58                                     ` Nathan Scott
2007-09-19  1:06                                       ` Linus Torvalds
2007-09-19  2:45                                         ` Nathan Scott
2007-09-19  5:09                                         ` David Chinner
2007-09-19  9:41                                           ` Alex Tomas
2007-09-19 14:04                                           ` Andrea Arcangeli
2007-09-20  1:38                                             ` David Chinner
2007-09-20 14:54                                               ` Andrea Arcangeli
2007-09-20 18:11                                                 ` Christoph Lameter
2007-09-20 18:07                                               ` Christoph Lameter
2007-09-21 20:41                                                 ` Hugh Dickins
2007-09-24 21:13                                                   ` Christoph Lameter
2007-09-28  2:46                                               ` Nick Piggin
2007-09-19  3:41                                     ` Rene Herman
2007-09-19  3:50                                       ` Linus Torvalds
2007-09-19  4:26                                         ` Rene Herman
2007-09-19  4:33                                           ` Linus Torvalds
2007-09-19  4:56                                             ` Rene Herman
2007-09-11 21:54             ` Mel Gorman
2007-09-12 14:29             ` Martin J. Bligh
2007-09-12  1:49           ` David Chinner
2007-09-11 15:27             ` Nick Piggin
2007-09-13  1:49               ` David Chinner
2007-09-12 17:23                 ` Nick Piggin
2007-09-13 13:03                   ` David Chinner
2007-09-13  2:01                     ` Nick Piggin
2007-09-13 20:48                       ` Nick Piggin
2007-09-17  4:07                         ` David Chinner
2007-09-16 21:13                           ` Nick Piggin
2007-09-12  2:01             ` Nick Piggin
2007-09-11 21:35         ` Christoph Lameter
2007-09-11 16:47     ` Andrea Arcangeli
2007-09-11 18:31       ` Mel Gorman
2007-09-11  2:26         ` Nick Piggin
2007-09-11 18:25           ` Maxim Levitsky
2007-09-11  3:05             ` Nick Piggin
2007-09-11 21:03           ` Mel Gorman
2007-09-11 19:20         ` Andrea Arcangeli
2007-09-11 20:19           ` Jörn Engel
2007-09-11 20:19             ` Jörn Engel
2007-09-11 20:13       ` Christoph Lameter
2007-09-11 20:01   ` Christoph Lameter
2007-09-11  4:43     ` Nick Piggin
2007-09-11  5:17     ` Nick Piggin
2007-09-11 21:27       ` Mel Gorman
2007-09-11  6:03 ` [01/41] Pagecache zeroing: zero_user_segment, zero_user_segments and zero_user Christoph Lameter
2007-09-11  6:03 ` [02/41] Define functions for page cache handling Christoph Lameter
2007-09-11  6:03 ` [03/41] Use page_cache_xxx functions in mm/filemap.c Christoph Lameter
2007-09-11  6:03 ` [04/41] Use page_cache_xxx in mm/page-writeback.c Christoph Lameter
2007-09-11  6:03 ` [05/41] Use page_cache_xxx in mm/truncate.c Christoph Lameter
2007-09-11  6:03 ` [06/41] Use page_cache_xxx in mm/rmap.c Christoph Lameter
2007-09-11  6:03 ` [07/41] Use page_cache_xxx in mm/filemap_xip.c Christoph Lameter
2007-09-11  6:03 ` [08/41] Use page_cache_xxx in mm/migrate.c Christoph Lameter
2007-09-11  6:03 ` [09/41] Use page_cache_xxx in fs/libfs.c Christoph Lameter
2007-09-11  6:04 ` [10/41] Use page_cache_xxx in fs/sync Christoph Lameter
2007-09-11  6:04 ` [11/41] Use page_cache_xxx in fs/buffer.c Christoph Lameter
2007-09-11  6:04 ` [12/41] Use page_cache_xxx in mm/mpage.c Christoph Lameter
2007-09-11  6:04 ` [13/41] Use page_cache_xxx in mm/fadvise.c Christoph Lameter
2007-09-11  6:04 ` [14/41] Use page_cache_xxx in fs/splice.c Christoph Lameter
2007-09-11  6:04 ` [15/41] Use page_cache_xxx in ext2 Christoph Lameter
2007-09-11  6:04 ` [16/41] Use page_cache_xxx in fs/ext3 Christoph Lameter
2007-09-11  6:04 ` [17/41] Use page_cache_xxx in fs/ext4 Christoph Lameter
2007-09-11  6:04 ` [18/41] Use page_cache_xxx in fs/reiserfs Christoph Lameter
2007-09-11  6:04 ` [19/41] Use page_cache_xxx for fs/xfs Christoph Lameter
2007-09-11  6:04 ` [20/41] Use page_cache_xxx in drivers/block/rd.c Christoph Lameter
2007-09-11  6:04 ` [21/41] compound pages: Better PageHead/PageTail handling Christoph Lameter
2007-09-11  6:04 ` [22/41] compound pages: Add new support functions Christoph Lameter
2007-09-11  6:04 ` [23/41] compound pages: vmstat support Christoph Lameter
2007-09-11  6:04 ` [24/41] compound pages: Use new compound vmstat functions in SLUB Christoph Lameter
2007-09-11  6:04 ` [25/41] compound pages: Allow use of get_page_unless_zero with compound pages Christoph Lameter
2007-09-11  6:04 ` [26/41] compound pages: Allow freeing of compound pages via pagevec Christoph Lameter
2007-09-11  6:04 ` [27/41] Large page order operations, zeroing and flushing Christoph Lameter
2007-09-11  6:04 ` [28/41] Futex: Fix PAGE SIZE assumption Christoph Lameter
2007-09-11  6:04 ` [29/41] Fix up reclaim counters Christoph Lameter
2007-09-11  6:04 ` [30/41] Add VM_BUG_ONs to check for correct page order Christoph Lameter
2007-09-11  6:04 ` [31/41] Large Blocksize: Core piece Christoph Lameter
2007-09-11  6:04 ` [32/41] Readahead changes to support large blocksize Christoph Lameter
2007-09-11  6:04 ` [33/41] Large blocksize support in ramfs Christoph Lameter
2007-09-11  6:04 ` [34/41] Large blocksize support for XFS Christoph Lameter
2007-09-11  6:04 ` [35/41] Reiserfs: Fix up mapping_set_gfp_mask() Christoph Lameter
2007-09-11  6:04 ` [36/41] 64k block size support for Ext2/3/4 Christoph Lameter
2007-09-11  6:04 ` [37/41] ext2: fix rec_len overflow for 64KB block size Christoph Lameter
2007-09-11  6:04 ` [38/41] ext3: fix rec_len overflow with " Christoph Lameter
2007-09-11  6:04 ` [39/41] ext4: fix rec_len overflow for " Christoph Lameter
2007-09-11  6:04 ` [40/41] Do not use f_mapping in simple_prepare_write() Christoph Lameter
2007-09-11  6:04 ` [41/41] Mmap support using pte PAGE_SIZE mappings Christoph Lameter

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=200709111144.48743.nickpiggin@yahoo.com.au \
    --to=nickpiggin@yahoo.com.au \
    --cc=andrea@suse.de \
    --cc=clameter@sgi.com \
    --cc=dgc@sgi.com \
    --cc=fengguang.wu@gmail.com \
    --cc=hch@lst.de \
    --cc=hugh@veritas.com \
    --cc=jens.axboe@oracle.com \
    --cc=joern@lazybastard.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=maximlevitsky@gmail.com \
    --cc=mel@csn.ul.ie \
    --cc=mel@skynet.ie \
    --cc=pbadari@gmail.com \
    --cc=torvalds@linux-foundation.org \
    --cc=totty.lu@gmail.com \
    --cc=wangswin@gmail.com \
    --cc=wli@holomorphy.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.