From: Goswin von Brederlow <brederlo@informatik.uni-tuebingen.de>
To: mel@skynet.ie (Mel Gorman)
Cc: Goswin von Brederlow <brederlo@informatik.uni-tuebingen.de>,
Nick Piggin <nickpiggin@yahoo.com.au>,
Christoph Lameter <clameter@sgi.com>,
andrea@suse.de, torvalds@linux-foundation.org,
linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
Christoph Hellwig <hch@lst.de>,
William Lee Irwin III <wli@holomorphy.com>,
David Chinner <dgc@sgi.com>, Jens Axboe <jens.axboe@oracle.com>,
Badari Pulavarty <pbadari@gmail.com>,
Maxim Levitsky <maximlevitsky@gmail.com>,
Fengguang Wu <fengguang.wu@gmail.com>,
swin wang <wangswin@gmail.com>,
totty.lu@gmail.com, hugh@veritas.com, joern@lazybastard.org
Subject: Re: [00/41] Large Blocksize Support V7 (adds memmap support)
Date: Mon, 17 Sep 2007 00:38:23 +0200 [thread overview]
Message-ID: <87zlzm44o0.fsf@informatik.uni-tuebingen.de> (raw)
In-Reply-To: <20070916211627.GE16406@skynet.ie> (Mel Gorman's message of "Sun, 16 Sep 2007 22:16:27 +0100")
mel@skynet.ie (Mel Gorman) writes:
> On (15/09/07 02:31), Goswin von Brederlow didst pronounce:
>> Mel Gorman <mel@csn.ul.ie> writes:
>>
>> > On Fri, 2007-09-14 at 18:10 +0200, Goswin von Brederlow wrote:
>> >> Nick Piggin <nickpiggin@yahoo.com.au> writes:
>> >>
>> >> > In my attack, I cause the kernel to allocate lots of unmovable allocations
>> >> > and deplete movable groups. I theoretically then only need to keep a
>> >> > small number (1/2^N) of these allocations around in order to DoS a
>> >> > page allocation of order N.
>> >>
>> >> I'm assuming that when an unmovable allocation hijacks a movable group
>> >> any further unmovable alloc will evict movable objects out of that
>> >> group before hijacking another one. right?
>> >>
>> >
>> > No eviction takes place. If an unmovable allocation gets placed in a
>> > movable group, then steps are taken to ensure that future unmovable
>> > allocations will take place in the same range (these decisions take
>> > place in __rmqueue_fallback()). When choosing a movable block to
>> > pollute, it will also choose the lowest possible block in PFN terms to
>> > steal so that fragmentation pollution will be as confined as possible.
>> > Evicting the unmovable pages would be one of those expensive steps that
>> > have been avoided to date.
>>
>> But then you can have all blocks filled with movable data, free 4K in
>> one group, allocate 4K unmovable to take over the group, free 4k in
>> the next group, take that group and so on. You can end with 4k
>> unmovable in every 64k easily by accident.
>>
>
> As the mixing takes place at the lowest possible block, it's
> exceptionally difficult to trigger this. Possible, but exceptionally
> difficult.
Why is it difficult?
When user space allocates memory wouldn't it get it contiously? I mean
that is one of the goals, to use larger continious allocations and map
them with a single page table entry where possible, right? And then
you can roughly predict where an munmap() would free a page.
Say the application does map a few GB of file, uses madvice to tell
the kernel it needs a 2MB block (to get a continious 2MB chunk
mapped), waits for it and then munmaps 4K in there. A 4k hole for some
unmovable object to fill. If you can then trigger the creation of an
unmovable object as well (stat some file?) and loop you will fill the
ram quickly. Maybe it only works in 10% but then you just do it 10
times as often.
Over long times it could occur naturally. This is just to demonstrate
it with malice.
> As I have stated repeatedly, the guarantees can be made but potential
> hugepage allocation did not justify it. Large blocks might.
>
>> There should be a lot of preassure for movable objects to vacate a
>> mixed group or you do get fragmentation catastrophs.
>
> We (Andy Whitcroft and I) did implement something like that. It hooked into
> kswapd to clean mixed blocks. If the caller could do the cleaning, it
> did the work instead of kswapd.
Do you have a graphic like
http://www.skynet.ie/~mel/anti-frag/2007-02-28/page_type_distribution.jpg
for that case?
>> Looking at my
>> little test program evicting movable objects from a mixed group should
>> not be that expensive as it doesn't happen often.
>
> It happens regularly if the size of the block you need to keep clean is
> lower than min_free_kbytes. In the case of hugepages, that was always
> the case.
That assumes that the number of groups allocated for unmovable objects
will continiously grow and shrink. I'm assuming it will level off at
some size for long times (hours) under normal operations. There should
be some buffering of a few groups to be held back in reserve when it
shrinks to prevent the scenario that the size is just at a group
boundary and always grows/shrinks by 1 group.
>> The cost of it
>> should be freeing some pages (or finding free ones in a movable group)
>> and then memcpy.
>
> Freeing pages is not cheap. Copying pages is cheaper but not cheap.
To copy you need a free page as destination. Thats all I
ment. Hopefully there will always be a free one and the actual freeing
is done asynchronously from the copying.
>> So if
>> you evict movable objects from mixed group when needed all the
>> pagetable pages would end up in the same mixed group slowly taking it
>> over completly. No fragmentation at all. See how essential that
>> feature is. :)
>>
>
> To move pages, there must be enough blocks free. That is where
> min_free_kbytes had to come in. If you cared only about keeping 64KB
> chunks free, it makes sense but it didn't in the context of hugepages.
I'm more concerned with keeping the little unmovable things out of the
way. Those are the things that will fragment the memory and prevent
any huge pages to be available even with moving other stuff out of the
way.
It would also already be a big plus to have 64k continious chunks for
many operations. Guaranty that the filesystem and block layers can
always get such a page (by means of copying pages out of the way when
needed) and do even larger pages speculative.
But as you say that is where min_free_kbytes comes in. To have the
chance of a 2MB continious free chunk you need to have much more
reserved free. Something I would certainly do on a 16GB ram
server. Note that the buddy system will be better at having large free
chunks if user space allocates and frees large chunks as well. It is
much easier to make an 128K chunk out of 64K chunks than out of 4K
chunks. Much more probable to get 2 64k chunks adjacent than 32 4K
chunks in series.
>> > These type of pictures feel somewhat familiar
>> > (http://www.skynet.ie/~mel/anti-frag/2007-02-28/page_type_distribution.jpg).
>>
>> Those look a lot better and look like they are actually real kernel
>> data. How did you make them and can one create them in real-time (say
>> once a second or so)?
>>
>
> It's from a real kernel. When I was measuring this stuff, I took a
> sample every 2 seconds.
Can you tell me how? I would like to do the same.
>> There seem to be an awfull lot of pinned pages inbetween the movable.
>
> It wasn't grouping by mobility at the time.
That might explain at lot. I was thinking that was with grouping. But
without grouping such a rather random distribution of unmovable
objects doesn't sound uncommon,.
>> I would verry much like to see the same data with evicting of movable
>> pages out of mixed groups. I see not a single movable group while with
>> strict eviction there could be at most one mixed group per order.
>>
>
> With strict eviction, there would be no mixed blocks period.
I ment on demand eviction. Not evicting the whole group just because
we need 4k of it. Even looser would be to allow say 16 mixed groups
before moving pages out of the way when needed for an alloc. Best to
make it a proc entry and then change the value once a day and see how
it behaves. :)
MfG
Goswin
next prev parent reply other threads:[~2007-09-16 22:38 UTC|newest]
Thread overview: 187+ messages / expand[flat|nested] mbox.gz Atom feed top
2007-09-11 6:03 [00/41] Large Blocksize Support V7 (adds memmap support) Christoph Lameter
2007-09-10 18:52 ` Nick Piggin
2007-09-11 12:05 ` Andrea Arcangeli
2007-09-11 20:03 ` Christoph Lameter
2007-09-11 12:12 ` Jörn Engel
2007-09-10 21:13 ` Nick Piggin
2007-09-11 16:02 ` Goswin von Brederlow
2007-09-11 20:07 ` Christoph Lameter
2007-09-11 20:29 ` Jörn Engel
2007-09-11 20:41 ` Christoph Lameter
2007-09-11 23:26 ` Andrea Arcangeli
2007-09-12 0:04 ` Christoph Lameter
2007-09-12 8:20 ` Andrea Arcangeli
2007-09-15 8:44 ` Andrew Morton
2007-09-15 12:14 ` Goswin von Brederlow
2007-09-15 15:51 ` Andrea Arcangeli
2007-09-15 20:14 ` Goswin von Brederlow
2007-09-15 22:30 ` Andrea Arcangeli
2007-09-16 13:54 ` Goswin von Brederlow
2007-09-16 15:08 ` Andrea Arcangeli
2007-09-16 21:08 ` Mel Gorman
2007-09-16 22:48 ` Goswin von Brederlow
2007-09-17 9:30 ` Mel Gorman
2007-09-16 17:46 ` Jörn Engel
2007-09-16 18:15 ` Linus Torvalds
2007-09-16 18:21 ` Jörn Engel
2007-09-16 18:44 ` Linus Torvalds
2007-09-16 22:51 ` Goswin von Brederlow
2007-09-23 17:44 ` Jörn Engel
2007-09-16 22:06 ` Goswin von Brederlow
2007-09-16 22:40 ` Jörn Engel
2007-09-16 18:15 ` Mel Gorman
2007-09-16 18:50 ` Andrea Arcangeli
2007-09-16 20:54 ` Mel Gorman
2007-09-16 21:31 ` Andrea Arcangeli
2007-09-17 10:13 ` Mel Gorman
2007-09-23 5:50 ` Goswin von Brederlow
2007-09-16 22:56 ` Goswin von Brederlow
2007-09-18 19:31 ` Andrea Arcangeli
2007-09-23 6:56 ` Goswin von Brederlow
2007-09-24 15:39 ` Andrea Arcangeli
2007-09-16 18:13 ` Mel Gorman
2007-09-16 9:03 ` Nick Piggin
2007-09-17 22:00 ` Christoph Lameter
2007-09-18 0:11 ` Nick Piggin
2007-09-18 20:36 ` Christoph Lameter
2007-09-18 10:00 ` Mel Gorman
2007-09-18 10:49 ` Jörn Engel
2007-09-18 12:31 ` David Chinner
2007-09-16 21:58 ` Goswin von Brederlow
2007-09-17 10:03 ` Mel Gorman
2007-09-23 6:22 ` Goswin von Brederlow
2007-09-24 12:32 ` Kyle Moffett
2007-09-16 17:53 ` Jörn Engel
2007-09-16 21:31 ` Mel Gorman
2007-09-17 22:03 ` Christoph Lameter
2007-09-11 15:36 ` Mel Gorman
2007-09-11 1:44 ` Nick Piggin
2007-09-11 20:11 ` Christoph Lameter
2007-09-11 4:53 ` Nick Piggin
2007-09-11 20:42 ` Christoph Lameter
2007-09-11 5:30 ` Nick Piggin
2007-09-11 21:41 ` Christoph Lameter
2007-09-11 6:06 ` Nick Piggin
2007-09-11 21:52 ` Christoph Lameter
2007-09-11 18:07 ` Nick Piggin
2007-09-12 23:06 ` Christoph Lameter
2007-09-13 20:51 ` Nick Piggin
2007-09-14 17:52 ` Christoph Lameter
2007-09-16 8:22 ` Nick Piggin
2007-09-17 22:05 ` Christoph Lameter
2007-09-18 0:10 ` Nick Piggin
2007-09-18 20:42 ` Christoph Lameter
2007-09-17 11:10 ` Bernd Schmidt
2007-09-17 22:10 ` Christoph Lameter
2007-09-14 16:10 ` Goswin von Brederlow
2007-09-14 17:42 ` Mel Gorman
2007-09-15 0:31 ` Goswin von Brederlow
2007-09-16 21:16 ` Mel Gorman
2007-09-16 22:38 ` Goswin von Brederlow [this message]
2007-09-17 8:57 ` Mel Gorman
2007-09-23 6:49 ` Goswin von Brederlow
2007-09-11 20:53 ` Mel Gorman
2007-09-11 6:00 ` Nick Piggin
2007-09-11 21:48 ` Christoph Lameter
2007-09-11 6:17 ` Nick Piggin
2007-09-12 0:00 ` Christoph Lameter
2007-09-12 2:46 ` Nick Piggin
2007-09-12 23:17 ` Christoph Lameter
2007-09-13 9:40 ` Mel Gorman
2007-09-14 2:38 ` Christoph Lameter
2007-09-13 21:20 ` Nick Piggin
2007-09-14 18:08 ` Christoph Lameter
2007-09-14 18:15 ` Christoph Lameter
2007-09-15 0:33 ` Goswin von Brederlow
2007-09-16 8:53 ` Nick Piggin
2007-09-17 22:21 ` Christoph Lameter
2007-09-18 1:16 ` Nick Piggin
2007-09-18 18:30 ` Linus Torvalds
2007-09-18 17:53 ` Nick Piggin
2007-09-18 19:18 ` Andrea Arcangeli
2007-09-18 19:44 ` Linus Torvalds
2007-09-19 0:58 ` Nathan Scott
2007-09-19 1:06 ` Linus Torvalds
2007-09-19 2:45 ` Nathan Scott
2007-09-19 5:09 ` David Chinner
2007-09-19 9:41 ` Alex Tomas
2007-09-19 14:04 ` Andrea Arcangeli
2007-09-20 1:38 ` David Chinner
2007-09-20 14:54 ` Andrea Arcangeli
2007-09-20 18:11 ` Christoph Lameter
2007-09-20 18:07 ` Christoph Lameter
2007-09-21 20:41 ` Hugh Dickins
2007-09-24 21:13 ` Christoph Lameter
2007-09-28 2:46 ` Nick Piggin
2007-09-19 3:41 ` Rene Herman
2007-09-19 3:50 ` Linus Torvalds
2007-09-19 4:26 ` Rene Herman
2007-09-19 4:33 ` Linus Torvalds
2007-09-19 4:56 ` Rene Herman
2007-09-11 21:54 ` Mel Gorman
2007-09-12 14:29 ` Martin J. Bligh
2007-09-12 1:49 ` David Chinner
2007-09-11 15:27 ` Nick Piggin
2007-09-13 1:49 ` David Chinner
2007-09-12 17:23 ` Nick Piggin
2007-09-13 13:03 ` David Chinner
2007-09-13 2:01 ` Nick Piggin
2007-09-13 20:48 ` Nick Piggin
2007-09-17 4:07 ` David Chinner
2007-09-16 21:13 ` Nick Piggin
2007-09-12 2:01 ` Nick Piggin
2007-09-11 21:35 ` Christoph Lameter
2007-09-11 16:47 ` Andrea Arcangeli
2007-09-11 18:31 ` Mel Gorman
2007-09-11 2:26 ` Nick Piggin
2007-09-11 18:25 ` Maxim Levitsky
2007-09-11 3:05 ` Nick Piggin
2007-09-11 21:03 ` Mel Gorman
2007-09-11 19:20 ` Andrea Arcangeli
2007-09-11 20:19 ` Jörn Engel
2007-09-11 20:13 ` Christoph Lameter
2007-09-11 20:01 ` Christoph Lameter
2007-09-11 4:43 ` Nick Piggin
2007-09-11 5:17 ` Nick Piggin
2007-09-11 21:27 ` Mel Gorman
2007-09-11 6:03 ` [01/41] Pagecache zeroing: zero_user_segment, zero_user_segments and zero_user Christoph Lameter
2007-09-11 6:03 ` [02/41] Define functions for page cache handling Christoph Lameter
2007-09-11 6:03 ` [03/41] Use page_cache_xxx functions in mm/filemap.c Christoph Lameter
2007-09-11 6:03 ` [04/41] Use page_cache_xxx in mm/page-writeback.c Christoph Lameter
2007-09-11 6:03 ` [05/41] Use page_cache_xxx in mm/truncate.c Christoph Lameter
2007-09-11 6:03 ` [06/41] Use page_cache_xxx in mm/rmap.c Christoph Lameter
2007-09-11 6:03 ` [07/41] Use page_cache_xxx in mm/filemap_xip.c Christoph Lameter
2007-09-11 6:03 ` [08/41] Use page_cache_xxx in mm/migrate.c Christoph Lameter
2007-09-11 6:03 ` [09/41] Use page_cache_xxx in fs/libfs.c Christoph Lameter
2007-09-11 6:04 ` [10/41] Use page_cache_xxx in fs/sync Christoph Lameter
2007-09-11 6:04 ` [11/41] Use page_cache_xxx in fs/buffer.c Christoph Lameter
2007-09-11 6:04 ` [12/41] Use page_cache_xxx in mm/mpage.c Christoph Lameter
2007-09-11 6:04 ` [13/41] Use page_cache_xxx in mm/fadvise.c Christoph Lameter
2007-09-11 6:04 ` [14/41] Use page_cache_xxx in fs/splice.c Christoph Lameter
2007-09-11 6:04 ` [15/41] Use page_cache_xxx in ext2 Christoph Lameter
2007-09-11 6:04 ` [16/41] Use page_cache_xxx in fs/ext3 Christoph Lameter
2007-09-11 6:04 ` [17/41] Use page_cache_xxx in fs/ext4 Christoph Lameter
2007-09-11 6:04 ` [18/41] Use page_cache_xxx in fs/reiserfs Christoph Lameter
2007-09-11 6:04 ` [19/41] Use page_cache_xxx for fs/xfs Christoph Lameter
2007-09-11 6:04 ` [20/41] Use page_cache_xxx in drivers/block/rd.c Christoph Lameter
2007-09-11 6:04 ` [21/41] compound pages: Better PageHead/PageTail handling Christoph Lameter
2007-09-11 6:04 ` [22/41] compound pages: Add new support functions Christoph Lameter
2007-09-11 6:04 ` [23/41] compound pages: vmstat support Christoph Lameter
2007-09-11 6:04 ` [24/41] compound pages: Use new compound vmstat functions in SLUB Christoph Lameter
2007-09-11 6:04 ` [25/41] compound pages: Allow use of get_page_unless_zero with compound pages Christoph Lameter
2007-09-11 6:04 ` [26/41] compound pages: Allow freeing of compound pages via pagevec Christoph Lameter
2007-09-11 6:04 ` [27/41] Large page order operations, zeroing and flushing Christoph Lameter
2007-09-11 6:04 ` [28/41] Futex: Fix PAGE SIZE assumption Christoph Lameter
2007-09-11 6:04 ` [29/41] Fix up reclaim counters Christoph Lameter
2007-09-11 6:04 ` [30/41] Add VM_BUG_ONs to check for correct page order Christoph Lameter
2007-09-11 6:04 ` [31/41] Large Blocksize: Core piece Christoph Lameter
2007-09-11 6:04 ` [32/41] Readahead changes to support large blocksize Christoph Lameter
2007-09-11 6:04 ` [33/41] Large blocksize support in ramfs Christoph Lameter
2007-09-11 6:04 ` [34/41] Large blocksize support for XFS Christoph Lameter
2007-09-11 6:04 ` [35/41] Reiserfs: Fix up mapping_set_gfp_mask() Christoph Lameter
2007-09-11 6:04 ` [36/41] 64k block size support for Ext2/3/4 Christoph Lameter
2007-09-11 6:04 ` [37/41] ext2: fix rec_len overflow for 64KB block size Christoph Lameter
2007-09-11 6:04 ` [38/41] ext3: fix rec_len overflow with " Christoph Lameter
2007-09-11 6:04 ` [39/41] ext4: fix rec_len overflow for " Christoph Lameter
2007-09-11 6:04 ` [40/41] Do not use f_mapping in simple_prepare_write() Christoph Lameter
2007-09-11 6:04 ` [41/41] Mmap support using pte PAGE_SIZE mappings Christoph Lameter
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87zlzm44o0.fsf@informatik.uni-tuebingen.de \
--to=brederlo@informatik.uni-tuebingen.de \
--cc=andrea@suse.de \
--cc=clameter@sgi.com \
--cc=dgc@sgi.com \
--cc=fengguang.wu@gmail.com \
--cc=hch@lst.de \
--cc=hugh@veritas.com \
--cc=jens.axboe@oracle.com \
--cc=joern@lazybastard.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=maximlevitsky@gmail.com \
--cc=mel@skynet.ie \
--cc=nickpiggin@yahoo.com.au \
--cc=pbadari@gmail.com \
--cc=torvalds@linux-foundation.org \
--cc=totty.lu@gmail.com \
--cc=wangswin@gmail.com \
--cc=wli@holomorphy.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).