From: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
To: Nick Piggin <npiggin@suse.de>
Cc: Christoph Lameter <clameter@sgi.com>,
Christoph Lameter <clameter@engr.sgi.com>,
Linux Memory Management List <linux-mm@kvack.org>,
Andrew Morton <akpm@linux-foundation.org>,
Christoph Hellwig <hch@infradead.org>
Subject: Re: [rfc][patch 2/2] mm: mlocked pages off LRU
Date: Tue, 06 Mar 2007 17:23:55 -0500 [thread overview]
Message-ID: <1173219835.20580.15.camel@localhost> (raw)
In-Reply-To: <20070306143045.GA28629@wotan.suse.de>
On Tue, 2007-03-06 at 15:30 +0100, Nick Piggin wrote:
> New core patch. This one is actually tested and works, and you can see
> the mlocked pages being accounted.
>
> Same basic idea. Too many fixes and changes to list. Haven't taken up
> Christoph's idea to do a union in struct page, but it could be a followup.
>
> Most importantly (aside from crashes and obvious bugs), it should correctly
> synchronise munlock vs vmscan lazy mlock now. Before this, it was possible
> to have pages leak. This took me a bit of thinking to get right, but was
> rather simple in the end.
>
> Memory migration should work now, too, but not tested.
>
> What do people think? Yes? No?
Nick: I've grabbed your 2 patches in this series and rebased them to
21-rc2-mm2 so I can test them and compare with Christoph's [which I've
also rebased to -mm2]. I had to fix up the ia32_setup_arg_pages() for
ia64 to track the change you made to install_new_arg_page. Patch
included below. Some comments in-line below, as well.
Now builds, boots, and successfully builds a kernel with Christoph's
series. Some basic testing with memtoy [see link below] shows pages
being locked according to the /proc/meminfo stats, but the counts don't
decrease when I unmap the segment nor when I exit the task. I'll
investigate why and let you know how further testing goes. After that,
I plan to merge both series with my page migration series and your page
cache replication patch to test the effects there. Should be
"interesting".
If you're interested, I have a little tool/toy for testing mm stuff at:
http://free.linux.hp.com/~lts/Tools/memtoy-latest.tar.gz
I recently added a lock()/unlock() command for testing locking of memory
regions. It could use more [a lot more] documentation, but there it
does have a README and internal help.
Lee
>
> --
>
> Index: linux-2.6/mm/mlock.c
> ===================================================================
> --- linux-2.6.orig/mm/mlock.c
> +++ linux-2.6/mm/mlock.c
> @@ -8,17 +8,204 @@
> #include <linux/capability.h>
> #include <linux/mman.h>
> #include <linux/mm.h>
> +#include <linux/swap.h>
> +#include <linux/pagemap.h>
> #include <linux/mempolicy.h>
> #include <linux/syscalls.h>
>
> +#include "internal.h"
> +
> +#define page_mlock_count(page) (*(unsigned long *)&(page)->lru.next)
> +#define set_page_mlock_count(page, v) (page_mlock_count(page) = (v))
> +#define inc_page_mlock_count(page) (page_mlock_count(page)++)
> +#define dec_page_mlock_count(page) (page_mlock_count(page)--)
> +
> +/*
> + * A page's mlock_count is kept in page->lru.next as an unsigned long.
> + * Access to this count is serialised with the page lock (or, in the
> + * case of mlock_page, virtue that there are no other references to
> + * the page).
> + *
> + * mlock counts are incremented at mlock, mmap, mremap, and new anon page
> + * faults, and lazily via vmscan. Decremented at munlock, munmap, and exit.
> + * mlock is not inherited across fork or exec, so we're safe there.
> + *
> + * If PageMLock is set, then the page is removed from the LRU list, and
> + * has its refcount incremented. This increment prevents the page from being
> + * freed until the mlock_count is decremented to zero and PageMLock is cleared.
> + *
> + * When lazy incrementing via vmscan, it is important to ensure that the
> + * vma's VM_LOCKED status is not concurrently being modified, otherwise we
> + * may have elevated mlock_count of a page that is being munlocked. So lazy
> + * mlocked must take the mmap_sem for read, and verify that the vma really
> + * is locked (see mm/rmap.c).
> + */
> +
> +/*
> + * Marks a page, belonging to the given mlocked vma, as mlocked.
> + *
> + * The page must be either locked or new, and must not be on the LRU.
> + */
> +static void __set_page_mlock(struct page *page)
> +{
> + BUG_ON(PageLRU(page));
> + BUG_ON(PageMLock(page));
> + /* BUG_ON(!list_empty(&page->lru)); -- if we always did list_del_init */
> +
> + SetPageMLock(page);
> + get_page(page);
> + inc_zone_page_state(page, NR_MLOCK);
> + set_page_mlock_count(page, 1);
> +}
> +
> +static void __clear_page_mlock(struct page *page)
> +{
> + BUG_ON(!PageMLock(page));
> + BUG_ON(PageLRU(page));
> + BUG_ON(page_mlock_count(page));
> +
> + dec_zone_page_state(page, NR_MLOCK);
> + ClearPageMLock(page);
> + lru_cache_add_active(page);
> + put_page(page);
> +}
> +
> +/*
> + * Zero the page's mlock_count. This can be useful in a situation where
> + * we want to unconditionally remove a page from the pagecache.
> + *
> + * It is not illegal to call this function for any page, mlocked or not.
Maybe "It is legal ..." ???
> + * If called for a page that is still mapped by mlocked vmas, all we do
> + * is revert to lazy LRU behaviour -- semantics are not broken.
> + */
> +void clear_page_mlock(struct page *page)
> +{
> + BUG_ON(!PageLocked(page));
> +
> + if (likely(!PageMLock(page)))
> + return;
> + BUG_ON(!page_mlock_count(page));
> + set_page_mlock_count(page, 0);
> + __clear_page_mlock(page);
> +}
> +
<snip>
> Index: linux-2.6/include/linux/page-flags.h
> ===================================================================
> --- linux-2.6.orig/include/linux/page-flags.h
> +++ linux-2.6/include/linux/page-flags.h
> @@ -91,6 +91,7 @@
> #define PG_nosave_free 18 /* Used for system suspend/resume */
> #define PG_buddy 19 /* Page is free, on buddy lists */
>
> +#define PG_mlock 20 /* Page has mlocked vmas */
Conflicts with PG_readahead in 21-rc2-mm2. I temporarily used bit
30--valid only for 64-bit systems. [Same in Christoph's series.]
>
> #if (BITS_PER_LONG > 32)
> /*
> @@ -247,6 +248,10 @@ static inline void SetPageUptodate(struc
> #define PageSwapCache(page) 0
> #endif
>
> +#define PageMLock(page) test_bit(PG_mlock, &(page)->flags)
> +#define SetPageMLock(page) set_bit(PG_mlock, &(page)->flags)
> +#define ClearPageMLock(page) clear_bit(PG_mlock, &(page)->flags)
> +
> #define PageUncached(page) test_bit(PG_uncached, &(page)->flags)
> #define SetPageUncached(page) set_bit(PG_uncached, &(page)->flags)
> #define ClearPageUncached(page) clear_bit(PG_uncached, &(page)->flags)
> Index: linux-2.6/mm/page_alloc.c
> ===================================================================
> --- linux-2.6.orig/mm/page_alloc.c
> +++ linux-2.6/mm/page_alloc.c
> @@ -203,7 +203,8 @@ static void bad_page(struct page *page)
> 1 << PG_slab |
> 1 << PG_swapcache |
> 1 << PG_writeback |
> - 1 << PG_buddy );
> + 1 << PG_buddy |
> + 1 << PG_mlock );
> set_page_count(page, 0);
> reset_page_mapcount(page);
> page->mapping = NULL;
> @@ -438,7 +439,8 @@ static inline int free_pages_check(struc
> 1 << PG_swapcache |
> 1 << PG_writeback |
> 1 << PG_reserved |
> - 1 << PG_buddy ))))
> + 1 << PG_buddy |
> + 1 << PG_mlock ))))
> bad_page(page);
> if (PageDirty(page))
> __ClearPageDirty(page);
> @@ -588,7 +590,8 @@ static int prep_new_page(struct page *pa
> 1 << PG_swapcache |
> 1 << PG_writeback |
> 1 << PG_reserved |
> - 1 << PG_buddy ))))
> + 1 << PG_buddy |
> + 1 << PG_mlock ))))
> bad_page(page);
>
> /*
> Index: linux-2.6/fs/exec.c
> ===================================================================
> --- linux-2.6.orig/fs/exec.c
> +++ linux-2.6/fs/exec.c
> @@ -297,44 +297,6 @@ int copy_strings_kernel(int argc,char **
> EXPORT_SYMBOL(copy_strings_kernel);
>
> #ifdef CONFIG_MMU
> -/*
> - * This routine is used to map in a page into an address space: needed by
> - * execve() for the initial stack and environment pages.
> - *
> - * vma->vm_mm->mmap_sem is held for writing.
> - */
> -void install_arg_page(struct vm_area_struct *vma,
> - struct page *page, unsigned long address)
> -{
> - struct mm_struct *mm = vma->vm_mm;
> - pte_t * pte;
> - spinlock_t *ptl;
> -
> - if (unlikely(anon_vma_prepare(vma)))
> - goto out;
> -
> - flush_dcache_page(page);
> - pte = get_locked_pte(mm, address, &ptl);
> - if (!pte)
> - goto out;
> - if (!pte_none(*pte)) {
> - pte_unmap_unlock(pte, ptl);
> - goto out;
> - }
> - inc_mm_counter(mm, anon_rss);
> - lru_cache_add_active(page);
> - set_pte_at(mm, address, pte, pte_mkdirty(pte_mkwrite(mk_pte(
> - page, vma->vm_page_prot))));
> - page_add_new_anon_rmap(page, vma, address);
> - pte_unmap_unlock(pte, ptl);
> -
> - /* no need for flush_tlb */
> - return;
> -out:
> - __free_page(page);
> - force_sig(SIGKILL, current);
> -}
> -
> #define EXTRA_STACK_VM_PAGES 20 /* random */
>
> int setup_arg_pages(struct linux_binprm *bprm,
> @@ -438,17 +400,25 @@ int setup_arg_pages(struct linux_binprm
> mm->stack_vm = mm->total_vm = vma_pages(mpnt);
> }
>
> + ret = 0;
> for (i = 0 ; i < MAX_ARG_PAGES ; i++) {
> struct page *page = bprm->page[i];
> if (page) {
> bprm->page[i] = NULL;
> - install_arg_page(mpnt, page, stack_base);
> + if (!ret)
> + ret = install_new_anon_page(mpnt, page,
> + stack_base);
> + if (ret)
> + put_page(page);
Need similar mod in arch/ia64/ia32/binfmt_elf32.c:ia32_setup_arg_pages()
Patch included below.
> }
> stack_base += PAGE_SIZE;
> }
> up_write(&mm->mmap_sem);
> -
> - return 0;
> +
> + if (ret)
> + do_munmap(mm, mpnt->vm_start, mpnt->vm_start - mpnt->vm_end);
> +
> + return ret;
> }
>
> EXPORT_SYMBOL(setup_arg_pages);
> Index: linux-2.6/mm/migrate.c
> ===================================================================
> --- linux-2.6.orig/mm/migrate.c
> +++ linux-2.6/mm/migrate.c
> @@ -272,6 +272,8 @@ static int migrate_page_move_mapping(str
> return 0;
> }
>
> + clear_page_mlock(page);
> +
> write_lock_irq(&mapping->tree_lock);
>
> pslot = radix_tree_lookup_slot(&mapping->page_tree,
> @@ -775,6 +777,17 @@ static int do_move_pages(struct mm_struc
> !migrate_all)
> goto put_and_set;
>
> + /*
> + * Just do the simple thing and put back mlocked pages onto
> + * the LRU list so they can be taken off again (inefficient
> + * but not a big deal).
> + */
> + if (PageMLock(page)) {
> + lock_page(page);
> + clear_page_mlock(page);
Note that this will put the page into the lru pagevec cache
[__clear_page_mlock() above] where isolate_lru_page(), called from
migrate_page_add(), is unlikely to find it. do_move_pages() has already
called migrate_prep() to drain the lru caches so that it is more likely
to find the pages, as does check_range() when called to collect pages
for migration. Yes, this is already racy--the target task or other
threads therein can fault additional pages into the lru cache after call
to migrate_prep(). But this almost guarantees we'll miss ~ the last
PAGEVEC_SIZE pages.
> + unlock_page(page);
> + }
> +
> err = isolate_lru_page(page);
> if (err) {
> put_and_set:
> Index: linux-2.6/mm/mempolicy.c
> ===================================================================
> --- linux-2.6.orig/mm/mempolicy.c
> +++ linux-2.6/mm/mempolicy.c
> @@ -89,6 +89,7 @@
> #include <linux/migrate.h>
> #include <linux/rmap.h>
> #include <linux/security.h>
> +#include <linux/pagemap.h>
>
> #include <asm/tlbflush.h>
> #include <asm/uaccess.h>
> @@ -224,7 +225,10 @@ static int check_pte_range(struct vm_are
> pte_t *orig_pte;
> pte_t *pte;
> spinlock_t *ptl;
> + struct page *mlocked;
>
> +resume:
> + mlocked = NULL;
> orig_pte = pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> do {
> struct page *page;
> @@ -254,12 +258,24 @@ static int check_pte_range(struct vm_are
>
> if (flags & MPOL_MF_STATS)
> gather_stats(page, private, pte_dirty(*pte));
> - else if (flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL))
> + else if (flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) {
> + if (PageMLock(page) && !mlocked) {
> + mlocked = page;
> + break;
> + }
> migrate_page_add(page, private, flags);
> - else
> + } else
> break;
> } while (pte++, addr += PAGE_SIZE, addr != end);
> pte_unmap_unlock(orig_pte, ptl);
> +
> + if (mlocked) {
> + lock_page(mlocked);
> + clear_page_mlock(mlocked);
Same comment as for do_move_pages() above.
> + unlock_page(mlocked);
> + goto resume;
> + }
> +
> return addr != end;
> }
>
> @@ -372,6 +388,7 @@ check_range(struct mm_struct *mm, unsign
> endvma = end;
> if (vma->vm_start > start)
> start = vma->vm_start;
> +
> err = check_pgd_range(vma, start, endvma, nodes,
> flags, private);
> if (err) {
Here's the patch mentioned above:
Need to replace call to install_arg_page() in ia64's
ia32 version of setup_arg_pages() to build 21-rc2-mm2
with Nick's "mlocked pages off LRU" patch on ia64.
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
arch/ia64/ia32/binfmt_elf32.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)
Index: Linux/arch/ia64/ia32/binfmt_elf32.c
===================================================================
--- Linux.orig/arch/ia64/ia32/binfmt_elf32.c 2007-03-06 12:16:33.000000000 -0500
+++ Linux/arch/ia64/ia32/binfmt_elf32.c 2007-03-06 15:19:02.000000000 -0500
@@ -240,7 +240,11 @@ ia32_setup_arg_pages (struct linux_binpr
struct page *page = bprm->page[i];
if (page) {
bprm->page[i] = NULL;
- install_arg_page(mpnt, page, stack_base);
+ if (!ret)
+ ret = install_new_anon_page(mpnt, page,
+ stack_base);
+ if (ret)
+ put_page(page);
}
stack_base += PAGE_SIZE;
}
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2007-03-06 22:23 UTC|newest]
Thread overview: 19+ messages / expand[flat|nested] mbox.gz Atom feed top
2007-03-05 16:17 [rfc][patch 2/2] mm: mlocked pages off LRU Nick Piggin
2007-03-05 16:40 ` Nick Piggin
2007-03-05 17:12 ` Christoph Hellwig
2007-03-05 18:17 ` Christoph Lameter
2007-03-05 18:14 ` Christoph Lameter
2007-03-05 19:26 ` Rik van Riel
2007-03-06 1:05 ` Nick Piggin
2007-03-06 1:27 ` Christoph Lameter
2007-03-06 1:44 ` Nick Piggin
2007-03-06 1:55 ` Christoph Lameter
2007-03-06 2:13 ` Nick Piggin
2007-03-06 2:46 ` Christoph Lameter
2007-03-06 2:50 ` Nick Piggin
2007-03-06 14:30 ` Nick Piggin
2007-03-06 18:30 ` Christoph Lameter
2007-03-07 3:07 ` Nick Piggin
2007-03-06 22:23 ` Lee Schermerhorn [this message]
2007-03-07 3:52 ` Nick Piggin
2007-03-06 15:59 ` Rik van Riel
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1173219835.20580.15.camel@localhost \
--to=lee.schermerhorn@hp.com \
--cc=akpm@linux-foundation.org \
--cc=clameter@engr.sgi.com \
--cc=clameter@sgi.com \
--cc=hch@infradead.org \
--cc=linux-mm@kvack.org \
--cc=npiggin@suse.de \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox