All of lore.kernel.org
 help / color / mirror / Atom feed
From: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
To: Nick Piggin <npiggin@suse.de>
Cc: Christoph Lameter <clameter@sgi.com>,
	Christoph Lameter <clameter@engr.sgi.com>,
	Linux Memory Management List <linux-mm@kvack.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Christoph Hellwig <hch@infradead.org>
Subject: Re: [rfc][patch 2/2] mm: mlocked pages off LRU
Date: Tue, 06 Mar 2007 17:23:55 -0500	[thread overview]
Message-ID: <1173219835.20580.15.camel@localhost> (raw)
In-Reply-To: <20070306143045.GA28629@wotan.suse.de>

On Tue, 2007-03-06 at 15:30 +0100, Nick Piggin wrote: 
> New core patch. This one is actually tested and works, and you can see
> the mlocked pages being accounted.
> 
> Same basic idea. Too many fixes and changes to list. Haven't taken up
> Christoph's idea to do a union in struct page, but it could be a followup.
> 
> Most importantly (aside from crashes and obvious bugs), it should correctly
> synchronise munlock vs vmscan lazy mlock now. Before this, it was possible
> to have pages leak. This took me a bit of thinking to get right, but was
> rather simple in the end.
> 
> Memory migration should work now, too, but not tested.
> 
> What do people think? Yes? No?

Nick:  I've grabbed your 2 patches in this series and rebased them to
21-rc2-mm2 so I can test them and compare with Christoph's [which I've
also rebased to -mm2].  I had to fix up the ia32_setup_arg_pages() for
ia64 to track the change you made to install_new_arg_page.  Patch
included below.  Some comments in-line below, as well.

Now builds, boots, and successfully builds a kernel with Christoph's
series.  Some basic testing with memtoy [see link below] shows pages
being locked according to the /proc/meminfo stats, but the counts don't
decrease when I unmap the segment nor when I exit the task.  I'll
investigate why and let you know how further testing goes.  After that,
I plan to merge both series with my page migration series and your page
cache replication patch to test the effects there.  Should be
"interesting".

If you're interested, I have a little tool/toy for testing mm stuff at:
http://free.linux.hp.com/~lts/Tools/memtoy-latest.tar.gz
I recently added a lock()/unlock() command for testing locking of memory
regions.  It could use more [a lot more] documentation, but there it
does have a README and internal help.

Lee


> 
> --
> 
> Index: linux-2.6/mm/mlock.c
> ===================================================================
> --- linux-2.6.orig/mm/mlock.c
> +++ linux-2.6/mm/mlock.c
> @@ -8,17 +8,204 @@
>  #include <linux/capability.h>
>  #include <linux/mman.h>
>  #include <linux/mm.h>
> +#include <linux/swap.h>
> +#include <linux/pagemap.h>
>  #include <linux/mempolicy.h>
>  #include <linux/syscalls.h>
>  
> +#include "internal.h"
> +
> +#define page_mlock_count(page)		(*(unsigned long *)&(page)->lru.next)
> +#define set_page_mlock_count(page, v)	(page_mlock_count(page) = (v))
> +#define inc_page_mlock_count(page)	(page_mlock_count(page)++)
> +#define dec_page_mlock_count(page)	(page_mlock_count(page)--)
> +
> +/*
> + * A page's mlock_count is kept in page->lru.next as an unsigned long.
> + * Access to this count is serialised with the page lock (or, in the
> + * case of mlock_page, virtue that there are no other references to
> + * the page).
> + *
> + * mlock counts are incremented at mlock, mmap, mremap, and new anon page
> + * faults, and lazily via vmscan. Decremented at munlock, munmap, and exit.
> + * mlock is not inherited across fork or exec, so we're safe there.
> + *
> + * If PageMLock is set, then the page is removed from the LRU list, and
> + * has its refcount incremented. This increment prevents the page from being
> + * freed until the mlock_count is decremented to zero and PageMLock is cleared.
> + *
> + * When lazy incrementing via vmscan, it is important to ensure that the
> + * vma's VM_LOCKED status is not concurrently being modified, otherwise we
> + * may have elevated mlock_count of a page that is being munlocked. So lazy
> + * mlocked must take the mmap_sem for read, and verify that the vma really
> + * is locked (see mm/rmap.c).
> + */
> +
> +/*
> + * Marks a page, belonging to the given mlocked vma, as mlocked.
> + *
> + * The page must be either locked or new, and must not be on the LRU.
> + */
> +static void __set_page_mlock(struct page *page)
> +{
> +	BUG_ON(PageLRU(page));
> +	BUG_ON(PageMLock(page));
> +	/* BUG_ON(!list_empty(&page->lru)); -- if we always did list_del_init */
> +
> +	SetPageMLock(page);
> +	get_page(page);
> +	inc_zone_page_state(page, NR_MLOCK);
> +	set_page_mlock_count(page, 1);
> +}
> +
> +static void __clear_page_mlock(struct page *page)
> +{
> +	BUG_ON(!PageMLock(page));
> +	BUG_ON(PageLRU(page));
> +	BUG_ON(page_mlock_count(page));
> +
> +	dec_zone_page_state(page, NR_MLOCK);
> +	ClearPageMLock(page);
> +	lru_cache_add_active(page);
> +	put_page(page);
> +}
> +
> +/*
> + * Zero the page's mlock_count. This can be useful in a situation where
> + * we want to unconditionally remove a page from the pagecache.
> + *
> + * It is not illegal to call this function for any page, mlocked or not.
Maybe "It is legal ..."  ???

> + * If called for a page that is still mapped by mlocked vmas, all we do
> + * is revert to lazy LRU behaviour -- semantics are not broken.
> + */
> +void clear_page_mlock(struct page *page)
> +{
> +	BUG_ON(!PageLocked(page));
> +
> +	if (likely(!PageMLock(page)))
> +		return;
> +	BUG_ON(!page_mlock_count(page));
> +	set_page_mlock_count(page, 0);
> +	__clear_page_mlock(page);
> +}
> +
<snip>

> Index: linux-2.6/include/linux/page-flags.h
> ===================================================================
> --- linux-2.6.orig/include/linux/page-flags.h
> +++ linux-2.6/include/linux/page-flags.h
> @@ -91,6 +91,7 @@
>  #define PG_nosave_free		18	/* Used for system suspend/resume */
>  #define PG_buddy		19	/* Page is free, on buddy lists */
>  
> +#define PG_mlock		20	/* Page has mlocked vmas */

Conflicts with PG_readahead in 21-rc2-mm2.  I temporarily used bit
30--valid only for 64-bit systems.  [Same in Christoph's series.]

>  
>  #if (BITS_PER_LONG > 32)
>  /*
> @@ -247,6 +248,10 @@ static inline void SetPageUptodate(struc
>  #define PageSwapCache(page)	0
>  #endif
>  
> +#define PageMLock(page)		test_bit(PG_mlock, &(page)->flags)
> +#define SetPageMLock(page)	set_bit(PG_mlock, &(page)->flags)
> +#define ClearPageMLock(page)	clear_bit(PG_mlock, &(page)->flags)
> +
>  #define PageUncached(page)	test_bit(PG_uncached, &(page)->flags)
>  #define SetPageUncached(page)	set_bit(PG_uncached, &(page)->flags)
>  #define ClearPageUncached(page)	clear_bit(PG_uncached, &(page)->flags)
> Index: linux-2.6/mm/page_alloc.c
> ===================================================================
> --- linux-2.6.orig/mm/page_alloc.c
> +++ linux-2.6/mm/page_alloc.c
> @@ -203,7 +203,8 @@ static void bad_page(struct page *page)
>  			1 << PG_slab    |
>  			1 << PG_swapcache |
>  			1 << PG_writeback |
> -			1 << PG_buddy );
> +			1 << PG_buddy |
> +			1 << PG_mlock );
>  	set_page_count(page, 0);
>  	reset_page_mapcount(page);
>  	page->mapping = NULL;
> @@ -438,7 +439,8 @@ static inline int free_pages_check(struc
>  			1 << PG_swapcache |
>  			1 << PG_writeback |
>  			1 << PG_reserved |
> -			1 << PG_buddy ))))
> +			1 << PG_buddy |
> +			1 << PG_mlock ))))
>  		bad_page(page);
>  	if (PageDirty(page))
>  		__ClearPageDirty(page);
> @@ -588,7 +590,8 @@ static int prep_new_page(struct page *pa
>  			1 << PG_swapcache |
>  			1 << PG_writeback |
>  			1 << PG_reserved |
> -			1 << PG_buddy ))))
> +			1 << PG_buddy |
> +			1 << PG_mlock ))))
>  		bad_page(page);
>  
>  	/*
> Index: linux-2.6/fs/exec.c
> ===================================================================
> --- linux-2.6.orig/fs/exec.c
> +++ linux-2.6/fs/exec.c
> @@ -297,44 +297,6 @@ int copy_strings_kernel(int argc,char **
>  EXPORT_SYMBOL(copy_strings_kernel);
>  
>  #ifdef CONFIG_MMU
> -/*
> - * This routine is used to map in a page into an address space: needed by
> - * execve() for the initial stack and environment pages.
> - *
> - * vma->vm_mm->mmap_sem is held for writing.
> - */
> -void install_arg_page(struct vm_area_struct *vma,
> -			struct page *page, unsigned long address)
> -{
> -	struct mm_struct *mm = vma->vm_mm;
> -	pte_t * pte;
> -	spinlock_t *ptl;
> -
> -	if (unlikely(anon_vma_prepare(vma)))
> -		goto out;
> -
> -	flush_dcache_page(page);
> -	pte = get_locked_pte(mm, address, &ptl);
> -	if (!pte)
> -		goto out;
> -	if (!pte_none(*pte)) {
> -		pte_unmap_unlock(pte, ptl);
> -		goto out;
> -	}
> -	inc_mm_counter(mm, anon_rss);
> -	lru_cache_add_active(page);
> -	set_pte_at(mm, address, pte, pte_mkdirty(pte_mkwrite(mk_pte(
> -					page, vma->vm_page_prot))));
> -	page_add_new_anon_rmap(page, vma, address);
> -	pte_unmap_unlock(pte, ptl);
> -
> -	/* no need for flush_tlb */
> -	return;
> -out:
> -	__free_page(page);
> -	force_sig(SIGKILL, current);
> -}
> -
>  #define EXTRA_STACK_VM_PAGES	20	/* random */
>  
>  int setup_arg_pages(struct linux_binprm *bprm,
> @@ -438,17 +400,25 @@ int setup_arg_pages(struct linux_binprm 
>  		mm->stack_vm = mm->total_vm = vma_pages(mpnt);
>  	}
>  
> +	ret = 0;
>  	for (i = 0 ; i < MAX_ARG_PAGES ; i++) {
>  		struct page *page = bprm->page[i];
>  		if (page) {
>  			bprm->page[i] = NULL;
> -			install_arg_page(mpnt, page, stack_base);
> +			if (!ret)
> +				ret = install_new_anon_page(mpnt, page,
> +								stack_base);
> +			if (ret)
> +				put_page(page);

Need similar mod in arch/ia64/ia32/binfmt_elf32.c:ia32_setup_arg_pages()
Patch included below.

>  		}
>  		stack_base += PAGE_SIZE;
>  	}
>  	up_write(&mm->mmap_sem);
> -	
> -	return 0;
> +
> +	if (ret)
> +		do_munmap(mm, mpnt->vm_start, mpnt->vm_start - mpnt->vm_end);
> +
> +	return ret;
>  }
>  
>  EXPORT_SYMBOL(setup_arg_pages);

> Index: linux-2.6/mm/migrate.c
> ===================================================================
> --- linux-2.6.orig/mm/migrate.c
> +++ linux-2.6/mm/migrate.c
> @@ -272,6 +272,8 @@ static int migrate_page_move_mapping(str
>  		return 0;
>  	}
>  
> +	clear_page_mlock(page);
> +
>  	write_lock_irq(&mapping->tree_lock);
>  
>  	pslot = radix_tree_lookup_slot(&mapping->page_tree,
> @@ -775,6 +777,17 @@ static int do_move_pages(struct mm_struc
>  				!migrate_all)
>  			goto put_and_set;
>  
> +		/*
> +		 * Just do the simple thing and put back mlocked pages onto
> +		 * the LRU list so they can be taken off again (inefficient
> +		 * but not a big deal).
> +		 */
> +		if (PageMLock(page)) {
> +			lock_page(page);
> +			clear_page_mlock(page);
Note that this will put the page into the lru pagevec cache
[__clear_page_mlock() above] where isolate_lru_page(), called from
migrate_page_add(), is unlikely to find it.  do_move_pages() has already
called migrate_prep() to drain the lru caches so that it is more likely
to find the pages, as does check_range() when called to collect pages
for migration.  Yes, this is already racy--the target task or other
threads therein can fault additional pages into the lru cache after call
to migrate_prep().  But this almost guarantees we'll miss ~ the last
PAGEVEC_SIZE pages.

> +			unlock_page(page);
> +		}
> +
>  		err = isolate_lru_page(page);
>  		if (err) {
>  put_and_set:
> Index: linux-2.6/mm/mempolicy.c
> ===================================================================
> --- linux-2.6.orig/mm/mempolicy.c
> +++ linux-2.6/mm/mempolicy.c
> @@ -89,6 +89,7 @@
>  #include <linux/migrate.h>
>  #include <linux/rmap.h>
>  #include <linux/security.h>
> +#include <linux/pagemap.h>
>  
>  #include <asm/tlbflush.h>
>  #include <asm/uaccess.h>
> @@ -224,7 +225,10 @@ static int check_pte_range(struct vm_are
>  	pte_t *orig_pte;
>  	pte_t *pte;
>  	spinlock_t *ptl;
> +	struct page *mlocked;
>  
> +resume:
> +	mlocked = NULL;
>  	orig_pte = pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
>  	do {
>  		struct page *page;
> @@ -254,12 +258,24 @@ static int check_pte_range(struct vm_are
>  
>  		if (flags & MPOL_MF_STATS)
>  			gather_stats(page, private, pte_dirty(*pte));
> -		else if (flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL))
> +		else if (flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) {
> +			if (PageMLock(page) && !mlocked) {
> +				mlocked = page;
> +				break;
> +			}
>  			migrate_page_add(page, private, flags);
> -		else
> +		} else
>  			break;
>  	} while (pte++, addr += PAGE_SIZE, addr != end);
>  	pte_unmap_unlock(orig_pte, ptl);
> +
> +	if (mlocked) {
> +		lock_page(mlocked);
> +		clear_page_mlock(mlocked);

Same comment as for do_move_pages() above.

> +		unlock_page(mlocked);
> +		goto resume;
> +	}
> +
>  	return addr != end;
>  }
>  
> @@ -372,6 +388,7 @@ check_range(struct mm_struct *mm, unsign
>  				endvma = end;
>  			if (vma->vm_start > start)
>  				start = vma->vm_start;
> +
>  			err = check_pgd_range(vma, start, endvma, nodes,
>  						flags, private);
>  			if (err) {

Here's the patch mentioned above:

Need to replace call to install_arg_page() in ia64's
ia32 version of setup_arg_pages() to build 21-rc2-mm2
with Nick's "mlocked pages off LRU" patch on ia64. 

Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>

 arch/ia64/ia32/binfmt_elf32.c |    6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

Index: Linux/arch/ia64/ia32/binfmt_elf32.c
===================================================================
--- Linux.orig/arch/ia64/ia32/binfmt_elf32.c	2007-03-06 12:16:33.000000000 -0500
+++ Linux/arch/ia64/ia32/binfmt_elf32.c	2007-03-06 15:19:02.000000000 -0500
@@ -240,7 +240,11 @@ ia32_setup_arg_pages (struct linux_binpr
 		struct page *page = bprm->page[i];
 		if (page) {
 			bprm->page[i] = NULL;
-			install_arg_page(mpnt, page, stack_base);
+			if (!ret)
+				ret = install_new_anon_page(mpnt, page,
+								stack_base);
+			if (ret)
+				put_page(page);
 		}
 		stack_base += PAGE_SIZE;
 	}


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  parent reply	other threads:[~2007-03-06 22:23 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-03-05 16:17 [rfc][patch 2/2] mm: mlocked pages off LRU Nick Piggin
2007-03-05 16:40 ` Nick Piggin
2007-03-05 17:12 ` Christoph Hellwig
2007-03-05 18:17   ` Christoph Lameter
2007-03-05 18:14 ` Christoph Lameter
2007-03-05 19:26   ` Rik van Riel
2007-03-06  1:05   ` Nick Piggin
2007-03-06  1:27     ` Christoph Lameter
2007-03-06  1:44       ` Nick Piggin
2007-03-06  1:55         ` Christoph Lameter
2007-03-06  2:13           ` Nick Piggin
2007-03-06  2:46             ` Christoph Lameter
2007-03-06  2:50               ` Nick Piggin
2007-03-06 14:30                 ` Nick Piggin
2007-03-06 18:30                   ` Christoph Lameter
2007-03-07  3:07                     ` Nick Piggin
2007-03-06 22:23                   ` Lee Schermerhorn [this message]
2007-03-07  3:52                     ` Nick Piggin
2007-03-06 15:59               ` Rik van Riel

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1173219835.20580.15.camel@localhost \
    --to=lee.schermerhorn@hp.com \
    --cc=akpm@linux-foundation.org \
    --cc=clameter@engr.sgi.com \
    --cc=clameter@sgi.com \
    --cc=hch@infradead.org \
    --cc=linux-mm@kvack.org \
    --cc=npiggin@suse.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.