Re: Can get_user_pages( ,write=1, force=1, ) result in a read-only pte and _count=2?

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Robin Holt <holt@sgi.com>
To: Hugh Dickins <hugh@veritas.com>
Cc: Robin Holt <holt@sgi.com>, Nick Piggin <nickpiggin@yahoo.com.au>,
	Ingo Molnar <mingo@elte.hu>, Christoph Lameter <clameter@sgi.com>,
	Jack Steiner <steiner@sgi.com>, Rik van Riel <riel@redhat.com>,
	Lee Schermerhorn <Lee.Schermerhorn@hp.com>,
	linux-mm@kvack.org
Subject: Re: Can get_user_pages( ,write=1, force=1, ) result in a read-only pte and _count=2?
Date: Tue, 24 Jun 2008 06:56:07 -0500	[thread overview]
Message-ID: <20080624115606.GJ10062@sgi.com> (raw)
In-Reply-To: <Pine.LNX.4.64.0806232134330.19691@blonde.site>

Argh, the current kernel does not boot for ia64.  I will look into
that first.  It fails even without your patch applied.

Thanks,
Robin

On Mon, Jun 23, 2008 at 09:58:42PM +0100, Hugh Dickins wrote:
> On Mon, 23 Jun 2008, Robin Holt wrote:
> > On Mon, Jun 23, 2008 at 05:48:17PM +0100, Hugh Dickins wrote:
> > 
> > > reuse test in do_wp_page that I'm still working on - of which Nick sent
> > > a lock_page approximation for you to try?  Would you still be able to
> > > try mine when I'm ready, or does it now appear irrelevant to you?
> > 
> > Before your response, I had convinced myself my problem was specific to
> > XPMEM, but I see your point and may agree that it is a problem for all
> > get_user_pages() users.
> > 
> > I can certainly test when you have it ready.
> 
> Thanks a lot, Robin.  Here it is below.
> 
> > 
> > I had confused myself about Nick's first patch.  I will give that
> > another look over and see if it fixes the problem.
> 
> Nick's _first_ patch?  The one I was thinking of was the one
> with lock_page in do_wp_page, but it shouldn't be necessary now if
> what's below works - though his is a much smaller and less worrying
> patch, so anyone looking for a quick fix to the issue might well
> prefer his.
> 
> > 
> > > 	http://lkml.org/lkml/2006/9/14/384
> > > 
> > > but it's a broken thread, with misunderstanding on all sides,
> > > so rather hard to get a grasp of it.
> > 
> > That is extremely similar to the issue I am seeing.  I think that if
> > Infiniband were using the mmu_notifier stuff, they would be closer, but
> > IIRC, there are significant hardware restrictions which prevent demand
> > paging for working on some IB devices.
> 
> Ah, I'm glad you've managed to glean something from it, good.
> 
> Here's the rollup of the patches I'm proposing for two issues:
> it doesn't get my signoff yet because I'll want to split it into
> little stages properly commented, and I'll want to do more strenuous
> testing; but this shouldn't blow up in your face.  Against 2.6.26-rc7,
> should apply quite easily to earlier, but a little more work against
> 2.6.26-rc5-mm3 - though it simplifies some of that too (Rik+Lee Cced).
> 
> I say two issues, two competing issues.  One is the issue which may be
> your issue, that we want to decide COW in do_wp_page without needing
> PageLocked, so a concurrent shrink_page_list() doesn't occasionally
> force an unwanted COW, messing up some get_user_pages() uses.  The
> other, of no interest to you, is that we do want PageLocked when
> deciding COW in do_wp_page, because that's a good moment to free
> up the swap space - leaving the modified page with swap just makes
> a big seek likely when it's next written to swap.
> 
> Hugh
> 
>  include/linux/swap.h |   21 ++++--
>  mm/memory.c          |   15 +++-
>  mm/migrate.c         |    5 -
>  mm/page_io.c         |    2 
>  mm/swap_state.c      |   12 ++-
>  mm/swapfile.c        |  128 ++++++++++++++++++++++-------------------
>  6 files changed, 105 insertions(+), 78 deletions(-)
> 
> --- 2.6.26-rc7/include/linux/swap.h	2008-05-03 21:55:11.000000000 +0100
> +++ linux/include/linux/swap.h	2008-06-23 18:12:47.000000000 +0100
> @@ -250,8 +250,9 @@ extern unsigned int count_swap_pages(int
>  extern sector_t map_swap_page(struct swap_info_struct *, pgoff_t);
>  extern sector_t swapdev_block(int, pgoff_t);
>  extern struct swap_info_struct *get_swap_info_struct(unsigned);
> -extern int can_share_swap_page(struct page *);
> -extern int remove_exclusive_swap_page(struct page *);
> +extern int can_reuse_swap_page_unlocked(struct page *);
> +extern int can_reuse_swap_page_locked(struct page *);
> +extern int try_to_free_swap(struct page *);
>  struct backing_dev_info;
>  
>  extern spinlock_t swap_lock;
> @@ -319,8 +320,6 @@ static inline struct page *lookup_swap_c
>  	return NULL;
>  }
>  
> -#define can_share_swap_page(p)			(page_mapcount(p) == 1)
> -
>  static inline int add_to_swap_cache(struct page *page, swp_entry_t entry,
>  							gfp_t gfp_mask)
>  {
> @@ -337,9 +336,19 @@ static inline void delete_from_swap_cach
>  
>  #define swap_token_default_timeout		0
>  
> -static inline int remove_exclusive_swap_page(struct page *p)
> +static inline int can_reuse_swap_page_unlocked(struct page *page)
>  {
> -	return 0;
> +	return 0;	/* irrelevant: never called  */
> +}
> +
> +static inline int can_reuse_swap_page_locked(struct page *page)
> +{
> +	return 0;	/* irrelevant: never called */
> +}
> +
> +static inline int try_to_free_swap(struct page *page)
> +{
> +	return 0;	/* irrelevant: never called */
>  }
>  
>  static inline swp_entry_t get_swap_page(void)
> --- 2.6.26-rc7/mm/memory.c	2008-06-21 08:41:19.000000000 +0100
> +++ linux/mm/memory.c	2008-06-23 18:12:47.000000000 +0100
> @@ -1686,9 +1686,14 @@ static int do_wp_page(struct mm_struct *
>  	 * not dirty accountable.
>  	 */
>  	if (PageAnon(old_page)) {
> -		if (!TestSetPageLocked(old_page)) {
> -			reuse = can_share_swap_page(old_page);
> -			unlock_page(old_page);
> +		if (page_mapcount(old_page) == 1) {
> +			if (!PageSwapCache(old_page))
> +				reuse = 1;
> +			else if (!TestSetPageLocked(old_page)) {
> +				reuse = can_reuse_swap_page_locked(old_page);
> +				unlock_page(old_page);
> +			} else
> +				reuse = can_reuse_swap_page_unlocked(old_page);
>  		}
>  	} else if (unlikely((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
>  					(VM_WRITE|VM_SHARED))) {
> @@ -2185,7 +2190,7 @@ static int do_swap_page(struct mm_struct
>  
>  	inc_mm_counter(mm, anon_rss);
>  	pte = mk_pte(page, vma->vm_page_prot);
> -	if (write_access && can_share_swap_page(page)) {
> +	if (write_access && can_reuse_swap_page_locked(page)) {
>  		pte = maybe_mkwrite(pte_mkdirty(pte), vma);
>  		write_access = 0;
>  	}
> @@ -2196,7 +2201,7 @@ static int do_swap_page(struct mm_struct
>  
>  	swap_free(entry);
>  	if (vm_swap_full())
> -		remove_exclusive_swap_page(page);
> +		try_to_free_swap(page);
>  	unlock_page(page);
>  
>  	if (write_access) {
> --- 2.6.26-rc7/mm/migrate.c	2008-06-21 08:41:19.000000000 +0100
> +++ linux/mm/migrate.c	2008-06-23 18:12:47.000000000 +0100
> @@ -330,8 +330,10 @@ static int migrate_page_move_mapping(str
>  	get_page(newpage);	/* add cache reference */
>  #ifdef CONFIG_SWAP
>  	if (PageSwapCache(page)) {
> -		SetPageSwapCache(newpage);
>  		set_page_private(newpage, page_private(page));
> +		/* page_swapcount() relies on private whenever PageSwapCache */
> +		smp_wmb();
> +		SetPageSwapCache(newpage);
>  	}
>  #endif
>  
> @@ -398,7 +400,6 @@ static void migrate_page_copy(struct pag
>  #endif
>  	ClearPageActive(page);
>  	ClearPagePrivate(page);
> -	set_page_private(page, 0);
>  	page->mapping = NULL;
>  
>  	/*
> --- 2.6.26-rc7/mm/page_io.c	2008-04-17 03:49:44.000000000 +0100
> +++ linux/mm/page_io.c	2008-06-23 18:12:47.000000000 +0100
> @@ -98,7 +98,7 @@ int swap_writepage(struct page *page, st
>  	struct bio *bio;
>  	int ret = 0, rw = WRITE;
>  
> -	if (remove_exclusive_swap_page(page)) {
> +	if (try_to_free_swap(page)) {
>  		unlock_page(page);
>  		goto out;
>  	}
> --- 2.6.26-rc7/mm/swap_state.c	2008-05-03 21:55:12.000000000 +0100
> +++ linux/mm/swap_state.c	2008-06-23 18:12:47.000000000 +0100
> @@ -76,13 +76,15 @@ int add_to_swap_cache(struct page *page,
>  	BUG_ON(PagePrivate(page));
>  	error = radix_tree_preload(gfp_mask);
>  	if (!error) {
> +		set_page_private(page, entry.val);
> +		/* page_swapcount() relies on private whenever PageSwapCache */
> +		smp_wmb();
>  		write_lock_irq(&swapper_space.tree_lock);
>  		error = radix_tree_insert(&swapper_space.page_tree,
>  						entry.val, page);
>  		if (!error) {
>  			page_cache_get(page);
>  			SetPageSwapCache(page);
> -			set_page_private(page, entry.val);
>  			total_swapcache_pages++;
>  			__inc_zone_page_state(page, NR_FILE_PAGES);
>  			INC_CACHE_INFO(add_total);
> @@ -105,7 +107,6 @@ void __delete_from_swap_cache(struct pag
>  	BUG_ON(PagePrivate(page));
>  
>  	radix_tree_delete(&swapper_space.page_tree, page_private(page));
> -	set_page_private(page, 0);
>  	ClearPageSwapCache(page);
>  	total_swapcache_pages--;
>  	__dec_zone_page_state(page, NR_FILE_PAGES);
> @@ -188,13 +189,14 @@ void delete_from_swap_cache(struct page 
>   * 
>   * Its ok to check for PageSwapCache without the page lock
>   * here because we are going to recheck again inside 
> - * exclusive_swap_page() _with_ the lock. 
> + * try_to_free_swap() _with_ the lock. 
>   * 					- Marcelo
>   */
>  static inline void free_swap_cache(struct page *page)
>  {
> -	if (PageSwapCache(page) && !TestSetPageLocked(page)) {
> -		remove_exclusive_swap_page(page);
> +	if (PageSwapCache(page) && !page_mapped(page) &&
> +	    !TestSetPageLocked(page)) {
> +		try_to_free_swap(page);
>  		unlock_page(page);
>  	}
>  }
> --- 2.6.26-rc7/mm/swapfile.c	2008-05-03 21:55:12.000000000 +0100
> +++ linux/mm/swapfile.c	2008-06-23 18:12:47.000000000 +0100
> @@ -251,7 +251,6 @@ static struct swap_info_struct * swap_in
>  		goto bad_offset;
>  	if (!p->swap_map[offset])
>  		goto bad_free;
> -	spin_lock(&swap_lock);
>  	return p;
>  
>  bad_free:
> @@ -300,90 +299,104 @@ void swap_free(swp_entry_t entry)
>  
>  	p = swap_info_get(entry);
>  	if (p) {
> +		spin_lock(&swap_lock);
>  		swap_entry_free(p, swp_offset(entry));
>  		spin_unlock(&swap_lock);
>  	}
>  }
>  
>  /*
> - * How many references to page are currently swapped out?
> + * How many page table references are there to this page's swap entry?
> + * including references to the page itself if add_mapcount.
> + *
> + * page_swapcount is only called on a page which was recently PageSwapCache
> + * (but might have been deleted from swap cache just before getting here).
> + *
> + * When called with PageLocked, the swapcount is stable, and the mapcount
> + * cannot rise (but may fall if the page is concurrently unmapped from a
> + * page table whose lock we don't hold).
> + *
> + * When called without PageLocked, the swapcount is stable while we hold
> + * swap_lock, and the mapcount cannot rise from 1 while we hold that page
> + * table lock (but may fall if the page is concurrently unmapped from a
> + * page table whose lock we don't hold).
> + *
> + * do_swap_page and unuse_pte call page_add_anon_rmap before swap_free,
> + * try_to_unmap_one calls swap_duplicate before page_remove_rmap:
> + * so in general, swapcount+mapcount should never be seen too low -
> + * but you need to consider more memory barriers if extending its use.
>   */
> -static inline int page_swapcount(struct page *page)
> +static int page_swapcount(struct page *page, int add_mapcount)
>  {
>  	int count = 0;
>  	struct swap_info_struct *p;
>  	swp_entry_t entry;
>  
> -	entry.val = page_private(page);
> -	p = swap_info_get(entry);
> -	if (p) {
> -		/* Subtract the 1 for the swap cache itself */
> -		count = p->swap_map[swp_offset(entry)] - 1;
> -		spin_unlock(&swap_lock);
> +	spin_lock(&swap_lock);
> +	if (add_mapcount)
> +		count = page_mapcount(page);
> +	if (PageSwapCache(page)) {
> +		/* We can rely on page_private once PageSwapCache is visible */
> +		smp_rmb();
> +		entry.val = page_private(page);
> +		p = swap_info_get(entry);
> +		if (p) {
> +			/* Subtract the 1 for the swap cache itself */
> +			count += p->swap_map[swp_offset(entry)] - 1;
> +		}
>  	}
> +	spin_unlock(&swap_lock);
>  	return count;
>  }
>  
>  /*
> - * We can use this swap cache entry directly
> - * if there are no other references to it.
> + * Can do_wp_page() make the faulting swapcache page writable without COW?
> + * But something else, probably shrink_page_list(), already has PageLocked.
> + * Don't be misled to COW the page unnecessarily: check swapcount+mapcount.
>   */
> -int can_share_swap_page(struct page *page)
> +int can_reuse_swap_page_unlocked(struct page *page)
>  {
> -	int count;
> -
> -	BUG_ON(!PageLocked(page));
> -	count = page_mapcount(page);
> -	if (count <= 1 && PageSwapCache(page))
> -		count += page_swapcount(page);
> -	return count == 1;
> +	return page_swapcount(page, 1) == 1;
>  }
>  
>  /*
> - * Work out if there are any other processes sharing this
> - * swap cache page. Free it if you can. Return success.
> + * Can do_wp_page() make the faulting swapcache page writable without COW?
> + * having acquiring PageLocked.  In this case, since we have that lock and
> + * are about to modify the page, we'd better free its swap space - it won't
> + * be read again, and writing there later would probably require an extra seek.
>   */
> -int remove_exclusive_swap_page(struct page *page)
> +int can_reuse_swap_page_locked(struct page *page)
>  {
> -	int retval;
> -	struct swap_info_struct * p;
> -	swp_entry_t entry;
> +	VM_BUG_ON(!PageLocked(page));
> +	if (page_swapcount(page, 1) != 1)
> +		return 0;
> +	if (!PageSwapCache(page))
> +		return 1;
> +	if (PageWriteback(page))
> +		return 1;
>  
> -	BUG_ON(PagePrivate(page));
> -	BUG_ON(!PageLocked(page));
> +	delete_from_swap_cache(page);
> +	SetPageDirty(page);
> +	return 1;
> +}
>  
> +/*
> + * If swap is getting full, or if there are no more mappings of this page,
> + * then try_to_free_swap is called to free its swap space.
> + */
> +int try_to_free_swap(struct page *page)
> +{
> +	VM_BUG_ON(!PageLocked(page));
>  	if (!PageSwapCache(page))
>  		return 0;
>  	if (PageWriteback(page))
>  		return 0;
> -	if (page_count(page) != 2) /* 2: us + cache */
> +	if (page_swapcount(page, 0))	/* Here we don't care about mapcount */
>  		return 0;
>  
> -	entry.val = page_private(page);
> -	p = swap_info_get(entry);
> -	if (!p)
> -		return 0;
> -
> -	/* Is the only swap cache user the cache itself? */
> -	retval = 0;
> -	if (p->swap_map[swp_offset(entry)] == 1) {
> -		/* Recheck the page count with the swapcache lock held.. */
> -		write_lock_irq(&swapper_space.tree_lock);
> -		if ((page_count(page) == 2) && !PageWriteback(page)) {
> -			__delete_from_swap_cache(page);
> -			SetPageDirty(page);
> -			retval = 1;
> -		}
> -		write_unlock_irq(&swapper_space.tree_lock);
> -	}
> -	spin_unlock(&swap_lock);
> -
> -	if (retval) {
> -		swap_free(entry);
> -		page_cache_release(page);
> -	}
> -
> -	return retval;
> +	delete_from_swap_cache(page);
> +	SetPageDirty(page);
> +	return 1;
>  }
>  
>  /*
> @@ -400,6 +413,7 @@ void free_swap_and_cache(swp_entry_t ent
>  
>  	p = swap_info_get(entry);
>  	if (p) {
> +		spin_lock(&swap_lock);
>  		if (swap_entry_free(p, swp_offset(entry)) == 1) {
>  			page = find_get_page(&swapper_space, entry.val);
>  			if (page && unlikely(TestSetPageLocked(page))) {
> @@ -410,14 +424,10 @@ void free_swap_and_cache(swp_entry_t ent
>  		spin_unlock(&swap_lock);
>  	}
>  	if (page) {
> -		int one_user;
> -
> -		BUG_ON(PagePrivate(page));
> -		one_user = (page_count(page) == 2);
> -		/* Only cache user (+us), or swap space full? Free it! */
> +		/* Not mapped elsewhere, or swap space full? Free it! */
>  		/* Also recheck PageSwapCache after page is locked (above) */
>  		if (PageSwapCache(page) && !PageWriteback(page) &&
> -					(one_user || vm_swap_full())) {
> +				(!page_mapped(page) || vm_swap_full())) {
>  			delete_from_swap_cache(page);
>  			SetPageDirty(page);
>  		}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2008-06-24 11:56 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-06-18 16:41 Can get_user_pages( ,write=1, force=1, ) result in a read-only pte and _count=2? Robin Holt
2008-06-18 17:29 ` Nick Piggin
2008-06-18 19:01   ` Hugh Dickins
2008-06-18 20:33     ` Robin Holt
2008-06-18 21:46       ` Hugh Dickins
2008-06-19  3:31         ` Nick Piggin
2008-06-19  3:34           ` Nick Piggin
2008-06-19 11:39           ` Hugh Dickins
2008-06-19 12:07             ` Nick Piggin
2008-06-19 12:21               ` Nick Piggin
2008-06-19 17:48                 ` Christoph Lameter
2008-06-19 12:34               ` Hugh Dickins
2008-06-19 12:53                 ` Nick Piggin
2008-06-19 13:25                   ` Hugh Dickins
2008-06-19 13:35                     ` Robin Holt
2008-06-19 16:32         ` Robin Holt
2008-06-20  9:23           ` Nick Piggin
2008-06-19  3:07     ` Nick Piggin
2008-06-19 11:09       ` Hugh Dickins
2008-06-19 13:38         ` Robin Holt
2008-06-19 13:49           ` Hugh Dickins
2008-06-23 15:54             ` Robin Holt
2008-06-23 16:48               ` Hugh Dickins
2008-06-23 17:52                 ` Robin Holt
2008-06-23 20:58                   ` Hugh Dickins
2008-06-24 11:56                     ` Robin Holt [this message]
2008-06-24 15:19                     ` Robin Holt
2008-06-24 20:19                       ` Hugh Dickins
2008-06-23 19:11             ` Robin Holt
2008-06-23 19:12               ` Robin Holt

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20080624115606.GJ10062@sgi.com \
    --to=holt@sgi.com \
    --cc=Lee.Schermerhorn@hp.com \
    --cc=clameter@sgi.com \
    --cc=hugh@veritas.com \
    --cc=linux-mm@kvack.org \
    --cc=mingo@elte.hu \
    --cc=nickpiggin@yahoo.com.au \
    --cc=riel@redhat.com \
    --cc=steiner@sgi.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.