Re: [RFC] mm:prototype for the updated swapoff implementation

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Rik van Riel <riel@surriel.com>
To: Kelley Nielsen <kelleynnn@gmail.com>
Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	opw-kernel@googlegroups.com, jamieliu@google.com,
	sjenning@linux.vnet.ibm.com
Subject: Re: [RFC] mm:prototype for the updated swapoff implementation
Date: Tue, 25 Feb 2014 15:14:48 -0500	[thread overview]
Message-ID: <530CF9B8.1020008@surriel.com> (raw)
In-Reply-To: <20140219003522.GA8887@kelleynnn-virtual-machine>

On 02/18/2014 07:35 PM, Kelley Nielsen wrote:
> The function try_to_unuse() is of quadratic complexity, with a lot of
> wasted effort. It unuses swap entries one by one, potentially iterating
> over all the page tables for all the processes in the system for each
> one.
> 
> This new proposed implementation of try_to_unuse simplifies its
> complexity to linear. It iterates over the system's mms once, unusing
> all the affected entries as it walks each set of page tables. It also
> makes similar changes to shmem_unuse.

Nice work.  After reading some related code, I think I may have found
one of the issues you are running into...

> TODO
> 
> * Find and correct where swap entries are being left behind
> * Probably related: handle case of remaining reference in try_to_unuse

It looks like the way you iterate over the mmlist may not be safe.

I believe you need the mm / prev_mm juggling, to make sure that you
never call mmput on an mm before dereferencing mm->mmlist.next.

Otherwise, the process holding the mm can exit during swapoff, and
swapoff will be the last user of the mm. Calling mmput will free
the mm_struct, after which the memory can be re-used for something
else.

In the best case, this can lead to swapoff continuing the scan at
another spot in the mmlist. At the worst case, the code can follow
a pointer to la-la land.

> * Remove find_next_to_unuse, and the call to it in try_to_unuse,
>   when the previous item has been resolved
> * Handle the failure of swapin_readahead in unuse_pte_range

When swapin_readahead fails to allocate memory, the swapoff operation
can be aborted.

When it runs into a swap slot that is no longer in use, swapoff can
continue.

> * make sure unuse_pte_range is handling multiple ptes in the best way

I would not worry about that for now. The reduction from quadratic
to linear complexity should be quite a performance boost :)

> * clean up after failure of unuse_pte in unuse_pte_range
> * Determine the proper place for the mmlist locks in try_to_unuse

The mmlist lock needs to be held until after mm->mmlist.next has
been dereferenced.

> * Handle count of unused pages for frontswap

This can probably be deferred till later.

> * Determine what kind of housekeeping shmem_unuse needs
> * Tighten up the access control for all the various data structures
>   (for instance, the mutex on shmem_swaplist is held throughout the
>   entire process, which is probably not only unneccesary but problematic)
> * Prevent radix entries with zero indices from being passed to
>   shmem_unuse_inode_index

Can you pass only exceptional radix tree entries?

> * Decide if shmem_unuse_inode* should be combined into one function

Probably :)

> * Find cases in which the errors returned from shmem_getpage_gfp can be
>   gracefully handled in shmem_unuse_inode_index, instead of just failing
> * determine when old comments and housekeeping are no longer needed
>   (there are still some to serve as reminders of the housekeeping that
>    needs to be accounted for)

It probably makes sense to not try to do everything at once, but send
in incrementally improved patches.

> ---
>  include/linux/shmem_fs.h |   2 +-
>  mm/shmem.c               | 146 +++++++-------------
>  mm/swapfile.c            | 352 ++++++++++++++++++++---------------------------
>  3 files changed, 206 insertions(+), 294 deletions(-)
> 
> diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
> index 9d55438..af78151 100644
> --- a/include/linux/shmem_fs.h
> +++ b/include/linux/shmem_fs.h
> @@ -55,7 +55,7 @@ extern void shmem_unlock_mapping(struct address_space *mapping);
>  extern struct page *shmem_read_mapping_page_gfp(struct address_space *mapping,
>  					pgoff_t index, gfp_t gfp_mask);
>  extern void shmem_truncate_range(struct inode *inode, loff_t start, loff_t end);
> -extern int shmem_unuse(swp_entry_t entry, struct page *page);
> +extern int shmem_unuse(unsigned int type);
>  
>  static inline struct page *shmem_read_mapping_page(
>  				struct address_space *mapping, pgoff_t index)
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 1f18c9d..802456e 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -650,127 +650,87 @@ static void shmem_evict_inode(struct inode *inode)
>  /*
>   * If swap found in inode, free it and move page from swapcache to filecache.
>   */
> -static int shmem_unuse_inode(struct shmem_inode_info *info,
> -			     swp_entry_t swap, struct page **pagep)
> +/* TODO Since there's hardly anything left of this function
> + * now that the things shmem_getpage_gfp does have been removed,
> + * just incorporate its actions into shmem_unuse_inode?
> + */
> +static int shmem_unuse_inode_index(struct shmem_inode_info *info,
> +			     pgoff_t index)
>  {
>  	struct address_space *mapping = info->vfs_inode.i_mapping;
> -	void *radswap;
> -	pgoff_t index;
> +	struct page *pagep;
>  	gfp_t gfp;
>  	int error = 0;
>  
> -	radswap = swp_to_radix_entry(swap);
> -	index = radix_tree_locate_item(&mapping->page_tree, radswap);
> -	if (index == -1)
> -		return 0;
> -
> -	/*
> -	 * Move _head_ to start search for next from here.
> -	 * But be careful: shmem_evict_inode checks list_empty without taking
> -	 * mutex, and there's an instant in list_move_tail when info->swaplist
> -	 * would appear empty, if it were the only one on shmem_swaplist.
> +	gfp = mapping_gfp_mask(mapping);
> +	error = shmem_getpage_gfp(&info->vfs_inode, index, &pagep, SGP_CACHE,
> +			gfp, NULL);
> +	/* TODO: go through all the possible error returns
> +	 * in shmem_getpage_gfp, and determine whether
> +	 * we need to fail, or whether we can gracefully recover.
> +	 * (for instance, if the page was swapped in from somewhere
> +	 * else in the kernel between the start of swapoff and now,
> +	 * and can be safely let go.)
> +	 * For now, send failure up the call chain for all errors.
>  	 */
> -	if (shmem_swaplist.next != &info->swaplist)
> -		list_move_tail(&shmem_swaplist, &info->swaplist);
> +	return error;
> +}
>  
> -	gfp = mapping_gfp_mask(mapping);
> -	if (shmem_should_replace_page(*pagep, gfp)) {
> -		mutex_unlock(&shmem_swaplist_mutex);
> -		error = shmem_replace_page(pagep, gfp, info, index);
> -		mutex_lock(&shmem_swaplist_mutex);
> -		/*
> -		 * We needed to drop mutex to make that restrictive page
> -		 * allocation, but the inode might have been freed while we
> -		 * dropped it: although a racing shmem_evict_inode() cannot
> -		 * complete without emptying the radix_tree, our page lock
> -		 * on this swapcache page is not enough to prevent that -
> -		 * free_swap_and_cache() of our swap entry will only
> -		 * trylock_page(), removing swap from radix_tree whatever.
> -		 *
> -		 * We must not proceed to shmem_add_to_page_cache() if the
> -		 * inode has been freed, but of course we cannot rely on
> -		 * inode or mapping or info to check that.  However, we can
> -		 * safely check if our swap entry is still in use (and here
> -		 * it can't have got reused for another page): if it's still
> -		 * in use, then the inode cannot have been freed yet, and we
> -		 * can safely proceed (if it's no longer in use, that tells
> -		 * nothing about the inode, but we don't need to unuse swap).
> -		 */
> -		if (!page_swapcount(*pagep))
> -			error = -ENOENT;
> -	}
> +/* TODO some pages with a null index are slipping through
> + * and being passed to shmem_unuse_inode_index
> + */
> +static int shmem_unuse_inode(struct shmem_inode_info *info, unsigned int type){
> +	struct address_space *mapping = info->vfs_inode.i_mapping;
> +	void **slot;
> +	struct radix_tree_iter iter;
> +	int error = 0;
>  
> -	/*
> -	 * We rely on shmem_swaplist_mutex, not only to protect the swaplist,
> -	 * but also to hold up shmem_evict_inode(): so inode cannot be freed
> -	 * beneath us (pagelock doesn't help until the page is in pagecache).
> -	 */
> -	if (!error)
> -		error = shmem_add_to_page_cache(*pagep, mapping, index,
> -						GFP_NOWAIT, radswap);
> -	if (error != -ENOMEM) {
> -		/*
> -		 * Truncation and eviction use free_swap_and_cache(), which
> -		 * only does trylock page: if we raced, best clean up here.
> -		 */
> -		delete_from_swap_cache(*pagep);
> -		set_page_dirty(*pagep);
> -		if (!error) {
> -			spin_lock(&info->lock);
> -			info->swapped--;
> -			spin_unlock(&info->lock);
> -			swap_free(swap);
> +	rcu_read_lock();
> +	radix_tree_for_each_slot(slot, &mapping->page_tree, &iter, 0){
> +		struct page *page;
> +		pgoff_t index;
> +		swp_entry_t entry;
> +		unsigned int stype;
> +
> +		index = iter.index;
> +		page = radix_tree_deref_slot(slot);
> +		if (unlikely(!page))
> +			continue;
> +		if (radix_tree_exceptional_entry(page)) {
> +			entry = radix_to_swp_entry(page);
> +			stype = swp_type(entry);
> +			if (stype == type){
> +				error = shmem_unuse_inode_index(info, index);
> +			}
>  		}
> -		error = 1;	/* not an error, but entry was found */
> +		if (error)
> +		break;
>  	}
> +	rcu_read_unlock();
>  	return error;
>  }
>  
> -/*
> - * Search through swapped inodes to find and replace swap by page.
> +/* unuse all the shared memory swap entries that
> + * have backing store in the designated swap type.
>   */
> -int shmem_unuse(swp_entry_t swap, struct page *page)
> +int shmem_unuse(unsigned int type)
>  {
>  	struct list_head *this, *next;
>  	struct shmem_inode_info *info;
> -	int found = 0;
>  	int error = 0;
>  
> -	/*
> -	 * There's a faint possibility that swap page was replaced before
> -	 * caller locked it: caller will come back later with the right page.
> -	 */
> -	if (unlikely(!PageSwapCache(page) || page_private(page) != swap.val))
> -		goto out;
> -
> -	/*
> -	 * Charge page using GFP_KERNEL while we can wait, before taking
> -	 * the shmem_swaplist_mutex which might hold up shmem_writepage().
> -	 * Charged back to the user (not to caller) when swap account is used.
> -	 */
> -	error = mem_cgroup_cache_charge(page, current->mm, GFP_KERNEL);
> -	if (error)
> -		goto out;
> -	/* No radix_tree_preload: swap entry keeps a place for page in tree */
> -
>  	mutex_lock(&shmem_swaplist_mutex);
>  	list_for_each_safe(this, next, &shmem_swaplist) {
>  		info = list_entry(this, struct shmem_inode_info, swaplist);
>  		if (info->swapped)
> -			found = shmem_unuse_inode(info, swap, &page);
> +			error = shmem_unuse_inode(info, type);
>  		else
>  			list_del_init(&info->swaplist);
>  		cond_resched();
> -		if (found)
> +		if (error)
>  			break;
>  	}
>  	mutex_unlock(&shmem_swaplist_mutex);
> -
> -	if (found < 0)
> -		error = found;
> -out:
> -	unlock_page(page);
> -	page_cache_release(page);
>  	return error;
>  }
>  
> @@ -2873,7 +2833,7 @@ int __init shmem_init(void)
>  	return 0;
>  }
>  
> -int shmem_unuse(swp_entry_t swap, struct page *page)
> +int shmem_unuse(unsigned int type)
>  {
>  	return 0;
>  }
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 4a7f7e6..b69e319 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -68,6 +68,9 @@ static DECLARE_WAIT_QUEUE_HEAD(proc_poll_wait);
>  /* Activity counter to indicate that a swapon or swapoff has occurred */
>  static atomic_t proc_poll_event = ATOMIC_INIT(0);
>  
> +/* count instances of unuse_pte for the changelog */
> +static long unusepte_calls = 0;
> +
>  static inline unsigned char swap_count(unsigned char ent)
>  {
>  	return ent & ~SWAP_HAS_CACHE;	/* may include SWAP_HAS_CONT flag */
> @@ -1167,13 +1170,18 @@ out_nolock:
>  
>  static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
>  				unsigned long addr, unsigned long end,
> -				swp_entry_t entry, struct page *page)
> +				unsigned int type)
>  {
> -	pte_t swp_pte = swp_entry_to_pte(entry);
> +	struct page * page;
> +	swp_entry_t entry;
> +	unsigned int found_type;
>  	pte_t *pte;
>  	int ret = 0;
>  
> +	unusepte_calls++;
> +
>  	/*
> +	 * TODO comment left from original:
>  	 * We don't actually need pte lock while scanning for swp_pte: since
>  	 * we hold page lock and mmap_sem, swp_pte cannot be inserted into the
>  	 * page table while we're scanning; though it could get zapped, and on
> @@ -1184,16 +1192,71 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
>  	 */
>  	pte = pte_offset_map(pmd, addr);
>  	do {
> +		if (is_swap_pte(*pte)){
> +			entry = pte_to_swp_entry(*pte);
> +			found_type = swp_type(entry);
> +		}
> +		else {
> +			continue;
> +		}
> +		if (found_type == type){
> +			entry = pte_to_swp_entry(*pte);
> +			page = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
> +					vma, addr);
> +			if (!page){
> +				/* TODO not sure yet what to do here, or
> +				 * how great the chance of the page
> +				 * not existing actually is.
> +				 * There is a comment in try_to_unuse
> +				 * about the page possibly being freed
> +				 * independently, etc
> +				 */
> +				printk("unuse_pte tried to swap in an invalid page\n");
> +				continue;
> +			}
>  		/*
> -		 * swapoff spends a _lot_ of time in this loop!
> -		 * Test inline before going to call unuse_pte.
> +		 * Wait for and lock page.  When do_swap_page races with
> +		 * try_to_unuse, do_swap_page can handle the fault much
> +		 * faster than try_to_unuse can locate the entry.  This
> +		 * apparently redundant "wait_on_page_locked" lets try_to_unuse
> +		 * defer to do_swap_page in such a case - in some tests,
> +		 * do_swap_page and try_to_unuse repeatedly compete.
>  		 */
> -		if (unlikely(maybe_same_pte(*pte, swp_pte))) {
> +			wait_on_page_locked(page);
> +			wait_on_page_writeback(page);
> +			lock_page(page);
> +			wait_on_page_writeback(page);
>  			pte_unmap(pte);
>  			ret = unuse_pte(vma, pmd, addr, entry, page);
> -			if (ret)
> -				goto out;
> -			pte = pte_offset_map(pmd, addr);
> +		/* TODO fix
> +		 * in the new way, we unuse
> +		 * all ptes in the range or fail before returning.
> +		 * For now, leave the return from unuse_pte as is,
> +		 * move on and unuse the next pte.
> +		 */
> +		if (ret < 1){
> +			/* TODO for now, we're just returning
> +			 * the error if unuse_pte fails.
> +			 * we need to clean up the allocated page,
> +			 * plus all the rest of the mess
> +			 */
> +			unlock_page(page);
> +			goto out;
> +		}
> +		/*
> +		 * TODO moved here from try_to_unuse--still relevant?:
> +		 * It is conceivable that a racing task removed this page from
> +		 * swap cache just before we acquired the page lock at the top,
> +		 * or while we dropped it in unuse_mm().  The page might even
> +		 * be back in swap cache on another swap area: that we must not
> +		 * delete, since it may not have been written out to swap yet.
> +		 */
> +		if (PageSwapCache(page) &&
> +		    likely(page_private(page) == entry.val))
> +			delete_from_swap_cache(page);
> +		SetPageDirty(page);
> +		unlock_page(page);
> +		page_cache_release(page);
>  		}
>  	} while (pte++, addr += PAGE_SIZE, addr != end);
>  	pte_unmap(pte - 1);
> @@ -1203,7 +1266,7 @@ out:
>  
>  static inline int unuse_pmd_range(struct vm_area_struct *vma, pud_t *pud,
>  				unsigned long addr, unsigned long end,
> -				swp_entry_t entry, struct page *page)
> +				unsigned int type)
>  {
>  	pmd_t *pmd;
>  	unsigned long next;
> @@ -1214,8 +1277,8 @@ static inline int unuse_pmd_range(struct vm_area_struct *vma, pud_t *pud,
>  		next = pmd_addr_end(addr, end);
>  		if (pmd_none_or_trans_huge_or_clear_bad(pmd))
>  			continue;
> -		ret = unuse_pte_range(vma, pmd, addr, next, entry, page);
> -		if (ret)
> +		ret = unuse_pte_range(vma, pmd, addr, next, type);
> +		if (ret < 0)
>  			return ret;
>  	} while (pmd++, addr = next, addr != end);
>  	return 0;
> @@ -1223,7 +1286,7 @@ static inline int unuse_pmd_range(struct vm_area_struct *vma, pud_t *pud,
>  
>  static inline int unuse_pud_range(struct vm_area_struct *vma, pgd_t *pgd,
>  				unsigned long addr, unsigned long end,
> -				swp_entry_t entry, struct page *page)
> +				unsigned int type)
>  {
>  	pud_t *pud;
>  	unsigned long next;
> @@ -1234,67 +1297,52 @@ static inline int unuse_pud_range(struct vm_area_struct *vma, pgd_t *pgd,
>  		next = pud_addr_end(addr, end);
>  		if (pud_none_or_clear_bad(pud))
>  			continue;
> -		ret = unuse_pmd_range(vma, pud, addr, next, entry, page);
> -		if (ret)
> +		ret = unuse_pmd_range(vma, pud, addr, next, type);
> +		if (ret < 0)
>  			return ret;
>  	} while (pud++, addr = next, addr != end);
>  	return 0;
>  }
>  
> -static int unuse_vma(struct vm_area_struct *vma,
> -				swp_entry_t entry, struct page *page)
> +static int unuse_vma(struct vm_area_struct *vma, unsigned int type)
>  {
>  	pgd_t *pgd;
>  	unsigned long addr, end, next;
>  	int ret;
>  
> -	if (page_anon_vma(page)) {
> -		addr = page_address_in_vma(page, vma);
> -		if (addr == -EFAULT)
> -			return 0;
> -		else
> -			end = addr + PAGE_SIZE;
> -	} else {
> -		addr = vma->vm_start;
> -		end = vma->vm_end;
> -	}
> +	addr = vma->vm_start;
> +	end = vma->vm_end;
>  
>  	pgd = pgd_offset(vma->vm_mm, addr);
>  	do {
>  		next = pgd_addr_end(addr, end);
>  		if (pgd_none_or_clear_bad(pgd))
>  			continue;
> -		ret = unuse_pud_range(vma, pgd, addr, next, entry, page);
> -		if (ret)
> +		ret = unuse_pud_range(vma, pgd, addr, next, type);
> +		if (ret < 0)
>  			return ret;
>  	} while (pgd++, addr = next, addr != end);
>  	return 0;
>  }
>  
> -static int unuse_mm(struct mm_struct *mm,
> -				swp_entry_t entry, struct page *page)
> +static int unuse_mm(struct mm_struct *mm, unsigned int type)
>  {
>  	struct vm_area_struct *vma;
>  	int ret = 0;
>  
> -	if (!down_read_trylock(&mm->mmap_sem)) {
> -		/*
> -		 * Activate page so shrink_inactive_list is unlikely to unmap
> -		 * its ptes while lock is dropped, so swapoff can make progress.
> -		 */
> -		activate_page(page);
> -		unlock_page(page);
> -		down_read(&mm->mmap_sem);
> -		lock_page(page);
> -	}
> +	down_read(&mm->mmap_sem);
>  	for (vma = mm->mmap; vma; vma = vma->vm_next) {
> -		if (vma->anon_vma && (ret = unuse_vma(vma, entry, page)))
> +		if (vma->anon_vma && (ret = unuse_vma(vma, type)))
>  			break;
>  	}
>  	up_read(&mm->mmap_sem);
>  	return (ret < 0)? ret: 0;
>  }
>  
> +/* TODO: this whole function is no longer necessary 
> + * useful for checking that the swap area is clean,
> + * so leaving until these changes are submitted
> + */
>  /*
>   * Scan swap_map (or frontswap_map if frontswap parameter is true)
>   * from current position to next entry still in use.
> @@ -1341,29 +1389,31 @@ static unsigned int find_next_to_unuse(struct swap_info_struct *si,
>  }
>  
>  /*
> - * We completely avoid races by reading each swap page in advance,
> - * and then search for the process using it.  All the necessary
> - * page table adjustments can then be made atomically.
> - *
>   * if the boolean frontswap is true, only unuse pages_to_unuse pages;
>   * pages_to_unuse==0 means all pages; ignored if frontswap is false
>   */
>  int try_to_unuse(unsigned int type, bool frontswap,
>  		 unsigned long pages_to_unuse)
>  {
> -	struct swap_info_struct *si = swap_info[type];
>  	struct mm_struct *start_mm;
> -	volatile unsigned char *swap_map; /* swap_map is accessed without
> -					   * locking. Mark it as volatile
> -					   * to prevent compiler doing
> -					   * something odd.
> -					   */
> -	unsigned char swcount;
> -	struct page *page;
> -	swp_entry_t entry;
> -	unsigned int i = 0;
> +	struct mm_struct *mm;
> +	struct list_head *p;
>  	int retval = 0;
>  
> +	/* TODO for checking if any entries are left
> +	 * after swapoff finishes
> +	 * for debug purposes, remove before submitting */
> +	struct swap_info_struct *si = swap_info[type];
> +	int i = 0;
> +
> +	/* TODO shmem_unuse needs its housekeeping
> +	 * exactly what needs to be done is not yet
> +	 * determined
> +	 */
> +	retval = shmem_unuse(type);
> +	if (retval)
> +		goto out;
> +
>  	/*
>  	 * When searching mms for an entry, a good strategy is to
>  	 * start at the first mm we freed the previous entry from
> @@ -1380,48 +1430,17 @@ int try_to_unuse(unsigned int type, bool frontswap,
>  	 */
>  	start_mm = &init_mm;
>  	atomic_inc(&init_mm.mm_users);
> +	p = &start_mm->mmlist;
>  
> -	/*
> -	 * Keep on scanning until all entries have gone.  Usually,
> -	 * one pass through swap_map is enough, but not necessarily:
> -	 * there are races when an instance of an entry might be missed.
> +	/* TODO: why do we protect the mmlist? (noob QUESTION)
> +	 * Where should the locks actually go?
>  	 */
> -	while ((i = find_next_to_unuse(si, i, frontswap)) != 0) {
> +	spin_lock(&mmlist_lock);
> +	while (!retval && (p = p->next) != &start_mm->mmlist) {
>  		if (signal_pending(current)) {
>  			retval = -EINTR;
>  			break;
>  		}
> -
> -		/*
> -		 * Get a page for the entry, using the existing swap
> -		 * cache page if there is one.  Otherwise, get a clean
> -		 * page and read the swap into it.
> -		 */
> -		swap_map = &si->swap_map[i];
> -		entry = swp_entry(type, i);
> -		page = read_swap_cache_async(entry,
> -					GFP_HIGHUSER_MOVABLE, NULL, 0);
> -		if (!page) {
> -			/*
> -			 * Either swap_duplicate() failed because entry
> -			 * has been freed independently, and will not be
> -			 * reused since sys_swapoff() already disabled
> -			 * allocation from here, or alloc_page() failed.
> -			 */
> -			swcount = *swap_map;
> -			/*
> -			 * We don't hold lock here, so the swap entry could be
> -			 * SWAP_MAP_BAD (when the cluster is discarding).
> -			 * Instead of fail out, We can just skip the swap
> -			 * entry because swapoff will wait for discarding
> -			 * finish anyway.
> -			 */
> -			if (!swcount || swcount == SWAP_MAP_BAD)
> -				continue;
> -			retval = -ENOMEM;
> -			break;
> -		}
> -
>  		/*
>  		 * Don't hold on to start_mm if it looks like exiting.
>  		 */
> @@ -1431,80 +1450,36 @@ int try_to_unuse(unsigned int type, bool frontswap,
>  			atomic_inc(&init_mm.mm_users);
>  		}
>  
> -		/*
> -		 * Wait for and lock page.  When do_swap_page races with
> -		 * try_to_unuse, do_swap_page can handle the fault much
> -		 * faster than try_to_unuse can locate the entry.  This
> -		 * apparently redundant "wait_on_page_locked" lets try_to_unuse
> -		 * defer to do_swap_page in such a case - in some tests,
> -		 * do_swap_page and try_to_unuse repeatedly compete.
> -		 */
> -		wait_on_page_locked(page);
> -		wait_on_page_writeback(page);
> -		lock_page(page);
> -		wait_on_page_writeback(page);
> +		mm = list_entry(p, struct mm_struct, mmlist);
> +		if (!atomic_inc_not_zero(&mm->mm_users))
> +			continue;
> +		spin_unlock(&mmlist_lock);
> +
> +		cond_resched();
> +
> +		retval = unuse_mm(mm, type);
> +		mmput(mm);
> +		if (retval)
> +			break;
>  
>  		/*
> -		 * Remove all references to entry.
> +		 * Make sure that we aren't completely killing
> +		 * interactive performance.
>  		 */
> -		swcount = *swap_map;
> -		if (swap_count(swcount) == SWAP_MAP_SHMEM) {
> -			retval = shmem_unuse(entry, page);
> -			/* page has already been unlocked and released */
> -			if (retval < 0)
> +		cond_resched();
> +		/* TODO we need another way to count these,
> +		 * because we will now be unusing all an mm's pages
> +		 * on each pass through the loop
> +		 * Ignoring frontswap for now
> +		 */
> +		if (frontswap && pages_to_unuse > 0) {
> +			if (!--pages_to_unuse)
>  				break;
> -			continue;
>  		}
> -		if (swap_count(swcount) && start_mm != &init_mm)
> -			retval = unuse_mm(start_mm, entry, page);
> -
> -		if (swap_count(*swap_map)) {
> -			int set_start_mm = (*swap_map >= swcount);
> -			struct list_head *p = &start_mm->mmlist;
> -			struct mm_struct *new_start_mm = start_mm;
> -			struct mm_struct *prev_mm = start_mm;
> -			struct mm_struct *mm;
> -
> -			atomic_inc(&new_start_mm->mm_users);
> -			atomic_inc(&prev_mm->mm_users);
> -			spin_lock(&mmlist_lock);
> -			while (swap_count(*swap_map) && !retval &&
> -					(p = p->next) != &start_mm->mmlist) {
> -				mm = list_entry(p, struct mm_struct, mmlist);
> -				if (!atomic_inc_not_zero(&mm->mm_users))
> -					continue;
> -				spin_unlock(&mmlist_lock);
> -				mmput(prev_mm);
> -				prev_mm = mm;
>  
> -				cond_resched();
> -
> -				swcount = *swap_map;
> -				if (!swap_count(swcount)) /* any usage ? */
> -					;
> -				else if (mm == &init_mm)
> -					set_start_mm = 1;
> -				else
> -					retval = unuse_mm(mm, entry, page);
> -
> -				if (set_start_mm && *swap_map < swcount) {
> -					mmput(new_start_mm);
> -					atomic_inc(&mm->mm_users);
> -					new_start_mm = mm;
> -					set_start_mm = 0;
> -				}
> -				spin_lock(&mmlist_lock);
> -			}
> -			spin_unlock(&mmlist_lock);
> -			mmput(prev_mm);
> -			mmput(start_mm);
> -			start_mm = new_start_mm;
> -		}
> -		if (retval) {
> -			unlock_page(page);
> -			page_cache_release(page);
> -			break;
> -		}
> +		spin_lock(&mmlist_lock);
> +	}
> +	spin_unlock(&mmlist_lock);
>  
>  		/*
>  		 * If a reference remains (rare), we would like to leave
> @@ -1524,50 +1499,27 @@ int try_to_unuse(unsigned int type, bool frontswap,
>  		 * this splitting happens to be just what is needed to
>  		 * handle where KSM pages have been swapped out: re-reading
>  		 * is unnecessarily slow, but we can fix that later on.
> +		 * TODO move this to unuse_pte_range?
>  		 */
> -		if (swap_count(*swap_map) &&
> -		     PageDirty(page) && PageSwapCache(page)) {
> -			struct writeback_control wbc = {
> -				.sync_mode = WB_SYNC_NONE,
> -			};
> -
> -			swap_writepage(page, &wbc);
> -			lock_page(page);
> -			wait_on_page_writeback(page);
> -		}
> -
> -		/*
> -		 * It is conceivable that a racing task removed this page from
> -		 * swap cache just before we acquired the page lock at the top,
> -		 * or while we dropped it in unuse_mm().  The page might even
> -		 * be back in swap cache on another swap area: that we must not
> -		 * delete, since it may not have been written out to swap yet.
> -		 */
> -		if (PageSwapCache(page) &&
> -		    likely(page_private(page) == entry.val))
> -			delete_from_swap_cache(page);
> -
> -		/*
> -		 * So we could skip searching mms once swap count went
> -		 * to 1, we did not mark any present ptes as dirty: must
> -		 * mark page dirty so shrink_page_list will preserve it.
> -		 */
> -		SetPageDirty(page);
> -		unlock_page(page);
> -		page_cache_release(page);
> -
> -		/*
> -		 * Make sure that we aren't completely killing
> -		 * interactive performance.
> -		 */
> -		cond_resched();
> -		if (frontswap && pages_to_unuse > 0) {
> -			if (!--pages_to_unuse)
> -				break;
> -		}
> -	}
> +/*		if (swap_count(*swap_map) &&
> +*		     PageDirty(page) && PageSwapCache(page)) {
> +*			struct writeback_control wbc = {
> +*				.sync_mode = WB_SYNC_NONE,
> +*			};
> +*
> +*			swap_writepage(page, &wbc);
> +*			lock_page(page);
> +*			wait_on_page_writeback(page);
> +*		}
> +*/
>  
> +	/* TODO check if there are any swap entries we failed to clean up. */
> +	if ((i = find_next_to_unuse(si, i, frontswap)) != 0)
> +		printk("swap entries remain, type not clean\n");
> +	printk("Leaving try_to_unuse\n");
> +	printk("Calls made to unuse_pte: %lu\n", unusepte_calls);
>  	mmput(start_mm);
> +out:
>  	return retval;
>  }
>  
> 


-- 
All rights reversed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

WARNING: multiple messages have this Message-ID (diff)

From: Rik van Riel <riel@surriel.com>
To: Kelley Nielsen <kelleynnn@gmail.com>
Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	opw-kernel@googlegroups.com, jamieliu@google.com,
	sjenning@linux.vnet.ibm.com
Subject: Re: [RFC] mm:prototype for the updated swapoff implementation
Date: Tue, 25 Feb 2014 15:14:48 -0500	[thread overview]
Message-ID: <530CF9B8.1020008@surriel.com> (raw)
In-Reply-To: <20140219003522.GA8887@kelleynnn-virtual-machine>

On 02/18/2014 07:35 PM, Kelley Nielsen wrote:
> The function try_to_unuse() is of quadratic complexity, with a lot of
> wasted effort. It unuses swap entries one by one, potentially iterating
> over all the page tables for all the processes in the system for each
> one.
> 
> This new proposed implementation of try_to_unuse simplifies its
> complexity to linear. It iterates over the system's mms once, unusing
> all the affected entries as it walks each set of page tables. It also
> makes similar changes to shmem_unuse.

Nice work.  After reading some related code, I think I may have found
one of the issues you are running into...

> TODO
> 
> * Find and correct where swap entries are being left behind
> * Probably related: handle case of remaining reference in try_to_unuse

It looks like the way you iterate over the mmlist may not be safe.

I believe you need the mm / prev_mm juggling, to make sure that you
never call mmput on an mm before dereferencing mm->mmlist.next.

Otherwise, the process holding the mm can exit during swapoff, and
swapoff will be the last user of the mm. Calling mmput will free
the mm_struct, after which the memory can be re-used for something
else.

In the best case, this can lead to swapoff continuing the scan at
another spot in the mmlist. At the worst case, the code can follow
a pointer to la-la land.

> * Remove find_next_to_unuse, and the call to it in try_to_unuse,
>   when the previous item has been resolved
> * Handle the failure of swapin_readahead in unuse_pte_range

When swapin_readahead fails to allocate memory, the swapoff operation
can be aborted.

When it runs into a swap slot that is no longer in use, swapoff can
continue.

> * make sure unuse_pte_range is handling multiple ptes in the best way

I would not worry about that for now. The reduction from quadratic
to linear complexity should be quite a performance boost :)

> * clean up after failure of unuse_pte in unuse_pte_range
> * Determine the proper place for the mmlist locks in try_to_unuse

The mmlist lock needs to be held until after mm->mmlist.next has
been dereferenced.

> * Handle count of unused pages for frontswap

This can probably be deferred till later.

> * Determine what kind of housekeeping shmem_unuse needs
> * Tighten up the access control for all the various data structures
>   (for instance, the mutex on shmem_swaplist is held throughout the
>   entire process, which is probably not only unneccesary but problematic)
> * Prevent radix entries with zero indices from being passed to
>   shmem_unuse_inode_index

Can you pass only exceptional radix tree entries?

> * Decide if shmem_unuse_inode* should be combined into one function

Probably :)

> * Find cases in which the errors returned from shmem_getpage_gfp can be
>   gracefully handled in shmem_unuse_inode_index, instead of just failing
> * determine when old comments and housekeeping are no longer needed
>   (there are still some to serve as reminders of the housekeeping that
>    needs to be accounted for)

It probably makes sense to not try to do everything at once, but send
in incrementally improved patches.

> ---
>  include/linux/shmem_fs.h |   2 +-
>  mm/shmem.c               | 146 +++++++-------------
>  mm/swapfile.c            | 352 ++++++++++++++++++++---------------------------
>  3 files changed, 206 insertions(+), 294 deletions(-)
> 
> diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
> index 9d55438..af78151 100644
> --- a/include/linux/shmem_fs.h
> +++ b/include/linux/shmem_fs.h
> @@ -55,7 +55,7 @@ extern void shmem_unlock_mapping(struct address_space *mapping);
>  extern struct page *shmem_read_mapping_page_gfp(struct address_space *mapping,
>  					pgoff_t index, gfp_t gfp_mask);
>  extern void shmem_truncate_range(struct inode *inode, loff_t start, loff_t end);
> -extern int shmem_unuse(swp_entry_t entry, struct page *page);
> +extern int shmem_unuse(unsigned int type);
>  
>  static inline struct page *shmem_read_mapping_page(
>  				struct address_space *mapping, pgoff_t index)
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 1f18c9d..802456e 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -650,127 +650,87 @@ static void shmem_evict_inode(struct inode *inode)
>  /*
>   * If swap found in inode, free it and move page from swapcache to filecache.
>   */
> -static int shmem_unuse_inode(struct shmem_inode_info *info,
> -			     swp_entry_t swap, struct page **pagep)
> +/* TODO Since there's hardly anything left of this function
> + * now that the things shmem_getpage_gfp does have been removed,
> + * just incorporate its actions into shmem_unuse_inode?
> + */
> +static int shmem_unuse_inode_index(struct shmem_inode_info *info,
> +			     pgoff_t index)
>  {
>  	struct address_space *mapping = info->vfs_inode.i_mapping;
> -	void *radswap;
> -	pgoff_t index;
> +	struct page *pagep;
>  	gfp_t gfp;
>  	int error = 0;
>  
> -	radswap = swp_to_radix_entry(swap);
> -	index = radix_tree_locate_item(&mapping->page_tree, radswap);
> -	if (index == -1)
> -		return 0;
> -
> -	/*
> -	 * Move _head_ to start search for next from here.
> -	 * But be careful: shmem_evict_inode checks list_empty without taking
> -	 * mutex, and there's an instant in list_move_tail when info->swaplist
> -	 * would appear empty, if it were the only one on shmem_swaplist.
> +	gfp = mapping_gfp_mask(mapping);
> +	error = shmem_getpage_gfp(&info->vfs_inode, index, &pagep, SGP_CACHE,
> +			gfp, NULL);
> +	/* TODO: go through all the possible error returns
> +	 * in shmem_getpage_gfp, and determine whether
> +	 * we need to fail, or whether we can gracefully recover.
> +	 * (for instance, if the page was swapped in from somewhere
> +	 * else in the kernel between the start of swapoff and now,
> +	 * and can be safely let go.)
> +	 * For now, send failure up the call chain for all errors.
>  	 */
> -	if (shmem_swaplist.next != &info->swaplist)
> -		list_move_tail(&shmem_swaplist, &info->swaplist);
> +	return error;
> +}
>  
> -	gfp = mapping_gfp_mask(mapping);
> -	if (shmem_should_replace_page(*pagep, gfp)) {
> -		mutex_unlock(&shmem_swaplist_mutex);
> -		error = shmem_replace_page(pagep, gfp, info, index);
> -		mutex_lock(&shmem_swaplist_mutex);
> -		/*
> -		 * We needed to drop mutex to make that restrictive page
> -		 * allocation, but the inode might have been freed while we
> -		 * dropped it: although a racing shmem_evict_inode() cannot
> -		 * complete without emptying the radix_tree, our page lock
> -		 * on this swapcache page is not enough to prevent that -
> -		 * free_swap_and_cache() of our swap entry will only
> -		 * trylock_page(), removing swap from radix_tree whatever.
> -		 *
> -		 * We must not proceed to shmem_add_to_page_cache() if the
> -		 * inode has been freed, but of course we cannot rely on
> -		 * inode or mapping or info to check that.  However, we can
> -		 * safely check if our swap entry is still in use (and here
> -		 * it can't have got reused for another page): if it's still
> -		 * in use, then the inode cannot have been freed yet, and we
> -		 * can safely proceed (if it's no longer in use, that tells
> -		 * nothing about the inode, but we don't need to unuse swap).
> -		 */
> -		if (!page_swapcount(*pagep))
> -			error = -ENOENT;
> -	}
> +/* TODO some pages with a null index are slipping through
> + * and being passed to shmem_unuse_inode_index
> + */
> +static int shmem_unuse_inode(struct shmem_inode_info *info, unsigned int type){
> +	struct address_space *mapping = info->vfs_inode.i_mapping;
> +	void **slot;
> +	struct radix_tree_iter iter;
> +	int error = 0;
>  
> -	/*
> -	 * We rely on shmem_swaplist_mutex, not only to protect the swaplist,
> -	 * but also to hold up shmem_evict_inode(): so inode cannot be freed
> -	 * beneath us (pagelock doesn't help until the page is in pagecache).
> -	 */
> -	if (!error)
> -		error = shmem_add_to_page_cache(*pagep, mapping, index,
> -						GFP_NOWAIT, radswap);
> -	if (error != -ENOMEM) {
> -		/*
> -		 * Truncation and eviction use free_swap_and_cache(), which
> -		 * only does trylock page: if we raced, best clean up here.
> -		 */
> -		delete_from_swap_cache(*pagep);
> -		set_page_dirty(*pagep);
> -		if (!error) {
> -			spin_lock(&info->lock);
> -			info->swapped--;
> -			spin_unlock(&info->lock);
> -			swap_free(swap);
> +	rcu_read_lock();
> +	radix_tree_for_each_slot(slot, &mapping->page_tree, &iter, 0){
> +		struct page *page;
> +		pgoff_t index;
> +		swp_entry_t entry;
> +		unsigned int stype;
> +
> +		index = iter.index;
> +		page = radix_tree_deref_slot(slot);
> +		if (unlikely(!page))
> +			continue;
> +		if (radix_tree_exceptional_entry(page)) {
> +			entry = radix_to_swp_entry(page);
> +			stype = swp_type(entry);
> +			if (stype == type){
> +				error = shmem_unuse_inode_index(info, index);
> +			}
>  		}
> -		error = 1;	/* not an error, but entry was found */
> +		if (error)
> +		break;
>  	}
> +	rcu_read_unlock();
>  	return error;
>  }
>  
> -/*
> - * Search through swapped inodes to find and replace swap by page.
> +/* unuse all the shared memory swap entries that
> + * have backing store in the designated swap type.
>   */
> -int shmem_unuse(swp_entry_t swap, struct page *page)
> +int shmem_unuse(unsigned int type)
>  {
>  	struct list_head *this, *next;
>  	struct shmem_inode_info *info;
> -	int found = 0;
>  	int error = 0;
>  
> -	/*
> -	 * There's a faint possibility that swap page was replaced before
> -	 * caller locked it: caller will come back later with the right page.
> -	 */
> -	if (unlikely(!PageSwapCache(page) || page_private(page) != swap.val))
> -		goto out;
> -
> -	/*
> -	 * Charge page using GFP_KERNEL while we can wait, before taking
> -	 * the shmem_swaplist_mutex which might hold up shmem_writepage().
> -	 * Charged back to the user (not to caller) when swap account is used.
> -	 */
> -	error = mem_cgroup_cache_charge(page, current->mm, GFP_KERNEL);
> -	if (error)
> -		goto out;
> -	/* No radix_tree_preload: swap entry keeps a place for page in tree */
> -
>  	mutex_lock(&shmem_swaplist_mutex);
>  	list_for_each_safe(this, next, &shmem_swaplist) {
>  		info = list_entry(this, struct shmem_inode_info, swaplist);
>  		if (info->swapped)
> -			found = shmem_unuse_inode(info, swap, &page);
> +			error = shmem_unuse_inode(info, type);
>  		else
>  			list_del_init(&info->swaplist);
>  		cond_resched();
> -		if (found)
> +		if (error)
>  			break;
>  	}
>  	mutex_unlock(&shmem_swaplist_mutex);
> -
> -	if (found < 0)
> -		error = found;
> -out:
> -	unlock_page(page);
> -	page_cache_release(page);
>  	return error;
>  }
>  
> @@ -2873,7 +2833,7 @@ int __init shmem_init(void)
>  	return 0;
>  }
>  
> -int shmem_unuse(swp_entry_t swap, struct page *page)
> +int shmem_unuse(unsigned int type)
>  {
>  	return 0;
>  }
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 4a7f7e6..b69e319 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -68,6 +68,9 @@ static DECLARE_WAIT_QUEUE_HEAD(proc_poll_wait);
>  /* Activity counter to indicate that a swapon or swapoff has occurred */
>  static atomic_t proc_poll_event = ATOMIC_INIT(0);
>  
> +/* count instances of unuse_pte for the changelog */
> +static long unusepte_calls = 0;
> +
>  static inline unsigned char swap_count(unsigned char ent)
>  {
>  	return ent & ~SWAP_HAS_CACHE;	/* may include SWAP_HAS_CONT flag */
> @@ -1167,13 +1170,18 @@ out_nolock:
>  
>  static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
>  				unsigned long addr, unsigned long end,
> -				swp_entry_t entry, struct page *page)
> +				unsigned int type)
>  {
> -	pte_t swp_pte = swp_entry_to_pte(entry);
> +	struct page * page;
> +	swp_entry_t entry;
> +	unsigned int found_type;
>  	pte_t *pte;
>  	int ret = 0;
>  
> +	unusepte_calls++;
> +
>  	/*
> +	 * TODO comment left from original:
>  	 * We don't actually need pte lock while scanning for swp_pte: since
>  	 * we hold page lock and mmap_sem, swp_pte cannot be inserted into the
>  	 * page table while we're scanning; though it could get zapped, and on
> @@ -1184,16 +1192,71 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
>  	 */
>  	pte = pte_offset_map(pmd, addr);
>  	do {
> +		if (is_swap_pte(*pte)){
> +			entry = pte_to_swp_entry(*pte);
> +			found_type = swp_type(entry);
> +		}
> +		else {
> +			continue;
> +		}
> +		if (found_type == type){
> +			entry = pte_to_swp_entry(*pte);
> +			page = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
> +					vma, addr);
> +			if (!page){
> +				/* TODO not sure yet what to do here, or
> +				 * how great the chance of the page
> +				 * not existing actually is.
> +				 * There is a comment in try_to_unuse
> +				 * about the page possibly being freed
> +				 * independently, etc
> +				 */
> +				printk("unuse_pte tried to swap in an invalid page\n");
> +				continue;
> +			}
>  		/*
> -		 * swapoff spends a _lot_ of time in this loop!
> -		 * Test inline before going to call unuse_pte.
> +		 * Wait for and lock page.  When do_swap_page races with
> +		 * try_to_unuse, do_swap_page can handle the fault much
> +		 * faster than try_to_unuse can locate the entry.  This
> +		 * apparently redundant "wait_on_page_locked" lets try_to_unuse
> +		 * defer to do_swap_page in such a case - in some tests,
> +		 * do_swap_page and try_to_unuse repeatedly compete.
>  		 */
> -		if (unlikely(maybe_same_pte(*pte, swp_pte))) {
> +			wait_on_page_locked(page);
> +			wait_on_page_writeback(page);
> +			lock_page(page);
> +			wait_on_page_writeback(page);
>  			pte_unmap(pte);
>  			ret = unuse_pte(vma, pmd, addr, entry, page);
> -			if (ret)
> -				goto out;
> -			pte = pte_offset_map(pmd, addr);
> +		/* TODO fix
> +		 * in the new way, we unuse
> +		 * all ptes in the range or fail before returning.
> +		 * For now, leave the return from unuse_pte as is,
> +		 * move on and unuse the next pte.
> +		 */
> +		if (ret < 1){
> +			/* TODO for now, we're just returning
> +			 * the error if unuse_pte fails.
> +			 * we need to clean up the allocated page,
> +			 * plus all the rest of the mess
> +			 */
> +			unlock_page(page);
> +			goto out;
> +		}
> +		/*
> +		 * TODO moved here from try_to_unuse--still relevant?:
> +		 * It is conceivable that a racing task removed this page from
> +		 * swap cache just before we acquired the page lock at the top,
> +		 * or while we dropped it in unuse_mm().  The page might even
> +		 * be back in swap cache on another swap area: that we must not
> +		 * delete, since it may not have been written out to swap yet.
> +		 */
> +		if (PageSwapCache(page) &&
> +		    likely(page_private(page) == entry.val))
> +			delete_from_swap_cache(page);
> +		SetPageDirty(page);
> +		unlock_page(page);
> +		page_cache_release(page);
>  		}
>  	} while (pte++, addr += PAGE_SIZE, addr != end);
>  	pte_unmap(pte - 1);
> @@ -1203,7 +1266,7 @@ out:
>  
>  static inline int unuse_pmd_range(struct vm_area_struct *vma, pud_t *pud,
>  				unsigned long addr, unsigned long end,
> -				swp_entry_t entry, struct page *page)
> +				unsigned int type)
>  {
>  	pmd_t *pmd;
>  	unsigned long next;
> @@ -1214,8 +1277,8 @@ static inline int unuse_pmd_range(struct vm_area_struct *vma, pud_t *pud,
>  		next = pmd_addr_end(addr, end);
>  		if (pmd_none_or_trans_huge_or_clear_bad(pmd))
>  			continue;
> -		ret = unuse_pte_range(vma, pmd, addr, next, entry, page);
> -		if (ret)
> +		ret = unuse_pte_range(vma, pmd, addr, next, type);
> +		if (ret < 0)
>  			return ret;
>  	} while (pmd++, addr = next, addr != end);
>  	return 0;
> @@ -1223,7 +1286,7 @@ static inline int unuse_pmd_range(struct vm_area_struct *vma, pud_t *pud,
>  
>  static inline int unuse_pud_range(struct vm_area_struct *vma, pgd_t *pgd,
>  				unsigned long addr, unsigned long end,
> -				swp_entry_t entry, struct page *page)
> +				unsigned int type)
>  {
>  	pud_t *pud;
>  	unsigned long next;
> @@ -1234,67 +1297,52 @@ static inline int unuse_pud_range(struct vm_area_struct *vma, pgd_t *pgd,
>  		next = pud_addr_end(addr, end);
>  		if (pud_none_or_clear_bad(pud))
>  			continue;
> -		ret = unuse_pmd_range(vma, pud, addr, next, entry, page);
> -		if (ret)
> +		ret = unuse_pmd_range(vma, pud, addr, next, type);
> +		if (ret < 0)
>  			return ret;
>  	} while (pud++, addr = next, addr != end);
>  	return 0;
>  }
>  
> -static int unuse_vma(struct vm_area_struct *vma,
> -				swp_entry_t entry, struct page *page)
> +static int unuse_vma(struct vm_area_struct *vma, unsigned int type)
>  {
>  	pgd_t *pgd;
>  	unsigned long addr, end, next;
>  	int ret;
>  
> -	if (page_anon_vma(page)) {
> -		addr = page_address_in_vma(page, vma);
> -		if (addr == -EFAULT)
> -			return 0;
> -		else
> -			end = addr + PAGE_SIZE;
> -	} else {
> -		addr = vma->vm_start;
> -		end = vma->vm_end;
> -	}
> +	addr = vma->vm_start;
> +	end = vma->vm_end;
>  
>  	pgd = pgd_offset(vma->vm_mm, addr);
>  	do {
>  		next = pgd_addr_end(addr, end);
>  		if (pgd_none_or_clear_bad(pgd))
>  			continue;
> -		ret = unuse_pud_range(vma, pgd, addr, next, entry, page);
> -		if (ret)
> +		ret = unuse_pud_range(vma, pgd, addr, next, type);
> +		if (ret < 0)
>  			return ret;
>  	} while (pgd++, addr = next, addr != end);
>  	return 0;
>  }
>  
> -static int unuse_mm(struct mm_struct *mm,
> -				swp_entry_t entry, struct page *page)
> +static int unuse_mm(struct mm_struct *mm, unsigned int type)
>  {
>  	struct vm_area_struct *vma;
>  	int ret = 0;
>  
> -	if (!down_read_trylock(&mm->mmap_sem)) {
> -		/*
> -		 * Activate page so shrink_inactive_list is unlikely to unmap
> -		 * its ptes while lock is dropped, so swapoff can make progress.
> -		 */
> -		activate_page(page);
> -		unlock_page(page);
> -		down_read(&mm->mmap_sem);
> -		lock_page(page);
> -	}
> +	down_read(&mm->mmap_sem);
>  	for (vma = mm->mmap; vma; vma = vma->vm_next) {
> -		if (vma->anon_vma && (ret = unuse_vma(vma, entry, page)))
> +		if (vma->anon_vma && (ret = unuse_vma(vma, type)))
>  			break;
>  	}
>  	up_read(&mm->mmap_sem);
>  	return (ret < 0)? ret: 0;
>  }
>  
> +/* TODO: this whole function is no longer necessary 
> + * useful for checking that the swap area is clean,
> + * so leaving until these changes are submitted
> + */
>  /*
>   * Scan swap_map (or frontswap_map if frontswap parameter is true)
>   * from current position to next entry still in use.
> @@ -1341,29 +1389,31 @@ static unsigned int find_next_to_unuse(struct swap_info_struct *si,
>  }
>  
>  /*
> - * We completely avoid races by reading each swap page in advance,
> - * and then search for the process using it.  All the necessary
> - * page table adjustments can then be made atomically.
> - *
>   * if the boolean frontswap is true, only unuse pages_to_unuse pages;
>   * pages_to_unuse==0 means all pages; ignored if frontswap is false
>   */
>  int try_to_unuse(unsigned int type, bool frontswap,
>  		 unsigned long pages_to_unuse)
>  {
> -	struct swap_info_struct *si = swap_info[type];
>  	struct mm_struct *start_mm;
> -	volatile unsigned char *swap_map; /* swap_map is accessed without
> -					   * locking. Mark it as volatile
> -					   * to prevent compiler doing
> -					   * something odd.
> -					   */
> -	unsigned char swcount;
> -	struct page *page;
> -	swp_entry_t entry;
> -	unsigned int i = 0;
> +	struct mm_struct *mm;
> +	struct list_head *p;
>  	int retval = 0;
>  
> +	/* TODO for checking if any entries are left
> +	 * after swapoff finishes
> +	 * for debug purposes, remove before submitting */
> +	struct swap_info_struct *si = swap_info[type];
> +	int i = 0;
> +
> +	/* TODO shmem_unuse needs its housekeeping
> +	 * exactly what needs to be done is not yet
> +	 * determined
> +	 */
> +	retval = shmem_unuse(type);
> +	if (retval)
> +		goto out;
> +
>  	/*
>  	 * When searching mms for an entry, a good strategy is to
>  	 * start at the first mm we freed the previous entry from
> @@ -1380,48 +1430,17 @@ int try_to_unuse(unsigned int type, bool frontswap,
>  	 */
>  	start_mm = &init_mm;
>  	atomic_inc(&init_mm.mm_users);
> +	p = &start_mm->mmlist;
>  
> -	/*
> -	 * Keep on scanning until all entries have gone.  Usually,
> -	 * one pass through swap_map is enough, but not necessarily:
> -	 * there are races when an instance of an entry might be missed.
> +	/* TODO: why do we protect the mmlist? (noob QUESTION)
> +	 * Where should the locks actually go?
>  	 */
> -	while ((i = find_next_to_unuse(si, i, frontswap)) != 0) {
> +	spin_lock(&mmlist_lock);
> +	while (!retval && (p = p->next) != &start_mm->mmlist) {
>  		if (signal_pending(current)) {
>  			retval = -EINTR;
>  			break;
>  		}
> -
> -		/*
> -		 * Get a page for the entry, using the existing swap
> -		 * cache page if there is one.  Otherwise, get a clean
> -		 * page and read the swap into it.
> -		 */
> -		swap_map = &si->swap_map[i];
> -		entry = swp_entry(type, i);
> -		page = read_swap_cache_async(entry,
> -					GFP_HIGHUSER_MOVABLE, NULL, 0);
> -		if (!page) {
> -			/*
> -			 * Either swap_duplicate() failed because entry
> -			 * has been freed independently, and will not be
> -			 * reused since sys_swapoff() already disabled
> -			 * allocation from here, or alloc_page() failed.
> -			 */
> -			swcount = *swap_map;
> -			/*
> -			 * We don't hold lock here, so the swap entry could be
> -			 * SWAP_MAP_BAD (when the cluster is discarding).
> -			 * Instead of fail out, We can just skip the swap
> -			 * entry because swapoff will wait for discarding
> -			 * finish anyway.
> -			 */
> -			if (!swcount || swcount == SWAP_MAP_BAD)
> -				continue;
> -			retval = -ENOMEM;
> -			break;
> -		}
> -
>  		/*
>  		 * Don't hold on to start_mm if it looks like exiting.
>  		 */
> @@ -1431,80 +1450,36 @@ int try_to_unuse(unsigned int type, bool frontswap,
>  			atomic_inc(&init_mm.mm_users);
>  		}
>  
> -		/*
> -		 * Wait for and lock page.  When do_swap_page races with
> -		 * try_to_unuse, do_swap_page can handle the fault much
> -		 * faster than try_to_unuse can locate the entry.  This
> -		 * apparently redundant "wait_on_page_locked" lets try_to_unuse
> -		 * defer to do_swap_page in such a case - in some tests,
> -		 * do_swap_page and try_to_unuse repeatedly compete.
> -		 */
> -		wait_on_page_locked(page);
> -		wait_on_page_writeback(page);
> -		lock_page(page);
> -		wait_on_page_writeback(page);
> +		mm = list_entry(p, struct mm_struct, mmlist);
> +		if (!atomic_inc_not_zero(&mm->mm_users))
> +			continue;
> +		spin_unlock(&mmlist_lock);
> +
> +		cond_resched();
> +
> +		retval = unuse_mm(mm, type);
> +		mmput(mm);
> +		if (retval)
> +			break;
>  
>  		/*
> -		 * Remove all references to entry.
> +		 * Make sure that we aren't completely killing
> +		 * interactive performance.
>  		 */
> -		swcount = *swap_map;
> -		if (swap_count(swcount) == SWAP_MAP_SHMEM) {
> -			retval = shmem_unuse(entry, page);
> -			/* page has already been unlocked and released */
> -			if (retval < 0)
> +		cond_resched();
> +		/* TODO we need another way to count these,
> +		 * because we will now be unusing all an mm's pages
> +		 * on each pass through the loop
> +		 * Ignoring frontswap for now
> +		 */
> +		if (frontswap && pages_to_unuse > 0) {
> +			if (!--pages_to_unuse)
>  				break;
> -			continue;
>  		}
> -		if (swap_count(swcount) && start_mm != &init_mm)
> -			retval = unuse_mm(start_mm, entry, page);
> -
> -		if (swap_count(*swap_map)) {
> -			int set_start_mm = (*swap_map >= swcount);
> -			struct list_head *p = &start_mm->mmlist;
> -			struct mm_struct *new_start_mm = start_mm;
> -			struct mm_struct *prev_mm = start_mm;
> -			struct mm_struct *mm;
> -
> -			atomic_inc(&new_start_mm->mm_users);
> -			atomic_inc(&prev_mm->mm_users);
> -			spin_lock(&mmlist_lock);
> -			while (swap_count(*swap_map) && !retval &&
> -					(p = p->next) != &start_mm->mmlist) {
> -				mm = list_entry(p, struct mm_struct, mmlist);
> -				if (!atomic_inc_not_zero(&mm->mm_users))
> -					continue;
> -				spin_unlock(&mmlist_lock);
> -				mmput(prev_mm);
> -				prev_mm = mm;
>  
> -				cond_resched();
> -
> -				swcount = *swap_map;
> -				if (!swap_count(swcount)) /* any usage ? */
> -					;
> -				else if (mm == &init_mm)
> -					set_start_mm = 1;
> -				else
> -					retval = unuse_mm(mm, entry, page);
> -
> -				if (set_start_mm && *swap_map < swcount) {
> -					mmput(new_start_mm);
> -					atomic_inc(&mm->mm_users);
> -					new_start_mm = mm;
> -					set_start_mm = 0;
> -				}
> -				spin_lock(&mmlist_lock);
> -			}
> -			spin_unlock(&mmlist_lock);
> -			mmput(prev_mm);
> -			mmput(start_mm);
> -			start_mm = new_start_mm;
> -		}
> -		if (retval) {
> -			unlock_page(page);
> -			page_cache_release(page);
> -			break;
> -		}
> +		spin_lock(&mmlist_lock);
> +	}
> +	spin_unlock(&mmlist_lock);
>  
>  		/*
>  		 * If a reference remains (rare), we would like to leave
> @@ -1524,50 +1499,27 @@ int try_to_unuse(unsigned int type, bool frontswap,
>  		 * this splitting happens to be just what is needed to
>  		 * handle where KSM pages have been swapped out: re-reading
>  		 * is unnecessarily slow, but we can fix that later on.
> +		 * TODO move this to unuse_pte_range?
>  		 */
> -		if (swap_count(*swap_map) &&
> -		     PageDirty(page) && PageSwapCache(page)) {
> -			struct writeback_control wbc = {
> -				.sync_mode = WB_SYNC_NONE,
> -			};
> -
> -			swap_writepage(page, &wbc);
> -			lock_page(page);
> -			wait_on_page_writeback(page);
> -		}
> -
> -		/*
> -		 * It is conceivable that a racing task removed this page from
> -		 * swap cache just before we acquired the page lock at the top,
> -		 * or while we dropped it in unuse_mm().  The page might even
> -		 * be back in swap cache on another swap area: that we must not
> -		 * delete, since it may not have been written out to swap yet.
> -		 */
> -		if (PageSwapCache(page) &&
> -		    likely(page_private(page) == entry.val))
> -			delete_from_swap_cache(page);
> -
> -		/*
> -		 * So we could skip searching mms once swap count went
> -		 * to 1, we did not mark any present ptes as dirty: must
> -		 * mark page dirty so shrink_page_list will preserve it.
> -		 */
> -		SetPageDirty(page);
> -		unlock_page(page);
> -		page_cache_release(page);
> -
> -		/*
> -		 * Make sure that we aren't completely killing
> -		 * interactive performance.
> -		 */
> -		cond_resched();
> -		if (frontswap && pages_to_unuse > 0) {
> -			if (!--pages_to_unuse)
> -				break;
> -		}
> -	}
> +/*		if (swap_count(*swap_map) &&
> +*		     PageDirty(page) && PageSwapCache(page)) {
> +*			struct writeback_control wbc = {
> +*				.sync_mode = WB_SYNC_NONE,
> +*			};
> +*
> +*			swap_writepage(page, &wbc);
> +*			lock_page(page);
> +*			wait_on_page_writeback(page);
> +*		}
> +*/
>  
> +	/* TODO check if there are any swap entries we failed to clean up. */
> +	if ((i = find_next_to_unuse(si, i, frontswap)) != 0)
> +		printk("swap entries remain, type not clean\n");
> +	printk("Leaving try_to_unuse\n");
> +	printk("Calls made to unuse_pte: %lu\n", unusepte_calls);
>  	mmput(start_mm);
> +out:
>  	return retval;
>  }
>  
> 


-- 
All rights reversed.

next prev parent reply	other threads:[~2014-02-25 20:14 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-02-19  0:35 [RFC] mm:prototype for the updated swapoff implementation Kelley Nielsen
2014-02-19  0:35 ` Kelley Nielsen
2014-02-19 21:27 ` Andrew Morton
2014-02-19 21:27   ` Andrew Morton
2014-02-19 21:39   ` Rik van Riel
2014-02-19 21:39     ` Rik van Riel
2014-02-19 23:08     ` [OPW kernel] " Josh Triplett
2014-02-19 23:08       ` Josh Triplett
2014-02-19 23:42   ` One Thousand Gnomes
2014-02-19 23:42     ` One Thousand Gnomes
2014-02-25 20:14 ` Rik van Riel [this message]
2014-02-25 20:14   ` Rik van Riel
2014-02-28  0:33 ` Hugh Dickins
2014-02-28  0:33   ` Hugh Dickins
2014-02-28 18:07   ` Rik van Riel
2014-02-28 18:07     ` Rik van Riel

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=530CF9B8.1020008@surriel.com \
    --to=riel@surriel.com \
    --cc=jamieliu@google.com \
    --cc=kelleynnn@gmail.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=opw-kernel@googlegroups.com \
    --cc=sjenning@linux.vnet.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.