linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 2/3] mm/zswap: bugfix: memory leak when invalidate and reclaim occur concurrently
@ 2013-09-23  8:21 Weijie Yang
  2013-09-24  1:03 ` Minchan Kim
  0 siblings, 1 reply; 14+ messages in thread
From: Weijie Yang @ 2013-09-23  8:21 UTC (permalink / raw)
  To: akpm
  Cc: sjenning, bob.liu, minchan, weijie.yang.kh, linux-mm,
	linux-kernel, stable, d.j.shin, heesub.shin, kyungmin.park,
	hau.chen, bifeng.tong, rui.xie

Consider the following scenario:
thread 0: reclaim entry x (get refcount, but not call zswap_get_swap_cache_page)
thread 1: call zswap_frontswap_invalidate_page to invalidate entry x.
	finished, entry x and its zbud is not freed as its refcount != 0
	now, the swap_map[x] = 0
thread 0: now call zswap_get_swap_cache_page
	swapcache_prepare return -ENOENT because entry x is not used any more
	zswap_get_swap_cache_page return ZSWAP_SWAPCACHE_NOMEM
	zswap_writeback_entry do nothing except put refcount
Now, the memory of zswap_entry x and its zpage leak.

Modify:
 - check the refcount in fail path, free memory if it is not referenced.
 - use ZSWAP_SWAPCACHE_FAIL instead of ZSWAP_SWAPCACHE_NOMEM as the fail path
can be not only caused by nomem but also by invalidate.

Signed-off-by: Weijie Yang <weijie.yang@samsung.com>
Reviewed-by: Bob Liu <bob.liu@oracle.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: stable@vger.kernel.org
Acked-by: Seth Jennings <sjenning@linux.vnet.ibm.com>
---
 mm/zswap.c |   21 +++++++++++++--------
 1 file changed, 13 insertions(+), 8 deletions(-)

diff --git a/mm/zswap.c b/mm/zswap.c
index cbd9578..1be7b90 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -387,7 +387,7 @@ static void zswap_free_entry(struct zswap_tree *tree, struct zswap_entry *entry)
 enum zswap_get_swap_ret {
 	ZSWAP_SWAPCACHE_NEW,
 	ZSWAP_SWAPCACHE_EXIST,
-	ZSWAP_SWAPCACHE_NOMEM
+	ZSWAP_SWAPCACHE_FAIL,
 };
 
 /*
@@ -401,9 +401,9 @@ enum zswap_get_swap_ret {
  * added to the swap cache, and returned in retpage.
  *
  * If success, the swap cache page is returned in retpage
- * Returns 0 if page was already in the swap cache, page is not locked
- * Returns 1 if the new page needs to be populated, page is locked
- * Returns <0 on error
+ * Returns ZSWAP_SWAPCACHE_EXIST if page was already in the swap cache
+ * Returns ZSWAP_SWAPCACHE_NEW if the new page needs to be populated, page is locked
+ * Returns ZSWAP_SWAPCACHE_FAIL on error
  */
 static int zswap_get_swap_cache_page(swp_entry_t entry,
 				struct page **retpage)
@@ -475,7 +475,7 @@ static int zswap_get_swap_cache_page(swp_entry_t entry,
 	if (new_page)
 		page_cache_release(new_page);
 	if (!found_page)
-		return ZSWAP_SWAPCACHE_NOMEM;
+		return ZSWAP_SWAPCACHE_FAIL;
 	*retpage = found_page;
 	return ZSWAP_SWAPCACHE_EXIST;
 }
@@ -529,11 +529,11 @@ static int zswap_writeback_entry(struct zbud_pool *pool, unsigned long handle)
 
 	/* try to allocate swap cache page */
 	switch (zswap_get_swap_cache_page(swpentry, &page)) {
-	case ZSWAP_SWAPCACHE_NOMEM: /* no memory */
+	case ZSWAP_SWAPCACHE_FAIL: /* no memory or invalidate happened */
 		ret = -ENOMEM;
 		goto fail;
 
-	case ZSWAP_SWAPCACHE_EXIST: /* page is unlocked */
+	case ZSWAP_SWAPCACHE_EXIST:
 		/* page is already in the swap cache, ignore for now */
 		page_cache_release(page);
 		ret = -EEXIST;
@@ -591,7 +591,12 @@ static int zswap_writeback_entry(struct zbud_pool *pool, unsigned long handle)
 
 fail:
 	spin_lock(&tree->lock);
-	zswap_entry_put(entry);
+	refcount = zswap_entry_put(entry);
+	if (refcount <= 0) {
+		/* invalidate happened, consider writeback as success */
+		zswap_free_entry(tree, entry);
+		ret = 0;
+	}
 	spin_unlock(&tree->lock);
 	return ret;
 }
-- 
1.7.10.4


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH v3 2/3] mm/zswap: bugfix: memory leak when invalidate and reclaim occur concurrently
  2013-09-23  8:21 [PATCH v3 2/3] mm/zswap: bugfix: memory leak when invalidate and reclaim occur concurrently Weijie Yang
@ 2013-09-24  1:03 ` Minchan Kim
  2013-09-26  3:42   ` Weijie Yang
  0 siblings, 1 reply; 14+ messages in thread
From: Minchan Kim @ 2013-09-24  1:03 UTC (permalink / raw)
  To: Weijie Yang
  Cc: akpm, sjenning, bob.liu, weijie.yang.kh, linux-mm, linux-kernel,
	stable, d.j.shin, heesub.shin, kyungmin.park, hau.chen,
	bifeng.tong, rui.xie

On Mon, Sep 23, 2013 at 04:21:49PM +0800, Weijie Yang wrote:
> Consider the following scenario:
> thread 0: reclaim entry x (get refcount, but not call zswap_get_swap_cache_page)
> thread 1: call zswap_frontswap_invalidate_page to invalidate entry x.
> 	finished, entry x and its zbud is not freed as its refcount != 0
> 	now, the swap_map[x] = 0
> thread 0: now call zswap_get_swap_cache_page
> 	swapcache_prepare return -ENOENT because entry x is not used any more
> 	zswap_get_swap_cache_page return ZSWAP_SWAPCACHE_NOMEM
> 	zswap_writeback_entry do nothing except put refcount
> Now, the memory of zswap_entry x and its zpage leak.
> 
> Modify:
>  - check the refcount in fail path, free memory if it is not referenced.

Hmm, I don't like this because zswap refcount routine is already mess for me.
I'm not sure why it was designed from the beginning. I hope we should fix it first.

1. zswap_rb_serach could include zswap_entry_get semantic if it founds a entry from
   the tree. Of course, we should ranme it as find_get_zswap_entry like find_get_page.
2. zswap_entry_put could hide resource free function like zswap_free_entry so that
   all of caller can use it easily following pattern.
   
  find_get_zswap_entry
  ...
  ...
  zswap_entry_put

Of course, zswap_entry_put have to check the entry is in the tree or not
so if someone already removes it from the tree, it should avoid double remove.

One of the concern I can think is that approach extends critical section
but I think it would be no problem because more bottleneck would be [de]compress
functions. If it were really problem, we can mitigate a problem with moving
unnecessary functions out of zswap_free_entry because it seem to be rather
over-enginnering.

>  - use ZSWAP_SWAPCACHE_FAIL instead of ZSWAP_SWAPCACHE_NOMEM as the fail path
> can be not only caused by nomem but also by invalidate.
> 
> Signed-off-by: Weijie Yang <weijie.yang@samsung.com>
> Reviewed-by: Bob Liu <bob.liu@oracle.com>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: stable@vger.kernel.org
> Acked-by: Seth Jennings <sjenning@linux.vnet.ibm.com>
> ---
>  mm/zswap.c |   21 +++++++++++++--------
>  1 file changed, 13 insertions(+), 8 deletions(-)
> 
> diff --git a/mm/zswap.c b/mm/zswap.c
> index cbd9578..1be7b90 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -387,7 +387,7 @@ static void zswap_free_entry(struct zswap_tree *tree, struct zswap_entry *entry)
>  enum zswap_get_swap_ret {
>  	ZSWAP_SWAPCACHE_NEW,
>  	ZSWAP_SWAPCACHE_EXIST,
> -	ZSWAP_SWAPCACHE_NOMEM
> +	ZSWAP_SWAPCACHE_FAIL,
>  };
>  
>  /*
> @@ -401,9 +401,9 @@ enum zswap_get_swap_ret {
>   * added to the swap cache, and returned in retpage.
>   *
>   * If success, the swap cache page is returned in retpage
> - * Returns 0 if page was already in the swap cache, page is not locked
> - * Returns 1 if the new page needs to be populated, page is locked
> - * Returns <0 on error
> + * Returns ZSWAP_SWAPCACHE_EXIST if page was already in the swap cache
> + * Returns ZSWAP_SWAPCACHE_NEW if the new page needs to be populated, page is locked
> + * Returns ZSWAP_SWAPCACHE_FAIL on error
>   */
>  static int zswap_get_swap_cache_page(swp_entry_t entry,
>  				struct page **retpage)
> @@ -475,7 +475,7 @@ static int zswap_get_swap_cache_page(swp_entry_t entry,
>  	if (new_page)
>  		page_cache_release(new_page);
>  	if (!found_page)
> -		return ZSWAP_SWAPCACHE_NOMEM;
> +		return ZSWAP_SWAPCACHE_FAIL;
>  	*retpage = found_page;
>  	return ZSWAP_SWAPCACHE_EXIST;
>  }
> @@ -529,11 +529,11 @@ static int zswap_writeback_entry(struct zbud_pool *pool, unsigned long handle)
>  
>  	/* try to allocate swap cache page */
>  	switch (zswap_get_swap_cache_page(swpentry, &page)) {
> -	case ZSWAP_SWAPCACHE_NOMEM: /* no memory */
> +	case ZSWAP_SWAPCACHE_FAIL: /* no memory or invalidate happened */
>  		ret = -ENOMEM;
>  		goto fail;
>  
> -	case ZSWAP_SWAPCACHE_EXIST: /* page is unlocked */
> +	case ZSWAP_SWAPCACHE_EXIST:
>  		/* page is already in the swap cache, ignore for now */
>  		page_cache_release(page);
>  		ret = -EEXIST;
> @@ -591,7 +591,12 @@ static int zswap_writeback_entry(struct zbud_pool *pool, unsigned long handle)
>  
>  fail:
>  	spin_lock(&tree->lock);
> -	zswap_entry_put(entry);
> +	refcount = zswap_entry_put(entry);
> +	if (refcount <= 0) {
> +		/* invalidate happened, consider writeback as success */
> +		zswap_free_entry(tree, entry);
> +		ret = 0;
> +	}
>  	spin_unlock(&tree->lock);
>  	return ret;
>  }
> -- 
> 1.7.10.4
> 
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* RE: [PATCH v3 2/3] mm/zswap: bugfix: memory leak when invalidate and reclaim occur concurrently
  2013-09-24  1:03 ` Minchan Kim
@ 2013-09-26  3:42   ` Weijie Yang
  2013-10-10 19:54     ` Andrew Morton
                       ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Weijie Yang @ 2013-09-26  3:42 UTC (permalink / raw)
  To: 'Minchan Kim'
  Cc: akpm, sjenning, bob.liu, weijie.yang.kh, linux-mm, linux-kernel,
	stable, d.j.shin, heesub.shin, kyungmin.park, hau.chen,
	bifeng.tong, rui.xie

On Tue, Sep 24, 2013 at 9:03 AM, Minchan Kim <minchan@kernel.org> wrote: 
> On Mon, Sep 23, 2013 at 04:21:49PM +0800, Weijie Yang wrote:
> >
> > Modify:
> >  - check the refcount in fail path, free memory if it is not referenced.
> 
> Hmm, I don't like this because zswap refcount routine is already mess for me.
> I'm not sure why it was designed from the beginning. I hope we should fix it first.
> 
> 1. zswap_rb_serach could include zswap_entry_get semantic if it founds a entry from
>    the tree. Of course, we should ranme it as find_get_zswap_entry like find_get_page.
> 2. zswap_entry_put could hide resource free function like zswap_free_entry so that
>    all of caller can use it easily following pattern.
> 
>   find_get_zswap_entry
>   ...
>   ...
>   zswap_entry_put
> 
> Of course, zswap_entry_put have to check the entry is in the tree or not
> so if someone already removes it from the tree, it should avoid double remove.
> 
> One of the concern I can think is that approach extends critical section
> but I think it would be no problem because more bottleneck would be [de]compress
> functions. If it were really problem, we can mitigate a problem with moving
> unnecessary functions out of zswap_free_entry because it seem to be rather
> over-enginnering.

I refactor the zswap refcount routine according to Minchan's idea.
Here is the new patch, Any suggestion is welcomed.

To Seth and Bob, would you please review it again?

mm/zswap.c |  116
++++++++++++++++++++++++++++++++++++++++++++++++++++----------------------------------------------------------------
 1 file changed, 52 insertions(+), 64 deletions(-)

diff --git a/mm/zswap.c b/mm/zswap.c
old mode 100644
new mode 100755
index deda2b6..bd04910
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -217,6 +217,7 @@ static struct zswap_entry *zswap_entry_cache_alloc(gfp_t gfp)
 	if (!entry)
 		return NULL;
 	entry->refcount = 1;
+	RB_CLEAR_NODE(&entry->rbnode);
 	return entry;
 }
 
@@ -232,10 +233,20 @@ static void zswap_entry_get(struct zswap_entry *entry)
 }
 
 /* caller must hold the tree lock */
-static int zswap_entry_put(struct zswap_entry *entry)
+static int zswap_entry_put(struct zswap_tree *tree, struct zswap_entry *entry)
 {
-	entry->refcount--;
-	return entry->refcount;
+	int refcount = --entry->refcount;
+
+	if (refcount <= 0) {
+		if (!RB_EMPTY_NODE(&entry->rbnode)) {
+			rb_erase(&entry->rbnode, &tree->rbroot);
+			RB_CLEAR_NODE(&entry->rbnode);
+		}
+
+		zswap_free_entry(tree, entry);
+	}
+
+	return refcount;
 }
 
 /*********************************
@@ -258,6 +269,17 @@ static struct zswap_entry *zswap_rb_search(struct rb_root *root, pgoff_t offset)
 	return NULL;
 }
 
+static struct zswap_entry *zswap_entry_find_get(struct rb_root *root, pgoff_t offset)
+{
+	struct zswap_entry *entry = NULL;
+
+	entry = zswap_rb_search(root, offset);
+	if (entry)
+		zswap_entry_get(entry);
+
+	return entry;
+}
+
 /*
  * In the case that a entry with the same offset is found, a pointer to
  * the existing entry is stored in dupentry and the function returns -EEXIST
@@ -387,7 +409,7 @@ static void zswap_free_entry(struct zswap_tree *tree, struct zswap_entry *entry)
 enum zswap_get_swap_ret {
 	ZSWAP_SWAPCACHE_NEW,
 	ZSWAP_SWAPCACHE_EXIST,
-	ZSWAP_SWAPCACHE_NOMEM
+	ZSWAP_SWAPCACHE_FAIL,
 };
 
 /*
@@ -401,9 +423,9 @@ enum zswap_get_swap_ret {
  * added to the swap cache, and returned in retpage.
  *
  * If success, the swap cache page is returned in retpage
- * Returns 0 if page was already in the swap cache, page is not locked
- * Returns 1 if the new page needs to be populated, page is locked
- * Returns <0 on error
+ * Returns ZSWAP_SWAPCACHE_EXIST if page was already in the swap cache
+ * Returns ZSWAP_SWAPCACHE_NEW if the new page needs to be populated, page is locked
+ * Returns ZSWAP_SWAPCACHE_FAIL on error
  */
 static int zswap_get_swap_cache_page(swp_entry_t entry,
 				struct page **retpage)
@@ -475,7 +497,7 @@ static int zswap_get_swap_cache_page(swp_entry_t entry,
 	if (new_page)
 		page_cache_release(new_page);
 	if (!found_page)
-		return ZSWAP_SWAPCACHE_NOMEM;
+		return ZSWAP_SWAPCACHE_FAIL;
 	*retpage = found_page;
 	return ZSWAP_SWAPCACHE_EXIST;
 }
@@ -517,23 +539,22 @@ static int zswap_writeback_entry(struct zbud_pool *pool, unsigned long handle)
 
 	/* find and ref zswap entry */
 	spin_lock(&tree->lock);
-	entry = zswap_rb_search(&tree->rbroot, offset);
+	entry = zswap_entry_find_get(&tree->rbroot, offset);
 	if (!entry) {
 		/* entry was invalidated */
 		spin_unlock(&tree->lock);
 		return 0;
 	}
-	zswap_entry_get(entry);
 	spin_unlock(&tree->lock);
 	BUG_ON(offset != entry->offset);
 
 	/* try to allocate swap cache page */
 	switch (zswap_get_swap_cache_page(swpentry, &page)) {
-	case ZSWAP_SWAPCACHE_NOMEM: /* no memory */
+	case ZSWAP_SWAPCACHE_FAIL: /* no memory or invalidate happened */
 		ret = -ENOMEM;
 		goto fail;
 
-	case ZSWAP_SWAPCACHE_EXIST: /* page is unlocked */
+	case ZSWAP_SWAPCACHE_EXIST:
 		/* page is already in the swap cache, ignore for now */
 		page_cache_release(page);
 		ret = -EEXIST;
@@ -562,38 +583,28 @@ static int zswap_writeback_entry(struct zbud_pool *pool, unsigned long handle)
 	zswap_written_back_pages++;
 
 	spin_lock(&tree->lock);
-
 	/* drop local reference */
-	zswap_entry_put(entry);
+	refcount = zswap_entry_put(tree, entry);
 	/* drop the initial reference from entry creation */
-	refcount = zswap_entry_put(entry);
-
-	/*
-	 * There are three possible values for refcount here:
-	 * (1) refcount is 1, load is in progress, unlink from rbtree,
-	 *     load will free
-	 * (2) refcount is 0, (normal case) entry is valid,
-	 *     remove from rbtree and free entry
-	 * (3) refcount is -1, invalidate happened during writeback,
-	 *     free entry
-	 */
-	if (refcount >= 0) {
-		/* no invalidate yet, remove from rbtree */
+	if (refcount > 0) {
 		rb_erase(&entry->rbnode, &tree->rbroot);
+		RB_CLEAR_NODE(&entry->rbnode);
+		refcount = zswap_entry_put(tree, entry);
 	}
 	spin_unlock(&tree->lock);
-	if (refcount <= 0) {
-		/* free the entry */
-		zswap_free_entry(tree, entry);
-		return 0;
-	}
-	return -EAGAIN;
+
+	goto end;
 
 fail:
 	spin_lock(&tree->lock);
-	zswap_entry_put(entry);
+	refcount = zswap_entry_put(tree, entry);
 	spin_unlock(&tree->lock);
-	return ret;
+
+end:
+	if (refcount <= 0)
+		return 0;
+	else
+		return -EAGAIN;
 }
 
 /*********************************
@@ -677,10 +688,8 @@ static int zswap_frontswap_store(unsigned type, pgoff_t offset,
 			zswap_duplicate_entry++;
 			/* remove from rbtree */
 			rb_erase(&dupentry->rbnode, &tree->rbroot);
-			if (!zswap_entry_put(dupentry)) {
-				/* free */
-				zswap_free_entry(tree, dupentry);
-			}
+			RB_CLEAR_NODE(&dupentry->rbnode);
+			zswap_entry_put(tree, dupentry);
 		}
 	} while (ret == -EEXIST);
 	spin_unlock(&tree->lock);
@@ -713,13 +722,12 @@ static int zswap_frontswap_load(unsigned type, pgoff_t offset,
 
 	/* find */
 	spin_lock(&tree->lock);
-	entry = zswap_rb_search(&tree->rbroot, offset);
+	entry = zswap_entry_find_get(&tree->rbroot, offset);
 	if (!entry) {
 		/* entry was written back */
 		spin_unlock(&tree->lock);
 		return -1;
 	}
-	zswap_entry_get(entry);
 	spin_unlock(&tree->lock);
 
 	/* decompress */
@@ -734,22 +742,9 @@ static int zswap_frontswap_load(unsigned type, pgoff_t offset,
 	BUG_ON(ret);
 
 	spin_lock(&tree->lock);
-	refcount = zswap_entry_put(entry);
-	if (likely(refcount)) {
-		spin_unlock(&tree->lock);
-		return 0;
-	}
+	zswap_entry_put(tree, entry);
 	spin_unlock(&tree->lock);
 
-	/*
-	 * We don't have to unlink from the rbtree because
-	 * zswap_writeback_entry() or zswap_frontswap_invalidate page()
-	 * has already done this for us if we are the last reference.
-	 */
-	/* free */
-
-	zswap_free_entry(tree, entry);
-
 	return 0;
 }
 
@@ -771,19 +766,12 @@ static void zswap_frontswap_invalidate_page(unsigned type, pgoff_t offset)
 
 	/* remove from rbtree */
 	rb_erase(&entry->rbnode, &tree->rbroot);
+	RB_CLEAR_NODE(&entry->rbnode);
 
 	/* drop the initial reference from entry creation */
-	refcount = zswap_entry_put(entry);
+	zswap_entry_put(tree, entry);
 
 	spin_unlock(&tree->lock);
-
-	if (refcount) {
-		/* writeback in progress, writeback will free */
-		return;
-	}
-
-	/* free */
-	zswap_free_entry(tree, entry);
 }
 
 /* frees all zswap entries for the given swap type */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH v3 2/3] mm/zswap: bugfix: memory leak when invalidate and reclaim occur concurrently
  2013-09-26  3:42   ` Weijie Yang
@ 2013-10-10 19:54     ` Andrew Morton
  2013-10-11  7:13     ` Minchan Kim
  2013-10-12  2:32     ` Bob Liu
  2 siblings, 0 replies; 14+ messages in thread
From: Andrew Morton @ 2013-10-10 19:54 UTC (permalink / raw)
  To: Weijie Yang
  Cc: 'Minchan Kim', sjenning, bob.liu, weijie.yang.kh,
	linux-mm, linux-kernel, stable, d.j.shin, heesub.shin,
	kyungmin.park, hau.chen, bifeng.tong, rui.xie

On Thu, 26 Sep 2013 11:42:17 +0800 Weijie Yang <weijie.yang@samsung.com> wrote:

> On Tue, Sep 24, 2013 at 9:03 AM, Minchan Kim <minchan@kernel.org> wrote: 
> > On Mon, Sep 23, 2013 at 04:21:49PM +0800, Weijie Yang wrote:
> > >
> > > Modify:
> > >  - check the refcount in fail path, free memory if it is not referenced.
> > 
> > Hmm, I don't like this because zswap refcount routine is already mess for me.
> > I'm not sure why it was designed from the beginning. I hope we should fix it first.
> > 
> > 1. zswap_rb_serach could include zswap_entry_get semantic if it founds a entry from
> >    the tree. Of course, we should ranme it as find_get_zswap_entry like find_get_page.
> > 2. zswap_entry_put could hide resource free function like zswap_free_entry so that
> >    all of caller can use it easily following pattern.
> > 
> >   find_get_zswap_entry
> >   ...
> >   ...
> >   zswap_entry_put
> > 
> > Of course, zswap_entry_put have to check the entry is in the tree or not
> > so if someone already removes it from the tree, it should avoid double remove.
> > 
> > One of the concern I can think is that approach extends critical section
> > but I think it would be no problem because more bottleneck would be [de]compress
> > functions. If it were really problem, we can mitigate a problem with moving
> > unnecessary functions out of zswap_free_entry because it seem to be rather
> > over-enginnering.
> 
> I refactor the zswap refcount routine according to Minchan's idea.
> Here is the new patch, Any suggestion is welcomed.
> 
> To Seth and Bob, would you please review it again?

Yes, please let's re-review this patch.  It is very different from its
predecessor.


From: Weijie Yang <weijie.yang@samsung.com>
Subject: mm/zswap: bugfix: memory leak when invalidate and reclaim occur concurrently

Consider the following scenario:

thread 0: reclaim entry x (get refcount, but not call
          zswap_get_swap_cache_page)

thread 1: call zswap_frontswap_invalidate_page to invalidate
          entry x.  finished, entry x and its zbud is not freed as its
          refcount != 0 now, the swap_map[x] = 0

thread 0: now call zswap_get_swap_cache_page swapcache_prepare
          return -ENOENT because entry x is not used any more
          zswap_get_swap_cache_page return ZSWAP_SWAPCACHE_NOMEM
          zswap_writeback_entry do nothing except put refcount

Now, the memory of zswap_entry x and its zpage leak.

Modify:
 - check the refcount in fail path, free memory if it is not referenced.
 - use ZSWAP_SWAPCACHE_FAIL instead of ZSWAP_SWAPCACHE_NOMEM as the fail path
can be not only caused by nomem but also by invalidate.

Signed-off-by: Weijie Yang <weijie.yang@samsung.com>
Reviewed-by: Bob Liu <bob.liu@oracle.com>
Cc: Minchan Kim <minchan@kernel.org>
Acked-by: Seth Jennings <sjenning@linux.vnet.ibm.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/zswap.c |  116 ++++++++++++++++++++++-----------------------------
 1 file changed, 52 insertions(+), 64 deletions(-)

diff -puN mm/zswap.c~mm-zswap-bugfix-memory-leak-when-invalidate-and-reclaim-occur-concurrently mm/zswap.c
--- a/mm/zswap.c~mm-zswap-bugfix-memory-leak-when-invalidate-and-reclaim-occur-concurrently
+++ a/mm/zswap.c
@@ -217,6 +217,7 @@ static struct zswap_entry *zswap_entry_c
 	if (!entry)
 		return NULL;
 	entry->refcount = 1;
+	RB_CLEAR_NODE(&entry->rbnode);
 	return entry;
 }
 
@@ -232,10 +233,20 @@ static void zswap_entry_get(struct zswap
 }
 
 /* caller must hold the tree lock */
-static int zswap_entry_put(struct zswap_entry *entry)
+static int zswap_entry_put(struct zswap_tree *tree, struct zswap_entry *entry)
 {
-	entry->refcount--;
-	return entry->refcount;
+	int refcount = --entry->refcount;
+
+	if (refcount <= 0) {
+		if (!RB_EMPTY_NODE(&entry->rbnode)) {
+			rb_erase(&entry->rbnode, &tree->rbroot);
+			RB_CLEAR_NODE(&entry->rbnode);
+		}
+
+		zswap_free_entry(tree, entry);
+	}
+
+	return refcount;
 }
 
 /*********************************
@@ -258,6 +269,17 @@ static struct zswap_entry *zswap_rb_sear
 	return NULL;
 }
 
+static struct zswap_entry *zswap_entry_find_get(struct rb_root *root, pgoff_t offset)
+{
+	struct zswap_entry *entry = NULL;
+
+	entry = zswap_rb_search(root, offset);
+	if (entry)
+		zswap_entry_get(entry);
+
+	return entry;
+}
+
 /*
  * In the case that a entry with the same offset is found, a pointer to
  * the existing entry is stored in dupentry and the function returns -EEXIST
@@ -387,7 +409,7 @@ static void zswap_free_entry(struct zswa
 enum zswap_get_swap_ret {
 	ZSWAP_SWAPCACHE_NEW,
 	ZSWAP_SWAPCACHE_EXIST,
-	ZSWAP_SWAPCACHE_NOMEM
+	ZSWAP_SWAPCACHE_FAIL,
 };
 
 /*
@@ -401,9 +423,9 @@ enum zswap_get_swap_ret {
  * added to the swap cache, and returned in retpage.
  *
  * If success, the swap cache page is returned in retpage
- * Returns 0 if page was already in the swap cache, page is not locked
- * Returns 1 if the new page needs to be populated, page is locked
- * Returns <0 on error
+ * Returns ZSWAP_SWAPCACHE_EXIST if page was already in the swap cache
+ * Returns ZSWAP_SWAPCACHE_NEW if the new page needs to be populated, page is locked
+ * Returns ZSWAP_SWAPCACHE_FAIL on error
  */
 static int zswap_get_swap_cache_page(swp_entry_t entry,
 				struct page **retpage)
@@ -475,7 +497,7 @@ static int zswap_get_swap_cache_page(swp
 	if (new_page)
 		page_cache_release(new_page);
 	if (!found_page)
-		return ZSWAP_SWAPCACHE_NOMEM;
+		return ZSWAP_SWAPCACHE_FAIL;
 	*retpage = found_page;
 	return ZSWAP_SWAPCACHE_EXIST;
 }
@@ -517,23 +539,22 @@ static int zswap_writeback_entry(struct
 
 	/* find and ref zswap entry */
 	spin_lock(&tree->lock);
-	entry = zswap_rb_search(&tree->rbroot, offset);
+	entry = zswap_entry_find_get(&tree->rbroot, offset);
 	if (!entry) {
 		/* entry was invalidated */
 		spin_unlock(&tree->lock);
 		return 0;
 	}
-	zswap_entry_get(entry);
 	spin_unlock(&tree->lock);
 	BUG_ON(offset != entry->offset);
 
 	/* try to allocate swap cache page */
 	switch (zswap_get_swap_cache_page(swpentry, &page)) {
-	case ZSWAP_SWAPCACHE_NOMEM: /* no memory */
+	case ZSWAP_SWAPCACHE_FAIL: /* no memory or invalidate happened */
 		ret = -ENOMEM;
 		goto fail;
 
-	case ZSWAP_SWAPCACHE_EXIST: /* page is unlocked */
+	case ZSWAP_SWAPCACHE_EXIST:
 		/* page is already in the swap cache, ignore for now */
 		page_cache_release(page);
 		ret = -EEXIST;
@@ -562,38 +583,28 @@ static int zswap_writeback_entry(struct
 	zswap_written_back_pages++;
 
 	spin_lock(&tree->lock);
-
 	/* drop local reference */
-	zswap_entry_put(entry);
+	refcount = zswap_entry_put(tree, entry);
 	/* drop the initial reference from entry creation */
-	refcount = zswap_entry_put(entry);
-
-	/*
-	 * There are three possible values for refcount here:
-	 * (1) refcount is 1, load is in progress, unlink from rbtree,
-	 *     load will free
-	 * (2) refcount is 0, (normal case) entry is valid,
-	 *     remove from rbtree and free entry
-	 * (3) refcount is -1, invalidate happened during writeback,
-	 *     free entry
-	 */
-	if (refcount >= 0) {
-		/* no invalidate yet, remove from rbtree */
+	if (refcount > 0) {
 		rb_erase(&entry->rbnode, &tree->rbroot);
+		RB_CLEAR_NODE(&entry->rbnode);
+		refcount = zswap_entry_put(tree, entry);
 	}
 	spin_unlock(&tree->lock);
-	if (refcount <= 0) {
-		/* free the entry */
-		zswap_free_entry(tree, entry);
-		return 0;
-	}
-	return -EAGAIN;
+
+	goto end;
 
 fail:
 	spin_lock(&tree->lock);
-	zswap_entry_put(entry);
+	refcount = zswap_entry_put(tree, entry);
 	spin_unlock(&tree->lock);
-	return ret;
+
+end:
+	if (refcount <= 0)
+		return 0;
+	else
+		return -EAGAIN;
 }
 
 /*********************************
@@ -677,10 +688,8 @@ static int zswap_frontswap_store(unsigne
 			zswap_duplicate_entry++;
 			/* remove from rbtree */
 			rb_erase(&dupentry->rbnode, &tree->rbroot);
-			if (!zswap_entry_put(dupentry)) {
-				/* free */
-				zswap_free_entry(tree, dupentry);
-			}
+			RB_CLEAR_NODE(&dupentry->rbnode);
+			zswap_entry_put(tree, dupentry);
 		}
 	} while (ret == -EEXIST);
 	spin_unlock(&tree->lock);
@@ -713,13 +722,12 @@ static int zswap_frontswap_load(unsigned
 
 	/* find */
 	spin_lock(&tree->lock);
-	entry = zswap_rb_search(&tree->rbroot, offset);
+	entry = zswap_entry_find_get(&tree->rbroot, offset);
 	if (!entry) {
 		/* entry was written back */
 		spin_unlock(&tree->lock);
 		return -1;
 	}
-	zswap_entry_get(entry);
 	spin_unlock(&tree->lock);
 
 	/* decompress */
@@ -734,22 +742,9 @@ static int zswap_frontswap_load(unsigned
 	BUG_ON(ret);
 
 	spin_lock(&tree->lock);
-	refcount = zswap_entry_put(entry);
-	if (likely(refcount)) {
-		spin_unlock(&tree->lock);
-		return 0;
-	}
+	zswap_entry_put(tree, entry);
 	spin_unlock(&tree->lock);
 
-	/*
-	 * We don't have to unlink from the rbtree because
-	 * zswap_writeback_entry() or zswap_frontswap_invalidate page()
-	 * has already done this for us if we are the last reference.
-	 */
-	/* free */
-
-	zswap_free_entry(tree, entry);
-
 	return 0;
 }
 
@@ -771,19 +766,12 @@ static void zswap_frontswap_invalidate_p
 
 	/* remove from rbtree */
 	rb_erase(&entry->rbnode, &tree->rbroot);
+	RB_CLEAR_NODE(&entry->rbnode);
 
 	/* drop the initial reference from entry creation */
-	refcount = zswap_entry_put(entry);
+	zswap_entry_put(tree, entry);
 
 	spin_unlock(&tree->lock);
-
-	if (refcount) {
-		/* writeback in progress, writeback will free */
-		return;
-	}
-
-	/* free */
-	zswap_free_entry(tree, entry);
 }
 
 /* frees all zswap entries for the given swap type */
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v3 2/3] mm/zswap: bugfix: memory leak when invalidate and reclaim occur concurrently
  2013-09-26  3:42   ` Weijie Yang
  2013-10-10 19:54     ` Andrew Morton
@ 2013-10-11  7:13     ` Minchan Kim
  2013-10-12  2:50       ` Bob Liu
                         ` (2 more replies)
  2013-10-12  2:32     ` Bob Liu
  2 siblings, 3 replies; 14+ messages in thread
From: Minchan Kim @ 2013-10-11  7:13 UTC (permalink / raw)
  To: Weijie Yang
  Cc: akpm, sjenning, bob.liu, weijie.yang.kh, linux-mm, linux-kernel,
	stable, d.j.shin, heesub.shin, kyungmin.park, hau.chen,
	bifeng.tong, rui.xie

On Thu, Sep 26, 2013 at 11:42:17AM +0800, Weijie Yang wrote:
> On Tue, Sep 24, 2013 at 9:03 AM, Minchan Kim <minchan@kernel.org> wrote: 
> > On Mon, Sep 23, 2013 at 04:21:49PM +0800, Weijie Yang wrote:
> > >
> > > Modify:
> > >  - check the refcount in fail path, free memory if it is not referenced.
> > 
> > Hmm, I don't like this because zswap refcount routine is already mess for me.
> > I'm not sure why it was designed from the beginning. I hope we should fix it first.
> > 
> > 1. zswap_rb_serach could include zswap_entry_get semantic if it founds a entry from
> >    the tree. Of course, we should ranme it as find_get_zswap_entry like find_get_page.
> > 2. zswap_entry_put could hide resource free function like zswap_free_entry so that
> >    all of caller can use it easily following pattern.
> > 
> >   find_get_zswap_entry
> >   ...
> >   ...
> >   zswap_entry_put
> > 
> > Of course, zswap_entry_put have to check the entry is in the tree or not
> > so if someone already removes it from the tree, it should avoid double remove.
> > 
> > One of the concern I can think is that approach extends critical section
> > but I think it would be no problem because more bottleneck would be [de]compress
> > functions. If it were really problem, we can mitigate a problem with moving
> > unnecessary functions out of zswap_free_entry because it seem to be rather
> > over-enginnering.
> 
> I refactor the zswap refcount routine according to Minchan's idea.
> Here is the new patch, Any suggestion is welcomed.
> 
> To Seth and Bob, would you please review it again?

Yeah, Seth, Bob. You guys are right persons to review this because this
scheme was suggested by me who is biased so it couldn't be a fair. ;-)
But anyway, I will review code itself.

> 
> mm/zswap.c |  116
> ++++++++++++++++++++++++++++++++++++++++++++++++++++----------------------------------------------------------------
>  1 file changed, 52 insertions(+), 64 deletions(-)
> 
> diff --git a/mm/zswap.c b/mm/zswap.c
> old mode 100644
> new mode 100755
> index deda2b6..bd04910
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -217,6 +217,7 @@ static struct zswap_entry *zswap_entry_cache_alloc(gfp_t gfp)
>  	if (!entry)
>  		return NULL;
>  	entry->refcount = 1;
> +	RB_CLEAR_NODE(&entry->rbnode);
>  	return entry;
>  }
>  
> @@ -232,10 +233,20 @@ static void zswap_entry_get(struct zswap_entry *entry)
>  }
>  
>  /* caller must hold the tree lock */
> -static int zswap_entry_put(struct zswap_entry *entry)
> +static int zswap_entry_put(struct zswap_tree *tree, struct zswap_entry *entry)

Why should we have return value? If we really need it, it mitigates
get/put semantic's whole point so I'd like to just return void.

Let me see.

>  {
> -	entry->refcount--;
> -	return entry->refcount;
> +	int refcount = --entry->refcount;
> +
> +	if (refcount <= 0) {

Hmm, I don't like minus refcount, really.
I hope we could do following as

        BUG_ON(refcount < 0);
        if (refcount == 0) {
                ... 
        }



> +		if (!RB_EMPTY_NODE(&entry->rbnode)) {
> +			rb_erase(&entry->rbnode, &tree->rbroot);
> +			RB_CLEAR_NODE(&entry->rbnode);

Minor,
You could make new function zswap_rb_del or zswap_rb_remove which detach the node
from rb tree and clear node because we have already zswap_rb_insert.


> +		}
> +
> +		zswap_free_entry(tree, entry);
> +	}
> +
> +	return refcount;
>  }
>  
>  /*********************************
> @@ -258,6 +269,17 @@ static struct zswap_entry *zswap_rb_search(struct rb_root *root, pgoff_t offset)
>  	return NULL;
>  }
>  

Add function description.

> +static struct zswap_entry *zswap_entry_find_get(struct rb_root *root, pgoff_t offset)
> +{
> +	struct zswap_entry *entry = NULL;
> +
> +	entry = zswap_rb_search(root, offset);
> +	if (entry)
> +		zswap_entry_get(entry);
> +
> +	return entry;
> +}
> +
>  /*
>   * In the case that a entry with the same offset is found, a pointer to
>   * the existing entry is stored in dupentry and the function returns -EEXIST
> @@ -387,7 +409,7 @@ static void zswap_free_entry(struct zswap_tree *tree, struct zswap_entry *entry)
>  enum zswap_get_swap_ret {
>  	ZSWAP_SWAPCACHE_NEW,
>  	ZSWAP_SWAPCACHE_EXIST,
> -	ZSWAP_SWAPCACHE_NOMEM
> +	ZSWAP_SWAPCACHE_FAIL,
>  };
>  
>  /*
> @@ -401,9 +423,9 @@ enum zswap_get_swap_ret {
>   * added to the swap cache, and returned in retpage.
>   *
>   * If success, the swap cache page is returned in retpage
> - * Returns 0 if page was already in the swap cache, page is not locked
> - * Returns 1 if the new page needs to be populated, page is locked
> - * Returns <0 on error
> + * Returns ZSWAP_SWAPCACHE_EXIST if page was already in the swap cache
> + * Returns ZSWAP_SWAPCACHE_NEW if the new page needs to be populated, page is locked
> + * Returns ZSWAP_SWAPCACHE_FAIL on error
>   */
>  static int zswap_get_swap_cache_page(swp_entry_t entry,
>  				struct page **retpage)
> @@ -475,7 +497,7 @@ static int zswap_get_swap_cache_page(swp_entry_t entry,
>  	if (new_page)
>  		page_cache_release(new_page);
>  	if (!found_page)
> -		return ZSWAP_SWAPCACHE_NOMEM;
> +		return ZSWAP_SWAPCACHE_FAIL;
>  	*retpage = found_page;
>  	return ZSWAP_SWAPCACHE_EXIST;
>  }
> @@ -517,23 +539,22 @@ static int zswap_writeback_entry(struct zbud_pool *pool, unsigned long handle)
>  
>  	/* find and ref zswap entry */
>  	spin_lock(&tree->lock);
> -	entry = zswap_rb_search(&tree->rbroot, offset);
> +	entry = zswap_entry_find_get(&tree->rbroot, offset);
>  	if (!entry) {
>  		/* entry was invalidated */
>  		spin_unlock(&tree->lock);
>  		return 0;
>  	}
> -	zswap_entry_get(entry);
>  	spin_unlock(&tree->lock);
>  	BUG_ON(offset != entry->offset);
>  
>  	/* try to allocate swap cache page */
>  	switch (zswap_get_swap_cache_page(swpentry, &page)) {
> -	case ZSWAP_SWAPCACHE_NOMEM: /* no memory */
> +	case ZSWAP_SWAPCACHE_FAIL: /* no memory or invalidate happened */
>  		ret = -ENOMEM;
>  		goto fail;
>  
> -	case ZSWAP_SWAPCACHE_EXIST: /* page is unlocked */
> +	case ZSWAP_SWAPCACHE_EXIST:

Why did you remove comment?

>  		/* page is already in the swap cache, ignore for now */
>  		page_cache_release(page);
>  		ret = -EEXIST;
> @@ -562,38 +583,28 @@ static int zswap_writeback_entry(struct zbud_pool *pool, unsigned long handle)
>  	zswap_written_back_pages++;
>  
>  	spin_lock(&tree->lock);
> -
>  	/* drop local reference */
> -	zswap_entry_put(entry);
> +	refcount = zswap_entry_put(tree, entry);
>  	/* drop the initial reference from entry creation */
> -	refcount = zswap_entry_put(entry);
> -
> -	/*
> -	 * There are three possible values for refcount here:
> -	 * (1) refcount is 1, load is in progress, unlink from rbtree,
> -	 *     load will free
> -	 * (2) refcount is 0, (normal case) entry is valid,
> -	 *     remove from rbtree and free entry
> -	 * (3) refcount is -1, invalidate happened during writeback,
> -	 *     free entry
> -	 */
> -	if (refcount >= 0) {
> -		/* no invalidate yet, remove from rbtree */
> +	if (refcount > 0) {
>  		rb_erase(&entry->rbnode, &tree->rbroot);
> +		RB_CLEAR_NODE(&entry->rbnode);
> +		refcount = zswap_entry_put(tree, entry);

Now, I see why you need return in zswap_entry_put but let's consider again
because it's really mess to me and it hurts get/put semantic a lot so
How about this?

        spin_lock(&tree->lock);
        /* drop local reference */
        zswap_entry_put(tree, entry);
        /*
         * In here, we want to free entry but invalidation may free earlier
         * under us so that we should check it again
         */
        if (entry == zswap_rb_search(&tree->rb_root, offset))
                /* Yes, it's stable so we should free it */
                zswap_entry_put(tree, entry);
                
        /*
         * Whether it would be freed by invalidation or writeback, it doesn't
         * matter. Important thing is that it will be freed so there
         * is no point to return -EAGAIN.
         */
        spin_unlock(&tree->lock);
        return 0;

>  	}
>  	spin_unlock(&tree->lock);
> -	if (refcount <= 0) {
> -		/* free the entry */
> -		zswap_free_entry(tree, entry);
> -		return 0;
> -	}
> -	return -EAGAIN;
> +
> +	goto end;
>  
>  fail:
>  	spin_lock(&tree->lock);
> -	zswap_entry_put(entry);
> +	refcount = zswap_entry_put(tree, entry);
>  	spin_unlock(&tree->lock);
> -	return ret;
> +
> +end:
> +	if (refcount <= 0)
> +		return 0;
> +	else
> +		return -EAGAIN;
>  }
>  
>  /*********************************
> @@ -677,10 +688,8 @@ static int zswap_frontswap_store(unsigned type, pgoff_t offset,
>  			zswap_duplicate_entry++;
>  			/* remove from rbtree */
>  			rb_erase(&dupentry->rbnode, &tree->rbroot);
> -			if (!zswap_entry_put(dupentry)) {
> -				/* free */
> -				zswap_free_entry(tree, dupentry);
> -			}
> +			RB_CLEAR_NODE(&dupentry->rbnode);
> +			zswap_entry_put(tree, dupentry);
>  		}
>  	} while (ret == -EEXIST);
>  	spin_unlock(&tree->lock);
> @@ -713,13 +722,12 @@ static int zswap_frontswap_load(unsigned type, pgoff_t offset,
>  
>  	/* find */
>  	spin_lock(&tree->lock);
> -	entry = zswap_rb_search(&tree->rbroot, offset);
> +	entry = zswap_entry_find_get(&tree->rbroot, offset);
>  	if (!entry) {
>  		/* entry was written back */
>  		spin_unlock(&tree->lock);
>  		return -1;
>  	}
> -	zswap_entry_get(entry);
>  	spin_unlock(&tree->lock);
>  
>  	/* decompress */
> @@ -734,22 +742,9 @@ static int zswap_frontswap_load(unsigned type, pgoff_t offset,
>  	BUG_ON(ret);
>  
>  	spin_lock(&tree->lock);
> -	refcount = zswap_entry_put(entry);
> -	if (likely(refcount)) {
> -		spin_unlock(&tree->lock);
> -		return 0;
> -	}
> +	zswap_entry_put(tree, entry);
>  	spin_unlock(&tree->lock);
>  
> -	/*
> -	 * We don't have to unlink from the rbtree because
> -	 * zswap_writeback_entry() or zswap_frontswap_invalidate page()
> -	 * has already done this for us if we are the last reference.
> -	 */
> -	/* free */
> -
> -	zswap_free_entry(tree, entry);
> -
>  	return 0;
>  }
>  
> @@ -771,19 +766,12 @@ static void zswap_frontswap_invalidate_page(unsigned type, pgoff_t offset)
>  
>  	/* remove from rbtree */
>  	rb_erase(&entry->rbnode, &tree->rbroot);
> +	RB_CLEAR_NODE(&entry->rbnode);
>  
>  	/* drop the initial reference from entry creation */
> -	refcount = zswap_entry_put(entry);
> +	zswap_entry_put(tree, entry);
>  
>  	spin_unlock(&tree->lock);
> -
> -	if (refcount) {
> -		/* writeback in progress, writeback will free */
> -		return;
> -	}
> -
> -	/* free */
> -	zswap_free_entry(tree, entry);
>  }
>  
>  /* frees all zswap entries for the given swap type */
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v3 2/3] mm/zswap: bugfix: memory leak when invalidate and reclaim occur concurrently
  2013-09-26  3:42   ` Weijie Yang
  2013-10-10 19:54     ` Andrew Morton
  2013-10-11  7:13     ` Minchan Kim
@ 2013-10-12  2:32     ` Bob Liu
  2013-10-12  8:45       ` Weijie Yang
  2 siblings, 1 reply; 14+ messages in thread
From: Bob Liu @ 2013-10-12  2:32 UTC (permalink / raw)
  To: Weijie Yang
  Cc: Minchan Kim, Andrew Morton, Seth Jennings, Bob Liu, Weijie Yang,
	Linux-MM, Linux-Kernel, stable, d.j.shin, heesub.shin,
	Kyungmin Park, hau.chen, bifeng.tong, rui.xie

On Thu, Sep 26, 2013 at 11:42 AM, Weijie Yang <weijie.yang@samsung.com> wrote:
> On Tue, Sep 24, 2013 at 9:03 AM, Minchan Kim <minchan@kernel.org> wrote:
>> On Mon, Sep 23, 2013 at 04:21:49PM +0800, Weijie Yang wrote:
>> >
>> > Modify:
>> >  - check the refcount in fail path, free memory if it is not referenced.
>>
>> Hmm, I don't like this because zswap refcount routine is already mess for me.
>> I'm not sure why it was designed from the beginning. I hope we should fix it first.
>>
>> 1. zswap_rb_serach could include zswap_entry_get semantic if it founds a entry from
>>    the tree. Of course, we should ranme it as find_get_zswap_entry like find_get_page.
>> 2. zswap_entry_put could hide resource free function like zswap_free_entry so that
>>    all of caller can use it easily following pattern.
>>
>>   find_get_zswap_entry
>>   ...
>>   ...
>>   zswap_entry_put
>>
>> Of course, zswap_entry_put have to check the entry is in the tree or not
>> so if someone already removes it from the tree, it should avoid double remove.
>>
>> One of the concern I can think is that approach extends critical section
>> but I think it would be no problem because more bottleneck would be [de]compress
>> functions. If it were really problem, we can mitigate a problem with moving
>> unnecessary functions out of zswap_free_entry because it seem to be rather
>> over-enginnering.
>
> I refactor the zswap refcount routine according to Minchan's idea.
> Here is the new patch, Any suggestion is welcomed.
>
> To Seth and Bob, would you please review it again?
>

I have nothing in addition to Minchan's review.

Since the code is a bit complex, I'd suggest you to split it into two patches.
[1/2]: fix the memory leak
[2/2]: clean up the entry_put

And run some testing..

Thanks,
-Bob

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v3 2/3] mm/zswap: bugfix: memory leak when invalidate and reclaim occur concurrently
  2013-10-11  7:13     ` Minchan Kim
@ 2013-10-12  2:50       ` Bob Liu
  2013-10-12  9:14         ` Weijie Yang
  2013-10-12  8:41       ` Weijie Yang
  2013-10-12  9:37       ` Weijie Yang
  2 siblings, 1 reply; 14+ messages in thread
From: Bob Liu @ 2013-10-12  2:50 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Weijie Yang, Andrew Morton, Seth Jennings, Bob Liu, Weijie Yang,
	Linux-MM, Linux-Kernel, stable, d.j.shin, heesub.shin,
	Kyungmin Park, hau.chen, bifeng.tong, rui.xie

On Fri, Oct 11, 2013 at 3:13 PM, Minchan Kim <minchan@kernel.org> wrote:
> On Thu, Sep 26, 2013 at 11:42:17AM +0800, Weijie Yang wrote:
>> On Tue, Sep 24, 2013 at 9:03 AM, Minchan Kim <minchan@kernel.org> wrote:
>> > On Mon, Sep 23, 2013 at 04:21:49PM +0800, Weijie Yang wrote:
>> > >
>> > > Modify:
>> > >  - check the refcount in fail path, free memory if it is not referenced.
>> >
>> > Hmm, I don't like this because zswap refcount routine is already mess for me.
>> > I'm not sure why it was designed from the beginning. I hope we should fix it first.
>> >
>> > 1. zswap_rb_serach could include zswap_entry_get semantic if it founds a entry from
>> >    the tree. Of course, we should ranme it as find_get_zswap_entry like find_get_page.
>> > 2. zswap_entry_put could hide resource free function like zswap_free_entry so that
>> >    all of caller can use it easily following pattern.
>> >
>> >   find_get_zswap_entry
>> >   ...
>> >   ...
>> >   zswap_entry_put
>> >
>> > Of course, zswap_entry_put have to check the entry is in the tree or not
>> > so if someone already removes it from the tree, it should avoid double remove.
>> >
>> > One of the concern I can think is that approach extends critical section
>> > but I think it would be no problem because more bottleneck would be [de]compress
>> > functions. If it were really problem, we can mitigate a problem with moving
>> > unnecessary functions out of zswap_free_entry because it seem to be rather
>> > over-enginnering.
>>
>> I refactor the zswap refcount routine according to Minchan's idea.
>> Here is the new patch, Any suggestion is welcomed.
>>
>> To Seth and Bob, would you please review it again?
>
> Yeah, Seth, Bob. You guys are right persons to review this because this
> scheme was suggested by me who is biased so it couldn't be a fair. ;-)
> But anyway, I will review code itself.
>
>>
>> mm/zswap.c |  116
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++----------------------------------------------------------------
>>  1 file changed, 52 insertions(+), 64 deletions(-)
>>
>> diff --git a/mm/zswap.c b/mm/zswap.c
>> old mode 100644
>> new mode 100755
>> index deda2b6..bd04910
>> --- a/mm/zswap.c
>> +++ b/mm/zswap.c
>> @@ -217,6 +217,7 @@ static struct zswap_entry *zswap_entry_cache_alloc(gfp_t gfp)
>>       if (!entry)
>>               return NULL;
>>       entry->refcount = 1;
>> +     RB_CLEAR_NODE(&entry->rbnode);
>>       return entry;
>>  }
>>
>> @@ -232,10 +233,20 @@ static void zswap_entry_get(struct zswap_entry *entry)
>>  }
>>
>>  /* caller must hold the tree lock */
>> -static int zswap_entry_put(struct zswap_entry *entry)
>> +static int zswap_entry_put(struct zswap_tree *tree, struct zswap_entry *entry)
>
> Why should we have return value? If we really need it, it mitigates
> get/put semantic's whole point so I'd like to just return void.
>
> Let me see.
>
>>  {
>> -     entry->refcount--;
>> -     return entry->refcount;
>> +     int refcount = --entry->refcount;
>> +
>> +     if (refcount <= 0) {
>
> Hmm, I don't like minus refcount, really.
> I hope we could do following as
>
>         BUG_ON(refcount < 0);
>         if (refcount == 0) {
>                 ...
>         }
>
>
>
>> +             if (!RB_EMPTY_NODE(&entry->rbnode)) {
>> +                     rb_erase(&entry->rbnode, &tree->rbroot);
>> +                     RB_CLEAR_NODE(&entry->rbnode);
>
> Minor,
> You could make new function zswap_rb_del or zswap_rb_remove which detach the node
> from rb tree and clear node because we have already zswap_rb_insert.
>
>
>> +             }
>> +
>> +             zswap_free_entry(tree, entry);
>> +     }
>> +
>> +     return refcount;
>>  }
>>
>>  /*********************************
>> @@ -258,6 +269,17 @@ static struct zswap_entry *zswap_rb_search(struct rb_root *root, pgoff_t offset)
>>       return NULL;
>>  }
>>
>
> Add function description.
>
>> +static struct zswap_entry *zswap_entry_find_get(struct rb_root *root, pgoff_t offset)
>> +{
>> +     struct zswap_entry *entry = NULL;
>> +
>> +     entry = zswap_rb_search(root, offset);
>> +     if (entry)
>> +             zswap_entry_get(entry);
>> +
>> +     return entry;
>> +}
>> +
>>  /*
>>   * In the case that a entry with the same offset is found, a pointer to
>>   * the existing entry is stored in dupentry and the function returns -EEXIST
>> @@ -387,7 +409,7 @@ static void zswap_free_entry(struct zswap_tree *tree, struct zswap_entry *entry)
>>  enum zswap_get_swap_ret {
>>       ZSWAP_SWAPCACHE_NEW,
>>       ZSWAP_SWAPCACHE_EXIST,
>> -     ZSWAP_SWAPCACHE_NOMEM
>> +     ZSWAP_SWAPCACHE_FAIL,
>>  };
>>
>>  /*
>> @@ -401,9 +423,9 @@ enum zswap_get_swap_ret {
>>   * added to the swap cache, and returned in retpage.
>>   *
>>   * If success, the swap cache page is returned in retpage
>> - * Returns 0 if page was already in the swap cache, page is not locked
>> - * Returns 1 if the new page needs to be populated, page is locked
>> - * Returns <0 on error
>> + * Returns ZSWAP_SWAPCACHE_EXIST if page was already in the swap cache
>> + * Returns ZSWAP_SWAPCACHE_NEW if the new page needs to be populated, page is locked
>> + * Returns ZSWAP_SWAPCACHE_FAIL on error
>>   */
>>  static int zswap_get_swap_cache_page(swp_entry_t entry,
>>                               struct page **retpage)
>> @@ -475,7 +497,7 @@ static int zswap_get_swap_cache_page(swp_entry_t entry,
>>       if (new_page)
>>               page_cache_release(new_page);
>>       if (!found_page)
>> -             return ZSWAP_SWAPCACHE_NOMEM;
>> +             return ZSWAP_SWAPCACHE_FAIL;
>>       *retpage = found_page;
>>       return ZSWAP_SWAPCACHE_EXIST;
>>  }
>> @@ -517,23 +539,22 @@ static int zswap_writeback_entry(struct zbud_pool *pool, unsigned long handle)
>>
>>       /* find and ref zswap entry */
>>       spin_lock(&tree->lock);
>> -     entry = zswap_rb_search(&tree->rbroot, offset);
>> +     entry = zswap_entry_find_get(&tree->rbroot, offset);
>>       if (!entry) {
>>               /* entry was invalidated */
>>               spin_unlock(&tree->lock);
>>               return 0;
>>       }
>> -     zswap_entry_get(entry);
>>       spin_unlock(&tree->lock);
>>       BUG_ON(offset != entry->offset);
>>
>>       /* try to allocate swap cache page */
>>       switch (zswap_get_swap_cache_page(swpentry, &page)) {
>> -     case ZSWAP_SWAPCACHE_NOMEM: /* no memory */
>> +     case ZSWAP_SWAPCACHE_FAIL: /* no memory or invalidate happened */
>>               ret = -ENOMEM;
>>               goto fail;
>>
>> -     case ZSWAP_SWAPCACHE_EXIST: /* page is unlocked */
>> +     case ZSWAP_SWAPCACHE_EXIST:
>
> Why did you remove comment?
>
>>               /* page is already in the swap cache, ignore for now */
>>               page_cache_release(page);
>>               ret = -EEXIST;
>> @@ -562,38 +583,28 @@ static int zswap_writeback_entry(struct zbud_pool *pool, unsigned long handle)
>>       zswap_written_back_pages++;
>>
>>       spin_lock(&tree->lock);
>> -
>>       /* drop local reference */
>> -     zswap_entry_put(entry);
>> +     refcount = zswap_entry_put(tree, entry);
>>       /* drop the initial reference from entry creation */
>> -     refcount = zswap_entry_put(entry);
>> -
>> -     /*
>> -      * There are three possible values for refcount here:
>> -      * (1) refcount is 1, load is in progress, unlink from rbtree,
>> -      *     load will free
>> -      * (2) refcount is 0, (normal case) entry is valid,
>> -      *     remove from rbtree and free entry
>> -      * (3) refcount is -1, invalidate happened during writeback,
>> -      *     free entry
>> -      */
>> -     if (refcount >= 0) {
>> -             /* no invalidate yet, remove from rbtree */
>> +     if (refcount > 0) {
>>               rb_erase(&entry->rbnode, &tree->rbroot);
>> +             RB_CLEAR_NODE(&entry->rbnode);
>> +             refcount = zswap_entry_put(tree, entry);
>
> Now, I see why you need return in zswap_entry_put but let's consider again
> because it's really mess to me and it hurts get/put semantic a lot so
> How about this?
>
>         spin_lock(&tree->lock);
>         /* drop local reference */
>         zswap_entry_put(tree, entry);
>         /*
>          * In here, we want to free entry but invalidation may free earlier
>          * under us so that we should check it again
>          */
>         if (entry == zswap_rb_search(&tree->rb_root, offset))

Then where is the place unlink entry from rbtree if load was in progress ?

And in the following fail path, return value from zswap_entry_put() is
also used.

>                 /* Yes, it's stable so we should free it */
>                 zswap_entry_put(tree, entry);
>
>         /*
>          * Whether it would be freed by invalidation or writeback, it doesn't
>          * matter. Important thing is that it will be freed so there
>          * is no point to return -EAGAIN.
>          */
>         spin_unlock(&tree->lock);
>         return 0;
>

-- 
Regards,
--Bob

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v3 2/3] mm/zswap: bugfix: memory leak when invalidate and reclaim occur concurrently
  2013-10-11  7:13     ` Minchan Kim
  2013-10-12  2:50       ` Bob Liu
@ 2013-10-12  8:41       ` Weijie Yang
  2013-10-12  9:37       ` Weijie Yang
  2 siblings, 0 replies; 14+ messages in thread
From: Weijie Yang @ 2013-10-12  8:41 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Weijie Yang, akpm, sjenning, bob.liu, linux-mm, linux-kernel,
	stable, d.j.shin, heesub.shin, kyungmin.park, hau.chen,
	bifeng.tong, rui.xie

On Fri, Oct 11, 2013 at 3:13 PM, Minchan Kim <minchan@kernel.org> wrote:
> On Thu, Sep 26, 2013 at 11:42:17AM +0800, Weijie Yang wrote:
>> On Tue, Sep 24, 2013 at 9:03 AM, Minchan Kim <minchan@kernel.org> wrote:
>> > On Mon, Sep 23, 2013 at 04:21:49PM +0800, Weijie Yang wrote:
>> > >
>> > > Modify:
>> > >  - check the refcount in fail path, free memory if it is not referenced.
>> >
>> > Hmm, I don't like this because zswap refcount routine is already mess for me.
>> > I'm not sure why it was designed from the beginning. I hope we should fix it first.
>> >
>> > 1. zswap_rb_serach could include zswap_entry_get semantic if it founds a entry from
>> >    the tree. Of course, we should ranme it as find_get_zswap_entry like find_get_page.
>> > 2. zswap_entry_put could hide resource free function like zswap_free_entry so that
>> >    all of caller can use it easily following pattern.
>> >
>> >   find_get_zswap_entry
>> >   ...
>> >   ...
>> >   zswap_entry_put
>> >
>> > Of course, zswap_entry_put have to check the entry is in the tree or not
>> > so if someone already removes it from the tree, it should avoid double remove.
>> >
>> > One of the concern I can think is that approach extends critical section
>> > but I think it would be no problem because more bottleneck would be [de]compress
>> > functions. If it were really problem, we can mitigate a problem with moving
>> > unnecessary functions out of zswap_free_entry because it seem to be rather
>> > over-enginnering.
>>
>> I refactor the zswap refcount routine according to Minchan's idea.
>> Here is the new patch, Any suggestion is welcomed.
>>
>> To Seth and Bob, would you please review it again?
>
> Yeah, Seth, Bob. You guys are right persons to review this because this
> scheme was suggested by me who is biased so it couldn't be a fair. ;-)
> But anyway, I will review code itself.

Thanks for your careful review and suggestion.

>>
>> mm/zswap.c |  116
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++----------------------------------------------------------------
>>  1 file changed, 52 insertions(+), 64 deletions(-)
>>
>> diff --git a/mm/zswap.c b/mm/zswap.c
>> old mode 100644
>> new mode 100755
>> index deda2b6..bd04910
>> --- a/mm/zswap.c
>> +++ b/mm/zswap.c
>> @@ -217,6 +217,7 @@ static struct zswap_entry *zswap_entry_cache_alloc(gfp_t gfp)
>>       if (!entry)
>>               return NULL;
>>       entry->refcount = 1;
>> +     RB_CLEAR_NODE(&entry->rbnode);
>>       return entry;
>>  }
>>
>> @@ -232,10 +233,20 @@ static void zswap_entry_get(struct zswap_entry *entry)
>>  }
>>
>>  /* caller must hold the tree lock */
>> -static int zswap_entry_put(struct zswap_entry *entry)
>> +static int zswap_entry_put(struct zswap_tree *tree, struct zswap_entry *entry)
>
> Why should we have return value? If we really need it, it mitigates
> get/put semantic's whole point so I'd like to just return void.
>
> Let me see.
>
>>  {
>> -     entry->refcount--;
>> -     return entry->refcount;
>> +     int refcount = --entry->refcount;
>> +
>> +     if (refcount <= 0) {
>
> Hmm, I don't like minus refcount, really.
> I hope we could do following as

It is not like the common get/put semantic
As invalidate and reclaim can be called concurrently,
this refcount would become minus.
we have to check the refcount and meanwhile handle
whether it is on the tree.

>         BUG_ON(refcount < 0);
>         if (refcount == 0) {
>                 ...
>         }
>
>
>
>> +             if (!RB_EMPTY_NODE(&entry->rbnode)) {
>> +                     rb_erase(&entry->rbnode, &tree->rbroot);
>> +                     RB_CLEAR_NODE(&entry->rbnode);
>
> Minor,
> You could make new function zswap_rb_del or zswap_rb_remove which detach the node
> from rb tree and clear node because we have already zswap_rb_insert.

yes.

>
>> +             }
>> +
>> +             zswap_free_entry(tree, entry);
>> +     }
>> +
>> +     return refcount;
>>  }
>>
>>  /*********************************
>> @@ -258,6 +269,17 @@ static struct zswap_entry *zswap_rb_search(struct rb_root *root, pgoff_t offset)
>>       return NULL;
>>  }
>>
>
> Add function description.

ok.

>> +static struct zswap_entry *zswap_entry_find_get(struct rb_root *root, pgoff_t offset)
>> +{
>> +     struct zswap_entry *entry = NULL;
>> +
>> +     entry = zswap_rb_search(root, offset);
>> +     if (entry)
>> +             zswap_entry_get(entry);
>> +
>> +     return entry;
>> +}
>> +
>>  /*
>>   * In the case that a entry with the same offset is found, a pointer to
>>   * the existing entry is stored in dupentry and the function returns -EEXIST
>> @@ -387,7 +409,7 @@ static void zswap_free_entry(struct zswap_tree *tree, struct zswap_entry *entry)
>>  enum zswap_get_swap_ret {
>>       ZSWAP_SWAPCACHE_NEW,
>>       ZSWAP_SWAPCACHE_EXIST,
>> -     ZSWAP_SWAPCACHE_NOMEM
>> +     ZSWAP_SWAPCACHE_FAIL,
>>  };
>>
>>  /*
>> @@ -401,9 +423,9 @@ enum zswap_get_swap_ret {
>>   * added to the swap cache, and returned in retpage.
>>   *
>>   * If success, the swap cache page is returned in retpage
>> - * Returns 0 if page was already in the swap cache, page is not locked
>> - * Returns 1 if the new page needs to be populated, page is locked
>> - * Returns <0 on error
>> + * Returns ZSWAP_SWAPCACHE_EXIST if page was already in the swap cache
>> + * Returns ZSWAP_SWAPCACHE_NEW if the new page needs to be populated, page is locked
>> + * Returns ZSWAP_SWAPCACHE_FAIL on error
>>   */
>>  static int zswap_get_swap_cache_page(swp_entry_t entry,
>>                               struct page **retpage)
>> @@ -475,7 +497,7 @@ static int zswap_get_swap_cache_page(swp_entry_t entry,
>>       if (new_page)
>>               page_cache_release(new_page);
>>       if (!found_page)
>> -             return ZSWAP_SWAPCACHE_NOMEM;
>> +             return ZSWAP_SWAPCACHE_FAIL;
>>       *retpage = found_page;
>>       return ZSWAP_SWAPCACHE_EXIST;
>>  }
>> @@ -517,23 +539,22 @@ static int zswap_writeback_entry(struct zbud_pool *pool, unsigned long handle)
>>
>>       /* find and ref zswap entry */
>>       spin_lock(&tree->lock);
>> -     entry = zswap_rb_search(&tree->rbroot, offset);
>> +     entry = zswap_entry_find_get(&tree->rbroot, offset);
>>       if (!entry) {
>>               /* entry was invalidated */
>>               spin_unlock(&tree->lock);
>>               return 0;
>>       }
>> -     zswap_entry_get(entry);
>>       spin_unlock(&tree->lock);
>>       BUG_ON(offset != entry->offset);
>>
>>       /* try to allocate swap cache page */
>>       switch (zswap_get_swap_cache_page(swpentry, &page)) {
>> -     case ZSWAP_SWAPCACHE_NOMEM: /* no memory */
>> +     case ZSWAP_SWAPCACHE_FAIL: /* no memory or invalidate happened */
>>               ret = -ENOMEM;
>>               goto fail;
>>
>> -     case ZSWAP_SWAPCACHE_EXIST: /* page is unlocked */
>> +     case ZSWAP_SWAPCACHE_EXIST:
>
> Why did you remove comment?

Sorry for that, I intended to move them to zswap_entry_put(), but I forgot.

>>               /* page is already in the swap cache, ignore for now */
>>               page_cache_release(page);
>>               ret = -EEXIST;
>> @@ -562,38 +583,28 @@ static int zswap_writeback_entry(struct zbud_pool *pool, unsigned long handle)
>>       zswap_written_back_pages++;
>>
>>       spin_lock(&tree->lock);
>> -
>>       /* drop local reference */
>> -     zswap_entry_put(entry);
>> +     refcount = zswap_entry_put(tree, entry);
>>       /* drop the initial reference from entry creation */
>> -     refcount = zswap_entry_put(entry);
>> -
>> -     /*
>> -      * There are three possible values for refcount here:
>> -      * (1) refcount is 1, load is in progress, unlink from rbtree,
>> -      *     load will free
>> -      * (2) refcount is 0, (normal case) entry is valid,
>> -      *     remove from rbtree and free entry
>> -      * (3) refcount is -1, invalidate happened during writeback,
>> -      *     free entry
>> -      */
>> -     if (refcount >= 0) {
>> -             /* no invalidate yet, remove from rbtree */
>> +     if (refcount > 0) {
>>               rb_erase(&entry->rbnode, &tree->rbroot);
>> +             RB_CLEAR_NODE(&entry->rbnode);
>> +             refcount = zswap_entry_put(tree, entry);
>
> Now, I see why you need return in zswap_entry_put but let's consider again
> because it's really mess to me and it hurts get/put semantic a lot so
> How about this?

It is a better way to avoid minus refcount !

>         spin_lock(&tree->lock);
>         /* drop local reference */
>         zswap_entry_put(tree, entry);
>         /*
>          * In here, we want to free entry but invalidation may free earlier
>          * under us so that we should check it again
>          */
>         if (entry == zswap_rb_search(&tree->rb_root, offset))
>                 /* Yes, it's stable so we should free it */
>                 zswap_entry_put(tree, entry);
>
>         /*
>          * Whether it would be freed by invalidation or writeback, it doesn't
>          * matter. Important thing is that it will be freed so there
>          * is no point to return -EAGAIN.
>          */
>         spin_unlock(&tree->lock);
>         return 0;
>
>>       }
>>       spin_unlock(&tree->lock);
>> -     if (refcount <= 0) {
>> -             /* free the entry */
>> -             zswap_free_entry(tree, entry);
>> -             return 0;
>> -     }
>> -     return -EAGAIN;
>> +
>> +     goto end;
>>
>>  fail:
>>       spin_lock(&tree->lock);
>> -     zswap_entry_put(entry);
>> +     refcount = zswap_entry_put(tree, entry);
>>       spin_unlock(&tree->lock);
>> -     return ret;
>> +
>> +end:
>> +     if (refcount <= 0)
>> +             return 0;
>> +     else
>> +             return -EAGAIN;
>>  }
>>
>>  /*********************************
>> @@ -677,10 +688,8 @@ static int zswap_frontswap_store(unsigned type, pgoff_t offset,
>>                       zswap_duplicate_entry++;
>>                       /* remove from rbtree */
>>                       rb_erase(&dupentry->rbnode, &tree->rbroot);
>> -                     if (!zswap_entry_put(dupentry)) {
>> -                             /* free */
>> -                             zswap_free_entry(tree, dupentry);
>> -                     }
>> +                     RB_CLEAR_NODE(&dupentry->rbnode);
>> +                     zswap_entry_put(tree, dupentry);
>>               }
>>       } while (ret == -EEXIST);
>>       spin_unlock(&tree->lock);
>> @@ -713,13 +722,12 @@ static int zswap_frontswap_load(unsigned type, pgoff_t offset,
>>
>>       /* find */
>>       spin_lock(&tree->lock);
>> -     entry = zswap_rb_search(&tree->rbroot, offset);
>> +     entry = zswap_entry_find_get(&tree->rbroot, offset);
>>       if (!entry) {
>>               /* entry was written back */
>>               spin_unlock(&tree->lock);
>>               return -1;
>>       }
>> -     zswap_entry_get(entry);
>>       spin_unlock(&tree->lock);
>>
>>       /* decompress */
>> @@ -734,22 +742,9 @@ static int zswap_frontswap_load(unsigned type, pgoff_t offset,
>>       BUG_ON(ret);
>>
>>       spin_lock(&tree->lock);
>> -     refcount = zswap_entry_put(entry);
>> -     if (likely(refcount)) {
>> -             spin_unlock(&tree->lock);
>> -             return 0;
>> -     }
>> +     zswap_entry_put(tree, entry);
>>       spin_unlock(&tree->lock);
>>
>> -     /*
>> -      * We don't have to unlink from the rbtree because
>> -      * zswap_writeback_entry() or zswap_frontswap_invalidate page()
>> -      * has already done this for us if we are the last reference.
>> -      */
>> -     /* free */
>> -
>> -     zswap_free_entry(tree, entry);
>> -
>>       return 0;
>>  }
>>
>> @@ -771,19 +766,12 @@ static void zswap_frontswap_invalidate_page(unsigned type, pgoff_t offset)
>>
>>       /* remove from rbtree */
>>       rb_erase(&entry->rbnode, &tree->rbroot);
>> +     RB_CLEAR_NODE(&entry->rbnode);
>>
>>       /* drop the initial reference from entry creation */
>> -     refcount = zswap_entry_put(entry);
>> +     zswap_entry_put(tree, entry);
>>
>>       spin_unlock(&tree->lock);
>> -
>> -     if (refcount) {
>> -             /* writeback in progress, writeback will free */
>> -             return;
>> -     }
>> -
>> -     /* free */
>> -     zswap_free_entry(tree, entry);
>>  }
>>
>>  /* frees all zswap entries for the given swap type */
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to majordomo@kvack.org.  For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>
> --
> Kind regards,
> Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v3 2/3] mm/zswap: bugfix: memory leak when invalidate and reclaim occur concurrently
  2013-10-12  2:32     ` Bob Liu
@ 2013-10-12  8:45       ` Weijie Yang
  0 siblings, 0 replies; 14+ messages in thread
From: Weijie Yang @ 2013-10-12  8:45 UTC (permalink / raw)
  To: Bob Liu
  Cc: Weijie Yang, Minchan Kim, Andrew Morton, Seth Jennings, Bob Liu,
	Linux-MM, Linux-Kernel, stable, d.j.shin, heesub.shin,
	Kyungmin Park, hau.chen, bifeng.tong, rui.xie

On Sat, Oct 12, 2013 at 10:32 AM, Bob Liu <lliubbo@gmail.com> wrote:
> On Thu, Sep 26, 2013 at 11:42 AM, Weijie Yang <weijie.yang@samsung.com> wrote:
>> On Tue, Sep 24, 2013 at 9:03 AM, Minchan Kim <minchan@kernel.org> wrote:
>>> On Mon, Sep 23, 2013 at 04:21:49PM +0800, Weijie Yang wrote:
>>> >
>>> > Modify:
>>> >  - check the refcount in fail path, free memory if it is not referenced.
>>>
>>> Hmm, I don't like this because zswap refcount routine is already mess for me.
>>> I'm not sure why it was designed from the beginning. I hope we should fix it first.
>>>
>>> 1. zswap_rb_serach could include zswap_entry_get semantic if it founds a entry from
>>>    the tree. Of course, we should ranme it as find_get_zswap_entry like find_get_page.
>>> 2. zswap_entry_put could hide resource free function like zswap_free_entry so that
>>>    all of caller can use it easily following pattern.
>>>
>>>   find_get_zswap_entry
>>>   ...
>>>   ...
>>>   zswap_entry_put
>>>
>>> Of course, zswap_entry_put have to check the entry is in the tree or not
>>> so if someone already removes it from the tree, it should avoid double remove.
>>>
>>> One of the concern I can think is that approach extends critical section
>>> but I think it would be no problem because more bottleneck would be [de]compress
>>> functions. If it were really problem, we can mitigate a problem with moving
>>> unnecessary functions out of zswap_free_entry because it seem to be rather
>>> over-enginnering.
>>
>> I refactor the zswap refcount routine according to Minchan's idea.
>> Here is the new patch, Any suggestion is welcomed.
>>
>> To Seth and Bob, would you please review it again?
>>
>
> I have nothing in addition to Minchan's review.
>
> Since the code is a bit complex, I'd suggest you to split it into two patches.
> [1/2]: fix the memory leak
> [2/2]: clean up the entry_put
>
> And run some testing..

I will split and test it.

Thanks.

> Thanks,
> -Bob

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v3 2/3] mm/zswap: bugfix: memory leak when invalidate and reclaim occur concurrently
  2013-10-12  2:50       ` Bob Liu
@ 2013-10-12  9:14         ` Weijie Yang
  2013-10-12  9:29           ` Bob Liu
  0 siblings, 1 reply; 14+ messages in thread
From: Weijie Yang @ 2013-10-12  9:14 UTC (permalink / raw)
  To: Bob Liu
  Cc: Minchan Kim, Weijie Yang, Andrew Morton, Seth Jennings, Bob Liu,
	Linux-MM, Linux-Kernel, stable, d.j.shin, heesub.shin,
	Kyungmin Park, hau.chen, bifeng.tong, rui.xie

On Sat, Oct 12, 2013 at 10:50 AM, Bob Liu <lliubbo@gmail.com> wrote:
> On Fri, Oct 11, 2013 at 3:13 PM, Minchan Kim <minchan@kernel.org> wrote:
>> On Thu, Sep 26, 2013 at 11:42:17AM +0800, Weijie Yang wrote:
>>> On Tue, Sep 24, 2013 at 9:03 AM, Minchan Kim <minchan@kernel.org> wrote:
>>> > On Mon, Sep 23, 2013 at 04:21:49PM +0800, Weijie Yang wrote:
>>> > >
>>> > > Modify:
>>> > >  - check the refcount in fail path, free memory if it is not referenced.
>>> >
>>> > Hmm, I don't like this because zswap refcount routine is already mess for me.
>>> > I'm not sure why it was designed from the beginning. I hope we should fix it first.
>>> >
>>> > 1. zswap_rb_serach could include zswap_entry_get semantic if it founds a entry from
>>> >    the tree. Of course, we should ranme it as find_get_zswap_entry like find_get_page.
>>> > 2. zswap_entry_put could hide resource free function like zswap_free_entry so that
>>> >    all of caller can use it easily following pattern.
>>> >
>>> >   find_get_zswap_entry
>>> >   ...
>>> >   ...
>>> >   zswap_entry_put
>>> >
>>> > Of course, zswap_entry_put have to check the entry is in the tree or not
>>> > so if someone already removes it from the tree, it should avoid double remove.
>>> >
>>> > One of the concern I can think is that approach extends critical section
>>> > but I think it would be no problem because more bottleneck would be [de]compress
>>> > functions. If it were really problem, we can mitigate a problem with moving
>>> > unnecessary functions out of zswap_free_entry because it seem to be rather
>>> > over-enginnering.
>>>
>>> I refactor the zswap refcount routine according to Minchan's idea.
>>> Here is the new patch, Any suggestion is welcomed.
>>>
>>> To Seth and Bob, would you please review it again?
>>
>> Yeah, Seth, Bob. You guys are right persons to review this because this
>> scheme was suggested by me who is biased so it couldn't be a fair. ;-)
>> But anyway, I will review code itself.
>>
>>>
>>> mm/zswap.c |  116
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++----------------------------------------------------------------
>>>  1 file changed, 52 insertions(+), 64 deletions(-)
>>>
>>> diff --git a/mm/zswap.c b/mm/zswap.c
>>> old mode 100644
>>> new mode 100755
>>> index deda2b6..bd04910
>>> --- a/mm/zswap.c
>>> +++ b/mm/zswap.c
>>> @@ -217,6 +217,7 @@ static struct zswap_entry *zswap_entry_cache_alloc(gfp_t gfp)
>>>       if (!entry)
>>>               return NULL;
>>>       entry->refcount = 1;
>>> +     RB_CLEAR_NODE(&entry->rbnode);
>>>       return entry;
>>>  }
>>>
>>> @@ -232,10 +233,20 @@ static void zswap_entry_get(struct zswap_entry *entry)
>>>  }
>>>
>>>  /* caller must hold the tree lock */
>>> -static int zswap_entry_put(struct zswap_entry *entry)
>>> +static int zswap_entry_put(struct zswap_tree *tree, struct zswap_entry *entry)
>>
>> Why should we have return value? If we really need it, it mitigates
>> get/put semantic's whole point so I'd like to just return void.
>>
>> Let me see.
>>
>>>  {
>>> -     entry->refcount--;
>>> -     return entry->refcount;
>>> +     int refcount = --entry->refcount;
>>> +
>>> +     if (refcount <= 0) {
>>
>> Hmm, I don't like minus refcount, really.
>> I hope we could do following as
>>
>>         BUG_ON(refcount < 0);
>>         if (refcount == 0) {
>>                 ...
>>         }
>>
>>
>>
>>> +             if (!RB_EMPTY_NODE(&entry->rbnode)) {
>>> +                     rb_erase(&entry->rbnode, &tree->rbroot);
>>> +                     RB_CLEAR_NODE(&entry->rbnode);
>>
>> Minor,
>> You could make new function zswap_rb_del or zswap_rb_remove which detach the node
>> from rb tree and clear node because we have already zswap_rb_insert.
>>
>>
>>> +             }
>>> +
>>> +             zswap_free_entry(tree, entry);
>>> +     }
>>> +
>>> +     return refcount;
>>>  }
>>>
>>>  /*********************************
>>> @@ -258,6 +269,17 @@ static struct zswap_entry *zswap_rb_search(struct rb_root *root, pgoff_t offset)
>>>       return NULL;
>>>  }
>>>
>>
>> Add function description.
>>
>>> +static struct zswap_entry *zswap_entry_find_get(struct rb_root *root, pgoff_t offset)
>>> +{
>>> +     struct zswap_entry *entry = NULL;
>>> +
>>> +     entry = zswap_rb_search(root, offset);
>>> +     if (entry)
>>> +             zswap_entry_get(entry);
>>> +
>>> +     return entry;
>>> +}
>>> +
>>>  /*
>>>   * In the case that a entry with the same offset is found, a pointer to
>>>   * the existing entry is stored in dupentry and the function returns -EEXIST
>>> @@ -387,7 +409,7 @@ static void zswap_free_entry(struct zswap_tree *tree, struct zswap_entry *entry)
>>>  enum zswap_get_swap_ret {
>>>       ZSWAP_SWAPCACHE_NEW,
>>>       ZSWAP_SWAPCACHE_EXIST,
>>> -     ZSWAP_SWAPCACHE_NOMEM
>>> +     ZSWAP_SWAPCACHE_FAIL,
>>>  };
>>>
>>>  /*
>>> @@ -401,9 +423,9 @@ enum zswap_get_swap_ret {
>>>   * added to the swap cache, and returned in retpage.
>>>   *
>>>   * If success, the swap cache page is returned in retpage
>>> - * Returns 0 if page was already in the swap cache, page is not locked
>>> - * Returns 1 if the new page needs to be populated, page is locked
>>> - * Returns <0 on error
>>> + * Returns ZSWAP_SWAPCACHE_EXIST if page was already in the swap cache
>>> + * Returns ZSWAP_SWAPCACHE_NEW if the new page needs to be populated, page is locked
>>> + * Returns ZSWAP_SWAPCACHE_FAIL on error
>>>   */
>>>  static int zswap_get_swap_cache_page(swp_entry_t entry,
>>>                               struct page **retpage)
>>> @@ -475,7 +497,7 @@ static int zswap_get_swap_cache_page(swp_entry_t entry,
>>>       if (new_page)
>>>               page_cache_release(new_page);
>>>       if (!found_page)
>>> -             return ZSWAP_SWAPCACHE_NOMEM;
>>> +             return ZSWAP_SWAPCACHE_FAIL;
>>>       *retpage = found_page;
>>>       return ZSWAP_SWAPCACHE_EXIST;
>>>  }
>>> @@ -517,23 +539,22 @@ static int zswap_writeback_entry(struct zbud_pool *pool, unsigned long handle)
>>>
>>>       /* find and ref zswap entry */
>>>       spin_lock(&tree->lock);
>>> -     entry = zswap_rb_search(&tree->rbroot, offset);
>>> +     entry = zswap_entry_find_get(&tree->rbroot, offset);
>>>       if (!entry) {
>>>               /* entry was invalidated */
>>>               spin_unlock(&tree->lock);
>>>               return 0;
>>>       }
>>> -     zswap_entry_get(entry);
>>>       spin_unlock(&tree->lock);
>>>       BUG_ON(offset != entry->offset);
>>>
>>>       /* try to allocate swap cache page */
>>>       switch (zswap_get_swap_cache_page(swpentry, &page)) {
>>> -     case ZSWAP_SWAPCACHE_NOMEM: /* no memory */
>>> +     case ZSWAP_SWAPCACHE_FAIL: /* no memory or invalidate happened */
>>>               ret = -ENOMEM;
>>>               goto fail;
>>>
>>> -     case ZSWAP_SWAPCACHE_EXIST: /* page is unlocked */
>>> +     case ZSWAP_SWAPCACHE_EXIST:
>>
>> Why did you remove comment?
>>
>>>               /* page is already in the swap cache, ignore for now */
>>>               page_cache_release(page);
>>>               ret = -EEXIST;
>>> @@ -562,38 +583,28 @@ static int zswap_writeback_entry(struct zbud_pool *pool, unsigned long handle)
>>>       zswap_written_back_pages++;
>>>
>>>       spin_lock(&tree->lock);
>>> -
>>>       /* drop local reference */
>>> -     zswap_entry_put(entry);
>>> +     refcount = zswap_entry_put(tree, entry);
>>>       /* drop the initial reference from entry creation */
>>> -     refcount = zswap_entry_put(entry);
>>> -
>>> -     /*
>>> -      * There are three possible values for refcount here:
>>> -      * (1) refcount is 1, load is in progress, unlink from rbtree,
>>> -      *     load will free
>>> -      * (2) refcount is 0, (normal case) entry is valid,
>>> -      *     remove from rbtree and free entry
>>> -      * (3) refcount is -1, invalidate happened during writeback,
>>> -      *     free entry
>>> -      */
>>> -     if (refcount >= 0) {
>>> -             /* no invalidate yet, remove from rbtree */
>>> +     if (refcount > 0) {
>>>               rb_erase(&entry->rbnode, &tree->rbroot);
>>> +             RB_CLEAR_NODE(&entry->rbnode);
>>> +             refcount = zswap_entry_put(tree, entry);
>>
>> Now, I see why you need return in zswap_entry_put but let's consider again
>> because it's really mess to me and it hurts get/put semantic a lot so
>> How about this?
>>
>>         spin_lock(&tree->lock);
>>         /* drop local reference */
>>         zswap_entry_put(tree, entry);
>>         /*
>>          * In here, we want to free entry but invalidation may free earlier
>>          * under us so that we should check it again
>>          */
>>         if (entry == zswap_rb_search(&tree->rb_root, offset))
>
> Then where is the place unlink entry from rbtree if load was in progress ?

zswap_entry_put() have the unlink handle logic

> And in the following fail path, return value from zswap_entry_put() is
> also used.

It is okay even if we return -EAGAIN in the fail path

>>                 /* Yes, it's stable so we should free it */
>>                 zswap_entry_put(tree, entry);
>>
>>         /*
>>          * Whether it would be freed by invalidation or writeback, it doesn't
>>          * matter. Important thing is that it will be freed so there
>>          * is no point to return -EAGAIN.
>>          */
>>         spin_unlock(&tree->lock);
>>         return 0;
>>
>
> --
> Regards,
> --Bob

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v3 2/3] mm/zswap: bugfix: memory leak when invalidate and reclaim occur concurrently
  2013-10-12  9:14         ` Weijie Yang
@ 2013-10-12  9:29           ` Bob Liu
  0 siblings, 0 replies; 14+ messages in thread
From: Bob Liu @ 2013-10-12  9:29 UTC (permalink / raw)
  To: Weijie Yang
  Cc: Minchan Kim, Weijie Yang, Andrew Morton, Seth Jennings, Bob Liu,
	Linux-MM, Linux-Kernel, stable, d.j.shin, heesub.shin,
	Kyungmin Park, hau.chen, bifeng.tong, rui.xie

On Sat, Oct 12, 2013 at 5:14 PM, Weijie Yang <weijie.yang.kh@gmail.com> wrote:
> On Sat, Oct 12, 2013 at 10:50 AM, Bob Liu <lliubbo@gmail.com> wrote:
>> On Fri, Oct 11, 2013 at 3:13 PM, Minchan Kim <minchan@kernel.org> wrote:
>>> On Thu, Sep 26, 2013 at 11:42:17AM +0800, Weijie Yang wrote:
>>>> On Tue, Sep 24, 2013 at 9:03 AM, Minchan Kim <minchan@kernel.org> wrote:
>>>> > On Mon, Sep 23, 2013 at 04:21:49PM +0800, Weijie Yang wrote:
>>>> > >
>>>> > > Modify:
>>>> > >  - check the refcount in fail path, free memory if it is not referenced.
>>>> >
>>>> > Hmm, I don't like this because zswap refcount routine is already mess for me.
>>>> > I'm not sure why it was designed from the beginning. I hope we should fix it first.
>>>> >
>>>> > 1. zswap_rb_serach could include zswap_entry_get semantic if it founds a entry from
>>>> >    the tree. Of course, we should ranme it as find_get_zswap_entry like find_get_page.
>>>> > 2. zswap_entry_put could hide resource free function like zswap_free_entry so that
>>>> >    all of caller can use it easily following pattern.
>>>> >
>>>> >   find_get_zswap_entry
>>>> >   ...
>>>> >   ...
>>>> >   zswap_entry_put
>>>> >
>>>> > Of course, zswap_entry_put have to check the entry is in the tree or not
>>>> > so if someone already removes it from the tree, it should avoid double remove.
>>>> >
>>>> > One of the concern I can think is that approach extends critical section
>>>> > but I think it would be no problem because more bottleneck would be [de]compress
>>>> > functions. If it were really problem, we can mitigate a problem with moving
>>>> > unnecessary functions out of zswap_free_entry because it seem to be rather
>>>> > over-enginnering.
>>>>
>>>> I refactor the zswap refcount routine according to Minchan's idea.
>>>> Here is the new patch, Any suggestion is welcomed.
>>>>
>>>> To Seth and Bob, would you please review it again?
>>>
>>> Yeah, Seth, Bob. You guys are right persons to review this because this
>>> scheme was suggested by me who is biased so it couldn't be a fair. ;-)
>>> But anyway, I will review code itself.
>>>
>>>>
>>>> mm/zswap.c |  116
>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++----------------------------------------------------------------
>>>>  1 file changed, 52 insertions(+), 64 deletions(-)
>>>>
>>>> diff --git a/mm/zswap.c b/mm/zswap.c
>>>> old mode 100644
>>>> new mode 100755
>>>> index deda2b6..bd04910
>>>> --- a/mm/zswap.c
>>>> +++ b/mm/zswap.c
>>>> @@ -217,6 +217,7 @@ static struct zswap_entry *zswap_entry_cache_alloc(gfp_t gfp)
>>>>       if (!entry)
>>>>               return NULL;
>>>>       entry->refcount = 1;
>>>> +     RB_CLEAR_NODE(&entry->rbnode);
>>>>       return entry;
>>>>  }
>>>>
>>>> @@ -232,10 +233,20 @@ static void zswap_entry_get(struct zswap_entry *entry)
>>>>  }
>>>>
>>>>  /* caller must hold the tree lock */
>>>> -static int zswap_entry_put(struct zswap_entry *entry)
>>>> +static int zswap_entry_put(struct zswap_tree *tree, struct zswap_entry *entry)
>>>
>>> Why should we have return value? If we really need it, it mitigates
>>> get/put semantic's whole point so I'd like to just return void.
>>>
>>> Let me see.
>>>
>>>>  {
>>>> -     entry->refcount--;
>>>> -     return entry->refcount;
>>>> +     int refcount = --entry->refcount;
>>>> +
>>>> +     if (refcount <= 0) {
>>>
>>> Hmm, I don't like minus refcount, really.
>>> I hope we could do following as
>>>
>>>         BUG_ON(refcount < 0);
>>>         if (refcount == 0) {
>>>                 ...
>>>         }
>>>
>>>
>>>
>>>> +             if (!RB_EMPTY_NODE(&entry->rbnode)) {
>>>> +                     rb_erase(&entry->rbnode, &tree->rbroot);
>>>> +                     RB_CLEAR_NODE(&entry->rbnode);
>>>
>>> Minor,
>>> You could make new function zswap_rb_del or zswap_rb_remove which detach the node
>>> from rb tree and clear node because we have already zswap_rb_insert.
>>>
>>>
>>>> +             }
>>>> +
>>>> +             zswap_free_entry(tree, entry);
>>>> +     }
>>>> +
>>>> +     return refcount;
>>>>  }
>>>>
>>>>  /*********************************
>>>> @@ -258,6 +269,17 @@ static struct zswap_entry *zswap_rb_search(struct rb_root *root, pgoff_t offset)
>>>>       return NULL;
>>>>  }
>>>>
>>>
>>> Add function description.
>>>
>>>> +static struct zswap_entry *zswap_entry_find_get(struct rb_root *root, pgoff_t offset)
>>>> +{
>>>> +     struct zswap_entry *entry = NULL;
>>>> +
>>>> +     entry = zswap_rb_search(root, offset);
>>>> +     if (entry)
>>>> +             zswap_entry_get(entry);
>>>> +
>>>> +     return entry;
>>>> +}
>>>> +
>>>>  /*
>>>>   * In the case that a entry with the same offset is found, a pointer to
>>>>   * the existing entry is stored in dupentry and the function returns -EEXIST
>>>> @@ -387,7 +409,7 @@ static void zswap_free_entry(struct zswap_tree *tree, struct zswap_entry *entry)
>>>>  enum zswap_get_swap_ret {
>>>>       ZSWAP_SWAPCACHE_NEW,
>>>>       ZSWAP_SWAPCACHE_EXIST,
>>>> -     ZSWAP_SWAPCACHE_NOMEM
>>>> +     ZSWAP_SWAPCACHE_FAIL,
>>>>  };
>>>>
>>>>  /*
>>>> @@ -401,9 +423,9 @@ enum zswap_get_swap_ret {
>>>>   * added to the swap cache, and returned in retpage.
>>>>   *
>>>>   * If success, the swap cache page is returned in retpage
>>>> - * Returns 0 if page was already in the swap cache, page is not locked
>>>> - * Returns 1 if the new page needs to be populated, page is locked
>>>> - * Returns <0 on error
>>>> + * Returns ZSWAP_SWAPCACHE_EXIST if page was already in the swap cache
>>>> + * Returns ZSWAP_SWAPCACHE_NEW if the new page needs to be populated, page is locked
>>>> + * Returns ZSWAP_SWAPCACHE_FAIL on error
>>>>   */
>>>>  static int zswap_get_swap_cache_page(swp_entry_t entry,
>>>>                               struct page **retpage)
>>>> @@ -475,7 +497,7 @@ static int zswap_get_swap_cache_page(swp_entry_t entry,
>>>>       if (new_page)
>>>>               page_cache_release(new_page);
>>>>       if (!found_page)
>>>> -             return ZSWAP_SWAPCACHE_NOMEM;
>>>> +             return ZSWAP_SWAPCACHE_FAIL;
>>>>       *retpage = found_page;
>>>>       return ZSWAP_SWAPCACHE_EXIST;
>>>>  }
>>>> @@ -517,23 +539,22 @@ static int zswap_writeback_entry(struct zbud_pool *pool, unsigned long handle)
>>>>
>>>>       /* find and ref zswap entry */
>>>>       spin_lock(&tree->lock);
>>>> -     entry = zswap_rb_search(&tree->rbroot, offset);
>>>> +     entry = zswap_entry_find_get(&tree->rbroot, offset);
>>>>       if (!entry) {
>>>>               /* entry was invalidated */
>>>>               spin_unlock(&tree->lock);
>>>>               return 0;
>>>>       }
>>>> -     zswap_entry_get(entry);
>>>>       spin_unlock(&tree->lock);
>>>>       BUG_ON(offset != entry->offset);
>>>>
>>>>       /* try to allocate swap cache page */
>>>>       switch (zswap_get_swap_cache_page(swpentry, &page)) {
>>>> -     case ZSWAP_SWAPCACHE_NOMEM: /* no memory */
>>>> +     case ZSWAP_SWAPCACHE_FAIL: /* no memory or invalidate happened */
>>>>               ret = -ENOMEM;
>>>>               goto fail;
>>>>
>>>> -     case ZSWAP_SWAPCACHE_EXIST: /* page is unlocked */
>>>> +     case ZSWAP_SWAPCACHE_EXIST:
>>>
>>> Why did you remove comment?
>>>
>>>>               /* page is already in the swap cache, ignore for now */
>>>>               page_cache_release(page);
>>>>               ret = -EEXIST;
>>>> @@ -562,38 +583,28 @@ static int zswap_writeback_entry(struct zbud_pool *pool, unsigned long handle)
>>>>       zswap_written_back_pages++;
>>>>
>>>>       spin_lock(&tree->lock);
>>>> -
>>>>       /* drop local reference */
>>>> -     zswap_entry_put(entry);
>>>> +     refcount = zswap_entry_put(tree, entry);
>>>>       /* drop the initial reference from entry creation */
>>>> -     refcount = zswap_entry_put(entry);
>>>> -
>>>> -     /*
>>>> -      * There are three possible values for refcount here:
>>>> -      * (1) refcount is 1, load is in progress, unlink from rbtree,
>>>> -      *     load will free
>>>> -      * (2) refcount is 0, (normal case) entry is valid,
>>>> -      *     remove from rbtree and free entry
>>>> -      * (3) refcount is -1, invalidate happened during writeback,
>>>> -      *     free entry
>>>> -      */
>>>> -     if (refcount >= 0) {
>>>> -             /* no invalidate yet, remove from rbtree */
>>>> +     if (refcount > 0) {
>>>>               rb_erase(&entry->rbnode, &tree->rbroot);
>>>> +             RB_CLEAR_NODE(&entry->rbnode);
>>>> +             refcount = zswap_entry_put(tree, entry);
>>>
>>> Now, I see why you need return in zswap_entry_put but let's consider again
>>> because it's really mess to me and it hurts get/put semantic a lot so
>>> How about this?
>>>
>>>         spin_lock(&tree->lock);
>>>         /* drop local reference */
>>>         zswap_entry_put(tree, entry);
>>>         /*
>>>          * In here, we want to free entry but invalidation may free earlier
>>>          * under us so that we should check it again
>>>          */
>>>         if (entry == zswap_rb_search(&tree->rb_root, offset))
>>
>> Then where is the place unlink entry from rbtree if load was in progress ?
>
> zswap_entry_put() have the unlink handle logic
>

I see. Then how about  use if (!RB_EMPTY_NODE(&entry->rbnode))  to
replace rbtree searching?

>> And in the following fail path, return value from zswap_entry_put() is
>> also used.
>
> It is okay even if we return -EAGAIN in the fail path
>

-- 
Regards,
--Bob

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* RE: [PATCH v3 2/3] mm/zswap: bugfix: memory leak when invalidate and reclaim occur concurrently
  2013-10-11  7:13     ` Minchan Kim
  2013-10-12  2:50       ` Bob Liu
  2013-10-12  8:41       ` Weijie Yang
@ 2013-10-12  9:37       ` Weijie Yang
  2013-10-14  0:31         ` Minchan Kim
  2 siblings, 1 reply; 14+ messages in thread
From: Weijie Yang @ 2013-10-12  9:37 UTC (permalink / raw)
  To: 'Minchan Kim'
  Cc: akpm, sjenning, bob.liu, weijie.yang.kh, linux-mm, linux-kernel,
	stable

Hi, all

I thought out a new way to resolve this problem: use CAS instead of refcount.

It not only resolve this problem, but also fix another issue,
can be expanded to support frontswap get_and_free mode easily.
And I use it in an zswap variant which writeback zbud page directly.

If it is accepted, I will resent it instead of the previous refactor patch

please see below, Request For Comments, Thanks.

On Fri, Oct 11, 2013 at 3:13 PM, Minchan Kim <minchan@kernel.org> wrote:
> On Thu, Sep 26, 2013 at 11:42:17AM +0800, Weijie Yang wrote:
> > On Tue, Sep 24, 2013 at 9:03 AM, Minchan Kim <minchan@kernel.org> wrote:
> > > On Mon, Sep 23, 2013 at 04:21:49PM +0800, Weijie Yang wrote:
> > > >
> > > > Modify:
> > > >  - check the refcount in fail path, free memory if it is not referenced.
> > >
> > > Hmm, I don't like this because zswap refcount routine is already mess for me.
> > > I'm not sure why it was designed from the beginning. I hope we should fix it first.
> > >
> > > 1. zswap_rb_serach could include zswap_entry_get semantic if it founds a entry from
> > >    the tree. Of course, we should ranme it as find_get_zswap_entry like find_get_page.
> > > 2. zswap_entry_put could hide resource free function like zswap_free_entry so that
> > >    all of caller can use it easily following pattern.
> > >
> > >   find_get_zswap_entry
> > >   ...
> > >   ...
> > >   zswap_entry_put
> > >
> > > Of course, zswap_entry_put have to check the entry is in the tree or not
> > > so if someone already removes it from the tree, it should avoid double remove.
> > >
> > > One of the concern I can think is that approach extends critical section
> > > but I think it would be no problem because more bottleneck would be [de]compress
> > > functions. If it were really problem, we can mitigate a problem with moving
> > > unnecessary functions out of zswap_free_entry because it seem to be rather
> > > over-enginnering.
> >
> > I refactor the zswap refcount routine according to Minchan's idea.
> > Here is the new patch, Any suggestion is welcomed.
> >
> > To Seth and Bob, would you please review it again?
> 
> Yeah, Seth, Bob. You guys are right persons to review this because this
> scheme was suggested by me who is biased so it couldn't be a fair. ;-)
> But anyway, I will review code itself.

Consider the following scenario:
thread 0: reclaim entry x(get refcount, but not call zswap_get_swap_cache_page)
thread 1: call zswap_frontswap_invalidate_page to invalidate entry x.
	finished, entry x and its zbud is not freed as its refcount != 0
	now, the swap_map[x] = 0
thread 0: now call zswap_get_swap_cache_page
	swapcache_prepare return -ENOENT because entry x is not used any more
	zswap_get_swap_cache_page return ZSWAP_SWAPCACHE_NOMEM
	zswap_writeback_entry do nothing except put refcount
Now, the memory of zswap_entry x and its zpage leak.

Consider the another scenario:
zswap entry x with offset A is already stored in zswap backend.
thread 0: reclaim entry x(get refcount, but not call zswap_get_swap_cache_page)
thread 1: store new page with the same offset A, alloc a new zswap entry y.
	This is a duplicate store and finished.
	shrink_page_list() finish, now the new page is not in swap_cache
thread 0: zswap_get_swap_cache_page get called.
	old page data is added to swap_cache
Now, swap cache has old data rather than new data for offset A.
Error will happen If do_swap_page() get page from swap_cache.

Modify:
 - re-design the zswap_entry concurrent protection by using a CAS-access flag
   instead of the refcount get/put semantic

 - use ZSWAP_SWAPCACHE_FAIL instead of ZSWAP_SWAPCACHE_NOMEM as the fail path
   can be not only caused by nomem but also by invalidate.

Signed-off-by: Weijie Yang <weijie.yang@samsung.com>
---
 mm/zswap.c |  177 +++++++++++++++++++++++++++---------------------------------
 1 file changed, 80 insertions(+), 97 deletions(-)
 mode change 100644 => 100755 mm/zswap.c

diff --git a/mm/zswap.c b/mm/zswap.c
old mode 100644
new mode 100755
index cbd9578..fcaaecb
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -158,21 +158,23 @@ static void zswap_comp_exit(void)
  * page within zswap.
  *
  * rbnode - links the entry into red-black tree for the appropriate swap type
- * refcount - the number of outstanding reference to the entry. This is needed
- *            to protect against premature freeing of the entry by code
- *            concurent calls to load, invalidate, and writeback.  The lock
- *            for the zswap_tree structure that contains the entry must
- *            be held while changing the refcount.  Since the lock must
- *            be held, there is no reason to also make refcount atomic.
  * offset - the swap offset for the entry.  Index into the red-black tree.
+ * pos - the position or status of this entry. see below macro definition
+ *    change it by CAS, we hold this entry on success or retry if fail
+ *    protect against concurrent access by load, invalidate or writeback
+ *    even though the probability of concurrent access is very small
  * handle - zsmalloc allocation handle that stores the compressed page data
  * length - the length in bytes of the compressed page data.  Needed during
  *           decompression
  */
+#define POS_POOL      1  /* stay in pool still */
+#define POS_LOAD      2  /* is loading */
+#define POS_RECLAIM   3  /* is reclaiming */
+#define POS_FLUSH     4  /* is invalidating */
 struct zswap_entry {
 	struct rb_node rbnode;
 	pgoff_t offset;
-	int refcount;
+	atomic_t pos;
 	unsigned int length;
 	unsigned long handle;
 };
@@ -216,7 +218,6 @@ static struct zswap_entry *zswap_entry_cache_alloc(gfp_t gfp)
 	entry = kmem_cache_alloc(zswap_entry_cache, gfp);
 	if (!entry)
 		return NULL;
-	entry->refcount = 1;
 	return entry;
 }
 
@@ -225,18 +226,6 @@ static void zswap_entry_cache_free(struct zswap_entry *entry)
 	kmem_cache_free(zswap_entry_cache, entry);
 }
 
-/* caller must hold the tree lock */
-static void zswap_entry_get(struct zswap_entry *entry)
-{
-	entry->refcount++;
-}
-
-/* caller must hold the tree lock */
-static int zswap_entry_put(struct zswap_entry *entry)
-{
-	entry->refcount--;
-	return entry->refcount;
-}
 
 /*********************************
 * rbtree functions
@@ -380,6 +369,23 @@ static void zswap_free_entry(struct zswap_tree *tree, struct zswap_entry *entry)
 	zswap_pool_pages = zbud_get_pool_size(tree->pool);
 }
 
+/* caller must hold the tree lock, entry is on the tree
+* seldom called fortunately
+* return 0: free this entry successfully
+* return < 0: fail
+*/
+static int zswap_invalidate_entry(struct zswap_tree *tree, struct zswap_entry *entry)
+{
+	if (atomic_cmpxchg(&entry->pos, POS_POOL, POS_FLUSH) == POS_POOL) {
+		rb_erase(&entry->rbnode, &tree->rbroot);
+		zswap_free_entry(tree, entry);
+		return 0;
+	}
+
+	/* encounter reclaim called by another thread */
+	return -1;
+}
+
 /*********************************
 * writeback code
 **********************************/
@@ -387,7 +393,7 @@ static void zswap_free_entry(struct zswap_tree *tree, struct zswap_entry *entry)
 enum zswap_get_swap_ret {
 	ZSWAP_SWAPCACHE_NEW,
 	ZSWAP_SWAPCACHE_EXIST,
-	ZSWAP_SWAPCACHE_NOMEM
+	ZSWAP_SWAPCACHE_FAIL,
 };
 
 /*
@@ -401,9 +407,10 @@ enum zswap_get_swap_ret {
  * added to the swap cache, and returned in retpage.
  *
  * If success, the swap cache page is returned in retpage
- * Returns 0 if page was already in the swap cache, page is not locked
- * Returns 1 if the new page needs to be populated, page is locked
- * Returns <0 on error
+ * Returns ZSWAP_SWAPCACHE_EXIST if page was already in the swap cache
+ * Returns ZSWAP_SWAPCACHE_NEW if the new page needs to be populated,
+ * 	the new page is added to swap cache and locked
+ * Returns ZSWAP_SWAPCACHE_FAIL on error
  */
 static int zswap_get_swap_cache_page(swp_entry_t entry,
 				struct page **retpage)
@@ -475,7 +482,7 @@ static int zswap_get_swap_cache_page(swp_entry_t entry,
 	if (new_page)
 		page_cache_release(new_page);
 	if (!found_page)
-		return ZSWAP_SWAPCACHE_NOMEM;
+		return ZSWAP_SWAPCACHE_FAIL;
 	*retpage = found_page;
 	return ZSWAP_SWAPCACHE_EXIST;
 }
@@ -502,7 +509,7 @@ static int zswap_writeback_entry(struct zbud_pool *pool, unsigned long handle)
 	struct page *page;
 	u8 *src, *dst;
 	unsigned int dlen;
-	int ret, refcount;
+	int ret;
 	struct writeback_control wbc = {
 		.sync_mode = WB_SYNC_NONE,
 	};
@@ -523,17 +530,23 @@ static int zswap_writeback_entry(struct zbud_pool *pool, unsigned long handle)
 		spin_unlock(&tree->lock);
 		return 0;
 	}
-	zswap_entry_get(entry);
+	/* maybe encounter load or invalidate called by another thread
+	* hold or give up this entry
+	*/
+	if (atomic_cmpxchg(&entry->pos, POS_POOL, POS_RECLAIM) != POS_POOL) {
+		spin_unlock(&tree->lock);
+		return -EAGAIN;
+	}
 	spin_unlock(&tree->lock);
 	BUG_ON(offset != entry->offset);
 
 	/* try to allocate swap cache page */
 	switch (zswap_get_swap_cache_page(swpentry, &page)) {
-	case ZSWAP_SWAPCACHE_NOMEM: /* no memory */
+	case ZSWAP_SWAPCACHE_FAIL:
 		ret = -ENOMEM;
 		goto fail;
 
-	case ZSWAP_SWAPCACHE_EXIST: /* page is unlocked */
+	case ZSWAP_SWAPCACHE_EXIST:
 		/* page is already in the swap cache, ignore for now */
 		page_cache_release(page);
 		ret = -EEXIST;
@@ -561,38 +574,23 @@ static int zswap_writeback_entry(struct zbud_pool *pool, unsigned long handle)
 	page_cache_release(page);
 	zswap_written_back_pages++;
 
+	/* once we hold this entry and get here, we can free entry safely
+	* during the period which we hold this entry:
+	* 1. if a thread call do_swap_page concurrently
+	* we will get ZSWAP_SWAPCACHE_EXIST returned or
+	* do_swap_page will hit page we added in the swap cache
+	* 2. if a thread call invalidate concurrently
+	* invalidate thread should wait until this function end
+	*/
 	spin_lock(&tree->lock);
-
-	/* drop local reference */
-	zswap_entry_put(entry);
-	/* drop the initial reference from entry creation */
-	refcount = zswap_entry_put(entry);
-
-	/*
-	 * There are three possible values for refcount here:
-	 * (1) refcount is 1, load is in progress, unlink from rbtree,
-	 *     load will free
-	 * (2) refcount is 0, (normal case) entry is valid,
-	 *     remove from rbtree and free entry
-	 * (3) refcount is -1, invalidate happened during writeback,
-	 *     free entry
-	 */
-	if (refcount >= 0) {
-		/* no invalidate yet, remove from rbtree */
-		rb_erase(&entry->rbnode, &tree->rbroot);
-	}
+	rb_erase(&entry->rbnode, &tree->rbroot);
 	spin_unlock(&tree->lock);
-	if (refcount <= 0) {
-		/* free the entry */
-		zswap_free_entry(tree, entry);
-		return 0;
-	}
-	return -EAGAIN;
+	zswap_free_entry(tree, entry);
+
+	return 0;
 
 fail:
-	spin_lock(&tree->lock);
-	zswap_entry_put(entry);
-	spin_unlock(&tree->lock);
+	BUG_ON(atomic_cmpxchg(&entry->pos, POS_RECLAIM, POS_POOL) != POS_RECLAIM);
 	return ret;
 }
 
@@ -668,19 +666,15 @@ static int zswap_frontswap_store(unsigned type, pgoff_t offset,
 	entry->offset = offset;
 	entry->handle = handle;
 	entry->length = dlen;
+	atomic_set(&entry->pos, POS_POOL);
 
 	/* map */
 	spin_lock(&tree->lock);
 	do {
 		ret = zswap_rb_insert(&tree->rbroot, entry, &dupentry);
-		if (ret == -EEXIST) {
-			zswap_duplicate_entry++;
-			/* remove from rbtree */
-			rb_erase(&dupentry->rbnode, &tree->rbroot);
-			if (!zswap_entry_put(dupentry)) {
-				/* free */
-				zswap_free_entry(tree, dupentry);
-			}
+		if (unlikely(ret == -EEXIST)) {
+			if (zswap_invalidate_entry(tree, dupentry) == 0)
+				zswap_duplicate_entry++;
 		}
 	} while (ret == -EEXIST);
 	spin_unlock(&tree->lock);
@@ -709,9 +703,10 @@ static int zswap_frontswap_load(unsigned type, pgoff_t offset,
 	struct zswap_entry *entry;
 	u8 *src, *dst;
 	unsigned int dlen;
-	int refcount, ret;
+	int ret;
 
 	/* find */
+find:
 	spin_lock(&tree->lock);
 	entry = zswap_rb_search(&tree->rbroot, offset);
 	if (!entry) {
@@ -719,7 +714,14 @@ static int zswap_frontswap_load(unsigned type, pgoff_t offset,
 		spin_unlock(&tree->lock);
 		return -1;
 	}
-	zswap_entry_get(entry);
+	/* encounter entry reclaimed by another thread
+	* just retry until that thread get ZSWAP_SWAPCACHE_EXIST returned
+	* as we hold the entry, we can do anything: get or get_and_free
+	*/
+	if (atomic_cmpxchg(&entry->pos, POS_POOL, POS_LOAD) != POS_POOL) {
+		spin_unlock(&tree->lock);
+		goto find;
+	}
 	spin_unlock(&tree->lock);
 
 	/* decompress */
@@ -733,22 +735,7 @@ static int zswap_frontswap_load(unsigned type, pgoff_t offset,
 	zbud_unmap(tree->pool, entry->handle);
 	BUG_ON(ret);
 
-	spin_lock(&tree->lock);
-	refcount = zswap_entry_put(entry);
-	if (likely(refcount)) {
-		spin_unlock(&tree->lock);
-		return 0;
-	}
-	spin_unlock(&tree->lock);
-
-	/*
-	 * We don't have to unlink from the rbtree because
-	 * zswap_writeback_entry() or zswap_frontswap_invalidate page()
-	 * has already done this for us if we are the last reference.
-	 */
-	/* free */
-
-	zswap_free_entry(tree, entry);
+	BUG_ON(atomic_cmpxchg(&entry->pos, POS_LOAD, POS_POOL) != POS_LOAD);
 
 	return 0;
 }
@@ -758,9 +745,9 @@ static void zswap_frontswap_invalidate_page(unsigned type, pgoff_t offset)
 {
 	struct zswap_tree *tree = zswap_trees[type];
 	struct zswap_entry *entry;
-	int refcount;
 
 	/* find */
+find:
 	spin_lock(&tree->lock);
 	entry = zswap_rb_search(&tree->rbroot, offset);
 	if (!entry) {
@@ -769,19 +756,18 @@ static void zswap_frontswap_invalidate_page(unsigned type, pgoff_t offset)
 		return;
 	}
 
+	/* encounter entry reclaimed by another thread
+	* just retry until that reclaim end.
+	*/
+	if (atomic_cmpxchg(&entry->pos, POS_POOL, POS_FLUSH) != POS_POOL) {
+		spin_unlock(&tree->lock);
+		goto find;
+	}
+
 	/* remove from rbtree */
 	rb_erase(&entry->rbnode, &tree->rbroot);
-
-	/* drop the initial reference from entry creation */
-	refcount = zswap_entry_put(entry);
-
 	spin_unlock(&tree->lock);
 
-	if (refcount) {
-		/* writeback in progress, writeback will free */
-		return;
-	}
-
 	/* free */
 	zswap_free_entry(tree, entry);
 }
@@ -809,10 +795,7 @@ static void zswap_frontswap_invalidate_area(unsigned type)
 	 */
 	while ((node = rb_first(&tree->rbroot))) {
 		entry = rb_entry(node, struct zswap_entry, rbnode);
-		rb_erase(&entry->rbnode, &tree->rbroot);
-		zbud_free(tree->pool, entry->handle);
-		zswap_entry_cache_free(entry);
-		atomic_dec(&zswap_stored_pages);
+		zswap_invalidate_entry(tree, entry);
 	}
 	tree->rbroot = RB_ROOT;
 	spin_unlock(&tree->lock);
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH v3 2/3] mm/zswap: bugfix: memory leak when invalidate and reclaim occur concurrently
  2013-10-12  9:37       ` Weijie Yang
@ 2013-10-14  0:31         ` Minchan Kim
  2013-10-14  8:05           ` Weijie Yang
  0 siblings, 1 reply; 14+ messages in thread
From: Minchan Kim @ 2013-10-14  0:31 UTC (permalink / raw)
  To: Weijie Yang
  Cc: akpm, sjenning, bob.liu, weijie.yang.kh, linux-mm, linux-kernel,
	stable

Hello,

On Sat, Oct 12, 2013 at 05:37:55PM +0800, Weijie Yang wrote:
> Hi, all
> 
> I thought out a new way to resolve this problem: use CAS instead of refcount.
> 
> It not only resolve this problem, but also fix another issue,
> can be expanded to support frontswap get_and_free mode easily.
> And I use it in an zswap variant which writeback zbud page directly.
> 
> If it is accepted, I will resent it instead of the previous refactor patch

You're adding atomic operation and making code more complicated to me rather
than simple get/put scheme. Please, write a good decsription why it's better,
not vague words like "fix another issue, get_and_free and zswap variant".

Please convince us.

> 
> please see below, Request For Comments, Thanks.
> 
> On Fri, Oct 11, 2013 at 3:13 PM, Minchan Kim <minchan@kernel.org> wrote:
> > On Thu, Sep 26, 2013 at 11:42:17AM +0800, Weijie Yang wrote:
> > > On Tue, Sep 24, 2013 at 9:03 AM, Minchan Kim <minchan@kernel.org> wrote:
> > > > On Mon, Sep 23, 2013 at 04:21:49PM +0800, Weijie Yang wrote:
> > > > >
> > > > > Modify:
> > > > >  - check the refcount in fail path, free memory if it is not referenced.
> > > >
> > > > Hmm, I don't like this because zswap refcount routine is already mess for me.
> > > > I'm not sure why it was designed from the beginning. I hope we should fix it first.
> > > >
> > > > 1. zswap_rb_serach could include zswap_entry_get semantic if it founds a entry from
> > > >    the tree. Of course, we should ranme it as find_get_zswap_entry like find_get_page.
> > > > 2. zswap_entry_put could hide resource free function like zswap_free_entry so that
> > > >    all of caller can use it easily following pattern.
> > > >
> > > >   find_get_zswap_entry
> > > >   ...
> > > >   ...
> > > >   zswap_entry_put
> > > >
> > > > Of course, zswap_entry_put have to check the entry is in the tree or not
> > > > so if someone already removes it from the tree, it should avoid double remove.
> > > >
> > > > One of the concern I can think is that approach extends critical section
> > > > but I think it would be no problem because more bottleneck would be [de]compress
> > > > functions. If it were really problem, we can mitigate a problem with moving
> > > > unnecessary functions out of zswap_free_entry because it seem to be rather
> > > > over-enginnering.
> > >
> > > I refactor the zswap refcount routine according to Minchan's idea.
> > > Here is the new patch, Any suggestion is welcomed.
> > >
> > > To Seth and Bob, would you please review it again?
> > 
> > Yeah, Seth, Bob. You guys are right persons to review this because this
> > scheme was suggested by me who is biased so it couldn't be a fair. ;-)
> > But anyway, I will review code itself.
> 
> Consider the following scenario:
> thread 0: reclaim entry x(get refcount, but not call zswap_get_swap_cache_page)
> thread 1: call zswap_frontswap_invalidate_page to invalidate entry x.
> 	finished, entry x and its zbud is not freed as its refcount != 0
> 	now, the swap_map[x] = 0
> thread 0: now call zswap_get_swap_cache_page
> 	swapcache_prepare return -ENOENT because entry x is not used any more
> 	zswap_get_swap_cache_page return ZSWAP_SWAPCACHE_NOMEM
> 	zswap_writeback_entry do nothing except put refcount
> Now, the memory of zswap_entry x and its zpage leak.
> 
> Consider the another scenario:
> zswap entry x with offset A is already stored in zswap backend.
> thread 0: reclaim entry x(get refcount, but not call zswap_get_swap_cache_page)
> thread 1: store new page with the same offset A, alloc a new zswap entry y.
> 	This is a duplicate store and finished.
> 	shrink_page_list() finish, now the new page is not in swap_cache
> thread 0: zswap_get_swap_cache_page get called.
> 	old page data is added to swap_cache
> Now, swap cache has old data rather than new data for offset A.
> Error will happen If do_swap_page() get page from swap_cache.
> 
> Modify:
>  - re-design the zswap_entry concurrent protection by using a CAS-access flag
>    instead of the refcount get/put semantic
> 
>  - use ZSWAP_SWAPCACHE_FAIL instead of ZSWAP_SWAPCACHE_NOMEM as the fail path
>    can be not only caused by nomem but also by invalidate.
> 
> Signed-off-by: Weijie Yang <weijie.yang@samsung.com>
> ---
>  mm/zswap.c |  177 +++++++++++++++++++++++++++---------------------------------
>  1 file changed, 80 insertions(+), 97 deletions(-)
>  mode change 100644 => 100755 mm/zswap.c
> 
> diff --git a/mm/zswap.c b/mm/zswap.c
> old mode 100644
> new mode 100755
> index cbd9578..fcaaecb
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -158,21 +158,23 @@ static void zswap_comp_exit(void)
>   * page within zswap.
>   *
>   * rbnode - links the entry into red-black tree for the appropriate swap type
> - * refcount - the number of outstanding reference to the entry. This is needed
> - *            to protect against premature freeing of the entry by code
> - *            concurent calls to load, invalidate, and writeback.  The lock
> - *            for the zswap_tree structure that contains the entry must
> - *            be held while changing the refcount.  Since the lock must
> - *            be held, there is no reason to also make refcount atomic.
>   * offset - the swap offset for the entry.  Index into the red-black tree.
> + * pos - the position or status of this entry. see below macro definition
> + *    change it by CAS, we hold this entry on success or retry if fail
> + *    protect against concurrent access by load, invalidate or writeback
> + *    even though the probability of concurrent access is very small
>   * handle - zsmalloc allocation handle that stores the compressed page data
>   * length - the length in bytes of the compressed page data.  Needed during
>   *           decompression
>   */
> +#define POS_POOL      1  /* stay in pool still */
> +#define POS_LOAD      2  /* is loading */
> +#define POS_RECLAIM   3  /* is reclaiming */
> +#define POS_FLUSH     4  /* is invalidating */
>  struct zswap_entry {
>  	struct rb_node rbnode;
>  	pgoff_t offset;
> -	int refcount;
> +	atomic_t pos;
>  	unsigned int length;
>  	unsigned long handle;
>  };
> @@ -216,7 +218,6 @@ static struct zswap_entry *zswap_entry_cache_alloc(gfp_t gfp)
>  	entry = kmem_cache_alloc(zswap_entry_cache, gfp);
>  	if (!entry)
>  		return NULL;
> -	entry->refcount = 1;
>  	return entry;
>  }
>  
> @@ -225,18 +226,6 @@ static void zswap_entry_cache_free(struct zswap_entry *entry)
>  	kmem_cache_free(zswap_entry_cache, entry);
>  }
>  
> -/* caller must hold the tree lock */
> -static void zswap_entry_get(struct zswap_entry *entry)
> -{
> -	entry->refcount++;
> -}
> -
> -/* caller must hold the tree lock */
> -static int zswap_entry_put(struct zswap_entry *entry)
> -{
> -	entry->refcount--;
> -	return entry->refcount;
> -}
>  
>  /*********************************
>  * rbtree functions
> @@ -380,6 +369,23 @@ static void zswap_free_entry(struct zswap_tree *tree, struct zswap_entry *entry)
>  	zswap_pool_pages = zbud_get_pool_size(tree->pool);
>  }
>  
> +/* caller must hold the tree lock, entry is on the tree
> +* seldom called fortunately
> +* return 0: free this entry successfully
> +* return < 0: fail
> +*/
> +static int zswap_invalidate_entry(struct zswap_tree *tree, struct zswap_entry *entry)
> +{
> +	if (atomic_cmpxchg(&entry->pos, POS_POOL, POS_FLUSH) == POS_POOL) {
> +		rb_erase(&entry->rbnode, &tree->rbroot);
> +		zswap_free_entry(tree, entry);
> +		return 0;
> +	}
> +
> +	/* encounter reclaim called by another thread */
> +	return -1;
> +}
> +
>  /*********************************
>  * writeback code
>  **********************************/
> @@ -387,7 +393,7 @@ static void zswap_free_entry(struct zswap_tree *tree, struct zswap_entry *entry)
>  enum zswap_get_swap_ret {
>  	ZSWAP_SWAPCACHE_NEW,
>  	ZSWAP_SWAPCACHE_EXIST,
> -	ZSWAP_SWAPCACHE_NOMEM
> +	ZSWAP_SWAPCACHE_FAIL,
>  };
>  
>  /*
> @@ -401,9 +407,10 @@ enum zswap_get_swap_ret {
>   * added to the swap cache, and returned in retpage.
>   *
>   * If success, the swap cache page is returned in retpage
> - * Returns 0 if page was already in the swap cache, page is not locked
> - * Returns 1 if the new page needs to be populated, page is locked
> - * Returns <0 on error
> + * Returns ZSWAP_SWAPCACHE_EXIST if page was already in the swap cache
> + * Returns ZSWAP_SWAPCACHE_NEW if the new page needs to be populated,
> + * 	the new page is added to swap cache and locked
> + * Returns ZSWAP_SWAPCACHE_FAIL on error
>   */
>  static int zswap_get_swap_cache_page(swp_entry_t entry,
>  				struct page **retpage)
> @@ -475,7 +482,7 @@ static int zswap_get_swap_cache_page(swp_entry_t entry,
>  	if (new_page)
>  		page_cache_release(new_page);
>  	if (!found_page)
> -		return ZSWAP_SWAPCACHE_NOMEM;
> +		return ZSWAP_SWAPCACHE_FAIL;
>  	*retpage = found_page;
>  	return ZSWAP_SWAPCACHE_EXIST;
>  }
> @@ -502,7 +509,7 @@ static int zswap_writeback_entry(struct zbud_pool *pool, unsigned long handle)
>  	struct page *page;
>  	u8 *src, *dst;
>  	unsigned int dlen;
> -	int ret, refcount;
> +	int ret;
>  	struct writeback_control wbc = {
>  		.sync_mode = WB_SYNC_NONE,
>  	};
> @@ -523,17 +530,23 @@ static int zswap_writeback_entry(struct zbud_pool *pool, unsigned long handle)
>  		spin_unlock(&tree->lock);
>  		return 0;
>  	}
> -	zswap_entry_get(entry);
> +	/* maybe encounter load or invalidate called by another thread
> +	* hold or give up this entry
> +	*/
> +	if (atomic_cmpxchg(&entry->pos, POS_POOL, POS_RECLAIM) != POS_POOL) {
> +		spin_unlock(&tree->lock);
> +		return -EAGAIN;
> +	}
>  	spin_unlock(&tree->lock);
>  	BUG_ON(offset != entry->offset);
>  
>  	/* try to allocate swap cache page */
>  	switch (zswap_get_swap_cache_page(swpentry, &page)) {
> -	case ZSWAP_SWAPCACHE_NOMEM: /* no memory */
> +	case ZSWAP_SWAPCACHE_FAIL:
>  		ret = -ENOMEM;
>  		goto fail;
>  
> -	case ZSWAP_SWAPCACHE_EXIST: /* page is unlocked */
> +	case ZSWAP_SWAPCACHE_EXIST:
>  		/* page is already in the swap cache, ignore for now */
>  		page_cache_release(page);
>  		ret = -EEXIST;
> @@ -561,38 +574,23 @@ static int zswap_writeback_entry(struct zbud_pool *pool, unsigned long handle)
>  	page_cache_release(page);
>  	zswap_written_back_pages++;
>  
> +	/* once we hold this entry and get here, we can free entry safely
> +	* during the period which we hold this entry:
> +	* 1. if a thread call do_swap_page concurrently
> +	* we will get ZSWAP_SWAPCACHE_EXIST returned or
> +	* do_swap_page will hit page we added in the swap cache
> +	* 2. if a thread call invalidate concurrently
> +	* invalidate thread should wait until this function end
> +	*/
>  	spin_lock(&tree->lock);
> -
> -	/* drop local reference */
> -	zswap_entry_put(entry);
> -	/* drop the initial reference from entry creation */
> -	refcount = zswap_entry_put(entry);
> -
> -	/*
> -	 * There are three possible values for refcount here:
> -	 * (1) refcount is 1, load is in progress, unlink from rbtree,
> -	 *     load will free
> -	 * (2) refcount is 0, (normal case) entry is valid,
> -	 *     remove from rbtree and free entry
> -	 * (3) refcount is -1, invalidate happened during writeback,
> -	 *     free entry
> -	 */
> -	if (refcount >= 0) {
> -		/* no invalidate yet, remove from rbtree */
> -		rb_erase(&entry->rbnode, &tree->rbroot);
> -	}
> +	rb_erase(&entry->rbnode, &tree->rbroot);
>  	spin_unlock(&tree->lock);
> -	if (refcount <= 0) {
> -		/* free the entry */
> -		zswap_free_entry(tree, entry);
> -		return 0;
> -	}
> -	return -EAGAIN;
> +	zswap_free_entry(tree, entry);
> +
> +	return 0;
>  
>  fail:
> -	spin_lock(&tree->lock);
> -	zswap_entry_put(entry);
> -	spin_unlock(&tree->lock);
> +	BUG_ON(atomic_cmpxchg(&entry->pos, POS_RECLAIM, POS_POOL) != POS_RECLAIM);
>  	return ret;
>  }
>  
> @@ -668,19 +666,15 @@ static int zswap_frontswap_store(unsigned type, pgoff_t offset,
>  	entry->offset = offset;
>  	entry->handle = handle;
>  	entry->length = dlen;
> +	atomic_set(&entry->pos, POS_POOL);
>  
>  	/* map */
>  	spin_lock(&tree->lock);
>  	do {
>  		ret = zswap_rb_insert(&tree->rbroot, entry, &dupentry);
> -		if (ret == -EEXIST) {
> -			zswap_duplicate_entry++;
> -			/* remove from rbtree */
> -			rb_erase(&dupentry->rbnode, &tree->rbroot);
> -			if (!zswap_entry_put(dupentry)) {
> -				/* free */
> -				zswap_free_entry(tree, dupentry);
> -			}
> +		if (unlikely(ret == -EEXIST)) {
> +			if (zswap_invalidate_entry(tree, dupentry) == 0)
> +				zswap_duplicate_entry++;
>  		}
>  	} while (ret == -EEXIST);
>  	spin_unlock(&tree->lock);
> @@ -709,9 +703,10 @@ static int zswap_frontswap_load(unsigned type, pgoff_t offset,
>  	struct zswap_entry *entry;
>  	u8 *src, *dst;
>  	unsigned int dlen;
> -	int refcount, ret;
> +	int ret;
>  
>  	/* find */
> +find:
>  	spin_lock(&tree->lock);
>  	entry = zswap_rb_search(&tree->rbroot, offset);
>  	if (!entry) {
> @@ -719,7 +714,14 @@ static int zswap_frontswap_load(unsigned type, pgoff_t offset,
>  		spin_unlock(&tree->lock);
>  		return -1;
>  	}
> -	zswap_entry_get(entry);
> +	/* encounter entry reclaimed by another thread
> +	* just retry until that thread get ZSWAP_SWAPCACHE_EXIST returned
> +	* as we hold the entry, we can do anything: get or get_and_free
> +	*/
> +	if (atomic_cmpxchg(&entry->pos, POS_POOL, POS_LOAD) != POS_POOL) {
> +		spin_unlock(&tree->lock);
> +		goto find;
> +	}
>  	spin_unlock(&tree->lock);
>  
>  	/* decompress */
> @@ -733,22 +735,7 @@ static int zswap_frontswap_load(unsigned type, pgoff_t offset,
>  	zbud_unmap(tree->pool, entry->handle);
>  	BUG_ON(ret);
>  
> -	spin_lock(&tree->lock);
> -	refcount = zswap_entry_put(entry);
> -	if (likely(refcount)) {
> -		spin_unlock(&tree->lock);
> -		return 0;
> -	}
> -	spin_unlock(&tree->lock);
> -
> -	/*
> -	 * We don't have to unlink from the rbtree because
> -	 * zswap_writeback_entry() or zswap_frontswap_invalidate page()
> -	 * has already done this for us if we are the last reference.
> -	 */
> -	/* free */
> -
> -	zswap_free_entry(tree, entry);
> +	BUG_ON(atomic_cmpxchg(&entry->pos, POS_LOAD, POS_POOL) != POS_LOAD);
>  
>  	return 0;
>  }
> @@ -758,9 +745,9 @@ static void zswap_frontswap_invalidate_page(unsigned type, pgoff_t offset)
>  {
>  	struct zswap_tree *tree = zswap_trees[type];
>  	struct zswap_entry *entry;
> -	int refcount;
>  
>  	/* find */
> +find:
>  	spin_lock(&tree->lock);
>  	entry = zswap_rb_search(&tree->rbroot, offset);
>  	if (!entry) {
> @@ -769,19 +756,18 @@ static void zswap_frontswap_invalidate_page(unsigned type, pgoff_t offset)
>  		return;
>  	}
>  
> +	/* encounter entry reclaimed by another thread
> +	* just retry until that reclaim end.
> +	*/
> +	if (atomic_cmpxchg(&entry->pos, POS_POOL, POS_FLUSH) != POS_POOL) {
> +		spin_unlock(&tree->lock);
> +		goto find;
> +	}
> +
>  	/* remove from rbtree */
>  	rb_erase(&entry->rbnode, &tree->rbroot);
> -
> -	/* drop the initial reference from entry creation */
> -	refcount = zswap_entry_put(entry);
> -
>  	spin_unlock(&tree->lock);
>  
> -	if (refcount) {
> -		/* writeback in progress, writeback will free */
> -		return;
> -	}
> -
>  	/* free */
>  	zswap_free_entry(tree, entry);
>  }
> @@ -809,10 +795,7 @@ static void zswap_frontswap_invalidate_area(unsigned type)
>  	 */
>  	while ((node = rb_first(&tree->rbroot))) {
>  		entry = rb_entry(node, struct zswap_entry, rbnode);
> -		rb_erase(&entry->rbnode, &tree->rbroot);
> -		zbud_free(tree->pool, entry->handle);
> -		zswap_entry_cache_free(entry);
> -		atomic_dec(&zswap_stored_pages);
> +		zswap_invalidate_entry(tree, entry);
>  	}
>  	tree->rbroot = RB_ROOT;
>  	spin_unlock(&tree->lock);
> -- 
> 1.7.10.4
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v3 2/3] mm/zswap: bugfix: memory leak when invalidate and reclaim occur concurrently
  2013-10-14  0:31         ` Minchan Kim
@ 2013-10-14  8:05           ` Weijie Yang
  0 siblings, 0 replies; 14+ messages in thread
From: Weijie Yang @ 2013-10-14  8:05 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Weijie Yang, akpm, sjenning, bob.liu, linux-mm, linux-kernel,
	stable, spartacus06

Hi,

In order to make things clear and simple,
I will re-send this v3 2/3 patch. split it to two patches:

1. just fix the bug, as same as the first v3 2/3 patch of this email thread

2. refactor the get/put routine based on Minchan's suggestion.

As for the last CAS patch, I will add more decsription and resend it
as a RFC patch.

Thanks,

On Mon, Oct 14, 2013 at 8:31 AM, Minchan Kim <minchan@kernel.org> wrote:
> Hello,
>
> On Sat, Oct 12, 2013 at 05:37:55PM +0800, Weijie Yang wrote:
>> Hi, all
>>
>> I thought out a new way to resolve this problem: use CAS instead of refcount.
>>
>> It not only resolve this problem, but also fix another issue,
>> can be expanded to support frontswap get_and_free mode easily.
>> And I use it in an zswap variant which writeback zbud page directly.
>>
>> If it is accepted, I will resent it instead of the previous refactor patch
>
> You're adding atomic operation and making code more complicated to me rather
> than simple get/put scheme. Please, write a good decsription why it's better,
> not vague words like "fix another issue, get_and_free and zswap variant".
>
> Please convince us.
>
>>
>> please see below, Request For Comments, Thanks.
>>
>> On Fri, Oct 11, 2013 at 3:13 PM, Minchan Kim <minchan@kernel.org> wrote:
>> > On Thu, Sep 26, 2013 at 11:42:17AM +0800, Weijie Yang wrote:
>> > > On Tue, Sep 24, 2013 at 9:03 AM, Minchan Kim <minchan@kernel.org> wrote:
>> > > > On Mon, Sep 23, 2013 at 04:21:49PM +0800, Weijie Yang wrote:
>> > > > >
>> > > > > Modify:
>> > > > >  - check the refcount in fail path, free memory if it is not referenced.
>> > > >
>> > > > Hmm, I don't like this because zswap refcount routine is already mess for me.
>> > > > I'm not sure why it was designed from the beginning. I hope we should fix it first.
>> > > >
>> > > > 1. zswap_rb_serach could include zswap_entry_get semantic if it founds a entry from
>> > > >    the tree. Of course, we should ranme it as find_get_zswap_entry like find_get_page.
>> > > > 2. zswap_entry_put could hide resource free function like zswap_free_entry so that
>> > > >    all of caller can use it easily following pattern.
>> > > >
>> > > >   find_get_zswap_entry
>> > > >   ...
>> > > >   ...
>> > > >   zswap_entry_put
>> > > >
>> > > > Of course, zswap_entry_put have to check the entry is in the tree or not
>> > > > so if someone already removes it from the tree, it should avoid double remove.
>> > > >
>> > > > One of the concern I can think is that approach extends critical section
>> > > > but I think it would be no problem because more bottleneck would be [de]compress
>> > > > functions. If it were really problem, we can mitigate a problem with moving
>> > > > unnecessary functions out of zswap_free_entry because it seem to be rather
>> > > > over-enginnering.
>> > >
>> > > I refactor the zswap refcount routine according to Minchan's idea.
>> > > Here is the new patch, Any suggestion is welcomed.
>> > >
>> > > To Seth and Bob, would you please review it again?
>> >
>> > Yeah, Seth, Bob. You guys are right persons to review this because this
>> > scheme was suggested by me who is biased so it couldn't be a fair. ;-)
>> > But anyway, I will review code itself.
>>
>> Consider the following scenario:
>> thread 0: reclaim entry x(get refcount, but not call zswap_get_swap_cache_page)
>> thread 1: call zswap_frontswap_invalidate_page to invalidate entry x.
>>       finished, entry x and its zbud is not freed as its refcount != 0
>>       now, the swap_map[x] = 0
>> thread 0: now call zswap_get_swap_cache_page
>>       swapcache_prepare return -ENOENT because entry x is not used any more
>>       zswap_get_swap_cache_page return ZSWAP_SWAPCACHE_NOMEM
>>       zswap_writeback_entry do nothing except put refcount
>> Now, the memory of zswap_entry x and its zpage leak.
>>
>> Consider the another scenario:
>> zswap entry x with offset A is already stored in zswap backend.
>> thread 0: reclaim entry x(get refcount, but not call zswap_get_swap_cache_page)
>> thread 1: store new page with the same offset A, alloc a new zswap entry y.
>>       This is a duplicate store and finished.
>>       shrink_page_list() finish, now the new page is not in swap_cache
>> thread 0: zswap_get_swap_cache_page get called.
>>       old page data is added to swap_cache
>> Now, swap cache has old data rather than new data for offset A.
>> Error will happen If do_swap_page() get page from swap_cache.
>>
>> Modify:
>>  - re-design the zswap_entry concurrent protection by using a CAS-access flag
>>    instead of the refcount get/put semantic
>>
>>  - use ZSWAP_SWAPCACHE_FAIL instead of ZSWAP_SWAPCACHE_NOMEM as the fail path
>>    can be not only caused by nomem but also by invalidate.
>>
>> Signed-off-by: Weijie Yang <weijie.yang@samsung.com>
>> ---
>>  mm/zswap.c |  177 +++++++++++++++++++++++++++---------------------------------
>>  1 file changed, 80 insertions(+), 97 deletions(-)
>>  mode change 100644 => 100755 mm/zswap.c
>>
>> diff --git a/mm/zswap.c b/mm/zswap.c
>> old mode 100644
>> new mode 100755
>> index cbd9578..fcaaecb
>> --- a/mm/zswap.c
>> +++ b/mm/zswap.c
>> @@ -158,21 +158,23 @@ static void zswap_comp_exit(void)
>>   * page within zswap.
>>   *
>>   * rbnode - links the entry into red-black tree for the appropriate swap type
>> - * refcount - the number of outstanding reference to the entry. This is needed
>> - *            to protect against premature freeing of the entry by code
>> - *            concurent calls to load, invalidate, and writeback.  The lock
>> - *            for the zswap_tree structure that contains the entry must
>> - *            be held while changing the refcount.  Since the lock must
>> - *            be held, there is no reason to also make refcount atomic.
>>   * offset - the swap offset for the entry.  Index into the red-black tree.
>> + * pos - the position or status of this entry. see below macro definition
>> + *    change it by CAS, we hold this entry on success or retry if fail
>> + *    protect against concurrent access by load, invalidate or writeback
>> + *    even though the probability of concurrent access is very small
>>   * handle - zsmalloc allocation handle that stores the compressed page data
>>   * length - the length in bytes of the compressed page data.  Needed during
>>   *           decompression
>>   */
>> +#define POS_POOL      1  /* stay in pool still */
>> +#define POS_LOAD      2  /* is loading */
>> +#define POS_RECLAIM   3  /* is reclaiming */
>> +#define POS_FLUSH     4  /* is invalidating */
>>  struct zswap_entry {
>>       struct rb_node rbnode;
>>       pgoff_t offset;
>> -     int refcount;
>> +     atomic_t pos;
>>       unsigned int length;
>>       unsigned long handle;
>>  };
>> @@ -216,7 +218,6 @@ static struct zswap_entry *zswap_entry_cache_alloc(gfp_t gfp)
>>       entry = kmem_cache_alloc(zswap_entry_cache, gfp);
>>       if (!entry)
>>               return NULL;
>> -     entry->refcount = 1;
>>       return entry;
>>  }
>>
>> @@ -225,18 +226,6 @@ static void zswap_entry_cache_free(struct zswap_entry *entry)
>>       kmem_cache_free(zswap_entry_cache, entry);
>>  }
>>
>> -/* caller must hold the tree lock */
>> -static void zswap_entry_get(struct zswap_entry *entry)
>> -{
>> -     entry->refcount++;
>> -}
>> -
>> -/* caller must hold the tree lock */
>> -static int zswap_entry_put(struct zswap_entry *entry)
>> -{
>> -     entry->refcount--;
>> -     return entry->refcount;
>> -}
>>
>>  /*********************************
>>  * rbtree functions
>> @@ -380,6 +369,23 @@ static void zswap_free_entry(struct zswap_tree *tree, struct zswap_entry *entry)
>>       zswap_pool_pages = zbud_get_pool_size(tree->pool);
>>  }
>>
>> +/* caller must hold the tree lock, entry is on the tree
>> +* seldom called fortunately
>> +* return 0: free this entry successfully
>> +* return < 0: fail
>> +*/
>> +static int zswap_invalidate_entry(struct zswap_tree *tree, struct zswap_entry *entry)
>> +{
>> +     if (atomic_cmpxchg(&entry->pos, POS_POOL, POS_FLUSH) == POS_POOL) {
>> +             rb_erase(&entry->rbnode, &tree->rbroot);
>> +             zswap_free_entry(tree, entry);
>> +             return 0;
>> +     }
>> +
>> +     /* encounter reclaim called by another thread */
>> +     return -1;
>> +}
>> +
>>  /*********************************
>>  * writeback code
>>  **********************************/
>> @@ -387,7 +393,7 @@ static void zswap_free_entry(struct zswap_tree *tree, struct zswap_entry *entry)
>>  enum zswap_get_swap_ret {
>>       ZSWAP_SWAPCACHE_NEW,
>>       ZSWAP_SWAPCACHE_EXIST,
>> -     ZSWAP_SWAPCACHE_NOMEM
>> +     ZSWAP_SWAPCACHE_FAIL,
>>  };
>>
>>  /*
>> @@ -401,9 +407,10 @@ enum zswap_get_swap_ret {
>>   * added to the swap cache, and returned in retpage.
>>   *
>>   * If success, the swap cache page is returned in retpage
>> - * Returns 0 if page was already in the swap cache, page is not locked
>> - * Returns 1 if the new page needs to be populated, page is locked
>> - * Returns <0 on error
>> + * Returns ZSWAP_SWAPCACHE_EXIST if page was already in the swap cache
>> + * Returns ZSWAP_SWAPCACHE_NEW if the new page needs to be populated,
>> + *   the new page is added to swap cache and locked
>> + * Returns ZSWAP_SWAPCACHE_FAIL on error
>>   */
>>  static int zswap_get_swap_cache_page(swp_entry_t entry,
>>                               struct page **retpage)
>> @@ -475,7 +482,7 @@ static int zswap_get_swap_cache_page(swp_entry_t entry,
>>       if (new_page)
>>               page_cache_release(new_page);
>>       if (!found_page)
>> -             return ZSWAP_SWAPCACHE_NOMEM;
>> +             return ZSWAP_SWAPCACHE_FAIL;
>>       *retpage = found_page;
>>       return ZSWAP_SWAPCACHE_EXIST;
>>  }
>> @@ -502,7 +509,7 @@ static int zswap_writeback_entry(struct zbud_pool *pool, unsigned long handle)
>>       struct page *page;
>>       u8 *src, *dst;
>>       unsigned int dlen;
>> -     int ret, refcount;
>> +     int ret;
>>       struct writeback_control wbc = {
>>               .sync_mode = WB_SYNC_NONE,
>>       };
>> @@ -523,17 +530,23 @@ static int zswap_writeback_entry(struct zbud_pool *pool, unsigned long handle)
>>               spin_unlock(&tree->lock);
>>               return 0;
>>       }
>> -     zswap_entry_get(entry);
>> +     /* maybe encounter load or invalidate called by another thread
>> +     * hold or give up this entry
>> +     */
>> +     if (atomic_cmpxchg(&entry->pos, POS_POOL, POS_RECLAIM) != POS_POOL) {
>> +             spin_unlock(&tree->lock);
>> +             return -EAGAIN;
>> +     }
>>       spin_unlock(&tree->lock);
>>       BUG_ON(offset != entry->offset);
>>
>>       /* try to allocate swap cache page */
>>       switch (zswap_get_swap_cache_page(swpentry, &page)) {
>> -     case ZSWAP_SWAPCACHE_NOMEM: /* no memory */
>> +     case ZSWAP_SWAPCACHE_FAIL:
>>               ret = -ENOMEM;
>>               goto fail;
>>
>> -     case ZSWAP_SWAPCACHE_EXIST: /* page is unlocked */
>> +     case ZSWAP_SWAPCACHE_EXIST:
>>               /* page is already in the swap cache, ignore for now */
>>               page_cache_release(page);
>>               ret = -EEXIST;
>> @@ -561,38 +574,23 @@ static int zswap_writeback_entry(struct zbud_pool *pool, unsigned long handle)
>>       page_cache_release(page);
>>       zswap_written_back_pages++;
>>
>> +     /* once we hold this entry and get here, we can free entry safely
>> +     * during the period which we hold this entry:
>> +     * 1. if a thread call do_swap_page concurrently
>> +     * we will get ZSWAP_SWAPCACHE_EXIST returned or
>> +     * do_swap_page will hit page we added in the swap cache
>> +     * 2. if a thread call invalidate concurrently
>> +     * invalidate thread should wait until this function end
>> +     */
>>       spin_lock(&tree->lock);
>> -
>> -     /* drop local reference */
>> -     zswap_entry_put(entry);
>> -     /* drop the initial reference from entry creation */
>> -     refcount = zswap_entry_put(entry);
>> -
>> -     /*
>> -      * There are three possible values for refcount here:
>> -      * (1) refcount is 1, load is in progress, unlink from rbtree,
>> -      *     load will free
>> -      * (2) refcount is 0, (normal case) entry is valid,
>> -      *     remove from rbtree and free entry
>> -      * (3) refcount is -1, invalidate happened during writeback,
>> -      *     free entry
>> -      */
>> -     if (refcount >= 0) {
>> -             /* no invalidate yet, remove from rbtree */
>> -             rb_erase(&entry->rbnode, &tree->rbroot);
>> -     }
>> +     rb_erase(&entry->rbnode, &tree->rbroot);
>>       spin_unlock(&tree->lock);
>> -     if (refcount <= 0) {
>> -             /* free the entry */
>> -             zswap_free_entry(tree, entry);
>> -             return 0;
>> -     }
>> -     return -EAGAIN;
>> +     zswap_free_entry(tree, entry);
>> +
>> +     return 0;
>>
>>  fail:
>> -     spin_lock(&tree->lock);
>> -     zswap_entry_put(entry);
>> -     spin_unlock(&tree->lock);
>> +     BUG_ON(atomic_cmpxchg(&entry->pos, POS_RECLAIM, POS_POOL) != POS_RECLAIM);
>>       return ret;
>>  }
>>
>> @@ -668,19 +666,15 @@ static int zswap_frontswap_store(unsigned type, pgoff_t offset,
>>       entry->offset = offset;
>>       entry->handle = handle;
>>       entry->length = dlen;
>> +     atomic_set(&entry->pos, POS_POOL);
>>
>>       /* map */
>>       spin_lock(&tree->lock);
>>       do {
>>               ret = zswap_rb_insert(&tree->rbroot, entry, &dupentry);
>> -             if (ret == -EEXIST) {
>> -                     zswap_duplicate_entry++;
>> -                     /* remove from rbtree */
>> -                     rb_erase(&dupentry->rbnode, &tree->rbroot);
>> -                     if (!zswap_entry_put(dupentry)) {
>> -                             /* free */
>> -                             zswap_free_entry(tree, dupentry);
>> -                     }
>> +             if (unlikely(ret == -EEXIST)) {
>> +                     if (zswap_invalidate_entry(tree, dupentry) == 0)
>> +                             zswap_duplicate_entry++;
>>               }
>>       } while (ret == -EEXIST);
>>       spin_unlock(&tree->lock);
>> @@ -709,9 +703,10 @@ static int zswap_frontswap_load(unsigned type, pgoff_t offset,
>>       struct zswap_entry *entry;
>>       u8 *src, *dst;
>>       unsigned int dlen;
>> -     int refcount, ret;
>> +     int ret;
>>
>>       /* find */
>> +find:
>>       spin_lock(&tree->lock);
>>       entry = zswap_rb_search(&tree->rbroot, offset);
>>       if (!entry) {
>> @@ -719,7 +714,14 @@ static int zswap_frontswap_load(unsigned type, pgoff_t offset,
>>               spin_unlock(&tree->lock);
>>               return -1;
>>       }
>> -     zswap_entry_get(entry);
>> +     /* encounter entry reclaimed by another thread
>> +     * just retry until that thread get ZSWAP_SWAPCACHE_EXIST returned
>> +     * as we hold the entry, we can do anything: get or get_and_free
>> +     */
>> +     if (atomic_cmpxchg(&entry->pos, POS_POOL, POS_LOAD) != POS_POOL) {
>> +             spin_unlock(&tree->lock);
>> +             goto find;
>> +     }
>>       spin_unlock(&tree->lock);
>>
>>       /* decompress */
>> @@ -733,22 +735,7 @@ static int zswap_frontswap_load(unsigned type, pgoff_t offset,
>>       zbud_unmap(tree->pool, entry->handle);
>>       BUG_ON(ret);
>>
>> -     spin_lock(&tree->lock);
>> -     refcount = zswap_entry_put(entry);
>> -     if (likely(refcount)) {
>> -             spin_unlock(&tree->lock);
>> -             return 0;
>> -     }
>> -     spin_unlock(&tree->lock);
>> -
>> -     /*
>> -      * We don't have to unlink from the rbtree because
>> -      * zswap_writeback_entry() or zswap_frontswap_invalidate page()
>> -      * has already done this for us if we are the last reference.
>> -      */
>> -     /* free */
>> -
>> -     zswap_free_entry(tree, entry);
>> +     BUG_ON(atomic_cmpxchg(&entry->pos, POS_LOAD, POS_POOL) != POS_LOAD);
>>
>>       return 0;
>>  }
>> @@ -758,9 +745,9 @@ static void zswap_frontswap_invalidate_page(unsigned type, pgoff_t offset)
>>  {
>>       struct zswap_tree *tree = zswap_trees[type];
>>       struct zswap_entry *entry;
>> -     int refcount;
>>
>>       /* find */
>> +find:
>>       spin_lock(&tree->lock);
>>       entry = zswap_rb_search(&tree->rbroot, offset);
>>       if (!entry) {
>> @@ -769,19 +756,18 @@ static void zswap_frontswap_invalidate_page(unsigned type, pgoff_t offset)
>>               return;
>>       }
>>
>> +     /* encounter entry reclaimed by another thread
>> +     * just retry until that reclaim end.
>> +     */
>> +     if (atomic_cmpxchg(&entry->pos, POS_POOL, POS_FLUSH) != POS_POOL) {
>> +             spin_unlock(&tree->lock);
>> +             goto find;
>> +     }
>> +
>>       /* remove from rbtree */
>>       rb_erase(&entry->rbnode, &tree->rbroot);
>> -
>> -     /* drop the initial reference from entry creation */
>> -     refcount = zswap_entry_put(entry);
>> -
>>       spin_unlock(&tree->lock);
>>
>> -     if (refcount) {
>> -             /* writeback in progress, writeback will free */
>> -             return;
>> -     }
>> -
>>       /* free */
>>       zswap_free_entry(tree, entry);
>>  }
>> @@ -809,10 +795,7 @@ static void zswap_frontswap_invalidate_area(unsigned type)
>>        */
>>       while ((node = rb_first(&tree->rbroot))) {
>>               entry = rb_entry(node, struct zswap_entry, rbnode);
>> -             rb_erase(&entry->rbnode, &tree->rbroot);
>> -             zbud_free(tree->pool, entry->handle);
>> -             zswap_entry_cache_free(entry);
>> -             atomic_dec(&zswap_stored_pages);
>> +             zswap_invalidate_entry(tree, entry);
>>       }
>>       tree->rbroot = RB_ROOT;
>>       spin_unlock(&tree->lock);
>> --
>> 1.7.10.4
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to majordomo@kvack.org.  For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>
> --
> Kind regards,
> Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2013-10-14  8:05 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-09-23  8:21 [PATCH v3 2/3] mm/zswap: bugfix: memory leak when invalidate and reclaim occur concurrently Weijie Yang
2013-09-24  1:03 ` Minchan Kim
2013-09-26  3:42   ` Weijie Yang
2013-10-10 19:54     ` Andrew Morton
2013-10-11  7:13     ` Minchan Kim
2013-10-12  2:50       ` Bob Liu
2013-10-12  9:14         ` Weijie Yang
2013-10-12  9:29           ` Bob Liu
2013-10-12  8:41       ` Weijie Yang
2013-10-12  9:37       ` Weijie Yang
2013-10-14  0:31         ` Minchan Kim
2013-10-14  8:05           ` Weijie Yang
2013-10-12  2:32     ` Bob Liu
2013-10-12  8:45       ` Weijie Yang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).