linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] memcg: fix stale swap cache leak v5
@ 2009-04-30  7:16 KAMEZAWA Hiroyuki
  2009-04-30  7:35 ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 12+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-04-30  7:16 UTC (permalink / raw)
  To: linux-kernel@vger.kernel.org
  Cc: linux-mm@kvack.org, nishimura@mxp.nes.nec.co.jp,
	balbir@linux.vnet.ibm.com, akpm@linux-foundation.org,
	hugh@veritas.com

This is v5 but all codes are rewritten.

After this patch, when memcg is used,
 1. page's swapcount is checked after I/O (without locks). If the page is
    stale swap cache, freeing routine will be scheduled.
 2. vmscan.c calls try_to_free_swap() when __remove_mapping() fails.

Works well for me. no extra resources and no races.

Because my office will be closed until May/7, I'll not be able to make a
response. Posting this for showing what I think of now.

This should be fixed before posting softlimit etc...
==
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

In general, Linux's swp_entry handling is done by combination of lazy techniques
and global LRU. It works well but when we use mem+swap controller, some more
strict cotroll is appropriate. Otherwise, swp_entry used by a cgroup will be
never freed until global LRU works. In a system where memcg is well-configured,
global LRU doesn't work frequently.

  Example A) Assume a swap cache which is not mapped.
              CPU0                            CPU1
	   zap_pte()....                  shrink_page_list()
	    free_swap_and_cache()           lock_page()
		page seems busy.

  Example B) Assume swapin-readahed.
	      CPU0			      CPU1
	   zap_pte()			  read_swap_cache_async()
					  swap_duplicate().
           swap_entry_free() = 1
	   find_get_page()=> NULL.
					  add_to_swap_cache().
					  issue swap I/O. 

There are many patterns of this kind of race (but no problems).

free_swap_and_cache() is called for freeing swp_entry. But it is a best-effort
function. If the swp_entry/page seems busy, swp_entry is not freed.
This is not a problem because global-LRU will find SwapCache at page reclaim.
But...

If memcg is used, on the other hand, global LRU may not work. Then, above
unused SwapCache will not be freed.
(unmapped SwapCache occupy swp_entry but never be freed if not on memcg's LRU)

So, even if there are no tasks in a cgroup, swp_entry usage still remains.
In bad case, OOM by mem+swap controller is triggerred by this "leak" of
swp_entry as Nishimura repoted.

This patch tries to fix racy case of free_swap_and_cache() and I/O by checking
swap's refnct again after I/O. And add a hook to vmscan.c.
After this patch applied, follwoing test works well.

  # echo 1-2M > ../memory.limit_in_bytes
  # run tasks under memcg.
  # kill all tasks and make memory.tasks empty
  # check memory.memsw.usage_in_bytes == memory.usage_in_bytes and
    there is no _used_ swp_entry.

Changelog: v4->v5
 - completely new design.
 - added nolock page_swapcount.
 - checks all swap I/O.

Reported-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 include/linux/swap.h |    1 
 mm/page_io.c         |  120 +++++++++++++++++++++++++++++++++++++++++++++++++++
 mm/swapfile.c        |   37 +++++++++++++++
 mm/vmscan.c          |   31 ++++++++++++-
 4 files changed, 187 insertions(+), 2 deletions(-)

Index: mmotm-2.6.30-Apr24/mm/page_io.c
===================================================================
--- mmotm-2.6.30-Apr24.orig/mm/page_io.c
+++ mmotm-2.6.30-Apr24/mm/page_io.c
@@ -19,6 +19,123 @@
 #include <linux/writeback.h>
 #include <asm/pgtable.h>
 
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+/*
+ * When memory cgroup is used, race between read/write-swap and zap-swap can
+ * be a leak of swp_entry accounted. So we have to check the status at
+ * the end of swap-io. If memory cgroup is not used, Global LRU will find
+ * unused swap-cache finally. (But this is too lazy for memcg.)
+ */
+
+struct swapio_check {
+	spinlock_t	lock;
+	void		*swap_bio_list;
+	struct delayed_work work;
+} stale_swap_check;
+
+
+static void mem_cgroup_check_stale_swap(struct work_struct *work)
+{
+	struct bio *bio;
+	struct page *page;
+	struct swapio_check *sc;
+	int nr = SWAP_CLUSTER_MAX;
+	swp_entry_t entry;
+
+	sc = &stale_swap_check;
+
+	while (nr--) {
+		cond_resched();
+		spin_lock_irq(&sc->lock);
+		bio = sc->swap_bio_list;
+		if (bio)
+			sc->swap_bio_list = bio->bi_next;
+		spin_unlock_irq(&sc->lock);
+		if (!bio)
+			break;
+		entry.val = (unsigned long)bio->bi_private;
+		bio_put(bio);
+
+		page = find_get_page(&swapper_space, entry.val);
+		if (!page || page_mapped(page))
+			continue;
+		lock_page(page);
+		/*
+		 * When it's mapped, this page passed checks in do_swap_page()
+		 * and we don't have to do any more. All other necessary checks
+		 * will be done in try_to_free_swap().
+		 */
+		if (!page_mapped(page))
+			try_to_free_swap(page);
+		unlock_page(page);
+		put_page(page);
+	}
+	if (sc->swap_bio_list)
+		schedule_delayed_work(&sc->work, HZ/10);
+}
+
+/*
+ * We can't call try_to_free_swap directly here because of caller's context.
+ */
+static void mem_cgroup_swapio_check_again(struct bio *bio, struct page *page)
+{
+	unsigned long flags;
+	struct swapio_check *sc;
+	swp_entry_t entry;
+	int ret;
+
+	/* check swap count here. If swp_entry is stable, nothing to do.*/
+	if (likely(mem_cgroup_staleswap_hint(page)))
+		return;
+	/* reuse bio if this bio is ready to be freed. */
+	ret = atomic_inc_return(&bio->bi_cnt);
+	/* Any other reference other than us ? */
+	if (unlikely(ret > 2)) {
+		bio_put(bio);
+		return;
+	}
+	/*
+	 * We don't want to grab this page....record swp_entry instead of page.
+	 */
+	entry.val = page_private(page);
+	bio->bi_private = (void *)entry.val;
+
+	sc = &stale_swap_check;
+	spin_lock_irqsave(&sc->lock, flags);
+	/* link bio */
+	bio->bi_next = sc->swap_bio_list;
+	sc->swap_bio_list = bio;
+	spin_unlock_irqrestore(&sc->lock, flags);
+	/*
+	 * Swap I/O is tend to be countinous. Do check in batched manner.
+	 */
+	if (!delayed_work_pending(&sc->work))
+		schedule_delayed_work(&sc->work, HZ/10);
+}
+
+static int __init setup_stale_swap_check(void)
+{
+	struct swapio_check *sc;
+
+	sc = &stale_swap_check;
+	spin_lock_init(&sc->lock);
+	sc->swap_bio_list = NULL;
+	INIT_DELAYED_WORK(&sc->work, mem_cgroup_check_stale_swap);
+	return 0;
+}
+late_initcall(setup_stale_swap_check);
+
+
+#else /* CONFIG_CGROUP_MEM_RES_CTRL */
+
+static inline
+void mem_cgroup_swapio_check_again(struct bio *bio, struct page *page)
+{
+}
+#endif
+
+
+
 static struct bio *get_swap_bio(gfp_t gfp_flags, pgoff_t index,
 				struct page *page, bio_end_io_t end_io)
 {
@@ -66,6 +183,8 @@ static void end_swap_bio_write(struct bi
 				(unsigned long long)bio->bi_sector);
 		ClearPageReclaim(page);
 	}
+	/* While PG_writeback, this page is stable ...then, call this here */
+	mem_cgroup_swapio_check_again(bio, page);
 	end_page_writeback(page);
 	bio_put(bio);
 }
@@ -85,6 +204,7 @@ void end_swap_bio_read(struct bio *bio, 
 	} else {
 		SetPageUptodate(page);
 	}
+	mem_cgroup_swapio_check_again(bio, page);
 	unlock_page(page);
 	bio_put(bio);
 }
Index: mmotm-2.6.30-Apr24/mm/vmscan.c
===================================================================
--- mmotm-2.6.30-Apr24.orig/mm/vmscan.c
+++ mmotm-2.6.30-Apr24/mm/vmscan.c
@@ -586,6 +586,30 @@ void putback_lru_page(struct page *page)
 }
 #endif /* CONFIG_UNEVICTABLE_LRU */
 
+#ifdef CONFIG_CGROUP_MEM_RES_CTRL
+/*
+ * Even if we don't call this, global LRU will finally find this SwapCache and
+ * free swap entry in the next loop. But, when memcg is used, we may have
+ * smaller chance to call global LRU's memory reclaim code.
+ * Freeing unused swap entry in aggresive way is good for avoid "leak" of swap
+ * entry accounting.
+ */
+static inline void unuse_swapcache_check_again(struct page *page)
+{
+	/*
+	 * The page is locked, but have extra reference from somewhere.
+	 * In typical case, rotate_reclaimable_page()'s extra refcnt makes
+	 * __remove_mapping fail. (see mm/swap.c)
+	 */
+	if (PageSwapCache(page))
+		try_to_free_swap(page);
+}
+#else
+static inline void unuse_swapcache_check_again(struct page *page)
+{
+}
+#endif
+
 
 /*
  * shrink_page_list() returns the number of reclaimed pages
@@ -758,9 +782,12 @@ static unsigned long shrink_page_list(st
 			}
 		}
 
-		if (!mapping || !__remove_mapping(mapping, page))
+		if (!mapping)
 			goto keep_locked;
-
+		if (!__remove_mapping(mapping, page)) {
+			unuse_swapcache_check_again(page);
+			goto keep_locked;
+		}
 		/*
 		 * At this point, we have no other references and there is
 		 * no way to pick any more up (removed from LRU, removed
Index: mmotm-2.6.30-Apr24/mm/swapfile.c
===================================================================
--- mmotm-2.6.30-Apr24.orig/mm/swapfile.c
+++ mmotm-2.6.30-Apr24/mm/swapfile.c
@@ -528,6 +528,43 @@ static inline int page_swapcount(struct 
 	return count;
 }
 
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+static inline int page_swapused_nolock(struct page *page)
+{
+	swp_entry_t entry;
+	unsigned long type, offset;
+	struct swap_info_struct *p;
+
+	entry.val = page_private(page);
+	type = swp_type(entry);
+	VM_BUG_ON(type >= nr_swapfiles);
+
+	offset = swp_offset(entry);
+	p = &swap_info[type];
+	VM_BUG_ON(!(p->flags & SWP_USED));
+	VM_BUG_ON(!(p->swap_map[offset]));
+
+	smp_rmb();
+	return p->swap_map[offset] != 1;
+}
+/*
+ * Use a lapping function not to allow reuse this function other than memcg.
+ */
+int mem_cgroup_staleswap_hint(struct page *page)
+{
+	/*
+	 * The page may not under lock_page() but Writeback is set in that case.
+	 * Then, swap_map is stable when this is called.
+	 * Very terrible troube will not occur even if page_swapused_nolock()
+	 * returns wrong value.
+	 * Because this can be called via interrupt context, we use nolock
+	 * version of swap's refcnt check.
+	 */
+	if (!PageSwapCache(page) || page_mapped(page))
+		return 1;
+	return page_swapused_nolock(page);
+}
+#endif
 /*
  * We can write to an anon page without COW if there are no other references
  * to it.  And as a side-effect, free up its swap: because the old content
Index: mmotm-2.6.30-Apr24/include/linux/swap.h
===================================================================
--- mmotm-2.6.30-Apr24.orig/include/linux/swap.h
+++ mmotm-2.6.30-Apr24/include/linux/swap.h
@@ -336,6 +336,7 @@ static inline void disable_swap_token(vo
 
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR
 extern void mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent);
+extern int mem_cgroup_staleswap_hint(struct page *page);
 #else
 static inline void
 mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] memcg: fix stale swap cache leak v5
  2009-04-30  7:16 [PATCH] memcg: fix stale swap cache leak v5 KAMEZAWA Hiroyuki
@ 2009-04-30  7:35 ` KAMEZAWA Hiroyuki
  2009-04-30  9:04   ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 12+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-04-30  7:35 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	nishimura@mxp.nes.nec.co.jp, balbir@linux.vnet.ibm.com,
	akpm@linux-foundation.org, hugh@veritas.com

On Thu, 30 Apr 2009 16:16:27 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> This is v5 but all codes are rewritten.
> 
> After this patch, when memcg is used,
>  1. page's swapcount is checked after I/O (without locks). If the page is
>     stale swap cache, freeing routine will be scheduled.
>  2. vmscan.c calls try_to_free_swap() when __remove_mapping() fails.
> 
> Works well for me. no extra resources and no races.
> 
> Because my office will be closed until May/7, I'll not be able to make a
> response. Posting this for showing what I think of now.
> 
I found a hole immediately after posted this...sorry. plz ignore this patch/
see you again in the next month.

-Kame

> This should be fixed before posting softlimit etc...
> ==
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> In general, Linux's swp_entry handling is done by combination of lazy techniques
> and global LRU. It works well but when we use mem+swap controller, some more
> strict cotroll is appropriate. Otherwise, swp_entry used by a cgroup will be
> never freed until global LRU works. In a system where memcg is well-configured,
> global LRU doesn't work frequently.
> 
>   Example A) Assume a swap cache which is not mapped.
>               CPU0                            CPU1
> 	   zap_pte()....                  shrink_page_list()
> 	    free_swap_and_cache()           lock_page()
> 		page seems busy.
> 
>   Example B) Assume swapin-readahed.
> 	      CPU0			      CPU1
> 	   zap_pte()			  read_swap_cache_async()
> 					  swap_duplicate().
>            swap_entry_free() = 1
> 	   find_get_page()=> NULL.
> 					  add_to_swap_cache().
> 					  issue swap I/O. 
> 
> There are many patterns of this kind of race (but no problems).
> 
> free_swap_and_cache() is called for freeing swp_entry. But it is a best-effort
> function. If the swp_entry/page seems busy, swp_entry is not freed.
> This is not a problem because global-LRU will find SwapCache at page reclaim.
> But...
> 
> If memcg is used, on the other hand, global LRU may not work. Then, above
> unused SwapCache will not be freed.
> (unmapped SwapCache occupy swp_entry but never be freed if not on memcg's LRU)
> 
> So, even if there are no tasks in a cgroup, swp_entry usage still remains.
> In bad case, OOM by mem+swap controller is triggerred by this "leak" of
> swp_entry as Nishimura repoted.
> 
> This patch tries to fix racy case of free_swap_and_cache() and I/O by checking
> swap's refnct again after I/O. And add a hook to vmscan.c.
> After this patch applied, follwoing test works well.
> 
>   # echo 1-2M > ../memory.limit_in_bytes
>   # run tasks under memcg.
>   # kill all tasks and make memory.tasks empty
>   # check memory.memsw.usage_in_bytes == memory.usage_in_bytes and
>     there is no _used_ swp_entry.
> 
> Changelog: v4->v5
>  - completely new design.
>  - added nolock page_swapcount.
>  - checks all swap I/O.
> 
> Reported-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> ---
>  include/linux/swap.h |    1 
>  mm/page_io.c         |  120 +++++++++++++++++++++++++++++++++++++++++++++++++++
>  mm/swapfile.c        |   37 +++++++++++++++
>  mm/vmscan.c          |   31 ++++++++++++-
>  4 files changed, 187 insertions(+), 2 deletions(-)
> 
> Index: mmotm-2.6.30-Apr24/mm/page_io.c
> ===================================================================
> --- mmotm-2.6.30-Apr24.orig/mm/page_io.c
> +++ mmotm-2.6.30-Apr24/mm/page_io.c
> @@ -19,6 +19,123 @@
>  #include <linux/writeback.h>
>  #include <asm/pgtable.h>
>  
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR
> +/*
> + * When memory cgroup is used, race between read/write-swap and zap-swap can
> + * be a leak of swp_entry accounted. So we have to check the status at
> + * the end of swap-io. If memory cgroup is not used, Global LRU will find
> + * unused swap-cache finally. (But this is too lazy for memcg.)
> + */
> +
> +struct swapio_check {
> +	spinlock_t	lock;
> +	void		*swap_bio_list;
> +	struct delayed_work work;
> +} stale_swap_check;
> +
> +
> +static void mem_cgroup_check_stale_swap(struct work_struct *work)
> +{
> +	struct bio *bio;
> +	struct page *page;
> +	struct swapio_check *sc;
> +	int nr = SWAP_CLUSTER_MAX;
> +	swp_entry_t entry;
> +
> +	sc = &stale_swap_check;
> +
> +	while (nr--) {
> +		cond_resched();
> +		spin_lock_irq(&sc->lock);
> +		bio = sc->swap_bio_list;
> +		if (bio)
> +			sc->swap_bio_list = bio->bi_next;
> +		spin_unlock_irq(&sc->lock);
> +		if (!bio)
> +			break;
> +		entry.val = (unsigned long)bio->bi_private;
> +		bio_put(bio);
> +
> +		page = find_get_page(&swapper_space, entry.val);
> +		if (!page || page_mapped(page))
> +			continue;
> +		lock_page(page);
> +		/*
> +		 * When it's mapped, this page passed checks in do_swap_page()
> +		 * and we don't have to do any more. All other necessary checks
> +		 * will be done in try_to_free_swap().
> +		 */
> +		if (!page_mapped(page))
> +			try_to_free_swap(page);
> +		unlock_page(page);
> +		put_page(page);
> +	}
> +	if (sc->swap_bio_list)
> +		schedule_delayed_work(&sc->work, HZ/10);
> +}
> +
> +/*
> + * We can't call try_to_free_swap directly here because of caller's context.
> + */
> +static void mem_cgroup_swapio_check_again(struct bio *bio, struct page *page)
> +{
> +	unsigned long flags;
> +	struct swapio_check *sc;
> +	swp_entry_t entry;
> +	int ret;
> +
> +	/* check swap count here. If swp_entry is stable, nothing to do.*/
> +	if (likely(mem_cgroup_staleswap_hint(page)))
> +		return;
> +	/* reuse bio if this bio is ready to be freed. */
> +	ret = atomic_inc_return(&bio->bi_cnt);
> +	/* Any other reference other than us ? */
> +	if (unlikely(ret > 2)) {
> +		bio_put(bio);
> +		return;
> +	}
> +	/*
> +	 * We don't want to grab this page....record swp_entry instead of page.
> +	 */
> +	entry.val = page_private(page);
> +	bio->bi_private = (void *)entry.val;
> +
> +	sc = &stale_swap_check;
> +	spin_lock_irqsave(&sc->lock, flags);
> +	/* link bio */
> +	bio->bi_next = sc->swap_bio_list;
> +	sc->swap_bio_list = bio;
> +	spin_unlock_irqrestore(&sc->lock, flags);
> +	/*
> +	 * Swap I/O is tend to be countinous. Do check in batched manner.
> +	 */
> +	if (!delayed_work_pending(&sc->work))
> +		schedule_delayed_work(&sc->work, HZ/10);
> +}
> +
> +static int __init setup_stale_swap_check(void)
> +{
> +	struct swapio_check *sc;
> +
> +	sc = &stale_swap_check;
> +	spin_lock_init(&sc->lock);
> +	sc->swap_bio_list = NULL;
> +	INIT_DELAYED_WORK(&sc->work, mem_cgroup_check_stale_swap);
> +	return 0;
> +}
> +late_initcall(setup_stale_swap_check);
> +
> +
> +#else /* CONFIG_CGROUP_MEM_RES_CTRL */
> +
> +static inline
> +void mem_cgroup_swapio_check_again(struct bio *bio, struct page *page)
> +{
> +}
> +#endif
> +
> +
> +
>  static struct bio *get_swap_bio(gfp_t gfp_flags, pgoff_t index,
>  				struct page *page, bio_end_io_t end_io)
>  {
> @@ -66,6 +183,8 @@ static void end_swap_bio_write(struct bi
>  				(unsigned long long)bio->bi_sector);
>  		ClearPageReclaim(page);
>  	}
> +	/* While PG_writeback, this page is stable ...then, call this here */
> +	mem_cgroup_swapio_check_again(bio, page);
>  	end_page_writeback(page);
>  	bio_put(bio);
>  }
> @@ -85,6 +204,7 @@ void end_swap_bio_read(struct bio *bio, 
>  	} else {
>  		SetPageUptodate(page);
>  	}
> +	mem_cgroup_swapio_check_again(bio, page);
>  	unlock_page(page);
>  	bio_put(bio);
>  }
> Index: mmotm-2.6.30-Apr24/mm/vmscan.c
> ===================================================================
> --- mmotm-2.6.30-Apr24.orig/mm/vmscan.c
> +++ mmotm-2.6.30-Apr24/mm/vmscan.c
> @@ -586,6 +586,30 @@ void putback_lru_page(struct page *page)
>  }
>  #endif /* CONFIG_UNEVICTABLE_LRU */
>  
> +#ifdef CONFIG_CGROUP_MEM_RES_CTRL
> +/*
> + * Even if we don't call this, global LRU will finally find this SwapCache and
> + * free swap entry in the next loop. But, when memcg is used, we may have
> + * smaller chance to call global LRU's memory reclaim code.
> + * Freeing unused swap entry in aggresive way is good for avoid "leak" of swap
> + * entry accounting.
> + */
> +static inline void unuse_swapcache_check_again(struct page *page)
> +{
> +	/*
> +	 * The page is locked, but have extra reference from somewhere.
> +	 * In typical case, rotate_reclaimable_page()'s extra refcnt makes
> +	 * __remove_mapping fail. (see mm/swap.c)
> +	 */
> +	if (PageSwapCache(page))
> +		try_to_free_swap(page);
> +}
> +#else
> +static inline void unuse_swapcache_check_again(struct page *page)
> +{
> +}
> +#endif
> +
>  
>  /*
>   * shrink_page_list() returns the number of reclaimed pages
> @@ -758,9 +782,12 @@ static unsigned long shrink_page_list(st
>  			}
>  		}
>  
> -		if (!mapping || !__remove_mapping(mapping, page))
> +		if (!mapping)
>  			goto keep_locked;
> -
> +		if (!__remove_mapping(mapping, page)) {
> +			unuse_swapcache_check_again(page);
> +			goto keep_locked;
> +		}
>  		/*
>  		 * At this point, we have no other references and there is
>  		 * no way to pick any more up (removed from LRU, removed
> Index: mmotm-2.6.30-Apr24/mm/swapfile.c
> ===================================================================
> --- mmotm-2.6.30-Apr24.orig/mm/swapfile.c
> +++ mmotm-2.6.30-Apr24/mm/swapfile.c
> @@ -528,6 +528,43 @@ static inline int page_swapcount(struct 
>  	return count;
>  }
>  
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR
> +static inline int page_swapused_nolock(struct page *page)
> +{
> +	swp_entry_t entry;
> +	unsigned long type, offset;
> +	struct swap_info_struct *p;
> +
> +	entry.val = page_private(page);
> +	type = swp_type(entry);
> +	VM_BUG_ON(type >= nr_swapfiles);
> +
> +	offset = swp_offset(entry);
> +	p = &swap_info[type];
> +	VM_BUG_ON(!(p->flags & SWP_USED));
> +	VM_BUG_ON(!(p->swap_map[offset]));
> +
> +	smp_rmb();
> +	return p->swap_map[offset] != 1;
> +}
> +/*
> + * Use a lapping function not to allow reuse this function other than memcg.
> + */
> +int mem_cgroup_staleswap_hint(struct page *page)
> +{
> +	/*
> +	 * The page may not under lock_page() but Writeback is set in that case.
> +	 * Then, swap_map is stable when this is called.
> +	 * Very terrible troube will not occur even if page_swapused_nolock()
> +	 * returns wrong value.
> +	 * Because this can be called via interrupt context, we use nolock
> +	 * version of swap's refcnt check.
> +	 */
> +	if (!PageSwapCache(page) || page_mapped(page))
> +		return 1;
> +	return page_swapused_nolock(page);
> +}
> +#endif
>  /*
>   * We can write to an anon page without COW if there are no other references
>   * to it.  And as a side-effect, free up its swap: because the old content
> Index: mmotm-2.6.30-Apr24/include/linux/swap.h
> ===================================================================
> --- mmotm-2.6.30-Apr24.orig/include/linux/swap.h
> +++ mmotm-2.6.30-Apr24/include/linux/swap.h
> @@ -336,6 +336,7 @@ static inline void disable_swap_token(vo
>  
>  #ifdef CONFIG_CGROUP_MEM_RES_CTLR
>  extern void mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent);
> +extern int mem_cgroup_staleswap_hint(struct page *page);
>  #else
>  static inline void
>  mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent)
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] memcg: fix stale swap cache leak v5
  2009-04-30  7:35 ` KAMEZAWA Hiroyuki
@ 2009-04-30  9:04   ` KAMEZAWA Hiroyuki
  2009-04-30  9:42     ` Balbir Singh
  0 siblings, 1 reply; 12+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-04-30  9:04 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	nishimura@mxp.nes.nec.co.jp, balbir@linux.vnet.ibm.com,
	akpm@linux-foundation.org, hugh@veritas.com

On Thu, 30 Apr 2009 16:35:39 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> On Thu, 30 Apr 2009 16:16:27 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> 
> > This is v5 but all codes are rewritten.
> > 
> > After this patch, when memcg is used,
> >  1. page's swapcount is checked after I/O (without locks). If the page is
> >     stale swap cache, freeing routine will be scheduled.
> >  2. vmscan.c calls try_to_free_swap() when __remove_mapping() fails.
> > 
> > Works well for me. no extra resources and no races.
> > 
> > Because my office will be closed until May/7, I'll not be able to make a
> > response. Posting this for showing what I think of now.
> > 
> I found a hole immediately after posted this...sorry. plz ignore this patch/
> see you again in the next month.
> 
I'm now wondering to disable "swapin readahed" completely when memcg is used...
Then, half of the problems will go away immediately.
And it's not so bad to try to free swapcache if swap writeback ends. Then, another
half will go away...

Regards,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] memcg: fix stale swap cache leak v5
  2009-04-30  9:04   ` KAMEZAWA Hiroyuki
@ 2009-04-30  9:42     ` Balbir Singh
  2009-04-30  9:47       ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 12+ messages in thread
From: Balbir Singh @ 2009-04-30  9:42 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	nishimura@mxp.nes.nec.co.jp, akpm@linux-foundation.org,
	hugh@veritas.com

* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-04-30 18:04:26]:

> On Thu, 30 Apr 2009 16:35:39 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> 
> > On Thu, 30 Apr 2009 16:16:27 +0900
> > KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > 
> > > This is v5 but all codes are rewritten.
> > > 
> > > After this patch, when memcg is used,
> > >  1. page's swapcount is checked after I/O (without locks). If the page is
> > >     stale swap cache, freeing routine will be scheduled.
> > >  2. vmscan.c calls try_to_free_swap() when __remove_mapping() fails.
> > > 
> > > Works well for me. no extra resources and no races.
> > > 
> > > Because my office will be closed until May/7, I'll not be able to make a
> > > response. Posting this for showing what I think of now.
> > > 
> > I found a hole immediately after posted this...sorry. plz ignore this patch/
> > see you again in the next month.
> > 
> I'm now wondering to disable "swapin readahed" completely when memcg is used...
> Then, half of the problems will go away immediately.
> And it's not so bad to try to free swapcache if swap writeback ends. Then, another
> half will go away...
>

Could you clarify? Will memcg not account for swapin readahead pages?
 

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] memcg: fix stale swap cache leak v5
  2009-04-30  9:42     ` Balbir Singh
@ 2009-04-30  9:47       ` KAMEZAWA Hiroyuki
  2009-04-30 18:12         ` Balbir Singh
  0 siblings, 1 reply; 12+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-04-30  9:47 UTC (permalink / raw)
  To: balbir
  Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	nishimura@mxp.nes.nec.co.jp, akpm@linux-foundation.org,
	hugh@veritas.com

On Thu, 30 Apr 2009 15:12:52 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-04-30 18:04:26]:
> 
> > On Thu, 30 Apr 2009 16:35:39 +0900
> > KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > 
> > > On Thu, 30 Apr 2009 16:16:27 +0900
> > > KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > > 
> > > > This is v5 but all codes are rewritten.
> > > > 
> > > > After this patch, when memcg is used,
> > > >  1. page's swapcount is checked after I/O (without locks). If the page is
> > > >     stale swap cache, freeing routine will be scheduled.
> > > >  2. vmscan.c calls try_to_free_swap() when __remove_mapping() fails.
> > > > 
> > > > Works well for me. no extra resources and no races.
> > > > 
> > > > Because my office will be closed until May/7, I'll not be able to make a
> > > > response. Posting this for showing what I think of now.
> > > > 
> > > I found a hole immediately after posted this...sorry. plz ignore this patch/
> > > see you again in the next month.
> > > 
> > I'm now wondering to disable "swapin readahed" completely when memcg is used...
> > Then, half of the problems will go away immediately.
> > And it's not so bad to try to free swapcache if swap writeback ends. Then, another
> > half will go away...
> >
> 
> Could you clarify? Will memcg not account for swapin readahead pages?
>  
swapin-readahead pages are _not_ accounted now. (And I think _never_)
But has race and leak swp_entry account until global LRU runs.

"Don't do swapin-readahead, at all" will remove following race completely.
==
         CPU0                  CPU1
 free_swap_and_cache()
                        read_swapcache_async()
==
swp_entry to be freed will not be read-in.

I think there will no performance regression in _usual_ case even if no readahead.
But has no number yet.


Thanks,
-Kame



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] memcg: fix stale swap cache leak v5
  2009-04-30  9:47       ` KAMEZAWA Hiroyuki
@ 2009-04-30 18:12         ` Balbir Singh
  2009-05-01  4:33           ` Daisuke Nishimura
  0 siblings, 1 reply; 12+ messages in thread
From: Balbir Singh @ 2009-04-30 18:12 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	nishimura@mxp.nes.nec.co.jp, akpm@linux-foundation.org,
	hugh@veritas.com

* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-04-30 18:47:38]:

> On Thu, 30 Apr 2009 15:12:52 +0530
> Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> 
> > * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-04-30 18:04:26]:
> > 
> > > On Thu, 30 Apr 2009 16:35:39 +0900
> > > KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > > 
> > > > On Thu, 30 Apr 2009 16:16:27 +0900
> > > > KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > > > 
> > > > > This is v5 but all codes are rewritten.
> > > > > 
> > > > > After this patch, when memcg is used,
> > > > >  1. page's swapcount is checked after I/O (without locks). If the page is
> > > > >     stale swap cache, freeing routine will be scheduled.
> > > > >  2. vmscan.c calls try_to_free_swap() when __remove_mapping() fails.
> > > > > 
> > > > > Works well for me. no extra resources and no races.
> > > > > 
> > > > > Because my office will be closed until May/7, I'll not be able to make a
> > > > > response. Posting this for showing what I think of now.
> > > > > 
> > > > I found a hole immediately after posted this...sorry. plz ignore this patch/
> > > > see you again in the next month.
> > > > 
> > > I'm now wondering to disable "swapin readahed" completely when memcg is used...
> > > Then, half of the problems will go away immediately.
> > > And it's not so bad to try to free swapcache if swap writeback ends. Then, another
> > > half will go away...
> > >
> > 
> > Could you clarify? Will memcg not account for swapin readahead pages?
> >  
> swapin-readahead pages are _not_ accounted now. (And I think _never_)
> But has race and leak swp_entry account until global LRU runs.
> 
> "Don't do swapin-readahead, at all" will remove following race completely.
> ==
>          CPU0                  CPU1
>  free_swap_and_cache()
>                         read_swapcache_async()
> ==
> swp_entry to be freed will not be read-in.
> 
> I think there will no performance regression in _usual_ case even if no readahead.
> But has no number yet.
>

Kamezawa, Daisuke,

Can't we just correct the accounting and leave the page on the global
LRU?

Daisuke in the race conditions mentioned is (2) significant? Since the
accounting is already fixed during mem_cgroup_uncharge_page()?

 

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] memcg: fix stale swap cache leak v5
  2009-04-30 18:12         ` Balbir Singh
@ 2009-05-01  4:33           ` Daisuke Nishimura
  2009-05-01 18:32             ` Balbir Singh
  2009-05-04 16:38             ` Balbir Singh
  0 siblings, 2 replies; 12+ messages in thread
From: Daisuke Nishimura @ 2009-05-01  4:33 UTC (permalink / raw)
  To: balbir
  Cc: KAMEZAWA Hiroyuki, linux-kernel@vger.kernel.org,
	linux-mm@kvack.org, akpm@linux-foundation.org, hugh@veritas.com,
	Daisuke Nishimura, d-nishimura

On Thu, 30 Apr 2009 23:42:46 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-04-30 18:47:38]:
> 
> > On Thu, 30 Apr 2009 15:12:52 +0530
> > Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> > 
> > > * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-04-30 18:04:26]:
> > > 
> > > > On Thu, 30 Apr 2009 16:35:39 +0900
> > > > KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > > > 
> > > > > On Thu, 30 Apr 2009 16:16:27 +0900
> > > > > KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > > > > 
> > > > > > This is v5 but all codes are rewritten.
> > > > > > 
> > > > > > After this patch, when memcg is used,
> > > > > >  1. page's swapcount is checked after I/O (without locks). If the page is
> > > > > >     stale swap cache, freeing routine will be scheduled.
> > > > > >  2. vmscan.c calls try_to_free_swap() when __remove_mapping() fails.
> > > > > > 
> > > > > > Works well for me. no extra resources and no races.
> > > > > > 
> > > > > > Because my office will be closed until May/7, I'll not be able to make a
> > > > > > response. Posting this for showing what I think of now.
> > > > > > 
> > > > > I found a hole immediately after posted this...sorry. plz ignore this patch/
> > > > > see you again in the next month.
> > > > > 
> > > > I'm now wondering to disable "swapin readahed" completely when memcg is used...
> > > > Then, half of the problems will go away immediately.
> > > > And it's not so bad to try to free swapcache if swap writeback ends. Then, another
> > > > half will go away...
> > > >
> > > 
> > > Could you clarify? Will memcg not account for swapin readahead pages?
> > >  
> > swapin-readahead pages are _not_ accounted now. (And I think _never_)
> > But has race and leak swp_entry account until global LRU runs.
> > 
> > "Don't do swapin-readahead, at all" will remove following race completely.
> > ==
> >          CPU0                  CPU1
> >  free_swap_and_cache()
> >                         read_swapcache_async()
> > ==
> > swp_entry to be freed will not be read-in.
> > 
> > I think there will no performance regression in _usual_ case even if no readahead.
> > But has no number yet.
> >
> 
> Kamezawa, Daisuke,
> 
> Can't we just correct the accounting and leave the page on the global
> LRU?
> 
> Daisuke in the race conditions mentioned is (2) significant? Since the
> accounting is already fixed during mem_cgroup_uncharge_page()?
> 
Do you mean type-2 stale swap caches I described before ?

They doesn't pressure mem.usage nor memsw.usage as you say,
but consumes swp_entry(of cource, type-1 has this problem too).
As a result, all the swap space can be used up and causes OOM.

I've verified it long ago by:

- make swap space small(50MB).
- set mem.limit(32MB).
- run some programs(allocate, touch sometimes, exit) enough to
  exceed mem.limit repeatedly(I used page01 included in ltp and run
  5 instances 8MB per each in cpuset with 4cpus.).
- wait for a very long time :) (2,30 hours IIRC)
  You can see the usage of swap cache(grep SwapCached /proc/meminfo)
  increasing gradually.


BTW, I'm now testing a attached patch to fix type-2 with setting page-cluster
to 0 to aboid type-1, and seeing what happens in the usage of swap cache.
(I can't test it in large box though, because my office is closed till May 06.)

Thanks,
Daisuke Nishimura.
===
From: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>

memcg: free unused swapcache on swapout path

memcg cannot handle !PageCgroupUsed swapcache the owner process of which
has been exited.

This patch is for handling such swap caches created by a race like below:

    Assume processA is exiting and pte points to a page(!PageSwapCache).
    And processB is trying reclaim the page.

              processA                   |           processB
    -------------------------------------+-------------------------------------
      (page_remove_rmap())               |  (shrink_page_list())
         mem_cgroup_uncharge_page()      |
            ->uncharged because it's not |
              PageSwapCache yet.         |
              So, both mem/memsw.usage   |
              are decremented.           |
                                         |    add_to_swap() -> added to swap cache.

    If this page goes thorough without being freed for some reason, this page
    doesn't goes back to memcg's LRU because of !PageCgroupUsed.

These swap cache cannot be freed in memcg's LRU scanning, and swp_entry cannot
be freed properly as a result.
This patch adds a hook after add_to_swap() to check the page is mapped by a
process or not, and frees it if it has been unmapped already.

If a page has been on swap cache already when the owner process calls
page_remove_rmap() -> mem_cgroup_uncharge_page(), the page is not uncharged.
It goes back to memcg's LRU even if it goes through shrink_page_list()
without being freed, so this patch ignores these case.

Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
---
 include/linux/swap.h |   12 ++++++++++++
 mm/memcontrol.c      |   14 ++++++++++++++
 mm/vmscan.c          |    8 ++++++++
 3 files changed, 34 insertions(+), 0 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index caf0767..8e75d7a 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -336,11 +336,17 @@ static inline void disable_swap_token(void)
 
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR
 extern void mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent);
+extern int memcg_free_unused_swapcache(struct page *page);
 #else
 static inline void
 mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent)
 {
 }
+static inline int
+memcg_free_unused_swapcache(struct page *page)
+{
+	return 0;
+}
 #endif
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
 extern void mem_cgroup_uncharge_swap(swp_entry_t ent);
@@ -431,6 +437,12 @@ static inline swp_entry_t get_swap_page(void)
 #define has_swap_token(x) 0
 #define disable_swap_token() do { } while(0)
 
+static inline int
+memcg_free_unused_swapcache(struct page *page)
+{
+	return 0;
+}
+
 #endif /* CONFIG_SWAP */
 #endif /* __KERNEL__*/
 #endif /* _LINUX_SWAP_H */
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 01c2d8f..4f7e5b6 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1488,6 +1488,7 @@ void mem_cgroup_uncharge_cache_page(struct page *page)
 	__mem_cgroup_uncharge_common(page, MEM_CGROUP_CHARGE_TYPE_CACHE);
 }
 
+#ifdef CONFIG_SWAP
 /*
  * called from __delete_from_swap_cache() and drop "page" account.
  * memcg information is recorded to swap_cgroup of "ent"
@@ -1507,6 +1508,19 @@ void mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent)
 		css_put(&memcg->css);
 }
 
+int memcg_free_unused_swapcache(struct page *page)
+{
+	VM_BUG_ON(!PageLocked(page));
+	VM_BUG_ON(!PageSwapCache(page));
+
+	if (mem_cgroup_disabled())
+		return 0;
+	if (!PageAnon(page) || page_mapped(page))
+		return 0;
+	return try_to_free_swap(page);	/* checks page_swapcount */
+}
+#endif /* CONFIG_SWAP */
+
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
 /*
  * called from swap_entry_free(). remove record in swap_cgroup and
diff --git a/mm/vmscan.c b/mm/vmscan.c
index eac9577..c1a7a6f 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -656,6 +656,14 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 				goto keep_locked;
 			if (!add_to_swap(page))
 				goto activate_locked;
+			/*
+			 * The owner process might have uncharged the page
+			 * (by page_remove_rmap()) before it has been added
+			 * to swap cache.
+			 * Check it here to avoid making it stale.
+			 */
+			if (memcg_free_unused_swapcache(page))
+				goto keep_locked;
 			may_enter_fs = 1;
 		}
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH] memcg: fix stale swap cache leak v5
  2009-05-01  4:33           ` Daisuke Nishimura
@ 2009-05-01 18:32             ` Balbir Singh
  2009-05-02  2:56               ` Daisuke Nishimura
  2009-05-04 16:38             ` Balbir Singh
  1 sibling, 1 reply; 12+ messages in thread
From: Balbir Singh @ 2009-05-01 18:32 UTC (permalink / raw)
  To: nishimura
  Cc: KAMEZAWA Hiroyuki, linux-kernel@vger.kernel.org,
	linux-mm@kvack.org, akpm@linux-foundation.org, hugh@veritas.com,
	d-nishimura

* Daisuke Nishimura <d-nishimura@mtf.biglobe.ne.jp> [2009-05-01 13:33:17]:

> On Thu, 30 Apr 2009 23:42:46 +0530
> Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> 
> > * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-04-30 18:47:38]:
> > 
> > > On Thu, 30 Apr 2009 15:12:52 +0530
> > > Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> > > 
> > > > * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-04-30 18:04:26]:
> > > > 
> > > > > On Thu, 30 Apr 2009 16:35:39 +0900
> > > > > KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > > > > 
> > > > > > On Thu, 30 Apr 2009 16:16:27 +0900
> > > > > > KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > > > > > 
> > > > > > > This is v5 but all codes are rewritten.
> > > > > > > 
> > > > > > > After this patch, when memcg is used,
> > > > > > >  1. page's swapcount is checked after I/O (without locks). If the page is
> > > > > > >     stale swap cache, freeing routine will be scheduled.
> > > > > > >  2. vmscan.c calls try_to_free_swap() when __remove_mapping() fails.
> > > > > > > 
> > > > > > > Works well for me. no extra resources and no races.
> > > > > > > 
> > > > > > > Because my office will be closed until May/7, I'll not be able to make a
> > > > > > > response. Posting this for showing what I think of now.
> > > > > > > 
> > > > > > I found a hole immediately after posted this...sorry. plz ignore this patch/
> > > > > > see you again in the next month.
> > > > > > 
> > > > > I'm now wondering to disable "swapin readahed" completely when memcg is used...
> > > > > Then, half of the problems will go away immediately.
> > > > > And it's not so bad to try to free swapcache if swap writeback ends. Then, another
> > > > > half will go away...
> > > > >
> > > > 
> > > > Could you clarify? Will memcg not account for swapin readahead pages?
> > > >  
> > > swapin-readahead pages are _not_ accounted now. (And I think _never_)
> > > But has race and leak swp_entry account until global LRU runs.
> > > 
> > > "Don't do swapin-readahead, at all" will remove following race completely.
> > > ==
> > >          CPU0                  CPU1
> > >  free_swap_and_cache()
> > >                         read_swapcache_async()
> > > ==
> > > swp_entry to be freed will not be read-in.
> > > 
> > > I think there will no performance regression in _usual_ case even if no readahead.
> > > But has no number yet.
> > >
> > 
> > Kamezawa, Daisuke,
> > 
> > Can't we just correct the accounting and leave the page on the global
> > LRU?
> > 
> > Daisuke in the race conditions mentioned is (2) significant? Since the
> > accounting is already fixed during mem_cgroup_uncharge_page()?
> > 
> Do you mean type-2 stale swap caches I described before ?
> 
> They doesn't pressure mem.usage nor memsw.usage as you say,
> but consumes swp_entry(of cource, type-1 has this problem too).
> As a result, all the swap space can be used up and causes OOM.
> 

Good point..

> I've verified it long ago by:
> 
> - make swap space small(50MB).
> - set mem.limit(32MB).
> - run some programs(allocate, touch sometimes, exit) enough to
>   exceed mem.limit repeatedly(I used page01 included in ltp and run
>   5 instances 8MB per each in cpuset with 4cpus.).
> - wait for a very long time :) (2,30 hours IIRC)
>   You can see the usage of swap cache(grep SwapCached /proc/meminfo)
>   increasing gradually.
> 
> 
> BTW, I'm now testing a attached patch to fix type-2 with setting page-cluster
> to 0 to aboid type-1, and seeing what happens in the usage of swap cache.
> (I can't test it in large box though, because my office is closed till May 06.)
> 
> Thanks,
> Daisuke Nishimura.
> ===
> From: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
> 
> memcg: free unused swapcache on swapout path
> 
> memcg cannot handle !PageCgroupUsed swapcache the owner process of which
> has been exited.
> 
> This patch is for handling such swap caches created by a race like below:
> 
>     Assume processA is exiting and pte points to a page(!PageSwapCache).
>     And processB is trying reclaim the page.
> 
>               processA                   |           processB
>     -------------------------------------+-------------------------------------
>       (page_remove_rmap())               |  (shrink_page_list())
>          mem_cgroup_uncharge_page()      |
>             ->uncharged because it's not |
>               PageSwapCache yet.         |
>               So, both mem/memsw.usage   |
>               are decremented.           |
>                                          |    add_to_swap() -> added to swap cache.
> 
>     If this page goes thorough without being freed for some reason, this page
>     doesn't goes back to memcg's LRU because of !PageCgroupUsed.
> 
> These swap cache cannot be freed in memcg's LRU scanning, and swp_entry cannot
> be freed properly as a result.
> This patch adds a hook after add_to_swap() to check the page is mapped by a
> process or not, and frees it if it has been unmapped already.
> 
> If a page has been on swap cache already when the owner process calls
> page_remove_rmap() -> mem_cgroup_uncharge_page(), the page is not uncharged.
> It goes back to memcg's LRU even if it goes through shrink_page_list()
> without being freed, so this patch ignores these case.
> 
> Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
> ---
>  include/linux/swap.h |   12 ++++++++++++
>  mm/memcontrol.c      |   14 ++++++++++++++
>  mm/vmscan.c          |    8 ++++++++
>  3 files changed, 34 insertions(+), 0 deletions(-)
> 
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index caf0767..8e75d7a 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -336,11 +336,17 @@ static inline void disable_swap_token(void)
> 
>  #ifdef CONFIG_CGROUP_MEM_RES_CTLR
>  extern void mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent);
> +extern int memcg_free_unused_swapcache(struct page *page);
>  #else
>  static inline void
>  mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent)
>  {
>  }
> +static inline int
> +memcg_free_unused_swapcache(struct page *page)
> +{
> +	return 0;
> +}
>  #endif
>  #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
>  extern void mem_cgroup_uncharge_swap(swp_entry_t ent);
> @@ -431,6 +437,12 @@ static inline swp_entry_t get_swap_page(void)
>  #define has_swap_token(x) 0
>  #define disable_swap_token() do { } while(0)
> 
> +static inline int
> +memcg_free_unused_swapcache(struct page *page)
> +{
> +	return 0;
> +}
> +
>  #endif /* CONFIG_SWAP */
>  #endif /* __KERNEL__*/
>  #endif /* _LINUX_SWAP_H */
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 01c2d8f..4f7e5b6 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1488,6 +1488,7 @@ void mem_cgroup_uncharge_cache_page(struct page *page)
>  	__mem_cgroup_uncharge_common(page, MEM_CGROUP_CHARGE_TYPE_CACHE);
>  }
> 
> +#ifdef CONFIG_SWAP
>  /*
>   * called from __delete_from_swap_cache() and drop "page" account.
>   * memcg information is recorded to swap_cgroup of "ent"
> @@ -1507,6 +1508,19 @@ void mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent)
>  		css_put(&memcg->css);
>  }
> 
> +int memcg_free_unused_swapcache(struct page *page)
> +{
> +	VM_BUG_ON(!PageLocked(page));
> +	VM_BUG_ON(!PageSwapCache(page));
> +
> +	if (mem_cgroup_disabled())
> +		return 0;
> +	if (!PageAnon(page) || page_mapped(page))
> +		return 0;
> +	return try_to_free_swap(page);	/* checks page_swapcount */
> +}
> +#endif /* CONFIG_SWAP */
> +
>  #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
>  /*
>   * called from swap_entry_free(). remove record in swap_cgroup and
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index eac9577..c1a7a6f 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -656,6 +656,14 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  				goto keep_locked;
>  			if (!add_to_swap(page))
>  				goto activate_locked;
> +			/*
> +			 * The owner process might have uncharged the page
> +			 * (by page_remove_rmap()) before it has been added
> +			 * to swap cache.
> +			 * Check it here to avoid making it stale.
> +			 */
> +			if (memcg_free_unused_swapcache(page))
> +				goto keep_locked;
>  			may_enter_fs = 1;
>  		}
> 
> 
Looking through the patch, I have my doubts

 shrink_page_list() will catch the page - how? It is not on memcg's
LRU, so if we have a small cgroup with lots of memory, when the cgroup
is running out of memory, will this page show up in
shrink_page_list()?

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] memcg: fix stale swap cache leak v5
  2009-05-01 18:32             ` Balbir Singh
@ 2009-05-02  2:56               ` Daisuke Nishimura
  2009-05-02  3:09                 ` Daisuke Nishimura
  0 siblings, 1 reply; 12+ messages in thread
From: Daisuke Nishimura @ 2009-05-02  2:56 UTC (permalink / raw)
  To: balbir
  Cc: KAMEZAWA Hiroyuki, linux-kernel@vger.kernel.org,
	linux-mm@kvack.org, akpm@linux-foundation.org, hugh@veritas.com,
	d-nishimura, nishimura

> > > Daisuke in the race conditions mentioned is (2) significant? Since the
> > > accounting is already fixed during mem_cgroup_uncharge_page()?
> > > 
> > Do you mean type-2 stale swap caches I described before ?
> > 
> > They doesn't pressure mem.usage nor memsw.usage as you say,
> > but consumes swp_entry(of cource, type-1 has this problem too).
> > As a result, all the swap space can be used up and causes OOM.
> > 
> 
> Good point..
> 
> > I've verified it long ago by:
> > 
> > - make swap space small(50MB).
> > - set mem.limit(32MB).
> > - run some programs(allocate, touch sometimes, exit) enough to
> >   exceed mem.limit repeatedly(I used page01 included in ltp and run
> >   5 instances 8MB per each in cpuset with 4cpus.).
> > - wait for a very long time :) (2,30 hours IIRC)
> >   You can see the usage of swap cache(grep SwapCached /proc/meminfo)
> >   increasing gradually.
> > 
> > 
> > BTW, I'm now testing a attached patch to fix type-2 with setting page-cluster
> > to 0 to aboid type-1, and seeing what happens in the usage of swap cache.
> > (I can't test it in large box though, because my office is closed till May 06.)
> > 
In my small box(i386/2CPU/2GB mem), a similar test shows after 12H
about 600MB leak of swap cache before applying this patch,
while no outstanding leak can be seen after it.

(snip)

> Looking through the patch, I have my doubts
> 
>  shrink_page_list() will catch the page - how? It is not on memcg's
> LRU, so if we have a small cgroup with lots of memory, when the cgroup
> is running out of memory, will this page show up in
> shrink_page_list()?
> 
It's just because the page has not been uncharged yet when shrink_page_list()
is called(,more precicely, when shrink_inactive_list() isolates the page),
so it can be handled in memcg's LRU scanning.


Thanks,
Daisuke Nishimura.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] memcg: fix stale swap cache leak v5
  2009-05-02  2:56               ` Daisuke Nishimura
@ 2009-05-02  3:09                 ` Daisuke Nishimura
  0 siblings, 0 replies; 12+ messages in thread
From: Daisuke Nishimura @ 2009-05-02  3:09 UTC (permalink / raw)
  To: nishimura
  Cc: balbir, KAMEZAWA Hiroyuki, linux-kernel@vger.kernel.org,
	linux-mm@kvack.org, akpm@linux-foundation.org, hugh@veritas.com,
	d-nishimura

> In my small box(i386/2CPU/2GB mem), a similar test shows after 12H
> about 600MB leak of swap cache before applying this patch,
Oops, not 600MB but 600KB.

Sorry,
Daisuke Nishimura.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] memcg: fix stale swap cache leak v5
  2009-05-01  4:33           ` Daisuke Nishimura
  2009-05-01 18:32             ` Balbir Singh
@ 2009-05-04 16:38             ` Balbir Singh
  2009-05-07  6:59               ` Daisuke Nishimura
  1 sibling, 1 reply; 12+ messages in thread
From: Balbir Singh @ 2009-05-04 16:38 UTC (permalink / raw)
  To: nishimura
  Cc: KAMEZAWA Hiroyuki, linux-kernel@vger.kernel.org,
	linux-mm@kvack.org, akpm@linux-foundation.org, hugh@veritas.com,
	d-nishimura

* Daisuke Nishimura <d-nishimura@mtf.biglobe.ne.jp> [2009-05-01 13:33:17]:

>               processA                   |           processB
>     -------------------------------------+-------------------------------------
>       (page_remove_rmap())               |  (shrink_page_list())
>          mem_cgroup_uncharge_page()      |
>             ->uncharged because it's not |
>               PageSwapCache yet.         |
>               So, both mem/memsw.usage   |
>               are decremented.           |
>                                          |    add_to_swap() -> added to swap cache.
> 
>     If this page goes thorough without being freed for some reason, this page
>     doesn't goes back to memcg's LRU because of !PageCgroupUsed.

For some reason could use some clarification.

> 
> These swap cache cannot be freed in memcg's LRU scanning, and swp_entry cannot
> be freed properly as a result.
> This patch adds a hook after add_to_swap() to check the page is mapped by a
> process or not, and frees it if it has been unmapped already.
> 
> If a page has been on swap cache already when the owner process calls
> page_remove_rmap() -> mem_cgroup_uncharge_page(), the page is not uncharged.
> It goes back to memcg's LRU even if it goes through shrink_page_list()
> without being freed, so this patch ignores these case.
> 
> Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
> ---
>  include/linux/swap.h |   12 ++++++++++++
>  mm/memcontrol.c      |   14 ++++++++++++++
>  mm/vmscan.c          |    8 ++++++++
>  3 files changed, 34 insertions(+), 0 deletions(-)
> 
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index caf0767..8e75d7a 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -336,11 +336,17 @@ static inline void disable_swap_token(void)
> 
>  #ifdef CONFIG_CGROUP_MEM_RES_CTLR
>  extern void mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent);
> +extern int memcg_free_unused_swapcache(struct page *page);
>  #else
>  static inline void
>  mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent)
>  {
>  }
> +static inline int
> +memcg_free_unused_swapcache(struct page *page)
> +{
> +	return 0;
> +}
>  #endif
>  #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
>  extern void mem_cgroup_uncharge_swap(swp_entry_t ent);
> @@ -431,6 +437,12 @@ static inline swp_entry_t get_swap_page(void)
>  #define has_swap_token(x) 0
>  #define disable_swap_token() do { } while(0)
> 
> +static inline int
> +memcg_free_unused_swapcache(struct page *page)
> +{
> +	return 0;
> +}
> +
>  #endif /* CONFIG_SWAP */
>  #endif /* __KERNEL__*/
>  #endif /* _LINUX_SWAP_H */
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 01c2d8f..4f7e5b6 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1488,6 +1488,7 @@ void mem_cgroup_uncharge_cache_page(struct page *page)
>  	__mem_cgroup_uncharge_common(page, MEM_CGROUP_CHARGE_TYPE_CACHE);
>  }
> 
> +#ifdef CONFIG_SWAP
>  /*
>   * called from __delete_from_swap_cache() and drop "page" account.
>   * memcg information is recorded to swap_cgroup of "ent"
> @@ -1507,6 +1508,19 @@ void mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent)
>  		css_put(&memcg->css);
>  }
> 
> +int memcg_free_unused_swapcache(struct page *page)
> +{
> +	VM_BUG_ON(!PageLocked(page));
> +	VM_BUG_ON(!PageSwapCache(page));
> +
> +	if (mem_cgroup_disabled())
> +		return 0;
> +	if (!PageAnon(page) || page_mapped(page))
> +		return 0;

Do we need these checks? Isn't PageSwapCache() check and
page_swapcount() check enough in try_to_free_swap()?

> +	return try_to_free_swap(page);	/* checks page_swapcount */

try_to_free_swap() marks the page as dirty, do you know why?

> +}
> +#endif /* CONFIG_SWAP */
> +
>  #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
>  /*
>   * called from swap_entry_free(). remove record in swap_cgroup and
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index eac9577..c1a7a6f 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -656,6 +656,14 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  				goto keep_locked;
>  			if (!add_to_swap(page))
>  				goto activate_locked;
> +			/*
> +			 * The owner process might have uncharged the page
> +			 * (by page_remove_rmap()) before it has been added
> +			 * to swap cache.
> +			 * Check it here to avoid making it stale.
> +			 */
> +			if (memcg_free_unused_swapcache(page))
> +				goto keep_locked;

Seems reasonable, but I think it is better to check for
scan_global_lru().. no?

>  			may_enter_fs = 1;
>  		}
> 
> 

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] memcg: fix stale swap cache leak v5
  2009-05-04 16:38             ` Balbir Singh
@ 2009-05-07  6:59               ` Daisuke Nishimura
  0 siblings, 0 replies; 12+ messages in thread
From: Daisuke Nishimura @ 2009-05-07  6:59 UTC (permalink / raw)
  To: balbir
  Cc: KAMEZAWA Hiroyuki, linux-kernel@vger.kernel.org,
	linux-mm@kvack.org, akpm@linux-foundation.org, hugh@veritas.com,
	d-nishimura, Daisuke Nishimura

On Mon, 4 May 2009 22:08:06 +0530, Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> * Daisuke Nishimura <d-nishimura@mtf.biglobe.ne.jp> [2009-05-01 13:33:17]:
> 
> >               processA                   |           processB
> >     -------------------------------------+-------------------------------------
> >       (page_remove_rmap())               |  (shrink_page_list())
> >          mem_cgroup_uncharge_page()      |
> >             ->uncharged because it's not |
> >               PageSwapCache yet.         |
> >               So, both mem/memsw.usage   |
> >               are decremented.           |
> >                                          |    add_to_swap() -> added to swap cache.
> > 
> >     If this page goes thorough without being freed for some reason, this page
> >     doesn't goes back to memcg's LRU because of !PageCgroupUsed.
> 
> For some reason could use some clarification.
> 
If swap_writepage() is called(via pageout()), try_to_free_swap() in swap_writepage()
will free this unused swap cache.
But there are many checks before swap_writepage() is called.

For example, if page_referenced() returned true, "referenced" will be set,
so we hit the check:

	If (PageDirty(page)) {
		if (sc->order <= PAGE_ALLOC_COSTLY_ORDER && referenced)
			goto keep_locked;

I think there are other cases.

> > 
> > These swap cache cannot be freed in memcg's LRU scanning, and swp_entry cannot
> > be freed properly as a result.
> > This patch adds a hook after add_to_swap() to check the page is mapped by a
> > process or not, and frees it if it has been unmapped already.
> > 
> > If a page has been on swap cache already when the owner process calls
> > page_remove_rmap() -> mem_cgroup_uncharge_page(), the page is not uncharged.
> > It goes back to memcg's LRU even if it goes through shrink_page_list()
> > without being freed, so this patch ignores these case.
> > 
> > Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
> > ---
> >  include/linux/swap.h |   12 ++++++++++++
> >  mm/memcontrol.c      |   14 ++++++++++++++
> >  mm/vmscan.c          |    8 ++++++++
> >  3 files changed, 34 insertions(+), 0 deletions(-)
> > 
> > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > index caf0767..8e75d7a 100644
> > --- a/include/linux/swap.h
> > +++ b/include/linux/swap.h
> > @@ -336,11 +336,17 @@ static inline void disable_swap_token(void)
> > 
> >  #ifdef CONFIG_CGROUP_MEM_RES_CTLR
> >  extern void mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent);
> > +extern int memcg_free_unused_swapcache(struct page *page);
> >  #else
> >  static inline void
> >  mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent)
> >  {
> >  }
> > +static inline int
> > +memcg_free_unused_swapcache(struct page *page)
> > +{
> > +	return 0;
> > +}
> >  #endif
> >  #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
> >  extern void mem_cgroup_uncharge_swap(swp_entry_t ent);
> > @@ -431,6 +437,12 @@ static inline swp_entry_t get_swap_page(void)
> >  #define has_swap_token(x) 0
> >  #define disable_swap_token() do { } while(0)
> > 
> > +static inline int
> > +memcg_free_unused_swapcache(struct page *page)
> > +{
> > +	return 0;
> > +}
> > +
> >  #endif /* CONFIG_SWAP */
> >  #endif /* __KERNEL__*/
> >  #endif /* _LINUX_SWAP_H */
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 01c2d8f..4f7e5b6 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -1488,6 +1488,7 @@ void mem_cgroup_uncharge_cache_page(struct page *page)
> >  	__mem_cgroup_uncharge_common(page, MEM_CGROUP_CHARGE_TYPE_CACHE);
> >  }
> > 
> > +#ifdef CONFIG_SWAP
> >  /*
> >   * called from __delete_from_swap_cache() and drop "page" account.
> >   * memcg information is recorded to swap_cgroup of "ent"
> > @@ -1507,6 +1508,19 @@ void mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent)
> >  		css_put(&memcg->css);
> >  }
> > 
> > +int memcg_free_unused_swapcache(struct page *page)
> > +{
> > +	VM_BUG_ON(!PageLocked(page));
> > +	VM_BUG_ON(!PageSwapCache(page));
> > +
> > +	if (mem_cgroup_disabled())
> > +		return 0;
> > +	if (!PageAnon(page) || page_mapped(page))
> > +		return 0;
> 
> Do we need these checks? Isn't PageSwapCache() check and
> page_swapcount() check enough in try_to_free_swap()?
> 
I think this is needed.
page_swapcount() will return true, because this page has just been added
to swap cache(try_to_unmap() has not been called yet).

> > +	return try_to_free_swap(page);	/* checks page_swapcount */
> 
> try_to_free_swap() marks the page as dirty, do you know why?
> 
I'm not sure.
Other callers of delete_from_swap_cache() also set the page dirty.

> > +}
> > +#endif /* CONFIG_SWAP */
> > +
> >  #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
> >  /*
> >   * called from swap_entry_free(). remove record in swap_cgroup and
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index eac9577..c1a7a6f 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -656,6 +656,14 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> >  				goto keep_locked;
> >  			if (!add_to_swap(page))
> >  				goto activate_locked;
> > +			/*
> > +			 * The owner process might have uncharged the page
> > +			 * (by page_remove_rmap()) before it has been added
> > +			 * to swap cache.
> > +			 * Check it here to avoid making it stale.
> > +			 */
> > +			if (memcg_free_unused_swapcache(page))
> > +				goto keep_locked;
> 
> Seems reasonable, but I think it is better to check for
> scan_global_lru().. no?
> 
hmm, we cannot ensure that additional global LRU scanning will happen after this,
so I think it would be better to free this swap cache if possible
instead of making it stale.


Thanks,
Daisuke Nishimura.

> >  			may_enter_fs = 1;
> >  		}
> > 
> > 
> 
> -- 
> 	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2009-05-07  7:10 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-04-30  7:16 [PATCH] memcg: fix stale swap cache leak v5 KAMEZAWA Hiroyuki
2009-04-30  7:35 ` KAMEZAWA Hiroyuki
2009-04-30  9:04   ` KAMEZAWA Hiroyuki
2009-04-30  9:42     ` Balbir Singh
2009-04-30  9:47       ` KAMEZAWA Hiroyuki
2009-04-30 18:12         ` Balbir Singh
2009-05-01  4:33           ` Daisuke Nishimura
2009-05-01 18:32             ` Balbir Singh
2009-05-02  2:56               ` Daisuke Nishimura
2009-05-02  3:09                 ` Daisuke Nishimura
2009-05-04 16:38             ` Balbir Singh
2009-05-07  6:59               ` Daisuke Nishimura

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).