[PATCH 0/2] fix stale swap cache account leak in memcg v6

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/2] fix stale swap cache account leak  in memcg v6
@ 2009-05-08  5:05 KAMEZAWA Hiroyuki
  2009-05-08  5:07 ` [PATCH 1/2] add mem cgroup is activated check KAMEZAWA Hiroyuki
  2009-05-08  5:09 ` [PATCH 2/2] memcg fix stale swap cache account leak v6 KAMEZAWA Hiroyuki
  0 siblings, 2 replies; 13+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-05-08  5:05 UTC (permalink / raw)
  To: linux-mm@kvack.org
  Cc: nishimura@mxp.nes.nec.co.jp, balbir@linux.vnet.ibm.com,
	hugh@veritas.com, akpm@linux-foundation.org,
	linux-kernel@vger.kernel.org

As Nishimura reported, there is a race at handling swap cache.

Typical cases are following (from Nishimura's mail)


== Type-1 ==
  If some pages of processA has been swapped out, it calls free_swap_and_cache().
  And if at the same time, processB is calling read_swap_cache_async() about
  a swap entry *that is used by processA*, a race like below can happen.

            processA                   |           processB
  -------------------------------------+-------------------------------------
    (free_swap_and_cache())            |  (read_swap_cache_async())
                                       |    swap_duplicate()
                                       |    __set_page_locked()
                                       |    add_to_swap_cache()
      swap_entry_free() == 0           |
      find_get_page() -> found         |
      try_lock_page() -> fail & return |
                                       |    lru_cache_add_anon()
                                       |      doesn't link this page to memcg's
                                       |      LRU, because of !PageCgroupUsed.

  This type of leak can be avoided by setting /proc/sys/vm/page-cluster to 0.


== Type-2 ==
    Assume processA is exiting and pte points to a page(!PageSwapCache).
    And processB is trying reclaim the page.

              processA                   |           processB
    -------------------------------------+-------------------------------------
      (page_remove_rmap())               |  (shrink_page_list())
         mem_cgroup_uncharge_page()      |
            ->uncharged because it's not |
              PageSwapCache yet.         |
              So, both mem/memsw.usage   |
              are decremented.           |
                                         |    add_to_swap() -> added to swap cache.

    If this page goes thorough without being freed for some reason, this page
    doesn't goes back to memcg's LRU because of !PageCgroupUsed.


Considering Type-1, it's better to avoid swapin-readahead when memcg is used.
swapin-readahead just read swp_entries which are near to requested entry. So,
pages not to be used can be on memory (on global LRU). When memcg is used,
this is not good behavior anyway.

Considering Type-2, the page should be freed from SwapCache right after WriteBack.
Free swapped out pages as soon as possible is a good nature to memcg, anyway.

The patch set includes followng
 [1/2] add mem_cgroup_is_activated() function. which tell us memcg is _really_ used.
 [2/2] fix swap cache handling.


Test result is good under my test. Nishimura, could you try ?

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 1/2] add mem cgroup is activated check
  2009-05-08  5:05 [PATCH 0/2] fix stale swap cache account leak in memcg v6 KAMEZAWA Hiroyuki
@ 2009-05-08  5:07 ` KAMEZAWA Hiroyuki
  2009-05-08 14:46   ` Balbir Singh
  2009-05-08  5:09 ` [PATCH 2/2] memcg fix stale swap cache account leak v6 KAMEZAWA Hiroyuki
  1 sibling, 1 reply; 13+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-05-08  5:07 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, nishimura@mxp.nes.nec.co.jp,
	balbir@linux.vnet.ibm.com, hugh@veritas.com,
	akpm@linux-foundation.org, linux-kernel@vger.kernel.org

From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

There is a function "mem_cgroup_disabled()" which returns
  - memcg is configured ?
  - disabled by boot option ?
This is check is useful to confirm whether we have to call memcg's hook or not.

But, even when memcg is configured (and not disabled), it's not really used
until mounted. This patch adds mem_cgroup_activated() to check memcg is
mounted or not at least once.
(Will be used in later patch.)

IIUC, only very careful users set boot option of memcg to be disabled and
most of people will not be aware of that memcg is enabled at default.
So, if memcg wants to affect to global VM behavior or to add some overheads,
there are cases that this check is better than mem_cgroup_disabled().


Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 include/linux/memcontrol.h |    7 +++++++
 mm/memcontrol.c            |   12 ++++++++++++
 2 files changed, 19 insertions(+)

Index: mmotm-2.6.30-May05/include/linux/memcontrol.h
===================================================================
--- mmotm-2.6.30-May05.orig/include/linux/memcontrol.h
+++ mmotm-2.6.30-May05/include/linux/memcontrol.h
@@ -115,6 +115,8 @@ static inline bool mem_cgroup_disabled(v
 		return true;
 	return false;
 }
+/* Returns strue if mem_cgroup is enabled and really used (mounted). */
+bool mem_cgroup_activated(void);
 
 extern bool mem_cgroup_oom_called(struct task_struct *task);
 void mem_cgroup_update_mapped_file_stat(struct page *page, int val);
@@ -229,6 +231,11 @@ static inline bool mem_cgroup_disabled(v
 	return true;
 }
 
+static inline bool mem_cgroup_activated(void)
+{
+	return false;
+}
+
 static inline bool mem_cgroup_oom_called(struct task_struct *task)
 {
 	return false;
Index: mmotm-2.6.30-May05/mm/memcontrol.c
===================================================================
--- mmotm-2.6.30-May05.orig/mm/memcontrol.c
+++ mmotm-2.6.30-May05/mm/memcontrol.c
@@ -2577,6 +2577,17 @@ static void mem_cgroup_move_task(struct 
 	mutex_unlock(&memcg_tasklist);
 }
 
+static bool __mem_cgroup_activated = false;
+bool mem_cgroup_activated(void)
+{
+	return __mem_cgroup_activated;
+}
+
+static void mem_cgroup_bind(struct cgroup_subsys *ss, struct cgroup *root)
+{
+	__mem_cgroup_activated = true;
+}
+
 struct cgroup_subsys mem_cgroup_subsys = {
 	.name = "memory",
 	.subsys_id = mem_cgroup_subsys_id,
@@ -2585,6 +2596,7 @@ struct cgroup_subsys mem_cgroup_subsys =
 	.destroy = mem_cgroup_destroy,
 	.populate = mem_cgroup_populate,
 	.attach = mem_cgroup_move_task,
+	.bind = mem_cgroup_bind,
 	.early_init = 0,
 	.use_id = 1,
 };

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 2/2] memcg fix stale swap cache account leak v6
  2009-05-08  5:05 [PATCH 0/2] fix stale swap cache account leak in memcg v6 KAMEZAWA Hiroyuki
  2009-05-08  5:07 ` [PATCH 1/2] add mem cgroup is activated check KAMEZAWA Hiroyuki
@ 2009-05-08  5:09 ` KAMEZAWA Hiroyuki
  2009-05-08 11:38   ` Ingo Molnar
                     ` (2 more replies)
  1 sibling, 3 replies; 13+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-05-08  5:09 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, nishimura@mxp.nes.nec.co.jp,
	balbir@linux.vnet.ibm.com, hugh@veritas.com,
	akpm@linux-foundation.org, linux-kernel@vger.kernel.org

From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

In general, Linux's swp_entry handling is done by combination of lazy techniques
and global LRU. It works well but when we use mem+swap controller, some more
strict control is appropriate. Otherwise, swp_entry used by a cgroup will be
never freed until global LRU works. In a system where memcg is well-configured,
global LRU doesn't work frequently.

  Example A) Assume a swap cache which is not mapped.
              CPU0                            CPU1
	   zap_pte()....                  shrink_page_list()
	    free_swap_and_cache()           lock_page()
		page seems busy.

  Example B) Assume swapin-readahead.
	      CPU0			      CPU1
	   zap_pte()			  read_swap_cache_async()
					  swap_duplicate().
           swap_entry_free() = 1
	   find_get_page()=> NULL.
					  add_to_swap_cache().
					  issue swap I/O. 

There are many patterns of this kind of race (but no problems).

free_swap_and_cache() is called for freeing swp_entry. But it is a best-effort
function. If the swp_entry/page seems busy, swp_entry is not freed.
This is not a problem because global-LRU will find SwapCache at page reclaim.

If memcg is used, on the other hand, global LRU may not work. Then, above
unused SwapCache will not be freed.
(unmapped SwapCache occupy swp_entry but never be freed if not on memcg's LRU)

So, even if there are no tasks in a cgroup, swp_entry usage still remains.
In bad case, OOM by mem+swap controller is triggered by this "leak" of
swp_entry as Nishimura reported.

Considering this issue, swapin-readahead itself is not very good for memcg.
It read swap cache which will not be used. (and _unused_ swapcache will
not be accounted.) Even if we account swap cache at add_to_swap_cache(),
we need to account page to several _unrelated_ memcg. This is bad.

This patch tries to fix racy case of free_swap_and_cache() and page status.

After this patch applied, following test works well.

  # echo 1-2M > ../memory.limit_in_bytes
  # run tasks under memcg.
  # kill all tasks and make memory.tasks empty
  # check memory.memsw.usage_in_bytes == memory.usage_in_bytes and
    there is no _used_ swp_entry.

What this patch does is
 - avoid swapin-readahead when memcg is activated.
 - try to free swapcache immediately after Writeback is done.
 - Handle racy case of __remove_mapping() in vmscan.c

TODO:
 - tmpfs should use real readahead rather than swapin readahead...

Changelog: v5 -> v6
 - works only when memcg is activated.
 - check after I/O works only after writeback.
 - avoid swapin-readahead when memcg is activated.
 - fixed page refcnt issue.
Changelog: v4->v5
 - completely new design.

Reported-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/page_io.c    |  125 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 mm/swap_state.c |   20 +++++++-
 mm/vmscan.c     |   31 ++++++++++++-
 3 files changed, 171 insertions(+), 5 deletions(-)

Index: mmotm-2.6.30-May05/mm/page_io.c
===================================================================
--- mmotm-2.6.30-May05.orig/mm/page_io.c
+++ mmotm-2.6.30-May05/mm/page_io.c
@@ -19,6 +19,130 @@
 #include <linux/writeback.h>
 #include <asm/pgtable.h>
 
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+/*
+ * When memory cgroup is used, race between writeback-swap and zap-swap can
+ * be a leak of swp_entry. So we have to check the status of swap at the end of
+ * swap-io. If memory cgroup is not used, Global LRU will find unused swap
+ * finally, in lazy way. (this is too lazy for memcg.)
+ */
+
+struct swapio_check {
+	spinlock_t	lock;
+	void		*swap_bio_list;
+	struct delayed_work work;
+} stale_swap_check;
+
+/*
+ * Check swap is really used or not after Writeback. And free swp_entry or
+ * drop swap cache if we can.
+ */
+static void mem_cgroup_check_stale_swap(struct work_struct *work)
+{
+	struct bio *bio;
+	struct page *page;
+	struct swapio_check *sc;
+	int nr = SWAP_CLUSTER_MAX;
+	swp_entry_t entry;
+
+	sc = &stale_swap_check;
+
+	while (nr--) {
+		cond_resched();
+		spin_lock_irq(&sc->lock);
+		bio = sc->swap_bio_list;
+		if (bio)
+			sc->swap_bio_list = bio->bi_next;
+		spin_unlock_irq(&sc->lock);
+		if (!bio)
+			break;
+		entry.val = (unsigned long)bio->bi_private;
+		bio_put(bio);
+		/* The page is not found if it's already reclaimed */
+		page = find_get_page(&swapper_space, entry.val);
+		if (!page)
+			continue;
+		if (!PageSwapCache(page) || page_mapped(page)) {
+			page_cache_release(page);
+			continue;
+		}
+		/*
+		 * "synchronous" freeing of swap cache after write back.
+		 */
+		lock_page(page);
+		if (PageSwapCache(page) && !PageWriteback(page) &&
+		    !page_mapped(page))
+			delete_from_swap_cache(page);
+		unlock_page(page);
+		page_cache_release(page);
+	}
+	if (sc->swap_bio_list)
+		schedule_delayed_work(&sc->work, HZ/10);
+}
+
+/*
+ * We can't call try_to_free_swap directly here because of caller's context.
+ */
+static void mem_cgroup_swapio_check_again(struct bio *bio, struct page *page)
+{
+	unsigned long flags;
+	struct swapio_check *sc;
+	swp_entry_t entry;
+	int ret;
+
+	/* If memcg is not mounted, global LRU will work fine. */
+	if (!mem_cgroup_activated())
+		return;
+	/* reuse bio if this bio is ready to be freed. */
+	ret = atomic_inc_return(&bio->bi_cnt);
+	/* Any other reference other than us ? */
+	if (unlikely(ret > 2)) {
+		bio_put(bio);
+		return;
+	}
+	/*
+	 * We don't grab this page....remember swp_entry instead of page. By
+	 * this, aggressive memory freeing routine can free this page.
+	 */
+	entry.val = page_private(page);
+	bio->bi_private = (void *)entry.val;
+
+	sc = &stale_swap_check;
+	spin_lock_irqsave(&sc->lock, flags);
+	/* link bio to remember swp_entry */
+	bio->bi_next = sc->swap_bio_list;
+	sc->swap_bio_list = bio;
+	spin_unlock_irqrestore(&sc->lock, flags);
+	/*
+	 * Swap Writeback is tend to be continuous. Do check in batched manner.
+	 */
+	if (!delayed_work_pending(&sc->work))
+		schedule_delayed_work(&sc->work, HZ/10);
+}
+
+static int __init setup_stale_swap_check(void)
+{
+	struct swapio_check *sc;
+
+	sc = &stale_swap_check;
+	spin_lock_init(&sc->lock);
+	sc->swap_bio_list = NULL;
+	INIT_DELAYED_WORK(&sc->work, mem_cgroup_check_stale_swap);
+	return 0;
+}
+late_initcall(setup_stale_swap_check);
+
+
+#else /* CONFIG_CGROUP_MEM_RES_CTRL */
+
+static inline
+void mem_cgroup_swapio_check_again(struct bio *bio, struct page *page)
+{
+}
+#endif
+
+
+
 static struct bio *get_swap_bio(gfp_t gfp_flags, pgoff_t index,
 				struct page *page, bio_end_io_t end_io)
 {
@@ -66,6 +190,7 @@ static void end_swap_bio_write(struct bi
 				(unsigned long long)bio->bi_sector);
 		ClearPageReclaim(page);
 	}
+	mem_cgroup_swapio_check_again(bio, page);
 	end_page_writeback(page);
 	bio_put(bio);
 }
Index: mmotm-2.6.30-May05/mm/vmscan.c
===================================================================
--- mmotm-2.6.30-May05.orig/mm/vmscan.c
+++ mmotm-2.6.30-May05/mm/vmscan.c
@@ -586,6 +586,30 @@ void putback_lru_page(struct page *page)
 }
 #endif /* CONFIG_UNEVICTABLE_LRU */
 
+#ifdef CONFIG_CGROUP_MEM_RES_CTRL
+/*
+ * Even if we don't call this, global LRU will finally find this SwapCache and
+ * free swap entry in the next loop. But, when memcg is used, we may have
+ * smaller chance to call global LRU's memory reclaim code.
+ * Freeing unused swap entry in aggressive way is good for avoid "leak" of swap
+ * entry accounting.
+ */
+static inline void unuse_swapcache_check_again(struct page *page)
+{
+	/*
+	 * The page is locked, but have extra reference from somewhere.
+	 * In typical case, rotate_reclaimable_page()'s extra refcnt makes
+	 * __remove_mapping fail. (see mm/swap.c)
+	 */
+	if (PageSwapCache(page))
+		try_to_free_swap(page);
+}
+#else
+static inline void unuse_swapcache_check_again(struct page *page)
+{
+}
+#endif
+
 
 /*
  * shrink_page_list() returns the number of reclaimed pages
@@ -758,9 +782,12 @@ static unsigned long shrink_page_list(st
 			}
 		}
 
-		if (!mapping || !__remove_mapping(mapping, page))
+		if (!mapping)
 			goto keep_locked;
-
+		if (!__remove_mapping(mapping, page)) {
+			unuse_swapcache_check_again(page);
+			goto keep_locked;
+		}
 		/*
 		 * At this point, we have no other references and there is
 		 * no way to pick any more up (removed from LRU, removed
Index: mmotm-2.6.30-May05/mm/swap_state.c
===================================================================
--- mmotm-2.6.30-May05.orig/mm/swap_state.c
+++ mmotm-2.6.30-May05/mm/swap_state.c
@@ -349,9 +349,9 @@ struct page *read_swap_cache_async(swp_e
 struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
 			struct vm_area_struct *vma, unsigned long addr)
 {
-	int nr_pages;
+	int nr_pages = 1;
 	struct page *page;
-	unsigned long offset;
+	unsigned long offset = 0;
 	unsigned long end_offset;
 
 	/*
@@ -360,8 +360,22 @@ struct page *swapin_readahead(swp_entry_
 	 * No, it's very unlikely that swap layout would follow vma layout,
 	 * more likely that neighbouring swap pages came from the same node:
 	 * so use the same "addr" to choose the same node for each swap read.
+	 *
+	 * But, when memcg is used, swapin readahead give us some bad
+	 * effects. There are 2 big problems in general.
+	 * 1. Swapin readahead tend to use/read _not required_ memory.
+	 *    And _not required_ memory is only freed by global LRU.
+	 * 2. We can't charge pages for swap-cache readahead because
+	 *    we should avoid account memory in a cgroup which a
+	 *    thread call this function is not related to.
+	 * And swapin-readahead have racy condition with
+	 * free_swap_and_cache(). This also annoys memcg.
+	 * Then, if memcg is really used, we avoid readahead.
 	 */
-	nr_pages = valid_swaphandles(entry, &offset);
+
+	if (!mem_cgroup_activated())
+		nr_pages = valid_swaphandles(entry, &offset);
+
 	for (end_offset = offset + nr_pages; offset < end_offset; offset++) {
 		/* Ok, do the async read-ahead now */
 		page = read_swap_cache_async(swp_entry(swp_type(entry), offset),

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] memcg fix stale swap cache account leak v6
  2009-05-08  5:09 ` [PATCH 2/2] memcg fix stale swap cache account leak v6 KAMEZAWA Hiroyuki
@ 2009-05-08 11:38   ` Ingo Molnar
  2009-05-08 16:26     ` KAMEZAWA Hiroyuki
  2009-05-08 14:01   ` Daisuke Nishimura
  2009-05-08 16:56   ` Balbir Singh
  2 siblings, 1 reply; 13+ messages in thread
From: Ingo Molnar @ 2009-05-08 11:38 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, nishimura@mxp.nes.nec.co.jp,
	balbir@linux.vnet.ibm.com, hugh@veritas.com,
	akpm@linux-foundation.org, linux-kernel@vger.kernel.org

x
* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> +struct swapio_check {
> +	spinlock_t	lock;
> +	void		*swap_bio_list;
> +	struct delayed_work work;
> +} stale_swap_check;

Small nit. It's nice that you lined up the first two fields, but it 
would be nice to line up the third one too:

struct swapio_check {
	spinlock_t		lock;
	void			*swap_bio_list;
	struct delayed_work	work;
} stale_swap_check;

> +	while (nr--) {
> +		cond_resched();
> +		spin_lock_irq(&sc->lock);
> +		bio = sc->swap_bio_list;

> @@ -66,6 +190,7 @@ static void end_swap_bio_write(struct bi
>  				(unsigned long long)bio->bi_sector);
>  		ClearPageReclaim(page);
>  	}
> +	mem_cgroup_swapio_check_again(bio, page);

Hm, this patch adds quite a bit of scanning overhead to 
end_swap_bio_write(), to work around artifacts of a global LRU not 
working well with a partitioned system's per-partition LRU needs.

Isnt the right solution to have a better LRU that is aware of this, 
instead of polling around in the hope of cleaning up stale entries?

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] memcg fix stale swap cache account leak v6
  2009-05-08  5:09 ` [PATCH 2/2] memcg fix stale swap cache account leak v6 KAMEZAWA Hiroyuki
  2009-05-08 11:38   ` Ingo Molnar
@ 2009-05-08 14:01   ` Daisuke Nishimura
  2009-05-08 16:29     ` KAMEZAWA Hiroyuki
  2009-05-08 16:56   ` Balbir Singh
  2 siblings, 1 reply; 13+ messages in thread
From: Daisuke Nishimura @ 2009-05-08 14:01 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, nishimura@mxp.nes.nec.co.jp,
	balbir@linux.vnet.ibm.com, hugh@veritas.com,
	akpm@linux-foundation.org, linux-kernel@vger.kernel.org,
	d-nishimura

On Fri, 8 May 2009 14:09:10 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> In general, Linux's swp_entry handling is done by combination of lazy techniques
> and global LRU. It works well but when we use mem+swap controller, some more
> strict control is appropriate. Otherwise, swp_entry used by a cgroup will be
> never freed until global LRU works. In a system where memcg is well-configured,
> global LRU doesn't work frequently.
> 
>   Example A) Assume a swap cache which is not mapped.
>               CPU0                            CPU1
> 	   zap_pte()....                  shrink_page_list()
> 	    free_swap_and_cache()           lock_page()
> 		page seems busy.
> 
>   Example B) Assume swapin-readahead.
> 	      CPU0			      CPU1
> 	   zap_pte()			  read_swap_cache_async()
> 					  swap_duplicate().
>            swap_entry_free() = 1
> 	   find_get_page()=> NULL.
> 					  add_to_swap_cache().
> 					  issue swap I/O. 
> 
> There are many patterns of this kind of race (but no problems).
> 
> free_swap_and_cache() is called for freeing swp_entry. But it is a best-effort
> function. If the swp_entry/page seems busy, swp_entry is not freed.
> This is not a problem because global-LRU will find SwapCache at page reclaim.
> 
> If memcg is used, on the other hand, global LRU may not work. Then, above
> unused SwapCache will not be freed.
> (unmapped SwapCache occupy swp_entry but never be freed if not on memcg's LRU)
> 
> So, even if there are no tasks in a cgroup, swp_entry usage still remains.
> In bad case, OOM by mem+swap controller is triggered by this "leak" of
> swp_entry as Nishimura reported.
> 
> Considering this issue, swapin-readahead itself is not very good for memcg.
> It read swap cache which will not be used. (and _unused_ swapcache will
> not be accounted.) Even if we account swap cache at add_to_swap_cache(),
> we need to account page to several _unrelated_ memcg. This is bad.
> 
> This patch tries to fix racy case of free_swap_and_cache() and page status.
> 
> After this patch applied, following test works well.
> 
>   # echo 1-2M > ../memory.limit_in_bytes
>   # run tasks under memcg.
>   # kill all tasks and make memory.tasks empty
>   # check memory.memsw.usage_in_bytes == memory.usage_in_bytes and
>     there is no _used_ swp_entry.
> 
> What this patch does is
>  - avoid swapin-readahead when memcg is activated.
I agree that disabling readahead would be the easiest way to avoid type-1.
And this patch looks good to me about it.

But if we go in this way to avoid type-1, I think my patch(*1) would be
enough to avoid type-2 and is simpler than this one.
I've confirmed in my test that no leak can be seen with my patch and
with setting page-cluster to 0.

*1 http://marc.info/?l=linux-kernel&m=124115252607665&w=2

>  - try to free swapcache immediately after Writeback is done.
>  - Handle racy case of __remove_mapping() in vmscan.c
> 
And IIUC, this patch still cannot avoid type-2 in some cases.
(for example, if stale swap cache hits "goto keep_locked" before pageout()
is called in shrink_page_list().)

In fact, I can see some leak in my test.


Thanks,
Daisuke Nishimura.

> TODO:
>  - tmpfs should use real readahead rather than swapin readahead...
> 
> Changelog: v5 -> v6
>  - works only when memcg is activated.
>  - check after I/O works only after writeback.
>  - avoid swapin-readahead when memcg is activated.
>  - fixed page refcnt issue.
> Changelog: v4->v5
>  - completely new design.
> 
> Reported-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> ---
>  mm/page_io.c    |  125 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  mm/swap_state.c |   20 +++++++-
>  mm/vmscan.c     |   31 ++++++++++++-
>  3 files changed, 171 insertions(+), 5 deletions(-)
> 
> Index: mmotm-2.6.30-May05/mm/page_io.c
> ===================================================================
> --- mmotm-2.6.30-May05.orig/mm/page_io.c
> +++ mmotm-2.6.30-May05/mm/page_io.c
> @@ -19,6 +19,130 @@
>  #include <linux/writeback.h>
>  #include <asm/pgtable.h>
>  
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR
> +/*
> + * When memory cgroup is used, race between writeback-swap and zap-swap can
> + * be a leak of swp_entry. So we have to check the status of swap at the end of
> + * swap-io. If memory cgroup is not used, Global LRU will find unused swap
> + * finally, in lazy way. (this is too lazy for memcg.)
> + */
> +
> +struct swapio_check {
> +	spinlock_t	lock;
> +	void		*swap_bio_list;
> +	struct delayed_work work;
> +} stale_swap_check;
> +
> +/*
> + * Check swap is really used or not after Writeback. And free swp_entry or
> + * drop swap cache if we can.
> + */
> +static void mem_cgroup_check_stale_swap(struct work_struct *work)
> +{
> +	struct bio *bio;
> +	struct page *page;
> +	struct swapio_check *sc;
> +	int nr = SWAP_CLUSTER_MAX;
> +	swp_entry_t entry;
> +
> +	sc = &stale_swap_check;
> +
> +	while (nr--) {
> +		cond_resched();
> +		spin_lock_irq(&sc->lock);
> +		bio = sc->swap_bio_list;
> +		if (bio)
> +			sc->swap_bio_list = bio->bi_next;
> +		spin_unlock_irq(&sc->lock);
> +		if (!bio)
> +			break;
> +		entry.val = (unsigned long)bio->bi_private;
> +		bio_put(bio);
> +		/* The page is not found if it's already reclaimed */
> +		page = find_get_page(&swapper_space, entry.val);
> +		if (!page)
> +			continue;
> +		if (!PageSwapCache(page) || page_mapped(page)) {
> +			page_cache_release(page);
> +			continue;
> +		}
> +		/*
> +		 * "synchronous" freeing of swap cache after write back.
> +		 */
> +		lock_page(page);
> +		if (PageSwapCache(page) && !PageWriteback(page) &&
> +		    !page_mapped(page))
> +			delete_from_swap_cache(page);
> +		unlock_page(page);
> +		page_cache_release(page);
> +	}
> +	if (sc->swap_bio_list)
> +		schedule_delayed_work(&sc->work, HZ/10);
> +}
> +
> +/*
> + * We can't call try_to_free_swap directly here because of caller's context.
> + */
> +static void mem_cgroup_swapio_check_again(struct bio *bio, struct page *page)
> +{
> +	unsigned long flags;
> +	struct swapio_check *sc;
> +	swp_entry_t entry;
> +	int ret;
> +
> +	/* If memcg is not mounted, global LRU will work fine. */
> +	if (!mem_cgroup_activated())
> +		return;
> +	/* reuse bio if this bio is ready to be freed. */
> +	ret = atomic_inc_return(&bio->bi_cnt);
> +	/* Any other reference other than us ? */
> +	if (unlikely(ret > 2)) {
> +		bio_put(bio);
> +		return;
> +	}
> +	/*
> +	 * We don't grab this page....remember swp_entry instead of page. By
> +	 * this, aggressive memory freeing routine can free this page.
> +	 */
> +	entry.val = page_private(page);
> +	bio->bi_private = (void *)entry.val;
> +
> +	sc = &stale_swap_check;
> +	spin_lock_irqsave(&sc->lock, flags);
> +	/* link bio to remember swp_entry */
> +	bio->bi_next = sc->swap_bio_list;
> +	sc->swap_bio_list = bio;
> +	spin_unlock_irqrestore(&sc->lock, flags);
> +	/*
> +	 * Swap Writeback is tend to be continuous. Do check in batched manner.
> +	 */
> +	if (!delayed_work_pending(&sc->work))
> +		schedule_delayed_work(&sc->work, HZ/10);
> +}
> +
> +static int __init setup_stale_swap_check(void)
> +{
> +	struct swapio_check *sc;
> +
> +	sc = &stale_swap_check;
> +	spin_lock_init(&sc->lock);
> +	sc->swap_bio_list = NULL;
> +	INIT_DELAYED_WORK(&sc->work, mem_cgroup_check_stale_swap);
> +	return 0;
> +}
> +late_initcall(setup_stale_swap_check);
> +
> +
> +#else /* CONFIG_CGROUP_MEM_RES_CTRL */
> +
> +static inline
> +void mem_cgroup_swapio_check_again(struct bio *bio, struct page *page)
> +{
> +}
> +#endif
> +
> +
> +
>  static struct bio *get_swap_bio(gfp_t gfp_flags, pgoff_t index,
>  				struct page *page, bio_end_io_t end_io)
>  {
> @@ -66,6 +190,7 @@ static void end_swap_bio_write(struct bi
>  				(unsigned long long)bio->bi_sector);
>  		ClearPageReclaim(page);
>  	}
> +	mem_cgroup_swapio_check_again(bio, page);
>  	end_page_writeback(page);
>  	bio_put(bio);
>  }
> Index: mmotm-2.6.30-May05/mm/vmscan.c
> ===================================================================
> --- mmotm-2.6.30-May05.orig/mm/vmscan.c
> +++ mmotm-2.6.30-May05/mm/vmscan.c
> @@ -586,6 +586,30 @@ void putback_lru_page(struct page *page)
>  }
>  #endif /* CONFIG_UNEVICTABLE_LRU */
>  
> +#ifdef CONFIG_CGROUP_MEM_RES_CTRL
> +/*
> + * Even if we don't call this, global LRU will finally find this SwapCache and
> + * free swap entry in the next loop. But, when memcg is used, we may have
> + * smaller chance to call global LRU's memory reclaim code.
> + * Freeing unused swap entry in aggressive way is good for avoid "leak" of swap
> + * entry accounting.
> + */
> +static inline void unuse_swapcache_check_again(struct page *page)
> +{
> +	/*
> +	 * The page is locked, but have extra reference from somewhere.
> +	 * In typical case, rotate_reclaimable_page()'s extra refcnt makes
> +	 * __remove_mapping fail. (see mm/swap.c)
> +	 */
> +	if (PageSwapCache(page))
> +		try_to_free_swap(page);
> +}
> +#else
> +static inline void unuse_swapcache_check_again(struct page *page)
> +{
> +}
> +#endif
> +
>  
>  /*
>   * shrink_page_list() returns the number of reclaimed pages
> @@ -758,9 +782,12 @@ static unsigned long shrink_page_list(st
>  			}
>  		}
>  
> -		if (!mapping || !__remove_mapping(mapping, page))
> +		if (!mapping)
>  			goto keep_locked;
> -
> +		if (!__remove_mapping(mapping, page)) {
> +			unuse_swapcache_check_again(page);
> +			goto keep_locked;
> +		}
>  		/*
>  		 * At this point, we have no other references and there is
>  		 * no way to pick any more up (removed from LRU, removed
> Index: mmotm-2.6.30-May05/mm/swap_state.c
> ===================================================================
> --- mmotm-2.6.30-May05.orig/mm/swap_state.c
> +++ mmotm-2.6.30-May05/mm/swap_state.c
> @@ -349,9 +349,9 @@ struct page *read_swap_cache_async(swp_e
>  struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
>  			struct vm_area_struct *vma, unsigned long addr)
>  {
> -	int nr_pages;
> +	int nr_pages = 1;
>  	struct page *page;
> -	unsigned long offset;
> +	unsigned long offset = 0;
>  	unsigned long end_offset;
>  
>  	/*
> @@ -360,8 +360,22 @@ struct page *swapin_readahead(swp_entry_
>  	 * No, it's very unlikely that swap layout would follow vma layout,
>  	 * more likely that neighbouring swap pages came from the same node:
>  	 * so use the same "addr" to choose the same node for each swap read.
> +	 *
> +	 * But, when memcg is used, swapin readahead give us some bad
> +	 * effects. There are 2 big problems in general.
> +	 * 1. Swapin readahead tend to use/read _not required_ memory.
> +	 *    And _not required_ memory is only freed by global LRU.
> +	 * 2. We can't charge pages for swap-cache readahead because
> +	 *    we should avoid account memory in a cgroup which a
> +	 *    thread call this function is not related to.
> +	 * And swapin-readahead have racy condition with
> +	 * free_swap_and_cache(). This also annoys memcg.
> +	 * Then, if memcg is really used, we avoid readahead.
>  	 */
> -	nr_pages = valid_swaphandles(entry, &offset);
> +
> +	if (!mem_cgroup_activated())
> +		nr_pages = valid_swaphandles(entry, &offset);
> +
>  	for (end_offset = offset + nr_pages; offset < end_offset; offset++) {
>  		/* Ok, do the async read-ahead now */
>  		page = read_swap_cache_async(swp_entry(swp_type(entry), offset),
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/2] add mem cgroup is activated check
  2009-05-08  5:07 ` [PATCH 1/2] add mem cgroup is activated check KAMEZAWA Hiroyuki
@ 2009-05-08 14:46   ` Balbir Singh
  0 siblings, 0 replies; 13+ messages in thread
From: Balbir Singh @ 2009-05-08 14:46 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, nishimura@mxp.nes.nec.co.jp, hugh@veritas.com,
	akpm@linux-foundation.org, linux-kernel@vger.kernel.org

* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-05-08 14:07:13]:

> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> There is a function "mem_cgroup_disabled()" which returns
>   - memcg is configured ?
>   - disabled by boot option ?
> This is check is useful to confirm whether we have to call memcg's hook or not.
> 
> But, even when memcg is configured (and not disabled), it's not really used
> until mounted. This patch adds mem_cgroup_activated() to check memcg is
> mounted or not at least once.
> (Will be used in later patch.)
> 
> IIUC, only very careful users set boot option of memcg to be disabled and
> most of people will not be aware of that memcg is enabled at default.
> So, if memcg wants to affect to global VM behavior or to add some overheads,
> there are cases that this check is better than mem_cgroup_disabled().
>

Agreed
 
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
 
-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] memcg fix stale swap cache account leak v6
  2009-05-08 11:38   ` Ingo Molnar
@ 2009-05-08 16:26     ` KAMEZAWA Hiroyuki
  2009-05-11 12:27       ` Ingo Molnar
  0 siblings, 1 reply; 13+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-05-08 16:26 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: KAMEZAWA Hiroyuki, linux-mm@kvack.org,
	nishimura@mxp.nes.nec.co.jp, balbir@linux.vnet.ibm.com,
	hugh@veritas.com, akpm@linux-foundation.org,
	linux-kernel@vger.kernel.org

Thank you for review.

Ingo Molnar wrote:
> x
> * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>
>> +struct swapio_check {
>> +	spinlock_t	lock;
>> +	void		*swap_bio_list;
>> +	struct delayed_work work;
>> +} stale_swap_check;
>
> Small nit. It's nice that you lined up the first two fields, but it
> would be nice to line up the third one too:
>
> struct swapio_check {
> 	spinlock_t		lock;
> 	void			*swap_bio_list;
> 	struct delayed_work	work;
> } stale_swap_check;
>
ok.

>> +	while (nr--) {
>> +		cond_resched();
>> +		spin_lock_irq(&sc->lock);
>> +		bio = sc->swap_bio_list;
>
>> @@ -66,6 +190,7 @@ static void end_swap_bio_write(struct bi
>>  				(unsigned long long)bio->bi_sector);
>>  		ClearPageReclaim(page);
>>  	}
>> +	mem_cgroup_swapio_check_again(bio, page);
>
> Hm, this patch adds quite a bit of scanning overhead to
> end_swap_bio_write(), to work around artifacts of a global LRU not
> working well with a partitioned system's per-partition LRU needs.
>
I'm not sure what is "scanning" overhead. But ok, this is not very light.

> Isnt the right solution to have a better LRU that is aware of this,
> instead of polling around in the hope of cleaning up stale entries?
>
I tried to modify LRU in the last month but I found it's difficult.

Hmm, maybe this patch's method is overkill. I have another option
(used in v1-v2) for fixing writeback. I'll try following again.

== add following codes to vmscan.c ==

   shrink_list()
      add_to_swap().
      memcg_confirm_swapcache_valid()
   -> We have race with zap_pte() here.
      After add_to_swap(), check account information of memcg.
      If memcg doesn't have account on this page, this page may
      be unused and not worth to do I/O. check usage again and
      try to free it.
==

The difficult part is how to fix race in swapin-readahead and we have
several option to fix writeback, I think.

I'll retry.

Thanks,
-Kame


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] memcg fix stale swap cache account leak v6
  2009-05-08 14:01   ` Daisuke Nishimura
@ 2009-05-08 16:29     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 13+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-05-08 16:29 UTC (permalink / raw)
  To: nishimura
  Cc: KAMEZAWA Hiroyuki, linux-mm@kvack.org, balbir@linux.vnet.ibm.com,
	hugh@veritas.com, akpm@linux-foundation.org,
	linux-kernel@vger.kernel.org, d-nishimura

Daisuke Nishimura wrote:
> On Fri, 8 May 2009 14:09:10 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

>>  - avoid swapin-readahead when memcg is activated.
> I agree that disabling readahead would be the easiest way to avoid type-1.
> And this patch looks good to me about it.
>
Thanks.

> But if we go in this way to avoid type-1, I think my patch(*1) would be
> enough to avoid type-2 and is simpler than this one.
> I've confirmed in my test that no leak can be seen with my patch and
> with setting page-cluster to 0.
>
> *1 http://marc.info/?l=linux-kernel&m=124115252607665&w=2
>
Ok, I'll merge yours on my set.
the whole patch set will be
  [1/3]  memcg_activated()
  [2/3]  avoid readahead
  [3/3]  your fix.

I'll post in the next week.

Thanks,
-Kame



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] memcg fix stale swap cache account leak v6
  2009-05-08  5:09 ` [PATCH 2/2] memcg fix stale swap cache account leak v6 KAMEZAWA Hiroyuki
  2009-05-08 11:38   ` Ingo Molnar
  2009-05-08 14:01   ` Daisuke Nishimura
@ 2009-05-08 16:56   ` Balbir Singh
  2009-05-11  0:22     ` Daisuke Nishimura
  2 siblings, 1 reply; 13+ messages in thread
From: Balbir Singh @ 2009-05-08 16:56 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, nishimura@mxp.nes.nec.co.jp, hugh@veritas.com,
	akpm@linux-foundation.org, linux-kernel@vger.kernel.org

* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-05-08 14:09:10]:

> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> In general, Linux's swp_entry handling is done by combination of lazy techniques
> and global LRU. It works well but when we use mem+swap controller, some more
> strict control is appropriate. Otherwise, swp_entry used by a cgroup will be
> never freed until global LRU works. In a system where memcg is well-configured,
> global LRU doesn't work frequently.
> 
>   Example A) Assume a swap cache which is not mapped.
>               CPU0                            CPU1
> 	   zap_pte()....                  shrink_page_list()
> 	    free_swap_and_cache()           lock_page()
> 		page seems busy.
> 
>   Example B) Assume swapin-readahead.
> 	      CPU0			      CPU1
> 	   zap_pte()			  read_swap_cache_async()
> 					  swap_duplicate().
>            swap_entry_free() = 1
> 	   find_get_page()=> NULL.
> 					  add_to_swap_cache().
> 					  issue swap I/O. 
> 
> There are many patterns of this kind of race (but no problems).
> 
> free_swap_and_cache() is called for freeing swp_entry. But it is a best-effort
> function. If the swp_entry/page seems busy, swp_entry is not freed.
> This is not a problem because global-LRU will find SwapCache at page reclaim.
> 
> If memcg is used, on the other hand, global LRU may not work. Then, above
> unused SwapCache will not be freed.
> (unmapped SwapCache occupy swp_entry but never be freed if not on memcg's LRU)
> 
> So, even if there are no tasks in a cgroup, swp_entry usage still remains.
> In bad case, OOM by mem+swap controller is triggered by this "leak" of
> swp_entry as Nishimura reported.
> 
> Considering this issue, swapin-readahead itself is not very good for memcg.
> It read swap cache which will not be used. (and _unused_ swapcache will
> not be accounted.) Even if we account swap cache at add_to_swap_cache(),
> we need to account page to several _unrelated_ memcg. This is bad.
> 
> This patch tries to fix racy case of free_swap_and_cache() and page status.
> 
> After this patch applied, following test works well.
> 
>   # echo 1-2M > ../memory.limit_in_bytes
>   # run tasks under memcg.
>   # kill all tasks and make memory.tasks empty
>   # check memory.memsw.usage_in_bytes == memory.usage_in_bytes and
>     there is no _used_ swp_entry.
> 
> What this patch does is
>  - avoid swapin-readahead when memcg is activated.
>  - try to free swapcache immediately after Writeback is done.
>  - Handle racy case of __remove_mapping() in vmscan.c
> 
> TODO:
>  - tmpfs should use real readahead rather than swapin readahead...
> 
> Changelog: v5 -> v6
>  - works only when memcg is activated.
>  - check after I/O works only after writeback.
>  - avoid swapin-readahead when memcg is activated.
>  - fixed page refcnt issue.
> Changelog: v4->v5
>  - completely new design.
> 
> Reported-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

I know we discussed readahead changes this in the past

1. the memcg_activated() check should be memcg_swap_activated(), no?
   In type 1, the problem can be solved by unaccounting the pages
   in swap_entry_free
   Type 2 is not a problem, since the accounting is already correct
   Hence my assertion that this problem occurs only when swapaccount
   is enabled.
2. I don't mind adding space overhead to swap_cgroup, if this problem
   can be fought that way. The approaches so far have made my head go
   round.
3. Disabling readahead is a big decision and will need loads of
   review/data before we can decide to go this route.


-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] memcg fix stale swap cache account leak v6
  2009-05-08 16:56   ` Balbir Singh
@ 2009-05-11  0:22     ` Daisuke Nishimura
  2009-05-11  6:50       ` Balbir Singh
  0 siblings, 1 reply; 13+ messages in thread
From: Daisuke Nishimura @ 2009-05-11  0:22 UTC (permalink / raw)
  To: balbir
  Cc: nishimura, KAMEZAWA Hiroyuki, linux-mm@kvack.org,
	hugh@veritas.com, akpm@linux-foundation.org,
	linux-kernel@vger.kernel.org

On Fri, 8 May 2009 22:26:36 +0530, Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-05-08 14:09:10]:
> 
> > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > 
> > In general, Linux's swp_entry handling is done by combination of lazy techniques
> > and global LRU. It works well but when we use mem+swap controller, some more
> > strict control is appropriate. Otherwise, swp_entry used by a cgroup will be
> > never freed until global LRU works. In a system where memcg is well-configured,
> > global LRU doesn't work frequently.
> > 
> >   Example A) Assume a swap cache which is not mapped.
> >               CPU0                            CPU1
> > 	   zap_pte()....                  shrink_page_list()
> > 	    free_swap_and_cache()           lock_page()
> > 		page seems busy.
> > 
> >   Example B) Assume swapin-readahead.
> > 	      CPU0			      CPU1
> > 	   zap_pte()			  read_swap_cache_async()
> > 					  swap_duplicate().
> >            swap_entry_free() = 1
> > 	   find_get_page()=> NULL.
> > 					  add_to_swap_cache().
> > 					  issue swap I/O. 
> > 
> > There are many patterns of this kind of race (but no problems).
> > 
> > free_swap_and_cache() is called for freeing swp_entry. But it is a best-effort
> > function. If the swp_entry/page seems busy, swp_entry is not freed.
> > This is not a problem because global-LRU will find SwapCache at page reclaim.
> > 
> > If memcg is used, on the other hand, global LRU may not work. Then, above
> > unused SwapCache will not be freed.
> > (unmapped SwapCache occupy swp_entry but never be freed if not on memcg's LRU)
> > 
> > So, even if there are no tasks in a cgroup, swp_entry usage still remains.
> > In bad case, OOM by mem+swap controller is triggered by this "leak" of
> > swp_entry as Nishimura reported.
> > 
> > Considering this issue, swapin-readahead itself is not very good for memcg.
> > It read swap cache which will not be used. (and _unused_ swapcache will
> > not be accounted.) Even if we account swap cache at add_to_swap_cache(),
> > we need to account page to several _unrelated_ memcg. This is bad.
> > 
> > This patch tries to fix racy case of free_swap_and_cache() and page status.
> > 
> > After this patch applied, following test works well.
> > 
> >   # echo 1-2M > ../memory.limit_in_bytes
> >   # run tasks under memcg.
> >   # kill all tasks and make memory.tasks empty
> >   # check memory.memsw.usage_in_bytes == memory.usage_in_bytes and
> >     there is no _used_ swp_entry.
> > 
> > What this patch does is
> >  - avoid swapin-readahead when memcg is activated.
> >  - try to free swapcache immediately after Writeback is done.
> >  - Handle racy case of __remove_mapping() in vmscan.c
> > 
> > TODO:
> >  - tmpfs should use real readahead rather than swapin readahead...
> > 
> > Changelog: v5 -> v6
> >  - works only when memcg is activated.
> >  - check after I/O works only after writeback.
> >  - avoid swapin-readahead when memcg is activated.
> >  - fixed page refcnt issue.
> > Changelog: v4->v5
> >  - completely new design.
> > 
> > Reported-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
> > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> I know we discussed readahead changes this in the past
> 
> 1. the memcg_activated() check should be memcg_swap_activated(), no?
>    In type 1, the problem can be solved by unaccounting the pages
>    in swap_entry_free
>    Type 2 is not a problem, since the accounting is already correct
>    Hence my assertion that this problem occurs only when swapaccount
>    is enabled.
No.
Both type-1 and type-2 have the problem that swp_entry is not freed correctly.
This problem has nothing to do with whether mem+swap controller is enabled or not.

Thanks,
Daisuke Nishimura.

> 2. I don't mind adding space overhead to swap_cgroup, if this problem
>    can be fought that way. The approaches so far have made my head go
>    round.
> 3. Disabling readahead is a big decision and will need loads of
>    review/data before we can decide to go this route.
> 
> 
> -- 
> 	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] memcg fix stale swap cache account leak v6
  2009-05-11  0:22     ` Daisuke Nishimura
@ 2009-05-11  6:50       ` Balbir Singh
  0 siblings, 0 replies; 13+ messages in thread
From: Balbir Singh @ 2009-05-11  6:50 UTC (permalink / raw)
  To: Daisuke Nishimura
  Cc: KAMEZAWA Hiroyuki, linux-mm@kvack.org, hugh@veritas.com,
	akpm@linux-foundation.org, linux-kernel@vger.kernel.org

* nishimura@mxp.nes.nec.co.jp <nishimura@mxp.nes.nec.co.jp> [2009-05-11 09:22:41]:

> On Fri, 8 May 2009 22:26:36 +0530, Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> > * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-05-08 14:09:10]:
> > 
> > > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > > 
> > > In general, Linux's swp_entry handling is done by combination of lazy techniques
> > > and global LRU. It works well but when we use mem+swap controller, some more
> > > strict control is appropriate. Otherwise, swp_entry used by a cgroup will be
> > > never freed until global LRU works. In a system where memcg is well-configured,
> > > global LRU doesn't work frequently.
> > > 
> > >   Example A) Assume a swap cache which is not mapped.
> > >               CPU0                            CPU1
> > > 	   zap_pte()....                  shrink_page_list()
> > > 	    free_swap_and_cache()           lock_page()
> > > 		page seems busy.
> > > 
> > >   Example B) Assume swapin-readahead.
> > > 	      CPU0			      CPU1
> > > 	   zap_pte()			  read_swap_cache_async()
> > > 					  swap_duplicate().
> > >            swap_entry_free() = 1
> > > 	   find_get_page()=> NULL.
> > > 					  add_to_swap_cache().
> > > 					  issue swap I/O. 
> > > 
> > > There are many patterns of this kind of race (but no problems).
> > > 
> > > free_swap_and_cache() is called for freeing swp_entry. But it is a best-effort
> > > function. If the swp_entry/page seems busy, swp_entry is not freed.
> > > This is not a problem because global-LRU will find SwapCache at page reclaim.
> > > 
> > > If memcg is used, on the other hand, global LRU may not work. Then, above
> > > unused SwapCache will not be freed.
> > > (unmapped SwapCache occupy swp_entry but never be freed if not on memcg's LRU)
> > > 
> > > So, even if there are no tasks in a cgroup, swp_entry usage still remains.
> > > In bad case, OOM by mem+swap controller is triggered by this "leak" of
> > > swp_entry as Nishimura reported.
> > > 
> > > Considering this issue, swapin-readahead itself is not very good for memcg.
> > > It read swap cache which will not be used. (and _unused_ swapcache will
> > > not be accounted.) Even if we account swap cache at add_to_swap_cache(),
> > > we need to account page to several _unrelated_ memcg. This is bad.
> > > 
> > > This patch tries to fix racy case of free_swap_and_cache() and page status.
> > > 
> > > After this patch applied, following test works well.
> > > 
> > >   # echo 1-2M > ../memory.limit_in_bytes
> > >   # run tasks under memcg.
> > >   # kill all tasks and make memory.tasks empty
> > >   # check memory.memsw.usage_in_bytes == memory.usage_in_bytes and
> > >     there is no _used_ swp_entry.
> > > 
> > > What this patch does is
> > >  - avoid swapin-readahead when memcg is activated.
> > >  - try to free swapcache immediately after Writeback is done.
> > >  - Handle racy case of __remove_mapping() in vmscan.c
> > > 
> > > TODO:
> > >  - tmpfs should use real readahead rather than swapin readahead...
> > > 
> > > Changelog: v5 -> v6
> > >  - works only when memcg is activated.
> > >  - check after I/O works only after writeback.
> > >  - avoid swapin-readahead when memcg is activated.
> > >  - fixed page refcnt issue.
> > > Changelog: v4->v5
> > >  - completely new design.
> > > 
> > > Reported-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
> > > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > 
> > I know we discussed readahead changes this in the past
> > 
> > 1. the memcg_activated() check should be memcg_swap_activated(), no?
> >    In type 1, the problem can be solved by unaccounting the pages
> >    in swap_entry_free
> >    Type 2 is not a problem, since the accounting is already correct
> >    Hence my assertion that this problem occurs only when swapaccount
> >    is enabled.
> No.
> Both type-1 and type-2 have the problem that swp_entry is not freed correctly.
> This problem has nothing to do with whether mem+swap controller is enabled or not.
>

Thanks, I was under the impression that we were leaking entries from
swap_cgroup_ctrl. So here is what we have so far

1. We leak swap_entries during type 1 and type 2 race. Basically we
   start occupying more swap_info space than required
2. The pages are not on memcg LRU, but on global LRU, hence memcg
   reclaim does not catch them
3. They are accounted to memcg in type 1 leak


 
> 
> > 2. I don't mind adding space overhead to swap_cgroup, if this problem
> >    can be fought that way. The approaches so far have made my head go
> >    round.
> > 3. Disabling readahead is a big decision and will need loads of
> >    review/data before we can decide to go this route.
> > 
> > 
> > -- 
> > 	Balbir

-- 
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] memcg fix stale swap cache account leak v6
  2009-05-08 16:26     ` KAMEZAWA Hiroyuki
@ 2009-05-11 12:27       ` Ingo Molnar
  2009-05-12  0:44         ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 13+ messages in thread
From: Ingo Molnar @ 2009-05-11 12:27 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm@kvack.org, nishimura@mxp.nes.nec.co.jp,
	balbir@linux.vnet.ibm.com, hugh@veritas.com,
	akpm@linux-foundation.org, linux-kernel@vger.kernel.org


* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> > Isnt the right solution to have a better LRU that is aware of 
> > this, instead of polling around in the hope of cleaning up stale 
> > entries?
>
> I tried to modify LRU in the last month but I found it's 
> difficult.

But your patch makes such a correct solution even more difficult to 
achieve, so in that sense it might be a step backwards, right?

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] memcg fix stale swap cache account leak v6
  2009-05-11 12:27       ` Ingo Molnar
@ 2009-05-12  0:44         ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 13+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-05-12  0:44 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-mm@kvack.org, nishimura@mxp.nes.nec.co.jp,
	balbir@linux.vnet.ibm.com, hugh@veritas.com,
	akpm@linux-foundation.org, linux-kernel@vger.kernel.org

On Mon, 11 May 2009 14:27:11 +0200
Ingo Molnar <mingo@elte.hu> wrote:

> 
> * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> 
> > > Isnt the right solution to have a better LRU that is aware of 
> > > this, instead of polling around in the hope of cleaning up stale 
> > > entries?
> >
> > I tried to modify LRU in the last month but I found it's 
> > difficult.
> 
> But your patch makes such a correct solution even more difficult to 
> achieve, so in that sense it might be a step backwards, right?
> 
Hmm...maybe or not. This leak comes from laziness of LRU and trylock() swap handling
for avoiding contentions. So, even if we addd new LRU, we need some trick to do
synchronous work.

Anyway, I'll use Nishimura's one, which is simple.
This is a bug fix and I don't want a patch which needs a half year testing ;)

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2009-05-12  0:45 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-05-08  5:05 [PATCH 0/2] fix stale swap cache account leak in memcg v6 KAMEZAWA Hiroyuki
2009-05-08  5:07 ` [PATCH 1/2] add mem cgroup is activated check KAMEZAWA Hiroyuki
2009-05-08 14:46   ` Balbir Singh
2009-05-08  5:09 ` [PATCH 2/2] memcg fix stale swap cache account leak v6 KAMEZAWA Hiroyuki
2009-05-08 11:38   ` Ingo Molnar
2009-05-08 16:26     ` KAMEZAWA Hiroyuki
2009-05-11 12:27       ` Ingo Molnar
2009-05-12  0:44         ` KAMEZAWA Hiroyuki
2009-05-08 14:01   ` Daisuke Nishimura
2009-05-08 16:29     ` KAMEZAWA Hiroyuki
2009-05-08 16:56   ` Balbir Singh
2009-05-11  0:22     ` Daisuke Nishimura
2009-05-11  6:50       ` Balbir Singh

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).