[RFC 1/6] mm, thp: introduce thp zero subpages reclaim

public inbox for linux-mm@kvack.org
 help / color / mirror / Atom feed

From: Ning Zhang <ningzhang@linux.alibaba.com>
To: linux-mm@kvack.org
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Michal Hocko <mhocko@kernel.org>,
	Vladimir Davydov <vdavydov.dev@gmail.com>,
	Yu Zhao <yuzhao@google.com>
Subject: [RFC 1/6] mm, thp: introduce thp zero subpages reclaim
Date: Thu, 28 Oct 2021 19:56:50 +0800	[thread overview]
Message-ID: <1635422215-99394-2-git-send-email-ningzhang@linux.alibaba.com> (raw)
In-Reply-To: <1635422215-99394-1-git-send-email-ningzhang@linux.alibaba.com>

Transparent huge pages could reduce the number of tlb misses, which
could improve performance for the applications. But one concern is
that thp may lead to memory bloat which may cause OOM. The reason
is a huge page may contain some zero subpages which user didn't
really access them.

This patch introduces a mechanism to reclaim these zero subpages, it
works when memory pressure is high. We'll estimate whether a huge
page contains enough zero subpages at first, then try split it and
reclaim the zero subpages.

Through testing with some apps, we found that the zero subpages are
tended to be concentrated into a few huge pages. Following is a
text_classification_rnn case for tensorflow:

  zero_subpages   huge_pages  waste
  [     0,     1) 186         0.00%
  [     1,     2) 23          0.01%
  [     2,     4) 36          0.02%
  [     4,     8) 67          0.08%
  [     8,    16) 80          0.23%
  [    16,    32) 109         0.61%
  [    32,    64) 44          0.49%
  [    64,   128) 12          0.30%
  [   128,   256) 28          1.54%
  [   256,   513) 159        18.03%

In the case, a lot of zero subpages are concentrated into 187(28+159)
huge pages, which lead to 19.57% waste of the total rss. It means we
can reclaim 19.57% memory by splitting the 187 huge pages and reclaiming
the zero subpages.

We store the huge pages to a new list in order to find them quickly. And
add a interface 'thp_reclaim' to control on or off in memory cgroup:

  echo 1 > memory.thp_reclaim to enable.
  echo 0 > memory.thp_reclaim to disable.

Signed-off-by: Ning Zhang <ningzhang@linux.alibaba.com>
Signed-off-by: Gang Deng <gavin.dg@linux.alibaba.com>
---
 include/linux/huge_mm.h    |   9 ++
 include/linux/memcontrol.h |  15 +++
 include/linux/mm.h         |   1 +
 include/linux/mm_types.h   |   6 +
 include/linux/mmzone.h     |   6 +
 mm/huge_memory.c           | 296 ++++++++++++++++++++++++++++++++++++++++++++-
 mm/memcontrol.c            | 107 ++++++++++++++++
 mm/vmscan.c                |  59 ++++++++-
 8 files changed, 496 insertions(+), 3 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index f123e15..e1b3bf9 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -185,6 +185,15 @@ unsigned long thp_get_unmapped_area(struct file *filp, unsigned long addr,
 void free_transhuge_page(struct page *page);
 bool is_transparent_hugepage(struct page *page);
 
+#ifdef CONFIG_MEMCG
+int zsr_get_hpage(struct hpage_reclaim *hr_queue, struct page **reclaim_page);
+unsigned long zsr_reclaim_hpage(struct lruvec *lruvec, struct page *page);
+static inline struct list_head *hpage_reclaim_list(struct page *page)
+{
+	return &page[3].hpage_reclaim_list;
+}
+#endif
+
 bool can_split_huge_page(struct page *page, int *pextra_pins);
 int split_huge_page_to_list(struct page *page, struct list_head *list);
 static inline int split_huge_page(struct page *page)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 3096c9a..502a6ab 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -150,6 +150,9 @@ struct mem_cgroup_per_node {
 	unsigned long		usage_in_excess;/* Set to the value by which */
 						/* the soft limit is exceeded*/
 	bool			on_tree;
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	struct hpage_reclaim hpage_reclaim_queue;
+#endif
 	struct mem_cgroup	*memcg;		/* Back pointer, we cannot */
 						/* use container_of	   */
 };
@@ -228,6 +231,13 @@ struct obj_cgroup {
 	};
 };
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+enum thp_reclaim_state {
+	THP_RECLAIM_DISABLE,
+	THP_RECLAIM_ENABLE,
+	THP_RECLAIM_MEMCG, /* For global configure*/
+};
+#endif
 /*
  * The memory controller data structure. The memory controller controls both
  * page cache and RSS per cgroup. We would eventually like to provide
@@ -345,6 +355,7 @@ struct mem_cgroup {
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 	struct deferred_split deferred_split_queue;
+	int thp_reclaim;
 #endif
 
 	struct mem_cgroup_per_node *nodeinfo[];
@@ -1110,6 +1121,10 @@ unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
 						gfp_t gfp_mask,
 						unsigned long *total_scanned);
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+void del_hpage_from_queue(struct page *page);
+#endif
+
 #else /* CONFIG_MEMCG */
 
 #define MEM_CGROUP_ID_SHIFT	0
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 73a52ab..39676f9 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3061,6 +3061,7 @@ int drop_caches_sysctl_handler(struct ctl_table *, int, void *, size_t *,
 
 void drop_slab(void);
 void drop_slab_node(int nid);
+unsigned int move_pages_to_lru(struct lruvec *lruvec, struct list_head *list);
 
 #ifndef CONFIG_MMU
 #define randomize_va_space 0
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 7f8ee09..9433987 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -159,6 +159,12 @@ struct page {
 			/* For both global and memcg */
 			struct list_head deferred_list;
 		};
+		struct {	 /* Third tail page of compound page */
+			unsigned long _compound_pad_2;
+			unsigned long _compound_pad_3;
+			/* For zero subpages reclaim */
+			struct list_head hpage_reclaim_list;
+		};
 		struct {	/* Page table pages */
 			unsigned long _pt_pad_1;	/* compound_head */
 			pgtable_t pmd_huge_pte; /* protected by page->ptl */
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 6a1d79d..222cd4f 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -787,6 +787,12 @@ struct deferred_split {
 	struct list_head split_queue;
 	unsigned long split_queue_len;
 };
+
+struct hpage_reclaim {
+	spinlock_t reclaim_queue_lock;
+	struct list_head reclaim_queue;
+	unsigned long reclaim_queue_len;
+};
 #endif
 
 /*
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 5e9ef0f..21e3c01 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -526,6 +526,9 @@ void prep_transhuge_page(struct page *page)
 	 */
 
 	INIT_LIST_HEAD(page_deferred_list(page));
+#ifdef CONFIG_MEMCG
+	INIT_LIST_HEAD(hpage_reclaim_list(page));
+#endif
 	set_compound_page_dtor(page, TRANSHUGE_PAGE_DTOR);
 }
 
@@ -2367,7 +2370,7 @@ static void __split_huge_page_tail(struct page *head, int tail,
 			 (1L << PG_dirty)));
 
 	/* ->mapping in first tail page is compound_mapcount */
-	VM_BUG_ON_PAGE(tail > 2 && page_tail->mapping != TAIL_MAPPING,
+	VM_BUG_ON_PAGE(tail > 3 && page_tail->mapping != TAIL_MAPPING,
 			page_tail);
 	page_tail->mapping = head->mapping;
 	page_tail->index = head->index + tail;
@@ -2620,6 +2623,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
 	VM_BUG_ON_PAGE(!PageLocked(head), head);
 	VM_BUG_ON_PAGE(!PageCompound(head), head);
 
+	del_hpage_from_queue(page);
 	if (PageWriteback(head))
 		return -EBUSY;
 
@@ -2779,6 +2783,7 @@ void deferred_split_huge_page(struct page *page)
 			set_shrinker_bit(memcg, page_to_nid(page),
 					 deferred_split_shrinker.id);
 #endif
+		del_hpage_from_queue(page);
 	}
 	spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
 }
@@ -3203,3 +3208,292 @@ void remove_migration_pmd(struct page_vma_mapped_walk *pvmw, struct page *new)
 	update_mmu_cache_pmd(vma, address, pvmw->pmd);
 }
 #endif
+
+#ifdef CONFIG_MEMCG
+static inline bool is_zero_page(struct page *page)
+{
+	void *addr = kmap(page);
+	bool ret = true;
+
+	if (memchr_inv(addr, 0, PAGE_SIZE))
+		ret = false;
+	kunmap(page);
+
+	return ret;
+}
+
+/*
+ * We'll split the huge page iff it contains at least 1/32 zeros,
+ * estimate it by checking some discrete unsigned long values.
+ */
+static bool hpage_estimate_zero(struct page *page)
+{
+	unsigned int i, maybe_zero_pages = 0, offset = 0;
+	void *addr;
+
+#define BYTES_PER_LONG (BITS_PER_LONG / BITS_PER_BYTE)
+	for (i = 0; i < HPAGE_PMD_NR; i++, page++, offset++) {
+		addr = kmap(page);
+		if (unlikely((offset + 1) * BYTES_PER_LONG > PAGE_SIZE))
+			offset = 0;
+		if (*(const unsigned long *)(addr + offset) == 0UL) {
+			if (++maybe_zero_pages == HPAGE_PMD_NR >> 5) {
+				kunmap(page);
+				return true;
+			}
+		}
+		kunmap(page);
+	}
+
+	return false;
+}
+
+static bool replace_zero_pte(struct page *page, struct vm_area_struct *vma,
+			     unsigned long addr, void *zero_page)
+{
+	struct page_vma_mapped_walk pvmw = {
+		.page = page,
+		.vma = vma,
+		.address = addr,
+		.flags = PVMW_SYNC | PVMW_MIGRATION,
+	};
+	pte_t pte;
+
+	VM_BUG_ON_PAGE(PageTail(page), page);
+
+	while (page_vma_mapped_walk(&pvmw)) {
+		pte = pte_mkspecial(
+			pfn_pte(page_to_pfn((struct page *)zero_page),
+			vma->vm_page_prot));
+		dec_mm_counter(vma->vm_mm, MM_ANONPAGES);
+		set_pte_at(vma->vm_mm, pvmw.address, pvmw.pte, pte);
+		update_mmu_cache(vma, pvmw.address, pvmw.pte);
+	}
+
+	return true;
+}
+
+static void replace_zero_ptes_locked(struct page *page)
+{
+	struct page *zero_page = ZERO_PAGE(0);
+	struct rmap_walk_control rwc = {
+		.rmap_one = replace_zero_pte,
+		.arg = zero_page,
+	};
+
+	rmap_walk_locked(page, &rwc);
+}
+
+static bool replace_zero_page(struct page *page)
+{
+	struct anon_vma *anon_vma = NULL;
+	bool unmap_success;
+	bool ret = true;
+
+	anon_vma = page_get_anon_vma(page);
+	if (!anon_vma)
+		return false;
+
+	anon_vma_lock_write(anon_vma);
+	try_to_migrate(page, TTU_RMAP_LOCKED);
+	unmap_success = !page_mapped(page);
+
+	if (!unmap_success || !is_zero_page(page)) {
+		/* remap the page */
+		remove_migration_ptes(page, page, true);
+		ret = false;
+	} else
+		replace_zero_ptes_locked(page);
+
+	anon_vma_unlock_write(anon_vma);
+	put_anon_vma(anon_vma);
+
+	return ret;
+}
+
+/*
+ * reclaim_zero_subpages - reclaim the zero subpages and putback the non-zero
+ * subpages.
+ *
+ * The non-zero subpages are putback to the keep_list, and will be putback to
+ * the lru list.
+ *
+ * Return the number of reclaimed zero subpages.
+ */
+static unsigned long reclaim_zero_subpages(struct list_head *list,
+					   struct list_head *keep_list)
+{
+	LIST_HEAD(zero_list);
+	struct page *page;
+	unsigned long reclaimed = 0;
+
+	while (!list_empty(list)) {
+		page = lru_to_page(list);
+		list_del_init(&page->lru);
+		if (is_zero_page(page)) {
+			if (!trylock_page(page))
+				goto keep;
+
+			if (!replace_zero_page(page)) {
+				unlock_page(page);
+				goto keep;
+			}
+
+			__ClearPageActive(page);
+			unlock_page(page);
+			if (put_page_testzero(page)) {
+				list_add(&page->lru, &zero_list);
+				reclaimed++;
+			}
+
+			/* someone may hold the zero page, we just skip it. */
+
+			continue;
+		}
+keep:
+		list_add(&page->lru, keep_list);
+	}
+
+	mem_cgroup_uncharge_list(&zero_list);
+	free_unref_page_list(&zero_list);
+
+	return reclaimed;
+
+}
+
+#ifdef CONFIG_MMU
+#define ZSR_PG_MLOCK(flag)	(1UL << flag)
+#else
+#define ZSR_PG_MLOCK(flag)	0
+#endif
+
+#ifdef CONFIG_ARCH_USES_PG_UNCACHED
+#define ZSR_PG_UNCACHED(flag)	(1UL << flag)
+#else
+#define ZSR_PG_UNCACHED(flag)	0
+#endif
+
+#ifdef CONFIG_MEMORY_FAILURE
+#define ZSR_PG_HWPOISON(flag)	(1UL << flag)
+#else
+#define ZSR_PG_HWPOISON(flag)	0
+#endif
+
+/* Filter unsupported page flags. */
+#define ZSR_FLAG_CHECK			\
+	((1UL << PG_error) |		\
+	 (1UL << PG_owner_priv_1) |	\
+	 (1UL << PG_arch_1) |		\
+	 (1UL << PG_reserved) |		\
+	 (1UL << PG_private) |		\
+	 (1UL << PG_private_2) |	\
+	 (1UL << PG_writeback) |	\
+	 (1UL << PG_swapcache) |	\
+	 (1UL << PG_mappedtodisk) |	\
+	 (1UL << PG_reclaim) |		\
+	 (1UL << PG_unevictable) |	\
+	 ZSR_PG_MLOCK(PG_mlocked) |	\
+	 ZSR_PG_UNCACHED(PG_uncached) |	\
+	 ZSR_PG_HWPOISON(PG_hwpoison))
+
+#define hpage_can_reclaim(page) \
+	(PageAnon(page) && !PageKsm(page) && !(page->flags & ZSR_FLAG_CHECK))
+
+#define hr_queue_list_to_page(head) \
+	compound_head(list_entry((head)->prev, struct page,\
+		      hpage_reclaim_list))
+
+/*
+ * zsr_get_hpage - get one huge page from huge page reclaim queue
+ *
+ * Return -EINVAL if the queue is empty; otherwise, return 0.
+ * If the queue is not empty, it will check whether the tail page of the
+ * queue can be reclaimed or not. If the page can be reclaimed, it will
+ * be stored in reclaim_page; otherwise, just delete the page from the
+ * queue.
+ */
+int zsr_get_hpage(struct hpage_reclaim *hr_queue, struct page **reclaim_page)
+{
+	struct page *page = NULL;
+	unsigned long flags;
+	int ret = 0;
+
+	spin_lock_irqsave(&hr_queue->reclaim_queue_lock, flags);
+	if (list_empty(&hr_queue->reclaim_queue)) {
+		ret = -EINVAL;
+		goto unlock;
+	}
+
+	page = hr_queue_list_to_page(&hr_queue->reclaim_queue);
+	list_del_init(hpage_reclaim_list(page));
+	hr_queue->reclaim_queue_len--;
+
+	if (!hpage_can_reclaim(page) || !get_page_unless_zero(page))
+		goto unlock;
+
+	if (!trylock_page(page)) {
+		put_page(page);
+		goto unlock;
+	}
+
+	spin_unlock_irqrestore(&hr_queue->reclaim_queue_lock, flags);
+
+	if (hpage_can_reclaim(page) && hpage_estimate_zero(page) &&
+	    !isolate_lru_page(page)) {
+		__mod_node_page_state(page_pgdat(page), NR_ISOLATED_ANON,
+				      HPAGE_PMD_NR);
+		/*
+		 *  dec the reference added in
+		 *  isolate_lru_page
+		 */
+		page_ref_dec(page);
+		*reclaim_page = page;
+	} else {
+		unlock_page(page);
+		put_page(page);
+	}
+
+	return ret;
+
+unlock:
+	spin_unlock_irqrestore(&hr_queue->reclaim_queue_lock, flags);
+	return ret;
+
+}
+
+unsigned long zsr_reclaim_hpage(struct lruvec *lruvec, struct page *page)
+{
+	struct pglist_data *pgdat = page_pgdat(page);
+	unsigned long reclaimed;
+	unsigned long flags;
+	LIST_HEAD(split_list);
+	LIST_HEAD(keep_list);
+
+	/*
+	 * Split the huge page and reclaim the zero subpages.
+	 * And putback the non-zero subpages to the lru list.
+	 */
+	if (split_huge_page_to_list(page, &split_list)) {
+		unlock_page(page);
+		putback_lru_page(page);
+		mod_node_page_state(pgdat, NR_ISOLATED_ANON,
+				    -HPAGE_PMD_NR);
+		return 0;
+	}
+
+	unlock_page(page);
+	list_add_tail(&page->lru, &split_list);
+	reclaimed = reclaim_zero_subpages(&split_list, &keep_list);
+
+	spin_lock_irqsave(&lruvec->lru_lock, flags);
+	move_pages_to_lru(lruvec, &keep_list);
+	spin_unlock_irqrestore(&lruvec->lru_lock, flags);
+	mod_node_page_state(pgdat, NR_ISOLATED_ANON,
+			    -HPAGE_PMD_NR);
+
+	mem_cgroup_uncharge_list(&keep_list);
+	free_unref_page_list(&keep_list);
+
+	return reclaimed;
+}
+#endif
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b762215..5df1cdd 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2739,6 +2739,56 @@ static void cancel_charge(struct mem_cgroup *memcg, unsigned int nr_pages)
 }
 #endif
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+/* Need the page lock if the page is not a newly allocated page. */
+static void add_hpage_to_queue(struct page *page, struct mem_cgroup *memcg)
+{
+	struct hpage_reclaim *hr_queue;
+	unsigned long flags;
+
+	if (READ_ONCE(memcg->thp_reclaim) == THP_RECLAIM_DISABLE)
+		return;
+
+	page = compound_head(page);
+	/*
+	 * we just want to add the anon page to the queue, but it is not sure
+	 * the page is anon or not when charging to memcg.
+	 * page_mapping return NULL if the page is a anon page or the mapping
+	 * is not yet set.
+	 */
+	if (!is_transparent_hugepage(page) || page_mapping(page))
+		return;
+
+	hr_queue = &memcg->nodeinfo[page_to_nid(page)]->hpage_reclaim_queue;
+	spin_lock_irqsave(&hr_queue->reclaim_queue_lock, flags);
+	if (list_empty(hpage_reclaim_list(page))) {
+		list_add(hpage_reclaim_list(page), &hr_queue->reclaim_queue);
+		hr_queue->reclaim_queue_len++;
+	}
+	spin_unlock_irqrestore(&hr_queue->reclaim_queue_lock, flags);
+}
+
+void del_hpage_from_queue(struct page *page)
+{
+	struct mem_cgroup *memcg;
+	struct hpage_reclaim *hr_queue;
+	unsigned long flags;
+
+	page = compound_head(page);
+	memcg = page_memcg(page);
+	if (!memcg || !is_transparent_hugepage(page))
+		return;
+
+	hr_queue = &memcg->nodeinfo[page_to_nid(page)]->hpage_reclaim_queue;
+	spin_lock_irqsave(&hr_queue->reclaim_queue_lock, flags);
+	if (!list_empty(hpage_reclaim_list(page))) {
+		list_del_init(hpage_reclaim_list(page));
+		hr_queue->reclaim_queue_len--;
+	}
+	spin_unlock_irqrestore(&hr_queue->reclaim_queue_lock, flags);
+}
+#endif
+
 static void commit_charge(struct page *page, struct mem_cgroup *memcg)
 {
 	VM_BUG_ON_PAGE(page_memcg(page), page);
@@ -2751,6 +2801,10 @@ static void commit_charge(struct page *page, struct mem_cgroup *memcg)
 	 * - exclusive reference
 	 */
 	page->memcg_data = (unsigned long)memcg;
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	add_hpage_to_queue(page, memcg);
+#endif
 }
 
 static struct mem_cgroup *get_mem_cgroup_from_objcg(struct obj_cgroup *objcg)
@@ -4425,6 +4479,26 @@ static int mem_cgroup_oom_control_write(struct cgroup_subsys_state *css,
 
 	return 0;
 }
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static u64 mem_cgroup_thp_reclaim_read(struct cgroup_subsys_state *css,
+				       struct cftype *cft)
+{
+	return READ_ONCE(mem_cgroup_from_css(css)->thp_reclaim);
+}
+
+static int mem_cgroup_thp_reclaim_write(struct cgroup_subsys_state *css,
+					struct cftype *cft, u64 val)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
+
+	if (val != THP_RECLAIM_DISABLE && val != THP_RECLAIM_ENABLE)
+		return -EINVAL;
+
+	WRITE_ONCE(memcg->thp_reclaim, val);
+
+	return 0;
+}
+#endif
 
 #ifdef CONFIG_CGROUP_WRITEBACK
 
@@ -4988,6 +5062,13 @@ static ssize_t memcg_write_event_control(struct kernfs_open_file *of,
 		.write = mem_cgroup_reset,
 		.read_u64 = mem_cgroup_read_u64,
 	},
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	{
+		.name = "thp_reclaim",
+		.read_u64 = mem_cgroup_thp_reclaim_read,
+		.write_u64 = mem_cgroup_thp_reclaim_write,
+	},
+#endif
 	{ },	/* terminate */
 };
 
@@ -5088,6 +5169,12 @@ static int alloc_mem_cgroup_per_node_info(struct mem_cgroup *memcg, int node)
 	pn->on_tree = false;
 	pn->memcg = memcg;
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	spin_lock_init(&pn->hpage_reclaim_queue.reclaim_queue_lock);
+	INIT_LIST_HEAD(&pn->hpage_reclaim_queue.reclaim_queue);
+	pn->hpage_reclaim_queue.reclaim_queue_len = 0;
+#endif
+
 	memcg->nodeinfo[node] = pn;
 	return 0;
 }
@@ -5176,6 +5263,8 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
 	spin_lock_init(&memcg->deferred_split_queue.split_queue_lock);
 	INIT_LIST_HEAD(&memcg->deferred_split_queue.split_queue);
 	memcg->deferred_split_queue.split_queue_len = 0;
+
+	memcg->thp_reclaim = THP_RECLAIM_DISABLE;
 #endif
 	idr_replace(&mem_cgroup_idr, memcg, memcg->id.id);
 	return memcg;
@@ -5209,6 +5298,9 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
 		page_counter_init(&memcg->swap, &parent->swap);
 		page_counter_init(&memcg->kmem, &parent->kmem);
 		page_counter_init(&memcg->tcpmem, &parent->tcpmem);
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+		memcg->thp_reclaim = parent->thp_reclaim;
+#endif
 	} else {
 		page_counter_init(&memcg->memory, NULL);
 		page_counter_init(&memcg->swap, NULL);
@@ -5654,6 +5746,10 @@ static int mem_cgroup_move_account(struct page *page,
 		__mod_lruvec_state(to_vec, NR_WRITEBACK, nr_pages);
 	}
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	del_hpage_from_queue(page);
+#endif
+
 	/*
 	 * All state has been migrated, let's switch to the new memcg.
 	 *
@@ -5674,6 +5770,10 @@ static int mem_cgroup_move_account(struct page *page,
 
 	page->memcg_data = (unsigned long)to;
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	add_hpage_to_queue(page, to);
+#endif
+
 	__unlock_page_memcg(from);
 
 	ret = 0;
@@ -6850,6 +6950,9 @@ static void uncharge_page(struct page *page, struct uncharge_gather *ug)
 
 	VM_BUG_ON_PAGE(PageLRU(page), page);
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	del_hpage_from_queue(page);
+#endif
 	/*
 	 * Nobody should be changing or seriously looking at
 	 * page memcg or objcg at this point, we have fully
@@ -7196,6 +7299,10 @@ void mem_cgroup_swapout(struct page *page, swp_entry_t entry)
 	VM_BUG_ON_PAGE(oldid, page);
 	mod_memcg_state(swap_memcg, MEMCG_SWAP, nr_entries);
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	del_hpage_from_queue(page);
+#endif
+
 	page->memcg_data = 0;
 
 	if (!mem_cgroup_is_root(memcg))
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 74296c2..9be136f 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2151,8 +2151,7 @@ static int too_many_isolated(struct pglist_data *pgdat, int file,
  *
  * Returns the number of pages moved to the given lruvec.
  */
-static unsigned int move_pages_to_lru(struct lruvec *lruvec,
-				      struct list_head *list)
+unsigned int move_pages_to_lru(struct lruvec *lruvec, struct list_head *list)
 {
 	int nr_pages, nr_moved = 0;
 	LIST_HEAD(pages_to_free);
@@ -2783,6 +2782,57 @@ static bool can_age_anon_pages(struct pglist_data *pgdat,
 	return can_demote(pgdat->node_id, sc);
 }
 
+#ifdef CONFIG_MEMCG
+#define MAX_SCAN_HPAGE 32UL
+/*
+ * Try to reclaim the zero subpages for the transparent huge page.
+ */
+static unsigned long reclaim_hpage_zero_subpages(struct lruvec *lruvec,
+						 int priority,
+						 unsigned long nr_to_reclaim)
+{
+	struct mem_cgroup *memcg;
+	struct hpage_reclaim *hr_queue;
+	int nid = lruvec->pgdat->node_id;
+	unsigned long nr_reclaimed = 0, nr_scanned = 0, nr_to_scan;
+
+	memcg = lruvec_memcg(lruvec);
+	if (!memcg)
+		goto out;
+
+	hr_queue = &memcg->nodeinfo[nid]->hpage_reclaim_queue;
+	if (!READ_ONCE(memcg->thp_reclaim))
+		goto out;
+
+	/* The last scan loop will scan all the huge pages.*/
+	nr_to_scan = priority == 0 ? 0 : MAX_SCAN_HPAGE;
+
+	do {
+		struct page *page = NULL;
+
+		if (zsr_get_hpage(hr_queue, &page))
+			break;
+
+		if (!page)
+			continue;
+
+		nr_reclaimed += zsr_reclaim_hpage(lruvec, page);
+
+		cond_resched();
+
+	} while ((nr_reclaimed < nr_to_reclaim) && (++nr_scanned != nr_to_scan));
+out:
+	return nr_reclaimed;
+}
+#else
+static unsigned long reclaim_hpage_zero_subpages(struct lruvec *lruvec,
+						 int priority,
+						 unsigned long nr_to_reclaim)
+{
+	return 0;
+}
+#endif
+
 static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
 {
 	unsigned long nr[NR_LRU_LISTS];
@@ -2886,6 +2936,11 @@ static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
 		scan_adjusted = true;
 	}
 	blk_finish_plug(&plug);
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	if (nr_reclaimed < nr_to_reclaim)
+		nr_reclaimed += reclaim_hpage_zero_subpages(lruvec,
+				sc->priority, nr_to_reclaim - nr_reclaimed);
+#endif
 	sc->nr_reclaimed += nr_reclaimed;
 
 	/*
-- 
1.8.3.1

next prev parent reply	other threads:[~2021-10-28 11:57 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-10-28 11:56 [RFC 0/6] Reclaim zero subpages of thp to avoid memory bloat Ning Zhang
2021-10-28 11:56 ` Ning Zhang [this message]
2021-10-28 12:53   ` [RFC 1/6] mm, thp: introduce thp zero subpages reclaim Matthew Wilcox
2021-10-29 12:16     ` ning zhang
2021-10-28 11:56 ` [RFC 2/6] mm, thp: add a global interface for zero subapges reclaim Ning Zhang
2021-10-28 11:56 ` [RFC 3/6] mm, thp: introduce zero subpages reclaim threshold Ning Zhang
2021-10-28 11:56 ` [RFC 4/6] mm, thp: introduce a controller to trigger zero subpages reclaim Ning Zhang
2021-10-28 11:56 ` [RFC 5/6] mm, thp: add some statistics for " Ning Zhang
2021-10-28 11:56 ` [RFC 6/6] mm, thp: add document " Ning Zhang
2021-10-28 14:13 ` [RFC 0/6] Reclaim zero subpages of thp to avoid memory bloat Kirill A. Shutemov
2021-10-29 12:07   ` ning zhang
2021-10-29 16:56     ` Yang Shi
2021-11-01  2:50       ` ning zhang
2021-10-29 13:38 ` Michal Hocko
2021-10-29 16:12   ` ning zhang
2021-11-01  9:20     ` Michal Hocko
2021-11-08  3:24       ` ning zhang

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:f123e15 dfblob:e1b3bf9 dfblob:3096c9a dfblob:502a6ab
dfblob:73a52ab dfblob:39676f9 dfblob:7f8ee09 dfblob:9433987
dfblob:6a1d79d dfblob:222cd4f dfblob:5e9ef0f dfblob:21e3c01
dfblob:b762215 dfblob:5df1cdd dfblob:74296c2 dfblob:9be136f )
 OR (
bs:"[RFC 1/6] mm, thp: introduce thp zero subpages reclaim" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1635422215-99394-2-git-send-email-ningzhang@linux.alibaba.com \
    --to=ningzhang@linux.alibaba.com \
    --cc=akpm@linux-foundation.org \
    --cc=hannes@cmpxchg.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@kernel.org \
    --cc=vdavydov.dev@gmail.com \
    --cc=yuzhao@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox