linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [RFC][PATCH -mm -v2 0/4] mm,vmscan: reclaim from highest score cgroup
@ 2012-08-16 15:34 Rik van Riel
  2012-08-16 15:35 ` [RFC][PATCH -mm -v2 1/4] mm,vmscan: track recent pressure on each LRU set Rik van Riel
                   ` (3 more replies)
  0 siblings, 4 replies; 11+ messages in thread
From: Rik van Riel @ 2012-08-16 15:34 UTC (permalink / raw)
  To: linux-mm; +Cc: yinghan, aquini, hannes, mhocko, Mel Gorman

Instead of doing round robin reclaim over all the cgroups in a zone, we
reclaim from the highest score cgroup first.

Factors in the scoring are the use ratio of pages in the lruvec
(recent_rotated / recent_scanned), the size of the lru, the recent amount
of pressure applied to each lru, whether the cgroup is over its soft limit
and whether the cgroup has lots of inactive file pages.

This patch series is on top of a recent mmotm with Ying's memcg softreclaim
patches [2/2] applied.  Unfortunately it turns out that that mmmotm tree
with Ying's patches does not compile with CONFIG_MEMCG=y, so I am testing
these patches over the wall untested, as inspiration for others (hi Ying).

This still suffers from the same scalability issue the current code has,
namely a round robin iteration over all the lruvecs in a zone. We may want
to fix that in the future by sorting the memcgs/lruvecs in some sort of
tree, allowing us to find the high priority ones more easily and doing the
recalculation asynchronously and less often.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [RFC][PATCH -mm -v2 1/4] mm,vmscan: track recent pressure on each LRU set
  2012-08-16 15:34 [RFC][PATCH -mm -v2 0/4] mm,vmscan: reclaim from highest score cgroup Rik van Riel
@ 2012-08-16 15:35 ` Rik van Riel
  2012-08-16 15:36 ` [RFC][PATCH -mm -v2 2/4] mm,memcontrol: export mem_cgroup_get/put Rik van Riel
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 11+ messages in thread
From: Rik van Riel @ 2012-08-16 15:35 UTC (permalink / raw)
  To: linux-mm; +Cc: yinghan, aquini, hannes, mhocko, Mel Gorman

Keep track of the recent amount of pressure applied to each LRU list.

This statistic is incremented simultaneously with ->recent_scanned,
however it is aged in a different way. Recent_scanned and recent_rotated
are aged locally for each list, to estimate the fraction of objects
on each list that are in active use.

The recent_pressure statistic is aged globally for all lists. We
can use this to figure out which LRUs we should reclaim from.
Because this figure is only used at reclaim time, we can lazily
age it whenever we consider an lruvec for reclaiming.

Signed-off-by: Rik van Riel <riel@redhat.com>
---
 include/linux/mmzone.h |   10 ++++++++-
 mm/memcontrol.c        |    5 ++++
 mm/swap.c              |    1 +
 mm/vmscan.c            |   51 ++++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 66 insertions(+), 1 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index f222e06..be93e7e 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -189,12 +189,20 @@ struct zone_reclaim_stat {
 	 * The pageout code in vmscan.c keeps track of how many of the
 	 * mem/swap backed and file backed pages are referenced.
 	 * The higher the rotated/scanned ratio, the more valuable
-	 * that cache is.
+	 * that cache is. These numbers are aged separately for each LRU.
 	 *
 	 * The anon LRU stats live in [0], file LRU stats in [1]
 	 */
 	unsigned long		recent_rotated[2];
 	unsigned long		recent_scanned[2];
+	/*
+	 * This number is incremented together with recent_scanned,
+	 * but is aged simultaneously for all LRUs. This allows the
+	 * system to determine which LRUs have already been scanned
+	 * enough, and which should be scanned next.
+	 */
+	unsigned long		recent_pressure[2];
+	unsigned long		recent_pressure_seq;
 };
 
 struct lruvec {
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d906b43..a18a0d5 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3852,6 +3852,7 @@ static int memcg_stat_show(struct cgroup *cont, struct cftype *cft,
 		struct zone_reclaim_stat *rstat;
 		unsigned long recent_rotated[2] = {0, 0};
 		unsigned long recent_scanned[2] = {0, 0};
+		unsigned long recent_pressure[2] = {0, 0};
 
 		for_each_online_node(nid)
 			for (zid = 0; zid < MAX_NR_ZONES; zid++) {
@@ -3862,11 +3863,15 @@ static int memcg_stat_show(struct cgroup *cont, struct cftype *cft,
 				recent_rotated[1] += rstat->recent_rotated[1];
 				recent_scanned[0] += rstat->recent_scanned[0];
 				recent_scanned[1] += rstat->recent_scanned[1];
+				recent_pressure[0] += rstat->recent_pressure[0];
+				recent_pressure[1] += rstat->recent_pressure[1];
 			}
 		seq_printf(m, "recent_rotated_anon %lu\n", recent_rotated[0]);
 		seq_printf(m, "recent_rotated_file %lu\n", recent_rotated[1]);
 		seq_printf(m, "recent_scanned_anon %lu\n", recent_scanned[0]);
 		seq_printf(m, "recent_scanned_file %lu\n", recent_scanned[1]);
+		seq_printf(m, "recent_pressure_anon %lu\n", recent_pressure[0]);
+		seq_printf(m, "recent_pressure_file %lu\n", recent_pressure[1]);
 	}
 #endif
 
diff --git a/mm/swap.c b/mm/swap.c
index 4e7e2ec..0cca972 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -316,6 +316,7 @@ static void update_page_reclaim_stat(struct lruvec *lruvec,
 	struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
 
 	reclaim_stat->recent_scanned[file]++;
+	reclaim_stat->recent_pressure[file]++;
 	if (rotated)
 		reclaim_stat->recent_rotated[file]++;
 }
diff --git a/mm/vmscan.c b/mm/vmscan.c
index a779b03..b0e5495 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1282,6 +1282,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
 	spin_lock_irq(&zone->lru_lock);
 
 	reclaim_stat->recent_scanned[file] += nr_taken;
+	reclaim_stat->recent_pressure[file] += nr_taken;
 
 	if (global_reclaim(sc)) {
 		if (current_is_kswapd())
@@ -1426,6 +1427,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
 		zone->pages_scanned += nr_scanned;
 
 	reclaim_stat->recent_scanned[file] += nr_taken;
+	reclaim_stat->recent_pressure[file] += nr_taken;
 
 	__count_zone_vm_events(PGREFILL, zone, nr_scanned);
 	__mod_zone_page_state(zone, NR_LRU_BASE + lru, -nr_taken);
@@ -1852,6 +1854,53 @@ static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
 	throttle_vm_writeout(sc->gfp_mask);
 }
 
+/*
+ * Ensure that the ->recent_pressure statistics for this lruvec are
+ * aged to the same degree as those elsewhere in the system, before
+ * we do reclaim on this lruvec or evaluate its reclaim priority.
+ */
+static DEFINE_SPINLOCK(recent_pressure_lock);
+static int recent_pressure_seq;
+static void age_recent_pressure(struct lruvec *lruvec, struct zone *zone)
+{
+	struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
+	unsigned long anon  = get_lru_size(lruvec, LRU_ACTIVE_ANON) +
+			      get_lru_size(lruvec, LRU_INACTIVE_ANON);
+	unsigned long file  = get_lru_size(lruvec, LRU_ACTIVE_FILE) +
+			      get_lru_size(lruvec, LRU_INACTIVE_FILE);
+	int shift;
+
+	/*
+	 * Do not bother recalculating unless we are behind with the
+	 * system wide statistics, or our local recent_pressure numbers
+	 * have grown too large. We have to keep the number somewhat
+	 * small, to ensure that reclaim_score returns non-zero.
+	 */
+	if (reclaim_stat->recent_pressure_seq != recent_pressure_seq &&
+			reclaim_stat->recent_pressure[0] < anon / 4 &&
+			reclaim_stat->recent_pressure[1] < file / 4)
+		return;
+
+	spin_lock(&recent_pressure_lock);
+	/*
+	 * If we are aging due to local activity, increment the global
+	 * sequence counter. Leave the global counter alone if we are
+	 * merely playing catchup.
+	 */
+	if (reclaim_stat->recent_pressure_seq == recent_pressure_seq)
+		recent_pressure_seq++;
+	shift = recent_pressure_seq - reclaim_stat->recent_pressure_seq;
+	shift = min(shift, (BITS_PER_LONG-1));
+	reclaim_stat->recent_pressure_seq = recent_pressure_seq;
+	spin_unlock(&recent_pressure_lock);
+
+	/* For every aging interval, do one division by two. */
+	spin_lock_irq(&zone->lru_lock);
+	reclaim_stat->recent_pressure[0] >>= shift;
+	reclaim_stat->recent_pressure[1] >>= shift;
+	spin_unlock_irq(&zone->lru_lock);
+}
+
 static void shrink_zone(struct zone *zone, struct scan_control *sc)
 {
 	struct mem_cgroup *root = sc->target_mem_cgroup;
@@ -1869,6 +1918,8 @@ static void shrink_zone(struct zone *zone, struct scan_control *sc)
 	do {
 		struct lruvec *lruvec = mem_cgroup_zone_lruvec(zone, memcg);
 
+		age_recent_pressure(lruvec, zone);
+
 		/*
 		 * Reclaim from mem_cgroup if any of these conditions are met:
 		 * - this is a targetted reclaim ( not global reclaim)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [RFC][PATCH -mm -v2 2/4] mm,memcontrol: export mem_cgroup_get/put
  2012-08-16 15:34 [RFC][PATCH -mm -v2 0/4] mm,vmscan: reclaim from highest score cgroup Rik van Riel
  2012-08-16 15:35 ` [RFC][PATCH -mm -v2 1/4] mm,vmscan: track recent pressure on each LRU set Rik van Riel
@ 2012-08-16 15:36 ` Rik van Riel
  2012-08-16 15:37 ` [RFC][PATCH -mm -v2 3/4] mm,vmscan: reclaim from the highest score cgroups Rik van Riel
  2012-08-16 15:38 ` [RFC][PATCH -mm -v2 4/4] mm,vmscan: evict inactive file pages first Rik van Riel
  3 siblings, 0 replies; 11+ messages in thread
From: Rik van Riel @ 2012-08-16 15:36 UTC (permalink / raw)
  To: linux-mm; +Cc: yinghan, aquini, hannes, mhocko, Mel Gorman

The page reclaim code should keep a reference to a cgroup while
reclaiming from that cgroup.  In order to do this when selecting
the highest score cgroup for reclaim, the VM code needs access
to refcounting functions for the memory cgroup code.

Signed-off-by: Rik van Riel <riel@redhat.com>
---
 include/linux/memcontrol.h |   11 +++++++++++
 mm/memcontrol.c            |    6 ++----
 2 files changed, 13 insertions(+), 4 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 65538f9..c4cc64c 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -65,6 +65,9 @@ extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
 struct lruvec *mem_cgroup_zone_lruvec(struct zone *, struct mem_cgroup *);
 struct lruvec *mem_cgroup_page_lruvec(struct page *, struct zone *);
 
+extern void mem_cgroup_get(struct mem_cgroup *memcg);
+extern void mem_cgroup_put(struct mem_cgroup *memcg);
+
 /* For coalescing uncharge for reducing memcg' overhead*/
 extern void mem_cgroup_uncharge_start(void);
 extern void mem_cgroup_uncharge_end(void);
@@ -298,6 +301,14 @@ static inline void mem_cgroup_iter_break(struct mem_cgroup *root,
 {
 }
 
+static inline void mem_cgroup_get(struct mem_cgroup *memcg)
+{
+}
+
+static inline void mem_cgroup_put(struct mem_cgroup *memcg)
+{
+}
+
 static inline bool mem_cgroup_disabled(void)
 {
 	return true;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index a18a0d5..376f680 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -368,8 +368,6 @@ enum charge_type {
 #define MEM_CGROUP_RECLAIM_SHRINK_BIT	0x1
 #define MEM_CGROUP_RECLAIM_SHRINK	(1 << MEM_CGROUP_RECLAIM_SHRINK_BIT)
 
-static void mem_cgroup_get(struct mem_cgroup *memcg);
-static void mem_cgroup_put(struct mem_cgroup *memcg);
 static bool mem_cgroup_is_root(struct mem_cgroup *memcg);
 
 /* Writing them here to avoid exposing memcg's inner layout */
@@ -4492,7 +4490,7 @@ static void __mem_cgroup_free(struct mem_cgroup *memcg)
 	call_rcu(&memcg->rcu_freeing, free_rcu);
 }
 
-static void mem_cgroup_get(struct mem_cgroup *memcg)
+void mem_cgroup_get(struct mem_cgroup *memcg)
 {
 	atomic_inc(&memcg->refcnt);
 }
@@ -4507,7 +4505,7 @@ static void __mem_cgroup_put(struct mem_cgroup *memcg, int count)
 	}
 }
 
-static void mem_cgroup_put(struct mem_cgroup *memcg)
+void mem_cgroup_put(struct mem_cgroup *memcg)
 {
 	__mem_cgroup_put(memcg, 1);
 }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [RFC][PATCH -mm -v2 3/4] mm,vmscan: reclaim from the highest score cgroups
  2012-08-16 15:34 [RFC][PATCH -mm -v2 0/4] mm,vmscan: reclaim from highest score cgroup Rik van Riel
  2012-08-16 15:35 ` [RFC][PATCH -mm -v2 1/4] mm,vmscan: track recent pressure on each LRU set Rik van Riel
  2012-08-16 15:36 ` [RFC][PATCH -mm -v2 2/4] mm,memcontrol: export mem_cgroup_get/put Rik van Riel
@ 2012-08-16 15:37 ` Rik van Riel
  2012-08-17 23:34   ` Ying Han
  2012-08-16 15:38 ` [RFC][PATCH -mm -v2 4/4] mm,vmscan: evict inactive file pages first Rik van Riel
  3 siblings, 1 reply; 11+ messages in thread
From: Rik van Riel @ 2012-08-16 15:37 UTC (permalink / raw)
  To: linux-mm; +Cc: yinghan, aquini, hannes, mhocko, Mel Gorman

Instead of doing a round robin reclaim over all the cgroups in a
zone, we pick the lruvec with the top score and reclaim from that.

We keep reclaiming from that lruvec until we have reclaimed enough
pages (common for direct reclaim), or that lruvec's score drops in
half. We keep reclaiming from the zone until we have reclaimed enough
pages, or have scanned more than the number of reclaimable pages shifted
by the reclaim priority.

As an additional change, targeted cgroup reclaim now reclaims from
the highest priority lruvec. This is because when a cgroup hierarchy
hits its limit, the best lruvec to reclaim from may be different than
whatever lruvec is the first we run into iterating from the hierarchy's
"root".

Signed-off-by: Rik van Riel <riel@redhat.com>
---
 mm/vmscan.c |  137 ++++++++++++++++++++++++++++++++++++++++++----------------
 1 files changed, 99 insertions(+), 38 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index b0e5495..769fdcd 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1901,6 +1901,57 @@ static void age_recent_pressure(struct lruvec *lruvec, struct zone *zone)
 	spin_unlock_irq(&zone->lru_lock);
 }
 
+/*
+ * The higher the LRU score, the more desirable it is to reclaim
+ * from this LRU set first. The score is a function of the fraction
+ * of recently scanned pages on the LRU that are in active use,
+ * as well as the size of the list and the amount of memory pressure
+ * that has been put on this LRU recently.
+ *
+ *          recent_scanned        size
+ * score =  -------------- x --------------- x adjustment
+ *          recent_rotated   recent_pressure
+ *
+ * The maximum score of the anon and file list in this lruvec
+ * is returned. Adjustments are made for the file LRU having
+ * lots of inactive pages (mostly streaming IO), or the memcg
+ * being over its soft limit.
+ *
+ * This function should return a positive number for any lruvec
+ * with more than a handful of resident pages, because recent_scanned
+ * should always be larger than recent_rotated, and the size should
+ * always be larger than recent_pressure.
+ */
+static u64 reclaim_score(struct mem_cgroup *memcg,
+			 struct lruvec *lruvec)
+{
+	struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
+	u64 anon, file;
+
+	anon  = get_lru_size(lruvec, LRU_ACTIVE_ANON) +
+	        get_lru_size(lruvec, LRU_INACTIVE_ANON);
+	anon *= reclaim_stat->recent_scanned[0];
+	anon /= (reclaim_stat->recent_rotated[0] + 1);
+	anon /= (reclaim_stat->recent_pressure[0] + 1);
+
+	file = get_lru_size(lruvec, LRU_ACTIVE_FILE) +
+	       get_lru_size(lruvec, LRU_INACTIVE_FILE);
+	file *= reclaim_stat->recent_scanned[1];
+	file /= (reclaim_stat->recent_rotated[1] + 1);
+	file /= (reclaim_stat->recent_pressure[1] + 1);
+
+	/*
+	 * Give a STRONG preference to reclaiming memory from lruvecs
+	 * that belong to a cgroup that is over its soft limit.
+	 */
+	if (mem_cgroup_over_soft_limit(memcg)) {
+		file *= 10000;
+		anon *= 10000;
+	}
+
+	return max(anon, file);
+}
+
 static void shrink_zone(struct zone *zone, struct scan_control *sc)
 {
 	struct mem_cgroup *root = sc->target_mem_cgroup;
@@ -1908,11 +1959,17 @@ static void shrink_zone(struct zone *zone, struct scan_control *sc)
 		.zone = zone,
 		.priority = sc->priority,
 	};
-	struct mem_cgroup *memcg;
-	bool over_softlimit, ignore_softlimit = false;
+	unsigned long nr_scanned = sc->nr_scanned;
+	unsigned long nr_scanned_this_round;
+	struct mem_cgroup *memcg, *victim_memcg;
+	struct lruvec *victim_lruvec;
+	u64 score, max_score;
 
 restart:
-	over_softlimit = false;
+	nr_scanned_this_round = sc->nr_scanned;
+	victim_lruvec = NULL;
+	victim_memcg = NULL;
+	max_score = 0;
 
 	memcg = mem_cgroup_iter(root, NULL, &reclaim);
 	do {
@@ -1920,48 +1977,52 @@ static void shrink_zone(struct zone *zone, struct scan_control *sc)
 
 		age_recent_pressure(lruvec, zone);
 
-		/*
-		 * Reclaim from mem_cgroup if any of these conditions are met:
-		 * - this is a targetted reclaim ( not global reclaim)
-		 * - reclaim priority is less than  DEF_PRIORITY - 2
-		 * - mem_cgroup or its ancestor ( not including root cgroup)
-		 * exceeds its soft limit
-		 *
-		 * Note: The priority check is a balance of how hard to
-		 * preserve the pages under softlimit. If the memcgs of the
-		 * zone having trouble to reclaim pages above their softlimit,
-		 * we have to reclaim under softlimit instead of burning more
-		 * cpu cycles.
-		 */
-		if (ignore_softlimit || !global_reclaim(sc) ||
-				sc->priority < DEF_PRIORITY - 2 ||
-				mem_cgroup_over_soft_limit(memcg)) {
-			shrink_lruvec(lruvec, sc);
+		score = reclaim_score(memcg, lruvec);
 
-			over_softlimit = true;
+		/* Pick the lruvec with the highest score. */
+		if (score > max_score) {
+			max_score = score;
+			if (victim_memcg)
+				mem_cgroup_put(victim_memcg);
+			mem_cgroup_get(memcg);
+			victim_lruvec = lruvec;
+			victim_memcg = memcg;
 		}
 
-		/*
-		 * Limit reclaim has historically picked one memcg and
-		 * scanned it with decreasing priority levels until
-		 * nr_to_reclaim had been reclaimed.  This priority
-		 * cycle is thus over after a single memcg.
-		 *
-		 * Direct reclaim and kswapd, on the other hand, have
-		 * to scan all memory cgroups to fulfill the overall
-		 * scan target for the zone.
-		 */
-		if (!global_reclaim(sc)) {
-			mem_cgroup_iter_break(root, memcg);
-			break;
-		}
 		memcg = mem_cgroup_iter(root, memcg, &reclaim);
 	} while (memcg);
 
-	if (!over_softlimit) {
-		ignore_softlimit = true;
+	/* No lruvec in our set is suitable for reclaiming. */
+	if (!victim_lruvec)
+		return;
+
+	/*
+	 * Reclaim from the top scoring lruvec until we freed enough
+	 * pages, or its reclaim priority has halved.
+	 */
+	do {
+		shrink_lruvec(victim_lruvec, sc);
+		score = reclaim_score(memcg, victim_lruvec);
+	} while (sc->nr_to_reclaim > 0 && score > max_score / 2);
+
+	mem_cgroup_put(victim_memcg);
+
+	/*
+	 * The shrinking code increments sc->nr_scanned for every
+	 * page scanned. If we failed to scan any pages from the
+	 * top reclaim victim, bail out to prevent a livelock.
+	 */
+	if (sc->nr_scanned == nr_scanned_this_round)
+		return;
+
+	/*
+	 * Do we need to reclaim more pages?
+	 * Did we scan fewer pages than the current priority allows?
+	 */
+	if (sc->nr_to_reclaim > 0 &&
+			sc->nr_scanned - nr_scanned <
+			zone_reclaimable_pages(zone) >> sc->priority)
 		goto restart;
-	}
 }
 
 /* Returns true if compaction should go ahead for a high-order request */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [RFC][PATCH -mm -v2 4/4] mm,vmscan: evict inactive file pages first
  2012-08-16 15:34 [RFC][PATCH -mm -v2 0/4] mm,vmscan: reclaim from highest score cgroup Rik van Riel
                   ` (2 preceding siblings ...)
  2012-08-16 15:37 ` [RFC][PATCH -mm -v2 3/4] mm,vmscan: reclaim from the highest score cgroups Rik van Riel
@ 2012-08-16 15:38 ` Rik van Riel
  2012-08-23 23:07   ` Ying Han
  3 siblings, 1 reply; 11+ messages in thread
From: Rik van Riel @ 2012-08-16 15:38 UTC (permalink / raw)
  To: linux-mm; +Cc: yinghan, aquini, hannes, mhocko, Mel Gorman

When a lot of streaming file IO is happening, it makes sense to
evict just the inactive file pages and leave the other LRU lists
alone.

Likewise, when driving a cgroup hierarchy into its hard limit,
or over its soft limit, it makes sense to pick a child cgroup
that has lots of inactive file pages, and evict those first.

Being over its soft limit is considered a stronger preference
than just having a lot of inactive file pages, so a well behaved
cgroup is allowed to keep its file cache when there is a "badly
behaving" one in the same hierarchy.

Signed-off-by: Rik van Riel <riel@redhat.com>
---
 mm/vmscan.c |   37 +++++++++++++++++++++++++++++++++----
 1 files changed, 33 insertions(+), 4 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 769fdcd..2884b4f 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1576,6 +1576,19 @@ static int inactive_list_is_low(struct lruvec *lruvec, enum lru_list lru)
 		return inactive_anon_is_low(lruvec);
 }
 
+/* If this lruvec has lots of inactive file pages, reclaim those only. */
+static bool reclaim_file_only(struct lruvec *lruvec, struct scan_control *sc,
+			      unsigned long anon, unsigned long file)
+{
+	if (inactive_file_is_low(lruvec))
+		return false;
+
+	if (file > (anon + file) >> sc->priority)
+		return true;
+
+	return false;
+}
+
 static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan,
 				 struct lruvec *lruvec, struct scan_control *sc)
 {
@@ -1658,6 +1671,14 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
 		}
 	}
 
+	/* Lots of inactive file pages? Reclaim those only. */
+	if (reclaim_file_only(lruvec, sc, anon, file)) {
+		fraction[0] = 0;
+		fraction[1] = 1;
+		denominator = 1;
+		goto out;
+	}
+
 	/*
 	 * With swappiness at 100, anonymous and file have the same priority.
 	 * This scanning priority is essentially the inverse of IO cost.
@@ -1922,8 +1943,8 @@ static void age_recent_pressure(struct lruvec *lruvec, struct zone *zone)
  * should always be larger than recent_rotated, and the size should
  * always be larger than recent_pressure.
  */
-static u64 reclaim_score(struct mem_cgroup *memcg,
-			 struct lruvec *lruvec)
+static u64 reclaim_score(struct mem_cgroup *memcg, struct lruvec *lruvec,
+			 struct scan_control *sc)
 {
 	struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
 	u64 anon, file;
@@ -1949,6 +1970,14 @@ static u64 reclaim_score(struct mem_cgroup *memcg,
 		anon *= 10000;
 	}
 
+	/*
+	 * Prefer reclaiming from an lruvec with lots of inactive file
+	 * pages. Once those have been reclaimed, the score will drop so
+	 * far we will pick another lruvec to reclaim from.
+	 */
+	if (reclaim_file_only(lruvec, sc, anon, file))
+		file *= 100;
+
 	return max(anon, file);
 }
 
@@ -1977,7 +2006,7 @@ static void shrink_zone(struct zone *zone, struct scan_control *sc)
 
 		age_recent_pressure(lruvec, zone);
 
-		score = reclaim_score(memcg, lruvec);
+		score = reclaim_score(memcg, lruvec, sc);
 
 		/* Pick the lruvec with the highest score. */
 		if (score > max_score) {
@@ -2002,7 +2031,7 @@ static void shrink_zone(struct zone *zone, struct scan_control *sc)
 	 */
 	do {
 		shrink_lruvec(victim_lruvec, sc);
-		score = reclaim_score(memcg, victim_lruvec);
+		score = reclaim_score(memcg, victim_lruvec, sc);
 	} while (sc->nr_to_reclaim > 0 && score > max_score / 2);
 
 	mem_cgroup_put(victim_memcg);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [RFC][PATCH -mm -v2 3/4] mm,vmscan: reclaim from the highest score cgroups
  2012-08-16 15:37 ` [RFC][PATCH -mm -v2 3/4] mm,vmscan: reclaim from the highest score cgroups Rik van Riel
@ 2012-08-17 23:34   ` Ying Han
  2012-08-17 23:41     ` Rik van Riel
  0 siblings, 1 reply; 11+ messages in thread
From: Ying Han @ 2012-08-17 23:34 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-mm, aquini, hannes, mhocko, Mel Gorman

On Thu, Aug 16, 2012 at 8:37 AM, Rik van Riel <riel@redhat.com> wrote:
> Instead of doing a round robin reclaim over all the cgroups in a
> zone, we pick the lruvec with the top score and reclaim from that.
>
> We keep reclaiming from that lruvec until we have reclaimed enough
> pages (common for direct reclaim), or that lruvec's score drops in
> half. We keep reclaiming from the zone until we have reclaimed enough
> pages, or have scanned more than the number of reclaimable pages shifted
> by the reclaim priority.
>
> As an additional change, targeted cgroup reclaim now reclaims from
> the highest priority lruvec. This is because when a cgroup hierarchy
> hits its limit, the best lruvec to reclaim from may be different than
> whatever lruvec is the first we run into iterating from the hierarchy's
> "root".
>
> Signed-off-by: Rik van Riel <riel@redhat.com>
> ---
>  mm/vmscan.c |  137 ++++++++++++++++++++++++++++++++++++++++++----------------
>  1 files changed, 99 insertions(+), 38 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index b0e5495..769fdcd 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1901,6 +1901,57 @@ static void age_recent_pressure(struct lruvec *lruvec, struct zone *zone)
>         spin_unlock_irq(&zone->lru_lock);
>  }
>
> +/*
> + * The higher the LRU score, the more desirable it is to reclaim
> + * from this LRU set first. The score is a function of the fraction
> + * of recently scanned pages on the LRU that are in active use,
> + * as well as the size of the list and the amount of memory pressure
> + * that has been put on this LRU recently.
> + *
> + *          recent_scanned        size
> + * score =  -------------- x --------------- x adjustment
> + *          recent_rotated   recent_pressure
> + *
> + * The maximum score of the anon and file list in this lruvec
> + * is returned. Adjustments are made for the file LRU having
> + * lots of inactive pages (mostly streaming IO), or the memcg
> + * being over its soft limit.
> + *
> + * This function should return a positive number for any lruvec
> + * with more than a handful of resident pages, because recent_scanned
> + * should always be larger than recent_rotated, and the size should
> + * always be larger than recent_pressure.
> + */
> +static u64 reclaim_score(struct mem_cgroup *memcg,
> +                        struct lruvec *lruvec)
> +{
> +       struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
> +       u64 anon, file;
> +
> +       anon  = get_lru_size(lruvec, LRU_ACTIVE_ANON) +
> +               get_lru_size(lruvec, LRU_INACTIVE_ANON);
> +       anon *= reclaim_stat->recent_scanned[0];
> +       anon /= (reclaim_stat->recent_rotated[0] + 1);
> +       anon /= (reclaim_stat->recent_pressure[0] + 1);
> +
> +       file = get_lru_size(lruvec, LRU_ACTIVE_FILE) +
> +              get_lru_size(lruvec, LRU_INACTIVE_FILE);
> +       file *= reclaim_stat->recent_scanned[1];
> +       file /= (reclaim_stat->recent_rotated[1] + 1);
> +       file /= (reclaim_stat->recent_pressure[1] + 1);
> +
> +       /*
> +        * Give a STRONG preference to reclaiming memory from lruvecs
> +        * that belong to a cgroup that is over its soft limit.
> +        */
> +       if (mem_cgroup_over_soft_limit(memcg)) {
> +               file *= 10000;
> +               anon *= 10000;
> +       }
> +
> +       return max(anon, file);
> +}
> +
>  static void shrink_zone(struct zone *zone, struct scan_control *sc)
>  {
>         struct mem_cgroup *root = sc->target_mem_cgroup;
> @@ -1908,11 +1959,17 @@ static void shrink_zone(struct zone *zone, struct scan_control *sc)
>                 .zone = zone,
>                 .priority = sc->priority,
>         };
> -       struct mem_cgroup *memcg;
> -       bool over_softlimit, ignore_softlimit = false;
> +       unsigned long nr_scanned = sc->nr_scanned;
> +       unsigned long nr_scanned_this_round;
> +       struct mem_cgroup *memcg, *victim_memcg;
> +       struct lruvec *victim_lruvec;
> +       u64 score, max_score;
>
>  restart:
> -       over_softlimit = false;
> +       nr_scanned_this_round = sc->nr_scanned;
> +       victim_lruvec = NULL;
> +       victim_memcg = NULL;
> +       max_score = 0;
>
>         memcg = mem_cgroup_iter(root, NULL, &reclaim);
>         do {
> @@ -1920,48 +1977,52 @@ static void shrink_zone(struct zone *zone, struct scan_control *sc)
>
>                 age_recent_pressure(lruvec, zone);
>
> -               /*
> -                * Reclaim from mem_cgroup if any of these conditions are met:
> -                * - this is a targetted reclaim ( not global reclaim)
> -                * - reclaim priority is less than  DEF_PRIORITY - 2
> -                * - mem_cgroup or its ancestor ( not including root cgroup)
> -                * exceeds its soft limit
> -                *
> -                * Note: The priority check is a balance of how hard to
> -                * preserve the pages under softlimit. If the memcgs of the
> -                * zone having trouble to reclaim pages above their softlimit,
> -                * we have to reclaim under softlimit instead of burning more
> -                * cpu cycles.
> -                */
> -               if (ignore_softlimit || !global_reclaim(sc) ||
> -                               sc->priority < DEF_PRIORITY - 2 ||
> -                               mem_cgroup_over_soft_limit(memcg)) {
> -                       shrink_lruvec(lruvec, sc);
> +               score = reclaim_score(memcg, lruvec);
>
> -                       over_softlimit = true;
> +               /* Pick the lruvec with the highest score. */
> +               if (score > max_score) {
> +                       max_score = score;
> +                       if (victim_memcg)
> +                               mem_cgroup_put(victim_memcg);
> +                       mem_cgroup_get(memcg);
> +                       victim_lruvec = lruvec;
> +                       victim_memcg = memcg;
>                 }
>
> -               /*
> -                * Limit reclaim has historically picked one memcg and
> -                * scanned it with decreasing priority levels until
> -                * nr_to_reclaim had been reclaimed.  This priority
> -                * cycle is thus over after a single memcg.
> -                *
> -                * Direct reclaim and kswapd, on the other hand, have
> -                * to scan all memory cgroups to fulfill the overall
> -                * scan target for the zone.
> -                */
> -               if (!global_reclaim(sc)) {
> -                       mem_cgroup_iter_break(root, memcg);
> -                       break;
> -               }
>                 memcg = mem_cgroup_iter(root, memcg, &reclaim);
>         } while (memcg);
>
> -       if (!over_softlimit) {
> -               ignore_softlimit = true;
> +       /* No lruvec in our set is suitable for reclaiming. */
> +       if (!victim_lruvec)
> +               return;
> +
> +       /*
> +        * Reclaim from the top scoring lruvec until we freed enough
> +        * pages, or its reclaim priority has halved.
> +        */
> +       do {
> +               shrink_lruvec(victim_lruvec, sc);
> +               score = reclaim_score(memcg, victim_lruvec);
> +       } while (sc->nr_to_reclaim > 0 && score > max_score / 2);

This would violate the user expectation of soft_limit badly,
especially for background reclaim where nr_to_reclaim equals to
ULONG_MAX.

Here we keep hitting cgroup A and potentially push it down to
softlimit until the score drops to certain level. It is bad since it
causes "hot" memory (under softlimit) of A being reclaimed while other
cgroups has plenty of "cold" (above softlimit) to give out.

In general, pick one cgroup to reclaim instead of round-robin is ok as
long as we don't reclaim further down to the softlimit. The next
question then is what's the next cgroup to reclaim if that doesn't
give us enough.

--Ying

> +
> +       mem_cgroup_put(victim_memcg);
> +
> +       /*
> +        * The shrinking code increments sc->nr_scanned for every
> +        * page scanned. If we failed to scan any pages from the
> +        * top reclaim victim, bail out to prevent a livelock.
> +        */
> +       if (sc->nr_scanned == nr_scanned_this_round)
> +               return;
> +
> +       /*
> +        * Do we need to reclaim more pages?
> +        * Did we scan fewer pages than the current priority allows?
> +        */
> +       if (sc->nr_to_reclaim > 0 &&
> +                       sc->nr_scanned - nr_scanned <
> +                       zone_reclaimable_pages(zone) >> sc->priority)
>                 goto restart;
> -       }
>  }
>
>  /* Returns true if compaction should go ahead for a high-order request */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC][PATCH -mm -v2 3/4] mm,vmscan: reclaim from the highest score cgroups
  2012-08-17 23:34   ` Ying Han
@ 2012-08-17 23:41     ` Rik van Riel
  2012-08-18  0:26       ` Ying Han
  0 siblings, 1 reply; 11+ messages in thread
From: Rik van Riel @ 2012-08-17 23:41 UTC (permalink / raw)
  To: Ying Han; +Cc: linux-mm, aquini, hannes, mhocko, Mel Gorman

On 08/17/2012 07:34 PM, Ying Han wrote:
> On Thu, Aug 16, 2012 at 8:37 AM, Rik van Riel <riel@redhat.com> wrote:

>> +       /*
>> +        * Reclaim from the top scoring lruvec until we freed enough
>> +        * pages, or its reclaim priority has halved.
>> +        */
>> +       do {
>> +               shrink_lruvec(victim_lruvec, sc);
>> +               score = reclaim_score(memcg, victim_lruvec);
>> +       } while (sc->nr_to_reclaim > 0 && score > max_score / 2);
>
> This would violate the user expectation of soft_limit badly,
> especially for background reclaim where nr_to_reclaim equals to
> ULONG_MAX.
>
> Here we keep hitting cgroup A and potentially push it down to
> softlimit until the score drops to certain level. It is bad since it
> causes "hot" memory (under softlimit) of A being reclaimed while other
> cgroups has plenty of "cold" (above softlimit) to give out.

Look at the function reclaim_score().

Once a group drops below its soft limit, its score will
be a factor 10000 smaller, making sure we hit the second
exit condition.

After that, we will pick another group.

> In general, pick one cgroup to reclaim instead of round-robin is ok as
> long as we don't reclaim further down to the softlimit. The next
> question then is what's the next cgroup to reclaim if that doesn't
> give us enough.

Again, look at the function reclaim_score().

If there is a group above the softlimit, we pretty much
guarantee we will reclaim from that group.  If any reclaim
will happen from another group, it will be absolutely
minimal (taking recent_pressure from 0 to SWAP_CLUSTER_MAX,
and then moving on to another group).

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC][PATCH -mm -v2 3/4] mm,vmscan: reclaim from the highest score cgroups
  2012-08-17 23:41     ` Rik van Riel
@ 2012-08-18  0:26       ` Ying Han
  2012-08-18  4:02         ` Rik van Riel
  0 siblings, 1 reply; 11+ messages in thread
From: Ying Han @ 2012-08-18  0:26 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-mm, aquini, hannes, mhocko, Mel Gorman

On Fri, Aug 17, 2012 at 4:41 PM, Rik van Riel <riel@redhat.com> wrote:
> On 08/17/2012 07:34 PM, Ying Han wrote:
>>
>> On Thu, Aug 16, 2012 at 8:37 AM, Rik van Riel <riel@redhat.com> wrote:
>
>
>>> +       /*
>>> +        * Reclaim from the top scoring lruvec until we freed enough
>>> +        * pages, or its reclaim priority has halved.
>>> +        */
>>> +       do {
>>> +               shrink_lruvec(victim_lruvec, sc);
>>> +               score = reclaim_score(memcg, victim_lruvec);
>>> +       } while (sc->nr_to_reclaim > 0 && score > max_score / 2);
>>
>>
>> This would violate the user expectation of soft_limit badly,
>> especially for background reclaim where nr_to_reclaim equals to
>> ULONG_MAX.
>>
>> Here we keep hitting cgroup A and potentially push it down to
>> softlimit until the score drops to certain level. It is bad since it
>> causes "hot" memory (under softlimit) of A being reclaimed while other
>> cgroups has plenty of "cold" (above softlimit) to give out.
>
>
> Look at the function reclaim_score().
>
> Once a group drops below its soft limit, its score will
> be a factor 10000 smaller, making sure we hit the second
> exit condition.
>
> After that, we will pick another group.
>
>
>> In general, pick one cgroup to reclaim instead of round-robin is ok as
>> long as we don't reclaim further down to the softlimit. The next
>> question then is what's the next cgroup to reclaim if that doesn't
>> give us enough.
>
>
> Again, look at the function reclaim_score().
>
> If there is a group above the softlimit, we pretty much
> guarantee we will reclaim from that group.  If any reclaim
> will happen from another group, it will be absolutely
> minimal (taking recent_pressure from 0 to SWAP_CLUSTER_MAX,
> and then moving on to another group).

Seems I should really look into the numbers, which i tried to avoid at
the beginning... :(

Another way of teaching myself on how it works is to run a sanity
test. Let's say I have two cgroups under root, and they are running
different workload:

root
   ->A ( mem_alloc which keep touching its working set)
   ->B ( stream IO, like dd )

Here are the test cases on top of my head as well as the expected
output, forget about root cgroup for now:

case 1. A & B above softlimit
    a) score(B) > score(A), and keep reclaiming from B
    b) as long as usage(B) > softlimit(B), no reclaim on A
    c) until B under softlimit, reclaim from A

case 2. A above softlimit and B under softlimit
    a) score(A) > score(B), and keep reclaiming from A
    b) as long as usage (A) > softlimit (A), no reclaim on B
    c) until A under softlimit, then reclaim on both as case 3

case 3. A & B under softlimit
    a) score(B) > score(A), and keep reclaiming from B
    b) there should be no reclaim happen on A.

My patch delivers the functionality of case 2, but not distributing
the pressure across memcgs as this patch does (case 1 & 3).  Also, on
case3 where in my patch I would scan all the memcgs for nothing where
in this patch it will eventually pick a memcg to reclaim. Not sure if
it is a lot save though.

Over the three cases, I would say case 2 is the basic functionality we
want to guarantee and the case 1 and case 3 are optimizations on top
of that.

I would like to run the test above and please help to clarify if they
make sense.

Thanks

--Ying


>
> --
> All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC][PATCH -mm -v2 3/4] mm,vmscan: reclaim from the highest score cgroups
  2012-08-18  0:26       ` Ying Han
@ 2012-08-18  4:02         ` Rik van Riel
  0 siblings, 0 replies; 11+ messages in thread
From: Rik van Riel @ 2012-08-18  4:02 UTC (permalink / raw)
  To: Ying Han; +Cc: linux-mm, aquini, hannes, mhocko, Mel Gorman

On 08/17/2012 08:26 PM, Ying Han wrote:

> Seems I should really look into the numbers, which i tried to avoid at
> the beginning... :(

It comes down to the same drawings we made on the white board
back in April :)

> Here are the test cases on top of my head as well as the expected
> output, forget about root cgroup for now:
>
> case 1. A & B above softlimit
>      a) score(B) > score(A), and keep reclaiming from B
>      b) as long as usage(B) > softlimit(B), no reclaim on A
>      c) until B under softlimit, reclaim from A

By reclaiming from (B), it is very possible (and likely) that
the score of (B) will be depressed below that of (A), after
which we will start reclaiming from (A).

This could happen even while both (A) and (B) are still over
their soft limits.

> case 2. A above softlimit and B under softlimit
>      a) score(A) > score(B), and keep reclaiming from A
>      b) as long as usage (A) > softlimit (A), no reclaim on B
>      c) until A under softlimit, then reclaim on both as case 3

Pretty much, yes.

If we have not scanned anything at all in (B), we might scan
SWAP_CLUSTER_MAX (32) pages in B, but that will instantly reduce
B's score by a factor 33 and get us to reclaim from (A) again.

That is 33 because we do a +1 in the calculation to avoid
division by zero :)

> case 3. A & B under softlimit
>      a) score(B) > score(A), and keep reclaiming from B
>      b) there should be no reclaim happen on A.

Reclaiming from (B) will reduce B's score, so eventually we will
end up reclaiming from (A) again.

The more memory pressure one lruvec gets, the lower its score,
and the more likely that somebody else has a higher score.

> My patch delivers the functionality of case 2, but not distributing
> the pressure across memcgs as this patch does (case 1 & 3).  Also, on
> case3 where in my patch I would scan all the memcgs for nothing where
> in this patch it will eventually pick a memcg to reclaim. Not sure if
> it is a lot save though.
>
> Over the three cases, I would say case 2 is the basic functionality we
> want to guarantee and the case 1 and case 3 are optimizations on top
> of that.

There is an additional optimization that becomes possible
with my approach, and not with round robin.

Some people want to run systems with hundreds, or even
thousands of memory cgroups. Having direct reclaim iterate
over all those cgroups could have a really bad impact on
direct reclaim latency.

Once we have a scoring mechanism, we can implement a further
optimization where we sort the lruvecs, adjusting their
priority as things happen (pages get allocated, freed or
scanned), instead of every time we go through the reclaim
code.

That way it will become possible to have a system that
truly scales to large numbers of cgroups.

> I would like to run the test above and please help to clarify if they
> make sense.

The test makes sense to me.

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC][PATCH -mm -v2 4/4] mm,vmscan: evict inactive file pages first
  2012-08-16 15:38 ` [RFC][PATCH -mm -v2 4/4] mm,vmscan: evict inactive file pages first Rik van Riel
@ 2012-08-23 23:07   ` Ying Han
  2012-08-24  3:00     ` Rik van Riel
  0 siblings, 1 reply; 11+ messages in thread
From: Ying Han @ 2012-08-23 23:07 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-mm, aquini, hannes, mhocko, Mel Gorman

[-- Attachment #1: Type: text/plain, Size: 4193 bytes --]

On Thu, Aug 16, 2012 at 8:38 AM, Rik van Riel <riel@redhat.com> wrote:

> When a lot of streaming file IO is happening, it makes sense to
> evict just the inactive file pages and leave the other LRU lists
> alone.
>
> Likewise, when driving a cgroup hierarchy into its hard limit,
> or over its soft limit, it makes sense to pick a child cgroup
> that has lots of inactive file pages, and evict those first.
>
> Being over its soft limit is considered a stronger preference
> than just having a lot of inactive file pages, so a well behaved
> cgroup is allowed to keep its file cache when there is a "badly
> behaving" one in the same hierarchy.
>
> Signed-off-by: Rik van Riel <riel@redhat.com>
> ---
>  mm/vmscan.c |   37 +++++++++++++++++++++++++++++++++----
>  1 files changed, 33 insertions(+), 4 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 769fdcd..2884b4f 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1576,6 +1576,19 @@ static int inactive_list_is_low(struct lruvec
> *lruvec, enum lru_list lru)
>                 return inactive_anon_is_low(lruvec);
>  }
>
> +/* If this lruvec has lots of inactive file pages, reclaim those only. */
> +static bool reclaim_file_only(struct lruvec *lruvec, struct scan_control
> *sc,
> +                             unsigned long anon, unsigned long file)
> +{
> +       if (inactive_file_is_low(lruvec))
> +               return false;
> +
> +       if (file > (anon + file) >> sc->priority)
> +               return true;
> +
> +       return false;
> +}
> +
>  static unsigned long shrink_list(enum lru_list lru, unsigned long
> nr_to_scan,
>                                  struct lruvec *lruvec, struct
> scan_control *sc)
>  {
> @@ -1658,6 +1671,14 @@ static void get_scan_count(struct lruvec *lruvec,
> struct scan_control *sc,
>                 }
>         }
>
> +       /* Lots of inactive file pages? Reclaim those only. */
> +       if (reclaim_file_only(lruvec, sc, anon, file)) {
> +               fraction[0] = 0;
> +               fraction[1] = 1;
> +               denominator = 1;
> +               goto out;
> +       }
> +
>         /*
>          * With swappiness at 100, anonymous and file have the same
> priority.
>          * This scanning priority is essentially the inverse of IO cost.
> @@ -1922,8 +1943,8 @@ static void age_recent_pressure(struct lruvec
> *lruvec, struct zone *zone)
>   * should always be larger than recent_rotated, and the size should
>   * always be larger than recent_pressure.
>   */
> -static u64 reclaim_score(struct mem_cgroup *memcg,
> -                        struct lruvec *lruvec)
> +static u64 reclaim_score(struct mem_cgroup *memcg, struct lruvec *lruvec,
> +                        struct scan_control *sc)
>  {
>         struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
>         u64 anon, file;
> @@ -1949,6 +1970,14 @@ static u64 reclaim_score(struct mem_cgroup *memcg,
>                 anon *= 10000;
>         }
>
> +       /*
> +        * Prefer reclaiming from an lruvec with lots of inactive file
> +        * pages. Once those have been reclaimed, the score will drop so
> +        * far we will pick another lruvec to reclaim from.
> +        */
> +       if (reclaim_file_only(lruvec, sc, anon, file))
> +               file *= 100;
> +
>         return max(anon, file);
>  }
>
> @@ -1977,7 +2006,7 @@ static void shrink_zone(struct zone *zone, struct
> scan_control *sc)
>
>                 age_recent_pressure(lruvec, zone);
>
> -               score = reclaim_score(memcg, lruvec);
> +               score = reclaim_score(memcg, lruvec, sc);
>
>                 /* Pick the lruvec with the highest score. */
>                 if (score > max_score) {
> @@ -2002,7 +2031,7 @@ static void shrink_zone(struct zone *zone, struct
> scan_control *sc)
>          */
>         do {
>                 shrink_lruvec(victim_lruvec, sc);
> -               score = reclaim_score(memcg, victim_lruvec);
> +               score = reclaim_score(memcg, victim_lruvec, sc);
>

I wonder if you meant s/memcg/victim_memcg here.

--Ying


>         } while (sc->nr_to_reclaim > 0 && score > max_score / 2);
>
>         mem_cgroup_put(victim_memcg);
>
>

[-- Attachment #2: Type: text/html, Size: 5003 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC][PATCH -mm -v2 4/4] mm,vmscan: evict inactive file pages first
  2012-08-23 23:07   ` Ying Han
@ 2012-08-24  3:00     ` Rik van Riel
  0 siblings, 0 replies; 11+ messages in thread
From: Rik van Riel @ 2012-08-24  3:00 UTC (permalink / raw)
  To: Ying Han; +Cc: linux-mm, aquini, hannes, mhocko, Mel Gorman

On 08/23/2012 07:07 PM, Ying Han wrote:
>
>
> On Thu, Aug 16, 2012 at 8:38 AM, Rik van Riel <riel@redhat.com
> <mailto:riel@redhat.com>> wrote:
>
>     When a lot of streaming file IO is happening, it makes sense to
>     evict just the inactive file pages and leave the other LRU lists
>     alone.
>
>     Likewise, when driving a cgroup hierarchy into its hard limit,
>     or over its soft limit, it makes sense to pick a child cgroup
>     that has lots of inactive file pages, and evict those first.
>
>     Being over its soft limit is considered a stronger preference
>     than just having a lot of inactive file pages, so a well behaved
>     cgroup is allowed to keep its file cache when there is a "badly
>     behaving" one in the same hierarchy.
>
>     Signed-off-by: Rik van Riel <riel@redhat.com <mailto:riel@redhat.com>>
>     ---
>       mm/vmscan.c |   37 +++++++++++++++++++++++++++++++++----
>       1 files changed, 33 insertions(+), 4 deletions(-)
>
>     diff --git a/mm/vmscan.c b/mm/vmscan.c
>     index 769fdcd..2884b4f 100644
>     --- a/mm/vmscan.c
>     +++ b/mm/vmscan.c
>     @@ -1576,6 +1576,19 @@ static int inactive_list_is_low(struct lruvec
>     *lruvec, enum lru_list lru)
>                      return inactive_anon_is_low(lruvec);
>       }
>
>     +/* If this lruvec has lots of inactive file pages, reclaim those
>     only. */
>     +static bool reclaim_file_only(struct lruvec *lruvec, struct
>     scan_control *sc,
>     +                             unsigned long anon, unsigned long file)
>     +{
>     +       if (inactive_file_is_low(lruvec))
>     +               return false;
>     +
>     +       if (file > (anon + file) >> sc->priority)
>     +               return true;
>     +
>     +       return false;
>     +}
>     +
>       static unsigned long shrink_list(enum lru_list lru, unsigned long
>     nr_to_scan,
>                                       struct lruvec *lruvec, struct
>     scan_control *sc)
>       {
>     @@ -1658,6 +1671,14 @@ static void get_scan_count(struct lruvec
>     *lruvec, struct scan_control *sc,
>                      }
>              }
>
>     +       /* Lots of inactive file pages? Reclaim those only. */
>     +       if (reclaim_file_only(lruvec, sc, anon, file)) {
>     +               fraction[0] = 0;
>     +               fraction[1] = 1;
>     +               denominator = 1;
>     +               goto out;
>     +       }
>     +
>              /*
>               * With swappiness at 100, anonymous and file have the same
>     priority.
>               * This scanning priority is essentially the inverse of IO
>     cost.
>     @@ -1922,8 +1943,8 @@ static void age_recent_pressure(struct lruvec
>     *lruvec, struct zone *zone)
>        * should always be larger than recent_rotated, and the size should
>        * always be larger than recent_pressure.
>        */
>     -static u64 reclaim_score(struct mem_cgroup *memcg,
>     -                        struct lruvec *lruvec)
>     +static u64 reclaim_score(struct mem_cgroup *memcg, struct lruvec
>     *lruvec,
>     +                        struct scan_control *sc)
>       {
>              struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
>              u64 anon, file;
>     @@ -1949,6 +1970,14 @@ static u64 reclaim_score(struct mem_cgroup
>     *memcg,
>                      anon *= 10000;
>              }
>
>     +       /*
>     +        * Prefer reclaiming from an lruvec with lots of inactive file
>     +        * pages. Once those have been reclaimed, the score will drop so
>     +        * far we will pick another lruvec to reclaim from.
>     +        */
>     +       if (reclaim_file_only(lruvec, sc, anon, file))
>     +               file *= 100;
>     +
>              return max(anon, file);
>       }
>
>     @@ -1977,7 +2006,7 @@ static void shrink_zone(struct zone *zone,
>     struct scan_control *sc)
>
>                      age_recent_pressure(lruvec, zone);
>
>     -               score = reclaim_score(memcg, lruvec);
>     +               score = reclaim_score(memcg, lruvec, sc);
>
>                      /* Pick the lruvec with the highest score. */
>                      if (score > max_score) {
>     @@ -2002,7 +2031,7 @@ static void shrink_zone(struct zone *zone,
>     struct scan_control *sc)
>               */
>              do {
>                      shrink_lruvec(victim_lruvec, sc);
>     -               score = reclaim_score(memcg, victim_lruvec);
>     +               score = reclaim_score(memcg, victim_lruvec, sc);
>
>
> I wonder if you meant s/memcg/victim_memcg here.

You are totally right, that should be victim_memcg.

Time for me to get a tree that works here, and where my patches
will apply. I got the c-state governor patches sent out for KS,
now I should be able to get some time again for cgroups stuff :)

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2012-08-24  3:00 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-08-16 15:34 [RFC][PATCH -mm -v2 0/4] mm,vmscan: reclaim from highest score cgroup Rik van Riel
2012-08-16 15:35 ` [RFC][PATCH -mm -v2 1/4] mm,vmscan: track recent pressure on each LRU set Rik van Riel
2012-08-16 15:36 ` [RFC][PATCH -mm -v2 2/4] mm,memcontrol: export mem_cgroup_get/put Rik van Riel
2012-08-16 15:37 ` [RFC][PATCH -mm -v2 3/4] mm,vmscan: reclaim from the highest score cgroups Rik van Riel
2012-08-17 23:34   ` Ying Han
2012-08-17 23:41     ` Rik van Riel
2012-08-18  0:26       ` Ying Han
2012-08-18  4:02         ` Rik van Riel
2012-08-16 15:38 ` [RFC][PATCH -mm -v2 4/4] mm,vmscan: evict inactive file pages first Rik van Riel
2012-08-23 23:07   ` Ying Han
2012-08-24  3:00     ` Rik van Riel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).