linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] mm: make kswapd try harder to keep active pages in cache
@ 2017-05-23 14:23 Josef Bacik
  2017-05-23 17:14 ` Rik van Riel
                   ` (2 more replies)
  0 siblings, 3 replies; 6+ messages in thread
From: Josef Bacik @ 2017-05-23 14:23 UTC (permalink / raw)
  To: akpm, kernel-team, riel, hannes, linux-mm

When testing a slab heavy workload I noticed that we often would barely
reclaim anything at all from slab when kswapd started doing reclaim.
This is because we use the ratio of nr_scanned / nr_lru to determine how
much of slab we should reclaim.  But in a slab only/mostly workload we
will not have much page cache to reclaim, and thus our ratio will be
really low and not at all related to where the memory on the system is.
Instead we want to use a ratio of the reclaimable slab to the actual
reclaimable space on the system.  That way if we are slab heavy we work
harder to reclaim slab.

The other part of this that hurts is when we are running close to full
memory with our working set.  If we start putting a lot of reclaimable
slab pressure on the system (think find /, or some other silliness), we
will happily evict the active pages over the slab cache.  This is kind
of backwards as we want to do all that we can to keep the active working
set in memory, and instead evict these short lived objects.  The same
thing occurs when say you do a yum update of a few packages while your
working set takes up most of RAM, you end up with inactive lists being
relatively small and so we reclaim active pages even though we could
reclaim these short lived inactive pages.

My approach here is twofold.  First, keep track of the difference in
inactive and slab pages since the last time kswapd ran.  In the first
run this will just be the overall counts of inactive and slab, but for
each subsequent run we'll have a good idea of where the memory pressure
is coming from.  Then we use this information to put pressure on either
the inactive lists or the slab caches, depending on where the pressure
is coming from.

If this optimization does not work, then we fall back to the previous
methods of reclaiming space with a slight adjustment.  Instead of using
the overall scan rate of page cache to determine the scan rate for slab,
we instead use the total usage of slab compared to the reclaimable page
cache on the box.  This will allow us to put an appropriate amount of
pressure on the slab shrinkers if we are a mostly slab workload.

I have two tests I was using to watch either side of this problem.  The
first test kept 2 files that took up 3/4 of the memory, and then started
creating a bunch of empty files.  Without this patch we would have to
re-read both files in their entirety at least 3 times during the run.
With this patch the active pages are never evicted.

The second test was a test that would read and stat all the files in a
directory, which again would take up about 3/4 of the memory with slab
cache.  Then I cat'ed a 100gib file into /dev/null and checked to see if
any of the files were evicted and verified that none of the files were
evicted.

Signed-off-by: Josef Bacik <jbacik@fb.com>
---
 mm/vmscan.c | 194 +++++++++++++++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 171 insertions(+), 23 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 8ad39bb..481e14e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -110,11 +110,20 @@ struct scan_control {
 	/* One of the zones is ready for compaction */
 	unsigned int compaction_ready:1;
 
+	/* Only reclaim inactive page cache or slab. */
+	unsigned int inactive_only:1;
+
 	/* Incremented by the number of inactive pages that were scanned */
 	unsigned long nr_scanned;
 
 	/* Number of pages freed so far during a call to shrink_zones() */
 	unsigned long nr_reclaimed;
+
+	/* Number of inactive pages added since last kswapd run. */
+	unsigned long inactive_diff;
+
+	/* Number of slab pages added since last kswapd run. */
+	unsigned long slab_diff;
 };
 
 #ifdef ARCH_HAS_PREFETCH
@@ -308,7 +317,8 @@ EXPORT_SYMBOL(unregister_shrinker);
 static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
 				    struct shrinker *shrinker,
 				    unsigned long nr_scanned,
-				    unsigned long nr_eligible)
+				    unsigned long nr_eligible,
+				    unsigned long *slab_scanned)
 {
 	unsigned long freed = 0;
 	unsigned long long delta;
@@ -409,6 +419,9 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
 		next_deferred -= scanned;
 	else
 		next_deferred = 0;
+	if (slab_scanned)
+		(*slab_scanned) += scanned;
+
 	/*
 	 * move the unused scan count back into the shrinker in a
 	 * manner that handles concurrent updates. If we exhausted the
@@ -455,7 +468,8 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
 static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
 				 struct mem_cgroup *memcg,
 				 unsigned long nr_scanned,
-				 unsigned long nr_eligible)
+				 unsigned long nr_eligible,
+				 unsigned long *slab_scanned)
 {
 	struct shrinker *shrinker;
 	unsigned long freed = 0;
@@ -463,9 +477,6 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
 	if (memcg && (!memcg_kmem_enabled() || !mem_cgroup_online(memcg)))
 		return 0;
 
-	if (nr_scanned == 0)
-		nr_scanned = SWAP_CLUSTER_MAX;
-
 	if (!down_read_trylock(&shrinker_rwsem)) {
 		/*
 		 * If we would return 0, our callers would understand that we
@@ -496,7 +507,8 @@ static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
 		if (!(shrinker->flags & SHRINKER_NUMA_AWARE))
 			sc.nid = 0;
 
-		freed += do_shrink_slab(&sc, shrinker, nr_scanned, nr_eligible);
+		freed += do_shrink_slab(&sc, shrinker, nr_scanned, nr_eligible,
+					slab_scanned);
 	}
 
 	up_read(&shrinker_rwsem);
@@ -515,7 +527,7 @@ void drop_slab_node(int nid)
 		freed = 0;
 		do {
 			freed += shrink_slab(GFP_KERNEL, nid, memcg,
-					     1000, 1000);
+					     1000, 1000, NULL);
 		} while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)) != NULL);
 	} while (freed > 10);
 }
@@ -2102,6 +2114,7 @@ enum scan_balance {
 	SCAN_FRACT,
 	SCAN_ANON,
 	SCAN_FILE,
+	SCAN_INACTIVE,
 };
 
 /*
@@ -2128,6 +2141,11 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
 	unsigned long ap, fp;
 	enum lru_list lru;
 
+	if (sc->inactive_only) {
+		scan_balance = SCAN_INACTIVE;
+		goto out;
+	}
+
 	/* If we have no swap space, do not bother scanning anon pages. */
 	if (!sc->may_swap || mem_cgroup_get_nr_swap_pages(memcg) <= 0) {
 		scan_balance = SCAN_FILE;
@@ -2292,6 +2310,15 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
 				scan = 0;
 			}
 			break;
+		case SCAN_INACTIVE:
+			if (file && !is_active_lru(lru)) {
+				if (scan && size > sc->nr_to_reclaim)
+					scan = sc->nr_to_reclaim;
+			} else {
+				size = 0;
+				scan = 0;
+			}
+			break;
 		default:
 			/* Look ma, no brain */
 			BUG();
@@ -2509,8 +2536,62 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
 {
 	struct reclaim_state *reclaim_state = current->reclaim_state;
 	unsigned long nr_reclaimed, nr_scanned;
+	unsigned long nr_reclaim, nr_slab, total_high_wmark = 0, nr_inactive;
+	int z;
 	bool reclaimable = false;
+	bool skip_slab = false;
+
+	nr_slab = sum_zone_node_page_state(pgdat->node_id,
+					   NR_SLAB_RECLAIMABLE);
+	nr_inactive = node_page_state(pgdat, NR_INACTIVE_FILE);
+	nr_reclaim = pgdat_reclaimable_pages(pgdat);
+
+	for (z = 0; z < MAX_NR_ZONES; z++) {
+		struct zone *zone = &pgdat->node_zones[z];
+		if (!managed_zone(zone))
+			continue;
+		total_high_wmark += high_wmark_pages(zone);
+	}
+
+	/*
+	 * If we don't have a lot of inactive or slab pages then there's no
+	 * point in trying to free them exclusively, do the normal scan stuff.
+	 */
+	if (nr_inactive < total_high_wmark && nr_slab < total_high_wmark)
+		sc->inactive_only = 0;
 
+	/*
+	 * We don't have historical information, we can't make good decisions
+	 * about ratio's and where we should put pressure, so just apply
+	 * pressure based on overall consumption ratios.
+	 */
+	if (!sc->slab_diff && !sc->inactive_diff)
+		sc->inactive_only = 0;
+
+	/*
+	 * We still want to slightly prefer slab over inactive, so if the
+	 * inactive on this node is large enough and what is pushing us into
+	 * reclaim terretitory then limit our flushing to the inactive list for
+	 * the first go around.
+	 *
+	 * The idea is that with a memcg configured system we will still reclaim
+	 * memcg aware shrinkers, which includes the super block shrinkers.  So
+	 * if our steady state is keeping fs objects in cache for our workload
+	 * we'll still put a certain amount of pressure on them anyway.  To
+	 * avoid evicting things we actually care about we want to skip slab
+	 * reclaim altogether.
+	 *
+	 * However we still want to account for slab and inactive growing at the
+	 * same rate, so if that is the case just carry on shrinking inactive
+	 * and slab together.
+	 */
+	if (nr_inactive > total_high_wmark &&
+	    sc->inactive_diff > sc->slab_diff) {
+		unsigned long tmp = sc->inactive_diff >> 1;
+
+		if (tmp >= sc->slab_diff)
+			skip_slab = true;
+	}
 	do {
 		struct mem_cgroup *root = sc->target_mem_cgroup;
 		struct mem_cgroup_reclaim_cookie reclaim = {
@@ -2518,6 +2599,8 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
 			.priority = sc->priority,
 		};
 		unsigned long node_lru_pages = 0;
+		unsigned long slab_reclaimed = 0;
+		unsigned long slab_scanned = 0;
 		struct mem_cgroup *memcg;
 
 		nr_reclaimed = sc->nr_reclaimed;
@@ -2543,10 +2626,27 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
 			shrink_node_memcg(pgdat, memcg, sc, &lru_pages);
 			node_lru_pages += lru_pages;
 
-			if (memcg)
-				shrink_slab(sc->gfp_mask, pgdat->node_id,
-					    memcg, sc->nr_scanned - scanned,
-					    lru_pages);
+			/*
+			 * We don't want to put a lot of pressure on all of the
+			 * slabs if a memcg is mostly full, so use the ratio of
+			 * the lru size to the total reclaimable space on the
+			 * system.  If we have sc->inactive_only set then we
+			 * want to use the ratio of the difference between the
+			 * two since the last kswapd run so we apply pressure to
+			 * the consumer appropriately.
+			 */
+			if (memcg && !skip_slab) {
+				unsigned long numerator = lru_pages;
+				unsigned long denominator = nr_reclaim;
+				if (sc->inactive_only) {
+					numerator = sc->slab_diff;
+					denominator = sc->inactive_diff;
+				}
+				slab_reclaimed +=
+					shrink_slab(sc->gfp_mask, pgdat->node_id,
+						    memcg, numerator, denominator,
+						    &slab_scanned);
+			}
 
 			/* Record the group's reclaim efficiency */
 			vmpressure(sc->gfp_mask, memcg, false,
@@ -2570,14 +2670,18 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
 			}
 		} while ((memcg = mem_cgroup_iter(root, memcg, &reclaim)));
 
-		/*
-		 * Shrink the slab caches in the same proportion that
-		 * the eligible LRU pages were scanned.
-		 */
-		if (global_reclaim(sc))
-			shrink_slab(sc->gfp_mask, pgdat->node_id, NULL,
-				    sc->nr_scanned - nr_scanned,
-				    node_lru_pages);
+		if (!skip_slab && global_reclaim(sc)) {
+			unsigned long numerator = nr_slab;
+			unsigned long denominator = nr_reclaim;
+			if (sc->inactive_only) {
+				numerator = sc->slab_diff;
+				denominator = sc->inactive_diff;
+			}
+			slab_reclaimed += shrink_slab(sc->gfp_mask,
+						      pgdat->node_id, NULL,
+						      numerator, denominator,
+						      &slab_scanned);
+		}
 
 		if (reclaim_state) {
 			sc->nr_reclaimed += reclaim_state->reclaimed_slab;
@@ -2589,9 +2693,27 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
 			   sc->nr_scanned - nr_scanned,
 			   sc->nr_reclaimed - nr_reclaimed);
 
-		if (sc->nr_reclaimed - nr_reclaimed)
+		if (sc->nr_reclaimed - nr_reclaimed) {
 			reclaimable = true;
+		} else if (sc->inactive_only && !skip_slab) {
+			unsigned long percent;
 
+			/*
+			 * We didn't reclaim anything this go around, so the
+			 * inactive list is likely spent.  If we're reclaiming
+			 * less than half of the objects in slab that we're
+			 * scanning then just stop doing the inactive only scan.
+			 * Otherwise ramp up the pressure on the slab caches
+			 * hoping that eventually we'll start freeing enough
+			 * objects to reclaim space.
+			 */
+			percent = (slab_reclaimed * 100 / slab_scanned);
+			if (percent < 50)
+				sc->inactive_only = 0;
+			else
+				nr_slab <<= 1;
+		}
+		skip_slab = false;
 	} while (should_continue_reclaim(pgdat, sc->nr_reclaimed - nr_reclaimed,
 					 sc->nr_scanned - nr_scanned, sc));
 
@@ -3234,7 +3356,8 @@ static bool kswapd_shrink_node(pg_data_t *pgdat,
  * or lower is eligible for reclaim until at least one usable zone is
  * balanced.
  */
-static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
+static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx,
+			 unsigned long inactive_diff, unsigned long slab_diff)
 {
 	int i;
 	unsigned long nr_soft_reclaimed;
@@ -3247,6 +3370,9 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
 		.may_writepage = !laptop_mode,
 		.may_unmap = 1,
 		.may_swap = 1,
+		.inactive_only = 1,
+		.inactive_diff = inactive_diff,
+		.slab_diff = slab_diff,
 	};
 	count_vm_event(PAGEOUTRUN);
 
@@ -3466,7 +3592,7 @@ static int kswapd(void *p)
 	unsigned int classzone_idx = MAX_NR_ZONES - 1;
 	pg_data_t *pgdat = (pg_data_t*)p;
 	struct task_struct *tsk = current;
-
+	unsigned long nr_slab = 0, nr_inactive = 0;
 	struct reclaim_state reclaim_state = {
 		.reclaimed_slab = 0,
 	};
@@ -3496,6 +3622,7 @@ static int kswapd(void *p)
 	pgdat->kswapd_order = 0;
 	pgdat->kswapd_classzone_idx = MAX_NR_ZONES;
 	for ( ; ; ) {
+		unsigned long slab_diff, inactive_diff;
 		bool ret;
 
 		alloc_order = reclaim_order = pgdat->kswapd_order;
@@ -3523,6 +3650,23 @@ static int kswapd(void *p)
 			continue;
 
 		/*
+		 * We want to know where we're adding pages so we can make
+		 * smarter decisions about where we're going to put pressure
+		 * when shrinking.
+		 */
+		slab_diff = sum_zone_node_page_state(pgdat->node_id,
+						     NR_SLAB_RECLAIMABLE);
+		inactive_diff = node_page_state(pgdat, NR_INACTIVE_FILE);
+		if (nr_slab > slab_diff)
+			slab_diff = 0;
+		else
+			slab_diff -= nr_slab;
+		if (inactive_diff < nr_inactive)
+			inactive_diff = 0;
+		else
+			inactive_diff -= nr_inactive;
+
+		/*
 		 * Reclaim begins at the requested order but if a high-order
 		 * reclaim fails then kswapd falls back to reclaiming for
 		 * order-0. If that happens, kswapd will consider sleeping
@@ -3532,7 +3676,11 @@ static int kswapd(void *p)
 		 */
 		trace_mm_vmscan_kswapd_wake(pgdat->node_id, classzone_idx,
 						alloc_order);
-		reclaim_order = balance_pgdat(pgdat, alloc_order, classzone_idx);
+		reclaim_order = balance_pgdat(pgdat, alloc_order, classzone_idx,
+					      inactive_diff, slab_diff);
+		nr_inactive = node_page_state(pgdat, NR_INACTIVE_FILE);
+		nr_slab = sum_zone_node_page_state(pgdat->node_id,
+						   NR_SLAB_RECLAIMABLE);
 		if (reclaim_order < alloc_order)
 			goto kswapd_try_sleep;
 	}
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH] mm: make kswapd try harder to keep active pages in cache
  2017-05-23 14:23 [PATCH] mm: make kswapd try harder to keep active pages in cache Josef Bacik
@ 2017-05-23 17:14 ` Rik van Riel
  2017-05-23 21:57 ` Andrew Morton
  2017-05-24 17:46 ` Johannes Weiner
  2 siblings, 0 replies; 6+ messages in thread
From: Rik van Riel @ 2017-05-23 17:14 UTC (permalink / raw)
  To: Josef Bacik, akpm, kernel-team, hannes, linux-mm

[-- Attachment #1: Type: text/plain, Size: 667 bytes --]

On Tue, 2017-05-23 at 10:23 -0400, Josef Bacik wrote:

> My approach here is twofold.  First, keep track of the difference in
> inactive and slab pages since the last time kswapd ran.  In the first
> run this will just be the overall counts of inactive and slab, but
> for
> each subsequent run we'll have a good idea of where the memory
> pressure
> is coming from.  Then we use this information to put pressure on
> either
> the inactive lists or the slab caches, depending on where the
> pressure
> is coming from.

> Signed-off-by: Josef Bacik <jbacik@fb.com>

This looks totally reasonable to me.

Acked-by: Rik van Riel <riel@redhat.com>

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] mm: make kswapd try harder to keep active pages in cache
  2017-05-23 14:23 [PATCH] mm: make kswapd try harder to keep active pages in cache Josef Bacik
  2017-05-23 17:14 ` Rik van Riel
@ 2017-05-23 21:57 ` Andrew Morton
  2017-05-24 17:46 ` Johannes Weiner
  2 siblings, 0 replies; 6+ messages in thread
From: Andrew Morton @ 2017-05-23 21:57 UTC (permalink / raw)
  To: Josef Bacik; +Cc: kernel-team, riel, hannes, linux-mm

On Tue, 23 May 2017 10:23:23 -0400 Josef Bacik <josef@toxicpanda.com> wrote:

> When testing a slab heavy workload I noticed that we often would barely
> reclaim anything at all from slab when kswapd started doing reclaim.
> This is because we use the ratio of nr_scanned / nr_lru to determine how
> much of slab we should reclaim.  But in a slab only/mostly workload we
> will not have much page cache to reclaim, and thus our ratio will be
> really low and not at all related to where the memory on the system is.
> Instead we want to use a ratio of the reclaimable slab to the actual
> reclaimable space on the system.  That way if we are slab heavy we work
> harder to reclaim slab.
> 
> The other part of this that hurts is when we are running close to full
> memory with our working set.  If we start putting a lot of reclaimable
> slab pressure on the system (think find /, or some other silliness), we
> will happily evict the active pages over the slab cache.  This is kind
> of backwards as we want to do all that we can to keep the active working
> set in memory, and instead evict these short lived objects.  The same
> thing occurs when say you do a yum update of a few packages while your
> working set takes up most of RAM, you end up with inactive lists being
> relatively small and so we reclaim active pages even though we could
> reclaim these short lived inactive pages.
> 
> My approach here is twofold.  First, keep track of the difference in
> inactive and slab pages since the last time kswapd ran.  In the first
> run this will just be the overall counts of inactive and slab, but for
> each subsequent run we'll have a good idea of where the memory pressure
> is coming from.  Then we use this information to put pressure on either
> the inactive lists or the slab caches, depending on where the pressure
> is coming from.
>
> ...
>

hm, that's a pretty big change.  I took it, but it will require quite
some reviewing and testing to get further, please.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] mm: make kswapd try harder to keep active pages in cache
  2017-05-23 14:23 [PATCH] mm: make kswapd try harder to keep active pages in cache Josef Bacik
  2017-05-23 17:14 ` Rik van Riel
  2017-05-23 21:57 ` Andrew Morton
@ 2017-05-24 17:46 ` Johannes Weiner
  2017-05-24 18:50   ` Josef Bacik
  2 siblings, 1 reply; 6+ messages in thread
From: Johannes Weiner @ 2017-05-24 17:46 UTC (permalink / raw)
  To: Josef Bacik; +Cc: akpm, kernel-team, riel, linux-mm

Hi Josef,

On Tue, May 23, 2017 at 10:23:23AM -0400, Josef Bacik wrote:
> @@ -308,7 +317,8 @@ EXPORT_SYMBOL(unregister_shrinker);
>  static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
>  				    struct shrinker *shrinker,
>  				    unsigned long nr_scanned,
> -				    unsigned long nr_eligible)
> +				    unsigned long nr_eligible,
> +				    unsigned long *slab_scanned)

Once you pass in pool size ratios here, nr_scanned and nr_eligible
become confusing. Can you update the names?

> @@ -2292,6 +2310,15 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
>  				scan = 0;
>  			}
>  			break;
> +		case SCAN_INACTIVE:
> +			if (file && !is_active_lru(lru)) {
> +				if (scan && size > sc->nr_to_reclaim)
> +					scan = sc->nr_to_reclaim;

Why is the scan target different than with regular cache reclaim? I'd
expect that we only need to zero the active list sizes here, not that
we'd also need any further updates to 'scan'.

> @@ -2509,8 +2536,62 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
>  {
>  	struct reclaim_state *reclaim_state = current->reclaim_state;
>  	unsigned long nr_reclaimed, nr_scanned;
> +	unsigned long nr_reclaim, nr_slab, total_high_wmark = 0, nr_inactive;
> +	int z;
>  	bool reclaimable = false;
> +	bool skip_slab = false;
> +
> +	nr_slab = sum_zone_node_page_state(pgdat->node_id,
> +					   NR_SLAB_RECLAIMABLE);
> +	nr_inactive = node_page_state(pgdat, NR_INACTIVE_FILE);
> +	nr_reclaim = pgdat_reclaimable_pages(pgdat);
> +
> +	for (z = 0; z < MAX_NR_ZONES; z++) {
> +		struct zone *zone = &pgdat->node_zones[z];
> +		if (!managed_zone(zone))
> +			continue;
> +		total_high_wmark += high_wmark_pages(zone);
> +	}

This function is used for memcg target reclaim, in which case you're
only looking at a subset of the pgdats and zones. Any pgdat or zone
state read here would be scoped incorrectly; and the ratios on the
node level are independent from ratios on the cgroup level and can
diverge heavily from each other.

These size inquiries to drive the balancing will have to be made
inside the memcg iteration loop further down with per-cgroup numbers.

> +	/*
> +	 * If we don't have a lot of inactive or slab pages then there's no
> +	 * point in trying to free them exclusively, do the normal scan stuff.
> +	 */
> +	if (nr_inactive < total_high_wmark && nr_slab < total_high_wmark)
> +		sc->inactive_only = 0;

Yes, we need something like this, to know when to fall back to full
reclaim. Cgroups don't have high watermarks, but I guess some magic
number for "too few pages" could do the trick.

> +	/*
> +	 * We don't have historical information, we can't make good decisions
> +	 * about ratio's and where we should put pressure, so just apply
> +	 * pressure based on overall consumption ratios.
> +	 */
> +	if (!sc->slab_diff && !sc->inactive_diff)
> +		sc->inactive_only = 0;

This one I'm not sure about. If we have enough slabs and and inactive
pages why shouldn't we go for them first anyway - regardless of
whether they have grown since the last reclaim invocation?

> @@ -2543,10 +2626,27 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
>  			shrink_node_memcg(pgdat, memcg, sc, &lru_pages);
>  			node_lru_pages += lru_pages;
>  
> -			if (memcg)
> -				shrink_slab(sc->gfp_mask, pgdat->node_id,
> -					    memcg, sc->nr_scanned - scanned,
> -					    lru_pages);
> +			/*
> +			 * We don't want to put a lot of pressure on all of the
> +			 * slabs if a memcg is mostly full, so use the ratio of
> +			 * the lru size to the total reclaimable space on the
> +			 * system.  If we have sc->inactive_only set then we
> +			 * want to use the ratio of the difference between the
> +			 * two since the last kswapd run so we apply pressure to
> +			 * the consumer appropriately.
> +			 */
> +			if (memcg && !skip_slab) {
> +				unsigned long numerator = lru_pages;
> +				unsigned long denominator = nr_reclaim;

I don't quite follow this.

It calculates the share of this memcg's pages on the node, which is
the ratio we should apply to the global slab pool to have equivalent
pressure. However, it's being applied to the *memcg's* share of slab
pages. This means that the smaller the memcg relative to the node, the
less of its tiny share of slab objects we reclaim.

We're not translating from fraction to total, we're translating from
fraction to fraction. Shouldn't the ratio be always 1:1?

For example, if there is only one cgroup on the node, the ratio would
be 1 share of LRU pages and 1 share of slab pages. But if there are
two cgroups, we still scan one share of each cgroup's LRU pages but
only half a share of each cgroup's slab pages. That doesn't add up.

Am I missing something?

> +				if (sc->inactive_only) {
> +					numerator = sc->slab_diff;
> +					denominator = sc->inactive_diff;
> +				}

Shouldn't these diffs already be reflected in the pool sizes? If we
scan pools proportional to their size, we also go more aggressively
for the one that grows faster relative to the other one, right?

I guess this *could* be more adaptive to fluctuations, but I wonder if
that's an optimization that could be split out into a separate patch,
to make it easier to review on its own merit. Start with a simple size
based balancing in the first patch, add improved adaptiveness after.

As mentioned above, this function is used not just by kswapd but also
by direct reclaim, which doesn't initialize these fields and so always
passes 0:0. We should be able to retain sensible balancing for them as
well, but it would require moving the diff sampling.

To make it work for cgroup reclaim, it would have to look at the
growths not on the node level, but on the lruvec level in
get_scan_count() or thereabouts.

Anyway, I think it might be less confusing to nail down the size-based
pressure balancing for slab caches first, and then do the recent diff
balancing on top of it as an optimization.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] mm: make kswapd try harder to keep active pages in cache
  2017-05-24 17:46 ` Johannes Weiner
@ 2017-05-24 18:50   ` Josef Bacik
  2017-05-30 18:03     ` Johannes Weiner
  0 siblings, 1 reply; 6+ messages in thread
From: Josef Bacik @ 2017-05-24 18:50 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: Josef Bacik, akpm, kernel-team, riel, linux-mm

On Wed, May 24, 2017 at 01:46:10PM -0400, Johannes Weiner wrote:
> Hi Josef,
> 
> On Tue, May 23, 2017 at 10:23:23AM -0400, Josef Bacik wrote:
> > @@ -308,7 +317,8 @@ EXPORT_SYMBOL(unregister_shrinker);
> >  static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
> >  				    struct shrinker *shrinker,
> >  				    unsigned long nr_scanned,
> > -				    unsigned long nr_eligible)
> > +				    unsigned long nr_eligible,
> > +				    unsigned long *slab_scanned)
> 
> Once you pass in pool size ratios here, nr_scanned and nr_eligible
> become confusing. Can you update the names?
> 

Yeah I kept changing them and eventually decided my names were equally as
shitty, so I just left them.  I'll change them to something useful.

> > @@ -2292,6 +2310,15 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
> >  				scan = 0;
> >  			}
> >  			break;
> > +		case SCAN_INACTIVE:
> > +			if (file && !is_active_lru(lru)) {
> > +				if (scan && size > sc->nr_to_reclaim)
> > +					scan = sc->nr_to_reclaim;
> 
> Why is the scan target different than with regular cache reclaim? I'd
> expect that we only need to zero the active list sizes here, not that
> we'd also need any further updates to 'scan'.
> 

Huh I actually screwed this up slightly from what I wanted.  Since

scan = size >> sc->priority

we'd sometimes end up with scan < nr_to_reclaim, but since we're only scanning
inactive we really want to try as hard as possible to reclaim what we need from
inactive.  What I should have done is something like

scan = max(sc->nr_to_reclaim, scan);

instead, I'll fix that.

> > @@ -2509,8 +2536,62 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
> >  {
> >  	struct reclaim_state *reclaim_state = current->reclaim_state;
> >  	unsigned long nr_reclaimed, nr_scanned;
> > +	unsigned long nr_reclaim, nr_slab, total_high_wmark = 0, nr_inactive;
> > +	int z;
> >  	bool reclaimable = false;
> > +	bool skip_slab = false;
> > +
> > +	nr_slab = sum_zone_node_page_state(pgdat->node_id,
> > +					   NR_SLAB_RECLAIMABLE);
> > +	nr_inactive = node_page_state(pgdat, NR_INACTIVE_FILE);
> > +	nr_reclaim = pgdat_reclaimable_pages(pgdat);
> > +
> > +	for (z = 0; z < MAX_NR_ZONES; z++) {
> > +		struct zone *zone = &pgdat->node_zones[z];
> > +		if (!managed_zone(zone))
> > +			continue;
> > +		total_high_wmark += high_wmark_pages(zone);
> > +	}
> 
> This function is used for memcg target reclaim, in which case you're
> only looking at a subset of the pgdats and zones. Any pgdat or zone
> state read here would be scoped incorrectly; and the ratios on the
> node level are independent from ratios on the cgroup level and can
> diverge heavily from each other.
> 
> These size inquiries to drive the balancing will have to be made
> inside the memcg iteration loop further down with per-cgroup numbers.
> 

Ok so I suppose I need to look at the actual lru list sizes instead for these
numbers for !global_reclaim(sc)?

> > +	/*
> > +	 * If we don't have a lot of inactive or slab pages then there's no
> > +	 * point in trying to free them exclusively, do the normal scan stuff.
> > +	 */
> > +	if (nr_inactive < total_high_wmark && nr_slab < total_high_wmark)
> > +		sc->inactive_only = 0;
> 
> Yes, we need something like this, to know when to fall back to full
> reclaim. Cgroups don't have high watermarks, but I guess some magic
> number for "too few pages" could do the trick.
> 
> > +	/*
> > +	 * We don't have historical information, we can't make good decisions
> > +	 * about ratio's and where we should put pressure, so just apply
> > +	 * pressure based on overall consumption ratios.
> > +	 */
> > +	if (!sc->slab_diff && !sc->inactive_diff)
> > +		sc->inactive_only = 0;
> 
> This one I'm not sure about. If we have enough slabs and and inactive
> pages why shouldn't we go for them first anyway - regardless of
> whether they have grown since the last reclaim invocation?
> 

Because we use them for the ratio of where to put pressure, but I suppose I
could just drop this and do

foo = max(sc->slab_diff, 1);
bar = max(sc->inactive_diff, 1);

so if we have no historical information we just equally scan both.  I'll do that
instead.

> > @@ -2543,10 +2626,27 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
> >  			shrink_node_memcg(pgdat, memcg, sc, &lru_pages);
> >  			node_lru_pages += lru_pages;
> >  
> > -			if (memcg)
> > -				shrink_slab(sc->gfp_mask, pgdat->node_id,
> > -					    memcg, sc->nr_scanned - scanned,
> > -					    lru_pages);
> > +			/*
> > +			 * We don't want to put a lot of pressure on all of the
> > +			 * slabs if a memcg is mostly full, so use the ratio of
> > +			 * the lru size to the total reclaimable space on the
> > +			 * system.  If we have sc->inactive_only set then we
> > +			 * want to use the ratio of the difference between the
> > +			 * two since the last kswapd run so we apply pressure to
> > +			 * the consumer appropriately.
> > +			 */
> > +			if (memcg && !skip_slab) {
> > +				unsigned long numerator = lru_pages;
> > +				unsigned long denominator = nr_reclaim;
> 
> I don't quite follow this.
> 
> It calculates the share of this memcg's pages on the node, which is
> the ratio we should apply to the global slab pool to have equivalent
> pressure. However, it's being applied to the *memcg's* share of slab
> pages. This means that the smaller the memcg relative to the node, the
> less of its tiny share of slab objects we reclaim.
> 
> We're not translating from fraction to total, we're translating from
> fraction to fraction. Shouldn't the ratio be always 1:1?
> 
> For example, if there is only one cgroup on the node, the ratio would
> be 1 share of LRU pages and 1 share of slab pages. But if there are
> two cgroups, we still scan one share of each cgroup's LRU pages but
> only half a share of each cgroup's slab pages. That doesn't add up.
> 
> Am I missing something?
> 

We hashed this out offline, but basically we concluded to add a memcg specific
slab reclaimable counter so we can make these ratios be consistent with the
global ratios.

> > +				if (sc->inactive_only) {
> > +					numerator = sc->slab_diff;
> > +					denominator = sc->inactive_diff;
> > +				}
> 
> Shouldn't these diffs already be reflected in the pool sizes? If we
> scan pools proportional to their size, we also go more aggressively
> for the one that grows faster relative to the other one, right?
> 

Sure unless the aggressive growth is from a different cgroup, we want to apply
proportional pressure everywhere.  I suppose that should only be done in the
global reclaim case.

> I guess this *could* be more adaptive to fluctuations, but I wonder if
> that's an optimization that could be split out into a separate patch,
> to make it easier to review on its own merit. Start with a simple size
> based balancing in the first patch, add improved adaptiveness after.
> 
> As mentioned above, this function is used not just by kswapd but also
> by direct reclaim, which doesn't initialize these fields and so always
> passes 0:0. We should be able to retain sensible balancing for them as
> well, but it would require moving the diff sampling.
> 
> To make it work for cgroup reclaim, it would have to look at the
> growths not on the node level, but on the lruvec level in
> get_scan_count() or thereabouts.
> 
> Anyway, I think it might be less confusing to nail down the size-based
> pressure balancing for slab caches first, and then do the recent diff
> balancing on top of it as an optimization.

Yeah I had it all separate but it got kind of weird and hard to tell which part
was needed where.  Now that I've taken a step back I see where I can split it
up, so I'll fix these things and split the patches up.  Thanks!

Josef

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] mm: make kswapd try harder to keep active pages in cache
  2017-05-24 18:50   ` Josef Bacik
@ 2017-05-30 18:03     ` Johannes Weiner
  0 siblings, 0 replies; 6+ messages in thread
From: Johannes Weiner @ 2017-05-30 18:03 UTC (permalink / raw)
  To: Josef Bacik; +Cc: akpm, kernel-team, riel, linux-mm

On Wed, May 24, 2017 at 02:50:42PM -0400, Josef Bacik wrote:
> On Wed, May 24, 2017 at 01:46:10PM -0400, Johannes Weiner wrote:
> > > @@ -2292,6 +2310,15 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
> > >  				scan = 0;
> > >  			}
> > >  			break;
> > > +		case SCAN_INACTIVE:
> > > +			if (file && !is_active_lru(lru)) {
> > > +				if (scan && size > sc->nr_to_reclaim)
> > > +					scan = sc->nr_to_reclaim;
> > 
> > Why is the scan target different than with regular cache reclaim? I'd
> > expect that we only need to zero the active list sizes here, not that
> > we'd also need any further updates to 'scan'.
> 
> Huh I actually screwed this up slightly from what I wanted.  Since
> 
> scan = size >> sc->priority
> 
> we'd sometimes end up with scan < nr_to_reclaim, but since we're only scanning
> inactive we really want to try as hard as possible to reclaim what we need from
> inactive.  What I should have done is something like
> 
> scan = max(sc->nr_to_reclaim, scan);
> 
> instead, I'll fix that.

I see what you're saying.

But why not leave it to the priority level going up until some groups'
inactive lists make the cut? If this one node or cgroup doesn't have
size >> priority number of inactive pages, the next one might. And we
should reclaim the bigger group's inactive pages first rather than
putting disproportionately high pressure on the smaller ones.

> > > @@ -2509,8 +2536,62 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
> > >  {
> > >  	struct reclaim_state *reclaim_state = current->reclaim_state;
> > >  	unsigned long nr_reclaimed, nr_scanned;
> > > +	unsigned long nr_reclaim, nr_slab, total_high_wmark = 0, nr_inactive;
> > > +	int z;
> > >  	bool reclaimable = false;
> > > +	bool skip_slab = false;
> > > +
> > > +	nr_slab = sum_zone_node_page_state(pgdat->node_id,
> > > +					   NR_SLAB_RECLAIMABLE);
> > > +	nr_inactive = node_page_state(pgdat, NR_INACTIVE_FILE);
> > > +	nr_reclaim = pgdat_reclaimable_pages(pgdat);
> > > +
> > > +	for (z = 0; z < MAX_NR_ZONES; z++) {
> > > +		struct zone *zone = &pgdat->node_zones[z];
> > > +		if (!managed_zone(zone))
> > > +			continue;
> > > +		total_high_wmark += high_wmark_pages(zone);
> > > +	}
> > 
> > This function is used for memcg target reclaim, in which case you're
> > only looking at a subset of the pgdats and zones. Any pgdat or zone
> > state read here would be scoped incorrectly; and the ratios on the
> > node level are independent from ratios on the cgroup level and can
> > diverge heavily from each other.
> > 
> > These size inquiries to drive the balancing will have to be made
> > inside the memcg iteration loop further down with per-cgroup numbers.
> > 
> 
> Ok so I suppose I need to look at the actual lru list sizes instead for these
> numbers for !global_reclaim(sc)?

Yeah, you have to work against the lruvecs. That is the right scope
for both global reclaim and cgroup limit reclaim.

> > > +	/*
> > > +	 * We don't have historical information, we can't make good decisions
> > > +	 * about ratio's and where we should put pressure, so just apply
> > > +	 * pressure based on overall consumption ratios.
> > > +	 */
> > > +	if (!sc->slab_diff && !sc->inactive_diff)
> > > +		sc->inactive_only = 0;
> > 
> > This one I'm not sure about. If we have enough slabs and and inactive
> > pages why shouldn't we go for them first anyway - regardless of
> > whether they have grown since the last reclaim invocation?
> > 
> 
> Because we use them for the ratio of where to put pressure, but I suppose I
> could just drop this and do
> 
> foo = max(sc->slab_diff, 1);
> bar = max(sc->inactive_diff, 1);
> 
> so if we have no historical information we just equally scan both.  I'll do that
> instead.

Okay, makes sense.

> > > @@ -2543,10 +2626,27 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
> > >  			shrink_node_memcg(pgdat, memcg, sc, &lru_pages);
> > >  			node_lru_pages += lru_pages;
> > >  
> > > -			if (memcg)
> > > -				shrink_slab(sc->gfp_mask, pgdat->node_id,
> > > -					    memcg, sc->nr_scanned - scanned,
> > > -					    lru_pages);
> > > +			/*
> > > +			 * We don't want to put a lot of pressure on all of the
> > > +			 * slabs if a memcg is mostly full, so use the ratio of
> > > +			 * the lru size to the total reclaimable space on the
> > > +			 * system.  If we have sc->inactive_only set then we
> > > +			 * want to use the ratio of the difference between the
> > > +			 * two since the last kswapd run so we apply pressure to
> > > +			 * the consumer appropriately.
> > > +			 */
> > > +			if (memcg && !skip_slab) {
> > > +				unsigned long numerator = lru_pages;
> > > +				unsigned long denominator = nr_reclaim;
> > 
> > I don't quite follow this.
> > 
> > It calculates the share of this memcg's pages on the node, which is
> > the ratio we should apply to the global slab pool to have equivalent
> > pressure. However, it's being applied to the *memcg's* share of slab
> > pages. This means that the smaller the memcg relative to the node, the
> > less of its tiny share of slab objects we reclaim.
> > 
> > We're not translating from fraction to total, we're translating from
> > fraction to fraction. Shouldn't the ratio be always 1:1?
> > 
> > For example, if there is only one cgroup on the node, the ratio would
> > be 1 share of LRU pages and 1 share of slab pages. But if there are
> > two cgroups, we still scan one share of each cgroup's LRU pages but
> > only half a share of each cgroup's slab pages. That doesn't add up.
> > 
> > Am I missing something?
> > 
> 
> We hashed this out offline, but basically we concluded to add a memcg specific
> slab reclaimable counter so we can make these ratios be consistent with the
> global ratios.

Yes, that should make things a lot easier. I'm going to send out
patches for this after this email. They'll add lruvec_page_state() and
a NR_SLAB_RECLAIMABLE counter which allows you to directly compare the
slab cache and the page cache for any given lruvec you're reclaiming.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2017-05-30 18:03 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-05-23 14:23 [PATCH] mm: make kswapd try harder to keep active pages in cache Josef Bacik
2017-05-23 17:14 ` Rik van Riel
2017-05-23 21:57 ` Andrew Morton
2017-05-24 17:46 ` Johannes Weiner
2017-05-24 18:50   ` Josef Bacik
2017-05-30 18:03     ` Johannes Weiner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).