All of lore.kernel.org
 help / color / mirror / Atom feed
From: Josef Bacik <josef@toxicpanda.com>
To: Johannes Weiner <hannes@cmpxchg.org>
Cc: Josef Bacik <josef@toxicpanda.com>,
	akpm@linux-foundation.org, kernel-team@fb.com, riel@redhat.com,
	linux-mm@kvack.org
Subject: Re: [PATCH] mm: make kswapd try harder to keep active pages in cache
Date: Wed, 24 May 2017 14:50:42 -0400	[thread overview]
Message-ID: <20170524185040.GA14869@destiny> (raw)
In-Reply-To: <20170524174610.GB22174@cmpxchg.org>

On Wed, May 24, 2017 at 01:46:10PM -0400, Johannes Weiner wrote:
> Hi Josef,
> 
> On Tue, May 23, 2017 at 10:23:23AM -0400, Josef Bacik wrote:
> > @@ -308,7 +317,8 @@ EXPORT_SYMBOL(unregister_shrinker);
> >  static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
> >  				    struct shrinker *shrinker,
> >  				    unsigned long nr_scanned,
> > -				    unsigned long nr_eligible)
> > +				    unsigned long nr_eligible,
> > +				    unsigned long *slab_scanned)
> 
> Once you pass in pool size ratios here, nr_scanned and nr_eligible
> become confusing. Can you update the names?
> 

Yeah I kept changing them and eventually decided my names were equally as
shitty, so I just left them.  I'll change them to something useful.

> > @@ -2292,6 +2310,15 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
> >  				scan = 0;
> >  			}
> >  			break;
> > +		case SCAN_INACTIVE:
> > +			if (file && !is_active_lru(lru)) {
> > +				if (scan && size > sc->nr_to_reclaim)
> > +					scan = sc->nr_to_reclaim;
> 
> Why is the scan target different than with regular cache reclaim? I'd
> expect that we only need to zero the active list sizes here, not that
> we'd also need any further updates to 'scan'.
> 

Huh I actually screwed this up slightly from what I wanted.  Since

scan = size >> sc->priority

we'd sometimes end up with scan < nr_to_reclaim, but since we're only scanning
inactive we really want to try as hard as possible to reclaim what we need from
inactive.  What I should have done is something like

scan = max(sc->nr_to_reclaim, scan);

instead, I'll fix that.

> > @@ -2509,8 +2536,62 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
> >  {
> >  	struct reclaim_state *reclaim_state = current->reclaim_state;
> >  	unsigned long nr_reclaimed, nr_scanned;
> > +	unsigned long nr_reclaim, nr_slab, total_high_wmark = 0, nr_inactive;
> > +	int z;
> >  	bool reclaimable = false;
> > +	bool skip_slab = false;
> > +
> > +	nr_slab = sum_zone_node_page_state(pgdat->node_id,
> > +					   NR_SLAB_RECLAIMABLE);
> > +	nr_inactive = node_page_state(pgdat, NR_INACTIVE_FILE);
> > +	nr_reclaim = pgdat_reclaimable_pages(pgdat);
> > +
> > +	for (z = 0; z < MAX_NR_ZONES; z++) {
> > +		struct zone *zone = &pgdat->node_zones[z];
> > +		if (!managed_zone(zone))
> > +			continue;
> > +		total_high_wmark += high_wmark_pages(zone);
> > +	}
> 
> This function is used for memcg target reclaim, in which case you're
> only looking at a subset of the pgdats and zones. Any pgdat or zone
> state read here would be scoped incorrectly; and the ratios on the
> node level are independent from ratios on the cgroup level and can
> diverge heavily from each other.
> 
> These size inquiries to drive the balancing will have to be made
> inside the memcg iteration loop further down with per-cgroup numbers.
> 

Ok so I suppose I need to look at the actual lru list sizes instead for these
numbers for !global_reclaim(sc)?

> > +	/*
> > +	 * If we don't have a lot of inactive or slab pages then there's no
> > +	 * point in trying to free them exclusively, do the normal scan stuff.
> > +	 */
> > +	if (nr_inactive < total_high_wmark && nr_slab < total_high_wmark)
> > +		sc->inactive_only = 0;
> 
> Yes, we need something like this, to know when to fall back to full
> reclaim. Cgroups don't have high watermarks, but I guess some magic
> number for "too few pages" could do the trick.
> 
> > +	/*
> > +	 * We don't have historical information, we can't make good decisions
> > +	 * about ratio's and where we should put pressure, so just apply
> > +	 * pressure based on overall consumption ratios.
> > +	 */
> > +	if (!sc->slab_diff && !sc->inactive_diff)
> > +		sc->inactive_only = 0;
> 
> This one I'm not sure about. If we have enough slabs and and inactive
> pages why shouldn't we go for them first anyway - regardless of
> whether they have grown since the last reclaim invocation?
> 

Because we use them for the ratio of where to put pressure, but I suppose I
could just drop this and do

foo = max(sc->slab_diff, 1);
bar = max(sc->inactive_diff, 1);

so if we have no historical information we just equally scan both.  I'll do that
instead.

> > @@ -2543,10 +2626,27 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
> >  			shrink_node_memcg(pgdat, memcg, sc, &lru_pages);
> >  			node_lru_pages += lru_pages;
> >  
> > -			if (memcg)
> > -				shrink_slab(sc->gfp_mask, pgdat->node_id,
> > -					    memcg, sc->nr_scanned - scanned,
> > -					    lru_pages);
> > +			/*
> > +			 * We don't want to put a lot of pressure on all of the
> > +			 * slabs if a memcg is mostly full, so use the ratio of
> > +			 * the lru size to the total reclaimable space on the
> > +			 * system.  If we have sc->inactive_only set then we
> > +			 * want to use the ratio of the difference between the
> > +			 * two since the last kswapd run so we apply pressure to
> > +			 * the consumer appropriately.
> > +			 */
> > +			if (memcg && !skip_slab) {
> > +				unsigned long numerator = lru_pages;
> > +				unsigned long denominator = nr_reclaim;
> 
> I don't quite follow this.
> 
> It calculates the share of this memcg's pages on the node, which is
> the ratio we should apply to the global slab pool to have equivalent
> pressure. However, it's being applied to the *memcg's* share of slab
> pages. This means that the smaller the memcg relative to the node, the
> less of its tiny share of slab objects we reclaim.
> 
> We're not translating from fraction to total, we're translating from
> fraction to fraction. Shouldn't the ratio be always 1:1?
> 
> For example, if there is only one cgroup on the node, the ratio would
> be 1 share of LRU pages and 1 share of slab pages. But if there are
> two cgroups, we still scan one share of each cgroup's LRU pages but
> only half a share of each cgroup's slab pages. That doesn't add up.
> 
> Am I missing something?
> 

We hashed this out offline, but basically we concluded to add a memcg specific
slab reclaimable counter so we can make these ratios be consistent with the
global ratios.

> > +				if (sc->inactive_only) {
> > +					numerator = sc->slab_diff;
> > +					denominator = sc->inactive_diff;
> > +				}
> 
> Shouldn't these diffs already be reflected in the pool sizes? If we
> scan pools proportional to their size, we also go more aggressively
> for the one that grows faster relative to the other one, right?
> 

Sure unless the aggressive growth is from a different cgroup, we want to apply
proportional pressure everywhere.  I suppose that should only be done in the
global reclaim case.

> I guess this *could* be more adaptive to fluctuations, but I wonder if
> that's an optimization that could be split out into a separate patch,
> to make it easier to review on its own merit. Start with a simple size
> based balancing in the first patch, add improved adaptiveness after.
> 
> As mentioned above, this function is used not just by kswapd but also
> by direct reclaim, which doesn't initialize these fields and so always
> passes 0:0. We should be able to retain sensible balancing for them as
> well, but it would require moving the diff sampling.
> 
> To make it work for cgroup reclaim, it would have to look at the
> growths not on the node level, but on the lruvec level in
> get_scan_count() or thereabouts.
> 
> Anyway, I think it might be less confusing to nail down the size-based
> pressure balancing for slab caches first, and then do the recent diff
> balancing on top of it as an optimization.

Yeah I had it all separate but it got kind of weird and hard to tell which part
was needed where.  Now that I've taken a step back I see where I can split it
up, so I'll fix these things and split the patches up.  Thanks!

Josef

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2017-05-24 18:50 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-05-23 14:23 [PATCH] mm: make kswapd try harder to keep active pages in cache Josef Bacik
2017-05-23 17:14 ` Rik van Riel
2017-05-23 21:57 ` Andrew Morton
2017-05-24 17:46 ` Johannes Weiner
2017-05-24 18:50   ` Josef Bacik [this message]
2017-05-30 18:03     ` Johannes Weiner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170524185040.GA14869@destiny \
    --to=josef@toxicpanda.com \
    --cc=akpm@linux-foundation.org \
    --cc=hannes@cmpxchg.org \
    --cc=kernel-team@fb.com \
    --cc=linux-mm@kvack.org \
    --cc=riel@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.