From: Josef Bacik <josef@toxicpanda.com>
To: Johannes Weiner <hannes@cmpxchg.org>
Cc: Josef Bacik <josef@toxicpanda.com>,
akpm@linux-foundation.org, kernel-team@fb.com, riel@redhat.com,
linux-mm@kvack.org
Subject: Re: [PATCH] mm: make kswapd try harder to keep active pages in cache
Date: Wed, 24 May 2017 14:50:42 -0400 [thread overview]
Message-ID: <20170524185040.GA14869@destiny> (raw)
In-Reply-To: <20170524174610.GB22174@cmpxchg.org>
On Wed, May 24, 2017 at 01:46:10PM -0400, Johannes Weiner wrote:
> Hi Josef,
>
> On Tue, May 23, 2017 at 10:23:23AM -0400, Josef Bacik wrote:
> > @@ -308,7 +317,8 @@ EXPORT_SYMBOL(unregister_shrinker);
> > static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
> > struct shrinker *shrinker,
> > unsigned long nr_scanned,
> > - unsigned long nr_eligible)
> > + unsigned long nr_eligible,
> > + unsigned long *slab_scanned)
>
> Once you pass in pool size ratios here, nr_scanned and nr_eligible
> become confusing. Can you update the names?
>
Yeah I kept changing them and eventually decided my names were equally as
shitty, so I just left them. I'll change them to something useful.
> > @@ -2292,6 +2310,15 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
> > scan = 0;
> > }
> > break;
> > + case SCAN_INACTIVE:
> > + if (file && !is_active_lru(lru)) {
> > + if (scan && size > sc->nr_to_reclaim)
> > + scan = sc->nr_to_reclaim;
>
> Why is the scan target different than with regular cache reclaim? I'd
> expect that we only need to zero the active list sizes here, not that
> we'd also need any further updates to 'scan'.
>
Huh I actually screwed this up slightly from what I wanted. Since
scan = size >> sc->priority
we'd sometimes end up with scan < nr_to_reclaim, but since we're only scanning
inactive we really want to try as hard as possible to reclaim what we need from
inactive. What I should have done is something like
scan = max(sc->nr_to_reclaim, scan);
instead, I'll fix that.
> > @@ -2509,8 +2536,62 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
> > {
> > struct reclaim_state *reclaim_state = current->reclaim_state;
> > unsigned long nr_reclaimed, nr_scanned;
> > + unsigned long nr_reclaim, nr_slab, total_high_wmark = 0, nr_inactive;
> > + int z;
> > bool reclaimable = false;
> > + bool skip_slab = false;
> > +
> > + nr_slab = sum_zone_node_page_state(pgdat->node_id,
> > + NR_SLAB_RECLAIMABLE);
> > + nr_inactive = node_page_state(pgdat, NR_INACTIVE_FILE);
> > + nr_reclaim = pgdat_reclaimable_pages(pgdat);
> > +
> > + for (z = 0; z < MAX_NR_ZONES; z++) {
> > + struct zone *zone = &pgdat->node_zones[z];
> > + if (!managed_zone(zone))
> > + continue;
> > + total_high_wmark += high_wmark_pages(zone);
> > + }
>
> This function is used for memcg target reclaim, in which case you're
> only looking at a subset of the pgdats and zones. Any pgdat or zone
> state read here would be scoped incorrectly; and the ratios on the
> node level are independent from ratios on the cgroup level and can
> diverge heavily from each other.
>
> These size inquiries to drive the balancing will have to be made
> inside the memcg iteration loop further down with per-cgroup numbers.
>
Ok so I suppose I need to look at the actual lru list sizes instead for these
numbers for !global_reclaim(sc)?
> > + /*
> > + * If we don't have a lot of inactive or slab pages then there's no
> > + * point in trying to free them exclusively, do the normal scan stuff.
> > + */
> > + if (nr_inactive < total_high_wmark && nr_slab < total_high_wmark)
> > + sc->inactive_only = 0;
>
> Yes, we need something like this, to know when to fall back to full
> reclaim. Cgroups don't have high watermarks, but I guess some magic
> number for "too few pages" could do the trick.
>
> > + /*
> > + * We don't have historical information, we can't make good decisions
> > + * about ratio's and where we should put pressure, so just apply
> > + * pressure based on overall consumption ratios.
> > + */
> > + if (!sc->slab_diff && !sc->inactive_diff)
> > + sc->inactive_only = 0;
>
> This one I'm not sure about. If we have enough slabs and and inactive
> pages why shouldn't we go for them first anyway - regardless of
> whether they have grown since the last reclaim invocation?
>
Because we use them for the ratio of where to put pressure, but I suppose I
could just drop this and do
foo = max(sc->slab_diff, 1);
bar = max(sc->inactive_diff, 1);
so if we have no historical information we just equally scan both. I'll do that
instead.
> > @@ -2543,10 +2626,27 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
> > shrink_node_memcg(pgdat, memcg, sc, &lru_pages);
> > node_lru_pages += lru_pages;
> >
> > - if (memcg)
> > - shrink_slab(sc->gfp_mask, pgdat->node_id,
> > - memcg, sc->nr_scanned - scanned,
> > - lru_pages);
> > + /*
> > + * We don't want to put a lot of pressure on all of the
> > + * slabs if a memcg is mostly full, so use the ratio of
> > + * the lru size to the total reclaimable space on the
> > + * system. If we have sc->inactive_only set then we
> > + * want to use the ratio of the difference between the
> > + * two since the last kswapd run so we apply pressure to
> > + * the consumer appropriately.
> > + */
> > + if (memcg && !skip_slab) {
> > + unsigned long numerator = lru_pages;
> > + unsigned long denominator = nr_reclaim;
>
> I don't quite follow this.
>
> It calculates the share of this memcg's pages on the node, which is
> the ratio we should apply to the global slab pool to have equivalent
> pressure. However, it's being applied to the *memcg's* share of slab
> pages. This means that the smaller the memcg relative to the node, the
> less of its tiny share of slab objects we reclaim.
>
> We're not translating from fraction to total, we're translating from
> fraction to fraction. Shouldn't the ratio be always 1:1?
>
> For example, if there is only one cgroup on the node, the ratio would
> be 1 share of LRU pages and 1 share of slab pages. But if there are
> two cgroups, we still scan one share of each cgroup's LRU pages but
> only half a share of each cgroup's slab pages. That doesn't add up.
>
> Am I missing something?
>
We hashed this out offline, but basically we concluded to add a memcg specific
slab reclaimable counter so we can make these ratios be consistent with the
global ratios.
> > + if (sc->inactive_only) {
> > + numerator = sc->slab_diff;
> > + denominator = sc->inactive_diff;
> > + }
>
> Shouldn't these diffs already be reflected in the pool sizes? If we
> scan pools proportional to their size, we also go more aggressively
> for the one that grows faster relative to the other one, right?
>
Sure unless the aggressive growth is from a different cgroup, we want to apply
proportional pressure everywhere. I suppose that should only be done in the
global reclaim case.
> I guess this *could* be more adaptive to fluctuations, but I wonder if
> that's an optimization that could be split out into a separate patch,
> to make it easier to review on its own merit. Start with a simple size
> based balancing in the first patch, add improved adaptiveness after.
>
> As mentioned above, this function is used not just by kswapd but also
> by direct reclaim, which doesn't initialize these fields and so always
> passes 0:0. We should be able to retain sensible balancing for them as
> well, but it would require moving the diff sampling.
>
> To make it work for cgroup reclaim, it would have to look at the
> growths not on the node level, but on the lruvec level in
> get_scan_count() or thereabouts.
>
> Anyway, I think it might be less confusing to nail down the size-based
> pressure balancing for slab caches first, and then do the recent diff
> balancing on top of it as an optimization.
Yeah I had it all separate but it got kind of weird and hard to tell which part
was needed where. Now that I've taken a step back I see where I can split it
up, so I'll fix these things and split the patches up. Thanks!
Josef
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2017-05-24 18:50 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-05-23 14:23 [PATCH] mm: make kswapd try harder to keep active pages in cache Josef Bacik
2017-05-23 17:14 ` Rik van Riel
2017-05-23 21:57 ` Andrew Morton
2017-05-24 17:46 ` Johannes Weiner
2017-05-24 18:50 ` Josef Bacik [this message]
2017-05-30 18:03 ` Johannes Weiner
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20170524185040.GA14869@destiny \
--to=josef@toxicpanda.com \
--cc=akpm@linux-foundation.org \
--cc=hannes@cmpxchg.org \
--cc=kernel-team@fb.com \
--cc=linux-mm@kvack.org \
--cc=riel@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).