linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Yang Shi <shy828301@gmail.com>
To: David Rientjes <rientjes@google.com>
Cc: Michal Hocko <mhocko@kernel.org>,
	Shakeel Butt <shakeelb@google.com>,
	 Yang Shi <yang.shi@linux.alibaba.com>,
	Roman Gushchin <guro@fb.com>,  Greg Thelen <gthelen@google.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	 Vladimir Davydov <vdavydov.dev@gmail.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	 Cgroups <cgroups@vger.kernel.org>, Linux MM <linux-mm@kvack.org>
Subject: Re: Memcg stat for available memory
Date: Fri, 10 Jul 2020 14:04:57 -0700	[thread overview]
Message-ID: <CAHbLzkoCNt7GPrwN1uPEvd==-Lz9-j6-2RS0CCL0s2e-M_omiw@mail.gmail.com> (raw)
In-Reply-To: <alpine.DEB.2.23.453.2007101223470.1178541@chino.kir.corp.google.com>

On Fri, Jul 10, 2020 at 12:49 PM David Rientjes <rientjes@google.com> wrote:
>
> On Tue, 7 Jul 2020, David Rientjes wrote:
>
> > Another use case would be motivated by exactly the MemAvailable use case:
> > when bound to a memcg hierarchy, how much memory is available without
> > substantial swap or risk of oom for starting a new process or service?
> > This would not trigger any memory.low or PSI notification but is a
> > heuristic that can be used to determine what can and cannot be started
> > without incurring substantial memory reclaim.
> >
> > I'm indifferent to whether this would be a "reclaimable" or "available"
> > metric, with a slight preference toward making it as similar in
> > calculation to MemAvailable as possible, so I think the question is
> > whether this is something the user should be deriving themselves based on
> > memcg stats that are exported or whether we should solidify this based on
> > how the kernel handles reclaim as a metric that will carry over across
> > kernel vesions?
> >
>
> To try to get more discussion on the subject, consider a malloc
> implementation, like tcmalloc, that does MADV_DONTNEED to free memory back
> to the system and how this freed memory is then described to userspace
> depending on the kernel implementation.
>
>  [ For the sake of this discussion, consider we have precise memcg stats
>    available to us although the actual implementation allows for some
>    variance (MEMCG_CHARGE_BATCH). ]
>
> With a 64MB heap backed by thp on x86, for example, the vma starts with an
> rss of 64MB, all of which is anon and backed by hugepages.  Imagine some
> aggressive MADV_DONTNEED freeing that ends up with only a single 4KB page
> mapped in each 2MB aligned range.  The rss is now 32 * 4KB = 128KB.
>
> Before freeing, anon, anon_thp, and active_anon in memory.stat would all
> be the same for this vma (64MB).  64MB would also be charged to
> memory.current.  That's all working as intended and to the expectation of
> userspace.
>
> After freeing, however, we have the kernel implementation specific detail
> of how huge pmd splitting is handled (rss) in comparison to the underlying
> split of the compound page (deferred split queue).  The huge pmd is always
> split synchronously after MADV_DONTNEED so, as mentioned, the rss is 128KB
> for this vma and none of it is backed by thp.
>
> What is charged to the memcg (memory.current) and what is on active_anon
> is unchanged, however, because the underlying compound pages are still
> charged to the memcg.  The amount of anon and anon_thp are decreased
> in compliance with the splitting of the page tables, however.
>
> So after freeing, for this vma: anon = 128KB, anon_thp = 0,
> active_anon = 64MB, memory.current = 64MB.
>
> In this case, because of the deferred split queue, which is a kernel
> implementation detail, userspace may be unclear on what is actually
> reclaimable -- and this memory is reclaimable under memory pressure.  For
> the motivation of MemAvailable (what amount of memory is available for
> starting new work), userspace *could* determine this through the
> aforementioned active_anon - anon (or some combination of
> memory.current - anon - file - slab), but I think it's a fair point that
> userspace's view of reclaimable memory as the kernel implementation
> changes is something that can and should remain consistent between
> versions.
>
> Otherwise, an earlier implementation before deferred split queues could
> have safely assumed that active_anon was unreclaimable unless swap were
> enabled.  It doesn't have the foresight based on future kernel
> implementation detail to reconcile what the amount of reclaimable memory
> actually is.
>
> Same discussion could happen for lazy free memory which is anon but now
> appears on the file lru stats and not the anon lru stats: it's easily
> reclaimable under memory pressure but you need to reconcile the difference
> between the anon metric and what is revealed in the anon lru stats.
>
> That gave way to my original thought of a si_mem_available()-like
> calculation ("avail") by doing
>
>         free = memory.high - memory.current

I'm wondering what if high or max is set to max limit. Don't you end
up seeing a super large memavail?

>         lazyfree = file - (active_file + inactive_file)

Isn't it (active_file + inactive_file) - file ? It looks MADV_FREE
just updates inactive lru size.

>         deferred = active_anon - anon
>
>         avail = free + lazyfree + deferred +
>                 (active_file + inactive_file + slab_reclaimable) / 2
>
> And we have the ability to change this formula based on kernel
> implementation details as they evolve.  Idea is to provide a consistent
> field that userspace can use to determine the rough amount of reclaimable
> memory in a MemAvailable-like way.
>


  reply	other threads:[~2020-07-10 21:05 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-06-28 22:15 Memcg stat for available memory David Rientjes
2020-07-02 15:22 ` Shakeel Butt
2020-07-03  8:15   ` Michal Hocko
2020-07-07 19:58     ` David Rientjes
2020-07-10 19:47       ` David Rientjes
2020-07-10 21:04         ` Yang Shi [this message]
2020-07-12 22:02           ` David Rientjes

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAHbLzkoCNt7GPrwN1uPEvd==-Lz9-j6-2RS0CCL0s2e-M_omiw@mail.gmail.com' \
    --to=shy828301@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=cgroups@vger.kernel.org \
    --cc=gthelen@google.com \
    --cc=guro@fb.com \
    --cc=hannes@cmpxchg.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@kernel.org \
    --cc=rientjes@google.com \
    --cc=shakeelb@google.com \
    --cc=vdavydov.dev@gmail.com \
    --cc=yang.shi@linux.alibaba.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).