From: Yang Shi <shy828301@gmail.com>
To: David Rientjes <rientjes@google.com>
Cc: Michal Hocko <mhocko@kernel.org>,
Shakeel Butt <shakeelb@google.com>,
Yang Shi <yang.shi@linux.alibaba.com>,
Roman Gushchin <guro@fb.com>, Greg Thelen <gthelen@google.com>,
Johannes Weiner <hannes@cmpxchg.org>,
Vladimir Davydov <vdavydov.dev@gmail.com>,
Andrew Morton <akpm@linux-foundation.org>,
Cgroups <cgroups@vger.kernel.org>, Linux MM <linux-mm@kvack.org>
Subject: Re: Memcg stat for available memory
Date: Fri, 10 Jul 2020 14:04:57 -0700 [thread overview]
Message-ID: <CAHbLzkoCNt7GPrwN1uPEvd==-Lz9-j6-2RS0CCL0s2e-M_omiw@mail.gmail.com> (raw)
In-Reply-To: <alpine.DEB.2.23.453.2007101223470.1178541@chino.kir.corp.google.com>
On Fri, Jul 10, 2020 at 12:49 PM David Rientjes <rientjes@google.com> wrote:
>
> On Tue, 7 Jul 2020, David Rientjes wrote:
>
> > Another use case would be motivated by exactly the MemAvailable use case:
> > when bound to a memcg hierarchy, how much memory is available without
> > substantial swap or risk of oom for starting a new process or service?
> > This would not trigger any memory.low or PSI notification but is a
> > heuristic that can be used to determine what can and cannot be started
> > without incurring substantial memory reclaim.
> >
> > I'm indifferent to whether this would be a "reclaimable" or "available"
> > metric, with a slight preference toward making it as similar in
> > calculation to MemAvailable as possible, so I think the question is
> > whether this is something the user should be deriving themselves based on
> > memcg stats that are exported or whether we should solidify this based on
> > how the kernel handles reclaim as a metric that will carry over across
> > kernel vesions?
> >
>
> To try to get more discussion on the subject, consider a malloc
> implementation, like tcmalloc, that does MADV_DONTNEED to free memory back
> to the system and how this freed memory is then described to userspace
> depending on the kernel implementation.
>
> [ For the sake of this discussion, consider we have precise memcg stats
> available to us although the actual implementation allows for some
> variance (MEMCG_CHARGE_BATCH). ]
>
> With a 64MB heap backed by thp on x86, for example, the vma starts with an
> rss of 64MB, all of which is anon and backed by hugepages. Imagine some
> aggressive MADV_DONTNEED freeing that ends up with only a single 4KB page
> mapped in each 2MB aligned range. The rss is now 32 * 4KB = 128KB.
>
> Before freeing, anon, anon_thp, and active_anon in memory.stat would all
> be the same for this vma (64MB). 64MB would also be charged to
> memory.current. That's all working as intended and to the expectation of
> userspace.
>
> After freeing, however, we have the kernel implementation specific detail
> of how huge pmd splitting is handled (rss) in comparison to the underlying
> split of the compound page (deferred split queue). The huge pmd is always
> split synchronously after MADV_DONTNEED so, as mentioned, the rss is 128KB
> for this vma and none of it is backed by thp.
>
> What is charged to the memcg (memory.current) and what is on active_anon
> is unchanged, however, because the underlying compound pages are still
> charged to the memcg. The amount of anon and anon_thp are decreased
> in compliance with the splitting of the page tables, however.
>
> So after freeing, for this vma: anon = 128KB, anon_thp = 0,
> active_anon = 64MB, memory.current = 64MB.
>
> In this case, because of the deferred split queue, which is a kernel
> implementation detail, userspace may be unclear on what is actually
> reclaimable -- and this memory is reclaimable under memory pressure. For
> the motivation of MemAvailable (what amount of memory is available for
> starting new work), userspace *could* determine this through the
> aforementioned active_anon - anon (or some combination of
> memory.current - anon - file - slab), but I think it's a fair point that
> userspace's view of reclaimable memory as the kernel implementation
> changes is something that can and should remain consistent between
> versions.
>
> Otherwise, an earlier implementation before deferred split queues could
> have safely assumed that active_anon was unreclaimable unless swap were
> enabled. It doesn't have the foresight based on future kernel
> implementation detail to reconcile what the amount of reclaimable memory
> actually is.
>
> Same discussion could happen for lazy free memory which is anon but now
> appears on the file lru stats and not the anon lru stats: it's easily
> reclaimable under memory pressure but you need to reconcile the difference
> between the anon metric and what is revealed in the anon lru stats.
>
> That gave way to my original thought of a si_mem_available()-like
> calculation ("avail") by doing
>
> free = memory.high - memory.current
I'm wondering what if high or max is set to max limit. Don't you end
up seeing a super large memavail?
> lazyfree = file - (active_file + inactive_file)
Isn't it (active_file + inactive_file) - file ? It looks MADV_FREE
just updates inactive lru size.
> deferred = active_anon - anon
>
> avail = free + lazyfree + deferred +
> (active_file + inactive_file + slab_reclaimable) / 2
>
> And we have the ability to change this formula based on kernel
> implementation details as they evolve. Idea is to provide a consistent
> field that userspace can use to determine the rough amount of reclaimable
> memory in a MemAvailable-like way.
>
next prev parent reply other threads:[~2020-07-10 21:05 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-06-28 22:15 Memcg stat for available memory David Rientjes
2020-07-02 15:22 ` Shakeel Butt
2020-07-03 8:15 ` Michal Hocko
2020-07-07 19:58 ` David Rientjes
2020-07-10 19:47 ` David Rientjes
2020-07-10 21:04 ` Yang Shi [this message]
2020-07-12 22:02 ` David Rientjes
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='CAHbLzkoCNt7GPrwN1uPEvd==-Lz9-j6-2RS0CCL0s2e-M_omiw@mail.gmail.com' \
--to=shy828301@gmail.com \
--cc=akpm@linux-foundation.org \
--cc=cgroups@vger.kernel.org \
--cc=gthelen@google.com \
--cc=guro@fb.com \
--cc=hannes@cmpxchg.org \
--cc=linux-mm@kvack.org \
--cc=mhocko@kernel.org \
--cc=rientjes@google.com \
--cc=shakeelb@google.com \
--cc=vdavydov.dev@gmail.com \
--cc=yang.shi@linux.alibaba.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).