From: Vladimir Davydov <vdavydov@parallels.com>
To: Michal Hocko <mhocko@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>,
Andres Lagar-Cavilla <andreslc@google.com>,
Minchan Kim <minchan@kernel.org>,
Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>,
Johannes Weiner <hannes@cmpxchg.org>,
Greg Thelen <gthelen@google.com>,
Michel Lespinasse <walken@google.com>,
David Rientjes <rientjes@google.com>,
Pavel Emelyanov <xemul@parallels.com>,
Cyrill Gorcunov <gorcunov@openvz.org>,
Jonathan Corbet <corbet@lwn.net>,
linux-api@vger.kernel.org, linux-doc@vger.kernel.org,
linux-mm@kvack.org, cgroups@vger.kernel.org,
linux-kernel@vger.kernel.org
Subject: Re: [PATCH -mm v9 0/8] idle memory tracking
Date: Wed, 29 Jul 2015 19:29:08 +0300 [thread overview]
Message-ID: <20150729162908.GY8100@esperanza> (raw)
In-Reply-To: <20150729154718.GN15801@dhcp22.suse.cz>
On Wed, Jul 29, 2015 at 05:47:18PM +0200, Michal Hocko wrote:
> On Wed 29-07-15 18:28:17, Vladimir Davydov wrote:
> > On Wed, Jul 29, 2015 at 04:26:19PM +0200, Michal Hocko wrote:
> > > On Wed 29-07-15 16:59:07, Vladimir Davydov wrote:
> > > > On Wed, Jul 29, 2015 at 02:36:30PM +0200, Michal Hocko wrote:
> > > > > On Sun 19-07-15 15:31:09, Vladimir Davydov wrote:
> > > > > [...]
> > > > > > ---- USER API ----
> > > > > >
> > > > > > The user API consists of two new proc files:
> > > > >
> > > > > I was thinking about this for a while. I dislike the interface. It is
> > > > > quite awkward to use - e.g. you have to read the full memory to check a
> > > > > single memcg idleness. This might turn out being a problem especially on
> > > > > large machines.
> > > >
> > > > Yes, with this API estimating the wss of a single memory cgroup will
> > > > cost almost as much as doing this for the whole system.
> > > >
> > > > Come to think of it, does anyone really need to estimate idleness of one
> > > > particular cgroup?
> > >
> > > It is certainly interesting for setting the low limit.
> >
> > Yes, but IMO there is no point in setting the low limit for one
> > particular cgroup w/o considering what's going on with the rest of the
> > system.
>
> If you use the low limit for isolating an important load then you do not
> have to care about the others that much. All you care about is to set
> the reasonable protection level and let others to compete for the rest.
That's a use case, you're right. Well, it's a natural limitation of this
API - you just have to perform a full PFN scan then. You can avoid
costly rmap walks for the cgroups you are not interested in by filtering
them out using /proc/kpagecgroup though.
>
> [...]
> > > > > I would assume that most users are interested only in a single number
> > > > > which tells the idleness of the system/memcg.
> > > >
> > > > Yes, that's what I need it for - estimating containers' wss for setting
> > > > their limits accordingly.
> > >
> > > So why don't we export the single per memcg and global knobs then?
> > > This would have few advantages. First of all it would be much easier to
> > > use, you wouldn't have to export memcg ids and finally the implementation
> > > could be changed without any user visible changes (e.g. lru vs. pfn walks),
> > > potential caching and who knows what. In other words. Michel had a
> > > single number interface AFAIR, what was the primary reason to move away
> > > from that API?
> >
> > Because there is too much to be taken care of in the kernel with such an
> > approach and chances are high that it won't satisfy everyone. What
> > should the scan period be equal too?
>
> No, just gather the data on the read request and let the userspace
> to decide when/how often etc. If we are clever enough we can cache
> the numbers and prevent from the walk. Write to the file and do the
> mark_idle stuff.
Still, scan rate limiting would be an issue IMO.
>
> > Knob. How many kthreads do we want?
> > Knob. I want to keep history for last N intervals (this was a part of
> > Michel's implementation), what should N be equal to? Knob.
>
> This all relates to the kernel thread implementation which I wasn't
> suggesting. I was referring to Michel's work which might induce that.
> I was merely referring to a single number output. Sorry about the
> confusion.
Still, what about idle stats history? I mean having info about how many
pages were idle for N scans. It might be useful for more robust/accurate
wss estimation.
>
> > I want to be
> > able to choose between an instant scan and a scan distributed in time.
> > Knob. I want to see stats for anon/locked/file/dirty memory separately,
>
> Why is this useful for the memcg limits setting or the wss estimation? I
> can imagine that a further drop down numbers might be interesting
> from the debugging POV but I fail to see what kind of decisions from
> userspace you would do based on them.
A couple examples that pop up in my mind:
It's difficult to make wss estimation perfect. By mlocking pages, a
workload might give a hint to the system that it will be really unhappy
if they are evicted.
One might want to consider anon pages and/or dirty pages as not idle in
order to protect them and hence avoid expensive pageout/swapout.
>
> [...]
> > > Yes this is really tricky with the current LRU implementation. I
> > > was playing with some ideas (do some checkpoints on the way) but
> > > none of them was really working out on a busy systems. But the LRU
> > > implementation might change in the future.
> >
> > It might. Then we could come up with a new /proc or /sys file which
> > would do the same as /proc/kpageidle, but on per LRU^w whatever-it-is
> > basis, and give people a choice which one to use.
>
> This just leads to proc files count explosion we are seeing
> already... Proc ended up in dump ground for different things which
> didn't fit elsewhere and I am not very much happy about it to be honest.
Moving the API to memcg is not a good idea either IMO, because the
feature can actually be useful with memcg disabled, e.g. it might help
estimate if the system is over- or underloaded.
/proc/kpageidle should probably live somewhere in /sys/kernel/mm, but I
added it where similar files are located (kpagecount, kpageflags) to
keep things consistent.
Thanks,
Vladimir
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2015-07-29 16:29 UTC|newest]
Thread overview: 57+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-07-19 12:31 [PATCH -mm v9 0/8] idle memory tracking Vladimir Davydov
2015-07-19 12:31 ` [PATCH -mm v9 1/8] memcg: add page_cgroup_ino helper Vladimir Davydov
2015-07-21 23:34 ` Andrew Morton
2015-07-22 9:21 ` Vladimir Davydov
2015-07-19 12:31 ` [PATCH -mm v9 2/8] hwpoison: use page_cgroup_ino for filtering by memcg Vladimir Davydov
2015-07-21 23:34 ` Andrew Morton
2015-07-22 9:45 ` Vladimir Davydov
2015-07-19 12:31 ` [PATCH -mm v9 3/8] memcg: zap try_get_mem_cgroup_from_page Vladimir Davydov
2015-07-19 12:31 ` [PATCH -mm v9 4/8] proc: add kpagecgroup file Vladimir Davydov
2015-07-21 23:34 ` Andrew Morton
2015-07-22 10:33 ` Vladimir Davydov
2015-07-19 12:31 ` [PATCH -mm v9 5/8] mmu-notifier: add clear_young callback Vladimir Davydov
2015-07-20 18:34 ` Andres Lagar-Cavilla
2015-07-21 8:51 ` Vladimir Davydov
2015-07-22 16:33 ` Vladimir Davydov
2015-07-19 12:31 ` [PATCH -mm v9 6/8] proc: add kpageidle file Vladimir Davydov
2015-07-21 23:34 ` Andrew Morton
2015-07-22 15:20 ` Vladimir Davydov
2015-07-24 14:08 ` Paul Gortmaker
2015-07-24 14:17 ` Vladimir Davydov
2015-07-19 12:31 ` [PATCH -mm v9 7/8] proc: export idle flag via kpageflags Vladimir Davydov
2015-07-21 23:35 ` Andrew Morton
2015-07-22 16:25 ` Vladimir Davydov
2015-07-22 19:44 ` Andrew Morton
2015-07-22 20:46 ` Andres Lagar-Cavilla
2015-07-23 7:57 ` Vladimir Davydov
2015-07-19 12:31 ` [PATCH -mm v9 8/8] proc: add cond_resched to /proc/kpage* read/write loop Vladimir Davydov
2015-07-19 12:37 ` [PATCH -mm v9 0/8] idle memory tracking Vladimir Davydov
2015-07-21 21:39 ` Andres Lagar-Cavilla
2015-07-21 23:34 ` Andrew Morton
2015-07-22 16:23 ` Vladimir Davydov
2015-07-25 16:24 ` Vladimir Davydov
2015-07-27 19:18 ` Kees Cook
2015-07-27 19:25 ` Andrew Morton
2015-07-29 12:36 ` Michal Hocko
2015-07-29 13:59 ` Vladimir Davydov
2015-07-29 14:12 ` Michel Lespinasse
2015-07-29 14:13 ` Michel Lespinasse
2015-07-29 14:45 ` Vladimir Davydov
2015-07-29 15:08 ` Michel Lespinasse
2015-07-29 15:31 ` Vladimir Davydov
2015-07-29 15:34 ` Michel Lespinasse
2015-07-29 15:08 ` Michal Hocko
2015-07-29 15:36 ` Vladimir Davydov
2015-07-29 15:58 ` Michal Hocko
2015-07-29 14:26 ` Michal Hocko
2015-07-29 15:28 ` Vladimir Davydov
2015-07-29 15:47 ` Michal Hocko
2015-07-29 16:29 ` Vladimir Davydov [this message]
2015-07-29 21:30 ` Andrew Morton
2015-07-30 9:12 ` Vladimir Davydov
2015-07-30 13:01 ` Vladimir Davydov
2015-07-31 9:34 ` Vladimir Davydov
2015-07-30 9:07 ` Michal Hocko
2015-07-30 9:31 ` Vladimir Davydov
2015-07-29 15:55 ` Andres Lagar-Cavilla
2015-07-29 16:37 ` Vladimir Davydov
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20150729162908.GY8100@esperanza \
--to=vdavydov@parallels.com \
--cc=akpm@linux-foundation.org \
--cc=andreslc@google.com \
--cc=cgroups@vger.kernel.org \
--cc=corbet@lwn.net \
--cc=gorcunov@openvz.org \
--cc=gthelen@google.com \
--cc=hannes@cmpxchg.org \
--cc=linux-api@vger.kernel.org \
--cc=linux-doc@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mhocko@kernel.org \
--cc=minchan@kernel.org \
--cc=raghavendra.kt@linux.vnet.ibm.com \
--cc=rientjes@google.com \
--cc=walken@google.com \
--cc=xemul@parallels.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).