linux-api.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Vladimir Davydov <vdavydov@parallels.com>
To: Michal Hocko <mhocko@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Andres Lagar-Cavilla <andreslc@google.com>,
	Minchan Kim <minchan@kernel.org>,
	Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Greg Thelen <gthelen@google.com>,
	Michel Lespinasse <walken@google.com>,
	David Rientjes <rientjes@google.com>,
	Pavel Emelyanov <xemul@parallels.com>,
	Cyrill Gorcunov <gorcunov@openvz.org>,
	Jonathan Corbet <corbet@lwn.net>,
	linux-api@vger.kernel.org, linux-doc@vger.kernel.org,
	linux-mm@kvack.org, cgroups@vger.kernel.org,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH -mm v9 0/8] idle memory tracking
Date: Wed, 29 Jul 2015 18:28:17 +0300	[thread overview]
Message-ID: <20150729152817.GV8100@esperanza> (raw)
In-Reply-To: <20150729142618.GJ15801@dhcp22.suse.cz>

On Wed, Jul 29, 2015 at 04:26:19PM +0200, Michal Hocko wrote:
> On Wed 29-07-15 16:59:07, Vladimir Davydov wrote:
> > On Wed, Jul 29, 2015 at 02:36:30PM +0200, Michal Hocko wrote:
> > > On Sun 19-07-15 15:31:09, Vladimir Davydov wrote:
> > > [...]
> > > > ---- USER API ----
> > > > 
> > > > The user API consists of two new proc files:
> > > 
> > > I was thinking about this for a while. I dislike the interface.  It is
> > > quite awkward to use - e.g. you have to read the full memory to check a
> > > single memcg idleness. This might turn out being a problem especially on
> > > large machines.
> > 
> > Yes, with this API estimating the wss of a single memory cgroup will
> > cost almost as much as doing this for the whole system.
> > 
> > Come to think of it, does anyone really need to estimate idleness of one
> > particular cgroup?
> 
> It is certainly interesting for setting the low limit.

Yes, but IMO there is no point in setting the low limit for one
particular cgroup w/o considering what's going on with the rest of the
system.

> 
> > If we are doing this for finding an optimal memcg
> > limits configuration or while considering a load move within a cluster
> > (which I think are the primary use cases for the feature), we must do it
> > system-wide to see the whole picture.
> > 
> > > It also provides a very low level information (per-pfn idleness) which
> > > is inherently racy. Does anybody really require this level of detail?
> > 
> > Well, one might want to do it per-process, obtaining PFNs from
> > /proc/pid/pagemap.
> 
> Sure once the interface is exported you can do whatever ;) But my
> question is whether any real usecase _requires_ it. 

I only know/care about my use case, which is memcg configuration, but I
want to make the API as reusable as possible.

> 
> > > I would assume that most users are interested only in a single number
> > > which tells the idleness of the system/memcg.
> > 
> > Yes, that's what I need it for - estimating containers' wss for setting
> > their limits accordingly.
> 
> So why don't we export the single per memcg and global knobs then?
> This would have few advantages. First of all it would be much easier to
> use, you wouldn't have to export memcg ids and finally the implementation
> could be changed without any user visible changes (e.g. lru vs. pfn walks),
> potential caching and who knows what. In other words. Michel had a
> single number interface AFAIR, what was the primary reason to move away
> from that API?

Because there is too much to be taken care of in the kernel with such an
approach and chances are high that it won't satisfy everyone. What
should the scan period be equal too? Knob. How many kthreads do we want?
Knob. I want to keep history for last N intervals (this was a part of
Michel's implementation), what should N be equal to? Knob. I want to be
able to choose between an instant scan and a scan distributed in time.
Knob. I want to see stats for anon/locked/file/dirty memory separately,
please add them to the API. You see the scale of the problem with doing
it in the kernel?

The API this patch set introduces is simple and fair. It only defines
what "idle" flag mean and gives you a way to flip it. That's it. You
wanna history? DIY. You wanna periodic scans? DIY. Etc.

> 
> > > Well, you have mentioned a per-process reclaim but I am quite
> > > skeptical about this.
> > 
> > This is what Minchan mentioned initially. Personally, I'm not going to
> > use it per-process, but I wouldn't rule out this use case either.
> 
> Considering how many times we have been bitten by too broad interfaces I
> would rather be conservative.

I consider an API "broad" when it tries to do a lot of different things.
sys_prctl is a good example of a broad API.

/proc/kpageidle is not broad, because it does just one thing (I hope it
does it good :). If we attempted to implement the scanner in the kernel
with all those tunables I mentioned above, then we would get a broad API
IMO.

> 
> > > I guess the primary reason to rely on the pfn rather than the LRU walk,
> > > which would be more targeted (especially for memcg cases), is that we
> > > cannot hold lru lock for the whole LRU walk and we cannot continue
> > > walking after the lock is dropped. Maybe we can try to address that
> > > instead? I do not think this is easy to achieve but have you considered
> > > that as an option?
> > 
> > Yes, I have, and I've come to a conclusion it's not doable, because LRU
> > lists can be constantly rotating at an arbitrary rate. If you have an
> > idea in mind how this could be done, please share.
> 
> Yes this is really tricky with the current LRU implementation. I
> was playing with some ideas (do some checkpoints on the way) but
> none of them was really working out on a busy systems. But the LRU
> implementation might change in the future.

It might. Then we could come up with a new /proc or /sys file which
would do the same as /proc/kpageidle, but on per LRU^w whatever-it-is
basis, and give people a choice which one to use.

> I didn't mean this as a hard requirement it just sounds that the
> current implementation restrictions shape the user visible API which
> is a good sign to think twice about it.

Agree. That's why we are discussing it now :-)

> 
> > Speaking of LRU-vs-PFN walk, iterating over PFNs has its own advantages:
> >  - You can distribute a walk in time to avoid CPU bursts.
> 
> This would make the information even more volatile. I am not sure how
> helpful it would be in the end.

If you do it periodically, it is quite accurate.

> 
> >  - You are free to parallelize the scanner as you wish to decrease the
> >    scan time.
> 
> This is true but you could argue similar with per-node/lru threads if this
> was implemented in the kernel and really needed. I am not sure it would
> be really needed though. I would expect this would be a low priority
> thing.

But if you needed it one day, you'd have to extend the kernel API. With
/proc/kpageidle, you just go and fix your program.

Thanks,
Vladimir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2015-07-29 15:28 UTC|newest]

Thread overview: 57+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-07-19 12:31 [PATCH -mm v9 0/8] idle memory tracking Vladimir Davydov
2015-07-19 12:31 ` [PATCH -mm v9 1/8] memcg: add page_cgroup_ino helper Vladimir Davydov
2015-07-21 23:34   ` Andrew Morton
2015-07-22  9:21     ` Vladimir Davydov
2015-07-19 12:31 ` [PATCH -mm v9 2/8] hwpoison: use page_cgroup_ino for filtering by memcg Vladimir Davydov
2015-07-21 23:34   ` Andrew Morton
     [not found]     ` <20150721163412.1b44e77f5ac3b742734d1ce6-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
2015-07-22  9:45       ` Vladimir Davydov
2015-07-19 12:31 ` [PATCH -mm v9 3/8] memcg: zap try_get_mem_cgroup_from_page Vladimir Davydov
2015-07-19 12:31 ` [PATCH -mm v9 4/8] proc: add kpagecgroup file Vladimir Davydov
2015-07-21 23:34   ` Andrew Morton
     [not found]     ` <20150721163433.618855e1f61536a09dfac30b-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
2015-07-22 10:33       ` Vladimir Davydov
2015-07-19 12:31 ` [PATCH -mm v9 5/8] mmu-notifier: add clear_young callback Vladimir Davydov
2015-07-20 18:34   ` Andres Lagar-Cavilla
2015-07-21  8:51     ` Vladimir Davydov
2015-07-22 16:33       ` Vladimir Davydov
2015-07-19 12:31 ` [PATCH -mm v9 6/8] proc: add kpageidle file Vladimir Davydov
2015-07-21 23:34   ` Andrew Morton
     [not found]     ` <20150721163452.c1e4075a2b193bcd325fad56-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
2015-07-22 15:20       ` Vladimir Davydov
     [not found]   ` <d7a78b72053cf529c0c9ff6cbc02ffbb3d58fe35.1437303956.git.vdavydov-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2015-07-24 14:08     ` Paul Gortmaker
     [not found]       ` <CAP=VYLqiNfQJ6oyQg2GszeHwdOmeY_uD3XPvw=++weJOKdx4_g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-07-24 14:17         ` Vladimir Davydov
2015-07-19 12:31 ` [PATCH -mm v9 7/8] proc: export idle flag via kpageflags Vladimir Davydov
2015-07-21 23:35   ` Andrew Morton
     [not found]     ` <20150721163500.528bd39bbbc71abc3c8d429b-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
2015-07-22 16:25       ` Vladimir Davydov
2015-07-22 19:44         ` Andrew Morton
2015-07-22 20:46           ` Andres Lagar-Cavilla
2015-07-23  7:57             ` Vladimir Davydov
2015-07-19 12:31 ` [PATCH -mm v9 8/8] proc: add cond_resched to /proc/kpage* read/write loop Vladimir Davydov
     [not found] ` <cover.1437303956.git.vdavydov-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2015-07-19 12:37   ` [PATCH -mm v9 0/8] idle memory tracking Vladimir Davydov
2015-07-21 21:39 ` Andres Lagar-Cavilla
2015-07-21 23:34 ` Andrew Morton
2015-07-22 16:23   ` Vladimir Davydov
2015-07-25 16:24     ` Vladimir Davydov
2015-07-27 19:18   ` Kees Cook
2015-07-27 19:25     ` Andrew Morton
2015-07-29 12:36 ` Michal Hocko
     [not found]   ` <20150729123629.GI15801-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2015-07-29 13:59     ` Vladimir Davydov
2015-07-29 14:12       ` Michel Lespinasse
     [not found]         ` <CANN689HJX2ZL891uOd8TW9ct4PNH9d5odQZm86WMxkpkCWhA-w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-07-29 14:13           ` Michel Lespinasse
2015-07-29 14:45           ` Vladimir Davydov
2015-07-29 15:08             ` Michel Lespinasse
     [not found]               ` <CANN689Euq3Y-CHQo8q88vzFAYZX4S6rK+rZRfbuSKfS74u=gcg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-07-29 15:31                 ` Vladimir Davydov
2015-07-29 15:34                   ` Michel Lespinasse
2015-07-29 15:08             ` Michal Hocko
     [not found]               ` <20150729150855.GM15801-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2015-07-29 15:36                 ` Vladimir Davydov
2015-07-29 15:58                   ` Michal Hocko
2015-07-29 14:26       ` Michal Hocko
2015-07-29 15:28         ` Vladimir Davydov [this message]
2015-07-29 15:47           ` Michal Hocko
     [not found]             ` <20150729154718.GN15801-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2015-07-29 16:29               ` Vladimir Davydov
2015-07-29 21:30                 ` Andrew Morton
2015-07-30  9:12                   ` Vladimir Davydov
2015-07-30 13:01                     ` Vladimir Davydov
2015-07-31  9:34                       ` Vladimir Davydov
2015-07-30  9:07                 ` Michal Hocko
     [not found]                   ` <20150730090708.GE9387-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2015-07-30  9:31                     ` Vladimir Davydov
2015-07-29 15:55           ` Andres Lagar-Cavilla
     [not found]             ` <CAJu=L59RdowYjTyVM0Vhz79A4d=d8=ZmU7PB59CmEj5B0_c48Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-07-29 16:37               ` Vladimir Davydov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20150729152817.GV8100@esperanza \
    --to=vdavydov@parallels.com \
    --cc=akpm@linux-foundation.org \
    --cc=andreslc@google.com \
    --cc=cgroups@vger.kernel.org \
    --cc=corbet@lwn.net \
    --cc=gorcunov@openvz.org \
    --cc=gthelen@google.com \
    --cc=hannes@cmpxchg.org \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@kernel.org \
    --cc=minchan@kernel.org \
    --cc=raghavendra.kt@linux.vnet.ibm.com \
    --cc=rientjes@google.com \
    --cc=walken@google.com \
    --cc=xemul@parallels.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).