linux-api.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Vladimir Davydov <vdavydov@parallels.com>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: Andres Lagar-Cavilla <andreslc@google.com>,
	Minchan Kim <minchan@kernel.org>,
	Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Michal Hocko <mhocko@suse.cz>, Greg Thelen <gthelen@google.com>,
	Michel Lespinasse <walken@google.com>,
	David Rientjes <rientjes@google.com>,
	Pavel Emelyanov <xemul@parallels.com>,
	Cyrill Gorcunov <gorcunov@openvz.org>,
	Jonathan Corbet <corbet@lwn.net>,
	linux-api@vger.kernel.org, linux-doc@vger.kernel.org,
	linux-mm@kvack.org, cgroups@vger.kernel.org,
	linux-kernel@vger.kernel.org, Kees Cook <keescook@chromium.org>
Subject: Re: [PATCH -mm v9 0/8] idle memory tracking
Date: Wed, 22 Jul 2015 19:23:53 +0300	[thread overview]
Message-ID: <20150722162353.GM23374@esperanza> (raw)
In-Reply-To: <20150721163402.43ad2527d9b8caa476a1c9e1@linux-foundation.org>

On Tue, Jul 21, 2015 at 04:34:02PM -0700, Andrew Morton wrote:
> On Sun, 19 Jul 2015 15:31:09 +0300 Vladimir Davydov <vdavydov@parallels.com> wrote:
> 
> > Hi,
> > 
> > This patch set introduces a new user API for tracking user memory pages
> > that have not been used for a given period of time. The purpose of this
> > is to provide the userspace with the means of tracking a workload's
> > working set, i.e. the set of pages that are actively used by the
> > workload. Knowing the working set size can be useful for partitioning
> > the system more efficiently, e.g. by tuning memory cgroup limits
> > appropriately, or for job placement within a compute cluster.
> > 
> > It is based on top of v4.2-rc2-mmotm-2015-07-15-16-46
> > It applies without conflicts to v4.2-rc2-mmotm-2015-07-17-16-04 as well
> > 
> > ---- USE CASES ----
> > 
> > The unified cgroup hierarchy has memory.low and memory.high knobs, which
> > are defined as the low and high boundaries for the workload working set
> > size. However, the working set size of a workload may be unknown or
> > change in time. With this patch set, one can periodically estimate the
> > amount of memory unused by each cgroup and tune their memory.low and
> > memory.high parameters accordingly, therefore optimizing the overall
> > memory utilization.
> > 
> > Another use case is balancing workloads within a compute cluster.
> > Knowing how much memory is not really used by a workload unit may help
> > take a more optimal decision when considering migrating the unit to
> > another node within the cluster.
> > 
> > Also, as noted by Minchan, this would be useful for per-process reclaim
> > (https://lwn.net/Articles/545668/). With idle tracking, we could reclaim idle
> > pages only by smart user memory manager.
> > 
> > ---- USER API ----
> > 
> > The user API consists of two new proc files:
> > 
> >  * /proc/kpageidle.  This file implements a bitmap where each bit corresponds
> >    to a page, indexed by PFN.
> 
> What are the bit mappings?  If I read the first byte of /proc/kpageidle
> I get PFN #0 in bit zero of that byte?  And the second byte of
> /proc/kpageidle contains PFN #8 in its LSB, etc?

The bit mapping is an array of u64 elements. Page at pfn #i corresponds
to bit #i%64 of element #i/64. Byte order is native.

Will add this to docs.

> 
> Maybe this is covered in the documentation file.
> 
> > When the bit is set, the corresponding page is
> >    idle. A page is considered idle if it has not been accessed since it was
> >    marked idle.
> 
> Perhaps we can spell out in some detail what "accessed" means?  I see
> you've hooked into mark_page_accessed(), so a read from disk is an
> access.  What about a write to disk?  And what about a page being
> accessed from some random device (could hook into get_user_pages()?) Is
> getting written to swap an access?  When a dirty pagecache page is
> written out by kswapd or direct reclaim?
> 
> This also should be in the permanent documentation.

OK, will add.

> 
> > To mark a page idle one should set the bit corresponding to the
> >    page by writing to the file. A value written to the file is OR-ed with the
> >    current bitmap value. Only user memory pages can be marked idle, for other
> >    page types input is silently ignored. Writing to this file beyond max PFN
> >    results in the ENXIO error. Only available when CONFIG_IDLE_PAGE_TRACKING is
> >    set.
> > 
> >    This file can be used to estimate the amount of pages that are not
> >    used by a particular workload as follows:
> > 
> >    1. mark all pages of interest idle by setting corresponding bits in the
> >       /proc/kpageidle bitmap
> >    2. wait until the workload accesses its working set
> >    3. read /proc/kpageidle and count the number of bits set
> 
> Security implications.  This interface could be used to learn about a
> sensitive application by poking data at it and then observing its
> memory access patterns.  Perhaps this is why the proc files are
> root-only (whcih I assume is sufficient). 

That's one point. Another point is that if we allow unprivileged users
to access it, they may interfere with the system-wide daemon doing the
regular scan and estimating the system wss.

> Some words here about the security side of things and the reasoning
> behind the chosen permissions would be good to have.
> 
> >  * /proc/kpagecgroup.  This file contains a 64-bit inode number of the
> >    memory cgroup each page is charged to, indexed by PFN.
> 
> Actually "closest online ancestor".  This also should be in the
> interface documentation.

Actually, the userspace knows nothing about online/offline cgroups,
because all cgroups used to be online and charge re-parenting was used
to forcibly empty a memcg on deletion. Anyways, I'll add a note.

> 
> > Only available when CONFIG_MEMCG is set.
> 
> CONFIG_MEMCG and CONFIG_IDLE_PAGE_TRACKING I assume?

No, it's present iff CONFIG_PROC_PAGE_MONITOR && CONFIG_MEMCG, because
it might be useful even w/o CONFIG_IDLE_PAGE_TRACKING, e.g. in order to
find out which memcg  pages of a particular process are accounted to.

> 
> > 
> >    This file can be used to find all pages (including unmapped file
> >    pages) accounted to a particular cgroup. Using /proc/kpageidle, one
> >    can then estimate the cgroup working set size.
> > 
> > For an example of using these files for estimating the amount of unused
> > memory pages per each memory cgroup, please see the script attached
> > below.
> 
> Why were these put in /proc anyway?  Rather than under /sys/fs/cgroup
> somewhere?  Presumably because /proc/kpageidle is useful in non-memcg
> setups.

Yes, one might use it for estimating active wss of a single process or
the whole system.

> 
> > ---- PERFORMANCE EVALUATION ----
> 
> "^___" means "end of changelog".  Perhaps that should have been
> "^---\n" - unclear.

Sorry :-/

> 
> > Documentation/vm/pagemap.txt           |  22 ++-
> 
> I think we'll need quite a lot more than this to fully describe the
> interface?

Agree, the documentation sucks :-( Will try to forge something more
thorough.

Thanks,
Vladimir

  reply	other threads:[~2015-07-22 16:23 UTC|newest]

Thread overview: 57+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-07-19 12:31 [PATCH -mm v9 0/8] idle memory tracking Vladimir Davydov
2015-07-19 12:31 ` [PATCH -mm v9 1/8] memcg: add page_cgroup_ino helper Vladimir Davydov
2015-07-21 23:34   ` Andrew Morton
2015-07-22  9:21     ` Vladimir Davydov
2015-07-19 12:31 ` [PATCH -mm v9 2/8] hwpoison: use page_cgroup_ino for filtering by memcg Vladimir Davydov
2015-07-21 23:34   ` Andrew Morton
     [not found]     ` <20150721163412.1b44e77f5ac3b742734d1ce6-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
2015-07-22  9:45       ` Vladimir Davydov
2015-07-19 12:31 ` [PATCH -mm v9 3/8] memcg: zap try_get_mem_cgroup_from_page Vladimir Davydov
2015-07-19 12:31 ` [PATCH -mm v9 4/8] proc: add kpagecgroup file Vladimir Davydov
2015-07-21 23:34   ` Andrew Morton
     [not found]     ` <20150721163433.618855e1f61536a09dfac30b-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
2015-07-22 10:33       ` Vladimir Davydov
2015-07-19 12:31 ` [PATCH -mm v9 5/8] mmu-notifier: add clear_young callback Vladimir Davydov
2015-07-20 18:34   ` Andres Lagar-Cavilla
2015-07-21  8:51     ` Vladimir Davydov
2015-07-22 16:33       ` Vladimir Davydov
2015-07-19 12:31 ` [PATCH -mm v9 6/8] proc: add kpageidle file Vladimir Davydov
2015-07-21 23:34   ` Andrew Morton
     [not found]     ` <20150721163452.c1e4075a2b193bcd325fad56-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
2015-07-22 15:20       ` Vladimir Davydov
     [not found]   ` <d7a78b72053cf529c0c9ff6cbc02ffbb3d58fe35.1437303956.git.vdavydov-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2015-07-24 14:08     ` Paul Gortmaker
     [not found]       ` <CAP=VYLqiNfQJ6oyQg2GszeHwdOmeY_uD3XPvw=++weJOKdx4_g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-07-24 14:17         ` Vladimir Davydov
2015-07-19 12:31 ` [PATCH -mm v9 7/8] proc: export idle flag via kpageflags Vladimir Davydov
2015-07-21 23:35   ` Andrew Morton
     [not found]     ` <20150721163500.528bd39bbbc71abc3c8d429b-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
2015-07-22 16:25       ` Vladimir Davydov
2015-07-22 19:44         ` Andrew Morton
2015-07-22 20:46           ` Andres Lagar-Cavilla
2015-07-23  7:57             ` Vladimir Davydov
2015-07-19 12:31 ` [PATCH -mm v9 8/8] proc: add cond_resched to /proc/kpage* read/write loop Vladimir Davydov
     [not found] ` <cover.1437303956.git.vdavydov-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2015-07-19 12:37   ` [PATCH -mm v9 0/8] idle memory tracking Vladimir Davydov
2015-07-21 21:39 ` Andres Lagar-Cavilla
2015-07-21 23:34 ` Andrew Morton
2015-07-22 16:23   ` Vladimir Davydov [this message]
2015-07-25 16:24     ` Vladimir Davydov
2015-07-27 19:18   ` Kees Cook
2015-07-27 19:25     ` Andrew Morton
2015-07-29 12:36 ` Michal Hocko
     [not found]   ` <20150729123629.GI15801-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2015-07-29 13:59     ` Vladimir Davydov
2015-07-29 14:12       ` Michel Lespinasse
     [not found]         ` <CANN689HJX2ZL891uOd8TW9ct4PNH9d5odQZm86WMxkpkCWhA-w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-07-29 14:13           ` Michel Lespinasse
2015-07-29 14:45           ` Vladimir Davydov
2015-07-29 15:08             ` Michel Lespinasse
     [not found]               ` <CANN689Euq3Y-CHQo8q88vzFAYZX4S6rK+rZRfbuSKfS74u=gcg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-07-29 15:31                 ` Vladimir Davydov
2015-07-29 15:34                   ` Michel Lespinasse
2015-07-29 15:08             ` Michal Hocko
     [not found]               ` <20150729150855.GM15801-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2015-07-29 15:36                 ` Vladimir Davydov
2015-07-29 15:58                   ` Michal Hocko
2015-07-29 14:26       ` Michal Hocko
2015-07-29 15:28         ` Vladimir Davydov
2015-07-29 15:47           ` Michal Hocko
     [not found]             ` <20150729154718.GN15801-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2015-07-29 16:29               ` Vladimir Davydov
2015-07-29 21:30                 ` Andrew Morton
2015-07-30  9:12                   ` Vladimir Davydov
2015-07-30 13:01                     ` Vladimir Davydov
2015-07-31  9:34                       ` Vladimir Davydov
2015-07-30  9:07                 ` Michal Hocko
     [not found]                   ` <20150730090708.GE9387-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2015-07-30  9:31                     ` Vladimir Davydov
2015-07-29 15:55           ` Andres Lagar-Cavilla
     [not found]             ` <CAJu=L59RdowYjTyVM0Vhz79A4d=d8=ZmU7PB59CmEj5B0_c48Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-07-29 16:37               ` Vladimir Davydov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20150722162353.GM23374@esperanza \
    --to=vdavydov@parallels.com \
    --cc=akpm@linux-foundation.org \
    --cc=andreslc@google.com \
    --cc=cgroups@vger.kernel.org \
    --cc=corbet@lwn.net \
    --cc=gorcunov@openvz.org \
    --cc=gthelen@google.com \
    --cc=hannes@cmpxchg.org \
    --cc=keescook@chromium.org \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@suse.cz \
    --cc=minchan@kernel.org \
    --cc=raghavendra.kt@linux.vnet.ibm.com \
    --cc=rientjes@google.com \
    --cc=walken@google.com \
    --cc=xemul@parallels.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).