Re: [PATCH 2/2] mm: add a field to store names for private anonymous memory

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Minchan Kim <minchan@kernel.org>
To: Colin Cross <ccross@android.com>
Cc: lkml <linux-kernel@vger.kernel.org>,
	Pekka Enberg <penberg@kernel.org>,
	Dave Hansen <dave.hansen@intel.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@kernel.org>, Oleg Nesterov <oleg@redhat.com>,
	"Eric W. Biederman" <ebiederm@xmission.com>,
	Jan Glauber <jan.glauber@gmail.com>,
	John Stultz <john.stultz@linaro.org>,
	Rob Landley <rob@landley.net>,
	Andrew Morton <akpm@linux-foundation.org>,
	Cyrill Gorcunov <gorcunov@openvz.org>,
	Kees Cook <keescook@chromium.org>,
	"Serge E. Hallyn" <serge.hallyn@ubuntu.com>,
	David Rientjes <rientjes@google.com>,
	Al Viro <viro@zeniv.linux.org.uk>,
	Hugh Dickins <hughd@google.com>, Rik van Riel <riel@redhat.com>,
	Mel Gorman <mgorman@suse.de>,
	Michel Lespinasse <walken@google.com>,
	Tang Chen <tangchen@cn.fujitsu.com>, Robin Holt <holt@sgi.com>,
	Shaohua Li <shli@fusionio.com>,
	Sasha Levin <sasha.levin@oracle.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Peter Zijlstra <a.p.zijlstra@chello.nl>,
	"open list:DOCUMENTATION" <linux-doc@vger.kernel.org>,
	"open list:MEMORY MANAGEMENT" <linux-mm@kvack.org>
Subject: Re: [PATCH 2/2] mm: add a field to store names for private anonymous memory
Date: Fri, 1 Nov 2013 10:30:09 +0900	[thread overview]
Message-ID: <20131101013009.GE26080@bbox> (raw)
In-Reply-To: <CAMbhsRRqP+RHx9wRhaO-Q44mZ3_777ZuZNcGdBSXxUHPz_ne+w@mail.gmail.com>

Hello,

On Wed, Oct 30, 2013 at 02:15:37PM -0700, Colin Cross wrote:
> On Wed, Oct 16, 2013 at 7:47 PM, Minchan Kim <minchan@kernel.org> wrote:
> > On Wed, Oct 16, 2013 at 01:00:03PM -0700, Colin Cross wrote:
> >> On Tue, Oct 15, 2013 at 5:33 PM, Minchan Kim <minchan@kernel.org> wrote:
> >> > Hello,
> >> >
> >> > On Mon, Oct 14, 2013 at 06:31:17PM -0700, Colin Cross wrote:
> >> >> In many userspace applications, and especially in VM based
> >> >> applications like Android uses heavily, there are multiple different
> >> >> allocators in use.  At a minimum there is libc malloc and the stack,
> >> >> and in many cases there are libc malloc, the stack, direct syscalls to
> >> >> mmap anonymous memory, and multiple VM heaps (one for small objects,
> >> >> one for big objects, etc.).  Each of these layers usually has its own
> >> >> tools to inspect its usage; malloc by compiling a debug version, the
> >> >> VM through heap inspection tools, and for direct syscalls there is
> >> >> usually no way to track them.
> >> >>
> >> >> On Android we heavily use a set of tools that use an extended version
> >> >> of the logic covered in Documentation/vm/pagemap.txt to walk all pages
> >> >> mapped in userspace and slice their usage by process, shared (COW) vs.
> >> >> unique mappings, backing, etc.  This can account for real physical
> >> >> memory usage even in cases like fork without exec (which Android uses
> >> >> heavily to share as many private COW pages as possible between
> >> >> processes), Kernel SamePage Merging, and clean zero pages.  It
> >> >> produces a measurement of the pages that only exist in that process
> >> >> (USS, for unique), and a measurement of the physical memory usage of
> >> >> that process with the cost of shared pages being evenly split between
> >> >> processes that share them (PSS).
> >> >>
> >> >> If all anonymous memory is indistinguishable then figuring out the
> >> >> real physical memory usage (PSS) of each heap requires either a pagemap
> >> >> walking tool that can understand the heap debugging of every layer, or
> >> >> for every layer's heap debugging tools to implement the pagemap
> >> >> walking logic, in which case it is hard to get a consistent view of
> >> >> memory across the whole system.
> >> >>
> >> >> This patch adds a field to /proc/pid/maps and /proc/pid/smaps to
> >> >> show a userspace-provided name for anonymous vmas.  The names of
> >> >> named anonymous vmas are shown in /proc/pid/maps and /proc/pid/smaps
> >> >> as [anon:<name>].
> >> >>
> >> >> Userspace can set the name for a region of memory by calling
> >> >> prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME, start, len, (unsigned long)name);
> >> >> Setting the name to NULL clears it.
> >> >>
> >> >> The name is stored in a user pointer in the shared union in
> >> >> vm_area_struct that points to a null terminated string inside
> >> >> the user process.  vmas that point to the same address and are
> >> >> otherwise mergeable will be merged, but vmas that point to
> >> >> equivalent strings at different addresses will not be merged.
> >> >>
> >> >> The idea to store a userspace pointer to reduce the complexity
> >> >> within mm (at the expense of the complexity of reading
> >> >> /proc/pid/mem) came from Dave Hansen.  This results in no
> >> >> runtime overhead in the mm subsystem other than comparing
> >> >> the anon_name pointers when considering vma merging.  The pointer
> >> >> is stored in a union with fields that are only used on file-backed
> >> >> mappings, so it does not increase memory usage.
> >> >
> >> > I'm not against this idea although I don't have review it in detail
> >> > but we need description to convince why it's hard to be done in
> >> > userspace.
> >>
> >> I covered the reasoning in more detail at
> >> http://permalink.gmane.org/gmane.linux.kernel.mm/103228.  The short
> >> version is that this is useful for a system-wide look at memory,
> >> combining all processes with the kernel's knowledge of map counts and
> >> page flags to produce a measurement of what a process' actual impact
> >> on physical memory usage is.  Doing it in userspace would require
> >> collating data from every allocator in every process on the system,
> >> requiring every process to export it somehow, and then reading the
> >> kernel information anyways to get the mapping info.
> >
> > I agree that kernel approach would be performance win and make it easy
> > to collect system-wide information. That's why I am not against the idea
> > because I think it would be very useful on comtemporary platforms.
> > But I doubt vma opeartion is proper.
> >
> > BTW, as Peter and I already asked, maybe other developer in future
> > will have a question about that so let's remain it in git log.
> > "Tacking infomrationin userspace leads to all sorts of problems.
> > ...
> > ...
> > "
> >
> >>
> >> > I guess this feature would be used with allocators tightly
> >> > so my concern of kernel approach like this that it needs mmap_sem
> >> > write-side lock to split/merge vmas which is really thing
> >> > allocators(ex, tcmalloc, jemalloc) want to avoid for performance win
> >> > that allocators have lots of complicated logic to avoid munmap which
> >> > needs mmap_sem write-side lock but this feature would make it invalid.
> >>
> >> My expected use case is that the allocator will mmap a new large chunk
> >> of anonymous memory, and then immediately name it, resulting in taking
> >
> > It makes new system call very limited.
> > You are assuming that this new system call should be used very carefully
> > inside new invented allocator which is aware of naming? So, it allocates
> > large chunk per name and user have to request memory with naming tag to
> > allocate object from chunk reserved for the name? Otherwise, large chunk
> > would be separated per every different name objct and allocator performance
> > will be drop.
> 
> I'm not sure I understand your question.
> 
> It is normal for allocators to mmap a large chunk of anonymous memory
> and then suballocate out of it to amortize the cost of the mmap across
> multiple smaller allocations.  I'm proposing adding a second
> syscall/grabbing the mmap_sem to this already slow path.  If a
> particular allocator is limited by the mmap_sem, it can conditionally
> skip the second syscall unless a "name memory" flag is set.  I expect
> an allocator to have a single name that it always uses.  It would be

I think it's very limited.
My requirement is that I'd like to name any anon object in process so that
a daemon in the platform could gather all important object statistics easily
from all of process which share some libraries.
For it, I don't want to replace my allocator(ex, jemalloc) with naming-aware
allocator like malloc(sizeofobject, "name") which could mmap a large of
anonymous memory per name.

> nice to avoid having to take the mmap_sem twice either by atomically
> mmaping and naming a region of memory or by protecting the names with
> something besides mmap_sem, but I can't think of a good way to
> accomplish either.

Yes, it's stuff related with allocator so it should be very sensitive with
alloc/fault performance. If we really care of it, we would need another data
structure to avoid lose.

> 
> > Why couldn't we use it in application layer, not allocator itself?
> > I mean we can use this following as.
> >
> > struct js_object *alloc_js_object(void) {
> >         if (pool_is_empty) {
> >                 struct js_object *obj_pool = malloc(sizeof(obj) * POOL_SIZE);
> >                 prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME, obj_pool, SIZE, js_name);
> >         }
> >
> >         return get_a_object_from_pool(obj_pool);
> > }
> >
> > It could work with any allocators which are not aware of naming.
> > And If pool size is bigger than a chunk, performance lose would be small.
> >
> > Other some insane user might want to call it per object frequently, even it's
> > small size under 4K. Why not? The result is that vma scheme couldn't work.
> 
> I guess what I'm really trying to accomplish here is to name physical
> pages, which is something only the kernel can track.  Naming every

It seems the difference between you and me. You want to tag page
but I want object. And object inclues page.

> page would be costly, and cause problems when different processes
> wanted different names, so the closest I can get to that is to name a
> process' view of physical pages, with the assumption that processes
> that share a page will be using it for the same thing and so won't
> name them differently.  Physical pages are a very kernel-y thing to

If the page is shared, it does make sense but it makes new systemcall
too limited, too.

> track, where as virtual address space, especially non-page-aligned
> virtual address space, is a little more nebulous on the
> kernel/userspace boundary.  Naming pages makes it clear who will name
> them - whoever requested them from the kernel.  Naming address space
> is less clear, what if the allocator names them and then the caller
> also wants to name them?

In that case, caller first because upper layer has more clear view.

> 
> >> the mmap_sem twice in a row.  This is the same pattern required for
> >> example by KSM to mark malloc'd memory as mergeable.  The avoid-munmap
> >
> > I guess KSM usecase would be very rare compared to naming API because
> > I dare to expect this feature will be very useful and be popular for lots of
> > platforms. Actually, our platform is considering such features and some of stack
> > in our platform already have owned such profiling although it's not system-wide.
> >
> > Why should we bind the feature into vma? At a glance, vma binding looks good
> > but the result is
> >
> > 1) We couldn't avoid write mmap_sem
> > 2) We couldn't represent small size object under 4K.
> >
> > Couldn't we use another data structure which represent range like
> > vrange interval tree I and John are implementing?
> >
> > So the result would be /proc/<pid>/named_anon
> >
> > It could solve above both problem all but it needs one more system call
> > to see /proc/<pid>/maps if you need maps information but I imagine that
> > gathering isn't frequent so it's not a big concern.
> 
> I chose to put it in the vma because the vmas cover exactly the right
> area that I want to name for my use case, and because when determining
> real system-wide memory usage only 4k aligned chunks matter.  An
> anonymous memory mmap normally results in a new vma covering exactly
> the allocation (ignoring merging with an adjacent anonymous mmap),
> which means there is normally zero memory cost to my naming.  Your
> proposal would require a vrange object for every named region.  I can
> see how it would be useful, but it would increase the cost of naming
> page-aligned regions significantly.  As an example, on one of my
> devices I have over 11,000 named regions.  Using a range_tree_node +
> userspace pointer for each one is already 500KB of memory.

In 32bit, 300K anyway, it could be huge for embedded device but with your
approach could need vm_area_struct if space is needed to split by new
system call so memory would be more significant.

> 
> >> optimization is actually even more important if the allocator names
> >> memory, creating a new mapping + name would require the mmap_sem
> >> twice, although the total number of mmap_sem write locks is still
> >> increased with naming.
> >
> >>
> >> --
> >> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> >> the body to majordomo@kvack.org.  For more info on Linux MM,
> >> see: http://www.linux-mm.org/ .
> >> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> >
> > --
> > Kind regards,
> > Minchan Kim
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2013-11-01  1:30 UTC|newest]

Thread overview: 47+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-10-15  1:31 [PATCHv3 1/2] mm: rearrange madvise code to allow for reuse Colin Cross
2013-10-15  1:31 ` Colin Cross
2013-10-15  1:31 ` [PATCH 2/2] mm: add a field to store names for private anonymous memory Colin Cross
2013-10-15  1:31   ` Colin Cross
2013-10-15 21:21   ` Andrew Morton
2013-10-15 21:32     ` Dave Hansen
2013-10-15 21:47   ` Colin Cross
2013-10-15 21:47     ` Colin Cross
2013-10-16  0:33   ` Minchan Kim
2013-10-16 20:00     ` Colin Cross
2013-10-16 20:34       ` Dave Hansen
2013-10-16 20:41         ` Colin Cross
2013-10-17  2:47       ` Minchan Kim
2013-10-30 21:15         ` Colin Cross
2013-11-01  1:30           ` Minchan Kim [this message]
  -- strict thread matches above, loose matches on Subject: below --
2013-07-12  2:34 [PATCH 1/2] mm: rearrange madvise code to allow for reuse Colin Cross
2013-07-12  2:34 ` [PATCH 2/2] mm: add a field to store names for private anonymous memory Colin Cross
2013-07-12  5:39   ` Pekka Enberg
2013-07-12  8:13     ` Peter Zijlstra
2013-07-12  8:17       ` Peter Zijlstra
2013-07-12  8:44         ` Ingo Molnar
2013-07-12  8:55           ` Pekka Enberg
2013-07-12  9:00           ` Peter Zijlstra
2013-07-12  9:15             ` Ingo Molnar
2013-07-12  9:27               ` Peter Zijlstra
2013-07-12  9:40                 ` Ingo Molnar
2013-07-12  9:49                   ` Peter Zijlstra
2013-07-12 10:01                     ` Ingo Molnar
2013-07-12 20:51                     ` Colin Cross
2013-09-26  1:24                       ` Colin Cross
2013-07-12  8:21       ` Pekka Enberg
2013-07-12  8:55         ` Peter Zijlstra
2013-07-12  9:04           ` Pekka Enberg
2013-07-12  9:14             ` Peter Zijlstra
2013-07-12  9:28               ` Ingo Molnar
2013-07-12  9:26             ` Ingo Molnar
2013-07-12  9:38               ` Pekka Enberg
2013-07-12  9:45                 ` Ingo Molnar
2013-07-12 10:09                   ` Peter Zijlstra
2013-07-12  5:43   ` Pekka Enberg
2013-07-12  6:18     ` Colin Cross
2013-07-12  7:03       ` Pekka Enberg
2013-07-12  6:36   ` Dave Hansen
2013-07-12  6:42     ` Colin Cross
2013-07-14 14:11   ` Oleg Nesterov
2013-07-14 19:27     ` Colin Cross
2013-07-14 14:17   ` Oleg Nesterov
2013-07-14 19:34     ` Colin Cross

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20131101013009.GE26080@bbox \
    --to=minchan@kernel.org \
    --cc=a.p.zijlstra@chello.nl \
    --cc=akpm@linux-foundation.org \
    --cc=ccross@android.com \
    --cc=dave.hansen@intel.com \
    --cc=ebiederm@xmission.com \
    --cc=gorcunov@openvz.org \
    --cc=hannes@cmpxchg.org \
    --cc=holt@sgi.com \
    --cc=hughd@google.com \
    --cc=jan.glauber@gmail.com \
    --cc=john.stultz@linaro.org \
    --cc=keescook@chromium.org \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@suse.de \
    --cc=mingo@kernel.org \
    --cc=oleg@redhat.com \
    --cc=penberg@kernel.org \
    --cc=peterz@infradead.org \
    --cc=riel@redhat.com \
    --cc=rientjes@google.com \
    --cc=rob@landley.net \
    --cc=sasha.levin@oracle.com \
    --cc=serge.hallyn@ubuntu.com \
    --cc=shli@fusionio.com \
    --cc=tangchen@cn.fujitsu.com \
    --cc=viro@zeniv.linux.org.uk \
    --cc=walken@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.