All of lore.kernel.org
 help / color / mirror / Atom feed
From: Vladimir Davydov <vdavydov@parallels.com>
To: Greg Thelen <gthelen@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Michal Hocko <mhocko@suse.cz>, Glauber Costa <glommer@gmail.com>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>,
	devel@openvz.org, Christoph Lameter <cl@linux-foundation.org>,
	Pekka Enberg <penberg@kernel.org>
Subject: Re: [PATCH -mm 1/4] sl[au]b: do not charge large allocations to memcg
Date: Fri, 28 Mar 2014 11:56:29 +0400	[thread overview]
Message-ID: <53352B2D.4000306@parallels.com> (raw)
In-Reply-To: <CAHH2K0YFB9yXF_oyxhQt9EiD_kuBuK7py6ah8YEy2H70P8SC_A@mail.gmail.com>

On 03/28/2014 12:42 AM, Greg Thelen wrote:
> On Thu, Mar 27, 2014 at 12:37 AM, Vladimir Davydov
> <vdavydov@parallels.com> wrote:
>> On 03/27/2014 08:31 AM, Greg Thelen wrote:
>>> Before this change both of the following allocations are charged to
>>> memcg (assuming kmem accounting is enabled):
>>>  a = kmalloc(KMALLOC_MAX_CACHE_SIZE, GFP_KERNEL)
>>>  b = kmalloc(KMALLOC_MAX_CACHE_SIZE + 1, GFP_KERNEL)
>>>
>>> After this change only 'a' is charged; 'b' goes directly to page
>>> allocator which no longer does accounting.
>>
>> Why do we need to charge 'b' in the first place? Can the userspace
>> trigger such allocations massively? If there can only be one or two such
>> allocations from a cgroup, is there any point in charging them?
> 
> Of the top of my head I don't know of any >8KIB kmalloc()s so I can't
> say if they're directly triggerable by user space en masse.  But we
> recently ran into some order:3 allocations in networking.  The
> networking allocations used a non-generic kmem_cache (rather than
> kmalloc which started this discussion).  For details, see ed98df3361f0
> ("net: use __GFP_NORETRY for high order allocations").  I can't say if
> such allocations exist in device drivers, but given the networking
> example, it's conceivable that they may (or will) exist.

Hmm, also not sure about device drivers, but the sock frag pages you
mentioned are worth charging I guess.

For such non-generic kmem allocations, we have two options: either go
with __GFP_KMEMCG, or introduce special alloc/free_kmem_pages methods,
which would be used instead of kmalloc for large allocations (e.g.
threadinfo, sock frags). I vote for the second, because I dislike having
kmemcg charging in the general allocation path.

Anyway, that brings us back to the necessity to reliably track arbitrary
pages in kmemcg to allow reparenting.

> With slab this isn't a problem because sla has kmalloc kmem_caches for
> all supported allocation sizes.  However, slub shows this issue for
> any kmalloc() allocations larger than 8KIB (at least on x86_64).  It
> seems like a strange directly to take kmem accounting to say that
> kmalloc allocations are kmem limited, but only if they are either less
> than a threshold size or done with slab.  Simply increasing the size
> of a data structure doesn't seem like it should automatically cause
> the memory to become exempt from kmem limits.

Sounds fair.

>> In fact, do we actually need to charge every random kmem allocation? I
>> guess not. For instance, filesystems often allocate data shared among
>> all the FS users. It's wrong to charge such allocations to a particular
>> memcg, IMO. That said the next step is going to be adding a per kmem
>> cache flag specifying if allocations from this cache should be charged
>> so that accounting will work only for those caches that are marked so
>> explicitly.
> 
> It's a question of what direction to approach kmem slab accounting
> from: either opt-out (as the code currently is), or opt-in (with per
> kmem_cache flags as you suggest).  I agree that some structures end up
> being shared (e.g. filesystem block bit map structures).  In an
> opt-out system these are charged to a memcg initially and remain
> charged there until the memcg is deleted at which point the shared
> objects are reparented to a shared location.  While this isn't
> perfect, it's unclear if it's better or worse than analyzing each
> class of allocation and deciding if they should be opt'd-in.  One
> could (though I'm not) make the case that even dentries are easily
> shareable between containers and thus shouldn't be accounted to a
> single memcg.  But given user space's ability to DoS a machine with
> dentires, they should be accounted.

Again, you must be right. After a bit of thinking I agree that deciding
which caches should be accounted and which shouldn't would be
cumbersome. Opt-out would be clearer, I guess.

>> There is one more argument for removing kmalloc_large accounting - we
>> don't have an easy way to track such allocations, which prevents us from
>> reparenting kmemcg charges on css offline. Of course, we could link
>> kmalloc_large pages in some sort of per-memcg list which would allow us
>> to find them on css offline, but I don't think such a complication is
>> justified.
> 
> I assume that reparenting of such non kmem_cache allocations (e.g.
> large kmalloc) is difficult because such pages refer to the memcg,
> which we're trying to delete and the memcg has no index of such pages.
>  If such zombie memcg are undesirable, then an alternative to indexing
> the pages is to define a kmem context object which such large pages
> point to.  The kmem context would be reparented without needing to
> adjust the individual large pages.  But there are plenty of options.

I like the idea about the context object. For usual kmalloc'ed data, we
already have one - the kmem_cache itself. For non-generic kmem (e.g.
threadinfo pages), we could easily introduce a separate one with the
pointer to the owning memcg on it. Reparenting wouldn't be a problem at
all then.

I guess I'll give it a try in the next iteration. Thank you!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

WARNING: multiple messages have this Message-ID (diff)
From: Vladimir Davydov <vdavydov@parallels.com>
To: Greg Thelen <gthelen@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Michal Hocko <mhocko@suse.cz>, Glauber Costa <glommer@gmail.com>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>, <devel@openvz.org>,
	Christoph Lameter <cl@linux-foundation.org>,
	Pekka Enberg <penberg@kernel.org>
Subject: Re: [PATCH -mm 1/4] sl[au]b: do not charge large allocations to memcg
Date: Fri, 28 Mar 2014 11:56:29 +0400	[thread overview]
Message-ID: <53352B2D.4000306@parallels.com> (raw)
In-Reply-To: <CAHH2K0YFB9yXF_oyxhQt9EiD_kuBuK7py6ah8YEy2H70P8SC_A@mail.gmail.com>

On 03/28/2014 12:42 AM, Greg Thelen wrote:
> On Thu, Mar 27, 2014 at 12:37 AM, Vladimir Davydov
> <vdavydov@parallels.com> wrote:
>> On 03/27/2014 08:31 AM, Greg Thelen wrote:
>>> Before this change both of the following allocations are charged to
>>> memcg (assuming kmem accounting is enabled):
>>>  a = kmalloc(KMALLOC_MAX_CACHE_SIZE, GFP_KERNEL)
>>>  b = kmalloc(KMALLOC_MAX_CACHE_SIZE + 1, GFP_KERNEL)
>>>
>>> After this change only 'a' is charged; 'b' goes directly to page
>>> allocator which no longer does accounting.
>>
>> Why do we need to charge 'b' in the first place? Can the userspace
>> trigger such allocations massively? If there can only be one or two such
>> allocations from a cgroup, is there any point in charging them?
> 
> Of the top of my head I don't know of any >8KIB kmalloc()s so I can't
> say if they're directly triggerable by user space en masse.  But we
> recently ran into some order:3 allocations in networking.  The
> networking allocations used a non-generic kmem_cache (rather than
> kmalloc which started this discussion).  For details, see ed98df3361f0
> ("net: use __GFP_NORETRY for high order allocations").  I can't say if
> such allocations exist in device drivers, but given the networking
> example, it's conceivable that they may (or will) exist.

Hmm, also not sure about device drivers, but the sock frag pages you
mentioned are worth charging I guess.

For such non-generic kmem allocations, we have two options: either go
with __GFP_KMEMCG, or introduce special alloc/free_kmem_pages methods,
which would be used instead of kmalloc for large allocations (e.g.
threadinfo, sock frags). I vote for the second, because I dislike having
kmemcg charging in the general allocation path.

Anyway, that brings us back to the necessity to reliably track arbitrary
pages in kmemcg to allow reparenting.

> With slab this isn't a problem because sla has kmalloc kmem_caches for
> all supported allocation sizes.  However, slub shows this issue for
> any kmalloc() allocations larger than 8KIB (at least on x86_64).  It
> seems like a strange directly to take kmem accounting to say that
> kmalloc allocations are kmem limited, but only if they are either less
> than a threshold size or done with slab.  Simply increasing the size
> of a data structure doesn't seem like it should automatically cause
> the memory to become exempt from kmem limits.

Sounds fair.

>> In fact, do we actually need to charge every random kmem allocation? I
>> guess not. For instance, filesystems often allocate data shared among
>> all the FS users. It's wrong to charge such allocations to a particular
>> memcg, IMO. That said the next step is going to be adding a per kmem
>> cache flag specifying if allocations from this cache should be charged
>> so that accounting will work only for those caches that are marked so
>> explicitly.
> 
> It's a question of what direction to approach kmem slab accounting
> from: either opt-out (as the code currently is), or opt-in (with per
> kmem_cache flags as you suggest).  I agree that some structures end up
> being shared (e.g. filesystem block bit map structures).  In an
> opt-out system these are charged to a memcg initially and remain
> charged there until the memcg is deleted at which point the shared
> objects are reparented to a shared location.  While this isn't
> perfect, it's unclear if it's better or worse than analyzing each
> class of allocation and deciding if they should be opt'd-in.  One
> could (though I'm not) make the case that even dentries are easily
> shareable between containers and thus shouldn't be accounted to a
> single memcg.  But given user space's ability to DoS a machine with
> dentires, they should be accounted.

Again, you must be right. After a bit of thinking I agree that deciding
which caches should be accounted and which shouldn't would be
cumbersome. Opt-out would be clearer, I guess.

>> There is one more argument for removing kmalloc_large accounting - we
>> don't have an easy way to track such allocations, which prevents us from
>> reparenting kmemcg charges on css offline. Of course, we could link
>> kmalloc_large pages in some sort of per-memcg list which would allow us
>> to find them on css offline, but I don't think such a complication is
>> justified.
> 
> I assume that reparenting of such non kmem_cache allocations (e.g.
> large kmalloc) is difficult because such pages refer to the memcg,
> which we're trying to delete and the memcg has no index of such pages.
>  If such zombie memcg are undesirable, then an alternative to indexing
> the pages is to define a kmem context object which such large pages
> point to.  The kmem context would be reparented without needing to
> adjust the individual large pages.  But there are plenty of options.

I like the idea about the context object. For usual kmalloc'ed data, we
already have one - the kmem_cache itself. For non-generic kmem (e.g.
threadinfo pages), we could easily introduce a separate one with the
pointer to the owning memcg on it. Reparenting wouldn't be a problem at
all then.

I guess I'll give it a try in the next iteration. Thank you!

  reply	other threads:[~2014-03-28  7:56 UTC|newest]

Thread overview: 38+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-03-26 15:28 [PATCH -mm 0/4] kmemcg: get rid of __GFP_KMEMCG Vladimir Davydov
2014-03-26 15:28 ` Vladimir Davydov
2014-03-26 15:28 ` [PATCH -mm 1/4] sl[au]b: do not charge large allocations to memcg Vladimir Davydov
2014-03-26 15:28   ` Vladimir Davydov
2014-03-26 21:53   ` Michal Hocko
2014-03-26 21:53     ` Michal Hocko
2014-03-27  7:34     ` Vladimir Davydov
2014-03-27  7:34       ` Vladimir Davydov
2014-03-27 20:40       ` Michal Hocko
2014-03-27 20:40         ` Michal Hocko
2014-03-27  4:31   ` Greg Thelen
2014-03-27  4:31     ` Greg Thelen
2014-03-27  7:37     ` Vladimir Davydov
2014-03-27  7:37       ` Vladimir Davydov
2014-03-27 20:42       ` Greg Thelen
2014-03-27 20:42         ` Greg Thelen
2014-03-28  7:56         ` Vladimir Davydov [this message]
2014-03-28  7:56           ` Vladimir Davydov
2014-03-27 20:43       ` Michal Hocko
2014-03-27 20:43         ` Michal Hocko
2014-03-28  7:58         ` Vladimir Davydov
2014-03-28  7:58           ` Vladimir Davydov
2014-03-26 15:28 ` [PATCH -mm 2/4] sl[au]b: charge slabs to memcg explicitly Vladimir Davydov
2014-03-26 15:28   ` Vladimir Davydov
2014-03-26 21:58   ` Michal Hocko
2014-03-26 21:58     ` Michal Hocko
2014-03-27  7:38     ` Vladimir Davydov
2014-03-27  7:38       ` Vladimir Davydov
2014-03-27 20:38       ` Michal Hocko
2014-03-27 20:38         ` Michal Hocko
2014-03-26 15:28 ` [PATCH -mm 3/4] fork: charge threadinfo " Vladimir Davydov
2014-03-26 15:28   ` Vladimir Davydov
2014-03-26 22:00   ` Michal Hocko
2014-03-26 22:00     ` Michal Hocko
2014-03-27  7:39     ` Vladimir Davydov
2014-03-27  7:39       ` Vladimir Davydov
2014-03-26 15:28 ` [PATCH -mm 4/4] mm: kill __GFP_KMEMCG Vladimir Davydov
2014-03-26 15:28   ` Vladimir Davydov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=53352B2D.4000306@parallels.com \
    --to=vdavydov@parallels.com \
    --cc=akpm@linux-foundation.org \
    --cc=cl@linux-foundation.org \
    --cc=devel@openvz.org \
    --cc=glommer@gmail.com \
    --cc=gthelen@google.com \
    --cc=hannes@cmpxchg.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@suse.cz \
    --cc=penberg@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.