public inbox for cgroups@vger.kernel.org
 help / color / mirror / Atom feed
From: Greg Thelen <gthelen-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
To: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Cc: Konstantin Khlebnikov
	<khlebnikov-XoJtRXgx1JseBXzfvpsJ4g@public.gmane.org>,
	Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>,
	Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>,
	Cgroups <cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>,
	"linux-mm@kvack.org"
	<linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org>,
	"linux-kernel@vger.kernel.org"
	<linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>,
	Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org>,
	Dave Chinner <david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org>,
	Jens Axboe <axboe-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>,
	Christoph Hellwig <hch-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>,
	Li Zefan <lizefan-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>,
	Hugh Dickins <hughd-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Subject: Re: [RFC] Making memcg track ownership per address_space or anon_vma
Date: Wed, 04 Feb 2015 15:51:01 -0800	[thread overview]
Message-ID: <xr93zj8ti6ca.fsf@gthelen.mtv.corp.google.com> (raw)
In-Reply-To: <20150204170656.GA18858-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>


On Wed, Feb 04 2015, Tejun Heo wrote:

> Hello,
>
> On Tue, Feb 03, 2015 at 03:30:31PM -0800, Greg Thelen wrote:
>> If a machine has several top level memcg trying to get some form of
>> isolation (using low, min, soft limit) then a shared libc will be
>> moved to the root memcg where it's not protected from global memory
>> pressure.  At least with the current per page accounting such shared
>> pages often land into some protected memcg.
>
> Yes, it becomes interesting with the low limit as the pressure
> direction is reversed but at the same time overcommitting low limits
> doesn't lead to a sane setup to begin with as it's asking for global
> OOMs anyway, which means that things like libc would end up competing
> at least fairly with other pages for global pressure and should stay
> in memory under most circumstances, which may or may not be
> sufficient.

I agree.  Clarification... I don't plan to overcommit low or min limits.
On machines without overcommited min limits the existing system offers
some protection for shared libs from global reclaim.  Pushing them to
root doesn't.

> Hmm.... need to think more about it but this only becomes a problem
> with the root cgroup because it doesn't have min setting which is
> expected to be inclusive of all descendants, right?  Maybe the right
> thing to do here is treating the inodes which get pushed to the root
> as a special case and we can implement a mechanism where the root is
> effectively borrowing from the mins of its children which doesn't have
> to be completely correct - e.g. just charge it against all children
> repeatedly and if any has min protection, put it under min protection.
> IOW, make it the baseload for all of them.

I think the linux-next low (and the TBD min) limits also have the
problem for more than just the root memcg.  I'm thinking of a 2M file
shared between C and D below.  The file will be charged to common parent
B.

	A
	+-B    (usage=2M lim=3M min=2M)
	  +-C  (usage=0  lim=2M min=1M shared_usage=2M)
	  +-D  (usage=0  lim=2M min=1M shared_usage=2M)
	  \-E  (usage=0  lim=2M min=0)

The problem arises if A/B/E allocates more than 1M of private
reclaimable file data.  This pushes A/B into reclaim which will reclaim
both the shared file from A/B and private file from A/B/E.  In contrast,
the current per-page memcg would've protected the shared file in either
C or D leaving A/B reclaim to only attack A/B/E.

Pinning the shared file to either C or D, using TBD policy such as mount
option, would solve this for tightly shared files.  But for wide fanout
file (libc) the admin would need to assign a global bucket and this
would be a pain to size due to various job requirements.

>> If two cgroups collude they can use more memory than their limit and
>> oom the entire machine.  Admittedly the current per-page system isn't
>> perfect because deleting a memcg which contains mlocked memory
>> (referenced by a remote memcg) moves the mlocked memory to root
>> resulting in the same issue.  But I'd argue this is more likely with
>
> Hmmm... why does it do that?  Can you point me to where it's
> happening?

My mistake, I was thinking of older kernels which reparent memory.
Though I can't say v3.19-rc7 handles this collusion any better.  Instead
of reparenting the mlocked memory, it's left in an invisible (offline)
memcg.  Unlike older kernels the memory doesn't appear in
root/memory.stat[unevictable], instead it buried in
root/memory.stat[total_unevictable] which includes mlocked memory in
visible (online) and invisible (offline) children.

>> the RFC because it doesn't involve the cgroup deletion/reparenting.  A
>
> One approach could be expanding on the forementioned scheme and make
> all sharing cgroups to get charged for the shared inodes they're
> using, which should render such collusions entirely pointless.
> e.g. let's say we start with the following.
>
> 	A   (usage=48M)
> 	+-B (usage=16M)
> 	\-C (usage=32M)
>
> And let's say, C starts accessing an inode which is 8M and currently
> associated with B.
>
> 	A   (usage=48M, hosted= 8M)
> 	+-B (usage= 8M, shared= 8M)
> 	\-C (usage=32M, shared= 8M)
>
> The only extra charging that we'd be doing is charing C with extra
> 8M.  Let's say another cgroup D gets created and uses 4M.
>
> 	A   (usage=56M, hosted= 8M)
> 	+-B (usage= 8M, shared= 8M)
> 	+-C (usage=32M, shared= 8M)
> 	\-D (usage= 8M)
>
> and it also accesses the inode.
>
> 	A   (usage=56M, hosted= 8M)
> 	+-B (usage= 8M, shared= 8M)
> 	+-C (usage=32M, shared= 8M)
> 	\-D (usage= 8M, shared= 8M)
>
> We'd need to track the shared charges separately as they should count
> only once in the parent but that shouldn't be too hard.  The problem
> here is that we'd need to track which inodes are being accessed by
> which children, which can get painful for things like libc.  Maybe we
> can limit it to be level-by-level - track sharing only from the
> immediate children and always move a shared inode at one level at a
> time.  That would lose some ability to track the sharing beyond the
> immediate children but it should be enough to solve the root case and
> allow us to adapt to changing usage pattern over time.  Given that
> sharing is mostly a corner case, this could be good enough.
>
> Now, if D accesses 4M area of the inode which hasn't been accessed by
> others yet.  We'd want it to look like the following.
>
> 	A   (usage=64M, hosted=16M)
> 	+-B (usage= 8M, shared=16M)
> 	+-C (usage=32M, shared=16M)
> 	\-D (usage= 8M, shared=16M)
>
> But charging it to B, C at the same time prolly wouldn't be
> particularly convenient.  We can prolly just do D -> A charging and
> let B and C sort themselves out later.  Note that such charging would
> still maintain the overall integrity of memory limits.  The only thing
> which may overflow is the pseudo shared charges to keep sharing in
> check and dealing with them later when B and C try to create further
> charges should be completely fine.
>
> Note that we can also try to split the shared charge across the users;
> however, charging the full amount seems like the better approach to
> me.  We don't have any way to tell how the usage is distributed
> anyway.  For use cases where this sort of sharing is expected, I think
> it's perfectly reasonable to provision the sharing children to have
> enough to accomodate the possible full size of the shared resource.
>
>> possible tweak to shore up the current system is to move such mlocked
>> pages to the memcg of the surviving locker.  When the machine is oom
>> it's often nice to examine memcg state to determine which container is
>> using the memory.  Tracking down who's contributing to a shared
>> container is non-trivial.
>> 
>> I actually have a set of patches which add a memcg=M mount option to
>> memory backed file systems.  I was planning on proposing them,
>> regardless of this RFC, and this discussion makes them even more
>> appealing.  If we go in this direction, then we'd need a similar
>> notion for disk based filesystems.  As Konstantin suggested, it'd be
>> really nice to specify charge policy on a per file, or directory, or
>> bind mount basis.  This allows shared files to be deterministically
>
> I'm not too sure about that.  We might add that later if absolutely
> justifiable but designing assuming that level of intervention from
> userland may not be such a good idea.
>
>> When there's large incidental sharing, then things get sticky.  A
>> periodic filesystem scanner (e.g. virus scanner, or grep foo -r /) in
>> a small container would pull all pages to the root memcg where they
>> are exposed to root pressure which breaks isolation.  This is
>> concerning.  Perhaps the such accesses could be decorated with
>> (O_NO_MOVEMEM).
>
> If such thing is really necessary, FADV_NOREUSE would be a better
> indicator; however, yes, such incidental sharing is easier to handle
> with per-page scheme as such scanner can be limited in the number of
> pages it can carry throughout its operation regardless of which cgroup
> it's looking at.  It still has the nasty corner case where random
> target cgroups can latch onto pages faulted in by the scanner and
> keeping accessing them tho, so, even now, FADV_NOREUSE would be a good
> idea.  Note that such scanning, if repeated on cgroups under high
> memory pressure, is *likely* to accumulate residue escaped pages and
> if such a management cgroup is transient, those escaped pages will
> accumulate over time outside any limit in a way which is unpredictable
> and invisible.
>
>> So this RFC change will introduce significant change to user space
>> machine managers and perturb isolation.  Is the resulting system
>> better?  It's not clear, it's the devil know vs devil unknown.  Maybe
>> it'd be easier if the memcg's I'm talking about were not allowed to
>> share page cache (aka copy-on-read) even for files which are jointly
>> visible.  That would provide today's interface while avoiding the
>> problematic sharing.
>
> Yeah, compatibility would be the stickiest part.
>
> Thanks.

  parent reply	other threads:[~2015-02-04 23:51 UTC|newest]

Thread overview: 31+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-01-30  4:43 [RFC] Making memcg track ownership per address_space or anon_vma Tejun Heo
2015-01-30  5:55 ` Greg Thelen
2015-01-30  6:27   ` Tejun Heo
     [not found]     ` <20150130062737.GB25699-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>
2015-01-30 16:07       ` Tejun Heo
2015-02-02 19:26         ` Konstantin Khlebnikov
2015-02-02 19:46           ` Tejun Heo
2015-02-03 23:30             ` Greg Thelen
2015-02-04 10:49               ` Konstantin Khlebnikov
2015-02-04 17:15                 ` Tejun Heo
2015-02-04 17:58                   ` Konstantin Khlebnikov
     [not found]                     ` <54D25DBD.5080009-XoJtRXgx1JseBXzfvpsJ4g@public.gmane.org>
2015-02-04 18:28                       ` Tejun Heo
2015-02-04 17:06               ` Tejun Heo
     [not found]                 ` <20150204170656.GA18858-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>
2015-02-04 23:51                   ` Greg Thelen [this message]
2015-02-05 13:15                     ` Tejun Heo
2015-02-05 22:05                       ` Greg Thelen
     [not found]                         ` <xr93siekt3p3.fsf-aSPv4SP+Du0KgorLzL7FmE7CuiCeIGUxQQ4Iyu8u01E@public.gmane.org>
2015-02-05 22:25                           ` Tejun Heo
2015-02-06  0:03                             ` Greg Thelen
2015-02-06 14:17                               ` Tejun Heo
2015-02-06 23:43                                 ` Greg Thelen
     [not found]                                   ` <CAHH2K0bxvc34u1PugVQsSfxXhmN8qU6KRpiCWwOVBa6BPqMDOg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-02-07 14:38                                     ` Tejun Heo
     [not found]                                       ` <20150207143839.GA9926-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>
2015-02-11  2:19                                         ` Tejun Heo
     [not found]                                           ` <20150211021906.GA21356-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org>
2015-02-11  7:32                                             ` Jan Kara
2015-02-11 18:28                                             ` Greg Thelen
2015-02-11 20:33                                               ` Tejun Heo
     [not found]                                                 ` <20150211203359.GF21356-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org>
2015-02-11 21:22                                                   ` Konstantin Khlebnikov
2015-02-11 21:46                                                     ` Tejun Heo
     [not found]                                                       ` <20150211214650.GA11920-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org>
2015-02-11 21:57                                                         ` Konstantin Khlebnikov
     [not found]                                                           ` <CALYGNiPX89HsgUS8BrJvL_jW-EU95xezc7uPf=0Pm72qiUwp7A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-02-11 22:05                                                             ` Tejun Heo
     [not found]                                                               ` <20150211220530.GA12728-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org>
2015-02-11 22:15                                                                 ` Konstantin Khlebnikov
2015-02-11 22:30                                                                   ` Tejun Heo
2015-02-12  2:10                                                 ` Greg Thelen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=xr93zj8ti6ca.fsf@gthelen.mtv.corp.google.com \
    --to=gthelen-hpiqsd4aklfqt0dzr+alfa@public.gmane.org \
    --cc=axboe-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org \
    --cc=cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
    --cc=david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org \
    --cc=hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org \
    --cc=hch-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org \
    --cc=hughd-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org \
    --cc=jack-AlSwsSmVLrQ@public.gmane.org \
    --cc=khlebnikov-XoJtRXgx1JseBXzfvpsJ4g@public.gmane.org \
    --cc=linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
    --cc=linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org \
    --cc=lizefan-hv44wF8Li93QT0dZR+AlfA@public.gmane.org \
    --cc=mhocko-AlSwsSmVLrQ@public.gmane.org \
    --cc=tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox