From: Vladimir Davydov <vdavydov@virtuozzo.com>
To: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tejun Heo <tj@kernel.org>,
Andrew Morton <akpm@linux-foundation.org>,
Michal Hocko <mhocko@suse.cz>, Li Zefan <lizefan@huawei.com>,
linux-mm@kvack.org, cgroups@vger.kernel.org,
linux-kernel@vger.kernel.org, kernel-team@fb.com
Subject: Re: [PATCH 3/3] mm: memcontrol: fix cgroup creation failure after many small jobs
Date: Tue, 21 Jun 2016 13:16:51 +0300 [thread overview]
Message-ID: <20160621101650.GD15970@esperanza> (raw)
In-Reply-To: <20160617162516.GD19084@cmpxchg.org>
On Fri, Jun 17, 2016 at 12:25:16PM -0400, Johannes Weiner wrote:
> The memory controller has quite a bit of state that usually outlives
> the cgroup and pins its CSS until said state disappears. At the same
> time it imposes a 16-bit limit on the CSS ID space to economically
> store IDs in the wild. Consequently, when we use cgroups to contain
> frequent but small and short-lived jobs that leave behind some page
> cache, we quickly run into the 64k limitations of outstanding CSSs.
> Creating a new cgroup fails with -ENOSPC while there are only a few,
> or even no user-visible cgroups in existence.
>
> Although pinning CSSs past cgroup removal is common, there are only
> two instances that actually need an ID after a cgroup is deleted:
> cache shadow entries and swapout records.
>
> Cache shadow entries reference the ID weakly and can deal with the CSS
> having disappeared when it's looked up later. They pose no hurdle.
>
> Swap-out records do need to pin the css to hierarchically attribute
> swapins after the cgroup has been deleted; though the only pages that
> remain swapped out after offlining are tmpfs/shmem pages. And those
> references are under the user's control, so they are manageable.
>
> This patch introduces a private 16-bit memcg ID and switches swap and
> cache shadow entries over to using that. This ID can then be recycled
> after offlining when the CSS remains pinned only by objects that don't
> specifically need it.
>
> This script demonstrates the problem by faulting one cache page in a
> new cgroup and deleting it again:
>
> set -e
> mkdir -p pages
> for x in `seq 128000`; do
> [ $((x % 1000)) -eq 0 ] && echo $x
> mkdir /cgroup/foo
> echo $$ >/cgroup/foo/cgroup.procs
> echo trex >pages/$x
> echo $$ >/cgroup/cgroup.procs
> rmdir /cgroup/foo
> done
>
> When run on an unpatched kernel, we eventually run out of possible IDs
> even though there are no visible cgroups:
>
> [root@ham ~]# ./cssidstress.sh
> [...]
> 65000
> mkdir: cannot create directory '/cgroup/foo': No space left on device
>
> After this patch, the IDs get released upon cgroup destruction and the
> cache and css objects get released once memory reclaim kicks in.
With 65K cgroups it will take the reclaimer a substantial amount of time
to iterate over all of them, which might result in latency spikes.
Probably, to avoid that, we could move pages from a dead cgroup's lru to
its parent's one on offline while still leaving dead cgroups pinned,
like we do in case of list_lru entries.
>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com>
One nit below.
...
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 75e74408cc8f..dc92b2df2585 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -4057,6 +4057,60 @@ static struct cftype mem_cgroup_legacy_files[] = {
> { }, /* terminate */
> };
>
> +/*
> + * Private memory cgroup IDR
> + *
> + * Swap-out records and page cache shadow entries need to store memcg
> + * references in constrained space, so we maintain an ID space that is
> + * limited to 16 bit (MEM_CGROUP_ID_MAX), limiting the total number of
> + * memory-controlled cgroups to 64k.
> + *
> + * However, there usually are many references to the oflline CSS after
> + * the cgroup has been destroyed, such as page cache or reclaimable
> + * slab objects, that don't need to hang on to the ID. We want to keep
> + * those dead CSS from occupying IDs, or we might quickly exhaust the
> + * relatively small ID space and prevent the creation of new cgroups
> + * even when there are much fewer than 64k cgroups - possibly none.
> + *
> + * Maintain a private 16-bit ID space for memcg, and allow the ID to
> + * be freed and recycled when it's no longer needed, which is usually
> + * when the CSS is offlined.
> + *
> + * The only exception to that are records of swapped out tmpfs/shmem
> + * pages that need to be attributed to live ancestors on swapin. But
> + * those references are manageable from userspace.
> + */
> +
> +static struct idr mem_cgroup_idr;
static DEFINE_IDR(mem_cgroup_idr);
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
WARNING: multiple messages have this Message-ID (diff)
From: Vladimir Davydov <vdavydov@virtuozzo.com>
To: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tejun Heo <tj@kernel.org>,
Andrew Morton <akpm@linux-foundation.org>,
Michal Hocko <mhocko@suse.cz>, Li Zefan <lizefan@huawei.com>,
<linux-mm@kvack.org>, <cgroups@vger.kernel.org>,
<linux-kernel@vger.kernel.org>, <kernel-team@fb.com>
Subject: Re: [PATCH 3/3] mm: memcontrol: fix cgroup creation failure after many small jobs
Date: Tue, 21 Jun 2016 13:16:51 +0300 [thread overview]
Message-ID: <20160621101650.GD15970@esperanza> (raw)
In-Reply-To: <20160617162516.GD19084@cmpxchg.org>
On Fri, Jun 17, 2016 at 12:25:16PM -0400, Johannes Weiner wrote:
> The memory controller has quite a bit of state that usually outlives
> the cgroup and pins its CSS until said state disappears. At the same
> time it imposes a 16-bit limit on the CSS ID space to economically
> store IDs in the wild. Consequently, when we use cgroups to contain
> frequent but small and short-lived jobs that leave behind some page
> cache, we quickly run into the 64k limitations of outstanding CSSs.
> Creating a new cgroup fails with -ENOSPC while there are only a few,
> or even no user-visible cgroups in existence.
>
> Although pinning CSSs past cgroup removal is common, there are only
> two instances that actually need an ID after a cgroup is deleted:
> cache shadow entries and swapout records.
>
> Cache shadow entries reference the ID weakly and can deal with the CSS
> having disappeared when it's looked up later. They pose no hurdle.
>
> Swap-out records do need to pin the css to hierarchically attribute
> swapins after the cgroup has been deleted; though the only pages that
> remain swapped out after offlining are tmpfs/shmem pages. And those
> references are under the user's control, so they are manageable.
>
> This patch introduces a private 16-bit memcg ID and switches swap and
> cache shadow entries over to using that. This ID can then be recycled
> after offlining when the CSS remains pinned only by objects that don't
> specifically need it.
>
> This script demonstrates the problem by faulting one cache page in a
> new cgroup and deleting it again:
>
> set -e
> mkdir -p pages
> for x in `seq 128000`; do
> [ $((x % 1000)) -eq 0 ] && echo $x
> mkdir /cgroup/foo
> echo $$ >/cgroup/foo/cgroup.procs
> echo trex >pages/$x
> echo $$ >/cgroup/cgroup.procs
> rmdir /cgroup/foo
> done
>
> When run on an unpatched kernel, we eventually run out of possible IDs
> even though there are no visible cgroups:
>
> [root@ham ~]# ./cssidstress.sh
> [...]
> 65000
> mkdir: cannot create directory '/cgroup/foo': No space left on device
>
> After this patch, the IDs get released upon cgroup destruction and the
> cache and css objects get released once memory reclaim kicks in.
With 65K cgroups it will take the reclaimer a substantial amount of time
to iterate over all of them, which might result in latency spikes.
Probably, to avoid that, we could move pages from a dead cgroup's lru to
its parent's one on offline while still leaving dead cgroups pinned,
like we do in case of list_lru entries.
>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com>
One nit below.
...
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 75e74408cc8f..dc92b2df2585 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -4057,6 +4057,60 @@ static struct cftype mem_cgroup_legacy_files[] = {
> { }, /* terminate */
> };
>
> +/*
> + * Private memory cgroup IDR
> + *
> + * Swap-out records and page cache shadow entries need to store memcg
> + * references in constrained space, so we maintain an ID space that is
> + * limited to 16 bit (MEM_CGROUP_ID_MAX), limiting the total number of
> + * memory-controlled cgroups to 64k.
> + *
> + * However, there usually are many references to the oflline CSS after
> + * the cgroup has been destroyed, such as page cache or reclaimable
> + * slab objects, that don't need to hang on to the ID. We want to keep
> + * those dead CSS from occupying IDs, or we might quickly exhaust the
> + * relatively small ID space and prevent the creation of new cgroups
> + * even when there are much fewer than 64k cgroups - possibly none.
> + *
> + * Maintain a private 16-bit ID space for memcg, and allow the ID to
> + * be freed and recycled when it's no longer needed, which is usually
> + * when the CSS is offlined.
> + *
> + * The only exception to that are records of swapped out tmpfs/shmem
> + * pages that need to be attributed to live ancestors on swapin. But
> + * those references are manageable from userspace.
> + */
> +
> +static struct idr mem_cgroup_idr;
static DEFINE_IDR(mem_cgroup_idr);
next prev parent reply other threads:[~2016-06-21 10:16 UTC|newest]
Thread overview: 33+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-06-16 3:42 [PATCH] mm: memcontrol: fix cgroup creation failure after many small jobs Johannes Weiner
2016-06-16 3:42 ` Johannes Weiner
2016-06-16 20:06 ` Tejun Heo
2016-06-16 20:06 ` Tejun Heo
2016-06-17 16:23 ` Johannes Weiner
2016-06-17 16:23 ` Johannes Weiner
[not found] ` <20160617162310.GA19084-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
2016-06-17 16:23 ` [PATCH 1/3] cgroup: fix idr leak for the first cgroup root Johannes Weiner
2016-06-17 16:23 ` Johannes Weiner
2016-06-17 16:23 ` Johannes Weiner
2016-06-17 16:24 ` [PATCH 2/3] cgroup: remove unnecessary 0 check from css_from_id() Johannes Weiner
2016-06-17 16:24 ` Johannes Weiner
2016-06-17 16:24 ` Johannes Weiner
[not found] ` <20160617162427.GC19084-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
2016-06-17 18:17 ` Tejun Heo
2016-06-17 18:17 ` Tejun Heo
2016-06-17 18:17 ` Tejun Heo
2016-06-17 16:25 ` [PATCH 3/3] mm: memcontrol: fix cgroup creation failure after many small jobs Johannes Weiner
2016-06-17 16:25 ` Johannes Weiner
2016-06-17 18:18 ` Tejun Heo
2016-06-17 18:18 ` Tejun Heo
2016-06-20 6:14 ` Nikolay Borisov
2016-06-20 6:14 ` Nikolay Borisov
2016-06-21 10:16 ` Vladimir Davydov [this message]
2016-06-21 10:16 ` Vladimir Davydov
2016-06-21 15:46 ` Johannes Weiner
2016-06-21 15:46 ` Johannes Weiner
2016-06-21 15:46 ` Johannes Weiner
2016-06-17 9:06 ` [PATCH] " Vladimir Davydov
2016-06-17 9:06 ` Vladimir Davydov
2016-06-17 16:40 ` Johannes Weiner
2016-06-17 16:40 ` Johannes Weiner
2016-06-17 16:40 ` Johannes Weiner
2016-07-14 15:37 ` Johannes Weiner
2016-07-14 15:37 ` Johannes Weiner
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20160621101650.GD15970@esperanza \
--to=vdavydov@virtuozzo.com \
--cc=akpm@linux-foundation.org \
--cc=cgroups@vger.kernel.org \
--cc=hannes@cmpxchg.org \
--cc=kernel-team@fb.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=lizefan@huawei.com \
--cc=mhocko@suse.cz \
--cc=tj@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.