From: Vladimir Davydov <vdavydov@parallels.com>
To: Dave Chinner <david@fromorbit.com>
Cc: glommer@gmail.com, Balbir Singh <bsingharora@gmail.com>,
dchinner@redhat.com,
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>,
linux-kernel@vger.kernel.org, glommer@openvz.org, mhocko@suse.cz,
linux-mm@kvack.org, Al Viro <viro@zeniv.linux.org.uk>,
hannes@cmpxchg.org, cgroups@vger.kernel.org,
akpm@linux-foundation.org, devel@openvz.org
Subject: Re: [PATCH v13 11/16] mm: list_lru: add per-memcg lists
Date: Fri, 13 Dec 2013 00:24:08 +0400 [thread overview]
Message-ID: <52AA1B68.80302@parallels.com> (raw)
In-Reply-To: <52A986D2.6010606@parallels.com>
On 12/12/2013 01:50 PM, Vladimir Davydov wrote:
>>>>> +int memcg_list_lru_init(struct list_lru *lru)
>>>>> +{
>>>>> + int err = 0;
>>>>> + int i;
>>>>> + struct mem_cgroup *memcg;
>>>>> +
>>>>> + lru->memcg = NULL;
>>>>> + lru->memcg_old = NULL;
>>>>> +
>>>>> + mutex_lock(&memcg_create_mutex);
>>>>> + if (!memcg_kmem_enabled())
>>>>> + goto out_list_add;
>>>>> +
>>>>> + lru->memcg = kcalloc(memcg_limited_groups_array_size,
>>>>> + sizeof(*lru->memcg), GFP_KERNEL);
>>>>> + if (!lru->memcg) {
>>>>> + err = -ENOMEM;
>>>>> + goto out;
>>>>> + }
>>>>> +
>>>>> + for_each_mem_cgroup(memcg) {
>>>>> + int memcg_id;
>>>>> +
>>>>> + memcg_id = memcg_cache_id(memcg);
>>>>> + if (memcg_id < 0)
>>>>> + continue;
>>>>> +
>>>>> + err = list_lru_memcg_alloc(lru, memcg_id);
>>>>> + if (err) {
>>>>> + mem_cgroup_iter_break(NULL, memcg);
>>>>> + goto out_free_lru_memcg;
>>>>> + }
>>>>> + }
>>>>> +out_list_add:
>>>>> + list_add(&lru->list, &all_memcg_lrus);
>>>>> +out:
>>>>> + mutex_unlock(&memcg_create_mutex);
>>>>> + return err;
>>>>> +
>>>>> +out_free_lru_memcg:
>>>>> + for (i = 0; i < memcg_limited_groups_array_size; i++)
>>>>> + list_lru_memcg_free(lru, i);
>>>>> + kfree(lru->memcg);
>>>>> + goto out;
>>>>> +}
>>>> That will probably scale even worse. Think about what happens when we
>>>> try to mount a bunch of filesystems in parallel - they will now
>>>> serialise completely on this memcg_create_mutex instantiating memcg
>>>> lists inside list_lru_init().
>>> Yes, the scalability seems to be the main problem here. I have a couple
>>> of thoughts on how it could be improved. Here they go:
>>>
>>> 1) We can turn memcg_create_mutex to rw-semaphore (or introduce an
>>> rw-semaphore, which we would take for modifying list_lru's) and take it
>>> for reading in memcg_list_lru_init() and for writing when we create a
>>> new memcg (memcg_init_all_lrus()).
>>> This would remove the bottleneck from the mount path, but every memcg
>>> creation would still iterate over all LRUs under a memcg mutex. So I
>>> guess it is not an option, isn't it?
>> Right - it's not so much that there is a mutex to protect the init,
>> it's how long it's held that will be the issue. I mean, we don't
>> need to hold the memcg_create_mutex until we've completely
>> initialised the lru structure and are ready to add it to the
>> all_memcg_lrus list, right?
>>
>> i.e. restructing it so that you don't need to hold the mutex until
>> you make the LRU list globally visible would solve the problem just
>> as well. if we can iterate the memcgs lists without holding a lock,
>> then we can init the per-memcg lru lists without holding a lock
>> because nobody will access them through the list_lru structure
>> because it's not yet been published.
>>
>> That keeps the locking simple, and we get scalability because we've
>> reduced the lock's scope to just a few instructures instead of a
>> memcg iteration and a heap of memory allocation....
> Unfortunately that's not that easy as it seems to be :-(
>
> Currently I hold the memcg_create_mutex while initializing per-memcg
> LRUs in memcg_list_lru_init() in order to be sure that I won't miss a
> memcg that is added during initialization.
>
> I mean, let's try to move per-memcg LRUs allocation outside the lock and
> only register the LRU there:
>
> memcg_list_lru_init():
> 1) allocate lru->memcg array
> 2) for_each_kmem_active_memcg(m)
> allocate lru->memcg[m]
> 3) lock memcg_create_mutex
> add lru to all_memcg_lrus_list
> unlock memcg_create_mutex
>
> Then if a new kmem-active memcg is added after step 2 and before step 3,
> it won't see the new lru, because it has not been registered yet, and
> thus won't initialize its list in this lru. As a result, we will end up
> with a partially initialized list_lru. Note that this will happen even
> if the whole memcg initialization proceeds under the memcg_create_mutex.
>
> Provided we could freeze memcg_limited_groups_array_size, it would be
> possible to fix this problem by swapping steps 2 and 3 and making step 2
> initialize lru->memcg[m] using cmpxchg() only if it was not initialized.
> However we still have to hold the memcg_create_mutex during the whole
> kmemcg activation path (memcg_init_all_lrus()).
>
> Let's see if we can get rid of the lock in memcg_init_all_lrus() by
> making the all_memcg_lrus RCU-protected so that we could iterate over
> all list_lrus w/o holding any locks and turn memcg_init_all_lrus() to
> something like this:
>
> memcg_init_all_lrus():
> 1) for_each_list_lru_rcu(lru)
> allocate lru->memcg[new_memcg_id]
> 2) mark new_memcg as kmem-active
>
> The problem is that if memcg_list_lru_init(new_lru) starts and completes
> between steps 1 and 2, we will not initialize
> new_lru->memcg[new_memcg_id] neither in memcg_init_all_lrus() nor in
> memcg_list_lru_init().
>
> The problem here is that on kmemcg creation (memcg_init_all_lrus()) we
> have to iterate over all list_lrus while on list_lru creation
> (memcg_list_lru_init()) we have to iterate over all memcgs. Currently I
> can't figure out how we can do it w/o holding any mutexes at least while
> calling one of these functions, but I'm keep thinking on it.
>
Seems I got it. We could add a memcg state bit, say "activating",
meaning that a memcg is going to become kmem-active, but it is not yet,
so it should not be accounted to; keep all list_lrus in a rcu-protected
list; and implement memcg_init_all_lrus() and memcg_list_lru_init() as
follows:
memcg_init_all_lrus():
set activating
for_each_list_lru_rcu(lru)
cmpxchg(&lru->memcg[new_memcg_id], NULL, new list_lru_node);
set active
memcg_list_lru_init():
add the new_lru to the all_memcg_lrus list
for_each_memcg(memcg):
if memcg is activating or active:
cmpxchg(&new_lru->memcg[memcg_id], NULL, new list_lru_node)
At first glance, it looks correct, because:
If we skip an lru while iterating over all_memcg_lrus in
memcg_init_all_lrus(), it means it was created after the "activating"
bit had been set and thus the per-memcg lru will be initialized in
memcg_list_lru_init(). If we skip a memcg in memcg_list_lru_init(), this
memg will "see" this lru while iterating over the all_memcg_lrus list,
because the lru must have been created before the "activating" bit set.
Although it is to be elaborated, because I haven't examined the
destruction paths yet, and this doesn't take into account the fact that
memcg_limited_groups_array_size is not a constant (I guess I'll have to
introduce an rw semaphore for it, but it'll be OK, because its updates
are very-very rare), I guess I am on the right way.
Thanks.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
WARNING: multiple messages have this Message-ID (diff)
From: Vladimir Davydov <vdavydov@parallels.com>
To: Dave Chinner <david@fromorbit.com>
Cc: <glommer@gmail.com>, Balbir Singh <bsingharora@gmail.com>,
<dchinner@redhat.com>,
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>,
<linux-kernel@vger.kernel.org>, <glommer@openvz.org>,
<mhocko@suse.cz>, <linux-mm@kvack.org>,
Al Viro <viro@zeniv.linux.org.uk>, <hannes@cmpxchg.org>,
<cgroups@vger.kernel.org>, <akpm@linux-foundation.org>,
<devel@openvz.org>
Subject: Re: [PATCH v13 11/16] mm: list_lru: add per-memcg lists
Date: Fri, 13 Dec 2013 00:24:08 +0400 [thread overview]
Message-ID: <52AA1B68.80302@parallels.com> (raw)
In-Reply-To: <52A986D2.6010606@parallels.com>
On 12/12/2013 01:50 PM, Vladimir Davydov wrote:
>>>>> +int memcg_list_lru_init(struct list_lru *lru)
>>>>> +{
>>>>> + int err = 0;
>>>>> + int i;
>>>>> + struct mem_cgroup *memcg;
>>>>> +
>>>>> + lru->memcg = NULL;
>>>>> + lru->memcg_old = NULL;
>>>>> +
>>>>> + mutex_lock(&memcg_create_mutex);
>>>>> + if (!memcg_kmem_enabled())
>>>>> + goto out_list_add;
>>>>> +
>>>>> + lru->memcg = kcalloc(memcg_limited_groups_array_size,
>>>>> + sizeof(*lru->memcg), GFP_KERNEL);
>>>>> + if (!lru->memcg) {
>>>>> + err = -ENOMEM;
>>>>> + goto out;
>>>>> + }
>>>>> +
>>>>> + for_each_mem_cgroup(memcg) {
>>>>> + int memcg_id;
>>>>> +
>>>>> + memcg_id = memcg_cache_id(memcg);
>>>>> + if (memcg_id < 0)
>>>>> + continue;
>>>>> +
>>>>> + err = list_lru_memcg_alloc(lru, memcg_id);
>>>>> + if (err) {
>>>>> + mem_cgroup_iter_break(NULL, memcg);
>>>>> + goto out_free_lru_memcg;
>>>>> + }
>>>>> + }
>>>>> +out_list_add:
>>>>> + list_add(&lru->list, &all_memcg_lrus);
>>>>> +out:
>>>>> + mutex_unlock(&memcg_create_mutex);
>>>>> + return err;
>>>>> +
>>>>> +out_free_lru_memcg:
>>>>> + for (i = 0; i < memcg_limited_groups_array_size; i++)
>>>>> + list_lru_memcg_free(lru, i);
>>>>> + kfree(lru->memcg);
>>>>> + goto out;
>>>>> +}
>>>> That will probably scale even worse. Think about what happens when we
>>>> try to mount a bunch of filesystems in parallel - they will now
>>>> serialise completely on this memcg_create_mutex instantiating memcg
>>>> lists inside list_lru_init().
>>> Yes, the scalability seems to be the main problem here. I have a couple
>>> of thoughts on how it could be improved. Here they go:
>>>
>>> 1) We can turn memcg_create_mutex to rw-semaphore (or introduce an
>>> rw-semaphore, which we would take for modifying list_lru's) and take it
>>> for reading in memcg_list_lru_init() and for writing when we create a
>>> new memcg (memcg_init_all_lrus()).
>>> This would remove the bottleneck from the mount path, but every memcg
>>> creation would still iterate over all LRUs under a memcg mutex. So I
>>> guess it is not an option, isn't it?
>> Right - it's not so much that there is a mutex to protect the init,
>> it's how long it's held that will be the issue. I mean, we don't
>> need to hold the memcg_create_mutex until we've completely
>> initialised the lru structure and are ready to add it to the
>> all_memcg_lrus list, right?
>>
>> i.e. restructing it so that you don't need to hold the mutex until
>> you make the LRU list globally visible would solve the problem just
>> as well. if we can iterate the memcgs lists without holding a lock,
>> then we can init the per-memcg lru lists without holding a lock
>> because nobody will access them through the list_lru structure
>> because it's not yet been published.
>>
>> That keeps the locking simple, and we get scalability because we've
>> reduced the lock's scope to just a few instructures instead of a
>> memcg iteration and a heap of memory allocation....
> Unfortunately that's not that easy as it seems to be :-(
>
> Currently I hold the memcg_create_mutex while initializing per-memcg
> LRUs in memcg_list_lru_init() in order to be sure that I won't miss a
> memcg that is added during initialization.
>
> I mean, let's try to move per-memcg LRUs allocation outside the lock and
> only register the LRU there:
>
> memcg_list_lru_init():
> 1) allocate lru->memcg array
> 2) for_each_kmem_active_memcg(m)
> allocate lru->memcg[m]
> 3) lock memcg_create_mutex
> add lru to all_memcg_lrus_list
> unlock memcg_create_mutex
>
> Then if a new kmem-active memcg is added after step 2 and before step 3,
> it won't see the new lru, because it has not been registered yet, and
> thus won't initialize its list in this lru. As a result, we will end up
> with a partially initialized list_lru. Note that this will happen even
> if the whole memcg initialization proceeds under the memcg_create_mutex.
>
> Provided we could freeze memcg_limited_groups_array_size, it would be
> possible to fix this problem by swapping steps 2 and 3 and making step 2
> initialize lru->memcg[m] using cmpxchg() only if it was not initialized.
> However we still have to hold the memcg_create_mutex during the whole
> kmemcg activation path (memcg_init_all_lrus()).
>
> Let's see if we can get rid of the lock in memcg_init_all_lrus() by
> making the all_memcg_lrus RCU-protected so that we could iterate over
> all list_lrus w/o holding any locks and turn memcg_init_all_lrus() to
> something like this:
>
> memcg_init_all_lrus():
> 1) for_each_list_lru_rcu(lru)
> allocate lru->memcg[new_memcg_id]
> 2) mark new_memcg as kmem-active
>
> The problem is that if memcg_list_lru_init(new_lru) starts and completes
> between steps 1 and 2, we will not initialize
> new_lru->memcg[new_memcg_id] neither in memcg_init_all_lrus() nor in
> memcg_list_lru_init().
>
> The problem here is that on kmemcg creation (memcg_init_all_lrus()) we
> have to iterate over all list_lrus while on list_lru creation
> (memcg_list_lru_init()) we have to iterate over all memcgs. Currently I
> can't figure out how we can do it w/o holding any mutexes at least while
> calling one of these functions, but I'm keep thinking on it.
>
Seems I got it. We could add a memcg state bit, say "activating",
meaning that a memcg is going to become kmem-active, but it is not yet,
so it should not be accounted to; keep all list_lrus in a rcu-protected
list; and implement memcg_init_all_lrus() and memcg_list_lru_init() as
follows:
memcg_init_all_lrus():
set activating
for_each_list_lru_rcu(lru)
cmpxchg(&lru->memcg[new_memcg_id], NULL, new list_lru_node);
set active
memcg_list_lru_init():
add the new_lru to the all_memcg_lrus list
for_each_memcg(memcg):
if memcg is activating or active:
cmpxchg(&new_lru->memcg[memcg_id], NULL, new list_lru_node)
At first glance, it looks correct, because:
If we skip an lru while iterating over all_memcg_lrus in
memcg_init_all_lrus(), it means it was created after the "activating"
bit had been set and thus the per-memcg lru will be initialized in
memcg_list_lru_init(). If we skip a memcg in memcg_list_lru_init(), this
memg will "see" this lru while iterating over the all_memcg_lrus list,
because the lru must have been created before the "activating" bit set.
Although it is to be elaborated, because I haven't examined the
destruction paths yet, and this doesn't take into account the fact that
memcg_limited_groups_array_size is not a constant (I guess I'll have to
introduce an rw semaphore for it, but it'll be OK, because its updates
are very-very rare), I guess I am on the right way.
Thanks.
next prev parent reply other threads:[~2013-12-12 20:24 UTC|newest]
Thread overview: 81+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-12-09 8:05 [PATCH v13 00/16] kmemcg shrinkers Vladimir Davydov
2013-12-09 8:05 ` Vladimir Davydov
2013-12-09 8:05 ` [PATCH v13 01/16] memcg: make cache index determination more robust Vladimir Davydov
2013-12-09 8:05 ` Vladimir Davydov
2013-12-09 8:05 ` [PATCH v13 02/16] memcg: consolidate callers of memcg_cache_id Vladimir Davydov
2013-12-09 8:05 ` Vladimir Davydov
2013-12-09 8:05 ` [PATCH v13 03/16] memcg: move initialization to memcg creation Vladimir Davydov
2013-12-09 8:05 ` Vladimir Davydov
2013-12-09 8:05 ` [PATCH v13 04/16] memcg: move memcg_caches_array_size() function Vladimir Davydov
2013-12-09 8:05 ` Vladimir Davydov
2013-12-10 8:04 ` Glauber Costa
2013-12-10 8:04 ` Glauber Costa
[not found] ` <cover.1386571280.git.vdavydov-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2013-12-09 8:05 ` [PATCH v13 05/16] vmscan: move call to shrink_slab() to shrink_zones() Vladimir Davydov
2013-12-09 8:05 ` Vladimir Davydov
2013-12-09 8:05 ` Vladimir Davydov
2013-12-10 8:10 ` Glauber Costa
2013-12-10 8:10 ` Glauber Costa
2013-12-09 8:05 ` [PATCH v13 06/16] vmscan: remove shrink_control arg from do_try_to_free_pages() Vladimir Davydov
2013-12-09 8:05 ` Vladimir Davydov
2013-12-09 8:05 ` [PATCH v13 07/16] vmscan: call NUMA-unaware shrinkers irrespective of nodemask Vladimir Davydov
2013-12-09 8:05 ` Vladimir Davydov
2013-12-09 8:05 ` Vladimir Davydov
2013-12-09 8:05 ` [PATCH v13 08/16] mm: list_lru: require shrink_control in count, walk functions Vladimir Davydov
2013-12-09 8:05 ` Vladimir Davydov
2013-12-10 1:36 ` Dave Chinner
2013-12-10 1:36 ` Dave Chinner
2013-12-09 8:05 ` [PATCH v13 09/16] fs: consolidate {nr,free}_cached_objects args in shrink_control Vladimir Davydov
2013-12-09 8:05 ` Vladimir Davydov
[not found] ` <43660b83b58531ccf4d45f626283484441441943.1386571280.git.vdavydov-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2013-12-10 1:38 ` Dave Chinner
2013-12-10 1:38 ` Dave Chinner
2013-12-10 1:38 ` Dave Chinner
2013-12-09 8:05 ` [PATCH v13 10/16] vmscan: shrink slab on memcg pressure Vladimir Davydov
2013-12-09 8:05 ` Vladimir Davydov
[not found] ` <24314b9f3b299bac988ea3570f71f9e6919bbc4e.1386571280.git.vdavydov-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2013-12-10 2:11 ` Dave Chinner
2013-12-10 2:11 ` Dave Chinner
2013-12-10 2:11 ` Dave Chinner
2013-12-09 8:05 ` [PATCH v13 11/16] mm: list_lru: add per-memcg lists Vladimir Davydov
2013-12-09 8:05 ` Vladimir Davydov
[not found] ` <0ca62dbfbf545edb22b86bd11c50e9017a3dc4db.1386571280.git.vdavydov-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2013-12-10 5:00 ` Dave Chinner
2013-12-10 5:00 ` Dave Chinner
2013-12-10 5:00 ` Dave Chinner
2013-12-10 10:05 ` Vladimir Davydov
2013-12-10 10:05 ` Vladimir Davydov
2013-12-10 10:05 ` Vladimir Davydov
2013-12-12 1:40 ` Dave Chinner
2013-12-12 1:40 ` Dave Chinner
2013-12-12 9:50 ` Vladimir Davydov
2013-12-12 9:50 ` Vladimir Davydov
2013-12-12 9:50 ` Vladimir Davydov
2013-12-12 20:24 ` Vladimir Davydov [this message]
2013-12-12 20:24 ` Vladimir Davydov
2013-12-14 20:03 ` Vladimir Davydov
2013-12-14 20:03 ` Vladimir Davydov
2013-12-12 20:48 ` Glauber Costa
2013-12-12 20:48 ` Glauber Costa
2013-12-09 8:05 ` [PATCH v13 12/16] fs: mark list_lru based shrinkers memcg aware Vladimir Davydov
2013-12-09 8:05 ` Vladimir Davydov
2013-12-10 4:17 ` Dave Chinner
2013-12-10 4:17 ` Dave Chinner
2013-12-11 11:08 ` Steven Whitehouse
2013-12-11 11:08 ` Steven Whitehouse
2013-12-09 8:05 ` [PATCH v13 13/16] vmscan: take at least one pass with shrinkers Vladimir Davydov
2013-12-09 8:05 ` Vladimir Davydov
2013-12-10 4:18 ` Dave Chinner
2013-12-10 4:18 ` Dave Chinner
2013-12-10 11:50 ` Vladimir Davydov
2013-12-10 11:50 ` Vladimir Davydov
2013-12-10 11:50 ` Vladimir Davydov
[not found] ` <52A6FFF0.6080207-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2013-12-10 12:38 ` Glauber Costa
2013-12-10 12:38 ` Glauber Costa
2013-12-10 12:38 ` Glauber Costa
2013-12-09 8:05 ` [PATCH v13 14/16] vmpressure: in-kernel notifications Vladimir Davydov
2013-12-09 8:05 ` Vladimir Davydov
2013-12-10 8:12 ` Glauber Costa
2013-12-10 8:12 ` Glauber Costa
2013-12-09 8:05 ` [PATCH v13 15/16] memcg: reap dead memcgs upon global memory pressure Vladimir Davydov
2013-12-09 8:05 ` Vladimir Davydov
2013-12-09 8:05 ` [PATCH v13 16/16] memcg: flush memcg items upon memcg destruction Vladimir Davydov
2013-12-09 8:05 ` Vladimir Davydov
2013-12-10 8:02 ` [PATCH v13 00/16] kmemcg shrinkers Glauber Costa
2013-12-10 8:02 ` Glauber Costa
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=52AA1B68.80302@parallels.com \
--to=vdavydov@parallels.com \
--cc=akpm@linux-foundation.org \
--cc=bsingharora@gmail.com \
--cc=cgroups@vger.kernel.org \
--cc=david@fromorbit.com \
--cc=dchinner@redhat.com \
--cc=devel@openvz.org \
--cc=glommer@gmail.com \
--cc=glommer@openvz.org \
--cc=hannes@cmpxchg.org \
--cc=kamezawa.hiroyu@jp.fujitsu.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mhocko@suse.cz \
--cc=viro@zeniv.linux.org.uk \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.