From: Glauber Costa <glommer@parallels.com>
To: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Balbir Singh <bsingharora@gmail.com>,
Greg Thelen <gthelen@google.com>, <linux-kernel@vger.kernel.org>,
<paul@paulmenage.org>, <lizf@cn.fujitsu.com>,
<ebiederm@xmission.com>, <davem@davemloft.net>,
<netdev@vger.kernel.org>, <linux-mm@kvack.org>,
<kirill@shutemov.name>
Subject: Re: [PATCH v3 2/7] socket: initial cgroup code.
Date: Tue, 27 Sep 2011 17:43:29 -0300 [thread overview]
Message-ID: <4E823571.6060001@parallels.com> (raw)
In-Reply-To: <20110926195213.12da87b4.kamezawa.hiroyu@jp.fujitsu.com>
[-- Attachment #1: Type: text/plain, Size: 2758 bytes --]
On 09/26/2011 07:52 AM, KAMEZAWA Hiroyuki wrote:
> On Sat, 24 Sep 2011 11:45:04 -0300
> Glauber Costa<glommer@parallels.com> wrote:
>
>> On 09/22/2011 12:09 PM, Balbir Singh wrote:
>>> On Thu, Sep 22, 2011 at 11:30 AM, Greg Thelen<gthelen@google.com> wrote:
>>>> On Wed, Sep 21, 2011 at 11:59 AM, Glauber Costa<glommer@parallels.com> wrote:
>>>>> Right now I am working under the assumption that tasks are long lived inside
>>>>> the cgroup. Migration potentially introduces some nasty locking problems in
>>>>> the mem_schedule path.
>>>>>
>>>>> Also, unless I am missing something, the memcg already has the policy of
>>>>> not carrying charges around, probably because of this very same complexity.
>>>>>
>>>>> True that at least it won't EBUSY you... But I think this is at least a way
>>>>> to guarantee that the cgroup under our nose won't disappear in the middle of
>>>>> our allocations.
>>>>
>>>> Here's the memcg user page behavior using the same pattern:
>>>>
>>>> 1. user page P is allocate by task T in memcg M1
>>>> 2. T is moved to memcg M2. The P charge is left behind still charged
>>>> to M1 if memory.move_charge_at_immigrate=0; or the charge is moved to
>>>> M2 if memory.move_charge_at_immigrate=1.
>>>> 3. rmdir M1 will try to reclaim P (if P was left in M1). If unable to
>>>> reclaim, then P is recharged to parent(M1).
>>>>
>>>
>>> We also have some magic in page_referenced() to remove pages
>>> referenced from different containers. What we do is try not to
>>> penalize a cgroup if another cgroup is referencing this page and the
>>> page under consideration is being reclaimed from the cgroup that
>>> touched it.
>>>
>>> Balbir Singh
>> Do you guys see it as a showstopper for this series to be merged, or can
>> we just TODO it ?
>>
>
> In my experience, 'I can't rmdir cgroup.' is always an important/difficult
> problem. The users cannot know where the accouting is leaking other than
> kmem.usage_in_bytes or memory.usage_in_bytes. and can't fix the issue.
>
> please add EXPERIMENTAL to Kconfig until this is fixed.
>
>> I can push a proposal for it, but it would be done in a separate patch
>> anyway. Also, we may be in better conditions to fix this when the slab
>> part is merged - since it will likely have the same problems...
>>
>
> Yes. considering sockets which can be shared between tasks(cgroups)
> you'll finally need
> - owner task of socket
> - account moving callback
>
> Or disallow task moving once accounted.
>
So,
I tried to come up with proper task charge moving here, and the locking
easily gets quite complicated. (But I have the feeling I am overlooking
something...) So I think I'll really need more time for that.
What do you guys think of this following patch, + EXPERIMENTAL ?
[-- Attachment #2: foo.patch --]
[-- Type: text/plain, Size: 3232 bytes --]
diff --git a/include/net/tcp.h b/include/net/tcp.h
index f784cb7..684c090 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -257,6 +257,7 @@ struct mem_cgroup;
struct tcp_memcontrol {
/* per-cgroup tcp memory pressure knobs */
int tcp_max_memory;
+ atomic_t refcnt;
atomic_long_t tcp_memory_allocated;
struct percpu_counter tcp_sockets_allocated;
/* those two are read-mostly, leave them at the end */
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 6937f20..b594a9a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -361,34 +361,21 @@ static struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *mem);
void sock_update_memcg(struct sock *sk)
{
- /* right now a socket spends its whole life in the same cgroup */
- BUG_ON(sk->sk_cgrp);
-
rcu_read_lock();
sk->sk_cgrp = mem_cgroup_from_task(current);
-
- /*
- * We don't need to protect against anything task-related, because
- * we are basically stuck with the sock pointer that won't change,
- * even if the task that originated the socket changes cgroups.
- *
- * What we do have to guarantee, is that the chain leading us to
- * the top level won't change under our noses. Incrementing the
- * reference count via cgroup_exclude_rmdir guarantees that.
- */
- cgroup_exclude_rmdir(mem_cgroup_css(sk->sk_cgrp));
rcu_read_unlock();
}
void sock_release_memcg(struct sock *sk)
{
- cgroup_release_and_wakeup_rmdir(mem_cgroup_css(sk->sk_cgrp));
}
void memcg_sock_mem_alloc(struct mem_cgroup *mem, struct proto *prot,
int amt, int *parent_failure)
{
+ atomic_inc(&mem->tcp.refcnt);
mem = parent_mem_cgroup(mem);
+
for (; mem != NULL; mem = parent_mem_cgroup(mem)) {
long alloc;
long *prot_mem = prot->prot_mem(mem);
@@ -406,9 +393,12 @@ EXPORT_SYMBOL(memcg_sock_mem_alloc);
void memcg_sock_mem_free(struct mem_cgroup *mem, struct proto *prot, int amt)
{
- mem = parent_mem_cgroup(mem);
- for (; mem != NULL; mem = parent_mem_cgroup(mem))
- atomic_long_sub(amt, prot->memory_allocated(mem));
+ struct mem_cgroup *parent;
+ parent = parent_mem_cgroup(mem);
+ for (; parent != NULL; parent = parent_mem_cgroup(parent))
+ atomic_long_sub(amt, prot->memory_allocated(parent));
+
+ atomic_dec(&mem->tcp.refcnt);
}
EXPORT_SYMBOL(memcg_sock_mem_free);
@@ -541,6 +531,7 @@ int tcp_init_cgroup(struct proto *prot, struct cgroup *cgrp,
cg->tcp.tcp_memory_pressure = 0;
atomic_long_set(&cg->tcp.tcp_memory_allocated, 0);
+ atomic_set(&cg->tcp.refcnt, 0);
percpu_counter_init(&cg->tcp.tcp_sockets_allocated, 0);
limit = nr_free_buffer_pages() / 8;
@@ -5787,6 +5778,9 @@ static int mem_cgroup_can_attach(struct cgroup_subsys *ss,
int ret = 0;
struct mem_cgroup *mem = mem_cgroup_from_cont(cgroup);
+ if (atomic_read(&mem->tcp.refcnt))
+ return 1;
+
if (mem->move_charge_at_immigrate) {
struct mm_struct *mm;
struct mem_cgroup *from = mem_cgroup_from_task(p);
@@ -5957,6 +5951,11 @@ static int mem_cgroup_can_attach(struct cgroup_subsys *ss,
struct cgroup *cgroup,
struct task_struct *p)
{
+ struct mem_cgroup *mem = mem_cgroup_from_cont(cgroup);
+
+ if (atomic_read(&mem->tcp.refcnt))
+ return 1;
+
return 0;
}
static void mem_cgroup_cancel_attach(struct cgroup_subsys *ss,
WARNING: multiple messages have this Message-ID (diff)
From: Glauber Costa <glommer@parallels.com>
To: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Balbir Singh <bsingharora@gmail.com>,
Greg Thelen <gthelen@google.com>,
linux-kernel@vger.kernel.org, paul@paulmenage.org,
lizf@cn.fujitsu.com, ebiederm@xmission.com, davem@davemloft.net,
netdev@vger.kernel.org, linux-mm@kvack.org, kirill@shutemov.name
Subject: Re: [PATCH v3 2/7] socket: initial cgroup code.
Date: Tue, 27 Sep 2011 17:43:29 -0300 [thread overview]
Message-ID: <4E823571.6060001@parallels.com> (raw)
In-Reply-To: <20110926195213.12da87b4.kamezawa.hiroyu@jp.fujitsu.com>
[-- Attachment #1: Type: text/plain, Size: 2758 bytes --]
On 09/26/2011 07:52 AM, KAMEZAWA Hiroyuki wrote:
> On Sat, 24 Sep 2011 11:45:04 -0300
> Glauber Costa<glommer@parallels.com> wrote:
>
>> On 09/22/2011 12:09 PM, Balbir Singh wrote:
>>> On Thu, Sep 22, 2011 at 11:30 AM, Greg Thelen<gthelen@google.com> wrote:
>>>> On Wed, Sep 21, 2011 at 11:59 AM, Glauber Costa<glommer@parallels.com> wrote:
>>>>> Right now I am working under the assumption that tasks are long lived inside
>>>>> the cgroup. Migration potentially introduces some nasty locking problems in
>>>>> the mem_schedule path.
>>>>>
>>>>> Also, unless I am missing something, the memcg already has the policy of
>>>>> not carrying charges around, probably because of this very same complexity.
>>>>>
>>>>> True that at least it won't EBUSY you... But I think this is at least a way
>>>>> to guarantee that the cgroup under our nose won't disappear in the middle of
>>>>> our allocations.
>>>>
>>>> Here's the memcg user page behavior using the same pattern:
>>>>
>>>> 1. user page P is allocate by task T in memcg M1
>>>> 2. T is moved to memcg M2. The P charge is left behind still charged
>>>> to M1 if memory.move_charge_at_immigrate=0; or the charge is moved to
>>>> M2 if memory.move_charge_at_immigrate=1.
>>>> 3. rmdir M1 will try to reclaim P (if P was left in M1). If unable to
>>>> reclaim, then P is recharged to parent(M1).
>>>>
>>>
>>> We also have some magic in page_referenced() to remove pages
>>> referenced from different containers. What we do is try not to
>>> penalize a cgroup if another cgroup is referencing this page and the
>>> page under consideration is being reclaimed from the cgroup that
>>> touched it.
>>>
>>> Balbir Singh
>> Do you guys see it as a showstopper for this series to be merged, or can
>> we just TODO it ?
>>
>
> In my experience, 'I can't rmdir cgroup.' is always an important/difficult
> problem. The users cannot know where the accouting is leaking other than
> kmem.usage_in_bytes or memory.usage_in_bytes. and can't fix the issue.
>
> please add EXPERIMENTAL to Kconfig until this is fixed.
>
>> I can push a proposal for it, but it would be done in a separate patch
>> anyway. Also, we may be in better conditions to fix this when the slab
>> part is merged - since it will likely have the same problems...
>>
>
> Yes. considering sockets which can be shared between tasks(cgroups)
> you'll finally need
> - owner task of socket
> - account moving callback
>
> Or disallow task moving once accounted.
>
So,
I tried to come up with proper task charge moving here, and the locking
easily gets quite complicated. (But I have the feeling I am overlooking
something...) So I think I'll really need more time for that.
What do you guys think of this following patch, + EXPERIMENTAL ?
[-- Attachment #2: foo.patch --]
[-- Type: text/plain, Size: 3232 bytes --]
diff --git a/include/net/tcp.h b/include/net/tcp.h
index f784cb7..684c090 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -257,6 +257,7 @@ struct mem_cgroup;
struct tcp_memcontrol {
/* per-cgroup tcp memory pressure knobs */
int tcp_max_memory;
+ atomic_t refcnt;
atomic_long_t tcp_memory_allocated;
struct percpu_counter tcp_sockets_allocated;
/* those two are read-mostly, leave them at the end */
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 6937f20..b594a9a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -361,34 +361,21 @@ static struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *mem);
void sock_update_memcg(struct sock *sk)
{
- /* right now a socket spends its whole life in the same cgroup */
- BUG_ON(sk->sk_cgrp);
-
rcu_read_lock();
sk->sk_cgrp = mem_cgroup_from_task(current);
-
- /*
- * We don't need to protect against anything task-related, because
- * we are basically stuck with the sock pointer that won't change,
- * even if the task that originated the socket changes cgroups.
- *
- * What we do have to guarantee, is that the chain leading us to
- * the top level won't change under our noses. Incrementing the
- * reference count via cgroup_exclude_rmdir guarantees that.
- */
- cgroup_exclude_rmdir(mem_cgroup_css(sk->sk_cgrp));
rcu_read_unlock();
}
void sock_release_memcg(struct sock *sk)
{
- cgroup_release_and_wakeup_rmdir(mem_cgroup_css(sk->sk_cgrp));
}
void memcg_sock_mem_alloc(struct mem_cgroup *mem, struct proto *prot,
int amt, int *parent_failure)
{
+ atomic_inc(&mem->tcp.refcnt);
mem = parent_mem_cgroup(mem);
+
for (; mem != NULL; mem = parent_mem_cgroup(mem)) {
long alloc;
long *prot_mem = prot->prot_mem(mem);
@@ -406,9 +393,12 @@ EXPORT_SYMBOL(memcg_sock_mem_alloc);
void memcg_sock_mem_free(struct mem_cgroup *mem, struct proto *prot, int amt)
{
- mem = parent_mem_cgroup(mem);
- for (; mem != NULL; mem = parent_mem_cgroup(mem))
- atomic_long_sub(amt, prot->memory_allocated(mem));
+ struct mem_cgroup *parent;
+ parent = parent_mem_cgroup(mem);
+ for (; parent != NULL; parent = parent_mem_cgroup(parent))
+ atomic_long_sub(amt, prot->memory_allocated(parent));
+
+ atomic_dec(&mem->tcp.refcnt);
}
EXPORT_SYMBOL(memcg_sock_mem_free);
@@ -541,6 +531,7 @@ int tcp_init_cgroup(struct proto *prot, struct cgroup *cgrp,
cg->tcp.tcp_memory_pressure = 0;
atomic_long_set(&cg->tcp.tcp_memory_allocated, 0);
+ atomic_set(&cg->tcp.refcnt, 0);
percpu_counter_init(&cg->tcp.tcp_sockets_allocated, 0);
limit = nr_free_buffer_pages() / 8;
@@ -5787,6 +5778,9 @@ static int mem_cgroup_can_attach(struct cgroup_subsys *ss,
int ret = 0;
struct mem_cgroup *mem = mem_cgroup_from_cont(cgroup);
+ if (atomic_read(&mem->tcp.refcnt))
+ return 1;
+
if (mem->move_charge_at_immigrate) {
struct mm_struct *mm;
struct mem_cgroup *from = mem_cgroup_from_task(p);
@@ -5957,6 +5951,11 @@ static int mem_cgroup_can_attach(struct cgroup_subsys *ss,
struct cgroup *cgroup,
struct task_struct *p)
{
+ struct mem_cgroup *mem = mem_cgroup_from_cont(cgroup);
+
+ if (atomic_read(&mem->tcp.refcnt))
+ return 1;
+
return 0;
}
static void mem_cgroup_cancel_attach(struct cgroup_subsys *ss,
next prev parent reply other threads:[~2011-09-27 20:44 UTC|newest]
Thread overview: 118+ messages / expand[flat|nested] mbox.gz Atom feed top
2011-09-19 0:56 [PATCH v3 0/7] per-cgroup tcp buffer pressure settings Glauber Costa
2011-09-19 0:56 ` Glauber Costa
2011-09-19 0:56 ` [PATCH v3 1/7] Basic kernel memory functionality for the Memory Controller Glauber Costa
2011-09-19 0:56 ` Glauber Costa
2011-09-21 2:23 ` Glauber Costa
2011-09-21 2:23 ` Glauber Costa
2011-09-21 2:23 ` Glauber Costa
2011-09-22 3:17 ` Balbir Singh
2011-09-22 3:17 ` Balbir Singh
2011-09-22 3:19 ` Glauber Costa
2011-09-22 3:19 ` Glauber Costa
2011-09-22 3:19 ` Glauber Costa
2011-09-24 14:43 ` Glauber Costa
2011-09-24 14:43 ` Glauber Costa
2011-09-24 14:43 ` Glauber Costa
2011-09-27 10:06 ` Balbir Singh
2011-09-27 10:06 ` Balbir Singh
2011-09-22 5:58 ` Greg Thelen
2011-09-22 5:58 ` Greg Thelen
2011-09-26 10:34 ` KAMEZAWA Hiroyuki
2011-09-26 10:34 ` KAMEZAWA Hiroyuki
2011-09-26 22:44 ` Glauber Costa
2011-09-26 22:44 ` Glauber Costa
2011-09-26 22:44 ` Glauber Costa
2011-09-26 23:18 ` Glauber Costa
2011-09-26 23:18 ` Glauber Costa
2011-09-26 23:18 ` Glauber Costa
2011-09-28 0:58 ` KAMEZAWA Hiroyuki
2011-09-28 0:58 ` KAMEZAWA Hiroyuki
2011-09-28 0:58 ` KAMEZAWA Hiroyuki
2011-09-28 12:03 ` Glauber Costa
2011-09-28 12:03 ` Glauber Costa
2011-09-28 12:03 ` Glauber Costa
2011-09-19 0:56 ` [PATCH v3 2/7] socket: initial cgroup code Glauber Costa
2011-09-19 0:56 ` Glauber Costa
2011-09-21 18:47 ` Greg Thelen
2011-09-21 18:47 ` Greg Thelen
2011-09-21 18:59 ` Glauber Costa
2011-09-21 18:59 ` Glauber Costa
2011-09-21 18:59 ` Glauber Costa
2011-09-22 6:00 ` Greg Thelen
2011-09-22 6:00 ` Greg Thelen
2011-09-22 15:09 ` Balbir Singh
2011-09-22 15:09 ` Balbir Singh
2011-09-24 13:33 ` Glauber Costa
2011-09-24 13:33 ` Glauber Costa
2011-09-24 13:33 ` Glauber Costa
2011-09-24 13:40 ` Glauber Costa
2011-09-24 13:40 ` Glauber Costa
2011-09-24 13:40 ` Glauber Costa
2011-09-24 14:45 ` Glauber Costa
2011-09-24 14:45 ` Glauber Costa
2011-09-24 14:45 ` Glauber Costa
2011-09-26 10:52 ` KAMEZAWA Hiroyuki
2011-09-26 10:52 ` KAMEZAWA Hiroyuki
2011-09-26 10:52 ` KAMEZAWA Hiroyuki
2011-09-26 22:47 ` Glauber Costa
2011-09-26 22:47 ` Glauber Costa
2011-09-26 22:47 ` Glauber Costa
2011-09-28 0:56 ` KAMEZAWA Hiroyuki
2011-09-28 0:56 ` KAMEZAWA Hiroyuki
2011-09-27 20:43 ` Glauber Costa [this message]
2011-09-27 20:43 ` Glauber Costa
2011-09-19 0:56 ` [PATCH v3 3/7] foundations of per-cgroup memory pressure controlling Glauber Costa
2011-09-19 0:56 ` Glauber Costa
2011-09-19 0:56 ` [PATCH v3 4/7] per-cgroup tcp buffers control Glauber Costa
2011-09-19 0:56 ` Glauber Costa
2011-09-26 10:59 ` KAMEZAWA Hiroyuki
2011-09-26 10:59 ` KAMEZAWA Hiroyuki
2011-09-26 22:48 ` Glauber Costa
2011-09-26 22:48 ` Glauber Costa
2011-09-26 22:48 ` Glauber Costa
2011-09-27 1:53 ` Glauber Costa
2011-09-27 1:53 ` Glauber Costa
2011-09-27 1:53 ` Glauber Costa
2011-09-28 1:09 ` KAMEZAWA Hiroyuki
2011-09-28 1:09 ` KAMEZAWA Hiroyuki
2011-09-26 14:39 ` Andrew Vagin
2011-09-26 14:39 ` Andrew Vagin
2011-09-26 22:52 ` Glauber Costa
2011-09-26 22:52 ` Glauber Costa
2011-09-26 22:52 ` Glauber Costa
2011-09-19 0:56 ` [PATCH v3 5/7] per-netns ipv4 sysctl_tcp_mem Glauber Costa
2011-09-19 0:56 ` Glauber Costa
2011-09-19 0:56 ` [PATCH v3 6/7] tcp buffer limitation: per-cgroup limit Glauber Costa
2011-09-19 0:56 ` Glauber Costa
2011-09-22 6:01 ` Greg Thelen
2011-09-22 6:01 ` Greg Thelen
2011-09-22 9:58 ` Kirill A. Shutemov
2011-09-22 9:58 ` Kirill A. Shutemov
2011-09-22 9:58 ` Kirill A. Shutemov
2011-09-22 15:44 ` Greg Thelen
2011-09-22 15:44 ` Greg Thelen
2011-09-24 13:30 ` Glauber Costa
2011-09-24 13:30 ` Glauber Costa
2011-09-24 13:30 ` Glauber Costa
2011-09-26 11:02 ` KAMEZAWA Hiroyuki
2011-09-26 11:02 ` KAMEZAWA Hiroyuki
2011-09-26 11:02 ` KAMEZAWA Hiroyuki
2011-09-26 22:49 ` Glauber Costa
2011-09-26 22:49 ` Glauber Costa
2011-09-26 22:49 ` Glauber Costa
2011-09-22 23:08 ` Balbir Singh
2011-09-22 23:08 ` Balbir Singh
2011-09-24 13:35 ` Glauber Costa
2011-09-24 13:35 ` Glauber Costa
2011-09-24 13:35 ` Glauber Costa
2011-09-24 16:58 ` Andi Kleen
2011-09-24 16:58 ` Andi Kleen
2011-09-24 17:27 ` Glauber Costa
2011-09-24 17:27 ` Glauber Costa
2011-09-24 17:27 ` Glauber Costa
2011-09-28 2:29 ` Balbir Singh
2011-09-28 2:29 ` Balbir Singh
2011-09-28 3:06 ` Andi Kleen
2011-09-28 3:06 ` Andi Kleen
2011-09-19 0:56 ` [PATCH v3 7/7] Display current tcp memory allocation in kmem cgroup Glauber Costa
2011-09-19 0:56 ` Glauber Costa
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4E823571.6060001@parallels.com \
--to=glommer@parallels.com \
--cc=bsingharora@gmail.com \
--cc=davem@davemloft.net \
--cc=ebiederm@xmission.com \
--cc=gthelen@google.com \
--cc=kamezawa.hiroyu@jp.fujitsu.com \
--cc=kirill@shutemov.name \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=lizf@cn.fujitsu.com \
--cc=netdev@vger.kernel.org \
--cc=paul@paulmenage.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.