[PATCH] introduce res_counter_charge_nofail() for socket allocations

cgroups.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] introduce res_counter_charge_nofail() for socket allocations
@ 2012-01-18 15:15 Glauber Costa
       [not found] ` <1326899758-9013-1-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
  0 siblings, 1 reply; 8+ messages in thread
From: Glauber Costa @ 2012-01-18 15:15 UTC (permalink / raw)
  To: cgroups-u79uwXL29TY76Z2rM5mHXA
  Cc: Glauber Costa, KAMEZAWA Hiroyuki, Johannes Weiner, Michal Hocko

There is a case in __sk_mem_schedule(), where an allocation
is beyond the maximum, but yet we are allowed to proceed.
It happens under the following condition:

	sk->sk_wmem_queued + size >= sk->sk_sndbuf

The network code won't revert the allocation in this case,
meaning that at some point later it'll try to do it. Since
this is never communicated to the underlying res_counter
code, there is an inbalance in res_counter uncharge operation.

I see two ways of fixing this:

1) storing the information about those allocations somewhere
   in memcg, and then deducting from that first, before
   we start draining the res_counter,
2) providing a slightly different allocation function for
   the res_counter, that matches the original behavior of
   the network code more closely.

I decided to go for #2 here, believing it to be more elegant,
since #1 would require us to do basically that, but in a more
obscure way.

I will eventually submit it through Dave for the -net tree,
but I wanted to query you guys first, to see if this approach
is acceptable or if you'd prefer me to try something else.

Thanks

Signed-off-by: Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
Cc: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
Cc: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>
---
 include/linux/res_counter.h |    6 ++++++
 include/net/sock.h          |   10 ++++------
 kernel/res_counter.c        |   25 +++++++++++++++++++++++++
 net/core/sock.c             |    4 ++--
 4 files changed, 37 insertions(+), 8 deletions(-)

diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
index c9d625c..32a7b02 100644
--- a/include/linux/res_counter.h
+++ b/include/linux/res_counter.h
@@ -109,12 +109,18 @@ void res_counter_init(struct res_counter *counter, struct res_counter *parent);
  *
  * returns 0 on success and <0 if the counter->usage will exceed the
  * counter->limit _locked call expects the counter->lock to be taken
+ *
+ * charge_nofail works the same, except that it charges the resource
+ * counter unconditionally, and returns < 0 if the after the current
+ * charge we are over limit.
  */
 
 int __must_check res_counter_charge_locked(struct res_counter *counter,
 		unsigned long val);
 int __must_check res_counter_charge(struct res_counter *counter,
 		unsigned long val, struct res_counter **limit_fail_at);
+int __must_check res_counter_charge_nofail(struct res_counter *counter,
+		unsigned long val, struct res_counter **limit_fail_at);
 
 /*
  * uncharge - tell that some portion of the resource is released
diff --git a/include/net/sock.h b/include/net/sock.h
index bb972d2..3269085 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1007,9 +1007,8 @@ static inline void memcg_memory_allocated_add(struct cg_proto *prot,
 	struct res_counter *fail;
 	int ret;
 
-	ret = res_counter_charge(prot->memory_allocated,
-				 amt << PAGE_SHIFT, &fail);
-
+	ret = res_counter_charge_nofail(prot->memory_allocated,
+					amt << PAGE_SHIFT, &fail);
 	if (ret < 0)
 		*parent_status = OVER_LIMIT;
 }
@@ -1053,12 +1052,11 @@ sk_memory_allocated_add(struct sock *sk, int amt, int *parent_status)
 }
 
 static inline void
-sk_memory_allocated_sub(struct sock *sk, int amt, int parent_status)
+sk_memory_allocated_sub(struct sock *sk, int amt)
 {
 	struct proto *prot = sk->sk_prot;
 
-	if (mem_cgroup_sockets_enabled && sk->sk_cgrp &&
-	    parent_status != OVER_LIMIT) /* Otherwise was uncharged already */
+	if (mem_cgroup_sockets_enabled && sk->sk_cgrp)
 		memcg_memory_allocated_sub(sk->sk_cgrp, amt);
 
 	atomic_long_sub(amt, prot->memory_allocated);
diff --git a/kernel/res_counter.c b/kernel/res_counter.c
index 6d269cc..d508363 100644
--- a/kernel/res_counter.c
+++ b/kernel/res_counter.c
@@ -66,6 +66,31 @@ done:
 	return ret;
 }
 
+int res_counter_charge_nofail(struct res_counter *counter, unsigned long val,
+			      struct res_counter **limit_fail_at)
+{
+	int ret, r;
+	unsigned long flags;
+	struct res_counter *c;
+
+	r = ret = 0;
+	*limit_fail_at = NULL;
+	local_irq_save(flags);
+	for (c = counter; c != NULL; c = c->parent) {
+		spin_lock(&c->lock);
+		r = res_counter_charge_locked(c, val);
+		if (r)
+			c->usage += val;
+		spin_unlock(&c->lock);
+		if (r < 0 && ret == 0) {
+			*limit_fail_at = c;
+			ret = r;
+		}
+	}
+	local_irq_restore(flags);
+
+	return ret;
+}
 void res_counter_uncharge_locked(struct res_counter *counter, unsigned long val)
 {
 	if (WARN_ON(counter->usage < val))
diff --git a/net/core/sock.c b/net/core/sock.c
index 5c5af998..3e81fd2 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1827,7 +1827,7 @@ suppress_allocation:
 	/* Alas. Undo changes. */
 	sk->sk_forward_alloc -= amt * SK_MEM_QUANTUM;
 
-	sk_memory_allocated_sub(sk, amt, parent_status);
+	sk_memory_allocated_sub(sk, amt);
 
 	return 0;
 }
@@ -1840,7 +1840,7 @@ EXPORT_SYMBOL(__sk_mem_schedule);
 void __sk_mem_reclaim(struct sock *sk)
 {
 	sk_memory_allocated_sub(sk,
-				sk->sk_forward_alloc >> SK_MEM_QUANTUM_SHIFT, 0);
+				sk->sk_forward_alloc >> SK_MEM_QUANTUM_SHIFT);
 	sk->sk_forward_alloc &= SK_MEM_QUANTUM - 1;
 
 	if (sk_under_memory_pressure(sk) &&
-- 
1.7.7.4

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH] introduce res_counter_charge_nofail() for socket allocations
       [not found] ` <1326899758-9013-1-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
@ 2012-01-19 12:48   ` Johannes Weiner
       [not found]     ` <20120119124841.GL24386-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
  2012-01-19 14:36   ` Michal Hocko
  1 sibling, 1 reply; 8+ messages in thread
From: Johannes Weiner @ 2012-01-19 12:48 UTC (permalink / raw)
  To: Glauber Costa
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, KAMEZAWA Hiroyuki, Michal Hocko

On Wed, Jan 18, 2012 at 07:15:58PM +0400, Glauber Costa wrote:
> There is a case in __sk_mem_schedule(), where an allocation
> is beyond the maximum, but yet we are allowed to proceed.
> It happens under the following condition:
> 
> 	sk->sk_wmem_queued + size >= sk->sk_sndbuf
> 
> The network code won't revert the allocation in this case,
> meaning that at some point later it'll try to do it. Since
> this is never communicated to the underlying res_counter
> code, there is an inbalance in res_counter uncharge operation.
> 
> I see two ways of fixing this:
> 
> 1) storing the information about those allocations somewhere
>    in memcg, and then deducting from that first, before
>    we start draining the res_counter,
> 2) providing a slightly different allocation function for
>    the res_counter, that matches the original behavior of
>    the network code more closely.
> 
> I decided to go for #2 here, believing it to be more elegant,
> since #1 would require us to do basically that, but in a more
> obscure way.
> 
> I will eventually submit it through Dave for the -net tree,
> but I wanted to query you guys first, to see if this approach
> is acceptable or if you'd prefer me to try something else.
> 
> Thanks
> 
> Signed-off-by: Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
> Cc: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
> Cc: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>
> ---
>  include/linux/res_counter.h |    6 ++++++
>  include/net/sock.h          |   10 ++++------
>  kernel/res_counter.c        |   25 +++++++++++++++++++++++++
>  net/core/sock.c             |    4 ++--
>  4 files changed, 37 insertions(+), 8 deletions(-)
> 
> diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
> index c9d625c..32a7b02 100644
> --- a/include/linux/res_counter.h
> +++ b/include/linux/res_counter.h
> @@ -109,12 +109,18 @@ void res_counter_init(struct res_counter *counter, struct res_counter *parent);
>   *
>   * returns 0 on success and <0 if the counter->usage will exceed the
>   * counter->limit _locked call expects the counter->lock to be taken
> + *
> + * charge_nofail works the same, except that it charges the resource
> + * counter unconditionally, and returns < 0 if the after the current
> + * charge we are over limit.
>   */

res_counter_margin() assumes usage <= limit is always true.  Just make
sure you return 0 if that is not the case, or the charge path can get
confused, thinking there is enough room and retry needlessly.

Otherwise, looks good.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] introduce res_counter_charge_nofail() for socket allocations
       [not found]     ` <20120119124841.GL24386-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
@ 2012-01-19 12:51       ` Glauber Costa
       [not found]         ` <4F1811C4.50204-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
  0 siblings, 1 reply; 8+ messages in thread
From: Glauber Costa @ 2012-01-19 12:51 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, KAMEZAWA Hiroyuki, Michal Hocko

On 01/19/2012 04:48 PM, Johannes Weiner wrote:
> On Wed, Jan 18, 2012 at 07:15:58PM +0400, Glauber Costa wrote:
>> There is a case in __sk_mem_schedule(), where an allocation
>> is beyond the maximum, but yet we are allowed to proceed.
>> It happens under the following condition:
>>
>> 	sk->sk_wmem_queued + size>= sk->sk_sndbuf
>>
>> The network code won't revert the allocation in this case,
>> meaning that at some point later it'll try to do it. Since
>> this is never communicated to the underlying res_counter
>> code, there is an inbalance in res_counter uncharge operation.
>>
>> I see two ways of fixing this:
>>
>> 1) storing the information about those allocations somewhere
>>     in memcg, and then deducting from that first, before
>>     we start draining the res_counter,
>> 2) providing a slightly different allocation function for
>>     the res_counter, that matches the original behavior of
>>     the network code more closely.
>>
>> I decided to go for #2 here, believing it to be more elegant,
>> since #1 would require us to do basically that, but in a more
>> obscure way.
>>
>> I will eventually submit it through Dave for the -net tree,
>> but I wanted to query you guys first, to see if this approach
>> is acceptable or if you'd prefer me to try something else.
>>
>> Thanks
>>
>> Signed-off-by: Glauber Costa<glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
>> Cc: KAMEZAWA Hiroyuki<kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
>> Cc: Johannes Weiner<hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
>> Cc: Michal Hocko<mhocko-AlSwsSmVLrQ@public.gmane.org>
>> ---
>>   include/linux/res_counter.h |    6 ++++++
>>   include/net/sock.h          |   10 ++++------
>>   kernel/res_counter.c        |   25 +++++++++++++++++++++++++
>>   net/core/sock.c             |    4 ++--
>>   4 files changed, 37 insertions(+), 8 deletions(-)
>>
>> diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
>> index c9d625c..32a7b02 100644
>> --- a/include/linux/res_counter.h
>> +++ b/include/linux/res_counter.h
>> @@ -109,12 +109,18 @@ void res_counter_init(struct res_counter *counter, struct res_counter *parent);
>>    *
>>    * returns 0 on success and<0 if the counter->usage will exceed the
>>    * counter->limit _locked call expects the counter->lock to be taken
>> + *
>> + * charge_nofail works the same, except that it charges the resource
>> + * counter unconditionally, and returns<  0 if the after the current
>> + * charge we are over limit.
>>    */
>
> res_counter_margin() assumes usage<= limit is always true.  Just make
> sure you return 0 if that is not the case, or the charge path can get
> confused, thinking there is enough room and retry needlessly.
>
> Otherwise, looks good.

You mean return zero in res_counter_charge_fail() if we exceeded the limit?

I do that, since one needs to know the allocation was supposed to fail.
Or are you talking about something else ?

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] introduce res_counter_charge_nofail() for socket allocations
       [not found]         ` <4F1811C4.50204-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
@ 2012-01-19 13:12           ` Johannes Weiner
       [not found]             ` <20120119131217.GN24386-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
  0 siblings, 1 reply; 8+ messages in thread
From: Johannes Weiner @ 2012-01-19 13:12 UTC (permalink / raw)
  To: Glauber Costa
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, KAMEZAWA Hiroyuki, Michal Hocko

On Thu, Jan 19, 2012 at 04:51:16PM +0400, Glauber Costa wrote:
> On 01/19/2012 04:48 PM, Johannes Weiner wrote:
> >On Wed, Jan 18, 2012 at 07:15:58PM +0400, Glauber Costa wrote:
> >>There is a case in __sk_mem_schedule(), where an allocation
> >>is beyond the maximum, but yet we are allowed to proceed.
> >>It happens under the following condition:
> >>
> >>	sk->sk_wmem_queued + size>= sk->sk_sndbuf
> >>
> >>The network code won't revert the allocation in this case,
> >>meaning that at some point later it'll try to do it. Since
> >>this is never communicated to the underlying res_counter
> >>code, there is an inbalance in res_counter uncharge operation.
> >>
> >>I see two ways of fixing this:
> >>
> >>1) storing the information about those allocations somewhere
> >>    in memcg, and then deducting from that first, before
> >>    we start draining the res_counter,
> >>2) providing a slightly different allocation function for
> >>    the res_counter, that matches the original behavior of
> >>    the network code more closely.
> >>
> >>I decided to go for #2 here, believing it to be more elegant,
> >>since #1 would require us to do basically that, but in a more
> >>obscure way.
> >>
> >>I will eventually submit it through Dave for the -net tree,
> >>but I wanted to query you guys first, to see if this approach
> >>is acceptable or if you'd prefer me to try something else.
> >>
> >>Thanks
> >>
> >>Signed-off-by: Glauber Costa<glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
> >>Cc: KAMEZAWA Hiroyuki<kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
> >>Cc: Johannes Weiner<hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
> >>Cc: Michal Hocko<mhocko-AlSwsSmVLrQ@public.gmane.org>
> >>---
> >>  include/linux/res_counter.h |    6 ++++++
> >>  include/net/sock.h          |   10 ++++------
> >>  kernel/res_counter.c        |   25 +++++++++++++++++++++++++
> >>  net/core/sock.c             |    4 ++--
> >>  4 files changed, 37 insertions(+), 8 deletions(-)
> >>
> >>diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
> >>index c9d625c..32a7b02 100644
> >>--- a/include/linux/res_counter.h
> >>+++ b/include/linux/res_counter.h
> >>@@ -109,12 +109,18 @@ void res_counter_init(struct res_counter *counter, struct res_counter *parent);
> >>   *
> >>   * returns 0 on success and<0 if the counter->usage will exceed the
> >>   * counter->limit _locked call expects the counter->lock to be taken
> >>+ *
> >>+ * charge_nofail works the same, except that it charges the resource
> >>+ * counter unconditionally, and returns<  0 if the after the current
> >>+ * charge we are over limit.
> >>   */
> >
> >res_counter_margin() assumes usage<= limit is always true.  Just make
> >sure you return 0 if that is not the case, or the charge path can get
> >confused, thinking there is enough room and retry needlessly.
> >
> >Otherwise, looks good.
> 
> You mean return zero in res_counter_charge_fail() if we exceeded the limit?
> 
> I do that, since one needs to know the allocation was supposed to fail.
> Or are you talking about something else ?

Yes, I mean the calculation in res_counter_margin(), which is supposed
to tell the margin for new charges.  Its current code will underflow
and return something large when the usage exceeds the limit, which is
not possible before your patch, so I think you need to include this:

diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
index c9d625c..d06d014 100644
--- a/include/linux/res_counter.h
+++ b/include/linux/res_counter.h
@@ -142,7 +142,10 @@ static inline unsigned long long res_counter_margin(struct res_counter *cnt)
 	unsigned long flags;
 
 	spin_lock_irqsave(&cnt->lock, flags);
-	margin = cnt->limit - cnt->usage;
+	if (cnt->limit > cnt->usage)
+		margin = cnt->limit - cnt->usage;
+	else
+		margin = 0;
 	spin_unlock_irqrestore(&cnt->lock, flags);
 	return margin;
 }

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH] introduce res_counter_charge_nofail() for socket allocations
       [not found]             ` <20120119131217.GN24386-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
@ 2012-01-19 13:13               ` Glauber Costa
       [not found]                 ` <4F1816E1.2050506-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
  0 siblings, 1 reply; 8+ messages in thread
From: Glauber Costa @ 2012-01-19 13:13 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, KAMEZAWA Hiroyuki, Michal Hocko

On 01/19/2012 05:12 PM, Johannes Weiner wrote:
> On Thu, Jan 19, 2012 at 04:51:16PM +0400, Glauber Costa wrote:
>> On 01/19/2012 04:48 PM, Johannes Weiner wrote:
>>> On Wed, Jan 18, 2012 at 07:15:58PM +0400, Glauber Costa wrote:
>>>> There is a case in __sk_mem_schedule(), where an allocation
>>>> is beyond the maximum, but yet we are allowed to proceed.
>>>> It happens under the following condition:
>>>>
>>>> 	sk->sk_wmem_queued + size>= sk->sk_sndbuf
>>>>
>>>> The network code won't revert the allocation in this case,
>>>> meaning that at some point later it'll try to do it. Since
>>>> this is never communicated to the underlying res_counter
>>>> code, there is an inbalance in res_counter uncharge operation.
>>>>
>>>> I see two ways of fixing this:
>>>>
>>>> 1) storing the information about those allocations somewhere
>>>>     in memcg, and then deducting from that first, before
>>>>     we start draining the res_counter,
>>>> 2) providing a slightly different allocation function for
>>>>     the res_counter, that matches the original behavior of
>>>>     the network code more closely.
>>>>
>>>> I decided to go for #2 here, believing it to be more elegant,
>>>> since #1 would require us to do basically that, but in a more
>>>> obscure way.
>>>>
>>>> I will eventually submit it through Dave for the -net tree,
>>>> but I wanted to query you guys first, to see if this approach
>>>> is acceptable or if you'd prefer me to try something else.
>>>>
>>>> Thanks
>>>>
>>>> Signed-off-by: Glauber Costa<glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
>>>> Cc: KAMEZAWA Hiroyuki<kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
>>>> Cc: Johannes Weiner<hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
>>>> Cc: Michal Hocko<mhocko-AlSwsSmVLrQ@public.gmane.org>
>>>> ---
>>>>   include/linux/res_counter.h |    6 ++++++
>>>>   include/net/sock.h          |   10 ++++------
>>>>   kernel/res_counter.c        |   25 +++++++++++++++++++++++++
>>>>   net/core/sock.c             |    4 ++--
>>>>   4 files changed, 37 insertions(+), 8 deletions(-)
>>>>
>>>> diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
>>>> index c9d625c..32a7b02 100644
>>>> --- a/include/linux/res_counter.h
>>>> +++ b/include/linux/res_counter.h
>>>> @@ -109,12 +109,18 @@ void res_counter_init(struct res_counter *counter, struct res_counter *parent);
>>>>    *
>>>>    * returns 0 on success and<0 if the counter->usage will exceed the
>>>>    * counter->limit _locked call expects the counter->lock to be taken
>>>> + *
>>>> + * charge_nofail works the same, except that it charges the resource
>>>> + * counter unconditionally, and returns<   0 if the after the current
>>>> + * charge we are over limit.
>>>>    */
>>>
>>> res_counter_margin() assumes usage<= limit is always true.  Just make
>>> sure you return 0 if that is not the case, or the charge path can get
>>> confused, thinking there is enough room and retry needlessly.
>>>
>>> Otherwise, looks good.
>>
>> You mean return zero in res_counter_charge_fail() if we exceeded the limit?
>>
>> I do that, since one needs to know the allocation was supposed to fail.
>> Or are you talking about something else ?
>
> Yes, I mean the calculation in res_counter_margin(), which is supposed
> to tell the margin for new charges.  Its current code will underflow
> and return something large when the usage exceeds the limit, which is
> not possible before your patch, so I think you need to include this:
>
> diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
> index c9d625c..d06d014 100644
> --- a/include/linux/res_counter.h
> +++ b/include/linux/res_counter.h
> @@ -142,7 +142,10 @@ static inline unsigned long long res_counter_margin(struct res_counter *cnt)
>   	unsigned long flags;
>
>   	spin_lock_irqsave(&cnt->lock, flags);
> -	margin = cnt->limit - cnt->usage;
> +	if (cnt->limit>  cnt->usage)
> +		margin = cnt->limit - cnt->usage;
> +	else
> +		margin = 0;
>   	spin_unlock_irqrestore(&cnt->lock, flags);
>   	return margin;
>   }
Ah, I see.

Okay, I will update my patch to include this. (Or would you think it 
would be better to add this as a preparation patch?)

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] introduce res_counter_charge_nofail() for socket allocations
       [not found]                 ` <4F1816E1.2050506-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
@ 2012-01-19 13:38                   ` Johannes Weiner
  0 siblings, 0 replies; 8+ messages in thread
From: Johannes Weiner @ 2012-01-19 13:38 UTC (permalink / raw)
  To: Glauber Costa
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, KAMEZAWA Hiroyuki, Michal Hocko

On Thu, Jan 19, 2012 at 05:13:05PM +0400, Glauber Costa wrote:
> On 01/19/2012 05:12 PM, Johannes Weiner wrote:
> >On Thu, Jan 19, 2012 at 04:51:16PM +0400, Glauber Costa wrote:
> >>On 01/19/2012 04:48 PM, Johannes Weiner wrote:
> >>>On Wed, Jan 18, 2012 at 07:15:58PM +0400, Glauber Costa wrote:
> >>>>There is a case in __sk_mem_schedule(), where an allocation
> >>>>is beyond the maximum, but yet we are allowed to proceed.
> >>>>It happens under the following condition:
> >>>>
> >>>>	sk->sk_wmem_queued + size>= sk->sk_sndbuf
> >>>>
> >>>>The network code won't revert the allocation in this case,
> >>>>meaning that at some point later it'll try to do it. Since
> >>>>this is never communicated to the underlying res_counter
> >>>>code, there is an inbalance in res_counter uncharge operation.
> >>>>
> >>>>I see two ways of fixing this:
> >>>>
> >>>>1) storing the information about those allocations somewhere
> >>>>    in memcg, and then deducting from that first, before
> >>>>    we start draining the res_counter,
> >>>>2) providing a slightly different allocation function for
> >>>>    the res_counter, that matches the original behavior of
> >>>>    the network code more closely.
> >>>>
> >>>>I decided to go for #2 here, believing it to be more elegant,
> >>>>since #1 would require us to do basically that, but in a more
> >>>>obscure way.
> >>>>
> >>>>I will eventually submit it through Dave for the -net tree,
> >>>>but I wanted to query you guys first, to see if this approach
> >>>>is acceptable or if you'd prefer me to try something else.
> >>>>
> >>>>Thanks
> >>>>
> >>>>Signed-off-by: Glauber Costa<glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
> >>>>Cc: KAMEZAWA Hiroyuki<kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
> >>>>Cc: Johannes Weiner<hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
> >>>>Cc: Michal Hocko<mhocko-AlSwsSmVLrQ@public.gmane.org>
> >>>>---
> >>>>  include/linux/res_counter.h |    6 ++++++
> >>>>  include/net/sock.h          |   10 ++++------
> >>>>  kernel/res_counter.c        |   25 +++++++++++++++++++++++++
> >>>>  net/core/sock.c             |    4 ++--
> >>>>  4 files changed, 37 insertions(+), 8 deletions(-)
> >>>>
> >>>>diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
> >>>>index c9d625c..32a7b02 100644
> >>>>--- a/include/linux/res_counter.h
> >>>>+++ b/include/linux/res_counter.h
> >>>>@@ -109,12 +109,18 @@ void res_counter_init(struct res_counter *counter, struct res_counter *parent);
> >>>>   *
> >>>>   * returns 0 on success and<0 if the counter->usage will exceed the
> >>>>   * counter->limit _locked call expects the counter->lock to be taken
> >>>>+ *
> >>>>+ * charge_nofail works the same, except that it charges the resource
> >>>>+ * counter unconditionally, and returns<   0 if the after the current
> >>>>+ * charge we are over limit.
> >>>>   */
> >>>
> >>>res_counter_margin() assumes usage<= limit is always true.  Just make
> >>>sure you return 0 if that is not the case, or the charge path can get
> >>>confused, thinking there is enough room and retry needlessly.
> >>>
> >>>Otherwise, looks good.
> >>
> >>You mean return zero in res_counter_charge_fail() if we exceeded the limit?
> >>
> >>I do that, since one needs to know the allocation was supposed to fail.
> >>Or are you talking about something else ?
> >
> >Yes, I mean the calculation in res_counter_margin(), which is supposed
> >to tell the margin for new charges.  Its current code will underflow
> >and return something large when the usage exceeds the limit, which is
> >not possible before your patch, so I think you need to include this:
> >
> >diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
> >index c9d625c..d06d014 100644
> >--- a/include/linux/res_counter.h
> >+++ b/include/linux/res_counter.h
> >@@ -142,7 +142,10 @@ static inline unsigned long long res_counter_margin(struct res_counter *cnt)
> >  	unsigned long flags;
> >
> >  	spin_lock_irqsave(&cnt->lock, flags);
> >-	margin = cnt->limit - cnt->usage;
> >+	if (cnt->limit>  cnt->usage)
> >+		margin = cnt->limit - cnt->usage;
> >+	else
> >+		margin = 0;
> >  	spin_unlock_irqrestore(&cnt->lock, flags);
> >  	return margin;
> >  }
> Ah, I see.
> 
> Okay, I will update my patch to include this. (Or would you think it
> would be better to add this as a preparation patch?)

Probably better to just include it as we wouldn't need this check
stand-alone.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] introduce res_counter_charge_nofail() for socket allocations
       [not found] ` <1326899758-9013-1-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
  2012-01-19 12:48   ` Johannes Weiner
@ 2012-01-19 14:36   ` Michal Hocko
       [not found]     ` <20120119143607.GF13932-VqjxzfR4DlwKmadIfiO5sKVXKuFTiq87@public.gmane.org>
  1 sibling, 1 reply; 8+ messages in thread
From: Michal Hocko @ 2012-01-19 14:36 UTC (permalink / raw)
  To: Glauber Costa
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, KAMEZAWA Hiroyuki,
	Johannes Weiner, Tejun Heo, Li Zefan

I guess you should have CCed cgroups maintainers as well.
Let's do that now

On Wed 18-01-12 19:15:58, Glauber Costa wrote:
> There is a case in __sk_mem_schedule(), where an allocation
> is beyond the maximum, but yet we are allowed to proceed.
> It happens under the following condition:
> 
> 	sk->sk_wmem_queued + size >= sk->sk_sndbuf
> 
> The network code won't revert the allocation in this case,
> meaning that at some point later it'll try to do it. Since
> this is never communicated to the underlying res_counter
> code, there is an inbalance in res_counter uncharge operation.
> 
> I see two ways of fixing this:
> 
> 1) storing the information about those allocations somewhere
>    in memcg, and then deducting from that first, before
>    we start draining the res_counter,
> 2) providing a slightly different allocation function for
>    the res_counter, that matches the original behavior of
>    the network code more closely.
> 
> I decided to go for #2 here, believing it to be more elegant,
> since #1 would require us to do basically that, but in a more
> obscure way.
> 
> I will eventually submit it through Dave for the -net tree,
> but I wanted to query you guys first, to see if this approach
> is acceptable or if you'd prefer me to try something else.
> 
> Thanks
> 
> Signed-off-by: Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
> Cc: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
> Cc: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>
> ---
>  include/linux/res_counter.h |    6 ++++++
>  include/net/sock.h          |   10 ++++------
>  kernel/res_counter.c        |   25 +++++++++++++++++++++++++
>  net/core/sock.c             |    4 ++--
>  4 files changed, 37 insertions(+), 8 deletions(-)
> 
> diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
> index c9d625c..32a7b02 100644
> --- a/include/linux/res_counter.h
> +++ b/include/linux/res_counter.h
> @@ -109,12 +109,18 @@ void res_counter_init(struct res_counter *counter, struct res_counter *parent);
>   *
>   * returns 0 on success and <0 if the counter->usage will exceed the
>   * counter->limit _locked call expects the counter->lock to be taken
> + *
> + * charge_nofail works the same, except that it charges the resource
> + * counter unconditionally, and returns < 0 if the after the current
> + * charge we are over limit.
>   */
>  
>  int __must_check res_counter_charge_locked(struct res_counter *counter,
>  		unsigned long val);
>  int __must_check res_counter_charge(struct res_counter *counter,
>  		unsigned long val, struct res_counter **limit_fail_at);
> +int __must_check res_counter_charge_nofail(struct res_counter *counter,
> +		unsigned long val, struct res_counter **limit_fail_at);
>  
>  /*
>   * uncharge - tell that some portion of the resource is released
> diff --git a/include/net/sock.h b/include/net/sock.h
> index bb972d2..3269085 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -1007,9 +1007,8 @@ static inline void memcg_memory_allocated_add(struct cg_proto *prot,
>  	struct res_counter *fail;
>  	int ret;
>  
> -	ret = res_counter_charge(prot->memory_allocated,
> -				 amt << PAGE_SHIFT, &fail);
> -
> +	ret = res_counter_charge_nofail(prot->memory_allocated,
> +					amt << PAGE_SHIFT, &fail);
>  	if (ret < 0)
>  		*parent_status = OVER_LIMIT;
>  }
> @@ -1053,12 +1052,11 @@ sk_memory_allocated_add(struct sock *sk, int amt, int *parent_status)
>  }
>  
>  static inline void
> -sk_memory_allocated_sub(struct sock *sk, int amt, int parent_status)
> +sk_memory_allocated_sub(struct sock *sk, int amt)
>  {
>  	struct proto *prot = sk->sk_prot;
>  
> -	if (mem_cgroup_sockets_enabled && sk->sk_cgrp &&
> -	    parent_status != OVER_LIMIT) /* Otherwise was uncharged already */
> +	if (mem_cgroup_sockets_enabled && sk->sk_cgrp)
>  		memcg_memory_allocated_sub(sk->sk_cgrp, amt);
>  
>  	atomic_long_sub(amt, prot->memory_allocated);
> diff --git a/kernel/res_counter.c b/kernel/res_counter.c
> index 6d269cc..d508363 100644
> --- a/kernel/res_counter.c
> +++ b/kernel/res_counter.c
> @@ -66,6 +66,31 @@ done:
>  	return ret;
>  }
>  
> +int res_counter_charge_nofail(struct res_counter *counter, unsigned long val,
> +			      struct res_counter **limit_fail_at)
> +{
> +	int ret, r;
> +	unsigned long flags;
> +	struct res_counter *c;
> +
> +	r = ret = 0;
> +	*limit_fail_at = NULL;
> +	local_irq_save(flags);
> +	for (c = counter; c != NULL; c = c->parent) {
> +		spin_lock(&c->lock);
> +		r = res_counter_charge_locked(c, val);
> +		if (r)
> +			c->usage += val;
> +		spin_unlock(&c->lock);
> +		if (r < 0 && ret == 0) {
> +			*limit_fail_at = c;
> +			ret = r;
> +		}
> +	}
> +	local_irq_restore(flags);
> +
> +	return ret;
> +}
>  void res_counter_uncharge_locked(struct res_counter *counter, unsigned long val)
>  {
>  	if (WARN_ON(counter->usage < val))
> diff --git a/net/core/sock.c b/net/core/sock.c
> index 5c5af998..3e81fd2 100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -1827,7 +1827,7 @@ suppress_allocation:
>  	/* Alas. Undo changes. */
>  	sk->sk_forward_alloc -= amt * SK_MEM_QUANTUM;
>  
> -	sk_memory_allocated_sub(sk, amt, parent_status);
> +	sk_memory_allocated_sub(sk, amt);
>  
>  	return 0;
>  }
> @@ -1840,7 +1840,7 @@ EXPORT_SYMBOL(__sk_mem_schedule);
>  void __sk_mem_reclaim(struct sock *sk)
>  {
>  	sk_memory_allocated_sub(sk,
> -				sk->sk_forward_alloc >> SK_MEM_QUANTUM_SHIFT, 0);
> +				sk->sk_forward_alloc >> SK_MEM_QUANTUM_SHIFT);
>  	sk->sk_forward_alloc &= SK_MEM_QUANTUM - 1;
>  
>  	if (sk_under_memory_pressure(sk) &&
> -- 
> 1.7.7.4
> 

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] introduce res_counter_charge_nofail() for socket allocations
       [not found]     ` <20120119143607.GF13932-VqjxzfR4DlwKmadIfiO5sKVXKuFTiq87@public.gmane.org>
@ 2012-01-19 20:57       ` James Bottomley
  0 siblings, 0 replies; 8+ messages in thread
From: James Bottomley @ 2012-01-19 20:57 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Glauber Costa, cgroups-u79uwXL29TY76Z2rM5mHXA, KAMEZAWA Hiroyuki,
	Johannes Weiner, Tejun Heo, Li Zefan

On Thu, 2012-01-19 at 15:36 +0100, Michal Hocko wrote:
> I guess you should have CCed cgroups maintainers as well.
> Let's do that now

So this is just my preference in SCSI and cgroup maintainers get to
specify their own preferences, but as a maintainer, I subscribe to the
mailing lists, so there's no need to cc me directly.  In fact, I prefer
that you don't cc me directly since it would just clutter up my inbox
and wouldn't be really helpful because all my scripts operate on my
linux-scsi folder (however, I also have a procmail rule that trashes an
email directly to me that also went to linux-scsi, so cc'ing me directly
is usually harmless).

James

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2012-01-19 20:57 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-01-18 15:15 [PATCH] introduce res_counter_charge_nofail() for socket allocations Glauber Costa
     [not found] ` <1326899758-9013-1-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2012-01-19 12:48   ` Johannes Weiner
     [not found]     ` <20120119124841.GL24386-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
2012-01-19 12:51       ` Glauber Costa
     [not found]         ` <4F1811C4.50204-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2012-01-19 13:12           ` Johannes Weiner
     [not found]             ` <20120119131217.GN24386-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
2012-01-19 13:13               ` Glauber Costa
     [not found]                 ` <4F1816E1.2050506-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2012-01-19 13:38                   ` Johannes Weiner
2012-01-19 14:36   ` Michal Hocko
     [not found]     ` <20120119143607.GF13932-VqjxzfR4DlwKmadIfiO5sKVXKuFTiq87@public.gmane.org>
2012-01-19 20:57       ` James Bottomley

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).