Re: [RFC] per-containers tcp buffer limitation

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Glauber Costa <glommer@parallels.com>
To: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>,
	Linux Containers <containers@lists.osdl.org>,
	netdev@vger.kernel.org, David Miller <davem@davemloft.net>,
	Pavel Emelyanov <xemul@parallels.com>
Subject: Re: [RFC] per-containers tcp buffer limitation
Date: Thu, 25 Aug 2011 15:05:03 -0300	[thread overview]
Message-ID: <4E568ECF.1030302@parallels.com> (raw)
In-Reply-To: <20110825104956.41c4b60e.kamezawa.hiroyu@jp.fujitsu.com>

On 08/24/2011 10:49 PM, KAMEZAWA Hiroyuki wrote:
> On Wed, 24 Aug 2011 22:28:59 -0300
> Glauber Costa<glommer@parallels.com>  wrote:
>
>> On 08/24/2011 09:35 PM, Eric W. Biederman wrote:
>>> Glauber Costa<glommer@parallels.com>   writes:
>>>
>>>> Hello,
>>>>
>>>> This is a proof of concept of some code I have here to limit tcp send and
>>>> receive buffers per-container (in our case). At this phase, I am more concerned
>>>> in discussing my approach, so please curse my family no further than the 3rd
>>>> generation.
>>>>
>>>> The problem we're trying to attack here, is that buffers can grow and fill
>>>> non-reclaimable kernel memory. When doing containers, we can't afford having a
>>>> malicious container pinning kernel memory at will, therefore exhausting all the
>>>> others.
>>>>
>>>> So here a container will be seen in the host system as a group of tasks, grouped
>>>> in a cgroup. This cgroup will have files allowing us to specify global
>>>> per-cgroup limits on buffers. For that purpose, I created a new sockets cgroup -
>>>> didn't really think any other one of the existing would do here.
>>>>
>>>> As for the network code per-se, I tried to keep the same code that deals with
>>>> memory schedule as a basis and make it per-cgroup.
>>>> You will notice that struct proto now take function pointers to values
>>>> controlling memory pressure and will return per-cgroup data instead of global
>>>> ones. So the current behavior is maintained: after the first threshold is hit,
>>>> we enter memory pressure. After that, allocations are suppressed.
>>>>
>>>> Only tcp code was really touched here. udp had the pointers filled, but we're
>>>> not really controlling anything. But the fact that this lives in generic code,
>>>> makes it easier to do the same for other protocols in the future.
>>>>
>>>> For this patch specifically, I am not touching - just provisioning -
>>>> rmem and wmem specific knobs. I should also #ifdef a lot of this, but hey,
>>>> remember: rfc...
>>>>
>>>> One drawback of this approach I found, is that cgroups does not really work well
>>>> with modules. A lot of the network code is modularized, so this would have to be
>>>> fixed somehow.
>>>>
>>>> Let me know what you think.
>>>
>>> Can you implement this by making the existing network sysctls per
>>> network namespace?
>>>
>>> At a quick skim it looks to me like you can make the existing sysctls
>>> per network namespace and solve the issues you are aiming at solving and
>>> that should make the code much simpler, than your proof of concept code.
>>>
>>> Any implementation of this needs to answer the question how much
>>> overhead does this extra accounting add.  I don't have a clue how much
>>> overhead you are adding but you are making structures larger and I
>>> suspect adding at least another cache line miss, so I suspect your
>>> changes will impact real world socket performance.
>>
>> Hi Eric,
>>
>> Thanks for your attention.
>>
>> So, this that you propose was my first implementation. I ended up
>> throwing it away after playing with it for a while.
>>
>> One of the first problems that arise from that, is that the sysctls are
>> a tunable visible from inside the container. Those limits, however, are
>> to be set from the outside world. The code is not much better than that
>> either, and instead of creating new cgroup structures and linking them
>> to the protocol, we end up doing it for net ns. We end up increasing
>> structures just the same...
>>
>> Also, since we're doing resource control, it seems more natural to use
>> cgroups. Now, the fact that there are no correlation whatsoever between
>> cgroups and namespaces does bother me. But that's another story, much
>> more broader and general than this patch.
>>
>
> I think using cgroup makes sense. A question in mind is whehter it is
> better to integrate this kind of 'memory usage' controls to memcg or not.
>
> How do you think ? IMHO, having cgroup per class of object is messy.
> ...
> How about adding
> 	memory.tcp_mem
> to memcg ?
>
> Or, adding kmem cgroup ?

I don't really care which cgroup do we use. I choosed a new socket one,
because they are usually not like other objects. People love tweaking 
network aspects, and it is not hard to imagine people wanting to extend it.

Now, if all of this will ever belong to cgroup, is of course a different 
matter.

Between your two suggestions, I like kmem better. It makes it then 
absolutely clear that we will handle kernel objects only...

>> About overhead, since this is the first RFC, I did not care about
>> measuring. However, it seems trivial to me to guarantee that at least
>> that it won't impose a significant performance penalty when it is
>> compiled out. If we're moving forward with this implementation, I will
>> include data in the next release so we can discuss in this basis.
>>
>
> IMHO, you should show performance number even if RFC. Then, people will
> see patch with more interests.

Let's call this one pre-RFC then.

WARNING: multiple messages have this Message-ID (diff)

From: Glauber Costa <glommer@parallels.com>
To: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>,
	Linux Containers <containers@lists.osdl.org>,
	<netdev@vger.kernel.org>, David Miller <davem@davemloft.net>,
	Pavel Emelyanov <xemul@parallels.com>
Subject: Re: [RFC] per-containers tcp buffer limitation
Date: Thu, 25 Aug 2011 15:05:03 -0300	[thread overview]
Message-ID: <4E568ECF.1030302@parallels.com> (raw)
In-Reply-To: <20110825104956.41c4b60e.kamezawa.hiroyu@jp.fujitsu.com>

On 08/24/2011 10:49 PM, KAMEZAWA Hiroyuki wrote:
> On Wed, 24 Aug 2011 22:28:59 -0300
> Glauber Costa<glommer@parallels.com>  wrote:
>
>> On 08/24/2011 09:35 PM, Eric W. Biederman wrote:
>>> Glauber Costa<glommer@parallels.com>   writes:
>>>
>>>> Hello,
>>>>
>>>> This is a proof of concept of some code I have here to limit tcp send and
>>>> receive buffers per-container (in our case). At this phase, I am more concerned
>>>> in discussing my approach, so please curse my family no further than the 3rd
>>>> generation.
>>>>
>>>> The problem we're trying to attack here, is that buffers can grow and fill
>>>> non-reclaimable kernel memory. When doing containers, we can't afford having a
>>>> malicious container pinning kernel memory at will, therefore exhausting all the
>>>> others.
>>>>
>>>> So here a container will be seen in the host system as a group of tasks, grouped
>>>> in a cgroup. This cgroup will have files allowing us to specify global
>>>> per-cgroup limits on buffers. For that purpose, I created a new sockets cgroup -
>>>> didn't really think any other one of the existing would do here.
>>>>
>>>> As for the network code per-se, I tried to keep the same code that deals with
>>>> memory schedule as a basis and make it per-cgroup.
>>>> You will notice that struct proto now take function pointers to values
>>>> controlling memory pressure and will return per-cgroup data instead of global
>>>> ones. So the current behavior is maintained: after the first threshold is hit,
>>>> we enter memory pressure. After that, allocations are suppressed.
>>>>
>>>> Only tcp code was really touched here. udp had the pointers filled, but we're
>>>> not really controlling anything. But the fact that this lives in generic code,
>>>> makes it easier to do the same for other protocols in the future.
>>>>
>>>> For this patch specifically, I am not touching - just provisioning -
>>>> rmem and wmem specific knobs. I should also #ifdef a lot of this, but hey,
>>>> remember: rfc...
>>>>
>>>> One drawback of this approach I found, is that cgroups does not really work well
>>>> with modules. A lot of the network code is modularized, so this would have to be
>>>> fixed somehow.
>>>>
>>>> Let me know what you think.
>>>
>>> Can you implement this by making the existing network sysctls per
>>> network namespace?
>>>
>>> At a quick skim it looks to me like you can make the existing sysctls
>>> per network namespace and solve the issues you are aiming at solving and
>>> that should make the code much simpler, than your proof of concept code.
>>>
>>> Any implementation of this needs to answer the question how much
>>> overhead does this extra accounting add.  I don't have a clue how much
>>> overhead you are adding but you are making structures larger and I
>>> suspect adding at least another cache line miss, so I suspect your
>>> changes will impact real world socket performance.
>>
>> Hi Eric,
>>
>> Thanks for your attention.
>>
>> So, this that you propose was my first implementation. I ended up
>> throwing it away after playing with it for a while.
>>
>> One of the first problems that arise from that, is that the sysctls are
>> a tunable visible from inside the container. Those limits, however, are
>> to be set from the outside world. The code is not much better than that
>> either, and instead of creating new cgroup structures and linking them
>> to the protocol, we end up doing it for net ns. We end up increasing
>> structures just the same...
>>
>> Also, since we're doing resource control, it seems more natural to use
>> cgroups. Now, the fact that there are no correlation whatsoever between
>> cgroups and namespaces does bother me. But that's another story, much
>> more broader and general than this patch.
>>
>
> I think using cgroup makes sense. A question in mind is whehter it is
> better to integrate this kind of 'memory usage' controls to memcg or not.
>
> How do you think ? IMHO, having cgroup per class of object is messy.
> ...
> How about adding
> 	memory.tcp_mem
> to memcg ?
>
> Or, adding kmem cgroup ?

I don't really care which cgroup do we use. I choosed a new socket one,
because they are usually not like other objects. People love tweaking 
network aspects, and it is not hard to imagine people wanting to extend it.

Now, if all of this will ever belong to cgroup, is of course a different 
matter.

Between your two suggestions, I like kmem better. It makes it then 
absolutely clear that we will handle kernel objects only...

>> About overhead, since this is the first RFC, I did not care about
>> measuring. However, it seems trivial to me to guarantee that at least
>> that it won't impose a significant performance penalty when it is
>> compiled out. If we're moving forward with this implementation, I will
>> include data in the next release so we can discuss in this basis.
>>
>
> IMHO, you should show performance number even if RFC. Then, people will
> see patch with more interests.

Let's call this one pre-RFC then.

next prev parent reply	other threads:[~2011-08-25 18:05 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-08-24 22:54 [RFC] per-containers tcp buffer limitation Glauber Costa
2011-08-24 22:54 ` Glauber Costa
2011-08-25  0:35 ` Eric W. Biederman
2011-08-25  0:35   ` Eric W. Biederman
2011-08-25  1:28   ` Glauber Costa
2011-08-25  1:28     ` Glauber Costa
2011-08-25  1:49     ` KAMEZAWA Hiroyuki
2011-08-25  2:16       ` Eric W. Biederman
     [not found]         ` <m14o16qlq1.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
2011-08-25 12:55           ` Daniel Wagner
2011-08-25 15:05             ` Chris Friesen
2011-08-25 15:44               ` Stephen Hemminger
2011-08-25 18:11                 ` Glauber Costa
2011-08-25 18:11                   ` Glauber Costa
2011-08-25 18:33                 ` Daniel Wagner
2011-08-25 18:33                   ` Daniel Wagner
2011-08-25 18:45                   ` Daniel Wagner
2011-08-25 18:27               ` Daniel Wagner
     [not found]                 ` <4E56942A.3080905-kQCPcA+X3s7YtjvyW6yDsg@public.gmane.org>
2011-08-27 23:39                   ` Matthew Helsley
2011-08-28  6:09                     ` David Miller
2011-08-25 18:02         ` Glauber Costa
2011-08-25 18:02           ` Glauber Costa
2011-08-25 18:05       ` Glauber Costa [this message]
2011-08-25 18:05         ` Glauber Costa

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4E568ECF.1030302@parallels.com \
    --to=glommer@parallels.com \
    --cc=containers@lists.osdl.org \
    --cc=davem@davemloft.net \
    --cc=ebiederm@xmission.com \
    --cc=kamezawa.hiroyu@jp.fujitsu.com \
    --cc=netdev@vger.kernel.org \
    --cc=xemul@parallels.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.