Containers and /proc/sys/vm/drop

All of lore.kernel.org
 help / color / mirror / Atom feed

* Containers and /proc/sys/vm/drop_caches
@ 2011-01-05  9:40 Mike Hommey
       [not found] ` <20110105094022.GA5366-YmoObPS1fuhg9hUCZPvPmw@public.gmane.org>
  0 siblings, 1 reply; 14+ messages in thread
From: Mike Hommey @ 2011-01-05  9:40 UTC (permalink / raw)
  To: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

[Copy/pasted from a previous message to lkml, where it was suggested to
 try containers@]

Hi,

I noticed that from within a lxc container, writing "3" to
/proc/sys/vm/drop_caches would flush the host page cache. That sounds a
little dangerous for VPS offerings that would be based on lxc, as in one
VPS instance root user could impact the overall performance of the host.
I don't know about other containers but I've been told openvz isn't
subject to this problem.
I only tested the current Debian Squeeze kernel, which is based on
2.6.32.27.

Cheers,

Mike

^ permalink raw reply	[flat|nested] 14+ messages in thread

[parent not found: <20110105094022.GA5366-YmoObPS1fuhg9hUCZPvPmw@public.gmane.org>]

* Re: Containers and /proc/sys/vm/drop_caches
       [not found] ` <20110105094022.GA5366-YmoObPS1fuhg9hUCZPvPmw@public.gmane.org>
@ 2011-01-05  9:49   ` Daniel Lezcano
       [not found]     ` <4D243EC3.1050101-GANU6spQydw@public.gmane.org>
  0 siblings, 1 reply; 14+ messages in thread
From: Daniel Lezcano @ 2011-01-05  9:49 UTC (permalink / raw)
  To: Mike Hommey; +Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

On 01/05/2011 10:40 AM, Mike Hommey wrote:
> [Copy/pasted from a previous message to lkml, where it was suggested to
>   try containers@]
>
> Hi,
>
> I noticed that from within a lxc container, writing "3" to
> /proc/sys/vm/drop_caches would flush the host page cache. That sounds a
> little dangerous for VPS offerings that would be based on lxc, as in one
> VPS instance root user could impact the overall performance of the host.
> I don't know about other containers but I've been told openvz isn't
> subject to this problem.
> I only tested the current Debian Squeeze kernel, which is based on
> 2.6.32.27.

There is definitively a big work to do with /proc.

Some files should be not accessible (/proc/sys/vm/drop_caches, 
/proc/sys/kernel/sysrq, ...) and some other should be virtualized 
(/proc/meminfo, /proc/cpuinfo, ...).

Serge suggested to create something similar to the cgroup device 
whitelist but for /proc, maybe it is a good approach for denying access 
a specific proc's file.

^ permalink raw reply	[flat|nested] 14+ messages in thread

[parent not found: <4D243EC3.1050101-GANU6spQydw@public.gmane.org>]

* Re: Containers and /proc/sys/vm/drop_caches
       [not found]     ` <4D243EC3.1050101-GANU6spQydw@public.gmane.org>
@ 2011-01-05 14:01       ` Serge Hallyn
       [not found]         ` <20110105140159.GC2718-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 14+ messages in thread
From: Serge Hallyn @ 2011-01-05 14:01 UTC (permalink / raw)
  To: Daniel Lezcano
  Cc: Mike Hommey,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

Quoting Daniel Lezcano (daniel.lezcano-GANU6spQydw@public.gmane.org):
> On 01/05/2011 10:40 AM, Mike Hommey wrote:
> >[Copy/pasted from a previous message to lkml, where it was suggested to
> >  try containers@]
> >
> >Hi,
> >
> >I noticed that from within a lxc container, writing "3" to
> >/proc/sys/vm/drop_caches would flush the host page cache. That sounds a
> >little dangerous for VPS offerings that would be based on lxc, as in one
> >VPS instance root user could impact the overall performance of the host.
> >I don't know about other containers but I've been told openvz isn't
> >subject to this problem.
> >I only tested the current Debian Squeeze kernel, which is based on
> >2.6.32.27.
> 
> There is definitively a big work to do with /proc.
> 
> Some files should be not accessible (/proc/sys/vm/drop_caches,
> /proc/sys/kernel/sysrq, ...) and some other should be virtualized
> (/proc/meminfo, /proc/cpuinfo, ...).
> 
> Serge suggested to create something similar to the cgroup device
> whitelist but for /proc, maybe it is a good approach for denying
> access a specific proc's file.

Long-term, user namespaces should fix this - /proc will be owned
by the user namespace which mounted it, but we can tell proc to
always have some files (like drop_caches) be owned by init_user_ns.

I'm hoping to push my final targeted capabilities prototype in the
next few weeks, and after that I start seriously attacking VFS
interaction.

In the meantime, though, you can use SELinux/Smack, or a custom
cgroup file does sound useful.  Can cgroups be modules nowadays?
(I can't keep up)  If so, an out of tree proc-cgroup module seems
like a good interim solution.

-serge

^ permalink raw reply	[flat|nested] 14+ messages in thread

[parent not found: <20110105140159.GC2718-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org>]

* Re: Containers and /proc/sys/vm/drop_caches
       [not found]         ` <20110105140159.GC2718-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org>
@ 2011-01-05 14:16           ` Balbir Singh
       [not found]             ` <AANLkTi=x=6gUZTxJC8LXxYNu029+firyzKqjMa6m+R-x-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 14+ messages in thread
From: Balbir Singh @ 2011-01-05 14:16 UTC (permalink / raw)
  To: Serge Hallyn
  Cc: Mike Hommey,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

On Wed, Jan 5, 2011 at 7:31 PM, Serge Hallyn <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org> wrote:
> Quoting Daniel Lezcano (daniel.lezcano-GANU6spQydw@public.gmane.org):
>> On 01/05/2011 10:40 AM, Mike Hommey wrote:
>> >[Copy/pasted from a previous message to lkml, where it was suggested to
>> >  try containers@]
>> >
>> >Hi,
>> >
>> >I noticed that from within a lxc container, writing "3" to
>> >/proc/sys/vm/drop_caches would flush the host page cache. That sounds a
>> >little dangerous for VPS offerings that would be based on lxc, as in one
>> >VPS instance root user could impact the overall performance of the host.
>> >I don't know about other containers but I've been told openvz isn't
>> >subject to this problem.
>> >I only tested the current Debian Squeeze kernel, which is based on
>> >2.6.32.27.
>>
>> There is definitively a big work to do with /proc.
>>
>> Some files should be not accessible (/proc/sys/vm/drop_caches,
>> /proc/sys/kernel/sysrq, ...) and some other should be virtualized
>> (/proc/meminfo, /proc/cpuinfo, ...).
>>
>> Serge suggested to create something similar to the cgroup device
>> whitelist but for /proc, maybe it is a good approach for denying
>> access a specific proc's file.
>
> Long-term, user namespaces should fix this - /proc will be owned
> by the user namespace which mounted it, but we can tell proc to
> always have some files (like drop_caches) be owned by init_user_ns.
>
> I'm hoping to push my final targeted capabilities prototype in the
> next few weeks, and after that I start seriously attacking VFS
> interaction.
>
> In the meantime, though, you can use SELinux/Smack, or a custom
> cgroup file does sound useful.  Can cgroups be modules nowadays?
> (I can't keep up)  If so, an out of tree proc-cgroup module seems
> like a good interim solution.
>

Ideally a drop_cache should drop page cache in that container, but
given container have a lot of shared page cache, what is suggested
might be a good way to work around the problem

Balbir

^ permalink raw reply	[flat|nested] 14+ messages in thread

[parent not found: <AANLkTi=x=6gUZTxJC8LXxYNu029+firyzKqjMa6m+R-x-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]

* Re: Containers and /proc/sys/vm/drop_caches
       [not found]             ` <AANLkTi=x=6gUZTxJC8LXxYNu029+firyzKqjMa6m+R-x-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2011-01-06 21:43               ` Matt Helsley
       [not found]                 ` <20110106214315.GJ29064-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
  0 siblings, 1 reply; 14+ messages in thread
From: Matt Helsley @ 2011-01-06 21:43 UTC (permalink / raw)
  To: Balbir Singh
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Mike Hommey

On Wed, Jan 05, 2011 at 07:46:17PM +0530, Balbir Singh wrote:
> On Wed, Jan 5, 2011 at 7:31 PM, Serge Hallyn <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org> wrote:
> > Quoting Daniel Lezcano (daniel.lezcano-GANU6spQydw@public.gmane.org):
> >> On 01/05/2011 10:40 AM, Mike Hommey wrote:
> >> >[Copy/pasted from a previous message to lkml, where it was suggested to
> >> >  try containers@]
> >> >
> >> >Hi,
> >> >
> >> >I noticed that from within a lxc container, writing "3" to
> >> >/proc/sys/vm/drop_caches would flush the host page cache. That sounds a
> >> >little dangerous for VPS offerings that would be based on lxc, as in one
> >> >VPS instance root user could impact the overall performance of the host.
> >> >I don't know about other containers but I've been told openvz isn't
> >> >subject to this problem.
> >> >I only tested the current Debian Squeeze kernel, which is based on
> >> >2.6.32.27.
> >>
> >> There is definitively a big work to do with /proc.
> >>
> >> Some files should be not accessible (/proc/sys/vm/drop_caches,
> >> /proc/sys/kernel/sysrq, ...) and some other should be virtualized
> >> (/proc/meminfo, /proc/cpuinfo, ...).
> >>
> >> Serge suggested to create something similar to the cgroup device
> >> whitelist but for /proc, maybe it is a good approach for denying
> >> access a specific proc's file.
> >
> > Long-term, user namespaces should fix this - /proc will be owned
> > by the user namespace which mounted it, but we can tell proc to
> > always have some files (like drop_caches) be owned by init_user_ns.
> >
> > I'm hoping to push my final targeted capabilities prototype in the
> > next few weeks, and after that I start seriously attacking VFS
> > interaction.
> >
> > In the meantime, though, you can use SELinux/Smack, or a custom
> > cgroup file does sound useful.  Can cgroups be modules nowadays?
> > (I can't keep up)  If so, an out of tree proc-cgroup module seems
> > like a good interim solution.
> >
> 
> Ideally a drop_cache should drop page cache in that container, but
> given container have a lot of shared page cache, what is suggested
> might be a good way to work around the problem

One gross hack that comes to mind: Instead of a hard permission model
limit the frequency with which the container could actually drop caches.
Then the container's ability to interfere with host performance is more
limited (but still non-zero). Or limit frequency on a per-user basis
(more like Serge's design) because running more containers by a
compromised user account shouldn't allow more frequent cache dropping.

That said, the more important question is why should we provide
drop_caches inside a container? My understanding is it's largely a
workload-debugging tool and not something meant to truly solve
problems. If that's the case then we shouldn't provide it at all or it
should actually interfere with the host cache.

Cheers,
	-Matt Helsley

^ permalink raw reply	[flat|nested] 14+ messages in thread

[parent not found: <20110106214315.GJ29064-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>]

* Re: Containers and /proc/sys/vm/drop_caches
       [not found]                 ` <20110106214315.GJ29064-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
@ 2011-01-06 21:50                   ` Dave Hansen
  2011-01-06 22:08                     ` Matt Helsley
  2011-01-07 13:03                   ` Rob Landley
  1 sibling, 1 reply; 14+ messages in thread
From: Dave Hansen @ 2011-01-06 21:50 UTC (permalink / raw)
  To: Matt Helsley
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Mike Hommey, Balbir Singh

On Thu, 2011-01-06 at 13:43 -0800, Matt Helsley wrote:
> That said, the more important question is why should we provide
> drop_caches inside a container? My understanding is it's largely a
> workload-debugging tool and not something meant to truly solve
> problems. If that's the case then we shouldn't provide it at all or it
> should actually interfere with the host cache. 

Yeah, what's the problem that you're solving with drop_caches?  The odds
are, there's a better way.

That said, it _might_ be worth doing things like dropping (inode or
dentry) caches per-sb.  That's a much better fit than using big, ugly,
loosely-defined, system-wide knobs like drop_caches.

Also, unless we start giving containers real ownership of devices or
partitions, it's going to be pretty darn hard to let things clear caches
in a meaningful way.  What if one container wants an object cleared
while another doesn't?

-- Dave

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Containers and /proc/sys/vm/drop_caches
  2011-01-06 21:50                   ` Dave Hansen
@ 2011-01-06 22:08                     ` Matt Helsley
       [not found]                       ` <20110106220841.GK29064-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
  0 siblings, 1 reply; 14+ messages in thread
From: Matt Helsley @ 2011-01-06 22:08 UTC (permalink / raw)
  To: Dave Hansen
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Mike Hommey, Balbir Singh

On Thu, Jan 06, 2011 at 01:50:05PM -0800, Dave Hansen wrote:
> On Thu, 2011-01-06 at 13:43 -0800, Matt Helsley wrote:
> > That said, the more important question is why should we provide
> > drop_caches inside a container? My understanding is it's largely a
> > workload-debugging tool and not something meant to truly solve
> > problems. If that's the case then we shouldn't provide it at all or it
> > should actually interfere with the host cache. 
> 
> Yeah, what's the problem that you're solving with drop_caches?  The odds
> are, there's a better way.
> 
> That said, it _might_ be worth doing things like dropping (inode or
> dentry) caches per-sb.  That's a much better fit than using big, ugly,
> loosely-defined, system-wide knobs like drop_caches.

Yup. Since many containers will have their own mount namespaces with
separate sbs it's a more reasonable approximation of per-container
dropping of caches.

> 
> Also, unless we start giving containers real ownership of devices or
> partitions, it's going to be pretty darn hard to let things clear caches
> in a meaningful way.  What if one container wants an object cleared
> while another doesn't?

Good point. First reaction: we'd want to keep it cached if any of the
containers want it. But even that's a bad policy under certain
circumstances containers (aka VPS) might be used for.

Is drop_caches well-defined? IOW would it be permissible to
not actually drop all or any of the cache entries or to do nothing and
still report success instead of, say, EPERM, to a container?

Cheers,
	-Matt Helsley

^ permalink raw reply	[flat|nested] 14+ messages in thread

[parent not found: <20110106220841.GK29064-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>]

* Re: Containers and /proc/sys/vm/drop_caches
       [not found]                       ` <20110106220841.GK29064-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
@ 2011-01-06 22:15                         ` Dave Hansen
  0 siblings, 0 replies; 14+ messages in thread
From: Dave Hansen @ 2011-01-06 22:15 UTC (permalink / raw)
  To: Matt Helsley
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Mike Hommey, Balbir Singh

On Thu, 2011-01-06 at 14:08 -0800, Matt Helsley wrote:
> Is drop_caches well-defined? IOW would it be permissible to
> not actually drop all or any of the cache entries or to do nothing and
> still report success instead of, say, EPERM, to a container?

It's really just a hint or a request.  It's possible that an

	echo 3 > /proc/sys/vm/drop_caches

returns '2' (for the two bytes written), indicating success and yet, not
a single object was freed.  There's currently no way to tell how much
work it did, or to figure out why it did a certain amount of work.

Frankly, in a container, it probably just shouldn't even show up
in /proc.

-- Dave

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Containers and /proc/sys/vm/drop_caches
       [not found]                 ` <20110106214315.GJ29064-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
  2011-01-06 21:50                   ` Dave Hansen
@ 2011-01-07 13:03                   ` Rob Landley
       [not found]                     ` <4D270F34.8080305-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
  1 sibling, 1 reply; 14+ messages in thread
From: Rob Landley @ 2011-01-07 13:03 UTC (permalink / raw)
  To: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

On 01/06/2011 03:43 PM, Matt Helsley wrote:
> On Wed, Jan 05, 2011 at 07:46:17PM +0530, Balbir Singh wrote:
>> On Wed, Jan 5, 2011 at 7:31 PM, Serge Hallyn <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org> wrote:
>>> Quoting Daniel Lezcano (daniel.lezcano-GANU6spQydw@public.gmane.org):
>>>> On 01/05/2011 10:40 AM, Mike Hommey wrote:
>>>>> [Copy/pasted from a previous message to lkml, where it was suggested to
>>>>>  try containers@]
>>>>>
>>>>> Hi,
>>>>>
>>>>> I noticed that from within a lxc container, writing "3" to
>>>>> /proc/sys/vm/drop_caches would flush the host page cache. That sounds a
>>>>> little dangerous for VPS offerings that would be based on lxc, as in one
>>>>> VPS instance root user could impact the overall performance of the host.
>>>>> I don't know about other containers but I've been told openvz isn't
>>>>> subject to this problem.
>>>>> I only tested the current Debian Squeeze kernel, which is based on
>>>>> 2.6.32.27.
>>>>
>>>> There is definitively a big work to do with /proc.
>>>>
>>>> Some files should be not accessible (/proc/sys/vm/drop_caches,
>>>> /proc/sys/kernel/sysrq, ...) and some other should be virtualized
>>>> (/proc/meminfo, /proc/cpuinfo, ...).
>>>>
>>>> Serge suggested to create something similar to the cgroup device
>>>> whitelist but for /proc, maybe it is a good approach for denying
>>>> access a specific proc's file.
>>>
>>> Long-term, user namespaces should fix this - /proc will be owned
>>> by the user namespace which mounted it, but we can tell proc to
>>> always have some files (like drop_caches) be owned by init_user_ns.

Changing ownership so a script can't open a file that it otherwise
could may cause scripts to fail when run in a container.  Makes the
containers less transparent.

>>> I'm hoping to push my final targeted capabilities prototype in the
>>> next few weeks, and after that I start seriously attacking VFS
>>> interaction.
>>>
>>> In the meantime, though, you can use SELinux/Smack, or a custom
>>> cgroup file does sound useful.  Can cgroups be modules nowadays?
>>> (I can't keep up)  If so, an out of tree proc-cgroup module seems
>>> like a good interim solution.
>>>
>>
>> Ideally a drop_cache should drop page cache in that container, but
>> given container have a lot of shared page cache, what is suggested
>> might be a good way to work around the problem
> 
> One gross hack that comes to mind: Instead of a hard permission model
> limit the frequency with which the container could actually drop caches.
> Then the container's ability to interfere with host performance is more
> limited (but still non-zero). Or limit frequency on a per-user basis
> (more like Serge's design) because running more containers by a
> compromised user account shouldn't allow more frequent cache dropping.

Disk access causes at best multi-milisecond latency spikes, which can cause
a heavily loaded server to go into thrashing meltdown.  So a container
could screw up another container with this pretty badly.

The easy short-term fix is to make containers silently ignore writes to
drop_caches.

> That said, the more important question is why should we provide
> drop_caches inside a container? My understanding is it's largely a
> workload-debugging tool and not something meant to truly solve
> problems.

A heavily loaded system that goes deep into swap without triggering
the OOM killer can become pretty useless.  My home laptop with 2 gigs
of ram gets so sluggish whenever I compile something that you can't
use the touchpad anymore because hitting the boundary of a widget
with the mouse pointer causes a 5 second freeze while it bounces a
off three or four processes to handle the message, evicting yet more
pages to fault in the pages to handle the X events.  By the time
the pointer moves again it's way overshot.  (Ok, having firefox,
chrome, and kmail open with several dozen tabs open in each may have
something to do with this.)

When it does this, ctrl-alt-f1 echo 1 > /proc/sys/vm/drop_caches
is just about the only thing that will snap it out of it short of
killing processes.  The system has ~600 megs of ram tied up in
disk cache while being so short of anonymous pages the mouse is
useless.

That doesn't necessarily apply to containers but that's one use case
of using it as a stick to hit the darn overburdened machine when it's
making stupid memory allocation decisions.  (Playing with swappiness
puts the OOM killer on a hair trigger, depending on kernel version
du jour.)

However, it's not guaranteed to do anything (the cached data could
be dirty, mmaped by some process, immediately faulted back in
by some other process), so ignoring writes to drop_caches from a
container is probably legal behavior anyway.

Rob

^ permalink raw reply	[flat|nested] 14+ messages in thread

[parent not found: <4D270F34.8080305-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>]

* Re: Containers and /proc/sys/vm/drop_caches
       [not found]                     ` <4D270F34.8080305-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
@ 2011-01-07 15:12                       ` Serge Hallyn
       [not found]                         ` <20110107151241.GB4962-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 14+ messages in thread
From: Serge Hallyn @ 2011-01-07 15:12 UTC (permalink / raw)
  To: Rob Landley; +Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

Quoting Rob Landley (rlandley-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org):
> On 01/06/2011 03:43 PM, Matt Helsley wrote:
> > On Wed, Jan 05, 2011 at 07:46:17PM +0530, Balbir Singh wrote:
> >> On Wed, Jan 5, 2011 at 7:31 PM, Serge Hallyn <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org> wrote:
> >>> Quoting Daniel Lezcano (daniel.lezcano-GANU6spQydw@public.gmane.org):
> >>>> On 01/05/2011 10:40 AM, Mike Hommey wrote:
> >>>>> [Copy/pasted from a previous message to lkml, where it was suggested to
> >>>>>  try containers@]
> >>>>>
> >>>>> Hi,
> >>>>>
> >>>>> I noticed that from within a lxc container, writing "3" to
> >>>>> /proc/sys/vm/drop_caches would flush the host page cache. That sounds a
> >>>>> little dangerous for VPS offerings that would be based on lxc, as in one
> >>>>> VPS instance root user could impact the overall performance of the host.
> >>>>> I don't know about other containers but I've been told openvz isn't
> >>>>> subject to this problem.
> >>>>> I only tested the current Debian Squeeze kernel, which is based on
> >>>>> 2.6.32.27.
> >>>>
> >>>> There is definitively a big work to do with /proc.
> >>>>
> >>>> Some files should be not accessible (/proc/sys/vm/drop_caches,
> >>>> /proc/sys/kernel/sysrq, ...) and some other should be virtualized
> >>>> (/proc/meminfo, /proc/cpuinfo, ...).
> >>>>
> >>>> Serge suggested to create something similar to the cgroup device
> >>>> whitelist but for /proc, maybe it is a good approach for denying
> >>>> access a specific proc's file.
> >>>
> >>> Long-term, user namespaces should fix this - /proc will be owned
> >>> by the user namespace which mounted it, but we can tell proc to
> >>> always have some files (like drop_caches) be owned by init_user_ns.
> 
> Changing ownership so a script can't open a file that it otherwise
> could may cause scripts to fail when run in a container.  Makes the
> containers less transparent.

While my goal next week is to make containers more transparent, the
official stance from kernel summit a few years ago was:  transparent
containers are not a valid goal (as seen from kernel).

Not saying that what you're saying above is wrong, but I *do* argue
that 'silently ignoring the write' is more wrong than refusing the
write :)  Fooling userspace is a lose, imo.

Also, we can use a FUSE fs over proc to hide the files.  Doing that
now is insufficient because root in the container can just remount
proc over the filter.  But after user namespaces, root in the container
has the choice of leaving the filter in place for the sake of his own
usespace, or removing it and getting a bunch of files he can't use.

...

> A heavily loaded system that goes deep into swap without triggering
> the OOM killer can become pretty useless.  My home laptop with 2 gigs

Isn't a cgroup that controls both memory and swap access the right
answer to this?  (And do we have that now, btw?)

(I'm doing too many things at once so probably not thinking this
through enough)

-serge

^ permalink raw reply	[flat|nested] 14+ messages in thread

[parent not found: <20110107151241.GB4962-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org>]

* Re: Containers and /proc/sys/vm/drop_caches
       [not found]                         ` <20110107151241.GB4962-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org>
@ 2011-01-08 12:39                           ` Rob Landley
       [not found]                             ` <4D285B03.6050708-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
  0 siblings, 1 reply; 14+ messages in thread
From: Rob Landley @ 2011-01-08 12:39 UTC (permalink / raw)
  To: Serge Hallyn; +Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

On 01/07/2011 09:12 AM, Serge Hallyn wrote:
>> Changing ownership so a script can't open a file that it otherwise
>>  could may cause scripts to fail when run in a container.  Makes
>> the containers less transparent.
> 
> While my goal next week is to make containers more transparent, the 
> official stance from kernel summit a few years ago was:  transparent
>  containers are not a valid goal (as seen from kernel).

Do you have a reference for that?  I'm still coming up to speed on all this.  Trying to collect documentation...

>> A heavily loaded system that goes deep into swap without triggering
>> the OOM killer can become pretty useless.  My home laptop with 2
>> gigs
> 
> Isn't a cgroup that controls both memory and swap access the right 
> answer to this?

There are other ways to work around it, sure.  (It's yet to be proven that they do actually work better in resource constrained desktop environments under real-world load, but they seem very promising.)

I was just pointing out that this has seen some use as a recovery mechanism, slightly less drastic than the OOM killer.  (Didn't say it was a _good_ use.  Also, error avoidance and error recovery are different issues, and virtual memory is an inherently overcommitted resource domain.)

> (And do we have that now, btw?)

I think it's coming, rather than actually here.  (I thought the beancounters stuff was OpenVZ, controlled by syscalls that the kernel developers rejected.  Have resource constraints on anything other than scheduler made it into vanilla yet?  If so, what's the UI to control them?)

By the way, from a UI perspective, most of the containers stuff I've seen so far is apparently aimed at big iron deployments (or attempts to make PC clusters look like mainframes, I.E. this "cloud" stuff).  I'm glad to see more diverse uses of it, but one of the downsides of cobbling together a mechanism from a dozen different unrelated pieces of infrastructure (clone flags, cgroup filesystem, extra mount flags on proc and such so they behave differently) is that we need a lot of documentation/example code/libraries to make it easy to use.  "You can do X" and "it's easy to reliably do X" have a gap that may take a while to close...

Rob

^ permalink raw reply	[flat|nested] 14+ messages in thread

[parent not found: <4D285B03.6050708-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>]

* Re: Containers and /proc/sys/vm/drop_caches
       [not found]                             ` <4D285B03.6050708-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
@ 2011-01-11 16:28                               ` Serge Hallyn
  0 siblings, 0 replies; 14+ messages in thread
From: Serge Hallyn @ 2011-01-11 16:28 UTC (permalink / raw)
  To: Rob Landley; +Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

Quoting Rob Landley (rlandley-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org):
> On 01/07/2011 09:12 AM, Serge Hallyn wrote:
> >> Changing ownership so a script can't open a file that it otherwise
> >>  could may cause scripts to fail when run in a container.  Makes
> >> the containers less transparent.
> > 
> > While my goal next week is to make containers more transparent, the 
> > official stance from kernel summit a few years ago was:  transparent
> >  containers are not a valid goal (as seen from kernel).
> 
> Do you have a reference for that?  I'm still coming up to speed on all this.  Trying to collect documentation...

Sorry, I don't offhand, and a quick google search wasn't helpful.  I think
it was from the very first containers discussion at ksummit, but not sure.
There is http://lwn.net/Articles/191923/.  Toward the bottom it claims that
noone thought it would be a problem to tweak distros to run in containers
without /sys and /proc.

But this was 2006, when pid namespaces were still a new idea, and noone
was actually using containers.  It certainly is possible that sentiment
has changed, which is why I do feel that it's worth it for someone to
try some native containerization inside fs/proc/*.c.  While user namespaces
should make it possible to make fuse proc filtering less wishy-washy, they
won't make it any less ugly :)

-serge

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Containers and /proc/sys/vm/drop_caches
@ 2010-12-30  7:59 Mike Hommey
  2010-12-30  8:57 ` Rob Landley
  0 siblings, 1 reply; 14+ messages in thread
From: Mike Hommey @ 2010-12-30  7:59 UTC (permalink / raw)
  To: linux-kernel

Hi,

I noticed that from within a lxc container, writing "3" to
/proc/sys/vm/drop_caches would flush the host page cache. That sounds a
little dangerous for VPS offerings that would be based on lxc, as in one
VPS instance root user could impact the overall performance of the host.
I don't know about other containers but I've been told openvz isn't
subject to this problem.
I only tested the current Debian Squeeze kernel, which is based on
2.6.32.27.

Cheers,

Mike

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Containers and /proc/sys/vm/drop_caches
  2010-12-30  7:59 Mike Hommey
@ 2010-12-30  8:57 ` Rob Landley
  0 siblings, 0 replies; 14+ messages in thread
From: Rob Landley @ 2010-12-30  8:57 UTC (permalink / raw)
  To: Mike Hommey; +Cc: linux-kernel

On Thu, Dec 30, 2010 at 1:59 AM, Mike Hommey <mh@glandium.org> wrote:
> Hi,
>
> I noticed that from within a lxc container, writing "3" to
> /proc/sys/vm/drop_caches would flush the host page cache. That sounds a
> little dangerous for VPS offerings that would be based on lxc, as in one
> VPS instance root user could impact the overall performance of the host.

There's a containers@vger mailing list for this stuff, you might have better
luck asking there.

> I don't know about other containers but I've been told openvz isn't
> subject to this problem.

I've been coming up to speed on this area recently: openvz has a lot of stuff
that isn't in the main kernel, but it's based on an approach that didn't get
merged into the kernel (using new syscalls to control container stuff).

Instead Google's rewrite of sgi's cgroup stuff went in for process grouping
(based on the cgroup filesystem), and a half-dozen different types of
namespaces are based on flags to clone(), and various other filesystems
(proc, sys, devpts) grew some kind of -o newinstance flag (see
http://lkml.indiana.edu/hypermail//linux/kernel/1012.3/00777.html for a pending
example, although why they can't detect they're the first instance in
the current
container rather than containers having to be specially set up by the host, I
still don't understand yet)... and so on.

The rest of the stuff openvz does is still being redesigned to go into
vanilla based on those mechanisms.  It seems a bit like squashfs: vanilla should
be able to do this someday, but when it gets merged it may not be
compatible with
the out of tree version.  LXC is an attempt to make a userspace tool to drive
containers in the vanilla kernel.  It doesn't do half of what openvz does yet,
but they're working on it.

Rob

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2011-01-11 16:28 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-01-05  9:40 Containers and /proc/sys/vm/drop_caches Mike Hommey
     [not found] ` <20110105094022.GA5366-YmoObPS1fuhg9hUCZPvPmw@public.gmane.org>
2011-01-05  9:49   ` Daniel Lezcano
     [not found]     ` <4D243EC3.1050101-GANU6spQydw@public.gmane.org>
2011-01-05 14:01       ` Serge Hallyn
     [not found]         ` <20110105140159.GC2718-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org>
2011-01-05 14:16           ` Balbir Singh
     [not found]             ` <AANLkTi=x=6gUZTxJC8LXxYNu029+firyzKqjMa6m+R-x-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2011-01-06 21:43               ` Matt Helsley
     [not found]                 ` <20110106214315.GJ29064-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
2011-01-06 21:50                   ` Dave Hansen
2011-01-06 22:08                     ` Matt Helsley
     [not found]                       ` <20110106220841.GK29064-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
2011-01-06 22:15                         ` Dave Hansen
2011-01-07 13:03                   ` Rob Landley
     [not found]                     ` <4D270F34.8080305-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2011-01-07 15:12                       ` Serge Hallyn
     [not found]                         ` <20110107151241.GB4962-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org>
2011-01-08 12:39                           ` Rob Landley
     [not found]                             ` <4D285B03.6050708-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2011-01-11 16:28                               ` Serge Hallyn
  -- strict thread matches above, loose matches on Subject: below --
2010-12-30  7:59 Mike Hommey
2010-12-30  8:57 ` Rob Landley

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.