* Re: [RFC PATCH 0/4] Make inotify instance/watches be accounted per userns
[not found] ` <1464767580-22732-1-git-send-email-kernel-6AxghH7DbtA@public.gmane.org>
@ 2016-06-01 16:00 ` Eric W. Biederman
0 siblings, 0 replies; 9+ messages in thread
From: Eric W. Biederman @ 2016-06-01 16:00 UTC (permalink / raw)
To: Nikolay Borisov
Cc: jack-AlSwsSmVLrQ, avagin-GEFAQzZX7r8dnm+yROfE0A,
netdev-u79uwXL29TY76Z2rM5mHXA, Linux Containers,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
eparis-H+wXaHxf7aLQT0dZR+AlfA, operations-/eCPMmvKun9pLGFMi4vTTA,
gorcunov-GEFAQzZX7r8dnm+yROfE0A,
john-jueV0HHMeujJJrXXpGQQMAC/G2K4zDHf
Cc'd the containers list.
Nikolay Borisov <kernel-6AxghH7DbtA@public.gmane.org> writes:
> Currently the inotify instances/watches are being accounted in the
> user_struct structure. This means that in setups where multiple
> users in unprivileged containers map to the same underlying
> real user (e.g. user_struct) the inotify limits are going to be
> shared as well which can lead to unplesantries. This is a problem
> since any user inside any of the containers can potentially exhaust
> the instance/watches limit which in turn might prevent certain
> services from other containers from starting.
On a high level this is a bit problematic as it appears to escapes the
current limits and allows anyone creating a user namespace to have their
own fresh set of limits. Given that anyone should be able to create a
user namespace whenever they feel like escaping limits is a problem.
That however is solvable.
A practical question. What kind of limits are we looking at here?
Are these loose limits for detecting buggy programs that have gone
off their rails?
Are these tight limits to ensure multitasking is possible?
For tight limits where something is actively controlling the limits you
probably want a cgroup base solution.
For loose limits that are the kind where you set a good default and
forget about I think a user namespace based solution is reasonable.
> The solution I propose is rather simple, instead of accounting the
> watches/instances per user_struct, start accounting them in a hashtable,
> where the index used is the hashed pointer of the userns. This way
> the administrator needn't set the inotify limits very high and also
> the risk of one container breaching the limits and affecting every
> other container is alleviated.
I don't think this is the right data structure for a user namespace
based solution, at least in part because it does not account for users
escaping.
> I have performed functional testing to validate that limits in
> different namespaces are indeed separate, as well as running
> multiple inotify stressers from stress-ng to ensure I haven't
> introduced any race conditions.
>
> This series is based on 4.7-rc1 (and applies cleanly on 4.4.10) and
> consist of the following 4 patches:
>
> Patch 1: This introduces the necessary structure and code changes. Including
> hashtable.h to sched.h causes some warnings in files which define HAS_SIZE macro,
> patch 3 fixes this by doing mechanical rename.
>
> Patch 2: This patch flips the inotify code to user the new infrastructure.
>
> Patch 3: This is a simple mechanical rename of conflicting definitions with
> hashtable.h's HASH_SIZE macro. I'm happy about comments how I should go
> about this.
>
> Patch 4: This is a rather self-container patch and can go irrespective of
> whether the series is accepted, it's needed so that building the kernel
> with !CONFIG_INOTIFY_USER doesn't fail (with patch 1 being applied).
> However, fdinfo.c doesn't really need inotify.h
>
> Nikolay Borisov (4):
> inotify: Add infrastructure to account inotify limits per-namespace
> inotify: Convert inotify limits to be accounted
> per-realuser/per-namespace
> misc: Rename the HASH_SIZE macro
> inotify: Don't include inotify.h when !CONFIG_INOTIFY_USER
>
> fs/logfs/dir.c | 6 +--
> fs/notify/fdinfo.c | 3 ++
> fs/notify/inotify/inotify.h | 68 ++++++++++++++++++++++++++++++++
> fs/notify/inotify/inotify_fsnotify.c | 14 ++++++-
> fs/notify/inotify/inotify_user.c | 57 ++++++++++++++++++++++----
> include/linux/fsnotify_backend.h | 1 +
> include/linux/sched.h | 5 ++-
> kernel/user.c | 13 ++++++
> net/ipv6/ip6_gre.c | 8 ++--
> net/ipv6/ip6_tunnel.c | 10 ++---
> net/ipv6/ip6_vti.c | 10 ++---
> net/ipv6/sit.c | 10 ++---
> security/keys/encrypted-keys/encrypted.c | 32 +++++++--------
> 13 files changed, 189 insertions(+), 48 deletions(-)
Eric
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC PATCH 0/4] Make inotify instance/watches be accounted per userns
[not found] ` <8737ow7vcp.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
@ 2016-06-02 6:27 ` Nikolay Borisov
[not found] ` <574FD1E4.8090109-6AxghH7DbtA@public.gmane.org>
2016-06-02 7:49 ` Jan Kara
1 sibling, 1 reply; 9+ messages in thread
From: Nikolay Borisov @ 2016-06-02 6:27 UTC (permalink / raw)
To: Eric W. Biederman
Cc: jack-AlSwsSmVLrQ, avagin-GEFAQzZX7r8dnm+yROfE0A,
netdev-u79uwXL29TY76Z2rM5mHXA, Linux Containers,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
eparis-H+wXaHxf7aLQT0dZR+AlfA, operations-/eCPMmvKun9pLGFMi4vTTA,
gorcunov-GEFAQzZX7r8dnm+yROfE0A,
john-jueV0HHMeujJJrXXpGQQMAC/G2K4zDHf
On 06/01/2016 07:00 PM, Eric W. Biederman wrote:
> Cc'd the containers list.
>
>
> Nikolay Borisov <kernel-6AxghH7DbtA@public.gmane.org> writes:
>
>> Currently the inotify instances/watches are being accounted in the
>> user_struct structure. This means that in setups where multiple
>> users in unprivileged containers map to the same underlying
>> real user (e.g. user_struct) the inotify limits are going to be
>> shared as well which can lead to unplesantries. This is a problem
>> since any user inside any of the containers can potentially exhaust
>> the instance/watches limit which in turn might prevent certain
>> services from other containers from starting.
>
> On a high level this is a bit problematic as it appears to escapes the
> current limits and allows anyone creating a user namespace to have their
> own fresh set of limits. Given that anyone should be able to create a
> user namespace whenever they feel like escaping limits is a problem.
> That however is solvable.
This is indeed a problem and the presented solution is rather dumb in
that regard. I'm happy to work with you on suggestions so that I arrive
at a solution that is upstreamable.
>
> A practical question. What kind of limits are we looking at here?
>
> Are these loose limits for detecting buggy programs that have gone
> off their rails?
Loose limits.
>
> Are these tight limits to ensure multitasking is possible?
>
>
>
> For tight limits where something is actively controlling the limits you
> probably want a cgroup base solution.
>
> For loose limits that are the kind where you set a good default and
> forget about I think a user namespace based solution is reasonable.
That's exactly the use case I had in mind.
>
>> The solution I propose is rather simple, instead of accounting the
>> watches/instances per user_struct, start accounting them in a hashtable,
>> where the index used is the hashed pointer of the userns. This way
>> the administrator needn't set the inotify limits very high and also
>> the risk of one container breaching the limits and affecting every
>> other container is alleviated.
>
> I don't think this is the right data structure for a user namespace
> based solution, at least in part because it does not account for users
> escaping.
Admittedly this is a naive solution, what are you ideas on something
which achieves my initial aim of having limits per users, yet not
allowing them to just create another namespace and escape them. The
current namespace code has a hard-coded limit of 32 for nesting user
namespaces. So currently at the worst case one can escape the limits up
to 32 * current_limits.
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC PATCH 0/4] Make inotify instance/watches be accounted per userns
[not found] ` <8737ow7vcp.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
2016-06-02 6:27 ` Nikolay Borisov
@ 2016-06-02 7:49 ` Jan Kara
[not found] ` <20160602074920.GG19636-4I4JzKEfoa/jFM9bn6wA6Q@public.gmane.org>
1 sibling, 1 reply; 9+ messages in thread
From: Jan Kara @ 2016-06-02 7:49 UTC (permalink / raw)
To: Eric W. Biederman
Cc: jack-AlSwsSmVLrQ, avagin-GEFAQzZX7r8dnm+yROfE0A,
netdev-u79uwXL29TY76Z2rM5mHXA, Linux Containers,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
eparis-H+wXaHxf7aLQT0dZR+AlfA, operations-/eCPMmvKun9pLGFMi4vTTA,
Nikolay Borisov, gorcunov-GEFAQzZX7r8dnm+yROfE0A,
john-jueV0HHMeujJJrXXpGQQMAC/G2K4zDHf
On Wed 01-06-16 11:00:06, Eric W. Biederman wrote:
> Cc'd the containers list.
>
> Nikolay Borisov <kernel-6AxghH7DbtA@public.gmane.org> writes:
>
> > Currently the inotify instances/watches are being accounted in the
> > user_struct structure. This means that in setups where multiple
> > users in unprivileged containers map to the same underlying
> > real user (e.g. user_struct) the inotify limits are going to be
> > shared as well which can lead to unplesantries. This is a problem
> > since any user inside any of the containers can potentially exhaust
> > the instance/watches limit which in turn might prevent certain
> > services from other containers from starting.
>
> On a high level this is a bit problematic as it appears to escapes the
> current limits and allows anyone creating a user namespace to have their
> own fresh set of limits. Given that anyone should be able to create a
> user namespace whenever they feel like escaping limits is a problem.
> That however is solvable.
>
> A practical question. What kind of limits are we looking at here?
>
> Are these loose limits for detecting buggy programs that have gone
> off their rails?
>
> Are these tight limits to ensure multitasking is possible?
The original motivation for these limits is to limit resource usage. There
is in-kernel data structure that is associated with each notification mark
you create and we don't want users to be able to DoS the system by creating
too many of them. Thus we limit number of notification marks for each user.
There is also a limit on the number of notification instances - those are
naturally limited by the number of open file descriptors but admin may want
to limit them more...
So cgroups would be probably the best fit for this but I'm not sure whether
it is not an overkill...
Honza
--
Jan Kara <jack-IBi9RG/b67k@public.gmane.org>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC PATCH 0/4] Make inotify instance/watches be accounted per userns
[not found] ` <574FD1E4.8090109-6AxghH7DbtA@public.gmane.org>
@ 2016-06-02 16:19 ` Eric W. Biederman
0 siblings, 0 replies; 9+ messages in thread
From: Eric W. Biederman @ 2016-06-02 16:19 UTC (permalink / raw)
To: Nikolay Borisov
Cc: jack-AlSwsSmVLrQ, avagin-GEFAQzZX7r8dnm+yROfE0A,
netdev-u79uwXL29TY76Z2rM5mHXA, Linux Containers,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
eparis-H+wXaHxf7aLQT0dZR+AlfA, operations-/eCPMmvKun9pLGFMi4vTTA,
gorcunov-GEFAQzZX7r8dnm+yROfE0A,
john-jueV0HHMeujJJrXXpGQQMAC/G2K4zDHf
Nikolay Borisov <kernel-6AxghH7DbtA@public.gmane.org> writes:
> On 06/01/2016 07:00 PM, Eric W. Biederman wrote:
>> Cc'd the containers list.
>>
>>
>> Nikolay Borisov <kernel-6AxghH7DbtA@public.gmane.org> writes:
>>
>>> Currently the inotify instances/watches are being accounted in the
>>> user_struct structure. This means that in setups where multiple
>>> users in unprivileged containers map to the same underlying
>>> real user (e.g. user_struct) the inotify limits are going to be
>>> shared as well which can lead to unplesantries. This is a problem
>>> since any user inside any of the containers can potentially exhaust
>>> the instance/watches limit which in turn might prevent certain
>>> services from other containers from starting.
>>
>> On a high level this is a bit problematic as it appears to escapes the
>> current limits and allows anyone creating a user namespace to have their
>> own fresh set of limits. Given that anyone should be able to create a
>> user namespace whenever they feel like escaping limits is a problem.
>> That however is solvable.
>
> This is indeed a problem and the presented solution is rather dumb in
> that regard. I'm happy to work with you on suggestions so that I arrive
> at a solution that is upstreamable.
The one in kernel solution to hierarchical resource limits that I am
aware of is the current include/linux/page_counter.h which evolved from
include/linux/res_counter.h
>> A practical question. What kind of limits are we looking at here?
>>
>> Are these loose limits for detecting buggy programs that have gone
>> off their rails?
>
> Loose limits.
>
>>
>> Are these tight limits to ensure multitasking is possible?
>>
>>
>>
>> For tight limits where something is actively controlling the limits you
>> probably want a cgroup base solution.
>>
>> For loose limits that are the kind where you set a good default and
>> forget about I think a user namespace based solution is reasonable.
>
> That's exactly the use case I had in mind.
>
>>
>>> The solution I propose is rather simple, instead of accounting the
>>> watches/instances per user_struct, start accounting them in a hashtable,
>>> where the index used is the hashed pointer of the userns. This way
>>> the administrator needn't set the inotify limits very high and also
>>> the risk of one container breaching the limits and affecting every
>>> other container is alleviated.
>>
>> I don't think this is the right data structure for a user namespace
>> based solution, at least in part because it does not account for users
>> escaping.
>
> Admittedly this is a naive solution, what are you ideas on something
> which achieves my initial aim of having limits per users, yet not
> allowing them to just create another namespace and escape them. The
> current namespace code has a hard-coded limit of 32 for nesting user
> namespaces. So currently at the worst case one can escape the limits up
> to 32 * current_limits.
32 is the nesting depth not the width of the tree. But see above.
Eric
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC PATCH 0/4] Make inotify instance/watches be accounted per userns
[not found] ` <20160602074920.GG19636-4I4JzKEfoa/jFM9bn6wA6Q@public.gmane.org>
@ 2016-06-02 16:58 ` Eric W. Biederman
[not found] ` <87bn3jy1cd.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
0 siblings, 1 reply; 9+ messages in thread
From: Eric W. Biederman @ 2016-06-02 16:58 UTC (permalink / raw)
To: Jan Kara
Cc: avagin-GEFAQzZX7r8dnm+yROfE0A, netdev-u79uwXL29TY76Z2rM5mHXA,
Linux Containers, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
eparis-H+wXaHxf7aLQT0dZR+AlfA, operations-/eCPMmvKun9pLGFMi4vTTA,
Nikolay Borisov, gorcunov-GEFAQzZX7r8dnm+yROfE0A,
john-jueV0HHMeujJJrXXpGQQMAC/G2K4zDHf
Nikolay please see my question for you at the end.
Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org> writes:
> On Wed 01-06-16 11:00:06, Eric W. Biederman wrote:
>> Cc'd the containers list.
>>
>> Nikolay Borisov <kernel-6AxghH7DbtA@public.gmane.org> writes:
>>
>> > Currently the inotify instances/watches are being accounted in the
>> > user_struct structure. This means that in setups where multiple
>> > users in unprivileged containers map to the same underlying
>> > real user (e.g. user_struct) the inotify limits are going to be
>> > shared as well which can lead to unplesantries. This is a problem
>> > since any user inside any of the containers can potentially exhaust
>> > the instance/watches limit which in turn might prevent certain
>> > services from other containers from starting.
>>
>> On a high level this is a bit problematic as it appears to escapes the
>> current limits and allows anyone creating a user namespace to have their
>> own fresh set of limits. Given that anyone should be able to create a
>> user namespace whenever they feel like escaping limits is a problem.
>> That however is solvable.
>>
>> A practical question. What kind of limits are we looking at here?
>>
>> Are these loose limits for detecting buggy programs that have gone
>> off their rails?
>>
>> Are these tight limits to ensure multitasking is possible?
>
> The original motivation for these limits is to limit resource usage. There
> is in-kernel data structure that is associated with each notification mark
> you create and we don't want users to be able to DoS the system by creating
> too many of them. Thus we limit number of notification marks for each user.
> There is also a limit on the number of notification instances - those are
> naturally limited by the number of open file descriptors but admin may want
> to limit them more...
>
> So cgroups would be probably the best fit for this but I'm not sure whether
> it is not an overkill...
There is some level of kernel memory accounting in the memory cgroup.
That said my experience with cgroups is that while they are good for
some things the semantics that derive from the userspace API are
problematic.
In the cgroup model objects in the kernel don't belong to a cgroup they
belong to a task/process. Those processes belong to a cgroup.
Processes under control of a sufficiently privileged parent are allowed
to switch cgroups. This causes implementation challenges and sematic
mismatch in a world where things are typically considered to have an
owner.
Right now fs_notify groups (upon which all of the rest of the inotify
accounting is built upon) belong to a user. So there is a semantic
mismatch with cgroups right out of the gate.
Given that cgroups have not choosen to account for individual kernel
objects or give that level of control, I think it reasonable to look to
other possible solutions. Assuming the overhead can be kept under
control.
The implementation of a hierarchical counter in mm/page_counter.c
strongly suggests to me that the overhead can be kept under control.
And yes. I am thinking of the problem space where you have a limit
based on the problem domain where if an application consumes more than
the limit, the application is likely bonkers. Which does prevent a DOS
situation in kernel memory. But is different from the problem I have
seen cgroups solve.
The problem I have seen cgroups solve looks like. Hmm. I have 8GB of
ram. I have 3 containers. Container A can have 4GB, Container B can
have 1GB and container C can have 3GB. Then I know one container won't
push the other containers into swap.
Perhaps that would tend to be a top down/vs a bottom up approach to
coming up with limits. As DOS preventions limits like the inotify ones
are generally written from the perspective of if you have more than X
you are crazy. While cgroup limits tend to be thought about top down
from a total system management point of view.
So I think there is definitely something to look at.
All of that said there is definitely a practical question that needs to
be asked. Nikolay how did you get into this situation? A typical user
namespace configuration will set up uid and gid maps with the help of a
privileged program and not map the uid of the user who created the user
namespace. Thus avoiding exhausting the limits of the user who created
the container.
Which makes me personally more worried about escaping the existing
limits than exhausting the limits of a particular user.
Eric
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC PATCH 0/4] Make inotify instance/watches be accounted per userns
[not found] ` <87bn3jy1cd.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
@ 2016-06-03 11:14 ` Nikolay Borisov
[not found] ` <5751667D.7010207-6AxghH7DbtA@public.gmane.org>
0 siblings, 1 reply; 9+ messages in thread
From: Nikolay Borisov @ 2016-06-03 11:14 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Jan Kara, avagin-GEFAQzZX7r8dnm+yROfE0A,
netdev-u79uwXL29TY76Z2rM5mHXA, Linux Containers,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
eparis-H+wXaHxf7aLQT0dZR+AlfA, operations-/eCPMmvKun9pLGFMi4vTTA,
gorcunov-GEFAQzZX7r8dnm+yROfE0A,
john-jueV0HHMeujJJrXXpGQQMAC/G2K4zDHf
On 06/02/2016 07:58 PM, Eric W. Biederman wrote:
>
> Nikolay please see my question for you at the end.
>
> Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org> writes:
>
>> On Wed 01-06-16 11:00:06, Eric W. Biederman wrote:
>>> Cc'd the containers list.
>>>
>>> Nikolay Borisov <kernel-6AxghH7DbtA@public.gmane.org> writes:
>>>
>>>> Currently the inotify instances/watches are being accounted in the
>>>> user_struct structure. This means that in setups where multiple
>>>> users in unprivileged containers map to the same underlying
>>>> real user (e.g. user_struct) the inotify limits are going to be
>>>> shared as well which can lead to unplesantries. This is a problem
>>>> since any user inside any of the containers can potentially exhaust
>>>> the instance/watches limit which in turn might prevent certain
>>>> services from other containers from starting.
>>>
>>> On a high level this is a bit problematic as it appears to escapes the
>>> current limits and allows anyone creating a user namespace to have their
>>> own fresh set of limits. Given that anyone should be able to create a
>>> user namespace whenever they feel like escaping limits is a problem.
>>> That however is solvable.
>>>
>>> A practical question. What kind of limits are we looking at here?
>>>
>>> Are these loose limits for detecting buggy programs that have gone
>>> off their rails?
>>>
>>> Are these tight limits to ensure multitasking is possible?
>>
>> The original motivation for these limits is to limit resource usage. There
>> is in-kernel data structure that is associated with each notification mark
>> you create and we don't want users to be able to DoS the system by creating
>> too many of them. Thus we limit number of notification marks for each user.
>> There is also a limit on the number of notification instances - those are
>> naturally limited by the number of open file descriptors but admin may want
>> to limit them more...
>>
>> So cgroups would be probably the best fit for this but I'm not sure whether
>> it is not an overkill...
>
> There is some level of kernel memory accounting in the memory cgroup.
>
> That said my experience with cgroups is that while they are good for
> some things the semantics that derive from the userspace API are
> problematic.
>
> In the cgroup model objects in the kernel don't belong to a cgroup they
> belong to a task/process. Those processes belong to a cgroup.
> Processes under control of a sufficiently privileged parent are allowed
> to switch cgroups. This causes implementation challenges and sematic
> mismatch in a world where things are typically considered to have an
> owner.
>
> Right now fs_notify groups (upon which all of the rest of the inotify
> accounting is built upon) belong to a user. So there is a semantic
> mismatch with cgroups right out of the gate.
>
> Given that cgroups have not choosen to account for individual kernel
> objects or give that level of control, I think it reasonable to look to
> other possible solutions. Assuming the overhead can be kept under
> control.
>
> The implementation of a hierarchical counter in mm/page_counter.c
> strongly suggests to me that the overhead can be kept under control.
>
> And yes. I am thinking of the problem space where you have a limit
> based on the problem domain where if an application consumes more than
> the limit, the application is likely bonkers. Which does prevent a DOS
> situation in kernel memory. But is different from the problem I have
> seen cgroups solve.
>
> The problem I have seen cgroups solve looks like. Hmm. I have 8GB of
> ram. I have 3 containers. Container A can have 4GB, Container B can
> have 1GB and container C can have 3GB. Then I know one container won't
> push the other containers into swap.
>
> Perhaps that would tend to be a top down/vs a bottom up approach to
> coming up with limits. As DOS preventions limits like the inotify ones
> are generally written from the perspective of if you have more than X
> you are crazy. While cgroup limits tend to be thought about top down
> from a total system management point of view.
>
> So I think there is definitely something to look at.
>
>
> All of that said there is definitely a practical question that needs to
> be asked. Nikolay how did you get into this situation? A typical user
> namespace configuration will set up uid and gid maps with the help of a
> privileged program and not map the uid of the user who created the user
> namespace. Thus avoiding exhausting the limits of the user who created
> the container.
Right but imagine having multiple containers with identical uid/gid maps
for LXC-based setups imagine this:
lxc.id_map = u 0 1337 65536
Now all processes which are running with the same user on different
containers will actually share the underlying user_struct thus the
inotify limits. In such cases even running multiple instances of 'tail'
in one container will eventually use all allowed inotify/mark instances.
For this to happen you needn't also have complete overlap of the uid
map, it's enough to have at least one UID between 2 containers overlap.
So the risk of exhaustion doesn't apply to the privileged user that
created the container and the uid mapping, but rather the users under
which the various processes in the container are running. Does that make
it clear?
>
> Which makes me personally more worried about escaping the existing
> limits than exhausting the limits of a particular user.
So I thought bit about it and I guess a solution can be concocted which
utilize the hierarchical nature of page counter, and the inotify limits
are set per namespace if you have capable(CAP_SYS_ADMIN). That way the
admin can set one fairly large on the init_user_ns and then in every
namespace created one can set smaller limits. That way for a branch in
the tree (in the nomenclature you used in your previous reply to me) you
will really be upper-bound to the limit set in the namespace which have
->level = 1. For the width of the tree, you will be bound by the
"global" init_user_ns limits. How does that sound?
>
> Eric
>
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC PATCH 0/4] Make inotify instance/watches be accounted per userns
[not found] ` <5751667D.7010207-6AxghH7DbtA@public.gmane.org>
@ 2016-06-03 20:41 ` Eric W. Biederman
[not found] ` <87inxqovho.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
0 siblings, 1 reply; 9+ messages in thread
From: Eric W. Biederman @ 2016-06-03 20:41 UTC (permalink / raw)
To: Nikolay Borisov
Cc: Jan Kara, avagin-GEFAQzZX7r8dnm+yROfE0A,
netdev-u79uwXL29TY76Z2rM5mHXA, Linux Containers,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
eparis-H+wXaHxf7aLQT0dZR+AlfA, operations-/eCPMmvKun9pLGFMi4vTTA,
gorcunov-GEFAQzZX7r8dnm+yROfE0A,
john-jueV0HHMeujJJrXXpGQQMAC/G2K4zDHf
Nikolay Borisov <kernel-6AxghH7DbtA@public.gmane.org> writes:
> On 06/02/2016 07:58 PM, Eric W. Biederman wrote:
>>
>> Nikolay please see my question for you at the end.
[snip]
>> All of that said there is definitely a practical question that needs to
>> be asked. Nikolay how did you get into this situation? A typical user
>> namespace configuration will set up uid and gid maps with the help of a
>> privileged program and not map the uid of the user who created the user
>> namespace. Thus avoiding exhausting the limits of the user who created
>> the container.
>
> Right but imagine having multiple containers with identical uid/gid maps
> for LXC-based setups imagine this:
>
> lxc.id_map = u 0 1337 65536
So I am only moderately concerned when the containers have overlapping
ids. Because at some level overlapping ids means they are the same
user. This is certainly true for file permissions and for other
permissions. To isolate one container from another it fundamentally
needs to have separate uids and gids on the host system.
> Now all processes which are running with the same user on different
> containers will actually share the underlying user_struct thus the
> inotify limits. In such cases even running multiple instances of 'tail'
> in one container will eventually use all allowed inotify/mark instances.
> For this to happen you needn't also have complete overlap of the uid
> map, it's enough to have at least one UID between 2 containers overlap.
>
>
> So the risk of exhaustion doesn't apply to the privileged user that
> created the container and the uid mapping, but rather the users under
> which the various processes in the container are running. Does that make
> it clear?
Yes. That is clear.
>> Which makes me personally more worried about escaping the existing
>> limits than exhausting the limits of a particular user.
>
> So I thought bit about it and I guess a solution can be concocted which
> utilize the hierarchical nature of page counter, and the inotify limits
> are set per namespace if you have capable(CAP_SYS_ADMIN). That way the
> admin can set one fairly large on the init_user_ns and then in every
> namespace created one can set smaller limits. That way for a branch in
> the tree (in the nomenclature you used in your previous reply to me) you
> will really be upper-bound to the limit set in the namespace which have
> ->level = 1. For the width of the tree, you will be bound by the
> "global" init_user_ns limits. How does that sound?
As a addendum to that design. I think there should be an additional
sysctl or two that specifies how much the limit decreases when creating
a new user namespace and when creating a new user in that user
namespace. That way with a good selection of limits and a limit
decrease people can use the kernel defaults without needing to change
them.
Having default settings that are good enough 99% of the time and that
people don't need to tune, would be my biggest requirement (aside from
being light-weight) for merging something like this.
If things are set and forget and even the continer case does not need to
be aware then I think we have a design sufficiently robust and different
from what cgroups is doing to make it worth while to have a userns based
solution.
I can see a lot of different limits implemented this way.
Eric
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC PATCH 0/4] Make inotify instance/watches be accounted per userns
[not found] ` <87inxqovho.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
@ 2016-06-06 6:41 ` Nikolay Borisov
[not found] ` <57551B10.6080505-6AxghH7DbtA@public.gmane.org>
0 siblings, 1 reply; 9+ messages in thread
From: Nikolay Borisov @ 2016-06-06 6:41 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Jan Kara, avagin-GEFAQzZX7r8dnm+yROfE0A,
netdev-u79uwXL29TY76Z2rM5mHXA, Linux Containers,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
eparis-H+wXaHxf7aLQT0dZR+AlfA, operations-/eCPMmvKun9pLGFMi4vTTA,
gorcunov-GEFAQzZX7r8dnm+yROfE0A,
john-jueV0HHMeujJJrXXpGQQMAC/G2K4zDHf
On 06/03/2016 11:41 PM, Eric W. Biederman wrote:
> Nikolay Borisov <kernel-6AxghH7DbtA@public.gmane.org> writes:
>
>> On 06/02/2016 07:58 PM, Eric W. Biederman wrote:
>>>
>>> Nikolay please see my question for you at the end.
> [snip]
>>> All of that said there is definitely a practical question that needs to
>>> be asked. Nikolay how did you get into this situation? A typical user
>>> namespace configuration will set up uid and gid maps with the help of a
>>> privileged program and not map the uid of the user who created the user
>>> namespace. Thus avoiding exhausting the limits of the user who created
>>> the container.
>>
>> Right but imagine having multiple containers with identical uid/gid maps
>> for LXC-based setups imagine this:
>>
>> lxc.id_map = u 0 1337 65536
>
> So I am only moderately concerned when the containers have overlapping
> ids. Because at some level overlapping ids means they are the same
> user. This is certainly true for file permissions and for other
> permissions. To isolate one container from another it fundamentally
> needs to have separate uids and gids on the host system.
>
>> Now all processes which are running with the same user on different
>> containers will actually share the underlying user_struct thus the
>> inotify limits. In such cases even running multiple instances of 'tail'
>> in one container will eventually use all allowed inotify/mark instances.
>> For this to happen you needn't also have complete overlap of the uid
>> map, it's enough to have at least one UID between 2 containers overlap.
>>
>>
>> So the risk of exhaustion doesn't apply to the privileged user that
>> created the container and the uid mapping, but rather the users under
>> which the various processes in the container are running. Does that make
>> it clear?
>
> Yes. That is clear.
>
>>> Which makes me personally more worried about escaping the existing
>>> limits than exhausting the limits of a particular user.
>>
>> So I thought bit about it and I guess a solution can be concocted which
>> utilize the hierarchical nature of page counter, and the inotify limits
>> are set per namespace if you have capable(CAP_SYS_ADMIN). That way the
>> admin can set one fairly large on the init_user_ns and then in every
>> namespace created one can set smaller limits. That way for a branch in
>> the tree (in the nomenclature you used in your previous reply to me) you
>> will really be upper-bound to the limit set in the namespace which have
>> ->level = 1. For the width of the tree, you will be bound by the
>> "global" init_user_ns limits. How does that sound?
>
> As a addendum to that design. I think there should be an additional
> sysctl or two that specifies how much the limit decreases when creating
> a new user namespace and when creating a new user in that user
> namespace. That way with a good selection of limits and a limit
> decrease people can use the kernel defaults without needing to change
> them.
I agree that a sysctl which controls how the limits are set for new
namespaces is a good idea. I think it's best if this is in % rather than
some absolute value. Also I'm not sure about the sysctl when a user is
added in a namespace since just adding a new user should fall under the
limits of the current userns.
Also should those sysctls be global or should they be per-namespace? At
this point I'm more inclined to have global sysctl and maybe refine it
in the future if the need arises?
>
> Having default settings that are good enough 99% of the time and that
> people don't need to tune, would be my biggest requirement (aside from
> being light-weight) for merging something like this.
>
> If things are set and forget and even the continer case does not need to
> be aware then I think we have a design sufficiently robust and different
> from what cgroups is doing to make it worth while to have a userns based
> solution.
Provided that we agree on the overall design, so far it seems we just
need to iron out the details with the sysctl I'll be happy to implement
this.
>
> I can see a lot of different limits implemented this way.
>
> Eric
> _______________________________________________
> Containers mailing list
> Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers
>
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC PATCH 0/4] Make inotify instance/watches be accounted per userns
[not found] ` <57551B10.6080505-6AxghH7DbtA@public.gmane.org>
@ 2016-06-06 20:00 ` Eric W. Biederman
0 siblings, 0 replies; 9+ messages in thread
From: Eric W. Biederman @ 2016-06-06 20:00 UTC (permalink / raw)
To: Nikolay Borisov
Cc: Jan Kara, avagin-GEFAQzZX7r8dnm+yROfE0A,
netdev-u79uwXL29TY76Z2rM5mHXA, Linux Containers,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
eparis-H+wXaHxf7aLQT0dZR+AlfA, operations-/eCPMmvKun9pLGFMi4vTTA,
gorcunov-GEFAQzZX7r8dnm+yROfE0A,
john-jueV0HHMeujJJrXXpGQQMAC/G2K4zDHf
Nikolay Borisov <kernel-6AxghH7DbtA@public.gmane.org> writes:
> On 06/03/2016 11:41 PM, Eric W. Biederman wrote:
>> Nikolay Borisov <kernel-6AxghH7DbtA@public.gmane.org> writes:
>>
>>> On 06/02/2016 07:58 PM, Eric W. Biederman wrote:
>>>>
>>>> Nikolay please see my question for you at the end.
>> [snip]
>>>> All of that said there is definitely a practical question that needs to
>>>> be asked. Nikolay how did you get into this situation? A typical user
>>>> namespace configuration will set up uid and gid maps with the help of a
>>>> privileged program and not map the uid of the user who created the user
>>>> namespace. Thus avoiding exhausting the limits of the user who created
>>>> the container.
>>>
>>> Right but imagine having multiple containers with identical uid/gid maps
>>> for LXC-based setups imagine this:
>>>
>>> lxc.id_map = u 0 1337 65536
>>
>> So I am only moderately concerned when the containers have overlapping
>> ids. Because at some level overlapping ids means they are the same
>> user. This is certainly true for file permissions and for other
>> permissions. To isolate one container from another it fundamentally
>> needs to have separate uids and gids on the host system.
>>
>>> Now all processes which are running with the same user on different
>>> containers will actually share the underlying user_struct thus the
>>> inotify limits. In such cases even running multiple instances of 'tail'
>>> in one container will eventually use all allowed inotify/mark instances.
>>> For this to happen you needn't also have complete overlap of the uid
>>> map, it's enough to have at least one UID between 2 containers overlap.
>>>
>>>
>>> So the risk of exhaustion doesn't apply to the privileged user that
>>> created the container and the uid mapping, but rather the users under
>>> which the various processes in the container are running. Does that make
>>> it clear?
>>
>> Yes. That is clear.
>>
>>>> Which makes me personally more worried about escaping the existing
>>>> limits than exhausting the limits of a particular user.
>>>
>>> So I thought bit about it and I guess a solution can be concocted which
>>> utilize the hierarchical nature of page counter, and the inotify limits
>>> are set per namespace if you have capable(CAP_SYS_ADMIN). That way the
>>> admin can set one fairly large on the init_user_ns and then in every
>>> namespace created one can set smaller limits. That way for a branch in
>>> the tree (in the nomenclature you used in your previous reply to me) you
>>> will really be upper-bound to the limit set in the namespace which have
>>> ->level = 1. For the width of the tree, you will be bound by the
>>> "global" init_user_ns limits. How does that sound?
>>
>> As a addendum to that design. I think there should be an additional
>> sysctl or two that specifies how much the limit decreases when creating
>> a new user namespace and when creating a new user in that user
>> namespace. That way with a good selection of limits and a limit
>> decrease people can use the kernel defaults without needing to change
>> them.
>
> I agree that a sysctl which controls how the limits are set for new
> namespaces is a good idea. I think it's best if this is in % rather than
> some absolute value. Also I'm not sure about the sysctl when a user is
> added in a namespace since just adding a new user should fall under the
> limits of the current userns.
My hunch is that a reserve per namespace as an absolute number will be
easier to implement and analyze but I don't much care.
I meant that we have a tree where we track created inotify things
that looks like:
uns0:
+-------------//\\----------+
/ /------/ \----\ \
user1 user2 user3 user4
+-------//\\--------+
/ /--/ \---\ \
uns1 uns2 uns3 uns4
+-------//\\---------+
/ /---/ \---\ \
user5 user6 user7 user8
Allowing a hierarchical tracking of things per user and per user
namespace.
The limits programed with the sysctl would look something like they do
today.
> Also should those sysctls be global or should they be per-namespace? At
> this point I'm more inclined to have global sysctl and maybe refine it
> in the future if the need arises?
I think at the end of the day per-namespace is interesting. We
certainly need to track the values as if they were per namespace.
However given that this should be a setup and forget kind of operation
we don't need to worry about how to implement the sysctl settings as per
namespace in the until everything else is sorted.
>> Having default settings that are good enough 99% of the time and that
>> people don't need to tune, would be my biggest requirement (aside from
>> being light-weight) for merging something like this.
>>
>> If things are set and forget and even the continer case does not need to
>> be aware then I think we have a design sufficiently robust and different
>> from what cgroups is doing to make it worth while to have a userns based
>> solution.
>
> Provided that we agree on the overall design, so far it seems we just
> need to iron out the details with the sysctl I'll be happy to implement
> this.
Thanks. There are some other limits that need to be implemented in this
style that are more important to me: maximum number of user namespaces,
max number of pid namespaces, max number of mount namespaces, etc.
Those limits I will gladly implement. As I can finally see how to make
all of this just work. Which is to say the per userns per user per data
structures that hold the counts will be worth creating generically.
No need to generalize the code prematurely I think it make sense to sort
out the logic on whichever we implement first and then the rest of the
interesting limits can just follow the pattern that gets laid down.
Eric
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2016-06-06 20:00 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <1464767580-22732-1-git-send-email-kernel@kyup.com>
[not found] ` <1464767580-22732-1-git-send-email-kernel-6AxghH7DbtA@public.gmane.org>
2016-06-01 16:00 ` [RFC PATCH 0/4] Make inotify instance/watches be accounted per userns Eric W. Biederman
[not found] ` <8737ow7vcp.fsf@x220.int.ebiederm.org>
[not found] ` <8737ow7vcp.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
2016-06-02 6:27 ` Nikolay Borisov
[not found] ` <574FD1E4.8090109-6AxghH7DbtA@public.gmane.org>
2016-06-02 16:19 ` Eric W. Biederman
2016-06-02 7:49 ` Jan Kara
[not found] ` <20160602074920.GG19636-4I4JzKEfoa/jFM9bn6wA6Q@public.gmane.org>
2016-06-02 16:58 ` Eric W. Biederman
[not found] ` <87bn3jy1cd.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
2016-06-03 11:14 ` Nikolay Borisov
[not found] ` <5751667D.7010207-6AxghH7DbtA@public.gmane.org>
2016-06-03 20:41 ` Eric W. Biederman
[not found] ` <87inxqovho.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
2016-06-06 6:41 ` Nikolay Borisov
[not found] ` <57551B10.6080505-6AxghH7DbtA@public.gmane.org>
2016-06-06 20:00 ` Eric W. Biederman
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox