* Re: [PATCHv2 7/7] cgroup: mount cgroupns-root when inside non-init cgroupns
From: Andy Lutomirski @ 2014-11-04 0:17 UTC (permalink / raw)
To: Aditya Kali
Cc: Linux API, Linux Containers, Serge Hallyn,
linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
Eric W. Biederman, Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA,
Ingo Molnar
In-Reply-To: <CAGr1F2EV4p_nJP_oMe3N8pBPedAZHbdB=XCMPjSEZTC9jmZoAg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
On Mon, Nov 3, 2014 at 4:12 PM, Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
> On Mon, Nov 3, 2014 at 3:48 PM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
>> On Mon, Nov 3, 2014 at 3:23 PM, Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
>>> On Mon, Nov 3, 2014 at 3:15 PM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
>>>> On Mon, Nov 3, 2014 at 3:12 PM, Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
>>>>> On Fri, Oct 31, 2014 at 5:07 PM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
>>>>>> On Fri, Oct 31, 2014 at 12:19 PM, Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
>>>>>>> if (opts->flags & CGRP_ROOT_SANE_BEHAVIOR) {
>>>>>>> pr_warn("sane_behavior: this is still under development and its behaviors will change, proceed at your own risk\n");
>>>>>>> - if (nr_opts != 1) {
>>>>>>> + if (nr_opts > 1) {
>>>>>>> pr_err("sane_behavior: no other mount options allowed\n");
>>>>>>> return -EINVAL;
>>>>>>
>>>>>> This looks wrong. But, if you make the change above, then it'll be right.
>>>>>>
>>>>>
>>>>> It would have been nice if simple 'mount -t cgroup cgroup <mnt>' from
>>>>> cgroupns does the right thing automatically.
>>>>>
>>>>
>>>> This is a debatable point, but it's not what I meant. Won't your code
>>>> let 'mount -t cgroup -o one_evil_flag cgroup mountpoint' through?
>>>>
>>>
>>> I don't think so. This check "if (nr_opts > 1)" is nested under "if
>>> (opts->flags & CGRP_ROOT_SANE_BEHAVIOR)". So we know that there is
>>> atleast 1 option ('__DEVEL__sane_behavior') present (implicit or not).
>>> Addition of 'one_evil_flag' will make nr_opts = 2 and result in EINVAL
>>> here.
>>
>> But the implicit __DEVEL__sane_behavior doesn't increment nr_opts, right?
>>
>
> Yes. Hence this change makes sure that we don't return EINVAL when
> nr_opts == 0 or nr_opts == 1 :)
> That way, both of the following are equivalent when inside non-init cgroupns:
>
> (1) $ mount -t cgroup -o __DEVEL__sane_behavior cgroup mountpoint
> (2) $ mount -t cgroup cgroup mountpoint
>
> Any other mount option will trigger the error here.
I still don't get it. Can you walk me through why mount -o
some_other_option -t cgroup cgroup mountpoint causes -EINVAL?
--Andy
^ permalink raw reply
* Re: [PATCHv2 7/7] cgroup: mount cgroupns-root when inside non-init cgroupns
From: Aditya Kali @ 2014-11-04 0:12 UTC (permalink / raw)
To: Andy Lutomirski
Cc: Tejun Heo, Li Zefan, Serge Hallyn, Eric W. Biederman,
cgroups-u79uwXL29TY76Z2rM5mHXA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Linux API,
Ingo Molnar, Linux Containers, Rohit Jnagal
In-Reply-To: <CALCETrUB_xx5zno26k5UjAFt77nZTpgyndD4AuBSZxiZBNjXSw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
On Mon, Nov 3, 2014 at 3:48 PM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
> On Mon, Nov 3, 2014 at 3:23 PM, Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
>> On Mon, Nov 3, 2014 at 3:15 PM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
>>> On Mon, Nov 3, 2014 at 3:12 PM, Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
>>>> On Fri, Oct 31, 2014 at 5:07 PM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
>>>>> On Fri, Oct 31, 2014 at 12:19 PM, Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
>>>>>> if (opts->flags & CGRP_ROOT_SANE_BEHAVIOR) {
>>>>>> pr_warn("sane_behavior: this is still under development and its behaviors will change, proceed at your own risk\n");
>>>>>> - if (nr_opts != 1) {
>>>>>> + if (nr_opts > 1) {
>>>>>> pr_err("sane_behavior: no other mount options allowed\n");
>>>>>> return -EINVAL;
>>>>>
>>>>> This looks wrong. But, if you make the change above, then it'll be right.
>>>>>
>>>>
>>>> It would have been nice if simple 'mount -t cgroup cgroup <mnt>' from
>>>> cgroupns does the right thing automatically.
>>>>
>>>
>>> This is a debatable point, but it's not what I meant. Won't your code
>>> let 'mount -t cgroup -o one_evil_flag cgroup mountpoint' through?
>>>
>>
>> I don't think so. This check "if (nr_opts > 1)" is nested under "if
>> (opts->flags & CGRP_ROOT_SANE_BEHAVIOR)". So we know that there is
>> atleast 1 option ('__DEVEL__sane_behavior') present (implicit or not).
>> Addition of 'one_evil_flag' will make nr_opts = 2 and result in EINVAL
>> here.
>
> But the implicit __DEVEL__sane_behavior doesn't increment nr_opts, right?
>
Yes. Hence this change makes sure that we don't return EINVAL when
nr_opts == 0 or nr_opts == 1 :)
That way, both of the following are equivalent when inside non-init cgroupns:
(1) $ mount -t cgroup -o __DEVEL__sane_behavior cgroup mountpoint
(2) $ mount -t cgroup cgroup mountpoint
Any other mount option will trigger the error here.
> --Andy
--
Aditya
^ permalink raw reply
* Re: [PATCHv2 7/7] cgroup: mount cgroupns-root when inside non-init cgroupns
From: Andy Lutomirski @ 2014-11-03 23:48 UTC (permalink / raw)
To: Aditya Kali
Cc: Linux API, Linux Containers, Serge Hallyn,
linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
Eric W. Biederman, Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA,
Ingo Molnar
In-Reply-To: <CAGr1F2GX45gC-V7kEzVjp-EiYfdPDVBRs+99nASpgFVAdYX+1w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
On Mon, Nov 3, 2014 at 3:23 PM, Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
> On Mon, Nov 3, 2014 at 3:15 PM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
>> On Mon, Nov 3, 2014 at 3:12 PM, Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
>>> On Fri, Oct 31, 2014 at 5:07 PM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
>>>> On Fri, Oct 31, 2014 at 12:19 PM, Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
>>>>> if (opts->flags & CGRP_ROOT_SANE_BEHAVIOR) {
>>>>> pr_warn("sane_behavior: this is still under development and its behaviors will change, proceed at your own risk\n");
>>>>> - if (nr_opts != 1) {
>>>>> + if (nr_opts > 1) {
>>>>> pr_err("sane_behavior: no other mount options allowed\n");
>>>>> return -EINVAL;
>>>>
>>>> This looks wrong. But, if you make the change above, then it'll be right.
>>>>
>>>
>>> It would have been nice if simple 'mount -t cgroup cgroup <mnt>' from
>>> cgroupns does the right thing automatically.
>>>
>>
>> This is a debatable point, but it's not what I meant. Won't your code
>> let 'mount -t cgroup -o one_evil_flag cgroup mountpoint' through?
>>
>
> I don't think so. This check "if (nr_opts > 1)" is nested under "if
> (opts->flags & CGRP_ROOT_SANE_BEHAVIOR)". So we know that there is
> atleast 1 option ('__DEVEL__sane_behavior') present (implicit or not).
> Addition of 'one_evil_flag' will make nr_opts = 2 and result in EINVAL
> here.
But the implicit __DEVEL__sane_behavior doesn't increment nr_opts, right?
--Andy
^ permalink raw reply
* Re: [PATCHv2 5/7] cgroup: introduce cgroup namespaces
From: Aditya Kali @ 2014-11-03 23:42 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Linux API, Linux Containers, Serge Hallyn,
linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
Andy Lutomirski, Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA,
Ingo Molnar
In-Reply-To: <87y4rvspnd.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
On Fri, Oct 31, 2014 at 5:58 PM, Eric W. Biederman
<ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
> Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> writes:
>
>> On Fri, Oct 31, 2014 at 12:18 PM, Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
>
> <snip>
>
>>> +static void *cgroupns_get(struct task_struct *task)
>>> +{
>>> + struct cgroup_namespace *ns = NULL;
>>> + struct nsproxy *nsproxy;
>>> +
>>> + rcu_read_lock();
>>> + nsproxy = task->nsproxy;
>>> + if (nsproxy) {
>>> + ns = nsproxy->cgroup_ns;
>>> + get_cgroup_ns(ns);
>>> + }
>>> + rcu_read_unlock();
>>
>> How is this correct? Other namespaces do it too, so it Must Be
>> Correct (tm), but I don't understand. What is RCU protecting?
>
> The code is not correct. The code needs to use task_lock.
>
> RCU used to protect nsproxy, and now task_lock protects nsproxy.
> For the reasons of of all of this I refer you to the commit
> that changed this, and the comment in nsproxy.h
>
My bad. This should be under task_lock. I will fix it.
> commit 728dba3a39c66b3d8ac889ddbe38b5b1c264aec3
> Author: Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
> Date: Mon Feb 3 19:13:49 2014 -0800
>
> namespaces: Use task_lock and not rcu to protect nsproxy
>
> The synchronous syncrhonize_rcu in switch_task_namespaces makes setns
> a sufficiently expensive system call that people have complained.
>
> Upon inspect nsproxy no longer needs rcu protection for remote reads.
> remote reads are rare. So optimize for same process reads and write
> by switching using rask_lock instead.
>
> This yields a simpler to understand lock, and a faster setns system call.
>
> In particular this fixes a performance regression observed
> by Rafael David Tinoco <rafael.tinoco-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>.
>
> This is effectively a revert of Pavel Emelyanov's commit
> cf7b708c8d1d7a27736771bcf4c457b332b0f818 Make access to task's nsproxy lighter
> from 2007. The race this originialy fixed no longer exists as
> do_notify_parent uses task_active_pid_ns(parent) instead of
> parent->nsproxy.
>
> Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
>
> Eric
--
Aditya
^ permalink raw reply
* Re: [PATCHv2 5/7] cgroup: introduce cgroup namespaces
From: Aditya Kali @ 2014-11-03 23:40 UTC (permalink / raw)
To: Andy Lutomirski
Cc: Linux API, Linux Containers, Serge Hallyn,
linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
Eric W. Biederman, Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA,
Ingo Molnar
In-Reply-To: <CALCETrWzYPngmWPMWnSFyiTPDwNJYPpXUj1C-294uQgjvp9wcA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
On Fri, Oct 31, 2014 at 5:02 PM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
> On Fri, Oct 31, 2014 at 12:18 PM, Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
>> Introduce the ability to create new cgroup namespace. The newly created
>> cgroup namespace remembers the cgroup of the process at the point
>> of creation of the cgroup namespace (referred as cgroupns-root).
>> The main purpose of cgroup namespace is to virtualize the contents
>> of /proc/self/cgroup file. Processes inside a cgroup namespace
>> are only able to see paths relative to their namespace root
>> (unless they are moved outside of their cgroupns-root, at which point
>> they will see a relative path from their cgroupns-root).
>> For a correctly setup container this enables container-tools
>> (like libcontainer, lxc, lmctfy, etc.) to create completely virtualized
>> containers without leaking system level cgroup hierarchy to the task.
>> This patch only implements the 'unshare' part of the cgroupns.
>>
>
>> + /* Prevent cgroup changes for this task. */
>> + threadgroup_lock(current);
>
> This could just be me being dense, but what is the lock for?
>
threadgroup_lock() is there to prevent the task from changing cgroups
while we are unsharing cgroupns.
But it seems that this might be unnecessary now because we have
removed the pinning restriction. Without pinning, we don't care if the
task cgroup changes underneath us. I will remove it from here as well
as from cgroupns_install().
>> +
>> + /* CGROUPNS only virtualizes the cgroup path on the unified hierarchy.
>> + */
>> + cgrp = get_task_cgroup(current);
>> +
>> + err = -ENOMEM;
>> + new_ns = alloc_cgroup_ns();
>> + if (!new_ns)
>> + goto err_out_unlock;
>> +
>> + err = proc_alloc_inum(&new_ns->proc_inum);
>> + if (err)
>> + goto err_out_unlock;
>> +
>> + new_ns->user_ns = get_user_ns(user_ns);
>> + new_ns->root_cgrp = cgrp;
>> +
>> + threadgroup_unlock(current);
>> +
>> + return new_ns;
>> +
>> +err_out_unlock:
>> + threadgroup_unlock(current);
>> +err_out:
>> + if (cgrp)
>> + cgroup_put(cgrp);
>> + kfree(new_ns);
>> + return ERR_PTR(err);
>> +}
>> +
>> +static int cgroupns_install(struct nsproxy *nsproxy, void *ns)
>> +{
>> + pr_info("setns not supported for cgroup namespace");
>> + return -EINVAL;
>> +}
>> +
>> +static void *cgroupns_get(struct task_struct *task)
>> +{
>> + struct cgroup_namespace *ns = NULL;
>> + struct nsproxy *nsproxy;
>> +
>> + rcu_read_lock();
>> + nsproxy = task->nsproxy;
>> + if (nsproxy) {
>> + ns = nsproxy->cgroup_ns;
>> + get_cgroup_ns(ns);
>> + }
>> + rcu_read_unlock();
>
> How is this correct? Other namespaces do it too, so it Must Be
> Correct (tm), but I don't understand. What is RCU protecting?
>
> --Andy
--
Aditya
^ permalink raw reply
* Re: [PATCHv2 7/7] cgroup: mount cgroupns-root when inside non-init cgroupns
From: Aditya Kali @ 2014-11-03 23:23 UTC (permalink / raw)
To: Andy Lutomirski
Cc: Tejun Heo, Li Zefan, Serge Hallyn, Eric W. Biederman,
cgroups-u79uwXL29TY76Z2rM5mHXA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Linux API,
Ingo Molnar, Linux Containers, Rohit Jnagal
In-Reply-To: <CALCETrW64-6xC6psP-8k0H-1GfVnWBTeEBNSrE_sH+-DFtuZQQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
On Mon, Nov 3, 2014 at 3:15 PM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
> On Mon, Nov 3, 2014 at 3:12 PM, Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
>> On Fri, Oct 31, 2014 at 5:07 PM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
>>> On Fri, Oct 31, 2014 at 12:19 PM, Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
>>>> This patch enables cgroup mounting inside userns when a process
>>>> as appropriate privileges. The cgroup filesystem mounted is
>>>> rooted at the cgroupns-root. Thus, in a container-setup, only
>>>> the hierarchy under the cgroupns-root is exposed inside the container.
>>>> This allows container management tools to run inside the containers
>>>> without depending on any global state.
>>>> In order to support this, a new kernfs api is added to lookup the
>>>> dentry for the cgroupns-root.
>>>>
>>>> Signed-off-by: Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
>>>> ---
>>>> fs/kernfs/mount.c | 48 ++++++++++++++++++++++++++++++++++++++++++++++++
>>>> include/linux/kernfs.h | 2 ++
>>>> kernel/cgroup.c | 47 +++++++++++++++++++++++++++++++++++++++++++++--
>>>> 3 files changed, 95 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
>>>> index f973ae9..e334f45 100644
>>>> --- a/fs/kernfs/mount.c
>>>> +++ b/fs/kernfs/mount.c
>>>> @@ -62,6 +62,54 @@ struct kernfs_root *kernfs_root_from_sb(struct super_block *sb)
>>>> return NULL;
>>>> }
>>>>
>>>> +/**
>>>> + * kernfs_make_root - create new root dentry for the given kernfs_node.
>>>> + * @sb: the kernfs super_block
>>>> + * @kn: kernfs_node for which a dentry is needed
>>>> + *
>>>> + * This can used used by callers which want to mount only a part of the kernfs
>>>> + * as root of the filesystem.
>>>> + */
>>>> +struct dentry *kernfs_obtain_root(struct super_block *sb,
>>>> + struct kernfs_node *kn)
>>>> +{
>>>
>>> I can't usefully review this, but kernfs_make_root and
>>> kernfs_obtain_root aren't the same string...
>>>
>>>> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
>>>> index 7e5d597..250aaec 100644
>>>> --- a/kernel/cgroup.c
>>>> +++ b/kernel/cgroup.c
>>>> @@ -1302,6 +1302,13 @@ static int parse_cgroupfs_options(char *data, struct cgroup_sb_opts *opts)
>>>>
>>>> memset(opts, 0, sizeof(*opts));
>>>>
>>>> + /* Implicitly add CGRP_ROOT_SANE_BEHAVIOR if inside a non-init cgroup
>>>> + * namespace.
>>>> + */
>>>> + if (current->nsproxy->cgroup_ns != &init_cgroup_ns) {
>>>> + opts->flags |= CGRP_ROOT_SANE_BEHAVIOR;
>>>> + }
>>>> +
>>>
>>> I don't like this implicit stuff. Can you just return -EINVAL if sane
>>> behavior isn't requested?
>>>
>>
>> I think the sane-behavior flag is only temporary and will be removed
>> anyways, right? So I didn't bother asking user to supply it. But I can
>> make the change as you suggested. We just have to make sure that tasks
>> inside cgroupns cannot mount non-default hierarchies as it would be a
>> regression.
>>
>>>> while ((token = strsep(&o, ",")) != NULL) {
>>>> nr_opts++;
>>>>
>>>> @@ -1391,7 +1398,7 @@ static int parse_cgroupfs_options(char *data, struct cgroup_sb_opts *opts)
>>>>
>>>> if (opts->flags & CGRP_ROOT_SANE_BEHAVIOR) {
>>>> pr_warn("sane_behavior: this is still under development and its behaviors will change, proceed at your own risk\n");
>>>> - if (nr_opts != 1) {
>>>> + if (nr_opts > 1) {
>>>> pr_err("sane_behavior: no other mount options allowed\n");
>>>> return -EINVAL;
>>>
>>> This looks wrong. But, if you make the change above, then it'll be right.
>>>
>>
>> It would have been nice if simple 'mount -t cgroup cgroup <mnt>' from
>> cgroupns does the right thing automatically.
>>
>
> This is a debatable point, but it's not what I meant. Won't your code
> let 'mount -t cgroup -o one_evil_flag cgroup mountpoint' through?
>
I don't think so. This check "if (nr_opts > 1)" is nested under "if
(opts->flags & CGRP_ROOT_SANE_BEHAVIOR)". So we know that there is
atleast 1 option ('__DEVEL__sane_behavior') present (implicit or not).
Addition of 'one_evil_flag' will make nr_opts = 2 and result in EINVAL
here.
>>
>>>> @@ -1685,6 +1701,14 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type,
>>>> int ret;
>>>> int i;
>>>> bool new_sb;
>>>> + struct cgroup_namespace *ns =
>>>> + get_cgroup_ns(current->nsproxy->cgroup_ns);
>>>> +
>>>> + /* Check if the caller has permission to mount. */
>>>> + if (!ns_capable(ns->user_ns, CAP_SYS_ADMIN)) {
>>>> + put_cgroup_ns(ns);
>>>> + return ERR_PTR(-EPERM);
>>>> + }
>>>
>>> Why is this necessary?
>>>
>>
>> Without this, if I unshare userns and mntns (but no cgroupns), I will
>> be able to mount my parent's cgroupfs hierarchy. This is deviation
>> from whats allowed today (i.e., today I can't mount cgroupfs even
>> after unsharing userns & mntns). This check is there to prevent the
>> unintended effect of cgroupns feature.
>
> Oh, I get it. I misunderstood the code.
>
> I guess this is reasonable. If it annoys anyone, it can be reverted
> or weakened.
>
> --Andy
--
Aditya
^ permalink raw reply
* Re: [PATCHv2 7/7] cgroup: mount cgroupns-root when inside non-init cgroupns
From: Andy Lutomirski @ 2014-11-03 23:15 UTC (permalink / raw)
To: Aditya Kali
Cc: Tejun Heo, Li Zefan, Serge Hallyn, Eric W. Biederman,
cgroups-u79uwXL29TY76Z2rM5mHXA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Linux API,
Ingo Molnar, Linux Containers, Rohit Jnagal
In-Reply-To: <CAGr1F2FuPQxLraYv7PstJ9c8H-XQsgawaAtj4AS77B+_0k2o+A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
On Mon, Nov 3, 2014 at 3:12 PM, Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
> On Fri, Oct 31, 2014 at 5:07 PM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
>> On Fri, Oct 31, 2014 at 12:19 PM, Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
>>> This patch enables cgroup mounting inside userns when a process
>>> as appropriate privileges. The cgroup filesystem mounted is
>>> rooted at the cgroupns-root. Thus, in a container-setup, only
>>> the hierarchy under the cgroupns-root is exposed inside the container.
>>> This allows container management tools to run inside the containers
>>> without depending on any global state.
>>> In order to support this, a new kernfs api is added to lookup the
>>> dentry for the cgroupns-root.
>>>
>>> Signed-off-by: Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
>>> ---
>>> fs/kernfs/mount.c | 48 ++++++++++++++++++++++++++++++++++++++++++++++++
>>> include/linux/kernfs.h | 2 ++
>>> kernel/cgroup.c | 47 +++++++++++++++++++++++++++++++++++++++++++++--
>>> 3 files changed, 95 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
>>> index f973ae9..e334f45 100644
>>> --- a/fs/kernfs/mount.c
>>> +++ b/fs/kernfs/mount.c
>>> @@ -62,6 +62,54 @@ struct kernfs_root *kernfs_root_from_sb(struct super_block *sb)
>>> return NULL;
>>> }
>>>
>>> +/**
>>> + * kernfs_make_root - create new root dentry for the given kernfs_node.
>>> + * @sb: the kernfs super_block
>>> + * @kn: kernfs_node for which a dentry is needed
>>> + *
>>> + * This can used used by callers which want to mount only a part of the kernfs
>>> + * as root of the filesystem.
>>> + */
>>> +struct dentry *kernfs_obtain_root(struct super_block *sb,
>>> + struct kernfs_node *kn)
>>> +{
>>
>> I can't usefully review this, but kernfs_make_root and
>> kernfs_obtain_root aren't the same string...
>>
>>> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
>>> index 7e5d597..250aaec 100644
>>> --- a/kernel/cgroup.c
>>> +++ b/kernel/cgroup.c
>>> @@ -1302,6 +1302,13 @@ static int parse_cgroupfs_options(char *data, struct cgroup_sb_opts *opts)
>>>
>>> memset(opts, 0, sizeof(*opts));
>>>
>>> + /* Implicitly add CGRP_ROOT_SANE_BEHAVIOR if inside a non-init cgroup
>>> + * namespace.
>>> + */
>>> + if (current->nsproxy->cgroup_ns != &init_cgroup_ns) {
>>> + opts->flags |= CGRP_ROOT_SANE_BEHAVIOR;
>>> + }
>>> +
>>
>> I don't like this implicit stuff. Can you just return -EINVAL if sane
>> behavior isn't requested?
>>
>
> I think the sane-behavior flag is only temporary and will be removed
> anyways, right? So I didn't bother asking user to supply it. But I can
> make the change as you suggested. We just have to make sure that tasks
> inside cgroupns cannot mount non-default hierarchies as it would be a
> regression.
>
>>> while ((token = strsep(&o, ",")) != NULL) {
>>> nr_opts++;
>>>
>>> @@ -1391,7 +1398,7 @@ static int parse_cgroupfs_options(char *data, struct cgroup_sb_opts *opts)
>>>
>>> if (opts->flags & CGRP_ROOT_SANE_BEHAVIOR) {
>>> pr_warn("sane_behavior: this is still under development and its behaviors will change, proceed at your own risk\n");
>>> - if (nr_opts != 1) {
>>> + if (nr_opts > 1) {
>>> pr_err("sane_behavior: no other mount options allowed\n");
>>> return -EINVAL;
>>
>> This looks wrong. But, if you make the change above, then it'll be right.
>>
>
> It would have been nice if simple 'mount -t cgroup cgroup <mnt>' from
> cgroupns does the right thing automatically.
>
This is a debatable point, but it's not what I meant. Won't your code
let 'mount -t cgroup -o one_evil_flag cgroup mountpoint' through?
>
>>> @@ -1685,6 +1701,14 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type,
>>> int ret;
>>> int i;
>>> bool new_sb;
>>> + struct cgroup_namespace *ns =
>>> + get_cgroup_ns(current->nsproxy->cgroup_ns);
>>> +
>>> + /* Check if the caller has permission to mount. */
>>> + if (!ns_capable(ns->user_ns, CAP_SYS_ADMIN)) {
>>> + put_cgroup_ns(ns);
>>> + return ERR_PTR(-EPERM);
>>> + }
>>
>> Why is this necessary?
>>
>
> Without this, if I unshare userns and mntns (but no cgroupns), I will
> be able to mount my parent's cgroupfs hierarchy. This is deviation
> from whats allowed today (i.e., today I can't mount cgroupfs even
> after unsharing userns & mntns). This check is there to prevent the
> unintended effect of cgroupns feature.
Oh, I get it. I misunderstood the code.
I guess this is reasonable. If it annoys anyone, it can be reverted
or weakened.
--Andy
^ permalink raw reply
* Re: [PATCHv2 7/7] cgroup: mount cgroupns-root when inside non-init cgroupns
From: Aditya Kali @ 2014-11-03 23:12 UTC (permalink / raw)
To: Andy Lutomirski
Cc: Linux API, Linux Containers, Serge Hallyn,
linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
Eric W. Biederman, Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA,
Ingo Molnar
In-Reply-To: <CALCETrXTaZ3SJ_t-gnbc93BVZXg-912NqO78kFd0Tpi-5-dZoQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
On Fri, Oct 31, 2014 at 5:07 PM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
> On Fri, Oct 31, 2014 at 12:19 PM, Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
>> This patch enables cgroup mounting inside userns when a process
>> as appropriate privileges. The cgroup filesystem mounted is
>> rooted at the cgroupns-root. Thus, in a container-setup, only
>> the hierarchy under the cgroupns-root is exposed inside the container.
>> This allows container management tools to run inside the containers
>> without depending on any global state.
>> In order to support this, a new kernfs api is added to lookup the
>> dentry for the cgroupns-root.
>>
>> Signed-off-by: Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
>> ---
>> fs/kernfs/mount.c | 48 ++++++++++++++++++++++++++++++++++++++++++++++++
>> include/linux/kernfs.h | 2 ++
>> kernel/cgroup.c | 47 +++++++++++++++++++++++++++++++++++++++++++++--
>> 3 files changed, 95 insertions(+), 2 deletions(-)
>>
>> diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
>> index f973ae9..e334f45 100644
>> --- a/fs/kernfs/mount.c
>> +++ b/fs/kernfs/mount.c
>> @@ -62,6 +62,54 @@ struct kernfs_root *kernfs_root_from_sb(struct super_block *sb)
>> return NULL;
>> }
>>
>> +/**
>> + * kernfs_make_root - create new root dentry for the given kernfs_node.
>> + * @sb: the kernfs super_block
>> + * @kn: kernfs_node for which a dentry is needed
>> + *
>> + * This can used used by callers which want to mount only a part of the kernfs
>> + * as root of the filesystem.
>> + */
>> +struct dentry *kernfs_obtain_root(struct super_block *sb,
>> + struct kernfs_node *kn)
>> +{
>
> I can't usefully review this, but kernfs_make_root and
> kernfs_obtain_root aren't the same string...
>
>> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
>> index 7e5d597..250aaec 100644
>> --- a/kernel/cgroup.c
>> +++ b/kernel/cgroup.c
>> @@ -1302,6 +1302,13 @@ static int parse_cgroupfs_options(char *data, struct cgroup_sb_opts *opts)
>>
>> memset(opts, 0, sizeof(*opts));
>>
>> + /* Implicitly add CGRP_ROOT_SANE_BEHAVIOR if inside a non-init cgroup
>> + * namespace.
>> + */
>> + if (current->nsproxy->cgroup_ns != &init_cgroup_ns) {
>> + opts->flags |= CGRP_ROOT_SANE_BEHAVIOR;
>> + }
>> +
>
> I don't like this implicit stuff. Can you just return -EINVAL if sane
> behavior isn't requested?
>
I think the sane-behavior flag is only temporary and will be removed
anyways, right? So I didn't bother asking user to supply it. But I can
make the change as you suggested. We just have to make sure that tasks
inside cgroupns cannot mount non-default hierarchies as it would be a
regression.
>> while ((token = strsep(&o, ",")) != NULL) {
>> nr_opts++;
>>
>> @@ -1391,7 +1398,7 @@ static int parse_cgroupfs_options(char *data, struct cgroup_sb_opts *opts)
>>
>> if (opts->flags & CGRP_ROOT_SANE_BEHAVIOR) {
>> pr_warn("sane_behavior: this is still under development and its behaviors will change, proceed at your own risk\n");
>> - if (nr_opts != 1) {
>> + if (nr_opts > 1) {
>> pr_err("sane_behavior: no other mount options allowed\n");
>> return -EINVAL;
>
> This looks wrong. But, if you make the change above, then it'll be right.
>
It would have been nice if simple 'mount -t cgroup cgroup <mnt>' from
cgroupns does the right thing automatically.
>> @@ -1685,6 +1701,14 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type,
>> int ret;
>> int i;
>> bool new_sb;
>> + struct cgroup_namespace *ns =
>> + get_cgroup_ns(current->nsproxy->cgroup_ns);
>> +
>> + /* Check if the caller has permission to mount. */
>> + if (!ns_capable(ns->user_ns, CAP_SYS_ADMIN)) {
>> + put_cgroup_ns(ns);
>> + return ERR_PTR(-EPERM);
>> + }
>
> Why is this necessary?
>
Without this, if I unshare userns and mntns (but no cgroupns), I will
be able to mount my parent's cgroupfs hierarchy. This is deviation
from whats allowed today (i.e., today I can't mount cgroupfs even
after unsharing userns & mntns). This check is there to prevent the
unintended effect of cgroupns feature.
>> @@ -1862,6 +1904,7 @@ static struct file_system_type cgroup_fs_type = {
>> .name = "cgroup",
>> .mount = cgroup_mount,
>> .kill_sb = cgroup_kill_sb,
>> + .fs_flags = FS_USERNS_MOUNT,
>
> Aargh, another one! Eric, can you either ack or nack my patch?
> Because if my patch goes in, then this line may need to change. Or
> not, but if a stable release with cgroupfs and without my patch
> happens, then we'll have an ABI break.
>
> --Andy
--
Aditya
^ permalink raw reply
* Re: [PATCHv2 7/7] cgroup: mount cgroupns-root when inside non-init cgroupns
From: Andy Lutomirski @ 2014-11-03 22:56 UTC (permalink / raw)
To: Aditya Kali
Cc: Linux API, Linux Containers, Serge Hallyn,
linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
Eric W. Biederman, Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA,
Ingo Molnar
In-Reply-To: <CAGr1F2Hd_PS_AscBGMXdZC9qkHGRUp-MeQvJksDOQkRBB3RGoA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
On Mon, Nov 3, 2014 at 2:43 PM, Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
>
>
> On Fri, Oct 31, 2014 at 6:09 PM, Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
> wrote:
>>
>> Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> writes:
>>
>> > This patch enables cgroup mounting inside userns when a process
>> > as appropriate privileges. The cgroup filesystem mounted is
>> > rooted at the cgroupns-root. Thus, in a container-setup, only
>> > the hierarchy under the cgroupns-root is exposed inside the container.
>> > This allows container management tools to run inside the containers
>> > without depending on any global state.
>> > In order to support this, a new kernfs api is added to lookup the
>> > dentry for the cgroupns-root.
>>
>> There is a misdesign in this. Because files already exist we need the
>> protections that are present in proc and sysfs that only allow you to
>> mount the filesystem if it is already mounted. Otherwise you can wind
>> up mounting this cgroupfs in a chroot jail when the global root would
>> not like you to see it. cgroupfs isn't as bad as proc and sys but there
>> is at the very least an information leak in mounting it.
>>
>
> I think simply mounting the cgroupfs doesn't give you any more information
> than what you don't already know about the system ; specially if the
> visibility is restricted within the process's cgroupns-root. The cgroups
> still wont be writable by the user, so I think it should be fine to allow
> mounting?
>
Can we try to figure out a better way to do this than checking at
mount time for a fully-visible procfs/sysfs/cgroupfs? The current
approach is unpleasant to deal with.
For example, we could check the equivalent conditions when the userns
is created and store then in a per-userns flags field.
>
>>
>> Given that we are effectively performing a bind mount in this patch, and
>> that we need to require cgroupfs be mounted anyway (to be safe).
>>
>> I don't see the point of this change.
>>
>> If we could change the set of cgroups or visible in cgroupfs I could
>> probably see the point. But as it is this change seems to be pointless.
>>
>
> I agree that this is effectively bind-mounting, but doing this in kernel
> makes it really convenient for the userspace. The process that sets up the
> container doesn't need to care whether it should bind-mount cgroupfs inside
> the container or not. The tasks inside the container can mount cgroupfs on
> as-needed basis. The root container manager can simply unshare cgroupns and
> forget about the internal setup. I think this is useful just for the reason
> that it makes life much simpler for userspace.
>
If we add the fully-visible check at mount time, then I almost agree
with Eric. I say almost because fs_fully_visible isn't checking
whether the superblock root is the thing that's mounted, and, if we
fix that, then bind-mounting like this becomes impossible and we'd
have to refine the check.
But if we come up with something less gross than checking for fs
visibility at mount time, then I agree with Aditya: let's let mount do
the right thing, since there may be nothing there to bind mount. If
we go that route, then I think we might want to make it explicit:
require a mount flag like root=. to indicate that we want to be rooted
at our cgroupns's root cgroup.
--Andy
^ permalink raw reply
* Re: [PATCHv2 7/7] cgroup: mount cgroupns-root when inside non-init cgroupns
From: Aditya Kali @ 2014-11-03 22:46 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Tejun Heo, Li Zefan, Serge Hallyn, Andy Lutomirski,
cgroups-u79uwXL29TY76Z2rM5mHXA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Linux API,
Ingo Molnar, Linux Containers, Rohit Jnagal
In-Reply-To: <87y4rvrakn.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
(sorry for accidental non-plain-text response earlier).
On Fri, Oct 31, 2014 at 6:09 PM, Eric W. Biederman
<ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
> Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> writes:
>
>> This patch enables cgroup mounting inside userns when a process
>> as appropriate privileges. The cgroup filesystem mounted is
>> rooted at the cgroupns-root. Thus, in a container-setup, only
>> the hierarchy under the cgroupns-root is exposed inside the container.
>> This allows container management tools to run inside the containers
>> without depending on any global state.
>> In order to support this, a new kernfs api is added to lookup the
>> dentry for the cgroupns-root.
>
> There is a misdesign in this. Because files already exist we need the
> protections that are present in proc and sysfs that only allow you to
> mount the filesystem if it is already mounted. Otherwise you can wind
> up mounting this cgroupfs in a chroot jail when the global root would
> not like you to see it. cgroupfs isn't as bad as proc and sys but there
> is at the very least an information leak in mounting it.
>
I think simply mounting the cgroupfs doesn't give you any more
information than what you don't already know about the system ;
specially if the visibility is restricted within the process's
cgroupns-root. The cgroups still wont be writable by the user, so I
think it should be fine to allow mounting?
> Given that we are effectively performing a bind mount in this patch, and
> that we need to require cgroupfs be mounted anyway (to be safe).
>
> I don't see the point of this change.
>
> If we could change the set of cgroups or visible in cgroupfs I could
> probably see the point. But as it is this change seems to be pointless.
>
I agree that this is effectively bind-mounting, but doing this in
kernel makes it really convenient for the userspace. The process that
sets up the container doesn't need to care whether it should
bind-mount cgroupfs inside the container or not. The tasks inside the
container can mount cgroupfs on as-needed basis. The root container
manager can simply unshare cgroupns and forget about the internal
setup. I think this is useful just for the reason that it makes life
much simpler for userspace.
> Eric
>
>
>> Signed-off-by: Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
>> ---
>> fs/kernfs/mount.c | 48 ++++++++++++++++++++++++++++++++++++++++++++++++
>> include/linux/kernfs.h | 2 ++
>> kernel/cgroup.c | 47 +++++++++++++++++++++++++++++++++++++++++++++--
>> 3 files changed, 95 insertions(+), 2 deletions(-)
>>
>> diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
>> index f973ae9..e334f45 100644
>> --- a/fs/kernfs/mount.c
>> +++ b/fs/kernfs/mount.c
>> @@ -62,6 +62,54 @@ struct kernfs_root *kernfs_root_from_sb(struct super_block *sb)
>> return NULL;
>> }
>>
>> +/**
>> + * kernfs_make_root - create new root dentry for the given kernfs_node.
>> + * @sb: the kernfs super_block
>> + * @kn: kernfs_node for which a dentry is needed
>> + *
>> + * This can used used by callers which want to mount only a part of the kernfs
>> + * as root of the filesystem.
>> + */
>> +struct dentry *kernfs_obtain_root(struct super_block *sb,
>> + struct kernfs_node *kn)
>> +{
>> + struct dentry *dentry;
>> + struct inode *inode;
>> +
>> + BUG_ON(sb->s_op != &kernfs_sops);
>> +
>> + /* inode for the given kernfs_node should already exist. */
>> + inode = ilookup(sb, kn->ino);
>> + if (!inode) {
>> + pr_debug("kernfs: could not get inode for '");
>> + pr_cont_kernfs_path(kn);
>> + pr_cont("'.\n");
>> + return ERR_PTR(-EINVAL);
>> + }
>> +
>> + /* instantiate and link root dentry */
>> + dentry = d_obtain_root(inode);
>> + if (!dentry) {
>> + pr_debug("kernfs: could not get dentry for '");
>> + pr_cont_kernfs_path(kn);
>> + pr_cont("'.\n");
>> + return ERR_PTR(-ENOMEM);
>> + }
>> +
>> + /* If this is a new dentry, set it up. We need kernfs_mutex because this
>> + * may be called by callers other than kernfs_fill_super. */
>> + mutex_lock(&kernfs_mutex);
>> + if (!dentry->d_fsdata) {
>> + kernfs_get(kn);
>> + dentry->d_fsdata = kn;
>> + } else {
>> + WARN_ON(dentry->d_fsdata != kn);
>> + }
>> + mutex_unlock(&kernfs_mutex);
>> +
>> + return dentry;
>> +}
>> +
>> static int kernfs_fill_super(struct super_block *sb, unsigned long magic)
>> {
>> struct kernfs_super_info *info = kernfs_info(sb);
>> diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
>> index 3c2be75..b9538e0 100644
>> --- a/include/linux/kernfs.h
>> +++ b/include/linux/kernfs.h
>> @@ -274,6 +274,8 @@ void kernfs_put(struct kernfs_node *kn);
>> struct kernfs_node *kernfs_node_from_dentry(struct dentry *dentry);
>> struct kernfs_root *kernfs_root_from_sb(struct super_block *sb);
>>
>> +struct dentry *kernfs_obtain_root(struct super_block *sb,
>> + struct kernfs_node *kn);
>> struct kernfs_root *kernfs_create_root(struct kernfs_syscall_ops *scops,
>> unsigned int flags, void *priv);
>> void kernfs_destroy_root(struct kernfs_root *root);
>> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
>> index 7e5d597..250aaec 100644
>> --- a/kernel/cgroup.c
>> +++ b/kernel/cgroup.c
>> @@ -1302,6 +1302,13 @@ static int parse_cgroupfs_options(char *data, struct cgroup_sb_opts *opts)
>>
>> memset(opts, 0, sizeof(*opts));
>>
>> + /* Implicitly add CGRP_ROOT_SANE_BEHAVIOR if inside a non-init cgroup
>> + * namespace.
>> + */
>> + if (current->nsproxy->cgroup_ns != &init_cgroup_ns) {
>> + opts->flags |= CGRP_ROOT_SANE_BEHAVIOR;
>> + }
>> +
>> while ((token = strsep(&o, ",")) != NULL) {
>> nr_opts++;
>>
>> @@ -1391,7 +1398,7 @@ static int parse_cgroupfs_options(char *data, struct cgroup_sb_opts *opts)
>>
>> if (opts->flags & CGRP_ROOT_SANE_BEHAVIOR) {
>> pr_warn("sane_behavior: this is still under development and its behaviors will change, proceed at your own risk\n");
>> - if (nr_opts != 1) {
>> + if (nr_opts > 1) {
>> pr_err("sane_behavior: no other mount options allowed\n");
>> return -EINVAL;
>> }
>> @@ -1581,6 +1588,15 @@ static void init_cgroup_root(struct cgroup_root *root,
>> set_bit(CGRP_CPUSET_CLONE_CHILDREN, &root->cgrp.flags);
>> }
>>
>> +struct dentry *cgroupns_get_root(struct super_block *sb,
>> + struct cgroup_namespace *ns)
>> +{
>> + struct dentry *nsdentry;
>> +
>> + nsdentry = kernfs_obtain_root(sb, ns->root_cgrp->kn);
>> + return nsdentry;
>> +}
>> +
>> static int cgroup_setup_root(struct cgroup_root *root, unsigned int ss_mask)
>> {
>> LIST_HEAD(tmp_links);
>> @@ -1685,6 +1701,14 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type,
>> int ret;
>> int i;
>> bool new_sb;
>> + struct cgroup_namespace *ns =
>> + get_cgroup_ns(current->nsproxy->cgroup_ns);
>> +
>> + /* Check if the caller has permission to mount. */
>> + if (!ns_capable(ns->user_ns, CAP_SYS_ADMIN)) {
>> + put_cgroup_ns(ns);
>> + return ERR_PTR(-EPERM);
>> + }
>>
>> /*
>> * The first time anyone tries to mount a cgroup, enable the list
>> @@ -1817,11 +1841,28 @@ out_free:
>> kfree(opts.release_agent);
>> kfree(opts.name);
>>
>> - if (ret)
>> + if (ret) {
>> + put_cgroup_ns(ns);
>> return ERR_PTR(ret);
>> + }
>>
>> dentry = kernfs_mount(fs_type, flags, root->kf_root,
>> CGROUP_SUPER_MAGIC, &new_sb);
>> +
>> + if (!IS_ERR(dentry) && (root == &cgrp_dfl_root)) {
>> + /* If this mount is for the default hierarchy in non-init cgroup
>> + * namespace, then instead of root cgroup's dentry, we return
>> + * the dentry corresponding to the cgroupns->root_cgrp.
>> + */
>> + if (ns != &init_cgroup_ns) {
>> + struct dentry *nsdentry;
>> +
>> + nsdentry = cgroupns_get_root(dentry->d_sb, ns);
>> + dput(dentry);
>> + dentry = nsdentry;
>> + }
>> + }
>> +
>> if (IS_ERR(dentry) || !new_sb)
>> cgroup_put(&root->cgrp);
>>
>> @@ -1834,6 +1875,7 @@ out_free:
>> deactivate_super(pinned_sb);
>> }
>>
>> + put_cgroup_ns(ns);
>> return dentry;
>> }
>>
>> @@ -1862,6 +1904,7 @@ static struct file_system_type cgroup_fs_type = {
>> .name = "cgroup",
>> .mount = cgroup_mount,
>> .kill_sb = cgroup_kill_sb,
>> + .fs_flags = FS_USERNS_MOUNT,
>> };
>>
>> static struct kobject *cgroup_kobj;
--
Aditya
^ permalink raw reply
* Re: [PATCHv2 7/7] cgroup: mount cgroupns-root when inside non-init cgroupns
From: Aditya Kali @ 2014-11-03 22:43 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Linux API, Linux Containers, Serge Hallyn,
linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
Andy Lutomirski, Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA,
Ingo Molnar
In-Reply-To: <87y4rvrakn.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
On Fri, Oct 31, 2014 at 6:09 PM, Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
wrote:
> Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> writes:
>
> > This patch enables cgroup mounting inside userns when a process
> > as appropriate privileges. The cgroup filesystem mounted is
> > rooted at the cgroupns-root. Thus, in a container-setup, only
> > the hierarchy under the cgroupns-root is exposed inside the container.
> > This allows container management tools to run inside the containers
> > without depending on any global state.
> > In order to support this, a new kernfs api is added to lookup the
> > dentry for the cgroupns-root.
>
> There is a misdesign in this. Because files already exist we need the
> protections that are present in proc and sysfs that only allow you to
> mount the filesystem if it is already mounted. Otherwise you can wind
> up mounting this cgroupfs in a chroot jail when the global root would
> not like you to see it. cgroupfs isn't as bad as proc and sys but there
> is at the very least an information leak in mounting it.
>
>
I think simply mounting the cgroupfs doesn't give you any more information
than what you don't already know about the system ; specially if the
visibility is restricted within the process's cgroupns-root. The cgroups
still wont be writable by the user, so I think it should be fine to allow
mounting?
> Given that we are effectively performing a bind mount in this patch, and
> that we need to require cgroupfs be mounted anyway (to be safe).
>
> I don't see the point of this change.
>
> If we could change the set of cgroups or visible in cgroupfs I could
> probably see the point. But as it is this change seems to be pointless.
>
>
I agree that this is effectively bind-mounting, but doing this in kernel
makes it really convenient for the userspace. The process that sets up the
container doesn't need to care whether it should bind-mount cgroupfs inside
the container or not. The tasks inside the container can mount cgroupfs on
as-needed basis. The root container manager can simply unshare cgroupns and
forget about the internal setup. I think this is useful just for the reason
that it makes life much simpler for userspace.
> Eric
>
>
> > Signed-off-by: Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> > ---
> > fs/kernfs/mount.c | 48
> ++++++++++++++++++++++++++++++++++++++++++++++++
> > include/linux/kernfs.h | 2 ++
> > kernel/cgroup.c | 47
> +++++++++++++++++++++++++++++++++++++++++++++--
> > 3 files changed, 95 insertions(+), 2 deletions(-)
> >
> > diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
> > index f973ae9..e334f45 100644
> > --- a/fs/kernfs/mount.c
> > +++ b/fs/kernfs/mount.c
> > @@ -62,6 +62,54 @@ struct kernfs_root *kernfs_root_from_sb(struct
> super_block *sb)
> > return NULL;
> > }
> >
> > +/**
> > + * kernfs_make_root - create new root dentry for the given kernfs_node.
> > + * @sb: the kernfs super_block
> > + * @kn: kernfs_node for which a dentry is needed
> > + *
> > + * This can used used by callers which want to mount only a part of the
> kernfs
> > + * as root of the filesystem.
> > + */
> > +struct dentry *kernfs_obtain_root(struct super_block *sb,
> > + struct kernfs_node *kn)
> > +{
> > + struct dentry *dentry;
> > + struct inode *inode;
> > +
> > + BUG_ON(sb->s_op != &kernfs_sops);
> > +
> > + /* inode for the given kernfs_node should already exist. */
> > + inode = ilookup(sb, kn->ino);
> > + if (!inode) {
> > + pr_debug("kernfs: could not get inode for '");
> > + pr_cont_kernfs_path(kn);
> > + pr_cont("'.\n");
> > + return ERR_PTR(-EINVAL);
> > + }
> > +
> > + /* instantiate and link root dentry */
> > + dentry = d_obtain_root(inode);
> > + if (!dentry) {
> > + pr_debug("kernfs: could not get dentry for '");
> > + pr_cont_kernfs_path(kn);
> > + pr_cont("'.\n");
> > + return ERR_PTR(-ENOMEM);
> > + }
> > +
> > + /* If this is a new dentry, set it up. We need kernfs_mutex
> because this
> > + * may be called by callers other than kernfs_fill_super. */
> > + mutex_lock(&kernfs_mutex);
> > + if (!dentry->d_fsdata) {
> > + kernfs_get(kn);
> > + dentry->d_fsdata = kn;
> > + } else {
> > + WARN_ON(dentry->d_fsdata != kn);
> > + }
> > + mutex_unlock(&kernfs_mutex);
> > +
> > + return dentry;
> > +}
> > +
> > static int kernfs_fill_super(struct super_block *sb, unsigned long
> magic)
> > {
> > struct kernfs_super_info *info = kernfs_info(sb);
> > diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
> > index 3c2be75..b9538e0 100644
> > --- a/include/linux/kernfs.h
> > +++ b/include/linux/kernfs.h
> > @@ -274,6 +274,8 @@ void kernfs_put(struct kernfs_node *kn);
> > struct kernfs_node *kernfs_node_from_dentry(struct dentry *dentry);
> > struct kernfs_root *kernfs_root_from_sb(struct super_block *sb);
> >
> > +struct dentry *kernfs_obtain_root(struct super_block *sb,
> > + struct kernfs_node *kn);
> > struct kernfs_root *kernfs_create_root(struct kernfs_syscall_ops *scops,
> > unsigned int flags, void *priv);
> > void kernfs_destroy_root(struct kernfs_root *root);
> > diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> > index 7e5d597..250aaec 100644
> > --- a/kernel/cgroup.c
> > +++ b/kernel/cgroup.c
> > @@ -1302,6 +1302,13 @@ static int parse_cgroupfs_options(char *data,
> struct cgroup_sb_opts *opts)
> >
> > memset(opts, 0, sizeof(*opts));
> >
> > + /* Implicitly add CGRP_ROOT_SANE_BEHAVIOR if inside a non-init
> cgroup
> > + * namespace.
> > + */
> > + if (current->nsproxy->cgroup_ns != &init_cgroup_ns) {
> > + opts->flags |= CGRP_ROOT_SANE_BEHAVIOR;
> > + }
> > +
> > while ((token = strsep(&o, ",")) != NULL) {
> > nr_opts++;
> >
> > @@ -1391,7 +1398,7 @@ static int parse_cgroupfs_options(char *data,
> struct cgroup_sb_opts *opts)
> >
> > if (opts->flags & CGRP_ROOT_SANE_BEHAVIOR) {
> > pr_warn("sane_behavior: this is still under development
> and its behaviors will change, proceed at your own risk\n");
> > - if (nr_opts != 1) {
> > + if (nr_opts > 1) {
> > pr_err("sane_behavior: no other mount options
> allowed\n");
> > return -EINVAL;
> > }
> > @@ -1581,6 +1588,15 @@ static void init_cgroup_root(struct cgroup_root
> *root,
> > set_bit(CGRP_CPUSET_CLONE_CHILDREN, &root->cgrp.flags);
> > }
> >
> > +struct dentry *cgroupns_get_root(struct super_block *sb,
> > + struct cgroup_namespace *ns)
> > +{
> > + struct dentry *nsdentry;
> > +
> > + nsdentry = kernfs_obtain_root(sb, ns->root_cgrp->kn);
> > + return nsdentry;
> > +}
> > +
> > static int cgroup_setup_root(struct cgroup_root *root, unsigned int
> ss_mask)
> > {
> > LIST_HEAD(tmp_links);
> > @@ -1685,6 +1701,14 @@ static struct dentry *cgroup_mount(struct
> file_system_type *fs_type,
> > int ret;
> > int i;
> > bool new_sb;
> > + struct cgroup_namespace *ns =
> > + get_cgroup_ns(current->nsproxy->cgroup_ns);
> > +
> > + /* Check if the caller has permission to mount. */
> > + if (!ns_capable(ns->user_ns, CAP_SYS_ADMIN)) {
> > + put_cgroup_ns(ns);
> > + return ERR_PTR(-EPERM);
> > + }
> >
> > /*
> > * The first time anyone tries to mount a cgroup, enable the list
> > @@ -1817,11 +1841,28 @@ out_free:
> > kfree(opts.release_agent);
> > kfree(opts.name);
> >
> > - if (ret)
> > + if (ret) {
> > + put_cgroup_ns(ns);
> > return ERR_PTR(ret);
> > + }
> >
> > dentry = kernfs_mount(fs_type, flags, root->kf_root,
> > CGROUP_SUPER_MAGIC, &new_sb);
> > +
> > + if (!IS_ERR(dentry) && (root == &cgrp_dfl_root)) {
> > + /* If this mount is for the default hierarchy in non-init
> cgroup
> > + * namespace, then instead of root cgroup's dentry, we
> return
> > + * the dentry corresponding to the cgroupns->root_cgrp.
> > + */
> > + if (ns != &init_cgroup_ns) {
> > + struct dentry *nsdentry;
> > +
> > + nsdentry = cgroupns_get_root(dentry->d_sb, ns);
> > + dput(dentry);
> > + dentry = nsdentry;
> > + }
> > + }
> > +
> > if (IS_ERR(dentry) || !new_sb)
> > cgroup_put(&root->cgrp);
> >
> > @@ -1834,6 +1875,7 @@ out_free:
> > deactivate_super(pinned_sb);
> > }
> >
> > + put_cgroup_ns(ns);
> > return dentry;
> > }
> >
> > @@ -1862,6 +1904,7 @@ static struct file_system_type cgroup_fs_type = {
> > .name = "cgroup",
> > .mount = cgroup_mount,
> > .kill_sb = cgroup_kill_sb,
> > + .fs_flags = FS_USERNS_MOUNT,
> > };
> >
> > static struct kobject *cgroup_kobj;
>
--
Aditya
^ permalink raw reply
* pdf
From: peterh.ans-RkryiyCqeDI @ 2014-11-03 20:47 UTC (permalink / raw)
Dear Sir/Madam, Here is a pdf attachment of my proposal to you. Please
read and reply I would be grateful. Peter Hans
reply to E-mail: peterh.ans-YDxpq3io04c@public.gmane.org
^ permalink raw reply
* Re: [PATCH] uapi: resort Kbuild entries
From: David Miller @ 2014-11-03 20:45 UTC (permalink / raw)
To: stephen-OTpzqLSitTUnbdJkjeBofR2eb7JE58TQ
Cc: gregkh-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
netdev-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <20141103124234.77377274@urahara>
From: Stephen Hemminger <stephen-OTpzqLSitTUnbdJkjeBofR2eb7JE58TQ@public.gmane.org>
Date: Mon, 3 Nov 2014 12:42:34 -0800
> The entries in the Kbuild files are incorrectly sorted.
> Matters for aesthetics only.
>
> Signed-off-by: Stephen Hemminger <stephen-OTpzqLSitTUnbdJkjeBofR2eb7JE58TQ@public.gmane.org>
>
> ---
> Patch against -net tree since that is where my last change was.
As such, I'll apply to this to net-next after my next merge.
^ permalink raw reply
* [PATCH] uapi: resort Kbuild entries
From: Stephen Hemminger @ 2014-11-03 20:42 UTC (permalink / raw)
To: David Miller, Greg KH; +Cc: netdev, linux-api
The entries in the Kbuild files are incorrectly sorted.
Matters for aesthetics only.
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
---
Patch against -net tree since that is where my last change was.
--- a/include/uapi/linux/Kbuild 2014-11-02 11:24:32.304658688 -0800
+++ b/include/uapi/linux/Kbuild 2014-11-02 11:27:09.017509917 -0800
@@ -37,27 +37,27 @@ header-y += aio_abi.h
header-y += apm_bios.h
header-y += arcfb.h
header-y += atalk.h
-header-y += atm.h
-header-y += atm_eni.h
-header-y += atm_he.h
-header-y += atm_idt77105.h
-header-y += atm_nicstar.h
-header-y += atm_tcp.h
-header-y += atm_zatm.h
header-y += atmapi.h
header-y += atmarp.h
header-y += atmbr2684.h
header-y += atmclip.h
header-y += atmdev.h
+header-y += atm_eni.h
+header-y += atm.h
+header-y += atm_he.h
+header-y += atm_idt77105.h
header-y += atmioc.h
header-y += atmlec.h
header-y += atmmpc.h
+header-y += atm_nicstar.h
header-y += atmppp.h
header-y += atmsap.h
header-y += atmsvc.h
+header-y += atm_tcp.h
+header-y += atm_zatm.h
header-y += audit.h
-header-y += auto_fs.h
header-y += auto_fs4.h
+header-y += auto_fs.h
header-y += auxvec.h
header-y += ax25.h
header-y += b1lli.h
@@ -67,8 +67,8 @@ header-y += bfs_fs.h
header-y += binfmts.h
header-y += blkpg.h
header-y += blktrace_api.h
-header-y += bpf.h
header-y += bpf_common.h
+header-y += bpf.h
header-y += bpqether.h
header-y += bsg.h
header-y += btrfs.h
@@ -93,21 +93,21 @@ header-y += cyclades.h
header-y += cycx_cfm.h
header-y += dcbnl.h
header-y += dccp.h
-header-y += dlm.h
+header-y += dlmconstants.h
header-y += dlm_device.h
+header-y += dlm.h
header-y += dlm_netlink.h
header-y += dlm_plock.h
-header-y += dlmconstants.h
header-y += dm-ioctl.h
header-y += dm-log-userspace.h
header-y += dn.h
header-y += dqblk_xfs.h
header-y += edd.h
header-y += efs_fs_sb.h
+header-y += elfcore.h
header-y += elf-em.h
header-y += elf-fdpic.h
header-y += elf.h
-header-y += elfcore.h
header-y += errno.h
header-y += errqueue.h
header-y += ethtool.h
@@ -131,15 +131,15 @@ header-y += fsl_hypervisor.h
header-y += fuse.h
header-y += futex.h
header-y += gameport.h
-header-y += gen_stats.h
header-y += genetlink.h
+header-y += gen_stats.h
header-y += gfs2_ondisk.h
header-y += gigaset_dev.h
-header-y += hdlc.h
header-y += hdlcdrv.h
+header-y += hdlc.h
header-y += hdreg.h
-header-y += hid.h
header-y += hiddev.h
+header-y += hid.h
header-y += hidraw.h
header-y += hpet.h
header-y += hsr_netlink.h
@@ -151,7 +151,6 @@ header-y += i2o-dev.h
header-y += i8k.h
header-y += icmp.h
header-y += icmpv6.h
-header-y += if.h
header-y += if_addr.h
header-y += if_addrlabel.h
header-y += if_alg.h
@@ -165,6 +164,7 @@ header-y += if_ether.h
header-y += if_fc.h
header-y += if_fddi.h
header-y += if_frad.h
+header-y += if.h
header-y += if_hippi.h
header-y += if_infiniband.h
header-y += if_link.h
@@ -182,40 +182,40 @@ header-y += if_tunnel.h
header-y += if_vlan.h
header-y += if_x25.h
header-y += igmp.h
-header-y += in.h
header-y += in6.h
-header-y += in_route.h
header-y += inet_diag.h
+header-y += in.h
header-y += inotify.h
header-y += input.h
+header-y += in_route.h
header-y += ioctl.h
-header-y += ip.h
header-y += ip6_tunnel.h
-header-y += ip_vs.h
header-y += ipc.h
+header-y += ip.h
header-y += ipmi.h
header-y += ipmi_msgdefs.h
header-y += ipsec.h
header-y += ipv6.h
header-y += ipv6_route.h
+header-y += ip_vs.h
header-y += ipx.h
header-y += irda.h
header-y += irqnr.h
-header-y += isdn.h
header-y += isdn_divertif.h
-header-y += isdn_ppp.h
+header-y += isdn.h
header-y += isdnif.h
+header-y += isdn_ppp.h
header-y += iso_fs.h
-header-y += ivtv.h
header-y += ivtvfb.h
+header-y += ivtv.h
header-y += ixjuser.h
header-y += jffs2.h
header-y += joystick.h
-header-y += kd.h
header-y += kdev_t.h
-header-y += kernel-page-flags.h
-header-y += kernel.h
+header-y += kd.h
header-y += kernelcapi.h
+header-y += kernel.h
+header-y += kernel-page-flags.h
header-y += kexec.h
header-y += keyboard.h
header-y += keyctl.h
@@ -231,6 +231,7 @@ ifneq ($(wildcard $(srctree)/arch/$(SRCA
header-y += kvm_para.h
endif
+header-y += hw_breakpoint.h
header-y += l2tp.h
header-y += libc-compat.h
header-y += limits.h
@@ -255,43 +256,43 @@ header-y += mman.h
header-y += mmtimer.h
header-y += mpls.h
header-y += mqueue.h
-header-y += mroute.h
header-y += mroute6.h
+header-y += mroute.h
header-y += msdos_fs.h
header-y += msg.h
header-y += mtio.h
-header-y += n_r3964.h
header-y += nbd.h
-header-y += ncp.h
header-y += ncp_fs.h
+header-y += ncp.h
header-y += ncp_mount.h
header-y += ncp_no.h
header-y += neighbour.h
-header-y += net.h
-header-y += net_dropmon.h
-header-y += net_tstamp.h
header-y += netconf.h
header-y += netdevice.h
-header-y += netlink_diag.h
-header-y += netfilter.h
+header-y += net_dropmon.h
header-y += netfilter_arp.h
header-y += netfilter_bridge.h
header-y += netfilter_decnet.h
+header-y += netfilter.h
header-y += netfilter_ipv4.h
header-y += netfilter_ipv6.h
+header-y += net.h
+header-y += netlink_diag.h
header-y += netlink.h
header-y += netrom.h
+header-y += net_tstamp.h
header-y += nfc.h
-header-y += nfs.h
header-y += nfs2.h
header-y += nfs3.h
header-y += nfs4.h
header-y += nfs4_mount.h
+header-y += nfsacl.h
header-y += nfs_fs.h
+header-y += nfs.h
header-y += nfs_idmap.h
header-y += nfs_mount.h
-header-y += nfsacl.h
header-y += nl80211.h
+header-y += n_r3964.h
header-y += nubus.h
header-y += nvme.h
header-y += nvram.h
@@ -311,16 +312,16 @@ header-y += pfkeyv2.h
header-y += pg.h
header-y += phantom.h
header-y += phonet.h
+header-y += pktcdvd.h
header-y += pkt_cls.h
header-y += pkt_sched.h
-header-y += pktcdvd.h
header-y += pmu.h
header-y += poll.h
header-y += posix_types.h
header-y += ppdev.h
header-y += ppp-comp.h
-header-y += ppp-ioctl.h
header-y += ppp_defs.h
+header-y += ppp-ioctl.h
header-y += pps.h
header-y += prctl.h
header-y += psci.h
@@ -352,13 +353,13 @@ header-y += seccomp.h
header-y += securebits.h
header-y += selinux_netlink.h
header-y += sem.h
-header-y += serial.h
header-y += serial_core.h
+header-y += serial.h
header-y += serial_reg.h
header-y += serio.h
header-y += shm.h
-header-y += signal.h
header-y += signalfd.h
+header-y += signal.h
header-y += smiapp.h
header-y += snmp.h
header-y += sock_diag.h
@@ -367,8 +368,8 @@ header-y += sockios.h
header-y += som.h
header-y += sonet.h
header-y += sonypi.h
-header-y += sound.h
header-y += soundcard.h
+header-y += sound.h
header-y += stat.h
header-y += stddef.h
header-y += string.h
@@ -387,11 +388,11 @@ header-y += time.h
header-y += times.h
header-y += timex.h
header-y += tiocl.h
-header-y += tipc.h
header-y += tipc_config.h
+header-y += tipc.h
header-y += toshiba.h
-header-y += tty.h
header-y += tty_flags.h
+header-y += tty.h
header-y += types.h
header-y += udf_fs_i.h
header-y += udp.h
@@ -437,6 +438,5 @@ header-y += wireless.h
header-y += x25.h
header-y += xattr.h
header-y += xfrm.h
-header-y += hw_breakpoint.h
header-y += zorro.h
header-y += zorro_ids.h
^ permalink raw reply
* Re: rt-tester - move under selftests
From: Thomas Gleixner @ 2014-11-03 19:54 UTC (permalink / raw)
To: Shuah Khan
Cc: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
mingo-H+wXaHxf7aLQT0dZR+AlfA, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
linux-api-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <5446BDC6.4040502-JPH+aEBZ4P+UEJcrhfAQsw@public.gmane.org>
On Tue, 21 Oct 2014, Shuah Khan wrote:
> Andrew/Ingo/Thomas,
>
> You are all on the signed-off list for rt-tester. I have been poking
> around the source tree looking for existing test and came across rt-tester.
>
> What are your thoughts on rt-tester suitability to reside under
> tools/testing/selftest and becoming part of kselftest target?
> Would you like see it moved under the kselftest umbrella?
Sure you can move it, but it is broken and needs a complete
rewrite. So the proper solution is to remove it. I kept it there as a
reminder to fix it, but obviously I never managed to do so.
Thanks,
tglx
^ permalink raw reply
* Re: [PATCH 1/3] fs: add O_BENEATH flag to openat(2)
From: Andy Lutomirski @ 2014-11-03 18:29 UTC (permalink / raw)
To: Julien Tinnes
Cc: David Drysdale, Al Viro,
linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Kees Cook,
Greg Kroah-Hartman, Meredydd Luff, Will Drewry,
Jorge Lucangeli Obes, Ricky Zhou, Lee Campbell, Mike Depinet,
James Morris, Paolo Bonzini, Paul Moore, Christoph Hellwig,
Eric W. Biederman, Linux API, LSM List
In-Reply-To: <CAKyRK=hRX1xk_0cRNhZ341HwU9Nim5_vhpM5twJHUOt8fH29=w@mail.gmail.com>
On Mon, Nov 3, 2014 at 10:25 AM, Julien Tinnes <jln-F7+t8E8rja9g9hUCZPvPmw@public.gmane.org> wrote:
> On Mon, Nov 3, 2014 at 9:37 AM, David Drysdale <drysdale-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
>>
>> On Mon, Nov 3, 2014 at 3:42 PM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org>
>> wrote:
>> > On Mon, Nov 3, 2014 at 7:20 AM, Al Viro <viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn@public.gmane.org> wrote:
>> >> On Mon, Nov 03, 2014 at 11:48:23AM +0000, David Drysdale wrote:
>> >>> Add a new O_BENEATH flag for openat(2) which restricts the
>> >>> provided path, rejecting (with -EACCES) paths that are not beneath
>> >>> the provided dfd. In particular, reject:
>> >>> - paths that contain .. components
>> >>> - paths that begin with /
>> >>> - symlinks that have paths as above.
>> >>
>> >> Yecch... The degree of usefulness aside (and I'm not convinced that it
>> >> is non-zero),
>> >
>> > This is extremely useful in conjunction with seccomp.
>>
>> Yes, that was my understanding of how the Chrome[OS] folk wanted
>> to use it.
>
>
> Yes, exactly. Without this, if we want to give a sandboxed process A access
> to a directory, we need to:
> 1. Create a new 'broker" process B
> 2. Make sure to have an IPC channel between A and B.
> 3. SIGSYS open() and openat() in A via seccomp-bpf
> 4. Have an async-signal-safe handler that can IPC open / openat.
You can do this with user namespaces, too. But this is way more
complicated than it should be, and it has a lot more overhead.
--Andy
>
> There is a lot of hidden complexity in such a set-up. For instance, if you
> need to prevent contention, the number of threads in the broker B should
> scale automatically.
>
> This is 'fine' (but undesirable) for a big beast such as Chromium which
> needs such a complex set-ups anyways, but David's patch would make it a lot
> easier to build a sandbox and whitelist directories for everyone, simply by
> enforcing O_BENEATH in seccomp and whitelisting open directory file
> descriptors in the sandboxed process.
>
> Julien
--
Andy Lutomirski
AMA Capital Management, LLC
^ permalink raw reply
* Re: [PATCH 1/3] fs: add O_BENEATH flag to openat(2)
From: Julien Tinnes @ 2014-11-03 18:26 UTC (permalink / raw)
To: David Drysdale
Cc: Andy Lutomirski, Al Viro, linux-kernel@vger.kernel.org, Kees Cook,
Greg Kroah-Hartman, Meredydd Luff, Will Drewry,
Jorge Lucangeli Obes, Ricky Zhou, Lee Campbell, Mike Depinet,
James Morris, Paolo Bonzini, Paul Moore, Christoph Hellwig,
Eric W. Biederman, Linux API, LSM List
In-Reply-To: <CAHse=S9AGNF84cLYszPSM2q7-78+KfuTCoap7eYyMEcbx97Zyw@mail.gmail.com>
On Mon, Nov 3, 2014 at 9:37 AM, David Drysdale <drysdale@google.com> wrote:
> On Mon, Nov 3, 2014 at 3:42 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>> This is extremely useful in conjunction with seccomp.
>
> Yes, that was my understanding of how the Chrome[OS] folk wanted
> to use it.
Yes, exactly. Without this, if we want to give a sandboxed process A
access to a directory, we need to:
1. Create a new 'broker" process B
2. Make sure to have an IPC channel between A and B.
3. SIGSYS open() and openat() in A via seccomp-bpf
4. Have an async-signal-safe handler that can IPC open / openat.
There is a lot of hidden complexity in such a set-up. For instance, if
you need to prevent contention, the number of threads in the broker B
should scale automatically.
This is 'fine' (but undesirable) for a big beast such as Chromium
which needs such a complex set-ups anyways, but David's patch would
make it a lot easier to build a sandbox and whitelist directories for
everyone, simply by enforcing O_BENEATH in seccomp and whitelisting
open directory file descriptors in the sandboxed process.
Julien
^ permalink raw reply
* Re: [PATCH 1/3] fs: add O_BENEATH flag to openat(2)
From: David Drysdale @ 2014-11-03 17:37 UTC (permalink / raw)
To: Andy Lutomirski
Cc: Al Viro, linux-kernel@vger.kernel.org, Kees Cook,
Greg Kroah-Hartman, Meredydd Luff, Will Drewry,
Jorge Lucangeli Obes, Ricky Zhou, Lee Campbell, Julien Tinnes,
Mike Depinet, James Morris, Paolo Bonzini, Paul Moore,
Christoph Hellwig, Eric W. Biederman, Linux API, LSM List
In-Reply-To: <CALCETrWmivJCvAAZPQo9Jf5gDiSbpWKesjZWfqEQRXPOBSPY9g@mail.gmail.com>
On Mon, Nov 3, 2014 at 3:42 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Mon, Nov 3, 2014 at 7:20 AM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>> On Mon, Nov 03, 2014 at 11:48:23AM +0000, David Drysdale wrote:
>>> Add a new O_BENEATH flag for openat(2) which restricts the
>>> provided path, rejecting (with -EACCES) paths that are not beneath
>>> the provided dfd. In particular, reject:
>>> - paths that contain .. components
>>> - paths that begin with /
>>> - symlinks that have paths as above.
>>
>> Yecch... The degree of usefulness aside (and I'm not convinced that it
>> is non-zero),
>
> This is extremely useful in conjunction with seccomp.
Yes, that was my understanding of how the Chrome[OS] folk wanted
to use it.
>> WTF pass one bit out of nameidata->flags in a separate argument?
I'll shift to using nd->flags; not sure what I was thinking of there.
(It *might* have made more sense in the full patchset this was extracted
from but it certainly doesn't look sensible in this narrower context.)
>> Through the mutual recursion, no less... And then you are not even attempting
>> to detect symlinks that are not followed by interpretation of _any_ pathname.
>
> How many symlinks like that are there? Is there anything except
> nd_jump_link users? All of those are in /proc. Arguably O_BENEATH
> should prevent traversal of all of those links.
>
> --Andy
On a quick search, the 2 users of nd_jump_link (namely proc_pid_follow_link
and proc_ns_follow_link) seem to be the only implementations of
inode_operations->follow_link that don't just call nd_set_link(). So
disallowing that for O_BENEATH might give sensible behaviour.
^ permalink raw reply
* Re: [PATCH 1/3] fs: add O_BENEATH flag to openat(2)
From: Eric W.Biederman @ 2014-11-03 17:22 UTC (permalink / raw)
To: Andy Lutomirski, Al Viro
Cc: David Drysdale, linux-kernel@vger.kernel.org, Kees Cook,
Greg Kroah-Hartman, Meredydd Luff, Will Drewry,
Jorge Lucangeli Obes, Ricky Zhou, Lee Campbell, Julien Tinnes,
Mike Depinet, James Morris, Paolo Bonzini, Paul Moore,
Christoph Hellwig, Linux API, LSM List
In-Reply-To: <CALCETrWmivJCvAAZPQo9Jf5gDiSbpWKesjZWfqEQRXPOBSPY9g@mail.gmail.com>
On November 3, 2014 7:42:58 AM PST, Andy Lutomirski <luto@amacapital.net> wrote:
>On Mon, Nov 3, 2014 at 7:20 AM, Al Viro <viro@zeniv.linux.org.uk>
>wrote:
>> On Mon, Nov 03, 2014 at 11:48:23AM +0000, David Drysdale wrote:
>>> Add a new O_BENEATH flag for openat(2) which restricts the
>>> provided path, rejecting (with -EACCES) paths that are not beneath
>>> the provided dfd. In particular, reject:
>>> - paths that contain .. components
>>> - paths that begin with /
>>> - symlinks that have paths as above.
>>
>> Yecch... The degree of usefulness aside (and I'm not convinced that
>it
>> is non-zero),
>
>This is extremely useful in conjunction with seccomp.
>
>> WTF pass one bit out of nameidata->flags in a separate argument?
>> Through the mutual recursion, no less... And then you are not even
>attempting
>> to detect symlinks that are not followed by interpretation of _any_
>pathname.
>
>How many symlinks like that are there? Is there anything except
>nd_jump_link users? All of those are in /proc. Arguably O_BENEATH
>should prevent traversal of all of those links.
Not commenting on the sanity of this one way or another, and I haven't read the patch. There is an absolutely trivial implementation of this.
After the path is resolved, walk backwards along d_parent and the mount tree, and see if you come to the file or directory dfd refers to.
That can handle magic proc symlinks, and does not need to disallow .. or / explicitly so it should be much simpler code.
My gut says that if Al says blech when looking at your code it is too complex to give you a security guarantee.
Eric
^ permalink raw reply
* Re: kdbus: add code to gather metadata
From: Andy Lutomirski @ 2014-11-03 17:05 UTC (permalink / raw)
To: Simon McVittie
Cc: Daniel Mack, Greg Kroah-Hartman, Linux API,
linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, John Stultz,
Arnd Bergmann, Tejun Heo, Marcel Holtmann, Ryan Lortie,
Bastien Nocera, David Herrmann, Djalal Harouni, alban.crequy,
Javier Martinez Canillas, Tom Gundersen
In-Reply-To: <54576E48.40800-ZGY8ohtN/8pPYcu2f3hruQ@public.gmane.org>
On Mon, Nov 3, 2014 at 4:00 AM, Simon McVittie
<simon.mcvittie-ZGY8ohtN/8pPYcu2f3hruQ@public.gmane.org> wrote:
> On 01/11/14 16:19, Andy Lutomirski wrote:
>> You can't justify logging fundamentally unverifiable things like the
>> command line by saying that you want to know if someone tries to play
>> (impossible-to-reliably-detect) games to obscure their command line.
>
> I think kdbus might be mixing up two orthogonal things here.
>
> It has an easy, kernel-checked, race-free way to determine
> kernel-mediated credential-like information that cannot be faked or
> interfered with (uid, primary gid, other gids?, security label,
> capabilities) because these are usable for security decisions, but if
> they are *not* received in a kernel-checked, race-free way, then they
> are useless.
>
> One concrete example of using non-ucred credential-like information is
> that traditional D-Bus can only restrict sysadmin tasks to uid 0 (or a
> root-equivalent uid in group sudo/admin/whatever), whereas when systemd
> and systemd-logind are run on kdbus, many of their D-Bus methods require
> specific capabilities(7): KillUser requires CAP_KILL, PowerOff requires
> CAP_SYS_BOOT, and so on. If capabilities(7) are a good thing, then
> that's surely a good thing too. (On the other hand, if you think
> capabilities(7) are a waste of time, then so is this.)
I think that capabilities(7) is largely a disaster. That aside, I
don't think that all of these capabilities should refer to the
*kernel* privileges. For example, CAP_SYS_BOOT should be, and remain,
the ability to reboot the system *yourself*. For example, if someone
wants to implement Windows 7 (or Visa? Or 2003? I forget.) style
reboot auditing, then userspace specifically does not want programs
that can ask for a reboot to hold CAP_SYS_BOOT.
>
> It also uses the same mechanism as an easy, race-free, but *not*
> kernel-checked way to determine bits and pieces that are valuable for
> debugging (dbus-monitor etc.), but unsuitable for security decisions,
> such as cmdline.
>
> In traditional D-Bus, you can get the uid and pid of a remote process,
> but in a debug log you would probably actually prefer to log the cmdline
> in addition; yes a malicious user could fake the cmdline, but when
> debugging a system problem, information that is known to be forgeable
> seems better than no information at all. After all, ps(1) shows the
> forgeable cmdline, not just the executable. You can get that by
> rummaging in /proc/$pid, but there is a race: if the remote process
> exits too soon (a "fire and forget" method call) then you'll never know
> who it was. kdbus solves that race, but does not make cmdline unforgeable.
>
I would love to fix this race. That opens the door to a lot more
debugging (maps, status, etc) while not pretending to offer a kernel
check that doesn't, and can't, exist.
Materializing some sort of pid fd on the receiving end would be one
solution. Another would be to give the receiver some token that can
be used to check for pid recycling.
Can we give each task a pid-namespace-local 64-bit (or even longer)
number that is mostly guaranteed not to be reused for the lifetime of
the namespace?
For example, give each task a per-pidns tid_unique, and give each
pidns a next_tid_unique. Creating a task assigns it
next_tid_unique++. (On overflow, clone fails.) For CRIU's benefit,
if you have CAP_SYS_ADMIN over the pidns's userns and you are not in
the pidns, then you can change tid_unique and next_tid_unique. Doing
that carelessly will introduce races, but that's fine -- it should
only be done on restore when everything's frozen.
> If client libraries wishing to attach their cmdline (or other debug
> info) to messages for debugging were required to add it as an
> out-of-band KDBUS_ITEM, or as a D-Bus message header inside the payload,
> then that would be duplicating work in client libraries that could have
> been done centrally, but would still solve the race.
>
I much prefer that approach. If a kernel feature is being added to
just to avoid duplication of a debugging aid in user code, then let's
leave it to user code.
--Andy
^ permalink raw reply
* Re: [PATCH 1/3] fs: add O_BENEATH flag to openat(2)
From: Andy Lutomirski @ 2014-11-03 15:42 UTC (permalink / raw)
To: Al Viro
Cc: David Drysdale,
linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Kees Cook,
Greg Kroah-Hartman, Meredydd Luff, Will Drewry,
Jorge Lucangeli Obes, Ricky Zhou, Lee Campbell, Julien Tinnes,
Mike Depinet, James Morris, Paolo Bonzini, Paul Moore,
Christoph Hellwig, Eric W. Biederman, Linux API, LSM List
In-Reply-To: <20141103152036.GA7996-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
On Mon, Nov 3, 2014 at 7:20 AM, Al Viro <viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn@public.gmane.org> wrote:
> On Mon, Nov 03, 2014 at 11:48:23AM +0000, David Drysdale wrote:
>> Add a new O_BENEATH flag for openat(2) which restricts the
>> provided path, rejecting (with -EACCES) paths that are not beneath
>> the provided dfd. In particular, reject:
>> - paths that contain .. components
>> - paths that begin with /
>> - symlinks that have paths as above.
>
> Yecch... The degree of usefulness aside (and I'm not convinced that it
> is non-zero),
This is extremely useful in conjunction with seccomp.
> WTF pass one bit out of nameidata->flags in a separate argument?
> Through the mutual recursion, no less... And then you are not even attempting
> to detect symlinks that are not followed by interpretation of _any_ pathname.
How many symlinks like that are there? Is there anything except
nd_jump_link users? All of those are in /proc. Arguably O_BENEATH
should prevent traversal of all of those links.
--Andy
^ permalink raw reply
* Re: [PATCH v6] kernel, add panic_on_warn
From: Vivek Goyal @ 2014-11-03 15:21 UTC (permalink / raw)
To: Prarit Bhargava
Cc: Andi Kleen, hedi-sJ/iWh9BUns, Jonathan Corbet,
jbaron-JqFfY2XvxFXQT0dZR+AlfA,
kexec-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, Rusty Russell,
linux-doc-u79uwXL29TY76Z2rM5mHXA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA, Fabian Frederick,
isimatu.yasuaki-+CUm20s59erQFUHtdCDX3A, H. Peter Anvin,
Masami Hiramatsu, Andrew Morton, linux-api-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <1415025143-4345-1-git-send-email-prarit-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
On Mon, Nov 03, 2014 at 09:32:23AM -0500, Prarit Bhargava wrote:
[..]
> +
> +static int __init panic_on_warn_setup(char *s)
> +{
> + /* Enabling this on a kdump kernel could cause a bogus panic. */
> + if (!is_kdump_kernel())
> + panic_on_warn = 1;
I think it would be better if we leave it to user space to remove
panic_on_warn from command line of second kernel.
Thanks
Vivek
^ permalink raw reply
* Re: [PATCH 1/3] fs: add O_BENEATH flag to openat(2)
From: Al Viro @ 2014-11-03 15:20 UTC (permalink / raw)
To: David Drysdale
Cc: linux-kernel, Kees Cook, Greg Kroah-Hartman, Meredydd Luff,
Will Drewry, Jorge Lucangeli Obes, Ricky Zhou, Lee Campbell,
Julien Tinnes, Mike Depinet, James Morris, Andy Lutomirski,
Paolo Bonzini, Paul Moore, Christoph Hellwig, Eric W. Biederman,
linux-api, linux-security-module
In-Reply-To: <1415015305-15494-2-git-send-email-drysdale@google.com>
On Mon, Nov 03, 2014 at 11:48:23AM +0000, David Drysdale wrote:
> Add a new O_BENEATH flag for openat(2) which restricts the
> provided path, rejecting (with -EACCES) paths that are not beneath
> the provided dfd. In particular, reject:
> - paths that contain .. components
> - paths that begin with /
> - symlinks that have paths as above.
Yecch... The degree of usefulness aside (and I'm not convinced that it
is non-zero), WTF pass one bit out of nameidata->flags in a separate argument?
Through the mutual recursion, no less... And then you are not even attempting
to detect symlinks that are not followed by interpretation of _any_ pathname.
^ permalink raw reply
* Re: [PATCH] kernel, add panic_on_warn
From: Vivek Goyal @ 2014-11-03 15:18 UTC (permalink / raw)
To: Prarit Bhargava
Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, Andi Kleen, Jonathan Corbet,
kexec-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, Rusty Russell,
linux-doc-u79uwXL29TY76Z2rM5mHXA, jbaron-JqFfY2XvxFXQT0dZR+AlfA,
Fabian Frederick, isimatu.yasuaki-+CUm20s59erQFUHtdCDX3A,
H. Peter Anvin, Masami Hiramatsu, Andrew Morton,
linux-api-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <545783FA.4050508-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
On Mon, Nov 03, 2014 at 08:32:42AM -0500, Prarit Bhargava wrote:
>
>
> On 10/30/2014 09:58 PM, Hedi Berriche wrote:
> > On Thu, Oct 30, 2014 at 17:06 Prarit Bhargava wrote:
> > | There have been several times where I have had to rebuild a kernel to
> > | cause a panic when hitting a WARN() in the code in order to get a crash
> > | dump from a system. Sometimes this is easy to do, other times (such as
> > | in the case of a remote admin) it is not trivial to send new images to the
> > | user.
> > |
> > | A much easier method would be a switch to change the WARN() over to a
> > | panic. This makes debugging easier in that I can now test the actual
> > | image the WARN() was seen on and I do not have to engage in remote
> > | debugging.
> >
> > Do we want to leave it to usersspace[1] to ensure panic_on_warn is out
> > of the way in when the kdump kernel boots? or would a self-contained
> > approach be more preferable i.e. test whether we're a kdump kernel
> > before bothering with panic_on_warn?
>
> Hmm ... this is a good point. Vivek, do you have a preference? I'm willing to
> code it either way. I should be able to put in a is_kdump_kernel() check
> without any problems but I'm not sure if that is the right thing to do here.
>
I think it will make sense to modify user space scripts to get rid of
panic_on_warn for kdump kernel.
Thanks
Vivek
^ permalink raw reply
* Re: [PATCH 2/2] perf: Userspace software event and ioctl
From: Pawel Moll @ 2014-11-03 15:04 UTC (permalink / raw)
To: Tomeu Vizoso
Cc: Peter Zijlstra, Richard Cochran, Steven Rostedt, Ingo Molnar,
Paul Mackerras, Arnaldo Carvalho de Melo, John Stultz,
linux-kernel@vger.kernel.org, linux-api@vger.kernel.org
In-Reply-To: <CAAObsKBw9VU3oZZSz5hV-rNXK2WAkjXeOgY_X1AF1Lpp9iLk7g@mail.gmail.com>
On Mon, 2014-11-03 at 14:48 +0000, Tomeu Vizoso wrote:
> On 29 September 2014 17:53, Pawel Moll <pawel.moll@arm.com> wrote:
> > On Mon, 2014-09-29 at 16:32 +0100, Peter Zijlstra wrote:
> >> Also none of the many words above describe
> >> PERF_SAMPLE_USERSPACE_EVENT(), wth is that about?
> >
> > Hopefully description of the v2 makes better job in this:
> >
> > http://thread.gmane.org/gmane.linux.kernel/1793272/focus=4813
> >
> > where it's already called "UEVENT" and was generated by write().
> >
> > Before you get into this, though, the most important outcomes of both v1
> > and v2 discussions:
> >
> > * Ingo suggested prctl(PR_TRACE_UEVENT, type, size, data, 0) as the way
> > of generating such events (so the tracee doesn't have to know the fd to
> > do ioctl); Frederic seems to have the same on his mind.
> >
> > * Namhyung proposed sticking the userspace-originating events into the
> > buffer as PERF_RECORD_UEVENT rather then PERF_SAMPLE_UEVENT.
> >
> > Working on making both happen now.
>
> are you still working on this? Would be happy to lend a hand if that
> can speed things up.
By all means! In fact I'm typing commit messages right now and will post
the patches later today. Stay tuned and I'm looking forward to all
suggestions, reviews etc.
Cheers!
Pawel
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox