From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1758626Ab3KMHZb (ORCPT <rfc822;w@1wt.eu>);
	Wed, 13 Nov 2013 02:25:31 -0500
Received: from cn.fujitsu.com ([222.73.24.84]:58268 "EHLO song.cn.fujitsu.com"
	rhost-flags-OK-FAIL-OK-OK) by vger.kernel.org with ESMTP
	id S1758023Ab3KMHZW (ORCPT <rfc822;linux-kernel@vger.kernel.org>);
	Wed, 13 Nov 2013 02:25:22 -0500
X-IronPort-AV: E=Sophos;i="4.93,690,1378828800"; 
   d="scan'208";a="9022471"
Message-ID: <5283299B.8080702@cn.fujitsu.com>
Date: Wed, 13 Nov 2013 15:26:19 +0800
From: Gao feng <gaofeng@cn.fujitsu.com>
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.0
MIME-Version: 1.0
To: "Eric W. Biederman" <ebiederm@xmission.com>
CC: Linux Containers <containers@lists.linux-foundation.org>,
        "Serge E. Hallyn" <serge@hallyn.com>, linux-fsdevel@vger.kernel.org,
        linux-kernel@vger.kernel.org, Andy Lutomirski <luto@amacapital.net>
Subject: Re: [REVIEW][PATCH 1/2] userns: Better restrictions on when proc
 and sysfs can be mounted
References: <878uzmhkqg.fsf@xmission.com> <52749663.2000701@cn.fujitsu.com>	<527C4D88.10907@cn.fujitsu.com> <87k3gigmgj.fsf@xmission.com>
In-Reply-To: <87k3gigmgj.fsf@xmission.com>
X-MIMETrack: Itemize by SMTP Server on mailserver/fnst(Release 8.5.3|September 15, 2011) at
 2013/11/13 15:23:33,
	Serialize by Router on mailserver/fnst(Release 8.5.3|September 15, 2011) at
 2013/11/13 15:23:34,
	Serialize complete at 2013/11/13 15:23:34
Content-Transfer-Encoding: 7bit
Content-Type: text/plain; charset=UTF-8
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On 11/09/2013 01:42 PM, Eric W. Biederman wrote:
> Gao feng <gaofeng@cn.fujitsu.com> writes:
> 
>> On 11/02/2013 02:06 PM, Gao feng wrote:
>>> Hi Eric,
>>>
>>> On 08/28/2013 05:44 AM, Eric W. Biederman wrote:
>>>>
>>>> Rely on the fact that another flavor of the filesystem is already
>>>> mounted and do not rely on state in the user namespace.
>>>>
>>>> Verify that the mounted filesystem is not covered in any significant
>>>> way.  I would love to verify that the previously mounted filesystem
>>>> has no mounts on top but there are at least the directories
>>>> /proc/sys/fs/binfmt_misc and /sys/fs/cgroup/ that exist explicitly
>>>> for other filesystems to mount on top of.
>>>>
>>>> Refactor the test into a function named fs_fully_visible and call that
>>>> function from the mount routines of proc and sysfs.  This makes this
>>>> test local to the filesystems involved and the results current of when
>>>> the mounts take place, removing a weird threading of the user
>>>> namespace, the mount namespace and the filesystems themselves.
>>>>
>>>> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
>>>> ---
>>>>  fs/namespace.c                 |   37 +++++++++++++++++++++++++------------
>>>>  fs/proc/root.c                 |    7 +++++--
>>>>  fs/sysfs/mount.c               |    3 ++-
>>>>  include/linux/fs.h             |    1 +
>>>>  include/linux/user_namespace.h |    4 ----
>>>>  kernel/user.c                  |    2 --
>>>>  kernel/user_namespace.c        |    2 --
>>>>  7 files changed, 33 insertions(+), 23 deletions(-)
>>>>
>>>> diff --git a/fs/namespace.c b/fs/namespace.c
>>>> index 64627f8..877e427 100644
>>>> --- a/fs/namespace.c
>>>> +++ b/fs/namespace.c
>>>> @@ -2867,25 +2867,38 @@ bool current_chrooted(void)
>>>>  	return chrooted;
>>>>  }
>>>>  
>>>> -void update_mnt_policy(struct user_namespace *userns)
>>>> +bool fs_fully_visible(struct file_system_type *type)
>>>>  {
>>>>  	struct mnt_namespace *ns = current->nsproxy->mnt_ns;
>>>>  	struct mount *mnt;
>>>> +	bool visible = false;
>>>>  
>>>> -	down_read(&namespace_sem);
>>>> +	if (unlikely(!ns))
>>>> +		return false;
>>>> +
>>>> +	namespace_lock();
>>>>  	list_for_each_entry(mnt, &ns->list, mnt_list) {
>>>> -		switch (mnt->mnt.mnt_sb->s_magic) {
>>>> -		case SYSFS_MAGIC:
>>>> -			userns->may_mount_sysfs = true;
>>>> -			break;
>>>> -		case PROC_SUPER_MAGIC:
>>>> -			userns->may_mount_proc = true;
>>>> -			break;
>>>> +		struct mount *child;
>>>> +		if (mnt->mnt.mnt_sb->s_type != type)
>>>> +			continue;
>>>> +
>>>> +		/* This mount is not fully visible if there are any child mounts
>>>> +		 * that cover anything except for empty directories.
>>>> +		 */
>>>> +		list_for_each_entry(child, &mnt->mnt_mounts, mnt_child) {
>>>> +			struct inode *inode = child->mnt_mountpoint->d_inode;
>>>> +			if (!S_ISDIR(inode->i_mode))
>>>> +				goto next;
>>>> +			if (inode->i_nlink != 2)
>>>> +				goto next;
>>>
>>>
>>> I met a problem that proc filesystem failed to mount in user namespace,
>>> The problem is the i_nlink of sysctl entries under proc filesystem is not
>>> 2. it always is 1 even it's a directory, see proc_sys_make_inode. and for
>>> btrfs, the i_nlink for an empty dir is 2 too. it seems like depends on the
>>> filesystem itself,not depends on vfs. In my system binfmt_misc is mounted
>>> on /proc/sys/fs/binfmt_misc, and the i_nlink of this directory's inode is
>>> 1.
> 
> Yes. 1 is what filesystems that are too lazy to count the number of
> links to a directory return, and /proc/sys is currently such a
> filesystem.
> 
> Ordinarily nlink == 2 means a directory does not have any subdirectories.
> 
>>> btw, I'm not quite understand what's the inode->i_nlink != 2 here means?
>>> is this directory empty? as I know, when we create a file(not dir) under
>>> a dir, the i_nlink of this dir will not increase.
>>>
>>> And another question, it looks like if we don't have proc/sys fs mounted,
>>> then proc/sys will be failed to be mounted?
>>>
>>
>> Any Idea?? or should we need to revert this patch??
> 
> The patch is mostly doing what it is supposed to be doing.
> 
> Now the code is slightly buggy.  inode->i_nlink will test to see if a
> directory has subdirectories but it won't test to see if a directory is
> empty.  Where did my brain go when I was writing that test?
> 
> Right now I would rather not have the empty directory exception than
> remove this code.
> 
> The test is a little trickier to write than it might otherwise be
> because /proc and /sys tend to be slightly imperfect filesystems.
> 
> I think the only way to really test that is to call readdir on the
> directory itself :(  I don't like that thought.
> 
> I don't know what I was thinking when I wrote that test but I definitely
> goofed up.  Grr!
> 
> I can certainly filter out any directory with nlink > 2.  That would be
> an easy partial step forward.
> 
> The real question though is how do I detect directories it is safe to
> mount on where there will not be files in them.  I can't call iterate
> with the namespace_lock held so things are a bit tricky.
> 

I know this problem is not easy to be resolved. why not let the user
make the decision?  maybe we can introduce a new mount option MS_LOCK,
if user wants to use mount to hide something, he should use mount with
option MS_LOCK. so the unpriviged user can't umount this filesystem and
fail to mount the filesystem if one of it's child mount is mounted with
MS_LOCK option otherwise he use MS_REC too.

Thanks