Re: [REVIEW][PATCH 1/2] userns: Better restrictions on when proc and sysfs can be mounted

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Gao feng <gaofeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
To: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Linux Containers
	<containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org>,
	Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org>,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Subject: Re: [REVIEW][PATCH 1/2] userns: Better restrictions on when proc and sysfs can be mounted
Date: Wed, 13 Nov 2013 15:26:19 +0800	[thread overview]
Message-ID: <5283299B.8080702@cn.fujitsu.com> (raw)
In-Reply-To: <87k3gigmgj.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>

On 11/09/2013 01:42 PM, Eric W. Biederman wrote:
> Gao feng <gaofeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org> writes:
> 
>> On 11/02/2013 02:06 PM, Gao feng wrote:
>>> Hi Eric,
>>>
>>> On 08/28/2013 05:44 AM, Eric W. Biederman wrote:
>>>>
>>>> Rely on the fact that another flavor of the filesystem is already
>>>> mounted and do not rely on state in the user namespace.
>>>>
>>>> Verify that the mounted filesystem is not covered in any significant
>>>> way.  I would love to verify that the previously mounted filesystem
>>>> has no mounts on top but there are at least the directories
>>>> /proc/sys/fs/binfmt_misc and /sys/fs/cgroup/ that exist explicitly
>>>> for other filesystems to mount on top of.
>>>>
>>>> Refactor the test into a function named fs_fully_visible and call that
>>>> function from the mount routines of proc and sysfs.  This makes this
>>>> test local to the filesystems involved and the results current of when
>>>> the mounts take place, removing a weird threading of the user
>>>> namespace, the mount namespace and the filesystems themselves.
>>>>
>>>> Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
>>>> ---
>>>>  fs/namespace.c                 |   37 +++++++++++++++++++++++++------------
>>>>  fs/proc/root.c                 |    7 +++++--
>>>>  fs/sysfs/mount.c               |    3 ++-
>>>>  include/linux/fs.h             |    1 +
>>>>  include/linux/user_namespace.h |    4 ----
>>>>  kernel/user.c                  |    2 --
>>>>  kernel/user_namespace.c        |    2 --
>>>>  7 files changed, 33 insertions(+), 23 deletions(-)
>>>>
>>>> diff --git a/fs/namespace.c b/fs/namespace.c
>>>> index 64627f8..877e427 100644
>>>> --- a/fs/namespace.c
>>>> +++ b/fs/namespace.c
>>>> @@ -2867,25 +2867,38 @@ bool current_chrooted(void)
>>>>  	return chrooted;
>>>>  }
>>>>  
>>>> -void update_mnt_policy(struct user_namespace *userns)
>>>> +bool fs_fully_visible(struct file_system_type *type)
>>>>  {
>>>>  	struct mnt_namespace *ns = current->nsproxy->mnt_ns;
>>>>  	struct mount *mnt;
>>>> +	bool visible = false;
>>>>  
>>>> -	down_read(&namespace_sem);
>>>> +	if (unlikely(!ns))
>>>> +		return false;
>>>> +
>>>> +	namespace_lock();
>>>>  	list_for_each_entry(mnt, &ns->list, mnt_list) {
>>>> -		switch (mnt->mnt.mnt_sb->s_magic) {
>>>> -		case SYSFS_MAGIC:
>>>> -			userns->may_mount_sysfs = true;
>>>> -			break;
>>>> -		case PROC_SUPER_MAGIC:
>>>> -			userns->may_mount_proc = true;
>>>> -			break;
>>>> +		struct mount *child;
>>>> +		if (mnt->mnt.mnt_sb->s_type != type)
>>>> +			continue;
>>>> +
>>>> +		/* This mount is not fully visible if there are any child mounts
>>>> +		 * that cover anything except for empty directories.
>>>> +		 */
>>>> +		list_for_each_entry(child, &mnt->mnt_mounts, mnt_child) {
>>>> +			struct inode *inode = child->mnt_mountpoint->d_inode;
>>>> +			if (!S_ISDIR(inode->i_mode))
>>>> +				goto next;
>>>> +			if (inode->i_nlink != 2)
>>>> +				goto next;
>>>
>>>
>>> I met a problem that proc filesystem failed to mount in user namespace,
>>> The problem is the i_nlink of sysctl entries under proc filesystem is not
>>> 2. it always is 1 even it's a directory, see proc_sys_make_inode. and for
>>> btrfs, the i_nlink for an empty dir is 2 too. it seems like depends on the
>>> filesystem itself,not depends on vfs. In my system binfmt_misc is mounted
>>> on /proc/sys/fs/binfmt_misc, and the i_nlink of this directory's inode is
>>> 1.
> 
> Yes. 1 is what filesystems that are too lazy to count the number of
> links to a directory return, and /proc/sys is currently such a
> filesystem.
> 
> Ordinarily nlink == 2 means a directory does not have any subdirectories.
> 
>>> btw, I'm not quite understand what's the inode->i_nlink != 2 here means?
>>> is this directory empty? as I know, when we create a file(not dir) under
>>> a dir, the i_nlink of this dir will not increase.
>>>
>>> And another question, it looks like if we don't have proc/sys fs mounted,
>>> then proc/sys will be failed to be mounted?
>>>
>>
>> Any Idea?? or should we need to revert this patch??
> 
> The patch is mostly doing what it is supposed to be doing.
> 
> Now the code is slightly buggy.  inode->i_nlink will test to see if a
> directory has subdirectories but it won't test to see if a directory is
> empty.  Where did my brain go when I was writing that test?
> 
> Right now I would rather not have the empty directory exception than
> remove this code.
> 
> The test is a little trickier to write than it might otherwise be
> because /proc and /sys tend to be slightly imperfect filesystems.
> 
> I think the only way to really test that is to call readdir on the
> directory itself :(  I don't like that thought.
> 
> I don't know what I was thinking when I wrote that test but I definitely
> goofed up.  Grr!
> 
> I can certainly filter out any directory with nlink > 2.  That would be
> an easy partial step forward.
> 
> The real question though is how do I detect directories it is safe to
> mount on where there will not be files in them.  I can't call iterate
> with the namespace_lock held so things are a bit tricky.
> 

I know this problem is not easy to be resolved. why not let the user
make the decision?  maybe we can introduce a new mount option MS_LOCK,
if user wants to use mount to hide something, he should use mount with
option MS_LOCK. so the unpriviged user can't umount this filesystem and
fail to mount the filesystem if one of it's child mount is mounted with
MS_LOCK option otherwise he use MS_REC too.

Thanks

WARNING: multiple messages have this Message-ID (diff)

From: Gao feng <gaofeng@cn.fujitsu.com>
To: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Linux Containers <containers@lists.linux-foundation.org>,
	"Serge E. Hallyn" <serge@hallyn.com>,
	linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	Andy Lutomirski <luto@amacapital.net>
Subject: Re: [REVIEW][PATCH 1/2] userns: Better restrictions on when proc and sysfs can be mounted
Date: Wed, 13 Nov 2013 15:26:19 +0800	[thread overview]
Message-ID: <5283299B.8080702@cn.fujitsu.com> (raw)
In-Reply-To: <87k3gigmgj.fsf@xmission.com>

On 11/09/2013 01:42 PM, Eric W. Biederman wrote:
> Gao feng <gaofeng@cn.fujitsu.com> writes:
> 
>> On 11/02/2013 02:06 PM, Gao feng wrote:
>>> Hi Eric,
>>>
>>> On 08/28/2013 05:44 AM, Eric W. Biederman wrote:
>>>>
>>>> Rely on the fact that another flavor of the filesystem is already
>>>> mounted and do not rely on state in the user namespace.
>>>>
>>>> Verify that the mounted filesystem is not covered in any significant
>>>> way.  I would love to verify that the previously mounted filesystem
>>>> has no mounts on top but there are at least the directories
>>>> /proc/sys/fs/binfmt_misc and /sys/fs/cgroup/ that exist explicitly
>>>> for other filesystems to mount on top of.
>>>>
>>>> Refactor the test into a function named fs_fully_visible and call that
>>>> function from the mount routines of proc and sysfs.  This makes this
>>>> test local to the filesystems involved and the results current of when
>>>> the mounts take place, removing a weird threading of the user
>>>> namespace, the mount namespace and the filesystems themselves.
>>>>
>>>> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
>>>> ---
>>>>  fs/namespace.c                 |   37 +++++++++++++++++++++++++------------
>>>>  fs/proc/root.c                 |    7 +++++--
>>>>  fs/sysfs/mount.c               |    3 ++-
>>>>  include/linux/fs.h             |    1 +
>>>>  include/linux/user_namespace.h |    4 ----
>>>>  kernel/user.c                  |    2 --
>>>>  kernel/user_namespace.c        |    2 --
>>>>  7 files changed, 33 insertions(+), 23 deletions(-)
>>>>
>>>> diff --git a/fs/namespace.c b/fs/namespace.c
>>>> index 64627f8..877e427 100644
>>>> --- a/fs/namespace.c
>>>> +++ b/fs/namespace.c
>>>> @@ -2867,25 +2867,38 @@ bool current_chrooted(void)
>>>>  	return chrooted;
>>>>  }
>>>>  
>>>> -void update_mnt_policy(struct user_namespace *userns)
>>>> +bool fs_fully_visible(struct file_system_type *type)
>>>>  {
>>>>  	struct mnt_namespace *ns = current->nsproxy->mnt_ns;
>>>>  	struct mount *mnt;
>>>> +	bool visible = false;
>>>>  
>>>> -	down_read(&namespace_sem);
>>>> +	if (unlikely(!ns))
>>>> +		return false;
>>>> +
>>>> +	namespace_lock();
>>>>  	list_for_each_entry(mnt, &ns->list, mnt_list) {
>>>> -		switch (mnt->mnt.mnt_sb->s_magic) {
>>>> -		case SYSFS_MAGIC:
>>>> -			userns->may_mount_sysfs = true;
>>>> -			break;
>>>> -		case PROC_SUPER_MAGIC:
>>>> -			userns->may_mount_proc = true;
>>>> -			break;
>>>> +		struct mount *child;
>>>> +		if (mnt->mnt.mnt_sb->s_type != type)
>>>> +			continue;
>>>> +
>>>> +		/* This mount is not fully visible if there are any child mounts
>>>> +		 * that cover anything except for empty directories.
>>>> +		 */
>>>> +		list_for_each_entry(child, &mnt->mnt_mounts, mnt_child) {
>>>> +			struct inode *inode = child->mnt_mountpoint->d_inode;
>>>> +			if (!S_ISDIR(inode->i_mode))
>>>> +				goto next;
>>>> +			if (inode->i_nlink != 2)
>>>> +				goto next;
>>>
>>>
>>> I met a problem that proc filesystem failed to mount in user namespace,
>>> The problem is the i_nlink of sysctl entries under proc filesystem is not
>>> 2. it always is 1 even it's a directory, see proc_sys_make_inode. and for
>>> btrfs, the i_nlink for an empty dir is 2 too. it seems like depends on the
>>> filesystem itself,not depends on vfs. In my system binfmt_misc is mounted
>>> on /proc/sys/fs/binfmt_misc, and the i_nlink of this directory's inode is
>>> 1.
> 
> Yes. 1 is what filesystems that are too lazy to count the number of
> links to a directory return, and /proc/sys is currently such a
> filesystem.
> 
> Ordinarily nlink == 2 means a directory does not have any subdirectories.
> 
>>> btw, I'm not quite understand what's the inode->i_nlink != 2 here means?
>>> is this directory empty? as I know, when we create a file(not dir) under
>>> a dir, the i_nlink of this dir will not increase.
>>>
>>> And another question, it looks like if we don't have proc/sys fs mounted,
>>> then proc/sys will be failed to be mounted?
>>>
>>
>> Any Idea?? or should we need to revert this patch??
> 
> The patch is mostly doing what it is supposed to be doing.
> 
> Now the code is slightly buggy.  inode->i_nlink will test to see if a
> directory has subdirectories but it won't test to see if a directory is
> empty.  Where did my brain go when I was writing that test?
> 
> Right now I would rather not have the empty directory exception than
> remove this code.
> 
> The test is a little trickier to write than it might otherwise be
> because /proc and /sys tend to be slightly imperfect filesystems.
> 
> I think the only way to really test that is to call readdir on the
> directory itself :(  I don't like that thought.
> 
> I don't know what I was thinking when I wrote that test but I definitely
> goofed up.  Grr!
> 
> I can certainly filter out any directory with nlink > 2.  That would be
> an easy partial step forward.
> 
> The real question though is how do I detect directories it is safe to
> mount on where there will not be files in them.  I can't call iterate
> with the namespace_lock held so things are a bit tricky.
> 

I know this problem is not easy to be resolved. why not let the user
make the decision?  maybe we can introduce a new mount option MS_LOCK,
if user wants to use mount to hide something, he should use mount with
option MS_LOCK. so the unpriviged user can't umount this filesystem and
fail to mount the filesystem if one of it's child mount is mounted with
MS_LOCK option otherwise he use MS_REC too.

Thanks

next prev parent reply	other threads:[~2013-11-13  7:26 UTC|newest]

Thread overview: 45+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-08-27 21:44 [REVIEW][PATCH 1/2] userns: Better restrictions on when proc and sysfs can be mounted Eric W. Biederman
2013-08-27 21:44 ` Eric W. Biederman
     [not found] ` <878uzmhkqg.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
2013-08-27 21:46   ` [REVIEW][PATCH 2/2] sysfs: Restrict mounting sysfs Eric W. Biederman
2013-08-27 21:46     ` Eric W. Biederman
     [not found]     ` <874naahkng.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
2013-08-28 19:00       ` Greg Kroah-Hartman
2013-08-28 19:00         ` Greg Kroah-Hartman
2013-09-23 10:33       ` James Hogan
2013-09-23 10:33         ` James Hogan
     [not found]         ` <524018EA.9070202-1AXoQHu6uovQT0dZR+AlfA@public.gmane.org>
2013-09-23 21:41           ` [PATCH] sysfs: Allow mounting without CONFIG_NET Eric W. Biederman
2013-09-23 21:41             ` Eric W. Biederman
     [not found]             ` <87ioxrrzb6.fsf_-_-HxuHnoDHeQZYhcs0q7wBk77fW72O3V7zAL8bYrjMMd8@public.gmane.org>
2013-09-24 11:25               ` James Hogan
2013-09-24 11:25                 ` James Hogan
2013-08-27 21:47   ` [REVIEW][PATCH 1/2] userns: Better restrictions on when proc and sysfs can be mounted Andy Lutomirski
2013-08-27 21:47     ` Andy Lutomirski
     [not found]     ` <CALCETrWPDzuoaJp2ko5jAbwYUBqSdPjvO5uGo-gZVsS4Wm1PKQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-08-27 21:57       ` Eric W. Biederman
2013-08-27 21:57         ` Eric W. Biederman
     [not found]         ` <87a9k2g5la.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
2013-09-01  4:45           ` Eric W. Biederman
2013-09-01  4:45         ` Eric W. Biederman
     [not found]           ` <87eh99noa0.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
2013-09-03 17:40             ` Andy Lutomirski
2013-09-03 17:40               ` Andy Lutomirski
2013-11-02  6:06   ` Gao feng
2013-11-02  6:06     ` Gao feng
     [not found]     ` <52749663.2000701-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
2013-11-04  7:00       ` Janne Karhunen
2013-11-04  7:00         ` Janne Karhunen
     [not found]         ` <CAE=NcrY+CzX+H4XQTdGj7CSZ98a5T=bNgT6=jGZzcjyaHb-ttw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-11-09  5:22           ` Eric W. Biederman
2013-11-09  5:22             ` Eric W. Biederman
2013-11-08  2:33       ` Gao feng
2013-11-08  2:33         ` Gao feng
     [not found]         ` <527C4D88.10907-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
2013-11-09  5:42           ` Eric W. Biederman
2013-11-09  5:42             ` Eric W. Biederman
     [not found]             ` <87k3gigmgj.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
2013-11-13  7:26               ` Gao feng [this message]
2013-11-13  7:26                 ` Gao feng
     [not found]                 ` <5283299B.8080702-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
2013-11-14 11:10                   ` Gao feng
2013-11-14 11:10                     ` Gao feng
2013-11-14 11:10                     ` Gao feng
     [not found]                     ` <5284AF90.7060506-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
2013-11-14 16:54                       ` Andy Lutomirski
2013-11-14 16:54                         ` Andy Lutomirski
     [not found]                         ` <CALCETrXtWtF=JgiwENNzh7UZKnXijHauOQ5ZjHYxYJC-BAU5Aw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-11-15  1:16                           ` Gao feng
2013-11-15  1:16                         ` Gao feng
     [not found]                           ` <528575EC.2030309-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
2013-11-15  4:54                             ` Eric W. Biederman
2013-11-15  4:54                               ` Eric W. Biederman
     [not found]                               ` <87txfexo25.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
2013-11-15  6:14                                 ` Gao feng
2013-11-15  6:14                                   ` Gao feng
     [not found]                                   ` <5285BBE2.7010001-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>
2013-11-15  8:37                                     ` Eric W. Biederman
2013-11-15  8:37                                       ` Eric W. Biederman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5283299B.8080702@cn.fujitsu.com \
    --to=gaofeng-bthxqxjhjhxqfuhtdcdx3a@public.gmane.org \
    --cc=containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org \
    --cc=ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org \
    --cc=linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
    --cc=linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
    --cc=luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.