linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [patch 0/2] symlinks for mapped files in proc
@ 2011-09-13 21:13 Cyrill Gorcunov
  2011-09-13 21:14 ` [patch 1/2] fs, proc: Make proc_get_link to use dentry instead of inode Cyrill Gorcunov
  2011-09-13 21:14 ` [patch 2/2] fs, proc: Introduce the /proc/<pid>/map_files/ directory v12 Cyrill Gorcunov
  0 siblings, 2 replies; 28+ messages in thread
From: Cyrill Gorcunov @ 2011-09-13 21:13 UTC (permalink / raw)
  To: linux-kernel, containers, linux-fsdevel
  Cc: Kirill Shutemov, Andrew Morton, Pavel Emelyanov, James Bottomley,
	Nathan Lynch, Zan Lynx, Daniel Lezcano, Vasiliy Kulikov

Hi, since due to discussion on the patch "fs, proc: Introduce the
/proc/<pid>/map_files/ directory" it was found that better to split
it into two patches (in a sake of bisectability and non-intersection
in code touched) -- this series does it. Please review. I hope all
concerns are addressed this time. Complains are still welcome.

	Cyrill

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [patch 1/2] fs, proc: Make proc_get_link to use dentry instead of inode
  2011-09-13 21:13 [patch 0/2] symlinks for mapped files in proc Cyrill Gorcunov
@ 2011-09-13 21:14 ` Cyrill Gorcunov
  2011-09-14  1:37   ` Kirill A. Shutemov
  2011-09-13 21:14 ` [patch 2/2] fs, proc: Introduce the /proc/<pid>/map_files/ directory v12 Cyrill Gorcunov
  1 sibling, 1 reply; 28+ messages in thread
From: Cyrill Gorcunov @ 2011-09-13 21:14 UTC (permalink / raw)
  To: linux-kernel, containers, linux-fsdevel
  Cc: Kirill Shutemov, Andrew Morton, Pavel Emelyanov, James Bottomley,
	Nathan Lynch, Zan Lynx, Daniel Lezcano, Vasiliy Kulikov,
	Cyrill Gorcunov, Tejun Heo, Alexey Dobriyan, Al Viro

[-- Attachment #1: fs-proc-switch-to-dentry --]
[-- Type: text/plain, Size: 3573 bytes --]

This patch prepares the ground for the next "map_files"
patch which needs a name of a link file to analyse.

So instead of squashing this change into one big
patch the separate one is done.

Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
CC: Pavel Emelyanov <xemul@parallels.com>
CC: Tejun Heo <tj@kernel.org>
CC: Vasiliy Kulikov <segoon@openwall.com>
CC: "Kirill A. Shutemov" <kirill@shutemov.name>
CC: Alexey Dobriyan <adobriyan@gmail.com>
CC: Al Viro <viro@ZenIV.linux.org.uk>
CC: Andrew Morton <akpm@linux-foundation.org>
---
 fs/proc/base.c          |   20 ++++++++++----------
 include/linux/proc_fs.h |    2 +-
 2 files changed, 11 insertions(+), 11 deletions(-)

Index: linux-2.6.git/fs/proc/base.c
===================================================================
--- linux-2.6.git.orig/fs/proc/base.c
+++ linux-2.6.git/fs/proc/base.c
@@ -165,9 +165,9 @@ static int get_task_root(struct task_str
 	return result;
 }
 
-static int proc_cwd_link(struct inode *inode, struct path *path)
+static int proc_cwd_link(struct dentry *dentry, struct path *path)
 {
-	struct task_struct *task = get_proc_task(inode);
+	struct task_struct *task = get_proc_task(dentry->d_inode);
 	int result = -ENOENT;
 
 	if (task) {
@@ -182,9 +182,9 @@ static int proc_cwd_link(struct inode *i
 	return result;
 }
 
-static int proc_root_link(struct inode *inode, struct path *path)
+static int proc_root_link(struct dentry *dentry, struct path *path)
 {
-	struct task_struct *task = get_proc_task(inode);
+	struct task_struct *task = get_proc_task(dentry->d_inode);
 	int result = -ENOENT;
 
 	if (task) {
@@ -1580,13 +1580,13 @@ static const struct file_operations proc
 	.release	= single_release,
 };
 
-static int proc_exe_link(struct inode *inode, struct path *exe_path)
+static int proc_exe_link(struct dentry *dentry, struct path *exe_path)
 {
 	struct task_struct *task;
 	struct mm_struct *mm;
 	struct file *exe_file;
 
-	task = get_proc_task(inode);
+	task = get_proc_task(dentry->d_inode);
 	if (!task)
 		return -ENOENT;
 	mm = get_task_mm(task);
@@ -1616,7 +1616,7 @@ static void *proc_pid_follow_link(struct
 	if (!proc_fd_access_allowed(inode))
 		goto out;
 
-	error = PROC_I(inode)->op.proc_get_link(inode, &nd->path);
+	error = PROC_I(inode)->op.proc_get_link(dentry, &nd->path);
 out:
 	return ERR_PTR(error);
 }
@@ -1655,7 +1655,7 @@ static int proc_pid_readlink(struct dent
 	if (!proc_fd_access_allowed(inode))
 		goto out;
 
-	error = PROC_I(inode)->op.proc_get_link(inode, &path);
+	error = PROC_I(inode)->op.proc_get_link(dentry, &path);
 	if (error)
 		goto out;
 
@@ -1947,9 +1947,9 @@ static int proc_fd_info(struct inode *in
 	return -ENOENT;
 }
 
-static int proc_fd_link(struct inode *inode, struct path *path)
+static int proc_fd_link(struct dentry *dentry, struct path *path)
 {
-	return proc_fd_info(inode, path, NULL);
+	return proc_fd_info(dentry->d_inode, path, NULL);
 }
 
 static int tid_fd_revalidate(struct dentry *dentry, struct nameidata *nd)
Index: linux-2.6.git/include/linux/proc_fs.h
===================================================================
--- linux-2.6.git.orig/include/linux/proc_fs.h
+++ linux-2.6.git/include/linux/proc_fs.h
@@ -253,7 +253,7 @@ extern const struct proc_ns_operations u
 extern const struct proc_ns_operations ipcns_operations;
 
 union proc_op {
-	int (*proc_get_link)(struct inode *, struct path *);
+	int (*proc_get_link)(struct dentry *, struct path *);
 	int (*proc_read)(struct task_struct *task, char *page);
 	int (*proc_show)(struct seq_file *m,
 		struct pid_namespace *ns, struct pid *pid,

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [patch 2/2] fs, proc: Introduce the /proc/<pid>/map_files/ directory v12
  2011-09-13 21:13 [patch 0/2] symlinks for mapped files in proc Cyrill Gorcunov
  2011-09-13 21:14 ` [patch 1/2] fs, proc: Make proc_get_link to use dentry instead of inode Cyrill Gorcunov
@ 2011-09-13 21:14 ` Cyrill Gorcunov
       [not found]   ` <20110914023428.GA4034@shutemov.name>
  2011-09-14  6:52   ` Andrew Morton
  1 sibling, 2 replies; 28+ messages in thread
From: Cyrill Gorcunov @ 2011-09-13 21:14 UTC (permalink / raw)
  To: linux-kernel, containers, linux-fsdevel
  Cc: Kirill Shutemov, Andrew Morton, Pavel Emelyanov, James Bottomley,
	Nathan Lynch, Zan Lynx, Daniel Lezcano, Vasiliy Kulikov,
	Cyrill Gorcunov, Tejun Heo, Alexey Dobriyan, Al Viro

[-- Attachment #1: cr-proc-map-files-19 --]
[-- Type: text/plain, Size: 14498 bytes --]

From: Pavel Emelyanov <xemul@parallels.com>

This one behaves similarly to the /proc/<pid>/fd/ one - it contains symlinks
one for each mapping with file, the name of a symlink is "vma->vm_start-vma->vm_end",
the target is the file. Opening a symlink results in a file that point exactly
to the same inode as them vma's one.

For example the ls -l of some arbitrary /proc/<pid>/map_files/

 | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80403000-7f8f80404000 -> /lib64/libc-2.5.so
 | lr-x------ 1 root root 64 Aug 26 06:40 7f8f8061e000-7f8f80620000 -> /lib64/libselinux.so.1
 | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80826000-7f8f80827000 -> /lib64/libacl.so.1.1.0
 | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80a2f000-7f8f80a30000 -> /lib64/librt-2.5.so
 | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80a30000-7f8f80a4c000 -> /lib64/ld-2.5.so

This *helps* checkpointing process in three ways:

1. When dumping a task mappings we do know exact file that is mapped by particular
   region. We do this by opening /proc/$pid/map_files/address symlink the way we do
   with file descriptors.

2. This also helps in determining which anonymous shared mappings are shared with
   each other by comparing the inodes of them.

3. When restoring a set of process in case two of them has a mapping shared, we map
   the memory by the 1st one and then open its /proc/$pid/map_files/address file and
   map it by the 2nd task.

Using /proc/$pid/maps for this is quite inconvenient since it brings repeatable
re-reading and reparsing for this text file which slows down restore procesure
significantly. Also as being pointed in (3) it is a way easier to use top level
shared mapping in children as /proc/$pid/map_files/address when needed.

v2: (spotted by Tejun Heo)
 - /proc/<pid>/mfd changed to /proc/<pid>/map_files
 - find_vma helper is used instead of linear search
 - routines are re-grouped
 - d_revalidate is set now

v3:
 - d_revalidate reworked, now it should drops no longer valid dentries (Tejun Heo)
 - ptrace_may_access added into proc_map_files_lookup (Vasiliy Kulikov)
 - because of filldir (which eventually might need to lock mmap_sem)
   the proc_map_files_readdir() was reworked to call proc_fill_cache()
   with unlocked mmap_sem

v4: (feedback by Tejun Heo and Vasiliy Kulikov)
 - instead of saving data in proc_inode we rather make a dentry name
   to keep both vm_start and vm_end accordingly
 - d_revalidate now honor task credentials

v5: (feedback by Kirill A. Shutemov)
 - don't forget to release mmap_sem on error path

v6:
 - sizeof get used in map_files_info which shrink member a bit on
   x86-32 (by Kirill A. Shutemov)
 - map_name_to_addr returns -EINVAL instead of -1
   which is more appropriate (by Tejun Heo)

v7:
 - add [get/set]attr handlers for
   proc_map_files_inode_operations (by Vasiliy Kulikov)

v8:
 - Kirill A. Shutemov spotted a parasite semicolon
   which ruined the ptrace_check call, fixed.

v9: (feedback by Andrew Morton)
 - find_exact_vma moved into include/linux/mm.h as an inline helper
 - proc_map_files_setattr uses either kmalloc or vmalloc depending
   on how many ojects are to be allocated
 - no more map_name_to_addr but dname_to_vma_addr introduced instead
   and it uses sscanf because in one case the find_exact_vma() is used
   only to confirm existence of vma area the boolean flag is used
 - fancy justification dropped
 - still the proc_map_files_get/setattr leaved untouched
   until additional fd/ patches applied first.

v10: (feedback by Andrew Morton)
 - flex_arrays are used instead of kmalloc/vmalloc calls
 - map_files_d_revalidate use ptrace_may_access for
   security reason (by Vasiliy Kulikov)

v11:
 - should use fput and drop !ret test from a loop code
   (feedback by Andrew Morton)
 - no need for 'used' variable, use existing
   nr_files with file->pos predicate
 - if preallocation fails no need to go further,
   simply release mmap semaphore and jump out

v12:
 - rework map_files_d_revalidate to make sure
   the task get released on return (by Vasiliy Kulikov)

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
CC: Tejun Heo <tj@kernel.org>
CC: Vasiliy Kulikov <segoon@openwall.com>
CC: "Kirill A. Shutemov" <kirill@shutemov.name>
CC: Alexey Dobriyan <adobriyan@gmail.com>
CC: Al Viro <viro@ZenIV.linux.org.uk>
CC: Andrew Morton <akpm@linux-foundation.org>
---
 fs/proc/base.c     |  365 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/mm.h |   12 +
 2 files changed, 377 insertions(+)

Index: linux-2.6.git/fs/proc/base.c
===================================================================
--- linux-2.6.git.orig/fs/proc/base.c
+++ linux-2.6.git/fs/proc/base.c
@@ -83,6 +83,7 @@
 #include <linux/pid_namespace.h>
 #include <linux/fs_struct.h>
 #include <linux/slab.h>
+#include <linux/flex_array.h>
 #ifdef CONFIG_HARDWALL
 #include <asm/hardwall.h>
 #endif
@@ -2171,6 +2172,369 @@ static const struct file_operations proc
 };
 
 /*
+ * dname_to_vma_addr - maps a dentry name into two unsigned longs
+ * which represent vma start and end addresses.
+ */
+static int dname_to_vma_addr(struct dentry *dentry,
+			     unsigned long *start, unsigned long *end)
+{
+	if (sscanf(dentry->d_name.name, "%lx-%lx", start, end) != 2)
+		return -EINVAL;
+
+	return 0;
+}
+
+static int map_files_d_revalidate(struct dentry *dentry, struct nameidata *nd)
+{
+	unsigned long vm_start, vm_end;
+	bool exact_vma_exists = false;
+	struct mm_struct *mm = NULL;
+	struct task_struct *task;
+	const struct cred *cred;
+	struct inode *inode;
+	int status = 0;
+
+	if (nd && nd->flags & LOOKUP_RCU)
+		return -ECHILD;
+
+	inode = dentry->d_inode;
+	task = get_proc_task(inode);
+	if (!task)
+		goto out_notask;
+
+	if (!ptrace_may_access(task, PTRACE_MODE_READ))
+		goto out;
+
+	mm = get_task_mm(task);
+	if (!mm)
+		goto out;
+
+	if (!dname_to_vma_addr(dentry, &vm_start, &vm_end)) {
+		down_read(&mm->mmap_sem);
+		exact_vma_exists = !!find_exact_vma(mm, vm_start, vm_end);
+		up_read(&mm->mmap_sem);
+	}
+
+	mmput(mm);
+
+	if (exact_vma_exists) {
+		if (task_dumpable(task)) {
+			rcu_read_lock();
+			cred = __task_cred(task);
+			inode->i_uid = cred->euid;
+			inode->i_gid = cred->egid;
+			rcu_read_unlock();
+		} else {
+			inode->i_uid = 0;
+			inode->i_gid = 0;
+		}
+		security_task_to_inode(task, inode);
+		status = 1;
+	}
+
+out:
+	put_task_struct(task);
+
+out_notask:
+	if (status <= 0)
+		d_drop(dentry);
+
+	return status;
+}
+
+static const struct dentry_operations tid_map_files_dentry_operations = {
+	.d_revalidate	= map_files_d_revalidate,
+	.d_delete	= pid_delete_dentry,
+};
+
+static int proc_map_files_get_link(struct dentry *dentry, struct path *path)
+{
+	unsigned long vm_start, vm_end;
+	struct vm_area_struct *vma;
+	struct task_struct *task;
+	struct mm_struct *mm;
+	int rc = -ENOENT;
+
+	task = get_proc_task(dentry->d_inode);
+	if (!task)
+		goto out;
+
+	mm = get_task_mm(task);
+	put_task_struct(task);
+	if (!mm)
+		goto out;
+
+	rc = dname_to_vma_addr(dentry, &vm_start, &vm_end);
+	if (rc)
+		goto out_mmput;
+
+	down_read(&mm->mmap_sem);
+	vma = find_exact_vma(mm, vm_start, vm_end);
+	if (vma && vma->vm_file) {
+		*path = vma->vm_file->f_path;
+		path_get(path);
+		rc = 0;
+	}
+	up_read(&mm->mmap_sem);
+
+out_mmput:
+	mmput(mm);
+out:
+	return rc;
+}
+
+struct map_files_info {
+	struct file	*file;
+	unsigned long	len;
+	unsigned char	name[4*sizeof(long)+2]; /* max: %lx-%lx\0 */
+};
+
+static struct dentry *
+proc_map_files_instantiate(struct inode *dir, struct dentry *dentry,
+			   struct task_struct *task, const void *ptr)
+{
+	const struct file *file = ptr;
+	struct proc_inode *ei;
+	struct inode *inode;
+
+	if (!file)
+		return ERR_PTR(-ENOENT);
+
+	inode = proc_pid_make_inode(dir->i_sb, task);
+	if (!inode)
+		return ERR_PTR(-ENOENT);
+
+	ei = PROC_I(inode);
+	ei->op.proc_get_link = proc_map_files_get_link;
+
+	inode->i_op = &proc_pid_link_inode_operations;
+	inode->i_size = 64;
+	inode->i_mode = S_IFLNK;
+
+	if (file->f_mode & FMODE_READ)
+		inode->i_mode |= S_IRUSR | S_IXUSR;
+	if (file->f_mode & FMODE_WRITE)
+		inode->i_mode |= S_IWUSR | S_IXUSR;
+
+	d_set_d_op(dentry, &tid_map_files_dentry_operations);
+	d_add(dentry, inode);
+
+	return NULL;
+}
+
+static struct dentry *proc_map_files_lookup(struct inode *dir,
+		struct dentry *dentry, struct nameidata *nd)
+{
+	unsigned long vm_start, vm_end;
+	struct task_struct *task;
+	struct vm_area_struct *vma;
+	struct mm_struct *mm;
+	struct dentry *result;
+
+	result = ERR_PTR(-ENOENT);
+	task = get_proc_task(dir);
+	if (!task)
+		goto out_no_task;
+
+	result = ERR_PTR(-EPERM);
+	if (!ptrace_may_access(task, PTRACE_MODE_READ))
+		goto out_no_mm;
+
+	result = ERR_PTR(-ENOENT);
+	if (dname_to_vma_addr(dentry, &vm_start, &vm_end))
+		goto out_no_mm;
+
+	mm = get_task_mm(task);
+	if (!mm)
+		goto out_no_mm;
+
+	down_read(&mm->mmap_sem);
+	vma = find_exact_vma(mm, vm_start, vm_end);
+	if (!vma)
+		goto out_no_vma;
+
+	result = proc_map_files_instantiate(dir, dentry, task, vma->vm_file);
+
+out_no_vma:
+	up_read(&mm->mmap_sem);
+	mmput(mm);
+out_no_mm:
+	put_task_struct(task);
+out_no_task:
+	return result;
+}
+
+static int proc_map_files_setattr(struct dentry *dentry, struct iattr *attr)
+{
+	struct inode *inode = dentry->d_inode;
+	struct task_struct *task;
+	int ret = -EACCES;
+
+	task = get_proc_task(inode);
+	if (!task)
+		return -ESRCH;
+
+	if (!lock_trace(task)) {
+		ret = proc_setattr(dentry, attr);
+		unlock_trace(task);
+	}
+
+	put_task_struct(task);
+	return ret;
+}
+
+static int proc_map_files_getattr(struct vfsmount *mnt, struct dentry *dentry,
+				  struct kstat *stat)
+{
+	struct inode *inode = dentry->d_inode;
+	struct task_struct *task;
+	int ret = -EACCES;
+
+	task = get_proc_task(inode);
+	if (!task)
+		return -ESRCH;
+
+	if (!lock_trace(task)) {
+		generic_fillattr(inode, stat);
+		unlock_trace(task);
+		ret = 0;
+	}
+
+	put_task_struct(task);
+	return ret;
+}
+
+static const struct inode_operations proc_map_files_inode_operations = {
+	.lookup		= proc_map_files_lookup,
+	.setattr	= proc_map_files_setattr,
+	.getattr	= proc_map_files_getattr,
+};
+
+static int proc_map_files_readdir(struct file *filp, void *dirent, filldir_t filldir)
+{
+	struct dentry *dentry = filp->f_path.dentry;
+	struct inode *inode = dentry->d_inode;
+	struct vm_area_struct *vma;
+	struct task_struct *task;
+	struct mm_struct *mm;
+	ino_t ino;
+	int ret;
+
+	ret = -ENOENT;
+	task = get_proc_task(inode);
+	if (!task)
+		goto out_no_task;
+
+	ret = -EPERM;
+	if (!ptrace_may_access(task, PTRACE_MODE_READ))
+		goto out;
+
+	ret = 0;
+	switch (filp->f_pos) {
+	case 0:
+		ino = inode->i_ino;
+		if (filldir(dirent, ".", 1, 0, ino, DT_DIR) < 0)
+			goto out;
+		filp->f_pos++;
+	case 1:
+		ino = parent_ino(dentry);
+		if (filldir(dirent, "..", 2, 1, ino, DT_DIR) < 0)
+			goto out;
+		filp->f_pos++;
+	default:
+	{
+		unsigned long nr_files, pos, i;
+		struct flex_array *fa = NULL;
+		struct map_files_info info;
+		struct map_files_info *p;
+
+		mm = get_task_mm(task);
+		if (!mm)
+			goto out;
+		down_read(&mm->mmap_sem);
+
+		nr_files = 0;
+
+		/*
+		 * We need two passes here:
+		 *
+		 *  1) Collect vmas of mapped files with mmap_sem taken
+		 *  2) Release mmap_sem and instantiate entries
+		 *
+		 * otherwise we get lockdep complained, since filldir()
+		 * routine might require mmap_sem taken in might_fault().
+		 */
+
+		for (vma = mm->mmap, pos = 2; vma; vma = vma->vm_next) {
+			if (vma->vm_file && ++pos > filp->f_pos)
+				nr_files++;
+		}
+
+		if (nr_files) {
+			fa = flex_array_alloc(sizeof(info), nr_files, GFP_KERNEL);
+			if (!fa || flex_array_prealloc(fa, 0, nr_files, GFP_KERNEL)) {
+				ret = -ENOMEM;
+				if (fa)
+					flex_array_free(fa);
+				up_read(&mm->mmap_sem);
+				mmput(mm);
+				goto out;
+			}
+			for (i = 0, vma = mm->mmap, pos = 2; vma; vma = vma->vm_next) {
+				if (!vma->vm_file)
+					continue;
+				if (++pos <= filp->f_pos)
+					continue;
+
+				get_file(vma->vm_file);
+				info.file = vma->vm_file;
+				info.len = snprintf(info.name, sizeof(info.name),
+						    "%lx-%lx", vma->vm_start,
+						    vma->vm_end);
+				if (flex_array_put(fa, i++, &info, GFP_KERNEL))
+					BUG();
+			}
+		}
+		up_read(&mm->mmap_sem);
+
+		for (i = 0; i < nr_files; i++) {
+			p = flex_array_get(fa, i);
+			ret = proc_fill_cache(filp, dirent, filldir,
+					      p->name, p->len,
+					      proc_map_files_instantiate,
+					      task, p->file);
+			if (ret)
+				break;
+			filp->f_pos++;
+			fput(p->file);
+		}
+		for (; i < nr_files; i++) {
+			/*
+			 * In case of error don't forget
+			 * to put rest of file refs.
+			 */
+			p = flex_array_get(fa, i);
+			fput(p->file);
+		}
+		if (fa)
+			flex_array_free(fa);
+		mmput(mm);
+	}
+	}
+
+out:
+	put_task_struct(task);
+out_no_task:
+	return ret;
+}
+
+static const struct file_operations proc_map_files_operations = {
+	.read		= generic_read_dir,
+	.readdir	= proc_map_files_readdir,
+	.llseek		= default_llseek,
+};
+
+/*
  * /proc/pid/fd needs a special permission handler so that a process can still
  * access /proc/self/fd after it has executed a setuid().
  */
@@ -2785,6 +3149,7 @@ static const struct inode_operations pro
 static const struct pid_entry tgid_base_stuff[] = {
 	DIR("task",       S_IRUGO|S_IXUGO, proc_task_inode_operations, proc_task_operations),
 	DIR("fd",         S_IRUSR|S_IXUSR, proc_fd_inode_operations, proc_fd_operations),
+	DIR("map_files",  S_IRUSR|S_IXUSR, proc_map_files_inode_operations, proc_map_files_operations),
 	DIR("fdinfo",     S_IRUSR|S_IXUSR, proc_fdinfo_inode_operations, proc_fdinfo_operations),
 	DIR("ns",	  S_IRUSR|S_IXUGO, proc_ns_dir_inode_operations, proc_ns_dir_operations),
 #ifdef CONFIG_NET
Index: linux-2.6.git/include/linux/mm.h
===================================================================
--- linux-2.6.git.orig/include/linux/mm.h
+++ linux-2.6.git/include/linux/mm.h
@@ -1491,6 +1491,18 @@ static inline unsigned long vma_pages(st
 	return (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
 }
 
+/* Look up the first VMA which exactly match the interval vm_start ... vm_end */
+static inline struct vm_area_struct *
+find_exact_vma(struct mm_struct *mm, unsigned long vm_start, unsigned long vm_end)
+{
+	struct vm_area_struct *vma = find_vma(mm, vm_start);
+
+	if (vma && (vma->vm_start != vm_start || vma->vm_end != vm_end))
+		vma = NULL;
+
+	return vma;
+}
+
 #ifdef CONFIG_MMU
 pgprot_t vm_get_page_prot(unsigned long vm_flags);
 #else

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 1/2] fs, proc: Make proc_get_link to use dentry instead of inode
  2011-09-13 21:14 ` [patch 1/2] fs, proc: Make proc_get_link to use dentry instead of inode Cyrill Gorcunov
@ 2011-09-14  1:37   ` Kirill A. Shutemov
  0 siblings, 0 replies; 28+ messages in thread
From: Kirill A. Shutemov @ 2011-09-14  1:37 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: linux-kernel, containers, linux-fsdevel, Andrew Morton,
	Pavel Emelyanov, James Bottomley, Nathan Lynch, Zan Lynx,
	Daniel Lezcano, Vasiliy Kulikov, Tejun Heo, Alexey Dobriyan,
	Al Viro

On Wed, Sep 14, 2011 at 01:14:00AM +0400, Cyrill Gorcunov wrote:
> This patch prepares the ground for the next "map_files"
> patch which needs a name of a link file to analyse.
> 
> So instead of squashing this change into one big
> patch the separate one is done.
> 
> Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
> CC: Pavel Emelyanov <xemul@parallels.com>
> CC: Tejun Heo <tj@kernel.org>
> CC: Vasiliy Kulikov <segoon@openwall.com>
> CC: "Kirill A. Shutemov" <kirill@shutemov.name>
> CC: Alexey Dobriyan <adobriyan@gmail.com>
> CC: Al Viro <viro@ZenIV.linux.org.uk>
> CC: Andrew Morton <akpm@linux-foundation.org>

Reviewed-by: Kirill Shutemov <kirill@shutemov.name>

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 2/2] fs, proc: Introduce the /proc/<pid>/map_files/ directory v12
       [not found]   ` <20110914023428.GA4034@shutemov.name>
@ 2011-09-14  5:54     ` Cyrill Gorcunov
  0 siblings, 0 replies; 28+ messages in thread
From: Cyrill Gorcunov @ 2011-09-14  5:54 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: linux-kernel, containers, linux-fsdevel, Andrew Morton,
	Pavel Emelyanov, James Bottomley, Nathan Lynch, Zan Lynx,
	Daniel Lezcano, Vasiliy Kulikov, Tejun Heo, Alexey Dobriyan,
	Al Viro

On Wed, Sep 14, 2011 at 05:34:28AM +0300, Kirill A. Shutemov wrote:
...
> > 
> > Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
> > Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
> > CC: Tejun Heo <tj@kernel.org>
> > CC: Vasiliy Kulikov <segoon@openwall.com>
> > CC: "Kirill A. Shutemov" <kirill@shutemov.name>
> > CC: Alexey Dobriyan <adobriyan@gmail.com>
> > CC: Al Viro <viro@ZenIV.linux.org.uk>
> > CC: Andrew Morton <akpm@linux-foundation.org>
> 
> Reviewed-by: Kirill A. Shutemov <kirill@shutemov.name>
>

Thanks for review, Kirill!

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 2/2] fs, proc: Introduce the /proc/<pid>/map_files/ directory v12
  2011-09-13 21:14 ` [patch 2/2] fs, proc: Introduce the /proc/<pid>/map_files/ directory v12 Cyrill Gorcunov
       [not found]   ` <20110914023428.GA4034@shutemov.name>
@ 2011-09-14  6:52   ` Andrew Morton
  2011-09-14 10:56     ` Cyrill Gorcunov
  1 sibling, 1 reply; 28+ messages in thread
From: Andrew Morton @ 2011-09-14  6:52 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: linux-kernel, containers, linux-fsdevel, Kirill Shutemov,
	Pavel Emelyanov, James Bottomley, Nathan Lynch, Zan Lynx,
	Daniel Lezcano, Vasiliy Kulikov, Tejun Heo, Alexey Dobriyan,
	Al Viro, Andrew Morton

On Wed, 14 Sep 2011 01:14:01 +0400 Cyrill Gorcunov <gorcunov@openvz.org> wrote:

> This one behaves similarly to the /proc/<pid>/fd/ one - it contains symlinks
> one for each mapping with file, the name of a symlink is "vma->vm_start-vma->vm_end",
> the target is the file. Opening a symlink results in a file that point exactly
> to the same inode as them vma's one.

We should fully work through Pavel Machek's comments, please.  For some
reason I'm a bit paranoid about security lately :(

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 2/2] fs, proc: Introduce the /proc/<pid>/map_files/ directory v12
  2011-09-14  6:52   ` Andrew Morton
@ 2011-09-14 10:56     ` Cyrill Gorcunov
  2011-09-14 11:14       ` Pavel Machek
  0 siblings, 1 reply; 28+ messages in thread
From: Cyrill Gorcunov @ 2011-09-14 10:56 UTC (permalink / raw)
  To: Andrew Morton, Pavel Machek
  Cc: linux-kernel, containers, linux-fsdevel, Kirill Shutemov,
	Pavel Emelyanov, James Bottomley, Nathan Lynch, Zan Lynx,
	Daniel Lezcano, Vasiliy Kulikov, Tejun Heo, Alexey Dobriyan,
	Al Viro, Andrew Morton

On Tue, Sep 13, 2011 at 11:52:22PM -0700, Andrew Morton wrote:
> On Wed, 14 Sep 2011 01:14:01 +0400 Cyrill Gorcunov <gorcunov@openvz.org> wrote:
> 
> > This one behaves similarly to the /proc/<pid>/fd/ one - it contains symlinks
> > one for each mapping with file, the name of a symlink is "vma->vm_start-vma->vm_end",
> > the target is the file. Opening a symlink results in a file that point exactly
> > to the same inode as them vma's one.
> 
> We should fully work through Pavel Machek's comments, please.  For some
> reason I'm a bit paranoid about security lately :(

Pavel, I somehow lost. What exactly the security issue here? There are a few
patches from Vasiliy in -mm queue at moment. In particular one includes
.permission set for fd/ handling. So I've updated the map_files as well
(it's below). So please review and point me where the problem is. Thanks!

	Cyrill
---
fs, proc: Introduce the /proc/<pid>/map_files/ directory v13

From: Pavel Emelyanov <xemul@parallels.com>

This one behaves similarly to the /proc/<pid>/fd/ one - it contains symlinks
one for each mapping with file, the name of a symlink is "vma->vm_start-vma->vm_end",
the target is the file. Opening a symlink results in a file that point exactly
to the same inode as them vma's one.

For example the ls -l of some arbitrary /proc/<pid>/map_files/

 | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80403000-7f8f80404000 -> /lib64/libc-2.5.so
 | lr-x------ 1 root root 64 Aug 26 06:40 7f8f8061e000-7f8f80620000 -> /lib64/libselinux.so.1
 | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80826000-7f8f80827000 -> /lib64/libacl.so.1.1.0
 | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80a2f000-7f8f80a30000 -> /lib64/librt-2.5.so
 | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80a30000-7f8f80a4c000 -> /lib64/ld-2.5.so

This *helps* checkpointing process in three ways:

1. When dumping a task mappings we do know exact file that is mapped by particular
   region. We do this by opening /proc/$pid/map_files/address symlink the way we do
   with file descriptors.

2. This also helps in determining which anonymous shared mappings are shared with
   each other by comparing the inodes of them.

3. When restoring a set of process in case two of them has a mapping shared, we map
   the memory by the 1st one and then open its /proc/$pid/map_files/address file and
   map it by the 2nd task.

Using /proc/$pid/maps for this is quite inconvenient since it brings repeatable
re-reading and reparsing for this text file which slows down restore procesure
significantly. Also as being pointed in (3) it is a way easier to use top level
shared mapping in children as /proc/$pid/map_files/address when needed.

v2: (spotted by Tejun Heo)
 - /proc/<pid>/mfd changed to /proc/<pid>/map_files
 - find_vma helper is used instead of linear search
 - routines are re-grouped
 - d_revalidate is set now

v3:
 - d_revalidate reworked, now it should drops no longer valid dentries (Tejun Heo)
 - ptrace_may_access added into proc_map_files_lookup (Vasiliy Kulikov)
 - because of filldir (which eventually might need to lock mmap_sem)
   the proc_map_files_readdir() was reworked to call proc_fill_cache()
   with unlocked mmap_sem

v4: (feedback by Tejun Heo and Vasiliy Kulikov)
 - instead of saving data in proc_inode we rather make a dentry name
   to keep both vm_start and vm_end accordingly
 - d_revalidate now honor task credentials

v5: (feedback by Kirill A. Shutemov)
 - don't forget to release mmap_sem on error path

v6:
 - sizeof get used in map_files_info which shrink member a bit on
   x86-32 (by Kirill A. Shutemov)
 - map_name_to_addr returns -EINVAL instead of -1
   which is more appropriate (by Tejun Heo)

v7:
 - add [get/set]attr handlers for
   proc_map_files_inode_operations (by Vasiliy Kulikov)

v8:
 - Kirill A. Shutemov spotted a parasite semicolon
   which ruined the ptrace_check call, fixed.

v9: (feedback by Andrew Morton)
 - find_exact_vma moved into include/linux/mm.h as an inline helper
 - proc_map_files_setattr uses either kmalloc or vmalloc depending
   on how many ojects are to be allocated
 - no more map_name_to_addr but dname_to_vma_addr introduced instead
   and it uses sscanf because in one case the find_exact_vma() is used
   only to confirm existence of vma area the boolean flag is used
 - fancy justification dropped
 - still the proc_map_files_get/setattr leaved untouched
   until additional fd/ patches applied first.

v10: (feedback by Andrew Morton)
 - flex_arrays are used instead of kmalloc/vmalloc calls
 - map_files_d_revalidate use ptrace_may_access for
   security reason (by Vasiliy Kulikov)

v11:
 - should use fput and drop !ret test from a loop code
   (feedback by Andrew Morton)
 - no need for 'used' variable, use existing
   nr_files with file->pos predicate
 - if preallocation fails no need to go further,
   simply release mmap semaphore and jump out

v12:
 - rework map_files_d_revalidate to make sure
   the task get released on return (by Vasiliy Kulikov)

v13:
 - proc_map_files_inode_operations are set to be the same
   as proc_fd_inode_operations, ie to include .permission
   pointing to proc_fd_permission

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
CC: Tejun Heo <tj@kernel.org>
CC: Vasiliy Kulikov <segoon@openwall.com>
CC: "Kirill A. Shutemov" <kirill@shutemov.name>
CC: Alexey Dobriyan <adobriyan@gmail.com>
CC: Al Viro <viro@ZenIV.linux.org.uk>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Pavel Machek <pavel@ucw.cz>
---
 fs/proc/base.c     |  329 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/mm.h |   12 +
 2 files changed, 341 insertions(+)

Index: linux-2.6.git/fs/proc/base.c
===================================================================
--- linux-2.6.git.orig/fs/proc/base.c
+++ linux-2.6.git/fs/proc/base.c
@@ -83,6 +83,7 @@
 #include <linux/pid_namespace.h>
 #include <linux/fs_struct.h>
 #include <linux/slab.h>
+#include <linux/flex_array.h>
 #ifdef CONFIG_HARDWALL
 #include <asm/hardwall.h>
 #endif
@@ -2201,6 +2202,333 @@ static const struct file_operations proc
 };
 
 /*
+ * dname_to_vma_addr - maps a dentry name into two unsigned longs
+ * which represent vma start and end addresses.
+ */
+static int dname_to_vma_addr(struct dentry *dentry,
+			     unsigned long *start, unsigned long *end)
+{
+	if (sscanf(dentry->d_name.name, "%lx-%lx", start, end) != 2)
+		return -EINVAL;
+
+	return 0;
+}
+
+static int map_files_d_revalidate(struct dentry *dentry, struct nameidata *nd)
+{
+	unsigned long vm_start, vm_end;
+	bool exact_vma_exists = false;
+	struct mm_struct *mm = NULL;
+	struct task_struct *task;
+	const struct cred *cred;
+	struct inode *inode;
+	int status = 0;
+
+	if (nd && nd->flags & LOOKUP_RCU)
+		return -ECHILD;
+
+	inode = dentry->d_inode;
+	task = get_proc_task(inode);
+	if (!task)
+		goto out_notask;
+
+	if (!ptrace_may_access(task, PTRACE_MODE_READ))
+		goto out;
+
+	mm = get_task_mm(task);
+	if (!mm)
+		goto out;
+
+	if (!dname_to_vma_addr(dentry, &vm_start, &vm_end)) {
+		down_read(&mm->mmap_sem);
+		exact_vma_exists = !!find_exact_vma(mm, vm_start, vm_end);
+		up_read(&mm->mmap_sem);
+	}
+
+	mmput(mm);
+
+	if (exact_vma_exists) {
+		if (task_dumpable(task)) {
+			rcu_read_lock();
+			cred = __task_cred(task);
+			inode->i_uid = cred->euid;
+			inode->i_gid = cred->egid;
+			rcu_read_unlock();
+		} else {
+			inode->i_uid = 0;
+			inode->i_gid = 0;
+		}
+		security_task_to_inode(task, inode);
+		status = 1;
+	}
+
+out:
+	put_task_struct(task);
+
+out_notask:
+	if (status <= 0)
+		d_drop(dentry);
+
+	return status;
+}
+
+static const struct dentry_operations tid_map_files_dentry_operations = {
+	.d_revalidate	= map_files_d_revalidate,
+	.d_delete	= pid_delete_dentry,
+};
+
+static int proc_map_files_get_link(struct dentry *dentry, struct path *path)
+{
+	unsigned long vm_start, vm_end;
+	struct vm_area_struct *vma;
+	struct task_struct *task;
+	struct mm_struct *mm;
+	int rc = -ENOENT;
+
+	task = get_proc_task(dentry->d_inode);
+	if (!task)
+		goto out;
+
+	mm = get_task_mm(task);
+	put_task_struct(task);
+	if (!mm)
+		goto out;
+
+	rc = dname_to_vma_addr(dentry, &vm_start, &vm_end);
+	if (rc)
+		goto out_mmput;
+
+	down_read(&mm->mmap_sem);
+	vma = find_exact_vma(mm, vm_start, vm_end);
+	if (vma && vma->vm_file) {
+		*path = vma->vm_file->f_path;
+		path_get(path);
+		rc = 0;
+	}
+	up_read(&mm->mmap_sem);
+
+out_mmput:
+	mmput(mm);
+out:
+	return rc;
+}
+
+struct map_files_info {
+	struct file	*file;
+	unsigned long	len;
+	unsigned char	name[4*sizeof(long)+2]; /* max: %lx-%lx\0 */
+};
+
+static struct dentry *
+proc_map_files_instantiate(struct inode *dir, struct dentry *dentry,
+			   struct task_struct *task, const void *ptr)
+{
+	const struct file *file = ptr;
+	struct proc_inode *ei;
+	struct inode *inode;
+
+	if (!file)
+		return ERR_PTR(-ENOENT);
+
+	inode = proc_pid_make_inode(dir->i_sb, task);
+	if (!inode)
+		return ERR_PTR(-ENOENT);
+
+	ei = PROC_I(inode);
+	ei->op.proc_get_link = proc_map_files_get_link;
+
+	inode->i_op = &proc_pid_link_inode_operations;
+	inode->i_size = 64;
+	inode->i_mode = S_IFLNK;
+
+	if (file->f_mode & FMODE_READ)
+		inode->i_mode |= S_IRUSR | S_IXUSR;
+	if (file->f_mode & FMODE_WRITE)
+		inode->i_mode |= S_IWUSR | S_IXUSR;
+
+	d_set_d_op(dentry, &tid_map_files_dentry_operations);
+	d_add(dentry, inode);
+
+	return NULL;
+}
+
+static struct dentry *proc_map_files_lookup(struct inode *dir,
+		struct dentry *dentry, struct nameidata *nd)
+{
+	unsigned long vm_start, vm_end;
+	struct task_struct *task;
+	struct vm_area_struct *vma;
+	struct mm_struct *mm;
+	struct dentry *result;
+
+	result = ERR_PTR(-ENOENT);
+	task = get_proc_task(dir);
+	if (!task)
+		goto out;
+
+	result = ERR_PTR(-EACCES);
+	if (lock_trace(task))
+		goto out_put_task;
+
+	result = ERR_PTR(-ENOENT);
+	if (dname_to_vma_addr(dentry, &vm_start, &vm_end))
+		goto out_unlock;
+
+	mm = get_task_mm(task);
+	if (!mm)
+		goto out_unlock;
+
+	down_read(&mm->mmap_sem);
+	vma = find_exact_vma(mm, vm_start, vm_end);
+	if (!vma)
+		goto out_no_vma;
+
+	result = proc_map_files_instantiate(dir, dentry, task, vma->vm_file);
+
+out_no_vma:
+	up_read(&mm->mmap_sem);
+	mmput(mm);
+out_unlock:
+	unlock_trace(task);
+out_put_task:
+	put_task_struct(task);
+out:
+	return result;
+}
+
+static const struct inode_operations proc_map_files_inode_operations = {
+	.lookup		= proc_map_files_lookup,
+	.permission	= proc_fd_permission,
+	.setattr	= proc_setattr,
+};
+
+static int proc_map_files_readdir(struct file *filp, void *dirent, filldir_t filldir)
+{
+	struct dentry *dentry = filp->f_path.dentry;
+	struct inode *inode = dentry->d_inode;
+	struct vm_area_struct *vma;
+	struct task_struct *task;
+	struct mm_struct *mm;
+	ino_t ino;
+	int ret;
+
+	ret = -ENOENT;
+	task = get_proc_task(inode);
+	if (!task)
+		goto out;
+
+	retval = -EACCES;
+	if (lock_trace(p))
+		goto out_put_task;
+
+	ret = 0;
+	switch (filp->f_pos) {
+	case 0:
+		ino = inode->i_ino;
+		if (filldir(dirent, ".", 1, 0, ino, DT_DIR) < 0)
+			goto out_unlock;
+		filp->f_pos++;
+	case 1:
+		ino = parent_ino(dentry);
+		if (filldir(dirent, "..", 2, 1, ino, DT_DIR) < 0)
+			goto out_unlock;
+		filp->f_pos++;
+	default:
+	{
+		unsigned long nr_files, pos, i;
+		struct flex_array *fa = NULL;
+		struct map_files_info info;
+		struct map_files_info *p;
+
+		mm = get_task_mm(task);
+		if (!mm)
+			goto out_unlock;
+		down_read(&mm->mmap_sem);
+
+		nr_files = 0;
+
+		/*
+		 * We need two passes here:
+		 *
+		 *  1) Collect vmas of mapped files with mmap_sem taken
+		 *  2) Release mmap_sem and instantiate entries
+		 *
+		 * otherwise we get lockdep complained, since filldir()
+		 * routine might require mmap_sem taken in might_fault().
+		 */
+
+		for (vma = mm->mmap, pos = 2; vma; vma = vma->vm_next) {
+			if (vma->vm_file && ++pos > filp->f_pos)
+				nr_files++;
+		}
+
+		if (nr_files) {
+			fa = flex_array_alloc(sizeof(info), nr_files, GFP_KERNEL);
+			if (!fa || flex_array_prealloc(fa, 0, nr_files, GFP_KERNEL)) {
+				ret = -ENOMEM;
+				if (fa)
+					flex_array_free(fa);
+				up_read(&mm->mmap_sem);
+				mmput(mm);
+				goto out_unlock;
+			}
+			for (i = 0, vma = mm->mmap, pos = 2; vma; vma = vma->vm_next) {
+				if (!vma->vm_file)
+					continue;
+				if (++pos <= filp->f_pos)
+					continue;
+
+				get_file(vma->vm_file);
+				info.file = vma->vm_file;
+				info.len = snprintf(info.name, sizeof(info.name),
+						    "%lx-%lx", vma->vm_start,
+						    vma->vm_end);
+				if (flex_array_put(fa, i++, &info, GFP_KERNEL))
+					BUG();
+			}
+		}
+		up_read(&mm->mmap_sem);
+
+		for (i = 0; i < nr_files; i++) {
+			p = flex_array_get(fa, i);
+			ret = proc_fill_cache(filp, dirent, filldir,
+					      p->name, p->len,
+					      proc_map_files_instantiate,
+					      task, p->file);
+			if (ret)
+				break;
+			filp->f_pos++;
+			fput(p->file);
+		}
+		for (; i < nr_files; i++) {
+			/*
+			 * In case of error don't forget
+			 * to put rest of file refs.
+			 */
+			p = flex_array_get(fa, i);
+			fput(p->file);
+		}
+		if (fa)
+			flex_array_free(fa);
+		mmput(mm);
+	}
+	}
+
+out_unlock:
+	unlock_trace(p);
+out_put_task:
+	put_task_struct(task);
+out:
+	return ret;
+}
+
+static const struct file_operations proc_map_files_operations = {
+	.read		= generic_read_dir,
+	.readdir	= proc_map_files_readdir,
+	.llseek		= default_llseek,
+};
+
+/*
  * /proc/pid/fd needs a special permission handler so that a process can still
  * access /proc/self/fd after it has executed a setuid().
  */
@@ -2815,6 +3143,7 @@ static const struct inode_operations pro
 static const struct pid_entry tgid_base_stuff[] = {
 	DIR("task",       S_IRUGO|S_IXUGO, proc_task_inode_operations, proc_task_operations),
 	DIR("fd",         S_IRUSR|S_IXUSR, proc_fd_inode_operations, proc_fd_operations),
+	DIR("map_files",  S_IRUSR|S_IXUSR, proc_map_files_inode_operations, proc_map_files_operations),
 	DIR("fdinfo",     S_IRUSR|S_IXUSR, proc_fdinfo_inode_operations, proc_fdinfo_operations),
 	DIR("ns",	  S_IRUSR|S_IXUGO, proc_ns_dir_inode_operations, proc_ns_dir_operations),
 #ifdef CONFIG_NET
Index: linux-2.6.git/include/linux/mm.h
===================================================================
--- linux-2.6.git.orig/include/linux/mm.h
+++ linux-2.6.git/include/linux/mm.h
@@ -1491,6 +1491,18 @@ static inline unsigned long vma_pages(st
 	return (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
 }
 
+/* Look up the first VMA which exactly match the interval vm_start ... vm_end */
+static inline struct vm_area_struct *
+find_exact_vma(struct mm_struct *mm, unsigned long vm_start, unsigned long vm_end)
+{
+	struct vm_area_struct *vma = find_vma(mm, vm_start);
+
+	if (vma && (vma->vm_start != vm_start || vma->vm_end != vm_end))
+		vma = NULL;
+
+	return vma;
+}
+
 #ifdef CONFIG_MMU
 pgprot_t vm_get_page_prot(unsigned long vm_flags);
 #else

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 2/2] fs, proc: Introduce the /proc/<pid>/map_files/ directory v12
  2011-09-14 10:56     ` Cyrill Gorcunov
@ 2011-09-14 11:14       ` Pavel Machek
  2011-09-14 11:39         ` Cyrill Gorcunov
  0 siblings, 1 reply; 28+ messages in thread
From: Pavel Machek @ 2011-09-14 11:14 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: Andrew Morton, linux-kernel, containers, linux-fsdevel,
	Kirill Shutemov, Pavel Emelyanov, James Bottomley, Nathan Lynch,
	Zan Lynx, Daniel Lezcano, Vasiliy Kulikov, Tejun Heo,
	Alexey Dobriyan, Al Viro, Andrew Morton

Hi!

> > 
> > > This one behaves similarly to the /proc/<pid>/fd/ one - it contains symlinks
> > > one for each mapping with file, the name of a symlink is "vma->vm_start-vma->vm_end",
> > > the target is the file. Opening a symlink results in a file that point exactly
> > > to the same inode as them vma's one.
> > 
> > We should fully work through Pavel Machek's comments, please.  For some
> > reason I'm a bit paranoid about security lately :(
> 
> Pavel, I somehow lost. What exactly the security issue here? There are a few
> patches from Vasiliy in -mm queue at moment. In particular one includes
> .permission set for fd/ handling. So I've updated the map_files as well
> (it's below). So please review and point me where the problem
> is. Thanks!

AFAICT, this recreates existing problem with /proc/<pid>/fd (see
discussion at 

http://www.securityfocus.com/archive/1/507386/30/0/threaded

). It creates object that looks like symlink, but does not behave as
one, and permissions of directories are not checked as they would be
if it was a symlink.

									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 2/2] fs, proc: Introduce the /proc/<pid>/map_files/ directory v12
  2011-09-14 11:14       ` Pavel Machek
@ 2011-09-14 11:39         ` Cyrill Gorcunov
  2011-09-14 13:44           ` Cyrill Gorcunov
  0 siblings, 1 reply; 28+ messages in thread
From: Cyrill Gorcunov @ 2011-09-14 11:39 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Andrew Morton, linux-kernel, containers, linux-fsdevel,
	Kirill Shutemov, Pavel Emelyanov, James Bottomley, Nathan Lynch,
	Zan Lynx, Daniel Lezcano, Vasiliy Kulikov, Tejun Heo,
	Alexey Dobriyan, Al Viro, Andrew Morton

On Wed, Sep 14, 2011 at 01:14:38PM +0200, Pavel Machek wrote:
> Hi!
> 
...
> > 
> > Pavel, I somehow lost. What exactly the security issue here? There are a few
> > patches from Vasiliy in -mm queue at moment. In particular one includes
> > .permission set for fd/ handling. So I've updated the map_files as well
> > (it's below). So please review and point me where the problem
> > is. Thanks!
> 
> AFAICT, this recreates existing problem with /proc/<pid>/fd (see
> discussion at 
> 
> http://www.securityfocus.com/archive/1/507386/30/0/threaded
> 
> ). It creates object that looks like symlink, but does not behave as
> one, and permissions of directories are not checked as they would be
> if it was a symlink.
> 
> 									Pavel

OK, so the problem is that we might return a path to the file mapped
even if the directory which consists the former file has its permission
changed, right? For example, lets say we have

| lr-x------ 1 cyrill cyrill 64 Sep 14 19:35 /proc/self/map_files/3d73a00000-3d73a1c000 -> /lib64/ld-2.5.so

and once /lib64 became unreadable we should not return the path to
3d73a00000-3d73a1c000 as well, that is what you mean, right?

	Cyrill

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 2/2] fs, proc: Introduce the /proc/<pid>/map_files/ directory v12
  2011-09-14 11:39         ` Cyrill Gorcunov
@ 2011-09-14 13:44           ` Cyrill Gorcunov
  2011-09-14 14:48             ` Vasiliy Kulikov
  0 siblings, 1 reply; 28+ messages in thread
From: Cyrill Gorcunov @ 2011-09-14 13:44 UTC (permalink / raw)
  To: Pavel Machek, Andrew Morton, Vasiliy Kulikov
  Cc: linux-kernel, containers, linux-fsdevel, Kirill Shutemov,
	Pavel Emelyanov, James Bottomley, Nathan Lynch, Zan Lynx,
	Daniel Lezcano, Tejun Heo, Alexey Dobriyan, Al Viro,
	Andrew Morton

On Wed, Sep 14, 2011 at 03:39:12PM +0400, Cyrill Gorcunov wrote:
...
> > 
> > AFAICT, this recreates existing problem with /proc/<pid>/fd (see
> > discussion at 
> > 
> > http://www.securityfocus.com/archive/1/507386/30/0/threaded
> > 
> > ). It creates object that looks like symlink, but does not behave as
> > one, and permissions of directories are not checked as they would be
> > if it was a symlink.
> > 
> > 									Pavel
> 
> OK, so the problem is that we might return a path to the file mapped
> even if the directory which consists the former file has its permission
> changed, right? For example, lets say we have
> 
> | lr-x------ 1 cyrill cyrill 64 Sep 14 19:35 /proc/self/map_files/3d73a00000-3d73a1c000 -> /lib64/ld-2.5.so
> 
> and once /lib64 became unreadable we should not return the path to
> 3d73a00000-3d73a1c000 as well, that is what you mean, right?
> 

So, there is no *new* hole. And from conversation you've pointed
I actually don't get what the final solution was taken?

Since I still can read file from fd/ even if the file keeper
directory (say 'fd/7 --> /home/cyrill/t/testee-shared.img' where
't' directory has changed to '-x' after the process opened
it) I assume -- not that much.

Both fd/ and map_files/ have ptrace_may_access checks, which
(as you pointed) might be not enough, but squashing all changes
into one big path seems to be not that good idea.

And what is more important -- would the change of the current
behaviour break some user-space programs?

Vasiliy, as far as I remember you had something in mind on
fd/ additional fixups, no?

A bit updated/tested patch below.

	Cyrill
---
fs, proc: Introduce the /proc/<pid>/map_files/ directory v13

From: Pavel Emelyanov <xemul@parallels.com>

This one behaves similarly to the /proc/<pid>/fd/ one - it contains symlinks
one for each mapping with file, the name of a symlink is "vma->vm_start-vma->vm_end",
the target is the file. Opening a symlink results in a file that point exactly
to the same inode as them vma's one.

For example the ls -l of some arbitrary /proc/<pid>/map_files/

 | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80403000-7f8f80404000 -> /lib64/libc-2.5.so
 | lr-x------ 1 root root 64 Aug 26 06:40 7f8f8061e000-7f8f80620000 -> /lib64/libselinux.so.1
 | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80826000-7f8f80827000 -> /lib64/libacl.so.1.1.0
 | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80a2f000-7f8f80a30000 -> /lib64/librt-2.5.so
 | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80a30000-7f8f80a4c000 -> /lib64/ld-2.5.so

This *helps* checkpointing process in three ways:

1. When dumping a task mappings we do know exact file that is mapped by particular
   region. We do this by opening /proc/$pid/map_files/address symlink the way we do
   with file descriptors.

2. This also helps in determining which anonymous shared mappings are shared with
   each other by comparing the inodes of them.

3. When restoring a set of process in case two of them has a mapping shared, we map
   the memory by the 1st one and then open its /proc/$pid/map_files/address file and
   map it by the 2nd task.

Using /proc/$pid/maps for this is quite inconvenient since it brings repeatable
re-reading and reparsing for this text file which slows down restore procesure
significantly. Also as being pointed in (3) it is a way easier to use top level
shared mapping in children as /proc/$pid/map_files/address when needed.

v2: (spotted by Tejun Heo)
 - /proc/<pid>/mfd changed to /proc/<pid>/map_files
 - find_vma helper is used instead of linear search
 - routines are re-grouped
 - d_revalidate is set now

v3:
 - d_revalidate reworked, now it should drops no longer valid dentries (Tejun Heo)
 - ptrace_may_access added into proc_map_files_lookup (Vasiliy Kulikov)
 - because of filldir (which eventually might need to lock mmap_sem)
   the proc_map_files_readdir() was reworked to call proc_fill_cache()
   with unlocked mmap_sem

v4: (feedback by Tejun Heo and Vasiliy Kulikov)
 - instead of saving data in proc_inode we rather make a dentry name
   to keep both vm_start and vm_end accordingly
 - d_revalidate now honor task credentials

v5: (feedback by Kirill A. Shutemov)
 - don't forget to release mmap_sem on error path

v6:
 - sizeof get used in map_files_info which shrink member a bit on
   x86-32 (by Kirill A. Shutemov)
 - map_name_to_addr returns -EINVAL instead of -1
   which is more appropriate (by Tejun Heo)

v7:
 - add [get/set]attr handlers for
   proc_map_files_inode_operations (by Vasiliy Kulikov)

v8:
 - Kirill A. Shutemov spotted a parasite semicolon
   which ruined the ptrace_check call, fixed.

v9: (feedback by Andrew Morton)
 - find_exact_vma moved into include/linux/mm.h as an inline helper
 - proc_map_files_setattr uses either kmalloc or vmalloc depending
   on how many ojects are to be allocated
 - no more map_name_to_addr but dname_to_vma_addr introduced instead
   and it uses sscanf because in one case the find_exact_vma() is used
   only to confirm existence of vma area the boolean flag is used
 - fancy justification dropped
 - still the proc_map_files_get/setattr leaved untouched
   until additional fd/ patches applied first.

v10: (feedback by Andrew Morton)
 - flex_arrays are used instead of kmalloc/vmalloc calls
 - map_files_d_revalidate use ptrace_may_access for
   security reason (by Vasiliy Kulikov)

v11:
 - should use fput and drop !ret test from a loop code
   (feedback by Andrew Morton)
 - no need for 'used' variable, use existing
   nr_files with file->pos predicate
 - if preallocation fails no need to go further,
   simply release mmap semaphore and jump out

v12:
 - rework map_files_d_revalidate to make sure
   the task get released on return (by Vasiliy Kulikov)

v13:
 - proc_map_files_inode_operations are set to be the same
   as proc_fd_inode_operations, ie to include .permission
   pointing to proc_fd_permission

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
CC: Tejun Heo <tj@kernel.org>
CC: Vasiliy Kulikov <segoon@openwall.com>
CC: "Kirill A. Shutemov" <kirill@shutemov.name>
CC: Alexey Dobriyan <adobriyan@gmail.com>
CC: Al Viro <viro@ZenIV.linux.org.uk>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Pavel Machek <pavel@ucw.cz>
---
 fs/proc/base.c     |  331 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/mm.h |   12 +
 2 files changed, 343 insertions(+)

Index: linux-2.6.git/fs/proc/base.c
===================================================================
--- linux-2.6.git.orig/fs/proc/base.c
+++ linux-2.6.git/fs/proc/base.c
@@ -83,6 +83,7 @@
 #include <linux/pid_namespace.h>
 #include <linux/fs_struct.h>
 #include <linux/slab.h>
+#include <linux/flex_array.h>
 #ifdef CONFIG_HARDWALL
 #include <asm/hardwall.h>
 #endif
@@ -133,6 +134,8 @@ struct pid_entry {
 		NULL, &proc_single_file_operations,	\
 		{ .proc_show = show } )
 
+static int proc_fd_permission(struct inode *inode, int mask);
+
 /*
  * Count the number of hardlinks for the pid_entry table, excluding the .
  * and .. links.
@@ -2201,6 +2204,333 @@ static const struct file_operations proc
 };
 
 /*
+ * dname_to_vma_addr - maps a dentry name into two unsigned longs
+ * which represent vma start and end addresses.
+ */
+static int dname_to_vma_addr(struct dentry *dentry,
+			     unsigned long *start, unsigned long *end)
+{
+	if (sscanf(dentry->d_name.name, "%lx-%lx", start, end) != 2)
+		return -EINVAL;
+
+	return 0;
+}
+
+static int map_files_d_revalidate(struct dentry *dentry, struct nameidata *nd)
+{
+	unsigned long vm_start, vm_end;
+	bool exact_vma_exists = false;
+	struct mm_struct *mm = NULL;
+	struct task_struct *task;
+	const struct cred *cred;
+	struct inode *inode;
+	int status = 0;
+
+	if (nd && nd->flags & LOOKUP_RCU)
+		return -ECHILD;
+
+	inode = dentry->d_inode;
+	task = get_proc_task(inode);
+	if (!task)
+		goto out_notask;
+
+	if (!ptrace_may_access(task, PTRACE_MODE_READ))
+		goto out;
+
+	mm = get_task_mm(task);
+	if (!mm)
+		goto out;
+
+	if (!dname_to_vma_addr(dentry, &vm_start, &vm_end)) {
+		down_read(&mm->mmap_sem);
+		exact_vma_exists = !!find_exact_vma(mm, vm_start, vm_end);
+		up_read(&mm->mmap_sem);
+	}
+
+	mmput(mm);
+
+	if (exact_vma_exists) {
+		if (task_dumpable(task)) {
+			rcu_read_lock();
+			cred = __task_cred(task);
+			inode->i_uid = cred->euid;
+			inode->i_gid = cred->egid;
+			rcu_read_unlock();
+		} else {
+			inode->i_uid = 0;
+			inode->i_gid = 0;
+		}
+		security_task_to_inode(task, inode);
+		status = 1;
+	}
+
+out:
+	put_task_struct(task);
+
+out_notask:
+	if (status <= 0)
+		d_drop(dentry);
+
+	return status;
+}
+
+static const struct dentry_operations tid_map_files_dentry_operations = {
+	.d_revalidate	= map_files_d_revalidate,
+	.d_delete	= pid_delete_dentry,
+};
+
+static int proc_map_files_get_link(struct dentry *dentry, struct path *path)
+{
+	unsigned long vm_start, vm_end;
+	struct vm_area_struct *vma;
+	struct task_struct *task;
+	struct mm_struct *mm;
+	int rc = -ENOENT;
+
+	task = get_proc_task(dentry->d_inode);
+	if (!task)
+		goto out;
+
+	mm = get_task_mm(task);
+	put_task_struct(task);
+	if (!mm)
+		goto out;
+
+	rc = dname_to_vma_addr(dentry, &vm_start, &vm_end);
+	if (rc)
+		goto out_mmput;
+
+	down_read(&mm->mmap_sem);
+	vma = find_exact_vma(mm, vm_start, vm_end);
+	if (vma && vma->vm_file) {
+		*path = vma->vm_file->f_path;
+		path_get(path);
+		rc = 0;
+	}
+	up_read(&mm->mmap_sem);
+
+out_mmput:
+	mmput(mm);
+out:
+	return rc;
+}
+
+struct map_files_info {
+	struct file	*file;
+	unsigned long	len;
+	unsigned char	name[4*sizeof(long)+2]; /* max: %lx-%lx\0 */
+};
+
+static struct dentry *
+proc_map_files_instantiate(struct inode *dir, struct dentry *dentry,
+			   struct task_struct *task, const void *ptr)
+{
+	const struct file *file = ptr;
+	struct proc_inode *ei;
+	struct inode *inode;
+
+	if (!file)
+		return ERR_PTR(-ENOENT);
+
+	inode = proc_pid_make_inode(dir->i_sb, task);
+	if (!inode)
+		return ERR_PTR(-ENOENT);
+
+	ei = PROC_I(inode);
+	ei->op.proc_get_link = proc_map_files_get_link;
+
+	inode->i_op = &proc_pid_link_inode_operations;
+	inode->i_size = 64;
+	inode->i_mode = S_IFLNK;
+
+	if (file->f_mode & FMODE_READ)
+		inode->i_mode |= S_IRUSR | S_IXUSR;
+	if (file->f_mode & FMODE_WRITE)
+		inode->i_mode |= S_IWUSR | S_IXUSR;
+
+	d_set_d_op(dentry, &tid_map_files_dentry_operations);
+	d_add(dentry, inode);
+
+	return NULL;
+}
+
+static struct dentry *proc_map_files_lookup(struct inode *dir,
+		struct dentry *dentry, struct nameidata *nd)
+{
+	unsigned long vm_start, vm_end;
+	struct task_struct *task;
+	struct vm_area_struct *vma;
+	struct mm_struct *mm;
+	struct dentry *result;
+
+	result = ERR_PTR(-ENOENT);
+	task = get_proc_task(dir);
+	if (!task)
+		goto out;
+
+	result = ERR_PTR(-EACCES);
+	if (lock_trace(task))
+		goto out_put_task;
+
+	result = ERR_PTR(-ENOENT);
+	if (dname_to_vma_addr(dentry, &vm_start, &vm_end))
+		goto out_unlock;
+
+	mm = get_task_mm(task);
+	if (!mm)
+		goto out_unlock;
+
+	down_read(&mm->mmap_sem);
+	vma = find_exact_vma(mm, vm_start, vm_end);
+	if (!vma)
+		goto out_no_vma;
+
+	result = proc_map_files_instantiate(dir, dentry, task, vma->vm_file);
+
+out_no_vma:
+	up_read(&mm->mmap_sem);
+	mmput(mm);
+out_unlock:
+	unlock_trace(task);
+out_put_task:
+	put_task_struct(task);
+out:
+	return result;
+}
+
+static const struct inode_operations proc_map_files_inode_operations = {
+	.lookup		= proc_map_files_lookup,
+	.permission	= proc_fd_permission,
+	.setattr	= proc_setattr,
+};
+
+static int proc_map_files_readdir(struct file *filp, void *dirent, filldir_t filldir)
+{
+	struct dentry *dentry = filp->f_path.dentry;
+	struct inode *inode = dentry->d_inode;
+	struct vm_area_struct *vma;
+	struct task_struct *task;
+	struct mm_struct *mm;
+	ino_t ino;
+	int ret;
+
+	ret = -ENOENT;
+	task = get_proc_task(inode);
+	if (!task)
+		goto out;
+
+	ret = -EACCES;
+	if (lock_trace(task))
+		goto out_put_task;
+
+	ret = 0;
+	switch (filp->f_pos) {
+	case 0:
+		ino = inode->i_ino;
+		if (filldir(dirent, ".", 1, 0, ino, DT_DIR) < 0)
+			goto out_unlock;
+		filp->f_pos++;
+	case 1:
+		ino = parent_ino(dentry);
+		if (filldir(dirent, "..", 2, 1, ino, DT_DIR) < 0)
+			goto out_unlock;
+		filp->f_pos++;
+	default:
+	{
+		unsigned long nr_files, pos, i;
+		struct flex_array *fa = NULL;
+		struct map_files_info info;
+		struct map_files_info *p;
+
+		mm = get_task_mm(task);
+		if (!mm)
+			goto out_unlock;
+		down_read(&mm->mmap_sem);
+
+		nr_files = 0;
+
+		/*
+		 * We need two passes here:
+		 *
+		 *  1) Collect vmas of mapped files with mmap_sem taken
+		 *  2) Release mmap_sem and instantiate entries
+		 *
+		 * otherwise we get lockdep complained, since filldir()
+		 * routine might require mmap_sem taken in might_fault().
+		 */
+
+		for (vma = mm->mmap, pos = 2; vma; vma = vma->vm_next) {
+			if (vma->vm_file && ++pos > filp->f_pos)
+				nr_files++;
+		}
+
+		if (nr_files) {
+			fa = flex_array_alloc(sizeof(info), nr_files, GFP_KERNEL);
+			if (!fa || flex_array_prealloc(fa, 0, nr_files, GFP_KERNEL)) {
+				ret = -ENOMEM;
+				if (fa)
+					flex_array_free(fa);
+				up_read(&mm->mmap_sem);
+				mmput(mm);
+				goto out_unlock;
+			}
+			for (i = 0, vma = mm->mmap, pos = 2; vma; vma = vma->vm_next) {
+				if (!vma->vm_file)
+					continue;
+				if (++pos <= filp->f_pos)
+					continue;
+
+				get_file(vma->vm_file);
+				info.file = vma->vm_file;
+				info.len = snprintf(info.name, sizeof(info.name),
+						    "%lx-%lx", vma->vm_start,
+						    vma->vm_end);
+				if (flex_array_put(fa, i++, &info, GFP_KERNEL))
+					BUG();
+			}
+		}
+		up_read(&mm->mmap_sem);
+
+		for (i = 0; i < nr_files; i++) {
+			p = flex_array_get(fa, i);
+			ret = proc_fill_cache(filp, dirent, filldir,
+					      p->name, p->len,
+					      proc_map_files_instantiate,
+					      task, p->file);
+			if (ret)
+				break;
+			filp->f_pos++;
+			fput(p->file);
+		}
+		for (; i < nr_files; i++) {
+			/*
+			 * In case of error don't forget
+			 * to put rest of file refs.
+			 */
+			p = flex_array_get(fa, i);
+			fput(p->file);
+		}
+		if (fa)
+			flex_array_free(fa);
+		mmput(mm);
+	}
+	}
+
+out_unlock:
+	unlock_trace(task);
+out_put_task:
+	put_task_struct(task);
+out:
+	return ret;
+}
+
+static const struct file_operations proc_map_files_operations = {
+	.read		= generic_read_dir,
+	.readdir	= proc_map_files_readdir,
+	.llseek		= default_llseek,
+};
+
+/*
  * /proc/pid/fd needs a special permission handler so that a process can still
  * access /proc/self/fd after it has executed a setuid().
  */
@@ -2815,6 +3145,7 @@ static const struct inode_operations pro
 static const struct pid_entry tgid_base_stuff[] = {
 	DIR("task",       S_IRUGO|S_IXUGO, proc_task_inode_operations, proc_task_operations),
 	DIR("fd",         S_IRUSR|S_IXUSR, proc_fd_inode_operations, proc_fd_operations),
+	DIR("map_files",  S_IRUSR|S_IXUSR, proc_map_files_inode_operations, proc_map_files_operations),
 	DIR("fdinfo",     S_IRUSR|S_IXUSR, proc_fdinfo_inode_operations, proc_fdinfo_operations),
 	DIR("ns",	  S_IRUSR|S_IXUGO, proc_ns_dir_inode_operations, proc_ns_dir_operations),
 #ifdef CONFIG_NET
Index: linux-2.6.git/include/linux/mm.h
===================================================================
--- linux-2.6.git.orig/include/linux/mm.h
+++ linux-2.6.git/include/linux/mm.h
@@ -1491,6 +1491,18 @@ static inline unsigned long vma_pages(st
 	return (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
 }
 
+/* Look up the first VMA which exactly match the interval vm_start ... vm_end */
+static inline struct vm_area_struct *
+find_exact_vma(struct mm_struct *mm, unsigned long vm_start, unsigned long vm_end)
+{
+	struct vm_area_struct *vma = find_vma(mm, vm_start);
+
+	if (vma && (vma->vm_start != vm_start || vma->vm_end != vm_end))
+		vma = NULL;
+
+	return vma;
+}
+
 #ifdef CONFIG_MMU
 pgprot_t vm_get_page_prot(unsigned long vm_flags);
 #else

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 2/2] fs, proc: Introduce the /proc/<pid>/map_files/ directory v12
  2011-09-14 13:44           ` Cyrill Gorcunov
@ 2011-09-14 14:48             ` Vasiliy Kulikov
  2011-09-14 14:57               ` Vasiliy Kulikov
  2011-09-14 16:00               ` Cyrill Gorcunov
  0 siblings, 2 replies; 28+ messages in thread
From: Vasiliy Kulikov @ 2011-09-14 14:48 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: Pavel Machek, Andrew Morton, linux-kernel, containers,
	linux-fsdevel, Kirill Shutemov, Pavel Emelyanov, James Bottomley,
	Nathan Lynch, Zan Lynx, Daniel Lezcano, Tejun Heo,
	Alexey Dobriyan, Al Viro, Andrew Morton

Hi,

On Wed, Sep 14, 2011 at 17:44 +0400, Cyrill Gorcunov wrote:
> On Wed, Sep 14, 2011 at 03:39:12PM +0400, Cyrill Gorcunov wrote:
> ...
> > > 
> > > AFAICT, this recreates existing problem with /proc/<pid>/fd (see
> > > discussion at 
> > > 
> > > http://www.securityfocus.com/archive/1/507386/30/0/threaded
> > > 
> > > ). It creates object that looks like symlink, but does not behave as
> > > one, and permissions of directories are not checked as they would be
> > > if it was a symlink.

The only difference between fd/X and dup(X) was the ability to write to
an fd opened as RO.  Now it is fixed:

$ ls -l 123
-rw-r--r-- 1 vasya vasya 0 Sep 14 18:21 123
$ id
uid=1008(new1) gid=1008(new1) groups=1008(new1)
$ bash 4< /tmp/123
new1@albatros:/tmp$ echo bla >&4
bash: 4: Bad file descriptor
new1@albatros:/tmp$ echo bla >/proc/$$/fd/4
bash: /proc/8527/fd/4: Permission denied

I don't really see any difference between opening fd/* and dup'ing file
descriptors with the current code.


> So, there is no *new* hole.

Actually now I see the difference between having something mapped and
having an _fd_ of this something.

Relevant code:

+static struct dentry *
+proc_map_files_instantiate(struct inode *dir, struct dentry *dentry,
+              struct task_struct *task, const void *ptr)
+{
...
+   inode->i_mode = S_IFLNK;
+
+   if (file->f_mode & FMODE_READ)
+       inode->i_mode |= S_IRUSR | S_IXUSR;
+   if (file->f_mode & FMODE_WRITE)
+       inode->i_mode |= S_IWUSR | S_IXUSR;


If you have a write mmap area, but no fd, you may not trunc a file; with
map_files/ you may get an fd and ftrunc it.


> Both fd/ and map_files/ have ptrace_may_access checks, which
> (as you pointed) might be not enough, but squashing all changes
> into one big path seems to be not that good idea.

ptrace() check is irrelevant to the access bypasses by the task owner.

> Vasiliy, as far as I remember you had something in mind on
> fd/ additional fixups, no?

Only closing fd presense leak:

http://www.openwall.com/lists/kernel-hardening/2011/09/10/3
http://www.openwall.com/lists/kernel-hardening/2011/09/10/4

Unfortunatelly, not yet applied/commented :(

Thanks,

-- 
Vasiliy Kulikov
http://www.openwall.com - bringing security into open computing environments

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 2/2] fs, proc: Introduce the /proc/<pid>/map_files/ directory v12
  2011-09-14 14:48             ` Vasiliy Kulikov
@ 2011-09-14 14:57               ` Vasiliy Kulikov
  2011-09-14 16:00               ` Cyrill Gorcunov
  1 sibling, 0 replies; 28+ messages in thread
From: Vasiliy Kulikov @ 2011-09-14 14:57 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: Pavel Machek, Andrew Morton, linux-kernel, containers,
	linux-fsdevel, Kirill Shutemov, Pavel Emelyanov, James Bottomley,
	Nathan Lynch, Zan Lynx, Daniel Lezcano, Tejun Heo,
	Alexey Dobriyan, Al Viro, Andrew Morton

On Wed, Sep 14, 2011 at 18:48 +0400, Vasiliy Kulikov wrote:
> > So, there is no *new* hole.
> 
> Actually now I see the difference between having something mapped and
> having an _fd_ of this something.
> 
> Relevant code:
> 
> +static struct dentry *
> +proc_map_files_instantiate(struct inode *dir, struct dentry *dentry,
> +              struct task_struct *task, const void *ptr)
> +{
> ...
> +   inode->i_mode = S_IFLNK;
> +
> +   if (file->f_mode & FMODE_READ)
> +       inode->i_mode |= S_IRUSR | S_IXUSR;
> +   if (file->f_mode & FMODE_WRITE)
> +       inode->i_mode |= S_IWUSR | S_IXUSR;
> 
> 
> If you have a write mmap area, but no fd, you may not trunc a file; with
> map_files/ you may get an fd and ftrunc it.

Also it unconditionally adds +x, but I don't think it breaks any
security assumption as (1) there is no +s and (2) fd is not a directory.

Thanks,

-- 
Vasiliy Kulikov
http://www.openwall.com - bringing security into open computing environments

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 2/2] fs, proc: Introduce the /proc/<pid>/map_files/ directory v12
  2011-09-14 14:48             ` Vasiliy Kulikov
  2011-09-14 14:57               ` Vasiliy Kulikov
@ 2011-09-14 16:00               ` Cyrill Gorcunov
  2011-09-14 16:07                 ` Vasiliy Kulikov
  1 sibling, 1 reply; 28+ messages in thread
From: Cyrill Gorcunov @ 2011-09-14 16:00 UTC (permalink / raw)
  To: Vasiliy Kulikov
  Cc: Pavel Machek, Andrew Morton, linux-kernel, containers,
	linux-fsdevel, Kirill Shutemov, Pavel Emelyanov, James Bottomley,
	Nathan Lynch, Zan Lynx, Daniel Lezcano, Tejun Heo,
	Alexey Dobriyan, Al Viro, Andrew Morton

On Wed, Sep 14, 2011 at 06:48:41PM +0400, Vasiliy Kulikov wrote:
...
> 
> Actually now I see the difference between having something mapped and
> having an _fd_ of this something.
> 
> Relevant code:
> 
> +static struct dentry *
> +proc_map_files_instantiate(struct inode *dir, struct dentry *dentry,
> +              struct task_struct *task, const void *ptr)
> +{
> ...
> +   inode->i_mode = S_IFLNK;
> +
> +   if (file->f_mode & FMODE_READ)
> +       inode->i_mode |= S_IRUSR | S_IXUSR;
> +   if (file->f_mode & FMODE_WRITE)
> +       inode->i_mode |= S_IWUSR | S_IXUSR;
> 
> 
> If you have a write mmap area, but no fd, you may not trunc a file; with
> map_files/ you may get an fd and ftrunc it.
> 

This stands for anonymous memory, and if you have enough rights to
access the task this ftruncate is the last problem (since having ptrace
access to the task I _aready_ can trash various stuff inside, i dont need
even bother to look into map_files/ or whatever). So I don't see how
ftruncate migh harm you here?

	Cyrill

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 2/2] fs, proc: Introduce the /proc/<pid>/map_files/ directory v12
  2011-09-14 16:00               ` Cyrill Gorcunov
@ 2011-09-14 16:07                 ` Vasiliy Kulikov
  2011-09-14 16:13                   ` Pavel Emelyanov
  2011-09-15  9:14                   ` Cyrill Gorcunov
  0 siblings, 2 replies; 28+ messages in thread
From: Vasiliy Kulikov @ 2011-09-14 16:07 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: Pavel Machek, Andrew Morton, linux-kernel, containers,
	linux-fsdevel, Kirill Shutemov, Pavel Emelyanov, James Bottomley,
	Nathan Lynch, Zan Lynx, Daniel Lezcano, Tejun Heo,
	Alexey Dobriyan, Al Viro, Andrew Morton

On Wed, Sep 14, 2011 at 20:00 +0400, Cyrill Gorcunov wrote:
> On Wed, Sep 14, 2011 at 06:48:41PM +0400, Vasiliy Kulikov wrote:
> ...
> > 
> > Actually now I see the difference between having something mapped and
> > having an _fd_ of this something.
> > 
> > Relevant code:
> > 
> > +static struct dentry *
> > +proc_map_files_instantiate(struct inode *dir, struct dentry *dentry,
> > +              struct task_struct *task, const void *ptr)
> > +{
> > ...
> > +   inode->i_mode = S_IFLNK;
> > +
> > +   if (file->f_mode & FMODE_READ)
> > +       inode->i_mode |= S_IRUSR | S_IXUSR;
> > +   if (file->f_mode & FMODE_WRITE)
> > +       inode->i_mode |= S_IWUSR | S_IXUSR;
> > 
> > 
> > If you have a write mmap area, but no fd, you may not trunc a file; with
> > map_files/ you may get an fd and ftrunc it.
> > 
> 
> This stands for anonymous memory, and if you have enough rights to
> access the task this ftruncate is the last problem (since having ptrace
> access to the task I _aready_ can trash various stuff inside, i dont need
> even bother to look into map_files/ or whatever). So I don't see how
> ftruncate migh harm you here?

No, I mean something else.  Assume you have a task, which does the
steps:

1) opens some sensitive file as root.  This file is e.g. 0700.

2) mmaps the file via opened fd, either RO or RW.

3) closes fd.

4) drops root.

Now it has a mapping of a privileged file, but cannot get fd of it
anyhow.  With map_files/ he may open his own /proc/$$/map_files/, pass
ptrace() check, and get fd of the privileged file.  He cannot explicitly
open it as it is 0700, but he may open it via map_files/ and get RO/RW
fd.

-- 
Vasiliy Kulikov
http://www.openwall.com - bringing security into open computing environments

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 2/2] fs, proc: Introduce the /proc/<pid>/map_files/ directory v12
  2011-09-14 16:07                 ` Vasiliy Kulikov
@ 2011-09-14 16:13                   ` Pavel Emelyanov
  2011-09-14 16:21                     ` Vasiliy Kulikov
  2011-09-15  9:14                   ` Cyrill Gorcunov
  1 sibling, 1 reply; 28+ messages in thread
From: Pavel Emelyanov @ 2011-09-14 16:13 UTC (permalink / raw)
  To: Vasiliy Kulikov
  Cc: Cyrill Gorcunov, Pavel Machek, Andrew Morton,
	linux-kernel@vger.kernel.org, containers@lists.osdl.org,
	linux-fsdevel@vger.kernel.org, Kirill Shutemov, James Bottomley,
	Nathan Lynch, Zan Lynx, Daniel Lezcano, Tejun Heo,
	Alexey Dobriyan, Al Viro, Andrew Morton

> No, I mean something else.  Assume you have a task, which does the
> steps:
> 
> 1) opens some sensitive file as root.  This file is e.g. 0700.
> 
> 2) mmaps the file via opened fd, either RO or RW.
> 
> 3) closes fd.
> 
> 4) drops root.
> 
> Now it has a mapping of a privileged file, but cannot get fd of it
> anyhow.  With map_files/ he may open his own /proc/$$/map_files/, pass
> ptrace() check, and get fd of the privileged file.  He cannot explicitly
> open it as it is 0700, but he may open it via map_files/ and get RO/RW
> fd.
> 

What is the problem here - the fact that we have some file considered to
be private be open-able by somebody else, or the fact that we can truncate
the file being mapped?

If the fist issue stands, then it also stands for /proc/pid/fd and thus we
don't introduce the new problem.

If the second, then it's not a problem as mm can handle this already.

Thanks,
Pavel

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 2/2] fs, proc: Introduce the /proc/<pid>/map_files/ directory v12
  2011-09-14 16:13                   ` Pavel Emelyanov
@ 2011-09-14 16:21                     ` Vasiliy Kulikov
  0 siblings, 0 replies; 28+ messages in thread
From: Vasiliy Kulikov @ 2011-09-14 16:21 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: Cyrill Gorcunov, Pavel Machek, Andrew Morton,
	linux-kernel@vger.kernel.org, containers@lists.osdl.org,
	linux-fsdevel@vger.kernel.org, Kirill Shutemov, James Bottomley,
	Nathan Lynch, Zan Lynx, Daniel Lezcano, Tejun Heo,
	Alexey Dobriyan, Al Viro, Andrew Morton

On Wed, Sep 14, 2011 at 20:13 +0400, Pavel Emelyanov wrote:
> > No, I mean something else.  Assume you have a task, which does the
> > steps:
> > 
> > 1) opens some sensitive file as root.  This file is e.g. 0700.
> > 
> > 2) mmaps the file via opened fd, either RO or RW.
> > 
> > 3) closes fd.
> > 
> > 4) drops root.
> > 
> > Now it has a mapping of a privileged file, but cannot get fd of it
> > anyhow.  With map_files/ he may open his own /proc/$$/map_files/, pass
> > ptrace() check, and get fd of the privileged file.  He cannot explicitly
> > open it as it is 0700, but he may open it via map_files/ and get RO/RW
> > fd.
> > 
> 
> What is the problem here - the fact that we have some file considered to
> be private be open-able by somebody else, or the fact that we can truncate
> the file being mapped?

The latter - the file, which is considered to be restricted to a process
as W only without ability to truncate it, now can be truncated.  The
process after (4) had no such ability without map_files/ with current
permission model of mmap'ed files.  Or I am missing something?

FWIW, ftruncate() might be not the only syscall which makes sense to use
in this case, I just thought about it.

Thanks,

-- 
Vasiliy Kulikov
http://www.openwall.com - bringing security into open computing environments

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 2/2] fs, proc: Introduce the /proc/<pid>/map_files/ directory v12
  2011-09-14 16:07                 ` Vasiliy Kulikov
  2011-09-14 16:13                   ` Pavel Emelyanov
@ 2011-09-15  9:14                   ` Cyrill Gorcunov
  2011-09-15  9:27                     ` Vasiliy Kulikov
  1 sibling, 1 reply; 28+ messages in thread
From: Cyrill Gorcunov @ 2011-09-15  9:14 UTC (permalink / raw)
  To: Vasiliy Kulikov
  Cc: Pavel Machek, Andrew Morton, linux-kernel, containers,
	linux-fsdevel, Kirill Shutemov, Pavel Emelyanov, James Bottomley,
	Nathan Lynch, Zan Lynx, Daniel Lezcano, Tejun Heo,
	Alexey Dobriyan, Al Viro, Andrew Morton

On Wed, Sep 14, 2011 at 08:07:25PM +0400, Vasiliy Kulikov wrote:
...
> 
> No, I mean something else.  Assume you have a task, which does the
> steps:
> 
> 1) opens some sensitive file as root.  This file is e.g. 0700.
> 
> 2) mmaps the file via opened fd, either RO or RW.
> 
> 3) closes fd.
> 
> 4) drops root.
> 
> Now it has a mapping of a privileged file, but cannot get fd of it
> anyhow.  With map_files/ he may open his own /proc/$$/map_files/, pass
> ptrace() check, and get fd of the privileged file.  He cannot explicitly
> open it as it is 0700, but he may open it via map_files/ and get RO/RW
> fd.
> 

Hi Vasiliy, could you please check if the update below address all your
concerns? Note that we still need at least RO access on such files.

	Cyrill
---
fs, proc: Introduce the /proc/<pid>/map_files/ directory v14

From: Pavel Emelyanov <xemul@parallels.com>

This one behaves similarly to the /proc/<pid>/fd/ one - it contains symlinks
one for each mapping with file, the name of a symlink is "vma->vm_start-vma->vm_end",
the target is the file. Opening a symlink results in a file that point exactly
to the same inode as them vma's one.

For example the ls -l of some arbitrary /proc/<pid>/map_files/

 | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80403000-7f8f80404000 -> /lib64/libc-2.5.so
 | lr-x------ 1 root root 64 Aug 26 06:40 7f8f8061e000-7f8f80620000 -> /lib64/libselinux.so.1
 | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80826000-7f8f80827000 -> /lib64/libacl.so.1.1.0
 | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80a2f000-7f8f80a30000 -> /lib64/librt-2.5.so
 | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80a30000-7f8f80a4c000 -> /lib64/ld-2.5.so

This *helps* checkpointing process in three ways:

1. When dumping a task mappings we do know exact file that is mapped by particular
   region. We do this by opening /proc/$pid/map_files/address symlink the way we do
   with file descriptors.

2. This also helps in determining which anonymous shared mappings are shared with
   each other by comparing the inodes of them.

3. When restoring a set of process in case two of them has a mapping shared, we map
   the memory by the 1st one and then open its /proc/$pid/map_files/address file and
   map it by the 2nd task.

Using /proc/$pid/maps for this is quite inconvenient since it brings repeatable
re-reading and reparsing for this text file which slows down restore procesure
significantly. Also as being pointed in (3) it is a way easier to use top level
shared mapping in children as /proc/$pid/map_files/address when needed.

v2: (spotted by Tejun Heo)
 - /proc/<pid>/mfd changed to /proc/<pid>/map_files
 - find_vma helper is used instead of linear search
 - routines are re-grouped
 - d_revalidate is set now

v3:
 - d_revalidate reworked, now it should drops no longer valid dentries (Tejun Heo)
 - ptrace_may_access added into proc_map_files_lookup (Vasiliy Kulikov)
 - because of filldir (which eventually might need to lock mmap_sem)
   the proc_map_files_readdir() was reworked to call proc_fill_cache()
   with unlocked mmap_sem

v4: (feedback by Tejun Heo and Vasiliy Kulikov)
 - instead of saving data in proc_inode we rather make a dentry name
   to keep both vm_start and vm_end accordingly
 - d_revalidate now honor task credentials

v5: (feedback by Kirill A. Shutemov)
 - don't forget to release mmap_sem on error path

v6:
 - sizeof get used in map_files_info which shrink member a bit on
   x86-32 (by Kirill A. Shutemov)
 - map_name_to_addr returns -EINVAL instead of -1
   which is more appropriate (by Tejun Heo)

v7:
 - add [get/set]attr handlers for
   proc_map_files_inode_operations (by Vasiliy Kulikov)

v8:
 - Kirill A. Shutemov spotted a parasite semicolon
   which ruined the ptrace_check call, fixed.

v9: (feedback by Andrew Morton)
 - find_exact_vma moved into include/linux/mm.h as an inline helper
 - proc_map_files_setattr uses either kmalloc or vmalloc depending
   on how many ojects are to be allocated
 - no more map_name_to_addr but dname_to_vma_addr introduced instead
   and it uses sscanf because in one case the find_exact_vma() is used
   only to confirm existence of vma area the boolean flag is used
 - fancy justification dropped
 - still the proc_map_files_get/setattr leaved untouched
   until additional fd/ patches applied first.

v10: (feedback by Andrew Morton)
 - flex_arrays are used instead of kmalloc/vmalloc calls
 - map_files_d_revalidate use ptrace_may_access for
   security reason (by Vasiliy Kulikov)

v11:
 - should use fput and drop !ret test from a loop code
   (feedback by Andrew Morton)
 - no need for 'used' variable, use existing
   nr_files with file->pos predicate
 - if preallocation fails no need to go further,
   simply release mmap semaphore and jump out

v12:
 - rework map_files_d_revalidate to make sure
   the task get released on return (by Vasiliy Kulikov)

v13:
 - proc_map_files_inode_operations are set to be the same
   as proc_fd_inode_operations, ie to include .permission
   pointing to proc_fd_permission

v14: (by Vasiliy Kulikov)
 - for security reason the links are created with FMODE_READ mode
   only even if the former file has FMODE_WRITE
 - proc_map_files_lookup fails on any non-read-only queries.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
CC: Tejun Heo <tj@kernel.org>
CC: Vasiliy Kulikov <segoon@openwall.com>
CC: "Kirill A. Shutemov" <kirill@shutemov.name>
CC: Alexey Dobriyan <adobriyan@gmail.com>
CC: Al Viro <viro@ZenIV.linux.org.uk>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Pavel Machek <pavel@ucw.cz>
---
 fs/proc/base.c     |  336 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/mm.h |   12 +
 2 files changed, 348 insertions(+)

Index: linux-2.6.git/fs/proc/base.c
===================================================================
--- linux-2.6.git.orig/fs/proc/base.c
+++ linux-2.6.git/fs/proc/base.c
@@ -83,6 +83,7 @@
 #include <linux/pid_namespace.h>
 #include <linux/fs_struct.h>
 #include <linux/slab.h>
+#include <linux/flex_array.h>
 #ifdef CONFIG_HARDWALL
 #include <asm/hardwall.h>
 #endif
@@ -133,6 +134,8 @@ struct pid_entry {
 		NULL, &proc_single_file_operations,	\
 		{ .proc_show = show } )
 
+static int proc_fd_permission(struct inode *inode, int mask);
+
 /*
  * Count the number of hardlinks for the pid_entry table, excluding the .
  * and .. links.
@@ -2201,6 +2204,338 @@ static const struct file_operations proc
 };
 
 /*
+ * dname_to_vma_addr - maps a dentry name into two unsigned longs
+ * which represent vma start and end addresses.
+ */
+static int dname_to_vma_addr(struct dentry *dentry,
+			     unsigned long *start, unsigned long *end)
+{
+	if (sscanf(dentry->d_name.name, "%lx-%lx", start, end) != 2)
+		return -EINVAL;
+
+	return 0;
+}
+
+static int map_files_d_revalidate(struct dentry *dentry, struct nameidata *nd)
+{
+	unsigned long vm_start, vm_end;
+	bool exact_vma_exists = false;
+	struct mm_struct *mm = NULL;
+	struct task_struct *task;
+	const struct cred *cred;
+	struct inode *inode;
+	int status = 0;
+
+	if (nd && nd->flags & LOOKUP_RCU)
+		return -ECHILD;
+
+	inode = dentry->d_inode;
+	task = get_proc_task(inode);
+	if (!task)
+		goto out_notask;
+
+	if (!ptrace_may_access(task, PTRACE_MODE_READ))
+		goto out;
+
+	mm = get_task_mm(task);
+	if (!mm)
+		goto out;
+
+	if (!dname_to_vma_addr(dentry, &vm_start, &vm_end)) {
+		down_read(&mm->mmap_sem);
+		exact_vma_exists = !!find_exact_vma(mm, vm_start, vm_end);
+		up_read(&mm->mmap_sem);
+	}
+
+	mmput(mm);
+
+	if (exact_vma_exists) {
+		if (task_dumpable(task)) {
+			rcu_read_lock();
+			cred = __task_cred(task);
+			inode->i_uid = cred->euid;
+			inode->i_gid = cred->egid;
+			rcu_read_unlock();
+		} else {
+			inode->i_uid = 0;
+			inode->i_gid = 0;
+		}
+		security_task_to_inode(task, inode);
+		status = 1;
+	}
+
+out:
+	put_task_struct(task);
+
+out_notask:
+	if (status <= 0)
+		d_drop(dentry);
+
+	return status;
+}
+
+static const struct dentry_operations tid_map_files_dentry_operations = {
+	.d_revalidate	= map_files_d_revalidate,
+	.d_delete	= pid_delete_dentry,
+};
+
+static int proc_map_files_get_link(struct dentry *dentry, struct path *path)
+{
+	unsigned long vm_start, vm_end;
+	struct vm_area_struct *vma;
+	struct task_struct *task;
+	struct mm_struct *mm;
+	int rc = -ENOENT;
+
+	task = get_proc_task(dentry->d_inode);
+	if (!task)
+		goto out;
+
+	mm = get_task_mm(task);
+	put_task_struct(task);
+	if (!mm)
+		goto out;
+
+	rc = dname_to_vma_addr(dentry, &vm_start, &vm_end);
+	if (rc)
+		goto out_mmput;
+
+	down_read(&mm->mmap_sem);
+	vma = find_exact_vma(mm, vm_start, vm_end);
+	if (vma && vma->vm_file) {
+		*path = vma->vm_file->f_path;
+		path_get(path);
+		rc = 0;
+	}
+	up_read(&mm->mmap_sem);
+
+out_mmput:
+	mmput(mm);
+out:
+	return rc;
+}
+
+struct map_files_info {
+	struct file	*file;
+	unsigned long	len;
+	unsigned char	name[4*sizeof(long)+2]; /* max: %lx-%lx\0 */
+};
+
+static struct dentry *
+proc_map_files_instantiate(struct inode *dir, struct dentry *dentry,
+			   struct task_struct *task, const void *ptr)
+{
+	const struct file *file = ptr;
+	struct proc_inode *ei;
+	struct inode *inode;
+
+	if (!file)
+		return ERR_PTR(-ENOENT);
+
+	inode = proc_pid_make_inode(dir->i_sb, task);
+	if (!inode)
+		return ERR_PTR(-ENOENT);
+
+	ei = PROC_I(inode);
+	ei->op.proc_get_link = proc_map_files_get_link;
+
+	inode->i_op = &proc_pid_link_inode_operations;
+	inode->i_size = 64;
+	inode->i_mode = S_IFLNK;
+
+	/* Read only access even if the former is writable */
+	if (file->f_mode & FMODE_READ)
+		inode->i_mode |= S_IRUSR;
+
+	d_set_d_op(dentry, &tid_map_files_dentry_operations);
+	d_add(dentry, inode);
+
+	return NULL;
+}
+
+static struct dentry *proc_map_files_lookup(struct inode *dir,
+		struct dentry *dentry, struct nameidata *nd)
+{
+	unsigned long vm_start, vm_end;
+	struct task_struct *task;
+	struct vm_area_struct *vma;
+	struct mm_struct *mm;
+	struct dentry *result;
+
+	/* An attempt to write */
+	if (nd->intent.open.flags & (O_CREAT | O_TRUNC | O_ACCMODE)) {
+		result = ERR_PTR(-EACCES);
+		goto out;
+	}
+
+	result = ERR_PTR(-ENOENT);
+	task = get_proc_task(dir);
+	if (!task)
+		goto out;
+
+	result = ERR_PTR(-EACCES);
+	if (lock_trace(task))
+		goto out_put_task;
+
+	result = ERR_PTR(-ENOENT);
+	if (dname_to_vma_addr(dentry, &vm_start, &vm_end))
+		goto out_unlock;
+
+	mm = get_task_mm(task);
+	if (!mm)
+		goto out_unlock;
+
+	down_read(&mm->mmap_sem);
+	vma = find_exact_vma(mm, vm_start, vm_end);
+	if (!vma)
+		goto out_no_vma;
+
+	result = proc_map_files_instantiate(dir, dentry, task, vma->vm_file);
+
+out_no_vma:
+	up_read(&mm->mmap_sem);
+	mmput(mm);
+out_unlock:
+	unlock_trace(task);
+out_put_task:
+	put_task_struct(task);
+out:
+	return result;
+}
+
+static const struct inode_operations proc_map_files_inode_operations = {
+	.lookup		= proc_map_files_lookup,
+	.permission	= proc_fd_permission,
+	.setattr	= proc_setattr,
+};
+
+static int proc_map_files_readdir(struct file *filp, void *dirent, filldir_t filldir)
+{
+	struct dentry *dentry = filp->f_path.dentry;
+	struct inode *inode = dentry->d_inode;
+	struct vm_area_struct *vma;
+	struct task_struct *task;
+	struct mm_struct *mm;
+	ino_t ino;
+	int ret;
+
+	ret = -ENOENT;
+	task = get_proc_task(inode);
+	if (!task)
+		goto out;
+
+	ret = -EACCES;
+	if (lock_trace(task))
+		goto out_put_task;
+
+	ret = 0;
+	switch (filp->f_pos) {
+	case 0:
+		ino = inode->i_ino;
+		if (filldir(dirent, ".", 1, 0, ino, DT_DIR) < 0)
+			goto out_unlock;
+		filp->f_pos++;
+	case 1:
+		ino = parent_ino(dentry);
+		if (filldir(dirent, "..", 2, 1, ino, DT_DIR) < 0)
+			goto out_unlock;
+		filp->f_pos++;
+	default:
+	{
+		unsigned long nr_files, pos, i;
+		struct flex_array *fa = NULL;
+		struct map_files_info info;
+		struct map_files_info *p;
+
+		mm = get_task_mm(task);
+		if (!mm)
+			goto out_unlock;
+		down_read(&mm->mmap_sem);
+
+		nr_files = 0;
+
+		/*
+		 * We need two passes here:
+		 *
+		 *  1) Collect vmas of mapped files with mmap_sem taken
+		 *  2) Release mmap_sem and instantiate entries
+		 *
+		 * otherwise we get lockdep complained, since filldir()
+		 * routine might require mmap_sem taken in might_fault().
+		 */
+
+		for (vma = mm->mmap, pos = 2; vma; vma = vma->vm_next) {
+			if (vma->vm_file && ++pos > filp->f_pos)
+				nr_files++;
+		}
+
+		if (nr_files) {
+			fa = flex_array_alloc(sizeof(info), nr_files, GFP_KERNEL);
+			if (!fa || flex_array_prealloc(fa, 0, nr_files, GFP_KERNEL)) {
+				ret = -ENOMEM;
+				if (fa)
+					flex_array_free(fa);
+				up_read(&mm->mmap_sem);
+				mmput(mm);
+				goto out_unlock;
+			}
+			for (i = 0, vma = mm->mmap, pos = 2; vma; vma = vma->vm_next) {
+				if (!vma->vm_file)
+					continue;
+				if (++pos <= filp->f_pos)
+					continue;
+
+				get_file(vma->vm_file);
+				info.file = vma->vm_file;
+				info.len = snprintf(info.name, sizeof(info.name),
+						    "%lx-%lx", vma->vm_start,
+						    vma->vm_end);
+				if (flex_array_put(fa, i++, &info, GFP_KERNEL))
+					BUG();
+			}
+		}
+		up_read(&mm->mmap_sem);
+
+		for (i = 0; i < nr_files; i++) {
+			p = flex_array_get(fa, i);
+			ret = proc_fill_cache(filp, dirent, filldir,
+					      p->name, p->len,
+					      proc_map_files_instantiate,
+					      task, p->file);
+			if (ret)
+				break;
+			filp->f_pos++;
+			fput(p->file);
+		}
+		for (; i < nr_files; i++) {
+			/*
+			 * In case of error don't forget
+			 * to put rest of file refs.
+			 */
+			p = flex_array_get(fa, i);
+			fput(p->file);
+		}
+		if (fa)
+			flex_array_free(fa);
+		mmput(mm);
+	}
+	}
+
+out_unlock:
+	unlock_trace(task);
+out_put_task:
+	put_task_struct(task);
+out:
+	return ret;
+}
+
+static const struct file_operations proc_map_files_operations = {
+	.read		= generic_read_dir,
+	.readdir	= proc_map_files_readdir,
+	.llseek		= default_llseek,
+};
+
+/*
  * /proc/pid/fd needs a special permission handler so that a process can still
  * access /proc/self/fd after it has executed a setuid().
  */
@@ -2815,6 +3150,7 @@ static const struct inode_operations pro
 static const struct pid_entry tgid_base_stuff[] = {
 	DIR("task",       S_IRUGO|S_IXUGO, proc_task_inode_operations, proc_task_operations),
 	DIR("fd",         S_IRUSR|S_IXUSR, proc_fd_inode_operations, proc_fd_operations),
+	DIR("map_files",  S_IRUSR|S_IXUSR, proc_map_files_inode_operations, proc_map_files_operations),
 	DIR("fdinfo",     S_IRUSR|S_IXUSR, proc_fdinfo_inode_operations, proc_fdinfo_operations),
 	DIR("ns",	  S_IRUSR|S_IXUGO, proc_ns_dir_inode_operations, proc_ns_dir_operations),
 #ifdef CONFIG_NET
Index: linux-2.6.git/include/linux/mm.h
===================================================================
--- linux-2.6.git.orig/include/linux/mm.h
+++ linux-2.6.git/include/linux/mm.h
@@ -1491,6 +1491,18 @@ static inline unsigned long vma_pages(st
 	return (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
 }
 
+/* Look up the first VMA which exactly match the interval vm_start ... vm_end */
+static inline struct vm_area_struct *
+find_exact_vma(struct mm_struct *mm, unsigned long vm_start, unsigned long vm_end)
+{
+	struct vm_area_struct *vma = find_vma(mm, vm_start);
+
+	if (vma && (vma->vm_start != vm_start || vma->vm_end != vm_end))
+		vma = NULL;
+
+	return vma;
+}
+
 #ifdef CONFIG_MMU
 pgprot_t vm_get_page_prot(unsigned long vm_flags);
 #else

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 2/2] fs, proc: Introduce the /proc/<pid>/map_files/ directory v12
  2011-09-15  9:14                   ` Cyrill Gorcunov
@ 2011-09-15  9:27                     ` Vasiliy Kulikov
  2011-09-15 10:29                       ` Cyrill Gorcunov
  0 siblings, 1 reply; 28+ messages in thread
From: Vasiliy Kulikov @ 2011-09-15  9:27 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: Pavel Machek, Andrew Morton, linux-kernel, containers,
	linux-fsdevel, Kirill Shutemov, Pavel Emelyanov, James Bottomley,
	Nathan Lynch, Zan Lynx, Daniel Lezcano, Tejun Heo,
	Alexey Dobriyan, Al Viro, Andrew Morton

Hi,

On Thu, Sep 15, 2011 at 13:14 +0400, Cyrill Gorcunov wrote:
> On Wed, Sep 14, 2011 at 08:07:25PM +0400, Vasiliy Kulikov wrote:
> ...
> > 
> > No, I mean something else.  Assume you have a task, which does the
> > steps:
> > 
> > 1) opens some sensitive file as root.  This file is e.g. 0700.
> > 
> > 2) mmaps the file via opened fd, either RO or RW.
> > 
> > 3) closes fd.
> > 
> > 4) drops root.
> > 
> > Now it has a mapping of a privileged file, but cannot get fd of it
> > anyhow.  With map_files/ he may open his own /proc/$$/map_files/, pass
> > ptrace() check, and get fd of the privileged file.  He cannot explicitly
> > open it as it is 0700, but he may open it via map_files/ and get RO/RW
> > fd.
> > 
> 
> Hi Vasiliy, could you please check if the update below address all your
> concerns? Note that we still need at least RO access on such files.
> 
> 	Cyrill
> ---
> fs, proc: Introduce the /proc/<pid>/map_files/ directory v14
> 
> From: Pavel Emelyanov <xemul@parallels.com>
> 
> This one behaves similarly to the /proc/<pid>/fd/ one - it contains symlinks
> one for each mapping with file, the name of a symlink is "vma->vm_start-vma->vm_end",
> the target is the file. Opening a symlink results in a file that point exactly
> to the same inode as them vma's one.
> 
> For example the ls -l of some arbitrary /proc/<pid>/map_files/
> 
>  | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80403000-7f8f80404000 -> /lib64/libc-2.5.so
>  | lr-x------ 1 root root 64 Aug 26 06:40 7f8f8061e000-7f8f80620000 -> /lib64/libselinux.so.1
>  | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80826000-7f8f80827000 -> /lib64/libacl.so.1.1.0
>  | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80a2f000-7f8f80a30000 -> /lib64/librt-2.5.so
>  | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80a30000-7f8f80a4c000 -> /lib64/ld-2.5.so
> 
> This *helps* checkpointing process in three ways:
> 
> 1. When dumping a task mappings we do know exact file that is mapped by particular
>    region. We do this by opening /proc/$pid/map_files/address symlink the way we do
>    with file descriptors.

s/address/$address/ for consistency.

> 
> 2. This also helps in determining which anonymous shared mappings are shared with
>    each other by comparing the inodes of them.
> 
> 3. When restoring a set of process

s/process/processes/

> in case two of them has a mapping shared, we map
>    the memory by the 1st one and then open its /proc/$pid/map_files/address file and
>    map it by the 2nd task.

How can you restore a set of processes in case they share an RW mapping
as RW in both tasks if you deny opening /proc/$pid/map_files/$address as W?

> Using /proc/$pid/maps for this is quite inconvenient since it brings repeatable
> re-reading and reparsing for this text file which slows down restore procesure
> significantly. Also as being pointed in (3) it is a way easier to use top level
> shared mapping in children as /proc/$pid/map_files/address when needed.
[...]
> v14: (by Vasiliy Kulikov)
>  - for security reason the links are created with FMODE_READ mode
>    only even if the former file has FMODE_WRITE
>  - proc_map_files_lookup fails on any non-read-only queries.

Do you have a PoC of the dumper?  At least without the restorer.  If we
see an implementation of map_files/ user we probably identify what
operation it needs and what security restrictions we have to define.

Thanks,

-- 
Vasiliy Kulikov
http://www.openwall.com - bringing security into open computing environments

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 2/2] fs, proc: Introduce the /proc/<pid>/map_files/ directory v12
  2011-09-15  9:27                     ` Vasiliy Kulikov
@ 2011-09-15 10:29                       ` Cyrill Gorcunov
  2011-09-15 10:56                         ` Vasiliy Kulikov
  0 siblings, 1 reply; 28+ messages in thread
From: Cyrill Gorcunov @ 2011-09-15 10:29 UTC (permalink / raw)
  To: Vasiliy Kulikov
  Cc: Pavel Machek, Andrew Morton, linux-kernel, containers,
	linux-fsdevel, Kirill Shutemov, Pavel Emelyanov, James Bottomley,
	Nathan Lynch, Zan Lynx, Daniel Lezcano, Tejun Heo,
	Alexey Dobriyan, Al Viro, Andrew Morton

On Thu, Sep 15, 2011 at 01:27:57PM +0400, Vasiliy Kulikov wrote:
...
> 
> > in case two of them has a mapping shared, we map
> >    the memory by the 1st one and then open its /proc/$pid/map_files/address file and
> >    map it by the 2nd task.
> 
> How can you restore a set of processes in case they share an RW mapping
> as RW in both tasks if you deny opening /proc/$pid/map_files/$address as W?

I can read the link first to figure out the file path and re-open it as rw via
path itself (which implies the restorer still must have enough rights to open
it as rw).

> 
> > Using /proc/$pid/maps for this is quite inconvenient since it brings repeatable
> > re-reading and reparsing for this text file which slows down restore procesure
> > significantly. Also as being pointed in (3) it is a way easier to use top level
> > shared mapping in children as /proc/$pid/map_files/address when needed.
> [...]
> > v14: (by Vasiliy Kulikov)
> >  - for security reason the links are created with FMODE_READ mode
> >    only even if the former file has FMODE_WRITE
> >  - proc_map_files_lookup fails on any non-read-only queries.
> 
> Do you have a PoC of the dumper?  At least without the restorer.  If we
> see an implementation of map_files/ user we probably identify what
> operation it needs and what security restrictions we have to define.

Yeah, i'll ping you the link (while beign trying various approaches
the code end up in being a pure mess, so until all cleaned up and
works as expected i'll not public it).

	Cyrill

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 2/2] fs, proc: Introduce the /proc/<pid>/map_files/ directory v12
  2011-09-15 10:29                       ` Cyrill Gorcunov
@ 2011-09-15 10:56                         ` Vasiliy Kulikov
  2011-09-15 11:00                           ` Cyrill Gorcunov
  2011-09-15 20:19                           ` Cyrill Gorcunov
  0 siblings, 2 replies; 28+ messages in thread
From: Vasiliy Kulikov @ 2011-09-15 10:56 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: Pavel Machek, Andrew Morton, linux-kernel, containers,
	linux-fsdevel, Kirill Shutemov, Pavel Emelyanov, James Bottomley,
	Nathan Lynch, Zan Lynx, Daniel Lezcano, Tejun Heo,
	Alexey Dobriyan, Al Viro, Andrew Morton

On Thu, Sep 15, 2011 at 14:29 +0400, Cyrill Gorcunov wrote:
> On Thu, Sep 15, 2011 at 01:27:57PM +0400, Vasiliy Kulikov wrote:
> ...
> > 
> > > in case two of them has a mapping shared, we map
> > >    the memory by the 1st one and then open its /proc/$pid/map_files/address file and
> > >    map it by the 2nd task.
> > 
> > How can you restore a set of processes in case they share an RW mapping
> > as RW in both tasks if you deny opening /proc/$pid/map_files/$address as W?
> 
> I can read the link first to figure out the file path and re-open it as rw via
> path itself (which implies the restorer still must have enough rights to open
> it as rw).

And what about RW mapping of unlinked files?

-- 
Vasiliy Kulikov
http://www.openwall.com - bringing security into open computing environments

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 2/2] fs, proc: Introduce the /proc/<pid>/map_files/ directory v12
  2011-09-15 10:56                         ` Vasiliy Kulikov
@ 2011-09-15 11:00                           ` Cyrill Gorcunov
  2011-09-15 20:19                           ` Cyrill Gorcunov
  1 sibling, 0 replies; 28+ messages in thread
From: Cyrill Gorcunov @ 2011-09-15 11:00 UTC (permalink / raw)
  To: Vasiliy Kulikov
  Cc: Pavel Machek, Andrew Morton, linux-kernel, containers,
	linux-fsdevel, Kirill Shutemov, Pavel Emelyanov, James Bottomley,
	Nathan Lynch, Zan Lynx, Daniel Lezcano, Tejun Heo,
	Alexey Dobriyan, Al Viro, Andrew Morton

On Thu, Sep 15, 2011 at 02:56:51PM +0400, Vasiliy Kulikov wrote:
...
> > > 
> > > How can you restore a set of processes in case they share an RW mapping
> > > as RW in both tasks if you deny opening /proc/$pid/map_files/$address as W?
> > 
> > I can read the link first to figure out the file path and re-open it as rw via
> > path itself (which implies the restorer still must have enough rights to open
> > it as rw).
> 
> And what about RW mapping of unlinked files?
> 

Need to figure some workaround

	Cyrill

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 2/2] fs, proc: Introduce the /proc/<pid>/map_files/ directory v12
  2011-09-15 10:56                         ` Vasiliy Kulikov
  2011-09-15 11:00                           ` Cyrill Gorcunov
@ 2011-09-15 20:19                           ` Cyrill Gorcunov
  2011-09-16 17:56                             ` Vasiliy Kulikov
  1 sibling, 1 reply; 28+ messages in thread
From: Cyrill Gorcunov @ 2011-09-15 20:19 UTC (permalink / raw)
  To: Vasiliy Kulikov
  Cc: Pavel Machek, Andrew Morton, linux-kernel, containers,
	linux-fsdevel, Kirill Shutemov, Pavel Emelyanov, James Bottomley,
	Nathan Lynch, Zan Lynx, Daniel Lezcano, Tejun Heo,
	Alexey Dobriyan, Al Viro, Andrew Morton

On Thu, Sep 15, 2011 at 02:56:51PM +0400, Vasiliy Kulikov wrote:
...
> > > 
> > > How can you restore a set of processes in case they share an RW mapping
> > > as RW in both tasks if you deny opening /proc/$pid/map_files/$address as W?
> > 
> > I can read the link first to figure out the file path and re-open it as rw via
> > path itself (which implies the restorer still must have enough rights to open
> > it as rw).
> 
> And what about RW mapping of unlinked files?
> 

So we indeed still need RW-capable links there. To break a tie the
CAP_SYS_ADMIN requirement being put into, so without such capabilities
granted there should be no way to mis-use this iterface (still there
is an opportunity to enhance/relax permissions if we ever need).

Vasiliy, check it please. Restored unlinking files is a different target
not addressed by this patch.

	Cyrill
---
fs, proc: Introduce the /proc/<pid>/map_files/ directory v14

From: Pavel Emelyanov <xemul@parallels.com>

This one behaves similarly to the /proc/<pid>/fd/ one - it contains symlinks
one for each mapping with file, the name of a symlink is "vma->vm_start-vma->vm_end",
the target is the file. Opening a symlink results in a file that point exactly
to the same inode as them vma's one.

For example the ls -l of some arbitrary /proc/<pid>/map_files/

 | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80403000-7f8f80404000 -> /lib64/libc-2.5.so
 | lr-x------ 1 root root 64 Aug 26 06:40 7f8f8061e000-7f8f80620000 -> /lib64/libselinux.so.1
 | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80826000-7f8f80827000 -> /lib64/libacl.so.1.1.0
 | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80a2f000-7f8f80a30000 -> /lib64/librt-2.5.so
 | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80a30000-7f8f80a4c000 -> /lib64/ld-2.5.so

This *helps* checkpointing process in three ways:

1. When dumping a task mappings we do know exact file that is mapped by particular
   region. We do this by opening /proc/$pid/map_files/$address symlink the way we do
   with file descriptors.

2. This also helps in determining which anonymous shared mappings are shared with
   each other by comparing the inodes of them.

3. When restoring a set of processes in case two of them has a mapping shared, we map
   the memory by the 1st one and then open its /proc/$pid/map_files/$address file and
   map it by the 2nd task.

Using /proc/$pid/maps for this is quite inconvenient since it brings repeatable
re-reading and reparsing for this text file which slows down restore procedure
significantly. Also as being pointed in (3) it is a way easier to use top level
shared mapping in children as /proc/$pid/map_files/$address when needed.

v2: (spotted by Tejun Heo)
 - /proc/<pid>/mfd changed to /proc/<pid>/map_files
 - find_vma helper is used instead of linear search
 - routines are re-grouped
 - d_revalidate is set now

v3:
 - d_revalidate reworked, now it should drops no longer valid dentries (Tejun Heo)
 - ptrace_may_access added into proc_map_files_lookup (Vasiliy Kulikov)
 - because of filldir (which eventually might need to lock mmap_sem)
   the proc_map_files_readdir() was reworked to call proc_fill_cache()
   with unlocked mmap_sem

v4: (feedback by Tejun Heo and Vasiliy Kulikov)
 - instead of saving data in proc_inode we rather make a dentry name
   to keep both vm_start and vm_end accordingly
 - d_revalidate now honor task credentials

v5: (feedback by Kirill A. Shutemov)
 - don't forget to release mmap_sem on error path

v6:
 - sizeof get used in map_files_info which shrink member a bit on
   x86-32 (by Kirill A. Shutemov)
 - map_name_to_addr returns -EINVAL instead of -1
   which is more appropriate (by Tejun Heo)

v7:
 - add [get/set]attr handlers for
   proc_map_files_inode_operations (by Vasiliy Kulikov)

v8:
 - Kirill A. Shutemov spotted a parasite semicolon
   which ruined the ptrace_check call, fixed.

v9: (feedback by Andrew Morton)
 - find_exact_vma moved into include/linux/mm.h as an inline helper
 - proc_map_files_setattr uses either kmalloc or vmalloc depending
   on how many objects are to be allocated
 - no more map_name_to_addr but dname_to_vma_addr introduced instead
   and it uses sscanf because in one case the find_exact_vma() is used
   only to confirm existence of vma area the boolean flag is used
 - fancy justification dropped
 - still the proc_map_files_get/setattr leaved untouched
   until additional fd/ patches applied first.

v10: (feedback by Andrew Morton)
 - flex_arrays are used instead of kmalloc/vmalloc calls
 - map_files_d_revalidate use ptrace_may_access for
   security reason (by Vasiliy Kulikov)

v11:
 - should use fput and drop !ret test from a loop code
   (feedback by Andrew Morton)
 - no need for 'used' variable, use existing
   nr_files with file->pos predicate
 - if preallocation fails no need to go further,
   simply release mmap semaphore and jump out

v12:
 - rework map_files_d_revalidate to make sure
   the task get released on return (by Vasiliy Kulikov)

v13:
 - proc_map_files_inode_operations are set to be the same
   as proc_fd_inode_operations, ie to include .permission
   pointing to proc_fd_permission

v14: (by Vasiliy Kulikov)
 - for security reason map_files/ entries are allowed for
   readers with CAP_SYS_ADMIN credentials granted only

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
CC: Tejun Heo <tj@kernel.org>
CC: Vasiliy Kulikov <segoon@openwall.com>
CC: "Kirill A. Shutemov" <kirill@shutemov.name>
CC: Alexey Dobriyan <adobriyan@gmail.com>
CC: Al Viro <viro@ZenIV.linux.org.uk>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Pavel Machek <pavel@ucw.cz>
---
 fs/proc/base.c     |  343 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/mm.h |   12 +
 2 files changed, 355 insertions(+)

Index: linux-2.6.git/fs/proc/base.c
===================================================================
--- linux-2.6.git.orig/fs/proc/base.c
+++ linux-2.6.git/fs/proc/base.c
@@ -83,6 +83,7 @@
 #include <linux/pid_namespace.h>
 #include <linux/fs_struct.h>
 #include <linux/slab.h>
+#include <linux/flex_array.h>
 #ifdef CONFIG_HARDWALL
 #include <asm/hardwall.h>
 #endif
@@ -133,6 +134,8 @@ struct pid_entry {
 		NULL, &proc_single_file_operations,	\
 		{ .proc_show = show } )
 
+static int proc_fd_permission(struct inode *inode, int mask);
+
 /*
  * Count the number of hardlinks for the pid_entry table, excluding the .
  * and .. links.
@@ -2201,6 +2204,345 @@ static const struct file_operations proc
 };
 
 /*
+ * dname_to_vma_addr - maps a dentry name into two unsigned longs
+ * which represent vma start and end addresses.
+ */
+static int dname_to_vma_addr(struct dentry *dentry,
+			     unsigned long *start, unsigned long *end)
+{
+	if (sscanf(dentry->d_name.name, "%lx-%lx", start, end) != 2)
+		return -EINVAL;
+
+	return 0;
+}
+
+static int map_files_d_revalidate(struct dentry *dentry, struct nameidata *nd)
+{
+	unsigned long vm_start, vm_end;
+	bool exact_vma_exists = false;
+	struct mm_struct *mm = NULL;
+	struct task_struct *task;
+	const struct cred *cred;
+	struct inode *inode;
+	int status = 0;
+
+	if (nd && nd->flags & LOOKUP_RCU)
+		return -ECHILD;
+
+	if (!capable(CAP_SYS_ADMIN))
+		goto out_notask;
+
+	inode = dentry->d_inode;
+	task = get_proc_task(inode);
+	if (!task)
+		goto out_notask;
+
+	if (!ptrace_may_access(task, PTRACE_MODE_READ))
+		goto out;
+
+	mm = get_task_mm(task);
+	if (!mm)
+		goto out;
+
+	if (!dname_to_vma_addr(dentry, &vm_start, &vm_end)) {
+		down_read(&mm->mmap_sem);
+		exact_vma_exists = !!find_exact_vma(mm, vm_start, vm_end);
+		up_read(&mm->mmap_sem);
+	}
+
+	mmput(mm);
+
+	if (exact_vma_exists) {
+		if (task_dumpable(task)) {
+			rcu_read_lock();
+			cred = __task_cred(task);
+			inode->i_uid = cred->euid;
+			inode->i_gid = cred->egid;
+			rcu_read_unlock();
+		} else {
+			inode->i_uid = 0;
+			inode->i_gid = 0;
+		}
+		security_task_to_inode(task, inode);
+		status = 1;
+	}
+
+out:
+	put_task_struct(task);
+
+out_notask:
+	if (status <= 0)
+		d_drop(dentry);
+
+	return status;
+}
+
+static const struct dentry_operations tid_map_files_dentry_operations = {
+	.d_revalidate	= map_files_d_revalidate,
+	.d_delete	= pid_delete_dentry,
+};
+
+static int proc_map_files_get_link(struct dentry *dentry, struct path *path)
+{
+	unsigned long vm_start, vm_end;
+	struct vm_area_struct *vma;
+	struct task_struct *task;
+	struct mm_struct *mm;
+	int rc;
+
+	rc = -ENOENT;
+	task = get_proc_task(dentry->d_inode);
+	if (!task)
+		goto out;
+
+	mm = get_task_mm(task);
+	put_task_struct(task);
+	if (!mm)
+		goto out;
+
+	rc = dname_to_vma_addr(dentry, &vm_start, &vm_end);
+	if (rc)
+		goto out_mmput;
+
+	down_read(&mm->mmap_sem);
+	vma = find_exact_vma(mm, vm_start, vm_end);
+	if (vma && vma->vm_file) {
+		*path = vma->vm_file->f_path;
+		path_get(path);
+		rc = 0;
+	}
+	up_read(&mm->mmap_sem);
+
+out_mmput:
+	mmput(mm);
+out:
+	return rc;
+}
+
+struct map_files_info {
+	struct file	*file;
+	unsigned long	len;
+	unsigned char	name[4*sizeof(long)+2]; /* max: %lx-%lx\0 */
+};
+
+static struct dentry *
+proc_map_files_instantiate(struct inode *dir, struct dentry *dentry,
+			   struct task_struct *task, const void *ptr)
+{
+	const struct file *file = ptr;
+	struct proc_inode *ei;
+	struct inode *inode;
+
+	if (!file)
+		return ERR_PTR(-ENOENT);
+
+	inode = proc_pid_make_inode(dir->i_sb, task);
+	if (!inode)
+		return ERR_PTR(-ENOENT);
+
+	ei = PROC_I(inode);
+	ei->op.proc_get_link = proc_map_files_get_link;
+
+	inode->i_op = &proc_pid_link_inode_operations;
+	inode->i_size = 64;
+	inode->i_mode = S_IFLNK;
+
+	if (file->f_mode & FMODE_READ)
+		inode->i_mode |= S_IRUSR;
+	if (file->f_mode & FMODE_WRITE)
+		inode->i_mode |= S_IWUSR;
+
+	d_set_d_op(dentry, &tid_map_files_dentry_operations);
+	d_add(dentry, inode);
+
+	return NULL;
+}
+
+static struct dentry *proc_map_files_lookup(struct inode *dir,
+		struct dentry *dentry, struct nameidata *nd)
+{
+	unsigned long vm_start, vm_end;
+	struct vm_area_struct *vma;
+	struct task_struct *task;
+	struct dentry *result;
+	struct mm_struct *mm;
+
+	result = ERR_PTR(-EACCES);
+	if (!capable(CAP_SYS_ADMIN))
+		goto out;
+
+	result = ERR_PTR(-ENOENT);
+	task = get_proc_task(dir);
+	if (!task)
+		goto out;
+
+	result = ERR_PTR(-EACCES);
+	if (lock_trace(task))
+		goto out_put_task;
+
+	result = ERR_PTR(-ENOENT);
+	if (dname_to_vma_addr(dentry, &vm_start, &vm_end))
+		goto out_unlock;
+
+	mm = get_task_mm(task);
+	if (!mm)
+		goto out_unlock;
+
+	down_read(&mm->mmap_sem);
+	vma = find_exact_vma(mm, vm_start, vm_end);
+	if (!vma)
+		goto out_no_vma;
+
+	result = proc_map_files_instantiate(dir, dentry, task, vma->vm_file);
+
+out_no_vma:
+	up_read(&mm->mmap_sem);
+	mmput(mm);
+out_unlock:
+	unlock_trace(task);
+out_put_task:
+	put_task_struct(task);
+out:
+	return result;
+}
+
+static const struct inode_operations proc_map_files_inode_operations = {
+	.lookup		= proc_map_files_lookup,
+	.permission	= proc_fd_permission,
+	.setattr	= proc_setattr,
+};
+
+static int proc_map_files_readdir(struct file *filp, void *dirent, filldir_t filldir)
+{
+	struct dentry *dentry = filp->f_path.dentry;
+	struct inode *inode = dentry->d_inode;
+	struct vm_area_struct *vma;
+	struct task_struct *task;
+	struct mm_struct *mm;
+	ino_t ino;
+	int ret;
+
+	ret = -EACCES;
+	if (!capable(CAP_SYS_ADMIN))
+		goto out;
+
+	ret = -ENOENT;
+	task = get_proc_task(inode);
+	if (!task)
+		goto out;
+
+	ret = -EACCES;
+	if (lock_trace(task))
+		goto out_put_task;
+
+	ret = 0;
+	switch (filp->f_pos) {
+	case 0:
+		ino = inode->i_ino;
+		if (filldir(dirent, ".", 1, 0, ino, DT_DIR) < 0)
+			goto out_unlock;
+		filp->f_pos++;
+	case 1:
+		ino = parent_ino(dentry);
+		if (filldir(dirent, "..", 2, 1, ino, DT_DIR) < 0)
+			goto out_unlock;
+		filp->f_pos++;
+	default:
+	{
+		unsigned long nr_files, pos, i;
+		struct flex_array *fa = NULL;
+		struct map_files_info info;
+		struct map_files_info *p;
+
+		mm = get_task_mm(task);
+		if (!mm)
+			goto out_unlock;
+		down_read(&mm->mmap_sem);
+
+		nr_files = 0;
+
+		/*
+		 * We need two passes here:
+		 *
+		 *  1) Collect vmas of mapped files with mmap_sem taken
+		 *  2) Release mmap_sem and instantiate entries
+		 *
+		 * otherwise we get lockdep complained, since filldir()
+		 * routine might require mmap_sem taken in might_fault().
+		 */
+
+		for (vma = mm->mmap, pos = 2; vma; vma = vma->vm_next) {
+			if (vma->vm_file && ++pos > filp->f_pos)
+				nr_files++;
+		}
+
+		if (nr_files) {
+			fa = flex_array_alloc(sizeof(info), nr_files, GFP_KERNEL);
+			if (!fa || flex_array_prealloc(fa, 0, nr_files, GFP_KERNEL)) {
+				ret = -ENOMEM;
+				if (fa)
+					flex_array_free(fa);
+				up_read(&mm->mmap_sem);
+				mmput(mm);
+				goto out_unlock;
+			}
+			for (i = 0, vma = mm->mmap, pos = 2; vma; vma = vma->vm_next) {
+				if (!vma->vm_file)
+					continue;
+				if (++pos <= filp->f_pos)
+					continue;
+
+				get_file(vma->vm_file);
+				info.file = vma->vm_file;
+				info.len = snprintf(info.name, sizeof(info.name),
+						    "%lx-%lx", vma->vm_start,
+						    vma->vm_end);
+				if (flex_array_put(fa, i++, &info, GFP_KERNEL))
+					BUG();
+			}
+		}
+		up_read(&mm->mmap_sem);
+
+		for (i = 0; i < nr_files; i++) {
+			p = flex_array_get(fa, i);
+			ret = proc_fill_cache(filp, dirent, filldir,
+					      p->name, p->len,
+					      proc_map_files_instantiate,
+					      task, p->file);
+			if (ret)
+				break;
+			filp->f_pos++;
+			fput(p->file);
+		}
+		for (; i < nr_files; i++) {
+			/*
+			 * In case of error don't forget
+			 * to put rest of file refs.
+			 */
+			p = flex_array_get(fa, i);
+			fput(p->file);
+		}
+		if (fa)
+			flex_array_free(fa);
+		mmput(mm);
+	}
+	}
+
+out_unlock:
+	unlock_trace(task);
+out_put_task:
+	put_task_struct(task);
+out:
+	return ret;
+}
+
+static const struct file_operations proc_map_files_operations = {
+	.read		= generic_read_dir,
+	.readdir	= proc_map_files_readdir,
+	.llseek		= default_llseek,
+};
+
+/*
  * /proc/pid/fd needs a special permission handler so that a process can still
  * access /proc/self/fd after it has executed a setuid().
  */
@@ -2815,6 +3157,7 @@ static const struct inode_operations pro
 static const struct pid_entry tgid_base_stuff[] = {
 	DIR("task",       S_IRUGO|S_IXUGO, proc_task_inode_operations, proc_task_operations),
 	DIR("fd",         S_IRUSR|S_IXUSR, proc_fd_inode_operations, proc_fd_operations),
+	DIR("map_files",  S_IRUSR|S_IXUSR, proc_map_files_inode_operations, proc_map_files_operations),
 	DIR("fdinfo",     S_IRUSR|S_IXUSR, proc_fdinfo_inode_operations, proc_fdinfo_operations),
 	DIR("ns",	  S_IRUSR|S_IXUGO, proc_ns_dir_inode_operations, proc_ns_dir_operations),
 #ifdef CONFIG_NET
Index: linux-2.6.git/include/linux/mm.h
===================================================================
--- linux-2.6.git.orig/include/linux/mm.h
+++ linux-2.6.git/include/linux/mm.h
@@ -1491,6 +1491,18 @@ static inline unsigned long vma_pages(st
 	return (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
 }
 
+/* Look up the first VMA which exactly match the interval vm_start ... vm_end */
+static inline struct vm_area_struct *
+find_exact_vma(struct mm_struct *mm, unsigned long vm_start, unsigned long vm_end)
+{
+	struct vm_area_struct *vma = find_vma(mm, vm_start);
+
+	if (vma && (vma->vm_start != vm_start || vma->vm_end != vm_end))
+		vma = NULL;
+
+	return vma;
+}
+
 #ifdef CONFIG_MMU
 pgprot_t vm_get_page_prot(unsigned long vm_flags);
 #else

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 2/2] fs, proc: Introduce the /proc/<pid>/map_files/ directory v12
  2011-09-15 20:19                           ` Cyrill Gorcunov
@ 2011-09-16 17:56                             ` Vasiliy Kulikov
  2011-09-16 18:07                               ` Cyrill Gorcunov
  0 siblings, 1 reply; 28+ messages in thread
From: Vasiliy Kulikov @ 2011-09-16 17:56 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: Pavel Machek, Andrew Morton, linux-kernel, containers,
	linux-fsdevel, Kirill Shutemov, Pavel Emelyanov, James Bottomley,
	Nathan Lynch, Zan Lynx, Daniel Lezcano, Tejun Heo,
	Alexey Dobriyan, Al Viro, Andrew Morton

Hi Cyrill,

On Fri, Sep 16, 2011 at 00:19 +0400, Cyrill Gorcunov wrote:
> On Thu, Sep 15, 2011 at 02:56:51PM +0400, Vasiliy Kulikov wrote:
> ...
> > > > 
> > > > How can you restore a set of processes in case they share an RW mapping
> > > > as RW in both tasks if you deny opening /proc/$pid/map_files/$address as W?
> > > 
> > > I can read the link first to figure out the file path and re-open it as rw via
> > > path itself (which implies the restorer still must have enough rights to open
> > > it as rw).
> > 
> > And what about RW mapping of unlinked files?
> > 
> 
> So we indeed still need RW-capable links there. To break a tie the
> CAP_SYS_ADMIN requirement being put into, so without such capabilities
> granted there should be no way to mis-use this iterface (still there
> is an opportunity to enhance/relax permissions if we ever need).
> 
> Vasiliy, check it please. Restored unlinking files is a different target
> not addressed by this patch.
> 
> 	Cyrill
> ---
> fs, proc: Introduce the /proc/<pid>/map_files/ directory v14
> 
> From: Pavel Emelyanov <xemul@parallels.com>
> 
> This one behaves similarly to the /proc/<pid>/fd/ one - it contains symlinks
> one for each mapping with file, the name of a symlink is "vma->vm_start-vma->vm_end",
> the target is the file. Opening a symlink results in a file that point exactly
> to the same inode as them vma's one.
> 
> For example the ls -l of some arbitrary /proc/<pid>/map_files/
> 
>  | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80403000-7f8f80404000 -> /lib64/libc-2.5.so
>  | lr-x------ 1 root root 64 Aug 26 06:40 7f8f8061e000-7f8f80620000 -> /lib64/libselinux.so.1
>  | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80826000-7f8f80827000 -> /lib64/libacl.so.1.1.0
>  | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80a2f000-7f8f80a30000 -> /lib64/librt-2.5.so
>  | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80a30000-7f8f80a4c000 -> /lib64/ld-2.5.so
> 
> This *helps* checkpointing process in three ways:
> 
> 1. When dumping a task mappings we do know exact file that is mapped by particular
>    region. We do this by opening /proc/$pid/map_files/$address symlink the way we do
>    with file descriptors.
> 
> 2. This also helps in determining which anonymous shared mappings are shared with
>    each other by comparing the inodes of them.
> 
> 3. When restoring a set of processes in case two of them has a mapping shared, we map
>    the memory by the 1st one and then open its /proc/$pid/map_files/$address file and
>    map it by the 2nd task.
> 
> Using /proc/$pid/maps for this is quite inconvenient since it brings repeatable
> re-reading and reparsing for this text file which slows down restore procedure
> significantly. Also as being pointed in (3) it is a way easier to use top level
> shared mapping in children as /proc/$pid/map_files/$address when needed.
> 
> v2: (spotted by Tejun Heo)
>  - /proc/<pid>/mfd changed to /proc/<pid>/map_files
>  - find_vma helper is used instead of linear search
>  - routines are re-grouped
>  - d_revalidate is set now
> 
> v3:
>  - d_revalidate reworked, now it should drops no longer valid dentries (Tejun Heo)
>  - ptrace_may_access added into proc_map_files_lookup (Vasiliy Kulikov)
>  - because of filldir (which eventually might need to lock mmap_sem)
>    the proc_map_files_readdir() was reworked to call proc_fill_cache()
>    with unlocked mmap_sem
> 
> v4: (feedback by Tejun Heo and Vasiliy Kulikov)
>  - instead of saving data in proc_inode we rather make a dentry name
>    to keep both vm_start and vm_end accordingly
>  - d_revalidate now honor task credentials
> 
> v5: (feedback by Kirill A. Shutemov)
>  - don't forget to release mmap_sem on error path
> 
> v6:
>  - sizeof get used in map_files_info which shrink member a bit on
>    x86-32 (by Kirill A. Shutemov)
>  - map_name_to_addr returns -EINVAL instead of -1
>    which is more appropriate (by Tejun Heo)
> 
> v7:
>  - add [get/set]attr handlers for
>    proc_map_files_inode_operations (by Vasiliy Kulikov)
> 
> v8:
>  - Kirill A. Shutemov spotted a parasite semicolon
>    which ruined the ptrace_check call, fixed.
> 
> v9: (feedback by Andrew Morton)
>  - find_exact_vma moved into include/linux/mm.h as an inline helper
>  - proc_map_files_setattr uses either kmalloc or vmalloc depending
>    on how many objects are to be allocated
>  - no more map_name_to_addr but dname_to_vma_addr introduced instead
>    and it uses sscanf because in one case the find_exact_vma() is used
>    only to confirm existence of vma area the boolean flag is used
>  - fancy justification dropped
>  - still the proc_map_files_get/setattr leaved untouched
>    until additional fd/ patches applied first.
> 
> v10: (feedback by Andrew Morton)
>  - flex_arrays are used instead of kmalloc/vmalloc calls
>  - map_files_d_revalidate use ptrace_may_access for
>    security reason (by Vasiliy Kulikov)
> 
> v11:
>  - should use fput and drop !ret test from a loop code
>    (feedback by Andrew Morton)
>  - no need for 'used' variable, use existing
>    nr_files with file->pos predicate
>  - if preallocation fails no need to go further,
>    simply release mmap semaphore and jump out
> 
> v12:
>  - rework map_files_d_revalidate to make sure
>    the task get released on return (by Vasiliy Kulikov)
> 
> v13:
>  - proc_map_files_inode_operations are set to be the same
>    as proc_fd_inode_operations, ie to include .permission
>    pointing to proc_fd_permission
> 
> v14: (by Vasiliy Kulikov)
>  - for security reason map_files/ entries are allowed for
>    readers with CAP_SYS_ADMIN credentials granted only

This changelog is currently much longer than the commit description text ;)

> Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
> Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
> CC: Tejun Heo <tj@kernel.org>
> CC: Vasiliy Kulikov <segoon@openwall.com>
> CC: "Kirill A. Shutemov" <kirill@shutemov.name>
> CC: Alexey Dobriyan <adobriyan@gmail.com>
> CC: Al Viro <viro@ZenIV.linux.org.uk>
> CC: Andrew Morton <akpm@linux-foundation.org>
> CC: Pavel Machek <pavel@ucw.cz>
> ---
>  fs/proc/base.c     |  343 +++++++++++++++++++++++++++++++++++++++++++++++++++++
>  include/linux/mm.h |   12 +
>  2 files changed, 355 insertions(+)
> 
> Index: linux-2.6.git/fs/proc/base.c
> ===================================================================
> --- linux-2.6.git.orig/fs/proc/base.c
> +++ linux-2.6.git/fs/proc/base.c
> @@ -83,6 +83,7 @@
>  #include <linux/pid_namespace.h>
>  #include <linux/fs_struct.h>
>  #include <linux/slab.h>
> +#include <linux/flex_array.h>
>  #ifdef CONFIG_HARDWALL
>  #include <asm/hardwall.h>
>  #endif
> @@ -133,6 +134,8 @@ struct pid_entry {
>  		NULL, &proc_single_file_operations,	\
>  		{ .proc_show = show } )
>  
> +static int proc_fd_permission(struct inode *inode, int mask);
> +
>  /*
>   * Count the number of hardlinks for the pid_entry table, excluding the .
>   * and .. links.
> @@ -2201,6 +2204,345 @@ static const struct file_operations proc
>  };
>  
>  /*
> + * dname_to_vma_addr - maps a dentry name into two unsigned longs
> + * which represent vma start and end addresses.
> + */
> +static int dname_to_vma_addr(struct dentry *dentry,
> +			     unsigned long *start, unsigned long *end)
> +{
> +	if (sscanf(dentry->d_name.name, "%lx-%lx", start, end) != 2)
> +		return -EINVAL;
> +
> +	return 0;
> +}
> +
> +static int map_files_d_revalidate(struct dentry *dentry, struct nameidata *nd)
> +{
> +	unsigned long vm_start, vm_end;
> +	bool exact_vma_exists = false;
> +	struct mm_struct *mm = NULL;
> +	struct task_struct *task;
> +	const struct cred *cred;
> +	struct inode *inode;
> +	int status = 0;
> +
> +	if (nd && nd->flags & LOOKUP_RCU)
> +		return -ECHILD;
> +
> +	if (!capable(CAP_SYS_ADMIN))
> +		goto out_notask;

As I said off-list, it's a pitty that only a root dumper process may
dump a task.  However, for the specific usecase - C/R - it should be OK.

> +
> + inode = dentry->d_inode;
> + task = get_proc_task(inode);
> + if (!task)
> +         goto out_notask;
> +
> + if (!ptrace_may_access(task, PTRACE_MODE_READ))
> +         goto out;

While this is not needed with capable() check, it's OK to keep it for
the future more finegranted access checks.

BTW, not a big deal, but probably you should return -EACCES on
!capable() as file presence is not an issue in this case.

    if (!ptrace_may_access(task, PTRACE_MODE_READ))
        goto out_notask;

    status = -EACCES;
    if (!capable(CAP_SYS_ADMIN))
        goto out_notask;

    status = 0;


Thanks,

-- 
Vasiliy Kulikov
http://www.openwall.com - bringing security into open computing environments

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 2/2] fs, proc: Introduce the /proc/<pid>/map_files/ directory v12
  2011-09-16 17:56                             ` Vasiliy Kulikov
@ 2011-09-16 18:07                               ` Cyrill Gorcunov
  2011-09-16 18:11                                 ` Vasiliy Kulikov
  0 siblings, 1 reply; 28+ messages in thread
From: Cyrill Gorcunov @ 2011-09-16 18:07 UTC (permalink / raw)
  To: Vasiliy Kulikov
  Cc: Pavel Machek, Andrew Morton, linux-kernel, containers,
	linux-fsdevel, Kirill Shutemov, Pavel Emelyanov, James Bottomley,
	Nathan Lynch, Zan Lynx, Daniel Lezcano, Tejun Heo,
	Alexey Dobriyan, Al Viro, Andrew Morton

On Fri, Sep 16, 2011 at 09:56:18PM +0400, Vasiliy Kulikov wrote:
...
> > 
> > v14: (by Vasiliy Kulikov)
> >  - for security reason map_files/ entries are allowed for
> >    readers with CAP_SYS_ADMIN credentials granted only
> 
> This changelog is currently much longer than the commit description text ;)

Yes, I know ;) I would like to keep it (to appreciate everyone who spent time
in review and feedback).
...
> 
> > +
> > + inode = dentry->d_inode;
> > + task = get_proc_task(inode);
> > + if (!task)
> > +         goto out_notask;
> > +
> > + if (!ptrace_may_access(task, PTRACE_MODE_READ))
> > +         goto out;
> 
> While this is not needed with capable() check, it's OK to keep it for
> the future more finegranted access checks.

yeah

> 
> BTW, not a big deal, but probably you should return -EACCES on
> !capable() as file presence is not an issue in this case.
> 
>     if (!ptrace_may_access(task, PTRACE_MODE_READ))
>         goto out_notask;
> 
>     status = -EACCES;
>     if (!capable(CAP_SYS_ADMIN))
>         goto out_notask;
> 
>     status = 0;
> 
> 

That's not a proble to fix it actually. So can I fix it and
put some tage here (Reviewed or something?).

	Cyrill

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 2/2] fs, proc: Introduce the /proc/<pid>/map_files/ directory v12
  2011-09-16 18:07                               ` Cyrill Gorcunov
@ 2011-09-16 18:11                                 ` Vasiliy Kulikov
  2011-09-16 18:26                                   ` Cyrill Gorcunov
  0 siblings, 1 reply; 28+ messages in thread
From: Vasiliy Kulikov @ 2011-09-16 18:11 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: Pavel Machek, Andrew Morton, linux-kernel, containers,
	linux-fsdevel, Kirill Shutemov, Pavel Emelyanov, James Bottomley,
	Nathan Lynch, Zan Lynx, Daniel Lezcano, Tejun Heo,
	Alexey Dobriyan, Al Viro, Andrew Morton

On Fri, Sep 16, 2011 at 22:07 +0400, Cyrill Gorcunov wrote:
> > BTW, not a big deal, but probably you should return -EACCES on
> > !capable() as file presence is not an issue in this case.
> > 
> >     if (!ptrace_may_access(task, PTRACE_MODE_READ))
> >         goto out_notask;
> > 
> >     status = -EACCES;
> >     if (!capable(CAP_SYS_ADMIN))
> >         goto out_notask;
> > 
> >     status = 0;
> > 
> > 
> 
> That's not a proble to fix it actually. So can I fix it and
> put some tage here (Reviewed or something?).

Yep, with CAP_SYS_ADMIN check there should be no issues here.

Reviewed-by: Vasiliy Kulikov <segoon@openwall.com>

Thanks,

-- 
Vasiliy Kulikov
http://www.openwall.com - bringing security into open computing environments

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 2/2] fs, proc: Introduce the /proc/<pid>/map_files/ directory v12
  2011-09-16 18:11                                 ` Vasiliy Kulikov
@ 2011-09-16 18:26                                   ` Cyrill Gorcunov
  2011-09-16 18:31                                     ` Kirill A. Shutemov
  0 siblings, 1 reply; 28+ messages in thread
From: Cyrill Gorcunov @ 2011-09-16 18:26 UTC (permalink / raw)
  To: Vasiliy Kulikov, Andrew Morton
  Cc: Pavel Machek, Andrew Morton, linux-kernel, containers,
	linux-fsdevel, Kirill Shutemov, Pavel Emelyanov, James Bottomley,
	Nathan Lynch, Zan Lynx, Daniel Lezcano, Tejun Heo,
	Alexey Dobriyan, Al Viro

On Fri, Sep 16, 2011 at 10:11:46PM +0400, Vasiliy Kulikov wrote:
...
> 
> Yep, with CAP_SYS_ADMIN check there should be no issues here.
> 
> Reviewed-by: Vasiliy Kulikov <segoon@openwall.com>
> 

Here we go. Andrew, should I resend the whole series (ie two patches?)

	Cyrill
---
From: Pavel Emelyanov <xemul@parallels.com>
Subject: [PATCH] fs, proc: Introduce the /proc/<pid>/map_files/ directory v14

This one behaves similarly to the /proc/<pid>/fd/ one - it contains symlinks
one for each mapping with file, the name of a symlink is "vma->vm_start-vma->vm_end",
the target is the file. Opening a symlink results in a file that point exactly
to the same inode as them vma's one.

For example the ls -l of some arbitrary /proc/<pid>/map_files/

 | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80403000-7f8f80404000 -> /lib64/libc-2.5.so
 | lr-x------ 1 root root 64 Aug 26 06:40 7f8f8061e000-7f8f80620000 -> /lib64/libselinux.so.1
 | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80826000-7f8f80827000 -> /lib64/libacl.so.1.1.0
 | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80a2f000-7f8f80a30000 -> /lib64/librt-2.5.so
 | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80a30000-7f8f80a4c000 -> /lib64/ld-2.5.so

This *helps* checkpointing process in three ways:

1. When dumping a task mappings we do know exact file that is mapped by particular
   region. We do this by opening /proc/$pid/map_files/$address symlink the way we do
   with file descriptors.

2. This also helps in determining which anonymous shared mappings are shared with
   each other by comparing the inodes of them.

3. When restoring a set of processes in case two of them has a mapping shared, we map
   the memory by the 1st one and then open its /proc/$pid/map_files/$address file and
   map it by the 2nd task.

Using /proc/$pid/maps for this is quite inconvenient since it brings repeatable
re-reading and reparsing for this text file which slows down restore procedure
significantly. Also as being pointed in (3) it is a way easier to use top level
shared mapping in children as /proc/$pid/map_files/$address when needed.

v2: (spotted by Tejun Heo)
 - /proc/<pid>/mfd changed to /proc/<pid>/map_files
 - find_vma helper is used instead of linear search
 - routines are re-grouped
 - d_revalidate is set now

v3:
 - d_revalidate reworked, now it should drops no longer valid dentries (Tejun Heo)
 - ptrace_may_access added into proc_map_files_lookup (Vasiliy Kulikov)
 - because of filldir (which eventually might need to lock mmap_sem)
   the proc_map_files_readdir() was reworked to call proc_fill_cache()
   with unlocked mmap_sem

v4: (feedback by Tejun Heo and Vasiliy Kulikov)
 - instead of saving data in proc_inode we rather make a dentry name
   to keep both vm_start and vm_end accordingly
 - d_revalidate now honor task credentials

v5: (feedback by Kirill A. Shutemov)
 - don't forget to release mmap_sem on error path

v6:
 - sizeof get used in map_files_info which shrink member a bit on
   x86-32 (by Kirill A. Shutemov)
 - map_name_to_addr returns -EINVAL instead of -1
   which is more appropriate (by Tejun Heo)

v7:
 - add [get/set]attr handlers for
   proc_map_files_inode_operations (by Vasiliy Kulikov)

v8:
 - Kirill A. Shutemov spotted a parasite semicolon
   which ruined the ptrace_check call, fixed.

v9: (feedback by Andrew Morton)
 - find_exact_vma moved into include/linux/mm.h as an inline helper
 - proc_map_files_setattr uses either kmalloc or vmalloc depending
   on how many objects are to be allocated
 - no more map_name_to_addr but dname_to_vma_addr introduced instead
   and it uses sscanf because in one case the find_exact_vma() is used
   only to confirm existence of vma area the boolean flag is used
 - fancy justification dropped
 - still the proc_map_files_get/setattr leaved untouched
   until additional fd/ patches applied first.

v10: (feedback by Andrew Morton)
 - flex_arrays are used instead of kmalloc/vmalloc calls
 - map_files_d_revalidate use ptrace_may_access for
   security reason (by Vasiliy Kulikov)

v11:
 - should use fput and drop !ret test from a loop code
   (feedback by Andrew Morton)
 - no need for 'used' variable, use existing
   nr_files with file->pos predicate
 - if preallocation fails no need to go further,
   simply release mmap semaphore and jump out

v12:
 - rework map_files_d_revalidate to make sure
   the task get released on return (by Vasiliy Kulikov)

v13:
 - proc_map_files_inode_operations are set to be the same
   as proc_fd_inode_operations, ie to include .permission
   pointing to proc_fd_permission

v14: (by Vasiliy Kulikov)
 - for security reason map_files/ entries are allowed for
   readers with CAP_SYS_ADMIN credentials granted only

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Reviewed-by: Vasiliy Kulikov <segoon@openwall.com>
CC: Tejun Heo <tj@kernel.org>
CC: Vasiliy Kulikov <segoon@openwall.com>
CC: "Kirill A. Shutemov" <kirill@shutemov.name>
CC: Alexey Dobriyan <adobriyan@gmail.com>
CC: Al Viro <viro@ZenIV.linux.org.uk>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Pavel Machek <pavel@ucw.cz>
---
 fs/proc/base.c     |  345 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/mm.h |   12 +
 2 files changed, 357 insertions(+)

Index: linux-2.6.git/fs/proc/base.c
===================================================================
--- linux-2.6.git.orig/fs/proc/base.c
+++ linux-2.6.git/fs/proc/base.c
@@ -83,6 +83,7 @@
 #include <linux/pid_namespace.h>
 #include <linux/fs_struct.h>
 #include <linux/slab.h>
+#include <linux/flex_array.h>
 #ifdef CONFIG_HARDWALL
 #include <asm/hardwall.h>
 #endif
@@ -133,6 +134,8 @@ struct pid_entry {
 		NULL, &proc_single_file_operations,	\
 		{ .proc_show = show } )
 
+static int proc_fd_permission(struct inode *inode, int mask);
+
 /*
  * Count the number of hardlinks for the pid_entry table, excluding the .
  * and .. links.
@@ -2201,6 +2204,347 @@ static const struct file_operations proc
 };
 
 /*
+ * dname_to_vma_addr - maps a dentry name into two unsigned longs
+ * which represent vma start and end addresses.
+ */
+static int dname_to_vma_addr(struct dentry *dentry,
+			     unsigned long *start, unsigned long *end)
+{
+	if (sscanf(dentry->d_name.name, "%lx-%lx", start, end) != 2)
+		return -EINVAL;
+
+	return 0;
+}
+
+static int map_files_d_revalidate(struct dentry *dentry, struct nameidata *nd)
+{
+	unsigned long vm_start, vm_end;
+	bool exact_vma_exists = false;
+	struct mm_struct *mm = NULL;
+	struct task_struct *task;
+	const struct cred *cred;
+	struct inode *inode;
+	int status = 0;
+
+	if (nd && nd->flags & LOOKUP_RCU)
+		return -ECHILD;
+
+	if (!capable(CAP_SYS_ADMIN)) {
+		status = -EACCES;
+		goto out_notask;
+	}
+
+	inode = dentry->d_inode;
+	task = get_proc_task(inode);
+	if (!task)
+		goto out_notask;
+
+	if (!ptrace_may_access(task, PTRACE_MODE_READ))
+		goto out;
+
+	mm = get_task_mm(task);
+	if (!mm)
+		goto out;
+
+	if (!dname_to_vma_addr(dentry, &vm_start, &vm_end)) {
+		down_read(&mm->mmap_sem);
+		exact_vma_exists = !!find_exact_vma(mm, vm_start, vm_end);
+		up_read(&mm->mmap_sem);
+	}
+
+	mmput(mm);
+
+	if (exact_vma_exists) {
+		if (task_dumpable(task)) {
+			rcu_read_lock();
+			cred = __task_cred(task);
+			inode->i_uid = cred->euid;
+			inode->i_gid = cred->egid;
+			rcu_read_unlock();
+		} else {
+			inode->i_uid = 0;
+			inode->i_gid = 0;
+		}
+		security_task_to_inode(task, inode);
+		status = 1;
+	}
+
+out:
+	put_task_struct(task);
+
+out_notask:
+	if (status <= 0)
+		d_drop(dentry);
+
+	return status;
+}
+
+static const struct dentry_operations tid_map_files_dentry_operations = {
+	.d_revalidate	= map_files_d_revalidate,
+	.d_delete	= pid_delete_dentry,
+};
+
+static int proc_map_files_get_link(struct dentry *dentry, struct path *path)
+{
+	unsigned long vm_start, vm_end;
+	struct vm_area_struct *vma;
+	struct task_struct *task;
+	struct mm_struct *mm;
+	int rc;
+
+	rc = -ENOENT;
+	task = get_proc_task(dentry->d_inode);
+	if (!task)
+		goto out;
+
+	mm = get_task_mm(task);
+	put_task_struct(task);
+	if (!mm)
+		goto out;
+
+	rc = dname_to_vma_addr(dentry, &vm_start, &vm_end);
+	if (rc)
+		goto out_mmput;
+
+	down_read(&mm->mmap_sem);
+	vma = find_exact_vma(mm, vm_start, vm_end);
+	if (vma && vma->vm_file) {
+		*path = vma->vm_file->f_path;
+		path_get(path);
+		rc = 0;
+	}
+	up_read(&mm->mmap_sem);
+
+out_mmput:
+	mmput(mm);
+out:
+	return rc;
+}
+
+struct map_files_info {
+	struct file	*file;
+	unsigned long	len;
+	unsigned char	name[4*sizeof(long)+2]; /* max: %lx-%lx\0 */
+};
+
+static struct dentry *
+proc_map_files_instantiate(struct inode *dir, struct dentry *dentry,
+			   struct task_struct *task, const void *ptr)
+{
+	const struct file *file = ptr;
+	struct proc_inode *ei;
+	struct inode *inode;
+
+	if (!file)
+		return ERR_PTR(-ENOENT);
+
+	inode = proc_pid_make_inode(dir->i_sb, task);
+	if (!inode)
+		return ERR_PTR(-ENOENT);
+
+	ei = PROC_I(inode);
+	ei->op.proc_get_link = proc_map_files_get_link;
+
+	inode->i_op = &proc_pid_link_inode_operations;
+	inode->i_size = 64;
+	inode->i_mode = S_IFLNK;
+
+	if (file->f_mode & FMODE_READ)
+		inode->i_mode |= S_IRUSR;
+	if (file->f_mode & FMODE_WRITE)
+		inode->i_mode |= S_IWUSR;
+
+	d_set_d_op(dentry, &tid_map_files_dentry_operations);
+	d_add(dentry, inode);
+
+	return NULL;
+}
+
+static struct dentry *proc_map_files_lookup(struct inode *dir,
+		struct dentry *dentry, struct nameidata *nd)
+{
+	unsigned long vm_start, vm_end;
+	struct vm_area_struct *vma;
+	struct task_struct *task;
+	struct dentry *result;
+	struct mm_struct *mm;
+
+	result = ERR_PTR(-EACCES);
+	if (!capable(CAP_SYS_ADMIN))
+		goto out;
+
+	result = ERR_PTR(-ENOENT);
+	task = get_proc_task(dir);
+	if (!task)
+		goto out;
+
+	result = ERR_PTR(-EACCES);
+	if (lock_trace(task))
+		goto out_put_task;
+
+	result = ERR_PTR(-ENOENT);
+	if (dname_to_vma_addr(dentry, &vm_start, &vm_end))
+		goto out_unlock;
+
+	mm = get_task_mm(task);
+	if (!mm)
+		goto out_unlock;
+
+	down_read(&mm->mmap_sem);
+	vma = find_exact_vma(mm, vm_start, vm_end);
+	if (!vma)
+		goto out_no_vma;
+
+	result = proc_map_files_instantiate(dir, dentry, task, vma->vm_file);
+
+out_no_vma:
+	up_read(&mm->mmap_sem);
+	mmput(mm);
+out_unlock:
+	unlock_trace(task);
+out_put_task:
+	put_task_struct(task);
+out:
+	return result;
+}
+
+static const struct inode_operations proc_map_files_inode_operations = {
+	.lookup		= proc_map_files_lookup,
+	.permission	= proc_fd_permission,
+	.setattr	= proc_setattr,
+};
+
+static int proc_map_files_readdir(struct file *filp, void *dirent, filldir_t filldir)
+{
+	struct dentry *dentry = filp->f_path.dentry;
+	struct inode *inode = dentry->d_inode;
+	struct vm_area_struct *vma;
+	struct task_struct *task;
+	struct mm_struct *mm;
+	ino_t ino;
+	int ret;
+
+	ret = -EACCES;
+	if (!capable(CAP_SYS_ADMIN))
+		goto out;
+
+	ret = -ENOENT;
+	task = get_proc_task(inode);
+	if (!task)
+		goto out;
+
+	ret = -EACCES;
+	if (lock_trace(task))
+		goto out_put_task;
+
+	ret = 0;
+	switch (filp->f_pos) {
+	case 0:
+		ino = inode->i_ino;
+		if (filldir(dirent, ".", 1, 0, ino, DT_DIR) < 0)
+			goto out_unlock;
+		filp->f_pos++;
+	case 1:
+		ino = parent_ino(dentry);
+		if (filldir(dirent, "..", 2, 1, ino, DT_DIR) < 0)
+			goto out_unlock;
+		filp->f_pos++;
+	default:
+	{
+		unsigned long nr_files, pos, i;
+		struct flex_array *fa = NULL;
+		struct map_files_info info;
+		struct map_files_info *p;
+
+		mm = get_task_mm(task);
+		if (!mm)
+			goto out_unlock;
+		down_read(&mm->mmap_sem);
+
+		nr_files = 0;
+
+		/*
+		 * We need two passes here:
+		 *
+		 *  1) Collect vmas of mapped files with mmap_sem taken
+		 *  2) Release mmap_sem and instantiate entries
+		 *
+		 * otherwise we get lockdep complained, since filldir()
+		 * routine might require mmap_sem taken in might_fault().
+		 */
+
+		for (vma = mm->mmap, pos = 2; vma; vma = vma->vm_next) {
+			if (vma->vm_file && ++pos > filp->f_pos)
+				nr_files++;
+		}
+
+		if (nr_files) {
+			fa = flex_array_alloc(sizeof(info), nr_files, GFP_KERNEL);
+			if (!fa || flex_array_prealloc(fa, 0, nr_files, GFP_KERNEL)) {
+				ret = -ENOMEM;
+				if (fa)
+					flex_array_free(fa);
+				up_read(&mm->mmap_sem);
+				mmput(mm);
+				goto out_unlock;
+			}
+			for (i = 0, vma = mm->mmap, pos = 2; vma; vma = vma->vm_next) {
+				if (!vma->vm_file)
+					continue;
+				if (++pos <= filp->f_pos)
+					continue;
+
+				get_file(vma->vm_file);
+				info.file = vma->vm_file;
+				info.len = snprintf(info.name, sizeof(info.name),
+						    "%lx-%lx", vma->vm_start,
+						    vma->vm_end);
+				if (flex_array_put(fa, i++, &info, GFP_KERNEL))
+					BUG();
+			}
+		}
+		up_read(&mm->mmap_sem);
+
+		for (i = 0; i < nr_files; i++) {
+			p = flex_array_get(fa, i);
+			ret = proc_fill_cache(filp, dirent, filldir,
+					      p->name, p->len,
+					      proc_map_files_instantiate,
+					      task, p->file);
+			if (ret)
+				break;
+			filp->f_pos++;
+			fput(p->file);
+		}
+		for (; i < nr_files; i++) {
+			/*
+			 * In case of error don't forget
+			 * to put rest of file refs.
+			 */
+			p = flex_array_get(fa, i);
+			fput(p->file);
+		}
+		if (fa)
+			flex_array_free(fa);
+		mmput(mm);
+	}
+	}
+
+out_unlock:
+	unlock_trace(task);
+out_put_task:
+	put_task_struct(task);
+out:
+	return ret;
+}
+
+static const struct file_operations proc_map_files_operations = {
+	.read		= generic_read_dir,
+	.readdir	= proc_map_files_readdir,
+	.llseek		= default_llseek,
+};
+
+/*
  * /proc/pid/fd needs a special permission handler so that a process can still
  * access /proc/self/fd after it has executed a setuid().
  */
@@ -2815,6 +3159,7 @@ static const struct inode_operations pro
 static const struct pid_entry tgid_base_stuff[] = {
 	DIR("task",       S_IRUGO|S_IXUGO, proc_task_inode_operations, proc_task_operations),
 	DIR("fd",         S_IRUSR|S_IXUSR, proc_fd_inode_operations, proc_fd_operations),
+	DIR("map_files",  S_IRUSR|S_IXUSR, proc_map_files_inode_operations, proc_map_files_operations),
 	DIR("fdinfo",     S_IRUSR|S_IXUSR, proc_fdinfo_inode_operations, proc_fdinfo_operations),
 	DIR("ns",	  S_IRUSR|S_IXUGO, proc_ns_dir_inode_operations, proc_ns_dir_operations),
 #ifdef CONFIG_NET
Index: linux-2.6.git/include/linux/mm.h
===================================================================
--- linux-2.6.git.orig/include/linux/mm.h
+++ linux-2.6.git/include/linux/mm.h
@@ -1491,6 +1491,18 @@ static inline unsigned long vma_pages(st
 	return (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
 }
 
+/* Look up the first VMA which exactly match the interval vm_start ... vm_end */
+static inline struct vm_area_struct *
+find_exact_vma(struct mm_struct *mm, unsigned long vm_start, unsigned long vm_end)
+{
+	struct vm_area_struct *vma = find_vma(mm, vm_start);
+
+	if (vma && (vma->vm_start != vm_start || vma->vm_end != vm_end))
+		vma = NULL;
+
+	return vma;
+}
+
 #ifdef CONFIG_MMU
 pgprot_t vm_get_page_prot(unsigned long vm_flags);
 #else

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 2/2] fs, proc: Introduce the /proc/<pid>/map_files/ directory v12
  2011-09-16 18:26                                   ` Cyrill Gorcunov
@ 2011-09-16 18:31                                     ` Kirill A. Shutemov
  2011-09-16 18:40                                       ` Cyrill Gorcunov
  0 siblings, 1 reply; 28+ messages in thread
From: Kirill A. Shutemov @ 2011-09-16 18:31 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: Vasiliy Kulikov, Andrew Morton, Pavel Machek, Andrew Morton,
	linux-kernel, containers, linux-fsdevel, Pavel Emelyanov,
	James Bottomley, Nathan Lynch, Zan Lynx, Daniel Lezcano,
	Tejun Heo, Alexey Dobriyan, Al Viro

On Fri, Sep 16, 2011 at 10:26:54PM +0400, Cyrill Gorcunov wrote:
> On Fri, Sep 16, 2011 at 10:11:46PM +0400, Vasiliy Kulikov wrote:
> ...
> > 
> > Yep, with CAP_SYS_ADMIN check there should be no issues here.
> > 
> > Reviewed-by: Vasiliy Kulikov <segoon@openwall.com>
> > 
> 
> Here we go. Andrew, should I resend the whole series (ie two patches?)
> 
> 	Cyrill
> ---
> From: Pavel Emelyanov <xemul@parallels.com>
> Subject: [PATCH] fs, proc: Introduce the /proc/<pid>/map_files/ directory v14
> 
> This one behaves similarly to the /proc/<pid>/fd/ one - it contains symlinks
> one for each mapping with file, the name of a symlink is "vma->vm_start-vma->vm_end",
> the target is the file. Opening a symlink results in a file that point exactly
> to the same inode as them vma's one.
> 
> For example the ls -l of some arbitrary /proc/<pid>/map_files/
> 
>  | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80403000-7f8f80404000 -> /lib64/libc-2.5.so
>  | lr-x------ 1 root root 64 Aug 26 06:40 7f8f8061e000-7f8f80620000 -> /lib64/libselinux.so.1
>  | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80826000-7f8f80827000 -> /lib64/libacl.so.1.1.0
>  | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80a2f000-7f8f80a30000 -> /lib64/librt-2.5.so
>  | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80a30000-7f8f80a4c000 -> /lib64/ld-2.5.so
> 
> This *helps* checkpointing process in three ways:
> 
> 1. When dumping a task mappings we do know exact file that is mapped by particular
>    region. We do this by opening /proc/$pid/map_files/$address symlink the way we do
>    with file descriptors.
> 
> 2. This also helps in determining which anonymous shared mappings are shared with
>    each other by comparing the inodes of them.
> 
> 3. When restoring a set of processes in case two of them has a mapping shared, we map
>    the memory by the 1st one and then open its /proc/$pid/map_files/$address file and
>    map it by the 2nd task.
> 
> Using /proc/$pid/maps for this is quite inconvenient since it brings repeatable
> re-reading and reparsing for this text file which slows down restore procedure
> significantly. Also as being pointed in (3) it is a way easier to use top level
> shared mapping in children as /proc/$pid/map_files/$address when needed.
> 
> v2: (spotted by Tejun Heo)
>  - /proc/<pid>/mfd changed to /proc/<pid>/map_files
>  - find_vma helper is used instead of linear search
>  - routines are re-grouped
>  - d_revalidate is set now
> 
> v3:
>  - d_revalidate reworked, now it should drops no longer valid dentries (Tejun Heo)
>  - ptrace_may_access added into proc_map_files_lookup (Vasiliy Kulikov)
>  - because of filldir (which eventually might need to lock mmap_sem)
>    the proc_map_files_readdir() was reworked to call proc_fill_cache()
>    with unlocked mmap_sem
> 
> v4: (feedback by Tejun Heo and Vasiliy Kulikov)
>  - instead of saving data in proc_inode we rather make a dentry name
>    to keep both vm_start and vm_end accordingly
>  - d_revalidate now honor task credentials
> 
> v5: (feedback by Kirill A. Shutemov)
>  - don't forget to release mmap_sem on error path
> 
> v6:
>  - sizeof get used in map_files_info which shrink member a bit on
>    x86-32 (by Kirill A. Shutemov)
>  - map_name_to_addr returns -EINVAL instead of -1
>    which is more appropriate (by Tejun Heo)
> 
> v7:
>  - add [get/set]attr handlers for
>    proc_map_files_inode_operations (by Vasiliy Kulikov)
> 
> v8:
>  - Kirill A. Shutemov spotted a parasite semicolon
>    which ruined the ptrace_check call, fixed.
> 
> v9: (feedback by Andrew Morton)
>  - find_exact_vma moved into include/linux/mm.h as an inline helper
>  - proc_map_files_setattr uses either kmalloc or vmalloc depending
>    on how many objects are to be allocated
>  - no more map_name_to_addr but dname_to_vma_addr introduced instead
>    and it uses sscanf because in one case the find_exact_vma() is used
>    only to confirm existence of vma area the boolean flag is used
>  - fancy justification dropped
>  - still the proc_map_files_get/setattr leaved untouched
>    until additional fd/ patches applied first.
> 
> v10: (feedback by Andrew Morton)
>  - flex_arrays are used instead of kmalloc/vmalloc calls
>  - map_files_d_revalidate use ptrace_may_access for
>    security reason (by Vasiliy Kulikov)
> 
> v11:
>  - should use fput and drop !ret test from a loop code
>    (feedback by Andrew Morton)
>  - no need for 'used' variable, use existing
>    nr_files with file->pos predicate
>  - if preallocation fails no need to go further,
>    simply release mmap semaphore and jump out
> 
> v12:
>  - rework map_files_d_revalidate to make sure
>    the task get released on return (by Vasiliy Kulikov)
> 
> v13:
>  - proc_map_files_inode_operations are set to be the same
>    as proc_fd_inode_operations, ie to include .permission
>    pointing to proc_fd_permission
> 
> v14: (by Vasiliy Kulikov)
>  - for security reason map_files/ entries are allowed for
>    readers with CAP_SYS_ADMIN credentials granted only
> 
> Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
> Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
> Reviewed-by: Vasiliy Kulikov <segoon@openwall.com>
> CC: Tejun Heo <tj@kernel.org>
> CC: Vasiliy Kulikov <segoon@openwall.com>
> CC: "Kirill A. Shutemov" <kirill@shutemov.name>
> CC: Alexey Dobriyan <adobriyan@gmail.com>
> CC: Al Viro <viro@ZenIV.linux.org.uk>
> CC: Andrew Morton <akpm@linux-foundation.org>
> CC: Pavel Machek <pavel@ucw.cz>

My Reviewed-by is in force.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [patch 2/2] fs, proc: Introduce the /proc/<pid>/map_files/ directory v12
  2011-09-16 18:31                                     ` Kirill A. Shutemov
@ 2011-09-16 18:40                                       ` Cyrill Gorcunov
  0 siblings, 0 replies; 28+ messages in thread
From: Cyrill Gorcunov @ 2011-09-16 18:40 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Vasiliy Kulikov, Andrew Morton, Pavel Machek, Andrew Morton,
	linux-kernel, containers, linux-fsdevel, Pavel Emelyanov,
	James Bottomley, Nathan Lynch, Zan Lynx, Daniel Lezcano,
	Tejun Heo, Alexey Dobriyan, Al Viro

On Fri, Sep 16, 2011 at 09:31:50PM +0300, Kirill A. Shutemov wrote:
...
> > 
> > Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
> > Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
> > Reviewed-by: Vasiliy Kulikov <segoon@openwall.com>
> > CC: Tejun Heo <tj@kernel.org>
> > CC: Vasiliy Kulikov <segoon@openwall.com>
> > CC: "Kirill A. Shutemov" <kirill@shutemov.name>
> > CC: Alexey Dobriyan <adobriyan@gmail.com>
> > CC: Al Viro <viro@ZenIV.linux.org.uk>
> > CC: Andrew Morton <akpm@linux-foundation.org>
> > CC: Pavel Machek <pavel@ucw.cz>
> 
> My Reviewed-by is in force.
> 

Thanks, Kirill! (I didn't put it in first place since we tuned
code a bit and I didn't know your opinion on new changes)

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2011-09-16 18:40 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-09-13 21:13 [patch 0/2] symlinks for mapped files in proc Cyrill Gorcunov
2011-09-13 21:14 ` [patch 1/2] fs, proc: Make proc_get_link to use dentry instead of inode Cyrill Gorcunov
2011-09-14  1:37   ` Kirill A. Shutemov
2011-09-13 21:14 ` [patch 2/2] fs, proc: Introduce the /proc/<pid>/map_files/ directory v12 Cyrill Gorcunov
     [not found]   ` <20110914023428.GA4034@shutemov.name>
2011-09-14  5:54     ` Cyrill Gorcunov
2011-09-14  6:52   ` Andrew Morton
2011-09-14 10:56     ` Cyrill Gorcunov
2011-09-14 11:14       ` Pavel Machek
2011-09-14 11:39         ` Cyrill Gorcunov
2011-09-14 13:44           ` Cyrill Gorcunov
2011-09-14 14:48             ` Vasiliy Kulikov
2011-09-14 14:57               ` Vasiliy Kulikov
2011-09-14 16:00               ` Cyrill Gorcunov
2011-09-14 16:07                 ` Vasiliy Kulikov
2011-09-14 16:13                   ` Pavel Emelyanov
2011-09-14 16:21                     ` Vasiliy Kulikov
2011-09-15  9:14                   ` Cyrill Gorcunov
2011-09-15  9:27                     ` Vasiliy Kulikov
2011-09-15 10:29                       ` Cyrill Gorcunov
2011-09-15 10:56                         ` Vasiliy Kulikov
2011-09-15 11:00                           ` Cyrill Gorcunov
2011-09-15 20:19                           ` Cyrill Gorcunov
2011-09-16 17:56                             ` Vasiliy Kulikov
2011-09-16 18:07                               ` Cyrill Gorcunov
2011-09-16 18:11                                 ` Vasiliy Kulikov
2011-09-16 18:26                                   ` Cyrill Gorcunov
2011-09-16 18:31                                     ` Kirill A. Shutemov
2011-09-16 18:40                                       ` Cyrill Gorcunov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).