All of lore.kernel.org
 help / color / mirror / Atom feed
From: Andrew Morton <akpm@linux-foundation.org>
To: Cyrill Gorcunov <gorcunov@gmail.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>,
	Vasiliy Kulikov <segoon@openwall.com>,
	containers@lists.osdl.org, linux-kernel@vger.kernel.org,
	linux-fsdevel@vger.kernel.org, Nathan Lynch <ntl@pobox.com>,
	Oren Laadan <orenl@cs.columbia.edu>,
	Daniel Lezcano <dlezcano@fr.ibm.com>,
	Glauber Costa <glommer@parallels.com>,
	James Bottomley <jbottomley@parallels.com>,
	Tejun Heo <tj@kernel.org>, Alexey Dobriyan <adobriyan@gmail.com>,
	Al Viro <viro@ZenIV.linux.org.uk>,
	Pavel Emelyanov <xemul@parallels.com>
Subject: Re: [patch 2/2] fs, proc: Introduce the /proc/<pid>/map_files/ directory v6
Date: Wed, 31 Aug 2011 15:50:18 -0700	[thread overview]
Message-ID: <20110831155018.fba63834.akpm@linux-foundation.org> (raw)
In-Reply-To: <20110831142622.GB30615@sun>

On Wed, 31 Aug 2011 18:26:22 +0400
Cyrill Gorcunov <gorcunov@gmail.com> wrote:

> fs, proc: Introduce the /proc/<pid>/map_files/ directory v8
> 
> From: Pavel Emelyanov <xemul@parallels.com>
> 
> This one behaves similarly to the /proc/<pid>/fd/ one - it contains symlinks
> one for each mapping with file, the name of a symlink is "vma->vm_start-vma->vm_end",
> the target is the file. Opening a symlink results in a file that point exactly
> to the same inode as them vma's one.
> 
> For example the ls -l of some arbitrary /proc/<pid>/map_files/
> 
>  | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80403000-7f8f80404000 -> /lib64/libc-2.5.so
>  | lr-x------ 1 root root 64 Aug 26 06:40 7f8f8061e000-7f8f80620000 -> /lib64/libselinux.so.1
>  | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80826000-7f8f80827000 -> /lib64/libacl.so.1.1.0
>  | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80a2f000-7f8f80a30000 -> /lib64/librt-2.5.so
>  | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80a30000-7f8f80a4c000 -> /lib64/ld-2.5.so
> 
> This helps checkpointing process in three ways:
> 
> 1. When dumping a task mappings we do know exact file that is mapped by particular
>    region. We do this by opening /proc/pid/map_files/address symlink the way we do
>    with file descriptors.
> 
> 2. This also helps in determining which anonymous shared mappings are shared with
>    each other by comparing the inodes of them.
> 
> 3. When restoring a set of process in case two of them has a mapping shared, we map
>    the memory by the 1st one and then open its /proc/pid/map_files/address file and
>    map it by the 2nd task.

I'm reluctant to merge large hunks c/r infrastructure before knowing
that it will actually be useful and used.

What it the state of your c/r effort?

What additional kernel patches will be needed to enable it?

Where are those patches?



This particular patch is distressingly similar to /proc/pid/maps and
smaps.  To justify a merge the changelog should clearly describe why
maps/smaps are unsuitable for this application and why they are
unfixable.

>
> ...
>
> --- linux-2.6.git.orig/fs/proc/base.c
> +++ linux-2.6.git/fs/proc/base.c
> @@ -2170,6 +2170,373 @@ static const struct file_operations proc
>  	.llseek		= default_llseek,
>  };
>  
> +static struct vm_area_struct *
> +find_exact_vma(struct mm_struct *mm, unsigned long vm_start, unsigned long vm_end)
> +{
> +	struct vm_area_struct *vma = find_vma(mm, vm_start);
> +	if (vma && (vma->vm_start != vm_start || vma->vm_end != vm_end))
> +		vma = NULL;
> +	return vma;
> +}

This function would be better with a little documentation.

It is quite generic and it may be the case that other parts of the
kernel are already doing this in an open-coded fashion.  Perhaps it
should be placed in mm/mmap.c along with the other vma manipulation
library code.  This would add a small amount of dead code to
CONFIG_PROC_FS=n builds.

> +static int map_name_to_addr(const unsigned char *name, unsigned long *start, unsigned long *end)
> +{
> +	int ret = -EINVAL;
> +	char *endp;
> +
> +	if (unlikely(!name))
> +		goto err;
> +
> +	*start = simple_strtoul(name, &endp, 16);
> +	if (*endp != '-')
> +		goto err;
> +	*end = simple_strtoul(endp + 1, &endp, 16);
> +	if (*endp != 0)
> +		goto err;
> +
> +	ret = 0;
> +
> +err:
> +	return ret;
> +}

Again, documentation would improve this code.  At least describe the
parsed input format.

simple_strtoul() is obsolete.  Use kstrto*().  There is a new
checkpatch rule pending for this.

> +static int map_files_d_revalidate(struct dentry *dentry, struct nameidata *nd)
> +{
> +	struct vm_area_struct *vma = NULL;
> +	unsigned long vm_start, vm_end;
> +	struct task_struct *task;
> +	const struct cred *cred;
> +	struct mm_struct *mm;
> +	struct inode *inode;
> +
> +	if (nd && nd->flags & LOOKUP_RCU)
> +		return -ECHILD;
> +
> +	inode = dentry->d_inode;
> +	task = get_proc_task(inode);
> +	if (!task)
> +		goto out;
> +
> +	mm = get_task_mm(task);
> +	put_task_struct(task);
> +	if (!mm)
> +		goto out;
> +
> +	if (!map_name_to_addr(dentry->d_name.name, &vm_start, &vm_end)) {
> +		down_read(&mm->mmap_sem);
> +		vma = find_exact_vma(mm, vm_start, vm_end);

This is nasty.  We have a pointer to a vma, but we drop the locks and
refcounts which protect that vma.  So we have a local pointer which we
cannot dereference.  It's a bit of a hand-grenade.

> +		up_read(&mm->mmap_sem);
> +	}
> +
> +	mmput(mm);
> +
> +	if (vma) {
> +		if (task_dumpable(task)) {
> +			rcu_read_lock();
> +			cred = __task_cred(task);
> +			inode->i_uid = cred->euid;
> +			inode->i_gid = cred->egid;
> +			rcu_read_unlock();
> +		} else {
> +			inode->i_uid = 0;
> +			inode->i_gid = 0;
> +		}
> +		security_task_to_inode(task, inode);
> +		return 1;

And indeed it was not dereferenced (yet).  It is treated as a bool.

Would it not be nicer, safer and clearer to turn the pointer-as-a-bool
into a bool?

	bool matching_vma_found = !!find_exact_vma(mm, vm_start, vm_end);

> +	}
> +out:
> +	d_drop(dentry);
> +	return 0;
> +}
> +
>
> ...
>
> +static struct dentry *
> +proc_map_files_instantiate(struct inode *dir, struct dentry *dentry,
> +			   struct task_struct *task, const void *ptr)
> +{
> +	const struct file *file = ptr;
> +	struct proc_inode *ei;
> +	struct inode *inode;
> +
> +	if (!file)
> +		return ERR_PTR(-ENOENT);
> +
> +	inode = proc_pid_make_inode(dir->i_sb, task);
> +	if (!inode)
> +		return ERR_PTR(-ENOENT);
> +
> +	ei			= PROC_I(inode);
> +	ei->op.proc_get_link	= proc_map_files_get_link;
> +
> +	inode->i_op	= &proc_pid_link_inode_operations;
> +	inode->i_size	= 64;
> +	inode->i_mode	= S_IFLNK;

The fancy indenting is atypical for kernel code.

> +	if (file->f_mode & FMODE_READ)
> +		inode->i_mode |= S_IRUSR | S_IXUSR;
> +	if (file->f_mode & FMODE_WRITE)
> +		inode->i_mode |= S_IWUSR | S_IXUSR;
> +
> +	d_set_d_op(dentry, &tid_map_files_dentry_operations);
> +	d_add(dentry, inode);
> +
> +	return NULL;
> +}
> +
>
> ...
>
> +/*
> + * NOTE: The getattr/setattr for both /proc/$pid/map_files and
> + * /proc/$pid/fd seems to have share the code, so need to be
> + * unified and code duplication eliminated!

Why not do this now?

> + */
> +
> +int proc_map_files_setattr(struct dentry *dentry, struct iattr *attr)

static

> +{
> +	struct inode *inode = dentry->d_inode;
> +	struct task_struct *task;
> +	int ret = -EACCES;
> +
> +	task = get_proc_task(inode);
> +	if (!task)
> +		return -ESRCH;
> +
> +	if (!lock_trace(task)) {
> +		ret = proc_setattr(dentry, attr);
> +		unlock_trace(task);
> +	}
> +
> +	put_task_struct(task);
> +	return ret;
> +}
> +
>
> ...
>
> +static int proc_map_files_readdir(struct file *filp, void *dirent, filldir_t filldir)
> +{
> +	struct dentry *dentry = filp->f_path.dentry;
> +	struct inode *inode = dentry->d_inode;
> +	struct vm_area_struct *vma;
> +	struct task_struct *task;
> +	struct mm_struct *mm;
> +	unsigned int vmai;
> +	ino_t ino;
> +	int ret;
> +
> +	ret = -ENOENT;
> +	task = get_proc_task(inode);
> +	if (!task)
> +		goto out_no_task;
> +
> +	ret = -EPERM;
> +	if (!ptrace_may_access(task, PTRACE_MODE_READ))
> +		goto out;
> +
> +	ret = 0;
> +	switch (filp->f_pos) {
> +	case 0:
> +		ino = inode->i_ino;
> +		if (filldir(dirent, ".", 1, 0, ino, DT_DIR) < 0)
> +			goto out;
> +		filp->f_pos++;
> +	case 1:
> +		ino = parent_ino(dentry);
> +		if (filldir(dirent, "..", 2, 1, ino, DT_DIR) < 0)
> +			goto out;
> +		filp->f_pos++;
> +	default:
> +	{
> +		struct map_files_info *info = NULL;
> +		unsigned long nr_files, used, i;
> +
> +		mm = get_task_mm(task);
> +		if (!mm)
> +			goto out;
> +		down_read(&mm->mmap_sem);
> +
> +		nr_files = 0;
> +		used = 0;
> +
> +		/*
> +		 * We need two passes here:
> +		 *  1) Collect vmas of mapped files with mmap_sem taken
> +		 *  2) Release mmap_sem and instantiate entries
> +		 * othrewise we get lockdep complains since filldir
> +		 * might sleep.

Why would a sleep-inside-mmap_sem cause a lockdep warning?

> +		 */
> +
> +		for (vma = mm->mmap; vma; vma = vma->vm_next) {
> +			if (vma->vm_file)
> +				nr_files++;
> +		}
> +
> +		if (nr_files) {
> +			info = kmalloc(nr_files * sizeof(*info), GFP_KERNEL);

I calculate sizeof(*info) to be 50 bytes.  nr_files can easily be in
the thousands.

So this kmalloc will be far too large on large applications.  This is a
must-fix.  Fixing it with vmalloc() is lame.

> +			if (!info)
> +				ret = -ENOMEM;
> +			for (vma = mm->mmap, vmai = 2; vma && info; vma = vma->vm_next) {

What does "vmai" mean?  "vma index"?  If so, call it vma_index.  If
not, call it something else which is meaningful.

vmai/vma_index could be made local to this code block.

> +				if (!vma->vm_file)
> +					continue;
> +				vmai++;
> +				if (vmai <= filp->f_pos)
> +					continue;
> +
> +				get_file(vma->vm_file);
> +				info[used].file	= vma->vm_file;
> +				info[used].len	= snprintf(info[used].name,
> +							   sizeof(info[used].name),
> +							   "%lx-%lx",
> +							   vma->vm_start,
> +							   vma->vm_end);
> +				used++;
> +			}
> +		}
> +
> +		up_read(&mm->mmap_sem);
> +
> +		for (i = 0; i < used; i++) {
> +			ret = proc_fill_cache(filp, dirent, filldir,
> +					      info[i].name, info[i].len,
> +					      proc_map_files_instantiate,
> +					      task, info[i].file);
> +			if (ret)
> +				break;
> +			filp->f_pos++;
> +		}
> +
> +		for (i = 0; i < used; i++)
> +			put_filp(info[i].file);

Why not do the put_filp() in the previous loop and avoid some cache misses?

> +		kfree(info);
> +		mmput(mm);
> +	}
> +	}
> +
> +out:
> +	put_task_struct(task);
> +out_no_task:
> +	return ret;
> +}
> +
> +static const struct file_operations proc_map_files_operations = {
> +	.read		= generic_read_dir,
> +	.readdir	= proc_map_files_readdir,
> +	.llseek		= default_llseek,
> +};
>
> ...
>


  parent reply	other threads:[~2011-08-31 22:50 UTC|newest]

Thread overview: 81+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-08-31  7:58 [patch 0/2] Introduce /proc/pid/map_files v6 Cyrill Gorcunov
2011-08-31  7:58 ` [patch 1/2] fs, proc: Make proc_get_link to use dentry instead of inode Cyrill Gorcunov
2011-08-31  7:58 ` [patch 2/2] fs, proc: Introduce the /proc/<pid>/map_files/ directory v6 Cyrill Gorcunov
2011-08-31  9:06   ` Vasiliy Kulikov
2011-08-31 10:12     ` Cyrill Gorcunov
2011-08-31 11:26     ` Cyrill Gorcunov
2011-08-31 14:04       ` Kirill A. Shutemov
2011-08-31 14:09         ` Cyrill Gorcunov
2011-08-31 14:26         ` Cyrill Gorcunov
2011-08-31 22:10           ` Andrew Morton
2011-09-01  3:07             ` Kyle Moffett
2011-09-01  3:07               ` Kyle Moffett
2011-09-01  7:58             ` Pavel Emelyanov
2011-09-01 11:50               ` Tejun Heo
2011-09-01 12:13                 ` Pavel Emelyanov
2011-09-01 17:13                   ` Tejun Heo
2011-09-02 19:15                     ` Matt Helsley
2011-09-02  0:09               ` Matt Helsley
2011-09-01  8:05             ` Cyrill Gorcunov
2011-09-02 16:37               ` Vasiliy Kulikov
2011-09-02 16:37                 ` [kernel-hardening] " Vasiliy Kulikov
2011-09-05 18:53                 ` Vasiliy Kulikov
2011-09-05 18:53                   ` [kernel-hardening] " Vasiliy Kulikov
2011-09-05 19:20                   ` Cyrill Gorcunov
2011-09-05 19:20                     ` [kernel-hardening] " Cyrill Gorcunov
2011-09-05 19:49                     ` Vasiliy Kulikov
2011-09-05 19:49                       ` [kernel-hardening] " Vasiliy Kulikov
2011-09-05 20:36                       ` Cyrill Gorcunov
2011-09-05 20:36                         ` [kernel-hardening] " Cyrill Gorcunov
2011-09-06 10:15                         ` Vasiliy Kulikov
2011-09-06 10:15                           ` [kernel-hardening] " Vasiliy Kulikov
2011-09-06 16:51                           ` Tejun Heo
2011-09-06 16:51                             ` [kernel-hardening] " Tejun Heo
2011-09-06 17:29                             ` Vasiliy Kulikov
2011-09-06 17:29                               ` [kernel-hardening] " Vasiliy Kulikov
2011-09-06 17:33                               ` Tejun Heo
2011-09-06 17:33                                 ` [kernel-hardening] " Tejun Heo
2011-09-06 18:15                                 ` Cyrill Gorcunov
2011-09-06 18:15                                   ` [kernel-hardening] " Cyrill Gorcunov
     [not found]                                 ` <20110906173341.GM18425-9pTldWuhBndy/B6EtB590w@public.gmane.org>
2011-09-07 11:23                                   ` Vasiliy Kulikov
2011-09-07 11:23                                     ` [kernel-hardening] " Vasiliy Kulikov
2011-09-07 21:53                                     ` Cyrill Gorcunov
2011-09-07 21:53                                       ` [kernel-hardening] " Cyrill Gorcunov
2011-09-07 22:13                                       ` Andrew Morton
2011-09-07 22:13                                         ` Andrew Morton
2011-09-07 22:13                                         ` [kernel-hardening] " Andrew Morton
2011-09-07 22:42                                         ` Cyrill Gorcunov
2011-09-07 22:42                                           ` [kernel-hardening] " Cyrill Gorcunov
2011-09-07 22:53                                           ` Andrew Morton
2011-09-07 22:53                                             ` Andrew Morton
2011-09-07 22:53                                             ` [kernel-hardening] " Andrew Morton
2011-09-08  5:48                                             ` Cyrill Gorcunov
2011-09-08  5:48                                               ` [kernel-hardening] " Cyrill Gorcunov
2011-09-08  5:50                                               ` Cyrill Gorcunov
2011-09-08  5:50                                                 ` [kernel-hardening] " Cyrill Gorcunov
2011-09-08  6:04                                                 ` Cyrill Gorcunov
2011-09-08  6:04                                                   ` [kernel-hardening] " Cyrill Gorcunov
2011-09-08 23:52                                                   ` Andrew Morton
2011-09-08 23:52                                                     ` Andrew Morton
2011-09-08 23:52                                                     ` [kernel-hardening] " Andrew Morton
2011-09-09  0:24                                                     ` Pavel Emelyanov
2011-09-09  0:24                                                       ` [kernel-hardening] " Pavel Emelyanov
2011-09-09  5:48                                                     ` Cyrill Gorcunov
2011-09-09  5:48                                                       ` [kernel-hardening] " Cyrill Gorcunov
2011-09-09  6:00                                                       ` Andrew Morton
2011-09-09  6:00                                                         ` [kernel-hardening] " Andrew Morton
2011-09-09  6:22                                                         ` Cyrill Gorcunov
2011-09-09  6:22                                                           ` [kernel-hardening] " Cyrill Gorcunov
2011-09-10 13:21                                                   ` Vasiliy Kulikov
2011-09-10 13:49                                                     ` Cyrill Gorcunov
2011-09-01 10:46             ` Cyrill Gorcunov
2011-09-01 22:49               ` Andrew Morton
2011-09-01 23:04                 ` Tejun Heo
2011-09-02  5:54                   ` Cyrill Gorcunov
2011-09-02  5:53                 ` Cyrill Gorcunov
2011-08-31 22:50           ` Andrew Morton [this message]
2011-09-02  1:54   ` Nicholas Miell
2011-09-02  1:58     ` Tejun Heo
2011-09-02  2:04       ` Nicholas Miell
2011-09-02  2:29         ` Tejun Heo
2011-09-02  8:07           ` Kirill A. Shutemov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20110831155018.fba63834.akpm@linux-foundation.org \
    --to=akpm@linux-foundation.org \
    --cc=adobriyan@gmail.com \
    --cc=containers@lists.osdl.org \
    --cc=dlezcano@fr.ibm.com \
    --cc=glommer@parallels.com \
    --cc=gorcunov@gmail.com \
    --cc=jbottomley@parallels.com \
    --cc=kirill@shutemov.name \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=ntl@pobox.com \
    --cc=orenl@cs.columbia.edu \
    --cc=segoon@openwall.com \
    --cc=tj@kernel.org \
    --cc=viro@ZenIV.linux.org.uk \
    --cc=xemul@parallels.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.