All of lore.kernel.org
 help / color / mirror / Atom feed
From: Andrew Morton <akpm@linux-foundation.org>
To: Cyrill Gorcunov <gorcunov@gmail.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>,
	Vasiliy Kulikov <segoon@openwall.com>,
	containers@lists.osdl.org, linux-kernel@vger.kernel.org,
	linux-fsdevel@vger.kernel.org, Nathan Lynch <ntl@pobox.com>,
	Oren Laadan <orenl@cs.columbia.edu>,
	Daniel Lezcano <dlezcano@fr.ibm.com>,
	Glauber Costa <glommer@parallels.com>,
	James Bottomley <jbottomley@parallels.com>,
	Tejun Heo <tj@kernel.org>, Alexey Dobriyan <adobriyan@gmail.com>,
	Al Viro <viro@ZenIV.linux.org.uk>,
	Pavel Emelyanov <xemul@parallels.com>
Subject: Re: [patch 2/2] fs, proc: Introduce the /proc/<pid>/map_files/ directory v6
Date: Wed, 31 Aug 2011 15:10:23 -0700	[thread overview]
Message-ID: <20110831151023.5b7e12da.akpm@linux-foundation.org> (raw)
In-Reply-To: <20110831142622.GB30615@sun>

On Wed, 31 Aug 2011 18:26:22 +0400
Cyrill Gorcunov <gorcunov@gmail.com> wrote:

> On Wed, Aug 31, 2011 at 05:04:16PM +0300, Kirill A. Shutemov wrote:
> ...
> > > 
> > > OK, here is an updated one. Thanks for feedback. Hope this time
> > > all nits are addressed. Still reviews/complains are *very* appreciated.
> > 
> > Please run checkpatch. It points several warnings and one dangerous error.
> > 
> 
> This one passes checkpatch without error (thanks again Kirill for spotting
> the parasite semicolon there!).
> 
> 	Cyrill
> ---
> fs, proc: Introduce the /proc/<pid>/map_files/ directory v8
> 
> From: Pavel Emelyanov <xemul@parallels.com>
> 
> This one behaves similarly to the /proc/<pid>/fd/ one - it contains symlinks
> one for each mapping with file, the name of a symlink is "vma->vm_start-vma->vm_end",
> the target is the file. Opening a symlink results in a file that point exactly
> to the same inode as them vma's one.
> 
> For example the ls -l of some arbitrary /proc/<pid>/map_files/
> 
>  | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80403000-7f8f80404000 -> /lib64/libc-2.5.so
>  | lr-x------ 1 root root 64 Aug 26 06:40 7f8f8061e000-7f8f80620000 -> /lib64/libselinux.so.1
>  | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80826000-7f8f80827000 -> /lib64/libacl.so.1.1.0
>  | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80a2f000-7f8f80a30000 -> /lib64/librt-2.5.so
>  | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80a30000-7f8f80a4c000 -> /lib64/ld-2.5.so
> 
> This helps checkpointing process in three ways:
> 
> 1. When dumping a task mappings we do know exact file that is mapped by particular
>    region. We do this by opening /proc/pid/map_files/address symlink the way we do
>    with file descriptors.
> 
> 2. This also helps in determining which anonymous shared mappings are shared with
>    each other by comparing the inodes of them.
> 
> 3. When restoring a set of process in case two of them has a mapping shared, we map
>    the memory by the 1st one and then open its /proc/pid/map_files/address file and
>    map it by the 2nd task.

I'm reluctant to merge something like this unless/until it has real
use-cases.

What is the status of your c/r effort?

What additional kernel patches are required to bring that effort to a
usable state and where are those patches?

IOW, before starting to merge things I'd like to get an understanding
of what *all* the patches look like and of what level of c/r
functionality they provide.


This particular patch introduces a distressing amount of duplication of
/proc/pid/maps.  The changelog should provide a really good
justification for doing this: why is /proc/pid/maps (and smaps!)
unsuitable and why cannot maps/smaps be fixed to be suitable?

>
> ...
>
> --- linux-2.6.git.orig/fs/proc/base.c
> +++ linux-2.6.git/fs/proc/base.c
> @@ -2170,6 +2170,373 @@ static const struct file_operations proc
>  	.llseek		= default_llseek,
>  };
>  
> +static struct vm_area_struct *
> +find_exact_vma(struct mm_struct *mm, unsigned long vm_start, unsigned long vm_end)
> +{
> +	struct vm_area_struct *vma = find_vma(mm, vm_start);
> +	if (vma && (vma->vm_start != vm_start || vma->vm_end != vm_end))
> +		vma = NULL;
> +	return vma;
> +}

This function would benefit from a code comment.

Given that it's pretty generic (indeed there might be open-coded code
which already does this elsewhere), perhaps it should be in mm/mmap.c
as a kernel-wide utility function.  That will add a little overhead to
CONFIG_PROC_FS=n builds, which doesn't seem terribly important.

> +static int map_name_to_addr(const unsigned char *name, unsigned long *start, unsigned long *end)
> +{
> +	int ret = -EINVAL;
> +	char *endp;
> +
> +	if (unlikely(!name))
> +		goto err;
> +
> +	*start = simple_strtoul(name, &endp, 16);
> +	if (*endp != '-')
> +		goto err;
> +	*end = simple_strtoul(endp + 1, &endp, 16);
> +	if (*endp != 0)
> +		goto err;
> +
> +	ret = 0;
> +
> +err:
> +	return ret;
> +}

Again, a little bit of interface documentation would be nice.  Explain
what the parsed input format is, at least.

simple_strtoul() is obsolete - use kstrto*().  A checkpatch rule for
this is queued.

> +static int map_files_d_revalidate(struct dentry *dentry, struct nameidata *nd)
> +{
> +	struct vm_area_struct *vma = NULL;
> +	unsigned long vm_start, vm_end;
> +	struct task_struct *task;
> +	const struct cred *cred;
> +	struct mm_struct *mm;
> +	struct inode *inode;
> +
> +	if (nd && nd->flags & LOOKUP_RCU)
> +		return -ECHILD;
> +
> +	inode = dentry->d_inode;
> +	task = get_proc_task(inode);
> +	if (!task)
> +		goto out;
> +
> +	mm = get_task_mm(task);
> +	put_task_struct(task);
> +	if (!mm)
> +		goto out;
> +
> +	if (!map_name_to_addr(dentry->d_name.name, &vm_start, &vm_end)) {
> +		down_read(&mm->mmap_sem);
> +		vma = find_exact_vma(mm, vm_start, vm_end);

OK, this is nasty.  We have a local variable which points to a vma but
then we release the locks and refcounts which protect that vma.  So we
have a pointer which we cannot dereference.  That's dangerous.

> +		up_read(&mm->mmap_sem);
> +	}
> +
> +	mmput(mm);
> +
> +	if (vma) {
> +		if (task_dumpable(task)) {
> +			rcu_read_lock();
> +			cred = __task_cred(task);
> +			inode->i_uid = cred->euid;
> +			inode->i_gid = cred->egid;
> +			rcu_read_unlock();
> +		} else {
> +			inode->i_uid = 0;
> +			inode->i_gid = 0;
> +		}
> +		security_task_to_inode(task, inode);
> +		return 1;
> +	}

And we don't actually dereference it - at present.  We use it as a bool.

Would it not be nicer, safer and clearer to turn this pointer-as-a-bool
into a bool?

	bool matching_vma_exists = !!find_exact_vma(mm, vm_start, vm_end);

> +out:
> +	d_drop(dentry);
> +	return 0;
> +}
> +
>
> ...
>
> +static struct dentry *
> +proc_map_files_instantiate(struct inode *dir, struct dentry *dentry,
> +			   struct task_struct *task, const void *ptr)
> +{
> +	const struct file *file = ptr;
> +	struct proc_inode *ei;
> +	struct inode *inode;
> +
> +	if (!file)
> +		return ERR_PTR(-ENOENT);
> +
> +	inode = proc_pid_make_inode(dir->i_sb, task);
> +	if (!inode)
> +		return ERR_PTR(-ENOENT);
> +
> +	ei			= PROC_I(inode);
> +	ei->op.proc_get_link	= proc_map_files_get_link;
> +
> +	inode->i_op	= &proc_pid_link_inode_operations;
> +	inode->i_size	= 64;
> +	inode->i_mode	= S_IFLNK;

The fancy indenting is not a thing we usually do in the kernel.

> +	if (file->f_mode & FMODE_READ)
> +		inode->i_mode |= S_IRUSR | S_IXUSR;
> +	if (file->f_mode & FMODE_WRITE)
> +		inode->i_mode |= S_IWUSR | S_IXUSR;
> +
> +	d_set_d_op(dentry, &tid_map_files_dentry_operations);
> +	d_add(dentry, inode);
> +
> +	return NULL;
> +}
> +
>
> ...
>
> +/*
> + * NOTE: The getattr/setattr for both /proc/$pid/map_files and
> + * /proc/$pid/fd seems to have share the code, so need to be
> + * unified and code duplication eliminated!

Why not do this now?

> + */
> +
> +int proc_map_files_setattr(struct dentry *dentry, struct iattr *attr)

static

> +{
> +	struct inode *inode = dentry->d_inode;
> +	struct task_struct *task;
> +	int ret = -EACCES;
> +
> +	task = get_proc_task(inode);
> +	if (!task)
> +		return -ESRCH;
> +
> +	if (!lock_trace(task)) {
> +		ret = proc_setattr(dentry, attr);
> +		unlock_trace(task);
> +	}
> +
> +	put_task_struct(task);
> +	return ret;
> +}
> +
>
> ...
>
> +static int proc_map_files_readdir(struct file *filp, void *dirent, filldir_t filldir)
> +{
> +	struct dentry *dentry = filp->f_path.dentry;
> +	struct inode *inode = dentry->d_inode;
> +	struct vm_area_struct *vma;
> +	struct task_struct *task;
> +	struct mm_struct *mm;
> +	unsigned int vmai;
> +	ino_t ino;
> +	int ret;
> +
> +	ret = -ENOENT;
> +	task = get_proc_task(inode);
> +	if (!task)
> +		goto out_no_task;
> +
> +	ret = -EPERM;
> +	if (!ptrace_may_access(task, PTRACE_MODE_READ))
> +		goto out;
> +
> +	ret = 0;
> +	switch (filp->f_pos) {
> +	case 0:
> +		ino = inode->i_ino;
> +		if (filldir(dirent, ".", 1, 0, ino, DT_DIR) < 0)
> +			goto out;
> +		filp->f_pos++;
> +	case 1:
> +		ino = parent_ino(dentry);
> +		if (filldir(dirent, "..", 2, 1, ino, DT_DIR) < 0)
> +			goto out;
> +		filp->f_pos++;
> +	default:
> +	{
> +		struct map_files_info *info = NULL;
> +		unsigned long nr_files, used, i;
> +
> +		mm = get_task_mm(task);
> +		if (!mm)
> +			goto out;
> +		down_read(&mm->mmap_sem);
> +
> +		nr_files = 0;
> +		used = 0;
> +
> +		/*
> +		 * We need two passes here:
> +		 *  1) Collect vmas of mapped files with mmap_sem taken
> +		 *  2) Release mmap_sem and instantiate entries
> +		 * othrewise we get lockdep complains since filldir

typo

> +		 * might sleep.
> +		 */

Why would lockdep complain about sleep-inside-mmap_sem?

> +		for (vma = mm->mmap; vma; vma = vma->vm_next) {
> +			if (vma->vm_file)
> +				nr_files++;
> +		}
> +
> +		if (nr_files) {
> +			info = kmalloc(nr_files * sizeof(*info), GFP_KERNEL);

I figure sizeof(*info) = 50 bytes.  nr_files can easily be >1000.

On large some applications the kmalloc attempt will be too large.  This
is a must-fix.  Using vmalloc() would be very lame.

> +			if (!info)
> +				ret = -ENOMEM;
> +			for (vma = mm->mmap, vmai = 2; vma && info; vma = vma->vm_next) {
> +				if (!vma->vm_file)
> +					continue;
> +				vmai++;

What is "vmai"?  "vma index"?  If so, please call it vma_index.  If
not, please call it something better anyway.

vmai/vma_index could be made local to this code block.

> +				if (vmai <= filp->f_pos)
> +					continue;
> +
> +				get_file(vma->vm_file);
> +				info[used].file	= vma->vm_file;
> +				info[used].len	= snprintf(info[used].name,
> +							   sizeof(info[used].name),
> +							   "%lx-%lx",
> +							   vma->vm_start,
> +							   vma->vm_end);
> +				used++;
> +			}
> +		}
> +
> +		up_read(&mm->mmap_sem);
> +
> +		for (i = 0; i < used; i++) {
> +			ret = proc_fill_cache(filp, dirent, filldir,
> +					      info[i].name, info[i].len,
> +					      proc_map_files_instantiate,
> +					      task, info[i].file);
> +			if (ret)
> +				break;
> +			filp->f_pos++;
> +		}
> +
> +		for (i = 0; i < used; i++)
> +			put_filp(info[i].file);

Why not do the put_filp() in the previous loop and avoid some cache
misses?

> +		kfree(info);
> +		mmput(mm);
> +	}
> +	}
> +
> +out:
> +	put_task_struct(task);
> +out_no_task:
> +	return ret;
> +}
>
> ...
>


  reply	other threads:[~2011-08-31 22:10 UTC|newest]

Thread overview: 81+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-08-31  7:58 [patch 0/2] Introduce /proc/pid/map_files v6 Cyrill Gorcunov
2011-08-31  7:58 ` [patch 1/2] fs, proc: Make proc_get_link to use dentry instead of inode Cyrill Gorcunov
2011-08-31  7:58 ` [patch 2/2] fs, proc: Introduce the /proc/<pid>/map_files/ directory v6 Cyrill Gorcunov
2011-08-31  9:06   ` Vasiliy Kulikov
2011-08-31 10:12     ` Cyrill Gorcunov
2011-08-31 11:26     ` Cyrill Gorcunov
2011-08-31 14:04       ` Kirill A. Shutemov
2011-08-31 14:09         ` Cyrill Gorcunov
2011-08-31 14:26         ` Cyrill Gorcunov
2011-08-31 22:10           ` Andrew Morton [this message]
2011-09-01  3:07             ` Kyle Moffett
2011-09-01  3:07               ` Kyle Moffett
2011-09-01  7:58             ` Pavel Emelyanov
2011-09-01 11:50               ` Tejun Heo
2011-09-01 12:13                 ` Pavel Emelyanov
2011-09-01 17:13                   ` Tejun Heo
2011-09-02 19:15                     ` Matt Helsley
2011-09-02  0:09               ` Matt Helsley
2011-09-01  8:05             ` Cyrill Gorcunov
2011-09-02 16:37               ` Vasiliy Kulikov
2011-09-02 16:37                 ` [kernel-hardening] " Vasiliy Kulikov
2011-09-05 18:53                 ` Vasiliy Kulikov
2011-09-05 18:53                   ` [kernel-hardening] " Vasiliy Kulikov
2011-09-05 19:20                   ` Cyrill Gorcunov
2011-09-05 19:20                     ` [kernel-hardening] " Cyrill Gorcunov
2011-09-05 19:49                     ` Vasiliy Kulikov
2011-09-05 19:49                       ` [kernel-hardening] " Vasiliy Kulikov
2011-09-05 20:36                       ` Cyrill Gorcunov
2011-09-05 20:36                         ` [kernel-hardening] " Cyrill Gorcunov
2011-09-06 10:15                         ` Vasiliy Kulikov
2011-09-06 10:15                           ` [kernel-hardening] " Vasiliy Kulikov
2011-09-06 16:51                           ` Tejun Heo
2011-09-06 16:51                             ` [kernel-hardening] " Tejun Heo
2011-09-06 17:29                             ` Vasiliy Kulikov
2011-09-06 17:29                               ` [kernel-hardening] " Vasiliy Kulikov
2011-09-06 17:33                               ` Tejun Heo
2011-09-06 17:33                                 ` [kernel-hardening] " Tejun Heo
2011-09-06 18:15                                 ` Cyrill Gorcunov
2011-09-06 18:15                                   ` [kernel-hardening] " Cyrill Gorcunov
     [not found]                                 ` <20110906173341.GM18425-9pTldWuhBndy/B6EtB590w@public.gmane.org>
2011-09-07 11:23                                   ` Vasiliy Kulikov
2011-09-07 11:23                                     ` [kernel-hardening] " Vasiliy Kulikov
2011-09-07 21:53                                     ` Cyrill Gorcunov
2011-09-07 21:53                                       ` [kernel-hardening] " Cyrill Gorcunov
2011-09-07 22:13                                       ` Andrew Morton
2011-09-07 22:13                                         ` Andrew Morton
2011-09-07 22:13                                         ` [kernel-hardening] " Andrew Morton
2011-09-07 22:42                                         ` Cyrill Gorcunov
2011-09-07 22:42                                           ` [kernel-hardening] " Cyrill Gorcunov
2011-09-07 22:53                                           ` Andrew Morton
2011-09-07 22:53                                             ` Andrew Morton
2011-09-07 22:53                                             ` [kernel-hardening] " Andrew Morton
2011-09-08  5:48                                             ` Cyrill Gorcunov
2011-09-08  5:48                                               ` [kernel-hardening] " Cyrill Gorcunov
2011-09-08  5:50                                               ` Cyrill Gorcunov
2011-09-08  5:50                                                 ` [kernel-hardening] " Cyrill Gorcunov
2011-09-08  6:04                                                 ` Cyrill Gorcunov
2011-09-08  6:04                                                   ` [kernel-hardening] " Cyrill Gorcunov
2011-09-08 23:52                                                   ` Andrew Morton
2011-09-08 23:52                                                     ` Andrew Morton
2011-09-08 23:52                                                     ` [kernel-hardening] " Andrew Morton
2011-09-09  0:24                                                     ` Pavel Emelyanov
2011-09-09  0:24                                                       ` [kernel-hardening] " Pavel Emelyanov
2011-09-09  5:48                                                     ` Cyrill Gorcunov
2011-09-09  5:48                                                       ` [kernel-hardening] " Cyrill Gorcunov
2011-09-09  6:00                                                       ` Andrew Morton
2011-09-09  6:00                                                         ` [kernel-hardening] " Andrew Morton
2011-09-09  6:22                                                         ` Cyrill Gorcunov
2011-09-09  6:22                                                           ` [kernel-hardening] " Cyrill Gorcunov
2011-09-10 13:21                                                   ` Vasiliy Kulikov
2011-09-10 13:49                                                     ` Cyrill Gorcunov
2011-09-01 10:46             ` Cyrill Gorcunov
2011-09-01 22:49               ` Andrew Morton
2011-09-01 23:04                 ` Tejun Heo
2011-09-02  5:54                   ` Cyrill Gorcunov
2011-09-02  5:53                 ` Cyrill Gorcunov
2011-08-31 22:50           ` Andrew Morton
2011-09-02  1:54   ` Nicholas Miell
2011-09-02  1:58     ` Tejun Heo
2011-09-02  2:04       ` Nicholas Miell
2011-09-02  2:29         ` Tejun Heo
2011-09-02  8:07           ` Kirill A. Shutemov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20110831151023.5b7e12da.akpm@linux-foundation.org \
    --to=akpm@linux-foundation.org \
    --cc=adobriyan@gmail.com \
    --cc=containers@lists.osdl.org \
    --cc=dlezcano@fr.ibm.com \
    --cc=glommer@parallels.com \
    --cc=gorcunov@gmail.com \
    --cc=jbottomley@parallels.com \
    --cc=kirill@shutemov.name \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=ntl@pobox.com \
    --cc=orenl@cs.columbia.edu \
    --cc=segoon@openwall.com \
    --cc=tj@kernel.org \
    --cc=viro@ZenIV.linux.org.uk \
    --cc=xemul@parallels.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.