All of lore.kernel.org
 help / color / mirror / Atom feed
From: Andrew Morton <akpm@linux-foundation.org>
To: Benjamin LaHaise <bcrl@kvack.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>,
	linux-aio@kvack.org, Linux Kernel <linux-kernel@vger.kernel.org>,
	linux-mm <linux-mm@kvack.org>
Subject: Re: [GIT PULL] aio: changes for 3.19
Date: Sun, 14 Dec 2014 14:13:36 -0800	[thread overview]
Message-ID: <20141214141336.a0267e95.akpm@linux-foundation.org> (raw)
In-Reply-To: <20141214215221.GI2672@kvack.org>

On Sun, 14 Dec 2014 16:52:21 -0500 Benjamin LaHaise <bcrl@kvack.org> wrote:

> On Sun, Dec 14, 2014 at 01:47:32PM -0800, Linus Torvalds wrote:
> > On Sun, Dec 14, 2014 at 12:22 PM, Benjamin LaHaise <bcrl@kvack.org> wrote:
> > >
> > > Pavel Emelyanov (1):
> > >       aio: Make it possible to remap aio ring
> > 
> > So quite frankly, I think this should have had more acks from VM
> > people. The patch looks ok to me, but it took me by surprise, and I
> > don't see much any discussion about it on linux-mm either..
> 
> Sadly, nobody responded.  Maybe akpm can chime in on this change (included 
> below for ease of review and akpm added to the To:)?
> 
> 		-ben
> -- 
> "Thought is the essence of where you are now."
> 
> commit e4a0d3e720e7e508749c1439b5ba3aff56c92976
> Author: Pavel Emelyanov <xemul@parallels.com>
> Date:   Thu Sep 18 19:56:17 2014 +0400
> 
>     aio: Make it possible to remap aio ring

The patch appears to be a bugfix which coincidentally helps CRIU?

If it weren't for the bugfix part, I'd be asking "why not pass the
desired virtual address into io_setup()?".

>     There are actually two issues this patch addresses. Let me start with
>     the one I tried to solve in the beginning.
>     
>     So, in the checkpoint-restore project (criu) we try to dump tasks'
>     state and restore one back exactly as it was. One of the tasks' state
>     bits is rings set up with io_setup() call. There's (almost) no problems
>     in dumping them, there's a problem restoring them -- if I dump a task
>     with aio ring originally mapped at address A, I want to restore one
>     back at exactly the same address A. Unfortunately, the io_setup() does
>     not allow for that -- it mmaps the ring at whatever place mm finds
>     appropriate (it calls do_mmap_pgoff() with zero address and without
>     the MAP_FIXED flag).
>     
>     To make restore possible I'm going to mremap() the freshly created ring
>     into the address A (under which it was seen before dump). The problem is
>     that the ring's virtual address is passed back to the user-space as the
>     context ID and this ID is then used as search key by all the other io_foo()
>     calls. Reworking this ID to be just some integer doesn't seem to work, as
>     this value is already used by libaio as a pointer using which this library
>     accesses memory for aio meta-data.
>     
>     So, to make restore work we need to make sure that
>     
>     a) ring is mapped at desired virtual address
>     b) kioctx->user_id matches this value
>     
>     Having said that, the patch makes mremap() on aio region update the
>     kioctx's user_id and mmap_base values.
>     
>     Here appears the 2nd issue I mentioned in the beginning of this mail.
>     If (regardless of the C/R dances I do) someone creates an io context
>     with io_setup(), then mremap()-s the ring and then destroys the context,
>     the kill_ioctx() routine will call munmap() on wrong (old) address.
>     This will result in a) aio ring remaining in memory and b) some other
>     vma get unexpectedly unmapped.
>     
>     What do you think?
>     
> ...
>
> --- a/fs/aio.c
> +++ b/fs/aio.c
> @@ -286,12 +286,37 @@ static void aio_free_ring(struct kioctx *ctx)
>  
>  static int aio_ring_mmap(struct file *file, struct vm_area_struct *vma)
>  {
> +	vma->vm_flags |= VM_DONTEXPAND;

I don't think this was changelogged?  A comment here would be best.

>  	vma->vm_ops = &generic_file_vm_ops;
>  	return 0;
>  }
>  
> +static void aio_ring_remap(struct file *file, struct vm_area_struct *vma)
> +{
> +	struct mm_struct *mm = vma->vm_mm;
> +	struct kioctx_table *table;
> +	int i;
> +
> +	spin_lock(&mm->ioctx_lock);
> +	rcu_read_lock();
> +	table = rcu_dereference(mm->ioctx_table);
> +	for (i = 0; i < table->nr; i++) {
> +		struct kioctx *ctx;
> +
> +		ctx = table->table[i];
> +		if (ctx && ctx->aio_ring_file == file) {
> +			ctx->user_id = ctx->mmap_base = vma->vm_start;
> +			break;
> +		}
> +	}
> +
> +	rcu_read_unlock();
> +	spin_unlock(&mm->ioctx_lock);
> +}

Looks simple enough.  Possibly so simple that it doesn't need commenting ;)

>  static const struct file_operations aio_ring_fops = {
>  	.mmap = aio_ring_mmap,
> +	.mremap = aio_ring_remap,
>  };
>  
>  #if IS_ENABLED(CONFIG_MIGRATION)
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 9ab779e..85f378c 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1497,6 +1497,7 @@ struct file_operations {
>  	long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long);
>  	long (*compat_ioctl) (struct file *, unsigned int, unsigned long);
>  	int (*mmap) (struct file *, struct vm_area_struct *);
> +	void (*mremap)(struct file *, struct vm_area_struct *);

Documentation/filesystems/vfs.txt needs updating.  Note that it says
"all methods are called without any locks being held, unless otherwise
noted", but ->mremap() is called under mmap_sem (at least) so otherwise
noting will be needed.

It might make sense for ->mremap() to return an errno, but we can
change that later if needed.

>  	int (*open) (struct inode *, struct file *);
>  	int (*flush) (struct file *, fl_owner_t id);
>  	int (*release) (struct inode *, struct file *);
> diff --git a/mm/mremap.c b/mm/mremap.c
> index b147f66..c855922 100644
> --- a/mm/mremap.c
> +++ b/mm/mremap.c
> @@ -288,7 +288,8 @@ static unsigned long move_vma(struct vm_area_struct *vma,
>  		old_len = new_len;
>  		old_addr = new_addr;
>  		new_addr = -ENOMEM;
> -	}
> +	} else if (vma->vm_file && vma->vm_file->f_op->mremap)
> +		vma->vm_file->f_op->mremap(vma->vm_file, new_vma);
>  

The patch overall is a no-op from an MM perspective and seems OK to me.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

WARNING: multiple messages have this Message-ID (diff)
From: Andrew Morton <akpm@linux-foundation.org>
To: Benjamin LaHaise <bcrl@kvack.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>,
	linux-aio@kvack.org, Linux Kernel <linux-kernel@vger.kernel.org>,
	linux-mm <linux-mm@kvack.org>
Subject: Re: [GIT PULL] aio: changes for 3.19
Date: Sun, 14 Dec 2014 14:13:36 -0800	[thread overview]
Message-ID: <20141214141336.a0267e95.akpm@linux-foundation.org> (raw)
In-Reply-To: <20141214215221.GI2672@kvack.org>

On Sun, 14 Dec 2014 16:52:21 -0500 Benjamin LaHaise <bcrl@kvack.org> wrote:

> On Sun, Dec 14, 2014 at 01:47:32PM -0800, Linus Torvalds wrote:
> > On Sun, Dec 14, 2014 at 12:22 PM, Benjamin LaHaise <bcrl@kvack.org> wrote:
> > >
> > > Pavel Emelyanov (1):
> > >       aio: Make it possible to remap aio ring
> > 
> > So quite frankly, I think this should have had more acks from VM
> > people. The patch looks ok to me, but it took me by surprise, and I
> > don't see much any discussion about it on linux-mm either..
> 
> Sadly, nobody responded.  Maybe akpm can chime in on this change (included 
> below for ease of review and akpm added to the To:)?
> 
> 		-ben
> -- 
> "Thought is the essence of where you are now."
> 
> commit e4a0d3e720e7e508749c1439b5ba3aff56c92976
> Author: Pavel Emelyanov <xemul@parallels.com>
> Date:   Thu Sep 18 19:56:17 2014 +0400
> 
>     aio: Make it possible to remap aio ring

The patch appears to be a bugfix which coincidentally helps CRIU?

If it weren't for the bugfix part, I'd be asking "why not pass the
desired virtual address into io_setup()?".

>     There are actually two issues this patch addresses. Let me start with
>     the one I tried to solve in the beginning.
>     
>     So, in the checkpoint-restore project (criu) we try to dump tasks'
>     state and restore one back exactly as it was. One of the tasks' state
>     bits is rings set up with io_setup() call. There's (almost) no problems
>     in dumping them, there's a problem restoring them -- if I dump a task
>     with aio ring originally mapped at address A, I want to restore one
>     back at exactly the same address A. Unfortunately, the io_setup() does
>     not allow for that -- it mmaps the ring at whatever place mm finds
>     appropriate (it calls do_mmap_pgoff() with zero address and without
>     the MAP_FIXED flag).
>     
>     To make restore possible I'm going to mremap() the freshly created ring
>     into the address A (under which it was seen before dump). The problem is
>     that the ring's virtual address is passed back to the user-space as the
>     context ID and this ID is then used as search key by all the other io_foo()
>     calls. Reworking this ID to be just some integer doesn't seem to work, as
>     this value is already used by libaio as a pointer using which this library
>     accesses memory for aio meta-data.
>     
>     So, to make restore work we need to make sure that
>     
>     a) ring is mapped at desired virtual address
>     b) kioctx->user_id matches this value
>     
>     Having said that, the patch makes mremap() on aio region update the
>     kioctx's user_id and mmap_base values.
>     
>     Here appears the 2nd issue I mentioned in the beginning of this mail.
>     If (regardless of the C/R dances I do) someone creates an io context
>     with io_setup(), then mremap()-s the ring and then destroys the context,
>     the kill_ioctx() routine will call munmap() on wrong (old) address.
>     This will result in a) aio ring remaining in memory and b) some other
>     vma get unexpectedly unmapped.
>     
>     What do you think?
>     
> ...
>
> --- a/fs/aio.c
> +++ b/fs/aio.c
> @@ -286,12 +286,37 @@ static void aio_free_ring(struct kioctx *ctx)
>  
>  static int aio_ring_mmap(struct file *file, struct vm_area_struct *vma)
>  {
> +	vma->vm_flags |= VM_DONTEXPAND;

I don't think this was changelogged?  A comment here would be best.

>  	vma->vm_ops = &generic_file_vm_ops;
>  	return 0;
>  }
>  
> +static void aio_ring_remap(struct file *file, struct vm_area_struct *vma)
> +{
> +	struct mm_struct *mm = vma->vm_mm;
> +	struct kioctx_table *table;
> +	int i;
> +
> +	spin_lock(&mm->ioctx_lock);
> +	rcu_read_lock();
> +	table = rcu_dereference(mm->ioctx_table);
> +	for (i = 0; i < table->nr; i++) {
> +		struct kioctx *ctx;
> +
> +		ctx = table->table[i];
> +		if (ctx && ctx->aio_ring_file == file) {
> +			ctx->user_id = ctx->mmap_base = vma->vm_start;
> +			break;
> +		}
> +	}
> +
> +	rcu_read_unlock();
> +	spin_unlock(&mm->ioctx_lock);
> +}

Looks simple enough.  Possibly so simple that it doesn't need commenting ;)

>  static const struct file_operations aio_ring_fops = {
>  	.mmap = aio_ring_mmap,
> +	.mremap = aio_ring_remap,
>  };
>  
>  #if IS_ENABLED(CONFIG_MIGRATION)
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 9ab779e..85f378c 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1497,6 +1497,7 @@ struct file_operations {
>  	long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long);
>  	long (*compat_ioctl) (struct file *, unsigned int, unsigned long);
>  	int (*mmap) (struct file *, struct vm_area_struct *);
> +	void (*mremap)(struct file *, struct vm_area_struct *);

Documentation/filesystems/vfs.txt needs updating.  Note that it says
"all methods are called without any locks being held, unless otherwise
noted", but ->mremap() is called under mmap_sem (at least) so otherwise
noting will be needed.

It might make sense for ->mremap() to return an errno, but we can
change that later if needed.

>  	int (*open) (struct inode *, struct file *);
>  	int (*flush) (struct file *, fl_owner_t id);
>  	int (*release) (struct inode *, struct file *);
> diff --git a/mm/mremap.c b/mm/mremap.c
> index b147f66..c855922 100644
> --- a/mm/mremap.c
> +++ b/mm/mremap.c
> @@ -288,7 +288,8 @@ static unsigned long move_vma(struct vm_area_struct *vma,
>  		old_len = new_len;
>  		old_addr = new_addr;
>  		new_addr = -ENOMEM;
> -	}
> +	} else if (vma->vm_file && vma->vm_file->f_op->mremap)
> +		vma->vm_file->f_op->mremap(vma->vm_file, new_vma);
>  

The patch overall is a no-op from an MM perspective and seems OK to me.


  parent reply	other threads:[~2014-12-14 22:13 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-12-14 20:22 [GIT PULL] aio: changes for 3.19 Benjamin LaHaise
2014-12-14 20:22 ` Benjamin LaHaise
2014-12-14 21:47 ` Linus Torvalds
2014-12-14 21:47   ` Linus Torvalds
2014-12-14 21:52   ` Benjamin LaHaise
2014-12-14 21:52     ` Benjamin LaHaise
2014-12-14 22:09     ` Linus Torvalds
2014-12-14 22:09       ` Linus Torvalds
2014-12-14 22:13     ` Andrew Morton [this message]
2014-12-14 22:13       ` Andrew Morton
2014-12-14 22:39       ` Benjamin LaHaise
2014-12-14 22:39         ` Benjamin LaHaise
2014-12-15  5:53         ` Andrew Morton
2014-12-15  5:53           ` Andrew Morton
2014-12-14 23:02       ` Kirill A. Shutemov
2014-12-14 23:02         ` Kirill A. Shutemov
2014-12-14 23:08         ` Benjamin LaHaise
2014-12-14 23:08           ` Benjamin LaHaise
2014-12-14 23:11         ` Linus Torvalds
2014-12-14 23:11           ` Linus Torvalds

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20141214141336.a0267e95.akpm@linux-foundation.org \
    --to=akpm@linux-foundation.org \
    --cc=bcrl@kvack.org \
    --cc=linux-aio@kvack.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.