git pull -- [PATCH] aio: v2 ensure access to ctx->ring

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* git pull -- [PATCH] aio: v2 ensure access to ctx->ring_pages is correctly serialised
@ 2014-03-27 13:46 Benjamin LaHaise
  2014-03-27 18:16 ` Linus Torvalds
  0 siblings, 1 reply; 11+ messages in thread
From: Benjamin LaHaise @ 2014-03-27 13:46 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Yasuaki Ishimatsu, Sasha Levin, Tang Chen, Gu Zheng, linux-kernel,
	stable, linux-aio, linux-mm

Hello Linus and everyone,

Please pull the change below from my aio-fixes git repository at 
git://git.kvack.org/~bcrl/aio-fixes.git which fixes a couple of issues 
found in the aio page migration code.  This patch is applicable to the 
3.12 and later stable trees as well.  Thanks to all the folks involved 
in reporting & testing.

		-ben
----snip----

As reported by Tang Chen, Gu Zheng and Yasuaki Isimatsu, the following issues
exist in the aio ring page migration support.

As a result, for example, we have the following problem:

            thread 1                      |              thread 2
                                          |
aio_migratepage()                         |
 |-> take ctx->completion_lock            |
 |-> migrate_page_copy(new, old)          |
 |   *NOW*, ctx->ring_pages[idx] == old   |
                                          |
                                          |    *NOW*, ctx->ring_pages[idx] == old
                                          |    aio_read_events_ring()
                                          |     |-> ring = kmap_atomic(ctx->ring_pages[0])
                                          |     |-> ring->head = head;          *HERE, write to the old ring page*
                                          |     |-> kunmap_atomic(ring);
                                          |
 |-> ctx->ring_pages[idx] = new           |
 |   *BUT NOW*, the content of            |
 |    ring_pages[idx] is old.             |
 |-> release ctx->completion_lock         |

As above, the new ring page will not be updated.

Fix this issue, as well as prevent races in aio_ring_setup() by taking
the ring_lock mutex during page migration and where otherwise applicable.
This avoids the overhead of taking another spinlock in
aio_read_events_ring() as Tang's and Gu's original fix did, pushing the
overhead into the migration code.

Note that to handle the nesting of ring_lock inside of mmap_sem, the
migratepage operation uses mutex_trylock().  Page migration is not a 100%
critical operation in this case, so the ocassional failure can be
tolerated.  This issue was reported by Sasha Levin.

Reported-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Gu Zheng <guz.fnst@cn.fujitsu.com>
Cc: stable@vger.kernel.org
---
 fs/aio.c | 55 ++++++++++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 48 insertions(+), 7 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index 062a5f6..bfe1497 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -241,8 +241,10 @@ static void put_aio_ring_file(struct kioctx *ctx)
 
 static void aio_free_ring(struct kioctx *ctx)
 {
+	unsigned long flags;
 	int i;
 
+	spin_lock_irqsave(&ctx->completion_lock, flags);
 	for (i = 0; i < ctx->nr_pages; i++) {
 		struct page *page;
 		pr_debug("pid(%d) [%d] page->count=%d\n", current->pid, i,
@@ -253,6 +255,7 @@ static void aio_free_ring(struct kioctx *ctx)
 		ctx->ring_pages[i] = NULL;
 		put_page(page);
 	}
+	spin_unlock_irqrestore(&ctx->completion_lock, flags);
 
 	put_aio_ring_file(ctx);
 
@@ -287,9 +290,29 @@ static int aio_migratepage(struct address_space *mapping, struct page *new,
 
 	rc = 0;
 
-	/* Make sure the old page hasn't already been changed */
+	/* Get a reference on the ioctx so we can take the ring_lock mutex. */
 	spin_lock(&mapping->private_lock);
 	ctx = mapping->private_data;
+	if (ctx)
+		percpu_ref_get(&ctx->users);
+	spin_unlock(&mapping->private_lock);
+
+	if (!ctx)
+		return -EINVAL;
+
+	/* We use mutex_trylock() here as the callers of migratepage may
+	 * already be holding current->mm->mmap_sem, and ->ring_lock must be
+	 * outside of mmap_sem due to its usage in aio_read_events_ring().
+	 * Since page migration is not an absolutely critical operation, the
+	 * occasional failure here is acceptable.
+	 */
+	if (!mutex_trylock(&ctx->ring_lock)) {
+		percpu_ref_put(&ctx->users);
+		return -EAGAIN;
+	}
+
+	/* Make sure the old page hasn't already been changed */
+	spin_lock(&mapping->private_lock);
 	if (ctx) {
 		pgoff_t idx;
 		spin_lock_irqsave(&ctx->completion_lock, flags);
@@ -305,7 +328,7 @@ static int aio_migratepage(struct address_space *mapping, struct page *new,
 	spin_unlock(&mapping->private_lock);
 
 	if (rc != 0)
-		return rc;
+		goto out_unlock;
 
 	/* Writeback must be complete */
 	BUG_ON(PageWriteback(old));
@@ -314,7 +337,7 @@ static int aio_migratepage(struct address_space *mapping, struct page *new,
 	rc = migrate_page_move_mapping(mapping, new, old, NULL, mode, 1);
 	if (rc != MIGRATEPAGE_SUCCESS) {
 		put_page(new);
-		return rc;
+		goto out_unlock;
 	}
 
 	/* We can potentially race against kioctx teardown here.  Use the
@@ -346,6 +369,9 @@ static int aio_migratepage(struct address_space *mapping, struct page *new,
 	else
 		put_page(new);
 
+out_unlock:
+	mutex_unlock(&ctx->ring_lock);
+	percpu_ref_put(&ctx->users);
 	return rc;
 }
 #endif
@@ -380,7 +406,7 @@ static int aio_setup_ring(struct kioctx *ctx)
 	file = aio_private_file(ctx, nr_pages);
 	if (IS_ERR(file)) {
 		ctx->aio_ring_file = NULL;
-		return -EAGAIN;
+		return -ENOMEM;
 	}
 
 	ctx->aio_ring_file = file;
@@ -415,7 +441,7 @@ static int aio_setup_ring(struct kioctx *ctx)
 
 	if (unlikely(i != nr_pages)) {
 		aio_free_ring(ctx);
-		return -EAGAIN;
+		return -ENOMEM;
 	}
 
 	ctx->mmap_size = nr_pages * PAGE_SIZE;
@@ -429,7 +455,7 @@ static int aio_setup_ring(struct kioctx *ctx)
 	if (IS_ERR((void *)ctx->mmap_base)) {
 		ctx->mmap_size = 0;
 		aio_free_ring(ctx);
-		return -EAGAIN;
+		return -ENOMEM;
 	}
 
 	pr_debug("mmap address: 0x%08lx\n", ctx->mmap_base);
@@ -556,9 +582,17 @@ static int ioctx_add_table(struct kioctx *ctx, struct mm_struct *mm)
 					rcu_read_unlock();
 					spin_unlock(&mm->ioctx_lock);
 
+					/*
+					 * Accessing ring pages must be done
+					 * holding ctx->completion_lock to
+					 * prevent aio ring page migration
+					 * procedure from migrating ring pages.
+					 */
+					spin_lock_irq(&ctx->completion_lock);
 					ring = kmap_atomic(ctx->ring_pages[0]);
 					ring->id = ctx->id;
 					kunmap_atomic(ring);
+					spin_unlock_irq(&ctx->completion_lock);
 					return 0;
 				}
 
@@ -657,7 +691,13 @@ static struct kioctx *ioctx_alloc(unsigned nr_events)
 	if (!ctx->cpu)
 		goto err;
 
-	if (aio_setup_ring(ctx) < 0)
+	/* Prevent races with page migration in aio_setup_ring() by holding
+	 * the ring_lock mutex.
+	 */
+	mutex_lock(&ctx->ring_lock);
+	err = aio_setup_ring(ctx);
+	mutex_unlock(&ctx->ring_lock);
+	if (err < 0)
 		goto err;
 
 	atomic_set(&ctx->reqs_available, ctx->nr_events - 1);
@@ -1024,6 +1064,7 @@ static long aio_read_events_ring(struct kioctx *ctx,
 
 	mutex_lock(&ctx->ring_lock);
 
+	/* Access to ->ring_pages here is protected by ctx->ring_lock. */
 	ring = kmap_atomic(ctx->ring_pages[0]);
 	head = ring->head;
 	tail = ring->tail;
-- 
1.8.2.1


-- 
"Thought is the essence of where you are now."

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: git pull -- [PATCH] aio: v2 ensure access to ctx->ring_pages is correctly serialised
  2014-03-27 13:46 git pull -- [PATCH] aio: v2 ensure access to ctx->ring_pages is correctly serialised Benjamin LaHaise
@ 2014-03-27 18:16 ` Linus Torvalds
  2014-03-27 19:43   ` Linus Torvalds
  0 siblings, 1 reply; 11+ messages in thread
From: Linus Torvalds @ 2014-03-27 18:16 UTC (permalink / raw)
  To: Benjamin LaHaise
  Cc: Yasuaki Ishimatsu, Sasha Levin, Tang Chen, Gu Zheng,
	Linux Kernel Mailing List, stable, linux-aio, linux-mm

On Thu, Mar 27, 2014 at 6:46 AM, Benjamin LaHaise <bcrl@kvack.org> wrote:
>
> Please pull the change below from my aio-fixes git repository

Ugh. This is way late in the release, and the patch makes me go: "This
is completely insane", which doesn't really help.

>
>  static void aio_free_ring(struct kioctx *ctx)
>  {
> +       unsigned long flags;
>         int i;
>
> +       spin_lock_irqsave(&ctx->completion_lock, flags);
>         for (i = 0; i < ctx->nr_pages; i++) {
>                 struct page *page;
>                 pr_debug("pid(%d) [%d] page->count=%d\n", current->pid, i,
> @@ -253,6 +255,7 @@ static void aio_free_ring(struct kioctx *ctx)
>                 ctx->ring_pages[i] = NULL;
>                 put_page(page);
>         }
> +       spin_unlock_irqrestore(&ctx->completion_lock, flags);

This is just pure bullshit.

aio_free_ring() is called before ctx is free'd, so if something can
race with it, then you have some seriously big problems.

So the above locking change is at least terminally stupid, and at most
a sign of something much much worse.

And quite frankly, in either case it sounds like a bad idea for me to
pull this change after rc8.

>From what I can tell, the *sane* solution is to just move the
"put_aio_ring_file()" to *before* this whole loop, which should mean
that migratepages cannot possible access the context any more (and
dammit, exactly because 'ctx' is generally immediately free'd after
this function, that had *better* be true).

So taking the completion_lock here really screams to me: "This patch is broken".

> @@ -556,9 +582,17 @@ static int ioctx_add_table(struct kioctx *ctx, struct mm_struct *mm)
>                                         rcu_read_unlock();
>                                         spin_unlock(&mm->ioctx_lock);
>
> +                                       /*
> +                                        * Accessing ring pages must be done
> +                                        * holding ctx->completion_lock to
> +                                        * prevent aio ring page migration
> +                                        * procedure from migrating ring pages.
> +                                        */
> +                                       spin_lock_irq(&ctx->completion_lock);
>                                         ring = kmap_atomic(ctx->ring_pages[0]);
>                                         ring->id = ctx->id;
>                                         kunmap_atomic(ring);
> +                                       spin_unlock_irq(&ctx->completion_lock);
>                                         return 0;
>                                 }

And quite frankly, this smells too.

Question to people following around: Why does this particular access
to the first page need the completion_lock(), but the earlier accesses
in "aio_setup_ring()" does not?

Answer: the locking is broken, idiotic, and totally not architected.
The caller (which is the same in both cases: ioctx_alloc()) randomly
takes one lock in one case but not the other.

> -       if (aio_setup_ring(ctx) < 0)
> +       /* Prevent races with page migration in aio_setup_ring() by holding
> +        * the ring_lock mutex.
> +        */
> +       mutex_lock(&ctx->ring_lock);
> +       err = aio_setup_ring(ctx);
> +       mutex_unlock(&ctx->ring_lock);
> +       if (err < 0)
>                 goto err;
>
>         atomic_set(&ctx->reqs_available, ctx->nr_events - 1);

but a few lines later when we call ioctx_add_table(), we don't do it
under that lock.

It would all be cleaner if all the setup was done with the
ctx->ring_lock held (you can even *initialize* it to the locked state,
since this is the function that allocates it!) and then it would just
be unlocked when done. But no. The locking is some ad-hoc random stuff
that makes no sense.

So to make a long sad story short: there is no way in hell I will
apply this obviously crap patch this late in the game. Because this
patch is just inexcusable crap, and it should *not* have been sent to
me in this state. If I can see these kinds of obvious problems with
it, it damn well shouldn't go into stable or into rc8.

And if I'm wrong, and the "obvious problems" are actually due to
subtle effects that are so subtle that I can't see them, they need
bigger explanations in the commit message. But from what I can see,
it's really just the patch being stupid and nobody spent the time
asking themselves: "What should the locking really look like?".

Because here's a hint: if you need locks at tear-down, you're almost
certainly doing something wrong, because you are tearing down data
structures that shouldn't be reachable.

That rule really is a good rule to follow. It's why I started looking
at the patch and going "this can't be right". Learn it. Please think
about locking, and make it make sense - don't just add random locks to
random places to "fix bugs".

                  Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: git pull -- [PATCH] aio: v2 ensure access to ctx->ring_pages is correctly serialised
  2014-03-27 18:16 ` Linus Torvalds
@ 2014-03-27 19:43   ` Linus Torvalds
  2014-03-27 20:08     ` Benjamin LaHaise
  0 siblings, 1 reply; 11+ messages in thread
From: Linus Torvalds @ 2014-03-27 19:43 UTC (permalink / raw)
  To: Benjamin LaHaise
  Cc: Yasuaki Ishimatsu, Sasha Levin, Tang Chen, Gu Zheng,
	Linux Kernel Mailing List, stable, linux-aio, linux-mm

[-- Attachment #1: Type: text/plain, Size: 1572 bytes --]

On Thu, Mar 27, 2014 at 11:16 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> It would all be cleaner if all the setup was done with the
> ctx->ring_lock held (you can even *initialize* it to the locked state,
> since this is the function that allocates it!) and then it would just
> be unlocked when done.

Attached is a TOTALLY UNTESTED patch based on Ben's one that does at
least this minimal cleanup, in addition to dropping the
completion_lock in aio_free_ring() in favor of instead just doing the
"put_aio_ring_file()" first.

I do want to stress that "untested" part. I'm not going to commit
this, because I don't have any real test-cases even if it were to boot
and work for me otherwise.

I can't say that I like the locking. It really seems totally
mis-designed to me. For example, the first completion_lock in
aio_migratepage() seems to be total BS - it's locking against other
migrations, but that's what "mapping->private_lock" (and now
"ctx->ring_lock") protect against.

The *second* completion_lock use in aio_migratepage() is actually
valid: we can't copy the page contents to a new one when a completion
might change the ring tail data, because then the change might be done
to the old page but not the new page. But there the "check if things
haven't changed" is bogus, since we've held the ring_lock.

I did *not* clean up that part. But it's an example of how the locking
here seems to be more "voodoo programming" than actually thought about
and designed.

Please, somebody who has test-cases look at this, ok?

                  Linus

[-- Attachment #2: patch.diff --]
[-- Type: text/plain, Size: 4455 bytes --]

 fs/aio.c | 53 +++++++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 43 insertions(+), 10 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index 062a5f6a1448..758430665b3a 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -243,6 +243,9 @@ static void aio_free_ring(struct kioctx *ctx)
 {
 	int i;
 
+	/* This makes the ctx unreachable */
+	put_aio_ring_file(ctx);
+
 	for (i = 0; i < ctx->nr_pages; i++) {
 		struct page *page;
 		pr_debug("pid(%d) [%d] page->count=%d\n", current->pid, i,
@@ -254,8 +257,6 @@ static void aio_free_ring(struct kioctx *ctx)
 		put_page(page);
 	}
 
-	put_aio_ring_file(ctx);
-
 	if (ctx->ring_pages && ctx->ring_pages != ctx->internal_pages) {
 		kfree(ctx->ring_pages);
 		ctx->ring_pages = NULL;
@@ -287,9 +288,29 @@ static int aio_migratepage(struct address_space *mapping, struct page *new,
 
 	rc = 0;
 
-	/* Make sure the old page hasn't already been changed */
+	/* Get a reference on the ioctx so we can take the ring_lock mutex. */
 	spin_lock(&mapping->private_lock);
 	ctx = mapping->private_data;
+	if (ctx)
+		percpu_ref_get(&ctx->users);
+	spin_unlock(&mapping->private_lock);
+
+	if (!ctx)
+		return -EINVAL;
+
+	/* We use mutex_trylock() here as the callers of migratepage may
+	 * already be holding current->mm->mmap_sem, and ->ring_lock must be
+	 * outside of mmap_sem due to its usage in aio_read_events_ring().
+	 * Since page migration is not an absolutely critical operation, the
+	 * occasional failure here is acceptable.
+	 */
+	if (!mutex_trylock(&ctx->ring_lock)) {
+		percpu_ref_put(&ctx->users);
+		return -EAGAIN;
+	}
+
+	/* Make sure the old page hasn't already been changed */
+	spin_lock(&mapping->private_lock);
 	if (ctx) {
 		pgoff_t idx;
 		spin_lock_irqsave(&ctx->completion_lock, flags);
@@ -305,7 +326,7 @@ static int aio_migratepage(struct address_space *mapping, struct page *new,
 	spin_unlock(&mapping->private_lock);
 
 	if (rc != 0)
-		return rc;
+		goto out_unlock;
 
 	/* Writeback must be complete */
 	BUG_ON(PageWriteback(old));
@@ -314,7 +335,7 @@ static int aio_migratepage(struct address_space *mapping, struct page *new,
 	rc = migrate_page_move_mapping(mapping, new, old, NULL, mode, 1);
 	if (rc != MIGRATEPAGE_SUCCESS) {
 		put_page(new);
-		return rc;
+		goto out_unlock;
 	}
 
 	/* We can potentially race against kioctx teardown here.  Use the
@@ -346,6 +367,9 @@ static int aio_migratepage(struct address_space *mapping, struct page *new,
 	else
 		put_page(new);
 
+out_unlock:
+	mutex_unlock(&ctx->ring_lock);
+	percpu_ref_put(&ctx->users);
 	return rc;
 }
 #endif
@@ -380,7 +404,7 @@ static int aio_setup_ring(struct kioctx *ctx)
 	file = aio_private_file(ctx, nr_pages);
 	if (IS_ERR(file)) {
 		ctx->aio_ring_file = NULL;
-		return -EAGAIN;
+		return -ENOMEM;
 	}
 
 	ctx->aio_ring_file = file;
@@ -415,7 +439,7 @@ static int aio_setup_ring(struct kioctx *ctx)
 
 	if (unlikely(i != nr_pages)) {
 		aio_free_ring(ctx);
-		return -EAGAIN;
+		return -ENOMEM;
 	}
 
 	ctx->mmap_size = nr_pages * PAGE_SIZE;
@@ -429,7 +453,7 @@ static int aio_setup_ring(struct kioctx *ctx)
 	if (IS_ERR((void *)ctx->mmap_base)) {
 		ctx->mmap_size = 0;
 		aio_free_ring(ctx);
-		return -EAGAIN;
+		return -ENOMEM;
 	}
 
 	pr_debug("mmap address: 0x%08lx\n", ctx->mmap_base);
@@ -657,8 +681,13 @@ static struct kioctx *ioctx_alloc(unsigned nr_events)
 	if (!ctx->cpu)
 		goto err;
 
-	if (aio_setup_ring(ctx) < 0)
-		goto err;
+	/* Prevent races with page migration during setup by holding
+	 * the ring_lock mutex.
+	 */
+	mutex_lock(&ctx->ring_lock);
+	err = aio_setup_ring(ctx);
+	if (err < 0)
+		goto err_unlock;
 
 	atomic_set(&ctx->reqs_available, ctx->nr_events - 1);
 	ctx->req_batch = (ctx->nr_events - 1) / (num_possible_cpus() * 4);
@@ -683,6 +712,7 @@ static struct kioctx *ioctx_alloc(unsigned nr_events)
 	if (err)
 		goto err_cleanup;
 
+	mutex_unlock(&ctx->ring_lock);
 	pr_debug("allocated ioctx %p[%ld]: mm=%p mask=0x%x\n",
 		 ctx, ctx->user_id, mm, ctx->nr_events);
 	return ctx;
@@ -691,6 +721,8 @@ err_cleanup:
 	aio_nr_sub(ctx->max_reqs);
 err_ctx:
 	aio_free_ring(ctx);
+err_unlock:
+	mutex_unlock(&ctx->ring_lock);
 err:
 	free_percpu(ctx->cpu);
 	free_percpu(ctx->reqs.pcpu_count);
@@ -1024,6 +1056,7 @@ static long aio_read_events_ring(struct kioctx *ctx,
 
 	mutex_lock(&ctx->ring_lock);
 
+	/* Access to ->ring_pages here is protected by ctx->ring_lock. */
 	ring = kmap_atomic(ctx->ring_pages[0]);
 	head = ring->head;
 	tail = ring->tail;

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: git pull -- [PATCH] aio: v2 ensure access to ctx->ring_pages is correctly serialised
  2014-03-27 19:43   ` Linus Torvalds
@ 2014-03-27 20:08     ` Benjamin LaHaise
  2014-03-27 20:22       ` Linus Torvalds
  0 siblings, 1 reply; 11+ messages in thread
From: Benjamin LaHaise @ 2014-03-27 20:08 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Yasuaki Ishimatsu, Sasha Levin, Tang Chen, Gu Zheng,
	Linux Kernel Mailing List, stable, linux-aio, linux-mm

On Thu, Mar 27, 2014 at 12:43:13PM -0700, Linus Torvalds wrote:
> On Thu, Mar 27, 2014 at 11:16 AM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > It would all be cleaner if all the setup was done with the
> > ctx->ring_lock held (you can even *initialize* it to the locked state,
> > since this is the function that allocates it!) and then it would just
> > be unlocked when done.
> 
> Attached is a TOTALLY UNTESTED patch based on Ben's one that does at
> least this minimal cleanup, in addition to dropping the
> completion_lock in aio_free_ring() in favor of instead just doing the
> "put_aio_ring_file()" first.

I was just about to send a slightly different variant (see below).  Your rant 
was entirely justified.

> I do want to stress that "untested" part. I'm not going to commit
> this, because I don't have any real test-cases even if it were to boot
> and work for me otherwise.

The patch below is lightly tested -- my migration test case is able to 
successfully move the aio ring around multiple times.  It still needs to 
be run through some more thorough tests (like Trinity).  See below for 
the differences relative to your patch.

> I can't say that I like the locking. It really seems totally
> mis-designed to me. For example, the first completion_lock in
> aio_migratepage() seems to be total BS - it's locking against other
> migrations, but that's what "mapping->private_lock" (and now
> "ctx->ring_lock") protect against.

What I did instead is to just hold mapping->private_lock over the entire 
operation of aio_migratepage.  That gets rid of the probably broken attempt 
to take a reference count on the kioctx within aio_migratepage(), and makes 
it completely clear that migration won't touch a kioctx after 
put_aio_ring_file() returns.  It also cleans up much of the error handling 
in aio_migratepage() since things cannot change unexpectedly.

> The *second* completion_lock use in aio_migratepage() is actually
> valid: we can't copy the page contents to a new one when a completion
> might change the ring tail data, because then the change might be done
> to the old page but not the new page. But there the "check if things
> haven't changed" is bogus, since we've held the ring_lock.

Yes, that should have been cleaned up since it was no longer needed.

> I did *not* clean up that part. But it's an example of how the locking
> here seems to be more "voodoo programming" than actually thought about
> and designed.

I also added a few comments to help explain the locking.

> Please, somebody who has test-cases look at this, ok?

How does this version look?

		-ben

 aio.c |  107 +++++++++++++++++++++++++++++++++++++-----------------------------
 1 file changed, 60 insertions(+), 47 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index 062a5f6..cdec97c 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -52,7 +52,8 @@
 struct aio_ring {
 	unsigned	id;	/* kernel internal index number */
 	unsigned	nr;	/* number of io_events */
-	unsigned	head;
+	unsigned	head;	/* Written to by userland or under ring_lock
+				 * mutex by aio_read_events_ring(). */
 	unsigned	tail;
 
 	unsigned	magic;
@@ -243,6 +244,11 @@ static void aio_free_ring(struct kioctx *ctx)
 {
 	int i;
 
+	/* Disconnect the kiotx from the ring file.  This prevents future
+	 * accesses to the kioctx from page migration.
+	 */
+	put_aio_ring_file(ctx);
+
 	for (i = 0; i < ctx->nr_pages; i++) {
 		struct page *page;
 		pr_debug("pid(%d) [%d] page->count=%d\n", current->pid, i,
@@ -254,8 +260,6 @@ static void aio_free_ring(struct kioctx *ctx)
 		put_page(page);
 	}
 
-	put_aio_ring_file(ctx);
-
 	if (ctx->ring_pages && ctx->ring_pages != ctx->internal_pages) {
 		kfree(ctx->ring_pages);
 		ctx->ring_pages = NULL;
@@ -283,29 +287,38 @@ static int aio_migratepage(struct address_space *mapping, struct page *new,
 {
 	struct kioctx *ctx;
 	unsigned long flags;
+	pgoff_t idx;
 	int rc;
 
 	rc = 0;
 
-	/* Make sure the old page hasn't already been changed */
+	/* mapping->private_lock here protects against the kioctx teardown.  */
 	spin_lock(&mapping->private_lock);
 	ctx = mapping->private_data;
-	if (ctx) {
-		pgoff_t idx;
-		spin_lock_irqsave(&ctx->completion_lock, flags);
-		idx = old->index;
-		if (idx < (pgoff_t)ctx->nr_pages) {
-			if (ctx->ring_pages[idx] != old)
-				rc = -EAGAIN;
-		} else
-			rc = -EINVAL;
-		spin_unlock_irqrestore(&ctx->completion_lock, flags);
+	if (!ctx) {
+		rc = -EINVAL;
+		goto out;
+	}
+
+	/* The ring_lock mutex.  The prevents aio_read_events() from writing
+	 * to the ring's head, and prevents page migration from mucking in
+	 * a partially initialized kiotx.
+	 */
+	if (!mutex_trylock(&ctx->ring_lock)) {
+		rc = -EAGAIN;
+		goto out;
+	}
+
+	idx = old->index;
+	if (idx < (pgoff_t)ctx->nr_pages) {
+		/* Make sure the old page hasn't already been changed */
+		if (ctx->ring_pages[idx] != old)
+			rc = -EAGAIN;
 	} else
 		rc = -EINVAL;
-	spin_unlock(&mapping->private_lock);
 
 	if (rc != 0)
-		return rc;
+		goto out_unlock;
 
 	/* Writeback must be complete */
 	BUG_ON(PageWriteback(old));
@@ -314,38 +327,26 @@ static int aio_migratepage(struct address_space *mapping, struct page *new,
 	rc = migrate_page_move_mapping(mapping, new, old, NULL, mode, 1);
 	if (rc != MIGRATEPAGE_SUCCESS) {
 		put_page(new);
-		return rc;
+		goto out_unlock;
 	}
 
-	/* We can potentially race against kioctx teardown here.  Use the
-	 * address_space's private data lock to protect the mapping's
-	 * private_data.
+	/* Take completion_lock to prevent other writes to the ring buffer
+	 * while the old page is copied to the new.  This prevents new
+	 * events from being lost.
 	 */
-	spin_lock(&mapping->private_lock);
-	ctx = mapping->private_data;
-	if (ctx) {
-		pgoff_t idx;
-		spin_lock_irqsave(&ctx->completion_lock, flags);
-		migrate_page_copy(new, old);
-		idx = old->index;
-		if (idx < (pgoff_t)ctx->nr_pages) {
-			/* And only do the move if things haven't changed */
-			if (ctx->ring_pages[idx] == old)
-				ctx->ring_pages[idx] = new;
-			else
-				rc = -EAGAIN;
-		} else
-			rc = -EINVAL;
-		spin_unlock_irqrestore(&ctx->completion_lock, flags);
-	} else
-		rc = -EBUSY;
-	spin_unlock(&mapping->private_lock);
+	spin_lock_irqsave(&ctx->completion_lock, flags);
+	migrate_page_copy(new, old);
+	BUG_ON(ctx->ring_pages[idx] != old);
+	ctx->ring_pages[idx] = new;
+	spin_unlock_irqrestore(&ctx->completion_lock, flags);
 
-	if (rc == MIGRATEPAGE_SUCCESS)
-		put_page(old);
-	else
-		put_page(new);
+	/* The old page is no longer accessible. */
+	put_page(old);
 
+out_unlock:
+	mutex_unlock(&ctx->ring_lock);
+out:
+	spin_unlock(&mapping->private_lock);
 	return rc;
 }
 #endif
@@ -380,7 +381,7 @@ static int aio_setup_ring(struct kioctx *ctx)
 	file = aio_private_file(ctx, nr_pages);
 	if (IS_ERR(file)) {
 		ctx->aio_ring_file = NULL;
-		return -EAGAIN;
+		return -ENOMEM;
 	}
 
 	ctx->aio_ring_file = file;
@@ -415,7 +416,7 @@ static int aio_setup_ring(struct kioctx *ctx)
 
 	if (unlikely(i != nr_pages)) {
 		aio_free_ring(ctx);
-		return -EAGAIN;
+		return -ENOMEM;
 	}
 
 	ctx->mmap_size = nr_pages * PAGE_SIZE;
@@ -429,7 +430,7 @@ static int aio_setup_ring(struct kioctx *ctx)
 	if (IS_ERR((void *)ctx->mmap_base)) {
 		ctx->mmap_size = 0;
 		aio_free_ring(ctx);
-		return -EAGAIN;
+		return -ENOMEM;
 	}
 
 	pr_debug("mmap address: 0x%08lx\n", ctx->mmap_base);
@@ -556,6 +557,10 @@ static int ioctx_add_table(struct kioctx *ctx, struct mm_struct *mm)
 					rcu_read_unlock();
 					spin_unlock(&mm->ioctx_lock);
 
+					/* While kioctx setup is in progress,
+					 * we are protected from page migration
+					 * changes ring_pages by ->ring_lock.
+					 */
 					ring = kmap_atomic(ctx->ring_pages[0]);
 					ring->id = ctx->id;
 					kunmap_atomic(ring);
@@ -649,6 +654,9 @@ static struct kioctx *ioctx_alloc(unsigned nr_events)
 	spin_lock_init(&ctx->ctx_lock);
 	spin_lock_init(&ctx->completion_lock);
 	mutex_init(&ctx->ring_lock);
+	/* Protect against page migration throughout kiotx setup by keeping
+	 * the ring_lock mutex held until setup is complete. */
+	mutex_lock(&ctx->ring_lock);
 	init_waitqueue_head(&ctx->wait);
 
 	INIT_LIST_HEAD(&ctx->active_reqs);
@@ -657,7 +665,8 @@ static struct kioctx *ioctx_alloc(unsigned nr_events)
 	if (!ctx->cpu)
 		goto err;
 
-	if (aio_setup_ring(ctx) < 0)
+	err = aio_setup_ring(ctx);
+	if (err < 0)
 		goto err;
 
 	atomic_set(&ctx->reqs_available, ctx->nr_events - 1);
@@ -683,6 +692,9 @@ static struct kioctx *ioctx_alloc(unsigned nr_events)
 	if (err)
 		goto err_cleanup;
 
+	/* Release the ring_lock mutex now that all setup is complete. */
+	mutex_unlock(&ctx->ring_lock);
+
 	pr_debug("allocated ioctx %p[%ld]: mm=%p mask=0x%x\n",
 		 ctx, ctx->user_id, mm, ctx->nr_events);
 	return ctx;
@@ -1024,6 +1036,7 @@ static long aio_read_events_ring(struct kioctx *ctx,
 
 	mutex_lock(&ctx->ring_lock);
 
+	/* Access to ->ring_pages here is protected by ctx->ring_lock. */
 	ring = kmap_atomic(ctx->ring_pages[0]);
 	head = ring->head;
 	tail = ring->tail;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: git pull -- [PATCH] aio: v2 ensure access to ctx->ring_pages is correctly serialised
  2014-03-27 20:08     ` Benjamin LaHaise
@ 2014-03-27 20:22       ` Linus Torvalds
  2014-03-27 20:57         ` Benjamin LaHaise
  0 siblings, 1 reply; 11+ messages in thread
From: Linus Torvalds @ 2014-03-27 20:22 UTC (permalink / raw)
  To: Benjamin LaHaise
  Cc: Yasuaki Ishimatsu, Sasha Levin, Tang Chen, Gu Zheng,
	Linux Kernel Mailing List, stable, linux-aio, linux-mm

On Thu, Mar 27, 2014 at 1:08 PM, Benjamin LaHaise <bcrl@kvack.org> wrote:
>
> The patch below is lightly tested -- my migration test case is able to
> successfully move the aio ring around multiple times.  It still needs to
> be run through some more thorough tests (like Trinity).  See below for
> the differences relative to your patch.

Ok, from a quick glance-through this fixes my big complaints (not
unrurprisingly, similarly to my patch), and seems to fix  few of the
smaller ones that I didn't bother with.

However, I think you missed the mutex_unlock() in the error paths of
ioctx_alloc().

> What I did instead is to just hold mapping->private_lock over the entire
> operation of aio_migratepage.  That gets rid of the probably broken attempt
> to take a reference count on the kioctx within aio_migratepage(), and makes
> it completely clear that migration won't touch a kioctx after
> put_aio_ring_file() returns.  It also cleans up much of the error handling
> in aio_migratepage() since things cannot change unexpectedly.

Yes, that looks simpler. I don't know what the latency implications
are, but the expensive part (the actual page migration) was and
continues to be under the completion lock with interrupts disabled, so
I guess it's not worse.

It would be good to try to get rid of the completion lock irq thing,
but that looks complex. It would likely require a two-phase migration
model, where phase one is "unmap page from user space and copy it to
new page", and phase two would be "insert new page into mapping". Then
we could have just a "update the tail pointer and the kernel mapping
under the completion lock" thing with interrupts disabled in between.

But that's a much bigger change and requires help from the migration
people. Maybe we don't care about latency here.

> I also added a few comments to help explain the locking.
>
> How does this version look?

Looks ok, except for the error handling wrt mutex_unlock. Did I miss it?

                 Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: git pull -- [PATCH] aio: v2 ensure access to ctx->ring_pages is correctly serialised
  2014-03-27 20:22       ` Linus Torvalds
@ 2014-03-27 20:57         ` Benjamin LaHaise
  2014-03-27 21:34           ` Linus Torvalds
  0 siblings, 1 reply; 11+ messages in thread
From: Benjamin LaHaise @ 2014-03-27 20:57 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Yasuaki Ishimatsu, Sasha Levin, Tang Chen, Gu Zheng,
	Linux Kernel Mailing List, stable, linux-aio, linux-mm

On Thu, Mar 27, 2014 at 01:22:11PM -0700, Linus Torvalds wrote:
> On Thu, Mar 27, 2014 at 1:08 PM, Benjamin LaHaise <bcrl@kvack.org> wrote:
> >
> > The patch below is lightly tested -- my migration test case is able to
> > successfully move the aio ring around multiple times.  It still needs to
> > be run through some more thorough tests (like Trinity).  See below for
> > the differences relative to your patch.
> 
> Ok, from a quick glance-through this fixes my big complaints (not
> unrurprisingly, similarly to my patch), and seems to fix  few of the
> smaller ones that I didn't bother with.
> 
> However, I think you missed the mutex_unlock() in the error paths of
> ioctx_alloc().

*nod* -- I added that to the below variant.

> > What I did instead is to just hold mapping->private_lock over the entire
> > operation of aio_migratepage.  That gets rid of the probably broken attempt
> > to take a reference count on the kioctx within aio_migratepage(), and makes
> > it completely clear that migration won't touch a kioctx after
> > put_aio_ring_file() returns.  It also cleans up much of the error handling
> > in aio_migratepage() since things cannot change unexpectedly.
> 
> Yes, that looks simpler. I don't know what the latency implications
> are, but the expensive part (the actual page migration) was and
> continues to be under the completion lock with interrupts disabled, so
> I guess it's not worse.

The only reason this was implemented is due to the hotplug CPU folks 
complaining about not being able to unpin memory held by aio when they 
offline entire nodes, so the expectation is that this occurs rarely.

> It would be good to try to get rid of the completion lock irq thing,
> but that looks complex. It would likely require a two-phase migration
> model, where phase one is "unmap page from user space and copy it to
> new page", and phase two would be "insert new page into mapping". Then
> we could have just a "update the tail pointer and the kernel mapping
> under the completion lock" thing with interrupts disabled in between.
> 
> But that's a much bigger change and requires help from the migration
> people. Maybe we don't care about latency here.

That sounds like a better way of dealing with the latency issue.  I'll 
think about that for the next round of changes.

> > I also added a few comments to help explain the locking.
> >
> > How does this version look?
> 
> Looks ok, except for the error handling wrt mutex_unlock. Did I miss it?

It probably wouldn't have caused any problems since nothing else would've 
taken the lock, but I suspect some of the debugging code that complains 
about held locks being freed would eventually complain.

Thanks for taking the time to look at this.

		-ben
-- 
"Thought is the essence of where you are now."

commit 8a620cad5db25f0c6bd5e72cae2f1c93d720842b
Author: Benjamin LaHaise <bcrl@kvack.org>
Date:   Thu Mar 27 16:46:58 2014 -0400

    aio: v3 ensure access to ctx->ring_pages is correctly serialised for migration
    
    As reported by Tang Chen, Gu Zheng and Yasuaki Isimatsu, the following issues
    exist in the aio ring page migration support.
    
    As a result, for example, we have the following problem:
    
                thread 1                      |              thread 2
                                              |
    aio_migratepage()                         |
     |-> take ctx->completion_lock            |
     |-> migrate_page_copy(new, old)          |
     |   *NOW*, ctx->ring_pages[idx] == old   |
                                              |
                                              |    *NOW*, ctx->ring_pages[idx] == old
                                              |    aio_read_events_ring()
                                              |     |-> ring = kmap_atomic(ctx->ring_pages[0])
                                              |     |-> ring->head = head;          *HERE, write to the old ring page*
                                              |     |-> kunmap_atomic(ring);
                                              |
     |-> ctx->ring_pages[idx] = new           |
     |   *BUT NOW*, the content of            |
     |    ring_pages[idx] is old.             |
     |-> release ctx->completion_lock         |
    
    As above, the new ring page will not be updated.
    
    Fix this issue, as well as prevent races in aio_ring_setup() by holding
    the ring_lock mutex during kioctx setup and page migration.  This avoids
    the overhead of taking another spinlock in aio_read_events_ring() as Tang's
    and Gu's original fix did, pushing the overhead into the migration code.
    
    Note that to handle the nesting of ring_lock inside of mmap_sem, the
    migratepage operation uses mutex_trylock().  Page migration is not a 100%
    critical operation in this case, so the ocassional failure can be
    tolerated.  This issue was reported by Sasha Levin.
    
    Based on feedback from Linus, avoid the extra taking of ctx->completion_lock.
    Instead, make page migration fully serialised by mapping->private_lock, and
    have aio_free_ring() simply disconnect the kioctx from the mapping by calling
    put_aio_ring_file() before touching ctx->ring_pages[].  This simplifies the
    error handling logic in aio_migratepage(), and should improve robustness.
    
    Reported-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
    Reported-by: Sasha Levin <sasha.levin@oracle.com>
    Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
    Cc: Tang Chen <tangchen@cn.fujitsu.com>
    Cc: Gu Zheng <guz.fnst@cn.fujitsu.com>
    Cc: stable@vger.kernel.org

diff --git a/fs/aio.c b/fs/aio.c
index 062a5f6..eb4d75e 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -52,7 +52,8 @@
 struct aio_ring {
 	unsigned	id;	/* kernel internal index number */
 	unsigned	nr;	/* number of io_events */
-	unsigned	head;
+	unsigned	head;	/* Written to by userland or under ring_lock
+				 * mutex by aio_read_events_ring(). */
 	unsigned	tail;
 
 	unsigned	magic;
@@ -243,6 +244,11 @@ static void aio_free_ring(struct kioctx *ctx)
 {
 	int i;
 
+	/* Disconnect the kiotx from the ring file.  This prevents future
+	 * accesses to the kioctx from page migration.
+	 */
+	put_aio_ring_file(ctx);
+
 	for (i = 0; i < ctx->nr_pages; i++) {
 		struct page *page;
 		pr_debug("pid(%d) [%d] page->count=%d\n", current->pid, i,
@@ -254,8 +260,6 @@ static void aio_free_ring(struct kioctx *ctx)
 		put_page(page);
 	}
 
-	put_aio_ring_file(ctx);
-
 	if (ctx->ring_pages && ctx->ring_pages != ctx->internal_pages) {
 		kfree(ctx->ring_pages);
 		ctx->ring_pages = NULL;
@@ -283,29 +287,38 @@ static int aio_migratepage(struct address_space *mapping, struct page *new,
 {
 	struct kioctx *ctx;
 	unsigned long flags;
+	pgoff_t idx;
 	int rc;
 
 	rc = 0;
 
-	/* Make sure the old page hasn't already been changed */
+	/* mapping->private_lock here protects against the kioctx teardown.  */
 	spin_lock(&mapping->private_lock);
 	ctx = mapping->private_data;
-	if (ctx) {
-		pgoff_t idx;
-		spin_lock_irqsave(&ctx->completion_lock, flags);
-		idx = old->index;
-		if (idx < (pgoff_t)ctx->nr_pages) {
-			if (ctx->ring_pages[idx] != old)
-				rc = -EAGAIN;
-		} else
-			rc = -EINVAL;
-		spin_unlock_irqrestore(&ctx->completion_lock, flags);
+	if (!ctx) {
+		rc = -EINVAL;
+		goto out;
+	}
+
+	/* The ring_lock mutex.  The prevents aio_read_events() from writing
+	 * to the ring's head, and prevents page migration from mucking in
+	 * a partially initialized kiotx.
+	 */
+	if (!mutex_trylock(&ctx->ring_lock)) {
+		rc = -EAGAIN;
+		goto out;
+	}
+
+	idx = old->index;
+	if (idx < (pgoff_t)ctx->nr_pages) {
+		/* Make sure the old page hasn't already been changed */
+		if (ctx->ring_pages[idx] != old)
+			rc = -EAGAIN;
 	} else
 		rc = -EINVAL;
-	spin_unlock(&mapping->private_lock);
 
 	if (rc != 0)
-		return rc;
+		goto out_unlock;
 
 	/* Writeback must be complete */
 	BUG_ON(PageWriteback(old));
@@ -314,38 +327,26 @@ static int aio_migratepage(struct address_space *mapping, struct page *new,
 	rc = migrate_page_move_mapping(mapping, new, old, NULL, mode, 1);
 	if (rc != MIGRATEPAGE_SUCCESS) {
 		put_page(new);
-		return rc;
+		goto out_unlock;
 	}
 
-	/* We can potentially race against kioctx teardown here.  Use the
-	 * address_space's private data lock to protect the mapping's
-	 * private_data.
+	/* Take completion_lock to prevent other writes to the ring buffer
+	 * while the old page is copied to the new.  This prevents new
+	 * events from being lost.
 	 */
-	spin_lock(&mapping->private_lock);
-	ctx = mapping->private_data;
-	if (ctx) {
-		pgoff_t idx;
-		spin_lock_irqsave(&ctx->completion_lock, flags);
-		migrate_page_copy(new, old);
-		idx = old->index;
-		if (idx < (pgoff_t)ctx->nr_pages) {
-			/* And only do the move if things haven't changed */
-			if (ctx->ring_pages[idx] == old)
-				ctx->ring_pages[idx] = new;
-			else
-				rc = -EAGAIN;
-		} else
-			rc = -EINVAL;
-		spin_unlock_irqrestore(&ctx->completion_lock, flags);
-	} else
-		rc = -EBUSY;
-	spin_unlock(&mapping->private_lock);
+	spin_lock_irqsave(&ctx->completion_lock, flags);
+	migrate_page_copy(new, old);
+	BUG_ON(ctx->ring_pages[idx] != old);
+	ctx->ring_pages[idx] = new;
+	spin_unlock_irqrestore(&ctx->completion_lock, flags);
 
-	if (rc == MIGRATEPAGE_SUCCESS)
-		put_page(old);
-	else
-		put_page(new);
+	/* The old page is no longer accessible. */
+	put_page(old);
 
+out_unlock:
+	mutex_unlock(&ctx->ring_lock);
+out:
+	spin_unlock(&mapping->private_lock);
 	return rc;
 }
 #endif
@@ -380,7 +381,7 @@ static int aio_setup_ring(struct kioctx *ctx)
 	file = aio_private_file(ctx, nr_pages);
 	if (IS_ERR(file)) {
 		ctx->aio_ring_file = NULL;
-		return -EAGAIN;
+		return -ENOMEM;
 	}
 
 	ctx->aio_ring_file = file;
@@ -415,7 +416,7 @@ static int aio_setup_ring(struct kioctx *ctx)
 
 	if (unlikely(i != nr_pages)) {
 		aio_free_ring(ctx);
-		return -EAGAIN;
+		return -ENOMEM;
 	}
 
 	ctx->mmap_size = nr_pages * PAGE_SIZE;
@@ -429,7 +430,7 @@ static int aio_setup_ring(struct kioctx *ctx)
 	if (IS_ERR((void *)ctx->mmap_base)) {
 		ctx->mmap_size = 0;
 		aio_free_ring(ctx);
-		return -EAGAIN;
+		return -ENOMEM;
 	}
 
 	pr_debug("mmap address: 0x%08lx\n", ctx->mmap_base);
@@ -556,6 +557,10 @@ static int ioctx_add_table(struct kioctx *ctx, struct mm_struct *mm)
 					rcu_read_unlock();
 					spin_unlock(&mm->ioctx_lock);
 
+					/* While kioctx setup is in progress,
+					 * we are protected from page migration
+					 * changes ring_pages by ->ring_lock.
+					 */
 					ring = kmap_atomic(ctx->ring_pages[0]);
 					ring->id = ctx->id;
 					kunmap_atomic(ring);
@@ -649,6 +654,9 @@ static struct kioctx *ioctx_alloc(unsigned nr_events)
 	spin_lock_init(&ctx->ctx_lock);
 	spin_lock_init(&ctx->completion_lock);
 	mutex_init(&ctx->ring_lock);
+	/* Protect against page migration throughout kiotx setup by keeping
+	 * the ring_lock mutex held until setup is complete. */
+	mutex_lock(&ctx->ring_lock);
 	init_waitqueue_head(&ctx->wait);
 
 	INIT_LIST_HEAD(&ctx->active_reqs);
@@ -657,7 +665,8 @@ static struct kioctx *ioctx_alloc(unsigned nr_events)
 	if (!ctx->cpu)
 		goto err;
 
-	if (aio_setup_ring(ctx) < 0)
+	err = aio_setup_ring(ctx);
+	if (err < 0)
 		goto err;
 
 	atomic_set(&ctx->reqs_available, ctx->nr_events - 1);
@@ -683,6 +692,9 @@ static struct kioctx *ioctx_alloc(unsigned nr_events)
 	if (err)
 		goto err_cleanup;
 
+	/* Release the ring_lock mutex now that all setup is complete. */
+	mutex_unlock(&ctx->ring_lock);
+
 	pr_debug("allocated ioctx %p[%ld]: mm=%p mask=0x%x\n",
 		 ctx, ctx->user_id, mm, ctx->nr_events);
 	return ctx;
@@ -691,6 +703,7 @@ err_cleanup:
 	aio_nr_sub(ctx->max_reqs);
 err_ctx:
 	aio_free_ring(ctx);
+	mutex_unlock(&ctx->ring_lock);
 err:
 	free_percpu(ctx->cpu);
 	free_percpu(ctx->reqs.pcpu_count);
@@ -1024,6 +1037,7 @@ static long aio_read_events_ring(struct kioctx *ctx,
 
 	mutex_lock(&ctx->ring_lock);
 
+	/* Access to ->ring_pages here is protected by ctx->ring_lock. */
 	ring = kmap_atomic(ctx->ring_pages[0]);
 	head = ring->head;
 	tail = ring->tail;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: git pull -- [PATCH] aio: v2 ensure access to ctx->ring_pages is correctly serialised
  2014-03-27 20:57         ` Benjamin LaHaise
@ 2014-03-27 21:34           ` Linus Torvalds
  2014-03-28 14:58             ` Benjamin LaHaise
  0 siblings, 1 reply; 11+ messages in thread
From: Linus Torvalds @ 2014-03-27 21:34 UTC (permalink / raw)
  To: Benjamin LaHaise
  Cc: Yasuaki Ishimatsu, Sasha Levin, Tang Chen, Gu Zheng,
	Linux Kernel Mailing List, stable, linux-aio, linux-mm

On Thu, Mar 27, 2014 at 1:57 PM, Benjamin LaHaise <bcrl@kvack.org> wrote:
>
> *nod* -- I added that to the below variant.

You still have "goto err" for cases that have the ctx locked. Which
means that the thing gets free'd while still locked, which causes
problems for lockdep etc, so don't do it.

Do what I did: add a "err_unlock" label, and make anybody after the
mutex_lock() call it. No broken shortcuts.

              Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: git pull -- [PATCH] aio: v2 ensure access to ctx->ring_pages is correctly serialised
  2014-03-27 21:34           ` Linus Torvalds
@ 2014-03-28 14:58             ` Benjamin LaHaise
  2014-03-28 17:08               ` Linus Torvalds
  0 siblings, 1 reply; 11+ messages in thread
From: Benjamin LaHaise @ 2014-03-28 14:58 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Yasuaki Ishimatsu, Sasha Levin, Tang Chen, Gu Zheng,
	Linux Kernel Mailing List, stable, linux-aio, linux-mm

On Thu, Mar 27, 2014 at 02:34:08PM -0700, Linus Torvalds wrote:
> On Thu, Mar 27, 2014 at 1:57 PM, Benjamin LaHaise <bcrl@kvack.org> wrote:
> >
> > *nod* -- I added that to the below variant.
> 
> You still have "goto err" for cases that have the ctx locked. Which
> means that the thing gets free'd while still locked, which causes
> problems for lockdep etc, so don't do it.
> 
> Do what I did: add a "err_unlock" label, and make anybody after the
> mutex_lock() call it. No broken shortcuts.

Here's a respin of that part.  I just moved the mutex initialization up so 
that it's always valid in the err path.  I have also run this version of 
the patch through xfstests and my migration test programs and it looks 
okay.

		-ben
-- 
"Thought is the essence of where you are now."

commit fa8a53c39f3fdde98c9eace6a9b412143f0f6ed6
Author: Benjamin LaHaise <bcrl@kvack.org>
Date:   Fri Mar 28 10:14:45 2014 -0400

    aio: v4 ensure access to ctx->ring_pages is correctly serialised for migration
    
    As reported by Tang Chen, Gu Zheng and Yasuaki Isimatsu, the following issues
    exist in the aio ring page migration support.
    
    As a result, for example, we have the following problem:
    
                thread 1                      |              thread 2
                                              |
    aio_migratepage()                         |
     |-> take ctx->completion_lock            |
     |-> migrate_page_copy(new, old)          |
     |   *NOW*, ctx->ring_pages[idx] == old   |
                                              |
                                              |    *NOW*, ctx->ring_pages[idx] == old
                                              |    aio_read_events_ring()
                                              |     |-> ring = kmap_atomic(ctx->ring_pages[0])
                                              |     |-> ring->head = head;          *HERE, write to the old ring page*
                                              |     |-> kunmap_atomic(ring);
                                              |
     |-> ctx->ring_pages[idx] = new           |
     |   *BUT NOW*, the content of            |
     |    ring_pages[idx] is old.             |
     |-> release ctx->completion_lock         |
    
    As above, the new ring page will not be updated.
    
    Fix this issue, as well as prevent races in aio_ring_setup() by holding
    the ring_lock mutex during kioctx setup and page migration.  This avoids
    the overhead of taking another spinlock in aio_read_events_ring() as Tang's
    and Gu's original fix did, pushing the overhead into the migration code.
    
    Note that to handle the nesting of ring_lock inside of mmap_sem, the
    migratepage operation uses mutex_trylock().  Page migration is not a 100%
    critical operation in this case, so the ocassional failure can be
    tolerated.  This issue was reported by Sasha Levin.
    
    Based on feedback from Linus, avoid the extra taking of ctx->completion_lock.
    Instead, make page migration fully serialised by mapping->private_lock, and
    have aio_free_ring() simply disconnect the kioctx from the mapping by calling
    put_aio_ring_file() before touching ctx->ring_pages[].  This simplifies the
    error handling logic in aio_migratepage(), and should improve robustness.
    
    v4: always do mutex_unlock() in cases when kioctx setup fails.
    
    Reported-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
    Reported-by: Sasha Levin <sasha.levin@oracle.com>
    Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
    Cc: Tang Chen <tangchen@cn.fujitsu.com>
    Cc: Gu Zheng <guz.fnst@cn.fujitsu.com>
    Cc: stable@vger.kernel.org

diff --git a/fs/aio.c b/fs/aio.c
index 062a5f6..12a3de0e 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -52,7 +52,8 @@
 struct aio_ring {
 	unsigned	id;	/* kernel internal index number */
 	unsigned	nr;	/* number of io_events */
-	unsigned	head;
+	unsigned	head;	/* Written to by userland or under ring_lock
+				 * mutex by aio_read_events_ring(). */
 	unsigned	tail;
 
 	unsigned	magic;
@@ -243,6 +244,11 @@ static void aio_free_ring(struct kioctx *ctx)
 {
 	int i;
 
+	/* Disconnect the kiotx from the ring file.  This prevents future
+	 * accesses to the kioctx from page migration.
+	 */
+	put_aio_ring_file(ctx);
+
 	for (i = 0; i < ctx->nr_pages; i++) {
 		struct page *page;
 		pr_debug("pid(%d) [%d] page->count=%d\n", current->pid, i,
@@ -254,8 +260,6 @@ static void aio_free_ring(struct kioctx *ctx)
 		put_page(page);
 	}
 
-	put_aio_ring_file(ctx);
-
 	if (ctx->ring_pages && ctx->ring_pages != ctx->internal_pages) {
 		kfree(ctx->ring_pages);
 		ctx->ring_pages = NULL;
@@ -283,29 +287,38 @@ static int aio_migratepage(struct address_space *mapping, struct page *new,
 {
 	struct kioctx *ctx;
 	unsigned long flags;
+	pgoff_t idx;
 	int rc;
 
 	rc = 0;
 
-	/* Make sure the old page hasn't already been changed */
+	/* mapping->private_lock here protects against the kioctx teardown.  */
 	spin_lock(&mapping->private_lock);
 	ctx = mapping->private_data;
-	if (ctx) {
-		pgoff_t idx;
-		spin_lock_irqsave(&ctx->completion_lock, flags);
-		idx = old->index;
-		if (idx < (pgoff_t)ctx->nr_pages) {
-			if (ctx->ring_pages[idx] != old)
-				rc = -EAGAIN;
-		} else
-			rc = -EINVAL;
-		spin_unlock_irqrestore(&ctx->completion_lock, flags);
+	if (!ctx) {
+		rc = -EINVAL;
+		goto out;
+	}
+
+	/* The ring_lock mutex.  The prevents aio_read_events() from writing
+	 * to the ring's head, and prevents page migration from mucking in
+	 * a partially initialized kiotx.
+	 */
+	if (!mutex_trylock(&ctx->ring_lock)) {
+		rc = -EAGAIN;
+		goto out;
+	}
+
+	idx = old->index;
+	if (idx < (pgoff_t)ctx->nr_pages) {
+		/* Make sure the old page hasn't already been changed */
+		if (ctx->ring_pages[idx] != old)
+			rc = -EAGAIN;
 	} else
 		rc = -EINVAL;
-	spin_unlock(&mapping->private_lock);
 
 	if (rc != 0)
-		return rc;
+		goto out_unlock;
 
 	/* Writeback must be complete */
 	BUG_ON(PageWriteback(old));
@@ -314,38 +327,26 @@ static int aio_migratepage(struct address_space *mapping, struct page *new,
 	rc = migrate_page_move_mapping(mapping, new, old, NULL, mode, 1);
 	if (rc != MIGRATEPAGE_SUCCESS) {
 		put_page(new);
-		return rc;
+		goto out_unlock;
 	}
 
-	/* We can potentially race against kioctx teardown here.  Use the
-	 * address_space's private data lock to protect the mapping's
-	 * private_data.
+	/* Take completion_lock to prevent other writes to the ring buffer
+	 * while the old page is copied to the new.  This prevents new
+	 * events from being lost.
 	 */
-	spin_lock(&mapping->private_lock);
-	ctx = mapping->private_data;
-	if (ctx) {
-		pgoff_t idx;
-		spin_lock_irqsave(&ctx->completion_lock, flags);
-		migrate_page_copy(new, old);
-		idx = old->index;
-		if (idx < (pgoff_t)ctx->nr_pages) {
-			/* And only do the move if things haven't changed */
-			if (ctx->ring_pages[idx] == old)
-				ctx->ring_pages[idx] = new;
-			else
-				rc = -EAGAIN;
-		} else
-			rc = -EINVAL;
-		spin_unlock_irqrestore(&ctx->completion_lock, flags);
-	} else
-		rc = -EBUSY;
-	spin_unlock(&mapping->private_lock);
+	spin_lock_irqsave(&ctx->completion_lock, flags);
+	migrate_page_copy(new, old);
+	BUG_ON(ctx->ring_pages[idx] != old);
+	ctx->ring_pages[idx] = new;
+	spin_unlock_irqrestore(&ctx->completion_lock, flags);
 
-	if (rc == MIGRATEPAGE_SUCCESS)
-		put_page(old);
-	else
-		put_page(new);
+	/* The old page is no longer accessible. */
+	put_page(old);
 
+out_unlock:
+	mutex_unlock(&ctx->ring_lock);
+out:
+	spin_unlock(&mapping->private_lock);
 	return rc;
 }
 #endif
@@ -380,7 +381,7 @@ static int aio_setup_ring(struct kioctx *ctx)
 	file = aio_private_file(ctx, nr_pages);
 	if (IS_ERR(file)) {
 		ctx->aio_ring_file = NULL;
-		return -EAGAIN;
+		return -ENOMEM;
 	}
 
 	ctx->aio_ring_file = file;
@@ -415,7 +416,7 @@ static int aio_setup_ring(struct kioctx *ctx)
 
 	if (unlikely(i != nr_pages)) {
 		aio_free_ring(ctx);
-		return -EAGAIN;
+		return -ENOMEM;
 	}
 
 	ctx->mmap_size = nr_pages * PAGE_SIZE;
@@ -429,7 +430,7 @@ static int aio_setup_ring(struct kioctx *ctx)
 	if (IS_ERR((void *)ctx->mmap_base)) {
 		ctx->mmap_size = 0;
 		aio_free_ring(ctx);
-		return -EAGAIN;
+		return -ENOMEM;
 	}
 
 	pr_debug("mmap address: 0x%08lx\n", ctx->mmap_base);
@@ -556,6 +557,10 @@ static int ioctx_add_table(struct kioctx *ctx, struct mm_struct *mm)
 					rcu_read_unlock();
 					spin_unlock(&mm->ioctx_lock);
 
+					/* While kioctx setup is in progress,
+					 * we are protected from page migration
+					 * changes ring_pages by ->ring_lock.
+					 */
 					ring = kmap_atomic(ctx->ring_pages[0]);
 					ring->id = ctx->id;
 					kunmap_atomic(ring);
@@ -640,24 +645,28 @@ static struct kioctx *ioctx_alloc(unsigned nr_events)
 
 	ctx->max_reqs = nr_events;
 
-	if (percpu_ref_init(&ctx->users, free_ioctx_users))
-		goto err;
-
-	if (percpu_ref_init(&ctx->reqs, free_ioctx_reqs))
-		goto err;
-
 	spin_lock_init(&ctx->ctx_lock);
 	spin_lock_init(&ctx->completion_lock);
 	mutex_init(&ctx->ring_lock);
+	/* Protect against page migration throughout kiotx setup by keeping
+	 * the ring_lock mutex held until setup is complete. */
+	mutex_lock(&ctx->ring_lock);
 	init_waitqueue_head(&ctx->wait);
 
 	INIT_LIST_HEAD(&ctx->active_reqs);
 
+	if (percpu_ref_init(&ctx->users, free_ioctx_users))
+		goto err;
+
+	if (percpu_ref_init(&ctx->reqs, free_ioctx_reqs))
+		goto err;
+
 	ctx->cpu = alloc_percpu(struct kioctx_cpu);
 	if (!ctx->cpu)
 		goto err;
 
-	if (aio_setup_ring(ctx) < 0)
+	err = aio_setup_ring(ctx);
+	if (err < 0)
 		goto err;
 
 	atomic_set(&ctx->reqs_available, ctx->nr_events - 1);
@@ -683,6 +692,9 @@ static struct kioctx *ioctx_alloc(unsigned nr_events)
 	if (err)
 		goto err_cleanup;
 
+	/* Release the ring_lock mutex now that all setup is complete. */
+	mutex_unlock(&ctx->ring_lock);
+
 	pr_debug("allocated ioctx %p[%ld]: mm=%p mask=0x%x\n",
 		 ctx, ctx->user_id, mm, ctx->nr_events);
 	return ctx;
@@ -692,6 +704,7 @@ err_cleanup:
 err_ctx:
 	aio_free_ring(ctx);
 err:
+	mutex_unlock(&ctx->ring_lock);
 	free_percpu(ctx->cpu);
 	free_percpu(ctx->reqs.pcpu_count);
 	free_percpu(ctx->users.pcpu_count);
@@ -1024,6 +1037,7 @@ static long aio_read_events_ring(struct kioctx *ctx,
 
 	mutex_lock(&ctx->ring_lock);
 
+	/* Access to ->ring_pages here is protected by ctx->ring_lock. */
 	ring = kmap_atomic(ctx->ring_pages[0]);
 	head = ring->head;
 	tail = ring->tail;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: git pull -- [PATCH] aio: v2 ensure access to ctx->ring_pages is correctly serialised
  2014-03-28 14:58             ` Benjamin LaHaise
@ 2014-03-28 17:08               ` Linus Torvalds
  2014-03-28 17:22                 ` Benjamin LaHaise
  0 siblings, 1 reply; 11+ messages in thread
From: Linus Torvalds @ 2014-03-28 17:08 UTC (permalink / raw)
  To: Benjamin LaHaise
  Cc: Yasuaki Ishimatsu, Sasha Levin, Tang Chen, Gu Zheng,
	Linux Kernel Mailing List, stable, linux-aio, linux-mm

On Fri, Mar 28, 2014 at 7:58 AM, Benjamin LaHaise <bcrl@kvack.org> wrote:
>
> Here's a respin of that part.  I just moved the mutex initialization up so
> that it's always valid in the err path.  I have also run this version of
> the patch through xfstests and my migration test programs and it looks
> okay.

Ok. I can't find any issues with this version. How critical is this?
Should I take it now, or with more testing during the 3.15 merge
window and then just have it picked up from stable? Do people actually
trigger the bug in real life, or has this been more of a
trinity/source code analysis thing?

                Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: git pull -- [PATCH] aio: v2 ensure access to ctx->ring_pages is correctly serialised
  2014-03-28 17:08               ` Linus Torvalds
@ 2014-03-28 17:22                 ` Benjamin LaHaise
  2014-03-28 17:31                   ` Linus Torvalds
  0 siblings, 1 reply; 11+ messages in thread
From: Benjamin LaHaise @ 2014-03-28 17:22 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Yasuaki Ishimatsu, Sasha Levin, Tang Chen, Gu Zheng,
	Linux Kernel Mailing List, stable, linux-aio, linux-mm

On Fri, Mar 28, 2014 at 10:08:47AM -0700, Linus Torvalds wrote:
> On Fri, Mar 28, 2014 at 7:58 AM, Benjamin LaHaise <bcrl@kvack.org> wrote:
> >
> > Here's a respin of that part.  I just moved the mutex initialization up so
> > that it's always valid in the err path.  I have also run this version of
> > the patch through xfstests and my migration test programs and it looks
> > okay.
> 
> Ok. I can't find any issues with this version. How critical is this?
> Should I take it now, or with more testing during the 3.15 merge
> window and then just have it picked up from stable? Do people actually
> trigger the bug in real life, or has this been more of a
> trinity/source code analysis thing?

I believe it was found by analysis.  I'd be okay with it baking for another 
week or two before pushing it to stable, as that gives the folks running 
various regression tests some more time to chime in.  Better to get it 
right than to have to push another fix.

		-ben
-- 
"Thought is the essence of where you are now."

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: git pull -- [PATCH] aio: v2 ensure access to ctx->ring_pages is correctly serialised
  2014-03-28 17:22                 ` Benjamin LaHaise
@ 2014-03-28 17:31                   ` Linus Torvalds
  0 siblings, 0 replies; 11+ messages in thread
From: Linus Torvalds @ 2014-03-28 17:31 UTC (permalink / raw)
  To: Benjamin LaHaise
  Cc: Yasuaki Ishimatsu, Sasha Levin, Tang Chen, Gu Zheng,
	Linux Kernel Mailing List, stable, linux-aio, linux-mm

On Fri, Mar 28, 2014 at 10:22 AM, Benjamin LaHaise <bcrl@kvack.org> wrote:
>
> I believe it was found by analysis.  I'd be okay with it baking for another
> week or two before pushing it to stable, as that gives the folks running
> various regression tests some more time to chime in.  Better to get it
> right than to have to push another fix.

Ok. If we have no actual user problem reports due to this, I'll just
expect to get a pull request later, and ignore it for 3.14 since the
risk (even though small) seems unnecessary.

                Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2014-03-28 17:31 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-03-27 13:46 git pull -- [PATCH] aio: v2 ensure access to ctx->ring_pages is correctly serialised Benjamin LaHaise
2014-03-27 18:16 ` Linus Torvalds
2014-03-27 19:43   ` Linus Torvalds
2014-03-27 20:08     ` Benjamin LaHaise
2014-03-27 20:22       ` Linus Torvalds
2014-03-27 20:57         ` Benjamin LaHaise
2014-03-27 21:34           ` Linus Torvalds
2014-03-28 14:58             ` Benjamin LaHaise
2014-03-28 17:08               ` Linus Torvalds
2014-03-28 17:22                 ` Benjamin LaHaise
2014-03-28 17:31                   ` Linus Torvalds

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).