* git pull -- [PATCH] aio: v2 ensure access to ctx->ring_pages is correctly serialised @ 2014-03-27 13:46 Benjamin LaHaise 2014-03-27 18:16 ` Linus Torvalds 0 siblings, 1 reply; 11+ messages in thread From: Benjamin LaHaise @ 2014-03-27 13:46 UTC (permalink / raw) To: Linus Torvalds Cc: Yasuaki Ishimatsu, Sasha Levin, Tang Chen, Gu Zheng, linux-kernel, stable, linux-aio, linux-mm Hello Linus and everyone, Please pull the change below from my aio-fixes git repository at git://git.kvack.org/~bcrl/aio-fixes.git which fixes a couple of issues found in the aio page migration code. This patch is applicable to the 3.12 and later stable trees as well. Thanks to all the folks involved in reporting & testing. -ben ----snip---- As reported by Tang Chen, Gu Zheng and Yasuaki Isimatsu, the following issues exist in the aio ring page migration support. As a result, for example, we have the following problem: thread 1 | thread 2 | aio_migratepage() | |-> take ctx->completion_lock | |-> migrate_page_copy(new, old) | | *NOW*, ctx->ring_pages[idx] == old | | | *NOW*, ctx->ring_pages[idx] == old | aio_read_events_ring() | |-> ring = kmap_atomic(ctx->ring_pages[0]) | |-> ring->head = head; *HERE, write to the old ring page* | |-> kunmap_atomic(ring); | |-> ctx->ring_pages[idx] = new | | *BUT NOW*, the content of | | ring_pages[idx] is old. | |-> release ctx->completion_lock | As above, the new ring page will not be updated. Fix this issue, as well as prevent races in aio_ring_setup() by taking the ring_lock mutex during page migration and where otherwise applicable. This avoids the overhead of taking another spinlock in aio_read_events_ring() as Tang's and Gu's original fix did, pushing the overhead into the migration code. Note that to handle the nesting of ring_lock inside of mmap_sem, the migratepage operation uses mutex_trylock(). Page migration is not a 100% critical operation in this case, so the ocassional failure can be tolerated. This issue was reported by Sasha Levin. Reported-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Reported-by: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Benjamin LaHaise <bcrl@kvack.org> Cc: Tang Chen <tangchen@cn.fujitsu.com> Cc: Gu Zheng <guz.fnst@cn.fujitsu.com> Cc: stable@vger.kernel.org --- fs/aio.c | 55 ++++++++++++++++++++++++++++++++++++++++++++++++------- 1 file changed, 48 insertions(+), 7 deletions(-) diff --git a/fs/aio.c b/fs/aio.c index 062a5f6..bfe1497 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -241,8 +241,10 @@ static void put_aio_ring_file(struct kioctx *ctx) static void aio_free_ring(struct kioctx *ctx) { + unsigned long flags; int i; + spin_lock_irqsave(&ctx->completion_lock, flags); for (i = 0; i < ctx->nr_pages; i++) { struct page *page; pr_debug("pid(%d) [%d] page->count=%d\n", current->pid, i, @@ -253,6 +255,7 @@ static void aio_free_ring(struct kioctx *ctx) ctx->ring_pages[i] = NULL; put_page(page); } + spin_unlock_irqrestore(&ctx->completion_lock, flags); put_aio_ring_file(ctx); @@ -287,9 +290,29 @@ static int aio_migratepage(struct address_space *mapping, struct page *new, rc = 0; - /* Make sure the old page hasn't already been changed */ + /* Get a reference on the ioctx so we can take the ring_lock mutex. */ spin_lock(&mapping->private_lock); ctx = mapping->private_data; + if (ctx) + percpu_ref_get(&ctx->users); + spin_unlock(&mapping->private_lock); + + if (!ctx) + return -EINVAL; + + /* We use mutex_trylock() here as the callers of migratepage may + * already be holding current->mm->mmap_sem, and ->ring_lock must be + * outside of mmap_sem due to its usage in aio_read_events_ring(). + * Since page migration is not an absolutely critical operation, the + * occasional failure here is acceptable. + */ + if (!mutex_trylock(&ctx->ring_lock)) { + percpu_ref_put(&ctx->users); + return -EAGAIN; + } + + /* Make sure the old page hasn't already been changed */ + spin_lock(&mapping->private_lock); if (ctx) { pgoff_t idx; spin_lock_irqsave(&ctx->completion_lock, flags); @@ -305,7 +328,7 @@ static int aio_migratepage(struct address_space *mapping, struct page *new, spin_unlock(&mapping->private_lock); if (rc != 0) - return rc; + goto out_unlock; /* Writeback must be complete */ BUG_ON(PageWriteback(old)); @@ -314,7 +337,7 @@ static int aio_migratepage(struct address_space *mapping, struct page *new, rc = migrate_page_move_mapping(mapping, new, old, NULL, mode, 1); if (rc != MIGRATEPAGE_SUCCESS) { put_page(new); - return rc; + goto out_unlock; } /* We can potentially race against kioctx teardown here. Use the @@ -346,6 +369,9 @@ static int aio_migratepage(struct address_space *mapping, struct page *new, else put_page(new); +out_unlock: + mutex_unlock(&ctx->ring_lock); + percpu_ref_put(&ctx->users); return rc; } #endif @@ -380,7 +406,7 @@ static int aio_setup_ring(struct kioctx *ctx) file = aio_private_file(ctx, nr_pages); if (IS_ERR(file)) { ctx->aio_ring_file = NULL; - return -EAGAIN; + return -ENOMEM; } ctx->aio_ring_file = file; @@ -415,7 +441,7 @@ static int aio_setup_ring(struct kioctx *ctx) if (unlikely(i != nr_pages)) { aio_free_ring(ctx); - return -EAGAIN; + return -ENOMEM; } ctx->mmap_size = nr_pages * PAGE_SIZE; @@ -429,7 +455,7 @@ static int aio_setup_ring(struct kioctx *ctx) if (IS_ERR((void *)ctx->mmap_base)) { ctx->mmap_size = 0; aio_free_ring(ctx); - return -EAGAIN; + return -ENOMEM; } pr_debug("mmap address: 0x%08lx\n", ctx->mmap_base); @@ -556,9 +582,17 @@ static int ioctx_add_table(struct kioctx *ctx, struct mm_struct *mm) rcu_read_unlock(); spin_unlock(&mm->ioctx_lock); + /* + * Accessing ring pages must be done + * holding ctx->completion_lock to + * prevent aio ring page migration + * procedure from migrating ring pages. + */ + spin_lock_irq(&ctx->completion_lock); ring = kmap_atomic(ctx->ring_pages[0]); ring->id = ctx->id; kunmap_atomic(ring); + spin_unlock_irq(&ctx->completion_lock); return 0; } @@ -657,7 +691,13 @@ static struct kioctx *ioctx_alloc(unsigned nr_events) if (!ctx->cpu) goto err; - if (aio_setup_ring(ctx) < 0) + /* Prevent races with page migration in aio_setup_ring() by holding + * the ring_lock mutex. + */ + mutex_lock(&ctx->ring_lock); + err = aio_setup_ring(ctx); + mutex_unlock(&ctx->ring_lock); + if (err < 0) goto err; atomic_set(&ctx->reqs_available, ctx->nr_events - 1); @@ -1024,6 +1064,7 @@ static long aio_read_events_ring(struct kioctx *ctx, mutex_lock(&ctx->ring_lock); + /* Access to ->ring_pages here is protected by ctx->ring_lock. */ ring = kmap_atomic(ctx->ring_pages[0]); head = ring->head; tail = ring->tail; -- 1.8.2.1 -- "Thought is the essence of where you are now." -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: git pull -- [PATCH] aio: v2 ensure access to ctx->ring_pages is correctly serialised 2014-03-27 13:46 git pull -- [PATCH] aio: v2 ensure access to ctx->ring_pages is correctly serialised Benjamin LaHaise @ 2014-03-27 18:16 ` Linus Torvalds 2014-03-27 19:43 ` Linus Torvalds 0 siblings, 1 reply; 11+ messages in thread From: Linus Torvalds @ 2014-03-27 18:16 UTC (permalink / raw) To: Benjamin LaHaise Cc: Yasuaki Ishimatsu, Sasha Levin, Tang Chen, Gu Zheng, Linux Kernel Mailing List, stable, linux-aio, linux-mm On Thu, Mar 27, 2014 at 6:46 AM, Benjamin LaHaise <bcrl@kvack.org> wrote: > > Please pull the change below from my aio-fixes git repository Ugh. This is way late in the release, and the patch makes me go: "This is completely insane", which doesn't really help. > > static void aio_free_ring(struct kioctx *ctx) > { > + unsigned long flags; > int i; > > + spin_lock_irqsave(&ctx->completion_lock, flags); > for (i = 0; i < ctx->nr_pages; i++) { > struct page *page; > pr_debug("pid(%d) [%d] page->count=%d\n", current->pid, i, > @@ -253,6 +255,7 @@ static void aio_free_ring(struct kioctx *ctx) > ctx->ring_pages[i] = NULL; > put_page(page); > } > + spin_unlock_irqrestore(&ctx->completion_lock, flags); This is just pure bullshit. aio_free_ring() is called before ctx is free'd, so if something can race with it, then you have some seriously big problems. So the above locking change is at least terminally stupid, and at most a sign of something much much worse. And quite frankly, in either case it sounds like a bad idea for me to pull this change after rc8. >From what I can tell, the *sane* solution is to just move the "put_aio_ring_file()" to *before* this whole loop, which should mean that migratepages cannot possible access the context any more (and dammit, exactly because 'ctx' is generally immediately free'd after this function, that had *better* be true). So taking the completion_lock here really screams to me: "This patch is broken". > @@ -556,9 +582,17 @@ static int ioctx_add_table(struct kioctx *ctx, struct mm_struct *mm) > rcu_read_unlock(); > spin_unlock(&mm->ioctx_lock); > > + /* > + * Accessing ring pages must be done > + * holding ctx->completion_lock to > + * prevent aio ring page migration > + * procedure from migrating ring pages. > + */ > + spin_lock_irq(&ctx->completion_lock); > ring = kmap_atomic(ctx->ring_pages[0]); > ring->id = ctx->id; > kunmap_atomic(ring); > + spin_unlock_irq(&ctx->completion_lock); > return 0; > } And quite frankly, this smells too. Question to people following around: Why does this particular access to the first page need the completion_lock(), but the earlier accesses in "aio_setup_ring()" does not? Answer: the locking is broken, idiotic, and totally not architected. The caller (which is the same in both cases: ioctx_alloc()) randomly takes one lock in one case but not the other. > - if (aio_setup_ring(ctx) < 0) > + /* Prevent races with page migration in aio_setup_ring() by holding > + * the ring_lock mutex. > + */ > + mutex_lock(&ctx->ring_lock); > + err = aio_setup_ring(ctx); > + mutex_unlock(&ctx->ring_lock); > + if (err < 0) > goto err; > > atomic_set(&ctx->reqs_available, ctx->nr_events - 1); but a few lines later when we call ioctx_add_table(), we don't do it under that lock. It would all be cleaner if all the setup was done with the ctx->ring_lock held (you can even *initialize* it to the locked state, since this is the function that allocates it!) and then it would just be unlocked when done. But no. The locking is some ad-hoc random stuff that makes no sense. So to make a long sad story short: there is no way in hell I will apply this obviously crap patch this late in the game. Because this patch is just inexcusable crap, and it should *not* have been sent to me in this state. If I can see these kinds of obvious problems with it, it damn well shouldn't go into stable or into rc8. And if I'm wrong, and the "obvious problems" are actually due to subtle effects that are so subtle that I can't see them, they need bigger explanations in the commit message. But from what I can see, it's really just the patch being stupid and nobody spent the time asking themselves: "What should the locking really look like?". Because here's a hint: if you need locks at tear-down, you're almost certainly doing something wrong, because you are tearing down data structures that shouldn't be reachable. That rule really is a good rule to follow. It's why I started looking at the patch and going "this can't be right". Learn it. Please think about locking, and make it make sense - don't just add random locks to random places to "fix bugs". Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: git pull -- [PATCH] aio: v2 ensure access to ctx->ring_pages is correctly serialised 2014-03-27 18:16 ` Linus Torvalds @ 2014-03-27 19:43 ` Linus Torvalds 2014-03-27 20:08 ` Benjamin LaHaise 0 siblings, 1 reply; 11+ messages in thread From: Linus Torvalds @ 2014-03-27 19:43 UTC (permalink / raw) To: Benjamin LaHaise Cc: Yasuaki Ishimatsu, Sasha Levin, Tang Chen, Gu Zheng, Linux Kernel Mailing List, stable, linux-aio, linux-mm [-- Attachment #1: Type: text/plain, Size: 1572 bytes --] On Thu, Mar 27, 2014 at 11:16 AM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > It would all be cleaner if all the setup was done with the > ctx->ring_lock held (you can even *initialize* it to the locked state, > since this is the function that allocates it!) and then it would just > be unlocked when done. Attached is a TOTALLY UNTESTED patch based on Ben's one that does at least this minimal cleanup, in addition to dropping the completion_lock in aio_free_ring() in favor of instead just doing the "put_aio_ring_file()" first. I do want to stress that "untested" part. I'm not going to commit this, because I don't have any real test-cases even if it were to boot and work for me otherwise. I can't say that I like the locking. It really seems totally mis-designed to me. For example, the first completion_lock in aio_migratepage() seems to be total BS - it's locking against other migrations, but that's what "mapping->private_lock" (and now "ctx->ring_lock") protect against. The *second* completion_lock use in aio_migratepage() is actually valid: we can't copy the page contents to a new one when a completion might change the ring tail data, because then the change might be done to the old page but not the new page. But there the "check if things haven't changed" is bogus, since we've held the ring_lock. I did *not* clean up that part. But it's an example of how the locking here seems to be more "voodoo programming" than actually thought about and designed. Please, somebody who has test-cases look at this, ok? Linus [-- Attachment #2: patch.diff --] [-- Type: text/plain, Size: 4455 bytes --] fs/aio.c | 53 +++++++++++++++++++++++++++++++++++++++++++---------- 1 file changed, 43 insertions(+), 10 deletions(-) diff --git a/fs/aio.c b/fs/aio.c index 062a5f6a1448..758430665b3a 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -243,6 +243,9 @@ static void aio_free_ring(struct kioctx *ctx) { int i; + /* This makes the ctx unreachable */ + put_aio_ring_file(ctx); + for (i = 0; i < ctx->nr_pages; i++) { struct page *page; pr_debug("pid(%d) [%d] page->count=%d\n", current->pid, i, @@ -254,8 +257,6 @@ static void aio_free_ring(struct kioctx *ctx) put_page(page); } - put_aio_ring_file(ctx); - if (ctx->ring_pages && ctx->ring_pages != ctx->internal_pages) { kfree(ctx->ring_pages); ctx->ring_pages = NULL; @@ -287,9 +288,29 @@ static int aio_migratepage(struct address_space *mapping, struct page *new, rc = 0; - /* Make sure the old page hasn't already been changed */ + /* Get a reference on the ioctx so we can take the ring_lock mutex. */ spin_lock(&mapping->private_lock); ctx = mapping->private_data; + if (ctx) + percpu_ref_get(&ctx->users); + spin_unlock(&mapping->private_lock); + + if (!ctx) + return -EINVAL; + + /* We use mutex_trylock() here as the callers of migratepage may + * already be holding current->mm->mmap_sem, and ->ring_lock must be + * outside of mmap_sem due to its usage in aio_read_events_ring(). + * Since page migration is not an absolutely critical operation, the + * occasional failure here is acceptable. + */ + if (!mutex_trylock(&ctx->ring_lock)) { + percpu_ref_put(&ctx->users); + return -EAGAIN; + } + + /* Make sure the old page hasn't already been changed */ + spin_lock(&mapping->private_lock); if (ctx) { pgoff_t idx; spin_lock_irqsave(&ctx->completion_lock, flags); @@ -305,7 +326,7 @@ static int aio_migratepage(struct address_space *mapping, struct page *new, spin_unlock(&mapping->private_lock); if (rc != 0) - return rc; + goto out_unlock; /* Writeback must be complete */ BUG_ON(PageWriteback(old)); @@ -314,7 +335,7 @@ static int aio_migratepage(struct address_space *mapping, struct page *new, rc = migrate_page_move_mapping(mapping, new, old, NULL, mode, 1); if (rc != MIGRATEPAGE_SUCCESS) { put_page(new); - return rc; + goto out_unlock; } /* We can potentially race against kioctx teardown here. Use the @@ -346,6 +367,9 @@ static int aio_migratepage(struct address_space *mapping, struct page *new, else put_page(new); +out_unlock: + mutex_unlock(&ctx->ring_lock); + percpu_ref_put(&ctx->users); return rc; } #endif @@ -380,7 +404,7 @@ static int aio_setup_ring(struct kioctx *ctx) file = aio_private_file(ctx, nr_pages); if (IS_ERR(file)) { ctx->aio_ring_file = NULL; - return -EAGAIN; + return -ENOMEM; } ctx->aio_ring_file = file; @@ -415,7 +439,7 @@ static int aio_setup_ring(struct kioctx *ctx) if (unlikely(i != nr_pages)) { aio_free_ring(ctx); - return -EAGAIN; + return -ENOMEM; } ctx->mmap_size = nr_pages * PAGE_SIZE; @@ -429,7 +453,7 @@ static int aio_setup_ring(struct kioctx *ctx) if (IS_ERR((void *)ctx->mmap_base)) { ctx->mmap_size = 0; aio_free_ring(ctx); - return -EAGAIN; + return -ENOMEM; } pr_debug("mmap address: 0x%08lx\n", ctx->mmap_base); @@ -657,8 +681,13 @@ static struct kioctx *ioctx_alloc(unsigned nr_events) if (!ctx->cpu) goto err; - if (aio_setup_ring(ctx) < 0) - goto err; + /* Prevent races with page migration during setup by holding + * the ring_lock mutex. + */ + mutex_lock(&ctx->ring_lock); + err = aio_setup_ring(ctx); + if (err < 0) + goto err_unlock; atomic_set(&ctx->reqs_available, ctx->nr_events - 1); ctx->req_batch = (ctx->nr_events - 1) / (num_possible_cpus() * 4); @@ -683,6 +712,7 @@ static struct kioctx *ioctx_alloc(unsigned nr_events) if (err) goto err_cleanup; + mutex_unlock(&ctx->ring_lock); pr_debug("allocated ioctx %p[%ld]: mm=%p mask=0x%x\n", ctx, ctx->user_id, mm, ctx->nr_events); return ctx; @@ -691,6 +721,8 @@ err_cleanup: aio_nr_sub(ctx->max_reqs); err_ctx: aio_free_ring(ctx); +err_unlock: + mutex_unlock(&ctx->ring_lock); err: free_percpu(ctx->cpu); free_percpu(ctx->reqs.pcpu_count); @@ -1024,6 +1056,7 @@ static long aio_read_events_ring(struct kioctx *ctx, mutex_lock(&ctx->ring_lock); + /* Access to ->ring_pages here is protected by ctx->ring_lock. */ ring = kmap_atomic(ctx->ring_pages[0]); head = ring->head; tail = ring->tail; ^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: git pull -- [PATCH] aio: v2 ensure access to ctx->ring_pages is correctly serialised 2014-03-27 19:43 ` Linus Torvalds @ 2014-03-27 20:08 ` Benjamin LaHaise 2014-03-27 20:22 ` Linus Torvalds 0 siblings, 1 reply; 11+ messages in thread From: Benjamin LaHaise @ 2014-03-27 20:08 UTC (permalink / raw) To: Linus Torvalds Cc: Yasuaki Ishimatsu, Sasha Levin, Tang Chen, Gu Zheng, Linux Kernel Mailing List, stable, linux-aio, linux-mm On Thu, Mar 27, 2014 at 12:43:13PM -0700, Linus Torvalds wrote: > On Thu, Mar 27, 2014 at 11:16 AM, Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > It would all be cleaner if all the setup was done with the > > ctx->ring_lock held (you can even *initialize* it to the locked state, > > since this is the function that allocates it!) and then it would just > > be unlocked when done. > > Attached is a TOTALLY UNTESTED patch based on Ben's one that does at > least this minimal cleanup, in addition to dropping the > completion_lock in aio_free_ring() in favor of instead just doing the > "put_aio_ring_file()" first. I was just about to send a slightly different variant (see below). Your rant was entirely justified. > I do want to stress that "untested" part. I'm not going to commit > this, because I don't have any real test-cases even if it were to boot > and work for me otherwise. The patch below is lightly tested -- my migration test case is able to successfully move the aio ring around multiple times. It still needs to be run through some more thorough tests (like Trinity). See below for the differences relative to your patch. > I can't say that I like the locking. It really seems totally > mis-designed to me. For example, the first completion_lock in > aio_migratepage() seems to be total BS - it's locking against other > migrations, but that's what "mapping->private_lock" (and now > "ctx->ring_lock") protect against. What I did instead is to just hold mapping->private_lock over the entire operation of aio_migratepage. That gets rid of the probably broken attempt to take a reference count on the kioctx within aio_migratepage(), and makes it completely clear that migration won't touch a kioctx after put_aio_ring_file() returns. It also cleans up much of the error handling in aio_migratepage() since things cannot change unexpectedly. > The *second* completion_lock use in aio_migratepage() is actually > valid: we can't copy the page contents to a new one when a completion > might change the ring tail data, because then the change might be done > to the old page but not the new page. But there the "check if things > haven't changed" is bogus, since we've held the ring_lock. Yes, that should have been cleaned up since it was no longer needed. > I did *not* clean up that part. But it's an example of how the locking > here seems to be more "voodoo programming" than actually thought about > and designed. I also added a few comments to help explain the locking. > Please, somebody who has test-cases look at this, ok? How does this version look? -ben aio.c | 107 +++++++++++++++++++++++++++++++++++++----------------------------- 1 file changed, 60 insertions(+), 47 deletions(-) diff --git a/fs/aio.c b/fs/aio.c index 062a5f6..cdec97c 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -52,7 +52,8 @@ struct aio_ring { unsigned id; /* kernel internal index number */ unsigned nr; /* number of io_events */ - unsigned head; + unsigned head; /* Written to by userland or under ring_lock + * mutex by aio_read_events_ring(). */ unsigned tail; unsigned magic; @@ -243,6 +244,11 @@ static void aio_free_ring(struct kioctx *ctx) { int i; + /* Disconnect the kiotx from the ring file. This prevents future + * accesses to the kioctx from page migration. + */ + put_aio_ring_file(ctx); + for (i = 0; i < ctx->nr_pages; i++) { struct page *page; pr_debug("pid(%d) [%d] page->count=%d\n", current->pid, i, @@ -254,8 +260,6 @@ static void aio_free_ring(struct kioctx *ctx) put_page(page); } - put_aio_ring_file(ctx); - if (ctx->ring_pages && ctx->ring_pages != ctx->internal_pages) { kfree(ctx->ring_pages); ctx->ring_pages = NULL; @@ -283,29 +287,38 @@ static int aio_migratepage(struct address_space *mapping, struct page *new, { struct kioctx *ctx; unsigned long flags; + pgoff_t idx; int rc; rc = 0; - /* Make sure the old page hasn't already been changed */ + /* mapping->private_lock here protects against the kioctx teardown. */ spin_lock(&mapping->private_lock); ctx = mapping->private_data; - if (ctx) { - pgoff_t idx; - spin_lock_irqsave(&ctx->completion_lock, flags); - idx = old->index; - if (idx < (pgoff_t)ctx->nr_pages) { - if (ctx->ring_pages[idx] != old) - rc = -EAGAIN; - } else - rc = -EINVAL; - spin_unlock_irqrestore(&ctx->completion_lock, flags); + if (!ctx) { + rc = -EINVAL; + goto out; + } + + /* The ring_lock mutex. The prevents aio_read_events() from writing + * to the ring's head, and prevents page migration from mucking in + * a partially initialized kiotx. + */ + if (!mutex_trylock(&ctx->ring_lock)) { + rc = -EAGAIN; + goto out; + } + + idx = old->index; + if (idx < (pgoff_t)ctx->nr_pages) { + /* Make sure the old page hasn't already been changed */ + if (ctx->ring_pages[idx] != old) + rc = -EAGAIN; } else rc = -EINVAL; - spin_unlock(&mapping->private_lock); if (rc != 0) - return rc; + goto out_unlock; /* Writeback must be complete */ BUG_ON(PageWriteback(old)); @@ -314,38 +327,26 @@ static int aio_migratepage(struct address_space *mapping, struct page *new, rc = migrate_page_move_mapping(mapping, new, old, NULL, mode, 1); if (rc != MIGRATEPAGE_SUCCESS) { put_page(new); - return rc; + goto out_unlock; } - /* We can potentially race against kioctx teardown here. Use the - * address_space's private data lock to protect the mapping's - * private_data. + /* Take completion_lock to prevent other writes to the ring buffer + * while the old page is copied to the new. This prevents new + * events from being lost. */ - spin_lock(&mapping->private_lock); - ctx = mapping->private_data; - if (ctx) { - pgoff_t idx; - spin_lock_irqsave(&ctx->completion_lock, flags); - migrate_page_copy(new, old); - idx = old->index; - if (idx < (pgoff_t)ctx->nr_pages) { - /* And only do the move if things haven't changed */ - if (ctx->ring_pages[idx] == old) - ctx->ring_pages[idx] = new; - else - rc = -EAGAIN; - } else - rc = -EINVAL; - spin_unlock_irqrestore(&ctx->completion_lock, flags); - } else - rc = -EBUSY; - spin_unlock(&mapping->private_lock); + spin_lock_irqsave(&ctx->completion_lock, flags); + migrate_page_copy(new, old); + BUG_ON(ctx->ring_pages[idx] != old); + ctx->ring_pages[idx] = new; + spin_unlock_irqrestore(&ctx->completion_lock, flags); - if (rc == MIGRATEPAGE_SUCCESS) - put_page(old); - else - put_page(new); + /* The old page is no longer accessible. */ + put_page(old); +out_unlock: + mutex_unlock(&ctx->ring_lock); +out: + spin_unlock(&mapping->private_lock); return rc; } #endif @@ -380,7 +381,7 @@ static int aio_setup_ring(struct kioctx *ctx) file = aio_private_file(ctx, nr_pages); if (IS_ERR(file)) { ctx->aio_ring_file = NULL; - return -EAGAIN; + return -ENOMEM; } ctx->aio_ring_file = file; @@ -415,7 +416,7 @@ static int aio_setup_ring(struct kioctx *ctx) if (unlikely(i != nr_pages)) { aio_free_ring(ctx); - return -EAGAIN; + return -ENOMEM; } ctx->mmap_size = nr_pages * PAGE_SIZE; @@ -429,7 +430,7 @@ static int aio_setup_ring(struct kioctx *ctx) if (IS_ERR((void *)ctx->mmap_base)) { ctx->mmap_size = 0; aio_free_ring(ctx); - return -EAGAIN; + return -ENOMEM; } pr_debug("mmap address: 0x%08lx\n", ctx->mmap_base); @@ -556,6 +557,10 @@ static int ioctx_add_table(struct kioctx *ctx, struct mm_struct *mm) rcu_read_unlock(); spin_unlock(&mm->ioctx_lock); + /* While kioctx setup is in progress, + * we are protected from page migration + * changes ring_pages by ->ring_lock. + */ ring = kmap_atomic(ctx->ring_pages[0]); ring->id = ctx->id; kunmap_atomic(ring); @@ -649,6 +654,9 @@ static struct kioctx *ioctx_alloc(unsigned nr_events) spin_lock_init(&ctx->ctx_lock); spin_lock_init(&ctx->completion_lock); mutex_init(&ctx->ring_lock); + /* Protect against page migration throughout kiotx setup by keeping + * the ring_lock mutex held until setup is complete. */ + mutex_lock(&ctx->ring_lock); init_waitqueue_head(&ctx->wait); INIT_LIST_HEAD(&ctx->active_reqs); @@ -657,7 +665,8 @@ static struct kioctx *ioctx_alloc(unsigned nr_events) if (!ctx->cpu) goto err; - if (aio_setup_ring(ctx) < 0) + err = aio_setup_ring(ctx); + if (err < 0) goto err; atomic_set(&ctx->reqs_available, ctx->nr_events - 1); @@ -683,6 +692,9 @@ static struct kioctx *ioctx_alloc(unsigned nr_events) if (err) goto err_cleanup; + /* Release the ring_lock mutex now that all setup is complete. */ + mutex_unlock(&ctx->ring_lock); + pr_debug("allocated ioctx %p[%ld]: mm=%p mask=0x%x\n", ctx, ctx->user_id, mm, ctx->nr_events); return ctx; @@ -1024,6 +1036,7 @@ static long aio_read_events_ring(struct kioctx *ctx, mutex_lock(&ctx->ring_lock); + /* Access to ->ring_pages here is protected by ctx->ring_lock. */ ring = kmap_atomic(ctx->ring_pages[0]); head = ring->head; tail = ring->tail; -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: git pull -- [PATCH] aio: v2 ensure access to ctx->ring_pages is correctly serialised 2014-03-27 20:08 ` Benjamin LaHaise @ 2014-03-27 20:22 ` Linus Torvalds 2014-03-27 20:57 ` Benjamin LaHaise 0 siblings, 1 reply; 11+ messages in thread From: Linus Torvalds @ 2014-03-27 20:22 UTC (permalink / raw) To: Benjamin LaHaise Cc: Yasuaki Ishimatsu, Sasha Levin, Tang Chen, Gu Zheng, Linux Kernel Mailing List, stable, linux-aio, linux-mm On Thu, Mar 27, 2014 at 1:08 PM, Benjamin LaHaise <bcrl@kvack.org> wrote: > > The patch below is lightly tested -- my migration test case is able to > successfully move the aio ring around multiple times. It still needs to > be run through some more thorough tests (like Trinity). See below for > the differences relative to your patch. Ok, from a quick glance-through this fixes my big complaints (not unrurprisingly, similarly to my patch), and seems to fix few of the smaller ones that I didn't bother with. However, I think you missed the mutex_unlock() in the error paths of ioctx_alloc(). > What I did instead is to just hold mapping->private_lock over the entire > operation of aio_migratepage. That gets rid of the probably broken attempt > to take a reference count on the kioctx within aio_migratepage(), and makes > it completely clear that migration won't touch a kioctx after > put_aio_ring_file() returns. It also cleans up much of the error handling > in aio_migratepage() since things cannot change unexpectedly. Yes, that looks simpler. I don't know what the latency implications are, but the expensive part (the actual page migration) was and continues to be under the completion lock with interrupts disabled, so I guess it's not worse. It would be good to try to get rid of the completion lock irq thing, but that looks complex. It would likely require a two-phase migration model, where phase one is "unmap page from user space and copy it to new page", and phase two would be "insert new page into mapping". Then we could have just a "update the tail pointer and the kernel mapping under the completion lock" thing with interrupts disabled in between. But that's a much bigger change and requires help from the migration people. Maybe we don't care about latency here. > I also added a few comments to help explain the locking. > > How does this version look? Looks ok, except for the error handling wrt mutex_unlock. Did I miss it? Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: git pull -- [PATCH] aio: v2 ensure access to ctx->ring_pages is correctly serialised 2014-03-27 20:22 ` Linus Torvalds @ 2014-03-27 20:57 ` Benjamin LaHaise 2014-03-27 21:34 ` Linus Torvalds 0 siblings, 1 reply; 11+ messages in thread From: Benjamin LaHaise @ 2014-03-27 20:57 UTC (permalink / raw) To: Linus Torvalds Cc: Yasuaki Ishimatsu, Sasha Levin, Tang Chen, Gu Zheng, Linux Kernel Mailing List, stable, linux-aio, linux-mm On Thu, Mar 27, 2014 at 01:22:11PM -0700, Linus Torvalds wrote: > On Thu, Mar 27, 2014 at 1:08 PM, Benjamin LaHaise <bcrl@kvack.org> wrote: > > > > The patch below is lightly tested -- my migration test case is able to > > successfully move the aio ring around multiple times. It still needs to > > be run through some more thorough tests (like Trinity). See below for > > the differences relative to your patch. > > Ok, from a quick glance-through this fixes my big complaints (not > unrurprisingly, similarly to my patch), and seems to fix few of the > smaller ones that I didn't bother with. > > However, I think you missed the mutex_unlock() in the error paths of > ioctx_alloc(). *nod* -- I added that to the below variant. > > What I did instead is to just hold mapping->private_lock over the entire > > operation of aio_migratepage. That gets rid of the probably broken attempt > > to take a reference count on the kioctx within aio_migratepage(), and makes > > it completely clear that migration won't touch a kioctx after > > put_aio_ring_file() returns. It also cleans up much of the error handling > > in aio_migratepage() since things cannot change unexpectedly. > > Yes, that looks simpler. I don't know what the latency implications > are, but the expensive part (the actual page migration) was and > continues to be under the completion lock with interrupts disabled, so > I guess it's not worse. The only reason this was implemented is due to the hotplug CPU folks complaining about not being able to unpin memory held by aio when they offline entire nodes, so the expectation is that this occurs rarely. > It would be good to try to get rid of the completion lock irq thing, > but that looks complex. It would likely require a two-phase migration > model, where phase one is "unmap page from user space and copy it to > new page", and phase two would be "insert new page into mapping". Then > we could have just a "update the tail pointer and the kernel mapping > under the completion lock" thing with interrupts disabled in between. > > But that's a much bigger change and requires help from the migration > people. Maybe we don't care about latency here. That sounds like a better way of dealing with the latency issue. I'll think about that for the next round of changes. > > I also added a few comments to help explain the locking. > > > > How does this version look? > > Looks ok, except for the error handling wrt mutex_unlock. Did I miss it? It probably wouldn't have caused any problems since nothing else would've taken the lock, but I suspect some of the debugging code that complains about held locks being freed would eventually complain. Thanks for taking the time to look at this. -ben -- "Thought is the essence of where you are now." commit 8a620cad5db25f0c6bd5e72cae2f1c93d720842b Author: Benjamin LaHaise <bcrl@kvack.org> Date: Thu Mar 27 16:46:58 2014 -0400 aio: v3 ensure access to ctx->ring_pages is correctly serialised for migration As reported by Tang Chen, Gu Zheng and Yasuaki Isimatsu, the following issues exist in the aio ring page migration support. As a result, for example, we have the following problem: thread 1 | thread 2 | aio_migratepage() | |-> take ctx->completion_lock | |-> migrate_page_copy(new, old) | | *NOW*, ctx->ring_pages[idx] == old | | | *NOW*, ctx->ring_pages[idx] == old | aio_read_events_ring() | |-> ring = kmap_atomic(ctx->ring_pages[0]) | |-> ring->head = head; *HERE, write to the old ring page* | |-> kunmap_atomic(ring); | |-> ctx->ring_pages[idx] = new | | *BUT NOW*, the content of | | ring_pages[idx] is old. | |-> release ctx->completion_lock | As above, the new ring page will not be updated. Fix this issue, as well as prevent races in aio_ring_setup() by holding the ring_lock mutex during kioctx setup and page migration. This avoids the overhead of taking another spinlock in aio_read_events_ring() as Tang's and Gu's original fix did, pushing the overhead into the migration code. Note that to handle the nesting of ring_lock inside of mmap_sem, the migratepage operation uses mutex_trylock(). Page migration is not a 100% critical operation in this case, so the ocassional failure can be tolerated. This issue was reported by Sasha Levin. Based on feedback from Linus, avoid the extra taking of ctx->completion_lock. Instead, make page migration fully serialised by mapping->private_lock, and have aio_free_ring() simply disconnect the kioctx from the mapping by calling put_aio_ring_file() before touching ctx->ring_pages[]. This simplifies the error handling logic in aio_migratepage(), and should improve robustness. Reported-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Reported-by: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Benjamin LaHaise <bcrl@kvack.org> Cc: Tang Chen <tangchen@cn.fujitsu.com> Cc: Gu Zheng <guz.fnst@cn.fujitsu.com> Cc: stable@vger.kernel.org diff --git a/fs/aio.c b/fs/aio.c index 062a5f6..eb4d75e 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -52,7 +52,8 @@ struct aio_ring { unsigned id; /* kernel internal index number */ unsigned nr; /* number of io_events */ - unsigned head; + unsigned head; /* Written to by userland or under ring_lock + * mutex by aio_read_events_ring(). */ unsigned tail; unsigned magic; @@ -243,6 +244,11 @@ static void aio_free_ring(struct kioctx *ctx) { int i; + /* Disconnect the kiotx from the ring file. This prevents future + * accesses to the kioctx from page migration. + */ + put_aio_ring_file(ctx); + for (i = 0; i < ctx->nr_pages; i++) { struct page *page; pr_debug("pid(%d) [%d] page->count=%d\n", current->pid, i, @@ -254,8 +260,6 @@ static void aio_free_ring(struct kioctx *ctx) put_page(page); } - put_aio_ring_file(ctx); - if (ctx->ring_pages && ctx->ring_pages != ctx->internal_pages) { kfree(ctx->ring_pages); ctx->ring_pages = NULL; @@ -283,29 +287,38 @@ static int aio_migratepage(struct address_space *mapping, struct page *new, { struct kioctx *ctx; unsigned long flags; + pgoff_t idx; int rc; rc = 0; - /* Make sure the old page hasn't already been changed */ + /* mapping->private_lock here protects against the kioctx teardown. */ spin_lock(&mapping->private_lock); ctx = mapping->private_data; - if (ctx) { - pgoff_t idx; - spin_lock_irqsave(&ctx->completion_lock, flags); - idx = old->index; - if (idx < (pgoff_t)ctx->nr_pages) { - if (ctx->ring_pages[idx] != old) - rc = -EAGAIN; - } else - rc = -EINVAL; - spin_unlock_irqrestore(&ctx->completion_lock, flags); + if (!ctx) { + rc = -EINVAL; + goto out; + } + + /* The ring_lock mutex. The prevents aio_read_events() from writing + * to the ring's head, and prevents page migration from mucking in + * a partially initialized kiotx. + */ + if (!mutex_trylock(&ctx->ring_lock)) { + rc = -EAGAIN; + goto out; + } + + idx = old->index; + if (idx < (pgoff_t)ctx->nr_pages) { + /* Make sure the old page hasn't already been changed */ + if (ctx->ring_pages[idx] != old) + rc = -EAGAIN; } else rc = -EINVAL; - spin_unlock(&mapping->private_lock); if (rc != 0) - return rc; + goto out_unlock; /* Writeback must be complete */ BUG_ON(PageWriteback(old)); @@ -314,38 +327,26 @@ static int aio_migratepage(struct address_space *mapping, struct page *new, rc = migrate_page_move_mapping(mapping, new, old, NULL, mode, 1); if (rc != MIGRATEPAGE_SUCCESS) { put_page(new); - return rc; + goto out_unlock; } - /* We can potentially race against kioctx teardown here. Use the - * address_space's private data lock to protect the mapping's - * private_data. + /* Take completion_lock to prevent other writes to the ring buffer + * while the old page is copied to the new. This prevents new + * events from being lost. */ - spin_lock(&mapping->private_lock); - ctx = mapping->private_data; - if (ctx) { - pgoff_t idx; - spin_lock_irqsave(&ctx->completion_lock, flags); - migrate_page_copy(new, old); - idx = old->index; - if (idx < (pgoff_t)ctx->nr_pages) { - /* And only do the move if things haven't changed */ - if (ctx->ring_pages[idx] == old) - ctx->ring_pages[idx] = new; - else - rc = -EAGAIN; - } else - rc = -EINVAL; - spin_unlock_irqrestore(&ctx->completion_lock, flags); - } else - rc = -EBUSY; - spin_unlock(&mapping->private_lock); + spin_lock_irqsave(&ctx->completion_lock, flags); + migrate_page_copy(new, old); + BUG_ON(ctx->ring_pages[idx] != old); + ctx->ring_pages[idx] = new; + spin_unlock_irqrestore(&ctx->completion_lock, flags); - if (rc == MIGRATEPAGE_SUCCESS) - put_page(old); - else - put_page(new); + /* The old page is no longer accessible. */ + put_page(old); +out_unlock: + mutex_unlock(&ctx->ring_lock); +out: + spin_unlock(&mapping->private_lock); return rc; } #endif @@ -380,7 +381,7 @@ static int aio_setup_ring(struct kioctx *ctx) file = aio_private_file(ctx, nr_pages); if (IS_ERR(file)) { ctx->aio_ring_file = NULL; - return -EAGAIN; + return -ENOMEM; } ctx->aio_ring_file = file; @@ -415,7 +416,7 @@ static int aio_setup_ring(struct kioctx *ctx) if (unlikely(i != nr_pages)) { aio_free_ring(ctx); - return -EAGAIN; + return -ENOMEM; } ctx->mmap_size = nr_pages * PAGE_SIZE; @@ -429,7 +430,7 @@ static int aio_setup_ring(struct kioctx *ctx) if (IS_ERR((void *)ctx->mmap_base)) { ctx->mmap_size = 0; aio_free_ring(ctx); - return -EAGAIN; + return -ENOMEM; } pr_debug("mmap address: 0x%08lx\n", ctx->mmap_base); @@ -556,6 +557,10 @@ static int ioctx_add_table(struct kioctx *ctx, struct mm_struct *mm) rcu_read_unlock(); spin_unlock(&mm->ioctx_lock); + /* While kioctx setup is in progress, + * we are protected from page migration + * changes ring_pages by ->ring_lock. + */ ring = kmap_atomic(ctx->ring_pages[0]); ring->id = ctx->id; kunmap_atomic(ring); @@ -649,6 +654,9 @@ static struct kioctx *ioctx_alloc(unsigned nr_events) spin_lock_init(&ctx->ctx_lock); spin_lock_init(&ctx->completion_lock); mutex_init(&ctx->ring_lock); + /* Protect against page migration throughout kiotx setup by keeping + * the ring_lock mutex held until setup is complete. */ + mutex_lock(&ctx->ring_lock); init_waitqueue_head(&ctx->wait); INIT_LIST_HEAD(&ctx->active_reqs); @@ -657,7 +665,8 @@ static struct kioctx *ioctx_alloc(unsigned nr_events) if (!ctx->cpu) goto err; - if (aio_setup_ring(ctx) < 0) + err = aio_setup_ring(ctx); + if (err < 0) goto err; atomic_set(&ctx->reqs_available, ctx->nr_events - 1); @@ -683,6 +692,9 @@ static struct kioctx *ioctx_alloc(unsigned nr_events) if (err) goto err_cleanup; + /* Release the ring_lock mutex now that all setup is complete. */ + mutex_unlock(&ctx->ring_lock); + pr_debug("allocated ioctx %p[%ld]: mm=%p mask=0x%x\n", ctx, ctx->user_id, mm, ctx->nr_events); return ctx; @@ -691,6 +703,7 @@ err_cleanup: aio_nr_sub(ctx->max_reqs); err_ctx: aio_free_ring(ctx); + mutex_unlock(&ctx->ring_lock); err: free_percpu(ctx->cpu); free_percpu(ctx->reqs.pcpu_count); @@ -1024,6 +1037,7 @@ static long aio_read_events_ring(struct kioctx *ctx, mutex_lock(&ctx->ring_lock); + /* Access to ->ring_pages here is protected by ctx->ring_lock. */ ring = kmap_atomic(ctx->ring_pages[0]); head = ring->head; tail = ring->tail; -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: git pull -- [PATCH] aio: v2 ensure access to ctx->ring_pages is correctly serialised 2014-03-27 20:57 ` Benjamin LaHaise @ 2014-03-27 21:34 ` Linus Torvalds 2014-03-28 14:58 ` Benjamin LaHaise 0 siblings, 1 reply; 11+ messages in thread From: Linus Torvalds @ 2014-03-27 21:34 UTC (permalink / raw) To: Benjamin LaHaise Cc: Yasuaki Ishimatsu, Sasha Levin, Tang Chen, Gu Zheng, Linux Kernel Mailing List, stable, linux-aio, linux-mm On Thu, Mar 27, 2014 at 1:57 PM, Benjamin LaHaise <bcrl@kvack.org> wrote: > > *nod* -- I added that to the below variant. You still have "goto err" for cases that have the ctx locked. Which means that the thing gets free'd while still locked, which causes problems for lockdep etc, so don't do it. Do what I did: add a "err_unlock" label, and make anybody after the mutex_lock() call it. No broken shortcuts. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: git pull -- [PATCH] aio: v2 ensure access to ctx->ring_pages is correctly serialised 2014-03-27 21:34 ` Linus Torvalds @ 2014-03-28 14:58 ` Benjamin LaHaise 2014-03-28 17:08 ` Linus Torvalds 0 siblings, 1 reply; 11+ messages in thread From: Benjamin LaHaise @ 2014-03-28 14:58 UTC (permalink / raw) To: Linus Torvalds Cc: Yasuaki Ishimatsu, Sasha Levin, Tang Chen, Gu Zheng, Linux Kernel Mailing List, stable, linux-aio, linux-mm On Thu, Mar 27, 2014 at 02:34:08PM -0700, Linus Torvalds wrote: > On Thu, Mar 27, 2014 at 1:57 PM, Benjamin LaHaise <bcrl@kvack.org> wrote: > > > > *nod* -- I added that to the below variant. > > You still have "goto err" for cases that have the ctx locked. Which > means that the thing gets free'd while still locked, which causes > problems for lockdep etc, so don't do it. > > Do what I did: add a "err_unlock" label, and make anybody after the > mutex_lock() call it. No broken shortcuts. Here's a respin of that part. I just moved the mutex initialization up so that it's always valid in the err path. I have also run this version of the patch through xfstests and my migration test programs and it looks okay. -ben -- "Thought is the essence of where you are now." commit fa8a53c39f3fdde98c9eace6a9b412143f0f6ed6 Author: Benjamin LaHaise <bcrl@kvack.org> Date: Fri Mar 28 10:14:45 2014 -0400 aio: v4 ensure access to ctx->ring_pages is correctly serialised for migration As reported by Tang Chen, Gu Zheng and Yasuaki Isimatsu, the following issues exist in the aio ring page migration support. As a result, for example, we have the following problem: thread 1 | thread 2 | aio_migratepage() | |-> take ctx->completion_lock | |-> migrate_page_copy(new, old) | | *NOW*, ctx->ring_pages[idx] == old | | | *NOW*, ctx->ring_pages[idx] == old | aio_read_events_ring() | |-> ring = kmap_atomic(ctx->ring_pages[0]) | |-> ring->head = head; *HERE, write to the old ring page* | |-> kunmap_atomic(ring); | |-> ctx->ring_pages[idx] = new | | *BUT NOW*, the content of | | ring_pages[idx] is old. | |-> release ctx->completion_lock | As above, the new ring page will not be updated. Fix this issue, as well as prevent races in aio_ring_setup() by holding the ring_lock mutex during kioctx setup and page migration. This avoids the overhead of taking another spinlock in aio_read_events_ring() as Tang's and Gu's original fix did, pushing the overhead into the migration code. Note that to handle the nesting of ring_lock inside of mmap_sem, the migratepage operation uses mutex_trylock(). Page migration is not a 100% critical operation in this case, so the ocassional failure can be tolerated. This issue was reported by Sasha Levin. Based on feedback from Linus, avoid the extra taking of ctx->completion_lock. Instead, make page migration fully serialised by mapping->private_lock, and have aio_free_ring() simply disconnect the kioctx from the mapping by calling put_aio_ring_file() before touching ctx->ring_pages[]. This simplifies the error handling logic in aio_migratepage(), and should improve robustness. v4: always do mutex_unlock() in cases when kioctx setup fails. Reported-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Reported-by: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Benjamin LaHaise <bcrl@kvack.org> Cc: Tang Chen <tangchen@cn.fujitsu.com> Cc: Gu Zheng <guz.fnst@cn.fujitsu.com> Cc: stable@vger.kernel.org diff --git a/fs/aio.c b/fs/aio.c index 062a5f6..12a3de0e 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -52,7 +52,8 @@ struct aio_ring { unsigned id; /* kernel internal index number */ unsigned nr; /* number of io_events */ - unsigned head; + unsigned head; /* Written to by userland or under ring_lock + * mutex by aio_read_events_ring(). */ unsigned tail; unsigned magic; @@ -243,6 +244,11 @@ static void aio_free_ring(struct kioctx *ctx) { int i; + /* Disconnect the kiotx from the ring file. This prevents future + * accesses to the kioctx from page migration. + */ + put_aio_ring_file(ctx); + for (i = 0; i < ctx->nr_pages; i++) { struct page *page; pr_debug("pid(%d) [%d] page->count=%d\n", current->pid, i, @@ -254,8 +260,6 @@ static void aio_free_ring(struct kioctx *ctx) put_page(page); } - put_aio_ring_file(ctx); - if (ctx->ring_pages && ctx->ring_pages != ctx->internal_pages) { kfree(ctx->ring_pages); ctx->ring_pages = NULL; @@ -283,29 +287,38 @@ static int aio_migratepage(struct address_space *mapping, struct page *new, { struct kioctx *ctx; unsigned long flags; + pgoff_t idx; int rc; rc = 0; - /* Make sure the old page hasn't already been changed */ + /* mapping->private_lock here protects against the kioctx teardown. */ spin_lock(&mapping->private_lock); ctx = mapping->private_data; - if (ctx) { - pgoff_t idx; - spin_lock_irqsave(&ctx->completion_lock, flags); - idx = old->index; - if (idx < (pgoff_t)ctx->nr_pages) { - if (ctx->ring_pages[idx] != old) - rc = -EAGAIN; - } else - rc = -EINVAL; - spin_unlock_irqrestore(&ctx->completion_lock, flags); + if (!ctx) { + rc = -EINVAL; + goto out; + } + + /* The ring_lock mutex. The prevents aio_read_events() from writing + * to the ring's head, and prevents page migration from mucking in + * a partially initialized kiotx. + */ + if (!mutex_trylock(&ctx->ring_lock)) { + rc = -EAGAIN; + goto out; + } + + idx = old->index; + if (idx < (pgoff_t)ctx->nr_pages) { + /* Make sure the old page hasn't already been changed */ + if (ctx->ring_pages[idx] != old) + rc = -EAGAIN; } else rc = -EINVAL; - spin_unlock(&mapping->private_lock); if (rc != 0) - return rc; + goto out_unlock; /* Writeback must be complete */ BUG_ON(PageWriteback(old)); @@ -314,38 +327,26 @@ static int aio_migratepage(struct address_space *mapping, struct page *new, rc = migrate_page_move_mapping(mapping, new, old, NULL, mode, 1); if (rc != MIGRATEPAGE_SUCCESS) { put_page(new); - return rc; + goto out_unlock; } - /* We can potentially race against kioctx teardown here. Use the - * address_space's private data lock to protect the mapping's - * private_data. + /* Take completion_lock to prevent other writes to the ring buffer + * while the old page is copied to the new. This prevents new + * events from being lost. */ - spin_lock(&mapping->private_lock); - ctx = mapping->private_data; - if (ctx) { - pgoff_t idx; - spin_lock_irqsave(&ctx->completion_lock, flags); - migrate_page_copy(new, old); - idx = old->index; - if (idx < (pgoff_t)ctx->nr_pages) { - /* And only do the move if things haven't changed */ - if (ctx->ring_pages[idx] == old) - ctx->ring_pages[idx] = new; - else - rc = -EAGAIN; - } else - rc = -EINVAL; - spin_unlock_irqrestore(&ctx->completion_lock, flags); - } else - rc = -EBUSY; - spin_unlock(&mapping->private_lock); + spin_lock_irqsave(&ctx->completion_lock, flags); + migrate_page_copy(new, old); + BUG_ON(ctx->ring_pages[idx] != old); + ctx->ring_pages[idx] = new; + spin_unlock_irqrestore(&ctx->completion_lock, flags); - if (rc == MIGRATEPAGE_SUCCESS) - put_page(old); - else - put_page(new); + /* The old page is no longer accessible. */ + put_page(old); +out_unlock: + mutex_unlock(&ctx->ring_lock); +out: + spin_unlock(&mapping->private_lock); return rc; } #endif @@ -380,7 +381,7 @@ static int aio_setup_ring(struct kioctx *ctx) file = aio_private_file(ctx, nr_pages); if (IS_ERR(file)) { ctx->aio_ring_file = NULL; - return -EAGAIN; + return -ENOMEM; } ctx->aio_ring_file = file; @@ -415,7 +416,7 @@ static int aio_setup_ring(struct kioctx *ctx) if (unlikely(i != nr_pages)) { aio_free_ring(ctx); - return -EAGAIN; + return -ENOMEM; } ctx->mmap_size = nr_pages * PAGE_SIZE; @@ -429,7 +430,7 @@ static int aio_setup_ring(struct kioctx *ctx) if (IS_ERR((void *)ctx->mmap_base)) { ctx->mmap_size = 0; aio_free_ring(ctx); - return -EAGAIN; + return -ENOMEM; } pr_debug("mmap address: 0x%08lx\n", ctx->mmap_base); @@ -556,6 +557,10 @@ static int ioctx_add_table(struct kioctx *ctx, struct mm_struct *mm) rcu_read_unlock(); spin_unlock(&mm->ioctx_lock); + /* While kioctx setup is in progress, + * we are protected from page migration + * changes ring_pages by ->ring_lock. + */ ring = kmap_atomic(ctx->ring_pages[0]); ring->id = ctx->id; kunmap_atomic(ring); @@ -640,24 +645,28 @@ static struct kioctx *ioctx_alloc(unsigned nr_events) ctx->max_reqs = nr_events; - if (percpu_ref_init(&ctx->users, free_ioctx_users)) - goto err; - - if (percpu_ref_init(&ctx->reqs, free_ioctx_reqs)) - goto err; - spin_lock_init(&ctx->ctx_lock); spin_lock_init(&ctx->completion_lock); mutex_init(&ctx->ring_lock); + /* Protect against page migration throughout kiotx setup by keeping + * the ring_lock mutex held until setup is complete. */ + mutex_lock(&ctx->ring_lock); init_waitqueue_head(&ctx->wait); INIT_LIST_HEAD(&ctx->active_reqs); + if (percpu_ref_init(&ctx->users, free_ioctx_users)) + goto err; + + if (percpu_ref_init(&ctx->reqs, free_ioctx_reqs)) + goto err; + ctx->cpu = alloc_percpu(struct kioctx_cpu); if (!ctx->cpu) goto err; - if (aio_setup_ring(ctx) < 0) + err = aio_setup_ring(ctx); + if (err < 0) goto err; atomic_set(&ctx->reqs_available, ctx->nr_events - 1); @@ -683,6 +692,9 @@ static struct kioctx *ioctx_alloc(unsigned nr_events) if (err) goto err_cleanup; + /* Release the ring_lock mutex now that all setup is complete. */ + mutex_unlock(&ctx->ring_lock); + pr_debug("allocated ioctx %p[%ld]: mm=%p mask=0x%x\n", ctx, ctx->user_id, mm, ctx->nr_events); return ctx; @@ -692,6 +704,7 @@ err_cleanup: err_ctx: aio_free_ring(ctx); err: + mutex_unlock(&ctx->ring_lock); free_percpu(ctx->cpu); free_percpu(ctx->reqs.pcpu_count); free_percpu(ctx->users.pcpu_count); @@ -1024,6 +1037,7 @@ static long aio_read_events_ring(struct kioctx *ctx, mutex_lock(&ctx->ring_lock); + /* Access to ->ring_pages here is protected by ctx->ring_lock. */ ring = kmap_atomic(ctx->ring_pages[0]); head = ring->head; tail = ring->tail; -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: git pull -- [PATCH] aio: v2 ensure access to ctx->ring_pages is correctly serialised 2014-03-28 14:58 ` Benjamin LaHaise @ 2014-03-28 17:08 ` Linus Torvalds 2014-03-28 17:22 ` Benjamin LaHaise 0 siblings, 1 reply; 11+ messages in thread From: Linus Torvalds @ 2014-03-28 17:08 UTC (permalink / raw) To: Benjamin LaHaise Cc: Yasuaki Ishimatsu, Sasha Levin, Tang Chen, Gu Zheng, Linux Kernel Mailing List, stable, linux-aio, linux-mm On Fri, Mar 28, 2014 at 7:58 AM, Benjamin LaHaise <bcrl@kvack.org> wrote: > > Here's a respin of that part. I just moved the mutex initialization up so > that it's always valid in the err path. I have also run this version of > the patch through xfstests and my migration test programs and it looks > okay. Ok. I can't find any issues with this version. How critical is this? Should I take it now, or with more testing during the 3.15 merge window and then just have it picked up from stable? Do people actually trigger the bug in real life, or has this been more of a trinity/source code analysis thing? Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: git pull -- [PATCH] aio: v2 ensure access to ctx->ring_pages is correctly serialised 2014-03-28 17:08 ` Linus Torvalds @ 2014-03-28 17:22 ` Benjamin LaHaise 2014-03-28 17:31 ` Linus Torvalds 0 siblings, 1 reply; 11+ messages in thread From: Benjamin LaHaise @ 2014-03-28 17:22 UTC (permalink / raw) To: Linus Torvalds Cc: Yasuaki Ishimatsu, Sasha Levin, Tang Chen, Gu Zheng, Linux Kernel Mailing List, stable, linux-aio, linux-mm On Fri, Mar 28, 2014 at 10:08:47AM -0700, Linus Torvalds wrote: > On Fri, Mar 28, 2014 at 7:58 AM, Benjamin LaHaise <bcrl@kvack.org> wrote: > > > > Here's a respin of that part. I just moved the mutex initialization up so > > that it's always valid in the err path. I have also run this version of > > the patch through xfstests and my migration test programs and it looks > > okay. > > Ok. I can't find any issues with this version. How critical is this? > Should I take it now, or with more testing during the 3.15 merge > window and then just have it picked up from stable? Do people actually > trigger the bug in real life, or has this been more of a > trinity/source code analysis thing? I believe it was found by analysis. I'd be okay with it baking for another week or two before pushing it to stable, as that gives the folks running various regression tests some more time to chime in. Better to get it right than to have to push another fix. -ben -- "Thought is the essence of where you are now." -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: git pull -- [PATCH] aio: v2 ensure access to ctx->ring_pages is correctly serialised 2014-03-28 17:22 ` Benjamin LaHaise @ 2014-03-28 17:31 ` Linus Torvalds 0 siblings, 0 replies; 11+ messages in thread From: Linus Torvalds @ 2014-03-28 17:31 UTC (permalink / raw) To: Benjamin LaHaise Cc: Yasuaki Ishimatsu, Sasha Levin, Tang Chen, Gu Zheng, Linux Kernel Mailing List, stable, linux-aio, linux-mm On Fri, Mar 28, 2014 at 10:22 AM, Benjamin LaHaise <bcrl@kvack.org> wrote: > > I believe it was found by analysis. I'd be okay with it baking for another > week or two before pushing it to stable, as that gives the folks running > various regression tests some more time to chime in. Better to get it > right than to have to push another fix. Ok. If we have no actual user problem reports due to this, I'll just expect to get a pull request later, and ignore it for 3.14 since the risk (even though small) seems unnecessary. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2014-03-28 17:31 UTC | newest] Thread overview: 11+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2014-03-27 13:46 git pull -- [PATCH] aio: v2 ensure access to ctx->ring_pages is correctly serialised Benjamin LaHaise 2014-03-27 18:16 ` Linus Torvalds 2014-03-27 19:43 ` Linus Torvalds 2014-03-27 20:08 ` Benjamin LaHaise 2014-03-27 20:22 ` Linus Torvalds 2014-03-27 20:57 ` Benjamin LaHaise 2014-03-27 21:34 ` Linus Torvalds 2014-03-28 14:58 ` Benjamin LaHaise 2014-03-28 17:08 ` Linus Torvalds 2014-03-28 17:22 ` Benjamin LaHaise 2014-03-28 17:31 ` Linus Torvalds
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).