linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Kent Overstreet <kmo@daterainc.com>
To: Chris Mason <clm@fb.com>,
	linux-fsdevel@vger.kernel.org, linux-aio@kvack.org,
	Peter Zijlstra <peterz@infradead.org>
Subject: Re: [PATCH RFC] fs/aio: fix sleeping while TASK_INTERRUPTIBLE
Date: Wed, 24 Dec 2014 18:56:41 -0800	[thread overview]
Message-ID: <20141225025641.GC29607@moria.home.lan> (raw)
In-Reply-To: <20141223001619.GA26385@ret.masoncoding.com>

On Mon, Dec 22, 2014 at 07:16:25PM -0500, Chris Mason wrote:
> The 3.19 merge window brought in a great new warning to catch someone
> calling might_sleep with their state != TASK_RUNNING.  The idea was to
> find buggy code locking mutexes after calling prepare_to_wait(), kind
> of like this:

Ben just told me about this issue.

IMO, the way the code is structured now is correct, I would argue the problem is
with the way wait_event() works - they way they have to mess with the global-ish
task state when adding a wait_queue_t to a wait_queue_head (who came up with
these names?)

Bcache's closures don't have this problem; a closure being on a waitlist has
nothing to do with task state - instead, closures keep a counter of the number
of things they're waiting on. You can add a closure to a waitlist and then
separately, later, do a closure_sync() to wait on the closure's remaining count
to hit 0.

Bcache in fact used to have a closure_wait_event() macro that was exactly
analogous to wait_event() but using a closure - I forget what it was used for,
but at some point it wasn't used by bcache anymore and got deleted.

I just cooked up closure_sync_interruptible_hrtimeout() and the corresponding
wait_event macro and then converted aio to use it. This would IMO be a much
cleaner solution to the original problem.

The one disadvantage I know of, with the current code, is that closure waitlists
are singly linked - so they can be lockless, but that means you wake up/remove
a single closure from a waitlist, you have to do wake_up_all() - which is an
obvious disadvantage w.r.t. spurious wakeups. If people like this approach
though I'll just make closure waitlists doubly linked with a lock (which is
something I'd been considering doing anyways)

Here's the patch to the aio code - the rest of the series is in a branch at:
http://evilpiepirate.org/git/linux-bcache.git/log/?h=aio_ring_fix

Disclaimer: code has only been _lightly_ tested so far, the closure hrtimer
stuff was somewhat nontrivial

commit c91f0de111da37581709f7d201793a88c6993188
Author: Kent Overstreet <kmo@daterainc.com>
Date:   Wed Dec 24 17:20:32 2014 -0800

    aio: Convert to closure waitlist for aio ring buffer
    
    Advantage of closure waitlists is that we don't have to muck with the task state
    before we actually sleep; instead of prepare_to_wait() we do closure_wait(),
    which like prepare_to_wait() adds an object to a waitlist but unlike
    prepare_to_wait it's the closure that's doing the waiting, not the task.
    
    This fixes the issue with doing copy_to_user() after modifying the task state.
    
    Change-Id: Ifc75123d5bb620277d1e78dd5102e5d8bead1add

diff --git a/fs/aio.c b/fs/aio.c
index 1b7893ecc2..284c74e624 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -40,6 +40,7 @@
 #include <linux/ramfs.h>
 #include <linux/percpu-refcount.h>
 #include <linux/mount.h>
+#include <linux/closure.h>
 
 #include <asm/kmap_types.h>
 #include <asm/uaccess.h>
@@ -136,7 +137,7 @@ struct kioctx {
 
 	struct {
 		struct mutex	ring_lock;
-		wait_queue_head_t wait;
+		struct closure_waitlist wait;
 	} ____cacheline_aligned_in_smp;
 
 	struct {
@@ -689,7 +690,6 @@ static struct kioctx *ioctx_alloc(unsigned nr_events)
 	/* Protect against page migration throughout kiotx setup by keeping
 	 * the ring_lock mutex held until setup is complete. */
 	mutex_lock(&ctx->ring_lock);
-	init_waitqueue_head(&ctx->wait);
 
 	INIT_LIST_HEAD(&ctx->active_reqs);
 
@@ -772,7 +772,7 @@ static int kill_ioctx(struct mm_struct *mm, struct kioctx *ctx,
 	spin_unlock(&mm->ioctx_lock);
 
 	/* percpu_ref_kill() will do the necessary call_rcu() */
-	wake_up_all(&ctx->wait);
+	closure_wake_up(&ctx->wait);
 
 	/*
 	 * It'd be more correct to do this in free_ioctx(), after all
@@ -1121,8 +1121,7 @@ void aio_complete(struct kiocb *iocb, long res, long res2)
 	 */
 	smp_mb();
 
-	if (waitqueue_active(&ctx->wait))
-		wake_up(&ctx->wait);
+	closure_wake_up(&ctx->wait);
 
 	percpu_ref_put(&ctx->reqs);
 }
@@ -1237,26 +1236,15 @@ static long read_events(struct kioctx *ctx, long min_nr, long nr,
 			return -EFAULT;
 
 		until = timespec_to_ktime(ts);
+
+		if (until.tv64)
+			until = ktime_add(ktime_get(), until);
 	}
 
-	/*
-	 * Note that aio_read_events() is being called as the conditional - i.e.
-	 * we're calling it after prepare_to_wait() has set task state to
-	 * TASK_INTERRUPTIBLE.
-	 *
-	 * But aio_read_events() can block, and if it blocks it's going to flip
-	 * the task state back to TASK_RUNNING.
-	 *
-	 * This should be ok, provided it doesn't flip the state back to
-	 * TASK_RUNNING and return 0 too much - that causes us to spin. That
-	 * will only happen if the mutex_lock() call blocks, and we then find
-	 * the ringbuffer empty. So in practice we should be ok, but it's
-	 * something to be aware of when touching this code.
-	 */
 	if (until.tv64 == 0)
 		aio_read_events(ctx, min_nr, nr, event, &ret);
 	else
-		wait_event_interruptible_hrtimeout(ctx->wait,
+		closure_wait_event_hrtimeout(&ctx->wait,
 				aio_read_events(ctx, min_nr, nr, event, &ret),
 				until);
 

  parent reply	other threads:[~2014-12-25  2:51 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-12-23  0:16 [PATCH RFC] fs/aio: fix sleeping while TASK_INTERRUPTIBLE Chris Mason
2014-12-23 18:43 ` Benjamin LaHaise
2014-12-23 18:55   ` Chris Mason
2014-12-23 21:58     ` Benjamin LaHaise
2014-12-25  2:59       ` Kent Overstreet
2014-12-25  3:11         ` Benjamin LaHaise
2014-12-25  3:29           ` Kent Overstreet
2014-12-29  1:24           ` Chris Mason
2014-12-25  2:56 ` Kent Overstreet [this message]
2014-12-25 14:27   ` Sedat Dilek
2015-01-04 10:16     ` Sedat Dilek
2014-12-29 15:08   ` Chris Mason
2014-12-29 22:08     ` Kent Overstreet
2015-01-13 16:06 ` Benjamin LaHaise
2015-01-13 16:20   ` Chris Mason
2015-01-21 10:13 ` Dave Chinner
2015-01-21 21:42   ` Chris Mason
2015-02-03  9:14     ` Sedat Dilek
2015-02-03  9:54       ` Sedat Dilek
2015-02-09  3:08     ` Sedat Dilek
2015-02-09  4:21       ` Sedat Dilek

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20141225025641.GC29607@moria.home.lan \
    --to=kmo@daterainc.com \
    --cc=clm@fb.com \
    --cc=linux-aio@kvack.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=peterz@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).