All of lore.kernel.org
 help / color / mirror / Atom feed
From: Kent Overstreet <kmo@daterainc.com>
To: Chris Mason <clm@fb.com>
Cc: linux-fsdevel@vger.kernel.org, linux-aio@kvack.org,
	Peter Zijlstra <peterz@infradead.org>
Subject: Re: [PATCH RFC] fs/aio: fix sleeping while TASK_INTERRUPTIBLE
Date: Mon, 29 Dec 2014 14:08:10 -0800	[thread overview]
Message-ID: <20141229220810.GG31102@daterainc.com> (raw)
In-Reply-To: <1419865694.13012.17@mail.thefacebook.com>

On Mon, Dec 29, 2014 at 10:08:14AM -0500, Chris Mason wrote:
> 
> 
> On Wed, Dec 24, 2014 at 9:56 PM, Kent Overstreet <kmo@daterainc.com> wrote:
> >On Mon, Dec 22, 2014 at 07:16:25PM -0500, Chris Mason wrote:
> >> The 3.19 merge window brought in a great new warning to catch someone
> >> calling might_sleep with their state != TASK_RUNNING.  The idea was to
> >> find buggy code locking mutexes after calling prepare_to_wait(), kind
> >> of like this:
> >
> >Ben just told me about this issue.
> >
> >IMO, the way the code is structured now is correct, I would argue the
> >problem is
> >with the way wait_event() works - they way they have to mess with the
> >global-ish
> >task state when adding a wait_queue_t to a wait_queue_head (who came up
> >with
> >these names?)
> 
> Grin, probably related to the guy who made closure_wait() not actually wait.

Hah, touche :)

> The advantage to the waitqueue head _t setup is its a very well understood
> mechanism for sleeping on something without missing wakeups. The locking
> overhead for the waitqueues can be a problem for lots of waiters on the same
> queue, but otherwise the overhead is low.

Well, unfortunately I'm not sure it's possible to sanely implement wake_one()
with a lockless singly linked list...

although, maybe we can do it with a lock that's only used for wakups (removing
entries from the list), not adding entries to the list... that would still be
nice...

> I think closures are too big a hammer for this problem, unless benchmarks
> show we need the lockless lists (I really like that part).  I do hesitate to
> make big changes here because debugging AIO hangs is horrible.  The code is
> only tested by a few workloads, and we can go a long time before problems
> are noticed.  When people do hit bugs, we only notice the ones where
> applications pile up in getevents.  Otherwise it's just strange performance
> changes that we can't explain because they are hidden in the app's AIO state
> machine.

Eh - OTOH, closures are old code that have been quite solid and has been
changing very little over the past several years, and they're used very heavily
in bcache (and some other out of tree code). The company I'm at now also has a
pretty extensive nightly test suite for bcache now that's by extension testing
closures too.

And IMO doing this with closures makes the AIO code much cleaner and saner -
reading this version, I'm _much_ more confident the AIO code is correct vs.
having to fiddle with the task state.

> When I first looked at the warning, I didn't realize that might_sleep and
> friends were setting a preempted flag to make sure the task wasn't removed
> from the runqueue.  So I thought we'd potentially sleep forever (thanks
> Peter for details++).  The real risk here is burning CPU in the running
> state, potentially a lot of it if the mutex is highly contended. We've
> probably been hitting this for a while, but since we test AIO performance
> with fast storage, the burning just made us look faster.

This issue btw did come up in reviews when I first wrote this code, people
definitely saw what was going on at the time (Andrew Morton, among others). The
reasoning was that another that's going to sleep is going to just be squashing
our task state setting before they sleep, so it wasn't going to deadlock. And
then if the code does sleep and sets task state back to TASK_RUNNING - because
we just slept we're probably going to want to recheck the conditional again,
which here is pretty cheap.

So I'm still not sure it's a real issue (at least with the mutex_lock() usage;
what Ben was telling me about copy_to_user() sounded more concerning).

OTOH, we recently ran into this same kind of bug in some (out of tree) code of
our own, where the conditional passed to wait_event() was sending RPCs which was
then _always_ setting the task state back to TASK_RUNNING - so a few weeks ago a
coworker brought back closure_wait_event() for that. That bug also made me a bit
more concerned about this type of issue.

I do also thing it's worth having a generic version of wait_event() that anyone
can use with arbitrary conditional expressions without worrying about the safety
of things that might muck with task state - I do feel pretty strongly that that
would find good uses outside of just the AIO code. Lately I've been able to
improve a lot of code by restructuring it to use wait_event(), even when the
conditional is another large function - it a big win to be able to use standard
infrastructure that solves a pretty subtle issue. Just think of it as a better
wait_event() :)

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

  reply	other threads:[~2014-12-29 22:08 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-12-23  0:16 [PATCH RFC] fs/aio: fix sleeping while TASK_INTERRUPTIBLE Chris Mason
2014-12-23 18:43 ` Benjamin LaHaise
2014-12-23 18:55   ` Chris Mason
2014-12-23 21:58     ` Benjamin LaHaise
2014-12-25  2:59       ` Kent Overstreet
2014-12-25  3:11         ` Benjamin LaHaise
2014-12-25  3:29           ` Kent Overstreet
2014-12-29  1:24           ` Chris Mason
2014-12-25  2:56 ` Kent Overstreet
2014-12-25 14:27   ` Sedat Dilek
2015-01-04 10:16     ` Sedat Dilek
2014-12-29 15:08   ` Chris Mason
2014-12-29 22:08     ` Kent Overstreet [this message]
2015-01-13 16:06 ` Benjamin LaHaise
2015-01-13 16:20   ` Chris Mason
2015-01-21 10:13 ` Dave Chinner
2015-01-21 21:42   ` Chris Mason
2015-02-03  9:14     ` Sedat Dilek
2015-02-03  9:54       ` Sedat Dilek
2015-02-09  3:08     ` Sedat Dilek
2015-02-09  4:21       ` Sedat Dilek

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20141229220810.GG31102@daterainc.com \
    --to=kmo@daterainc.com \
    --cc=clm@fb.com \
    --cc=linux-aio@kvack.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=peterz@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.