All of lore.kernel.org
 help / color / mirror / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: Brian Foster <bfoster@redhat.com>
Cc: Alex Lyakas <alex@zadarastorage.com>, xfs@oss.sgi.com
Subject: Re: use-after-free on log replay failure
Date: Thu, 14 Aug 2014 16:14:44 +1000	[thread overview]
Message-ID: <20140814061444.GH20518@dastard> (raw)
In-Reply-To: <20140813232135.GB8456@laptop.bfoster>

On Wed, Aug 13, 2014 at 07:21:35PM -0400, Brian Foster wrote:
> On Thu, Aug 14, 2014 at 06:59:29AM +1000, Dave Chinner wrote:
> > On Wed, Aug 13, 2014 at 08:59:32AM -0400, Brian Foster wrote:
> > > Perhaps I'm missing some context... I don't follow how removing the
> > > error check doesn't solve the problem. It clearly closes the race and
> > > perhaps there are other means of doing the same thing, but what part of
> > > the problem does that leave unresolved?
> > 
> > Anything that does:
> > 
> > 	xfs_buf_iorequest(bp);
> > 	if (bp->b_error)
> > 		xfs_buf_relse(bp);
> > 
> > is susceptible to the same race condition. based on bp->b_error
> > being set asynchronously and before the buffer IO completion
> > processing is complete.
> > 
> 
> Understood, by why would anything do that (as opposed to
> xfs_buf_iowait())? I don't see that we do that anywhere today
> (the check buried within xfs_buf_iowait() notwithstanding of course).

"Why" is not important - the fact is the caller *owns* the buffer
and so the above fragment of code is valid behaviour. If there is
an error on the buffer after xfs_buf_iorequest() request returns on
a synchronous IO, then it's a bug if there is still IO in progress
on that buffer.

We can't run IO completion synchronously from xfs_buf_bio_end_io in
this async dispatch error case - we cannot detect it as any
different from IO completion in interrupt context - and so we need
to have some kind of reference protecting the buffer from being
freed from under the completion.

i.e. the bug is that a synchronous buffer has no active reference
while it is sitting on the completion workqueue - it's references
are owned by other contexts that can drop them without regard to
the completion status of the buffer.

For async IO we transfer a reference and the lock to the IO context,
which gets dropped in xfs_buf_iodone_work when all the IO is
complete. Synchronous IO needs this protection, too.

As a proof of concept, adding this to the start of
xfs_buf_iorequest():

+	/*
+	 * synchronous IO needs it's own reference count. async IO
+	 * inherits the submitter's reference count.
+	 */
+	if (!(bp->b_flags & XBF_ASYNC))
+		xfs_buf_hold(bp);

And this to the synchronous IO completion case for
xfs_buf_iodone_work():

	else {
		ASSERT(read && bp->b_ops);
		complete(&bp->b_iowait);
+		xfs_buf_rele(bp);
	}

Should ensure that all IO carries a reference count and the buffer
cannot be freed until all IO processing has been completed.

This means it does not matter what the buffer owner does after
xfs_buf_iorequest() - even unconditionally calling xfs_buf_relse()
will not result in use-after-free as the b_hold count will not go to
zero until the IO completion processing has been finalised.

Fixing the rest of the mess (i.e. determining how to deal with
submission/completion races) is going to require more effort and
thought. For the moment, though, correctly reference counting
buffers will solve the use-after-free without changing any
other behaviour.

> From what I can see, all it really guarantees is that the submission has
> either passed/failed the write verifier, yes?

No.  It can also mean it wasn't rejected by the lower layersi as
they process the bio passed by submit_bio(). e.g.  ENODEV, because
the underlying device has been hot-unplugged, EIO because the
buffer is beyond the end of the device, etc.

> > > It looks like submit_bio() manages this by providing the error through
> > > the callback (always). It also doesn't look like submission path is
> > > guaranteed to be synchronous either (consider md, which appears to use
> > > workqueues and kernel threads)), so I'm not sure that '...;
> > > xfs_buf_iorequest(bp); if (bp->b_error)' is really safe anywhere unless
> > > you're explicitly looking for a write verifier error or something and
> > > do nothing further on the buf contingent on completion (e.g., freeing it
> > > or something it depends on).
> > 
> > My point remains that it *should be safe*, and the intent is that
> > the caller should be able to check for submission errors without
> > being exposed to a use after free situation. That's the bug we need
> > to fix, not say "you can't check for submission errors on
> > synchronous IO" to avoid the race condition.....
> > 
> 
> Well, technically you can check for submission errors on sync I/O, just
> use the code you posted above. :) What we can't currently do is find out
> when the I/O subsystem is done with the buffer.

By definition, a buffer marked with an error after submission
processing is complete. It should not need to be waited on, and
there-in lies the bug.

> Perhaps the point here is around the semantics of xfs_buf_iowait(). With
> a mechanism that is fundamentally async, the sync variant obviously
> becomes the async mechanism + some kind of synchronization. I'd expect
> that synchronization to not necessarily just tell me whether an error
> occurred, but also tell me when the I/O subsystem is done with the
> object I've passed (e.g., so I'm free to chuck it, scribble over it, put
> it back where I got it, whatever).
>
> My impression is that's the purpose of the b_iowait mechanism.
> Otherwise, what's the point of the whole
> bio_end_io->buf_ioend->b_iodone->buf_ioend round trip dance?

Yes, that's exactly what xfs_buf_iorequest/xfs_buf_iowait() provides
and the b_error indication is an integral part of that
synchronisation mechanism.  Unfortunately, that is also the part of
the mechanism that is racy and causing problems.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

  reply	other threads:[~2014-08-14  6:14 UTC|newest]

Thread overview: 47+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-12-18 18:37 Questions about XFS discard and xfs_free_extent() code (newbie) Alex Lyakas
2013-12-18 23:06 ` Dave Chinner
2013-12-19  9:24   ` Alex Lyakas
2013-12-19 10:55     ` Dave Chinner
2013-12-19 19:24       ` Alex Lyakas
2013-12-21 17:03         ` Chris Murphy
2013-12-24 18:21       ` Alex Lyakas
2013-12-26 23:00         ` Dave Chinner
2014-01-08 18:13           ` Alex Lyakas
2014-01-13  3:02             ` Dave Chinner
2014-01-13 17:44               ` Alex Lyakas
2014-01-13 20:43                 ` Dave Chinner
2014-01-14 13:48                   ` Alex Lyakas
2014-01-15  1:45                     ` Dave Chinner
2014-01-19  9:38                       ` Alex Lyakas
2014-01-19 23:17                         ` Dave Chinner
2014-07-01 15:06                           ` xfs_growfs_data_private memory leak Alex Lyakas
2014-07-01 21:56                             ` Dave Chinner
2014-07-02 12:27                               ` Alex Lyakas
2014-08-04 18:15                                 ` Eric Sandeen
2014-08-06  8:56                                   ` Alex Lyakas
2014-08-04 11:00                             ` use-after-free on log replay failure Alex Lyakas
2014-08-04 14:12                               ` Brian Foster
2014-08-04 23:07                               ` Dave Chinner
2014-08-06 10:05                                 ` Alex Lyakas
2014-08-06 12:32                                   ` Dave Chinner
2014-08-06 14:43                                     ` Alex Lyakas
2014-08-10 16:26                                     ` Alex Lyakas
2014-08-06 12:52                                 ` Alex Lyakas
2014-08-06 15:20                                   ` Brian Foster
2014-08-06 15:28                                     ` Alex Lyakas
2014-08-10 12:20                                     ` Alex Lyakas
2014-08-11 13:20                                       ` Brian Foster
2014-08-11 21:52                                         ` Dave Chinner
2014-08-12 12:03                                           ` Brian Foster
2014-08-12 12:39                                             ` Alex Lyakas
2014-08-12 19:31                                               ` Brian Foster
2014-08-12 23:56                                               ` Dave Chinner
2014-08-13 12:59                                                 ` Brian Foster
2014-08-13 20:59                                                   ` Dave Chinner
2014-08-13 23:21                                                     ` Brian Foster
2014-08-14  6:14                                                       ` Dave Chinner [this message]
2014-08-14 19:05                                                         ` Brian Foster
2014-08-14 22:27                                                           ` Dave Chinner
2014-08-13 17:07                                                 ` Alex Lyakas
2014-08-13  0:03                                               ` Dave Chinner
2014-08-13 13:11                                                 ` Brian Foster

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20140814061444.GH20518@dastard \
    --to=david@fromorbit.com \
    --cc=alex@zadarastorage.com \
    --cc=bfoster@redhat.com \
    --cc=xfs@oss.sgi.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.