From: Brian Foster <bfoster@redhat.com>
To: Dave Chinner <david@fromorbit.com>
Cc: Alex Lyakas <alex@zadarastorage.com>, xfs@oss.sgi.com
Subject: Re: use-after-free on log replay failure
Date: Tue, 12 Aug 2014 08:03:41 -0400 [thread overview]
Message-ID: <20140812120341.GA46654@bfoster.bfoster> (raw)
In-Reply-To: <20140811215207.GS20518@dastard>
On Tue, Aug 12, 2014 at 07:52:07AM +1000, Dave Chinner wrote:
> On Mon, Aug 11, 2014 at 09:20:57AM -0400, Brian Foster wrote:
> > On Sun, Aug 10, 2014 at 03:20:50PM +0300, Alex Lyakas wrote:
> > > On Wed, Aug 6, 2014 at 6:20 PM, Brian Foster <bfoster@redhat.com> wrote:
> > > > On Wed, Aug 06, 2014 at 03:52:03PM +0300, Alex Lyakas wrote:
> .....
> > > >> But I believe, my analysis shows that during the mount sequence XFS does not
> > > >> wait properly for all the bios to complete, before failing the mount
> > > >> sequence back to the caller.
> > > >>
> > > >
> > > > As an experiment, what about the following? Compile tested only and not
> > > > safe for general use.
> ...
> > > > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> > > > index cd7b8ca..fbcf524 100644
> > > > --- a/fs/xfs/xfs_buf.c
> > > > +++ b/fs/xfs/xfs_buf.c
> > > > @@ -1409,19 +1409,27 @@ xfs_buf_iorequest(
> > > > * case nothing will ever complete. It returns the I/O error code, if any, or
> > > > * 0 if there was no error.
> > > > */
> > > > -int
> > > > -xfs_buf_iowait(
> > > > - xfs_buf_t *bp)
> > > > +static int
> > > > +__xfs_buf_iowait(
> > > > + struct xfs_buf *bp,
> > > > + bool skip_error)
> > > > {
> > > > trace_xfs_buf_iowait(bp, _RET_IP_);
> > > >
> > > > - if (!bp->b_error)
> > > > + if (skip_error || !bp->b_error)
> > > > wait_for_completion(&bp->b_iowait);
> > > >
> > > > trace_xfs_buf_iowait_done(bp, _RET_IP_);
> > > > return bp->b_error;
> > > > }
> > > >
> > > > +int
> > > > +xfs_buf_iowait(
> > > > + struct xfs_buf *bp)
> > > > +{
> > > > + return __xfs_buf_iowait(bp, false);
> > > > +}
> > > > +
> > > > xfs_caddr_t
> > > > xfs_buf_offset(
> > > > xfs_buf_t *bp,
> > > > @@ -1866,7 +1874,7 @@ xfs_buf_delwri_submit(
> > > > bp = list_first_entry(&io_list, struct xfs_buf, b_list);
> > > >
> > > > list_del_init(&bp->b_list);
> > > > - error2 = xfs_buf_iowait(bp);
> > > > + error2 = __xfs_buf_iowait(bp, true);
> > > > xfs_buf_relse(bp);
> > > > if (!error)
> > > > error = error2;
>
> Not waiting here on buffer error should not matter. Any buffer that
> is under IO and requires completion should be referenced, and that
> means it should be caught and waited on by xfs_wait_buftarg() in the
> mount failure path after log recovery fails.
>
I think that assumes the I/O is successful. Looking through
xlog_recover_buffer_pass2() as an example, we read the buffer which
should return with b_hold == 1. The delwri queue increments the hold and
we xfs_buf_relse() in the return path (i.e., buffer is now held by the
delwri queue awaiting submission).
Sometime later we delwri_submit()... xfs_buf_iorequest() does an
xfs_buf_hold() and xfs_buf_rele() within that single function. The
delwri_submit() releases its hold after xfs_buf_iowait(), which I guess
at that point bp should go onto the lru (b_hold back to 1 in
xfs_buf_rele(). Indeed, the caller has lost scope of the buffer at this
point.
So unless I miss something or got the lifecycle wrong here, which is
easily possible ;), this all hinges on xfs_buf_iowait(). That's where
the last hold forcing the buffer to stay around goes away.
xfs_buftarg_wait_rele() will dispose the buffer if b_hold == 1. If
xfs_buf_iowait() is racy in the event of I/O errors via the bio
callback, I think this path is susceptible just the same.
> > > > ---
> > > I think that this patch fixes the problem. I tried reproducing it like
> > > 30 times, and it doesn't happen with this patch. Dropping this patch
> > > reproduces the problem within 1 or 2 tries. Thanks!
> > > What are next steps? How to make it "safe for general use"?
> > >
> >
> > Ok, thanks for testing. I think that implicates the caller bypassing the
> > expected blocking with the right sequence of log recovery I/Os and
> > device failure. TBH, I'd still like to see the specifics, if possible.
> > Could you come up with a generic reproducer for this? I think a metadump
> > of the fs with the dirty log plus whatever device failure simulation
> > hack you're using would suffice.
>
> The real issue is we don't know exactly what code is being tested
> (it's 3.8 + random bug fix backports + custom code). Even if we have
> a reproducer there's no guarantee it will reproduce on a current
> kernel. IOWs, we are stumbling around in the dark bashing our heads
> against everything in the room, and that just wastes everyone's
> time.
>
> We need a reproducer that triggers on a current, unmodified
> kernel release. You can use dm-faulty to error out all writes just
> like you are doing with your custom code. See
> xfstests::tests/generic/321 and common/dmflakey for to do this.
> Ideally the reproducer is in a form that xfstests can use....
>
> If you can't reproduce it on an upstream kernel, then git bisect is
> your friend. It will find the commit that fixed the problem you are
> seeing....
>
Ugh, yeah. The fact that this was customized as such apparently went
over my head. I agree completely. This needs to be genericized to a
pristine, preferably current kernel. The experiment patch could be
papering over something completely different.
> > The ideal fix is not yet clear to me.
>
> We are not even that far along - the root cause of the bug is not at
> all clear to me. :/
>
Yeah.. the above was just the theory that motivated the experiment in
the previously posted patch. It of course remains a theory until we can
see the race in action. I was referring to the potential fix for the
raciness of xfs_buf_iowait() with regard to bio errors and the wq iodone
handling, while still asking for a reproducer to confirm the actual
problem. FWIW, I'm not too high on changes in the buf management code,
even a smallish behavior change, without a real trace of some sort that
documents the problem and justifies the change.
Brian
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
next prev parent reply other threads:[~2014-08-12 12:03 UTC|newest]
Thread overview: 47+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-12-18 18:37 Questions about XFS discard and xfs_free_extent() code (newbie) Alex Lyakas
2013-12-18 23:06 ` Dave Chinner
2013-12-19 9:24 ` Alex Lyakas
2013-12-19 10:55 ` Dave Chinner
2013-12-19 19:24 ` Alex Lyakas
2013-12-21 17:03 ` Chris Murphy
2013-12-24 18:21 ` Alex Lyakas
2013-12-26 23:00 ` Dave Chinner
2014-01-08 18:13 ` Alex Lyakas
2014-01-13 3:02 ` Dave Chinner
2014-01-13 17:44 ` Alex Lyakas
2014-01-13 20:43 ` Dave Chinner
2014-01-14 13:48 ` Alex Lyakas
2014-01-15 1:45 ` Dave Chinner
2014-01-19 9:38 ` Alex Lyakas
2014-01-19 23:17 ` Dave Chinner
2014-07-01 15:06 ` xfs_growfs_data_private memory leak Alex Lyakas
2014-07-01 21:56 ` Dave Chinner
2014-07-02 12:27 ` Alex Lyakas
2014-08-04 18:15 ` Eric Sandeen
2014-08-06 8:56 ` Alex Lyakas
2014-08-04 11:00 ` use-after-free on log replay failure Alex Lyakas
2014-08-04 14:12 ` Brian Foster
2014-08-04 23:07 ` Dave Chinner
2014-08-06 10:05 ` Alex Lyakas
2014-08-06 12:32 ` Dave Chinner
2014-08-06 14:43 ` Alex Lyakas
2014-08-10 16:26 ` Alex Lyakas
2014-08-06 12:52 ` Alex Lyakas
2014-08-06 15:20 ` Brian Foster
2014-08-06 15:28 ` Alex Lyakas
2014-08-10 12:20 ` Alex Lyakas
2014-08-11 13:20 ` Brian Foster
2014-08-11 21:52 ` Dave Chinner
2014-08-12 12:03 ` Brian Foster [this message]
2014-08-12 12:39 ` Alex Lyakas
2014-08-12 19:31 ` Brian Foster
2014-08-12 23:56 ` Dave Chinner
2014-08-13 12:59 ` Brian Foster
2014-08-13 20:59 ` Dave Chinner
2014-08-13 23:21 ` Brian Foster
2014-08-14 6:14 ` Dave Chinner
2014-08-14 19:05 ` Brian Foster
2014-08-14 22:27 ` Dave Chinner
2014-08-13 17:07 ` Alex Lyakas
2014-08-13 0:03 ` Dave Chinner
2014-08-13 13:11 ` Brian Foster
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20140812120341.GA46654@bfoster.bfoster \
--to=bfoster@redhat.com \
--cc=alex@zadarastorage.com \
--cc=david@fromorbit.com \
--cc=xfs@oss.sgi.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).