From: Brian Foster <bfoster@redhat.com>
To: Dave Chinner <david@fromorbit.com>
Cc: Alex Lyakas <alex@zadarastorage.com>, xfs@oss.sgi.com
Subject: Re: use-after-free on log replay failure
Date: Wed, 13 Aug 2014 08:59:32 -0400 [thread overview]
Message-ID: <20140813125932.GA4605@bfoster.bfoster> (raw)
In-Reply-To: <20140812235615.GB20518@dastard>
On Wed, Aug 13, 2014 at 09:56:15AM +1000, Dave Chinner wrote:
> On Tue, Aug 12, 2014 at 03:39:02PM +0300, Alex Lyakas wrote:
> > Hello Dave, Brian,
> > I will describe a generic reproduction that you ask for.
> >
> > It was performed on pristine XFS code from 3.8.13, taken from here:
> > git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git
> ....
> > I mounted XFS with the following options:
> > rw,sync,noatime,wsync,attr2,inode64,noquota 0 0
> >
> > I started a couple of processes writing files sequentially onto this
> > mount point, and after few seconds crashed the VM.
> > When the VM came up, I took the metadump file and placed it in:
> > https://drive.google.com/file/d/0ByBy89zr3kJNa0ZpdmZFS242RVU/edit?usp=sharing
> >
> > Then I set up the following Device Mapper target onto /dev/vde:
> > dmsetup create VDE --table "0 41943040 linear-custom /dev/vde 0"
> > I am attaching the code (and Makefile) of dm-linear-custom target.
> > It is exact copy of dm-linear, except that it has a module
> > parameter. With the parameter set to 0, this is an identity mapping
> > onto /dev/vde. If the parameter is set to non-0, all WRITE bios are
> > failed with ENOSPC. There is a workqueue to fail them in a different
> > context (not sure if really needed, but that's what our "real"
> > custom
> > block device does).
>
> Well, they you go. That explains it - an asynchronous dispatch error
> happening fast enough to race with the synchronous XFS dispatch
> processing.
>
> dispatch thread device workqueue
> xfs_buf_hold();
> atomic_set(b_io_remaining, 1)
> atomic_inc(b_io_remaining)
> submit_bio(bio)
> queue_work(bio)
> xfs_buf_ioend(bp, ....);
> atomic_dec(b_io_remaining)
> xfs_buf_rele()
> bio error set to ENOSPC
> bio->end_io()
> xfs_buf_bio_endio()
> bp->b_error = ENOSPC
> _xfs_buf_ioend(bp, 1);
> atomic_dec(b_io_remaining)
> xfs_buf_ioend(bp, 1);
> queue_work(bp)
> xfs_buf_iowait()
> if (bp->b_error) return error;
> if (error)
> xfs_buf_relse()
> xfs_buf_rele()
> xfs_buf_free()
>
> And now we have a freed buffer that is queued on the io completion
> queue. Basically, it requires the buffer error to be set
> asynchronously *between* the dispatch decrementing it's I/O count
> after dispatch, but before we wait on the IO.
>
That's basically the theory I wanted to test with the experimental
patch. E.g., the error check races with the iodone workqueue item.
> Not sure what the right fix is yet - removing the bp->b_error check
> from xfs_buf_iowait() doesn't solve the problem - it just prevents
> this code path from being tripped over by the race condition.
>
Perhaps I'm missing some context... I don't follow how removing the
error check doesn't solve the problem. It clearly closes the race and
perhaps there are other means of doing the same thing, but what part of
the problem does that leave unresolved? E.g., we provide a
synchronization mechanism for an async submission path and an object
(xfs_buf) that is involved with potentially multiple such async (I/O)
operations. The async callback side manages the counts of outstanding
bios etc. to set the state of the buf object correctly and fires a
completion when everything is done. The calling side simply waits on the
completion before it can analyze state of the object. Referring to
anything inside that object that happens to be managed by the buffer I/O
mechanism before the buffer is considered complete just seems generally
racy.
It looks like submit_bio() manages this by providing the error through
the callback (always). It also doesn't look like submission path is
guaranteed to be synchronous either (consider md, which appears to use
workqueues and kernel threads)), so I'm not sure that '...;
xfs_buf_iorequest(bp); if (bp->b_error)' is really safe anywhere unless
you're explicitly looking for a write verifier error or something and
do nothing further on the buf contingent on completion (e.g., freeing it
or something it depends on).
Brian
> But, just to validate this is the problem, you should be able to
> reproduce this on a 3.16 kernel. Can you try that, Alex?
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
next prev parent reply other threads:[~2014-08-13 12:59 UTC|newest]
Thread overview: 47+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-12-18 18:37 Questions about XFS discard and xfs_free_extent() code (newbie) Alex Lyakas
2013-12-18 23:06 ` Dave Chinner
2013-12-19 9:24 ` Alex Lyakas
2013-12-19 10:55 ` Dave Chinner
2013-12-19 19:24 ` Alex Lyakas
2013-12-21 17:03 ` Chris Murphy
2013-12-24 18:21 ` Alex Lyakas
2013-12-26 23:00 ` Dave Chinner
2014-01-08 18:13 ` Alex Lyakas
2014-01-13 3:02 ` Dave Chinner
2014-01-13 17:44 ` Alex Lyakas
2014-01-13 20:43 ` Dave Chinner
2014-01-14 13:48 ` Alex Lyakas
2014-01-15 1:45 ` Dave Chinner
2014-01-19 9:38 ` Alex Lyakas
2014-01-19 23:17 ` Dave Chinner
2014-07-01 15:06 ` xfs_growfs_data_private memory leak Alex Lyakas
2014-07-01 21:56 ` Dave Chinner
2014-07-02 12:27 ` Alex Lyakas
2014-08-04 18:15 ` Eric Sandeen
2014-08-06 8:56 ` Alex Lyakas
2014-08-04 11:00 ` use-after-free on log replay failure Alex Lyakas
2014-08-04 14:12 ` Brian Foster
2014-08-04 23:07 ` Dave Chinner
2014-08-06 10:05 ` Alex Lyakas
2014-08-06 12:32 ` Dave Chinner
2014-08-06 14:43 ` Alex Lyakas
2014-08-10 16:26 ` Alex Lyakas
2014-08-06 12:52 ` Alex Lyakas
2014-08-06 15:20 ` Brian Foster
2014-08-06 15:28 ` Alex Lyakas
2014-08-10 12:20 ` Alex Lyakas
2014-08-11 13:20 ` Brian Foster
2014-08-11 21:52 ` Dave Chinner
2014-08-12 12:03 ` Brian Foster
2014-08-12 12:39 ` Alex Lyakas
2014-08-12 19:31 ` Brian Foster
2014-08-12 23:56 ` Dave Chinner
2014-08-13 12:59 ` Brian Foster [this message]
2014-08-13 20:59 ` Dave Chinner
2014-08-13 23:21 ` Brian Foster
2014-08-14 6:14 ` Dave Chinner
2014-08-14 19:05 ` Brian Foster
2014-08-14 22:27 ` Dave Chinner
2014-08-13 17:07 ` Alex Lyakas
2014-08-13 0:03 ` Dave Chinner
2014-08-13 13:11 ` Brian Foster
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20140813125932.GA4605@bfoster.bfoster \
--to=bfoster@redhat.com \
--cc=alex@zadarastorage.com \
--cc=david@fromorbit.com \
--cc=xfs@oss.sgi.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).