Re: Broken dio refcounting leads to livelock?

linux-xfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: "Darrick J. Wong" <darrick.wong@oracle.com>
To: Jeff Moyer <jmoyer@redhat.com>
Cc: Dave Chinner <david@fromorbit.com>,
	Christoph Hellwig <hch@infradead.org>,
	xfs <linux-xfs@vger.kernel.org>
Subject: Re: Broken dio refcounting leads to livelock?
Date: Tue, 8 Jan 2019 13:07:16 -0800	[thread overview]
Message-ID: <20190108210716.GV12689@magnolia> (raw)
In-Reply-To: <x49tvii7tvk.fsf@segfault.boston.devel.redhat.com>

On Tue, Jan 08, 2019 at 03:54:23PM -0500, Jeff Moyer wrote:
> "Darrick J. Wong" <darrick.wong@oracle.com> writes:
> 
> > It looks like we're issuing two bios to satisfy a particular dio.
> > However, the first bio completes before the submitter calls finish_plug??
> >
> > I guess this means that blk_start_plug no longer plugs up io requests,
> > which means that the end_io function tries to wake up the submitter
> > before it's even had a chance to issue the second bio.
> 
> blk_start_plug was always a hint.  If you exceed a certain number of
> requests (BLK_MAX_REQUEST_COUNT, which is 16) or a certain size of
> request (BLK_PLUG_FLUSH_SIZE, which is 128k), the block layer will flush
> the plug list.
> 
> > This is surprising to me, because I was under the impression that
> > blk_{start,finish}_plug held all the bios so there was no chance that
> > any of them would issue (let alone call end_io) before finish_plug.
> 
> Definitely not the case.  The plug list is just a performance
> optimization.

Ahh, fun!  I had not realized that it was merely a suggestion.

$ git grep blk_start_plug Documentation/
$ echo $?
1

In that case we definitely need something to prevent the endio from
trying to wake up a submitter that's still submitting.  Below is the
amended patch I'm using to run generic/013 and generic/323 in a loop.
No crashes so far.

--D

From: Dave Chinner <dchinner@redhat.com>
Subject: [PATCH] iomap: fix iomap dio reference counts

The iomap dio structure could be referenced after contexts had
dropped their reference, resulting in use-after free conditions
being created. Rework the reference counting to ensure that we never
access the dio structure without having a valid reference count or
having guaranteed that the context holds the last reference will
free it.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
[darrick: elevate dio refcount during block plug]
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/iomap.c |   82 +++++++++++++++++++++++++++++++++++++++++++++---------------
 1 file changed, 62 insertions(+), 20 deletions(-)

diff --git a/fs/iomap.c b/fs/iomap.c
index a3088fae567b..8e7eea407a0a 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -1539,12 +1539,31 @@ static void iomap_dio_bio_end_io(struct bio *bio)
 	if (bio->bi_status)
 		iomap_dio_set_error(dio, blk_status_to_errno(bio->bi_status));
 
-	if (atomic_dec_and_test(&dio->ref)) {
-		if (dio->wait_for_completion) {
+	/*
+	 * If we have a waiter, then this is a sync IO and the waiter will still
+	 * hold a reference to the dio. If this is the last IO reference, we
+	 * cannot reference the dio after we drop the reference as the waiter
+	 * may be woken immediately and free the dio before we are don with it.
+	 * Hence we check for two references, do the wakeup, then drop the final
+	 * IO reference. The ordering of clearing the submit.waiter, waking the
+	 * waiter and then dropping the reference matches the order of checks in
+	 * the wait loop to avoid wakeup races.
+	 */
+	if (dio->wait_for_completion) {
+		if (atomic_read(&dio->ref) == 2) {
 			struct task_struct *waiter = dio->submit.waiter;
 			WRITE_ONCE(dio->submit.waiter, NULL);
 			blk_wake_io_task(waiter);
-		} else if (dio->flags & IOMAP_DIO_WRITE) {
+		}
+		atomic_dec(&dio->ref);
+	} else if (atomic_dec_and_test(&dio->ref)) {
+		/*
+		 * There is no waiter so we are finishing async IO. If the
+		 * refcount drops to zero then we are responsible for running
+		 * the appropriate IO completion to finish and free the dio
+		 * context.
+		 */
+		if (dio->flags & IOMAP_DIO_WRITE) {
 			struct inode *inode = file_inode(dio->iocb->ki_filp);
 
 			INIT_WORK(&dio->aio.work, iomap_dio_complete_work);
@@ -1553,6 +1572,7 @@ static void iomap_dio_bio_end_io(struct bio *bio)
 			iomap_dio_complete_work(&dio->aio.work);
 		}
 	}
+	/* not safe to use dio past this point */
 
 	if (should_dirty) {
 		bio_check_pages_dirty(bio);
@@ -1887,6 +1907,12 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 
 	inode_dio_begin(inode);
 
+	/*
+	 * The presence of a plug does not prevent IOs from issuing, so elevate
+	 * the refcount on the dio so that a fast end_io can't try to wake us
+	 * up before we're even done issuing IO.
+	 */
+	atomic_inc(&dio->ref);
 	blk_start_plug(&plug);
 	do {
 		ret = iomap_apply(inode, pos, count, flags, ops, dio,
@@ -1905,6 +1931,7 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 			break;
 	} while ((count = iov_iter_count(iter)) > 0);
 	blk_finish_plug(&plug);
+	atomic_dec(&dio->ref);
 
 	if (ret < 0)
 		iomap_dio_set_error(dio, ret);
@@ -1916,27 +1943,42 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 	if (dio->flags & IOMAP_DIO_WRITE_FUA)
 		dio->flags &= ~IOMAP_DIO_NEED_SYNC;
 
-	if (!atomic_dec_and_test(&dio->ref)) {
-		if (!dio->wait_for_completion)
-			return -EIOCBQUEUED;
+	/*
+	 * If this is async IO, and we drop the last reference here we need
+	 * to complete the IO, otherwise we return EIOCBQUEUED. The order of
+	 * checks is important here because we can be racing with Io completion
+	 * to drop the last reference and free the dio, so we must check the dio
+	 * state before we drop the reference to avoid use-after-free
+	 * situations.
+	 */
+	if (!dio->wait_for_completion) {
+		if (atomic_dec_and_test(&dio->ref))
+			return iomap_dio_complete(dio);
+		return -EIOCBQUEUED;
+	}
 
-		for (;;) {
-			set_current_state(TASK_UNINTERRUPTIBLE);
-			if (!READ_ONCE(dio->submit.waiter))
-				break;
+	/*
+	 * This is sync IO and we have to access the dio structure to determine
+	 * if we need to wait and/or poll the block device for completion. hence
+	 * we need to hold on to our reference until IO completion has dropped
+	 * all the IO references and woken us before we drop our reference and
+	 * complete the IO.
+	 */
+	while (atomic_read(&dio->ref) > 1) {
+		set_current_state(TASK_UNINTERRUPTIBLE);
+		if (!READ_ONCE(dio->submit.waiter))
+			break;
 
-			if (!(iocb->ki_flags & IOCB_HIPRI) ||
-			    !dio->submit.last_queue ||
-			    !blk_poll(dio->submit.last_queue,
-					 dio->submit.cookie, true))
-				io_schedule();
-		}
-		__set_current_state(TASK_RUNNING);
+		if (!(iocb->ki_flags & IOCB_HIPRI) ||
+		    !dio->submit.last_queue ||
+		    !blk_poll(dio->submit.last_queue,
+				 dio->submit.cookie, true))
+			io_schedule();
 	}
+	__set_current_state(TASK_RUNNING);
+	atomic_dec(&dio->ref);
 
-	ret = iomap_dio_complete(dio);
-
-	return ret;
+	return iomap_dio_complete(dio);
 
 out_free_dio:
 	kfree(dio);

next prev parent reply	other threads:[~2019-01-08 21:07 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-01-08  0:26 Broken dio refcounting leads to livelock? Darrick J. Wong
2019-01-08  1:16 ` Darrick J. Wong
2019-01-08  3:03   ` Darrick J. Wong
2019-01-08  7:46 ` Dave Chinner
2019-01-08 20:08   ` Darrick J. Wong
2019-01-08 20:54     ` Jeff Moyer
2019-01-08 21:07       ` Darrick J. Wong [this message]
2019-01-08 21:30         ` Jeff Moyer
2019-01-08 23:12         ` Dave Chinner

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:a3088fae567 dfblob:8e7eea407a0 )
 OR (
bs:"iomap: fix iomap dio reference counts" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190108210716.GV12689@magnolia \
    --to=darrick.wong@oracle.com \
    --cc=david@fromorbit.com \
    --cc=hch@infradead.org \
    --cc=jmoyer@redhat.com \
    --cc=linux-xfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).