Broken dio refcounting leads to livelock?

linux-xfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Broken dio refcounting leads to livelock?
@ 2019-01-08  0:26 Darrick J. Wong
  2019-01-08  1:16 ` Darrick J. Wong
  2019-01-08  7:46 ` Dave Chinner
  0 siblings, 2 replies; 9+ messages in thread
From: Darrick J. Wong @ 2019-01-08  0:26 UTC (permalink / raw)
  To: Christoph Hellwig, Dave Chinner; +Cc: xfs

Hi Christoph,

I have a question about refcounting in struct iomap_dio:

As I mentioned late last year, I've been seeing a livelock since the
early 4.20-rcX era in iomap_dio_rw when running generic/323.  The
relevant part of the function starts around line 1910:

	blk_finish_plug(&plug);

	if (ret < 0)
		iomap_dio_set_error(dio, ret);

	/*
	 * If all the writes we issued were FUA, we don't need to flush the
	 * cache on IO completion. Clear the sync flag for this case.
	 */
	if (dio->flags & IOMAP_DIO_WRITE_FUA)
		dio->flags &= ~IOMAP_DIO_NEED_SYNC;

	if (!atomic_dec_and_test(&dio->ref)) {
		int clang = 0;

		if (!dio->wait_for_completion)
			return -EIOCBQUEUED;

		for (;;) {
			set_current_state(TASK_UNINTERRUPTIBLE);
			if (!READ_ONCE(dio->submit.waiter))
				break;

			if (!(iocb->ki_flags & IOCB_HIPRI) ||
			    !dio->submit.last_queue ||
			    !blk_poll(dio->submit.last_queue,
					 dio->submit.cookie, true))
				io_schedule(); <--------- here
		}
		__set_current_state(TASK_RUNNING);
	}

	ret = iomap_dio_complete(dio);

I finally had time to look into this, and I noticed garbage values for
dio->submit.waiter in *dio right before we call io_schedule.  I suspect
this is a use-after-free, so I altered iomap_dio_complete to memset the
entire structure to 0x5C just prior to kfreeing the *dio, and sure
enough I saw 0x5C in all the dio fields right before the io_schedule
call.

Next I instrumented all the places where we access dio->ref to see
what's going on, and observed the following sequence:

 1. ref == 2 just before the blk_finish_plug call.
    dio->wait_for_completion == false.
 2. ref == 2 just before the "if (!atomic_dec_and_test(...))"
 3. ref == 1 just after the "if (!atomic_dec_and_test(...))"

Next we apparently end up in iomap_dio_bio_end_io:

 4. ref == 1 just before the "if (atomic_dec_and_test(...))"
 5. ref == 0 just after the "if (atomic_dec_and_test(...))"
 6. iomap_dio_bio_end_io calls iomap_dio_complete, frees the dio after
    poisoning it with 0x5C as described above.

Then we jump back to iomap_dio_rw, and:

 7. ref = 0x5c5c5c5c just before the "if (!(iocb->ki_flags & IOCB_HIPRI)...)"
    However, dio->wait_for_completion is 0x5c5c5c5c so we incorrectly
    call io_schedule and never wake up.

KASAN confirms this diagnosis.  I noticed that only ~100us elapse
between the unplug and the bio endio function being called; I've found
that I can reproduce this in a qemu vm with a virtio-scsi device backed
by a tmpfs file on the host.  My guess is that with slower scsi device
the dispatch would take long enough that iomap_dio_rw would have
returned -EIOCBQUEUED long before the bio end_io function gets called
and frees the dio?

Anyhow, I'm not sure how to fix this.  It's clear that iomap_dio_rw
can't drop its ref to the dio until it's done using the dio fields.
It's tempting to change the if (!atomic_dec_and_test()) to a if
(atomic_read() > 1) and only drop the ref just before returning
-EIOCBQUEUED or just before calling iomap_dio_complete... but that just
pushes the livelock to kill_ioctx.

Brain scrambled, kicking to the list to see if anyone has something
smarter to say. ;)

--D

==================================================================
BUG: KASAN: use-after-free in iomap_dio_rw+0x1197/0x1200
Read of size 1 at addr ffff88807780e0ec by task aio-last-ref-he/2527

CPU: 2 PID: 2527 Comm: aio-last-ref-he Not tainted 5.0.0-rc1-xfsx #rc1
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.10.2-1ubuntu1 04/01/2014
Call Trace:
 dump_stack+0x7c/0xc0
 print_address_description+0x6c/0x23c
 ? iomap_dio_rw+0x1197/0x1200
 kasan_report.cold.3+0x1c/0x3a
 ? iomap_dio_rw+0x1197/0x1200
 ? iomap_dio_rw+0x1197/0x1200
 iomap_dio_rw+0x1197/0x1200
 ? iomap_seek_data+0x130/0x130
 ? lock_contended+0xd70/0xd70
 ? lock_acquire+0x102/0x2f0
 ? xfs_ilock+0x137/0x3e0 [xfs]
 ? xfs_file_dio_aio_read+0x10f/0x360 [xfs]
 ? down_read_nested+0x78/0xc0
 ? xfs_file_dio_aio_read+0x123/0x360 [xfs]
 xfs_file_dio_aio_read+0x123/0x360 [xfs]
 ? __lock_acquire+0x645/0x44c0
 xfs_file_read_iter+0x41a/0x7a0 [xfs]
 ? aio_setup_rw+0xb4/0x170
 aio_read+0x29f/0x3d0
 ? mark_held_locks+0x130/0x130
 ? aio_prep_rw+0x960/0x960
 ? kvm_clock_read+0x14/0x30
 ? kvm_clock_read+0x14/0x30
 ? kvm_sched_clock_read+0x5/0x10
 ? sched_clock+0x5/0x10
 ? sched_clock_cpu+0x18/0x1b0
 ? find_held_lock+0x33/0x1c0
 ? lock_acquire+0x102/0x2f0
 ? lock_downgrade+0x590/0x590
 io_submit_one+0xf4e/0x17e0
 ? __x64_sys_io_setup+0x2f0/0x2f0
 ? sched_clock_cpu+0x18/0x1b0
 ? find_held_lock+0x33/0x1c0
 ? lock_downgrade+0x590/0x590
 ? __x64_sys_io_submit+0x199/0x430
 __x64_sys_io_submit+0x199/0x430
 ? __ia32_compat_sys_io_submit+0x410/0x410
 ? trace_hardirqs_on_thunk+0x1a/0x1c
 ? do_syscall_64+0x8f/0x3c0
 do_syscall_64+0x8f/0x3c0
 entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x7f3589847687
Code: 00 00 00 49 83 38 00 75 ed 49 83 78 08 00 75 e6 8b 47 0c 39 47 08 75 de 31 c0 c3 0f 1f 84 00 00 00 00 00 b8 d1 00 00 00 0f 05 <c3> 0f 1f 84 00 00 00 00 00 b8 d2 00 00 00 0f 05 c3 0f 1f 84 00 00
RSP: 002b:00007f356d7f9818 EFLAGS: 00000202 ORIG_RAX: 00000000000000d1
RAX: ffffffffffffffda RBX: 000000000000000b RCX: 00007f3589847687
RDX: 00007f356d7f98c0 RSI: 0000000000000001 RDI: 00007f3589b50000
RBP: 0000000000000018 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000202 R12: 00007f356d7f98c0
R13: 0000000000000002 R14: 00007f356d7f9d40 R15: 00000000000a0000

Allocated by task 2527:
 kasan_kmalloc+0xc6/0xd0
 iomap_dio_rw+0x261/0x1200
 xfs_file_dio_aio_read+0x123/0x360 [xfs]
 xfs_file_read_iter+0x41a/0x7a0 [xfs]
 aio_read+0x29f/0x3d0
 io_submit_one+0xf4e/0x17e0
 __x64_sys_io_submit+0x199/0x430
 do_syscall_64+0x8f/0x3c0
 entry_SYSCALL_64_after_hwframe+0x49/0xbe

Freed by task 2519:
 __kasan_slab_free+0x133/0x180
 kfree+0xe4/0x2d0
 iomap_dio_complete+0x37b/0x6e0
 iomap_dio_complete_work+0x52/0x80
 iomap_dio_bio_end_io+0x50e/0x7a0
 blk_update_request+0x234/0x8b0
 scsi_end_request+0x7f/0x6b0
 scsi_io_completion+0x18c/0x14a0
 blk_done_softirq+0x258/0x390
 __do_softirq+0x228/0x8de

The buggy address belongs to the object at ffff88807780e0c0
 which belongs to the cache kmalloc-128 of size 128
The buggy address is located 44 bytes inside of
 128-byte region [ffff88807780e0c0, ffff88807780e140)
The buggy address belongs to the page:
page:ffffea0001de0380 count:1 mapcount:0 mapping:ffff88802dc03500 index:0x0
flags: 0x4fff80000000200(slab)
raw: 04fff80000000200 dead000000000100 dead000000000200 ffff88802dc03500
raw: 0000000000000000 0000000080150015 00000001ffffffff 0000000000000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
 ffff88807780df80: fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc fc
 ffff88807780e000: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
>ffff88807780e080: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb
                                                          ^
 ffff88807780e100: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
 ffff88807780e180: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
==================================================================
Disabling lock debugging due to kernel taint

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Broken dio refcounting leads to livelock?
  2019-01-08  0:26 Broken dio refcounting leads to livelock? Darrick J. Wong
@ 2019-01-08  1:16 ` Darrick J. Wong
  2019-01-08  3:03   ` Darrick J. Wong
  2019-01-08  7:46 ` Dave Chinner
  1 sibling, 1 reply; 9+ messages in thread
From: Darrick J. Wong @ 2019-01-08  1:16 UTC (permalink / raw)
  To: Christoph Hellwig, Dave Chinner; +Cc: xfs

On Mon, Jan 07, 2019 at 04:26:33PM -0800, Darrick J. Wong wrote:
> Hi Christoph,
> 
> I have a question about refcounting in struct iomap_dio:
> 
> As I mentioned late last year, I've been seeing a livelock since the
> early 4.20-rcX era in iomap_dio_rw when running generic/323.  The
> relevant part of the function starts around line 1910:
> 
> 	blk_finish_plug(&plug);
> 
> 	if (ret < 0)
> 		iomap_dio_set_error(dio, ret);
> 
> 	/*
> 	 * If all the writes we issued were FUA, we don't need to flush the
> 	 * cache on IO completion. Clear the sync flag for this case.
> 	 */
> 	if (dio->flags & IOMAP_DIO_WRITE_FUA)
> 		dio->flags &= ~IOMAP_DIO_NEED_SYNC;
> 
> 	if (!atomic_dec_and_test(&dio->ref)) {
> 		int clang = 0;
> 
> 		if (!dio->wait_for_completion)
> 			return -EIOCBQUEUED;
> 
> 		for (;;) {
> 			set_current_state(TASK_UNINTERRUPTIBLE);
> 			if (!READ_ONCE(dio->submit.waiter))
> 				break;
> 
> 			if (!(iocb->ki_flags & IOCB_HIPRI) ||
> 			    !dio->submit.last_queue ||
> 			    !blk_poll(dio->submit.last_queue,
> 					 dio->submit.cookie, true))
> 				io_schedule(); <--------- here
> 		}
> 		__set_current_state(TASK_RUNNING);
> 	}
> 
> 	ret = iomap_dio_complete(dio);
> 
> I finally had time to look into this, and I noticed garbage values for
> dio->submit.waiter in *dio right before we call io_schedule.  I suspect
> this is a use-after-free, so I altered iomap_dio_complete to memset the
> entire structure to 0x5C just prior to kfreeing the *dio, and sure
> enough I saw 0x5C in all the dio fields right before the io_schedule
> call.
> 
> Next I instrumented all the places where we access dio->ref to see
> what's going on, and observed the following sequence:
> 
>  1. ref == 2 just before the blk_finish_plug call.
>     dio->wait_for_completion == false.
>  2. ref == 2 just before the "if (!atomic_dec_and_test(...))"
>  3. ref == 1 just after the "if (!atomic_dec_and_test(...))"
> 
> Next we apparently end up in iomap_dio_bio_end_io:
> 
>  4. ref == 1 just before the "if (atomic_dec_and_test(...))"
>  5. ref == 0 just after the "if (atomic_dec_and_test(...))"
>  6. iomap_dio_bio_end_io calls iomap_dio_complete, frees the dio after
>     poisoning it with 0x5C as described above.
> 
> Then we jump back to iomap_dio_rw, and:
> 
>  7. ref = 0x5c5c5c5c just before the "if (!(iocb->ki_flags & IOCB_HIPRI)...)"
>     However, dio->wait_for_completion is 0x5c5c5c5c so we incorrectly
>     call io_schedule and never wake up.
> 
> KASAN confirms this diagnosis.  I noticed that only ~100us elapse
> between the unplug and the bio endio function being called; I've found
> that I can reproduce this in a qemu vm with a virtio-scsi device backed
> by a tmpfs file on the host.  My guess is that with slower scsi device
> the dispatch would take long enough that iomap_dio_rw would have
> returned -EIOCBQUEUED long before the bio end_io function gets called
> and frees the dio?
> 
> Anyhow, I'm not sure how to fix this.  It's clear that iomap_dio_rw
> can't drop its ref to the dio until it's done using the dio fields.
> It's tempting to change the if (!atomic_dec_and_test()) to a if
> (atomic_read() > 1) and only drop the ref just before returning
> -EIOCBQUEUED or just before calling iomap_dio_complete... but that just
> pushes the livelock to kill_ioctx.

No it doesn't, silly me just had debugging crapola all over iomap.c.

> Brain scrambled, kicking to the list to see if anyone has something
> smarter to say. ;)

I have a gross brainscrambled patch that makes the regression go away.
No idea what other subtle breakage this might cause...

--D

---
 fs/iomap.c |   22 ++++++++++++++--------
 1 file changed, 14 insertions(+), 8 deletions(-)

diff --git a/fs/iomap.c b/fs/iomap.c
index a3088fae567b..cdea2a83ef78 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -1535,16 +1535,18 @@ static void iomap_dio_bio_end_io(struct bio *bio)
 {
 	struct iomap_dio *dio = bio->bi_private;
 	bool should_dirty = (dio->flags & IOMAP_DIO_DIRTY);
+	int dio_ref = atomic_dec_return(&dio->ref);
 
 	if (bio->bi_status)
 		iomap_dio_set_error(dio, blk_status_to_errno(bio->bi_status));
 
-	if (atomic_dec_and_test(&dio->ref)) {
-		if (dio->wait_for_completion) {
-			struct task_struct *waiter = dio->submit.waiter;
-			WRITE_ONCE(dio->submit.waiter, NULL);
-			blk_wake_io_task(waiter);
-		} else if (dio->flags & IOMAP_DIO_WRITE) {
+	if (dio_ref == 1 && dio->wait_for_completion) {
+		struct task_struct *waiter = dio->submit.waiter;
+
+		WRITE_ONCE(dio->submit.waiter, NULL);
+		blk_wake_io_task(waiter);
+	} else if (dio_ref == 0) {
+		if (dio->flags & IOMAP_DIO_WRITE) {
 			struct inode *inode = file_inode(dio->iocb->ki_filp);
 
 			INIT_WORK(&dio->aio.work, iomap_dio_complete_work);
@@ -1916,9 +1918,12 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 	if (dio->flags & IOMAP_DIO_WRITE_FUA)
 		dio->flags &= ~IOMAP_DIO_NEED_SYNC;
 
-	if (!atomic_dec_and_test(&dio->ref)) {
-		if (!dio->wait_for_completion)
+	if (atomic_read(&dio->ref) > 1) {
+		if (!dio->wait_for_completion) {
+			if (atomic_dec_and_test(&dio->ref))
+				return iomap_dio_complete(dio);
 			return -EIOCBQUEUED;
+		}
 
 		for (;;) {
 			set_current_state(TASK_UNINTERRUPTIBLE);
@@ -1934,6 +1939,7 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 		__set_current_state(TASK_RUNNING);
 	}
 
+	atomic_dec(&dio->ref);
 	ret = iomap_dio_complete(dio);
 
 	return ret;

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: Broken dio refcounting leads to livelock?
  2019-01-08  1:16 ` Darrick J. Wong
@ 2019-01-08  3:03   ` Darrick J. Wong
  0 siblings, 0 replies; 9+ messages in thread
From: Darrick J. Wong @ 2019-01-08  3:03 UTC (permalink / raw)
  To: Christoph Hellwig, Dave Chinner; +Cc: xfs

On Mon, Jan 07, 2019 at 05:16:05PM -0800, Darrick J. Wong wrote:
> On Mon, Jan 07, 2019 at 04:26:33PM -0800, Darrick J. Wong wrote:
> > Hi Christoph,
> > 
> > I have a question about refcounting in struct iomap_dio:
> > 
> > As I mentioned late last year, I've been seeing a livelock since the
> > early 4.20-rcX era in iomap_dio_rw when running generic/323.  The
> > relevant part of the function starts around line 1910:
> > 
> > 	blk_finish_plug(&plug);
> > 
> > 	if (ret < 0)
> > 		iomap_dio_set_error(dio, ret);
> > 
> > 	/*
> > 	 * If all the writes we issued were FUA, we don't need to flush the
> > 	 * cache on IO completion. Clear the sync flag for this case.
> > 	 */
> > 	if (dio->flags & IOMAP_DIO_WRITE_FUA)
> > 		dio->flags &= ~IOMAP_DIO_NEED_SYNC;
> > 
> > 	if (!atomic_dec_and_test(&dio->ref)) {
> > 		int clang = 0;
> > 
> > 		if (!dio->wait_for_completion)
> > 			return -EIOCBQUEUED;
> > 
> > 		for (;;) {
> > 			set_current_state(TASK_UNINTERRUPTIBLE);
> > 			if (!READ_ONCE(dio->submit.waiter))
> > 				break;
> > 
> > 			if (!(iocb->ki_flags & IOCB_HIPRI) ||
> > 			    !dio->submit.last_queue ||
> > 			    !blk_poll(dio->submit.last_queue,
> > 					 dio->submit.cookie, true))
> > 				io_schedule(); <--------- here
> > 		}
> > 		__set_current_state(TASK_RUNNING);
> > 	}
> > 
> > 	ret = iomap_dio_complete(dio);
> > 
> > I finally had time to look into this, and I noticed garbage values for
> > dio->submit.waiter in *dio right before we call io_schedule.  I suspect
> > this is a use-after-free, so I altered iomap_dio_complete to memset the
> > entire structure to 0x5C just prior to kfreeing the *dio, and sure
> > enough I saw 0x5C in all the dio fields right before the io_schedule
> > call.
> > 
> > Next I instrumented all the places where we access dio->ref to see
> > what's going on, and observed the following sequence:
> > 
> >  1. ref == 2 just before the blk_finish_plug call.
> >     dio->wait_for_completion == false.
> >  2. ref == 2 just before the "if (!atomic_dec_and_test(...))"
> >  3. ref == 1 just after the "if (!atomic_dec_and_test(...))"
> > 
> > Next we apparently end up in iomap_dio_bio_end_io:
> > 
> >  4. ref == 1 just before the "if (atomic_dec_and_test(...))"
> >  5. ref == 0 just after the "if (atomic_dec_and_test(...))"
> >  6. iomap_dio_bio_end_io calls iomap_dio_complete, frees the dio after
> >     poisoning it with 0x5C as described above.
> > 
> > Then we jump back to iomap_dio_rw, and:
> > 
> >  7. ref = 0x5c5c5c5c just before the "if (!(iocb->ki_flags & IOCB_HIPRI)...)"
> >     However, dio->wait_for_completion is 0x5c5c5c5c so we incorrectly
> >     call io_schedule and never wake up.
> > 
> > KASAN confirms this diagnosis.  I noticed that only ~100us elapse
> > between the unplug and the bio endio function being called; I've found
> > that I can reproduce this in a qemu vm with a virtio-scsi device backed
> > by a tmpfs file on the host.  My guess is that with slower scsi device
> > the dispatch would take long enough that iomap_dio_rw would have
> > returned -EIOCBQUEUED long before the bio end_io function gets called
> > and frees the dio?
> > 
> > Anyhow, I'm not sure how to fix this.  It's clear that iomap_dio_rw
> > can't drop its ref to the dio until it's done using the dio fields.
> > It's tempting to change the if (!atomic_dec_and_test()) to a if
> > (atomic_read() > 1) and only drop the ref just before returning
> > -EIOCBQUEUED or just before calling iomap_dio_complete... but that just
> > pushes the livelock to kill_ioctx.
> 
> No it doesn't, silly me just had debugging crapola all over iomap.c.
> 
> > Brain scrambled, kicking to the list to see if anyone has something
> > smarter to say. ;)
> 
> I have a gross brainscrambled patch that makes the regression go away.
> No idea what other subtle breakage this might cause...
> 
> --D
> 
> ---
>  fs/iomap.c |   22 ++++++++++++++--------
>  1 file changed, 14 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/iomap.c b/fs/iomap.c
> index a3088fae567b..cdea2a83ef78 100644
> --- a/fs/iomap.c
> +++ b/fs/iomap.c
> @@ -1535,16 +1535,18 @@ static void iomap_dio_bio_end_io(struct bio *bio)
>  {
>  	struct iomap_dio *dio = bio->bi_private;
>  	bool should_dirty = (dio->flags & IOMAP_DIO_DIRTY);
> +	int dio_ref = atomic_dec_return(&dio->ref);
>  
>  	if (bio->bi_status)
>  		iomap_dio_set_error(dio, blk_status_to_errno(bio->bi_status));
>  
> -	if (atomic_dec_and_test(&dio->ref)) {
> -		if (dio->wait_for_completion) {
> -			struct task_struct *waiter = dio->submit.waiter;
> -			WRITE_ONCE(dio->submit.waiter, NULL);
> -			blk_wake_io_task(waiter);
> -		} else if (dio->flags & IOMAP_DIO_WRITE) {
> +	if (dio_ref == 1 && dio->wait_for_completion) {
> +		struct task_struct *waiter = dio->submit.waiter;
> +
> +		WRITE_ONCE(dio->submit.waiter, NULL);
> +		blk_wake_io_task(waiter);
> +	} else if (dio_ref == 0) {
> +		if (dio->flags & IOMAP_DIO_WRITE) {

Heh, no, this doesn't actually fix the data race... :P

...I'm stopping for the night, maybe one of you have fresher eyes.

--D

>  			struct inode *inode = file_inode(dio->iocb->ki_filp);
>  
>  			INIT_WORK(&dio->aio.work, iomap_dio_complete_work);
> @@ -1916,9 +1918,12 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>  	if (dio->flags & IOMAP_DIO_WRITE_FUA)
>  		dio->flags &= ~IOMAP_DIO_NEED_SYNC;
>  
> -	if (!atomic_dec_and_test(&dio->ref)) {
> -		if (!dio->wait_for_completion)
> +	if (atomic_read(&dio->ref) > 1) {
> +		if (!dio->wait_for_completion) {
> +			if (atomic_dec_and_test(&dio->ref))
> +				return iomap_dio_complete(dio);
>  			return -EIOCBQUEUED;
> +		}
>  
>  		for (;;) {
>  			set_current_state(TASK_UNINTERRUPTIBLE);
> @@ -1934,6 +1939,7 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>  		__set_current_state(TASK_RUNNING);
>  	}
>  
> +	atomic_dec(&dio->ref);
>  	ret = iomap_dio_complete(dio);
>  
>  	return ret;

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Broken dio refcounting leads to livelock?
  2019-01-08  0:26 Broken dio refcounting leads to livelock? Darrick J. Wong
  2019-01-08  1:16 ` Darrick J. Wong
@ 2019-01-08  7:46 ` Dave Chinner
  2019-01-08 20:08   ` Darrick J. Wong
  1 sibling, 1 reply; 9+ messages in thread
From: Dave Chinner @ 2019-01-08  7:46 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Christoph Hellwig, xfs

On Mon, Jan 07, 2019 at 04:26:33PM -0800, Darrick J. Wong wrote:
> Hi Christoph,
> 
> I have a question about refcounting in struct iomap_dio:
> 
> As I mentioned late last year, I've been seeing a livelock since the
> early 4.20-rcX era in iomap_dio_rw when running generic/323.  The
> relevant part of the function starts around line 1910:

Looking at iomap_dio_bio_end_io(), it has completion code that runs
only when dio->ref hits zero. i.e. it assumes that if it sees a zero
refcount, it holds the last reference and can free dio. Otherwise it
doesn't touch dio.

> 
> 	blk_finish_plug(&plug);
> 
> 	if (ret < 0)
> 		iomap_dio_set_error(dio, ret);
> 
> 	/*
> 	 * If all the writes we issued were FUA, we don't need to flush the
> 	 * cache on IO completion. Clear the sync flag for this case.
> 	 */
> 	if (dio->flags & IOMAP_DIO_WRITE_FUA)
> 		dio->flags &= ~IOMAP_DIO_NEED_SYNC;
> 
> 	if (!atomic_dec_and_test(&dio->ref)) {

So we've dropped the IO submission reference to the dio here, which
means the onl remaining references are the bio references. We run
this code if there are remaining bio references, which given the
above it means that dio can be freed at any time after this by IO
completion. That means it's never safe to reference dio after this.

Yup, that's a use after free.

> 		int clang = 0;
> 
> 		if (!dio->wait_for_completion)
> 			return -EIOCBQUEUED;

Ok, dio->wait_for_completion is unchanging at this point, so it
doesn't matter if we do this check before or after we drop the
dio->ref. That simplifies things.

> 
> 		for (;;) {
> 			set_current_state(TASK_UNINTERRUPTIBLE);
> 			if (!READ_ONCE(dio->submit.waiter))
> 				break;
> 
> 			if (!(iocb->ki_flags & IOCB_HIPRI) ||
> 			    !dio->submit.last_queue ||
> 			    !blk_poll(dio->submit.last_queue,
> 					 dio->submit.cookie, true))
> 				io_schedule(); <--------- here
> 		}
> 		__set_current_state(TASK_RUNNING);

This is an IO wait, which means we do not want to free the dio
structure until after we've been woken. And that also means we
are going to be running iomap_dio_complete() below, so we /must/
hold on to the dio reference here. Which means we need to change
the iomap_dio_bio_end_io() code because it only triggers a wakeup
if it's dropping the last dio->ref.

OK, so something like the patch below?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

iomap: fix iomap dio reference counts

From: Dave Chinner <dchinner@redhat.com>

The iomap dio structure could be referenced after contexts had
dropped their reference, resulting in use-after free conditions
being created. Rework the reference counting to ensure that we never
access the dio structure without having a valid reference count or
having guaranteed that the context holds the last reference will
free it.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/iomap.c | 75 +++++++++++++++++++++++++++++++++++++++++++++-----------------
 1 file changed, 55 insertions(+), 20 deletions(-)

diff --git a/fs/iomap.c b/fs/iomap.c
index a3088fae567b..ee71e2ec93d4 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -1539,12 +1539,31 @@ static void iomap_dio_bio_end_io(struct bio *bio)
 	if (bio->bi_status)
 		iomap_dio_set_error(dio, blk_status_to_errno(bio->bi_status));
 
-	if (atomic_dec_and_test(&dio->ref)) {
-		if (dio->wait_for_completion) {
+	/*
+	 * If we have a waiter, then this is a sync IO and the waiter will still
+	 * hold a reference to the dio. If this is the last IO reference, we
+	 * cannot reference the dio after we drop the reference as the waiter
+	 * may be woken immediately and free the dio before we are don with it.
+	 * Hence we check for two references, do the wakeup, then drop the final
+	 * IO reference. The ordering of clearing the submit.waiter, waking the
+	 * waiter and then dropping the reference matches the order of checks in
+	 * the wait loop to avoid wakeup races.
+	 */
+	if (dio->wait_for_completion) {
+		if (atomic_read(&dio->ref) == 2) {
 			struct task_struct *waiter = dio->submit.waiter;
 			WRITE_ONCE(dio->submit.waiter, NULL);
 			blk_wake_io_task(waiter);
-		} else if (dio->flags & IOMAP_DIO_WRITE) {
+		}
+		atomic_dec(&dio->ref);
+	} else if (atomic_dec_and_test(&dio->ref)) {
+		/*
+		 * There is no waiter so we are finishing async IO. If the
+		 * refcount drops to zero then we are responsible for running
+		 * the appropriate IO completion to finish and free the dio
+		 * context.
+		 */
+		if (dio->flags & IOMAP_DIO_WRITE) {
 			struct inode *inode = file_inode(dio->iocb->ki_filp);
 
 			INIT_WORK(&dio->aio.work, iomap_dio_complete_work);
@@ -1553,6 +1572,7 @@ static void iomap_dio_bio_end_io(struct bio *bio)
 			iomap_dio_complete_work(&dio->aio.work);
 		}
 	}
+	/* not safe to use dio past this point */
 
 	if (should_dirty) {
 		bio_check_pages_dirty(bio);
@@ -1916,27 +1936,42 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 	if (dio->flags & IOMAP_DIO_WRITE_FUA)
 		dio->flags &= ~IOMAP_DIO_NEED_SYNC;
 
-	if (!atomic_dec_and_test(&dio->ref)) {
-		if (!dio->wait_for_completion)
-			return -EIOCBQUEUED;
+	/*
+	 * If this is async IO, and we drop the last reference here we need
+	 * to complete the IO, otherwise we return EIOCBQUEUED. The order of
+	 * checks is important here because we can be racing with Io completion
+	 * to drop the last reference and free the dio, so we must check the dio
+	 * state before we drop the reference to avoid use-after-free
+	 * situations.
+	 */
+	if (!dio->wait_for_completion) {
+		if (atomic_dec_and_test(&dio->ref))
+			return iomap_dio_complete(dio);
+		return -EIOCBQUEUED;
+	}
 
-		for (;;) {
-			set_current_state(TASK_UNINTERRUPTIBLE);
-			if (!READ_ONCE(dio->submit.waiter))
-				break;
+	/*
+	 * This is sync IO and we have to access the dio structure to determine
+	 * if we need to wait and/or poll the block device for completion. hence
+	 * we need to hold on to our reference until IO completion has dropped
+	 * all the IO references and woken us before we drop our reference and
+	 * complete the IO.
+	 */
+	while (atomic_read(&dio->ref) > 1) {
+		set_current_state(TASK_UNINTERRUPTIBLE);
+		if (!READ_ONCE(dio->submit.waiter))
+			break;
 
-			if (!(iocb->ki_flags & IOCB_HIPRI) ||
-			    !dio->submit.last_queue ||
-			    !blk_poll(dio->submit.last_queue,
-					 dio->submit.cookie, true))
-				io_schedule();
-		}
-		__set_current_state(TASK_RUNNING);
+		if (!(iocb->ki_flags & IOCB_HIPRI) ||
+		    !dio->submit.last_queue ||
+		    !blk_poll(dio->submit.last_queue,
+				 dio->submit.cookie, true))
+			io_schedule();
 	}
+	__set_current_state(TASK_RUNNING);
+	atomic_dec(&dio->ref);
 
-	ret = iomap_dio_complete(dio);
-
-	return ret;
+	return iomap_dio_complete(dio);
 
 out_free_dio:
 	kfree(dio);

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: Broken dio refcounting leads to livelock?
  2019-01-08  7:46 ` Dave Chinner
@ 2019-01-08 20:08   ` Darrick J. Wong
  2019-01-08 20:54     ` Jeff Moyer
  0 siblings, 1 reply; 9+ messages in thread
From: Darrick J. Wong @ 2019-01-08 20:08 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Christoph Hellwig, xfs

On Tue, Jan 08, 2019 at 06:46:44PM +1100, Dave Chinner wrote:
> On Mon, Jan 07, 2019 at 04:26:33PM -0800, Darrick J. Wong wrote:
> > Hi Christoph,
> > 
> > I have a question about refcounting in struct iomap_dio:
> > 
> > As I mentioned late last year, I've been seeing a livelock since the
> > early 4.20-rcX era in iomap_dio_rw when running generic/323.  The
> > relevant part of the function starts around line 1910:
> 
> Looking at iomap_dio_bio_end_io(), it has completion code that runs
> only when dio->ref hits zero. i.e. it assumes that if it sees a zero
> refcount, it holds the last reference and can free dio. Otherwise it
> doesn't touch dio.
> 
> > 
> > 	blk_finish_plug(&plug);
> > 
> > 	if (ret < 0)
> > 		iomap_dio_set_error(dio, ret);
> > 
> > 	/*
> > 	 * If all the writes we issued were FUA, we don't need to flush the
> > 	 * cache on IO completion. Clear the sync flag for this case.
> > 	 */
> > 	if (dio->flags & IOMAP_DIO_WRITE_FUA)
> > 		dio->flags &= ~IOMAP_DIO_NEED_SYNC;
> > 
> > 	if (!atomic_dec_and_test(&dio->ref)) {
> 
> So we've dropped the IO submission reference to the dio here, which
> means the onl remaining references are the bio references. We run
> this code if there are remaining bio references, which given the
> above it means that dio can be freed at any time after this by IO
> completion. That means it's never safe to reference dio after this.
> 
> Yup, that's a use after free.
> 
> > 		int clang = 0;
> > 
> > 		if (!dio->wait_for_completion)
> > 			return -EIOCBQUEUED;
> 
> Ok, dio->wait_for_completion is unchanging at this point, so it
> doesn't matter if we do this check before or after we drop the
> dio->ref. That simplifies things.
> 
> > 
> > 		for (;;) {
> > 			set_current_state(TASK_UNINTERRUPTIBLE);
> > 			if (!READ_ONCE(dio->submit.waiter))
> > 				break;
> > 
> > 			if (!(iocb->ki_flags & IOCB_HIPRI) ||
> > 			    !dio->submit.last_queue ||
> > 			    !blk_poll(dio->submit.last_queue,
> > 					 dio->submit.cookie, true))
> > 				io_schedule(); <--------- here
> > 		}
> > 		__set_current_state(TASK_RUNNING);
> 
> This is an IO wait, which means we do not want to free the dio
> structure until after we've been woken. And that also means we
> are going to be running iomap_dio_complete() below, so we /must/
> hold on to the dio reference here. Which means we need to change
> the iomap_dio_bio_end_io() code because it only triggers a wakeup
> if it's dropping the last dio->ref.
> 
> OK, so something like the patch below?
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 
> iomap: fix iomap dio reference counts
> 
> From: Dave Chinner <dchinner@redhat.com>
> 
> The iomap dio structure could be referenced after contexts had
> dropped their reference, resulting in use-after free conditions
> being created. Rework the reference counting to ensure that we never
> access the dio structure without having a valid reference count or
> having guaranteed that the context holds the last reference will
> free it.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/iomap.c | 75 +++++++++++++++++++++++++++++++++++++++++++++-----------------
>  1 file changed, 55 insertions(+), 20 deletions(-)
> 
> diff --git a/fs/iomap.c b/fs/iomap.c
> index a3088fae567b..ee71e2ec93d4 100644
> --- a/fs/iomap.c
> +++ b/fs/iomap.c
> @@ -1539,12 +1539,31 @@ static void iomap_dio_bio_end_io(struct bio *bio)
>  	if (bio->bi_status)
>  		iomap_dio_set_error(dio, blk_status_to_errno(bio->bi_status));
>  
> -	if (atomic_dec_and_test(&dio->ref)) {
> -		if (dio->wait_for_completion) {
> +	/*
> +	 * If we have a waiter, then this is a sync IO and the waiter will still
> +	 * hold a reference to the dio. If this is the last IO reference, we
> +	 * cannot reference the dio after we drop the reference as the waiter
> +	 * may be woken immediately and free the dio before we are don with it.
> +	 * Hence we check for two references, do the wakeup, then drop the final
> +	 * IO reference. The ordering of clearing the submit.waiter, waking the
> +	 * waiter and then dropping the reference matches the order of checks in
> +	 * the wait loop to avoid wakeup races.
> +	 */
> +	if (dio->wait_for_completion) {
> +		if (atomic_read(&dio->ref) == 2) {
>  			struct task_struct *waiter = dio->submit.waiter;
>  			WRITE_ONCE(dio->submit.waiter, NULL);
>  			blk_wake_io_task(waiter);

This blows up because waiter == NULL, which implies that this is the
second time through this code block for a given dio.  That doesn't sound
right... (more after the traceback)

BUG: unable to handle kernel NULL pointer dereference at 0000000000000800
#PF error: [normal kernel read fault]
PGD 0 P4D 0 
Oops: 0000 [#1] PREEMPT SMP PTI
CPU: 1 PID: 3438 Comm: fsstress Not tainted 5.0.0-rc1-xfsx #rc1
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.10.2-1ubuntu1 04/01/2014
RIP: 0010:__lock_acquire+0x40b/0x15f0
Code: d2 48 8b bc 24 98 00 00 00 65 48 33 3c 25 28 00 00 00 44 89 d0 0f 85 83 0f 00 00 48 8d 65 d8 5b 41 5c 41 5d 41 5e 41 5f 5d c3 <49> 81 3b a0 63 70 82 41 b8 00 00 00 00 44 0f 45 c0 83 fe 01 0f 87
RSP: 0018:ffff88803ea03c80 EFLAGS: 00010002
RAX: 0000000000000001 RBX: 0000000000000800 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000001
RBP: ffff88803ea03d50 R08: 0000000000000001 R09: 0000000000000001
R10: 0000000000000000 R11: 0000000000000800 R12: ffff88803cd18000
R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000000
FS:  00007f508846cb80(0000) GS:ffff88803ea00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000800 CR3: 000000002f9fa001 CR4: 00000000001606a0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 <IRQ>
 ? __lock_is_held+0x51/0x90
 ? __lock_is_held+0x51/0x90
 ? kvm_clock_read+0x14/0x30
 ? __lock_acquire+0x26c/0x15f0
 ? kvm_clock_read+0x14/0x30
 lock_acquire+0x90/0x180
 ? try_to_wake_up+0x3f/0x560
 ? ring_buffer_unlock_commit+0x21/0x120
 _raw_spin_lock_irqsave+0x3e/0x80
 ? try_to_wake_up+0x3f/0x560
 try_to_wake_up+0x3f/0x560
 ? __trace_bprintk+0x6e/0x80
 iomap_dio_bio_end_io+0x1a6/0x1d0
 blk_update_request+0xd7/0x2f0
 scsi_end_request+0x32/0x3a0
 scsi_io_completion+0x79/0x610
 blk_done_softirq+0xb1/0xe0
 __do_softirq+0xe5/0x4b2
 ? kvm_sched_clock_read+0x5/0x10
 irq_exit+0xbc/0xe0
 call_function_single_interrupt+0xf/0x20
 </IRQ>
RIP: 0010:__list_del_entry_valid+0x3d/0x4c
Code: c8 0f 84 5a 00 00 00 48 b9 00 02 00 00 00 00 ad de 48 39 ca 0f 84 7d 00 00 00 48 8b 32 48 39 fe 0f 85 5d 00 00 00 48 8b 50 08 <48> 39 f2 0f 85 42 00 00 00 b8 01 00 00 00 c3 48 89 d1 48 c7 c7 e8
RSP: 0018:ffffc90005f2f9f0 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff04
RAX: ffff88802f9da110 RBX: ffff888078313a00 RCX: dead000000000200
RDX: ffff88803dbd3eb0 RSI: ffff88803dbd3eb0 RDI: ffff88803dbd3eb0
RBP: ffff8880370c7800 R08: 0000000000003bac R09: 000000000000000c
R10: 0000000000000000 R11: ffff88803c586c00 R12: ffff888078942c00
R13: ffff88803dbd3eb0 R14: ffff88803c586c00 R15: ffff88803dbd3e40
 xfs_log_commit_cil+0x554/0x880 [xfs]
 ? xfs_trans_roll+0x7e/0x190 [xfs]
 __xfs_trans_commit+0xd3/0x630 [xfs]
 ? xfs_defer_trans_roll+0x16e/0x5b0 [xfs]
 ? xfs_defer_finish_noroll+0xf1/0x7e0 [xfs]
 xfs_trans_roll+0x7e/0x190 [xfs]
 xfs_defer_trans_roll+0x16e/0x5b0 [xfs]
 ? __xfs_trans_commit+0x1c3/0x630 [xfs]
 xfs_defer_finish_noroll+0xf1/0x7e0 [xfs]
 ? xfs_iomap_write_unwritten+0xf4/0x390 [xfs]
 __xfs_trans_commit+0x1c3/0x630 [xfs]
 xfs_iomap_write_unwritten+0xf4/0x390 [xfs]
 iomap_dio_complete+0x3c/0x130
 ? iomap_dio_rw+0x404/0x520
 iomap_dio_rw+0x3e7/0x520
 ? xfs_file_dio_aio_write+0x132/0x3d0 [xfs]
 xfs_file_dio_aio_write+0x132/0x3d0 [xfs]
 xfs_file_write_iter+0xf2/0x1d0 [xfs]
 __vfs_write+0x1a1/0x200
 vfs_write+0xba/0x1c0
 ksys_write+0x52/0xc0
 do_syscall_64+0x4a/0x150
 entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x7f5087c3c281
Code: c3 0f 1f 84 00 00 00 00 00 48 8b 05 59 8d 20 00 c3 0f 1f 84 00 00 00 00 00 8b 05 8a d1 20 00 85 c0 75 16 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 57 f3 c3 0f 1f 44 00 00 41 54 55 49 89 d4 53
RSP: 002b:00007fff0879c828 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f5087c3c281
RDX: 0000000000019000 RSI: 00005585ac58a200 RDI: 0000000000000003
RBP: 00000000001dd000 R08: 0000000000000007 R09: 0000000000000000
R10: 00005585ac54a010 R11: 0000000000000246 R12: 0000000000000254
R13: 0000000000019000 R14: 00005585ac58a200 R15: 0000000000000000
Modules linked in: xfs libcrc32c bfq dax_pmem device_dax nd_pmem sch_fq_codel ip_tables x_tables nfsv4 af_packet

Oh sh*t, we're committing the unwritten extent conversion before the
disk write even finishes!

I added some tracepoints to see what's going on with the dio here:
The iomap_dio_bio_end_io event is right at the top of the function.

dio0 is right after we set dio->submit.waiter = current
dio1 is right before we call blk_finish_plug
dio2 is right before we check !dio->wait_for_completion
dio3 is just before the final iomap_dio_complete

fsstress-3438    1.... 5060291us : iomap_dio_rw: dio0 xffff88802a88cc00 ref 1 wfc 1 waiter xffff88803cd18000
fsstress-3437    1..s1 5061648us : iomap_dio_bio_end_io: dio xffff88802a88cc00 ref 2 wfc 1 waiter xffff88803cd18000
fsstress-3438    1.... 5062366us : iomap_dio_rw: dio1 xffff88802a88cc00 ref 3 wfc 1 waiter x0
fsstress-3438    1.... 5062388us : iomap_dio_rw: dio2 xffff88802a88cc00 ref 3 wfc 1 waiter x0
fsstress-3438    1.... 5062388us : iomap_dio_rw: dio3 xffff88802a88cc00 ref 3 wfc 1 waiter x0
fsstress-3438    1..s1 5062494us : iomap_dio_bio_end_io: dio xffff88802a88cc00 ref 2 wfc 1 waiter x0

It looks like we're issuing two bios to satisfy a particular dio.
However, the first bio completes before the submitter calls finish_plug??

I guess this means that blk_start_plug no longer plugs up io requests,
which means that the end_io function tries to wake up the submitter
before it's even had a chance to issue the second bio.

This is surprising to me, because I was under the impression that
blk_{start,finish}_plug held all the bios so there was no chance that
any of them would issue (let alone call end_io) before finish_plug.

/sys/block/sda/queue/scheduler = [bfq] none
/sys/block/sdf/queue/scheduler = [bfq] none

Changing the scheduler to none doesn't help.

However, elevating &dio->ref for the duration of the plug does seem to
work around this problem...

--D

> -		} else if (dio->flags & IOMAP_DIO_WRITE) {
> +		}
> +		atomic_dec(&dio->ref);
> +	} else if (atomic_dec_and_test(&dio->ref)) {
> +		/*
> +		 * There is no waiter so we are finishing async IO. If the
> +		 * refcount drops to zero then we are responsible for running
> +		 * the appropriate IO completion to finish and free the dio
> +		 * context.
> +		 */
> +		if (dio->flags & IOMAP_DIO_WRITE) {
>  			struct inode *inode = file_inode(dio->iocb->ki_filp);
>  
>  			INIT_WORK(&dio->aio.work, iomap_dio_complete_work);
> @@ -1553,6 +1572,7 @@ static void iomap_dio_bio_end_io(struct bio *bio)
>  			iomap_dio_complete_work(&dio->aio.work);
>  		}
>  	}
> +	/* not safe to use dio past this point */
>  
>  	if (should_dirty) {
>  		bio_check_pages_dirty(bio);
> @@ -1916,27 +1936,42 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>  	if (dio->flags & IOMAP_DIO_WRITE_FUA)
>  		dio->flags &= ~IOMAP_DIO_NEED_SYNC;
>  
> -	if (!atomic_dec_and_test(&dio->ref)) {
> -		if (!dio->wait_for_completion)
> -			return -EIOCBQUEUED;
> +	/*
> +	 * If this is async IO, and we drop the last reference here we need
> +	 * to complete the IO, otherwise we return EIOCBQUEUED. The order of
> +	 * checks is important here because we can be racing with Io completion
> +	 * to drop the last reference and free the dio, so we must check the dio
> +	 * state before we drop the reference to avoid use-after-free
> +	 * situations.
> +	 */
> +	if (!dio->wait_for_completion) {
> +		if (atomic_dec_and_test(&dio->ref))
> +			return iomap_dio_complete(dio);
> +		return -EIOCBQUEUED;
> +	}
>  
> -		for (;;) {
> -			set_current_state(TASK_UNINTERRUPTIBLE);
> -			if (!READ_ONCE(dio->submit.waiter))
> -				break;
> +	/*
> +	 * This is sync IO and we have to access the dio structure to determine
> +	 * if we need to wait and/or poll the block device for completion. hence
> +	 * we need to hold on to our reference until IO completion has dropped
> +	 * all the IO references and woken us before we drop our reference and
> +	 * complete the IO.
> +	 */
> +	while (atomic_read(&dio->ref) > 1) {
> +		set_current_state(TASK_UNINTERRUPTIBLE);
> +		if (!READ_ONCE(dio->submit.waiter))
> +			break;
>  
> -			if (!(iocb->ki_flags & IOCB_HIPRI) ||
> -			    !dio->submit.last_queue ||
> -			    !blk_poll(dio->submit.last_queue,
> -					 dio->submit.cookie, true))
> -				io_schedule();
> -		}
> -		__set_current_state(TASK_RUNNING);
> +		if (!(iocb->ki_flags & IOCB_HIPRI) ||
> +		    !dio->submit.last_queue ||
> +		    !blk_poll(dio->submit.last_queue,
> +				 dio->submit.cookie, true))
> +			io_schedule();
>  	}
> +	__set_current_state(TASK_RUNNING);
> +	atomic_dec(&dio->ref);
>  
> -	ret = iomap_dio_complete(dio);
> -
> -	return ret;
> +	return iomap_dio_complete(dio);
>  
>  out_free_dio:
>  	kfree(dio);

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Broken dio refcounting leads to livelock?
  2019-01-08 20:08   ` Darrick J. Wong
@ 2019-01-08 20:54     ` Jeff Moyer
  2019-01-08 21:07       ` Darrick J. Wong
  0 siblings, 1 reply; 9+ messages in thread
From: Jeff Moyer @ 2019-01-08 20:54 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Dave Chinner, Christoph Hellwig, xfs

"Darrick J. Wong" <darrick.wong@oracle.com> writes:

> It looks like we're issuing two bios to satisfy a particular dio.
> However, the first bio completes before the submitter calls finish_plug??
>
> I guess this means that blk_start_plug no longer plugs up io requests,
> which means that the end_io function tries to wake up the submitter
> before it's even had a chance to issue the second bio.

blk_start_plug was always a hint.  If you exceed a certain number of
requests (BLK_MAX_REQUEST_COUNT, which is 16) or a certain size of
request (BLK_PLUG_FLUSH_SIZE, which is 128k), the block layer will flush
the plug list.

> This is surprising to me, because I was under the impression that
> blk_{start,finish}_plug held all the bios so there was no chance that
> any of them would issue (let alone call end_io) before finish_plug.

Definitely not the case.  The plug list is just a performance
optimization.

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Broken dio refcounting leads to livelock?
  2019-01-08 20:54     ` Jeff Moyer
@ 2019-01-08 21:07       ` Darrick J. Wong
  2019-01-08 21:30         ` Jeff Moyer
  2019-01-08 23:12         ` Dave Chinner
  0 siblings, 2 replies; 9+ messages in thread
From: Darrick J. Wong @ 2019-01-08 21:07 UTC (permalink / raw)
  To: Jeff Moyer; +Cc: Dave Chinner, Christoph Hellwig, xfs

On Tue, Jan 08, 2019 at 03:54:23PM -0500, Jeff Moyer wrote:
> "Darrick J. Wong" <darrick.wong@oracle.com> writes:
> 
> > It looks like we're issuing two bios to satisfy a particular dio.
> > However, the first bio completes before the submitter calls finish_plug??
> >
> > I guess this means that blk_start_plug no longer plugs up io requests,
> > which means that the end_io function tries to wake up the submitter
> > before it's even had a chance to issue the second bio.
> 
> blk_start_plug was always a hint.  If you exceed a certain number of
> requests (BLK_MAX_REQUEST_COUNT, which is 16) or a certain size of
> request (BLK_PLUG_FLUSH_SIZE, which is 128k), the block layer will flush
> the plug list.
> 
> > This is surprising to me, because I was under the impression that
> > blk_{start,finish}_plug held all the bios so there was no chance that
> > any of them would issue (let alone call end_io) before finish_plug.
> 
> Definitely not the case.  The plug list is just a performance
> optimization.

Ahh, fun!  I had not realized that it was merely a suggestion.

$ git grep blk_start_plug Documentation/
$ echo $?
1

In that case we definitely need something to prevent the endio from
trying to wake up a submitter that's still submitting.  Below is the
amended patch I'm using to run generic/013 and generic/323 in a loop.
No crashes so far.

--D

From: Dave Chinner <dchinner@redhat.com>
Subject: [PATCH] iomap: fix iomap dio reference counts

The iomap dio structure could be referenced after contexts had
dropped their reference, resulting in use-after free conditions
being created. Rework the reference counting to ensure that we never
access the dio structure without having a valid reference count or
having guaranteed that the context holds the last reference will
free it.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
[darrick: elevate dio refcount during block plug]
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/iomap.c |   82 +++++++++++++++++++++++++++++++++++++++++++++---------------
 1 file changed, 62 insertions(+), 20 deletions(-)

diff --git a/fs/iomap.c b/fs/iomap.c
index a3088fae567b..8e7eea407a0a 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -1539,12 +1539,31 @@ static void iomap_dio_bio_end_io(struct bio *bio)
 	if (bio->bi_status)
 		iomap_dio_set_error(dio, blk_status_to_errno(bio->bi_status));
 
-	if (atomic_dec_and_test(&dio->ref)) {
-		if (dio->wait_for_completion) {
+	/*
+	 * If we have a waiter, then this is a sync IO and the waiter will still
+	 * hold a reference to the dio. If this is the last IO reference, we
+	 * cannot reference the dio after we drop the reference as the waiter
+	 * may be woken immediately and free the dio before we are don with it.
+	 * Hence we check for two references, do the wakeup, then drop the final
+	 * IO reference. The ordering of clearing the submit.waiter, waking the
+	 * waiter and then dropping the reference matches the order of checks in
+	 * the wait loop to avoid wakeup races.
+	 */
+	if (dio->wait_for_completion) {
+		if (atomic_read(&dio->ref) == 2) {
 			struct task_struct *waiter = dio->submit.waiter;
 			WRITE_ONCE(dio->submit.waiter, NULL);
 			blk_wake_io_task(waiter);
-		} else if (dio->flags & IOMAP_DIO_WRITE) {
+		}
+		atomic_dec(&dio->ref);
+	} else if (atomic_dec_and_test(&dio->ref)) {
+		/*
+		 * There is no waiter so we are finishing async IO. If the
+		 * refcount drops to zero then we are responsible for running
+		 * the appropriate IO completion to finish and free the dio
+		 * context.
+		 */
+		if (dio->flags & IOMAP_DIO_WRITE) {
 			struct inode *inode = file_inode(dio->iocb->ki_filp);
 
 			INIT_WORK(&dio->aio.work, iomap_dio_complete_work);
@@ -1553,6 +1572,7 @@ static void iomap_dio_bio_end_io(struct bio *bio)
 			iomap_dio_complete_work(&dio->aio.work);
 		}
 	}
+	/* not safe to use dio past this point */
 
 	if (should_dirty) {
 		bio_check_pages_dirty(bio);
@@ -1887,6 +1907,12 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 
 	inode_dio_begin(inode);
 
+	/*
+	 * The presence of a plug does not prevent IOs from issuing, so elevate
+	 * the refcount on the dio so that a fast end_io can't try to wake us
+	 * up before we're even done issuing IO.
+	 */
+	atomic_inc(&dio->ref);
 	blk_start_plug(&plug);
 	do {
 		ret = iomap_apply(inode, pos, count, flags, ops, dio,
@@ -1905,6 +1931,7 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 			break;
 	} while ((count = iov_iter_count(iter)) > 0);
 	blk_finish_plug(&plug);
+	atomic_dec(&dio->ref);
 
 	if (ret < 0)
 		iomap_dio_set_error(dio, ret);
@@ -1916,27 +1943,42 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 	if (dio->flags & IOMAP_DIO_WRITE_FUA)
 		dio->flags &= ~IOMAP_DIO_NEED_SYNC;
 
-	if (!atomic_dec_and_test(&dio->ref)) {
-		if (!dio->wait_for_completion)
-			return -EIOCBQUEUED;
+	/*
+	 * If this is async IO, and we drop the last reference here we need
+	 * to complete the IO, otherwise we return EIOCBQUEUED. The order of
+	 * checks is important here because we can be racing with Io completion
+	 * to drop the last reference and free the dio, so we must check the dio
+	 * state before we drop the reference to avoid use-after-free
+	 * situations.
+	 */
+	if (!dio->wait_for_completion) {
+		if (atomic_dec_and_test(&dio->ref))
+			return iomap_dio_complete(dio);
+		return -EIOCBQUEUED;
+	}
 
-		for (;;) {
-			set_current_state(TASK_UNINTERRUPTIBLE);
-			if (!READ_ONCE(dio->submit.waiter))
-				break;
+	/*
+	 * This is sync IO and we have to access the dio structure to determine
+	 * if we need to wait and/or poll the block device for completion. hence
+	 * we need to hold on to our reference until IO completion has dropped
+	 * all the IO references and woken us before we drop our reference and
+	 * complete the IO.
+	 */
+	while (atomic_read(&dio->ref) > 1) {
+		set_current_state(TASK_UNINTERRUPTIBLE);
+		if (!READ_ONCE(dio->submit.waiter))
+			break;
 
-			if (!(iocb->ki_flags & IOCB_HIPRI) ||
-			    !dio->submit.last_queue ||
-			    !blk_poll(dio->submit.last_queue,
-					 dio->submit.cookie, true))
-				io_schedule();
-		}
-		__set_current_state(TASK_RUNNING);
+		if (!(iocb->ki_flags & IOCB_HIPRI) ||
+		    !dio->submit.last_queue ||
+		    !blk_poll(dio->submit.last_queue,
+				 dio->submit.cookie, true))
+			io_schedule();
 	}
+	__set_current_state(TASK_RUNNING);
+	atomic_dec(&dio->ref);
 
-	ret = iomap_dio_complete(dio);
-
-	return ret;
+	return iomap_dio_complete(dio);
 
 out_free_dio:
 	kfree(dio);

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: Broken dio refcounting leads to livelock?
  2019-01-08 21:07       ` Darrick J. Wong
@ 2019-01-08 21:30         ` Jeff Moyer
  2019-01-08 23:12         ` Dave Chinner
  1 sibling, 0 replies; 9+ messages in thread
From: Jeff Moyer @ 2019-01-08 21:30 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Dave Chinner, Christoph Hellwig, xfs

"Darrick J. Wong" <darrick.wong@oracle.com> writes:

> On Tue, Jan 08, 2019 at 03:54:23PM -0500, Jeff Moyer wrote:
>> "Darrick J. Wong" <darrick.wong@oracle.com> writes:
>> 
>> > It looks like we're issuing two bios to satisfy a particular dio.
>> > However, the first bio completes before the submitter calls finish_plug??
>> >
>> > I guess this means that blk_start_plug no longer plugs up io requests,
>> > which means that the end_io function tries to wake up the submitter
>> > before it's even had a chance to issue the second bio.
>> 
>> blk_start_plug was always a hint.  If you exceed a certain number of
>> requests (BLK_MAX_REQUEST_COUNT, which is 16) or a certain size of
>> request (BLK_PLUG_FLUSH_SIZE, which is 128k), the block layer will flush
>> the plug list.
>> 
>> > This is surprising to me, because I was under the impression that
>> > blk_{start,finish}_plug held all the bios so there was no chance that
>> > any of them would issue (let alone call end_io) before finish_plug.
>> 
>> Definitely not the case.  The plug list is just a performance
>> optimization.
>
> Ahh, fun!  I had not realized that it was merely a suggestion.
>
> $ git grep blk_start_plug Documentation/
> $ echo $?
> 1

Point taken.  ;-)  I'll write some kernel-doc text.  Hopefully
documentation in the header file will be enough?

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Broken dio refcounting leads to livelock?
  2019-01-08 21:07       ` Darrick J. Wong
  2019-01-08 21:30         ` Jeff Moyer
@ 2019-01-08 23:12         ` Dave Chinner
  1 sibling, 0 replies; 9+ messages in thread
From: Dave Chinner @ 2019-01-08 23:12 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Jeff Moyer, Christoph Hellwig, xfs

On Tue, Jan 08, 2019 at 01:07:16PM -0800, Darrick J. Wong wrote:
> On Tue, Jan 08, 2019 at 03:54:23PM -0500, Jeff Moyer wrote:
> > "Darrick J. Wong" <darrick.wong@oracle.com> writes:
> > 
> > > It looks like we're issuing two bios to satisfy a particular dio.
> > > However, the first bio completes before the submitter calls finish_plug??
> > >
> > > I guess this means that blk_start_plug no longer plugs up io requests,
> > > which means that the end_io function tries to wake up the submitter
> > > before it's even had a chance to issue the second bio.
> > 
> > blk_start_plug was always a hint.  If you exceed a certain number of
> > requests (BLK_MAX_REQUEST_COUNT, which is 16) or a certain size of
> > request (BLK_PLUG_FLUSH_SIZE, which is 128k), the block layer will flush
> > the plug list.
> > 
> > > This is surprising to me, because I was under the impression that
> > > blk_{start,finish}_plug held all the bios so there was no chance that
> > > any of them would issue (let alone call end_io) before finish_plug.
> > 
> > Definitely not the case.  The plug list is just a performance
> > optimization.
> 
> Ahh, fun!  I had not realized that it was merely a suggestion.
> 
> $ git grep blk_start_plug Documentation/
> $ echo $?
> 1
> 
> In that case we definitely need something to prevent the endio from
> trying to wake up a submitter that's still submitting.  Below is the
> amended patch I'm using to run generic/013 and generic/323 in a loop.
> No crashes so far.

Bloody ugly having to take /two/ submission context reference
counts, but....

/me sees this in the block device dio path:

              if (!dio->multi_bio) {
                        /*
                         * AIO needs an extra reference to ensure the dio
                         * structure which is embedded into the first bio
                         * stays around.
                         */
                        if (!is_sync)
                                bio_get(bio);
                        dio->multi_bio = true;
                        atomic_set(&dio->ref, 2);
               } else {
                       atomic_inc(&dio->ref);
               }


This code was derived from the iomap direct IO code, but unlike the
iomap code it takes two references to the first bio in the
submission set and sets a flag to indicate there are multiple
bios in the set. because the extra bio reference is not dropped
until all the IO is completed, the struct dio is valid until
everything is done and it is no longer being referenced.

The other thing to note here is that "dio->ref" is actually an "io
remaining" counter, and it is /pre-incremented/. That is, when we
submit the first bio, the ref count is set to 2 so that if the IO
completes before the second bio is submitted the count will not go
to zero and the IO won't be completed.

So, this is /quite different/ to the iomap code it was derived from.
i.e. dio->ref is an IO counter, not a structure lifetime controller,
and the struct dio lifetime is handled by a bio reference counter
rather than an internal counter. That makes a bunch of stuff much
simpler.

OK, I'll rework the patch with this in mind....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2019-01-08 23:13 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2019-01-08  0:26 Broken dio refcounting leads to livelock? Darrick J. Wong
2019-01-08  1:16 ` Darrick J. Wong
2019-01-08  3:03   ` Darrick J. Wong
2019-01-08  7:46 ` Dave Chinner
2019-01-08 20:08   ` Darrick J. Wong
2019-01-08 20:54     ` Jeff Moyer
2019-01-08 21:07       ` Darrick J. Wong
2019-01-08 21:30         ` Jeff Moyer
2019-01-08 23:12         ` Dave Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).