[RFC PATCH] blk-mq: Fix lost request during timeout

linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH] blk-mq: Fix lost request during timeout
@ 2017-09-18 22:03 Keith Busch
  2017-09-18 22:07 ` Bart Van Assche
  0 siblings, 1 reply; 14+ messages in thread
From: Keith Busch @ 2017-09-18 22:03 UTC (permalink / raw)
  To: linux-block, Jens Axboe, Christoph Hellwig; +Cc: Keith Busch

I think we've always known it's possible to lose a request during timeout
handling, but just accepted that possibility. It seems to be causing
problems, though, leading to unnecessary error escalation and IO failures.

The possiblity arises when the block layer marks the request complete
prior to running the timeout handler. If that request happens to complete
while the handler is running, the request will be lost, inevitably
triggering a second timeout.

This patch attempts to shorten the window for this race condition by
clearing the started flag when the driver completes a request. The block
layer's timeout handler will then complete the command if it observes
the started flag is no longer set.

Note it's possible to lose the command even with this patch. It's just
less likely to happen.

Signed-off-by: Keith Busch <keith.busch@intel.com>
---
 block/blk-mq.c | 37 +++++++++++++++++++------------------
 1 file changed, 19 insertions(+), 18 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 98a1860..37144ef 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -566,6 +566,7 @@ void blk_mq_complete_request(struct request *rq)
 
 	if (unlikely(blk_should_fake_timeout(q)))
 		return;
+	clear_bit(REQ_ATOM_STARTED, &rq->atomic_flags);
 	if (!blk_mark_rq_complete(rq))
 		__blk_mq_complete_request(rq);
 }
@@ -605,19 +606,19 @@ void blk_mq_start_request(struct request *rq)
 	 * complete. So be sure to clear complete again when we start
 	 * the request, otherwise we'll ignore the completion event.
 	 */
-	if (!test_bit(REQ_ATOM_STARTED, &rq->atomic_flags))
+	if (!test_bit(REQ_ATOM_STARTED, &rq->atomic_flags)) {
 		set_bit(REQ_ATOM_STARTED, &rq->atomic_flags);
+		if (q->dma_drain_size && blk_rq_bytes(rq)) {
+			/*
+			 * Make sure space for the drain appears.  We know we can do
+			 * this because max_hw_segments has been adjusted to be one
+			 * fewer than the device can handle.
+			 */
+			rq->nr_phys_segments++;
+		}
+	}
 	if (test_bit(REQ_ATOM_COMPLETE, &rq->atomic_flags))
 		clear_bit(REQ_ATOM_COMPLETE, &rq->atomic_flags);
-
-	if (q->dma_drain_size && blk_rq_bytes(rq)) {
-		/*
-		 * Make sure space for the drain appears.  We know we can do
-		 * this because max_hw_segments has been adjusted to be one
-		 * fewer than the device can handle.
-		 */
-		rq->nr_phys_segments++;
-	}
 }
 EXPORT_SYMBOL(blk_mq_start_request);
 
@@ -637,11 +638,6 @@ static void __blk_mq_requeue_request(struct request *rq)
 	trace_block_rq_requeue(q, rq);
 	wbt_requeue(q->rq_wb, &rq->issue_stat);
 	blk_mq_sched_requeue_request(rq);
-
-	if (test_and_clear_bit(REQ_ATOM_STARTED, &rq->atomic_flags)) {
-		if (q->dma_drain_size && blk_rq_bytes(rq))
-			rq->nr_phys_segments--;
-	}
 }
 
 void blk_mq_requeue_request(struct request *rq, bool kick_requeue_list)
@@ -763,10 +759,15 @@ void blk_mq_rq_timed_out(struct request *req, bool reserved)
 		__blk_mq_complete_request(req);
 		break;
 	case BLK_EH_RESET_TIMER:
-		blk_add_timer(req);
-		blk_clear_rq_complete(req);
-		break;
+		if (test_bit(REQ_ATOM_STARTED, &req->atomic_flags)) {
+			blk_add_timer(req);
+			blk_clear_rq_complete(req);
+			break;
+		}
+		/* Fall through */
 	case BLK_EH_NOT_HANDLED:
+		if (!test_bit(REQ_ATOM_STARTED, &req->atomic_flags))
+			__blk_mq_complete_request(req);
 		break;
 	default:
 		printk(KERN_ERR "block: bad eh return: %d\n", ret);
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH] blk-mq: Fix lost request during timeout
  2017-09-18 22:03 [RFC PATCH] blk-mq: Fix lost request during timeout Keith Busch
@ 2017-09-18 22:07 ` Bart Van Assche
  2017-09-18 22:39   ` Keith Busch
  0 siblings, 1 reply; 14+ messages in thread
From: Bart Van Assche @ 2017-09-18 22:07 UTC (permalink / raw)
  To: hch@lst.de, keith.busch@intel.com, linux-block@vger.kernel.org,
	axboe@kernel.dk

T24gTW9uLCAyMDE3LTA5LTE4IGF0IDE4OjAzIC0wNDAwLCBLZWl0aCBCdXNjaCB3cm90ZToNCj4g
SSB0aGluayB3ZSd2ZSBhbHdheXMga25vd24gaXQncyBwb3NzaWJsZSB0byBsb3NlIGEgcmVxdWVz
dCBkdXJpbmcgdGltZW91dA0KPiBoYW5kbGluZywgYnV0IGp1c3QgYWNjZXB0ZWQgdGhhdCBwb3Nz
aWJpbGl0eS4gSXQgc2VlbXMgdG8gYmUgY2F1c2luZw0KPiBwcm9ibGVtcywgdGhvdWdoLCBsZWFk
aW5nIHRvIHVubmVjZXNzYXJ5IGVycm9yIGVzY2FsYXRpb24gYW5kIElPIGZhaWx1cmVzLg0KPiAN
Cj4gVGhlIHBvc3NpYmxpdHkgYXJpc2VzIHdoZW4gdGhlIGJsb2NrIGxheWVyIG1hcmtzIHRoZSBy
ZXF1ZXN0IGNvbXBsZXRlDQo+IHByaW9yIHRvIHJ1bm5pbmcgdGhlIHRpbWVvdXQgaGFuZGxlci4g
SWYgdGhhdCByZXF1ZXN0IGhhcHBlbnMgdG8gY29tcGxldGUNCj4gd2hpbGUgdGhlIGhhbmRsZXIg
aXMgcnVubmluZywgdGhlIHJlcXVlc3Qgd2lsbCBiZSBsb3N0LCBpbmV2aXRhYmx5DQo+IHRyaWdn
ZXJpbmcgYSBzZWNvbmQgdGltZW91dC4NCj4gDQo+IFRoaXMgcGF0Y2ggYXR0ZW1wdHMgdG8gc2hv
cnRlbiB0aGUgd2luZG93IGZvciB0aGlzIHJhY2UgY29uZGl0aW9uIGJ5DQo+IGNsZWFyaW5nIHRo
ZSBzdGFydGVkIGZsYWcgd2hlbiB0aGUgZHJpdmVyIGNvbXBsZXRlcyBhIHJlcXVlc3QuIFRoZSBi
bG9jaw0KPiBsYXllcidzIHRpbWVvdXQgaGFuZGxlciB3aWxsIHRoZW4gY29tcGxldGUgdGhlIGNv
bW1hbmQgaWYgaXQgb2JzZXJ2ZXMNCj4gdGhlIHN0YXJ0ZWQgZmxhZyBpcyBubyBsb25nZXIgc2V0
Lg0KPiANCj4gTm90ZSBpdCdzIHBvc3NpYmxlIHRvIGxvc2UgdGhlIGNvbW1hbmQgZXZlbiB3aXRo
IHRoaXMgcGF0Y2guIEl0J3MganVzdA0KPiBsZXNzIGxpa2VseSB0byBoYXBwZW4uDQoNCkhlbGxv
IEtlaXRoLA0KDQpBcmUgeW91IHN1cmUgdGhlIHJvb3QgY2F1c2Ugb2YgdGhpcyByYWNlIGNvbmRp
dGlvbiBpcyBpbiB0aGUgYmxrLW1xIGNvcmU/DQpJJ3ZlIG5ldmVyIG9ic2VydmVkIHN1Y2ggYmVo
YXZpb3IgaW4gYW55IG9mIG15IG51bWVyb3VzIHNjc2ktbXEgdGVzdHMgKHdoaWNoDQp0cmlnZ2Vy
IHRpbWVvdXRzKS4gQXJlIHlvdSBzdXJlIHRoZSByYWNlIHlvdSBvYnNlcnZlZCBpcyBub3QgY2F1
c2VkIGJ5IGENCmJsa19tcV9yZWluaXRfdGFnc2V0KCkgY2FsbCwgYSBmdW5jdGlvbiB0aGF0IGlz
IG9ubHkgdXNlZCBieSB0aGUgTlZNZSBkcml2ZXINCmFuZCBub3QgYnkgYW55IG90aGVyIGJsb2Nr
IGRyaXZlcj8NCg0KQmFydC4=

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH] blk-mq: Fix lost request during timeout
  2017-09-18 22:07 ` Bart Van Assche
@ 2017-09-18 22:39   ` Keith Busch
  2017-09-18 22:53     ` Bart Van Assche
  0 siblings, 1 reply; 14+ messages in thread
From: Keith Busch @ 2017-09-18 22:39 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: hch@lst.de, linux-block@vger.kernel.org, axboe@kernel.dk

On Mon, Sep 18, 2017 at 10:07:58PM +0000, Bart Van Assche wrote:
> On Mon, 2017-09-18 at 18:03 -0400, Keith Busch wrote:
> > I think we've always known it's possible to lose a request during timeout
> > handling, but just accepted that possibility. It seems to be causing
> > problems, though, leading to unnecessary error escalation and IO failures.
> > 
> > The possiblity arises when the block layer marks the request complete
> > prior to running the timeout handler. If that request happens to complete
> > while the handler is running, the request will be lost, inevitably
> > triggering a second timeout.
> > 
> > This patch attempts to shorten the window for this race condition by
> > clearing the started flag when the driver completes a request. The block
> > layer's timeout handler will then complete the command if it observes
> > the started flag is no longer set.
> > 
> > Note it's possible to lose the command even with this patch. It's just
> > less likely to happen.
> 
> Hello Keith,
> 
> Are you sure the root cause of this race condition is in the blk-mq core?
> I've never observed such behavior in any of my numerous scsi-mq tests (which
> trigger timeouts). Are you sure the race you observed is not caused by a
> blk_mq_reinit_tagset() call, a function that is only used by the NVMe driver
> and not by any other block driver?

Hi Bart,

The nvme driver's use of blk_mq_reinit_tagset only happens during
controller initialisation, but I'm seeing lost commands well after that
during normal and stable running.

The timing is pretty narrow to hit, but I'm pretty sure this is what's
happening. For nvme, this occurs when nvme_timeout() runs concurrently
with nvme_handle_cqe() for the same struct request. For scsi-mq,
the same situation may arise if scsi_mq_done() runs concurrently with
scsi_times_out().

I don't really like the proposed "fix" though since it only makes it
less likely, but I didn't see a way to close that without introducing
locks. If someone knows of a better way, that would be awesome.

Thanks,
Keith

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH] blk-mq: Fix lost request during timeout
  2017-09-18 22:39   ` Keith Busch
@ 2017-09-18 22:53     ` Bart Van Assche
  2017-09-18 23:08       ` Keith Busch
  0 siblings, 1 reply; 14+ messages in thread
From: Bart Van Assche @ 2017-09-18 22:53 UTC (permalink / raw)
  To: keith.busch@intel.com
  Cc: hch@lst.de, linux-block@vger.kernel.org, axboe@kernel.dk

T24gTW9uLCAyMDE3LTA5LTE4IGF0IDE4OjM5IC0wNDAwLCBLZWl0aCBCdXNjaCB3cm90ZToNCj4g
VGhlIG52bWUgZHJpdmVyJ3MgdXNlIG9mIGJsa19tcV9yZWluaXRfdGFnc2V0IG9ubHkgaGFwcGVu
cyBkdXJpbmcNCj4gY29udHJvbGxlciBpbml0aWFsaXNhdGlvbiwgYnV0IEknbSBzZWVpbmcgbG9z
dCBjb21tYW5kcyB3ZWxsIGFmdGVyIHRoYXQNCj4gZHVyaW5nIG5vcm1hbCBhbmQgc3RhYmxlIHJ1
bm5pbmcuDQo+IA0KPiBUaGUgdGltaW5nIGlzIHByZXR0eSBuYXJyb3cgdG8gaGl0LCBidXQgSSdt
IHByZXR0eSBzdXJlIHRoaXMgaXMgd2hhdCdzDQo+IGhhcHBlbmluZy4gRm9yIG52bWUsIHRoaXMg
b2NjdXJzIHdoZW4gbnZtZV90aW1lb3V0KCkgcnVucyBjb25jdXJyZW50bHkNCj4gd2l0aCBudm1l
X2hhbmRsZV9jcWUoKSBmb3IgdGhlIHNhbWUgc3RydWN0IHJlcXVlc3QuIEZvciBzY3NpLW1xLA0K
PiB0aGUgc2FtZSBzaXR1YXRpb24gbWF5IGFyaXNlIGlmIHNjc2lfbXFfZG9uZSgpIHJ1bnMgY29u
Y3VycmVudGx5IHdpdGgNCj4gc2NzaV90aW1lc19vdXQoKS4NCg0KSGVsbG8gS2VpdGgsDQoNCkFy
ZSB5b3Ugc3VyZSB0aGF0IHNjZW5hcmlvIGNhbiBoYXBwZW4/IFRoZSBibGstbXEgY29yZSBjYWxs
cyB0ZXN0X2FuZF9zZXRfYml0KCkNCmZvciB0aGUgUkVRX0FUT01fQ09NUExFVEUgZmxhZyBiZWZv
cmUgYW55IGNvbXBsZXRpb24gb3IgdGltZW91dCBoYW5kbGVyIGlzDQpjYWxsZWQuIFNlZSBhbHNv
IGJsa19tYXJrX3JxX2NvbXBsZXRlKCkuIFRoaXMgYXZvaWRzIHRoYXQgdGhlIC5jb21wbGV0ZSgp
IGFuZA0KLnRpbWVvdXQoKSBmdW5jdGlvbnMgcnVuIGNvbmN1cnJlbnRseS4NCg0KSW4gY2FzZSB0
aGlzIHdvdWxkbid0IGJlIGNsZWFyLCBhIGJsb2NrIGRyaXZlciBtdXN0IG5vdCBjYWxsDQpibGtf
bXFfZW5kX3JlcXVlc3QoKSBhZnRlciB0aGUgdGltZW91dCBoYW5kbGVyIGhhcyBmaW5pc2hlZCBi
ZWNhdXNlIHRoYXQgd291bGQNCnRyaWdnZXIgYSB1c2UtYWZ0ZXItZnJlZSBvZiBhIHJlcXVlc3Qg
c3RydWN0dXJlLg0KDQpJIG5vdGljZWQgdGhhdCB5b3VyIHBhdGNoIGluY2x1ZGVzIGNoYW5nZXMg
Zm9yIGJsa19tcV9zdGFydF9yZXF1ZXN0KCkuIE5vDQp0aW1lb3V0IG9yIGNvbXBsZXRpb24gaGFu
ZGxlciBzaG91bGQgYmUgcnVubmluZyB3aGlsZSBibGtfbXFfc3RhcnRfcmVxdWVzdCgpIGlzDQpi
ZWluZyBleGVjdXRlZC4gSWYgdGhlc2UgY2hhbmdlcyBtYWtlIGEgZGlmZmVyZW5jZSBpbiB5b3Vy
IHRlc3RzIHRoZW4gSSB0aGluaw0KdGhhdCBtZWFucyB0aGF0IHRoZXJlIGlzIHNvbWV0aGluZyB3
cm9uZyB3aXRoIHRoZSBOVk1lIGRyaXZlci4NCg0KQmFydC4=

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH] blk-mq: Fix lost request during timeout
  2017-09-18 22:53     ` Bart Van Assche
@ 2017-09-18 23:08       ` Keith Busch
  2017-09-18 23:14         ` Bart Van Assche
  2017-09-19  4:16         ` Ming Lei
  0 siblings, 2 replies; 14+ messages in thread
From: Keith Busch @ 2017-09-18 23:08 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: hch@lst.de, linux-block@vger.kernel.org, axboe@kernel.dk

On Mon, Sep 18, 2017 at 10:53:12PM +0000, Bart Van Assche wrote:
> On Mon, 2017-09-18 at 18:39 -0400, Keith Busch wrote:
> > The nvme driver's use of blk_mq_reinit_tagset only happens during
> > controller initialisation, but I'm seeing lost commands well after that
> > during normal and stable running.
> > 
> > The timing is pretty narrow to hit, but I'm pretty sure this is what's
> > happening. For nvme, this occurs when nvme_timeout() runs concurrently
> > with nvme_handle_cqe() for the same struct request. For scsi-mq,
> > the same situation may arise if scsi_mq_done() runs concurrently with
> > scsi_times_out().
> 
> Hello Keith,
> 
> Are you sure that scenario can happen? The blk-mq core calls test_and_set_bit()
> for the REQ_ATOM_COMPLETE flag before any completion or timeout handler is
> called. See also blk_mark_rq_complete(). This avoids that the .complete() and
> .timeout() functions run concurrently.

Indeed that prevents .complete from running concurrently with the
timeout handler, but scsi_mq_done and nvme_handle_cqe are not .complete
callbacks. These are the LLD functions that call blk_mq_complete_request
well before .complete. If the driver calls blk_mq_complete_request on
a request that blk-mq is timing out, the request is lost because blk-mq
already called blk_mark_rq_complete. Nothing prevents these LLD functions
from running at the same time as the timeout handler.
 
> In case this wouldn't be clear, a block driver must not call
> blk_mq_end_request() after the timeout handler has finished because that would
> trigger a use-after-free of a request structure.
>
> I noticed that your patch includes changes for blk_mq_start_request(). No
> timeout or completion handler should be running while blk_mq_start_request() is
> being executed. If these changes make a difference in your tests then I think
> that means that there is something wrong with the NVMe driver.

The reason for changing blk_mq_start_request was because of how the
requeue was clearing STARTED. I had to remove that since it would have
otherwise been impossible for the blk_mq_rq_timed_out to know if the
request was requeued vs. completed.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH] blk-mq: Fix lost request during timeout
  2017-09-18 23:08       ` Keith Busch
@ 2017-09-18 23:14         ` Bart Van Assche
  2017-09-19  1:55           ` Keith Busch
  2017-09-19  4:16         ` Ming Lei
  1 sibling, 1 reply; 14+ messages in thread
From: Bart Van Assche @ 2017-09-18 23:14 UTC (permalink / raw)
  To: keith.busch@intel.com
  Cc: hch@lst.de, linux-block@vger.kernel.org, axboe@kernel.dk

T24gTW9uLCAyMDE3LTA5LTE4IGF0IDE5OjA4IC0wNDAwLCBLZWl0aCBCdXNjaCB3cm90ZToNCj4g
T24gTW9uLCBTZXAgMTgsIDIwMTcgYXQgMTA6NTM6MTJQTSArMDAwMCwgQmFydCBWYW4gQXNzY2hl
IHdyb3RlOg0KPiA+IEFyZSB5b3Ugc3VyZSB0aGF0IHNjZW5hcmlvIGNhbiBoYXBwZW4/IFRoZSBi
bGstbXEgY29yZSBjYWxscyB0ZXN0X2FuZF9zZXRfYml0KCkNCj4gPiBmb3IgdGhlIFJFUV9BVE9N
X0NPTVBMRVRFIGZsYWcgYmVmb3JlIGFueSBjb21wbGV0aW9uIG9yIHRpbWVvdXQgaGFuZGxlciBp
cw0KPiA+IGNhbGxlZC4gU2VlIGFsc28gYmxrX21hcmtfcnFfY29tcGxldGUoKS4gVGhpcyBhdm9p
ZHMgdGhhdCB0aGUgLmNvbXBsZXRlKCkgYW5kDQo+ID4gLnRpbWVvdXQoKSBmdW5jdGlvbnMgcnVu
IGNvbmN1cnJlbnRseS4NCj4gDQo+IEluZGVlZCB0aGF0IHByZXZlbnRzIC5jb21wbGV0ZSBmcm9t
IHJ1bm5pbmcgY29uY3VycmVudGx5IHdpdGggdGhlDQo+IHRpbWVvdXQgaGFuZGxlciwgYnV0IHNj
c2lfbXFfZG9uZSBhbmQgbnZtZV9oYW5kbGVfY3FlIGFyZSBub3QgLmNvbXBsZXRlDQo+IGNhbGxi
YWNrcy4gVGhlc2UgYXJlIHRoZSBMTEQgZnVuY3Rpb25zIHRoYXQgY2FsbCBibGtfbXFfY29tcGxl
dGVfcmVxdWVzdA0KPiB3ZWxsIGJlZm9yZSAuY29tcGxldGUuIElmIHRoZSBkcml2ZXIgY2FsbHMg
YmxrX21xX2NvbXBsZXRlX3JlcXVlc3Qgb24NCj4gYSByZXF1ZXN0IHRoYXQgYmxrLW1xIGlzIHRp
bWluZyBvdXQsIHRoZSByZXF1ZXN0IGlzIGxvc3QgYmVjYXVzZSBibGstbXENCj4gYWxyZWFkeSBj
YWxsZWQgYmxrX21hcmtfcnFfY29tcGxldGUuIE5vdGhpbmcgcHJldmVudHMgdGhlc2UgTExEIGZ1
bmN0aW9ucw0KPiBmcm9tIHJ1bm5pbmcgYXQgdGhlIHNhbWUgdGltZSBhcyB0aGUgdGltZW91dCBo
YW5kbGVyLg0KDQpDYW4geW91IGV4cGxhaW4gaG93IHlvdSBkZWZpbmUgInJlcXVlc3QgaXMgbG9z
dCI/IElmIGEgdGltZW91dCBvY2N1cnMgZm9yIGENClNDU0kgcmVxdWVzdCB0aGVuIHNjc2lfdGlt
ZXNfb3V0KCkgY2FsbHMgc2NzaV9hYm9ydF9jb21tYW5kKCkgKGlmIG5vDQouZWhfdGltZWRfb3V0
KCkgY2FsbGJhY2sgaGFzIGJlZW4gZGVmaW5lZCBieSB0aGUgTExEKS4gSXQgaXMgdGhlIHJlc3Bv
bnNpYmlsaXR5DQpvZiB0aGUgU0NTSSBMTEQgdG8gY2FsbCAuc2NzaV9kb25lKCkgYmVmb3JlIGl0
cyAuZWhfYWJvcnRfaGFuZGxlcigpIHJldHVybnMNClNVQ0NFU1MuIElmIC5laF9hYm9ydF9oYW5k
bGVyKCkgcmV0dXJucyBhIHZhbHVlIG90aGVyIHRoYW4gU1VDQ0VTUyB0aGVuIHRoZQ0KU0NTSSBj
b3JlIHdpbGwgZXNjYWxhdGUgdGhlIGVycm9yIGZ1cnRoZXIgdW50aWwgLnNjc2lfZG9uZSgpIGhh
cyBiZWVuIGNhbGxlZCBmb3INCnRoZSBjb21tYW5kIHRoYXQgdGltZWQgb3V0LiBTZWUgYWxzbyBz
Y3NpX2Fib3J0X2VoX2NtbmQoKS4gU28gSSB0aGluayB3aGF0IHlvdQ0Kd3JvdGUgaXMgbm90IGNv
cnJlY3QgZm9yIHRoZSBTQ1NJIGNvcmUgYW5kIGEgcHJvcGVybHkgaW1wbGVtZW50ZWQgU0NTSSBM
TEQuIA0KDQpCYXJ0Lg==

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH] blk-mq: Fix lost request during timeout
  2017-09-18 23:14         ` Bart Van Assche
@ 2017-09-19  1:55           ` Keith Busch
  2017-09-19 15:05             ` Bart Van Assche
  0 siblings, 1 reply; 14+ messages in thread
From: Keith Busch @ 2017-09-19  1:55 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: hch@lst.de, linux-block@vger.kernel.org, axboe@kernel.dk

On Mon, Sep 18, 2017 at 11:14:38PM +0000, Bart Van Assche wrote:
> On Mon, 2017-09-18 at 19:08 -0400, Keith Busch wrote:
> > On Mon, Sep 18, 2017 at 10:53:12PM +0000, Bart Van Assche wrote:
> > > Are you sure that scenario can happen? The blk-mq core calls test_and_set_bit()
> > > for the REQ_ATOM_COMPLETE flag before any completion or timeout handler is
> > > called. See also blk_mark_rq_complete(). This avoids that the .complete() and
> > > .timeout() functions run concurrently.
> > 
> > Indeed that prevents .complete from running concurrently with the
> > timeout handler, but scsi_mq_done and nvme_handle_cqe are not .complete
> > callbacks. These are the LLD functions that call blk_mq_complete_request
> > well before .complete. If the driver calls blk_mq_complete_request on
> > a request that blk-mq is timing out, the request is lost because blk-mq
> > already called blk_mark_rq_complete. Nothing prevents these LLD functions
> > from running at the same time as the timeout handler.
> 
> Can you explain how you define "request is lost"? 

Sure, what I mean by "lost" is when nothing will call
__blk_mq_complete_request, which is required to make forward progress on
that request.

If a driver calls blk_mq_complete_request on a request being checked by
by the timeout handler, blk-mq will return immediately instead of
making progress toward completion since blk-mq already set
REQ_ATOM_COMPLETE while running the timeout hanlder.

> If a timeout occurs for a
> SCSI request then scsi_times_out() calls scsi_abort_command() (if no
> .eh_timed_out() callback has been defined by the LLD). It is the responsibility
> of the SCSI LLD to call .scsi_done() before its .eh_abort_handler() returns
> SUCCESS. If .eh_abort_handler() returns a value other than SUCCESS then the
> SCSI core will escalate the error further until .scsi_done() has been called for
> the command that timed out. See also scsi_abort_eh_cmnd(). So I think what you
> wrote is not correct for the SCSI core and a properly implemented SCSI LLD. 

Once blk-mq's blk_mq_check_expired calls blk_mark_rq_complete, an
LLD calling blk_mq_complete_request does absolutely nothing because
REQ_ATOM_COMPLETE is already set.

There's nothing stopping scsi_mq_done from running at the same time as
blk-mq's timeout handler.

It doesn't matter what path .scsi_done is called now. That will just
call scsi_mq_done -> blk_mq_complete_request, and since
REQ_ATOM_COMPLETE is already set, the command won't be completed.

The only way to complete that request now is if the timeout
handler returns BLK_EH_HANDLED, but the scsi-mq abort path returns
BLK_EH_NOT_HANDLED on success (very few drivers actually return
BLK_EH_HANDLED).

After the timeout handler runs, it's once again possible for the driver to
complete the request if it returned BLK_EH_RESET_TIMER, but if the driver
already completed the command during the timeout handler, how is the
driver supposed to know it needs to complete the request a second time?

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH] blk-mq: Fix lost request during timeout
  2017-09-18 23:08       ` Keith Busch
  2017-09-18 23:14         ` Bart Van Assche
@ 2017-09-19  4:16         ` Ming Lei
  2017-09-19 15:07           ` Keith Busch
  1 sibling, 1 reply; 14+ messages in thread
From: Ming Lei @ 2017-09-19  4:16 UTC (permalink / raw)
  To: Keith Busch
  Cc: Bart Van Assche, hch@lst.de, linux-block@vger.kernel.org,
	axboe@kernel.dk

On Tue, Sep 19, 2017 at 7:08 AM, Keith Busch <keith.busch@intel.com> wrote:
> On Mon, Sep 18, 2017 at 10:53:12PM +0000, Bart Van Assche wrote:
>> On Mon, 2017-09-18 at 18:39 -0400, Keith Busch wrote:
>> > The nvme driver's use of blk_mq_reinit_tagset only happens during
>> > controller initialisation, but I'm seeing lost commands well after that
>> > during normal and stable running.
>> >
>> > The timing is pretty narrow to hit, but I'm pretty sure this is what's
>> > happening. For nvme, this occurs when nvme_timeout() runs concurrently
>> > with nvme_handle_cqe() for the same struct request. For scsi-mq,
>> > the same situation may arise if scsi_mq_done() runs concurrently with
>> > scsi_times_out().
>>
>> Hello Keith,
>>
>> Are you sure that scenario can happen? The blk-mq core calls test_and_set_bit()
>> for the REQ_ATOM_COMPLETE flag before any completion or timeout handler is
>> called. See also blk_mark_rq_complete(). This avoids that the .complete() and
>> .timeout() functions run concurrently.
>
> Indeed that prevents .complete from running concurrently with the
> timeout handler, but scsi_mq_done and nvme_handle_cqe are not .complete
> callbacks. These are the LLD functions that call blk_mq_complete_request
> well before .complete. If the driver calls blk_mq_complete_request on
> a request that blk-mq is timing out, the request is lost because blk-mq
> already called blk_mark_rq_complete. Nothing prevents these LLD functions

That shouldn't happen because only one blk_mark_rq_complete() can win
and it is the winner's responsibility to complete the request, so
there shouldn't
be request lost. Especially in your case, it is the responsibility of timeout
to complete the rq really.

Also your patch removes test_and_clear_bit(REQ_ATOM_STARTED) from
__blk_mq_requeue_request(), which will make timeout to easy to happen
since the period between starting requeue and starting req may be long
enough, and timeout can be triggered because STARTED isn't cleared.

-- 
Ming Lei

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH] blk-mq: Fix lost request during timeout
  2017-09-19  1:55           ` Keith Busch
@ 2017-09-19 15:05             ` Bart Van Assche
  0 siblings, 0 replies; 14+ messages in thread
From: Bart Van Assche @ 2017-09-19 15:05 UTC (permalink / raw)
  To: keith.busch@intel.com
  Cc: hch@lst.de, linux-block@vger.kernel.org, axboe@kernel.dk

T24gTW9uLCAyMDE3LTA5LTE4IGF0IDIxOjU1IC0wNDAwLCBLZWl0aCBCdXNjaCB3cm90ZToNCj4g
VGhlIG9ubHkgd2F5IHRvIGNvbXBsZXRlIHRoYXQgcmVxdWVzdCBub3cgaXMgaWYgdGhlIHRpbWVv
dXQNCj4gaGFuZGxlciByZXR1cm5zIEJMS19FSF9IQU5ETEVELCBidXQgdGhlIHNjc2ktbXEgYWJv
cnQgcGF0aCByZXR1cm5zDQo+IEJMS19FSF9OT1RfSEFORExFRCBvbiBzdWNjZXNzICh2ZXJ5IGZl
dyBkcml2ZXJzIGFjdHVhbGx5IHJldHVybg0KPiBCTEtfRUhfSEFORExFRCkuDQo+IA0KPiBBZnRl
ciB0aGUgdGltZW91dCBoYW5kbGVyIHJ1bnMsIGl0J3Mgb25jZSBhZ2FpbiBwb3NzaWJsZSBmb3Ig
dGhlIGRyaXZlciB0bw0KPiBjb21wbGV0ZSB0aGUgcmVxdWVzdCBpZiBpdCByZXR1cm5lZCBCTEtf
RUhfUkVTRVRfVElNRVIsIGJ1dCBpZiB0aGUgZHJpdmVyDQo+IGFscmVhZHkgY29tcGxldGVkIHRo
ZSBjb21tYW5kIGR1cmluZyB0aGUgdGltZW91dCBoYW5kbGVyLCBob3cgaXMgdGhlDQo+IGRyaXZl
ciBzdXBwb3NlZCB0byBrbm93IGl0IG5lZWRzIHRvIGNvbXBsZXRlIHRoZSByZXF1ZXN0IGEgc2Vj
b25kIHRpbWU/DQoNCkhlbGxvIEtlaXRoLA0KDQpJZiB5b3UgaGF2ZSBhIGxvb2sgYXQgc2NtZF9l
aF9hYm9ydF9oYW5kbGVyKCkgdGhlbiB5b3Ugd2lsbCBzZWUgdGhhdCBpdCBlaXRoZXINCmNhbGxz
IHNjc2lfcXVldWVfaW5zZXJ0KCksIHNjc2lfZmluaXNoX2NvbW1hbmQoKSBvciBzY3NpX2VoX3Nj
bWRfYWRkKCkuDQpzY3NpX3F1ZXVlX2luc2VydCgpIHJlaW5zZXJ0cyB0aGUgcmVxdWVzdC4gc2Nz
aV9laF9zY21kX2FkZCgpIHdha2VzIHVwIHRoZQ0KZXJyb3IgaGFuZGxlciB0aHJlYWQgdG8gcGVy
Zm9ybSBmdXJ0aGVyIGVycm9yIGhhbmRsaW5nIGVzY2FsYXRpb24uDQpzY3NpX2ZpbmlzaF9jb21t
YW5kKCkgY2FsbHMgc2NzaV9lbmRfcmVxdWVzdCgpLiBUaGF0IGxhc3QgZnVuY3Rpb24gaW4gdHVy
bg0KY2FsbHMgX19ibGtfbXFfZW5kX3JlcXVlc3QoKS4gX19ibGtfbXFfZW5kX3JlcXVlc3QoKSBk
b2VzIG5vdCBjaGVjaw0KUkVRX0FUT01fQ09NUExFVEUuIFNvcnJ5IGJ1dCB0aGUgc3VtbWFyeSBp
cyB0aGF0IEkgZG8gbm90IGFncmVlIHdpdGggeW91DQp0aGF0IHRoZSBTQ1NJIGNvcmUgY2FuIGxv
c2UgcmVxdWVzdHMuDQoNCkJhcnQu

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH] blk-mq: Fix lost request during timeout
  2017-09-19  4:16         ` Ming Lei
@ 2017-09-19 15:07           ` Keith Busch
  2017-09-19 15:18             ` Bart Van Assche
  2017-09-19 15:22             ` Ming Lei
  0 siblings, 2 replies; 14+ messages in thread
From: Keith Busch @ 2017-09-19 15:07 UTC (permalink / raw)
  To: Ming Lei
  Cc: Bart Van Assche, hch@lst.de, linux-block@vger.kernel.org,
	axboe@kernel.dk

On Tue, Sep 19, 2017 at 12:16:31PM +0800, Ming Lei wrote:
> On Tue, Sep 19, 2017 at 7:08 AM, Keith Busch <keith.busch@intel.com> wrote:
> >
> > Indeed that prevents .complete from running concurrently with the
> > timeout handler, but scsi_mq_done and nvme_handle_cqe are not .complete
> > callbacks. These are the LLD functions that call blk_mq_complete_request
> > well before .complete. If the driver calls blk_mq_complete_request on
> > a request that blk-mq is timing out, the request is lost because blk-mq
> > already called blk_mark_rq_complete. Nothing prevents these LLD functions
> 
> That shouldn't happen because only one blk_mark_rq_complete() can win
> and it is the winner's responsibility to complete the request, so
> there shouldn't
> be request lost. Especially in your case, it is the responsibility of timeout
> to complete the rq really.

Hm, either I'm explaining this poorly, or I'm missing something that's
obvious to everyone else.

The driver's IRQ handler has no idea it's racing with the blk-mq timeout
handler, and there's nothing indicating it lost the race. The IRQ handler
just calls blk_mq_complete_request. As far as the driver is concerned,
it has done its part to complete the request at that point.

The problem is when blk-mq's timeout handler prevents the request from
completing, and doesn't leave any indication the driver requested to
complete it. Who is responsible for completing that request now?

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH] blk-mq: Fix lost request during timeout
  2017-09-19 15:07           ` Keith Busch
@ 2017-09-19 15:18             ` Bart Van Assche
  2017-09-19 16:39               ` Keith Busch
  2017-09-19 15:22             ` Ming Lei
  1 sibling, 1 reply; 14+ messages in thread
From: Bart Van Assche @ 2017-09-19 15:18 UTC (permalink / raw)
  To: keith.busch@intel.com, tom.leiming@gmail.com
  Cc: hch@lst.de, linux-block@vger.kernel.org, axboe@kernel.dk

T24gVHVlLCAyMDE3LTA5LTE5IGF0IDExOjA3IC0wNDAwLCBLZWl0aCBCdXNjaCB3cm90ZToNCj4g
VGhlIHByb2JsZW0gaXMgd2hlbiBibGstbXEncyB0aW1lb3V0IGhhbmRsZXIgcHJldmVudHMgdGhl
IHJlcXVlc3QgZnJvbQ0KPiBjb21wbGV0aW5nLCBhbmQgZG9lc24ndCBsZWF2ZSBhbnkgaW5kaWNh
dGlvbiB0aGUgZHJpdmVyIHJlcXVlc3RlZCB0bw0KPiBjb21wbGV0ZSBpdC4gV2hvIGlzIHJlc3Bv
bnNpYmxlIGZvciBjb21wbGV0aW5nIHRoYXQgcmVxdWVzdCBub3c/DQoNCkhlbGxvIEtlaXRoLA0K
DQpNeSB1bmRlcnN0YW5kaW5nIGlzIHRoYXQgYmxvY2sgZHJpdmVycyBhcmUgcmVzcG9uc2libGUg
Zm9yIGNvbXBsZXRpbmcgdGltZWQNCm91dCByZXF1ZXN0cyBieSB1c2luZyBvbmUgb2YgdGhlIGZv
bGxvd2luZyBhcHByb2FjaGVzOg0KKiBCeSByZXR1cm5pbmcgQkxLX0VIX0hBTkRMRUQgb3IgQkxL
X0VIX1JFU0VUX1RJTUVSIGZyb20gaW5zaWRlIHRoZSB0aW1lb3V0DQogIGhhbmRsZXIuDQoqIEJ5
IHJldHVybmluZyBCTEtfRUhfTk9UX0hBTkRMRUQgYW5kIGJ5IGNhbGxpbmcgYmxrX21xX2VuZF9y
ZXF1ZXN0KCkgb3INCiAgX19ibGtfbXFfZW5kX3JlcXVlc3QoKSBmb3IgdGhlIHJlcXVlc3QgdGhh
dCB0aW1lZCBvdXQuDQoNCkJhcnQuIA==

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH] blk-mq: Fix lost request during timeout
  2017-09-19 15:07           ` Keith Busch
  2017-09-19 15:18             ` Bart Van Assche
@ 2017-09-19 15:22             ` Ming Lei
  2017-09-19 16:00               ` Keith Busch
  1 sibling, 1 reply; 14+ messages in thread
From: Ming Lei @ 2017-09-19 15:22 UTC (permalink / raw)
  To: Keith Busch
  Cc: Bart Van Assche, hch@lst.de, linux-block@vger.kernel.org,
	axboe@kernel.dk

On Tue, Sep 19, 2017 at 11:07 PM, Keith Busch <keith.busch@intel.com> wrote:
> On Tue, Sep 19, 2017 at 12:16:31PM +0800, Ming Lei wrote:
>> On Tue, Sep 19, 2017 at 7:08 AM, Keith Busch <keith.busch@intel.com> wrote:
>> >
>> > Indeed that prevents .complete from running concurrently with the
>> > timeout handler, but scsi_mq_done and nvme_handle_cqe are not .complete
>> > callbacks. These are the LLD functions that call blk_mq_complete_request
>> > well before .complete. If the driver calls blk_mq_complete_request on
>> > a request that blk-mq is timing out, the request is lost because blk-mq
>> > already called blk_mark_rq_complete. Nothing prevents these LLD functions
>>
>> That shouldn't happen because only one blk_mark_rq_complete() can win
>> and it is the winner's responsibility to complete the request, so
>> there shouldn't
>> be request lost. Especially in your case, it is the responsibility of timeout
>> to complete the rq really.
>
> Hm, either I'm explaining this poorly, or I'm missing something that's
> obvious to everyone else.
>
> The driver's IRQ handler has no idea it's racing with the blk-mq timeout
> handler, and there's nothing indicating it lost the race. The IRQ handler
> just calls blk_mq_complete_request. As far as the driver is concerned,
> it has done its part to complete the request at that point.

Both blk_mark_rq_complete() and blk_mq_check_expired() calls
blk_mark_rq_complete() to try to complete the req, but only
one of them can do that actually, right?

>
> The problem is when blk-mq's timeout handler prevents the request from
> completing, and doesn't leave any indication the driver requested to
> complete it. Who is responsible for completing that request now?

Who sets ATOM_COMPLETE successfully is responsible for completing
the request. In this case it should be timeout handler, and irq handler
shouldn't touch the request any more, otherwise use-after-free may
happen.


-- 
Ming Lei

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH] blk-mq: Fix lost request during timeout
  2017-09-19 15:22             ` Ming Lei
@ 2017-09-19 16:00               ` Keith Busch
  0 siblings, 0 replies; 14+ messages in thread
From: Keith Busch @ 2017-09-19 16:00 UTC (permalink / raw)
  To: Ming Lei
  Cc: Bart Van Assche, hch@lst.de, linux-block@vger.kernel.org,
	axboe@kernel.dk

On Tue, Sep 19, 2017 at 11:22:20PM +0800, Ming Lei wrote:
> On Tue, Sep 19, 2017 at 11:07 PM, Keith Busch <keith.busch@intel.com> wrote:
> > On Tue, Sep 19, 2017 at 12:16:31PM +0800, Ming Lei wrote:
> >> On Tue, Sep 19, 2017 at 7:08 AM, Keith Busch <keith.busch@intel.com> wrote:
> >> >
> >> > Indeed that prevents .complete from running concurrently with the
> >> > timeout handler, but scsi_mq_done and nvme_handle_cqe are not .complete
> >> > callbacks. These are the LLD functions that call blk_mq_complete_request
> >> > well before .complete. If the driver calls blk_mq_complete_request on
> >> > a request that blk-mq is timing out, the request is lost because blk-mq
> >> > already called blk_mark_rq_complete. Nothing prevents these LLD functions
> >>
> >> That shouldn't happen because only one blk_mark_rq_complete() can win
> >> and it is the winner's responsibility to complete the request, so
> >> there shouldn't
> >> be request lost. Especially in your case, it is the responsibility of timeout
> >> to complete the rq really.
> >
> > Hm, either I'm explaining this poorly, or I'm missing something that's
> > obvious to everyone else.
> >
> > The driver's IRQ handler has no idea it's racing with the blk-mq timeout
> > handler, and there's nothing indicating it lost the race. The IRQ handler
> > just calls blk_mq_complete_request. As far as the driver is concerned,
> > it has done its part to complete the request at that point.
> 
> Both blk_mark_rq_complete() and blk_mq_check_expired() calls
> blk_mark_rq_complete() to try to complete the req, but only
> one of them can do that actually, right?

Yes, only one can successfully call that.

The problem I'm asking about is the driver's IRQ handler doesn't know it
lost the race. Only blk-mq knows that. At the point the driver's timeout
handler runs, the driver believes it already completed the request.
 
> > The problem is when blk-mq's timeout handler prevents the request from
> > completing, and doesn't leave any indication the driver requested to
> > complete it. Who is responsible for completing that request now?
> 
> Who sets ATOM_COMPLETE successfully is responsible for completing
> the request. In this case it should be timeout handler, and irq handler
> shouldn't touch the request any more, otherwise use-after-free may
> happen.

Blk-mq sets ATOM_COMPLETE well before the driver's timeout handler is
executed. The IRQ handler doesn't know blk-mq did that, so suggesting
the IRQ handler can't touch the request doesn't make sense.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH] blk-mq: Fix lost request during timeout
  2017-09-19 15:18             ` Bart Van Assche
@ 2017-09-19 16:39               ` Keith Busch
  0 siblings, 0 replies; 14+ messages in thread
From: Keith Busch @ 2017-09-19 16:39 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: tom.leiming@gmail.com, hch@lst.de, linux-block@vger.kernel.org,
	axboe@kernel.dk

On Tue, Sep 19, 2017 at 03:18:45PM +0000, Bart Van Assche wrote:
> On Tue, 2017-09-19 at 11:07 -0400, Keith Busch wrote:
> > The problem is when blk-mq's timeout handler prevents the request from
> > completing, and doesn't leave any indication the driver requested to
> > complete it. Who is responsible for completing that request now?
> 
> Hello Keith,
> 
> My understanding is that block drivers are responsible for completing timed
> out requests by using one of the following approaches:
> * By returning BLK_EH_HANDLED or BLK_EH_RESET_TIMER from inside the timeout
>   handler.

No problem with BLK_EH_HANDLED when timeout handler completes the
request. That usage at least makes sense.

In NVMe, we use BLK_EH_RESET_TIMER if the driver does an asynchronous
action to reclaim the request. If the request is returned very soon though
(before blk-mq clears ATOM_COMPLETE), blk_mq_complete_request will still
to do nothing.

> * By returning BLK_EH_NOT_HANDLED and by calling blk_mq_end_request() or
>   __blk_mq_end_request() for the request that timed out.

You want to bypass __blk_mq_complete_request? Doesn't that actually do
important things with queue and scheduler stats? If it's not important,
then this sounds like the piece I'm looking for, but this also puts a
burden on the driver to track the state of their requests that blk-mq
could do for all drivers.

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2017-09-19 16:39 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-09-18 22:03 [RFC PATCH] blk-mq: Fix lost request during timeout Keith Busch
2017-09-18 22:07 ` Bart Van Assche
2017-09-18 22:39   ` Keith Busch
2017-09-18 22:53     ` Bart Van Assche
2017-09-18 23:08       ` Keith Busch
2017-09-18 23:14         ` Bart Van Assche
2017-09-19  1:55           ` Keith Busch
2017-09-19 15:05             ` Bart Van Assche
2017-09-19  4:16         ` Ming Lei
2017-09-19 15:07           ` Keith Busch
2017-09-19 15:18             ` Bart Van Assche
2017-09-19 16:39               ` Keith Busch
2017-09-19 15:22             ` Ming Lei
2017-09-19 16:00               ` Keith Busch

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).