[PATCH] nvme-fabrics -in case of REQ_NVME_MPATH we should return BLK_STS

linux-nvme.lists.infradead.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] nvme-fabrics -in case of REQ_NVME_MPATH we should return BLK_STS_RESOURCE
       [not found] <CY4PR17MB1368DE7C1E52DF30E17060BBE61D0@CY4PR17MB1368.namprd17.prod.outlook.com>
@ 2018-09-18 16:25 ` hch
  2018-09-18 18:37   ` James Smart
  0 siblings, 1 reply; 9+ messages in thread
From: hch @ 2018-09-18 16:25 UTC (permalink / raw)


Can you resend this with a proper signoff so that we can apply it?

While you're at it please also make sure you have a line break after
73 chars for the commit message.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH] nvme-fabrics -in case of REQ_NVME_MPATH we should return BLK_STS_RESOURCE
  2018-09-18 16:25 ` [PATCH] nvme-fabrics -in case of REQ_NVME_MPATH we should return BLK_STS_RESOURCE hch
@ 2018-09-18 18:37   ` James Smart
  2018-09-18 19:08     ` Keith Busch
  0 siblings, 1 reply; 9+ messages in thread
From: James Smart @ 2018-09-18 18:37 UTC (permalink / raw)



On 9/18/2018 9:25 AM, hch@lst.de wrote:
> Can you resend this with a proper signoff so that we can apply it?
>
> While you're at it please also make sure you have a line break after
> 73 chars for the commit message.

where is this patch ? I can't find it on the reflector.

-- james

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH] nvme-fabrics -in case of REQ_NVME_MPATH we should return BLK_STS_RESOURCE
  2018-09-18 18:37   ` James Smart
@ 2018-09-18 19:08     ` Keith Busch
  2018-09-18 19:43       ` James Smart
  0 siblings, 1 reply; 9+ messages in thread
From: Keith Busch @ 2018-09-18 19:08 UTC (permalink / raw)



On Tue, Sep 18, 2018@11:37:50AM -0700, James Smart wrote:
> 
> On 9/18/2018 9:25 AM, hch@lst.de wrote:
> > Can you resend this with a proper signoff so that we can apply it?
> > 
> > While you're at it please also make sure you have a line break after
> > 73 chars for the commit message.
> 
> where is this patch ? I can't find it on the reflector.

The list rejected the message because it was not sent in plain text.
This was the attachment:

---
--- a/drivers/nvme/host/fabrics.c	2018-09-18 02:14:25.590722134 -0700
+++ b/drivers/nvme/host/fabrics.c	2018-09-18 03:18:11.663131011 -0700
@@ -539,8 +539,13 @@ static struct nvmf_transport_ops *nvmf_l
 /*
  * For something we're not in a state to send to the device the default action
  * is to busy it and retry it after the controller state is recovered.  However,
- * if the controller is deleting or if anything is marked for failfast or
- * nvme multipath it is immediately failed.
+ * if the controller is deleting or if anything is marked for failfast.
+ * It may happen that just after choosing current path in case of multipath
+ * controller state can go into resetting/connecting either from keep_alive 
+ * timeout or IO failure followed by failover. So there can be a race like 
+ * when path was selected ctrl state was LIVE but while queue_check is done
+ * in rdma.c ctrl state was changed to resetting/connecting. We have to retry the 
+ * same IO from different path and it will succeed.
  *
  * Note: commands used to initialize the controller will be marked for failfast.
  * Note: nvme cli/ioctl commands are marked for failfast.
@@ -550,7 +555,7 @@ blk_status_t nvmf_fail_nonready_command(
 {
 	if (ctrl->state != NVME_CTRL_DELETING &&
 	    ctrl->state != NVME_CTRL_DEAD &&
-	    !blk_noretry_request(rq) && !(rq->cmd_flags & REQ_NVME_MPATH))
+	    !blk_noretry_request(rq))
 		return BLK_STS_RESOURCE;
 	nvme_req(rq)->status = NVME_SC_ABORT_REQ;
 	return BLK_STS_IOERR;
--

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH] nvme-fabrics -in case of REQ_NVME_MPATH we should return BLK_STS_RESOURCE
  2018-09-18 19:08     ` Keith Busch
@ 2018-09-18 19:43       ` James Smart
  2018-09-20  6:39         ` hch
  0 siblings, 1 reply; 9+ messages in thread
From: James Smart @ 2018-09-18 19:43 UTC (permalink / raw)




On 9/18/2018 12:08 PM, Keith Busch wrote:
> On Tue, Sep 18, 2018@11:37:50AM -0700, James Smart wrote:
>> On 9/18/2018 9:25 AM, hch@lst.de wrote:
>>> Can you resend this with a proper signoff so that we can apply it?
>>>
>>> While you're at it please also make sure you have a line break after
>>> 73 chars for the commit message.
>> where is this patch ? I can't find it on the reflector.
> The list rejected the message because it was not sent in plain text.
> This was the attachment:
>
> ---
> --- a/drivers/nvme/host/fabrics.c	2018-09-18 02:14:25.590722134 -0700
> +++ b/drivers/nvme/host/fabrics.c	2018-09-18 03:18:11.663131011 -0700
> @@ -539,8 +539,13 @@ static struct nvmf_transport_ops *nvmf_l
>   /*
>    * For something we're not in a state to send to the device the default action
>    * is to busy it and retry it after the controller state is recovered.  However,
> - * if the controller is deleting or if anything is marked for failfast or
> - * nvme multipath it is immediately failed.
> + * if the controller is deleting or if anything is marked for failfast.
> + * It may happen that just after choosing current path in case of multipath
> + * controller state can go into resetting/connecting either from keep_alive
> + * timeout or IO failure followed by failover. So there can be a race like
> + * when path was selected ctrl state was LIVE but while queue_check is done
> + * in rdma.c ctrl state was changed to resetting/connecting. We have to retry the
> + * same IO from different path and it will succeed.
>    *
>    * Note: commands used to initialize the controller will be marked for failfast.
>    * Note: nvme cli/ioctl commands are marked for failfast.
> @@ -550,7 +555,7 @@ blk_status_t nvmf_fail_nonready_command(
>   {
>   	if (ctrl->state != NVME_CTRL_DELETING &&
>   	    ctrl->state != NVME_CTRL_DEAD &&
> -	    !blk_noretry_request(rq) && !(rq->cmd_flags & REQ_NVME_MPATH))
> +	    !blk_noretry_request(rq))
>   		return BLK_STS_RESOURCE;
>   	nvme_req(rq)->status = NVME_SC_ABORT_REQ;
>   	return BLK_STS_IOERR;
> --

ok - so this was what i thought it was - and we've hit this on FC as 
well. However, I don't believe it should be solved this way.? I believe 
we do want it to return error, as it does now, so blk_mq will 
release/fail it back to the caller, which should be the multipather, so 
it can move it over to a different path - the whole point of "fast 
failover".

The issue that I believe exists is that the multipath driver/core knows 
how to retry on a different path when the request is accepted then the 
nvme_complete_rq() is called. But it doesn't know how to retry on a 
different path when the initial transport:queue_rq() fails - the io ends 
up failing completely.?? I think we need to fix that code path, not 
"busy" requeue the mpath io.

-- james

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH] nvme-fabrics -in case of REQ_NVME_MPATH we should return BLK_STS_RESOURCE
  2018-09-18 19:43       ` James Smart
@ 2018-09-20  6:39         ` hch
  2018-09-20 23:39           ` James Smart
       [not found]           ` <CY4PR17MB1368B68A2FC010C1A8CED289E6120@CY4PR17MB1368.namprd17.prod.outlook.com>
  0 siblings, 2 replies; 9+ messages in thread
From: hch @ 2018-09-20  6:39 UTC (permalink / raw)


On Tue, Sep 18, 2018@12:43:30PM -0700, James Smart wrote:
> The issue that I believe exists is that the multipath driver/core knows how 
> to retry on a different path when the request is accepted then the 
> nvme_complete_rq() is called. But it doesn't know how to retry on a 
> different path when the initial transport:queue_rq() fails - the io ends up 
> failing completely.?? I think we need to fix that code path, not "busy" 
> requeue the mpath io.

The multipath code handles failures from nvme_complete_rq just fine,
in fact even with this patch we still don't accept the command into
queue_rq.  It is just that BLK_STS_RESOURCE is a magic indicator
for the blk-mq core to retry internally and not hand it back to the
next higher level (which would be the multipath code, either nvme
or dm for that matter).

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH] nvme-fabrics -in case of REQ_NVME_MPATH we should return BLK_STS_RESOURCE
  2018-09-20  6:39         ` hch
@ 2018-09-20 23:39           ` James Smart
  2018-09-21  7:05             ` hch
       [not found]           ` <CY4PR17MB1368B68A2FC010C1A8CED289E6120@CY4PR17MB1368.namprd17.prod.outlook.com>
  1 sibling, 1 reply; 9+ messages in thread
From: James Smart @ 2018-09-20 23:39 UTC (permalink / raw)




On 9/19/2018 11:39 PM, hch@lst.de wrote:
> On Tue, Sep 18, 2018@12:43:30PM -0700, James Smart wrote:
>> The issue that I believe exists is that the multipath driver/core knows how
>> to retry on a different path when the request is accepted then the
>> nvme_complete_rq() is called. But it doesn't know how to retry on a
>> different path when the initial transport:queue_rq() fails - the io ends up
>> failing completely.?? I think we need to fix that code path, not "busy"
>> requeue the mpath io.
> The multipath code handles failures from nvme_complete_rq just fine,
> in fact even with this patch we still don't accept the command into
> queue_rq.  It is just that BLK_STS_RESOURCE is a magic indicator
> for the blk-mq core to retry internally and not hand it back to the
> next higher level (which would be the multipath code, either nvme
> or dm for that matter).

your response is a bit cryptic

I agree with you - that nvme_complete_rq() handles it fine. But in the 
path where the transport->queue_rq() is called, and the io is bounced 
due to controller state checks, nvme_complete_rq() isn't being called.? 
If queue_rq() return BLK_STS_RESOURCE, its ok as the io gets requeued in 
blk-mq, but it could sit there for 60s or so. But if queue_rq() see the 
io marked NVME_MPATH, it returns BLK_STS_IOERR? 
(blk_mq_complete_request() isn't called), and the blk-mq layer ends up 
calling blk_mq_end_request().

-- james

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH] nvme-fabrics -in case of REQ_NVME_MPATH we should return BLK_STS_RESOURCE
  2018-09-20 23:39           ` James Smart
@ 2018-09-21  7:05             ` hch
  2018-09-21 18:21               ` James Smart
  0 siblings, 1 reply; 9+ messages in thread
From: hch @ 2018-09-21  7:05 UTC (permalink / raw)


On Thu, Sep 20, 2018@04:39:24PM -0700, James Smart wrote:
>> The multipath code handles failures from nvme_complete_rq just fine,
>> in fact even with this patch we still don't accept the command into
>> queue_rq.  It is just that BLK_STS_RESOURCE is a magic indicator
>> for the blk-mq core to retry internally and not hand it back to the
>> next higher level (which would be the multipath code, either nvme
>> or dm for that matter).
>
> your response is a bit cryptic

Or confused.  I meant handles failures from nvme_*queue_rq just fine
above.

> I agree with you - that nvme_complete_rq() handles it fine. But in the path 
> where the transport->queue_rq() is called, and the io is bounced due to 
> controller state checks, nvme_complete_rq() isn't being called.? If 
> queue_rq() return BLK_STS_RESOURCE, its ok as the io gets requeued in 
> blk-mq, but it could sit there for 60s or so. But if queue_rq() see the io 
> marked NVME_MPATH, it returns BLK_STS_IOERR? (blk_mq_complete_request() 
> isn't called), and the blk-mq layer ends up calling blk_mq_end_request().

Yes.  And that is probably why the current code is doing the right(-ish)
thing.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH] nvme-fabrics -in case of REQ_NVME_MPATH we should return BLK_STS_RESOURCE
       [not found]           ` <CY4PR17MB1368B68A2FC010C1A8CED289E6120@CY4PR17MB1368.namprd17.prod.outlook.com>
@ 2018-09-21  8:19             ` hch
  0 siblings, 0 replies; 9+ messages in thread
From: hch @ 2018-09-21  8:19 UTC (permalink / raw)


On Fri, Sep 21, 2018@08:15:14AM +0000, Susobhan Dey wrote:
> I am trying to figure out is there any impact if we return BLK_STS_RESOURCE as ultimately it will land to multipather make request and eventually gets completed through multipather completion handler.

If we return BLK_STS_RESOURCE the block layer will keep retrying it
on the same device until we hit the max retries count.  So yes, eventually
it will go up to the multipath layer, but it will do so much delayed.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH] nvme-fabrics -in case of REQ_NVME_MPATH we should return BLK_STS_RESOURCE
  2018-09-21  7:05             ` hch
@ 2018-09-21 18:21               ` James Smart
  0 siblings, 0 replies; 9+ messages in thread
From: James Smart @ 2018-09-21 18:21 UTC (permalink / raw)

On 9/21/2018 12:05 AM, hch@lst.de wrote:
>> I agree with you - that nvme_complete_rq() handles it fine. But in the path
>> where the transport->queue_rq() is called, and the io is bounced due to
>> controller state checks, nvme_complete_rq() isn't being called.? If
>> queue_rq() return BLK_STS_RESOURCE, its ok as the io gets requeued in
>> blk-mq, but it could sit there for 60s or so. But if queue_rq() see the io
>> marked NVME_MPATH, it returns BLK_STS_IOERR? (blk_mq_complete_request()
>> isn't called), and the blk-mq layer ends up calling blk_mq_end_request().
> Yes.  And that is probably why the current code is doing the right(-ish)
> thing.

So I agree that what we're doing currently is correct - except that the 
IOERR bounces all the way up to the caller above multipath - so they get 
an erroneous EIO rather than having us insulate it (waiting somewhere, 
but not on blk-mq for a long time) over a very short reset/reconnect window.

I'll post something shortly that seems to resolve it, although maybe not 
optimally.

And as I look at this - I think we need to invent a 
linux-nvme-stack-specific status code to indicate a path-connectivity 
error like we did on SCSI.? I/O's are currently returning 
NVME_SC_ABORT_REQ status. We could cheat and use an ANA_TRANSITIONING 
status, but that probably isn't the smartest thing to do.? The new 
status would give nvme multipath a better hint about what to do next.

-- james

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2018-09-21 18:21 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <CY4PR17MB1368DE7C1E52DF30E17060BBE61D0@CY4PR17MB1368.namprd17.prod.outlook.com>
2018-09-18 16:25 ` [PATCH] nvme-fabrics -in case of REQ_NVME_MPATH we should return BLK_STS_RESOURCE hch
2018-09-18 18:37   ` James Smart
2018-09-18 19:08     ` Keith Busch
2018-09-18 19:43       ` James Smart
2018-09-20  6:39         ` hch
2018-09-20 23:39           ` James Smart
2018-09-21  7:05             ` hch
2018-09-21 18:21               ` James Smart
     [not found]           ` <CY4PR17MB1368B68A2FC010C1A8CED289E6120@CY4PR17MB1368.namprd17.prod.outlook.com>
2018-09-21  8:19             ` hch

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).