[PATCH RFC] nvme-rdma: Queue ns scanning after a sucessful reconnection

linux-nvme.lists.infradead.org archive mirror
 help / color / mirror / Atom feed

* [PATCH RFC] nvme-rdma: Queue ns scanning after a sucessful reconnection
@ 2016-07-31 15:55 Sagi Grimberg
  2016-08-01 11:14 ` Christoph Hellwig
  2016-08-04 17:08 ` J Freyensee
  0 siblings, 2 replies; 5+ messages in thread
From: Sagi Grimberg @ 2016-07-31 15:55 UTC (permalink / raw)


On an ordered target shutdown, the target can send a AEN on a namespace
removal, this will trigger the host to queue ns-list query. The shutdown
will trigger error recovery which will attepmt periodic reconnect.

We can hit a race where the ns rescanning fails (error recovery kicked
in and we're not connected) causing removing all the namespaces and when
we reconnect we won't see any namespaces for this controller.

So, queue a namespace rescan after we successfully reconnected to the target.

Note, that unlike user initiated controller reset, we don't need to trigger
namespace scanning (until the point I noticed the above at least) because we
reconnect to an existing controller. However due to the interaction with
the aen mechanism we queue ns scan here as well.

Signed-off-by: Sagi Grimberg <sagi at grimberg.me>
---
I'm open to other suggestions if anyone has any...

 drivers/nvme/host/rdma.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index f8539dd75504..5cb069ab27ed 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -743,8 +743,10 @@ static void nvme_rdma_reconnect_ctrl_work(struct work_struct *work)
 	changed = nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_LIVE);
 	WARN_ON_ONCE(!changed);
 
-	if (ctrl->queue_count > 1)
+	if (ctrl->queue_count > 1) {
 		nvme_start_queues(&ctrl->ctrl);
+		nvme_queue_scan(&ctrl->ctrl);
+	}
 
 	dev_info(ctrl->ctrl.device, "Successfully reconnected\n");
 
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCH RFC] nvme-rdma: Queue ns scanning after a sucessful reconnection
  2016-07-31 15:55 [PATCH RFC] nvme-rdma: Queue ns scanning after a sucessful reconnection Sagi Grimberg
@ 2016-08-01 11:14 ` Christoph Hellwig
  2016-08-04 17:08 ` J Freyensee
  1 sibling, 0 replies; 5+ messages in thread
From: Christoph Hellwig @ 2016-08-01 11:14 UTC (permalink / raw)


Looks fine,

Reviewed-by: Christoph Hellwig <hch at lst.de>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH RFC] nvme-rdma: Queue ns scanning after a sucessful reconnection
  2016-07-31 15:55 [PATCH RFC] nvme-rdma: Queue ns scanning after a sucessful reconnection Sagi Grimberg
  2016-08-01 11:14 ` Christoph Hellwig
@ 2016-08-04 17:08 ` J Freyensee
  2016-08-07  9:12   ` Sagi Grimberg
  1 sibling, 1 reply; 5+ messages in thread
From: J Freyensee @ 2016-08-04 17:08 UTC (permalink / raw)


On Sun, 2016-07-31@18:55 +0300, Sagi Grimberg wrote:
> On an ordered target shutdown, the target can send a AEN on a
> namespace
> removal, this will trigger the host to queue ns-list query. The
> shutdown
> will trigger error recovery which will attepmt periodic reconnect.
> 
> We can hit a race where the ns rescanning fails (error recovery
> kicked
> in and we're not connected) causing removing all the namespaces and
> when
> we reconnect we won't see any namespaces for this controller.
> 
> So, queue a namespace rescan after we successfully reconnected to the
> target.
> 
> Note, that unlike user initiated controller reset, we don't need to
> trigger
> namespace scanning (until the point I noticed the above at least)
> because we
> reconnect to an existing controller. However due to the interaction
> with
> the aen mechanism we queue ns scan here as well.
> 
> Signed-off-by: Sagi Grimberg <sagi at grimberg.me>
> ---
> I'm open to other suggestions if anyone has any...

this sounds like a fix that should really go in the core target code
instead of RDMA code as this could affect any implementation layer.

If the target is shutting down I'm not sure why it would enter error
recovery which would attempt a reconnect. ?If the target is shutting
down, shut down. ?Maybe the keep-alive timer needs to stop
(nvmet_stop_keep_alive_timer()???). I could see the benefit of the
target emitting an AEN to tell the host to rescan for namespaces so it
doesn't keep a stale list of namespaces after shutdown.

My 2.5 cents...

> 
> ?drivers/nvme/host/rdma.c | 4 +++-
> ?1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
> index f8539dd75504..5cb069ab27ed 100644
> --- a/drivers/nvme/host/rdma.c
> +++ b/drivers/nvme/host/rdma.c
> @@ -743,8 +743,10 @@ static void nvme_rdma_reconnect_ctrl_work(struct
> work_struct *work)
> ?	changed = nvme_change_ctrl_state(&ctrl->ctrl,
> NVME_CTRL_LIVE);
> ?	WARN_ON_ONCE(!changed);
> ?
> -	if (ctrl->queue_count > 1)
> +	if (ctrl->queue_count > 1) {
> ?		nvme_start_queues(&ctrl->ctrl);
> +		nvme_queue_scan(&ctrl->ctrl);
> +	}
> ?
> ?	dev_info(ctrl->ctrl.device, "Successfully reconnected\n");
> ?

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH RFC] nvme-rdma: Queue ns scanning after a sucessful reconnection
  2016-08-04 17:08 ` J Freyensee
@ 2016-08-07  9:12   ` Sagi Grimberg
  2016-08-08 19:43     ` J Freyensee
  0 siblings, 1 reply; 5+ messages in thread
From: Sagi Grimberg @ 2016-08-07  9:12 UTC (permalink / raw)



>> On an ordered target shutdown, the target can send a AEN on a
>> namespace
>> removal, this will trigger the host to queue ns-list query. The
>> shutdown
>> will trigger error recovery which will attepmt periodic reconnect.
>>
>> We can hit a race where the ns rescanning fails (error recovery
>> kicked
>> in and we're not connected) causing removing all the namespaces and
>> when
>> we reconnect we won't see any namespaces for this controller.
>>
>> So, queue a namespace rescan after we successfully reconnected to the
>> target.
>>
>> Note, that unlike user initiated controller reset, we don't need to
>> trigger
>> namespace scanning (until the point I noticed the above at least)
>> because we
>> reconnect to an existing controller. However due to the interaction
>> with
>> the aen mechanism we queue ns scan here as well.
>>
>> Signed-off-by: Sagi Grimberg <sagi at grimberg.me>
>> ---
>> I'm open to other suggestions if anyone has any...
>
> this sounds like a fix that should really go in the core target code
> instead of RDMA code as this could affect any implementation layer.

But it fixes the host behavior (nvme-rdma).

> If the target is shutting down I'm not sure why it would enter error
> recovery which would attempt a reconnect.  If the target is shutting
> down, shut down.  Maybe the keep-alive timer needs to stop
> (nvmet_stop_keep_alive_timer()???). I could see the benefit of the
> target emitting an AEN to tell the host to rescan for namespaces so it
> doesn't keep a stale list of namespaces after shutdown.

Its not the target entering error recovery, its the host. What would
you expect from a host that suddenly was disconnected by the target?
Not sure what other than periodic reconnects would be adequate...

The patch here just requeue ns scanning when we successfully reconnected
to the target that was restored.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH RFC] nvme-rdma: Queue ns scanning after a sucessful reconnection
  2016-08-07  9:12   ` Sagi Grimberg
@ 2016-08-08 19:43     ` J Freyensee
  0 siblings, 0 replies; 5+ messages in thread
From: J Freyensee @ 2016-08-08 19:43 UTC (permalink / raw)


On Sun, 2016-08-07@12:12 +0300, Sagi Grimberg wrote:
> > 
> > > 
> > > On an ordered target shutdown, the target can send a AEN on a
> > > namespace
> > > removal, this will trigger the host to queue ns-list query. The
> > > shutdown
> > > will trigger error recovery which will attepmt periodic
> > > reconnect.
> > > 
> > > We can hit a race where the ns rescanning fails (error recovery
> > > kicked
> > > in and we're not connected) causing removing all the namespaces
> > > and
> > > when
> > > we reconnect we won't see any namespaces for this controller.
> > > 
> > > So, queue a namespace rescan after we successfully reconnected to
> > > the
> > > target.
> > > 
> > > Note, that unlike user initiated controller reset, we don't need
> > > to
> > > trigger
> > > namespace scanning (until the point I noticed the above at least)
> > > because we
> > > reconnect to an existing controller. However due to the
> > > interaction
> > > with
> > > the aen mechanism we queue ns scan here as well.
> > > 
> > > Signed-off-by: Sagi Grimberg <sagi at grimberg.me>
> > > ---
> > > I'm open to other suggestions if anyone has any...
> > 
> > this sounds like a fix that should really go in the core target
> > code
> > instead of RDMA code as this could affect any implementation layer.
> 
> But it fixes the host behavior (nvme-rdma).

Actually I think I meant host but didn't have enough coffee for my
brain when I looked at this :-/...

OK, makes sense.

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2016-08-08 19:43 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-07-31 15:55 [PATCH RFC] nvme-rdma: Queue ns scanning after a sucessful reconnection Sagi Grimberg
2016-08-01 11:14 ` Christoph Hellwig
2016-08-04 17:08 ` J Freyensee
2016-08-07  9:12   ` Sagi Grimberg
2016-08-08 19:43     ` J Freyensee

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).