From: swise@opengridcomputing.com (Steve Wise)
Subject: nvmet_rdma crash - DISCONNECT event with NULL queue
Date: Tue, 1 Nov 2016 11:49:03 -0500 [thread overview]
Message-ID: <01d901d2345f$da0d2e00$8e278a00$@opengridcomputing.com> (raw)
In-Reply-To: <dbe5f18d-7928-f065-920f-753b30fb99a2@grimberg.me>
> >> So I think that the patch from Bart a few weeks ago was correct:
> >>
> >
> > Not quite. It just guards against a null queue for TIMEWAIT_EXIT, which is
only
> > generated by the IB_CM.
>
> Yes, this is why we need ADDR_CHANGE and DISCONNECTED too
> "(and include all the relevant cases around it)"
>
> The other events we don't get to LIVE state and we don't have
> other error flows that will trigger queue teardown sequence.
>
> --
> nvmet-rdma: Fix possible NULL deref when handling rdma cm
> events
>
> When we initiate queue teardown sequence we call rdma_destroy_qp
> which clears cm_id->qp, afterwards we call rdma_destroy_id, but
> we might see a rdma_cm event in between with a cleared cm_id->qp
> so watch out for that and silently ignore the event because this
> means that the queue teardown sequence is in progress.
>
> Signed-off-by: Bart Van Assche <bart.vanassche at sandisk.com>
> Signed-off-by: Sagi Grimberg <sagi at grimberg.me>
> ---
> drivers/nvme/target/rdma.c | 8 +++++++-
> 1 file changed, 7 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c
> index b4d648536c3e..240888efd920 100644
> --- a/drivers/nvme/target/rdma.c
> +++ b/drivers/nvme/target/rdma.c
> @@ -1351,7 +1351,13 @@ static int nvmet_rdma_cm_handler(struct
> rdma_cm_id *cm_id,
> case RDMA_CM_EVENT_ADDR_CHANGE:
> case RDMA_CM_EVENT_DISCONNECTED:
> case RDMA_CM_EVENT_TIMEWAIT_EXIT:
> - nvmet_rdma_queue_disconnect(queue);
> + /*
> + * We might end up here when we already freed the qp
> + * which means queue release sequence is in progress,
> + * so don't get in the way...
> + */
> + if (!queue)
> + nvmet_rdma_queue_disconnect(queue);
> break;
> case RDMA_CM_EVENT_DEVICE_REMOVAL:
> ret = nvmet_rdma_device_removal(cm_id, queue);
> --
This looks good. But you mentioned the 2 rapid-fire keep alive timeout logs for
the same controller as being seriously broken. Perhaps that is another problem?
Maybe keep alives aren't getting stopped correctly or something...
But: I'll try this patch and run for a few hours and see what happens. I
believe regardless of a keep alive issue, the above patch is still needed.
Steve.
WARNING: multiple messages have this Message-ID (diff)
From: "Steve Wise" <swise-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org>
To: 'Sagi Grimberg' <sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>,
'Christoph Hellwig' <hch-jcswGhMUV9g@public.gmane.org>
Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r@public.gmane.org
Subject: RE: nvmet_rdma crash - DISCONNECT event with NULL queue
Date: Tue, 1 Nov 2016 11:49:03 -0500 [thread overview]
Message-ID: <01d901d2345f$da0d2e00$8e278a00$@opengridcomputing.com> (raw)
In-Reply-To: <dbe5f18d-7928-f065-920f-753b30fb99a2-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
> >> So I think that the patch from Bart a few weeks ago was correct:
> >>
> >
> > Not quite. It just guards against a null queue for TIMEWAIT_EXIT, which is
only
> > generated by the IB_CM.
>
> Yes, this is why we need ADDR_CHANGE and DISCONNECTED too
> "(and include all the relevant cases around it)"
>
> The other events we don't get to LIVE state and we don't have
> other error flows that will trigger queue teardown sequence.
>
> --
> nvmet-rdma: Fix possible NULL deref when handling rdma cm
> events
>
> When we initiate queue teardown sequence we call rdma_destroy_qp
> which clears cm_id->qp, afterwards we call rdma_destroy_id, but
> we might see a rdma_cm event in between with a cleared cm_id->qp
> so watch out for that and silently ignore the event because this
> means that the queue teardown sequence is in progress.
>
> Signed-off-by: Bart Van Assche <bart.vanassche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
> Signed-off-by: Sagi Grimberg <sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
> ---
> drivers/nvme/target/rdma.c | 8 +++++++-
> 1 file changed, 7 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c
> index b4d648536c3e..240888efd920 100644
> --- a/drivers/nvme/target/rdma.c
> +++ b/drivers/nvme/target/rdma.c
> @@ -1351,7 +1351,13 @@ static int nvmet_rdma_cm_handler(struct
> rdma_cm_id *cm_id,
> case RDMA_CM_EVENT_ADDR_CHANGE:
> case RDMA_CM_EVENT_DISCONNECTED:
> case RDMA_CM_EVENT_TIMEWAIT_EXIT:
> - nvmet_rdma_queue_disconnect(queue);
> + /*
> + * We might end up here when we already freed the qp
> + * which means queue release sequence is in progress,
> + * so don't get in the way...
> + */
> + if (!queue)
> + nvmet_rdma_queue_disconnect(queue);
> break;
> case RDMA_CM_EVENT_DEVICE_REMOVAL:
> ret = nvmet_rdma_device_removal(cm_id, queue);
> --
This looks good. But you mentioned the 2 rapid-fire keep alive timeout logs for
the same controller as being seriously broken. Perhaps that is another problem?
Maybe keep alives aren't getting stopped correctly or something...
But: I'll try this patch and run for a few hours and see what happens. I
believe regardless of a keep alive issue, the above patch is still needed.
Steve.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
next prev parent reply other threads:[~2016-11-01 16:49 UTC|newest]
Thread overview: 40+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-11-01 15:57 nvmet_rdma crash - DISCONNECT event with NULL queue Steve Wise
2016-11-01 15:57 ` Steve Wise
2016-11-01 16:15 ` Sagi Grimberg
2016-11-01 16:15 ` Sagi Grimberg
2016-11-01 16:20 ` Steve Wise
2016-11-01 16:20 ` Steve Wise
2016-11-01 16:34 ` Sagi Grimberg
2016-11-01 16:34 ` Sagi Grimberg
2016-11-01 16:37 ` Steve Wise
2016-11-01 16:37 ` Steve Wise
2016-11-01 16:44 ` Sagi Grimberg
2016-11-01 16:44 ` Sagi Grimberg
2016-11-01 16:49 ` Steve Wise [this message]
2016-11-01 16:49 ` Steve Wise
2016-11-01 17:41 ` Sagi Grimberg
2016-11-01 17:41 ` Sagi Grimberg
[not found] ` <025201d23476$66812290$338367b0$@opengridcomputing.com>
2016-11-01 19:42 ` Steve Wise
2016-11-01 19:42 ` Steve Wise
[not found] ` <024e01d23476$6668b890$333a29b0$@opengridcomputing.com>
2016-11-01 22:34 ` Sagi Grimberg
2016-11-01 22:34 ` Sagi Grimberg
2016-11-02 15:07 ` Steve Wise
2016-11-02 15:07 ` Steve Wise
2016-11-02 15:15 ` 'Christoph Hellwig'
2016-11-02 15:15 ` 'Christoph Hellwig'
2016-11-06 7:35 ` Sagi Grimberg
2016-11-06 7:35 ` Sagi Grimberg
2016-11-07 18:29 ` J Freyensee
2016-11-07 18:29 ` J Freyensee
2016-11-07 18:41 ` 'Christoph Hellwig'
2016-11-07 18:41 ` 'Christoph Hellwig'
2016-11-07 18:50 ` J Freyensee
2016-11-07 18:50 ` J Freyensee
2016-11-07 18:51 ` 'Christoph Hellwig'
2016-11-07 18:51 ` 'Christoph Hellwig'
[not found] ` <004701d2351a$d9e4ad70$8dae0850$@opengridcomputing.com>
2016-11-02 19:18 ` Steve Wise
2016-11-02 19:18 ` Steve Wise
2016-11-06 8:51 ` Sagi Grimberg
2016-11-06 8:51 ` Sagi Grimberg
2016-11-08 20:45 ` Steve Wise
2016-11-08 20:45 ` Steve Wise
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='01d901d2345f$da0d2e00$8e278a00$@opengridcomputing.com' \
--to=swise@opengridcomputing.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.