From: Patrisious Haddad <phaddad@nvidia.com>
To: Sagi Grimberg <sagi@grimberg.me>, Christoph Hellwig <hch@lst.de>
Cc: Israel Rukshin <israelr@nvidia.com>,
Leon Romanovsky <leonro@nvidia.com>,
Linux-nvme <linux-nvme@lists.infradead.org>,
<linux-rdma@vger.kernel.org>,
Michael Guralnik <michaelgur@nvidia.com>,
Maor Gottlieb <maorg@nvidia.com>,
Max Gurtovoy <mgurtovoy@nvidia.com>
Subject: [PATCH rdma-next 4/4] nvme-rdma: add more error details when a QP moves to an error state
Date: Wed, 7 Sep 2022 14:38:00 +0300 [thread overview]
Message-ID: <20220907113800.22182-5-phaddad@nvidia.com> (raw)
In-Reply-To: <20220907113800.22182-1-phaddad@nvidia.com>
From: Israel Rukshin <israelr@nvidia.com>
Add debug prints for fatal QP events that are helpful for finding the
root cause of the errors. The ib_get_qp_err_syndrome is called at
a work queue since the QP event callback is running on an
interrupt context that can't sleep.
Signed-off-by: Israel Rukshin <israelr@nvidia.com>
Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
---
drivers/nvme/host/rdma.c | 24 ++++++++++++++++++++++++
1 file changed, 24 insertions(+)
diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 3100643be299..7e56c0dbe8ea 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -99,6 +99,7 @@ struct nvme_rdma_queue {
bool pi_support;
int cq_size;
struct mutex queue_lock;
+ struct work_struct qp_err_work;
};
struct nvme_rdma_ctrl {
@@ -237,11 +238,31 @@ static struct nvme_rdma_qe *nvme_rdma_alloc_ring(struct ib_device *ibdev,
return NULL;
}
+static void nvme_rdma_qp_error_work(struct work_struct *work)
+{
+ struct nvme_rdma_queue *queue = container_of(work,
+ struct nvme_rdma_queue, qp_err_work);
+ int ret;
+ char err[IB_ERR_SYNDROME_LENGTH];
+
+ ret = ib_get_qp_err_syndrome(queue->qp, err);
+ if (ret)
+ return;
+
+ pr_err("Queue %d got QP error syndrome %s\n",
+ nvme_rdma_queue_idx(queue), err);
+}
+
static void nvme_rdma_qp_event(struct ib_event *event, void *context)
{
+ struct nvme_rdma_queue *queue = context;
+
pr_debug("QP event %s (%d)\n",
ib_event_msg(event->event), event->event);
+ if (event->event == IB_EVENT_QP_FATAL ||
+ event->event == IB_EVENT_QP_ACCESS_ERR)
+ queue_work(nvme_wq, &queue->qp_err_work);
}
static int nvme_rdma_wait_for_cm(struct nvme_rdma_queue *queue)
@@ -261,7 +282,9 @@ static int nvme_rdma_create_qp(struct nvme_rdma_queue *queue, const int factor)
struct ib_qp_init_attr init_attr;
int ret;
+ INIT_WORK(&queue->qp_err_work, nvme_rdma_qp_error_work);
memset(&init_attr, 0, sizeof(init_attr));
+ init_attr.qp_context = queue;
init_attr.event_handler = nvme_rdma_qp_event;
/* +1 for drain */
init_attr.cap.max_send_wr = factor * queue->queue_size + 1;
@@ -434,6 +457,7 @@ static void nvme_rdma_destroy_queue_ib(struct nvme_rdma_queue *queue)
ib_mr_pool_destroy(queue->qp, &queue->qp->sig_mrs);
ib_mr_pool_destroy(queue->qp, &queue->qp->rdma_mrs);
+ flush_work(&queue->qp_err_work);
/*
* The cm_id object might have been destroyed during RDMA connection
* establishment error flow to avoid getting other cma events, thus
--
2.18.1
next prev parent reply other threads:[~2022-09-07 11:39 UTC|newest]
Thread overview: 16+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-09-07 11:37 [PATCH rdma-next 0/4] Provide more error details when a QP moves to Patrisious Haddad
2022-09-07 11:37 ` [PATCH rdma-next 1/4] net/mlx5: Introduce CQE error syndrome Patrisious Haddad
2022-09-07 11:37 ` [PATCH rdma-next 2/4] RDMA/core: Introduce ib_get_qp_err_syndrome function Patrisious Haddad
2022-09-07 11:37 ` [PATCH rdma-next 3/4] RDMA/mlx5: Implement ib_get_qp_err_syndrome Patrisious Haddad
2022-09-07 11:38 ` Patrisious Haddad [this message]
2022-09-07 12:02 ` [PATCH rdma-next 4/4] nvme-rdma: add more error details when a QP moves to an error state Christoph Hellwig
2022-09-07 12:11 ` Leon Romanovsky
2022-09-07 12:34 ` Sagi Grimberg
2022-09-07 12:51 ` Leon Romanovsky
2022-09-07 15:16 ` Sagi Grimberg
2022-09-07 15:18 ` Christoph Hellwig
2022-09-07 17:39 ` Leon Romanovsky
2022-11-01 9:12 ` Mark Zhang
2022-11-02 1:56 ` Mark Zhang
2022-09-08 7:55 ` Patrisious Haddad
2022-09-07 17:29 ` Leon Romanovsky
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20220907113800.22182-5-phaddad@nvidia.com \
--to=phaddad@nvidia.com \
--cc=hch@lst.de \
--cc=israelr@nvidia.com \
--cc=leonro@nvidia.com \
--cc=linux-nvme@lists.infradead.org \
--cc=linux-rdma@vger.kernel.org \
--cc=maorg@nvidia.com \
--cc=mgurtovoy@nvidia.com \
--cc=michaelgur@nvidia.com \
--cc=sagi@grimberg.me \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).