public inbox for linux-nvme@lists.infradead.org
 help / color / mirror / Atom feed
* [PATCHv2] nvme-pci: enhance timeout kernel log
@ 2023-12-18 22:46 Keith Busch
  2023-12-19  5:01 ` Christoph Hellwig
  0 siblings, 1 reply; 6+ messages in thread
From: Keith Busch @ 2023-12-18 22:46 UTC (permalink / raw)
  To: linux-nvme; +Cc: hch, sagi, axboe, chaitanyak, Keith Busch

From: Keith Busch <kbusch@kernel.org>

Kernel configs don't necessarily have opcode decoding, and some opcodes
are not even decodable. It is still interesting for debugging SSD issues
to know what opcode is timing out, what request type it came from, and
the data size (if applicable).

Signed-off-by: Keith Busch <kbusch@kernel.org>
---
v1->v2:

  Even more verbose annotations in the messages (Jens and Chaitanya)

 drivers/nvme/host/pci.c | 18 ++++++++++--------
 1 file changed, 10 insertions(+), 8 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 507bc149046dc..f71ec41489a73 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1284,6 +1284,7 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req)
 	struct request *abort_req;
 	struct nvme_command cmd = { };
 	u32 csts = readl(dev->bar + NVME_REG_CSTS);
+	u8 opcode;
 
 	/* If PCI error recovery process is happening, we cannot reset or
 	 * the recovery mechanism will surely fail.
@@ -1310,7 +1311,7 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req)
 
 	if (blk_mq_rq_state(req) != MQ_RQ_IN_FLIGHT) {
 		dev_warn(dev->ctrl.device,
-			 "I/O %d QID %d timeout, completion polled\n",
+			 "I/O tag %d QID %d timeout, completion polled\n",
 			 req->tag, nvmeq->qid);
 		return BLK_EH_DONE;
 	}
@@ -1327,7 +1328,7 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req)
 		fallthrough;
 	case NVME_CTRL_DELETING:
 		dev_warn_ratelimited(dev->ctrl.device,
-			 "I/O %d QID %d timeout, disable controller\n",
+			 "I/O tag %d QID %d timeout, disable controller\n",
 			 req->tag, nvmeq->qid);
 		nvme_req(req)->flags |= NVME_REQ_CANCELLED;
 		nvme_dev_disable(dev, true);
@@ -1343,10 +1344,12 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req)
 	 * command was already aborted once before and still hasn't been
 	 * returned to the driver, or if this is the admin queue.
 	 */
+	opcode = nvme_req(req)->cmd->common.opcode;
 	if (!nvmeq->qid || iod->aborted) {
 		dev_warn(dev->ctrl.device,
-			 "I/O %d QID %d timeout, reset controller\n",
-			 req->tag, nvmeq->qid);
+			 "I/O tag %d (opcode:%s %d) QID %d timeout, reset controller\n",
+			 req->tag, nvme_opcode_str(nvmeq->qid, opcode, 0),
+			 opcode, nvmeq->qid);
 		nvme_req(req)->flags |= NVME_REQ_CANCELLED;
 		goto disable;
 	}
@@ -1362,10 +1365,9 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req)
 	cmd.abort.sqid = cpu_to_le16(nvmeq->qid);
 
 	dev_warn(nvmeq->dev->ctrl.device,
-		"I/O %d (%s) QID %d timeout, aborting\n",
-		 req->tag,
-		 nvme_get_opcode_str(nvme_req(req)->cmd->common.opcode),
-		 nvmeq->qid);
+		 "I/O tag %d (opcode:%s %d) QID %d timeout, aborting req_op:%s(%u) size:%u\n",
+		 req->tag, nvme_get_opcode_str(opcode), opcode, nvmeq->qid,
+		 blk_op_str(req_op(req)), req_op(req), blk_rq_bytes(req));
 
 	abort_req = blk_mq_alloc_request(dev->ctrl.admin_q, nvme_req_op(&cmd),
 					 BLK_MQ_REQ_NOWAIT);
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCHv2] nvme-pci: enhance timeout kernel log
  2023-12-18 22:46 [PATCHv2] nvme-pci: enhance timeout kernel log Keith Busch
@ 2023-12-19  5:01 ` Christoph Hellwig
  2023-12-19 14:15   ` Sagi Grimberg
  0 siblings, 1 reply; 6+ messages in thread
From: Christoph Hellwig @ 2023-12-19  5:01 UTC (permalink / raw)
  To: Keith Busch; +Cc: linux-nvme, hch, sagi, axboe, chaitanyak, Keith Busch

On Mon, Dec 18, 2023 at 02:46:33PM -0800, Keith Busch wrote:
> -			 "I/O %d QID %d timeout, completion polled\n",
> +			 "I/O tag %d QID %d timeout, completion polled\n",

Q: should we use the Linux term tag or the NVMe term command ID
(or command_id) ?  I don't really care, just thinking out loud.

> +	opcode = nvme_req(req)->cmd->common.opcode;
>  	if (!nvmeq->qid || iod->aborted) {
>  		dev_warn(dev->ctrl.device,
> -			 "I/O %d QID %d timeout, reset controller\n",
> -			 req->tag, nvmeq->qid);
> +			 "I/O tag %d (opcode:%s %d) QID %d timeout, reset controller\n",
> +			 req->tag, nvme_opcode_str(nvmeq->qid, opcode, 0),
> +			 opcode, nvmeq->qid);

space after the ":" ?

> +		 "I/O tag %d (opcode:%s %d) QID %d timeout, aborting req_op:%s(%u) size:%u\n",
> +		 req->tag, nvme_get_opcode_str(opcode), opcode, nvmeq->qid,
> +		 blk_op_str(req_op(req)), req_op(req), blk_rq_bytes(req));

Same.



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCHv2] nvme-pci: enhance timeout kernel log
  2023-12-19  5:01 ` Christoph Hellwig
@ 2023-12-19 14:15   ` Sagi Grimberg
  2023-12-19 16:09     ` Max Gurtovoy
  0 siblings, 1 reply; 6+ messages in thread
From: Sagi Grimberg @ 2023-12-19 14:15 UTC (permalink / raw)
  To: Christoph Hellwig, Keith Busch; +Cc: linux-nvme, axboe, chaitanyak, Keith Busch


>> -			 "I/O %d QID %d timeout, completion polled\n",
>> +			 "I/O tag %d QID %d timeout, completion polled\n",
> 
> Q: should we use the Linux term tag or the NVMe term command ID
> (or command_id) ?  I don't really care, just thinking out loud.

I was fine with neither tbh, but tag is better than command_id I
think.


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCHv2] nvme-pci: enhance timeout kernel log
  2023-12-19 14:15   ` Sagi Grimberg
@ 2023-12-19 16:09     ` Max Gurtovoy
  2023-12-19 16:15       ` Keith Busch
  0 siblings, 1 reply; 6+ messages in thread
From: Max Gurtovoy @ 2023-12-19 16:09 UTC (permalink / raw)
  To: Sagi Grimberg, Christoph Hellwig, Keith Busch
  Cc: linux-nvme, axboe, chaitanyak, Keith Busch



On 19/12/2023 16:15, Sagi Grimberg wrote:
> 
>>> -             "I/O %d QID %d timeout, completion polled\n",
>>> +             "I/O tag %d QID %d timeout, completion polled\n",
>>
>> Q: should we use the Linux term tag or the NVMe term command ID
>> (or command_id) ?  I don't really care, just thinking out loud.
> 
> I was fine with neither tbh, but tag is better than command_id I
> think.
> 

In the current Keith code it is actually the tag.
If you would like to print the cid please use: nvme_cid(req).

I guess the cid will be the useful value that we can compare to the 
wireshark or the target prints.


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCHv2] nvme-pci: enhance timeout kernel log
  2023-12-19 16:09     ` Max Gurtovoy
@ 2023-12-19 16:15       ` Keith Busch
  2023-12-19 16:21         ` Max Gurtovoy
  0 siblings, 1 reply; 6+ messages in thread
From: Keith Busch @ 2023-12-19 16:15 UTC (permalink / raw)
  To: Max Gurtovoy
  Cc: Sagi Grimberg, Christoph Hellwig, Keith Busch, linux-nvme, axboe,
	chaitanyak

On Tue, Dec 19, 2023 at 06:09:23PM +0200, Max Gurtovoy wrote:
> 
> 
> On 19/12/2023 16:15, Sagi Grimberg wrote:
> > 
> > > > -             "I/O %d QID %d timeout, completion polled\n",
> > > > +             "I/O tag %d QID %d timeout, completion polled\n",
> > > 
> > > Q: should we use the Linux term tag or the NVMe term command ID
> > > (or command_id) ?  I don't really care, just thinking out loud.
> > 
> > I was fine with neither tbh, but tag is better than command_id I
> > think.
> > 
> 
> In the current Keith code it is actually the tag.
> If you would like to print the cid please use: nvme_cid(req).
> 
> I guess the cid will be the useful value that we can compare to the
> wireshark or the target prints.

Yes! This currently prints the "tag" from blk-mq. I can change it to
print the encoded command_id in the SQE. Or both!


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCHv2] nvme-pci: enhance timeout kernel log
  2023-12-19 16:15       ` Keith Busch
@ 2023-12-19 16:21         ` Max Gurtovoy
  0 siblings, 0 replies; 6+ messages in thread
From: Max Gurtovoy @ 2023-12-19 16:21 UTC (permalink / raw)
  To: Keith Busch
  Cc: Sagi Grimberg, Christoph Hellwig, Keith Busch, linux-nvme, axboe,
	chaitanyak



On 19/12/2023 18:15, Keith Busch wrote:
> On Tue, Dec 19, 2023 at 06:09:23PM +0200, Max Gurtovoy wrote:
>>
>>
>> On 19/12/2023 16:15, Sagi Grimberg wrote:
>>>
>>>>> -             "I/O %d QID %d timeout, completion polled\n",
>>>>> +             "I/O tag %d QID %d timeout, completion polled\n",
>>>>
>>>> Q: should we use the Linux term tag or the NVMe term command ID
>>>> (or command_id) ?  I don't really care, just thinking out loud.
>>>
>>> I was fine with neither tbh, but tag is better than command_id I
>>> think.
>>>
>>
>> In the current Keith code it is actually the tag.
>> If you would like to print the cid please use: nvme_cid(req).
>>
>> I guess the cid will be the useful value that we can compare to the
>> wireshark or the target prints.
> 
> Yes! This currently prints the "tag" from blk-mq. I can change it to
> print the encoded command_id in the SQE. Or both!

I tend to say command_id is more interesting. So better use the option 
of cid or both.

Can you please align all the timeout functions in different transports 
to print same value (tag or cid) ? each transport today prints different 
values. For example tcp prints cid and rdma prints tag :)



^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2023-12-19 16:21 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-12-18 22:46 [PATCHv2] nvme-pci: enhance timeout kernel log Keith Busch
2023-12-19  5:01 ` Christoph Hellwig
2023-12-19 14:15   ` Sagi Grimberg
2023-12-19 16:09     ` Max Gurtovoy
2023-12-19 16:15       ` Keith Busch
2023-12-19 16:21         ` Max Gurtovoy

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox