public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Oded Gabbay <ogabbay@kernel.org>
To: linux-kernel@vger.kernel.org
Cc: Tal Cohen <talcohen@habana.ai>
Subject: [PATCH 02/17] habanalabs: expose undefined opcode status via info ioctl
Date: Mon, 20 Jun 2022 16:04:17 +0300	[thread overview]
Message-ID: <20220620130432.1180451-2-ogabbay@kernel.org> (raw)
In-Reply-To: <20220620130432.1180451-1-ogabbay@kernel.org>

From: Tal Cohen <talcohen@habana.ai>

The info ioctl retrieves information on the last undefined opcode
occurred.

Signed-off-by: Tal Cohen <talcohen@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
---
 .../misc/habanalabs/common/habanalabs_ioctl.c | 25 ++++++++++++++++
 include/uapi/misc/habanalabs.h                | 30 +++++++++++++++++++
 2 files changed, 55 insertions(+)

diff --git a/drivers/misc/habanalabs/common/habanalabs_ioctl.c b/drivers/misc/habanalabs/common/habanalabs_ioctl.c
index c7864d6bb0a1..fe7ed46cd1c5 100644
--- a/drivers/misc/habanalabs/common/habanalabs_ioctl.c
+++ b/drivers/misc/habanalabs/common/habanalabs_ioctl.c
@@ -610,6 +610,28 @@ static int razwi_info(struct hl_fpriv *hpriv, struct hl_info_args *args)
 	return copy_to_user(out, &info, min_t(size_t, max_size, sizeof(info))) ? -EFAULT : 0;
 }
 
+static int undefined_opcode_info(struct hl_fpriv *hpriv, struct hl_info_args *args)
+{
+	struct hl_device *hdev = hpriv->hdev;
+	u32 max_size = args->return_size;
+	struct hl_info_undefined_opcode_event info = {0};
+	void __user *out = (void __user *) (uintptr_t) args->return_pointer;
+
+	if ((!max_size) || (!out))
+		return -EINVAL;
+
+	info.timestamp = ktime_to_ns(hdev->last_error.undef_opcode.timestamp);
+	info.engine_id = hdev->last_error.undef_opcode.engine_id;
+	info.cq_addr = hdev->last_error.undef_opcode.cq_addr;
+	info.cq_size = hdev->last_error.undef_opcode.cq_size;
+	info.stream_id = hdev->last_error.undef_opcode.stream_id;
+	info.cb_addr_streams_len = hdev->last_error.undef_opcode.cb_addr_streams_len;
+	memcpy(info.cb_addr_streams, hdev->last_error.undef_opcode.cb_addr_streams,
+			sizeof(info.cb_addr_streams));
+
+	return copy_to_user(out, &info, min_t(size_t, max_size, sizeof(info))) ? -EFAULT : 0;
+}
+
 static int dev_mem_alloc_page_sizes_info(struct hl_fpriv *hpriv, struct hl_info_args *args)
 {
 	void __user *out = (void __user *) (uintptr_t) args->return_pointer;
@@ -718,6 +740,9 @@ static int _hl_info_ioctl(struct hl_fpriv *hpriv, void *data,
 	case HL_INFO_RAZWI_EVENT:
 		return razwi_info(hpriv, args);
 
+	case HL_INFO_UNDEFINED_OPCODE_EVENT:
+		return undefined_opcode_info(hpriv, args);
+
 	case HL_INFO_DEV_MEM_ALLOC_PAGE_SIZES:
 		return dev_mem_alloc_page_sizes_info(hpriv, args);
 
diff --git a/include/uapi/misc/habanalabs.h b/include/uapi/misc/habanalabs.h
index c94b89cf1ec1..5f9a6097f5f3 100644
--- a/include/uapi/misc/habanalabs.h
+++ b/include/uapi/misc/habanalabs.h
@@ -352,6 +352,7 @@ enum hl_server_type {
  * HL_INFO_REGISTER_EVENTFD   - Register eventfd for event notifications.
  * HL_INFO_UNREGISTER_EVENTFD - Unregister eventfd
  * HL_INFO_GET_EVENTS         - Retrieve the last occurred events
+ * HL_INFO_UNDEFINED_OPCODE_EVENT - Retrieve last undefined opcode error information.
  */
 #define HL_INFO_HW_IP_INFO			0
 #define HL_INFO_HW_EVENTS			1
@@ -380,6 +381,7 @@ enum hl_server_type {
 #define HL_INFO_REGISTER_EVENTFD		28
 #define HL_INFO_UNREGISTER_EVENTFD		29
 #define HL_INFO_GET_EVENTS			30
+#define HL_INFO_UNDEFINED_OPCODE_EVENT		31
 
 #define HL_INFO_VERSION_MAX_LEN			128
 #define HL_INFO_CARD_NAME_MAX_LEN		16
@@ -656,6 +658,34 @@ struct hl_info_razwi_event {
 	__u8 pad[2];
 };
 
+#define MAX_QMAN_STREAMS_INFO		4
+#define OPCODE_INFO_MAX_ADDR_SIZE	8
+/**
+ * struct hl_info_undefined_opcode_event - info about last undefined opcode error
+ * @timestamp: timestamp of the undefined opcode error
+ * @cb_addr_streams: CB addresses (per stream) that are currently exists in the PQ
+ *                   entiers. In case all streams array entries are
+ *                   filled with values, it means the execution was in Lower-CP.
+ * @cq_addr: the address of the current handled command buffer
+ * @cq_size: the size of the current handled command buffer
+ * @cb_addr_streams_len: num of streams - actual len of cb_addr_streams array.
+ *                       should be equal to 1 incase of undefined opcode
+ *                       in Upper-CP (specific stream) and equal to 4 incase
+ *                       of undefined opcode in Lower-CP.
+ * @engine_id: engine-id that the error occurred on
+ * @stream_id: the stream id the error occurred on. In case the stream equals to
+ *             MAX_QMAN_STREAMS_INFO it means the error occurred on a Lower-CP.
+ */
+struct hl_info_undefined_opcode_event {
+	__s64 timestamp;
+	__u64 cb_addr_streams[MAX_QMAN_STREAMS_INFO][OPCODE_INFO_MAX_ADDR_SIZE];
+	__u64 cq_addr;
+	__u32 cq_size;
+	__u32 cb_addr_streams_len;
+	__u32 engine_id;
+	__u32 stream_id;
+};
+
 /**
  * struct hl_info_dev_memalloc_page_sizes - valid page sizes in device mem alloc information.
  * @page_order_bitmask: bitmap in which a set bit represents the order of the supported page size
-- 
2.25.1


  reply	other threads:[~2022-06-20 13:16 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-06-20 13:04 [PATCH 01/17] habanalabs/gaudi: collect undefined opcode error info Oded Gabbay
2022-06-20 13:04 ` Oded Gabbay [this message]
2022-06-20 13:04 ` [PATCH 03/17] habanalabs/gaudi: invoke device reset from one code block Oded Gabbay
2022-06-20 13:04 ` [PATCH 04/17] habanalabs/gaudi: send device reset notification Oded Gabbay
2022-06-20 13:04 ` [PATCH 05/17] habanalabs: send an event notification when CS timeout occurs Oded Gabbay
2022-06-20 13:04 ` [PATCH 06/17] habanalabs: avoid unnecessary error print Oded Gabbay
2022-06-20 13:04 ` [PATCH 07/17] habanalabs/gaudi: fix incorrect MME offset calculation Oded Gabbay
2022-06-20 13:04 ` [PATCH 08/17] habanalabs: add validity check for cq counter offset Oded Gabbay
2022-06-20 13:04 ` [PATCH 09/17] habanalabs/gaudi: fix shift out of bounds Oded Gabbay
2022-06-20 13:04 ` [PATCH 10/17] habanalabs: fix NULL dereference on cs timeout Oded Gabbay
2022-06-20 13:04 ` [PATCH 11/17] habanalabs: remove unused get_dma_desc_list_size Oded Gabbay
2022-06-20 13:04 ` [PATCH 12/17] habanalabs/gaudi: notify user process on device unavailable Oded Gabbay
2022-06-20 13:04 ` [PATCH 13/17] habanalabs: add critical indication in sram ecc Oded Gabbay
2022-06-20 13:04 ` [PATCH 14/17] habanalabs: check fence pointer before use Oded Gabbay
2022-06-20 13:04 ` [PATCH 15/17] habanalabs: print pointer with correct modifier Oded Gabbay
2022-06-20 13:04 ` [PATCH 16/17] habanalabs: use kvcalloc when possible Oded Gabbay
2022-06-20 13:04 ` [PATCH 17/17] habanalabs: fix comment style Oded Gabbay

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20220620130432.1180451-2-ogabbay@kernel.org \
    --to=ogabbay@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=talcohen@habana.ai \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox