All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 1/6] accel/habanalabs: add missing tpc interrupt info
@ 2023-05-01  9:47 Oded Gabbay
  2023-05-01  9:47 ` [PATCH 2/6] accel/habanalabs: add pci health check during heartbeat Oded Gabbay
                   ` (4 more replies)
  0 siblings, 5 replies; 6+ messages in thread
From: Oded Gabbay @ 2023-05-01  9:47 UTC (permalink / raw)
  To: dri-devel; +Cc: Dafna Hirschfeld

From: Dafna Hirschfeld <dhirschfeld@habana.ai>

For some reason the last possible tpc interrupt cause in
gaudi2_tpc_interrupts_cause is missing from the code.

Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
---
 drivers/accel/habanalabs/gaudi2/gaudi2.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/accel/habanalabs/gaudi2/gaudi2.c b/drivers/accel/habanalabs/gaudi2/gaudi2.c
index 5c80e7af5b78..058741698d71 100644
--- a/drivers/accel/habanalabs/gaudi2/gaudi2.c
+++ b/drivers/accel/habanalabs/gaudi2/gaudi2.c
@@ -63,7 +63,7 @@
 #define GAUDI2_NUM_OF_CPU_SEI_ERR_CAUSE		3
 #define GAUDI2_NUM_OF_QM_SEI_ERR_CAUSE		2
 #define GAUDI2_NUM_OF_ROT_ERR_CAUSE		22
-#define GAUDI2_NUM_OF_TPC_INTR_CAUSE		30
+#define GAUDI2_NUM_OF_TPC_INTR_CAUSE		31
 #define GAUDI2_NUM_OF_DEC_ERR_CAUSE		25
 #define GAUDI2_NUM_OF_MME_ERR_CAUSE		16
 #define GAUDI2_NUM_OF_MME_SBTE_ERR_CAUSE	5
@@ -891,6 +891,7 @@ static const char * const gaudi2_tpc_interrupts_cause[GAUDI2_NUM_OF_TPC_INTR_CAU
 	"invalid_lock_access",
 	"LD_L protection violation",
 	"ST_L protection violation",
+	"D$ L0CS mismatch",
 };
 
 static const char * const guadi2_mme_error_cause[GAUDI2_NUM_OF_MME_ERR_CAUSE] = {
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH 2/6] accel/habanalabs: add pci health check during heartbeat
  2023-05-01  9:47 [PATCH 1/6] accel/habanalabs: add missing tpc interrupt info Oded Gabbay
@ 2023-05-01  9:47 ` Oded Gabbay
  2023-05-01  9:47 ` [PATCH 3/6] accel/habanalabs: expose debugfs files later Oded Gabbay
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 6+ messages in thread
From: Oded Gabbay @ 2023-05-01  9:47 UTC (permalink / raw)
  To: dri-devel; +Cc: Ofir Bitton

From: Ofir Bitton <obitton@habana.ai>

Currently upon a heartbeat failure, we don't know if the failure
is due to firmware hang or due to a bad PCI link. Hence, we
are reading a PCI config space register with a known value (vendor ID)
so we will know which of the two possibilities caused the heartbeat
failure.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
---
 drivers/accel/habanalabs/common/device.c         | 15 ++++++++++++++-
 drivers/accel/habanalabs/common/habanalabs.h     |  2 ++
 drivers/accel/habanalabs/common/habanalabs_drv.c |  2 --
 3 files changed, 16 insertions(+), 3 deletions(-)

diff --git a/drivers/accel/habanalabs/common/device.c b/drivers/accel/habanalabs/common/device.c
index c8f4a34d2831..98b46b2e1898 100644
--- a/drivers/accel/habanalabs/common/device.c
+++ b/drivers/accel/habanalabs/common/device.c
@@ -981,6 +981,18 @@ static void device_early_fini(struct hl_device *hdev)
 		hdev->asic_funcs->early_fini(hdev);
 }
 
+static bool is_pci_link_healthy(struct hl_device *hdev)
+{
+	u16 vendor_id;
+
+	if (!hdev->pdev)
+		return false;
+
+	pci_read_config_word(hdev->pdev, PCI_VENDOR_ID, &vendor_id);
+
+	return (vendor_id == PCI_VENDOR_ID_HABANALABS);
+}
+
 static void hl_device_heartbeat(struct work_struct *work)
 {
 	struct hl_device *hdev = container_of(work, struct hl_device,
@@ -995,7 +1007,8 @@ static void hl_device_heartbeat(struct work_struct *work)
 		goto reschedule;
 
 	if (hl_device_operational(hdev, NULL))
-		dev_err(hdev->dev, "Device heartbeat failed!\n");
+		dev_err(hdev->dev, "Device heartbeat failed! PCI link is %s\n",
+			is_pci_link_healthy(hdev) ? "healthy" : "broken");
 
 	info.err_type = HL_INFO_FW_HEARTBEAT_ERR;
 	info.event_mask = &event_mask;
diff --git a/drivers/accel/habanalabs/common/habanalabs.h b/drivers/accel/habanalabs/common/habanalabs.h
index f83ea96c6530..ad412cc01aba 100644
--- a/drivers/accel/habanalabs/common/habanalabs.h
+++ b/drivers/accel/habanalabs/common/habanalabs.h
@@ -36,6 +36,8 @@
 struct hl_device;
 struct hl_fpriv;
 
+#define PCI_VENDOR_ID_HABANALABS	0x1da3
+
 /* Use upper bits of mmap offset to store habana driver specific information.
  * bits[63:59] - Encode mmap type
  * bits[45:0]  - mmap offset value
diff --git a/drivers/accel/habanalabs/common/habanalabs_drv.c b/drivers/accel/habanalabs/common/habanalabs_drv.c
index a4b3f50f1cba..ea80638fc6b9 100644
--- a/drivers/accel/habanalabs/common/habanalabs_drv.c
+++ b/drivers/accel/habanalabs/common/habanalabs_drv.c
@@ -54,8 +54,6 @@ module_param(boot_error_status_mask, ulong, 0444);
 MODULE_PARM_DESC(boot_error_status_mask,
 	"Mask of the error status during device CPU boot (If bitX is cleared then error X is masked. Default all 1's)");
 
-#define PCI_VENDOR_ID_HABANALABS	0x1da3
-
 #define PCI_IDS_GOYA			0x0001
 #define PCI_IDS_GAUDI			0x1000
 #define PCI_IDS_GAUDI_SEC		0x1010
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH 3/6] accel/habanalabs: expose debugfs files later
  2023-05-01  9:47 [PATCH 1/6] accel/habanalabs: add missing tpc interrupt info Oded Gabbay
  2023-05-01  9:47 ` [PATCH 2/6] accel/habanalabs: add pci health check during heartbeat Oded Gabbay
@ 2023-05-01  9:47 ` Oded Gabbay
  2023-05-01  9:47 ` [PATCH 4/6] accel/habanalabs: poll for device status update following WFE cmd Oded Gabbay
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 6+ messages in thread
From: Oded Gabbay @ 2023-05-01  9:47 UTC (permalink / raw)
  To: dri-devel; +Cc: Tomer Tayar

From: Tomer Tayar <ttayar@habana.ai>

Currently the debugfs root folder and files for a device are created at
an early step, before the device initialization and before the char
device and sysfs files are exposed to user.
As there is no real reason not to do it together with the device
creation, postpone it to be done right afterwards.

The initialization of the debugfs entry structure is left in its
current position because it is used before creating the files.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
---
 drivers/accel/habanalabs/common/debugfs.c    | 37 +++++++-----
 drivers/accel/habanalabs/common/device.c     | 62 +++++++++++---------
 drivers/accel/habanalabs/common/habanalabs.h |  6 +-
 3 files changed, 61 insertions(+), 44 deletions(-)

diff --git a/drivers/accel/habanalabs/common/debugfs.c b/drivers/accel/habanalabs/common/debugfs.c
index 22dd17c077c0..29b78c7cf5de 100644
--- a/drivers/accel/habanalabs/common/debugfs.c
+++ b/drivers/accel/habanalabs/common/debugfs.c
@@ -1756,17 +1756,15 @@ static void add_files_to_device(struct hl_device *hdev, struct hl_dbg_device_ent
 	}
 }
 
-void hl_debugfs_add_device(struct hl_device *hdev)
+int hl_debugfs_device_init(struct hl_device *hdev)
 {
 	struct hl_dbg_device_entry *dev_entry = &hdev->hl_debugfs;
 	int count = ARRAY_SIZE(hl_debugfs_list);
 
 	dev_entry->hdev = hdev;
-	dev_entry->entry_arr = kmalloc_array(count,
-					sizeof(struct hl_debugfs_entry),
-					GFP_KERNEL);
+	dev_entry->entry_arr = kmalloc_array(count, sizeof(struct hl_debugfs_entry), GFP_KERNEL);
 	if (!dev_entry->entry_arr)
-		return;
+		return -ENOMEM;
 
 	dev_entry->data_dma_blob_desc.size = 0;
 	dev_entry->data_dma_blob_desc.data = NULL;
@@ -1787,21 +1785,14 @@ void hl_debugfs_add_device(struct hl_device *hdev)
 	spin_lock_init(&dev_entry->userptr_spinlock);
 	mutex_init(&dev_entry->ctx_mem_hash_mutex);
 
-	dev_entry->root = debugfs_create_dir(dev_name(hdev->dev),
-						hl_debug_root);
-
-	add_files_to_device(hdev, dev_entry, dev_entry->root);
-	if (!hdev->asic_prop.fw_security_enabled)
-		add_secured_nodes(dev_entry, dev_entry->root);
+	return 0;
 }
 
-void hl_debugfs_remove_device(struct hl_device *hdev)
+void hl_debugfs_device_fini(struct hl_device *hdev)
 {
 	struct hl_dbg_device_entry *entry = &hdev->hl_debugfs;
 	int i;
 
-	debugfs_remove_recursive(entry->root);
-
 	mutex_destroy(&entry->ctx_mem_hash_mutex);
 	mutex_destroy(&entry->file_mutex);
 
@@ -1814,6 +1805,24 @@ void hl_debugfs_remove_device(struct hl_device *hdev)
 	kfree(entry->entry_arr);
 }
 
+void hl_debugfs_add_device(struct hl_device *hdev)
+{
+	struct hl_dbg_device_entry *dev_entry = &hdev->hl_debugfs;
+
+	dev_entry->root = debugfs_create_dir(dev_name(hdev->dev), hl_debug_root);
+
+	add_files_to_device(hdev, dev_entry, dev_entry->root);
+	if (!hdev->asic_prop.fw_security_enabled)
+		add_secured_nodes(dev_entry, dev_entry->root);
+}
+
+void hl_debugfs_remove_device(struct hl_device *hdev)
+{
+	struct hl_dbg_device_entry *entry = &hdev->hl_debugfs;
+
+	debugfs_remove_recursive(entry->root);
+}
+
 void hl_debugfs_add_file(struct hl_fpriv *hpriv)
 {
 	struct hl_dbg_device_entry *dev_entry = &hpriv->hdev->hl_debugfs;
diff --git a/drivers/accel/habanalabs/common/device.c b/drivers/accel/habanalabs/common/device.c
index 98b46b2e1898..cab5a63db8c1 100644
--- a/drivers/accel/habanalabs/common/device.c
+++ b/drivers/accel/habanalabs/common/device.c
@@ -674,7 +674,7 @@ static int device_init_cdev(struct hl_device *hdev, struct class *class,
 	return 0;
 }
 
-static int device_cdev_sysfs_add(struct hl_device *hdev)
+static int cdev_sysfs_debugfs_add(struct hl_device *hdev)
 {
 	int rc;
 
@@ -699,7 +699,9 @@ static int device_cdev_sysfs_add(struct hl_device *hdev)
 		goto delete_ctrl_cdev_device;
 	}
 
-	hdev->cdev_sysfs_created = true;
+	hl_debugfs_add_device(hdev);
+
+	hdev->cdev_sysfs_debugfs_created = true;
 
 	return 0;
 
@@ -710,11 +712,12 @@ static int device_cdev_sysfs_add(struct hl_device *hdev)
 	return rc;
 }
 
-static void device_cdev_sysfs_del(struct hl_device *hdev)
+static void cdev_sysfs_debugfs_remove(struct hl_device *hdev)
 {
-	if (!hdev->cdev_sysfs_created)
+	if (!hdev->cdev_sysfs_debugfs_created)
 		goto put_devices;
 
+	hl_debugfs_remove_device(hdev);
 	hl_sysfs_fini(hdev);
 	cdev_device_del(&hdev->cdev_ctrl, hdev->dev_ctrl);
 	cdev_device_del(&hdev->cdev, hdev->dev);
@@ -2054,7 +2057,7 @@ static int create_cdev(struct hl_device *hdev)
 int hl_device_init(struct hl_device *hdev)
 {
 	int i, rc, cq_cnt, user_interrupt_cnt, cq_ready_cnt;
-	bool add_cdev_sysfs_on_err = false;
+	bool expose_interfaces_on_err = false;
 
 	rc = create_cdev(hdev);
 	if (rc)
@@ -2170,16 +2173,22 @@ int hl_device_init(struct hl_device *hdev)
 	hdev->device_release_watchdog_timeout_sec = HL_DEVICE_RELEASE_WATCHDOG_TIMEOUT_SEC;
 
 	hdev->memory_scrub_val = MEM_SCRUB_DEFAULT_VAL;
-	hl_debugfs_add_device(hdev);
 
-	/* debugfs nodes are created in hl_ctx_init so it must be called after
-	 * hl_debugfs_add_device.
+	rc = hl_debugfs_device_init(hdev);
+	if (rc) {
+		dev_err(hdev->dev, "failed to initialize debugfs entry structure\n");
+		kfree(hdev->kernel_ctx);
+		goto mmu_fini;
+	}
+
+	/* The debugfs entry structure is accessed in hl_ctx_init(), so it must be called after
+	 * hl_debugfs_device_init().
 	 */
 	rc = hl_ctx_init(hdev, hdev->kernel_ctx, true);
 	if (rc) {
 		dev_err(hdev->dev, "failed to initialize kernel context\n");
 		kfree(hdev->kernel_ctx);
-		goto remove_device_from_debugfs;
+		goto debugfs_device_fini;
 	}
 
 	rc = hl_cb_pool_init(hdev);
@@ -2195,11 +2204,10 @@ int hl_device_init(struct hl_device *hdev)
 	}
 
 	/*
-	 * From this point, override rc (=0) in case of an error to allow
-	 * debugging (by adding char devices and create sysfs nodes as part of
-	 * the error flow).
+	 * From this point, override rc (=0) in case of an error to allow debugging
+	 * (by adding char devices and creating sysfs/debugfs files as part of the error flow).
 	 */
-	add_cdev_sysfs_on_err = true;
+	expose_interfaces_on_err = true;
 
 	/* Device is now enabled as part of the initialization requires
 	 * communication with the device firmware to get information that
@@ -2241,15 +2249,13 @@ int hl_device_init(struct hl_device *hdev)
 	}
 
 	/*
-	 * Expose devices and sysfs nodes to user.
-	 * From here there is no need to add char devices and create sysfs nodes
-	 * in case of an error.
+	 * Expose devices and sysfs/debugfs files to user.
+	 * From here there is no need to expose them in case of an error.
 	 */
-	add_cdev_sysfs_on_err = false;
-	rc = device_cdev_sysfs_add(hdev);
+	expose_interfaces_on_err = false;
+	rc = cdev_sysfs_debugfs_add(hdev);
 	if (rc) {
-		dev_err(hdev->dev,
-			"Failed to add char devices and sysfs nodes\n");
+		dev_err(hdev->dev, "Failed to add char devices and sysfs/debugfs files\n");
 		rc = 0;
 		goto out_disabled;
 	}
@@ -2295,8 +2301,8 @@ int hl_device_init(struct hl_device *hdev)
 	if (hl_ctx_put(hdev->kernel_ctx) != 1)
 		dev_err(hdev->dev,
 			"kernel ctx is still alive on initialization failure\n");
-remove_device_from_debugfs:
-	hl_debugfs_remove_device(hdev);
+debugfs_device_fini:
+	hl_debugfs_device_fini(hdev);
 mmu_fini:
 	hl_mmu_fini(hdev);
 eq_fini:
@@ -2320,8 +2326,8 @@ int hl_device_init(struct hl_device *hdev)
 	put_device(hdev->dev);
 out_disabled:
 	hdev->disabled = true;
-	if (add_cdev_sysfs_on_err)
-		device_cdev_sysfs_add(hdev);
+	if (expose_interfaces_on_err)
+		cdev_sysfs_debugfs_add(hdev);
 	if (hdev->pdev)
 		dev_err(&hdev->pdev->dev,
 			"Failed to initialize hl%d. Device %s is NOT usable !\n",
@@ -2447,8 +2453,6 @@ void hl_device_fini(struct hl_device *hdev)
 	if ((hdev->kernel_ctx) && (hl_ctx_put(hdev->kernel_ctx) != 1))
 		dev_err(hdev->dev, "kernel ctx is still alive\n");
 
-	hl_debugfs_remove_device(hdev);
-
 	hl_dec_fini(hdev);
 
 	hl_vm_fini(hdev);
@@ -2473,8 +2477,10 @@ void hl_device_fini(struct hl_device *hdev)
 
 	device_early_fini(hdev);
 
-	/* Hide devices and sysfs nodes from user */
-	device_cdev_sysfs_del(hdev);
+	/* Hide devices and sysfs/debugfs files from user */
+	cdev_sysfs_debugfs_remove(hdev);
+
+	hl_debugfs_device_fini(hdev);
 
 	pr_info("removed device successfully\n");
 }
diff --git a/drivers/accel/habanalabs/common/habanalabs.h b/drivers/accel/habanalabs/common/habanalabs.h
index ad412cc01aba..ea0914d08bdc 100644
--- a/drivers/accel/habanalabs/common/habanalabs.h
+++ b/drivers/accel/habanalabs/common/habanalabs.h
@@ -3292,7 +3292,7 @@ struct hl_reset_info {
  * @in_debug: whether the device is in a state where the profiling/tracing infrastructure
  *            can be used. This indication is needed because in some ASICs we need to do
  *            specific operations to enable that infrastructure.
- * @cdev_sysfs_created: were char devices and sysfs nodes created.
+ * @cdev_sysfs_debugfs_created: were char devices and sysfs/debugfs files created.
  * @stop_on_err: true if engines should stop on error.
  * @supports_sync_stream: is sync stream supported.
  * @sync_stream_queue_idx: helper index for sync stream queues initialization.
@@ -3459,7 +3459,7 @@ struct hl_device {
 	u8				init_done;
 	u8				device_cpu_disabled;
 	u8				in_debug;
-	u8				cdev_sysfs_created;
+	u8				cdev_sysfs_debugfs_created;
 	u8				stop_on_err;
 	u8				supports_sync_stream;
 	u8				sync_stream_queue_idx;
@@ -3978,6 +3978,8 @@ void hl_handle_fw_err(struct hl_device *hdev, struct hl_info_fw_err_info *info);
 
 void hl_debugfs_init(void);
 void hl_debugfs_fini(void);
+int hl_debugfs_device_init(struct hl_device *hdev);
+void hl_debugfs_device_fini(struct hl_device *hdev);
 void hl_debugfs_add_device(struct hl_device *hdev);
 void hl_debugfs_remove_device(struct hl_device *hdev);
 void hl_debugfs_add_file(struct hl_fpriv *hpriv);
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH 4/6] accel/habanalabs: poll for device status update following WFE cmd
  2023-05-01  9:47 [PATCH 1/6] accel/habanalabs: add missing tpc interrupt info Oded Gabbay
  2023-05-01  9:47 ` [PATCH 2/6] accel/habanalabs: add pci health check during heartbeat Oded Gabbay
  2023-05-01  9:47 ` [PATCH 3/6] accel/habanalabs: expose debugfs files later Oded Gabbay
@ 2023-05-01  9:47 ` Oded Gabbay
  2023-05-01  9:47 ` [PATCH 5/6] accel/habanalabs: fix a static warning - 'dubious: x & !y' Oded Gabbay
  2023-05-01  9:47 ` [PATCH 6/6] accel/habanalabs: always fetch pci addr_dec error info Oded Gabbay
  4 siblings, 0 replies; 6+ messages in thread
From: Oded Gabbay @ 2023-05-01  9:47 UTC (permalink / raw)
  To: dri-devel; +Cc: Koby Elbaz

From: Koby Elbaz <kelbaz@habana.ai>

Currently, we rely on COMMS protocol's ack to verify that WFE command
has been acknowledged by the FW. However, this does not guarantee that
the device status has been updated.
Although unlikely, this could trigger a race since the driver expects
the device to be halted at that stage, but it might not be.
Therefore, we increase WFE's robustness by polling on the status
register that will be updated once the device is actually halted.

Signed-off-by: Koby Elbaz <kelbaz@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
---
 drivers/accel/habanalabs/common/firmware_if.c | 28 +++++++++++++++----
 1 file changed, 23 insertions(+), 5 deletions(-)

diff --git a/drivers/accel/habanalabs/common/firmware_if.c b/drivers/accel/habanalabs/common/firmware_if.c
index e48f024d8649..eb51d7f70aec 100644
--- a/drivers/accel/habanalabs/common/firmware_if.c
+++ b/drivers/accel/habanalabs/common/firmware_if.c
@@ -1368,8 +1368,10 @@ void hl_fw_ask_hard_reset_without_linux(struct hl_device *hdev)
 
 void hl_fw_ask_halt_machine_without_linux(struct hl_device *hdev)
 {
-	struct static_fw_load_mgr *static_loader =
-			&hdev->fw_loader.static_loader;
+	struct fw_load_mgr *fw_loader = &hdev->fw_loader;
+	u32 status, cpu_boot_status_reg, cpu_timeout;
+	struct static_fw_load_mgr *static_loader;
+	struct pre_fw_load_props *pre_fw_load;
 	int rc;
 
 	if (hdev->device_cpu_is_halted)
@@ -1377,12 +1379,28 @@ void hl_fw_ask_halt_machine_without_linux(struct hl_device *hdev)
 
 	/* Stop device CPU to make sure nothing bad happens */
 	if (hdev->asic_prop.dynamic_fw_load) {
+		pre_fw_load = &fw_loader->pre_fw_load;
+		cpu_timeout = fw_loader->cpu_timeout;
+		cpu_boot_status_reg = pre_fw_load->cpu_boot_status_reg;
+
 		rc = hl_fw_dynamic_send_protocol_cmd(hdev, &hdev->fw_loader,
-				COMMS_GOTO_WFE, 0, false,
-				hdev->fw_loader.cpu_timeout);
-		if (rc)
+				COMMS_GOTO_WFE, 0, false, cpu_timeout);
+		if (rc) {
 			dev_err(hdev->dev, "Failed sending COMMS_GOTO_WFE\n");
+		} else {
+			rc = hl_poll_timeout(
+				hdev,
+				cpu_boot_status_reg,
+				status,
+				status == CPU_BOOT_STATUS_IN_WFE,
+				hdev->fw_poll_interval_usec,
+				cpu_timeout);
+			if (rc)
+				dev_err(hdev->dev, "Current status=%u. Timed-out updating to WFE\n",
+						status);
+		}
 	} else {
+		static_loader = &hdev->fw_loader.static_loader;
 		WREG32(static_loader->kmd_msg_to_cpu_reg, KMD_MSG_GOTO_WFE);
 		msleep(static_loader->cpu_reset_wait_msec);
 
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH 5/6] accel/habanalabs: fix a static warning - 'dubious: x & !y'
  2023-05-01  9:47 [PATCH 1/6] accel/habanalabs: add missing tpc interrupt info Oded Gabbay
                   ` (2 preceding siblings ...)
  2023-05-01  9:47 ` [PATCH 4/6] accel/habanalabs: poll for device status update following WFE cmd Oded Gabbay
@ 2023-05-01  9:47 ` Oded Gabbay
  2023-05-01  9:47 ` [PATCH 6/6] accel/habanalabs: always fetch pci addr_dec error info Oded Gabbay
  4 siblings, 0 replies; 6+ messages in thread
From: Oded Gabbay @ 2023-05-01  9:47 UTC (permalink / raw)
  To: dri-devel; +Cc: Koby Elbaz

From: Koby Elbaz <kelbaz@habana.ai>

Use a straight forward approach to get a conditional result.

Signed-off-by: Koby Elbaz <kelbaz@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
---
 drivers/accel/habanalabs/gaudi2/gaudi2.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/accel/habanalabs/gaudi2/gaudi2.c b/drivers/accel/habanalabs/gaudi2/gaudi2.c
index 058741698d71..240fecfab608 100644
--- a/drivers/accel/habanalabs/gaudi2/gaudi2.c
+++ b/drivers/accel/habanalabs/gaudi2/gaudi2.c
@@ -4527,7 +4527,7 @@ static int gaudi2_set_tpc_engine_mode(struct hl_device *hdev, u32 engine_id, u32
 	reg_base = gaudi2_tpc_cfg_blocks_bases[tpc_id];
 	reg_addr = reg_base + TPC_CFG_STALL_OFFSET;
 	reg_val = FIELD_PREP(DCORE0_TPC0_CFG_TPC_STALL_V_MASK,
-			!!(engine_command == HL_ENGINE_STALL));
+			(engine_command == HL_ENGINE_STALL) ? 1 : 0);
 	WREG32(reg_addr, reg_val);
 
 	if (engine_command == HL_ENGINE_RESUME) {
@@ -4551,7 +4551,7 @@ static int gaudi2_set_mme_engine_mode(struct hl_device *hdev, u32 engine_id, u32
 	reg_base = gaudi2_mme_ctrl_lo_blocks_bases[mme_id];
 	reg_addr = reg_base + MME_CTRL_LO_QM_STALL_OFFSET;
 	reg_val = FIELD_PREP(DCORE0_MME_CTRL_LO_QM_STALL_V_MASK,
-			!!(engine_command == HL_ENGINE_STALL));
+			(engine_command == HL_ENGINE_STALL) ? 1 : 0);
 	WREG32(reg_addr, reg_val);
 
 	return 0;
@@ -4572,7 +4572,7 @@ static int gaudi2_set_edma_engine_mode(struct hl_device *hdev, u32 engine_id, u3
 	reg_base = gaudi2_dma_core_blocks_bases[edma_id];
 	reg_addr = reg_base + EDMA_CORE_CFG_STALL_OFFSET;
 	reg_val = FIELD_PREP(DCORE0_EDMA0_CORE_CFG_1_HALT_MASK,
-			!!(engine_command == HL_ENGINE_STALL));
+			(engine_command == HL_ENGINE_STALL) ? 1 : 0);
 	WREG32(reg_addr, reg_val);
 
 	if (engine_command == HL_ENGINE_STALL) {
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH 6/6] accel/habanalabs: always fetch pci addr_dec error info
  2023-05-01  9:47 [PATCH 1/6] accel/habanalabs: add missing tpc interrupt info Oded Gabbay
                   ` (3 preceding siblings ...)
  2023-05-01  9:47 ` [PATCH 5/6] accel/habanalabs: fix a static warning - 'dubious: x & !y' Oded Gabbay
@ 2023-05-01  9:47 ` Oded Gabbay
  4 siblings, 0 replies; 6+ messages in thread
From: Oded Gabbay @ 2023-05-01  9:47 UTC (permalink / raw)
  To: dri-devel; +Cc: Ofir Bitton

From: Ofir Bitton <obitton@habana.ai>

Due to missing indication of address decode source (LBW/HBW bus),
we should always try and fetch extended information.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
---
 drivers/accel/habanalabs/gaudi2/gaudi2.c | 14 ++++++--------
 1 file changed, 6 insertions(+), 8 deletions(-)

diff --git a/drivers/accel/habanalabs/gaudi2/gaudi2.c b/drivers/accel/habanalabs/gaudi2/gaudi2.c
index 240fecfab608..d21ef9997d05 100644
--- a/drivers/accel/habanalabs/gaudi2/gaudi2.c
+++ b/drivers/accel/habanalabs/gaudi2/gaudi2.c
@@ -8892,14 +8892,12 @@ static int gaudi2_print_pcie_addr_dec_info(struct hl_device *hdev, u16 event_typ
 			"err cause: %s", gaudi2_pcie_addr_dec_error_cause[i]);
 		error_count++;
 
-		switch (intr_cause_data & BIT_ULL(i)) {
-		case PCIE_WRAP_PCIE_IC_SEI_INTR_IND_AXI_LBW_ERR_INTR_MASK:
-			hl_check_for_glbl_errors(hdev);
-			break;
-		case PCIE_WRAP_PCIE_IC_SEI_INTR_IND_BAD_ACCESS_INTR_MASK:
-			gaudi2_print_pcie_mstr_rr_mstr_if_razwi_info(hdev, event_mask);
-			break;
-		}
+		/*
+		 * Always check for LBW and HBW additional info as the indication itself is
+		 * sometimes missing
+		 */
+		hl_check_for_glbl_errors(hdev);
+		gaudi2_print_pcie_mstr_rr_mstr_if_razwi_info(hdev, event_mask);
 	}
 
 	return error_count;
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2023-05-01  9:48 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-05-01  9:47 [PATCH 1/6] accel/habanalabs: add missing tpc interrupt info Oded Gabbay
2023-05-01  9:47 ` [PATCH 2/6] accel/habanalabs: add pci health check during heartbeat Oded Gabbay
2023-05-01  9:47 ` [PATCH 3/6] accel/habanalabs: expose debugfs files later Oded Gabbay
2023-05-01  9:47 ` [PATCH 4/6] accel/habanalabs: poll for device status update following WFE cmd Oded Gabbay
2023-05-01  9:47 ` [PATCH 5/6] accel/habanalabs: fix a static warning - 'dubious: x & !y' Oded Gabbay
2023-05-01  9:47 ` [PATCH 6/6] accel/habanalabs: always fetch pci addr_dec error info Oded Gabbay

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.