public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH] habanalabs: fetch hard reset capability from FW
@ 2020-11-11 19:59 Oded Gabbay
  2020-11-11 19:59 ` [PATCH] habanalabs/gaudi: fetch HBM ecc info " Oded Gabbay
  2020-11-11 19:59 ` [PATCH] habanalabs: print message with correct device Oded Gabbay
  0 siblings, 2 replies; 3+ messages in thread
From: Oded Gabbay @ 2020-11-11 19:59 UTC (permalink / raw)
  To: linux-kernel; +Cc: SW_Drivers, Ofir Bitton

From: Ofir Bitton <obitton@habana.ai>

Driver must fetch FW hard reset capability during boot time,
in order to skip the hard reset flow if necessary.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
---
 drivers/misc/habanalabs/common/firmware_if.c  | 10 ++++++-
 drivers/misc/habanalabs/common/habanalabs.h   |  2 ++
 drivers/misc/habanalabs/gaudi/gaudi.c         |  1 +
 drivers/misc/habanalabs/goya/goya.c           |  1 +
 .../habanalabs/include/common/hl_boot_if.h    | 30 ++++++++++++-------
 5 files changed, 32 insertions(+), 12 deletions(-)

diff --git a/drivers/misc/habanalabs/common/firmware_if.c b/drivers/misc/habanalabs/common/firmware_if.c
index 2fc12e529241..b5464cd34071 100644
--- a/drivers/misc/habanalabs/common/firmware_if.c
+++ b/drivers/misc/habanalabs/common/firmware_if.c
@@ -784,10 +784,18 @@ int hl_fw_init_cpu(struct hl_device *hdev, u32 cpu_boot_status_reg,
 	}
 
 	/* Read FW application security bits */
-	if (hdev->asic_prop.fw_security_status_valid)
+	if (hdev->asic_prop.fw_security_status_valid) {
 		hdev->asic_prop.fw_app_security_map =
 				RREG32(cpu_security_boot_status_reg);
 
+		if (hdev->asic_prop.fw_app_security_map &
+				CPU_BOOT_DEV_STS0_FW_HARD_RST_EN)
+			hdev->asic_prop.hard_reset_done_by_fw = true;
+	}
+
+	dev_info(hdev->dev, "Firmware hard-reset is %s\n",
+		hdev->asic_prop.hard_reset_done_by_fw ? "enabled" : "disabled");
+
 	dev_info(hdev->dev, "Successfully loaded firmware to device\n");
 
 out:
diff --git a/drivers/misc/habanalabs/common/habanalabs.h b/drivers/misc/habanalabs/common/habanalabs.h
index a1d82de60ef6..eeb78381177b 100644
--- a/drivers/misc/habanalabs/common/habanalabs.h
+++ b/drivers/misc/habanalabs/common/habanalabs.h
@@ -412,6 +412,7 @@ struct hl_mmu_properties {
  * @fw_security_status_valid: security status bits are valid and can be fetched
  *                            from BOOT_DEV_STS0
  * @dram_supports_virtual_memory: is there an MMU towards the DRAM
+ * @hard_reset_done_by_fw: true if firmware is handling hard reset flow
  */
 struct asic_fixed_properties {
 	struct hw_queue_properties	*hw_queues_props;
@@ -469,6 +470,7 @@ struct asic_fixed_properties {
 	u8				fw_security_disabled;
 	u8				fw_security_status_valid;
 	u8				dram_supports_virtual_memory;
+	u8				hard_reset_done_by_fw;
 };
 
 /**
diff --git a/drivers/misc/habanalabs/gaudi/gaudi.c b/drivers/misc/habanalabs/gaudi/gaudi.c
index 8d6cffd28381..6d54a4574284 100644
--- a/drivers/misc/habanalabs/gaudi/gaudi.c
+++ b/drivers/misc/habanalabs/gaudi/gaudi.c
@@ -518,6 +518,7 @@ static int gaudi_get_fixed_properties(struct hl_device *hdev)
 	/* disable fw security for now, set it in a later stage */
 	prop->fw_security_disabled = true;
 	prop->fw_security_status_valid = false;
+	prop->hard_reset_done_by_fw = false;
 
 	return 0;
 }
diff --git a/drivers/misc/habanalabs/goya/goya.c b/drivers/misc/habanalabs/goya/goya.c
index bf21f05f7849..3398b4cc1298 100644
--- a/drivers/misc/habanalabs/goya/goya.c
+++ b/drivers/misc/habanalabs/goya/goya.c
@@ -458,6 +458,7 @@ int goya_get_fixed_properties(struct hl_device *hdev)
 	/* disable fw security for now, set it in a later stage */
 	prop->fw_security_disabled = true;
 	prop->fw_security_status_valid = false;
+	prop->hard_reset_done_by_fw = false;
 
 	return 0;
 }
diff --git a/drivers/misc/habanalabs/include/common/hl_boot_if.h b/drivers/misc/habanalabs/include/common/hl_boot_if.h
index d928ad93cd80..60916780df35 100644
--- a/drivers/misc/habanalabs/include/common/hl_boot_if.h
+++ b/drivers/misc/habanalabs/include/common/hl_boot_if.h
@@ -84,45 +84,52 @@
  *					device is indicated as security enabled,
  *					registers are protected, and device
  *					uses keys for image verification.
- *					Initialized at: preboot
+ *					Initialized in: preboot
  *
  * CPU_BOOT_DEV_STS0_DEBUG_EN		Debug is enabled.
  *					Enabled when JTAG or DEBUG is enabled
  *					in FW.
- *					Initialized at: preboot
+ *					Initialized in: preboot
  *
  * CPU_BOOT_DEV_STS0_WATCHDOG_EN	Watchdog is enabled.
  *					Watchdog is enabled in FW.
- *					Initialized at: preboot
+ *					Initialized in: preboot
  *
  * CPU_BOOT_DEV_STS0_DRAM_INIT_EN	DRAM initialization is enabled.
  *					DRAM initialization has been done in FW.
- *					Initialized at: u-boot
+ *					Initialized in: u-boot
  *
  * CPU_BOOT_DEV_STS0_BMC_WAIT_EN	Waiting for BMC data enabled.
  *					If set, it means that during boot,
  *					FW waited for BMC data.
- *					Initialized at: u-boot
+ *					Initialized in: u-boot
  *
  * CPU_BOOT_DEV_STS0_E2E_CRED_EN	E2E credits initialized.
  *					FW initialized E2E credits.
- *					Initialized at: u-boot
+ *					Initialized in: u-boot
  *
  * CPU_BOOT_DEV_STS0_HBM_CRED_EN	HBM credits initialized.
  *					FW initialized HBM credits.
- *					Initialized at: u-boot
+ *					Initialized in: u-boot
  *
  * CPU_BOOT_DEV_STS0_RL_EN		Rate limiter initialized.
  *					FW initialized rate limiter.
- *					Initialized at: u-boot
+ *					Initialized in: u-boot
  *
  * CPU_BOOT_DEV_STS0_SRAM_SCR_EN	SRAM scrambler enabled.
  *					FW initialized SRAM scrambler.
- *					Initialized at: linux
+ *					Initialized in: linux
  *
  * CPU_BOOT_DEV_STS0_DRAM_SCR_EN	DRAM scrambler enabled.
  *					FW initialized DRAM scrambler.
- *					Initialized at: u-boot
+ *					Initialized in: u-boot
+ *
+ * CPU_BOOT_DEV_STS0_FW_HARD_RST_EN	FW hard reset procedure is enabled.
+ *					FW has the hard reset procedure
+ *					implemented. This means that FW will
+ *					perform hard reset procedure on
+ *					receiving the halt-machine event.
+ *					Initialized in: linux
  *
  * CPU_BOOT_DEV_STS0_ENABLED		Device status register enabled.
  *					This is a main indication that the
@@ -130,7 +137,7 @@
  *					register. Meaning the device status
  *					bits are not garbage, but actual
  *					statuses.
- *					Initialized at: preboot
+ *					Initialized in: preboot
  */
 #define CPU_BOOT_DEV_STS0_SECURITY_EN			(1 << 0)
 #define CPU_BOOT_DEV_STS0_DEBUG_EN			(1 << 1)
@@ -142,6 +149,7 @@
 #define CPU_BOOT_DEV_STS0_RL_EN				(1 << 7)
 #define CPU_BOOT_DEV_STS0_SRAM_SCR_EN			(1 << 8)
 #define CPU_BOOT_DEV_STS0_DRAM_SCR_EN			(1 << 9)
+#define CPU_BOOT_DEV_STS0_FW_HARD_RST_EN		(1 << 10)
 #define CPU_BOOT_DEV_STS0_ENABLED			(1 << 31)
 
 enum cpu_boot_status {
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 3+ messages in thread

* [PATCH] habanalabs/gaudi: fetch HBM ecc info from FW
  2020-11-11 19:59 [PATCH] habanalabs: fetch hard reset capability from FW Oded Gabbay
@ 2020-11-11 19:59 ` Oded Gabbay
  2020-11-11 19:59 ` [PATCH] habanalabs: print message with correct device Oded Gabbay
  1 sibling, 0 replies; 3+ messages in thread
From: Oded Gabbay @ 2020-11-11 19:59 UTC (permalink / raw)
  To: linux-kernel; +Cc: SW_Drivers, Ofir Bitton

From: Ofir Bitton <obitton@habana.ai>

Once FW security is enabled there is no access to HBM ecc registers,
need to read values from FW using a dedicated interface.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
---
 drivers/misc/habanalabs/gaudi/gaudi.c         | 47 ++++++++++++++++---
 .../misc/habanalabs/include/common/cpucp_if.h | 32 +++++++++++++
 2 files changed, 72 insertions(+), 7 deletions(-)

diff --git a/drivers/misc/habanalabs/gaudi/gaudi.c b/drivers/misc/habanalabs/gaudi/gaudi.c
index 6d54a4574284..bf34ca29e42b 100644
--- a/drivers/misc/habanalabs/gaudi/gaudi.c
+++ b/drivers/misc/habanalabs/gaudi/gaudi.c
@@ -6850,10 +6850,41 @@ static int gaudi_soft_reset_late_init(struct hl_device *hdev)
 	return hl_fw_unmask_irq_arr(hdev, gaudi->events, sizeof(gaudi->events));
 }
 
-static int gaudi_hbm_read_interrupts(struct hl_device *hdev, int device)
+static int gaudi_hbm_read_interrupts(struct hl_device *hdev, int device,
+			struct hl_eq_hbm_ecc_data *hbm_ecc_data)
 {
-	int ch, err = 0;
-	u32 base, val, val2;
+	u32 base, val, val2, wr_par, rd_par, ca_par, derr, serr, type, ch;
+	int err = 0;
+
+	if (!hdev->asic_prop.fw_security_disabled) {
+		if (!hbm_ecc_data) {
+			dev_err(hdev->dev, "No FW ECC data");
+			return 0;
+		}
+
+		wr_par = FIELD_GET(CPUCP_PKT_HBM_ECC_INFO_WR_PAR_MASK,
+				hbm_ecc_data->hbm_ecc_info);
+		rd_par = FIELD_GET(CPUCP_PKT_HBM_ECC_INFO_RD_PAR_MASK,
+				hbm_ecc_data->hbm_ecc_info);
+		ca_par = FIELD_GET(CPUCP_PKT_HBM_ECC_INFO_CA_PAR_MASK,
+				hbm_ecc_data->hbm_ecc_info);
+		derr = FIELD_GET(CPUCP_PKT_HBM_ECC_INFO_DERR_MASK,
+				hbm_ecc_data->hbm_ecc_info);
+		serr = FIELD_GET(CPUCP_PKT_HBM_ECC_INFO_SERR_MASK,
+				hbm_ecc_data->hbm_ecc_info);
+		type = FIELD_GET(CPUCP_PKT_HBM_ECC_INFO_TYPE_MASK,
+				hbm_ecc_data->hbm_ecc_info);
+		ch = FIELD_GET(CPUCP_PKT_HBM_ECC_INFO_HBM_CH_MASK,
+				hbm_ecc_data->hbm_ecc_info);
+
+		dev_err(hdev->dev,
+			"HBM%d pc%d interrupts info: WR_PAR=%d, RD_PAR=%d, CA_PAR=%d, SERR=%d, DERR=%d\n",
+			device, ch, wr_par, rd_par, ca_par, serr, derr);
+
+		err = 1;
+
+		return 0;
+	}
 
 	base = GAUDI_HBM_CFG_BASE + device * GAUDI_HBM_CFG_OFFSET;
 	for (ch = 0 ; ch < GAUDI_HBM_CHANNELS ; ch++) {
@@ -6869,7 +6900,7 @@ static int gaudi_hbm_read_interrupts(struct hl_device *hdev, int device)
 
 			val2 = RREG32(base + ch * 0x1000 + 0x060);
 			dev_err(hdev->dev,
-				"HBM%d pc%d ECC info: 1ST_ERR_ADDR=0x%x, 1ST_ERR_TYPE=%d, SEC_CONT_CNT=%d, SEC_CNT=%d, DED_CNT=%d\n",
+				"HBM%d pc%d ECC info: 1ST_ERR_ADDR=0x%x, 1ST_ERR_TYPE=%d, SEC_CONT_CNT=%d, SEC_CNT=%d, DEC_CNT=%d\n",
 				device, ch * 2,
 				RREG32(base + ch * 0x1000 + 0x064),
 				(val2 & 0x200) >> 9, (val2 & 0xFC00) >> 10,
@@ -6889,7 +6920,7 @@ static int gaudi_hbm_read_interrupts(struct hl_device *hdev, int device)
 
 			val2 = RREG32(base + ch * 0x1000 + 0x070);
 			dev_err(hdev->dev,
-				"HBM%d pc%d ECC info: 1ST_ERR_ADDR=0x%x, 1ST_ERR_TYPE=%d, SEC_CONT_CNT=%d, SEC_CNT=%d, DED_CNT=%d\n",
+				"HBM%d pc%d ECC info: 1ST_ERR_ADDR=0x%x, 1ST_ERR_TYPE=%d, SEC_CONT_CNT=%d, SEC_CNT=%d, DEC_CNT=%d\n",
 				device, ch * 2 + 1,
 				RREG32(base + ch * 0x1000 + 0x074),
 				(val2 & 0x200) >> 9, (val2 & 0xFC00) >> 10,
@@ -7090,7 +7121,8 @@ static void gaudi_handle_eqe(struct hl_device *hdev,
 	case GAUDI_EVENT_HBM3_SPI_0:
 		gaudi_print_irq_info(hdev, event_type, false);
 		gaudi_hbm_read_interrupts(hdev,
-					  gaudi_hbm_event_to_dev(event_type));
+				gaudi_hbm_event_to_dev(event_type),
+				&eq_entry->hbm_ecc_data);
 		if (hdev->hard_reset_on_fw_events)
 			hl_device_reset(hdev, true, false);
 		break;
@@ -7101,7 +7133,8 @@ static void gaudi_handle_eqe(struct hl_device *hdev,
 	case GAUDI_EVENT_HBM3_SPI_1:
 		gaudi_print_irq_info(hdev, event_type, false);
 		gaudi_hbm_read_interrupts(hdev,
-					  gaudi_hbm_event_to_dev(event_type));
+				gaudi_hbm_event_to_dev(event_type),
+				&eq_entry->hbm_ecc_data);
 		break;
 
 	case GAUDI_EVENT_TPC0_DEC:
diff --git a/drivers/misc/habanalabs/include/common/cpucp_if.h b/drivers/misc/habanalabs/include/common/cpucp_if.h
index 1c1e2b394457..759c068b2b7a 100644
--- a/drivers/misc/habanalabs/include/common/cpucp_if.h
+++ b/drivers/misc/habanalabs/include/common/cpucp_if.h
@@ -11,6 +11,37 @@
 #include <linux/types.h>
 #include <linux/if_ether.h>
 
+#define NUM_HBM_PSEUDO_CH				2
+#define NUM_HBM_CH_PER_DEV				8
+#define CPUCP_PKT_HBM_ECC_INFO_WR_PAR_SHIFT		0
+#define CPUCP_PKT_HBM_ECC_INFO_WR_PAR_MASK		0x00000001
+#define CPUCP_PKT_HBM_ECC_INFO_RD_PAR_SHIFT		1
+#define CPUCP_PKT_HBM_ECC_INFO_RD_PAR_MASK		0x00000002
+#define CPUCP_PKT_HBM_ECC_INFO_CA_PAR_SHIFT		2
+#define CPUCP_PKT_HBM_ECC_INFO_CA_PAR_MASK		0x00000004
+#define CPUCP_PKT_HBM_ECC_INFO_DERR_SHIFT		3
+#define CPUCP_PKT_HBM_ECC_INFO_DERR_MASK		0x00000008
+#define CPUCP_PKT_HBM_ECC_INFO_SERR_SHIFT		4
+#define CPUCP_PKT_HBM_ECC_INFO_SERR_MASK		0x00000010
+#define CPUCP_PKT_HBM_ECC_INFO_TYPE_SHIFT		5
+#define CPUCP_PKT_HBM_ECC_INFO_TYPE_MASK		0x00000020
+#define CPUCP_PKT_HBM_ECC_INFO_HBM_CH_SHIFT		6
+#define CPUCP_PKT_HBM_ECC_INFO_HBM_CH_MASK		0x000007C0
+
+struct hl_eq_hbm_ecc_data {
+	/* SERR counter */
+	__le32 sec_cnt;
+	/* DERR counter */
+	__le32 dec_cnt;
+	/* Supplemental Information according to the mask bits */
+	__le32 hbm_ecc_info;
+	/* Address in hbm where the ecc happened */
+	__le32 first_addr;
+	/* SERR continuous address counter */
+	__le32 sec_cont_cnt;
+	__le32 pad;
+};
+
 /*
  * EVENT QUEUE
  */
@@ -31,6 +62,7 @@ struct hl_eq_entry {
 	struct hl_eq_header hdr;
 	union {
 		struct hl_eq_ecc_data ecc_data;
+		struct hl_eq_hbm_ecc_data hbm_ecc_data;
 		__le64 data[7];
 	};
 };
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 3+ messages in thread

* [PATCH] habanalabs: print message with correct device
  2020-11-11 19:59 [PATCH] habanalabs: fetch hard reset capability from FW Oded Gabbay
  2020-11-11 19:59 ` [PATCH] habanalabs/gaudi: fetch HBM ecc info " Oded Gabbay
@ 2020-11-11 19:59 ` Oded Gabbay
  1 sibling, 0 replies; 3+ messages in thread
From: Oded Gabbay @ 2020-11-11 19:59 UTC (permalink / raw)
  To: linux-kernel; +Cc: SW_Drivers

During hard-reset, the driver rejects further IOCTL calls and prints
an error message. That error message should be printed with the correct
device instead of using only the control device.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
---
 drivers/misc/habanalabs/common/habanalabs_ioctl.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/misc/habanalabs/common/habanalabs_ioctl.c b/drivers/misc/habanalabs/common/habanalabs_ioctl.c
index 0729cd43f297..ba8217fc9425 100644
--- a/drivers/misc/habanalabs/common/habanalabs_ioctl.c
+++ b/drivers/misc/habanalabs/common/habanalabs_ioctl.c
@@ -573,7 +573,7 @@ static long _hl_ioctl(struct file *filep, unsigned int cmd, unsigned long arg,
 	int retcode;
 
 	if (hdev->hard_reset_pending) {
-		dev_crit_ratelimited(hdev->dev_ctrl,
+		dev_crit_ratelimited(dev,
 			"Device HARD reset pending! Please close FD\n");
 		return -ENODEV;
 	}
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2020-11-11 19:59 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2020-11-11 19:59 [PATCH] habanalabs: fetch hard reset capability from FW Oded Gabbay
2020-11-11 19:59 ` [PATCH] habanalabs/gaudi: fetch HBM ecc info " Oded Gabbay
2020-11-11 19:59 ` [PATCH] habanalabs: print message with correct device Oded Gabbay

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox