public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Oded Gabbay <ogabbay@kernel.org>
To: linux-kernel@vger.kernel.org
Subject: [PATCH 5/9] habanalabs/gaudi: increase default cs timeout to 10 minutes
Date: Wed, 20 Jul 2022 14:15:19 +0300	[thread overview]
Message-ID: <20220720111523.4069830-5-ogabbay@kernel.org> (raw)
In-Reply-To: <20220720111523.4069830-1-ogabbay@kernel.org>

In order to improve scalability and reduce host overhead, it is better
to increase the default TDR timeout of Gaudi1 from 30 seconds to
10 minutes.

This will allow the DL Framework (e.g. PyTorch, TensorFlow) to remove
the host sync they are using now and improve overall performance on
scaleout training.

Note that one can always set the timeout to a custom value via
a kernel module parameter given during driver load.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
---
 .../misc/habanalabs/common/habanalabs_drv.c   | 23 +++++++++++++++----
 1 file changed, 18 insertions(+), 5 deletions(-)

diff --git a/drivers/misc/habanalabs/common/habanalabs_drv.c b/drivers/misc/habanalabs/common/habanalabs_drv.c
index f733ead605e7..d59d8cdf33e6 100644
--- a/drivers/misc/habanalabs/common/habanalabs_drv.c
+++ b/drivers/misc/habanalabs/common/habanalabs_drv.c
@@ -27,7 +27,10 @@ static struct class *hl_class;
 static DEFINE_IDR(hl_devs_idr);
 static DEFINE_MUTEX(hl_devs_idr_lock);
 
-static int timeout_locked = 30;
+#define HL_DEFAULT_TIMEOUT_LOCKED	30	/* 30 seconds */
+#define GAUDI_DEFAULT_TIMEOUT_LOCKED	600	/* 10 minutes */
+
+static int timeout_locked = HL_DEFAULT_TIMEOUT_LOCKED;
 static int reset_on_lockup = 1;
 static int memory_scrub;
 static ulong boot_error_status_mask = ULONG_MAX;
@@ -314,12 +317,22 @@ static void copy_kernel_module_params_to_device(struct hl_device *hdev)
 	hdev->boot_error_status_mask = boot_error_status_mask;
 }
 
-static void fixup_device_params_per_asic(struct hl_device *hdev)
+static void fixup_device_params_per_asic(struct hl_device *hdev, int timeout)
 {
 	switch (hdev->asic_type) {
-	case ASIC_GOYA:
 	case ASIC_GAUDI:
 	case ASIC_GAUDI_SEC:
+		/* If user didn't request a different timeout than the default one, we have
+		 * a different default timeout for Gaudi
+		 */
+		if (timeout == HL_DEFAULT_TIMEOUT_LOCKED)
+			hdev->timeout_jiffies = msecs_to_jiffies(GAUDI_DEFAULT_TIMEOUT_LOCKED *
+										MSEC_PER_SEC);
+
+		hdev->reset_upon_device_release = 0;
+		break;
+
+	case ASIC_GOYA:
 		hdev->reset_upon_device_release = 0;
 		break;
 
@@ -339,7 +352,7 @@ static int fixup_device_params(struct hl_device *hdev)
 	hdev->fw_comms_poll_interval_usec = HL_FW_STATUS_POLL_INTERVAL_USEC;
 
 	if (tmp_timeout)
-		hdev->timeout_jiffies = msecs_to_jiffies(tmp_timeout * 1000);
+		hdev->timeout_jiffies = msecs_to_jiffies(tmp_timeout * MSEC_PER_SEC);
 	else
 		hdev->timeout_jiffies = MAX_SCHEDULE_TIMEOUT;
 
@@ -360,7 +373,7 @@ static int fixup_device_params(struct hl_device *hdev)
 	if (!hdev->cpu_queues_enable)
 		hdev->heartbeat = 0;
 
-	fixup_device_params_per_asic(hdev);
+	fixup_device_params_per_asic(hdev, tmp_timeout);
 
 	return 0;
 }
-- 
2.25.1


  parent reply	other threads:[~2022-07-20 11:15 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-07-20 11:15 [PATCH 1/9] habanalabs: rename non_hard_reset to compute_reset Oded Gabbay
2022-07-20 11:15 ` [PATCH 2/9] habanalabs/gaudi2: enable all MMU SPI/SEI interrupts Oded Gabbay
2022-07-20 11:15 ` [PATCH 3/9] habanalabs: set gnu_printf attribute for hl_engine_data_sprintf Oded Gabbay
2022-07-20 11:15 ` [PATCH 4/9] habanalabs: add return code field to module iterator Oded Gabbay
2022-07-20 11:15 ` Oded Gabbay [this message]
2022-07-20 11:15 ` [PATCH 6/9] habanalabs/gaudi2: remove old interrupt mappings Oded Gabbay
2022-07-20 11:15 ` [PATCH 7/9] habanalabs: fix spelling mistakes Oded Gabbay
2022-07-20 11:15 ` [PATCH 8/9] habanalabs: wrap macro arg with parentheses Oded Gabbay
2022-07-20 11:15 ` [PATCH 9/9] habanalabs: remove all kdma locks Oded Gabbay

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20220720111523.4069830-5-ogabbay@kernel.org \
    --to=ogabbay@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox