[PATCH 1/2] drm/amdgpu: Warn when bad pages approaches 90% threshold

AMD-GFX Archive on lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 1/2] drm/amdgpu: Warn when bad pages approaches 90% threshold
@ 2021-10-21 15:57 Kent Russell
  2021-10-21 15:57 ` [PATCH 2/2] drm/amdgpu: Add kernel parameter support for ignoring bad page threshold Kent Russell
                   ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Kent Russell @ 2021-10-21 15:57 UTC (permalink / raw)
  To: amd-gfx; +Cc: Kent Russell, Luben Tuikov, Mukul Joshi

dmesg doesn't warn when the number of bad pages approaches the
threshold for page retirement. WARN when the number of bad pages
is at 90% or greater for easier checks and planning, instead of waiting
until the GPU is full of bad pages.

Cc: Luben Tuikov <luben.tuikov@amd.com>
Cc: Mukul Joshi <Mukul.Joshi@amd.com>
Signed-off-by: Kent Russell <kent.russell@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
index f4c05ff4b26c..ce5089216474 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
@@ -1077,6 +1077,12 @@ int amdgpu_ras_eeprom_init(struct amdgpu_ras_eeprom_control *control,
 		if (res)
 			DRM_ERROR("RAS table incorrect checksum or error:%d\n",
 				  res);
+
+		/* Warn if we are at 90% of the threshold or above */
+		if ((10 * control->ras_num_recs) >= (ras->bad_page_cnt_threshold * 9))
+			DRM_WARN("RAS records:%u exceeds 90%% of threshold:%d",
+					control->ras_num_recs,
+					ras->bad_page_cnt_threshold);
 	} else if (hdr->header == RAS_TABLE_HDR_BAD &&
 		   amdgpu_bad_page_threshold != 0) {
 		res = __verify_ras_table_checksum(control);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 2/2] drm/amdgpu: Add kernel parameter support for ignoring bad page threshold
  2021-10-21 15:57 [PATCH 1/2] drm/amdgpu: Warn when bad pages approaches 90% threshold Kent Russell
@ 2021-10-21 15:57 ` Kent Russell
  2021-10-21 16:21   ` Luben Tuikov
  2021-10-21 16:31 ` [PATCH 1/2] drm/amdgpu: Warn when bad pages approaches 90% threshold Lazar, Lijo
  2021-10-21 16:35 ` Luben Tuikov
  2 siblings, 1 reply; 14+ messages in thread
From: Kent Russell @ 2021-10-21 15:57 UTC (permalink / raw)
  To: amd-gfx; +Cc: Kent Russell, Luben Tuikov, Mukul Joshi, Felix Kuehling

When a GPU hits the bad_page_threshold, it will not be initialized by
the amdgpu driver. This means that the table cannot be cleared, nor can
information gathering be performed (getting serial number, BDF, etc).

If the bad_page_threshold kernel parameter is set to -2,
continue to initialize the GPU, while printing a warning to dmesg that
this action has been done

Cc: Luben Tuikov <luben.tuikov@amd.com>
Cc: Mukul Joshi <Mukul.Joshi@amd.com>
Signed-off-by: Kent Russell <kent.russell@amd.com>
Acked-by: Felix Kuehling <Felix.Kuehling@amd.com>
Reviewed-by: Luben Tuikov <luben.tuikov@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu.h            |  1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c        |  2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c | 12 ++++++++----
 3 files changed, 10 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index d58e37fd01f4..b85b67a88a3d 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -205,6 +205,7 @@ extern struct amdgpu_mgpu_info mgpu_info;
 extern int amdgpu_ras_enable;
 extern uint amdgpu_ras_mask;
 extern int amdgpu_bad_page_threshold;
+extern bool amdgpu_ignore_bad_page_threshold;
 extern struct amdgpu_watchdog_timer amdgpu_watchdog_timer;
 extern int amdgpu_async_gfx_ring;
 extern int amdgpu_mcbp;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index 96bd63aeeddd..eee3cf874e7a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -877,7 +877,7 @@ module_param_named(reset_method, amdgpu_reset_method, int, 0444);
  * result in the GPU entering bad status when the number of total
  * faulty pages by ECC exceeds the threshold value.
  */
-MODULE_PARM_DESC(bad_page_threshold, "Bad page threshold(-1 = auto(default value), 0 = disable bad page retirement)");
+MODULE_PARM_DESC(bad_page_threshold, "Bad page threshold(-1 = auto(default value), 0 = disable bad page retirement, -2 = ignore bad page threshold)");
 module_param_named(bad_page_threshold, amdgpu_bad_page_threshold, int, 0444);
 
 MODULE_PARM_DESC(num_kcq, "number of kernel compute queue user want to setup (8 if set to greater than 8 or less than 0, only affect gfx 8+)");
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
index ce5089216474..bd6ed43b0df2 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
@@ -1104,11 +1104,15 @@ int amdgpu_ras_eeprom_init(struct amdgpu_ras_eeprom_control *control,
 			res = amdgpu_ras_eeprom_correct_header_tag(control,
 								   RAS_TABLE_HDR_VAL);
 		} else {
-			*exceed_err_limit = true;
-			dev_err(adev->dev,
-				"RAS records:%d exceed threshold:%d, "
-				"GPU will not be initialized. Replace this GPU or increase the threshold",
+			dev_err(adev->dev, "RAS records:%d exceed threshold:%d",
 				control->ras_num_recs, ras->bad_page_cnt_threshold);
+			if (amdgpu_bad_page_threshold == -2) {
+				dev_warn(adev->dev, "GPU will be initialized due to bad_page_threshold = -2.");
+				res = 0;
+			} else {
+				*exceed_err_limit = true;
+				dev_err(adev->dev, "GPU will not be initialized. Replace this GPU or increase the threshold.");
+			}
 		}
 	} else {
 		DRM_INFO("Creating a new EEPROM table");
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/2] drm/amdgpu: Add kernel parameter support for ignoring bad page threshold
  2021-10-21 15:57 ` [PATCH 2/2] drm/amdgpu: Add kernel parameter support for ignoring bad page threshold Kent Russell
@ 2021-10-21 16:21   ` Luben Tuikov
  2021-10-21 16:42     ` Russell, Kent
  0 siblings, 1 reply; 14+ messages in thread
From: Luben Tuikov @ 2021-10-21 16:21 UTC (permalink / raw)
  To: Kent Russell, amd-gfx; +Cc: Mukul Joshi, Felix Kuehling, Tuikov, Luben

On 2021-10-21 11:57, Kent Russell wrote:
> When a GPU hits the bad_page_threshold, it will not be initialized by
> the amdgpu driver. This means that the table cannot be cleared, nor can
> information gathering be performed (getting serial number, BDF, etc).
>
> If the bad_page_threshold kernel parameter is set to -2,
> continue to initialize the GPU, while printing a warning to dmesg that
> this action has been done
>
> Cc: Luben Tuikov <luben.tuikov@amd.com>
> Cc: Mukul Joshi <Mukul.Joshi@amd.com>
> Signed-off-by: Kent Russell <kent.russell@amd.com>
> Acked-by: Felix Kuehling <Felix.Kuehling@amd.com>
> Reviewed-by: Luben Tuikov <luben.tuikov@amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu.h            |  1 +
>  drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c        |  2 +-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c | 12 ++++++++----
>  3 files changed, 10 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> index d58e37fd01f4..b85b67a88a3d 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> @@ -205,6 +205,7 @@ extern struct amdgpu_mgpu_info mgpu_info;
>  extern int amdgpu_ras_enable;
>  extern uint amdgpu_ras_mask;
>  extern int amdgpu_bad_page_threshold;
> +extern bool amdgpu_ignore_bad_page_threshold;
>  extern struct amdgpu_watchdog_timer amdgpu_watchdog_timer;
>  extern int amdgpu_async_gfx_ring;
>  extern int amdgpu_mcbp;
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> index 96bd63aeeddd..eee3cf874e7a 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> @@ -877,7 +877,7 @@ module_param_named(reset_method, amdgpu_reset_method, int, 0444);
>   * result in the GPU entering bad status when the number of total
>   * faulty pages by ECC exceeds the threshold value.
>   */
> -MODULE_PARM_DESC(bad_page_threshold, "Bad page threshold(-1 = auto(default value), 0 = disable bad page retirement)");
> +MODULE_PARM_DESC(bad_page_threshold, "Bad page threshold(-1 = auto(default value), 0 = disable bad page retirement, -2 = ignore bad page threshold)");
>  module_param_named(bad_page_threshold, amdgpu_bad_page_threshold, int, 0444);
>  
>  MODULE_PARM_DESC(num_kcq, "number of kernel compute queue user want to setup (8 if set to greater than 8 or less than 0, only affect gfx 8+)");
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> index ce5089216474..bd6ed43b0df2 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> @@ -1104,11 +1104,15 @@ int amdgpu_ras_eeprom_init(struct amdgpu_ras_eeprom_control *control,
>  			res = amdgpu_ras_eeprom_correct_header_tag(control,
>  								   RAS_TABLE_HDR_VAL);
>  		} else {
> -			*exceed_err_limit = true;
> -			dev_err(adev->dev,
> -				"RAS records:%d exceed threshold:%d, "
> -				"GPU will not be initialized. Replace this GPU or increase the threshold",
> +			dev_err(adev->dev, "RAS records:%d exceed threshold:%d",
>  				control->ras_num_recs, ras->bad_page_cnt_threshold); 

I thought this would all go in a single set of patches. I wasn't aware a singleton patch went in already which changed just this line--this change was always a part of a patch set.

Regards,
Luben

> +			if (amdgpu_bad_page_threshold == -2) {
> +				dev_warn(adev->dev, "GPU will be initialized due to bad_page_threshold = -2.");
> +				res = 0;
> +			} else {
> +				*exceed_err_limit = true;
> +				dev_err(adev->dev, "GPU will not be initialized. Replace this GPU or increase the threshold.");
> +			}
>  		}
>  	} else {
>  		DRM_INFO("Creating a new EEPROM table");


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2] drm/amdgpu: Warn when bad pages approaches 90% threshold
  2021-10-21 15:57 [PATCH 1/2] drm/amdgpu: Warn when bad pages approaches 90% threshold Kent Russell
  2021-10-21 15:57 ` [PATCH 2/2] drm/amdgpu: Add kernel parameter support for ignoring bad page threshold Kent Russell
@ 2021-10-21 16:31 ` Lazar, Lijo
  2021-10-21 16:35   ` Russell, Kent
  2021-10-21 16:35 ` Luben Tuikov
  2 siblings, 1 reply; 14+ messages in thread
From: Lazar, Lijo @ 2021-10-21 16:31 UTC (permalink / raw)
  To: Russell, Kent, amd-gfx@lists.freedesktop.org
  Cc: Russell, Kent, Tuikov, Luben, Joshi, Mukul

[-- Attachment #1: Type: text/plain, Size: 2096 bytes --]

[Public]

Nit pick - suggest to use dev_warn for easy identification of the device.

Thanks,
Lijo
________________________________
From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> on behalf of Kent Russell <kent.russell@amd.com>
Sent: Thursday, October 21, 2021 9:27:10 PM
To: amd-gfx@lists.freedesktop.org <amd-gfx@lists.freedesktop.org>
Cc: Russell, Kent <Kent.Russell@amd.com>; Tuikov, Luben <Luben.Tuikov@amd.com>; Joshi, Mukul <Mukul.Joshi@amd.com>
Subject: [PATCH 1/2] drm/amdgpu: Warn when bad pages approaches 90% threshold

dmesg doesn't warn when the number of bad pages approaches the
threshold for page retirement. WARN when the number of bad pages
is at 90% or greater for easier checks and planning, instead of waiting
until the GPU is full of bad pages.

Cc: Luben Tuikov <luben.tuikov@amd.com>
Cc: Mukul Joshi <Mukul.Joshi@amd.com>
Signed-off-by: Kent Russell <kent.russell@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
index f4c05ff4b26c..ce5089216474 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
@@ -1077,6 +1077,12 @@ int amdgpu_ras_eeprom_init(struct amdgpu_ras_eeprom_control *control,
                 if (res)
                         DRM_ERROR("RAS table incorrect checksum or error:%d\n",
                                   res);
+
+               /* Warn if we are at 90% of the threshold or above */
+               if ((10 * control->ras_num_recs) >= (ras->bad_page_cnt_threshold * 9))
+                       DRM_WARN("RAS records:%u exceeds 90%% of threshold:%d",
+                                       control->ras_num_recs,
+                                       ras->bad_page_cnt_threshold);
         } else if (hdr->header == RAS_TABLE_HDR_BAD &&
                    amdgpu_bad_page_threshold != 0) {
                 res = __verify_ras_table_checksum(control);
--
2.25.1


[-- Attachment #2: Type: text/html, Size: 4289 bytes --]

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2] drm/amdgpu: Warn when bad pages approaches 90% threshold
  2021-10-21 15:57 [PATCH 1/2] drm/amdgpu: Warn when bad pages approaches 90% threshold Kent Russell
  2021-10-21 15:57 ` [PATCH 2/2] drm/amdgpu: Add kernel parameter support for ignoring bad page threshold Kent Russell
  2021-10-21 16:31 ` [PATCH 1/2] drm/amdgpu: Warn when bad pages approaches 90% threshold Lazar, Lijo
@ 2021-10-21 16:35 ` Luben Tuikov
  2021-10-21 16:44   ` Luben Tuikov
  2021-10-21 17:18   ` Felix Kuehling
  2 siblings, 2 replies; 14+ messages in thread
From: Luben Tuikov @ 2021-10-21 16:35 UTC (permalink / raw)
  To: Kent Russell, amd-gfx; +Cc: Mukul Joshi, Tuikov, Luben

On 2021-10-21 11:57, Kent Russell wrote:
> dmesg doesn't warn when the number of bad pages approaches the
> threshold for page retirement. WARN when the number of bad pages
> is at 90% or greater for easier checks and planning, instead of waiting
> until the GPU is full of bad pages.
>
> Cc: Luben Tuikov <luben.tuikov@amd.com>
> Cc: Mukul Joshi <Mukul.Joshi@amd.com>
> Signed-off-by: Kent Russell <kent.russell@amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c | 6 ++++++
>  1 file changed, 6 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> index f4c05ff4b26c..ce5089216474 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> @@ -1077,6 +1077,12 @@ int amdgpu_ras_eeprom_init(struct amdgpu_ras_eeprom_control *control,
>  		if (res)
>  			DRM_ERROR("RAS table incorrect checksum or error:%d\n",
>  				  res);
> +
> +		/* Warn if we are at 90% of the threshold or above */

The kernel uses a couple of styles, this is one of them:

/* Warn ...
 */
if (...)

Please use this style as it is used extensively in the amdgpu_ras_eeprom.c file.

> +		if ((10 * control->ras_num_recs) >= (ras->bad_page_cnt_threshold * 9))

You don't need the extra parenthesis around multiplication--it has higher precedence than relational operators--drop the extra parenthesis.

Regards,
Luben

> +			DRM_WARN("RAS records:%u exceeds 90%% of threshold:%d",
> +					control->ras_num_recs,
> +					ras->bad_page_cnt_threshold);
>  	} else if (hdr->header == RAS_TABLE_HDR_BAD &&
>  		   amdgpu_bad_page_threshold != 0) {
>  		res = __verify_ras_table_checksum(control);


^ permalink raw reply	[flat|nested] 14+ messages in thread

* RE: [PATCH 1/2] drm/amdgpu: Warn when bad pages approaches 90% threshold
  2021-10-21 16:31 ` [PATCH 1/2] drm/amdgpu: Warn when bad pages approaches 90% threshold Lazar, Lijo
@ 2021-10-21 16:35   ` Russell, Kent
  0 siblings, 0 replies; 14+ messages in thread
From: Russell, Kent @ 2021-10-21 16:35 UTC (permalink / raw)
  To: Lazar, Lijo, amd-gfx@lists.freedesktop.org; +Cc: Tuikov, Luben, Joshi, Mukul

[-- Attachment #1: Type: text/plain, Size: 3659 bytes --]

[Public]

I had noticed that all of these RAS messages use DRM instead of dev_warn. I wasn't sure if there was a reason for that or not. It's definitely inconsistent.

DRM_ERROR("Partial read for checksum, res:%d\n", res);
DRM_DEBUG_DRIVER("Found existing EEPROM table with %d records",
DRM_ERROR("RAS table incorrect checksum or error:%d\n",
DRM_WARN("RAS records:%u exceeds 90%% of threshold:%d",
DRM_ERROR("RAS Table incorrect checksum or error:%d\n",
dev_info(adev->dev,  "records:%d threshold:%d, resetting RAS table header signature",
dev_err(adev->dev, "RAS records:%d exceed threshold:%d",
dev_warn(adev->dev, "GPU will be initialized due to bad_page_threshold = -2.");
DRM_INFO("Creating a new EEPROM table");

Might be worth making a separate patch to handle those inconsistencies. I agree that device is useful for this kind of error/warning/info

Kent

From: Lazar, Lijo <Lijo.Lazar@amd.com>
Sent: Thursday, October 21, 2021 12:31 PM
To: Russell, Kent <Kent.Russell@amd.com>; amd-gfx@lists.freedesktop.org
Cc: Russell, Kent <Kent.Russell@amd.com>; Tuikov, Luben <Luben.Tuikov@amd.com>; Joshi, Mukul <Mukul.Joshi@amd.com>
Subject: Re: [PATCH 1/2] drm/amdgpu: Warn when bad pages approaches 90% threshold


[Public]

Nit pick - suggest to use dev_warn for easy identification of the device.

Thanks,
Lijo
________________________________
From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org<mailto:amd-gfx-bounces@lists.freedesktop.org>> on behalf of Kent Russell <kent.russell@amd.com<mailto:kent.russell@amd.com>>
Sent: Thursday, October 21, 2021 9:27:10 PM
To: amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org> <amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>>
Cc: Russell, Kent <Kent.Russell@amd.com<mailto:Kent.Russell@amd.com>>; Tuikov, Luben <Luben.Tuikov@amd.com<mailto:Luben.Tuikov@amd.com>>; Joshi, Mukul <Mukul.Joshi@amd.com<mailto:Mukul.Joshi@amd.com>>
Subject: [PATCH 1/2] drm/amdgpu: Warn when bad pages approaches 90% threshold

dmesg doesn't warn when the number of bad pages approaches the
threshold for page retirement. WARN when the number of bad pages
is at 90% or greater for easier checks and planning, instead of waiting
until the GPU is full of bad pages.

Cc: Luben Tuikov <luben.tuikov@amd.com<mailto:luben.tuikov@amd.com>>
Cc: Mukul Joshi <Mukul.Joshi@amd.com<mailto:Mukul.Joshi@amd.com>>
Signed-off-by: Kent Russell <kent.russell@amd.com<mailto:kent.russell@amd.com>>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
index f4c05ff4b26c..ce5089216474 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
@@ -1077,6 +1077,12 @@ int amdgpu_ras_eeprom_init(struct amdgpu_ras_eeprom_control *control,
                 if (res)
                         DRM_ERROR("RAS table incorrect checksum or error:%d\n",
                                   res);
+
+               /* Warn if we are at 90% of the threshold or above */
+               if ((10 * control->ras_num_recs) >= (ras->bad_page_cnt_threshold * 9))
+                       DRM_WARN("RAS records:%u exceeds 90%% of threshold:%d",
+                                       control->ras_num_recs,
+                                       ras->bad_page_cnt_threshold);
         } else if (hdr->header == RAS_TABLE_HDR_BAD &&
                    amdgpu_bad_page_threshold != 0) {
                 res = __verify_ras_table_checksum(control);
--
2.25.1

[-- Attachment #2: Type: text/html, Size: 47293 bytes --]

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* RE: [PATCH 2/2] drm/amdgpu: Add kernel parameter support for ignoring bad page threshold
  2021-10-21 16:21   ` Luben Tuikov
@ 2021-10-21 16:42     ` Russell, Kent
  2021-10-21 16:46       ` Luben Tuikov
  0 siblings, 1 reply; 14+ messages in thread
From: Russell, Kent @ 2021-10-21 16:42 UTC (permalink / raw)
  To: Tuikov, Luben, amd-gfx@lists.freedesktop.org
  Cc: Joshi, Mukul, Kuehling, Felix

[AMD Official Use Only]



> -----Original Message-----
> From: Tuikov, Luben <Luben.Tuikov@amd.com>
> Sent: Thursday, October 21, 2021 12:21 PM
> To: Russell, Kent <Kent.Russell@amd.com>; amd-gfx@lists.freedesktop.org
> Cc: Joshi, Mukul <Mukul.Joshi@amd.com>; Kuehling, Felix <Felix.Kuehling@amd.com>;
> Tuikov, Luben <Luben.Tuikov@amd.com>
> Subject: Re: [PATCH 2/2] drm/amdgpu: Add kernel parameter support for ignoring bad page
> threshold
> 
> On 2021-10-21 11:57, Kent Russell wrote:
> > When a GPU hits the bad_page_threshold, it will not be initialized by
> > the amdgpu driver. This means that the table cannot be cleared, nor can
> > information gathering be performed (getting serial number, BDF, etc).
> >
> > If the bad_page_threshold kernel parameter is set to -2,
> > continue to initialize the GPU, while printing a warning to dmesg that
> > this action has been done
> >
> > Cc: Luben Tuikov <luben.tuikov@amd.com>
> > Cc: Mukul Joshi <Mukul.Joshi@amd.com>
> > Signed-off-by: Kent Russell <kent.russell@amd.com>
> > Acked-by: Felix Kuehling <Felix.Kuehling@amd.com>
> > Reviewed-by: Luben Tuikov <luben.tuikov@amd.com>
> > ---
> >  drivers/gpu/drm/amd/amdgpu/amdgpu.h            |  1 +
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c        |  2 +-
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c | 12 ++++++++----
> >  3 files changed, 10 insertions(+), 5 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > index d58e37fd01f4..b85b67a88a3d 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > @@ -205,6 +205,7 @@ extern struct amdgpu_mgpu_info mgpu_info;
> >  extern int amdgpu_ras_enable;
> >  extern uint amdgpu_ras_mask;
> >  extern int amdgpu_bad_page_threshold;
> > +extern bool amdgpu_ignore_bad_page_threshold;
> >  extern struct amdgpu_watchdog_timer amdgpu_watchdog_timer;
> >  extern int amdgpu_async_gfx_ring;
> >  extern int amdgpu_mcbp;
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > index 96bd63aeeddd..eee3cf874e7a 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > @@ -877,7 +877,7 @@ module_param_named(reset_method, amdgpu_reset_method,
> int, 0444);
> >   * result in the GPU entering bad status when the number of total
> >   * faulty pages by ECC exceeds the threshold value.
> >   */
> > -MODULE_PARM_DESC(bad_page_threshold, "Bad page threshold(-1 = auto(default
> value), 0 = disable bad page retirement)");
> > +MODULE_PARM_DESC(bad_page_threshold, "Bad page threshold(-1 = auto(default
> value), 0 = disable bad page retirement, -2 = ignore bad page threshold)");
> >  module_param_named(bad_page_threshold, amdgpu_bad_page_threshold, int, 0444);
> >
> >  MODULE_PARM_DESC(num_kcq, "number of kernel compute queue user want to setup
> (8 if set to greater than 8 or less than 0, only affect gfx 8+)");
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> > index ce5089216474..bd6ed43b0df2 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> > @@ -1104,11 +1104,15 @@ int amdgpu_ras_eeprom_init(struct
> amdgpu_ras_eeprom_control *control,
> >  			res = amdgpu_ras_eeprom_correct_header_tag(control,
> >  								   RAS_TABLE_HDR_VAL);
> >  		} else {
> > -			*exceed_err_limit = true;
> > -			dev_err(adev->dev,
> > -				"RAS records:%d exceed threshold:%d, "
> > -				"GPU will not be initialized. Replace this GPU or increase the
> threshold",
> > +			dev_err(adev->dev, "RAS records:%d exceed threshold:%d",
> >  				control->ras_num_recs, ras->bad_page_cnt_threshold);
> 
> I thought this would all go in a single set of patches. I wasn't aware a singleton patch went
> in already which changed just this line--this change was always a part of a patch set.
> 

Ah sorry. When you reviewed the original patch2 clarifying the message, I merged it and then re-submitted the remaining 3 (which pared down to 2) for review. Sorry for the confusion, I was trying to minimize the number of moving parts.

 Kent

> Regards,
> Luben
> 
> > +			if (amdgpu_bad_page_threshold == -2) {
> > +				dev_warn(adev->dev, "GPU will be initialized due to
> bad_page_threshold = -2.");
> > +				res = 0;
> > +			} else {
> > +				*exceed_err_limit = true;
> > +				dev_err(adev->dev, "GPU will not be initialized. Replace this
> GPU or increase the threshold.");
> > +			}
> >  		}
> >  	} else {
> >  		DRM_INFO("Creating a new EEPROM table");

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2] drm/amdgpu: Warn when bad pages approaches 90% threshold
  2021-10-21 16:35 ` Luben Tuikov
@ 2021-10-21 16:44   ` Luben Tuikov
  2021-10-21 16:47     ` Russell, Kent
  2021-10-21 17:18   ` Felix Kuehling
  1 sibling, 1 reply; 14+ messages in thread
From: Luben Tuikov @ 2021-10-21 16:44 UTC (permalink / raw)
  To: Kent Russell, amd-gfx; +Cc: Mukul Joshi

On 2021-10-21 12:35, Luben Tuikov wrote:
> On 2021-10-21 11:57, Kent Russell wrote:
>> dmesg doesn't warn when the number of bad pages approaches the
>> threshold for page retirement. WARN when the number of bad pages
>> is at 90% or greater for easier checks and planning, instead of waiting
>> until the GPU is full of bad pages.
>>
>> Cc: Luben Tuikov <luben.tuikov@amd.com>
>> Cc: Mukul Joshi <Mukul.Joshi@amd.com>
>> Signed-off-by: Kent Russell <kent.russell@amd.com>
>> ---
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c | 6 ++++++
>>  1 file changed, 6 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
>> index f4c05ff4b26c..ce5089216474 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
>> @@ -1077,6 +1077,12 @@ int amdgpu_ras_eeprom_init(struct amdgpu_ras_eeprom_control *control,
>>  		if (res)
>>  			DRM_ERROR("RAS table incorrect checksum or error:%d\n",
>>  				  res);
>> +
>> +		/* Warn if we are at 90% of the threshold or above */
> The kernel uses a couple of styles, this is one of them:
>
> /* Warn ...
>  */
> if (...)
>
> Please use this style as it is used extensively in the amdgpu_ras_eeprom.c file.
>
>> +		if ((10 * control->ras_num_recs) >= (ras->bad_page_cnt_threshold * 9))
> You don't need the extra parenthesis around multiplication--it has higher precedence than relational operators--drop the extra parenthesis.
>
> Regards,
> Luben
>
>> +			DRM_WARN("RAS records:%u exceeds 90%% of threshold:%d",
>> +					control->ras_num_recs,
>> +					ras->bad_page_cnt_threshold);

One more note: The code uses "dev_err()" for this very similar message:

            dev_err(adev->dev,
                "RAS records:%d exceed threshold:%d, "
                "GPU will not be initialized. Replace this GPU or increase the threshold",
                control->ras_num_recs, ras->bad_page_cnt_threshold);

Since your message is essentially the same, sans the "90% of threshold", perhaps you want to use dev_warn(), instead of "DRM_WARN()".

Regards,
Luben

>>  	} else if (hdr->header == RAS_TABLE_HDR_BAD &&
>>  		   amdgpu_bad_page_threshold != 0) {
>>  		res = __verify_ras_table_checksum(control);


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/2] drm/amdgpu: Add kernel parameter support for ignoring bad page threshold
  2021-10-21 16:42     ` Russell, Kent
@ 2021-10-21 16:46       ` Luben Tuikov
  2021-10-21 16:49         ` Russell, Kent
  0 siblings, 1 reply; 14+ messages in thread
From: Luben Tuikov @ 2021-10-21 16:46 UTC (permalink / raw)
  To: Russell, Kent, amd-gfx@lists.freedesktop.org
  Cc: Joshi, Mukul, Kuehling, Felix

On 2021-10-21 12:42, Russell, Kent wrote:
> [AMD Official Use Only]
>
>
>
>> -----Original Message-----
>> From: Tuikov, Luben <Luben.Tuikov@amd.com>
>> Sent: Thursday, October 21, 2021 12:21 PM
>> To: Russell, Kent <Kent.Russell@amd.com>; amd-gfx@lists.freedesktop.org
>> Cc: Joshi, Mukul <Mukul.Joshi@amd.com>; Kuehling, Felix <Felix.Kuehling@amd.com>;
>> Tuikov, Luben <Luben.Tuikov@amd.com>
>> Subject: Re: [PATCH 2/2] drm/amdgpu: Add kernel parameter support for ignoring bad page
>> threshold
>>
>> On 2021-10-21 11:57, Kent Russell wrote:
>>> When a GPU hits the bad_page_threshold, it will not be initialized by
>>> the amdgpu driver. This means that the table cannot be cleared, nor can
>>> information gathering be performed (getting serial number, BDF, etc).
>>>
>>> If the bad_page_threshold kernel parameter is set to -2,
>>> continue to initialize the GPU, while printing a warning to dmesg that
>>> this action has been done
>>>
>>> Cc: Luben Tuikov <luben.tuikov@amd.com>
>>> Cc: Mukul Joshi <Mukul.Joshi@amd.com>
>>> Signed-off-by: Kent Russell <kent.russell@amd.com>
>>> Acked-by: Felix Kuehling <Felix.Kuehling@amd.com>
>>> Reviewed-by: Luben Tuikov <luben.tuikov@amd.com>
>>> ---
>>>  drivers/gpu/drm/amd/amdgpu/amdgpu.h            |  1 +
>>>  drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c        |  2 +-
>>>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c | 12 ++++++++----
>>>  3 files changed, 10 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>> index d58e37fd01f4..b85b67a88a3d 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>> @@ -205,6 +205,7 @@ extern struct amdgpu_mgpu_info mgpu_info;
>>>  extern int amdgpu_ras_enable;
>>>  extern uint amdgpu_ras_mask;
>>>  extern int amdgpu_bad_page_threshold;
>>> +extern bool amdgpu_ignore_bad_page_threshold;
>>>  extern struct amdgpu_watchdog_timer amdgpu_watchdog_timer;
>>>  extern int amdgpu_async_gfx_ring;
>>>  extern int amdgpu_mcbp;
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>> index 96bd63aeeddd..eee3cf874e7a 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>> @@ -877,7 +877,7 @@ module_param_named(reset_method, amdgpu_reset_method,
>> int, 0444);
>>>   * result in the GPU entering bad status when the number of total
>>>   * faulty pages by ECC exceeds the threshold value.
>>>   */
>>> -MODULE_PARM_DESC(bad_page_threshold, "Bad page threshold(-1 = auto(default
>> value), 0 = disable bad page retirement)");
>>> +MODULE_PARM_DESC(bad_page_threshold, "Bad page threshold(-1 = auto(default
>> value), 0 = disable bad page retirement, -2 = ignore bad page threshold)");
>>>  module_param_named(bad_page_threshold, amdgpu_bad_page_threshold, int, 0444);
>>>
>>>  MODULE_PARM_DESC(num_kcq, "number of kernel compute queue user want to setup
>> (8 if set to greater than 8 or less than 0, only affect gfx 8+)");
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
>>> index ce5089216474..bd6ed43b0df2 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
>>> @@ -1104,11 +1104,15 @@ int amdgpu_ras_eeprom_init(struct
>> amdgpu_ras_eeprom_control *control,
>>>  			res = amdgpu_ras_eeprom_correct_header_tag(control,
>>>  								   RAS_TABLE_HDR_VAL);
>>>  		} else {
>>> -			*exceed_err_limit = true;
>>> -			dev_err(adev->dev,
>>> -				"RAS records:%d exceed threshold:%d, "
>>> -				"GPU will not be initialized. Replace this GPU or increase the
>> threshold",
>>> +			dev_err(adev->dev, "RAS records:%d exceed threshold:%d",
>>>  				control->ras_num_recs, ras->bad_page_cnt_threshold);
>> I thought this would all go in a single set of patches. I wasn't aware a singleton patch went
>> in already which changed just this line--this change was always a part of a patch set.
>>
> Ah sorry. When you reviewed the original patch2 clarifying the message, I merged it and then re-submitted the remaining 3 (which pared down to 2) for review. Sorry for the confusion, I was trying to minimize the number of moving parts.

Admittedly, now you have 3 patches, one singleton and two coming in. Would've probably be best to submit only the current two.

No worries for now--for the future.

Regards,
Luben

>
>  Kent
>
>> Regards,
>> Luben
>>
>>> +			if (amdgpu_bad_page_threshold == -2) {
>>> +				dev_warn(adev->dev, "GPU will be initialized due to
>> bad_page_threshold = -2.");
>>> +				res = 0;
>>> +			} else {
>>> +				*exceed_err_limit = true;
>>> +				dev_err(adev->dev, "GPU will not be initialized. Replace this
>> GPU or increase the threshold.");
>>> +			}
>>>  		}
>>>  	} else {
>>>  		DRM_INFO("Creating a new EEPROM table");


^ permalink raw reply	[flat|nested] 14+ messages in thread

* RE: [PATCH 1/2] drm/amdgpu: Warn when bad pages approaches 90% threshold
  2021-10-21 16:44   ` Luben Tuikov
@ 2021-10-21 16:47     ` Russell, Kent
  2021-10-21 16:49       ` Luben Tuikov
  0 siblings, 1 reply; 14+ messages in thread
From: Russell, Kent @ 2021-10-21 16:47 UTC (permalink / raw)
  To: Tuikov, Luben, amd-gfx@lists.freedesktop.org; +Cc: Joshi, Mukul

[AMD Official Use Only]



> -----Original Message-----
> From: Tuikov, Luben <Luben.Tuikov@amd.com>
> Sent: Thursday, October 21, 2021 12:45 PM
> To: Russell, Kent <Kent.Russell@amd.com>; amd-gfx@lists.freedesktop.org
> Cc: Joshi, Mukul <Mukul.Joshi@amd.com>
> Subject: Re: [PATCH 1/2] drm/amdgpu: Warn when bad pages approaches 90% threshold
> 
> On 2021-10-21 12:35, Luben Tuikov wrote:
> > On 2021-10-21 11:57, Kent Russell wrote:
> >> dmesg doesn't warn when the number of bad pages approaches the
> >> threshold for page retirement. WARN when the number of bad pages
> >> is at 90% or greater for easier checks and planning, instead of waiting
> >> until the GPU is full of bad pages.
> >>
> >> Cc: Luben Tuikov <luben.tuikov@amd.com>
> >> Cc: Mukul Joshi <Mukul.Joshi@amd.com>
> >> Signed-off-by: Kent Russell <kent.russell@amd.com>
> >> ---
> >>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c | 6 ++++++
> >>  1 file changed, 6 insertions(+)
> >>
> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> >> index f4c05ff4b26c..ce5089216474 100644
> >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> >> @@ -1077,6 +1077,12 @@ int amdgpu_ras_eeprom_init(struct
> amdgpu_ras_eeprom_control *control,
> >>  		if (res)
> >>  			DRM_ERROR("RAS table incorrect checksum or error:%d\n",
> >>  				  res);
> >> +
> >> +		/* Warn if we are at 90% of the threshold or above */
> > The kernel uses a couple of styles, this is one of them:
> >
> > /* Warn ...
> >  */
> > if (...)
> >
> > Please use this style as it is used extensively in the amdgpu_ras_eeprom.c file.
> >
> >> +		if ((10 * control->ras_num_recs) >= (ras->bad_page_cnt_threshold * 9))
> > You don't need the extra parenthesis around multiplication--it has higher precedence
> than relational operators--drop the extra parenthesis.
> >
> > Regards,
> > Luben
> >
> >> +			DRM_WARN("RAS records:%u exceeds 90%% of threshold:%d",
> >> +					control->ras_num_recs,
> >> +					ras->bad_page_cnt_threshold);
> 
> One more note: The code uses "dev_err()" for this very similar message:
> 
>             dev_err(adev->dev,
>                 "RAS records:%d exceed threshold:%d, "
>                 "GPU will not be initialized. Replace this GPU or increase the threshold",
>                 control->ras_num_recs, ras->bad_page_cnt_threshold);
> 
> Since your message is essentially the same, sans the "90% of threshold", perhaps you want
> to use dev_warn(), instead of "DRM_WARN()".

Agreed. Lijo had a similar comment. I may follow up with another patch to change all of these table-specific DRM_* messages to dev_*

 Kent

> 
> Regards,
> Luben
> 
> >>  	} else if (hdr->header == RAS_TABLE_HDR_BAD &&
> >>  		   amdgpu_bad_page_threshold != 0) {
> >>  		res = __verify_ras_table_checksum(control);

^ permalink raw reply	[flat|nested] 14+ messages in thread

* RE: [PATCH 2/2] drm/amdgpu: Add kernel parameter support for ignoring bad page threshold
  2021-10-21 16:46       ` Luben Tuikov
@ 2021-10-21 16:49         ` Russell, Kent
  2021-10-21 16:55           ` Luben Tuikov
  0 siblings, 1 reply; 14+ messages in thread
From: Russell, Kent @ 2021-10-21 16:49 UTC (permalink / raw)
  To: Tuikov, Luben, amd-gfx@lists.freedesktop.org
  Cc: Joshi, Mukul, Kuehling, Felix

[AMD Official Use Only]



> -----Original Message-----
> From: Tuikov, Luben <Luben.Tuikov@amd.com>
> Sent: Thursday, October 21, 2021 12:47 PM
> To: Russell, Kent <Kent.Russell@amd.com>; amd-gfx@lists.freedesktop.org
> Cc: Joshi, Mukul <Mukul.Joshi@amd.com>; Kuehling, Felix <Felix.Kuehling@amd.com>
> Subject: Re: [PATCH 2/2] drm/amdgpu: Add kernel parameter support for ignoring bad page
> threshold
> 
> On 2021-10-21 12:42, Russell, Kent wrote:
> > [AMD Official Use Only]
> >
> >
> >
> >> -----Original Message-----
> >> From: Tuikov, Luben <Luben.Tuikov@amd.com>
> >> Sent: Thursday, October 21, 2021 12:21 PM
> >> To: Russell, Kent <Kent.Russell@amd.com>; amd-gfx@lists.freedesktop.org
> >> Cc: Joshi, Mukul <Mukul.Joshi@amd.com>; Kuehling, Felix <Felix.Kuehling@amd.com>;
> >> Tuikov, Luben <Luben.Tuikov@amd.com>
> >> Subject: Re: [PATCH 2/2] drm/amdgpu: Add kernel parameter support for ignoring bad
> page
> >> threshold
> >>
> >> On 2021-10-21 11:57, Kent Russell wrote:
> >>> When a GPU hits the bad_page_threshold, it will not be initialized by
> >>> the amdgpu driver. This means that the table cannot be cleared, nor can
> >>> information gathering be performed (getting serial number, BDF, etc).
> >>>
> >>> If the bad_page_threshold kernel parameter is set to -2,
> >>> continue to initialize the GPU, while printing a warning to dmesg that
> >>> this action has been done
> >>>
> >>> Cc: Luben Tuikov <luben.tuikov@amd.com>
> >>> Cc: Mukul Joshi <Mukul.Joshi@amd.com>
> >>> Signed-off-by: Kent Russell <kent.russell@amd.com>
> >>> Acked-by: Felix Kuehling <Felix.Kuehling@amd.com>
> >>> Reviewed-by: Luben Tuikov <luben.tuikov@amd.com>
> >>> ---
> >>>  drivers/gpu/drm/amd/amdgpu/amdgpu.h            |  1 +
> >>>  drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c        |  2 +-
> >>>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c | 12 ++++++++----
> >>>  3 files changed, 10 insertions(+), 5 deletions(-)
> >>>
> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> >> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> >>> index d58e37fd01f4..b85b67a88a3d 100644
> >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> >>> @@ -205,6 +205,7 @@ extern struct amdgpu_mgpu_info mgpu_info;
> >>>  extern int amdgpu_ras_enable;
> >>>  extern uint amdgpu_ras_mask;
> >>>  extern int amdgpu_bad_page_threshold;
> >>> +extern bool amdgpu_ignore_bad_page_threshold;
> >>>  extern struct amdgpu_watchdog_timer amdgpu_watchdog_timer;
> >>>  extern int amdgpu_async_gfx_ring;
> >>>  extern int amdgpu_mcbp;
> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> >> b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> >>> index 96bd63aeeddd..eee3cf874e7a 100644
> >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> >>> @@ -877,7 +877,7 @@ module_param_named(reset_method,
> amdgpu_reset_method,
> >> int, 0444);
> >>>   * result in the GPU entering bad status when the number of total
> >>>   * faulty pages by ECC exceeds the threshold value.
> >>>   */
> >>> -MODULE_PARM_DESC(bad_page_threshold, "Bad page threshold(-1 = auto(default
> >> value), 0 = disable bad page retirement)");
> >>> +MODULE_PARM_DESC(bad_page_threshold, "Bad page threshold(-1 = auto(default
> >> value), 0 = disable bad page retirement, -2 = ignore bad page threshold)");
> >>>  module_param_named(bad_page_threshold, amdgpu_bad_page_threshold, int,
> 0444);
> >>>
> >>>  MODULE_PARM_DESC(num_kcq, "number of kernel compute queue user want to
> setup
> >> (8 if set to greater than 8 or less than 0, only affect gfx 8+)");
> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> >> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> >>> index ce5089216474..bd6ed43b0df2 100644
> >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
> >>> @@ -1104,11 +1104,15 @@ int amdgpu_ras_eeprom_init(struct
> >> amdgpu_ras_eeprom_control *control,
> >>>  			res = amdgpu_ras_eeprom_correct_header_tag(control,
> >>>  								   RAS_TABLE_HDR_VAL);
> >>>  		} else {
> >>> -			*exceed_err_limit = true;
> >>> -			dev_err(adev->dev,
> >>> -				"RAS records:%d exceed threshold:%d, "
> >>> -				"GPU will not be initialized. Replace this GPU or increase the
> >> threshold",
> >>> +			dev_err(adev->dev, "RAS records:%d exceed threshold:%d",
> >>>  				control->ras_num_recs, ras->bad_page_cnt_threshold);
> >> I thought this would all go in a single set of patches. I wasn't aware a singleton patch
> went
> >> in already which changed just this line--this change was always a part of a patch set.
> >>
> > Ah sorry. When you reviewed the original patch2 clarifying the message, I merged it and
> then re-submitted the remaining 3 (which pared down to 2) for review. Sorry for the
> confusion, I was trying to minimize the number of moving parts.
> 
> Admittedly, now you have 3 patches, one singleton and two coming in. Would've probably
> be best to submit only the current two.
> 
> No worries for now--for the future.

Thanks. For the most part, all of my previous multi-patch sets have been all-or-nothing. This is the first time that a patch in the middle got an RB before the rest, so I did the wrong thing and merged it while the rest was still moving. Thanks for bearing with me!

 Kent

> 
> Regards,
> Luben
> 
> >
> >  Kent
> >
> >> Regards,
> >> Luben
> >>
> >>> +			if (amdgpu_bad_page_threshold == -2) {
> >>> +				dev_warn(adev->dev, "GPU will be initialized due to
> >> bad_page_threshold = -2.");
> >>> +				res = 0;
> >>> +			} else {
> >>> +				*exceed_err_limit = true;
> >>> +				dev_err(adev->dev, "GPU will not be initialized. Replace this
> >> GPU or increase the threshold.");
> >>> +			}
> >>>  		}
> >>>  	} else {
> >>>  		DRM_INFO("Creating a new EEPROM table");

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2] drm/amdgpu: Warn when bad pages approaches 90% threshold
  2021-10-21 16:47     ` Russell, Kent
@ 2021-10-21 16:49       ` Luben Tuikov
  0 siblings, 0 replies; 14+ messages in thread
From: Luben Tuikov @ 2021-10-21 16:49 UTC (permalink / raw)
  To: Russell, Kent, amd-gfx@lists.freedesktop.org; +Cc: Joshi, Mukul

On 2021-10-21 12:47, Russell, Kent wrote:
> [AMD Official Use Only]
>
>
>
>> -----Original Message-----
>> From: Tuikov, Luben <Luben.Tuikov@amd.com>
>> Sent: Thursday, October 21, 2021 12:45 PM
>> To: Russell, Kent <Kent.Russell@amd.com>; amd-gfx@lists.freedesktop.org
>> Cc: Joshi, Mukul <Mukul.Joshi@amd.com>
>> Subject: Re: [PATCH 1/2] drm/amdgpu: Warn when bad pages approaches 90% threshold
>>
>> On 2021-10-21 12:35, Luben Tuikov wrote:
>>> On 2021-10-21 11:57, Kent Russell wrote:
>>>> dmesg doesn't warn when the number of bad pages approaches the
>>>> threshold for page retirement. WARN when the number of bad pages
>>>> is at 90% or greater for easier checks and planning, instead of waiting
>>>> until the GPU is full of bad pages.
>>>>
>>>> Cc: Luben Tuikov <luben.tuikov@amd.com>
>>>> Cc: Mukul Joshi <Mukul.Joshi@amd.com>
>>>> Signed-off-by: Kent Russell <kent.russell@amd.com>
>>>> ---
>>>>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c | 6 ++++++
>>>>  1 file changed, 6 insertions(+)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
>>>> index f4c05ff4b26c..ce5089216474 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
>>>> @@ -1077,6 +1077,12 @@ int amdgpu_ras_eeprom_init(struct
>> amdgpu_ras_eeprom_control *control,
>>>>  		if (res)
>>>>  			DRM_ERROR("RAS table incorrect checksum or error:%d\n",
>>>>  				  res);
>>>> +
>>>> +		/* Warn if we are at 90% of the threshold or above */
>>> The kernel uses a couple of styles, this is one of them:
>>>
>>> /* Warn ...
>>>  */
>>> if (...)
>>>
>>> Please use this style as it is used extensively in the amdgpu_ras_eeprom.c file.
>>>
>>>> +		if ((10 * control->ras_num_recs) >= (ras->bad_page_cnt_threshold * 9))
>>> You don't need the extra parenthesis around multiplication--it has higher precedence
>> than relational operators--drop the extra parenthesis.
>>> Regards,
>>> Luben
>>>
>>>> +			DRM_WARN("RAS records:%u exceeds 90%% of threshold:%d",
>>>> +					control->ras_num_recs,
>>>> +					ras->bad_page_cnt_threshold);
>> One more note: The code uses "dev_err()" for this very similar message:
>>
>>             dev_err(adev->dev,
>>                 "RAS records:%d exceed threshold:%d, "
>>                 "GPU will not be initialized. Replace this GPU or increase the threshold",
>>                 control->ras_num_recs, ras->bad_page_cnt_threshold);
>>
>> Since your message is essentially the same, sans the "90% of threshold", perhaps you want
>> to use dev_warn(), instead of "DRM_WARN()".
> Agreed. Lijo had a similar comment. I may follow up with another patch to change all of these table-specific DRM_* messages to dev_*

You can still do that, but for this patch, make the changes I requested and change this to dev_warn().

Regards,
Luben

>
>  Kent
>
>> Regards,
>> Luben
>>
>>>>  	} else if (hdr->header == RAS_TABLE_HDR_BAD &&
>>>>  		   amdgpu_bad_page_threshold != 0) {
>>>>  		res = __verify_ras_table_checksum(control);


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/2] drm/amdgpu: Add kernel parameter support for ignoring bad page threshold
  2021-10-21 16:49         ` Russell, Kent
@ 2021-10-21 16:55           ` Luben Tuikov
  0 siblings, 0 replies; 14+ messages in thread
From: Luben Tuikov @ 2021-10-21 16:55 UTC (permalink / raw)
  To: Russell, Kent, amd-gfx@lists.freedesktop.org
  Cc: Joshi, Mukul, Kuehling, Felix

[-- Attachment #1: Type: text/html, Size: 11681 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2] drm/amdgpu: Warn when bad pages approaches 90% threshold
  2021-10-21 16:35 ` Luben Tuikov
  2021-10-21 16:44   ` Luben Tuikov
@ 2021-10-21 17:18   ` Felix Kuehling
  1 sibling, 0 replies; 14+ messages in thread
From: Felix Kuehling @ 2021-10-21 17:18 UTC (permalink / raw)
  To: Luben Tuikov, Kent Russell, amd-gfx; +Cc: Mukul Joshi

Am 2021-10-21 um 12:35 p.m. schrieb Luben Tuikov:
> On 2021-10-21 11:57, Kent Russell wrote:
>> dmesg doesn't warn when the number of bad pages approaches the
>> threshold for page retirement. WARN when the number of bad pages
>> is at 90% or greater for easier checks and planning, instead of waiting
>> until the GPU is full of bad pages.
>>
>> Cc: Luben Tuikov <luben.tuikov@amd.com>
>> Cc: Mukul Joshi <Mukul.Joshi@amd.com>
>> Signed-off-by: Kent Russell <kent.russell@amd.com>
>> ---
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c | 6 ++++++
>>  1 file changed, 6 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
>> index f4c05ff4b26c..ce5089216474 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
>> @@ -1077,6 +1077,12 @@ int amdgpu_ras_eeprom_init(struct amdgpu_ras_eeprom_control *control,
>>  		if (res)
>>  			DRM_ERROR("RAS table incorrect checksum or error:%d\n",
>>  				  res);
>> +
>> +		/* Warn if we are at 90% of the threshold or above */
> The kernel uses a couple of styles, this is one of them:
>
> /* Warn ...
>  */

That's a waste of space. That means there can be no single-line comments?

checkpatch.pl complains about multi-line comments that don't have the
final */ on their own line. But a single line comment is OK. Let's stick
with the coding style recommended by checkpatch.pl. If we make up our
own arbitrary rules for different reviewers and different parts of the
code, I think it becomes a mine-field of pointless cosmetic fixes for
anyone trespassing on unfamiliar code.


> if (...)
>
> Please use this style as it is used extensively in the amdgpu_ras_eeprom.c file.
>
>> +		if ((10 * control->ras_num_recs) >= (ras->bad_page_cnt_threshold * 9))
> You don't need the extra parenthesis around multiplication--it has higher precedence than relational operators--drop the extra parenthesis.

I agree. With that fixed, the patch is

Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>


>
> Regards,
> Luben
>
>> +			DRM_WARN("RAS records:%u exceeds 90%% of threshold:%d",
>> +					control->ras_num_recs,
>> +					ras->bad_page_cnt_threshold);
>>  	} else if (hdr->header == RAS_TABLE_HDR_BAD &&
>>  		   amdgpu_bad_page_threshold != 0) {
>>  		res = __verify_ras_table_checksum(control);

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2021-10-21 17:19 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2021-10-21 15:57 [PATCH 1/2] drm/amdgpu: Warn when bad pages approaches 90% threshold Kent Russell
2021-10-21 15:57 ` [PATCH 2/2] drm/amdgpu: Add kernel parameter support for ignoring bad page threshold Kent Russell
2021-10-21 16:21   ` Luben Tuikov
2021-10-21 16:42     ` Russell, Kent
2021-10-21 16:46       ` Luben Tuikov
2021-10-21 16:49         ` Russell, Kent
2021-10-21 16:55           ` Luben Tuikov
2021-10-21 16:31 ` [PATCH 1/2] drm/amdgpu: Warn when bad pages approaches 90% threshold Lazar, Lijo
2021-10-21 16:35   ` Russell, Kent
2021-10-21 16:35 ` Luben Tuikov
2021-10-21 16:44   ` Luben Tuikov
2021-10-21 16:47     ` Russell, Kent
2021-10-21 16:49       ` Luben Tuikov
2021-10-21 17:18   ` Felix Kuehling

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox