AMD-GFX Archive on lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 1/3] drm/amdgpu: Warn when bad pages approaches 90% threshold
@ 2021-10-20 16:35 Kent Russell
  2021-10-20 16:35 ` [PATCH 2/3] drm/amdgpu: Add kernel parameter support for ignoring bad page threshold Kent Russell
                   ` (3 more replies)
  0 siblings, 4 replies; 14+ messages in thread
From: Kent Russell @ 2021-10-20 16:35 UTC (permalink / raw)
  To: amd-gfx; +Cc: Kent Russell, Luben Tuikov, Mukul Joshi

Currently dmesg doesn't warn when the number of bad pages approaches the
threshold for page retirement. WARN when the number of bad pages
is at 90% or greater for easier checks and planning, instead of waiting
until the GPU is full of bad pages

Cc: Luben Tuikov <luben.tuikov@amd.com>
Cc: Mukul Joshi <Mukul.Joshi@amd.com>
Signed-off-by: Kent Russell <kent.russell@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
index f4c05ff4b26c..1ede0f0d6f55 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
@@ -1071,12 +1071,29 @@ int amdgpu_ras_eeprom_init(struct amdgpu_ras_eeprom_control *control,
 	control->ras_fri = RAS_OFFSET_TO_INDEX(control, hdr->first_rec_offset);
 
 	if (hdr->header == RAS_TABLE_HDR_VAL) {
+		int threshold = 0;
 		DRM_DEBUG_DRIVER("Found existing EEPROM table with %d records",
 				 control->ras_num_recs);
 		res = __verify_ras_table_checksum(control);
 		if (res)
 			DRM_ERROR("RAS table incorrect checksum or error:%d\n",
 				  res);
+
+		/* threshold = 0 means that page retirement is disabled, while
+		 * threshold = -1 means default behaviour
+		 */
+		if (amdgpu_bad_page_threshold == -1)
+			threshold = ras->bad_page_cnt_threshold;
+		else if (amdgpu_bad_page_threshold > 0)
+			threshold = amdgpu_bad_page_threshold;
+
+		/* Since multiplcation is transitive, a = 9b/10 is the same
+		 * as 10a = 9b. Use this for our 90% limit to avoid rounding
+		 */
+		if (threshold > 0 && ((control->ras_num_recs * 10) >= (threshold * 9)))
+			DRM_WARN("RAS records:%u exceeds 90%% of threshold:%d",
+					control->ras_num_recs,
+					threshold);
 	} else if (hdr->header == RAS_TABLE_HDR_BAD &&
 		   amdgpu_bad_page_threshold != 0) {
 		res = __verify_ras_table_checksum(control);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2021-10-22 11:26 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2021-10-20 16:35 [PATCH 1/3] drm/amdgpu: Warn when bad pages approaches 90% threshold Kent Russell
2021-10-20 16:35 ` [PATCH 2/3] drm/amdgpu: Add kernel parameter support for ignoring bad page threshold Kent Russell
2021-10-20 16:35 ` [PATCH 3/3] drm/amdgpu: Implement bad_page_threshold = -2 case Kent Russell
2021-10-20 21:54   ` Felix Kuehling
2021-10-20 22:01     ` Luben Tuikov
2021-10-21 13:57       ` Russell, Kent
2021-10-21  5:24   ` Lazar, Lijo
2021-10-21 13:56     ` Russell, Kent
2021-10-22 11:26       ` Lazar, Lijo
2021-10-20 21:47 ` [PATCH 1/3] drm/amdgpu: Warn when bad pages approaches 90% threshold Luben Tuikov
2021-10-21 14:04   ` Russell, Kent
2021-10-20 21:50 ` Felix Kuehling
2021-10-20 22:09   ` Luben Tuikov
2021-10-20 22:31   ` Felix Kuehling

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox