[PATCH 0/1] amdgpu fix for gfx1103 queue evict/restore crash v2

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: Mika Laitio <lamikr@gmail.com>
To: christian.koenig@amd.com, Xinhui.Pan@amd.com, airlied@gmail.com,
	simona@ffwll.ch, Hawking.Zhang@amd.com, sunil.khatri@amd.com,
	lijo.lazar@amd.com, kevinyang.wang@amd.com,
	amd-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org,
	linux-kernel@vger.kernel.org, lamikr@gmail.com
Subject: [PATCH 0/1] amdgpu fix for gfx1103 queue evict/restore crash v2
Date: Wed, 27 Nov 2024 03:46:37 -0800	[thread overview]
Message-ID: <20241127114638.11216-1-lamikr@gmail.com> (raw)

This is the corrected v2 version from the patch that was send earlier.
Fixes:
- add cover letter
- use "goto out_unlock" instead of "goto out" in restore_process_queues_cpsch 
method after the mutex has been acquired in the code.
- fixed typo on patch subject line and improved patch description

Patch will fix the evict/restore queue problem on AMD 
gfx1103 iGPU. Problem has not been seen on following other AMD GPUs tested:
- gfx1010 (RX 5700)
- gfx1030 (RZ 6800)
- gfx1035 (M680 iGPU) 
- gfx1102 (RX 7700S)

From these devices the gfx1102 uses same codepath 
than gfx1102 and calls evict and restore queue methods which will
then call the MES firmware.

Fix will remove the evict/restore calls to MES in case the device is iGPU.
Added queues will still be removed normally when the program closes.

Easy way to trigger the problem is to build the the
ML/AI support for gfx1103 M780 iGPU with the
rocm sdk builder and then running the test application in loop.

Most of the testing has been done on 6.13 devel and 6.12 final kernels
but the same problem can also be triggered at least with the 6.8
and 6.11 kernels.

Adding delays to either to test application between calls \
(tested with 1 second) or to loop inside kernel which removes the
queues (tested with mdelay(10)) did not help to avoid the crash.

After applying the kernel fix, I and others have executed 
the test loop thousands of times without seeing the error to happen again.

On multi-gpu devices, correct gfx1103 needs to be forced in use by exporting
environment variable HIP_VISIBLE_DEVICES=<gpu-index>

Original bug and test case was made by jrl290 on rocm sdk builder bug issue 141.
Test app below to trigger the problem.

import torch
import numpy as np
from onnx import load
from onnx2pytorch import ConvertModel
import time

if __name__ == "__main__":
    ii = 0
    while True:
        ii = ii + 1
        print("Loop Start")
        model_path = "model.onnx"
        device = 'cuda'
        model_run = ConvertModel(load(model_path))
        model_run.to(device).eval()

        #This code causes the crash. Comment out to remove the crash
        random = np.random.rand(1, 4, 3072, 256)
        tensor = torch.tensor(random, dtype=torch.float32, device=device)

        #This code doesn't cause a crash
        tensor = torch.randn(1, 4, 3072, 256, dtype=torch.float32, device=device)

        print("[" + str(ii) + "], the crash happens here:")
        time.sleep(0.5)
        result = model_run(tensor).numpy(force=True)
        print(result.shape)
Mika Laitio (1):
  amdgpu fix for gfx1103 queue evict/restore crash

 .../drm/amd/amdkfd/kfd_device_queue_manager.c | 24 ++++++++++++-------
 1 file changed, 16 insertions(+), 8 deletions(-)

-- 
2.43.0

next             reply	other threads:[~2024-11-27 11:47 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-11-27 11:46 Mika Laitio [this message]
2024-11-27 11:46 ` [PATCH 1/1] amdgpu fix for gfx1103 queue evict/restore crash Mika Laitio
2024-11-27 11:51   ` Christian König
2024-11-27 23:50     ` Felix Kuehling
     [not found]       ` <CAJ+8kEYDRozboMpybdqMVZx+S77s_zHNXURJ-pp_Lrx_fESkgA@mail.gmail.com>
2024-11-29 17:21         ` Felix Kuehling

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20241127114638.11216-1-lamikr@gmail.com \
    --to=lamikr@gmail.com \
    --cc=Hawking.Zhang@amd.com \
    --cc=Xinhui.Pan@amd.com \
    --cc=airlied@gmail.com \
    --cc=amd-gfx@lists.freedesktop.org \
    --cc=christian.koenig@amd.com \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=kevinyang.wang@amd.com \
    --cc=lijo.lazar@amd.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=simona@ffwll.ch \
    --cc=sunil.khatri@amd.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox