From: "Lazar, Lijo" <lijo.lazar@amd.com>
To: "Khatri, Sunil" <sukhatri@amd.com>,
Alex Deucher <alexdeucher@gmail.com>,
Sunil Khatri <sunil.khatri@amd.com>
Cc: "Alex Deucher" <alexander.deucher@amd.com>,
"Liu Leo" <Leo.Liu@amd.com>,
"Christian König" <christian.koenig@amd.com>,
amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH 2/2] drm/amdgpu: trigger ip dump before suspend of IP's
Date: Fri, 26 Jul 2024 20:36:44 +0530 [thread overview]
Message-ID: <77bb2c37-d906-4c76-b87d-effbbe6064e4@amd.com> (raw)
In-Reply-To: <d49c682a-57ea-4061-8b42-59764f603164@amd.com>
On 7/26/2024 8:11 PM, Khatri, Sunil wrote:
>
> On 7/26/2024 7:53 PM, Khatri, Sunil wrote:
>>
>> On 7/26/2024 7:18 PM, Lazar, Lijo wrote:
>>>
>>> On 7/26/2024 6:42 PM, Alex Deucher wrote:
>>>> On Fri, Jul 26, 2024 at 8:48 AM Sunil Khatri <sunil.khatri@amd.com>
>>>> wrote:
>>>>> Problem:
>>>>> IP dump right now is done post suspend of
>>>>> all IP's which for some IP's could change power
>>>>> state and software state too which we do not want
>>>>> to reflect in the dump as it might not be same at
>>>>> the time of hang.
>>>>>
>>>>> Solution:
>>>>> IP should be dumped as close to the HW state when
>>>>> the GPU was in hung state without trying to reinitialize
>>>>> any resource.
>>>>>
>>>>> Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
>>>> Acked-by: Alex Deucher <alexander.deucher@amd.com>
>>>>
>>>>> ---
>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 60
>>>>> +++++++++++-----------
>>>>> 1 file changed, 30 insertions(+), 30 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>> index 730dae77570c..74f6f15e73b5 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>> @@ -5277,11 +5277,29 @@ int amdgpu_device_mode1_reset(struct
>>>>> amdgpu_device *adev)
>>>>> return ret;
>>>>> }
>>>>>
>>>>> +static int amdgpu_reset_reg_dumps(struct amdgpu_device *adev)
>>>>> +{
>>>>> + int i;
>>>>> +
>>>>> + lockdep_assert_held(&adev->reset_domain->sem);
>>>>> +
>>>>> + for (i = 0; i < adev->reset_info.num_regs; i++) {
>>>>> + adev->reset_info.reset_dump_reg_value[i] =
>>>>> +
>>>>> RREG32(adev->reset_info.reset_dump_reg_list[i]);
>>> A suspend also involves power/clock ungate. When reg dump is moved
>>> earlier, I'm not sure if this read works for all. If it's left to
>>> individual IP call backs, they could just do the same or better to move
>>> these up before a dump.
>> Suspend also put the status.hw = false and each IP in their respective
>> suspend state which i feel does change the state of the HW.
>> To get the correct snapshot of the GPU register we should not be
>> fiddling with the HW IP at least till we capture the dump and that is
>> the intention behind the change.
>>
>> Do you think there is a problem in this approach?
>>>
>>> amdgpu_device_set_pg_state(adev, AMD_PG_STATE_UNGATE);
>>> amdgpu_device_set_cg_state(adev, AMD_CG_STATE_UNGATE);
>> Adding this does sounds better to enable just before we dump the
>> registers but i am not very sure if ungating would be clean here or
>> not. i Could try quickly adding these two calls just before dump.
>>
>> All i am worried if it does change some register reflecting the
>> original state of registers at dump.
>>
I was thinking that if it includes some GFX regs and the hang happened
because of some SDMA/VCN jobs which somehow keeps GFXOFF state intact.
BTW, since there is already dump_ip state which could capture IP regs
separately and handle their power/clock gate situations, do you think
this generic one is still needed?
Thanks,
Lijo
>> What u suggest ?
>> Regards
>> Sunil
> I quickly validated on Navi22 by adding below call just before we dump
> registers
> if(!test_bit(AMDGPU_SKIP_COREDUMP, &reset_context->flags)) {
>
> amdgpu_device_set_pg_state(adev, AMD_PG_STATE_UNGATE);
> amdgpu_device_set_cg_state(adev, AMD_CG_STATE_UNGATE);
>
> amdgpu_reset_reg_dumps(tmp_adev);
> dev_info(tmp_adev->dev, "Dumping IP State\n");
> /* Trigger ip dump before we reset the asic */
> for(i=0; i<tmp_adev->num_ip_blocks; i++)
> if(tmp_adev->ip_blocks[i].version->funcs->dump_ip_state)
> tmp_adev->ip_blocks[i].version->funcs->dump_ip_state(
> (void*)tmp_adev);
> dev_info(tmp_adev->dev, "Dumping IP State Completed\n");
> }
> If this sounds fine with you i am update that. Regards Sunil Khatri
>>
>>>
>>> Thanks,
>>> Lijo
>>>
>>>>> +
>>>>> +
>>>>> trace_amdgpu_reset_reg_dumps(adev->reset_info.reset_dump_reg_list[i],
>>>>> +
>>>>> adev->reset_info.reset_dump_reg_value[i]);
>>>>> + }
>>>>> +
>>>>> + return 0;
>>>>> +}
>>>>> +
>>>>> int amdgpu_device_pre_asic_reset(struct amdgpu_device *adev,
>>>>> struct amdgpu_reset_context
>>>>> *reset_context)
>>>>> {
>>>>> int i, r = 0;
>>>>> struct amdgpu_job *job = NULL;
>>>>> + struct amdgpu_device *tmp_adev = reset_context->reset_req_dev;
>>>>> bool need_full_reset =
>>>>> test_bit(AMDGPU_NEED_FULL_RESET,
>>>>> &reset_context->flags);
>>>>>
>>>>> @@ -5340,6 +5358,18 @@ int amdgpu_device_pre_asic_reset(struct
>>>>> amdgpu_device *adev,
>>>>> }
>>>>> }
>>>>>
>>>>> + if (!test_bit(AMDGPU_SKIP_COREDUMP,
>>>>> &reset_context->flags)) {
>>>>> + amdgpu_reset_reg_dumps(tmp_adev);
>>>>> +
>>>>> + dev_info(tmp_adev->dev, "Dumping IP State\n");
>>>>> + /* Trigger ip dump before we reset the asic */
>>>>> + for (i = 0; i < tmp_adev->num_ip_blocks; i++)
>>>>> + if
>>>>> (tmp_adev->ip_blocks[i].version->funcs->dump_ip_state)
>>>>> +
>>>>> tmp_adev->ip_blocks[i].version->funcs->dump_ip_state(
>>>>> + (void
>>>>> *)tmp_adev);
>>>>> + dev_info(tmp_adev->dev, "Dumping IP State
>>>>> Completed\n");
>>>>> + }
>>>>> +
>>>>> if (need_full_reset)
>>>>> r = amdgpu_device_ip_suspend(adev);
>>>>> if (need_full_reset)
>>>>> @@ -5352,47 +5382,17 @@ int amdgpu_device_pre_asic_reset(struct
>>>>> amdgpu_device *adev,
>>>>> return r;
>>>>> }
>>>>>
>>>>> -static int amdgpu_reset_reg_dumps(struct amdgpu_device *adev)
>>>>> -{
>>>>> - int i;
>>>>> -
>>>>> - lockdep_assert_held(&adev->reset_domain->sem);
>>>>> -
>>>>> - for (i = 0; i < adev->reset_info.num_regs; i++) {
>>>>> - adev->reset_info.reset_dump_reg_value[i] =
>>>>> -
>>>>> RREG32(adev->reset_info.reset_dump_reg_list[i]);
>>>>> -
>>>>> -
>>>>> trace_amdgpu_reset_reg_dumps(adev->reset_info.reset_dump_reg_list[i],
>>>>> -
>>>>> adev->reset_info.reset_dump_reg_value[i]);
>>>>> - }
>>>>> -
>>>>> - return 0;
>>>>> -}
>>>>> -
>>>>> int amdgpu_do_asic_reset(struct list_head *device_list_handle,
>>>>> struct amdgpu_reset_context *reset_context)
>>>>> {
>>>>> struct amdgpu_device *tmp_adev = NULL;
>>>>> bool need_full_reset, skip_hw_reset, vram_lost = false;
>>>>> int r = 0;
>>>>> - uint32_t i;
>>>>>
>>>>> /* Try reset handler method first */
>>>>> tmp_adev = list_first_entry(device_list_handle, struct
>>>>> amdgpu_device,
>>>>> reset_list);
>>>>>
>>>>> - if (!test_bit(AMDGPU_SKIP_COREDUMP, &reset_context->flags)) {
>>>>> - amdgpu_reset_reg_dumps(tmp_adev);
>>>>> -
>>>>> - dev_info(tmp_adev->dev, "Dumping IP State\n");
>>>>> - /* Trigger ip dump before we reset the asic */
>>>>> - for (i = 0; i < tmp_adev->num_ip_blocks; i++)
>>>>> - if
>>>>> (tmp_adev->ip_blocks[i].version->funcs->dump_ip_state)
>>>>> - tmp_adev->ip_blocks[i].version->funcs
>>>>> - ->dump_ip_state((void *)tmp_adev);
>>>>> - dev_info(tmp_adev->dev, "Dumping IP State
>>>>> Completed\n");
>>>>> - }
>>>>> -
>>>>> reset_context->reset_device_list = device_list_handle;
>>>>> r = amdgpu_reset_perform_reset(tmp_adev, reset_context);
>>>>> /* If reset handler not implemented, continue; otherwise
>>>>> return */
>>>>> --
>>>>> 2.34.1
>>>>>
next prev parent reply other threads:[~2024-07-26 15:07 UTC|newest]
Thread overview: 15+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-07-26 12:47 [PATCH 1/2] drm/amdgpu: print VCN instance dump for valid instance Sunil Khatri
2024-07-26 12:47 ` [PATCH 2/2] drm/amdgpu: trigger ip dump before suspend of IP's Sunil Khatri
2024-07-26 13:12 ` Alex Deucher
2024-07-26 13:48 ` Lazar, Lijo
2024-07-26 14:23 ` Khatri, Sunil
2024-07-26 14:41 ` Khatri, Sunil
2024-07-26 15:06 ` Lazar, Lijo [this message]
2024-07-26 17:16 ` Khatri, Sunil
2024-07-26 18:43 ` Alex Deucher
2024-07-26 19:21 ` Khatri, Sunil
2024-07-29 4:38 ` Lazar, Lijo
2024-07-29 5:38 ` Khatri, Sunil
2024-07-29 5:47 ` Lazar, Lijo
2024-07-29 8:27 ` Khatri, Sunil
2024-07-26 13:12 ` [PATCH 1/2] drm/amdgpu: print VCN instance dump for valid instance Alex Deucher
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=77bb2c37-d906-4c76-b87d-effbbe6064e4@amd.com \
--to=lijo.lazar@amd.com \
--cc=Leo.Liu@amd.com \
--cc=alexander.deucher@amd.com \
--cc=alexdeucher@gmail.com \
--cc=amd-gfx@lists.freedesktop.org \
--cc=christian.koenig@amd.com \
--cc=sukhatri@amd.com \
--cc=sunil.khatri@amd.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox