public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Mario Limonciello <mario.limonciello@amd.com>
To: Lu Yao <yaolu@kylinos.cn>,
	alexander.deucher@amd.com, christian.koenig@amd.com,
	Xinhui.Pan@amd.com, kenneth.feng@amd.com
Cc: lijo.lazar@amd.com, Hawking.Zhang@amd.com,
	andrealmeid@igalia.com, hamza.mahfooz@amd.com,
	candice.li@amd.com, victorchengchi.lu@amd.com,
	sunil.khatri@amd.com, Jun.Ma2@amd.com, kevinyang.wang@amd.com,
	Tim.Huang@amd.com, jesse.zhang@amd.com,
	amd-gfx@lists.freedesktop.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH] drm/amdgpu: fix OLAND card ip_init failed during kdump caputrue kernel boot
Date: Thu, 22 Aug 2024 09:05:53 -0500	[thread overview]
Message-ID: <5dcd603a-7d62-439d-9a07-9d7d9324e0b6@amd.com> (raw)
In-Reply-To: <20240723094232.162319-1-yaolu@kylinos.cn>

On 7/23/2024 04:42, Lu Yao wrote:
> [Why]
> When running kdump test on a machine with R7340 card, a hang is caused due
> to the failure of 'amdgpu_device_ip_init()', error message as follows:
> 
>    '[drm:amdgpu_device_ip_init [amdgpu]] *ERROR* hw_init of IP block <si_dpm> failed -22'
>    '[drm:uvd_v3_1_hw_init [amdgpu]] *ERROR* amdgpu: UVD Firmware validate fail (-22).'
>    '[drm:amdgpu_device_ip_init [amdgpu]] *ERROR* hw_init of IP block <uvd_v3_1> failed -22'
>    'amdgpu 0000:01:00.0: amdgpu: amdgpu_device_ip_init failed'
>    'amdgpu 0000:01:00.0: amdgpu: Fatal error during GPU init'
> 
> This is because the caputrue kernel does not power off when it starts,

Presumably you mean:
s/caputrue/capture/

> cause hardware status does not reset.
> 
> [How]
> Add 'is_kdump_kernel()' judgment.
> For 'si_dpm' block, use disable and then enable.
> For 'uvd_v3_1' block, skip loading during the initialization phase.
> 
> Signed-off-by: Lu Yao <yaolu@kylinos.cn>
> ---
> During test, I first modified the 'amdgpu_device_ip_hw_init_phase*', make
> it does not end directly when a block hw_init failed.
> 
> After analysis, 'si_dpm' block failed at 'si_dpm_enable()->
> amdgpu_si_is_smc_running()', calling 'si_dpm_disable()' before can resolve.
> 'uvd_v3_1' block failed at 'uvd_v3_1_hw_init()->uvd_v3_1_fw_validate()',
> read mmUVD_FW_STATUS value is 0x27220102, I didn't find out why. But for
> caputrue kernel, UVD is not required. Therefore, don't added this block.

Hmm, a few thoughs.

1) Although you used this for the R7340, these concepts you're 
identifying probably make sense on most AMD GPUs.  SUch checks might be 
better to uplevel to earlier in IP discovery code.

2) I'd actually argue we don't want to have the kdump capture kernel do 
ANY hardware init.  You're going to lose hardware state which "could" be 
valuable information for debugging a problem that caused a panic.

That being said, I'm not really sure what framebuffer can drive the 
display across a kexec if you don't load amdgpu.  What actually happens 
if you blacklist amdgpu in the capture kernel?

What happens with your patch in place?

At least for me I'd like to see a kernel log from both cases.

> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu.h        | 1 +
>   drivers/gpu/drm/amd/amdgpu/si.c            | 6 ++++--
>   drivers/gpu/drm/amd/pm/legacy-dpm/si_dpm.c | 6 ++++++
>   3 files changed, 11 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> index 137a88b8de45..52ebc24561c4 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> @@ -50,6 +50,7 @@
>   #include <linux/hashtable.h>
>   #include <linux/dma-fence.h>
>   #include <linux/pci.h>
> +#include <linux/crash_dump.h>
>   
>   #include <drm/ttm/ttm_bo.h>
>   #include <drm/ttm/ttm_placement.h>
> diff --git a/drivers/gpu/drm/amd/amdgpu/si.c b/drivers/gpu/drm/amd/amdgpu/si.c
> index 85235470e872..fc0daed1b829 100644
> --- a/drivers/gpu/drm/amd/amdgpu/si.c
> +++ b/drivers/gpu/drm/amd/amdgpu/si.c
> @@ -2739,7 +2739,8 @@ int si_set_ip_blocks(struct amdgpu_device *adev)
>   #endif
>   		else
>   			amdgpu_device_ip_block_add(adev, &dce_v6_0_ip_block);
> -		amdgpu_device_ip_block_add(adev, &uvd_v3_1_ip_block);
> +		if (!is_kdump_kernel())
> +			amdgpu_device_ip_block_add(adev, &uvd_v3_1_ip_block);
>   		/* amdgpu_device_ip_block_add(adev, &vce_v1_0_ip_block); */
>   		break;
>   	case CHIP_OLAND:
> @@ -2757,7 +2758,8 @@ int si_set_ip_blocks(struct amdgpu_device *adev)
>   #endif
>   		else
>   			amdgpu_device_ip_block_add(adev, &dce_v6_4_ip_block);
> -		amdgpu_device_ip_block_add(adev, &uvd_v3_1_ip_block);
> +		if (!is_kdump_kernel())
> +			amdgpu_device_ip_block_add(adev, &uvd_v3_1_ip_block);
>   		/* amdgpu_device_ip_block_add(adev, &vce_v1_0_ip_block); */
>   		break;
>   	case CHIP_HAINAN:
> diff --git a/drivers/gpu/drm/amd/pm/legacy-dpm/si_dpm.c b/drivers/gpu/drm/amd/pm/legacy-dpm/si_dpm.c
> index a1baa13ab2c2..8700a22ba809 100644
> --- a/drivers/gpu/drm/amd/pm/legacy-dpm/si_dpm.c
> +++ b/drivers/gpu/drm/amd/pm/legacy-dpm/si_dpm.c
> @@ -1848,6 +1848,7 @@ static int si_calculate_sclk_params(struct amdgpu_device *adev,
>   static void si_thermal_start_smc_fan_control(struct amdgpu_device *adev);
>   static void si_fan_ctrl_set_default_mode(struct amdgpu_device *adev);
>   static void si_dpm_set_irq_funcs(struct amdgpu_device *adev);
> +static void si_dpm_disable(struct amdgpu_device *adev);
>   
>   static struct si_power_info *si_get_pi(struct amdgpu_device *adev)
>   {
> @@ -6811,6 +6812,11 @@ static int si_dpm_enable(struct amdgpu_device *adev)
>   	struct amdgpu_ps *boot_ps = adev->pm.dpm.boot_ps;
>   	int ret;
>   
> +	if (is_kdump_kernel()) {
> +		si_dpm_disable(adev);
> +		udelay(50);
> +	}
> +
>   	if (amdgpu_si_is_smc_running(adev))
>   		return -EINVAL;
>   	if (pi->voltage_control || si_pi->voltage_control_svi2)


  reply	other threads:[~2024-08-22 14:05 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-07-23  9:42 [PATCH] drm/amdgpu: fix OLAND card ip_init failed during kdump caputrue kernel boot Lu Yao
2024-08-22 14:05 ` Mario Limonciello [this message]
2024-08-29  8:11   ` Lu Yao
2024-09-18  8:46     ` Liu, Yongxin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5dcd603a-7d62-439d-9a07-9d7d9324e0b6@amd.com \
    --to=mario.limonciello@amd.com \
    --cc=Hawking.Zhang@amd.com \
    --cc=Jun.Ma2@amd.com \
    --cc=Tim.Huang@amd.com \
    --cc=Xinhui.Pan@amd.com \
    --cc=alexander.deucher@amd.com \
    --cc=amd-gfx@lists.freedesktop.org \
    --cc=andrealmeid@igalia.com \
    --cc=candice.li@amd.com \
    --cc=christian.koenig@amd.com \
    --cc=hamza.mahfooz@amd.com \
    --cc=jesse.zhang@amd.com \
    --cc=kenneth.feng@amd.com \
    --cc=kevinyang.wang@amd.com \
    --cc=lijo.lazar@amd.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=sunil.khatri@amd.com \
    --cc=victorchengchi.lu@amd.com \
    --cc=yaolu@kylinos.cn \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox