Re: [PATCH 1/1] drm/amdgpu: Use device wedged event

All of lore.kernel.org
 help / color / mirror / Atom feed

From: kernel test robot <lkp@intel.com>
To: "André Almeida" <andrealmeid@igalia.com>,
	"Raag Jadav" <raag.jadav@intel.com>,
	airlied@gmail.com, simona@ffwll.ch, lucas.demarchi@intel.com,
	rodrigo.vivi@intel.com, jani.nikula@linux.intel.com,
	andriy.shevchenko@linux.intel.com, lina@asahilina.net,
	michal.wajdeczko@intel.com, christian.koenig@amd.com
Cc: oe-kbuild-all@lists.linux.dev, intel-gfx@lists.freedesktop.org,
	dri-devel@lists.freedesktop.org, himal.prasad.ghimiray@intel.com,
	aravind.iddamsetty@linux.intel.com, anshuman.gupta@intel.com,
	alexander.deucher@amd.com, andrealmeid@igalia.com,
	amd-gfx@lists.freedesktop.org, kernel-dev@igalia.com
Subject: Re: [PATCH 1/1] drm/amdgpu: Use device wedged event
Date: Fri, 20 Dec 2024 22:06:33 +0800	[thread overview]
Message-ID: <202412202144.SalviKDh-lkp@intel.com> (raw)
In-Reply-To: <20241212190909.28559-2-andrealmeid@igalia.com>

Hi André,

kernel test robot noticed the following build errors:

[auto build test ERROR on linus/master]
[also build test ERROR on drm-intel/for-linux-next drm-intel/for-linux-next-fixes drm-tip/drm-tip v6.13-rc3 next-20241220]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Andr-Almeida/drm-amdgpu-Use-device-wedged-event/20241213-031134
base:   linus/master
patch link:    https://lore.kernel.org/r/20241212190909.28559-2-andrealmeid%40igalia.com
patch subject: [PATCH 1/1] drm/amdgpu: Use device wedged event
config: s390-randconfig-001-20241220 (https://download.01.org/0day-ci/archive/20241220/202412202144.SalviKDh-lkp@intel.com/config)
compiler: s390-linux-gcc (GCC) 14.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20241220/202412202144.SalviKDh-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202412202144.SalviKDh-lkp@intel.com/

All errors (new ones prefixed by >>):

   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c: In function 'amdgpu_device_gpu_recover':
>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c:6061:9: error: implicit declaration of function 'drm_dev_wedged_event' [-Wimplicit-function-declaration]
    6061 |         drm_dev_wedged_event(adev_to_drm(adev), DRM_WEDGE_RECOVERY_NONE);
         |         ^~~~~~~~~~~~~~~~~~~~
>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c:6061:49: error: 'DRM_WEDGE_RECOVERY_NONE' undeclared (first use in this function); did you mean 'DRM_MODE_ENCODER_NONE'?
    6061 |         drm_dev_wedged_event(adev_to_drm(adev), DRM_WEDGE_RECOVERY_NONE);
         |                                                 ^~~~~~~~~~~~~~~~~~~~~~~
         |                                                 DRM_MODE_ENCODER_NONE
   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c:6061:49: note: each undeclared identifier is reported only once for each function it appears in


vim +/drm_dev_wedged_event +6061 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c

  5794	
  5795	/**
  5796	 * amdgpu_device_gpu_recover - reset the asic and recover scheduler
  5797	 *
  5798	 * @adev: amdgpu_device pointer
  5799	 * @job: which job trigger hang
  5800	 * @reset_context: amdgpu reset context pointer
  5801	 *
  5802	 * Attempt to reset the GPU if it has hung (all asics).
  5803	 * Attempt to do soft-reset or full-reset and reinitialize Asic
  5804	 * Returns 0 for success or an error on failure.
  5805	 */
  5806	
  5807	int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
  5808				      struct amdgpu_job *job,
  5809				      struct amdgpu_reset_context *reset_context)
  5810	{
  5811		struct list_head device_list, *device_list_handle =  NULL;
  5812		bool job_signaled = false;
  5813		struct amdgpu_hive_info *hive = NULL;
  5814		struct amdgpu_device *tmp_adev = NULL;
  5815		int i, r = 0;
  5816		bool need_emergency_restart = false;
  5817		bool audio_suspended = false;
  5818		int retry_limit = AMDGPU_MAX_RETRY_LIMIT;
  5819	
  5820		/*
  5821		 * Special case: RAS triggered and full reset isn't supported
  5822		 */
  5823		need_emergency_restart = amdgpu_ras_need_emergency_restart(adev);
  5824	
  5825		/*
  5826		 * Flush RAM to disk so that after reboot
  5827		 * the user can read log and see why the system rebooted.
  5828		 */
  5829		if (need_emergency_restart && amdgpu_ras_get_context(adev) &&
  5830			amdgpu_ras_get_context(adev)->reboot) {
  5831			DRM_WARN("Emergency reboot.");
  5832	
  5833			ksys_sync_helper();
  5834			emergency_restart();
  5835		}
  5836	
  5837		dev_info(adev->dev, "GPU %s begin!\n",
  5838			need_emergency_restart ? "jobs stop":"reset");
  5839	
  5840		if (!amdgpu_sriov_vf(adev))
  5841			hive = amdgpu_get_xgmi_hive(adev);
  5842		if (hive)
  5843			mutex_lock(&hive->hive_lock);
  5844	
  5845		reset_context->job = job;
  5846		reset_context->hive = hive;
  5847		/*
  5848		 * Build list of devices to reset.
  5849		 * In case we are in XGMI hive mode, resort the device list
  5850		 * to put adev in the 1st position.
  5851		 */
  5852		INIT_LIST_HEAD(&device_list);
  5853		if (!amdgpu_sriov_vf(adev) && (adev->gmc.xgmi.num_physical_nodes > 1) && hive) {
  5854			list_for_each_entry(tmp_adev, &hive->device_list, gmc.xgmi.head) {
  5855				list_add_tail(&tmp_adev->reset_list, &device_list);
  5856				if (adev->shutdown)
  5857					tmp_adev->shutdown = true;
  5858			}
  5859			if (!list_is_first(&adev->reset_list, &device_list))
  5860				list_rotate_to_front(&adev->reset_list, &device_list);
  5861			device_list_handle = &device_list;
  5862		} else {
  5863			list_add_tail(&adev->reset_list, &device_list);
  5864			device_list_handle = &device_list;
  5865		}
  5866	
  5867		if (!amdgpu_sriov_vf(adev)) {
  5868			r = amdgpu_device_health_check(device_list_handle);
  5869			if (r)
  5870				goto end_reset;
  5871		}
  5872	
  5873		/* We need to lock reset domain only once both for XGMI and single device */
  5874		tmp_adev = list_first_entry(device_list_handle, struct amdgpu_device,
  5875					    reset_list);
  5876		amdgpu_device_lock_reset_domain(tmp_adev->reset_domain);
  5877	
  5878		/* block all schedulers and reset given job's ring */
  5879		list_for_each_entry(tmp_adev, device_list_handle, reset_list) {
  5880	
  5881			amdgpu_device_set_mp1_state(tmp_adev);
  5882	
  5883			/*
  5884			 * Try to put the audio codec into suspend state
  5885			 * before gpu reset started.
  5886			 *
  5887			 * Due to the power domain of the graphics device
  5888			 * is shared with AZ power domain. Without this,
  5889			 * we may change the audio hardware from behind
  5890			 * the audio driver's back. That will trigger
  5891			 * some audio codec errors.
  5892			 */
  5893			if (!amdgpu_device_suspend_display_audio(tmp_adev))
  5894				audio_suspended = true;
  5895	
  5896			amdgpu_ras_set_error_query_ready(tmp_adev, false);
  5897	
  5898			cancel_delayed_work_sync(&tmp_adev->delayed_init_work);
  5899	
  5900			amdgpu_amdkfd_pre_reset(tmp_adev, reset_context);
  5901	
  5902			/*
  5903			 * Mark these ASICs to be reseted as untracked first
  5904			 * And add them back after reset completed
  5905			 */
  5906			amdgpu_unregister_gpu_instance(tmp_adev);
  5907	
  5908			drm_client_dev_suspend(adev_to_drm(tmp_adev), false);
  5909	
  5910			/* disable ras on ALL IPs */
  5911			if (!need_emergency_restart &&
  5912			      amdgpu_device_ip_need_full_reset(tmp_adev))
  5913				amdgpu_ras_suspend(tmp_adev);
  5914	
  5915			for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
  5916				struct amdgpu_ring *ring = tmp_adev->rings[i];
  5917	
  5918				if (!amdgpu_ring_sched_ready(ring))
  5919					continue;
  5920	
  5921				drm_sched_stop(&ring->sched, job ? &job->base : NULL);
  5922	
  5923				if (need_emergency_restart)
  5924					amdgpu_job_stop_all_jobs_on_sched(&ring->sched);
  5925			}
  5926			atomic_inc(&tmp_adev->gpu_reset_counter);
  5927		}
  5928	
  5929		if (need_emergency_restart)
  5930			goto skip_sched_resume;
  5931	
  5932		/*
  5933		 * Must check guilty signal here since after this point all old
  5934		 * HW fences are force signaled.
  5935		 *
  5936		 * job->base holds a reference to parent fence
  5937		 */
  5938		if (job && dma_fence_is_signaled(&job->hw_fence)) {
  5939			job_signaled = true;
  5940			dev_info(adev->dev, "Guilty job already signaled, skipping HW reset");
  5941			goto skip_hw_reset;
  5942		}
  5943	
  5944	retry:	/* Rest of adevs pre asic reset from XGMI hive. */
  5945		list_for_each_entry(tmp_adev, device_list_handle, reset_list) {
  5946			r = amdgpu_device_pre_asic_reset(tmp_adev, reset_context);
  5947			/*TODO Should we stop ?*/
  5948			if (r) {
  5949				dev_err(tmp_adev->dev, "GPU pre asic reset failed with err, %d for drm dev, %s ",
  5950					  r, adev_to_drm(tmp_adev)->unique);
  5951				tmp_adev->asic_reset_res = r;
  5952			}
  5953		}
  5954	
  5955		/* Actual ASIC resets if needed.*/
  5956		/* Host driver will handle XGMI hive reset for SRIOV */
  5957		if (amdgpu_sriov_vf(adev)) {
  5958			if (amdgpu_ras_get_fed_status(adev) || amdgpu_virt_rcvd_ras_interrupt(adev)) {
  5959				dev_dbg(adev->dev, "Detected RAS error, wait for FLR completion\n");
  5960				amdgpu_ras_set_fed(adev, true);
  5961				set_bit(AMDGPU_HOST_FLR, &reset_context->flags);
  5962			}
  5963	
  5964			r = amdgpu_device_reset_sriov(adev, reset_context);
  5965			if (AMDGPU_RETRY_SRIOV_RESET(r) && (retry_limit--) > 0) {
  5966				amdgpu_virt_release_full_gpu(adev, true);
  5967				goto retry;
  5968			}
  5969			if (r)
  5970				adev->asic_reset_res = r;
  5971		} else {
  5972			r = amdgpu_do_asic_reset(device_list_handle, reset_context);
  5973			if (r && r == -EAGAIN)
  5974				goto retry;
  5975		}
  5976	
  5977		list_for_each_entry(tmp_adev, device_list_handle, reset_list) {
  5978			/*
  5979			 * Drop any pending non scheduler resets queued before reset is done.
  5980			 * Any reset scheduled after this point would be valid. Scheduler resets
  5981			 * were already dropped during drm_sched_stop and no new ones can come
  5982			 * in before drm_sched_start.
  5983			 */
  5984			amdgpu_device_stop_pending_resets(tmp_adev);
  5985		}
  5986	
  5987	skip_hw_reset:
  5988	
  5989		/* Post ASIC reset for all devs .*/
  5990		list_for_each_entry(tmp_adev, device_list_handle, reset_list) {
  5991	
  5992			for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
  5993				struct amdgpu_ring *ring = tmp_adev->rings[i];
  5994	
  5995				if (!amdgpu_ring_sched_ready(ring))
  5996					continue;
  5997	
  5998				drm_sched_start(&ring->sched, 0);
  5999			}
  6000	
  6001			if (!drm_drv_uses_atomic_modeset(adev_to_drm(tmp_adev)) && !job_signaled)
  6002				drm_helper_resume_force_mode(adev_to_drm(tmp_adev));
  6003	
  6004			if (tmp_adev->asic_reset_res)
  6005				r = tmp_adev->asic_reset_res;
  6006	
  6007			tmp_adev->asic_reset_res = 0;
  6008	
  6009			if (r) {
  6010				/* bad news, how to tell it to userspace ?
  6011				 * for ras error, we should report GPU bad status instead of
  6012				 * reset failure
  6013				 */
  6014				if (reset_context->src != AMDGPU_RESET_SRC_RAS ||
  6015				    !amdgpu_ras_eeprom_check_err_threshold(tmp_adev))
  6016					dev_info(tmp_adev->dev, "GPU reset(%d) failed\n",
  6017						atomic_read(&tmp_adev->gpu_reset_counter));
  6018				amdgpu_vf_error_put(tmp_adev, AMDGIM_ERROR_VF_GPU_RESET_FAIL, 0, r);
  6019			} else {
  6020				dev_info(tmp_adev->dev, "GPU reset(%d) succeeded!\n", atomic_read(&tmp_adev->gpu_reset_counter));
  6021				if (amdgpu_acpi_smart_shift_update(adev_to_drm(tmp_adev), AMDGPU_SS_DEV_D0))
  6022					DRM_WARN("smart shift update failed\n");
  6023			}
  6024		}
  6025	
  6026	skip_sched_resume:
  6027		list_for_each_entry(tmp_adev, device_list_handle, reset_list) {
  6028			/* unlock kfd: SRIOV would do it separately */
  6029			if (!need_emergency_restart && !amdgpu_sriov_vf(tmp_adev))
  6030				amdgpu_amdkfd_post_reset(tmp_adev);
  6031	
  6032			/* kfd_post_reset will do nothing if kfd device is not initialized,
  6033			 * need to bring up kfd here if it's not be initialized before
  6034			 */
  6035			if (!adev->kfd.init_complete)
  6036				amdgpu_amdkfd_device_init(adev);
  6037	
  6038			if (audio_suspended)
  6039				amdgpu_device_resume_display_audio(tmp_adev);
  6040	
  6041			amdgpu_device_unset_mp1_state(tmp_adev);
  6042	
  6043			amdgpu_ras_set_error_query_ready(tmp_adev, true);
  6044		}
  6045	
  6046		tmp_adev = list_first_entry(device_list_handle, struct amdgpu_device,
  6047						    reset_list);
  6048		amdgpu_device_unlock_reset_domain(tmp_adev->reset_domain);
  6049	
  6050	end_reset:
  6051		if (hive) {
  6052			mutex_unlock(&hive->hive_lock);
  6053			amdgpu_put_xgmi_hive(hive);
  6054		}
  6055	
  6056		if (r)
  6057			dev_info(adev->dev, "GPU reset end with ret = %d\n", r);
  6058	
  6059		atomic_set(&adev->reset_domain->reset_res, r);
  6060	
> 6061		drm_dev_wedged_event(adev_to_drm(adev), DRM_WEDGE_RECOVERY_NONE);
  6062	
  6063		return r;
  6064	}
  6065	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

next prev parent reply	other threads:[~2024-12-20 14:07 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-12-12 19:09 [PATCH 0/1] drm/amdgpu: Use device wedged event André Almeida
2024-12-12 19:09 ` [PATCH 1/1] " André Almeida
2024-12-13  7:34   ` Christian König
2024-12-13  7:46     ` Sharma, Shashank
2024-12-13 14:15     ` André Almeida
2024-12-13 14:36       ` Raag Jadav
2024-12-13 15:56         ` André Almeida
2024-12-16 10:18           ` Christian König
2024-12-16 10:38             ` Lazar, Lijo
2024-12-16 13:04               ` André Almeida
2024-12-16 13:10                 ` Christian König
2024-12-16 13:15                   ` André Almeida
2024-12-16 13:36                     ` Lazar, Lijo
2024-12-16 13:39                       ` Christian König
2024-12-16 13:44                         ` Lazar, Lijo
2024-12-16 13:36                     ` Christian König
2024-12-16 13:57                     ` Raag Jadav
2024-12-13 15:31   ` Lazar, Lijo
2024-12-20 13:31   ` kernel test robot
2024-12-20 14:06   ` kernel test robot [this message]
2024-12-13 14:10 ` [PATCH 0/1] " Lucas De Marchi
2024-12-13 14:25 ` ✗ Fi.CI.BUILD: failure for " Patchwork

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=202412202144.SalviKDh-lkp@intel.com \
    --to=lkp@intel.com \
    --cc=airlied@gmail.com \
    --cc=alexander.deucher@amd.com \
    --cc=amd-gfx@lists.freedesktop.org \
    --cc=andrealmeid@igalia.com \
    --cc=andriy.shevchenko@linux.intel.com \
    --cc=anshuman.gupta@intel.com \
    --cc=aravind.iddamsetty@linux.intel.com \
    --cc=christian.koenig@amd.com \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=himal.prasad.ghimiray@intel.com \
    --cc=intel-gfx@lists.freedesktop.org \
    --cc=jani.nikula@linux.intel.com \
    --cc=kernel-dev@igalia.com \
    --cc=lina@asahilina.net \
    --cc=lucas.demarchi@intel.com \
    --cc=michal.wajdeczko@intel.com \
    --cc=oe-kbuild-all@lists.linux.dev \
    --cc=raag.jadav@intel.com \
    --cc=rodrigo.vivi@intel.com \
    --cc=simona@ffwll.ch \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.