From: Felix Kuehling <felix.kuehling@amd.com>
To: Jonathan Kim <jonathan.kim@amd.com>,
amd-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org
Subject: Re: [PATCH 07/32] drm/amdgpu: add gfx9.4.1 hw debug mode enable and disable calls
Date: Thu, 16 Feb 2023 18:01:40 -0500 [thread overview]
Message-ID: <cd1b640c-ba59-4627-ad2f-b9593833598e@amd.com> (raw)
In-Reply-To: <20230125195401.4183544-8-jonathan.kim@amd.com>
On 2023-01-25 14:53, Jonathan Kim wrote:
> On GFX9.4.1, the implicit wait count instruction on s_barrier is
> disabled by default in the driver during normal operation for
> performance requirements.
>
> There is a hardware bug in GFX9.4.1 where if the implicit wait count
> instruction after an s_barrier instruction is disabled, any wave that
> hits an exception may step over the s_barrier when returning from the
> trap handler with the barrier logic having no ability to be
> aware of this, thereby causing other waves to wait at the barrier
> indefinitely resulting in a shader hang. This bug has been corrected
> for GFX9.4.2 and onward.
>
> Since the debugger subscribes to hardware exceptions, in order to avoid
> this bug, the debugger must enable implicit wait count on s_barrier
> for a debug session and disable it on detach.
>
> In order to change this setting in the in the device global SQ_CONFIG
> register, the GFX pipeline must be idle. GFX9.4.1 as a compute device
> will either dispatch work through the compute ring buffers used for
> image post processing or through the hardware scheduler by the KFD.
>
> Have the KGD suspend and drain the compute ring buffer, then suspend the
> hardware scheduler and block any future KFD process job requests before
> changing the implicit wait count setting. Once set, resume all work.
>
> v2: remove flush on kfd suspend as that will be a general fix required
> outside of this patch series.
> comment on trap enable/disable ignored variables.
>
> Signed-off-by: Jonathan Kim <jonathan.kim@amd.com>
> ---
> drivers/gpu/drm/amd/amdgpu/amdgpu.h | 3 +
> .../drm/amd/amdgpu/amdgpu_amdkfd_arcturus.c | 118 +++++++++++++++++-
> drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c | 4 +-
> 3 files changed, 122 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> index 872450a3a164..3c03e34c194c 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> @@ -1041,6 +1041,9 @@ struct amdgpu_device {
> struct pci_saved_state *pci_state;
> pci_channel_state_t pci_channel_state;
>
> + /* Track auto wait count on s_barrier settings */
> + bool barrier_has_auto_waitcnt;
> +
> struct amdgpu_reset_control *reset_cntl;
> uint32_t ip_versions[MAX_HWIP][HWIP_MAX_INSTANCE];
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_arcturus.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_arcturus.c
> index 4191af5a3f13..d5bb86ccd617 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_arcturus.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_arcturus.c
> @@ -26,6 +26,7 @@
> #include "amdgpu.h"
> #include "amdgpu_amdkfd.h"
> #include "amdgpu_amdkfd_arcturus.h"
> +#include "amdgpu_reset.h"
> #include "sdma0/sdma0_4_2_2_offset.h"
> #include "sdma0/sdma0_4_2_2_sh_mask.h"
> #include "sdma1/sdma1_4_2_2_offset.h"
> @@ -48,6 +49,8 @@
> #include "amdgpu_amdkfd_gfx_v9.h"
> #include "gfxhub_v1_0.h"
> #include "mmhub_v9_4.h"
> +#include "gc/gc_9_0_offset.h"
> +#include "gc/gc_9_0_sh_mask.h"
>
> #define HQD_N_REGS 56
> #define DUMP_REG(addr) do { \
> @@ -276,6 +279,117 @@ int kgd_arcturus_hqd_sdma_destroy(struct amdgpu_device *adev, void *mqd,
> return 0;
> }
>
> +/*
> + * Helper used to suspend/resume gfx pipe for image post process work to set
> + * barrier behaviour.
> + */
> +static int suspend_resume_compute_scheduler(struct amdgpu_device *adev, bool suspend)
> +{
> + int i, r = 0;
> +
> + for (i = 0; i < adev->gfx.num_compute_rings; i++) {
> + struct amdgpu_ring *ring = &adev->gfx.compute_ring[i];
> +
> + if (!(ring && ring->sched.thread))
> + continue;
> +
> + /* stop secheduler and drain ring. */
> + if (suspend) {
> + drm_sched_stop(&ring->sched, NULL);
> + r = amdgpu_fence_wait_empty(ring);
> + if (r)
> + goto out;
> + } else {
> + drm_sched_start(&ring->sched, false);
> + }
> + }
> +
> +out:
> + /* return on resume or failure to drain rings. */
> + if (!suspend || r)
> + return r;
> +
> + return amdgpu_device_ip_wait_for_idle(adev, GC_HWIP);
> +}
> +
> +static void set_barrier_auto_waitcnt(struct amdgpu_device *adev, bool enable_waitcnt)
> +{
> + uint32_t data;
> +
> + WRITE_ONCE(adev->barrier_has_auto_waitcnt, enable_waitcnt);
> +
> + if (!down_read_trylock(&adev->reset_domain->sem))
> + return;
> +
> + amdgpu_amdkfd_suspend(adev, false);
> +
> + if (suspend_resume_compute_scheduler(adev, true))
> + goto out;
> +
> + data = RREG32(SOC15_REG_OFFSET(GC, 0, mmSQ_CONFIG));
> + data = REG_SET_FIELD(data, SQ_CONFIG, DISABLE_BARRIER_WAITCNT,
> + enable_waitcnt ? 0 : 1);
This could be ..., !enable_waitcnt);
> + WREG32(SOC15_REG_OFFSET(GC, 0, mmSQ_CONFIG), data);
> +
> +out:
> + suspend_resume_compute_scheduler(adev, false);
> +
> + amdgpu_amdkfd_resume(adev, false);
> +
> + up_read(&adev->reset_domain->sem);
> +}
> +
> +/**
Use /* here, since this is not a doc comment.
> + * restore_dbg_reisters is ignored here but is a general interface requirement
Typo: registers
> + * for devices that support GFXOFF and where the RLC save/restore list
> + * does not support hw registers for debugging i.e. the driver has to manually
> + * initialize the debug mode registers after it has disabled GFX off during the
> + * debug session.
> + */
> +static uint32_t kgd_arcturus_enable_debug_trap(struct amdgpu_device *adev,
> + bool restore_dbg_registers,
> + uint32_t vmid)
> +{
> + mutex_lock(&adev->grbm_idx_mutex);
> +
> + kgd_gfx_v9_set_wave_launch_stall(adev, vmid, true);
> +
> + set_barrier_auto_waitcnt(adev, true);
> +
> + WREG32(SOC15_REG_OFFSET(GC, 0, mmSPI_GDBG_TRAP_MASK), 0);
> +
> + kgd_gfx_v9_set_wave_launch_stall(adev, vmid, false);
> +
> + mutex_unlock(&adev->grbm_idx_mutex);
> +
> + return 0;
> +}
> +
> +/**
/*
> + * keep_trap_enabled is ignored here but is a general interface requirement
> + * for devices that support multi-process debugging where the performance
> + * overhead from trap temporary setup needs to be bypassed when the debug
> + * session has ended.
> + */
> +static uint32_t kgd_arcturus_disable_debug_trap(struct amdgpu_device *adev,
> + bool keep_trap_enabled,
> + uint32_t vmid)
> +{
> +
> + mutex_lock(&adev->grbm_idx_mutex);
> +
> + kgd_gfx_v9_set_wave_launch_stall(adev, vmid, true);
> +
> + set_barrier_auto_waitcnt(adev, false);
> +
> + WREG32(SOC15_REG_OFFSET(GC, 0, mmSPI_GDBG_TRAP_MASK), 0);
> +
> + kgd_gfx_v9_set_wave_launch_stall(adev, vmid, false);
> +
> + mutex_unlock(&adev->grbm_idx_mutex);
> +
> + return 0;
> +}
> const struct kfd2kgd_calls arcturus_kfd2kgd = {
> .program_sh_mem_settings = kgd_gfx_v9_program_sh_mem_settings,
> .set_pasid_vmid_mapping = kgd_gfx_v9_set_pasid_vmid_mapping,
> @@ -294,6 +408,8 @@ const struct kfd2kgd_calls arcturus_kfd2kgd = {
> kgd_gfx_v9_get_atc_vmid_pasid_mapping_info,
> .set_vm_context_page_table_base =
> kgd_gfx_v9_set_vm_context_page_table_base,
> + .enable_debug_trap = kgd_arcturus_enable_debug_trap,
> + .disable_debug_trap = kgd_arcturus_disable_debug_trap,
> .get_cu_occupancy = kgd_gfx_v9_get_cu_occupancy,
> - .program_trap_handler_settings = kgd_gfx_v9_program_trap_handler_settings
> + .program_trap_handler_settings = kgd_gfx_v9_program_trap_handler_settings,
> };
> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> index 222fe87161b7..56d25a6f1da9 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> @@ -2376,8 +2376,8 @@ static void gfx_v9_0_init_sq_config(struct amdgpu_device *adev)
> switch (adev->ip_versions[GC_HWIP][0]) {
> case IP_VERSION(9, 4, 1):
> tmp = RREG32_SOC15(GC, 0, mmSQ_CONFIG);
> - tmp = REG_SET_FIELD(tmp, SQ_CONFIG,
> - DISABLE_BARRIER_WAITCNT, 1);
> + tmp = REG_SET_FIELD(tmp, SQ_CONFIG, DISABLE_BARRIER_WAITCNT,
> + READ_ONCE(adev->barrier_has_auto_waitcnt) ? 0 : 1);
This could be ..., !READ_ONCE(adev->barrier_has_auto_waitcnt));
With those nit-picks fixed, the patch is
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
> WREG32_SOC15(GC, 0, mmSQ_CONFIG, tmp);
> break;
> default:
next prev parent reply other threads:[~2023-02-16 23:45 UTC|newest]
Thread overview: 68+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-01-25 19:53 [PATCH 00/32] Upstream of kernel support for AMDGPU ISA debugging Jonathan Kim
2023-01-25 19:53 ` [PATCH 01/32] drm/amdkfd: add debug and runtime enable interface Jonathan Kim
2023-02-16 22:16 ` Felix Kuehling
2023-01-25 19:53 ` [PATCH 02/32] drm/amdkfd: display debug capabilities Jonathan Kim
2023-02-16 22:24 ` Felix Kuehling
2023-01-25 19:53 ` [PATCH 03/32] drm/amdkfd: prepare per-process debug enable and disable Jonathan Kim
2023-02-16 23:44 ` Felix Kuehling
2023-03-23 19:12 ` Kim, Jonathan
2023-03-23 20:08 ` Felix Kuehling
2023-01-25 19:53 ` [PATCH 04/32] drm/amdgpu: add kgd hw debug mode setting interface Jonathan Kim
2023-01-25 19:53 ` [PATCH 05/32] drm/amdgpu: setup hw debug registers on driver initialization Jonathan Kim
2023-02-16 22:39 ` Felix Kuehling
2023-01-25 19:53 ` [PATCH 06/32] drm/amdgpu: add gfx9 hw debug mode enable and disable calls Jonathan Kim
2023-01-29 5:12 ` kernel test robot
2023-02-16 22:54 ` Felix Kuehling
2023-01-25 19:53 ` [PATCH 07/32] drm/amdgpu: add gfx9.4.1 " Jonathan Kim
2023-01-29 6:34 ` kernel test robot
2023-02-16 23:01 ` Felix Kuehling [this message]
2023-01-25 19:53 ` [PATCH 08/32] drm/amdgpu: add gfx10 " Jonathan Kim
2023-01-29 7:55 ` kernel test robot
2023-02-16 23:11 ` Felix Kuehling
2023-01-25 19:53 ` [PATCH 09/32] drm/amdgpu: add gfx9.4.2 " Jonathan Kim
2023-02-16 23:14 ` Felix Kuehling
2023-01-25 19:53 ` [PATCH 10/32] drm/amdgpu: add gfx11 " Jonathan Kim
2023-02-16 23:19 ` Felix Kuehling
2023-01-25 19:53 ` [PATCH 11/32] drm/amdgpu: add configurable grace period for unmap queues Jonathan Kim
2023-03-20 19:19 ` Felix Kuehling
2023-01-25 19:53 ` [PATCH 12/32] drm/amdkfd: prepare map process for single process debug devices Jonathan Kim
2023-03-20 20:06 ` Felix Kuehling
2023-01-25 19:53 ` [PATCH 13/32] drm/amdgpu: prepare map process for multi-process " Jonathan Kim
2023-03-20 20:16 ` Felix Kuehling
2023-01-25 19:53 ` [PATCH 14/32] drm/amdgpu: expose debug api for mes Jonathan Kim
2023-03-20 20:47 ` Felix Kuehling
2023-01-25 19:53 ` [PATCH 15/32] drm/amdkfd: prepare trap workaround for gfx11 Jonathan Kim
2023-03-20 21:49 ` Felix Kuehling
2023-03-23 13:50 ` Kim, Jonathan
2023-03-23 14:00 ` Felix Kuehling
2023-01-25 19:53 ` [PATCH 16/32] drm/amdkfd: add per process hw trap enable and disable functions Jonathan Kim
2023-03-20 23:06 ` Felix Kuehling
2023-01-25 19:53 ` [PATCH 17/32] drm/amdkfd: add raise exception event function Jonathan Kim
2023-03-20 23:18 ` Felix Kuehling
2023-01-25 19:53 ` [PATCH 18/32] drm/amdkfd: add send exception operation Jonathan Kim
2023-03-20 23:26 ` Felix Kuehling
2023-01-25 19:53 ` [PATCH 19/32] drm/amdkfd: add runtime enable operation Jonathan Kim
2023-03-21 0:31 ` Felix Kuehling
2023-03-23 19:45 ` Kim, Jonathan
2023-01-25 19:53 ` [PATCH 20/32] drm/amdkfd: add debug trap enabled flag to tma Jonathan Kim
2023-01-25 19:53 ` [PATCH 21/32] drm/amdkfd: update process interrupt handling for debug events Jonathan Kim
2023-03-21 21:07 ` Felix Kuehling
2023-01-25 19:53 ` [PATCH 22/32] drm/amdkfd: add debug set exceptions enabled operation Jonathan Kim
2023-01-25 19:53 ` [PATCH 23/32] drm/amdkfd: add debug wave launch override operation Jonathan Kim
2023-03-21 21:37 ` Felix Kuehling
2023-01-25 19:53 ` [PATCH 24/32] drm/amdkfd: add debug wave launch mode operation Jonathan Kim
2023-03-21 21:42 ` Felix Kuehling
2023-01-25 19:53 ` [PATCH 25/32] drm/amdkfd: add debug suspend and resume process queues operation Jonathan Kim
2023-03-21 22:16 ` Felix Kuehling
2023-01-25 19:53 ` [PATCH 26/32] drm/amdkfd: add debug set and clear address watch points operation Jonathan Kim
2023-03-22 21:38 ` Felix Kuehling
2023-01-25 19:53 ` [PATCH 27/32] drm/amdkfd: add debug set flags operation Jonathan Kim
2023-03-22 21:47 ` Felix Kuehling
2023-01-25 19:53 ` [PATCH 28/32] drm/amdkfd: add debug query event operation Jonathan Kim
2023-01-25 19:53 ` [PATCH 29/32] drm/amdkfd: add debug query exception info operation Jonathan Kim
2023-01-25 19:53 ` [PATCH 30/32] drm/amdkfd: add debug queue snapshot operation Jonathan Kim
2023-03-22 21:52 ` Felix Kuehling
2023-01-25 19:54 ` [PATCH 31/32] drm/amdkfd: add debug device " Jonathan Kim
2023-03-22 21:54 ` Felix Kuehling
2023-01-25 19:54 ` [PATCH 32/32] drm/amdkfd: bump kfd ioctl minor version for debug api availability Jonathan Kim
2023-03-22 21:56 ` Felix Kuehling
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=cd1b640c-ba59-4627-ad2f-b9593833598e@amd.com \
--to=felix.kuehling@amd.com \
--cc=amd-gfx@lists.freedesktop.org \
--cc=dri-devel@lists.freedesktop.org \
--cc=jonathan.kim@amd.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox