* [PATCH v2] drm/amdgpu: introduce a kind of halt state for amdgpu device
@ 2021-12-10 9:35 Lang Yu
2021-12-10 9:57 ` Christian König
0 siblings, 1 reply; 2+ messages in thread
From: Lang Yu @ 2021-12-10 9:35 UTC (permalink / raw)
To: amd-gfx, Andrey Grodzovsky
Cc: Alex Deucher, Huang Rui, Lang Yu, Christian Koenig
It is useful to maintain error context when debugging
SW/FW issues. Introduce amdgpu_device_halt() for this
purpose. It will bring hardware to a kind of halt state,
so that no one can touch it any more.
Compare to a simple hang, the system will keep stable
at least for SSH access. Then it should be trivial to
inspect the hardware state and see what's going on.
Suggested-by: Christian Koenig <christian.koenig@amd.com>
Suggested-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Signed-off-by: Lang Yu <lang.yu@amd.com>
v2:
- Set adev->no_hw_access earlier to avoid crashes.(Christian)
---
drivers/gpu/drm/amd/amdgpu/amdgpu.h | 2 ++
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 39 ++++++++++++++++++++++
2 files changed, 41 insertions(+)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index c5cfe2926ca1..3f5f8f62aa5c 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -1317,6 +1317,8 @@ void amdgpu_device_flush_hdp(struct amdgpu_device *adev,
void amdgpu_device_invalidate_hdp(struct amdgpu_device *adev,
struct amdgpu_ring *ring);
+void amdgpu_device_halt(struct amdgpu_device *adev);
+
/* atpx handler */
#if defined(CONFIG_VGA_SWITCHEROO)
void amdgpu_register_atpx_handler(void);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index a1c14466f23d..8fe7ea6cee18 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -5634,3 +5634,42 @@ void amdgpu_device_invalidate_hdp(struct amdgpu_device *adev,
amdgpu_asic_invalidate_hdp(adev, ring);
}
+
+/**
+ * amdgpu_device_halt() - bring hardware to some kind of halt state
+ *
+ * @adev: amdgpu_device pointer
+ *
+ * Bring hardware to some kind of halt state so that no one can touch it
+ * any more. It will help to maintain error context when error occurred.
+ * Compare to a simple hang, the system will keep stable at least for SSH
+ * access. Then it should be trivial to inspect the hardware state and
+ * see what's going on. Implemented as following:
+ *
+ * 1. drm_dev_unplug() makes device inaccessible to user space(IOCTLs, etc),
+ * clears all CPU mappings to device, disallows remappings through page faults
+ * 2. amdgpu_irq_disable_all() disables all interrupts
+ * 3. amdgpu_fence_driver_hw_fini() signals all HW fences
+ * 4. set adev->no_hw_access to avoid potential crashes after setp 5
+ * 5. amdgpu_device_unmap_mmio() clears all MMIO mappings
+ * 6. pci_disable_device() and pci_wait_for_pending_transaction()
+ * flush any in flight DMA operations
+ */
+void amdgpu_device_halt(struct amdgpu_device *adev)
+{
+ struct pci_dev *pdev = adev->pdev;
+ struct drm_device *ddev = &adev->ddev;
+
+ drm_dev_unplug(ddev);
+
+ amdgpu_irq_disable_all(adev);
+
+ amdgpu_fence_driver_hw_fini(adev);
+
+ adev->no_hw_access = true;
+
+ amdgpu_device_unmap_mmio(adev);
+
+ pci_disable_device(pdev);
+ pci_wait_for_pending_transaction(pdev);
+}
--
2.25.1
^ permalink raw reply related [flat|nested] 2+ messages in thread
* Re: [PATCH v2] drm/amdgpu: introduce a kind of halt state for amdgpu device
2021-12-10 9:35 [PATCH v2] drm/amdgpu: introduce a kind of halt state for amdgpu device Lang Yu
@ 2021-12-10 9:57 ` Christian König
0 siblings, 0 replies; 2+ messages in thread
From: Christian König @ 2021-12-10 9:57 UTC (permalink / raw)
To: Lang Yu, amd-gfx, Andrey Grodzovsky; +Cc: Alex Deucher, Huang Rui
Am 10.12.21 um 10:35 schrieb Lang Yu:
> It is useful to maintain error context when debugging
> SW/FW issues. Introduce amdgpu_device_halt() for this
> purpose. It will bring hardware to a kind of halt state,
> so that no one can touch it any more.
>
> Compare to a simple hang, the system will keep stable
> at least for SSH access. Then it should be trivial to
> inspect the hardware state and see what's going on.
>
> Suggested-by: Christian Koenig <christian.koenig@amd.com>
> Suggested-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
> Signed-off-by: Lang Yu <lang.yu@amd.com>
I'm not 100% sure if the amdgpu_device_unmap_mmio() in necessary here,
but for now that patch is Reviewed-by: Christian König
<christian.koenig@amd.com>
Thanks,
Christian.
>
> v2:
> - Set adev->no_hw_access earlier to avoid crashes.(Christian)
> ---
> drivers/gpu/drm/amd/amdgpu/amdgpu.h | 2 ++
> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 39 ++++++++++++++++++++++
> 2 files changed, 41 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> index c5cfe2926ca1..3f5f8f62aa5c 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> @@ -1317,6 +1317,8 @@ void amdgpu_device_flush_hdp(struct amdgpu_device *adev,
> void amdgpu_device_invalidate_hdp(struct amdgpu_device *adev,
> struct amdgpu_ring *ring);
>
> +void amdgpu_device_halt(struct amdgpu_device *adev);
> +
> /* atpx handler */
> #if defined(CONFIG_VGA_SWITCHEROO)
> void amdgpu_register_atpx_handler(void);
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index a1c14466f23d..8fe7ea6cee18 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -5634,3 +5634,42 @@ void amdgpu_device_invalidate_hdp(struct amdgpu_device *adev,
>
> amdgpu_asic_invalidate_hdp(adev, ring);
> }
> +
> +/**
> + * amdgpu_device_halt() - bring hardware to some kind of halt state
> + *
> + * @adev: amdgpu_device pointer
> + *
> + * Bring hardware to some kind of halt state so that no one can touch it
> + * any more. It will help to maintain error context when error occurred.
> + * Compare to a simple hang, the system will keep stable at least for SSH
> + * access. Then it should be trivial to inspect the hardware state and
> + * see what's going on. Implemented as following:
> + *
> + * 1. drm_dev_unplug() makes device inaccessible to user space(IOCTLs, etc),
> + * clears all CPU mappings to device, disallows remappings through page faults
> + * 2. amdgpu_irq_disable_all() disables all interrupts
> + * 3. amdgpu_fence_driver_hw_fini() signals all HW fences
> + * 4. set adev->no_hw_access to avoid potential crashes after setp 5
> + * 5. amdgpu_device_unmap_mmio() clears all MMIO mappings
> + * 6. pci_disable_device() and pci_wait_for_pending_transaction()
> + * flush any in flight DMA operations
> + */
> +void amdgpu_device_halt(struct amdgpu_device *adev)
> +{
> + struct pci_dev *pdev = adev->pdev;
> + struct drm_device *ddev = &adev->ddev;
> +
> + drm_dev_unplug(ddev);
> +
> + amdgpu_irq_disable_all(adev);
> +
> + amdgpu_fence_driver_hw_fini(adev);
> +
> + adev->no_hw_access = true;
> +
> + amdgpu_device_unmap_mmio(adev);
> +
> + pci_disable_device(pdev);
> + pci_wait_for_pending_transaction(pdev);
> +}
^ permalink raw reply [flat|nested] 2+ messages in thread
end of thread, other threads:[~2021-12-10 9:57 UTC | newest]
Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2021-12-10 9:35 [PATCH v2] drm/amdgpu: introduce a kind of halt state for amdgpu device Lang Yu
2021-12-10 9:57 ` Christian König
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.