From: Nirmoy <nirmodas@amd.com>
To: Andrey Grodzovsky <Andrey.Grodzovsky@amd.com>,
Nirmoy Das <nirmoy.das@amd.com>,
amd-gfx@lists.freedesktop.org
Cc: alexander.deucher@amd.com, christian.koenig@amd.com
Subject: Re: [RFC PATCH 1/1] drm/amdgpu: add initial support for pci error handler
Date: Thu, 13 Aug 2020 17:06:45 +0200 [thread overview]
Message-ID: <50cab62a-e8d4-0e0d-438b-b274c09d0972@amd.com> (raw)
In-Reply-To: <b6e878c0-bade-77a5-fda7-6be234c2f6b7@amd.com>
On 8/13/20 3:38 PM, Andrey Grodzovsky wrote:
>
> On 8/13/20 7:09 AM, Nirmoy wrote:
>>
>> On 8/12/20 4:52 PM, Andrey Grodzovsky wrote:
>>>
>>> On 8/11/20 9:30 AM, Nirmoy Das wrote:
>>>> This patch will ignore non-fatal errors and try to
>>>> stop amdgpu's sw stack on fatal errors.
>>>>
>>>> Signed-off-by: Nirmoy Das <nirmoy.das@amd.com>
>>>> ---
>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 56
>>>> ++++++++++++++++++++++++-
>>>> 1 file changed, 54 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>>> index c1219af2e7d6..2b9ede3000ee 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>>>> @@ -35,6 +35,7 @@
>>>> #include <linux/pm_runtime.h>
>>>> #include <linux/vga_switcheroo.h>
>>>> #include <drm/drm_probe_helper.h>
>>>> +#include <drm/drm_atomic_helper.h>
>>>> #include <linux/mmu_notifier.h>
>>>> #include "amdgpu.h"
>>>> @@ -1516,6 +1517,58 @@ static struct drm_driver kms_driver = {
>>>> .patchlevel = KMS_DRIVER_PATCHLEVEL,
>>>> };
>>>> +static pci_ers_result_t amdgpu_pci_err_detected(struct pci_dev
>>>> *pdev,
>>>> + pci_channel_state_t state)
>>>> +{
>>>> + struct drm_device *dev = pci_get_drvdata(pdev);
>>>> + struct amdgpu_device *adev = dev->dev_private;
>>>> + int i;
>>>> + int ret = PCI_ERS_RESULT_DISCONNECT;
>>>> +
>>>> + switch (state) {
>>>> + case pci_channel_io_normal:
>>>> + ret = PCI_ERS_RESULT_CAN_RECOVER;
>>>> + break;
>>>> + default:
>>>> + /* Disable power management */
>>>> + adev->runpm = 0;
>>>> + /* Suspend all IO operations */
>>>> + amdgpu_fbdev_set_suspend(adev, 1);
>>>> + cancel_delayed_work_sync(&adev->delayed_init_work);
>>>> + for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
>>>> + struct amdgpu_ring *ring = adev->rings[i];
>>>> +
>>>> + if (!ring || !ring->sched.thread)
>>>> + continue;
>>>> +
>>>> + amdgpu_job_stop_all_jobs_on_sched(&ring->sched);
>>>
>>>
>>> You need to call drm_sched_stop first before calling this
>>>
>>>> + }
>>>> +
>>>> + if (adev->mode_info.mode_config_initialized) {
>>>> + if (!amdgpu_device_has_dc_support(adev))
>>>> + drm_helper_force_disable_all(adev->ddev);
>>>> + else
>>>> + drm_atomic_helper_shutdown(adev->ddev);
>>>> + }
>>>> +
>>>> + amdgpu_fence_driver_fini(adev);
>>>> + amdgpu_fbdev_fini(adev);
>>>> + /* Try to close drm device to stop applications
>>>> + * from opening dri files for further IO operations.
>>>> + * TODO: This will throw warning as ttm is not
>>>> + * cleaned perperly */
>>>> + drm_dev_fini(dev);
>>>
>>>
>>> I think user mode applications might still hold reference to the drm
>>> device through through drm_dev_get either by directly opening
>>> the device file or indirectly through importing DMA buff, if so when
>>> the last of them terminate
>>> drm_dev_put->drm_dev_release->...->drm_dev_fini
>>> might get called again causing use after free e.t.c issues. Maybe
>>> better to call here drm_dev_put then and so drm_dev_fini will get
>>> called when this
>>> last user client releases his reference.
>>
>>
>> drm_dev_fini() seems to be cleaner. Problem is window manager(sway)
>> never gets terminated after the AER error and drm files remains
>> active. Simple cat on dri files
>>
>> goes though amdgpu and spits out more errors.
>
>
> What happens if you kill the window manager after you closed drm
> device with your original code applied ? I would expect drm_dev_fini
> to be called again
> for the reason i explained above and this would obviously would be
> wrong to happen.
Hi Andrey,
hmm I quickly tried that, Kernel crashed and later rebooted after
sometime. I don't have a serial console to check logs and there was no
logs afterwards in journalctl.
drm_dev_put() had similar behavior, kernel/machine was inaccessible over
ssh.
Did you face same behavior while testing gpu hotplug ?
Nirmoy
>
> Andrey
>
>
>>
>>
>>>
>>> Also a general question - in my work on DPC recovery feature which
>>> tries to recover after PCIe error - once the PCI error has happened
>>> MMIO registers become
>>> unaccessible for r/w as the PCI link is dead until after the PCI
>>> link is reset by the DPC driver (see
>>> https://www.kernel.org/doc/html/latest/PCI/pci-error-recovery.html
>>> section 6.1.4).
>>> Your case is to try and gracefully to close the drm device once
>>> fatal error happened, didn't you encounter errors or warnings when
>>> accessing HW registers during any of the operations
>>> above ?
>>
>>
>> As discussed over chat, it seems aer generated with aer-inject tool
>> just triggers kernel PCI error APIs but the device is still active so
>> I didn't encounter any errors when accessing HW registers.
>>
>>
>> Nirmoy
>>
>>
>>>
>>> Andrey
>>>
>>>
>>>> + break;
>>>> + }
>>>> +
>>>> + return ret;
>>>> +}
>>>> +
>>>> +static const struct pci_error_handlers amdgpu_err_handler = {
>>>> + .error_detected = amdgpu_pci_err_detected,
>>>> +};
>>>> +
>>>> +
>>>> static struct pci_driver amdgpu_kms_pci_driver = {
>>>> .name = DRIVER_NAME,
>>>> .id_table = pciidlist,
>>>> @@ -1523,10 +1576,9 @@ static struct pci_driver
>>>> amdgpu_kms_pci_driver = {
>>>> .remove = amdgpu_pci_remove,
>>>> .shutdown = amdgpu_pci_shutdown,
>>>> .driver.pm = &amdgpu_pm_ops,
>>>> + .err_handler = &amdgpu_err_handler,
>>>> };
>>>> -
>>>> -
>>>> static int __init amdgpu_init(void)
>>>> {
>>>> int r;
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx
next prev parent reply other threads:[~2020-08-13 15:03 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-08-11 13:30 [RFC PATCH 1/1] drm/amdgpu: add initial support for pci error handler Nirmoy Das
2020-08-12 14:52 ` Andrey Grodzovsky
2020-08-13 11:09 ` Nirmoy
2020-08-13 13:36 ` Alex Deucher
2020-08-13 13:38 ` Andrey Grodzovsky
2020-08-13 15:06 ` Nirmoy [this message]
2020-08-13 18:18 ` Andrey Grodzovsky
2020-08-13 21:17 ` Luben Tuikov
2020-08-13 21:30 ` Luben Tuikov
2020-08-14 15:23 ` Nirmoy
2020-08-14 19:52 ` Luben Tuikov
2020-08-14 20:10 ` Alex Deucher
2020-08-14 20:20 ` Luben Tuikov
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=50cab62a-e8d4-0e0d-438b-b274c09d0972@amd.com \
--to=nirmodas@amd.com \
--cc=Andrey.Grodzovsky@amd.com \
--cc=alexander.deucher@amd.com \
--cc=amd-gfx@lists.freedesktop.org \
--cc=christian.koenig@amd.com \
--cc=nirmoy.das@amd.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox