Re: [PATCH 1/2] drm/amdgpu: introduce a kind of halt state for amdgpu device

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Lang Yu <Lang.Yu@amd.com>
To: Christian KKKnig <christian.koenig@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>,
	Andrey Grodzovsky <andrey.grodzovsky@amd.com>,
	Lijo Lazar <lijo.lazar@amd.com>, Huang Rui <ray.huang@amd.com>,
	amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH 1/2] drm/amdgpu: introduce a kind of halt state for amdgpu device
Date: Fri, 10 Dec 2021 11:47:00 +0800	[thread overview]
Message-ID: <YbLNtFgoxa66ZV7N@lang-desktop> (raw)
In-Reply-To: <983b6cd3-57ac-a797-170c-2c43b86b67fc@amd.com>

On 12/09/ , Christian KKKnig wrote:
> Am 09.12.21 um 16:38 schrieb Andrey Grodzovsky:
> > 
> > On 2021-12-09 4:00 a.m., Christian König wrote:
> > > 
> > > 
> > > Am 09.12.21 um 09:49 schrieb Lang Yu:
> > > > It is useful to maintain error context when debugging
> > > > SW/FW issues. We introduce amdgpu_device_halt() for this
> > > > purpose. It will bring hardware to a kind of halt state,
> > > > so that no one can touch it any more.
> > > > 
> > > > Compare to a simple hang, the system will keep stable
> > > > at least for SSH access. Then it should be trivial to
> > > > inspect the hardware state and see what's going on.
> > > > 
> > > > Suggested-by: Christian Koenig <christian.koenig@amd.com>
> > > > Suggested-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
> > > > Signed-off-by: Lang Yu <lang.yu@amd.com>
> > > > ---
> > > >   drivers/gpu/drm/amd/amdgpu/amdgpu.h        |  2 ++
> > > >   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 39
> > > > ++++++++++++++++++++++
> > > >   2 files changed, 41 insertions(+)
> > > > 
> > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > > > b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > > > index c5cfe2926ca1..3f5f8f62aa5c 100644
> > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > > > @@ -1317,6 +1317,8 @@ void amdgpu_device_flush_hdp(struct
> > > > amdgpu_device *adev,
> > > >   void amdgpu_device_invalidate_hdp(struct amdgpu_device *adev,
> > > >           struct amdgpu_ring *ring);
> > > >   +void amdgpu_device_halt(struct amdgpu_device *adev);
> > > > +
> > > >   /* atpx handler */
> > > >   #if defined(CONFIG_VGA_SWITCHEROO)
> > > >   void amdgpu_register_atpx_handler(void);
> > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > > index a1c14466f23d..62216627cc83 100644
> > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > > @@ -5634,3 +5634,42 @@ void amdgpu_device_invalidate_hdp(struct
> > > > amdgpu_device *adev,
> > > >         amdgpu_asic_invalidate_hdp(adev, ring);
> > > >   }
> > > > +
> > > > +/**
> > > > + * amdgpu_device_halt() - bring hardware to some kind of halt state
> > > > + *
> > > > + * @adev: amdgpu_device pointer
> > > > + *
> > > > + * Bring hardware to some kind of halt state so that no one can
> > > > touch it
> > > > + * any more. It will help to maintain error context when error
> > > > occurred.
> > > > + * Compare to a simple hang, the system will keep stable at
> > > > least for SSH
> > > > + * access. Then it should be trivial to inspect the hardware state and
> > > > + * see what's going on. Implemented as following:
> > > > + *
> > > > + * 1. drm_dev_unplug() makes device inaccessible to user
> > > > space(IOCTLs, etc),
> > > > + *    clears all CPU mappings to device, disallows remappings through
> > > > page faults
> > > > + * 2. amdgpu_irq_disable_all() disables all interrupts
> > > > + * 3. amdgpu_fence_driver_hw_fini() signals all HW fences
> > > > + * 4. amdgpu_device_unmap_mmio() clears all MMIO mappings
> > > > + * 5. pci_disable_device() and pci_wait_for_pending_transaction()
> > > > + *    flush any in flight DMA operations
> > > > + * 6. set adev->no_hw_access to true
> > > > + */
> > > > +void amdgpu_device_halt(struct amdgpu_device *adev)
> > > > +{
> > > > +    struct pci_dev *pdev = adev->pdev;
> > > > +    struct drm_device *ddev = &adev->ddev;
> > > > +
> > > > +    drm_dev_unplug(ddev);
> > > > +
> > > > +    amdgpu_irq_disable_all(adev);
> > > > +
> > > > +    amdgpu_fence_driver_hw_fini(adev);
> > > > +
> > > > +    amdgpu_device_unmap_mmio(adev);
> > 
> > 
> > Note that this one will cause page fault on any subsequent MMIO access
> > (trough registers or by direct VRAM access)
> > 
> > 
> > > > 
> > > > +
> > > > +    pci_disable_device(pdev);
> > > > +    pci_wait_for_pending_transaction(pdev);
> > > > +
> > > > +    adev->no_hw_access = true;
> > > 
> > > I think we need to reorder this, e.g. set adev->no_hw_access much
> > > earlier for example. Andrey what do you think?
> > 
> > 
> > Earlier can be ok but at least after the last HW configuration we
> > actaully want to do like disabling IRQs.
> 
> My thinking was to at least do this before we unmap the MMIO to avoid the
> crash.
> 
> Additionally to that we maybe don't even want to do this for this case.

So we just do "adev->no_hw_access = true;" before
"amdgpu_device_unmap_mmio(adev);".

That can avoid potential registers access page faults.
But direct VRAM access will still trigger page faults.

For example, 
"cat /sys/class/drm/card0/device/pp_od_clk_voltage"
will call smu_cmn_update_table and can still trigger 
a page fault.

smu_cmn_update_table()
{
 ...
	if (drv2smu) {
		memcpy(table->cpu_addr, table_data, table_size);
 ...
}

Regards,
Lang

> Christian.
> 
> > 
> > 
> > Andrey
> > 
> > > 
> > > Apart from that sounds like the right idea to me.
> > > 
> > > Regards,
> > > Christian.
> > > 
> > > > +}
> > > 
>

next prev parent reply	other threads:[~2021-12-10  3:47 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-12-09  8:49 [PATCH 1/2] drm/amdgpu: introduce a kind of halt state for amdgpu device Lang Yu
2021-12-09  8:49 ` [PATCH 2/2] drm/amdgpu: add support for SMU debug option Lang Yu
2021-12-10  2:07   ` Quan, Evan
2021-12-10  2:33     ` Lang Yu
2021-12-10  2:52       ` Quan, Evan
2021-12-10  3:21         ` Lang Yu
2021-12-10  7:04           ` Christian König
2021-12-10  7:49             ` Lang Yu
2021-12-09  9:00 ` [PATCH 1/2] drm/amdgpu: introduce a kind of halt state for amdgpu device Christian König
2021-12-09 15:38   ` Andrey Grodzovsky
2021-12-09 15:42     ` Christian König
2021-12-10  3:47       ` Lang Yu [this message]
2021-12-10 15:45         ` Andrey Grodzovsky

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YbLNtFgoxa66ZV7N@lang-desktop \
    --to=lang.yu@amd.com \
    --cc=alexander.deucher@amd.com \
    --cc=amd-gfx@lists.freedesktop.org \
    --cc=andrey.grodzovsky@amd.com \
    --cc=christian.koenig@amd.com \
    --cc=lijo.lazar@amd.com \
    --cc=ray.huang@amd.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.