From mboxrd@z Thu Jan 1 00:00:00 1970 From: =?UTF-8?Q?Christian_K=c3=b6nig?= Subject: Re: Regression on gfx8 with ring init Date: Fri, 21 Sep 2018 19:53:26 +0200 Message-ID: <04944e7b-044b-4b16-3d2f-e760eedcee9a@gmail.com> References: <8cdb037b-7db7-9be9-2c8a-d52c1b058454@amd.com> <7f748397-265d-20e9-b081-108b28994c1f@gmail.com> <1fdbd1f8-afb8-59e7-c057-10da9b9f6e25@amd.com> <80d8437f-0873-8318-01c1-2710adea67e0@amd.com> <43e69bf1-8751-dbe8-6b8d-5250c527154c@amd.com> <34359f9e-be6f-945c-e084-c109e6584d67@amd.com> <12ac8b66-0ce2-0304-d9ad-6e3f2479e04f@amd.com> <3ad24617-bdee-846e-b47c-d854c48fce43@amd.com> <4a250398-d2ac-1650-739d-e4a6598f1c48@gmail.com> <4afeb01c-37e9-ca76-8055-5dd15fca98d3@amd.com> Reply-To: christian.koenig-5C7GfCeVMHo@public.gmane.org Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="===============1678632423==" Return-path: In-Reply-To: Content-Language: en-US List-Id: Discussion list for AMD gfx List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: amd-gfx-bounces-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org Sender: "amd-gfx" To: Andrey Grodzovsky , =?UTF-8?Q?Christian_K=c3=b6nig?= , "Deucher, Alexander" , "StDenis, Tom" , amd-gfx mailing list , "Zhou, David(ChunMing)" This is a multi-part message in MIME format. --===============1678632423== Content-Type: multipart/alternative; boundary="------------8F16CC4B5A2FCF0BF4CA4A95" Content-Language: en-US This is a multi-part message in MIME format. --------------8F16CC4B5A2FCF0BF4CA4A95 Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 8bit I unfortunately don't have a Polaris to test this myself. But please give me time till Monday so that I can at least try one more things to fix it. Christian. Am 21.09.2018 um 19:11 schrieb Andrey Grodzovsky: > > Ping... > > > Andrey > > > On 09/20/2018 04:35 PM, Andrey Grodzovsky wrote: >> >> What's the status with this error and the suggested patch to fix it ? >> It impacts GPU reset on Polaris11. >> >> Do we want to investigate why the original patch breaks it or just >> disable with the proposed patch ? >> >> >> P.S Suspend resume also stopped working on latest branch - will >> bisect it later today or tomorrow. >> >> >> Andrey >> >> >> On 09/18/2018 11:00 AM, Christian König wrote: >>> Tom, >>> >>> can you try if the following makes it working again? >>> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c >>> b/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c >>> index b6160de70d12..d65f5ba92fc5 100644 >>> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c >>> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c >>> @@ -937,6 +937,10 @@ static int gfx_v8_0_ring_test_ib(struct >>> amdgpu_ring *ring, long timeout) >>>         return r; >>>  } >>> >>> +static int gfx_v8_0_kiq_ring_test_ib(struct amdgpu_ring *ring, long >>> timeout) >>> +{ >>> +       return 0; >>> +} >>> >>>  static void gfx_v8_0_free_microcode(struct amdgpu_device *adev) >>>  { >>> @@ -7174,7 +7178,7 @@ static const struct amdgpu_ring_funcs >>> gfx_v8_0_ring_funcs_kiq = { >>>         .emit_ib = gfx_v8_0_ring_emit_ib_compute, >>>         .emit_fence = gfx_v8_0_ring_emit_fence_kiq, >>>         .test_ring = gfx_v8_0_ring_test_ring, >>> -       .test_ib = gfx_v8_0_ring_test_ib, >>> +       .test_ib = gfx_v8_0_kiq_ring_test_ib, >>>         .insert_nop = amdgpu_ring_insert_nop, >>>         .pad_ib = amdgpu_ring_generic_pad_ib, >>>         .emit_rreg = gfx_v8_0_ring_emit_rreg, >>> >>> >>> Thanks, >>> Christian. >>> >>> Am 18.09.2018 um 16:41 schrieb Christian König: >>>> CRTC and GFX interrupts seem to be working perfectly fine. >>>> >>>> The problem here looks like only EOP interrupts from the Compute >>>> queue are not correctly handled. >>>> >>>> Most likely a bug somewhere in gfx_v8_0_eop_irq(). >>>> >>>> Christian. >>>> >>>> Am 18.09.2018 um 16:36 schrieb Deucher, Alexander: >>>>> >>>>> FWIW, a number of consumer Raven boards have bad IVRS tables >>>>> (windows doesn't use interrupt remapping so they are sometimes >>>>> wrong and probably not validated.  There are a number of >>>>> workaround to manually override the IVRS tables to make interrupts >>>>> work.  I think specifying pci=noacpi is also a possible workaround. >>>>> >>>>> >>>>> Alex >>>>> >>>>> ------------------------------------------------------------------------ >>>>> *From:* amd-gfx on behalf >>>>> of Christian König >>>>> *Sent:* Tuesday, September 18, 2018 10:31:16 AM >>>>> *To:* StDenis, Tom; amd-gfx mailing list; Zhou, David(ChunMing) >>>>> *Subject:* Re: Regression on gfx8 with ring init >>>>> Well looks like interrupt processing is working perfectly fine. >>>>> >>>>> But looking at the error message once more I see that this actually >>>>> affects ring number 9 and not the GFX ring. >>>>> >>>>> Can you fix amdgpu_ib_ring_tests() to print ring->name instead of the >>>>> number? >>>>> >>>>> That must be some of the compute rings. >>>>> >>>>> Thanks, >>>>> Christian. >>>>> >>>>> Am 18.09.2018 um 16:20 schrieb Tom St Denis: >>>>> > On 2018-09-18 10:13 a.m., Christian König wrote: >>>>> >> Mhm, there is no more failed IB-test in there isn't it? >>>>> > >>>>> > oh sorry I thought you wanted to test HEAD~ ... Attached is a >>>>> log from >>>>> > the tip of drm-next >>>>> > >>>>> > Tom >>>>> > >>>>> >> >>>>> >> Christian. >>>>> >> >>>>> >> Am 18.09.2018 um 16:09 schrieb Tom St Denis: >>>>> >>> Disabling IOMMU in the BIOS resulted in a correct boot up... >>>>> >>> >>>>> >>> Here's the log. >>>>> >>> >>>>> >>> Tom >>>>> >>> >>>>> >>> On 2018-09-18 9:58 a.m., Tom St Denis wrote: >>>>> >>>> Odd I couldn't even boot my system with the dGPU as primary >>>>> after >>>>> >>>> rebuilding the kernel.  It got hung up in the IOMMU driver >>>>> (loads >>>>> >>>> of AMD-Vi IOMMU errors) which I wasn't able to capture >>>>> because it >>>>> >>>> panic'ed before loading the network stack. >>>>> >>>> >>>>> >>>> Bizarre. >>>>> >>>> >>>>> >>>> I'll keep trying. >>>>> >>>> >>>>> >>>> Tom >>>>> >>>> >>>>> >>>> On 2018-09-18 9:35 a.m., Christian König wrote: >>>>> >>>>> Am 18.09.2018 um 15:32 schrieb Tom St Denis: >>>>> >>>>>> On 2018-09-18 9:30 a.m., Christian König wrote: >>>>> >>>>>>> Great, not sure if that is a good or a bad news. >>>>> >>>>>>> >>>>> >>>>>>> Anyway going to revert the change for now. Does anybody >>>>> >>>>>>> volunteer to figure out why interrupts sometimes doesn't work >>>>> >>>>>>> correctly on Raven? >>>>> >>>>>> >>>>> >>>>>> What does "doesn't work correctly?"  My workstation is a >>>>> Raven1 >>>>> >>>>>> (Ryzen 2400G) and other than the TTM bulk move issue has been >>>>> >>>>>> perfectly stable (through suspend/resumes too I might add). >>>>> >>>>>> >>>>> >>>>>> Anything I could test with my devel raven? >>>>> >>>>> >>>>> >>>>> The problem seems to be that on some boards IH handling doesn't >>>>> >>>>> work as it should. >>>>> >>>>> >>>>> >>>>> Can you try to disable the onboard graphics and try again? >>>>> >>>>> >>>>> >>>>> If that still doesn't work there is a DRM_DEBUG in >>>>> >>>>> amdgpu_ih_process(), make that a DRM_ERROR and send me the >>>>> >>>>> resulting dmesg of loading amdgpu (but don't start any UMD). >>>>> >>>>> >>>>> >>>>> Thanks, >>>>> >>>>> Christian. >>>>> >>>>> >>>>> >>>>>> >>>>> >>>>>> >>>>> >>>>>> Tom >>>>> >>>>>> >>>>> >>>>>>> >>>>> >>>>>>> Christian. >>>>> >>>>>>> >>>>> >>>>>>> Am 18.09.2018 um 15:27 schrieb Tom St Denis: >>>>> >>>>>>>> This commit: >>>>> >>>>>>>> >>>>> >>>>>>>> [root@raven linux]# git bisect good >>>>> >>>>>>>> 9b0df0937a852d299fbe42a5939c9a8a4cc83c55 is the first bad >>>>> commit >>>>> >>>>>>>> commit 9b0df0937a852d299fbe42a5939c9a8a4cc83c55 >>>>> >>>>>>>> Author: Christian König >>>>> >>>>>>>> Date:   Tue Sep 18 10:38:09 2018 +0200 >>>>> >>>>>>>> >>>>> >>>>>>>>     drm/amdgpu: remove fence fallback >>>>> >>>>>>>> >>>>> >>>>>>>>     DC doesn't seem to have a fallback path either. >>>>> >>>>>>>> >>>>> >>>>>>>>     So when interrupts doesn't work any more we are >>>>> pretty much >>>>> >>>>>>>> busted no >>>>> >>>>>>>>     matter what. >>>>> >>>>>>>> >>>>> >>>>>>>> Signed-off-by: Christian König >>>>> >>>>>>>>     Reviewed-by: Chunming Zhou >>>>> >>>>>>>> >>>>> >>>>>>>> Results in this: >>>>> >>>>>>>> >>>>> >>>>>>>> [   24.334025] [drm] Initialized amdgpu 3.27.0 20150101 for >>>>> >>>>>>>> 0000:07:00.0 on minor 1 >>>>> >>>>>>>> [   24.335674] modprobe (3895) used greatest stack depth: >>>>> 12600 >>>>> >>>>>>>> bytes left >>>>> >>>>>>>> [   26.272358] [drm:gfx_v8_0_ring_test_ib [amdgpu]] *ERROR* >>>>> >>>>>>>> amdgpu: IB test timed out. >>>>> >>>>>>>> [   26.272460] [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* >>>>> >>>>>>>> amdgpu: failed testing IB on ring 9 (-110). >>>>> >>>>>>>> [   26.407885] [drm:process_one_work] *ERROR* ib ring test >>>>> >>>>>>>> failed (-110). >>>>> >>>>>>>> [   28.506708] fuse init (API version 7.27) >>>>> >>>>>>>> >>>>> >>>>>>>> On init with my polaris/raven1 system. >>>>> >>>>>>>> >>>>> >>>>>>>> Cheers, >>>>> >>>>>>>> Tom >>>>> >>>>>>>> _______________________________________________ >>>>> >>>>>>>> amd-gfx mailing list >>>>> >>>>>>>> amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org >>>>> >>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx >>>>> >>>>>>> >>>>> >>>>>> >>>>> >>>>> >>>>> >>>> >>>>> >>> >>>>> >> >>>>> > >>>>> >>>>> _______________________________________________ >>>>> amd-gfx mailing list >>>>> amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org >>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx >>>>> >>>>> >>>>> _______________________________________________ >>>>> amd-gfx mailing list >>>>> amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org >>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx >>>> >>> >>> >>> >>> _______________________________________________ >>> amd-gfx mailing list >>> amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org >>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx >> >> >> >> _______________________________________________ >> amd-gfx mailing list >> amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org >> https://lists.freedesktop.org/mailman/listinfo/amd-gfx > > > > _______________________________________________ > amd-gfx mailing list > amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org > https://lists.freedesktop.org/mailman/listinfo/amd-gfx --------------8F16CC4B5A2FCF0BF4CA4A95 Content-Type: text/html; charset=windows-1252 Content-Transfer-Encoding: 8bit
I unfortunately don't have a Polaris to test this myself.

But please give me time till Monday so that I can at least try one more things to fix it.

Christian.

Am 21.09.2018 um 19:11 schrieb Andrey Grodzovsky:

Ping...


Andrey


On 09/20/2018 04:35 PM, Andrey Grodzovsky wrote:

What's the status with this error and the suggested patch to fix it ? It impacts GPU reset on Polaris11.

Do we want to investigate why the original patch breaks it or just disable with the proposed patch ?


P.S Suspend resume also stopped working on latest branch - will bisect it later today or tomorrow.


Andrey


On 09/18/2018 11:00 AM, Christian König wrote:
Tom,

can you try if the following makes it working again?

diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c
index b6160de70d12..d65f5ba92fc5 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c
@@ -937,6 +937,10 @@ static int gfx_v8_0_ring_test_ib(struct amdgpu_ring *ring, long timeout)
        return r;
 }
 
+static int gfx_v8_0_kiq_ring_test_ib(struct amdgpu_ring *ring, long timeout)
+{
+       return 0;
+}
 
 static void gfx_v8_0_free_microcode(struct amdgpu_device *adev)
 {
@@ -7174,7 +7178,7 @@ static const struct amdgpu_ring_funcs gfx_v8_0_ring_funcs_kiq = {
        .emit_ib = gfx_v8_0_ring_emit_ib_compute,
        .emit_fence = gfx_v8_0_ring_emit_fence_kiq,
        .test_ring = gfx_v8_0_ring_test_ring,
-       .test_ib = gfx_v8_0_ring_test_ib,
+       .test_ib = gfx_v8_0_kiq_ring_test_ib,
        .insert_nop = amdgpu_ring_insert_nop,
        .pad_ib = amdgpu_ring_generic_pad_ib,
        .emit_rreg = gfx_v8_0_ring_emit_rreg,


Thanks,
Christian.

Am 18.09.2018 um 16:41 schrieb Christian König:
CRTC and GFX interrupts seem to be working perfectly fine.

The problem here looks like only EOP interrupts from the Compute queue are not correctly handled.

Most likely a bug somewhere in gfx_v8_0_eop_irq().

Christian.

Am 18.09.2018 um 16:36 schrieb Deucher, Alexander:

FWIW, a number of consumer Raven boards have bad IVRS tables (windows doesn't use interrupt remapping so they are sometimes wrong and probably not validated.  There are a number of workaround to manually override the IVRS tables to make interrupts work.  I think specifying pci=noacpi is also a possible workaround.


Alex


From: amd-gfx <amd-gfx-bounces-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org> on behalf of Christian König <christian.koenig-5C7GfCeVMHo@public.gmane.org>
Sent: Tuesday, September 18, 2018 10:31:16 AM
To: StDenis, Tom; amd-gfx mailing list; Zhou, David(ChunMing)
Subject: Re: Regression on gfx8 with ring init
 
Well looks like interrupt processing is working perfectly fine.

But looking at the error message once more I see that this actually
affects ring number 9 and not the GFX ring.

Can you fix amdgpu_ib_ring_tests() to print ring->name instead of the
number?

That must be some of the compute rings.

Thanks,
Christian.

Am 18.09.2018 um 16:20 schrieb Tom St Denis:
> On 2018-09-18 10:13 a.m., Christian König wrote:
>> Mhm, there is no more failed IB-test in there isn't it?
>
> oh sorry I thought you wanted to test HEAD~ ... Attached is a log from
> the tip of drm-next
>
> Tom
>
>>
>> Christian.
>>
>> Am 18.09.2018 um 16:09 schrieb Tom St Denis:
>>> Disabling IOMMU in the BIOS resulted in a correct boot up...
>>>
>>> Here's the log.
>>>
>>> Tom
>>>
>>> On 2018-09-18 9:58 a.m., Tom St Denis wrote:
>>>> Odd I couldn't even boot my system with the dGPU as primary after
>>>> rebuilding the kernel.  It got hung up in the IOMMU driver (loads
>>>> of AMD-Vi IOMMU errors) which I wasn't able to capture because it
>>>> panic'ed before loading the network stack.
>>>>
>>>> Bizarre.
>>>>
>>>> I'll keep trying.
>>>>
>>>> Tom
>>>>
>>>> On 2018-09-18 9:35 a.m., Christian König wrote:
>>>>> Am 18.09.2018 um 15:32 schrieb Tom St Denis:
>>>>>> On 2018-09-18 9:30 a.m., Christian König wrote:
>>>>>>> Great, not sure if that is a good or a bad news.
>>>>>>>
>>>>>>> Anyway going to revert the change for now. Does anybody
>>>>>>> volunteer to figure out why interrupts sometimes doesn't work
>>>>>>> correctly on Raven?
>>>>>>
>>>>>> What does "doesn't work correctly?"  My workstation is a Raven1
>>>>>> (Ryzen 2400G) and other than the TTM bulk move issue has been
>>>>>> perfectly stable (through suspend/resumes too I might add).
>>>>>>
>>>>>> Anything I could test with my devel raven?
>>>>>
>>>>> The problem seems to be that on some boards IH handling doesn't
>>>>> work as it should.
>>>>>
>>>>> Can you try to disable the onboard graphics and try again?
>>>>>
>>>>> If that still doesn't work there is a DRM_DEBUG in
>>>>> amdgpu_ih_process(), make that a DRM_ERROR and send me the
>>>>> resulting dmesg of loading amdgpu (but don't start any UMD).
>>>>>
>>>>> Thanks,
>>>>> Christian.
>>>>>
>>>>>>
>>>>>>
>>>>>> Tom
>>>>>>
>>>>>>>
>>>>>>> Christian.
>>>>>>>
>>>>>>> Am 18.09.2018 um 15:27 schrieb Tom St Denis:
>>>>>>>> This commit:
>>>>>>>>
>>>>>>>> [root@raven linux]# git bisect good
>>>>>>>> 9b0df0937a852d299fbe42a5939c9a8a4cc83c55 is the first bad commit
>>>>>>>> commit 9b0df0937a852d299fbe42a5939c9a8a4cc83c55
>>>>>>>> Author: Christian König <christian.koenig-5C7GfCeVMHo@public.gmane.org>
>>>>>>>> Date:   Tue Sep 18 10:38:09 2018 +0200
>>>>>>>>
>>>>>>>>     drm/amdgpu: remove fence fallback
>>>>>>>>
>>>>>>>>     DC doesn't seem to have a fallback path either.
>>>>>>>>
>>>>>>>>     So when interrupts doesn't work any more we are pretty much
>>>>>>>> busted no
>>>>>>>>     matter what.
>>>>>>>>
>>>>>>>>     Signed-off-by: Christian König <christian.koenig-5C7GfCeVMHo@public.gmane.org>
>>>>>>>>     Reviewed-by: Chunming Zhou <david1.zhou-5C7GfCeVMHo@public.gmane.org>
>>>>>>>>
>>>>>>>> Results in this:
>>>>>>>>
>>>>>>>> [   24.334025] [drm] Initialized amdgpu 3.27.0 20150101 for
>>>>>>>> 0000:07:00.0 on minor 1
>>>>>>>> [   24.335674] modprobe (3895) used greatest stack depth: 12600
>>>>>>>> bytes left
>>>>>>>> [   26.272358] [drm:gfx_v8_0_ring_test_ib [amdgpu]] *ERROR*
>>>>>>>> amdgpu: IB test timed out.
>>>>>>>> [   26.272460] [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR*
>>>>>>>> amdgpu: failed testing IB on ring 9 (-110).
>>>>>>>> [   26.407885] [drm:process_one_work] *ERROR* ib ring test
>>>>>>>> failed (-110).
>>>>>>>> [   28.506708] fuse init (API version 7.27)
>>>>>>>>
>>>>>>>> On init with my polaris/raven1 system.
>>>>>>>>
>>>>>>>> Cheers,
>>>>>>>> Tom
>>>>>>>> _______________________________________________
>>>>>>>> amd-gfx mailing list
>>>>>>>> amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

_______________________________________________
amd-gfx mailing list
amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


_______________________________________________
amd-gfx mailing list
amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx




_______________________________________________
amd-gfx mailing list
amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx



_______________________________________________
amd-gfx mailing list
amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx



_______________________________________________
amd-gfx mailing list
amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

--------------8F16CC4B5A2FCF0BF4CA4A95-- --===============1678632423== Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: base64 Content-Disposition: inline X19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX18KYW1kLWdmeCBt YWlsaW5nIGxpc3QKYW1kLWdmeEBsaXN0cy5mcmVlZGVza3RvcC5vcmcKaHR0cHM6Ly9saXN0cy5m cmVlZGVza3RvcC5vcmcvbWFpbG1hbi9saXN0aW5mby9hbWQtZ2Z4Cg== --===============1678632423==--