Linux PCI subsystem development
 help / color / mirror / Atom feed
From: "Christian König" <christian.koenig@amd.com>
To: Bjorn Helgaas <helgaas@kernel.org>,
	Alex Deucher <alexander.deucher@amd.com>,
	Xinhui Pan <Xinhui.Pan@amd.com>
Cc: David Airlie <airlied@linux.ie>, Daniel Vetter <daniel@ffwll.ch>,
	Tom Seewald <tseewald@gmail.com>, Stefan Roese <sr@denx.de>,
	Kai-Heng Feng <kai.heng.feng@canonical.com>,
	regressions@lists.linux.dev, linux-pci@vger.kernel.org,
	amd-gfx@lists.freedesktop.org
Subject: Re: [Bug 216373] New: Uncorrected errors reported for AMD GPU
Date: Fri, 19 Aug 2022 09:05:47 +0200	[thread overview]
Message-ID: <c1a4da18-a6e1-c633-a585-1b4940a5de59@amd.com> (raw)
In-Reply-To: <20220818203812.GA2381243@bhelgaas>

Hi Bjorn,

Am 18.08.22 um 22:38 schrieb Bjorn Helgaas:
> [Adding amdgpu folks]
>
> On Wed, Aug 17, 2022 at 11:45:15PM +0000, bugzilla-daemon@kernel.org wrote:
>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugzilla.kernel.org%2Fshow_bug.cgi%3Fid%3D216373&amp;data=05%7C01%7Cchristian.koenig%40amd.com%7C62cca3872daa46ee7a0a08da8159950a%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637964519011973266%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=TFF9LWIXBbdrU27%2FbjDfP8FTUhW874X8%2FA0kIrGrjJs%3D&amp;reserved=0
>>
>>              Bug ID: 216373
>>             Summary: Uncorrected errors reported for AMD GPU
>>      Kernel Version: v6.0-rc1
>>          Regression: No
>> ...
> I marked this as a regression in bugzilla.
>
>> Hardware:
>> CPU: Intel i7-12700K (Alder Lake)
>> GPU: AMD RX 6700 XT [1002:73df]
>> Motherboard: ASUS Prime Z690-A
>>
>> Problem:
>> After upgrading to v6.0-rc1 the kernel is now reporting uncorrected PCI errors
>> for my GPU.
> Thank you very much for the report and for taking the trouble to
> bisect it and test Kai-Heng's patch!
>
> I suspect that booting with "pci=noaer" should be a temporary
> workaround for this issue.  If it, can you add that to the bugzilla
> for anybody else who trips over this?
>
>> I have bisected this issue to: [8795e182b02dc87e343c79e73af6b8b7f9c5e635]
>> PCI/portdrv: Don't disable AER reporting in get_port_device_capability()
>> Reverting that commit causes the errors to cease.
> I suspect the errors still occur, but we just don't notice and log
> them.
>
>> I have also tried Kai-Heng Feng's patch[1] which seems to resolve a similar
>> problem, but it did not fix my issue.
>>
>> [1]
>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flinux-pci%2F20220706123244.18056-1-kai.heng.feng%40canonical.com%2F&amp;data=05%7C01%7Cchristian.koenig%40amd.com%7C62cca3872daa46ee7a0a08da8159950a%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637964519011973266%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=Y0ofsDYgNGXoQn2e%2BbCM4NHaMOUnEJPqL8lqs1YJzrQ%3D&amp;reserved=0
>>
>> dmesg snippet:
>>
>> pcieport 0000:00:01.0: AER: Multiple Uncorrected (Non-Fatal) error received:
>> 0000:03:00.0
>> amdgpu 0000:03:00.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal),
>> type=Transaction Layer, (Requester ID)
>> amdgpu 0000:03:00.0:   device [1002:73df] error status/mask=00100000/00000000
>> amdgpu 0000:03:00.0:    [20] UnsupReq               (First)
>> amdgpu 0000:03:00.0: AER:   TLP Header: 40000001 0000000f 95e7f000 00000000
> I think the TLP header decodes to:
>
>    0x40000001 = 0100 0000 ... 0000 0001 binary
>    0x0000000f = 0000 0000 ... 0000 1111 binary
>
>    Fmt           010b                 3 DW header with data
>    Type          0000b  010 0 0000    MWr Memory Write Request
>    Length        00 0000 0001b        1 DW
>    Requester ID  0x0000               00:00.0
>    Tag           0x00
>    Last DW BE    0000b                must be zero for 1 DW write
>    First DW BE   1111b                all 4 bytes in DW enabled
>    Address       0x95e7f000
>    Data          0x00000000
>
> So I think this is a 32-bit write of zero to PCI bus address
> 0x95e7f000.
>
> Your dmesg log says:
>
>    pci 0000:02:00.0: PCI bridge to [bus 03]
>    pci 0000:02:00.0:   bridge window [mem 0x95e00000-0x95ffffff]
>    pci 0000:03:00.0: reg 0x24: [mem 0x95e00000-0x95efffff]
>    [drm] register mmio base: 0x95E00000
>
> So this looks like a write to the device's BAR 5.  I don't see a PCI
> reason why this should fail.  Maybe there's some amdgpu reason?

Well I have seen a couple of boards where stuff like that happened, but 
from my experience this always has some hardware problem as background.

 From my understanding what essentially happens is that a write doesn't 
make it to the device (e.g. transmission errors can't be corrected).

It's quite likely that the write is then either dropped and doesn't 
matter that much (just clearing the framebuffer for example) or repeated 
and because of this everything still seems to work fine.

Either way I suggest to try this with some other hartdware 
configuration. E.g. put the GPU in another system and see if it still 
gives the same issues or put another GPU into this system.

Regards,
Christian.


>
> Bjorn


  reply	other threads:[~2022-08-19  7:06 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <bug-216373-41252@https.bugzilla.kernel.org/>
2022-08-18 20:38 ` [Bug 216373] New: Uncorrected errors reported for AMD GPU Bjorn Helgaas
2022-08-19  7:05   ` Christian König [this message]
2022-08-19  8:33     ` Lazar, Lijo
2022-08-19 11:04       ` Bjorn Helgaas
2022-08-19 17:13   ` Bjorn Helgaas
2022-08-19 19:07     ` Bjorn Helgaas
2022-08-20  7:52       ` Lazar, Lijo
2022-08-23 17:04         ` Tom Seewald
2022-08-24  5:10           ` Lazar, Lijo
2022-08-24 14:45             ` Tom Seewald
2022-08-25  6:40               ` Stefan Roese
2022-08-25  7:34                 ` Christian König
2022-08-25  7:54                   ` Lazar, Lijo
2022-08-25  8:18                     ` Christian König
2022-08-25 17:48                       ` Bjorn Helgaas
2022-08-26  7:10                         ` Christian König
2022-08-25 15:05             ` Felix Kuehling
2022-08-23 16:01   ` [Bug 216373] New: Uncorrected errors reported for AMD GPU #forregzbot Thorsten Leemhuis

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=c1a4da18-a6e1-c633-a585-1b4940a5de59@amd.com \
    --to=christian.koenig@amd.com \
    --cc=Xinhui.Pan@amd.com \
    --cc=airlied@linux.ie \
    --cc=alexander.deucher@amd.com \
    --cc=amd-gfx@lists.freedesktop.org \
    --cc=daniel@ffwll.ch \
    --cc=helgaas@kernel.org \
    --cc=kai.heng.feng@canonical.com \
    --cc=linux-pci@vger.kernel.org \
    --cc=regressions@lists.linux.dev \
    --cc=sr@denx.de \
    --cc=tseewald@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox