From mboxrd@z Thu Jan 1 00:00:00 1970 From: bugzilla-daemon@freedesktop.org Subject: [Bug 91880] Radeonsi on Grenada cards (r9 390) exceptionally unstable and poorly performing Date: Sun, 18 Mar 2018 20:31:41 +0000 Message-ID: References: Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="===============1842849196==" Return-path: Received: from culpepper.freedesktop.org (culpepper.freedesktop.org [131.252.210.165]) by gabe.freedesktop.org (Postfix) with ESMTP id 258356E2CB for ; Sun, 18 Mar 2018 20:31:43 +0000 (UTC) In-Reply-To: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" To: dri-devel@lists.freedesktop.org List-Id: dri-devel@lists.freedesktop.org --===============1842849196== Content-Type: multipart/alternative; boundary="15214051031.D4aa.26066" Content-Transfer-Encoding: 7bit --15214051031.D4aa.26066 Date: Sun, 18 Mar 2018 20:31:43 +0000 MIME-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://bugs.freedesktop.org/ Auto-Submitted: auto-generated https://bugs.freedesktop.org/show_bug.cgi?id=3D91880 --- Comment #186 from Chris Heald --- I've been doing a lot of experimentation, and I've found a few more things = that I feel are probably related: * I can force a system hard-lock by doing anything which disables a monitor. Notably, going full-screen under KDE/Xorg does this, but I can trigger it j= ust as easily by disabling a monitor with xrandr. Fullscreen under gnome doesn't seem to trigger the issue, which I suspect is due to gnome's using mutter f= or screen management. * Occassioanlly, the system boots up and gets stuck with a 150MHz memory cl= ock, rather than clocking up to the 1500MHz state. This causes the display corruption even if the sclk is set to 500MHz+. Setting the mclk mask manual= ly fixes display corruption. * I've been experimenting with different kernels ranging from 4.4 to 4.16rc= 5. Earlier kernels feel more susceptible to hard-locking, though the later ker= nels aren't immune to it. * I tried a fresh Ubuntu 16.04 LTS install, and while it did NOT exhibit the artifacting behavior, the system hard-locked within a few minutes of light desktop usage. I've had a few classes of exceptions show up in kern.log: On 4.4, my kde/wayland session hard-froze when moving a window, and produce= d a log like this: kernel: [ 116.904013] radeon 0000:06:00.0: GPU fault detected: 146 0x0d8e0= 40c kernel: [ 116.904017] radeon 0000:06:00.0: VM_CONTEXT1_PROTECTION_FAULT_= ADDR 0x0001776C kernel: [ 116.904019] radeon 0000:06:00.0:=20=20 VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0E10400C kernel: [ 116.904021] VM fault (0x0c, vmid 7) at page 96108, read from 'TC= 3' (0x54433300) (260) kernel: [ 127.306156] radeon 0000:06:00.0: ring 0 stalled for more than 10404msec kernel: [ 127.306164] radeon 0000:06:00.0: GPU lockup (current fence id 0x0000000000002419 last fence id 0x0000000000002431 on ring 0) kernel: [ 127.357942] radeon 0000:06:00.0: Saved 2200 dwords of commands on ring 0. kernel: [ 127.357961] radeon 0000:06:00.0: GPU softreset: 0x00000009 kernel: [ 127.357963] radeon 0000:06:00.0: GRBM_STATUS=3D0xF5D01028 kernel: [ 127.357965] radeon 0000:06:00.0: GRBM_STATUS2=3D0x50000008 kernel: [ 127.357968] radeon 0000:06:00.0: GRBM_STATUS_SE0=3D0xEC400002 kernel: [ 127.357970] radeon 0000:06:00.0: GRBM_STATUS_SE1=3D0xEC400002 kernel: [ 127.357972] radeon 0000:06:00.0: GRBM_STATUS_SE2=3D0x08000002 kernel: [ 127.357974] radeon 0000:06:00.0: GRBM_STATUS_SE3=3D0xEC000002 kernel: [ 127.357976] radeon 0000:06:00.0: SRBM_STATUS=3D0x20000040 kernel: [ 127.357978] radeon 0000:06:00.0: SRBM_STATUS2=3D0x00000000 kernel: [ 127.357980] radeon 0000:06:00.0: SDMA0_STATUS_REG =3D 0x46CE= E557 kernel: [ 127.357982] radeon 0000:06:00.0: SDMA1_STATUS_REG =3D 0x46CE= E557 kernel: [ 127.357984] radeon 0000:06:00.0: CP_STAT =3D 0x84228600 kernel: [ 127.357986] radeon 0000:06:00.0: CP_STALLED_STAT1 =3D 0x00000c= 00 kernel: [ 127.357988] radeon 0000:06:00.0: CP_STALLED_STAT2 =3D 0x400000= 00 kernel: [ 127.357991] radeon 0000:06:00.0: CP_STALLED_STAT3 =3D 0x000004= 00 kernel: [ 127.357993] radeon 0000:06:00.0: CP_CPF_BUSY_STAT =3D 0x000000= 06 kernel: [ 127.357995] radeon 0000:06:00.0: CP_CPF_STALLED_STAT1 =3D 0x00= 000003 kernel: [ 127.357997] radeon 0000:06:00.0: CP_CPF_STATUS =3D 0x80000063 kernel: [ 127.357999] radeon 0000:06:00.0: CP_CPC_BUSY_STAT =3D 0x000000= 00 kernel: [ 127.358001] radeon 0000:06:00.0: CP_CPC_STALLED_STAT1 =3D 0x00= 000000 kernel: [ 127.358003] radeon 0000:06:00.0: CP_CPC_STATUS =3D 0x00000000 kernel: [ 127.358005] radeon 0000:06:00.0: VM_CONTEXT1_PROTECTION_FAULT_= ADDR 0x00000000 kernel: [ 127.358007] radeon 0000:06:00.0:=20=20 VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000 kernel: [ 127.404670] radeon 0000:06:00.0: GRBM_SOFT_RESET=3D0x00010001 kernel: [ 127.404725] radeon 0000:06:00.0: SRBM_SOFT_RESET=3D0x00000100 kernel: [ 127.405874] radeon 0000:06:00.0: GRBM_STATUS=3D0x00003028 kernel: [ 127.405876] radeon 0000:06:00.0: GRBM_STATUS2=3D0x00000008 kernel: [ 127.405878] radeon 0000:06:00.0: GRBM_STATUS_SE0=3D0x00000006 kernel: [ 127.405880] radeon 0000:06:00.0: GRBM_STATUS_SE1=3D0x00000006 kernel: [ 127.405882] radeon 0000:06:00.0: GRBM_STATUS_SE2=3D0x00000006 kernel: [ 127.405884] radeon 0000:06:00.0: GRBM_STATUS_SE3=3D0x00000006 kernel: [ 127.405885] radeon 0000:06:00.0: SRBM_STATUS=3D0x20000A40 kernel: [ 127.405887] radeon 0000:06:00.0: SRBM_STATUS2=3D0x00000000 kernel: [ 127.405889] radeon 0000:06:00.0: SDMA0_STATUS_REG =3D 0x46CE= E557 kernel: [ 127.405891] radeon 0000:06:00.0: SDMA1_STATUS_REG =3D 0x46CE= E557 kernel: [ 127.405893] radeon 0000:06:00.0: CP_STAT =3D 0x00000000 kernel: [ 127.405893] radeon 0000:06:00.0: CP_STAT =3D 0x00000000 kernel: [ 127.405895] radeon 0000:06:00.0: CP_STALLED_STAT1 =3D 0x000000= 00 kernel: [ 127.405896] radeon 0000:06:00.0: CP_STALLED_STAT2 =3D 0x000000= 00 kernel: [ 127.405898] radeon 0000:06:00.0: CP_STALLED_STAT3 =3D 0x000000= 00 kernel: [ 127.405900] radeon 0000:06:00.0: CP_CPF_BUSY_STAT =3D 0x000000= 00 kernel: [ 127.405902] radeon 0000:06:00.0: CP_CPF_STALLED_STAT1 =3D 0x00= 000000 kernel: [ 127.405903] radeon 0000:06:00.0: CP_CPF_STATUS =3D 0x00000000 kernel: [ 127.405905] radeon 0000:06:00.0: CP_CPC_BUSY_STAT =3D 0x000000= 00 kernel: [ 127.405907] radeon 0000:06:00.0: CP_CPC_STALLED_STAT1 =3D 0x00= 000000 kernel: [ 127.405909] radeon 0000:06:00.0: CP_CPC_STATUS =3D 0x00000000 kernel: [ 127.405929] radeon 0000:06:00.0: GPU reset succeeded, trying to resume kernel: [ 127.658172] [drm:ci_dpm_enable [radeon]] *ERROR* ci_start_dpm fa= iled kernel: [ 127.658189] [drm:radeon_pm_resume [radeon]] *ERROR* radeon: dpm resume failed kernel: [ 127.658194] [drm] probing gen 2 caps for device 1022:1453 =3D 73= 3903/e kernel: [ 127.658197] [drm] PCIE gen 3 link speeds already enabled kernel: [ 127.664213] [drm] PCIE GART of 2048M enabled (table at 0x0000000000326000). kernel: [ 127.664341] radeon 0000:06:00.0: WB enabled kernel: [ 127.664344] radeon 0000:06:00.0: fence driver on ring 0 use gpu = addr 0x0000000200000c00 and cpu addr 0xffff8807f3799c00 kernel: [ 127.664346] radeon 0000:06:00.0: fence driver on ring 1 use gpu = addr 0x0000000200000c04 and cpu addr 0xffff8807f3799c04 kernel: [ 127.664347] radeon 0000:06:00.0: fence driver on ring 2 use gpu = addr 0x0000000200000c08 and cpu addr 0xffff8807f3799c08 kernel: [ 127.664349] radeon 0000:06:00.0: fence driver on ring 3 use gpu = addr 0x0000000200000c0c and cpu addr 0xffff8807f3799c0c kernel: [ 127.664350] radeon 0000:06:00.0: fence driver on ring 4 use gpu = addr 0x0000000200000c10 and cpu addr 0xffff8807f3799c10 kernel: [ 127.664772] radeon 0000:06:00.0: fence driver on ring 5 use gpu = addr 0x0000000000078b30 and cpu addr 0xffffc90003c38b30 kernel: [ 127.664933] radeon 0000:06:00.0: fence driver on ring 6 use gpu = addr 0x0000000200000c18 and cpu addr 0xffff8807f3799c18 kernel: [ 127.664934] radeon 0000:06:00.0: fence driver on ring 7 use gpu = addr 0x0000000200000c1c and cpu addr 0xffff8807f3799c1c kernel: [ 127.666482] [drm] ring test on 0 succeeded in 2 usecs kernel: [ 127.666568] [drm] ring test on 1 succeeded in 2 usecs kernel: [ 127.666586] [drm] ring test on 2 succeeded in 2 usecs kernel: [ 127.666735] [drm] ring test on 3 succeeded in 3 usecs kernel: [ 127.666745] [drm] ring test on 4 succeeded in 3 usecs kernel: [ 127.692636] [drm] ring test on 5 succeeded in 1 usecs kernel: [ 127.712543] [drm] UVD initialized successfully. kernel: [ 127.813896] [drm] ring test on 6 succeeded in 708 usecs kernel: [ 127.813920] [drm] ring test on 7 succeeded in 3 usecs kernel: [ 127.813921] [drm] VCE initialized successfully. kernel: [ 127.814029] [drm:radeon_pm_resume [radeon]] *ERROR* radeon: dpm resume failed On 4.15.10-041510-generic, I left my computer running overnight and came ba= ck to it frozen with this in kern.log: Mar 18 04:25:10 Gaia kernel: [ 559.092721] BUG: stack guard page was hit at 000000001ecd1fa8 (stack is 0000000020941864..00000000cf703fbf) Mar 18 04:25:10 Gaia kernel: [ 559.092729] kernel stack overflow (page fau= lt): 0000 [#1] SMP NOPTI Mar 18 04:25:10 Gaia kernel: [ 559.092733] Modules linked in: nf_conntrack_netlink nfnetlink xt_addrtype br_netfilter overlay xfrm_user xfrm4_tunnel tunnel4 l2tp_ppp l2tp_netlink l2tp_core ip6_udp_tunnel ipcomp xfrm_ipcomp udp_tunnel esp4 pppox ah4 af_key xfrm_algo xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack libcrc32c ipt_REJECT nf_reject_ipv4 xt_tcpudp bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables devlink iptable_filter binfmt_misc snd_hda_codec_realtek snd_hda_codec_generic snd_hda_codec_hdmi snd_hda_intel edac_mce_amd snd_hda_codec snd_usb_audio snd_hda_core snd_usbmidi_lib kvm_a= md snd_hwdep kvm uvcvideo snd_seq_midi irqbypass snd_seq_midi_event snd_rawmidi crct10dif_pclmul videobuf2_vmalloc crc32_pclmul Mar 18 04:25:10 Gaia kernel: [ 559.092784] videobuf2_memops videobuf2_v4l2 snd_seq ghash_clmulni_intel videobuf2_core snd_pcm pcbc videodev snd_seq_de= vice media snd_timer joydev aesni_intel aes_x86_64 snd crypto_simd input_leds glue_helper serio_raw soundcore cryptd ccp k10temp shpchp mac_hid wmi_bmof sch_fq_codel parport_pc ppdev lp parport ip_tables x_tables autofs4 hid_gen= eric usbhid hid amdkfd amd_iommu_v2 amdgpu chash radeon i2c_algo_bit ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm i2c_piix4 r8169 ahci mii libahci wmi gpio_amdpt gpio_generic Mar 18 04:25:10 Gaia kernel: [ 559.092832] CPU: 5 PID: 7352 Comm: tail Tainted: G W 4.15.10-041510-generic #201803152130 Mar 18 04:25:10 Gaia kernel: [ 559.092834] Hardware name: Gigabyte Technol= ogy Co., Ltd. AB350-Gaming 3/AB350-Gaming 3-CF, BIOS F10 12/01/2017 Mar 18 04:25:10 Gaia kernel: [ 559.092881] RIP: 0010:amdgpu_get_pp_num_states+0x88/0x120 [amdgpu] Mar 18 04:25:10 Gaia kernel: [ 559.092884] RSP: 0018:ffffb3cb8a837ca8 EFLA= GS: 00010282 Mar 18 04:25:10 Gaia kernel: [ 559.092888] RAX: 00000000000000d4 RBX: ffffb3cb8a837cac RCX: 0000000000000001 Mar 18 04:25:10 Gaia kernel: [ 559.092890] RDX: 0000000000000000 RSI: ffffffffc087a88c RDI: 0000000000000000 Mar 18 04:25:10 Gaia kernel: [ 559.092893] RBP: ffffb3cb8a837d20 R08: ffffffffc087a865 R09: ffff88c9ecebd98b Mar 18 04:25:10 Gaia kernel: [ 559.092895] R10: 0000000000000000 R11: ffff88c9ecebd98a R12: ffff88c9ecebd000 Mar 18 04:25:10 Gaia kernel: [ 559.092898] R13: ffffffffc087a858 R14: 00000000000000d4 R15: 0000000000000993 Mar 18 04:25:10 Gaia kernel: [ 559.092901] FS: 00007fccb1787540(0000) GS:ffff88c9fe740000(0000) knlGS:0000000000000000 Mar 18 04:25:10 Gaia kernel: [ 559.092904] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Mar 18 04:25:10 Gaia kernel: [ 559.092906] CR2: ffffb3cb8a838000 CR3: 00000004a30d0000 CR4: 00000000003406e0 Mar 18 04:25:10 Gaia kernel: [ 559.092909] Call Trace: Mar 18 04:25:10 Gaia kernel: [ 559.092918] ? tty_insert_flip_string_fixed_flag+0x86/0xe0 Mar 18 04:25:10 Gaia kernel: [ 559.092925] dev_attr_show+0x23/0x60 Mar 18 04:25:10 Gaia kernel: [ 559.092931] sysfs_kf_seq_show+0xa3/0x130 Mar 18 04:25:10 Gaia kernel: [ 559.092935] kernfs_seq_show+0x27/0x30 Mar 18 04:25:10 Gaia kernel: [ 559.092939] seq_read+0xe5/0x430 Mar 18 04:25:10 Gaia kernel: [ 559.092943] kernfs_fop_read+0x137/0x180 Mar 18 04:25:10 Gaia kernel: [ 559.092948] __vfs_read+0x3a/0x170 Mar 18 04:25:10 Gaia kernel: [ 559.092954] ? security_file_permission+0xa1/0xc0 Mar 18 04:25:10 Gaia kernel: [ 559.092958] vfs_read+0x8e/0x130 Mar 18 04:25:10 Gaia kernel: [ 559.092962] SyS_read+0x55/0xc0 Mar 18 04:25:10 Gaia kernel: [ 559.092967] do_syscall_64+0x73/0x130 Mar 18 04:25:10 Gaia kernel: [ 559.092973]=20 entry_SYSCALL_64_after_hwframe+0x3d/0xa2 Mar 18 04:25:10 Gaia kernel: [ 559.092976] RIP: 0033:0x7fccb12b5081 Mar 18 04:25:10 Gaia kernel: [ 559.092978] RSP: 002b:00007ffc17d84d68 EFLA= GS: 00000246 ORIG_RAX: 0000000000000000 Mar 18 04:25:10 Gaia kernel: [ 559.092982] RAX: ffffffffffffffda RBX: 0000000000002000 RCX: 00007fccb12b5081 Mar 18 04:25:10 Gaia kernel: [ 559.092984] RDX: 0000000000002000 RSI: 00007ffc17d84db0 RDI: 0000000000000003 Mar 18 04:25:10 Gaia kernel: [ 559.092986] RBP: 0000000000000000 R08: 0000000000000000 R09: 00007fccb1313b40 Mar 18 04:25:10 Gaia kernel: [ 559.092988] R10: 00000000fffffff3 R11: 0000000000000246 R12: 00007ffc17d84db0 Mar 18 04:25:10 Gaia kernel: [ 559.092991] R13: 0000000000000003 R14: ffffffffffffffff R15: 000055e8f3b747e0 Mar 18 04:25:10 Gaia kernel: [ 559.092994] Code: c7 c2 7a a8 87 c0 be 00 1= 0 00 00 4c 89 e7 e8 d0 08 90 d1 41 89 c7 8b 45 8c 85 c0 74 72 48 8d 5d 8c 45 31 = f6 49 c7 c5 58 a8 87 c0 <42> 8b 44 b3 04 44 89 f1 4d 89 e8 83 f8 0a 74 2d 83 f= 8 02 49 c7 Mar 18 04:25:10 Gaia kernel: [ 559.093080] RIP: amdgpu_get_pp_num_states+0x88/0x120 [amdgpu] RSP: ffffb3cb8a837ca8 Mar 18 04:25:10 Gaia kernel: [ 559.093084] ---[ end trace dbba232a9ca4c5c7 ]--- Possibly related, if I `cat pp_num_states` from a terminal, I get a segmentation fault: root@Gaia:~# cat /sys/class/drm/card0/device/pp_num_states Segmentation fault I'm going to continue to dig. Let me know what logs/tests/whatnot I can pro= vide that would be useful. --=20 You are receiving this mail because: You are the assignee for the bug.= --15214051031.D4aa.26066 Date: Sun, 18 Mar 2018 20:31:43 +0000 MIME-Version: 1.0 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://bugs.freedesktop.org/ Auto-Submitted: auto-generated

Comme= nt # 186 on bug 91880<= /a> from <= span class=3D"fn">Chris Heald
I've been doing a lot of experimentation, and I've found a few=
 more things that
I feel are probably related:

* I can force a system hard-lock by doing anything which disables a monitor.
Notably, going full-screen under KDE/Xorg does this, but I can trigger it j=
ust
as easily by disabling a monitor with xrandr. Fullscreen under gnome doesn't
seem to trigger the issue, which I suspect is due to gnome's using mutter f=
or
screen management.

* Occassioanlly, the system boots up and gets stuck with a 150MHz memory cl=
ock,
rather than clocking up to the 1500MHz state. This causes the display
corruption even if the sclk is set to 500MHz+. Setting the mclk mask manual=
ly
fixes display corruption.

* I've been experimenting with different kernels ranging from 4.4 to 4.16rc=
5.
Earlier kernels feel more susceptible to hard-locking, though the later ker=
nels
aren't immune to it.

* I tried a fresh Ubuntu 16.04 LTS install, and while it did NOT exhibit the
artifacting behavior, the system hard-locked within a few minutes of light
desktop usage.

I've had a few classes of exceptions show up in kern.log:

On 4.4, my kde/wayland session hard-froze when moving a window, and produce=
d a
log like this:

kernel: [  116.904013] radeon 0000:06:00.0: GPU fault detected: 146 0x0d8e0=
40c
kernel: [  116.904017] radeon 0000:06:00.0:   VM_CONTEXT1_PROTECTION_FAULT_=
ADDR
  0x0001776C
kernel: [  116.904019] radeon 0000:06:00.0:=20=20
VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0E10400C
kernel: [  116.904021] VM fault (0x0c, vmid 7) at page 96108, read from 'TC=
3'
(0x54433300) (260)
kernel: [  127.306156] radeon 0000:06:00.0: ring 0 stalled for more than
10404msec
kernel: [  127.306164] radeon 0000:06:00.0: GPU lockup (current fence id
0x0000000000002419 last fence id 0x0000000000002431 on ring 0)
kernel: [  127.357942] radeon 0000:06:00.0: Saved 2200 dwords of commands on
ring 0.
kernel: [  127.357961] radeon 0000:06:00.0: GPU softreset: 0x00000009
kernel: [  127.357963] radeon 0000:06:00.0:   GRBM_STATUS=3D0xF5D01028
kernel: [  127.357965] radeon 0000:06:00.0:   GRBM_STATUS2=3D0x50000008
kernel: [  127.357968] radeon 0000:06:00.0:   GRBM_STATUS_SE0=3D0xEC400002
kernel: [  127.357970] radeon 0000:06:00.0:   GRBM_STATUS_SE1=3D0xEC400002
kernel: [  127.357972] radeon 0000:06:00.0:   GRBM_STATUS_SE2=3D0x08000002
kernel: [  127.357974] radeon 0000:06:00.0:   GRBM_STATUS_SE3=3D0xEC000002
kernel: [  127.357976] radeon 0000:06:00.0:   SRBM_STATUS=3D0x20000040
kernel: [  127.357978] radeon 0000:06:00.0:   SRBM_STATUS2=3D0x00000000
kernel: [  127.357980] radeon 0000:06:00.0:   SDMA0_STATUS_REG   =3D 0x46CE=
E557
kernel: [  127.357982] radeon 0000:06:00.0:   SDMA1_STATUS_REG   =3D 0x46CE=
E557
kernel: [  127.357984] radeon 0000:06:00.0:   CP_STAT =3D 0x84228600
kernel: [  127.357986] radeon 0000:06:00.0:   CP_STALLED_STAT1 =3D 0x00000c=
00
kernel: [  127.357988] radeon 0000:06:00.0:   CP_STALLED_STAT2 =3D 0x400000=
00
kernel: [  127.357991] radeon 0000:06:00.0:   CP_STALLED_STAT3 =3D 0x000004=
00
kernel: [  127.357993] radeon 0000:06:00.0:   CP_CPF_BUSY_STAT =3D 0x000000=
06
kernel: [  127.357995] radeon 0000:06:00.0:   CP_CPF_STALLED_STAT1 =3D 0x00=
000003
kernel: [  127.357997] radeon 0000:06:00.0:   CP_CPF_STATUS =3D 0x80000063
kernel: [  127.357999] radeon 0000:06:00.0:   CP_CPC_BUSY_STAT =3D 0x000000=
00
kernel: [  127.358001] radeon 0000:06:00.0:   CP_CPC_STALLED_STAT1 =3D 0x00=
000000
kernel: [  127.358003] radeon 0000:06:00.0:   CP_CPC_STATUS =3D 0x00000000
kernel: [  127.358005] radeon 0000:06:00.0:   VM_CONTEXT1_PROTECTION_FAULT_=
ADDR
  0x00000000
kernel: [  127.358007] radeon 0000:06:00.0:=20=20
VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000
kernel: [  127.404670] radeon 0000:06:00.0: GRBM_SOFT_RESET=3D0x00010001
kernel: [  127.404725] radeon 0000:06:00.0: SRBM_SOFT_RESET=3D0x00000100
kernel: [  127.405874] radeon 0000:06:00.0:   GRBM_STATUS=3D0x00003028
kernel: [  127.405876] radeon 0000:06:00.0:   GRBM_STATUS2=3D0x00000008
kernel: [  127.405878] radeon 0000:06:00.0:   GRBM_STATUS_SE0=3D0x00000006
kernel: [  127.405880] radeon 0000:06:00.0:   GRBM_STATUS_SE1=3D0x00000006
kernel: [  127.405882] radeon 0000:06:00.0:   GRBM_STATUS_SE2=3D0x00000006
kernel: [  127.405884] radeon 0000:06:00.0:   GRBM_STATUS_SE3=3D0x00000006
kernel: [  127.405885] radeon 0000:06:00.0:   SRBM_STATUS=3D0x20000A40
kernel: [  127.405887] radeon 0000:06:00.0:   SRBM_STATUS2=3D0x00000000
kernel: [  127.405889] radeon 0000:06:00.0:   SDMA0_STATUS_REG   =3D 0x46CE=
E557
kernel: [  127.405891] radeon 0000:06:00.0:   SDMA1_STATUS_REG   =3D 0x46CE=
E557
kernel: [  127.405893] radeon 0000:06:00.0:   CP_STAT =3D 0x00000000
kernel: [  127.405893] radeon 0000:06:00.0:   CP_STAT =3D 0x00000000
kernel: [  127.405895] radeon 0000:06:00.0:   CP_STALLED_STAT1 =3D 0x000000=
00
kernel: [  127.405896] radeon 0000:06:00.0:   CP_STALLED_STAT2 =3D 0x000000=
00
kernel: [  127.405898] radeon 0000:06:00.0:   CP_STALLED_STAT3 =3D 0x000000=
00
kernel: [  127.405900] radeon 0000:06:00.0:   CP_CPF_BUSY_STAT =3D 0x000000=
00
kernel: [  127.405902] radeon 0000:06:00.0:   CP_CPF_STALLED_STAT1 =3D 0x00=
000000
kernel: [  127.405903] radeon 0000:06:00.0:   CP_CPF_STATUS =3D 0x00000000
kernel: [  127.405905] radeon 0000:06:00.0:   CP_CPC_BUSY_STAT =3D 0x000000=
00
kernel: [  127.405907] radeon 0000:06:00.0:   CP_CPC_STALLED_STAT1 =3D 0x00=
000000
kernel: [  127.405909] radeon 0000:06:00.0:   CP_CPC_STATUS =3D 0x00000000
kernel: [  127.405929] radeon 0000:06:00.0: GPU reset succeeded, trying to
resume
kernel: [  127.658172] [drm:ci_dpm_enable [radeon]] *ERROR* ci_start_dpm fa=
iled
kernel: [  127.658189] [drm:radeon_pm_resume [radeon]] *ERROR* radeon: dpm
resume failed
kernel: [  127.658194] [drm] probing gen 2 caps for device 1022:1453 =3D 73=
3903/e
kernel: [  127.658197] [drm] PCIE gen 3 link speeds already enabled
kernel: [  127.664213] [drm] PCIE GART of 2048M enabled (table at
0x0000000000326000).
kernel: [  127.664341] radeon 0000:06:00.0: WB enabled
kernel: [  127.664344] radeon 0000:06:00.0: fence driver on ring 0 use gpu =
addr
0x0000000200000c00 and cpu addr 0xffff8807f3799c00
kernel: [  127.664346] radeon 0000:06:00.0: fence driver on ring 1 use gpu =
addr
0x0000000200000c04 and cpu addr 0xffff8807f3799c04
kernel: [  127.664347] radeon 0000:06:00.0: fence driver on ring 2 use gpu =
addr
0x0000000200000c08 and cpu addr 0xffff8807f3799c08
kernel: [  127.664349] radeon 0000:06:00.0: fence driver on ring 3 use gpu =
addr
0x0000000200000c0c and cpu addr 0xffff8807f3799c0c
kernel: [  127.664350] radeon 0000:06:00.0: fence driver on ring 4 use gpu =
addr
0x0000000200000c10 and cpu addr 0xffff8807f3799c10
kernel: [  127.664772] radeon 0000:06:00.0: fence driver on ring 5 use gpu =
addr
0x0000000000078b30 and cpu addr 0xffffc90003c38b30
kernel: [  127.664933] radeon 0000:06:00.0: fence driver on ring 6 use gpu =
addr
0x0000000200000c18 and cpu addr 0xffff8807f3799c18
kernel: [  127.664934] radeon 0000:06:00.0: fence driver on ring 7 use gpu =
addr
0x0000000200000c1c and cpu addr 0xffff8807f3799c1c
kernel: [  127.666482] [drm] ring test on 0 succeeded in 2 usecs
kernel: [  127.666568] [drm] ring test on 1 succeeded in 2 usecs
kernel: [  127.666586] [drm] ring test on 2 succeeded in 2 usecs
kernel: [  127.666735] [drm] ring test on 3 succeeded in 3 usecs
kernel: [  127.666745] [drm] ring test on 4 succeeded in 3 usecs
kernel: [  127.692636] [drm] ring test on 5 succeeded in 1 usecs
kernel: [  127.712543] [drm] UVD initialized successfully.
kernel: [  127.813896] [drm] ring test on 6 succeeded in 708 usecs
kernel: [  127.813920] [drm] ring test on 7 succeeded in 3 usecs
kernel: [  127.813921] [drm] VCE initialized successfully.
kernel: [  127.814029] [drm:radeon_pm_resume [radeon]] *ERROR* radeon: dpm
resume failed

On 4.15.10-041510-generic, I left my computer running overnight and came ba=
ck
to it frozen with this in kern.log:

Mar 18 04:25:10 Gaia kernel: [  559.092721] BUG: stack guard page was hit at
000000001ecd1fa8 (stack is 0000000020941864..00000000cf703fbf)
Mar 18 04:25:10 Gaia kernel: [  559.092729] kernel stack overflow (page fau=
lt):
0000 [#1] SMP NOPTI
Mar 18 04:25:10 Gaia kernel: [  559.092733] Modules linked in:
nf_conntrack_netlink nfnetlink xt_addrtype br_netfilter overlay xfrm_user
xfrm4_tunnel tunnel4 l2tp_ppp l2tp_netlink l2tp_core ip6_udp_tunnel ipcomp
xfrm_ipcomp udp_tunnel esp4 pppox ah4 af_key xfrm_algo xt_CHECKSUM
iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4
nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack libcrc32c
ipt_REJECT nf_reject_ipv4 xt_tcpudp bridge stp llc ebtable_filter ebtables
ip6table_filter ip6_tables devlink iptable_filter binfmt_misc
snd_hda_codec_realtek snd_hda_codec_generic snd_hda_codec_hdmi snd_hda_intel
edac_mce_amd snd_hda_codec snd_usb_audio snd_hda_core snd_usbmidi_lib kvm_a=
md
snd_hwdep kvm uvcvideo snd_seq_midi irqbypass snd_seq_midi_event snd_rawmidi
crct10dif_pclmul videobuf2_vmalloc crc32_pclmul
Mar 18 04:25:10 Gaia kernel: [  559.092784]  videobuf2_memops videobuf2_v4l2
snd_seq ghash_clmulni_intel videobuf2_core snd_pcm pcbc videodev snd_seq_de=
vice
media snd_timer joydev aesni_intel aes_x86_64 snd crypto_simd input_leds
glue_helper serio_raw soundcore cryptd ccp k10temp shpchp mac_hid wmi_bmof
sch_fq_codel parport_pc ppdev lp parport ip_tables x_tables autofs4 hid_gen=
eric
usbhid hid amdkfd amd_iommu_v2 amdgpu chash radeon i2c_algo_bit ttm
drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm i2c_piix4
r8169 ahci mii libahci wmi gpio_amdpt gpio_generic
Mar 18 04:25:10 Gaia kernel: [  559.092832] CPU: 5 PID: 7352 Comm: tail
Tainted: G        W        4.15.10-041510-generic #201803152130
Mar 18 04:25:10 Gaia kernel: [  559.092834] Hardware name: Gigabyte Technol=
ogy
Co., Ltd. AB350-Gaming 3/AB350-Gaming 3-CF, BIOS F10 12/01/2017
Mar 18 04:25:10 Gaia kernel: [  559.092881] RIP:
0010:amdgpu_get_pp_num_states+0x88/0x120 [amdgpu]
Mar 18 04:25:10 Gaia kernel: [  559.092884] RSP: 0018:ffffb3cb8a837ca8 EFLA=
GS:
00010282
Mar 18 04:25:10 Gaia kernel: [  559.092888] RAX: 00000000000000d4 RBX:
ffffb3cb8a837cac RCX: 0000000000000001
Mar 18 04:25:10 Gaia kernel: [  559.092890] RDX: 0000000000000000 RSI:
ffffffffc087a88c RDI: 0000000000000000
Mar 18 04:25:10 Gaia kernel: [  559.092893] RBP: ffffb3cb8a837d20 R08:
ffffffffc087a865 R09: ffff88c9ecebd98b
Mar 18 04:25:10 Gaia kernel: [  559.092895] R10: 0000000000000000 R11:
ffff88c9ecebd98a R12: ffff88c9ecebd000
Mar 18 04:25:10 Gaia kernel: [  559.092898] R13: ffffffffc087a858 R14:
00000000000000d4 R15: 0000000000000993
Mar 18 04:25:10 Gaia kernel: [  559.092901] FS:  00007fccb1787540(0000)
GS:ffff88c9fe740000(0000) knlGS:0000000000000000
Mar 18 04:25:10 Gaia kernel: [  559.092904] CS:  0010 DS: 0000 ES: 0000 CR0:
0000000080050033
Mar 18 04:25:10 Gaia kernel: [  559.092906] CR2: ffffb3cb8a838000 CR3:
00000004a30d0000 CR4: 00000000003406e0
Mar 18 04:25:10 Gaia kernel: [  559.092909] Call Trace:
Mar 18 04:25:10 Gaia kernel: [  559.092918]  ?
tty_insert_flip_string_fixed_flag+0x86/0xe0
Mar 18 04:25:10 Gaia kernel: [  559.092925]  dev_attr_show+0x23/0x60
Mar 18 04:25:10 Gaia kernel: [  559.092931]  sysfs_kf_seq_show+0xa3/0x130
Mar 18 04:25:10 Gaia kernel: [  559.092935]  kernfs_seq_show+0x27/0x30
Mar 18 04:25:10 Gaia kernel: [  559.092939]  seq_read+0xe5/0x430
Mar 18 04:25:10 Gaia kernel: [  559.092943]  kernfs_fop_read+0x137/0x180
Mar 18 04:25:10 Gaia kernel: [  559.092948]  __vfs_read+0x3a/0x170
Mar 18 04:25:10 Gaia kernel: [  559.092954]  ?
security_file_permission+0xa1/0xc0
Mar 18 04:25:10 Gaia kernel: [  559.092958]  vfs_read+0x8e/0x130
Mar 18 04:25:10 Gaia kernel: [  559.092962]  SyS_read+0x55/0xc0
Mar 18 04:25:10 Gaia kernel: [  559.092967]  do_syscall_64+0x73/0x130
Mar 18 04:25:10 Gaia kernel: [  559.092973]=20
entry_SYSCALL_64_after_hwframe+0x3d/0xa2
Mar 18 04:25:10 Gaia kernel: [  559.092976] RIP: 0033:0x7fccb12b5081
Mar 18 04:25:10 Gaia kernel: [  559.092978] RSP: 002b:00007ffc17d84d68 EFLA=
GS:
00000246 ORIG_RAX: 0000000000000000
Mar 18 04:25:10 Gaia kernel: [  559.092982] RAX: ffffffffffffffda RBX:
0000000000002000 RCX: 00007fccb12b5081
Mar 18 04:25:10 Gaia kernel: [  559.092984] RDX: 0000000000002000 RSI:
00007ffc17d84db0 RDI: 0000000000000003
Mar 18 04:25:10 Gaia kernel: [  559.092986] RBP: 0000000000000000 R08:
0000000000000000 R09: 00007fccb1313b40
Mar 18 04:25:10 Gaia kernel: [  559.092988] R10: 00000000fffffff3 R11:
0000000000000246 R12: 00007ffc17d84db0
Mar 18 04:25:10 Gaia kernel: [  559.092991] R13: 0000000000000003 R14:
ffffffffffffffff R15: 000055e8f3b747e0
Mar 18 04:25:10 Gaia kernel: [  559.092994] Code: c7 c2 7a a8 87 c0 be 00 1=
0 00
00 4c 89 e7 e8 d0 08 90 d1 41 89 c7 8b 45 8c 85 c0 74 72 48 8d 5d 8c 45 31 =
f6
49 c7 c5 58 a8 87 c0 <42> 8b 44 b3 04 44 89 f1 4d 89 e8 83 f8 0a 74 2=
d 83 f8 02
49 c7
Mar 18 04:25:10 Gaia kernel: [  559.093080] RIP:
amdgpu_get_pp_num_states+0x88/0x120 [amdgpu] RSP: ffffb3cb8a837ca8
Mar 18 04:25:10 Gaia kernel: [  559.093084] ---[ end trace dbba232a9ca4c5c7
]---

Possibly related, if I `cat pp_num_states` from a terminal, I get a
segmentation fault:

root@Gaia:~# cat /sys/class/drm/card0/device/pp_num_states
Segmentation fault

I'm going to continue to dig. Let me know what logs/tests/whatnot I can pro=
vide
that would be useful.


You are receiving this mail because:
  • You are the assignee for the bug.
= --15214051031.D4aa.26066-- --===============1842849196== Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: base64 Content-Disposition: inline X19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX18KZHJpLWRldmVs IG1haWxpbmcgbGlzdApkcmktZGV2ZWxAbGlzdHMuZnJlZWRlc2t0b3Aub3JnCmh0dHBzOi8vbGlz dHMuZnJlZWRlc2t0b3Aub3JnL21haWxtYW4vbGlzdGluZm8vZHJpLWRldmVsCg== --===============1842849196==--