From mboxrd@z Thu Jan 1 00:00:00 1970 From: bugzilla-daemon@freedesktop.org Subject: [Bug 93101] GPU Fault almost burned the CPU Date: Wed, 25 Nov 2015 10:00:11 +0000 Message-ID: Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="===============1405973386==" Return-path: Received: from culpepper.freedesktop.org (unknown [131.252.210.165]) by gabe.freedesktop.org (Postfix) with ESMTP id B5E086E7F6 for ; Wed, 25 Nov 2015 02:00:11 -0800 (PST) List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" To: dri-devel@lists.freedesktop.org List-Id: dri-devel@lists.freedesktop.org --===============1405973386== Content-Type: multipart/alternative; boundary="1448445611.FCC740.24609"; charset="UTF-8" --1448445611.FCC740.24609 Date: Wed, 25 Nov 2015 10:00:11 +0000 MIME-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable https://bugs.freedesktop.org/show_bug.cgi?id=3D93101 Bug ID: 93101 Summary: GPU Fault almost burned the CPU Product: Mesa Version: git Hardware: Other OS: All Status: NEW Severity: normal Priority: medium Component: Drivers/Gallium/radeonsi Assignee: dri-devel@lists.freedesktop.org Reporter: dev@illwieckz.net QA Contact: dri-devel@lists.freedesktop.org Created attachment 120103 --> https://bugs.freedesktop.org/attachment.cgi?id=3D120103&action=3Dedit syslog (short) Hi, this is an issue about the fact that some GPU lockup can lead to some C= PU burn (for real). Some hours ago I get a GPU lockup while I was trying to read a DVD with VLC. The video rendering wasn't functionnal (no picture), then the GPU started to display weird things (see attached photo) then locked up. I've joined some log, one very long syslog, and some abstract for this one (more easy to read, but I gave you the original one in case of I missed something). To summarize, you can read lines like that in the syslog: ``` Nov 24 22:58:18 gollum gnome-session[3720]: [00007f134c173c20] avcodec deco= der: Using G3DVL VDPAU Driver Shared Library version 1.0 for hardware decoding Nov 24 22:58:18 gollum kernel: [97035.599456] radeon 0000:01:00.0:=20=20 VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00002126 Nov 24 22:58:18 gollum kernel: [97035.599460] radeon 0000:01:00.0:=20=20 VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0408800C Nov 24 22:58:18 gollum kernel: [97035.599465] VM fault (0x0c, vmid 2) at pa= ge 8486, read from 'TC4' (0x54433400) (136) Nov 24 22:58:55 gollum kernel: [97072.747472] radeon 0000:01:00.0: ring 0 stalled for more than 10088msec Nov 24 22:58:55 gollum kernel: [97072.747483] radeon 0000:01:00.0: GPU lock= up (current fence id 0x000000000059fcff last fence id 0x000000000059fd12 on ri= ng 0) Nov 24 22:59:04 gollum kernel: [97081.259933] WARNING: CPU: 4 PID: 23502 at /home/kernel/COD/linux/drivers/gpu/drm/radeon/radeon_object.c:83 radeon_ttm_bo_destroy+0xe7/0xf0 [radeon]() ``` My system is running: vlc 3.0.0~~git20151123+r62463+34~ubuntu15.10.1 linux-image-4.3.0-040300-generic 4.3.0-040300.201511020949 libdrm-radeon1 2.4.65+git1511161830.8913cd~gd~w xserver-xorg-video-radeon 7.6.99+git1511170732.10b7c3~gd~w libgl1-mesa-dri 11.2~git1511231930.e4c122~gd~w mesa-vdpau-drivers 11.2~git1511231930.e4c122~gd~w That is a real issue but it's not the topic of this ticket. The really big problem is this bug almost burned my CPU. I explain. When the bug occurred, I tried to track it. Instead of rebooting my compute= r I started a laptop in order to connect to my computer using ssh, and to diagn= ose some stuff on the living system. While the laptop were booting, I took some photo of my screen. But suddenly, my computer shutdown itself. The CPU critical temperature was reached. Normal operation temperature is normally between 30=C2=B0C and 40=C2=B0C on= my system. In case of emergency, I have two regulators running on my computer. The first = one raises fan speed from 128 tr/min to 1400 tr/min when temperature reaches 50= =C2=B0C, and the second one downclocks all the 8 core from 4.7 GHz to 1.4GHz when the temperature reaches 70=C2=B0C. Both regulators are userspace regulators. The first is the well-known fancontrol, and the other one is mine. Both works well (if I use cpuburn for example). The fact is, when the GPU lockup occurred, something from the driver goes w= rong on the CPU side. It looks like some infinite loop started on my cores, doing some extensive tasks, probably without having to deal with external compone= nts (like central memory unit) in order to never slow done the CPU. In fact, the computer acted exactly like if I was running one cpuburn proce= ss per core using performance cpu governor during a summer noon. But there was= an exception, the fan never accelerated (so it was still running at 128 tr/min when the CPU reached 90=C2=B0, and the cpu was never downclocked too. That's why I wrote this issue. When this bug occured, the system goes so wr= ong the CPU was on knees and no regulator was able to control the CPU fan so the CPU endlessly heating. Hopefully, the internal CPU temperature protection shutdown automatically my computer to save itself. But if someone use a CPU with a faulty temperature safety mechanisme, this GPU lockup can lead to a CPU burn for real=E2=80=AF! --=20 You are receiving this mail because: You are the assignee for the bug. --1448445611.FCC740.24609 Date: Wed, 25 Nov 2015 10:00:11 +0000 MIME-Version: 1.0 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Bug ID 93101
Summary GPU Fault almost burned the CPU
Product Mesa
Version git
Hardware Other
OS All
Status NEW
Severity normal
Priority medium
Component Drivers/Gallium/radeonsi
Assignee dri-devel@lists.freedesktop.org
Reporter dev@illwieckz.net
QA Contact dri-devel@lists.freedesktop.org

Created attachment 120103 [details]
syslog (short)

Hi, this is an issue about the fact that some GPU lockup can lead to some C=
PU
burn (for real).

Some hours ago I get a GPU lockup while I was trying to read a DVD with VLC.
The video rendering wasn't functionnal (no picture), then the GPU started to
display weird things (see attached photo) then locked up.

I've joined some log, one very long syslog, and some abstract for this one
(more easy to read, but I gave you the original one in case of I missed
something).

To summarize, you can read lines like that in the syslog:

```
Nov 24 22:58:18 gollum gnome-session[3720]: [00007f134c173c20] avcodec deco=
der:
Using G3DVL VDPAU Driver Shared Library version 1.0 for hardware decoding
Nov 24 22:58:18 gollum kernel: [97035.599456] radeon 0000:01:00.0:=20=20
VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00002126
Nov 24 22:58:18 gollum kernel: [97035.599460] radeon 0000:01:00.0:=20=20
VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0408800C
Nov 24 22:58:18 gollum kernel: [97035.599465] VM fault (0x0c, vmid 2) at pa=
ge
8486, read from 'TC4' (0x54433400) (136)
Nov 24 22:58:55 gollum kernel: [97072.747472] radeon 0000:01:00.0: ring 0
stalled for more than 10088msec
Nov 24 22:58:55 gollum kernel: [97072.747483] radeon 0000:01:00.0: GPU lock=
up
(current fence id 0x000000000059fcff last fence id 0x000000000059fd12 on ri=
ng
0)
Nov 24 22:59:04 gollum kernel: [97081.259933] WARNING: CPU: 4 PID: 23502 at
/home/kernel/COD/linux/drivers/gpu/drm/radeon/radeon_object.c:83
radeon_ttm_bo_destroy+0xe7/0xf0 [radeon]()
```

My system is running:

vlc 3.0.0~~git20151123+r62463+34~ubuntu15.10.1
linux-image-4.3.0-040300-generic 4.3.0-040300.201511020949
libdrm-radeon1 2.4.65+git1511161830.8913cd~gd~w
xserver-xorg-video-radeon 7.6.99+git1511170732.10b7c3~gd~w
libgl1-mesa-dri 11.2~git1511231930.e4c122~gd~w
mesa-vdpau-drivers 11.2~git1511231930.e4c122~gd~w

That is a real issue but it's not the topic of this ticket.

The really big problem is this bug almost burned my CPU. I explain.

When the bug occurred, I tried to track it. Instead of rebooting my compute=
r I
started a laptop in order to connect to my computer using ssh, and to diagn=
ose
some stuff on the living system. While the laptop were booting, I took some
photo of my screen.

But suddenly, my computer shutdown itself. The CPU critical temperature was
reached.

Normal operation temperature is normally between 30=C2=B0C and 40=C2=B0C on=
 my system. In
case of emergency, I have two regulators running on my computer. The first =
one
raises fan speed from 128 tr/min to 1400 tr/min when temperature reaches 50=
=C2=B0C,
and the second one downclocks all the 8 core from 4.7 GHz to 1.4GHz when the
temperature reaches 70=C2=B0C.

Both regulators are userspace regulators. The first is the well-known
fancontrol, and the other one is mine. Both works well (if I use cpuburn for
example).

The fact is, when the GPU lockup occurred, something from the driver goes w=
rong
on the CPU side. It looks like some infinite loop started on my cores, doing
some extensive tasks, probably without having to deal with external compone=
nts
(like central memory unit) in order to never slow done the CPU.

In fact, the computer acted exactly like if I was running one cpuburn proce=
ss
per core using performance cpu governor during a summer noon. But there was=
 an
exception, the fan never accelerated (so it was still running at 128 tr/min
when the CPU reached 90=C2=B0, and the cpu was never downclocked too.

That's why I wrote this issue. When this bug occured, the system goes so wr=
ong
the CPU was on knees and no regulator was able to control the CPU fan so the
CPU endlessly heating.

Hopefully, the internal CPU temperature protection shutdown automatically my
computer to save itself. But if someone use a CPU with a faulty temperature
safety mechanisme, this GPU lockup can lead to a CPU burn for real=E2=80=AF=
!


You are receiving this mail because: =20=20=20=20=20=20
  • You are the assignee for the bug.
--1448445611.FCC740.24609-- --===============1405973386== Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: base64 Content-Disposition: inline X19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX18KZHJpLWRldmVs IG1haWxpbmcgbGlzdApkcmktZGV2ZWxAbGlzdHMuZnJlZWRlc2t0b3Aub3JnCmh0dHA6Ly9saXN0 cy5mcmVlZGVza3RvcC5vcmcvbWFpbG1hbi9saXN0aW5mby9kcmktZGV2ZWwK --===============1405973386==--