From mboxrd@z Thu Jan 1 00:00:00 1970 From: bugzilla-daemon@freedesktop.org Subject: [Bug 105733] Amdgpu randomly hangs and only ssh works. Mouse cursor moves sometimes but does nothing. Keyboard stops working. Date: Fri, 27 Apr 2018 12:41:52 +0000 Message-ID: References: Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="===============1068417155==" Return-path: Received: from culpepper.freedesktop.org (culpepper.freedesktop.org [IPv6:2610:10:20:722:a800:ff:fe98:4b55]) by gabe.freedesktop.org (Postfix) with ESMTP id A08AE6E8F7 for ; Fri, 27 Apr 2018 12:41:52 +0000 (UTC) In-Reply-To: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" To: dri-devel@lists.freedesktop.org List-Id: dri-devel@lists.freedesktop.org --===============1068417155== Content-Type: multipart/alternative; boundary="15248329120.6fF1770EB.31858" Content-Transfer-Encoding: 7bit --15248329120.6fF1770EB.31858 Date: Fri, 27 Apr 2018 12:41:52 +0000 MIME-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://bugs.freedesktop.org/ Auto-Submitted: auto-generated https://bugs.freedesktop.org/show_bug.cgi?id=3D105733 --- Comment #12 from Allan --- My system started to power down for nothing sometimes, even using the GTX10= 70 (nvidia|nouveau) . Then I installed a Windows image just to be sure if the kernel was the prob= lem. Well, for now it *SEEMS* that isn't *ONLY* the driver/kernel : - The RX480 was freezing in the same way, then I sent it for warranty. - RX580 run problematically, almost always I got a message like : "DX11 : device disconnected" or "Mantle : Device lost". - GTX1070 was running fine for 1 day, then it became the same as the RX580 = and for my bad luck the system started to power down after a random time (5min = to 2 hours +/-). For sure the driver/kernel (amdgpu/linux) has its faults here, and here's w= hy: - At Windows, the only card that stuck the system was RX480 sometimes becau= se it was really broken. - In other cases, when a failure happened (with Nvidia or AMD), the system = was able to retake the control over the device. - Maybe doing a soft-reset? - Maybe just killing the driver and starting again? - Maybe just by stopping the process that were using the GPU to avoid a big chain of resulting problems? - Neither the RX580 nor GTX1070 has dual-bios AFAIK. Maybe RX480, but I did= not test it. Then : - Revised and changed the PCI-Ex power lines : OK. - Tested power supply (lucky for me AX860i has a self test) : OK. - Cleaned all slots with a brush : OK. - Tested again CPU and RAM : OK. But , I must be in a very bad luck, the problems persisted. I've sent the Motherboard for warranty. I'm waiting for its diagnostic and solution. I'll inform here as soon as it becomes possible. Thoughts for the while : - Not being able to kill the processes *is* a problem that concerns only am= dgpu and it is either a problem of the driver itself (most likely to be) or of t= he kernel. - The driver is not capable of retaking control of the device. - It is impossible to kill children pids when something hung using amdgpu. - Yes, it occurred once or twice using nvidia proprietary too, but it was probably caused because of the faulty motherboard that I'm waiting to be fi= xed. - Using nouveau was the most happy path , but unfortunately nouveau does not support Pascal at all yet. It keeps the card at the min clock (300 or 400MH= z) and it is not possible yet to increase the speed of the card. So it is not a valid working way. --=20 You are receiving this mail because: You are the assignee for the bug.= --15248329120.6fF1770EB.31858 Date: Fri, 27 Apr 2018 12:41:52 +0000 MIME-Version: 1.0 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://bugs.freedesktop.org/ Auto-Submitted: auto-generated

Comme= nt # 12 on bug 10573= 3 from <= span class=3D"fn">Allan
My system started to power down for nothing sometimes, even us=
ing the GTX1070
(nvidia|nouveau) .
Then I installed a Windows image just to be sure if the kernel was the prob=
lem.

Well, for now it *SEEMS* that isn't *ONLY* the driver/kernel :
- The RX480 was freezing in the same way, then I sent it for warranty.
- RX580 run problematically, almost always I got a message like : "DX1=
1 :
device disconnected" or "Mantle : Device lost".
- GTX1070 was running fine for 1 day, then it became the same as the RX580 =
and
for my bad luck the system started to power down after a random time (5min =
to 2
hours +/-).

For sure the driver/kernel (amdgpu/linux) has its faults here, and here's w=
hy:
- At Windows, the only card that stuck the system was RX480 sometimes becau=
se
it was really broken.
- In other cases, when a failure happened (with Nvidia or AMD), the system =
was
able to retake the control over the device.
 - Maybe doing a soft-reset?
 - Maybe just killing the driver and starting again?
 - Maybe just by stopping the process that were using the GPU to avoid a big
chain of resulting problems?
- Neither the RX580 nor GTX1070 has dual-bios AFAIK. Maybe RX480, but I did=
 not
test it.

Then :
- Revised and changed the PCI-Ex power lines : OK.
- Tested power supply (lucky for me AX860i has a self test) : OK.
- Cleaned all slots with a brush : OK.
- Tested again CPU and RAM : OK.

But , I must be in a very bad luck, the problems persisted.

I've sent the Motherboard for warranty. I'm waiting for its diagnostic and
solution.

I'll inform here as soon as it becomes possible.

Thoughts for the while :
- Not being able to kill the processes *is* a problem that concerns only am=
dgpu
and it is either a problem of the driver itself (most likely to be) or of t=
he
kernel.
- The driver is not capable of retaking control of the device.
- It is impossible to kill children pids when something hung using amdgpu.
- Yes, it occurred once or twice using nvidia proprietary too, but it was
probably caused because of the faulty motherboard that I'm waiting to be fi=
xed.
- Using nouveau was the most happy path , but unfortunately nouveau does not
support Pascal at all yet. It keeps the card at the min clock (300 or 400MH=
z)
and it is not possible yet to increase the speed of the card. So it is not a
valid working way.


You are receiving this mail because:
  • You are the assignee for the bug.
= --15248329120.6fF1770EB.31858-- --===============1068417155== Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: base64 Content-Disposition: inline X19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX18KZHJpLWRldmVs IG1haWxpbmcgbGlzdApkcmktZGV2ZWxAbGlzdHMuZnJlZWRlc2t0b3Aub3JnCmh0dHBzOi8vbGlz dHMuZnJlZWRlc2t0b3Aub3JnL21haWxtYW4vbGlzdGluZm8vZHJpLWRldmVsCg== --===============1068417155==--