From mboxrd@z Thu Jan 1 00:00:00 1970 From: Matthew Dawson Subject: Fixing system lockup after GPU reset on SI (Radeon HD 7970 GHz edition) Date: Sun, 17 Jan 2016 02:10:06 -0500 Message-ID: <1947060.HExFUpHDR7@ring00> Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="===============0892464721==" Return-path: Received: from scadrial.mjdsystems.ca (scadrial.mjdsystems.ca [198.100.154.185]) by gabe.freedesktop.org (Postfix) with ESMTP id 9D0FE6E287 for ; Sat, 16 Jan 2016 23:16:53 -0800 (PST) Received: from ring00.localnet (unknown [IPv6:2607:f090:63e5:1608:beae:c5ff:fe07:d22]) by scadrial.mjdsystems.ca (Postfix) with ESMTPSA id 58823F2AB8D for ; Sun, 17 Jan 2016 02:10:10 -0500 (EST) List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" To: dri-devel@lists.freedesktop.org List-Id: dri-devel@lists.freedesktop.org --===============0892464721== Content-Type: multipart/signed; boundary="nextPart4861507.kqZgTrl8sc"; micalg="sha256"; protocol="application/pkcs7-signature" --nextPart4861507.kqZgTrl8sc Content-Transfer-Encoding: 7Bit Content-Type: text/plain; charset="us-ascii" Hi all, I'm trying to work through this bug: https://bugs.freedesktop.org/ show_bug.cgi?id=93649 . The main symptom that something has gone wrong is the system locks up, with some process trying to reset the gpu while the gpu is trying to be reset which deadlocks. The system still works over ssh, just the graphics get stuck. I'm trying to fix the kernel side of this first, so my gpu can reliably reset when the game triggers the gpu lockup, after which I'll try tracking down the mesa issue which causes the lockup in the first place. I've started some preliminary investigating, but I'm running out of ideas as public documentation on some of the AMD hardware is currently not available. As far as I can tell, when the radeon module tries to reset the GPU it will always fail to bring up the VCE (which I haven't looked at yet, as it doesn't seem to be involved with this issue.) and the UVD. The VCE failure is caught early, and so the kernel module just ignores the whole thing. However, the UVD claims to initialize properly. But when the kernel module tries to run a test IB on the UVD ring, it stalls forever. Note: before any issues, the UVD works on my GPU, tested with a random media file and vlc. I poked IRC some time ago, where Dave Airlie suggested that UVD is really unhappy with being reset, and to try disabling that as a test. Nothing I tried yielded any improvement. I also noticed that the SMC (I assume that is some sort of power manager? I didn't find anything on it besides the source code) fails to initialize after a reset, with the error: [drm:si_dpm_set_power_state [radeon]] *ERROR* si_set_sw_state failed I'm wondering if this might be causing the issue instead, as the source code fiddles with the UVD after this error. Not knowing more, I can't say for sure. Details on testing done: For the UVD, I tried forcing it to be completely reset by setting the appropriate bit in SRBM_SOFT_RESET, but that still cause the failure to happen in the same place. Based on the advice from IRC, I tried disabling large parts of the UVD startup and shutdown code, to avoid disabling anything. Some of the initialization process also disabled parts of the UVD, which is which it was disabled. There was no change. Note the initial start was never changed, and vlc was always able to play a video using it. Suspecting the SMC, I've got the return code from the message sent in si_set_sw_state. It always returns 0x0, which doesn't have a name in the source code. I guess this means a timeout, from looking at the code. I have no idea where to look further I couldn't find any documentation. If there is any I missed, I'd be happy to take a look and see what is going on. I also captured traces of every command sent to the SMC, if that would help. I haven't checked them much, other then to note they are different then on boot. Also, is there a bit in either GRBM_SOFT_RESET or SRBM_SOFT_RESET to reset the SMC? I'm just curious if that might help. I've been using vlc playing a movie while forcing a gpu reset through debugfs to speed up testing, as it quickly and reliably causes this issue. I can also reproduce this with TF2 reliably, it just takes 30-60 minutes to test. For solutions I was hopeful on, I'd use TF2 to confirm that vlc using the UVD wasn't causing a failure on reset different from the TF2 one. Any help in debugging this issue would be greatly appreciated. Any documentation I can review to better understand the GPU would be helpful. I already checked the documentation linked to from the fdo wiki, but it didn't mention this part. One last thing, I can partial work around the hang by allowing the ib test of the UVD to time out. I've used a long time out (20 seconds) for testing. Would a patch limiting this be accepted? It might allow users who run into this to recover (sometimes TF2 will recover thanks to that workaround, and continuing playing. Sometimes the system still lockups due to other issues, but those don't seem to be hardware errors so I rather work on that later). Right now I add a timeout to every call to radeon_fence_wait, but if that isn't a good idea I could add another similar function (radeon_fence_wait_timeout?) that takes a timeout, and update the ring tests appropriately. Thanks for reading my wall of text, -- Matthew --nextPart4861507.kqZgTrl8sc Content-Type: application/pkcs7-signature; name="smime.p7s" Content-Disposition: attachment; filename="smime.p7s" Content-Transfer-Encoding: base64 MIAGCSqGSIb3DQEHAqCAMIACAQExDzANBglghkgBZQMEAgEFADCABgkqhkiG9w0BBwEAAKCCEzkw ggafMIIEh6ADAgECAgE9MA0GCSqGSIb3DQEBCwUAMFMxCzAJBgNVBAYTAklMMRYwFAYDVQQKEw1T dGFydENvbSBMdGQuMSwwKgYDVQQDEyNTdGFydENvbSBDZXJ0aWZpY2F0aW9uIEF1dGhvcml0eSBH MjAeFw0wNjA5MTcxOTQ2MzdaFw0zNjA5MTcxOTQ2MzdaMH0xCzAJBgNVBAYTAklMMRYwFAYDVQQK Ew1TdGFydENvbSBMdGQuMSswKQYDVQQLEyJTZWN1cmUgRGlnaXRhbCBDZXJ0aWZpY2F0ZSBTaWdu aW5nMSkwJwYDVQQDEyBTdGFydENvbSBDZXJ0aWZpY2F0aW9uIEF1dGhvcml0eTCCAiIwDQYJKoZI hvcNAQEBBQADggIPADCCAgoCggIBAMGI2wm8bEZ8eJ+Ve7UzkPJyYtbBNiAiJF7O6XfyQwqiBmSk zI42+DjmI/BubbE83XKjhRyh0z20MyvTL6/+6rBBWWe2xAZ9Cp50hdZ5TIA3et85BVJZ9/QbRkOk 0oWF0sNx83ViNLosin8ej+7tNNARx5bNUj26M9bdTd4LO0pLn8ImL/q1FhxyNXfKPF3myuEmixo2 dlwB23QUJf7ttaCID914yi0fB5cwAS1yefpG1hMqqLmmq4NJHeXy793kAY4YCo9jUxaFYqkOGTrM tWamwmt0B+Qr4XY+tG3Y9kThc2IfO8S+oFNWJWxRCfeqq8q/dv1tm/Od2789ZrwMVqqvmEiVOkvf p1hQ2Th1qVvqQwwC/5nr6GxNcFspZZzdql3MrwEx7Azr0o3o6px75m73J2YMGkjXbkLjP94hPnvh DXD7Y6qobBpUtFwlesmiyYsWprssfhdeBU1YbhIdAe4SEA3GMn8Y//z0+s1ukeg2Sb4aSGmLwpZN GhKyaRfBCpDW+nkiSL+6e2n4cMf6ejfY2A3Sdk9X/5C345HS3e/CYLdnOt3+qpzw1It/ciLOxp+X tviviqAQqNn7GMa2tVxSPIm2GSpzAQoPA7MSYPJ6L4Hbo27/JjCX9YvdiVe2rT2zryvFt3YC8KXW K5qGFCpy9uMzjF0JSxPfu4x0E1JLAgMBAAGjggFSMIIBTjASBgNVHRMBAf8ECDAGAQH/AgECMA4G A1UdDwEB/wQEAwIBBjAdBgNVHQ4EFgQUTgvvGqRAW6UXaYcwyjRoQ9BBrvIwHwYDVR0jBBgwFoAU S8W0QGutHLOlHGVuRjaJhwUMDrYwbwYIKwYBBQUHAQEEYzBhMCoGCCsGAQUFBzABhh5odHRwOi8v b2NzcC5zdGFydHNzbC5jb20vY2EtZzIwMwYIKwYBBQUHMAKGJ2h0dHA6Ly9haWEuc3RhcnRzc2wu Y29tL2NlcnRzL2NhLWcyLmNlcjAyBgNVHR8EKzApMCegJaAjhiFodHRwOi8vY3JsLnN0YXJ0c3Ns LmNvbS9jYS1nMi5jcmwwQwYDVR0gBDwwOjA4BgRVHSAAMDAwLgYIKwYBBQUHAgEWImh0dHA6Ly93 d3cuc3RhcnRzc2wuY29tL3BvbGljeS5wZGYwDQYJKoZIhvcNAQELBQADggIBADOeU8uuVCwCibCV VgaCtEd5cJdIDwrQTTdrVV8ITy6GBFmDtwrc3agSiMEyh4vhWos566px3fP7FKrzm7GfM9z5wAFu JFviC7JC8r92rNaOOYNwRhWTSjMKpaIGT55zhbgeaDKRAd2+Fj7zOAlIJB1+oaSEBRzX9T+ITmMA i9zczvucV4nooh7dVqcQEOy0XOVCElKjylLOEvykRgFam50dqdsWiBkMnAb4YLqkTw9KcgQAeYsp fChF5KvrGgelQOsx3AKG2HTSBFgLPjkA6uRPKUisiXwSXPNdu2MGsIUwohy2ZzFcTSxVzZhP04IJ Qv6bv8so811eikZt7wNYvoeYVTAYvB23y3lbNdVLFLfEADsXoMzPTnA1Prwl91/wGUnJDcdGJO/9 PbQFs1KpJWw5aykjK5o+EtHX9K/abg96sHmUlgDr9T94VnWi3yB7ndc4ig98P30sb1J53icXzbaj 90dJcBnxf6ii2gFfm2ID72F8ap9X8m6xhy/6FztKpF3NhNtyzqyL4SFGzWEpYuK582hAwc/6Vyik 5SW3cin+npYNk0k4mrjouC8MWPtxD9ZKQy7MDl+no0++NCdWaiONROxp2lE3iFudqOs7uuMmE4kE 3eRx3jXNrtlD2WHjb3LzLlNslZgngRuUfiuJBUoRCDKDiaqYI6WXIxhLsIYZMIIGNDCCBBygAwIB AgIBIDANBgkqhkiG9w0BAQUFADB9MQswCQYDVQQGEwJJTDEWMBQGA1UEChMNU3RhcnRDb20gTHRk LjErMCkGA1UECxMiU2VjdXJlIERpZ2l0YWwgQ2VydGlmaWNhdGUgU2lnbmluZzEpMCcGA1UEAxMg U3RhcnRDb20gQ2VydGlmaWNhdGlvbiBBdXRob3JpdHkwHhcNMDcxMDI0MjEwMjU1WhcNMTcxMDI0 MjEwMjU1WjCBjDELMAkGA1UEBhMCSUwxFjAUBgNVBAoTDVN0YXJ0Q29tIEx0ZC4xKzApBgNVBAsT IlNlY3VyZSBEaWdpdGFsIENlcnRpZmljYXRlIFNpZ25pbmcxODA2BgNVBAMTL1N0YXJ0Q29tIENs YXNzIDIgUHJpbWFyeSBJbnRlcm1lZGlhdGUgQ2xpZW50IENBMIIBIjANBgkqhkiG9w0BAQEFAAOC AQ8AMIIBCgKCAQEAyyiFRZwBLPsZ8qulM4wqoA3L0FXtXSKBZ0bEDwhTvsvdpPEStD59zG0Nhnfp noYfRgWft+rlEAO14/QBjOsID4RBN+LyrX6QDebSfC3Bcb3gzmwiqy+zuVE/VrJwGR7+zmD2Ekev JnZpxJyfNzOMEICjtfW/kbfLDwwM/abZELJ7Qp+Bnic4N6tklXOECU4P1h6O8BdmoeSzDnofMSVU ihhJnerj5Em49dd8ijJvL5jabUT5jNfmIJlcHHTmCTowoBbW9rDj+/Y44vLoVkfdcce06TNSt4b/ 8KwWcH365phKVHrlx0bNOyaggrxYfXKCheFEGb3xIPsd/+vcUQs29QIDAQABo4IBrTCCAakwDwYD VR0TAQH/BAUwAwEB/zAOBgNVHQ8BAf8EBAMCAQYwHQYDVR0OBBYEFK5Vg2/sMcq59x36r2sx88gd 46y7MB8GA1UdIwQYMBaAFE4L7xqkQFulF2mHMMo0aEPQQa7yMGYGCCsGAQUFBwEBBFowWDAnBggr BgEFBQcwAYYbaHR0cDovL29jc3Auc3RhcnRzc2wuY29tL2NhMC0GCCsGAQUFBzAChiFodHRwOi8v d3d3LnN0YXJ0c3NsLmNvbS9zZnNjYS5jcnQwWwYDVR0fBFQwUjAnoCWgI4YhaHR0cDovL3d3dy5z dGFydHNzbC5jb20vc2ZzY2EuY3JsMCegJaAjhiFodHRwOi8vY3JsLnN0YXJ0c3NsLmNvbS9zZnNj YS5jcmwwgYAGA1UdIAR5MHcwdQYLKwYBBAGBtTcBAgEwZjAuBggrBgEFBQcCARYiaHR0cDovL3d3 dy5zdGFydHNzbC5jb20vcG9saWN5LnBkZjA0BggrBgEFBQcCARYoaHR0cDovL3d3dy5zdGFydHNz bC5jb20vaW50ZXJtZWRpYXRlLnBkZjANBgkqhkiG9w0BAQUFAAOCAgEAOqknDcjTtBaR72mU0GnF TFx99zudSshCkkcNyL3UMUbt9WEdGQxnUn4EqDlQ1TEgUnZn0H3QyJxz81dyzuQ8FF2fpcY7+89z ztjpH5xZL01z35+ncSeayb3g7E7Aj3Cfyqev/qjeYPieg/0mmHBQ9NfAei+Nq9GeFRAct+j+LY0b RdeBMddLZebSRVSjNeqlikB26YyIH+97OPVAdAkOZR09gxyn6t6D33TXOTJPd8Nb+K8lW/qla2SH o/DbK1lPua+DtiFfdXnQ2/zrRxXXWahxRVyoEu/SBxP4cENi1u697E1+3A1AoPnX3cDvM3i8bsJN iRfHK1KMq5eukRq0SIU/FaBvzidWPblJTmr+vVeHBvLqYuGRtpeyWRW3ARZPdVqfj2sqOUrhetQp hha2d/OrppJqTm7RWnF3Wtjv21aEPcR0K86E3P9CmU1r5WQiytynv7xmE/WXVvZToxrEhW165UwP dfS/OIjCvuPFhp70VmNvGWlR6apeNwLRBoV83uojQmxAwAsvuIZEGDsrQKg4HMQJ0Pd+5mKXmRT8 s/1yvUErfVTRTLrnVsWcAwwySqEAKEhufR6LvZWdulKyutJ0XQEAZhDy/Idp7loKD4fvl1UJXfcr oW7Fmt/T3JPP5XH8+R1FRiReZnP85N/IlexQ5RevKboq8jatilxxKTkwggZaMIIFQqADAgECAgJJ PjANBgkqhkiG9w0BAQsFADCBjDELMAkGA1UEBhMCSUwxFjAUBgNVBAoTDVN0YXJ0Q29tIEx0ZC4x KzApBgNVBAsTIlNlY3VyZSBEaWdpdGFsIENlcnRpZmljYXRlIFNpZ25pbmcxODA2BgNVBAMTL1N0 YXJ0Q29tIENsYXNzIDIgUHJpbWFyeSBJbnRlcm1lZGlhdGUgQ2xpZW50IENBMB4XDTE0MDQwODE5 MzIxMloXDTE2MDQwOTAyNTIzMFowdDELMAkGA1UEBhMCQ0ExEDAOBgNVBAgTB09udGFyaW8xFDAS BgNVBAcTC01pc3Npc3NhdWdhMRcwFQYDVQQDEw5NYXR0aGV3IERhd3NvbjEkMCIGCSqGSIb3DQEJ ARYVbWF0dGhld0BtamRzeXN0ZW1zLmNhMIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8AMIIBCgKCAQEA rEwVnAlwRQ6jtLTU199K5s9RQm9PI4Xem2rJ4bclgXQN3nC6BNJeSYAV/X1/HODe6OwCitfVUo1e 5IoHk+g09hBs7G84o5KWq3hauiXWWcz7riRD/a/rkO2H1DCkLG+uEmyTLt3sEKTGMrjwCpCxYHip l5IXDusm1cnOY9HmJre1NifOWtEQNcyW7XRICaEdxT0yzOGos3vVfFfbAcbu1LQJvXlHeuBkc5+O Q5Ryg1HU4eVmmKzBBpnNBSq5lKeIIblXr80QJO/qS1P5980yLvqfoYIvsOZyCIFOsgiBpNIoTV9f m7CryAXx/59LLav57dAR3UOL4fEvkSvMp4cQ4QIDAQABo4IC2zCCAtcwCQYDVR0TBAIwADALBgNV HQ8EBAMCBLAwHQYDVR0lBBYwFAYIKwYBBQUHAwIGCCsGAQUFBwMEMB0GA1UdDgQWBBSypWj+ax4j 6TpWpD4TaRnFP7xJDTAfBgNVHSMEGDAWgBSuVYNv7DHKufcd+q9rMfPIHeOsuzAgBgNVHREEGTAX gRVtYXR0aGV3QG1qZHN5c3RlbXMuY2EwggFMBgNVHSAEggFDMIIBPzCCATsGCysGAQQBgbU3AQID MIIBKjAuBggrBgEFBQcCARYiaHR0cDovL3d3dy5zdGFydHNzbC5jb20vcG9saWN5LnBkZjCB9wYI KwYBBQUHAgIwgeowJxYgU3RhcnRDb20gQ2VydGlmaWNhdGlvbiBBdXRob3JpdHkwAwIBARqBvlRo aXMgY2VydGlmaWNhdGUgd2FzIGlzc3VlZCBhY2NvcmRpbmcgdG8gdGhlIENsYXNzIDIgVmFsaWRh dGlvbiByZXF1aXJlbWVudHMgb2YgdGhlIFN0YXJ0Q29tIENBIHBvbGljeSwgcmVsaWFuY2Ugb25s eSBmb3IgdGhlIGludGVuZGVkIHB1cnBvc2UgaW4gY29tcGxpYW5jZSBvZiB0aGUgcmVseWluZyBw YXJ0eSBvYmxpZ2F0aW9ucy4wNgYDVR0fBC8wLTAroCmgJ4YlaHR0cDovL2NybC5zdGFydHNzbC5j b20vY3J0dTItY3JsLmNybDCBjgYIKwYBBQUHAQEEgYEwfzA5BggrBgEFBQcwAYYtaHR0cDovL29j c3Auc3RhcnRzc2wuY29tL3N1Yi9jbGFzczIvY2xpZW50L2NhMEIGCCsGAQUFBzAChjZodHRwOi8v YWlhLnN0YXJ0c3NsLmNvbS9jZXJ0cy9zdWIuY2xhc3MyLmNsaWVudC5jYS5jcnQwIwYDVR0SBBww GoYYaHR0cDovL3d3dy5zdGFydHNzbC5jb20vMA0GCSqGSIb3DQEBCwUAA4IBAQBUXPITpbFKhbBF CwTzuTdUc85Q+cB5LGHpUvIdnkvCrBd5KjBOlQP0rheU0PgiBm4NoSRrKCswFDD6ryIsP23gfbAa djhjEj01mbXId3ftn3jBq/6skV+o6V0E1LGGlWK8isHoS5NrjDBNRmI1jMU2TlvP+VwRFa5Ut55h 4CSKMzxG4CB0d+j5riU41O4rJfqidc6pV0zdhpInGHjCu4YVg09bZy5RGtsr+iwOlmHOvR/RAq7l smKdw9e4YR7bltSV4XvuOZdgHjuPIAYlMyX+0m5jy+gj3mOLvFz9VAbR9fGWPNBOTBtfOPmtBkqk rBOTzUomsg1BhU3wau4Pr4cGMYICVTCCAlECAQEwgZMwgYwxCzAJBgNVBAYTAklMMRYwFAYDVQQK Ew1TdGFydENvbSBMdGQuMSswKQYDVQQLEyJTZWN1cmUgRGlnaXRhbCBDZXJ0aWZpY2F0ZSBTaWdu aW5nMTgwNgYDVQQDEy9TdGFydENvbSBDbGFzcyAyIFByaW1hcnkgSW50ZXJtZWRpYXRlIENsaWVu dCBDQQICST4wDQYJYIZIAWUDBAIBBQCggZMwGAYJKoZIhvcNAQkDMQsGCSqGSIb3DQEHATAcBgkq hkiG9w0BCQUxDxcNMTYwMTE3MDcxMDA2WjAoBgkqhkiG9w0BCQ8xGzAZMAsGCWCGSAFlAwQBAjAK BggqhkiG9w0DBzAvBgkqhkiG9w0BCQQxIgQgKgKljYZCEXI1Lp9xw214vQrABFyGj7lnUXnPHLES YbkwDQYJKoZIhvcNAQEBBQAEggEAJ4Y1z1NASG3NYUGZVsI+fyMYup7zblZWjmmFfFH6nI/dK18G v4v2xtUJMijdT4HMDkgPK+fo3B82RxYdt5inysEncpH9H0kO5ifj1HlOx6xZo1gLHuIWBtMwVeNb 85fPumPNxd+RYGkmQePL4YSJYih9Zqm/SNOhNv3+DNuQF7ZMO1f4MuTmrjlhhJb/jp0Q2kfWM/0K oOJ/I3b4YYIJ38ZPMAe2U3Bhga89TXC+vn6ZhgIjfJ5kxZY63n8pd/XPOFD/v911ZFjGr0kK2mV7 +nutdi2nUhMgzYbV1FssqxrmGFKemdgDjfToCANdIvM4qz8U7IZOAff8JF+2p0ydqgAAAAAAAA== --nextPart4861507.kqZgTrl8sc-- --===============0892464721== Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: base64 Content-Disposition: inline X19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX18KZHJpLWRldmVs IG1haWxpbmcgbGlzdApkcmktZGV2ZWxAbGlzdHMuZnJlZWRlc2t0b3Aub3JnCmh0dHA6Ly9saXN0 cy5mcmVlZGVza3RvcC5vcmcvbWFpbG1hbi9saXN0aW5mby9kcmktZGV2ZWwK --===============0892464721==--