* Issues with AMD microcode updates @ 2013-09-19 14:58 Henrique de Moraes Holschuh 2013-09-19 16:44 ` Borislav Petkov 0 siblings, 1 reply; 9+ messages in thread From: Henrique de Moraes Holschuh @ 2013-09-19 14:58 UTC (permalink / raw) To: Jacob Shin; +Cc: Andreas Herrmann, linux-kernel Jacob, Andreas, I take care of the amd64 microcode update support for Debian, and I'm receiving user reports of lockup issues with the AMD microcode driver in several kernels. This is about the runtime update interface, /sys/devices/system/cpu/*/microcode/reload and /sys/devices/system/cpu/microcode/reload. Basically, the issue is that the process that tries to write "1" to the reload node gets stuck in "D" state on several kernel versions. I started by blacklisting several older kernels (e.g. I got a report of 2.6.38 locking up), but recently I got a report of a lockup with kernel 3.5.1. Blacklisting everything before 3.10 is not exactly kosher, not when I would have to blindly trust 3.0, 3.2 and 3.4 to not have whatever issue is causing the lockups. IMHO that's the point where it becomes interesting to actually track down the bug even if it apparently doesn't exist anymore on the more recent kernels, and ensure that the stable/long-term kernels have the fix. That would also help distros blacklist microcode update on the broken kernels. Unfortunately, I don't own, or have access to, any boxes with an AMD processor (let alone one with an AMD processor in need of a microcode update) to bissect the problem. I'd appreciate if AMD (or anyone with an AMD processor, really) could help me track this issue down. Debian bug reports: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=717185 http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=723081 -- "One disk to rule them all, One disk to find them. One disk to bring them all and in the darkness grind them. In the Land of Redmond where the shadows lie." -- The Silicon Valley Tarot Henrique Holschuh ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Issues with AMD microcode updates 2013-09-19 14:58 Issues with AMD microcode updates Henrique de Moraes Holschuh @ 2013-09-19 16:44 ` Borislav Petkov 2013-09-19 18:15 ` Henrique de Moraes Holschuh 2013-09-24 23:35 ` Sherry Hurwitz 0 siblings, 2 replies; 9+ messages in thread From: Borislav Petkov @ 2013-09-19 16:44 UTC (permalink / raw) To: Henrique de Moraes Holschuh; +Cc: Jacob Shin, Andreas Herrmann, linux-kernel On Thu, Sep 19, 2013 at 11:58:34AM -0300, Henrique de Moraes Holschuh wrote: > Jacob, Andreas, > > I take care of the amd64 microcode update support for Debian, and I'm > receiving user reports of lockup issues with the AMD microcode driver in > several kernels. This is about the runtime update interface, > /sys/devices/system/cpu/*/microcode/reload and > /sys/devices/system/cpu/microcode/reload. > > Basically, the issue is that the process that tries to write "1" to the > reload node gets stuck in "D" state on several kernel versions. > > I started by blacklisting several older kernels (e.g. I got a report of > 2.6.38 locking up), but recently I got a report of a lockup with kernel > 3.5.1. Blacklisting everything before 3.10 is not exactly kosher, not when > I would have to blindly trust 3.0, 3.2 and 3.4 to not have whatever issue is > causing the lockups. > > IMHO that's the point where it becomes interesting to actually track down > the bug even if it apparently doesn't exist anymore on the more recent > kernels, and ensure that the stable/long-term kernels have the fix. That > would also help distros blacklist microcode update on the broken kernels. > > Unfortunately, I don't own, or have access to, any boxes with an AMD > processor (let alone one with an AMD processor in need of a microcode > update) to bissect the problem. > > I'd appreciate if AMD (or anyone with an AMD processor, really) could help > me track this issue down. > > Debian bug reports: > http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=717185 > http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=723081 Well, both Andreas and Jacob don't work for AMD anymore. I could try to help with this but it'll be slow as I'm pretty busy with other stuff. Anyway, I'd suggest we look only on the long term kernels since they're the only ones which can get updates/fixes anyway. Now, how do I reproduce this? Writing 1 to .../reload on latest kernel works here. So I'd need a reproducer. Alternatively, I'd need a sysrq-l and sysrq-w from those systems with hung processes. Thanks. -- Regards/Gruss, Boris. Sent from a fat crate under my desk. Formatting is fine. -- ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Issues with AMD microcode updates 2013-09-19 16:44 ` Borislav Petkov @ 2013-09-19 18:15 ` Henrique de Moraes Holschuh 2013-09-19 18:46 ` Borislav Petkov 2013-09-24 23:35 ` Sherry Hurwitz 1 sibling, 1 reply; 9+ messages in thread From: Henrique de Moraes Holschuh @ 2013-09-19 18:15 UTC (permalink / raw) To: Borislav Petkov; +Cc: Jacob Shin, Andreas Herrmann, linux-kernel On Thu, 19 Sep 2013, Borislav Petkov wrote: > On Thu, Sep 19, 2013 at 11:58:34AM -0300, Henrique de Moraes Holschuh wrote: > > I take care of the amd64 microcode update support for Debian, and I'm > > receiving user reports of lockup issues with the AMD microcode driver in > > several kernels. This is about the runtime update interface, > > /sys/devices/system/cpu/*/microcode/reload and > > /sys/devices/system/cpu/microcode/reload. > > > > Basically, the issue is that the process that tries to write "1" to the > > reload node gets stuck in "D" state on several kernel versions. > > > > I started by blacklisting several older kernels (e.g. I got a report of > > 2.6.38 locking up), but recently I got a report of a lockup with kernel > > 3.5.1. Blacklisting everything before 3.10 is not exactly kosher, not when The kernels reproted to be broken are 2.6.38 and 3.5.2, I got the last one wrong. > > I would have to blindly trust 3.0, 3.2 and 3.4 to not have whatever issue is > > causing the lockups. ... > > Debian bug reports: > > http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=717185 > > http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=723081 > > Well, both Andreas and Jacob don't work for AMD anymore. I could try to > help with this but it'll be slow as I'm pretty busy with other stuff. Well, if someone can give me suitable ssh and full root access to a small AMD box anywhere in the world [with a suitably outdated BIOS/EFI that doesn't have the latest microcode for the processor] so that I can bissect this, I'm game. Preferably, a box with a throw-away install of the latest Debian stable, which might help track down the issue faster since it is what I am most confortable with. > Anyway, I'd suggest we look only on the long term kernels since they're > the only ones which can get updates/fixes anyway. If I could get a confirmation that "it's good on latest 3.0, 3.2, 3.4, 3.10 and mainline", I'd at least be able to blacklist everything else. But I'd need at least a control test of 3.5.2 (which should fail) to make sure it is easy to reproduce the bug on the test box... I'm almost sure that the latest 3.2 and 3.10+ work just fine, otherwise I'd have noticed it really fast... > Now, how do I reproduce this? Writing 1 to .../reload on latest kernel > works here. So I'd need a reproducer. Alternatively, I'd need a sysrq-l > and sysrq-w from those systems with hung processes. I can request help on debian-user or debian-devel to get someone with an AMD box to help with bissection, but it is usually best if we don't ask general users to bissect kernels (due to non-zero risk of data corruption if the bissect hit one of the problem spots that often show up during the development window). -- "One disk to rule them all, One disk to find them. One disk to bring them all and in the darkness grind them. In the Land of Redmond where the shadows lie." -- The Silicon Valley Tarot Henrique Holschuh ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Issues with AMD microcode updates 2013-09-19 18:15 ` Henrique de Moraes Holschuh @ 2013-09-19 18:46 ` Borislav Petkov 2013-09-19 19:26 ` Henrique de Moraes Holschuh 0 siblings, 1 reply; 9+ messages in thread From: Borislav Petkov @ 2013-09-19 18:46 UTC (permalink / raw) To: Henrique de Moraes Holschuh; +Cc: Jacob Shin, Andreas Herrmann, linux-kernel On Thu, Sep 19, 2013 at 03:15:54PM -0300, Henrique de Moraes Holschuh wrote: > I can request help on debian-user or debian-devel to get someone with > an AMD box to help with bissection, but it is usually best if we don't > ask general users to bissect kernels (due to non-zero risk of data > corruption if the bissect hit one of the problem spots that often show > up during the development window). I have a couple of AMD boxes so I can bisect - I just need a reproducer how to trigger. Thanks. -- Regards/Gruss, Boris. Sent from a fat crate under my desk. Formatting is fine. -- ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Issues with AMD microcode updates 2013-09-19 18:46 ` Borislav Petkov @ 2013-09-19 19:26 ` Henrique de Moraes Holschuh 0 siblings, 0 replies; 9+ messages in thread From: Henrique de Moraes Holschuh @ 2013-09-19 19:26 UTC (permalink / raw) To: Borislav Petkov; +Cc: Jacob Shin, Andreas Herrmann, linux-kernel On Thu, 19 Sep 2013, Borislav Petkov wrote: > On Thu, Sep 19, 2013 at 03:15:54PM -0300, Henrique de Moraes Holschuh wrote: > > I can request help on debian-user or debian-devel to get someone with > > an AMD box to help with bissection, but it is usually best if we don't > > ask general users to bissect kernels (due to non-zero risk of data > > corruption if the bissect hit one of the problem spots that often show > > up during the development window). > > I have a couple of AMD boxes so I can bisect - I just need a reproducer > how to trigger. Sure. There are two possibilities: Possiblity one (the most likely): hang on first microcode update: 1. Have the lastest AMD microcode update (from linux-firmware) installed to the proper place under /lib/firmware, but NOT yet uploaded to kernel. 2. run this: find /sys/devices/system/cpu -noleaf -type f -path '/sys/devices/system/cpu/cpu*/microcode/reload' | while read i ; do echo -n 1 >"$i" || true ; done If the kernel is buggy, it should hang. If it doesn't hang for a supposed-bad kernel (2.6.38 or 3.5.2), please check "possibility two" below. Possibility two: hang on second microcode update in a row: 1. Install a previous version of the AMD microcode update (which must still be newer than what is in the processor) to /lib/firmware/... 2. Run the command (2) above. It should not hang, and it should update the microcode in the processor. 3. Update /lib/firmware with the latest microcode from AMD (i.e. so that the processor will have its microcode updated TWICE). 4. Run the command (2) above. It should hang if the kernel is buggy. I do not have any reports of kernels 3.6 and later causing issues. If they do, the "reproducer" should be this, instead: echo -n 1 > /sys/devices/system/cpu/microcode/reload You can get earlier versions of the AMD microcode to test "possiblity two" from the Debian package historical archive: http://snapshot.debian.org/archive/debian/20120710T032858Z/pool/non-free/a/amd64-microcode/amd64-microcode_1.20120117.orig.tar.bz2 http://snapshot.debian.org/archive/debian/20120915T033250Z/pool/non-free/a/amd64-microcode/amd64-microcode_1.20120910.orig.tar.bz2 The latest version of the microcode is available in linux-firmware. -- "One disk to rule them all, One disk to find them. One disk to bring them all and in the darkness grind them. In the Land of Redmond where the shadows lie." -- The Silicon Valley Tarot Henrique Holschuh ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Issues with AMD microcode updates 2013-09-19 16:44 ` Borislav Petkov 2013-09-19 18:15 ` Henrique de Moraes Holschuh @ 2013-09-24 23:35 ` Sherry Hurwitz 2013-09-25 13:49 ` Henrique de Moraes Holschuh 1 sibling, 1 reply; 9+ messages in thread From: Sherry Hurwitz @ 2013-09-24 23:35 UTC (permalink / raw) To: Borislav Petkov Cc: Henrique de Moraes Holschuh, Jacob Shin, Andreas Herrmann, linux-kernel On 09/19/2013 11:44 AM, Borislav Petkov wrote: > On Thu, Sep 19, 2013 at 11:58:34AM -0300, Henrique de Moraes Holschuh wrote: >> Jacob, Andreas, >> >> I take care of the amd64 microcode update support for Debian, and I'm >> receiving user reports of lockup issues with the AMD microcode driver in >> several kernels. This is about the runtime update interface, >> /sys/devices/system/cpu/*/microcode/reload and >> /sys/devices/system/cpu/microcode/reload. >> >> Basically, the issue is that the process that tries to write "1" to the >> reload node gets stuck in "D" state on several kernel versions. >> >> I started by blacklisting several older kernels (e.g. I got a report of >> 2.6.38 locking up), but recently I got a report of a lockup with kernel >> 3.5.1. Blacklisting everything before 3.10 is not exactly kosher, not when >> I would have to blindly trust 3.0, 3.2 and 3.4 to not have whatever issue is >> causing the lockups. >> >> IMHO that's the point where it becomes interesting to actually track down >> the bug even if it apparently doesn't exist anymore on the more recent >> kernels, and ensure that the stable/long-term kernels have the fix. That >> would also help distros blacklist microcode update on the broken kernels. >> >> Unfortunately, I don't own, or have access to, any boxes with an AMD >> processor (let alone one with an AMD processor in need of a microcode >> update) to bissect the problem. >> >> I'd appreciate if AMD (or anyone with an AMD processor, really) could help >> me track this issue down. >> >> Debian bug reports: >> http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=717185 >> http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=723081 > Well, both Andreas and Jacob don't work for AMD anymore. I could try to > help with this but it'll be slow as I'm pretty busy with other stuff. > > Anyway, I'd suggest we look only on the long term kernels since they're > the only ones which can get updates/fixes anyway. > > Now, how do I reproduce this? Writing 1 to .../reload on latest kernel > works here. So I'd need a reproducer. Alternatively, I'd need a sysrq-l > and sysrq-w from those systems with hung processes. > > Thanks. > You can direct AMD microcode issues to me now. We are setting up some systems in the lab and trying to duplicate the problem now. Thanks. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Issues with AMD microcode updates 2013-09-24 23:35 ` Sherry Hurwitz @ 2013-09-25 13:49 ` Henrique de Moraes Holschuh 2013-09-26 17:36 ` Sherry Hurwitz 0 siblings, 1 reply; 9+ messages in thread From: Henrique de Moraes Holschuh @ 2013-09-25 13:49 UTC (permalink / raw) To: Sherry Hurwitz Cc: Borislav Petkov, Jacob Shin, Andreas Herrmann, linux-kernel On Tue, 24 Sep 2013, Sherry Hurwitz wrote: > You can direct AMD microcode issues to me now. > We are setting up some systems in the lab and trying to duplicate > the problem now. Thank you! If you're going to be taking care of AMD microcode update issues, maybe it would be a good idea to add your name to the MAINTAINERS file for the "AMD MICROCODE UPDATE SUPPORT", and remove the (dead for a while now) amd64-microcode@amd64.org mailing list? -- "One disk to rule them all, One disk to find them. One disk to bring them all and in the darkness grind them. In the Land of Redmond where the shadows lie." -- The Silicon Valley Tarot Henrique Holschuh ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Issues with AMD microcode updates 2013-09-25 13:49 ` Henrique de Moraes Holschuh @ 2013-09-26 17:36 ` Sherry Hurwitz 2013-09-27 19:36 ` Henrique de Moraes Holschuh 0 siblings, 1 reply; 9+ messages in thread From: Sherry Hurwitz @ 2013-09-26 17:36 UTC (permalink / raw) To: Henrique de Moraes Holschuh Cc: Borislav Petkov, Jacob Shin, Andreas Herrmann, linux-kernel On 09/25/2013 08:49 AM, Henrique de Moraes Holschuh wrote: > On Tue, 24 Sep 2013, Sherry Hurwitz wrote: >> You can direct AMD microcode issues to me now. >> We are setting up some systems in the lab and trying to duplicate >> the problem now. > Thank you! > > If you're going to be taking care of AMD microcode update issues, maybe it > would be a good idea to add your name to the MAINTAINERS file for the "AMD > MICROCODE UPDATE SUPPORT", and remove the (dead for a while now) > amd64-microcode@amd64.org mailing list? > We have failed to reproduce a hang while loading microcode. We have tested with kernel and AMD family combinations with normal and error condition so error paths were taken. Obviously there are factors we are missing that the users are hitting. Any suggestions on how we improve the test matrix would be helpful. We will continue the investigation but any insights are appreciated. NOTE: kernels before 3.0 only load 1 (2k) size of microcode patch and therefore do not support microcode loading of family 14h, 15h, and 16h. Also,in a test request on another thread you suggested someone with family 15h revC0 to load microcode twice with an earlier patch and then the latest, but there has only been 1 microcode patch level published for revB2 so that test won't work. Test Matrix: kernel cpu family results conditions --------------------------------------------------------------------------------- 2.6.38 fam10h load passed normal 2.6.38 fam15h revC0 load failed 2.6.38 can not handle 4k patches 3.5.2 fam10h load passed normal 3.5.2 fam15h revB2 load passed loaded 637 then second load 63d 3.5.2 fam15h revC0 load passed normal 3.5.2 fam15h revC0 load failed used a corrupted bin file 3.7 fam15h revC0 load passed loaded 81c then second load 822 3.10 fam15h revC0 load passed loaded 81c then second load 822 3.11rc7 fam15h revB2 load passedBIOS loaded 637; test loaded 63d; sysfs info can be misleading ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Issues with AMD microcode updates 2013-09-26 17:36 ` Sherry Hurwitz @ 2013-09-27 19:36 ` Henrique de Moraes Holschuh 0 siblings, 0 replies; 9+ messages in thread From: Henrique de Moraes Holschuh @ 2013-09-27 19:36 UTC (permalink / raw) To: Sherry Hurwitz Cc: Borislav Petkov, Jacob Shin, Andreas Herrmann, linux-kernel On Thu, 26 Sep 2013, Sherry Hurwitz wrote: > We have failed to reproduce a hang while loading microcode. I got an offer from a Debian user to test it over the weekend, let's hope he will have more luck(?) at hitting the issue. If he does, it should give us sysrq+t dumps of the hung system. > We have tested with kernel and AMD family combinations with > normal and error condition so error paths were taken. Obviously > there are factors we are missing that the users are hitting. Yeah, and it is not likely to be a kernel patch, as the users hit the issue using non-distro kernels :-( Maybe it is on the firmware-loader side, but one user did wait 1 hour for the thing to get unstuck, and that would have taken care of any possible firmware-loader timeouts. > Any suggestions on how we improve the test matrix would be > helpful. We will continue the investigation but any insights are appreciated. > > NOTE: kernels before 3.0 only load 1 (2k) size of microcode patch and > therefore do not support microcode loading of family 14h, 15h, and 16h. > Also,in a test request on another thread you suggested someone with > family 15h revC0 to load microcode twice with an earlier patch and then > the latest, but there has only been 1 microcode patch level published for revB2 > so that test won't work. Well, it is the only thing I could think of, other than some nasty race condition... > kernel cpu family results conditions > --------------------------------------------------------------------------------- > 2.6.38 fam10h load passed normal > 2.6.38 fam15h revC0 load failed 2.6.38 can not handle 4k patches > 3.5.2 fam10h load passed normal > 3.5.2 fam15h revB2 load passed loaded 637 then second load 63d > 3.5.2 fam15h revC0 load passed normal > 3.5.2 fam15h revC0 load failed used a corrupted bin file I just looked, and the 2.6.38 hang happened for i686 and an unindentified 3-core AMD processor, and the 3.5.2 on x86-64 PREEMPT, on a fam15h model 2 stepping 0, 32-core AMD processor (Linux 3.5.2 (SMP w/32 CPU cores; PREEMPT)). No patterns there. BTW, the userspace script that users reported to have hung is this: grep -q "^vendor_id[[:blank:]]*:[[:blank:]]*.*AuthenticAMD" /proc/cpuinfo && { if modprobe -q --first-time microcode ; then echo "Updating microcode on all online processors..." >&2 else # we have to trigger the microcode update manually if [ -e /sys/devices/system/cpu/microcode/reload ] ; then echo "Updating microcode on all online processors..." >&2 echo 1 > /sys/devices/system/cpu/microcode/reload || { echo "Kernel reported failure while updating microcode!" >&2 } else # Try all online processors, broken kernels need this, # fixed kernels will accept it only on the BSP and update # all processors anyway, and -EINVAL all others... but we # don't know which one is the BSP, so we try all of them # and hide errors, the kernel will log any real problem. echo "Using per-core interface to update microcode on online processors..." >&2 find /sys/devices/system/cpu -noleaf -type f -path '/sys/devices/system/cpu/cpu*/microcode/reload' | \ while read i ; do echo -n 1 2>/dev/null >"$i" || true ; done fi fi } With the microcode driver already loaded (so, that modprobe line fails). -- "One disk to rule them all, One disk to find them. One disk to bring them all and in the darkness grind them. In the Land of Redmond where the shadows lie." -- The Silicon Valley Tarot Henrique Holschuh ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2013-09-27 19:37 UTC | newest] Thread overview: 9+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2013-09-19 14:58 Issues with AMD microcode updates Henrique de Moraes Holschuh 2013-09-19 16:44 ` Borislav Petkov 2013-09-19 18:15 ` Henrique de Moraes Holschuh 2013-09-19 18:46 ` Borislav Petkov 2013-09-19 19:26 ` Henrique de Moraes Holschuh 2013-09-24 23:35 ` Sherry Hurwitz 2013-09-25 13:49 ` Henrique de Moraes Holschuh 2013-09-26 17:36 ` Sherry Hurwitz 2013-09-27 19:36 ` Henrique de Moraes Holschuh
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox