From mboxrd@z Thu Jan 1 00:00:00 1970 From: Brad Campbell Subject: Re: XP machine freeze Date: Mon, 20 Apr 2015 00:50:40 +0800 Message-ID: <5533DCE0.4080606@fnarfbargle.com> References: <009701d05ffb$5e37a740$1aa6f5c0$@astim.si> <550EE047.3030605@fnarfbargle.com> <5519BBF4.7080600@redhat.com> <552B40F7.5080107@fnarfbargle.com> <552BB8D5.7060200@redhat.com> <552BBA87.50109@fnarfbargle.com> <552BCC60.1000103@redhat.com> <5533C95E.5030707@fnarfbargle.com> <45D7B761-5F3B-4A1B-8057-6C77693A308B@gmail.com> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: QUOTED-PRINTABLE Cc: Saso Slavicic , kvm list , =?UTF-8?B?UmFkaW0gS3LEjW3DocWZ?= To: Nadav Amit , Paolo Bonzini Return-path: Received: from [103.4.17.7] ([103.4.17.7]:41700 "EHLO ns3.fnarfbargle.com" rhost-flags-FAIL-FAIL-OK-OK) by vger.kernel.org with ESMTP id S1751623AbbDSQvF (ORCPT ); Sun, 19 Apr 2015 12:51:05 -0400 In-Reply-To: <45D7B761-5F3B-4A1B-8057-6C77693A308B@gmail.com> Sender: kvm-owner@vger.kernel.org List-ID: On 19/04/15 23:48, Nadav Amit wrote: > Brad Campbell wrote: > >> On 13/04/15 22:02, Paolo Bonzini wrote: >>> On 13/04/2015 14:45, Brad Campbell wrote: >>>> G'day Paolo, >>>> >>>> Yes, on AMD and I've tried hard to reproduce it on Intel and been = unable >>>> to thus far. >>>> >>>> Now you mention it may be AMD specific, I have a spare motherboard= and >>>> processor sitting in a drawer. I'll bolt it together tomorrow and = see if >>>> I can reproduce it on another AMD machine. Two machines should let= me >>>> test it twice as fast. >>>> >>>> I got a fail this afternoon, so I'm due to reboot tonight. I'll ju= st >>>> revert that one suspect commit from a known bad kernel and see if = that >>>> cleans it up. If not then I'll work through the remainder of the >>>> information in your mail. I really appreciate the attention you've= paid >>>> to this, it has been a frustrating bug for me because I'm in a pos= ition >>>> of not knowing what I don't know, and obviously doing something wr= ong in >>>> very long bisection processes. >>> Actually, if you have time to change your course of action, please >>> revert the one that Nadav pointed out (f210f7572bed, KVM: x86: >>> Fix lost interrupt on irr_pending race) or cherry-pick it on top of= 3.17. >>> >>> Paolo >> Ok, I think we have a winner. Patch manually plopped on top of vanil= la 3.17. It has never gone for anywhere near this long on a bad kernel. >> >> brad@srv:~$ uptime >> 23:24:48 up 6 days, 1:01, 3 users, load average: 1.48, 1.95, 2.48 >> >> So this patch went into the kernel during the 3.19 release cycle? Af= fected kernels 3.16-3.18? > Actually, the original bug seemed to be introduced by commit > 33e4c68656a2e461b296ce714ec322978de85412 "KVM: Optimize searching for > highest IRR=E2=80=9D. So the bug goes all the way back to 2.6.32. The= race that this > patch fixes just became more apparent (i.e., likely to happen) on 3.1= 6. It > is fixed in 3.19. And I can confidently state that over the years I've seen this happen a= =20 number of times, but in each case I was using qemu with an SDL console=20 as a user-interactive VM, and a moving the mouse would restore network=20 connectivity. It was obviously seriously exacerbated by something that=20 went into 3.16. I really appreciate the assistance in pinning this down. At the next=20 excuse for a reboot I'll upgrade the server to a 3.19.x kernel and call= =20 it done. Regards, Brad --=20 Dolphins are so intelligent that within a few weeks they can train Americans to stand at the edge of the pool and throw them fish.