From mboxrd@z Thu Jan  1 00:00:00 1970
From: Brad Campbell <lists2009@fnarfbargle.com>
Subject: Re: XP machine freeze
Date: Mon, 20 Apr 2015 00:50:40 +0800
Message-ID: <5533DCE0.4080606@fnarfbargle.com>
References: <009701d05ffb$5e37a740$1aa6f5c0$@astim.si> <550EE047.3030605@fnarfbargle.com> <5519BBF4.7080600@redhat.com> <552B40F7.5080107@fnarfbargle.com> <552BB8D5.7060200@redhat.com> <552BBA87.50109@fnarfbargle.com> <552BCC60.1000103@redhat.com> <5533C95E.5030707@fnarfbargle.com> <45D7B761-5F3B-4A1B-8057-6C77693A308B@gmail.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8;
	format=flowed
Content-Transfer-Encoding: QUOTED-PRINTABLE
Cc: Saso Slavicic <saso.linux@astim.si>,
	kvm list <kvm@vger.kernel.org>,
	=?UTF-8?B?UmFkaW0gS3LEjW3DocWZ?= <rkrcmar@redhat.com>
To: Nadav Amit <nadav.amit@gmail.com>,
	Paolo Bonzini <pbonzini@redhat.com>
Return-path: <kvm-owner@vger.kernel.org>
Received: from [103.4.17.7] ([103.4.17.7]:41700 "EHLO ns3.fnarfbargle.com"
	rhost-flags-FAIL-FAIL-OK-OK) by vger.kernel.org with ESMTP
	id S1751623AbbDSQvF (ORCPT <rfc822;kvm@vger.kernel.org>);
	Sun, 19 Apr 2015 12:51:05 -0400
In-Reply-To: <45D7B761-5F3B-4A1B-8057-6C77693A308B@gmail.com>
Sender: kvm-owner@vger.kernel.org
List-ID: <kvm.vger.kernel.org>


On 19/04/15 23:48, Nadav Amit wrote:
> Brad Campbell <lists2009@fnarfbargle.com> wrote:
>
>> On 13/04/15 22:02, Paolo Bonzini wrote:
>>> On 13/04/2015 14:45, Brad Campbell wrote:
>>>> G'day Paolo,
>>>>
>>>> Yes, on AMD and I've tried hard to reproduce it on Intel and been =
unable
>>>> to thus far.
>>>>
>>>> Now you mention it may be AMD specific, I have a spare motherboard=
 and
>>>> processor sitting in a drawer. I'll bolt it together tomorrow and =
see if
>>>> I can reproduce it on another AMD machine. Two machines should let=
 me
>>>> test it twice as fast.
>>>>
>>>> I got a fail this afternoon, so I'm due to reboot tonight. I'll ju=
st
>>>> revert that one suspect commit from a known bad kernel and see if =
that
>>>> cleans it up. If not then I'll work through the remainder of the
>>>> information in your mail. I really appreciate the attention you've=
 paid
>>>> to this, it has been a frustrating bug for me because I'm in a pos=
ition
>>>> of not knowing what I don't know, and obviously doing something wr=
ong in
>>>> very long bisection processes.
>>> Actually, if you have time to change your course of action, please
>>> revert the one that Nadav pointed out (f210f7572bed, KVM: x86:
>>> Fix lost interrupt on irr_pending race) or cherry-pick it on top of=
 3.17.
>>>
>>> Paolo
>> Ok, I think we have a winner. Patch manually plopped on top of vanil=
la 3.17. It has never gone for anywhere near this long on a bad kernel.
>>
>> brad@srv:~$ uptime
>> 23:24:48 up 6 days,  1:01,  3 users,  load average: 1.48, 1.95, 2.48
>>
>> So this patch went into the kernel during the 3.19 release cycle? Af=
fected kernels 3.16-3.18?
> Actually, the original bug seemed to be introduced by commit
> 33e4c68656a2e461b296ce714ec322978de85412 "KVM: Optimize searching for
> highest IRR=E2=80=9D. So the bug goes all the way back to 2.6.32. The=
 race that this
> patch fixes just became more apparent (i.e., likely to happen) on 3.1=
6. It
> is fixed in 3.19.

And I can confidently state that over the years I've seen this happen a=
=20
number of times, but in each case I was using qemu with an SDL console=20
as a user-interactive VM, and a moving the mouse would restore network=20
connectivity. It was obviously seriously exacerbated by something that=20
went into 3.16.

I really appreciate the assistance in pinning this down. At the next=20
excuse for a reboot I'll upgrade the server to a 3.19.x kernel and call=
=20
it done.

Regards,
Brad

--=20
Dolphins are so intelligent that within a few weeks they can
train Americans to stand at the edge of the pool and throw them
fish.