From mboxrd@z Thu Jan 1 00:00:00 1970 From: Andreas Kinzler Subject: Re: State of GPLPV tests - 28.11.11 Date: Tue, 29 Nov 2011 18:05:04 +0100 Message-ID: <4ED510C0.8000202@hfp.de> References: <4ED39164.5040203@hfp.de> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii"; Format="flowed" Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xensource.com Errors-To: xen-devel-bounces@lists.xensource.com To: James Harper Cc: xen-devel@lists.xensource.com List-Id: xen-devel@lists.xenproject.org On 29.11.2011 00:16, James Harper wrote: >> I am still running tests 7 days a week on two test systems. Results are quite >> discouraging though. After experiencing crash after crash I wanted to test if >> the configuration I called "stable" (Xen 4.0.1, GPLPV 0.11.0.213, dom0 kernel >> 2.6.32.18-pvops0-ak3) was stable indeed. But even that config crashed when >> running my torture test. It is stable on our production systems - running >> other workloads of course. > What crash are you getting these days? Is it the same one as you used to > get? Yes, still exactly the same crashes. Good good news: I think I have found the bug. Since I am not really a Xen or Windows kernel developer it cannot say for sure but here is what I found: When domU hang I ran xentop and found out that the number of vbd read requests was an number like 0x7FFFzzzz in hex which lead me to a thesis: GPLPV crashes as soon as the number of disk requests reaches 2^32. On my hardware with 5000 IIOPs/sec this is reached in 2^32 / 5000 IIOPs / 3600 sec-per-hour / 24 hours-per-day = 9.94 days And there we go: there are the 9-10 days I was always seeing. I studied the source code of blkback/blktap/aio and found nothing. But in GPLPV and its use of the ring macros I found suspicious code in every version of GPLPV I ever used while (more_to_do) { rp = xvdd->ring.sring->rsp_prod; KeMemoryBarrier(); for (i = xvdd->ring.rsp_cons; i < rp; i++) { rep = XenVbd_GetResponse(xvdd, i); If now rp is 10 for example and xvdd->ring.rsp_cons is 0xFFFFFFF7 then the for loop is skipped, responses are not delivered and we see the hang. Regards Andreas