From mboxrd@z Thu Jan 1 00:00:00 1970 From: John Weekes Subject: Re: [PATCH] Fix locking bug in vcpu_migrate Date: Fri, 22 Apr 2011 15:33:26 -0700 Message-ID: <4DB20236.4020604@nuclearfallout.net> References: Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xensource.com Errors-To: xen-devel-bounces@lists.xensource.com To: Keir Fraser Cc: "xen-devel@lists.xensource.com" , George Dunlap List-Id: xen-devel@lists.xenproject.org On 4/22/2011 11:43 AM, Keir Fraser wrote: > It's odd that it seemed to lead to such a big difference for me, then. > > I'll do some further tests -- maybe I changed something else to cause > > the behavior, or the problem is more random than I thought and just > > hasn't occurred for me yet in all the new tests. I did further testing and determined that my domU was starting properly because I had only tested once or twice with Debian Squeeze after applying the patch; I had then done more extensive testing only under a Win2k3 domU. It seems that Win2k3 domUs don't have the same issue. Back on the Squeeze domU, I am reliably seeing the BUG again, with either configuration. I have rolled back schedule.c to pre-22948, when it was much simpler, and that seems to have resolved this particular bug. Now, a different credit2 bug has occurred, though only once for me so far; with the other bug, I was seeing a panic with every 1 or 2 domU startups, but I have seen the new bug on one test out of 15. Specifically, I have triggered the BUG_ON in csched_domcntl. The line number is not the standard one because I have added further debugging, but the BUG_ON is: BUG_ON(svc->rqd != RQD(ops, svc->vcpu->processor)); The bt being: (XEN) [] csched_dom_cntl+0x11a/0x185 (XEN) [] sched_adjust+0x102/0x1f9 (XEN) [] do_domctl+0xb25/0x1250 (XEN) [] syscall_enter+0xc8/0x122 Also, in three of those last 15 startups, my domU froze three times (consuming no CPU and seemingly doing nothing), somewhere in this block of code in ring_read in tools/firmware/hvmloader/xenbus.c -- I added debug information that allowed me to narrow it down. This function is being called when it is writing the SMBIOS tables. I can't tell whether this is related to the credit2 problem. (The domU can be "destroyed" to get out of it). /* Don't overrun the producer pointer */ while ( (part = MASK_XENSTORE_IDX(rings->rsp_prod - rings->rsp_cons)) == 0 ) ring_wait(); /* Don't overrun the end of the ring */ if ( part > (XENSTORE_RING_SIZE - MASK_XENSTORE_IDX(rings->rsp_cons)) ) part = XENSTORE_RING_SIZE - MASK_XENSTORE_IDX(rings->rsp_cons); /* Don't read more than we were asked for */ if ( part > len ) part = len; Note that I am using stubdoms. I would be happy to temporarily turn over the reins on this machine to you or George, if you'd like to debug any of these issues directly. I may not be able to continue experimenting in the short term here myself due to time constraints. -John