From mboxrd@z Thu Jan 1 00:00:00 1970 From: han Subject: Re: analyze for the P1 bug 593(xensource bug tracker) Date: Wed, 10 May 2006 21:00:52 +0800 Message-ID: <4461E404.4000607@gmail.com> References: <0EBFB99D260C5B40AC33E0F807B1AD660E08EB@pdsmsx411.ccr.corp.intel.com> Mime-Version: 1.0 Content-Type: text/plain; charset=gb18030; format=flowed Content-Transfer-Encoding: quoted-printable Return-path: In-Reply-To: <0EBFB99D260C5B40AC33E0F807B1AD660E08EB@pdsmsx411.ccr.corp.intel.com> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xensource.com Errors-To: xen-devel-bounces@lists.xensource.com To: Xen-devel@lists.xensource.com List-Id: xen-devel@lists.xenproject.org Hi, Keir! Your patch works quite well. We have created and destroyed the VMX more t= han 500 times, and everything goes OK! I suppose the patch could solve th= e race condition! You may put the correctness code about VBD and VNIF tog= ether and send it to the maillist. We could help you to test it! I prefer the wait_event and wakeup approach, it is clearer and straightfu= l just as you said! :-) BTW: I'm out of office right now, so i can't send the patch back to you! = That's also why I change to another mailbox to send this mail.=20 Thanks a lot for your help! _______________________________________________________ Best Regards, hanzhu Han, Zhu =D0=B4=B5=C0: > Best Regards,=20 > hanzhu > > -----Original Message----- > From: Han, Zhu=20 > Sent: 2006=1B$BG/=1B(J5=1B$B7n=1B(J10=1B$BF|=1B(J 14:27 > To: Yu, Ke; 'xen-devel@lists.xensource.com' > Cc: Helix-vmm > Subject: analyze for the P1 bug 593(xensource bug tracker) > > Hi, all! > Our QA team submitted a bug 593 to xensource bug tracker one month ago = and it was boosted up to P1 several days ago! So I spend some time to tra= ce this bug this week! Below words is what I have found: > 1) This bug is hard to been reproduced on most of the platforms we owns= , especially the UP box. The platform on which we got the bug and could = reproduce the bug stably is Paxville, which owns 4 physical CPUs, and 2 c= ores, 2 hyperthreads for each CPU. > 2) This root cause of this problem is "losetup -d /dev/loop*" could fai= l at a rather low probability. "losetup -d /dev/loop*" is invoked by /etc= /xen/scripts/block when the script processes remove action. If we exhaust= ed all the loop devices, the VMX cannot be initialized properly. That's w= hy XEND complains "Error: Device creation failed for domain ****". Howeve= r, if we remove the loop device manually, everything goes OK! > 3) "losetup -d /dev/loop" failed because kernel/drivers/block/loop.c re= turn EBUSY for the LOOP_CLR_FD ioctl operation. The probable cause for th= is action is some one else didn't close the loop device when we try to de= lete it! > 4) The program opens the loop device could be VBD device driver. It ope= ns the loop device in vbd_create() through open_by_devnum. It closes the = handle for the loop device in vbd_free which is called by a schedulable w= ork item free_blkif. Is it true? If so, the problem could be arised by th= e possible race condition between the work item and the hotplug script! W= hen the xenbus driver is notified the front end device has been destroyed= by the xenstore thread, it will remove the backend device and related re= sources, and then notify the hotplug subsystem the remove action! Because= the code close the loop device's handle and the script delete the loop d= evice can run concurrently, the script could fail when it try to delete t= he loop device! > > My question is: > 1) Does this possible race condition exist? > 2) Why does the code closing the loop device been put to another out of= code workitem instead of finishing all work directly in blkback_remove()= ? Any operation in free_blkif() could be blocked? Which one? > > Since I'm a really newbie to this field, any tips and comments will be = appreciated! > Thanks a lot! > > > > Best Regards,=20 > hanzhu > > =20