From mboxrd@z Thu Jan  1 00:00:00 1970
From: han <vanbas.han@gmail.com>
Subject: Re:  analyze for the P1 bug 593(xensource bug tracker)
Date: Wed, 10 May 2006 21:00:52 +0800
Message-ID: <4461E404.4000607@gmail.com>
References: <0EBFB99D260C5B40AC33E0F807B1AD660E08EB@pdsmsx411.ccr.corp.intel.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=gb18030; format=flowed
Content-Transfer-Encoding: quoted-printable
Return-path: <xen-devel-bounces@lists.xensource.com>
In-Reply-To: <0EBFB99D260C5B40AC33E0F807B1AD660E08EB@pdsmsx411.ccr.corp.intel.com>
List-Unsubscribe: <http://lists.xensource.com/cgi-bin/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe>
List-Post: <mailto:xen-devel@lists.xensource.com>
List-Help: <mailto:xen-devel-request@lists.xensource.com?subject=help>
List-Subscribe: <http://lists.xensource.com/cgi-bin/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xensource.com?subject=subscribe>
Sender: xen-devel-bounces@lists.xensource.com
Errors-To: xen-devel-bounces@lists.xensource.com
To: Xen-devel@lists.xensource.com
List-Id: xen-devel@lists.xenproject.org

Hi, Keir!

Your patch works quite well. We have created and destroyed the VMX more t=
han 500 times, and everything goes OK! I suppose the patch could solve th=
e race condition! You may put the correctness code about VBD and VNIF tog=
ether and send it to the maillist. We could help you to test it!
I prefer the wait_event and wakeup approach, it is clearer and straightfu=
l just as you said! :-)
BTW: I'm out of office right now, so i can't send the patch back to you! =
That's also why I change to another mailbox to send this mail.=20

Thanks a lot for your help!

_______________________________________________________
Best Regards,
hanzhu


Han, Zhu =D0=B4=B5=C0:
> Best Regards,=20
> hanzhu
>
> -----Original Message-----
> From: Han, Zhu=20
> Sent: 2006=1B$BG/=1B(J5=1B$B7n=1B(J10=1B$BF|=1B(J 14:27
> To: Yu, Ke; 'xen-devel@lists.xensource.com'
> Cc: Helix-vmm
> Subject: analyze for the P1 bug 593(xensource bug tracker)
>
> Hi, all!
> Our QA team submitted a bug 593 to xensource bug tracker one month ago =
and it was boosted up to P1 several days ago! So I spend some time to tra=
ce this bug this week! Below words is what I have found:
> 1) This bug is hard to been reproduced on most of the platforms we owns=
, especially the UP box.  The platform on which we got the bug and could =
reproduce the bug stably is Paxville, which owns 4 physical CPUs, and 2 c=
ores, 2 hyperthreads for each CPU.
> 2) This root cause of this problem is "losetup -d /dev/loop*" could fai=
l at a rather low probability. "losetup -d /dev/loop*" is invoked by /etc=
/xen/scripts/block when the script processes remove action. If we exhaust=
ed all the loop devices, the VMX cannot be initialized properly. That's w=
hy XEND complains "Error: Device creation failed for domain ****". Howeve=
r, if we remove the loop device manually, everything goes OK!
> 3) "losetup -d /dev/loop" failed because kernel/drivers/block/loop.c re=
turn EBUSY for the LOOP_CLR_FD ioctl operation. The probable cause for th=
is action is some one else didn't close the loop device when we try to de=
lete it!
> 4) The program opens the loop device could be VBD device driver. It ope=
ns the loop device in vbd_create() through open_by_devnum. It closes the =
handle for the loop device in vbd_free which is called by a schedulable w=
ork item free_blkif. Is it true? If so, the problem could be arised by th=
e possible race condition between the work item and the hotplug script! W=
hen the xenbus driver is notified the front end device has been destroyed=
 by the xenstore thread, it will remove the backend device and related re=
sources, and then notify the hotplug subsystem the remove action! Because=
 the code close the loop device's handle and the script delete the loop d=
evice can run concurrently, the script could fail when it try to delete t=
he loop device!
>
> My question is:
> 1) Does this possible race condition exist?
> 2) Why does the code closing the loop device been put to another out of=
 code workitem instead of finishing all work directly in blkback_remove()=
? Any operation in free_blkif() could be blocked? Which one?
>
> Since I'm a really newbie to this field, any tips and comments will be =
appreciated!
> Thanks a lot!
>
>
>
> Best Regards,=20
> hanzhu
>
>  =20