xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed
* Re: Basic blktap2 functionality issues.
@ 2012-04-07 15:59 Dr. Greg Wettstein
  2012-04-10 10:04 ` Ian Campbell
  0 siblings, 1 reply; 10+ messages in thread
From: Dr. Greg Wettstein @ 2012-04-07 15:59 UTC (permalink / raw)
  To: Ian Campbell; +Cc: xen-devel@lists.xen.org

On Mar 30, 12:23pm, Ian Campbell wrote:
} Subject: Re: [Xen-devel] Basic blktap2 functionality issues.

Hi, hope the weekend is going well for everyone.

Sorry for the delay in getting back to everyone on this, had a
deadline on another project.

> On Fri, 2012-03-30 at 09:17 +0100, Ian Campbell wrote:
> > I think an approach worth trying would be to have
> > tapdisk_control_detach_vbd respond to TAPDISK_MESSAGE_DETACH before
> > doing the actual detach. i.e. it would respond with "Yes, I will do
> > that" rather than "Yes, I have done that". My speculation is that this
> > will allow libxl to continue and hopefully avoid the deadlock.

> This seems to be the case as the following fixes things for
> me. Thanks very much for your analysis which lead me to this
> solution...

I ported your fix into 4.1.2 but I think we still have a problem, at
least in this codebase.

I no longer see the select timeout delay when xl shuts down but upon
shutdown the minor number is not freed.  A 'tap-ctl list' shows a
steadily increasing set of orphaned minor numbers as VM's are started
up and shutdown.

Are you seeing this in your development codebase?

The culprit is a failed ioctl call for BLKTAP2_IOCTL_FREE_TAP in
tap_ctl_free().  The underlying reason for the ioctl failure is the
check in [linuxsrc]:drivers/block/blktap/ring.c:blktap_ring_destroy()
for whether or not the task_struct pointer in the blktap_ring
structure has been NULLed.

Which certainly makes sense since there is a race between xl's call to
tap_ctl_free() and tapdisk2 getting to the point where it can close
its descriptor to the blktap instance and thus invoke the .release
method which translates into a call to blktap_ring_release() which
NULL's the task_struct pointer.

If you are not seeing the orphan minor numbers there must be ordering
changes in the unstable version of xl which eliminate or alters the
race timing.

For the sake of completeness of information for this thread I captured
the following stack trace of a tapdisk2 when it is deadlocked against
xl:

---------------------------------------------------------------------------
Apr  1 07:03:45 hooter kernel: Call Trace:
Apr  1 07:03:45 hooter kernel:  [<c10aa791>] ? blkdev_get_blocks+0xb4/0xb4
Apr  1 07:03:45 hooter kernel:  [<c109ab11>] ? iput+0x28/0x143
Apr  1 07:03:45 hooter kernel:  [<c10e9fec>] ? blk_peek_request+0x155/0x165
Apr  1 07:03:45 hooter kernel:  [<c128279c>] schedule+0x4d/0x4f
Apr  1 07:03:45 hooter kernel:  [<f7a9053b>] blktap_device_destroy_sync+0x63/0x76 [blktap]
Apr  1 07:03:45 hooter kernel:  [<c104204a>] ? wake_up_bit+0x61/0x61
Apr  1 07:03:45 hooter kernel:  [<f7a8f57a>] blktap_ring_release+0xe/0x29 [blktap]
Apr  1 07:03:45 hooter kernel:  [<c108aba7>] fput+0xce/0x167
Apr  1 07:03:45 hooter kernel:  [<c107944a>] remove_vma+0x28/0x47
Apr  1 07:03:45 hooter kernel:  [<c107a19a>] do_munmap+0x1e8/0x204
Apr  1 07:03:45 hooter kernel:  [<c107a1de>] sys_munmap+0x28/0x37
Apr  1 07:03:45 hooter kernel:  [<c1284adc>] sysenter_do_call+0x12/0x2c
Apr  1 07:03:45 hooter kernel:  [<c1280000>] ? migration_call+0x1d9/0x1f2
Apr  1 07:03:45 hooter kernel:  c686dad0 00000286 c1045bcd e86581c0 e84f6380 c13d1c80 c13d1c80 d753476c
Apr  1 07:03:45 hooter kernel:  e9185b00 e9185c78 00000000 d75346cb 00002ad2 d75346cb e8515ae4 e8515ae4
Apr  1 07:03:45 hooter kernel:  c1005597 00000000 ed7d80c8 c686dae4 c686da98 c1005d10 ed7d02c0 00000000
---------------------------------------------------------------------------

Let me know if you are seeing the issues I'm seeing, in the meantime I
will keep hunting to see if I can rundown the ultimate cause of the
deadlock.  Given the above trace it has to be an issue with xl
orchestrating the release of resources which reference the tapdev
block device.

> Ian.

Will look forward to your thoughts.

Have a good weekend.

}-- End of excerpt from Ian Campbell

As always,
Dr. G.W. Wettstein, Ph.D.   Enjellic Systems Development, LLC.
4206 N. 19th Ave.           Specializing in information infra-structure
Fargo, ND  58102            development.
PH: 701-281-1686
FAX: 701-281-3949           EMAIL: greg@enjellic.com
------------------------------------------------------------------------------
"According to the philosopher Ly Tin Wheedle, chaos is found in greatest
 abundance wherever order is being sought.  It always defeats order,
 because it is better organized."
                                -- Terry Pratchett
                                   Interesting Times

^ permalink raw reply	[flat|nested] 10+ messages in thread
* Re: Basic blktap2 functionality issues.
@ 2012-04-17 15:55 Dr. Greg Wettstein
  0 siblings, 0 replies; 10+ messages in thread
From: Dr. Greg Wettstein @ 2012-04-17 15:55 UTC (permalink / raw)
  To: Ian Jackson, Ian Campbell; +Cc: greg@enjellic.com, xen-devel@lists.xen.org

On Apr 2,  4:51pm, Ian Jackson wrote:
} Subject: Re: [Xen-devel] Basic blktap2 functionality issues.

Hi Ian, et. al, hope this note finds your day going well.

> Ian Campbell writes ("Re: [Xen-devel] Basic blktap2 functionality issues."):
> > On Fri, 2012-03-30 at 09:17 +0100, Ian Campbell wrote:
> > > I think an approach worth trying would be to have
> > > tapdisk_control_detach_vbd respond to TAPDISK_MESSAGE_DETACH before
> > > doing the actual detach. i.e. it would respond with "Yes, I will do
> > > that" rather than "Yes, I have done that". My speculation is that this
> > > will allow libxl to continue and hopefully avoid the deadlock.
> > 
> > This seems to be the case as the following fixes things for me. Thanks
> > very much for your analysis which lead me to this solution...

> Greg, can you confirm whether this works for you ?

I've been grinding through the blktap2 issues and came up with some
additional findings I wanted to bounce off everyone.

The ability to reproduce this problem with the tap-ctl utility helped
confirm what the problem 'smelled like' from the beginning, ie. a hung
reference to the tapdev device instance which is ultimately preventing
the .release method from completing on the tapdisk2 side of things.

I started snooping around in xenstore and I believe I am seeing an
issue where xl does not release the the vbd instance in dom0.  This
causes a steady and persistent increase in the number of 'stale' vbd
instances held in dom0.  At a minimum this would suggest that xl is
not properly releasing resources properly which may help track the
whole issue down.

I replaced the tap: directive with a phy: device instance and saw the
orphan vbd instance as well so the problem doesn't seem related to a
blktap driver instance.

This is with the original patch applied which 'fixed' the problem with
the tapdisk2 process being left alive.  I'm going to back out that
patch to see if it reproduces with stock 4.1.2.

In the meantime let me know if anyone else is seeing this or if anyone
remembers a generic problem like this being fixed in xl in unstable.

> Ian.

Have a good afternoon.

}-- End of excerpt from Ian Jackson

As always,
Dr. G.W. Wettstein, Ph.D.   Enjellic Systems Development, LLC.
4206 N. 19th Ave.           Specializing in information infra-structure
Fargo, ND  58102            development.
PH: 701-281-1686
FAX: 701-281-3949           EMAIL: greg@enjellic.com
------------------------------------------------------------------------------
"We have more to fear from the bungling of the incompetent than from
 the machinations of the wicked."
                                -- Slashdot

^ permalink raw reply	[flat|nested] 10+ messages in thread
* Re: Basic blktap2 functionality issues.
@ 2012-04-11 15:13 Dr. Greg Wettstein
  0 siblings, 0 replies; 10+ messages in thread
From: Dr. Greg Wettstein @ 2012-04-11 15:13 UTC (permalink / raw)
  To: Ian Campbell; +Cc: xen-devel@lists.xen.org

On Apr 10, 10:04am, Ian Campbell wrote:
} Subject: Re: [Xen-devel] Basic blktap2 functionality issues.

> On Sat, 2012-04-07 at 16:59 +0100, Dr. Greg Wettstein wrote:
> > On Mar 30, 12:23pm, Ian Campbell wrote:
> > } Subject: Re: [Xen-devel] Basic blktap2 functionality issues.

> > > On Fri, 2012-03-30 at 09:17 +0100, Ian Campbell wrote:
> > > > I think an approach worth trying would be to have
> > > > tapdisk_control_detach_vbd respond to TAPDISK_MESSAGE_DETACH before
> > > > doing the actual detach. i.e. it would respond with "Yes, I will do
> > > > that" rather than "Yes, I have done that". My speculation is that this
> > > > will allow libxl to continue and hopefully avoid the deadlock.
> > 
> > > This seems to be the case as the following fixes things for
> > > me. Thanks very much for your analysis which lead me to this
> > > solution...
> > 
> > I ported your fix into 4.1.2 but I think we still have a problem, at
> > least in this codebase.
> > 
> > I no longer see the select timeout delay when xl shuts down but upon
> > shutdown the minor number is not freed.  A 'tap-ctl list' shows a
> > steadily increasing set of orphaned minor numbers as VM's are started
> > up and shutdown.
> > 
> > Are you seeing this in your development codebase?

> It turns out that I am, yes. e.g. after starting/stopping a guest 3
> times:
> # tap-ctl list
>        -  0    -          - -
>        -  1    -          - -
>        -  2    -          - -

Yes, that is the same thing I'm seeing and would be expected.

> > The culprit is a failed ioctl call for BLKTAP2_IOCTL_FREE_TAP in
> > tap_ctl_free().  The underlying reason for the ioctl failure is the
> > check in [linuxsrc]:drivers/block/blktap/ring.c:blktap_ring_destroy()

> drivers/*xen*/... right? Or do you have a different blktap to me?

I believe the 'drivers/*xen*' was an old location.

You may have seen the note I sent to Tim Wood yesterday.  We isolated
the blktap2 code from the last GIT tree which Dan Stodden had
available and are carrying it as a standalone patch.  The version we
are testing against is available from the following location:

ftp://ftp.enjellic.com/pub/xen/blktap2-3.2.patch

Dan had the code in the following location:

	drivers/block/blktap

> > for whether or not the task_struct pointer in the blktap_ring
> > structure has been NULLed.
> > 
> > Which certainly makes sense since there is a race between xl's call to
> > tap_ctl_free() and tapdisk2 getting to the point where it can close
> > its descriptor to the blktap instance and thus invoke the .release
> > method which translates into a call to blktap_ring_release() which
> > NULL's the task_struct pointer.

> Sounds right, but then you would expect both the ioctl and release
> path to cleanup, depending on who loses the race?

The problem is the ioctl path can't clean up properly since it is
blocked from doing so by the check for a valid task_struct pointer.
The tapdisk2 side cleans up properly but with the 'acknowledge the
detach and then detach patch' the xl side is always guaranteed to win
the race.

So the patch moved the failure one ladder step downwards in:

[xensrc]/tools/blktap2/control/tap-ctl-destroy.c:tap_control_destroy()

But the effective result is the same, modulo the elimination of the
'hang' while the select call times out.

> There also seems to be a BLKTAP_SHUTDOWN_REQUESTED bit which looks like
> it is involved somehow...
>
> We've gotten way beyond my understanding of blktap internals though.

That bit is set by the kernel kobject infrastructure as part of the
teardown of the block device and I believe is implicated to the extent
that the xl<->tapdisk2 deadlock is caused by a reference still being
held to the block device when the the detach message is sent to
tapdisk2.

I can now reproduce the deadlock with a tap-ctl recipe which should
help in chasing things down.  CAUTION: *** the following will livelock
your kernel, you will need to have a shell available to force an
unmount of the tapdev device. ***

	1.) Create aio based device with tap-ctl.

	2.) Mount tapdev device.

	3.) Close device with tap-ctl.

	4.) Send detach message with tap-ctl.

	* Livelock *

	5.) Kill tapdisk2 process.

	6.) Unmount tapdev - kernel issues WARNING from fs/buffer.c.

Having this recipe should at least help run down which lock we are
wedging on.

> > Let me know if you are seeing the issues I'm seeing, in the meantime I
> > will keep hunting to see if I can rundown the ultimate cause of the
> > deadlock.  Given the above trace it has to be an issue with xl
> > orchestrating the release of resources which reference the tapdev
> > block device.

> I'm not convinced that the shutdown stuff on the kernel side isn't
> either horribly broken or at best fragile (perhaps xl just tickles
> it differently to xend).

I don't believe its fragile as much as the classic problem of having
the kernel dependent on userspace.

I think the underlying problem is the device release order between
xend and xl.  I believe xend is releasing all references to the tapdev
device before invoking userspace cleanup.

> Please do keep hunting and let us know what you find...

I will wade a bit deeper into xl to see if I can isolate the ordering
problem.

> Thanks,
> Ian

Thanks for the verification from your end on the problem.

Have a good day.

Greg

}-- End of excerpt from Ian Campbell

As always,
Dr. G.W. Wettstein, Ph.D.   Enjellic Systems Development, LLC.
4206 N. 19th Ave.           Specializing in information infra-structure
Fargo, ND  58102            development.
PH: 701-281-1686
FAX: 701-281-3949           EMAIL: greg@enjellic.com
------------------------------------------------------------------------------
"Inspite of all evidence to the contrary, the entire universe is
 composed of only two basic substances: magic and bullshit."
                                -- Ian Macdonald

^ permalink raw reply	[flat|nested] 10+ messages in thread
* Basic blktap2 functionality issues.
@ 2012-03-29 14:39 greg
  2012-03-29 17:35 ` Stefano Stabellini
  2012-03-30  8:17 ` Ian Campbell
  0 siblings, 2 replies; 10+ messages in thread
From: greg @ 2012-03-29 14:39 UTC (permalink / raw)
  To: xen-devel

Hi, hope the day is going well for everyone.

I had posted a note about this issue two weeks ago and didn't get any
response.  I don't know if that indicates that people are not using
blktap or if the question stumped everyone.

First of all there is a documentable issue with blktap in 4.1.2 and
xl.  Anyone trying to use it to export files to a guest will be
affected.  Using xl to do 'block-attach' and 'block-detach' in dom0
also doesn't work in stock 4.1.2 so I suspect there are generic issues
with blktap and xl.

In stock 4.1.2 using xl results in the tapdisk2 server process being
orphaned.  There is a patch floating around from Ian Campbell which
fixes this problem and either creates or uncovers what may be a more
fundamental problem.

With Ian's patch applied the tapdisk2 process terminates but the
tapdisk device is not released resulting in a steady accumulation of
orphan minor numbers.  The underlying cause of this appears to be a
resource deadlock between xl requesting a detach of the VBD and the
tapdisk2 process.

Ian actually acknowledges the phenomenon in one of his posts saying
there appeared to be a 'hang' when the VBD is released with his
patches.  The 'hang' is actually a livelock where the entire kernel
blocks until the select call from xl to the tapdisk2 process times out
and rescues things.  The error return from the select call is what
causes the xl code to not complete the release of the tapdisk minor.

I'm still hunting through the code maze trying to find the underlying
problem.  The tapdisk2 process is hanging on the munmap of the ring
buffers.  It 'feels' as if the problem may be secondary to xl still
holding a reference to some resource which blocks the unmap of the
ring buffers.

I will continue to hunt but if anyone has any pointers send them my
way.  It takes a bit of time to come up to speed on all the code
paths.

In a broader context I think the XEN community needs to take a
reasoned look at how to handle the file issue.  It appears as if Dan
Stodden has disappeared so I get the sense the entire blktap2
architecture is a bit rudderless (no judgement just observation).  I'm
trying to put my money where my mouth is and actually hunt for the
problems which are there.

I've seen rumbles on LKML about discussions with regards to
modifications to loop in order to deal with the page cache issues
which hinder the reliability of delivering virtual block devices over
loop.  I don't know how far out that is or even if it will eventuate.
I've also heard rumbles about a mythical 'blktap3' which runs
completely in userspace.  If that is the direction things are going I
would certainly be willing to hammer on that rather then put more time
into blktap2 if there is 'blktap3' code someplace.

I will look forward to any comments or suggestions people may have.

Have a good weekend.

As always,
Dr. G.W. Wettstein, Ph.D.   Enjellic Systems Development, LLC.
4206 N. 19th Ave.           Specializing in information infra-structure
Fargo, ND  58102            development.
PH: 701-281-1686
FAX: 701-281-3949           EMAIL: greg@enjellic.com
------------------------------------------------------------------------------
"My thoughts on trusting Open-Source?  A quote I once saw said it
 best: 'Remember, Amateurs built the ark.  Professionals built the
 Titanic.'  Perhaps most significantly the ark was one guy, there were
 no doubt committees involved with the Titanic project."
                                -- Dr. G.W. Wettstein
                                   Resurrection

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2012-04-17 15:55 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-04-07 15:59 Basic blktap2 functionality issues Dr. Greg Wettstein
2012-04-10 10:04 ` Ian Campbell
  -- strict thread matches above, loose matches on Subject: below --
2012-04-17 15:55 Dr. Greg Wettstein
2012-04-11 15:13 Dr. Greg Wettstein
2012-03-29 14:39 greg
2012-03-29 17:35 ` Stefano Stabellini
2012-03-30  8:02   ` Ian Campbell
2012-03-30  8:17 ` Ian Campbell
2012-03-30 11:23   ` Ian Campbell
2012-04-02 15:51     ` Ian Jackson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).