* [Qemu-devel] 3.1: second invocation of migrate crashes qemu @ 2019-01-12 17:11 Michael Tokarev 2019-01-14 10:51 ` Dr. David Alan Gilbert 0 siblings, 1 reply; 8+ messages in thread From: Michael Tokarev @ 2019-01-12 17:11 UTC (permalink / raw) To: qemu-devel $ qemu-system-x86_64 -monitor stdio -hda foo.img QEMU 3.1.0 monitor - type 'help' for more information (qemu) stop (qemu) migrate "exec:cat >/dev/null" (qemu) migrate "exec:cat >/dev/null" qemu-system-x86_64: /build/qemu/qemu-3.1/block.c:4647: bdrv_inactivate_recurse: Assertion `!(bs->open_flags & BDRV_O_INACTIVE)' failed. Aborted (it is irrelevant what's in foo.img, it only needs to be initialized). If it is worth to bisect I'll do that tomorrow. Thanks, /mjt ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [Qemu-devel] 3.1: second invocation of migrate crashes qemu 2019-01-12 17:11 [Qemu-devel] 3.1: second invocation of migrate crashes qemu Michael Tokarev @ 2019-01-14 10:51 ` Dr. David Alan Gilbert 2019-01-14 11:52 ` Kevin Wolf 0 siblings, 1 reply; 8+ messages in thread From: Dr. David Alan Gilbert @ 2019-01-14 10:51 UTC (permalink / raw) To: Michael Tokarev, quintela, kwolf; +Cc: qemu-devel * Michael Tokarev (mjt@tls.msk.ru) wrote: > $ qemu-system-x86_64 -monitor stdio -hda foo.img > QEMU 3.1.0 monitor - type 'help' for more information > (qemu) stop > (qemu) migrate "exec:cat >/dev/null" > (qemu) migrate "exec:cat >/dev/null" > qemu-system-x86_64: /build/qemu/qemu-3.1/block.c:4647: bdrv_inactivate_recurse: Assertion `!(bs->open_flags & BDRV_O_INACTIVE)' failed. > Aborted And on head as well; it only happens if the 1st migrate is succesful; if it got cancelled the 2nd one works, so it's not too bad. I suspect the problem here is all around locking/ownership - the block devices get shutdown at the end of migration since the assumption is that the other end has them open now and we had better release them. Dave > (it is irrelevant what's in foo.img, it only needs to be initialized). > > If it is worth to bisect I'll do that tomorrow. > > Thanks, > > /mjt > -- Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [Qemu-devel] 3.1: second invocation of migrate crashes qemu 2019-01-14 10:51 ` Dr. David Alan Gilbert @ 2019-01-14 11:52 ` Kevin Wolf 2019-01-18 15:57 ` Dr. David Alan Gilbert 0 siblings, 1 reply; 8+ messages in thread From: Kevin Wolf @ 2019-01-14 11:52 UTC (permalink / raw) To: Dr. David Alan Gilbert; +Cc: Michael Tokarev, quintela, qemu-devel Am 14.01.2019 um 11:51 hat Dr. David Alan Gilbert geschrieben: > * Michael Tokarev (mjt@tls.msk.ru) wrote: > > $ qemu-system-x86_64 -monitor stdio -hda foo.img > > QEMU 3.1.0 monitor - type 'help' for more information > > (qemu) stop > > (qemu) migrate "exec:cat >/dev/null" > > (qemu) migrate "exec:cat >/dev/null" > > qemu-system-x86_64: /build/qemu/qemu-3.1/block.c:4647: bdrv_inactivate_recurse: Assertion `!(bs->open_flags & BDRV_O_INACTIVE)' failed. > > Aborted > > And on head as well; it only happens if the 1st migrate is succesful; > if it got cancelled the 2nd one works, so it's not too bad. > > I suspect the problem here is all around locking/ownership - the block > devices get shutdown at the end of migration since the assumption is > that the other end has them open now and we had better release them. Yes, only "cont" gets control back to the source VM. I think we really should limit the possible monitor commands in the postmigrate status, and possibly provide a way to get back to the regular paused state (which means getting back control of the resources) without resuming the VM first. Kevin ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [Qemu-devel] 3.1: second invocation of migrate crashes qemu 2019-01-14 11:52 ` Kevin Wolf @ 2019-01-18 15:57 ` Dr. David Alan Gilbert 2019-01-21 15:55 ` Kevin Wolf 0 siblings, 1 reply; 8+ messages in thread From: Dr. David Alan Gilbert @ 2019-01-18 15:57 UTC (permalink / raw) To: Kevin Wolf; +Cc: Michael Tokarev, quintela, qemu-devel * Kevin Wolf (kwolf@redhat.com) wrote: > Am 14.01.2019 um 11:51 hat Dr. David Alan Gilbert geschrieben: > > * Michael Tokarev (mjt@tls.msk.ru) wrote: > > > $ qemu-system-x86_64 -monitor stdio -hda foo.img > > > QEMU 3.1.0 monitor - type 'help' for more information > > > (qemu) stop > > > (qemu) migrate "exec:cat >/dev/null" > > > (qemu) migrate "exec:cat >/dev/null" > > > qemu-system-x86_64: /build/qemu/qemu-3.1/block.c:4647: bdrv_inactivate_recurse: Assertion `!(bs->open_flags & BDRV_O_INACTIVE)' failed. > > > Aborted > > > > And on head as well; it only happens if the 1st migrate is succesful; > > if it got cancelled the 2nd one works, so it's not too bad. > > > > I suspect the problem here is all around locking/ownership - the block > > devices get shutdown at the end of migration since the assumption is > > that the other end has them open now and we had better release them. > > Yes, only "cont" gets control back to the source VM. > > I think we really should limit the possible monitor commands in the > postmigrate status, and possibly provide a way to get back to the > regular paused state (which means getting back control of the resources) > without resuming the VM first. This error is a little interesting if you'd done something like: src: stop migrate dst: <kill qemu for some reason> start a new qemu src: migrate Now that used to work (safely) - note we've not started a VM succesfully anywhere else. Now the source refuses to let that happen - with a rather nasty abort. Dave > Kevin -- Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [Qemu-devel] 3.1: second invocation of migrate crashes qemu 2019-01-18 15:57 ` Dr. David Alan Gilbert @ 2019-01-21 15:55 ` Kevin Wolf 2019-01-21 16:05 ` Dr. David Alan Gilbert 0 siblings, 1 reply; 8+ messages in thread From: Kevin Wolf @ 2019-01-21 15:55 UTC (permalink / raw) To: Dr. David Alan Gilbert; +Cc: Michael Tokarev, quintela, qemu-devel Am 18.01.2019 um 16:57 hat Dr. David Alan Gilbert geschrieben: > * Kevin Wolf (kwolf@redhat.com) wrote: > > Am 14.01.2019 um 11:51 hat Dr. David Alan Gilbert geschrieben: > > > * Michael Tokarev (mjt@tls.msk.ru) wrote: > > > > $ qemu-system-x86_64 -monitor stdio -hda foo.img > > > > QEMU 3.1.0 monitor - type 'help' for more information > > > > (qemu) stop > > > > (qemu) migrate "exec:cat >/dev/null" > > > > (qemu) migrate "exec:cat >/dev/null" > > > > qemu-system-x86_64: /build/qemu/qemu-3.1/block.c:4647: bdrv_inactivate_recurse: Assertion `!(bs->open_flags & BDRV_O_INACTIVE)' failed. > > > > Aborted > > > > > > And on head as well; it only happens if the 1st migrate is succesful; > > > if it got cancelled the 2nd one works, so it's not too bad. > > > > > > I suspect the problem here is all around locking/ownership - the block > > > devices get shutdown at the end of migration since the assumption is > > > that the other end has them open now and we had better release them. > > > > Yes, only "cont" gets control back to the source VM. > > > > I think we really should limit the possible monitor commands in the > > postmigrate status, and possibly provide a way to get back to the > > regular paused state (which means getting back control of the resources) > > without resuming the VM first. > > This error is a little interesting if you'd done something like: > > > src: > stop > migrate > > dst: > <kill qemu for some reason> > start a new qemu > > src: > migrate > > Now that used to work (safely) - note we've not started > a VM succesfully anywhere else. > > Now the source refuses to let that happen - with a rather > nasty abort. Essentially it's another effect of the problem that migration has always lacked a proper model of ownership transfer. And it's still treating this as a block layer problem rather than making it a core concept of migration as it should. We can stack another one-off fix on top, and get back control of the block devices automatically on a second 'migrate'. But it feels like a hack and not like VMs had a properly designed and respected state machine. Kevin ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [Qemu-devel] 3.1: second invocation of migrate crashes qemu 2019-01-21 15:55 ` Kevin Wolf @ 2019-01-21 16:05 ` Dr. David Alan Gilbert 2019-01-21 16:45 ` Kevin Wolf 0 siblings, 1 reply; 8+ messages in thread From: Dr. David Alan Gilbert @ 2019-01-21 16:05 UTC (permalink / raw) To: Kevin Wolf; +Cc: Michael Tokarev, quintela, qemu-devel * Kevin Wolf (kwolf@redhat.com) wrote: > Am 18.01.2019 um 16:57 hat Dr. David Alan Gilbert geschrieben: > > * Kevin Wolf (kwolf@redhat.com) wrote: > > > Am 14.01.2019 um 11:51 hat Dr. David Alan Gilbert geschrieben: > > > > * Michael Tokarev (mjt@tls.msk.ru) wrote: > > > > > $ qemu-system-x86_64 -monitor stdio -hda foo.img > > > > > QEMU 3.1.0 monitor - type 'help' for more information > > > > > (qemu) stop > > > > > (qemu) migrate "exec:cat >/dev/null" > > > > > (qemu) migrate "exec:cat >/dev/null" > > > > > qemu-system-x86_64: /build/qemu/qemu-3.1/block.c:4647: bdrv_inactivate_recurse: Assertion `!(bs->open_flags & BDRV_O_INACTIVE)' failed. > > > > > Aborted > > > > > > > > And on head as well; it only happens if the 1st migrate is succesful; > > > > if it got cancelled the 2nd one works, so it's not too bad. > > > > > > > > I suspect the problem here is all around locking/ownership - the block > > > > devices get shutdown at the end of migration since the assumption is > > > > that the other end has them open now and we had better release them. > > > > > > Yes, only "cont" gets control back to the source VM. > > > > > > I think we really should limit the possible monitor commands in the > > > postmigrate status, and possibly provide a way to get back to the > > > regular paused state (which means getting back control of the resources) > > > without resuming the VM first. > > > > This error is a little interesting if you'd done something like: > > > > > > src: > > stop > > migrate > > > > dst: > > <kill qemu for some reason> > > start a new qemu > > > > src: > > migrate > > > > Now that used to work (safely) - note we've not started > > a VM succesfully anywhere else. > > > > Now the source refuses to let that happen - with a rather > > nasty abort. > > Essentially it's another effect of the problem that migration has always > lacked a proper model of ownership transfer. And it's still treating > this as a block layer problem rather than making it a core concept of > migration as it should. > > We can stack another one-off fix on top, and get back control of the > block devices automatically on a second 'migrate'. But it feels like a > hack and not like VMs had a properly designed and respected state > machine. Hmm; I don't like to get back to this argument because I think we've got a perfectly servicable model that's implemented at higher levels outside qemu, and the real problem is the block layer added new assumptions about the semantics without checking they were really true. qemu only has the view from a single host; it takes the higher level view from something like libvirt to have the view across multiple hosts to understand who has the ownership when. Dave > Kevin -- Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [Qemu-devel] 3.1: second invocation of migrate crashes qemu 2019-01-21 16:05 ` Dr. David Alan Gilbert @ 2019-01-21 16:45 ` Kevin Wolf 2019-01-24 20:04 ` Dr. David Alan Gilbert 0 siblings, 1 reply; 8+ messages in thread From: Kevin Wolf @ 2019-01-21 16:45 UTC (permalink / raw) To: Dr. David Alan Gilbert; +Cc: Michael Tokarev, quintela, qemu-devel Am 21.01.2019 um 17:05 hat Dr. David Alan Gilbert geschrieben: > * Kevin Wolf (kwolf@redhat.com) wrote: > > Am 18.01.2019 um 16:57 hat Dr. David Alan Gilbert geschrieben: > > > * Kevin Wolf (kwolf@redhat.com) wrote: > > > > Am 14.01.2019 um 11:51 hat Dr. David Alan Gilbert geschrieben: > > > > > * Michael Tokarev (mjt@tls.msk.ru) wrote: > > > > > > $ qemu-system-x86_64 -monitor stdio -hda foo.img > > > > > > QEMU 3.1.0 monitor - type 'help' for more information > > > > > > (qemu) stop > > > > > > (qemu) migrate "exec:cat >/dev/null" > > > > > > (qemu) migrate "exec:cat >/dev/null" > > > > > > qemu-system-x86_64: /build/qemu/qemu-3.1/block.c:4647: bdrv_inactivate_recurse: Assertion `!(bs->open_flags & BDRV_O_INACTIVE)' failed. > > > > > > Aborted > > > > > > > > > > And on head as well; it only happens if the 1st migrate is succesful; > > > > > if it got cancelled the 2nd one works, so it's not too bad. > > > > > > > > > > I suspect the problem here is all around locking/ownership - the block > > > > > devices get shutdown at the end of migration since the assumption is > > > > > that the other end has them open now and we had better release them. > > > > > > > > Yes, only "cont" gets control back to the source VM. > > > > > > > > I think we really should limit the possible monitor commands in the > > > > postmigrate status, and possibly provide a way to get back to the > > > > regular paused state (which means getting back control of the resources) > > > > without resuming the VM first. > > > > > > This error is a little interesting if you'd done something like: > > > > > > > > > src: > > > stop > > > migrate > > > > > > dst: > > > <kill qemu for some reason> > > > start a new qemu > > > > > > src: > > > migrate > > > > > > Now that used to work (safely) - note we've not started > > > a VM succesfully anywhere else. > > > > > > Now the source refuses to let that happen - with a rather > > > nasty abort. > > > > Essentially it's another effect of the problem that migration has always > > lacked a proper model of ownership transfer. And it's still treating > > this as a block layer problem rather than making it a core concept of > > migration as it should. > > > > We can stack another one-off fix on top, and get back control of the > > block devices automatically on a second 'migrate'. But it feels like a > > hack and not like VMs had a properly designed and respected state > > machine. > > Hmm; I don't like to get back to this argument because I think > we've got a perfectly servicable model that's implemented at higher > levels outside qemu, and the real problem is the block layer added > new assumptions about the semantics without checking they were really > true. > qemu only has the view from a single host; it takes the higher level > view from something like libvirt to have the view across multiple hosts > to understand who has the ownership when. Obviously the upper layer is not handling this without the help of QEMU or we wouldn't have had bugs that images were accessed by two QEMU processes at the same time. We didn't change the assumptions, but we only started to actually check the preconditions that have always been necessary to perform live migration correctly. But if you like to think the upper layer should handle all of this, then it's on libvirt to handle the ownership transfer manually. If we really want, we can add explicit QMP commands to activate and inactivate block nodes. This can be done and requiring that the management layer does all of this would be a consistent interface, too. I just don't like this design much for two reasons: The first is that you can't migrate a VM that has disks with a simple 'migrate' command any more. The second is that if you implement it consistently, this has an impact on compatibility. I think it's a design that could be considered if we were adding live migration as a new feature, but it's probably hard to switch to it now. In any case, I do think we should finally make a decision how ownership of resources should work in the context of migration, and then implement that. Kevin ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [Qemu-devel] 3.1: second invocation of migrate crashes qemu 2019-01-21 16:45 ` Kevin Wolf @ 2019-01-24 20:04 ` Dr. David Alan Gilbert 0 siblings, 0 replies; 8+ messages in thread From: Dr. David Alan Gilbert @ 2019-01-24 20:04 UTC (permalink / raw) To: Kevin Wolf; +Cc: Michael Tokarev, quintela, qemu-devel * Kevin Wolf (kwolf@redhat.com) wrote: > Am 21.01.2019 um 17:05 hat Dr. David Alan Gilbert geschrieben: > > * Kevin Wolf (kwolf@redhat.com) wrote: > > > Am 18.01.2019 um 16:57 hat Dr. David Alan Gilbert geschrieben: > > > > * Kevin Wolf (kwolf@redhat.com) wrote: > > > > > Am 14.01.2019 um 11:51 hat Dr. David Alan Gilbert geschrieben: > > > > > > * Michael Tokarev (mjt@tls.msk.ru) wrote: > > > > > > > $ qemu-system-x86_64 -monitor stdio -hda foo.img > > > > > > > QEMU 3.1.0 monitor - type 'help' for more information > > > > > > > (qemu) stop > > > > > > > (qemu) migrate "exec:cat >/dev/null" > > > > > > > (qemu) migrate "exec:cat >/dev/null" > > > > > > > qemu-system-x86_64: /build/qemu/qemu-3.1/block.c:4647: bdrv_inactivate_recurse: Assertion `!(bs->open_flags & BDRV_O_INACTIVE)' failed. > > > > > > > Aborted > > > > > > > > > > > > And on head as well; it only happens if the 1st migrate is succesful; > > > > > > if it got cancelled the 2nd one works, so it's not too bad. > > > > > > > > > > > > I suspect the problem here is all around locking/ownership - the block > > > > > > devices get shutdown at the end of migration since the assumption is > > > > > > that the other end has them open now and we had better release them. > > > > > > > > > > Yes, only "cont" gets control back to the source VM. > > > > > > > > > > I think we really should limit the possible monitor commands in the > > > > > postmigrate status, and possibly provide a way to get back to the > > > > > regular paused state (which means getting back control of the resources) > > > > > without resuming the VM first. > > > > > > > > This error is a little interesting if you'd done something like: > > > > > > > > > > > > src: > > > > stop > > > > migrate > > > > > > > > dst: > > > > <kill qemu for some reason> > > > > start a new qemu > > > > > > > > src: > > > > migrate > > > > > > > > Now that used to work (safely) - note we've not started > > > > a VM succesfully anywhere else. > > > > > > > > Now the source refuses to let that happen - with a rather > > > > nasty abort. > > > > > > Essentially it's another effect of the problem that migration has always > > > lacked a proper model of ownership transfer. And it's still treating > > > this as a block layer problem rather than making it a core concept of > > > migration as it should. > > > > > > We can stack another one-off fix on top, and get back control of the > > > block devices automatically on a second 'migrate'. But it feels like a > > > hack and not like VMs had a properly designed and respected state > > > machine. > > > > Hmm; I don't like to get back to this argument because I think > > we've got a perfectly servicable model that's implemented at higher > > levels outside qemu, and the real problem is the block layer added > > new assumptions about the semantics without checking they were really > > true. > > qemu only has the view from a single host; it takes the higher level > > view from something like libvirt to have the view across multiple hosts > > to understand who has the ownership when. > > Obviously the upper layer is not handling this without the help of QEMU > or we wouldn't have had bugs that images were accessed by two QEMU > processes at the same time. We didn't change the assumptions, but we > only started to actually check the preconditions that have always been > necessary to perform live migration correctly. In this case there is a behaviour that was perfectly legal before that fails now; further the case is safe - the source hasn't accessed the disks after the first migration and isn't trying to access it again either. > But if you like to think the upper layer should handle all of this, I don't really want the upper layer to handle all of this; but I don't think we can handle it all either - we've not got the higher level view of screwups that happen outside qemu. >then > it's on libvirt to handle the ownership transfer manually. If we really > want, we can add explicit QMP commands to activate and inactivate block > nodes. This can be done and requiring that the management layer does > all of this would be a consistent interface, too. > > I just don't like this design much for two reasons: The first is that > you can't migrate a VM that has disks with a simple 'migrate' command > any more. The second is that if you implement it consistently, this has > an impact on compatibility. I think it's a design that could be > considered if we were adding live migration as a new feature, but it's > probably hard to switch to it now. > > In any case, I do think we should finally make a decision how ownership > of resources should work in the context of migration, and then implement > that. I think we're mostly OK, but what I'd like would be: a) I'd like things to fail gently rather than abort; so I'd either like the current functions to fail cleanly so I can fail the migration or add a check at the start of migration to tell the user they did something wrong. b) I'd like commands that can tell me the current state and a command to move it to the other state explicitly; so we've got a way to recover in weirder cases. c) I'd like to document what the states should be before/after/in various middle states of migration. I think the normal case is fine, and hence as you say, I wouldn't want to break a normal 'migrate' - I just want cleaner failures and ways to do the more unusual things. Dave > Kevin -- Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2019-01-24 20:14 UTC | newest] Thread overview: 8+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2019-01-12 17:11 [Qemu-devel] 3.1: second invocation of migrate crashes qemu Michael Tokarev 2019-01-14 10:51 ` Dr. David Alan Gilbert 2019-01-14 11:52 ` Kevin Wolf 2019-01-18 15:57 ` Dr. David Alan Gilbert 2019-01-21 15:55 ` Kevin Wolf 2019-01-21 16:05 ` Dr. David Alan Gilbert 2019-01-21 16:45 ` Kevin Wolf 2019-01-24 20:04 ` Dr. David Alan Gilbert
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).