Thoughts on VM fence infrastructure

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* Thoughts on VM fence infrastructure
@ 2019-09-30 10:30 Felipe Franciosi
  2019-09-30 14:29 ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 19+ messages in thread
From: Felipe Franciosi @ 2019-09-30 10:30 UTC (permalink / raw)
  To: qemu-devel; +Cc: Aditya Ramesh

Heyall,

We have a use case where a host should self-fence (and all VMs should
die) if it doesn't hear back from a heartbeat within a certain time
period. Lots of ideas were floated around where libvirt could take
care of killing VMs or a separate service could do it. The concern
with those is that various failures could lead to _those_ services
being unavailable and the fencing wouldn't be enforced as it should.

Ultimately, it feels like Qemu should be responsible for this
heartbeat and exit (or execute a custom callback) on timeout.

Does something already exist for this purpose which could be used?
Would a generic Qemu-fencing infrastructure be something of interest?

Cheers,
F.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Thoughts on VM fence infrastructure
  2019-09-30 10:30 Thoughts on VM fence infrastructure Felipe Franciosi
@ 2019-09-30 14:29 ` Dr. David Alan Gilbert
  2019-09-30 15:46   ` Felipe Franciosi
  0 siblings, 1 reply; 19+ messages in thread
From: Dr. David Alan Gilbert @ 2019-09-30 14:29 UTC (permalink / raw)
  To: Felipe Franciosi; +Cc: Aditya Ramesh, qemu-devel

* Felipe Franciosi (felipe@nutanix.com) wrote:
> Heyall,
> 
> We have a use case where a host should self-fence (and all VMs should
> die) if it doesn't hear back from a heartbeat within a certain time
> period. Lots of ideas were floated around where libvirt could take
> care of killing VMs or a separate service could do it. The concern
> with those is that various failures could lead to _those_ services
> being unavailable and the fencing wouldn't be enforced as it should.
> 
> Ultimately, it feels like Qemu should be responsible for this
> heartbeat and exit (or execute a custom callback) on timeout.

It doesn't feel doing it inside qemu would be any safer;  something
outside QEMU can forcibly emit a kill -9 and qemu *will* stop.

> Does something already exist for this purpose which could be used?
> Would a generic Qemu-fencing infrastructure be something of interest?
Dave


> Cheers,
> F.
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Thoughts on VM fence infrastructure
  2019-09-30 14:29 ` Dr. David Alan Gilbert
@ 2019-09-30 15:46   ` Felipe Franciosi
  2019-09-30 16:03     ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 19+ messages in thread
From: Felipe Franciosi @ 2019-09-30 15:46 UTC (permalink / raw)
  To: Dr. David Alan Gilbert; +Cc: Aditya Ramesh, qemu-devel

Hi David,

> On Sep 30, 2019, at 3:29 PM, Dr. David Alan Gilbert <dgilbert@redhat.com> wrote:
> 
> * Felipe Franciosi (felipe@nutanix.com) wrote:
>> Heyall,
>> 
>> We have a use case where a host should self-fence (and all VMs should
>> die) if it doesn't hear back from a heartbeat within a certain time
>> period. Lots of ideas were floated around where libvirt could take
>> care of killing VMs or a separate service could do it. The concern
>> with those is that various failures could lead to _those_ services
>> being unavailable and the fencing wouldn't be enforced as it should.
>> 
>> Ultimately, it feels like Qemu should be responsible for this
>> heartbeat and exit (or execute a custom callback) on timeout.
> 
> It doesn't feel doing it inside qemu would be any safer;  something
> outside QEMU can forcibly emit a kill -9 and qemu *will* stop.

The argument above is that we would have to rely on this external
service being functional. Consider the case where the host is
dysfunctional, with this service perhaps crashed and a corrupt
filesystem preventing it from restarting. The VMs would never die.

It feels like a Qemu timer-driven heartbeat check and calls abort() /
exit() would be more reliable. Thoughts?

Felipe

> 
>> Does something already exist for this purpose which could be used?
>> Would a generic Qemu-fencing infrastructure be something of interest?
> Dave
> 
> 
>> Cheers,
>> F.
>> 
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Thoughts on VM fence infrastructure
  2019-09-30 15:46   ` Felipe Franciosi
@ 2019-09-30 16:03     ` Dr. David Alan Gilbert
  2019-09-30 16:59       ` Felipe Franciosi
  0 siblings, 1 reply; 19+ messages in thread
From: Dr. David Alan Gilbert @ 2019-09-30 16:03 UTC (permalink / raw)
  To: Felipe Franciosi; +Cc: Aditya Ramesh, qemu-devel

* Felipe Franciosi (felipe@nutanix.com) wrote:
> Hi David,
> 
> > On Sep 30, 2019, at 3:29 PM, Dr. David Alan Gilbert <dgilbert@redhat.com> wrote:
> > 
> > * Felipe Franciosi (felipe@nutanix.com) wrote:
> >> Heyall,
> >> 
> >> We have a use case where a host should self-fence (and all VMs should
> >> die) if it doesn't hear back from a heartbeat within a certain time
> >> period. Lots of ideas were floated around where libvirt could take
> >> care of killing VMs or a separate service could do it. The concern
> >> with those is that various failures could lead to _those_ services
> >> being unavailable and the fencing wouldn't be enforced as it should.
> >> 
> >> Ultimately, it feels like Qemu should be responsible for this
> >> heartbeat and exit (or execute a custom callback) on timeout.
> > 
> > It doesn't feel doing it inside qemu would be any safer;  something
> > outside QEMU can forcibly emit a kill -9 and qemu *will* stop.
> 
> The argument above is that we would have to rely on this external
> service being functional. Consider the case where the host is
> dysfunctional, with this service perhaps crashed and a corrupt
> filesystem preventing it from restarting. The VMs would never die.

Yeh that could fail.

> It feels like a Qemu timer-driven heartbeat check and calls abort() /
> exit() would be more reliable. Thoughts?

OK, yes; perhaps using a timer_create and telling it to send a fatal
signal is pretty solid; it would take the kernel to do that once it's
set.

IMHO the safer way is to kick the host off the network by reprogramming
switches; so even if the qemu is actually alive it can't get anywhere.

Dave


> Felipe
> 
> > 
> >> Does something already exist for this purpose which could be used?
> >> Would a generic Qemu-fencing infrastructure be something of interest?
> > Dave
> > 
> > 
> >> Cheers,
> >> F.
> >> 
> > --
> > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Thoughts on VM fence infrastructure
  2019-09-30 16:03     ` Dr. David Alan Gilbert
@ 2019-09-30 16:59       ` Felipe Franciosi
  2019-09-30 17:11         ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 19+ messages in thread
From: Felipe Franciosi @ 2019-09-30 16:59 UTC (permalink / raw)
  To: Dr. David Alan Gilbert; +Cc: Aditya Ramesh, qemu-devel



> On Sep 30, 2019, at 5:03 PM, Dr. David Alan Gilbert <dgilbert@redhat.com> wrote:
> 
> * Felipe Franciosi (felipe@nutanix.com) wrote:
>> Hi David,
>> 
>>> On Sep 30, 2019, at 3:29 PM, Dr. David Alan Gilbert <dgilbert@redhat.com> wrote:
>>> 
>>> * Felipe Franciosi (felipe@nutanix.com) wrote:
>>>> Heyall,
>>>> 
>>>> We have a use case where a host should self-fence (and all VMs should
>>>> die) if it doesn't hear back from a heartbeat within a certain time
>>>> period. Lots of ideas were floated around where libvirt could take
>>>> care of killing VMs or a separate service could do it. The concern
>>>> with those is that various failures could lead to _those_ services
>>>> being unavailable and the fencing wouldn't be enforced as it should.
>>>> 
>>>> Ultimately, it feels like Qemu should be responsible for this
>>>> heartbeat and exit (or execute a custom callback) on timeout.
>>> 
>>> It doesn't feel doing it inside qemu would be any safer;  something
>>> outside QEMU can forcibly emit a kill -9 and qemu *will* stop.
>> 
>> The argument above is that we would have to rely on this external
>> service being functional. Consider the case where the host is
>> dysfunctional, with this service perhaps crashed and a corrupt
>> filesystem preventing it from restarting. The VMs would never die.
> 
> Yeh that could fail.
> 
>> It feels like a Qemu timer-driven heartbeat check and calls abort() /
>> exit() would be more reliable. Thoughts?
> 
> OK, yes; perhaps using a timer_create and telling it to send a fatal
> signal is pretty solid; it would take the kernel to do that once it's
> set.

I'm confused about why the kernel needs to be involved. If this is a
timer off the Qemu main loop, it can just check on the heartbeat
condition (which should be customisable) and call abort() if that's
not satisfied. If you agree on that I'd like to talk about how that
check could be made customisable.

> 
> IMHO the safer way is to kick the host off the network by reprogramming
> switches; so even if the qemu is actually alive it can't get anywhere.
> 
> Dave

Naturally some off-host STONITH is preferable, but that's not always
available. A self-fencing mechanism right at the heart of the emulator
can do the job without external hardware dependencies.

Cheers,
Felipe

> 
> 
>> Felipe
>> 
>>> 
>>>> Does something already exist for this purpose which could be used?
>>>> Would a generic Qemu-fencing infrastructure be something of interest?
>>> Dave
>>> 
>>> 
>>>> Cheers,
>>>> F.
>>>> 
>>> --
>>> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
>> 
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Thoughts on VM fence infrastructure
  2019-09-30 16:59       ` Felipe Franciosi
@ 2019-09-30 17:11         ` Dr. David Alan Gilbert
  2019-09-30 17:33           ` Felipe Franciosi
  0 siblings, 1 reply; 19+ messages in thread
From: Dr. David Alan Gilbert @ 2019-09-30 17:11 UTC (permalink / raw)
  To: Felipe Franciosi; +Cc: Aditya Ramesh, qemu-devel

* Felipe Franciosi (felipe@nutanix.com) wrote:
> 
> 
> > On Sep 30, 2019, at 5:03 PM, Dr. David Alan Gilbert <dgilbert@redhat.com> wrote:
> > 
> > * Felipe Franciosi (felipe@nutanix.com) wrote:
> >> Hi David,
> >> 
> >>> On Sep 30, 2019, at 3:29 PM, Dr. David Alan Gilbert <dgilbert@redhat.com> wrote:
> >>> 
> >>> * Felipe Franciosi (felipe@nutanix.com) wrote:
> >>>> Heyall,
> >>>> 
> >>>> We have a use case where a host should self-fence (and all VMs should
> >>>> die) if it doesn't hear back from a heartbeat within a certain time
> >>>> period. Lots of ideas were floated around where libvirt could take
> >>>> care of killing VMs or a separate service could do it. The concern
> >>>> with those is that various failures could lead to _those_ services
> >>>> being unavailable and the fencing wouldn't be enforced as it should.
> >>>> 
> >>>> Ultimately, it feels like Qemu should be responsible for this
> >>>> heartbeat and exit (or execute a custom callback) on timeout.
> >>> 
> >>> It doesn't feel doing it inside qemu would be any safer;  something
> >>> outside QEMU can forcibly emit a kill -9 and qemu *will* stop.
> >> 
> >> The argument above is that we would have to rely on this external
> >> service being functional. Consider the case where the host is
> >> dysfunctional, with this service perhaps crashed and a corrupt
> >> filesystem preventing it from restarting. The VMs would never die.
> > 
> > Yeh that could fail.
> > 
> >> It feels like a Qemu timer-driven heartbeat check and calls abort() /
> >> exit() would be more reliable. Thoughts?
> > 
> > OK, yes; perhaps using a timer_create and telling it to send a fatal
> > signal is pretty solid; it would take the kernel to do that once it's
> > set.
> 
> I'm confused about why the kernel needs to be involved. If this is a
> timer off the Qemu main loop, it can just check on the heartbeat
> condition (which should be customisable) and call abort() if that's
> not satisfied. If you agree on that I'd like to talk about how that
> check could be made customisable.

There are times when the main loop can get blocked even though the CPU
threads can be running and can in some configurations perform IO
even without the main loop (I think!).
By setting a timer in the kernel that sends a signal to qemu, the kernel
will send that signal however broken qemu is.

> 
> > 
> > IMHO the safer way is to kick the host off the network by reprogramming
> > switches; so even if the qemu is actually alive it can't get anywhere.
> > 
> > Dave
> 
> Naturally some off-host STONITH is preferable, but that's not always
> available. A self-fencing mechanism right at the heart of the emulator
> can do the job without external hardware dependencies.

Dave

> Cheers,
> Felipe
> 
> > 
> > 
> >> Felipe
> >> 
> >>> 
> >>>> Does something already exist for this purpose which could be used?
> >>>> Would a generic Qemu-fencing infrastructure be something of interest?
> >>> Dave
> >>> 
> >>> 
> >>>> Cheers,
> >>>> F.
> >>>> 
> >>> --
> >>> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> >> 
> > --
> > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Thoughts on VM fence infrastructure
  2019-09-30 17:11         ` Dr. David Alan Gilbert
@ 2019-09-30 17:33           ` Felipe Franciosi
  2019-09-30 17:59             ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 19+ messages in thread
From: Felipe Franciosi @ 2019-09-30 17:33 UTC (permalink / raw)
  To: Dr. David Alan Gilbert; +Cc: Aditya Ramesh, qemu-devel



> On Sep 30, 2019, at 6:11 PM, Dr. David Alan Gilbert <dgilbert@redhat.com> wrote:
> 
> * Felipe Franciosi (felipe@nutanix.com) wrote:
>> 
>> 
>>> On Sep 30, 2019, at 5:03 PM, Dr. David Alan Gilbert <dgilbert@redhat.com> wrote:
>>> 
>>> * Felipe Franciosi (felipe@nutanix.com) wrote:
>>>> Hi David,
>>>> 
>>>>> On Sep 30, 2019, at 3:29 PM, Dr. David Alan Gilbert <dgilbert@redhat.com> wrote:
>>>>> 
>>>>> * Felipe Franciosi (felipe@nutanix.com) wrote:
>>>>>> Heyall,
>>>>>> 
>>>>>> We have a use case where a host should self-fence (and all VMs should
>>>>>> die) if it doesn't hear back from a heartbeat within a certain time
>>>>>> period. Lots of ideas were floated around where libvirt could take
>>>>>> care of killing VMs or a separate service could do it. The concern
>>>>>> with those is that various failures could lead to _those_ services
>>>>>> being unavailable and the fencing wouldn't be enforced as it should.
>>>>>> 
>>>>>> Ultimately, it feels like Qemu should be responsible for this
>>>>>> heartbeat and exit (or execute a custom callback) on timeout.
>>>>> 
>>>>> It doesn't feel doing it inside qemu would be any safer;  something
>>>>> outside QEMU can forcibly emit a kill -9 and qemu *will* stop.
>>>> 
>>>> The argument above is that we would have to rely on this external
>>>> service being functional. Consider the case where the host is
>>>> dysfunctional, with this service perhaps crashed and a corrupt
>>>> filesystem preventing it from restarting. The VMs would never die.
>>> 
>>> Yeh that could fail.
>>> 
>>>> It feels like a Qemu timer-driven heartbeat check and calls abort() /
>>>> exit() would be more reliable. Thoughts?
>>> 
>>> OK, yes; perhaps using a timer_create and telling it to send a fatal
>>> signal is pretty solid; it would take the kernel to do that once it's
>>> set.
>> 
>> I'm confused about why the kernel needs to be involved. If this is a
>> timer off the Qemu main loop, it can just check on the heartbeat
>> condition (which should be customisable) and call abort() if that's
>> not satisfied. If you agree on that I'd like to talk about how that
>> check could be made customisable.
> 
> There are times when the main loop can get blocked even though the CPU
> threads can be running and can in some configurations perform IO
> even without the main loop (I think!).

Ah, that's a very good point. Indeed, you can perform IO in those
cases specially when using vhost devices.

> By setting a timer in the kernel that sends a signal to qemu, the kernel
> will send that signal however broken qemu is.

Got you now. That's probably better. Do you reckon a signal is
preferable over SIGEV_THREAD?

I'm still wondering how to make this customisable so that different
types of heartbeat could be implemented (preferably without creating
external dependencies per discussion above). Thoughts welcome.

F.

> 
>> 
>>> 
>>> IMHO the safer way is to kick the host off the network by reprogramming
>>> switches; so even if the qemu is actually alive it can't get anywhere.
>>> 
>>> Dave
>> 
>> Naturally some off-host STONITH is preferable, but that's not always
>> available. A self-fencing mechanism right at the heart of the emulator
>> can do the job without external hardware dependencies.
> 
> Dave
> 
>> Cheers,
>> Felipe
>> 
>>> 
>>> 
>>>> Felipe
>>>> 
>>>>> 
>>>>>> Does something already exist for this purpose which could be used?
>>>>>> Would a generic Qemu-fencing infrastructure be something of interest?
>>>>> Dave
>>>>> 
>>>>> 
>>>>>> Cheers,
>>>>>> F.
>>>>>> 
>>>>> --
>>>>> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
>>>> 
>>> --
>>> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
>> 
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Thoughts on VM fence infrastructure
  2019-09-30 17:33           ` Felipe Franciosi
@ 2019-09-30 17:59             ` Dr. David Alan Gilbert
  2019-09-30 19:23               ` Felipe Franciosi
  2019-09-30 19:45               ` Rafael David Tinoco
  0 siblings, 2 replies; 19+ messages in thread
From: Dr. David Alan Gilbert @ 2019-09-30 17:59 UTC (permalink / raw)
  To: Felipe Franciosi; +Cc: Aditya Ramesh, qemu-devel

* Felipe Franciosi (felipe@nutanix.com) wrote:
> 
> 
> > On Sep 30, 2019, at 6:11 PM, Dr. David Alan Gilbert <dgilbert@redhat.com> wrote:
> > 
> > * Felipe Franciosi (felipe@nutanix.com) wrote:
> >> 
> >> 
> >>> On Sep 30, 2019, at 5:03 PM, Dr. David Alan Gilbert <dgilbert@redhat.com> wrote:
> >>> 
> >>> * Felipe Franciosi (felipe@nutanix.com) wrote:
> >>>> Hi David,
> >>>> 
> >>>>> On Sep 30, 2019, at 3:29 PM, Dr. David Alan Gilbert <dgilbert@redhat.com> wrote:
> >>>>> 
> >>>>> * Felipe Franciosi (felipe@nutanix.com) wrote:
> >>>>>> Heyall,
> >>>>>> 
> >>>>>> We have a use case where a host should self-fence (and all VMs should
> >>>>>> die) if it doesn't hear back from a heartbeat within a certain time
> >>>>>> period. Lots of ideas were floated around where libvirt could take
> >>>>>> care of killing VMs or a separate service could do it. The concern
> >>>>>> with those is that various failures could lead to _those_ services
> >>>>>> being unavailable and the fencing wouldn't be enforced as it should.
> >>>>>> 
> >>>>>> Ultimately, it feels like Qemu should be responsible for this
> >>>>>> heartbeat and exit (or execute a custom callback) on timeout.
> >>>>> 
> >>>>> It doesn't feel doing it inside qemu would be any safer;  something
> >>>>> outside QEMU can forcibly emit a kill -9 and qemu *will* stop.
> >>>> 
> >>>> The argument above is that we would have to rely on this external
> >>>> service being functional. Consider the case where the host is
> >>>> dysfunctional, with this service perhaps crashed and a corrupt
> >>>> filesystem preventing it from restarting. The VMs would never die.
> >>> 
> >>> Yeh that could fail.
> >>> 
> >>>> It feels like a Qemu timer-driven heartbeat check and calls abort() /
> >>>> exit() would be more reliable. Thoughts?
> >>> 
> >>> OK, yes; perhaps using a timer_create and telling it to send a fatal
> >>> signal is pretty solid; it would take the kernel to do that once it's
> >>> set.
> >> 
> >> I'm confused about why the kernel needs to be involved. If this is a
> >> timer off the Qemu main loop, it can just check on the heartbeat
> >> condition (which should be customisable) and call abort() if that's
> >> not satisfied. If you agree on that I'd like to talk about how that
> >> check could be made customisable.
> > 
> > There are times when the main loop can get blocked even though the CPU
> > threads can be running and can in some configurations perform IO
> > even without the main loop (I think!).
> 
> Ah, that's a very good point. Indeed, you can perform IO in those
> cases specially when using vhost devices.
> 
> > By setting a timer in the kernel that sends a signal to qemu, the kernel
> > will send that signal however broken qemu is.
> 
> Got you now. That's probably better. Do you reckon a signal is
> preferable over SIGEV_THREAD?

Not sure; probably the safest is getting the kernel to SIGKILL it - but
that's a complete nightmare to debug - your process just goes *pop*
with no apparent reason why.
I've not used SIGEV_THREAD - it looks promising though.

> I'm still wondering how to make this customisable so that different
> types of heartbeat could be implemented (preferably without creating
> external dependencies per discussion above). Thoughts welcome.

Yes, you need something to enable it, and some safe way to retrigger
the timer.  A qmp command marked as 'oob' might be the right way -
another qm command can't block it.

Dave


> F.
> 
> > 
> >> 
> >>> 
> >>> IMHO the safer way is to kick the host off the network by reprogramming
> >>> switches; so even if the qemu is actually alive it can't get anywhere.
> >>> 
> >>> Dave
> >> 
> >> Naturally some off-host STONITH is preferable, but that's not always
> >> available. A self-fencing mechanism right at the heart of the emulator
> >> can do the job without external hardware dependencies.
> > 
> > Dave
> > 
> >> Cheers,
> >> Felipe
> >> 
> >>> 
> >>> 
> >>>> Felipe
> >>>> 
> >>>>> 
> >>>>>> Does something already exist for this purpose which could be used?
> >>>>>> Would a generic Qemu-fencing infrastructure be something of interest?
> >>>>> Dave
> >>>>> 
> >>>>> 
> >>>>>> Cheers,
> >>>>>> F.
> >>>>>> 
> >>>>> --
> >>>>> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> >>>> 
> >>> --
> >>> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> >> 
> > --
> > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Thoughts on VM fence infrastructure
  2019-09-30 17:59             ` Dr. David Alan Gilbert
@ 2019-09-30 19:23               ` Felipe Franciosi
  2019-10-01  8:23                 ` Dr. David Alan Gilbert
  2019-10-01 10:49                 ` Daniel P. Berrangé
  2019-09-30 19:45               ` Rafael David Tinoco
  1 sibling, 2 replies; 19+ messages in thread
From: Felipe Franciosi @ 2019-09-30 19:23 UTC (permalink / raw)
  To: Dr. David Alan Gilbert; +Cc: Aditya Ramesh, qemu-devel



> On Sep 30, 2019, at 6:59 PM, Dr. David Alan Gilbert <dgilbert@redhat.com> wrote:
> 
> * Felipe Franciosi (felipe@nutanix.com) wrote:
>> 
>> 
>>> On Sep 30, 2019, at 6:11 PM, Dr. David Alan Gilbert <dgilbert@redhat.com> wrote:
>>> 
>>> * Felipe Franciosi (felipe@nutanix.com) wrote:
>>>> 
>>>> 
>>>>> On Sep 30, 2019, at 5:03 PM, Dr. David Alan Gilbert <dgilbert@redhat.com> wrote:
>>>>> 
>>>>> * Felipe Franciosi (felipe@nutanix.com) wrote:
>>>>>> Hi David,
>>>>>> 
>>>>>>> On Sep 30, 2019, at 3:29 PM, Dr. David Alan Gilbert <dgilbert@redhat.com> wrote:
>>>>>>> 
>>>>>>> * Felipe Franciosi (felipe@nutanix.com) wrote:
>>>>>>>> Heyall,
>>>>>>>> 
>>>>>>>> We have a use case where a host should self-fence (and all VMs should
>>>>>>>> die) if it doesn't hear back from a heartbeat within a certain time
>>>>>>>> period. Lots of ideas were floated around where libvirt could take
>>>>>>>> care of killing VMs or a separate service could do it. The concern
>>>>>>>> with those is that various failures could lead to _those_ services
>>>>>>>> being unavailable and the fencing wouldn't be enforced as it should.
>>>>>>>> 
>>>>>>>> Ultimately, it feels like Qemu should be responsible for this
>>>>>>>> heartbeat and exit (or execute a custom callback) on timeout.
>>>>>>> 
>>>>>>> It doesn't feel doing it inside qemu would be any safer;  something
>>>>>>> outside QEMU can forcibly emit a kill -9 and qemu *will* stop.
>>>>>> 
>>>>>> The argument above is that we would have to rely on this external
>>>>>> service being functional. Consider the case where the host is
>>>>>> dysfunctional, with this service perhaps crashed and a corrupt
>>>>>> filesystem preventing it from restarting. The VMs would never die.
>>>>> 
>>>>> Yeh that could fail.
>>>>> 
>>>>>> It feels like a Qemu timer-driven heartbeat check and calls abort() /
>>>>>> exit() would be more reliable. Thoughts?
>>>>> 
>>>>> OK, yes; perhaps using a timer_create and telling it to send a fatal
>>>>> signal is pretty solid; it would take the kernel to do that once it's
>>>>> set.
>>>> 
>>>> I'm confused about why the kernel needs to be involved. If this is a
>>>> timer off the Qemu main loop, it can just check on the heartbeat
>>>> condition (which should be customisable) and call abort() if that's
>>>> not satisfied. If you agree on that I'd like to talk about how that
>>>> check could be made customisable.
>>> 
>>> There are times when the main loop can get blocked even though the CPU
>>> threads can be running and can in some configurations perform IO
>>> even without the main loop (I think!).
>> 
>> Ah, that's a very good point. Indeed, you can perform IO in those
>> cases specially when using vhost devices.
>> 
>>> By setting a timer in the kernel that sends a signal to qemu, the kernel
>>> will send that signal however broken qemu is.
>> 
>> Got you now. That's probably better. Do you reckon a signal is
>> preferable over SIGEV_THREAD?
> 
> Not sure; probably the safest is getting the kernel to SIGKILL it - but
> that's a complete nightmare to debug - your process just goes *pop*
> with no apparent reason why.
> I've not used SIGEV_THREAD - it looks promising though.

I'm worried that SIGEV_THREAD could be a bit heavyweight (if it fires
up a new thread each time). On the other hand, as you said, SIGKILL
makes it harder to debug.

Also, asking the kernel to defer the SIGKILL (ie. updating the timer)
needs to come from Qemu itself (eg. a timer in the main loop,
something we already ruled unsuitable, or a qmp command which
constitutes an external dependency that we also ruled undesirable).

What if, when self-fencing is enabled, Qemu kicks off a new thread
from the start which does nothing but periodically wake up, verify the
heartbeat condition and log()+abort() if required? (Then we wouldn't
need the kernel timer.)

> 
>> I'm still wondering how to make this customisable so that different
>> types of heartbeat could be implemented (preferably without creating
>> external dependencies per discussion above). Thoughts welcome.
> 
> Yes, you need something to enable it, and some safe way to retrigger
> the timer.  A qmp command marked as 'oob' might be the right way -
> another qm command can't block it.

This qmp approach is slightly different than the external dependency
that itself kills Qemu; if it doesn't run, then Qemu dies because the
kernel timer is not updated. But this is also a heavyweight approach.
We are talking about a service that needs to frequently connect to all
running VMs on a host to reset the timer.

But it does allow for the customisable heartbeat: the logic behind
what triggers the command is completely flexible.

Thinking about this idea of a separate Qemu thread, one thing that
came to mind is this:

qemu -fence heartbeat=/path/to/file,deadline=60[,recheck=5]

Qemu could fire up a thread that stat()s <file> (every <recheck>
seconds or on a default interval) and log()+abort() the whole process
if the last modification time of the file is older than <deadline>. If
<file> goes away (ie. stat() gives ENOENT), then it either fences
immediately or ignores it, not sure which is more sensible.

Thoughts?

F.

> 
> Dave
> 
> 
>> F.
>> 
>>> 
>>>> 
>>>>> 
>>>>> IMHO the safer way is to kick the host off the network by reprogramming
>>>>> switches; so even if the qemu is actually alive it can't get anywhere.
>>>>> 
>>>>> Dave
>>>> 
>>>> Naturally some off-host STONITH is preferable, but that's not always
>>>> available. A self-fencing mechanism right at the heart of the emulator
>>>> can do the job without external hardware dependencies.
>>> 
>>> Dave
>>> 
>>>> Cheers,
>>>> Felipe
>>>> 
>>>>> 
>>>>> 
>>>>>> Felipe
>>>>>> 
>>>>>>> 
>>>>>>>> Does something already exist for this purpose which could be used?
>>>>>>>> Would a generic Qemu-fencing infrastructure be something of interest?
>>>>>>> Dave
>>>>>>> 
>>>>>>> 
>>>>>>>> Cheers,
>>>>>>>> F.
>>>>>>>> 
>>>>>>> --
>>>>>>> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
>>>>>> 
>>>>> --
>>>>> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
>>>> 
>>> --
>>> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
>> 
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Thoughts on VM fence infrastructure
  2019-09-30 17:59             ` Dr. David Alan Gilbert
  2019-09-30 19:23               ` Felipe Franciosi
@ 2019-09-30 19:45               ` Rafael David Tinoco
  2019-09-30 20:24                 ` Felipe Franciosi
  1 sibling, 1 reply; 19+ messages in thread
From: Rafael David Tinoco @ 2019-09-30 19:45 UTC (permalink / raw)
  To: Dr. David Alan Gilbert, Felipe Franciosi; +Cc: Aditya Ramesh, qemu-devel

>>> There are times when the main loop can get blocked even though the CPU
>>> threads can be running and can in some configurations perform IO
>>> even without the main loop (I think!).
>> Ah, that's a very good point. Indeed, you can perform IO in those
>> cases specially when using vhost devices.
>>
>>> By setting a timer in the kernel that sends a signal to qemu, the kernel
>>> will send that signal however broken qemu is.
>> Got you now. That's probably better. Do you reckon a signal is
>> preferable over SIGEV_THREAD?
> Not sure; probably the safest is getting the kernel to SIGKILL it - but
> that's a complete nightmare to debug - your process just goes *pop*
> with no apparent reason why.
> I've not used SIGEV_THREAD - it looks promising though.

Sorry to "enter" the discussion, but, in "real" HW, its not by accident
that watchdog devices timeout generates a NMI to CPUs, causing the
kernel to handle the interrupt - and panic (or to take other action set
by specific watchdog drivers that re-implements the default ones).

Can't you simple "inject" a NMI in all guest vCPUs BEFORE you take any
action in QEMU itself? Just like the virtual watchdog device would do,
from inside the guest (/dev/watchdog), but capable of being updated by
outside, in this case of yours (if I understood correctly).

Possibly you would have to have a dedicated loop for this "watchdog
device" (AIO threads ?) not to compete with existing coroutines/BH Tasks
and their jittering on your "realtime watchdog needs".

Regarding remaining existing I/OS for the guest's devices in question
(vhost/vhost-user etc), would be just like a real host where the "bus"
received commands, but sender died right after...

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Thoughts on VM fence infrastructure
  2019-09-30 19:45               ` Rafael David Tinoco
@ 2019-09-30 20:24                 ` Felipe Franciosi
  0 siblings, 0 replies; 19+ messages in thread
From: Felipe Franciosi @ 2019-09-30 20:24 UTC (permalink / raw)
  To: Rafael David Tinoco; +Cc: Aditya Ramesh, Dr. David Alan Gilbert, qemu-devel

> On Sep 30, 2019, at 8:45 PM, Rafael David Tinoco <rafaeldtinoco@ubuntu.com> wrote:
> 
> 
>>>> There are times when the main loop can get blocked even though the CPU
>>>> threads can be running and can in some configurations perform IO
>>>> even without the main loop (I think!).
>>> Ah, that's a very good point. Indeed, you can perform IO in those
>>> cases specially when using vhost devices.
>>> 
>>>> By setting a timer in the kernel that sends a signal to qemu, the kernel
>>>> will send that signal however broken qemu is.
>>> Got you now. That's probably better. Do you reckon a signal is
>>> preferable over SIGEV_THREAD?
>> Not sure; probably the safest is getting the kernel to SIGKILL it - but
>> that's a complete nightmare to debug - your process just goes *pop*
>> with no apparent reason why.
>> I've not used SIGEV_THREAD - it looks promising though.
> 
> Sorry to "enter" the discussion, but, in "real" HW, its not by accident
> that watchdog devices timeout generates a NMI to CPUs, causing the
> kernel to handle the interrupt - and panic (or to take other action set
> by specific watchdog drivers that re-implements the default ones).

Not sure what you mean by "sorry"... thanks for joining. :)

> Can't you simple "inject" a NMI in all guest vCPUs BEFORE you take any
> action in QEMU itself? Just like the virtual watchdog device would do,
> from inside the guest (/dev/watchdog), but capable of being updated by
> outside, in this case of yours (if I understood correctly).

It's unclear to me how this relates to this use case, perhaps that's
not clear. The idea is that on various cloud deployments, a host could
be temporarily unavailable. Imagine that a network cable snapped. A
management layer could then restart the unreachable VMs elsewhere (as
part of High Availability offerings), but it needs to ensure that
disconnected host is not just going to come back from the dead with
older incarnations of the VMs running. (Imagine that someone replaced
the broken network cable.) That would result in lots of issues from
colliding IP addresses to different writers on shared storage leading
to data corruption.

The ask is for a mechanism to fence the host, essentially causing all
(or selected) VMs on that host to die. There are several mechanisms
for that, mostly requiring some sort of HW support (eg. STONITH).
Those are often focused on cases where the host requires manual
intervention to recover or at least a reset.

I'm looking to implement a mechanism for self-fencing, which doesn't
require external hardware and cover most failure scenarios (from
partially/totally broken hosts to simply a temporary network failure).
In several cases rebooting the host is unnecessary; just ensuring the
VMs are down is enough. That's almost always true on temporary network
unavailability (eg. split network).

> Possibly you would have to have a dedicated loop for this "watchdog
> device" (AIO threads ?) not to compete with existing coroutines/BH Tasks
> and their jittering on your "realtime watchdog needs".

Only when this feature is needed (which isn't the case for most
people), there would be an extra thread (according to the latest
proposal) which is mostly idle. It would wake up every few seconds and
stat() a file, which is a very lightweight operation. That would not
measurably impact/jitter other work.

> Regarding remaining existing I/OS for the guest's devices in question
> (vhost/vhost-user etc), would be just like a real host where the "bus"
> received commands, but sender died right after...

I hope the above clarifies the idea.

Cheers,
F.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Thoughts on VM fence infrastructure
  2019-09-30 19:23               ` Felipe Franciosi
@ 2019-10-01  8:23                 ` Dr. David Alan Gilbert
  2019-10-01  9:56                   ` Felipe Franciosi
  2019-10-01 10:49                 ` Daniel P. Berrangé
  1 sibling, 1 reply; 19+ messages in thread
From: Dr. David Alan Gilbert @ 2019-10-01  8:23 UTC (permalink / raw)
  To: Felipe Franciosi; +Cc: Aditya Ramesh, qemu-devel

* Felipe Franciosi (felipe@nutanix.com) wrote:
> 
> 
> > On Sep 30, 2019, at 6:59 PM, Dr. David Alan Gilbert <dgilbert@redhat.com> wrote:
> > 
> > * Felipe Franciosi (felipe@nutanix.com) wrote:
> >> 
> >> 
> >>> On Sep 30, 2019, at 6:11 PM, Dr. David Alan Gilbert <dgilbert@redhat.com> wrote:
> >>> 
> >>> * Felipe Franciosi (felipe@nutanix.com) wrote:
> >>>> 
> >>>> 
> >>>>> On Sep 30, 2019, at 5:03 PM, Dr. David Alan Gilbert <dgilbert@redhat.com> wrote:
> >>>>> 
> >>>>> * Felipe Franciosi (felipe@nutanix.com) wrote:
> >>>>>> Hi David,
> >>>>>> 
> >>>>>>> On Sep 30, 2019, at 3:29 PM, Dr. David Alan Gilbert <dgilbert@redhat.com> wrote:
> >>>>>>> 
> >>>>>>> * Felipe Franciosi (felipe@nutanix.com) wrote:
> >>>>>>>> Heyall,
> >>>>>>>> 
> >>>>>>>> We have a use case where a host should self-fence (and all VMs should
> >>>>>>>> die) if it doesn't hear back from a heartbeat within a certain time
> >>>>>>>> period. Lots of ideas were floated around where libvirt could take
> >>>>>>>> care of killing VMs or a separate service could do it. The concern
> >>>>>>>> with those is that various failures could lead to _those_ services
> >>>>>>>> being unavailable and the fencing wouldn't be enforced as it should.
> >>>>>>>> 
> >>>>>>>> Ultimately, it feels like Qemu should be responsible for this
> >>>>>>>> heartbeat and exit (or execute a custom callback) on timeout.
> >>>>>>> 
> >>>>>>> It doesn't feel doing it inside qemu would be any safer;  something
> >>>>>>> outside QEMU can forcibly emit a kill -9 and qemu *will* stop.
> >>>>>> 
> >>>>>> The argument above is that we would have to rely on this external
> >>>>>> service being functional. Consider the case where the host is
> >>>>>> dysfunctional, with this service perhaps crashed and a corrupt
> >>>>>> filesystem preventing it from restarting. The VMs would never die.
> >>>>> 
> >>>>> Yeh that could fail.
> >>>>> 
> >>>>>> It feels like a Qemu timer-driven heartbeat check and calls abort() /
> >>>>>> exit() would be more reliable. Thoughts?
> >>>>> 
> >>>>> OK, yes; perhaps using a timer_create and telling it to send a fatal
> >>>>> signal is pretty solid; it would take the kernel to do that once it's
> >>>>> set.
> >>>> 
> >>>> I'm confused about why the kernel needs to be involved. If this is a
> >>>> timer off the Qemu main loop, it can just check on the heartbeat
> >>>> condition (which should be customisable) and call abort() if that's
> >>>> not satisfied. If you agree on that I'd like to talk about how that
> >>>> check could be made customisable.
> >>> 
> >>> There are times when the main loop can get blocked even though the CPU
> >>> threads can be running and can in some configurations perform IO
> >>> even without the main loop (I think!).
> >> 
> >> Ah, that's a very good point. Indeed, you can perform IO in those
> >> cases specially when using vhost devices.
> >> 
> >>> By setting a timer in the kernel that sends a signal to qemu, the kernel
> >>> will send that signal however broken qemu is.
> >> 
> >> Got you now. That's probably better. Do you reckon a signal is
> >> preferable over SIGEV_THREAD?
> > 
> > Not sure; probably the safest is getting the kernel to SIGKILL it - but
> > that's a complete nightmare to debug - your process just goes *pop*
> > with no apparent reason why.
> > I've not used SIGEV_THREAD - it looks promising though.
> 
> I'm worried that SIGEV_THREAD could be a bit heavyweight (if it fires
> up a new thread each time). On the other hand, as you said, SIGKILL
> makes it harder to debug.
> 
> Also, asking the kernel to defer the SIGKILL (ie. updating the timer)
> needs to come from Qemu itself (eg. a timer in the main loop,
> something we already ruled unsuitable, or a qmp command which
> constitutes an external dependency that we also ruled undesirable).

OK, there's two reasons I think this isn't that bad/is good:
   a) It's an external dependency - but if it fails the result is the
      system fails, rather than the system keeps on running; so I think
      that's the balance you were after; it's the opposite from
      the external watchdog.

   b) You need some external system anyway to tell QEMU when it's
      OK - what's your definitino of a failed system?

> What if, when self-fencing is enabled, Qemu kicks off a new thread
> from the start which does nothing but periodically wake up, verify the
> heartbeat condition and log()+abort() if required? (Then we wouldn't
> need the kernel timer.)

I'd make that thread bump the kernel timer along.

> > 
> >> I'm still wondering how to make this customisable so that different
> >> types of heartbeat could be implemented (preferably without creating
> >> external dependencies per discussion above). Thoughts welcome.
> > 
> > Yes, you need something to enable it, and some safe way to retrigger
> > the timer.  A qmp command marked as 'oob' might be the right way -
> > another qm command can't block it.
> 
> This qmp approach is slightly different than the external dependency
> that itself kills Qemu; if it doesn't run, then Qemu dies because the
> kernel timer is not updated. But this is also a heavyweight approach.
> We are talking about a service that needs to frequently connect to all
> running VMs on a host to reset the timer.
> 
> But it does allow for the customisable heartbeat: the logic behind
> what triggers the command is completely flexible.
> 
> Thinking about this idea of a separate Qemu thread, one thing that
> came to mind is this:
> 
> qemu -fence heartbeat=/path/to/file,deadline=60[,recheck=5]
> 
> Qemu could fire up a thread that stat()s <file> (every <recheck>
> seconds or on a default interval) and log()+abort() the whole process
> if the last modification time of the file is older than <deadline>. If
> <file> goes away (ie. stat() gives ENOENT), then it either fences
> immediately or ignores it, not sure which is more sensible.
> 
> Thoughts?

As above; I'm OK with using a file with that; but I'd make that thread
bump the kernel timer along; if that thread gets stuck somehow the
kernel still nukes your process.

Dave

> F.
> 
> > 
> > Dave
> > 
> > 
> >> F.
> >> 
> >>> 
> >>>> 
> >>>>> 
> >>>>> IMHO the safer way is to kick the host off the network by reprogramming
> >>>>> switches; so even if the qemu is actually alive it can't get anywhere.
> >>>>> 
> >>>>> Dave
> >>>> 
> >>>> Naturally some off-host STONITH is preferable, but that's not always
> >>>> available. A self-fencing mechanism right at the heart of the emulator
> >>>> can do the job without external hardware dependencies.
> >>> 
> >>> Dave
> >>> 
> >>>> Cheers,
> >>>> Felipe
> >>>> 
> >>>>> 
> >>>>> 
> >>>>>> Felipe
> >>>>>> 
> >>>>>>> 
> >>>>>>>> Does something already exist for this purpose which could be used?
> >>>>>>>> Would a generic Qemu-fencing infrastructure be something of interest?
> >>>>>>> Dave
> >>>>>>> 
> >>>>>>> 
> >>>>>>>> Cheers,
> >>>>>>>> F.
> >>>>>>>> 
> >>>>>>> --
> >>>>>>> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> >>>>>> 
> >>>>> --
> >>>>> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> >>>> 
> >>> --
> >>> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> >> 
> > --
> > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Thoughts on VM fence infrastructure
  2019-10-01  8:23                 ` Dr. David Alan Gilbert
@ 2019-10-01  9:56                   ` Felipe Franciosi
  2019-10-01 10:05                     ` Dr. David Alan Gilbert
  2019-10-01 10:31                     ` Daniel P. Berrangé
  0 siblings, 2 replies; 19+ messages in thread
From: Felipe Franciosi @ 2019-10-01  9:56 UTC (permalink / raw)
  To: Dr. David Alan Gilbert; +Cc: Rafael David Tinoco, Aditya Ramesh, qemu-devel



> On Oct 1, 2019, at 9:23 AM, Dr. David Alan Gilbert <dgilbert@redhat.com> wrote:
> 
> * Felipe Franciosi (felipe@nutanix.com) wrote:
>> 
>> 
>>> On Sep 30, 2019, at 6:59 PM, Dr. David Alan Gilbert <dgilbert@redhat.com> wrote:
>>> 
>>> * Felipe Franciosi (felipe@nutanix.com) wrote:
>>>> 
>>>> 
>>>>> On Sep 30, 2019, at 6:11 PM, Dr. David Alan Gilbert <dgilbert@redhat.com> wrote:
>>>>> 
>>>>> * Felipe Franciosi (felipe@nutanix.com) wrote:
>>>>>> 
>>>>>> 
>>>>>>> On Sep 30, 2019, at 5:03 PM, Dr. David Alan Gilbert <dgilbert@redhat.com> wrote:
>>>>>>> 
>>>>>>> * Felipe Franciosi (felipe@nutanix.com) wrote:
>>>>>>>> Hi David,
>>>>>>>> 
>>>>>>>>> On Sep 30, 2019, at 3:29 PM, Dr. David Alan Gilbert <dgilbert@redhat.com> wrote:
>>>>>>>>> 
>>>>>>>>> * Felipe Franciosi (felipe@nutanix.com) wrote:
>>>>>>>>>> Heyall,
>>>>>>>>>> 
>>>>>>>>>> We have a use case where a host should self-fence (and all VMs should
>>>>>>>>>> die) if it doesn't hear back from a heartbeat within a certain time
>>>>>>>>>> period. Lots of ideas were floated around where libvirt could take
>>>>>>>>>> care of killing VMs or a separate service could do it. The concern
>>>>>>>>>> with those is that various failures could lead to _those_ services
>>>>>>>>>> being unavailable and the fencing wouldn't be enforced as it should.
>>>>>>>>>> 
>>>>>>>>>> Ultimately, it feels like Qemu should be responsible for this
>>>>>>>>>> heartbeat and exit (or execute a custom callback) on timeout.
>>>>>>>>> 
>>>>>>>>> It doesn't feel doing it inside qemu would be any safer;  something
>>>>>>>>> outside QEMU can forcibly emit a kill -9 and qemu *will* stop.
>>>>>>>> 
>>>>>>>> The argument above is that we would have to rely on this external
>>>>>>>> service being functional. Consider the case where the host is
>>>>>>>> dysfunctional, with this service perhaps crashed and a corrupt
>>>>>>>> filesystem preventing it from restarting. The VMs would never die.
>>>>>>> 
>>>>>>> Yeh that could fail.
>>>>>>> 
>>>>>>>> It feels like a Qemu timer-driven heartbeat check and calls abort() /
>>>>>>>> exit() would be more reliable. Thoughts?
>>>>>>> 
>>>>>>> OK, yes; perhaps using a timer_create and telling it to send a fatal
>>>>>>> signal is pretty solid; it would take the kernel to do that once it's
>>>>>>> set.
>>>>>> 
>>>>>> I'm confused about why the kernel needs to be involved. If this is a
>>>>>> timer off the Qemu main loop, it can just check on the heartbeat
>>>>>> condition (which should be customisable) and call abort() if that's
>>>>>> not satisfied. If you agree on that I'd like to talk about how that
>>>>>> check could be made customisable.
>>>>> 
>>>>> There are times when the main loop can get blocked even though the CPU
>>>>> threads can be running and can in some configurations perform IO
>>>>> even without the main loop (I think!).
>>>> 
>>>> Ah, that's a very good point. Indeed, you can perform IO in those
>>>> cases specially when using vhost devices.
>>>> 
>>>>> By setting a timer in the kernel that sends a signal to qemu, the kernel
>>>>> will send that signal however broken qemu is.
>>>> 
>>>> Got you now. That's probably better. Do you reckon a signal is
>>>> preferable over SIGEV_THREAD?
>>> 
>>> Not sure; probably the safest is getting the kernel to SIGKILL it - but
>>> that's a complete nightmare to debug - your process just goes *pop*
>>> with no apparent reason why.
>>> I've not used SIGEV_THREAD - it looks promising though.
>> 
>> I'm worried that SIGEV_THREAD could be a bit heavyweight (if it fires
>> up a new thread each time). On the other hand, as you said, SIGKILL
>> makes it harder to debug.
>> 
>> Also, asking the kernel to defer the SIGKILL (ie. updating the timer)
>> needs to come from Qemu itself (eg. a timer in the main loop,
>> something we already ruled unsuitable, or a qmp command which
>> constitutes an external dependency that we also ruled undesirable).
> 
> OK, there's two reasons I think this isn't that bad/is good:
>   a) It's an external dependency - but if it fails the result is the
>      system fails, rather than the system keeps on running; so I think
>      that's the balance you were after; it's the opposite from
>      the external watchdog.

Right. I like where you are coming from. And I think a mix of these
may be the best way forwards. I'll elaborate on it below.

> 
>   b) You need some external system anyway to tell QEMU when it's
>      OK - what's your definitino of a failed system?

The feature is targeted at providing a self-fencing mechanism for
Qemu. If a host is unreachable for whatever reason (eg. sshd down, ovs
died, oomkiller took services out, physical network failure), it
should guarantee that VMs won't be running after a certain amount of
time. To your point, if this external software doesn't come in and
touch the file, that's because it can't reach the host or it wants the
host to self-fence. The qualifying Qemus should therefore be
considered dead after a "deadline" period (since the last time the
control file was touched).

> 
>> What if, when self-fencing is enabled, Qemu kicks off a new thread
>> from the start which does nothing but periodically wake up, verify the
>> heartbeat condition and log()+abort() if required? (Then we wouldn't
>> need the kernel timer.)
> 
> I'd make that thread bump the kernel timer along.

I think combining the thread's logic with the kernel timer makes the
whole thing a lot more solid. See below.

> 
>>> 
>>>> I'm still wondering how to make this customisable so that different
>>>> types of heartbeat could be implemented (preferably without creating
>>>> external dependencies per discussion above). Thoughts welcome.
>>> 
>>> Yes, you need something to enable it, and some safe way to retrigger
>>> the timer.  A qmp command marked as 'oob' might be the right way -
>>> another qm command can't block it.
>> 
>> This qmp approach is slightly different than the external dependency
>> that itself kills Qemu; if it doesn't run, then Qemu dies because the
>> kernel timer is not updated. But this is also a heavyweight approach.
>> We are talking about a service that needs to frequently connect to all
>> running VMs on a host to reset the timer.
>> 
>> But it does allow for the customisable heartbeat: the logic behind
>> what triggers the command is completely flexible.
>> 
>> Thinking about this idea of a separate Qemu thread, one thing that
>> came to mind is this:
>> 
>> qemu -fence heartbeat=/path/to/file,deadline=60[,recheck=5]
>> 
>> Qemu could fire up a thread that stat()s <file> (every <recheck>
>> seconds or on a default interval) and log()+abort() the whole process
>> if the last modification time of the file is older than <deadline>. If
>> <file> goes away (ie. stat() gives ENOENT), then it either fences
>> immediately or ignores it, not sure which is more sensible.
>> 
>> Thoughts?
> 
> As above; I'm OK with using a file with that; but I'd make that thread
> bump the kernel timer along; if that thread gets stuck somehow the
> kernel still nukes your process.


Awesome. So check this out:

qemu -fence heartbeat=/path/to/file,deadline=60[,recheck=5][,harddeadline=61]

We can default <harddeadline> to <deadline+1> and enforce that:
-  <deadline> is a multiple of <recheck>.
- <harddeadline> is bigger than <deadline>

When <deadline> expires, we can log() + abort(), but if <harddeadline>
expires, we can rest assured the kernel will come around and SIGKILL
Qemu. If there's demand for it, this can later be enhanced by adding
more parameters which set the fence thread scheduling priority, &c.

If that sounds ok I'll send an RFC as soon as I get a chance and we
can take it from there.

F.

> 
> Dave
> 
>> F.
>> 
>>> 
>>> Dave
>>> 
>>> 
>>>> F.
>>>> 
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>> IMHO the safer way is to kick the host off the network by reprogramming
>>>>>>> switches; so even if the qemu is actually alive it can't get anywhere.
>>>>>>> 
>>>>>>> Dave
>>>>>> 
>>>>>> Naturally some off-host STONITH is preferable, but that's not always
>>>>>> available. A self-fencing mechanism right at the heart of the emulator
>>>>>> can do the job without external hardware dependencies.
>>>>> 
>>>>> Dave
>>>>> 
>>>>>> Cheers,
>>>>>> Felipe
>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>> Felipe
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> Does something already exist for this purpose which could be used?
>>>>>>>>>> Would a generic Qemu-fencing infrastructure be something of interest?
>>>>>>>>> Dave
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> Cheers,
>>>>>>>>>> F.
>>>>>>>>>> 
>>>>>>>>> --
>>>>>>>>> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
>>>>>>>> 
>>>>>>> --
>>>>>>> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
>>>>>> 
>>>>> --
>>>>> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
>>>> 
>>> --
>>> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
>> 
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Thoughts on VM fence infrastructure
  2019-10-01  9:56                   ` Felipe Franciosi
@ 2019-10-01 10:05                     ` Dr. David Alan Gilbert
  2019-10-01 10:31                     ` Daniel P. Berrangé
  1 sibling, 0 replies; 19+ messages in thread
From: Dr. David Alan Gilbert @ 2019-10-01 10:05 UTC (permalink / raw)
  To: Felipe Franciosi, armbru, berrange
  Cc: Rafael David Tinoco, Aditya Ramesh, qemu-devel

* Felipe Franciosi (felipe@nutanix.com) wrote:
> 
> 
> > On Oct 1, 2019, at 9:23 AM, Dr. David Alan Gilbert <dgilbert@redhat.com> wrote:
> > 
> > * Felipe Franciosi (felipe@nutanix.com) wrote:
> >> 
> >> 
> >>> On Sep 30, 2019, at 6:59 PM, Dr. David Alan Gilbert <dgilbert@redhat.com> wrote:
> >>> 
> >>> * Felipe Franciosi (felipe@nutanix.com) wrote:
> >>>> 
> >>>> 
> >>>>> On Sep 30, 2019, at 6:11 PM, Dr. David Alan Gilbert <dgilbert@redhat.com> wrote:
> >>>>> 
> >>>>> * Felipe Franciosi (felipe@nutanix.com) wrote:
> >>>>>> 
> >>>>>> 
> >>>>>>> On Sep 30, 2019, at 5:03 PM, Dr. David Alan Gilbert <dgilbert@redhat.com> wrote:
> >>>>>>> 
> >>>>>>> * Felipe Franciosi (felipe@nutanix.com) wrote:
> >>>>>>>> Hi David,
> >>>>>>>> 
> >>>>>>>>> On Sep 30, 2019, at 3:29 PM, Dr. David Alan Gilbert <dgilbert@redhat.com> wrote:
> >>>>>>>>> 
> >>>>>>>>> * Felipe Franciosi (felipe@nutanix.com) wrote:
> >>>>>>>>>> Heyall,
> >>>>>>>>>> 
> >>>>>>>>>> We have a use case where a host should self-fence (and all VMs should
> >>>>>>>>>> die) if it doesn't hear back from a heartbeat within a certain time
> >>>>>>>>>> period. Lots of ideas were floated around where libvirt could take
> >>>>>>>>>> care of killing VMs or a separate service could do it. The concern
> >>>>>>>>>> with those is that various failures could lead to _those_ services
> >>>>>>>>>> being unavailable and the fencing wouldn't be enforced as it should.
> >>>>>>>>>> 
> >>>>>>>>>> Ultimately, it feels like Qemu should be responsible for this
> >>>>>>>>>> heartbeat and exit (or execute a custom callback) on timeout.
> >>>>>>>>> 
> >>>>>>>>> It doesn't feel doing it inside qemu would be any safer;  something
> >>>>>>>>> outside QEMU can forcibly emit a kill -9 and qemu *will* stop.
> >>>>>>>> 
> >>>>>>>> The argument above is that we would have to rely on this external
> >>>>>>>> service being functional. Consider the case where the host is
> >>>>>>>> dysfunctional, with this service perhaps crashed and a corrupt
> >>>>>>>> filesystem preventing it from restarting. The VMs would never die.
> >>>>>>> 
> >>>>>>> Yeh that could fail.
> >>>>>>> 
> >>>>>>>> It feels like a Qemu timer-driven heartbeat check and calls abort() /
> >>>>>>>> exit() would be more reliable. Thoughts?
> >>>>>>> 
> >>>>>>> OK, yes; perhaps using a timer_create and telling it to send a fatal
> >>>>>>> signal is pretty solid; it would take the kernel to do that once it's
> >>>>>>> set.
> >>>>>> 
> >>>>>> I'm confused about why the kernel needs to be involved. If this is a
> >>>>>> timer off the Qemu main loop, it can just check on the heartbeat
> >>>>>> condition (which should be customisable) and call abort() if that's
> >>>>>> not satisfied. If you agree on that I'd like to talk about how that
> >>>>>> check could be made customisable.
> >>>>> 
> >>>>> There are times when the main loop can get blocked even though the CPU
> >>>>> threads can be running and can in some configurations perform IO
> >>>>> even without the main loop (I think!).
> >>>> 
> >>>> Ah, that's a very good point. Indeed, you can perform IO in those
> >>>> cases specially when using vhost devices.
> >>>> 
> >>>>> By setting a timer in the kernel that sends a signal to qemu, the kernel
> >>>>> will send that signal however broken qemu is.
> >>>> 
> >>>> Got you now. That's probably better. Do you reckon a signal is
> >>>> preferable over SIGEV_THREAD?
> >>> 
> >>> Not sure; probably the safest is getting the kernel to SIGKILL it - but
> >>> that's a complete nightmare to debug - your process just goes *pop*
> >>> with no apparent reason why.
> >>> I've not used SIGEV_THREAD - it looks promising though.
> >> 
> >> I'm worried that SIGEV_THREAD could be a bit heavyweight (if it fires
> >> up a new thread each time). On the other hand, as you said, SIGKILL
> >> makes it harder to debug.
> >> 
> >> Also, asking the kernel to defer the SIGKILL (ie. updating the timer)
> >> needs to come from Qemu itself (eg. a timer in the main loop,
> >> something we already ruled unsuitable, or a qmp command which
> >> constitutes an external dependency that we also ruled undesirable).
> > 
> > OK, there's two reasons I think this isn't that bad/is good:
> >   a) It's an external dependency - but if it fails the result is the
> >      system fails, rather than the system keeps on running; so I think
> >      that's the balance you were after; it's the opposite from
> >      the external watchdog.
> 
> Right. I like where you are coming from. And I think a mix of these
> may be the best way forwards. I'll elaborate on it below.
> 
> > 
> >   b) You need some external system anyway to tell QEMU when it's
> >      OK - what's your definitino of a failed system?
> 
> The feature is targeted at providing a self-fencing mechanism for
> Qemu. If a host is unreachable for whatever reason (eg. sshd down, ovs
> died, oomkiller took services out, physical network failure), it
> should guarantee that VMs won't be running after a certain amount of
> time. To your point, if this external software doesn't come in and
> touch the file, that's because it can't reach the host or it wants the
> host to self-fence. The qualifying Qemus should therefore be
> considered dead after a "deadline" period (since the last time the
> control file was touched).
> 
> > 
> >> What if, when self-fencing is enabled, Qemu kicks off a new thread
> >> from the start which does nothing but periodically wake up, verify the
> >> heartbeat condition and log()+abort() if required? (Then we wouldn't
> >> need the kernel timer.)
> > 
> > I'd make that thread bump the kernel timer along.
> 
> I think combining the thread's logic with the kernel timer makes the
> whole thing a lot more solid. See below.
> 
> > 
> >>> 
> >>>> I'm still wondering how to make this customisable so that different
> >>>> types of heartbeat could be implemented (preferably without creating
> >>>> external dependencies per discussion above). Thoughts welcome.
> >>> 
> >>> Yes, you need something to enable it, and some safe way to retrigger
> >>> the timer.  A qmp command marked as 'oob' might be the right way -
> >>> another qm command can't block it.
> >> 
> >> This qmp approach is slightly different than the external dependency
> >> that itself kills Qemu; if it doesn't run, then Qemu dies because the
> >> kernel timer is not updated. But this is also a heavyweight approach.
> >> We are talking about a service that needs to frequently connect to all
> >> running VMs on a host to reset the timer.
> >> 
> >> But it does allow for the customisable heartbeat: the logic behind
> >> what triggers the command is completely flexible.
> >> 
> >> Thinking about this idea of a separate Qemu thread, one thing that
> >> came to mind is this:
> >> 
> >> qemu -fence heartbeat=/path/to/file,deadline=60[,recheck=5]
> >> 
> >> Qemu could fire up a thread that stat()s <file> (every <recheck>
> >> seconds or on a default interval) and log()+abort() the whole process
> >> if the last modification time of the file is older than <deadline>. If
> >> <file> goes away (ie. stat() gives ENOENT), then it either fences
> >> immediately or ignores it, not sure which is more sensible.
> >> 
> >> Thoughts?
> > 
> > As above; I'm OK with using a file with that; but I'd make that thread
> > bump the kernel timer along; if that thread gets stuck somehow the
> > kernel still nukes your process.
> 
> 
> Awesome. So check this out:
> 
> qemu -fence heartbeat=/path/to/file,deadline=60[,recheck=5][,harddeadline=61]
> 
> We can default <harddeadline> to <deadline+1> and enforce that:
> -  <deadline> is a multiple of <recheck>.
> - <harddeadline> is bigger than <deadline>
> 
> When <deadline> expires, we can log() + abort(), but if <harddeadline>
> expires, we can rest assured the kernel will come around and SIGKILL
> Qemu. If there's demand for it, this can later be enhanced by adding
> more parameters which set the fence thread scheduling priority, &c.
> 
> If that sounds ok I'll send an RFC as soon as I get a chance and we
> can take it from there.

So I think I'm OK with that; but I've copied in Markus and Daniel who
normally have ideas on how the command line/libvirt interface should
look like.

Dave

> F.
> 
> > 
> > Dave
> > 
> >> F.
> >> 
> >>> 
> >>> Dave
> >>> 
> >>> 
> >>>> F.
> >>>> 
> >>>>> 
> >>>>>> 
> >>>>>>> 
> >>>>>>> IMHO the safer way is to kick the host off the network by reprogramming
> >>>>>>> switches; so even if the qemu is actually alive it can't get anywhere.
> >>>>>>> 
> >>>>>>> Dave
> >>>>>> 
> >>>>>> Naturally some off-host STONITH is preferable, but that's not always
> >>>>>> available. A self-fencing mechanism right at the heart of the emulator
> >>>>>> can do the job without external hardware dependencies.
> >>>>> 
> >>>>> Dave
> >>>>> 
> >>>>>> Cheers,
> >>>>>> Felipe
> >>>>>> 
> >>>>>>> 
> >>>>>>> 
> >>>>>>>> Felipe
> >>>>>>>> 
> >>>>>>>>> 
> >>>>>>>>>> Does something already exist for this purpose which could be used?
> >>>>>>>>>> Would a generic Qemu-fencing infrastructure be something of interest?
> >>>>>>>>> Dave
> >>>>>>>>> 
> >>>>>>>>> 
> >>>>>>>>>> Cheers,
> >>>>>>>>>> F.
> >>>>>>>>>> 
> >>>>>>>>> --
> >>>>>>>>> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> >>>>>>>> 
> >>>>>>> --
> >>>>>>> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> >>>>>> 
> >>>>> --
> >>>>> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> >>>> 
> >>> --
> >>> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> >> 
> > --
> > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Thoughts on VM fence infrastructure
  2019-10-01  9:56                   ` Felipe Franciosi
  2019-10-01 10:05                     ` Dr. David Alan Gilbert
@ 2019-10-01 10:31                     ` Daniel P. Berrangé
  2019-10-01 10:46                       ` Felipe Franciosi
  1 sibling, 1 reply; 19+ messages in thread
From: Daniel P. Berrangé @ 2019-10-01 10:31 UTC (permalink / raw)
  To: Felipe Franciosi
  Cc: Rafael David Tinoco, Aditya Ramesh, Dr. David Alan Gilbert,
	qemu-devel

On Tue, Oct 01, 2019 at 09:56:17AM +0000, Felipe Franciosi wrote:
> 
> 
> > On Oct 1, 2019, at 9:23 AM, Dr. David Alan Gilbert <dgilbert@redhat.com> wrote:
> > 
> > * Felipe Franciosi (felipe@nutanix.com) wrote:
> >> 
> >> 
> >>> On Sep 30, 2019, at 6:59 PM, Dr. David Alan Gilbert <dgilbert@redhat.com> wrote:
> >>> 
> >>> * Felipe Franciosi (felipe@nutanix.com) wrote:
> >>>> 
> >>>> 
> >>>>> On Sep 30, 2019, at 6:11 PM, Dr. David Alan Gilbert <dgilbert@redhat.com> wrote:
> >>>>> 
> >>>>> * Felipe Franciosi (felipe@nutanix.com) wrote:
> >>>>>> 
> >>>>>> 
> >>>>>>> On Sep 30, 2019, at 5:03 PM, Dr. David Alan Gilbert <dgilbert@redhat.com> wrote:
> >>>>>>> 
> >>>>>>> * Felipe Franciosi (felipe@nutanix.com) wrote:
> >>>>>>>> Hi David,
> >>>>>>>> 
> >>>>>>>>> On Sep 30, 2019, at 3:29 PM, Dr. David Alan Gilbert <dgilbert@redhat.com> wrote:
> >>>>>>>>> 
> >>>>>>>>> * Felipe Franciosi (felipe@nutanix.com) wrote:
> >>>>>>>>>> Heyall,
> >>>>>>>>>> 
> >>>>>>>>>> We have a use case where a host should self-fence (and all VMs should
> >>>>>>>>>> die) if it doesn't hear back from a heartbeat within a certain time
> >>>>>>>>>> period. Lots of ideas were floated around where libvirt could take
> >>>>>>>>>> care of killing VMs or a separate service could do it. The concern
> >>>>>>>>>> with those is that various failures could lead to _those_ services
> >>>>>>>>>> being unavailable and the fencing wouldn't be enforced as it should.
> >>>>>>>>>> 
> >>>>>>>>>> Ultimately, it feels like Qemu should be responsible for this
> >>>>>>>>>> heartbeat and exit (or execute a custom callback) on timeout.
> >>>>>>>>> 
> >>>>>>>>> It doesn't feel doing it inside qemu would be any safer;  something
> >>>>>>>>> outside QEMU can forcibly emit a kill -9 and qemu *will* stop.
> >>>>>>>> 
> >>>>>>>> The argument above is that we would have to rely on this external
> >>>>>>>> service being functional. Consider the case where the host is
> >>>>>>>> dysfunctional, with this service perhaps crashed and a corrupt
> >>>>>>>> filesystem preventing it from restarting. The VMs would never die.
> >>>>>>> 
> >>>>>>> Yeh that could fail.
> >>>>>>> 
> >>>>>>>> It feels like a Qemu timer-driven heartbeat check and calls abort() /
> >>>>>>>> exit() would be more reliable. Thoughts?
> >>>>>>> 
> >>>>>>> OK, yes; perhaps using a timer_create and telling it to send a fatal
> >>>>>>> signal is pretty solid; it would take the kernel to do that once it's
> >>>>>>> set.
> >>>>>> 
> >>>>>> I'm confused about why the kernel needs to be involved. If this is a
> >>>>>> timer off the Qemu main loop, it can just check on the heartbeat
> >>>>>> condition (which should be customisable) and call abort() if that's
> >>>>>> not satisfied. If you agree on that I'd like to talk about how that
> >>>>>> check could be made customisable.
> >>>>> 
> >>>>> There are times when the main loop can get blocked even though the CPU
> >>>>> threads can be running and can in some configurations perform IO
> >>>>> even without the main loop (I think!).
> >>>> 
> >>>> Ah, that's a very good point. Indeed, you can perform IO in those
> >>>> cases specially when using vhost devices.
> >>>> 
> >>>>> By setting a timer in the kernel that sends a signal to qemu, the kernel
> >>>>> will send that signal however broken qemu is.
> >>>> 
> >>>> Got you now. That's probably better. Do you reckon a signal is
> >>>> preferable over SIGEV_THREAD?
> >>> 
> >>> Not sure; probably the safest is getting the kernel to SIGKILL it - but
> >>> that's a complete nightmare to debug - your process just goes *pop*
> >>> with no apparent reason why.
> >>> I've not used SIGEV_THREAD - it looks promising though.
> >> 
> >> I'm worried that SIGEV_THREAD could be a bit heavyweight (if it fires
> >> up a new thread each time). On the other hand, as you said, SIGKILL
> >> makes it harder to debug.
> >> 
> >> Also, asking the kernel to defer the SIGKILL (ie. updating the timer)
> >> needs to come from Qemu itself (eg. a timer in the main loop,
> >> something we already ruled unsuitable, or a qmp command which
> >> constitutes an external dependency that we also ruled undesirable).
> > 
> > OK, there's two reasons I think this isn't that bad/is good:
> >   a) It's an external dependency - but if it fails the result is the
> >      system fails, rather than the system keeps on running; so I think
> >      that's the balance you were after; it's the opposite from
> >      the external watchdog.
> 
> Right. I like where you are coming from. And I think a mix of these
> may be the best way forwards. I'll elaborate on it below.
> 
> > 
> >   b) You need some external system anyway to tell QEMU when it's
> >      OK - what's your definitino of a failed system?
> 
> The feature is targeted at providing a self-fencing mechanism for
> Qemu. If a host is unreachable for whatever reason (eg. sshd down, ovs
> died, oomkiller took services out, physical network failure), it
> should guarantee that VMs won't be running after a certain amount of
> time. To your point, if this external software doesn't come in and
> touch the file, that's because it can't reach the host or it wants the
> host to self-fence. The qualifying Qemus should therefore be
> considered dead after a "deadline" period (since the last time the
> control file was touched).

This all sounds reasonable, but I don't see the value in doing this
work
of this work in QEMU.  

> 
> > 
> >> What if, when self-fencing is enabled, Qemu kicks off a new thread
> >> from the start which does nothing but periodically wake up, verify the
> >> heartbeat condition and log()+abort() if required? (Then we wouldn't
> >> need the kernel timer.)
> > 
> > I'd make that thread bump the kernel timer along.
> 
> I think combining the thread's logic with the kernel timer makes the
> whole thing a lot more solid. See below.
> 
> > 
> >>> 
> >>>> I'm still wondering how to make this customisable so that different
> >>>> types of heartbeat could be implemented (preferably without creating
> >>>> external dependencies per discussion above). Thoughts welcome.
> >>> 
> >>> Yes, you need something to enable it, and some safe way to retrigger
> >>> the timer.  A qmp command marked as 'oob' might be the right way -
> >>> another qm command can't block it.
> >> 
> >> This qmp approach is slightly different than the external dependency
> >> that itself kills Qemu; if it doesn't run, then Qemu dies because the
> >> kernel timer is not updated. But this is also a heavyweight approach.
> >> We are talking about a service that needs to frequently connect to all
> >> running VMs on a host to reset the timer.
> >> 
> >> But it does allow for the customisable heartbeat: the logic behind
> >> what triggers the command is completely flexible.
> >> 
> >> Thinking about this idea of a separate Qemu thread, one thing that
> >> came to mind is this:
> >> 
> >> qemu -fence heartbeat=/path/to/file,deadline=60[,recheck=5]
> >> 
> >> Qemu could fire up a thread that stat()s <file> (every <recheck>
> >> seconds or on a default interval) and log()+abort() the whole process
> >> if the last modification time of the file is older than <deadline>. If
> >> <file> goes away (ie. stat() gives ENOENT), then it either fences
> >> immediately or ignores it, not sure which is more sensible.
> >> 
> >> Thoughts?
> > 
> > As above; I'm OK with using a file with that; but I'd make that thread
> > bump the kernel timer along; if that thread gets stuck somehow the
> > kernel still nukes your process.
> 
> 
> Awesome. So check this out:
> 
> qemu -fence heartbeat=/path/to/file,deadline=60[,recheck=5][,harddeadline=61]
> 
> We can default <harddeadline> to <deadline+1> and enforce that:
> -  <deadline> is a multiple of <recheck>.
> - <harddeadline> is bigger than <deadline>
> 
> When <deadline> expires, we can log() + abort(), but if <harddeadline>
> expires, we can rest assured the kernel will come around and SIGKILL
> Qemu. If there's demand for it, this can later be enhanced by adding
> more parameters which set the fence thread scheduling priority, &c.
> 
> If that sounds ok I'll send an RFC as soon as I get a chance and we
> can take it from there.

I don't really see the point in doing any of this in QEMU, as opposed to
using the general purpose self-fencing features of the host OS. As an
example, hardware watchdogs are a built-in feature of systemd

   "To make use of the hardware watchdog it is sufficient to set the
    RuntimeWatchdogSec= option in /etc/systemd/system.conf. It defaults
    to 0 (i.e. no hardware watchdog use). Set it to a value like 20s 
    and the watchdog is enabled. After 20s of no keep-alive pings the 
    hardware will reset itself. Note that systemd will send a ping to
    the hardware at half the specified interval, i.e. every 10s. And 
    that's already all there is to it. By enabling this single, simple
    option you have turned on supervision by the hardware of systemd 
    and the kernel beneath it.[2]"

    http://0pointer.de/blog/projects/watchdog.html


When a host becomes non-responsive, for example, due to a network error
I would not have confidence in QEMU being reliable enough to trigger
any self-fencing code. I've seen many bug reports where QEMU has entirely
hung due to non-responsive network based storage. 

IMHO doing this at the host OS level is going to be more reliable in
terms of detecting the problem in the first place, as well as more
reliable in taking the action - its very difficult for a hardware CPU
reset to fail to work.

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Thoughts on VM fence infrastructure
  2019-10-01 10:31                     ` Daniel P. Berrangé
@ 2019-10-01 10:46                       ` Felipe Franciosi
  2019-10-01 11:10                         ` Daniel P. Berrangé
  0 siblings, 1 reply; 19+ messages in thread
From: Felipe Franciosi @ 2019-10-01 10:46 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Rafael David Tinoco, Aditya Ramesh, Dr. David Alan Gilbert,
	qemu-devel

Hi Daniel!


> On Oct 1, 2019, at 11:31 AM, Daniel P. Berrangé <berrange@redhat.com> wrote:
> 
> On Tue, Oct 01, 2019 at 09:56:17AM +0000, Felipe Franciosi wrote:
>> 
>> 
>>> On Oct 1, 2019, at 9:23 AM, Dr. David Alan Gilbert <dgilbert@redhat.com> wrote:
>>> 
>>> * Felipe Franciosi (felipe@nutanix.com) wrote:
>>>> 
>>>> 
>>>>> On Sep 30, 2019, at 6:59 PM, Dr. David Alan Gilbert <dgilbert@redhat.com> wrote:
>>>>> 
>>>>> * Felipe Franciosi (felipe@nutanix.com) wrote:
>>>>>> 
>>>>>> 
>>>>>>> On Sep 30, 2019, at 6:11 PM, Dr. David Alan Gilbert <dgilbert@redhat.com> wrote:
>>>>>>> 
>>>>>>> * Felipe Franciosi (felipe@nutanix.com) wrote:
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> On Sep 30, 2019, at 5:03 PM, Dr. David Alan Gilbert <dgilbert@redhat.com> wrote:
>>>>>>>>> 
>>>>>>>>> * Felipe Franciosi (felipe@nutanix.com) wrote:
>>>>>>>>>> Hi David,
>>>>>>>>>> 
>>>>>>>>>>> On Sep 30, 2019, at 3:29 PM, Dr. David Alan Gilbert <dgilbert@redhat.com> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> * Felipe Franciosi (felipe@nutanix.com) wrote:
>>>>>>>>>>>> Heyall,
>>>>>>>>>>>> 
>>>>>>>>>>>> We have a use case where a host should self-fence (and all VMs should
>>>>>>>>>>>> die) if it doesn't hear back from a heartbeat within a certain time
>>>>>>>>>>>> period. Lots of ideas were floated around where libvirt could take
>>>>>>>>>>>> care of killing VMs or a separate service could do it. The concern
>>>>>>>>>>>> with those is that various failures could lead to _those_ services
>>>>>>>>>>>> being unavailable and the fencing wouldn't be enforced as it should.
>>>>>>>>>>>> 
>>>>>>>>>>>> Ultimately, it feels like Qemu should be responsible for this
>>>>>>>>>>>> heartbeat and exit (or execute a custom callback) on timeout.
>>>>>>>>>>> 
>>>>>>>>>>> It doesn't feel doing it inside qemu would be any safer;  something
>>>>>>>>>>> outside QEMU can forcibly emit a kill -9 and qemu *will* stop.
>>>>>>>>>> 
>>>>>>>>>> The argument above is that we would have to rely on this external
>>>>>>>>>> service being functional. Consider the case where the host is
>>>>>>>>>> dysfunctional, with this service perhaps crashed and a corrupt
>>>>>>>>>> filesystem preventing it from restarting. The VMs would never die.
>>>>>>>>> 
>>>>>>>>> Yeh that could fail.
>>>>>>>>> 
>>>>>>>>>> It feels like a Qemu timer-driven heartbeat check and calls abort() /
>>>>>>>>>> exit() would be more reliable. Thoughts?
>>>>>>>>> 
>>>>>>>>> OK, yes; perhaps using a timer_create and telling it to send a fatal
>>>>>>>>> signal is pretty solid; it would take the kernel to do that once it's
>>>>>>>>> set.
>>>>>>>> 
>>>>>>>> I'm confused about why the kernel needs to be involved. If this is a
>>>>>>>> timer off the Qemu main loop, it can just check on the heartbeat
>>>>>>>> condition (which should be customisable) and call abort() if that's
>>>>>>>> not satisfied. If you agree on that I'd like to talk about how that
>>>>>>>> check could be made customisable.
>>>>>>> 
>>>>>>> There are times when the main loop can get blocked even though the CPU
>>>>>>> threads can be running and can in some configurations perform IO
>>>>>>> even without the main loop (I think!).
>>>>>> 
>>>>>> Ah, that's a very good point. Indeed, you can perform IO in those
>>>>>> cases specially when using vhost devices.
>>>>>> 
>>>>>>> By setting a timer in the kernel that sends a signal to qemu, the kernel
>>>>>>> will send that signal however broken qemu is.
>>>>>> 
>>>>>> Got you now. That's probably better. Do you reckon a signal is
>>>>>> preferable over SIGEV_THREAD?
>>>>> 
>>>>> Not sure; probably the safest is getting the kernel to SIGKILL it - but
>>>>> that's a complete nightmare to debug - your process just goes *pop*
>>>>> with no apparent reason why.
>>>>> I've not used SIGEV_THREAD - it looks promising though.
>>>> 
>>>> I'm worried that SIGEV_THREAD could be a bit heavyweight (if it fires
>>>> up a new thread each time). On the other hand, as you said, SIGKILL
>>>> makes it harder to debug.
>>>> 
>>>> Also, asking the kernel to defer the SIGKILL (ie. updating the timer)
>>>> needs to come from Qemu itself (eg. a timer in the main loop,
>>>> something we already ruled unsuitable, or a qmp command which
>>>> constitutes an external dependency that we also ruled undesirable).
>>> 
>>> OK, there's two reasons I think this isn't that bad/is good:
>>>  a) It's an external dependency - but if it fails the result is the
>>>     system fails, rather than the system keeps on running; so I think
>>>     that's the balance you were after; it's the opposite from
>>>     the external watchdog.
>> 
>> Right. I like where you are coming from. And I think a mix of these
>> may be the best way forwards. I'll elaborate on it below.
>> 
>>> 
>>>  b) You need some external system anyway to tell QEMU when it's
>>>     OK - what's your definitino of a failed system?
>> 
>> The feature is targeted at providing a self-fencing mechanism for
>> Qemu. If a host is unreachable for whatever reason (eg. sshd down, ovs
>> died, oomkiller took services out, physical network failure), it
>> should guarantee that VMs won't be running after a certain amount of
>> time. To your point, if this external software doesn't come in and
>> touch the file, that's because it can't reach the host or it wants the
>> host to self-fence. The qualifying Qemus should therefore be
>> considered dead after a "deadline" period (since the last time the
>> control file was touched).
> 
> This all sounds reasonable, but I don't see the value in doing this
> work
> of this work in QEMU.

I'll elaborate below.

>  
> 
>> 
>>> 
>>>> What if, when self-fencing is enabled, Qemu kicks off a new thread
>>>> from the start which does nothing but periodically wake up, verify the
>>>> heartbeat condition and log()+abort() if required? (Then we wouldn't
>>>> need the kernel timer.)
>>> 
>>> I'd make that thread bump the kernel timer along.
>> 
>> I think combining the thread's logic with the kernel timer makes the
>> whole thing a lot more solid. See below.
>> 
>>> 
>>>>> 
>>>>>> I'm still wondering how to make this customisable so that different
>>>>>> types of heartbeat could be implemented (preferably without creating
>>>>>> external dependencies per discussion above). Thoughts welcome.
>>>>> 
>>>>> Yes, you need something to enable it, and some safe way to retrigger
>>>>> the timer.  A qmp command marked as 'oob' might be the right way -
>>>>> another qm command can't block it.
>>>> 
>>>> This qmp approach is slightly different than the external dependency
>>>> that itself kills Qemu; if it doesn't run, then Qemu dies because the
>>>> kernel timer is not updated. But this is also a heavyweight approach.
>>>> We are talking about a service that needs to frequently connect to all
>>>> running VMs on a host to reset the timer.
>>>> 
>>>> But it does allow for the customisable heartbeat: the logic behind
>>>> what triggers the command is completely flexible.
>>>> 
>>>> Thinking about this idea of a separate Qemu thread, one thing that
>>>> came to mind is this:
>>>> 
>>>> qemu -fence heartbeat=/path/to/file,deadline=60[,recheck=5]
>>>> 
>>>> Qemu could fire up a thread that stat()s <file> (every <recheck>
>>>> seconds or on a default interval) and log()+abort() the whole process
>>>> if the last modification time of the file is older than <deadline>. If
>>>> <file> goes away (ie. stat() gives ENOENT), then it either fences
>>>> immediately or ignores it, not sure which is more sensible.
>>>> 
>>>> Thoughts?
>>> 
>>> As above; I'm OK with using a file with that; but I'd make that thread
>>> bump the kernel timer along; if that thread gets stuck somehow the
>>> kernel still nukes your process.
>> 
>> 
>> Awesome. So check this out:
>> 
>> qemu -fence heartbeat=/path/to/file,deadline=60[,recheck=5][,harddeadline=61]
>> 
>> We can default <harddeadline> to <deadline+1> and enforce that:
>> -  <deadline> is a multiple of <recheck>.
>> - <harddeadline> is bigger than <deadline>
>> 
>> When <deadline> expires, we can log() + abort(), but if <harddeadline>
>> expires, we can rest assured the kernel will come around and SIGKILL
>> Qemu. If there's demand for it, this can later be enhanced by adding
>> more parameters which set the fence thread scheduling priority, &c.
>> 
>> If that sounds ok I'll send an RFC as soon as I get a chance and we
>> can take it from there.
> 
> I don't really see the point in doing any of this in QEMU, as opposed to
> using the general purpose self-fencing features of the host OS. As an
> example, hardware watchdogs are a built-in feature of systemd
> 
>   "To make use of the hardware watchdog it is sufficient to set the
>    RuntimeWatchdogSec= option in /etc/systemd/system.conf. It defaults
>    to 0 (i.e. no hardware watchdog use). Set it to a value like 20s 
>    and the watchdog is enabled. After 20s of no keep-alive pings the 
>    hardware will reset itself. Note that systemd will send a ping to
>    the hardware at half the specified interval, i.e. every 10s. And 
>    that's already all there is to it. By enabling this single, simple
>    option you have turned on supervision by the hardware of systemd 
>    and the kernel beneath it.[2]"
> 
>    https://urldefense.proofpoint.com/v2/url?u=http-3A__0pointer.de_blog_projects_watchdog.html&d=DwIBaQ&c=s883GpUCOChKOHiocYtGcg&r=CCrJKVC5zGot8PrnI-iYe00MdX4pgdQfMRigp14Ptmk&m=fnS3qgbCpD600__bE6JI4UEHztt6cQXutpLsr0eVzF0&s=HiJorxNd588b8gTpUyQB2qbLPq4-6UMqyhcHAcwrRlE&e= 
> 

(Apologies for the mangled URL, nothing I can do about that.) :(

There are several points which favour adding this to Qemu:
- Not all environments use systemd.
- HW watchdogs always reboot the host, which is too drastic.
- You may not want to protect all VMs in the same way.

> When a host becomes non-responsive, for example, due to a network error
> I would not have confidence in QEMU being reliable enough to trigger
> any self-fencing code. I've seen many bug reports where QEMU has entirely
> hung due to non-responsive network based storage. 

Completely agree with you. There are various failures where Qemu
itself wouldn't be able to self-fence, but there are many in which it
would. There's also the fact that you may not want to protect all VMs
equally. To that point, nothing stops "harder" deadlines to be used.
The idea being discussed already involves a two-level protection model
where Qemu tries to suicide, but if it fails the kernel will do it.

With that in mind, the libvirt API could actually offer a third-level
protection which sets a HW watchdog (via systemd or otherwise). That
would be a host setting, though, but part of the same offering.


> IMHO doing this at the host OS level is going to be more reliable in
> terms of detecting the problem in the first place, as well as more
> reliable in taking the action - its very difficult for a hardware CPU
> reset to fail to work.

Absolutely, but it's a very drastic measure that:
- May be unnecessary.
- Will fence everything even perhaps only some VMs need protection.

What are your thoughts on this 3-level approach?
1) Qemu tries to log() + abort() (deadline)
2) Kernel sends SIGKILL (harddeadline)
3) HW watchdog kicks in (harderdeadline)

(Better names welcome.)

F.

> 
> Regards,
> Daniel
> -- 
> |: https://urldefense.proofpoint.com/v2/url?u=https-3A__berrange.com&d=DwIBaQ&c=s883GpUCOChKOHiocYtGcg&r=CCrJKVC5zGot8PrnI-iYe00MdX4pgdQfMRigp14Ptmk&m=fnS3qgbCpD600__bE6JI4UEHztt6cQXutpLsr0eVzF0&s=VTVB_KrstsrM_b0QmszrGif9RUNbRjbor-G4T_hT2yQ&e=       -o-    https://urldefense.proofpoint.com/v2/url?u=https-3A__www.flickr.com_photos_dberrange&d=DwIBaQ&c=s883GpUCOChKOHiocYtGcg&r=CCrJKVC5zGot8PrnI-iYe00MdX4pgdQfMRigp14Ptmk&m=fnS3qgbCpD600__bE6JI4UEHztt6cQXutpLsr0eVzF0&s=psF8iLWLoqpZLq_FEBFXGZrzRzT7lu3Hf1M52NnHJo4&e=  :|
> |: https://urldefense.proofpoint.com/v2/url?u=https-3A__libvirt.org&d=DwIBaQ&c=s883GpUCOChKOHiocYtGcg&r=CCrJKVC5zGot8PrnI-iYe00MdX4pgdQfMRigp14Ptmk&m=fnS3qgbCpD600__bE6JI4UEHztt6cQXutpLsr0eVzF0&s=cDs9EbaDsZq5ABOi6BiK1WhisZ4PZdtr34YCkqMiWqc&e=          -o-            https://urldefense.proofpoint.com/v2/url?u=https-3A__fstop138.berrange.com&d=DwIBaQ&c=s883GpUCOChKOHiocYtGcg&r=CCrJKVC5zGot8PrnI-iYe00MdX4pgdQfMRigp14Ptmk&m=fnS3qgbCpD600__bE6JI4UEHztt6cQXutpLsr0eVzF0&s=8yXoHeNZP1AH8FNipWuOSG2xHwQaNiG5JumTtwETP-I&e=  :|
> |: https://urldefense.proofpoint.com/v2/url?u=https-3A__entangle-2Dphoto.org&d=DwIBaQ&c=s883GpUCOChKOHiocYtGcg&r=CCrJKVC5zGot8PrnI-iYe00MdX4pgdQfMRigp14Ptmk&m=fnS3qgbCpD600__bE6JI4UEHztt6cQXutpLsr0eVzF0&s=qzAxwpwVFQj2vuRr65verRCahoLq5nLbwbyVcjTRf0M&e=     -o-    https://urldefense.proofpoint.com/v2/url?u=https-3A__www.instagram.com_dberrange&d=DwIBaQ&c=s883GpUCOChKOHiocYtGcg&r=CCrJKVC5zGot8PrnI-iYe00MdX4pgdQfMRigp14Ptmk&m=fnS3qgbCpD600__bE6JI4UEHztt6cQXutpLsr0eVzF0&s=t3kVrkYK52n7YlwzkquKLAjVt3NSlSF6CF7CeGLHvNM&e=  :|


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Thoughts on VM fence infrastructure
  2019-09-30 19:23               ` Felipe Franciosi
  2019-10-01  8:23                 ` Dr. David Alan Gilbert
@ 2019-10-01 10:49                 ` Daniel P. Berrangé
  1 sibling, 0 replies; 19+ messages in thread
From: Daniel P. Berrangé @ 2019-10-01 10:49 UTC (permalink / raw)
  To: Felipe Franciosi; +Cc: Aditya Ramesh, Dr. David Alan Gilbert, qemu-devel

On Mon, Sep 30, 2019 at 07:23:47PM +0000, Felipe Franciosi wrote:
> 
> 
> > On Sep 30, 2019, at 6:59 PM, Dr. David Alan Gilbert <dgilbert@redhat.com> wrote:
> > 
> > * Felipe Franciosi (felipe@nutanix.com) wrote:
> >> 
> >> 
> >>> On Sep 30, 2019, at 6:11 PM, Dr. David Alan Gilbert <dgilbert@redhat.com> wrote:
> >>> 
> >>> * Felipe Franciosi (felipe@nutanix.com) wrote:
> >>>> 
> >>>> 
> >>>>> On Sep 30, 2019, at 5:03 PM, Dr. David Alan Gilbert <dgilbert@redhat.com> wrote:
> >>>>> 
> >>>>> * Felipe Franciosi (felipe@nutanix.com) wrote:
> >>>>>> Hi David,
> >>>>>> 
> >>>>>>> On Sep 30, 2019, at 3:29 PM, Dr. David Alan Gilbert <dgilbert@redhat.com> wrote:
> >>>>>>> 
> >>>>>>> * Felipe Franciosi (felipe@nutanix.com) wrote:
> >>>>>>>> Heyall,
> >>>>>>>> 
> >>>>>>>> We have a use case where a host should self-fence (and all VMs should
> >>>>>>>> die) if it doesn't hear back from a heartbeat within a certain time
> >>>>>>>> period. Lots of ideas were floated around where libvirt could take
> >>>>>>>> care of killing VMs or a separate service could do it. The concern
> >>>>>>>> with those is that various failures could lead to _those_ services
> >>>>>>>> being unavailable and the fencing wouldn't be enforced as it should.
> >>>>>>>> 
> >>>>>>>> Ultimately, it feels like Qemu should be responsible for this
> >>>>>>>> heartbeat and exit (or execute a custom callback) on timeout.
> >>>>>>> 
> >>>>>>> It doesn't feel doing it inside qemu would be any safer;  something
> >>>>>>> outside QEMU can forcibly emit a kill -9 and qemu *will* stop.
> >>>>>> 
> >>>>>> The argument above is that we would have to rely on this external
> >>>>>> service being functional. Consider the case where the host is
> >>>>>> dysfunctional, with this service perhaps crashed and a corrupt
> >>>>>> filesystem preventing it from restarting. The VMs would never die.
> >>>>> 
> >>>>> Yeh that could fail.
> >>>>> 
> >>>>>> It feels like a Qemu timer-driven heartbeat check and calls abort() /
> >>>>>> exit() would be more reliable. Thoughts?
> >>>>> 
> >>>>> OK, yes; perhaps using a timer_create and telling it to send a fatal
> >>>>> signal is pretty solid; it would take the kernel to do that once it's
> >>>>> set.
> >>>> 
> >>>> I'm confused about why the kernel needs to be involved. If this is a
> >>>> timer off the Qemu main loop, it can just check on the heartbeat
> >>>> condition (which should be customisable) and call abort() if that's
> >>>> not satisfied. If you agree on that I'd like to talk about how that
> >>>> check could be made customisable.
> >>> 
> >>> There are times when the main loop can get blocked even though the CPU
> >>> threads can be running and can in some configurations perform IO
> >>> even without the main loop (I think!).
> >> 
> >> Ah, that's a very good point. Indeed, you can perform IO in those
> >> cases specially when using vhost devices.
> >> 
> >>> By setting a timer in the kernel that sends a signal to qemu, the kernel
> >>> will send that signal however broken qemu is.
> >> 
> >> Got you now. That's probably better. Do you reckon a signal is
> >> preferable over SIGEV_THREAD?
> > 
> > Not sure; probably the safest is getting the kernel to SIGKILL it - but
> > that's a complete nightmare to debug - your process just goes *pop*
> > with no apparent reason why.
> > I've not used SIGEV_THREAD - it looks promising though.
> 
> I'm worried that SIGEV_THREAD could be a bit heavyweight (if it fires
> up a new thread each time). On the other hand, as you said, SIGKILL
> makes it harder to debug.
> 
> Also, asking the kernel to defer the SIGKILL (ie. updating the timer)
> needs to come from Qemu itself (eg. a timer in the main loop,
> something we already ruled unsuitable, or a qmp command which
> constitutes an external dependency that we also ruled undesirable).
> 
> What if, when self-fencing is enabled, Qemu kicks off a new thread
> from the start which does nothing but periodically wake up, verify the
> heartbeat condition and log()+abort() if required? (Then we wouldn't
> need the kernel timer.)
> 
> > 
> >> I'm still wondering how to make this customisable so that different
> >> types of heartbeat could be implemented (preferably without creating
> >> external dependencies per discussion above). Thoughts welcome.
> > 
> > Yes, you need something to enable it, and some safe way to retrigger
> > the timer.  A qmp command marked as 'oob' might be the right way -
> > another qm command can't block it.
> 
> This qmp approach is slightly different than the external dependency
> that itself kills Qemu; if it doesn't run, then Qemu dies because the
> kernel timer is not updated. But this is also a heavyweight approach.
> We are talking about a service that needs to frequently connect to all
> running VMs on a host to reset the timer.
> 
> But it does allow for the customisable heartbeat: the logic behind
> what triggers the command is completely flexible.
> 
> Thinking about this idea of a separate Qemu thread, one thing that
> came to mind is this:
> 
> qemu -fence heartbeat=/path/to/file,deadline=60[,recheck=5]
> 
> Qemu could fire up a thread that stat()s <file> (every <recheck>
> seconds or on a default interval) and log()+abort() the whole process
> if the last modification time of the file is older than <deadline>. If
> <file> goes away (ie. stat() gives ENOENT), then it either fences
> immediately or ignores it, not sure which is more sensible.

The architectural direction of QEMU is taking us towards a world
where "QEMU" is actually many processes. Having a thread which
just a log()+abort() the process is only going to take out one of
the QEMU processes associated with a VM. The other ones, which
are likely servicing I/O for the guest are still going to be running.

Even with monolithic single process QEMU we arguably have two processes,
the other being the kernel that is doing work on behalf of QEMU userspace.

In the event of a network outage, there are can be I/O requests from QEMU
in the kernel which are stuck/stalled. If you kill QEMU I don't believe
there's a strong guarantee that those I/O requests will be cancelled. In
this case QEMU can be stuck in an uninterruptable sleep such that even
kill -9 won't make the process go away entirely, it can get into a zombie
state. If/when the network problem resolves itself, QEMU will finally get
cleaned up, but there's a good chance the I/O operations will get sent.

As mentioned earlier, off-host STONITH is the best, but IMHO the next
fallback should be self-fencing of the host, rather than trying to get
each QEMU to self-fence its own process.

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Thoughts on VM fence infrastructure
  2019-10-01 10:46                       ` Felipe Franciosi
@ 2019-10-01 11:10                         ` Daniel P. Berrangé
  2019-10-01 11:38                           ` Felipe Franciosi
  0 siblings, 1 reply; 19+ messages in thread
From: Daniel P. Berrangé @ 2019-10-01 11:10 UTC (permalink / raw)
  To: Felipe Franciosi
  Cc: Rafael David Tinoco, Aditya Ramesh, Dr. David Alan Gilbert,
	qemu-devel

On Tue, Oct 01, 2019 at 10:46:24AM +0000, Felipe Franciosi wrote:
> Hi Daniel!
> 
> 
> > On Oct 1, 2019, at 11:31 AM, Daniel P. Berrangé <berrange@redhat.com> wrote:
> > 
> > On Tue, Oct 01, 2019 at 09:56:17AM +0000, Felipe Franciosi wrote:
> 
> (Apologies for the mangled URL, nothing I can do about that.) :(
> 
> There are several points which favour adding this to Qemu:
> - Not all environments use systemd.

Sure, if you want to cope with that you can just use the HW watchdog
directly instead of via systemd. 

> - HW watchdogs always reboot the host, which is too drastic.
> - You may not want to protect all VMs in the same way.

Same points repeated below, so I'll respond there....

> > IMHO doing this at the host OS level is going to be more reliable in
> > terms of detecting the problem in the first place, as well as more
> > reliable in taking the action - its very difficult for a hardware CPU
> > reset to fail to work.
> 
> Absolutely, but it's a very drastic measure that:
> - May be unnecessary.

Of course, the inability to predict future consequences is what
forces us into assuming the worst case & taking actions to
mitigate that. It will definitely result in unccessary killing
of hosts, but that is what gives you the safety guarantees you
can't otherwise achieve.

I gave the example elsewhere that even if you kill QEMU, the kernel
can have pending I/O associated with QEMU that can be sent if the
host later recovers.

> - Will fence everything even perhaps only some VMs need protection.

I don't believe its viable to have offer real protection to only
a subset of VMs, principally because the kernel is doing I/O work
on behalf of the VM, so to protect just 1 VM you must fence the
kernel.

> What are your thoughts on this 3-level approach?
> 1) Qemu tries to log() + abort() (deadline)

Just abort()'ing isn't going to be a viable strategy with QEMU's move
towards a multi-process architecture. This introduces the problem that
the "main" QEMU process has to enumerate all the helpers it is dealing
with and kill them all off in some way. This is non-trivial especially
if some of the helpers are running under different privilege levels.

You could declare that multi-process QEMU is out of scope, but I think
QEMU self-fencing would need to offer compelling benefits over host OS
self-fencing to justify that exception. Personally I'm not seeing it.

> 2) Kernel sends SIGKILL (harddeadline)

This is slightly easier to deal with multiple processes in that it
isn't restricted by the privileges of the main QEMU vs helpers and
could take advantage of cgroups perhaps.

> 3) HW watchdog kicks in (harderdeadline)

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Thoughts on VM fence infrastructure
  2019-10-01 11:10                         ` Daniel P. Berrangé
@ 2019-10-01 11:38                           ` Felipe Franciosi
  0 siblings, 0 replies; 19+ messages in thread
From: Felipe Franciosi @ 2019-10-01 11:38 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Rafael David Tinoco, Aditya Ramesh, Dr. David Alan Gilbert,
	qemu-devel



> On Oct 1, 2019, at 12:10 PM, Daniel P. Berrangé <berrange@redhat.com> wrote:
> 
> On Tue, Oct 01, 2019 at 10:46:24AM +0000, Felipe Franciosi wrote:
>> Hi Daniel!
>> 
>> 
>>> On Oct 1, 2019, at 11:31 AM, Daniel P. Berrangé <berrange@redhat.com> wrote:
>>> 
>>> On Tue, Oct 01, 2019 at 09:56:17AM +0000, Felipe Franciosi wrote:
>> 
>> (Apologies for the mangled URL, nothing I can do about that.) :(
>> 
>> There are several points which favour adding this to Qemu:
>> - Not all environments use systemd.
> 
> Sure, if you want to cope with that you can just use the HW watchdog
> directly instead of via systemd. 
> 
>> - HW watchdogs always reboot the host, which is too drastic.
>> - You may not want to protect all VMs in the same way.
> 
> Same points repeated below, so I'll respond there....
> 
>>> IMHO doing this at the host OS level is going to be more reliable in
>>> terms of detecting the problem in the first place, as well as more
>>> reliable in taking the action - its very difficult for a hardware CPU
>>> reset to fail to work.
>> 
>> Absolutely, but it's a very drastic measure that:
>> - May be unnecessary.
> 
> Of course, the inability to predict future consequences is what
> forces us into assuming the worst case & taking actions to
> mitigate that. It will definitely result in unccessary killing
> of hosts, but that is what gives you the safety guarantees you
> can't otherwise achieve.

The argument is that many configurations have controlled settings that
do not require that drastic level of protection. And the feature we're
discussing offer a softer way of dealing with these.

> I gave the example elsewhere that even if you kill QEMU, the kernel
> can have pending I/O associated with QEMU that can be sent if the
> host later recovers.

Even if an I/O is sent out of the host, there are no guarantees that
it isn't queued somewhere and will reach its destination even after
you pulled the power of a host. Such discussions were held separately
a while back when we were talking about task cancellation.

To that point, I've personally seen corruption with network storage
which was debugged as:

1) write(lba=0, value='a')
2) host crashed (hard reset)
3) vm restarted elsewhere
4) write(lba=0, value='a') (resubmitted from "1")
5) write(lba=0, value='b')
6) I/O from step "1" reached controller
7) read(lba=0) == 'a'

My argument is that you need to look into protection where protection
is needed. Perhaps the example above could be avoided with a session
distinction so that once I/O from step "4" was seen (coming from the
new session), I/Os from older sessions should be rejected by the
storage controller.

In any case, all I'm saying is that there are different levels of
protection. The feature we're discussing here offers one of them.

> 
>> - Will fence everything even perhaps only some VMs need protection.
> 
> I don't believe its viable to have offer real protection to only
> a subset of VMs, principally because the kernel is doing I/O work
> on behalf of the VM, so to protect just 1 VM you must fence the
> kernel.

You are assuming that all users of Qemu out there do I/O through the
kernel. What if they don't?

> 
>> What are your thoughts on this 3-level approach?
>> 1) Qemu tries to log() + abort() (deadline)
> 
> Just abort()'ing isn't going to be a viable strategy with QEMU's move
> towards a multi-process architecture. This introduces the problem that
> the "main" QEMU process has to enumerate all the helpers it is dealing
> with and kill them all off in some way. This is non-trivial especially
> if some of the helpers are running under different privilege levels.

If this is to extend to a multi-process model, I don't think it should
be one process killing others. It's called "self-fencing" because each
process should be responsible for killing itself based on a heartbeat.
(Or configuring the kernel to do it.)

> 
> You could declare that multi-process QEMU is out of scope, but I think
> QEMU self-fencing would need to offer compelling benefits over host OS
> self-fencing to justify that exception. Personally I'm not seeing it.

I would limit the feature for a monolithic model to begin with, but
definitely keep an eye on ways to extend it to a multi-process model.

The benefit is as described before, with the added arguments in this e-mail:
- May not want to protect all VMs.
- May not want to kill the entire host for a temporary network outage.
- Killing Qemu is sufficient in various configurations.

> 
>> 2) Kernel sends SIGKILL (harddeadline)
> 
> This is slightly easier to deal with multiple processes in that it
> isn't restricted by the privileges of the main QEMU vs helpers and
> could take advantage of cgroups perhaps.

Right, and it should be an option from the start. Thanks for weighting
in with extra ideas around cgroup.

F.

> 
>> 3) HW watchdog kicks in (harderdeadline)
> 
> 
> Regards,
> Daniel
> -- 
> |: https://urldefense.proofpoint.com/v2/url?u=https-3A__berrange.com&d=DwIDaQ&c=s883GpUCOChKOHiocYtGcg&r=CCrJKVC5zGot8PrnI-iYe00MdX4pgdQfMRigp14Ptmk&m=YC-qqaG6kodFKVMNX0J0kjz9X6V_-FJAwJp7V6Ib-ww&s=gHylmilfA2eUInLpjzmHqoaGnpyR8GSQLmO8EAw4eR8&e=       -o-    https://urldefense.proofpoint.com/v2/url?u=https-3A__www.flickr.com_photos_dberrange&d=DwIDaQ&c=s883GpUCOChKOHiocYtGcg&r=CCrJKVC5zGot8PrnI-iYe00MdX4pgdQfMRigp14Ptmk&m=YC-qqaG6kodFKVMNX0J0kjz9X6V_-FJAwJp7V6Ib-ww&s=80QiYAfpKyAPCTKApjm8KYLTx1X7M6pJ53GBbJncy9o&e=  :|
> |: https://urldefense.proofpoint.com/v2/url?u=https-3A__libvirt.org&d=DwIDaQ&c=s883GpUCOChKOHiocYtGcg&r=CCrJKVC5zGot8PrnI-iYe00MdX4pgdQfMRigp14Ptmk&m=YC-qqaG6kodFKVMNX0J0kjz9X6V_-FJAwJp7V6Ib-ww&s=yYvZX9Rbg3oemcKK3hUFeQ5vbgOkZY7I43TiTHdTqHw&e=          -o-            https://urldefense.proofpoint.com/v2/url?u=https-3A__fstop138.berrange.com&d=DwIDaQ&c=s883GpUCOChKOHiocYtGcg&r=CCrJKVC5zGot8PrnI-iYe00MdX4pgdQfMRigp14Ptmk&m=YC-qqaG6kodFKVMNX0J0kjz9X6V_-FJAwJp7V6Ib-ww&s=2ejDGTcMIDyWBM5xm5N4rZb1uXOj9YWpvTR1DNZLszM&e=  :|
> |: https://urldefense.proofpoint.com/v2/url?u=https-3A__entangle-2Dphoto.org&d=DwIDaQ&c=s883GpUCOChKOHiocYtGcg&r=CCrJKVC5zGot8PrnI-iYe00MdX4pgdQfMRigp14Ptmk&m=YC-qqaG6kodFKVMNX0J0kjz9X6V_-FJAwJp7V6Ib-ww&s=BUlhrMDYb3k5CzU-Cmj22z_Pn4VelyueDC3Sx0JRbWE&e=     -o-    https://urldefense.proofpoint.com/v2/url?u=https-3A__www.instagram.com_dberrange&d=DwIDaQ&c=s883GpUCOChKOHiocYtGcg&r=CCrJKVC5zGot8PrnI-iYe00MdX4pgdQfMRigp14Ptmk&m=YC-qqaG6kodFKVMNX0J0kjz9X6V_-FJAwJp7V6Ib-ww&s=bi7Dqd0QZt6oUI3dl6TYRfhF3mCc8Wq7rXg554y9Ygc&e=  :|


^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2019-10-01 11:39 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2019-09-30 10:30 Thoughts on VM fence infrastructure Felipe Franciosi
2019-09-30 14:29 ` Dr. David Alan Gilbert
2019-09-30 15:46   ` Felipe Franciosi
2019-09-30 16:03     ` Dr. David Alan Gilbert
2019-09-30 16:59       ` Felipe Franciosi
2019-09-30 17:11         ` Dr. David Alan Gilbert
2019-09-30 17:33           ` Felipe Franciosi
2019-09-30 17:59             ` Dr. David Alan Gilbert
2019-09-30 19:23               ` Felipe Franciosi
2019-10-01  8:23                 ` Dr. David Alan Gilbert
2019-10-01  9:56                   ` Felipe Franciosi
2019-10-01 10:05                     ` Dr. David Alan Gilbert
2019-10-01 10:31                     ` Daniel P. Berrangé
2019-10-01 10:46                       ` Felipe Franciosi
2019-10-01 11:10                         ` Daniel P. Berrangé
2019-10-01 11:38                           ` Felipe Franciosi
2019-10-01 10:49                 ` Daniel P. Berrangé
2019-09-30 19:45               ` Rafael David Tinoco
2019-09-30 20:24                 ` Felipe Franciosi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).