Re: [PATCH 4/5] migration: Provide QMP access to downtime stats

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

From: Peter Xu <peterx@redhat.com>
To: Joao Martins <joao.m.martins@oracle.com>
Cc: qemu-devel@nongnu.org, Juan Quintela <quintela@redhat.com>,
	Leonardo Bras <leobras@redhat.com>,
	Eric Blake <eblake@redhat.com>,
	Markus Armbruster <armbru@redhat.com>,
	Avihai Horon <avihaih@nvidia.com>,
	Yishai Hadas <yishaih@nvidia.com>,
	"Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
Subject: Re: [PATCH 4/5] migration: Provide QMP access to downtime stats
Date: Fri, 6 Oct 2023 10:27:06 -0400	[thread overview]
Message-ID: <ZSAZOhSL2X0AJckQ@x1n> (raw)
In-Reply-To: <f254478a-2e4d-4e6e-b19f-d5e56099f2a9@oracle.com>

On Fri, Oct 06, 2023 at 12:37:15PM +0100, Joao Martins wrote:
> I added the statistics mainly for observability (e.g. you would grep in the
> libvirt logs for a non developer and they can understand how downtime is
> explained). I wasn't specifically thinking about management app using this, just
> broad access to the metrics.
> 
> One can get the same level of observability with a BPF/dtrace/systemtap script,
> albeit in a less obvious way.

Makes sense.

> 
> With respect to motivation: I am doing migration with VFs and sometimes
> vhost-net, and the downtime/switchover is the only thing that is either
> non-determinisc or not captured in the migration math. There are some things
> that aren't accounted (e.g. vhost with enough queues will give you high
> downtimes),

Will this be something relevant to loading of the queues?  There used to be
a work on greatly reducing downtime especially for virtio scenarios over
multiple queues (and iirc even 1 queue also benefits from that), it wasn't
merged probably because not enough review:

https://lore.kernel.org/r/20230317081904.24389-1-xuchuangxclwt@bytedance.com

Though personally I think that's some direction good to keep exploring at
least, maybe some slightly enhancement to that series will work for us.

> and algorithimally not really possible to account for as one needs
> to account every possible instruction when we quiesce the guest (or at least
> that's my understanding).
> 
> Just having these metrics, help the developer *and* user see why such downtime
> is high, and maybe open up window for fixes/bug-reports or where to improve.
> 
> Furthermore, hopefully these tracepoints or stats could be a starting point for
> developers to understand how much downtime is spent in a particular device in
> Qemu(as a follow-up to this series),

Yes, I was actually expecting that when read the cover letter. :) This also
makes sense.  One thing worth mention is, the real downtime measured can,
IMHO, differ on src/dst due to "pre_save" and "post_load" may not really
doing similar things.  IIUC it can happen that some device sents fast, but
loads slow.  I'm not sure whether there's reversed use case. Maybe we want
to capture that on both sides on some metrics?

> or allow to implement bounds check limits in switchover limits in way
> that doesn't violate downtime-limit SLAs (I have a small set of patches
> for this).

I assume that decision will always be synchronized between src/dst in some
way, or guaranteed to be same. But I can wait to read the series first.

Thanks,

-- 
Peter Xu

next prev parent reply	other threads:[~2023-10-06 14:28 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-09-26 16:18 [PATCH 0/5] migration: Downtime observability improvements Joao Martins
2023-09-26 16:18 ` [PATCH 1/5] migration: Store downtime timestamps in an array Joao Martins
2023-09-28  1:55   ` Wang, Lei
2023-09-28 13:31     ` Joao Martins
2023-09-26 16:18 ` [PATCH 2/5] migration: Collect more timestamps during switchover Joao Martins
2023-09-26 16:18 ` [PATCH 3/5] migration: Add a tracepoint for the downtime stats Joao Martins
2023-09-26 16:18 ` [PATCH 4/5] migration: Provide QMP access to " Joao Martins
2023-10-04 17:10   ` Peter Xu
2023-10-06 11:37     ` Joao Martins
2023-10-06 14:27       ` Peter Xu [this message]
2023-09-26 16:18 ` [PATCH 5/5] migration: Print expected-downtime on completion Joao Martins
2023-10-04 19:33   ` Peter Xu
2023-10-06 11:45     ` Joao Martins
2023-10-31 13:14   ` Juan Quintela
2023-11-02 10:22     ` Joao Martins
2023-10-04 17:19 ` [PATCH 0/5] migration: Downtime observability improvements Peter Xu
2023-10-06 11:39   ` Joao Martins

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZSAZOhSL2X0AJckQ@x1n \
    --to=peterx@redhat.com \
    --cc=armbru@redhat.com \
    --cc=avihaih@nvidia.com \
    --cc=eblake@redhat.com \
    --cc=joao.m.martins@oracle.com \
    --cc=leobras@redhat.com \
    --cc=maciej.szmigiero@oracle.com \
    --cc=qemu-devel@nongnu.org \
    --cc=quintela@redhat.com \
    --cc=yishaih@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).