Re: [PATCH 0/8] migration: Add precopy initial data capability and VFIO precopy support

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

From: Juan Quintela <quintela@redhat.com>
To: Avihai Horon <avihaih@nvidia.com>
Cc: "Peter Xu" <peterx@redhat.com>,
	qemu-devel@nongnu.org,
	"Alex Williamson" <alex.williamson@redhat.com>,
	"Cédric Le Goater" <clg@redhat.com>,
	"Leonardo Bras" <leobras@redhat.com>,
	"Eric Blake" <eblake@redhat.com>,
	"Markus Armbruster" <armbru@redhat.com>,
	"Thomas Huth" <thuth@redhat.com>,
	"Laurent Vivier" <lvivier@redhat.com>,
	"Paolo Bonzini" <pbonzini@redhat.com>,
	"Yishai Hadas" <yishaih@nvidia.com>,
	"Jason Gunthorpe" <jgg@nvidia.com>,
	"Maor Gottlieb" <maorg@nvidia.com>,
	"Kirti Wankhede" <kwankhede@nvidia.com>,
	"Tarun Gupta" <targupta@nvidia.com>,
	"Joao Martins" <joao.m.martins@oracle.com>,
	"Daniel Berrange" <berrange@redhat.com>
Subject: Re: [PATCH 0/8] migration: Add precopy initial data capability and VFIO precopy support
Date: Wed, 10 May 2023 18:41:44 +0200	[thread overview]
Message-ID: <87jzxg6svr.fsf@secure.mitica> (raw)
In-Reply-To: <3bb652f6-4948-d6c2-fac5-e0a6c3edb62a@nvidia.com> (Avihai Horon's message of "Wed, 10 May 2023 19:01:03 +0300")

Avihai Horon <avihaih@nvidia.com> wrote:

>> You have a point here.
>> But I will approach this case in a different way:
>>
>> Destination QEMU needs to be older, because it don't have the feature.
>> So we need to NOT being able to do the switchover for older machine
>> types.
>> And have something like this is qemu/hw/machine.c
>>
>> GlobalProperty hw_compat_7_2[] = {
>>      { "our_device", "explicit-switchover", "off" },
>> };
>>
>> Or whatever we want to call the device and the property, and not use it
>> for older machine types to allow migration for that.
>
> Let me see if I get this straight (I'm not that familiar with
> hw_compat_x_y):
>
> You mean that device Y which adds support for explicit-switchover in
> QEMU version Z should add a property
> like you wrote above, and use it to disable explicit-switchover usage
> for Y devices when Y device
> from QEMU older than Z is migrated?

More that "from" "to"

Let me elaborate.  We have two QEMUs:

QEMU version X, has device dev. Let's call it qemu-X.
QEMU version Y (X+1) add feature foo to device dev.  Let's call it qemu-Y.

We have two machine types (for this exercise we don't care about
architectures)

PC-X.0
PC-Y.0

So, the possible combinations are:

First the easy cases, same qemu on both sides.  Different machine types.

$ qemu-X -M PC-X.0   -> to -> qemu-X -M PC-X.0

  good. neither guest use feature foo.

$ qemu-X -M PC-Y.0   -> to -> qemu-X -M PC-Y.0

  impossible. qemu-X don't have machine PC-Y.0.  So nothing to see here.

$ qemu-Y -M PC-X.0   -> to -> qemu-Y -M PC-X.0

  good.  We have feature foo in both sides. Notice that I recomend here
  to not use feature foo.  We will see on the difficult cases.

$ qemu-Y -M PC-Y.0   -> to -> qemu-Y -M PC-Y.0

  good.  Both sides use feature foo.

Difficult cases, when we mix qemu versions.

$ qemu-X -M PC-X.0  -> to -> qemu-Y -M PC-X.0

  source don't have feature foo.  Destination have feature foo.
  But if we disable it for machine PC-X.0, it will work.

$ qemu-Y -M PC-X.0  -> to -> qemu-X -M PC-X.0

  same than previous example.  But here we have feature foo on source
  and not on destination.  Disabling it for machine PC-X.0 fixes the
  problem.

We can't migrate a PC-Y.0 when one of the qemu's is qemu-X, so that case
is impossible.

Does this makes more sense?

And now, how hw_compat_X_Y works.

It is an array of registers with the format:

- name of device  (we give some rope here, for instance migration is a
  device in this context)

- name of property: self explanatory.  The important bit is that
  we can get the value of the property in the device driver.

- value of the property: self explanatory.

With this mechanism what we do when we add a feature to a device that
matters for migration is:
- for the machine type of the version that we are "developing" feature
  is enabled by default.  For whatever that enable means.

- for old machine types we disable the feature, so it can be migrate
  freely with old qemu. But using the old machine type.

- there is way to enable the feature on the command line even for old
  machine types on new qemu, but only developers use that for testing.
  Normal users/admins never do that.

what does hw_compat_7_2 means?

Ok, we need to know the versions.  New version is 8.0.

hw_compat_7_2 has all the properties represensing "features", defaults,
whatever that has changed since 7.2.  In other words, what features we
need to disable to get to the features that existed when 7.2 was
released.

To go to a real example.

In the development tree.  We have:

GlobalProperty hw_compat_8_0[] = {
    { "migration", "multifd-flush-after-each-section", "on"},
};
const size_t hw_compat_8_0_len = G_N_ELEMENTS(hw_compat_8_0);

Feature is implemented in the following commits:

77c259a4cb1c9799754b48f570301ebf1de5ded8
b05292c237030343516d073b1a1e5f49ffc017a8
294e5a4034e81b3d8db03b4e0f691386f20d6ed3

When we are doing migration with multifd and we pass the end of memory
(i.e. we end one iteration through all the RAM) we need to make sure
that we don't send the same page through two channels, i.e. contents of
the page at iteration 1 through channel 1 and contents of the page at
iteration 2 through channel 2.  The problem is that they could arrive
out of order and the of page of iteration 1 arrive later than iteration
2 and overwrite new data with old data.  Which is undesirable.
We could use complex algorithms to fix that, but one easy way of doing
it is:

- When we finish a run through all memory (i.e.) one iteration, we flush
  all channels and make sure that everything arrives to destination
  before starting sending data o the next iteration.  I call that
  synchronize all channels.

And that is what we *should* have done.  But when I implemented the
feature, I did this synchronization everytime that we finish a cycle
(around 100miliseconds).  i.e. 10 times per second. This is called a
section for historical reasons. And when you are migrating
multiterabytes RAM machines with very fast networking, we end waiting
too much on the synchronizations.

Once detected the problem and found the cause, we change that.  The
problem is that if we are running an old qemu against a new qemu (or
viceversa) we would not be able to migrate, because one send/expects
synchronizations at different points.

So we have to maintain the old algorithm and the new algoritm.  That is
what we did here.  For machines older than <current in development>,
i.e. 8.0 we use the old algorithm (multifd-flush-after-earch section is
"on").

But the default for new machine types is the new algorithm, much faster.

I know that the explanation has been quite long, but inventing an
example was going to be even more complex.

Does this makes sense?

Later, Juan.

next prev parent reply	other threads:[~2023-05-10 16:42 UTC|newest]

Thread overview: 42+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-05-01 14:01 [PATCH 0/8] migration: Add precopy initial data capability and VFIO precopy support Avihai Horon
2023-05-01 14:01 ` [PATCH 1/8] migration: Add precopy initial data capability Avihai Horon
2023-05-10  8:24   ` Juan Quintela
2023-05-17  9:17   ` Markus Armbruster
2023-05-17 10:16     ` Avihai Horon
2023-05-17 12:21       ` Markus Armbruster
2023-05-17 13:23         ` Avihai Horon
2023-05-01 14:01 ` [PATCH 2/8] migration: Add precopy initial data handshake Avihai Horon
2023-05-02 22:54   ` Peter Xu
2023-05-03 15:31     ` Avihai Horon
2023-05-10  8:40   ` Juan Quintela
2023-05-10 15:32     ` Avihai Horon
2023-05-14 16:42   ` Cédric Le Goater
2023-05-15  7:56     ` Avihai Horon
2023-05-01 14:01 ` [PATCH 3/8] migration: Add precopy initial data loaded ACK functionality Avihai Horon
2023-05-02 22:56   ` Peter Xu
2023-05-03 15:36     ` Avihai Horon
2023-05-10  8:54   ` Juan Quintela
2023-05-10 15:52     ` Avihai Horon
2023-05-10 15:59       ` Juan Quintela
2023-05-01 14:01 ` [PATCH 4/8] migration: Enable precopy initial data capability Avihai Horon
2023-05-10  8:55   ` Juan Quintela
2023-05-01 14:01 ` [PATCH 5/8] tests: Add migration precopy initial data capability test Avihai Horon
2023-05-10  8:55   ` Juan Quintela
2023-05-01 14:01 ` [PATCH 6/8] vfio/migration: Refactor vfio_save_block() to return saved data size Avihai Horon
2023-05-10  9:00   ` Juan Quintela
2023-05-01 14:01 ` [PATCH 7/8] vfio/migration: Add VFIO migration pre-copy support Avihai Horon
2023-05-01 14:01 ` [PATCH 8/8] vfio/migration: Add support for precopy initial data capability Avihai Horon
2023-05-02 22:49 ` [PATCH 0/8] migration: Add precopy initial data capability and VFIO precopy support Peter Xu
2023-05-03 15:22   ` Avihai Horon
2023-05-03 15:49     ` Peter Xu
2023-05-04 10:18       ` Avihai Horon
2023-05-04 15:50         ` Peter Xu
2023-05-07 12:54           ` Avihai Horon
2023-05-08  0:49             ` Peter Xu
2023-05-08 11:11               ` Avihai Horon
2023-05-10  9:12     ` Juan Quintela
2023-05-10 16:01       ` Avihai Horon
2023-05-10 16:41         ` Juan Quintela [this message]
2023-05-11 11:31           ` Avihai Horon
2023-05-11 13:09             ` Juan Quintela
2023-05-11 15:08               ` Avihai Horon

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87jzxg6svr.fsf@secure.mitica \
    --to=quintela@redhat.com \
    --cc=alex.williamson@redhat.com \
    --cc=armbru@redhat.com \
    --cc=avihaih@nvidia.com \
    --cc=berrange@redhat.com \
    --cc=clg@redhat.com \
    --cc=eblake@redhat.com \
    --cc=jgg@nvidia.com \
    --cc=joao.m.martins@oracle.com \
    --cc=kwankhede@nvidia.com \
    --cc=leobras@redhat.com \
    --cc=lvivier@redhat.com \
    --cc=maorg@nvidia.com \
    --cc=pbonzini@redhat.com \
    --cc=peterx@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=targupta@nvidia.com \
    --cc=thuth@redhat.com \
    --cc=yishaih@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).