qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: "Michael S. Tsirkin" <mst@redhat.com>
To: Juan Quintela <quintela@redhat.com>
Cc: qemu-devel@nongnu.org, "Leonardo Bras" <leobras@redhat.com>,
	"Peter Xu" <peterx@redhat.com>,
	"Jiri Denemark" <jdenemar@redhat.com>,
	"Avihai Horon" <avihaih@nvidia.com>,
	"Fiona Ebner" <f.ebner@proxmox.com>,
	"Daniel P . Berrangé" <berrange@redhat.com>,
	"Eric Blake" <eblake@redhat.com>,
	"Laine Stump" <laine@redhat.com>
Subject: Re: [PATCH v3 3/3] migration/doc: We broke backwards compatibility
Date: Wed, 31 May 2023 09:23:35 -0400	[thread overview]
Message-ID: <20230531092046-mutt-send-email-mst@kernel.org> (raw)
In-Reply-To: <87lehnfaj1.fsf@secure.mitica>

On Wed, May 17, 2023 at 01:43:46PM +0200, Juan Quintela wrote:
> "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > On Mon, May 15, 2023 at 10:32:01AM +0200, Juan Quintela wrote:
> >> When we detect that we have broken backwards compantibility in a
> >> released version, we can't do anything for that version.  But once we
> >> fix that bug on the next released version, we can "mitigate" that
> >> problem when migrating to new versions to give a way out of that
> >> machine until it does a hard reboot.
> >> 
> >> Signed-off-by: Juan Quintela <quintela@redhat.com>
> >> ---
> >>  docs/devel/migration.rst | 194 +++++++++++++++++++++++++++++++++++++++
> >>  1 file changed, 194 insertions(+)
> >> 
> >> diff --git a/docs/devel/migration.rst b/docs/devel/migration.rst
> >> index 95e797ee60..97b6f48474 100644
> >> --- a/docs/devel/migration.rst
> >> +++ b/docs/devel/migration.rst
> >> @@ -451,6 +451,200 @@ binary in both sides of the migration.  If we use different QEMU
> >>  versions process, then we need to have into account all other
> >>  differences and the examples become even more complicated.
> >>  
> >> +How to mitigate when we have a backward compatibility error
> >> +-----------------------------------------------------------
> >> +
> >> +We broke migration for old machine types continously during
> >> +development.  But as soon as we find that there is a problem, we fix
> >> +it.  The problem is what happens when we detect after we have done a
> >> +release that something has gone wrong.
> >> +
> >> +Let see how it worked with one example.
> >> +
> >> +After the release of qemu-8.0 we found a problem when doing migration
> >> +of the machine type pc-7.2.
> >> +
> >> +- $ qemu-7.2 -M pc-7.2  ->  qemu-7.2 -M pc-7.2
> >> +
> >> +  This migration works
> >> +
> >> +- $ qemu-8.0 -M pc-7.2  ->  qemu-8.0 -M pc-7.2
> >> +
> >> +  This migration works
> >> +
> >> +- $ qemu-8.0 -M pc-7.2  ->  qemu-7.2 -M pc-7.2
> >> +
> >> +  This migration fails
> >> +
> >> +- $ qemu-7.2 -M pc-7.2  ->  qemu-8.0 -M pc-7.2
> >> +
> >> +  This migration fails
> >> +
> >> +So clearly something fails when migration between qemu-7.2 and
> >> +qemu-8.0 with machine type pc-7.2.  The error messages, and git bisect
> >> +pointed to this commit.
> >> +
> >> +In qemu-8.0 we got this commit: ::
> >> +
> >> +    commit 9a6ef182c03eaa138bae553f0fbb5a123bef9a53
> >> +    Author: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> >> +    Date:   Thu Mar 2 13:37:03 2023 +0000
> >> +
> >> +        hw/pci/aer: Add missing routing for AER errors
> >> +
> >> +The relevant bits of the commit for our example are this ones:
> >> +
> >> +    --- a/hw/pci/pcie_aer.c
> >> +    +++ b/hw/pci/pcie_aer.c
> >> +    @@ -112,6 +112,10 @@ int pcie_aer_init(PCIDevice *dev,
> >> +
> >> +         pci_set_long(dev->w1cmask + offset + PCI_ERR_UNCOR_STATUS,
> >> +                      PCI_ERR_UNC_SUPPORTED);
> >> +    +    pci_set_long(dev->config + offset + PCI_ERR_UNCOR_MASK,
> >> +    +                 PCI_ERR_UNC_MASK_DEFAULT);
> >> +    +    pci_set_long(dev->wmask + offset + PCI_ERR_UNCOR_MASK,
> >> +    +                 PCI_ERR_UNC_SUPPORTED);
> >> +
> >> +         pci_set_long(dev->config + offset + PCI_ERR_UNCOR_SEVER,
> >> +                     PCI_ERR_UNC_SEVERITY_DEFAULT);
> >> +
> >> +The patch changes how we configure pci space for AER.  But qemu fails
> >> +when the pci space configuration is different betwwen source and
> >> +destination.
> >> +
> >> +The following commit show how this got fixed:
> >> +
> >> +<put info of the commit once that it arrives upstream>
> >> +
> >> +The relevant parts of the fix are as follow:
> >> +
> >> +First, we create a new property for the device to be able to configure
> >> +the old behaviour or the new behaviour. ::
> >> +
> >> +    diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> >> +    index 8a87ccc8b0..5153ad63d6 100644
> >> +    --- a/hw/pci/pci.c
> >> +    +++ b/hw/pci/pci.c
> >> +    @@ -79,6 +79,8 @@ static Property pci_props[] = {
> >> +         DEFINE_PROP_STRING("failover_pair_id", PCIDevice,
> >> +                            failover_pair_id),
> >> +         DEFINE_PROP_UINT32("acpi-index",  PCIDevice, acpi_index, 0),
> >> +    +    DEFINE_PROP_BIT("x-pcie-err-unc-mask", PCIDevice, cap_present,
> >> +    +                    QEMU_PCIE_ERR_UNC_MASK_BITNR, true),
> >> +         DEFINE_PROP_END_OF_LIST()
> >> +     };
> >> +
> >> +Notice that we enable te feature for new machine types.
> >> +
> >> +Now we see how the fix is done.  This is going to depend on what kind
> >> +of breakage happens, but in this case it is quite simple. ::
> >> +
> >> +    diff --git a/hw/pci/pcie_aer.c b/hw/pci/pcie_aer.c
> >> +    index 103667c368..374d593ead 100644
> >> +    --- a/hw/pci/pcie_aer.c
> >> +    +++ b/hw/pci/pcie_aer.c
> >> +    @@ -112,10 +112,13 @@ int pcie_aer_init(PCIDevice *dev, uint8_t cap_ver,
> >> +    uint16_t offset,
> >> +
> >> +         pci_set_long(dev->w1cmask + offset + PCI_ERR_UNCOR_STATUS,
> >> +                      PCI_ERR_UNC_SUPPORTED);
> >> +    -    pci_set_long(dev->config + offset + PCI_ERR_UNCOR_MASK,
> >> +    -                 PCI_ERR_UNC_MASK_DEFAULT);
> >> +    -    pci_set_long(dev->wmask + offset + PCI_ERR_UNCOR_MASK,
> >> +    -                 PCI_ERR_UNC_SUPPORTED);
> >> +    +
> >> +    +    if (dev->cap_present & QEMU_PCIE_ERR_UNC_MASK) {
> >> +    +        pci_set_long(dev->config + offset + PCI_ERR_UNCOR_MASK,
> >> +    +                     PCI_ERR_UNC_MASK_DEFAULT);
> >> +    +        pci_set_long(dev->wmask + offset + PCI_ERR_UNCOR_MASK,
> >> +    +                     PCI_ERR_UNC_SUPPORTED);
> >> +    +    }
> >> +
> >> +         pci_set_long(dev->config + offset + PCI_ERR_UNCOR_SEVER,
> >> +                      PCI_ERR_UNC_SEVERITY_DEFAULT);
> >> +
> >> +I.e. If the property bit is enabled, we configure it as we did for
> >> +qemu-8.0.  If the property bit is not set, we configure it as it was in 7.2.
> >> +
> >> +And now, everything that is missing is disable the feature for old
> >> +machine types: ::
> >> +
> >> +    diff --git a/hw/core/machine.c b/hw/core/machine.c
> >> +    index 47a34841a5..07f763eb2e 100644
> >> +    --- a/hw/core/machine.c
> >> +    +++ b/hw/core/machine.c
> >> +    @@ -48,6 +48,7 @@ GlobalProperty hw_compat_7_2[] = {
> >> +         { "e1000e", "migrate-timadj", "off" },
> >> +         { "virtio-mem", "x-early-migration", "false" },
> >> +         { "migration", "x-preempt-pre-7-2", "true" },
> >> +    +    { TYPE_PCI_DEVICE, "x-pcie-err-unc-mask", "off" },
> >> +     };
> >> +     const size_t hw_compat_7_2_len = G_N_ELEMENTS(hw_compat_7_2);
> >> +
> >> +And now, when qemu-8.0.1 is released with this fix, all combinations
> >> +are going to work as supposed.
> >> +
> >> +- $ qemu-7.2 -M pc-7.2  ->  qemu-7.2 -M pc-7.2 (works)
> >> +- $ qemu-8.0.1 -M pc-7.2  ->  qemu-8.0.1 -M pc-7.2 (works)
> >> +- $ qemu-8.0.1 -M pc-7.2  ->  qemu-7.2 -M pc-7.2 (works)
> >> +- $ qemu-7.2 -M pc-7.2  ->  qemu-8.0.1 -M pc-7.2 (works)
> >> +
> >> +So the normality has been restaured and everything is ok, no?
> >> +
> >> +Not really, now our matrix is much bigger.  We started with the easy
> >> +cases, migration from the same version to the same version always
> >> +works:
> >> +
> >> +- $ qemu-7.2 -M pc-7.2  ->  qemu-7.2 -M pc-7.2
> >> +- $ qemu-8.0 -M pc-7.2  ->  qemu-8.0 -M pc-7.2
> >> +- $ qemu-8.0.1 -M pc-7.2  ->  qemu-8.0.1 -M pc-7.2
> >> +
> >> +Now the interesting ones.  When the QEMU processes versions are
> >> +different.  For the 1st set, their fail and we can do nothing, both
> >> +versions are relased and we can't change anything.
> >> +
> >> +- $ qemu-7.2 -M pc-7.2  ->  qemu-8.0 -M pc-7.2
> >> +- $ qemu-8.0 -M pc-7.2  ->  qemu-7.2 -M pc-7.2
> >> +
> >> +This two are the ones that work. The whole point of making the
> >> +change in qemu-8.0.1 release was to fix this issue:
> >> +
> >> +- $ qemu-7.2 -M pc-7.2  ->  qemu-8.0.1 -M pc-7.2
> >> +- $ qemu-8.0.1 -M pc-7.2  ->  qemu-7.2 -M pc-7.2
> >> +
> >> +But now we found that qemu-8.0 neither can migrate to qemu-7.2 not
> >> +qemu-8.0.1.
> >> +
> >> +- $ qemu-8.0 -M pc-7.2  ->  qemu-8.0.1 -M pc-7.2
> >> +- $ qemu-8.0.1 -M pc-7.2  ->  qemu-8.0 -M pc-7.2
> >> +
> >> +So, if we start a pc-7.2 machine in qemu-8.0 we can't migrate it to
> >> +anything except to qemu-8.0.
> >> +
> >> +Can we do better?
> >> +
> >> +Yeap.  If we know that we are gonig to do this migration:
> >> +
> >> +- $ qemu-8.0 -M pc-7.2  ->  qemu-8.0.1 -M pc-7.2
> >> +
> >> +We can launche the appropiate devices with
> >> +
> >> +--device...,x-pci-e-err-unc-mask=on
> >> +
> >> +And now we can receive a migration from 8.0.  And from now on, we can
> >> +do that migration to new machine types if we remember to enable that
> >> +property for pc-7.2.  Notice that we need to remember, it is not
> >> +enough to know that the source of the migration is qemu-8.0.  Think of this example:
> >> +
> >> +$ qemu-8.0 -M pc-7.2 -> qemu-8.0.1 -M pc-7.2 -> qemu-8.2 -M pc-7.2
> >> +
> >> +In the second migration, the source is not qemu-8.0, but we still have
> >> +that "problem" and have that property enabled.  Notice that we need to
> >> +continue having this mark/property until we have this machine
> >> +rebooted.  But it is not a normal reboot (that don't reload qemu) we
> >> +need the mapchine to poweroff/poweron on a fixed qemu.  And from now
> >> +on we can use the proper real machine.
> >> +
> >>  VMState
> >>  -------
> >
> > Can we release this list of things that need to be configured
> > somewhere? Maybe in a sane format that libvirt can parse?
> 
> What do you mean here?
> 
> the x-pci-e-err-unc-mask=on?
> 
> The most similar thing that we have is pc/machine.c:hw_compat_x_y.
> 
> But that also include the things where we have done the things right.
> 
> Daniel, Jiri, what would you need and what would be useful to you?
> 
> Later, Juan.

Any input from anyone?  Cc a couple more people.



-- 
MST



      parent reply	other threads:[~2023-05-31 13:24 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-05-15  8:31 [PATCH v3 0/3] Migration documentation Juan Quintela
2023-05-15  8:31 ` [PATCH v3 1/3] migration: Add documentation for backwards compatiblity Juan Quintela
2023-05-16 23:39   ` Peter Xu
2023-05-18  1:47     ` Xiaoyao Li
2023-10-17 13:59       ` Juan Quintela
2023-10-23 11:09     ` Juan Quintela
2023-05-15  8:32 ` [PATCH v3 2/3] migration/docs: How to migrate when hosts have different features Juan Quintela
2023-05-16 23:51   ` Peter Xu
2023-10-17 14:05     ` Juan Quintela
2023-05-17 10:23   ` Michael S. Tsirkin
2023-10-17 14:11     ` Juan Quintela
2023-05-15  8:32 ` [PATCH v3 3/3] migration/doc: We broke backwards compatibility Juan Quintela
2023-05-17  0:03   ` Peter Xu
2023-10-17 14:18     ` Juan Quintela
2023-05-17  7:09   ` Fiona Ebner
2023-10-23 11:09     ` Juan Quintela
2023-05-17 10:20   ` Michael S. Tsirkin
2023-05-17 11:43     ` Juan Quintela
2023-05-17 11:47       ` Michael S. Tsirkin
2023-05-31 13:23       ` Michael S. Tsirkin [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230531092046-mutt-send-email-mst@kernel.org \
    --to=mst@redhat.com \
    --cc=avihaih@nvidia.com \
    --cc=berrange@redhat.com \
    --cc=eblake@redhat.com \
    --cc=f.ebner@proxmox.com \
    --cc=jdenemar@redhat.com \
    --cc=laine@redhat.com \
    --cc=leobras@redhat.com \
    --cc=peterx@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=quintela@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).