qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/2] Fix cross migration issue with missing features: pdcm, arch-capabilities
@ 2025-09-12 13:35 Hector Cao
  0 siblings, 0 replies; 6+ messages in thread
From: Hector Cao @ 2025-09-12 13:35 UTC (permalink / raw)
  To: qemu-devel; +Cc: Paolo Bonzini, Zhao Liu, peterx, farosas

[-- Attachment #1: Type: text/plain, Size: 230 bytes --]

Thanks Fiona Ebner for pointing out (in DM) that I did not CC to the
relevant maintainers.
Let me CC to maintainers that are listed by  the ./scripts/get_maintainer.pl
script on the submission changed files.

Kind regards,
Hector

[-- Attachment #2: Type: text/html, Size: 513 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread
* Re: Issues with pdcm in qemu 10.1-rc on migration and save/restore
@ 2025-09-04 14:35 Hector Cao
  2025-09-10 11:57 ` [RFC PATCH 0/2] Fix cross migration issue with missing features: pdcm, arch-capabilities Hector Cao
  0 siblings, 1 reply; 6+ messages in thread
From: Hector Cao @ 2025-09-04 14:35 UTC (permalink / raw)
  To: Christian Ehrhardt
  Cc: Paolo Bonzini, Daniel P. Berrangé, Xiaoyao Li, Zhao Liu,
	qemu-devel

[-- Attachment #1: Type: text/plain, Size: 9654 bytes --]

Hello,

In addition to my previous mail describing the issue on different
Ubuntu releases,

I went further by testing directly qemu upstream at HEAD
(baa79455fa92984ff0f4b9ae94bed66823177a27)

As the start version for the migration, I take quite recent release
v10.0.x to make the version gap smaller.

I can reproduce the following migration failures:

v10.0.2 -> HEAD:
error: operation failed: guest CPU doesn't match specification:
missing features: pdcm,arch-capabilities

v10.0.3 -> HEAD:
error: operation failed: guest CPU doesn't match specification:
missing features: pdcm
The error arch-capabilities is no longer present because v10.0.3 also
has [2] like HEAD.

If I revert the two commits [1] and [2] in HEAD, the migration works fine:

v10.0.2 -> HEAD (+reverts):
OK

[1] Revert "i386/cpu: Move adjustment of CPUID_EXT_PDCM before
feature_dependencies[] check"
    This reverts commit e68ec2980901c8e7f948f3305770962806c53f0b.

[2] Revert "target/i386: do not expose ARCH_CAPABILITIES on AMD CPU"
    This reverts commit d3a24134e37d57abd3e7445842cda2717f49e96d.

Since this issue is blocking us for the Ubuntu 25.10 release, can you
please provide
feedback on the best path going forward ?


On Wed, Sep 3, 2025 at 10:38 AM Christian Ehrhardt <
christian.ehrhardt@canonical.com> wrote:

> On Wed, Aug 20, 2025 at 7:11 AM Christian Ehrhardt
> <christian.ehrhardt@canonical.com> wrote:
> >
> > On Tue, Aug 19, 2025 at 4:51 PM Paolo Bonzini <pbonzini@redhat.com>
> wrote:
> > >
> > > On 8/6/25 21:18, Daniel P. Berrangé wrote:
> > > > On Wed, Aug 06, 2025 at 07:57:34PM +0200, Christian Ehrhardt wrote:
> > > >> On Wed, Aug 6, 2025 at 2:00 PM Daniel P. Berrangé <
> berrange@redhat.com> wrote:
> > > >>>
> > > >>> On Wed, Aug 06, 2025 at 01:52:17PM +0200, Christian Ehrhardt wrote:
> > > >>>> Hi,
> > > >>>> I was unsure if this would be better sent to libvirt or qemu - the
> > > >>>> issue is somewhere between libvirt modelling CPUs and qemu 10.1
> > > >>>> behaving differently. I did not want to double post and gladly
> most of
> > > >>>> the people are on both lists - since the switch in/out of the
> problem
> > > >>>> is qemu 10.0 <-> 10.1 let me start here. I beg your pardon for
> not yet
> > > >>>> having all the answers, I'm sure I could find more with
> debugging, but
> > > >>>> I also wanted to report early for your awareness while we are
> still in
> > > >>>> the RC phase.
> > > >>>>
> > > >>>>
> > > >>>> # Problem
> > > >>>>
> > > >>>> What I found when testing migrations in Ubuntu with qemu 10.1-rc1
> was:
> > > >>>>    error: operation failed: guest CPU doesn't match specification:
> > > >>>> missing features: pdcm
> > > >>>>
> > > >>>> This is behaving the same with libvirt 11.4 or the more recent
> 11.6.
> > > >>>> But switching back to qemu 10.0 confirmed that this behavior is
> new
> > > >>>> with qemu 10.1-rc.
> > > >>>
> > > >>>
> > > >>>> Without yet having any hard evidence against them I found a few
> pdcm
> > > >>>> related commits between 10.0 and 10.1-rc1:
> > > >>>>    7ff24fb65 i386/tdx: Don't mask off CPUID_EXT_PDCM
> > > >>>>    00268e000 i386/cpu: Warn about why CPUID_EXT_PDCM is not
> available
> > > >>>>    e68ec2980 i386/cpu: Move adjustment of CPUID_EXT_PDCM before
> > > >>>> feature_dependencies[] check
> > > >>>>    0ba06e46d i386/tdx: Add TDX fixed1 bits to supported CPUIDs
> > > >>>>
> > > >>>>
> > > >>>> # Caveat
> > > >>>>
> > > >>>> My test environment is in LXD system containers, that gives me
> issues
> > > >>>> in the power management detection
> > > >>>>    libvirtd[406]: error from service:
> GDBus.Error:System.Error.EROFS:
> > > >>>> Read-only file system
> > > >>>>    libvirtd[406]: Failed to get host power management capabilities
> > > >>>
> > > >>> That's harmless.
> > > >>
> > > >> Yeah, it always was for me - thanks for confirming.
> > > >>
> > > >>>> And the resulting host-model on a  rather old test server will
> therefore have:
> > > >>>>    <cpu mode='custom' match='exact' check='full'>
> > > >>>>      <model fallback='forbid'>Haswell-noTSX-IBRS</model>
> > > >>>>      <vendor>Intel</vendor>
> > > >>>>      <feature policy='require' name='vmx'/>
> > > >>>>      <feature policy='disable' name='pdcm'/>
> > > >>>>       ...
> > > >>>>
> > > >>>> But that was fine in the past, and the behavior started to break
> > > >>>> save/restore or migrations just now with the new qemu 10.1-rc.
> > > >>>>
> > > >>>> # Next steps
> > > >>>>
> > > >>>> I'm soon overwhelmed by meetings for the rest of the day, but
> would be
> > > >>>> curious if one has a suggestion about what to look at next for
> > > >>>> debugging or a theory about what might go wrong. If nothing else
> comes
> > > >>>> up I'll try to set up a bisect run tomorrow.
> > > >>>
> > > >>> Yeah, git bisect is what I'd start with.
> > > >>
> > > >> Bisect complete, identified this commit
> > > >>
> > > >> commit 00268e00027459abede448662f8794d78eb4b0a4
> > > >> Author: Xiaoyao Li <xiaoyao.li@intel.com>
> > > >> Date:   Tue Mar 4 00:24:50 2025 -0500
> > > >>
> > > >>      i386/cpu: Warn about why CPUID_EXT_PDCM is not available
> > > >>
> > > >>      When user requests PDCM explicitly via "+pdcm" without PMU
> enabled, emit
> > > >>      a warning to inform the user.
> > > >>
> > > >>      Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
> > > >>      Reviewed-by: Zhao Liu <zhao1.liu@intel.com>
> > > >>      Link:
> https://lore.kernel.org/r/20250304052450.465445-3-xiaoyao.li@intel.com
> > > >>      Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> > > >>
> > > >>   target/i386/cpu.c | 3 +++
> > > >>   1 file changed, 3 insertions(+)
> > > >>
> > > >>
> > > >>
> > > >> Which is odd as it should only add a warning right?
> > > >
> > > > No, that commit message is misleading.
> > > >
> > > > IIUC mark_unavailable_features() actively blocks usage of the
> feature,
> > > > so it is a functional change, not merely a emitting warning.
> > > >
> > > > It makes me wonder if that commit was actually intended to block the
> > > > feature or not, vs merely warning ?  CC'ing those involved in the
> > > > commit.
> > > We can revert the commit.  I'll send the revert to Stefan and let him
> > > decide whether to include it in 10.1-rc4 or delay to 10.2 and 10.1.1.
> >
> > Thanks Paolo for considering that.
> >
> > My steps to reproduce seemed really clear and are 100% reproducible
> > for me, but no one so far said "yeah they see it too", so I'm getting
> > unsure if it was not tried by anyone else or if there is more to it
> > than we yet know.
> > Further I tested more with the commit reverted, and found that at
> > least cross version migrations (9.2 -> 10.1) still have issues that
> > seem related - complaining about pdcm as missing feature.
> > But that was in a log of a test system that went away and ... you know
> > how these things can sometimes be, that new result is not yet very
> > reliable.
> >
> > I intended to check the following matrix more deeply again with and
> > without the reverted change and then come back to this thread:
> >
> > #1 Compare platforms
> > - Migrating between non containerized hosts to verify if they are
> > affected as well
> > - Power management explicitly switched off/on (vs the auto detect of
> > host-model) in the guest XML
> > #2 Retest the different Use-cases I've seen this pop up
> > - 10.1 managed save (broken unless reverting the commit that was
> identified)
> > - 9.2 -> 10.1 migration (seems broken even with the revert)
>
> I need to come back to this aspect of it - the cross release or cross
> qemu version migrations.
>
> Hector (on CC) helps me on that now - sadly we were able to confirm
> that migrations from older qemu versions no longer work.
> Yep 10.1 is released by now so it might end up as "The problem is what
> happens when we detect after we have done a release that something has
> gone wrong" from [2].
> But I still can't believe only we see this and therefore for now want
> to believe I messed up on our side when merging 10.1 :-)
>
> For now this is a call if others have also seen any older release
> migrating to 10.1 to throw:
>   error: operation failed: guest CPU doesn't match specification:
> missing features: pdcm,arch-capabilities
>
> Hector will later today reply here with a summary of what we found so
> far, to provide you a more complete picture to think about, without
> having to read through all the messy interim steps in the Ubuntu bug.
>
> [1]: https://bugs.launchpad.net/ubuntu/+source/qemu/+bug/2121787
> [2]:
> https://gitlab.com/qemu-project/qemu/-/blob/master/docs/devel/migration/compatibility.rst?plain=1#L322
>
> > The hope was that these will help to further identify what is going
> > on, but despite the urgency of the release being imminent I have not
> > yet managed to find the time in the last two days :-/
> >
> > > Sorry for the delay in answering (and thanks Daniel for bringing this
> to
> > > my attention).
> > >
> > > Thanks,
> > >
> > > Paolo
> > >
> >
> >
> > --
> > Christian Ehrhardt
> > Director of Engineering, Ubuntu Server
> > Canonical Ltd
>
>
>
> --
> Christian Ehrhardt
> Director of Engineering, Ubuntu Server
> Canonical Ltd
>


-- 
Hector CAO
Software Engineer – Partner Engineering Team
hector.cao@canonical.com
https://launc <https://launchpad.net/~hectorcao>hpad.net/~hectorcao
<https://launchpad.net/~hectorcao>

<https://launchpad.net/~hectorcao>

[-- Attachment #2: Type: text/html, Size: 13864 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2025-09-23 10:32 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-12 13:35 [RFC PATCH 0/2] Fix cross migration issue with missing features: pdcm, arch-capabilities Hector Cao
  -- strict thread matches above, loose matches on Subject: below --
2025-09-04 14:35 Issues with pdcm in qemu 10.1-rc on migration and save/restore Hector Cao
2025-09-10 11:57 ` [RFC PATCH 0/2] Fix cross migration issue with missing features: pdcm, arch-capabilities Hector Cao
2025-09-23  7:53   ` Paolo Bonzini
2025-09-23 10:08     ` Hector Cao
2025-09-23 10:15       ` Paolo Bonzini
2025-09-23 10:31         ` Hector Cao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).