All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: Issues with pdcm in qemu 10.1-rc on migration and save/restore
@ 2025-09-04 14:35 Hector Cao
  2025-09-10  8:24 ` [RFC PATCH 0/2] Fix cross migration issue with missing features: pdcm, arch-capabilities Hector Cao
  0 siblings, 1 reply; 5+ messages in thread
From: Hector Cao @ 2025-09-04 14:35 UTC (permalink / raw)
  To: Christian Ehrhardt
  Cc: Paolo Bonzini, Daniel P. Berrangé, Xiaoyao Li, Zhao Liu,
	qemu-devel

[-- Attachment #1: Type: text/plain, Size: 9654 bytes --]

Hello,

In addition to my previous mail describing the issue on different
Ubuntu releases,

I went further by testing directly qemu upstream at HEAD
(baa79455fa92984ff0f4b9ae94bed66823177a27)

As the start version for the migration, I take quite recent release
v10.0.x to make the version gap smaller.

I can reproduce the following migration failures:

v10.0.2 -> HEAD:
error: operation failed: guest CPU doesn't match specification:
missing features: pdcm,arch-capabilities

v10.0.3 -> HEAD:
error: operation failed: guest CPU doesn't match specification:
missing features: pdcm
The error arch-capabilities is no longer present because v10.0.3 also
has [2] like HEAD.

If I revert the two commits [1] and [2] in HEAD, the migration works fine:

v10.0.2 -> HEAD (+reverts):
OK

[1] Revert "i386/cpu: Move adjustment of CPUID_EXT_PDCM before
feature_dependencies[] check"
    This reverts commit e68ec2980901c8e7f948f3305770962806c53f0b.

[2] Revert "target/i386: do not expose ARCH_CAPABILITIES on AMD CPU"
    This reverts commit d3a24134e37d57abd3e7445842cda2717f49e96d.

Since this issue is blocking us for the Ubuntu 25.10 release, can you
please provide
feedback on the best path going forward ?


On Wed, Sep 3, 2025 at 10:38 AM Christian Ehrhardt <
christian.ehrhardt@canonical.com> wrote:

> On Wed, Aug 20, 2025 at 7:11 AM Christian Ehrhardt
> <christian.ehrhardt@canonical.com> wrote:
> >
> > On Tue, Aug 19, 2025 at 4:51 PM Paolo Bonzini <pbonzini@redhat.com>
> wrote:
> > >
> > > On 8/6/25 21:18, Daniel P. Berrangé wrote:
> > > > On Wed, Aug 06, 2025 at 07:57:34PM +0200, Christian Ehrhardt wrote:
> > > >> On Wed, Aug 6, 2025 at 2:00 PM Daniel P. Berrangé <
> berrange@redhat.com> wrote:
> > > >>>
> > > >>> On Wed, Aug 06, 2025 at 01:52:17PM +0200, Christian Ehrhardt wrote:
> > > >>>> Hi,
> > > >>>> I was unsure if this would be better sent to libvirt or qemu - the
> > > >>>> issue is somewhere between libvirt modelling CPUs and qemu 10.1
> > > >>>> behaving differently. I did not want to double post and gladly
> most of
> > > >>>> the people are on both lists - since the switch in/out of the
> problem
> > > >>>> is qemu 10.0 <-> 10.1 let me start here. I beg your pardon for
> not yet
> > > >>>> having all the answers, I'm sure I could find more with
> debugging, but
> > > >>>> I also wanted to report early for your awareness while we are
> still in
> > > >>>> the RC phase.
> > > >>>>
> > > >>>>
> > > >>>> # Problem
> > > >>>>
> > > >>>> What I found when testing migrations in Ubuntu with qemu 10.1-rc1
> was:
> > > >>>>    error: operation failed: guest CPU doesn't match specification:
> > > >>>> missing features: pdcm
> > > >>>>
> > > >>>> This is behaving the same with libvirt 11.4 or the more recent
> 11.6.
> > > >>>> But switching back to qemu 10.0 confirmed that this behavior is
> new
> > > >>>> with qemu 10.1-rc.
> > > >>>
> > > >>>
> > > >>>> Without yet having any hard evidence against them I found a few
> pdcm
> > > >>>> related commits between 10.0 and 10.1-rc1:
> > > >>>>    7ff24fb65 i386/tdx: Don't mask off CPUID_EXT_PDCM
> > > >>>>    00268e000 i386/cpu: Warn about why CPUID_EXT_PDCM is not
> available
> > > >>>>    e68ec2980 i386/cpu: Move adjustment of CPUID_EXT_PDCM before
> > > >>>> feature_dependencies[] check
> > > >>>>    0ba06e46d i386/tdx: Add TDX fixed1 bits to supported CPUIDs
> > > >>>>
> > > >>>>
> > > >>>> # Caveat
> > > >>>>
> > > >>>> My test environment is in LXD system containers, that gives me
> issues
> > > >>>> in the power management detection
> > > >>>>    libvirtd[406]: error from service:
> GDBus.Error:System.Error.EROFS:
> > > >>>> Read-only file system
> > > >>>>    libvirtd[406]: Failed to get host power management capabilities
> > > >>>
> > > >>> That's harmless.
> > > >>
> > > >> Yeah, it always was for me - thanks for confirming.
> > > >>
> > > >>>> And the resulting host-model on a  rather old test server will
> therefore have:
> > > >>>>    <cpu mode='custom' match='exact' check='full'>
> > > >>>>      <model fallback='forbid'>Haswell-noTSX-IBRS</model>
> > > >>>>      <vendor>Intel</vendor>
> > > >>>>      <feature policy='require' name='vmx'/>
> > > >>>>      <feature policy='disable' name='pdcm'/>
> > > >>>>       ...
> > > >>>>
> > > >>>> But that was fine in the past, and the behavior started to break
> > > >>>> save/restore or migrations just now with the new qemu 10.1-rc.
> > > >>>>
> > > >>>> # Next steps
> > > >>>>
> > > >>>> I'm soon overwhelmed by meetings for the rest of the day, but
> would be
> > > >>>> curious if one has a suggestion about what to look at next for
> > > >>>> debugging or a theory about what might go wrong. If nothing else
> comes
> > > >>>> up I'll try to set up a bisect run tomorrow.
> > > >>>
> > > >>> Yeah, git bisect is what I'd start with.
> > > >>
> > > >> Bisect complete, identified this commit
> > > >>
> > > >> commit 00268e00027459abede448662f8794d78eb4b0a4
> > > >> Author: Xiaoyao Li <xiaoyao.li@intel.com>
> > > >> Date:   Tue Mar 4 00:24:50 2025 -0500
> > > >>
> > > >>      i386/cpu: Warn about why CPUID_EXT_PDCM is not available
> > > >>
> > > >>      When user requests PDCM explicitly via "+pdcm" without PMU
> enabled, emit
> > > >>      a warning to inform the user.
> > > >>
> > > >>      Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
> > > >>      Reviewed-by: Zhao Liu <zhao1.liu@intel.com>
> > > >>      Link:
> https://lore.kernel.org/r/20250304052450.465445-3-xiaoyao.li@intel.com
> > > >>      Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> > > >>
> > > >>   target/i386/cpu.c | 3 +++
> > > >>   1 file changed, 3 insertions(+)
> > > >>
> > > >>
> > > >>
> > > >> Which is odd as it should only add a warning right?
> > > >
> > > > No, that commit message is misleading.
> > > >
> > > > IIUC mark_unavailable_features() actively blocks usage of the
> feature,
> > > > so it is a functional change, not merely a emitting warning.
> > > >
> > > > It makes me wonder if that commit was actually intended to block the
> > > > feature or not, vs merely warning ?  CC'ing those involved in the
> > > > commit.
> > > We can revert the commit.  I'll send the revert to Stefan and let him
> > > decide whether to include it in 10.1-rc4 or delay to 10.2 and 10.1.1.
> >
> > Thanks Paolo for considering that.
> >
> > My steps to reproduce seemed really clear and are 100% reproducible
> > for me, but no one so far said "yeah they see it too", so I'm getting
> > unsure if it was not tried by anyone else or if there is more to it
> > than we yet know.
> > Further I tested more with the commit reverted, and found that at
> > least cross version migrations (9.2 -> 10.1) still have issues that
> > seem related - complaining about pdcm as missing feature.
> > But that was in a log of a test system that went away and ... you know
> > how these things can sometimes be, that new result is not yet very
> > reliable.
> >
> > I intended to check the following matrix more deeply again with and
> > without the reverted change and then come back to this thread:
> >
> > #1 Compare platforms
> > - Migrating between non containerized hosts to verify if they are
> > affected as well
> > - Power management explicitly switched off/on (vs the auto detect of
> > host-model) in the guest XML
> > #2 Retest the different Use-cases I've seen this pop up
> > - 10.1 managed save (broken unless reverting the commit that was
> identified)
> > - 9.2 -> 10.1 migration (seems broken even with the revert)
>
> I need to come back to this aspect of it - the cross release or cross
> qemu version migrations.
>
> Hector (on CC) helps me on that now - sadly we were able to confirm
> that migrations from older qemu versions no longer work.
> Yep 10.1 is released by now so it might end up as "The problem is what
> happens when we detect after we have done a release that something has
> gone wrong" from [2].
> But I still can't believe only we see this and therefore for now want
> to believe I messed up on our side when merging 10.1 :-)
>
> For now this is a call if others have also seen any older release
> migrating to 10.1 to throw:
>   error: operation failed: guest CPU doesn't match specification:
> missing features: pdcm,arch-capabilities
>
> Hector will later today reply here with a summary of what we found so
> far, to provide you a more complete picture to think about, without
> having to read through all the messy interim steps in the Ubuntu bug.
>
> [1]: https://bugs.launchpad.net/ubuntu/+source/qemu/+bug/2121787
> [2]:
> https://gitlab.com/qemu-project/qemu/-/blob/master/docs/devel/migration/compatibility.rst?plain=1#L322
>
> > The hope was that these will help to further identify what is going
> > on, but despite the urgency of the release being imminent I have not
> > yet managed to find the time in the last two days :-/
> >
> > > Sorry for the delay in answering (and thanks Daniel for bringing this
> to
> > > my attention).
> > >
> > > Thanks,
> > >
> > > Paolo
> > >
> >
> >
> > --
> > Christian Ehrhardt
> > Director of Engineering, Ubuntu Server
> > Canonical Ltd
>
>
>
> --
> Christian Ehrhardt
> Director of Engineering, Ubuntu Server
> Canonical Ltd
>


-- 
Hector CAO
Software Engineer – Partner Engineering Team
hector.cao@canonical.com
https://launc <https://launchpad.net/~hectorcao>hpad.net/~hectorcao
<https://launchpad.net/~hectorcao>

<https://launchpad.net/~hectorcao>

[-- Attachment #2: Type: text/html, Size: 13864 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2025-09-11 10:40 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <mailman.870.1757509050.1197.grub-devel@gnu.org>
2025-09-11  6:59 ` [PATCH v2] kern: perform NULL check in unregister paths (command/extcmd) Avnish Chouhan
2025-09-11  8:09   ` Srish Srinivasan
2025-09-11  8:27 ` [RFC PATCH 1/2] target/i386: add compatibility property for arch_capabilities Avnish Chouhan
2025-09-11 10:39 ` [RFC PATCH 2/2] target/i386: add compatibility property for pdcm feature Avnish Chouhan
2025-09-04 14:35 Issues with pdcm in qemu 10.1-rc on migration and save/restore Hector Cao
2025-09-10  8:24 ` [RFC PATCH 0/2] Fix cross migration issue with missing features: pdcm, arch-capabilities Hector Cao
2025-09-10  8:24   ` [RFC PATCH 2/2] target/i386: add compatibility property for pdcm feature Hector Cao

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.