All of lore.kernel.org
 help / color / mirror / Atom feed
From: bugzilla-daemon@kernel.org
To: kvm@vger.kernel.org
Subject: [Bug 218792] Guest call trace with mwait enabled
Date: Thu, 31 Jul 2025 08:59:52 +0000	[thread overview]
Message-ID: <bug-218792-28872-QaPSh9KPXG@https.bugzilla.kernel.org/> (raw)
In-Reply-To: <bug-218792-28872@https.bugzilla.kernel.org/>

https://bugzilla.kernel.org/show_bug.cgi?id=218792

--- Comment #5 from chenyi.qiang@intel.com ---
On 5/1/2024 12:41 AM, Sean Christopherson wrote:
> On Tue, Apr 30, 2024, bugzilla-daemon@kernel.org wrote:
>> https://bugzilla.kernel.org/show_bug.cgi?id=218792
>>
>>             Bug ID: 218792
>>            Summary: Guest call trace with mwait enabled
>>            Product: Virtualization
>>            Version: unspecified
>>           Hardware: Intel
>>                 OS: Linux
>>             Status: NEW
>>           Severity: normal
>>           Priority: P3
>>          Component: kvm
>>           Assignee: virtualization_kvm@kernel-bugs.osdl.org
>>           Reporter: farrah.chen@intel.com
>>         Regression: No
>>
>> Environment:
>> host/guest kernel:
>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master
>> e67572cd220(v6.9-rc6)
>> QEMU: https://gitlab.com/qemu-project/qemu.git master 5c6528dce86d
>> Host/Guest OS: Centos stream9/Ubuntu24.04
>>
>> Bug detail description: 
>> Boot Guest with mwait enabled(-overcommit cpu-pm=on), guest call trace
>> "unchecked MSR access error"
>>
>> Reproduce steps:
>> img=centos9.qcow2
>> qemu-system-x86_64 \
>>     -name legacy,debug-threads=on \
>>     -overcommit cpu-pm=on \
>>     -accel kvm -smp 8 -m 8G -cpu host \
>>     -drive file=${img},if=none,id=virtio-disk0 \
>>     -device virtio-blk-pci,drive=virtio-disk0 \
>>     -device virtio-net-pci,netdev=nic0 -netdev
>> user,id=nic0,hostfwd=tcp::10023-:22 \
>>     -vnc :1 -serial stdio
>>
>> Guest boot with call trace:
>> [ 0.475344] unchecked MSR access error: RDMSR from 0xe2 at rIP:
> 
> MSR 0xE2 is MSR_PKG_CST_CONFIG_CONTROL, which hpet_is_pc10_damaged() assumes
> exists if PC10 substates are supported. KVM doesn't emulate/support
> MSR_PKG_CST_CONFIG_CONTROL, i.e. injects a #GP on the guest RDMSR, hence the
> splat.  This isn't a KVM bug as KVM explicitly advertises all zeros for the
> MWAIT CPUID leaf, i.e. QEMU is effectively telling the guest that PC10
> substates
> are support without KVM's explicit blessing.
> 
> That said, this is arguably a kernel bug (guest side), as I don't see
> anything
> in the SDM that _requires_ MSR_PKG_CST_CONFIG_CONTROL to exist if PC10
> substates
> are supported.
> 
> The issue is likely benign, other that than obvious WARN.  The kernel
> gracefully
> handles the #GP and zeros the result, i.e. will always think PC10 is
> _disabled_,
> which may or may not be correct, but is functionally ok if the HPET is being
> emulated by the host, which it probably is.
> 
>       rdmsrl(MSR_PKG_CST_CONFIG_CONTROL, pcfg);
>       if ((pcfg & 0xF) < 8)
>               return false;
> 
> The most straightforward fix, and probably the most correct all around, would
> be
> to use rdmsrl_safe() to suppress the WARN, i.e. have the kernel not yell if
> MSR_PKG_CST_CONFIG_CONTROL doesn't exist.  Unless HPET is also being passed
> through, that'll do the right thing when Linux is a guest.  And if a setup
> also
> passes through HPET, then the VMM can also trap-and-emulate
> MSR_PKG_CST_CONFIG_CONTROL
> as appropriate (doing so in QEMU without KVM support might be impossible,
> though
> again it's unnecessary if QEMU is emulating the HPET).
> 
> diff --git a/arch/x86/kernel/hpet.c b/arch/x86/kernel/hpet.c
> index c96ae8fee95e..2afafff18f92 100644
> --- a/arch/x86/kernel/hpet.c
> +++ b/arch/x86/kernel/hpet.c
> @@ -980,7 +980,9 @@ static bool __init hpet_is_pc10_damaged(void)
>                 return false;
>  
>         /* Check whether PC10 is enabled in PKG C-state limit */
> -       rdmsrl(MSR_PKG_CST_CONFIG_CONTROL, pcfg);
> +       if (rdmsrl_safe(MSR_PKG_CST_CONFIG_CONTROL, pcfg))
> +               return false;
> +
>         if ((pcfg & 0xF) < 8)
>                 return false;

There are three places which could access MSR_PKG_CST_CONFIG_CONTROL.
1. hpet_is_pc10_damaged() in hpet.c
2. *_idle_state_table_update() in intel_idle.c (This BUG comes from this path
in VMs)
3. auto_demotion_disable() in intel_idle.c

This MSR seems not architectural but CPU model specific.

Besides the case 1 as mentioned, the intel_idle driver also uses it to query
the
lowest processor-specific C-state for the package (case 2) and to disable auto
demotion
(case 3) based on the specific model.

I assume both case 2 and 3 are aimed to improve energy-efficiency. For example,
spr_idle_state_table_update() adjusts the exit_latency/target_residency to
hardcoded ones based on
the package C-state limit. It seems unreasonable in VMs as the hardcoded values
are measured in host
and the guest CPU model may not match the host one if we only pass-thru this
MSR. Similarly,
for case 3, there is no guarantee that disabling auto demotion can improve
energy efficiency in a
emulated CPU model.

Since there is no such fine-grained power management virtualization support
yet. Can we change
all the rdmsr/wrmsr(MSR_PKG_CST_CONFIG_CONTROL) to the *_safe() variant to skip
the related operation
in VMs?

> 
>> 0xffffffffb5a966b8 (native_read_msr+0x8/0x40)
>> [ 0.476465] Call Trace:
>> [ 0.476763] <TASK>
>> [ 0.477027] ? ex_handler_msr+0x128/0x140
>> [ 0.477460] ? fixup_exception+0x166/0x3c0
>> [ 0.477934] ? exc_general_protection+0xdc/0x3c0
>> [ 0.478481] ? asm_exc_general_protection+0x26/0x30
>> [ 0.479052] ? __pfx_intel_idle_init+0x10/0x10
>> [ 0.479587] ? native_read_msr+0x8/0x40
>> [ 0.480057] intel_idle_init_cstates_icpu.constprop.0+0x5e/0x560
>> [ 0.480747] ? __pfx_intel_idle_init+0x10/0x10
>> [ 0.481275] intel_idle_init+0x161/0x360
>> [ 0.481742] do_one_initcall+0x45/0x220
>> [ 0.482209] do_initcalls+0xac/0x130
>> [ 0.482643] kernel_init_freeable+0x134/0x1e0
>> [ 0.483159] ? __pfx_kernel_init+0x10/0x10
>> [ 0.483648] kernel_init+0x1a/0x1c0
>> [ 0.484087] ret_from_fork+0x31/0x50
>> [ 0.484541] ? __pfx_kernel_init+0x10/0x10
>> [ 0.485030] ret_from_fork_asm+0x1a/0x30
>> [ 0.485462] </TASK>
>>
>> -- 
>> You may reply to this email to add a comment.
>>
>> You are receiving this mail because:
>> You are watching the assignee of the bug.
>

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

  parent reply	other threads:[~2025-07-31  8:59 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-04-30  7:32 [Bug 218792] New: Guest call trace with mwait enabled bugzilla-daemon
2024-04-30 11:32 ` [Bug 218792] " bugzilla-daemon
2024-04-30 16:41 ` [Bug 218792] New: " Sean Christopherson
2025-07-31  8:59   ` Chenyi Qiang
2024-04-30 16:42 ` [Bug 218792] " bugzilla-daemon
2024-07-12  8:11 ` bugzilla-daemon
2024-07-12  8:40 ` bugzilla-daemon
2025-07-31  8:59 ` bugzilla-daemon [this message]
2025-08-08 21:05 ` bugzilla-daemon
2025-08-08 22:59   ` Sean Christopherson
2025-08-08 22:59 ` bugzilla-daemon

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=bug-218792-28872-QaPSh9KPXG@https.bugzilla.kernel.org/ \
    --to=bugzilla-daemon@kernel.org \
    --cc=kvm@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.