netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Jürgen Groß" <jgross@suse.com>
To: Xin Li <xin@zytor.com>,
	linux-kernel@vger.kernel.org, kvm@vger.kernel.org,
	linux-perf-users@vger.kernel.org, linux-hyperv@vger.kernel.org,
	virtualization@lists.linux.dev, linux-pm@vger.kernel.org,
	linux-edac@vger.kernel.org, xen-devel@lists.xenproject.org,
	linux-acpi@vger.kernel.org, linux-hwmon@vger.kernel.org,
	netdev@vger.kernel.org, platform-driver-x86@vger.kernel.org
Cc: tglx@linutronix.de, mingo@redhat.com, bp@alien8.de,
	dave.hansen@linux.intel.com, x86@kernel.org, hpa@zytor.com,
	acme@kernel.org, andrew.cooper3@citrix.com, peterz@infradead.org,
	namhyung@kernel.org, mark.rutland@arm.com,
	alexander.shishkin@linux.intel.com, jolsa@kernel.org,
	irogers@google.com, adrian.hunter@intel.com,
	kan.liang@linux.intel.com, wei.liu@kernel.org,
	ajay.kaher@broadcom.com, bcm-kernel-feedback-list@broadcom.com,
	tony.luck@intel.com, pbonzini@redhat.com, vkuznets@redhat.com,
	seanjc@google.com, luto@kernel.org, boris.ostrovsky@oracle.com,
	kys@microsoft.com, haiyangz@microsoft.com, decui@microsoft.com
Subject: Re: [RFC PATCH v2 21/34] x86/msr: Utilize the alternatives mechanism to write MSR
Date: Wed, 23 Apr 2025 18:05:19 +0200	[thread overview]
Message-ID: <37c88ea3-dd24-4607-9ee1-0f19025aaef3@suse.com> (raw)
In-Reply-To: <f7198308-e8f8-4cc5-b884-24bc5f408a2a@zytor.com>


[-- Attachment #1.1.1: Type: text/plain, Size: 6940 bytes --]

On 23.04.25 10:51, Xin Li wrote:
> On 4/22/2025 2:57 AM, Jürgen Groß wrote:
>> On 22.04.25 10:22, Xin Li (Intel) wrote:
>>> The story started from tglx's reply in [1]:
>>>
>>>    For actual performance relevant code the current PV ops mechanics
>>>    are a horrorshow when the op defaults to the native instruction.
>>>
>>>    look at wrmsrl():
>>>
>>>    wrmsrl(msr, val
>>>     wrmsr(msr, (u32)val, (u32)val >> 32))
>>>      paravirt_write_msr(msr, low, high)
>>>        PVOP_VCALL3(cpu.write_msr, msr, low, high)
>>>
>>>    Which results in
>>>
>>>     mov    $msr, %edi
>>>     mov    $val, %rdx
>>>     mov    %edx, %esi
>>>     shr    $0x20, %rdx
>>>     call    native_write_msr
>>>
>>>    and native_write_msr() does at minimum:
>>>
>>>     mov    %edi,%ecx
>>>     mov    %esi,%eax
>>>     wrmsr
>>>     ret
>>>
>>>    In the worst case 'ret' is going through the return thunk. Not to
>>>    talk about function prologues and whatever.
>>>
>>>    This becomes even more silly for trivial instructions like STI/CLI
>>>    or in the worst case paravirt_nop().
>>
>> This is nonsense.
>>
>> In the non-Xen case the initial indirect call is directly replaced with
>> STI/CLI via alternative patching, while for Xen it is replaced by a direct
>> call.
>>
>> The paravirt_nop() case is handled in alt_replace_call() by replacing the
>> indirect call with a nop in case the target of the call was paravirt_nop()
>> (which is in fact no_func()).
>>
>>>
>>>    The call makes only sense, when the native default is an actual
>>>    function, but for the trivial cases it's a blatant engineering
>>>    trainwreck.
>>
>> The trivial cases are all handled as stated above: a direct replacement
>> instruction is placed at the indirect call position.
> 
> The above comment was given in 2023 IIRC, and you have addressed it.
> 
>>
>>> Later a consensus was reached to utilize the alternatives mechanism to
>>> eliminate the indirect call overhead introduced by the pv_ops APIs:
>>>
>>>      1) When built with !CONFIG_XEN_PV, X86_FEATURE_XENPV becomes a
>>>         disabled feature, preventing the Xen code from being built
>>>         and ensuring the native code is executed unconditionally.
>>
>> This is the case today already. There is no need for any change to have
>> this in place.
>>
>>>
>>>      2) When built with CONFIG_XEN_PV:
>>>
>>>         2.1) If not running on the Xen hypervisor (!X86_FEATURE_XENPV),
>>>              the kernel runtime binary is patched to unconditionally
>>>              jump to the native MSR write code.
>>>
>>>         2.2) If running on the Xen hypervisor (X86_FEATURE_XENPV), the
>>>              kernel runtime binary is patched to unconditionally jump
>>>              to the Xen MSR write code.
>>
>> I can't see what is different here compared to today's state.
>>
>>>
>>> The alternatives mechanism is also used to choose the new immediate
>>> form MSR write instruction when it's available.
>>
>> Yes, this needs to be added.
>>
>>> Consequently, remove the pv_ops MSR write APIs and the Xen callbacks.
>>
>> I still don't see a major difference to today's solution.
> 
> The existing code generates:
> 
>      ...
>      bf e0 06 00 00          mov    $0x6e0,%edi
>      89 d6                   mov    %edx,%esi
>      48 c1 ea 20             shr    $0x20,%rdx
>      ff 15 07 48 8c 01       call   *0x18c4807(%rip)  # <pv_ops+0xb8>
>      31 c0                   xor    %eax,%eax
>      ...
> 
> And on native, the indirect call instruction is patched to a direct call
> as you mentioned:
> 
>      ...
>      bf e0 06 00 00          mov    $0x6e0,%edi
>      89 d6                   mov    %edx,%esi
>      48 c1 ea 20             shr    $0x20,%rdx
>      e8 60 3e 01 00          call   <{native,xen}_write_msr> # direct
>      90                      nop
>      31 c0                   xor    %eax,%eax
>      ...
> 
> 
> This patch set generates assembly w/o CALL on native:
> 
>      ...
>      e9 e6 22 c6 01          jmp    1f   # on native or nop on Xen
>      b9 e0 06 00 00          mov    $0x6e0,%ecx
>      e8 91 d4 fa ff          call   ffffffff8134ee80 <asm_xen_write_msr>
>      e9 a4 9f eb 00          jmp    ffffffff8225b9a0 <__x86_return_thunk>
>          ...
> 1:  b9 e0 06 00 00          mov    $0x6e0,%ecx   # immediate form here
>      48 89 c2                mov    %rax,%rdx
>      48 c1 ea 20             shr    $0x20,%rdx
>      3e 0f 30                ds wrmsr
>      ...
> 
> It's not a major change, but when it is patched to use the immediate form MSR 
> write instruction, it's straightforwardly streamlined.

It should be rather easy to switch the current wrmsr/rdmsr paravirt patching
locations to use the rdmsr/wrmsr instructions instead of doing a call to
native_*msr().

The case of the new immediate form could be handled the same way.

> 
>>
>> Only the "paravirt" term has been eliminated.
> 
> Yes.
> 
> But a PV guest doesn't operate at the highest privilege level, which
> means MSR instructions typically result in a #GP fault.  I actually think the 
> pv_ops MSR APIs are unnecessary because of this inherent
> limitation.
> 
> Looking at the Xen MSR code, except PMU and just a few MSRs, it falls
> back to executes native MSR instructions.  As MSR instructions trigger
> #GP, Xen takes control and handles them in 2 ways:
> 
>    1) emulate (or ignore) a MSR operation and skip the guest instruction.
> 
>    2) inject the #GP back to guest OS and let its #GP handler handle it.
>       But Linux MSR exception handler just ignores the MSR instruction
>       (MCE MSR exception will panic).
> 
> So why not let Xen handle all the details which it already tries to do?

Some MSRs are not handled that way, but via a kernel internal emulation.
And those are handled that way mostly due to performance reasons. And some
need special treatment.

> (Linux w/ such a change may not be able to run on old Xen hypervisors.)

Yes, and this is something to avoid.

And remember that Linux isn't the only PV-mode guest existing.

> BTW, if performance is a concern, writes to MSR_KERNEL_GS_BASE and
> MSR_GS_BASE anyway are hpyercalls into Xen.

Yes, and some other MSR writes are just NOPs with Xen-PV.


Juergen

[-- Attachment #1.1.2: OpenPGP public key --]
[-- Type: application/pgp-keys, Size: 3743 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 495 bytes --]

  reply	other threads:[~2025-04-23 16:05 UTC|newest]

Thread overview: 94+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-04-22  8:21 [RFC PATCH v2 00/34] MSR refactor with new MSR instructions support Xin Li (Intel)
2025-04-22  8:21 ` [RFC PATCH v2 01/34] x86/msr: Move rdtsc{,_ordered}() to <asm/tsc.h> Xin Li (Intel)
2025-04-23 14:13   ` Dave Hansen
2025-04-23 17:12     ` Xin Li
2025-04-22  8:21 ` [RFC PATCH v2 02/34] x86/msr: Remove rdpmc() Xin Li (Intel)
2025-04-23 14:23   ` Dave Hansen
2025-04-22  8:21 ` [RFC PATCH v2 03/34] x86/msr: Rename rdpmcl() to rdpmcq() Xin Li (Intel)
2025-04-23 14:24   ` Dave Hansen
2025-04-23 14:28   ` Sean Christopherson
2025-04-23 15:06     ` Dave Hansen
2025-04-23 17:23       ` Xin Li
2025-04-22  8:21 ` [RFC PATCH v2 04/34] x86/msr: Convert rdpmcq() into a function Xin Li (Intel)
2025-04-23 14:25   ` Dave Hansen
2025-04-22  8:21 ` [RFC PATCH v2 05/34] x86/msr: Return u64 consistently in Xen PMC read functions Xin Li (Intel)
2025-04-22  8:40   ` Jürgen Groß
2025-04-22  8:21 ` [RFC PATCH v2 06/34] x86/msr: Use the alternatives mechanism to read PMC Xin Li (Intel)
2025-04-22  8:38   ` Jürgen Groß
2025-04-22  9:12     ` Xin Li
2025-04-22  9:28       ` Juergen Gross
2025-04-23  7:40         ` Xin Li
2025-04-22  8:21 ` [RFC PATCH v2 07/34] x86/msr: Convert __wrmsr() uses to native_wrmsr{,q}() uses Xin Li (Intel)
2025-04-22  8:21 ` [RFC PATCH v2 08/34] x86/msr: Convert a native_wrmsr() use to native_wrmsrq() Xin Li (Intel)
2025-04-23 15:51   ` Dave Hansen
2025-04-23 17:27     ` Xin Li
2025-04-23 23:23     ` Xin Li
2025-04-22  8:21 ` [RFC PATCH v2 09/34] x86/msr: Add the native_rdmsrq() helper Xin Li (Intel)
2025-04-22  8:21 ` [RFC PATCH v2 10/34] x86/msr: Convert __rdmsr() uses to native_rdmsrq() uses Xin Li (Intel)
2025-04-22 15:09   ` Sean Christopherson
2025-04-23  9:27     ` Xin Li
2025-04-23 13:37       ` Sean Christopherson
2025-04-23 14:02       ` Dave Hansen
2025-04-22  8:21 ` [RFC PATCH v2 11/34] x86/msr: Remove calling native_{read,write}_msr{,_safe}() in pmu_msr_{read,write}() Xin Li (Intel)
2025-04-24  6:25   ` Mi, Dapeng
2025-04-24  7:16     ` Xin Li
2025-04-22  8:21 ` [RFC PATCH v2 12/34] x86/msr: Remove pmu_msr_{read,write}() Xin Li (Intel)
2025-04-24  6:33   ` Mi, Dapeng
2025-04-24  7:21     ` Xin Li
2025-04-24  7:43       ` Mi, Dapeng
2025-04-24  7:50         ` Xin Li
2025-04-24 10:05   ` Jürgen Groß
2025-04-24 17:49     ` Xin Li
2025-04-24 21:14       ` H. Peter Anvin
2025-04-24 22:24         ` Xin Li
2025-04-22  8:21 ` [RFC PATCH v2 13/34] x86/xen/msr: Remove the error pointer argument from set_reg() Xin Li (Intel)
2025-04-24 10:11   ` Jürgen Groß
2025-04-24 17:50     ` Xin Li
2025-04-22  8:21 ` [RFC PATCH v2 14/34] x86/msr: refactor pv_cpu_ops.write_msr{_safe}() Xin Li (Intel)
2025-04-24 10:16   ` Jürgen Groß
2025-04-22  8:21 ` [RFC PATCH v2 15/34] x86/msr: Replace wrmsr(msr, low, 0) with wrmsrq(msr, low) Xin Li (Intel)
2025-04-22  8:21 ` [RFC PATCH v2 16/34] x86/msr: Change function type of native_read_msr_safe() Xin Li (Intel)
2025-04-22  8:21 ` [RFC PATCH v2 17/34] x86/cpufeatures: Add a CPU feature bit for MSR immediate form instructions Xin Li (Intel)
2025-04-22  8:21 ` [RFC PATCH v2 18/34] x86/opcode: Add immediate form MSR instructions Xin Li (Intel)
2025-04-22  8:22 ` [RFC PATCH v2 19/34] x86/extable: Add support for " Xin Li (Intel)
2025-04-22  8:22 ` [RFC PATCH v2 20/34] x86/extable: Implement EX_TYPE_FUNC_REWIND Xin Li (Intel)
2025-04-22  8:22 ` [RFC PATCH v2 21/34] x86/msr: Utilize the alternatives mechanism to write MSR Xin Li (Intel)
2025-04-22  9:57   ` Jürgen Groß
2025-04-23  8:51     ` Xin Li
2025-04-23 16:05       ` Jürgen Groß [this message]
2025-04-24  8:06         ` Xin Li
2025-04-24  8:14           ` Jürgen Groß
2025-04-25  1:15             ` H. Peter Anvin
2025-04-25  3:44               ` H. Peter Anvin
2025-04-25  7:01                 ` Jürgen Groß
2025-04-25 15:28                   ` H. Peter Anvin
2025-04-25  6:51               ` Jürgen Groß
2025-04-25 12:33         ` Peter Zijlstra
2025-04-25 12:51           ` Jürgen Groß
2025-04-25 20:12             ` H. Peter Anvin
2025-04-25 15:29           ` H. Peter Anvin
2025-04-25  7:11     ` Peter Zijlstra
2025-04-22  8:22 ` [RFC PATCH v2 22/34] x86/msr: Utilize the alternatives mechanism to read MSR Xin Li (Intel)
2025-04-22  8:59   ` Jürgen Groß
2025-04-22  9:20     ` Xin Li
2025-04-22  9:57       ` Jürgen Groß
2025-04-22 11:12   ` Jürgen Groß
2025-04-23  9:03     ` Xin Li
2025-04-23 16:11       ` Jürgen Groß
2025-04-22  8:22 ` [RFC PATCH v2 23/34] x86/extable: Remove new dead code in ex_handler_msr() Xin Li (Intel)
2025-04-22  8:22 ` [RFC PATCH v2 24/34] x86/mce: Use native MSR API __native_{wr,rd}msrq() Xin Li (Intel)
2025-04-22  8:22 ` [RFC PATCH v2 25/34] x86/msr: Rename native_wrmsrq() to native_wrmsrq_no_trace() Xin Li (Intel)
2025-04-22  8:22 ` [RFC PATCH v2 26/34] x86/msr: Rename native_wrmsr() to native_wrmsr_no_trace() Xin Li (Intel)
2025-04-22  8:22 ` [RFC PATCH v2 27/34] x86/msr: Rename native_write_msr() to native_wrmsrq() Xin Li (Intel)
2025-04-22  8:22 ` [RFC PATCH v2 28/34] x86/msr: Rename native_write_msr_safe() to native_wrmsrq_safe() Xin Li (Intel)
2025-04-22  8:22 ` [RFC PATCH v2 29/34] x86/msr: Rename native_rdmsrq() to native_rdmsrq_no_trace() Xin Li (Intel)
2025-04-22  8:22 ` [RFC PATCH v2 30/34] x86/msr: Rename native_rdmsr() to native_rdmsr_no_trace() Xin Li (Intel)
2025-04-22  8:22 ` [RFC PATCH v2 31/34] x86/msr: Rename native_read_msr() to native_rdmsrq() Xin Li (Intel)
2025-04-22  8:22 ` [RFC PATCH v2 32/34] x86/msr: Rename native_read_msr_safe() to native_rdmsrq_safe() Xin Li (Intel)
2025-04-22  8:22 ` [RFC PATCH v2 33/34] x86/msr: Move the ARGS macros after the MSR read/write APIs Xin Li (Intel)
2025-04-22  8:22 ` [RFC PATCH v2 34/34] x86/msr: Convert native_rdmsr_no_trace() uses to native_rdmsrq_no_trace() uses Xin Li (Intel)
2025-04-22 15:03 ` [RFC PATCH v2 00/34] MSR refactor with new MSR instructions support Sean Christopherson
2025-04-22 17:51   ` Xin Li
2025-04-22 18:05     ` Luck, Tony
2025-04-22 19:44       ` Ingo Molnar
2025-04-22 19:51         ` Sean Christopherson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=37c88ea3-dd24-4607-9ee1-0f19025aaef3@suse.com \
    --to=jgross@suse.com \
    --cc=acme@kernel.org \
    --cc=adrian.hunter@intel.com \
    --cc=ajay.kaher@broadcom.com \
    --cc=alexander.shishkin@linux.intel.com \
    --cc=andrew.cooper3@citrix.com \
    --cc=bcm-kernel-feedback-list@broadcom.com \
    --cc=boris.ostrovsky@oracle.com \
    --cc=bp@alien8.de \
    --cc=dave.hansen@linux.intel.com \
    --cc=decui@microsoft.com \
    --cc=haiyangz@microsoft.com \
    --cc=hpa@zytor.com \
    --cc=irogers@google.com \
    --cc=jolsa@kernel.org \
    --cc=kan.liang@linux.intel.com \
    --cc=kvm@vger.kernel.org \
    --cc=kys@microsoft.com \
    --cc=linux-acpi@vger.kernel.org \
    --cc=linux-edac@vger.kernel.org \
    --cc=linux-hwmon@vger.kernel.org \
    --cc=linux-hyperv@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-perf-users@vger.kernel.org \
    --cc=linux-pm@vger.kernel.org \
    --cc=luto@kernel.org \
    --cc=mark.rutland@arm.com \
    --cc=mingo@redhat.com \
    --cc=namhyung@kernel.org \
    --cc=netdev@vger.kernel.org \
    --cc=pbonzini@redhat.com \
    --cc=peterz@infradead.org \
    --cc=platform-driver-x86@vger.kernel.org \
    --cc=seanjc@google.com \
    --cc=tglx@linutronix.de \
    --cc=tony.luck@intel.com \
    --cc=virtualization@lists.linux.dev \
    --cc=vkuznets@redhat.com \
    --cc=wei.liu@kernel.org \
    --cc=x86@kernel.org \
    --cc=xen-devel@lists.xenproject.org \
    --cc=xin@zytor.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).