public inbox for linux-acpi@vger.kernel.org
 help / color / mirror / Atom feed
From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
To: Thomas Gleixner <tglx@linutronix.de>
Cc: Kuppuswamy Sathyanarayanan 
	<sathyanarayanan.kuppuswamy@linux.intel.com>,
	Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	x86@kernel.org, "Rafael J . Wysocki" <rjw@rjwysocki.net>,
	"H . Peter Anvin" <hpa@zytor.com>,
	Tony Luck <tony.luck@intel.com>,
	Dan Williams <dan.j.williams@intel.com>,
	Andi Kleen <ak@linux.intel.com>,
	Kuppuswamy Sathyanarayanan <knsathya@kernel.org>,
	linux-kernel@vger.kernel.org, linux-acpi@vger.kernel.org
Subject: Re: [PATCH v2] x86: Skip WBINVD instruction for VM guest
Date: Fri, 3 Dec 2021 01:21:09 +0300	[thread overview]
Message-ID: <20211202222109.pcsgm2jska3obvmx@black.fi.intel.com> (raw)
In-Reply-To: <87pmqpjcef.ffs@tglx>

On Thu, Nov 25, 2021 at 01:40:24AM +0100, Thomas Gleixner wrote:
> Kuppuswamy,
> 
> On Thu, Nov 18 2021 at 20:03, Kuppuswamy Sathyanarayanan wrote:
> > ACPI mandates that CPU caches be flushed before entering any sleep
> > state. This ensures that the CPU and its caches can be powered down
> > without losing data.
> >
> > ACPI-based VMs have maintained this sleep-state-entry behavior.
> > However, cache flushing for VM sleep state entry is useless. Unlike on
> > bare metal, guest sleep states are not correlated with potential data
> > loss of any kind; the host is responsible for data preservation. In
> > fact, some KVM configurations simply skip the cache flushing
> > instruction (see need_emulate_wbinvd()).
> 
> KVM starts out with kvm->arch.noncoherent_dma_count = 0 which makes
> need_emulate_wbinvd() skip WBINVD emulation. So far so good.
> 
> VFIO has code to invoke kvm_arch_register_noncoherent_dma() which
> increments the count which will subsequently cause WBINVD emulation to
> be enabled. What now?
> 
> > Further, on TDX systems, the WBINVD instruction causes an
> > unconditional #VE exception.  If this cache flushing remained, it would
> > need extra code in the form of a #VE handler.
> >
> > All use of ACPI_FLUSH_CPU_CACHE() appears to be in sleep-state-related
> > code.
> 
> C3 is considered a sleep state nowadays? Also ACPI_FLUSH_CPU_CACHE() is
> used in other places which have nothing to do with sleep states.
> 
> git grep is not rocket science to use.
> 
> > This means that the ACPI use of WBINVD is at *best* superfluous.
> 
> Really? You probably meant to say:
> 
>   This means that the ACPI usage of WBINVD from within a guest is at
>   best superfluous.
> 
> No?
> 
> But aside of that this does not give any reasonable answers why
> disabling WBINVD for guests unconditionally in ACPI_FLUSH_CPU_CACHE()
> and the argumentation vs. need_emulate_wbinvd() are actually correct
> under all circumstances.
> 
> I'm neither going to do that analysis nor am I going to accept a patch
> which comes with 'appears' based arguments and some handwavy references
> to disabled WBINVD emulation code which can obviously be enabled for a
> reason.
> 
> The even more interesting question for me is how a TDX guest is dealing
> with all other potential invocations of WBINVD all over the place. Are
> they all going to get the same treatment or are those magically going to
> be never executed in TDX guests?
> 
> I really have to ask why SEV can deal with WBINVD and other things just
> nicely by implementing trivial #VC handler functions, but TDX has to
> prematurely optimize the kernel tree based on half baken arguments?
> 
> Having a few trivial #VE handlers is not the end of the world. You can
> revisit that once basic support for TDX is merged in order to gain
> performance or whatever.
> 
> Either that or you provide patches with arguments which are based on
> proper analysis and not on 'appears to' observations.

I think the right solution to the WBINVD would be to add a #VE handler
that does nothing. We don't have a reasonable way to handle it from within
the guest. We can call the VMM in hope that it would handle it, but VMM is
untrusted and it can ignore the request.

Dave suggested that we need to do code audit to make sure that there's no
user inside TDX guest environment that relies on WBINVD to work correctly.

Below is full call tree of WBINVD. It is substantially larger than I
anticipated from initial grep.

Conclusions:

  - Most of callers are in ACPI code on changing S-states. Ignoring cache
    flush for S-state change on virtual machine should be safe.

  - The only WBINVD I was able to trigger is on poweroff from ACPI code.
    Reboot also should trigger it, but for some reason I don't see it.

  - Few caller in CPU offline code. TDX does not allowed to offline CPU as
    we cannot bring it back -- we don't have SIPI. And even if offline
    works for vCPU it should be safe to ignore WBINVD there.

  - NVDIMMs are not supported inside TDX. If it will change we would need
    to deal with cache flushing for this case. Hopefully, we would be able
    to avoid WBINVD.

  - Cache QoS and MTRR use WBINVD. They are disabled in TDX, but it is
    controlled by VMM if the feature is advertised. We would need to
    filter CPUID/MSRs to make sure VMM would not mess with them.

Is it good enough justification for do-nothing #VE WBINVD handler?

WBINVD
  native_wbinvd()
    wbinvd()
      ACPI_FLUSH_CPU_CACHE()
        acpi_hw_extended_sleep()
          acpi_enter_sleep_state()
            x86_acpi_enter_sleep_state()
              do_suspend_lowlevel()
                x86_acpi_suspend_lowlevel()
                  acpi_suspend_enter()
                    >>> On S3: No suspend-to-ram -- no problem
            acpi_db_do_one_sleep_state()
              acpi_db_sleep()
                acpi_db_command_dispatch()
                  >>> "SLEEP" command of ACPI debugger. I guess can trigger poweroff. WBINVD doesn't make any difference in TDX.
            acpi_hibernation_enter()
              >>> On S4. No hibernate -- no problem.
            acpi_power_off()
              >>> On S5. Triggirable on poweroff, but safe to ignore WBINVD here on TDX
            acpi_suspend_enter()
              >>> On S1. No S1 -- no problem.
            xen_acpi_suspend_lowlevel()
              >>> N/A to TDX.
        acpi_hw_legacy_sleep()
          acpi_enter_sleep_state()
            >>> See above. For ACPI_REDUCED_HARDWARE.
        acpi_enter_sleep_state_s4bios()
          No users? Or I failed to decypther ACPI code.
        acpi_idle_enter()
          acpi_processor_setup_cstates()
            acpi_processor_setup_cpuidle_states()
              acpi_processor_power_state_has_changed()
                acpi_processor_notify()
                  >>> Looks like the driver going to get event in case the number of power state will change. But I can be mistaken. Anyway skipping WBINVD is safe.
              acpi_processor_power_init()
                >>> Only applicable if acpi_idle_driver is in use. N/A to TDX.
        acpi_idle_enter_s2idle()
          acpi_processor_setup_cstates()
            >>> See above.
        acpi_idle_play_dead()
          acpi_processor_setup_cstates()
            >>> See above.
        acpi_sleep_prepare()
          >>> On the way to S3/S4/S5. Safe to ignore WBINVD
        acpi_suspend_enter()
          >>> On the way to S3/S4/S5. Safe to ignore WBINVD
        acpi_hibernation_enter()
          >>> On S4, No S4 -- no problem.
        <Bunch of callers in cpufreq/longhaul.c>
          >>> CPU frequency driver for VIA Cyrix CPU. N/A to TDX.
      flush_agp_cache()
        ipi_handler()
          global_cache_flush()
            >>> Used by bunch of random AGP drivers. N/A to TDX: device passthrough is not supported.
      wbinvd_on_cpu()
        amd_l3_disable_index()
          >>> N/A to TDX
      gart_iommu_init()
        >>> N/A to TDX
      init_amd_k6()
        >>> N/A to TDX
      amd_set_mtrr()
        >>> N/A to TDX
      prepare_set() in mtrr/cyrix.c
        >>> N/A to TDX
      post_set() in mtrr/cyrix.c
        >>> N/A to TDX
      prepare_set() mtrr/generic.c
        >>> MTRR is disabled, but it is in control of VMM.
      mwait_play_dead()
        native_play_dead()
          sev_es_play_dead()
            >>> N/A to TDX.
          play_dead()
            arch_cpu_idle_enter()
              do_idle()
                >>> Only for offline CPUs. Offlining is disabled on TDX.
      hlt_play_dead()
        native_play_dead()
          >>> See above
        resume_play_dead()
          hibernate_resume_nonboot_cpu_disable()
            >>> No hipernate -- no problem.
      pseudo_lock_fn()
        rdtgroup_pseudo_lock_create()
          rdtgroup_schemata_write()
            res_common_files[]
              rdtgroup_init()
                resctrl_late_init()
                  >>> Depends on Cache QoS features that configured by VMM.
      wbinvd_ipi() in kvm/x86.c
        >>> KVM emulation of WBINVD. N/A for TDX guest.
      __wbinvd()
        wbinvd_on_cpu()
          >>> See above
        wbinvd_on_all_cpus()
          sev_flush_asids() and other users in kvm/svm/sev.c
            >>> N/A to TDX
          nvdimm_invalidate_cache()
            >>> No NVDIMMs in TDX
          i830_chipset_flush()
            >>> N/A to TDX
          __sev_platform_init_locked()
            >>> N/A to TDX
          drm_clflush_virt_range(), drm_clflush_pages(), drm_clflush_sg()
            >>> Only for !X86_FEATURE_CLFLUSH, N/A to TDX.
          Few callers in i915
            >>> N/A to TDX
      __sme_early_enc_dec()
        >>> N/A to TDX
      __cpa_flush_all()
        cpa_flush_all()
          cpa_flush()
            >>> Only for !X86_FEATURE_CLFLUSH. N/A to TDX.
      powernow_k6_set_cpu_multiplier()
        >>> N/A to TDX
      disable_caches()
        inject_write_store() in amd64_edac.c
          >>> N/A to TDX
      drm_ati_pcigart_init()
        >>> N/A to TDX
      nettel_init() and other nettel users
        >>> N/A to TDX
      atomisp_acc_start() and other atomisp users
        >>> N/A to TDX
    apply_microcode_intel()/apply_microcode_early()
      >>> N/A to TDX
  identity_mapped()
    >>> Only for AMD SME
  __enc_copy in /mem_encrypt_boot.S
    >>> N/A to TDX
  wakeup_start in platform/olpc/xo1-wakeup.S
    >>> N/A to TDX
  machine_real_restart_asm16 in realmode/rm/reboot.S
    >>> Safe to ignore WBINVD on TDX
  trampoline_start in realmode/rm/trampoline_64.S
    >>> TDX doesn't use realmode trampoline
  flush_cache() in i810_main.h
    >>> N/A to TDX
-- 
 Kirill A. Shutemov

  reply	other threads:[~2021-12-02 22:21 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <YZPbQVwWOJCrAH78@zn.tnic>
2021-11-19  4:03 ` [PATCH v2] x86: Skip WBINVD instruction for VM guest Kuppuswamy Sathyanarayanan
2021-11-25  0:40   ` Thomas Gleixner
2021-12-02 22:21     ` Kirill A. Shutemov [this message]
2021-12-02 22:38       ` Dave Hansen
2021-12-02 23:48       ` Thomas Gleixner
2021-12-03 23:49         ` Kirill A. Shutemov
2021-12-04  0:20           ` Dave Hansen
2021-12-04  0:54             ` Kirill A. Shutemov
2021-12-06 15:35               ` Dave Hansen
2021-12-06 16:39                 ` Dan Williams
2021-12-06 16:53                   ` Dave Hansen
2021-12-06 17:51                     ` Dan Williams
2021-12-04 20:27           ` Rafael J. Wysocki
2021-12-06 12:29             ` [PATCH 0/4] ACPI/ACPICA: Only flush caches on S1/S2/S3 and C3 Kirill A. Shutemov
2021-12-06 12:29               ` [PATCH 1/4] ACPICA: Do not flush cache for on entering S4 and S5 Kirill A. Shutemov
2021-12-08 14:58                 ` Rafael J. Wysocki
2021-12-06 12:29               ` [PATCH 2/4] ACPI: PM: Remove redundant cache flushing Kirill A. Shutemov
2021-12-07 16:35                 ` Rafael J. Wysocki
2021-12-09 13:32                   ` Kirill A. Shutemov
2021-12-17 18:04                     ` Rafael J. Wysocki
2021-12-06 12:29               ` [PATCH 3/4] ACPI: processor idle: Only flush cache on entering C3 Kirill A. Shutemov
2021-12-06 15:03                 ` Peter Zijlstra
2021-12-08 16:26                   ` Rafael J. Wysocki
2021-12-09 13:33                     ` Kirill A. Shutemov
2021-12-17 17:58                       ` Rafael J. Wysocki
2021-12-06 12:29               ` [PATCH 4/4] ACPI: PM: Avoid cache flush on entering S4 Kirill A. Shutemov
2021-12-08 15:10                 ` Rafael J. Wysocki
2021-12-08 16:04                   ` Kirill A. Shutemov
2021-12-08 16:16                     ` Rafael J. Wysocki

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20211202222109.pcsgm2jska3obvmx@black.fi.intel.com \
    --to=kirill.shutemov@linux.intel.com \
    --cc=ak@linux.intel.com \
    --cc=bp@alien8.de \
    --cc=dan.j.williams@intel.com \
    --cc=dave.hansen@linux.intel.com \
    --cc=hpa@zytor.com \
    --cc=knsathya@kernel.org \
    --cc=linux-acpi@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=rjw@rjwysocki.net \
    --cc=sathyanarayanan.kuppuswamy@linux.intel.com \
    --cc=tglx@linutronix.de \
    --cc=tony.luck@intel.com \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox