From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=nLL1=3F=vger.kernel.org=kvm-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-1.0 required=3.0 tests=MAILING_LIST_MULTI,
	SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 9257EC33CB6
	for <kvm@archiver.kernel.org>; Thu, 16 Jan 2020 18:08:55 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.kernel.org (Postfix) with ESMTP id 6DC62214AF
	for <kvm@archiver.kernel.org>; Thu, 16 Jan 2020 18:08:55 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S2395047AbgAPSIy convert rfc822-to-8bit (ORCPT
        <rfc822;kvm@archiver.kernel.org>); Thu, 16 Jan 2020 13:08:54 -0500
Received: from mail.kernel.org ([198.145.29.99]:45272 "EHLO mail.kernel.org"
        rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
        id S2395046AbgAPSIu (ORCPT <rfc822;kvm@vger.kernel.org>);
        Thu, 16 Jan 2020 13:08:50 -0500
From:   bugzilla-daemon@bugzilla.kernel.org
To:     kvm@vger.kernel.org
Subject: [Bug 206215] QEMU guest crash due to random 'general protection
 fault' since kernel 5.2.5 on i7-3517UE
Date:   Thu, 16 Jan 2020 18:08:49 +0000
X-Bugzilla-Reason: None
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: AssignedTo virtualization_kvm@kernel-bugs.osdl.org
X-Bugzilla-Product: Virtualization
X-Bugzilla-Component: kvm
X-Bugzilla-Version: unspecified
X-Bugzilla-Keywords: 
X-Bugzilla-Severity: blocking
X-Bugzilla-Who: sean.j.christopherson@intel.com
X-Bugzilla-Status: NEW
X-Bugzilla-Resolution: 
X-Bugzilla-Priority: P1
X-Bugzilla-Assigned-To: virtualization_kvm@kernel-bugs.osdl.org
X-Bugzilla-Flags: 
X-Bugzilla-Changed-Fields: 
Message-ID: <bug-206215-28872-ppTHYf6H3E@https.bugzilla.kernel.org/>
In-Reply-To: <bug-206215-28872@https.bugzilla.kernel.org/>
References: <bug-206215-28872@https.bugzilla.kernel.org/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: 8BIT
X-Bugzilla-URL: https://bugzilla.kernel.org/
Auto-Submitted: auto-generated
MIME-Version: 1.0
Sender: kvm-owner@vger.kernel.org
Precedence: bulk
List-ID: <kvm.vger.kernel.org>
X-Mailing-List: kvm@vger.kernel.org

https://bugzilla.kernel.org/show_bug.cgi?id=206215

--- Comment #6 from Sean Christopherson (sean.j.christopherson@intel.com) ---
On Thu, Jan 16, 2020 at 07:38:54AM -0800, Sean Christopherson wrote:
> On Wed, Jan 15, 2020 at 08:08:32PM -0500, Derek Yerger wrote:
> > On 1/15/20 4:52 PM, Sean Christopherson wrote:
> > >+cc Derek, who is hitting the same thing.
> > >
> > >On Wed, Jan 15, 2020 at 09:18:56PM +0000,
> bugzilla-daemon@bugzilla.kernel.org wrote:
> > >>https://bugzilla.kernel.org/show_bug.cgi?id=206215
> > >*snip*
> > >that's a big smoking gun pointing at commit ca7e6b286333 ("KVM: X86: Fix
> > >fpu state crash in kvm guest"), which is commit e751732486eb upstream.
> > >
> > >1. Can you verify reverting ca7e6b286333 (or e751732486eb in upstream)
> > >    solves the issue?
> > >
> > >2. Assuming the answer is yes, on a buggy kernel, can you run with the
> > >    attached patch to try get debug info?
> > I did these out of order since I had 5.3.11 built with the patch, ready to
> > go for weeks now, waiting for an opportunity to test.
> > 
> > Win10 guest immediately BSOD'ed with:
> > 
> > WARNING: CPU: 2 PID: 9296 at include/linux/thread_info.h:55
> > kernel_fpu_begin+0x6b/0xc0
> 
> Can you provide the full stack trace of the WARN?  I'm hoping that will
> provide a hint as to what's going wrong.

Aha!  I found at least two cases where TIF_NEED_FPU_LOAD could be set
without the vCPU being preempted.

The comment on fpregs_lock() states that softirq can set TIF_NEED_FPU_LOAD,
which would not be handled by the preempt notifier.

  /*
   * Use fpregs_lock() while editing CPU's FPU registers or fpu->state.
   * A context switch will (and softirq might) save CPU's FPU registers to
                           ^^^^^^^^^^^^^^^^^^^
   * fpu->state and set TIF_NEED_FPU_LOAD leaving CPU's FPU registers in
   * a random state.
   */
  static inline void fpregs_lock(void)

The other scenario is from a stack trace from commit f775b13eedee ("x86,kvm:
move qemu/guest FPU switching out to vcpu_run"), which clearly shows that
kernel_fpu_begin() can be invoked without KVM being preempted.

  __warn+0xcb/0xf0
  warn_slowpath_null+0x1d/0x20
  kernel_fpu_disable+0x3f/0x50
  __kernel_fpu_begin+0x49/0x100
  kernel_fpu_begin+0xe/0x10
  crc32c_pcl_intel_update+0x84/0xb0
  crypto_shash_update+0x3f/0x110
  crc32c+0x63/0x8a [libcrc32c]
  dm_bm_checksum+0x1b/0x20 [dm_persistent_data]
  node_prepare_for_write+0x44/0x70 [dm_persistent_data]
  dm_block_manager_write_callback+0x41/0x50 [dm_persistent_data]
  submit_io+0x170/0x1b0 [dm_bufio]
  __write_dirty_buffer+0x89/0x90 [dm_bufio]
  __make_buffer_clean+0x4f/0x80 [dm_bufio]
  __try_evict_buffer+0x42/0x60 [dm_bufio]
  dm_bufio_shrink_scan+0xc0/0x130 [dm_bufio]
  shrink_slab.part.40+0x1f5/0x420
  shrink_node+0x22c/0x320
  do_try_to_free_pages+0xf5/0x330
  try_to_free_pages+0xe9/0x190
  __alloc_pages_slowpath+0x40f/0xba0
  __alloc_pages_nodemask+0x209/0x260
  alloc_pages_vma+0x1f1/0x250
  do_huge_pmd_anonymous_page+0x123/0x660
  handle_mm_fault+0xfd3/0x1330
  __get_user_pages+0x113/0x640
  get_user_pages+0x4f/0x60
  __gfn_to_pfn_memslot+0x120/0x3f0 [kvm]
  try_async_pf+0x66/0x230 [kvm]
  tdp_page_fault+0x130/0x280 [kvm]
  kvm_mmu_page_fault+0x60/0x120 [kvm]
  handle_ept_violation+0x91/0x170 [kvm_intel]
  vmx_handle_exit+0x1ca/0x1400 [kvm_intel]

Either of the above explains why pre-e751732486eb code waited until IRQs
are disabled by vcpu_enter_guest() to do switch_fpu_return().

Properly fixing soley within KVM is going to be somewhat painful.  The
most common case, vcpu_enter_guest(), which is being hit here, is easy
to handle by restoring the switch_fpu_return() that was removed by commit
e751732486eb.  The other obvious case I see is emulator's access of guest
fpu state, which will effectively require reverting commit 6ab0b9feb82a
("x86,kvm: remove KVM emulator get_fpu / put_fpu") along with new
implementations of the hooks to handle TIF_NEED_FPU_LOAD.

> > Then stashed the patch, reverted ca7e6b286333, compile, reboot.
> > 
> > Guest is running stable now on 5.3.11. Did test my CAD under the guest, did
> > not experience the crashes that had me stuck at 5.1.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.