From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932078AbbDBSve (ORCPT ); Thu, 2 Apr 2015 14:51:34 -0400 Received: from youngberry.canonical.com ([91.189.89.112]:52347 "EHLO youngberry.canonical.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753300AbbDBSvb (ORCPT ); Thu, 2 Apr 2015 14:51:31 -0400 Message-ID: <551D8FAF.5070805@canonical.com> Date: Thu, 02 Apr 2015 13:51:27 -0500 From: Chris J Arges User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.5.0 MIME-Version: 1.0 To: Ingo Molnar , Linus Torvalds CC: Rafael David Tinoco , Peter Anvin , Jiang Liu , Peter Zijlstra , LKML , Jens Axboe , Frederic Weisbecker , Gema Gomez , the arch/x86 maintainers Subject: Re: smp_call_function_single lockups References: <20150331031536.GA9303@canonical.com> <20150331222327.GA12512@canonical.com> <20150401124336.GB12841@gmail.com> <20150401161047.GD12730@canonical.com> <551C6A48.9060805@canonical.com> <20150402182607.GA8896@gmail.com> In-Reply-To: <20150402182607.GA8896@gmail.com> Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 04/02/2015 01:26 PM, Ingo Molnar wrote: > > * Linus Torvalds wrote: > >> So unless we find a real clear signature of the bug (I was hoping >> that the ISR bit would be that sign), I don't think trying to bisect >> it based on how quickly you can reproduce things is worthwhile. > > So I'm wondering (and I might have missed some earlier report that > outlines just that), now that the possible location of the bug is > again sadly up to 15+ million lines of code, I have no better idea > than to debug by symptoms again: what kind of effort was made to > examine the locked up state itself? > Ingo, Rafael did some analysis when I was out earlier here: https://lkml.org/lkml/2015/2/23/234 My reproducer setup is as follows: L0 - 8-way CPU, 48 GB memory L1 - 2-way vCPU, 4 GB memory L2 - 1-way vCPU, 1 GB memory Stress is only run in the L2 VM, and running top on L0/L1 doesn't show excessive load. > Softlockups always have some direct cause, which task exactly causes > scheduling to stop altogether, why does it lock up - or is it not a > clear lockup, just a very slow system? > > Thanks, > > Ingo > Whenever we look through the crashdump we see csd_lock_wait waiting for CSD_FLAG_LOCK bit to be cleared. Usually the signature leading up to that looks like the following (in the openstack tempest on openstack and nested VM stress case) (qemu-system-x86 task) kvm_sched_in -> kvm_arch_vcpu_load -> vmx_vcpu_load -> loaded_vmcs_clear -> smp_call_function_single (ksmd task) pmdp_clear_flush -> flush_tlb_mm_range -> native_flush_tlb_others -> smp_call_function_many --chris