From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S932078AbbDBSve (ORCPT <rfc822;w@1wt.eu>);
	Thu, 2 Apr 2015 14:51:34 -0400
Received: from youngberry.canonical.com ([91.189.89.112]:52347 "EHLO
	youngberry.canonical.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1753300AbbDBSvb (ORCPT
	<rfc822;linux-kernel@vger.kernel.org>);
	Thu, 2 Apr 2015 14:51:31 -0400
Message-ID: <551D8FAF.5070805@canonical.com>
Date: Thu, 02 Apr 2015 13:51:27 -0500
From: Chris J Arges <chris.j.arges@canonical.com>
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.5.0
MIME-Version: 1.0
To: Ingo Molnar <mingo@kernel.org>,
        Linus Torvalds <torvalds@linux-foundation.org>
CC: Rafael David Tinoco <inaddy@ubuntu.com>, Peter Anvin <hpa@zytor.com>,
        Jiang Liu <jiang.liu@linux.intel.com>,
        Peter Zijlstra <peterz@infradead.org>,
        LKML <linux-kernel@vger.kernel.org>, Jens Axboe <axboe@kernel.dk>,
        Frederic Weisbecker <fweisbec@gmail.com>,
        Gema Gomez <gema.gomez-solano@canonical.com>,
        the arch/x86 maintainers <x86@kernel.org>
Subject: Re: smp_call_function_single lockups
References: <CA+55aFxd1WGNBzSHeOGiXXdUD1GqDYv9PUNGdrdiGFwaX7HYJQ@mail.gmail.com> <CA+55aFxFkw7cKu6R8-v9z=c+yG+jsPHyQKW5-yyn3+M0BuyvxA@mail.gmail.com> <20150331031536.GA9303@canonical.com> <CA+55aFykg3SAO16=NRiC+tP1gGj5hgbu+Y93ss4Qg30+qyZ=+w@mail.gmail.com> <20150331222327.GA12512@canonical.com> <20150401124336.GB12841@gmail.com> <20150401161047.GD12730@canonical.com> <CA+55aFxQ6q7MNS+4XWZ3=Xa0Hz6kumd84v_aEw3M4gBpXszTkQ@mail.gmail.com> <551C6A48.9060805@canonical.com> <CA+55aFw2Jb4ASOxckY1cwP23fAYv5dG1WYCkB6RyjjpP2hEQcw@mail.gmail.com> <20150402182607.GA8896@gmail.com>
In-Reply-To: <20150402182607.GA8896@gmail.com>
Content-Type: text/plain; charset=windows-1252
Content-Transfer-Encoding: 8bit
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On 04/02/2015 01:26 PM, Ingo Molnar wrote:
> 
> * Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
>> So unless we find a real clear signature of the bug (I was hoping 
>> that the ISR bit would be that sign), I don't think trying to bisect 
>> it based on how quickly you can reproduce things is worthwhile.
> 
> So I'm wondering (and I might have missed some earlier report that 
> outlines just that), now that the possible location of the bug is 
> again sadly up to 15+ million lines of code, I have no better idea 
> than to debug by symptoms again: what kind of effort was made to 
> examine the locked up state itself?
>

Ingo,

Rafael did some analysis when I was out earlier here:
https://lkml.org/lkml/2015/2/23/234

My reproducer setup is as follows:
L0 - 8-way CPU, 48 GB memory
L1 - 2-way vCPU, 4 GB memory
L2 - 1-way vCPU, 1 GB memory

Stress is only run in the L2 VM, and running top on L0/L1 doesn't show
excessive load.

> Softlockups always have some direct cause, which task exactly causes 
> scheduling to stop altogether, why does it lock up - or is it not a 
> clear lockup, just a very slow system?
> 
> Thanks,
> 
> 	Ingo
> 

Whenever we look through the crashdump we see csd_lock_wait waiting for
CSD_FLAG_LOCK bit to be cleared.  Usually the signature leading up to
that looks like the following (in the openstack tempest on openstack and
nested VM stress case)

(qemu-system-x86 task)
kvm_sched_in
 -> kvm_arch_vcpu_load
  -> vmx_vcpu_load
   -> loaded_vmcs_clear
    -> smp_call_function_single

(ksmd task)
pmdp_clear_flush
 -> flush_tlb_mm_range
  -> native_flush_tlb_others
    -> smp_call_function_many

--chris