From mboxrd@z Thu Jan  1 00:00:00 1970
From: Jan Glauber <jan.glauber@caviumnetworks.com>
Subject: Re: RCU stall with high number of KVM vcpus
Date: Tue, 14 Nov 2017 15:19:36 +0100
Message-ID: <20171114141936.GA21650@hc>
References: <20171113131000.GA10546@hc>
 <2832f775-3cbe-d984-fe4f-e018642b6f1d@arm.com>
 <20171113173552.GA13282@hc>
 <7dda7be2-f392-8056-d4d3-372bb867729a@arm.com>
 <20171113184046.GA14678@hc>
 <e8e1af91-b755-e04e-6ab4-c47b570c9fe0@arm.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Cc: kvm@vger.kernel.org, Paolo Bonzini <pbonzini@redhat.com>,
        Radim =?utf-8?B?S3LEjW3DocWZ?= <rkrcmar@redhat.com>,
        Christoffer Dall <christoffer.dall@linaro.org>,
        linux-arm-kernel@lists.infradead.org
To: Marc Zyngier <marc.zyngier@arm.com>
Return-path: <kvm-owner@vger.kernel.org>
Received: from mail-dm3nam03on0053.outbound.protection.outlook.com ([104.47.41.53]:48192
        "EHLO NAM03-DM3-obe.outbound.protection.outlook.com"
        rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP
        id S1754612AbdKNOTw (ORCPT <rfc822;kvm@vger.kernel.org>);
        Tue, 14 Nov 2017 09:19:52 -0500
Content-Disposition: inline
In-Reply-To: <e8e1af91-b755-e04e-6ab4-c47b570c9fe0@arm.com>
Sender: kvm-owner@vger.kernel.org
List-ID: <kvm.vger.kernel.org>

On Tue, Nov 14, 2017 at 01:30:07PM +0000, Marc Zyngier wrote:
> On 13/11/17 18:40, Jan Glauber wrote:
> > On Mon, Nov 13, 2017 at 06:11:19PM +0000, Marc Zyngier wrote:
> >> On 13/11/17 17:35, Jan Glauber wrote:
> >>> On Mon, Nov 13, 2017 at 01:47:38PM +0000, Marc Zyngier wrote:
> > 
> > [...]
> > 
> >>>> Please elaborate. Messed in what way? Corrupted? The guest crashing? Or
> >>>> is that a tooling issue?
> >>>
> >>> Every vcpu that oopses prints one line in parallel, so I get blocks like:
> >>> [58880.179814] [<ffff000008084b98>] ret_from_fork+0x10/0x18
> >>> [58880.179834] [<ffff000008084b98>] ret_from_fork+0x10/0x18
> >>> [58880.179847] [<ffff000008084b98>] ret_from_fork+0x10/0x18
> >>> [58880.179873] [<ffff000008084b98>] ret_from_fork+0x10/0x18
> >>> [58880.179893] [<ffff000008084b98>] ret_from_fork+0x10/0x18
> >>> [58880.179911] [<ffff000008084b98>] ret_from_fork+0x10/0x18
> >>> [58880.179917] [<ffff000008084b98>] ret_from_fork+0x10/0x18
> >>> [58880.180288] [<ffff000008084b98>] ret_from_fork+0x10/0x18
> >>> [58880.180303] [<ffff000008084b98>] ret_from_fork+0x10/0x18
> >>> [58880.180336] [<ffff000008084b98>] ret_from_fork+0x10/0x18
> >>> [58880.180363] [<ffff000008084b98>] ret_from_fork+0x10/0x18
> >>> [58880.180384] [<ffff000008084b98>] ret_from_fork+0x10/0x18
> >>> [58880.180415] [<ffff000008084b98>] ret_from_fork+0x10/0x18
> >>> [58880.180461] [<ffff000008084b98>] ret_from_fork+0x10/0x18
> >>>
> >>> I can send the full log if you want to have a look.
> >>
> >> Sure, send that over (maybe not over email though).
> > 
> > Here is the guest dmesg:
> > http://paste.ubuntu.com/25955682/
> 
> Yeah, that's because all the vcpus are getting starved at the same time,
> and spitting out interleaved traces... Not very useful anyway, as I
> think this is only a consequence of what's happening on the host.
> 
> > 
> > And the host dmesg as it might have been too big for the lists:
> > http://paste.ubuntu.com/25955699/
> 
> And that one doesn't show much either, apart from indicating that
> something is keeping the lock for itself. Drat.
> 
> We need to narrow down the problem, or make it appear on more common HW.
> Let me know if you've managed to reproduce it with non-VHE and/or on TX-1.

It also shows up when I disable VHE (CONFIG_ARM64_VHE). I'll try
enabling some tracepoints next.

--Jan