From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1756385AbcBDMVM (ORCPT <rfc822;w@1wt.eu>);
	Thu, 4 Feb 2016 07:21:12 -0500
Received: from mail-wm0-f43.google.com ([74.125.82.43]:38570 "EHLO
	mail-wm0-f43.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1752327AbcBDMVL (ORCPT
	<rfc822;linux-kernel@vger.kernel.org>);
	Thu, 4 Feb 2016 07:21:11 -0500
Subject: Re: crash in 3.12.51 (likely in 3.12.52 as well) in timer code
To: Mike Galbraith <umgwanakikbuti@gmail.com>,
        "Linux-Kernel@Vger. Kernel. Org" <linux-kernel@vger.kernel.org>
References: <56B1DD62.9030900@kyup.com> <1454585550.3407.126.camel@gmail.com>
 <56B33B34.6090508@kyup.com> <1454588264.3407.142.camel@gmail.com>
Cc: Jiri Slaby <jslaby@suse.cz>, Oleg Nesterov <oleg@redhat.com>,
        tglx@linutronix.de, SiteGround Operations <operations@siteground.com>
From: Nikolay Borisov <kernel@kyup.com>
Message-ID: <56B34233.5010804@kyup.com>
Date: Thu, 4 Feb 2016 14:21:07 +0200
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101
 Thunderbird/38.1.0
MIME-Version: 1.0
In-Reply-To: <1454588264.3407.142.camel@gmail.com>
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org


On 02/04/2016 02:17 PM, Mike Galbraith wrote:
> On Thu, 2016-02-04 at 13:51 +0200, Nikolay Borisov wrote:
>>
>> On 02/04/2016 01:32 PM, Mike Galbraith wrote:
>>> On Wed, 2016-02-03 at 12:58 +0200, Nikolay Borisov wrote:
>>>>
>>>> So in this case the prev/next entries do not look like corrupted,
>>>> whereas
>>>> when manipulating the list inside detach_timer they do. This is
>>>> really
>>>> odd, any ideas how to further debug this?
>>>
>>> Suspiciously similar to https://lkml.org/lkml/2016/2/4/247
>>
>> Right, I've been cursory following this thread but I was left with the
>> impression this only occurs on machines where the CPU can go offline,
>> currently the server on which this happened should never offline any of
>> its CPUs since the power management is disabled (though I will have to
>> double check this).
> 
> AFAIU, hotplug isn't required, only mod_delayed_work() being called
> from a different CPU than where the timer was born, migrating it at a
> bad time.

Right, in this case the ib_addr was indeed using mod_delayed_work so
things line up so far.

> 
>> On a different note - is there a way to safely reproduce this so I can
>> test the suggested fix by Thomas?
> 
> Hm, write a module to beat mod_delayed_work() to pulp with a NR_CPUS
> horde, and run it in a vm where you don't care about shrapnel?

In other words, have multiple threads (NR_CPUS) that spin on
mod_delayed_work?


> 
> 	-Mike
>