From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1751674AbbCURp1 (ORCPT <rfc822;w@1wt.eu>);
	Sat, 21 Mar 2015 13:45:27 -0400
Received: from userp1040.oracle.com ([156.151.31.81]:36452 "EHLO
	userp1040.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1751527AbbCURpY (ORCPT
	<rfc822;linux-kernel@vger.kernel.org>);
	Sat, 21 Mar 2015 13:45:24 -0400
Message-ID: <550DAE23.7030000@oracle.com>
Date: Sat, 21 Mar 2015 11:45:07 -0600
From: David Ahern <david.ahern@oracle.com>
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:31.0) Gecko/20100101 Thunderbird/31.5.0
MIME-Version: 1.0
To: Linus Torvalds <torvalds@linux-foundation.org>
CC: linux-mm <linux-mm@kvack.org>, LKML <linux-kernel@vger.kernel.org>
Subject: Re: 4.0.0-rc4: panic in free_block
References: <550C37C9.2060200@oracle.com>	<CA+55aFxoVPRuFJGuP_=0-NCiqx_NPeJBv+SAZqbAzeC9AhN+CA@mail.gmail.com>	<550CA3F9.9040201@oracle.com>	<550CB8D1.9030608@oracle.com> <CA+55aFwyuVWHMq_oc_hfwWcu6RaPGSifXD9-adX2_TOa-L+PHA@mail.gmail.com>
In-Reply-To: <CA+55aFwyuVWHMq_oc_hfwWcu6RaPGSifXD9-adX2_TOa-L+PHA@mail.gmail.com>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
X-Source-IP: userv0022.oracle.com [156.151.31.74]
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On 3/20/15 6:47 PM, Linus Torvalds wrote:
>
>> Here's another data point: If I disable NUMA I don't see the problem.
>> Performance drops, but no NULL pointer splats which would have been panics.
>
> So the NUMA case triggers the per-node "n->shared" logic, which
> *should* be protected by "n->list_lock". Maybe there is some bug there
> - but since that code seems to do ok on x86-64 (and apparently older
> sparc too), I really would look at arch-specific issues first.

You raise a lot of valid questions and something to look into. But if 
the root cause were such a fundamental issue (CPU memory ordering, 
compiler bug, etc) why would it only occur on this one code path -- free 
with SLAB and NUMA -- and so consistently?

Continuing to poke around, but open to any suggestions. I have enabled 
every DEBUG I can find in the memory code and nothing is popping out. In 
terms of races wouldn't all the DEBUG checks affect timing? Yet, I am 
still seeing the same stack traces due to the same root cause.

David