From mboxrd@z Thu Jan  1 00:00:00 1970
From: Waiman Long <waiman.long@hp.com>
Subject: Re: [PATCH v3 0/3] qrwlock: Introducing a queue read/write lock implementation
Date: Fri, 16 Aug 2013 18:47:35 -0400
Message-ID: <520EAC07.5050106@hp.com>
References: <1375315259-29392-1-git-send-email-Waiman.Long@hp.com> <520A811A.7080907@hp.com> <20130814102041.GG10849@gmail.com> <520BA109.1000501@hp.com> <20130814155751.GA17821@gmail.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Return-path: <linux-kernel-owner@vger.kernel.org>
In-Reply-To: <20130814155751.GA17821@gmail.com>
Sender: linux-kernel-owner@vger.kernel.org
To: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>, Ingo Molnar <mingo@redhat.com>, "H. Peter Anvin" <hpa@zytor.com>, Arnd Bergmann <arnd@arndb.de>, linux-arch@vger.kernel.org, x86@kernel.org, linux-kernel@vger.kernel.org, Peter Zijlstra <peterz@infradead.org>, Steven Rostedt <rostedt@goodmis.org>, Andrew Morton <akpm@linux-foundation.org>, Greg Kroah-Hartman <gregkh@linuxfoundation.org>, Matt Fleming <matt.fleming@intel.com>, Michel Lespinasse <walken@google.com>, Andi Kleen <andi@firstfloor.org>, Rik van Riel <riel@redhat.com>, "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>, Linus Torvalds <torvalds@linux-foundation.org>, Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>, George Spelvin <linux@horizon.com>, Harvey Harrison <harvey.harrison@gmail.com>, "Chandramouleeswaran, Aswin" <aswin@hp.com>, "Norton, Scott J" <scott.norton@hp.com>
List-Id: linux-arch.vger.kernel.org

On 08/14/2013 11:57 AM, Ingo Molnar wrote:
> * Waiman Long<waiman.long@hp.com>  wrote:
>
>> On 08/14/2013 06:20 AM, Ingo Molnar wrote:
>>> * Waiman Long<waiman.long@hp.com>   wrote:
>>>
>>>> I would like to share with you a rwlock related system crash that I
>>>> encountered during my testing with hackbench on an 80-core DL980. The
>>>> kernel crash because of a "watchdog detected hard lockup on cpu 79". The
>>>> crashing CPU was running "write_lock_irq(&tasklist_lock)" in
>>>> forget_original_parent() of the exit code path when I interrupted the
>>>> hackbench which was spawning thousands of processes. Apparently, the
>>>> remote CPU was not able to get the lock for a sufficient long time due
>>>> to the unfairness of the rwlock which I think my version of queue rwlock
>>>> will be able to alleviate this issue.
>>>>
>>>> So far, I was not able to reproduce the crash. I will try to see if I
>>>> could more consistently reproduce it.
>>> Was it an actual crash/lockup, or a longish hang followed by a lock
>>> detector splat followed by the system eventually recovering back to
>>> working order?
>>>
>>> Thanks,
>>>
>>> 	Ingo
>> It was an actual crash initiated by the NMI handler. I think the
>> system was in a halt state after that.
> Could be a CONFIG_BOOTPARAM_HARDLOCKUP_PANIC_VALUE=1 kernel?
>
> Thanks,
>
> 	Ingo

My test system was a RHEL6.4 system. The 3.10 kernel config file was 
based on the original RHEL6.4 config file. So yes, the 
CONFIG_BOOTPARAM_HARDLOCKUP_PANIC_VALUE parameter was set.

I also found that when I bump the process count up to about 30k range, 
interrupting the main hack_bench process may not cause the other spawned 
process to die out. I will further investigate this phenomenon later 
next week.

Regards,
Longman

From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-arch-owner@vger.kernel.org>
Received: from g1t0028.austin.hp.com ([15.216.28.35]:18718 "EHLO
	g1t0028.austin.hp.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1757852Ab3HPWrq (ORCPT
	<rfc822;linux-arch@vger.kernel.org>); Fri, 16 Aug 2013 18:47:46 -0400
Message-ID: <520EAC07.5050106@hp.com>
Date: Fri, 16 Aug 2013 18:47:35 -0400
From: Waiman Long <waiman.long@hp.com>
MIME-Version: 1.0
Subject: Re: [PATCH v3 0/3] qrwlock: Introducing a queue read/write lock implementation
References: <1375315259-29392-1-git-send-email-Waiman.Long@hp.com> <520A811A.7080907@hp.com> <20130814102041.GG10849@gmail.com> <520BA109.1000501@hp.com> <20130814155751.GA17821@gmail.com>
In-Reply-To: <20130814155751.GA17821@gmail.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Sender: linux-arch-owner@vger.kernel.org
List-ID: <linux-arch.vger.kernel.org>
To: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>, Ingo Molnar <mingo@redhat.com>, "H. Peter Anvin" <hpa@zytor.com>, Arnd Bergmann <arnd@arndb.de>, linux-arch@vger.kernel.org, x86@kernel.org, linux-kernel@vger.kernel.org, Peter Zijlstra <peterz@infradead.org>, Steven Rostedt <rostedt@goodmis.org>, Andrew Morton <akpm@linux-foundation.org>, Greg Kroah-Hartman <gregkh@linuxfoundation.org>, Matt Fleming <matt.fleming@intel.com>, Michel Lespinasse <walken@google.com>, Andi Kleen <andi@firstfloor.org>, Rik van Riel <riel@redhat.com>, "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>, Linus Torvalds <torvalds@linux-foundation.org>, Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>, George Spelvin <linux@horizon.com>, Harvey Harrison <harvey.harrison@gmail.com>, "Chandramouleeswaran, Aswin" <aswin@hp.com>, "Norton, Scott J" <scott.norton@hp.com>
Message-ID: <20130816224735.1HGGqvctzVmR3yyqsbt8x6R2M5c8pgIMof-oUJT29FE@z>

On 08/14/2013 11:57 AM, Ingo Molnar wrote:
> * Waiman Long<waiman.long@hp.com>  wrote:
>
>> On 08/14/2013 06:20 AM, Ingo Molnar wrote:
>>> * Waiman Long<waiman.long@hp.com>   wrote:
>>>
>>>> I would like to share with you a rwlock related system crash that I
>>>> encountered during my testing with hackbench on an 80-core DL980. The
>>>> kernel crash because of a "watchdog detected hard lockup on cpu 79". The
>>>> crashing CPU was running "write_lock_irq(&tasklist_lock)" in
>>>> forget_original_parent() of the exit code path when I interrupted the
>>>> hackbench which was spawning thousands of processes. Apparently, the
>>>> remote CPU was not able to get the lock for a sufficient long time due
>>>> to the unfairness of the rwlock which I think my version of queue rwlock
>>>> will be able to alleviate this issue.
>>>>
>>>> So far, I was not able to reproduce the crash. I will try to see if I
>>>> could more consistently reproduce it.
>>> Was it an actual crash/lockup, or a longish hang followed by a lock
>>> detector splat followed by the system eventually recovering back to
>>> working order?
>>>
>>> Thanks,
>>>
>>> 	Ingo
>> It was an actual crash initiated by the NMI handler. I think the
>> system was in a halt state after that.
> Could be a CONFIG_BOOTPARAM_HARDLOCKUP_PANIC_VALUE=1 kernel?
>
> Thanks,
>
> 	Ingo

My test system was a RHEL6.4 system. The 3.10 kernel config file was 
based on the original RHEL6.4 config file. So yes, the 
CONFIG_BOOTPARAM_HARDLOCKUP_PANIC_VALUE parameter was set.

I also found that when I bump the process count up to about 30k range, 
interrupting the main hack_bench process may not cause the other spawned 
process to die out. I will further investigate this phenomenon later 
next week.

Regards,
Longman