From mboxrd@z Thu Jan 1 00:00:00 1970 From: Waiman Long Subject: Re: [PATCH v3 0/3] qrwlock: Introducing a queue read/write lock implementation Date: Fri, 16 Aug 2013 18:47:35 -0400 Message-ID: <520EAC07.5050106@hp.com> References: <1375315259-29392-1-git-send-email-Waiman.Long@hp.com> <520A811A.7080907@hp.com> <20130814102041.GG10849@gmail.com> <520BA109.1000501@hp.com> <20130814155751.GA17821@gmail.com> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <20130814155751.GA17821@gmail.com> Sender: linux-kernel-owner@vger.kernel.org To: Ingo Molnar Cc: Thomas Gleixner , Ingo Molnar , "H. Peter Anvin" , Arnd Bergmann , linux-arch@vger.kernel.org, x86@kernel.org, linux-kernel@vger.kernel.org, Peter Zijlstra , Steven Rostedt , Andrew Morton , Greg Kroah-Hartman , Matt Fleming , Michel Lespinasse , Andi Kleen , Rik van Riel , "Paul E. McKenney" , Linus Torvalds , Raghavendra K T , George Spelvin , Harvey Harrison , "Chandramouleeswaran, Aswin" , "Norton, Scott J" List-Id: linux-arch.vger.kernel.org On 08/14/2013 11:57 AM, Ingo Molnar wrote: > * Waiman Long wrote: > >> On 08/14/2013 06:20 AM, Ingo Molnar wrote: >>> * Waiman Long wrote: >>> >>>> I would like to share with you a rwlock related system crash that I >>>> encountered during my testing with hackbench on an 80-core DL980. The >>>> kernel crash because of a "watchdog detected hard lockup on cpu 79". The >>>> crashing CPU was running "write_lock_irq(&tasklist_lock)" in >>>> forget_original_parent() of the exit code path when I interrupted the >>>> hackbench which was spawning thousands of processes. Apparently, the >>>> remote CPU was not able to get the lock for a sufficient long time due >>>> to the unfairness of the rwlock which I think my version of queue rwlock >>>> will be able to alleviate this issue. >>>> >>>> So far, I was not able to reproduce the crash. I will try to see if I >>>> could more consistently reproduce it. >>> Was it an actual crash/lockup, or a longish hang followed by a lock >>> detector splat followed by the system eventually recovering back to >>> working order? >>> >>> Thanks, >>> >>> Ingo >> It was an actual crash initiated by the NMI handler. I think the >> system was in a halt state after that. > Could be a CONFIG_BOOTPARAM_HARDLOCKUP_PANIC_VALUE=1 kernel? > > Thanks, > > Ingo My test system was a RHEL6.4 system. The 3.10 kernel config file was based on the original RHEL6.4 config file. So yes, the CONFIG_BOOTPARAM_HARDLOCKUP_PANIC_VALUE parameter was set. I also found that when I bump the process count up to about 30k range, interrupting the main hack_bench process may not cause the other spawned process to die out. I will further investigate this phenomenon later next week. Regards, Longman From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from g1t0028.austin.hp.com ([15.216.28.35]:18718 "EHLO g1t0028.austin.hp.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1757852Ab3HPWrq (ORCPT ); Fri, 16 Aug 2013 18:47:46 -0400 Message-ID: <520EAC07.5050106@hp.com> Date: Fri, 16 Aug 2013 18:47:35 -0400 From: Waiman Long MIME-Version: 1.0 Subject: Re: [PATCH v3 0/3] qrwlock: Introducing a queue read/write lock implementation References: <1375315259-29392-1-git-send-email-Waiman.Long@hp.com> <520A811A.7080907@hp.com> <20130814102041.GG10849@gmail.com> <520BA109.1000501@hp.com> <20130814155751.GA17821@gmail.com> In-Reply-To: <20130814155751.GA17821@gmail.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-arch-owner@vger.kernel.org List-ID: To: Ingo Molnar Cc: Thomas Gleixner , Ingo Molnar , "H. Peter Anvin" , Arnd Bergmann , linux-arch@vger.kernel.org, x86@kernel.org, linux-kernel@vger.kernel.org, Peter Zijlstra , Steven Rostedt , Andrew Morton , Greg Kroah-Hartman , Matt Fleming , Michel Lespinasse , Andi Kleen , Rik van Riel , "Paul E. McKenney" , Linus Torvalds , Raghavendra K T , George Spelvin , Harvey Harrison , "Chandramouleeswaran, Aswin" , "Norton, Scott J" Message-ID: <20130816224735.1HGGqvctzVmR3yyqsbt8x6R2M5c8pgIMof-oUJT29FE@z> On 08/14/2013 11:57 AM, Ingo Molnar wrote: > * Waiman Long wrote: > >> On 08/14/2013 06:20 AM, Ingo Molnar wrote: >>> * Waiman Long wrote: >>> >>>> I would like to share with you a rwlock related system crash that I >>>> encountered during my testing with hackbench on an 80-core DL980. The >>>> kernel crash because of a "watchdog detected hard lockup on cpu 79". The >>>> crashing CPU was running "write_lock_irq(&tasklist_lock)" in >>>> forget_original_parent() of the exit code path when I interrupted the >>>> hackbench which was spawning thousands of processes. Apparently, the >>>> remote CPU was not able to get the lock for a sufficient long time due >>>> to the unfairness of the rwlock which I think my version of queue rwlock >>>> will be able to alleviate this issue. >>>> >>>> So far, I was not able to reproduce the crash. I will try to see if I >>>> could more consistently reproduce it. >>> Was it an actual crash/lockup, or a longish hang followed by a lock >>> detector splat followed by the system eventually recovering back to >>> working order? >>> >>> Thanks, >>> >>> Ingo >> It was an actual crash initiated by the NMI handler. I think the >> system was in a halt state after that. > Could be a CONFIG_BOOTPARAM_HARDLOCKUP_PANIC_VALUE=1 kernel? > > Thanks, > > Ingo My test system was a RHEL6.4 system. The 3.10 kernel config file was based on the original RHEL6.4 config file. So yes, the CONFIG_BOOTPARAM_HARDLOCKUP_PANIC_VALUE parameter was set. I also found that when I bump the process count up to about 30k range, interrupting the main hack_bench process may not cause the other spawned process to die out. I will further investigate this phenomenon later next week. Regards, Longman