From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932972AbcHJV0z (ORCPT ); Wed, 10 Aug 2016 17:26:55 -0400 Received: from mx1.redhat.com ([209.132.183.28]:36904 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752355AbcHJSDH (ORCPT ); Wed, 10 Aug 2016 14:03:07 -0400 Date: Wed, 10 Aug 2016 12:57:25 +0200 From: Oleg Nesterov To: Bart Van Assche Cc: Bart Van Assche , Peter Zijlstra , "mingo@kernel.org" , Andrew Morton , Johannes Weiner , Neil Brown , Michael Shaver , "linux-kernel@vger.kernel.org" Subject: Re: [PATCH] sched: Avoid that __wait_on_bit_lock() hangs Message-ID: <20160810105724.GA9389@redhat.com> References: <20160803181128.GH6879@twins.programming.kicks-ass.net> <11007730-3fa5-139a-8091-655743894ae8@sandisk.com> <20160803213006.GA11712@redhat.com> <17b65ff9-215f-ab74-9f5f-15dbd308d054@sandisk.com> <20160804140938.GB24652@twins.programming.kicks-ass.net> <16207b90-2e6c-fe23-1b4b-3763e5cf0384@sandisk.com> <20160808102213.GA6879@twins.programming.kicks-ass.net> <4091e252-18d9-1795-de63-9fbc678aa6b1@acm.org> <20160808162038.GA25927@redhat.com> <8ca35562-a670-4fe5-fa46-7d1872d90299@sandisk.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <8ca35562-a670-4fe5-fa46-7d1872d90299@sandisk.com> User-Agent: Mutt/1.5.24 (2015-08-30) X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.28]); Wed, 10 Aug 2016 10:57:29 +0000 (UTC) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 08/09, Bart Van Assche wrote: > > Hello Oleg, > > Something that puzzles me is that removing the "else" keyword from > abort_exclusive_wait() is sufficient to avoid the hang. Yes, we need to understand this. > If there would > be code that clears PG_locked without calling wake_up() this hang > probably would also be triggered by workloads that do not wake up > lock_page_killable() with a signal. Yes, and I already have another debugging patch to test this... it simply turns lock_page_killable() into lock_page(). But lets check __ClearPageLocked() first (the patch I sent a minute ago). > BTW, the > WARN_ONCE(!list_empty(&wait->task_list) && waitqueue_active(q), "mode = > %#x\n", mode) statement that I added in abort_exclusive_wait() just > produced the following call stack: This condition is fine, and the trace is clear. This means that lock_page_killable() was interrupted and wake_bit_function() was not called. We do not need another wakeup in this case but somehow it helps. Again, I think because the necessary wakeup was already lost/missed. Oleg.