From mboxrd@z Thu Jan 1 00:00:00 1970 From: Asias He Subject: Re: [RFC PATCH 2/5] block: Do not stop draining if waitqueue is not empty. Date: Wed, 23 May 2012 22:54:11 +0800 Message-ID: <4FBCFA13.5030401@redhat.com> References: <1337591313-26333-1-git-send-email-asias@redhat.com> <1337591313-26333-2-git-send-email-asias@redhat.com> <20120521153922.GA6549@google.com> <4FBB36D7.9030202@redhat.com> <20120522150742.GA14339@google.com> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Cc: Jens Axboe , linux-fsdevel@vger.kernel.org To: Tejun Heo Return-path: Received: from mx1.redhat.com ([209.132.183.28]:18515 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S933502Ab2EWOxU (ORCPT ); Wed, 23 May 2012 10:53:20 -0400 In-Reply-To: <20120522150742.GA14339@google.com> Sender: linux-fsdevel-owner@vger.kernel.org List-ID: On 05/22/2012 11:07 PM, Tejun Heo wrote: > Hello, Asias. > > On Tue, May 22, 2012 at 02:48:55PM +0800, Asias He wrote: >> I actually saw this happened though it should not happen. I have no >> idea why this happens. Maybe unbalanced prepare_to_wait_exclusive() >> in get_request_wait() and wake_up() in __freed_request()? > > Hmm.... unbalanced how? I think bugs in this area are much more > likely to show up as live queue hang rather than issues during queue > shutdown. I added some debug code to count the number of sleep and wakeup in get_request_wait() and __freed_request(). I found this after queue cleanup. rl->wait[] is not empty while rl->count[] == 0. There are exactly nr_sleep - nr_wakeup of process in D state. So missed wakeup happens? Any ideas to do more debug to find the root-cause? [ 52.917115] ---> nr_sleep=1046, nr_wakeup=873, delta=173 $ vmstat 1 1 173 0 712640 24292 96172 0 0 0 0 419 757 0 0 0 100 0 0 173 0 712764 24292 96180 0 0 0 0 472 725 0 1 0 97 2 >> With this happened, I saw some fio threads in D state which are >> sleeping on get_request_wait(). If I wake up the threads in the wait >> queue in q->abort_queue_fn() callback which i proposed in the 1/5 of >> this patch set, the queue cleanup and thus hot-unplug went pretty >> well. (Passed 3000~ rounds of test, without this 2~ round of test >> would fail). See this patch [RFC PATCH 4/5] virtio-blk: Use >> q->abort_queue_fn() to abort. > > If the problem is that easily reproducible (you mean that you can > reproduce it every other time, right?), it would be immensely helpful > if you can root cause the issue properly. As it currently stands, > this series seems to work around the problem by adding extra API > without properly root-causing it. Workarounds without proper > root-causing are already pretty bad and adding extra API for that is > rather silly, IMHO. Yes. it is very easy to reproduce. /me Trying to figure the root cause out. -- Asias