From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S932603AbbIUNiP (ORCPT <rfc822;w@1wt.eu>);
	Mon, 21 Sep 2015 09:38:15 -0400
Received: from mx1.redhat.com ([209.132.183.28]:60551 "EHLO mx1.redhat.com"
	rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
	id S932302AbbIUNhz (ORCPT <rfc822;linux-kernel@vger.kernel.org>);
	Mon, 21 Sep 2015 09:37:55 -0400
From: Vitaly Kuznetsov <vkuznets@redhat.com>
To: Olaf Hering <olaf@aepfle.de>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>,
        Greg KH <gregkh@linuxfoundation.org>, linux-kernel@vger.kernel.org,
        devel@linuxdriverproject.org, apw@canonical.com, jasowang@redhat.com
Subject: Re: [PATCH 2/5] hv: add helpers to handle hv_util device state
References: <1442363823-22428-1-git-send-email-kys@microsoft.com>
	<1442363874-22508-1-git-send-email-kys@microsoft.com>
	<1442363874-22508-2-git-send-email-kys@microsoft.com>
	<20150921052532.GA24350@kroah.com> <20150921102626.GB4252@aepfle.de>
	<87y4g0hv4d.fsf@vitty.brq.redhat.com>
	<20150921121706.GA9172@aepfle.de>
Date: Mon, 21 Sep 2015 15:37:51 +0200
In-Reply-To: <20150921121706.GA9172@aepfle.de> (Olaf Hering's message of "Mon,
	21 Sep 2015 14:17:06 +0200")
Message-ID: <8737y728r4.fsf@vitty.brq.redhat.com>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/24.5 (gnu/linux)
MIME-Version: 1.0
Content-Type: text/plain
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

Olaf Hering <olaf@aepfle.de> writes:

> On Mon, Sep 21, Vitaly Kuznetsov wrote:
>
>> I'd like to see a trace from the hang, it is not obvious to me how it
>> happened and what caused it. (or if you have such hang scenario in your
>> head, can you please reveal it?)
>
> There is no trace. I think fcopy_respond_to_host notifies the host,
> which in turn triggers an interrupt right away which is processed while
> fcopy_on_msg is executing somewhere between the return from
> fcopy_respond_to_host and the call into hv_fcopy_onchannelcallback.
>

I think it is fcopy_transaction.fcopy_context which gets out of sync.

When we're done processing some request we have the following code:

		fcopy_transaction.state = HVUTIL_USERSPACE_RECV;
		fcopy_respond_to_host(*val);
		fcopy_transaction.state = HVUTIL_READY;
		hv_poll_channel(fcopy_transaction.fcopy_context,
				hv_fcopy_onchannelcallback);

If interrupt happens after we did fcopy_respond_to_host()
fcopy_transaction.state will still be HVUTIL_USERSPACE_RECV or even its
previous HVUTIL_USERSPACE_REQ but it's OK as we have the following in
hv_fcopy_onchannelcallback()

	if (fcopy_transaction.state > HVUTIL_READY) {
		/*
		 * We will defer processing this callback once
		 * the current transaction is complete.
		 */
		fcopy_transaction.fcopy_context = context;
		return;
	}

And we're supposed to process the work with hv_poll_channel(). The
problem is (I guess) that fcopy_transaction.fcopy_context gets out of
sync and it still has its previous value (possibly NULL). We call
hv_poll_channel() with NULL and everything gets stuck as we'll never
process the request.

AFAICS proper locking is requred here (and probably in all three
drivers), we need to protect not only .state but the whole transaction.

[...]

-- 
  Vitaly