From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932603AbbIUNiP (ORCPT ); Mon, 21 Sep 2015 09:38:15 -0400 Received: from mx1.redhat.com ([209.132.183.28]:60551 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932302AbbIUNhz (ORCPT ); Mon, 21 Sep 2015 09:37:55 -0400 From: Vitaly Kuznetsov To: Olaf Hering Cc: "K. Y. Srinivasan" , Greg KH , linux-kernel@vger.kernel.org, devel@linuxdriverproject.org, apw@canonical.com, jasowang@redhat.com Subject: Re: [PATCH 2/5] hv: add helpers to handle hv_util device state References: <1442363823-22428-1-git-send-email-kys@microsoft.com> <1442363874-22508-1-git-send-email-kys@microsoft.com> <1442363874-22508-2-git-send-email-kys@microsoft.com> <20150921052532.GA24350@kroah.com> <20150921102626.GB4252@aepfle.de> <87y4g0hv4d.fsf@vitty.brq.redhat.com> <20150921121706.GA9172@aepfle.de> Date: Mon, 21 Sep 2015 15:37:51 +0200 In-Reply-To: <20150921121706.GA9172@aepfle.de> (Olaf Hering's message of "Mon, 21 Sep 2015 14:17:06 +0200") Message-ID: <8737y728r4.fsf@vitty.brq.redhat.com> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/24.5 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Olaf Hering writes: > On Mon, Sep 21, Vitaly Kuznetsov wrote: > >> I'd like to see a trace from the hang, it is not obvious to me how it >> happened and what caused it. (or if you have such hang scenario in your >> head, can you please reveal it?) > > There is no trace. I think fcopy_respond_to_host notifies the host, > which in turn triggers an interrupt right away which is processed while > fcopy_on_msg is executing somewhere between the return from > fcopy_respond_to_host and the call into hv_fcopy_onchannelcallback. > I think it is fcopy_transaction.fcopy_context which gets out of sync. When we're done processing some request we have the following code: fcopy_transaction.state = HVUTIL_USERSPACE_RECV; fcopy_respond_to_host(*val); fcopy_transaction.state = HVUTIL_READY; hv_poll_channel(fcopy_transaction.fcopy_context, hv_fcopy_onchannelcallback); If interrupt happens after we did fcopy_respond_to_host() fcopy_transaction.state will still be HVUTIL_USERSPACE_RECV or even its previous HVUTIL_USERSPACE_REQ but it's OK as we have the following in hv_fcopy_onchannelcallback() if (fcopy_transaction.state > HVUTIL_READY) { /* * We will defer processing this callback once * the current transaction is complete. */ fcopy_transaction.fcopy_context = context; return; } And we're supposed to process the work with hv_poll_channel(). The problem is (I guess) that fcopy_transaction.fcopy_context gets out of sync and it still has its previous value (possibly NULL). We call hv_poll_channel() with NULL and everything gets stuck as we'll never process the request. AFAICS proper locking is requred here (and probably in all three drivers), we need to protect not only .state but the whole transaction. [...] -- Vitaly