From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pa0-x234.google.com (mail-pa0-x234.google.com [IPv6:2607:f8b0:400e:c03::234]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by lists.ozlabs.org (Postfix) with ESMTPS id D85931A0EBD for ; Thu, 1 Oct 2015 09:53:13 +1000 (AEST) Received: by pacex6 with SMTP id ex6so54765122pac.0 for ; Wed, 30 Sep 2015 16:53:11 -0700 (PDT) From: Daniel Axtens To: "Matthew R. Ochs" Cc: linux-scsi@vger.kernel.org, James Bottomley , "Nicholas A. Bellinger" , Brian King , Ian Munsie , Andrew Donnellan , Tomas Henzl , David Laight , Michael Neuling , linuxppc-dev@lists.ozlabs.org, "Manoj N. Kumar" Subject: Re: [PATCH v4 25/32] cxlflash: Fix to prevent EEH recovery failure In-Reply-To: References: <1443222593-8828-1-git-send-email-mrochs@linux.vnet.ibm.com> <1443223134-9886-1-git-send-email-mrochs@linux.vnet.ibm.com> <87612uuich.fsf@gamma.ozlabs.ibm.com> Date: Thu, 01 Oct 2015 09:53:06 +1000 Message-ID: <87mvw3sbvh.fsf@gamma.ozlabs.ibm.com> MIME-Version: 1.0 Content-Type: text/plain List-Id: Linux on PowerPC Developers Mail List List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , "Matthew R. Ochs" writes: >>> The process_sense() routine can perform a read capacity which >>> can take some time to complete. If an EEH occurs while waiting >>> on the read capacity, the EEH handler is unable to obtain the >>> context's mutex in order to put the context in an error state. >>> The EEH handler will sit and wait until the context is free, >>> but this wait can last longer than the EEH handler tolerates, >>> leading to a failed recovery. >> >> I'm not quite clear on what you mean by the EEH handler timing >> out. AFAIK there's nothing in eehd and the EEH core that times out if a >> driver doesn't respond - indeed, it's pretty easy to hang eehd with a >> misbehaving driver. >> >> Are you referring to your own internal timeouts? >> cxlflash_wait_for_pci_err_recovery and anything else that uses >> CXLFLASH_PCI_ERROR_RECOVERY_TIMEOUT? > > Reading through this again I can see how this is misleading. This is > actually similar and related to the deadlock scenario described in > "Fix to avoid potential deadlock on EEH". Without this fix, you'd end > up in a similar situation but deadlocked on the context mutex instead > of the ioctl semaphore. That makes _much_ more sense. If you could please revise the commit message to explain that, you can include this in the next version: Reviewed-by: Daniel Axtens Regards, Daniel