From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-pl1-f174.google.com (mail-pl1-f174.google.com [209.85.214.174]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 0777F1DE3A5 for ; Wed, 19 Feb 2025 13:13:28 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.174 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1739970811; cv=none; b=G3fBZGT1L1V5ObMAZXDWLsOJhAKWUVBgqS75EEeJpTVxV/YXkwgg3Wssf6AZE13WMByKsbeu3RX4Hj6keVb9j+39FpxX2jrr5vlLa1T2u8Tu0YVJVxPCtxDOk9tgwB0PycTdcQCzoY1wZyUeFvCuINnxF+2OzFNT7LQ7pyRR3EY= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1739970811; c=relaxed/simple; bh=N1DHUdSnvmt36V2m5nNLQz+GAkBLimsA64IQtEB74XI=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=WRfGM/gCNI/eD7QLMl6bH1IZKG2G9wfLpANOJe5zNmD61cmVYgPl/W0NTmSN39mKxsc624UhvUJsyZh0602M0MM2E1ANGL1AZ+T+jKWpmQOdVyJ3/3FpTlsNJDrhTe6j0zZ4xb8ssST1iQAfUisos7f/VHD51QuG5ZbhsTOGDcc= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linaro.org; spf=pass smtp.mailfrom=linaro.org; dkim=pass (2048-bit key) header.d=linaro.org header.i=@linaro.org header.b=iiT9M1QW; arc=none smtp.client-ip=209.85.214.174 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linaro.org Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linaro.org Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=linaro.org header.i=@linaro.org header.b="iiT9M1QW" Received: by mail-pl1-f174.google.com with SMTP id d9443c01a7336-2211cd4463cso68894175ad.2 for ; Wed, 19 Feb 2025 05:13:28 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linaro.org; s=google; t=1739970808; x=1740575608; darn=vger.kernel.org; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:references:message-id:subject:cc:to:from:date:from:to :cc:subject:date:message-id:reply-to; bh=8+URBwMiBnKyGY5kPdMN3kkeBQNC7Y+f9MSb7kdc35Q=; b=iiT9M1QWVkFukFNm1xyjzR9+TX9qJMrwEOOnKzUo09FwBOYBWSzySYhKxWFRIzt/gZ IJkY4I51ys4OnKVJ22z8o1fu7f8daMVqITwZAkcO3wG8IlRlXaGTbnFAN9HjMLNqT8L1 lm20WLjV9f3TuX7sidxL8DVYQDgyWaF3DxJwhJVXsq8TmlscjGPiuPRwM6sUe54O96kw kqOOTpe+UAnSs/BzqtWwmeRfk9QKVmfzSOw88tKr7eVWoWsnS51faHrAPeWOexIPmCxH 6k6a2kT02PfJTIx1dvBdUGV/0lVLmmJ2xwnUe1DfKwOLjwu2UOscxabIlepYpA8b+Em8 dfIg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1739970808; x=1740575608; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:references:message-id:subject:cc:to:from:date :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=8+URBwMiBnKyGY5kPdMN3kkeBQNC7Y+f9MSb7kdc35Q=; b=CO5jblvu0+prYcozwtKSb+MCGaaTQH1OsazYkmmNIvicu1jnf9uBJXss6LypscTS+k nwuDISI3CRLt4kvm1U5OysK05R1WPDs0VOv2nqhL6gBZY/gikbuyue3+1lz9yfY/WLZh QSBECq/9DzLiHnNJiiV8Dj/KH5dbQhLbLK8j2PWyBK9xuyKWILuXO6MwZY7tosc6rAz2 PtbgkcrEGh/7golw2j0nSzK4r0oY8w97KpVl6ZZwmx9Z/DSyZdYqDMUU/bqLiAB5XyHd qU7WG+11yFAb4Q/H2vls9Dli84wYd63tHu5QXDtjLKekk1N5WNnp1/2WCls6RNyHF2+i A8ow== X-Forwarded-Encrypted: i=1; AJvYcCWlS8YTC0b3tzUmYIeFpd5XNiRthAUIGjq9nwtu72qBEnQsZvYk81d+LsqEzcN6ksuCKMNZy+Srax2S5ORV@vger.kernel.org X-Gm-Message-State: AOJu0YxT+nAYV9JSXbeeGuph9amFv9wfSHGEFJQUKS1cdFkYP5SmTLme 5RX48oDAq57/MyxKv+y/DiZOW/DgUlNYfJ6CPe6SoYfHqv9ad8dulhiq5pMoBw== X-Gm-Gg: ASbGnctmM7yqz7DjXL9Vn9y4nsIBGCxDqgXAtwaHFeJIunOpAx/q1YydUKR4DShHI7T BmbfJrbUyx368yXQtrNSrivm/my767WT9PjNxfPwRhitC60U9R5q01bRVtFsoxvW0+DbSi/hsI2 nN9NcZylKhFP0S33Yt6vtchlSbjGEcx5FnZ0YwS3rOrWwxh81qT7+zrzoEeKlYn4vTEtbabyooW fKLVWS5Dgv71OzKKODdufUNuvQdw7ImRfjOdYNCL8Az4BxQtAHmhbiWR1pmKZmcGJQJDmTeaZfj kF5p2Dx/hV14Db7wnBXe9rbiqX0= X-Google-Smtp-Source: AGHT+IHkVE5PfzgLunRYfW7LOverW0PdjTWo0Wwm0ksUnlxjHfmqNgrs9+yjaxVIVMFyRcmK9zw8KQ== X-Received: by 2002:a17:902:ef49:b0:216:6769:9ed7 with SMTP id d9443c01a7336-221040b131dmr298362555ad.40.1739970808214; Wed, 19 Feb 2025 05:13:28 -0800 (PST) Received: from thinkpad ([120.56.197.245]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-220d537c921sm104190715ad.104.2025.02.19.05.13.25 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 19 Feb 2025 05:13:27 -0800 (PST) Date: Wed, 19 Feb 2025 18:43:24 +0530 From: Manivannan Sadhasivam To: Johan Hovold Cc: mhi@lists.linux.dev, Loic Poulain , linux-arm-msm@vger.kernel.org, linux-kernel@vger.kernel.org, stable@vger.kernel.org Subject: Re: [PATCH 1/2] bus: mhi: host: pci_generic: Use pci_try_reset_function() to avoid deadlock Message-ID: <20250219131324.ohfrkuj32fifkmkt@thinkpad> References: <20250108-mhi_recovery_fix-v1-0-a0a00a17da46@linaro.org> <20250108-mhi_recovery_fix-v1-1-a0a00a17da46@linaro.org> Precedence: bulk X-Mailing-List: linux-arm-msm@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: On Wed, Jan 22, 2025 at 04:11:41PM +0100, Johan Hovold wrote: > On Wed, Jan 08, 2025 at 07:09:27PM +0530, Manivannan Sadhasivam via B4 Relay wrote: > > From: Manivannan Sadhasivam > > > > There are multiple places from where the recovery work gets scheduled > > asynchronously. Also, there are multiple places where the caller waits > > synchronously for the recovery to be completed. One such place is during > > the PM shutdown() callback. > > > > If the device is not alive during recovery_work, it will try to reset the > > device using pci_reset_function(). This function internally will take the > > device_lock() first before resetting the device. By this time, if the lock > > has already been acquired, then recovery_work will get stalled while > > waiting for the lock. And if the lock was already acquired by the caller > > which waits for the recovery_work to be completed, it will lead to > > deadlock. > > > > This is what happened on the X1E80100 CRD device when the device died > > before shutdown() callback. Driver core calls the driver's shutdown() > > callback while holding the device_lock() leading to deadlock. > > > > And this deadlock scenario can occur on other paths as well, like during > > the PM suspend() callback, where the driver core would hold the > > device_lock() before calling driver's suspend() callback. And if the > > recovery_work was already started, it could lead to deadlock. This is also > > observed on the X1E80100 CRD. > > > > So to fix both issues, use pci_try_reset_function() in recovery_work. This > > function first checks for the availability of the device_lock() before > > trying to reset the device. If the lock is available, it will acquire it > > and reset the device. Otherwise, it will return -EAGAIN. If that happens, > > recovery_work will fail with the error message "Recovery failed" as not > > much could be done. > > I can confirm that this patch (alone) fixes the deadlock on shutdown > and suspend as expected, but it does leave the system state that blocks > further suspend (this is with patches that tear down the PCI link). Yeah, we wouldn't know when not to return an error to unblock the suspend. So this is acceptable. > Looks like the modem is also hosed. > > [ 267.454616] mhi-pci-generic 0005:01:00.0: mhi_pci_runtime_resume > [ 267.461165] mhi mhi0: Resuming from non M3 state (SYS ERROR) > [ 267.467102] mhi-pci-generic 0005:01:00.0: failed to resume device: -22 > [ 267.473968] mhi-pci-generic 0005:01:00.0: device recovery started > [ 267.477460] mhi-pci-generic 0005:01:00.0: mhi_pci_suspend > [ 267.480331] mhi-pci-generic 0005:01:00.0: __mhi_power_down > [ 267.485917] mhi-pci-generic 0005:01:00.0: mhi_pci_runtime_suspend > [ 267.498339] mhi-pci-generic 0005:01:00.0: __mhi_power_down - pm mutex taken > [ 267.505618] mhi-pci-generic 0005:01:00.0: __mhi_power_down - pm lock taken > [ 267.513372] wwan wwan0: port wwan0qcdm0 disconnected > [ 267.519556] wwan wwan0: port wwan0mbim0 disconnected > [ 267.525544] wwan wwan0: port wwan0qmi0 disconnected > [ 267.573773] mhi-pci-generic 0005:01:00.0: __mhi_power_down - returns > [ 267.591199] mhi mhi0: Requested to power ON > [ 267.914688] mhi mhi0: Power on setup success > [ 267.914875] mhi mhi0: Wait for device to enter SBL or Mission mode > [ 267.919179] mhi-pci-generic 0005:01:00.0: mhi_sync_power_up - wait event timeout_ms = 8000 > [ 276.189399] mhi-pci-generic 0005:01:00.0: mhi_sync_power_up - wait event returns, ret = -110 > [ 276.198240] mhi-pci-generic 0005:01:00.0: __mhi_power_down > [ 276.203990] mhi-pci-generic 0005:01:00.0: __mhi_power_down - pm mutex taken > [ 276.211269] mhi-pci-generic 0005:01:00.0: __mhi_power_down - pm lock taken > [ 276.220024] mhi-pci-generic 0005:01:00.0: __mhi_power_down - returns > [ 276.226680] mhi-pci-generic 0005:01:00.0: mhi_sync_power_up - returns > [ 276.233417] mhi-pci-generic 0005:01:00.0: mhi_pci_recovery_work - mhi unprepare after power down > [ 276.242604] mhi-pci-generic 0005:01:00.0: mhi_pci_recovery_work - pci reset > [ 276.249881] mhi-pci-generic 0005:01:00.0: Recovery failed > [ 276.255536] mhi-pci-generic 0005:01:00.0: mhi_pci_recovery_work - returns > [ 276.328878] qcom-pcie 1bf8000.pci: qcom_pcie_suspend_noirq > [ 276.368851] qcom-pcie 1c00000.pci: qcom_pcie_suspend_noirq > [ 276.477900] qcom-pcie 1c00000.pci: Failed to enter L2/L3 > [ 276.483535] qcom-pcie 1c00000.pci: PM: dpm_run_callback(): genpd_suspend_noirq returns -110 > [ 276.492292] qcom-pcie 1c00000.pci: PM: failed to suspend noirq: error -110 > [ 276.500218] qcom-pcie 1bf8000.pci: qcom_pcie_resume_noirq > [ 276.721401] qcom-pcie 1bf8000.pci: PCIe Gen.4 x4 link up > [ 276.730639] PM: noirq suspend of devices failed > [ 276.818943] mhi-pci-generic 0005:01:00.0: mhi_pci_resume > [ 276.824582] mhi-pci-generic 0005:01:00.0: mhi_pci_runtime_resume > > Still better than hanging the machine, but perhaps not much point in > having recovery code that can't recover the device. > Unfortunately, we cannot know if we could not recover the device. > We obviously need to track down what is causing the modem to fail to > resume since 6.13-rc1 too. > Yeah, this is worth tracing down. > > Cc: stable@vger.kernel.org # 5.12 > > Reported-by: Johan Hovold > > Closes: https://lore.kernel.org/mhi/Z1me8iaK7cwgjL92@hovoldconsulting.com > > Reviewed-by: Johan Hovold > Tested-by: Johan Hovold > > And since I've spent way to much time debugging this and provided the > analysis of the deadlock: > > Analyzed-by: Johan Hovold > Link: https://lore.kernel.org/mhi/Z2KKjWY2mPen6GPL@hovoldconsulting.com/ > Sure. > > Fixes: 7389337f0a78 ("mhi: pci_generic: Add suspend/resume/recovery procedure") > > Signed-off-by: Manivannan Sadhasivam > > --- > > drivers/bus/mhi/host/pci_generic.c | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > diff --git a/drivers/bus/mhi/host/pci_generic.c b/drivers/bus/mhi/host/pci_generic.c > > index 07645ce2119a..e92df380c785 100644 > > --- a/drivers/bus/mhi/host/pci_generic.c > > +++ b/drivers/bus/mhi/host/pci_generic.c > > @@ -1040,7 +1040,7 @@ static void mhi_pci_recovery_work(struct work_struct *work) > > err_unprepare: > > mhi_unprepare_after_power_down(mhi_cntrl); > > err_try_reset: > > - if (pci_reset_function(pdev)) > > + if (pci_try_reset_function(pdev)) > > dev_err(&pdev->dev, "Recovery failed\n"); > > Perhaps you should log the returned error here as a part of this patch > so we can tell when the recovery code failed due to the device lock > being held. > Makes sense. Added the errno to the log and applied to patch to mhi/next with your tags. Thanks a lot! - Mani -- மணிவண்ணன் சதாசிவம்