From mboxrd@z Thu Jan 1 00:00:00 1970 From: Gavin Shan Subject: Re: [PATCH] net/mlx4: Fix EEH recovery failure Date: Tue, 25 Nov 2014 08:55:55 +1100 Message-ID: <20141124215555.GA6970@shangw> References: Reply-To: Gavin Shan Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: Gavin Shan , Linux Netdev List , Amir Vadai , David Miller To: Or Gerlitz Return-path: Received: from e23smtp09.au.ibm.com ([202.81.31.142]:45039 "EHLO e23smtp09.au.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750726AbaKXV4F (ORCPT ); Mon, 24 Nov 2014 16:56:05 -0500 Received: from /spool/local by e23smtp09.au.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Tue, 25 Nov 2014 07:56:00 +1000 Received: from d23relay09.au.ibm.com (d23relay09.au.ibm.com [9.185.63.181]) by d23dlp02.au.ibm.com (Postfix) with ESMTP id 516DF2BB0059 for ; Tue, 25 Nov 2014 08:55:58 +1100 (EST) Received: from d23av01.au.ibm.com (d23av01.au.ibm.com [9.190.234.96]) by d23relay09.au.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id sAOLtvPD4456506 for ; Tue, 25 Nov 2014 08:55:58 +1100 Received: from d23av01.au.ibm.com (localhost [127.0.0.1]) by d23av01.au.ibm.com (8.14.4/8.14.4/NCO v10.0 AVout) with ESMTP id sAOLtudO028508 for ; Tue, 25 Nov 2014 08:55:57 +1100 Content-Disposition: inline In-Reply-To: Sender: netdev-owner@vger.kernel.org List-ID: On Mon, Nov 24, 2014 at 11:17:55PM +0200, Or Gerlitz wrote: >On Sat, Nov 22, 2014 at 12:56 PM, Gavin Shan wrote: >> The patch fixes couple of EEH recovery failures on PPC PowerNV >> platform: > >> * Don't clear struct mlx4_priv instance in mlx4_pci_err_detected(). >> Otherwise, __mlx4_init_one() runs into kernel crash because >> of dereferencing to NULL pointer. > >I don't see this change in the patch, I see no-clearing of mlx4_priv >in __mlx4_unload_one - please clarify, also is this patch >based/targeted on the net or net-next tree? > Yes, It would be: Don't clear struct mlx4_priv instance in mlx4_unload_one(), which is called by mlx4_pci_err_detected(). It's based on 3.18.rc5, where I had couple of EEH fixes on top of it. When testing EEH with it, I hit the issue. Thanks, Gavin > > >> With the patch applied, EEH recovery for mlx4 adapter succeeds on PPC >> PowerNV platform. >> >> # lspci >> 0003:0f:00.0 Network controller: Mellanox Technologies \ >> MT27500 Family [ConnectX-3] >> >> Signed-off-by: Gavin Shan >> --- >> drivers/net/ethernet/mellanox/mlx4/main.c | 3 ++- >> 1 file changed, 2 insertions(+), 1 deletion(-) >> >> diff --git a/drivers/net/ethernet/mellanox/mlx4/main.c b/drivers/net/ethernet/mellanox/mlx4/main.c >> index 90de6e1..e118ac9 100644 >> --- a/drivers/net/ethernet/mellanox/mlx4/main.c >> +++ b/drivers/net/ethernet/mellanox/mlx4/main.c >> @@ -2809,7 +2809,6 @@ static void mlx4_unload_one(struct pci_dev *pdev) >> kfree(dev->caps.qp1_proxy); >> kfree(dev->dev_vfs); >> >> - memset(priv, 0, sizeof(*priv)); >> priv->pci_dev_data = pci_dev_data; >> priv->removed = 1; >> } >> @@ -2900,6 +2899,8 @@ static pci_ers_result_t mlx4_pci_err_detected(struct pci_dev *pdev, >> pci_channel_state_t state) >> { >> mlx4_unload_one(pdev); >> + pci_release_regions(pdev); >> + pci_disable_device(pdev); >> >> return state == pci_channel_io_perm_failure ? >> PCI_ERS_RESULT_DISCONNECT : PCI_ERS_RESULT_NEED_RESET; >> -- >> 1.8.3.2 >> >> -- >> To unsubscribe from this list: send the line "unsubscribe netdev" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >