* [PATCH] net/mlx4: Fix EEH recovery failure
@ 2014-11-22 10:56 Gavin Shan
2014-11-23 16:21 ` Amir Vadai
2014-12-05 4:28 ` Gavin Shan
0 siblings, 2 replies; 8+ messages in thread
From: Gavin Shan @ 2014-11-22 10:56 UTC (permalink / raw)
To: netdev; +Cc: amirv, davem, Gavin Shan
The patch fixes couple of EEH recovery failures on PPC PowerNV
platform:
* Release reserved memory regions in mlx4_pci_err_detected().
Otherwise, __mlx4_init_one() fails because of reserving
same memory regions recursively.
* Disable PCI device in mlx4_pci_err_detected(). Otherwise,
pci_enable_device() in __mlx4_init_one() doesn't enable
the PCI device because it's already in enabled state indicated
by struct pci_dev::enable_cnt.
* Don't clear struct mlx4_priv instance in mlx4_pci_err_detected().
Otherwise, __mlx4_init_one() runs into kernel crash because
of dereferencing to NULL pointer.
With the patch applied, EEH recovery for mlx4 adapter succeeds on PPC
PowerNV platform.
# lspci
0003:0f:00.0 Network controller: Mellanox Technologies \
MT27500 Family [ConnectX-3]
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
---
drivers/net/ethernet/mellanox/mlx4/main.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/net/ethernet/mellanox/mlx4/main.c b/drivers/net/ethernet/mellanox/mlx4/main.c
index 90de6e1..e118ac9 100644
--- a/drivers/net/ethernet/mellanox/mlx4/main.c
+++ b/drivers/net/ethernet/mellanox/mlx4/main.c
@@ -2809,7 +2809,6 @@ static void mlx4_unload_one(struct pci_dev *pdev)
kfree(dev->caps.qp1_proxy);
kfree(dev->dev_vfs);
- memset(priv, 0, sizeof(*priv));
priv->pci_dev_data = pci_dev_data;
priv->removed = 1;
}
@@ -2900,6 +2899,8 @@ static pci_ers_result_t mlx4_pci_err_detected(struct pci_dev *pdev,
pci_channel_state_t state)
{
mlx4_unload_one(pdev);
+ pci_release_regions(pdev);
+ pci_disable_device(pdev);
return state == pci_channel_io_perm_failure ?
PCI_ERS_RESULT_DISCONNECT : PCI_ERS_RESULT_NEED_RESET;
--
1.8.3.2
^ permalink raw reply related [flat|nested] 8+ messages in thread* Re: [PATCH] net/mlx4: Fix EEH recovery failure
2014-11-22 10:56 [PATCH] net/mlx4: Fix EEH recovery failure Gavin Shan
@ 2014-11-23 16:21 ` Amir Vadai
2014-11-24 21:42 ` Gavin Shan
2014-12-05 4:28 ` Gavin Shan
1 sibling, 1 reply; 8+ messages in thread
From: Amir Vadai @ 2014-11-23 16:21 UTC (permalink / raw)
To: Gavin Shan, netdev, Or Gerlitz; +Cc: davem, yishaih
On 11/22/2014 12:56 PM, Gavin Shan wrote:
> The patch fixes couple of EEH recovery failures on PPC PowerNV
> platform:
>
> * Release reserved memory regions in mlx4_pci_err_detected().
> Otherwise, __mlx4_init_one() fails because of reserving
> same memory regions recursively.
> * Disable PCI device in mlx4_pci_err_detected(). Otherwise,
> pci_enable_device() in __mlx4_init_one() doesn't enable
> the PCI device because it's already in enabled state indicated
> by struct pci_dev::enable_cnt.
> * Don't clear struct mlx4_priv instance in mlx4_pci_err_detected().
> Otherwise, __mlx4_init_one() runs into kernel crash because
> of dereferencing to NULL pointer.
>
> With the patch applied, EEH recovery for mlx4 adapter succeeds on PPC
> PowerNV platform.
>
> # lspci
> 0003:0f:00.0 Network controller: Mellanox Technologies \
> MT27500 Family [ConnectX-3]
>
> Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Hi Gavin,
Yishai (added to the CC) is few days before sending a patchset to fix
the reset flow and inside it there is a fix to EEH recovery.
I would be happy if you could wait for the whole reset flow fix by Yishai.
If you'd like, I can send you the patchset to try. Currently it is under
review inside Mellanox before being sent to the mailing list.
Thanks,
Amir
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] net/mlx4: Fix EEH recovery failure
2014-11-23 16:21 ` Amir Vadai
@ 2014-11-24 21:42 ` Gavin Shan
0 siblings, 0 replies; 8+ messages in thread
From: Gavin Shan @ 2014-11-24 21:42 UTC (permalink / raw)
To: Amir Vadai; +Cc: Gavin Shan, netdev, Or Gerlitz, davem, yishaih
On Sun, Nov 23, 2014 at 06:21:47PM +0200, Amir Vadai wrote:
>On 11/22/2014 12:56 PM, Gavin Shan wrote:
>> The patch fixes couple of EEH recovery failures on PPC PowerNV
>> platform:
>>
>> * Release reserved memory regions in mlx4_pci_err_detected().
>> Otherwise, __mlx4_init_one() fails because of reserving
>> same memory regions recursively.
>> * Disable PCI device in mlx4_pci_err_detected(). Otherwise,
>> pci_enable_device() in __mlx4_init_one() doesn't enable
>> the PCI device because it's already in enabled state indicated
>> by struct pci_dev::enable_cnt.
>> * Don't clear struct mlx4_priv instance in mlx4_pci_err_detected().
>> Otherwise, __mlx4_init_one() runs into kernel crash because
>> of dereferencing to NULL pointer.
>>
>> With the patch applied, EEH recovery for mlx4 adapter succeeds on PPC
>> PowerNV platform.
>>
>> # lspci
>> 0003:0f:00.0 Network controller: Mellanox Technologies \
>> MT27500 Family [ConnectX-3]
>>
>> Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
>
>Hi Gavin,
>
>Yishai (added to the CC) is few days before sending a patchset to fix
>the reset flow and inside it there is a fix to EEH recovery.
>I would be happy if you could wait for the whole reset flow fix by Yishai.
>
Yes, It's not urgent and I can wait. Thanks for the info.
>If you'd like, I can send you the patchset to try. Currently it is under
>review inside Mellanox before being sent to the mailing list.
>
It would be nice to send me the patchset for me to have a try.
Thanks,
Gavin
>Thanks,
>Amir
>
>
>
>
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] net/mlx4: Fix EEH recovery failure
2014-11-22 10:56 [PATCH] net/mlx4: Fix EEH recovery failure Gavin Shan
2014-11-23 16:21 ` Amir Vadai
@ 2014-12-05 4:28 ` Gavin Shan
1 sibling, 0 replies; 8+ messages in thread
From: Gavin Shan @ 2014-12-05 4:28 UTC (permalink / raw)
To: Gavin Shan; +Cc: netdev, amirv, davem, yishaih
On Sat, Nov 22, 2014 at 09:56:47PM +1100, Gavin Shan wrote:
Yishai already had patches fixing the issue. So please ignore
this patch and drop it.
Thanks,
Gavin
>The patch fixes couple of EEH recovery failures on PPC PowerNV
>platform:
>
> * Release reserved memory regions in mlx4_pci_err_detected().
> Otherwise, __mlx4_init_one() fails because of reserving
> same memory regions recursively.
> * Disable PCI device in mlx4_pci_err_detected(). Otherwise,
> pci_enable_device() in __mlx4_init_one() doesn't enable
> the PCI device because it's already in enabled state indicated
> by struct pci_dev::enable_cnt.
> * Don't clear struct mlx4_priv instance in mlx4_pci_err_detected().
> Otherwise, __mlx4_init_one() runs into kernel crash because
> of dereferencing to NULL pointer.
>
>With the patch applied, EEH recovery for mlx4 adapter succeeds on PPC
>PowerNV platform.
>
> # lspci
> 0003:0f:00.0 Network controller: Mellanox Technologies \
> MT27500 Family [ConnectX-3]
>
>Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
>---
> drivers/net/ethernet/mellanox/mlx4/main.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
>diff --git a/drivers/net/ethernet/mellanox/mlx4/main.c b/drivers/net/ethernet/mellanox/mlx4/main.c
>index 90de6e1..e118ac9 100644
>--- a/drivers/net/ethernet/mellanox/mlx4/main.c
>+++ b/drivers/net/ethernet/mellanox/mlx4/main.c
>@@ -2809,7 +2809,6 @@ static void mlx4_unload_one(struct pci_dev *pdev)
> kfree(dev->caps.qp1_proxy);
> kfree(dev->dev_vfs);
>
>- memset(priv, 0, sizeof(*priv));
> priv->pci_dev_data = pci_dev_data;
> priv->removed = 1;
> }
>@@ -2900,6 +2899,8 @@ static pci_ers_result_t mlx4_pci_err_detected(struct pci_dev *pdev,
> pci_channel_state_t state)
> {
> mlx4_unload_one(pdev);
>+ pci_release_regions(pdev);
>+ pci_disable_device(pdev);
>
> return state == pci_channel_io_perm_failure ?
> PCI_ERS_RESULT_DISCONNECT : PCI_ERS_RESULT_NEED_RESET;
>--
>1.8.3.2
>
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] net/mlx4: Fix EEH recovery failure
@ 2014-11-24 21:17 Or Gerlitz
2014-11-24 21:55 ` Gavin Shan
0 siblings, 1 reply; 8+ messages in thread
From: Or Gerlitz @ 2014-11-24 21:17 UTC (permalink / raw)
To: Gavin Shan; +Cc: Linux Netdev List, Amir Vadai, David Miller
On Sat, Nov 22, 2014 at 12:56 PM, Gavin Shan <gwshan@linux.vnet.ibm.com> wrote:
> The patch fixes couple of EEH recovery failures on PPC PowerNV
> platform:
> * Don't clear struct mlx4_priv instance in mlx4_pci_err_detected().
> Otherwise, __mlx4_init_one() runs into kernel crash because
> of dereferencing to NULL pointer.
I don't see this change in the patch, I see no-clearing of mlx4_priv
in __mlx4_unload_one - please clarify, also is this patch
based/targeted on the net or net-next tree?
> With the patch applied, EEH recovery for mlx4 adapter succeeds on PPC
> PowerNV platform.
>
> # lspci
> 0003:0f:00.0 Network controller: Mellanox Technologies \
> MT27500 Family [ConnectX-3]
>
> Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
> ---
> drivers/net/ethernet/mellanox/mlx4/main.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/net/ethernet/mellanox/mlx4/main.c b/drivers/net/ethernet/mellanox/mlx4/main.c
> index 90de6e1..e118ac9 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/main.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/main.c
> @@ -2809,7 +2809,6 @@ static void mlx4_unload_one(struct pci_dev *pdev)
> kfree(dev->caps.qp1_proxy);
> kfree(dev->dev_vfs);
>
> - memset(priv, 0, sizeof(*priv));
> priv->pci_dev_data = pci_dev_data;
> priv->removed = 1;
> }
> @@ -2900,6 +2899,8 @@ static pci_ers_result_t mlx4_pci_err_detected(struct pci_dev *pdev,
> pci_channel_state_t state)
> {
> mlx4_unload_one(pdev);
> + pci_release_regions(pdev);
> + pci_disable_device(pdev);
>
> return state == pci_channel_io_perm_failure ?
> PCI_ERS_RESULT_DISCONNECT : PCI_ERS_RESULT_NEED_RESET;
> --
> 1.8.3.2
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 8+ messages in thread* Re: [PATCH] net/mlx4: Fix EEH recovery failure
2014-11-24 21:17 Or Gerlitz
@ 2014-11-24 21:55 ` Gavin Shan
2014-11-25 22:00 ` Or Gerlitz
0 siblings, 1 reply; 8+ messages in thread
From: Gavin Shan @ 2014-11-24 21:55 UTC (permalink / raw)
To: Or Gerlitz; +Cc: Gavin Shan, Linux Netdev List, Amir Vadai, David Miller
On Mon, Nov 24, 2014 at 11:17:55PM +0200, Or Gerlitz wrote:
>On Sat, Nov 22, 2014 at 12:56 PM, Gavin Shan <gwshan@linux.vnet.ibm.com> wrote:
>> The patch fixes couple of EEH recovery failures on PPC PowerNV
>> platform:
>
>> * Don't clear struct mlx4_priv instance in mlx4_pci_err_detected().
>> Otherwise, __mlx4_init_one() runs into kernel crash because
>> of dereferencing to NULL pointer.
>
>I don't see this change in the patch, I see no-clearing of mlx4_priv
>in __mlx4_unload_one - please clarify, also is this patch
>based/targeted on the net or net-next tree?
>
Yes, It would be: Don't clear struct mlx4_priv instance in mlx4_unload_one(),
which is called by mlx4_pci_err_detected().
It's based on 3.18.rc5, where I had couple of EEH fixes on top of it.
When testing EEH with it, I hit the issue.
Thanks,
Gavin
>
>
>> With the patch applied, EEH recovery for mlx4 adapter succeeds on PPC
>> PowerNV platform.
>>
>> # lspci
>> 0003:0f:00.0 Network controller: Mellanox Technologies \
>> MT27500 Family [ConnectX-3]
>>
>> Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
>> ---
>> drivers/net/ethernet/mellanox/mlx4/main.c | 3 ++-
>> 1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/net/ethernet/mellanox/mlx4/main.c b/drivers/net/ethernet/mellanox/mlx4/main.c
>> index 90de6e1..e118ac9 100644
>> --- a/drivers/net/ethernet/mellanox/mlx4/main.c
>> +++ b/drivers/net/ethernet/mellanox/mlx4/main.c
>> @@ -2809,7 +2809,6 @@ static void mlx4_unload_one(struct pci_dev *pdev)
>> kfree(dev->caps.qp1_proxy);
>> kfree(dev->dev_vfs);
>>
>> - memset(priv, 0, sizeof(*priv));
>> priv->pci_dev_data = pci_dev_data;
>> priv->removed = 1;
>> }
>> @@ -2900,6 +2899,8 @@ static pci_ers_result_t mlx4_pci_err_detected(struct pci_dev *pdev,
>> pci_channel_state_t state)
>> {
>> mlx4_unload_one(pdev);
>> + pci_release_regions(pdev);
>> + pci_disable_device(pdev);
>>
>> return state == pci_channel_io_perm_failure ?
>> PCI_ERS_RESULT_DISCONNECT : PCI_ERS_RESULT_NEED_RESET;
>> --
>> 1.8.3.2
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
^ permalink raw reply [flat|nested] 8+ messages in thread* Re: [PATCH] net/mlx4: Fix EEH recovery failure
2014-11-24 21:55 ` Gavin Shan
@ 2014-11-25 22:00 ` Or Gerlitz
2014-11-25 22:21 ` Gavin Shan
0 siblings, 1 reply; 8+ messages in thread
From: Or Gerlitz @ 2014-11-25 22:00 UTC (permalink / raw)
To: Gavin Shan
Cc: Linux Netdev List, Amir Vadai, David Miller, Wei Yang,
Yishai Hadas, Jack Morgenstein
On Mon, Nov 24, 2014 at 11:55 PM, Gavin Shan <gwshan@linux.vnet.ibm.com> wrote:
> On Mon, Nov 24, 2014 at 11:17:55PM +0200, Or Gerlitz wrote:
>>On Sat, Nov 22, 2014 at 12:56 PM, Gavin Shan <gwshan@linux.vnet.ibm.com> wrote:
>>> The patch fixes couple of EEH recovery failures on PPC PowerNV
>>> platform:
>>
>>> * Don't clear struct mlx4_priv instance in mlx4_pci_err_detected().
>>> Otherwise, __mlx4_init_one() runs into kernel crash because
>>> of dereferencing to NULL pointer.
>>
>>I don't see this change in the patch, I see no-clearing of mlx4_priv
>>in __mlx4_unload_one - please clarify, also is this patch
>>based/targeted on the net or net-next tree?
>>
>
> Yes, It would be: Don't clear struct mlx4_priv instance in mlx4_unload_one(),
> which is called by mlx4_pci_err_detected().
But the struct mlx4_priv instance is cleared in mlx4_unload_one() for
a reason, I suspect that you might made the EEH callback to work, but
broke something else... e.g did you made sure that kexec works after
your changes as it did before?
> It's based on 3.18.rc5, where I had couple of EEH fixes on top of it.
> When testing EEH with it, I hit the issue.
>>> With the patch applied, EEH recovery for mlx4 adapter succeeds on PPC
>>> PowerNV platform.
>>>
>>> # lspci
>>> 0003:0f:00.0 Network controller: Mellanox Technologies \
>>> MT27500 Family [ConnectX-3]
>>>
>>> Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
>>> ---
>>> drivers/net/ethernet/mellanox/mlx4/main.c | 3 ++-
>>> 1 file changed, 2 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/net/ethernet/mellanox/mlx4/main.c b/drivers/net/ethernet/mellanox/mlx4/main.c
>>> index 90de6e1..e118ac9 100644
>>> --- a/drivers/net/ethernet/mellanox/mlx4/main.c
>>> +++ b/drivers/net/ethernet/mellanox/mlx4/main.c
>>> @@ -2809,7 +2809,6 @@ static void mlx4_unload_one(struct pci_dev *pdev)
>>> kfree(dev->caps.qp1_proxy);
>>> kfree(dev->dev_vfs);
>>>
>>> - memset(priv, 0, sizeof(*priv));
>>> priv->pci_dev_data = pci_dev_data;
>>> priv->removed = 1;
>>> }
>>> @@ -2900,6 +2899,8 @@ static pci_ers_result_t mlx4_pci_err_detected(struct pci_dev *pdev,
>>> pci_channel_state_t state)
>>> {
>>> mlx4_unload_one(pdev);
>>> + pci_release_regions(pdev);
>>> + pci_disable_device(pdev);
>>>
>>> return state == pci_channel_io_perm_failure ?
>>> PCI_ERS_RESULT_DISCONNECT : PCI_ERS_RESULT_NEED_RESET;
>>> --
^ permalink raw reply [flat|nested] 8+ messages in thread* Re: [PATCH] net/mlx4: Fix EEH recovery failure
2014-11-25 22:00 ` Or Gerlitz
@ 2014-11-25 22:21 ` Gavin Shan
0 siblings, 0 replies; 8+ messages in thread
From: Gavin Shan @ 2014-11-25 22:21 UTC (permalink / raw)
To: Or Gerlitz
Cc: Gavin Shan, Linux Netdev List, Amir Vadai, David Miller, Wei Yang,
Yishai Hadas, Jack Morgenstein
On Wed, Nov 26, 2014 at 12:00:31AM +0200, Or Gerlitz wrote:
>On Mon, Nov 24, 2014 at 11:55 PM, Gavin Shan <gwshan@linux.vnet.ibm.com> wrote:
>> On Mon, Nov 24, 2014 at 11:17:55PM +0200, Or Gerlitz wrote:
>>>On Sat, Nov 22, 2014 at 12:56 PM, Gavin Shan <gwshan@linux.vnet.ibm.com> wrote:
>>>> The patch fixes couple of EEH recovery failures on PPC PowerNV
>>>> platform:
>>>
>>>> * Don't clear struct mlx4_priv instance in mlx4_pci_err_detected().
>>>> Otherwise, __mlx4_init_one() runs into kernel crash because
>>>> of dereferencing to NULL pointer.
>>>
>>>I don't see this change in the patch, I see no-clearing of mlx4_priv
>>>in __mlx4_unload_one - please clarify, also is this patch
>>>based/targeted on the net or net-next tree?
>>>
>>
>> Yes, It would be: Don't clear struct mlx4_priv instance in mlx4_unload_one(),
>> which is called by mlx4_pci_err_detected().
>
>
>But the struct mlx4_priv instance is cleared in mlx4_unload_one() for
>a reason, I suspect that you might made the EEH callback to work, but
>broke something else... e.g did you made sure that kexec works after
>your changes as it did before?
>
Nope, I didn't try kexec out and I'll have a try, thanks!
Gavin
>> It's based on 3.18.rc5, where I had couple of EEH fixes on top of it.
>> When testing EEH with it, I hit the issue.
>
>>>> With the patch applied, EEH recovery for mlx4 adapter succeeds on PPC
>>>> PowerNV platform.
>>>>
>>>> # lspci
>>>> 0003:0f:00.0 Network controller: Mellanox Technologies \
>>>> MT27500 Family [ConnectX-3]
>>>>
>>>> Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
>>>> ---
>>>> drivers/net/ethernet/mellanox/mlx4/main.c | 3 ++-
>>>> 1 file changed, 2 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/drivers/net/ethernet/mellanox/mlx4/main.c b/drivers/net/ethernet/mellanox/mlx4/main.c
>>>> index 90de6e1..e118ac9 100644
>>>> --- a/drivers/net/ethernet/mellanox/mlx4/main.c
>>>> +++ b/drivers/net/ethernet/mellanox/mlx4/main.c
>>>> @@ -2809,7 +2809,6 @@ static void mlx4_unload_one(struct pci_dev *pdev)
>>>> kfree(dev->caps.qp1_proxy);
>>>> kfree(dev->dev_vfs);
>>>>
>>>> - memset(priv, 0, sizeof(*priv));
>>>> priv->pci_dev_data = pci_dev_data;
>>>> priv->removed = 1;
>>>> }
>>>> @@ -2900,6 +2899,8 @@ static pci_ers_result_t mlx4_pci_err_detected(struct pci_dev *pdev,
>>>> pci_channel_state_t state)
>>>> {
>>>> mlx4_unload_one(pdev);
>>>> + pci_release_regions(pdev);
>>>> + pci_disable_device(pdev);
>>>>
>>>> return state == pci_channel_io_perm_failure ?
>>>> PCI_ERS_RESULT_DISCONNECT : PCI_ERS_RESULT_NEED_RESET;
>>>> --
>
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2014-12-05 4:28 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-11-22 10:56 [PATCH] net/mlx4: Fix EEH recovery failure Gavin Shan
2014-11-23 16:21 ` Amir Vadai
2014-11-24 21:42 ` Gavin Shan
2014-12-05 4:28 ` Gavin Shan
-- strict thread matches above, loose matches on Subject: below --
2014-11-24 21:17 Or Gerlitz
2014-11-24 21:55 ` Gavin Shan
2014-11-25 22:00 ` Or Gerlitz
2014-11-25 22:21 ` Gavin Shan
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).