netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: EEH recovery failing on mlx5 card
       [not found] <c13fa245-64ed-f87c-fd1e-e618fe017359@linux.ibm.com>
@ 2023-07-17 13:10 ` Leon Romanovsky
  2023-07-17 16:18   ` Moshe Shemesh
  0 siblings, 1 reply; 3+ messages in thread
From: Leon Romanovsky @ 2023-07-17 13:10 UTC (permalink / raw)
  To: Ganesh G R, Moshe Shemesh; +Cc: saeedm, netdev, oohall, Mahesh Salgaonkar

+ Moshe

On Mon, Jul 17, 2023 at 12:48:37PM +0530, Ganesh G R wrote:
> Hi,
> 
> mlx5 cards are failing to recover from PCI errors, Upon investigation we found that the
> driver is trying to do MMIO in the middle of EEH error handling.
> The following fix in mlx5_pci_err_detected() is fixing the issue, Do you think its the right fix?
> 
> @@ -1847,6 +1847,7 @@ static pci_ers_result_t mlx5_pci_err_detected(struct pci_dev *pdev,
>         mlx5_unload_one(dev, true);
>         mlx5_drain_health_wq(dev);
>         mlx5_pci_disable_device(dev);
> +       cancel_delayed_work_sync(&clock->timer.overflow_work);
>         res = state == pci_channel_io_perm_failure ?
>                 PCI_ERS_RESULT_DISCONNECT : PCI_ERS_RESULT_NEED_RESET;
> 
> Regards
> Ganesh

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: EEH recovery failing on mlx5 card
  2023-07-17 13:10 ` EEH recovery failing on mlx5 card Leon Romanovsky
@ 2023-07-17 16:18   ` Moshe Shemesh
       [not found]     ` <b7ad516f-da16-2ba0-98e3-4f16f47e0fc8@linux.ibm.com>
  0 siblings, 1 reply; 3+ messages in thread
From: Moshe Shemesh @ 2023-07-17 16:18 UTC (permalink / raw)
  To: Leon Romanovsky, Ganesh G R; +Cc: saeedm, netdev, oohall, Mahesh Salgaonkar



On 7/17/2023 4:10 PM, Leon Romanovsky wrote:
> 
> + Moshe
> 
> On Mon, Jul 17, 2023 at 12:48:37PM +0530, Ganesh G R wrote:
>> Hi,
>>
>> mlx5 cards are failing to recover from PCI errors, Upon investigation we found that the
>> driver is trying to do MMIO in the middle of EEH error handling.
>> The following fix in mlx5_pci_err_detected() is fixing the issue, Do you think its the right fix?
>>
>> @@ -1847,6 +1847,7 @@ static pci_ers_result_t mlx5_pci_err_detected(struct pci_dev *pdev,
>>          mlx5_unload_one(dev, true);
>>          mlx5_drain_health_wq(dev);
>>          mlx5_pci_disable_device(dev);
>> +       cancel_delayed_work_sync(&clock->timer.overflow_work);
>>          res = state == pci_channel_io_perm_failure ?
>>                  PCI_ERS_RESULT_DISCONNECT : PCI_ERS_RESULT_NEED_RESET;
>>

Hi Ganesh,
Thanks for pointing to this issue and its solution!
I would rather fix it in the work itself.
Please test this fix instead :

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lib/clock.c 
b/drivers/net/ethernet/mellanox/mlx5/core/lib/clock.c
index 973babfaff25..2ad0bcc0f1b1 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/lib/clock.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/lib/clock.c
@@ -227,10 +227,15 @@ static void mlx5_timestamp_overflow(struct 
work_struct *work)
         clock = container_of(timer, struct mlx5_clock, timer);
         mdev = container_of(clock, struct mlx5_core_dev, clock);

+       if (mdev->state == MLX5_DEVICE_STATE_INTERNAL_ERROR)
+              goto out;
+
         write_seqlock_irqsave(&clock->lock, flags);
         timecounter_read(&timer->tc);
         mlx5_update_clock_info_page(mdev);
         write_sequnlock_irqrestore(&clock->lock, flags);
+
+out:
         schedule_delayed_work(&timer->overflow_work, 
timer->overflow_period);
  }


>> Regards
>> Ganesh

^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: EEH recovery failing on mlx5 card
       [not found]     ` <b7ad516f-da16-2ba0-98e3-4f16f47e0fc8@linux.ibm.com>
@ 2023-07-19  7:46       ` Moshe Shemesh
  0 siblings, 0 replies; 3+ messages in thread
From: Moshe Shemesh @ 2023-07-19  7:46 UTC (permalink / raw)
  To: Ganesh G R, Leon Romanovsky; +Cc: saeedm, netdev, oohall, Mahesh Salgaonkar



On 7/18/2023 9:47 PM, Ganesh G R wrote:
> 
> On 7/17/23 9:48 PM, Moshe Shemesh wrote:
> 
>> On 7/17/2023 4:10 PM, Leon Romanovsky wrote:
>>> + Moshe
>>> On Mon, Jul 17, 2023 at 12:48:37PM +0530, Ganesh G R wrote:
>>>> Hi,
>>>> mlx5 cards are failing to recover from PCI errors, Upon investigation we found that the
>>>> driver is trying to do MMIO in the middle of EEH error handling.
>>>> The following fix in mlx5_pci_err_detected() is fixing the issue, Do you think its the right fix?
>>>> @@ -1847,6 +1847,7 @@ static pci_ers_result_t mlx5_pci_err_detected(struct pci_dev *pdev,
>>>>           mlx5_unload_one(dev, true);
>>>>           mlx5_drain_health_wq(dev);
>>>>           mlx5_pci_disable_device(dev);
>>>> +       cancel_delayed_work_sync(&clock->timer.overflow_work);
>>>>           res = state == pci_channel_io_perm_failure ?
>>>>                   PCI_ERS_RESULT_DISCONNECT : PCI_ERS_RESULT_NEED_RESET;
>> Hi Ganesh,
>> Thanks for pointing to this issue and its solution!
>> I would rather fix it in the work itself.
>> Please test this fix instead :
>> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lib/clock.c
>> b/drivers/net/ethernet/mellanox/mlx5/core/lib/clock.c
>> index 973babfaff25..2ad0bcc0f1b1 100644
>> --- a/drivers/net/ethernet/mellanox/mlx5/core/lib/clock.c
>> +++ b/drivers/net/ethernet/mellanox/mlx5/core/lib/clock.c
>> @@ -227,10 +227,15 @@ static void mlx5_timestamp_overflow(struct
>> work_struct *work)
>>          clock = container_of(timer, struct mlx5_clock, timer);
>>          mdev = container_of(clock, struct mlx5_core_dev, clock);
>> +       if (mdev->state == MLX5_DEVICE_STATE_INTERNAL_ERROR)
>> +              goto out;
>> +
>>          write_seqlock_irqsave(&clock->lock, flags);
>>          timecounter_read(&timer->tc);
>>          mlx5_update_clock_info_page(mdev);
>>          write_sequnlock_irqrestore(&clock->lock, flags);
>> +
>> +out:
>>          schedule_delayed_work(&timer->overflow_work,
>> timer->overflow_period);
>>   }
> Thanks Moshe, The fix looks clean and It worked.
> 

Great, thank you !


^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2023-07-19  7:46 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <c13fa245-64ed-f87c-fd1e-e618fe017359@linux.ibm.com>
2023-07-17 13:10 ` EEH recovery failing on mlx5 card Leon Romanovsky
2023-07-17 16:18   ` Moshe Shemesh
     [not found]     ` <b7ad516f-da16-2ba0-98e3-4f16f47e0fc8@linux.ibm.com>
2023-07-19  7:46       ` Moshe Shemesh

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).