* Re: EEH recovery failing on mlx5 card [not found] <c13fa245-64ed-f87c-fd1e-e618fe017359@linux.ibm.com> @ 2023-07-17 13:10 ` Leon Romanovsky 2023-07-17 16:18 ` Moshe Shemesh 0 siblings, 1 reply; 3+ messages in thread From: Leon Romanovsky @ 2023-07-17 13:10 UTC (permalink / raw) To: Ganesh G R, Moshe Shemesh; +Cc: saeedm, netdev, oohall, Mahesh Salgaonkar + Moshe On Mon, Jul 17, 2023 at 12:48:37PM +0530, Ganesh G R wrote: > Hi, > > mlx5 cards are failing to recover from PCI errors, Upon investigation we found that the > driver is trying to do MMIO in the middle of EEH error handling. > The following fix in mlx5_pci_err_detected() is fixing the issue, Do you think its the right fix? > > @@ -1847,6 +1847,7 @@ static pci_ers_result_t mlx5_pci_err_detected(struct pci_dev *pdev, > mlx5_unload_one(dev, true); > mlx5_drain_health_wq(dev); > mlx5_pci_disable_device(dev); > + cancel_delayed_work_sync(&clock->timer.overflow_work); > res = state == pci_channel_io_perm_failure ? > PCI_ERS_RESULT_DISCONNECT : PCI_ERS_RESULT_NEED_RESET; > > Regards > Ganesh ^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: EEH recovery failing on mlx5 card 2023-07-17 13:10 ` EEH recovery failing on mlx5 card Leon Romanovsky @ 2023-07-17 16:18 ` Moshe Shemesh [not found] ` <b7ad516f-da16-2ba0-98e3-4f16f47e0fc8@linux.ibm.com> 0 siblings, 1 reply; 3+ messages in thread From: Moshe Shemesh @ 2023-07-17 16:18 UTC (permalink / raw) To: Leon Romanovsky, Ganesh G R; +Cc: saeedm, netdev, oohall, Mahesh Salgaonkar On 7/17/2023 4:10 PM, Leon Romanovsky wrote: > > + Moshe > > On Mon, Jul 17, 2023 at 12:48:37PM +0530, Ganesh G R wrote: >> Hi, >> >> mlx5 cards are failing to recover from PCI errors, Upon investigation we found that the >> driver is trying to do MMIO in the middle of EEH error handling. >> The following fix in mlx5_pci_err_detected() is fixing the issue, Do you think its the right fix? >> >> @@ -1847,6 +1847,7 @@ static pci_ers_result_t mlx5_pci_err_detected(struct pci_dev *pdev, >> mlx5_unload_one(dev, true); >> mlx5_drain_health_wq(dev); >> mlx5_pci_disable_device(dev); >> + cancel_delayed_work_sync(&clock->timer.overflow_work); >> res = state == pci_channel_io_perm_failure ? >> PCI_ERS_RESULT_DISCONNECT : PCI_ERS_RESULT_NEED_RESET; >> Hi Ganesh, Thanks for pointing to this issue and its solution! I would rather fix it in the work itself. Please test this fix instead : diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lib/clock.c b/drivers/net/ethernet/mellanox/mlx5/core/lib/clock.c index 973babfaff25..2ad0bcc0f1b1 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/lib/clock.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/lib/clock.c @@ -227,10 +227,15 @@ static void mlx5_timestamp_overflow(struct work_struct *work) clock = container_of(timer, struct mlx5_clock, timer); mdev = container_of(clock, struct mlx5_core_dev, clock); + if (mdev->state == MLX5_DEVICE_STATE_INTERNAL_ERROR) + goto out; + write_seqlock_irqsave(&clock->lock, flags); timecounter_read(&timer->tc); mlx5_update_clock_info_page(mdev); write_sequnlock_irqrestore(&clock->lock, flags); + +out: schedule_delayed_work(&timer->overflow_work, timer->overflow_period); } >> Regards >> Ganesh ^ permalink raw reply related [flat|nested] 3+ messages in thread
[parent not found: <b7ad516f-da16-2ba0-98e3-4f16f47e0fc8@linux.ibm.com>]
* Re: EEH recovery failing on mlx5 card [not found] ` <b7ad516f-da16-2ba0-98e3-4f16f47e0fc8@linux.ibm.com> @ 2023-07-19 7:46 ` Moshe Shemesh 0 siblings, 0 replies; 3+ messages in thread From: Moshe Shemesh @ 2023-07-19 7:46 UTC (permalink / raw) To: Ganesh G R, Leon Romanovsky; +Cc: saeedm, netdev, oohall, Mahesh Salgaonkar On 7/18/2023 9:47 PM, Ganesh G R wrote: > > On 7/17/23 9:48 PM, Moshe Shemesh wrote: > >> On 7/17/2023 4:10 PM, Leon Romanovsky wrote: >>> + Moshe >>> On Mon, Jul 17, 2023 at 12:48:37PM +0530, Ganesh G R wrote: >>>> Hi, >>>> mlx5 cards are failing to recover from PCI errors, Upon investigation we found that the >>>> driver is trying to do MMIO in the middle of EEH error handling. >>>> The following fix in mlx5_pci_err_detected() is fixing the issue, Do you think its the right fix? >>>> @@ -1847,6 +1847,7 @@ static pci_ers_result_t mlx5_pci_err_detected(struct pci_dev *pdev, >>>> mlx5_unload_one(dev, true); >>>> mlx5_drain_health_wq(dev); >>>> mlx5_pci_disable_device(dev); >>>> + cancel_delayed_work_sync(&clock->timer.overflow_work); >>>> res = state == pci_channel_io_perm_failure ? >>>> PCI_ERS_RESULT_DISCONNECT : PCI_ERS_RESULT_NEED_RESET; >> Hi Ganesh, >> Thanks for pointing to this issue and its solution! >> I would rather fix it in the work itself. >> Please test this fix instead : >> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lib/clock.c >> b/drivers/net/ethernet/mellanox/mlx5/core/lib/clock.c >> index 973babfaff25..2ad0bcc0f1b1 100644 >> --- a/drivers/net/ethernet/mellanox/mlx5/core/lib/clock.c >> +++ b/drivers/net/ethernet/mellanox/mlx5/core/lib/clock.c >> @@ -227,10 +227,15 @@ static void mlx5_timestamp_overflow(struct >> work_struct *work) >> clock = container_of(timer, struct mlx5_clock, timer); >> mdev = container_of(clock, struct mlx5_core_dev, clock); >> + if (mdev->state == MLX5_DEVICE_STATE_INTERNAL_ERROR) >> + goto out; >> + >> write_seqlock_irqsave(&clock->lock, flags); >> timecounter_read(&timer->tc); >> mlx5_update_clock_info_page(mdev); >> write_sequnlock_irqrestore(&clock->lock, flags); >> + >> +out: >> schedule_delayed_work(&timer->overflow_work, >> timer->overflow_period); >> } > Thanks Moshe, The fix looks clean and It worked. > Great, thank you ! ^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2023-07-19 7:46 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <c13fa245-64ed-f87c-fd1e-e618fe017359@linux.ibm.com>
2023-07-17 13:10 ` EEH recovery failing on mlx5 card Leon Romanovsky
2023-07-17 16:18 ` Moshe Shemesh
[not found] ` <b7ad516f-da16-2ba0-98e3-4f16f47e0fc8@linux.ibm.com>
2023-07-19 7:46 ` Moshe Shemesh
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).