netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Kernel crash after FLR reset of a ConnectX-5 PF in switchdev mode
@ 2023-04-11 15:11 Niklas Schnelle
  2023-04-13 11:02 ` Leon Romanovsky
  0 siblings, 1 reply; 6+ messages in thread
From: Niklas Schnelle @ 2023-04-11 15:11 UTC (permalink / raw)
  To: Saeed Mahameed, Leon Romanovsky
  Cc: Gerd Bayer, alexander.sschmidt, Alexandra Winter, netdev

[-- Attachment #1: Type: text/plain, Size: 3638 bytes --]

Hi Saeed, Hi Leon,

While testing PCI recovery with a ConnectX-5 card (MT28800, fw
16.35.1012) and vanilla 6.3-rc4/5/6 on s390 I've run into a kernel
crash (stacktrace attached) when the card is in switchdev mode. No
crash occurs and the recovery succeeds in legacy mode (with VFs). I
found that the same crash occurs also with a simple Function Level
Reset instead of the s390 specific PCI recovery, see instructions
below. From the looks of it I think this could affect non-s390 too but
I don't have a proper x86 test system with a ConnectX card to test
with.

Anyway, I tried to analyze further but got stuck after figuring out
that in mlx5e_remove() deep down from mlx5_fw_fatal_reporter_err_work()
(see trace) the mlx5e_dev->priv pointer is valid but the pointed to
struct only contains zeros as it was previously zeroed by
mlx5_mdev_uninit() which then leads to a NULL pointer access.

The crash itself can be prevented by the following debug patch though
clearly this is not a proper fix:

--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -6012,6 +6012,10 @@ static void mlx5e_remove(struct auxiliary_device
*adev)
        struct mlx5e_priv *priv = mlx5e_dev->priv;
        pm_message_t state = {};

+       if (!priv->mdev) {
+               pr_err("%s with zeroed mlx5e_dev->priv\n", __func__);
+               return;
+       }
        mlx5_core_uplink_netdev_set(priv->mdev, NULL);
        mlx5e_dcbnl_delete_app(priv);
        unregister_netdev(priv->netdev);

With that I then tried to track down why mlx5_mdev_uninit() is called
and this might actually be s390 specific in that this happens during
the removal of the VF which on s390 causes extra hot unplug events for
the VFs (our virtualized PCI hotplug is per-PCI function) resulting in
the following call trace:

...
zpci_bus_remove_device()
   zpci_iov_remove_virtfn()
      pci_iov_remove_virtfn()
         pci_stop_and_remove_bus_device()
            pci_stop_bus_device()
               device_release_driver_internal()
                  pci_device_remove()
                     remove_one()
                        mlx5_mdev_uninit()

Then again I would expect that on other architectures VFs become at
leastunresponsive during a FLR of the PF not sure if that also lead to
calls to remove_one() though.

As another data point I tried the same on the default Ubuntu 22.04
generic 5.15 kernel and there no crash occurs so this might be a newer
issue.

Also, I did test with and without the patch I sent recently for
skipping the wait in mlx5_health_wait_pci_up() but that made no
difference.

Any hints on how to debug this further and could you try to see if this
occurs on other architectures as well?

Thanks,
Niklas

Reproduced with (0004:00:00.0 being the first PF of a ConnectX-5):

$ devlink dev eswitch set pci/0004:00:00.0 mode switchdev
$ devlink dev eswitch set pci/0004:00:00.1 mode switchdev

The next 2 lines are needed on s390 due to an unrelated issue with smsf mode.

$ devlink dev param set pci/0004:00:00.0 name flow_steering_mode value dmfs cmode runtime
$ devlink dev param set pci/0004:00:00.1 name flow_steering_mode value dmfs cmode runtime
$ echo 1 > /sys/bus/pci/devices/0004:00:00.0/sriov_numvfs
# Check the reset method is FLR though others might also cause this
$ cat /sys/bus/pci/devices/0004:00:00.0/reset_method
flr

Then to trigger the crash (after a bit of recovery, it may be racy though it hits pretty
consistently for me)

$ echo 1 > /sys/bus/pci/devices/0004:00:00.0/reset

[-- Attachment #2: mlx5_switchdev_reset_crash_backtrace.txt --]
[-- Type: text/plain, Size: 11932 bytes --]

[ 1346.987414] mlx5_core 0003:00:00.0 ens8704f0np0: Link up
[ 1347.006845] IPv6: ADDRCONF(NETDEV_CHANGE): ens8704f0np0: link becomes ready
[ 1356.518669] mlx5_core 0004:00:00.0: E-Switch: Disable: mode(LEGACY), nvfs(0), active vports(0)
[ 1359.192049] mlx5_core 0004:00:00.0 ens8832f0np0: Dropping C-tag vlan stripping offload due to S-tag vlan
[ 1359.192152] mlx5_core 0004:00:00.0 ens8832f0np0: Disabling HW_VLAN CTAG FILTERING, not supported in switchdev mode
[ 1359.365370] mlx5_core 0004:00:00.0: E-Switch: Enable: mode(OFFLOADS), nvfs(0), active vports(1)
[ 1359.365952] devlink (1586) used greatest stack depth: 7552 bytes left
[ 1359.448638] mlx5_core 0004:00:00.1: E-Switch: Disable: mode(LEGACY), nvfs(0), active vports(0)
[ 1361.256671] debugfs: Directory 'mlx5_dr_chunks' with parent 'slab' already present!
[ 1361.257295] debugfs: Directory 'mlx5_dr_htbls' with parent 'slab' already present!
[ 1362.063433] mlx5_core 0004:00:00.1 ens8896f1np1: Dropping C-tag vlan stripping offload due to S-tag vlan
[ 1362.063560] mlx5_core 0004:00:00.1 ens8896f1np1: Disabling HW_VLAN CTAG FILTERING, not supported in switchdev mode
[ 1362.240899] mlx5_core 0004:00:00.1: E-Switch: Enable: mode(OFFLOADS), nvfs(0), active vports(1)
[ 1362.241527] devlink (1595) used greatest stack depth: 7520 bytes left
[ 1362.327301] mlx5_core 0004:00:00.0 ens8832f0np0: Link up
[ 1362.345353] IPv6: ADDRCONF(NETDEV_CHANGE): ens8832f0np0: link becomes ready
[ 1362.346169] ip (1603) used greatest stack depth: 7368 bytes left
[ 1362.437044] pci 0004:00:00.2: [15b3:101a] type 00 class 0x020000
[ 1362.437277] pci 0004:00:00.2: reg 0x10: [mem 0xffffc00002000000-0xffffc000021fffff 64bit pref]
[ 1362.437423] pci 0004:00:00.2: enabling Extended Tags
[ 1362.438619] pci 0004:00:00.2: Adding to iommu group 30
[ 1362.440640] mlx5_core 0004:00:00.2: enabling device (0000 -> 0002)
[ 1362.440810] mlx5_core 0004:00:00.2: firmware version: 16.35.1012
[ 1362.675795] mlx5_core 0004:00:00.2: Rate limit: 127 rates are supported, range: 0Mbps to 97656Mbps
[ 1362.676711] debugfs: Directory 'mlx5_fs_fgs' with parent 'slab' already present!
[ 1362.677278] debugfs: Directory 'mlx5_fs_ftes' with parent 'slab' already present!
[ 1362.696847] mlx5_core 0004:00:00.2: Assigned random MAC address 02:0e:8a:c7:93:9a
[ 1362.830432] mlx5_core 0004:00:00.2: MLX5E: StrdRq(1) RqSz(16) StrdSz(4096) RxCqeCmprss(0 basic)
[ 1362.835203] mlx5_core 0004:00:00.2 ens8832f0v0: renamed from eth1
[ 1372.768645] mlx5_core 0004:00:00.2: poll_health:818:(pid 0): Fatal error 1 detected
[ 1372.768791] mlx5_core 0004:00:00.2: print_health_info:423:(pid 0): PCI slot is unavailable
[ 1373.168544] mlx5_core 0004:00:00.0: poll_health:818:(pid 0): Fatal error 3 detected
[ 1375.434387] mlx5_core 0004:00:00.0: mlx5_health_try_recover:335:(pid 1505): handling bad device here
[ 1375.434502] mlx5_core 0004:00:00.0: mlx5_handle_bad_state:292:(pid 1505): starting teardown
[ 1375.434545] mlx5_core 0004:00:00.0: mlx5_error_sw_reset:243:(pid 1505): start
[ 1375.434586] mlx5_core 0004:00:00.0: mlx5_error_sw_reset:276:(pid 1505): end
[ 1375.771395] mlx5_core 0004:00:00.0 eth0 (unregistering): vport 1 error -67 reading stats
[ 1376.151345] mlx5_core 0004:00:00.0: mlx5e_init_nic_tx:5376:(pid 1505): create tises failed, -67
[ 1376.238808] mlx5_core 0004:00:00.0 ens8832f0np0: mlx5e_netdev_change_profile: new profile init failed, -67
[ 1376.243746] mlx5_core 0004:00:00.0: mlx5e_init_rep_tx:1101:(pid 1505): create tises failed, -67
[ 1376.328623] mlx5_core 0004:00:00.0 ens8832f0np0: mlx5e_netdev_change_profile: failed to rollback to orig profile, -67
[ 1376.329159] ------------[ cut here ]------------
[ 1376.329161] DEBUG_LOCKS_WARN_ON(lock->magic != lock)
[ 1376.329171] WARNING: CPU: 6 PID: 1505 at kernel/locking/mutex.c:582 __mutex_lock+0xbc6/0xfa8
[ 1376.329182] Modules linked in: kvm nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink uvdevice s390_trng sunrpc mlx5_ib ib_uverbs ib_core ism eadm_sch vfio_ccw mdev v
fio_iommu_type1 vfio sch_fq_codel loop configfs ghash_s390 prng chacha_s390 libchacha aes_s390 des_s390 libdes sha3_512_s390 sha3_256_s390 sha512_s390 sha256_s390 sha1_s390 sha_common nvme mlx5_core nvme_core pkey zcrypt rng_core autofs4
[ 1376.329230] CPU: 6 PID: 1505 Comm: kworker/u800:3 Not tainted 6.3.0-rc4 #49
[ 1376.329233] Hardware name: IBM 3931 A01 704 (LPAR)
[ 1376.329235] Workqueue: mlx5_health0004:00:00.0 mlx5_fw_fatal_reporter_err_work [mlx5_core]
[ 1376.329345] Krnl PSW : 0704d00180000000 00000000ee827992 (__mutex_lock+0xbca/0xfa8)
[ 1376.329351]            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:1 PM:0 RI:0 EA:3
[ 1376.329354] Krnl GPRS: c0000000fffeffff 0000000080000001 0000000000000028 00000000eec90198
[ 1376.329357]            0000038004af36b8 0000038004af36b0 000003ff7fbcdfd8 0000000000000080
[ 1376.329359]            00000000f03839a8 0000000000000000 0000000000000000 0000000000059fe8
[ 1376.329361]            000000009fa58100 0000000000000000 00000000ee82798e 0000038004af3848
[ 1376.329368] Krnl Code: 00000000ee827982: c020001ef4a1        larl    %r2,00000000eec062c4
[ 1376.329368]            00000000ee827988: c0e5ff90c318        brasl   %r14,00000000eda3ffb8
[ 1376.329368]           #00000000ee82798e: af000000            mc      0,0
[ 1376.329368]           >00000000ee827992: a7f4fa4d            brc     15,00000000ee826e2c
[ 1376.329368]            00000000ee827996: af000000            mc      0,0
[ 1376.329368]            00000000ee82799a: a7f4faf4            brc     15,00000000ee826f82
[ 1376.329368]            00000000ee82799e: e31003400004        lg      %r1,832
[ 1376.329368]            00000000ee8279a4: e32010000004        lg      %r2,0(%r1)
[ 1376.329405] Call Trace:
[ 1376.329406]  [<00000000ee827992>] __mutex_lock+0xbca/0xfa8 
[ 1376.329409] ([<00000000ee82798e>] __mutex_lock+0xbc6/0xfa8)
[ 1376.329412]  [<00000000ee827da2>] mutex_lock_nested+0x32/0x40 
[ 1376.329416]  [<000003ff7fbcdfd8>] mlx5_core_uplink_netdev_set+0x38/0x60 [mlx5_core] 
[ 1376.329469]  [<000003ff7fc1100c>] mlx5e_remove+0x3c/0xa0 [mlx5_core] 
[ 1376.329521]  [<00000000ee42fa7c>] device_release_driver_internal+0x1c4/0x270 
[ 1376.329526]  [<00000000ee42d788>] bus_remove_device+0x100/0x188 
[ 1376.329528]  [<00000000ee42724e>] device_del+0x186/0x3b8 
[ 1376.329533]  [<000003ff7fbf6134>] mlx5_detach_device+0xac/0x110 [mlx5_core] 
[ 1376.329583]  [<000003ff7fbcf812>] mlx5_unload_one_devl_locked+0x52/0xc8 [mlx5_core] 
[ 1376.329635]  [<000003ff7fc777e8>] mlx5_health_try_recover+0x168/0x240 [mlx5_core] 
[ 1376.329687]  [<00000000ee72b9d2>] devlink_health_reporter_recover+0x3a/0x98 
[ 1376.329690]  [<00000000ee72bb3c>] devlink_health_report+0x10c/0x300 
[ 1376.329693]  [<000003ff7fbdd0ac>] mlx5_fw_fatal_reporter_err_work+0xc4/0x1d8 [mlx5_core] 
[ 1376.329744]  [<00000000eda6b934>] process_one_work+0x30c/0x688 
[ 1376.329749]  [<00000000eda6bd12>] worker_thread+0x62/0x438 
[ 1376.329752]  [<00000000eda77400>] kthread+0x138/0x150 
[ 1376.329756]  [<00000000ed9eb894>] __ret_from_fork+0x3c/0x58 
[ 1376.329759]  [<00000000ee8309da>] ret_from_fork+0xa/0x40 
[ 1376.329762] INFO: lockdep is turned off.
[ 1376.329764] Last Breaking-Event-Address:
[ 1376.329765]  [<00000000eda400e4>] __warn_printk+0x12c/0x138
[ 1376.329770] irq event stamp: 987413
[ 1376.329772] hardirqs last  enabled at (987413): [<00000000ee82fab6>] _raw_spin_unlock_irqrestore+0x76/0xc0
[ 1376.329774] hardirqs last disabled at (987412): [<00000000ee82f6b6>] _raw_spin_lock_irqsave+0x9e/0xd8
[ 1376.329779] softirqs last  enabled at (986754): [<00000000ee4ec40e>] netif_set_real_num_tx_queues+0x116/0x270
[ 1376.329784] softirqs last disabled at (986752): [<00000000ee4ec3fa>] netif_set_real_num_tx_queues+0x102/0x270
[ 1376.329787] ---[ end trace 0000000000000000 ]---
[ 1376.329792] Unable to handle kernel pointer dereference in virtual kernel address space
[ 1376.329794] Failing address: a7190001eb112000 TEID: a7190001eb112803
[ 1376.329796] Fault in home space mode while using kernel ASCE.
[ 1376.329801] AS:00000000ef868007 R3:0000000000000024 
[ 1376.329831] Oops: 0038 ilc:2 [#1] PREEMPT SMP 
[ 1376.329834] Modules linked in: kvm nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink uvdevice s390_trng sunrpc mlx5_ib ib_uverbs ib_core ism eadm_sch vfio_ccw mdev v
fio_iommu_type1 vfio sch_fq_codel loop configfs ghash_s390 prng chacha_s390 libchacha aes_s390 des_s390 libdes sha3_512_s390 sha3_256_s390 sha512_s390 sha256_s390 sha1_s390 sha_common nvme mlx5_core nvme_core pkey zcrypt rng_core autofs4
[ 1376.329876] CPU: 6 PID: 1505 Comm: kworker/u800:3 Tainted: G        W          6.3.0-rc4 #49
[ 1376.329878] Hardware name: IBM 3931 A01 704 (LPAR)
[ 1376.329880] Workqueue: mlx5_health0004:00:00.0 mlx5_fw_fatal_reporter_err_work [mlx5_core]
[ 1376.329933] Krnl PSW : 0704d00180000000 00000000ee8277c4 (__mutex_lock+0x9fc/0xfa8)
[ 1376.329938]            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:1 PM:0 RI:0 EA:3
[ 1376.329941] Krnl GPRS: c0000000fffeffff 000000009fa58100 a7190001eb112000 a7190001eb112000
[ 1376.329944]            0000000000000000 0000000000000000 00000000eee2cd60 a7190001eb112000
[ 1376.329946]            00000000f03839a8 00000000eee2ca48 000000000005a058 0000000000059fe8
[ 1376.329949]            000000009fa58100 000000009fa58100 00000000ee827396 0000038004af3848
[ 1376.329953] Krnl Code: 00000000ee8277b6: c0e5ffffe97d        brasl   %r14,00000000ee824ab0
[ 1376.329953]            00000000ee8277bc: a7f4fc36            brc     15,00000000ee827028
[ 1376.329953]           #00000000ee8277c0: a527fff8            nill    %r2,65528
[ 1376.329953]           >00000000ee8277c4: 58302034            l       %r3,52(%r2)
[ 1376.329953]            00000000ee8277c8: ec38fc66007e        cij     %r3,0,8,00000000ee827094
[ 1376.329953]            00000000ee8277ce: 58202010            l       %r2,16(%r2)
[ 1376.329953]            00000000ee8277d2: b9140022            lgfr    %r2,%r2
[ 1376.329953]            00000000ee8277d6: c0e5ff8ecdfd        brasl   %r14,00000000eda013d0
[ 1376.329969] Call Trace:
[ 1376.329971]  [<00000000ee8277c4>] __mutex_lock+0x9fc/0xfa8 
[ 1376.329974] ([<00000000ee827396>] __mutex_lock+0x5ce/0xfa8)
[ 1376.329977]  [<00000000ee827da2>] mutex_lock_nested+0x32/0x40 
[ 1376.329980]  [<000003ff7fbcdfd8>] mlx5_core_uplink_netdev_set+0x38/0x60 [mlx5_core] 
[ 1376.330032]  [<000003ff7fc1100c>] mlx5e_remove+0x3c/0xa0 [mlx5_core] 
[ 1376.330085]  [<00000000ee42fa7c>] device_release_driver_internal+0x1c4/0x270 
[ 1376.330088]  [<00000000ee42d788>] bus_remove_device+0x100/0x188 
[ 1376.330090]  [<00000000ee42724e>] device_del+0x186/0x3b8 
[ 1376.330093]  [<000003ff7fbf6134>] mlx5_detach_device+0xac/0x110 [mlx5_core] 
[ 1376.330143]  [<000003ff7fbcf812>] mlx5_unload_one_devl_locked+0x52/0xc8 [mlx5_core] 
[ 1376.330195]  [<000003ff7fc777e8>] mlx5_health_try_recover+0x168/0x240 [mlx5_core] 
[ 1376.330246]  [<00000000ee72b9d2>] devlink_health_reporter_recover+0x3a/0x98 
[ 1376.330249]  [<00000000ee72bb3c>] devlink_health_report+0x10c/0x300 
[ 1376.330251]  [<000003ff7fbdd0ac>] mlx5_fw_fatal_reporter_err_work+0xc4/0x1d8 [mlx5_core] 
[ 1376.330303]  [<00000000eda6b934>] process_one_work+0x30c/0x688 
[ 1376.330305]  [<00000000eda6bd12>] worker_thread+0x62/0x438 
[ 1376.330308]  [<00000000eda77400>] kthread+0x138/0x150 
[ 1376.330311]  [<00000000ed9eb894>] __ret_from_fork+0x3c/0x58 
[ 1376.330314]  [<00000000ee8309da>] ret_from_fork+0xa/0x40 
[ 1376.330316] INFO: lockdep is turned off.
[ 1376.330317] Last Breaking-Event-Address:
[ 1376.330319]  [<00000000ee826f9c>] __mutex_lock+0x1d4/0xfa8
[ 1376.330323] Kernel panic - not syncing: Fatal exception: panic_on_oops

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2023-04-19 11:47 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-04-11 15:11 Kernel crash after FLR reset of a ConnectX-5 PF in switchdev mode Niklas Schnelle
2023-04-13 11:02 ` Leon Romanovsky
2023-04-13 22:02   ` Saeed Mahameed
2023-04-14  7:12     ` Niklas Schnelle
2023-04-14 22:27       ` Saeed Mahameed
2023-04-19 11:47         ` Niklas Schnelle

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).