public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
To: linux-kernel@vger.kernel.org
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	stable@vger.kernel.org,
	David Christensen <drc@linux.vnet.ibm.com>,
	Manish Chopra <manishc@marvell.com>,
	Ariel Elior <aelior@marvell.com>,
	Jakub Kicinski <kuba@kernel.org>, Sasha Levin <sashal@kernel.org>
Subject: [PATCH 4.14 40/78] bnx2x: fix napi API usage sequence
Date: Tue, 10 May 2022 15:07:26 +0200	[thread overview]
Message-ID: <20220510130733.721181840@linuxfoundation.org> (raw)
In-Reply-To: <20220510130732.522479698@linuxfoundation.org>

From: Manish Chopra <manishc@marvell.com>

[ Upstream commit af68656d66eda219b7f55ce8313a1da0312c79e1 ]

While handling PCI errors (AER flow) driver tries to
disable NAPI [napi_disable()] after NAPI is deleted
[__netif_napi_del()] which causes unexpected system
hang/crash.

System message log shows the following:
=======================================
[ 3222.537510] EEH: Detected PCI bus error on PHB#384-PE#800000 [ 3222.537511] EEH: This PCI device has failed 2 times in the last hour and will be permanently disabled after 5 failures.
[ 3222.537512] EEH: Notify device drivers to shutdown [ 3222.537513] EEH: Beginning: 'error_detected(IO frozen)'
[ 3222.537514] EEH: PE#800000 (PCI 0384:80:00.0): Invoking
bnx2x->error_detected(IO frozen)
[ 3222.537516] bnx2x: [bnx2x_io_error_detected:14236(eth14)]IO error detected [ 3222.537650] EEH: PE#800000 (PCI 0384:80:00.0): bnx2x driver reports:
'need reset'
[ 3222.537651] EEH: PE#800000 (PCI 0384:80:00.1): Invoking
bnx2x->error_detected(IO frozen)
[ 3222.537651] bnx2x: [bnx2x_io_error_detected:14236(eth13)]IO error detected [ 3222.537729] EEH: PE#800000 (PCI 0384:80:00.1): bnx2x driver reports:
'need reset'
[ 3222.537729] EEH: Finished:'error_detected(IO frozen)' with aggregate recovery state:'need reset'
[ 3222.537890] EEH: Collect temporary log [ 3222.583481] EEH: of node=0384:80:00.0 [ 3222.583519] EEH: PCI device/vendor: 168e14e4 [ 3222.583557] EEH: PCI cmd/status register: 00100140 [ 3222.583557] EEH: PCI-E capabilities and status follow:
[ 3222.583744] EEH: PCI-E 00: 00020010 012c8da2 00095d5e 00455c82 [ 3222.583892] EEH: PCI-E 10: 10820000 00000000 00000000 00000000 [ 3222.583893] EEH: PCI-E 20: 00000000 [ 3222.583893] EEH: PCI-E AER capability register set follows:
[ 3222.584079] EEH: PCI-E AER 00: 13c10001 00000000 00000000 00062030 [ 3222.584230] EEH: PCI-E AER 10: 00002000 000031c0 000001e0 00000000 [ 3222.584378] EEH: PCI-E AER 20: 00000000 00000000 00000000 00000000 [ 3222.584416] EEH: PCI-E AER 30: 00000000 00000000 [ 3222.584416] EEH: of node=0384:80:00.1 [ 3222.584454] EEH: PCI device/vendor: 168e14e4 [ 3222.584491] EEH: PCI cmd/status register: 00100140 [ 3222.584492] EEH: PCI-E capabilities and status follow:
[ 3222.584677] EEH: PCI-E 00: 00020010 012c8da2 00095d5e 00455c82 [ 3222.584825] EEH: PCI-E 10: 10820000 00000000 00000000 00000000 [ 3222.584826] EEH: PCI-E 20: 00000000 [ 3222.584826] EEH: PCI-E AER capability register set follows:
[ 3222.585011] EEH: PCI-E AER 00: 13c10001 00000000 00000000 00062030 [ 3222.585160] EEH: PCI-E AER 10: 00002000 000031c0 000001e0 00000000 [ 3222.585309] EEH: PCI-E AER 20: 00000000 00000000 00000000 00000000 [ 3222.585347] EEH: PCI-E AER 30: 00000000 00000000 [ 3222.586872] RTAS: event: 5, Type: Platform Error (224), Severity: 2 [ 3222.586873] EEH: Reset without hotplug activity [ 3224.762767] EEH: Beginning: 'slot_reset'
[ 3224.762770] EEH: PE#800000 (PCI 0384:80:00.0): Invoking
bnx2x->slot_reset()
[ 3224.762771] bnx2x: [bnx2x_io_slot_reset:14271(eth14)]IO slot reset initializing...
[ 3224.762887] bnx2x 0384:80:00.0: enabling device (0140 -> 0142) [ 3224.768157] bnx2x: [bnx2x_io_slot_reset:14287(eth14)]IO slot reset
--> driver unload

Uninterruptible tasks
=====================
crash> ps | grep UN
     213      2  11  c000000004c89e00  UN   0.0       0      0  [eehd]
     215      2   0  c000000004c80000  UN   0.0       0      0
[kworker/0:2]
    2196      1  28  c000000004504f00  UN   0.1   15936  11136  wickedd
    4287      1   9  c00000020d076800  UN   0.0    4032   3008  agetty
    4289      1  20  c00000020d056680  UN   0.0    7232   3840  agetty
   32423      2  26  c00000020038c580  UN   0.0       0      0
[kworker/26:3]
   32871   4241  27  c0000002609ddd00  UN   0.1   18624  11648  sshd
   32920  10130  16  c00000027284a100  UN   0.1   48512  12608  sendmail
   33092  32987   0  c000000205218b00  UN   0.1   48512  12608  sendmail
   33154   4567  16  c000000260e51780  UN   0.1   48832  12864  pickup
   33209   4241  36  c000000270cb6500  UN   0.1   18624  11712  sshd
   33473  33283   0  c000000205211480  UN   0.1   48512  12672  sendmail
   33531   4241  37  c00000023c902780  UN   0.1   18624  11648  sshd

EEH handler hung while bnx2x sleeping and holding RTNL lock
===========================================================
crash> bt 213
PID: 213    TASK: c000000004c89e00  CPU: 11  COMMAND: "eehd"
  #0 [c000000004d477e0] __schedule at c000000000c70808
  #1 [c000000004d478b0] schedule at c000000000c70ee0
  #2 [c000000004d478e0] schedule_timeout at c000000000c76dec
  #3 [c000000004d479c0] msleep at c0000000002120cc
  #4 [c000000004d479f0] napi_disable at c000000000a06448
                                        ^^^^^^^^^^^^^^^^
  #5 [c000000004d47a30] bnx2x_netif_stop at c0080000018dba94 [bnx2x]
  #6 [c000000004d47a60] bnx2x_io_slot_reset at c0080000018a551c [bnx2x]
  #7 [c000000004d47b20] eeh_report_reset at c00000000004c9bc
  #8 [c000000004d47b90] eeh_pe_report at c00000000004d1a8
  #9 [c000000004d47c40] eeh_handle_normal_event at c00000000004da64

And the sleeping source code
============================
crash> dis -ls c000000000a06448
FILE: ../net/core/dev.c
LINE: 6702

   6697  {
   6698          might_sleep();
   6699          set_bit(NAPI_STATE_DISABLE, &n->state);
   6700
   6701          while (test_and_set_bit(NAPI_STATE_SCHED, &n->state))
* 6702                  msleep(1);
   6703          while (test_and_set_bit(NAPI_STATE_NPSVC, &n->state))
   6704                  msleep(1);
   6705
   6706          hrtimer_cancel(&n->timer);
   6707
   6708          clear_bit(NAPI_STATE_DISABLE, &n->state);
   6709  }

EEH calls into bnx2x twice based on the system log above, first through
bnx2x_io_error_detected() and then bnx2x_io_slot_reset(), and executes
the following call chains:

bnx2x_io_error_detected()
  +-> bnx2x_eeh_nic_unload()
       +-> bnx2x_del_all_napi()
            +-> __netif_napi_del()

bnx2x_io_slot_reset()
  +-> bnx2x_netif_stop()
       +-> bnx2x_napi_disable()
            +->napi_disable()

Fix this by correcting the sequence of NAPI APIs usage,
that is delete the NAPI after disabling it.

Fixes: 7fa6f34081f1 ("bnx2x: AER revised")
Reported-by: David Christensen <drc@linux.vnet.ibm.com>
Tested-by: David Christensen <drc@linux.vnet.ibm.com>
Signed-off-by: Manish Chopra <manishc@marvell.com>
Signed-off-by: Ariel Elior <aelior@marvell.com>
Link: https://lore.kernel.org/r/20220426153913.6966-1-manishc@marvell.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
index b0ada7eac652..7925c40c0062 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
@@ -14317,10 +14317,6 @@ static int bnx2x_eeh_nic_unload(struct bnx2x *bp)
 
 	/* Stop Tx */
 	bnx2x_tx_disable(bp);
-	/* Delete all NAPI objects */
-	bnx2x_del_all_napi(bp);
-	if (CNIC_LOADED(bp))
-		bnx2x_del_all_napi_cnic(bp);
 	netdev_reset_tc(bp->dev);
 
 	del_timer_sync(&bp->timer);
@@ -14425,6 +14421,11 @@ static pci_ers_result_t bnx2x_io_slot_reset(struct pci_dev *pdev)
 		bnx2x_drain_tx_queues(bp);
 		bnx2x_send_unload_req(bp, UNLOAD_RECOVERY);
 		bnx2x_netif_stop(bp, 1);
+		bnx2x_del_all_napi(bp);
+
+		if (CNIC_LOADED(bp))
+			bnx2x_del_all_napi_cnic(bp);
+
 		bnx2x_free_irq(bp);
 
 		/* Report UNLOAD_DONE to MCP */
-- 
2.35.1




  parent reply	other threads:[~2022-05-10 13:29 UTC|newest]

Thread overview: 81+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-05-10 13:06 [PATCH 4.14 00/78] 4.14.278-rc1 review Greg Kroah-Hartman
2022-05-10 13:06 ` [PATCH 4.14 01/78] floppy: disable FDRAWCMD by default Greg Kroah-Hartman
2022-05-10 13:06 ` [PATCH 4.14 02/78] hamradio: defer 6pack kfree after unregister_netdev Greg Kroah-Hartman
2022-05-10 13:06 ` [PATCH 4.14 03/78] hamradio: remove needs_free_netdev to avoid UAF Greg Kroah-Hartman
2022-05-10 13:06 ` [PATCH 4.14 04/78] net/sched: cls_u32: fix netns refcount changes in u32_change() Greg Kroah-Hartman
2022-05-10 13:06 ` [PATCH 4.14 05/78] Revert "net: ethernet: stmmac: fix altr_tse_pcs function when using a fixed-link" Greg Kroah-Hartman
2022-05-10 13:06 ` [PATCH 4.14 06/78] lightnvm: disable the subsystem Greg Kroah-Hartman
2022-05-10 13:06 ` [PATCH 4.14 07/78] usb: mtu3: fix USB 3.0 dual-role-switch from device to host Greg Kroah-Hartman
2022-05-10 13:06 ` [PATCH 4.14 08/78] USB: quirks: add a Realtek card reader Greg Kroah-Hartman
2022-05-10 13:06 ` [PATCH 4.14 09/78] USB: quirks: add STRING quirk for VCOM device Greg Kroah-Hartman
2022-05-10 13:06 ` [PATCH 4.14 10/78] USB: serial: whiteheat: fix heap overflow in WHITEHEAT_GET_DTR_RTS Greg Kroah-Hartman
2022-05-10 13:06 ` [PATCH 4.14 11/78] USB: serial: cp210x: add PIDs for Kamstrup USB Meter Reader Greg Kroah-Hartman
2022-05-10 13:06 ` [PATCH 4.14 12/78] USB: serial: option: add support for Cinterion MV32-WA/MV32-WB Greg Kroah-Hartman
2022-05-10 13:06 ` [PATCH 4.14 13/78] USB: serial: option: add Telit 0x1057, 0x1058, 0x1075 compositions Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.14 14/78] xhci: stop polling roothubs after shutdown Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.14 15/78] iio: dac: ad5592r: Fix the missing return value Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.14 16/78] iio: dac: ad5446: Fix read_raw not returning set value Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.14 17/78] iio: magnetometer: ak8975: Fix the error handling in ak8975_power_on() Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.14 18/78] usb: misc: fix improper handling of refcount in uss720_probe() Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.14 19/78] usb: gadget: uvc: Fix crash when encoding data for usb request Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.14 20/78] usb: gadget: configfs: clear deactivation flag in configfs_composite_unbind() Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.14 21/78] serial: 8250: Also set sticky MCR bits in console restoration Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.14 22/78] serial: 8250: Correct the clock for EndRun PTP/1588 PCIe device Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.14 23/78] hex2bin: make the function hex_to_bin constant-time Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.14 24/78] hex2bin: fix access beyond string end Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.14 25/78] USB: Fix xhci event ring dequeue pointer ERDP update issue Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.14 26/78] ARM: dts: imx6qdl-apalis: Fix sgtl5000 detection issue Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.14 27/78] phy: samsung: Fix missing of_node_put() in exynos_sata_phy_probe Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.14 28/78] phy: samsung: exynos5250-sata: fix missing device put in probe error paths Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.14 29/78] ARM: OMAP2+: Fix refcount leak in omap_gic_of_init Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.14 30/78] ARM: dts: Fix mmc order for omap3-gta04 Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.14 31/78] ipvs: correctly print the memory size of ip_vs_conn_tab Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.14 32/78] mtd: rawnand: Fix return value check of wait_for_completion_timeout Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.14 33/78] sctp: check asoc strreset_chunk in sctp_generate_reconf_event Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.14 34/78] pinctrl: pistachio: fix use of irq_of_parse_and_map() Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.14 35/78] ip_gre: Make o_seqno start from 0 in native mode Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.14 36/78] tcp: fix potential xmit stalls caused by TCP_NOTSENT_LOWAT Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.14 37/78] bus: sunxi-rsb: Fix the return value of sunxi_rsb_device_create() Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.14 38/78] clk: sunxi: sun9i-mmc: check return value after calling platform_get_resource() Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.14 39/78] net: bcmgenet: hide status block before TX timestamping Greg Kroah-Hartman
2022-05-10 13:07 ` Greg Kroah-Hartman [this message]
2022-05-10 13:07 ` [PATCH 4.14 41/78] ASoC: wm8731: Disable the regulator when probing fails Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.14 42/78] x86: __memcpy_flushcache: fix wrong alignment if size > 2^32 Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.14 43/78] cifs: destage any unwritten data to the server before calling copychunk_write Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.14 44/78] drivers: net: hippi: Fix deadlock in rr_close() Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.14 45/78] x86/cpu: Load microcode during restore_processor_state() Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.14 46/78] tty: n_gsm: fix wrong signal octet encoding in convergence layer type 2 Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.14 47/78] tty: n_gsm: fix malformed counter for out of frame data Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.14 48/78] tty: n_gsm: fix insufficient txframe size Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.14 49/78] tty: n_gsm: fix missing explicit ldisc flush Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.14 50/78] tty: n_gsm: fix wrong command retry handling Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.14 51/78] tty: n_gsm: fix wrong command frame length field encoding Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.14 52/78] tty: n_gsm: fix incorrect UA handling Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.14 53/78] drm/vgem: Close use-after-free race in vgem_gem_create Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.14 54/78] MIPS: Fix CP0 counter erratum detection for R4k CPUs Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.14 55/78] parisc: Merge model and model name into one line in /proc/cpuinfo Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.14 56/78] ALSA: fireworks: fix wrong return count shorter than expected by 4 bytes Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.14 57/78] Revert "SUNRPC: attempt AF_LOCAL connect on setup" Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.14 58/78] firewire: fix potential uaf in outbound_phy_packet_callback() Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.14 59/78] firewire: remove check of list iterator against head past the loop body Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.14 60/78] firewire: core: extend card->lock in fw_core_handle_bus_reset Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.14 61/78] ASoC: wm8958: Fix change notifications for DSP controls Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.14 62/78] can: grcan: grcan_close(): fix deadlock Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.14 63/78] can: grcan: use ofdev->dev when allocating DMA memory Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.14 64/78] nfc: replace improper check device_is_registered() in netlink related functions Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.14 65/78] nfc: nfcmrvl: main: reorder destructive operations in nfcmrvl_nci_unregister_dev to avoid bugs Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.14 66/78] NFC: netlink: fix sleep in atomic bug when firmware download timeout Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.14 67/78] hwmon: (adt7470) Fix warning on module removal Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.14 68/78] ASoC: dmaengine: Restore NULL prepare_slave_config() callback Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.14 69/78] net: emaclite: Add error handling for of_address_to_resource() Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.14 70/78] smsc911x: allow using IRQ0 Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.14 71/78] btrfs: always log symlinks in full mode Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.14 72/78] net: igmp: respect RCU rules in ip_mc_source() and ip_mc_msfilter() Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.14 73/78] kvm: x86/cpuid: Only provide CPUID leaf 0xA if host has architectural PMU Greg Kroah-Hartman
2022-05-10 13:08 ` [PATCH 4.14 74/78] net: ipv6: ensure we call ipv6_mc_down() at most once Greg Kroah-Hartman
2022-05-10 13:08 ` [PATCH 4.14 75/78] dm: fix mempool NULL pointer race when completing IO Greg Kroah-Hartman
2022-05-10 13:08 ` [PATCH 4.14 76/78] dm: interlock pending dm_io and dm_wait_for_bios_completion Greg Kroah-Hartman
2022-05-10 13:08 ` [PATCH 4.14 77/78] PCI: aardvark: Clear all MSIs at setup Greg Kroah-Hartman
2022-05-10 13:08 ` [PATCH 4.14 78/78] PCI: aardvark: Fix reading MSI interrupt number Greg Kroah-Hartman
2022-05-11  1:10 ` [PATCH 4.14 00/78] 4.14.278-rc1 review Guenter Roeck
2022-05-11 11:03 ` Naresh Kamboju

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20220510130733.721181840@linuxfoundation.org \
    --to=gregkh@linuxfoundation.org \
    --cc=aelior@marvell.com \
    --cc=drc@linux.vnet.ibm.com \
    --cc=kuba@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=manishc@marvell.com \
    --cc=sashal@kernel.org \
    --cc=stable@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox