public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
To: linux-kernel@vger.kernel.org
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	stable@vger.kernel.org,
	David Christensen <drc@linux.vnet.ibm.com>,
	Manish Chopra <manishc@marvell.com>,
	Ariel Elior <aelior@marvell.com>,
	Jakub Kicinski <kuba@kernel.org>, Sasha Levin <sashal@kernel.org>
Subject: [PATCH 4.9 32/66] bnx2x: fix napi API usage sequence
Date: Tue, 10 May 2022 15:07:22 +0200	[thread overview]
Message-ID: <20220510130730.708432233@linuxfoundation.org> (raw)
In-Reply-To: <20220510130729.762341544@linuxfoundation.org>

From: Manish Chopra <manishc@marvell.com>

[ Upstream commit af68656d66eda219b7f55ce8313a1da0312c79e1 ]

While handling PCI errors (AER flow) driver tries to
disable NAPI [napi_disable()] after NAPI is deleted
[__netif_napi_del()] which causes unexpected system
hang/crash.

System message log shows the following:
=======================================
[ 3222.537510] EEH: Detected PCI bus error on PHB#384-PE#800000 [ 3222.537511] EEH: This PCI device has failed 2 times in the last hour and will be permanently disabled after 5 failures.
[ 3222.537512] EEH: Notify device drivers to shutdown [ 3222.537513] EEH: Beginning: 'error_detected(IO frozen)'
[ 3222.537514] EEH: PE#800000 (PCI 0384:80:00.0): Invoking
bnx2x->error_detected(IO frozen)
[ 3222.537516] bnx2x: [bnx2x_io_error_detected:14236(eth14)]IO error detected [ 3222.537650] EEH: PE#800000 (PCI 0384:80:00.0): bnx2x driver reports:
'need reset'
[ 3222.537651] EEH: PE#800000 (PCI 0384:80:00.1): Invoking
bnx2x->error_detected(IO frozen)
[ 3222.537651] bnx2x: [bnx2x_io_error_detected:14236(eth13)]IO error detected [ 3222.537729] EEH: PE#800000 (PCI 0384:80:00.1): bnx2x driver reports:
'need reset'
[ 3222.537729] EEH: Finished:'error_detected(IO frozen)' with aggregate recovery state:'need reset'
[ 3222.537890] EEH: Collect temporary log [ 3222.583481] EEH: of node=0384:80:00.0 [ 3222.583519] EEH: PCI device/vendor: 168e14e4 [ 3222.583557] EEH: PCI cmd/status register: 00100140 [ 3222.583557] EEH: PCI-E capabilities and status follow:
[ 3222.583744] EEH: PCI-E 00: 00020010 012c8da2 00095d5e 00455c82 [ 3222.583892] EEH: PCI-E 10: 10820000 00000000 00000000 00000000 [ 3222.583893] EEH: PCI-E 20: 00000000 [ 3222.583893] EEH: PCI-E AER capability register set follows:
[ 3222.584079] EEH: PCI-E AER 00: 13c10001 00000000 00000000 00062030 [ 3222.584230] EEH: PCI-E AER 10: 00002000 000031c0 000001e0 00000000 [ 3222.584378] EEH: PCI-E AER 20: 00000000 00000000 00000000 00000000 [ 3222.584416] EEH: PCI-E AER 30: 00000000 00000000 [ 3222.584416] EEH: of node=0384:80:00.1 [ 3222.584454] EEH: PCI device/vendor: 168e14e4 [ 3222.584491] EEH: PCI cmd/status register: 00100140 [ 3222.584492] EEH: PCI-E capabilities and status follow:
[ 3222.584677] EEH: PCI-E 00: 00020010 012c8da2 00095d5e 00455c82 [ 3222.584825] EEH: PCI-E 10: 10820000 00000000 00000000 00000000 [ 3222.584826] EEH: PCI-E 20: 00000000 [ 3222.584826] EEH: PCI-E AER capability register set follows:
[ 3222.585011] EEH: PCI-E AER 00: 13c10001 00000000 00000000 00062030 [ 3222.585160] EEH: PCI-E AER 10: 00002000 000031c0 000001e0 00000000 [ 3222.585309] EEH: PCI-E AER 20: 00000000 00000000 00000000 00000000 [ 3222.585347] EEH: PCI-E AER 30: 00000000 00000000 [ 3222.586872] RTAS: event: 5, Type: Platform Error (224), Severity: 2 [ 3222.586873] EEH: Reset without hotplug activity [ 3224.762767] EEH: Beginning: 'slot_reset'
[ 3224.762770] EEH: PE#800000 (PCI 0384:80:00.0): Invoking
bnx2x->slot_reset()
[ 3224.762771] bnx2x: [bnx2x_io_slot_reset:14271(eth14)]IO slot reset initializing...
[ 3224.762887] bnx2x 0384:80:00.0: enabling device (0140 -> 0142) [ 3224.768157] bnx2x: [bnx2x_io_slot_reset:14287(eth14)]IO slot reset
--> driver unload

Uninterruptible tasks
=====================
crash> ps | grep UN
     213      2  11  c000000004c89e00  UN   0.0       0      0  [eehd]
     215      2   0  c000000004c80000  UN   0.0       0      0
[kworker/0:2]
    2196      1  28  c000000004504f00  UN   0.1   15936  11136  wickedd
    4287      1   9  c00000020d076800  UN   0.0    4032   3008  agetty
    4289      1  20  c00000020d056680  UN   0.0    7232   3840  agetty
   32423      2  26  c00000020038c580  UN   0.0       0      0
[kworker/26:3]
   32871   4241  27  c0000002609ddd00  UN   0.1   18624  11648  sshd
   32920  10130  16  c00000027284a100  UN   0.1   48512  12608  sendmail
   33092  32987   0  c000000205218b00  UN   0.1   48512  12608  sendmail
   33154   4567  16  c000000260e51780  UN   0.1   48832  12864  pickup
   33209   4241  36  c000000270cb6500  UN   0.1   18624  11712  sshd
   33473  33283   0  c000000205211480  UN   0.1   48512  12672  sendmail
   33531   4241  37  c00000023c902780  UN   0.1   18624  11648  sshd

EEH handler hung while bnx2x sleeping and holding RTNL lock
===========================================================
crash> bt 213
PID: 213    TASK: c000000004c89e00  CPU: 11  COMMAND: "eehd"
  #0 [c000000004d477e0] __schedule at c000000000c70808
  #1 [c000000004d478b0] schedule at c000000000c70ee0
  #2 [c000000004d478e0] schedule_timeout at c000000000c76dec
  #3 [c000000004d479c0] msleep at c0000000002120cc
  #4 [c000000004d479f0] napi_disable at c000000000a06448
                                        ^^^^^^^^^^^^^^^^
  #5 [c000000004d47a30] bnx2x_netif_stop at c0080000018dba94 [bnx2x]
  #6 [c000000004d47a60] bnx2x_io_slot_reset at c0080000018a551c [bnx2x]
  #7 [c000000004d47b20] eeh_report_reset at c00000000004c9bc
  #8 [c000000004d47b90] eeh_pe_report at c00000000004d1a8
  #9 [c000000004d47c40] eeh_handle_normal_event at c00000000004da64

And the sleeping source code
============================
crash> dis -ls c000000000a06448
FILE: ../net/core/dev.c
LINE: 6702

   6697  {
   6698          might_sleep();
   6699          set_bit(NAPI_STATE_DISABLE, &n->state);
   6700
   6701          while (test_and_set_bit(NAPI_STATE_SCHED, &n->state))
* 6702                  msleep(1);
   6703          while (test_and_set_bit(NAPI_STATE_NPSVC, &n->state))
   6704                  msleep(1);
   6705
   6706          hrtimer_cancel(&n->timer);
   6707
   6708          clear_bit(NAPI_STATE_DISABLE, &n->state);
   6709  }

EEH calls into bnx2x twice based on the system log above, first through
bnx2x_io_error_detected() and then bnx2x_io_slot_reset(), and executes
the following call chains:

bnx2x_io_error_detected()
  +-> bnx2x_eeh_nic_unload()
       +-> bnx2x_del_all_napi()
            +-> __netif_napi_del()

bnx2x_io_slot_reset()
  +-> bnx2x_netif_stop()
       +-> bnx2x_napi_disable()
            +->napi_disable()

Fix this by correcting the sequence of NAPI APIs usage,
that is delete the NAPI after disabling it.

Fixes: 7fa6f34081f1 ("bnx2x: AER revised")
Reported-by: David Christensen <drc@linux.vnet.ibm.com>
Tested-by: David Christensen <drc@linux.vnet.ibm.com>
Signed-off-by: Manish Chopra <manishc@marvell.com>
Signed-off-by: Ariel Elior <aelior@marvell.com>
Link: https://lore.kernel.org/r/20220426153913.6966-1-manishc@marvell.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
index 8d17d464c067..398928642a97 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
@@ -14314,10 +14314,6 @@ static int bnx2x_eeh_nic_unload(struct bnx2x *bp)
 
 	/* Stop Tx */
 	bnx2x_tx_disable(bp);
-	/* Delete all NAPI objects */
-	bnx2x_del_all_napi(bp);
-	if (CNIC_LOADED(bp))
-		bnx2x_del_all_napi_cnic(bp);
 	netdev_reset_tc(bp->dev);
 
 	del_timer_sync(&bp->timer);
@@ -14422,6 +14418,11 @@ static pci_ers_result_t bnx2x_io_slot_reset(struct pci_dev *pdev)
 		bnx2x_drain_tx_queues(bp);
 		bnx2x_send_unload_req(bp, UNLOAD_RECOVERY);
 		bnx2x_netif_stop(bp, 1);
+		bnx2x_del_all_napi(bp);
+
+		if (CNIC_LOADED(bp))
+			bnx2x_del_all_napi_cnic(bp);
+
 		bnx2x_free_irq(bp);
 
 		/* Report UNLOAD_DONE to MCP */
-- 
2.35.1




  parent reply	other threads:[~2022-05-10 13:19 UTC|newest]

Thread overview: 72+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-05-10 13:06 [PATCH 4.9 00/66] 4.9.313-rc1 review Greg Kroah-Hartman
2022-05-10 13:06 ` [PATCH 4.9 01/66] floppy: disable FDRAWCMD by default Greg Kroah-Hartman
2022-05-10 13:06 ` [PATCH 4.9 02/66] Revert "net: ethernet: stmmac: fix altr_tse_pcs function when using a fixed-link" Greg Kroah-Hartman
2022-05-10 13:06 ` [PATCH 4.9 03/66] lightnvm: disable the subsystem Greg Kroah-Hartman
2022-05-10 13:06 ` [PATCH 4.9 04/66] USB: quirks: add a Realtek card reader Greg Kroah-Hartman
2022-05-10 13:06 ` [PATCH 4.9 05/66] USB: quirks: add STRING quirk for VCOM device Greg Kroah-Hartman
2022-05-10 13:06 ` [PATCH 4.9 06/66] USB: serial: whiteheat: fix heap overflow in WHITEHEAT_GET_DTR_RTS Greg Kroah-Hartman
2022-05-10 13:06 ` [PATCH 4.9 07/66] USB: serial: cp210x: add PIDs for Kamstrup USB Meter Reader Greg Kroah-Hartman
2022-05-10 13:06 ` [PATCH 4.9 08/66] USB: serial: option: add support for Cinterion MV32-WA/MV32-WB Greg Kroah-Hartman
2022-05-10 13:06 ` [PATCH 4.9 09/66] USB: serial: option: add Telit 0x1057, 0x1058, 0x1075 compositions Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.9 10/66] xhci: stop polling roothubs after shutdown Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.9 11/66] iio: dac: ad5592r: Fix the missing return value Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.9 12/66] iio: dac: ad5446: Fix read_raw not returning set value Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.9 13/66] iio: magnetometer: ak8975: Fix the error handling in ak8975_power_on() Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.9 14/66] usb: misc: fix improper handling of refcount in uss720_probe() Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.9 15/66] usb: gadget: uvc: Fix crash when encoding data for usb request Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.9 16/66] usb: gadget: configfs: clear deactivation flag in configfs_composite_unbind() Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.9 17/66] serial: 8250: Also set sticky MCR bits in console restoration Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.9 18/66] serial: 8250: Correct the clock for EndRun PTP/1588 PCIe device Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.9 19/66] hex2bin: make the function hex_to_bin constant-time Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.9 20/66] hex2bin: fix access beyond string end Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.9 21/66] ARM: dts: imx6qdl-apalis: Fix sgtl5000 detection issue Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.9 22/66] phy: samsung: Fix missing of_node_put() in exynos_sata_phy_probe Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.9 23/66] phy: samsung: exynos5250-sata: fix missing device put in probe error paths Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.9 24/66] ARM: OMAP2+: Fix refcount leak in omap_gic_of_init Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.9 25/66] ARM: dts: Fix mmc order for omap3-gta04 Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.9 26/66] mtd: rawnand: Fix return value check of wait_for_completion_timeout Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.9 27/66] pinctrl: pistachio: fix use of irq_of_parse_and_map() Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.9 28/66] ip_gre: Make o_seqno start from 0 in native mode Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.9 29/66] tcp: fix potential xmit stalls caused by TCP_NOTSENT_LOWAT Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.9 30/66] bus: sunxi-rsb: Fix the return value of sunxi_rsb_device_create() Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.9 31/66] clk: sunxi: sun9i-mmc: check return value after calling platform_get_resource() Greg Kroah-Hartman
2022-05-10 13:07 ` Greg Kroah-Hartman [this message]
2022-05-10 13:07 ` [PATCH 4.9 33/66] ASoC: wm8731: Disable the regulator when probing fails Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.9 34/66] drivers: net: hippi: Fix deadlock in rr_close() Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.9 35/66] x86/cpu: Load microcode during restore_processor_state() Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.9 36/66] tty: n_gsm: fix wrong signal octet encoding in convergence layer type 2 Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.9 37/66] tty: n_gsm: fix malformed counter for out of frame data Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.9 38/66] tty: n_gsm: fix insufficient txframe size Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.9 39/66] tty: n_gsm: fix missing explicit ldisc flush Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.9 40/66] tty: n_gsm: fix wrong command retry handling Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.9 41/66] tty: n_gsm: fix wrong command frame length field encoding Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.9 42/66] tty: n_gsm: fix incorrect UA handling Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.9 43/66] MIPS: Fix CP0 counter erratum detection for R4k CPUs Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.9 44/66] parisc: Merge model and model name into one line in /proc/cpuinfo Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.9 45/66] ALSA: fireworks: fix wrong return count shorter than expected by 4 bytes Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.9 46/66] Revert "SUNRPC: attempt AF_LOCAL connect on setup" Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.9 47/66] firewire: fix potential uaf in outbound_phy_packet_callback() Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.9 48/66] firewire: remove check of list iterator against head past the loop body Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.9 49/66] firewire: core: extend card->lock in fw_core_handle_bus_reset Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.9 50/66] ASoC: wm8958: Fix change notifications for DSP controls Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.9 51/66] can: grcan: grcan_close(): fix deadlock Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.9 52/66] can: grcan: use ofdev->dev when allocating DMA memory Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.9 53/66] nfc: replace improper check device_is_registered() in netlink related functions Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.9 54/66] nfc: nfcmrvl: main: reorder destructive operations in nfcmrvl_nci_unregister_dev to avoid bugs Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.9 55/66] NFC: netlink: fix sleep in atomic bug when firmware download timeout Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.9 56/66] hwmon: (adt7470) Fix warning on module removal Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.9 57/66] ASoC: dmaengine: Restore NULL prepare_slave_config() callback Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.9 58/66] net: emaclite: Add error handling for of_address_to_resource() Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.9 59/66] smsc911x: allow using IRQ0 Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.9 60/66] btrfs: always log symlinks in full mode Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.9 61/66] net: igmp: respect RCU rules in ip_mc_source() and ip_mc_msfilter() Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.9 62/66] kvm: x86/cpuid: Only provide CPUID leaf 0xA if host has architectural PMU Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.9 63/66] net: sched: prevent UAF on tc_ctl_tfilter when temporarily dropping rtnl_lock Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.9 64/66] net: ipv6: ensure we call ipv6_mc_down() at most once Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.9 65/66] dm: fix mempool NULL pointer race when completing IO Greg Kroah-Hartman
2022-05-10 13:07 ` [PATCH 4.9 66/66] dm: interlock pending dm_io and dm_wait_for_bios_completion Greg Kroah-Hartman
2022-05-10 16:46 ` [PATCH 4.9 00/66] 4.9.313-rc1 review Florian Fainelli
2022-05-10 18:07 ` Pavel Machek
2022-05-10 22:43 ` Shuah Khan
2022-05-11  1:10 ` Guenter Roeck
2022-05-11 11:15 ` Naresh Kamboju

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20220510130730.708432233@linuxfoundation.org \
    --to=gregkh@linuxfoundation.org \
    --cc=aelior@marvell.com \
    --cc=drc@linux.vnet.ibm.com \
    --cc=kuba@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=manishc@marvell.com \
    --cc=sashal@kernel.org \
    --cc=stable@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox