All of lore.kernel.org
 help / color / mirror / Atom feed
From: Stephen Hemminger <stephen@networkplumber.org>
To: intel-wired-lan@osuosl.org
Subject: [Intel-wired-lan] [PATCH] igb: Free IRQs when device is hotplugged
Date: Mon, 11 Dec 2017 16:34:22 -0800	[thread overview]
Message-ID: <20171211163422.07d538fb@xeon-e3> (raw)
In-Reply-To: <20171211234502.27445-1-lyude@redhat.com>

On Mon, 11 Dec 2017 18:45:02 -0500
Lyude Paul <lyude@redhat.com> wrote:

> Recently I got a Caldigit TS3 Thunderbolt 3 dock, and noticed that upon
> hotplugging my kernel would immediately crash due to igb:
> 
> [  680.825801] kernel BUG at drivers/pci/msi.c:352!
> [  680.828388] invalid opcode: 0000 [#1] SMP
> [  680.829194] Modules linked in: igb(O) thunderbolt i2c_algo_bit joydev vfat fat btusb btrtl btbcm btintel bluetooth ecdh_generic hp_wmi sparse_keymap rfkill wmi_bmof iTCO_wdt intel_rapl x86_pkg_temp_thermal coretemp crc32_pclmul snd_pcm rtsx_pci_ms mei_me snd_timer memstick snd pcspkr mei soundcore i2c_i801 tpm_tis psmouse shpchp wmi tpm_tis_core tpm video hp_wireless acpi_pad rtsx_pci_sdmmc mmc_core crc32c_intel serio_raw rtsx_pci mfd_core xhci_pci xhci_hcd i2c_hid i2c_core [last unloaded: igb]
> [  680.831085] CPU: 1 PID: 78 Comm: kworker/u16:1 Tainted: G           O     4.15.0-rc3Lyude-Test+ #6
> [  680.831596] Hardware name: HP HP ZBook Studio G4/826B, BIOS P71 Ver. 01.03 06/09/2017
> [  680.832168] Workqueue: kacpi_hotplug acpi_hotplug_work_fn
> [  680.832687] RIP: 0010:free_msi_irqs+0x180/0x1b0
> [  680.833271] RSP: 0018:ffffc9000030fbf0 EFLAGS: 00010286
> [  680.833761] RAX: ffff8803405f9c00 RBX: ffff88033e3d2e40 RCX: 000000000000002c
> [  680.834278] RDX: 0000000000000000 RSI: 00000000000000ac RDI: ffff880340be2178
> [  680.834832] RBP: 0000000000000000 R08: ffff880340be1ff0 R09: ffff8803405f9c00
> [  680.835342] R10: 0000000000000000 R11: 0000000000000040 R12: ffff88033d63a298
> [  680.835822] R13: ffff88033d63a000 R14: 0000000000000060 R15: ffff880341959000
> [  680.836332] FS:  0000000000000000(0000) GS:ffff88034f440000(0000) knlGS:0000000000000000
> [  680.836817] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  680.837360] CR2: 000055e64044afdf CR3: 0000000001c09002 CR4: 00000000003606e0
> [  680.837954] Call Trace:
> [  680.838853]  pci_disable_msix+0xce/0xf0
> [  680.839616]  igb_reset_interrupt_capability+0x5d/0x60 [igb]
> [  680.840278]  igb_remove+0x9d/0x110 [igb]
> [  680.840764]  pci_device_remove+0x36/0xb0
> [  680.841279]  device_release_driver_internal+0x157/0x220
> [  680.841739]  pci_stop_bus_device+0x7d/0xa0
> [  680.842255]  pci_stop_bus_device+0x2b/0xa0
> [  680.842722]  pci_stop_bus_device+0x3d/0xa0
> [  680.843189]  pci_stop_and_remove_bus_device+0xe/0x20
> [  680.843627]  trim_stale_devices+0xf3/0x140
> [  680.844086]  trim_stale_devices+0x94/0x140
> [  680.844532]  trim_stale_devices+0xa6/0x140
> [  680.845031]  ? get_slot_status+0x90/0xc0
> [  680.845536]  acpiphp_check_bridge.part.5+0xfe/0x140
> [  680.846021]  acpiphp_hotplug_notify+0x175/0x200
> [  680.846581]  ? free_bridge+0x100/0x100
> [  680.847113]  acpi_device_hotplug+0x8a/0x490
> [  680.847535]  acpi_hotplug_work_fn+0x1a/0x30
> [  680.848076]  process_one_work+0x182/0x3a0
> [  680.848543]  worker_thread+0x2e/0x380
> [  680.848963]  ? process_one_work+0x3a0/0x3a0
> [  680.849373]  kthread+0x111/0x130
> [  680.849776]  ? kthread_create_worker_on_cpu+0x50/0x50
> [  680.850188]  ret_from_fork+0x1f/0x30
> [  680.850601] Code: 43 14 85 c0 0f 84 d5 fe ff ff 31 ed eb 0f 83 c5 01 39 6b 14 0f 86 c5 fe ff ff 8b 7b 10 01 ef e8 b7 e4 d2 ff 48 83 78 70 00 74 e3 <0f> 0b 49 8d b5 a0 00 00 00 e8 62 6f d3 ff e9 c7 fe ff ff 48 8b
> [  680.851497] RIP: free_msi_irqs+0x180/0x1b0 RSP: ffffc9000030fbf0
> 
> As it turns out, normally the freeing of IRQs that would fix this is called
> inside of the scope of __igb_close(). However, since the device is
> already gone by the point we try to unregister the netdevice from the
> driver due to a hotplug we end up seeing that the netif isn't present
> and thus, forget to free any of the device IRQs.
> 
> So: after unregistering the netdev in igb_remove() check whether the PCI
> device is stale and if so, free it's IRQs and tx/rx resources.
> 
> Signed-off-by: Lyude Paul <lyude@redhat.com>
> Fixes: 9474933caf21 ("igb: close/suspend race in netif_device_detach")
> Cc: Todd Fujinaka <todd.fujinaka@intel.com>
> Cc: stable at vger.kernel.org
> ---
>  drivers/net/ethernet/intel/igb/igb_main.c | 10 ++++++++++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/drivers/net/ethernet/intel/igb/igb_main.c b/drivers/net/ethernet/intel/igb/igb_main.c
> index c208753ff5b7..e650348b4bd7 100644
> --- a/drivers/net/ethernet/intel/igb/igb_main.c
> +++ b/drivers/net/ethernet/intel/igb/igb_main.c
> @@ -3325,6 +3325,16 @@ static void igb_remove(struct pci_dev *pdev)
>  
>  	unregister_netdev(netdev);
>  
> +	/* If the PCI device has already been physically removed (e.g. user
> +	 * unplugged a thunderbolt dock containing our hw) then the netif will
> +	 * already be down, so unregistering the netdev won't free the IRQs
> +	 */
> +	if (!pci_device_is_present(pdev)) {
> +		igb_free_irq(adapter);
> +		igb_free_all_tx_resources(adapter);
> +		igb_free_all_rx_resources(adapter);
> +	}
> +
>  	igb_clear_interrupt_scheme(adapter);
>  
>  	pci_iounmap(pdev, adapter->io_addr);

This looks like you are making a special case out of something that should
be fixed in igb_close instead.

Most likely something is wrong with earlier commit.

commit 9474933caf21a4cb5147223dca1551f527aaac36
Author: Todd Fujinaka <todd.fujinaka@intel.com>
Date:   Tue Nov 15 08:54:26 2016 -0800

    igb: close/suspend race in netif_device_detach
    
    Similar to ixgbe, when an interface is part of a namespace it is
    possible that igb_close() may be called while __igb_shutdown() is
    running which ends up in a double free WARN and/or a BUG in
    free_msi_irqs().
    
    Extend the rtnl_lock() to protect the call to netif_device_detach() and
    igb_clear_interrupt_scheme() in __igb_shutdown() and check for
    netif_device_present() to avoid calling igb_clear_interrupt_scheme() a
    second time in igb_close().
    
    Also extend the rtnl lock in igb_resume() to netif_device_attach().
    
    Signed-off-by: Todd Fujinaka <todd.fujinaka@intel.com>
    Acked-by: Alexander Duyck <alexander.h.duyck@intel.com>
    Tested-by: Aaron Brown <aaron.f.brown@intel.com>
    Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>


Layers on layers of workarounds does not lead to stability.


WARNING: multiple messages have this Message-ID (diff)
From: Stephen Hemminger <stephen@networkplumber.org>
To: Lyude Paul <lyude@redhat.com>
Cc: intel-wired-lan@lists.osuosl.org,
	Todd Fujinaka <todd.fujinaka@intel.com>,
	stable@vger.kernel.org,
	Jeff Kirsher <jeffrey.t.kirsher@intel.com>,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH] igb: Free IRQs when device is hotplugged
Date: Mon, 11 Dec 2017 16:34:22 -0800	[thread overview]
Message-ID: <20171211163422.07d538fb@xeon-e3> (raw)
In-Reply-To: <20171211234502.27445-1-lyude@redhat.com>

On Mon, 11 Dec 2017 18:45:02 -0500
Lyude Paul <lyude@redhat.com> wrote:

> Recently I got a Caldigit TS3 Thunderbolt 3 dock, and noticed that upon
> hotplugging my kernel would immediately crash due to igb:
> 
> [  680.825801] kernel BUG at drivers/pci/msi.c:352!
> [  680.828388] invalid opcode: 0000 [#1] SMP
> [  680.829194] Modules linked in: igb(O) thunderbolt i2c_algo_bit joydev vfat fat btusb btrtl btbcm btintel bluetooth ecdh_generic hp_wmi sparse_keymap rfkill wmi_bmof iTCO_wdt intel_rapl x86_pkg_temp_thermal coretemp crc32_pclmul snd_pcm rtsx_pci_ms mei_me snd_timer memstick snd pcspkr mei soundcore i2c_i801 tpm_tis psmouse shpchp wmi tpm_tis_core tpm video hp_wireless acpi_pad rtsx_pci_sdmmc mmc_core crc32c_intel serio_raw rtsx_pci mfd_core xhci_pci xhci_hcd i2c_hid i2c_core [last unloaded: igb]
> [  680.831085] CPU: 1 PID: 78 Comm: kworker/u16:1 Tainted: G           O     4.15.0-rc3Lyude-Test+ #6
> [  680.831596] Hardware name: HP HP ZBook Studio G4/826B, BIOS P71 Ver. 01.03 06/09/2017
> [  680.832168] Workqueue: kacpi_hotplug acpi_hotplug_work_fn
> [  680.832687] RIP: 0010:free_msi_irqs+0x180/0x1b0
> [  680.833271] RSP: 0018:ffffc9000030fbf0 EFLAGS: 00010286
> [  680.833761] RAX: ffff8803405f9c00 RBX: ffff88033e3d2e40 RCX: 000000000000002c
> [  680.834278] RDX: 0000000000000000 RSI: 00000000000000ac RDI: ffff880340be2178
> [  680.834832] RBP: 0000000000000000 R08: ffff880340be1ff0 R09: ffff8803405f9c00
> [  680.835342] R10: 0000000000000000 R11: 0000000000000040 R12: ffff88033d63a298
> [  680.835822] R13: ffff88033d63a000 R14: 0000000000000060 R15: ffff880341959000
> [  680.836332] FS:  0000000000000000(0000) GS:ffff88034f440000(0000) knlGS:0000000000000000
> [  680.836817] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  680.837360] CR2: 000055e64044afdf CR3: 0000000001c09002 CR4: 00000000003606e0
> [  680.837954] Call Trace:
> [  680.838853]  pci_disable_msix+0xce/0xf0
> [  680.839616]  igb_reset_interrupt_capability+0x5d/0x60 [igb]
> [  680.840278]  igb_remove+0x9d/0x110 [igb]
> [  680.840764]  pci_device_remove+0x36/0xb0
> [  680.841279]  device_release_driver_internal+0x157/0x220
> [  680.841739]  pci_stop_bus_device+0x7d/0xa0
> [  680.842255]  pci_stop_bus_device+0x2b/0xa0
> [  680.842722]  pci_stop_bus_device+0x3d/0xa0
> [  680.843189]  pci_stop_and_remove_bus_device+0xe/0x20
> [  680.843627]  trim_stale_devices+0xf3/0x140
> [  680.844086]  trim_stale_devices+0x94/0x140
> [  680.844532]  trim_stale_devices+0xa6/0x140
> [  680.845031]  ? get_slot_status+0x90/0xc0
> [  680.845536]  acpiphp_check_bridge.part.5+0xfe/0x140
> [  680.846021]  acpiphp_hotplug_notify+0x175/0x200
> [  680.846581]  ? free_bridge+0x100/0x100
> [  680.847113]  acpi_device_hotplug+0x8a/0x490
> [  680.847535]  acpi_hotplug_work_fn+0x1a/0x30
> [  680.848076]  process_one_work+0x182/0x3a0
> [  680.848543]  worker_thread+0x2e/0x380
> [  680.848963]  ? process_one_work+0x3a0/0x3a0
> [  680.849373]  kthread+0x111/0x130
> [  680.849776]  ? kthread_create_worker_on_cpu+0x50/0x50
> [  680.850188]  ret_from_fork+0x1f/0x30
> [  680.850601] Code: 43 14 85 c0 0f 84 d5 fe ff ff 31 ed eb 0f 83 c5 01 39 6b 14 0f 86 c5 fe ff ff 8b 7b 10 01 ef e8 b7 e4 d2 ff 48 83 78 70 00 74 e3 <0f> 0b 49 8d b5 a0 00 00 00 e8 62 6f d3 ff e9 c7 fe ff ff 48 8b
> [  680.851497] RIP: free_msi_irqs+0x180/0x1b0 RSP: ffffc9000030fbf0
> 
> As it turns out, normally the freeing of IRQs that would fix this is called
> inside of the scope of __igb_close(). However, since the device is
> already gone by the point we try to unregister the netdevice from the
> driver due to a hotplug we end up seeing that the netif isn't present
> and thus, forget to free any of the device IRQs.
> 
> So: after unregistering the netdev in igb_remove() check whether the PCI
> device is stale and if so, free it's IRQs and tx/rx resources.
> 
> Signed-off-by: Lyude Paul <lyude@redhat.com>
> Fixes: 9474933caf21 ("igb: close/suspend race in netif_device_detach")
> Cc: Todd Fujinaka <todd.fujinaka@intel.com>
> Cc: stable@vger.kernel.org
> ---
>  drivers/net/ethernet/intel/igb/igb_main.c | 10 ++++++++++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/drivers/net/ethernet/intel/igb/igb_main.c b/drivers/net/ethernet/intel/igb/igb_main.c
> index c208753ff5b7..e650348b4bd7 100644
> --- a/drivers/net/ethernet/intel/igb/igb_main.c
> +++ b/drivers/net/ethernet/intel/igb/igb_main.c
> @@ -3325,6 +3325,16 @@ static void igb_remove(struct pci_dev *pdev)
>  
>  	unregister_netdev(netdev);
>  
> +	/* If the PCI device has already been physically removed (e.g. user
> +	 * unplugged a thunderbolt dock containing our hw) then the netif will
> +	 * already be down, so unregistering the netdev won't free the IRQs
> +	 */
> +	if (!pci_device_is_present(pdev)) {
> +		igb_free_irq(adapter);
> +		igb_free_all_tx_resources(adapter);
> +		igb_free_all_rx_resources(adapter);
> +	}
> +
>  	igb_clear_interrupt_scheme(adapter);
>  
>  	pci_iounmap(pdev, adapter->io_addr);

This looks like you are making a special case out of something that should
be fixed in igb_close instead.

Most likely something is wrong with earlier commit.

commit 9474933caf21a4cb5147223dca1551f527aaac36
Author: Todd Fujinaka <todd.fujinaka@intel.com>
Date:   Tue Nov 15 08:54:26 2016 -0800

    igb: close/suspend race in netif_device_detach
    
    Similar to ixgbe, when an interface is part of a namespace it is
    possible that igb_close() may be called while __igb_shutdown() is
    running which ends up in a double free WARN and/or a BUG in
    free_msi_irqs().
    
    Extend the rtnl_lock() to protect the call to netif_device_detach() and
    igb_clear_interrupt_scheme() in __igb_shutdown() and check for
    netif_device_present() to avoid calling igb_clear_interrupt_scheme() a
    second time in igb_close().
    
    Also extend the rtnl lock in igb_resume() to netif_device_attach().
    
    Signed-off-by: Todd Fujinaka <todd.fujinaka@intel.com>
    Acked-by: Alexander Duyck <alexander.h.duyck@intel.com>
    Tested-by: Aaron Brown <aaron.f.brown@intel.com>
    Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>


Layers on layers of workarounds does not lead to stability.

  reply	other threads:[~2017-12-12  0:34 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-12-11 23:45 [Intel-wired-lan] [PATCH] igb: Free IRQs when device is hotplugged Lyude Paul
2017-12-11 23:45 ` Lyude Paul
2017-12-12  0:34 ` Stephen Hemminger [this message]
2017-12-12  0:34   ` Stephen Hemminger
2017-12-12  0:37   ` [Intel-wired-lan] " Lyude Paul
2017-12-12  0:37     ` Lyude Paul

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20171211163422.07d538fb@xeon-e3 \
    --to=stephen@networkplumber.org \
    --cc=intel-wired-lan@osuosl.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.