Re: [v1] e1000e: EEH on e1000e adapter detects io perm failure can trigger crash

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: "David Z. Dai" <zdai@linux.vnet.ibm.com>
To: Alexander Duyck <alexander.duyck@gmail.com>
Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>,
	David Miller <davem@davemloft.net>,
	intel-wired-lan <intel-wired-lan@lists.osuosl.org>,
	Netdev <netdev@vger.kernel.org>,
	LKML <linux-kernel@vger.kernel.org>,
	zdai@us.ibm.com
Subject: Re: [v1] e1000e: EEH on e1000e adapter detects io perm failure can trigger crash
Date: Thu, 03 Oct 2019 13:50:58 -0500	[thread overview]
Message-ID: <1570128658.1250.8.camel@oc5348122405> (raw)
In-Reply-To: <CAKgT0Udz7vt5C=+6vpFPbys4sODAZtCjrkSvOdgP80rX7Ww+Ng@mail.gmail.com>

On Thu, 2019-10-03 at 10:39 -0700, Alexander Duyck wrote:
> On Thu, Oct 3, 2019 at 9:59 AM David Dai <zdai@linux.vnet.ibm.com> wrote:
> >
> > We see the behavior when EEH e1000e adapter detects io permanent failure,
> > it will crash kernel with this stack:
> > EEH: Beginning: 'error_detected(permanent failure)'
> > EEH: PE#900000 (PCI 0115:90:00.1): Invoking e1000e->error_detected(permanent failure)
> > EEH: PE#900000 (PCI 0115:90:00.1): e1000e driver reports: 'disconnect'
> > EEH: PE#900000 (PCI 0115:90:00.0): Invoking e1000e->error_detected(permanent failure)
> > EEH: PE#900000 (PCI 0115:90:00.0): e1000e driver reports: 'disconnect'
> > EEH: Finished:'error_detected(permanent failure)'
> > Oops: Exception in kernel mode, sig: 5 [#1]
> > NIP [c0000000007b1be0] free_msi_irqs+0xa0/0x280
> >  LR [c0000000007b1bd0] free_msi_irqs+0x90/0x280
> > Call Trace:
> > [c0000004f491ba10] [c0000000007b1bd0] free_msi_irqs+0x90/0x280 (unreliable)
> > [c0000004f491ba70] [c0000000007b260c] pci_disable_msi+0x13c/0x180
> > [c0000004f491bab0] [d0000000046381ac] e1000_remove+0x234/0x2a0 [e1000e]
> > [c0000004f491baf0] [c000000000783cec] pci_device_remove+0x6c/0x120
> > [c0000004f491bb30] [c00000000088da6c] device_release_driver_internal+0x2bc/0x3f0
> > [c0000004f491bb80] [c00000000076f5a8] pci_stop_and_remove_bus_device+0xb8/0x110
> > [c0000004f491bbc0] [c00000000006e890] pci_hp_remove_devices+0x90/0x130
> > [c0000004f491bc50] [c00000000004ad34] eeh_handle_normal_event+0x1d4/0x660
> > [c0000004f491bd10] [c00000000004bf10] eeh_event_handler+0x1c0/0x1e0
> > [c0000004f491bdc0] [c00000000017c4ac] kthread+0x1ac/0x1c0
> > [c0000004f491be30] [c00000000000b75c] ret_from_kernel_thread+0x5c/0x80
> >
> > Basically the e1000e irqs haven't been freed at the time eeh is trying to
> > remove the the e1000e device.
> > Need to make sure when e1000e_close is called to bring down the NIC,
> > if adapter error_state is pci_channel_io_perm_failure, it should also
> > bring down the link and free irqs.
> >
> > Reported-by: Morumuri Srivalli  <smorumu1@in.ibm.com>
> > Signed-off-by: David Dai <zdai@linux.vnet.ibm.com>
> > ---
> >  drivers/net/ethernet/intel/e1000e/netdev.c |    3 ++-
> >  1 files changed, 2 insertions(+), 1 deletions(-)
> >
> > diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c b/drivers/net/ethernet/intel/e1000e/netdev.c
> > index d7d56e4..cf618e1 100644
> > --- a/drivers/net/ethernet/intel/e1000e/netdev.c
> > +++ b/drivers/net/ethernet/intel/e1000e/netdev.c
> > @@ -4715,7 +4715,8 @@ int e1000e_close(struct net_device *netdev)
> >
> >         pm_runtime_get_sync(&pdev->dev);
> >
> > -       if (!test_bit(__E1000_DOWN, &adapter->state)) {
> > +       if (!test_bit(__E1000_DOWN, &adapter->state) ||
> > +           (adapter->pdev->error_state == pci_channel_io_perm_failure)) {
> >                 e1000e_down(adapter, true);
> >                 e1000_free_irq(adapter);
> 
> It seems like the issue is the fact that e1000_io_error_detected is
> calling e1000e_down without the e1000_free_irq() bit. Instead of doing
> this couldn't you simply add the following to e1000_is_slot_reset in
> the "result = PCI_ERS_RESULT_DISCONNECT" case:
>     if (netif_running(netdev)
>         e1000_free_irq(adapter);
> 
> Alternatively we could look at freeing and reallocating the IRQs in
> the event of an error like we do for the e1000e_pm_freeze and
> e1000e_pm_thaw cases. That might make more sense since we are dealing
> with an error we might want to free and reallocate the IRQ resources
> assigned to the device.
> 
> Thanks.
> 
> - Alex

Thanks for the quick reply and comment!
Looked the e1000_io_slot_reset() routine:
        err = pci_enable_device_mem(pdev);
        if (err) {
                dev_err(&pdev->dev,
                        "Cannot re-enable PCI device after reset.\n");
                result = PCI_ERS_RESULT_DISCONNECT;
        } else {
I didn't see log message "Cannot re-enable PCI device after reset" at
the time of crash.

I can still apply the same logic in e1000_io_error_detected() routine:
    if (state == pci_channel_io_perm_failure) {
+       if (netif_running(netdev))
+           e1000_free_irq(adapter);
        return PCI_ERS_RESULT_DISCONNECT;
    }
Will test this once the test hardware is available again.

Thanks! - David

next prev parent reply	other threads:[~2019-10-03 18:51 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-10-03 16:54 [v1] e1000e: EEH on e1000e adapter detects io perm failure can trigger crash David Dai
2019-10-03 17:39 ` Alexander Duyck
2019-10-03 18:50   ` David Z. Dai [this message]
2019-10-03 20:39     ` Alexander Duyck
2019-10-04  0:02       ` David Z. Dai
2019-10-04 14:35         ` Alexander Duyck
2019-10-04 17:04           ` David Z. Dai
2019-10-04 23:36             ` [RFC PATCH] e1000e: Use rtnl_lock to prevent race conditions between net and pci/pm Alexander Duyck
2019-10-05  2:18               ` David Z. Dai
2019-10-05 17:22                 ` Alexander Duyck
2019-10-07 15:50                   ` David Z. Dai
2019-10-07 17:02                     ` Alexander Duyck
2019-10-07 17:12                       ` David Z. Dai
2019-10-07 17:23                         ` Alexander Duyck
2019-10-07 17:27                           ` [RFC PATCH v2] " Alexander Duyck
2019-10-08 20:49                             ` David Z. Dai
2020-02-25  9:42                             ` Kai-Heng Feng
2020-02-25 20:46                               ` Greg Kroah-Hartman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1570128658.1250.8.camel@oc5348122405 \
    --to=zdai@linux.vnet.ibm.com \
    --cc=alexander.duyck@gmail.com \
    --cc=davem@davemloft.net \
    --cc=intel-wired-lan@lists.osuosl.org \
    --cc=jeffrey.t.kirsher@intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=netdev@vger.kernel.org \
    --cc=zdai@us.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).