All of lore.kernel.org
 help / color / mirror / Atom feed
From: Simon Horman <horms@kernel.org>
To: Mohamed Khalfella <mkhalfella@purestorage.com>
Cc: Auke Kok <auke-jan.h.kok@intel.com>,
	Przemek Kitszel <przemyslaw.kitszel@intel.com>,
	Ying Hsu <yinghsu@chromium.org>,
	linux-kernel@vger.kernel.org, Jeff Garzik <jgarzik@redhat.com>,
	Yuanyuan Zhong <yzhong@purestorage.com>,
	Eric Dumazet <edumazet@google.com>,
	netdev@vger.kernel.org, Tony Nguyen <anthony.l.nguyen@intel.com>,
	intel-wired-lan@lists.osuosl.org,
	Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>,
	"David S. Miller" <davem@davemloft.net>
Subject: Re: [Intel-wired-lan] [PATCH v2 1/1] igb: Do not bring the device up after non-fatal error
Date: Wed, 25 Sep 2024 19:30:35 +0100	[thread overview]
Message-ID: <20240925183035.GV4029621@kernel.org> (raw)
In-Reply-To: <20240924210604.123175-2-mkhalfella@purestorage.com>

On Tue, Sep 24, 2024 at 03:06:01PM -0600, Mohamed Khalfella wrote:
> Commit 004d25060c78 ("igb: Fix igb_down hung on surprise removal")
> changed igb_io_error_detected() to ignore non-fatal pcie errors in order
> to avoid hung task that can happen when igb_down() is called multiple
> times. This caused an issue when processing transient non-fatal errors.
> igb_io_resume(), which is called after igb_io_error_detected(), assumes
> that device is brought down by igb_io_error_detected() if the interface
> is up. This resulted in panic with stacktrace below.
> 
> [ T3256] igb 0000:09:00.0 haeth0: igb: haeth0 NIC Link is Down
> [  T292] pcieport 0000:00:1c.5: AER: Uncorrected (Non-Fatal) error received: 0000:09:00.0
> [  T292] igb 0000:09:00.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
> [  T292] igb 0000:09:00.0:   device [8086:1537] error status/mask=00004000/00000000
> [  T292] igb 0000:09:00.0:    [14] CmpltTO [  200.105524,009][  T292] igb 0000:09:00.0: AER:   TLP Header: 00000000 00000000 00000000 00000000
> [  T292] pcieport 0000:00:1c.5: AER: broadcast error_detected message
> [  T292] igb 0000:09:00.0: Non-correctable non-fatal error reported.
> [  T292] pcieport 0000:00:1c.5: AER: broadcast mmio_enabled message
> [  T292] pcieport 0000:00:1c.5: AER: broadcast resume message
> [  T292] ------------[ cut here ]------------
> [  T292] kernel BUG at net/core/dev.c:6539!
> [  T292] invalid opcode: 0000 [#1] PREEMPT SMP
> [  T292] RIP: 0010:napi_enable+0x37/0x40
> [  T292] Call Trace:
> [  T292]  <TASK>
> [  T292]  ? die+0x33/0x90
> [  T292]  ? do_trap+0xdc/0x110
> [  T292]  ? napi_enable+0x37/0x40
> [  T292]  ? do_error_trap+0x70/0xb0
> [  T292]  ? napi_enable+0x37/0x40
> [  T292]  ? napi_enable+0x37/0x40
> [  T292]  ? exc_invalid_op+0x4e/0x70
> [  T292]  ? napi_enable+0x37/0x40
> [  T292]  ? asm_exc_invalid_op+0x16/0x20
> [  T292]  ? napi_enable+0x37/0x40
> [  T292]  igb_up+0x41/0x150
> [  T292]  igb_io_resume+0x25/0x70
> [  T292]  report_resume+0x54/0x70
> [  T292]  ? report_frozen_detected+0x20/0x20
> [  T292]  pci_walk_bus+0x6c/0x90
> [  T292]  ? aer_print_port_info+0xa0/0xa0
> [  T292]  pcie_do_recovery+0x22f/0x380
> [  T292]  aer_process_err_devices+0x110/0x160
> [  T292]  aer_isr+0x1c1/0x1e0
> [  T292]  ? disable_irq_nosync+0x10/0x10
> [  T292]  irq_thread_fn+0x1a/0x60
> [  T292]  irq_thread+0xe3/0x1a0
> [  T292]  ? irq_set_affinity_notifier+0x120/0x120
> [  T292]  ? irq_affinity_notify+0x100/0x100
> [  T292]  kthread+0xe2/0x110
> [  T292]  ? kthread_complete_and_exit+0x20/0x20
> [  T292]  ret_from_fork+0x2d/0x50
> [  T292]  ? kthread_complete_and_exit+0x20/0x20
> [  T292]  ret_from_fork_asm+0x11/0x20
> [  T292]  </TASK>
> 
> To fix this issue igb_io_resume() checks if the interface is running and
> the device is not down this means igb_io_error_detected() did not bring
> the device down and there is no need to bring it up.
> 
> Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
> Reviewed-by: Yuanyuan Zhong<yzhong@purestorage.com>
> Fixes: 004d25060c78 ("igb: Fix igb_down hung on surprise removal")

Thanks for the update, this version looks good to me.

Reviewed-by: Simon Horman <horms@kernel.org>

WARNING: multiple messages have this Message-ID (diff)
From: Simon Horman <horms@kernel.org>
To: Mohamed Khalfella <mkhalfella@purestorage.com>
Cc: Tony Nguyen <anthony.l.nguyen@intel.com>,
	Przemek Kitszel <przemyslaw.kitszel@intel.com>,
	"David S. Miller" <davem@davemloft.net>,
	Eric Dumazet <edumazet@google.com>,
	Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>,
	Auke Kok <auke-jan.h.kok@intel.com>,
	Yuanyuan Zhong <yzhong@purestorage.com>,
	Jeff Garzik <jgarzik@redhat.com>, Ying Hsu <yinghsu@chromium.org>,
	intel-wired-lan@lists.osuosl.org, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH v2 1/1] igb: Do not bring the device up after non-fatal error
Date: Wed, 25 Sep 2024 19:30:35 +0100	[thread overview]
Message-ID: <20240925183035.GV4029621@kernel.org> (raw)
In-Reply-To: <20240924210604.123175-2-mkhalfella@purestorage.com>

On Tue, Sep 24, 2024 at 03:06:01PM -0600, Mohamed Khalfella wrote:
> Commit 004d25060c78 ("igb: Fix igb_down hung on surprise removal")
> changed igb_io_error_detected() to ignore non-fatal pcie errors in order
> to avoid hung task that can happen when igb_down() is called multiple
> times. This caused an issue when processing transient non-fatal errors.
> igb_io_resume(), which is called after igb_io_error_detected(), assumes
> that device is brought down by igb_io_error_detected() if the interface
> is up. This resulted in panic with stacktrace below.
> 
> [ T3256] igb 0000:09:00.0 haeth0: igb: haeth0 NIC Link is Down
> [  T292] pcieport 0000:00:1c.5: AER: Uncorrected (Non-Fatal) error received: 0000:09:00.0
> [  T292] igb 0000:09:00.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
> [  T292] igb 0000:09:00.0:   device [8086:1537] error status/mask=00004000/00000000
> [  T292] igb 0000:09:00.0:    [14] CmpltTO [  200.105524,009][  T292] igb 0000:09:00.0: AER:   TLP Header: 00000000 00000000 00000000 00000000
> [  T292] pcieport 0000:00:1c.5: AER: broadcast error_detected message
> [  T292] igb 0000:09:00.0: Non-correctable non-fatal error reported.
> [  T292] pcieport 0000:00:1c.5: AER: broadcast mmio_enabled message
> [  T292] pcieport 0000:00:1c.5: AER: broadcast resume message
> [  T292] ------------[ cut here ]------------
> [  T292] kernel BUG at net/core/dev.c:6539!
> [  T292] invalid opcode: 0000 [#1] PREEMPT SMP
> [  T292] RIP: 0010:napi_enable+0x37/0x40
> [  T292] Call Trace:
> [  T292]  <TASK>
> [  T292]  ? die+0x33/0x90
> [  T292]  ? do_trap+0xdc/0x110
> [  T292]  ? napi_enable+0x37/0x40
> [  T292]  ? do_error_trap+0x70/0xb0
> [  T292]  ? napi_enable+0x37/0x40
> [  T292]  ? napi_enable+0x37/0x40
> [  T292]  ? exc_invalid_op+0x4e/0x70
> [  T292]  ? napi_enable+0x37/0x40
> [  T292]  ? asm_exc_invalid_op+0x16/0x20
> [  T292]  ? napi_enable+0x37/0x40
> [  T292]  igb_up+0x41/0x150
> [  T292]  igb_io_resume+0x25/0x70
> [  T292]  report_resume+0x54/0x70
> [  T292]  ? report_frozen_detected+0x20/0x20
> [  T292]  pci_walk_bus+0x6c/0x90
> [  T292]  ? aer_print_port_info+0xa0/0xa0
> [  T292]  pcie_do_recovery+0x22f/0x380
> [  T292]  aer_process_err_devices+0x110/0x160
> [  T292]  aer_isr+0x1c1/0x1e0
> [  T292]  ? disable_irq_nosync+0x10/0x10
> [  T292]  irq_thread_fn+0x1a/0x60
> [  T292]  irq_thread+0xe3/0x1a0
> [  T292]  ? irq_set_affinity_notifier+0x120/0x120
> [  T292]  ? irq_affinity_notify+0x100/0x100
> [  T292]  kthread+0xe2/0x110
> [  T292]  ? kthread_complete_and_exit+0x20/0x20
> [  T292]  ret_from_fork+0x2d/0x50
> [  T292]  ? kthread_complete_and_exit+0x20/0x20
> [  T292]  ret_from_fork_asm+0x11/0x20
> [  T292]  </TASK>
> 
> To fix this issue igb_io_resume() checks if the interface is running and
> the device is not down this means igb_io_error_detected() did not bring
> the device down and there is no need to bring it up.
> 
> Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
> Reviewed-by: Yuanyuan Zhong<yzhong@purestorage.com>
> Fixes: 004d25060c78 ("igb: Fix igb_down hung on surprise removal")

Thanks for the update, this version looks good to me.

Reviewed-by: Simon Horman <horms@kernel.org>

  reply	other threads:[~2024-09-25 18:30 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-09-24 21:06 [Intel-wired-lan] [PATCH v2 0/1] igb: Do not bring the device up after non-fatal error Mohamed Khalfella
2024-09-24 21:06 ` Mohamed Khalfella
2024-09-24 21:06 ` [Intel-wired-lan] [PATCH v2 1/1] " Mohamed Khalfella
2024-09-24 21:06   ` Mohamed Khalfella
2024-09-25 18:30   ` Simon Horman [this message]
2024-09-25 18:30     ` Simon Horman
2024-09-28 14:40   ` [Intel-wired-lan] " Pucha, HimasekharX Reddy
2024-09-28 14:40     ` Pucha, HimasekharX Reddy
2024-09-30 18:27     ` Mohamed Khalfella
2024-09-30 18:27       ` Mohamed Khalfella
2024-10-02 12:53   ` Pucha, HimasekharX Reddy
2024-10-02 12:53     ` Pucha, HimasekharX Reddy

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20240925183035.GV4029621@kernel.org \
    --to=horms@kernel.org \
    --cc=anthony.l.nguyen@intel.com \
    --cc=auke-jan.h.kok@intel.com \
    --cc=davem@davemloft.net \
    --cc=edumazet@google.com \
    --cc=intel-wired-lan@lists.osuosl.org \
    --cc=jgarzik@redhat.com \
    --cc=kuba@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mkhalfella@purestorage.com \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=przemyslaw.kitszel@intel.com \
    --cc=yinghsu@chromium.org \
    --cc=yzhong@purestorage.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.