[Intel-wired-lan] e1000e driver - hang after 4 hours of uptime - finally bisected!

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
To: intel-wired-lan@osuosl.org
Subject: [Intel-wired-lan] e1000e driver - hang after 4 hours of uptime - finally bisected!
Date: Thu, 18 Jun 2015 15:34:23 -0700	[thread overview]
Message-ID: <1434666863.3530.74.camel@intel.com> (raw)
In-Reply-To: <5284.1434646014@turing-police.cc.vt.edu>

On Thu, 2015-06-18 at 12:46 -0400, Valdis Kletnieks wrote:
> (follow up to a report from last week - bisecting took a while as I could
> only do 1 or 2 tests an evening)
> 
> My Dell Latitude E6530 crashes with a specific kernel lockup almost
> exactly 4 hours after boot if there isn't a cable connected to the
> Ethernet port:
> 
> [14508.846327] Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 0
> [14468.229720] Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 0
> [14463.254791] Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 0
> [14491.134413] Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 1
> [14463.396593] Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 2
> [14490.390223] Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 1
> [14494.680591] Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 0
> [14513.365378] Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 1
> [14482.271716] Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 3
> [14479.906820] Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 0
> 
> As far as I can tell, the timestamp jitter is just how long it takes me to
> enter the cryptLUKS passphrase for the hard drive at boot...
> 
> lspci tells me:
> 
> lspci -vvv -s "00:19.0"
> 00:19.0 Ethernet controller: Intel Corporation 82579LM Gigabit Network Connection (rev 04)
>         DeviceName:  Onboard LAN
>         Subsystem: Dell Device 0535
>         Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
>         Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
>         Latency: 0
>         Interrupt: pin A routed to IRQ 28
>         Region 0: Memory at f7700000 (32-bit, non-prefetchable) [size=128K]
>         Region 1: Memory at f7739000 (32-bit, non-prefetchable) [size=4K]
>         Region 2: I/O ports at f040 [size=32]
>         Capabilities: [c8] Power Management version 2
>                 Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
>                 Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=1 PME-
>         Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+
>                 Address: 00000000fee00318  Data: 0000
>         Capabilities: [e0] PCI Advanced Features
>                 AFCap: TP+ FLR+
>                 AFCtrl: FLR-
>                 AFStatus: TP-
>         Kernel driver in use: e1000e
> 
> 
> The traceback always looks like:
> 
> [14479.906820] Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 0
> 
> [14479.906908] Call Trace:
> [14479.906914]  <NMI>  [<ffffffffba94db16>] dump_stack+0x50/0xa8
> [14479.906930]  [<ffffffffba948bb9>] panic+0xcd/0x1e4
> [14479.906940]  [<ffffffffba166a60>] ? perf_event_task_disable+0xc0/0xc0
> [14479.906952]  [<ffffffffba125d8b>] watchdog_overflow_callback+0x9b/0xa0
> [14479.906959]  [<ffffffffba16a684>] __perf_event_overflow+0xc4/0x1f0
> [14479.906968]  [<ffffffffba16b3a4>] perf_event_overflow+0x14/0x20
> [14479.906976]  [<ffffffffba022271>] intel_pmu_handle_irq+0x1e1/0x430
> [14479.906990]  [<ffffffffba01a0f6>] perf_event_nmi_handler+0x26/0x40
> [14479.906999]  [<ffffffffba0085b3>] nmi_handle+0x103/0x340
> [14479.907005]  [<ffffffffba0084b5>] ? nmi_handle+0x5/0x340
> [14479.907017]  [<ffffffffba008a53>] default_do_nmi+0xc3/0x120
> [14479.907032]  [<ffffffffba008b98>] do_nmi+0xe8/0x130
> [14479.907044]  [<ffffffffba95c9a8>] end_repeat_nmi+0x1e/0x2e
> [14479.907055]  [<ffffffffba529886>] ? e1000e_cyclecounter_read+0x16/0xc0
> [14479.907061]  [<ffffffffba529886>] ? e1000e_cyclecounter_read+0x16/0xc0
> [14479.907069]  [<ffffffffba529886>] ? e1000e_cyclecounter_read+0x16/0xc0
> [14479.907075]  <<EOE>>  [<ffffffffba0e9529>] timecounter_read+0x19/0x60
> [14479.907088]  [<ffffffffba53687e>] e1000e_phc_gettime+0x2e/0x60
> [14479.907098]  [<ffffffffba536a31>] e1000e_systim_overflow_work+0x31/0x70
> [14479.907105]  [<ffffffffba07ad19>] process_one_work+0x3c9/0x980
> [14479.907115]  [<ffffffffba07ac62>] ? process_one_work+0x312/0x980
> [14479.907125]  [<ffffffffba07b348>] ? worker_thread+0x78/0x760
> [14479.907134]  [<ffffffffba07b59c>] worker_thread+0x2cc/0x760
> [14479.907144]  [<ffffffffba07b2d0>] ? process_one_work+0x980/0x980
> [14479.907154]  [<ffffffffba082a5e>] kthread+0xfe/0x120
> [14479.907163]  [<ffffffffba08ca50>] ? finish_task_switch+0x50/0x1c0
> [14479.907173]  [<ffffffffba082960>] ? kthread_create_on_node+0x270/0x270
> [14479.907179]  [<ffffffffba95ae4f>] ret_from_fork+0x3f/0x70
> [14479.907188]  [<ffffffffba082960>] ? kthread_create_on_node+0x270/0x270
> [14479.907243] Kernel Offset: 0x39000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
> 
> Bisection tells me it's this commit:
> 
> commit 83129b37ef35bb6a7f01c060129736a8db5d31c4
> Author: Yanir Lubetkin <yanirx.lubetkin@intel.com>
> Date:   Tue Jun 2 17:05:45 2015 +0300
> 
>     e1000e: fix systim issues
> 
>     Two issues involving systim were reported.
>     1. Clock is not running in the correct frequency
>     2. In some situations, systim values were not incremented linearly
>     This patch fixes the hardware clock configuration and the spurious
>     non-linear increment.

Thanks Valdis!  I will have Yanir look into it and hopefully we should
have a fix here soon for you to verify.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 819 bytes
Desc: This is a digitally signed message part
URL: <http://lists.osuosl.org/pipermail/intel-wired-lan/attachments/20150618/3c52a5e2/attachment.asc>

WARNING: multiple messages have this Message-ID (diff)

From: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
To: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Cc: Yanir Lubetkin <yanirx.lubetkin@intel.com>,
	intel-wired-lan@lists.osuosl.org, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org
Subject: Re: e1000e driver - hang after 4 hours of uptime - finally bisected!
Date: Thu, 18 Jun 2015 15:34:23 -0700	[thread overview]
Message-ID: <1434666863.3530.74.camel@intel.com> (raw)
In-Reply-To: <5284.1434646014@turing-police.cc.vt.edu>

[-- Attachment #1: Type: text/plain, Size: 5555 bytes --]

On Thu, 2015-06-18 at 12:46 -0400, Valdis Kletnieks wrote:
> (follow up to a report from last week - bisecting took a while as I could
> only do 1 or 2 tests an evening)
> 
> My Dell Latitude E6530 crashes with a specific kernel lockup almost
> exactly 4 hours after boot if there isn't a cable connected to the
> Ethernet port:
> 
> [14508.846327] Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 0
> [14468.229720] Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 0
> [14463.254791] Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 0
> [14491.134413] Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 1
> [14463.396593] Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 2
> [14490.390223] Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 1
> [14494.680591] Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 0
> [14513.365378] Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 1
> [14482.271716] Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 3
> [14479.906820] Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 0
> 
> As far as I can tell, the timestamp jitter is just how long it takes me to
> enter the cryptLUKS passphrase for the hard drive at boot...
> 
> lspci tells me:
> 
> lspci -vvv -s "00:19.0"
> 00:19.0 Ethernet controller: Intel Corporation 82579LM Gigabit Network Connection (rev 04)
>         DeviceName:  Onboard LAN
>         Subsystem: Dell Device 0535
>         Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
>         Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
>         Latency: 0
>         Interrupt: pin A routed to IRQ 28
>         Region 0: Memory at f7700000 (32-bit, non-prefetchable) [size=128K]
>         Region 1: Memory at f7739000 (32-bit, non-prefetchable) [size=4K]
>         Region 2: I/O ports at f040 [size=32]
>         Capabilities: [c8] Power Management version 2
>                 Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
>                 Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=1 PME-
>         Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+
>                 Address: 00000000fee00318  Data: 0000
>         Capabilities: [e0] PCI Advanced Features
>                 AFCap: TP+ FLR+
>                 AFCtrl: FLR-
>                 AFStatus: TP-
>         Kernel driver in use: e1000e
> 
> 
> The traceback always looks like:
> 
> [14479.906820] Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 0
> 
> [14479.906908] Call Trace:
> [14479.906914]  <NMI>  [<ffffffffba94db16>] dump_stack+0x50/0xa8
> [14479.906930]  [<ffffffffba948bb9>] panic+0xcd/0x1e4
> [14479.906940]  [<ffffffffba166a60>] ? perf_event_task_disable+0xc0/0xc0
> [14479.906952]  [<ffffffffba125d8b>] watchdog_overflow_callback+0x9b/0xa0
> [14479.906959]  [<ffffffffba16a684>] __perf_event_overflow+0xc4/0x1f0
> [14479.906968]  [<ffffffffba16b3a4>] perf_event_overflow+0x14/0x20
> [14479.906976]  [<ffffffffba022271>] intel_pmu_handle_irq+0x1e1/0x430
> [14479.906990]  [<ffffffffba01a0f6>] perf_event_nmi_handler+0x26/0x40
> [14479.906999]  [<ffffffffba0085b3>] nmi_handle+0x103/0x340
> [14479.907005]  [<ffffffffba0084b5>] ? nmi_handle+0x5/0x340
> [14479.907017]  [<ffffffffba008a53>] default_do_nmi+0xc3/0x120
> [14479.907032]  [<ffffffffba008b98>] do_nmi+0xe8/0x130
> [14479.907044]  [<ffffffffba95c9a8>] end_repeat_nmi+0x1e/0x2e
> [14479.907055]  [<ffffffffba529886>] ? e1000e_cyclecounter_read+0x16/0xc0
> [14479.907061]  [<ffffffffba529886>] ? e1000e_cyclecounter_read+0x16/0xc0
> [14479.907069]  [<ffffffffba529886>] ? e1000e_cyclecounter_read+0x16/0xc0
> [14479.907075]  <<EOE>>  [<ffffffffba0e9529>] timecounter_read+0x19/0x60
> [14479.907088]  [<ffffffffba53687e>] e1000e_phc_gettime+0x2e/0x60
> [14479.907098]  [<ffffffffba536a31>] e1000e_systim_overflow_work+0x31/0x70
> [14479.907105]  [<ffffffffba07ad19>] process_one_work+0x3c9/0x980
> [14479.907115]  [<ffffffffba07ac62>] ? process_one_work+0x312/0x980
> [14479.907125]  [<ffffffffba07b348>] ? worker_thread+0x78/0x760
> [14479.907134]  [<ffffffffba07b59c>] worker_thread+0x2cc/0x760
> [14479.907144]  [<ffffffffba07b2d0>] ? process_one_work+0x980/0x980
> [14479.907154]  [<ffffffffba082a5e>] kthread+0xfe/0x120
> [14479.907163]  [<ffffffffba08ca50>] ? finish_task_switch+0x50/0x1c0
> [14479.907173]  [<ffffffffba082960>] ? kthread_create_on_node+0x270/0x270
> [14479.907179]  [<ffffffffba95ae4f>] ret_from_fork+0x3f/0x70
> [14479.907188]  [<ffffffffba082960>] ? kthread_create_on_node+0x270/0x270
> [14479.907243] Kernel Offset: 0x39000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
> 
> Bisection tells me it's this commit:
> 
> commit 83129b37ef35bb6a7f01c060129736a8db5d31c4
> Author: Yanir Lubetkin <yanirx.lubetkin@intel.com>
> Date:   Tue Jun 2 17:05:45 2015 +0300
> 
>     e1000e: fix systim issues
> 
>     Two issues involving systim were reported.
>     1. Clock is not running in the correct frequency
>     2. In some situations, systim values were not incremented linearly
>     This patch fixes the hardware clock configuration and the spurious
>     non-linear increment.

Thanks Valdis!  I will have Yanir look into it and hopefully we should
have a fix here soon for you to verify.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

next prev parent reply	other threads:[~2015-06-18 22:34 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-06-18 16:46 [Intel-wired-lan] e1000e driver - hang after 4 hours of uptime - finally bisected! Valdis Kletnieks
2015-06-18 16:46 ` Valdis Kletnieks
2015-06-18 22:34 ` Jeff Kirsher [this message]
2015-06-18 22:34   ` Jeff Kirsher

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1434666863.3530.74.camel@intel.com \
    --to=jeffrey.t.kirsher@intel.com \
    --cc=intel-wired-lan@osuosl.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.