All of lore.kernel.org
 help / color / mirror / Atom feed
From: Per Oberg <pero@wolfram.com>
To: xenomai <xenomai@xenomai.org>
Subject: Re: Cyclic hardware reset for e1000e
Date: Mon, 18 Mar 2019 03:29:13 -0500 (CDT)	[thread overview]
Message-ID: <570821601.340329.1552897753017.JavaMail.zimbra@wolfram.com> (raw)
In-Reply-To: <544217707.857557.1552467189711.JavaMail.zimbra@wolfram.com>

----- Den 13 mar 2019, på kl 9:53, Per Öberg pero@wolfram.com skrev:

> > ----- Den 18 feb 2019, på kl 13:43, Jan Kiszka jan.kiszka@siemens.com skrev:

> > > On 18.02.19 13:36, Per Oberg via Xenomai wrote:
> > > > Hello list

> > >> I have this issue where my e1000e network card gets into some kind of cyclic
> > >> hardware reset during operation. The weird thing is that this only happens when
> > >> I let systemd start the application. If it's started manually it always works
> > > > as intended.

> > >> I am running xenomai 3.0.7 with a linux-4.9.38 kernel and I use the network
> > > > connection in Linux non-rt mode. I use systemd and NetworkManager.

> > >> I do realize that once I get into the reset it will continue resetting because I
> > >> keep flooding the buffers. My issue is that it -never- happens when I start my
> > >> process manually, only when systemd starts it. Because the network goes down
> > >> quite badly I cannot log in and disable the service once it happens and
> > >> therefore I cannot really try starting it manually after letting the network
> > > > recover.

> > >> There is some information from intel in [1] below. There is talk about power
> > > > management function and EPROM etc. They specifically write:

> > > > "82573(V/L/E) TX Unit Hang Messages
> > >> Several adapters with the 82573 chipset display "TX unit hang" messages during
> > >> normal operation with the e1000 driver. The issue appears both with TSO enabled
> > >> and disabled, and is caused by a power management function that is enabled in
> > >> the EEPROM. Early releases of the chipsets to vendors had the EEPROM bit that
> > >> enabled the feature. After the issue was discovered newer adapters were
> > > > released with the feature disabled in the EEPROM."

> > > > I also read something about disabling GRO/TSO/GSO that helped some people.

> > > > My questions to the list are:

> > > > 1. Have you guys any experience with this?
> > > > 2. Would I be better of using the RT Net drivers?
> > >> 3. What could cause the issue to trigger only when run by systemd. (I thought
> > > > about timing issues and NetworkManager, but how do I debug this?)

> > >> [1]
> > > > https://serverfault.com/questions/193114/linux-e1000e-intel-networking-driver-problems-galore-where-do-i-start

> > > > Thoughts anyone?

> > > Are you giving Linux enough time to work (no 100% RT domination of any core for
> > > hundreds of milliseconds or longer)?

> > I am not sure, yet. I have this logging function for reporting back to me when I
> > loose samples. Loosing samples would currently make the software try to catch
> > up and this would mean 100% cpu till it does. I do see this being logged around
> > the time it resets but I'm not sure if it's much worse than "usual". If for
> > some reason the hardware reset happens because linux gets starved I can easily
> > see this going cyclic.

> > Per Öberg

> So, I have managed to do some checking

> It looks like the cyclic resets are about 80-100 seconds apart.
> Before the first reset we are most likely holding the CPUs for about 3-4ms.

> I managed to get hold of a kernel message saying:
> [...] WARNING: CPU: 0 PID: 3 at net/sched/sch_generic.c:316
> dev_watchdog+0x215/0x220
> [...] NETDEV WATCHDOG: enp0s31f6 (e1000e): transmit queue 0 timed out

> The full trace is shown below.

> One difference that I have found is that I am running with "--cpu-affinity=2,3"
> when running manually, but not when using systemd to start the program. Can
> this have an impact?

> -------------------- DMESG TRACE -----------------------------------------

> [31865.706967] ------------[ cut here ]------------
> [31865.706973] WARNING: CPU: 0 PID: 3 at net/sched/sch_generic.c:316
> dev_watchdog+0x215/0x220
> [31865.706974] NETDEV WATCHDOG: enp0s31f6 (e1000e): transmit queue 0 timed out
> [31865.706974] Modules linked in: iTCO_wdt iTCO_vendor_support ppdev i915
> intel_rapl intel_powerclamp coretemp kvm_intel kvm drm_kms_helper irqbypass
> crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel drm intel_gtt
> aesni_intel agpgart aes_x86_64 fb_sys_fops lrw gf128mul glue_helper e1000e
> ablk_helper syscopyarea cryptd sysfillrect sysimgblt efi_pstore igb xhci_pci
> psmouse xhci_hcd dca pcspkr i2c_algo_bit serio_raw ptp efivars pps_core
> xeno_can_peak_pci xeno_can_sja1000 xeno_can i2c_i801 shpchp i2c_smbus hci_uart
> btbcm btintel bluetooth parport_pc parport pinctrl_sunrisepoint pinctrl_intel
> i2c_hid tpm_tis tpm_tis_core tpm sch_fq_codel efivarfs ipv6 crc_ccitt
> [31865.707329] CPU: 0 PID: 3 Comm: ksoftirqd/0 Not tainted 4.9.38-xenomai+ #6
> [31865.707330] Hardware name: Default string Default string/SKYBAY, BIOS 5.11
> 09/22/2016
> [31865.707331] I-pipe domain: Linux
> [31865.707333] ffffc90000033c80 ffffffff813e0324 ffffc90000033cd0
> 0000000000000000
> [31865.707336] ffffc90000033cc0 ffffffff81054b67 0000013c6dc2eb00
> 0000000000000000
> [31865.707517] ffff88026048fc80 0000000000000000 ffff88025ed74000
> 0000000000000001
> [31865.707520] Call Trace:
> [31865.707524] [<ffffffff813e0324>] dump_stack+0x96/0xc2
> [31865.707526] [<ffffffff81054b67>] __warn+0xc7/0xf0
> [31865.707527] [<ffffffff81054bda>] warn_slowpath_fmt+0x4a/0x50
> [31865.707529] [<ffffffff81a04be0>] ? dev_graft_qdisc+0x70/0x70
> [31865.707568] [<ffffffff81a04df5>] dev_watchdog+0x215/0x220
> [31865.707569] [<ffffffff81a04be0>] ? dev_graft_qdisc+0x70/0x70
> [31865.707571] [<ffffffff81a04be0>] ? dev_graft_qdisc+0x70/0x70
> [31865.707573] [<ffffffff810a6d47>] call_timer_fn.isra.25+0x17/0x70
> [31865.707575] [<ffffffff810a6e47>] expire_timers+0xa7/0xd0
> [31865.707576] [<ffffffff810a6eec>] run_timer_softirq+0x7c/0x160
> [31865.707578] [<ffffffff81aae546>] ? _raw_spin_unlock_irq+0x16/0x30
> [31865.707581] [<ffffffff810595b6>] __do_softirq+0xe6/0x1e0
> [31865.707583] [<ffffffff810596e2>] run_ksoftirqd+0x32/0x40
> [31865.707584] [<ffffffff81073ff5>] smpboot_thread_fn+0x165/0x230
> [31865.707611] [<ffffffff81073e90>] ? sort_range+0x20/0x20
> [31865.707827] [<ffffffff81070962>] kthread+0xd2/0xf0
> [31865.707829] [<ffffffff81070890>] ? kthread_park+0x60/0x60
> [31865.707831] [<ffffffff81aaed33>] ret_from_fork+0x23/0x30
> [31865.707834] ---[ end trace 111a72a07d1d2f26 ]---
> [31865.743096] e1000e 0000:00:1f.6 enp0s31f6: Reset adapter unexpectedly
> [31867.827820] e1000e: enp0s31f6 NIC Link is Up 100 Mbps Full Duplex, Flow
> Control: Rx/Tx


Does anyone know what causes :
"NETDEV WATCHDOG: enp0s31f6 (e1000e): transmit queue 0 timed out"

Is it only me hogging all resources or are there other possibilities? 


Does anyone know if I would benefit from using "--cpu-affinity=2,3" ? My assumption is that perhaps if I schedule stuff on a core that is not used for handling interrupts, remembering the "WARNING: CPU: 0" part of the error, it would somehow help. 


Per Öberg


      parent reply	other threads:[~2019-03-18  8:29 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-02-18 12:36 Cyclic hardware reset for e1000e Per Oberg
2019-02-18 12:43 ` Jan Kiszka
2019-02-18 13:08   ` Per Oberg
2019-03-13  8:53     ` Per Oberg
2019-03-13 17:06       ` Cobalt compatible distribution Don Newbold
     [not found]         ` <192645678.5721329.1552685329163@mail.yahoo.com>
2019-03-15 21:29           ` Alec Ari
     [not found]           ` <cece8f69-d8c5-7165-e918-444398bea154@gmail.com>
2019-03-16  7:44             ` Alec Ari
2019-03-18 18:00               ` Don Newbold
     [not found]                 ` <1723459381.6926353.1552936352861@mail.yahoo.com>
2019-03-18 19:13                   ` Alec Ari
2019-03-18 20:59                     ` Don Newbold
2019-03-18 23:42                       ` Alec Ari
2019-03-19 16:08                         ` Don Newbold
2019-03-20 18:43                           ` Alec Ari
2019-03-18  8:29       ` Per Oberg [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=570821601.340329.1552897753017.JavaMail.zimbra@wolfram.com \
    --to=pero@wolfram.com \
    --cc=xenomai@xenomai.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.